elastcisearch分词那些事如果使用中文检索,还必须使用中文分词,平时使用最多的可能就要属IK分词器了。我们可

准备今天的操作

删除之前的实验索引

curl -XDELETE http://127.0.0.1:9200/synctest/article

output:
{"acknowledged":true}

创建新mapping

curl -XPUT 'http://127.0.0.1:9200/servcie/_mapping/massage' -d '
{
    "massage":{
        "properties":{
            "location":{
                "type":"geo_point"
            },
            "name":{
                "type":"string"
            },
            "age":{
                "type":"integer"
            },
            "address":{
                "type":"string"
            },
            "price":{
                "type":"double",
                "index":"not_analyzed"
            },
            "is_open":{
                "type":"boolean"
            }
        }
    }
}'

查看新创建的mapping

curl -XGET http://127.0.0.1:9200/servcie/massage/_mapping?pretty

{
  "servcie" : {
    "mappings" : {
      "massage" : {
        "properties" : {
          "address" : {
            "type" : "string"
          },
          "age" : {
            "type" : "integer"
          },
          "is_open" : {
            "type" : "boolean"
          },
          "location" : {
            "type" : "geo_point"
          },
          "name" : {
            "type" : "string"
          },
          "price" : {
            "type" : "double"
          }
        }
      }
    }
  }
}

进入我们的分词测试

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"text":"波多菠萝蜜"}'

{
  "tokens" : [ {
    "token" : "波",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "<IDEOGRAPHIC>",
    "position" : 0
  }, {
    "token" : "多",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "<IDEOGRAPHIC>",
    "position" : 1
  }, {
    "token" : "菠",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "<IDEOGRAPHIC>",
    "position" : 2
  }, {
    "token" : "萝",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "<IDEOGRAPHIC>",
    "position" : 3
  }, {
    "token" : "蜜",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "<IDEOGRAPHIC>",
    "position" : 4
  } ]
}

分词器是由一个分解器(tokenizer)、零个或多个词元过滤器(token filters)组成

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"text":"abc dsf,sdsf"}'

中文检索

如果使用中文检索,还必须使用中文分词,平时使用最多的可能就要属IK分词器了。

安装IK分词

./bin/plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.9.3/elasticsearch-analysis-ik-1.9.3.zip

重启后查看插件(是否加载成功)

curl -XGET http://localhost:9200/_cat/plugins

Marrow analysis-ik 1.9.3 j

使用ik分词分析


curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"波多菠萝蜜"}'

{
  "tokens" : [ {
    "token" : "波",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "多",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 1
  }, {
    "token" : "菠萝蜜",
    "start_offset" : 2,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "菠萝",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "菠",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "萝",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "蜜",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 6
  } ]
}

可以看到已经多菠萝、菠萝蜜进行了分词

随着社会发展和不同的业务术语, 有些新的词汇,并没有收录到我们的IK分词器, 即使使用match_pharse等查询也存在检索不到数据情况,那我们该怎么办呢?

举个例子, 比如我们希望能检索出 “吊炸天” 这个词(1.9.3版本的IK并没有被收录)

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"吊炸天天不容"}'

{
  "tokens" : [ {
    "token" : "吊",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "炸",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 1
  }, {
    "token" : "天天",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "不容",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 3
  } ]
}

如果必须的话, 这个时候我们就需要修改IK的词库了

我们修改analysis-ik/config/ik/custom 下 mydict.dic 文件, 这个文件是专门为我们拓展词汇准备的, 再最后面添加好新词后保存并重启es即可

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"吊炸天天不容"}'


{
  "tokens" : [ {
    "token" : "吊炸天",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "吊",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "炸",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 2
  }, {
    "token" : "天天",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "不容",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 4
  } ]
}

我们可以看到已经对“吊炸天”进行了单独的分词.