elasticsearch7.13.4 ik中文分词器安装

114 阅读2分钟

这是我参与8月更文挑战的第21天,活动详情查看:8月更文挑战

ik分词器下载

ik分词器 git

百度网盘 提取码: fnq0

解压后放到 /plugins/ik 目录 启动服务 ./bin/elasticsearch

测试

使用 kibana的Dev Tools

测试分词

get /_analyze
{
 "analyzer": "standard",
 "text": "王者荣耀"
}

standard分词器结果

{
  "tokens" : [
    {
      "token" : "王",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "者",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "荣",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "耀",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    }
  ]
}
get /_analyze
{
  "analyzer": "ik_smart",
  "text": "王者荣耀"
}

ik_smart分词结果

{
  "tokens" : [
    {
      "token" : "王者",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "荣耀",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    }
  ]
}

分词器使用


# 创建索引
put /indexik

# 索引映射,字段指定分词器
put /indexik/_mapping
{
  "properties": {
      "content": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
      }
  }
}

# 查看映射
get /indexik/_mapping

# 添加测试数据
post /indexik/_doc
{"content":"美国留给伊拉克的是个烂摊子吗"}

post /indexik/_doc
{"content":"公安部:各地校车将享最高路权"}

post /indexik/_doc
{"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}

post /indexik/_doc
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}

post /indexik/_doc
{"content":"中华人民共和国"}


# 查询
get /indexik/_search
{
  "query": {
    "match": {
      "content":"中国"
    }
  }
}

# 删除索引
DELETE /indexik

查询得到结果比默认分词更准确

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.79423964,
    "hits" : [
      {
        "_index" : "indexik",
        "_type" : "_doc",
        "_id" : "3ObjKnsBEq3c_HSrSrAr",
        "_score" : 0.79423964,
        "_source" : {
          "content" : "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
        }
      },
      {
        "_index" : "indexik",
        "_type" : "_doc",
        "_id" : "3ebjKnsBEq3c_HSrUbCx",
        "_score" : 0.79423964,
        "_source" : {
          "content" : "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        }
      }
    ]
  }
}

配置ik生效

踩过的坑

elasticsearch.yaml添加配置 报错

# ik analyzer
index.analysis.analyzer.default.type: ik_max_word

Since elasticsearch 5.x index level settings can NOT be set on the nodes configuration like the elasticsearch.yaml, in system properties or command line arguments.In order to upgrade all indices the settings must be updated via the /${index}/_settings API. Unless all settings are dynamic all indices must be closed in order to apply the upgradeIndices created in the future should use index templates to set default values.

Please ensure all required values are updated on all indices by executing:

curl -XPUT 'http://localhost:9200/_all/_settings?preserve_existing=true' -d '{ "index.analysis.analyzer.default.type" : "ik_max_word" }'

自5.x版本 不允许在node节点配置index级别属性,需要使用/${index}/_settings API进行更改

索引设默认分词器

PUT indexik
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "ik_smart"
        }
      }
    }
  }
}

配置ik分词器

IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
 	<!--用户可以在这里配置远程扩展字典 -->
	<entry key="remote_ext_dict">location</entry>
 	<!--用户可以在这里配置远程扩展停止词字典-->
	<entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>