ElasticSearch(二)在ElasticSearch 中使用中文分词器

964 阅读4分钟

这是我参与11月更文挑战的第2天,活动详情查看:2021最后一次更文挑战

IK分词器对中文具有良好支持的分词器,相比于ES自带的分词器,IK分词器更能适用中文博大精深的语言环境.

分析器:ik_smart,,分词器ik_max_word:ik_smart,ik_max_word

自 v5.0.0 起 移除名为 ik 的analyzer和tokenizer,请分别使用 ik_smart 和 ik_max_word

1、下载:

方式一、

  1. 分词器官网:github.com/medcl/elast…
  2. github.com/medcl/elast… 可以直接根据该链接下载:github.com/medcl/elast…
  3. 将下载文件解压。
  4. 在 es/plugins 目录下,新建 ik 目录,并将解压后的所有文件拷贝到 ik 目录下。
  5. 重启 es 服务。

注意:需要和自己的ES的版本对应。

IK版本ES版
7.x->主
6.x6.x
5.x5.x
1.10.62.4.6
1.9.52.3.5
1.8.12.2.1
1.7.02.1.1
1.5.02.0.0
1.2.61.0.0
1.2.50.90.x
1.1.30.20.x
1.0.00.16.2-> 0.19.0

方式二:

使用elasticsearch-plugin进行安装(v5.5.1版本支持):

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip
注意:替换6.3.0为您自己的elasticsearch版本

2、测试:

1、首先建立一个索引

PUT
http://localhost:9200/test

2、利用该索引中进行分词测试: 结果:

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "自己",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "长得",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "可",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "真",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 4
        },
        {
            "token": "好看",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 5
        }
    ]
}

3、自定义词库:

POST	http://localhost:9200/test/_analyze
{
    "analyzer" : "ik_smart" ,
    "text" :"公众号博思奥园"
}

默认情况下,没有我们自定义的词库,他会将博思奥园拆分开,如果我们不想将他拆开,我们可以自定义词库。

{
    "tokens": [
        {
            "token": "公众",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "号",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "博",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "思",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "奥",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 4
        },
        {
            "token": "园",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 5
        }
    ]
}

1、在ik/config目录下新建一个词典文件myext.dic,加入自己所需要的词语

公众号
博思奥园

2、在ik/config/IKAnalyzer.cfg.xml 文件中配置远程扩展词接口:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">myext.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

3、当我们再次请求该接口时,就会得到我们想要的结果

{
    "tokens": [
        {
            "token": "公众号",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "博思奥园",
            "start_offset": 3,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

4、拓展 使用热更新 IK 分词

目前该插件支持热更新 IK 分词,通过上文在 IK 配置文件中提到的如下配置

<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">location</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<entry key="remote_ext_stopwords">location</entry>

其中 location 是指一个 url,比如 http://localhost:8080/myext.dic,该请求只需满足以下两点即可完成分词热更新。

该 http 请求需要返回两个头部(header),一个是 Last-Modified,一个是 ETag,这两者都是字符串类型,只要有一个发生变化,该插件就会去抓取新的分词进而更新词库。

该 http 请求返回的内容格式是一行一个分词,换行符用 \n 即可。

满足上面两点要求就可以实现热更新分词了,不需要重启 ES 实例

可以将需自动更新的热词放在一个 UTF-8 编码的 .txt 文件里,放在 nginx 或其他简易 http server 下,当 .txt 文件修改时,http server 会在客户端请求该文件时自动返回相应的 Last-Modified 和 ETag。可以另外做一个工具来从业务系统提取相关词汇,并更新这个 .txt 文件。

4、对比

ES自己也会自己的默认的分词器,那么我们可以将ES自带的分词器进行对比

同样利用test索引,分别使用es自带的分词器、ik提供的分词器

POST	http://localhost:9200/test/_analyze
1、es自带
{
    "analyzer" : "standard" ,
    "text" :"公众号博思奥园,中国文化博大精深"
}
2、ik提供的ik_max_word
{
    "analyzer" : "ik_max_word" ,
    "text" :"公众号博思奥园,中国文化博大精深"
}
3、IK提供的ik_smart,加上自定义词典
{
    "analyzer" : "ik_smart" ,
    "text" :"公众号博思奥园,中国文化博大精深"
}

三次结果分别是: 通过分析可以看到

  1. es提供的分词器对于中文的分词并不是那么友好,将所有的文字都拆开了。
  2. 而ik分词器就能够很友好的识别成语,更好的体会中国文化的博大精深,而ik_max_wordik_smart 之间的差别

ik_max_word会将文本做最细粒度的拆分; ik_smart 会做最粗粒度的拆分

1、standard
{
    "tokens": [
        {
            "token": "公",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "众",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "号",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "博",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "思",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "奥",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "园",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        },
        {
            "token": "中",
            "start_offset": 8,
            "end_offset": 9,
            "type": "<IDEOGRAPHIC>",
            "position": 7
        },
        {
            "token": "国",
            "start_offset": 9,
            "end_offset": 10,
            "type": "<IDEOGRAPHIC>",
            "position": 8
        },
        {
            "token": "文",
            "start_offset": 10,
            "end_offset": 11,
            "type": "<IDEOGRAPHIC>",
            "position": 9
        },
        {
            "token": "化",
            "start_offset": 11,
            "end_offset": 12,
            "type": "<IDEOGRAPHIC>",
            "position": 10
        },
        {
            "token": "博",
            "start_offset": 12,
            "end_offset": 13,
            "type": "<IDEOGRAPHIC>",
            "position": 11
        },
        {
            "token": "大",
            "start_offset": 13,
            "end_offset": 14,
            "type": "<IDEOGRAPHIC>",
            "position": 12
        },
        {
            "token": "精",
            "start_offset": 14,
            "end_offset": 15,
            "type": "<IDEOGRAPHIC>",
            "position": 13
        },
        {
            "token": "深",
            "start_offset": 15,
            "end_offset": 16,
            "type": "<IDEOGRAPHIC>",
            "position": 14
        }
    ]
}

2、ik_max_word
{
    "tokens": [
        {
            "token": "公众",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "号",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "博",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "思",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "奥",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 4
        },
        {
            "token": "园",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 5
        },
        {
            "token": "中国文化",
            "start_offset": 8,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "中国",
            "start_offset": 8,
            "end_offset": 10,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "国文",
            "start_offset": 9,
            "end_offset": 11,
            "type": "CN_WORD",
            "position": 8
        },
        {
            "token": "文化",
            "start_offset": 10,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 9
        },
        {
            "token": "博大精深",
            "start_offset": 12,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 10
        },
        {
            "token": "博大",
            "start_offset": 12,
            "end_offset": 14,
            "type": "CN_WORD",
            "position": 11
        },
        {
            "token": "精深",
            "start_offset": 14,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 12
        }
    ]
}

3、ik_smart
{
    "tokens": [
        {
            "token": "公众号",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "博思奥园",
            "start_offset": 3,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "中国文化",
            "start_offset": 8,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "博大精深",
            "start_offset": 12,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}