一起养成写作习惯!这是我参与「掘金日新计划 · 4 月更文挑战」的第3天,点击查看活动详情。
前言
我们在上一篇文章已经完成的spring boot集成es,这篇文章我们来介绍一下ik分词器
分词器的概念和es默认分词器
分词器
简单的来说分词器就是基于自己本身固定的算法进行词语拆分。例如:“程序员的第一段代码为:Hello World”这个词组。它可能拆分为一下几个片段:程序、程序员、第一段、第一段代码、代码、Hello、Hello World。假如我们搜索的时候输入程序员的搜索条件即可匹配到该词组
es默认分词器
es的默认分词器是standard,它的分词机制是按字拆分,例如:“今天是周末”会拆分拆分成今、天、是、周、末,这个拆分的机制很呆,所以我们引入ik分词器来协助我们进行汉字分词
ik分词器
大家有兴趣可以看一下官网:ik分词器官网,它有两个机制:最粗粒度和最细粒度。下面我们会写案例来示范一下这两个机制的使用
安装分词器
在gihub的release页面上,我们找到对应版本的分词器进行下载。我这里使用的是8.1.2,下载地址如下:8.1.2 ik分词器下载,我们在下载完成并解压之后,在es的安装目录找到plugins文件夹,新建ik目录,把解压出来的文件全部复制进来就可以了。具体目录显示如下
然后我们重新启动es工程,开始试用ik分词效果
试用es分词
原生分词
我们先创建一个名称叫做stander的索引,使用put请求路径如下 ip:port/stander,请求体为
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
}
}
然后我们开始试用es原生分词来分“我是路人甲”这个词 使用pots请求ip:port/stander/_analyze,传入以下body
{"text":"我是路人甲"}
查看响应可以看到确实分出来5个词组,响应的数据如下
{"tokens":[
{"token":"我","start_offset":0,"end_offset":1,"type":"<IDEOGRAPHIC>","position":0},
{"token":"是","start_offset":1,"end_offset":2,"type":"<IDEOGRAPHIC>","position":1},
{"token":"路","start_offset":2,"end_offset":3,"type":"<IDEOGRAPHIC>","position":2},
{"token":"人","start_offset":3,"end_offset":4,"type":"<IDEOGRAPHIC>","position":3},
{"token":"甲","start_offset":4,"end_offset":5,"type":"<IDEOGRAPHIC>","position":4}
]}
ik最细粒度分词
ik的最细粒度分词用通俗的话来讲就是按照最多的匹配量来拆分。例如一个词组是“四月份的尾巴你是狮子座”,我们使用最大粒度拆分的步骤以及返回信息如下 使用pots请求ip:port/stander/_analyze,传入以下body如下
{"text":"四月份的尾巴你是狮子座","analyzer":"ik_max_word"}
我们可以看到生成的响应数据如下
{"tokens":[
{"token":"四月份","start_offset":0,"end_offset":3,"type":"CN_WORD","position":0},
{"token":"四月","start_offset":0,"end_offset":2,"type":"CN_WORD","position":1},
{"token":"四","start_offset":0,"end_offset":1,"type":"TYPE_CNUM","position":2},
{"token":"月份","start_offset":1,"end_offset":3,"type":"COUNT","position":3},{"token":"月","start_offset":1,"end_offset":2,"type":"COUNT","position":4},
{"token":"份","start_offset":2,"end_offset":3,"type":"COUNT","position":5},
{"token":"的","start_offset":3,"end_offset":4,"type":"CN_CHAR","position":6},
{"token":"尾巴","start_offset":4,"end_offset":6,"type":"CN_WORD","position":7},
{"token":"你","start_offset":6,"end_offset":7,"type":"CN_CHAR","position":8},
{"token":"是","start_offset":7,"end_offset":8,"type":"CN_CHAR","position":9},
{"token":"狮子座","start_offset":8,"end_offset":11,"type":"CN_WORD","position":10},
{"token":"狮子","start_offset":8,"end_offset":10,"type":"CN_WORD","position":11},
{"token":"座","start_offset":10,"end_offset":11,"type":"CN_CHAR","position":12}
]}
可以说是非常细了
ik最粗粒度分词
和ik最细粒度分词不同,最粗粒度的效果相反,例如我们同样对上面的词组分词 用pots请求ip:port/stander/_analyze,传入以下body如下
{"text":"四月份的尾巴你是狮子座","analyzer":"ik_smart"}
生成的响应数据如下
{"tokens":[
{"token":"四月份","start_offset":0,"end_offset":3,"type":"CN_WORD","position":0},
{"token":"的","start_offset":3,"end_offset":4,"type":"CN_CHAR","position":1},
{"token":"尾巴","start_offset":4,"end_offset":6,"type":"CN_WORD","position":2},
{"token":"你","start_offset":6,"end_offset":7,"type":"CN_CHAR","position":3},
{"token":"是","start_offset":7,"end_offset":8,"type":"CN_CHAR","position":4},
{"token":"狮子座","start_offset":8,"end_offset":11,"type":"CN_WORD","position":5}
]}
基于以上,我们在分词的时候建议使用最细粒度分词,这样匹配的更多一些
创建使用ik分词的索引
使用post请求访问接口如下 ip:port/index_name(索引名称) 传入的请求body如下:
{
"settings": {
"number_of_shards": "1",
"number_of_replicas": "0",
"analysis": {
"analyzer": {
"ik": {
"tokenizer": "ik_max_word"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text"
},
"keyWord": {
"type": "text"
}
}
}
}
结语
本文介绍了如何整合ik分词器到es8.1.2中,但是使用的都是原生的api接口调用的,下篇文章我们开始探讨基于已经搭建好的spring boot使用esclient来进行搜索以及底层的原理
- 下期预告:spring boot整合esclient的ik分词器。敬请期待!