elasticsearch8.1.2集成ik分词器

2,176 阅读4分钟

一起养成写作习惯!这是我参与「掘金日新计划 · 4 月更文挑战」的第3天,点击查看活动详情

前言

我们在上一篇文章已经完成的spring boot集成es,这篇文章我们来介绍一下ik分词器

分词器的概念和es默认分词器

分词器

简单的来说分词器就是基于自己本身固定的算法进行词语拆分。例如:“程序员的第一段代码为:Hello World”这个词组。它可能拆分为一下几个片段:程序、程序员、第一段、第一段代码、代码、Hello、Hello World。假如我们搜索的时候输入程序员的搜索条件即可匹配到该词组

es默认分词器

es的默认分词器是standard,它的分词机制是按字拆分,例如:“今天是周末”会拆分拆分成今、天、是、周、末,这个拆分的机制很呆,所以我们引入ik分词器来协助我们进行汉字分词

ik分词器

大家有兴趣可以看一下官网:ik分词器官网,它有两个机制:最粗粒度和最细粒度。下面我们会写案例来示范一下这两个机制的使用

安装分词器

在gihub的release页面上,我们找到对应版本的分词器进行下载。我这里使用的是8.1.2,下载地址如下:8.1.2 ik分词器下载,我们在下载完成并解压之后,在es的安装目录找到plugins文件夹,新建ik目录,把解压出来的文件全部复制进来就可以了。具体目录显示如下 1648963328(1).png 然后我们重新启动es工程,开始试用ik分词效果

试用es分词

原生分词

我们先创建一个名称叫做stander的索引,使用put请求路径如下 ip:port/stander,请求体为

{
	"settings": {
		"number_of_shards": 1,
		"number_of_replicas": 0
	}
}

然后我们开始试用es原生分词来分“我是路人甲”这个词 使用pots请求ip:port/stander/_analyze,传入以下body

{"text":"我是路人甲"}

查看响应可以看到确实分出来5个词组,响应的数据如下

{"tokens":[
{"token":"我","start_offset":0,"end_offset":1,"type":"<IDEOGRAPHIC>","position":0},
{"token":"是","start_offset":1,"end_offset":2,"type":"<IDEOGRAPHIC>","position":1},
{"token":"路","start_offset":2,"end_offset":3,"type":"<IDEOGRAPHIC>","position":2},
{"token":"人","start_offset":3,"end_offset":4,"type":"<IDEOGRAPHIC>","position":3},
{"token":"甲","start_offset":4,"end_offset":5,"type":"<IDEOGRAPHIC>","position":4}
]}

ik最细粒度分词

ik的最细粒度分词用通俗的话来讲就是按照最多的匹配量来拆分。例如一个词组是“四月份的尾巴你是狮子座”,我们使用最大粒度拆分的步骤以及返回信息如下 使用pots请求ip:port/stander/_analyze,传入以下body如下

{"text":"四月份的尾巴你是狮子座","analyzer":"ik_max_word"}

我们可以看到生成的响应数据如下

{"tokens":[
{"token":"四月份","start_offset":0,"end_offset":3,"type":"CN_WORD","position":0},
{"token":"四月","start_offset":0,"end_offset":2,"type":"CN_WORD","position":1},
{"token":"四","start_offset":0,"end_offset":1,"type":"TYPE_CNUM","position":2},
{"token":"月份","start_offset":1,"end_offset":3,"type":"COUNT","position":3},{"token":"月","start_offset":1,"end_offset":2,"type":"COUNT","position":4},
{"token":"份","start_offset":2,"end_offset":3,"type":"COUNT","position":5},
{"token":"的","start_offset":3,"end_offset":4,"type":"CN_CHAR","position":6},
{"token":"尾巴","start_offset":4,"end_offset":6,"type":"CN_WORD","position":7},
{"token":"你","start_offset":6,"end_offset":7,"type":"CN_CHAR","position":8},
{"token":"是","start_offset":7,"end_offset":8,"type":"CN_CHAR","position":9},
{"token":"狮子座","start_offset":8,"end_offset":11,"type":"CN_WORD","position":10},
{"token":"狮子","start_offset":8,"end_offset":10,"type":"CN_WORD","position":11},
{"token":"座","start_offset":10,"end_offset":11,"type":"CN_CHAR","position":12}
]}

可以说是非常细了

ik最粗粒度分词

和ik最细粒度分词不同,最粗粒度的效果相反,例如我们同样对上面的词组分词 用pots请求ip:port/stander/_analyze,传入以下body如下

{"text":"四月份的尾巴你是狮子座","analyzer":"ik_smart"}

生成的响应数据如下

{"tokens":[
{"token":"四月份","start_offset":0,"end_offset":3,"type":"CN_WORD","position":0},
{"token":"的","start_offset":3,"end_offset":4,"type":"CN_CHAR","position":1},
{"token":"尾巴","start_offset":4,"end_offset":6,"type":"CN_WORD","position":2},
{"token":"你","start_offset":6,"end_offset":7,"type":"CN_CHAR","position":3},
{"token":"是","start_offset":7,"end_offset":8,"type":"CN_CHAR","position":4},
{"token":"狮子座","start_offset":8,"end_offset":11,"type":"CN_WORD","position":5}
]}

基于以上,我们在分词的时候建议使用最细粒度分词,这样匹配的更多一些

创建使用ik分词的索引

使用post请求访问接口如下 ip:port/index_name(索引名称) 传入的请求body如下:

{
	"settings": {
		"number_of_shards": "1",
		"number_of_replicas": "0",
		"analysis": {
			"analyzer": {
				"ik": {
					"tokenizer": "ik_max_word"
				}
			}
		}
	},
	"mappings": {
		"properties": {
		"name": {
			"type": "text"
		},
		"keyWord": {
			"type": "text"
		}
	}
	}
}

结语

本文介绍了如何整合ik分词器到es8.1.2中,但是使用的都是原生的api接口调用的,下篇文章我们开始探讨基于已经搭建好的spring boot使用esclient来进行搜索以及底层的原理

  • 下期预告:spring boot整合esclient的ik分词器。敬请期待!