Elasticsearch（七）IK分词器我们这里详细的了解一下Elasticsearch插件IK分词器的基本使用。

我们这里详细的了解一下Elasticsearch插件IK分词器的基本使用。

IK分词器插件安装成功之后。

一：IK分词器的两种分词方式

我们使用kibana做一下测试

1 ：ik_smart：最少切分

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "时间里的SpringBoot2.6"
}

结果如下所示：

{
  "tokens" : [
    {
      "token" : "时",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "间里",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "的",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "springboot2.6",
      "start_offset" : 4,
      "end_offset" : 17,
      "type" : "LETTER",
      "position" : 3
    }
  ]
}

2 ：ik_max_word：最细粒度划分（穷尽词库的可能）

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "时间里的SpringBoot2.6"
}

结果如下所示：

{
  "tokens" : [
    {
      "token" : "时间",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "间里",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "的",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "springboot2.6",
      "start_offset" : 4,
      "end_offset" : 17,
      "type" : "LETTER",
      "position" : 3
    },
    {
      "token" : "springboot",
      "start_offset" : 4,
      "end_offset" : 14,
      "type" : "ENGLISH",
      "position" : 4
    },
    {
      "token" : "2.6",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "ARABIC",
      "position" : 5
    }
  ]
}

二：添加自定义的词添加到扩展字典中

有一些时候，分词是满足不了我们的想法的，比如下边的情况

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "耗子尾汁SpringBoot2.6"
}

结果如下所示：

{
  "tokens" : [
    {
      "token" : "耗子",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "尾",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "汁",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "springboot2.6",
      "start_offset" : 4,
      "end_offset" : 17,
      "type" : "LETTER",
      "position" : 3
    },
    {
      "token" : "springboot",
      "start_offset" : 4,
      "end_offset" : 14,
      "type" : "ENGLISH",
      "position" : 4
    },
    {
      "token" : "2.6",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "ARABIC",
      "position" : 5
    }
  ]
}

耗子尾汁这四个字我是不想让他分开的，但是系统将这四个字分开了。这个问题该怎么解决呢？

很简单，我们需要手动将该词添加到分词器的词典当中

分词器字典配置位置：

elasticsearch目录/plugins/ik/config/IKAnalyzer.cfg.xml

打开 IKAnalyzer.cfg.xml 文件，扩展字典

My.dic文件和IKAnalyzer.cfg.xml文件在同一个目录中

Elasticsearch 支持的自定义词典格式包括：

· txt：简单的行分隔文本文件，每行一个词。

· csv：逗号分隔值，第一行是“word,freq,[stem]”，后续行是词条目。

· xml：Elasticsearch 6.x 以前的版本使用的词典格式。

· zip：包含 txt 或 csv 文件的压缩包。

My.dic文件使用行分割即可，一行一个词语

重启Elasticsearch。再次执行上方的查询指令：

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "耗子尾汁SpringBoot2.6"
}

结果如下所示：

{
  "tokens" : [
    {
      "token" : "耗子尾汁",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "耗子",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "尾汁",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "springboot2.6",
      "start_offset" : 4,
      "end_offset" : 17,
      "type" : "LETTER",
      "position" : 3
    },
    {
      "token" : "springboot",
      "start_offset" : 4,
      "end_offset" : 14,
      "type" : "ENGLISH",
      "position" : 4
    },
    {
      "token" : "2.6",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "ARABIC",
      "position" : 5
    }
  ]
}

以上大概就是IK分词器的基本配置及使用。

有好的建议，请在下方输入你的评论。