6. Elasticsearch 分词器

485 阅读3分钟

概念

分词器 接受一个字符串作为输入,将这个字符串拆分成独立的词或 语汇单元(token) (可能会丢弃一些标点符号等字符),然后输出一个 语汇单元流(token stream) 。

normalization

normalization的方式有: 时态转换, 单复数转换, 同义词转换, 大小写转换等
比如文档中包含His mom likes small dogs:
① 在建立索引的时候normalization会对文档进行时态、单复数、同义词等方面的处理;
② 然后用户通过近似的mother liked little dog, 也能搜索到相关的文档.

分析器的组成

  • character filter(mapping):分词之前预处理(过滤无用字符、标签等,转换一些&=>and 《Elasticsearch》=> Elasticsearch。自带的有:html strip(去除html标签),mapping(字符串替换),pattern replace(正则匹配替换)
  • tokenizer(分词器):分词
  • token filter:停用词、时态转换、大小写转换、同义词转换、语气词处理等。自带的有:lowercase,stop,synonym
# 过滤html 标签
POST _analyze
{
	"tokenizer":"keyword",
	"char_filter":["html_strip"],
	"text":"<b>hello world<b>"
}
 
# 字符转换
POST _analyze
{
	"tokenizer":"standard",
	"char_filter":[
			{
				"type":"mapping",
				"mappings":["- => _"]
			}
		],
	"text":"123-456-789,i-love-u"
}
 
# 替换表情符号
POST _analyze
{
	"tokenizer":"standard",
	"char_filter":[
		{
			"type":"mapping",
			"mappings":[":) => happy"]
		}
	],
	"text":"i am felling :),i-love-u"
}
 
 
# 正则表达式
POST _analyze
{
	"tokenizer":"standard",
	"char_filter":[
		{
			"type":"pattern_replace",
			"pattern":"http://(.*)",
			"replacement":"$1_haha"
		}
	],
	"text":"http://www.elastic.co"
}

ES内置分词器

Standard Analyzer——默认分词器,按词切分,小写处理
Simple Analyzer——按照非字母切分(符号被过滤),小写处理
Stop Analyzer——小写处理,停用词过滤(the,a,is)
Whitespace Analyzer——按照空格切分,不转小写
Keyword Analyze——不分词,直接将输入当作输出
Patter Analyzer——正则表达式,默认\W+(非字符分隔)
Language——提供了30多种常见语言的分词器
Customer Analyzer——自定义分词器

#HTML Strip Character Filter
PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["a"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I&apos;m so <a>happy</a>!</p>"
}


#Mapping Character Filter
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "٠ => 0",
            "١ => 1",
            "٢ => 2",
            "٣ => 3",
            "٤ => 4",
            "٥ => 5",
            "٦ => 6",
            "٧ => 7",
            "٨ => 8",
            "٩ => 9"
          ]
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My license plate is ٢٥٠١٥"
}


#Pattern Replace Character Filter
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}



#**************************************************************************
#token filter:时态转换、大小写转换、同义词转换、语气词处理等
#比如:has=>have  him=>he  apples=>apple  the/oh/a=>干掉
#大小写 lowercase token filter
GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["lowercase"],
  "text" : "THE Quick FoX JUMPs"
}

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "condition",
      "filter": [ "lowercase" ],
      "script": {
        "source": "token.getTerm().length() < 5"
      }
    }
  ],
  "text": "THE QUICK BROWN FOX"
}
 
#停用词 stopwords token filter
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "type":"standard",
          "stopwords":"_english_"
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Teacher Ma is in the restroom"
}

#分词器  tokenizer  standard
GET /my_index/_analyze
{
  "text": "江山如此多娇,小姐姐哪里可以撩",
  "analyzer": "standard"
}



#自定义 analysis
#设置type为custom告诉Elasticsearch我们正在定义一个定制分析器。将此与配置内置分析器的方式进行比较: type将设置为内置分析器的名称,如 standard或simple
PUT /test_analysis
{
  "settings": {
    "analysis": {
      "char_filter": {
        "test_char_filter": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "| => or"
          ]
        }
      },
      "filter": {
        "test_stopwords": {
          "type": "stop",
          "stopwords": ["is","in","at","the","a","for"]
        }
      },
      "tokenizer": {
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "test_char_filter"
          ],
          "tokenizer": "standard",
          "filter": ["lowercase","test_stopwords"]
        }
      }
    }
  }
}

GET /test_analysis/_analyze
{
  "text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!",
  "analyzer": "my_analyzer"
}

#创建mapping时候指定分词器
PUT /test_analysis/_mapping/my_type
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "test_analysis"
    }
  }
}

#**************************************************************************
#中文分词
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "ik_max_word"
        }
      }
    }
  }
}
PUT /my_index
{
  "mappings": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
        }
    }
  }
}
POST /my_index/_bulk
{ "index": { "_id": "1"} }
{ "text": "城管打电话喊商贩去摆摊" }
{ "index": { "_id": "2"} }
{ "text": "笑果文化回应商贩老农去摆摊" }
{ "index": { "_id": "3"} }
{ "text": "老农耗时17年种出椅子树" }
{ "index": { "_id": "4"} }
{ "text": "夫妻结婚30多年AA制,被城管抓" }
{ "index": { "_id": "5"} }
{ "text": "黑人见义勇为阻止抢劫反被铐住" }

GET /my_index/_analyze
{
  "text": "中华人民共和国国歌",
  "analyzer": "ik_max_word"
}
GET /my_index/_analyze
{
  "text": "中华人民共和国国歌",
  "analyzer": "ik_smart"
}

GET /my_index/_search 
{
  "query": {
    "match": {
      "text": "关关雎鸠"
    }
  }
}

GET /my_index/_analyze
{
  "text": "超级赛亚人",
  "analyzer": "ik_max_word"
}

GET /my_index/_analyze
{
  "text": "碰瓷是一种敲诈, 应该被判刑",
  "analyzer": "ik_max_word"
}

中文分词器

github.com/medcl/elast…

下载。创建es根目录下的plugins下的ik,并解压。重新启动

ik_max_word 和 ik_smart 什么区别? ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query;
ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询。

IK文件描述:
1)IKAnalyzer.cfg.xml:IK分词配置文件
2)主词库:main.dic
3)英文停用词:stopword.dic,不会建立在倒排索引中
4)特殊词库:
a.quantifier.dic:特殊词库:计量单位等
b.suffix.dic:特殊词库:后缀名
c.surname.dic:特殊词库:百家姓
d.preposition:特殊词库:语气词

热更新:
a.修改ik分词器源码
b.基于ik分词器原生支持的热更新方案,部署一个web服务器,提供一个http接口,通过modified和tag两个http响应头,来提供词语的热更新

ElasticSearch完整目录

1. Elasticsearch是什么
2.Elasticsearch基础使用
3.Elasticsearch Mapping
4.Elasticsearch 集群原理
5.Elasticsearch Scripts和读写原理
6.Elasticsearch 分词器
7.Elasticsearch TF-IDF算法及高级查询
8.Elasticsearch 地理位置及搜索
9.Elasticsearch ELK