概念
分词器 接受一个字符串作为输入,将这个字符串拆分成独立的词或 语汇单元(token) (可能会丢弃一些标点符号等字符),然后输出一个 语汇单元流(token stream) 。
normalization
normalization的方式有: 时态转换, 单复数转换, 同义词转换, 大小写转换等
比如文档中包含His mom likes small dogs:
① 在建立索引的时候normalization会对文档进行时态、单复数、同义词等方面的处理;
② 然后用户通过近似的mother liked little dog, 也能搜索到相关的文档.
分析器的组成
- character filter(mapping):分词之前预处理(过滤无用字符、标签等,转换一些&=>and 《Elasticsearch》=> Elasticsearch。自带的有:html strip(去除html标签),mapping(字符串替换),pattern replace(正则匹配替换)
- tokenizer(分词器):分词
- token filter:停用词、时态转换、大小写转换、同义词转换、语气词处理等。自带的有:lowercase,stop,synonym
# 过滤html 标签
POST _analyze
{
"tokenizer":"keyword",
"char_filter":["html_strip"],
"text":"<b>hello world<b>"
}
# 字符转换
POST _analyze
{
"tokenizer":"standard",
"char_filter":[
{
"type":"mapping",
"mappings":["- => _"]
}
],
"text":"123-456-789,i-love-u"
}
# 替换表情符号
POST _analyze
{
"tokenizer":"standard",
"char_filter":[
{
"type":"mapping",
"mappings":[":) => happy"]
}
],
"text":"i am felling :),i-love-u"
}
# 正则表达式
POST _analyze
{
"tokenizer":"standard",
"char_filter":[
{
"type":"pattern_replace",
"pattern":"http://(.*)",
"replacement":"$1_haha"
}
],
"text":"http://www.elastic.co"
}
ES内置分词器
Standard Analyzer——默认分词器,按词切分,小写处理
Simple Analyzer——按照非字母切分(符号被过滤),小写处理
Stop Analyzer——小写处理,停用词过滤(the,a,is)
Whitespace Analyzer——按照空格切分,不转小写
Keyword Analyze——不分词,直接将输入当作输出
Patter Analyzer——正则表达式,默认\W+(非字符分隔)
Language——提供了30多种常见语言的分词器
Customer Analyzer——自定义分词器
#HTML Strip Character Filter
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "html_strip",
"escaped_tags": ["a"]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": ["my_char_filter"]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "<p>I'm so <a>happy</a>!</p>"
}
#Mapping Character Filter
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"٠ => 0",
"١ => 1",
"٢ => 2",
"٣ => 3",
"٤ => 4",
"٥ => 5",
"٦ => 6",
"٧ => 7",
"٨ => 8",
"٩ => 9"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "My license plate is ٢٥٠١٥"
}
#Pattern Replace Character Filter
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": ["my_char_filter"]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "My credit card is 123-456-789"
}
#**************************************************************************
#token filter:时态转换、大小写转换、同义词转换、语气词处理等
#比如:has=>have him=>he apples=>apple the/oh/a=>干掉
#大小写 lowercase token filter
GET _analyze
{
"tokenizer" : "standard",
"filter" : ["lowercase"],
"text" : "THE Quick FoX JUMPs"
}
GET /_analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "condition",
"filter": [ "lowercase" ],
"script": {
"source": "token.getTerm().length() < 5"
}
}
],
"text": "THE QUICK BROWN FOX"
}
#停用词 stopwords token filter
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{
"type":"standard",
"stopwords":"_english_"
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Teacher Ma is in the restroom"
}
#分词器 tokenizer standard
GET /my_index/_analyze
{
"text": "江山如此多娇,小姐姐哪里可以撩",
"analyzer": "standard"
}
#自定义 analysis
#设置type为custom告诉Elasticsearch我们正在定义一个定制分析器。将此与配置内置分析器的方式进行比较: type将设置为内置分析器的名称,如 standard或simple
PUT /test_analysis
{
"settings": {
"analysis": {
"char_filter": {
"test_char_filter": {
"type": "mapping",
"mappings": [
"& => and",
"| => or"
]
}
},
"filter": {
"test_stopwords": {
"type": "stop",
"stopwords": ["is","in","at","the","a","for"]
}
},
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip",
"test_char_filter"
],
"tokenizer": "standard",
"filter": ["lowercase","test_stopwords"]
}
}
}
}
}
GET /test_analysis/_analyze
{
"text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!",
"analyzer": "my_analyzer"
}
#创建mapping时候指定分词器
PUT /test_analysis/_mapping/my_type
{
"properties": {
"content": {
"type": "text",
"analyzer": "test_analysis"
}
}
}
#**************************************************************************
#中文分词
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "ik_max_word"
}
}
}
}
}
PUT /my_index
{
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}
}
POST /my_index/_bulk
{ "index": { "_id": "1"} }
{ "text": "城管打电话喊商贩去摆摊" }
{ "index": { "_id": "2"} }
{ "text": "笑果文化回应商贩老农去摆摊" }
{ "index": { "_id": "3"} }
{ "text": "老农耗时17年种出椅子树" }
{ "index": { "_id": "4"} }
{ "text": "夫妻结婚30多年AA制,被城管抓" }
{ "index": { "_id": "5"} }
{ "text": "黑人见义勇为阻止抢劫反被铐住" }
GET /my_index/_analyze
{
"text": "中华人民共和国国歌",
"analyzer": "ik_max_word"
}
GET /my_index/_analyze
{
"text": "中华人民共和国国歌",
"analyzer": "ik_smart"
}
GET /my_index/_search
{
"query": {
"match": {
"text": "关关雎鸠"
}
}
}
GET /my_index/_analyze
{
"text": "超级赛亚人",
"analyzer": "ik_max_word"
}
GET /my_index/_analyze
{
"text": "碰瓷是一种敲诈, 应该被判刑",
"analyzer": "ik_max_word"
}
中文分词器
下载。创建es根目录下的plugins下的ik,并解压。重新启动
ik_max_word 和 ik_smart 什么区别?
ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query;
ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询。
IK文件描述:
1)IKAnalyzer.cfg.xml:IK分词配置文件
2)主词库:main.dic
3)英文停用词:stopword.dic,不会建立在倒排索引中
4)特殊词库:
a.quantifier.dic:特殊词库:计量单位等
b.suffix.dic:特殊词库:后缀名
c.surname.dic:特殊词库:百家姓
d.preposition:特殊词库:语气词
热更新:
a.修改ik分词器源码
b.基于ik分词器原生支持的热更新方案,部署一个web服务器,提供一个http接口,通过modified和tag两个http响应头,来提供词语的热更新
ElasticSearch完整目录
1. Elasticsearch是什么
2.Elasticsearch基础使用
3.Elasticsearch Mapping
4.Elasticsearch 集群原理
5.Elasticsearch Scripts和读写原理
6.Elasticsearch 分词器
7.Elasticsearch TF-IDF算法及高级查询
8.Elasticsearch 地理位置及搜索
9.Elasticsearch ELK