Analyzer
Standard Analyzer
(处理大写为小写,按照空格分词,保留数字和小引号) 基于Unicode文本分隔算法
Tokenizer:
standard Tokenizer
Token Filters:
Lower Case Token Filter
Stop Token Filter
Simple Analyzer
(去除数字和其他符号,按照非字母切分,小写处理)
Token Filters:
Lower Case Token Filter
Whitespace Analyzer
(只去除空格)
Token Filters:
whitespace tokenizer
Stop Analyzer
(在simple analyzer基础上,移除停用词)
小写处理,停用词过滤(the,a,is)
Tokenizer:
lower case tokenizer
Token Filters:
Stop Token Filter
Keyword Analyzer
不切词,将整个输入一起返回
Tokenizer:
Keyword tokenizer
Pattern Analyzer
正则表达式,默认\w+(非字符分割)
Tokenizer:
pattern Tokenizer
Token Filters:
Lower case token filter
stop Token Filter
Language Analyzers
中文分词器分类
ik 分词
- ik分词ik maxword和粗粒度 ik_smart 两种分词方式
- ik更新词典只要在词尾添加关键词即可,支持本地和远程词典两种方式
- ik分词插件更新速度快,和最新版本保持一致
ANSJ分词
结巴分词
hanlp分词
清华大学THULAC分词
斯坦福分词
哈工大分词
Fingerprint Analyzer
(转小写、规范化、删除扩展字符、去重)
Tokenizer:
Standard Tokenizer
Token Filters:
Lower Case Token Filter
ASCII Folding Token Filter
Stop Token Filter
Fingerprint Token Filter
Custom Analyzer
- character filters >=0
- tokenizer 1
- token filter >=0
Tokenizer
[Standard Tokenizer]
The standard tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.
[Letter Tokenizer]
The letter tokenizer divides text into terms whenever it encounters a character which is not a letter.
[Lowercase Tokenizer]
The lowercase tokenizer, like the letter tokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms.
[Whitespace Tokenizer]
The whitespace tokenizer divides text into terms whenever it encounters any whitespace character.
[UAX URL Email Tokenizer]
The uax_url_email tokenizer is like the standard tokenizer except that it recognises URLs and email addresses as single tokens.
[Classic Tokenizer]
The classic tokenizer is a grammar based tokenizer for the English Language.
[Thai Tokenizer]
The thai tokenizer segments Thai text into words.
[N-Gram Tokenizer]
[Edge N-Gram Tokenizer]
文档分词转换
character filter字符过滤
字符过滤器将原始文本作为字符流接收,并且可以添加、删除或者更改字符来转换字符流。
字符过滤分类:
-
HTML Strip Character Filter
删除HTML元素,如
<b>,并解码HTML实体,如& -
Mapping character filter
替换指定的字符
-
Pattern Replace Character Filter
基于
正则表达式替换指定的字符
Tokenizers 文本切分为分词
接收字符流,将其分词,同时记录分词后的顺序或者位置(position),以及开始至(start_offset)和偏移值(end_offset-start_offset)。
token filter分词后再过滤
针对Tokenizers处理后的字符流在加工,如转小写,删除,新增等
自定义分词
模板
PUT my_custom_filter
{
"setting":{
"analysis":{
"char_filter":{},
"tokenizer":{},
"filter":{},
"analyzer":{}
}
}
}
业务问题:我现在的业务需求是这样的。有一个作者字段,比如是这样的
Li,LeiLei;Han,MeiMei;还有一些是LeiLei Li...。现在要精确匹配。 我的想法是:用自定义分词通过分号分词。但是这样我检索
Li,LeiLei那么LeiLei Li就不能搜索到,我希望的结果是LeiLei Li也被搜索到而且这种分词,
Li,LeiLei不加逗号,也不能匹配到。但是不知道为什么我在mapping里面添加停用词也不管用?
PUT my_index
{
"setting":{
"analysis":{
"char_filter":{
"my_char_filter":{
"type": "mapping",
"mappings": [
", => "
]
}
},
"tokenizer":{
"my_tokenizer":{
"type": "pattern",
"pattern": "\;"
}
}
"filter":{
"my_sysonym_filter":{
"type": "synonym",
"expand": true,
"synonyms": [
"lileilei => leileili",
"hanmeimei => meimeihan"
]
}
},
"analyzer": {
"my_analyzer":{
"tokenizer": "my_tokenizer",
"char_filter": ["my_char_filter"],
"filter": ["my_sysonym_filter"]
}
}
}
}
}