elasticsearch analyzer

302 阅读3分钟

Analyzer

Standard Analyzer

(处理大写为小写,按照空格分词,保留数字和小引号) 基于Unicode文本分隔算法

Tokenizer:

standard Tokenizer

Token Filters:

Lower Case Token Filter

Stop Token Filter

Simple Analyzer

(去除数字和其他符号,按照非字母切分,小写处理)

Token Filters:

Lower Case Token Filter

Whitespace Analyzer

(只去除空格)

Token Filters:

whitespace tokenizer

Stop Analyzer

(在simple analyzer基础上,移除停用词)

小写处理,停用词过滤(the,a,is)

Tokenizer:

lower case tokenizer

Token Filters

Stop Token Filter

Keyword Analyzer

不切词,将整个输入一起返回

Tokenizer:

Keyword tokenizer

Pattern Analyzer

正则表达式,默认\w+(非字符分割)

Tokenizer:

pattern Tokenizer

Token Filters

Lower case token filter

stop Token Filter

Language Analyzers

中文分词器分类

ik 分词

github.com/medcl/elast…

  • ik分词ik maxword和粗粒度 ik_smart 两种分词方式
  • ik更新词典只要在词尾添加关键词即可,支持本地和远程词典两种方式
  • ik分词插件更新速度快,和最新版本保持一致

ANSJ分词

github.com/NLPchina/el…

结巴分词

github.com/sing1ee/ela…

hanlp分词

github.com/KennFalcon/…

清华大学THULAC分词

github.com/microbun/el…

斯坦福分词

github.com/stanfordnlp…

哈工大分词

github.com/HIT-SCIR/lt…

Fingerprint Analyzer

(转小写、规范化、删除扩展字符、去重)

Tokenizer:

Standard Tokenizer

Token Filters

Lower Case Token Filter

ASCII Folding Token Filter

Stop Token Filter

Fingerprint Token Filter

Custom Analyzer

  • character filters >=0
  • tokenizer 1
  • token filter >=0

Tokenizer

[Standard Tokenizer]

The standard tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.

[Letter Tokenizer]

The letter tokenizer divides text into terms whenever it encounters a character which is not a letter.

[Lowercase Tokenizer]

The lowercase tokenizer, like the letter tokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms.

[Whitespace Tokenizer]

The whitespace tokenizer divides text into terms whenever it encounters any whitespace character.

[UAX URL Email Tokenizer]

The uax_url_email tokenizer is like the standard tokenizer except that it recognises URLs and email addresses as single tokens.

[Classic Tokenizer]

The classic tokenizer is a grammar based tokenizer for the English Language.

[Thai Tokenizer]

The thai tokenizer segments Thai text into words.

[N-Gram Tokenizer]

[Edge N-Gram Tokenizer]

文档分词转换

character filter字符过滤

字符过滤器将原始文本作为字符流接收,并且可以添加、删除或者更改字符来转换字符流。

字符过滤分类:

  • HTML Strip Character Filter

    删除HTML元素,如 <b>,并解码HTML实体,如 &amp

  • Mapping character filter

    替换指定的字符

  • Pattern Replace Character Filter

    基于 正则表达式替换指定的字符

Tokenizers 文本切分为分词

接收字符流,将其分词,同时记录分词后的顺序或者位置(position),以及开始至(start_offset)和偏移值(end_offset-start_offset)。

token filter分词后再过滤

针对Tokenizers处理后的字符流在加工,如转小写,删除,新增等

自定义分词

模板

 PUT my_custom_filter
 {
     "setting":{
         "analysis":{
             "char_filter":{},
             "tokenizer":{},
             "filter":{},
             "analyzer":{}
         }
     }
 }

业务问题:我现在的业务需求是这样的。有一个作者字段,比如是这样的Li,LeiLei;Han,MeiMei;还有一些是LeiLei Li...

现在要精确匹配。 我的想法是:用自定义分词通过分号分词。但是这样我检索Li,LeiLei那么LeiLei Li就不能搜索到,我希望的结果是LeiLei Li也被搜索到

而且这种分词,Li,LeiLei不加逗号,也不能匹配到。但是不知道为什么我在mapping里面添加停用词也不管用?

 PUT my_index
 {
     "setting":{
         "analysis":{
             "char_filter":{
                 "my_char_filter":{
                     "type": "mapping",
                     "mappings": [
                         ", => "
                     ]
                 }
             },
             "tokenizer":{
                 "my_tokenizer":{
                     "type": "pattern",
                     "pattern": "\;"
                 }
             }
             "filter":{
                 "my_sysonym_filter":{
                     "type": "synonym",
                     "expand": true,
                     "synonyms": [
                         "lileilei => leileili",
                         "hanmeimei => meimeihan"
                     ]
                 }
             },
             "analyzer": {
                 "my_analyzer":{
                     "tokenizer": "my_tokenizer",
                     "char_filter": ["my_char_filter"],
                     "filter": ["my_sysonym_filter"]
                 }
             }
         }
     }
 }