分词器
一、Analysis与Analyzer
- Analysis:将全文文本拆解成一系列单词的过程(term/token)的过程,也叫分词;整个过程是由Analyzer(分词器)来实现的;
- Analyzer:分词器,主要由以下三个部分组成
-
Charactor Filter:对原始文本进行处理,增加删除以及替换字符,去除一些标签等(如html标签),可以配置多个,但是会影响后续Tokenizer(分词)的position(位置)和offset(偏移量)信息;
一些ES自带的Charactor Filter
- HTML strip - 去除html标签
- Mapping - 字符串替换
- Pattern replace - 正则匹配替换
-
Tokenizer:按照规则将文本切分成单词;
- ES内置Tokenizer:whitespace(空格分词)/standard/uax-url-email(email、url分词)/pattern(正则表达式分词)/keyword(不做任何分词处理)/path hierarchy(文件路径分词);
- 可以用java开发插件,实现自己的Tokenizer;
-
Token Filter:对Tokenizer后输出的数据进行一系列加工,如小写(Lowercase),删除stopwords(停用词),增加同义词(synonym)等各式功能;
- Standard Analyzer:默认分词器,按词切分(即空格和非字符切分,并去除,注意字母和数字都属于字符),小写处理;
- Simple Analyzer:按照非字母切分(非字母被过滤去除,即数字也会被过滤去除),小写处理;
- Stop Analyzer:小写处理,按照非字母切分(is,the,a等修饰性词以及数字,符号等去除;
- WhiteSpace Analyzer:按空格分词,不转小写;
- Keyword Analyzer:不分词;
- Pattern Analyzer:正则表达式切分,默认是根据非字符的符号切分;
- Language:常见的语言分词器,中文分词器有analysis-icu、IK、THULAC等;
- Customer Analyzer:自定义分词器;
三、analyze API测试分词效果
- 指定analyzer和text测试分词
GET /_analyze
{
"analyzer": "standard",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
- 通过指定索引(books)下的某个字段(titile),来测试该字段的分词器分词效果
POST books/_analyze
{
"filed": "title",
"text": "xxxxxxxxxxx"
}
- 自定义分词器进行测试(指定分词规则以及过滤器)
POST /_analyze
{
"tokenizer": "standard",
"filter":["lowercase"],
"text": "xxxxxxxxxxx"
}
- 测试demo文件
四、自定义分词器
- 创建并设置自定义分词器
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_customer_analyzer": {
"type": "custom",
//自定义字符串过滤器emoticons
"char_filter": [
"emoticons"
],
//自定义分词punctuation
"tokenizer": "punctuation",
"filter": [
"lowercase",
//自定义过滤器
"english_stop"
]
}
},
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": {
"type": "mapping",
"mappings": [
":) => _happay_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
//测试解析文本
POST my_index/_analyze
{
"analyzer": "my_customer_analyzer",
"text": "I'm a :) person, and you?"
}
//分词结果
{
"tokens" : [
{
"token" : "i'm",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "_happay_",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "person",
"start_offset" : 9,
"end_offset" : 15,
"type" : "word",
"position" : 3
},
{
"token" : "you",
"start_offset" : 21,
"end_offset" : 24,
"type" : "word",
"position" : 5
}
]
}