世界上并没有完美的程序,但是我们并不因此而沮丧,因为写程序就是一个不断追求完美的过程。
自定义分析器 :
- Character filters :
1. 作用 : 字符的增、删、改转换
2. 数量限制 : 可以有0个或多个
3. 内建字符过滤器 :
1. HTML Strip Character filter : 去除html标签
2. Mapping Character filter : 映射替换
3. Pattern Replace Character filter : 正则替换 - Tokenizer :
1. 作用 :
1. 分词
2. 记录词的顺序和位置(短语查询)
3. 记录词的开头和结尾位置(高亮)
4. 记录词的类型(分类)
2. 数量限制 : 有且只能有一个
3. 分类 :
1. 完整分词 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切词 :
1. N-Gram
2. Edge N-Gram
3. 文本 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path - Token filters :
1. 作用 : 分词的增、删、改转换
2. 数量限制 : 可以有0个或多个
今天演示文本结构的分词器 :
# keyword tokenizer
# 原样返回
GET /_analyze
{
"tokenizer": "keyword",
"text": ["hello world", "我是中国人"]
}
# 结果
{
"tokens" : [
{
"token" : "hello world",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
},
{
"token" : "我是中国人",
"start_offset" : 12,
"end_offset" : 17,
"type" : "word",
"position" : 101
}
]
}
# pattern tokenizer
# 基于正则的分词,取正则匹配为词
# 配置项 :
# 1. pattern : 正则表达式
# 2. flags
# 3. group : 组的个数,默认 -1
GET /_analyze
{
"tokenizer": {
"type" : "pattern",
"pattern" : "((?:[a-z0-9])+)",
"group" : 1
},
"text": ["hello 23456"]
}
# 结果
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "23456",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 1
}
]
}
# simple pattern analyzer
# 使用lucene的正则,取匹配正则的词
# 必须指定pattern
# 配置项 : pattern
GET /_analyze
{
"tokenizer": {
"type" : "simple_pattern",
"pattern" : "[0-9]{3}"
},
"text": ["3456786544433 fsdfsd"]
}
# 结果
{
"tokens" : [
{
"token" : "345",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "678",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "654",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "443",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
}
]
}
# char group tokenizer
# 基于指定字符分词
# 配置项 :
# 1. tokenize_on_chars : 分词的字符
# 2. max_token_length
GET /_analyze
{
"tokenizer": {
"type" : "char_group",
"tokenize_on_chars" : [
"-", "whitespace", "_"
]
},
"text": ["sdjflds sdfsd-sdf-7879 fsd_us9098"]
}
# 结果
{
"tokens" : [
{
"token" : "sdjflds",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "sdfsd",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "sdf",
"start_offset" : 14,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "7879",
"start_offset" : 18,
"end_offset" : 22,
"type" : "word",
"position" : 3
},
{
"token" : "fsd",
"start_offset" : 23,
"end_offset" : 26,
"type" : "word",
"position" : 4
},
{
"token" : "us9098",
"start_offset" : 27,
"end_offset" : 33,
"type" : "word",
"position" : 5
}
]
}
# simple pattern split tokenizer
# 以正则分词
# 配置项 : pattern
GET /_analyze
{
"tokenizer": {
"type" : "simple_pattern_split",
"pattern" : "[0-9]{3}"
},
"text": ["sdfsd23243sdfsd890sdfs"]
}
# 结果
{
"tokens" : [
{
"token" : "sdfsd",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "43sdfsd",
"start_offset" : 8,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "sdfs",
"start_offset" : 18,
"end_offset" : 22,
"type" : "word",
"position" : 2
}
]
}
# path hierarchy tokenizer
# 路径拆分及转换
# 配置项 :
# 1. delimiter : 拆分字符
# 2. replacement : 替换字符
# 3. buffer_size : 不建议修改
# 4. reverse : 默认 false
# 1. 分出的路径的显示顺序反转
# 2. skip的位置的反转
# 5. skip : 初始位置,默认 0
GET /_analyze
{
"tokenizer": {
"type" : "path_hierarchy",
"delimiter" : "-",
"replacement" : "/",
"reverse" : true,
"skip" : 1
},
"text": ["hello-good-this-is-me"]
}
# 结果
{
"tokens" : [
{
"token" : "hello/good/this/is/",
"start_offset" : 0,
"end_offset" : 19,
"type" : "word",
"position" : 0
},
{
"token" : "good/this/is/",
"start_offset" : 6,
"end_offset" : 19,
"type" : "word",
"position" : 0
},
{
"token" : "this/is/",
"start_offset" : 11,
"end_offset" : 19,
"type" : "word",
"position" : 0
},
{
"token" : "is/",
"start_offset" : 16,
"end_offset" : 19,
"type" : "word",
"position" : 0
}
]
}