世界上并没有完美的程序,但是我们并不因此而沮丧,因为写程序就是一个不断追求完美的过程。
自定义分析器 :
- Character filters :
1. 作用 : 字符的增、删、改转换
2. 数量限制 : 可以有0个或多个
3. 内建字符过滤器 :
1. HTML Strip Character filter : 去除html标签
2. Mapping Character filter : 映射替换
3. Pattern Replace Character filter : 正则替换 - Tokenizer :
1. 作用 :
1. 分词
2. 记录词的顺序和位置(短语查询)
3. 记录词的开头和结尾位置(高亮)
4. 记录词的类型(分类)
2. 数量限制 : 有且只能有一个
3. 分类 :
1. 完整分词 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切词 :
1. N-Gram
2. Edge N-Gram
3. 文本 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path - Token filters :
1. 作用 : 分词的增、删、改转换
2. 数量限制 : 可以有0个或多个
3. 分类 :
1. apostrophe
2. asciifolding
3. cjk bigram
4. cjk width
5. classic
6. common grams
7. conditional
8. decimal digit
9. delimited payload
10. dictionary decompounder
11. edge ngram
12. elision
13. fingerprint
14. flatten_graph
15. hunspell
16. hyphenation decompounder
17. keep types
18. keep words
19. keyword marker
20. keyword repeat
21. kstem
22. length
23. limit token count
24. lowercase
25. min_hash
26. multiplexer
27. ngram
28. normalization
29. pattern_capture
今天演示26-29,重要 : 26
# multiplexer token filter
# 作用 :
# 1. 可以同一分词同时使用多个filter
# 2. 不能使用 shingle 和 multi-word synonym filter
# 配置项 :
# 1. filters : 以逗号分隔的过滤器,如"lowercase, kstem"
# 2. preserve_original : 是否显示原始分词,默认true
GET /_analyze
{
"tokenizer" : "whitespace",
"filter" : [{
"type" : "multiplexer",
"filters" : ["edge_ngram", "lowercase, kstem"],
"preserve_original" : false
}],
"text" : ["hello gooding Me"]
}
# 结果
{
"tokens" : [
{
"token" : "h",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "g",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "good",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "M",
"start_offset" : 14,
"end_offset" : 16,
"type" : "word",
"position" : 2
},
{
"token" : "me",
"start_offset" : 14,
"end_offset" : 16,
"type" : "word",
"position" : 2
}
]
}
# ngram token filter
# 作用 : 切分词
# 配置项 :
# 1. max_gram : 切分最大长度,默认2
# 2. min_gram : 切分最小长度,默认1
# 3. preserve_original : 是否显示原始词,默认false
GET /_analyze
{
"tokenizer" : "whitespace",
"filter" : ["ngram"],
"text" : ["hello gooding me"]
}
# 结果
{
"tokens" : [
{
"token" : "h",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "he",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "e",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "el",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "l",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "ll",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "l",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "lo",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "o",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "g",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "go",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "o",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "oo",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "o",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "od",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "d",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "di",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "i",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "in",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "n",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "ng",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "g",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "m",
"start_offset" : 14,
"end_offset" : 16,
"type" : "word",
"position" : 2
},
{
"token" : "me",
"start_offset" : 14,
"end_offset" : 16,
"type" : "word",
"position" : 2
},
{
"token" : "e",
"start_offset" : 14,
"end_offset" : 16,
"type" : "word",
"position" : 2
}
]
}
# 一些列过滤器用来规范化某种语言的特殊字符
# 包含 :
# 1. arabic_normalization : 阿拉伯语
# 2. german_normalization : 德语
# 3. hindi_normalization : 北印度语
# 4. indic_normalization : 印度语
# 5. sorani_normalization : 库尔德语
# 6. persian_normalization : 波斯语
# 7. scandinavian_normalization : 斯堪的纳维亚语
# 8. scandinavian_folding : 斯堪的纳维亚语
# 9. serbian_normalization : 塞尔维亚语
# pattern capture token filter
# 作用 : 正则匹配,分组匹配,基于java正则
# 配置项 :
# 1. patterns : 正则
# 2. preserve_original : 是否显示原始词,默认true
GET /_analyze
{
"tokenizer": "keyword",
"filter": [{
"type" : "pattern_capture",
"patterns" : ["([a-z]+)(\\d+)"],
"preserve_original" : false
}],
"text": ["hellogoodingme23r4234"]
}
# 结果
{
"tokens" : [
{
"token" : "hellogoodingme",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 0
},
{
"token" : "23",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 0
},
{
"token" : "r",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 0
},
{
"token" : "4234",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 0
}
]
}