世界上并没有完美的程序,但是我们并不因此而沮丧,因为写程序就是一个不断追求完美的过程。
自定义分析器 :
- Character filters :
1. 作用 : 字符的增、删、改转换
2. 数量限制 : 可以有0个或多个
3. 内建字符过滤器 :
1. HTML Strip Character filter : 去除html标签
2. Mapping Character filter : 映射替换
3. Pattern Replace Character filter : 正则替换 - Tokenizer :
1. 作用 :
1. 分词
2. 记录词的顺序和位置(短语查询)
3. 记录词的开头和结尾位置(高亮)
4. 记录词的类型(分类)
2. 数量限制 : 有且只能有一个
3. 分类 :
1. 完整分词 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切词 :
1. N-Gram
2. Edge N-Gram
3. 文本 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path - Token filters :
1. 作用 : 分词的增、删、改转换
2. 数量限制 : 可以有0个或多个
3. 分类 :
1. apostrophe
2. asciifolding
3. cjk bigram
4. cjk width
5. classic
6. common grams
7. conditional
8. decimal digit
9. delimited payload
10. dictionary decompounder
11. edge ngram
今天演示 : 9 10 11
# delimited payload token filter
# 根据分隔符过滤paload
# 还有就是payload 用户自定义的位置相关的数据
# 配置项 :
# 1. delimiter : 分隔符,默认|
# 2. encoding : 存储的payload的数据类型
# 1. float
# 2. identity : character
# 3. int
GET /_analyze
{
"tokenizer": "whitespace",
"filter": {
"type" : "delimited_payload",
"delimiter" : "-",
"encoding" : "int"
},
"text": ["h-1 hello-3 good-2 hello"]
}
# 结果
{
"tokens" : [
{
"token" : "h",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "hello",
"start_offset" : 4,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "good",
"start_offset" : 12,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "hello",
"start_offset" : 19,
"end_offset" : 24,
"type" : "word",
"position" : 3
}
]
}
# dictionary decompounder token filter
# 根据单词列表匹配内容,如果匹配到,将单词列表中的词也加进去
# 配置项 :
# 1. word_list
# 2. word_list_path
# 3. max_subword_size : 默认15
# 4. min_subword_size : 默认2
# 5. min_word_size : 默认5
# 6. only_longest_match : 默认false
GET /_analyze
{
"tokenizer": "keyword",
"filter": [{
"type" : "dictionary_decompounder",
"word_list" : ["hello", "good", "me", "中国", "啦啦啦"]
}],
"text": ["hellogoodmeddddd我是中国人"]
}
# 结果
{
"tokens" : [
{
"token" : "hellogoodmeddddd我是中国人",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 0
},
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 0
},
{
"token" : "good",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 0
},
{
"token" : "me",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 0
},
{
"token" : "中国",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 0
}
]
}
# edge n-gram token filter
# 以每个词的第一个字母开始切分
# 配置项 :
# 1. min_gram
# 2. max_gram
# 3. preserve_original : 显示原始值,默认true
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [{
"type" : "edge_ngram",
"min_gram" : 1,
"max_gram" : 3,
"preserve_original" : false
}],
"text": ["hello good me 我是中国人"]
}
# 结果
{
"tokens" : [
{
"token" : "h",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "he",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "hel",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "g",
"start_offset" : 6,
"end_offset" : 10,
"type" : "word",
"position" : 1
},
{
"token" : "go",
"start_offset" : 6,
"end_offset" : 10,
"type" : "word",
"position" : 1
},
{
"token" : "goo",
"start_offset" : 6,
"end_offset" : 10,
"type" : "word",
"position" : 1
},
{
"token" : "m",
"start_offset" : 11,
"end_offset" : 13,
"type" : "word",
"position" : 2
},
{
"token" : "me",
"start_offset" : 11,
"end_offset" : 13,
"type" : "word",
"position" : 2
},
{
"token" : "我",
"start_offset" : 14,
"end_offset" : 19,
"type" : "word",
"position" : 3
},
{
"token" : "我是",
"start_offset" : 14,
"end_offset" : 19,
"type" : "word",
"position" : 3
},
{
"token" : "我是中",
"start_offset" : 14,
"end_offset" : 19,
"type" : "word",
"position" : 3
}
]
}