es - elasticsearch 自定义分析器 - 内建分词过滤器 - 3世界上并没有完美的程序，但是我们并不因此而

世界上并没有完美的程序，但是我们并不因此而沮丧，因为写程序就是一个不断追求完美的过程。

自定义分析器 :

Character filters :
1. 作用 : 字符的增、删、改转换
2. 数量限制 : 可以有0个或多个
3. 内建字符过滤器 :
1. HTML Strip Character filter : 去除html标签
2. Mapping Character filter : 映射替换
3. Pattern Replace Character filter : 正则替换
Tokenizer :
1. 作用 :
1. 分词
2. 记录词的顺序和位置（短语查询）
3. 记录词的开头和结尾位置（高亮）
4. 记录词的类型（分类）
2. 数量限制 : 有且只能有一个
3. 分类 :
1. 完整分词 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切词 :
1. N-Gram
2. Edge N-Gram
3. 文本 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path
Token filters :
1. 作用 : 分词的增、删、改转换
2. 数量限制 : 可以有0个或多个
3. 分类 :
1. apostrophe
2. asciifolding
3. cjk bigram
4. cjk width
5. classic
6. common grams
7. conditional
8. decimal digit
9. delimited payload
10. dictionary decompounder
11. edge ngram

今天演示 : 9 10 11

# delimited payload token filter
# 根据分隔符过滤paload
# 还有就是payload 用户自定义的位置相关的数据
# 配置项 :
#   1. delimiter : 分隔符，默认|
#   2. encoding  : 存储的payload的数据类型
#      1. float
#      2. identity : character
#      3. int
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": {
    "type" : "delimited_payload",
    "delimiter" : "-",
    "encoding"  : "int"
  },
  "text": ["h-1 hello-3 good-2 hello"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "h",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "hello",
      "start_offset" : 4,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 12,
      "end_offset" : 18,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "hello",
      "start_offset" : 19,
      "end_offset" : 24,
      "type" : "word",
      "position" : 3
    }
  ]
}

# dictionary decompounder token filter
# 根据单词列表匹配内容，如果匹配到，将单词列表中的词也加进去
# 配置项 :
#   1. word_list
#   2. word_list_path
#   3. max_subword_size   : 默认15
#   4. min_subword_size   : 默认2
#   5. min_word_size      : 默认5
#   6. only_longest_match : 默认false
GET /_analyze
{
  "tokenizer": "keyword",
  "filter": [{
    "type" : "dictionary_decompounder",
    "word_list" : ["hello", "good", "me", "中国", "啦啦啦"]
  }],
  "text": ["hellogoodmeddddd我是中国人"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "hellogoodmeddddd我是中国人",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "good",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "me",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "word",
      "position" : 0
    }
  ]
}

# edge n-gram token filter
# 以每个词的第一个字母开始切分
# 配置项 :
#   1. min_gram
#   2. max_gram
#   3. preserve_original : 显示原始值，默认true
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [{
    "type" : "edge_ngram",
    "min_gram" : 1,
    "max_gram" : 3,
    "preserve_original" : false
  }],
  "text": ["hello good me 我是中国人"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "h",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "he",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "hel",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "g",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "go",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "goo",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "m",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "me",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "我",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "我是",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "我是中",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    }
  ]
}