es - elasticsearch 自定义分析器 - 内建分词器 - 2世界上并没有完美的程序，但是我们并不因此而沮丧

世界上并没有完美的程序，但是我们并不因此而沮丧，因为写程序就是一个不断追求完美的过程。

自定义分析器 :

Character filters :
1. 作用 : 字符的增、删、改转换
2. 数量限制 : 可以有0个或多个
3. 内建字符过滤器 :
1. HTML Strip Character filter : 去除html标签
2. Mapping Character filter : 映射替换
3. Pattern Replace Character filter : 正则替换
Tokenizer :
1. 作用 :
1. 分词
2. 记录词的顺序和位置（短语查询）
3. 记录词的开头和结尾位置（高亮）
4. 记录词的类型（分类）
2. 数量限制 : 有且只能有一个
3. 分类 :
1. 完整分词 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切词 :
1. N-Gram
2. Edge N-Gram
3. 文本 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path
Token filters :
1. 作用 : 分词的增、删、改转换
2. 数量限制 : 可以有0个或多个

今天演示自定义分词器的切词 :

# ngram tokenizer
# 按一定的长度顺序切分文本
# 配置项 :
#   1. min_gram           : 最小切分长度      
#   2. max_gram           : 最大切分长度
#   3. token_chars        : 切分时不包含的字符
#   4. custom_token_chars : 自定义切分时不包含的字符
GET /_analyze
{
  "tokenizer": {
    "type" : "ngram",
    "min_gram" : 4,
    "max_gram" : 4,
    "token_chars" : ["letter"],
    "custom_token_chars" : ["+"]
  },
  "text": ["hello-go+odwewq "]
}

# 结果
{
  "tokens" : [
    {
      "token" : "hell",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ello",
      "start_offset" : 1,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "odwe",
      "start_offset" : 9,
      "end_offset" : 13,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "dwew",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "wewq",
      "start_offset" : 11,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    }
  ]
}

# edge - ngram tokenizer
# 按一定的长度从单词头递进切分
# 配置项 :
#   1. min_gram           : 最小切分长度      
#   2. max_gram           : 最大切分长度
#   3. token_chars        : 切分时不包含的字符
#   4. custom_token_chars : 自定义切分时不包含的字符
GET /_analyze
{
  "tokenizer": {
    "type" : "edge_ngram",
    "min_gram" : 1,
    "max_gram" : 5,
    "token_chars" : ["letter"],
    "custom_token_chars" : ["+"]
  },
  "text": ["hello-go+od "]
}

# 结果
{
  "tokens" : [
    {
      "token" : "h",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "he",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "hel",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "hell",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "g",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "go",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "o",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "od",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "word",
      "position" : 8
    }
  ]
}