es - elasticsearch 自定义分析器 - 内建分词过滤器 - 3

88 阅读2分钟

世界上并没有完美的程序,但是我们并不因此而沮丧,因为写程序就是一个不断追求完美的过程。

自定义分析器 :

  1. Character filters :
    1. 作用 : 字符的增、删、改转换
    2. 数量限制 : 可以有0个或多个
    3. 内建字符过滤器 :
    1. HTML Strip Character filter : 去除html标签
    2. Mapping Character filter : 映射替换
    3. Pattern Replace Character filter : 正则替换
  2. Tokenizer :
    1. 作用 :
    1. 分词
    2. 记录词的顺序和位置(短语查询)
    3. 记录词的开头和结尾位置(高亮)
    4. 记录词的类型(分类)
    2. 数量限制 : 有且只能有一个
    3. 分类 :
    1. 完整分词 :
    1. Standard
    2. Letter
    3. Lowercase
    4. whitespace
    5. UAX URL Email
    6. Classic
    7. Thai
    2. 切词 :
    1. N-Gram
    2. Edge N-Gram
    3. 文本 :
    1. Keyword
    2. Pattern
    3. Simple Pattern
    4. Char Group
    5. Simple Pattern split
    6. Path
  3. Token filters :
    1. 作用 : 分词的增、删、改转换
    2. 数量限制 : 可以有0个或多个
    3. 分类 :
    1. apostrophe
    2. asciifolding
    3. cjk bigram
    4. cjk width
    5. classic
    6. common grams
    7. conditional
    8. decimal digit
    9. delimited payload
    10. dictionary decompounder
    11. edge ngram

今天演示 : 9 10 11

# delimited payload token filter
# 根据分隔符过滤paload
# 还有就是payload 用户自定义的位置相关的数据
# 配置项 :
#   1. delimiter : 分隔符,默认|
#   2. encoding  : 存储的payload的数据类型
#      1. float
#      2. identity : character
#      3. int
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": {
    "type" : "delimited_payload",
    "delimiter" : "-",
    "encoding"  : "int"
  },
  "text": ["h-1 hello-3 good-2 hello"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "h",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "hello",
      "start_offset" : 4,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 12,
      "end_offset" : 18,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "hello",
      "start_offset" : 19,
      "end_offset" : 24,
      "type" : "word",
      "position" : 3
    }
  ]
}
# dictionary decompounder token filter
# 根据单词列表匹配内容,如果匹配到,将单词列表中的词也加进去
# 配置项 :
#   1. word_list
#   2. word_list_path
#   3. max_subword_size   : 默认15
#   4. min_subword_size   : 默认2
#   5. min_word_size      : 默认5
#   6. only_longest_match : 默认false
GET /_analyze
{
  "tokenizer": "keyword",
  "filter": [{
    "type" : "dictionary_decompounder",
    "word_list" : ["hello", "good", "me", "中国", "啦啦啦"]
  }],
  "text": ["hellogoodmeddddd我是中国人"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "hellogoodmeddddd我是中国人",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "good",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "me",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "word",
      "position" : 0
    }
  ]
}
# edge n-gram token filter
# 以每个词的第一个字母开始切分
# 配置项 :
#   1. min_gram
#   2. max_gram
#   3. preserve_original : 显示原始值,默认true
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [{
    "type" : "edge_ngram",
    "min_gram" : 1,
    "max_gram" : 3,
    "preserve_original" : false
  }],
  "text": ["hello good me 我是中国人"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "h",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "he",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "hel",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "g",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "go",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "goo",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "m",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "me",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "我",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "我是",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "我是中",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    }
  ]
}