es - elasticsearch 自定义分析器 - 内建分词过滤器 - 11

246 阅读3分钟

世界上并没有完美的程序,但是我们并不因此而沮丧,因为写程序就是一个不断追求完美的过程。

自定义分析器 :

  1. Character filters :
    1. 作用 : 字符的增、删、改转换
    2. 数量限制 : 可以有0个或多个
    3. 内建字符过滤器 :
    1. HTML Strip Character filter : 去除html标签
    2. Mapping Character filter : 映射替换
    3. Pattern Replace Character filter : 正则替换
  2. Tokenizer :
    1. 作用 :
    1. 分词
    2. 记录词的顺序和位置(短语查询)
    3. 记录词的开头和结尾位置(高亮)
    4. 记录词的类型(分类)
    2. 数量限制 : 有且只能有一个
    3. 分类 :
    1. 完整分词 :
    1. Standard
    2. Letter
    3. Lowercase
    4. whitespace
    5. UAX URL Email
    6. Classic
    7. Thai
    2. 切词 :
    1. N-Gram
    2. Edge N-Gram
    3. 文本 :
    1. Keyword
    2. Pattern
    3. Simple Pattern
    4. Char Group
    5. Simple Pattern split
    6. Path
  3. Token filters :
    1. 作用 : 分词的增、删、改转换
    2. 数量限制 : 可以有0个或多个
    3. 分类 :
    1. apostrophe
    2. asciifolding
    3. cjk bigram
    4. cjk width
    5. classic
    6. common grams
    7. conditional
    8. decimal digit
    9. delimited payload
    10. dictionary decompounder
    11. edge ngram
    12. elision
    13. fingerprint
    14. flatten_graph
    15. hunspell
    16. hyphenation decompounder
    17. keep types
    18. keep words
    19. keyword marker
    20. keyword repeat
    21. kstem
    22. length
    23. limit token count
    24. lowercase
    25. min_hash
    26. multiplexer
    27. ngram
    28. normalization
    29. pattern_capture
    30. pattern replace
    31. porter stem
    32. predicate script
    33. remove duplicates
    34. reverse
    35. shingle
    36. snowball
    37. stemmer
    38. stemmer override
    39. stop
    40. synonym
    41. synonym graph

今天演示38 - 41
重点 : stop, synonym

# stemmer override token filter
# 作用   : 自定义词干映射
# 条件   : 必须放在所有词干提取器之前
# 配置项 :
#   1. rules      : 映射
#   2. rules_path : 映射路径

GET /_analyze
{
  "tokenizer" : "whitespace",
  "filter"    : [{
    "type"  : "stemmer_override",
    "rules" : [
      "gooding, goodly => good",
      "hello => hi",
      "中国 => 中华人民共和国"
    ]
  }],
  "text" : ["hello gooding me 中国"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "hi",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "good",
      "start_offset" : 6,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "me",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "中华人民共和国",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    }
  ]
}
# stop token filter
# 作用   : 移除停用词
# 配置项 :
#   1. stopwords       : 停用词列表或语言
#   2. stopwords_path  : 停用词文件路径
#   3. ignore_case     : 是否忽略大小写,默认false
#   4. remove_trailing : 移除末尾的停用词,默认false,但是在completion suggester中建议设置为true以便更好的匹配

GET /_analyze
{
  "tokenizer" : "whitespace",
  "filter"    : [{
    "type"      : "stop",
    "stopwords" : ["this", "is", "a"]
  }],
  "text" : ["this is a good boy"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "good",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "boy",
      "start_offset" : 15,
      "end_offset" : 18,
      "type" : "word",
      "position" : 4
    }
  ]
}
# synonym token filter
# 作用     : 添加同义词
# 配置项   : 
#   1. synonyms      : 同义词列表,支持solr和wordnet格式
#   2. synonyms_path : 同义词文件路径
# 使用建议 : 搜索时使用同义词,索引时不使用同义词

GET /_analyze
{
  "tokenizer" : "whitespace",
  "filter"    : [{
    "type"     : "synonym",
    "synonyms" : ["hello, hi => hell, he", "中国, 中国人, 我是中国人"]
  }],
  "text" : ["hello gooding me hi this is me 中国"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "hell",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "he",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "gooding",
      "start_offset" : 6,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "me",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "hell",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "he",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "this",
      "start_offset" : 20,
      "end_offset" : 24,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "is",
      "start_offset" : 25,
      "end_offset" : 27,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "me",
      "start_offset" : 28,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "中国",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "中国人",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "SYNONYM",
      "position" : 7
    },
    {
      "token" : "我是中国人",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "SYNONYM",
      "position" : 7
    }
  ]
}
# synonym graph token filter
# 作用   : 添加同义词,只支持搜索分析器
# 配置项 :
#   1. synonyms      : 同义词列表,支持solr和wordnet格式
#   2. synonyms_path : 同义词文件路径

GET /_analyze
{
  "tokenizer" : "whitespace",
  "filter"    : [{
    "type"      : "synonym_graph",
    "synonyms" : ["hello, hi => hell, he", "中国, 中国人, 我是中国人"]
  }],
  "text" : ["hello gooding me hi this is me 中国"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "hell",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "he",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "gooding",
      "start_offset" : 6,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "me",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "hell",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "he",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "this",
      "start_offset" : 20,
      "end_offset" : 24,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "is",
      "start_offset" : 25,
      "end_offset" : 27,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "me",
      "start_offset" : 28,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "中国人",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "SYNONYM",
      "position" : 7
    },
    {
      "token" : "我是中国人",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "SYNONYM",
      "position" : 7
    },
    {
      "token" : "中国",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "word",
      "position" : 7
    }
  ]
}