大数据利器Elasticsearch之edge_ngram输入即搜索

141 阅读2分钟

这是我参与8月更文挑战的第20天,活动详情查看:8月更文挑战
本Elasticsearch相关文章的版本为:7.4.2

在搜索之前准备好供部分匹配的数据可以提高搜索的性能。n-gram就是一种部分匹配的工具。例如,n-grams road这个词。它的结果跟它的n有关:

  • n=1时,结果为:[r, o, a, d]
  • n=2时,结果为:[ro, oa, ad]
  • n=3时,结果为:[roa, oad]
  • n=4时,结果为:[road]

对于输入即搜索的使用场景,我们可以使用edge n-grams,它和n-grams的区别是它固定左边,然后枚举直到词语末尾。例如,road的edge n-grams的结果是: [r, ro, roa, road]。

这种结果的顺序和我们输入road的顺序是一样的,所以可以实现输入即搜索,可以快速给出下拉框候选建议。

要想使用edge n-grams的便利,我们可以在创建索引的mapping时对分词进行edge n-grams。

  1. 首先构建edge n-grams的token filter;
  2. 然后把这个token filter放入自定义analyzer;
  3. 把这个自定义analyzer应用到字段中。
PUT /edge_ngrams_test_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "instant_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "autocomplete_filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "instant_search_analyzer"
      }
    }
  }
}

在上面我们设置最小长度为1最大长度为20的edge n-gram,然后应用到title字段。我们可以使用_analyze接口看看分词后的结果。

GET /edge_ngrams_test_index/_analyze
{
  "analyzer": "instant_search_analyzer",
  "text": "riding road"
}

使用edge n-gram后的分词结果:

{
  "tokens" : [
    {
      "token" : "r",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "ri",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "rid",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "ridi",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "ridin",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "riding",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "r",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "ro",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "roa",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "road",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

下面我们我们写几个例子来进行搜索一下:

POST /edge_ngrams_test_index/_doc/1
{
  "title": "elasticsearch"
}

POST /edge_ngrams_test_index/_doc/2
{
  "title": "python web"
}

POST /edge_ngrams_test_index/_doc/3
{
  "title": "golang"
}

当用户输入py的时候,我们这样查询即可以推荐出python这个关键字了

GET /edge_ngrams_test_index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "py"
      }
    }
  }
}

返回的数据:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.7312057,
    "hits" : [
      {
        "_index" : "edge_ngrams_test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.7312057,
        "_source" : {
          "title" : "python web"
        }
      }
    ]
  }
}

为什么会返回呢?
那是因为在进行倒排索引时doc2会产生以下的结果:

GET /edge_ngrams_test_index/_analyze
{
  "analyzer": "instant_search_analyzer",
  "text": "python web"
}

{
  "tokens" : [
    {
      "token" : "p",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "py",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "pyt",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "pyth",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "pytho",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "python",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "w",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "we",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "web",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

然后当用户进行搜索py时,会这样进行搜索, 寻找title倒排索引里面包含[p, py]的文档(doc2的title的倒排索引里面刚好有ppy):

POST /edge_ngrams_test_index/_validate/query?explain
{
  "query": {
    "match": {
      "title": {
        "query": "py"
      }
    }
  }
}

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "edge_ngrams_test_index",
      "valid" : true,
      "explanation" : "Synonym(title:p title:py)"
    }
  ]
}