这是我参与8月更文挑战的第20天,活动详情查看:8月更文挑战
本Elasticsearch相关文章的版本为:7.4.2
在搜索之前准备好供部分匹配的数据可以提高搜索的性能。n-gram就是一种部分匹配的工具。例如,n-grams road这个词。它的结果跟它的n有关:
- n=1时,结果为:[r, o, a, d]
- n=2时,结果为:[ro, oa, ad]
- n=3时,结果为:[roa, oad]
- n=4时,结果为:[road]
对于输入即搜索的使用场景,我们可以使用edge n-grams,它和n-grams的区别是它固定左边,然后枚举直到词语末尾。例如,road的edge n-grams的结果是: [r, ro, roa, road]。
这种结果的顺序和我们输入road的顺序是一样的,所以可以实现输入即搜索,可以快速给出下拉框候选建议。
要想使用edge n-grams的便利,我们可以在创建索引的mapping时对分词进行edge n-grams。
- 首先构建edge n-grams的token filter;
- 然后把这个token filter放入自定义analyzer;
- 把这个自定义analyzer应用到字段中。
PUT /edge_ngrams_test_index
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"instant_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "autocomplete_filter"]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "instant_search_analyzer"
}
}
}
}
在上面我们设置最小长度为1最大长度为20的edge n-gram,然后应用到title字段。我们可以使用_analyze接口看看分词后的结果。
GET /edge_ngrams_test_index/_analyze
{
"analyzer": "instant_search_analyzer",
"text": "riding road"
}
使用edge n-gram后的分词结果:
{
"tokens" : [
{
"token" : "r",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "ri",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "rid",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "ridi",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "ridin",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "riding",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "r",
"start_offset" : 7,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "ro",
"start_offset" : 7,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "roa",
"start_offset" : 7,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "road",
"start_offset" : 7,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
下面我们我们写几个例子来进行搜索一下:
POST /edge_ngrams_test_index/_doc/1
{
"title": "elasticsearch"
}
POST /edge_ngrams_test_index/_doc/2
{
"title": "python web"
}
POST /edge_ngrams_test_index/_doc/3
{
"title": "golang"
}
当用户输入py的时候,我们这样查询即可以推荐出python这个关键字了
GET /edge_ngrams_test_index/_search
{
"query": {
"match": {
"title": {
"query": "py"
}
}
}
}
返回的数据:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.7312057,
"hits" : [
{
"_index" : "edge_ngrams_test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.7312057,
"_source" : {
"title" : "python web"
}
}
]
}
}
为什么会返回呢?
那是因为在进行倒排索引时doc2会产生以下的结果:
GET /edge_ngrams_test_index/_analyze
{
"analyzer": "instant_search_analyzer",
"text": "python web"
}
{
"tokens" : [
{
"token" : "p",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "py",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "pyt",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "pyth",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "pytho",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "python",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "w",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "we",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "web",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
然后当用户进行搜索py时,会这样进行搜索, 寻找title倒排索引里面包含[p, py]的文档(doc2的title的倒排索引里面刚好有p和py):
POST /edge_ngrams_test_index/_validate/query?explain
{
"query": {
"match": {
"title": {
"query": "py"
}
}
}
}
{
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"valid" : true,
"explanations" : [
{
"index" : "edge_ngrams_test_index",
"valid" : true,
"explanation" : "Synonym(title:p title:py)"
}
]
}