Elasticsearch进阶笔记第十二篇Elasticsearch高手进阶篇(23) 深度探秘搜索技术_实战match

Elasticsearch高手进阶篇(23)

深度探秘搜索技术_实战match_phrase_prefix实现`search-time`搜索推荐

搜索推荐

搜索推荐，search as you type，搜索提示，解释一下什么意思，简化来说就是我们的搜索还没有搜索完，对应的词条的推荐已经跃然纸上了

hello w --> 搜索
- hello world
- hello we
- hello win
- hello wind
- hello dog
- hello cat
hello w -->
- hello world
- hello we
- hello win
- hello wind

搜索推荐的功能

百度 --> elas --> elasticsearch --> elasticsearch权威指南

测试数据导入

 PUT /waws_index/waws_type/1
 {
   "title":"hello world"
 }
 
 PUT /waws_index/waws_type/2
 {
   "title":"hello wind"
 }
 
 PUT /waws_index/waws_type/3
 {
   "title":"hello dark"
 }
 
 PUT /waws_index/waws_type/4
 {
   "title":"hello pig"
 }
 
 PUT /waws_index/waws_type/5
 {
   "title":"hello www.baidu.com"
 }

进行搜索推荐

 GET /waws_index/waws_type/_search 
 {
   "query": {
     "match_phrase_prefix": {
       "title": "hello w"
     }
   }
 }
 
 {
   "took": 7,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 3,
     "max_score": 0.7854939,
     "hits": [
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "2",
         "_score": 0.7854939,
         "_source": {
           "title": "hello wind"
         }
       },
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "5",
         "_score": 0.51623213,
         "_source": {
           "title": "hello www.baidu.com"
         }
       },
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "1",
         "_score": 0.51623213,
         "_source": {
           "title": "hello world"
         }
       }
     ]
   }
 }

原理跟match_phrase类似，唯一的区别，就是把最后一个term作为前缀去搜索(重点)

hello就是去进行match，搜索对应的doc

w，会作为前缀，去扫描整个倒排索引，找到所有w开头的doc

然后找到所有doc中，即包含hello，又包含w开头的字符的doc

根据你的slop去计算，看在slop范围内，能不能让hello w，正好跟doc中的hello和w开头的单词的position相匹配

也可以指定slop，但是只有最后一个term会作为前缀

max_expansions：指定prefix最多匹配多少个term，超过这个数量就不继续匹配了，限定性能
- 默认情况下，前缀要扫描所有的倒排索引中的term，去查找w打头的单词，但是这样性能太差。可以用max_expansions限定，w前缀最多匹配多少个term，就不再继续搜索倒排索引了。

尽量不要用，因为，最后一个前缀始终要去扫描大量的索引，性能可能会很差

Elasticsearch高手进阶篇(24)

深度探秘搜索技术_实战通过ngram分词机制实现index-time搜索推荐

ngram和index-time搜索推荐原理

ngram

quick，5种长度下的ngram

ngram length=1，q u i c k

ngram length=2，qu ui ic ck

ngram length=3，qui uic ick

ngram length=4，quic uick

ngram length=5，quick

edge ngram

quick，anchor首字母后进行ngram

q

qu

qui

quic

quick

使用edge ngram将每个单词都进行进一步的分词切分，用切分后的ngram来实现前缀搜索推荐功能

hello world hello we

h

he

hel

hell

hello      doc1,     doc2

w           doc1,     doc2

wo

wor

worl

world

e            doc2

helloworld

min ngram = 1 max ngram = 3

h

he

hel

hello w

hello --> hello，doc1

w --> w，doc1

doc1，hello和w，而且position也匹配，所以，ok，doc1返回，hello world

搜索的时候，不用再根据一个前缀，然后扫描整个倒排索引了; 简单的拿前缀去倒排索引中匹配即可，如果匹配上了，那么就好了; match，全文检索

实验一下ngram

 PUT /waws_index
 {
     "settings": {
         "analysis": {
             "filter": {
                 "autocomplete_filter": { 
                     "type":"edge_ngram",
                     "min_gram": 1,
                     "max_gram": 20
                 }
             },
             "analyzer": {
                 "autocomplete": {
                     "type":"custom",
                     "tokenizer": "standard",
                     "filter": [
                         "lowercase",
                         "autocomplete_filter" 
                     ]
                 }
             }
         }
     }
 }

添加数据

 PUT /waws_index/waws_type/1
 {
  "title":"hello world"
 }
 
 PUT /waws_index/waws_type/2
 {
  "title":"hello win98"
 }
 
 PUT /waws_index/waws_type/3
 {
  "title":"hello pig"
 }
 
 PUT /waws_index/waws_type/4
 {
  "title":"hello dog"
 }
 
 PUT /waws_index/waws_type/5
 {
  "title":"hello wink"
 }
 
 PUT /waws_index/waws_type/6
 {
  "title":"hello www.baidu.com"
 }

测试分词效果

 GET /waws_index/_analyze
 {
   "analyzer": "autocomplete",
   "text": "hello world"
 }

 {
   "tokens": [
     {
       "token": "h",
       "start_offset": 0,
       "end_offset": 5,
       "type": "word",
       "position": 0
     },
     {
       "token": "he",
       "start_offset": 0,
       "end_offset": 5,
       "type": "word",
       "position": 0
     },
     {
       "token": "hel",
       "start_offset": 0,
       "end_offset": 5,
       "type": "word",
       "position": 0
     },
     {
       "token": "hell",
       "start_offset": 0,
       "end_offset": 5,
       "type": "word",
       "position": 0
     },
     {
       "token": "hello",
       "start_offset": 0,
       "end_offset": 5,
       "type": "word",
       "position": 0
     },
     {
       "token": "w",
       "start_offset": 6,
       "end_offset": 11,
       "type": "word",
       "position": 1
     },
     {
       "token": "wo",
       "start_offset": 6,
       "end_offset": 11,
       "type": "word",
       "position": 1
     },
     {
       "token": "wor",
       "start_offset": 6,
       "end_offset": 11,
       "type": "word",
       "position": 1
     },
     {
       "token": "worl",
       "start_offset": 6,
       "end_offset": 11,
       "type": "word",
       "position": 1
     },
     {
       "token": "world",
       "start_offset": 6,
       "end_offset": 11,
       "type": "word",
       "position": 1
     }
   ]
 }

展示mapping

 # 我们的搜索的时候的分词器是standard分词器，在构建index的时候使用n-gram的方式构建(需要在没有数据的时候，进行mapping指定)
 PUT /waws_index/_mapping/waws_type
 {
   "properties": {
       "title": {
           "type":"string",
           "analyzer": "autocomplete",
           "search_analyzer": "standard"
       }
   }
 }

搜索数据

 GET /waws_index/waws_type/_search 
 {
   "query": {
     "match_phrase": {
       "title": "hello w"
     }
   }
 }
 
 {
   "took": 1,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 4,
     "max_score": 0.8899311,
     "hits": [
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "2",
         "_score": 0.8899311,
         "_source": {
           "title": "hello win98"
         }
       },
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "6",
         "_score": 0.8899311,
         "_source": {
           "title": "hello www.baidu.com"
         }
       },
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "1",
         "_score": 0.8271048,
         "_source": {
           "title": "hello world"
         }
       },
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "5",
         "_score": 0.8134969,
         "_source": {
           "title": "hello wink"
         }
       }
     ]
   }
 }

如果用match，只有hello的也会出来，全文检索，只是分数比较低
推荐使用match_phrase，要求每个term都有，而且position刚好靠着1位，符合我们的期望的

Elasticsearch进阶笔记第十二篇

Elasticsearch高手进阶篇(23)

深度探秘搜索技术_实战match_phrase_prefix实现search-time搜索推荐

搜索推荐

搜索推荐的功能

Elasticsearch高手进阶篇(24)

深度探秘搜索技术_实战通过ngram分词机制实现index-time搜索推荐

ngram和index-time搜索推荐原理

实验一下ngram

深度探秘搜索技术_实战match_phrase_prefix实现`search-time`搜索推荐