Elasticsearch进阶笔记第十一篇Elasticsearch高手进阶篇(21) 深度探秘搜索技术_使用resco

Elasticsearch高手进阶篇(21)

深度探秘搜索技术_使用rescoring机制优化近似匹配搜索的性能

match和phrase match(proximity match)区别

match
- 只要简单的匹配到了一个term，就可以理解将term对应的doc作为结果返回，扫描倒排索引，扫描到了就ok

phrase match
- 首先扫描到所有term的doc list; 找到包含所有term的doc list; 然后对每个doc都计算每个term的position，是否符合指定的范围; slop，需要进行复杂的运算，来判断能否通过slop移动，匹配一个doc

match query的性能比phrase match和proximity match（有slop）要高很多。因为后两者都要计算position的距离。

match query比phrase match的性能要高10倍，比proximity match的性能要高20倍。

但是别太担心，因为es的性能一般都在毫秒级别，match query一般就在几毫秒，或者几十毫秒，而phrase match和proximity match的性能在几十毫秒到几百毫秒之间，所以也是可以接受的。

优化proximity match的性能

一般就是减少要进行proximity match搜索的document数量。

主要思路就是
- 用match query先过滤出需要的数据，然后再用proximity match来根据term距离提高doc的分数，同时proximity match只针对每个shard的分数排名前n个doc起作用，来重新调整它们的分数，这个过程称之为rescoring，重计分。因为一般用户会分页查询，只会看到前几页的数据，所以不需要对所有结果进行proximity match操作。

用我们刚才的说法，match + proximity match同时实现召回率和精准度(重要)

默认情况下，match也许匹配了1000个doc，proximity match全都需要对每个doc进行一遍运算，判断能否slop移动匹配上，然后去贡献自己的分数但是很多情况下，match出来也许1000个doc，其实用户大部分情况下是分页查询的，所以可能最多只会看前几页，比如一页是10条，最多也许就看5页，就是50条
proximity match只要对前50个doc进行slop移动去匹配，去贡献自己的分数即可，不需要对全部1000个doc都去进行计算和贡献分数

rescore：重打分

match：1000个doc，其实这时候每个doc都有一个分数了; proximity match，前50个doc，进行rescore，重打分，即可; 让前50个doc，term举例越近的，排在越前面

 GET /waws/article/_search 
 {
   "query": {
     "match": {
       "content": "java spark"
     }
   },
   "rescore": {
     "window_size": 50,
     "query": {
       "rescore_query": {
         "match_phrase": {
           "content": {
             "query": "java spark",
             "slop": 50
           }
         }
       }
     }
   }
 }
 
 {
   "took": 56,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 2,
     "max_score": 1.258609,
     "hits": [
       {
         "_index": "waws",
         "_type": "article",
         "_id": "5",
         "_score": 1.258609,
         "_source": {
           "articleID": "DHJK-B-1395-#Ky5",
           "userID": 3,
           "hidden": false,
           "postDate": "2017-03-01",
           "tag": [
             "elasticsearch"
           ],
           "tag_cnt": 1,
           "view_cnt": 10,
           "title": "this is spark blog",
           "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
           "sub_title": "haha, hello world",
           "author_first_name": "Tonny",
           "author_last_name": "Peter Smith",
           "new_author_last_name": "Peter Smith",
           "new_author_first_name": "Tonny"
         }
       },
       {
         "_index": "waws",
         "_type": "article",
         "_id": "2",
         "_score": 0.68640786,
         "_source": {
           "articleID": "KDKE-B-9947-#kL5",
           "userID": 1,
           "hidden": false,
           "postDate": "2017-01-02",
           "tag": [
             "java"
           ],
           "tag_cnt": 1,
           "view_cnt": 50,
           "title": "this is java blog",
           "content": "i think java is the best programming language",
           "sub_title": "learned a lot of course",
           "author_first_name": "Smith",
           "author_last_name": "Williams",
           "new_author_last_name": "Williams",
           "new_author_first_name": "Smith"
         }
       }
     ]
   }
 }

Elasticsearch高手进阶篇(22)

深度探秘搜索技术_实战`前缀搜索`、`通配符搜索`、`正则搜索`等技术

前缀搜索

C3D0-KD345
C3K5-DFG65
C4I8-UI365

C3 --> 上面这两个都搜索出来 --> 根据字符串的前缀去搜索

 PUT my_index
 {
   "mappings": {
     "my_type": {
       "properties": {
         "title": {
           "type": "keyword"
         }
       }
     }
   }
 }

增加测试数据

 PUT /waws_index/waws_type/1
 {
   "title":"C3D0-KD345"
 }
 
 PUT /waws_index/waws_type/2
 {
   "title":"C3K5-DFG65"
 }
 
 PUT /waws_index/waws_type/3
 {
   "title":"C4I8-UI365"
 }

通过前缀获取数据

 GET /waws_index/waws_type/_search
 {
   "query": {
     "prefix": {
       "title": {
         "value": "C3"
       }
     }
   }
 }
 
 {
   "took": 51,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 2,
     "max_score": 1,
     "hits": [
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "2",
         "_score": 1,
         "_source": {
           "title": "C3K5-DFG65"
         }
       },
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "1",
         "_score": 1,
         "_source": {
           "title": "C3D0-KD345"
         }
       }
     ]
   }
 }

前缀搜索的原理

prefix query不计算relevance score，与prefix filter唯一的区别就是，filter会cache bitset

扫描整个倒排索引，举例说明

前缀越短，要处理的doc越多，性能越差，尽可能用长前缀搜索

前缀搜索，它是怎么执行的？性能为什么差呢？

match

C3-D0-KD345

C3-K5-DFG65

C4-I8-UI365

全文检索(每个字符串都需要被分词)

c3 doc1,doc2

d0

kd345

k5

dfg65

c4

i8

ui365

c3 --> 扫描倒排索引 --> 一旦扫描到c3，就可以停了，因为带c3的就2个doc，已经找到了 --> 没有必要继续去搜索其他的term了

match性能往往是很高的

不分词

C3-D0-KD345

C3-K5-DFG65

C4-I8-UI365

c3 --> 先扫描到了C3-D0-KD345，很棒，找到了一个前缀带c3的字符串 --> 还是要继续搜索的，因为后面还有一个C3-K5-DFG65，也许还有其他很多的前缀带c3的字符串 --> 你扫描到了一个前缀匹配的term，不能停，必须继续搜索 --> 直到扫描完整个的倒排索引，才能结束

因为实际场景中，可能有些场景是全文检索解决不了的

C3D0-KD345

C3K5-DFG65

C4I8-UI365

c3 --> match --> 扫描整个倒排索引，能找到吗

c3 --> 只能用prefix

prefix性能很差

通配符搜索

跟前缀搜索类似，功能更加强大

C3D0-KD345

C3K5-DFG65

C4I8-UI365

5字符-D任意个字符5

5?-*5：通配符去表达更加复杂的模糊搜索的语义

 GET /waws_index/waws_type/_search
 {
   "query": {
     "wildcard": {
       "title": {
         "value": "C?K*5"
       }
     }
   }
 }
 
 {
   "took": 8,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 1,
     "max_score": 1,
     "hits": [
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "2",
         "_score": 1,
         "_source": {
           "title": "C3K5-DFG65"
         }
       }
     ]
   }
 }

?：任意字符
*：0个或任意多个字符

性能一样差，必须扫描整个倒排索引，才ok

正则搜索

 GET /waws_index/waws_type/_search 
 {
   "query": {
     "regexp": {
       "title": "C[0-9].+"
     }
   }
 }
 
 {
   "took": 11,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 3,
     "max_score": 1,
     "hits": [
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "2",
         "_score": 1,
         "_source": {
           "title": "C3K5-DFG65"
         }
       },
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "1",
         "_score": 1,
         "_source": {
           "title": "C3D0-KD345"
         }
       },
       {
         "_index": "waws_index",
         "_type": "waws_type",
         "_id": "3",
         "_score": 1,
         "_source": {
           "title": "C4I8-UI365"
         }
       }
     ]
   }
 }

C[0-9].+
- [0-9]：指定范围内的数字
- [a-z]：指定范围内的字母
- .：一个字符
- +：前面的正则表达式可以出现一次或多次

wildcard和regexp，与prefix原理一致，都会扫描整个索引，性能很差

主要是给大家介绍一些高级的搜索语法。在实际应用中，能不用尽量别用。性能太差了。

Elasticsearch进阶笔记第十一篇