Elasticsearch高手进阶篇(21)
深度探秘搜索技术_使用rescoring机制优化近似匹配搜索的性能
match和phrase match(proximity match)区别
-
match
- 只要简单的匹配到了一个term,就可以理解将term对应的doc作为结果返回,扫描倒排索引,扫描到了就ok
-
phrase match
- 首先扫描到所有term的doc list; 找到包含所有term的doc list; 然后对每个doc都计算每个term的position,是否符合指定的范围; slop,需要进行复杂的运算,来判断能否通过slop移动,匹配一个doc
match query的性能比phrase match和proximity match(有slop)要高很多。因为后两者都要计算position的距离。
match query比phrase match的性能要高10倍,比proximity match的性能要高20倍。
但是别太担心,因为es的性能一般都在毫秒级别,match query一般就在几毫秒,或者几十毫秒,而phrase match和proximity match的性能在几十毫秒到几百毫秒之间,所以也是可以接受的。
优化proximity match的性能
一般就是减少要进行proximity match搜索的document数量。
-
主要思路就是
- 用match query先过滤出需要的数据,然后再用proximity match来根据term距离提高doc的分数,同时proximity match只针对每个shard的分数排名前n个doc起作用,来重新调整它们的分数,这个过程称之为
rescoring,重计分。因为一般用户会分页查询,只会看到前几页的数据,所以不需要对所有结果进行proximity match操作。
- 用match query先过滤出需要的数据,然后再用proximity match来根据term距离提高doc的分数,同时proximity match只针对每个shard的分数排名前n个doc起作用,来重新调整它们的分数,这个过程称之为
用我们刚才的说法,match + proximity match同时实现召回率和精准度(重要)
- 默认情况下,match也许匹配了1000个doc,proximity match全都需要对每个doc进行一遍运算,判断能否slop移动匹配上,然后去贡献自己的分数 但是很多情况下,match出来也许1000个doc,其实用户大部分情况下是分页查询的,所以可能最多只会看前几页,比如一页是10条,最多也许就看5页,就是50条
- proximity match只要对前50个doc进行slop移动去匹配,去贡献自己的分数即可,不需要对全部1000个doc都去进行计算和贡献分数
rescore:重打分
match:1000个doc,其实这时候每个doc都有一个分数了; proximity match,前50个doc,进行rescore,重打分,即可; 让前50个doc,term举例越近的,排在越前面
GET /waws/article/_search
{
"query": {
"match": {
"content": "java spark"
}
},
"rescore": {
"window_size": 50,
"query": {
"rescore_query": {
"match_phrase": {
"content": {
"query": "java spark",
"slop": 50
}
}
}
}
}
}
{
"took": 56,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.258609,
"hits": [
{
"_index": "waws",
"_type": "article",
"_id": "5",
"_score": 1.258609,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny"
}
},
{
"_index": "waws",
"_type": "article",
"_id": "2",
"_score": 0.68640786,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language",
"sub_title": "learned a lot of course",
"author_first_name": "Smith",
"author_last_name": "Williams",
"new_author_last_name": "Williams",
"new_author_first_name": "Smith"
}
}
]
}
}
Elasticsearch高手进阶篇(22)
深度探秘搜索技术_实战前缀搜索、通配符搜索、正则搜索等技术
前缀搜索
- C3D0-KD345
- C3K5-DFG65
- C4I8-UI365
C3 --> 上面这两个都搜索出来 --> 根据字符串的前缀去搜索
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "keyword"
}
}
}
}
}
- 增加测试数据
PUT /waws_index/waws_type/1
{
"title":"C3D0-KD345"
}
PUT /waws_index/waws_type/2
{
"title":"C3K5-DFG65"
}
PUT /waws_index/waws_type/3
{
"title":"C4I8-UI365"
}
- 通过前缀获取数据
GET /waws_index/waws_type/_search
{
"query": {
"prefix": {
"title": {
"value": "C3"
}
}
}
}
{
"took": 51,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "2",
"_score": 1,
"_source": {
"title": "C3K5-DFG65"
}
},
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "1",
"_score": 1,
"_source": {
"title": "C3D0-KD345"
}
}
]
}
}
前缀搜索的原理
prefix query不计算relevance score,与prefix filter唯一的区别就是,filter会cache bitset
扫描整个倒排索引,举例说明
- 前缀越短,要处理的doc越多,性能越差,尽可能用长前缀搜索
前缀搜索,它是怎么执行的?性能为什么差呢?
- match
- C3-D0-KD345
- C3-K5-DFG65
- C4-I8-UI365
全文检索(每个字符串都需要被分词)
- c3 doc1,doc2
- d0
- kd345
- k5
- dfg65
- c4
- i8
- ui365
c3 --> 扫描倒排索引 --> 一旦扫描到c3,就可以停了,因为带c3的就2个doc,已经找到了 --> 没有必要继续去搜索其他的term了
match性能往往是很高的
- 不分词
- C3-D0-KD345
- C3-K5-DFG65
- C4-I8-UI365
c3 --> 先扫描到了C3-D0-KD345,很棒,找到了一个前缀带c3的字符串 --> 还是要继续搜索的,因为后面还有一个C3-K5-DFG65,也许还有其他很多的前缀带c3的字符串 --> 你扫描到了一个前缀匹配的term,不能停,必须继续搜索 --> 直到扫描完整个的倒排索引,才能结束
因为实际场景中,可能有些场景是全文检索解决不了的
- C3D0-KD345
- C3K5-DFG65
- C4I8-UI365
c3 --> match --> 扫描整个倒排索引,能找到吗
c3 --> 只能用prefix
prefix性能很差
通配符搜索
- 跟前缀搜索类似,功能更加强大
- C3D0-KD345
- C3K5-DFG65
- C4I8-UI365
5字符-D任意个字符5
5?-*5:通配符去表达更加复杂的模糊搜索的语义
GET /waws_index/waws_type/_search
{
"query": {
"wildcard": {
"title": {
"value": "C?K*5"
}
}
}
}
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "2",
"_score": 1,
"_source": {
"title": "C3K5-DFG65"
}
}
]
}
}
- ?:任意字符
- *:0个或任意多个字符
性能一样差,必须扫描整个倒排索引,才ok
正则搜索
GET /waws_index/waws_type/_search
{
"query": {
"regexp": {
"title": "C[0-9].+"
}
}
}
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "2",
"_score": 1,
"_source": {
"title": "C3K5-DFG65"
}
},
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "1",
"_score": 1,
"_source": {
"title": "C3D0-KD345"
}
},
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "3",
"_score": 1,
"_source": {
"title": "C4I8-UI365"
}
}
]
}
}
-
C[0-9].+
- [0-9]:指定范围内的数字
- [a-z]:指定范围内的字母
- .:一个字符
- +:前面的正则表达式可以出现一次或多次
wildcard和regexp,与prefix原理一致,都会扫描整个索引,性能很差
主要是给大家介绍一些高级的搜索语法。在实际应用中,能不用尽量别用。性能太差了。