hi，我是蛋挞，一个初出茅庐的后端开发，希望可以和大家共同努力、共同进步！

起始标记->深入搜索（13讲）：「26 | 搜索的相关性算分」
结尾标记->深入搜索（13讲）：「27 | Query&Filtering与多字符串多字段查询」

搜索的相关性算分

词频 TF

Term Frequency：检索词在一篇文档中出现的频率
- 检索词出现的次数除以文档的总字数
度量一条查询和结果文档相关性的简单方法：简单将搜索中每一个词的 TF进行相加
- TF(区块链)+TF(的)+TF(应用)
Stop Word
- ”的“ 在文档中出现了很多次，但是对于共享相关度几乎没有用处，不应该考虑他们的TF

逆文档频率 IDF

DF：检索词在所有文档中出现的频率
- ”区块链“在相对较少的文档中出现
- ”应用“在相对比较多的文档中出现
- “Stop Word”在大量的文档中出现
Inverse Document Frequency :简单说= log(全部文档数/检索词出现过的文档总数)
TF - IDF本质上就是将TF求和变成了加权求和
- TF(区块链)*IDF(区块链) + TF(的)*IDF(的)+ TF(应用)*IDF(应用)

TF-IDF的概念

TF-IDF 被公认为是信息检索领域最重要的发明
除了在信息检索，在文献分类和其他相关领域有着非常广泛的应用
IDF的概念，最早是剑桥大学的“斯巴克.琼斯”提出
- 1972年 -“关键词特殊性的统计解释和它在文献检索中的应用”
- 但是没有从理论. 上解释IDF应该是用log(全部文档数/检索词出现过的文档总数)，而不是其他函数。也没有做进一步的研究
1970， 1980年代萨尔顿和罗宾逊，进行了进一步的证明和研究，并用香农信息论做了证明
- www.staff.city.ac.uk/~sb317/pape…
现代搜索引擎，对TF - IDF进行了大量细微的优化

Lucene中的TF-IDF评分公式

BM 25

从ES 5开始，默认算法改为BM25
和经典的TF - IDF相比，当TF无限增加时，BM 25算分会趋于一个数值

通过 Explain API 查看TF -IDF

4篇文档+4个Term查询
思考一下
- 查询中的TF和IDF?
- 结果如何排序?
- 文档长短/TF / IDF对相关度算分的影响

Boosting Relevance

Boosting 是控制相关度的一种手段
- 索引，字段或查询子条件
参数 boost的含义
- 当 boost> 1时，打分的相关度相对性提升
- 当 0 < boost <1时，打分的权重相对性降低
- 当boost<0时，贡献负分

本节知识回顾

什么是相关性&相关性算分介绍
- TF - IDF / BM25
在Elasticsearch 中定制相关度算法的参数
ES中可以对索引，字段分别设置Boosting参数

CodeDemo

PUT testscore { "settings": { "number_of_shards": 1 }, "mappings": { "properties": { "content": { "type": "text" } } } }

PUT testscore/_bulk { "index": { "_id": 1 }} { "content":"we use Elasticsearch to power the search" } { "index": { "_id": 2 }} { "content":"we like elasticsearch" } { "index": { "_id": 3 }} { "content":"The scoring of documents is caculated by the scoring formula" } { "index": { "_id": 4 }} { "content":"you know, for search" }

POST /testscore/_search { //"explain": true, "query": { "match": { "content":"you" //"content": "elasticsearch" //"content":"the" //"content": "the elasticsearch" } } }

POST testscore/_search { "query": { "boosting" : { "positive" : { "term" : { "content" : "elasticsearch" } }, "negative" : { "term" : { "content" : "like" } }, "negative_boost" : 0.2 } } }

POST tmdb/_search { "_source": ["title","overview"], "query": { "more_like_this": { "fields": [ "title^10","overview" ], "like": [{"_id":"14191"}], "min_term_freq": 1, "max_query_terms": 12 } } }

Query&Filtering与多字符串多字段查询

Query Context & Filter Context

高级搜索的功能:支持多项文本输入，针对多个字段进行搜索。
搜索引擎- -般也提供基于时间，价格等条件的过滤
在Elasticsearch中，有Query和Filter 两种不同的Context
- Queny Context:相关性算分
- Filter Context: 不需要算分( Yes or No),可以利用Cache，获得更好的性能

条件组合

假设要搜索一本电影，包含了以下一些条件
- 评论中包含了Guitar，用户打分高于3分，同时上映日期要在1993到2000年之间
这个搜索其实包含了三段逻辑，针对不同的字段
- 评论字段要包含Guitar/用户评分大于3/上映日期需要在给定的范围
同时包含这三个逻辑，并且有比较好的性格？
- 复合查询 bool Query

bool 查询

一个bool查询，是一个或者多个查询子句的组合
- 总共包括4种子句。其中两种会影响算分，2种不影响算分
相关性并不只是全文本检索的专利。也适用于yes| no的子句，匹配的子句越多，相关性评分越高。如果多条查询子句被合并为一条符合查询语句，比如bool查询，则每个查询子句计算得出的评分会被合并到总的相关性评分中。

bool 查询语法

子查询可以任意顺序出现
可以嵌套多个查询
如果bool查询中，没有must条件，should中必须至少满足一条查询

如何解决结构化查询 - “包含而不是相等”的问题解决方案：增加一个 genre count 字段进行计数

增加count字段，使用bool查询解决

从业务角度，按需改进Elasticsearch数据模型

Filter Context — 不影响算分

Query Context — 影响算分

bool 嵌套

查询语句的结构，会对相关度算分产生影响

同一层级下的竞争字段，具有相同的权重
通过嵌套bool查询，可以改变对算分的影响

控制字段的Boosting

Boosting 是控制相关度的一种手段
- 索引，字段或查询之条件
参数 boosting的含义
- 当boost>1时，打分的相关度相对性提升
- 当0<boost<1时，打分的权重相对性降低
- 当boost<0时，贡献负分

Not Quite Not

要求苹果公司的产品信息优先

本节知识点回顾

Query Context Vs. Filter Context
Bool Query -更多的条件组合
查询结构与相关性算分
如何控制查询的精确度
- Boosting & Boosting Query

CodeDemo

POST /products/_bulk { "index": { "_id": 1 }} { "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" } { "index": { "_id": 2 }} { "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" } { "index": { "_id": 3 }} { "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" } { "index": { "_id": 4 }} { "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }

#基本语法 POST /products/_search { "query": { "bool" : { "must" : { "term" : { "price" : "30" } }, "filter": { "term" : { "avaliable" : "true" } }, "must_not" : { "range" : { "price" : { "lte" : 10 } } }, "should" : [ { "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } }, { "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } } ], "minimum_should_match" :1 } } }

#改变数据模型，增加字段。解决数组包含而不是精确匹配的问题 POST /newmovies/_bulk { "index": { "_id": 1 }} { "title" : "Father of the Bridge Part II","year":1995, "genre":"Comedy","genre_count":1 } { "index": { "_id": 2 }} { "title" : "Dave","year":1993,"genre":["Comedy","Romance"],"genre_count":2 }

#must，有算分 POST /newmovies/_search { "query": { "bool": { "must": [ {"term": {"genre.keyword": {"value": "Comedy"}}}, {"term": {"genre_count": {"value": 1}}}
  ]
}
} }

#Filter。不参与算分，结果的score是0 POST /newmovies/_search { "query": { "bool": { "filter": [ {"term": {"genre.keyword": {"value": "Comedy"}}}, {"term": {"genre_count": {"value": 1}}} ]
}
} }

#Filtering Context POST _search { "query": { "bool" : {
  "filter": {
    "term" : { "avaliable" : "true" }
  },
  "must_not" : {
    "range" : {
      "price" : { "lte" : 10 }
    }
  }
}
} }

#Query Context POST /products/_bulk { "index": { "_id": 1 }} { "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" } { "index": { "_id": 2 }} { "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" } { "index": { "_id": 3 }} { "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" } { "index": { "_id": 4 }} { "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }

POST /products/_search { "query": { "bool": { "should": [ { "term": { "productID.keyword": { "value": "JODL-X-1937-#pV7"}} }, {"term": {"avaliable": {"value": true}} } ] } } }

#嵌套，实现了 should not 逻辑 POST /products/_search { "query": { "bool": { "must": { "term": { "price": "30" } }, "should": [ { "bool": { "must_not": { "term": { "avaliable": "false" } } } } ], "minimum_should_match": 1 } } }

#Controll the Precision POST _search { "query": { "bool" : { "must" : { "term" : { "price" : "30" } }, "filter": { "term" : { "avaliable" : "true" } }, "must_not" : { "range" : { "price" : { "lte" : 10 } } }, "should" : [ { "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } }, { "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } } ], "minimum_should_match" :2 } } }

POST /animals/_search { "query": { "bool": { "should": [ { "term": { "text": "brown" }}, { "term": { "text": "red" }}, { "term": { "text": "quick" }}, { "term": { "text": "dog" }} ] } } }

POST /animals/_search { "query": { "bool": { "should": [ { "term": { "text": "quick" }}, { "term": { "text": "dog" }}, { "bool":{ "should":[ { "term": { "text": "brown" }}, { "term": { "text": "brown" }}, ] }
    }
  ]
}
} }

DELETE blogs POST /blogs/_bulk { "index": { "_id": 1 }} {"title":"Apple iPad", "content":"Apple iPad,Apple iPad" } { "index": { "_id": 2 }} {"title":"Apple iPad,Apple iPad", "content":"Apple iPad" }

POST blogs/_search { "query": { "bool": { "should": [ {"match": { "title": { "query": "apple,ipad", "boost": 1.1 } }},
    {"match": {
      "content": {
        "query": "apple,ipad",
        "boost":
      }
    }}
  ]
}
} }

DELETE news POST /news/_bulk { "index": { "_id": 1 }} { "content":"Apple Mac" } { "index": { "_id": 2 }} { "content":"Apple iPad" } { "index": { "_id": 3 }} { "content":"Apple employee like Apple Pie and Apple Juice" }

POST news/_search { "query": { "bool": { "must": { "match":{"content":"apple"} } } } }

POST news/_search { "query": { "bool": { "must": { "match":{"content":"apple"} }, "must_not": { "match":{"content":"pie"} } } } }

POST news/_search { "query": { "boosting": { "positive": { "match": { "content": "apple" } }, "negative": { "match": { "content": "pie" } }, "negative_boost": 0.5 } } }

Elasticsearch 学习笔记Day 12

搜索的相关性算分

相关性和相关性算分

词频 TF

逆文档频率 IDF

TF-IDF的概念

Lucene中的TF-IDF评分公式

BM 25

通过 Explain API 查看TF -IDF

Boosting Relevance

本节知识回顾

CodeDemo

Query&Filtering与多字符串多字段查询

Query Context & Filter Context

条件组合

bool 查询

增加count字段，使用bool查询解决

Filter Context — 不影响算分

Query Context — 影响算分

bool 嵌套

查询语句的结构，会对相关度算分产生影响

控制字段的Boosting

本节知识点回顾

CodeDemo

相关阅读