Elasticsearch核心知识篇(55)
初识搜索引擎_相关度评分TF&IDF算法独家解密(重点
)
算法介绍
relevance score
算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度
Elasticsearch使用的是 term frequency/inverse document frequency
算法,简称为TF/IDF算法
-
Term frequency:搜索文本中的各个词条在field文本中出现了多少次,出现次数越多,就越相关
- 搜索请求:hello world
- doc1:hello you, and world is very good
- doc2:hello, how are you
-
Inverse document frequency:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多,就越不相关
- 搜索请求:hello world
- doc1:hello, today is very good
- doc2:hi world, how are you
总结:
比如说,在index中有1万条document,hello这个单词在所有的document中,一共出现了1000次;world这个单词在所有的document中,一共出现了100次doc2更相关
Field-length norm:field长度,field越长,相关度越弱
- 搜索请求:hello world
- doc1:{ "title": "hello article", "content": "babaaba 1万个单词" }
- doc2:{ "title": "my article", "content": "blablabala 1万个单词,hi world" }
hello world在整个index中出现的次数是一样多的 doc1更相关,title field更短
_score是如何被计算出来的
GET /weblalala/article/_search?explain
{
"query": {
"match": {
"title": "first"
}
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.25811607,
"hits": [
{
"_shard": "[weblalala][3]",
"_node": "w85Pu0ftS7-vrZOwmdAE9g",
"_index": "weblalala",
"_type": "article",
"_id": "1",
"_score": 0.25811607,
"_source": {
"title": "first article",
"content": "this is my first article",
"post_date": "2017-01-01",
"author_id": 110
},
"_explanation": {
"value": 0.25811607,
"description": "sum of:",
"details": [
{
"value": 0.25811607,
"description": "weight(title:first in 0) [PerFieldSimilarity], result of:",
"details": [
{
# 综合得到的数值 TF * IDF
"value": 0.25811607,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
# TF的计算
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
# IDF的计算
{
"value": 0.89722675,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 1,
"description": "*:*, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 1,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
}
]
}
}
分析一个document是如何被匹配上的
GET /weblalala/article/1/_explain
{
"query": {
"match": {
"title": "first"
}
}
}
{
"_index": "weblalala",
"_type": "article",
"_id": "1",
"matched": true,
"explanation": {
"value": 0.25811607,
"description": "sum of:",
"details": [
{
"value": 0.25811607,
"description": "weight(title:first in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.25811607,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.89722675,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 1,
"description": "*:*, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 1,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
}