「这是我参与2022首次更文挑战的第6天，活动详情查看：2022首次更文挑战」。

前言

在使用 ElasticSearch 过程中，大多数时候都能够满足我们检索的要求，检索出我们想要的文章，但是不乏一些情况，文章出来的顺序不是我们想要的，将比较无关的文章靠前排序了，这不是我们想要的结果。那针对这种情况如何定位呢？

这里可能就要涉及到 ELasticSearch 的算法逻辑了，在 ElasticSearch 5 之前默认的是 TF-IDF，之后是 BM 25。

但是你有没有好奇怪，排除本身 TF-IDF 复杂的数学计算公示的推理过程，ElasticSearch 是如何根据这个公示计算出相关度算分呢？它的各个参数又是如何？

先了解一下什么是 TF-IDF 和 BM25。

TF-IDF

TF

TF 是 Term Frequency （词频）的缩写。你可以简单的理解为，关键词出现的次数除以文档的总词数，也就被称为词频。

TF = 分词 Term 出现的次数 / 分词 Term 文档的总的 Terms 数量。

那么如何度量一条查询关键词和结果文档的相关性？直接使用各个关键词在文档中出现的频率相加。

到这里，读者应该也发现了一个问题，像是的、地、得这种词，在文档中多次出现的，不应该去考虑他们的 TF。

IDF

所以除了将这种 stopword 词去掉方法外，还有一个逆文档频率，可以用于衡量分词后的 term 的重要性。也就是 IDF （Inverse Document Frequency）逆文档频率。

简单的理解就是。

IDF= log（总的文档数/分词 term 在所有文档中出现的次数）。

用来降低常用的分词 Term 权重，同时扩大稀有分词 Term 的权重。

BM 25

Best Match 最佳匹配， 25 是经过 25 次算法迭代。

在 ElasticSearch 5 之后，采取了 BM25。看到这复杂的公式相信读者已经头晕了。但推导这个公式不是我们的主要目的，我们更重要了解算法机制，为我们排除有关检索评分问题做铺垫。下面是公式中参数的主要意思。

D：表示文档
Q：表示查询
K1：自由参数，默认 1.2
b：调整文档长度对于相关性的影响，默认 0.75

先暂时不理解这些参数，下面我们通过实战来看

实战

建立索引

建立一索引 test，并插入 7 条数据，分片为 1，默认的分词器 standard

PUT test
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "content": {
        "type": "text"
      },
      "remark": {
        "type": "text"
      }
    }
  },
  "settings": {
    "number_of_shards": 1
  }
}

PUT test/_bulk
{"index":{"_id":1}}
{"title":"To school, everywhere is the white one, school","content":" the snow is still one child to jump from the sky"}
{"index":{"_id":2}}
{"title":"First of the big brothers and sisters are braving the cold","content":"braving heavy snow snow yet"}
{"index":{"_id":3}}
{"title":"Behind them there was a curved path","content":" junior high school English composition"}
{"index":{"_id":4}}
{"title":" we walked convenient","content":"small writing on the National Day is not smooth"}
{"index":{"_id":5}}
{"title":"but they must be tired","content":"very hard."}
{"index":{"_id":6}}
{"title":"Home school","content":"Iove made several small partner"}
{"index":{"_id":7}}
{"remark":"remark school"}

使用查询语句，检索 title 字段命中 school 的文档

POST test/_search
{
  "query": {
    "match": {
      "title": "school"
    }
  }
}

结果

id	score	title
6	1.4157268	Home school
1	1.2943789	To school, everywhere is the white one, school

以第二条结果文档 id 为 1 为例（这里只取了 title 字段），它是如何计算出 1.2943789 。使用 explain:true 来分析一下评分规则。

评分分析

POST test/_search
{
  "explain": true, 
  "query": {
    "match": {
      "title": "school"
    }
  }
}

这里只取第二条结果的分析语句的 _explanation。

        "_explanation" : {
          "value" : 1.2943789,
          ......
          "details" : [
            {
              "value" : 1.2943789,
              "description" : "score(freq=2.0), computed as boost * idf * tf from:",
              "details" : [
                {
                  "value" : 2.2,
                  "description" : "boost",
                },
                {
                  "value" : 1.0296195,
                  "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  ......
                },
                {
                  "value" : 0.5714286,
                  "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  ......
                }
              ]
            }
          ]
        }

开启 explain: true 后，读者可以看到结果里面多了_explanation结果，在结果里面也说明了，score = boost * idf * tf 。

第一部分：权重

            {
              "value" : 2.2,
              "description" : "boost",
              "details" : [ ]
            }

_explanation 的第一部分是有关权重的，缺省值 2.2，因为在查询中我们没有指定 boost，如果你指定了 boost 的值 X，这里将会 2.2 乘以 X。

第二部分：IDF

            {
              "value" : 1.0296195,
              "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
              "details" : [
                {
                  "value" : 2,
                  "description" : "n, number of documents containing term",
                  "details" : [ ]
                },
                {
                  "value" : 6,
                  "description" : "N, total number of documents with field",
                  "details" : [ ]
                }
              ]
            }

n ：字段 title 包含关键词 school 的文档数目
N：包含 title 字段的总文档数

这个就显而易见了，包含关键词 school 的文档只有第 6 篇和第 1 篇，所以 n = 2。而有 title 字段的总文档书 N 为 6。

idf = log(1 + (N - n + 0.5) / (n + 0.5))
     = log(1 + (6 - 2 + 0.5) / (2 + 0.5))
     = 1.0296195

这里 log 底数为 e

第三部分：TF

	{
              "value" : 0.5714286,
              "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details" : [
                {
                  "value" : 2.0,
                  "description" : "freq, occurrences of term within document",
                  "details" : [ ]
                },
                {
                  "value" : 1.2,
                  "description" : "k1, term saturation parameter",
                  "details" : [ ]
                },
                {
                  "value" : 0.75,
                  "description" : "b, length normalization parameter",
                  "details" : [ ]
               },
                {
                  "value" : 8.0,
                  "description" : "dl, length of field",
                  "details" : [ ]
                },
                {
                  "value" : 6.0,
                  "description" : "avgdl, average length of field",
                  "details" : [ ]
                }
              ]
            }

freq：关键词在文中出现的次数。
k1 ：饱和度参数，默认 1.2
b：归一化参数，默认 0.75
dl：title 字段分词后的长度
avgdl：title 字段的分词后的平均长度。

可以使用 analyze api 进行验证。查看 title 字段分词后的结果

POST test/_analyze
{
  "field": "{title}",
  "text": ["To school, everywhere is the white one, school"]
}

结果（以下省略了其他的参数信息，只展示分词后的 token）

{
  "tokens" : [
    {
      "token" : "to",
    },
    {
      "token" : "school",
    },
    {
      "token" : "everywhere",
    },
    {
      "token" : "is",
    },
    {
      "token" : "the",
    },
    {
      "token" : "white",
    },
    {
      "token" : "one",
    },
    {
      "token" : "school",
    }
  ]
}

显而易见

freq 出现两次，freq = 2
dl 分词结果出现了 8 个词，索引 dl = 8

那么 avgdl 字段的平均长度要如何计算？

avgdl 字段的平均长度，所以需要把所有文章的 title 字段分词结果数量统计/拥有 title 字段的总文档数

我这里用简单的 analyze api 证明一下，把 title 字段所有的内容放到一起，查看它的分词结果数量。

在右边的结果中，tokens 为 36 个，除以拥有 title 字段的文档数目 6。所以 avgdl = 6。

tf = freq / (freq + k1 * (1 - b + b * dl / avgdl))
   = 2 / (2 + 1.2 * (1 - 0.75 + 0.75 * 8 / 6)) 
   = 0.57142857142857

总数

boost 因为没有指定，使用缺省值 2.2。

score = boost * idf * tf = 2.2 *  1.0296195 * 0.57142857142857 = 1.2943788

(可能有精度问题，但是大体是一致的)

总结

通过以上的分析，复杂的数学公式也是由一个个参数组成，当我们了解了评分规则，在日常对有关结果排序的问题，能做到“心中有数”。

扩展

上面说明的所有数据统计都是基于索引只有一个分片的情况下，如果分片过多会对算分一定的影响。

这次指定索引分片为 10，通过参数 routing 来将七条数据插入到七个不同的分片上。

PUT test2
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text"
      },
      "content":{
        "type": "text"
      },
      "remark":{
        "type": "text"
      }
    }
  },
  "settings": {
    "number_of_shards": 10
  }
}

POST test2/_doc/1?routing=1
{"title":"To school, everywhere is the white one, school","content":" the snow is still one child to jump from the sky"}
POST test2/_doc/2?routing=2
{"title":"First of the big brothers and sisters are braving the cold","content":"braving heavy snow snow yet"}
POST test2/_doc/3?routing=3
{"title":"Behind them there was a curved path","content":" junior high school English composition"}
POST test2/_doc/4?routing=4
{"title":"but they must be tired","content":"very hard."}
POST test2/_doc/5?routing=5
{"title":"but they must be tired","content":"very hard."}
POST test2/_doc/6?routing=6
{"title":"Home school","content":"Iove made several small partner"}
POST test2/_doc/7?routing=7
{"remark":"remark school"}

再次执行一遍查询

POST test2/_search
{
  "explain": true, 
  "query": {
    "match": {
      "title": "school"
    }
  }
}

结果我们依旧取第二条结果的 explain 为例

        ......
            {
              "value" : 0.39556286,
              "description" : "score(freq=2.0), computed as boost * idf * tf from:",
              "details" : [
                {
                  "description" : "boost",
                },
                {
                  "value" : 0.2876821,
                  "description" : "idf",
                  "details" : [
                    {
                      "value" : 1,
                      "description" : "n, number of documents containing term",
                    },
                    {
                      "value" : 1,
                      "description" : "N, total number of documents with field",
                    }
                  ]
                },
                {
                  "value" : 0.625,
                  "description" : "tf",
                  "details" : [
                    {
                      "value" : 2.0,
                      "description" : "freq, occurrences of term within document",
                    },
                    {
                      "value" : 1.2,
                      "description" : "k1, term saturation parameter",
                    },
                    {
                      "value" : 0.75,
                      "description" : "b, length normalization parameter",
                    },
                    {
                      "value" : 8.0,
                      "description" : "dl, length of field",
                    },
                    {
                      "value" : 8.0,
                      "description" : "avgdl, average length of field",
                    }
                  ]
                }
              ]
            }

上面的 n 和 N 都为 1，这是因为在这条记录所在的分片，只有它一份文档。

同理 avgl 和 dl 都等于 8 。它们都是记录同一分片下。

因此在数据量不是很大的情况下，不建议分片数过多，其实分片数为 1 可以应对大多的数据。如果想统计正确，可以使用一参数 search_type=dfs_query_then_fetch 让 ElasticSearch 统计所有的分片数据，但是会导致性能下降。

POST test2/_search?search_type=dfs_query_then_fetch
{
  ......
}

你真的了解过 ElasticSearch 的评分规则嘛？

前言