1. 相关度分数算法

使用 Elasticsearch 检索的过程中，结果都会包含一个相关度分数 relevance score ，基于 TF/IDF 和 vector space model 实现。底层其实调用了 Lucene 的 practical scoring 函数来完成分数的计算。进行相关度分数计算时，核心步骤就包含三点：

boolean model进行document过滤。
TF/IDF算法计算单个term的分数。
vector space model整合最终的相关度分数。

1.1. bool model

Elasticsearch中的各种doc过滤语法，例如：bool命令中的 must / should / must not等等。核心目的就是过滤出包含检索关键字的document，提升后续分数计算的性能。这一过程仅仅是过滤，不进行分数计算。

1.2. TF/IDF算法

term frequency/inverse document frequency 算法，Elasticsearch计算相关度分数的基础，用于计算单个term在document中的分数。例如：我们的检索关键字是”hello world“，index中包含如下的document：

# doc1
hello you, and world is very good

TF/IDF算法会对"hello"这个term计算出一个doc1分数，对”world“再计算出一个doc1分数。至于”hello world“这整个关键字在doc1中的综合分数，TF/IDF是不管的。核心思想：

term frequency 表示检索的term，在单个document中的各个词条中出现的频次，出现的次数越多，该document相关度越高；
inverse document frequency 表示检索的term，在该索引的所有document中出现的频次，出现的次数越少，包含该term的document相关度越高；
Field-length norm 表示document的field内容长度越短，相关度越高。

term frequency

关键字“hello world”，索引中包含下面两条document：

# doc1
hello you, and world is very good
# doc2
hello, how are you

“hello world”进行分词后，拆成两个term——“hello”和“world”，显然doc1既包含“hello”又包含“world”，doc2只包含“hello”，所以doc1更相关。

inverse document frequency

检索请求还是“hello world”，假设索引中一共包含1000条document，我们只列出其中两条：

# doc1
hello, today is very good
# doc2
hi world, how are you

根据term frequency规则，doc1和doc2的相关度应该是相同的，但是如果‘hello“在该索引的1000条document中，有800条document都包含它而‘world“只有200条document包含，那么doc2的相关度就比doc1更高。
term在整个document列表中出现的次数越多，包含它的doc相关度反而越低。因为出现的次数越少，说明包含那个term的doc的区分度越高。

Field-length norm

检索请求还是“hello world”，假设索引中一共包含2条document：

# doc1
{ "title": "hello article", "content": "babaaba 1万个单词" }
# doc2
{ "title": "my article", "content": "blablabala 1万个单词，hi world" }

doc1中，”hello“出现在title字段，doc2中，”world“出现在content字段，显然title字段的内容长度远小于content字段的内容长度，所以doc1的相关度比doc2更高。

示例

通过explain命令，查看Elasticsearch对某个query的评分计算：

curl GET ip:port/test_index/_search?explain
{
  "query": {
    "match": {
      "test_field": "test hello"
    }
  }
}

通过如下命令查看某个document是如何被一个query匹配上，比如下面是查看id为6的document是如何被匹配上的：

curl GET ip:port/test_index/6/_explain
{
  "query": {
    "match": {
      "test_field": "test hello"
    }
  }
}

1.3. 向量空间模型算法

TF/IDF 算法只能计算单个 term 在 document 中的分数。依靠 vector space model 计算整个搜索关键词在各个doc中的综合分数。其核心思想是计算两个向量，然后相乘得到最终分：

query vector term在所有document的分数。
doc vector term在各个document的分数。
计算doc vector对于query vector的弧度（其实就是线性代数中的向量运算）。

query vector

根据TF/IDF算法的结果计算出query vector，是每一个term对所有document的综合评分。假设index中包含3条document，搜索关键字是“hello world”：

# doc1
hello, today is very good
# doc2
hi world, how are you
# doc3
hello world

对于hello term vector space model 算法会算出它对所有doc的评分，例如为2；world这个term，基于所有doc的评分是5，那么query vector = [2, 5]。
query vector的计算过程不用去深究，底层涉及线性代数之类的高等数学知识，我们只要知道vector space model会计算出这样一个vector，vector包含了每个term对所有document的评分就行了。

doc vector

每个term在各个document中的分数组成的一个向量。例如：“hello”在doc1中的分数是2，doc2中是0，doc3中是2；“world”在doc1中的分数是0，doc2中是5，doc3中是5，那么最终计算出的doc vector是下面这样的：

[2 , 0]
[0 , 5]
[2 , 5]

弧度计算

根据doc vector和query vector进行向量运算，最终得到每个doc对多个term的总分数。

Lucene相关度分数函数

Lucene中 practical scoring 函数，综合了上面我们讲的TF/IDF算法和vector space model：

score(q,d)  =  
  queryNorm(q)  
  · coord(q,d)    
  · ∑ (           
    tf(t in d)   
      · idf(t)2      
      · t.getBoost() 
      · norm(t,d)    
  ) (t in q)

query（入参q），对一个doc（入参d）的最终总评分，也就是搜索关键字对某个document的相关度分数：

queryNorm(q) 用来让一个doc的分数处于一个合理的区间内，不要太离谱。
coord(q,d) 对更加匹配的doc，进行一些分数上的成倍奖励。
tf(t in d) 计算每个term对doc的分数，就是TF/IDF算法中的term frequency步骤。
idf(t)2 计算每个term对doc的分数，就是TF/IDF算法中的inverse document frequency步骤。

t.getBoost() 计入字段权重。
norm(t,d) 计算每个term对doc的分数，就是TF/IDF算法中的Field-length norm步骤。

2. 相关度分数调优

调优4种方案： query-time boost negative boost constant_score function_score

2.1. query-time boost

利用boost增强某个query权重，例如：查询有两个搜索条件，针对title字段查询添加boost参数，使其权重更大，title在匹配doc中分数占比会更大：

curl GET ip:port/forum/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "java spark",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "content": "java spark"
          }
        }
      ]
    }
  }
}

2.2. negative boost

主要用于减少某些字段的权重 query-time boost 的反向参数。例如：搜索content字段中包含"java"，但不包含spark的"document"：

curl GET ip:port/forum/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "java"
          }
        }
      ],
      "must_not": [
        {
          "match": {
            "content": "spark"
          }
        }
      ]
    }
  }
}

不完全排除某个关键字，如果字段中包含某个关键字，就降低它的分数，比如上面的spark。对于这种需求，可以使用negative_boost，包含了negative term的document，其分数会乘以negative boost：

curl GET ip:port/forum/_search 
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "content": "java"
        }
      },
      "negative": {
        "match": {
          "content": "spark"
        }
      },
      "negative_boost": 0.2
    }
  }
}

2.3. constant_score

不需要相关度评分，使用 constant_score 加 filter ，所有的doc分数都是 1 ：

curl GET ip:port/forum/_search 
{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "query": {
              "match": {
                "title": "java"
              }
            }
          }
        },
        {
          "constant_score": {
            "query": {
              "match": {
                "title": "spark"
              }
            }
          }
        }
      ]
    }
  }
}

2.4. function_score

自定义相关度分数的算法。例如：某个帖子的人越多，那么该帖子的分数就越高，帖子浏览数定义 follower_num 字段：

curl GET ip:port/forum/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "java spark",
          "fields": ["tile", "content"]
        }
      },
      "field_value_factor": {
        "field": "follower_num",
        "modifier": "log1p",
        "factor": 0.5
      },
      "boost_mode": "sum",
      "max_boost": 2
    }
  }
}

请求解析

log1p 是一个函数，用于对字段分数进行修正：new_score = old_score * log(1 + factor * follower_num)。
boost_mode 决定最终doc分数与指定字段的值如何计算： multiply sum min max replace 。
max_boost 用于限制计算出来的分数不要超过 max_boost 指定的值。

3. fuzzy

模糊搜索 自动将拼写错误的搜索文本进行纠正，纠正以后去尝试匹配索引中的数据。

主要有两种使用的方式。更多的是直接在 match query 中使用 fuzziness 。