Elasticsearch:当混合搜索真正发挥作用时

159 阅读8分钟

作者:来自 Elastic Gustavo Llermaly

展示混合搜索何时优于单独的词汇或语义搜索。

在本文中,我们将通过示例探讨混合搜索,并展示它与单独使用词汇或语义搜索技术相比的真正优势。

什么是混合搜索?

混合搜索是一种结合了不同搜索方法(如传统词汇匹配和语义搜索)的技术。

当用户知道确切的单词时,词汇搜索非常有用。这种方法将找到相关文档,并使用 TF-IDF 以合理的方式对其进行排序,这意味着:你搜索的术语在数据集中越常见,它对分数的贡献就越小;在某个文档中越常见,它对分数的贡献就越大。

更多有关 TF-IDF 的阅读,请参阅文章 “Elasticsearch:分布式计分”。

但是,如果查询中的单词不在文档中,该怎么办?有时用户不是在寻找具体的东西,而是在寻找一个概念。他们可能不是在寻找特定的餐厅,而是在寻找 “与家人一起吃饭的好地方”。对于这种查询,语义搜索很有用,因为它会考虑搜索查询的上下文并带来类似的文档。与以前的方法相比,你可以期望获得更多的相关文档,但作为回报,这种方法在精度方面存在困难,尤其是数字方面。

混合搜索将术语匹配的精确度与语义搜索的上下文感知匹配相结合,为我们提供了两全其美的优势。

你可以在这篇文章中深入了解混合搜索,并在此文章中了解有关词汇和语义搜索差异的更多信息。

让我们使用房地产单位创建一个示例。

查询将是:quiet place in Pinewood with 2 rooms,其中 “quiet place” 是查询的语义部分,而 “Pinewood with 2 rooms” 是查询的文本或词汇部分。

配置 ELSER

我们将使用 ELSER 作为我们的模型提供者。

首先创建推理端点:

`

1.  PUT _inference/sparse_embedding/my-elser-model 
2.  {
3.    "service": "elser", 
4.    "service_settings": {
5.      "num_allocations": 1,
6.      "num_threads": 1
7.    }
8.  }

`代码解读

如果这是你第一次使用 ELSER,你可能会在后台加载模型时遇到 502 Bad Gateway 错误。你可以在 Kibana 中的 Machine Learning > Trained Models 中检查模型的状态。部署后,你可以继续下一步。

配置索引

对于索引,我们将使用文本字段,并使用 semantic_text 作为语义字段。我们将复制描述,因为我们想将它们用于 match 和 semantic 查询。

`

1.  PUT properties-hybrid
2.  {
3.    "mappings": {
4.      "properties": {
5.        "title": {
6.          "type": "text",
7.          "analyzer": "english"
8.        },
9.        "description": {
10.          "type": "text",
11.          "analyzer": "english", 
12.          "copy_to": "semantic_field"
13.        },
14.        "neighborhood": {
15.          "type": "keyword"
16.        },
17.        "bedrooms": {
18.          "type": "integer"
19.        },
20.        "bathrooms": {
21.          "type": "integer"
22.        },
23.        "semantic_field": {
24.          "type": "semantic_text",
25.          "inference_id": "my-elser-model"
26.        }
27.      }
28.    }
29.  }

`代码解读

索引数据

 `2.  POST _bulk
3.  { "index" : { "_index" : "properties-hybrid" , "_id": "1"} }
4.  { "title": "2 Bed 2 Bath in Sunnydale", "description": "Spacious apartment with modern amenities, perfect for urban dwellers.", "neighborhood": "Sunnydale", "bedrooms": 2, "bathrooms": 2 }
5.  { "index" : { "_index" : "properties-hybrid", "_id": "2" } }
6.  { "title": "1 Bed 1 Bath in Sunnydale", "description": "Compact apartment with easy access to downtown, ideal for singles or couples.", "neighborhood": "Sunnydale", "bedrooms": 1, "bathrooms": 1 }
7.  { "index" : { "_index" : "properties-hybrid", "_id": "3" } }
8.  { "title": "2 Bed 2 Bath in Pinewood", "description": "New apartment with modern bedrooms, located in a restaurant and bar area. Suitable for active people who enjoy nightlife.", "neighborhood": "Pinewood", "bedrooms": 2, "bathrooms": 2 }
9.  { "index" : { "_index" : "properties-hybrid", "_id": "4" } }
10.  { "title": "3 Bed 2 Bath in Pinewood", "description": "Secluded and private family unit with a practical layout with three total rooms. Near schools and shops. Perfect for raising kids.", "neighborhood": "Pinewood", "bedrooms": 3, "bathrooms": 2 }
11.  { "index" : { "_index" : "properties-hybrid", "_id": "5" } }
12.  { "title": "2 Bed 2 Bath in Pinewood", "description": "Retired apartment in a serene neighborhood, perfect for those seeking a retreat. This well-maintained residence offers two bedrooms with abundant natural light and silence.", "neighborhood": "Pinewood", "bedrooms": 2, "bathrooms": 2 }
13.  { "index" : { "_index" : "properties-hybrid", "_id": "6" } }
14.  { "title": "1 Bed 1 Bath in Pinewood", "description": "Apartment with a scenic view, ideal for those seeking an energetic environment.", "neighborhood": "Pinewood", "bedrooms": 1, "bathrooms": 1 }
15.  { "index" : { "_index" : "properties-hybrid", "_id": "7" } }
16.  { "title": "2 Bed 2 Bath in Maplewood", "description": "Nice apartment with a large balcony, offering a relaxed and comfortable living experience.", "neighborhood": "Maplewood", "bedrooms": 2, "bathrooms": 2 }
17.  { "index" : { "_index" : "properties-hybrid", "_id": "8" } }
18.  { "title": "1 Bed 1 Bath in Maplewood", "description": "Charming apartment with modern interiors, situated in a peaceful neighborhood.", "neighborhood": "Maplewood", "bedrooms": 1, "bathrooms": 1 }`代码解读

查询数据

让我们从经典的匹配查询开始,它将根据标题和描述的内容进行搜索:

`

1.  GET properties-hybrid/_search
2.  {
3.    "query": {
4.      "multi_match": {
5.        "query": "quiet home 2 bedroom in Pinewood",
6.        "fields": ["title", "description"]
7.      }
8.    }
9.  }

`代码解读

这是第一个结果:

`

1.  {
2.      "description": "New apartment with modern bedrooms, located in a restaurant and bar area. Suitable for active people who enjoy nightlife.",
3.      "title": "2 Bed 2 Bath in Pinewood"
4.  }

`代码解读

还不错。它成功地吸引了附近的 Pinewood 和 2 间卧室的需求,然而,这根本不是一个 quiet place。

现在,一个纯粹的语义查询:

`

1.  GET properties-hybrid/_search
2.  {
3.    "query": {
4.      "semantic": {
5.        "field": "semantic_field",
6.        "query": "quiet home in Pinewood with 2 rooms"
7.      }
8.    }
9.  }

`代码解读

这是第一个结果:

`

1.  {
2.      "description": "Secluded and private family unit with a practical layout with three total rooms. Near schools and shops. Perfect for raising kids.",
3.      "title": "3 Bed 2 Bath in Pinewood"
4.  }

`代码解读

现在,搜索结果考虑了 quiet home 部分,将其与 “secluded and private” 等内容联系起来,但这个是 3 间卧室,我们正在寻找 2 间卧室。

现在让我们运行混合搜索。我们将使用 RRF(Reciprocal rank fusion - 倒数排名融合)来实现此目的,并结合前两个查询。RRF 算法将为我们混合两个查询的分数。

`

1.  GET properties-hybrid/_search
2.  {
3.    "retriever": {
4.      "rrf": {
5.        "retrievers": [
6.          {
7.            "standard": {
8.              "query": {
9.                "semantic": {
10.                  "field": "semantic_field",
11.                  "query": "quiet home 2 bedroom in Pinewood"
12.                }
13.              }
14.            }
15.          },
16.          {
17.            "standard": {
18.              "query": {
19.                "multi_match": {
20.                  "query": "quiet home 2 bedroom in Pinewood",
21.                  "fields": ["title", "description"]
22.                }
23.              }
24.            }
25.          }
26.        ],
27.        "rank_window_size": 50,
28.        "rank_constant": 20
29.      }
30.    }
31.  }

`代码解读

这是第一个结果:

`

1.  {
2.      "description": ""Retired apartment in a serene neighborhood, perfect for those seeking a retreat. This well-maintained residence offers two bedrooms with abundant natural light and silence.",
3.      "title": "2 Bed 2 Bath in Pinewood"
4.  }

`代码解读

现在的结果考虑既是一个安静的地方,而且有 2 间卧室。

评估结果

对于评估,我们将使用排名评估 API(Ranking Evaluation API ),它允许我们自动执行运行查询的过程,然后检查相关结果的位置。你可以选择不同的评估指标。在这个例子中,我将选择平均倒数排名 MRR (Mean reciprocal ranking),它考虑了结果位置,并随着位置降低 1/位置# 而降低分数。

对于这个场景,我们将针对初始问题测试我们的 3 个查询(multi_match、semantic、hybrid):

`quiet home 2 bedroom in Pinewood`代码解读

预计以下公寓将排在第一位,因为它满足所有标准。

Retired apartment in a serene neighborhood, perfect for those seeking a retreat. This well-maintained residence offers two bedrooms with abundant natural light and silence."

我们可以根据需要配置任意数量的查询,并在评级中添加我们期望排在第一位的文档的 ID:

`

1.  GET /properties-hybrid/_rank_eval
2.  {
3.    "requests": [
4.      {
5.        "id": "hybrid",
6.        "request": {
7.          "retriever": {
8.            "rrf": {
9.              "retrievers": [
10.                {
11.                  "standard": {
12.                    "query": {
13.                      "semantic": {
14.                        "field": "semantic_field",
15.                        "query": "quiet home 2 bedroom in Pinewood"
16.                      }
17.                    }
18.                  }
19.                },
20.                {
21.                  "standard": {
22.                    "query": {
23.                      "multi_match": {
24.                        "query": "quiet home 2 bedroom in Pinewood",
25.                        "fields": [
26.                          "title",
27.                          "description"
28.                        ]
29.                      }
30.                    }
31.                  }
32.                }
33.              ],
34.              "rank_window_size": 50,
35.              "rank_constant": 20
36.            }
37.          }
38.        },
39.        "ratings": [
40.          {
41.            "_index": "properties-hybrid",
42.            "_id": "5",
43.            "rating": 1
44.          }
45.        ]
46.      },
47.      {
48.        "id": "lexical",
49.        "request": {
50.          "query": {
51.            "multi_match": {
52.              "query": "quiet home 2 bedroom in Pinewood",
53.              "fields": [
54.                "title",
55.                "description"
56.              ]
57.            }
58.          }
59.        },
60.        "ratings": [
61.          {
62.            "_index": "properties-hybrid",
63.            "_id": "5",
64.            "rating": 1
65.          }
66.        ]
67.      },
68.      {
69.        "id": "semantic",
70.        "request": {
71.          "query": {
72.            "semantic": {
73.              "field": "semantic_field",
74.              "query": "quiet place in Pinewood with 2 rooms"
75.            }
76.          }
77.        },
78.        "ratings": [
79.          {
80.            "_index": "properties-hybrid",
81.            "_id": "5",
82.            "rating": 1
83.          }
84.        ]
85.      }
86.    ],
87.    "metric": {
88.      "mean_reciprocal_rank": {
89.        "k": 20,
90.        "relevant_rating_threshold": 1
91.      }
92.    }
93.  }

`代码解读

从图中可以看出,该查询在混合搜索(第一位)中获得了 1 分,在其他搜索中获得了 0.5 分,这意味着在第二位返回了预期结果。

结论

全文搜索技术(查找术语并按术语频率对结果进行排序)和语义搜索(按语义接近度进行搜索)在不同情况下非常有用。一方面,当用户明确他们想要搜索的内容时,文本搜索会大放异彩,例如提供文章的确切 SKU 或技术手册中的单词。另一方面,当用户正在寻找文档中未明确定义的概念或想法时,语义搜索很有用。将这两种方法与混合搜索相结合,可以为你提供全文搜索功能以及添加语义相关文档,这在需要关键字匹配和上下文理解的特定场景中非常有用。这种双重方法提高了搜索准确性和相关性,使其成为复杂查询和多样化内容类型的理想选择。

想要获得 Elastic 认证?了解下一次 Elasticsearch 工程师培训何时开始!

Elasticsearch 包含新功能,可帮助你为你的用例构建最佳搜索解决方案。深入了解我们的示例笔记本以了解更多信息,开始免费云试用,或立即在你的本地机器上试用 Elastic。

原文:When hybrid search truly shines - Elasticsearch Labs