Elasticsearch:同义词在 RAG 中重要吗?

177 阅读12分钟

作者:来自 Elastic Jeffrey Rengifo 及 Tomás Murúa

探索 RAG 应用程序中 Elasticsearch 同义词的功能。

同义词允许我们使用具有相同含义的不同词语在文档中搜索,以确保用户无论使用什么确切的词语都能找到他们所寻找的内容。你可能会认为,由于 RAG 应用程序使用语义/向量搜索,同义词功能的一部分已经被同义词涵盖(因为根据定义,同义词是语义相关的词)。

这是真的吗?语义搜索真的能取代同义词吗?在本文中,我们将分析在 RAG 应用程序中使用同义词的影响。

步骤

  • 配置端点
  • 配置同义词
  • 索引文档
  • 语义搜索
  • 同义词和 RAG

配置推理端点

对于这个例子,我们将在 HR 环境中实现带有和不带有同义词的 RAG(Retrieval-Augmented Generation - 检索增强生成)系统。我们将使用术语 PTO(Paid Time Off - 带薪休假)的变体(如 “vacation” 或 “holiday”)为不同的文档编制索引。然后我们将配置同义词来展示这些关系如何提高搜索的相关性和准确性。

首先,让我们通过在 Kibana DevTools 中运行以下命令,使用带有推理 API(inference api) 的 ELSER 模型创建一个端点:



1.  PUT _inference/sparse_embedding/code-wave_inference
2.  {
3.    "service": "elasticsearch",
4.    "service_settings": {
5.      "num_allocations": 1,
6.      "num_threads": 1
7.    }
8.  }


配置同义词

Elasticsearch 中的同义词是什么?

在 Elasticsearch 中,同义词(synonyms)是具有相同或相似含义的单词或短语,存储为同义词集,可以作为文件或通过 API 进行管理。它们允许用户找到相关信息,即使他们使用不同的术语来指代同一概念。

因此,例如,如果我们创建一组同义词,其中 “holiday” 和 “vacation” 是 “Paid Time Off” 的同义词,当员工搜索其中任何一个词时,他们就会找到与所有词相关的文档。

你可以在这篇文章中阅读有关它们的更多信息。

让我们使用同义词 API(synonyms API:) 创建一组同义词:



1.  PUT _synonyms/code-wave_synonyms
2.  {
3.    "synonyms_set": [
4.      {
5.        "synonyms": "holidays, paid time off"
6.      }
7.    ]
8.  }


值得注意的是,同义词集必须先进行配置,然后才能应用于索引。

现在,让我们定义数据的设置和映射:



1.  PUT /code-wave_index
2.  {
3.    "settings": {
4.      "analysis": {
5.        "filter": {
6.          "synonyms_filter": {
7.            "type": "synonym_graph",
8.            "synonyms_set": "code-wave_synonyms",
9.            "updateable": true
10.          }
11.        },
12.        "analyzer": {
13.          "my_search_analyzer": {
14.            "type": "custom",
15.            "tokenizer": "standard",
16.            "filter": [
17.              "lowercase",
18.              "synonyms_filter"
19.            ]
20.          }
21.        }
22.      }
23.    },
24.    "mappings": {
25.      "properties": {
26.        "text_field": {
27.          "type": "text",
28.          "analyzer": "standard",
29.          "copy_to": "semantic_field",
30.          "fields": {
31.            "synonyms": {
32.              "type": "text",
33.              "analyzer": "standard",
34.              "search_analyzer": "my_search_analyzer"
35.            }
36.          }
37.        },
38.        "semantic_field": {
39.          "type": "semantic_text",
40.          "inference_id": "code-wave_inference"
41.        }
42.      }
43.    }
44.  }


我们将使用 semantic_text 字段进行语义搜索,并使用 synonyms graph token filter 来处理多词同义词。

我们还创建了 text_field.synonym 版本和 text_field 版本的字段(可以针对这两种不同的类型进行搜索。请注意的是这两个类型都是 text 类型),以便更好地控制如何使用或不考虑同义词来查询字段。

最后,我们使用 copy_to 将 text_field 的值复制到该字段的 semantic_text 版本,以实现全文和语义查询。

索引文档

我们现在将使用批量 API 索引我们的文档:



1.  POST _bulk
2.  {"index":{"_index":"code-wave_index","_id":"1"}}
3.  {"semantic_field":"Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.","text_field":"Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones."}
4.  {"index":{"_index":"code-wave_index","_id":"2"}}
5.  {"semantic_field":"Holidays: Paid public holidays recognized each calendar year.","text_field":"Holidays: Paid public holidays recognized each calendar year."}
6.  {"index":{"_index":"code-wave_index","_id":"3"}}
7.  {"semantic_field":"Sick leave: Paid sick leave of up to 15 days per year.","text_field":"Sick leave: Paid sick leave of up to 15 days per year."}
8.  {"index":{"_index":"code-wave_index","_id":"4"}}
9.  {"semantic_field":"Holidays sale: Enjoy discounts up to 50% during our exclusive holidays sale event!","text_field":"Holidays sale: Enjoy discounts up to 50% during our exclusive holidays sale event!"}
10.  {"index":{"_index":"code-wave_index","_id":"5"}}
11.  {"semantic_field":"Holidays recipes: Try our top 10 holidays dessert recipes, perfect for family gatherings and celebrations.","text_field":"Holidays recipes: Try our top 10 holidays dessert recipes, perfect for family gatherings and celebrations."}
12.  {"index":{"_index":"code-wave_index","_id":"6"}}
13.  {"semantic_field":"Holidays travel: Find the best deals for your holidays flights and accommodations this season.","text_field":"Holidays travel: Find the best deals for your holidays flights and accommodations this season."}
14.  {"index":{"_index":"code-wave_index","_id":"7"}}
15.  {"semantic_field":"Holidays music: Stream your favorite holidays classics and discover new seasonal hits.","text_field":"Holidays music: Stream your favorite holidays classics and discover new seasonal hits."}
16.  {"index":{"_index":"code-wave_index","_id":"8"}}
17.  {"semantic_field":"Holidays decorations: Our store offers a wide range of holidays decorations to make your home festive.","text_field":"Holidays decorations: Our store offers a wide range of holidays decorations to make your home festive."}
18.  {"index":{"_index":"code-wave_index","_id":"9"}}
19.  {"semantic_field":"Holidays movies: Check out our list of must-watch holidays movies for cozy winter nights.","text_field":"Holidays movies: Check out our list of must-watch holidays movies for cozy winter nights."}
20.  {"index":{"_index":"code-wave_index","_id":"10"}}
21.  {"semantic_field":"Holidays festival: Join us at the city's annual holidays festival featuring lights, music, and local food.","text_field":"Holidays festival: Join us at the city's annual holidays festival featuring lights, music, and local food."}
22.  {"index":{"_index":"code-wave_index","_id":"11"}}
23.  {"semantic_field":"Holidays weather: Stay updated with our holidays weather forecast to plan your activities.","text_field":"Holidays weather: Stay updated with our holidays weather forecast to plan your activities."}
24.  {"index":{"_index":"code-wave_index","_id":"12"}}
25.  {"semantic_field":"Holidays gift guide: Browse our ultimate holidays gift guide for everyone on your list.","text_field":"Holidays gift guide: Browse our ultimate holidays gift guide for everyone on your list."}
26.  {"index":{"_index":"code-wave_index","_id":"13"}}
27.  {"semantic_field":"Holidays traditions: Explore unique holidays traditions celebrated around the world.","text_field":"Holidays traditions: Explore unique holidays traditions celebrated around the world."}


我们现在就可以开始搜索了!但首先,让我们通过搜索 holidays 来确保同义词有效:



1.  GET code-wave_index/_search
2.  {
3.    "_source": {
4.      "excludes": [
5.        "*embeddings",
6.        "*chunks"
7.      ]
8.    },
9.    "query": {
10.      "multi_match": {
11.        "query": "holidays",
12.        "fields": [
13.          "text_field^10",
14.          "text_field.synonyms^0.6"
15.        ]
16.      }
17.    }
18.  }


我们对 boost 进行调整,使同义词的得分低于原始单词。

检查响应:



1.  {
2.    "took": 3,
3.    "timed_out": false,
4.    "_shards": {
5.      "total": 1,
6.      "successful": 1,
7.      "skipped": 0,
8.      "failed": 0
9.    },
10.    "hits": {
11.      "total": {
12.        "value": 12,
13.        "relation": "eq"
14.      },
15.      "max_score": 5.2014494,
16.      "hits": [
17.        {
18.          "_index": "code-wave_index",
19.          "_id": "2",
20.          "_score": 3.0596757,
21.          "_source": {
22.            "text_field": "Holidays: Paid public holidays recognized each calendar year.",
23.            "semantic_field": {
24.              "inference": {
25.                "inference_id": "code-wave_inference",
26.                "model_settings": {
27.                  "task_type": "sparse_embedding"
28.                }
29.              },
30.              "text": "Holidays: Paid public holidays recognized each calendar year."
31.            }
32.          }
33.        },
34.        {
35.          "_index": "code-wave_index",
36.          "_id": "1",
37.          "_score": 3.023004,
38.          "_source": {
39.            "text_field": "Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.",
40.            "semantic_field": {
41.              "inference": {
42.                "inference_id": "code-wave_inference",
43.                "model_settings": {
44.                  "task_type": "sparse_embedding"
45.                }
46.              },
47.              "text": "Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones."
48.            }
49.          }
50.        },
51.        {
52.          "_index": "code-wave_index",
53.          "_id": "13",
54.          "_score": 2.9230676,
55.          "_source": {
56.            "text_field": "Holidays traditions: Explore unique holidays traditions celebrated around the world.",
57.            "semantic_field": {
58.              "inference": {
59.                "inference_id": "code-wave_inference",
60.                "model_settings": {
61.                  "task_type": "sparse_embedding"
62.                }
63.              },
64.              "text": "Holidays traditions: Explore unique holidays traditions celebrated around the world."
65.            }
66.          }
67.        },
68.        ...
69.      ]
70.    }
71.  }


我们可以看到,当我们搜索 “holidays” 时,第二个文档有同义词:“Paid Time Off”。

混合搜索

混合搜索使我们能够将全文和语义搜索查询的结果组合成一个规范化的结果集,方法是使用 RRF(Reciprocal Rank Fusion - 倒述排序融合)来平衡来自不同检索器的分数。



1.  GET code-wave_index/_search
2.  {
3.    "_source": "text_field",
4.    "retriever": {
5.      "rrf": {
6.        "retrievers": [
7.          {
8.            "standard": {
9.              "query": {
10.                "nested": {
11.                  "path": "semantic_field.inference.chunks",
12.                  "query": {
13.                    "sparse_vector": {
14.                      "inference_id": "code-wave_inference",
15.                      "field": "semantic_field.inference.chunks.embeddings",
16.                      "query": "holidays"
17.                    }
18.                  }
19.                }
20.              }
21.            }
22.          },
23.          {
24.            "standard": {
25.              "query": {
26.                "multi_match": {
27.                  "query": "holidays",
28.                  "fields": [
29.                    "text_field.synonyms"
30.                  ]
31.                }
32.              }
33.            }
34.          }
35.        ]
36.      }
37.    }
38.  }


回复:



1.  {
2.    "took": 11,
3.    "timed_out": false,
4.    "_shards": {
5.      "total": 1,
6.      "successful": 1,
7.      "skipped": 0,
8.      "failed": 0
9.    },
10.    "hits": {
11.      "total": {
12.        "value": 13,
13.        "relation": "eq"
14.      },
15.      "max_score": 0.03175403,
16.      "hits": [
17.        {
18.          "_index": "code-wave_index",
19.          "_id": "7",
20.          "_score": 0.03175403,
21.          "_source": {
22.            "text_field": "Holidays music: Stream your favorite holidays classics and discover new seasonal hits."
23.          }
24.        },
25.        {
26.          "_index": "code-wave_index",
27.          "_id": "13",
28.          "_score": 0.031257633,
29.          "_source": {
30.            "text_field": "Holidays traditions: Explore unique holidays traditions celebrated around the world."
31.          }
32.        },
33.        {
34.          "_index": "code-wave_index",
35.          "_id": "4",
36.          "_score": 0.031009614,
37.          "_source": {
38.            "text_field": "Holidays sale: Enjoy discounts up to 50% during our exclusive holidays sale event!"
39.          }
40.        },
41.        {
42.          "_index": "code-wave_index",
43.          "_id": "2",
44.          "_score": 0.030834913,
45.          "_source": {
46.            "text_field": "Holidays: Paid public holidays recognized each calendar year."
47.          }
48.        },
49.        {
50.          "_index": "code-wave_index",
51.          "_id": "6",
52.          "_score": 0.03079839,
53.          "_source": {
54.            "text_field": "Holidays travel: Find the best deals for your holidays flights and accommodations this season."
55.          }
56.        },
57.        {
58.          "_index": "code-wave_index",
59.          "_id": "11",
60.          "_score": 0.02964427,
61.          "_source": {
62.            "text_field": "Holidays weather: Stay updated with our holidays weather forecast to plan your activities."
63.          }
64.        },
65.        {
66.          "_index": "code-wave_index",
67.          "_id": "5",
68.          "_score": 0.029418126,
69.          "_source": {
70.            "text_field": "Holidays recipes: Try our top 10 holidays dessert recipes, perfect for family gatherings and celebrations."
71.          }
72.        },
73.        {
74.          "_index": "code-wave_index",
75.          "_id": "12",
76.          "_score": 0.028991597,
77.          "_source": {
78.            "text_field": "Holidays gift guide: Browse our ultimate holidays gift guide for everyone on your list."
79.          }
80.        },
81.        {
82.          "_index": "code-wave_index",
83.          "_id": "1",
84.          "_score": 0.016393442,
85.          "_source": {
86.            "text_field": "Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones."
87.          }
88.        },
89.        {
90.          "_index": "code-wave_index",
91.          "_id": "10",
92.          "_score": 0.016393442,
93.          "_source": {
94.            "text_field": "Holidays festival: Join us at the city's annual holidays festival featuring lights, music, and local food."
95.          }
96.        }
97.      ]
98.    }
99.  }


该查询将返回语义和文本相关的文档。

同义词和 RAG

在本节中,我们将评估同义词和语义搜索如何改进 RAG 系统中的查询。我们将使用一个关于休息日的常见问题作为此示例:

How many vacation days are provided for holidays?

对于这个问题,我们对文档 1 中的信息感兴趣。文档 2 更接近我们想要的结果,但并不精确。当我们不使用同义词进行搜索时,我们将得到此结果。我们来看看它们的内容:

  • [1] Paid time off: All employees receive 20 days of paid vacation annually, with additional days earned for tenure milestones.
  • [2] Holidays: Paid public holidays recognized each calendar year.

这两个文档都包含与休息日(days off)相关的信息,但只有文档 2 特别使用了术语 “holidays”,因此我们可以测试同义词和语义搜索在 Playground 中的工作方式。

你可以从 Search>Playground 访问 Playground。从那里,你需要配置你想要使用的 LLM 并选择我们已经创建的索引作为上下文发送。你可以在此处阅读有关 Playground 及其配置的更多信息

配置完 Playground 后,如果我们点击查询按钮,我们可以看到同义词已被停用:

对于每个问题,我们会将前一个查询的前三个结果发送给 LLM,作为上下文:

现在,让我们向 Playground 提出问题并检查停用同义词后的结果:

由于前三个搜索结果中没有列出说明员工每年可享受多少假期的文件,因此 LLM 无法回答这个问题。在这种情况下,最接近的结果在文档 [2] 中。

注意:通过点击 “Snippet”,我们可以看到答案在 Elasticsearch 中的具体内容。

让我们清理聊天记录,激活同义词并再次提出同样的问题:

请注意,当你启用 semantic_text 字段和 text 字段时,Playground 将自动生成混合搜索查询

让我们重复一下这个问题,现在激活同义词:

现在,答案确实包含了我们正在搜索的文档,因为同义词允许将文档 [1] 发送到 LLM。

结论

在本文中,我们发现同义词是搜索系统的基本组成部分,即使在使用语义搜索时也不一定涵盖同义词功能。

同义词允许我们根据用例控制要提升的文档,并通过调整相关性来提高准确性。另一方面,语义搜索对于 recall 很有用,这意味着它可以引入潜在的相关结果,而无需我们为每个相关术语添加同义词。

通过混合搜索,我们可以同时进行同义词和语义搜索,实现两全其美的效果。使用 Playground,如果我们选择语义和文本字段的组合作为搜索字段,它将自动为我们构建混合查询。

想要获得 Elastic 认证吗?了解下一期 Elasticsearch 工程师培训何时举行!

Elasticsearch 包含许多新功能,可帮助你为你的用例构建最佳的搜索解决方案。深入了解我们的示例笔记本以了解更多信息,开始免费云试用,或立即在本地机器上试用 Elastic。

原文:Are synonyms important in RAG? - Elasticsearch Labs