Elasticsearch：Retrievers 介绍 - Python Jupyter notebook在今天的文章里

在今天的文章里，我是继上一篇文章 “Elasticsearch：介绍 retrievers - 搜索一切事物” 来使用一个可以在本地设置的 Elasticsearch 集群来展示 Retrievers 的使用。在本篇文章中，你将学到如下的内容：

从 Kaggle 下载 IMDB 数据集
创建两个推理服务
部署 ELSER
部署 e5-small
创建摄取管道
创建映射
摄取 IMDB 数据，在摄取过程中创建嵌入
缩小查询负载模型
运行示例检索器

安装

Elasticsearch 及 Kibana

如果你还没有安装好自己的 Elasticsearch 及 Kibana，请参考如下的链接来进行安装：

在安装的时候，我们选择 Elastic Stack 8.x 来进行安装。在首次启动 Elasticsearch 的时候，我们可以看到如下的输出：

在上面，我们可以看到 elastic 超级用户的密码。我们记下它，并将在下面的代码中进行使用。

我们还可以在安装 Elasticsearch 目录中找到 Elasticsearch 的访问证书：



1.  $ pwd
2.  /Users/liuxg/elastic/elasticsearch-8.14.1/config/certs
3.  $ ls
4.  http.p12      http_ca.crt   transport.p12

在上面，http_ca.crt 是我们需要用来访问 Elasticsearch 的证书。

我们首先克隆已经写好的代码：

git clone https://github.com/liu-xiao-guo/elasticsearch-labs

我们然后进入到该项目的根目录下：



1.  $ pwd
2.  /Users/liuxg/python/elasticsearch-labs/supporting-blog-content/introducing-retrievers
3.  $ ls
4.  retrievers_intro_notebook.ipynb

如上所示，retrievers_intro_notebook.ipynb 就是我们今天想要工作的 notebook。

我们通过如下的命令来拷贝所需要的证书：



1.  $ pwd
2.  /Users/liuxg/python/elasticsearch-labs/supporting-blog-content/introducing-retrievers
3.  $ cp ~/elastic/elasticsearch-8.14.1/config/certs/http_ca.crt .
4.  $ ls
5.  http_ca.crt                     retrievers_intro_notebook.ipynb

安装所需要的 python 依赖包

pip3 install -qqq pandas elasticsearch python-dotenv

我们可以使用如下的方法来查看 elasticsearch 的版本：



1.  $ pip3 list | grep elasticsearch
2.  elasticsearch                           8.14.0

创建环境变量

为了能够使得下面的应用顺利执行，在项目当前的目录下运行如下的命令：



1.  export ES_ENDPOINT="localhost"
2.  export ES_USER="elastic"
3.  export ES_PASSWORD="uK+7WbkeXMzwk9YvP-H3"

你需要根据自己的 Elasticsearch 设置进行相应的修改。

下载数据集

我们去到地址 IMDB movies dataset | Kaggle 下载数据集并解压缩。



1.  $ pwd
2.  /Users/liuxg/python/elasticsearch-labs/supporting-blog-content/introducing-retrievers
3.  $ ls
4.  archive (13).zip                http_ca.crt                     retrievers_intro_notebook.ipynb
5.  $ unzip archive\ \(13\).zip 
6.  Archive:  archive (13).zip
7.    inflating: imdb_movies.csv         
8.  $ mkdir -p content
9.  $ mv imdb_movies.csv content/



1.  $ tree -L 2
2.  .
3.  ├── archive\ (13).zip
4.  ├── content
5.  │   └── imdb_movies.csv
6.  ├── http_ca.crt
7.  └── retrievers_intro_notebook.ipynb

如上所示，我们吧 imdb_movies.csv 文件置于当前工作目录下的 content 目录下。

代码展示

我们在当前项目的根目录下打入如下的命令：

设置



1.  import os
2.  import zipfile
3.  import pandas as pd
4.  from elasticsearch import Elasticsearch, helpers
5.  from elasticsearch.exceptions import ConnectionTimeout
6.  from elastic_transport import ConnectionError
7.  from time import sleep
8.  import time
9.  import logging

11.  # Get the logger for 'elastic_transport.node_pool'
12.  logger = logging.getLogger("elastic_transport.node_pool")

14.  # Set its level to ERROR
15.  logger.setLevel(logging.ERROR)

17.  # Suppress warnings from the elastic_transport module
18.  logging.getLogger("elastic_transport").setLevel(logging.ERROR)

连接到 Elasticsearch



1.  from dotenv import load_dotenv

3.  load_dotenv()

5.  ES_USER = os.getenv("ES_USER")
6.  ES_PASSWORD = os.getenv("ES_PASSWORD")
7.  ES_ENDPOINT = os.getenv("ES_ENDPOINT")
8.  COHERE_API_KEY = os.getenv("COHERE_API_KEY")

10.  url = f"https://{ES_USER}:{ES_PASSWORD}@{ES_ENDPOINT}:9200"
11.  print(url)

13.  es = Elasticsearch(url, ca_certs = "./http_ca.crt", verify_certs = True)
14.  print(es.info())

如上所示，我们的客户端连接到 Elasticsearch 是成功的。

部署 ELSER 及 E5

下面的两个代码块将部署嵌入模型并自动扩展 ML 容量。

部署及启动 ELSER



1.  from elasticsearch.exceptions import BadRequestError

3.  try:
4.      resp = es.options(request_timeout=5).inference.put_model(
5.          task_type="sparse_embedding",
6.          inference_id="my-elser-model",
7.          body={
8.              "service": "elser",
9.              "service_settings": {"num_allocations": 64, "num_threads": 1},
10.          },
11.      )
12.  except ConnectionTimeout:
13.      pass
14.  except BadRequestError as e:
15.      print(e)

如果你之前已经部署过 ELSER，你可能会得到一个 resource already exists 这样的错误。你可以使用如下的命令来删除之前的 inference_id。

DELETE /_inference/my-elser-model

在运行完上面的命令后，需要经过一定的时间下载 ELSER 模型。这个依赖于你的网络速度。我们可以在 Kibana 中进行查看：

部署及启动 es-small



1.  try:
2.      resp = es.inference.put_model(
3.          task_type="text_embedding",
4.          inference_id="my-e5-model",
5.          body={
6.              "service": "elasticsearch",
7.              "service_settings": {
8.                  "num_allocations": 8,
9.                  "num_threads": 1,
10.                  "model_id": ".multilingual-e5-small",
11.              },
12.          },
13.      )
14.  except ConnectionTimeout:
15.      pass
16.  except BadRequestError as e:
17.      print(e)

在运行完上面的代码后，我们可以在 Kibana 界面中：

点击上面的 "Add trained model" 来安装 .multilingual-e5-small 模型。

我们到最后能看到这个：

整个下载及部署需要很长的时间，需要你耐心等待！

提示：如果你的机器是在 x86 架构的机器上运行的话，那么你在上面可以选择 .multilingual-e5-small_linux-x86_64 作为其 model_id。

检查模型部署状态

这将循环检查，直到 ELSER 和 e5 都已完全部署。如果你在上面已经等了足够久的话，那么下面的代码讲很快地执行。

如果需要分配额外容量来运行模型，这可能需要几分钟



1.  from time import sleep
2.  from elasticsearch.exceptions import ConnectionTimeout

5.  def wait_for_models_to_start(es, models):
6.      model_status_map = {model: False for model in models}

8.      while not all(model_status_map.values()):
9.          try:
10.              model_status = es.ml.get_trained_models_stats()
11.          except ConnectionTimeout:
12.              print("A connection timeout error occurred.")
13.              continue

15.          for x in model_status["trained_model_stats"]:
16.              model_id = x["model_id"]
17.              # Skip this model if it's not in our list or it has already started
18.              if model_id not in models or model_status_map[model_id]:
19.                  continue
20.              if "deployment_stats" in x:
21.                  if (
22.                      "nodes" in x["deployment_stats"]
23.                      and len(x["deployment_stats"]["nodes"]) > 0
24.                  ):
25.                      if (
26.                          x["deployment_stats"]["nodes"][0]["routing_state"][
27.                              "routing_state"
28.                          ]
29.                          == "started"
30.                      ):
31.                          print(f"{model_id} model deployed and started")
32.                          model_status_map[model_id] = True

34.          if not all(model_status_map.values()):
35.              sleep(0.5)

38.  models = [".elser_model_2", ".multilingual-e5-small"]
39.  wait_for_models_to_start(es, models)



1.  .elser_model_2 model deployed and started
2.  .multilingual-e5-small model deployed and started

创建索引模板并链接到摄取管道



1.  template_body = {
2.      "index_patterns": ["imdb_movies*"],
3.      "template": {
4.          "settings": {"index": {"default_pipeline": "elser_e5_embed"}},
5.          "mappings": {
6.              "properties": {
7.                  "budget_x": {"type": "double"},
8.                  "country": {"type": "keyword"},
9.                  "crew": {"type": "text"},
10.                  "date_x": {"type": "date", "format": "MM/dd/yyyy||MM/dd/yyyy[ ]"},
11.                  "genre": {"type": "keyword"},
12.                  "names": {"type": "text"},
13.                  "names_sparse": {"type": "sparse_vector"},
14.                  "names_dense": {"type": "dense_vector"},
15.                  "orig_lang": {"type": "keyword"},
16.                  "orig_title": {"type": "text"},
17.                  "overview": {"type": "text"},
18.                  "overview_sparse": {"type": "sparse_vector"},
19.                  "overview_dense": {"type": "dense_vector"},
20.                  "revenue": {"type": "double"},
21.                  "score": {"type": "double"},
22.                  "status": {"type": "keyword"},
23.              }
24.          },
25.      },
26.  }

28.  # Create the template
29.  es.indices.put_index_template(, body=template_body)

创建采集管道



1.  # Define the pipeline configuration
2.  pipeline_body = {
3.      "processors": [
4.          {
5.              "inference": {
6.                  "model_id": ".multilingual-e5-small",
7.                  "description": "embed names with e5 to names_dense nested field",
8.                  "input_output": [
9.                      {"input_field": "names", "output_field": "names_dense"}
10.                  ],
11.              }
12.          },
13.          {
14.              "inference": {
15.                  "model_id": ".multilingual-e5-small",
16.                  "description": "embed overview with e5 to names_dense nested field",
17.                  "input_output": [
18.                      {"input_field": "overview", "output_field": "overview_dense"}
19.                  ],
20.              }
21.          },
22.          {
23.              "inference": {
24.                  "model_id": ".elser_model_2",
25.                  "description": "embed overview with .elser_model_2 to overview_sparse nested field",
26.                  "input_output": [
27.                      {"input_field": "overview", "output_field": "overview_sparse"}
28.                  ],
29.              }
30.          },
31.          {
32.              "inference": {
33.                  "model_id": ".elser_model_2",
34.                  "description": "embed names with .elser_model_2 to names_sparse nested field",
35.                  "input_output": [
36.                      {"input_field": "names", "output_field": "names_sparse"}
37.                  ],
38.              }
39.          },
40.      ],
41.      "on_failure": [
42.          {
43.              "append": {
44.                  "field": "_source._ingest.inference_errors",
45.                  "value": [
46.                      {
47.                          "message": "{{ _ingest.on_failure_message }}",
48.                          "pipeline": "{{_ingest.pipeline}}",
49.                          "timestamp": "{{{ _ingest.timestamp }}}",
50.                      }
51.                  ],
52.              }
53.          }
54.      ],
55.  }

58.  # Create the pipeline
59.  es.ingest.put_pipeline(id="elser_e5_embed", body=pipeline_body)

提取文档

这将

进行一些预处理
批量提取 10,178 条 IMDB 记录
使用 ELSER 模型为 overview 和 name 字段生成稀疏向量嵌入
使用 ELSER 模型为 overview 和 name 字段生成密集向量嵌入

使用上述分配设置通常需要一定的时间才能完成。这个依赖于你自己电脑的配置。



1.  # Load CSV data into a pandas DataFrame
2.  df = pd.read_csv("./content/imdb_movies.csv")

4.  # Replace all NaN values in DataFrame with None
5.  df = df.where(pd.notnull(df), None)

7.  # Convert DataFrame into a list of dictionaries
8.  # Each dictionary represents a document to be indexed
9.  documents = df.to_dict(orient="records")

12.  # Define a function to generate actions for bulk API
13.  def generate_bulk_actions(documents):
14.      for doc in documents:
15.          yield {
16.              "_index": "imdb_movies",
17.              "_source": doc,
18.          }

21.  # Use the bulk helper to insert documents, 200 at a time
22.  start_time = time.time()
23.  helpers.bulk(es, generate_bulk_actions(documents), chunk_size=200)
24.  end_time = time.time()

26.  print(f"The function took {end_time - start_time} seconds to run")

我们可以在 Kibana 中进行查看：

我们需要等一定的时间来完成上面的摄取工作。值得注意的是：在上面的代码中我把 chunk_size 设置为 20。这个是为了避免 "Connection timeout" 错误。如果我们把这个值设置很大，那么摄取的时间可能过长，那么就会发生 "Connection timeout" 这样的错误。我们在批量处理时，选择比较少的文档来完成摄取工作。有关如何设置这个 timeout 的时间，我们可以参考文章 “在 Elasticsearch 中扩展 ML 推理管道：如何避免问题并解决瓶颈”。

针对我的电脑，它花费了如下的时间来完成 10,178 个文档的摄取：

The function took 1292.8102316856384 seconds to run

这个将近20分钟。

缩小 ELSER 和 e5 模型

我们不需要大量的模型分配来进行测试查询，因此我们将每个模型分配缩小到 1 个



1.  for model_id in [".elser_model_2","my-e5-model"]:
2.      result = es.perform_request(
3.          "POST",
4.          f"/_ml/trained_models/{model_id}/deployment/_update",
5.          headers={"content-type": "application/json", "accept": "application/json"},
6.          body={"number_of_allocations": 1},
7.      )

Retriever 测试

我们将使用搜索输入 clueless slackers 在数据集中的 overview 字段（文本或嵌入）中搜索电影

请随意将下面的 movie_search 变量更改为其他内容

movie_search = "clueless slackers"

Standard - 搜索所有文本！ - bm25



1.  response = es.search(
2.      index="imdb_movies",
3.      body={
4.          "query": {"match": {"overview": movie_search}},
5.          "size": 3,
6.          "fields": ["names", "overview"],
7.          "_source": False,
8.      },
9.  )

11.  for hit in response["hits"]["hits"]:
12.      print(f"{hit['fields']['names'][0]}\n- {hit['fields']['overview'][0]}\n")

kNN-搜索所有密集向量！



1.  response = es.search(
2.      index="imdb_movies",
3.      body={
4.          "retriever": {
5.              "knn": {
6.                  "field": "overview_dense",
7.                  "query_vector_builder": {
8.                      "text_embedding": {
9.                          "model_id": "my-e5-model",
10.                          "model_text": movie_search,
11.                      }
12.                  },
13.                  "k": 5,
14.                  "num_candidates": 5,
15.              }
16.          },
17.          "size": 3,
18.          "fields": ["names", "overview"],
19.          "_source": False,
20.      },
21.  )

23.  for hit in response["hits"]["hits"]:
24.      print(f"{hit['fields']['names'][0]}\n- {hit['fields']['overview'][0]}\n")

text_expansion - 搜索所有稀疏向量！ - elser



1.  response = es.search(
2.      index="imdb_movies",
3.      body={
4.          "retriever": {
5.              "standard": {
6.                  "query": {
7.                      "text_expansion": {
8.                          "overview_sparse": {
9.                              "model_id": ".elser_model_2",
10.                              "model_text": movie_search,
11.                          }
12.                      }
13.                  }
14.              }
15.          },
16.          "size": 3,
17.          "fields": ["names", "overview"],
18.          "_source": False,
19.      },
20.  )

22.  for hit in response["hits"]["hits"]:
23.      print(f"{hit['fields']['names'][0]}\n- {hit['fields']['overview'][0]}\n")

rrf — 将所有事物结合起来！



1.  response = es.search(
2.      index="imdb_movies",
3.      body={
4.          "retriever": {
5.              "rrf": {
6.                  "retrievers": [
7.                      {"standard": {"query": {"term": {"overview": movie_search}}}},
8.                      {
9.                          "knn": {
10.                              "field": "overview_dense",
11.                              "query_vector_builder": {
12.                                  "text_embedding": {
13.                                      "model_id": "my-e5-model",
14.                                      "model_text": movie_search,
15.                                  }
16.                              },
17.                              "k": 5,
18.                              "num_candidates": 5,
19.                          }
20.                      },
21.                      {
22.                          "standard": {
23.                              "query": {
24.                                  "text_expansion": {
25.                                      "overview_sparse": {
26.                                          "model_id": ".elser_model_2",
27.                                          "model_text": movie_search,
28.                                      }
29.                                  }
30.                              }
31.                          }
32.                      },
33.                  ],
34.                  "window_size": 5,
35.                  "rank_constant": 1,
36.              }
37.          },
38.          "size": 3,
39.          "fields": ["names", "overview"],
40.          "_source": False,
41.      },
42.  )

44.  for hit in response["hits"]["hits"]:
45.      print(f"{hit['fields']['names'][0]}\n- {hit['fields']['overview'][0]}\n")

所有的源码可以在地址 elasticsearch-labs/supporting-blog-content/introducing-retrievers/retrievers_intro_notebook.ipynb at main · liu-xiao-guo/elasticsearch-labs · GitHub

下载。