Elasticsearch:如何在 Python 中使用批量 API 为 Elasticsearch 索引文档

2,234 阅读6分钟

当我们需要创建 Elasticsearch 索引时,数据源通常没有规范化,无法直接导入。 原始数据可以存储在数据库、原始 CSV/XML 文件中,甚至可以从第三方 API 获取。 在这种情况下,我们需要对数据进行预处理以使其与 Bulk API 一起使用。 在本教程中,我们将演示如何使用简单的 Python 代码从 CSV 文件中索引 Elasticsearch 文档。 将使用原生 Elasticsearch bulk API 和 helpers 模块中的 API。 你将学习如何在不同的场合使用合适的工具来索引 Elasticsearch 文档。

在之前的文章 “Elasticsearch:关于在 Python 中使用 Elasticsearch 你需要知道的一切 - 8.x”,我展示了如何使用 bulk API 来索引文档到 Elasticsearch 中。细心的开发者可能观察到,如果我们的文档很多,数据量很大,那个方法可能并不适用,这是因为所以的操作都是在内存里进行操作的。如果我们的原始文档很大,这极有可能造成内存不够的情况。在今天的文章中,我将探讨使用 Python 里的 generator 来实现。

为了方便测试,我们的数据可以从 github.com/liu-xiao-gu… 中获取。data.csv 将是我们使用的原始数据。

安装

为了方便进行测试,我们将采用我之前的文章 “Elasticsearch:如何在 Docker 上运行 Elasticsearch 8.x 进行本地开发” 来进行部署。在这里我们采用 docker compose 来进行安装 Elasticsearch 及 Kibana。我们将不采用安全设置。更多关于如何在具有安全性的条件下使用 Python 来连接 Elasticsearch,请参考之前的文章 “Elasticsearch:关于在 Python 中使用 Elasticsearch 你需要知道的一切 - 8.x”。我们可以参考那篇文章来进行安装所需要的 Python 包。

在 Python 中创建索引

我们将创建与之前文章中演示的相同的 latops-demo 索引。  首先,我们将使用 Elasticsearch 客户端直接创建索引。 此外,settings 和 mappings 将作为顶级参数传递,而不是通过 body 参数传递。创建索引的命令是:

main.py

1.  # Import Elasticsearch package
2.  from elasticsearch import Elasticsearch
3.  import csv
4.  import json

6.  # Connect to Elasticsearch cluster 
7.  es = Elasticsearch( "http://localhost:9200")
8.  resp = es.info()
9.  print(resp)

11.  settings = {
12.      "index": {"number_of_replicas": 2},
13.      "analysis": {
14.          "filter": {
15.              "ngram_filter": {
16.                  "type": "edge_ngram",
17.                  "min_gram": 2,
18.                  "max_gram": 15,
19.              }
20.          },
21.          "analyzer": {
22.              "ngram_analyzer": {
23.                  "type": "custom",
24.                  "tokenizer": "standard",
25.                  "filter": ["lowercase", "ngram_filter"],
26.              }
27.          }
28.      }
29.  }

31.  mappings = {
32.      "properties": {
33.          "id": {"type": "long"},
34.          "name": {
35.              "type": "text",
36.              "analyzer": "standard",
37.              "fields": {
38.                  "keyword": {"type": "keyword"},
39.                  "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
40.              }
41.          },
42.          "brand": {
43.              "type": "text",
44.              "fields": {
45.                  "keyword": {"type": "keyword"},
46.              }
47.          },
48.          "price": {"type": "float"},
49.          "attributes": {
50.              "type": "nested",
51.              "properties": {
52.                  "attribute_name": {"type": "text"},
53.                  "attribute_value": {"type": "text"},
54.              }
55.          }
56.      }
57.  }

59.  configurations = {
60.      "settings": {
61.          "index": {"number_of_replicas": 2},
62.          "analysis": {
63.              "filter": {
64.                  "ngram_filter": {
65.                      "type": "edge_ngram",
66.                      "min_gram": 2,
67.                      "max_gram": 15,
68.                  }
69.              },
70.              "analyzer": {
71.                  "ngram_analyzer": {
72.                      "type": "custom",
73.                      "tokenizer": "standard",
74.                      "filter": ["lowercase", "ngram_filter"],
75.                  }
76.              }
77.          }
78.      },
79.      "mappings": {
80.          "properties": {
81.              "id": {"type": "long"},
82.              "name": {
83.                  "type": "text",
84.                  "analyzer": "standard",
85.                  "fields": {
86.                      "keyword": {"type": "keyword"},
87.                      "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
88.                  }
89.              },
90.              "brand": {
91.                  "type": "text",
92.                  "fields": {
93.                      "keyword": {"type": "keyword"},
94.                  }
95.              },
96.              "price": {"type": "float"},
97.              "attributes": {
98.                  "type": "nested",
99.                  "properties": {
100.                      "attribute_name": {"type": "text"},
101.                      "attribute_value": {"type": "text"},
102.                  }
103.              }
104.          }
105.      }
106.  }

109.  INDEX_NAME = "laptops-demo"

111.  # check the existence of the index. If yes, remove it
112.  if(es.indices.exists(index=INDEX_NAME)):
113.      print("The index has already existed, going to remove it")
114.      es.options(ignore_status=404).indices.delete(index=INDEX_NAME)

116.  # Create the index with the correct configurations
117.  res = es.indices.create(index=INDEX_NAME, settings=settings,mappings=mappings)
118.  print(res)

120.  # The following is another way to create the index, but it is deprecated
121.  # es.indices.create(index = INDEX_NAME, body =configurations ) 

现在索引已创建。我们可以在 Kibana 中使用如下的命令来进行查看:

GET _cat/indices

我们可以开始向其中添加文档。

使用原生 Elasticsearch 批量 API

当你有一个小数据集要加载时,使用原生 Elasticsearch 批量 API 会很方便,因为语法与原生 Elasticsearch 查询相同,可以直接在 Dev 控制台中运行。 你不需要学习任何新东西。

将要加载的数据文件可以从这个链接下载。 将其保存为 data.csv,将在下面的 Python 代码中使用:

main.py



1.  # Import Elasticsearch package
2.  from elasticsearch import Elasticsearch
3.  import csv
4.  import json

6.  # Connect to Elasticsearch cluster 
7.  es = Elasticsearch( "http://localhost:9200")
8.  resp = es.info()
9.  # print(resp)

11.  settings = {
12.      "index": {"number_of_replicas": 2},
13.      "analysis": {
14.          "filter": {
15.              "ngram_filter": {
16.                  "type": "edge_ngram",
17.                  "min_gram": 2,
18.                  "max_gram": 15,
19.              }
20.          },
21.          "analyzer": {
22.              "ngram_analyzer": {
23.                  "type": "custom",
24.                  "tokenizer": "standard",
25.                  "filter": ["lowercase", "ngram_filter"],
26.              }
27.          }
28.      }
29.  }

31.  mappings = {
32.      "properties": {
33.          "id": {"type": "long"},
34.          "name": {
35.              "type": "text",
36.              "analyzer": "standard",
37.              "fields": {
38.                  "keyword": {"type": "keyword"},
39.                  "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
40.              }
41.          },
42.          "brand": {
43.              "type": "text",
44.              "fields": {
45.                  "keyword": {"type": "keyword"},
46.              }
47.          },
48.          "price": {"type": "float"},
49.          "attributes": {
50.              "type": "nested",
51.              "properties": {
52.                  "attribute_name": {"type": "text"},
53.                  "attribute_value": {"type": "text"},
54.              }
55.          }
56.      }
57.  }

59.  configurations = {
60.      "settings": {
61.          "index": {"number_of_replicas": 2},
62.          "analysis": {
63.              "filter": {
64.                  "ngram_filter": {
65.                      "type": "edge_ngram",
66.                      "min_gram": 2,
67.                      "max_gram": 15,
68.                  }
69.              },
70.              "analyzer": {
71.                  "ngram_analyzer": {
72.                      "type": "custom",
73.                      "tokenizer": "standard",
74.                      "filter": ["lowercase", "ngram_filter"],
75.                  }
76.              }
77.          }
78.      },
79.      "mappings": {
80.          "properties": {
81.              "id": {"type": "long"},
82.              "name": {
83.                  "type": "text",
84.                  "analyzer": "standard",
85.                  "fields": {
86.                      "keyword": {"type": "keyword"},
87.                      "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
88.                  }
89.              },
90.              "brand": {
91.                  "type": "text",
92.                  "fields": {
93.                      "keyword": {"type": "keyword"},
94.                  }
95.              },
96.              "price": {"type": "float"},
97.              "attributes": {
98.                  "type": "nested",
99.                  "properties": {
100.                      "attribute_name": {"type": "text"},
101.                      "attribute_value": {"type": "text"},
102.                  }
103.              }
104.          }
105.      }
106.  }

109.  INDEX_NAME = "laptops-demo"

111.  # check the existence of the index. If yes, remove it
112.  if(es.indices.exists(index=INDEX_NAME)):
113.      print("The index has already existed, going to remove it")
114.      es.options(ignore_status=404).indices.delete(index=INDEX_NAME)

116.  # Create the index with the correct configurations
117.  res = es.indices.create(index=INDEX_NAME, settings=settings,mappings=mappings)
118.  print(res)

120.  # The following is another way to create the index, but it is deprecated
121.  # es.indices.create(index = INDEX_NAME, body =configurations )

123.  with open("data.csv", "r") as fi:
124.      reader = csv.DictReader(fi, delimiter=",")

126.      actions = []
127.      for row in reader:
128.          action = {"index": {"_index": INDEX_NAME, "_id": int(row["id"])}}
129.          doc = {
130.              "id": int(row["id"]),
131.              "name": row["name"],
132.              "price": float(row["price"]),
133.              "brand": row["brand"],
134.              "attributes": [
135.                  {"attribute_name": "cpu", "attribute_value": row["cpu"]},
136.                  {"attribute_name": "memory", "attribute_value": row["memory"]},
137.                  {
138.                      "attribute_name": "storage",
139.                      "attribute_value": row["storage"],
140.                  },
141.              ],
142.          }
143.          actions.append(action)
144.          actions.append(doc)

146.      es.bulk(index=INDEX_NAME, operations=actions, refresh=True)

148.  # Check the results:
149.  result = es.count(index=INDEX_NAME)
150.  print(result)
151.  print(result.body['count'])


我们运行上面的代码:



1.  $ python main.py 
2.  The index has already existed, going to remove it
3.  {'acknowledged': True, 'shards_acknowledged': True, 'index': 'laptops-demo'}
4.  {'count': 200, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}
5.  200


注意:在上面的 bulk 指令中,我们需要使用 refresh=True,否则当我们读出 count 的时候,它的值可能是 0。

在上面的代码中,有一个致命的问题就是我们在内存里创建 actions。如果我们的数据比较大的话,那么 actions 所需要的内存也会比较大。它显然不适合很大的数据的情况。

请注意,我们使用 csv 库方便地从 CSV 文件中读取数据。 可以看出,原生 bulk API 的语法非常简单,可以跨不同语言(包括 Dev Tools Console)使用。

使用批量助手 - bulk helper

如上所述,原生 bulk API 的一个问题是所有数据都需要先加载到内存,然后才能被索引。 当我们有一个大数据集时,这可能会出现问题并且效率很低。 为了解决这个问题,我们可以使用 bulk helper,它可以从迭代器(iterators)或生成器(generators)中索引 Elasticsearch 文档。 因此,它不需要先将所有数据加载到内存中,这在内存方面非常高效。 然而,语法有点不同,我们很快就会看到。

在我们使用 bulk helper 索引文档之前,我们应该删除索引中的文档以确认 bulk helper 确实成功工作。这个已经在我们上面的代码中已经完成了。然后我们可以运行以下代码使用批量助手将数据加载到 Elasticsearch:

main.py



1.  # Import Elasticsearch package
2.  from elasticsearch import Elasticsearch
3.  from elasticsearch import helpers
4.  import csv
5.  import json

7.  # Connect to Elasticsearch cluster 
8.  es = Elasticsearch( "http://localhost:9200")
9.  resp = es.info()
10.  # print(resp)

12.  settings = {
13.      "index": {"number_of_replicas": 2},
14.      "analysis": {
15.          "filter": {
16.              "ngram_filter": {
17.                  "type": "edge_ngram",
18.                  "min_gram": 2,
19.                  "max_gram": 15,
20.              }
21.          },
22.          "analyzer": {
23.              "ngram_analyzer": {
24.                  "type": "custom",
25.                  "tokenizer": "standard",
26.                  "filter": ["lowercase", "ngram_filter"],
27.              }
28.          }
29.      }
30.  }

32.  mappings = {
33.      "properties": {
34.          "id": {"type": "long"},
35.          "name": {
36.              "type": "text",
37.              "analyzer": "standard",
38.              "fields": {
39.                  "keyword": {"type": "keyword"},
40.                  "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
41.              }
42.          },
43.          "brand": {
44.              "type": "text",
45.              "fields": {
46.                  "keyword": {"type": "keyword"},
47.              }
48.          },
49.          "price": {"type": "float"},
50.          "attributes": {
51.              "type": "nested",
52.              "properties": {
53.                  "attribute_name": {"type": "text"},
54.                  "attribute_value": {"type": "text"},
55.              }
56.          }
57.      }
58.  }

60.  configurations = {
61.      "settings": {
62.          "index": {"number_of_replicas": 2},
63.          "analysis": {
64.              "filter": {
65.                  "ngram_filter": {
66.                      "type": "edge_ngram",
67.                      "min_gram": 2,
68.                      "max_gram": 15,
69.                  }
70.              },
71.              "analyzer": {
72.                  "ngram_analyzer": {
73.                      "type": "custom",
74.                      "tokenizer": "standard",
75.                      "filter": ["lowercase", "ngram_filter"],
76.                  }
77.              }
78.          }
79.      },
80.      "mappings": {
81.          "properties": {
82.              "id": {"type": "long"},
83.              "name": {
84.                  "type": "text",
85.                  "analyzer": "standard",
86.                  "fields": {
87.                      "keyword": {"type": "keyword"},
88.                      "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
89.                  }
90.              },
91.              "brand": {
92.                  "type": "text",
93.                  "fields": {
94.                      "keyword": {"type": "keyword"},
95.                  }
96.              },
97.              "price": {"type": "float"},
98.              "attributes": {
99.                  "type": "nested",
100.                  "properties": {
101.                      "attribute_name": {"type": "text"},
102.                      "attribute_value": {"type": "text"},
103.                  }
104.              }
105.          }
106.      }
107.  }

110.  INDEX_NAME = "laptops-demo"

112.  # check the existence of the index. If yes, remove it
113.  if(es.indices.exists(index=INDEX_NAME)):
114.      print("The index has already existed, going to remove it")
115.      es.options(ignore_status=404).indices.delete(index=INDEX_NAME)

117.  # Create the index with the correct configurations
118.  res = es.indices.create(index=INDEX_NAME, settings=settings,mappings=mappings)
119.  print(res)

121.  # The following is another way to create the index, but it is deprecated
122.  # es.indices.create(index = INDEX_NAME, body =configurations )

124.  def generate_docs():
125.      with open("data.csv", "r") as fi:
126.          reader = csv.DictReader(fi, delimiter=",")

128.          for row in reader:
129.              doc = {
130.                  "_index": INDEX_NAME,
131.                  "_id": int(row["id"]),
132.                  "_source": {
133.                      "id": int(row["id"]),
134.                      "name": row["name"],
135.                      "price": float(row["price"]),
136.                      "brand": row["brand"],
137.                      "attributes": [
138.                          {
139.                              "attribute_name": "cpu",
140.                              "attribute_value": row["cpu"],
141.                          },
142.                          {
143.                              "attribute_name": "memory",
144.                              "attribute_value": row["memory"],
145.                          },
146.                          {
147.                              "attribute_name": "storage",
148.                              "attribute_value": row["storage"],
149.                          },
150.                      ],
151.                  },
152.              }
153.              yield doc

156.  helpers.bulk(es, generate_docs())
157.  # (200, [])   -- 200 indexed, no errors.

159.  es.indices.refresh()

161.  # Check the results:
162.  result = es.count(index=INDEX_NAME)
163.  print(result.body['count'])


运行上面的代码。显示的结果如下:



1.  $ python main.py 
2.  The index has already existed, going to remove it
3.  {'acknowledged': True, 'shards_acknowledged': True, 'index': 'laptops-demo'}
4.  200


从上面的结果中我们可以看出来,我们已经成功地摄入了 200 个文档。