当我们需要创建 Elasticsearch 索引时,数据源通常没有规范化,无法直接导入。 原始数据可以存储在数据库、原始 CSV/XML 文件中,甚至可以从第三方 API 获取。 在这种情况下,我们需要对数据进行预处理以使其与 Bulk API 一起使用。 在本教程中,我们将演示如何使用简单的 Python 代码从 CSV 文件中索引 Elasticsearch 文档。 将使用原生 Elasticsearch bulk API 和 helpers 模块中的 API。 你将学习如何在不同的场合使用合适的工具来索引 Elasticsearch 文档。
在之前的文章 “Elasticsearch:关于在 Python 中使用 Elasticsearch 你需要知道的一切 - 8.x”,我展示了如何使用 bulk API 来索引文档到 Elasticsearch 中。细心的开发者可能观察到,如果我们的文档很多,数据量很大,那个方法可能并不适用,这是因为所以的操作都是在内存里进行操作的。如果我们的原始文档很大,这极有可能造成内存不够的情况。在今天的文章中,我将探讨使用 Python 里的 generator 来实现。
为了方便测试,我们的数据可以从 github.com/liu-xiao-gu… 中获取。data.csv 将是我们使用的原始数据。
安装
为了方便进行测试,我们将采用我之前的文章 “Elasticsearch:如何在 Docker 上运行 Elasticsearch 8.x 进行本地开发” 来进行部署。在这里我们采用 docker compose 来进行安装 Elasticsearch 及 Kibana。我们将不采用安全设置。更多关于如何在具有安全性的条件下使用 Python 来连接 Elasticsearch,请参考之前的文章 “Elasticsearch:关于在 Python 中使用 Elasticsearch 你需要知道的一切 - 8.x”。我们可以参考那篇文章来进行安装所需要的 Python 包。
在 Python 中创建索引
我们将创建与之前文章中演示的相同的 latops-demo 索引。 首先,我们将使用 Elasticsearch 客户端直接创建索引。 此外,settings 和 mappings 将作为顶级参数传递,而不是通过 body 参数传递。创建索引的命令是:
main.py
1. # Import Elasticsearch package
2. from elasticsearch import Elasticsearch
3. import csv
4. import json
6. # Connect to Elasticsearch cluster
7. es = Elasticsearch( "http://localhost:9200")
8. resp = es.info()
9. print(resp)
11. settings = {
12. "index": {"number_of_replicas": 2},
13. "analysis": {
14. "filter": {
15. "ngram_filter": {
16. "type": "edge_ngram",
17. "min_gram": 2,
18. "max_gram": 15,
19. }
20. },
21. "analyzer": {
22. "ngram_analyzer": {
23. "type": "custom",
24. "tokenizer": "standard",
25. "filter": ["lowercase", "ngram_filter"],
26. }
27. }
28. }
29. }
31. mappings = {
32. "properties": {
33. "id": {"type": "long"},
34. "name": {
35. "type": "text",
36. "analyzer": "standard",
37. "fields": {
38. "keyword": {"type": "keyword"},
39. "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
40. }
41. },
42. "brand": {
43. "type": "text",
44. "fields": {
45. "keyword": {"type": "keyword"},
46. }
47. },
48. "price": {"type": "float"},
49. "attributes": {
50. "type": "nested",
51. "properties": {
52. "attribute_name": {"type": "text"},
53. "attribute_value": {"type": "text"},
54. }
55. }
56. }
57. }
59. configurations = {
60. "settings": {
61. "index": {"number_of_replicas": 2},
62. "analysis": {
63. "filter": {
64. "ngram_filter": {
65. "type": "edge_ngram",
66. "min_gram": 2,
67. "max_gram": 15,
68. }
69. },
70. "analyzer": {
71. "ngram_analyzer": {
72. "type": "custom",
73. "tokenizer": "standard",
74. "filter": ["lowercase", "ngram_filter"],
75. }
76. }
77. }
78. },
79. "mappings": {
80. "properties": {
81. "id": {"type": "long"},
82. "name": {
83. "type": "text",
84. "analyzer": "standard",
85. "fields": {
86. "keyword": {"type": "keyword"},
87. "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
88. }
89. },
90. "brand": {
91. "type": "text",
92. "fields": {
93. "keyword": {"type": "keyword"},
94. }
95. },
96. "price": {"type": "float"},
97. "attributes": {
98. "type": "nested",
99. "properties": {
100. "attribute_name": {"type": "text"},
101. "attribute_value": {"type": "text"},
102. }
103. }
104. }
105. }
106. }
109. INDEX_NAME = "laptops-demo"
111. # check the existence of the index. If yes, remove it
112. if(es.indices.exists(index=INDEX_NAME)):
113. print("The index has already existed, going to remove it")
114. es.options(ignore_status=404).indices.delete(index=INDEX_NAME)
116. # Create the index with the correct configurations
117. res = es.indices.create(index=INDEX_NAME, settings=settings,mappings=mappings)
118. print(res)
120. # The following is another way to create the index, but it is deprecated
121. # es.indices.create(index = INDEX_NAME, body =configurations )
现在索引已创建。我们可以在 Kibana 中使用如下的命令来进行查看:
GET _cat/indices
我们可以开始向其中添加文档。
使用原生 Elasticsearch 批量 API
当你有一个小数据集要加载时,使用原生 Elasticsearch 批量 API 会很方便,因为语法与原生 Elasticsearch 查询相同,可以直接在 Dev 控制台中运行。 你不需要学习任何新东西。
将要加载的数据文件可以从这个链接下载。 将其保存为 data.csv,将在下面的 Python 代码中使用:
main.py
1. # Import Elasticsearch package
2. from elasticsearch import Elasticsearch
3. import csv
4. import json
6. # Connect to Elasticsearch cluster
7. es = Elasticsearch( "http://localhost:9200")
8. resp = es.info()
9. # print(resp)
11. settings = {
12. "index": {"number_of_replicas": 2},
13. "analysis": {
14. "filter": {
15. "ngram_filter": {
16. "type": "edge_ngram",
17. "min_gram": 2,
18. "max_gram": 15,
19. }
20. },
21. "analyzer": {
22. "ngram_analyzer": {
23. "type": "custom",
24. "tokenizer": "standard",
25. "filter": ["lowercase", "ngram_filter"],
26. }
27. }
28. }
29. }
31. mappings = {
32. "properties": {
33. "id": {"type": "long"},
34. "name": {
35. "type": "text",
36. "analyzer": "standard",
37. "fields": {
38. "keyword": {"type": "keyword"},
39. "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
40. }
41. },
42. "brand": {
43. "type": "text",
44. "fields": {
45. "keyword": {"type": "keyword"},
46. }
47. },
48. "price": {"type": "float"},
49. "attributes": {
50. "type": "nested",
51. "properties": {
52. "attribute_name": {"type": "text"},
53. "attribute_value": {"type": "text"},
54. }
55. }
56. }
57. }
59. configurations = {
60. "settings": {
61. "index": {"number_of_replicas": 2},
62. "analysis": {
63. "filter": {
64. "ngram_filter": {
65. "type": "edge_ngram",
66. "min_gram": 2,
67. "max_gram": 15,
68. }
69. },
70. "analyzer": {
71. "ngram_analyzer": {
72. "type": "custom",
73. "tokenizer": "standard",
74. "filter": ["lowercase", "ngram_filter"],
75. }
76. }
77. }
78. },
79. "mappings": {
80. "properties": {
81. "id": {"type": "long"},
82. "name": {
83. "type": "text",
84. "analyzer": "standard",
85. "fields": {
86. "keyword": {"type": "keyword"},
87. "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
88. }
89. },
90. "brand": {
91. "type": "text",
92. "fields": {
93. "keyword": {"type": "keyword"},
94. }
95. },
96. "price": {"type": "float"},
97. "attributes": {
98. "type": "nested",
99. "properties": {
100. "attribute_name": {"type": "text"},
101. "attribute_value": {"type": "text"},
102. }
103. }
104. }
105. }
106. }
109. INDEX_NAME = "laptops-demo"
111. # check the existence of the index. If yes, remove it
112. if(es.indices.exists(index=INDEX_NAME)):
113. print("The index has already existed, going to remove it")
114. es.options(ignore_status=404).indices.delete(index=INDEX_NAME)
116. # Create the index with the correct configurations
117. res = es.indices.create(index=INDEX_NAME, settings=settings,mappings=mappings)
118. print(res)
120. # The following is another way to create the index, but it is deprecated
121. # es.indices.create(index = INDEX_NAME, body =configurations )
123. with open("data.csv", "r") as fi:
124. reader = csv.DictReader(fi, delimiter=",")
126. actions = []
127. for row in reader:
128. action = {"index": {"_index": INDEX_NAME, "_id": int(row["id"])}}
129. doc = {
130. "id": int(row["id"]),
131. "name": row["name"],
132. "price": float(row["price"]),
133. "brand": row["brand"],
134. "attributes": [
135. {"attribute_name": "cpu", "attribute_value": row["cpu"]},
136. {"attribute_name": "memory", "attribute_value": row["memory"]},
137. {
138. "attribute_name": "storage",
139. "attribute_value": row["storage"],
140. },
141. ],
142. }
143. actions.append(action)
144. actions.append(doc)
146. es.bulk(index=INDEX_NAME, operations=actions, refresh=True)
148. # Check the results:
149. result = es.count(index=INDEX_NAME)
150. print(result)
151. print(result.body['count'])
我们运行上面的代码:
1. $ python main.py
2. The index has already existed, going to remove it
3. {'acknowledged': True, 'shards_acknowledged': True, 'index': 'laptops-demo'}
4. {'count': 200, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}
5. 200
注意:在上面的 bulk 指令中,我们需要使用 refresh=True,否则当我们读出 count 的时候,它的值可能是 0。
在上面的代码中,有一个致命的问题就是我们在内存里创建 actions。如果我们的数据比较大的话,那么 actions 所需要的内存也会比较大。它显然不适合很大的数据的情况。
请注意,我们使用 csv 库方便地从 CSV 文件中读取数据。 可以看出,原生 bulk API 的语法非常简单,可以跨不同语言(包括 Dev Tools Console)使用。
使用批量助手 - bulk helper
如上所述,原生 bulk API 的一个问题是所有数据都需要先加载到内存,然后才能被索引。 当我们有一个大数据集时,这可能会出现问题并且效率很低。 为了解决这个问题,我们可以使用 bulk helper,它可以从迭代器(iterators)或生成器(generators)中索引 Elasticsearch 文档。 因此,它不需要先将所有数据加载到内存中,这在内存方面非常高效。 然而,语法有点不同,我们很快就会看到。
在我们使用 bulk helper 索引文档之前,我们应该删除索引中的文档以确认 bulk helper 确实成功工作。这个已经在我们上面的代码中已经完成了。然后我们可以运行以下代码使用批量助手将数据加载到 Elasticsearch:
main.py
1. # Import Elasticsearch package
2. from elasticsearch import Elasticsearch
3. from elasticsearch import helpers
4. import csv
5. import json
7. # Connect to Elasticsearch cluster
8. es = Elasticsearch( "http://localhost:9200")
9. resp = es.info()
10. # print(resp)
12. settings = {
13. "index": {"number_of_replicas": 2},
14. "analysis": {
15. "filter": {
16. "ngram_filter": {
17. "type": "edge_ngram",
18. "min_gram": 2,
19. "max_gram": 15,
20. }
21. },
22. "analyzer": {
23. "ngram_analyzer": {
24. "type": "custom",
25. "tokenizer": "standard",
26. "filter": ["lowercase", "ngram_filter"],
27. }
28. }
29. }
30. }
32. mappings = {
33. "properties": {
34. "id": {"type": "long"},
35. "name": {
36. "type": "text",
37. "analyzer": "standard",
38. "fields": {
39. "keyword": {"type": "keyword"},
40. "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
41. }
42. },
43. "brand": {
44. "type": "text",
45. "fields": {
46. "keyword": {"type": "keyword"},
47. }
48. },
49. "price": {"type": "float"},
50. "attributes": {
51. "type": "nested",
52. "properties": {
53. "attribute_name": {"type": "text"},
54. "attribute_value": {"type": "text"},
55. }
56. }
57. }
58. }
60. configurations = {
61. "settings": {
62. "index": {"number_of_replicas": 2},
63. "analysis": {
64. "filter": {
65. "ngram_filter": {
66. "type": "edge_ngram",
67. "min_gram": 2,
68. "max_gram": 15,
69. }
70. },
71. "analyzer": {
72. "ngram_analyzer": {
73. "type": "custom",
74. "tokenizer": "standard",
75. "filter": ["lowercase", "ngram_filter"],
76. }
77. }
78. }
79. },
80. "mappings": {
81. "properties": {
82. "id": {"type": "long"},
83. "name": {
84. "type": "text",
85. "analyzer": "standard",
86. "fields": {
87. "keyword": {"type": "keyword"},
88. "ngrams": {"type": "text", "analyzer": "ngram_analyzer"},
89. }
90. },
91. "brand": {
92. "type": "text",
93. "fields": {
94. "keyword": {"type": "keyword"},
95. }
96. },
97. "price": {"type": "float"},
98. "attributes": {
99. "type": "nested",
100. "properties": {
101. "attribute_name": {"type": "text"},
102. "attribute_value": {"type": "text"},
103. }
104. }
105. }
106. }
107. }
110. INDEX_NAME = "laptops-demo"
112. # check the existence of the index. If yes, remove it
113. if(es.indices.exists(index=INDEX_NAME)):
114. print("The index has already existed, going to remove it")
115. es.options(ignore_status=404).indices.delete(index=INDEX_NAME)
117. # Create the index with the correct configurations
118. res = es.indices.create(index=INDEX_NAME, settings=settings,mappings=mappings)
119. print(res)
121. # The following is another way to create the index, but it is deprecated
122. # es.indices.create(index = INDEX_NAME, body =configurations )
124. def generate_docs():
125. with open("data.csv", "r") as fi:
126. reader = csv.DictReader(fi, delimiter=",")
128. for row in reader:
129. doc = {
130. "_index": INDEX_NAME,
131. "_id": int(row["id"]),
132. "_source": {
133. "id": int(row["id"]),
134. "name": row["name"],
135. "price": float(row["price"]),
136. "brand": row["brand"],
137. "attributes": [
138. {
139. "attribute_name": "cpu",
140. "attribute_value": row["cpu"],
141. },
142. {
143. "attribute_name": "memory",
144. "attribute_value": row["memory"],
145. },
146. {
147. "attribute_name": "storage",
148. "attribute_value": row["storage"],
149. },
150. ],
151. },
152. }
153. yield doc
156. helpers.bulk(es, generate_docs())
157. # (200, []) -- 200 indexed, no errors.
159. es.indices.refresh()
161. # Check the results:
162. result = es.count(index=INDEX_NAME)
163. print(result.body['count'])
运行上面的代码。显示的结果如下:
1. $ python main.py
2. The index has already existed, going to remove it
3. {'acknowledged': True, 'shards_acknowledged': True, 'index': 'laptops-demo'}
4. 200
从上面的结果中我们可以看出来,我们已经成功地摄入了 200 个文档。