Keep words token 过滤器是用来仅保留包含在指定单词列表中的 token,尽管你的文字中可能含有比这个列表更多的 token。在某些情况下,我们可以有一个包含多个单词的字段,但是将字段中的所有单词都设为标记可能并不有趣。这个过滤器使用 Lucene 的 KeepWordFilter。它的使用和我们经常使用到的 stop 过滤器正好相反。关于 stop filter 的使用,你可以查看我之前的文章 “Elasticsearch:分词器中的 token 过滤器使用示例”。
示例
以下 _analyze API 请求使用 keep 过滤器仅保留 "thief", "corporate", "technology" 及 "project" 标记:
1. GET _analyze
2. {
3. "tokenizer": "standard",
4. "filter": [
5. {
6. "type": "keep",
7. "keep_words": [ "thief", "corporate", "technology", "project", "elephant" ]
8. }
9. ],
10. "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
11. }
上述命令返回结果:
1. {
2. "tokens": [
3. {
4. "token": "thief",
5. "start_offset": 2,
6. "end_offset": 7,
7. "type": "<ALPHANUM>",
8. "position": 1
9. },
10. {
11. "token": "corporate",
12. "start_offset": 19,
13. "end_offset": 28,
14. "type": "<ALPHANUM>",
15. "position": 4
16. },
17. {
18. "token": "technology",
19. "start_offset": 70,
20. "end_offset": 80,
21. "type": "<ALPHANUM>",
22. "position": 12
23. },
24. {
25. "token": "project",
26. "start_offset": 187,
27. "end_offset": 194,
28. "type": "<ALPHANUM>",
29. "position": 35
30. }
31. ]
32. }
从上述的结果中,我们可以看到尽管 text 字段有一段很长的文字,但是返回的结果中只含有 keep 过滤器中的 keep_words 的一部分 token。如果按照正常的不使用 keep 过滤器,返回的结果是这样的:
1. GET _analyze
2. {
3. "tokenizer": "standard",
4. "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
5. }
上述命令返回的结果是:
1. {
2. "tokens": [
3. {
4. "token": "A",
5. "start_offset": 0,
6. "end_offset": 1,
7. "type": "<ALPHANUM>",
8. "position": 0
9. },
10. {
11. "token": "thief",
12. "start_offset": 2,
13. "end_offset": 7,
14. "type": "<ALPHANUM>",
15. "position": 1
16. },
17. {
18. "token": "who",
19. "start_offset": 8,
20. "end_offset": 11,
21. "type": "<ALPHANUM>",
22. "position": 2
23. },
24. {
25. "token": "steals",
26. "start_offset": 12,
27. "end_offset": 18,
28. "type": "<ALPHANUM>",
29. "position": 3
30. },
31. {
32. "token": "corporate",
33. "start_offset": 19,
34. "end_offset": 28,
35. "type": "<ALPHANUM>",
36. "position": 4
37. },
38. {
39. "token": "secrets",
40. "start_offset": 29,
41. "end_offset": 36,
42. "type": "<ALPHANUM>",
43. "position": 5
44. },
45. {
46. "token": "through",
47. "start_offset": 37,
48. "end_offset": 44,
49. "type": "<ALPHANUM>",
50. "position": 6
51. },
52. {
53. "token": "the",
54. "start_offset": 45,
55. "end_offset": 48,
56. "type": "<ALPHANUM>",
57. "position": 7
58. },
59. {
60. "token": "use",
61. "start_offset": 49,
62. "end_offset": 52,
63. "type": "<ALPHANUM>",
64. "position": 8
65. },
66. {
67. "token": "of",
68. "start_offset": 53,
69. "end_offset": 55,
70. "type": "<ALPHANUM>",
71. "position": 9
72. },
73. {
74. "token": "dream",
75. "start_offset": 56,
76. "end_offset": 61,
77. "type": "<ALPHANUM>",
78. "position": 10
79. },
80. {
81. "token": "sharing",
82. "start_offset": 62,
83. "end_offset": 69,
84. "type": "<ALPHANUM>",
85. "position": 11
86. },
87. {
88. "token": "technology",
89. "start_offset": 70,
90. "end_offset": 80,
91. "type": "<ALPHANUM>",
92. "position": 12
93. },
94. {
95. "token": "is",
96. "start_offset": 81,
97. "end_offset": 83,
98. "type": "<ALPHANUM>",
99. "position": 13
100. },
101. {
102. "token": "given",
103. "start_offset": 84,
104. "end_offset": 89,
105. "type": "<ALPHANUM>",
106. "position": 14
107. },
108. {
109. "token": "the",
110. "start_offset": 90,
111. "end_offset": 93,
112. "type": "<ALPHANUM>",
113. "position": 15
114. },
115. {
116. "token": "inverse",
117. "start_offset": 94,
118. "end_offset": 101,
119. "type": "<ALPHANUM>",
120. "position": 16
121. },
122. {
123. "token": "task",
124. "start_offset": 102,
125. "end_offset": 106,
126. "type": "<ALPHANUM>",
127. "position": 17
128. },
129. {
130. "token": "of",
131. "start_offset": 107,
132. "end_offset": 109,
133. "type": "<ALPHANUM>",
134. "position": 18
135. },
136. {
137. "token": "planting",
138. "start_offset": 110,
139. "end_offset": 118,
140. "type": "<ALPHANUM>",
141. "position": 19
142. },
143. {
144. "token": "an",
145. "start_offset": 119,
146. "end_offset": 121,
147. "type": "<ALPHANUM>",
148. "position": 20
149. },
150. {
151. "token": "idea",
152. "start_offset": 122,
153. "end_offset": 126,
154. "type": "<ALPHANUM>",
155. "position": 21
156. },
157. {
158. "token": "into",
159. "start_offset": 127,
160. "end_offset": 131,
161. "type": "<ALPHANUM>",
162. "position": 22
163. },
164. {
165. "token": "the",
166. "start_offset": 132,
167. "end_offset": 135,
168. "type": "<ALPHANUM>",
169. "position": 23
170. },
171. {
172. "token": "mind",
173. "start_offset": 136,
174. "end_offset": 140,
175. "type": "<ALPHANUM>",
176. "position": 24
177. },
178. {
179. "token": "of",
180. "start_offset": 141,
181. "end_offset": 143,
182. "type": "<ALPHANUM>",
183. "position": 25
184. },
185. {
186. "token": "a",
187. "start_offset": 144,
188. "end_offset": 145,
189. "type": "<ALPHANUM>",
190. "position": 26
191. },
192. {
193. "token": "C.E.O",
194. "start_offset": 146,
195. "end_offset": 151,
196. "type": "<ALPHANUM>",
197. "position": 27
198. },
199. {
200. "token": "but",
201. "start_offset": 154,
202. "end_offset": 157,
203. "type": "<ALPHANUM>",
204. "position": 28
205. },
206. {
207. "token": "his",
208. "start_offset": 158,
209. "end_offset": 161,
210. "type": "<ALPHANUM>",
211. "position": 29
212. },
213. {
214. "token": "tragic",
215. "start_offset": 162,
216. "end_offset": 168,
217. "type": "<ALPHANUM>",
218. "position": 30
219. },
220. {
221. "token": "past",
222. "start_offset": 169,
223. "end_offset": 173,
224. "type": "<ALPHANUM>",
225. "position": 31
226. },
227. {
228. "token": "may",
229. "start_offset": 174,
230. "end_offset": 177,
231. "type": "<ALPHANUM>",
232. "position": 32
233. },
234. {
235. "token": "doom",
236. "start_offset": 178,
237. "end_offset": 182,
238. "type": "<ALPHANUM>",
239. "position": 33
240. },
241. {
242. "token": "the",
243. "start_offset": 183,
244. "end_offset": 186,
245. "type": "<ALPHANUM>",
246. "position": 34
247. },
248. {
249. "token": "project",
250. "start_offset": 187,
251. "end_offset": 194,
252. "type": "<ALPHANUM>",
253. "position": 35
254. },
255. {
256. "token": "and",
257. "start_offset": 195,
258. "end_offset": 198,
259. "type": "<ALPHANUM>",
260. "position": 36
261. },
262. {
263. "token": "his",
264. "start_offset": 199,
265. "end_offset": 202,
266. "type": "<ALPHANUM>",
267. "position": 37
268. },
269. {
270. "token": "team",
271. "start_offset": 203,
272. "end_offset": 207,
273. "type": "<ALPHANUM>",
274. "position": 38
275. },
276. {
277. "token": "to",
278. "start_offset": 208,
279. "end_offset": 210,
280. "type": "<ALPHANUM>",
281. "position": 39
282. },
283. {
284. "token": "disaster",
285. "start_offset": 211,
286. "end_offset": 219,
287. "type": "<ALPHANUM>",
288. "position": 40
289. }
290. ]
291. }
很显然,这个列表比之前使用过滤器的情况下比较,列表要长很多。
添加到索引中
我们可以定义如下的一个索引:
1. PUT keep_example
2. {
3. "settings": {
4. "analysis": {
5. "analyzer": {
6. "my_analyzer": {
7. "tokenizer": "standard",
8. "filter": [
9. "my_keep"
10. ]
11. }
12. },
13. "filter": {
14. "my_keep": {
15. "type": "keep",
16. "stopwords": [
17. "thief",
18. "corporate",
19. "technology",
20. "project",
21. "elephant"
22. ]
23. }
24. }
25. }
26. },
27. "mappings": {
28. "properties": {
29. "text": {
30. "type": "text",
31. "analyzer": "my_analyzer"
32. }
33. }
34. }
35. }
我们可以使用如下的命令来进行测试:
1. GET keep_example/_analyze
2. {
3. "analyzer": "my_analyzer",
4. "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
5. }
配置参数
参数 | 描述 |
---|---|
keep_words | (必需*,字符串数组)要保留的单词列表。 只有与此列表中的单词匹配的标记才会包含在输出中。 必须指定此参数或 keep_words_path。 |
keep_words_path | (必需*,字符串数组)包含要保留的单词列表的文件的路径。 只有与此列表中的单词匹配的标记才会包含在输出中。 此路径必须是绝对路径或相对于 Elasticsearch config 位置的路径,并且文件必须是 UTF-8 编码的。 文件中的每个单词必须用换行符分隔。 必须指定此参数或 keep_words。 |
keep_words_case | (可选,布尔值)如果为真,则把 keep_words 中词都进行小写。 默认为假。 |
在实际的使用中,我们的 keep_words 可能会比较长。放入到命令中会不方便,且难以阅读。我们可以把这个列表放入到 keep_words_path 的文件中,比如:
1. PUT keep_words_example
2. {
3. "settings": {
4. "analysis": {
5. "analyzer": {
6. "standard_keep_word_array": {
7. "tokenizer": "standard",
8. "filter": [ "keep_word_array" ]
9. },
10. "standard_keep_word_file": {
11. "tokenizer": "standard",
12. "filter": [ "keep_word_file" ]
13. }
14. },
15. "filter": {
16. "keep_word_array": {
17. "type": "keep",
18. "keep_words": [ "one", "two", "three" ]
19. },
20. "keep_word_file": {
21. "type": "keep",
22. "keep_words_path": "analysis/example_word_list.txt"
23. }
24. }
25. }
26. }
27. }
如上所示,我们把 example_word_list.txt 放置于我们的 Elasticsearch 安装目录中的如下位置:
1. $ pwd
2. /Users/liuxg/elastic/elasticsearch-8.6.1/config/analysis
3. $ ls
4. example_word_list.txt
5. $ cat example_word_list.txt
6. thief
7. corporate
8. technology
9. project
10. elephant