Elasticsearch:Keep words token 过滤器

Keep words token 过滤器是用来仅保留包含在指定单词列表中的 token,尽管你的文字中可能含有比这个列表更多的 token。在某些情况下,我们可以有一个包含多个单词的字段,但是将字段中的所有单词都设为标记可能并不有趣。这个过滤器使用 Lucene 的 KeepWordFilter。它的使用和我们经常使用到的 stop 过滤器正好相反。关于 stop filter 的使用,你可以查看我之前的文章 “Elasticsearch:分词器中的 token 过滤器使用示例”。

示例

以下 _analyze API 请求使用 keep 过滤器仅保留 "thief", "corporate", "technology" 及 "project" 标记:



1.  GET _analyze
2.  {
3.    "tokenizer": "standard",
4.    "filter": [
5.      {
6.        "type": "keep",
7.        "keep_words": [ "thief", "corporate", "technology", "project", "elephant" ]
8.      }
9.    ],
10.    "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
11.  }


复制代码

上述命令返回结果:



1.  {
2.    "tokens": [
3.      {
4.        "token": "thief",
5.        "start_offset": 2,
6.        "end_offset": 7,
7.        "type": "<ALPHANUM>",
8.        "position": 1
9.      },
10.      {
11.        "token": "corporate",
12.        "start_offset": 19,
13.        "end_offset": 28,
14.        "type": "<ALPHANUM>",
15.        "position": 4
16.      },
17.      {
18.        "token": "technology",
19.        "start_offset": 70,
20.        "end_offset": 80,
21.        "type": "<ALPHANUM>",
22.        "position": 12
23.      },
24.      {
25.        "token": "project",
26.        "start_offset": 187,
27.        "end_offset": 194,
28.        "type": "<ALPHANUM>",
29.        "position": 35
30.      }
31.    ]
32.  }


复制代码

从上述的结果中,我们可以看到尽管 text 字段有一段很长的文字,但是返回的结果中只含有 keep 过滤器中的 keep_words 的一部分 token。如果按照正常的不使用 keep 过滤器,返回的结果是这样的:



1.  GET _analyze
2.  {
3.    "tokenizer": "standard",
4.    "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
5.  }


复制代码

上述命令返回的结果是:



1.  {
2.    "tokens": [
3.      {
4.        "token": "A",
5.        "start_offset": 0,
6.        "end_offset": 1,
7.        "type": "<ALPHANUM>",
8.        "position": 0
9.      },
10.      {
11.        "token": "thief",
12.        "start_offset": 2,
13.        "end_offset": 7,
14.        "type": "<ALPHANUM>",
15.        "position": 1
16.      },
17.      {
18.        "token": "who",
19.        "start_offset": 8,
20.        "end_offset": 11,
21.        "type": "<ALPHANUM>",
22.        "position": 2
23.      },
24.      {
25.        "token": "steals",
26.        "start_offset": 12,
27.        "end_offset": 18,
28.        "type": "<ALPHANUM>",
29.        "position": 3
30.      },
31.      {
32.        "token": "corporate",
33.        "start_offset": 19,
34.        "end_offset": 28,
35.        "type": "<ALPHANUM>",
36.        "position": 4
37.      },
38.      {
39.        "token": "secrets",
40.        "start_offset": 29,
41.        "end_offset": 36,
42.        "type": "<ALPHANUM>",
43.        "position": 5
44.      },
45.      {
46.        "token": "through",
47.        "start_offset": 37,
48.        "end_offset": 44,
49.        "type": "<ALPHANUM>",
50.        "position": 6
51.      },
52.      {
53.        "token": "the",
54.        "start_offset": 45,
55.        "end_offset": 48,
56.        "type": "<ALPHANUM>",
57.        "position": 7
58.      },
59.      {
60.        "token": "use",
61.        "start_offset": 49,
62.        "end_offset": 52,
63.        "type": "<ALPHANUM>",
64.        "position": 8
65.      },
66.      {
67.        "token": "of",
68.        "start_offset": 53,
69.        "end_offset": 55,
70.        "type": "<ALPHANUM>",
71.        "position": 9
72.      },
73.      {
74.        "token": "dream",
75.        "start_offset": 56,
76.        "end_offset": 61,
77.        "type": "<ALPHANUM>",
78.        "position": 10
79.      },
80.      {
81.        "token": "sharing",
82.        "start_offset": 62,
83.        "end_offset": 69,
84.        "type": "<ALPHANUM>",
85.        "position": 11
86.      },
87.      {
88.        "token": "technology",
89.        "start_offset": 70,
90.        "end_offset": 80,
91.        "type": "<ALPHANUM>",
92.        "position": 12
93.      },
94.      {
95.        "token": "is",
96.        "start_offset": 81,
97.        "end_offset": 83,
98.        "type": "<ALPHANUM>",
99.        "position": 13
100.      },
101.      {
102.        "token": "given",
103.        "start_offset": 84,
104.        "end_offset": 89,
105.        "type": "<ALPHANUM>",
106.        "position": 14
107.      },
108.      {
109.        "token": "the",
110.        "start_offset": 90,
111.        "end_offset": 93,
112.        "type": "<ALPHANUM>",
113.        "position": 15
114.      },
115.      {
116.        "token": "inverse",
117.        "start_offset": 94,
118.        "end_offset": 101,
119.        "type": "<ALPHANUM>",
120.        "position": 16
121.      },
122.      {
123.        "token": "task",
124.        "start_offset": 102,
125.        "end_offset": 106,
126.        "type": "<ALPHANUM>",
127.        "position": 17
128.      },
129.      {
130.        "token": "of",
131.        "start_offset": 107,
132.        "end_offset": 109,
133.        "type": "<ALPHANUM>",
134.        "position": 18
135.      },
136.      {
137.        "token": "planting",
138.        "start_offset": 110,
139.        "end_offset": 118,
140.        "type": "<ALPHANUM>",
141.        "position": 19
142.      },
143.      {
144.        "token": "an",
145.        "start_offset": 119,
146.        "end_offset": 121,
147.        "type": "<ALPHANUM>",
148.        "position": 20
149.      },
150.      {
151.        "token": "idea",
152.        "start_offset": 122,
153.        "end_offset": 126,
154.        "type": "<ALPHANUM>",
155.        "position": 21
156.      },
157.      {
158.        "token": "into",
159.        "start_offset": 127,
160.        "end_offset": 131,
161.        "type": "<ALPHANUM>",
162.        "position": 22
163.      },
164.      {
165.        "token": "the",
166.        "start_offset": 132,
167.        "end_offset": 135,
168.        "type": "<ALPHANUM>",
169.        "position": 23
170.      },
171.      {
172.        "token": "mind",
173.        "start_offset": 136,
174.        "end_offset": 140,
175.        "type": "<ALPHANUM>",
176.        "position": 24
177.      },
178.      {
179.        "token": "of",
180.        "start_offset": 141,
181.        "end_offset": 143,
182.        "type": "<ALPHANUM>",
183.        "position": 25
184.      },
185.      {
186.        "token": "a",
187.        "start_offset": 144,
188.        "end_offset": 145,
189.        "type": "<ALPHANUM>",
190.        "position": 26
191.      },
192.      {
193.        "token": "C.E.O",
194.        "start_offset": 146,
195.        "end_offset": 151,
196.        "type": "<ALPHANUM>",
197.        "position": 27
198.      },
199.      {
200.        "token": "but",
201.        "start_offset": 154,
202.        "end_offset": 157,
203.        "type": "<ALPHANUM>",
204.        "position": 28
205.      },
206.      {
207.        "token": "his",
208.        "start_offset": 158,
209.        "end_offset": 161,
210.        "type": "<ALPHANUM>",
211.        "position": 29
212.      },
213.      {
214.        "token": "tragic",
215.        "start_offset": 162,
216.        "end_offset": 168,
217.        "type": "<ALPHANUM>",
218.        "position": 30
219.      },
220.      {
221.        "token": "past",
222.        "start_offset": 169,
223.        "end_offset": 173,
224.        "type": "<ALPHANUM>",
225.        "position": 31
226.      },
227.      {
228.        "token": "may",
229.        "start_offset": 174,
230.        "end_offset": 177,
231.        "type": "<ALPHANUM>",
232.        "position": 32
233.      },
234.      {
235.        "token": "doom",
236.        "start_offset": 178,
237.        "end_offset": 182,
238.        "type": "<ALPHANUM>",
239.        "position": 33
240.      },
241.      {
242.        "token": "the",
243.        "start_offset": 183,
244.        "end_offset": 186,
245.        "type": "<ALPHANUM>",
246.        "position": 34
247.      },
248.      {
249.        "token": "project",
250.        "start_offset": 187,
251.        "end_offset": 194,
252.        "type": "<ALPHANUM>",
253.        "position": 35
254.      },
255.      {
256.        "token": "and",
257.        "start_offset": 195,
258.        "end_offset": 198,
259.        "type": "<ALPHANUM>",
260.        "position": 36
261.      },
262.      {
263.        "token": "his",
264.        "start_offset": 199,
265.        "end_offset": 202,
266.        "type": "<ALPHANUM>",
267.        "position": 37
268.      },
269.      {
270.        "token": "team",
271.        "start_offset": 203,
272.        "end_offset": 207,
273.        "type": "<ALPHANUM>",
274.        "position": 38
275.      },
276.      {
277.        "token": "to",
278.        "start_offset": 208,
279.        "end_offset": 210,
280.        "type": "<ALPHANUM>",
281.        "position": 39
282.      },
283.      {
284.        "token": "disaster",
285.        "start_offset": 211,
286.        "end_offset": 219,
287.        "type": "<ALPHANUM>",
288.        "position": 40
289.      }
290.    ]
291.  }


复制代码

很显然,这个列表比之前使用过滤器的情况下比较,列表要长很多。

添加到索引中

我们可以定义如下的一个索引:



1.  PUT keep_example
2.  {
3.    "settings": {
4.      "analysis": {
5.        "analyzer": {
6.          "my_analyzer": {
7.            "tokenizer": "standard",
8.            "filter": [
9.              "my_keep"
10.            ]
11.          }
12.        },
13.        "filter": {
14.          "my_keep": {
15.            "type": "keep",
16.            "stopwords": [
17.              "thief",
18.              "corporate",
19.              "technology",
20.              "project",
21.              "elephant"
22.            ]
23.          }
24.        }
25.      }
26.    },
27.    "mappings": {
28.      "properties": {
29.        "text": {
30.          "type": "text",
31.          "analyzer": "my_analyzer"
32.        }
33.      }
34.    }
35.  }


复制代码

我们可以使用如下的命令来进行测试:



1.  GET keep_example/_analyze
2.  {
3.    "analyzer": "my_analyzer", 
4.    "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
5.  }


复制代码

配置参数

参数        描述
keep_words

(必需*,字符串数组)要保留的单词列表。 只有与此列表中的单词匹配的标记才会包含在输出中。

必须指定此参数或 keep_words_path。

keep_words_path

(必需*,字符串数组)包含要保留的单词列表的文件的路径。 只有与此列表中的单词匹配的标记才会包含在输出中。

此路径必须是绝对路径或相对于 Elasticsearch config 位置的路径,并且文件必须是 UTF-8 编码的。 文件中的每个单词必须用换行符分隔。

必须指定此参数或 keep_words。

keep_words_case(可选,布尔值)如果为真,则把 keep_words 中词都进行小写。 默认为假。

在实际的使用中,我们的 keep_words 可能会比较长。放入到命令中会不方便,且难以阅读。我们可以把这个列表放入到 keep_words_path 的文件中,比如:



1.  PUT keep_words_example
2.  {
3.    "settings": {
4.      "analysis": {
5.        "analyzer": {
6.          "standard_keep_word_array": {
7.            "tokenizer": "standard",
8.            "filter": [ "keep_word_array" ]
9.          },
10.          "standard_keep_word_file": {
11.            "tokenizer": "standard",
12.            "filter": [ "keep_word_file" ]
13.          }
14.        },
15.        "filter": {
16.          "keep_word_array": {
17.            "type": "keep",
18.            "keep_words": [ "one", "two", "three" ]
19.          },
20.          "keep_word_file": {
21.            "type": "keep",
22.            "keep_words_path": "analysis/example_word_list.txt"
23.          }
24.        }
25.      }
26.    }
27.  }


复制代码

如上所示,我们把 example_word_list.txt 放置于我们的 Elasticsearch 安装目录中的如下位置:



1.  $ pwd
2.  /Users/liuxg/elastic/elasticsearch-8.6.1/config/analysis
3.  $ ls
4.  example_word_list.txt
5.  $ cat example_word_list.txt 
6.  thief
7.  corporate
8.  technology
9.  project
10.  elephant


复制代码
分类:
后端
标签: