Elasticsearch:按类型删除分词

1,313 阅读2分钟

在我之前的文章 “Elasticsearch:分词器中的 token 过滤器使用示例”,我有很多示例展示如何使用分词器中的过滤器来对分词进行过滤。在今天的文章中,我将展示如何使用另外一种过滤器根据类型来保留或者移除一些分词。

保留类型分词过滤器能够跨类型保留或删除分词。 让我们想象一下项目描述字段,通常这个字段接收带有单词和数字的文本。 为所有文本生成分词可能没有意义,为了避免这种情况,我们将使用 Keep 类型分词过滤器。

删除数字标记

要删除数字类型,请将 “types” 参数设置为 “”,此参数接受一个标记列表。 “mode” 参数设置为 “exclude”。

例子:



1.  GET _analyze
2.  {
3.    "tokenizer": "standard",
4.    "filter": [
5.      {
6.        "type": "keep_types",
7.        "types": [ "<NUM>" ],
8.        "mode": "exclude"
9.      },
10.      {
11.        "type": "stop"
12.      }
13.    ],
14.    "text": "The German philosopher and economist Karl Marx was born on May 5, 1818."
15.  }


上面命令返回的分词为:



1.  {
2.    "tokens": [
3.      {
4.        "token": "The",
5.        "start_offset": 0,
6.        "end_offset": 3,
7.        "type": "<ALPHANUM>",
8.        "position": 0
9.      },
10.      {
11.        "token": "German",
12.        "start_offset": 4,
13.        "end_offset": 10,
14.        "type": "<ALPHANUM>",
15.        "position": 1
16.      },
17.      {
18.        "token": "philosopher",
19.        "start_offset": 11,
20.        "end_offset": 22,
21.        "type": "<ALPHANUM>",
22.        "position": 2
23.      },
24.      {
25.        "token": "economist",
26.        "start_offset": 27,
27.        "end_offset": 36,
28.        "type": "<ALPHANUM>",
29.        "position": 4
30.      },
31.      {
32.        "token": "Karl",
33.        "start_offset": 37,
34.        "end_offset": 41,
35.        "type": "<ALPHANUM>",
36.        "position": 5
37.      },
38.      {
39.        "token": "Marx",
40.        "start_offset": 42,
41.        "end_offset": 46,
42.        "type": "<ALPHANUM>",
43.        "position": 6
44.      },
45.      {
46.        "token": "born",
47.        "start_offset": 51,
48.        "end_offset": 55,
49.        "type": "<ALPHANUM>",
50.        "position": 8
51.      },
52.      {
53.        "token": "May",
54.        "start_offset": 59,
55.        "end_offset": 62,
56.        "type": "<ALPHANUM>",
57.        "position": 10
58.      }
59.    ]
60.  }


从上面的输出中,我们可以看出来所以的数字分词都被移除了。

我们也可以尝试使用如下的命令来保留数字:



1.  GET _analyze
2.  {
3.    "tokenizer": "standard",
4.    "filter": [
5.      {
6.        "type": "keep_types",
7.        "types": [ "<NUM>" ],
8.        "mode": "include"
9.      },
10.      {
11.        "type": "stop"
12.      }
13.    ],
14.    "text": "The German philosopher and economist Karl Marx was born on May 5, 1818."
15.  }


上面的分词为:



1.  {
2.    "tokens": [
3.      {
4.        "token": "5",
5.        "start_offset": 63,
6.        "end_offset": 64,
7.        "type": "<NUM>",
8.        "position": 11
9.      },
10.      {
11.        "token": "1818",
12.        "start_offset": 66,
13.        "end_offset": 70,
14.        "type": "<NUM>",
15.        "position": 12
16.      }
17.    ]
18.  }


删除 aphanumeric 分词

要删除文本,我们只需将 “types” 字段设置为“”。



1.  GET _analyze
2.  {
3.    "tokenizer": "standard",
4.    "filter": [
5.      {
6.        "type": "keep_types",
7.        "types": [ "<ALPHANUM>" ],
8.        "mode": "exclude"
9.      },
10.      {
11.        "type": "stop"
12.      }
13.    ],
14.    "text": "The German philosopher and economist Karl Marx was born on May 5, 1818."
15.  }


现在我们只有数字分词。



1.  {
2.    "tokens": [
3.      {
4.        "token": "5",
5.        "start_offset": 63,
6.        "end_offset": 64,
7.        "type": "<NUM>",
8.        "position": 11
9.      },
10.      {
11.        "token": "1818",
12.        "start_offset": 66,
13.        "end_offset": 70,
14.        "type": "<NUM>",
15.        "position": 12
16.      }
17.    ]
18.  }