我的前同事 Medcl 大神，在 github 上也创建了一个转换简体及繁体的分词器。这个在我们的很多的实际应用中也是非常有用的，比如当我的文档是繁体的，但是我们想用中文对它进行搜索。

安装

我们可以按照如下的方法来对这个分词器进行安装：

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-stconvert/releases/download/v8.2.3/elasticsearch-analysis-stconvert-8.2.3.zip

你可以根据发行的版本及自己的 Elasticsearch 版本来选择合适的版本来安装。

安装完这个插件后，我们必须注意的是：重新启动 Elasticsearch 集群。我们可以使用如下的命令来进行查看：

./bin/elasticsearch-plugin list



1.  $ ./bin/elasticsearch-plugin list
2.  analysis-stconvert

该插件包括如下的部分：

analyzer：stconvert
tokenizer: stconvert
token-filter：stconvert
char-filter: stconvert

它还支持如下的配置：

convert_type：默认值为 s2t，其它的选项为：
- s2t：从简体中文转换为繁体中文
- t2s：从繁体中文转换为简体中文
keep_both：默认为 false
delimiter：默认是以 , 为分隔符

例子

我们使用如下的例子来进行展示：



1.  PUT /stconvert/
2.  {
3.    "settings": {
4.      "analysis": {
5.        "analyzer": {
6.          "tsconvert": {
7.            "tokenizer": "tsconvert"
8.          }
9.        },
10.        "tokenizer": {
11.          "tsconvert": {
12.            "type": "stconvert",
13.            "delimiter": "#",
14.            "keep_both": false,
15.            "convert_type": "t2s"
16.          }
17.        },
18.        "filter": {
19.          "tsconvert": {
20.            "type": "stconvert",
21.            "delimiter": "#",
22.            "keep_both": false,
23.            "convert_type": "t2s"
24.          }
25.        },
26.        "char_filter": {
27.          "tsconvert": {
28.            "type": "stconvert",
29.            "convert_type": "t2s"
30.          }
31.        }
32.      }
33.    }
34.  }

在上面，我们创建一个叫做 stconvert 的索引。它定义了一个叫做 tscovert 的 analyzer。如果你想了解更多关于如何定制 analyzer，请阅读我之前的文章 “Elasticsearch: analyzer”。

我们做如下的分词测试：



1.  GET stconvert/_analyze
2.  {
3.    "tokenizer" : "keyword",
4.    "filter" : ["lowercase"],
5.    "char_filter" : ["tsconvert"],
6.    "text" : "国际國際"
7.  }

上面的命令显示：



1.  {
2.    "tokens" : [
3.      {
4.        "token" : "国际国际",
5.        "start_offset" : 0,
6.        "end_offset" : 4,
7.        "type" : "word",
8.        "position" : 0
9.      }
10.    ]
11.  }

我们可以使用如下的一个定制 analyzer 来对繁体字来进行分词：



1.  PUT index
2.  {
3.    "settings": {
4.      "analysis": {
5.        "char_filter": {
6.          "tsconvert": {
7.            "type": "stconvert",
8.            "convert_type": "t2s"
9.          }
10.        },
11.        "normalizer": {
12.          "my_normalizer": {
13.            "type": "custom",
14.            "char_filter": [
15.              "tsconvert"
16.            ],
17.            "filter": [
18.              "lowercase"
19.            ]
20.          }
21.        }
22.      }
23.    },
24.    "mappings": {
25.      "properties": {
26.        "foo": {
27.          "type": "keyword",
28.          "normalizer": "my_normalizer"
29.        }
30.      }
31.    }
32.  }

我们使用如下的命令来写入一些文档：



1.  PUT index/_doc/1
2.  {
3.    "foo": "國際"
4.  }

6.  PUT index/_doc/2
7.  {
8.    "foo": "国际"
9.  }

在上面，我们定义了 foo 字段的分词器为 my_normalizer，那么上面的繁体字 “國際” 将被 char_filter 转换为 “国际”。我们使用如下的命令来进行搜索时：



1.  GET index/_search
2.  {
3.    "query": {
4.      "term": {
5.        "foo": "国际"
6.      }
7.    }
8.  }

它返回的结果为：

`

1.  {
2.    "took" : 1,
3.    "timed_out" : false,
4.    "_shards" : {
5.      "total" : 1,
6.      "successful" : 1,
7.      "skipped" : 0,
8.      "failed" : 0
9.    },
10.    "hits" : {
11.      "total" : {
12.        "value" : 2,
13.        "relation" : "eq"
14.      },
15.      "max_score" : 0.18232156,
16.      "hits" : [
17.        {
18.          "_index" : "index",
19.          "_id" : "1",
20.          "_score" : 0.18232156,
21.          "_source" : {
22.            "foo" : "國際"
23.          }
24.        },
25.        {
26.          "_index" : "index",
27.          "_id" : "2",
28.          "_score" : 0.18232156,
29.          "_source" : {
30.            "foo" : "国际"
31.          }
32.        }
33.      ]
34.    }
35.  }

`![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)

如果我们对它进行 term 搜索：



1.  GET index/_search
2.  {
3.    "query": {
4.      "term": {
5.        "foo": "國際"
6.      }
7.    }
8.  }

它返回的结果为：

`

1.  {
2.    "took" : 0,
3.    "timed_out" : false,
4.    "_shards" : {
5.      "total" : 1,
6.      "successful" : 1,
7.      "skipped" : 0,
8.      "failed" : 0
9.    },
10.    "hits" : {
11.      "total" : {
12.        "value" : 2,
13.        "relation" : "eq"
14.      },
15.      "max_score" : 0.18232156,
16.      "hits" : [
17.        {
18.          "_index" : "index",
19.          "_id" : "1",
20.          "_score" : 0.18232156,
21.          "_source" : {
22.            "foo" : "國際"
23.          }
24.        },
25.        {
26.          "_index" : "index",
27.          "_id" : "2",
28.          "_score" : 0.18232156,
29.          "_source" : {
30.            "foo" : "国际"
31.          }
32.        }
33.      ]
34.    }
35.  }

`![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)

Elasticsearch：简体繁体转换分词器 - STConvert analysis

安装

例子