这是我参与8月更文挑战的第21天,活动详情查看:8月更文挑战
本Elasticsearch相关文章的版本为:7.4.2
shingle token过滤器的作用:可以实现短语搜索的功能同时保持上下文信息,并且不像短语查询那样需要所有词语都出现。
例如有以下文档:
POST /my_index/_doc/1
{"tittle": "ridingroad likes elasticsearch"}
POST /my_index/_doc/2
{"tittle": "elasticsearch likes ridingroad"}
但是用户输入: handsome ridingroad likes elasticsearch
, 如果我们使用短语查询:
GET /my_index/_search
{
"query": {
"match_phrase": {
"title": "handsome ridingroad likes elasticsearch"
}
}
}
返回的数据, 并没有命中任何文档,因为短语查询要求查询的内容需要全部出现在文档里:
{
"took" : 748,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
那么,如果我们使用shingle token过滤器呢?那什么是shingle?
例如有这样的内容ridingroad likes elasticsearch
, 如果拆分为以下长度为1,2的单词对:
一个长度的单词对(unigram):[ridingroad
, likes
, elasticsearch
]
两个长度的单词对(bigrams):[ridingroad likes
, likes elasticsearch
]
那么,这些一个个的单词对就叫为shingle。
所以我们可以利用shingle保留了它们的先后顺序,因为ridingroad likes elasticsearch
和elasticsearch likes ridingroad
的表达的意思是存在很大差异的。
那么要实现当我们搜索handsome ridingroad likes elasticsearch
命中包含ridingroad likes elasticsearch
的文档1需要做以下准备:
- 构造使用shingle的分词器;
- 把分词器应用到字段;
- 为了提高匹配度,既要使用unigram同时使用bigrams。
PUT /shingle_test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"shingles": {
"type": "text",
"analyzer": "my_shingle_analyzer"
}
}
}
}
}
}
POST /shingle_test_index/_doc/1
{"title": "ridingroad likes elasticsearch"}
POST /shingle_test_index/_doc/2
{"title": "elasticsearch likes ridingroad"}
下面进行查询:
GET /shingle_test_index/_search
{
"query": {
"bool": {
"must": {
"match": {
"title": "handsome ridingroad likes elasticsearch"
}
},
"should": {
"match": {
"title.shingles": "handsome ridingroad likes elasticsearch"
}
}
}
}
}
返回的数据:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.933259,
"hits" : [
{
"_index" : "shingle_test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.933259,
"_source" : {
"title" : "ridingroad likes elasticsearch"
}
},
{
"_index" : "shingle_test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.5469647,
"_source" : {
"title" : "elasticsearch likes ridingroad"
}
}
]
}
}
从上面的查询结果中,可以看到我们需要文档1的得分比文档2的得分更高(通过should里面的shingle增加了相关性得分,文档1的 likes elasticsearch
比文档2的likes ridingroad
更符合查询语句的顺序),符合我们的需求。