Elasticsearch:创建一个简单的 “你的意思是?” 推荐搜索

1,435 阅读5分钟

你的意思是” 是搜索引擎中一个非常重要的功能,因为它们通过显示建议的术语来帮助用户,以便他可以进行更准确的搜索。比如,在百度中,我们进行搜索时,它通常会显示一些更为常用推荐的搜索选项来供我们选择:

为了创建 “你的意思是”,我们将使用 phrase suggester,因为通过它我们将能够建议句子更正,而不仅仅是术语。在我之前的文章 “Elasticsearch:如何实现短语建议 - phrase suggester”,我有涉及到这个问题。

首先,我们将使用一个 shingle 过滤器,因为它将提供一个分词,短语建议器将使用该标记来进行匹配并返回更正。有关 shingle 过滤器的描述,请阅读之前的文章 “Elasticsearch: Ngrams, edge ngrams, and shingles”。

准备数据

我们首先来定义映射:



1.  PUT movies
2.  {
3.    "settings": {
4.      "analysis": {
5.        "analyzer": {
6.          "en_analyzer": {
7.            "tokenizer": "standard",
8.            "filter": [
9.              "lowercase",
10.              "stop"
11.            ]
12.          },
13.          "shingle_analyzer": {
14.            "type": "custom",
15.            "tokenizer": "standard",
16.            "filter": [
17.              "lowercase",
18.              "shingle_filter"
19.            ]
20.          }
21.        },
22.        "filter": {
23.          "shingle_filter": {
24.            "type": "shingle",
25.            "min_shingle_size": 2,
26.            "max_shingle_size": 3
27.          }
28.        }
29.      }
30.    },
31.    "mappings": {
32.      "properties": {
33.        "title": {
34.          "type": "text",
35.          "analyzer": "en_analyzer",
36.          "fields": {
37.            "suggest": {
38.              "type": "text",
39.              "analyzer": "shingle_analyzer"
40.            }
41.          }
42.        },
43.        "actors": {
44.          "type": "text",
45.          "analyzer": "en_analyzer",
46.          "fields": {
47.            "keyword": {
48.              "type": "keyword",
49.              "ignore_above": 256
50.            }
51.          }
52.        },
53.        "description": {
54.          "type": "text",
55.          "analyzer": "en_analyzer",
56.          "fields": {
57.            "keyword": {
58.              "type": "keyword",
59.              "ignore_above": 256
60.            }
61.          }
62.        },
63.        "director": {
64.          "type": "text",
65.          "fields": {
66.            "keyword": {
67.              "type": "keyword",
68.              "ignore_above": 256
69.            }
70.          }
71.        },
72.        "genre": {
73.          "type": "text",
74.          "fields": {
75.            "keyword": {
76.              "type": "keyword",
77.              "ignore_above": 256
78.            }
79.          }
80.        },
81.        "metascore": {
82.          "type": "long"
83.        },
84.        "rating": {
85.          "type": "float"
86.        },
87.        "revenue": {
88.          "type": "float"
89.        },
90.        "runtime": {
91.          "type": "long"
92.        },
93.        "votes": {
94.          "type": "long"
95.        },
96.        "year": {
97.          "type": "long"
98.        },
99.        "title_suggest": {
100.          "type": "completion",
101.          "analyzer": "simple",
102.          "preserve_separators": true,
103.          "preserve_position_increments": true,
104.          "max_input_length": 50
105.        }
106.      }
107.    }
108.  }


我们接下来使用 _bulk 命令来写入一些文档到这个索引中去。我们使用这个链接中的内容。我们使用如下的方法:



1.  POST movies/_bulk
2.  {"index": {}}
3.  {"title": "Guardians of the Galaxy", "genre": "Action,Adventure,Sci-Fi", "director": "James Gunn", "actors": "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana", "description": "A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.", "year": 2014, "runtime": 121, "rating": 8.1, "votes": 757074, "revenue": 333.13, "metascore": 76}
4.  {"index": {}}
5.  {"title": "Prometheus", "genre": "Adventure,Mystery,Sci-Fi", "director": "Ridley Scott", "actors": "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron", "description": "Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize they are not alone.", "year": 2012, "runtime": 124, "rating": 7, "votes": 485820, "revenue": 126.46, "metascore": 65}

7.  ....


 在上面,为了说明的方便,我省去了其它的文档。你需要把整个 movies.txt 的文件拷贝过来,并全部写入到 Elasticsearch 中。它共有1000 个文档。

搜索数据

现在让我们运行一个基本查询来查看 suggest 的结果:



1.  GET movies/_search?filter_path=suggest
2.  {
3.    "suggest": {
4.      "text": "transformers revenge of the falen",
5.      "did_you_mean": {
6.        "phrase": {
7.          "field": "title.suggest",
8.          "size": 5
9.        }
10.      }
11.    }
12.  }


上面命令显示的结果为:



1.  {
2.    "suggest": {
3.      "did_you_mean": [
4.        {
5.          "text": "transformers revenge of the falen",
6.          "offset": 0,
7.          "length": 33,
8.          "options": [
9.            {
10.              "text": "transformers revenge of the fallen",
11.              "score": 0.004467494
12.            },
13.            {
14.              "text": "transformers revenge of the fall",
15.              "score": 0.00020402104
16.            },
17.            {
18.              "text": "transformers revenge of the face",
19.              "score": 0.00006419608
20.            }
21.          ]
22.        }
23.      ]
24.    }
25.  }


请注意,在几行中你已经获得了一些有希望的结果。

现在让我们通过使用更多短语建议功能来增加我们的查询。让我们使用 max_errors = 2,这样我们希望句子中最多有两个术语。 添加了 highlight 显示以突出​​显示建议的术语。



1.  GET movies/_search?filter_path=suggest
2.  {
3.    "suggest": {
4.      "text": "transformer revenge of the falen",
5.      "did_you_mean": {
6.        "phrase": {
7.          "field": "title.suggest",
8.          "size": 5,
9.          "confidence": 1,
10.          "max_errors":2,
11.          "highlight": {
12.            "pre_tag": "<strong>",
13.            "post_tag": "</strong>"
14.          }
15.        }
16.      }
17.    }
18.  }


上面命令返回的结果为:



1.  {
2.    "suggest": {
3.      "did_you_mean": [
4.        {
5.          "text": "transformer revenge of the falen",
6.          "offset": 0,
7.          "length": 32,
8.          "options": [
9.            {
10.              "text": "transformers revenge of the fallen",
11.              "highlighted": "<strong>transformers</strong> revenge of the <strong>fallen</strong>",
12.              "score": 0.004382903
13.            },
14.            {
15.              "text": "transformers revenge of the fall",
16.              "highlighted": "<strong>transformers</strong> revenge of the <strong>fall</strong>",
17.              "score": 0.00020015794
18.            },
19.            {
20.              "text": "transformers revenge of the face",
21.              "highlighted": "<strong>transformers</strong> revenge of the <strong>face</strong>",
22.              "score": 0.00006298054
23.            },
24.            {
25.              "text": "transformers revenge of the falen",
26.              "highlighted": "<strong>transformers</strong> revenge of the falen",
27.              "score": 0.00006159308
28.            },
29.            {
30.              "text": "transformer revenge of the fallen",
31.              "highlighted": "transformer revenge of the <strong>fallen</strong>",
32.              "score": 0.000048000533
33.            }
34.          ]
35.        }
36.      ]
37.    }
38.  }


我们再改进一点好吗? 我们添加了 “collate”,我们可以对每个结果执行查询,改进建议的结果。 我使用了带有 “and” 运算符的匹配项,以便在同一个句子中匹配所有术语。 如果我仍然想要不符合查询条件的结果,我使用 prune = true。



1.  GET movies/_search?filter_path=suggest
2.  {
3.    "suggest": {
4.      "text": "transformer revenge of the falen",
5.      "did_you_mean": {
6.        "phrase": {
7.          "field": "title.suggest",
8.          "size": 5,
9.          "confidence": 1,
10.          "max_errors":2,
11.          "collate": {
12.            "query": { 
13.              "source" : {
14.                "match": {
15.                  "{{field_name}}": {
16.                    "query": "{{suggestion}}",
17.                    "operator": "and"
18.                  }
19.                }
20.              }
21.            },
22.            "params": {"field_name" : "title"}, 
23.            "prune" :true
24.          },
25.          "highlight": {
26.            "pre_tag": "<strong>",
27.            "post_tag": "</strong>"
28.          }
29.        }
30.      }
31.    }
32.  }


现在的结果是:

请注意,答案已更改,我有一个新字段 “collat​​e_match”,它指示结果中是否匹配整理规则(这是因为 prune = true)。

让我们设置 prune 为 false:



1.  GET movies/_search?filter_path=suggest
2.  {
3.    "suggest": {
4.      "text": "transformer revenge of the falen",
5.      "did_you_mean": {
6.        "phrase": {
7.          "field": "title.suggest",
8.          "size": 5,
9.          "confidence": 1,
10.          "max_errors":2,
11.          "collate": {
12.            "query": { 
13.              "source" : {
14.                "match": {
15.                  "{{field_name}}": {
16.                    "query": "{{suggestion}}",
17.                    "operator": "and"
18.                  }
19.                }
20.              }
21.            },
22.            "params": {"field_name" : "title"}, 
23.            "prune" :false
24.          },
25.          "highlight": {
26.            "pre_tag": "<strong>",
27.            "post_tag": "</strong>"
28.          }
29.        }
30.      }
31.    }
32.  }


这次我们得到的结果是:

我们可以看到只有一个结果是最相关的建议。