“你的意思是” 是搜索引擎中一个非常重要的功能,因为它们通过显示建议的术语来帮助用户,以便他可以进行更准确的搜索。比如,在百度中,我们进行搜索时,它通常会显示一些更为常用推荐的搜索选项来供我们选择:
为了创建 “你的意思是”,我们将使用 phrase suggester,因为通过它我们将能够建议句子更正,而不仅仅是术语。在我之前的文章 “Elasticsearch:如何实现短语建议 - phrase suggester”,我有涉及到这个问题。
首先,我们将使用一个 shingle 过滤器,因为它将提供一个分词,短语建议器将使用该标记来进行匹配并返回更正。有关 shingle 过滤器的描述,请阅读之前的文章 “Elasticsearch: Ngrams, edge ngrams, and shingles”。
准备数据
我们首先来定义映射:
1. PUT movies
2. {
3. "settings": {
4. "analysis": {
5. "analyzer": {
6. "en_analyzer": {
7. "tokenizer": "standard",
8. "filter": [
9. "lowercase",
10. "stop"
11. ]
12. },
13. "shingle_analyzer": {
14. "type": "custom",
15. "tokenizer": "standard",
16. "filter": [
17. "lowercase",
18. "shingle_filter"
19. ]
20. }
21. },
22. "filter": {
23. "shingle_filter": {
24. "type": "shingle",
25. "min_shingle_size": 2,
26. "max_shingle_size": 3
27. }
28. }
29. }
30. },
31. "mappings": {
32. "properties": {
33. "title": {
34. "type": "text",
35. "analyzer": "en_analyzer",
36. "fields": {
37. "suggest": {
38. "type": "text",
39. "analyzer": "shingle_analyzer"
40. }
41. }
42. },
43. "actors": {
44. "type": "text",
45. "analyzer": "en_analyzer",
46. "fields": {
47. "keyword": {
48. "type": "keyword",
49. "ignore_above": 256
50. }
51. }
52. },
53. "description": {
54. "type": "text",
55. "analyzer": "en_analyzer",
56. "fields": {
57. "keyword": {
58. "type": "keyword",
59. "ignore_above": 256
60. }
61. }
62. },
63. "director": {
64. "type": "text",
65. "fields": {
66. "keyword": {
67. "type": "keyword",
68. "ignore_above": 256
69. }
70. }
71. },
72. "genre": {
73. "type": "text",
74. "fields": {
75. "keyword": {
76. "type": "keyword",
77. "ignore_above": 256
78. }
79. }
80. },
81. "metascore": {
82. "type": "long"
83. },
84. "rating": {
85. "type": "float"
86. },
87. "revenue": {
88. "type": "float"
89. },
90. "runtime": {
91. "type": "long"
92. },
93. "votes": {
94. "type": "long"
95. },
96. "year": {
97. "type": "long"
98. },
99. "title_suggest": {
100. "type": "completion",
101. "analyzer": "simple",
102. "preserve_separators": true,
103. "preserve_position_increments": true,
104. "max_input_length": 50
105. }
106. }
107. }
108. }
我们接下来使用 _bulk 命令来写入一些文档到这个索引中去。我们使用这个链接中的内容。我们使用如下的方法:
1. POST movies/_bulk
2. {"index": {}}
3. {"title": "Guardians of the Galaxy", "genre": "Action,Adventure,Sci-Fi", "director": "James Gunn", "actors": "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana", "description": "A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.", "year": 2014, "runtime": 121, "rating": 8.1, "votes": 757074, "revenue": 333.13, "metascore": 76}
4. {"index": {}}
5. {"title": "Prometheus", "genre": "Adventure,Mystery,Sci-Fi", "director": "Ridley Scott", "actors": "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron", "description": "Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize they are not alone.", "year": 2012, "runtime": 124, "rating": 7, "votes": 485820, "revenue": 126.46, "metascore": 65}
7. ....
在上面,为了说明的方便,我省去了其它的文档。你需要把整个 movies.txt 的文件拷贝过来,并全部写入到 Elasticsearch 中。它共有1000 个文档。
搜索数据
现在让我们运行一个基本查询来查看 suggest 的结果:
1. GET movies/_search?filter_path=suggest
2. {
3. "suggest": {
4. "text": "transformers revenge of the falen",
5. "did_you_mean": {
6. "phrase": {
7. "field": "title.suggest",
8. "size": 5
9. }
10. }
11. }
12. }
上面命令显示的结果为:
1. {
2. "suggest": {
3. "did_you_mean": [
4. {
5. "text": "transformers revenge of the falen",
6. "offset": 0,
7. "length": 33,
8. "options": [
9. {
10. "text": "transformers revenge of the fallen",
11. "score": 0.004467494
12. },
13. {
14. "text": "transformers revenge of the fall",
15. "score": 0.00020402104
16. },
17. {
18. "text": "transformers revenge of the face",
19. "score": 0.00006419608
20. }
21. ]
22. }
23. ]
24. }
25. }
请注意,在几行中你已经获得了一些有希望的结果。
现在让我们通过使用更多短语建议功能来增加我们的查询。让我们使用 max_errors = 2,这样我们希望句子中最多有两个术语。 添加了 highlight 显示以突出显示建议的术语。
1. GET movies/_search?filter_path=suggest
2. {
3. "suggest": {
4. "text": "transformer revenge of the falen",
5. "did_you_mean": {
6. "phrase": {
7. "field": "title.suggest",
8. "size": 5,
9. "confidence": 1,
10. "max_errors":2,
11. "highlight": {
12. "pre_tag": "<strong>",
13. "post_tag": "</strong>"
14. }
15. }
16. }
17. }
18. }
上面命令返回的结果为:
1. {
2. "suggest": {
3. "did_you_mean": [
4. {
5. "text": "transformer revenge of the falen",
6. "offset": 0,
7. "length": 32,
8. "options": [
9. {
10. "text": "transformers revenge of the fallen",
11. "highlighted": "<strong>transformers</strong> revenge of the <strong>fallen</strong>",
12. "score": 0.004382903
13. },
14. {
15. "text": "transformers revenge of the fall",
16. "highlighted": "<strong>transformers</strong> revenge of the <strong>fall</strong>",
17. "score": 0.00020015794
18. },
19. {
20. "text": "transformers revenge of the face",
21. "highlighted": "<strong>transformers</strong> revenge of the <strong>face</strong>",
22. "score": 0.00006298054
23. },
24. {
25. "text": "transformers revenge of the falen",
26. "highlighted": "<strong>transformers</strong> revenge of the falen",
27. "score": 0.00006159308
28. },
29. {
30. "text": "transformer revenge of the fallen",
31. "highlighted": "transformer revenge of the <strong>fallen</strong>",
32. "score": 0.000048000533
33. }
34. ]
35. }
36. ]
37. }
38. }
我们再改进一点好吗? 我们添加了 “collate”,我们可以对每个结果执行查询,改进建议的结果。 我使用了带有 “and” 运算符的匹配项,以便在同一个句子中匹配所有术语。 如果我仍然想要不符合查询条件的结果,我使用 prune = true。
1. GET movies/_search?filter_path=suggest
2. {
3. "suggest": {
4. "text": "transformer revenge of the falen",
5. "did_you_mean": {
6. "phrase": {
7. "field": "title.suggest",
8. "size": 5,
9. "confidence": 1,
10. "max_errors":2,
11. "collate": {
12. "query": {
13. "source" : {
14. "match": {
15. "{{field_name}}": {
16. "query": "{{suggestion}}",
17. "operator": "and"
18. }
19. }
20. }
21. },
22. "params": {"field_name" : "title"},
23. "prune" :true
24. },
25. "highlight": {
26. "pre_tag": "<strong>",
27. "post_tag": "</strong>"
28. }
29. }
30. }
31. }
32. }
现在的结果是:
请注意,答案已更改,我有一个新字段 “collate_match”,它指示结果中是否匹配整理规则(这是因为 prune = true)。
让我们设置 prune 为 false:
1. GET movies/_search?filter_path=suggest
2. {
3. "suggest": {
4. "text": "transformer revenge of the falen",
5. "did_you_mean": {
6. "phrase": {
7. "field": "title.suggest",
8. "size": 5,
9. "confidence": 1,
10. "max_errors":2,
11. "collate": {
12. "query": {
13. "source" : {
14. "match": {
15. "{{field_name}}": {
16. "query": "{{suggestion}}",
17. "operator": "and"
18. }
19. }
20. }
21. },
22. "params": {"field_name" : "title"},
23. "prune" :false
24. },
25. "highlight": {
26. "pre_tag": "<strong>",
27. "post_tag": "</strong>"
28. }
29. }
30. }
31. }
32. }
这次我们得到的结果是:
我们可以看到只有一个结果是最相关的建议。