如果你搜索不经常更改的文档,则使用标准查询的分页效果非常好; 否则,使用实时数据执行分页会返回不可预测的结果。 为了绕过这个问题,Elasticsearch 在查询中提供了一个额外的参数:scroll。如果你对搜索结果分页不是很熟悉的话,请参考我之前的文章 “Elasticsearch:运用 scroll 接口对大量数据实现更好的分页”。
准备数据
在今天的练习中,为了说明问题的方便,我们使用如下的数据来进行练习:
1. POST _bulk
2. { "index" : { "_index" : "twitter", "_id": 1} }
3. {"user":"双榆树-张三","message":"今儿天气不错啊,出去转转去","uid":2,"age":20,"city":"北京","province":"北京","country":"中国","address":"中国北京市海淀区","location":{"lat":"39.970718","lon":"116.325747"}}
4. { "index" : { "_index" : "twitter", "_id": 2 }}
5. {"user":"东城区-老刘","message":"出发,下一站云南!","uid":3,"age":30,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区台基厂三条3号","location":{"lat":"39.904313","lon":"116.412754"}}
6. { "index" : { "_index" : "twitter", "_id": 3} }
7. {"user":"东城区-李四","message":"happy birthday!","uid":4,"age":30,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区","location":{"lat":"39.893801","lon":"116.408986"}}
8. { "index" : { "_index" : "twitter", "_id": 4} }
9. {"user":"朝阳区-老贾","message":"123,gogogo","uid":5,"age":35,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区建国门","location":{"lat":"39.718256","lon":"116.367910"}}
10. { "index" : { "_index" : "twitter", "_id": 5} }
11. {"user":"朝阳区-老王","message":"Happy BirthDay My Friend!","uid":6,"age":50,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区国贸","location":{"lat":"39.918256","lon":"116.467910"}}
12. { "index" : { "_index" : "twitter", "_id": 6} }
13. {"user":"虹桥-老吴","message":"好友来了都今天我生日,好友来了,什么 birthday happy 就成!","uid":7,"age":90,"city":"上海","province":"上海","country":"中国","address":"中国上海市闵行区","location":{"lat":"31.175927","lon":"121.383328"}}
在上面,我们写入6个文档到 Elasticsearch 中。在练习中,我将设置每页的文档数为 2。我们可以进行如下的搜索:
`
1. GET twitter/_search
2. {
3. "query": {
4. "bool": {
5. "must": [
6. {
7. "match": {
8. "city": "北京"
9. }
10. }
11. ],
12. "filter": [
13. {
14. "range": {
15. "age": {
16. "gte": 0,
17. "lte": 100
18. }
19. }
20. }
21. ]
22. }
23. },
24. "size": 2
25. }
`
上面的搜索显示搜索结果中的前两个:
`
1. {
2. "took": 0,
3. "timed_out": false,
4. "_shards": {
5. "total": 1,
6. "successful": 1,
7. "skipped": 0,
8. "failed": 0
9. },
10. "hits": {
11. "total": {
12. "value": 5,
13. "relation": "eq"
14. },
15. "max_score": 0.48232412,
16. "hits": [
17. {
18. "_index": "twitter",
19. "_id": "1",
20. "_score": 0.48232412,
21. "_source": {
22. "user": "双榆树-张三",
23. "message": "今儿天气不错啊,出去转转去",
24. "uid": 2,
25. "age": 20,
26. "city": "北京",
27. "province": "北京",
28. "country": "中国",
29. "address": "中国北京市海淀区"
30. }
31. },
32. {
33. "_index": "twitter",
34. "_id": "2",
35. "_score": 0.48232412,
36. "_source": {
37. "user": "东城区-老刘",
38. "message": "出发,下一站云南!",
39. "uid": 3,
40. "age": 30,
41. "city": "北京",
42. "province": "北京",
43. "country": "中国",
44. "address": "中国北京市东城区台基厂三条3号"
45. }
46. }
47. ]
48. }
49. }
`
从上面的显示结果中,我们可以看出来,它共有5个文档是满足搜索的条件的。按照每页 2 个文档,我们共有 3 页。那么我们该如何对搜索结果进行分页呢?我们可以使用 scroll 参数:
`
1. GET twitter/_search?scroll=2m
2. {
3. "query": {
4. "bool": {
5. "must": [
6. {
7. "match": {
8. "city": "北京"
9. }
10. }
11. ],
12. "filter": [
13. {
14. "range": {
15. "age": {
16. "gte": 0,
17. "lte": 100
18. }
19. }
20. }
21. ]
22. }
23. },
24. "size": 2
25. }
`
在上面,2m 代表2分钟之内有效。它返回的结果为:
`
1. {
2. "_scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFeHBZReU4zSnhXVlR5eW5WQW5Yb09RSHNR",
3. "took": 0,
4. "timed_out": false,
5. "_shards": {
6. "total": 1,
7. "successful": 1,
8. "skipped": 0,
9. "failed": 0
10. },
11. "hits": {
12. "total": {
13. "value": 5,
14. "relation": "eq"
15. },
16. "max_score": 0.48232412,
17. "hits": [
18. {
19. "_index": "twitter",
20. "_id": "1",
21. "_score": 0.48232412,
22. "_source": {
23. "user": "双榆树-张三",
24. "message": "今儿天气不错啊,出去转转去",
25. "uid": 2,
26. "age": 20,
27. "city": "北京",
28. "province": "北京",
29. "country": "中国",
30. "address": "中国北京市海淀区"
31. }
32. },
33. {
34. "_index": "twitter",
35. "_id": "2",
36. "_score": 0.48232412,
37. "_source": {
38. "user": "东城区-老刘",
39. "message": "出发,下一站云南!",
40. "uid": 3,
41. "age": 30,
42. "city": "北京",
43. "province": "北京",
44. "country": "中国",
45. "address": "中国北京市东城区台基厂三条3号"
46. }
47. }
48. ]
49. }
50. }
`
很显然,它返回了第一个页的两个结果,但是它同时返回了一个 _scroll_id。我们可以运用这个 _scroll_id 来返回第二页的搜索结果:
1. GET _search/scroll
2. {
3. "scroll": "2m",
4. "scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFeHBZReU4zSnhXVlR5eW5WQW5Yb09RSHNR"
5. }
上面的返回结果为:
`
1. {
2. "_scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFeHBZReU4zSnhXVlR5eW5WQW5Yb09RSHNR",
3. "took": 1,
4. "timed_out": false,
5. "_shards": {
6. "total": 1,
7. "successful": 1,
8. "skipped": 0,
9. "failed": 0
10. },
11. "hits": {
12. "total": {
13. "value": 5,
14. "relation": "eq"
15. },
16. "max_score": 0.48232412,
17. "hits": [
18. {
19. "_index": "twitter",
20. "_id": "3",
21. "_score": 0.48232412,
22. "_source": {
23. "user": "东城区-李四",
24. "message": "happy birthday!",
25. "uid": 4,
26. "age": 30,
27. "city": "北京",
28. "province": "北京",
29. "country": "中国",
30. "address": "中国北京市东城区"
31. }
32. },
33. {
34. "_index": "twitter",
35. "_id": "4",
36. "_score": 0.48232412,
37. "_source": {
38. "user": "朝阳区-老贾",
39. "message": "123,gogogo",
40. "uid": 5,
41. "age": 35,
42. "city": "北京",
43. "province": "北京",
44. "country": "中国",
45. "address": "中国北京市朝阳区建国门"
46. }
47. }
48. ]
49. }
50. }
`
我们可以运用返回的 _scroll_id 再接着返回接下来的搜索结果,直到我们的 hits 里的数组里没有数据为止。
运用 Java client APIs 来实现分页
接下来,我们来设计 Java 应用来对搜索结果进行分页。为了方便大家对代码的理解,我把最终的项目上传到 github:github.com/liu-xiao-gu…
首先我们创建一个叫做 Twitter 的 class:
Twitter.java
`
1. public class Twitter {
2. private String user;
3. private long uid;
4. private String province;
5. private String message;
6. private String country;
7. private String city;
8. private long age;
9. private String address;
11. public Twitter() {
12. }
14. public Twitter(String user, long uid, String province, String message,
15. String country, String city, long age, String address) {
16. this.user = user;
17. this.uid = uid;
18. this.province = province;
19. this.message = message;
20. this.country = country;
21. this.city = city;
22. this.age = age;
23. this.address = address;
24. }
26. public String getUser() {
27. return user;
28. }
30. public long getUid() {
31. return uid;
32. }
34. public String getProvince() {
35. return province;
36. }
38. public String getMessage() {
39. return message;
40. }
42. public String getCountry() {
43. return country;
44. }
46. public String getCity() {
47. return city;
48. }
50. public long getAge() {
51. return age;
52. }
54. public String getAddress() {
55. return address;
56. }
58. public void setUser(String user) {
59. this.user = user;
60. }
62. public void setUid(long uid) {
63. this.uid = uid;
64. }
66. public void setProvince(String province) {
67. this.province = province;
68. }
70. public void setMessage(String message) {
71. this.message = message;
72. }
74. public void setCountry(String country) {
75. this.country = country;
76. }
78. public void setCity(String city) {
79. this.city = city;
80. }
82. public void setAge(long age) {
83. this.age = age;
84. }
86. public void setAddress(String address) {
87. this.address = address;
88. }
89. }
`
这个和上面的 twitter 文档相对应。
我们接下来连接到 Elasticsearch 集群。我们可以参考之前的文章 “Elasticsearch:在 Java 客户端中使用 truststore 来创建 HTTPS 连接”。一旦连接到 Elasticsearch 后,我们可以设计如下的代码来对搜索的结果进行分页:
ElasticsearchJava.java
`1. final String INDEX_NAME = "twitter";
2. SearchRequest searchRequest = new SearchRequest.
3. Builder().index(INDEX_NAME)
4. .query( q -> q.bool(b -> b
5. .must(must->must.match(m ->m.field("city").query("北京")))
6. .filter(f -> f.range(r -> r.field("age").gte(JsonData.of(0)).lte(JsonData.of(100))))
7. )
8. )
9. .size(2)
10. .scroll(Time.of(t -> t.time("2m")))
11. .build();
13. SearchResponse<Twitter> response = client.
14. search(searchRequest, Twitter.class);
16. do {
17. System.out.println("size: " + response.hits().hits().size());
19. for (Hit<Twitter> hit : response.hits().hits()) {
20. System.out.println("hit: " + hit.index() + ": " + hit.id());
21. }
23. final SearchResponse<Twitter> old_response = response;
24. System.out.println("scrollId: " + old_response.scrollId());
26. response = client.scroll(s -> s.scrollId(old_response.scrollId()).scroll(Time.of(t -> t.time("2m"))),
27. Twitter.class);
29. System.out.println("=================================");
31. } while (response.hits().hits().size() != 0); // 0 hits mark the end of the scroll and the while loop.`
我们运行上面的代码后,我们可以看到如下的搜索结果:
1. size: 2
2. hit: twitter: 1
3. hit: twitter: 2
4. scrollId: FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFAnxZReU4zSnhXVlR5eW5WQW5Yb09RSHNR
5. =================================
6. size: 2
7. hit: twitter: 3
8. hit: twitter: 4
9. scrollId: FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFAnxZReU4zSnhXVlR5eW5WQW5Yb09RSHNR
10. =================================
11. size: 1
12. hit: twitter: 5
13. scrollId: FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFAnxZReU4zSnhXVlR5eW5WQW5Yb09RSHNR
14. =================================
从上面的搜索结果中,我们可以看出来它有三个页。共有5个文档被搜索到了。