Elasticsearch:在 Java 客户端中使用 scroll 来遍历搜索结果 - Elastic Stack 8.x

766 阅读7分钟

如果你搜索不经常更改的文档,则使用标准查询的分页效果非常好; 否则,使用实时数据执行分页会返回不可预测的结果。 为了绕过这个问题,Elasticsearch 在查询中提供了一个额外的参数:scroll。如果你对搜索结果分页不是很熟悉的话,请参考我之前的文章 “Elasticsearch:运用 scroll 接口对大量数据实现更好的分页”。

准备数据

在今天的练习中,为了说明问题的方便,我们使用如下的数据来进行练习:



1.  POST _bulk
2.  { "index" : { "_index" : "twitter", "_id": 1} }
3.  {"user":"双榆树-张三","message":"今儿天气不错啊,出去转转去","uid":2,"age":20,"city":"北京","province":"北京","country":"中国","address":"中国北京市海淀区","location":{"lat":"39.970718","lon":"116.325747"}}
4.  { "index" : { "_index" : "twitter", "_id": 2 }}
5.  {"user":"东城区-老刘","message":"出发,下一站云南!","uid":3,"age":30,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区台基厂三条3号","location":{"lat":"39.904313","lon":"116.412754"}}
6.  { "index" : { "_index" : "twitter", "_id": 3} }
7.  {"user":"东城区-李四","message":"happy birthday!","uid":4,"age":30,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区","location":{"lat":"39.893801","lon":"116.408986"}}
8.  { "index" : { "_index" : "twitter", "_id": 4} }
9.  {"user":"朝阳区-老贾","message":"123,gogogo","uid":5,"age":35,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区建国门","location":{"lat":"39.718256","lon":"116.367910"}}
10.  { "index" : { "_index" : "twitter", "_id": 5} }
11.  {"user":"朝阳区-老王","message":"Happy BirthDay My Friend!","uid":6,"age":50,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区国贸","location":{"lat":"39.918256","lon":"116.467910"}}
12.  { "index" : { "_index" : "twitter", "_id": 6} }
13.  {"user":"虹桥-老吴","message":"好友来了都今天我生日,好友来了,什么 birthday happy 就成!","uid":7,"age":90,"city":"上海","province":"上海","country":"中国","address":"中国上海市闵行区","location":{"lat":"31.175927","lon":"121.383328"}}


在上面,我们写入6个文档到 Elasticsearch 中。在练习中,我将设置每页的文档数为 2。我们可以进行如下的搜索:

`

1.  GET twitter/_search
2.  {
3.    "query": {
4.      "bool": {
5.        "must": [
6.          {
7.            "match": {
8.              "city": "北京"
9.            }
10.          }
11.        ],
12.        "filter": [
13.          {
14.            "range": {
15.              "age": {
16.                "gte": 0,
17.                "lte": 100
18.              }
19.            }
20.          }
21.        ]
22.      }
23.    },
24.    "size": 2
25.  }

`![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)

上面的搜索显示搜索结果中的前两个:

`

1.  {
2.    "took": 0,
3.    "timed_out": false,
4.    "_shards": {
5.      "total": 1,
6.      "successful": 1,
7.      "skipped": 0,
8.      "failed": 0
9.    },
10.    "hits": {
11.      "total": {
12.        "value": 5,
13.        "relation": "eq"
14.      },
15.      "max_score": 0.48232412,
16.      "hits": [
17.        {
18.          "_index": "twitter",
19.          "_id": "1",
20.          "_score": 0.48232412,
21.          "_source": {
22.            "user": "双榆树-张三",
23.            "message": "今儿天气不错啊,出去转转去",
24.            "uid": 2,
25.            "age": 20,
26.            "city": "北京",
27.            "province": "北京",
28.            "country": "中国",
29.            "address": "中国北京市海淀区"
30.          }
31.        },
32.        {
33.          "_index": "twitter",
34.          "_id": "2",
35.          "_score": 0.48232412,
36.          "_source": {
37.            "user": "东城区-老刘",
38.            "message": "出发,下一站云南!",
39.            "uid": 3,
40.            "age": 30,
41.            "city": "北京",
42.            "province": "北京",
43.            "country": "中国",
44.            "address": "中国北京市东城区台基厂三条3号"
45.          }
46.        }
47.      ]
48.    }
49.  }

`![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)

从上面的显示结果中,我们可以看出来,它共有5个文档是满足搜索的条件的。按照每页 2 个文档,我们共有 3 页。那么我们该如何对搜索结果进行分页呢?我们可以使用 scroll 参数:

`

1.  GET twitter/_search?scroll=2m
2.  {
3.    "query": {
4.      "bool": {
5.        "must": [
6.          {
7.            "match": {
8.              "city": "北京"
9.            }
10.          }
11.        ],
12.        "filter": [
13.          {
14.            "range": {
15.              "age": {
16.                "gte": 0,
17.                "lte": 100
18.              }
19.            }
20.          }
21.        ]
22.      }
23.    },
24.    "size": 2
25.  }

`![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)

在上面,2m 代表2分钟之内有效。它返回的结果为:

`

1.  {
2.    "_scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFeHBZReU4zSnhXVlR5eW5WQW5Yb09RSHNR",
3.    "took": 0,
4.    "timed_out": false,
5.    "_shards": {
6.      "total": 1,
7.      "successful": 1,
8.      "skipped": 0,
9.      "failed": 0
10.    },
11.    "hits": {
12.      "total": {
13.        "value": 5,
14.        "relation": "eq"
15.      },
16.      "max_score": 0.48232412,
17.      "hits": [
18.        {
19.          "_index": "twitter",
20.          "_id": "1",
21.          "_score": 0.48232412,
22.          "_source": {
23.            "user": "双榆树-张三",
24.            "message": "今儿天气不错啊,出去转转去",
25.            "uid": 2,
26.            "age": 20,
27.            "city": "北京",
28.            "province": "北京",
29.            "country": "中国",
30.            "address": "中国北京市海淀区"
31.          }
32.        },
33.        {
34.          "_index": "twitter",
35.          "_id": "2",
36.          "_score": 0.48232412,
37.          "_source": {
38.            "user": "东城区-老刘",
39.            "message": "出发,下一站云南!",
40.            "uid": 3,
41.            "age": 30,
42.            "city": "北京",
43.            "province": "北京",
44.            "country": "中国",
45.            "address": "中国北京市东城区台基厂三条3号"
46.          }
47.        }
48.      ]
49.    }
50.  }

`![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)

很显然,它返回了第一个页的两个结果,但是它同时返回了一个 _scroll_id。我们可以运用这个 _scroll_id 来返回第二页的搜索结果:



1.  GET _search/scroll
2.  {
3.    "scroll": "2m",
4.    "scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFeHBZReU4zSnhXVlR5eW5WQW5Yb09RSHNR"
5.  }


上面的返回结果为:

`

1.  {
2.    "_scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFeHBZReU4zSnhXVlR5eW5WQW5Yb09RSHNR",
3.    "took": 1,
4.    "timed_out": false,
5.    "_shards": {
6.      "total": 1,
7.      "successful": 1,
8.      "skipped": 0,
9.      "failed": 0
10.    },
11.    "hits": {
12.      "total": {
13.        "value": 5,
14.        "relation": "eq"
15.      },
16.      "max_score": 0.48232412,
17.      "hits": [
18.        {
19.          "_index": "twitter",
20.          "_id": "3",
21.          "_score": 0.48232412,
22.          "_source": {
23.            "user": "东城区-李四",
24.            "message": "happy birthday!",
25.            "uid": 4,
26.            "age": 30,
27.            "city": "北京",
28.            "province": "北京",
29.            "country": "中国",
30.            "address": "中国北京市东城区"
31.          }
32.        },
33.        {
34.          "_index": "twitter",
35.          "_id": "4",
36.          "_score": 0.48232412,
37.          "_source": {
38.            "user": "朝阳区-老贾",
39.            "message": "123,gogogo",
40.            "uid": 5,
41.            "age": 35,
42.            "city": "北京",
43.            "province": "北京",
44.            "country": "中国",
45.            "address": "中国北京市朝阳区建国门"
46.          }
47.        }
48.      ]
49.    }
50.  }

`![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)

我们可以运用返回的 _scroll_id 再接着返回接下来的搜索结果,直到我们的 hits 里的数组里没有数据为止。

运用 Java client APIs 来实现分页

接下来,我们来设计 Java 应用来对搜索结果进行分页。为了方便大家对代码的理解,我把最终的项目上传到 github:github.com/liu-xiao-gu…

首先我们创建一个叫做 Twitter 的 class:

Twitter.java

`

1.  public class Twitter {
2.      private String user;
3.      private long uid;
4.      private String province;
5.      private String message;
6.      private String country;
7.      private String city;
8.      private long age;
9.      private String address;

11.      public Twitter() {
12.      }

14.      public Twitter(String user, long uid, String province, String message,
15.                     String country, String city, long age, String address) {
16.          this.user = user;
17.          this.uid = uid;
18.          this.province = province;
19.          this.message = message;
20.          this.country = country;
21.          this.city = city;
22.          this.age = age;
23.          this.address = address;
24.      }

26.      public String getUser() {
27.          return user;
28.      }

30.      public long getUid() {
31.          return uid;
32.      }

34.      public String getProvince() {
35.          return province;
36.      }

38.      public String getMessage() {
39.          return message;
40.      }

42.      public String getCountry() {
43.          return country;
44.      }

46.      public String getCity() {
47.          return city;
48.      }

50.      public long getAge() {
51.          return age;
52.      }

54.      public String getAddress() {
55.          return address;
56.      }

58.      public void setUser(String user) {
59.          this.user = user;
60.      }

62.      public void setUid(long uid) {
63.          this.uid = uid;
64.      }

66.      public void setProvince(String province) {
67.          this.province = province;
68.      }

70.      public void setMessage(String message) {
71.          this.message = message;
72.      }

74.      public void setCountry(String country) {
75.          this.country = country;
76.      }

78.      public void setCity(String city) {
79.          this.city = city;
80.      }

82.      public void setAge(long age) {
83.          this.age = age;
84.      }

86.      public void setAddress(String address) {
87.          this.address = address;
88.      }
89.  }

`![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)

这个和上面的 twitter 文档相对应。

我们接下来连接到 Elasticsearch 集群。我们可以参考之前的文章 “Elasticsearch:在 Java 客户端中使用 truststore 来创建 HTTPS 连接”。一旦连接到 Elasticsearch 后,我们可以设计如下的代码来对搜索的结果进行分页:

ElasticsearchJava.java

 `1.          final String INDEX_NAME = "twitter";
2.          SearchRequest searchRequest = new SearchRequest.
3.                  Builder().index(INDEX_NAME)
4.                  .query( q -> q.bool(b -> b
5.                                  .must(must->must.match(m ->m.field("city").query("北京")))
6.                                  .filter(f -> f.range(r -> r.field("age").gte(JsonData.of(0)).lte(JsonData.of(100))))
7.                                )
8.                        )
9.                  .size(2)
10.                  .scroll(Time.of(t -> t.time("2m")))
11.                  .build();

13.          SearchResponse<Twitter> response = client.
14.                  search(searchRequest, Twitter.class);

16.          do {
17.              System.out.println("size: " + response.hits().hits().size());

19.              for (Hit<Twitter> hit : response.hits().hits()) {
20.                  System.out.println("hit: " + hit.index() + ": " + hit.id());
21.              }

23.              final SearchResponse<Twitter> old_response = response;
24.              System.out.println("scrollId: " + old_response.scrollId());

26.              response = client.scroll(s -> s.scrollId(old_response.scrollId()).scroll(Time.of(t -> t.time("2m"))),
27.                      Twitter.class);

29.              System.out.println("=================================");

31.          } while (response.hits().hits().size() != 0); // 0 hits mark the end of the scroll and the while loop.`![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)

我们运行上面的代码后,我们可以看到如下的搜索结果:



1.  size: 2
2.  hit: twitter: 1
3.  hit: twitter: 2
4.  scrollId: FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFAnxZReU4zSnhXVlR5eW5WQW5Yb09RSHNR
5.  =================================
6.  size: 2
7.  hit: twitter: 3
8.  hit: twitter: 4
9.  scrollId: FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFAnxZReU4zSnhXVlR5eW5WQW5Yb09RSHNR
10.  =================================
11.  size: 1
12.  hit: twitter: 5
13.  scrollId: FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFi1rOUlBMFdGU2tLSS0yTlMyUkdRdUEAAAAAAAFAnxZReU4zSnhXVlR5eW5WQW5Yb09RSHNR
14.  =================================


从上面的搜索结果中,我们可以看出来它有三个页。共有5个文档被搜索到了。