Elasticsearch:在 Elasticsearch 中的 Composite Aggregation

1,710 阅读7分钟

复合聚合(Composite aggregation)是自 Elasticsearch 6.1 中的强大新功能。 为了展示该功能的全部功能,我们将逐步为虚构的披萨外卖公司 Sliceline 创建分析服务。

复合聚合使我们能够:

  • 通过 aggregation 结果快速分页
  • 根据 aggregation 结果建立新索引
  • 开发由 Elasticsearch aggregation 支持的API,对大型结果集具有一致的响应时间

Composite aggregation 在我们需要做大量数据的聚合时非常有用。比如如下的聚合:

GET /pizza/_search
{
  "size": 0,
  "aggs": {
    "group_by_deliveries": {
      "terms": {
        "field": "full_address",
        "size" : 3000
      }
    }
  }
}

假如我们有 10 个 shards   的情况下,上面的查询可能会出现如下的警告信息:

#! Deprecation: This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.

 

比萨饼的分页 aggregation

Sliceline 向数百万客户提供披萨。 与任何比萨饼送货服务一样,每个比萨饼都是唯一的,并且要根据客户的要求进行制作。 Sliceline 业务现在对查看和分析客户的购买习惯感兴趣,其中许多习惯都要求格式正确的结果。

AddressDeliveries
355 First St, San Francisco, CA6
961 Union St, San Francisco, CA8
123 Baker Way, San Francisco, CA4
1700 Powell St, San Francisco, CA10
900 Union St, San Francisco, CA5

                                                          101-105 of 10,000,000 结果下一页结果

上表对于 Elasticsearch 查询始终是可能的,Elasticsearch 6.1 之前版本中的限制意味着将显示页面之前的所有结果都返回给应用程序。 例如,要创建上表(结果101-105),我们需要查询前 105 个结果,而忽略前 100 个结果。

// Standard terms aggregation to create the table above
GET /pizza/_search
{
  "size": 0,
  "aggs": {
    "group_by_deliveries": {
      "terms": {
        "field": "full_address",
        "size" : 105
      }
    }
  }
}

 

引入复合聚合,一种更好的分页聚合方法

复合聚合使我们能够对上面的聚合查询进行分页,并为用户提供一种快速遍历结果集的方法。

导入样本数据集

让我们首先导入示例数据集:披萨送货! 我们建议在通过 Kibana 开发工具找到的控制台中执行这些命令。

// Create our pizza index
PUT /pizza
{
  "mappings" : {
      "properties": {
      "full_address": {
        "type": "keyword"
      },
      "order" : {
        "type" : "text"
      },
      "num_pizzas" : {
        "type" : "integer"
      },
      "timestamp" : {
        "type" : "date"
      }
    }
  }
}

// Insert sample pizza deliveries
POST /pizza/_bulk
{ "index": { "_id": 1 }}
{"full_address" : "355 First St, San Francisco, CA", "order" : "cheese", "num_pizzas" : 2, "timestamp": "2018-04-10T12:25" }
{ "index": { "_id": 2 }}
{"full_address" : "961 Union St, San Francisco, CA", "order" : "cheese", "num_pizzas" : 3, "timestamp": "2018-04-11T12:25" }
{ "index": { "_id": 3 }}
{"full_address" : "123 Baker St, San Francisco, CA", "order" : "vegan", "num_pizzas" : 1, "timestamp": "2018-04-18T12:25" }
{ "index": { "_id": 4 }}
{"full_address" : "1700 Powell St, San Francisco, CA", "order" : "cheese", "num_pizzas" : 5, "timestamp": "2018-04-18T12:25" }
{ "index": { "_id": 5 }}
{"full_address" : "900 Union St, San Francisco, CA", "order" : "pepperoni", "num_pizzas" : 4, "timestamp": "2018-04-18T12:25" }
{ "index": { "_id": 6 }}
{"full_address" : "355 First St, San Francisco, CA", "order" : "pepperoni", "num_pizzas" : 3, "timestamp": "2018-04-10T12:25" }
{ "index": { "_id": 7 }}
{"full_address" : "961 Union St, San Francisco, CA", "order" : "cheese", "num_pizzas" : 1, "timestamp": "2018-04-12T12:25" }
{ "index": { "_id": 8 }}
{"full_address" : "100 First St, San Francisco, CA", "order" : "pepperoni", "num_pizzas" : 3, "timestamp": "2018-04-11T12:25" }
{ "index": { "_id": 9 }}
{"full_address" : "101 First St, San Francisco, CA", "order" : "cheese", "num_pizzas" : 5, "timestamp": "2018-04-11T12:25" }
{ "index": { "_id": 10 }}
{"full_address" : "355 First St, San Francisco, CA", "order" : "cheese", "num_pizzas" : 10, "timestamp": "2018-04-10T12:25" }
{ "index": { "_id": 11 }}
{"full_address" : "100 First St, San Francisco, CA", "order" : "pepperoni", "num_pizzas" : 4, "timestamp": "2018-04-11T14:25" }

上面的第一个命令创建索引 pizza 的 mapping,而下面的一个 bulk 命令导入 11 个文档的数据。

使用复合聚合对结果进行分页

让我们重新审视原始术语汇总,并提供一种更有效的机制来对复合汇总进行分页。

GET /pizza/_search
{
  "size": 0,
  "track_total_hits": false,
  "aggs": {
    "group_by_deliveries": {
      "composite": {
        "size": 3,
        "sources": [
          {
            "by_address": {
              "terms": {
                "field": "full_address"
              }
            }
          }
        ]
      }
    }
  }
}

上面的结果显示:

  "aggregations" : {
    "group_by_deliveries" : {
      "after_key" : {
        "by_address" : "123 Baker St, San Francisco, CA"
      },
      "buckets" : [
        {
          "key" : {
            "by_address" : "100 First St, San Francisco, CA"
          },
          "doc_count" : 2
        },
        {
          "key" : {
            "by_address" : "101 First St, San Francisco, CA"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "by_address" : "123 Baker St, San Francisco, CA"
          },
          "doc_count" : 1
        }
      ]
    }
  }
}

上面的汇总将为我们提供结果的第一页(请注意,size 参数设置为 3):

AddressDeliveries
100 First St, San Francisco, CA2
101 First St, San Francisco, CA1
123 Baker St, San Francisco, CA1

现在,让我们提供一种使用复合汇总显示下一页结果的快速方法。 由于 “123 Baker St,CA,San Francisco,CA” 是上一次查询的最后结果,因此我们需要在复合汇总中的“after”字段中指定此值。

GET /pizza/_search
{
  "size": 0,
  "track_total_hits": false,
  "aggs": {
    "group_by_deliveries": {
      "composite": {
        "after": {
          "by_address": "123 Baker St, San Francisco, CA"
        },
        "size": 3,
        "sources": [
          {
            "by_address": {
              "terms": {
                "field": "full_address"
              }
            }
          }
        ]
      }
    }
  }
}

运行上面的命令。显示结果为:

  "aggregations" : {
    "group_by_deliveries" : {
      "after_key" : {
        "by_address" : "900 Union St, San Francisco, CA"
      },
      "buckets" : [
        {
          "key" : {
            "by_address" : "1700 Powell St, San Francisco, CA"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "by_address" : "355 First St, San Francisco, CA"
          },
          "doc_count" : 3
        },
        {
          "key" : {
            "by_address" : "900 Union St, San Francisco, CA"
          },
          "doc_count" : 1
        }
      ]
    }
  }

我们还包含了一个名为 “track_total_hits” 的参数,并将其设置为 false,这使 Elasticsearch 在找到足够的存储桶作为结果之后就可以提前终止查询。

为每个地址创建日期直方图

让我们在复合聚合中使用多个 source 按地址对结果进行分组,然后为每个地址创建日期直方图!

GET /pizza/_search
{
  "size": 0,
  "track_total_hits": false,
  "aggs": {
    "group_by_deliveries": {
      "composite": {
        "size": 3,
        "sources": [
          {
            "by_address": {
              "terms": {
                "field": "full_address"
              }
            }
          },
          {
            "histogram": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "1d"
              }
            }
          }
        ]
      }
    }
  }
}

上面的运行结果为:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_deliveries" : {
      "after_key" : {
        "by_address" : "123 Baker St, San Francisco, CA",
        "histogram" : 1524009600000
      },
      "buckets" : [
        {
          "key" : {
            "by_address" : "100 First St, San Francisco, CA",
            "histogram" : 1523404800000
          },
          "doc_count" : 2
        },
        {
          "key" : {
            "by_address" : "101 First St, San Francisco, CA",
            "histogram" : 1523404800000
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "by_address" : "123 Baker St, San Francisco, CA",
            "histogram" : 1524009600000
          },
          "doc_count" : 1
        }
      ]
    }
  }
}

上面的汇总将为每个地址创建一个直方图,间隔为一天。 现在,我们可以使用上一示例中所述的 “after” 参数快速浏览结果集:

GET /pizza/_search
{
  "size": 0,
  "track_total_hits": false,
  "aggs": {
    "group_by_deliveries": {
      "composite": {
        "after": {
          "by_address": "900 Union St, San Francisco, CA",
          "histogram": 1524009600000
        },
        "size": 3,
        "sources": [
          {
            "by_address": {
              "terms": {
                "field": "full_address"
              }
            }
          },
          {
            "histogram": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "1d"
              }
            }
          }
        ]
      }
    }
  }
}

它生成如下的结果:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_deliveries" : {
      "after_key" : {
        "by_address" : "961 Union St, San Francisco, CA",
        "histogram" : 1523491200000
      },
      "buckets" : [
        {
          "key" : {
            "by_address" : "961 Union St, San Francisco, CA",
            "histogram" : 1523404800000
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "by_address" : "961 Union St, San Francisco, CA",
            "histogram" : 1523491200000
          },
          "doc_count" : 1
        }
      ]
    }
  }
}

每天平均比萨数

很高兴知道每天索引中每个地址的平均比萨数。 复合聚合允许我们指定子聚合,例如平均值:

GET /pizza/_search
{
  "size": 0,
  "track_total_hits": false,
  "aggs": {
    "group_by_deliveries": {
      "composite": {
        "size": 3,
        "sources": [
          {
            "by_address": {
              "terms": {
                "field": "full_address"
              }
            }
          },
          {
            "histogram": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "1d"
              }
            }
          }
        ]
      },
      "aggregations": {
        "avg_pizzas_per_day": {
          "avg": {
            "field": "num_pizzas"
          }
        }
      }
    }
  }
}

上面的聚合运行的结果为:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_deliveries" : {
      "after_key" : {
        "by_address" : "123 Baker St, San Francisco, CA",
        "histogram" : 1524009600000
      },
      "buckets" : [
        {
          "key" : {
            "by_address" : "100 First St, San Francisco, CA",
            "histogram" : 1523404800000
          },
          "doc_count" : 2,
          "avg_pizzas_per_day" : {
            "value" : 3.5
          }
        },
        {
          "key" : {
            "by_address" : "101 First St, San Francisco, CA",
            "histogram" : 1523404800000
          },
          "doc_count" : 1,
          "avg_pizzas_per_day" : {
            "value" : 5.0
          }
        },
        {
          "key" : {
            "by_address" : "123 Baker St, San Francisco, CA",
            "histogram" : 1524009600000
          },
          "doc_count" : 1,
          "avg_pizzas_per_day" : {
            "value" : 1.0
          }
        }
      ]
    }
  }
}

从上面的结果中您可以看到,第一街的 100 个平均每天送出 3.5 个披萨!

顺序访问

没有直接跳转到聚合结果页面的方法,需要顺序访问。 您可能已经在其他系统中看到了这种称为“基于范围的分页”的方法。 即使使用 RDBMS,通常使用基于范围的分页策略比skip及limit策略更高效。随着用户在结果集中移动得越远,skip 及 limit 策略变得越昂贵。

排序

对于复合聚合,除了聚合中使用的键(例如,上一个示例中使用的 “by_address” 值)以外,无法按其他任何东西进行排序。 通过限制排序选项,可以确保对复合聚合的响应始终快速准确。