Druid取数方式介绍主要参考自Druid官网。 1. Druid简介 Apache Druid是一个实时分析数据库，专

最近实在太忙了，但还是觉得有必要记录下最近关于Druid的取数方式的学习。主要参考自Druid官网。

1. Druid简介

Apache Druid是一个实时分析数据库，专为大型数据集的快速切片分析（“ OLAP ”查询）而设计。Druid 最常用作数据库，用于支持实时摄取、快速查询性能和高正常运行时间很重要的用例。因此，Druid 通常用于支持分析应用程序的 GUI，或作为需要快速聚合的高并发 API 的后端。Druid 最适用于面向事件的数据。

OLTP（on-line transaction processing）翻译为联机事务处理， OLAP（On-Line Analytical Processing）翻译为联机分析处理，从字面上来看OLTP是做事务处理，OLAP是做分析处理。从对数据库操作来看，OLTP主要是对数据的增删改，OLAP是对数据的查询。

Druid的常见应用领域（与实习内容相切合）

点击流分析（网络和移动分析）网络遥测分析（网络性能监控）服务器指标存储供应链分析（制造指标）应用程序性能指标 数字营销/广告分析商业智能/OLAP

Druid的使用场景

适用于：

插入多，更新少；
大多数查询情况为聚合、搜索或扫描;
查询延迟为100毫秒到几秒钟;
数据具有时间成分；（即后文的TimeSeries）;
需要对高基数数据列（如url，用户id）进行快速计数和排名（即后文的TopN);
从kafka、HDFS或对象存储中加载数据。

不适用于：

需要使用主键对现有记录进行低延时更新，Druid 支持流式插入，但不支持流式更新；
构建离线系统，查询延时不太重要；
“大”连接操作（将一个大事实表连接到另一个大事实表）。

您需要使用主键对现有记录进行低延迟更新。Druid 支持流式插入，但不支持流式更新（更新是使用后台批处理作业完成的）。您正在构建一个离线报告系统，其中查询延迟不是很重要。您想要进行“大”连接（将一个大事实表连接到另一个大事实表），并且您可以接受这些需要很长时间才能完成的查询。

2.Druid查询方式

Druid支持两种查询方式：使用sql语句和json格式的查询。本文仅介绍json格式的查询，更详细的介绍可参考官方文档。官方文档中一共提供了8种查询方式，这里仅以前两种为例进行介绍，其他六种使用方法十分相近。

2.1 Timeseries queries

{
  "queryType": "timeseries",
  "dataSource": "sample_datasource",
  "granularity": "day",
  "descending": "true",
  "filter": {
    "type": "and",
    "fields": [
      { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },
      { "type": "or",
        "fields": [
          { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },
          { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }
        ]
      }
    ]
  },
  "aggregations": [
    { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },
    { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }
  ],
  "postAggregations": [
    { "type": "arithmetic",
      "name": "sample_divide",
      "fn": "/",
      "fields": [
        { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },
        { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }
      ]
    }
  ],
  "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]
}

上述Json对应属性和描述如下。

属性	描述
queryType	默认为“timeseries”
dataSource	对应的表名
descending	是否进行降序排序。默认值为false（升序）
intervals	代表ISO-8601间隔的JSON对象。这定义了运行查询的时间范围
granularity	聚合粒度：all，none，second，minute，fifteen_minute，thirty_minute，hour，day，week，month，quarter和year
filter	包含多种内置过滤器
aggregations	包含多种内置聚合器
postAggregations	数据聚合后进行的"后聚和"操作
limit	结果行数限制
context	可用于修改查询行为，包括"总计"和"零填充"。总计："context": { "grandTotal": true } 在时间序列结果集中的最后一行包含额外的“总计”行。零填充："context" : { "skipEmptyBuckets": "true" } 时间序列查询通常用零填充空的内部时间段，指定为true可禁用零填充

因此，上述查询将从“ sample_datasource”表中返回2个数据点，从2012年1月1日到2012年1月3日之间每天返回一个数据点。每个数据点将是sample_fieldName1的（长）总和，sample_fieldName2的（两倍）总和和sample_fieldName1的（两倍）结果除以过滤器集的sample_fieldName2。返回结果为：

[
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> }
  },
  {
    "timestamp": "2012-01-02T00:00:00.000Z",
    "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> }
  }
]

2.2 TopN

需要注意，TopN是近似的，因为每个数据过程都会对它们的前 K 个结果进行排名，并且只将那些前 K 个结果返回。K默认是max(1000, threshold). 在实践中，如果要求Top前 1000 项，前 0~900 项的正确性将是 100%，并且无法保证之后结果的排序。通过增加阈值可以使 TopN 更加准确。

{
  "queryType": "topN",
  "dataSource": "sample_data",
  "dimension": "sample_dim",
  "threshold": 5,
  "metric": "count",
  "granularity": "all",
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "selector",
        "dimension": "dim1",
        "value": "some_value"
      },
      {
        "type": "selector",
        "dimension": "dim2",
        "value": "some_other_val"
      }
    ]
  },
  "aggregations": [
    {
      "type": "longSum",
      "name": "count",
      "fieldName": "count"
    },
    {
      "type": "doubleSum",
      "name": "some_metric",
      "fieldName": "some_metric"
    }
  ],
  "postAggregations": [
    {
      "type": "arithmetic",
      "name": "average",
      "fn": "/",
      "fields": [
        {
          "type": "fieldAccess",
          "name": "some_metric",
          "fieldName": "some_metric"
        },
        {
          "type": "fieldAccess",
          "name": "count",
          "fieldName": "count"
        }
      ]
    }
  ],
  "intervals": [
    "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"
  ]
}

属性	描述
queryType	默认为“topN”
...（同上）	...（同上）
dimension	TopN根据的维度
metric	指定需要进行排序的指标

上述查询返回结果如下：

[
  {
    "timestamp": "2013-08-31T00:00:00.000Z",
    "result": [
      {
        "dim1": "dim1_val",
        "count": 111,
        "some_metrics": 10669,
        "average": 96.11711711711712
      },
      {
        "dim1": "another_dim1_val",
        "count": 88,
        "some_metrics": 28344,
        "average": 322.09090909090907
      },
      {
        "dim1": "dim1_val3",
        "count": 70,
        "some_metrics": 871,
        "average": 12.442857142857143
      },
      {
        "dim1": "dim1_val4",
        "count": 62,
        "some_metrics": 815,
        "average": 13.14516129032258
      },
      {
        "dim1": "dim1_val5",
        "count": 60,
        "some_metrics": 2787,
        "average": 46.45
      }
    ]
  }
]

3.原生查询组件介绍

3.1 过滤器filter

（1）选择器过滤器selector

"filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }

这相当于WHERE <dimension_string> = '<dimension_value_string>'

（2）列比较过滤器columnComparison

"filter": { "type": "columnComparison", "dimensions": [<dimension_a>, <dimension_b>] }

这相当于WHERE <dimension_a> = <dimension_b>

（3）正则表达式过滤器regex

"filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }

（4）逻辑表达式过滤器and、or、not

"filter": { "type": "and", "fields": [<filter>, <filter>, ...] }
"filter": { "type": "or", "fields": [<filter>, <filter>, ...] }
"filter": { "type": "not", "field": <filter> }

（5）JavaScript过滤器 javascript

{
  "type" : "javascript",
  "dimension" : <dimension_string>,
  "function" : "function(value) { <...> }"
}

（6）提取过滤器extraction 现在不建议使用提取过滤器

（7）搜索过滤器search 用于对部分字符串匹配进行过滤

{
    "filter": {
        "type": "search",
        "dimension": "product",
        "query": {
          "type": "insensitive_contains",
          "value": "foo"
        }
    }
}

（8）在过滤器中in

{
    "type": "in",
    "dimension": "outlaw",
    "values": ["Good", "Bad", "Ugly"]
}

（9）模糊匹配过滤器like

{
    "type": "like",
    "dimension": "last_name",
    "pattern": "D%"
}

（10）绑定过滤器bound 以下绑定过滤器表示条件21 <= age <= 31：

{
    "type": "bound",
    "dimension": "age",
    "lower": "21",
    "upper": "31" ,
    "ordering": "numeric"
}

此过滤器foo <= name <= hoo使用默认的字典排序顺序来表达条件

{
    "type": "bound",
    "dimension": "name",
    "lower": "foo",
    "upper": "hoo"
}

使用严格界限，此过滤器表示条件 21 < age < 31

{
    "type": "bound",
    "dimension": "age",
    "lower": "21",
    "lowerStrict": true,
    "upper": "31" ,
    "upperStrict": true,
    "ordering": "numeric"
}

（11）间隔过滤器interval

{
    "type" : "interval",
    "dimension" : "__time",
    "intervals" : [
      "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z",
      "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z"
    ]
}

3.2 查询粒度granularity

（1）简单粒度简单的粒度通过其UTC时间（例如，以00:00 UTC开始的天数）指定为字符串和存储桶时间戳记。支持粒度字符串是：all，none，second，minute，fifteen_minute，thirty_minute，hour，day，week，month，quarter和year。（2）持续时间粒度duration

{"type": "duration", "duration": 3600000, "origin": "2012-01-01T00:30:00Z"}

（3）期间时间粒度period

{"type": "period", "period": "P3M", "timeZone": "America/Los_Angeles", "origin": "2012-02-01T00:00:00-08:00"}

3.3 聚合aggregations

（1）计数count：{ "type" : "count", "name" : <output_name> }

（2）总和longSum、doubleSum、floatSum：{ "type" : "floatSum", "name" : <output_name>, "fieldName" : <metric_name> }

（3）最大最小聚合：doubleMin、doubleMax、floatMin、floatMax、longMin、longMax

（4）算数平均值：doubleMean

（5）第一个/最后一个过滤器：doubleFirst、doubleLast、floatFirst、floatLast、longFirst、longLast、stringFirst、stringLast

（6）Any聚合器：doubleAny、floatAny、longAny、stringAny

（7）杂项聚集->过滤的聚合器filtered

其中过滤的聚合器filtered包装任何给定的聚合器，但仅聚合给定维度过滤器匹配的值。这使得可以同时计算已过滤和未过滤聚合的结果，而不必发出多个查询，并将这两个结果都用作聚合后的一部分。注意：如果只需要过滤的结果，请考虑将过滤器放在查询本身上，这将更快，因为它不需要扫描所有数据。

{
  "type" : "filtered",
  "filter" : {
    "type" : "selector",
    "dimension" : <dimension>,
    "value" : <dimension value>
  }
  "aggregator" : <aggregation>
}

3.4 后聚合postAggregation

（1）算数后聚合算术后聚合器将提供的函数从左到右应用于给定的字段。这些字段可以是聚合器或其他后期聚合器。支持的功能有+，-，*，/，和quotient。

/0如果除以0，除法总是返回，而不管分子如何。
quotient 除法的行为类似于常规浮点除法。算术后聚合器还可以指定一个ordering，定义排序结果时结果值的顺序（例如，这对topN查询很有用）：
如果未null指定任何顺序（或），则使用默认的浮点顺序。
numericFirst顺序总是先返回有限值，然后是NaN，最后返回无限值。

算数后聚合的json形式为：

postAggregation : {
  "type"  : "arithmetic",
  "name"  : <output_name>,
  "fn"    : <arithmetic_function>,
  "fields": [<post_aggregator>, <post_aggregator>, ...],
  "ordering" : <null (default), or "numericFirst">
}

（2）字段访问器后聚合器 fieldAccess 这些后聚合器返回指定聚合器产生的值。fieldName引用查询的聚合部分中给出的聚合器的输出名称。使用类型“ fieldAccess”返回原始聚合对象，或使用类型“ finalizingFieldAccess”返回最终值。

{ "type" : "fieldAccess", "name": <output_name>, "fieldName" : <aggregator_name> }

（3）常量后聚合器 constant 常量后聚合器返回指定的值

{ "type"  : "constant", "name"  : <output_name>, "value" : <numerical_value> }

（4）最大最小的后聚合doubleGreatest doubleMax聚合器和doubleGreatest后聚合器之间的区别在于，doubleMax返回某一特定列的所有行的最大值，而doubleGreatest返回一行中多个列的最大值。

{
  ...
  "aggregations" : [
    { "type" : "doubleSum", "name" : "tot", "fieldName" : "total" },
    { "type" : "doubleSum", "name" : "part", "fieldName" : "part" }
  ],
  "postAggregations" : [{
    "type"   : "arithmetic",
    "name"   : "part_percentage",
    "fn"     : "*",
    "fields" : [
       { "type"   : "arithmetic",
         "name"   : "ratio",
         "fn"     : "/",
         "fields" : [
           { "type" : "fieldAccess", "name" : "part", "fieldName" : "part" },
           { "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" }
         ]
       },
       { "type" : "constant", "name": "const", "value" : 100 }
    ]
  }]
  ...
}

上述为常用的原生查询组件，其余几项可以查阅官方文档介绍。

4 用Java类封装Json查询

可以先封装一个基础查询类，包含queryType,dataSource等通用属性，再根据不同查询方式使用子类继承父类各自进行封装。并且，对于聚合器、后聚合器等原生组件也需要各自进行实体类封装，然后在子类中使用类似List的方式，具体封装方式应当根据业务具体情况进行，这里就不详细举例了。

Druid取数方式介绍