MongoDb聚合阶段操作符$bucketAuto学习笔记输入文档的数量小于指定的桶的数量。 groupBy表达式的惟一

MongoDb聚合阶段操作符$bucketAuto学习笔记

功能说明

MongoDb 3.4版本新功能

根据指定的表达式将传入的文档分类到特定数量的组(称为bucket)中。Bucket边界将自动确定，以便将文档平均分配到指定数量的Bucket中。每个桶在输出中表示为一个文档。每个桶的文档包含一个_id字段，其值指定桶的包含下界和排他上界，以及一个包含桶中文档数量的count字段。在未指定输出时，默认情况下包括count字段。

使用格式

{
  $bucketAuto: {
      groupBy: <expression>,
      buckets: <number>,
      output: {
         <output1>: { <$accumulator expression> },
         ...
      }
      granularity: <string>
  }
}

字段解释

字段	类型	描述
groupBy	expression	将文档分组的表达式。要指定字段路径，请在字段名前面加上美元符号$并将其括在引号中。
buckets	integer	一个32位正整数，指定将输入文档分组到其中的桶数。
output	document	可选值。除_id字段外，指定输出文档中包含的字段的文档。要指定要包含的字段，必须使用累加器表达式
granularity	string	可选值。指定要使用的首选数字系列的字符串，以确保计算的边界边缘以首选整数或整数的10次方结束。`仅在所有groupBy值都是数值且没有NaN时才可用。`

如果存在下面的情况，则产生的分组数可能会小于指定的桶数

输入文档的数量小于指定的桶的数量。
groupBy表达式的惟一值的数量小于指定的桶的数量。
granularity粒度的间隔比buckets桶的数量少。
granularity粒度不够细，无法将文档均匀地分布到指定数量的buckets桶中。

Granularity

$bucketAuto接受一个可选的粒度参数，该参数确保所有bucket的边界符合指定的首选数字序列。使用首选数字系列可以更好地控制在groupBy表达式的值范围中设置桶边界的位置。当groupBy表达式的范围以指数方式扩展时，它们还可以用于帮助以对数方式均匀地设置桶边界。

支持的granularity粒度值有以下几种:

R5
R10
R20
R40
R80
E6
E12
E24
E48
E96
E192
1-2-5
POWERSOF2

Renard Series详解

勒纳尔数列是一组通过取10的5次、10次、20次、40次或80次方根得到的数，然后包括根的各种幂，这些幂等于1.0到10.0之间的值(R80是10.3)。将粒度设置为R5、R10、R20、R40或R80，以将桶边界限制为系列中的值。当groupBy值在1.0到10.0 (R80为10.3)范围之外时，该系列的值乘以10的幂。

例如R5系列以10的五次方根为基础，即1.58，并包含这个根的各种幂(四舍五入)，直到达到10。R5系列的推导如下

10^0/5^ = 1
10^1/5^ = 1.584 ~ 1.6
10^2/5^ = 2.511 ~ 2.5
10^3/5^ = 3.981 ~ 4.0
10^4/5^ = 6.309 ~ 6.3
10^5/5^ = 10

创建100个文档，_id从1到100

{ _id: 1 }
{ _id: 2 }
...
{ _id: 100 }

做如下操作

db.things.aggregate( [
  {
    $bucketAuto: {
      groupBy: "$_id",
      buckets: 5,
      granularity: "R5"
    }
  }
] )

结果

{
    "_id" : {
        "min" : 0.63,
        "max" : 25.0
    },
    "count" : 24
}
{
    "_id" : {
        "min" : 25.0,
        "max" : 63.0
    },
    "count" : 38
}
{
    "_id" : {
        "min" : 63.0,
        "max" : 100.0
    },
    "count" : 37
}
{
    "_id" : {
        "min" : 100.0,
        "max" : 160.0
    },
    "count" : 1
}

如果是R10呢？

db.things.aggregate( [
  {
    $bucketAuto: {
      groupBy: "$_id",
      buckets: 5,
      granularity: "R10"
    }
  }
] )

结果如下

{
    "_id" : {
        "min" : 0.8,
        "max" : 25.0
    },
    "count" : 24
}
{
    "_id" : {
        "min" : 25.0,
        "max" : 50.0
    },
    "count" : 25
}
{
    "_id" : {
        "min" : 50.0,
        "max" : 80.0
    },
    "count" : 30
}
{
    "_id" : {
        "min" : 80.0,
        "max" : 100.0
    },
    "count" : 20
}
{
    "_id" : {
        "min" : 100.0,
        "max" : 125.0
    },
    "count" : 1
}

这些临界值怎么来的？根据R5的推导，R10以10的10次方根为基础，即1.25，并包含这个根的各种幂(四舍五入)，直到达到10。R10的推导如下

10^0/10^ = 1
10^1/10^ = 1.25892541 ~ 1.25
10^2/10^ = 1.58489319 ~ 1.6
10^3/10^ = 1.99526231 ~ 2.0
10^4/10^ = 2.51188643 ~ 2.5
10^5/10^ = 3.16227766 ~ 3.16
10^6/10^ = 3.98107171 ~ 4.0
10^7/10^ = 5.01187234 ~ 5.0
10^8/10^ = 6.30957344 ~ 6.3
10^9/10^ = 7.94328235 ~ 8.0

因为_id是从1-100，所以1.25扩大100倍即125，也就是最后一个临界值。但是其中的一些数值并没有出现，这里是按什么样的规律来选取其中的数列的呢？

把buckets桶的数量调大为10，再次看R10的结果

{
    "_id" : {
        "min" : 0.8,
        "max" : 12.5
    },
    "count" : 12
}
{
    "_id" : {
        "min" : 12.5,
        "max" : 25.0
    },
    "count" : 12
}
{
    "_id" : {
        "min" : 25.0,
        "max" : 40.0
    },
    "count" : 15
}
{
    "_id" : {
        "min" : 40.0,
        "max" : 50.0
    },
    "count" : 10
}
{
    "_id" : {
        "min" : 50.0,
        "max" : 63.0
    },
    "count" : 13
}
{
    "_id" : {
        "min" : 63.0,
        "max" : 80.0
    },
    "count" : 17
}
{
    "_id" : {
        "min" : 80.0,
        "max" : 100.0
    },
    "count" : 20
}
{
    "_id" : {
        "min" : 100.0,
        "max" : 125.0
    },
    "count" : 1
}

这确实增加了R10的数列中的数，再次调整buckets桶的数量，当桶的数量为67，发现最多只能分为18组，之后无论怎么增加，都不会再有多的分组。18组分组结果如下

/* 1 */
{
    "_id" : {
        "min" : 0.8,
        "max" : 1.25
    },
    "count" : 1
}

/* 2 */
{
    "_id" : {
        "min" : 1.25,
        "max" : 2.5
    },
    "count" : 1
}

/* 3 */
{
    "_id" : {
        "min" : 2.5,
        "max" : 3.15
    },
    "count" : 1
}

/* 4 */
{
    "_id" : {
        "min" : 3.15,
        "max" : 5.0
    },
    "count" : 1
}

/* 5 */
{
    "_id" : {
        "min" : 5.0,
        "max" : 6.3
    },
    "count" : 2
}

/* 6 */
{
    "_id" : {
        "min" : 6.3,
        "max" : 8.0
    },
    "count" : 1
}

/* 7 */
{
    "_id" : {
        "min" : 8.0,
        "max" : 10.0
    },
    "count" : 2
}

/* 8 */
{
    "_id" : {
        "min" : 10.0,
        "max" : 12.5
    },
    "count" : 3
}

/* 9 */
{
    "_id" : {
        "min" : 12.5,
        "max" : 16.0
    },
    "count" : 3
}

/* 10 */
{
    "_id" : {
        "min" : 16.0,
        "max" : 20.0
    },
    "count" : 4
}

/* 11 */
{
    "_id" : {
        "min" : 20.0,
        "max" : 25.0
    },
    "count" : 5
}

/* 12 */
{
    "_id" : {
        "min" : 25.0,
        "max" : 31.5
    },
    "count" : 7
}

/* 13 */
{
    "_id" : {
        "min" : 31.5,
        "max" : 40.0
    },
    "count" : 8
}

/* 14 */
{
    "_id" : {
        "min" : 40.0,
        "max" : 50.0
    },
    "count" : 10
}

/* 15 */
{
    "_id" : {
        "min" : 50.0,
        "max" : 63.0
    },
    "count" : 13
}

/* 16 */
{
    "_id" : {
        "min" : 63.0,
        "max" : 80.0
    },
    "count" : 17
}

/* 17 */
{
    "_id" : {
        "min" : 80.0,
        "max" : 100.0
    },
    "count" : 20
}

/* 18 */
{
    "_id" : {
        "min" : 100.0,
        "max" : 125.0
    },
    "count" : 1
}

最终出现的数列中并不完全，原因是1.25到1.99之间不可存在数，_id是从1-100的整数，所以，1.6,2.0,4.0都可以跳过了。

10^0/10^ = 1
10^1/10^ = 1.25892541 ~ 1.25
10^2/10^ = 1.58489319 ~ 1.6
10^3/10^ = 1.99526231 ~ 2.0
10^4/10^ = 2.51188643 ~ 2.5
10^5/10^ = 3.16227766 ~ 3.15
10^6/10^ = 3.98107171 ~ 4.0
10^7/10^ = 5.01187234 ~ 5.0
10^8/10^ = 6.30957344 ~ 6.3
10^9/10^ = 7.94328235 ~ 8.0

最终出现的结果如上，其中10^5/10^ = 3.16227766 ~ 3.15按照计算四舍五入应该是3.16，但是MongoDB分组的结果给的却是3.15，不知道是不是精度已经不影响分组了，MongoDB的计算也就不是采用的精确计算方法

E Series

E数字系列与R系列相似，它们将从1.0到10.0的区间细分为第6、第12、第24、第48、第96或第192个10的平方根，并有特定的相对误差。

10^0/6^ = 1
10^1/6^ = 1.46779927 ~ 1.5
10^2/6^ = 2.15443469 ~ 2.2
10^3/6^ = 3.16227766 ~ 3.2
10^4/6^ = 4.64158883 ~ 4.6
10^5/6^ = 6.81292069 ~ 6.8
10^6/6^ = 10

1-2-5 Series

1-2-5系列也是和R系列差不多，只是固定的三个值0.1,0.2,0.5,1,2,5,10,20,50...

buckets桶值为3的结果如下

/* 1 */
{
    "_id" : {
        "min" : 0.5,
        "max" : 50.0
    },
    "count" : 49
}

/* 2 */
{
    "_id" : {
        "min" : 50.0,
        "max" : 100.0
    },
    "count" : 50
}

/* 3 */
{
    "_id" : {
        "min" : 100.0,
        "max" : 200.0
    },
    "count" : 1
}

Powers of Two Series

即指数数列，数列采用2的指数数列区分临界值

2^0^ = 1
2^1^ = 2
2^2^ = 4
2^3^ = 8
2^4^ = 16 ...

buckets桶值为3的结果

/* 1 */
{
    "_id" : {
        "min" : 0.5,
        "max" : 64
    },
    "count" : 63
}

/* 2 */
{
    "_id" : {
        "min" : 64,
        "max" : 128
    },
    "count" : 37
}

注：以上测试数据使用MongoDB version v4.2.2