ClickHouse Parametric Aggregate Functions 带参聚合函数

2,208 阅读10分钟

参考文献

Parametric Aggregate Functions

clickhouse聚合函数的整体描述

Aggregate functions work in the normal way as expected by database experts.

ClickHouse also supports:

  • Parametric aggregate functions, which accept other parameters in addition to columns.
  • Combinators, which change the behavior of aggregate functions.

During aggregation, all NULLs are skipped.

clickhouse支持常见的聚合函数,同时也有自己的特色函数Parametrice aggretate functions和combinators两种,在聚合的过程中所有的NULL值都会被忽略,本文将具体讨论Parametric aggregate functions.

Histogram

Calculates an adaptive histogram. It doesn’t guarantee precise results.
聚合成直方图,且不保证精确的结果

用法

histogram(number_of_bins)(values)

The functions uses A Streaming Parallel Decision Tree Algorithm. The borders of histogram bins are adjusted as new data enters a function. In common case, the widths of bins are not equal.

这里的算法大意是计算大致有多少个点落在直方图左侧bin和右侧bin之间的区间里,与我们直观的计算不同,一般情况下直方图的条宽度并不相同

参数

  • number_of_bins: Upper limit for the number of bins in the histogram. The function automatically calculates the number of bins. It tries to reach the specified number of bins, but if it fails, it uses fewer bins.数据条的最大数量
  • values Expression resulting in input values.

返回值

Returned values Array of Tuples of the following format:

[(lower_1, upper_1, height_1), ... (lower_N, upper_N, height_N)]
  • lower — Lower bound of the bin.
  • upper — Upper bound of the bin.
  • height — Calculated height of the bin.

举例

SELECT histogram(5)(number + 1)
FROM (
    SELECT *
    FROM system.numbers
    LIMIT 20
)
┌─histogram(5)(plus(number, 1))───────────────────────────────────────────┐
│ [(1,4.5,4),(4.5,8.5,4),(8.5,12.75,4.125),(12.75,17,4.625),(17,20,3.25)] │
└─────────────────────────────────────────────────────────────────────────┘

其中,number是system.numbers表的一个列名.

直接在console中展示直方图

通过bar函数可以直接可视化直方图,但是这种情况你并不会直观获取直方图的条宽度.

In this case, you should remember that you don’t know the histogram bin borders.

WITH histogram(5)(rand() % 100) AS hist
SELECT
    arrayJoin(hist).3 AS height,
    bar(height, 0, 6, 5) AS bar
FROM
(
    SELECT *
    FROM system.numbers
    LIMIT 20
)
┌─height─┬─bar───┐
│  2.125 │ █▋    │
│   3.25 │ ██▌   │
│  5.625 │ ████▏ │
│  5.625 │ ████▏ │
│  3.375 │ ██▌   │
└────────┴───────┘

sequenceMatch

Checks whether the sequence contains an event chain that matches the pattern.

检查序列中是否有匹配条件的event chain。

Warning: Events that occur at the same second may lay in the sequence in an undefined order affecting the result.

即使同时刻发生的事件,也会因为在序列中的无序位置而导致结果的不同

用法

sequenceMatch(pattern)(timestamp, cond1, cond2, …)

参数

  • pattern— Pattern string. SeePattern syntax.
  • timestamp— Column considered to contain time data. Typical data types are Date and DateTime. You can also use any of the supportedUIntdata types.
  • cond1,cond2— Conditions that describe the chain of events. Data type: UInt8. You can pass up to 32 condition arguments. The function takes only the events described in these conditions into account. If the sequence contains data that isn’t described in a condition, the function skips them. 函数只考虑当前时刻符合条件的事件。

返回值

  • 1, if the pattern is matched.
  • 0, if the pattern isn’t matched. Type:UInt8.

Pattern Syntax

  • (?N)— Matches the condition argument at position N. Conditions are numbered in the [1, 32] range. For example,(?1)matches the argument passed to the cond1 parameter.
  • .*— Matches any number of events. You don’t need conditional arguments to match this element of the pattern.
  • (?t operator value) — Sets the time in seconds that should separate two events. For example, pattern (?1)(?t>1800)(?2) matches events that occur more than 1800 seconds from each other. An arbitrary number of any events can lay between these events. You can use the>=, >, <, <=operators. 时间分割符

举例

Consider data in the t table:
┌─time─┬─number─┐
│    1 │      1 │
│    2 │      3 │
│    3 │      2 │
└──────┴────────┘
Perform the query:
SELECT sequenceMatch('(?1)(?2)')(time, number = 1, number = 2) FROM t
┌─sequenceMatch('(?1)(?2)')(time, equals(number, 1), equals(number, 2))─┐
│                                                                     1 │
└───────────────────────────────────────────────────────────────────────┘
The function found the event chain where number 2 follows number 1.
It skipped number 3 between them, because the number is not described as an event. 
If we want to take this number into account when searching for the event chain given in the example,
we should make a condition for it.
如栗子,因为2顺序在1之后,所以返回1,表示模式匹配
SELECT sequenceMatch('(?1)(?2)')(time, number = 1, number = 2, number = 3) FROM t
┌─sequenceMatch('(?1)(?2)')(time, equals(number, 1), equals(number, 2), equals(number, 3))─┐
│                                                                                        0 │
└──────────────────────────────────────────────────────────────────────────────────────────┘
In this case, the function couldn’t find the event chain matching the pattern, 
because the event for number 3 occured between 1 and 2.
If in the same case we checked the condition for number 4, the sequence would match the pattern.
这样就无法感知,因为1-2-3的顺序不匹配,但是如果寻找的顺序是1-2-4,那么则匹配(只要是匹配集的子集也符合满足条件)
SELECT sequenceMatch('(?1)(?2)')(time, number = 1, number = 2, number = 4) FROM t
┌─sequenceMatch('(?1)(?2)')(time, equals(number, 1), equals(number, 2), equals(number, 4))─┐
│                                                                                        1 │
└──────────────────────────────────────────────────────────────────────────────────────────┘

sequenceCount

Counts the number of event chains that matched the pattern. The function searches event chains that don’t overlap. It starts to search for the next chain after the current chain is matched.

统计匹配的次数

Warning: Events that occur at the same second may lay in the sequence in an undefined order affecting the result.

sequenceMatch

用法

sequenceCount(pattern)(time, cond1, cond2, …)

参数

  • pattern — Pattern string. See Pattern syntax.
  • timestamp — Column considered to contain time data. Typical data types are DateandDateTime. You can also use any of the supportedUIntdata types.
  • cond1, cond2— Conditions that describe the chain of events. Data type:UInt8. You can pass up to 32 condition arguments. The function takes only the events described in these conditions into account. If the sequence contains data that isn’t described in a condition, the function skips them.

返回值

  • Number of non-overlapping event chains that are matched.
    不重复的事件链出现的次数

Type:UInt64.

举例

Consider data in the t table:
┌─time─┬─number─┐
│    1 │      1 │
│    2 │      3 │
│    3 │      2 │
│    4 │      1 │
│    5 │      3 │
│    6 │      2 │
└──────┴────────┘
Count how many times the number 2 occurs after the number 1 with any amount of other numbers between them:
SELECT sequenceCount('(?1).*(?2)')(time, number = 1, number = 2) FROM t
┌─sequenceCount('(?1).*(?2)')(time, equals(number, 1), equals(number, 2))─┐
│                                                                       2 │
└─────────────────────────────────────────────────────────────────────────┘

windowFunnel

Searches for event chains in a sliding time window and calculates the maximum number of events that occurred from the chain.
滑动窗口周期中符合要求的事件链条的最大次数
The function works according to the algorithm:

  • The function searches for data that triggers the first condition in the chain and sets the event counter to 1. This is the moment when the sliding window starts.
  • If events from the chain occur sequentially within the window, the counter is incremented. If the sequence of events is disrupted, the counter isn’t incremented.
  • If the data has multiple event chains at varying points of completion, the function will only output the size of the longest chain.

用法

windowFunnel(window, [mode])(timestamp, cond1, cond2, ..., condN)

参数

  • window— Length of the sliding window in seconds.
  • mode- It is an optional argument.
  • 'strict'- When the'strict'is set, the windowFunnel() applies conditions only for the unique values.
  • timestamp— Name of the column containing the timestamp. Data types supported:Date,DateTimeand other unsigned integer types (note that even though timestamp supports theUInt64type, it’s value can’t exceed the Int64 maximum, which is 2^63 - 1).
  • cond— Conditions or data describing the chain of events.UInt8.

返回值

The maximum number of consecutive triggered conditions from the chain within the sliding time window. All the chains in the selection are analyzed. Type:Integer.

举例

Determine if a set period of time is enough for the user to select a phone and purchase it twice in the online store.
Set the following chain of events:
1. The user logged in to their account on the store (eventID = 1003).
2. The user searches for a phone (eventID = 1007, product = 'phone').
3. The user placed an order (eventID = 1009).
4. The user made the order again (eventID = 1010).
Input table:
┌─event_date─┬─user_id─┬───────────timestamp─┬─eventID─┬─product─┐
│ 2019-01-28 │       1 │ 2019-01-29 10:00:00 │    1003 │ phone   │
└────────────┴─────────┴─────────────────────┴─────────┴─────────┘
┌─event_date─┬─user_id─┬───────────timestamp─┬─eventID─┬─product─┐
│ 2019-01-31 │       1 │ 2019-01-31 09:00:00 │    1007 │ phone   │
└────────────┴─────────┴─────────────────────┴─────────┴─────────┘
┌─event_date─┬─user_id─┬───────────timestamp─┬─eventID─┬─product─┐
│ 2019-01-30 │       1 │ 2019-01-30 08:00:00 │    1009 │ phone   │
└────────────┴─────────┴─────────────────────┴─────────┴─────────┘
┌─event_date─┬─user_id─┬───────────timestamp─┬─eventID─┬─product─┐
│ 2019-02-01 │       1 │ 2019-02-01 08:00:00 │    1010 │ phone   │
└────────────┴─────────┴─────────────────────┴─────────┴─────────┘
Find out how far the user user_id could get through the chain in a period in January-February of 2019.
Query:
SELECT
    level,
    count() AS c
FROM
(
    SELECT
        user_id,
        windowFunnel(6048000000000000)(timestamp, eventID = 1003, eventID = 1009, eventID = 1007, eventID = 1010) AS level
    FROM trend
    WHERE (event_date >= '2019-01-01') AND (event_date <= '2019-02-02')
    GROUP BY user_id
)
GROUP BY level
ORDER BY level ASC
Result:
┌─level─┬─c─┐
│     4 │ 1 │
└───────┴───┘

retention

The function takes as arguments a set of conditions from 1 to 32 arguments of type UInt8 that indicate whether a certain condition was met for the event.
检查情况和条件是否匹配
Any condition can be specified as an argument (as inWHERE).
The conditions, except the first, apply in pairs: the result of the second will be true if the first and second are true, of the third if the first and fird are true, etc.

用法

retention(cond1, cond2, ..., cond32)

参数

  • cond— an expression that returns aUInt8result (1 or 0).

返回值

The array of 1 or 0.

  • 1 — condition was met for the event.
  • 0 — condition wasn’t met for the event. Type:UInt8.

举例

Let’s consider an example of calculating the retention function to determine site traffic.
1. Сreate a table to illustrate an example.
CREATE TABLE retention_test(date Date, uid Int32) ENGINE = Memory;

INSERT INTO retention_test SELECT '2020-01-01', number FROM numbers(5);
INSERT INTO retention_test SELECT '2020-01-02', number FROM numbers(10);
INSERT INTO retention_test SELECT '2020-01-03', number FROM numbers(15);
Input table:
Query:
SELECT * FROM retention_test
Result:
┌───────date─┬─uid─┐
│ 2020-01-01 │   0 │
│ 2020-01-01 │   1 │
│ 2020-01-01 │   2 │
│ 2020-01-01 │   3 │
│ 2020-01-01 │   4 │
└────────────┴─────┘
┌───────date─┬─uid─┐
│ 2020-01-02 │   0 │
│ 2020-01-02 │   1 │
│ 2020-01-02 │   2 │
│ 2020-01-02 │   3 │
│ 2020-01-02 │   4 │
│ 2020-01-02 │   5 │
│ 2020-01-02 │   6 │
│ 2020-01-02 │   7 │
│ 2020-01-02 │   8 │
│ 2020-01-02 │   9 │
└────────────┴─────┘
┌───────date─┬─uid─┐
│ 2020-01-03 │   0 │
│ 2020-01-03 │   1 │
│ 2020-01-03 │   2 │
│ 2020-01-03 │   3 │
│ 2020-01-03 │   4 │
│ 2020-01-03 │   5 │
│ 2020-01-03 │   6 │
│ 2020-01-03 │   7 │
│ 2020-01-03 │   8 │
│ 2020-01-03 │   9 │
│ 2020-01-03 │  10 │
│ 2020-01-03 │  11 │
│ 2020-01-03 │  12 │
│ 2020-01-03 │  13 │
│ 2020-01-03 │  14 │
└────────────┴─────┘
2. Group users by unique ID uid using the retention function.
Query:
SELECT
    uid,
    retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03') AS r
FROM retention_test
WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03')
GROUP BY uid
ORDER BY uid ASC
Result:
┌─uid─┬─r───────┐
│   0 │ [1,1,1] │
│   1 │ [1,1,1] │
│   2 │ [1,1,1] │
│   3 │ [1,1,1] │
│   4 │ [1,1,1] │
│   5 │ [0,0,0] │
│   6 │ [0,0,0] │
│   7 │ [0,0,0] │
│   8 │ [0,0,0] │
│   9 │ [0,0,0] │
│  10 │ [0,0,0] │
│  11 │ [0,0,0] │
│  12 │ [0,0,0] │
│  13 │ [0,0,0] │
│  14 │ [0,0,0] │
└─────┴─────────┘
只有三个满足rentention中所有的情况才会出1,否则出0.rentention必须要有两个及以上的如参数
3. Calculate the total number of site visits per day.
Query:
SELECT
    sum(r[1]) AS r1,
    sum(r[2]) AS r2,
    sum(r[3]) AS r3
FROM
(
    SELECT
        uid,
        retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03') AS r
    FROM retention_test
    WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03')
    GROUP BY uid
)
Result:
┌─r1─┬─r2─┬─r3─┐
│  5 │  5 │  5 │
└────┴────┴────┘
Where:
* r1- the number of unique visitors who visited the site during 2020-01-01 (the cond1 condition).
* r2- the number of unique visitors who visited the site during a specific time period between 2020-01-01 and 2020-01-02 (cond1 and cond2 conditions).
* r3- the number of unique visitors who visited the site during a specific time period between 2020-01-01 and 2020-01-03 (cond1 and cond3 conditions).

uniqUpTo

uniqUpTo(N)(x)

Calculates the number of different argument values if it is less than or equal to N. If the number of different argument values is greater than N, it returns N + 1.
统计所有不同参数小于N的值,如果超过,则为N+1

Recommended for use with small Ns, up to 10. The maximum value of N is 100. For the state of an aggregate function, it uses the amount of memory equal to 1 + N * the size of one value of bytes.

For strings, it stores a non-cryptographic hash of 8 bytes. That is, the calculation is approximated for strings.

The function also works for several arguments.

It works as fast as possible, except for cases when a large N value is used and the number of unique values is slightly less than N.

Usage example:

Problem: Generate a report that shows only keywords that produced at least 5 unique users.

Solution: Write in the GROUP BY query SearchPhrase HAVING uniqUpTo(4)(UserID) >= 5

sumMapFiltered

sumMapFiltered(keys_to_keep)(keys, values)

Same behavior assumMapexcept that an array of keys is passed as a parameter. This can be especially useful when working with a high cardinality of keys.

与sumMap类似,只是将key作为参数传入,适合有大量key的环境