Clickhouse物化视图

294 阅读8分钟

概念

  1. 物化视图是查询结果集的一份持久化存储,与关系型数据库的普通视图不同,且非常趋近于表。物化视图不会随着基础表的变化而变化,所以它也称为快照(snapshot)。如果要更新数据的话,需要用户手动进行,如周期性执行SQL,或利用触发器等机制。产生物化视图的过程叫做物化;
  2. 物化视图也是数据库中的预计算逻辑+显示缓存,典型的空间换时间思路。所以利用得当,它可以避免对基础表的频繁查询并复用结果,从而显著提升查询性能;
  3. 物化视图可以计算聚合,从kafka读取数据,实现最后点查询以及重组表主索引和排序顺序。实例化视图还可以在大量节点上很好地扩展,并可以处理大型数据集。它们是clickhouse的独特功能之一;

语法

CREARE [MATERIALIZED] VIEW [IF NOT EXIST] [db.]TABLE_NAME
[TO[db.]name]  [ENGINE=engine] [POPULATE] AS 
SELECT ...
  1. 必须指定物化视图的engine用于数据存储;
  2. 使用TO [db].name语法的时候,不得使用POPULATE;
  3. SELECT后面可以包含:DISTNCT,GROUP BY,ORDER BY,LIMIT...;
  4. 物化视图的ALTER操作有些限制,操作不便;

数据更新

  1. 物化视图创建好后,若源表被写入新数据则物化视图也会同步更新;
  2. POPULATE关键字决定了物化视图的更新策略: 若有POPULATE则在创建视图的过程中会将源表已经存在的数据一并导入,类似于create table as...; 若无POPULATE则物化视图在创建之后没有数据,只会在创建只有同步之后写入源表的数据;
  3. 物化视图支持同步更新但不支持同步删除,若源表的数据不存在了,则物化视图的数据仍然保留;
  4. 物化视图是种特殊的数据表,可以用show tables查看;

案例

基于单机表引擎MergeTree

假设有一个表download来记录用户下载的信息,并且可以追踪用户每天下载的信息

创建download表

CREATE TABLE download
(
	`when` DateTime,
	`userid` UInt32,
	`bytes` Float32
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(when)
ORDER BY (userid, when)

插入数据

INSERT INTO download SELECT 
	now() + (number * 60) AS when,
	25,
	rand() % 100000000
FROM system.numbers
LIMIT 20

查看数据

SELECT *
FROM download

┌────────────────when─┬─userid─┬────bytes─┐
│ 2021-06-16 10:02:522537577424 │
│ 2021-06-16 10:03:522583676350 │
│ 2021-06-16 10:04:522529040676 │
│ 2021-06-16 10:05:522519840790 │
│ 2021-06-16 10:06:522556336120 │
│ 2021-06-16 10:07:522524582186 │
│ 2021-06-16 10:08:522573516080 │
│ 2021-06-16 10:09:52254801462 │
│ 2021-06-16 10:10:522554939668 │
│ 2021-06-16 10:11:522543684268 │
│ 2021-06-16 10:12:522512542577 │
│ 2021-06-16 10:13:522562006156 │
│ 2021-06-16 10:14:522576839750 │
│ 2021-06-16 10:15:522582842560 │
│ 2021-06-16 10:16:522511130308 │
│ 2021-06-16 10:17:522553049730 │
│ 2021-06-16 10:18:522557856370 │
│ 2021-06-16 10:19:522569018670 │
│ 2021-06-16 10:20:522590462900 │
│ 2021-06-16 10:21:522539850290 │
└─────────────────────┴────────┴──────────┘

创建物化视图

CREATE MATERIALIZED VIEW download_mv
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(day)
ORDER BY (userid, day) POPULATE AS
SELECT 
	toStartOfDay(when) AS day,
	userid,
	count() AS downloads,
	sum(bytes) AS bytes
FROM download
GROUP BY 
	userid,
	day
				

1. 因为旨在简化总和计数,所以使用SummingMergeTree。对于计算聚合的物化视图,推荐使用此引擎; 2. 使用POPULATE,即告诉clickhouse将计算download表中现有的数据,就像刚插入一样,新加入的数据一样会被计算

查看视图

SELECT *
FROM download_mv
ORDER BY 
	day ASC,
	userid ASC

┌─────────────────day─┬─userid─┬─downloads─┬─────bytes─┐
│ 2021-06-16 00:00:002520983594331 │
└─────────────────────┴────────┴───────────┴───────────┘

模拟产生新数据

INSERT INTO download SELECT 
	now() + (number * 60) AS when,
	22,
	rand() % 100000000
FROM system.numbers
LIMIT 50

查询验证试图

SELECT *
FROM download_mv
ORDER BY 
	day ASC,
	userid ASC

┌─────────────────day─┬─userid─┬─downloads─┬──────bytes─┐
│ 2021-06-16 00:00:0022502536221861 │
└─────────────────────┴────────┴───────────┴────────────┘
┌─────────────────day─┬─userid─┬─downloads─┬─────bytes─┐
│ 2021-06-16 00:00:002520983594331 │
└─────────────────────┴────────┴───────────┴───────────┘

聚合计算

SELECT 
	toStartOfMonth(day) AS month,
	userid,
	sum(downloads),
	sum(bytes)
FROM download_mv
GROUP BY 
	userid,
	month
	WITH TOTALS
ORDER BY 
	userid ASC,
	month ASC

┌──────month─┬─userid─┬─sum(downloads)─┬───sum(bytes)─┐
│ 2021-06-01225050252141221435 │
│ 2021-06-01235000249287398287 │
│ 2021-06-012520983594331 │
└────────────┴────────┴────────────────┴──────────────┘

基于集群分布式表引擎ReplicatedMergeTree

创建本地表

CREATE TABLE download_rep
(
	`when` DateTime,
	`userid` UInt32,
	`bytes` Float32
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/download_rep', '{replica}')
PARTITION BY toYYYYMM(when)
ORDER BY (userid, when)

创建分布式表

CREATE TABLE download_rep_dis AS download_rep
ENGINE = Distributed(ck_cluster, cc_test, download_rep, rand())

插入数据

INSERT INTO download_rep_dis SELECT 
	now() + (number * 60) AS when,
	23,
	rand() % 100000000
FROM system.numbers
LIMIT 50

查看数据

SELECT *
FROM download_rep_dis

┌────────────────when─┬─userid─┬────bytes─┐
│ 2021-06-16 14:58:132330161356 │
│ 2021-06-16 14:59:132334822964 │
│ 2021-06-16 15:01:132390626264 │
│ 2021-06-16 15:02:132347314428 │
│ 2021-06-16 15:09:132390441950 │
│ 2021-06-16 15:11:13236823396 │
│ 2021-06-16 15:13:132366999120 │
│ 2021-06-16 15:16:132346866404 │
│ 2021-06-16 15:17:132390831976 │
│ 2021-06-16 15:18:132356102428 │
│ 2021-06-16 15:19:132340785110 │
│ 2021-06-16 15:22:132350626640 │
│ 2021-06-16 15:30:132347742570 │
│ 2021-06-16 15:31:132346140064 │
│ 2021-06-16 15:32:132369070200 │
│ 2021-06-16 15:33:132383270024 │
│ 2021-06-16 15:34:132372469496 │
│ 2021-06-16 15:39:132332683200 │
│ 2021-06-16 15:42:132317278020 │
│ 2021-06-16 15:45:132369946130 │
│ 2021-06-16 15:46:132348463776 │
│ 2021-06-16 15:47:132336671556 │
└─────────────────────┴────────┴──────────┘

创建物化视图

CREATE MATERIALIZED VIEW download_rep_dis_mv
ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/{shard}/cc_test/download_rep_dis_mv', '{replica}')
PARTITION BY toYYYYMM(day)
ORDER BY (userid, day) POPULATE AS
SELECT 
	toStartOfDay(when) AS day,
	userid,
	count() AS downloads,
	sum(bytes) AS bytes
FROM download_rep_dis
GROUP BY 
	userid,
	day

查询验证试图

SELECT *
FROM download_rep_dis_mv
ORDER BY 
	day ASC,
	userid ASC

┌─────────────────day─┬─userid─┬─downloads─┬──────bytes─┐
│ 2021-06-16 00:00:0023221176137072 │
└─────────────────────┴────────┴───────────┴────────────┘

聚合计算

SELECT 
	toStartOfMonth(day) AS month,
	userid,
	sum(downloads),
	sum(bytes)
FROM download_rep_dis_mv
GROUP BY 
	userid,
	month
	WITH TOTALS
ORDER BY 
	userid ASC,
	month ASC

┌──────month─┬─userid─┬─sum(downloads)─┬──sum(bytes)─┐
│ 2021-06-0123221176137072 │
│ 2021-06-012450025121398114 │
└────────────┴────────┴────────────────┴─────────────┘

Totals:
┌──────month─┬─userid─┬─sum(downloads)─┬──sum(bytes)─┐
│ 1970-01-01052226297535186 │
└────────────┴────────┴────────────────┴─────────────┘