ClickHouse数据生命周期今天早晨看到CDH集群推送的邮箱告警，发现磁盘不够用了。本身生产线写入Clickhous

前言

今天早晨看到CDH集群推送的邮箱告警，发现磁盘不够用了。本身生产线写入Clickhouse数据量并不大，但ClikeHouse把磁盘空间撑满了就很迷惑。

故障定位

# 查看磁盘使用情况
[root@hadoop-prod-datanode1 /]# df -h
文件系统                 容量  已用  可用 已用% 挂载点
devtmpfs                  32G     0   32G    0% /dev
tmpfs                     32G   16K   32G    1% /dev/shm
tmpfs                     32G  221M   32G    1% /run
tmpfs                     32G     0   32G    0% /sys/fs/cgroup
/dev/mapper/centos-root  492G  416G   77G   85% /
/dev/sda1                509M  144M  366M   29% /boot
tmpfs                    6.3G     0  6.3G    0% /run/user/0
cm_processes              32G   42M   32G    1% /run/cloudera-scm-agent/process

#查看挂载磁盘目录使用情况
[root@hadoop-prod-datanode1 /]# du -sh *
0	bin
118M	boot
233M	data
0	dev
135G	dfs
33M	etc
32K	home
0	impala
0	lib
0	lib64
0	media
0	mnt
9.4G	opt
0	proc
23M	root
262M	run
0	sbin
109M	script
0	srv
0	sys
131M	tmp
2.2G	usr
258G	var
11G	yarn

#定位到var路径磁盘占用过高
#重复执行 du -sh *命令找到占用磁盘的子路径
[root@hadoop-prod-datanode1 ece]# pwd
/var/lib/clickhouse/store/ece
[root@hadoop-prod-datanode1 ece]# du -sh *
229G	ece02706-70f8-43c9-b342-c634cd93471b
[root@hadoop-prod-datanode1 ece]# 

#具体定位到是query.bin文件占用磁盘过高
[root@hadoop-prod-datanode1 ece02706-70f8-43c9-b342-c634cd93471b]# cd 202103_287170_288764_12
[root@hadoop-prod-datanode1 202103_152820_240492_74]# du -sh *
...
72M	ProfileEvents.size0.bin
1.8M	ProfileEvents.size0.mrk2
31G	query.bin
40M	query_duration_ms.bin
1.8M	query_duration_ms.mrk2
...

故障处理

.bin文件存储的是Clickhouse的某个字段数据，使用SQL查询ck系统表定位该字段属于哪张表

select * from system.columns where name='query'

select * from system.query_thread_log limit 100

由于生产环境有业务线在跑，需确定该表存储的内容是什么，删除是否有不可预知的风险。

clickhouse.tech/docs/en/ope… clickhouse.tech/docs/en/ope…

官方文档解释

Unlike other system tables, the system log tables metric_log, query_log, query_thread_log, trace_log, part_log, crash_log and text_log are served by MergeTree table engine and store their data in a filesystem by default. If you remove a table from a filesystem, the ClickHouse server creates the empty one again at the time of the next data writing. If system table schema changed in a new release, then ClickHouse renames the current table and creates a new one.

By default, table growth is unlimited. To control a size of a table, you can use TTL settings for removing outdated log records. Also you can use the partitioning feature of MergeTree-engine tables.

系统日志表metric_log，query_log，query_thread_log，trace_log，part_log，crash_log和text_log由MergeTree表引擎提供服务，并且默认情况下将其数据存储在文件系统中。

默认情况下，表增长是无限的。要控制表的大小，可以使用TTL设置删除过时的日志记录。您也可以使用MergeTree-engine表的分区功能。

问题处理

TTL语法参考：blog.csdn.net/vkingnew/ar…

设置数据生命周期

-- 查看表结构
desc system.query_thread_log;

--设置过期生命周期 为7天
--INTERVAL支持的操作：second，minute，hour，day，week，month，quarter，year。
ALTER TABLE query_thread_log MODIFY COLUMN query String TTL event_time + INTERVAL 7 Day;

-- 强制触发进行TTL清理
optimize table query_thread_log;