数据存储
ClickHouse 中有众多的不同特性的表引擎可以应对不同的需要,其中 MergeTree 引擎作为 ClickHouse 的核心,凭借其强大的性能与丰富的特性得到了广泛的使用,并成为其他特性引擎的基础。
创建数据表
CREATE TABLE test.users2
(
name String,
sex String,
age Int8,
birthday Date
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(birthday)
ORDER BY age
SETTINGS min_rows_for_wide_part = 0, min_bytes_for_wide_part = 0,index_granularity = 2
注:
这里建表语句中加了SETTINGS内容。因为ClickHouse新版本中有compact功能,可以提高插入量少插入频率频繁时的性能。但是底层就不会每一列生成一个.bin文件,只会生成一个统一的data.bin。所以先暂时关闭compact功能(min_rows_for_wide_part = 0, min_bytes_for_wide_part = 0)
index_granularity:索引粒度,默认为8192,即MergeTree索引在默认情况下,每间隔8192行才生成一个索引。通常不需要修改此参数。这里index_granularity=2,即间隔2行生成一个索引。
插入一条数据:
insert into test.users values ('zhangsan', 'female', 20, '2023-03-07');
进入到 /var/lib/clickhouse/data/test/users 路径下,观察生成的文件:
可以看到生成了一个 202303_1_1_0的分区目录。
分区目录命名规则
202303_1_1_0
PartitionId_MinBlockNum_MaxBlockNum_Level
PartitionId:分区Id。其值是由创建表时所指定的分区键决定的,由于我们创建表时使用的分区键为toYYYYMM(birthday),而插入的birthday为2023-03-08,所以其值为202303。
MinBlockNum、MaxBlockNum:最小最大数据块编号。其值在单张表内从1开始累加,每当新创建一个分区目录其值就会加1,且新创建的分区目录MinBlockNum和MaxBlockNum相等,只有当分区目录发生合并时其值才会不等。由于这是该表第一次插入数据,所以MinBlockNum和MaxBlockNum都为1。
Level:分区被合并的次数。level和MinBlockNum以及MaxBlockNum不同,它不是单张表内累加的,而是单张表中的单个分区内累加的。每当新创建一个分区目录其值均为0,只有当分区目录发生合并时其值才会大于0。
在插入两条数据实验:
insert into test.users values ('lisi', 'qqq', 21, '2023-03-08');
insert into test.users values ('wangwu', 'www', 22, '2023-03-09');
发现每次插入都生成了一个新的分区目录,同时,MinBlockNum和MaxBlockNum在自增。
由于MergeTree引擎对于分区的合并是不定时的,因此这里为了更快地尝试对分区进行合并后的结果,使用optimize命令强制合并分区。
optimize table users final;
命令执行后,发现多了一个 202303_1_3_1的分区目录,也就是合并后的目录。这里的第一个1和3,表示MinBlockNum,MaxBlockNum,因为合并的3个分区中,MinBlockNum=1,MaxBlockNum=3,最后一个表示Level,也就是分区被合并的次数,由于只合并了一次,所以此时Level=1。
一段时间后,其他的几个分区会被删除,只保留合并后的分区
文件结构
进入到202303_1_3_1目录下,利用tree命令查看目录下的文件结构:
[root@bogon 202303_1_3_1]# tree
.
├── age.bin
├── age.cmrk2
├── birthday.bin
├── birthday.cmrk2
├── checksums.txt
├── columns.txt
├── count.txt
├── default_compression_codec.txt
├── metadata_version.txt
├── minmax_birthday.idx
├── name.bin
├── name.cmrk2
├── partition.dat
├── primary.cidx
├── serialization.json
├── sex.bin
└── sex.cmrk2
说明:
checksums.txt:校验文件,二进制存储了个文件的大小、哈希等,用于快速校验存储的正确性
columns.txt:列名以及数据类型,本例该文件的内容为
count.txt:记录数据的总行数,以上示例内容为3
primary.idx:主键索引文件,用于存放稀疏索引的数据与数据块的位置信息。通过查询条件与稀疏索引快速的过滤无用的数据,减少需要加载的数据量。
{column}.bin:实际列数据的存储文件,以列名.bin为文件名,默认设置采用lz4压缩格式。每一列都会有单独的文件,此种方式是wide part模式。另外一种是compact part模式,这种模式下所有的列数据都放在一个data.bin文件里(新版本需要指定 SETTINGS min_rows_for_wide_part = 0, min_bytes_for_wide_part = 0 参数来强制跳过 Compact format)。
{column}.cmrk2:列数据的标记信息,记录了数据块在bin文件中的偏移量。标记文件首先与列数据的存储文件对齐,记录了某个压缩快在bin文件中的相对位置;其次与索引文件对齐,记录了稀疏索引对应数据在列存储文件中的位置。ClickHouse将首先通过索引文件定位到标记信息,再根据标记信息直接从bin数据文件中读取数据。
Partition.dat:保存一个值,就是partition的编号,从0开始。
minmax_{column}.inx:minmax索引,用于记录当前分区下分区字段的最小最大值。
二进制文件格式详解
primary.idx
MergeTree表会根据排序键生成primary.idx表,由users建表语句可知,设置排序的键为age,同时index_granularity=2,因此,对应primary.idx中生成的记录应该为20、25、30三条记录。查看primary.idx表:
hexdump可以用来查看二进制文件的十六进制编码。由于定义的age Int8占用一个字节,因此每一个16进制值应该对应一个age值。对上面的16进制结果做一下转换:
14(16进制) --> 20(十进制)
19(16进制) --> 25(十进制)
1e(16进制) --> 30(十进制)
可以发现,转换成10进制后与我们的逻辑分析结果一致,更进一步核实了ClickHouse中primary.idx稀疏索引的原理。
{column}.mrk2
一个{column}.bin文件由1个至多个数据压缩块组成,mrk2数据标记文件格式比较固定,primary.idx文件中的每个索引再次文件中都有一个对应的mark,有三列:
Offset in compressed file,8 Bytes,代表该标记指向的压缩数据块在bin文件中的偏移量。
Offset in decompressed block,8 Bytes,代表该标记指向的数据在解压数据块中的偏移量。
Rows count,8 Bytes,行数,通常情况下其等于index_granularity。
因此,每一行mrk2文件共占用24 Bytes。所以通过primary.idx中的索引寻找mrk2文件中对应的Mark非常简单,如果要寻找第n(从0开始)个index,则对应的Mark在mrk2文件中的偏移为n*24,从这个偏移处开始读取24 Bytes即可得到相应的Mark。
age.mrk2
$ hexdump -C age.mrk2
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000010 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 02 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 |................|
00000040 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000060
$ od -An -l age.mrk2
0 0
2 0
2 2
0 4
1 0
5 0
分析下上面的文件内容。由于24 Bytes表示一行数据,因此从上面以每24位做切分,可以得到如下所示的表:
暂时无法在飞书文档外展示此内容
从表中可以看出,解压文件中的偏移量对应的原表中的age值这与我们在primary.idx中分析出来的3个索引值一一对应:
index_granularity =2,且定义的age Int8占用1个字节,所以每两行对应一个索引,mrk2中也是没两行生成一条对应mark。
由于本次存储的数据 < 默认的压缩大小块64KB,因此所有的数据都在一个压缩块内,压缩文件中的偏移量都是0
由于最后一个mark对应的数据只有一条,所以最后一个行数为1
{column}. bin
{column}.bin文件由若干数据块组成,默认使用LZ4压缩格式,用于存储某一列数据。
一个压缩数据块由头信息和压缩数据两部分组成。头信息固定使用9位字节表示,具体由一个UInt8(1字节)整型和两个UInt32(4字节)整型组成,分别代表使用的压缩算法类型、压缩后的数据大小和压缩前数据大小。
如:0x821200065536
0x82:压缩方法
12000:压缩后数据大小
65536:压缩前数据大小
{column}.bin文件存储如下:
minmax_birthday.idx
minmax文件里面存放的是该分区里分区字段的最小最大值。分区字段birthday的类型为Date,其底层由UInt16实现,存放的是从1970年1月1日到现在所经过的天数。通过上面的INSERT语句可以知道birthday字段的最小值为2023-03-06,最大值为2023-03-09,这两个时间转换成天数分别为19422和19425,再转换成16进制就是0x4bde和0x4be1。
minmax文件的作用是可以让ClickHouse可以只能得跳过那些不可能包含符合条件的数据的块和分区,从而减少需要扫描的数据量。举个例子:存在一个table表,表中含有timestamp字段,minmax_timestamp.idx文件中的最小值与最大值为"2023-01-01"和"2023-12-31",此时我们想查询"2023-01-01"到"2023-06-30"范围内的数据,可以发现这个范围完全在文件中记录的最大值与最小值之间,由于可能存在分区,数据存储在.bin文件中,而.bin文件又由很多的数据块组成,使用minmax文件可以使得ClickHouse跳过不在范围内的数据分区或数据块,从而加快数据的查询速度与效率。
索引与数据存储
ClickHouse通过将数据排序并建立稀疏索引的方式来加速数据的定位与查询。ClickHouse的稀疏索引就像是只记录开始位置,一条索引记录就能标记大量的数据,而且数据量越大这种优势就会越明显。例如:MergeTree的索引粒度(index_granularity)为8192,标记一亿条数据只需要12208条索引。索引少占用的空间就小,所以对ClickHouse而言,primary.idx中的数据可以是常驻内存。
引用官方文档的示例:
这里使用(CounterID,Date)两个字段作为索引,当需要查询 CounterID in ('a', 'h')时,服务器会定位到 mark [0,3) 和 mark [6,8);当需要查询 CounterID in ('a', 'h') and Date = 3 时,服务器会定位到 mark [1,3) 和 mark [7,8);当需要查询 Date=3 的时候,服务器会定位到 mark [1,10]。
稀疏索引无法精确的找到数据,只能定位到大概范围,但作为一个旨在处理海量数据的系统而言,这些性能浪费是可以忽略的。
上述分析可知,索引的性能会受到索引字段数据分布的影响,设计表的时候需要充分考虑业务数据的实际情况,避免使用区分度很低的字段做索引。同时索引粒度(index_granularity)也是一个重要的属性,ClickHouse 默认 8192,这个值在大多数的情况下都能提供良好的性能。
索引过程
通过上面对primary.idx、mrk2和bin文件的分析可知,对于每一个primary.idx中的索引,mrk2都有一条记录与之对应。而从mrk2中就可以找到bin文件中的压缩数据块和解压缩后的数据索引。
如果按照8192的索引粒度把数据分成批次,每批次读入数据的规则:
(1) 如果单批次获取的数据 <64k,则继续下一个批次,直到 > 64k则生成下一个数据块
(2) 如果单批次数据 64k < x < 1m,则直接生成下一个数据块
(3) 如果单批次数据 > 1m,则按照1m切分数据,剩下的数据继续按照上述规则执行
查询过程
在MergeTree读取数据时,必须通过标记数据的位置信息找到所需要的数据。查找过程大致分为读取压缩数据块和读取数据两个步骤:
上图中,以JavaEnable字段说明:
8192 = 8192b 8192b = 8192b 8192b * 8 = 64k,相当于8个8192条数据形成一个数据块。而第二个数据块的压缩文件的偏移量是从12016 Bytes开始的,原因是加上了两个数据块的头文件部分(假设第一个数据块压缩后的大小为12000Bytes),压缩后的头文件大小为8 Bytes,以此类推。而未压缩大小的65536大小则表示这里每一块数据块的大小都正好是64K大小。
有了mrk2,MergeTree在读取数据时并不需要将数据一次性加载到内存。定位压缩数据块并读取数据的过程:
读取压缩数据块:在查询某一列数据MergeTree无须一次性加载整个.bin文件。借助标记文件中的压缩文件偏移量加载指定的数据压缩块。
读取数据:解压后的数据,MergeTree并不需要一次性扫描整段解压数据,借住标记文件中保存的数据块中偏移量以index_granularity的粒度加载特定一小段
数据标记与压缩数据块的对应关系
一对一
此时,每一个压缩数据块都是64KB。这里图中压缩数块中,头信息数组的前一个位置65536表示解压后的大小,第二个43311表示压缩后的大小。
多对一
一对多
即mrk2中的一行对应两个压缩数据块中的数据。
索引使用与数据加载概述
下图较为简单的展示了primary.idx,.mrk2,.bin之间的对应关系,其中*.bin文件中压缩块Bock0,Block1的划分只是为了描述概念,而并非实际情况:
在ClickHouse的查询过程中,理想的情况下可以通过查询条件,利用预设的分区信息,索引,将需要加载的数据mark缩至最少,尽可能减少数据扫描的范围。通过SQL分析得到必须的数据列,根据mark范围,将必须的数据加载至内存,进行后续的处理。例如有以下算SQL:
select
StringID,
avg(Score)
from
id_test
where
StringID between '607589cdcbc95900017ddf03' and '607589cdcbc95900017ddf06'
group by StringID;
简化的ClickHouse数据加载过程如下:
(1) 根据条件StringID与索引,确定Mark范围:
mark0 [607589cdcbc95900017ddf01, 607589cdcbc95900017ddf04)
mark1 [607589cdcbc95900017ddf04, 607589cdcbc95900017ddf07)
(2) 通过SQL定位到需要加载StringID与Score列的数据,根据步骤(1)得到的Mark范围,需要从StringID.bin与Score.bin中加载mark0与mark1的数据。通过分析StringID.mrk2与Score.mrk2的内容,得到mark0与mark1对应的压缩block在*.bin文件中的偏移信息。将StringID.bin以及Score.bin中相应压缩block的数据读取出来并解压,根据SQL的条件进行精确的过滤。
(3) 对数据进行后续处理,这一步基本上全部在内存中完成。
简单来说,当执行查询时,ClickHouse首先使用索引文件(primary.idx)查找符合查询条件的数据块的位置(这样可以避免查询大量的数据块),然后使用标记文件({column}.mrk2)定位到符合查询条件的数据块的开始位置,最后从二进制数据文件({column.bin})中获取实际的列数据。
数据备份
冷备份
通过对物理文件的拷贝也可达到备份的效果,但是需要注意的是通过拷贝物理文件进行备份期间,要避免数据写入。
物理文件备份
物理文件的备份主要分为两部分:
(1) 表数据文件备份,{datadir}/metadata下可对某个数据库、某张表的建表语句进行备份
备份数据库文件
复制clickhouse完整的数据目录,数据的目录路径是config.xml的path属性的值。
恢复也很简单,只需要将数据目录替换为备份的数据目录即可。
热备份
通过查询语句将指定的数据导出到文件进行备份,备份的数据比较灵活,并且可以通过客户端工具进行数据导入恢复。
-- 导出带表头(字段名)的txt文件
clickhouse-client -h 127.0.0.1 --database="default" -u default --password "" --format_csv_delimiter="|" --query="select * from default.users FORMAT CSVWithNames" > local.txt
-- 导出不带表头(字段名)的txt文件
clickhouse-client -h 127.0.0.1 --database="default" -u default --password "" --format_csv_delimiter="|" --query="select * from default.users FORMAT CSV" > local.txt
-- 导出带表头(字段名)的csv文件
clickhouse-client -h 127.0.0.1 --database="default" -u default --password "" --format_csv_delimiter="|" --query="select * from default.users FORMAT CSVWithNames" > local.txt
-- 导出不带表头(字段名)的csv文件
clickhouse-client -h 127.0.0.1 --database="default" -u default --password "" --format_csv_delimiter="|" --query="select * from default.users FORMAT CSV" > local.csv
-- 导入带表头(字段名)的txt文件
clickhouse-client -h 127.0.0.1 --database="test01" -u default --password "123456" --format_csv_delimiter='|' --query="insert into local FORMAT CSVWithNames" < local.txt
-- 导入不表头(字段名)的txt文件
clickhouse-client --password 123456 -d test01 -q " insert into local FORMAT CSV" --format_csv_delimiter='|' < local.txt
-- 导入带表头(字段名)的csv文件
clickhouse-client --password 123456 -d default -q " insert into local FORMAT CSVWithNames" --format_csv_delimiter='|' < /data/local.csv
-- 导入不表头(字段名)的csv文件
clickhouse-client --password 123456 -d default -q " insert into local FORMAT CSV" --format_csv_delimiter='|' < /data/local.csv
分区备份
基于数据分区的备份,ClickHouse目前提供了FREEZE和FETCH两种方式。
FREEZE备份
准备数据:
CREATE TABLE default.tett
(
id UInt64,
name String,
age UInt8,
loc String
)
ENGINE = MergeTree
PARTITION BY loc
ORDER BY id
insert into default.tett values(1, 'zs', 18, 'beijing');
insert into default.tett values(2, 'ls', 19, 'shanghai');
insert into default.tett values(3, 'ww', 20, 'guangzhou');
insert into default.tett values(4, 'ml', 21, 'beijing');
insert into default.tett values(5, 'tq', 22, 'shanghai');
insert into default.tett values(6, 'gb', 23, 'guangzhou');
语法:
ALTER TABLE table_name FREEZE PARTITION partition_expr
分区在备份后,会统一保存到 /var/lib/clickhouse/shadow/N目录下,其中N是一个自增的整数,代表备份的次数(FREEZE执行的次数),具体次数由shadow子目录下的increment.txt文件记录。
例如:
ALTER TABLE default.tett FREEZE PARTITION 'beijing';
执行完毕后,会在 /var/lib/clickhouse/shadow 目录下发现:
在对应的数字目录中,目录结构为:/store/[database]/[table]/[partition_folder],如下:
关于备份数据的还原,则需要借助ATTACH装载分区的方式实现,需要将shadow子目录下的分区文件复制到相应数据表的detached目录下,然后在使用ATTACH语句加载。
-- 将表tett表beijing分区删除
alter table test_backup drop partition 'beijing';
-- 将备份的分区数据放入表tett数据目录下的detached目录下
cp -r /var/lib/ClickHouse/shadow/1/store/xxx/xxxxx/* /var/lib/ClickHouse/data/default/tett/detached/
-- 查看detached目录下的数据
ls /var/lib/ClickHouse/data/default/tett/detached
-- 修改分区数据对应的拥有者权限为ClickHouse
chown -R ClickHouse:ClickHouse xxxxxx/
-- 给表tett表添加分区
alter table test_backup attach partition 'beijing';
-- 查询表test_backup表中数据,数据还原成功
select * from tett;
FETCH备份
FETCH只支持ReplicatedMergeTree系列的表引擎
ALTER TABLE table_name FETCH PARTITION partition_id FROM zk_path
其工作原理与ReplicatedMergeTree同步数据的原理类似,FETCH通过指定的zk_path找到ReplicatedMergeTree的所有副本实例,然后选择一个最合适的副本,下载相应的分区数据。
准备数据:
Create table default.test_f(
id UInt32,
name String,
age UInt32,
gender String,
loc String
) engine = ReplicatedMergeTree('/ClickHouse/tables/01/test_f','node1')
partition by loc
order by id;
insert into default.test_f values (1,'zs',18,'m','beijing'),(2,'ls',19,'f','shanghai'),(3,'ww',20,'m','beijing'),(4,'ml',21,'m','shanghai')
将表test_f beijing分区数据进行备份:
ALTER TABLE test_f FETCH PARTITION 'beijing' FROM '/ClickHouse/tables/01/test_f'
将对应分区的数据下载到当前节点本地 /var/lib/ClickHouse/data/{database}/{table}/detached 目录下。
与FREEZE一样,在对分区数据进行还原时,需借助ATTACH装载分区来实现:
-- 删除表test_fetch “beijing”分区数据
alter table test_fetch drop partition 'beijing';
-- 由于默认在detached中的数据权限为ClickHouse,不需要修改所有者,直接执行添加分区命令恢复备份数据即可
alter table test_fetch attach partition 'beijing';
FREEZE和FETCH虽然都能实现分区文件的备份,但不会备份数据表的元数据(也就是建表语句),为了做到万无一失的备份,还需要对元数据进行备份,元数据在 /var/lib/ClickHouse/metadata/{database} 目录下,找到对应表的sql文件单独保存即可。
clickhouse-backup工具备份
特点:
(1) 可以轻松创建和恢复所有或特定表的备份
(2) 在文件系统上高效存储多个备份
(3) 通过流压缩上传和下载
(4) 支持增量备份在远程存储上
安装:
tar -xzvf clickhouse-backup-linux-amd64.tar.gz
mv build/linux/amd64/clickhouse-backup /usr/bin/
mkdir /etc/clickhouse-backup
chown -R clickhouse:clickhouse /etc/clickhouse-backup
默认配置
[root@bogon opt]# clickhouse-backup default-config
general:
remote_storage: none # REMOTE_STORAGE, if `none` then `upload` and `download` command will fail
max_file_size: 1073741824 # MAX_FILE_SIZE, 1G by default, useless when upload_by_part is true, use for split data parts files by archives
disable_progress_bar: true # DISABLE_PROGRESS_BAR, show progress bar during upload and download, makes sense only when `upload_concurrency` and `download_concurrency` is 1
backups_to_keep_local: 0 # BACKUPS_TO_KEEP_LOCAL, how many latest local backup should be kept, 0 means all created backups will be stored on local disk
# -1 means backup will keep after `create` but will delete after `create_remote` command
# You shall run `clickhouse-backup delete local <backup_name>` command to remove temporary backup files from the local disk
backups_to_keep_remote: 0 # BACKUPS_TO_KEEP_REMOTE, how many latest backup should be kept on remote storage, 0 means all uploaded backups will be stored on remote storage.
# If old backups are required for newer incremental backup then it won't be deleted. Be careful with long incremental backup sequences.
log_level: info # LOG_LEVEL, a choice from `debug`, `info`, `warn`, `error`
allow_empty_backups: false # ALLOW_EMPTY_BACKUPS
# concurrency means parallel tables and parallel parts inside tables
# for example 4 means max 4 parallel tables and 4 parallel parts inside one table, so equals 16 concurrent streams
download_concurrency: 1 # DOWNLOAD_CONCURRENCY, max 255, by default, the value is round(sqrt(AVAILABLE_CPU_CORES / 2))
upload_concurrency: 1 # UPLOAD_CONCURRENCY, max 255, by default, the value is round(sqrt(AVAILABLE_CPU_CORES / 2))
# RESTORE_SCHEMA_ON_CLUSTER, execute all schema related SQL queries with `ON CLUSTER` clause as Distributed DDL.
# Check `system.clusters` table for the correct cluster name, also `system.macros` can be used.
# This isn't applicable when `use_embedded_backup_restore: true`
restore_schema_on_cluster: ""
upload_by_part: true # UPLOAD_BY_PART
download_by_part: true # DOWNLOAD_BY_PART
use_resumable_state: true # USE_RESUMABLE_STATE, allow resume upload and download according to the <backup_name>.resumable file
# RESTORE_DATABASE_MAPPING, restore rules from backup databases to target databases, which is useful when changing destination database, all atomic tables will be created with new UUIDs.
# The format for this env variable is "src_db1:target_db1,src_db2:target_db2". For YAML please continue using map syntax
restore_database_mapping: {}
retries_on_failure: 3 # RETRIES_ON_FAILURE, how many times to retry after a failure during upload or download
retries_pause: 30s # RETRIES_PAUSE, duration time to pause after each download or upload failure
watch_interval: 1h # WATCH_INTERVAL, use only for `watch` command, backup will create every 1h
full_interval: 24h # FULL_INTERVAL, use only for `watch` command, full backup will create every 24h
watch_backup_name_template: "shard{shard}-{type}-{time:20060102150405}" # WATCH_BACKUP_NAME_TEMPLATE, used only for `watch` command, macros values will apply from `system.macros` for time:XXX, look format in https://go.dev/src/time/format.go
sharded_operation_mode: none # SHARDED_OPERATION_MODE, how different replicas will shard backing up data for tables. Options are: none (no sharding), table (table granularity), database (database granularity), first-replica (on the lexicographically sorted first active replica). If left empty, then the "none" option will be set as default.
cpu_nice_priority: 15 # CPU niceness priority, to allow throttling СЗГ intensive operation, more details https://manpages.ubuntu.com/manpages/xenial/man1/nice.1.html
io_nice_priority: "idle" # IO niceness priority, to allow throttling disk intensive operation, more details https://manpages.ubuntu.com/manpages/xenial/man1/ionice.1.html
clickhouse:
username: default # CLICKHOUSE_USERNAME
password: "" # CLICKHOUSE_PASSWORD
host: localhost # CLICKHOUSE_HOST, To make backup data `clickhouse-backup` requires access to the same file system as clickhouse-server, so `host` should localhost or address of another docker container on the same machine, or IP address bound to some network interface on the same host.
port: 9000 # CLICKHOUSE_PORT, don't use 8123, clickhouse-backup doesn't support HTTP protocol
# CLICKHOUSE_DISK_MAPPING, use this mapping when your `system.disks` are different between the source and destination clusters during backup and restore process
# The format for this env variable is "disk_name1:disk_path1,disk_name2:disk_path2". For YAML please continue using map syntax
disk_mapping: {}
# CLICKHOUSE_SKIP_TABLES, the list of tables (pattern are allowed) which are ignored during backup and restore process
# The format for this env variable is "pattern1,pattern2,pattern3". For YAML please continue using list syntax
skip_tables:
- system.*
- INFORMATION_SCHEMA.*
- information_schema.*
# CLICKHOUSE_SKIP_TABLE_ENGINES, the list of tables engines which are ignored during backup, upload, download, restore process
# The format for this env variable is "Engine1,Engine2,engine3". For YAML please continue using list syntax
skip_table_engines: []
timeout: 5m # CLICKHOUSE_TIMEOUT
freeze_by_part: false # CLICKHOUSE_FREEZE_BY_PART, allow freezing by part instead of freezing the whole table
freeze_by_part_where: "" # CLICKHOUSE_FREEZE_BY_PART_WHERE, allow parts filtering during freezing when freeze_by_part: true
secure: false # CLICKHOUSE_SECURE, use TLS encryption for connection
skip_verify: false # CLICKHOUSE_SKIP_VERIFY, skip certificate verification and allow potential certificate warnings
sync_replicated_tables: true # CLICKHOUSE_SYNC_REPLICATED_TABLES
tls_key: "" # CLICKHOUSE_TLS_KEY, filename with TLS key file
tls_cert: "" # CLICKHOUSE_TLS_CERT, filename with TLS certificate file
tls_ca: "" # CLICKHOUSE_TLS_CA, filename with TLS custom authority file
log_sql_queries: true # CLICKHOUSE_LOG_SQL_QUERIES, enable logging `clickhouse-backup` SQL queries on `system.query_log` table inside clickhouse-server
debug: false # CLICKHOUSE_DEBUG
config_dir: "/etc/clickhouse-server" # CLICKHOUSE_CONFIG_DIR
# CLICKHOUSE_RESTART_COMMAND, use this command when restoring with --rbac, --rbac-only or --configs, --configs-only options
# will split command by ; and execute one by one, all errors will logged and ignore
# available prefixes
# - sql: will execute SQL query
# - exec: will execute command via shell
restart_command: "sql:SYSTEM SHUTDOWN"
ignore_not_exists_error_during_freeze: true # CLICKHOUSE_IGNORE_NOT_EXISTS_ERROR_DURING_FREEZE, helps to avoid backup failures when running frequent CREATE / DROP tables and databases during backup, `clickhouse-backup` will ignore `code: 60` and `code: 81` errors during execution of `ALTER TABLE ... FREEZE`
check_replicas_before_attach: true # CLICKHOUSE_CHECK_REPLICAS_BEFORE_ATTACH, helps avoiding concurrent ATTACH PART execution when restoring ReplicatedMergeTree tables
use_embedded_backup_restore: false # CLICKHOUSE_USE_EMBEDDED_BACKUP_RESTORE, use BACKUP / RESTORE SQL statements instead of regular SQL queries to use features of modern ClickHouse server versions
backup_mutations: true # CLICKHOUSE_BACKUP_MUTATIONS, allow backup mutations from system.mutations WHERE is_done AND apply it during restore
restore_as_attach: false # CLICKHOUSE_RESTORE_AS_ATTACH, allow restore tables which have inconsistent data parts structure and mutations in progress
check_parts_columns: true # CLICKHOUSE_CHECK_PARTS_COLUMNS, check data types from system.parts_columns during create backup to guarantee mutation is complete
azblob:
endpoint_suffix: "core.windows.net" # AZBLOB_ENDPOINT_SUFFIX
account_name: "" # AZBLOB_ACCOUNT_NAME
account_key: "" # AZBLOB_ACCOUNT_KEY
sas: "" # AZBLOB_SAS
use_managed_identity: false # AZBLOB_USE_MANAGED_IDENTITY
container: "" # AZBLOB_CONTAINER
path: "" # AZBLOB_PATH, `system.macros` values could be applied as {macro_name}
object_disk_path: "" # AZBLOB_OBJECT_DISK_PATH, path for backup of part from `azure_blob_storage` object disk, if disk present, then shall not be zero and shall not be prefixed by `path`
compression_level: 1 # AZBLOB_COMPRESSION_LEVEL
compression_format: tar # AZBLOB_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
sse_key: "" # AZBLOB_SSE_KEY
buffer_size: 0 # AZBLOB_BUFFER_SIZE, if less or eq 0 then it is calculated as max_file_size / max_parts_count, between 2Mb and 4Mb
max_parts_count: 10000 # AZBLOB_MAX_PARTS_COUNT, number of parts for AZBLOB uploads, for properly calculate buffer size
max_buffers: 3 # AZBLOB_MAX_BUFFERS
s3:
access_key: "" # S3_ACCESS_KEY
secret_key: "" # S3_SECRET_KEY
bucket: "" # S3_BUCKET
endpoint: "" # S3_ENDPOINT
region: us-east-1 # S3_REGION
acl: private # S3_ACL
assume_role_arn: "" # S3_ASSUME_ROLE_ARN
force_path_style: false # S3_FORCE_PATH_STYLE
path: "" # S3_PATH, `system.macros` values could be applied as {macro_name}
object_disk_path: "" # S3_OBJECT_DISK_PATH, path for backup of part from `s3` object disk, if disk present, then shall not be zero and shall not be prefixed by `path`
disable_ssl: false # S3_DISABLE_SSL
compression_level: 1 # S3_COMPRESSION_LEVEL
compression_format: tar # S3_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
# look details in https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html
sse: "" # S3_SSE, empty (default), AES256, or aws:kms
sse_kms_key_id: "" # S3_SSE_KMS_KEY_ID, if S3_SSE is aws:kms then specifies the ID of the Amazon Web Services Key Management Service
sse_customer_algorithm: "" # S3_SSE_CUSTOMER_ALGORITHM, encryption algorithm, for example, AES256
sse_customer_key: "" # S3_SSE_CUSTOMER_KEY, customer-provided encryption key
sse_customer_key_md5: "" # S3_SSE_CUSTOMER_KEY_MD5, 128-bit MD5 digest of the encryption key according to RFC 1321
sse_kms_encryption_context: "" # S3_SSE_KMS_ENCRYPTION_CONTEXT, base64-encoded UTF-8 string holding a JSON with the encryption context
# Specifies the Amazon Web Services KMS Encryption Context to use for object encryption.
# This is a collection of non-secret key-value pairs that represent additional authenticated data.
# When you use an encryption context to encrypt data, you must specify the same (an exact case-sensitive match)
# encryption context to decrypt the data. An encryption context is supported only on operations with symmetric encryption KMS keys
disable_cert_verification: false # S3_DISABLE_CERT_VERIFICATION
use_custom_storage_class: false # S3_USE_CUSTOM_STORAGE_CLASS
storage_class: STANDARD # S3_STORAGE_CLASS, by default allow only from list https://github.com/aws/aws-sdk-go-v2/blob/main/service/s3/types/enums.go#L787-L799
concurrency: 1 # S3_CONCURRENCY
part_size: 0 # S3_PART_SIZE, if less or eq 0 then it is calculated as max_file_size / max_parts_count, between 5MB and 5Gb
max_parts_count: 10000 # S3_MAX_PARTS_COUNT, number of parts for S3 multipart uploads
allow_multipart_download: false # S3_ALLOW_MULTIPART_DOWNLOAD, allow faster download and upload speeds, but will require additional disk space, download_concurrency * part size in worst case
# S3_OBJECT_LABELS, allow setup metadata for each object during upload, use {macro_name} from system.macros and {backupName} for current backup name
# The format for this env variable is "key1:value1,key2:value2". For YAML please continue using map syntax
object_labels: {}
# S3_CUSTOM_STORAGE_CLASS_MAP, allow setup storage class depending on the backup name regexp pattern, format nameRegexp > className
custom_storage_class_map: {}
debug: false # S3_DEBUG
gcs:
credentials_file: "" # GCS_CREDENTIALS_FILE
credentials_json: "" # GCS_CREDENTIALS_JSON
credentials_json_encoded: "" # GCS_CREDENTIALS_JSON_ENCODED
bucket: "" # GCS_BUCKET
path: "" # GCS_PATH, `system.macros` values could be applied as {macro_name}
object_disk_path: "" # GCS_OBJECT_DISK_PATH, path for backup of part from `s3` object disk (clickhouse support only gcs over s3 protocol), if disk present, then shall not be zero and shall not be prefixed by `path`
compression_level: 1 # GCS_COMPRESSION_LEVEL
compression_format: tar # GCS_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
storage_class: STANDARD # GCS_STORAGE_CLASS
client_pool_size: 500 # GCS_CLIENT_POOL_SIZE, default max(upload_concurrency, download concurrency) * 3, should be at least 3 times bigger than `UPLOAD_CONCURRENCY` or `DOWNLOAD_CONCURRENCY` in each upload and download case to avoid stuck
# GCS_OBJECT_LABELS, allow setup metadata for each object during upload, use {macro_name} from system.macros and {backupName} for current backup name
# The format for this env variable is "key1:value1,key2:value2". For YAML please continue using map syntax
object_labels: {}
# GCS_CUSTOM_STORAGE_CLASS_MAP, allow setup storage class depends on backup name regexp pattern, format nameRegexp > className
custom_storage_class_map: {}
debug: false # GCS_DEBUG
cos:
url: "" # COS_URL
timeout: 2m # COS_TIMEOUT
secret_id: "" # COS_SECRET_ID
secret_key: "" # COS_SECRET_KEY
path: "" # COS_PATH, `system.macros` values could be applied as {macro_name}
compression_format: tar # COS_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
compression_level: 1 # COS_COMPRESSION_LEVEL
ftp:
address: "" # FTP_ADDRESS in format `host:port`
timeout: 2m # FTP_TIMEOUT
username: "" # FTP_USERNAME
password: "" # FTP_PASSWORD
tls: false # FTP_TLS
tls_skip_verify: false # FTP_TLS_SKIP_VERIFY
path: "" # FTP_PATH, `system.macros` values could be applied as {macro_name}
compression_format: tar # FTP_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
compression_level: 1 # FTP_COMPRESSION_LEVEL
debug: false # FTP_DEBUG
sftp:
address: "" # SFTP_ADDRESS
username: "" # SFTP_USERNAME
password: "" # SFTP_PASSWORD
port: 22 # SFTP_PORT
key: "" # SFTP_KEY
path: "" # SFTP_PATH, `system.macros` values could be applied as {macro_name}
concurrency: 1 # SFTP_CONCURRENCY
compression_format: tar # SFTP_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
compression_level: 1 # SFTP_COMPRESSION_LEVEL
debug: false # SFTP_DEBUG
custom:
upload_command: "" # CUSTOM_UPLOAD_COMMAND
download_command: "" # CUSTOM_DOWNLOAD_COMMAND
delete_command: "" # CUSTOM_DELETE_COMMAND
list_command: "" # CUSTOM_LIST_COMMAND
command_timeout: "4h" # CUSTOM_COMMAND_TIMEOUT
api:
listen: "localhost:7171" # API_LISTEN
enable_metrics: true # API_ENABLE_METRICS
enable_pprof: false # API_ENABLE_PPROF
username: "" # API_USERNAME, basic authorization for API endpoint
password: "" # API_PASSWORD
secure: false # API_SECURE, use TLS for listen API socket
ca_cert_file: "" # API_CA_CERT_FILE
# openssl genrsa -out /etc/clickhouse-backup/ca-key.pem 4096
# openssl req -subj "/O=altinity" -x509 -new -nodes -key /etc/clickhouse-backup/ca-key.pem -sha256 -days 365 -out /etc/clickhouse-backup/ca-cert.pem
private_key_file: "" # API_PRIVATE_KEY_FILE, openssl genrsa -out /etc/clickhouse-backup/server-key.pem 4096
certificate_file: "" # API_CERTIFICATE_FILE,
# openssl req -subj "/CN=localhost" -addext "subjectAltName = DNS:localhost,DNS:*.cluster.local" -new -key /etc/clickhouse-backup/server-key.pem -out /etc/clickhouse-backup/server-req.csr
# openssl x509 -req -days 365000 -extensions SAN -extfile <(printf "\n[SAN]\nsubjectAltName=DNS:localhost,DNS:*.cluster.local") -in /etc/clickhouse-backup/server-req.csr -out /etc/clickhouse-backup/server-cert.pem -CA /etc/clickhouse-backup/ca-cert.pem -CAkey /etc/clickhouse-backup/ca-key.pem -CAcreateserial
integration_tables_host: "" # API_INTEGRATION_TABLES_HOST, allow using DNS name to connect in `system.backup_list` and `system.backup_actions`
allow_parallel: false # API_ALLOW_PARALLEL, enable parallel operations, this allows for significant memory allocation and spawns go-routines, don't enable it if you are not sure
create_integration_tables: false # API_CREATE_INTEGRATION_TABLES, create `system.backup_list` and `system.backup_actions`
complete_resumable_after_restart: true # API_COMPLETE_RESUMABLE_AFTER_RESTART, after API server startup, if `/var/lib/clickhouse/backup/*/(upload|download).state` present, then operation will continue in the background
配置文件设置:
vim /etc/clickhouse-backup/config.xml
general:
remote_storage: none # 默认为none,如果通过sftp上传到远程服务器,需要这个参数这只为sftp
disable_progress_bar: false
backups_to_keep_local: 7 # 本地备份的个数,大于7的自动删除旧的备份,默认为0,不删除备份
backups_to_keep_remote: 7 # 远程备份个数
log_level: info
allow_empty_backups: false
clickhouse:
username: default
password: "PeQLg45tJcxZk"
host: 192.168.12.91
port: 9000
skip_tables: # 不需要备份的库
system.*
default.*
INFORMATION_SCHEMA.*
information_schema.*
timeout: 5m
freeze_by_part: false
备份命令:
clickhouse-backup create my_backup --config /etc/clickhouse-backup/config.xml
-- create 指定备份名称 若不指定 默认格式为时间
clickhouse-backup create --config /etc/clickhouse-backup/config.xml
-- 查看已生成的备份
clickhouse-backup --config /etc/config
-- 删除已备份的数据库
clickhouse-backup delete local/remote my_backup --config /etc/clickhouse-backup/config.xml
-- 恢复数据库
clikhouse restore my_backup
-- 查看可以备份的表
clickhouse-backup --config config.xml list
CLI命令
tables
可以备份的表
NAME:
clickhouse-backup tables - List of tables, exclude skip_tables
USAGE:
clickhouse-backup tables [-t, --tables=<db>.<table>]] [--all]
OPTIONS:
--config value, -c value config配置文件路径
--all, -a 打印表
--table value, --tables value, -t value 列表表只与表名模式匹配,用逗号分隔,允许?和*作为通配符
create
创建新备份,若不指定名称,则备份文件夹名称将为日期。例如:2023-11-10T02-00-10
NAME:
clickhouse-backup create - Create new backup
USAGE:
clickhouse-backup create [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [-s, --schema] [--rbac] [--configs] [--skip-check-parts-columns] <backup_name>
DESCRIPTION:
Create new backup
OPTIONS:
--config value, -c value config配置文件路径
--table value, --tables value, -t value 创建与表名模式匹配的备份,用逗号分隔,允许吗?和*作为通配符
--partitions partition_id 仅为选定的分区名称创建备份,用逗号分隔
--schema, -s 仅备份结构
--rbac, --backup-rbac, --do-backup-rbac 备份与rbac相关的对象
--configs, --backup-configs, --do-backup-configs 备份“clickhouse服务器”配置文件
--skip-check-parts-columns 跳过检查系统.partscolumns不允许备份数据部件的不一致列类型
create_remote
创建并上传新备份
NAME:
clickhouse-backup create_remote - Create and upload new backup
USAGE:
clickhouse-backup create_remote [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [--diff-from=<local_backup_name>] [--diff-from-remote=<local_backup_name>] [--schema] [--rbac] [--configs] [--resumable] [--skip-check-parts-columns] <backup_name>
DESCRIPTION:
Create and upload
OPTIONS:
--config value, -c value config配置文件路径
--table value, --tables value, -t value 创建与表名模式匹配的备份,用逗号分隔,允许吗?和*作为通配符
--partitions partition_id 仅为选定的分区名称创建备份,用逗号分隔
--diff-from value 用于将当前备份作为增量上载的值本地备份名称的差异
--diff-from-remote value 用于将当前备份作为增量上载的远程值远程备份名称的差异
--schema, -s 仅备份结构
--rbac, --backup-rbac, --do-backup-rbac 备份与rbac相关的对象
--configs, --backup-configs, --do-backup-configs 备份“clickhouse服务器”配置文件
--resume, --resumable 如果远程存储上存在备份,则保存中间上传状态并继续上传,当“remote_storage:custom”或“use_embedded_backup_restore:true”时忽略
--skip-check-parts-columns 跳过检查系统.partscolumns不允许备份数据部件的不一致列类型
upload
上传备份
NAME:
clickhouse-backup upload - Upload backup to remote storage
USAGE:
clickhouse-backup upload [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [-s, --schema] [--diff-from=<local_backup_name>] [--diff-from-remote=<remote_backup_name>] [--resumable] <backup_name>
OPTIONS:
--config value, -c value config配置文件路径
--diff-from value 用于将当前备份作为增量上载的值本地备份名称的差异
--diff-from-remote value 用于将当前备份作为增量上载的远程值远程备份名称的差异
--table value, --tables value, -t value 创建与表名模式匹配的备份,用逗号分隔,允许吗?和*作为通配符
--partitions partition_id 仅为选定的分区名称创建备份,用逗号分隔
--schema, -s 仅备份结构
--resume, --resumable 如果远程存储上存在备份,则保存中间上传状态并继续上传,当“remote_storage:custom”或“use_embedded_backup_restore:true”时忽略
--skip-check-parts-columns 跳过检查系统.partscolumns不允许备份数据部件的不一致列类型
list
查看备份列表
NAME:
clickhouse-backup list - List of backups
USAGE:
clickhouse-backup list [all|local|remote] [latest|previous]
OPTIONS:
--config value, -c value config配置文件路径
download
从远程存储下载备份
NAME:
clickhouse-backup download - Download backup from remote storage
USAGE:
clickhouse-backup download [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [-s, --schema] [--resumable] <backup_name>
OPTIONS:
--config value, -c value config配置文件路径
--table value, --tables value, -t value 下载与表名模式匹配的备份,用逗号分隔,允许吗?和*作为通配符
--partitions partition_id 仅为选定的分区名称下载备份,用逗号分隔
--schema, -s 仅下载备份结构
--resume, --resumable 如果本地存储上存在备份,则保存中间下载状态并继续下载,用“remote_storage:custom”或“use_embedded_backup_restore:true”忽略
restore
备份还原,创建架构并从备份中还原数据
NAME:
clickhouse-backup restore - Create schema and restore data from backup
USAGE:
clickhouse-backup restore [-t, --tables=<db>.<table>] [-m, --restore-database-mapping=<originDB>:<targetDB>[,<...>]] [--partitions=<partitions_names>] [-s, --schema] [-d, --data] [--rm, --drop] [-i, --ignore-dependencies] [--rbac] [--configs] <backup_name>
OPTIONS:
--config value, -c value config配置文件路径
--table value, --tables value, -t value 只还原与表名模式匹配的数据库和对象,用逗号分隔,允许?和*作为通配符
--restore-database-mapping value, -m value 定义恢复数据的规则。对于未在此结构中定义的数据库,程序将不会处理它。
--partitions partition_id 仅恢复选定分区名称的备份,用逗号分隔
--schema, -s 只还原结构
--data, -d 只还原数据
--rm, --drop 还原前删除已存在的架构对象
-i, --ignore-dependencies 删除存在的架构对象时忽略依赖项
--rbac, --restore-rbac, --do-restore-rbac 还原与rbac相关的对象
--configs, --restore-configs, --do-restore-configs 还原“clickhouse server”CONFIG相关文件
restore_remote
下载并还原
NAME:
clickhouse-backup restore_remote - Download and restore
USAGE:
clickhouse-backup restore_remote [--schema] [--data] [-t, --tables=<db>.<table>] [-m, --restore-database-mapping=<originDB>:<targetDB>[,<...>]] [--partitions=<partitions_names>] [--rm, --drop] [-i, --ignore-dependencies] [--rbac] [--configs] [--skip-rbac] [--skip-configs] [--resumable] <backup_name>
OPTIONS:
--config value, -c value config配置文件路径
--table value, --tables value, -t value 下载并恢复与表名模式匹配的对象,用逗号分隔,允许?和*作为通配符
--restore-database-mapping value, -m value 定义恢复数据的规则。对于未在此结构中定义的数据库,程序将不会处理它。
--partitions partition_id 下载和恢复选定分区名称的备份,用逗号分隔
--schema, -s 仅下载和还原架构
--data, -d 仅下载和还原数据
--rm, --drop 还原前删除架构对象
-i, --ignore-dependencies 删除存在的架构对象时忽略依赖项
--rbac, --restore-rbac, --do-restore-rbac 下载并还原与rbac相关的对象
--configs, --restore-configs, --do-restore-configs 下载并还原“clickhouse server”CONFIG相关文件
--resume, --resumable 如果远程存储上存在备份,则保存中间上载状态并继续上载,用“remote_storage:custom”或“use_embedded_backup_restore:true”忽略
delete
删除备份
NAME:
clickhouse-backup delete - Delete specific backup
USAGE:
clickhouse-backup delete <local|remote> <backup_name>
OPTIONS:
--config value, -c value config配置文件路径
default-config
查看默认配置
NAME:
clickhouse-backup default-config - Print default config
USAGE:
clickhouse-backup default-config [command options] [arguments...]
OPTIONS:
--config value, -c value config配置文件路径
print-config
打印当前配置
NAME:
clickhouse-backup print-config - Print current config merged with environment variables
USAGE:
clickhouse-backup print-config [command options] [arguments...]
OPTIONS:
--config value, -c value config配置文件路径
clean
从所有可用的path文件夹中删除shadow文件夹中的数据
NAME:
clickhouse-backup clean - Remove data in 'shadow' folder from all 'path' folders available from 'system.disks'
USAGE:
clickhouse-backup clean [command options] [arguments...]
OPTIONS:
--config value, -c value config配置文件路径
clean_remote_broken
删除所有损坏的远程备份
NAME:
clickhouse-backup clean_remote_broken - Remove all broken remote backups
USAGE:
clickhouse-backup clean_remote_broken [command options] [arguments...]
OPTIONS:
--config value, -c value config配置文件路径
watch
备份监视
NAME:
clickhouse-backup watch - Run infinite loop which create full + incremental backup sequence to allow efficient backup sequences
USAGE:
clickhouse-backup watch [--watch-interval=1h] [--full-interval=24h] [--watch-backup-name-template=shard{shard}-{type}-{time:20060102150405}] [-t, --tables=<db>.<table>] [--partitions=<partitions_names>] [--schema] [--rbac] [--configs] [--skip-check-parts-columns]
DESCRIPTION:
Execute create_remote + delete local, create full backup every `--full-interval`, create and upload incremental backup every `--watch-interval` use previous backup as base with `--diff-from-remote` option, use `backups_to_keep_remote` config option for properly deletion remote backups, will delete old backups which not have references from other backups
OPTIONS:
--config value, -c value config配置文件路径
--watch-interval value 增量备份运行“create_remote”+“delete local”的间隔
--full-interval value 停止创建增量备份序列并创建完整备份时运行“create_remote”+“delete local”的间隔
--watch-backup-name-template value 新备份名称的模板
--table value, --tables value, -t value 只创建和上传与表名模式匹配的对象,用逗号分隔,允许?和*作为通配符
--partitions partition_id 分区名称,用逗号分隔
--schema, -s 仅限架构
--rbac, --backup-rbac, --do-backup-rbac 仅备份与rbac相关的对象
--configs, --backup-configs, --do-backup-configs 仅备份“clickhouse服务器”配置文件
--skip-check-parts-columns 跳过检查system.partscolumns以禁止备份数据部分的不一致列类型
简单示例
查看当前能备份的数据表:
[root@bogon clickhouse-backup]# clickhouse-backup tables
2023/11/09 16:55:15.414994 info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2023/11/09 16:55:15.436732 info clickhouse connection open: tcp://localhost:9000 logger=clickhouse
2023/11/09 16:55:15.436826 info SELECT name, count(*) as is_present FROM system.settings WHERE name IN (?, ?) GROUP BY name with args [show_table_uuid_in_table_create_query_if_not_nil display_secrets_in_show_and_select] logger=clickhouse
2023/11/09 16:55:15.441067 info SELECT name FROM system.databases WHERE engine IN ('MySQL','PostgreSQL','MaterializedPostgreSQL') logger=clickhouse
2023/11/09 16:55:15.443948 info SELECT countIf(name='data_path') is_data_path_present, countIf(name='data_paths') is_data_paths_present, countIf(name='uuid') is_uuid_present, countIf(name='create_table_query') is_create_table_query_present, countIf(name='total_bytes') is_total_bytes_present FROM system.columns WHERE database='system' AND table='tables' logger=clickhouse
2023/11/09 16:55:15.498436 info SELECT database, name, engine , data_paths , uuid , create_table_query , coalesce(total_bytes, 0) AS total_bytes FROM system.tables WHERE is_temporary = 0 ORDER BY total_bytes DESC SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 logger=clickhouse
2023/11/09 16:55:15.511604 info SELECT metadata_path FROM system.tables WHERE database = 'system' AND metadata_path!='' LIMIT 1; logger=clickhouse
2023/11/09 16:55:15.513617 info SELECT value FROM `system`.`build_options` where name='VERSION_INTEGER' logger=clickhouse
2023/11/09 16:55:15.514306 info SELECT count() is_disk_type_present FROM system.columns WHERE database='system' AND table='disks' AND name='type' logger=clickhouse
2023/11/09 16:55:15.515825 info SELECT path, any(name) AS name, any(type) AS type FROM system.disks GROUP BY path logger=clickhouse
default.sq_ck_data 1.99MiB default full
default.tett 1.64KiB default full
default.users 746B default full
2023/11/09 16:55:15.517788 info clickhouse connection closed logger=clickhouse
备份一张表:
[root@bogon clickhouse-backup]# clickhouse-backup create --table default.tett
2023/11/10 10:00:10.565466 info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2023/11/10 10:00:10.567876 info clickhouse connection open: tcp://localhost:9000 logger=clickhouse
2023/11/10 10:00:10.567942 info SELECT metadata_path FROM system.tables WHERE database = 'system' AND metadata_path!='' LIMIT 1; logger=clickhouse
2023/11/10 10:00:10.572487 info SELECT name, engine FROM system.databases WHERE NOT match(name,'^(system|INFORMATION_SCHEMA|information_schema|_temporary_and_external_tables)$') AND match(name,'^(default)$') logger=clickhouse
2023/11/10 10:00:10.574905 info SHOW CREATE DATABASE `default` logger=clickhouse
2023/11/10 10:00:10.575946 info SELECT name, count(*) as is_present FROM system.settings WHERE name IN (?, ?) GROUP BY name with args [show_table_uuid_in_table_create_query_if_not_nil display_secrets_in_show_and_select] logger=clickhouse
2023/11/10 10:00:10.579002 info SELECT name FROM system.databases WHERE engine IN ('MySQL','PostgreSQL','MaterializedPostgreSQL') logger=clickhouse
2023/11/10 10:00:10.579982 info SELECT countIf(name='data_path') is_data_path_present, countIf(name='data_paths') is_data_paths_present, countIf(name='uuid') is_uuid_present, countIf(name='create_table_query') is_create_table_query_present, countIf(name='total_bytes') is_total_bytes_present FROM system.columns WHERE database='system' AND table='tables' logger=clickhouse
2023/11/10 10:00:10.584187 info SELECT database, name, engine , data_paths , uuid , create_table_query , coalesce(total_bytes, 0) AS total_bytes FROM system.tables WHERE is_temporary = 0 AND match(concat(database,'.',name),'^default.tett$') ORDER BY total_bytes DESC SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 logger=clickhouse
2023/11/10 10:00:10.595985 info SELECT metadata_path FROM system.tables WHERE database = 'system' AND metadata_path!='' LIMIT 1; logger=clickhouse
2023/11/10 10:00:10.598130 info SELECT count() as cnt FROM system.columns WHERE database='system' AND table='functions' AND name='create_query' SETTINGS empty_result_for_aggregation_by_empty_set=0 logger=clickhouse
2023/11/10 10:00:10.600149 info SELECT name, create_query FROM system.functions WHERE create_query!='' logger=clickhouse
2023/11/10 10:00:10.602970 info SELECT value FROM `system`.`build_options` where name='VERSION_INTEGER' logger=clickhouse
2023/11/10 10:00:10.603787 info SELECT count() is_disk_type_present FROM system.columns WHERE database='system' AND table='disks' AND name='type' logger=clickhouse
2023/11/10 10:00:10.606071 info SELECT path, any(name) AS name, any(type) AS type FROM system.disks GROUP BY path logger=clickhouse
2023/11/10 10:00:10.608060 info SELECT column, groupUniqArray(type) AS uniq_types FROM system.parts_columns WHERE active AND database=? AND table=? AND type NOT LIKE 'Enum%' AND type NOT LIKE 'Tuple(%' GROUP BY column HAVING length(uniq_types) > 1 with args [default tett] logger=clickhouse
2023/11/10 10:00:10.611245 info ALTER TABLE `default`.`tett` FREEZE WITH NAME '97b3ebeae36e4625a18d03c268c762e4'; logger=clickhouse
2023/11/10 10:00:10.616292 info ALTER TABLE `default`.`tett` UNFREEZE WITH NAME '97b3ebeae36e4625a18d03c268c762e4' logger=clickhouse
2023/11/10 10:00:10.616993 info SELECT mutation_id, command FROM system.mutations WHERE is_done=0 AND database=? AND table=? with args [default tett] logger=clickhouse
2023/11/10 10:00:10.619054 info done backup=2023-11-10T02-00-10 logger=backuper operation=create table=default.tett
2023/11/10 10:00:10.619090 info SELECT value FROM `system`.`build_options` where name='VERSION_DESCRIBE' logger=clickhouse
2023/11/10 10:00:10.620433 info done backup=2023-11-10T02-00-10 duration=55ms logger=backuper operation=create
2023/11/10 10:00:10.620496 info clickhouse connection closed logger=clickhouse
查看备份列表:
[root@bogon clickhouse-backup]# clickhouse-backup list
2023/11/10 10:00:23.397790 info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2023/11/10 10:00:23.399892 info clickhouse connection open: tcp://localhost:9000 logger=clickhouse
2023/11/10 10:00:23.399919 info SELECT value FROM `system`.`build_options` where name='VERSION_INTEGER' logger=clickhouse
2023/11/10 10:00:23.403206 info SELECT count() is_disk_type_present FROM system.columns WHERE database='system' AND table='disks' AND name='type' logger=clickhouse
2023/11/10 10:00:23.405522 info SELECT path, any(name) AS name, any(type) AS type FROM system.disks GROUP BY path logger=clickhouse
2023-11-09T06-32-30 2.00MiB 09/11/2023 06:32:30 local
2023-11-10T02-00-10 3.44KiB 10/11/2023 02:00:10 local
2023/11/10 10:00:23.407513 info clickhouse connection closed logger=clickhouse
可以看到圈出来的是我们刚刚备份过得default.tett表的备份。
将原表中的数据删除:
bogon :) select * from default.tett;
SELECT *
FROM default.tett
Query id: 418c25c1-d383-481d-b0a5-563f03926295
Connecting to database default at localhost:9000 as user default.
Connected to ClickHouse server version 23.8.2 revision 54465.
┌─id─┬─name─┬─age─┬─loc──────┐
│ 2 │ ls │ 19 │ shanghai │
│ 5 │ tq │ 22 │ shanghai │
└────┴──────┴─────┴──────────┘
┌─id─┬─name─┬─age─┬─loc─────┐
│ 1 │ zs │ 18 │ beijing │
│ 4 │ ml │ 21 │ beijing │
└────┴──────┴─────┴─────────┘
┌─id─┬─name─┬─age─┬─loc───────┐
│ 3 │ ww │ 20 │ guangzhou │
│ 6 │ gb │ 23 │ guangzhou │
└────┴──────┴─────┴───────────┘
6 rows in set. Elapsed: 0.023 sec.
bogon :) truncate default.tett;
TRUNCATE TABLE default.tett
Query id: 30d538dd-929d-4ca3-9653-da5124f4a37e
Ok.
0 rows in set. Elapsed: 0.011 sec.
bogon :) select * from default.tett;
SELECT *
FROM default.tett
Query id: 1ad2ebf8-08cb-4cb9-9dfe-92bf33b34daf
Ok.
0 rows in set. Elapsed: 0.020 sec.
还原备份:
[root@bogon clickhouse-backup]# clickhouse-backup restore 2023-11-10T02-00-10
2023/11/10 10:02:17.483029 info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2023/11/10 10:02:17.503304 info clickhouse connection open: tcp://localhost:9000 logger=clickhouse
2023/11/10 10:02:17.503343 info SELECT value FROM `system`.`build_options` where name='VERSION_INTEGER' logger=clickhouse
2023/11/10 10:02:17.507371 info SELECT count() is_disk_type_present FROM system.columns WHERE database='system' AND table='disks' AND name='type' logger=clickhouse
2023/11/10 10:02:17.510458 info SELECT path, any(name) AS name, any(type) AS type FROM system.disks GROUP BY path logger=clickhouse
2023/11/10 10:02:17.514702 info CREATE DATABASE IF NOT EXISTS `default` ENGINE = Atomic with args [[]] logger=clickhouse
2023/11/10 10:02:17.516893 info SELECT engine FROM system.databases WHERE name = 'default' logger=clickhouse
2023/11/10 10:02:17.518091 info DROP TABLE IF EXISTS `default`.`tett` NO DELAY logger=clickhouse
2023/11/10 10:02:17.525491 info CREATE DATABASE IF NOT EXISTS `default` logger=clickhouse
2023/11/10 10:02:17.526077 info CREATE TABLE default.tett UUID '5d622b96-7e59-4586-aa60-1b1f992fa85c' (`id` UInt64, `name` String, `age` UInt8, `loc` String) ENGINE = MergeTree PARTITION BY loc ORDER BY id SETTINGS index_granularity = 8192 logger=clickhouse
2023/11/10 10:02:17.546341 info SELECT name, count(*) as is_present FROM system.settings WHERE name IN (?, ?) GROUP BY name with args [show_table_uuid_in_table_create_query_if_not_nil display_secrets_in_show_and_select] logger=clickhouse
2023/11/10 10:02:17.550424 info SELECT name FROM system.databases WHERE engine IN ('MySQL','PostgreSQL','MaterializedPostgreSQL') logger=clickhouse
2023/11/10 10:02:17.551540 info SELECT countIf(name='data_path') is_data_path_present, countIf(name='data_paths') is_data_paths_present, countIf(name='uuid') is_uuid_present, countIf(name='create_table_query') is_create_table_query_present, countIf(name='total_bytes') is_total_bytes_present FROM system.columns WHERE database='system' AND table='tables' logger=clickhouse
2023/11/10 10:02:17.554243 info SELECT database, name, engine , data_paths , uuid , create_table_query , coalesce(total_bytes, 0) AS total_bytes FROM system.tables WHERE is_temporary = 0 ORDER BY total_bytes DESC SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 logger=clickhouse
2023/11/10 10:02:17.568021 info SELECT metadata_path FROM system.tables WHERE database = 'system' AND metadata_path!='' LIMIT 1; logger=clickhouse
2023/11/10 10:02:17.571347 info SELECT sum(bytes_on_disk) as size FROM system.parts WHERE active AND database='default' AND table='tett' GROUP BY database, table logger=clickhouse
2023/11/10 10:02:17.575051 info ALTER TABLE `default`.`tett` ATTACH PART 'c0e10d07728763da6c6f9ee4adb2fe46_3_6_1' logger=clickhouse
2023/11/10 10:02:17.578985 info ALTER TABLE `default`.`tett` ATTACH PART 'd3cb49ee6c8532140eabb67773fa583d_2_5_1' logger=clickhouse
2023/11/10 10:02:17.580333 info ALTER TABLE `default`.`tett` ATTACH PART 'e8ccea18d4053730ef65315d1781f8c6_1_4_1' logger=clickhouse
2023/11/10 10:02:17.581493 info done backup=2023-11-10T02-00-10 operation=restore table=default.tett
2023/11/10 10:02:17.581509 info done backup=2023-11-10T02-00-10 duration=35ms operation=restore
2023/11/10 10:02:17.581513 info done backup=2023-11-10T02-00-10 operation=restore
2023/11/10 10:02:17.581552 info clickhouse connection closed logger=clickhouse
查看表中数据:
bogon :) select * from default.tett;
SELECT *
FROM default.tett
Query id: f8a5e78e-723e-4cd0-8a9e-38316fd6882e
┌─id─┬─name─┬─age─┬─loc───────┐
│ 3 │ ww │ 20 │ guangzhou │
│ 6 │ gb │ 23 │ guangzhou │
└────┴──────┴─────┴───────────┘
┌─id─┬─name─┬─age─┬─loc─────┐
│ 1 │ zs │ 18 │ beijing │
│ 4 │ ml │ 21 │ beijing │
└────┴──────┴─────┴─────────┘
┌─id─┬─name─┬─age─┬─loc──────┐
│ 2 │ ls │ 19 │ shanghai │
│ 5 │ tq │ 22 │ shanghai │
└────┴──────┴─────┴──────────┘
6 rows in set. Elapsed: 0.006 sec.
还原成功。