ClickHouse数据存储与备份

705 阅读42分钟

数据存储

ClickHouse 中有众多的不同特性的表引擎可以应对不同的需要,其中 MergeTree 引擎作为 ClickHouse 的核心,凭借其强大的性能与丰富的特性得到了广泛的使用,并成为其他特性引擎的基础。

创建数据表

CREATE TABLE test.users2
(
    name String,
    sex String,
    age Int8,
    birthday Date
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(birthday)
ORDER BY age
SETTINGS min_rows_for_wide_part = 0, min_bytes_for_wide_part = 0,index_granularity = 2

注:
    这里建表语句中加了SETTINGS内容。因为ClickHouse新版本中有compact功能,可以提高插入量少插入频率频繁时的性能。但是底层就不会每一列生成一个.bin文件,只会生成一个统一的data.bin。所以先暂时关闭compact功能(min_rows_for_wide_part = 0, min_bytes_for_wide_part = 0)
    index_granularity:索引粒度,默认为8192,即MergeTree索引在默认情况下,每间隔8192行才生成一个索引。通常不需要修改此参数。这里index_granularity=2,即间隔2行生成一个索引。

插入一条数据:

insert into test.users values ('zhangsan', 'female', 20, '2023-03-07');

进入到 /var/lib/clickhouse/data/test/users 路径下,观察生成的文件:

可以看到生成了一个 202303_1_1_0的分区目录。

分区目录命名规则

202303_1_1_0
PartitionId_MinBlockNum_MaxBlockNum_Level

PartitionId:分区Id。其值是由创建表时所指定的分区键决定的,由于我们创建表时使用的分区键为toYYYYMM(birthday),而插入的birthday为2023-03-08,所以其值为202303。

MinBlockNum、MaxBlockNum:最小最大数据块编号。其值在单张表内从1开始累加,每当新创建一个分区目录其值就会加1,且新创建的分区目录MinBlockNum和MaxBlockNum相等,只有当分区目录发生合并时其值才会不等。由于这是该表第一次插入数据,所以MinBlockNum和MaxBlockNum都为1。

Level:分区被合并的次数。level和MinBlockNum以及MaxBlockNum不同,它不是单张表内累加的,而是单张表中的单个分区内累加的。每当新创建一个分区目录其值均为0,只有当分区目录发生合并时其值才会大于0。

在插入两条数据实验:

insert into test.users values ('lisi', 'qqq', 21, '2023-03-08');
insert into test.users values ('wangwu', 'www', 22, '2023-03-09');

发现每次插入都生成了一个新的分区目录,同时,MinBlockNum和MaxBlockNum在自增。

由于MergeTree引擎对于分区的合并是不定时的,因此这里为了更快地尝试对分区进行合并后的结果,使用optimize命令强制合并分区。

optimize table users final;

命令执行后,发现多了一个 202303_1_3_1的分区目录,也就是合并后的目录。这里的第一个1和3,表示MinBlockNum,MaxBlockNum,因为合并的3个分区中,MinBlockNum=1,MaxBlockNum=3,最后一个表示Level,也就是分区被合并的次数,由于只合并了一次,所以此时Level=1。

一段时间后,其他的几个分区会被删除,只保留合并后的分区

文件结构

进入到202303_1_3_1目录下,利用tree命令查看目录下的文件结构:

[root@bogon 202303_1_3_1]# tree
.
├── age.bin
├── age.cmrk2
├── birthday.bin
├── birthday.cmrk2
├── checksums.txt
├── columns.txt
├── count.txt
├── default_compression_codec.txt
├── metadata_version.txt
├── minmax_birthday.idx
├── name.bin
├── name.cmrk2
├── partition.dat
├── primary.cidx
├── serialization.json
├── sex.bin
└── sex.cmrk2

说明:

checksums.txt:校验文件,二进制存储了个文件的大小、哈希等,用于快速校验存储的正确性

columns.txt:列名以及数据类型,本例该文件的内容为

count.txt:记录数据的总行数,以上示例内容为3

primary.idx:主键索引文件,用于存放稀疏索引的数据与数据块的位置信息。通过查询条件与稀疏索引快速的过滤无用的数据,减少需要加载的数据量。

{column}.bin:实际列数据的存储文件,以列名.bin为文件名,默认设置采用lz4压缩格式。每一列都会有单独的文件,此种方式是wide part模式。另外一种是compact part模式,这种模式下所有的列数据都放在一个data.bin文件里(新版本需要指定 SETTINGS min_rows_for_wide_part = 0, min_bytes_for_wide_part = 0 参数来强制跳过 Compact format)。

{column}.cmrk2:列数据的标记信息,记录了数据块在bin文件中的偏移量。标记文件首先与列数据的存储文件对齐,记录了某个压缩快在bin文件中的相对位置;其次与索引文件对齐,记录了稀疏索引对应数据在列存储文件中的位置。ClickHouse将首先通过索引文件定位到标记信息,再根据标记信息直接从bin数据文件中读取数据。

Partition.dat:保存一个值,就是partition的编号,从0开始。

minmax_{column}.inx:minmax索引,用于记录当前分区下分区字段的最小最大值。

二进制文件格式详解

primary.idx

MergeTree表会根据排序键生成primary.idx表,由users建表语句可知,设置排序的键为age,同时index_granularity=2,因此,对应primary.idx中生成的记录应该为20、25、30三条记录。查看primary.idx表:

hexdump可以用来查看二进制文件的十六进制编码。由于定义的age Int8占用一个字节,因此每一个16进制值应该对应一个age值。对上面的16进制结果做一下转换:

  14(16进制) --> 20(十进制)

  19(16进制) --> 25(十进制)

  1e(16进制) --> 30(十进制)

可以发现,转换成10进制后与我们的逻辑分析结果一致,更进一步核实了ClickHouse中primary.idx稀疏索引的原理。

{column}.mrk2

一个{column}.bin文件由1个至多个数据压缩块组成,mrk2数据标记文件格式比较固定,primary.idx文件中的每个索引再次文件中都有一个对应的mark,有三列:

  Offset in compressed file,8 Bytes,代表该标记指向的压缩数据块在bin文件中的偏移量。

  Offset in decompressed block,8 Bytes,代表该标记指向的数据在解压数据块中的偏移量。

  Rows count,8 Bytes,行数,通常情况下其等于index_granularity。

因此,每一行mrk2文件共占用24 Bytes。所以通过primary.idx中的索引寻找mrk2文件中对应的Mark非常简单,如果要寻找第n(从0开始)个index,则对应的Mark在mrk2文件中的偏移为n*24,从这个偏移处开始读取24 Bytes即可得到相应的Mark。

age.mrk2

$ hexdump -C age.mrk2
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000010  02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  02 00 00 00 00 00 00 00  02 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  04 00 00 00 00 00 00 00  |................|
00000040  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  05 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060

$ od -An -l age.mrk2
                    0                    0
                    2                    0
                    2                    2
                    0                    4
                    1                    0
                    5                    0

分析下上面的文件内容。由于24 Bytes表示一行数据,因此从上面以每24位做切分,可以得到如下所示的表:

暂时无法在飞书文档外展示此内容

从表中可以看出,解压文件中的偏移量对应的原表中的age值这与我们在primary.idx中分析出来的3个索引值一一对应:

  index_granularity =2,且定义的age Int8占用1个字节,所以每两行对应一个索引,mrk2中也是没两行生成一条对应mark。

  由于本次存储的数据 < 默认的压缩大小块64KB,因此所有的数据都在一个压缩块内,压缩文件中的偏移量都是0

  由于最后一个mark对应的数据只有一条,所以最后一个行数为1

{column}. bin

{column}.bin文件由若干数据块组成,默认使用LZ4压缩格式,用于存储某一列数据。

一个压缩数据块由头信息和压缩数据两部分组成。头信息固定使用9位字节表示,具体由一个UInt8(1字节)整型和两个UInt32(4字节)整型组成,分别代表使用的压缩算法类型、压缩后的数据大小和压缩前数据大小。

如:0x821200065536

  0x82:压缩方法

  12000:压缩后数据大小

  65536:压缩前数据大小

{column}.bin文件存储如下:

minmax_birthday.idx

minmax文件里面存放的是该分区里分区字段的最小最大值。分区字段birthday的类型为Date,其底层由UInt16实现,存放的是从1970年1月1日到现在所经过的天数。通过上面的INSERT语句可以知道birthday字段的最小值为2023-03-06,最大值为2023-03-09,这两个时间转换成天数分别为19422和19425,再转换成16进制就是0x4bde和0x4be1。

minmax文件的作用是可以让ClickHouse可以只能得跳过那些不可能包含符合条件的数据的块和分区,从而减少需要扫描的数据量。举个例子:存在一个table表,表中含有timestamp字段,minmax_timestamp.idx文件中的最小值与最大值为"2023-01-01"和"2023-12-31",此时我们想查询"2023-01-01"到"2023-06-30"范围内的数据,可以发现这个范围完全在文件中记录的最大值与最小值之间,由于可能存在分区,数据存储在.bin文件中,而.bin文件又由很多的数据块组成,使用minmax文件可以使得ClickHouse跳过不在范围内的数据分区或数据块,从而加快数据的查询速度与效率。

索引与数据存储

ClickHouse通过将数据排序并建立稀疏索引的方式来加速数据的定位与查询。ClickHouse的稀疏索引就像是只记录开始位置,一条索引记录就能标记大量的数据,而且数据量越大这种优势就会越明显。例如:MergeTree的索引粒度(index_granularity)为8192,标记一亿条数据只需要12208条索引。索引少占用的空间就小,所以对ClickHouse而言,primary.idx中的数据可以是常驻内存。

引用官方文档的示例:

这里使用(CounterID,Date)两个字段作为索引,当需要查询 CounterID in ('a', 'h')时,服务器会定位到 mark [0,3) 和 mark [6,8);当需要查询 CounterID in ('a', 'h') and Date = 3 时,服务器会定位到 mark [1,3) 和 mark [7,8);当需要查询 Date=3 的时候,服务器会定位到 mark [1,10]。

稀疏索引无法精确的找到数据,只能定位到大概范围,但作为一个旨在处理海量数据的系统而言,这些性能浪费是可以忽略的。

上述分析可知,索引的性能会受到索引字段数据分布的影响,设计表的时候需要充分考虑业务数据的实际情况,避免使用区分度很低的字段做索引。同时索引粒度(index_granularity)也是一个重要的属性,ClickHouse 默认 8192,这个值在大多数的情况下都能提供良好的性能。

索引过程

通过上面对primary.idx、mrk2和bin文件的分析可知,对于每一个primary.idx中的索引,mrk2都有一条记录与之对应。而从mrk2中就可以找到bin文件中的压缩数据块和解压缩后的数据索引。

如果按照8192的索引粒度把数据分成批次,每批次读入数据的规则:

  (1) 如果单批次获取的数据 <64k,则继续下一个批次,直到 > 64k则生成下一个数据块

  (2) 如果单批次数据 64k < x < 1m,则直接生成下一个数据块

  (3) 如果单批次数据 > 1m,则按照1m切分数据,剩下的数据继续按照上述规则执行

查询过程

在MergeTree读取数据时,必须通过标记数据的位置信息找到所需要的数据。查找过程大致分为读取压缩数据块和读取数据两个步骤:

上图中,以JavaEnable字段说明:

  8192 = 8192b 8192b = 8192b 8192b * 8 = 64k,相当于8个8192条数据形成一个数据块。而第二个数据块的压缩文件的偏移量是从12016 Bytes开始的,原因是加上了两个数据块的头文件部分(假设第一个数据块压缩后的大小为12000Bytes),压缩后的头文件大小为8 Bytes,以此类推。而未压缩大小的65536大小则表示这里每一块数据块的大小都正好是64K大小。

有了mrk2,MergeTree在读取数据时并不需要将数据一次性加载到内存。定位压缩数据块并读取数据的过程:

  读取压缩数据块:在查询某一列数据MergeTree无须一次性加载整个.bin文件。借助标记文件中的压缩文件偏移量加载指定的数据压缩块。

  读取数据:解压后的数据,MergeTree并不需要一次性扫描整段解压数据,借住标记文件中保存的数据块中偏移量以index_granularity的粒度加载特定一小段

数据标记与压缩数据块的对应关系

一对一

此时,每一个压缩数据块都是64KB。这里图中压缩数块中,头信息数组的前一个位置65536表示解压后的大小,第二个43311表示压缩后的大小。

多对一

一对多

即mrk2中的一行对应两个压缩数据块中的数据。

索引使用与数据加载概述

下图较为简单的展示了primary.idx,.mrk2,.bin之间的对应关系,其中*.bin文件中压缩块Bock0,Block1的划分只是为了描述概念,而并非实际情况:

在ClickHouse的查询过程中,理想的情况下可以通过查询条件,利用预设的分区信息,索引,将需要加载的数据mark缩至最少,尽可能减少数据扫描的范围。通过SQL分析得到必须的数据列,根据mark范围,将必须的数据加载至内存,进行后续的处理。例如有以下算SQL:

select
    StringID,
    avg(Score)
from
    id_test
where
    StringID between '607589cdcbc95900017ddf03' and '607589cdcbc95900017ddf06'
group by StringID;

简化的ClickHouse数据加载过程如下:

  (1) 根据条件StringID与索引,确定Mark范围:

mark0 [607589cdcbc95900017ddf01, 607589cdcbc95900017ddf04)
mark1 [607589cdcbc95900017ddf04, 607589cdcbc95900017ddf07)

  (2) 通过SQL定位到需要加载StringID与Score列的数据,根据步骤(1)得到的Mark范围,需要从StringID.bin与Score.bin中加载mark0与mark1的数据。通过分析StringID.mrk2与Score.mrk2的内容,得到mark0与mark1对应的压缩block在*.bin文件中的偏移信息。将StringID.bin以及Score.bin中相应压缩block的数据读取出来并解压,根据SQL的条件进行精确的过滤。

  (3) 对数据进行后续处理,这一步基本上全部在内存中完成。

简单来说,当执行查询时,ClickHouse首先使用索引文件(primary.idx)查找符合查询条件的数据块的位置(这样可以避免查询大量的数据块),然后使用标记文件({column}.mrk2)定位到符合查询条件的数据块的开始位置,最后从二进制数据文件({column.bin})中获取实际的列数据。

数据备份

冷备份

通过对物理文件的拷贝也可达到备份的效果,但是需要注意的是通过拷贝物理文件进行备份期间,要避免数据写入。

物理文件备份

物理文件的备份主要分为两部分:

  (1) 表数据文件备份,datadir/data下可对某个数据库、某张表、某个分区文件进行备份(2)元数据备份,{datadir}/data下可对某个数据库、某张表、某个分区文件进行备份 (2) 元数据备份,{datadir}/metadata下可对某个数据库、某张表的建表语句进行备份

备份数据库文件

复制clickhouse完整的数据目录,数据的目录路径是config.xml的path属性的值。

恢复也很简单,只需要将数据目录替换为备份的数据目录即可。

热备份

通过查询语句将指定的数据导出到文件进行备份,备份的数据比较灵活,并且可以通过客户端工具进行数据导入恢复。

-- 导出带表头(字段名)的txt文件
clickhouse-client -h 127.0.0.1 --database="default" -u default --password "" --format_csv_delimiter="|" --query="select * from default.users FORMAT CSVWithNames" > local.txt

-- 导出不带表头(字段名)的txt文件
clickhouse-client -h 127.0.0.1 --database="default" -u default --password "" --format_csv_delimiter="|" --query="select * from default.users FORMAT CSV" > local.txt

-- 导出带表头(字段名)的csv文件
clickhouse-client -h 127.0.0.1 --database="default" -u default --password "" --format_csv_delimiter="|" --query="select * from default.users FORMAT CSVWithNames" > local.txt

-- 导出不带表头(字段名)的csv文件
clickhouse-client -h 127.0.0.1 --database="default" -u default --password "" --format_csv_delimiter="|" --query="select * from default.users FORMAT CSV" > local.csv
-- 导入带表头(字段名)的txt文件
clickhouse-client -h 127.0.0.1 --database="test01" -u default --password "123456" --format_csv_delimiter='|' --query="insert into local FORMAT CSVWithNames" < local.txt

-- 导入不表头(字段名)的txt文件
clickhouse-client --password 123456 -d test01 -q " insert into local FORMAT CSV" --format_csv_delimiter='|' < local.txt

-- 导入带表头(字段名)的csv文件
clickhouse-client --password 123456 -d default -q " insert into local FORMAT CSVWithNames" --format_csv_delimiter='|' < /data/local.csv

-- 导入不表头(字段名)的csv文件
clickhouse-client --password 123456 -d default -q " insert into local FORMAT CSV" --format_csv_delimiter='|' < /data/local.csv

分区备份

基于数据分区的备份,ClickHouse目前提供了FREEZE和FETCH两种方式。

FREEZE备份

准备数据:

CREATE TABLE default.tett
(
    id UInt64,
    name String,
    age UInt8,
    loc String
)
ENGINE = MergeTree
PARTITION BY loc
ORDER BY id

insert into default.tett values(1, 'zs', 18, 'beijing');
insert into default.tett values(2, 'ls', 19, 'shanghai');
insert into default.tett values(3, 'ww', 20, 'guangzhou');
insert into default.tett values(4, 'ml', 21, 'beijing');
insert into default.tett values(5, 'tq', 22, 'shanghai');
insert into default.tett values(6, 'gb', 23, 'guangzhou');

语法:

ALTER TABLE table_name FREEZE PARTITION partition_expr

分区在备份后,会统一保存到 /var/lib/clickhouse/shadow/N目录下,其中N是一个自增的整数,代表备份的次数(FREEZE执行的次数),具体次数由shadow子目录下的increment.txt文件记录。

例如:

ALTER TABLE default.tett FREEZE PARTITION 'beijing';

执行完毕后,会在 /var/lib/clickhouse/shadow 目录下发现:

在对应的数字目录中,目录结构为:/store/[database]/[table]/[partition_folder],如下:

关于备份数据的还原,则需要借助ATTACH装载分区的方式实现,需要将shadow子目录下的分区文件复制到相应数据表的detached目录下,然后在使用ATTACH语句加载。

-- 将表tett表beijing分区删除
alter table test_backup drop partition 'beijing';

-- 将备份的分区数据放入表tett数据目录下的detached目录下
cp -r /var/lib/ClickHouse/shadow/1/store/xxx/xxxxx/* /var/lib/ClickHouse/data/default/tett/detached/

-- 查看detached目录下的数据
ls /var/lib/ClickHouse/data/default/tett/detached

-- 修改分区数据对应的拥有者权限为ClickHouse
chown -R ClickHouse:ClickHouse xxxxxx/

-- 给表tett表添加分区
alter table test_backup attach partition 'beijing';

-- 查询表test_backup表中数据,数据还原成功
select * from tett;

FETCH备份

FETCH只支持ReplicatedMergeTree系列的表引擎

ALTER TABLE table_name FETCH PARTITION partition_id FROM zk_path

其工作原理与ReplicatedMergeTree同步数据的原理类似,FETCH通过指定的zk_path找到ReplicatedMergeTree的所有副本实例,然后选择一个最合适的副本,下载相应的分区数据。

准备数据:

Create table default.test_f(
    id UInt32,
    name String,
    age UInt32,
    gender String,
    loc String
) engine = ReplicatedMergeTree('/ClickHouse/tables/01/test_f','node1')
partition by loc
order by id;

insert into default.test_f values (1,'zs',18,'m','beijing'),(2,'ls',19,'f','shanghai'),(3,'ww',20,'m','beijing'),(4,'ml',21,'m','shanghai')

将表test_f beijing分区数据进行备份:

ALTER TABLE test_f FETCH PARTITION 'beijing' FROM '/ClickHouse/tables/01/test_f'

将对应分区的数据下载到当前节点本地 /var/lib/ClickHouse/data/{database}/{table}/detached 目录下。

与FREEZE一样,在对分区数据进行还原时,需借助ATTACH装载分区来实现:

-- 删除表test_fetch “beijing”分区数据
 alter table test_fetch drop partition 'beijing';
 
 -- 由于默认在detached中的数据权限为ClickHouse,不需要修改所有者,直接执行添加分区命令恢复备份数据即可
 alter table test_fetch attach partition 'beijing';

FREEZE和FETCH虽然都能实现分区文件的备份,但不会备份数据表的元数据(也就是建表语句),为了做到万无一失的备份,还需要对元数据进行备份,元数据在 /var/lib/ClickHouse/metadata/{database} 目录下,找到对应表的sql文件单独保存即可。

clickhouse-backup工具备份

特点:

  (1) 可以轻松创建和恢复所有或特定表的备份

  (2) 在文件系统上高效存储多个备份

  (3) 通过流压缩上传和下载

  (4) 支持增量备份在远程存储上

下载地址:github.com/Altinity/cl…

安装:

tar -xzvf clickhouse-backup-linux-amd64.tar.gz

mv build/linux/amd64/clickhouse-backup /usr/bin/
mkdir /etc/clickhouse-backup
chown -R clickhouse:clickhouse /etc/clickhouse-backup

默认配置

[root@bogon opt]# clickhouse-backup default-config
general:
    remote_storage: none           # REMOTE_STORAGE, if `none` then `upload` and  `download` command will fail
    max_file_size: 1073741824      # MAX_FILE_SIZE, 1G by default, useless when upload_by_part is true, use for split data parts files by archives
    disable_progress_bar: true     # DISABLE_PROGRESS_BAR, show progress bar during upload and download, makes sense only when `upload_concurrency` and `download_concurrency` is 1
    backups_to_keep_local: 0       # BACKUPS_TO_KEEP_LOCAL, how many latest local backup should be kept, 0 means all created backups will be stored on local disk
                                 # -1 means backup will keep after `create` but will delete after `create_remote` command
                                 # You shall run `clickhouse-backup delete local <backup_name>` command to remove temporary backup files from the local disk
    backups_to_keep_remote: 0      # BACKUPS_TO_KEEP_REMOTE, how many latest backup should be kept on remote storage, 0 means all uploaded backups will be stored on remote storage.
                                 # If old backups are required for newer incremental backup then it won't be deleted. Be careful with long incremental backup sequences.
    log_level: info                # LOG_LEVEL, a choice from `debug`, `info`, `warn`, `error`
    allow_empty_backups: false     # ALLOW_EMPTY_BACKUPS
    
    # concurrency means parallel tables and parallel parts inside tables
    # for example 4 means max 4 parallel tables and 4 parallel parts inside one table, so equals 16 concurrent streams
    download_concurrency: 1        # DOWNLOAD_CONCURRENCY, max 255, by default, the value is round(sqrt(AVAILABLE_CPU_CORES / 2))
    upload_concurrency: 1          # UPLOAD_CONCURRENCY, max 255, by default, the value is round(sqrt(AVAILABLE_CPU_CORES / 2))
    
    # RESTORE_SCHEMA_ON_CLUSTER, execute all schema related SQL queries with `ON CLUSTER` clause as Distributed DDL.
    # Check `system.clusters` table for the correct cluster name, also `system.macros` can be used.
    # This isn't applicable when `use_embedded_backup_restore: true`
    restore_schema_on_cluster: ""
    upload_by_part: true           # UPLOAD_BY_PART
    download_by_part: true         # DOWNLOAD_BY_PART
    use_resumable_state: true      # USE_RESUMABLE_STATE, allow resume upload and download according to the <backup_name>.resumable file
    
    # RESTORE_DATABASE_MAPPING, restore rules from backup databases to target databases, which is useful when changing destination database, all atomic tables will be created with new UUIDs.
    # The format for this env variable is "src_db1:target_db1,src_db2:target_db2". For YAML please continue using map syntax
    restore_database_mapping: {}
    retries_on_failure: 3          # RETRIES_ON_FAILURE, how many times to retry after a failure during upload or download
    retries_pause: 30s             # RETRIES_PAUSE, duration time to pause after each download or upload failure
    
    watch_interval: 1h       # WATCH_INTERVAL, use only for `watch` command, backup will create every 1h
    full_interval: 24h       # FULL_INTERVAL, use only for `watch` command, full backup will create every 24h
    watch_backup_name_template: "shard{shard}-{type}-{time:20060102150405}" # WATCH_BACKUP_NAME_TEMPLATE, used only for `watch` command, macros values will apply from `system.macros` for time:XXX, look format in https://go.dev/src/time/format.go
    
    sharded_operation_mode: none       # SHARDED_OPERATION_MODE, how different replicas will shard backing up data for tables. Options are: none (no sharding), table (table granularity), database (database granularity), first-replica (on the lexicographically sorted first active replica). If left empty, then the "none" option will be set as default.
    
    cpu_nice_priority: 15    # CPU niceness priority, to allow throttling СЗГ intensive operation, more details https://manpages.ubuntu.com/manpages/xenial/man1/nice.1.html
    io_nice_priority: "idle" # IO niceness priority, to allow throttling disk intensive operation, more details https://manpages.ubuntu.com/manpages/xenial/man1/ionice.1.html
clickhouse:
    username: default                # CLICKHOUSE_USERNAME
    password: ""                     # CLICKHOUSE_PASSWORD
    host: localhost                  # CLICKHOUSE_HOST, To make backup data `clickhouse-backup` requires access to the same file system as clickhouse-server, so `host` should localhost or address of another docker container on the same machine, or IP address bound to some network interface on the same host.
    port: 9000                       # CLICKHOUSE_PORT, don't use 8123, clickhouse-backup doesn't support HTTP protocol
    # CLICKHOUSE_DISK_MAPPING, use this mapping when your `system.disks` are different between the source and destination clusters during backup and restore process
    # The format for this env variable is "disk_name1:disk_path1,disk_name2:disk_path2". For YAML please continue using map syntax
    disk_mapping: {}
    # CLICKHOUSE_SKIP_TABLES, the list of tables (pattern are allowed) which are ignored during backup and restore process
    # The format for this env variable is "pattern1,pattern2,pattern3". For YAML please continue using list syntax
    skip_tables:
    - system.*
    - INFORMATION_SCHEMA.*
    - information_schema.*
    # CLICKHOUSE_SKIP_TABLE_ENGINES, the list of tables engines which are ignored during backup, upload, download, restore process
    # The format for this env variable is "Engine1,Engine2,engine3". For YAML please continue using list syntax
    skip_table_engines: []
    timeout: 5m                  # CLICKHOUSE_TIMEOUT
    freeze_by_part: false        # CLICKHOUSE_FREEZE_BY_PART, allow freezing by part instead of freezing the whole table
    freeze_by_part_where: ""     # CLICKHOUSE_FREEZE_BY_PART_WHERE, allow parts filtering during freezing when freeze_by_part: true
    secure: false                # CLICKHOUSE_SECURE, use TLS encryption for connection
    skip_verify: false           # CLICKHOUSE_SKIP_VERIFY, skip certificate verification and allow potential certificate warnings
    sync_replicated_tables: true # CLICKHOUSE_SYNC_REPLICATED_TABLES
    tls_key: ""                  # CLICKHOUSE_TLS_KEY, filename with TLS key file
    tls_cert: ""                 # CLICKHOUSE_TLS_CERT, filename with TLS certificate file
    tls_ca: ""                   # CLICKHOUSE_TLS_CA, filename with TLS custom authority file
    log_sql_queries: true        # CLICKHOUSE_LOG_SQL_QUERIES, enable logging `clickhouse-backup` SQL queries on `system.query_log` table inside clickhouse-server
    debug: false                 # CLICKHOUSE_DEBUG
    config_dir:      "/etc/clickhouse-server"              # CLICKHOUSE_CONFIG_DIR
    # CLICKHOUSE_RESTART_COMMAND, use this command when restoring with --rbac, --rbac-only or --configs, --configs-only options
    # will split command by ; and execute one by one, all errors will logged and ignore
    # available prefixes
    # - sql: will execute SQL query
    # - exec: will execute command via shell
    restart_command: "sql:SYSTEM SHUTDOWN"
    ignore_not_exists_error_during_freeze: true # CLICKHOUSE_IGNORE_NOT_EXISTS_ERROR_DURING_FREEZE, helps to avoid backup failures when running frequent CREATE / DROP tables and databases during backup, `clickhouse-backup` will ignore `code: 60` and `code: 81` errors during execution of `ALTER TABLE ... FREEZE`
    check_replicas_before_attach: true # CLICKHOUSE_CHECK_REPLICAS_BEFORE_ATTACH, helps avoiding concurrent ATTACH PART execution when restoring ReplicatedMergeTree tables
    use_embedded_backup_restore: false # CLICKHOUSE_USE_EMBEDDED_BACKUP_RESTORE, use BACKUP / RESTORE SQL statements instead of regular SQL queries to use features of modern ClickHouse server versions
    backup_mutations: true # CLICKHOUSE_BACKUP_MUTATIONS, allow backup mutations from system.mutations WHERE is_done AND apply it during restore
    restore_as_attach: false # CLICKHOUSE_RESTORE_AS_ATTACH, allow restore tables which have inconsistent data parts structure and mutations in progress
    check_parts_columns: true # CLICKHOUSE_CHECK_PARTS_COLUMNS, check data types from system.parts_columns during create backup to guarantee mutation is complete
azblob:
    endpoint_suffix: "core.windows.net" # AZBLOB_ENDPOINT_SUFFIX
    account_name: ""             # AZBLOB_ACCOUNT_NAME
    account_key: ""              # AZBLOB_ACCOUNT_KEY
    sas: ""                      # AZBLOB_SAS
    use_managed_identity: false  # AZBLOB_USE_MANAGED_IDENTITY
    container: ""                # AZBLOB_CONTAINER
    path: ""                     # AZBLOB_PATH, `system.macros` values could be applied as {macro_name}
    object_disk_path: ""         # AZBLOB_OBJECT_DISK_PATH, path for backup of part from `azure_blob_storage` object disk, if disk present, then shall not be zero and shall not be prefixed by `path`
    compression_level: 1         # AZBLOB_COMPRESSION_LEVEL
    compression_format: tar      # AZBLOB_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
    sse_key: ""                  # AZBLOB_SSE_KEY
    buffer_size: 0               # AZBLOB_BUFFER_SIZE, if less or eq 0 then it is calculated as max_file_size / max_parts_count, between 2Mb and 4Mb
    max_parts_count: 10000       # AZBLOB_MAX_PARTS_COUNT, number of parts for AZBLOB uploads, for properly calculate buffer size
    max_buffers: 3               # AZBLOB_MAX_BUFFERS
s3:
    access_key: ""                   # S3_ACCESS_KEY
    secret_key: ""                   # S3_SECRET_KEY
    bucket: ""                       # S3_BUCKET
    endpoint: ""                     # S3_ENDPOINT
    region: us-east-1                # S3_REGION
    acl: private                     # S3_ACL
    assume_role_arn: ""              # S3_ASSUME_ROLE_ARN
    force_path_style: false          # S3_FORCE_PATH_STYLE
    path: ""                         # S3_PATH, `system.macros` values could be applied as {macro_name}
    object_disk_path: ""             # S3_OBJECT_DISK_PATH, path for backup of part from `s3` object disk, if disk present, then shall not be zero and shall not be prefixed by `path`
    disable_ssl: false               # S3_DISABLE_SSL
    compression_level: 1             # S3_COMPRESSION_LEVEL
    compression_format: tar          # S3_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
    # look details in https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html
    sse: ""                          # S3_SSE, empty (default), AES256, or aws:kms
    sse_kms_key_id: ""               # S3_SSE_KMS_KEY_ID, if S3_SSE is aws:kms then specifies the ID of the Amazon Web Services Key Management Service
    sse_customer_algorithm: ""       # S3_SSE_CUSTOMER_ALGORITHM, encryption algorithm, for example, AES256
    sse_customer_key: ""             # S3_SSE_CUSTOMER_KEY, customer-provided encryption key
    sse_customer_key_md5: ""         # S3_SSE_CUSTOMER_KEY_MD5, 128-bit MD5 digest of the encryption key according to RFC 1321
    sse_kms_encryption_context: ""   # S3_SSE_KMS_ENCRYPTION_CONTEXT, base64-encoded UTF-8 string holding a JSON with the encryption context
                                   # Specifies the Amazon Web Services KMS Encryption Context to use for object encryption.
                                   # This is a collection of non-secret key-value pairs that represent additional authenticated data.
                                   # When you use an encryption context to encrypt data, you must specify the same (an exact case-sensitive match)
                                   # encryption context to decrypt the data. An encryption context is supported only on operations with symmetric encryption KMS keys
    disable_cert_verification: false # S3_DISABLE_CERT_VERIFICATION
    use_custom_storage_class: false  # S3_USE_CUSTOM_STORAGE_CLASS
    storage_class: STANDARD          # S3_STORAGE_CLASS, by default allow only from list https://github.com/aws/aws-sdk-go-v2/blob/main/service/s3/types/enums.go#L787-L799
    concurrency: 1                   # S3_CONCURRENCY
    part_size: 0                     # S3_PART_SIZE, if less or eq 0 then it is calculated as max_file_size / max_parts_count, between 5MB and 5Gb
    max_parts_count: 10000           # S3_MAX_PARTS_COUNT, number of parts for S3 multipart uploads
    allow_multipart_download: false  # S3_ALLOW_MULTIPART_DOWNLOAD, allow faster download and upload speeds, but will require additional disk space, download_concurrency * part size in worst case
    
    # S3_OBJECT_LABELS, allow setup metadata for each object during upload, use {macro_name} from system.macros and {backupName} for current backup name
    # The format for this env variable is "key1:value1,key2:value2". For YAML please continue using map syntax
    object_labels: {}
    # S3_CUSTOM_STORAGE_CLASS_MAP, allow setup storage class depending on the backup name regexp pattern, format nameRegexp > className
    custom_storage_class_map: {}
    debug: false                     # S3_DEBUG
gcs:
    credentials_file: ""         # GCS_CREDENTIALS_FILE
    credentials_json: ""         # GCS_CREDENTIALS_JSON
    credentials_json_encoded: "" # GCS_CREDENTIALS_JSON_ENCODED
    bucket: ""                   # GCS_BUCKET
    path: ""                     # GCS_PATH, `system.macros` values could be applied as {macro_name}
    object_disk_path: ""         # GCS_OBJECT_DISK_PATH, path for backup of part from `s3` object disk (clickhouse support only gcs over s3 protocol), if disk present, then shall not be zero and shall not be prefixed by `path`
    compression_level: 1         # GCS_COMPRESSION_LEVEL
    compression_format: tar      # GCS_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
    storage_class: STANDARD      # GCS_STORAGE_CLASS
    client_pool_size: 500        # GCS_CLIENT_POOL_SIZE, default max(upload_concurrency, download concurrency) * 3, should be at least 3 times bigger than `UPLOAD_CONCURRENCY` or `DOWNLOAD_CONCURRENCY` in each upload and download case to avoid stuck
    # GCS_OBJECT_LABELS, allow setup metadata for each object during upload, use {macro_name} from system.macros and {backupName} for current backup name
    # The format for this env variable is "key1:value1,key2:value2". For YAML please continue using map syntax
    object_labels: {}
    # GCS_CUSTOM_STORAGE_CLASS_MAP, allow setup storage class depends on backup name regexp pattern, format nameRegexp > className
    custom_storage_class_map: {}
    debug: false                 # GCS_DEBUG
cos:
    url: ""                      # COS_URL
    timeout: 2m                  # COS_TIMEOUT
    secret_id: ""                # COS_SECRET_ID
    secret_key: ""               # COS_SECRET_KEY
    path: ""                     # COS_PATH, `system.macros` values could be applied as {macro_name}
    compression_format: tar      # COS_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
    compression_level: 1         # COS_COMPRESSION_LEVEL
ftp:
    address: ""                  # FTP_ADDRESS in format `host:port`
    timeout: 2m                  # FTP_TIMEOUT
    username: ""                 # FTP_USERNAME
    password: ""                 # FTP_PASSWORD
    tls: false                   # FTP_TLS
    tls_skip_verify: false       # FTP_TLS_SKIP_VERIFY
    path: ""                     # FTP_PATH, `system.macros` values could be applied as {macro_name}
    compression_format: tar      # FTP_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
    compression_level: 1         # FTP_COMPRESSION_LEVEL
    debug: false                 # FTP_DEBUG
sftp:
    address: ""                  # SFTP_ADDRESS
    username: ""                 # SFTP_USERNAME
    password: ""                 # SFTP_PASSWORD
    port: 22                     # SFTP_PORT
    key: ""                      # SFTP_KEY
    path: ""                     # SFTP_PATH, `system.macros` values could be applied as {macro_name}
    concurrency: 1               # SFTP_CONCURRENCY
    compression_format: tar      # SFTP_COMPRESSION_FORMAT, allowed values tar, lz4, bzip2, gzip, sz, xz, brortli, zstd, `none` for upload data part folders as is
    compression_level: 1         # SFTP_COMPRESSION_LEVEL
    debug: false                 # SFTP_DEBUG
custom:
    upload_command: ""           # CUSTOM_UPLOAD_COMMAND
    download_command: ""         # CUSTOM_DOWNLOAD_COMMAND
    delete_command: ""           # CUSTOM_DELETE_COMMAND
    list_command: ""             # CUSTOM_LIST_COMMAND
    command_timeout: "4h"          # CUSTOM_COMMAND_TIMEOUT
api:
    listen: "localhost:7171"     # API_LISTEN
    enable_metrics: true         # API_ENABLE_METRICS
    enable_pprof: false          # API_ENABLE_PPROF
    username: ""                 # API_USERNAME, basic authorization for API endpoint
    password: ""                 # API_PASSWORD
    secure: false                # API_SECURE, use TLS for listen API socket
    ca_cert_file: ""             # API_CA_CERT_FILE
                               # openssl genrsa -out /etc/clickhouse-backup/ca-key.pem 4096
                               # openssl req -subj "/O=altinity" -x509 -new -nodes -key /etc/clickhouse-backup/ca-key.pem -sha256 -days 365 -out /etc/clickhouse-backup/ca-cert.pem
    private_key_file: ""         # API_PRIVATE_KEY_FILE, openssl genrsa -out /etc/clickhouse-backup/server-key.pem 4096
    certificate_file: ""         # API_CERTIFICATE_FILE,
                               # openssl req -subj "/CN=localhost" -addext "subjectAltName = DNS:localhost,DNS:*.cluster.local" -new -key /etc/clickhouse-backup/server-key.pem -out /etc/clickhouse-backup/server-req.csr
                               # openssl x509 -req -days 365000 -extensions SAN -extfile <(printf "\n[SAN]\nsubjectAltName=DNS:localhost,DNS:*.cluster.local") -in /etc/clickhouse-backup/server-req.csr -out /etc/clickhouse-backup/server-cert.pem -CA /etc/clickhouse-backup/ca-cert.pem -CAkey /etc/clickhouse-backup/ca-key.pem -CAcreateserial
    integration_tables_host: ""  # API_INTEGRATION_TABLES_HOST, allow using DNS name to connect in `system.backup_list` and `system.backup_actions`
    allow_parallel: false        # API_ALLOW_PARALLEL, enable parallel operations, this allows for significant memory allocation and spawns go-routines, don't enable it if you are not sure
    create_integration_tables: false # API_CREATE_INTEGRATION_TABLES, create `system.backup_list` and `system.backup_actions`
    complete_resumable_after_restart: true # API_COMPLETE_RESUMABLE_AFTER_RESTART, after API server startup, if `/var/lib/clickhouse/backup/*/(upload|download).state` present, then operation will continue in the background

配置文件设置:

vim /etc/clickhouse-backup/config.xml

general:
  remote_storage: none  # 默认为none,如果通过sftp上传到远程服务器,需要这个参数这只为sftp
  disable_progress_bar: false
  backups_to_keep_local: 7  # 本地备份的个数,大于7的自动删除旧的备份,默认为0,不删除备份
  backups_to_keep_remote: 7 # 远程备份个数
  log_level: info
  allow_empty_backups: false
clickhouse:
  username: default
  password: "PeQLg45tJcxZk"
  host: 192.168.12.91
  port: 9000
  skip_tables: # 不需要备份的库
    system.*
    default.*
    INFORMATION_SCHEMA.*
    information_schema.*
    timeout: 5m
    freeze_by_part: false

备份命令:

clickhouse-backup create my_backup --config /etc/clickhouse-backup/config.xml 

-- create 指定备份名称 若不指定 默认格式为时间
clickhouse-backup create --config /etc/clickhouse-backup/config.xml
-- 查看已生成的备份
clickhouse-backup --config /etc/config
-- 删除已备份的数据库
clickhouse-backup delete local/remote  my_backup --config /etc/clickhouse-backup/config.xml
-- 恢复数据库
clikhouse restore my_backup
-- 查看可以备份的表
clickhouse-backup --config  config.xml list

CLI命令

tables

可以备份的表

NAME:
   clickhouse-backup tables - List of tables, exclude skip_tables

USAGE:
   clickhouse-backup tables [-t, --tables=<db>.<table>]] [--all]

OPTIONS:
   --config value, -c value                 config配置文件路径
   --all, -a                                打印表
   --table value, --tables value, -t value  列表表只与表名模式匹配,用逗号分隔,允许?和*作为通配符
create

创建新备份,若不指定名称,则备份文件夹名称将为日期。例如:2023-11-10T02-00-10

NAME:
   clickhouse-backup create - Create new backup

USAGE:
   clickhouse-backup create [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [-s, --schema] [--rbac] [--configs] [--skip-check-parts-columns] <backup_name>

DESCRIPTION:
   Create new backup

OPTIONS:
   --config value, -c value                 config配置文件路径
   --table value, --tables value, -t value  创建与表名模式匹配的备份,用逗号分隔,允许吗?和*作为通配符
   --partitions partition_id                仅为选定的分区名称创建备份,用逗号分隔
   --schema, -s                                      仅备份结构
   --rbac, --backup-rbac, --do-backup-rbac           备份与rbac相关的对象
   --configs, --backup-configs, --do-backup-configs  备份“clickhouse服务器”配置文件
   --skip-check-parts-columns                        跳过检查系统.partscolumns不允许备份数据部件的不一致列类型
create_remote

创建并上传新备份

NAME:
   clickhouse-backup create_remote - Create and upload new backup

USAGE:
   clickhouse-backup create_remote [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [--diff-from=<local_backup_name>] [--diff-from-remote=<local_backup_name>] [--schema] [--rbac] [--configs] [--resumable] [--skip-check-parts-columns] <backup_name>

DESCRIPTION:
   Create and upload

OPTIONS:
   --config value, -c value                 config配置文件路径
   --table value, --tables value, -t value  创建与表名模式匹配的备份,用逗号分隔,允许吗?和*作为通配符
   --partitions partition_id                仅为选定的分区名称创建备份,用逗号分隔
   --diff-from value                                 用于将当前备份作为增量上载的值本地备份名称的差异
   --diff-from-remote value                          用于将当前备份作为增量上载的远程值远程备份名称的差异
   --schema, -s                                      仅备份结构
   --rbac, --backup-rbac, --do-backup-rbac           备份与rbac相关的对象
   --configs, --backup-configs, --do-backup-configs  备份“clickhouse服务器”配置文件
   --resume, --resumable                             如果远程存储上存在备份,则保存中间上传状态并继续上传,当“remote_storage:custom”或“use_embedded_backup_restore:true”时忽略
   --skip-check-parts-columns                        跳过检查系统.partscolumns不允许备份数据部件的不一致列类型
upload

  上传备份

NAME:
   clickhouse-backup upload - Upload backup to remote storage

USAGE:
   clickhouse-backup upload [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [-s, --schema] [--diff-from=<local_backup_name>] [--diff-from-remote=<remote_backup_name>] [--resumable] <backup_name>

OPTIONS:
   --config value, -c value                 config配置文件路径
   --diff-from value                        用于将当前备份作为增量上载的值本地备份名称的差异
   --diff-from-remote value                 用于将当前备份作为增量上载的远程值远程备份名称的差异
   --table value, --tables value, -t value  创建与表名模式匹配的备份,用逗号分隔,允许吗?和*作为通配符
   --partitions partition_id                仅为选定的分区名称创建备份,用逗号分隔
   --schema, -s               仅备份结构
   --resume, --resumable      如果远程存储上存在备份,则保存中间上传状态并继续上传,当“remote_storage:custom”或“use_embedded_backup_restore:true”时忽略
   --skip-check-parts-columns    跳过检查系统.partscolumns不允许备份数据部件的不一致列类型
list

  查看备份列表

NAME:
   clickhouse-backup list - List of backups

USAGE:
   clickhouse-backup list [all|local|remote] [latest|previous]

OPTIONS:
   --config value, -c value  config配置文件路径
download

从远程存储下载备份

NAME:
   clickhouse-backup download - Download backup from remote storage

USAGE:
   clickhouse-backup download [-t, --tables=<db>.<table>] [--partitions=<partition_names>] [-s, --schema] [--resumable] <backup_name>

OPTIONS:
   --config value, -c value                 config配置文件路径
   --table value, --tables value, -t value  下载与表名模式匹配的备份,用逗号分隔,允许吗?和*作为通配符
   --partitions partition_id                仅为选定的分区名称下载备份,用逗号分隔
   --schema, -s           仅下载备份结构
   --resume, --resumable  如果本地存储上存在备份,则保存中间下载状态并继续下载,用“remote_storage:custom”或“use_embedded_backup_restore:true”忽略
restore

备份还原,创建架构并从备份中还原数据

NAME:
   clickhouse-backup restore - Create schema and restore data from backup

USAGE:
   clickhouse-backup restore  [-t, --tables=<db>.<table>] [-m, --restore-database-mapping=<originDB>:<targetDB>[,<...>]] [--partitions=<partitions_names>] [-s, --schema] [-d, --data] [--rm, --drop] [-i, --ignore-dependencies] [--rbac] [--configs] <backup_name>

OPTIONS:
   --config value, -c value                    config配置文件路径
   --table value, --tables value, -t value     只还原与表名模式匹配的数据库和对象,用逗号分隔,允许?和*作为通配符
   --restore-database-mapping value, -m value  定义恢复数据的规则。对于未在此结构中定义的数据库,程序将不会处理它。
   --partitions partition_id                   仅恢复选定分区名称的备份,用逗号分隔
   --schema, -s                                        只还原结构
   --data, -d                                          只还原数据
   --rm, --drop                                        还原前删除已存在的架构对象
   -i, --ignore-dependencies                           删除存在的架构对象时忽略依赖项
   --rbac, --restore-rbac, --do-restore-rbac           还原与rbac相关的对象
   --configs, --restore-configs, --do-restore-configs  还原“clickhouse server”CONFIG相关文件
restore_remote

下载并还原

NAME:
   clickhouse-backup restore_remote - Download and restore

USAGE:
   clickhouse-backup restore_remote [--schema] [--data] [-t, --tables=<db>.<table>] [-m, --restore-database-mapping=<originDB>:<targetDB>[,<...>]] [--partitions=<partitions_names>] [--rm, --drop] [-i, --ignore-dependencies] [--rbac] [--configs] [--skip-rbac] [--skip-configs] [--resumable] <backup_name>

OPTIONS:
   --config value, -c value                    config配置文件路径
   --table value, --tables value, -t value     下载并恢复与表名模式匹配的对象,用逗号分隔,允许?和*作为通配符
   --restore-database-mapping value, -m value  定义恢复数据的规则。对于未在此结构中定义的数据库,程序将不会处理它。
   --partitions partition_id                   下载和恢复选定分区名称的备份,用逗号分隔
   --schema, -s                                        仅下载和还原架构
   --data, -d                                          仅下载和还原数据
   --rm, --drop                                        还原前删除架构对象
   -i, --ignore-dependencies                           删除存在的架构对象时忽略依赖项
   --rbac, --restore-rbac, --do-restore-rbac           下载并还原与rbac相关的对象
   --configs, --restore-configs, --do-restore-configs  下载并还原“clickhouse server”CONFIG相关文件
   --resume, --resumable                               如果远程存储上存在备份,则保存中间上载状态并继续上载,用“remote_storage:custom”或“use_embedded_backup_restore:true”忽略
delete

删除备份

NAME:
   clickhouse-backup delete - Delete specific backup

USAGE:
   clickhouse-backup delete <local|remote> <backup_name>

OPTIONS:
   --config value, -c value  config配置文件路径
default-config

查看默认配置

NAME:
   clickhouse-backup default-config - Print default config

USAGE:
   clickhouse-backup default-config [command options] [arguments...]

OPTIONS:
   --config value, -c value  config配置文件路径
print-config

打印当前配置

NAME:
   clickhouse-backup print-config - Print current config merged with environment variables

USAGE:
   clickhouse-backup print-config [command options] [arguments...]

OPTIONS:
   --config value, -c value  config配置文件路径
clean

从所有可用的path文件夹中删除shadow文件夹中的数据

NAME:
   clickhouse-backup clean - Remove data in 'shadow' folder from all 'path' folders available from 'system.disks'

USAGE:
   clickhouse-backup clean [command options] [arguments...]

OPTIONS:
   --config value, -c value  config配置文件路径
clean_remote_broken

删除所有损坏的远程备份

NAME:
   clickhouse-backup clean_remote_broken - Remove all broken remote backups

USAGE:
   clickhouse-backup clean_remote_broken [command options] [arguments...]

OPTIONS:
   --config value, -c value  config配置文件路径
watch

备份监视

NAME:
   clickhouse-backup watch - Run infinite loop which create full + incremental backup sequence to allow efficient backup sequences

USAGE:
   clickhouse-backup watch [--watch-interval=1h] [--full-interval=24h] [--watch-backup-name-template=shard{shard}-{type}-{time:20060102150405}] [-t, --tables=<db>.<table>] [--partitions=<partitions_names>] [--schema] [--rbac] [--configs] [--skip-check-parts-columns]

DESCRIPTION:
   Execute create_remote + delete local, create full backup every `--full-interval`, create and upload incremental backup every `--watch-interval` use previous backup as base with `--diff-from-remote` option, use `backups_to_keep_remote` config option for properly deletion remote backups, will delete old backups which not have references from other backups

OPTIONS:
   --config value, -c value                 config配置文件路径
   --watch-interval value                   增量备份运行“create_remote”+“delete local”的间隔
   --full-interval value                    停止创建增量备份序列并创建完整备份时运行“create_remote”+“delete local”的间隔
   --watch-backup-name-template value       新备份名称的模板
   --table value, --tables value, -t value  只创建和上传与表名模式匹配的对象,用逗号分隔,允许?和*作为通配符
   --partitions partition_id                分区名称,用逗号分隔
   --schema, -s                                      仅限架构
   --rbac, --backup-rbac, --do-backup-rbac           仅备份与rbac相关的对象
   --configs, --backup-configs, --do-backup-configs  仅备份“clickhouse服务器”配置文件
   --skip-check-parts-columns                        跳过检查system.partscolumns以禁止备份数据部分的不一致列类型

简单示例

查看当前能备份的数据表:

[root@bogon clickhouse-backup]# clickhouse-backup tables
2023/11/09 16:55:15.414994  info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2023/11/09 16:55:15.436732  info clickhouse connection open: tcp://localhost:9000 logger=clickhouse
2023/11/09 16:55:15.436826  info SELECT name, count(*) as is_present FROM system.settings WHERE name IN (?, ?) GROUP BY name with args [show_table_uuid_in_table_create_query_if_not_nil display_secrets_in_show_and_select] logger=clickhouse
2023/11/09 16:55:15.441067  info SELECT name FROM system.databases WHERE engine IN ('MySQL','PostgreSQL','MaterializedPostgreSQL') logger=clickhouse
2023/11/09 16:55:15.443948  info    SELECT     countIf(name='data_path') is_data_path_present,     countIf(name='data_paths') is_data_paths_present,     countIf(name='uuid') is_uuid_present,     countIf(name='create_table_query') is_create_table_query_present,     countIf(name='total_bytes') is_total_bytes_present    FROM system.columns WHERE database='system' AND table='tables'   logger=clickhouse
2023/11/09 16:55:15.498436  info SELECT database, name, engine , data_paths , uuid , create_table_query , coalesce(total_bytes, 0) AS total_bytes   FROM system.tables WHERE is_temporary = 0 ORDER BY total_bytes DESC SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 logger=clickhouse
2023/11/09 16:55:15.511604  info SELECT metadata_path FROM system.tables WHERE database = 'system' AND metadata_path!='' LIMIT 1; logger=clickhouse
2023/11/09 16:55:15.513617  info SELECT value FROM `system`.`build_options` where name='VERSION_INTEGER' logger=clickhouse
2023/11/09 16:55:15.514306  info SELECT count() is_disk_type_present FROM system.columns WHERE database='system' AND table='disks' AND name='type' logger=clickhouse
2023/11/09 16:55:15.515825  info SELECT path, any(name) AS name, any(type) AS type FROM system.disks GROUP BY path logger=clickhouse
default.sq_ck_data  1.99MiB  default  full
default.tett        1.64KiB  default  full
default.users       746B     default  full
2023/11/09 16:55:15.517788  info clickhouse connection closed logger=clickhouse

备份一张表:

[root@bogon clickhouse-backup]# clickhouse-backup create --table default.tett
2023/11/10 10:00:10.565466  info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2023/11/10 10:00:10.567876  info clickhouse connection open: tcp://localhost:9000 logger=clickhouse
2023/11/10 10:00:10.567942  info SELECT metadata_path FROM system.tables WHERE database = 'system' AND metadata_path!='' LIMIT 1; logger=clickhouse
2023/11/10 10:00:10.572487  info SELECT name, engine FROM system.databases WHERE NOT match(name,'^(system|INFORMATION_SCHEMA|information_schema|_temporary_and_external_tables)$') AND match(name,'^(default)$') logger=clickhouse
2023/11/10 10:00:10.574905  info SHOW CREATE DATABASE `default` logger=clickhouse
2023/11/10 10:00:10.575946  info SELECT name, count(*) as is_present FROM system.settings WHERE name IN (?, ?) GROUP BY name with args [show_table_uuid_in_table_create_query_if_not_nil display_secrets_in_show_and_select] logger=clickhouse
2023/11/10 10:00:10.579002  info SELECT name FROM system.databases WHERE engine IN ('MySQL','PostgreSQL','MaterializedPostgreSQL') logger=clickhouse
2023/11/10 10:00:10.579982  info    SELECT     countIf(name='data_path') is_data_path_present,     countIf(name='data_paths') is_data_paths_present,     countIf(name='uuid') is_uuid_present,     countIf(name='create_table_query') is_create_table_query_present,     countIf(name='total_bytes') is_total_bytes_present    FROM system.columns WHERE database='system' AND table='tables'   logger=clickhouse
2023/11/10 10:00:10.584187  info SELECT database, name, engine , data_paths , uuid , create_table_query , coalesce(total_bytes, 0) AS total_bytes   FROM system.tables WHERE is_temporary = 0 AND match(concat(database,'.',name),'^default.tett$')  ORDER BY total_bytes DESC SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 logger=clickhouse
2023/11/10 10:00:10.595985  info SELECT metadata_path FROM system.tables WHERE database = 'system' AND metadata_path!='' LIMIT 1; logger=clickhouse
2023/11/10 10:00:10.598130  info SELECT count() as cnt FROM system.columns WHERE database='system' AND table='functions' AND name='create_query' SETTINGS empty_result_for_aggregation_by_empty_set=0 logger=clickhouse
2023/11/10 10:00:10.600149  info SELECT name, create_query FROM system.functions WHERE create_query!='' logger=clickhouse
2023/11/10 10:00:10.602970  info SELECT value FROM `system`.`build_options` where name='VERSION_INTEGER' logger=clickhouse
2023/11/10 10:00:10.603787  info SELECT count() is_disk_type_present FROM system.columns WHERE database='system' AND table='disks' AND name='type' logger=clickhouse
2023/11/10 10:00:10.606071  info SELECT path, any(name) AS name, any(type) AS type FROM system.disks GROUP BY path logger=clickhouse
2023/11/10 10:00:10.608060  info SELECT column, groupUniqArray(type) AS uniq_types FROM system.parts_columns WHERE active AND database=? AND table=? AND type NOT LIKE 'Enum%' AND type NOT LIKE 'Tuple(%' GROUP BY column HAVING length(uniq_types) > 1 with args [default tett] logger=clickhouse
2023/11/10 10:00:10.611245  info ALTER TABLE `default`.`tett` FREEZE WITH NAME '97b3ebeae36e4625a18d03c268c762e4'; logger=clickhouse
2023/11/10 10:00:10.616292  info ALTER TABLE `default`.`tett` UNFREEZE WITH NAME '97b3ebeae36e4625a18d03c268c762e4' logger=clickhouse
2023/11/10 10:00:10.616993  info SELECT mutation_id, command FROM system.mutations WHERE is_done=0 AND database=? AND table=? with args [default tett] logger=clickhouse
2023/11/10 10:00:10.619054  info done                      backup=2023-11-10T02-00-10 logger=backuper operation=create table=default.tett
2023/11/10 10:00:10.619090  info SELECT value FROM `system`.`build_options` where name='VERSION_DESCRIBE' logger=clickhouse
2023/11/10 10:00:10.620433  info done                      backup=2023-11-10T02-00-10 duration=55ms logger=backuper operation=create
2023/11/10 10:00:10.620496  info clickhouse connection closed logger=clickhouse

查看备份列表:

[root@bogon clickhouse-backup]# clickhouse-backup list
2023/11/10 10:00:23.397790  info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2023/11/10 10:00:23.399892  info clickhouse connection open: tcp://localhost:9000 logger=clickhouse
2023/11/10 10:00:23.399919  info SELECT value FROM `system`.`build_options` where name='VERSION_INTEGER' logger=clickhouse
2023/11/10 10:00:23.403206  info SELECT count() is_disk_type_present FROM system.columns WHERE database='system' AND table='disks' AND name='type' logger=clickhouse
2023/11/10 10:00:23.405522  info SELECT path, any(name) AS name, any(type) AS type FROM system.disks GROUP BY path logger=clickhouse
2023-11-09T06-32-30   2.00MiB   09/11/2023 06:32:30   local      
2023-11-10T02-00-10   3.44KiB   10/11/2023 02:00:10   local      
2023/11/10 10:00:23.407513  info clickhouse connection closed logger=clickhouse

可以看到圈出来的是我们刚刚备份过得default.tett表的备份。

  将原表中的数据删除:

bogon :) select * from default.tett;

SELECT *
FROM default.tett

Query id: 418c25c1-d383-481d-b0a5-563f03926295

Connecting to database default at localhost:9000 as user default.
Connected to ClickHouse server version 23.8.2 revision 54465.

┌─id─┬─name─┬─age─┬─loc──────┐
│  2 │ ls   │  19 │ shanghai │
│  5 │ tq   │  22 │ shanghai │
└────┴──────┴─────┴──────────┘
┌─id─┬─name─┬─age─┬─loc─────┐
│  1 │ zs   │  18 │ beijing │
│  4 │ ml   │  21 │ beijing │
└────┴──────┴─────┴─────────┘
┌─id─┬─name─┬─age─┬─loc───────┐
│  3 │ ww   │  20 │ guangzhou │
│  6 │ gb   │  23 │ guangzhou │
└────┴──────┴─────┴───────────┘

6 rows in set. Elapsed: 0.023 sec. 

bogon :) truncate default.tett;

TRUNCATE TABLE default.tett

Query id: 30d538dd-929d-4ca3-9653-da5124f4a37e

Ok.

0 rows in set. Elapsed: 0.011 sec. 

bogon :) select * from default.tett;

SELECT *
FROM default.tett

Query id: 1ad2ebf8-08cb-4cb9-9dfe-92bf33b34daf

Ok.

0 rows in set. Elapsed: 0.020 sec. 

还原备份:

[root@bogon clickhouse-backup]# clickhouse-backup restore 2023-11-10T02-00-10
2023/11/10 10:02:17.483029  info clickhouse connection prepared: tcp://localhost:9000 run ping logger=clickhouse
2023/11/10 10:02:17.503304  info clickhouse connection open: tcp://localhost:9000 logger=clickhouse
2023/11/10 10:02:17.503343  info SELECT value FROM `system`.`build_options` where name='VERSION_INTEGER' logger=clickhouse
2023/11/10 10:02:17.507371  info SELECT count() is_disk_type_present FROM system.columns WHERE database='system' AND table='disks' AND name='type' logger=clickhouse
2023/11/10 10:02:17.510458  info SELECT path, any(name) AS name, any(type) AS type FROM system.disks GROUP BY path logger=clickhouse
2023/11/10 10:02:17.514702  info CREATE DATABASE IF NOT EXISTS `default` ENGINE = Atomic with args [[]] logger=clickhouse
2023/11/10 10:02:17.516893  info SELECT engine FROM system.databases WHERE name = 'default' logger=clickhouse
2023/11/10 10:02:17.518091  info DROP TABLE IF EXISTS `default`.`tett` NO DELAY logger=clickhouse
2023/11/10 10:02:17.525491  info CREATE DATABASE IF NOT EXISTS `default` logger=clickhouse
2023/11/10 10:02:17.526077  info CREATE TABLE default.tett UUID '5d622b96-7e59-4586-aa60-1b1f992fa85c' (`id` UInt64, `name` String, `age` UInt8, `loc` String) ENGINE = MergeTree PARTITION BY loc ORDER BY id SETTINGS index_granularity = 8192 logger=clickhouse
2023/11/10 10:02:17.546341  info SELECT name, count(*) as is_present FROM system.settings WHERE name IN (?, ?) GROUP BY name with args [show_table_uuid_in_table_create_query_if_not_nil display_secrets_in_show_and_select] logger=clickhouse
2023/11/10 10:02:17.550424  info SELECT name FROM system.databases WHERE engine IN ('MySQL','PostgreSQL','MaterializedPostgreSQL') logger=clickhouse
2023/11/10 10:02:17.551540  info    SELECT     countIf(name='data_path') is_data_path_present,     countIf(name='data_paths') is_data_paths_present,     countIf(name='uuid') is_uuid_present,     countIf(name='create_table_query') is_create_table_query_present,     countIf(name='total_bytes') is_total_bytes_present    FROM system.columns WHERE database='system' AND table='tables'   logger=clickhouse
2023/11/10 10:02:17.554243  info SELECT database, name, engine , data_paths , uuid , create_table_query , coalesce(total_bytes, 0) AS total_bytes   FROM system.tables WHERE is_temporary = 0 ORDER BY total_bytes DESC SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1 logger=clickhouse
2023/11/10 10:02:17.568021  info SELECT metadata_path FROM system.tables WHERE database = 'system' AND metadata_path!='' LIMIT 1; logger=clickhouse
2023/11/10 10:02:17.571347  info SELECT sum(bytes_on_disk) as size FROM system.parts WHERE active AND database='default' AND table='tett' GROUP BY database, table logger=clickhouse
2023/11/10 10:02:17.575051  info ALTER TABLE `default`.`tett` ATTACH PART 'c0e10d07728763da6c6f9ee4adb2fe46_3_6_1' logger=clickhouse
2023/11/10 10:02:17.578985  info ALTER TABLE `default`.`tett` ATTACH PART 'd3cb49ee6c8532140eabb67773fa583d_2_5_1' logger=clickhouse
2023/11/10 10:02:17.580333  info ALTER TABLE `default`.`tett` ATTACH PART 'e8ccea18d4053730ef65315d1781f8c6_1_4_1' logger=clickhouse
2023/11/10 10:02:17.581493  info done                      backup=2023-11-10T02-00-10 operation=restore table=default.tett
2023/11/10 10:02:17.581509  info done                      backup=2023-11-10T02-00-10 duration=35ms operation=restore
2023/11/10 10:02:17.581513  info done                      backup=2023-11-10T02-00-10 operation=restore
2023/11/10 10:02:17.581552  info clickhouse connection closed logger=clickhouse

查看表中数据:

bogon :) select * from default.tett;

SELECT *
FROM default.tett

Query id: f8a5e78e-723e-4cd0-8a9e-38316fd6882e

┌─id─┬─name─┬─age─┬─loc───────┐
│  3 │ ww   │  20 │ guangzhou │
│  6 │ gb   │  23 │ guangzhou │
└────┴──────┴─────┴───────────┘
┌─id─┬─name─┬─age─┬─loc─────┐
│  1 │ zs   │  18 │ beijing │
│  4 │ ml   │  21 │ beijing │
└────┴──────┴─────┴─────────┘
┌─id─┬─name─┬─age─┬─loc──────┐
│  2 │ ls   │  19 │ shanghai │
│  5 │ tq   │  22 │ shanghai │
└────┴──────┴─────┴──────────┘

6 rows in set. Elapsed: 0.006 sec. 

还原成功。