通过时间线变更恢复
摘要: 本文演示了在 PostgreSQL 中可以通过时间线变更(timeline change)执行时间点恢复(Point-In-Time Recovery,PITR),这与常见假设相反。作者通过一个实际示例展示了如何备份 PostgreSQL 集群、提升副本,以及成功恢复包括提升后发生的事务在内的数据。
我总是听说,一旦执行了切换(switchover)或故障转移(failover),就需要创建新的 Postgres 集群备份。我猜这就是为什么我以为无法通过时间线变更进行 PITR 的原因。今天,我学会了可以通过时间线变更进行 PITR!
一些背景
假设你有一个女王(elinor)和一个公主(merida)。如果你不熟悉女王/公主/工作节点这些术语,请阅读 Tapoueh 的博客文章 了解相关信息。
创建备份并添加一些数据
首先,使用 pg_basebackup 创建 Postgres 集群的简单备份:
elinor:$ pg_basebackup -D /var/lib/postgresql/backup -Ft -z -P
并将其发送到公主那里,以防万一:
elinor:$ rsync -r /var/lib/postgresql/backup merida:/var/lib/postgresql/backup
在 elinor 上创建一个简单的测试表:
test=# create table test(id integer generated always as identity, value text);
CREATE TABLE
在表中添加一些数据:
test=# insert into test(value) values ('a');
INSERT 0 1
test=# insert into test(value) values ('a');
INSERT 0 1
test=# insert into test(value) values ('a');
INSERT 0 1
你可以在 merida(公主)上查询这些数据:
test=# select * from test;
id | value
----+-------
1 | a
2 | a
3 | a
(3 rows)
提升公主
现在,关闭女王并提升公主:
elinor:$ pg_ctlcluster 12 main stop
merida:$ pg_ctlcluster 12 main promote
然后从你的新女王(merida)的测试表中添加一些行:
test=# insert into test(value) values ('b');
INSERT 0 1
test=# insert into test(value) values ('b');
INSERT 0 1
恢复备份
现在,让我们停止 merida 并清理集群。
merida:$ pg_dropcluster 12 main --stop
让我们恢复备份:
merida$: mkdir /var/lib/postgresql/12/main
merida$: tar xzf base.tar.gz -C /var/lib/postgresql/12/main
merida$: touch /var/lib/postgresql/12/main/recovery.signal
merida$: chown -R postgres:postgres /var/lib/postgresql/12/main
我们只需要启动 Postgres:
merida:$ pg_ctlcluster 12 main start
我的数据回来了吗?
test=# select * from test;
id | value
----+-------
1 | a
2 | a
3 | a
4 | b
5 | b
(5 rows)
是的!一切都回来了。没有数据丢失!
Postgres 做了什么?
这是我的启动日志(是的,我的日志前缀应该更好一些):
2020-03-24 13:27:36.835 UTC [7826] LOG: starting PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
2020-03-24 13:27:36.835 UTC [7826] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-03-24 13:27:36.835 UTC [7826] LOG: listening on IPv6 address "::", port 5432
2020-03-24 13:27:36.842 UTC [7826] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-03-24 13:27:36.869 UTC [7827] LOG: database system was interrupted; last known up at 2020-03-24 10:07:35 UTC
2020-03-24 13:27:36.889 UTC [7827] LOG: restored log file "00000002.history" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000003.history" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:36.891 UTC [7827] LOG: starting archive recovery
2020-03-24 13:27:36.894 UTC [7827] LOG: reststartored log file "00000002.history" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000002000000000000000E" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:36.955 UTC [7827] LOG: restored log file "00000001000000000000000E" from archive
2020-03-24 13:27:36.985 UTC [7827] LOG: redo starts at 0/E000028
2020-03-24 13:27:36.994 UTC [7827] LOG: consistent recovery state reached at 0/E000100
2020-03-24 13:27:36.995 UTC [7826] LOG: database system is ready to accept read only connections
rsync: link_stat "/var/lib/postgresql/wals/00000002000000000000000F" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.046 UTC [7827] LOG: restored log file "00000001000000000000000F" from archive
rsync: link_stat "/var/lib/postgresql/wals/000000020000000000000010" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.120 UTC [7827] LOG: restored log file "000000010000000000000010" from archive
2020-03-24 13:27:37.198 UTC [7827] LOG: restored log file "000000020000000000000011" from archive
rsync: link_stat "/var/lib/postgresql/wals/000000020000000000000012" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.211 UTC [7827] LOG: redo done at 0/11000400
2020-03-24 13:27:37.211 UTC [7827] LOG: last completed transaction was at log time 2020-03-24 10:09:38.087444+00
2020-03-24 13:27:37.274 UTC [7827] LOG: restored log file "000000020000000000000011" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000003.history" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.288 UTC [7827] LOG: selected new timeline ID: 3
2020-03-24 13:27:37.314 UTC [7827] LOG: archive recovery complete
2020-03-24 13:27:37.317 UTC [7827] LOG: restored log file "00000002.history" from archive
2020-03-24 13:27:37.382 UTC [7826] LOG: database system is ready to accept connections
- Postgres 发现发生了时间线变更(通过
00000002.history文件) - Postgres 发现没有
00000003.history文件,这意味着当前时间线是 2(所以在此次恢复结束时其时间线编号将是 3) - Postgres 在时间线 2 中查找 WAL 文件
000000000000000E,但找不到 - Postgres 在时间线 1 中查找 WAL 文件
000000000000000E,找到了它并恢复它,到达恢复点 0/E000100 - Postgres 继续恢复 WAL 文件
00000001000000000000000F、000000010000000000000010和000000020000000000000011 - Postgres 找不到文件
000000020000000000000012 - Postgres 已完成恢复
- Postgres 更改时间线编号
- 数据库就绪!
为什么需要在提升后备份?
首先,故障转移意味着生产环境中的异常。一旦发生异常,备份总是明智之举。如果是切换,情况会有所不同,备份仍然是一个好主意。
其次,我见过太多次公主身上没有 archive_command 设置,如果你的公主死了,你无法获取 .history 文件,而这些文件告诉 Postgres 发生了时间线变更。没有这些文件,你就无法恢复数据。
所以请确保安全,备份你的数据库(并测试你的备份)!