通过时间线变更恢复

3 阅读7分钟

通过时间线变更恢复

摘要: 本文演示了在 PostgreSQL 中可以通过时间线变更(timeline change)执行时间点恢复(Point-In-Time Recovery,PITR),这与常见假设相反。作者通过一个实际示例展示了如何备份 PostgreSQL 集群、提升副本,以及成功恢复包括提升后发生的事务在内的数据。

原文链接

我总是听说,一旦执行了切换(switchover)或故障转移(failover),就需要创建新的 Postgres 集群备份。我猜这就是为什么我以为无法通过时间线变更进行 PITR 的原因。今天,我学会了可以通过时间线变更进行 PITR!

一些背景

假设你有一个女王(elinor)和一个公主(merida)。如果你不熟悉女王/公主/工作节点这些术语,请阅读 Tapoueh 的博客文章 了解相关信息。

创建备份并添加一些数据

首先,使用 pg_basebackup 创建 Postgres 集群的简单备份:

elinor:$ pg_basebackup -D /var/lib/postgresql/backup -Ft -z -P

并将其发送到公主那里,以防万一:

elinor:$ rsync -r /var/lib/postgresql/backup merida:/var/lib/postgresql/backup

在 elinor 上创建一个简单的测试表:

test=# create table test(id integer generated always as identity, value text);
CREATE TABLE

在表中添加一些数据:

test=# insert into test(value) values ('a');
INSERT 0 1
test=# insert into test(value) values ('a');
INSERT 0 1
test=# insert into test(value) values ('a');
INSERT 0 1

你可以在 merida(公主)上查询这些数据:

test=# select * from test;
 id | value 
 ----+-------
  1 | a
  2 | a
  3 | a
(3 rows)

提升公主

现在,关闭女王并提升公主:

elinor:$ pg_ctlcluster 12 main stop

merida:$ pg_ctlcluster 12 main promote

然后从你的新女王(merida)的测试表中添加一些行:

test=# insert into test(value) values ('b');
INSERT 0 1
test=# insert into test(value) values ('b');
INSERT 0 1

恢复备份

现在,让我们停止 merida 并清理集群。

merida:$ pg_dropcluster 12 main --stop

让我们恢复备份:

merida$: mkdir /var/lib/postgresql/12/main
merida$: tar xzf base.tar.gz -C /var/lib/postgresql/12/main
merida$: touch /var/lib/postgresql/12/main/recovery.signal
merida$: chown -R postgres:postgres /var/lib/postgresql/12/main

我们只需要启动 Postgres:

merida:$ pg_ctlcluster 12 main start

我的数据回来了吗?

test=# select * from test;
 id | value 
----+-------
  1 | a
  2 | a
  3 | a
  4 | b
  5 | b
(5 rows) 

是的!一切都回来了。没有数据丢失!

Postgres 做了什么?

这是我的启动日志(是的,我的日志前缀应该更好一些):

2020-03-24 13:27:36.835 UTC [7826] LOG:  starting PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
2020-03-24 13:27:36.835 UTC [7826] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-03-24 13:27:36.835 UTC [7826] LOG:  listening on IPv6 address "::", port 5432
2020-03-24 13:27:36.842 UTC [7826] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-03-24 13:27:36.869 UTC [7827] LOG:  database system was interrupted; last known up at 2020-03-24 10:07:35 UTC
2020-03-24 13:27:36.889 UTC [7827] LOG:  restored log file "00000002.history" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000003.history" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:36.891 UTC [7827] LOG:  starting archive recovery
2020-03-24 13:27:36.894 UTC [7827] LOG:  reststartored log file "00000002.history" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000002000000000000000E" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:36.955 UTC [7827] LOG:  restored log file "00000001000000000000000E" from archive
2020-03-24 13:27:36.985 UTC [7827] LOG:  redo starts at 0/E000028
2020-03-24 13:27:36.994 UTC [7827] LOG:  consistent recovery state reached at 0/E000100
2020-03-24 13:27:36.995 UTC [7826] LOG:  database system is ready to accept read only connections
rsync: link_stat "/var/lib/postgresql/wals/00000002000000000000000F" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.046 UTC [7827] LOG:  restored log file "00000001000000000000000F" from archive
rsync: link_stat "/var/lib/postgresql/wals/000000020000000000000010" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.120 UTC [7827] LOG:  restored log file "000000010000000000000010" from archive
2020-03-24 13:27:37.198 UTC [7827] LOG:  restored log file "000000020000000000000011" from archive
rsync: link_stat "/var/lib/postgresql/wals/000000020000000000000012" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.211 UTC [7827] LOG:  redo done at 0/11000400
2020-03-24 13:27:37.211 UTC [7827] LOG:  last completed transaction was at log time 2020-03-24 10:09:38.087444+00
2020-03-24 13:27:37.274 UTC [7827] LOG:  restored log file "000000020000000000000011" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000003.history" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.288 UTC [7827] LOG:  selected new timeline ID: 3
2020-03-24 13:27:37.314 UTC [7827] LOG:  archive recovery complete
2020-03-24 13:27:37.317 UTC [7827] LOG:  restored log file "00000002.history" from archive
2020-03-24 13:27:37.382 UTC [7826] LOG:  database system is ready to accept connections
  1. Postgres 发现发生了时间线变更(通过 00000002.history 文件)
  2. Postgres 发现没有 00000003.history 文件,这意味着当前时间线是 2(所以在此次恢复结束时其时间线编号将是 3)
  3. Postgres 在时间线 2 中查找 WAL 文件 000000000000000E,但找不到
  4. Postgres 在时间线 1 中查找 WAL 文件 000000000000000E,找到了它并恢复它,到达恢复点 0/E000100
  5. Postgres 继续恢复 WAL 文件 00000001000000000000000F000000010000000000000010000000020000000000000011
  6. Postgres 找不到文件 000000020000000000000012
  7. Postgres 已完成恢复
  8. Postgres 更改时间线编号
  9. 数据库就绪!

为什么需要在提升后备份?

首先,故障转移意味着生产环境中的异常。一旦发生异常,备份总是明智之举。如果是切换,情况会有所不同,备份仍然是一个好主意。

其次,我见过太多次公主身上没有 archive_command 设置,如果你的公主死了,你无法获取 .history 文件,而这些文件告诉 Postgres 发生了时间线变更。没有这些文件,你就无法恢复数据。

所以请确保安全,备份你的数据库(并测试你的备份)!