通过时间线变更恢复本文演示了在 PostgreSQL 中可以通过时间线变更（timeline change）执行时间点恢

通过时间线变更恢复

摘要： 本文演示了在 PostgreSQL 中可以通过时间线变更（timeline change）执行时间点恢复（Point-In-Time Recovery，PITR），这与常见假设相反。作者通过一个实际示例展示了如何备份 PostgreSQL 集群、提升副本，以及成功恢复包括提升后发生的事务在内的数据。

原文链接

我总是听说，一旦执行了切换（switchover）或故障转移（failover），就需要创建新的 Postgres 集群备份。我猜这就是为什么我以为无法通过时间线变更进行 PITR 的原因。今天，我学会了可以通过时间线变更进行 PITR！

一些背景

假设你有一个女王（elinor）和一个公主（merida）。如果你不熟悉女王/公主/工作节点这些术语，请阅读 Tapoueh 的博客文章了解相关信息。

创建备份并添加一些数据

首先，使用 pg_basebackup 创建 Postgres 集群的简单备份：

elinor:$ pg_basebackup -D /var/lib/postgresql/backup -Ft -z -P

并将其发送到公主那里，以防万一：

elinor:$ rsync -r /var/lib/postgresql/backup merida:/var/lib/postgresql/backup

在 elinor 上创建一个简单的测试表：

test=# create table test(id integer generated always as identity, value text);
CREATE TABLE

在表中添加一些数据：

test=# insert into test(value) values ('a');
INSERT 0 1
test=# insert into test(value) values ('a');
INSERT 0 1
test=# insert into test(value) values ('a');
INSERT 0 1

你可以在 merida（公主）上查询这些数据：

test=# select * from test;
 id | value 
 ----+-------
  1 | a
  2 | a
  3 | a
(3 rows)

提升公主

现在，关闭女王并提升公主：

elinor:$ pg_ctlcluster 12 main stop

merida:$ pg_ctlcluster 12 main promote

然后从你的新女王（merida）的测试表中添加一些行：

test=# insert into test(value) values ('b');
INSERT 0 1
test=# insert into test(value) values ('b');
INSERT 0 1

恢复备份

现在，让我们停止 merida 并清理集群。

merida:$ pg_dropcluster 12 main --stop

让我们恢复备份：

merida$: mkdir /var/lib/postgresql/12/main
merida$: tar xzf base.tar.gz -C /var/lib/postgresql/12/main
merida$: touch /var/lib/postgresql/12/main/recovery.signal
merida$: chown -R postgres:postgres /var/lib/postgresql/12/main

我们只需要启动 Postgres：

merida:$ pg_ctlcluster 12 main start

我的数据回来了吗？

test=# select * from test;
 id | value 
----+-------
  1 | a
  2 | a
  3 | a
  4 | b
  5 | b
(5 rows)

是的！一切都回来了。没有数据丢失！

Postgres 做了什么？

这是我的启动日志（是的，我的日志前缀应该更好一些）：

2020-03-24 13:27:36.835 UTC [7826] LOG:  starting PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
2020-03-24 13:27:36.835 UTC [7826] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-03-24 13:27:36.835 UTC [7826] LOG:  listening on IPv6 address "::", port 5432
2020-03-24 13:27:36.842 UTC [7826] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-03-24 13:27:36.869 UTC [7827] LOG:  database system was interrupted; last known up at 2020-03-24 10:07:35 UTC
2020-03-24 13:27:36.889 UTC [7827] LOG:  restored log file "00000002.history" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000003.history" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:36.891 UTC [7827] LOG:  starting archive recovery
2020-03-24 13:27:36.894 UTC [7827] LOG:  reststartored log file "00000002.history" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000002000000000000000E" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:36.955 UTC [7827] LOG:  restored log file "00000001000000000000000E" from archive
2020-03-24 13:27:36.985 UTC [7827] LOG:  redo starts at 0/E000028
2020-03-24 13:27:36.994 UTC [7827] LOG:  consistent recovery state reached at 0/E000100
2020-03-24 13:27:36.995 UTC [7826] LOG:  database system is ready to accept read only connections
rsync: link_stat "/var/lib/postgresql/wals/00000002000000000000000F" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.046 UTC [7827] LOG:  restored log file "00000001000000000000000F" from archive
rsync: link_stat "/var/lib/postgresql/wals/000000020000000000000010" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.120 UTC [7827] LOG:  restored log file "000000010000000000000010" from archive
2020-03-24 13:27:37.198 UTC [7827] LOG:  restored log file "000000020000000000000011" from archive
rsync: link_stat "/var/lib/postgresql/wals/000000020000000000000012" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.211 UTC [7827] LOG:  redo done at 0/11000400
2020-03-24 13:27:37.211 UTC [7827] LOG:  last completed transaction was at log time 2020-03-24 10:09:38.087444+00
2020-03-24 13:27:37.274 UTC [7827] LOG:  restored log file "000000020000000000000011" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000003.history" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.288 UTC [7827] LOG:  selected new timeline ID: 3
2020-03-24 13:27:37.314 UTC [7827] LOG:  archive recovery complete
2020-03-24 13:27:37.317 UTC [7827] LOG:  restored log file "00000002.history" from archive
2020-03-24 13:27:37.382 UTC [7826] LOG:  database system is ready to accept connections

Postgres 发现发生了时间线变更（通过 00000002.history 文件）
Postgres 发现没有 00000003.history 文件，这意味着当前时间线是 2（所以在此次恢复结束时其时间线编号将是 3）
Postgres 在时间线 2 中查找 WAL 文件 000000000000000E，但找不到
Postgres 在时间线 1 中查找 WAL 文件 000000000000000E，找到了它并恢复它，到达恢复点 0/E000100
Postgres 继续恢复 WAL 文件 00000001000000000000000F、000000010000000000000010 和 000000020000000000000011
Postgres 找不到文件 000000020000000000000012
Postgres 已完成恢复
Postgres 更改时间线编号
数据库就绪！

为什么需要在提升后备份？

首先，故障转移意味着生产环境中的异常。一旦发生异常，备份总是明智之举。如果是切换，情况会有所不同，备份仍然是一个好主意。

其次，我见过太多次公主身上没有 archive_command 设置，如果你的公主死了，你无法获取 .history 文件，而这些文件告诉 Postgres 发生了时间线变更。没有这些文件，你就无法恢复数据。