某生产系统KingbaseES读写分离集群主库出现故障,导致集群主备发生切换。
KingbaseES读写分离集群基本信息:
| KingbaseES集群信息 | |
|---|---|
| 操作系统 | Linux7.5 |
| CPU | 海光x86 |
| 数据库版本 | V008R006C006B0021 |
| 集群方案 | 读写分离集群 |
| 节点数 | 2 |
| 主节点 | 10.10.10.11 |
| 备节点 | 10.10.10.12 |
问题现象:
生产环境KingbaseES读写分离集群主库突然发生故障,集群发生切换(主备切换).
数据库主节点日志信息:
2023-01-09 04:15:24 CST [2707066]: [164-1] user=,db=,app=,client=LOG: server process (PID 1761621) was terminated by signal 6: Aborted
操作系统日志信息:
Jan 9 04:15:24 xyfwoadb1 systemd-coredump[1764978]: Process 1761621 (kingbase) of user 1001 dumped core.#012#012Stack trace of thread 1761621:#012#0 0x00007efe049995ab n/a (n/a)
通过数据库主节点日志、操作系统日志可以确定数据库PID 1761621进程异常中止并发生coredump.
# coredump产生的文件.
core.kingbase.1001.9e5401f0dac843a6b07236782e840f1d.1761621.1673208915000000000000
故障分析:
通过主节点hamgr.log日志可以看出,在2023-01-09 04:12:32至2023-01-09 04:15:26之间,服务器进程发生crash.
# 数据库主节点hamgr.log
[2023-01-09 04:02:29] [INFO] monitoring primary node "node11" (ID: 1) in normal state
[2023-01-09 04:07:31] [INFO] monitoring primary node "node11" (ID: 1) in normal state
[2023-01-09 04:12:32] [INFO] monitoring primary node "node11" (ID: 1) in normal state
WARNING: terminating connection because of crash of another server process
DETAIL: The kingbase has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
[2023-01-09 04:15:26] [WARNING] unable to ping "host=10.10.10.11 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2023-01-09 04:15:26] [DETAIL] PQping() returned "PQPING_REJECT"
[2023-01-09 04:15:26] [WARNING] connection to node "node11" (ID: 1) lost
[2023-01-09 04:15:26] [DETAIL]
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[2023-01-09 04:15:26] [INFO] attempting to reconnect to node "node11" (ID: 1)
[2023-01-09 04:15:26] [ERROR] connection to database failed
[2023-01-09 04:15:26] [DETAIL]
FATAL: the database system is in recovery mode
[2023-01-09 04:15:26] [DETAIL] attempted to connect using:
user=esrep connect_timeout=10 dbname=esrep host=10.10.10.11 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path=
[2023-01-09 04:15:26] [WARNING] reconnection to node "node11" (ID: 1) failed
[2023-01-09 04:15:26] [WARNING] unable to connect to local node
[2023-01-09 04:15:26] [INFO] sleeping 6 seconds until next reconnection attempt
[2023-01-09 04:15:32] [INFO] checking state of node 1, 1 of 10 attempts
[2023-01-09 04:15:33] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=10.10.10.11 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2023-01-09 04:15:33] [DETAIL] PQping() returned "PQPING_REJECT"
[2023-01-09 04:15:33] [INFO] sleeping 6 seconds until next reconnection attempt
[2023-01-09 04:15:39] [INFO] checking state of node 1, 2 of 10 attempts
[2023-01-09 04:15:39] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=10.10.10.11 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2023-01-09 04:15:39] [DETAIL] PQping() returned "PQPING_REJECT"
[2023-01-09 04:15:39] [INFO] sleeping 6 seconds until next reconnection attempt
[2023-01-09 04:15:45] [INFO] checking state of node 1, 3 of 10 attempts
[2023-01-09 04:15:45] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=10.10.10.11 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
通过数据库主节点kbha.log发现在2023-01-09 04:15:25主节点10.10.10.11不能正常访问
# 数据库主节点kbha.log
[2023-01-09 04:15:25] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.
--- 10.10.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1016ms
rtt min/avg/max/mdev = 0.225/0.242/0.259/0.017 ms
[2023-01-09 04:15:25] [WARNING] unable to ping "host=10.10.10.11 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2023-01-09 04:15:25] [DETAIL] PQping() returned "PQPING_REJECT"
[2023-01-09 04:15:28] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.
--- 10.10.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1014ms
rtt min/avg/max/mdev = 0.241/0.254/0.267/0.013 ms
[2023-01-09 04:15:33] [WARNING] unable to ping "host=10.10.10.11 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2023-01-09 04:15:33] [DETAIL] PQping() returned "PQPING_REJECT"
[2023-01-09 04:15:36] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.
--- 10.10.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1020ms
rtt min/avg/max/mdev = 0.211/0.285/0.360/0.074 ms
[2023-01-09 04:15:36] [WARNING] unable to ping "host=10.10.10.11 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2023-01-09 04:15:36] [DETAIL] PQping() returned "PQPING_REJECT"
[2023-01-09 04:15:39] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.
--- 10.10.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1023ms
rtt min/avg/max/mdev = 0.113/0.183/0.254/0.070 ms
[2023-01-09 04:15:39] [WARNING] unable to ping "host=10.10.10.11 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2023-01-09 04:15:39] [DETAIL] PQping() returned "PQPING_REJECT"
[2023-01-09 04:15:42] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.
--- 10.10.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1026ms
rtt min/avg/max/mdev = 0.186/0.212/0.238/0.026 ms
[2023-01-09 04:15:42] [WARNING] unable to ping "host=10.10.10.11 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2023-01-09 04:15:42] [DETAIL] PQping() returned "PQPING_REJECT"
[2023-01-09 04:15:45] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.
主节点数据库日志:
通过主节点数据库日志发现如下信息:
1.主节点在2023-01-09 04:15:15的时候发生的故障,由1761621进程导致.
2.The failed archive command was.数据库归档失败.
3.unable to write ... No space left on device.服务器磁盘没有空闲空间,不能正常写文件.
4.磁盘没有空闲空间数据库备份一直失败.
5.应用执行sql,写临时文件的时候没有空间.could not write to file "sys_wal/xlogtemp.1761621": No space left on device.由于data空间被撑满.(任何数据修改操作,都会先写wal临时文件)无法写wal/xlogtemp.pid临时文件.
6.没有空闲空间写临时文件.数据库进行关闭操作.kingbase ran into a problem it couldn't handle,it needs to be shutdown to prevent damage to your data.并发生coredump.
关于"sys_wal/xlogtemp.1761621"临时文件
XlogFileCopy函数调用:调用XLogFilePath函数获取源XLOG文件名,OpenTransientFile打开该文件,创建并打开一个临时XLOG文件pg_wal/xlogtemp.pid,pid为startup进程的ID号。sizeof(buffer)为一页大小8192字节,从源文件每次read一页数据并将之write到xlogtemp文件,最后一页数据如果不够8192字节,则有多少读取多少并写入文件。当文件拷贝完成后,执行一次sync。最后调用InstallXLogFileSegment将文件重命名。
完整的问题分析结果:
1.询问现场同事备份失败原因.专门存放备份的/backup磁盘空间满了,备份任务连续几天都备份失败,没有及时进行处理.
2.由于备份磁盘满了,连续几天备份任务一直失败、不能及时清理wal日志.导致wal日志堆积,并把data空间撑满.
3.应用执行update更新sql,由于data空间没有空闲空间,无法写wal/xlogtemp.pid临时文件(任何数据修改操作,都会先写wal临时文件).could not write to file "sys_wal/xlogtemp.1761621": No space left on device.
4.data磁盘空间满不能写临时wal日志文件.应用更新进程crash数据库自动关闭.kingbase ran into a problem it couldn't handle,it needs to be shutdown to prevent damage to your data.并发生coredump.
5.以上问题导致数据库进程崩溃,并发生切换.server process (PID 1761621) was terminated by signal 6: Aborted.
总结
1.小问题没有及时处理差点变成一次严重的生产事故(故障如果在备份空间满的时候及时进行处理,就不会出现后面的主库故障导致集群发生切换).同时也警醒我们小问题也可能会导致更加严重的问题.
2.每天按时查看系统运行情况,定时任务执行情况.避免数据库系统发生死锁、定时任务没有执行或者执行失败.
3.同时也提醒我们,必须对重要的系统进行及时有效的监控,出现问题的时候必须及时的进行处理(避免小问题变成大故障). 更多信息,参见help.kingbase.com.cn/v8/index.ht…