南大通用 GBase 8c 主备集群备节点操作系统故障恢复指南

0 阅读5分钟

原文链接:www.gbase.cn/community/p…
更多精彩内容尽在南大通用GBase技术社区,南大通用致力于成为用户最信赖的数据库产品供应商。

一、问题背景

在数据库运维过程中,我们常常会遇到各种突发状况。今天要和大家分享一个常见的问题:主备集群中,备节点由于操作系统问题导致所有数据丢失,该如何恢复?

这种情况可能发生在系统文件损坏、系统无法启动需要重装、或者运维人员误操作导致系统被重装等场景。面对这种情况,我们应该如何快速恢复主备集群,保障业务正常运行呢?

本文将结合实际操作,详细讲解恢复步骤,帮助你在遇到类似问题时能够从容应对。

二、解决方法

当备节点操作系统损坏,数据全部丢失时,集群的主备关系已经无法维持。此时,我们需要先将故障的备节点从集群中移除,让主节点恢复单节点运行状态,然后再重新添加备节点,恢复主备集群。

三、解决步骤

3.1 环境信息

当前环境是一主一备节点,信息如下:

主机名IP 地址当前角色
节点 1gbasedb12192.168.1.12主节点
节点 2gbasedb11192.168.1.11备节点

假设出现问题的节点为 gbasedb11,ip 地址为 192.168.1.11。

注意:当前环境主备无 CM 组件。

3.2 恢复单节点运行

3.2.1 修改 XML 配置文件

首先,我们需要修改集群的 XML 配置文件,将故障的备节点信息移除,保留主节点信息,使其变成单节点配置。

在主节点找到原集群配置文件,通常是在之前的安装目录下。

-- cluster_conf.xml

image.png

将上述 XML 文件拷贝一份,并进行修改,去掉备节点的信息,我们将其存为 single.xml:

image.png

3.2.2 生成并分发静态配置文件

使用 gs_om 工具生成新的静态配置文件,注意必须加上 --distribute 参数,否则后续执行 gs_om -t refreshconf 时会报错。

[gbase@gbasedb12 gbase_package]$ gs_om -t generateconf -X single.xml  --distribute

执行成功后会输出:

Generating static configuration files for all nodes.

Creating temp directory to store static configuration files.

Successfully created the temp directory.

Generating static configuration files.

Successfully generated static configuration files.

Static configuration files for all nodes are saved in /data/install/om/script/static_config_files.

Distributing static configuration files to all nodes.

Successfully distributed static configuration files.

不加--distribute会报错:

Generating dynamic configuration file for all nodes.

[GBASE-50205] : Failed to write dynamic configuration file. Error: 

[GBASE-51230] : The number of master dn must equal to 1.

3.2.3 刷新配置

[gbase@gbasedb12 gbase_package]$ gs_om -t refreshconf

No need to generate dynamic configuration file for one node.
​

3.2.4 重新启动数据库

[gbase@gbasedb12 gbase_package]$ gs_om -t start

Starting cluster.

=========================================

[SUCCESS] gbasedb12

2026-03-13 18:44:36.006 69b3ea93.1 [unknown] 140409960601472 [unknown] 0 dn_6001_6002 01000  0 [BACKEND] WARNING:  could not create any HA TCP/IP sockets

=========================================

Successfully started.
​

3.2.5 检查集群状态

[gbase@gbasedb12 gbase_package]$ gs_om -t status --detail

[   Cluster State   ]

cluster_state   : Normal

redistributing  : No

current_az      : AZ_ALL

[  Datanode State   ]

    node     node_ip         port      instance                      state

------------------------------------------------------------------------------------------

1  gbasedb12 192.168.1.12    15400      6001 /data/install/data/dn   P Primary Normal

此时,集群已恢复为单节点运行模式,业务可以正常访问主节点。

3.3 重新添加备节点

3.3.1 备节点操作系统配置

1.配置 hosts 文件

cat >> /etc/hosts << EOF
192.168.1.12  gbasedb12
192.168.1.11  gbasedb11
EOF

2.关闭防火墙

systemctl status firewalld

systemctl disable firewalld.service

systemctl stop firewalld.service
​

3.关闭 SELinux

sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config

setenforce 0

getenforce
​

4.配置语言环境

echo LANG=en_US.UTF-8 >>  /etc/profile

source /etc/profile

echo $LANG

5.关闭 swap 分区

swapoff -a

6.关闭透明大页

echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo 'echo never > /sys/kernel/mm/transparent_hugepage/defrag' >> /etc/rc.d/rc.local
echo 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' >> /etc/rc.d/rc.local
chmod +x /etc/rc.d/rc.local
​

7.关闭 RemoveIPC

sed -i '/^RemoveIPC/d' /etc/systemd/logind.conf
sed -i '/^RemoveIPC/d' /usr/lib/systemd/system/systemd-logind.service
echo "RemoveIPC=no" >> /etc/systemd/logind.conf
echo "RemoveIPC=no" >> /usr/lib/systemd/system/systemd-logind.service

systemctl daemon-reload
systemctl restart systemd-logind
# 验证配置
loginctl show-session | grep RemoveIPC
systemctl show systemd-logind | grep RemoveIPC
​

8.安装依赖包

yum install -y libaio-devel flex bison ncurses-devel glibc-devel patch \

readline-devel expect ntp sudo openssh-clients cronie ethtool file \

libtool-ltdl gdb psmisc lsof dnf libffi libffi-devel bzip2
​

9.配置内核参数

将以下内容添加到`/etc/sysctl.conf`:

echo "# add for gbase

net.ipv4.ip_forward = 1

net.ipv4.tcp_max_tw_buckets = 10000

net.ipv4.tcp_tw_reuse = 1

net.ipv4.tcp_keepalive_time = 30

net.ipv4.tcp_keepalive_probes = 9

net.ipv4.tcp_keepalive_intvl = 30

net.ipv4.tcp_retries1 = 5

net.ipv4.tcp_syn_retries = 5

net.ipv4.tcp_synack_retries = 5

net.ipv4.tcp_retries2 = 12

vm.overcommit_memory = 0

net.ipv4.tcp_rmem = 8192 250000 16777216

net.ipv4.tcp_wmem = 8192 250000 16777216

net.core.wmem_max = 21299200

net.core.rmem_max = 21299200

net.core.wmem_default = 21299200

net.core.rmem_default = 21299200

net.ipv4.ip_local_port_range = 26000 65535

kernel.sem = 250 6400000 1000 25600

net.core.somaxconn = 65535

net.ipv4.tcp_syncookies = 1

net.core.netdev_max_backlog = 65535

net.ipv4.tcp_max_syn_backlog = 65535

net.ipv4.tcp_fin_timeout = 60

kernel.shmall = 4066499 

kernel.shmmax = 16656379903 

net.ipv4.tcp_sack = 1

net.ipv4.tcp_timestamps = 1

vm.extfrag_threshold = 500

vm.overcommit_ratio = 90

vm.swappiness = 0

" >> /etc/sysctl.conf

sysctl -p
​

10.配置资源限制

echo "# add for gbase

* soft nofile 1000000

* hard nofile 1000000

* soft nproc 655360

* hard nproc 655360

* soft memlock unlimited

* hard memlock unlimited

* soft core unlimited

* hard core unlimited

* soft stack unlimited

* hard stack unlimited

" >> /etc/security/limits.d/90-nproc.conf
​

11.创建 gbase 用户

groupadd gbase

useradd -m -d /home/gbase gbase -g gbase

echo "Database@123" | passwd gbase --stdin
​

12.配置 sudo 权限

sed -i.bak '/^root\s+ALL=(ALL)\s+ALL$/a gbase   ALL=(ALL)       NOPASSWD:ALL' /etc/sudoers
​

13.配置 SSH 互信
在 root 和 gbase 用户下分别执行:

ssh-keygen -t rsa
ssh-copy-id gbasedb11
ssh-copy-id gbasedb12
​

14.创建数据目录

mkdir -p /data/install/data/

chmod 755 -R /data/

chown gbase:gbase -R /data/
​

3.3.2 主节点执行预安装

在主节点上使用 root 用户执行预安装命令:

cd /home/gbase/gbase_package
./script/gs_preinstall -U gbase -G gbase -X /home/gbase/gbase_package/cluster_conf.xml
​

3.3.3 主节点执行扩容

使用 root 用户执行扩容命令,添加新的备节点:

cd script/

./gs_expansion -U gbase -G gbase -X /home/gbase/gbase_package/cluster_conf.xml -h 192.168.1.11

###注意192.168.1.11为备节点IP

执行成功输出:

The cluster no need create SSH trust

Start expansion without cluster manager component.

Start to preinstall database on new nodes.

Start to send soft to each standby nodes.

End to send soft to each standby nodes.

Success to send XML to new nodes

Start to preinstall database step.

Preinstall 192.168.1.11 success

End to preinstall database step.

End to preinstall database on new nodes.

Start to install database on new nodes.

192.168.1.11 install success.

Finish to install database on all nodes.

Database on standby nodes installed finished.

Checking gbased and gs_om version.

End to check gbased and gs_om version.

Start to establish the relationship.

Start to build standby 192.168.1.11.

Build standby 192.168.1.11 success.

Start to generate and send cluster static file.

End to generate and send cluster static file.

Expansion results:

192.168.1.11:   Success

Expansion Finish.
​

3.3.4 验证集群状态

gs_om -t status --detail


确认主备关系恢复正常:

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node     node_ip         port      instance                      state
------------------------------------------------------------------------------------------
1  gbasedb12 192.168.1.12    15400      6001 /data/install/data/dn   P Primary Normal
2  gbasedb11 192.168.1.11    15400      6002 /data/install/data/dn   S Standby Normal

四、总结

当主备集群中的备节点因操作系统问题导致数据全部丢失时,我们可以通过以下两个主要步骤快速恢复:

  1. 先去掉备节点:修改 XML 配置,生成新的静态配置文件,刷新配置,使主节点恢复单节点运行状态,保障业务不中断。
  2. 重新添加备节点:在备节点上完成操作系统基础配置,执行预安装和扩容操作,重新建立主备关系。

备节点恢复过程中,主节点的业务不会受到影响,数据也不会丢失,因为主节点始终保持正常运行状态。掌握这套恢复方法,可以让你在面对类似问题时从容应对,快速恢复集群的高可用能力。

原文链接:www.gbase.cn/community/p…
更多精彩内容尽在南大通用GBase技术社区,南大通用致力于成为用户最信赖的数据库产品供应商。