离线部署高可用 RKE2 实践

1,257 阅读7分钟

介绍

本方案介绍了离线环境下部署 RKE2 集群的具体方法和步骤,并且通过使用 keepalived 基于 VRRP 协议实现了管控服务的高可用。

规划

域名

名称用途
rke.tech.spaceicloud.com用于解析 master 地址

虚拟 IP 地址

名称用途
172.16.255.19用于 3 台 master 服务器的高可用

服务器

名称IPCPU内存磁盘系统
rke-master-1172.16.255.114C8GB100GBopenEuler-24.03
rke-master-2172.16.255.124C8GB100GBopenEuler-24.03
rke-master-3172.16.255.134C8GB100GBopenEuler-24.03
rke-worker-1172.16.255.218C16GB100GBopenEuler-24.03
rke-worker-2172.16.255.228C16GB100GBopenEuler-24.03

相关物料

openeuler-20.03

www.openeuler.org/zh/download…

rke2

github.com/rancher/rke…

准备工作

准备离线物料

  1. 准备好 openEuler-24.03-LTS-everything-x86_64-dvd.ios 镜像文件,用于离线环境下为操作系统提供相对完整的 yum 软件源
  2. 准备好 rke2 相关的离线物料包文件(rke2-images.linux-amd64.tar.zst rke2.linux-amd64.tar.gz sha256sum-amd64.txt),用于离线环境下安装 rke2 使用

系统环境初始化(所有服务器)

停止及禁用防火墙
  1. 运行以下命令停止和禁用系统防火墙
sudo systemctl stop firewalld
sudo systemctl disable firewalld
配置 NetworkManager 忽略 calico/flannel 相关的网络接口
  1. 创建 /etc/NetworkManager/conf.d/rke2-canal.conf 文件,内容如下:
[keyfile]  
unmanaged-devices=interface-name:cali*;interface-name:flannel*
  1. 运行以下命令使之生效:
sudo systemctl reload NetworkManager
【可选】安装 iscsi、nfs 工具包

如果后续考虑在集群中部署 longhorn,还必须在所有节点安装 iscsi-initiator-utilsnfs-utils

sudo yum install iscsi-initiator-utils nfs-utils -y
配置本地源
  1. 挂载 openEuler-24.03-LTS-everything-x86_64-dvd.ios 镜像
sudo mkdir /mnt/cdrom
sudo mount /dev/cdrom /mnt/cdrom
  1. 删除在线 yum 源配置
sudo rm /etc/yum.repos.d/openEuler.repo
  1. 添加本地 yum 源配置,创建 /etc/yum.repos.d/everything-media.repo 文件,内容如下:
[everything-media]
name=everything
baseurl=file:///mnt/cdrom
enabled=1
gpgcheck=0
gpgkey=file:///mnt/cdrom/RPM-GPG-KEY-openEuler
  1. 重建 yum 源缓存
sudo yum makecache
安装 tar 工具
  1. 安装相关工具
sudo yum install tar -y
配置虚拟域名
  1. 将以下 Hosts 解析记录添加到 /etc/hosts
172.16.255.19 rke.tech.spaceicloud.com
上传 rke2 物料包
  1. 创建物料包文件夹
mkdir ~/rke2-artifacts
  1. 上传物料文件
scp rke2-images.linux-amd64.tar.zst rke2.linux-amd64.tar.gz sha256sum-amd64.txt wangkuan@<SERVERIP>:~/rke2-artifacts/
  1. 上传安装脚本
scp install.sh wangkuan@<SERVERIP>:~

系统环境初始化(3 台 master 服务器)

通过 keepalived 实现管控服务高可用
  1. 安装 keepalived
sudo yum install -y keepalived
  1. 添加配置文件 /etc/keepalived/keepalived.conf,内容如下:

提示:rke-master-2 和 rke-master-3 节点的 state 应该为 BACKUP,priority 应该分别为 80,50

global_defs {
   router_id LVS_DEVEL
}
vrrp_instance VI_1 {
    state MASTER
    interface ens18
    virtual_router_id 86
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 3333
    }
    virtual_ipaddress {
        172.16.255.19
    }
}
  1. 启动服务
sudo systemctl start keepalived
  1. 设置服务为开机启动
sudo systemctl enable keepalived

安装过程

第 1 个 Master 节点

安装配置
  1. 创建文件夹
sudo mkdir -p /etc/rancher/rke2
  1. 创建 /etc/rancher/rke2/config.yaml 文件,内容如下:
token: my-shared-secret
tls-san:  
  - rke.tech.spaceicloud.com
node-taint:  
  - "CriticalAddonsOnly=true:NoExecute"
安装 RKE2
  1. 安装相关程序
sudo INSTALL_RKE2_ARTIFACT_PATH=$HOME/rke2-artifacts sh install.sh
  1. 启动服务
sudo systemctl start rke2-server.service
  1. 设置开机启动
sudo systemctl enable rke2-server.service

第 2、3 个 Master 节点

安装配置
  1. 创建文件夹
sudo mkdir -p /etc/rancher/rke2
  1. 创建 /etc/rancher/rke2/config.yaml 文件,内容如下:
server: https://rke.tech.spaceicloud.com:9345
token: my-shared-secret
tls-san:
  - rke.tech.spaceicloud.com
node-taint:
  - "CriticalAddonsOnly=true:NoExecute"
安装 RKE2
  1. 安装相关程序
sudo INSTALL_RKE2_ARTIFACT_PATH=$HOME/rke2-artifacts sh install.sh
  1. 启动服务
sudo systemctl start rke2-server.service
  1. 设置开机启动
sudo systemctl enable rke2-server.service

加入其他 Worker 节点

安装配置
  1. 创建文件夹
sudo mkdir -p /etc/rancher/rke2
  1. 创建 /etc/rancher/rke2/config.yaml 文件,内容如下:
server: https://rke.tech.spaceicloud.com:9345
token: my-shared-secret
安装 RKE2
  1. 安装相关程序
sudo INSTALL_RKE2_ARTIFACT_PATH=/home/wangkuan/rke2-artifacts INSTALL_RKE2_TYPE="agent" sh install.sh
  1. 启动服务
sudo systemctl start rke2-agent.service
  1. 设置开机启动
sudo systemctl enable rke2-agent.service

安装后的检查

状态检查
  1. 通过 kubectl 工具检查节点状态
$ sudo /var/lib/rancher/rke2/bin/kubectl get nodes --kubeconfig /etc/rancher/rke2/rke2.yaml
NAME           STATUS   ROLES                       AGE   VERSION
rke-master-1   Ready    control-plane,etcd,master   64m   v1.31.2+rke2r1
rke-master-2   Ready    control-plane,etcd,master   27m   v1.31.2+rke2r1
rke-master-3   Ready    control-plane,etcd,master   19m   v1.31.2+rke2r1
rke-worker-1   Ready    <none>                      61s   v1.31.2+rke2r1
rke-worker-2   Ready    <none>                      54s   v1.31.2+rke2r1
  1. 通过 kubectl 工具检查 Pod 状态
$ sudo /var/lib/rancher/rke2/bin/kubectl get pod -A --kubeconfig /etc/rancher/rke2/rke2.yaml
NAMESPACE     NAME                                                    READY   STATUS      RESTARTS   AGE
kube-system   cloud-controller-manager-rke-master-1                   1/1     Running     0          65m
kube-system   cloud-controller-manager-rke-master-2                   1/1     Running     0          30m
kube-system   cloud-controller-manager-rke-master-3                   1/1     Running     0          21m
kube-system   etcd-rke-master-1                                       1/1     Running     0          65m
kube-system   etcd-rke-master-2                                       1/1     Running     0          29m
kube-system   etcd-rke-master-3                                       1/1     Running     0          21m
kube-system   helm-install-rke2-canal-qdbrz                           0/1     Completed   0          66m
kube-system   helm-install-rke2-coredns-x9qcw                         0/1     Completed   0          66m
kube-system   helm-install-rke2-ingress-nginx-gbqz4                   0/1     Completed   0          66m
kube-system   helm-install-rke2-metrics-server-qtgkn                  0/1     Completed   0          66m
kube-system   helm-install-rke2-snapshot-controller-crd-t77ln         0/1     Completed   0          66m
kube-system   helm-install-rke2-snapshot-controller-s42hm             0/1     Completed   2          66m
kube-system   helm-install-rke2-snapshot-validation-webhook-vjf5d     0/1     Completed   0          66m
kube-system   kube-apiserver-rke-master-1                             1/1     Running     0          65m
kube-system   kube-apiserver-rke-master-2                             1/1     Running     0          30m
kube-system   kube-apiserver-rke-master-3                             1/1     Running     0          21m
kube-system   kube-controller-manager-rke-master-1                    1/1     Running     0          65m
kube-system   kube-controller-manager-rke-master-2                    1/1     Running     0          30m
kube-system   kube-controller-manager-rke-master-3                    1/1     Running     0          21m
kube-system   kube-proxy-rke-master-1                                 1/1     Running     0          66m
kube-system   kube-proxy-rke-master-2                                 1/1     Running     0          29m
kube-system   kube-proxy-rke-master-3                                 1/1     Running     0          21m
kube-system   kube-proxy-rke-worker-1                                 1/1     Running     0          3m20s
kube-system   kube-proxy-rke-worker-2                                 1/1     Running     0          3m13s
kube-system   kube-scheduler-rke-master-1                             1/1     Running     0          65m
kube-system   kube-scheduler-rke-master-2                             1/1     Running     0          30m
kube-system   kube-scheduler-rke-master-3                             1/1     Running     0          21m
kube-system   rke2-canal-49fb6                                        2/2     Running     0          3m20s
kube-system   rke2-canal-5bhwk                                        2/2     Running     0          66m
kube-system   rke2-canal-d7zrw                                        2/2     Running     0          22m
kube-system   rke2-canal-rm49d                                        2/2     Running     0          3m12s
kube-system   rke2-canal-zc5ts                                        2/2     Running     0          30m
kube-system   rke2-coredns-rke2-coredns-6dbd4f7dd4-78mt7              1/1     Running     0          66m
kube-system   rke2-coredns-rke2-coredns-6dbd4f7dd4-m5rcq              1/1     Running     0          3m
kube-system   rke2-coredns-rke2-coredns-autoscaler-84766cf644-wxlxv   1/1     Running     0          66m
kube-system   rke2-ingress-nginx-controller-46txl                     1/1     Running     0          2m51s
kube-system   rke2-ingress-nginx-controller-v5stm                     1/1     Running     0          2m51s
kube-system   rke2-metrics-server-7c85d458bd-bppdq                    1/1     Running     0          3m
kube-system   rke2-snapshot-controller-65bc6fbd57-t24fr               1/1     Running     0          2m45s
kube-system   rke2-snapshot-validation-webhook-859c7896df-kpn2s       1/1     Running     0          2m56s

高可用测试

  1. rke-master-1 上执行 ip address 命令,可以看到此时 172.16.255.19 的 VIP 正在 rke-master-1 上,也可以在其他 2 个 master 节点运行 ip address 查看 IP 状态
$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether bc:24:11:3f:a8:32 brd ff:ff:ff:ff:ff:ff
    inet 172.16.255.11/24 brd 172.16.255.255 scope global noprefixroute ens18
       valid_lft forever preferred_lft forever
    inet 172.16.255.19/32 scope global proto 0x12 ens18
       valid_lft forever preferred_lft forever
    inet6 fe80::be24:11ff:fe3f:a832/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
  1. 此时我们在客户机使用 kubectl 工具通过 172.16.255.19 的 VIP 访问集群,检查节点状态。
$ kubectl get nodes
NAME           STATUS   ROLES                       AGE   VERSION
rke-master-1   Ready    control-plane,etcd,master   18h   v1.31.2+rke2r1
rke-master-2   Ready    control-plane,etcd,master   17h   v1.31.2+rke2r1
rke-master-3   Ready    control-plane,etcd,master   17h   v1.31.2+rke2r1
rke-worker-1   Ready    <none>                      17h   v1.31.2+rke2r1
rke-worker-2   Ready    <none>                      17h   v1.31.2+rke2r1
  1. rke-master-1 节点运行以下命令关机,模拟节点主观下线。
sudo shutdown -h now
  1. 在客户机使用 kubectl 工具再次检查节点状态,此时可以看到集群中 rke-master-1 节点已经属于 NotReady 状态,但是集群管控服务保持了可用性。
$ kubectl get nodes
NAME           STATUS     ROLES                       AGE   VERSION
rke-master-1   NotReady   control-plane,etcd,master   18h   v1.31.2+rke2r1
rke-master-2   Ready      control-plane,etcd,master   17h   v1.31.2+rke2r1
rke-master-3   Ready      control-plane,etcd,master   17h   v1.31.2+rke2r1
rke-worker-1   Ready      <none>                      17h   v1.31.2+rke2r1
rke-worker-2   Ready      <none>                      17h   v1.31.2+rke2r1
  1. 我们还可以在 rke-master-2 上执行 ip address 命令,可以看到此时 172.16.255.19 的 VIP 已经漂移到了 rke-master-2 上,也就是说此时集群管控服务是由 rke-master-2 所提供的。
$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether bc:24:11:9b:9d:26 brd ff:ff:ff:ff:ff:ff
    inet 172.16.255.12/24 brd 172.16.255.255 scope global noprefixroute ens18
       valid_lft forever preferred_lft forever
    inet 172.16.255.19/32 scope global proto 0x12 ens18
       valid_lft forever preferred_lft forever
    inet6 fe80::be24:11ff:fe9b:9d26/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
  1. rke-master-2 节点运行以下命令关机,模拟第 2 个节点主观下线。
sudo shutdown -h now
  1. 在客户机使用 kubectl 工具再次检查节点状态,此时集群的管控服务已经不可用了,服务器返回了 etcdserver 访问超时的错误。
$ kubectl get nodes
Error from server: etcdserver: request timed out
  1. 让我们重新将 rke-master-1 开机,等待开机完成后在客户机使用 kubectl 工具再次检查节点状态。
$ kubectl get nodes
NAME           STATUS     ROLES                       AGE   VERSION
rke-master-1   Ready      control-plane,etcd,master   18h   v1.31.2+rke2r1
rke-master-2   NotReady   control-plane,etcd,master   17h   v1.31.2+rke2r1
rke-master-3   Ready      control-plane,etcd,master   17h   v1.31.2+rke2r1
rke-worker-1   Ready      <none>                      17h   v1.31.2+rke2r1
rke-worker-2   Ready      <none>                      17h   v1.31.2+rke2r1
  1. 让我们重新将 rke-master-2 开机,等待开机完成后在客户机使用 kubectl 工具再次检查节点状态。
$ kubectl get nodes
NAME           STATUS   ROLES                       AGE   VERSION
rke-master-1   Ready    control-plane,etcd,master   18h   v1.31.2+rke2r1
rke-master-2   Ready    control-plane,etcd,master   17h   v1.31.2+rke2r1
rke-master-3   Ready    control-plane,etcd,master   17h   v1.31.2+rke2r1
rke-worker-1   Ready    <none>                      17h   v1.31.2+rke2r1
rke-worker-2   Ready    <none>                      17h   v1.31.2+rke2r1
  1. 至此,高可用测试结束,服务器已经全部恢复正常。