介绍
本方案介绍了离线环境下部署 RKE2 集群的具体方法和步骤,并且通过使用 keepalived 基于 VRRP 协议实现了管控服务的高可用。
规划
域名
| 名称 | 用途 |
|---|---|
| rke.tech.spaceicloud.com | 用于解析 master 地址 |
虚拟 IP 地址
| 名称 | 用途 |
|---|---|
| 172.16.255.19 | 用于 3 台 master 服务器的高可用 |
服务器
| 名称 | IP | CPU | 内存 | 磁盘 | 系统 |
|---|---|---|---|---|---|
| rke-master-1 | 172.16.255.11 | 4C | 8GB | 100GB | openEuler-24.03 |
| rke-master-2 | 172.16.255.12 | 4C | 8GB | 100GB | openEuler-24.03 |
| rke-master-3 | 172.16.255.13 | 4C | 8GB | 100GB | openEuler-24.03 |
| rke-worker-1 | 172.16.255.21 | 8C | 16GB | 100GB | openEuler-24.03 |
| rke-worker-2 | 172.16.255.22 | 8C | 16GB | 100GB | openEuler-24.03 |
相关物料
openeuler-20.03
www.openeuler.org/zh/download…
rke2
准备工作
准备离线物料
- 准备好
openEuler-24.03-LTS-everything-x86_64-dvd.ios镜像文件,用于离线环境下为操作系统提供相对完整的 yum 软件源 - 准备好 rke2 相关的离线物料包文件(
rke2-images.linux-amd64.tar.zstrke2.linux-amd64.tar.gzsha256sum-amd64.txt),用于离线环境下安装 rke2 使用
系统环境初始化(所有服务器)
停止及禁用防火墙
- 运行以下命令停止和禁用系统防火墙
sudo systemctl stop firewalld
sudo systemctl disable firewalld
配置 NetworkManager 忽略 calico/flannel 相关的网络接口
- 创建
/etc/NetworkManager/conf.d/rke2-canal.conf文件,内容如下:
[keyfile]
unmanaged-devices=interface-name:cali*;interface-name:flannel*
- 运行以下命令使之生效:
sudo systemctl reload NetworkManager
【可选】安装 iscsi、nfs 工具包
如果后续考虑在集群中部署 longhorn,还必须在所有节点安装 iscsi-initiator-utils、nfs-utils
sudo yum install iscsi-initiator-utils nfs-utils -y
配置本地源
- 挂载
openEuler-24.03-LTS-everything-x86_64-dvd.ios镜像
sudo mkdir /mnt/cdrom
sudo mount /dev/cdrom /mnt/cdrom
- 删除在线 yum 源配置
sudo rm /etc/yum.repos.d/openEuler.repo
- 添加本地 yum 源配置,创建
/etc/yum.repos.d/everything-media.repo文件,内容如下:
[everything-media]
name=everything
baseurl=file:///mnt/cdrom
enabled=1
gpgcheck=0
gpgkey=file:///mnt/cdrom/RPM-GPG-KEY-openEuler
- 重建 yum 源缓存
sudo yum makecache
安装 tar 工具
- 安装相关工具
sudo yum install tar -y
配置虚拟域名
- 将以下 Hosts 解析记录添加到
/etc/hosts中
172.16.255.19 rke.tech.spaceicloud.com
上传 rke2 物料包
- 创建物料包文件夹
mkdir ~/rke2-artifacts
- 上传物料文件
scp rke2-images.linux-amd64.tar.zst rke2.linux-amd64.tar.gz sha256sum-amd64.txt wangkuan@<SERVERIP>:~/rke2-artifacts/
- 上传安装脚本
scp install.sh wangkuan@<SERVERIP>:~
系统环境初始化(3 台 master 服务器)
通过 keepalived 实现管控服务高可用
- 安装 keepalived
sudo yum install -y keepalived
- 添加配置文件
/etc/keepalived/keepalived.conf,内容如下:
提示:rke-master-2 和 rke-master-3 节点的 state 应该为 BACKUP,priority 应该分别为 80,50
global_defs {
router_id LVS_DEVEL
}
vrrp_instance VI_1 {
state MASTER
interface ens18
virtual_router_id 86
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 3333
}
virtual_ipaddress {
172.16.255.19
}
}
- 启动服务
sudo systemctl start keepalived
- 设置服务为开机启动
sudo systemctl enable keepalived
安装过程
第 1 个 Master 节点
安装配置
- 创建文件夹
sudo mkdir -p /etc/rancher/rke2
- 创建
/etc/rancher/rke2/config.yaml文件,内容如下:
token: my-shared-secret
tls-san:
- rke.tech.spaceicloud.com
node-taint:
- "CriticalAddonsOnly=true:NoExecute"
安装 RKE2
- 安装相关程序
sudo INSTALL_RKE2_ARTIFACT_PATH=$HOME/rke2-artifacts sh install.sh
- 启动服务
sudo systemctl start rke2-server.service
- 设置开机启动
sudo systemctl enable rke2-server.service
第 2、3 个 Master 节点
安装配置
- 创建文件夹
sudo mkdir -p /etc/rancher/rke2
- 创建
/etc/rancher/rke2/config.yaml文件,内容如下:
server: https://rke.tech.spaceicloud.com:9345
token: my-shared-secret
tls-san:
- rke.tech.spaceicloud.com
node-taint:
- "CriticalAddonsOnly=true:NoExecute"
安装 RKE2
- 安装相关程序
sudo INSTALL_RKE2_ARTIFACT_PATH=$HOME/rke2-artifacts sh install.sh
- 启动服务
sudo systemctl start rke2-server.service
- 设置开机启动
sudo systemctl enable rke2-server.service
加入其他 Worker 节点
安装配置
- 创建文件夹
sudo mkdir -p /etc/rancher/rke2
- 创建
/etc/rancher/rke2/config.yaml文件,内容如下:
server: https://rke.tech.spaceicloud.com:9345
token: my-shared-secret
安装 RKE2
- 安装相关程序
sudo INSTALL_RKE2_ARTIFACT_PATH=/home/wangkuan/rke2-artifacts INSTALL_RKE2_TYPE="agent" sh install.sh
- 启动服务
sudo systemctl start rke2-agent.service
- 设置开机启动
sudo systemctl enable rke2-agent.service
安装后的检查
状态检查
- 通过
kubectl工具检查节点状态
$ sudo /var/lib/rancher/rke2/bin/kubectl get nodes --kubeconfig /etc/rancher/rke2/rke2.yaml
NAME STATUS ROLES AGE VERSION
rke-master-1 Ready control-plane,etcd,master 64m v1.31.2+rke2r1
rke-master-2 Ready control-plane,etcd,master 27m v1.31.2+rke2r1
rke-master-3 Ready control-plane,etcd,master 19m v1.31.2+rke2r1
rke-worker-1 Ready <none> 61s v1.31.2+rke2r1
rke-worker-2 Ready <none> 54s v1.31.2+rke2r1
- 通过
kubectl工具检查 Pod 状态
$ sudo /var/lib/rancher/rke2/bin/kubectl get pod -A --kubeconfig /etc/rancher/rke2/rke2.yaml
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cloud-controller-manager-rke-master-1 1/1 Running 0 65m
kube-system cloud-controller-manager-rke-master-2 1/1 Running 0 30m
kube-system cloud-controller-manager-rke-master-3 1/1 Running 0 21m
kube-system etcd-rke-master-1 1/1 Running 0 65m
kube-system etcd-rke-master-2 1/1 Running 0 29m
kube-system etcd-rke-master-3 1/1 Running 0 21m
kube-system helm-install-rke2-canal-qdbrz 0/1 Completed 0 66m
kube-system helm-install-rke2-coredns-x9qcw 0/1 Completed 0 66m
kube-system helm-install-rke2-ingress-nginx-gbqz4 0/1 Completed 0 66m
kube-system helm-install-rke2-metrics-server-qtgkn 0/1 Completed 0 66m
kube-system helm-install-rke2-snapshot-controller-crd-t77ln 0/1 Completed 0 66m
kube-system helm-install-rke2-snapshot-controller-s42hm 0/1 Completed 2 66m
kube-system helm-install-rke2-snapshot-validation-webhook-vjf5d 0/1 Completed 0 66m
kube-system kube-apiserver-rke-master-1 1/1 Running 0 65m
kube-system kube-apiserver-rke-master-2 1/1 Running 0 30m
kube-system kube-apiserver-rke-master-3 1/1 Running 0 21m
kube-system kube-controller-manager-rke-master-1 1/1 Running 0 65m
kube-system kube-controller-manager-rke-master-2 1/1 Running 0 30m
kube-system kube-controller-manager-rke-master-3 1/1 Running 0 21m
kube-system kube-proxy-rke-master-1 1/1 Running 0 66m
kube-system kube-proxy-rke-master-2 1/1 Running 0 29m
kube-system kube-proxy-rke-master-3 1/1 Running 0 21m
kube-system kube-proxy-rke-worker-1 1/1 Running 0 3m20s
kube-system kube-proxy-rke-worker-2 1/1 Running 0 3m13s
kube-system kube-scheduler-rke-master-1 1/1 Running 0 65m
kube-system kube-scheduler-rke-master-2 1/1 Running 0 30m
kube-system kube-scheduler-rke-master-3 1/1 Running 0 21m
kube-system rke2-canal-49fb6 2/2 Running 0 3m20s
kube-system rke2-canal-5bhwk 2/2 Running 0 66m
kube-system rke2-canal-d7zrw 2/2 Running 0 22m
kube-system rke2-canal-rm49d 2/2 Running 0 3m12s
kube-system rke2-canal-zc5ts 2/2 Running 0 30m
kube-system rke2-coredns-rke2-coredns-6dbd4f7dd4-78mt7 1/1 Running 0 66m
kube-system rke2-coredns-rke2-coredns-6dbd4f7dd4-m5rcq 1/1 Running 0 3m
kube-system rke2-coredns-rke2-coredns-autoscaler-84766cf644-wxlxv 1/1 Running 0 66m
kube-system rke2-ingress-nginx-controller-46txl 1/1 Running 0 2m51s
kube-system rke2-ingress-nginx-controller-v5stm 1/1 Running 0 2m51s
kube-system rke2-metrics-server-7c85d458bd-bppdq 1/1 Running 0 3m
kube-system rke2-snapshot-controller-65bc6fbd57-t24fr 1/1 Running 0 2m45s
kube-system rke2-snapshot-validation-webhook-859c7896df-kpn2s 1/1 Running 0 2m56s
高可用测试
- 在
rke-master-1上执行ip address命令,可以看到此时172.16.255.19的 VIP 正在rke-master-1上,也可以在其他 2 个 master 节点运行ip address查看 IP 状态
$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether bc:24:11:3f:a8:32 brd ff:ff:ff:ff:ff:ff
inet 172.16.255.11/24 brd 172.16.255.255 scope global noprefixroute ens18
valid_lft forever preferred_lft forever
inet 172.16.255.19/32 scope global proto 0x12 ens18
valid_lft forever preferred_lft forever
inet6 fe80::be24:11ff:fe3f:a832/64 scope link noprefixroute
valid_lft forever preferred_lft forever
- 此时我们在客户机使用
kubectl工具通过172.16.255.19的 VIP 访问集群,检查节点状态。
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
rke-master-1 Ready control-plane,etcd,master 18h v1.31.2+rke2r1
rke-master-2 Ready control-plane,etcd,master 17h v1.31.2+rke2r1
rke-master-3 Ready control-plane,etcd,master 17h v1.31.2+rke2r1
rke-worker-1 Ready <none> 17h v1.31.2+rke2r1
rke-worker-2 Ready <none> 17h v1.31.2+rke2r1
- 在
rke-master-1节点运行以下命令关机,模拟节点主观下线。
sudo shutdown -h now
- 在客户机使用
kubectl工具再次检查节点状态,此时可以看到集群中 rke-master-1 节点已经属于NotReady状态,但是集群管控服务保持了可用性。
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
rke-master-1 NotReady control-plane,etcd,master 18h v1.31.2+rke2r1
rke-master-2 Ready control-plane,etcd,master 17h v1.31.2+rke2r1
rke-master-3 Ready control-plane,etcd,master 17h v1.31.2+rke2r1
rke-worker-1 Ready <none> 17h v1.31.2+rke2r1
rke-worker-2 Ready <none> 17h v1.31.2+rke2r1
- 我们还可以在
rke-master-2上执行ip address命令,可以看到此时172.16.255.19的 VIP 已经漂移到了rke-master-2上,也就是说此时集群管控服务是由rke-master-2所提供的。
$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether bc:24:11:9b:9d:26 brd ff:ff:ff:ff:ff:ff
inet 172.16.255.12/24 brd 172.16.255.255 scope global noprefixroute ens18
valid_lft forever preferred_lft forever
inet 172.16.255.19/32 scope global proto 0x12 ens18
valid_lft forever preferred_lft forever
inet6 fe80::be24:11ff:fe9b:9d26/64 scope link noprefixroute
valid_lft forever preferred_lft forever
- 在
rke-master-2节点运行以下命令关机,模拟第 2 个节点主观下线。
sudo shutdown -h now
- 在客户机使用
kubectl工具再次检查节点状态,此时集群的管控服务已经不可用了,服务器返回了etcdserver访问超时的错误。
$ kubectl get nodes
Error from server: etcdserver: request timed out
- 让我们重新将
rke-master-1开机,等待开机完成后在客户机使用kubectl工具再次检查节点状态。
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
rke-master-1 Ready control-plane,etcd,master 18h v1.31.2+rke2r1
rke-master-2 NotReady control-plane,etcd,master 17h v1.31.2+rke2r1
rke-master-3 Ready control-plane,etcd,master 17h v1.31.2+rke2r1
rke-worker-1 Ready <none> 17h v1.31.2+rke2r1
rke-worker-2 Ready <none> 17h v1.31.2+rke2r1
- 让我们重新将
rke-master-2开机,等待开机完成后在客户机使用kubectl工具再次检查节点状态。
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
rke-master-1 Ready control-plane,etcd,master 18h v1.31.2+rke2r1
rke-master-2 Ready control-plane,etcd,master 17h v1.31.2+rke2r1
rke-master-3 Ready control-plane,etcd,master 17h v1.31.2+rke2r1
rke-worker-1 Ready <none> 17h v1.31.2+rke2r1
rke-worker-2 Ready <none> 17h v1.31.2+rke2r1
- 至此,高可用测试结束,服务器已经全部恢复正常。