基于 Rook 为集群部署 Ceph 的实践验证

76 阅读10分钟

背景

最近在研究搭建生产 Kubenetes 集群的问题,如何搭建 CSI,为整个集群提供高效可用的存储供应,是一个核心问题,根据了解决定尝试使用 Rook+Ceph 的方案来实践,满足如下几个目标:

  1. 为有状态中间件(MySQL、Redis、MQ 等)提供持久化卷的供应;
  2. 存储附件等基本文件的存储;

为了验证这个方案的可行性,需要对 Rook 和 Ceph 的基本特性有所了解,关注如下几个目标:

  1. 功能:如何新搭建一个 Ceph 集群,并在集群中提供给其他 Pod 使用;
  2. 性能:以 MySQL 为例,Ceph 引入的网络延时,对 SQL 的性能影响是否可以接受;
  3. 高可用:分布式带来的容量超售、多节点容错能力,节点异常对业务连续性的测试;
  4. 备份与数据安全:在集群被摧毁的情况下,如何手动恢复数据,输出应急处置方案;

那么后针对这些能力在开发环境中进行验证,使用环境及软件版本参照:

《ubuntu 22.04 开发环境 Kubernetes 集群搭建》

开发环境的 Ubuntu 22.04 建立在如下的一个 Win10 + VMware Workstation 16 的虚拟机中。

主要操作流程参照 Rook 官方文档:

rook.github.io/docs/rook/l…

一、Rook 的搭建

1.1. 准备工作

看官方文档的[准备工作]rook.github.io/docs/rook/l…,是需要提供几块可用的盘的,那么我先在 VMWare 中添加三块 10GB 的硬盘:

image.png

image.png

image.png

image.png

image.png

以上的配置搞三个,分别是 ceph-0 / ceph-1 / ceph-2。

image.png

于是多出来三个盘,确定。

重启 ubuntu,fdisk -l 可以看到已经识别出来 sdb / sdc /sdd 三个裸盘,然后参照准备工作文档,执行 lsblk -f 也可以看到这三个裸盘,符合要求。

然后安装 lvm2:

sudo apt-get install -y lvm2

检查 Linux 内核是否支持 RDB 模块,运行 sudo modprobe rbd 没有显示 Not Found 即可。

1.2. 安装

1.2.1. 预下载镜像

需要的镜像清单,由于网络原因,需要提前下载好:

docker.io/rook/ceph:v1.13.1
quay.io/ceph/ceph:v18.2.1
quay.io/cephcsi/cephcsi:v3.10.1
registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.1
registry.k8s.io/sig-storage/csi-resizer:v1.9.2
registry.k8s.io/sig-storage/csi-attacher:v4.4.2
registry.k8s.io/sig-storage/csi-snapshotter:v6.3.2
registry.k8s.io/sig-storage/csi-provisioner:v3.6.2

下载的方法,找个外网代理拉,例如:

# 如果 containerd 当前用户没有权限,用 root 权限
https_proxy=socks5://192.168.154.1:1080 ctr -n=k8s.io i pull registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.1

拉到之后,为了后面生产用起来方便,一定要留个种,将这些镜像导出为二进制文件,存好:

ctr -n=k8s.io i export ceph:v1.13.1.img docker.io/rook/ceph:v1.13.1
ctr -n=k8s.io i export ceph:v18.2.1.img quay.io/ceph/ceph:v18.2.1
ctr -n=k8s.io i export cephcsi:v3.10.1.img quay.io/cephcsi/cephcsi:v3.10.1
ctr -n=k8s.io i export csi-node-driver-registrar:v2.9.1.img registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.1
ctr -n=k8s.io i export csi-resizer:v1.9.2.img registry.k8s.io/sig-storage/csi-resizer:v1.9.2
ctr -n=k8s.io i export csi-attacher:v4.4.2.img registry.k8s.io/sig-storage/csi-attacher:v4.4.2
ctr -n=k8s.io i export csi-snapshotter:v6.3.2.img registry.k8s.io/sig-storage/csi-snapshotter:v6.3.2
ctr -n=k8s.io i export csi-provisioner:v3.6.2.img registry.k8s.io/sig-storage/csi-provisioner:v3.6.2

然后日后在生产集群就可以离线把这些镜像整进去了:

ctr -n=k8s.io i import ceph:v1.13.1.img
ctr -n=k8s.io i import ceph:v18.2.1.img
ctr -n=k8s.io i import cephcsi:v3.10.1.img
ctr -n=k8s.io i import csi-node-driver-registrar:v2.9.1.img
ctr -n=k8s.io i import csi-resizer:v1.9.2.img
ctr -n=k8s.io i import csi-attacher:v4.4.2.img
ctr -n=k8s.io i import csi-snapshotter:v6.3.2.img
ctr -n=k8s.io i import csi-provisioner:v3.6.2.img

1.2.2. 执行安装

注意版本和 tag 是对应的,不要乱改,否则镜像版本可能也要对应修改(但是如果想搞最新的也行,用同样方法看官网的最新说明即可)。

$ git clone --single-branch --branch v1.13.1 https://github.com/rook/rook.git
cd rook/deploy/examples
kubectl create -f crds.yaml -f common.yaml -f operator.yaml
kubectl create -f cluster.yaml

以上应用 cluster.yaml 的时候,国内环境测试 registry.k8s.io 下面的一堆镜像都是拉不到的,按照上卖弄的方式预先导入到 containerd 中。

cd deploy/examples
kubectl create -f crds.yaml -f common.yaml -f operator.yaml
# verify the rook-ceph-operator is in the `Running` state before proceeding
kubectl -n rook-ceph get pod

观测执行之后创建的 pod 清单,输出如下:

NAME                                           READY   STATUS    RESTARTS      AGE
csi-cephfsplugin-7jq8m                         2/2     Running   0             33m
csi-cephfsplugin-provisioner-cc86b75b8-bmccf   5/5     Running   2 (33m ago)   33m
csi-rbdplugin-ngf97                            2/2     Running   0             33m
csi-rbdplugin-provisioner-5bc465bbcc-mfl7s     5/5     Running   0             33m
rook-ceph-operator-5d9455bdcd-vz8j4            1/1     Running   0             3h1m

与官网相比较,缺少了 rook-ceph-monrook-ceph-mgrrook-ceph-osd 三类 pod,官网提示参照common issues 进行处理,猜测是因为开发机现在只有一个 master node,包不住所导致。

另外官网说了保证集群健康的三个条件,下一步的任务应该是解决这个问题:

  1. 所有 mons 监控节点 数量要够;
  2. 要有活跃的 mgr 管理节点,在 active 状态;
  3. 至少有三个 osd 节点拉起并加入 (up and in);

参照这一节:rook.github.io/docs/rook/l…

1.2.3. 创建 worker 节点

现在只有一个 master 节点显然是无法满足这个状态的,因此我们另外部署了三个 worker 节点,并且把刚开始建立的三个裸盘从 master 节点中剥离,在分别挂载到三个 worker 节点中。

部署步骤参照:juejin.cn/post/732047…

拉起之后,解决各类基本问题,我们相当于有一个 master 节点,三个 worker 节点。

确认大部分的 Pod 都已经拉起来了,大致是这么个状况:

$ k get po -n=rook-ceph -owide
NAME                                                  READY   STATUS             RESTARTS         AGE     IP                NODE        NOMINATED NODE   READINESS GATES
csi-cephfsplugin-7jq8m                                2/2     Running            2 (9h ago)       2d16h   192.168.154.128   alfred-pc   <none>           <none>
csi-cephfsplugin-db4n5                                2/2     Running            4 (8h ago)       9h      192.168.154.131   k8s-wk1     <none>           <none>
csi-cephfsplugin-gxbqp                                2/2     Running            4 (8h ago)       37h     192.168.154.130   k8s-wk0     <none>           <none>
csi-cephfsplugin-provisioner-cc86b75b8-b4s2h          5/5     Running            2 (7h56m ago)    8h      192.168.74.149    k8s-wk1     <none>           <none>
csi-cephfsplugin-provisioner-cc86b75b8-bmccf          5/5     Running            7 (9h ago)       2d16h   192.168.179.244   alfred-pc   <none>           <none>
csi-cephfsplugin-qwnt6                                2/2     Running            6 (8h ago)       9h      192.168.154.132   k8s-wk2     <none>           <none>
csi-rbdplugin-5rqtw                                   2/2     Running            4 (8h ago)       37h     192.168.154.130   k8s-wk0     <none>           <none>
csi-rbdplugin-lvglc                                   2/2     Running            6 (8h ago)       9h      192.168.154.132   k8s-wk2     <none>           <none>
csi-rbdplugin-ngf97                                   2/2     Running            2 (9h ago)       2d16h   192.168.154.128   alfred-pc   <none>           <none>
csi-rbdplugin-provisioner-5bc465bbcc-hw7sh            5/5     Running            1 (7h56m ago)    8h      192.168.74.147    k8s-wk1     <none>           <none>
csi-rbdplugin-provisioner-5bc465bbcc-mfl7s            5/5     Running            5 (9h ago)       2d16h   192.168.179.247   alfred-pc   <none>           <none>
csi-rbdplugin-rj5ng                                   2/2     Running            4 (8h ago)       9h      192.168.154.131   k8s-wk1     <none>           <none>
rook-ceph-crashcollector-alfred-pc-5477c9b8f7-vfn8d   1/1     Running            0                9h      192.168.179.255   alfred-pc   <none>           <none>
rook-ceph-crashcollector-k8s-wk0-78c58c869b-2rdmx     1/1     Running            0                8h      192.168.129.213   k8s-wk0     <none>           <none>
rook-ceph-crashcollector-k8s-wk1-7b4859f6cc-64nr7     1/1     Running            0                8h      192.168.74.145    k8s-wk1     <none>           <none>
rook-ceph-crashcollector-k8s-wk2-79c56475c8-vjftb     1/1     Running            0                8h      192.168.36.11     k8s-wk2     <none>           <none>
rook-ceph-exporter-alfred-pc-cb476c5d5-nl9sv          1/1     Running            0                9h      192.168.179.199   alfred-pc   <none>           <none>
rook-ceph-exporter-k8s-wk0-c6787bc7b-n4w5v            1/1     Running            0                8h      192.168.129.212   k8s-wk0     <none>           <none>
rook-ceph-exporter-k8s-wk1-6c4868c665-w9zmw           1/1     Running            1 (7h35m ago)    8h      192.168.74.146    k8s-wk1     <none>           <none>
rook-ceph-exporter-k8s-wk2-dbb57f6f8-ch7w5            1/1     Running            0                8h      192.168.36.10     k8s-wk2     <none>           <none>
rook-ceph-mgr-a-78d945bf94-v9z9l                      3/3     Running            0                9h      192.168.179.254   alfred-pc   <none>           <none>
rook-ceph-mgr-b-6d979dbd4f-979rd                      2/3     CrashLoopBackOff   8 (7h35m ago)    5h33m   192.168.36.19     k8s-wk2     <none>           <none>
rook-ceph-mon-a-6c8d6bb8b6-qbhbg                      2/2     Running            0                9h      192.168.179.251   alfred-pc   <none>           <none>
rook-ceph-mon-c-75645448db-vnlkb                      2/2     Running            0                8h      192.168.74.144    k8s-wk1     <none>           <none>
rook-ceph-mon-d-5578547686-fn9jd                      2/2     Running            1 (8h ago)       8h      192.168.36.13     k8s-wk2     <none>           <none>
rook-ceph-operator-5d9455bdcd-jtg2x                   1/1     Running            1 (9h ago)       37h     192.168.179.246   alfred-pc   <none>           <none>
rook-ceph-osd-0-7456b4f497-r67sl                      1/2     CrashLoopBackOff   21 (7h35m ago)   8h      192.168.36.12     k8s-wk2     <none>           <none>
rook-ceph-osd-1-785d897f79-k87tw                      1/2     CrashLoopBackOff   16 (7h31m ago)   8h      192.168.74.148    k8s-wk1     <none>           <none>
rook-ceph-osd-2-6f6dd79b78-sx2g9                      1/2     CrashLoopBackOff   16 (7h35m ago)   8h      192.168.129.211   k8s-wk0     <none>           <none>
rook-ceph-osd-prepare-alfred-pc-ncsp6                 0/1     Completed          0                8h      192.168.179.201   alfred-pc   <none>           <none>
rook-ceph-osd-prepare-k8s-wk0-hhg94                   0/1     Completed          0                8h      192.168.129.215   k8s-wk0     <none>           <none>
rook-ceph-osd-prepare-k8s-wk1-h9jkl                   0/1     Completed          0                8h      192.168.74.151    k8s-wk1     <none>           <none>
rook-ceph-osd-prepare-k8s-wk2-krxmh                   0/1     Completed          0                8h      192.168.36.14     k8s-wk2     <none>           <none>

对照三个健康条件:

第一个:mon 节点要够

可以观察到 mon-a, mon-c, mon-d 是好的,够。

至于为啥没有 mon-b,我们再观测一下 mon-a, mon-c, mon-d 这三个去了哪个 node:

发现 a, c, d 分别在 master,wk1 和 wk2,wk0 上面并没有调度到 mon 节点,姑且假设三个就够了所以没拉起。

第二个:要有 mgr 节点

可以看到有两个 mgr 节点,一个是 mgr-a (在 master),一个是 b (在 wk2),但是 b 是 CrashLoopBackOff 状态,其实也算满足。

不过还是看看 mgr-b 为何挂了:k describe po -n=rook-ceph rook-ceph-mgr-b-6d979dbd4f-979rd

从 Event 里面看到,真正产生问题的在于:

Warning Unhealthy 7h44m (x2 over 7h56m) kubelet Startup probe failed: command "env -i sh -c \noutp=\"$(ceph --admin-daemon /run/ceph/ceph-mgr.b.asok status 2>&1)\"\nrc=$?\nif [ $rc -ne 0 ]; then\n\techo \"ceph daemon health check failed with the following output:\"\n\techo \"$outp\" | sed -e 's/^/> /g'\n\texit $rc\nfi\n" timed out

仔细看了一下 node 的状态,发现出现了一个 OOM,还是内存不够,于是把三个 worker 的内存从 1G 加到了 2G,发现两个 mgr 都好起来了。

第三个:至少三个好的 osd 节点

可以看到,有三个 osd pod:osd-0osd-1osd-2 分别调度在了 wk2wk1wk0 三个节点上。

但是状态都是 CrashLoopBackOff,也看看有什么问题:k describe po -n=rook-ceph rook-ceph-osd-1-785d897f79-k87tw

Warning BackOff 7h29m (x370 over 8h) kubelet Back-off restarting failed container osd in pod rook-ceph-osd-1-785d897f79-k87tw_rook-ceph(a7de3be1-d118-4919-a547-8ba6c392a494)

看不出来有什么问题,再进 wk0 看一下 kubelet 的状态:systemctl status kubelet -l,发现问题了:

1月 07 01:33:43 k8s-wk0 kubelet[17787]: I0107 01:33:43.198786 17787 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"host-dev\" (UniqueName: \"kubernetes.io/host-path/2d25b699-6c99-4f53-863f-9c61161c831c-host-dev\") pod \"csi-cephfsplugin-gxbqp\" (UID: \"2d25b699-6c99-4f53-863f-9c61161c831c\") " pod="rook-ceph/csi-cephfsplugin-gxbqp"

貌似是卷的挂载验证出了些问题,鉴于刚刚加过内存,重启一下 kubelet 试试 systemctl restart kubelet,结果再看 kubelet 状态的时候居然就好了:

1月 07 01:33:43 k8s-wk0 kubelet[17787]: I0107 01:33:43.410776   17787 scope.go:117] "RemoveContainer" containerID="ece95d18c5be16767ff35af8f7a120844b126c6b9ec04f28c2ee09dbea54296f"
1月 07 01:33:43 k8s-wk0 kubelet[17787]: I0107 01:33:43.412747   17787 kubelet_resources.go:45] "Allocatable" allocatable={"cpu":"1","ephemeral-storage":"17394Mi","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"1941632Ki","pods":"110"}
1月 07 01:33:43 k8s-wk0 kubelet[17787]: I0107 01:33:43.412857   17787 kubelet_resources.go:45] "Allocatable" allocatable={"cpu":"1","ephemeral-storage":"17394Mi","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"1941632Ki","pods":"110"}
1月 07 01:33:43 k8s-wk0 kubelet[17787]: I0107 01:33:43.412882   17787 kubelet_resources.go:45] "Allocatable" allocatable={"cpu":"1","ephemeral-storage":"17394Mi","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"1941632Ki","pods":"110"}
1月 07 01:33:43 k8s-wk0 kubelet[17787]: I0107 01:33:43.412901   17787 kubelet_resources.go:45] "Allocatable" allocatable={"cpu":"1","ephemeral-storage":"17394Mi","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"1941632Ki","pods":"110"}

然后在 master 上面再看 k get po -owide 的时候发现这个 osd pod 起来了:

rook-ceph          rook-ceph-osd-2-6f6dd79b78-sx2g9                      2/2     Running                  22 (7h36m ago)   9h      192.168.129.211   k8s-wk0     <none>           <none>

如法炮制 wk1 和 wk2,最后全部上线。

1.2.4. 调试和管理工具

创建好 ceph 集群之后,我们需要通过 ceph 命令来测试使用,以及调试 ceph 服务。

这需要用到 rook-ceph-tools,按照下面这个安装。

cd rook/deploy/examples
kubectl create -f toolbox.yaml

deployment.apps/rook-ceph-tools created

$ k get po -A | grep ceph-tools
rook-ceph          rook-ceph-tools-66b77b8df5-m2fj8                      1/1     Running     0                77s

可以看到多拉起了一个 rook-ceph-tools 的 pod,进去之后就可以用 ceph 命令来对 ceph 集群进行操作了。

我们进入这个 pod:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash

在里面可以执行如下 CLI 命令:

  • ceph status
  • ceph health detail
  • ceph osd status
  • ceph osd df
  • ceph osd utilization
  • ceph osd pool stats
  • ceph osd tree
  • ceph pg stat

参考:

记录一下这几个命令的样例输出:

1. ceph status
bash-4.4$ ceph status
  cluster:
    id:     77a08ac4-65a9-406e-b983-3ce1ab8b4ffb
    health: HEALTH_WARN
            clock skew detected on mon.c, mon.d
            Degraded data redundancy: 2/6 objects degraded (33.333%), 1 pg degraded, 1 pg undersized
            2 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,c,d (age 12m)
    mgr: a(active, since 13h), standbys: b
    osd: 3 osds: 2 up (since 11m), 2 in (since 19s)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 449 KiB
    usage:   55 MiB used, 20 GiB / 20 GiB avail
    pgs:     2/6 objects degraded (33.333%)
             1 active+undersized+degraded

看出来有警告问题,先字面理解一下再考虑怎么破:

  1. 时钟倾斜 clock skew detected
  2. 数据冗余降级 Degraded data redundancy
  3. 两个 deamon crash 掉了

好家伙,pod 看起来都是 UP 的,还是不健康,所以前面写的三个健康条件只是必要条件,充分条件还是要关注 ceph 命令查看的状态的输出。

时钟倾斜的问题

主要是 ntp 没有同步,我们在节点上 timedatectl status 查看时间同步情况:

$ timedatectl status
               Local time: 日 2024-01-07 14:15:34 CST
           Universal time: 日 2024-01-07 06:15:34 UTC
                 RTC time: 日 2024-01-07 06:15:34
                Time zone: Asia/Shanghai (CST, +0800)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

以上,master 节点上面这个看起来就是对的。

[root@k8s-wk1 ~]# timedatectl status
      Local time:  2024-01-07 11:17:28 CST
  Universal time:  2024-01-07 03:17:28 UTC
        RTC time:  2024-01-07 05:35:52
       Time zone: Asia/Shanghai (CST, +0800)
     NTP enabled: yes
NTP synchronized: no
 RTC in local TZ: no
      DST active: n/a

但看了下几个 wk 节点,发现 wk0,wk2 是好的,唯独 wk1 出现了上面的情况,NTP 有启动,但是并没有同步,而且时间错得有点离谱。

于是安装 ntp 并令其正常工作:

yum install -y ntp ntpdate
systemctl enable ntpd
systemctl stop ntpd
ntpd -gq
hwclock -w
systemctl start ntpd

然后就成功解除警告了。

参考:

两个 deamon crash 的问题

是由于最近有一个或多个 Ceph 守护进程崩溃,管理员尚未对该崩溃进行存档(确认)。这可能表示软件错误、硬件问题 (例如,故障磁盘) 或某些其它问题。

解决方式就是查看这些 crash,并且将其手动标记一下,就可以恢复正常。

bash-4.4$ ceph crash ls-new
ID                                                                ENTITY                NEW  
2024-01-06T15:43:06.114309Z_e5f1efcd-6430-4077-af9c-85b47c364f11  client.ceph-exporter   *   
2024-01-06T17:07:14.789138Z_0e3836e1-aca4-4d41-835e-57c2ef69a028  client.ceph-exporter   *   
bash-4.4$ ceph crash archive-all

参考:

搞完这个,ceph status 的状态就恢复到 health: HEALTH_OK 了。

2. ceph osd status
bash-4.4$ ceph osd status
ID  HOST      USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE      
 0  k8s-wk2  27.6M  9.97G      0        0       0        0   exists,up  
 1  k8s-wk1  28.1M  9.97G      0        0       0        0   exists,up  
 2  k8s-wk0  28.0M  9.97G      0        0       0        0   exists,up  

看起来三个 10G 的盘都正常起来了。

3. ceph osd df
bash-4.4$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE    RAW USE  DATA     OMAP     META    AVAIL   %USE  VAR   PGS  STATUS
 2    hdd  0.00980   1.00000  10 GiB   28 MiB  1.1 MiB    1 KiB  27 MiB  10 GiB  0.27  1.00    1      up
 1    hdd  0.00980   1.00000  10 GiB   28 MiB  1.1 MiB    1 KiB  27 MiB  10 GiB  0.27  1.01    1      up
 0    hdd  0.00980   1.00000  10 GiB   28 MiB  1.1 MiB    1 KiB  27 MiB  10 GiB  0.27  0.99    1      up
                       TOTAL  30 GiB   84 MiB  3.2 MiB  3.5 KiB  81 MiB  30 GiB  0.27                   
MIN/MAX VAR: 0.99/1.01  STDDEV: 0.00

这个看起来没事,三个盘都认到了。

4. ceph osd utilization
bash-4.4$ ceph osd utilization
avg 1
stddev 0 (expected baseline 0.816497)
min osd.0 with 1 pgs (1 * mean)
max osd.0 with 1 pgs (1 * mean)

可以看出来使用率。

5. ceph osd pool stats
bash-4.4$ ceph osd pool stats
pool .mgr id 1
  nothing is going on
6. ceph osd tree
bash-4.4$ ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME         STATUS  REWEIGHT  PRI-AFF
-1         0.02939  root default                               
-3         0.00980      host k8s-wk0                           
 2    hdd  0.00980          osd.2         up   1.00000  1.00000
-5         0.00980      host k8s-wk1                           
 1    hdd  0.00980          osd.1         up   1.00000  1.00000
-7         0.00980      host k8s-wk2                           
 0    hdd  0.00980          osd.0         up   1.00000  1.00000
7. ceph pg stat
bash-4.4$ ceph pg stat
1 pgs: 1 active+clean; 449 KiB data, 88 MiB used, 30 GiB / 30 GiB avail

以上,大致可以看得出来 ceph 集群的整体状态信息。

1.2.5. Ceph Dashboard

TODO:

1.2.6. StorageClass

官网给出了一个 StorageClass 的示例 storageclass.yaml

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph
spec:
  failureDomain: host
  replicated:l
    size: 3
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: rook-ceph-block
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
    # clusterID is the namespace where the rook cluster is running
    clusterID: rook-ceph
    # Ceph pool into which the RBD image shall be created
    pool: replicapool

    # (optional) mapOptions is a comma-separated list of map options.
    # For krbd options refer
    # https://docs.ceph.com/docs/master/man/8/rbd/#kernel-rbd-krbd-options
    # For nbd options refer
    # https://docs.ceph.com/docs/master/man/8/rbd-nbd/#options
    # mapOptions: lock_on_read,queue_depth=1024

    # (optional) unmapOptions is a comma-separated list of unmap options.
    # For krbd options refer
    # https://docs.ceph.com/docs/master/man/8/rbd/#kernel-rbd-krbd-options
    # For nbd options refer
    # https://docs.ceph.com/docs/master/man/8/rbd-nbd/#options
    # unmapOptions: force

    # RBD image format. Defaults to "2".
    imageFormat: "2"

    # RBD image features
    # Available for imageFormat: "2". Older releases of CSI RBD
    # support only the `layering` feature. The Linux kernel (KRBD) supports the
    # full complement of features as of 5.4
    # `layering` alone corresponds to Ceph's bitfield value of "2" ;
    # `layering` + `fast-diff` + `object-map` + `deep-flatten` + `exclusive-lock` together
    # correspond to Ceph's OR'd bitfield value of "63". Here we use
    # a symbolic, comma-separated format:
    # For 5.4 or later kernels:
    #imageFeatures: layering,fast-diff,object-map,deep-flatten,exclusive-lock
    # For 5.3 or earlier kernels:
    imageFeatures: layering

    # The secrets contain Ceph admin credentials.
    csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
    csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
    csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
    csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

    # Specify the filesystem type of the volume. If not specified, csi-provisioner
    # will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
    # in hyperconverged settings where the volume is mounted on the same node as the osds.
    csi.storage.k8s.io/fstype: ext4

# Delete the rbd volume when a PVC is deleted
reclaimPolicy: Delete

# Optional, if you want to add dynamic resize for PVC.
# For now only ext3, ext4, xfs resize support provided, like in Kubernetes itself.
allowVolumeExpansion: true

手动编辑,存起来,然后跑:

kubectl create -f storageclass.yaml

完成之后,可以看到 StorageClass 已经被拉起:

$ k get sc -owide
NAME              PROVISIONER                  RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
rook-ceph-block   rook-ceph.rbd.csi.ceph.com   Delete          Immediate           true                   73s

也能看到同时被拉起的 replicapool 对象:

$ k get cephblockpool.ceph.rook.io/replicapool -nrook-ceph
NAME          PHASE
replicapool   Ready

参考:

1.3. 正式应用于 PVC

按理来说,如果 StorageClass 已经有效工作,每个实际声明的 PVC 都会自动分配到 PV 使其工作,做一个示例来测试:

TODO:

参考