Rocky9部署ceph集群(使用ceph-ansible)

1,349 阅读8分钟

该方式不是官方推荐的安装方式,官方推荐安装方式为cephadm

服务器准备

系统主机名Public IPCluster IP
Rocky9.2node1192.168.202.129192.168.142.128
Rocky9.2node2192.168.202.130192.168.142.129
Rocky9.2node3192.168.202.131192.168.142.130
Rocky9.2node4192.168.202.132192.168.142.131
Rocky9.2node5192.168.202.134192.168.142.132

每台服务器的磁盘如下

[root@node2 ~]# lsblk
NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sr0              11:0    1  1.5G  0 rom  
nvme0n1         259:0    0   60G  0 disk 
├─nvme0n1p1     259:1    0    1G  0 part /boot
└─nvme0n1p2     259:2    0   59G  0 part 
  ├─rl_192-root 253:0    0   57G  0 lvm  /
  └─rl_192-swap 253:1    0    2G  0 lvm  [SWAP]
nvme0n2         259:3    0   20G  0 disk 
nvme0n3         259:4    0   20G  0 disk

关闭防火墙和selinux

# 关闭防火墙并移除开机自启动
systemctl disable firewalld --now

# 关闭selinux
sed -i 's#SELINUX=enforcing#SELINUX=disabled#g' /etc/selinux/config 

ceph中各个组件监听的端口

服务名端口描述
Monitor6379/TCP和Ceph cluster通信
Manager7000/TCP和Ceph Manager dashboard 通信
8003/TCP通过HTTPS和Ceph Manager RESTful API通信
OSD9283/TCP和Ceph Manager Prometheus 插件通信
6800-7300/TCP每一个OSD使用该范围中的三个端口:一个通过Public network和客户端与monitors通信;一个端口通过Cluster network其他OSD发送数据,第三个端口通过Cluster network发送心跳包
RADOS Gateway7480/TCP(configurable)RADOS Gateway 使用7480/TCP端口,但是可以改变该端口,例如改为80/TCP, 443/TCP

服务器的时间要同步

对于集群来说,时间同步很重要,如果时间不同步的话很容易导致ceph集群健康状态不一致的情况

# 启动chronyd服务进行自动时间同步
systemctl start chronyd --now

安装前准备

本示例中将使node1作为控制节点

# 安装额外的软件源,需要安装epel-release,因为ceph需要的一些包在这里面
dnf install epel-release -y

克隆ceph-ansible

git clone https://github.com/ceph/ceph-ansible.git

安装ceph ansible所需的一些第三方集合

# 默认rocky9没有安装pip3,需要安装pip3
dnf install python3-pip
# pip 配置国内镜像
pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# 安装python第三方包。例如ansible等
pip3 install -r requirements.txt
# 上面requirements.yml里面会定义ansible的版本,所以不需要单独安装ansible了

#安装ceph-ansible所需的第三方集合
ansible-galaxy install -r requirements.yml

让node1可以和所有主机免密钥登陆

ssh-copy-id root@192.168.202.129
ssh-copy-id root@192.168.202.130
ssh-copy-id root@192.168.202.131
ssh-copy-id root@192.168.202.132
ssh-copy-id root@192.168.202.134

定义ansible主机清单

注意:ansible清单文件中的组名必须为mons和mgrs,因为ceph-ansible的playbook定义了使用这个名字,例如在site.yml.example中的配置

# cat /etc/ansible/hosts
[mons]
192.168.202.129
192.168.202.130
192.168.202.131

[mgrs]
192.168.202.129
192.168.202.130
192.168.202.131

# 如果有dashbord的话必须设置monitorning主机组,并且不能在部署了mgrs的节点再部署monitoring
[monitoring]
192.168.202.134

拷贝ansible的入口文件

[root@node1 ceph-ansible]# cp site.yml.sample site.yml

修改变量文件

和硬件相关密切的只有osd,所以可以只拷贝osds.yml出来,其他的可以不拷贝

# 进入group_vars目录
[root@node1 group_vars]# pwd
/root/ceph-ansible/group_vars

# 将sample文件拷贝为yml文件
[root@node1 group_vars]# cp mons.yml.sample mons.yml
[root@node1 group_vars]# cp mgrs.yml.sample mgrs.yml

修改ceph集群的配置文件

all.yml为ceph集群的配置文件,该文件定义了安装ceph的包路径以及定义了yum仓库,以及ceph的一些配置信息

# 将配置文件拷贝出来

[root@node1 group_vars]# pwd
/root/ceph-ansible/group_vars
[root@node1 group_vars]# cp all.yml.sample all.yml

# 修改all.yml文件
[root@node1 group_vars]# grep -vE '^$|^#'  all.yml
---
dummy:
  #fetch_directory: ~/ceph-ansible-keys # 将密钥文件拷贝到此目录
ntp_service_enabled: false # 禁止ntp服务器,因为系统开始已经配置好了ntp了,所以这里禁止
ceph_origin: repository  # 使用本地镜像仓库
ceph_repository: community  # 仓库名字
ceph_mirror: http://mirrors.aliyun.com/ceph
ceph_stable_key: http://mirrors.aliyun.com/ceph/keys/release.asc
ceph_stable_release: reef  # 定义ceph安装的版本,与前面git切换的分支相对应,不同的分支只能安装指定的ceph版本
ceph_stable_repo: "{{ ceph_mirror }}/rpm-{{ ceph_stable_release }}"
rbd_cache: "true"
rbd_cache_writethrough_until_flush: "false"
rbd_client_directories: false # this will create rbd_client_log_path and rbd_client_admin_socket_path directories with proper permissions
dashboard_enabled: false
monitor_interface: ens160  # 监听的接口
journal_size: 5120 # OSD journal size in MB
public_network: 192.168.202.0/24    # 客户端访问的网络
cluster_network: 192.168.142.0/24   # 集群网络
ceph_conf_overrides:
   global:
     mon_osd_allow_primary_affinity: 1
     mon_clock_drift_allowed: 0.5
     osd_pool_default_size: 2
     osd_pool_default_min_size: 1
     mon_pg_warn_min_per_osd: 0
     mon_pg_warn_max_per_osd: 0
     mon_pg_warn_max_object_skew: 0
   client:
     rdb_default_features: 1

执行ansible

# 不加-e yes_i_know=true会无法运行,会提示迁移到cephadm新的工具上
ansible-playbook site.yml -e yes_i_know=true

查看安装的结果

[root@node1 ~]# ceph -s
  cluster:
    id:     a82e91cb-dedc-4115-b44a-2bc8b7190afe
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            1 daemons have recently crashed
            6 mgr modules have recently crashed
            OSD count 0 < osd_pool_default_size 2
 
  services:
    mon: 3 daemons, quorum node1,node2,node3 (age 38m)
    mgr: node3(active, since 36m), standbys: node2, node1
    osd: 0 osds: 0 up, 0 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     

添加修改osd变量文件

# 将osd.yml文件拷贝出来
[root@node1 ceph-ansible]# cp group_vars/osds.yml.sample group_vars/osds.yml

# 修改osds.yml文件
[root@node1 ceph-ansible]# grep -vE '^$|^#' group_vars/osds.yml
---
dummy:
devices:
  - /dev/nvme0n2
  - /dev/nvme0n3

定义osd的ansible清单文件

[root@node1 ceph-ansible]# cat /etc/ansible/hosts 
[mons]
192.168.202.129
192.168.202.130
192.168.202.131

[mgrs]
192.168.202.129
192.168.202.130
192.168.202.131

[osds]
192.168.202.129
192.168.202.130
192.168.202.131
192.168.202.132

执行ansible-playbook

ansible-playbook site.yml -e yes_i_know=true

查看结果

[root@node1 ceph-ansible]# ceph status
  cluster:
    id:     a82e91cb-dedc-4115-b44a-2bc8b7190afe
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            1 daemons have recently crashed
            9 mgr modules have recently crashed
 
  services:
    mon: 3 daemons, quorum node1,node2,node3 (age 2m)
    mgr: node3(active, since 112s), standbys: node1, node2
    osd: 8 osds: 8 up (since 46s), 8 in (since 57s)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 449 KiB
    usage:   213 MiB used, 160 GiB / 160 GiB avail
    pgs:     1 active+clean

查看服务器监听的IP

[root@node1 ceph-ansible]# ss -tunlp
Netid     State       Recv-Q      Send-Q             Local Address:Port           Peer Address:Port     Process                                   
udp       UNCONN      0           0                      127.0.0.1:323                 0.0.0.0:*         users:(("chronyd",pid=800,fd=5))         
udp       UNCONN      0           0                          [::1]:323                    [::]:*         users:(("chronyd",pid=800,fd=6))         
tcp       LISTEN      0           512              192.168.202.129:6789                0.0.0.0:*         users:(("ceph-mon",pid=37151,fd=28))     
tcp       LISTEN      0           512              192.168.202.129:6802                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=22))     
tcp       LISTEN      0           512              192.168.202.129:6803                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=23))     
tcp       LISTEN      0           512              192.168.202.129:6800                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=18))     
tcp       LISTEN      0           512              192.168.202.129:6801                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=19))     
tcp       LISTEN      0           512              192.168.202.129:6806                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=22))     
tcp       LISTEN      0           512              192.168.202.129:6807                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=23))     
tcp       LISTEN      0           512              192.168.202.129:6804                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=18))     
tcp       LISTEN      0           512              192.168.202.129:6805                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=19))     
tcp       LISTEN      0           512              192.168.202.129:3300                0.0.0.0:*         users:(("ceph-mon",pid=37151,fd=27))     
tcp       LISTEN      0           128                      0.0.0.0:22                  0.0.0.0:*         users:(("sshd",pid=813,fd=3))            
tcp       LISTEN      0           512              192.168.142.128:6802                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=24))     
tcp       LISTEN      0           512              192.168.142.128:6803                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=25))     
tcp       LISTEN      0           512              192.168.142.128:6800                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=20))     
tcp       LISTEN      0           512              192.168.142.128:6801                0.0.0.0:*         users:(("ceph-osd",pid=47203,fd=21))     
tcp       LISTEN      0           512              192.168.142.128:6806                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=24))     
tcp       LISTEN      0           512              192.168.142.128:6807                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=25))     
tcp       LISTEN      0           512              192.168.142.128:6804                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=20))     
tcp       LISTEN      0           512              192.168.142.128:6805                0.0.0.0:*         users:(("ceph-osd",pid=48625,fd=21))     
tcp       LISTEN      0           128                         [::]:22                     [::]:*         users:(("sshd",pid=813,fd=4))  

安装client

使用node5作为客户端

修改clients的yml文件

# 从模板文件拷贝出来clients.yml
[root@node1 ceph-ansible]# cp group_vars/clients.yml.sample group_vars/clients.yml

# 修改clients内容如下
[root@node1 ceph-ansible]# grep -vE '^$|^#' group_vars/clients.yml
---
dummy:
copy_admin_key: true  # 将admin 的 key拷贝出来,没有key认证会失败

修改ansible清单文件

[root@node1 ceph-ansible]# cat /etc/ansible/hosts 
[mons]
192.168.202.129
192.168.202.130
192.168.202.131

[mgrs]
192.168.202.129
192.168.202.130
192.168.202.131

[osds]
192.168.202.129
192.168.202.130
192.168.202.131
192.168.202.132

[clients]
192.168.202.134

执行ansible-playbook

ansible-playbook site.yml -e yes_i_know=true

遇到的问题

问题一:

fatal: [192.168.202.131]: FAILED! => changed=false 
  attempts: 3
  failures: []
  msg: |-
    Depsolve Error occurred:
     Problem 1: conflicting requests
      - nothing provides libtcmalloc.so.4()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides libthrift-0.14.0.so()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0(LIBOATH_1.10.0)(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0(LIBOATH_1.2.0)(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
     Problem 2: conflicting requests
      - nothing provides libtcmalloc.so.4()(64bit) needed by ceph-mon-2:17.2.6-0.el9.x86_64
  rc: 1
  results: []
fatal: [192.168.202.130]: FAILED! => changed=false 
  attempts: 3
  failures: []
  msg: |-
    Depsolve Error occurred:
     Problem 1: conflicting requests
      - nothing provides libtcmalloc.so.4()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides libthrift-0.14.0.so()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0(LIBOATH_1.10.0)(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
      - nothing provides liboath.so.0(LIBOATH_1.2.0)(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
     Problem 2: conflicting requests
      - nothing provides libtcmalloc.so.4()(64bit) needed by ceph-mon-2:17.2.6-0.el9.x86_64
  rc: 1
  results: []
fatal: [192.168.202.129]: FAILED! => changed=false 
  attempts: 3
  failures: []
  msg: |-
    Depsolve Error occurred:
     Problem 1: conflicting requests
      - nothing provides libthrift-0.14.0.so()(64bit) needed by ceph-common-2:17.2.6-0.el9.x86_64
     Problem 2: package ceph-mon-2:17.2.6-0.el9.x86_64 requires ceph-base = 2:17.2.6-0.el9, but none of the providers can be installed
      - package ceph-base-2:17.2.6-0.el9.x86_64 requires librgw2 = 2:17.2.6-0.el9, but none of the providers can be installed
      - conflicting requests
      - nothing provides libthrift-0.14.0.so()(64bit) needed by librgw2-2:17.2.6-0.el9.x86_64
  rc: 1
  results: []

解决办法:需要安装epel源

dnf install epel-release -y

问题二

TASK [ceph-infra : install firewalld python binding] *********************************************************************************************An exception occurred during task execution. To see the full traceback, use -vvv. The error was: AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'
fatal: [192.168.202.129]: FAILED! => changed=false 
  module_stderr: |-
    Shared connection to 192.168.202.129 closed.
  module_stdout: |-
    Traceback (most recent call last):
      File "/root/.ansible/tmp/ansible-tmp-1694568395.6913803-54958-96914655721864/AnsiballZ_dnf.py", line 107, in <module>
        _ansiballz_main()
      File "/root/.ansible/tmp/ansible-tmp-1694568395.6913803-54958-96914655721864/AnsiballZ_dnf.py", line 99, in _ansiballz_main
        invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)
      File "/root/.ansible/tmp/ansible-tmp-1694568395.6913803-54958-96914655721864/AnsiballZ_dnf.py", line 47, in invoke_module
        runpy.run_module(mod_name='ansible.modules.dnf', init_globals=dict(_module_fqn='ansible.modules.dnf', _modlib_path=modlib_path),
      File "/usr/lib64/python3.9/runpy.py", line 225, in run_module
        return _run_module_code(code, init_globals, run_name, mod_spec)
      File "/usr/lib64/python3.9/runpy.py", line 97, in _run_module_code
        _run_code(code, mod_globals, init_globals,
      File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/tmp/ansible_ansible.legacy.dnf_payload_o8_ihnkm/ansible_ansible.legacy.dnf_payload.zip/ansible/modules/dnf.py", line 359, in <module>      File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
      File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 664, in _load_unlocked
      File "<frozen importlib._bootstrap>", line 627, in _load_backward_compatible
      File "<frozen zipimport>", line 259, in load_module
      File "/tmp/ansible_ansible.legacy.dnf_payload_o8_ihnkm/ansible_ansible.legacy.dnf_payload.zip/ansible/module_utils/urls.py", line 115, in <module>
      File "/usr/lib/python3.9/site-packages/urllib3/contrib/pyopenssl.py", line 50, in <module>
        import OpenSSL.SSL
      File "/usr/lib/python3.9/site-packages/OpenSSL/__init__.py", line 8, in <module>
        from OpenSSL import crypto, SSL
      File "/usr/lib/python3.9/site-packages/OpenSSL/crypto.py", line 3279, in <module>
        _lib.OpenSSL_add_all_algorithms()
    AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: 1

解决办法:升级升级 OpenSSL 库

python3 -m pip install --upgrade pyOpenSSL

问题三

TASK [ceph-mgr : add modules to ceph-mgr] ********************************************************************************************************failed: [192.168.202.131 -> 192.168.202.129] (item=dashboard) => changed=true 
  ansible_loop_var: item
  cmd:
  - ceph
  - -n
  - client.admin
  - -k
  - /etc/ceph/ceph.client.admin.keyring
  - --cluster
  - ceph
  - mgr
  - module
  - enable
  - dashboard
  delta: '0:00:00.210799'
  end: '2023-09-13 09:32:06.953160'
  item: dashboard
  rc: 2
  start: '2023-09-13 09:32:06.742361'
  stderr: 'Error ENOENT: module ''dashboard'' reports that it cannot run on the active manager daemon: PyO3 modules may only be initialized once per interpreter process (pass --force to force enablement)'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

问题四

[WARNING]: log file at /root/ansible/ansible.log is not writeable and we cannot create it, aborting

[DEPRECATION WARNING]: "include" is deprecated, use include_tasks/import_tasks instead. This feature will be removed in version 2.16. Deprecation
 warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
ERROR! couldn't resolve module/action 'openstack.config_template.config_template'. This often indicates a misspelling, missing collection, or incorrect module path.

The error appears to be in '/root/ceph-ansible/roles/ceph-config/tasks/main.yml': line 137, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


- name: "generate {{ cluster }}.conf configuration file"
  ^ here
We could be wrong, but this one looks like it might be an issue with
missing quotes. Always quote template expression brackets when they
start a value. For instance:

    with_items:
      - {{ foo }}

Should be written as:

    with_items:
      - "{{ foo }}"

解决办法: 安装ceph-ansible所需的第三方包

#安装ceph-ansible所需的第三方集合
ansible-galaxy install -r requirements.yml

问题五

TASK [ceph-validate : fail if monitoring group doesn't exist] ************************************************************************************fatal: [192.168.202.131]: FAILED! => changed=false 
  msg: you must add a monitoring group and add at least one node.
fatal: [192.168.202.129]: FAILED! => changed=false 
  msg: you must add a monitoring group and add at least one node.
fatal: [192.168.202.130]: FAILED! => changed=false 
  msg: you must add a monitoring group and add at least one node.

解决办法一:关闭dashboard

# 在group_vars/all.yml添加如下参数
dashboard_enabled: false

参考文章