Kubernetes 1.23+ GPU共享

704 阅读2分钟

参考:gitcode.net/mirrors/Ali…


一、带GPU的Node节点安装显卡驱动(以当前RTX2080Ti为例)

  1. 使用update进行系统升级,并安装必要组件并
yum -y update
yum -y install gcc dkms kernel-devel
reboot
  1. 禁用默认自带的Nouveau
echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf
  1. 重建initramfs image
mv /boot/initramfs-(uname -r).img /boot/initramfs-(uname -r).img.bak
dracut /boot/initramfs-(uname -r).img (uname -r)
  1. 安装驱动
chmod +x NVIDIA-Linux-x86_64-470.63.01.run
./NVIDIA-Linux-x86_64-470.63.01.run --kernel-source-path=/usr/src/kernels/内核号
  1. 检查驱动是否安装完成
nvidia-smi
# 有返回结果说明安装完成

image.png

二、安装nvidia-docker2

  1. 如果已安装nvidia-docker,需要先进行卸载
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
yum remove nvidia-docker -y
  1. 安装nvidia-docker2 repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
  1. 安装nvidia-docker,并重新加载docker配置,可能需要翻墙,仅设置https代理
yum install -y nvidia-docker2
pkill -SIGHUP dockerd
  1. 设置NVIDIA runtime为Docker默认运行时环境
vim /etc/docker/daemon.json
# 填入以下内容

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
           "path": "/usr/bin/nvidia-container-runtime",
           "runtimeArgs": []
      }
   }
}

三、设置gpu share schedule

  1. 在master节点部署GPU共享调度扩展器
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
  1. 下载调度文件到/etc/kubernetes
cd /etc/kubernetes
curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.yaml
  1. 在调度参数中添加策略配置文件参数
vim /etc/kubernetes/manifests/kube-scheduler.yaml
# 添加以下内容
- --config=/etc/kubernetes/scheduler-policy-config.yaml

image.png 4. 在Pod Spec中添加卷挂载

vim /etc/kubernetes/manifests/kube-scheduler.yaml
# 添加以下内容

- mountPath: /etc/kubernetes/scheduler-policy-config.yaml
  name: scheduler-policy-config
  readOnly: true
  
  

- hostPath:
      path: /etc/kubernetes/scheduler-policy-config.yaml
      type: FileOrCreate
  name: scheduler-policy-config

image.png 5. 部署设备插件

kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
  1. 为需要GPU共享的节点添加gpushare节点标签
kubectl label node <target_node> gpushare=true
  1. 查看支持GPU共享的节点情况
kubectl inspect gpushare

image.png 8. 需要请求GPU,只需要在资源请求下添加aliyun.com/gpu-mem: 即可(请求数量以G为单位)如:

apiVersion: apps/v1beta1
kind: StatefulSet

metadata:
  name: binpack-1
  labels:
    app: binpack-1

spec:
  replicas: 3
  serviceName: "binpack-1"
  podManagementPolicy: "Parallel"
  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: binpack-1

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-1

    spec:
      containers:
      - name: binpack-1
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-mem: 3