Kubernetes 1.23+ GPU共享参考：https://gitcode.net/mirrors/AliyunC

参考：gitcode.net/mirrors/Ali…

一、带GPU的Node节点安装显卡驱动(以当前RTX2080Ti为例)

使用update进行系统升级,并安装必要组件并

yum -y update
yum -y install gcc dkms kernel-devel
reboot

禁用默认自带的Nouveau

echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf

重建initramfs image

mv /boot/initramfs-(uname -r).img /boot/initramfs-(uname -r).img.bak
dracut /boot/initramfs-(uname -r).img (uname -r)

安装驱动

chmod +x NVIDIA-Linux-x86_64-470.63.01.run
./NVIDIA-Linux-x86_64-470.63.01.run --kernel-source-path=/usr/src/kernels/内核号

检查驱动是否安装完成

nvidia-smi
# 有返回结果说明安装完成

二、安装nvidia-docker2

如果已安装nvidia-docker，需要先进行卸载

docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
yum remove nvidia-docker -y

安装nvidia-docker2 repo

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

安装nvidia-docker，并重新加载docker配置，可能需要翻墙，仅设置https代理

yum install -y nvidia-docker2
pkill -SIGHUP dockerd

设置NVIDIA runtime为Docker默认运行时环境

vim /etc/docker/daemon.json
# 填入以下内容

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
           "path": "/usr/bin/nvidia-container-runtime",
           "runtimeArgs": []
      }
   }
}

三、设置gpu share schedule

在master节点部署GPU共享调度扩展器

kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml

下载调度文件到/etc/kubernetes

cd /etc/kubernetes
curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.yaml

在调度参数中添加策略配置文件参数

vim /etc/kubernetes/manifests/kube-scheduler.yaml
# 添加以下内容
- --config=/etc/kubernetes/scheduler-policy-config.yaml

4. 在Pod Spec中添加卷挂载

vim /etc/kubernetes/manifests/kube-scheduler.yaml
# 添加以下内容

- mountPath: /etc/kubernetes/scheduler-policy-config.yaml
  name: scheduler-policy-config
  readOnly: true
  
  

- hostPath:
      path: /etc/kubernetes/scheduler-policy-config.yaml
      type: FileOrCreate
  name: scheduler-policy-config

5. 部署设备插件

kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml

为需要GPU共享的节点添加gpushare节点标签

kubectl label node <target_node> gpushare=true

查看支持GPU共享的节点情况

kubectl inspect gpushare

8. 需要请求GPU，只需要在资源请求下添加aliyun.com/gpu-mem: 即可（请求数量以G为单位）如：

apiVersion: apps/v1beta1
kind: StatefulSet

metadata:
  name: binpack-1
  labels:
    app: binpack-1

spec:
  replicas: 3
  serviceName: "binpack-1"
  podManagementPolicy: "Parallel"
  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: binpack-1

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-1

    spec:
      containers:
      - name: binpack-1
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-mem: 3