参考:gitcode.net/mirrors/Ali…
一、带GPU的Node节点安装显卡驱动(以当前RTX2080Ti为例)
- 使用update进行系统升级,并安装必要组件并
yum -y update
yum -y install gcc dkms kernel-devel
reboot
- 禁用默认自带的Nouveau
echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf
- 重建initramfs image
mv /boot/initramfs-(uname -r).img /boot/initramfs-(uname -r).img.bak
dracut /boot/initramfs-(uname -r).img (uname -r)
- 安装驱动
chmod +x NVIDIA-Linux-x86_64-470.63.01.run
./NVIDIA-Linux-x86_64-470.63.01.run --kernel-source-path=/usr/src/kernels/内核号
- 检查驱动是否安装完成
nvidia-smi
# 有返回结果说明安装完成
二、安装nvidia-docker2
- 如果已安装nvidia-docker,需要先进行卸载
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
yum remove nvidia-docker -y
- 安装nvidia-docker2 repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
- 安装nvidia-docker,并重新加载docker配置,
可能需要翻墙,仅设置https代理
yum install -y nvidia-docker2
pkill -SIGHUP dockerd
- 设置NVIDIA runtime为Docker默认运行时环境
vim /etc/docker/daemon.json
# 填入以下内容
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
三、设置gpu share schedule
- 在master节点部署GPU共享调度扩展器
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
- 下载调度文件到/etc/kubernetes
cd /etc/kubernetes
curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.yaml
- 在调度参数中添加策略配置文件参数
vim /etc/kubernetes/manifests/kube-scheduler.yaml
# 添加以下内容
- --config=/etc/kubernetes/scheduler-policy-config.yaml
4. 在Pod Spec中添加卷挂载
vim /etc/kubernetes/manifests/kube-scheduler.yaml
# 添加以下内容
- mountPath: /etc/kubernetes/scheduler-policy-config.yaml
name: scheduler-policy-config
readOnly: true
- hostPath:
path: /etc/kubernetes/scheduler-policy-config.yaml
type: FileOrCreate
name: scheduler-policy-config
5. 部署设备插件
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
- 为需要GPU共享的节点添加gpushare节点标签
kubectl label node <target_node> gpushare=true
- 查看支持GPU共享的节点情况
kubectl inspect gpushare
8. 需要请求GPU,只需要在资源请求下添加aliyun.com/gpu-mem: 即可(请求数量以G为单位)如:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: binpack-1
labels:
app: binpack-1
spec:
replicas: 3
serviceName: "binpack-1"
podManagementPolicy: "Parallel"
selector: # define how the deployment finds the pods it manages
matchLabels:
app: binpack-1
template: # define the pods specifications
metadata:
labels:
app: binpack-1
spec:
containers:
- name: binpack-1
image: cheyang/gpu-player:v2
resources:
limits:
# GiB
aliyun.com/gpu-mem: 3