Docker + CUDA + NVidia-docker在Ubuntu 18.04上的安装与环境配置
科研需要构建基于Cuda10.0 + Nvidia-Docker的开发环境。本文基于基本配置为CPU i7-8700K和GPU GeForce RTX 2080 Ti的组装机展开。
配置流程为 Ubuntu → Language → firefox account → obs → atom/vim → git → anaconda → pycharm → packages → openblas → nvidia-driver → cuda → docker (engine → compose) → nvidia-docker → openblas → maxwell (→ Klayout (GDSII Viewer)) → teamviewer
预备
# 检查显卡
$ lspci | grep -i vga
# 01:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
# 检查系统版本,确保系统支持(需要Linux-64bit系统)
$ uname -m && cat /etc/*release
# x86_64
# 查看GPU型号:
$ lspci | grep -i nvidia
$ sudo apt update
$ sudo apt-get update
# 安装GCC
$ sudo apt-get install build-essential
# Debian / Ubuntu linux install kernel headers package
# Search for kernel version (optional)
$ apt-cache search linux-headers-$(uname -r)
# Install linux-header package
$ sudo apt-get install linux-headers-$(uname -r)
$ sudo apt intall lightdm (invalid operation install ubuntu in tty1, tty2可以安装)
<OK> → <lightdm> (Tab键选择,Enter确认)
$ sudo apt-get remove lightdm
安装CUDA
Ubuntu安装NVIDA显卡驱动
- 参考cnblogs-Ubuntu安装NVIDA显卡驱动
- 下载显卡驱动
- 卸载原显卡驱动
# for case1: original driver installed by apt-get:
$ sudo apt-get remove --purge nvidia*
# for case2: original driver installed by runfile:
$ sudo chmod +x *.run
$ sudo ./NVIDIA-Linux-x86_64-384.59.run --uninstall
- 禁用nouveau驱动
$ sudo gedit /etc/modprobe.d/blacklist.conf
在文本最后添加
blacklist nouveau
options nouveau modeset=0
保存,退出文本编辑。在终端执行
sudo update-initramfs -u
重启。执行
$ lsmod | grep nouveau
若无屏幕输出,说明禁用nouveau成功。
- 禁用X-Window服务
$ sudo service lightdm stop
关闭图形界面
Ctrl + Alt + F1-F6 进入tty模式,其中F1是root
Ctrl + Alt + F7 是GUI
在命令行输入:sudo service lightdm start ,然后按Ctrl-Alt+F7即可恢复到图形界面
- tty1 安装驱动
按键Ctrl + Alt + F1进入tty1
登录:login > password
进入runfile文件所在目录:默认为$ cd Downloads
查看当前路径下文件:$ ls
安装NVIDIA-Linux-x86_64-435.21
$ sudo chmod +x NVIDIA-Linux-x86_64-435.21.run
$ sudo ./NVIDIA-Linux-x86_64-435.21.run –no-opengl-files
选择Continue installation > OK > No > OK
$ reboot
–no-opengl-files:表示只安装驱动文件,不安装OpenGL文件。这个参数不可省略,否则会导致登陆界面死循环,英语一般称为”login loop”或者”stuck in login”。因为NVIDIA的驱动默认会安装OpenGL,而Ubuntu的内核本身也有OpenGL、且与GUI显示息息相关,一旦NVIDIA的驱动覆写了OpenGL,在GUI需要动态链接OpenGL库的时候就引起问题。
WARNING: Unable to find suitable destination to install 32-bit compatibility libraries.
- 查看是否安装成功
在GUI打开terminal
$ nvidia-smi
显示最高可支持CUDA version: 10.1
$ nvidia-settings
Ubuntu安装CUDA Toolkit 10.0 Archive
- 卸载cuda:
$ cd /usr/local/cuda/bin
$ sudo ./uninstall_cuda_10.0.pl
$ cd /usr/local
$ sudo rm -rf cuda-10.0/
- 下载CUDA Toolkit Archive,本文通过runfile进行安装,所选择版本是cuda_10.0.130_410.48_linux.run
$ sudo service lightdm stop
Ctrl+Alt+F1
$ cd Downloads
$ sudo chmod +x cuda_10.0.130_410.48_linux.run
$ sudo ./cuda_10.0.130_410.48_linux.run –no-opengl-libs
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
(y)es/(n)o/(q)uit: n
Install the CUDA 10.0 Toolkit?
(y)es/(n)o/(q)uit: y
Enter Toolkit Location
[ default is /usr/local/cuda-10.0 ]:
Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y
Install the CUDA 10.0 Samples?
(y)es/(n)o/(q)uit: n
选择 accept > n > y > <Enter> > y > n
accept #同意安装
n #不安装Driver,因为已安装最新驱动
y #安装CUDA Toolkit
#安装到默认目录
y #创建安装目录的软链接
n #不复制Samples,因为在安装目录下有/samples
- CUDA, Driver, gcc版本是否适配
- 配置环境变量
$ sudo gedit ~/.bashrc
在末尾添加
export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
保存,退出,激活
$ source ~/.bashrc
- 运行检测命令
$ cat /proc/driver/nvidia/version
$ nvcc -V
- CUDA Sample 测试
$ cd /usr/local/cuda-10.0/samples/1_Utilities/deviceQuery
$ sudo make
$ ./deviceQuery
屏幕输出Result = PASS
安装官方补丁Patch 1 (Released May 10, 2019)
$ sudo chmod +x cuda_10.0.130.1_linux.run
$ sudo ./cuda_10.0.130.1_linux.run
Anaconda安装pytorch-GPU版本
- 查看glibc版本
$ ldd --VERSION_ID
$ bash ~/Downloads/Anaconda3-[version]-Linux-x86_64.sh
yes > <enter> > yes
conda create -n Pytorch python=3.6
conda activate Pytorch
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
如果安装速度慢,可以添加清华镜像源
# 查看源
$ conda config --show-sources
# 添加源
$ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
$ conda config --set show_channel_urls yes
参考官网
$ conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
verification 确认是否以正确安装PyTorch
$ python
>>> from __future__ import print_function
>>> import torch
>>> x = torch.rand(5, 3)
>>> print(x)
The output should be something similar to:
>>> tensor([[0.3380, 0.3845, 0.3217],
[0.8337, 0.9050, 0.2650],
[0.2979, 0.7141, 0.9069],
[0.1449, 0.1132, 0.1375],
[0.4675, 0.3947, 0.1426]])
Additionally, to check if your GPU driver and CUDA is enabled and accessible by PyTorch, run the following commands to return whether or not the CUDA driver is enabled:
>>> import torch
>>> torch.cuda.is_available()
Output is True
- Anaconda常用命令
# 查看当前系统下的环境
$ conda info -e
# 切换到新环境# linux/Mac下需要使用source activate env_name
$ activate env_name
#退出环境,也可以使用`activate root`切回root环境
$ deactivate env_name
# 移除环境
$ conda remove -n env_name --all
# 安装jupyter notebook
$ conda install -c anaconda jupyter
安装Docker
- 参考Get Docker Engine - Community for Ubuntu
- 卸载旧版本(如果已安装)
$ sudo apt-get remove docker docker-engine docker.io containerd runc
- 使用存储库安装
$ sudo apt-get update
$ sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo apt-key fingerprint 0EBFCD88
根据架构选择设置stable repository,我的是x86_64
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
其中,查看位数命令是
$ uname -a
- 安装DOCKER引擎-社区版
$ sudo apt-get update
- 安装最新版本的Docker(安装指定版本的Docker请跳至下一步)
$ sudo apt-get install docker-ce docker-ce-cli containerd.io
- 安装指定版本的Docker
$ apt-cache madison docker-ce
$ sudo apt-get install docker-ce=<VERSION_STRING> docker-ce-cli=<VERSION_STRING> containerd.io
- 查看是否安装成功
$ sudo docker run hello-world
- 查看Docker状态
$ sudo service docker status
$ sudo docker info
# 查看docker版本
$ docker --version
- 修改docker源镜像
修改
/etc/docker/daemon.json,添加
{
"registry-mirrors": ["https://docker.mirrors.ustc.edu.cn"]
}
其中, Docker 官方中国区 registry.docker-cn.com
网易 hub-mirror.c.163.com
ustc docker.mirrors.ustc.edu.cn
- 重启
$ systemctl restart docker.service - Post-installation steps for Linux
安装Nvidia-Docker
- 参考Github
# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
\
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
$ sudo systemctl restart docker
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
Installation (version 2.0) Ubuntu distributions
$ sudo apt-get install nvidia-docker2
$ sudo pkill -SIGHUP dockerd
安装openBLAS
下载并安装
$ cd ~/Downloads/OpenBLAS-0.3.6/OpenBLAS
$ sudo apt-get install gfortran
测试
$ gedit c.c
$ gcc c.c -I /opt/OpenBLAS/include/ -L/opt/OpenBLAS/lib -lopenblas
$ ./a.out
输出
0.000000 1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000
90.000000 81.000000 72.000000 63.000000 54.000000 45.000000 36.000000 27.000000 18.000000 9.000000