CUDA + NVidia-docker在Ubuntu 18.04上的安装与环境配置科研需要构建基于Cuda10.0 +

Docker + CUDA + NVidia-docker在Ubuntu 18.04上的安装与环境配置

科研需要构建基于Cuda10.0 + Nvidia-Docker的开发环境。本文基于基本配置为CPU i7-8700K和GPU GeForce RTX 2080 Ti的组装机展开。

配置流程为 Ubuntu → Language → firefox account → obs → atom/vim → git → anaconda → pycharm → packages → openblas → nvidia-driver → cuda → docker (engine → compose) → nvidia-docker → openblas → maxwell (→ Klayout (GDSII Viewer)) → teamviewer

预备

# 检查显卡
$ lspci | grep -i vga
# 01:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
# 检查系统版本，确保系统支持(需要Linux-64bit系统)
$ uname -m && cat /etc/*release
# x86_64
# 查看GPU型号: 
$ lspci | grep -i nvidia
$ sudo apt update
$ sudo apt-get update  
# 安装GCC
$ sudo apt-get install build-essential
# Debian / Ubuntu linux install kernel headers package
# Search for kernel version (optional)
$ apt-cache search linux-headers-$(uname -r)
# Install linux-header package
$ sudo apt-get install linux-headers-$(uname -r)
$ sudo apt intall lightdm (invalid operation install ubuntu in tty1, tty2可以安装)
<OK> → <lightdm> (Tab键选择，Enter确认)
$ sudo apt-get remove lightdm

安装CUDA

Ubuntu安装NVIDA显卡驱动

# for case1: original driver installed by apt-get:
$ sudo apt-get remove --purge nvidia*
# for case2: original driver installed by runfile:
$ sudo chmod +x *.run
$ sudo ./NVIDIA-Linux-x86_64-384.59.run --uninstall

禁用nouveau驱动

$ sudo gedit /etc/modprobe.d/blacklist.conf

在文本最后添加

blacklist nouveau
options nouveau modeset=0

保存，退出文本编辑。在终端执行

sudo update-initramfs -u

重启。执行

$ lsmod | grep nouveau

若无屏幕输出，说明禁用nouveau成功。

禁用X-Window服务

$ sudo service lightdm stop

关闭图形界面

Ctrl + Alt + F1-F6 进入tty模式，其中F1是root
Ctrl + Alt + F7 是GUI
在命令行输入：sudo service lightdm start ，然后按Ctrl-Alt+F7即可恢复到图形界面

tty1 安装驱动
按键Ctrl + Alt + F1进入tty1
登录：login > password
进入runfile文件所在目录：默认为$ cd Downloads
查看当前路径下文件：$ ls
安装NVIDIA-Linux-x86_64-435.21

$ sudo chmod +x NVIDIA-Linux-x86_64-435.21.run
$ sudo ./NVIDIA-Linux-x86_64-435.21.run –no-opengl-files

选择Continue installation > OK > No > OK

$ reboot

–no-opengl-files：表示只安装驱动文件，不安装OpenGL文件。这个参数不可省略，否则会导致登陆界面死循环，英语一般称为”login loop”或者”stuck in login”。因为NVIDIA的驱动默认会安装OpenGL，而Ubuntu的内核本身也有OpenGL、且与GUI显示息息相关，一旦NVIDIA的驱动覆写了OpenGL，在GUI需要动态链接OpenGL库的时候就引起问题。
WARNING: Unable to find suitable destination to install 32-bit compatibility libraries.

查看是否安装成功
在GUI打开terminal

$ nvidia-smi

显示最高可支持CUDA version: 10.1

$ nvidia-settings

Ubuntu安装CUDA Toolkit 10.0 Archive

卸载cuda:

$ cd /usr/local/cuda/bin
$ sudo ./uninstall_cuda_10.0.pl
$ cd /usr/local
$ sudo rm -rf cuda-10.0/

下载CUDA Toolkit Archive，本文通过runfile进行安装，所选择版本是cuda_10.0.130_410.48_linux.run

$ sudo service lightdm stop
Ctrl+Alt+F1
$ cd Downloads
$ sudo chmod +x cuda_10.0.130_410.48_linux.run
$ sudo ./cuda_10.0.130_410.48_linux.run –no-opengl-libs

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
(y)es/(n)o/(q)uit: n
Install the CUDA 10.0 Toolkit?
(y)es/(n)o/(q)uit: y
Enter Toolkit Location
 [ default is /usr/local/cuda-10.0 ]:
Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y
Install the CUDA 10.0 Samples?
(y)es/(n)o/(q)uit: n

选择 accept > n > y > <Enter> > y > n

accept #同意安装
n #不安装Driver，因为已安装最新驱动
y #安装CUDA Toolkit
#安装到默认目录
y #创建安装目录的软链接
n #不复制Samples，因为在安装目录下有/samples

CUDA, Driver, gcc版本是否适配
配置环境变量

$ sudo gedit ~/.bashrc

在末尾添加

export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

保存，退出，激活

$ source ~/.bashrc

运行检测命令

$ cat /proc/driver/nvidia/version

$ nvcc -V

CUDA Sample 测试

$ cd /usr/local/cuda-10.0/samples/1_Utilities/deviceQuery
$ sudo make
$ ./deviceQuery

屏幕输出Result = PASS
安装官方补丁Patch 1 (Released May 10, 2019)

$ sudo chmod +x cuda_10.0.130.1_linux.run
$ sudo ./cuda_10.0.130.1_linux.run

Anaconda安装pytorch-GPU版本

查看glibc版本

$ ldd --VERSION_ID

Linux上安装Anaconda

$ bash ~/Downloads/Anaconda3-[version]-Linux-x86_64.sh

yes > <enter> > yes

conda create -n Pytorch python=3.6
conda activate Pytorch
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

如果安装速度慢，可以添加清华镜像源

# 查看源
$ conda config --show-sources
# 添加源
$ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
$ conda config --set show_channel_urls yes
参考官网
$ conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

verification 确认是否以正确安装PyTorch

$ python
>>> from __future__ import print_function
>>> import torch
>>> x = torch.rand(5, 3)
>>> print(x)

The output should be something similar to:

>>> tensor([[0.3380, 0.3845, 0.3217],
            [0.8337, 0.9050, 0.2650],
            [0.2979, 0.7141, 0.9069],
            [0.1449, 0.1132, 0.1375],
            [0.4675, 0.3947, 0.1426]])

Additionally, to check if your GPU driver and CUDA is enabled and accessible by PyTorch, run the following commands to return whether or not the CUDA driver is enabled:

>>> import torch
>>> torch.cuda.is_available()

Output is True

Anaconda常用命令

# 查看当前系统下的环境
$ conda info -e
# 切换到新环境# linux/Mac下需要使用source activate env_name
$ activate env_name
#退出环境，也可以使用`activate root`切回root环境
$ deactivate env_name
# 移除环境
$ conda remove -n env_name --all
# 安装jupyter notebook
$ conda install -c anaconda jupyter

安装Docker

参考Get Docker Engine - Community for Ubuntu
卸载旧版本（如果已安装）

$ sudo apt-get remove docker docker-engine docker.io containerd runc

使用存储库安装

$ sudo apt-get update
$ sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo apt-key fingerprint 0EBFCD88

根据架构选择设置stable repository，我的是x86_64

$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

其中，查看位数命令是

 $ uname -a

安装DOCKER引擎-社区版

$ sudo apt-get update

安装最新版本的Docker（安装指定版本的Docker请跳至下一步）

$ sudo apt-get install docker-ce docker-ce-cli containerd.io

安装指定版本的Docker

$ apt-cache madison docker-ce
$ sudo apt-get install docker-ce=<VERSION_STRING> docker-ce-cli=<VERSION_STRING> containerd.io

查看是否安装成功

$ sudo docker run hello-world

查看Docker状态

$ sudo service docker status
$ sudo docker info
# 查看docker版本
$ docker --version

修改docker源镜像修改/etc/docker/daemon.json，添加

{
"registry-mirrors": ["https://docker.mirrors.ustc.edu.cn"]
}

其中， Docker 官方中国区 registry.docker-cn.com
网易 hub-mirror.c.163.com
ustc docker.mirrors.ustc.edu.cn

重启 $ systemctl restart docker.service
Post-installation steps for Linux

安装Nvidia-Docker

参考Github

# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
\
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
$ sudo systemctl restart docker

Debian-based distributions

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

Installation (version 2.0) Ubuntu distributions

$ sudo apt-get install nvidia-docker2
$ sudo pkill -SIGHUP dockerd

安装openBLAS

下载并安装

$ cd ~/Downloads/OpenBLAS-0.3.6/OpenBLAS
$ sudo apt-get install gfortran

测试

$ gedit c.c
$ gcc c.c  -I /opt/OpenBLAS/include/ -L/opt/OpenBLAS/lib -lopenblas
$ ./a.out

输出

0.000000 1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000
90.000000 81.000000 72.000000 63.000000 54.000000 45.000000 36.000000 27.000000 18.000000 9.000000