cuDNN是什么?为什么要安装cuDNN?本文将介绍nvidia硬件和驱动(包含nvidia driver),cuda工具包(cuda toolkit),cuDNN系列库和TensorRT,讲解不同层次硬件和驱动以及软件的关系和作用.并使用腾讯cloud stuio做示例,并安装和配置pytorch的GPU加速.
cloud studio介绍
Cloud Studio(云端 IDE)是基于浏览器的集成式开发环境,为开发者提供了一个稳定的云端工作站。支持CPU与GPU的访问。用户在使用 Cloud Studio 时无需安装,随时随地打开浏览器即可使用。 Cloud Studio支持免费的CPU环境(每月5w mins)和免费的GPU环境(一张Tesla T4 16G)(每月1w mins).本文将用Cloud Studio的GPU环境演示说明.
开启Cloud Studio GPU空间
- 首先注册并开启Cloud Studio,点击链接curl.qcloud.com/sdeIX8nx
- 点击ide.cloud.tencent.com/ 到Cloud Studio主页面
- 如下图,点击
空间模版->AI模版->Pytorch2.0.0 - 选择
免费基础版->确认 - 点击
高性能工作空间.Pytorch2.0.0 gssrak这个就是已经创建的GPU空间了.可以看到这里已经有绿色圆点,并显示运行中. - 点击
Pytorch2.0.0 gssrak进入空间,等待不到一分钟则会加载完成
Nvidia driver
Nvidia Driver是专为nvidia GPU的驱动程序.有了Nvidia Drvier,才可以正确驱动GPU,从而正常输出显示画面(针对studio专业显卡或者游戏显卡)和加速科学计算(针对数据中心显卡等).它也是之后安装CUDA toolkit或者cuDNN的基础.
-
由于Cloud Studio基于容器技术,已经在宿主机和GPU工作空间(本质是容器)安装了同一版本的Nvidia Driver.我们可以使用
nvidia-smi查看 -
打开
终端,输入nvidia-smi:(base) root@VM-24-95-ubuntu:/workspace# nvidia-smi Mon Mar 10 12:13:25 2025 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:09.0 Off | 0 | | N/A 31C P8 10W / 70W | 2MiB / 15360MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ -
Driver Version: 525.105.17指Nvidia Driver版本是525.105.17 -
CUDA Version: 12.0指目前的Nvidia Driver版本所能支持的 最高 CUDA版本是12.0- 也就是此时机器支持
CUDA12.0以及 <=CUDA12.0的其他版本(CUDA11.8, CUDA11.7, CUDA10.0 等).另一方面CUDA12.1,CUDA12.8等高于CUDA12.0的版本,则不被支持.
- 也就是此时机器支持
CUDA toolkit
CUDA Toolkit 是 NVIDIA 提供的一套完整的开发工具集,用于开发和优化 CUDA 程序.它包括编译器(如 nvcc)、调试器、运行时库(cudart)、性能分析工具以及各种数学和计算库.
注意如果只需要运行tensotflow或pytorch其实不需要安装(完全版) CUDA toolkit,在安装pytorch或者tensorflow时候自带的cuDNN的子集既可实现GPU加速计算.近在需要开发CUDA算子,编译GPU加速实现(如Apex库)等情况下需要安装CUDA toolkit
Cloud Studio已经默认安装配置了CUDA toolkit 版本11.7
-
nvcc-V查看是否安装了CUDA toolkit(base) root@VM-24-95-ubuntu:/workspace# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:49:14_PDT_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0 -
echo $PATH,检查是否包含过了路径/usr/local/cuda/bin(base) root@VM-24-95-ubuntu:/workspace# echo $PATH /etc/.hai/cloud_studio/vendor/modules/code-oss-dev/bin/remote-cli:/root/miniforge3/bin:/root/miniforge3/condabin:/root/miniforge3/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin -
echo $LD_LIBRARY_PATH,检查是否包含过了路径/usr/local/cuda/lib64(base) root@VM-24-95-ubuntu:/workspace# echo $LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
cuDNN
cuDNN介绍
NVIDIA CUDA 深度神经网络库(cuDNN) 是一个 GPU 加速的深度神经网络基本操作库。它提供了深度神经网络(DNN)应用中频繁出现的运算的优化实现.cuDNN是实际在tensorflow,pytorch或大模型部署平台的GPU加速的实现.
- ref:
- 官方网站:docs.nvidia.com/cudnn/index…
- 官方文档:docs.nvidia.com/deeplearnin…
- 官方安装linux下的
cuDNN:docs.nvidia.com/deeplearnin…
此时如果按照如上所述使用Pytorch2.0.0空间模版则不需要另外再安装cuDNN.因为此时Cloud Studio已经安装并配置好了GPU版本的pytorch,也就是说需要的cuDNN的子集.
查看cuDNN版本
-
查看pytorch是否可以调用cuda
python -c "import torch;print(torch.cuda.is_available())" -
查看
cuDNN是否启用python -c "import torch;print(torch.backends.cudnn.enabled)" -
查看
cuDNN版本python -c "import torch;print(torch.backends.cudnn.version())"(base) root@VM-24-95-ubuntu:/workspace# python -c "import torch;print(torch.cuda.is_available())" True (base) root@VM-24-95-ubuntu:/workspace# python -c "import torch;print(torch.backends.cudnn.enabled)" True (base) root@VM-24-95-ubuntu:/workspace# python -c "import torch;print(torch.backends.cudnn.version())" 8500 -
因为是pytorch自带的
cuDNN的子集,使用代码查看so库find $(python -c "import torch; print(torch.__path__[0])") -name "*cudnn*so*"(base) root@VM-24-95-ubuntu:/workspace# find $(python -c "import torch; print(torch.__path__[0])") -name "*cudnn*so*" /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn.so.8 /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_adv_infer.so.8 /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_cnn_train.so.8 /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_adv_train.so.8 /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_ops_train.so.8 /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_cnn_infer.so.8 /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_ops_infer.so.8
验证cuDNN安装
- 安装示例文件和依赖
apt -y install libcudnn8-samples libfreeimage-dev build-essential由于刚刚看Cloud Studio的pytorch自带的cuDNN是8500版本所以此处安装libcudnn8-samples. - 编译
cd /usr/src/cudnn_samples_v8/mnistCUDNN && make clean && make - 运行
./mnistCUDNN出现Test passed!则为安装cuDNN成功.
logs of `./mnistCUDNN`
(base) root@VM-24-95-ubuntu:/usr/src/cudnn_samples_v8/mnistCUDNN# make clean && make
rm -rf *o
rm -rf mnistCUDNN
CUDA_VERSION is 11070
Linking agains cublasLt = true
CUDA VERSION: 11070
TARGET ARCH: x86_64
HOST_ARCH: x86_64
TARGET OS: linux
SMS: 35 50 53 60 61 62 70 72 75 80 86 87
/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -ccbin g++ -m64 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o fp16_dev.o -c fp16_dev.cu
nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
g++ -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -o fp16_emu.o -c fp16_emu.cpp
g++ -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -o mnistCUDNN.o -c mnistCUDNN.cpp
/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o mnistCUDNN fp16_dev.o fp16_emu.o mnistCUDNN.o -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcublasLt -LFreeImage/lib/linux/x86_64 -LFreeImage/lib/linux -lcudart -lcublas -lcudnn -lfreeimage -lstdc++ -lm
nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
(base) root@VM-24-95-ubuntu:/usr/src/cudnn_samples_v8/mnistCUDNN# ./mnistCUDNN
Executing: mnistCUDNN
cudnnGetVersion() : 8500 , CUDNN_VERSION from cudnn.h : 8500 (8.5.0)
Host compiler version : GCC 9.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 40 Capabilities 7.5, SmClock 1590.0 Mhz, MemSize (Mb) 14928, MemClock 5001.0 Mhz, Ecc=1, boardGroupID=0
Using device 0
Testing single precision
Loading binary file data/conv1.bin
Loading binary file data/conv1.bias.bin
Loading binary file data/conv2.bin
Loading binary file data/conv2.bias.bin
Loading binary file data/ip1.bin
Loading binary file data/ip1.bias.bin
Loading binary file data/ip2.bin
Loading binary file data/ip2.bias.bin
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.027136 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.027680 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.059392 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.095232 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.149504 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 5.357568 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.088064 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.088352 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.129024 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.135936 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.144864 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 5.752384 time requiring 2450080 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000
Loading image data/three_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.025984 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.030496 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.061536 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.085920 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.086048 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.118688 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.080128 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.086432 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.087552 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.124960 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.135456 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.143360 time requiring 128000 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006
Result of classification: 1 3 5
Test passed!
Testing half precision (math in single precision)
Loading binary file data/conv1.bin
Loading binary file data/conv1.bias.bin
Loading binary file data/conv2.bin
Loading binary file data/conv2.bias.bin
Loading binary file data/ip1.bin
Loading binary file data/ip1.bias.bin
Loading binary file data/ip2.bin
Loading binary file data/ip2.bias.bin
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.028000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.030048 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.080224 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.086048 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.093568 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 2.026400 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 51584 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 64000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.104480 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.121888 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.129344 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.133152 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.200096 time requiring 51584 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.919584 time requiring 64000 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001
Loading image data/three_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.032352 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.036704 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.037408 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.079872 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.083968 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.085984 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 51584 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 64000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.083360 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.120096 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.124992 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.127648 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.193344 time requiring 51584 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.282880 time requiring 64000 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006
Result of classification: 1 3 5
Test passed!
手动安装/升级cuDNN(可选)
由于Cloud Studio的AI模版大多是AI框架的cuDNN实现,且Cloud Studio空间自带conda,所以建议使用pip install的方式安装.
-
针对cu11.7的情况:
pip install nvidia-cudnn-cu11- 进一步的,如果你需要其他小版本
pip install nvidia-cudnn-cu11==9.x.y.z
- 进一步的,如果你需要其他小版本
-
当然仍然可以使用tarball解压压缩包安装(可参考NVIDIA cuDNN Installation ### Tarball Installation
- 下载压缩包:
wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.8.0.87_cuda11-archive.tar.xz - 解压到 CUDA toolkit文件夹
tar -xf cudnn-linux-x86_64-9.8.0.87_cuda11-archive.tar.xz --strip-components=1 -C /usr/local/cuda
- 下载压缩包:
-
或者conda安装(可参考NVIDIA cuDNN Installation ### Conda Installation):
conda install cudnn cuda-version=<cuda-major-version> -c nvidia -
如果使用conda安装了部分依赖,那么建议一直用conda安装升级和管理依赖.若用pip安装依赖,则建议一直pip管理依赖.十分不建议混用,混用很可能出现依赖混乱,以至于需要删掉env重装.
tensorRT(可选)
tensorRT是一个推理加速库,可以大幅加速生产环境的模型推理效果
- 安装:
pip install tensorrt-cu11 - 验证:
python -c "import tensorrt;print(tensorrt.__version__);assert tensorrt.Builder(tensorrt.Logger())"- 备注:
- 由于Cloud Studio默认安装了CUDA toolkit 11.7,那么这里也用cu11的tensorrt版本.
- version10会比是新版本.version8是旧版本(但version8主流);实测Cloud Studio安装version8和10都可以.详情可见下面的log.
- 此时
pip install tensorrt-cu11命令默认安装tensortrt cu11 version10- 若使用
pip install tensorrt命令则会安装tensortrt cu12 version10 - 若需要安装指定版本则:
pip install tensorrt-cu11==10.0.1或pip install tensorrt==8.5.3.1
- 若使用
- 备注:
点击查看logs of `pip install tensorrt-cu11`
(base) root@VM-24-95-ubuntu:/workspace# pip install tensorrt-cu11
Looking in indexes: http://mirrors.tencentyun.com/pypi/simple
Collecting tensorrt-cu11
Downloading http://mirrors.tencentyun.com/pypi/packages/ad/04/0d6cffca481309ca0f6904446b4a075ddbf759f249851b54938c43fa6982/tensorrt_cu11-10.9.0.34.tar.gz (18 kB)
Preparing metadata (setup.py) ... done
Collecting tensorrt_cu11_libs==10.9.0.34 (from tensorrt-cu11)
Downloading http://mirrors.tencentyun.com/pypi/packages/12/3f/8962914e14e265711f262ad961b437630acacbe794f730f1b6503fe1cec8/tensorrt_cu11_libs-10.9.0.34.tar.gz (704 bytes)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting tensorrt_cu11_bindings==10.9.0.34 (from tensorrt-cu11)
Downloading http://mirrors.tencentyun.com/pypi/packages/6e/3c/056876197cf050b064fbc4a89a5f72e092ecf7a4f1454f0ca7c579fbc109/tensorrt_cu11_bindings-10.9.0.34-cp310-none-manylinux_2_28_x86_64.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 28.1 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu11 (from tensorrt_cu11_libs==10.9.0.34->tensorrt-cu11)
Downloading http://mirrors.tencentyun.com/pypi/packages/a6/ec/a540f28b31de7bc1ed49eecc72035d4cb77db88ead1d42f7bfa5ae407ac6/nvidia_cuda_runtime_cu11-11.8.89-py3-none-manylinux2014_x86_64.whl (875 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 875.6/875.6 kB 24.6 MB/s eta 0:00:00
Building wheels for collected packages: tensorrt-cu11, tensorrt_cu11_libs
Building wheel for tensorrt-cu11 (setup.py) ... done
Created wheel for tensorrt-cu11: filename=tensorrt_cu11-10.9.0.34-py2.py3-none-any.whl size=17466 sha256=48b8117c9b58cef409a1838af20124df8e830c0f91ccb256ce68a34ccb8cbab7
Stored in directory: /root/.cache/pip/wheels/74/2a/8a/58fb3d73239359b35886927883f9ede3f874dfe000f4847afd
Building wheel for tensorrt_cu11_libs (pyproject.toml) ... done
Created wheel for tensorrt_cu11_libs: filename=tensorrt_cu11_libs-10.9.0.34-py2.py3-none-manylinux_2_28_x86_64.whl size=2053243630 sha256=bf85dc722a08f2b28bc206a147737f74c62bf24f93842ea0ab5b6b4094cb0af7
Stored in directory: /root/.cache/pip/wheels/50/fe/b9/a6137a71b76c0282920b71420d97a280aa7388573cbee6ec28
Successfully built tensorrt-cu11 tensorrt_cu11_libs
Installing collected packages: tensorrt_cu11_bindings, nvidia-cuda-runtime-cu11, tensorrt_cu11_libs, tensorrt-cu11
Successfully installed nvidia-cuda-runtime-cu11-11.8.89 tensorrt-cu11-10.9.0.34 tensorrt_cu11_bindings-10.9.0.34 tensorrt_cu11_libs-10.9.0.34
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
(base) root@VM-24-95-ubuntu:/workspace# python -c "import tensorrt;print(tensorrt.__version__);assert tensorrt.Builder(tensorrt.Logger())"
10.9.0.34
[03/11/2025-01:49:50] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
点击查看 logs of `pip install tensorrt==8.5.3.1`
(base) root@VM-24-95-ubuntu:/workspace# pip install tensorrt==8.5.3.1
Looking in indexes: http://mirrors.tencentyun.com/pypi/simple
Collecting tensorrt==8.5.3.1
Downloading http://mirrors.tencentyun.com/pypi/packages/3e/d5/5f9dd454a89f5bf09c3740c649ba6c8dd685cae98a1255299a2e1dbac606/tensorrt-8.5.3.1-cp310-none-manylinux_2_17_x86_64.whl (549.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 549.5/549.5 MB 47.7 MB/s eta 0:00:00
Requirement already satisfied: nvidia-cuda-runtime-cu11 in /root/miniforge3/lib/python3.10/site-packages (from tensorrt==8.5.3.1) (11.8.89)
Collecting nvidia-cudnn-cu11 (from tensorrt==8.5.3.1)
Downloading http://mirrors.tencentyun.com/pypi/packages/22/32/6385ef0da5e01553e3b8ad55428fd4824cbff29ff941185082b17f030c9e/nvidia_cudnn_cu11-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (434.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 434.5/434.5 MB 72.8 MB/s eta 0:00:00
Collecting nvidia-cublas-cu11 (from tensorrt==8.5.3.1)
Downloading http://mirrors.tencentyun.com/pypi/packages/ea/2e/9d99c60771d275ecf6c914a612e9a577f740a615bc826bec132368e1d3ae/nvidia_cublas_cu11-11.11.3.6-py3-none-manylinux2014_x86_64.whl (417.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 417.9/417.9 MB 63.4 MB/s eta 0:00:00
Installing collected packages: nvidia-cublas-cu11, nvidia-cudnn-cu11, tensorrt
Successfully installed nvidia-cublas-cu11-11.11.3.6 nvidia-cudnn-cu11-9.8.0.87 tensorrt-8.5.3.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
(base) root@VM-24-95-ubuntu:/workspace# python -c "import tensorrt;print(tensorrt.__version__);assert tensorrt.Builder(tensorrt.Logger())"
8.5.3.1
[03/11/2025-02:03:52] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
Troubleshooting
1 nvidia驱动搞坏了
- 问题:经过一系列的配置和安装,好像搞坏了哪里,情况如下
- 原因:
- 可能是使用
apt install, 或者bash NVIDIA-Linux-x86_64-XXX.XXX.XXX.run更新了驱动或者CUDA toolkit.然而这样更新驱动在Cloud Studio是不能成功更新的. - 使用
pip install应该不会把驱动环境搞坏 - 由于
Cloud Studio的nvidia driver是以只读方式mount在容器空间中的,所以卸载掉用户安装的驱动即可恢复使用本来的驱动.(注意如果用户修改过$PATH或LD_LIBRARY_PATH环境变量,也需要恢复到原来的环境变量)
- 可能是使用
- 解决:
apt remote *nvidia* -y - 附件
附件1:使用apt install更新驱动之后nvidia-smi错误
(base) root@VM-24-95-ubuntu:/workspace# apt install nvidia-driver-535 -y
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
libnvidia-cfg1-535 libnvidia-common-535 libnvidia-compute-535
libnvidia-decode-535 libnvidia-encode-535 libnvidia-extra-535
libnvidia-fbc1-535 libnvidia-gl-535 nvidia-compute-utils-535 nvidia-dkms-535
nvidia-kernel-common-535 nvidia-kernel-source-535 nvidia-prime
nvidia-settings nvidia-utils-535 xserver-xorg-video-nvidia-535
Recommended packages:
libnvidia-compute-535:i386 libnvidia-decode-535:i386
libnvidia-encode-535:i386 libnvidia-fbc1-535:i386 libnvidia-gl-535:i386
The following NEW packages will be installed:
libnvidia-cfg1-535 libnvidia-common-535 libnvidia-compute-535
libnvidia-decode-535 libnvidia-encode-535 libnvidia-extra-535
libnvidia-fbc1-535 libnvidia-gl-535 nvidia-compute-utils-535 nvidia-dkms-535
nvidia-driver-535 nvidia-kernel-common-535 nvidia-kernel-source-535
nvidia-prime nvidia-settings nvidia-utils-535 xserver-xorg-video-nvidia-535
0 upgraded, 17 newly installed, 0 to remove and 4 not upgraded.
Need to get 308 MB of archives.
After this operation, 801 MB of additional disk space will be used.
Get:1 http://mirrors.cloud.tencent.com/ubuntu focal-updates/main amd64 nvidia-prime all 0.8.16~0.20.04.2 [9960 B]
Get:2 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 libnvidia-cfg1-535 535.230.02-0ubuntu1 [98.9 kB]
Get:3 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 libnvidia-common-535 535.230.02-0ubuntu1 [14.9 kB]
Get:4 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 libnvidia-compute-535 535.230.02-0ubuntu1 [36.9 MB]
Get:5 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 libnvidia-decode-535 535.230.02-0ubuntu1 [1660 kB]
Get:6 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 libnvidia-encode-535 535.230.02-0ubuntu1 [90.0 kB]
Get:7 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 libnvidia-extra-535 535.230.02-0ubuntu1 [256 kB]
Get:8 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 libnvidia-fbc1-535 535.230.02-0ubuntu1 [51.3 kB]
Get:9 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 libnvidia-gl-535 535.230.02-0ubuntu1 [183 MB]
Get:10 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 nvidia-compute-utils-535 535.230.02-0ubuntu1 [285 kB]
Get:11 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 nvidia-kernel-source-535 535.230.02-0ubuntu1 [44.5 MB]
Get:12 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 nvidia-kernel-common-535 535.230.02-0ubuntu1 [38.4 MB]
Get:13 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 nvidia-dkms-535 535.230.02-0ubuntu1 [34.2 kB]
Get:14 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 nvidia-utils-535 535.230.02-0ubuntu1 [382 kB]
Get:15 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 xserver-xorg-video-nvidia-535 535.230.02-0ubuntu1 [1504 kB]
Get:16 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 nvidia-driver-535 535.230.02-0ubuntu1 [478 kB]
Get:17 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 nvidia-settings 570.124.06-0ubuntu1 [951 kB]
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "C.UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
debconf: delaying package configuration, since apt-utils is not installed
Fetched 308 MB in 27s (11.5 MB/s)
Selecting previously unselected package libnvidia-cfg1-535:amd64.
(Reading database ...
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
(Reading database ... 20%
(Reading database ... 25%
(Reading database ... 30%
(Reading database ... 35%
(Reading database ... 40%
(Reading database ... 45%
(Reading database ... 50%
(Reading database ... 55%
(Reading database ... 60%
(Reading database ... 65%
(Reading database ... 70%
(Reading database ... 75%
(Reading database ... 80%
(Reading database ... 85%
(Reading database ... 90%
(Reading database ... 95%
(Reading database ... 100%
(Reading database ... 84774 files and directories currently installed.)
Preparing to unpack .../00-libnvidia-cfg1-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-cfg1-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-common-535.
Preparing to unpack .../01-libnvidia-common-535_535.230.02-0ubuntu1_all.deb ...
Unpacking libnvidia-common-535 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-compute-535:amd64.
Preparing to unpack .../02-libnvidia-compute-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-compute-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-decode-535:amd64.
Preparing to unpack .../03-libnvidia-decode-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-decode-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-encode-535:amd64.
Preparing to unpack .../04-libnvidia-encode-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-encode-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-extra-535:amd64.
Preparing to unpack .../05-libnvidia-extra-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-extra-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-fbc1-535:amd64.
Preparing to unpack .../06-libnvidia-fbc1-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-fbc1-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-gl-535:amd64.
Preparing to unpack .../07-libnvidia-gl-535_535.230.02-0ubuntu1_amd64.deb ...
dpkg-query: no packages found matching libnvidia-gl-450
Unpacking libnvidia-gl-535:amd64 (535.230.02-0ubuntu1) ...
Preparing to unpack .../08-nvidia-compute-utils-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-compute-utils-535 (535.230.02-0ubuntu1) ...
dpkg: error processing archive /tmp/apt-dpkg-install-weWcQR/08-nvidia-compute-utils-535_535.230.02-0ubuntu1_amd64.deb (--unpack):
unable to make backup link of './usr/bin/nvidia-cuda-mps-control' before installing new version: Invalid cross-device link
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Selecting previously unselected package nvidia-kernel-source-535.
Preparing to unpack .../09-nvidia-kernel-source-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-kernel-source-535 (535.230.02-0ubuntu1) ...
Selecting previously unselected package nvidia-kernel-common-535.
Preparing to unpack .../10-nvidia-kernel-common-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-kernel-common-535 (535.230.02-0ubuntu1) ...
Selecting previously unselected package nvidia-dkms-535.
Preparing to unpack .../11-nvidia-dkms-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-dkms-535 (535.230.02-0ubuntu1) ...
Preparing to unpack .../12-nvidia-utils-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-utils-535 (535.230.02-0ubuntu1) ...
dpkg: error processing archive /tmp/apt-dpkg-install-weWcQR/12-nvidia-utils-535_535.230.02-0ubuntu1_amd64.deb (--unpack):
unable to make backup link of './usr/bin/nvidia-debugdump' before installing new version: Invalid cross-device link
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Selecting previously unselected package xserver-xorg-video-nvidia-535.
Preparing to unpack .../13-xserver-xorg-video-nvidia-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking xserver-xorg-video-nvidia-535 (535.230.02-0ubuntu1) ...
Selecting previously unselected package nvidia-driver-535.
Preparing to unpack .../14-nvidia-driver-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-driver-535 (535.230.02-0ubuntu1) ...
Selecting previously unselected package nvidia-prime.
Preparing to unpack .../15-nvidia-prime_0.8.16~0.20.04.2_all.deb ...
Unpacking nvidia-prime (0.8.16~0.20.04.2) ...
Selecting previously unselected package nvidia-settings.
Preparing to unpack .../16-nvidia-settings_570.124.06-0ubuntu1_amd64.deb ...
Unpacking nvidia-settings (570.124.06-0ubuntu1) ...
Errors were encountered while processing:
/tmp/apt-dpkg-install-weWcQR/08-nvidia-compute-utils-535_535.230.02-0ubuntu1_amd64.deb
/tmp/apt-dpkg-install-weWcQR/12-nvidia-utils-535_535.230.02-0ubuntu1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
(base) root@VM-24-95-ubuntu:/workspace# nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
附件2:使用apt remove修复后nvidia-smi正常了
(base) root@VM-24-95-ubuntu:/workspace# apt remove *nvidia* -y
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Reading package lists...
Building dependency tree...
Reading state information...
Package 'nvidia-304' is not installed, so not removed
这里省略了一些log
Package 'linux-objects-nvidia-535-server-5.15.0-1049-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1049-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1049-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1049-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1049-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-105-generic' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-105-lowlatency' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1050-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1050-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1050-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1050-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1051-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1051-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1051-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1051-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1052-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1052-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1052-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1052-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1052-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1053-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1053-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1053-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1053-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1054-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1055-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1055-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1055-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1055-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1056-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1057-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1057-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1058-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1058-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1058-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1058-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1058-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1059-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1059-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1059-oracle' is not installed,
The following packages were automatically installed and are no longer required:
accountsservice acl apg apport apport-symptoms aptdaemon aptdaemon-data
aspell aspell-en avahi-daemon avahi-utils bind9-host bind9-libs bluez bolt
bubblewrap cheese-common colord colord-data cracklib-runtime crda
cups-pk-helper dconf-cli dctrl-tools desktop-file-utils dictionaries-common
dkms dns-root-data dnsmasq-base docbook-xml emacsen-common enchant-2
evolution-data-server evolution-data-server-common fprintd gdm3 geoclue-2.0
gettext-base gir1.2-accountsservice-1.0 gir1.2-atk-1.0 gir1.2-atspi-2.0
gir1.2-freedesktop gir1.2-gck-1 gir1.2-gcr-3 gir1.2-gdesktopenums-3.0
gir1.2-gdkpixbuf-2.0 gir1.2-gdm-1.0 gir1.2-geoclue-2.0
gir1.2-gnomebluetooth-1.0 gir1.2-gnomedesktop-3.0 gir1.2-graphene-1.0
gir1.2-gtk-3.0 gir1.2-gweather-3.0 gir1.2-ibus-1.0 gir1.2-json-1.0
gir1.2-mutter-6 gir1.2-nm-1.0 gir1.2-nma-1.0 gir1.2-notify-0.7
gir1.2-pango-1.0 gir1.2-polkit-1.0 gir1.2-rsvg-2.0 gir1.2-secret-1
gir1.2-soup-2.4 gir1.2-upowerglib-1.0 gir1.2-vte-2.91 gjs gkbd-capplet
gnome-control-center gnome-control-center-data gnome-control-center-faces
gnome-desktop3-data gnome-keyring gnome-keyring-pkcs11 gnome-menus
gnome-online-accounts gnome-session-bin gnome-session-common
gnome-settings-daemon gnome-settings-daemon-common gnome-shell
gnome-shell-common gnome-startup-applications gnome-user-docs groff-base
gstreamer1.0-clutter-3.0 gstreamer1.0-gl gstreamer1.0-plugins-base
gstreamer1.0-plugins-good gstreamer1.0-pulseaudio gstreamer1.0-x
hunspell-en-us ibus ibus-data ibus-gtk ibus-gtk3 iio-sensor-proxy im-config
ippusbxd iptables iw keyboard-configuration kmod language-selector-common
language-selector-gnome libaa1 libaccountsservice0 libappindicator3-1
libarchive13 libasound2-plugins libaspell15 libasyncns0 libavahi-core7
libavahi-glib1 libavc1394-0 libbluetooth3 libboost-thread1.71.0 libcaca0
libcamel-1.2-62 libcanberra-gtk3-0 libcanberra-gtk3-module libcanberra-pulse
libcdparanoia0 libcheese-gtk25 libcheese8 libclutter-1.0-0
libclutter-1.0-common libclutter-gst-3.0-0 libclutter-gtk-1.0-0
libcogl-common libcogl-pango20 libcogl-path20 libcogl20 libcolord-gtk1
libcolorhug2 libcrack2 libdaemon0 libdbusmenu-glib4 libdbusmenu-gtk3-4
libdrm-amdgpu1 libdrm-common libdrm-intel1 libdrm-nouveau2 libdrm-radeon1
libdrm2 libdv4 libebackend-1.2-10 libebook-1.2-20 libebook-contacts-1.2-3
libecal-2.0-1 libedata-book-1.2-26 libedata-cal-2.0-1 libedataserver-1.2-24
libedataserverui-1.2-2 libegl-mesa0 libegl1 libenchant-2-2 libevdev2
libexif12 libflac8 libfontenc1 libfprint-2-2 libgail-common libgail18
libgbm1 libgd3 libgdata-common libgdata22 libgdm1 libgee-0.8-2
libgeoclue-2-0 libgeocode-glib0 libgjs0g libgl1 libgl1-mesa-dri
libglapi-mesa libgles2 libglvnd0 libglx-mesa0 libglx0 libgnome-autoar-0-0
libgnome-bluetooth13 libgnome-desktop-3-19 libgnomekbd-common libgnomekbd8
libgoa-1.0-0b libgoa-1.0-common libgoa-backend-1.0-1 libgphoto2-6
libgphoto2-l10n libgphoto2-port12 libgraphene-1.0-0 libgsound0
libgssdp-1.2-0 libgstreamer-gl1.0-0 libgstreamer-plugins-base1.0-0
libgstreamer-plugins-good1.0-0 libgtk2.0-0 libgtk2.0-bin libgtk2.0-common
libgtop-2.0-11 libgtop2-common libgudev-1.0-0 libgupnp-1.2-0
libgupnp-av-1.0-2 libgupnp-dlna-2.0-3 libgusb2 libgweather-3-16
libgweather-common libharfbuzz-icu0 libhunspell-1.7-0 libhyphen0
libibus-1.0-5 libical3 libice6 libidn11 libiec61883-0 libieee1284-3
libimobiledevice6 libinput-bin libinput10 libip6tc2 libjack-jackd2-0
libjansson4 libjavascriptcoregtk-4.0-18 libldb2 libllvm12 libmaxminddb0
libmbim-glib4 libmbim-proxy libmediaart-2.0-0 libmm-glib0 libmnl0
libmozjs-68-0 libmp3lame0 libmpg123-0 libmtdev1 libmutter-6-0
libmysqlclient21 libndp0 libnetfilter-conntrack3 libnewt0.52 libnfnetlink0
libnftnl11 libnl-3-200 libnl-genl-3-200 libnl-route-3-200 libnm0 libnma0
libnotify4 libnspr4 libnss-mdns libnss3 libopengl0 libopus0 liborc-0.4-0
libpam-fprintd libpam-gnome-keyring libpangoxft-1.0-0 libpcap0.8 libpci3
libpciaccess0 libpcsclite1 libphonenumber7 libpipeline1 libplist3
libprotobuf17 libpulse-mainloop-glib0 libpulse0 libpulsedsp
libpwquality-common libpwquality1 libqmi-glib5 libqmi-proxy libraw1394-11
librygel-core-2.6-2 librygel-db-2.6-2 librygel-renderer-2.6-2
librygel-server-2.6-2 libsamplerate0 libsane libsane-common libsbc1
libsensors-config libsensors5 libshout3 libslang2 libsm6 libsmbclient
libsnapd-glib1 libsndfile1 libsnmp-base libsnmp35 libsodium23 libsoxr0
libspeex1 libspeexdsp1 libstartup-notification0 libtag1v5 libtag1v5-vanilla
libtalloc2 libteamdctl0 libtevent0 libtext-iconv-perl libtheora0 libtwolame0
libuchardet0 libudisks2-0 libunwind8 libupower-glib3 libusb-1.0-0
libusbmuxd6 libuv1 libv4l-0 libv4lconvert0 libvdpau1 libvisual-0.4-0
libvorbisenc2 libvpx6 libvte-2.91-0 libvte-2.91-common libvulkan1
libwacom-bin libwacom-common libwacom2 libwavpack1 libwayland-server0
libwbclient0 libwebkit2gtk-4.0-37 libwebpdemux2 libwebrtc-audio-processing1
libwhoopsie-preferences0 libwhoopsie0 libwoff1 libx11-xcb1 libxatracker2
libxaw7 libxcb-dri2-0 libxcb-dri3-0 libxcb-glx0 libxcb-icccm4 libxcb-image0
libxcb-keysyms1 libxcb-present0 libxcb-randr0 libxcb-render-util0
libxcb-res0 libxcb-shape0 libxcb-sync1 libxcb-util1 libxcb-xfixes0
libxcb-xkb1 libxcb-xv0 libxfont2 libxft2 libxkbcommon-x11-0 libxkbfile1
libxklavier16 libxmu6 libxnvctrl0 libxpm4 libxshmfence1 libxslt1.1 libxss1
libxt6 libxtables12 libxv1 libxvmc1 libxxf86vm1 libyelp0
linux-headers-5.4.0-208 linux-headers-5.4.0-208-generic
linux-headers-generic man-db mesa-vdpau-drivers mesa-vulkan-drivers
mobile-broadband-provider-info modemmanager mousetweaks mutter mutter-common
mysql-common network-manager network-manager-gnome network-manager-pptp
p11-kit p11-kit-modules pci.ids pkg-config ppp pptp-linux pulseaudio
pulseaudio-module-bluetooth pulseaudio-utils python3-apport
python3-aptdaemon python3-aptdaemon.gtk3widgets python3-blinker
python3-cairo python3-cffi-backend python3-cryptography python3-cups
python3-cupshelpers python3-defer python3-entrypoints python3-httplib2
python3-ibus-1.0 python3-jwt python3-keyring python3-launchpadlib
python3-lazr.restfulclient python3-lazr.uri python3-ldb
python3-macaroonbakery python3-nacl python3-oauthlib python3-problem-report
python3-protobuf python3-pymacaroons python3-rfc3339 python3-secretstorage
python3-simplejson python3-systemd python3-talloc python3-tz python3-wadllib
python3-xkit rtkit rygel samba-libs sane-utils screen-resolution-extra
session-migration sgml-base sgml-data sudo switcheroo-control
system-config-printer system-config-printer-common
system-config-printer-udev ubuntu-docs ubuntu-session ubuntu-wallpapers
ubuntu-wallpapers-focal udev update-inetd upower usb-modeswitch
usb-modeswitch-data usb.ids usbmuxd vdpau-driver-all wamerican
whoopsie-preferences wireless-regdb wpasupplicant x11-xkb-utils
x11-xserver-utils xdg-dbus-proxy xfonts-base xfonts-encodings xfonts-utils
xml-core xserver-common xserver-xephyr xserver-xorg xserver-xorg-core
xserver-xorg-input-all xserver-xorg-input-libinput xserver-xorg-input-wacom
xserver-xorg-legacy xserver-xorg-video-all xserver-xorg-video-amdgpu
xserver-xorg-video-ati xserver-xorg-video-fbdev xserver-xorg-video-intel
xserver-xorg-video-nouveau xserver-xorg-video-qxl xserver-xorg-video-radeon
xserver-xorg-video-vesa xserver-xorg-video-vmware xwayland
yaru-theme-gnome-shell yelp yelp-xsl zenity zenity-common
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
libnvidia-cfg1-535 libnvidia-common-535 libnvidia-compute-535
libnvidia-decode-535 libnvidia-encode-535 libnvidia-extra-535
libnvidia-fbc1-535 libnvidia-gl-535 nvidia-dkms-535 nvidia-driver-535
nvidia-kernel-common-535 nvidia-kernel-source-535 nvidia-prime
nvidia-settings xserver-xorg-video-nvidia-535
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "C.UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
0 upgraded, 0 newly installed, 15 to remove and 4 not upgraded.
15 not fully installed or removed.
After this operation, 798 MB disk space will be freed.
(Reading database ...
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
(Reading database ... 20%
(Reading database ... 25%
(Reading database ... 30%
(Reading database ... 35%
(Reading database ... 40%
(Reading database ... 45%
(Reading database ... 50%
(Reading database ... 55%
(Reading database ... 60%
(Reading database ... 65%
(Reading database ... 70%
(Reading database ... 75%
(Reading database ... 80%
(Reading database ... 85%
(Reading database ... 90%
(Reading database ... 95%
(Reading database ... 100%
(Reading database ... 85474 files and directories currently installed.)
Removing nvidia-driver-535 (535.230.02-0ubuntu1) ...
Removing xserver-xorg-video-nvidia-535 (535.230.02-0ubuntu1) ...
Removing libnvidia-cfg1-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-gl-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-common-535 (535.230.02-0ubuntu1) ...
Removing libnvidia-encode-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-decode-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-compute-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-extra-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-fbc1-535:amd64 (535.230.02-0ubuntu1) ...
Removing nvidia-dkms-535 (535.230.02-0ubuntu1) ...
Removing nvidia-kernel-common-535 (535.230.02-0ubuntu1) ...
Removing nvidia-kernel-source-535 (535.230.02-0ubuntu1) ...
Removing nvidia-prime (0.8.16~0.20.04.2) ...
Removing nvidia-settings (570.124.06-0ubuntu1) ...
Processing triggers for mime-support (3.64ubuntu1) ...
Processing triggers for gnome-menus (3.36.0-1ubuntu1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.17) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for dbus (1.12.16-2ubuntu2.3) ...
Processing triggers for desktop-file-utils (0.24-1ubuntu3) ...
(base) root@VM-24-95-ubuntu:/workspace# nvidia-smi
Fri Mar 14 03:24:46 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:09.0 Off | 0 |
| N/A 31C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+