[阿维科普]什么是cuDNN?如何安装CUDA和cuDNN

1,221 阅读21分钟

cuDNN是什么?为什么要安装cuDNN?本文将介绍nvidia硬件和驱动(包含nvidia driver),cuda工具包(cuda toolkit),cuDNN系列库和TensorRT,讲解不同层次硬件和驱动以及软件的关系和作用.并使用腾讯cloud stuio做示例,并安装和配置pytorch的GPU加速.

cloud studio介绍

Cloud Studio(云端 IDE)是基于浏览器的集成式开发环境,为开发者提供了一个稳定的云端工作站。支持CPU与GPU的访问。用户在使用 Cloud Studio 时无需安装,随时随地打开浏览器即可使用。 Cloud Studio支持免费的CPU环境(每月5w mins)和免费的GPU环境(一张Tesla T4 16G)(每月1w mins).本文将用Cloud Studio的GPU环境演示说明.

开启Cloud Studio GPU空间

  • 首先注册并开启Cloud Studio,点击链接curl.qcloud.com/sdeIX8nx
  • 点击ide.cloud.tencent.com/ 到Cloud Studio主页面
  • 如下图,点击空间模版 -> AI模版 -> Pytorch2.0.0 截屏2025-03-10 20.01.11.png
  • 选择免费基础版 -> 确认 截屏2025-03-10 20.02.48.png
  • 点击高性能工作空间. Pytorch2.0.0 gssrak这个就是已经创建的GPU空间了.可以看到这里已经有绿色圆点,并显示运行中. 截屏2025-03-10 20.03.57.png
  • 点击Pytorch2.0.0 gssrak进入空间,等待不到一分钟则会加载完成 截屏2025-03-10 20.06.39.png

Nvidia driver

Nvidia Driver是专为nvidia GPU的驱动程序.有了Nvidia Drvier,才可以正确驱动GPU,从而正常输出显示画面(针对studio专业显卡或者游戏显卡)和加速科学计算(针对数据中心显卡等).它也是之后安装CUDA toolkit或者cuDNN的基础.

  • 由于Cloud Studio基于容器技术,已经在宿主机和GPU工作空间(本质是容器)安装了同一版本的Nvidia Driver.我们可以使用nvidia-smi查看

  • 打开终端,输入nvidia-smi:

    (base) root@VM-24-95-ubuntu:/workspace# nvidia-smi
    Mon Mar 10 12:13:25 2025       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:00:09.0 Off |                    0 |
    | N/A   31C    P8    10W /  70W |      2MiB / 15360MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    
  • Driver Version: 525.105.17 指Nvidia Driver版本是525.105.17

  • CUDA Version: 12.0 指目前的Nvidia Driver版本所能支持的 最高 CUDA版本是12.0

    • 也就是此时机器支持CUDA12.0以及 <= CUDA12.0的其他版本(CUDA11.8, CUDA11.7, CUDA10.0 等).另一方面 CUDA12.1, CUDA12.8等高于 CUDA12.0的版本,则不被支持.

CUDA toolkit

CUDA Toolkit 是 NVIDIA 提供的一套完整的开发工具集,用于开发和优化 CUDA 程序.它包括编译器(如 nvcc)、调试器、运行时库(cudart)、性能分析工具以及各种数学和计算库. 注意如果只需要运行tensotflow或pytorch其实不需要安装(完全版) CUDA toolkit,在安装pytorch或者tensorflow时候自带的cuDNN的子集既可实现GPU加速计算.近在需要开发CUDA算子,编译GPU加速实现(如Apex库)等情况下需要安装CUDA toolkit

Cloud Studio已经默认安装配置了CUDA toolkit 版本11.7

  • nvcc-V查看是否安装了CUDA toolkit

    (base) root@VM-24-95-ubuntu:/workspace# nvcc -V
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2022 NVIDIA Corporation
    Built on Wed_Jun__8_16:49:14_PDT_2022
    Cuda compilation tools, release 11.7, V11.7.99
    Build cuda_11.7.r11.7/compiler.31442593_0
    
  • echo $PATH,检查是否包含过了路径/usr/local/cuda/bin

    (base) root@VM-24-95-ubuntu:/workspace# echo $PATH
    /etc/.hai/cloud_studio/vendor/modules/code-oss-dev/bin/remote-cli:/root/miniforge3/bin:/root/miniforge3/condabin:/root/miniforge3/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    
  • echo $LD_LIBRARY_PATH,检查是否包含过了路径/usr/local/cuda/lib64

    (base) root@VM-24-95-ubuntu:/workspace# echo $LD_LIBRARY_PATH
    /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
    

cuDNN

cuDNN介绍

NVIDIA CUDA 深度神经网络库(cuDNN) 是一个 GPU 加速的深度神经网络基本操作库。它提供了深度神经网络(DNN)应用中频繁出现的运算的优化实现.cuDNN是实际在tensorflow,pytorch或大模型部署平台的GPU加速的实现.

此时如果按照如上所述使用Pytorch2.0.0空间模版则不需要另外再安装cuDNN.因为此时Cloud Studio已经安装并配置好了GPU版本的pytorch,也就是说需要的cuDNN的子集.

查看cuDNN版本

  • 查看pytorch是否可以调用cudapython -c "import torch;print(torch.cuda.is_available())"

  • 查看cuDNN是否启用python -c "import torch;print(torch.backends.cudnn.enabled)"

  • 查看cuDNN版本python -c "import torch;print(torch.backends.cudnn.version())"

    (base) root@VM-24-95-ubuntu:/workspace# python -c "import torch;print(torch.cuda.is_available())"
    True
    (base) root@VM-24-95-ubuntu:/workspace# python -c "import torch;print(torch.backends.cudnn.enabled)"
    True
    (base) root@VM-24-95-ubuntu:/workspace# python -c "import torch;print(torch.backends.cudnn.version())"
    8500
    
  • 因为是pytorch自带的cuDNN的子集,使用代码查看so库find $(python -c "import torch; print(torch.__path__[0])") -name "*cudnn*so*"

    (base) root@VM-24-95-ubuntu:/workspace# find $(python -c "import torch; print(torch.__path__[0])") -name "*cudnn*so*"
    /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn.so.8
    /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_adv_infer.so.8
    /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_cnn_train.so.8
    /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_adv_train.so.8
    /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_ops_train.so.8
    /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_cnn_infer.so.8
    /root/miniforge3/lib/python3.10/site-packages/torch/lib/libcudnn_ops_infer.so.8
    

验证cuDNN安装

  • 安装示例文件和依赖apt -y install libcudnn8-samples libfreeimage-dev build-essential由于刚刚看Cloud Studio的pytorch自带的cuDNN8500版本所以此处安装libcudnn8-samples.
  • 编译cd /usr/src/cudnn_samples_v8/mnistCUDNN && make clean && make
  • 运行./mnistCUDNN 出现Test passed!则为安装cuDNN成功.
logs of `./mnistCUDNN`
(base) root@VM-24-95-ubuntu:/usr/src/cudnn_samples_v8/mnistCUDNN# make clean && make
rm -rf *o
rm -rf mnistCUDNN
CUDA_VERSION is 11070
Linking agains cublasLt = true
CUDA VERSION: 11070
TARGET ARCH: x86_64
HOST_ARCH: x86_64
TARGET OS: linux
SMS: 35 50 53 60 61 62 70 72 75 80 86 87 
/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include  -ccbin g++ -m64    -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o fp16_dev.o -c fp16_dev.cu
nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
g++ -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include   -o fp16_emu.o -c fp16_emu.cpp
g++ -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include   -o mnistCUDNN.o -c mnistCUDNN.cpp
/usr/local/cuda/bin/nvcc   -ccbin g++ -m64      -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o mnistCUDNN fp16_dev.o fp16_emu.o mnistCUDNN.o -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcublasLt -LFreeImage/lib/linux/x86_64 -LFreeImage/lib/linux -lcudart -lcublas -lcudnn -lfreeimage -lstdc++ -lm
nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
(base) root@VM-24-95-ubuntu:/usr/src/cudnn_samples_v8/mnistCUDNN# ./mnistCUDNN 
Executing: mnistCUDNN
cudnnGetVersion() : 8500 , CUDNN_VERSION from cudnn.h : 8500 (8.5.0)
Host compiler version : GCC 9.4.0

There are 1 CUDA capable devices on your machine :
device 0 : sms 40  Capabilities 7.5, SmClock 1590.0 Mhz, MemSize (Mb) 14928, MemClock 5001.0 Mhz, Ecc=1, boardGroupID=0
Using device 0

Testing single precision
Loading binary file data/conv1.bin
Loading binary file data/conv1.bias.bin
Loading binary file data/conv2.bin
Loading binary file data/conv2.bias.bin
Loading binary file data/ip1.bin
Loading binary file data/ip1.bias.bin
Loading binary file data/ip2.bin
Loading binary file data/ip2.bias.bin
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.027136 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.027680 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.059392 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.095232 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.149504 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 5.357568 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.088064 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.088352 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.129024 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.135936 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.144864 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 5.752384 time requiring 2450080 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.025984 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.030496 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.061536 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.085920 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.086048 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.118688 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.080128 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.086432 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.087552 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.124960 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.135456 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.143360 time requiring 128000 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006 

Result of classification: 1 3 5

Test passed!

Testing half precision (math in single precision)
Loading binary file data/conv1.bin
Loading binary file data/conv1.bias.bin
Loading binary file data/conv2.bin
Loading binary file data/conv2.bias.bin
Loading binary file data/ip1.bin
Loading binary file data/ip1.bias.bin
Loading binary file data/ip2.bin
Loading binary file data/ip2.bias.bin
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.028000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.030048 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.080224 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.086048 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.093568 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 2.026400 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 51584 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 64000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.104480 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.121888 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.129344 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.133152 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.200096 time requiring 51584 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.919584 time requiring 64000 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.032352 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.036704 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.037408 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.079872 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.083968 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.085984 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 51584 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 64000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.083360 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.120096 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.124992 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.127648 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.193344 time requiring 51584 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.282880 time requiring 64000 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006 

Result of classification: 1 3 5

Test passed!

手动安装/升级cuDNN(可选)

由于Cloud Studio的AI模版大多是AI框架的cuDNN实现,且Cloud Studio空间自带conda,所以建议使用pip install的方式安装.

  • 针对cu11.7的情况:pip install nvidia-cudnn-cu11

    • 进一步的,如果你需要其他小版本pip install nvidia-cudnn-cu11==9.x.y.z
  • 当然仍然可以使用tarball解压压缩包安装(可参考NVIDIA cuDNN Installation ### Tarball Installation

    • 下载压缩包:wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.8.0.87_cuda11-archive.tar.xz
    • 解压到 CUDA toolkit文件夹tar -xf cudnn-linux-x86_64-9.8.0.87_cuda11-archive.tar.xz --strip-components=1 -C /usr/local/cuda
  • 或者conda安装(可参考NVIDIA cuDNN Installation ### Conda Installation):conda install cudnn cuda-version=<cuda-major-version> -c nvidia

  • 如果使用conda安装了部分依赖,那么建议一直用conda安装升级和管理依赖.若用pip安装依赖,则建议一直pip管理依赖.十分不建议混用,混用很可能出现依赖混乱,以至于需要删掉env重装.

tensorRT(可选)

tensorRT是一个推理加速库,可以大幅加速生产环境的模型推理效果

  • 安装:pip install tensorrt-cu11
  • 验证:python -c "import tensorrt;print(tensorrt.__version__);assert tensorrt.Builder(tensorrt.Logger())"
    • 备注:
      • 由于Cloud Studio默认安装了CUDA toolkit 11.7,那么这里也用cu11的tensorrt版本.
      • version10会比是新版本.version8是旧版本(但version8主流);实测Cloud Studio安装version8和10都可以.详情可见下面的log.
      • 此时pip install tensorrt-cu11命令默认安装tensortrt cu11 version10
        • 若使用pip install tensorrt命令则会安装tensortrt cu12 version10
        • 若需要安装指定版本则:pip install tensorrt-cu11==10.0.1pip install tensorrt==8.5.3.1
点击查看logs of `pip install tensorrt-cu11`
(base) root@VM-24-95-ubuntu:/workspace# pip install  tensorrt-cu11
Looking in indexes: http://mirrors.tencentyun.com/pypi/simple
Collecting tensorrt-cu11
  Downloading http://mirrors.tencentyun.com/pypi/packages/ad/04/0d6cffca481309ca0f6904446b4a075ddbf759f249851b54938c43fa6982/tensorrt_cu11-10.9.0.34.tar.gz (18 kB)
  Preparing metadata (setup.py) ... done
Collecting tensorrt_cu11_libs==10.9.0.34 (from tensorrt-cu11)
  Downloading http://mirrors.tencentyun.com/pypi/packages/12/3f/8962914e14e265711f262ad961b437630acacbe794f730f1b6503fe1cec8/tensorrt_cu11_libs-10.9.0.34.tar.gz (704 bytes)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting tensorrt_cu11_bindings==10.9.0.34 (from tensorrt-cu11)
  Downloading http://mirrors.tencentyun.com/pypi/packages/6e/3c/056876197cf050b064fbc4a89a5f72e092ecf7a4f1454f0ca7c579fbc109/tensorrt_cu11_bindings-10.9.0.34-cp310-none-manylinux_2_28_x86_64.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 28.1 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu11 (from tensorrt_cu11_libs==10.9.0.34->tensorrt-cu11)
  Downloading http://mirrors.tencentyun.com/pypi/packages/a6/ec/a540f28b31de7bc1ed49eecc72035d4cb77db88ead1d42f7bfa5ae407ac6/nvidia_cuda_runtime_cu11-11.8.89-py3-none-manylinux2014_x86_64.whl (875 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 875.6/875.6 kB 24.6 MB/s eta 0:00:00
Building wheels for collected packages: tensorrt-cu11, tensorrt_cu11_libs
  Building wheel for tensorrt-cu11 (setup.py) ... done
  Created wheel for tensorrt-cu11: filename=tensorrt_cu11-10.9.0.34-py2.py3-none-any.whl size=17466 sha256=48b8117c9b58cef409a1838af20124df8e830c0f91ccb256ce68a34ccb8cbab7
  Stored in directory: /root/.cache/pip/wheels/74/2a/8a/58fb3d73239359b35886927883f9ede3f874dfe000f4847afd
  Building wheel for tensorrt_cu11_libs (pyproject.toml) ... done
  Created wheel for tensorrt_cu11_libs: filename=tensorrt_cu11_libs-10.9.0.34-py2.py3-none-manylinux_2_28_x86_64.whl size=2053243630 sha256=bf85dc722a08f2b28bc206a147737f74c62bf24f93842ea0ab5b6b4094cb0af7
  Stored in directory: /root/.cache/pip/wheels/50/fe/b9/a6137a71b76c0282920b71420d97a280aa7388573cbee6ec28
Successfully built tensorrt-cu11 tensorrt_cu11_libs
Installing collected packages: tensorrt_cu11_bindings, nvidia-cuda-runtime-cu11, tensorrt_cu11_libs, tensorrt-cu11
Successfully installed nvidia-cuda-runtime-cu11-11.8.89 tensorrt-cu11-10.9.0.34 tensorrt_cu11_bindings-10.9.0.34 tensorrt_cu11_libs-10.9.0.34
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
(base) root@VM-24-95-ubuntu:/workspace# python -c "import tensorrt;print(tensorrt.__version__);assert tensorrt.Builder(tensorrt.Logger())"
10.9.0.34
[03/11/2025-01:49:50] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
点击查看 logs of `pip install tensorrt==8.5.3.1`
(base) root@VM-24-95-ubuntu:/workspace# pip install  tensorrt==8.5.3.1
Looking in indexes: http://mirrors.tencentyun.com/pypi/simple
Collecting tensorrt==8.5.3.1
  Downloading http://mirrors.tencentyun.com/pypi/packages/3e/d5/5f9dd454a89f5bf09c3740c649ba6c8dd685cae98a1255299a2e1dbac606/tensorrt-8.5.3.1-cp310-none-manylinux_2_17_x86_64.whl (549.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 549.5/549.5 MB 47.7 MB/s eta 0:00:00
Requirement already satisfied: nvidia-cuda-runtime-cu11 in /root/miniforge3/lib/python3.10/site-packages (from tensorrt==8.5.3.1) (11.8.89)
Collecting nvidia-cudnn-cu11 (from tensorrt==8.5.3.1)
  Downloading http://mirrors.tencentyun.com/pypi/packages/22/32/6385ef0da5e01553e3b8ad55428fd4824cbff29ff941185082b17f030c9e/nvidia_cudnn_cu11-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (434.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 434.5/434.5 MB 72.8 MB/s eta 0:00:00
Collecting nvidia-cublas-cu11 (from tensorrt==8.5.3.1)
  Downloading http://mirrors.tencentyun.com/pypi/packages/ea/2e/9d99c60771d275ecf6c914a612e9a577f740a615bc826bec132368e1d3ae/nvidia_cublas_cu11-11.11.3.6-py3-none-manylinux2014_x86_64.whl (417.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 417.9/417.9 MB 63.4 MB/s eta 0:00:00
Installing collected packages: nvidia-cublas-cu11, nvidia-cudnn-cu11, tensorrt
Successfully installed nvidia-cublas-cu11-11.11.3.6 nvidia-cudnn-cu11-9.8.0.87 tensorrt-8.5.3.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
(base) root@VM-24-95-ubuntu:/workspace# python -c "import tensorrt;print(tensorrt.__version__);assert tensorrt.Builder(tensorrt.Logger())"
8.5.3.1
[03/11/2025-02:03:52] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars

Troubleshooting

1 nvidia驱动搞坏了

  • 问题:经过一系列的配置和安装,好像搞坏了哪里,情况如下
  • 原因:
    • 可能是使用apt install, 或者bash NVIDIA-Linux-x86_64-XXX.XXX.XXX.run更新了驱动或者CUDA toolkit.然而这样更新驱动在Cloud Studio是不能成功更新的.
    • 使用pip install应该不会把驱动环境搞坏
    • 由于Cloud Studio的nvidia driver是以只读方式mount在容器空间中的,所以卸载掉用户安装的驱动即可恢复使用本来的驱动.(注意如果用户修改过$PATHLD_LIBRARY_PATH环境变量,也需要恢复到原来的环境变量)
  • 解决:apt remote *nvidia* -y
  • 附件
附件1:使用apt install更新驱动之后nvidia-smi错误
(base) root@VM-24-95-ubuntu:/workspace# apt install nvidia-driver-535 -y
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libnvidia-cfg1-535 libnvidia-common-535 libnvidia-compute-535
  libnvidia-decode-535 libnvidia-encode-535 libnvidia-extra-535
  libnvidia-fbc1-535 libnvidia-gl-535 nvidia-compute-utils-535 nvidia-dkms-535
  nvidia-kernel-common-535 nvidia-kernel-source-535 nvidia-prime
  nvidia-settings nvidia-utils-535 xserver-xorg-video-nvidia-535
Recommended packages:
  libnvidia-compute-535:i386 libnvidia-decode-535:i386
  libnvidia-encode-535:i386 libnvidia-fbc1-535:i386 libnvidia-gl-535:i386
The following NEW packages will be installed:
  libnvidia-cfg1-535 libnvidia-common-535 libnvidia-compute-535
  libnvidia-decode-535 libnvidia-encode-535 libnvidia-extra-535
  libnvidia-fbc1-535 libnvidia-gl-535 nvidia-compute-utils-535 nvidia-dkms-535
  nvidia-driver-535 nvidia-kernel-common-535 nvidia-kernel-source-535
  nvidia-prime nvidia-settings nvidia-utils-535 xserver-xorg-video-nvidia-535
0 upgraded, 17 newly installed, 0 to remove and 4 not upgraded.
Need to get 308 MB of archives.
After this operation, 801 MB of additional disk space will be used.
Get:1 http://mirrors.cloud.tencent.com/ubuntu focal-updates/main amd64 nvidia-prime all 0.8.16~0.20.04.2 [9960 B]
Get:2 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  libnvidia-cfg1-535 535.230.02-0ubuntu1 [98.9 kB]
Get:3 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  libnvidia-common-535 535.230.02-0ubuntu1 [14.9 kB]
Get:4 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  libnvidia-compute-535 535.230.02-0ubuntu1 [36.9 MB]
Get:5 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  libnvidia-decode-535 535.230.02-0ubuntu1 [1660 kB]
Get:6 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  libnvidia-encode-535 535.230.02-0ubuntu1 [90.0 kB]
Get:7 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  libnvidia-extra-535 535.230.02-0ubuntu1 [256 kB]
Get:8 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  libnvidia-fbc1-535 535.230.02-0ubuntu1 [51.3 kB]
Get:9 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  libnvidia-gl-535 535.230.02-0ubuntu1 [183 MB]
Get:10 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  nvidia-compute-utils-535 535.230.02-0ubuntu1 [285 kB]
Get:11 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  nvidia-kernel-source-535 535.230.02-0ubuntu1 [44.5 MB]
Get:12 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  nvidia-kernel-common-535 535.230.02-0ubuntu1 [38.4 MB]
Get:13 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  nvidia-dkms-535 535.230.02-0ubuntu1 [34.2 kB]
Get:14 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  nvidia-utils-535 535.230.02-0ubuntu1 [382 kB]
Get:15 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  xserver-xorg-video-nvidia-535 535.230.02-0ubuntu1 [1504 kB]
Get:16 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  nvidia-driver-535 535.230.02-0ubuntu1 [478 kB]
Get:17 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  nvidia-settings 570.124.06-0ubuntu1 [951 kB]
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "C.UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
debconf: delaying package configuration, since apt-utils is not installed
Fetched 308 MB in 27s (11.5 MB/s)
Selecting previously unselected package libnvidia-cfg1-535:amd64.
(Reading database ... 
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
(Reading database ... 20%
(Reading database ... 25%
(Reading database ... 30%
(Reading database ... 35%
(Reading database ... 40%
(Reading database ... 45%
(Reading database ... 50%
(Reading database ... 55%
(Reading database ... 60%
(Reading database ... 65%
(Reading database ... 70%
(Reading database ... 75%
(Reading database ... 80%
(Reading database ... 85%
(Reading database ... 90%
(Reading database ... 95%
(Reading database ... 100%
(Reading database ... 84774 files and directories currently installed.)
Preparing to unpack .../00-libnvidia-cfg1-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-cfg1-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-common-535.
Preparing to unpack .../01-libnvidia-common-535_535.230.02-0ubuntu1_all.deb ...
Unpacking libnvidia-common-535 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-compute-535:amd64.
Preparing to unpack .../02-libnvidia-compute-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-compute-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-decode-535:amd64.
Preparing to unpack .../03-libnvidia-decode-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-decode-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-encode-535:amd64.
Preparing to unpack .../04-libnvidia-encode-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-encode-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-extra-535:amd64.
Preparing to unpack .../05-libnvidia-extra-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-extra-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-fbc1-535:amd64.
Preparing to unpack .../06-libnvidia-fbc1-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking libnvidia-fbc1-535:amd64 (535.230.02-0ubuntu1) ...
Selecting previously unselected package libnvidia-gl-535:amd64.
Preparing to unpack .../07-libnvidia-gl-535_535.230.02-0ubuntu1_amd64.deb ...
dpkg-query: no packages found matching libnvidia-gl-450
Unpacking libnvidia-gl-535:amd64 (535.230.02-0ubuntu1) ...
Preparing to unpack .../08-nvidia-compute-utils-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-compute-utils-535 (535.230.02-0ubuntu1) ...
dpkg: error processing archive /tmp/apt-dpkg-install-weWcQR/08-nvidia-compute-utils-535_535.230.02-0ubuntu1_amd64.deb (--unpack):
 unable to make backup link of './usr/bin/nvidia-cuda-mps-control' before installing new version: Invalid cross-device link
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Selecting previously unselected package nvidia-kernel-source-535.
Preparing to unpack .../09-nvidia-kernel-source-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-kernel-source-535 (535.230.02-0ubuntu1) ...
Selecting previously unselected package nvidia-kernel-common-535.
Preparing to unpack .../10-nvidia-kernel-common-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-kernel-common-535 (535.230.02-0ubuntu1) ...
Selecting previously unselected package nvidia-dkms-535.
Preparing to unpack .../11-nvidia-dkms-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-dkms-535 (535.230.02-0ubuntu1) ...
Preparing to unpack .../12-nvidia-utils-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-utils-535 (535.230.02-0ubuntu1) ...
dpkg: error processing archive /tmp/apt-dpkg-install-weWcQR/12-nvidia-utils-535_535.230.02-0ubuntu1_amd64.deb (--unpack):
 unable to make backup link of './usr/bin/nvidia-debugdump' before installing new version: Invalid cross-device link
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Selecting previously unselected package xserver-xorg-video-nvidia-535.
Preparing to unpack .../13-xserver-xorg-video-nvidia-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking xserver-xorg-video-nvidia-535 (535.230.02-0ubuntu1) ...
Selecting previously unselected package nvidia-driver-535.
Preparing to unpack .../14-nvidia-driver-535_535.230.02-0ubuntu1_amd64.deb ...
Unpacking nvidia-driver-535 (535.230.02-0ubuntu1) ...
Selecting previously unselected package nvidia-prime.
Preparing to unpack .../15-nvidia-prime_0.8.16~0.20.04.2_all.deb ...
Unpacking nvidia-prime (0.8.16~0.20.04.2) ...
Selecting previously unselected package nvidia-settings.
Preparing to unpack .../16-nvidia-settings_570.124.06-0ubuntu1_amd64.deb ...
Unpacking nvidia-settings (570.124.06-0ubuntu1) ...
Errors were encountered while processing:
 /tmp/apt-dpkg-install-weWcQR/08-nvidia-compute-utils-535_535.230.02-0ubuntu1_amd64.deb
 /tmp/apt-dpkg-install-weWcQR/12-nvidia-utils-535_535.230.02-0ubuntu1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
(base) root@VM-24-95-ubuntu:/workspace# nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
附件2:使用apt remove修复后nvidia-smi正常了
(base) root@VM-24-95-ubuntu:/workspace# apt remove *nvidia* -y

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Reading package lists...
Building dependency tree...
Reading state information...
Package 'nvidia-304' is not installed, so not removed
这里省略了一些log
Package 'linux-objects-nvidia-535-server-5.15.0-1049-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1049-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1049-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1049-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1049-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-105-generic' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-105-lowlatency' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1050-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1050-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1050-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1050-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1051-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1051-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1051-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1051-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1052-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1052-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1052-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1052-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1052-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1053-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1053-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1053-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1053-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1054-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1055-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1055-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1055-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1055-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1056-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1057-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1057-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1058-aws' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1058-azure' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1058-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1058-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1058-oracle' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1059-gcp' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1059-intel-iotg' is not installed, so not removed
Package 'linux-objects-nvidia-535-server-5.15.0-1059-oracle' is not installed, 
The following packages were automatically installed and are no longer required:
  accountsservice acl apg apport apport-symptoms aptdaemon aptdaemon-data
  aspell aspell-en avahi-daemon avahi-utils bind9-host bind9-libs bluez bolt
  bubblewrap cheese-common colord colord-data cracklib-runtime crda
  cups-pk-helper dconf-cli dctrl-tools desktop-file-utils dictionaries-common
  dkms dns-root-data dnsmasq-base docbook-xml emacsen-common enchant-2
  evolution-data-server evolution-data-server-common fprintd gdm3 geoclue-2.0
  gettext-base gir1.2-accountsservice-1.0 gir1.2-atk-1.0 gir1.2-atspi-2.0
  gir1.2-freedesktop gir1.2-gck-1 gir1.2-gcr-3 gir1.2-gdesktopenums-3.0
  gir1.2-gdkpixbuf-2.0 gir1.2-gdm-1.0 gir1.2-geoclue-2.0
  gir1.2-gnomebluetooth-1.0 gir1.2-gnomedesktop-3.0 gir1.2-graphene-1.0
  gir1.2-gtk-3.0 gir1.2-gweather-3.0 gir1.2-ibus-1.0 gir1.2-json-1.0
  gir1.2-mutter-6 gir1.2-nm-1.0 gir1.2-nma-1.0 gir1.2-notify-0.7
  gir1.2-pango-1.0 gir1.2-polkit-1.0 gir1.2-rsvg-2.0 gir1.2-secret-1
  gir1.2-soup-2.4 gir1.2-upowerglib-1.0 gir1.2-vte-2.91 gjs gkbd-capplet
  gnome-control-center gnome-control-center-data gnome-control-center-faces
  gnome-desktop3-data gnome-keyring gnome-keyring-pkcs11 gnome-menus
  gnome-online-accounts gnome-session-bin gnome-session-common
  gnome-settings-daemon gnome-settings-daemon-common gnome-shell
  gnome-shell-common gnome-startup-applications gnome-user-docs groff-base
  gstreamer1.0-clutter-3.0 gstreamer1.0-gl gstreamer1.0-plugins-base
  gstreamer1.0-plugins-good gstreamer1.0-pulseaudio gstreamer1.0-x
  hunspell-en-us ibus ibus-data ibus-gtk ibus-gtk3 iio-sensor-proxy im-config
  ippusbxd iptables iw keyboard-configuration kmod language-selector-common
  language-selector-gnome libaa1 libaccountsservice0 libappindicator3-1
  libarchive13 libasound2-plugins libaspell15 libasyncns0 libavahi-core7
  libavahi-glib1 libavc1394-0 libbluetooth3 libboost-thread1.71.0 libcaca0
  libcamel-1.2-62 libcanberra-gtk3-0 libcanberra-gtk3-module libcanberra-pulse
  libcdparanoia0 libcheese-gtk25 libcheese8 libclutter-1.0-0
  libclutter-1.0-common libclutter-gst-3.0-0 libclutter-gtk-1.0-0
  libcogl-common libcogl-pango20 libcogl-path20 libcogl20 libcolord-gtk1
  libcolorhug2 libcrack2 libdaemon0 libdbusmenu-glib4 libdbusmenu-gtk3-4
  libdrm-amdgpu1 libdrm-common libdrm-intel1 libdrm-nouveau2 libdrm-radeon1
  libdrm2 libdv4 libebackend-1.2-10 libebook-1.2-20 libebook-contacts-1.2-3
  libecal-2.0-1 libedata-book-1.2-26 libedata-cal-2.0-1 libedataserver-1.2-24
  libedataserverui-1.2-2 libegl-mesa0 libegl1 libenchant-2-2 libevdev2
  libexif12 libflac8 libfontenc1 libfprint-2-2 libgail-common libgail18
  libgbm1 libgd3 libgdata-common libgdata22 libgdm1 libgee-0.8-2
  libgeoclue-2-0 libgeocode-glib0 libgjs0g libgl1 libgl1-mesa-dri
  libglapi-mesa libgles2 libglvnd0 libglx-mesa0 libglx0 libgnome-autoar-0-0
  libgnome-bluetooth13 libgnome-desktop-3-19 libgnomekbd-common libgnomekbd8
  libgoa-1.0-0b libgoa-1.0-common libgoa-backend-1.0-1 libgphoto2-6
  libgphoto2-l10n libgphoto2-port12 libgraphene-1.0-0 libgsound0
  libgssdp-1.2-0 libgstreamer-gl1.0-0 libgstreamer-plugins-base1.0-0
  libgstreamer-plugins-good1.0-0 libgtk2.0-0 libgtk2.0-bin libgtk2.0-common
  libgtop-2.0-11 libgtop2-common libgudev-1.0-0 libgupnp-1.2-0
  libgupnp-av-1.0-2 libgupnp-dlna-2.0-3 libgusb2 libgweather-3-16
  libgweather-common libharfbuzz-icu0 libhunspell-1.7-0 libhyphen0
  libibus-1.0-5 libical3 libice6 libidn11 libiec61883-0 libieee1284-3
  libimobiledevice6 libinput-bin libinput10 libip6tc2 libjack-jackd2-0
  libjansson4 libjavascriptcoregtk-4.0-18 libldb2 libllvm12 libmaxminddb0
  libmbim-glib4 libmbim-proxy libmediaart-2.0-0 libmm-glib0 libmnl0
  libmozjs-68-0 libmp3lame0 libmpg123-0 libmtdev1 libmutter-6-0
  libmysqlclient21 libndp0 libnetfilter-conntrack3 libnewt0.52 libnfnetlink0
  libnftnl11 libnl-3-200 libnl-genl-3-200 libnl-route-3-200 libnm0 libnma0
  libnotify4 libnspr4 libnss-mdns libnss3 libopengl0 libopus0 liborc-0.4-0
  libpam-fprintd libpam-gnome-keyring libpangoxft-1.0-0 libpcap0.8 libpci3
  libpciaccess0 libpcsclite1 libphonenumber7 libpipeline1 libplist3
  libprotobuf17 libpulse-mainloop-glib0 libpulse0 libpulsedsp
  libpwquality-common libpwquality1 libqmi-glib5 libqmi-proxy libraw1394-11
  librygel-core-2.6-2 librygel-db-2.6-2 librygel-renderer-2.6-2
  librygel-server-2.6-2 libsamplerate0 libsane libsane-common libsbc1
  libsensors-config libsensors5 libshout3 libslang2 libsm6 libsmbclient
  libsnapd-glib1 libsndfile1 libsnmp-base libsnmp35 libsodium23 libsoxr0
  libspeex1 libspeexdsp1 libstartup-notification0 libtag1v5 libtag1v5-vanilla
  libtalloc2 libteamdctl0 libtevent0 libtext-iconv-perl libtheora0 libtwolame0
  libuchardet0 libudisks2-0 libunwind8 libupower-glib3 libusb-1.0-0
  libusbmuxd6 libuv1 libv4l-0 libv4lconvert0 libvdpau1 libvisual-0.4-0
  libvorbisenc2 libvpx6 libvte-2.91-0 libvte-2.91-common libvulkan1
  libwacom-bin libwacom-common libwacom2 libwavpack1 libwayland-server0
  libwbclient0 libwebkit2gtk-4.0-37 libwebpdemux2 libwebrtc-audio-processing1
  libwhoopsie-preferences0 libwhoopsie0 libwoff1 libx11-xcb1 libxatracker2
  libxaw7 libxcb-dri2-0 libxcb-dri3-0 libxcb-glx0 libxcb-icccm4 libxcb-image0
  libxcb-keysyms1 libxcb-present0 libxcb-randr0 libxcb-render-util0
  libxcb-res0 libxcb-shape0 libxcb-sync1 libxcb-util1 libxcb-xfixes0
  libxcb-xkb1 libxcb-xv0 libxfont2 libxft2 libxkbcommon-x11-0 libxkbfile1
  libxklavier16 libxmu6 libxnvctrl0 libxpm4 libxshmfence1 libxslt1.1 libxss1
  libxt6 libxtables12 libxv1 libxvmc1 libxxf86vm1 libyelp0
  linux-headers-5.4.0-208 linux-headers-5.4.0-208-generic
  linux-headers-generic man-db mesa-vdpau-drivers mesa-vulkan-drivers
  mobile-broadband-provider-info modemmanager mousetweaks mutter mutter-common
  mysql-common network-manager network-manager-gnome network-manager-pptp
  p11-kit p11-kit-modules pci.ids pkg-config ppp pptp-linux pulseaudio
  pulseaudio-module-bluetooth pulseaudio-utils python3-apport
  python3-aptdaemon python3-aptdaemon.gtk3widgets python3-blinker
  python3-cairo python3-cffi-backend python3-cryptography python3-cups
  python3-cupshelpers python3-defer python3-entrypoints python3-httplib2
  python3-ibus-1.0 python3-jwt python3-keyring python3-launchpadlib
  python3-lazr.restfulclient python3-lazr.uri python3-ldb
  python3-macaroonbakery python3-nacl python3-oauthlib python3-problem-report
  python3-protobuf python3-pymacaroons python3-rfc3339 python3-secretstorage
  python3-simplejson python3-systemd python3-talloc python3-tz python3-wadllib
  python3-xkit rtkit rygel samba-libs sane-utils screen-resolution-extra
  session-migration sgml-base sgml-data sudo switcheroo-control
  system-config-printer system-config-printer-common
  system-config-printer-udev ubuntu-docs ubuntu-session ubuntu-wallpapers
  ubuntu-wallpapers-focal udev update-inetd upower usb-modeswitch
  usb-modeswitch-data usb.ids usbmuxd vdpau-driver-all wamerican
  whoopsie-preferences wireless-regdb wpasupplicant x11-xkb-utils
  x11-xserver-utils xdg-dbus-proxy xfonts-base xfonts-encodings xfonts-utils
  xml-core xserver-common xserver-xephyr xserver-xorg xserver-xorg-core
  xserver-xorg-input-all xserver-xorg-input-libinput xserver-xorg-input-wacom
  xserver-xorg-legacy xserver-xorg-video-all xserver-xorg-video-amdgpu
  xserver-xorg-video-ati xserver-xorg-video-fbdev xserver-xorg-video-intel
  xserver-xorg-video-nouveau xserver-xorg-video-qxl xserver-xorg-video-radeon
  xserver-xorg-video-vesa xserver-xorg-video-vmware xwayland
  yaru-theme-gnome-shell yelp yelp-xsl zenity zenity-common
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  libnvidia-cfg1-535 libnvidia-common-535 libnvidia-compute-535
  libnvidia-decode-535 libnvidia-encode-535 libnvidia-extra-535
  libnvidia-fbc1-535 libnvidia-gl-535 nvidia-dkms-535 nvidia-driver-535
  nvidia-kernel-common-535 nvidia-kernel-source-535 nvidia-prime
  nvidia-settings xserver-xorg-video-nvidia-535
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "C.UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
0 upgraded, 0 newly installed, 15 to remove and 4 not upgraded.
15 not fully installed or removed.
After this operation, 798 MB disk space will be freed.
(Reading database ... 
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
(Reading database ... 20%
(Reading database ... 25%
(Reading database ... 30%
(Reading database ... 35%
(Reading database ... 40%
(Reading database ... 45%
(Reading database ... 50%
(Reading database ... 55%
(Reading database ... 60%
(Reading database ... 65%
(Reading database ... 70%
(Reading database ... 75%
(Reading database ... 80%
(Reading database ... 85%
(Reading database ... 90%
(Reading database ... 95%
(Reading database ... 100%
(Reading database ... 85474 files and directories currently installed.)
Removing nvidia-driver-535 (535.230.02-0ubuntu1) ...
Removing xserver-xorg-video-nvidia-535 (535.230.02-0ubuntu1) ...
Removing libnvidia-cfg1-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-gl-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-common-535 (535.230.02-0ubuntu1) ...
Removing libnvidia-encode-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-decode-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-compute-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-extra-535:amd64 (535.230.02-0ubuntu1) ...
Removing libnvidia-fbc1-535:amd64 (535.230.02-0ubuntu1) ...
Removing nvidia-dkms-535 (535.230.02-0ubuntu1) ...
Removing nvidia-kernel-common-535 (535.230.02-0ubuntu1) ...
Removing nvidia-kernel-source-535 (535.230.02-0ubuntu1) ...
Removing nvidia-prime (0.8.16~0.20.04.2) ...
Removing nvidia-settings (570.124.06-0ubuntu1) ...
Processing triggers for mime-support (3.64ubuntu1) ...
Processing triggers for gnome-menus (3.36.0-1ubuntu1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.17) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for dbus (1.12.16-2ubuntu2.3) ...
Processing triggers for desktop-file-utils (0.24-1ubuntu3) ...
(base) root@VM-24-95-ubuntu:/workspace# nvidia-smi
Fri Mar 14 03:24:46 2025       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:09.0 Off |                    0 |
| N/A   31C    P8     9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+