NVIDIA 驱动与 CUDA 安装与检查流程

0 阅读3分钟

NVIDIA 驱动与 CUDA 安装与检查流程

1) 检查并解锁已挂起的包

检查是否有挂起的包,并解锁相关 NVIDIA 相关包:

apt-mark showhold
for p in $(apt-mark showhold | grep -E 'nvidia|libnvidia|cuda-drivers' || true); do
  sudo apt-mark unhold "$p"
done

2) 清除 550 分支残留

删除旧的 NVIDIA 驱动版本(550 分支)相关包:

sudo apt purge -y 'nvidia-*550*' 'libnvidia-*550*'

3) 修复依赖并清理

修复破损的依赖关系,自动删除不必要的包并更新系统:

sudo apt --fix-broken install -y
sudo apt autoremove -y
sudo apt update

4) 安装指定版本的 NVIDIA 驱动和 Fabric Manager

安装适用于服务器的 NVIDIA 驱动 590 版本和 Fabric Manager:

sudo apt install -y nvidia-driver-590-server nvidia-fabricmanager-590

5) 自查

1) 检查关键包状态

dpkg -l | grep -E 'nvidia-driver-590-server|nvidia-fabricmanager-590|libnvidia-extra-590-server|libnvidia-fbc1-590-server'

输出示例:

== 1) 关键包状态 ==
ii  libnvidia-extra-590-server:amd64            590.48.01-0ubuntu0.22.04.1                               amd64        Extra libraries for the NVIDIA Server Driver
ii  libnvidia-fbc1-590-server:amd64             590.48.01-0ubuntu0.22.04.1                               amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  nvidia-driver-590-server                    590.48.01-0ubuntu0.22.04.1                               amd64        NVIDIA Server Driver metapackage
ii  nvidia-fabricmanager-590                    590.48.01-0ubuntu0.22.04.1                               amd64        Fabric Manager for NVSwitch based systems.

2) 检查是否还有破损的依赖

sudo apt -s --fix-broken install

输出示例:

== 2) 是否还有破损依赖 ==
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 0 to remove and 269 not upgraded.

3) 检查是否需要重启

ls -l /var/run/reboot-required* 2>/dev/null || echo "no reboot-required file"

输出示例:

== 3) 是否要求重启 ==
no reboot-required file

4) 当前已加载的驱动版本

查看当前已加载的 NVIDIA 驱动版本(重启前可能显示的是旧版):

nvidia-smi

输出示例:

== 4) 当前已加载驱动版本(重启前通常还是旧版) ==
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 590.48

6) 重启并验证

重启系统:

sudo reboot

重启后验证

检查 NVIDIA 驱动是否已正确加载:

nvidia-smi
systemctl status nvidia-fabricmanager --no-pager
nvidia-smi topo -m

输出示例:

Fri Mar  6 11:34:46 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H20                     Off |   00000000:09:00.0 Off |                    0 |
| N/A   29C    P0             73W /  500W |       0MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
...

查看 Fabric Manager 服务状态:

systemctl status nvidia-fabricmanager --no-pager

输出示例:

● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2026-03-06 11:29:48 CST; 4min 59s ago
    Process: 3381 ExecStartPre=/usr/bin/nvidia-fabricmanager-start.sh --mode precheck (code=exited, status=0/SUCCESS)
    Process: 3557 ExecStart=/usr/bin/nvidia-fabricmanager-start.sh --mode start (code=exited, status=0/SUCCESS)
   Main PID: 3579 (nv-fabricmanage)
      Tasks: 18 (limit: 629145)
     Memory: 20.1M
        CPU: 1.213s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─3579 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabr…

检查 NVCC 版本

nvcc --version

输出示例:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

7) 安装 CUDA 工具

检查 CUDA 工具包的安装情况

apt-cache policy cuda-nvcc-13-1 cuda-compiler-13-1 cuda-toolkit-13-1

输出示例:

cuda-nvcc-13-1:
  Installed: (none)
  Candidate: 13.1.115-1
  Version table:
     13.1.115-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     13.1.80-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
...

安装 CUDA NVCC 版本 13.1

sudo apt install -y cuda-nvcc-13-1