NVIDIA 驱动与 CUDA 安装与检查流程
1) 检查并解锁已挂起的包
检查是否有挂起的包,并解锁相关 NVIDIA 相关包:
apt-mark showhold
for p in $(apt-mark showhold | grep -E 'nvidia|libnvidia|cuda-drivers' || true); do
sudo apt-mark unhold "$p"
done
2) 清除 550 分支残留
删除旧的 NVIDIA 驱动版本(550 分支)相关包:
sudo apt purge -y 'nvidia-*550*' 'libnvidia-*550*'
3) 修复依赖并清理
修复破损的依赖关系,自动删除不必要的包并更新系统:
sudo apt --fix-broken install -y
sudo apt autoremove -y
sudo apt update
4) 安装指定版本的 NVIDIA 驱动和 Fabric Manager
安装适用于服务器的 NVIDIA 驱动 590 版本和 Fabric Manager:
sudo apt install -y nvidia-driver-590-server nvidia-fabricmanager-590
5) 自查
1) 检查关键包状态
dpkg -l | grep -E 'nvidia-driver-590-server|nvidia-fabricmanager-590|libnvidia-extra-590-server|libnvidia-fbc1-590-server'
输出示例:
== 1) 关键包状态 ==
ii libnvidia-extra-590-server:amd64 590.48.01-0ubuntu0.22.04.1 amd64 Extra libraries for the NVIDIA Server Driver
ii libnvidia-fbc1-590-server:amd64 590.48.01-0ubuntu0.22.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii nvidia-driver-590-server 590.48.01-0ubuntu0.22.04.1 amd64 NVIDIA Server Driver metapackage
ii nvidia-fabricmanager-590 590.48.01-0ubuntu0.22.04.1 amd64 Fabric Manager for NVSwitch based systems.
2) 检查是否还有破损的依赖
sudo apt -s --fix-broken install
输出示例:
== 2) 是否还有破损依赖 ==
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 0 to remove and 269 not upgraded.
3) 检查是否需要重启
ls -l /var/run/reboot-required* 2>/dev/null || echo "no reboot-required file"
输出示例:
== 3) 是否要求重启 ==
no reboot-required file
4) 当前已加载的驱动版本
查看当前已加载的 NVIDIA 驱动版本(重启前可能显示的是旧版):
nvidia-smi
输出示例:
== 4) 当前已加载驱动版本(重启前通常还是旧版) ==
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 590.48
6) 重启并验证
重启系统:
sudo reboot
重启后验证
检查 NVIDIA 驱动是否已正确加载:
nvidia-smi
systemctl status nvidia-fabricmanager --no-pager
nvidia-smi topo -m
输出示例:
Fri Mar 6 11:34:46 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H20 Off | 00000000:09:00.0 Off | 0 |
| N/A 29C P0 73W / 500W | 0MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
...
查看 Fabric Manager 服务状态:
systemctl status nvidia-fabricmanager --no-pager
输出示例:
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2026-03-06 11:29:48 CST; 4min 59s ago
Process: 3381 ExecStartPre=/usr/bin/nvidia-fabricmanager-start.sh --mode precheck (code=exited, status=0/SUCCESS)
Process: 3557 ExecStart=/usr/bin/nvidia-fabricmanager-start.sh --mode start (code=exited, status=0/SUCCESS)
Main PID: 3579 (nv-fabricmanage)
Tasks: 18 (limit: 629145)
Memory: 20.1M
CPU: 1.213s
CGroup: /system.slice/nvidia-fabricmanager.service
└─3579 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabr…
检查 NVCC 版本
nvcc --version
输出示例:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
7) 安装 CUDA 工具
检查 CUDA 工具包的安装情况
apt-cache policy cuda-nvcc-13-1 cuda-compiler-13-1 cuda-toolkit-13-1
输出示例:
cuda-nvcc-13-1:
Installed: (none)
Candidate: 13.1.115-1
Version table:
13.1.115-1 600
600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages
13.1.80-1 600
600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages
...
安装 CUDA NVCC 版本 13.1
sudo apt install -y cuda-nvcc-13-1