【CUDA 基础】GPU设备信息
摘要
本节学习如何获取设备(1个或多个)信息
关键词:CUDA Device Information
GPU设备信息
我们用CUDA的时候一般有两种情况,一种自己写完自己用,使用本机或者已经确定的服务器,这时候我们只要查看说明书或者配置说明就知道用的什么型号的GPU,以及GPU的所有信息,但是如果我们写的程序是通用的程序或者框架,我们在使用CUDA前要先确定当前的硬件环境,这使得我们的程序不那么容易因为设备不同而崩溃,本文介绍两种方法,第一种适用于通用程序或者框架,第二种适合查询本机或者可登陆的服务器,并且一般不会改变,那么这时候用一条nvidia驱动提供的指令查询设备信息就很方便了。
API查询GPU信息
在软件内查询信息,用到如下代码:
#include <cuda_runtime.h>
#include <stdio.h>
int main(int argc,char** argv)
{
printf("%s Starting ...\n",argv[0]);
int deviceCount = 0;
cudaError_t error_id = cudaGetDeviceCount(&deviceCount);
if(error_id!=cudaSuccess)
{
printf("cudaGetDeviceCount returned %d\n ->%s\n",
(int)error_id,cudaGetErrorString(error_id));
printf("Result = FAIL\n");
exit(EXIT_FAILURE);
}
if(deviceCount==0)
{
printf("There are no available device(s) that support CUDA\n");
}
else
{
printf("Detected %d CUDA Capable device(s)\n",deviceCount);
}
int dev=0,driverVersion=0,runtimeVersion=0;
cudaSetDevice(dev);
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp,dev);
printf("Device %d:\"%s\"\n",dev,deviceProp.name);
cudaDriverGetVersion(&driverVersion);
cudaRuntimeGetVersion(&runtimeVersion);
printf(" CUDA Driver Version / Runtime Version %d.%d / %d.%d\n",
driverVersion/1000,(driverVersion%100)/10,
runtimeVersion/1000,(runtimeVersion%100)/10);
printf(" CUDA Capability Major/Minor version number: %d.%d\n",
deviceProp.major,deviceProp.minor);
printf(" Total amount of global memory: %.2f MBytes (%llu bytes)\n",
(float)deviceProp.totalGlobalMem/pow(1024.0,3));
printf(" GPU Clock rate: %.0f MHz (%0.2f GHz)\n",
deviceProp.clockRate*1e-3f,deviceProp.clockRate*1e-6f);
printf(" Memory Bus width: %d-bits\n",
deviceProp.memoryBusWidth);
if (deviceProp.l2CacheSize)
{
printf(" L2 Cache Size: %d bytes\n",
deviceProp.l2CacheSize);
}
printf(" Max Texture Dimension Size (x,y,z) 1D=(%d),2D=(%d,%d),3D=(%d,%d,%d)\n",
deviceProp.maxTexture1D,deviceProp.maxTexture2D[0],deviceProp.maxTexture2D[1]
,deviceProp.maxTexture3D[0],deviceProp.maxTexture3D[1],deviceProp.maxTexture3D[2]);
printf(" Max Layered Texture Size (dim) x layers 1D=(%d) x %d,2D=(%d,%d) x %d\n",
deviceProp.maxTexture1DLayered[0],deviceProp.maxTexture1DLayered[1],
deviceProp.maxTexture2DLayered[0],deviceProp.maxTexture2DLayered[1],
deviceProp.maxTexture2DLayered[2]);
printf(" Total amount of constant memory %lu bytes\n",
deviceProp.totalConstMem);
printf(" Total amount of shared memory per block: %lu bytes\n",
deviceProp.sharedMemPerBlock);
printf(" Total number of registers available per block:%d\n",
deviceProp.regsPerBlock);
printf(" Wrap size: %d\n",deviceProp.warpSize);
printf(" Maximun number of thread per multiprocesser: %d\n",
deviceProp.maxThreadsPerMultiProcessor);
printf(" Maximun number of thread per block: %d\n",
deviceProp.maxThreadsPerBlock);
printf(" Maximun size of each dimension of a block: %d x %d x %d\n",
deviceProp.maxThreadsDim[0],deviceProp.maxThreadsDim[1],deviceProp.maxThreadsDim[2]);
printf(" Maximun size of each dimension of a grid: %d x %d x %d\n",
deviceProp.maxGridSize[0],
deviceProp.maxGridSize[1],
deviceProp.maxGridSize[2]);
printf(" Maximu memory pitch %lu bytes\n",deviceProp.memPitch);
exit(EXIT_SUCCESS);
}
主要用到了下面API,了解API的功能最好不要看博客,因为博客不会与时俱进,要查文档,所以我这里不挨个解释用法,对于API的不了解,解决办法:查文档,查文档,查文档!
cudaSetDevice
cudaGetDeviceProperties
cudaDriverGetVersion
cudaRuntimeGetVersion
cudaGetDeviceCount
运行的效果如下:
(base) ut@UT:~/yangbin/cudaLearn/build/7_device_information$ ./device_information
./device_information Starting ...
Detected 1 CUDA Capable device(s)
Device 0:"NVIDIA GeForce RTX 3080"
CUDA Driver Version / Runtime Version 11.4 / 10.0
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 9.78 GBytes (10503323648 bytes)
GPU Clock rate: 1800 MHz (1.80 GHz)
Memory Bus width: 320-bits
L2 Cache Size: 5242880 bytes
Max Texture Dimension Size (x,y,z) 1D=(131072),2D=(131072,65536),3D=(16384,16384,16384)
Max Layered Texture Size (dim) x layers 1D=(32768) x 2048,2D=(32768,32768) x 2048
Total amount of constant memory 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block:65536
Wrap size: 32
Maximun number of thread per multiprocesser: 1536
Maximun number of thread per block: 1024
Maximun size of each dimension of a block: 1024 x 1024 x 64
Maximun size of each dimension of a grid: 2147483647 x 65535 x 65535
Maximu memory pitch 2147483647 bytes
----------------------------------------------------------
Number of multiprocessors: 68
Total amount of constant memory: 64.00 KB
Total amount of shared memory per block: 48.00 KB
Total number of registers available per block: 65536
Warp size 32
Maximum number of threads per block: 1024
Maximum number of threads per multiprocessor: 1536
Maximum number of warps per multiprocessor: 48
这里面很多参数是我们后面要介绍的,而且每一个都对性能有影响:
- CUDA驱动版本
- 设备计算能力编号
- 全局内存大小(1.95G)
- GPU主频
- GPU带宽
- L2缓存大小
- 纹理维度最大值,不同维度下的
- 层叠纹理维度最大值
- 常量内存大小
- 块内共享内存大小
- 块内寄存器大小
- 线程束大小
- 每个处理器硬件处理的最大线程数
- 每个块处理的最大线程数
- 块的最大尺寸
- 网格的最大尺寸
- 最大连续线性内存
上面这些都是后面要用到的关键参数,这些会严重影响我们的效率。后面会一一说到,不同的设备参数要按照不同的参数来使得程序效率最大化,所以我们必须在程序运行前得到所有我们关心的参数。
NVIDIA-SMI
nvidia-smi是nvidia驱动程序内带的一个工具,可以返回当前环境的设备信息:
(base) ut@UT:~/yangbin/cudaLearn/build/7_device_information$ nvidia-smi
Sun Oct 27 18:18:18 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.239.06 Driver Version: 470.239.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:03:00.0 On | N/A |
| 0% 50C P8 30W / 370W | 223MiB / 10016MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1190 G /usr/lib/xorg/Xorg 186MiB |
| 0 N/A N/A 1371 G /usr/bin/gnome-shell 24MiB |
| 0 N/A N/A 2054 G ...nlogin/bin/sunloginclient 10MiB |
+-----------------------------------------------------------------------------+
这个命令可以加各种参数,如:
(base) ut@UT:~/yangbin/cudaLearn/build/7_device_information$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3080 (UUID: GPU-5fc6c894-c559-5a15-f951-28be6e6775b5)
利用下面这个参数可以精简上面那么一大堆的信息,而这些可以在脚本中帮我们得到设备信息,比如我们可以写通用程序时在编译前执行脚本来获取设备信息,然后在编译时固化最优参数,这样程序运行时就不会被查询设备信息的过程浪费资源。 也就是我们可以用以下两种方式编写通用程序:
- 运行时获取设备信息:
- 编译程序
- 启动程序
- 查询信息,将信息保存到全局变量
- 功能函数通过全局变量判断当前设备信息,优化参数
- 程序运行完毕
- 编译时获取设备信息
- 脚本获取设备信息
- 编译程序,根据设备信息调整固化参数到二进制机器码
- 运行程序
- 程序运行完毕
详细信息使用
nvidia-smi -q -i
可以得到如下信息:
==============NVSMI LOG==============
Timestamp : Sun Oct 27 18:24:59 2024
Driver Version : 470.239.06
CUDA Version : 11.4
Attached GPUs : 1
GPU 00000000:03:00.0
Product Name : NVIDIA GeForce RTX 3080
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-5fc6c894-c559-5a15-f951-28be6e6775b5
Minor Number : 0
VBIOS Version : 94.02.26.80.3C
MultiGPU Board : No
Board ID : 0x300
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G001.0000.03.03
OEM Object : 2.0
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0000
Device Id : 0x220610DE
Bus Id : 00000000:03:00.0
Sub System Id : 0x404B1458
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 10016 MiB
Used : 223 MiB
Free : 9793 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 1 %
Memory : 2 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 50 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 83 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 23.69 W
Power Limit : 370.00 W
Default Power Limit : 370.00 W
Enforced Power Limit : 370.00 W
Min Power Limit : 100.00 W
Max Power Limit : 370.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2115 MHz
SM : 2115 MHz
Memory : 9501 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 731.250 mV
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 1190
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 186 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 1371
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 24 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2054
Type : G
Name : /usr/local/sunlogin/bin/sunloginclient
Used GPU Memory : 10 MiB
下面这些nvidia-smi -q -i 0 的参数可以提取我们要的信息(这样就不需要用正则表达式了)
- MEMORY
- UTILIZATION
- ECC
- TEMPERATURE
- POWER
- CLOCK
- COMPUTE
- PIDS
- PERFORMANCE
- SUPPORTED_CLOCKS
- PAGE_RETIREMENT
- ACCOUNTING
比如我们想得到内存信息:
nvidia-smi -q -i 0 -d MEMORY
得到:
==============NVSMI LOG==============
Timestamp : Sun Oct 27 18:27:08 2024
Driver Version : 470.239.06
CUDA Version : 11.4
Attached GPUs : 1
GPU 00000000:03:00.0
FB Memory Usage
Total : 10016 MiB
Used : 223 MiB
Free : 9793 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
多设备时,我们只要把上面的0改成对应的设备号就好了。
总结
本文没有理论的东西,都是技术层面的,技术问题最好的解决方法就是查文档,而原理部分就要看书看教程了,至此CUDA的编程模型大概就是这些了,核函数,计时,内存,线程,设备参数,这些足够能写出比CPU快很多的程序了,但是追求更快的我们要深入研究每一个细节,后面开始,我们深入硬件,研究背后的秘密。