GPU-2-4-CUDA基础之GPU设备信息【CUDA 基础】GPU设备信息：本节学习如何获取设备（1个或多个）->：C

【CUDA 基础】GPU设备信息

摘要

本节学习如何获取设备（1个或多个）信息

关键词：CUDA Device Information

GPU设备信息

我们用CUDA的时候一般有两种情况，一种自己写完自己用，使用本机或者已经确定的服务器，这时候我们只要查看说明书或者配置说明就知道用的什么型号的GPU，以及GPU的所有信息，但是如果我们写的程序是通用的程序或者框架，我们在使用CUDA前要先确定当前的硬件环境，这使得我们的程序不那么容易因为设备不同而崩溃，本文介绍两种方法，第一种适用于通用程序或者框架，第二种适合查询本机或者可登陆的服务器，并且一般不会改变，那么这时候用一条nvidia驱动提供的指令查询设备信息就很方便了。

API查询GPU信息

在软件内查询信息，用到如下代码：

#include <cuda_runtime.h>
#include <stdio.h>

int main(int argc,char** argv)
{
    printf("%s Starting ...\n",argv[0]);
    int deviceCount = 0;
    cudaError_t error_id = cudaGetDeviceCount(&deviceCount);
    if(error_id!=cudaSuccess)
    {
        printf("cudaGetDeviceCount returned %d\n ->%s\n",
              (int)error_id,cudaGetErrorString(error_id));
        printf("Result = FAIL\n");
        exit(EXIT_FAILURE);
    }
    if(deviceCount==0)
    {
        printf("There are no available device(s) that support CUDA\n");
    }
    else
    {
        printf("Detected %d CUDA Capable device(s)\n",deviceCount);
    }
    int dev=0,driverVersion=0,runtimeVersion=0;
    cudaSetDevice(dev);
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp,dev);
    printf("Device %d:\"%s\"\n",dev,deviceProp.name);
    cudaDriverGetVersion(&driverVersion);
    cudaRuntimeGetVersion(&runtimeVersion);
    printf("  CUDA Driver Version / Runtime Version         %d.%d  /  %d.%d\n",
        driverVersion/1000,(driverVersion%100)/10,
        runtimeVersion/1000,(runtimeVersion%100)/10);
    printf("  CUDA Capability Major/Minor version number:   %d.%d\n",
        deviceProp.major,deviceProp.minor);
    printf("  Total amount of global memory:                %.2f MBytes (%llu bytes)\n",
            (float)deviceProp.totalGlobalMem/pow(1024.0,3));
    printf("  GPU Clock rate:                               %.0f MHz (%0.2f GHz)\n",
            deviceProp.clockRate*1e-3f,deviceProp.clockRate*1e-6f);
    printf("  Memory Bus width:                             %d-bits\n",
            deviceProp.memoryBusWidth);
    if (deviceProp.l2CacheSize)
    {
        printf("  L2 Cache Size:                            	%d bytes\n",
                deviceProp.l2CacheSize);
    }
    printf("  Max Texture Dimension Size (x,y,z)            1D=(%d),2D=(%d,%d),3D=(%d,%d,%d)\n",
            deviceProp.maxTexture1D,deviceProp.maxTexture2D[0],deviceProp.maxTexture2D[1]
            ,deviceProp.maxTexture3D[0],deviceProp.maxTexture3D[1],deviceProp.maxTexture3D[2]);
    printf("  Max Layered Texture Size (dim) x layers       1D=(%d) x %d,2D=(%d,%d) x %d\n",
            deviceProp.maxTexture1DLayered[0],deviceProp.maxTexture1DLayered[1],
            deviceProp.maxTexture2DLayered[0],deviceProp.maxTexture2DLayered[1],
            deviceProp.maxTexture2DLayered[2]);
    printf("  Total amount of constant memory               %lu bytes\n",
            deviceProp.totalConstMem);
    printf("  Total amount of shared memory per block:      %lu bytes\n",
            deviceProp.sharedMemPerBlock);
    printf("  Total number of registers available per block:%d\n",
            deviceProp.regsPerBlock);
    printf("  Wrap size:                                    %d\n",deviceProp.warpSize);
    printf("  Maximun number of thread per multiprocesser:  %d\n",
            deviceProp.maxThreadsPerMultiProcessor);
    printf("  Maximun number of thread per block:           %d\n",
            deviceProp.maxThreadsPerBlock);
    printf("  Maximun size of each dimension of a block:    %d x %d x %d\n",
            deviceProp.maxThreadsDim[0],deviceProp.maxThreadsDim[1],deviceProp.maxThreadsDim[2]);
    printf("  Maximun size of each dimension of a grid:     %d x %d x %d\n",
            deviceProp.maxGridSize[0],
	    deviceProp.maxGridSize[1],
	    deviceProp.maxGridSize[2]);
    printf("  Maximu memory pitch                           %lu bytes\n",deviceProp.memPitch);
    exit(EXIT_SUCCESS);
}

主要用到了下面API，了解API的功能最好不要看博客，因为博客不会与时俱进，要查文档，所以我这里不挨个解释用法，对于API的不了解，解决办法：查文档，查文档，查文档！

cudaSetDevice

cudaGetDeviceProperties

cudaDriverGetVersion

cudaRuntimeGetVersion

cudaGetDeviceCount

运行的效果如下：

(base) ut@UT:~/yangbin/cudaLearn/build/7_device_information$ ./device_information 
./device_information Starting ...
Detected 1 CUDA Capable device(s)
Device 0:"NVIDIA GeForce RTX 3080"
  CUDA Driver Version / Runtime Version         11.4  /  10.0
  CUDA Capability Major/Minor version number:   8.6
  Total amount of global memory:                9.78 GBytes (10503323648 bytes)
  GPU Clock rate:                               1800 MHz (1.80 GHz)
  Memory Bus width:                             320-bits
  L2 Cache Size:                            	5242880 bytes
  Max Texture Dimension Size (x,y,z)            1D=(131072),2D=(131072,65536),3D=(16384,16384,16384)
  Max Layered Texture Size (dim) x layers       1D=(32768) x 2048,2D=(32768,32768) x 2048
  Total amount of constant memory               65536 bytes
  Total amount of shared memory per block:      49152 bytes
  Total number of registers available per block:65536
  Wrap size:                                    32
  Maximun number of thread per multiprocesser:  1536
  Maximun number of thread per block:           1024
  Maximun size of each dimension of a block:    1024 x 1024 x 64
  Maximun size of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximu memory pitch                           2147483647 bytes
----------------------------------------------------------
Number of multiprocessors:                      68
Total amount of constant memory:                64.00 KB
Total amount of shared memory per block:        48.00 KB
Total number of registers available per block:  65536
Warp size                                       32
Maximum number of threads per block:            1024
Maximum number of threads per multiprocessor:  1536
Maximum number of warps per multiprocessor:     48

这里面很多参数是我们后面要介绍的，而且每一个都对性能有影响：

CUDA驱动版本
设备计算能力编号
全局内存大小（1.95G）
GPU主频
GPU带宽
L2缓存大小
纹理维度最大值，不同维度下的
层叠纹理维度最大值
常量内存大小
块内共享内存大小
块内寄存器大小
线程束大小
每个处理器硬件处理的最大线程数
每个块处理的最大线程数
块的最大尺寸
网格的最大尺寸
最大连续线性内存

上面这些都是后面要用到的关键参数，这些会严重影响我们的效率。后面会一一说到，不同的设备参数要按照不同的参数来使得程序效率最大化，所以我们必须在程序运行前得到所有我们关心的参数。

NVIDIA-SMI

nvidia-smi是nvidia驱动程序内带的一个工具，可以返回当前环境的设备信息：

(base) ut@UT:~/yangbin/cudaLearn/build/7_device_information$ nvidia-smi
Sun Oct 27 18:18:18 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.239.06   Driver Version: 470.239.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:03:00.0  On |                  N/A |
|  0%   50C    P8    30W / 370W |    223MiB / 10016MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1190      G   /usr/lib/xorg/Xorg                186MiB |
|    0   N/A  N/A      1371      G   /usr/bin/gnome-shell               24MiB |
|    0   N/A  N/A      2054      G   ...nlogin/bin/sunloginclient       10MiB |
+-----------------------------------------------------------------------------+

这个命令可以加各种参数，如：

(base) ut@UT:~/yangbin/cudaLearn/build/7_device_information$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3080 (UUID: GPU-5fc6c894-c559-5a15-f951-28be6e6775b5)

利用下面这个参数可以精简上面那么一大堆的信息，而这些可以在脚本中帮我们得到设备信息，比如我们可以写通用程序时在编译前执行脚本来获取设备信息，然后在编译时固化最优参数，这样程序运行时就不会被查询设备信息的过程浪费资源。也就是我们可以用以下两种方式编写通用程序：

运行时获取设备信息：
- 编译程序
- 启动程序
- 查询信息，将信息保存到全局变量
- 功能函数通过全局变量判断当前设备信息，优化参数
- 程序运行完毕
编译时获取设备信息
- 脚本获取设备信息
- 编译程序，根据设备信息调整固化参数到二进制机器码
- 运行程序
- 程序运行完毕

详细信息使用

nvidia-smi -q -i

可以得到如下信息：

==============NVSMI LOG==============

Timestamp                                 : Sun Oct 27 18:24:59 2024
Driver Version                            : 470.239.06
CUDA Version                              : 11.4

Attached GPUs                             : 1
GPU 00000000:03:00.0
    Product Name                          : NVIDIA GeForce RTX 3080
    Product Brand                         : GeForce
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-5fc6c894-c559-5a15-f951-28be6e6775b5
    Minor Number                          : 0
    VBIOS Version                         : 94.02.26.80.3C
    MultiGPU Board                        : No
    Board ID                              : 0x300
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x03
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220610DE
        Bus Id                            : 00000000:03:00.0
        Sub System Id                     : 0x404B1458
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 10016 MiB
        Used                              : 223 MiB
        Free                              : 9793 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 5 MiB
        Free                              : 251 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 1 %
        Memory                            : 2 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 50 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 23.69 W
        Power Limit                       : 370.00 W
        Default Power Limit               : 370.00 W
        Enforced Power Limit              : 370.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 370.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2115 MHz
        SM                                : 2115 MHz
        Memory                            : 9501 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 731.250 mV
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1190
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 186 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1371
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 24 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2054
            Type                          : G
            Name                          : /usr/local/sunlogin/bin/sunloginclient
            Used GPU Memory               : 10 MiB

下面这些nvidia-smi -q -i 0 的参数可以提取我们要的信息(这样就不需要用正则表达式了)

MEMORY
UTILIZATION
ECC
TEMPERATURE
POWER
CLOCK
COMPUTE
PIDS
PERFORMANCE
SUPPORTED_CLOCKS
PAGE_RETIREMENT
ACCOUNTING

比如我们想得到内存信息：

nvidia-smi -q -i 0 -d MEMORY

得到：

==============NVSMI LOG==============

Timestamp                                 : Sun Oct 27 18:27:08 2024
Driver Version                            : 470.239.06
CUDA Version                              : 11.4

Attached GPUs                             : 1
GPU 00000000:03:00.0
    FB Memory Usage
        Total                             : 10016 MiB
        Used                              : 223 MiB
        Free                              : 9793 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 5 MiB
        Free                              : 251 MiB

多设备时，我们只要把上面的0改成对应的设备号就好了。

总结

本文没有理论的东西，都是技术层面的，技术问题最好的解决方法就是查文档，而原理部分就要看书看教程了，至此CUDA的编程模型大概就是这些了，核函数，计时，内存，线程，设备参数，这些足够能写出比CPU快很多的程序了，但是追求更快的我们要深入研究每一个细节，后面开始，我们深入硬件，研究背后的秘密。

参考资料