CUDA C++ Best Practices Guide——Chapter 9. Memory Optimizations

571 阅读4分钟

MindMap

Memory Optimizations.png

docs.nvidia.com/cuda/cuda-c…

9. Memory Optimizations

9.1. Data Transfer Between Host and Device

9.1.1. Pinned Memory 锁页内存

锁页内存(Page-locked or pinned memory)可实现主机和设备之间的最高带宽。例如,在 PCIe x16 Gen3 卡上,固定内存可以达到大约 12 GB/s 的传输速率。

  • cudaHostAlloc() 锁页内存allocate
  • cudaHostRegister() pin the memory on-the-fly without the need to allocate a separate buffer and copy the data into it.

9.1.2. Asynchronous and Overlapping Transfers with Computation异步计算和重叠拷贝

Overlapping computation and data transfers

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);
kernel<<<grid, block>>>(a_d);
cpuFunction();

Concurrent copy and execute

cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream2>>>(otherData_d);

Sequential copy and execute

cudaMemcpy(a_d, a_h, N*sizeof(float), dir);
kernel<<<N/nThreads, nThreads>>>(a_d);

Staged concurrent copy and execute

size=N*sizeof(float)/nStreams;
for (i=0; i<nStreams; i++) {
    offset = i*N/nStreams;
    cudaMemcpyAsync(a_d+offset, a_h+offset, size, dir, stream[i]);
    kernel<<<N/(nThreads*nStreams), nThreads, 0,  
             stream[i]>>>(a_d+offset);
}

image.png

9.1.3. Zero Copy零拷贝

简而言之是GPU和CPU共享物理内存(Map内存映射)

Zero-copy host code

float *a_h, *a_map;
...
cudaGetDeviceProperties(&prop, 0);
if (!prop.canMapHostMemory) 
    exit(0);
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped);
cudaHostGetDevicePointer(&a_map, a_h, 0);
kernel<<<gridSize, blockSize>>>(a_map);

9.1.4. Unified Virtual Addressing

9.2. Device Memory Spaces

image.png

The various principal traits of the memory types are shown in Table 1.

9.2.1. Coalesced Access to Global Memory合并访问全局内存

the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.

这句话是核心-一个warp中的并发线程会合并32字节(capability 6.0 or higher)对齐的transaction

在具有 GDDR 内存的设备上,当 ECC 开启时,以合并方式访问内存更为重要。分散访问会增加 ECC 内存传输开销,尤其是在将数据写入全局内存时。

9.2.1.1. A Simple Access Pattern

第k个线程访问32对齐数组中的第k个元素(float),此访问模式对应四个32字节的transactions,由红色矩形指示,如果多个线程访问同一个元素或如果一部分线程不访问数据,实际上都会获取完整的段

image.png

9.2.1.2. A Sequential but Misaligned Access Pattern

通过 CUDA 运行时 API 分配的内存,例如通过 cudaMalloc(), 保证至少对齐到 256 字节。因此,选择合理的线程块大小,例如WarpSize的倍数(即当前 GPU 上的 32),有助于通过正确对齐的warp进行内存访问。

image.png

9.2.1.3. Effects of Misaligned Accesses

A copy kernel that illustrates misaligned accesses

实验说明不对齐访存带来的影响

__global__ void offsetCopy(float *odata, float* idata, int offset)
{
    int xid = blockIdx.x * blockDim.x + threadIdx.x + offset;
    odata[xid] = idata[xid];
}

Figure. Performance of offsetCopy kernel

image.png

9.2.1.4. Strided Accesses

实验说明 stride 访存带来的影响

A kernel to illustrate non-unit stride data copy

warp 中的线程以 2 的步幅访问内存中的字。此操作导致 Tesla V100(计算能力 7.0)上每个 warp 加载 8 个 L2 缓存段

__global__ void strideCopy(float *odata, float* idata, int stride)
{
    int xid = (blockIdx.x*blockDim.x + threadIdx.x)*stride;
    odata[xid] = idata[xid];
}

Figure. Adjacent threads accessing memory with a stride of 2

image.png

Figure. Performance of strideCopy kernel

image.png

9.2.2. L2 Cache

9.2.2.1. L2 Cache Access Window

当一个 CUDA kernel反复访问全局内存中的一个数据区域时,这种数据访问可以被认为是持久化的。另一方面,如果数据只被访问一次,这样的数据访问可以被认为是流式的。L2 缓存的一部分可以留出用于对全局内存中的数据区域的持久访问。如果持久访问不使用此预留部分,则流或正常数据访问可以使用它。

可以在限制范围内调整用于持久访问的 L2 缓存预留大小:

cudaGetDeviceProperties(&prop, device_id);                
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, prop.persistingL2CacheMaxSize); /* 为持久访问留出最大可能的 L2 缓存大小 */ 

通过调整 hitRatio num_bytes避免 L2Cache的thrashing

cudaStreamAttrValue    stream_attribute;                           // 流级属性数据结构
stream_attribute.accessPolicyWindow.base_ptr = reinterpret_cast< void *>(ptr); // 全局内存数据指针
stream_attribute.accessPolicyWindow.num_bytes = num_bytes;                    // 用于持久访问的字节数。 (必须小于 cudaDeviceProp::accessPolicyMaxWindowSize) 
stream_attribute.accessPolicyWindow.hitRatio = 1.0;                          // 提示 num_bytes 区域中持久访问的 L2 缓存命中率
stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting; // 缓存命中时的访问属性类型
stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;  // 缓存未命中的访问属性类型。

//将属性设置为 cudaStream_t 
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute); 

9.2.2.2. Tuning the Access Window Hit-Ratio

image.png

滑动窗口实验

__global__ void kernel(int *data_persistent, int *data_streaming, int dataSize, int freqSize) { 
    int tid = blockIdx.x * blockDim.x + threadIdx.x; 
    
    /*Each CUDA thread accesses one element in the persistent data section
      and one element in the streaming data section.
      Because the size of the persistent memory region (freqSize * sizeof(int) bytes) is much 
      smaller than the size of the streaming memory region (dataSize * sizeof(int) bytes), data 
      in the persistent region is accessed more frequently*/

    data_persistent[tid % freqSize] = 2 * data_persistent[tid % freqSize]; 
    data_streaming[tid % dataSize] = 2 * data_streaming[tid % dataSize];
}     

stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(data_persistent); 
stream_attribute.accessPolicyWindow.num_bytes = freqSize * sizeof(int);   //Number of bytes for persisting accesses in range 10-60 MB
stream_attribute.accessPolicyWindow.hitRatio  = 1.0;                      //Hint for cache hit ratio. Fixed value 1.0

image.png

9.2.3. Shared Memory

共享内存比本地和全局内存具有更高的带宽和更低的延迟——前提是线程之间没有 bank 冲突

9.2.3.1. Shared Memory and Memory Banks

9.2.3.2. Shared Memory in Matrix Multiplication (C=AB)