CUDA固定内存、零拷贝内存和统一内存

1,508 阅读7分钟

一、固定内存(pinned memory)

默认在host端分配的内存都是pageable的(Pageable memory space means memory contents that can be paged in / paged out between DRAM and secondary storage device.),而GPU无法直接访问pageable memory的数据,所以在传输数据的时候,GPU会先在host memory上分配一块page-locked,也就是pinned momery,把pageable的数据先拷贝到pinned memory,再从pinned memory拷贝到gpu的内存,如下图所示: image.png

所以如果可以直接使用pinned momery,就省去了一次拷贝的时间,也就可以加速了。同时要注意,由于pinned memory不会被交换到磁盘,固定在内存,因此对主机性能有一定影响。

需要使用cudaMallocHostcudaFreeHost来分配和释放host主机上的pinned memory,示例代码如下:

#include <stdio.h>
#include <assert.h>

// Convenience function for checking CUDA runtime API results
// can be wrapped around any runtime API call. No-op in release builds.
inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n", 
            cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
#endif
  return result;
}

void profileCopies(float        *h_a, 
                   float        *h_b, 
                   float        *d, 
                   unsigned int  n,
                   char         *desc)
{
  printf("\n%s transfers\n", desc);

  unsigned int bytes = n * sizeof(float);

  // events for timing
  cudaEvent_t startEvent, stopEvent; 

  checkCuda( cudaEventCreate(&startEvent) );
  checkCuda( cudaEventCreate(&stopEvent) );

  checkCuda( cudaEventRecord(startEvent, 0) );
  checkCuda( cudaMemcpy(d, h_a, bytes, cudaMemcpyHostToDevice) );
  checkCuda( cudaEventRecord(stopEvent, 0) );
  checkCuda( cudaEventSynchronize(stopEvent) );

  float time;
  checkCuda( cudaEventElapsedTime(&time, startEvent, stopEvent) );
  printf("  Host to Device bandwidth (GB/s): %f\n", bytes * 1e-6 / time);

  checkCuda( cudaEventRecord(startEvent, 0) );
  checkCuda( cudaMemcpy(h_b, d, bytes, cudaMemcpyDeviceToHost) );
  checkCuda( cudaEventRecord(stopEvent, 0) );
  checkCuda( cudaEventSynchronize(stopEvent) );

  checkCuda( cudaEventElapsedTime(&time, startEvent, stopEvent) );
  printf("  Device to Host bandwidth (GB/s): %f\n", bytes * 1e-6 / time);

  for (int i = 0; i < n; ++i) {
    if (h_a[i] != h_b[i]) {
      printf("*** %s transfers failed ***\n", desc);
      break;
    }
  }

  // clean up events
  checkCuda( cudaEventDestroy(startEvent) );
  checkCuda( cudaEventDestroy(stopEvent) );
}

int main()
{
  unsigned int nElements = 4*1024*1024;
  const unsigned int bytes = nElements * sizeof(float);

  // host arrays
  float *h_aPageable, *h_bPageable;   
  float *h_aPinned, *h_bPinned;

  // device array
  float *d_a;

  // allocate and initialize
  h_aPageable = (float*)malloc(bytes);                    // host pageable
  h_bPageable = (float*)malloc(bytes);                    // host pageable
  checkCuda( cudaMallocHost((void**)&h_aPinned, bytes) ); // host pinned
  checkCuda( cudaMallocHost((void**)&h_bPinned, bytes) ); // host pinned
  checkCuda( cudaMalloc((void**)&d_a, bytes) );           // device

  for (int i = 0; i < nElements; ++i) h_aPageable[i] = i;      
  memcpy(h_aPinned, h_aPageable, bytes);
  memset(h_bPageable, 0, bytes);
  memset(h_bPinned, 0, bytes);

  // output device info and transfer size
  cudaDeviceProp prop;
  checkCuda( cudaGetDeviceProperties(&prop, 0) );

  printf("\nDevice: %s\n", prop.name);
  printf("Transfer size (MB): %d\n", bytes / (1024 * 1024));

  // perform copies and report bandwidth
  profileCopies(h_aPageable, h_bPageable, d_a, nElements, "Pageable");
  profileCopies(h_aPinned, h_bPinned, d_a, nElements, "Pinned");

  printf("n");

  // cleanup
  cudaFree(d_a);
  cudaFreeHost(h_aPinned);
  cudaFreeHost(h_bPinned);
  free(h_aPageable);
  free(h_bPageable);

  return 0;
}

参考:

developer.nvidia.com/blog/how-op…

s1nh.org/post/tx-1-z…

二、零拷贝内存(zero-copy memory, also known as mapped-pinned memory)

零拷贝内存是基于固定内存(pinned memory)的,零拷贝内存上的数据在host主机内存中被分配,然后每次操作的时候GPU可以通过PCI-E访问这些数据,虽然PCI-E的访问速度不如GPU内存访问速度快,但是可以省去在设备内存分配空间和从host复制数据到内存的两个步骤,因而可以节约时间,所以假如只是少数几次读取数据,用零拷贝内存有优势,但是如果反复读取数据(即访问模式受益于cache,注意零拷贝内存的访问模式不会有cache),那么普通的内存访问更快。如forums.developer.nvidia.com/t/regarding… 所言:

Regarding the article [url]arrayfire.com/zero-copy-o… from 2014 stating that zero-copy is faster than cudaMalloc, this article is mis-leading and generalizes the zero-copy case. This is not really accurate.
Zero copy is only faster in some cases where the access pattern does not benefit from caches. Zero-Copy memory on Tegra is CPU and GPU uncached. So every access by the CUDA kernel goes to DRAM. So if the kernel repeatedly accesses the same memory location from then it is likely that the cudaMalloc memory is faster.

假如硬件属于集成GPU,那么由于GPU和CPU的内存在物理上属于一块内存,那么零拷贝内存的性能会更佳。

代码示例如下:

// Set flag to enable zero copy access
cudaSetDeviceFlags(cudaDeviceMapHost);
 
// Host Arrays (CPU pointers)
float* h_in  = NULL;
float* h_out = NULL;
 
// Process h_in
 
// Allocate host memory using CUDA allocation calls
cudaHostAlloc((void **)&h_in,  sizeIn,  cudaHostAllocMapped);
cudaHostAlloc((void **)&h_out, sizeOut, cudaHostAllocMapped);
 
// Device arrays (CPU pointers)
float *d_out, *d_in;
// Get device pointer from host memory. No allocation or memcpy
cudaHostGetDevicePointer((void **)&d_in,  (void *) h_in , 0);
cudaHostGetDevicePointer((void **)&d_out, (void *) h_out, 0);
 
// Launch the GPU kernel
kernel<<<blocks, threads>>>(d_out, d_in);

// No need to copy d_out back
// Continue processing on host using h_out

cudaHostAlloc的第三个参数有如下选择:

  • cudaHostAllocDefault
  • cudaHostAllocPortable
  • cudaHostAllocWriteCombined
  • cudaHostAllocMapped

具体用法如《Professional CUDA C Programming》所诉:

cudaHostAllocDefault makes the behavior of cudaHostAlloc identical to cudaMallocHost. Setting cudaHostAllocPortable returns pinned memory that can be used by all CUDA contexts, not just the one that performed the allocation. The flag cudaHostAllocWriteCombined returns write-combined memory, which can be transferred across the PCI Express bus more quickly on some system confi gurations but cannot be read effi ciently by most hosts. Therefore, write-combined memory is a good option for buffers that will be written by the host and read by the device using either mapped pinned memory or host-to-device transfers. The most relevant fl ag to zero-copy memory is cudaHostAllocMapped, which returns host memory that is mapped into the device address space.

参考:

developer.ridgerun.com/wiki/index.…

leimao.github.io/blog/CUDA-Z…

三、统一虚拟地址(Unified Virtual Addressing)

在零拷贝内存中,需要先调用cudaHostAlloc来分配零拷贝内存,从而获取一个host端的指向被分配内存的指针,再调用cudaHostGetDevicePointer来获取一个device端的指针来指向相同的内存区域,而CUDA 4.0引入的UVA则省去了调用cudaHostGetDevicePointer的步骤,由CUDA来完成这一步骤,也就是说内部执行是一样的,只是代码被简化了,如《Professional CUDA C Programming》所述:

Generating the same results using less code improves both the readability and maintainability of your application.

在有UVA之前,使用零拷贝内存的方法如下:

float *h_A, *h_B, *gpuRef;
cudaHostAlloc((void **)&h_A, nBytes, flags);
cudaHostAlloc((void **)&h_B, nBytes, flags);
gpuRef = (float *)malloc(nBytes);

float *d_A, *d_B, *d_C;
cudaHostGetDevicePointer((void **)&d_A, (void *)h_A, 0); 
cudaHostGetDevicePointer((void **)&d_B, (void *)h_B, 0);
cudaMalloc((float**)&d_C, nBytes);
sumArraysZeroCopy <<grid, block>>(d_A, d_B, d_C, nElem);

cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost);
cudaFree(d_C);
cudaFreeHost(h_A);
cudaFreeHost(h_B);
free(gpuRef);

有了UVA之后,代码可以被简化成如下:

cudaHostAlloc((void **)&h_A, nBytes, cudaHostAllocMapped);
cudaHostAlloc((void **)&h_B, nBytes, cudaHostAllocMapped);
sumArraysZeroCopy<<>>(h_A, h_B, d_C, nElem);

四、统一内存(Unified Memory)

系统创建一个内存池,并自动在host和device之间搬运数据,让系统来管理内存,从而无需手动cudaMemcpy*函数,也就是说,UM相当于帮你自动分配内存并搬运数据,因此速度上其并不会比手动管理内存更快,只是更方便了,这里直接引用forums.developer.nvidia.com/t/differenc… 的如下内容更为清晰:

Unified Memory: data are moved between CPU and GPU RAM on demand. It’s similar to manual copying before/after kernel call, but automatically managed by the CUDA. You just allocates the single universal pointer and can access it on both sides. So it never faster than manual memory management, sometimes may be slower (when automatic heuristics sucks), but simplifies the program. Essentially, you can remove all explicit memory movу operations, remove any separate allocations of device memory and alloc everything as UM arrays used by both CPU and GPU code.
Heuristic AFAIR is the following: when data, which are absent on CPU side, are accessed by CPU, they are moved from GPU on-demand with a page (4KB) granularity. When data that can be potentially accessed by kernel (i.e. available for the stream to which the kernel belongs) are absent on GPU side, entire array is copied from CPU to GPU prior to kernel start.

《CUDA C Programming Guide》写道:

In simple terms, Unified Memory eliminates the need for explicit data movement via the cudaMemcpy*() routines without the performance penalty incurred by placing all data into zero-copy memory. Data movement, of course, still takes place, so a program’s run time typically does not decrease; Unified Memory instead enables the writing of simpler and more maintainable code. Unified Memory offers a “single-pointer-to-data” model that is conceptually similar to CUDA’s zerocopy memory. One key difference between the two is that with zero-copy allocations the physical location of memory is pinned in CPU system memory such that a program may have fast or slow access to it depending on where it is being accessed from. Unified Memory, on the other hand, decouples memory and execution spaces so that all data accesses are fast.

《Professional CUDA C Programming》写道:

Unified Memory offers a “single-pointer-to-data” model that is conceptually similar to zero-copy memory. However, zero-copy memory is allocated in host memory, and as a result kernel performance generally suffers from high-latency accesses to zero-copy memory over the PCIe bus. Unified Memory, on the other hand, decouples memory and execution spaces so that data can be transparently migrated on demand to the host or device to improve locality and performance.

也就是说,统一内存和零拷贝内存其实概念上很相似,但是零拷贝内存只存在于host内存,所以速度慢,但是统一内存可以在host内存和device内存之间来回搬运数据。