注:新手文章,欢迎指正
cuda编程分配device端内存的时候,最常用的是cudaMalloc、cudaMemcpy()和cudaFree()函数。在cuda中访问二维数组的时候,如果能够满足内存对齐条件,那么内存访问速度会更快,显然,cudaMemcpy只能单纯的分配一维数组,无法处理内存对齐等问题。这个时候cudaMallocPitch()和cudaMemcpy2D()就能派上用场了。如CUDA_C_Programming_Guide所言:
Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements described in Device Memory Accesses, therefore ensuring best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions). The returned pitch (or stride) must be used to access array elements.
下面以要分配一个float类型的二维数组为例,讲解其用法。
cudaMallocPitch的典型使用方法:
cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);
devPtr是分配好的内存指针
devPitch为size_t类型的每一行的padded后的bytes数
Nrows和Nrows代表了要分配的二维数组的尺寸
因此,该函数分配的内存大小为Nrows * devPitch bytes,但是,只有每行的Ncols * sizeof(float) bytes包含了真实的二维数组数据。
相应的,复制数据的函数cudaMemcpy2D的使用方法为:
cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)
devPtr和hostPtr为指向device端和host端的float类型的指针。
devPitch和hostPitch为size_t类型的变量,分别代表device端和host端padded后的行bytes数。host端的二维数组如果没有pitch,即为普通二维数组,那么hostPitch=Ncols * sizeof(float)。
Nrows和Ncols为size_t类型的变量,分别代表二维数组的行数和列数。
当然,cudaMemcpy2D也可以反过来:
cudaMemcpy2D(hostPtr, hostPitch, devPtr, devPitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost)
访问cudaMallocPitch分配的二维数组的时候,每一行的头元素地址为float *row_a = (float *)((char*)devPtr + row_index * pitch);,该行第i个元素为row_a[i],注意i<Ncols。
假如tidx和tidy分别代表列号和行号,那么访问代码为:
int tidx = blockIdx.x*blockDim.x + threadIdx.x;
int tidy = blockIdx.y*blockDim.y + threadIdx.y;
if ((tidx < Ncols) && (tidy < Nrows))
{
float *row_a = (float *)((char*)devPtr + tidy * pitch);
row_a[tidx] = row_a[tidx] * tidx * tidy;
}
以下是CUDA_C_Programming_Guide中的代码:
∕∕ Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
∕∕ Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height){
for (int r = 0; r < height; ++r){
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
如何释放cudaMallocPitch分配的内存?
用cudaFree() 即可
参考资料: