cuda编程中的共享内存(shared memory)

1,438 阅读5分钟

共享内存位于GPU的SM中,由一个线程块(thread block)中的线程共同使用。共享内存的延迟比全局内存(global memory)低20到30倍,带宽高约10倍。如《Professional CUDA C Programming》第232页所述:

Shared memory (SMEM) is one of the key components of the GPU. Physically, each SM contains a small low-latency memory pool shared by all threads in the thread block currently executing on that SM. Shared memory enables threads within the same thread block to cooperate, facilitates reuse of on-chip data, and can greatly reduce the global memory bandwidth needed by kernels. Because the contents of shared memory are explicitly managed by the application, it is often described as a program-managed cache. Fermi and Kepler GPUs have similar memory hierarchies, except Kepler includes an additional compiler-directed cache for read-only data. As illustrated in Figure 5-1, all load and store requests to global memory go through the L2 cache, which is the primary point of data unifi cation between SM units. Note that shared memory and L1 cache are physically closer to the SM than both the L2 cache and global memory. As a result, shared memory latency is roughly 20 to 30 times lower than global memory, and bandwidth is nearly 10 times higher.

A fixed amount of shared memory is allocated to each thread block when it starts executing. This shared memory address space is shared by all threads in a thread block. Its contents have the same lifetime as the thread block in which it was created.

1. 共享内存的分配

共享内存可以静态分配或者动态分配,静态分配指的是在编译期就知道其大小,而动态分配指的是核函数(kernel function)被launch的时候才知道其大小。静态分配的方法如下:

__shared__ float tile[size_y][size_x];

动态分配的方法如下:

extern __shared__ int tile[];

在kernel被launch的时候,以字节为单位输入动态内存的尺寸:

kernel<<<grad,block,isize * sizeof(int)>>>(...)

2. 共享内存的banks和访问模式

共享内存被分为32个尺寸相同的内存模组(memory modules),每个内存模组被称为bank。不同的计算能力的GPU会根据不同的模式把共享内存的地址映射到不同的bank。当多个线程同时访问相同的bank中的数据,bank conflict就会发生,此时GPU会进行多次访问,从而导致访问效率变低,速度变慢,因此应该尽可能避免bank conflict。在一个warp访问共享内存的时候,一般会是以下三种情况:

  1. Parallel access
    warp中的线程访问多个共享内存地址,这些地址分布于不同的banks,当每个地址都在不同的bank的时候,就是conflict-free的共享内存访问。
  2. Serial access
    访问的地址都位于一个bank中。这个是最烂的访问模式,如果一个warp中的32个线程访问一个bank里面的不同数据,那么会导致32次内存读取。
  3. Broadcast access
    整个warp里面的全部线程只访问一个地址,这个地址在某个bank中,这种情况下只会发生一次内存读取,被访问的数据会被广播到所有请求访问的线程。

以Fermi架构为例,一个bank中每个内存单位为4字节(对于计算能力为3.X的就是8字节了),其示意图如下:

image.png

相邻的四字节单位落在相邻的bank中,因此内存地址和bank索引的关系如下式所示:

bank index=(byte address÷4)%32 banksbank\ index = (byte\ address ÷ 4) \% 32\ banks

如果bank中每个内存单位为8字节(kepler可以配置),那么公式就是:

bank index=(byte address÷8)%32 banksbank\ index = (byte\ address ÷ 8) \% 32\ banks

需要注意的是,如果一个warp中的两个线程访问一个内存单位(4字节/8字节)中的两个子区域,那么不会产生bank-conflict。

3. memory padding

所谓的memory padding,就是给bank index为N的bank第N个元素添加一个word,如下图所示:

image.png 假如要访问bank 0的五个元素,那么padding前会有conflict,padding后,如右图所示,则这五个元素分散在不同的bank中,这样就不会有conflict了。

padding的方法很简单,只需要给shared memory加一列即可:

__shared__ int tile[BDIMY][BDIMX+1];

4. 使用共享内存过程中的同步

  1. __syncthreads():

函数如下:

void __syncthreads()

线程块中的线程会挂起,直到该线程块中的所有线程都执行到了这个函数。注意,当在if语句中使用该函数时,需要保证线程块中所有线程的if条件结果相同,即都为if(true)或者if(false),如下代码会导致未定义的结果:

if (threadID % 2 == 0) {
    __syncthreads(); 
} else {
    __syncthreads(); 
}
  1. Memory Fence fence函数保证执行fence()之前的内存写操作在fence()函数执行结束后都完成,fence函数有三种类型:

    • 线程块范围内的fence
    void __threadfence_block();
    

    __threadfence_block ensures that all writes to shared memory and global memory made by a calling thread before the fence are visible to other threads in the same block after the fence. Recall that memory fences do not perform any thread synchronization, and so it is not necessary for all threads in a block to actually execute this instruction.

    • grid范围内的fence
    void __threadfence();
    

    __threadfence stalls the calling thread until all of its writes to global memory are visible to all threads in the same grid

    • 系统范围内的fence
    void __threadfence_system();
    

    __threadfence_system stalls the calling thread to ensure all its writes to global memory, pagelocked host memory, and the memory of other devices are visible to all threads in all devices and host threads.

5. 如何高效访问共享内存

  1. 用square shared memory

image.png 如上图所示,分配了一个32×32字节的二维数组,注意CUDA C中的多维数组是行优先的。

__shared__ int tile[N][N];

如果用如上代码分配共享内存,注意为了避免bank conflict,一个warp中的线程应访问不同bank中的数据,而一个warp中的线程又是以threadIDx、threadIDy和threadIDz为先后顺序作为实际物理ID的(juejin.cn/post/731490… 中的讲解三所述),如果连续的线程能访问共享内存中连续的数据,那么就能尽可能避免bank conflict,毕竟连续的数据处于相邻的bank中,因此如下的访问模式更佳:

tile[threadIdx.y][threadIdx.x]

这个时候一个warp内的线程访问的数据一定互相处于不同的bank,而不同的warp之间则不会产生bank conflict。