cuda全局内存高效访问——对齐访问和合并内存访问高效的全局内存访问需要保证一点：相邻的线程访问相邻的数据。（This

高效的全局内存访问需要保证两点：

对齐访问和相邻的线程访问相邻的数据。（This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.）

GPU在访问全局内存的时候，通常一次性传输32 bytes或者128 bytes，如果L1和L2 cache都被使用了，那么一次性传输128 bytes，如果只有L2 cache被使用，那么一次性传输32 bytes，L1和L2 cache是否被使用和GPU的架构等都有关系，L1 cache行和设备内存中一个128字节对齐的内存区域对应。对齐访问(Aligned memory accesss)要求访问的设备内存的首地址是缓存大小的偶数倍（Aligned memory accesses occur when the first address of a device memory transaction is an even multiple of the cache granularity being used to service the transaction (either 32 bytes for L2 cache or 128 bytes for L1 cache). Performing a misaligned load will cause wasted bandwidth.）。 注：此处有问题，详情可见：stackoverflow.com/questions/7…

正确的应该是这样的：

Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned: Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.

合并内存访问则要求warp内的线程访问内存的连续块（Coalesced memory accesses occur when all 32 threads in a warp access a contiguous chunk of memory.）。

如下图，就是一个既满足对齐访问又满足合并访问的示例：

下图则是不的对齐+不合并访问的内存访问示例：In this case, there may be as many as three 128-byte memory transactions to read the data from device memory: one starting at offset 0 to include the data being read below the contiguous region, one at offset 256 to read the data being read above the contiguous region, and one at offset 128 that fetches the bulk of the data. Note that most of the bytes fetched by the lower and upper memory transactions will not be used, leading to wasted bandwidth.

讲解一：

有线程0，1，2，3，有一个二维数组

0 1 2 3
4 5 6 7
8 9 a b

该数组行优先，所以内存模式为0 1 2 3 4 5 6 7 8 9 a b,这样就有如下两种访问模式：

thread 0:  0, 1, 2
thread 1:  3, 4, 5
thread 2:  6, 7, 8
thread 3:  9, a, b

或者

thread 0:  0, 4, 8
thread 1:  1, 5, 9
thread 2:  2, 6, a
thread 3:  3, 7, b

显然，第二种更符合合并访问的模式，因此效率更高。

来源：stackoverflow.com/questions/5…

讲解二：

①一个warp里面的所有线程在任何时刻执行相同的指令，注意一个warp里面的线程的线程id是连续递增的。

②如果相邻的数据被同时访问，那么就会引发DRAM bursts，此时相邻的数据会被以极高的速度传输，所谓的合并内存访问，就是引发DRAM bursts

③由于一个warp里面的所有线程执行相同的指令,因此任何时刻他们都在相同的循环中（也就是说如果线程内有for、whild循环等，那么他们在任意时刻的循环次数都是一样的，也就是都是第X次循环）

上图为两种内存访问方式： M：也就是图(A)的访问方式，这种访问方式中，对于每个线程，第i次循环会访问某一行的第i个元素。 N：也就是图(B)的访问方式，在这种访问方式中，对于每个线程，第i次循环会访问某一列的第i个元素，其详解图如下：

由于相同的循环，也就是相同的时间访问了相邻的数据，因此其速度会很快，符合合并内存访问模式。

矩阵M的访问模式则如下图，可见其在循环0的时候，T0、T1、T2和T3分别访问M[0]、M[4]、M[8]和M[12]，不符合相邻线程访问相邻内存的模式。

来源：

nichijou.co/cuda5-coale…

zhuanlan.zhihu.com/p/300785893

讲解三：二维的合并内存访问模式

如《CUDA C权威编程指南》所述：

从逻辑角度来看，线程块是线程的集合，它们可以被组织为一维、二维或三维布局。

从硬件角度来看，线程块是一维线程束的集合。在线程块中线程被组织成一维布局，每32个连续线程组成一个线程束。

在一个块中，每个线程都有一个唯一的ID。用x维度作为最内层的维度，y维度作为第二个维度，z作为最外层的维度，则二维或三维线程块的逻辑布局可以转化为一维物理布局。例如，对于一个给定的二维线程块，在一个块中每个线程的独特标识符都可以用内置变量threadIdx和blockDim来计算：对于一个三维线程块，计算如下：

而一个warp由全局ID连续的线程组成，因此二维情况下其访问模式为：

   tid_in_block = threadIdx.x + threadIdx.y * blockDim.x;     //线程在block中的线程全局ID
   bid_in_grid = blockIdx.x + blockIdx.y * gridDim.x;         //现在所在的block的block全局ID
   threads_per_block = blockDim.x * blockDim.y;               //一个block中的线程数

   tid_in_grid = tid_in_block + thread_per_block * bid_in_grid; //二维情形下的线程全局ID

   global_memory[tid_in_grid] = ...;

来源:stackoverflow.com/questions/1…