在shuffle指令主要用于让一个warp中的线程能直接访问另一个warp中线程的寄存器，从而实现数据的交换，而无需通过共享内存或者全局内存，如《Professional CUDA C Programming》所述:

Starting with the Kepler family of GPUs (compute capability 3.0 or higher), the shuffl e instruction was introduced as a mechanism to allow threads to directly read another thread’s register, as long as both threads are in the same warp.
The shuffl e instruction enables threads in a warp to exchange data with each other directly, rather than going through shared or global memory. The shuffl e instruction has lower latency than shared memory and does not consume extra memory to perform a data exchange. The shuffl e instruction therefore offers an attractive way for applications to rapidly interchange data among threads in a warp.

专业名词lane指warp中的某个线程(A lane simply refers to a single thread within a warp.)，lane index的范围是[0,31]，warp中的每个线程都有一个单独的lane index，当然，由于一个线程块中可能包含多个warp，因此一个线程块中的两个线程可能有相同的lane index，在一维线程块中，lane index和warp index的计算方法如下:

laneID = threadIdx.x \% 32 \\ warpID = threadIdx.x / 32

For 2D thread blocks, you can convert a 2D thread coordinate into a 1D thread index, and apply the preceding formulas to determine the lane and warp indices. 注意依然是行优先的(x->y->z)

1.__shfl函数

int __shfl(int var, int srcLane, int width=warpSize);

这个函数的返回值是传递给srcLane指定的thread的var值，width则可以将一个warp分成多个部分，width大小处于[1,32]且必须是2的指数。执行这个函数的时候，warp会被分成若干份，并且每一份的长度都是width个线程。srcLane指定的线程是满足如下条件的线程：

Land\ index\ of\ thread\ \%\ width =\ srcLane

注意：如上文所述，land index在warp中的范围是[0,31]。对于 int y = shfl(x, 3, 16);一个warp中的线程被分为两份，每份有16个线程，srcLane等于3，因此Lane index等于3和19的线程为srcLane指定的线程，那么线程0到15会接受输入给线程3的var，线程16到线程31会接受输入给线程19的var。

2.__shfl_up函数

int __shfl_up(int var, unsigned int delta, int width=warpSize)

__shfl_up calculates the source lane index by subtracting delta from the caller’s lane index. The value held by the source thread is returned. Hence, this instruction shifts var up the warp by delta lanes. There is no wrap around with __shfl_up, so the lowest delta threads in a warp will be unchanged.

3.__shfl_down函数

int __shfl_down(int var, unsigned int delta, int width=warpSize)

__shfl_down calculates a source lane index by adding delta to the caller’s lane index. The value held by the source thread is returned. Hence, this instruction shifts var down the warp by delta lanes. There is no wrap around when using __shfl_down, so the upper delta lanes in a warp will remain unchanged.

4. __shfl_xor函数

int __shfl_xor(int var, int laneMask, int width=warpSize)

The intrinsic instruction calculates a source lane index by performing a bitwise XOR of the caller’s lane index with laneMask. The value held by the source thread is returned.

cuda编程中的shuffle使用

1.__shfl函数

2.__shfl_up函数

3.__shfl_down函数

4. __shfl_xor函数