ref Best Practices Guide :: CUDA Toolkit Documentation (nvidia.com)

1. 支持调优的软硬件版本

从 CUDA 11.0 开始，计算能力 8.0 及更高版本的设备 (GeForce RTX 3070 及以上) 能够影响 L2 缓存中数据的持久性。由于 L2 高速缓存位于片上，因此它可能会提供更高的带宽和更低的全局内存访问延迟。

2. L2缓存在哪里

3. 使用L2缓存的方式

3.1. L2缓存的类型

cudaAccessPropertyStreaming：流式缓存。这些缓存上的数据优先被逐出。
cudaAccessPropertyPersisting：持久缓存。这些缓存上的数据更有可能保留在 L2 缓存中。
cudaAccessPropertyNormal：删除缓存的持久化属性。

3.2. L2持久缓的使用策略

3.2.1. 为什么需要使用策略

L2Cache 的总体大小是固定的。故，当所有 Streaming 需要的持久缓存大小大于 L2Cache 的总体大小时，会出现当前 Streaming 工作时上一个 Streaming 的持久缓存被大量覆盖的情况。此种情况下，持久缓存被频繁覆盖，失去它减少数据在 L2Cache 换入换出的功能。为减少这种 缓存行波动，我们应采用适当策略使所有 Streaming 需要的持久缓存大小等于 L2Cache的总体大小。

3.2.2. 设置策略的方法

通过设置 Streaming 的 hitProp 为 cudaAccessPropertyPersisting 的同时设置访问比例属性 hitRatio ( Hint for cache hit ratio ) 。

cudaStreamAttrValue stream_attribute;                                         // Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(ptr); // Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = num_bytes;                    // Number of bytes for persistence access.
                                                                              // (Must be less than cudaDeviceProp::accessPolicyMaxWindowSize)
stream_attribute.accessPolicyWindow.hitRatio  = 0.6;                          // Hint for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting; // Type of access property on cache hit
stream_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;  // Type of access property on cache miss.

//Set the attributes to a CUDA stream of type cudaStream_t
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);

解释 hitRatio 参数：
例如，如果 L2Cache 预留缓存大小为 16KB，而持久化数据输入 (num_byte) 为 32KB：

如果 hitRatio 为 0.5，则硬件将随机选择 32KB 窗口中的 16KB 指定为保留，并缓存在预留的 L2 缓存区域中。
如果 hitRatio 为 1.0，则硬件将尝试将整个 32KB 窗口缓存在预留的 L2 缓存区域中。由于预留区域小于窗口，因此将逐出缓存行，以将最近使用的 32KB 数据中的 16KB 保留在 L2 缓存的预留部分中。

3.2.3. 为什么设置策略可以减少缓存行波动

例如，让 L2 预留缓存大小为 16KB。两个不同 Streaming 中的两个并发内核（每个流的 num_bytes 为 16KB ，hitRatio 值均为 1.0）在争用共享 L2 资源时可能会逐出彼此的缓存行。但是，如果两者的 hitRatio 值均为 0.5，则它们将不太可能逐出自己或彼此的持久缓存行。

4. 示例代码

以下示例演示如何为持久访问预留 L2 缓存，如何通过 CUDA 流在 CUDA 内核中使用预留的 L2 缓存，然后重置 L2 缓存。详细信息请参考 cuda编程手册

cudaStream_t stream;
cudaStreamCreate(&stream);                                                                  // Create CUDA stream

cudaDeviceProp prop;                                                                        // CUDA device properties variable
cudaGetDeviceProperties( &prop, device_id);                                                 // Query GPU properties
size_t size = min( int(prop.l2CacheSize * 0.75) , prop.persistingL2CacheMaxSize );
cudaDeviceSetLimit( cudaLimitPersistingL2CacheSize, size);                                  // set-aside 3/4 of L2 cache for persisting accesses or the max allowed

size_t window_size = min(prop.accessPolicyMaxWindowSize, num_bytes);                        // Select minimum of user defined num_bytes and max window size.

cudaStreamAttrValue stream_attribute;                                                       // Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(data1);               // Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = window_size;                                // Number of bytes for persistence access
stream_attribute.accessPolicyWindow.hitRatio  = 0.6;                                        // Hint for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting;               // Persistence Property
stream_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;                // Type of access property on cache miss

cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);   // Set the attributes to a CUDA Stream

for(int i = 0; i < 10; i++) {
    cuda_kernelA<<<grid_size,block_size,0,stream>>>(data1);                                 // This data1 is used by a kernel multiple times
}                                                                                           // [data1 + num_bytes) benefits from L2 persistence
cuda_kernelB<<<grid_size,block_size,0,stream>>>(data1);                                     // A different kernel in the same stream can also benefit
                                                                                            // from the persistence of data1

stream_attribute.accessPolicyWindow.num_bytes = 0;                                          // Setting the window size to 0 disable it
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);   // Overwrite the access policy attribute to a CUDA Stream
cudaCtxResetPersistingL2Cache();                                                            // Remove any persistent lines in L2 

cuda_kernelC<<<grid_size,block_size,0,stream>>>(data2);                                     // data2 can now benefit from full L2 in normal mode

4.1.2 【NVIDIA-GPU-CUDA】高速缓存的调优 —— L2Cache