CUDA系列：llm.c layernorm源码1 llm.c llm.c是大神karpathy手搓大模型，纯C/CUD

1 llm.c

llm.c是大神karpathy手搓大模型，纯C/CUDA方式训练大模型。

github: github.com/karpathy/ll…

llm.c项目介绍 使用简单、纯C/CUDA进行LLM训练，无需安装245MB PyTorch、107MB cPython。这个项目主要是重现预训练模型GPT-2、GPT-3 mini系列，train_gpt2.py是参考Pytorch的实现。最前沿主线代码train_gpt2.cu，train_gpt2.c是一个简单的CPU fp32实现（1000行代码的clean code）。karpathy希望该项目值维护C、CUDA代码，其他语言的封装独立在另外repos(参见"notable forks")。

算子教程 算子layernorm教程(参见doc/layernorm/layernorm.md)，step-by-step guide关于如何实现GPT-2 model的layernorm layer。

关于repo

1 llm.c是教学目的 dev/cuda是kernels库，包含所有layers 手写算子实现且文档详细。从简单kernes到复杂、优化kernels。
2 llm.c非常高效 llm.c重现GPT-2（1.6B）训练，因此会尽可能使用加速kernels，包括cuBLAS, cuBLASLt, CUTLASS, cuDNN等。但用户可以自由选择或者使用加速优化kernels，或者“拖拽”任意想用的手写kernels，所以非常适合评测手写kernels的性能，例如达到cuBLas 80%等。
3 llm.c 足够简洁 llm.c root目录保持简洁、可读性好。例如默认cuBLAS作为训练loop中matmuls实现，这样可以保持主线代码简洁且高效，仅需一行可解释代码，而且cuBLAS是常见依赖。手写的matmul实现可以放在dev/cuda。

2 LayerNorm算子定义及实现

layernorm(层归一化)，在给定轴上对输入进行层归一化。算子公式：

Layer Normalization vs Batch Normalization
Layer Norm是对每个样本进行标准化，而不是对所有样本的某一特征通道进行标准化，会保留同一句子不同单词之间的可比性，所以更适合NLP任务。 Batch Norm把不同特征图的同一通道进行标准化，BN在视觉领域非常有效，其更符合CV特征图的内在特征。

作者AK对LayerNorm算子给出非常完整解析，包括python、c、cuda实现。
算法实现(python版本)

import torch
def forward(x, w, b):
    B, T, C = x.size()
    mean = x.sum(-1, keepdim=True) / C # B,T,1
    xshift = x - mean # B,T,C
    var = (xshift**2).sum(-1, keepdim=True) / C # B,T,1
    rstd = (var + eps) ** -0.5 # B,T,1
    norm = xshift * rstd # B,T,C
    out = norm * w + b # B,T,C

    cache = (x, w, mean, rstd)
    return out, cache

def backward(dout, cache):
    x, w, mean, rstd = cache
    # recompute the norm (save memory at the cost of compute)
    norm = (x - mean) * rstd
    # gradients for weights, bias
    db = dout.sum((0, 1))
    dw = (dout * norm).sum((0, 1))
    # gradients for input
    dnorm = dout * w
    dx = dnorm - dnorm.mean(-1, keepdim=True) - norm * (dnorm * norm).mean(-1, keepdim=True)
    dx *= rstd
    return dx, dw, db

Transformer输入激活shape是[B, T, C]，B即Batch Size, T是time，C是 Channels。

算法实现(c版本)
c版本即是对python版本算法c语言实现。注意Tensor[b,t,c]内存排布是一维的，其offset等于b*T*C + t*C + c。

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void layernorm_forward(float* out, float* mean, float* rstd,
                       float* inp, float* weight, float* bias,
                       int B, int T, int C) {
    float eps = 1e-5f;
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            // seek to the input position inp[b,t,:]
            float* x = inp + b * T * C + t * C;
            // calculate the mean
            float m = 0.0f;
            for (int i = 0; i < C; i++) {
                m += x[i];
            }
            m = m/C;
            // calculate the variance (without any bias correction)
            float v = 0.0f;
            for (int i = 0; i < C; i++) {
                float xshift = x[i] - m;
                v += xshift * xshift;
            }
            v = v/C;
            // calculate the rstd
            float s = 1.0f / sqrtf(v + eps);
            // seek to the output position in out[b,t,:]
            float* out_bt = out + b * T * C + t * C;
            for (int i = 0; i < C; i++) {
                float n = (s * (x[i] - m)); // normalized output
                float o = n * weight[i] + bias[i]; // scale and shift it
                out_bt[i] = o; // write
            }
            // cache the mean and rstd for the backward pass later
            mean[b * T + t] = m;
            rstd[b * T + t] = s;
        }
    }
}

void layernorm_backward(float* dinp, float* dweight, float* dbias,
                        float* dout, float* inp, float* weight, float* mean, float* rstd,
                        int B, int T, int C) {
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            float* dout_bt = dout + b * T * C + t * C;
            float* inp_bt = inp + b * T * C + t * C;
            float* dinp_bt = dinp + b * T * C + t * C;
            float mean_bt = mean[b * T + t];
            float rstd_bt = rstd[b * T + t];

            // first: two reduce operations
            float dnorm_mean = 0.0f;
            float dnorm_norm_mean = 0.0f;
            for (int i = 0; i < C; i++) {
                float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;
                float dnorm_i = weight[i] * dout_bt[i];
                dnorm_mean += dnorm_i;
                dnorm_norm_mean += dnorm_i * norm_bti;
            }
            dnorm_mean = dnorm_mean / C;
            dnorm_norm_mean = dnorm_norm_mean / C;

            // now iterate again and accumulate all the gradients
            for (int i = 0; i < C; i++) {
                float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;
                float dnorm_i = weight[i] * dout_bt[i];
                // gradient contribution to bias
                dbias[i] += dout_bt[i];
                // gradient contribution to weight
                dweight[i] += norm_bti * dout_bt[i];
                // gradient contribution to input
                float dval = 0.0f;
                dval += dnorm_i; // term 1
                dval -= dnorm_mean; // term 2
                dval -= norm_bti * dnorm_norm_mean; // term 3
                dval *= rstd_bt; // final scale
                dinp_bt[i] += dval;
            }
        }
    }
}

3 LayerNorm 前向计算CUDA优化

CUDA v1
LayerNormd的计算都是在C维度，在[B, T]维度通过CUDA实现并行，即用kernel把N=B*T 展开。

void layernorm_forward1(float* out, float* mean, float* rstd,
                           const float* inp, const float* weight, const float* bias,
                           int B, int T, int C,
                           const int block_size) {
    const int N = B * T;
    const int grid_size = ceil_div(N, block_size);  // 向上取整
    layernorm_forward_kernel1<<<grid_size, block_size>>>(out, mean, rstd, inp, weight, bias, N, C);
    cudaCheck(cudaGetLastError());
}

block_size 设置为128、256。

block_size设置有以下约束： 1 blockDim的x、y上限是1024，z上限64； 2 一般是warp（线程束）倍数； 3 为尽可能利用GPU，即高Occupancy，需满足block_size大于等于 (SM最大线程数/SM最大同时block数)，对于V100，A100比值为2048/32 = 64； 4 block_size 应该是SM最大线程数的约数； 5 寄存器、共享内存等资源不能超过上限，即每个线程使用的寄存器数、共享内存小于上限/block_size。

CUDA v2
均值和标准差的计算是规约操作，先计算出均值和方差，再计算Norm（把N=BTC 展开）

void layernorm_forward2(float* out, float* mean, float* rstd,
                       const float* inp, const float* weight, const float* bias,
                       int B, int T, int C,
                       const int block_size) {
    int N = B * T;
    // in mean and rstd, threads cooperate within blocks via reductions
    mean_kernel<<<N, block_size, block_size * sizeof(float)>>>(mean, inp, N, C, block_size);
    cudaCheck(cudaGetLastError());
    rstd_kernel<<<N, block_size, block_size * sizeof(float)>>>(rstd, inp, mean, N, C, block_size);
    cudaCheck(cudaGetLastError());
    // in the normalization, everything just gets flattened out
    const int block_size2 = 256;
    const int grid_size = ceil_div(B * T * C, block_size2);
    normalization_kernel<<<grid_size, block_size2>>>(out, inp, mean, rstd, weight, bias, B, T, C);
    cudaCheck(cudaGetLastError());
}

CUDA v3
cooperative groups(协作组)

CUDA V4
简化方差计算，方差可以在均值计算过程得到。

var(x) == mean(x**2) - mean(x)**2

参考 pytorch 算子定义 pytorch.org/docs/stable…

layernorm 计算过程 blog.csdn.net/weixin_3922…

layernorm 理解 http://8.130.79.177/layer-normalization/

《动手自制大模型推理框架》，全手写cuda算子，课程框架支持LLama2和3.x
https://github.com/zjhellofss/KuiperInfer/blob/main/README.md