1 llm.c
llm.c是大神karpathy手搓大模型,纯C/CUDA方式训练大模型。
github: github.com/karpathy/ll…
llm.c项目介绍 使用简单、纯C/CUDA进行LLM训练,无需安装245MB PyTorch、107MB cPython。这个项目主要是重现预训练模型GPT-2、GPT-3 mini系列,train_gpt2.py是参考Pytorch的实现。最前沿主线代码train_gpt2.cu,train_gpt2.c是一个简单的CPU fp32实现(1000行代码的clean code)。karpathy希望该项目值维护C、CUDA代码,其他语言的封装独立在另外repos(参见"notable forks")。
算子教程 算子layernorm教程(参见doc/layernorm/layernorm.md),step-by-step guide关于如何实现GPT-2 model的layernorm layer。
关于repo
-
1 llm.c是教学目的
dev/cuda是kernels库,包含所有layers 手写算子实现且文档详细。从简单kernes到复杂、优化kernels。 -
2 llm.c非常高效 llm.c重现GPT-2(1.6B)训练,因此会尽可能使用加速kernels,包括cuBLAS, cuBLASLt, CUTLASS, cuDNN等。但用户可以自由选择或者使用加速优化kernels,或者“拖拽”任意想用的手写kernels,所以非常适合评测手写kernels的性能,例如达到cuBLas 80%等。
-
3 llm.c 足够简洁 llm.c root目录保持简洁、可读性好。例如默认cuBLAS作为训练loop中matmuls实现,这样可以保持主线代码简洁且高效,仅需一行可解释代码,而且cuBLAS是常见依赖。手写的matmul实现可以放在
dev/cuda。
2 LayerNorm算子定义及实现
layernorm(层归一化),在给定轴上对输入进行层归一化。算子公式:
Layer Normalization vs Batch Normalization
Layer Norm是对每个样本进行标准化,而不是对所有样本的某一特征通道进行标准化,会保留同一句子不同单词之间的可比性,所以更适合NLP任务。
Batch Norm把不同特征图的同一通道进行标准化,BN在视觉领域非常有效,其更符合CV特征图的内在特征。
作者AK对LayerNorm算子给出非常完整解析,包括python、c、cuda实现。
算法实现(python版本)
import torch
def forward(x, w, b):
B, T, C = x.size()
mean = x.sum(-1, keepdim=True) / C # B,T,1
xshift = x - mean # B,T,C
var = (xshift**2).sum(-1, keepdim=True) / C # B,T,1
rstd = (var + eps) ** -0.5 # B,T,1
norm = xshift * rstd # B,T,C
out = norm * w + b # B,T,C
cache = (x, w, mean, rstd)
return out, cache
def backward(dout, cache):
x, w, mean, rstd = cache
# recompute the norm (save memory at the cost of compute)
norm = (x - mean) * rstd
# gradients for weights, bias
db = dout.sum((0, 1))
dw = (dout * norm).sum((0, 1))
# gradients for input
dnorm = dout * w
dx = dnorm - dnorm.mean(-1, keepdim=True) - norm * (dnorm * norm).mean(-1, keepdim=True)
dx *= rstd
return dx, dw, db
Transformer输入激活shape是[B, T, C],B即Batch Size, T是time,C是 Channels。
算法实现(c版本)
c版本即是对python版本算法c语言实现。注意Tensor[b,t,c]内存排布是一维的,其offset等于b*T*C + t*C + c。
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
void layernorm_forward(float* out, float* mean, float* rstd,
float* inp, float* weight, float* bias,
int B, int T, int C) {
float eps = 1e-5f;
for (int b = 0; b < B; b++) {
for (int t = 0; t < T; t++) {
// seek to the input position inp[b,t,:]
float* x = inp + b * T * C + t * C;
// calculate the mean
float m = 0.0f;
for (int i = 0; i < C; i++) {
m += x[i];
}
m = m/C;
// calculate the variance (without any bias correction)
float v = 0.0f;
for (int i = 0; i < C; i++) {
float xshift = x[i] - m;
v += xshift * xshift;
}
v = v/C;
// calculate the rstd
float s = 1.0f / sqrtf(v + eps);
// seek to the output position in out[b,t,:]
float* out_bt = out + b * T * C + t * C;
for (int i = 0; i < C; i++) {
float n = (s * (x[i] - m)); // normalized output
float o = n * weight[i] + bias[i]; // scale and shift it
out_bt[i] = o; // write
}
// cache the mean and rstd for the backward pass later
mean[b * T + t] = m;
rstd[b * T + t] = s;
}
}
}
void layernorm_backward(float* dinp, float* dweight, float* dbias,
float* dout, float* inp, float* weight, float* mean, float* rstd,
int B, int T, int C) {
for (int b = 0; b < B; b++) {
for (int t = 0; t < T; t++) {
float* dout_bt = dout + b * T * C + t * C;
float* inp_bt = inp + b * T * C + t * C;
float* dinp_bt = dinp + b * T * C + t * C;
float mean_bt = mean[b * T + t];
float rstd_bt = rstd[b * T + t];
// first: two reduce operations
float dnorm_mean = 0.0f;
float dnorm_norm_mean = 0.0f;
for (int i = 0; i < C; i++) {
float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;
float dnorm_i = weight[i] * dout_bt[i];
dnorm_mean += dnorm_i;
dnorm_norm_mean += dnorm_i * norm_bti;
}
dnorm_mean = dnorm_mean / C;
dnorm_norm_mean = dnorm_norm_mean / C;
// now iterate again and accumulate all the gradients
for (int i = 0; i < C; i++) {
float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;
float dnorm_i = weight[i] * dout_bt[i];
// gradient contribution to bias
dbias[i] += dout_bt[i];
// gradient contribution to weight
dweight[i] += norm_bti * dout_bt[i];
// gradient contribution to input
float dval = 0.0f;
dval += dnorm_i; // term 1
dval -= dnorm_mean; // term 2
dval -= norm_bti * dnorm_norm_mean; // term 3
dval *= rstd_bt; // final scale
dinp_bt[i] += dval;
}
}
}
}
3 LayerNorm 前向计算CUDA优化
CUDA v1
LayerNormd的计算都是在C维度,在[B, T]维度通过CUDA实现并行,即用kernel把N=B*T 展开。
void layernorm_forward1(float* out, float* mean, float* rstd,
const float* inp, const float* weight, const float* bias,
int B, int T, int C,
const int block_size) {
const int N = B * T;
const int grid_size = ceil_div(N, block_size); // 向上取整
layernorm_forward_kernel1<<<grid_size, block_size>>>(out, mean, rstd, inp, weight, bias, N, C);
cudaCheck(cudaGetLastError());
}
block_size 设置为128、256。
block_size设置有以下约束: 1 blockDim的x、y上限是1024,z上限64; 2 一般是warp(线程束)倍数; 3 为尽可能利用GPU,即高Occupancy,需满足block_size大于等于 (SM最大线程数/SM最大同时block数),对于V100,A100比值为2048/32 = 64; 4 block_size 应该是SM最大线程数的约数; 5 寄存器、共享内存等资源不能超过上限,即每个线程使用的寄存器数、共享内存小于 上限/block_size。
CUDA v2
均值和标准差的计算是规约操作,先计算出均值和方差,再计算Norm(把N=BTC 展开)
void layernorm_forward2(float* out, float* mean, float* rstd,
const float* inp, const float* weight, const float* bias,
int B, int T, int C,
const int block_size) {
int N = B * T;
// in mean and rstd, threads cooperate within blocks via reductions
mean_kernel<<<N, block_size, block_size * sizeof(float)>>>(mean, inp, N, C, block_size);
cudaCheck(cudaGetLastError());
rstd_kernel<<<N, block_size, block_size * sizeof(float)>>>(rstd, inp, mean, N, C, block_size);
cudaCheck(cudaGetLastError());
// in the normalization, everything just gets flattened out
const int block_size2 = 256;
const int grid_size = ceil_div(B * T * C, block_size2);
normalization_kernel<<<grid_size, block_size2>>>(out, inp, mean, rstd, weight, bias, B, T, C);
cudaCheck(cudaGetLastError());
}
CUDA v3
cooperative groups(协作组)
CUDA V4
简化方差计算,方差可以在均值计算过程得到。
var(x) == mean(x**2) - mean(x)**2
参考 pytorch 算子定义 pytorch.org/docs/stable…
layernorm 计算过程 blog.csdn.net/weixin_3922…
layernorm 理解 http://8.130.79.177/layer-normalization/
《动手自制大模型推理框架》,全手写cuda算子,课程框架支持LLama2和3.x
https://github.com/zjhellofss/KuiperInfer/blob/main/README.md