🎓 AI模型训练、微调、部署完整教程

2 阅读29分钟

从零开始的深度学习实战指南


📚 目录

  1. 模型文件格式详解
  2. 模型架构与层数
  3. 模型参数与权重
  4. 模型量化技术
  5. 模型训练基础
  6. 模型微调技术
  7. 模型部署实战
  8. 性能优化技巧
  9. 常见问题与解决方案

1. 模型文件格式详解

1.1 格式演进与选择

不同的模型格式服务于不同的目的:

📦 模型文件格式的演进历程

第一代: Pickle (.pkl)
├── 优点: Python原生支持,简单易用
└── 缺点: 不安全(可执行任意代码),只能Python使用

第二代: Framework原生格式 (.pth, .h5)
├── 优点: 框架优化,功能完整
└── 缺点: 框架锁定,跨平台困难

第三代: 跨框架格式 (ONNX)
├── 优点: 框架无关,硬件优化
└── 缺点: 某些操作不支持

第四代: 安全高效格式 (SafeTensors)
├── 优点: 安全、快速、通用
└── 缺点: 相对较新

第五代: 推理优化格式 (GGUF, TensorRT)
├── 优点: 极致性能
└── 缺点: 特定硬件/场景

1.2 主流模型格式对比

格式扩展名跨平台安全性加载速度文件大小生产推荐
PyTorch原生.pt, .pth⚠️ 低中等基准
SafeTensors.safetensors✅ 高略小✅ 强烈推荐
ONNX.onnx✅ 高基准✅ 推荐
GGUF.gguf✅ 高✅ CPU推理
TensorFlow/目录⚠️✅ 中⚠️ TF生态
TorchScript.pt⚠️✅ 中基准✅ PyTorch生产
Pickle.pkl❌ 极低基准❌ 仅开发

1.3 PyTorch格式深度解析

1.3.1 三种保存方式详解

import torch
import torch.nn as nn

# 定义示例模型
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = SimpleNet()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# ============================================================
# 方式1: 保存完整模型(不推荐)
# ============================================================
torch.save(model, 'complete_model.pth')

# 保存内容:
# ├── 模型类定义的引用
# ├── 模型结构
# ├── 所有参数值
# └── Python对象状态

# 加载
loaded_model = torch.load('complete_model.pth')

# ⚠️ 缺点:
# 1. 文件很大(包含Python类信息)
# 2. 加载时必须有原始类定义
# 3. Python版本变化可能导致失败
# 4. 不安全(可能执行恶意代码)

# ============================================================
# 方式2: 只保存state_dict(强烈推荐)
# ============================================================
torch.save(model.state_dict(), 'model_weights.pth')

# 保存内容(只有参数值):
# ├── fc1.weight: Tensor(128, 784)
# ├── fc1.bias: Tensor(128)
# ├── fc2.weight: Tensor(10, 128)
# └── fc2.bias: Tensor(10)

# 加载
new_model = SimpleNet()  # 必须先创建模型实例
new_model.load_state_dict(torch.load('model_weights.pth'))

# ✅ 优点:
# 1. 文件小(只有参数值)
# 2. 灵活(可以加载到不同的模型实例)
# 3. 版本兼容性好

# ============================================================
# 方式3: 保存检查点(训练时推荐)
# ============================================================
checkpoint = {
    'epoch': 10,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': 0.5,
    'accuracy': 0.92,
    'config': {
        'learning_rate': 0.001,
        'batch_size': 32
    }
}
torch.save(checkpoint, 'checkpoint.pth')

# 加载检查点
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']

# ✅ 优点:可以恢复完整的训练状态,支持断点续训

1.3.2 PyTorch文件内部结构

import zipfile

# PyTorch的.pth文件实际上是一个ZIP压缩包
# 查看文件内部结构
with zipfile.ZipFile('model_weights.pth', 'r') as zip_file:
    print("PyTorch文件内部包含:")
    for file_info in zip_file.filelist:
        print(f"  📄 {file_info.filename:30s} {file_info.file_size:>10,} bytes")

# PyTorch文件结构:
# ├── data.pkl          → Pickle序列化的元数据
# ├── data/0            → 第一个tensor的二进制数据
# ├── data/1            → 第二个tensor的二进制数据
# └── ...

# ⚠️ 这就是为什么PyTorch文件不安全的原因:Pickle可以执行任意Python代码!

1.3.3 高级保存技巧

import os
from datetime import datetime

# 技巧1: 条件保存(只保存最佳模型)
best_loss = float('inf')

def save_if_best(model, current_loss, path='best_model.pth'):
    global best_loss
    if current_loss < best_loss:
        best_loss = current_loss
        torch.save({
            'model_state_dict': model.state_dict(),
            'loss': current_loss,
            'timestamp': datetime.now().isoformat()
        }, path)
        print(f"✓ 新的最佳模型已保存 (loss: {current_loss:.4f})")

# 技巧2: 多版本保存(保留历史)
def save_with_version(model, epoch):
    filename = f'model_epoch_{epoch:03d}.pth'
    torch.save(model.state_dict(), filename)

    # 只保留最近5个版本
    import glob
    models = sorted(glob.glob('model_epoch_*.pth'))
    if len(models) > 5:
        os.remove(models[0])

# 技巧3: 跨设备保存与加载
# 在GPU上训练,保存后在CPU上加载
model_cpu = SimpleNet()
model_cpu.load_state_dict(
    torch.load('model_weights.pth', map_location='cpu')
)

# 技巧4: 部分加载(加载部分权重)
pretrained_dict = torch.load('pretrained.pth')
model_dict = model.state_dict()

# 过滤掉不匹配的键
pretrained_dict = {k: v for k, v in pretrained_dict.items()
                   if k in model_dict and v.size() == model_dict[k].size()}

# 更新现有模型字典
model_dict.update(pretrained_dict)
model.load_state_dict(model_dict)

1.4 SafeTensors格式详解

SafeTensors是HuggingFace开发的新一代模型存储格式,专为安全性和性能设计。

1.4.1 核心优势

from safetensors.torch import save_file, load_file
from safetensors import safe_open

"""
✅ 安全性:
  - 纯数据格式,不包含可执行代码
  - 无法通过加载模型执行恶意代码
  - 防止Pickle反序列化攻击

✅ 速度:
  - Zero-copy加载(不需要复制数据)
  - 支持内存映射(mmap)
  - 延迟加载(lazy loading)

✅ 简洁性:
  - 文件头包含完整元数据
  - 无需解析整个文件即可查看结构
  - 易于调试和检查

✅ 跨平台:
  - 框架无关(PyTorch、TensorFlow、JAX等)
  - 语言无关(Python、Rust、JavaScript等)
"""

# 基本使用
state_dict = {
    'layer1.weight': torch.randn(100, 50),
    'layer1.bias': torch.randn(100),
    'layer2.weight': torch.randn(10, 100),
}

# 保存为SafeTensors
save_file(state_dict, 'model.safetensors')

# 加载SafeTensors
loaded_state_dict = load_file('model.safetensors')

1.4.2 SafeTensors文件结构

"""
SafeTensors文件格式:

┌─────────────────────────────────────────────┐
│ Header Size (8 bytes)                       │  ← 头部大小
├─────────────────────────────────────────────┤
│ Header (JSON metadata)                      │  ← 元数据
│ {                                           │
│   "layer1.weight": {                        │
│     "dtype": "F32",                         │
│     "shape": [100, 50],                     │
│     "data_offsets": [0, 20000]             │
│   },                                        │
│   ...                                       │
│ }                                           │
├─────────────────────────────────────────────┤
│ Tensor Data (aligned)                       │  ← 实际数据
│ ┌─────────────────┐                         │
│ │ layer1.weight   │                         │
│ ├─────────────────┤                         │
│ │ layer1.bias     │                         │
│ └─────────────────┘                         │
└─────────────────────────────────────────────┘
"""

# 查看元数据(无需加载数据)
with safe_open('model.safetensors', framework="pt") as f:
    print("SafeTensors元数据:")
    for key in f.keys():
        tensor_slice = f.get_slice(key)
        print(f"  • {key:20s} {tensor_slice.get_shape()} {tensor_slice.get_dtype()}")

1.4.3 性能对比

import time

# 创建测试数据
large_state_dict = {
    f'layer{i}.weight': torch.randn(1000, 1000)
    for i in range(10)
}

# PyTorch保存
start = time.time()
torch.save(large_state_dict, 'large_model.pth')
pytorch_save_time = time.time() - start

# SafeTensors保存
start = time.time()
save_file(large_state_dict, 'large_model.safetensors')
safetensors_save_time = time.time() - start

print(f"保存性能:")
print(f"  PyTorch:     {pytorch_save_time:.3f}秒")
print(f"  SafeTensors: {safetensors_save_time:.3f}秒")
print(f"  速度提升: {pytorch_save_time / safetensors_save_time:.2f}x")

# 加载性能
start = time.time()
_ = torch.load('large_model.pth')
pytorch_load_time = time.time() - start

start = time.time()
_ = load_file('large_model.safetensors')
safetensors_load_time = time.time() - start

print(f"\n加载性能:")
print(f"  PyTorch:     {pytorch_load_time:.3f}秒")
print(f"  SafeTensors: {safetensors_load_time:.3f}秒")
print(f"  速度提升: {pytorch_load_time / safetensors_load_time:.2f}x")

1.4.4 高级特性

# 特性1: 延迟加载(Lazy Loading)
with safe_open('large_model.safetensors', framework="pt", device="cpu") as f:
    # 此时还没有加载任何tensor到内存
    # 只加载需要的tensor
    layer0_weight = f.get_tensor('layer0.weight')
    print(f"✓ 只加载了需要的tensor,节省内存")

# 特性2: 元数据存储
metadata = {
    'model_name': 'MyModel',
    'version': '1.0.0',
    'accuracy': '95.5%'
}

save_file(state_dict, 'model_with_metadata.safetensors', metadata=metadata)

# 读取元数据
with safe_open('model_with_metadata.safetensors', framework="pt") as f:
    stored_metadata = f.metadata()
    print(f"元数据: {stored_metadata}")

# 特性3: PyTorch ↔ SafeTensors 互转
def convert_pytorch_to_safetensors(pth_path, safetensors_path):
    state_dict = torch.load(pth_path, map_location='cpu')
    if isinstance(state_dict, nn.Module):
        state_dict = state_dict.state_dict()
    save_file(state_dict, safetensors_path)
    print(f"✓ 转换完成: {pth_path}{safetensors_path}")

1.5 ONNX格式详解

ONNX (Open Neural Network Exchange) 是一个开放的模型表示格式,支持跨框架、跨硬件的模型部署。

1.5.1 ONNX核心概念

import torch.onnx
import onnx
import onnxruntime as ort

"""
ONNX是什么?
  将神经网络表示为计算图:
  - 节点(Nodes):操作符(如Conv、Add、ReLU)
  - 边(Edges):数据流(tensors)
  - 属性(Attributes):操作参数

为什么使用ONNX?
  ✅ 框架无关:PyTorch → ONNX → TensorFlow
  ✅ 硬件优化:针对特定硬件加速
  ✅ 部署友好:多种运行时支持
  ✅ 生产就绪:工业界广泛使用
"""

# 导出为ONNX
model = SimpleNet()
dummy_input = torch.randn(1, 784)

torch.onnx.export(
    model,                          # 模型
    dummy_input,                    # 示例输入
    'model.onnx',                   # 保存路径
    export_params=True,             # 导出参数
    opset_version=11,               # ONNX算子集版本
    do_constant_folding=True,       # 常量折叠优化
    input_names=['input'],          # 输入名称
    output_names=['output'],        # 输出名称
    dynamic_axes={                  # 动态维度(支持不同batch size)
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

# 验证ONNX模型
onnx_model = onnx.load('model.onnx')
onnx.checker.check_model(onnx_model)
print("✓ ONNX模型验证通过")

1.5.2 ONNX模型结构分析

# 查看计算图结构
graph = onnx_model.graph

# 输入
print("输入:")
for input_tensor in graph.input:
    shape = [dim.dim_value if dim.dim_value > 0 else 'dynamic'
             for dim in input_tensor.type.tensor_type.shape.dim]
    print(f"  • {input_tensor.name}: {shape}")

# 节点(操作)
print(f"\n计算图节点: ({len(graph.node)} 个操作)")
for i, node in enumerate(graph.node[:10]):
    print(f"  {i+1}. {node.op_type:15s}{node.output[0]}")

# 参数(权重)
print(f"\n参数: ({len(graph.initializer)} 个)")
total_params = 0
for initializer in graph.initializer:
    shape = list(initializer.dims)
    num_params = np.prod(shape) if shape else 0
    total_params += num_params
    print(f"  • {initializer.name:30s} {shape}")
print(f"\n总参数量: {total_params:,}")

1.5.3 ONNX Runtime推理

import numpy as np

# 创建推理会话
ort_session = ort.InferenceSession(
    'model.onnx',
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

print(f"执行提供者: {ort_session.get_providers()}")

# 推理
input_data = np.random.randn(1, 784).astype(np.float32)
outputs = ort_session.run(None, {'input': input_data})

# 性能对比
import time

# PyTorch推理
model.eval()
pytorch_input = torch.from_numpy(input_data)

start = time.time()
for _ in range(100):
    with torch.no_grad():
        _ = model(pytorch_input)
pytorch_time = (time.time() - start) / 100

# ONNX Runtime推理
start = time.time()
for _ in range(100):
    _ = ort_session.run(None, {'input': input_data})
onnx_time = (time.time() - start) / 100

print(f"\n性能对比:")
print(f"  PyTorch:      {pytorch_time*1000:.3f} ms")
print(f"  ONNX Runtime: {onnx_time*1000:.3f} ms")
print(f"  加速比: {pytorch_time / onnx_time:.2f}x")

1.6 GGUF格式详解(大模型CPU推理)

GGUF (GPT-Generated Unified Format) 是llama.cpp项目使用的模型格式,专为CPU推理优化。

1.6.1 GGUF特点

"""
✅ CPU优化:
  - 专为CPU推理设计
  - 使用AVX2、AVX512等SIMD指令加速
  - 内存映射(mmap)支持

✅ 量化支持:
  - Q4_0, Q4_1: 4bit量化
  - Q5_0, Q5_1: 5bit量化
  - Q8_0: 8bit量化
  - 大幅减小模型大小

✅ 易于部署:
  - 单文件包含所有数据
  - 无需Python环境
  - 跨平台(Windows、Linux、macOS)

常见使用场景:
  - 在CPU上运行LLaMA、Mistral等大模型
  - 嵌入式设备部署
  - 边缘计算
  - 个人电脑上的AI应用
"""

1.6.2 量化类型对比

"""
┌─────────┬──────────┬──────────┬──────────┐
│ 类型    │ 位数/权重│ 模型大小 │ 质量     │
├─────────┼──────────┼──────────┼──────────┤
│ F16     │ 16 bit   │ 基准     │ 最佳     │
│ Q8_0    │ 8 bit    │ 50%      │ 优秀     │
│ Q6_K    │ 6 bit    │ 37.5%    │ 很好     │
│ Q5_1    │ 5 bit    │ 31.25%   │ 很好     │
│ Q4_1    │ 4 bit    │ 25%      │ 好       │
│ Q4_0    │ 4 bit    │ 25%      │ 可接受   │
│ Q3_K_S  │ 3 bit    │ 18.75%   │ 较差     │
│ Q2_K    │ 2 bit    │ 12.5%    │ 实验性   │
└─────────┴──────────┴──────────┴──────────┘

示例:LLaMA-7B模型大小对比
- 原始 (FP32): ~28 GB
- F16: ~14 GB
- Q8_0: ~7 GB
- Q4_0: ~3.5 GB ← 推荐
- Q2_K: ~1.8 GB
"""

1.6.3 使用GGUF模型

# 命令行推理
./main -m llama-7b-q4_0.gguf \
    -p "What is artificial intelligence?" \
    -n 128 \
    -t 8 \
    --temp 0.7

# Python API(使用llama-cpp-python)
from llama_cpp import Llama

# 加载模型
llm = Llama(
    model_path="./llama-7b-q4_0.gguf",
    n_ctx=2048,        # 上下文窗口
    n_threads=8,       # 线程数
    n_gpu_layers=0,    # GPU层数(0=纯CPU)
)

# 推理
output = llm(
    "What is the meaning of life?",
    max_tokens=128,
    temperature=0.7,
    top_p=0.9
)

print(output['choices'][0]['text'])

1.7 格式选择决策树

"""
                    开始选择模型格式
                          │
                          ▼
           ┌──────────────┴──────────────┐
           │      什么使用场景?          │
           └──────────────┬──────────────┘
                          │
        ┌─────────────────┼─────────────────┐
        │                 │                 │
        ▼                 ▼                 ▼
    【开发/训练】      【分享/分发】     【部署/推理】
        │                 │                 │
        ▼                 ▼                 ▼
  .pth/.ckpt       .safetensors        根据平台选择
   (checkpoint)     (安全快速)              │
                                           │
                          ┌────────────────┴────────────────┐
                          │                                 │
                          ▼                                 ▼
                    ┌─────────┐                      ┌──────────┐
                    │云端/服务器│                      │ 边缘设备  │
                    └─────────┘                      └──────────┘
                          │                                 │
              ┌───────────┼───────────┐                    │
              │           │           │                    │
              ▼           ▼           ▼                    ▼
          PyTorch     TensorFlow   通用              CPU  vs  GPU
            │             │          │                │        │
            ▼             ▼          ▼                ▼        ▼
      .pth/.pt      SavedModel    .onnx           .gguf    .onnx
    + TorchScript     .h5      (ONNX Runtime)  (llama.cpp) (TensorRT)


具体推荐:

┌────────────────┬──────────────────┬───────────────────┐
│ 场景           │ 推荐格式         │ 原因              │
├────────────────┼──────────────────┼───────────────────┤
│ 模型开发       │ .pth             │ 灵活,易调试      │
│ 训练检查点     │ .ckpt            │ 包含优化器状态    │
│ 模型分享       │ .safetensors     │ 安全,快速加载    │
│ HuggingFace发布│ .safetensors     │ 官方推荐          │
│ 跨框架部署     │ .onnx            │ 框架无关          │
│ PyTorch生产    │ TorchScript      │ 性能优化          │
│ CPU大模型推理  │ .gguf            │ 高度优化          │
│ Nvidia GPU推理 │ TensorRT         │ 极致性能          │
└────────────────┴──────────────────┴───────────────────┘
"""

1.8 格式转换实战

# PyTorch → SafeTensors
from safetensors.torch import save_file
save_file(model.state_dict(), 'model.safetensors')

# PyTorch → ONNX
torch.onnx.export(model, dummy_input, 'model.onnx')

# ONNX → TensorFlow (需要onnx-tf)
# pip install onnx-tf
# python -m onnx_tf.backend convert -i model.onnx -o model_tf

# SafeTensors → PyTorch
from safetensors.torch import load_file
state_dict = load_file('model.safetensors')
torch.save(state_dict, 'model.pth')

print("✓ 格式转换完成")

2. 模型架构与层数

2.1 神经网络层次结构

输入层 → 隐藏层(多层) → 输出层
  ↓         ↓              ↓
特征输入   特征提取      最终预测

2.2 常见层类型详解

2.2.1 全连接层(Linear/Dense)

import torch.nn as nn

# 定义全连接层
fc = nn.Linear(in_features=512, out_features=256)

# 实际案例:图像分类器
class ImageClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        # 28x28图像展平后是784个像素
        self.fc1 = nn.Linear(784, 512)    # 第1层:784→512
        self.fc2 = nn.Linear(512, 256)    # 第2层:512→256
        self.fc3 = nn.Linear(256, 10)     # 第3层:256→10类别

    def forward(self, x):
        x = x.view(-1, 784)  # 展平
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# 查看模型结构
model = ImageClassifier()
print(model)

# 计算参数量
total_params = sum(p.numel() for p in model.parameters())
print(f"总参数量: {total_params:,}")
# 输出: 总参数量: 669,706

2.2.2 卷积层(Convolutional)

# 2D卷积层(用于图像)
conv2d = nn.Conv2d(
    in_channels=3,      # 输入通道(RGB=3)
    out_channels=64,    # 输出通道(特征图数量)
    kernel_size=3,      # 卷积核大小3x3
    stride=1,           # 步长
    padding=1           # 填充
)

# 实际案例:CNN图像分类器
class CNNClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        # 卷积块1
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)   # 32x32x3 → 32x32x32
        self.pool1 = nn.MaxPool2d(2, 2)                # 32x32x32 → 16x16x32

        # 卷积块2
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)  # 16x16x32 → 16x16x64
        self.pool2 = nn.MaxPool2d(2, 2)                # 16x16x64 → 8x8x64

        # 全连接层
        self.fc1 = nn.Linear(8 * 8 * 64, 512)
        self.fc2 = nn.Linear(512, 10)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = self.pool1(x)
        x = torch.relu(self.conv2(x))
        x = self.pool2(x)
        x = x.view(-1, 8 * 8 * 64)  # 展平
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

2.2.3 注意力层(Attention)- Transformer核心

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.qkv = nn.Linear(d_model, d_model * 3)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch_size, seq_len, d_model = x.shape

        # 生成Q, K, V
        qkv = self.qkv(x).reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
        q, k, v = qkv.permute(2, 0, 3, 1, 4)

        # 计算注意力分数
        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attn = torch.softmax(scores, dim=-1)

        # 应用注意力
        out = torch.matmul(attn, v)
        out = out.permute(0, 2, 1, 3).reshape(batch_size, seq_len, d_model)
        out = self.out(out)
        return out

# 实际案例:GPT中的Transformer层
class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model)
        )

    def forward(self, x):
        # 自注意力 + 残差连接
        x = x + self.attention(self.norm1(x))
        # 前馈网络 + 残差连接
        x = x + self.ffn(self.norm2(x))
        return x

2.3 模型深度与规模对比

模型层数参数量应用场景
小型CNN5-10层1M-10M简单图像分类
ResNet-5050层25M图像识别
BERT-Base12层110M文本理解
GPT-212-48层117M-1.5B文本生成
GPT-396层175B大语言模型
LLaMA-7B32层7B开源LLM
LLaMA-70B80层70B高性能LLM

3. 模型参数与权重

3.1 什么是模型参数?

模型参数是神经网络在训练过程中学习到的数值,主要包括:

  • 权重(Weights):连接神经元的强度
  • 偏置(Bias):调整输出的偏移量

3.2 实战案例:查看和分析模型参数

import torch
import torch.nn as nn

# 定义模型
model = nn.Sequential(
    nn.Linear(10, 5),
    nn.ReLU(),
    nn.Linear(5, 2)
)

# 方法1:查看所有参数
print("=" * 50)
print("所有参数:")
for name, param in model.named_parameters():
    print(f"{name:20s} | Shape: {str(param.shape):15s} | 参数量: {param.numel():,}")

# 输出示例:
# 0.weight            | Shape: torch.Size([5, 10]) | 参数量: 50
# 0.bias              | Shape: torch.Size([5])     | 参数量: 5
# 2.weight            | Shape: torch.Size([2, 5])  | 参数量: 10
# 2.bias              | Shape: torch.Size([2])     | 参数量: 2

# 方法2:计算总参数量
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\n总参数量: {total_params:,}")
print(f"可训练参数: {trainable_params:,}")

# 方法3:查看具体权重值
first_layer_weights = model[0].weight.data
print(f"\n第一层权重示例:\n{first_layer_weights[:2, :3]}")

3.3 参数量计算公式

# 全连接层参数量计算
def linear_params(in_features, out_features, bias=True):
    params = in_features * out_features
    if bias:
        params += out_features
    return params

# 卷积层参数量计算
def conv2d_params(in_channels, out_channels, kernel_size, bias=True):
    if isinstance(kernel_size, int):
        kernel_size = (kernel_size, kernel_size)
    params = in_channels * out_channels * kernel_size[0] * kernel_size[1]
    if bias:
        params += out_channels
    return params

# 示例
print(f"Linear(784, 128)参数量: {linear_params(784, 128):,}")
# 输出: 100,480

print(f"Conv2d(3, 64, 3)参数量: {conv2d_params(3, 64, 3):,}")
# 输出: 1,792

3.4 大模型参数量速查

# LLaMA模型参数量估算
def estimate_llama_params(
    n_layers,           # 层数
    d_model,            # 隐藏维度
    vocab_size=32000    # 词表大小
):
    # 嵌入层
    embedding = vocab_size * d_model

    # 每个Transformer层
    # 注意力层: 4 * d_model * d_model (Q, K, V, O)
    # FFN层: 8 * d_model * d_model (通常是4倍扩展)
    per_layer = 12 * d_model * d_model

    total_params = embedding + (n_layers * per_layer)
    return total_params

# LLaMA-7B
params_7b = estimate_llama_params(n_layers=32, d_model=4096)
print(f"LLaMA-7B估算参数量: {params_7b / 1e9:.1f}B")
# 输出: LLaMA-7B估算参数量: 6.7B

# LLaMA-13B
params_13b = estimate_llama_params(n_layers=40, d_model=5120)
print(f"LLaMA-13B估算参数量: {params_13b / 1e9:.1f}B")
# 输出: LLaMA-13B估算参数量: 12.7B

4. 模型量化技术

4.1 什么是模型量化?

量化是将模型中的高精度数值(如FP32)转换为低精度数值(如INT8)的过程,目的是:

  • ✅ 减小模型体积(通常压缩2-4倍)
  • ✅ 加快推理速度(INT8运算比FP32快)
  • ✅ 降低内存占用
  • ❌ 可能轻微降低精度(通常<1%)

4.2 数据类型对比

类型位数范围内存占用精度用途
FP3232位±3.4×10³⁸基准训练
FP1616位±65,50450%混合精度训练
BF1616位±3.4×10³⁸50%中高训练(推荐)
INT88位-128~12725%中低推理
INT44位-8~712.5%极致压缩

4.3 实战案例:PyTorch量化

案例1:动态量化(最简单)

import torch
import torch.quantization as quantization

# 原始模型
model = ImageClassifier()
model.eval()

# 动态量化(无需训练数据)
quantized_model = torch.quantization.quantize_dynamic(
    model,                              # 原始模型
    {nn.Linear},                        # 要量化的层类型
    dtype=torch.qint8                   # 量化数据类型
)

# 比较模型大小
def get_model_size(model):
    torch.save(model.state_dict(), "temp.pth")
    size_mb = os.path.getsize("temp.pth") / (1024 * 1024)
    os.remove("temp.pth")
    return size_mb

original_size = get_model_size(model)
quantized_size = get_model_size(quantized_model)

print(f"原始模型大小: {original_size:.2f} MB")
print(f"量化模型大小: {quantized_size:.2f} MB")
print(f"压缩比: {original_size / quantized_size:.2f}x")

案例2:静态量化(需要校准)

# 准备校准数据
calibration_data = torch.randn(100, 1, 28, 28)

# 插入量化/反量化节点
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)

# 校准(运行代表性数据)
model.eval()
with torch.no_grad():
    for data in calibration_data:
        model(data.unsqueeze(0))

# 转换为量化模型
torch.quantization.convert(model, inplace=True)

print("静态量化完成!")

案例3:使用BitsAndBytes进行4bit量化(大模型常用)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4bit量化配置
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                  # 启用4bit量化
    bnb_4bit_quant_type="nf4",          # 量化类型(nf4推荐)
    bnb_4bit_compute_dtype=torch.bfloat16,  # 计算类型
    bnb_4bit_use_double_quant=True,     # 双重量化(进一步压缩)
)

# 加载量化模型
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# 内存占用对比:
# FP32: ~28GB
# FP16: ~14GB
# 8bit: ~7GB
# 4bit: ~3.5GB

4.4 量化对比实验

import time
import torch

# 准备测试数据
test_input = torch.randn(1000, 784)

# 测试函数
def benchmark(model, input_data, num_runs=100):
    model.eval()
    with torch.no_grad():
        # 预热
        for _ in range(10):
            _ = model(input_data[:1])

        # 计时
        start = time.time()
        for _ in range(num_runs):
            _ = model(input_data)
        end = time.time()

    return (end - start) / num_runs

# 对比测试
fp32_time = benchmark(original_model, test_input)
int8_time = benchmark(quantized_model, test_input)

print(f"FP32推理时间: {fp32_time*1000:.2f} ms")
print(f"INT8推理时间: {int8_time*1000:.2f} ms")
print(f"加速比: {fp32_time/int8_time:.2f}x")

5. 模型训练基础

5.1 训练流程完整示例

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# 步骤1: 准备数据
# 生成模拟数据(实际应该是真实数据集)
X_train = torch.randn(1000, 784)
y_train = torch.randint(0, 10, (1000,))

# 创建数据加载器
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# 步骤2: 定义模型
model = ImageClassifier()

# 步骤3: 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()  # 分类任务常用损失
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adam优化器

# 步骤4: 训练循环
num_epochs = 10
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

for epoch in range(num_epochs):
    model.train()  # 设置为训练模式
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(train_loader):
        # 移动数据到GPU
        data, target = data.to(device), target.to(device)

        # 前向传播
        optimizer.zero_grad()  # 清空梯度
        output = model(data)   # 计算输出
        loss = criterion(output, target)  # 计算损失

        # 反向传播
        loss.backward()        # 计算梯度
        optimizer.step()       # 更新权重

        # 统计
        total_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

        # 打印进度
        if (batch_idx + 1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], '
                  f'Step [{batch_idx+1}/{len(train_loader)}], '
                  f'Loss: {loss.item():.4f}')

    # 每个epoch结束后的统计
    avg_loss = total_loss / len(train_loader)
    accuracy = 100. * correct / total
    print(f'Epoch {epoch+1} 完成 - 平均损失: {avg_loss:.4f}, 准确率: {accuracy:.2f}%')

# 步骤5: 保存模型
torch.save({
    'epoch': num_epochs,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': avg_loss,
}, 'trained_model.pth')

print("训练完成!模型已保存。")

5.2 关键概念详解

5.2.1 Batch Size(批大小)

# 不同batch size的影响
batch_sizes = [8, 32, 128, 512]

"""
Batch Size | 内存占用 | 训练速度 | 梯度稳定性 | 泛化能力
-----------------------------------------------------------------
8          | 低       | 慢       | 低(噪声大)| 好
32         | 中低     | 中       | 中         | 较好(推荐)
128        | 中高     | 快       | 高         | 中等
512        | 高       | 很快     | 很高       | 可能过拟合

推荐策略:
- GPU内存小: batch_size=16-32
- GPU内存大: batch_size=64-128
- 大模型: 使用梯度累积模拟大batch
"""

# 梯度累积示例(模拟大batch)
accumulation_steps = 4  # 累积4个batch

for i, (data, target) in enumerate(train_loader):
    output = model(data)
    loss = criterion(output, target)
    loss = loss / accumulation_steps  # 缩放损失
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

5.2.2 Learning Rate(学习率)

# 学习率调度器
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR

# 方式1: 阶梯式衰减
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# 每30个epoch,学习率乘以0.1

# 方式2: 余弦退火
scheduler = CosineAnnealingLR(optimizer, T_max=100)
# 学习率按余弦曲线衰减

# 训练循环中使用
for epoch in range(num_epochs):
    train(...)
    scheduler.step()  # 更新学习率

    # 打印当前学习率
    current_lr = optimizer.param_groups[0]['lr']
    print(f'Epoch {epoch}, Learning Rate: {current_lr}')

5.2.3 过拟合与正则化

# 技术1: Dropout(随机失活)
class ModelWithDropout(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 512)
        self.dropout1 = nn.Dropout(0.5)  # 50%的神经元随机失活
        self.fc2 = nn.Linear(512, 256)
        self.dropout2 = nn.Dropout(0.3)  # 30%失活
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = x.view(-1, 784)
        x = torch.relu(self.fc1(x))
        x = self.dropout1(x)  # 训练时生效,测试时自动关闭
        x = torch.relu(self.fc2(x))
        x = self.dropout2(x)
        x = self.fc3(x)
        return x

# 技术2: 权重衰减(L2正则化)
optimizer = optim.Adam(
    model.parameters(),
    lr=0.001,
    weight_decay=1e-4  # L2正则化系数
)

# 技术3: Early Stopping(早停)
best_loss = float('inf')
patience = 5
counter = 0

for epoch in range(num_epochs):
    val_loss = validate(...)

    if val_loss < best_loss:
        best_loss = val_loss
        counter = 0
        torch.save(model.state_dict(), 'best_model.pth')
    else:
        counter += 1
        if counter >= patience:
            print("Early stopping triggered!")
            break

5.3 完整训练脚本模板

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from tqdm import tqdm  # 进度条

class Trainer:
    def __init__(self, model, train_loader, val_loader, device='cuda'):
        self.model = model.to(device)
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.device = device

        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = optim.Adam(model.parameters(), lr=0.001)
        self.scheduler = StepLR(self.optimizer, step_size=10, gamma=0.1)

        self.train_losses = []
        self.val_losses = []
        self.val_accuracies = []

    def train_epoch(self):
        self.model.train()
        total_loss = 0

        for data, target in tqdm(self.train_loader, desc="Training"):
            data, target = data.to(self.device), target.to(self.device)

            self.optimizer.zero_grad()
            output = self.model(data)
            loss = self.criterion(output, target)
            loss.backward()
            self.optimizer.step()

            total_loss += loss.item()

        return total_loss / len(self.train_loader)

    def validate(self):
        self.model.eval()
        total_loss = 0
        correct = 0
        total = 0

        with torch.no_grad():
            for data, target in tqdm(self.val_loader, desc="Validation"):
                data, target = data.to(self.device), target.to(self.device)

                output = self.model(data)
                loss = self.criterion(output, target)

                total_loss += loss.item()
                _, predicted = output.max(1)
                total += target.size(0)
                correct += predicted.eq(target).sum().item()

        avg_loss = total_loss / len(self.val_loader)
        accuracy = 100. * correct / total
        return avg_loss, accuracy

    def fit(self, num_epochs):
        for epoch in range(num_epochs):
            print(f"\nEpoch {epoch+1}/{num_epochs}")

            # 训练
            train_loss = self.train_epoch()
            self.train_losses.append(train_loss)

            # 验证
            val_loss, val_acc = self.validate()
            self.val_losses.append(val_loss)
            self.val_accuracies.append(val_acc)

            # 更新学习率
            self.scheduler.step()

            # 打印结果
            print(f"Train Loss: {train_loss:.4f}")
            print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%")

            # 保存最佳模型
            if val_loss == min(self.val_losses):
                torch.save(self.model.state_dict(), 'best_model.pth')
                print("✓ Best model saved!")

# 使用示例
trainer = Trainer(model, train_loader, val_loader)
trainer.fit(num_epochs=50)

6. 模型微调技术

6.1 什么是微调(Fine-tuning)?

微调是在预训练模型的基础上,使用少量特定任务数据继续训练,使模型适应新任务。

预训练模型(通用能力) + 微调(特定任务) = 定制化模型

6.2 微调方法对比

方法训练参数量内存需求训练速度效果适用场景
Full Fine-tuning100%最好数据充足
LoRA0.1%-1%优秀推荐
QLoRA0.1%-1%极低优秀消费级GPU
Adapter2%-5%良好多任务切换
Prompt Tuning<0.01%极低极快中等少样本学习

6.3 实战案例:LoRA微调

案例1:使用PEFT库进行LoRA微调

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset

# 步骤1: 加载基础模型
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,  # 8bit量化节省内存
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 步骤2: 配置LoRA
lora_config = LoraConfig(
    r=8,                              # LoRA秩(rank)
    lora_alpha=32,                    # LoRA缩放参数
    target_modules=["q_proj", "v_proj"],  # 要应用LoRA的层
    lora_dropout=0.05,                # Dropout率
    bias="none",                      # 偏置处理方式
    task_type=TaskType.CAUSAL_LM      # 任务类型
)

# 步骤3: 应用LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 输出: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

# 步骤4: 准备数据
dataset = load_dataset("json", data_files="your_data.jsonl")

def preprocess(examples):
    texts = [f"### Question: {q}\n### Answer: {a}"
             for q, a in zip(examples['question'], examples['answer'])]
    return tokenizer(texts, truncation=True, padding='max_length', max_length=512)

tokenized_dataset = dataset.map(preprocess, batched=True)

# 步骤5: 配置训练参数
training_args = TrainingArguments(
    output_dir="./lora_output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # 有效batch size = 16
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,                      # 混合精度训练
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
)

# 步骤6: 开始训练
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
)

trainer.train()

# 步骤7: 保存LoRA权重(仅几MB)
model.save_pretrained("./lora_weights")

案例2: QLoRA微调(更省内存)

from transformers import BitsAndBytesConfig

# 4bit量化配置
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# 加载量化模型
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# 准备模型用于kbit训练
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)

# 应用LoRA(与上面相同)
model = get_peft_model(model, lora_config)

# 内存占用对比:
# Full Fine-tuning (FP32): ~28GB VRAM + ~28GB 梯度/优化器 = 56GB
# LoRA (FP16): ~14GB + ~2GB = 16GB
# QLoRA (4bit): ~3.5GB + ~2GB = 5.5GB ✓ 消费级GPU可用!

6.4 微调数据准备

# 数据格式示例 (JSONL格式)
"""
{"instruction": "将下面的英文翻译成中文", "input": "Hello world", "output": "你好世界"}
{"instruction": "总结以下文本", "input": "...", "output": "..."}
"""

# 数据预处理函数
def format_instruction(sample):
    """格式化为指令模板"""
    return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{sample['instruction']}

### Input:
{sample['input']}

### Response:
{sample['output']}"""

# 数据质量检查
def check_data_quality(dataset):
    print(f"数据集大小: {len(dataset)}")
    print(f"平均长度: {sum(len(d['output']) for d in dataset) / len(dataset):.0f} 字符")

    # 检查是否有空值
    empty_count = sum(1 for d in dataset if not d['output'].strip())
    print(f"空输出数量: {empty_count}")

    # 检查长度分布
    lengths = [len(d['output']) for d in dataset]
    print(f"最短: {min(lengths)}, 最长: {max(lengths)}")

# 推荐数据量:
# - 全量微调: 10,000+ 样本
# - LoRA: 1,000+ 样本
# - Few-shot: 100+ 样本

6.5 微调后的模型推理

from peft import PeftModel

# 加载基础模型
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# 加载LoRA权重
model = PeftModel.from_pretrained(base_model, "./lora_weights")

# 合并权重(可选,用于部署)
model = model.merge_and_unload()

# 推理
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
input_text = "### Question: What is AI?\n### Answer:"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(result)

7. 模型部署实战

7.1 部署方式对比

方式适用场景延迟吞吐量难度
本地Python开发测试简单
Flask/FastAPI小规模服务中等
TorchServe生产环境中等
ONNX Runtime跨平台中等
TensorRTNvidia GPU极低极高困难
vLLM大模型推理极高简单

7.2 实战案例:FastAPI部署

# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import uvicorn

# 初始化FastAPI
app = FastAPI(title="LLM API", version="1.0")

# 全局加载模型(启动时加载一次)
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",
    torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
print("Model loaded!")

# 请求数据模型
class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 100
    temperature: float = 0.7
    top_p: float = 0.9

class GenerateResponse(BaseModel):
    generated_text: str
    tokens_count: int

# API端点
@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
    try:
        # 编码输入
        inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)

        # 生成
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=request.max_tokens,
                temperature=request.temperature,
                top_p=request.top_p,
                do_sample=True,
            )

        # 解码输出
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        return GenerateResponse(
            generated_text=generated_text,
            tokens_count=len(outputs[0])
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# 健康检查端点
@app.get("/health")
async def health():
    return {"status": "healthy"}

# 运行服务器
if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
# 启动服务
python app.py

# 测试API (使用curl)
curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "What is AI?", "max_tokens": 50}'

7.3 使用vLLM高性能部署

from vllm import LLM, SamplingParams

# 初始化vLLM(自动优化批处理和KV缓存)
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=1,  # GPU数量
    dtype="half",            # FP16
    max_model_len=2048,      # 最大序列长度
)

# 采样参数
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=100
)

# 批量推理(vLLM会自动优化)
prompts = [
    "What is AI?",
    "Explain machine learning",
    "What is deep learning?"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")
    print("-" * 50)

# 性能对比:
# HuggingFace (单个): ~100 tokens/s
# vLLM (批处理):      ~500 tokens/s ✓ 5倍提升!

7.4 ONNX优化部署

# 步骤1: 导出为ONNX
import torch.onnx

model.eval()
dummy_input = torch.randn(1, 784)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

# 步骤2: 优化ONNX模型
import onnx
from onnxruntime.transformers import optimizer

model_onnx = onnx.load("model.onnx")
optimized_model = optimizer.optimize_model(
    "model.onnx",
    model_type='bert',  # 或 'gpt2'
    num_heads=12,
    hidden_size=768
)
optimized_model.save_model_to_file("model_optimized.onnx")

# 步骤3: 使用ONNX Runtime推理
import onnxruntime as ort
import numpy as np

# 创建推理会话
session = ort.InferenceSession(
    "model_optimized.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# 推理
input_data = np.random.randn(1, 784).astype(np.float32)
outputs = session.run(None, {'input': input_data})

print(f"Output shape: {outputs[0].shape}")

# 性能提升:
# PyTorch: 基准
# ONNX Runtime: 1.5-3倍加速 ✓

7.5 Docker容器化部署

# Dockerfile
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

# 安装Python
RUN apt-get update && apt-get install -y python3.10 python3-pip

# 设置工作目录
WORKDIR /app

# 复制依赖文件
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

# 复制应用代码
COPY . .

# 下载模型(或从外部挂载)
RUN python3 download_model.py

# 暴露端口
EXPOSE 8000

# 启动命令
CMD ["python3", "app.py"]
# 构建镜像
docker build -t llm-api:latest .

# 运行容器
docker run -d \
  --gpus all \
  -p 8000:8000 \
  -v /path/to/models:/models \
  --name llm-service \
  llm-api:latest

# 查看日志
docker logs -f llm-service

8. 性能优化技巧

8.1 推理速度优化

# 技巧1: 使用torch.compile (PyTorch 2.0+)
import torch

model = ImageClassifier()
model = torch.compile(model)  # 自动优化

# 推理速度提升: 30%-50%

# 技巧2: 使用半精度
model = model.half()  # FP16
input_data = input_data.half()

# 技巧3: 使用torch.inference_mode()
with torch.inference_mode():  # 比no_grad更快
    output = model(input_data)

# 技巧4: 批量推理
# 不好的做法
for data in dataset:
    output = model(data.unsqueeze(0))  # 逐个推理

# 好的做法
batch_data = torch.stack([data for data in dataset[:32]])
outputs = model(batch_data)  # 批量推理

# 技巧5: 使用KV缓存(Transformer模型)
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    use_cache=True  # 启用KV缓存
)

8.2 内存优化

# 技巧1: 梯度检查点(Gradient Checkpointing)
from torch.utils.checkpoint import checkpoint

class MemoryEfficientModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerBlock() for _ in range(24)
        ])

    def forward(self, x):
        for layer in self.layers:
            x = checkpoint(layer, x)  # 重计算而非存储中间激活
        return x

# 内存节省: 30%-50%,速度略降10%-20%

# 技巧2: 混合精度训练
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for data, target in train_loader:
    optimizer.zero_grad()

    with autocast():  # 自动混合FP16/FP32
        output = model(data)
        loss = criterion(output, target)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# 技巧3: 清理缓存
import gc

del model  # 删除对象
gc.collect()  # 垃圾回收
torch.cuda.empty_cache()  # 清空CUDA缓存

8.3 多GPU训练

# 方法1: DataParallel(简单但效率低)
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)

# 方法2: DistributedDataParallel(推荐)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# 初始化进程组
dist.init_process_group(backend='nccl')

# 设置设备
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

# 包装模型
model = model.to(local_rank)
model = DDP(model, device_ids=[local_rank])

# 使用分布式采样器
from torch.utils.data.distributed import DistributedSampler
sampler = DistributedSampler(dataset)
train_loader = DataLoader(dataset, sampler=sampler)

# 启动命令:
# torchrun --nproc_per_node=4 train.py

9. 常见问题与解决方案

9.1 训练相关问题

Q1: 损失不下降怎么办?

# 解决方案检查清单:

# 1. 检查学习率
print(f"当前学习率: {optimizer.param_groups[0]['lr']}")
# 太大: 损失震荡 → 降低10倍
# 太小: 几乎不动 → 提高10倍

# 2. 检查数据
for data, target in train_loader:
    print(f"数据范围: [{data.min():.2f}, {data.max():.2f}]")
    print(f"标签分布: {torch.bincount(target)}")
    break
# 确保数据已归一化,标签分布均衡

# 3. 检查梯度
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: 梯度范数 = {param.grad.norm().item():.4f}")
# 梯度爆炸: >10 → 使用梯度裁剪
# 梯度消失: <0.0001 → 检查激活函数、初始化

# 4. 使用梯度裁剪
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Q2: 显存溢出(OOM)怎么办?

# 解决方案:

# 1. 减小batch size
train_loader = DataLoader(dataset, batch_size=16)  # 从32降到16

# 2. 使用梯度累积
accumulation_steps = 4
for i, (data, target) in enumerate(train_loader):
    loss = criterion(model(data), target) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# 3. 启用梯度检查点
from torch.utils.checkpoint import checkpoint

# 4. 使用混合精度
from torch.cuda.amp import autocast
with autocast():
    output = model(data)

# 5. 清理缓存
torch.cuda.empty_cache()

Q3: 过拟合怎么办?

# 症状: 训练集准确率高,验证集准确率低

# 解决方案:

# 1. 增加Dropout
model.dropout = nn.Dropout(0.5)  # 从0.3提高到0.5

# 2. 数据增强
from torchvision import transforms
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
])

# 3. 权重衰减
optimizer = optim.Adam(model.parameters(), weight_decay=1e-4)

# 4. Early Stopping
# 5. 收集更多数据

9.2 部署相关问题

Q1: 推理速度慢怎么办?

# 优化策略:

# 1. 量化模型
quantized_model = torch.quantization.quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)

# 2. 批量推理
inputs = torch.stack([data for data in batch])
outputs = model(inputs)

# 3. 使用TorchScript
scripted_model = torch.jit.script(model)

# 4. 使用ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")

# 5. 使用专用推理引擎
from vllm import LLM
llm = LLM(model="your-model")

Q2: 模型加载失败?

# 常见错误及解决:

# 错误1: RuntimeError: Error(s) in loading state_dict
# 原因: 模型结构不匹配
# 解决: 检查模型定义,使用strict=False
model.load_state_dict(torch.load("model.pth"), strict=False)

# 错误2: 找不到模型文件
# 解决: 使用绝对路径
import os
model_path = os.path.join(os.getcwd(), "models", "model.pth")

# 错误3: CUDA out of memory
# 解决: 加载到CPU
model = torch.load("model.pth", map_location='cpu')

9.3 实用调试技巧

# 技巧1: 打印模型结构
from torchsummary import summary
summary(model, input_size=(3, 224, 224))

# 技巧2: 可视化训练曲线
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(train_losses, label='Train Loss')
plt.plot(val_losses, label='Val Loss')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(val_accuracies, label='Val Accuracy')
plt.legend()
plt.show()

# 技巧3: 使用TensorBoard
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter('runs/experiment_1')
for epoch in range(num_epochs):
    train(...)
    writer.add_scalar('Loss/train', train_loss, epoch)
    writer.add_scalar('Loss/val', val_loss, epoch)
    writer.add_scalar('Accuracy/val', val_acc, epoch)

# 启动TensorBoard: tensorboard --logdir=runs

# 技巧4: 模型参数监控
def monitor_gradients(model):
    for name, param in model.named_parameters():
        if param.grad is not None:
            grad_norm = param.grad.norm().item()
            param_norm = param.norm().item()
            print(f"{name:30s} | Grad: {grad_norm:8.4f} | Param: {param_norm:8.4f}")

📖 附录:快速参考

常用命令速查

# 安装依赖
pip install torch torchvision transformers peft bitsandbytes accelerate

# 查看GPU信息
nvidia-smi

# 查看PyTorch版本
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

# 清理GPU缓存
python -c "import torch; torch.cuda.empty_cache()"

# 启动TensorBoard
tensorboard --logdir=runs --port=6006

# 多GPU训练
torchrun --nproc_per_node=4 train.py

# 转换模型格式
python -m torch.onnx convert model.pth model.onnx

推荐学习资源

  1. 官方文档

  2. 实战教程

  3. 社区资源

硬件配置建议

任务推荐GPU显存备注
小模型训练GTX 1660 Ti6GB入门级
中等模型训练RTX 306012GB性价比高
大模型微调RTX 409024GB消费级最强
大模型训练A10080GB专业级
推理部署T416GB云服务常用

🎯 下一步学习路径

初级(1-2周)

  1. ✅ 理解本文档所有基础概念
  2. ✅ 完成简单的图像分类模型训练
  3. ✅ 尝试保存和加载模型

中级(1-2个月)

  1. ✅ 使用预训练模型进行微调
  2. ✅ 实现LoRA微调
  3. ✅ 部署一个简单的API服务

高级(3-6个月)

  1. ✅ 多GPU分布式训练
  2. ✅ 模型量化和优化
  3. ✅ 生产环境部署

祝您学习顺利!有问题随时查阅本文档 📚

最后更新: 2026-01-28