🎓 AI模型训练、微调、部署完整教程从零开始的深度学习实战指南 📚 目录模型文件格式详解模型架构与层数模型参数

从零开始的深度学习实战指南

📚 目录

模型文件格式详解
模型架构与层数
模型参数与权重
模型量化技术
模型训练基础
模型微调技术
模型部署实战
性能优化技巧
常见问题与解决方案

1. 模型文件格式详解

1.1 格式演进与选择

不同的模型格式服务于不同的目的：

📦 模型文件格式的演进历程

第一代: Pickle (.pkl)
├── 优点: Python原生支持，简单易用
└── 缺点: 不安全（可执行任意代码），只能Python使用

第二代: Framework原生格式 (.pth, .h5)
├── 优点: 框架优化，功能完整
└── 缺点: 框架锁定，跨平台困难

第三代: 跨框架格式 (ONNX)
├── 优点: 框架无关，硬件优化
└── 缺点: 某些操作不支持

第四代: 安全高效格式 (SafeTensors)
├── 优点: 安全、快速、通用
└── 缺点: 相对较新

第五代: 推理优化格式 (GGUF, TensorRT)
├── 优点: 极致性能
└── 缺点: 特定硬件/场景

1.2 主流模型格式对比

格式	扩展名	跨平台	安全性	加载速度	文件大小	生产推荐
PyTorch原生	.pt, .pth	❌	⚠️ 低	中等	基准	❌
SafeTensors	.safetensors	✅	✅ 高	快	略小	✅ 强烈推荐
ONNX	.onnx	✅	✅ 高	快	基准	✅ 推荐
GGUF	.gguf	✅	✅ 高	快	小	✅ CPU推理
TensorFlow	/目录	⚠️	✅ 中	慢	大	⚠️ TF生态
TorchScript	.pt	⚠️	✅ 中	快	基准	✅ PyTorch生产
Pickle	.pkl	❌	❌ 极低	快	基准	❌ 仅开发

1.3 PyTorch格式深度解析

1.3.1 三种保存方式详解

import torch
import torch.nn as nn

# 定义示例模型
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = SimpleNet()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# ============================================================
# 方式1: 保存完整模型（不推荐）
# ============================================================
torch.save(model, 'complete_model.pth')

# 保存内容:
# ├── 模型类定义的引用
# ├── 模型结构
# ├── 所有参数值
# └── Python对象状态

# 加载
loaded_model = torch.load('complete_model.pth')

# ⚠️ 缺点：
# 1. 文件很大（包含Python类信息）
# 2. 加载时必须有原始类定义
# 3. Python版本变化可能导致失败
# 4. 不安全（可能执行恶意代码）

# ============================================================
# 方式2: 只保存state_dict（强烈推荐）
# ============================================================
torch.save(model.state_dict(), 'model_weights.pth')

# 保存内容（只有参数值）:
# ├── fc1.weight: Tensor(128, 784)
# ├── fc1.bias: Tensor(128)
# ├── fc2.weight: Tensor(10, 128)
# └── fc2.bias: Tensor(10)

# 加载
new_model = SimpleNet()  # 必须先创建模型实例
new_model.load_state_dict(torch.load('model_weights.pth'))

# ✅ 优点：
# 1. 文件小（只有参数值）
# 2. 灵活（可以加载到不同的模型实例）
# 3. 版本兼容性好

# ============================================================
# 方式3: 保存检查点（训练时推荐）
# ============================================================
checkpoint = {
    'epoch': 10,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': 0.5,
    'accuracy': 0.92,
    'config': {
        'learning_rate': 0.001,
        'batch_size': 32
    }
}
torch.save(checkpoint, 'checkpoint.pth')

# 加载检查点
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']

# ✅ 优点：可以恢复完整的训练状态，支持断点续训

1.3.2 PyTorch文件内部结构

import zipfile

# PyTorch的.pth文件实际上是一个ZIP压缩包
# 查看文件内部结构
with zipfile.ZipFile('model_weights.pth', 'r') as zip_file:
    print("PyTorch文件内部包含:")
    for file_info in zip_file.filelist:
        print(f"  📄 {file_info.filename:30s} {file_info.file_size:>10,} bytes")

# PyTorch文件结构:
# ├── data.pkl          → Pickle序列化的元数据
# ├── data/0            → 第一个tensor的二进制数据
# ├── data/1            → 第二个tensor的二进制数据
# └── ...

# ⚠️ 这就是为什么PyTorch文件不安全的原因：Pickle可以执行任意Python代码！

1.3.3 高级保存技巧

import os
from datetime import datetime

# 技巧1: 条件保存（只保存最佳模型）
best_loss = float('inf')

def save_if_best(model, current_loss, path='best_model.pth'):
    global best_loss
    if current_loss < best_loss:
        best_loss = current_loss
        torch.save({
            'model_state_dict': model.state_dict(),
            'loss': current_loss,
            'timestamp': datetime.now().isoformat()
        }, path)
        print(f"✓ 新的最佳模型已保存 (loss: {current_loss:.4f})")

# 技巧2: 多版本保存（保留历史）
def save_with_version(model, epoch):
    filename = f'model_epoch_{epoch:03d}.pth'
    torch.save(model.state_dict(), filename)

    # 只保留最近5个版本
    import glob
    models = sorted(glob.glob('model_epoch_*.pth'))
    if len(models) > 5:
        os.remove(models[0])

# 技巧3: 跨设备保存与加载
# 在GPU上训练，保存后在CPU上加载
model_cpu = SimpleNet()
model_cpu.load_state_dict(
    torch.load('model_weights.pth', map_location='cpu')
)

# 技巧4: 部分加载（加载部分权重）
pretrained_dict = torch.load('pretrained.pth')
model_dict = model.state_dict()

# 过滤掉不匹配的键
pretrained_dict = {k: v for k, v in pretrained_dict.items()
                   if k in model_dict and v.size() == model_dict[k].size()}

# 更新现有模型字典
model_dict.update(pretrained_dict)
model.load_state_dict(model_dict)

1.4 SafeTensors格式详解

SafeTensors是HuggingFace开发的新一代模型存储格式，专为安全性和性能设计。

1.4.1 核心优势

from safetensors.torch import save_file, load_file
from safetensors import safe_open

"""
✅ 安全性：
  - 纯数据格式，不包含可执行代码
  - 无法通过加载模型执行恶意代码
  - 防止Pickle反序列化攻击

✅ 速度：
  - Zero-copy加载（不需要复制数据）
  - 支持内存映射（mmap）
  - 延迟加载（lazy loading）

✅ 简洁性：
  - 文件头包含完整元数据
  - 无需解析整个文件即可查看结构
  - 易于调试和检查

✅ 跨平台：
  - 框架无关（PyTorch、TensorFlow、JAX等）
  - 语言无关（Python、Rust、JavaScript等）
"""

# 基本使用
state_dict = {
    'layer1.weight': torch.randn(100, 50),
    'layer1.bias': torch.randn(100),
    'layer2.weight': torch.randn(10, 100),
}

# 保存为SafeTensors
save_file(state_dict, 'model.safetensors')

# 加载SafeTensors
loaded_state_dict = load_file('model.safetensors')

1.4.2 SafeTensors文件结构

"""
SafeTensors文件格式:

┌─────────────────────────────────────────────┐
│ Header Size (8 bytes)                       │  ← 头部大小
├─────────────────────────────────────────────┤
│ Header (JSON metadata)                      │  ← 元数据
│ {                                           │
│   "layer1.weight": {                        │
│     "dtype": "F32",                         │
│     "shape": [100, 50],                     │
│     "data_offsets": [0, 20000]             │
│   },                                        │
│   ...                                       │
│ }                                           │
├─────────────────────────────────────────────┤
│ Tensor Data (aligned)                       │  ← 实际数据
│ ┌─────────────────┐                         │
│ │ layer1.weight   │                         │
│ ├─────────────────┤                         │
│ │ layer1.bias     │                         │
│ └─────────────────┘                         │
└─────────────────────────────────────────────┘
"""

# 查看元数据（无需加载数据）
with safe_open('model.safetensors', framework="pt") as f:
    print("SafeTensors元数据:")
    for key in f.keys():
        tensor_slice = f.get_slice(key)
        print(f"  • {key:20s} {tensor_slice.get_shape()} {tensor_slice.get_dtype()}")

1.4.3 性能对比

import time

# 创建测试数据
large_state_dict = {
    f'layer{i}.weight': torch.randn(1000, 1000)
    for i in range(10)
}

# PyTorch保存
start = time.time()
torch.save(large_state_dict, 'large_model.pth')
pytorch_save_time = time.time() - start

# SafeTensors保存
start = time.time()
save_file(large_state_dict, 'large_model.safetensors')
safetensors_save_time = time.time() - start

print(f"保存性能:")
print(f"  PyTorch:     {pytorch_save_time:.3f}秒")
print(f"  SafeTensors: {safetensors_save_time:.3f}秒")
print(f"  速度提升: {pytorch_save_time / safetensors_save_time:.2f}x")

# 加载性能
start = time.time()
_ = torch.load('large_model.pth')
pytorch_load_time = time.time() - start

start = time.time()
_ = load_file('large_model.safetensors')
safetensors_load_time = time.time() - start

print(f"\n加载性能:")
print(f"  PyTorch:     {pytorch_load_time:.3f}秒")
print(f"  SafeTensors: {safetensors_load_time:.3f}秒")
print(f"  速度提升: {pytorch_load_time / safetensors_load_time:.2f}x")

1.4.4 高级特性

# 特性1: 延迟加载（Lazy Loading）
with safe_open('large_model.safetensors', framework="pt", device="cpu") as f:
    # 此时还没有加载任何tensor到内存
    # 只加载需要的tensor
    layer0_weight = f.get_tensor('layer0.weight')
    print(f"✓ 只加载了需要的tensor，节省内存")

# 特性2: 元数据存储
metadata = {
    'model_name': 'MyModel',
    'version': '1.0.0',
    'accuracy': '95.5%'
}

save_file(state_dict, 'model_with_metadata.safetensors', metadata=metadata)

# 读取元数据
with safe_open('model_with_metadata.safetensors', framework="pt") as f:
    stored_metadata = f.metadata()
    print(f"元数据: {stored_metadata}")

# 特性3: PyTorch ↔ SafeTensors 互转
def convert_pytorch_to_safetensors(pth_path, safetensors_path):
    state_dict = torch.load(pth_path, map_location='cpu')
    if isinstance(state_dict, nn.Module):
        state_dict = state_dict.state_dict()
    save_file(state_dict, safetensors_path)
    print(f"✓ 转换完成: {pth_path} → {safetensors_path}")

1.5 ONNX格式详解

ONNX (Open Neural Network Exchange) 是一个开放的模型表示格式，支持跨框架、跨硬件的模型部署。

1.5.1 ONNX核心概念

import torch.onnx
import onnx
import onnxruntime as ort

"""
ONNX是什么？
  将神经网络表示为计算图：
  - 节点（Nodes）：操作符（如Conv、Add、ReLU）
  - 边（Edges）：数据流（tensors）
  - 属性（Attributes）：操作参数

为什么使用ONNX？
  ✅ 框架无关：PyTorch → ONNX → TensorFlow
  ✅ 硬件优化：针对特定硬件加速
  ✅ 部署友好：多种运行时支持
  ✅ 生产就绪：工业界广泛使用
"""

# 导出为ONNX
model = SimpleNet()
dummy_input = torch.randn(1, 784)

torch.onnx.export(
    model,                          # 模型
    dummy_input,                    # 示例输入
    'model.onnx',                   # 保存路径
    export_params=True,             # 导出参数
    opset_version=11,               # ONNX算子集版本
    do_constant_folding=True,       # 常量折叠优化
    input_names=['input'],          # 输入名称
    output_names=['output'],        # 输出名称
    dynamic_axes={                  # 动态维度（支持不同batch size）
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

# 验证ONNX模型
onnx_model = onnx.load('model.onnx')
onnx.checker.check_model(onnx_model)
print("✓ ONNX模型验证通过")

1.5.2 ONNX模型结构分析

# 查看计算图结构
graph = onnx_model.graph

# 输入
print("输入:")
for input_tensor in graph.input:
    shape = [dim.dim_value if dim.dim_value > 0 else 'dynamic'
             for dim in input_tensor.type.tensor_type.shape.dim]
    print(f"  • {input_tensor.name}: {shape}")

# 节点（操作）
print(f"\n计算图节点: ({len(graph.node)} 个操作)")
for i, node in enumerate(graph.node[:10]):
    print(f"  {i+1}. {node.op_type:15s} → {node.output[0]}")

# 参数（权重）
print(f"\n参数: ({len(graph.initializer)} 个)")
total_params = 0
for initializer in graph.initializer:
    shape = list(initializer.dims)
    num_params = np.prod(shape) if shape else 0
    total_params += num_params
    print(f"  • {initializer.name:30s} {shape}")
print(f"\n总参数量: {total_params:,}")

1.5.3 ONNX Runtime推理

import numpy as np

# 创建推理会话
ort_session = ort.InferenceSession(
    'model.onnx',
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

print(f"执行提供者: {ort_session.get_providers()}")

# 推理
input_data = np.random.randn(1, 784).astype(np.float32)
outputs = ort_session.run(None, {'input': input_data})

# 性能对比
import time

# PyTorch推理
model.eval()
pytorch_input = torch.from_numpy(input_data)

start = time.time()
for _ in range(100):
    with torch.no_grad():
        _ = model(pytorch_input)
pytorch_time = (time.time() - start) / 100

# ONNX Runtime推理
start = time.time()
for _ in range(100):
    _ = ort_session.run(None, {'input': input_data})
onnx_time = (time.time() - start) / 100

print(f"\n性能对比:")
print(f"  PyTorch:      {pytorch_time*1000:.3f} ms")
print(f"  ONNX Runtime: {onnx_time*1000:.3f} ms")
print(f"  加速比: {pytorch_time / onnx_time:.2f}x")

1.6 GGUF格式详解（大模型CPU推理）

GGUF (GPT-Generated Unified Format) 是llama.cpp项目使用的模型格式，专为CPU推理优化。

1.6.1 GGUF特点

"""
✅ CPU优化：
  - 专为CPU推理设计
  - 使用AVX2、AVX512等SIMD指令加速
  - 内存映射（mmap）支持

✅ 量化支持：
  - Q4_0, Q4_1: 4bit量化
  - Q5_0, Q5_1: 5bit量化
  - Q8_0: 8bit量化
  - 大幅减小模型大小

✅ 易于部署：
  - 单文件包含所有数据
  - 无需Python环境
  - 跨平台（Windows、Linux、macOS）

常见使用场景：
  - 在CPU上运行LLaMA、Mistral等大模型
  - 嵌入式设备部署
  - 边缘计算
  - 个人电脑上的AI应用
"""

1.6.2 量化类型对比

"""
┌─────────┬──────────┬──────────┬──────────┐
│ 类型    │ 位数/权重│ 模型大小 │ 质量     │
├─────────┼──────────┼──────────┼──────────┤
│ F16     │ 16 bit   │ 基准     │ 最佳     │
│ Q8_0    │ 8 bit    │ 50%      │ 优秀     │
│ Q6_K    │ 6 bit    │ 37.5%    │ 很好     │
│ Q5_1    │ 5 bit    │ 31.25%   │ 很好     │
│ Q4_1    │ 4 bit    │ 25%      │ 好       │
│ Q4_0    │ 4 bit    │ 25%      │ 可接受   │
│ Q3_K_S  │ 3 bit    │ 18.75%   │ 较差     │
│ Q2_K    │ 2 bit    │ 12.5%    │ 实验性   │
└─────────┴──────────┴──────────┴──────────┘

示例：LLaMA-7B模型大小对比
- 原始 (FP32): ~28 GB
- F16: ~14 GB
- Q8_0: ~7 GB
- Q4_0: ~3.5 GB ← 推荐
- Q2_K: ~1.8 GB
"""

1.6.3 使用GGUF模型

# 命令行推理
./main -m llama-7b-q4_0.gguf \
    -p "What is artificial intelligence?" \
    -n 128 \
    -t 8 \
    --temp 0.7

# Python API（使用llama-cpp-python）

from llama_cpp import Llama

# 加载模型
llm = Llama(
    model_path="./llama-7b-q4_0.gguf",
    n_ctx=2048,        # 上下文窗口
    n_threads=8,       # 线程数
    n_gpu_layers=0,    # GPU层数（0=纯CPU）
)

# 推理
output = llm(
    "What is the meaning of life?",
    max_tokens=128,
    temperature=0.7,
    top_p=0.9
)

print(output['choices'][0]['text'])

1.7 格式选择决策树

"""
                    开始选择模型格式
                          │
                          ▼
           ┌──────────────┴──────────────┐
           │      什么使用场景？          │
           └──────────────┬──────────────┘
                          │
        ┌─────────────────┼─────────────────┐
        │                 │                 │
        ▼                 ▼                 ▼
    【开发/训练】      【分享/分发】     【部署/推理】
        │                 │                 │
        ▼                 ▼                 ▼
  .pth/.ckpt       .safetensors        根据平台选择
   (checkpoint)     (安全快速)              │
                                           │
                          ┌────────────────┴────────────────┐
                          │                                 │
                          ▼                                 ▼
                    ┌─────────┐                      ┌──────────┐
                    │云端/服务器│                      │ 边缘设备  │
                    └─────────┘                      └──────────┘
                          │                                 │
              ┌───────────┼───────────┐                    │
              │           │           │                    │
              ▼           ▼           ▼                    ▼
          PyTorch     TensorFlow   通用              CPU  vs  GPU
            │             │          │                │        │
            ▼             ▼          ▼                ▼        ▼
      .pth/.pt      SavedModel    .onnx           .gguf    .onnx
    + TorchScript     .h5      (ONNX Runtime)  (llama.cpp) (TensorRT)


具体推荐：

┌────────────────┬──────────────────┬───────────────────┐
│ 场景           │ 推荐格式         │ 原因              │
├────────────────┼──────────────────┼───────────────────┤
│ 模型开发       │ .pth             │ 灵活，易调试      │
│ 训练检查点     │ .ckpt            │ 包含优化器状态    │
│ 模型分享       │ .safetensors     │ 安全，快速加载    │
│ HuggingFace发布│ .safetensors     │ 官方推荐          │
│ 跨框架部署     │ .onnx            │ 框架无关          │
│ PyTorch生产    │ TorchScript      │ 性能优化          │
│ CPU大模型推理  │ .gguf            │ 高度优化          │
│ Nvidia GPU推理 │ TensorRT         │ 极致性能          │
└────────────────┴──────────────────┴───────────────────┘
"""

1.8 格式转换实战

# PyTorch → SafeTensors
from safetensors.torch import save_file
save_file(model.state_dict(), 'model.safetensors')

# PyTorch → ONNX
torch.onnx.export(model, dummy_input, 'model.onnx')

# ONNX → TensorFlow (需要onnx-tf)
# pip install onnx-tf
# python -m onnx_tf.backend convert -i model.onnx -o model_tf

# SafeTensors → PyTorch
from safetensors.torch import load_file
state_dict = load_file('model.safetensors')
torch.save(state_dict, 'model.pth')

print("✓ 格式转换完成")

2. 模型架构与层数

2.1 神经网络层次结构

输入层 → 隐藏层(多层) → 输出层
  ↓         ↓              ↓
特征输入   特征提取      最终预测

2.2 常见层类型详解

2.2.1 全连接层（Linear/Dense）

import torch.nn as nn

# 定义全连接层
fc = nn.Linear(in_features=512, out_features=256)

# 实际案例：图像分类器
class ImageClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        # 28x28图像展平后是784个像素
        self.fc1 = nn.Linear(784, 512)    # 第1层：784→512
        self.fc2 = nn.Linear(512, 256)    # 第2层：512→256
        self.fc3 = nn.Linear(256, 10)     # 第3层：256→10类别

    def forward(self, x):
        x = x.view(-1, 784)  # 展平
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# 查看模型结构
model = ImageClassifier()
print(model)

# 计算参数量
total_params = sum(p.numel() for p in model.parameters())
print(f"总参数量: {total_params:,}")
# 输出: 总参数量: 669,706

2.2.2 卷积层（Convolutional）

# 2D卷积层（用于图像）
conv2d = nn.Conv2d(
    in_channels=3,      # 输入通道（RGB=3）
    out_channels=64,    # 输出通道（特征图数量）
    kernel_size=3,      # 卷积核大小3x3
    stride=1,           # 步长
    padding=1           # 填充
)

# 实际案例：CNN图像分类器
class CNNClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        # 卷积块1
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)   # 32x32x3 → 32x32x32
        self.pool1 = nn.MaxPool2d(2, 2)                # 32x32x32 → 16x16x32

        # 卷积块2
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)  # 16x16x32 → 16x16x64
        self.pool2 = nn.MaxPool2d(2, 2)                # 16x16x64 → 8x8x64

        # 全连接层
        self.fc1 = nn.Linear(8 * 8 * 64, 512)
        self.fc2 = nn.Linear(512, 10)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = self.pool1(x)
        x = torch.relu(self.conv2(x))
        x = self.pool2(x)
        x = x.view(-1, 8 * 8 * 64)  # 展平
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

2.2.3 注意力层（Attention）- Transformer核心

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.qkv = nn.Linear(d_model, d_model * 3)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch_size, seq_len, d_model = x.shape

        # 生成Q, K, V
        qkv = self.qkv(x).reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
        q, k, v = qkv.permute(2, 0, 3, 1, 4)

        # 计算注意力分数
        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attn = torch.softmax(scores, dim=-1)

        # 应用注意力
        out = torch.matmul(attn, v)
        out = out.permute(0, 2, 1, 3).reshape(batch_size, seq_len, d_model)
        out = self.out(out)
        return out

# 实际案例：GPT中的Transformer层
class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model)
        )

    def forward(self, x):
        # 自注意力 + 残差连接
        x = x + self.attention(self.norm1(x))
        # 前馈网络 + 残差连接
        x = x + self.ffn(self.norm2(x))
        return x

2.3 模型深度与规模对比

模型	层数	参数量	应用场景
小型CNN	5-10层	1M-10M	简单图像分类
ResNet-50	50层	25M	图像识别
BERT-Base	12层	110M	文本理解
GPT-2	12-48层	117M-1.5B	文本生成
GPT-3	96层	175B	大语言模型
LLaMA-7B	32层	7B	开源LLM
LLaMA-70B	80层	70B	高性能LLM

3. 模型参数与权重

3.1 什么是模型参数？

模型参数是神经网络在训练过程中学习到的数值，主要包括：

权重（Weights）：连接神经元的强度
偏置（Bias）：调整输出的偏移量

3.2 实战案例：查看和分析模型参数

import torch
import torch.nn as nn

# 定义模型
model = nn.Sequential(
    nn.Linear(10, 5),
    nn.ReLU(),
    nn.Linear(5, 2)
)

# 方法1：查看所有参数
print("=" * 50)
print("所有参数:")
for name, param in model.named_parameters():
    print(f"{name:20s} | Shape: {str(param.shape):15s} | 参数量: {param.numel():,}")

# 输出示例:
# 0.weight            | Shape: torch.Size([5, 10]) | 参数量: 50
# 0.bias              | Shape: torch.Size([5])     | 参数量: 5
# 2.weight            | Shape: torch.Size([2, 5])  | 参数量: 10
# 2.bias              | Shape: torch.Size([2])     | 参数量: 2

# 方法2：计算总参数量
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\n总参数量: {total_params:,}")
print(f"可训练参数: {trainable_params:,}")

# 方法3：查看具体权重值
first_layer_weights = model[0].weight.data
print(f"\n第一层权重示例:\n{first_layer_weights[:2, :3]}")

3.3 参数量计算公式

# 全连接层参数量计算
def linear_params(in_features, out_features, bias=True):
    params = in_features * out_features
    if bias:
        params += out_features
    return params

# 卷积层参数量计算
def conv2d_params(in_channels, out_channels, kernel_size, bias=True):
    if isinstance(kernel_size, int):
        kernel_size = (kernel_size, kernel_size)
    params = in_channels * out_channels * kernel_size[0] * kernel_size[1]
    if bias:
        params += out_channels
    return params

# 示例
print(f"Linear(784, 128)参数量: {linear_params(784, 128):,}")
# 输出: 100,480

print(f"Conv2d(3, 64, 3)参数量: {conv2d_params(3, 64, 3):,}")
# 输出: 1,792

3.4 大模型参数量速查

# LLaMA模型参数量估算
def estimate_llama_params(
    n_layers,           # 层数
    d_model,            # 隐藏维度
    vocab_size=32000    # 词表大小
):
    # 嵌入层
    embedding = vocab_size * d_model

    # 每个Transformer层
    # 注意力层: 4 * d_model * d_model (Q, K, V, O)
    # FFN层: 8 * d_model * d_model (通常是4倍扩展)
    per_layer = 12 * d_model * d_model

    total_params = embedding + (n_layers * per_layer)
    return total_params

# LLaMA-7B
params_7b = estimate_llama_params(n_layers=32, d_model=4096)
print(f"LLaMA-7B估算参数量: {params_7b / 1e9:.1f}B")
# 输出: LLaMA-7B估算参数量: 6.7B

# LLaMA-13B
params_13b = estimate_llama_params(n_layers=40, d_model=5120)
print(f"LLaMA-13B估算参数量: {params_13b / 1e9:.1f}B")
# 输出: LLaMA-13B估算参数量: 12.7B

4. 模型量化技术

4.1 什么是模型量化？

量化是将模型中的高精度数值（如FP32）转换为低精度数值（如INT8）的过程，目的是：

✅ 减小模型体积（通常压缩2-4倍）
✅ 加快推理速度（INT8运算比FP32快）
✅ 降低内存占用
❌ 可能轻微降低精度（通常<1%）

4.2 数据类型对比

类型	位数	范围	内存占用	精度	用途
FP32	32位	±3.4×10³⁸	基准	高	训练
FP16	16位	±65,504	50%	中	混合精度训练
BF16	16位	±3.4×10³⁸	50%	中高	训练（推荐）
INT8	8位	-128~127	25%	中低	推理
INT4	4位	-8~7	12.5%	低	极致压缩

4.3 实战案例：PyTorch量化

案例1：动态量化（最简单）

import torch
import torch.quantization as quantization

# 原始模型
model = ImageClassifier()
model.eval()

# 动态量化（无需训练数据）
quantized_model = torch.quantization.quantize_dynamic(
    model,                              # 原始模型
    {nn.Linear},                        # 要量化的层类型
    dtype=torch.qint8                   # 量化数据类型
)

# 比较模型大小
def get_model_size(model):
    torch.save(model.state_dict(), "temp.pth")
    size_mb = os.path.getsize("temp.pth") / (1024 * 1024)
    os.remove("temp.pth")
    return size_mb

original_size = get_model_size(model)
quantized_size = get_model_size(quantized_model)

print(f"原始模型大小: {original_size:.2f} MB")
print(f"量化模型大小: {quantized_size:.2f} MB")
print(f"压缩比: {original_size / quantized_size:.2f}x")

案例2：静态量化（需要校准）

# 准备校准数据
calibration_data = torch.randn(100, 1, 28, 28)

# 插入量化/反量化节点
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)

# 校准（运行代表性数据）
model.eval()
with torch.no_grad():
    for data in calibration_data:
        model(data.unsqueeze(0))

# 转换为量化模型
torch.quantization.convert(model, inplace=True)

print("静态量化完成!")

案例3：使用BitsAndBytes进行4bit量化（大模型常用）

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4bit量化配置
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                  # 启用4bit量化
    bnb_4bit_quant_type="nf4",          # 量化类型（nf4推荐）
    bnb_4bit_compute_dtype=torch.bfloat16,  # 计算类型
    bnb_4bit_use_double_quant=True,     # 双重量化（进一步压缩）
)

# 加载量化模型
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# 内存占用对比:
# FP32: ~28GB
# FP16: ~14GB
# 8bit: ~7GB
# 4bit: ~3.5GB

4.4 量化对比实验

import time
import torch

# 准备测试数据
test_input = torch.randn(1000, 784)

# 测试函数
def benchmark(model, input_data, num_runs=100):
    model.eval()
    with torch.no_grad():
        # 预热
        for _ in range(10):
            _ = model(input_data[:1])

        # 计时
        start = time.time()
        for _ in range(num_runs):
            _ = model(input_data)
        end = time.time()

    return (end - start) / num_runs

# 对比测试
fp32_time = benchmark(original_model, test_input)
int8_time = benchmark(quantized_model, test_input)

print(f"FP32推理时间: {fp32_time*1000:.2f} ms")
print(f"INT8推理时间: {int8_time*1000:.2f} ms")
print(f"加速比: {fp32_time/int8_time:.2f}x")

5. 模型训练基础

5.1 训练流程完整示例

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# 步骤1: 准备数据
# 生成模拟数据（实际应该是真实数据集）
X_train = torch.randn(1000, 784)
y_train = torch.randint(0, 10, (1000,))

# 创建数据加载器
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# 步骤2: 定义模型
model = ImageClassifier()

# 步骤3: 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()  # 分类任务常用损失
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adam优化器

# 步骤4: 训练循环
num_epochs = 10
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

for epoch in range(num_epochs):
    model.train()  # 设置为训练模式
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(train_loader):
        # 移动数据到GPU
        data, target = data.to(device), target.to(device)

        # 前向传播
        optimizer.zero_grad()  # 清空梯度
        output = model(data)   # 计算输出
        loss = criterion(output, target)  # 计算损失

        # 反向传播
        loss.backward()        # 计算梯度
        optimizer.step()       # 更新权重

        # 统计
        total_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

        # 打印进度
        if (batch_idx + 1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], '
                  f'Step [{batch_idx+1}/{len(train_loader)}], '
                  f'Loss: {loss.item():.4f}')

    # 每个epoch结束后的统计
    avg_loss = total_loss / len(train_loader)
    accuracy = 100. * correct / total
    print(f'Epoch {epoch+1} 完成 - 平均损失: {avg_loss:.4f}, 准确率: {accuracy:.2f}%')

# 步骤5: 保存模型
torch.save({
    'epoch': num_epochs,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': avg_loss,
}, 'trained_model.pth')

print("训练完成！模型已保存。")

5.2 关键概念详解

5.2.1 Batch Size（批大小）

# 不同batch size的影响
batch_sizes = [8, 32, 128, 512]

"""
Batch Size | 内存占用 | 训练速度 | 梯度稳定性 | 泛化能力
-----------------------------------------------------------------
8          | 低       | 慢       | 低（噪声大）| 好
32         | 中低     | 中       | 中         | 较好（推荐）
128        | 中高     | 快       | 高         | 中等
512        | 高       | 很快     | 很高       | 可能过拟合

推荐策略：
- GPU内存小: batch_size=16-32
- GPU内存大: batch_size=64-128
- 大模型: 使用梯度累积模拟大batch
"""

# 梯度累积示例（模拟大batch）
accumulation_steps = 4  # 累积4个batch

for i, (data, target) in enumerate(train_loader):
    output = model(data)
    loss = criterion(output, target)
    loss = loss / accumulation_steps  # 缩放损失
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

5.2.2 Learning Rate（学习率）

# 学习率调度器
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR

# 方式1: 阶梯式衰减
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# 每30个epoch，学习率乘以0.1

# 方式2: 余弦退火
scheduler = CosineAnnealingLR(optimizer, T_max=100)
# 学习率按余弦曲线衰减

# 训练循环中使用
for epoch in range(num_epochs):
    train(...)
    scheduler.step()  # 更新学习率

    # 打印当前学习率
    current_lr = optimizer.param_groups[0]['lr']
    print(f'Epoch {epoch}, Learning Rate: {current_lr}')

5.2.3 过拟合与正则化

# 技术1: Dropout（随机失活）
class ModelWithDropout(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 512)
        self.dropout1 = nn.Dropout(0.5)  # 50%的神经元随机失活
        self.fc2 = nn.Linear(512, 256)
        self.dropout2 = nn.Dropout(0.3)  # 30%失活
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = x.view(-1, 784)
        x = torch.relu(self.fc1(x))
        x = self.dropout1(x)  # 训练时生效，测试时自动关闭
        x = torch.relu(self.fc2(x))
        x = self.dropout2(x)
        x = self.fc3(x)
        return x

# 技术2: 权重衰减（L2正则化）
optimizer = optim.Adam(
    model.parameters(),
    lr=0.001,
    weight_decay=1e-4  # L2正则化系数
)

# 技术3: Early Stopping（早停）
best_loss = float('inf')
patience = 5
counter = 0

for epoch in range(num_epochs):
    val_loss = validate(...)

    if val_loss < best_loss:
        best_loss = val_loss
        counter = 0
        torch.save(model.state_dict(), 'best_model.pth')
    else:
        counter += 1
        if counter >= patience:
            print("Early stopping triggered!")
            break

5.3 完整训练脚本模板

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from tqdm import tqdm  # 进度条

class Trainer:
    def __init__(self, model, train_loader, val_loader, device='cuda'):
        self.model = model.to(device)
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.device = device

        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = optim.Adam(model.parameters(), lr=0.001)
        self.scheduler = StepLR(self.optimizer, step_size=10, gamma=0.1)

        self.train_losses = []
        self.val_losses = []
        self.val_accuracies = []

    def train_epoch(self):
        self.model.train()
        total_loss = 0

        for data, target in tqdm(self.train_loader, desc="Training"):
            data, target = data.to(self.device), target.to(self.device)

            self.optimizer.zero_grad()
            output = self.model(data)
            loss = self.criterion(output, target)
            loss.backward()
            self.optimizer.step()

            total_loss += loss.item()

        return total_loss / len(self.train_loader)

    def validate(self):
        self.model.eval()
        total_loss = 0
        correct = 0
        total = 0

        with torch.no_grad():
            for data, target in tqdm(self.val_loader, desc="Validation"):
                data, target = data.to(self.device), target.to(self.device)

                output = self.model(data)
                loss = self.criterion(output, target)

                total_loss += loss.item()
                _, predicted = output.max(1)
                total += target.size(0)
                correct += predicted.eq(target).sum().item()

        avg_loss = total_loss / len(self.val_loader)
        accuracy = 100. * correct / total
        return avg_loss, accuracy

    def fit(self, num_epochs):
        for epoch in range(num_epochs):
            print(f"\nEpoch {epoch+1}/{num_epochs}")

            # 训练
            train_loss = self.train_epoch()
            self.train_losses.append(train_loss)

            # 验证
            val_loss, val_acc = self.validate()
            self.val_losses.append(val_loss)
            self.val_accuracies.append(val_acc)

            # 更新学习率
            self.scheduler.step()

            # 打印结果
            print(f"Train Loss: {train_loss:.4f}")
            print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%")

            # 保存最佳模型
            if val_loss == min(self.val_losses):
                torch.save(self.model.state_dict(), 'best_model.pth')
                print("✓ Best model saved!")

# 使用示例
trainer = Trainer(model, train_loader, val_loader)
trainer.fit(num_epochs=50)

6. 模型微调技术

6.1 什么是微调（Fine-tuning）？

微调是在预训练模型的基础上，使用少量特定任务数据继续训练，使模型适应新任务。

预训练模型（通用能力） + 微调（特定任务） = 定制化模型

6.2 微调方法对比

方法	训练参数量	内存需求	训练速度	效果	适用场景
Full Fine-tuning	100%	高	慢	最好	数据充足
LoRA	0.1%-1%	低	快	优秀	推荐
QLoRA	0.1%-1%	极低	快	优秀	消费级GPU
Adapter	2%-5%	中	中	良好	多任务切换
Prompt Tuning	<0.01%	极低	极快	中等	少样本学习

6.3 实战案例：LoRA微调

案例1：使用PEFT库进行LoRA微调

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset

# 步骤1: 加载基础模型
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,  # 8bit量化节省内存
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 步骤2: 配置LoRA
lora_config = LoraConfig(
    r=8,                              # LoRA秩（rank）
    lora_alpha=32,                    # LoRA缩放参数
    target_modules=["q_proj", "v_proj"],  # 要应用LoRA的层
    lora_dropout=0.05,                # Dropout率
    bias="none",                      # 偏置处理方式
    task_type=TaskType.CAUSAL_LM      # 任务类型
)

# 步骤3: 应用LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 输出: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

# 步骤4: 准备数据
dataset = load_dataset("json", data_files="your_data.jsonl")

def preprocess(examples):
    texts = [f"### Question: {q}\n### Answer: {a}"
             for q, a in zip(examples['question'], examples['answer'])]
    return tokenizer(texts, truncation=True, padding='max_length', max_length=512)

tokenized_dataset = dataset.map(preprocess, batched=True)

# 步骤5: 配置训练参数
training_args = TrainingArguments(
    output_dir="./lora_output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # 有效batch size = 16
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,                      # 混合精度训练
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
)

# 步骤6: 开始训练
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
)

trainer.train()

# 步骤7: 保存LoRA权重（仅几MB）
model.save_pretrained("./lora_weights")

案例2: QLoRA微调（更省内存）

from transformers import BitsAndBytesConfig

# 4bit量化配置
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# 加载量化模型
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# 准备模型用于kbit训练
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)

# 应用LoRA（与上面相同）
model = get_peft_model(model, lora_config)

# 内存占用对比:
# Full Fine-tuning (FP32): ~28GB VRAM + ~28GB 梯度/优化器 = 56GB
# LoRA (FP16): ~14GB + ~2GB = 16GB
# QLoRA (4bit): ~3.5GB + ~2GB = 5.5GB ✓ 消费级GPU可用！

6.4 微调数据准备

# 数据格式示例 (JSONL格式)
"""
{"instruction": "将下面的英文翻译成中文", "input": "Hello world", "output": "你好世界"}
{"instruction": "总结以下文本", "input": "...", "output": "..."}
"""

# 数据预处理函数
def format_instruction(sample):
    """格式化为指令模板"""
    return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{sample['instruction']}

### Input:
{sample['input']}

### Response:
{sample['output']}"""

# 数据质量检查
def check_data_quality(dataset):
    print(f"数据集大小: {len(dataset)}")
    print(f"平均长度: {sum(len(d['output']) for d in dataset) / len(dataset):.0f} 字符")

    # 检查是否有空值
    empty_count = sum(1 for d in dataset if not d['output'].strip())
    print(f"空输出数量: {empty_count}")

    # 检查长度分布
    lengths = [len(d['output']) for d in dataset]
    print(f"最短: {min(lengths)}, 最长: {max(lengths)}")

# 推荐数据量:
# - 全量微调: 10,000+ 样本
# - LoRA: 1,000+ 样本
# - Few-shot: 100+ 样本

6.5 微调后的模型推理

from peft import PeftModel

# 加载基础模型
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# 加载LoRA权重
model = PeftModel.from_pretrained(base_model, "./lora_weights")

# 合并权重（可选，用于部署）
model = model.merge_and_unload()

# 推理
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
input_text = "### Question: What is AI?\n### Answer:"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(result)

7. 模型部署实战

7.1 部署方式对比

方式	适用场景	延迟	吞吐量	难度
本地Python	开发测试	中	低	简单
Flask/FastAPI	小规模服务	中	中	中等
TorchServe	生产环境	低	高	中等
ONNX Runtime	跨平台	低	高	中等
TensorRT	Nvidia GPU	极低	极高	困难
vLLM	大模型推理	低	极高	简单

7.2 实战案例：FastAPI部署

# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import uvicorn

# 初始化FastAPI
app = FastAPI(title="LLM API", version="1.0")

# 全局加载模型（启动时加载一次）
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",
    torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
print("Model loaded!")

# 请求数据模型
class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 100
    temperature: float = 0.7
    top_p: float = 0.9

class GenerateResponse(BaseModel):
    generated_text: str
    tokens_count: int

# API端点
@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
    try:
        # 编码输入
        inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)

        # 生成
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=request.max_tokens,
                temperature=request.temperature,
                top_p=request.top_p,
                do_sample=True,
            )

        # 解码输出
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        return GenerateResponse(
            generated_text=generated_text,
            tokens_count=len(outputs[0])
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# 健康检查端点
@app.get("/health")
async def health():
    return {"status": "healthy"}

# 运行服务器
if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

# 启动服务
python app.py

# 测试API (使用curl)
curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "What is AI?", "max_tokens": 50}'

7.3 使用vLLM高性能部署

from vllm import LLM, SamplingParams

# 初始化vLLM（自动优化批处理和KV缓存）
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=1,  # GPU数量
    dtype="half",            # FP16
    max_model_len=2048,      # 最大序列长度
)

# 采样参数
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=100
)

# 批量推理（vLLM会自动优化）
prompts = [
    "What is AI?",
    "Explain machine learning",
    "What is deep learning?"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")
    print("-" * 50)

# 性能对比:
# HuggingFace (单个): ~100 tokens/s
# vLLM (批处理):      ~500 tokens/s ✓ 5倍提升！

7.4 ONNX优化部署

# 步骤1: 导出为ONNX
import torch.onnx

model.eval()
dummy_input = torch.randn(1, 784)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

# 步骤2: 优化ONNX模型
import onnx
from onnxruntime.transformers import optimizer

model_onnx = onnx.load("model.onnx")
optimized_model = optimizer.optimize_model(
    "model.onnx",
    model_type='bert',  # 或 'gpt2'
    num_heads=12,
    hidden_size=768
)
optimized_model.save_model_to_file("model_optimized.onnx")

# 步骤3: 使用ONNX Runtime推理
import onnxruntime as ort
import numpy as np

# 创建推理会话
session = ort.InferenceSession(
    "model_optimized.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# 推理
input_data = np.random.randn(1, 784).astype(np.float32)
outputs = session.run(None, {'input': input_data})

print(f"Output shape: {outputs[0].shape}")

# 性能提升:
# PyTorch: 基准
# ONNX Runtime: 1.5-3倍加速 ✓

7.5 Docker容器化部署

# Dockerfile
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

# 安装Python
RUN apt-get update && apt-get install -y python3.10 python3-pip

# 设置工作目录
WORKDIR /app

# 复制依赖文件
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

# 复制应用代码
COPY . .

# 下载模型（或从外部挂载）
RUN python3 download_model.py

# 暴露端口
EXPOSE 8000

# 启动命令
CMD ["python3", "app.py"]

# 构建镜像
docker build -t llm-api:latest .

# 运行容器
docker run -d \
  --gpus all \
  -p 8000:8000 \
  -v /path/to/models:/models \
  --name llm-service \
  llm-api:latest

# 查看日志
docker logs -f llm-service

8. 性能优化技巧

8.1 推理速度优化

# 技巧1: 使用torch.compile (PyTorch 2.0+)
import torch

model = ImageClassifier()
model = torch.compile(model)  # 自动优化

# 推理速度提升: 30%-50%

# 技巧2: 使用半精度
model = model.half()  # FP16
input_data = input_data.half()

# 技巧3: 使用torch.inference_mode()
with torch.inference_mode():  # 比no_grad更快
    output = model(input_data)

# 技巧4: 批量推理
# 不好的做法
for data in dataset:
    output = model(data.unsqueeze(0))  # 逐个推理

# 好的做法
batch_data = torch.stack([data for data in dataset[:32]])
outputs = model(batch_data)  # 批量推理

# 技巧5: 使用KV缓存（Transformer模型）
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    use_cache=True  # 启用KV缓存
)

8.2 内存优化

# 技巧1: 梯度检查点（Gradient Checkpointing）
from torch.utils.checkpoint import checkpoint

class MemoryEfficientModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerBlock() for _ in range(24)
        ])

    def forward(self, x):
        for layer in self.layers:
            x = checkpoint(layer, x)  # 重计算而非存储中间激活
        return x

# 内存节省: 30%-50%，速度略降10%-20%

# 技巧2: 混合精度训练
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for data, target in train_loader:
    optimizer.zero_grad()

    with autocast():  # 自动混合FP16/FP32
        output = model(data)
        loss = criterion(output, target)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# 技巧3: 清理缓存
import gc

del model  # 删除对象
gc.collect()  # 垃圾回收
torch.cuda.empty_cache()  # 清空CUDA缓存

8.3 多GPU训练

# 方法1: DataParallel（简单但效率低）
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)

# 方法2: DistributedDataParallel（推荐）
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# 初始化进程组
dist.init_process_group(backend='nccl')

# 设置设备
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

# 包装模型
model = model.to(local_rank)
model = DDP(model, device_ids=[local_rank])

# 使用分布式采样器
from torch.utils.data.distributed import DistributedSampler
sampler = DistributedSampler(dataset)
train_loader = DataLoader(dataset, sampler=sampler)

# 启动命令:
# torchrun --nproc_per_node=4 train.py

9. 常见问题与解决方案

9.1 训练相关问题

Q1: 损失不下降怎么办？

# 解决方案检查清单:

# 1. 检查学习率
print(f"当前学习率: {optimizer.param_groups[0]['lr']}")
# 太大: 损失震荡 → 降低10倍
# 太小: 几乎不动 → 提高10倍

# 2. 检查数据
for data, target in train_loader:
    print(f"数据范围: [{data.min():.2f}, {data.max():.2f}]")
    print(f"标签分布: {torch.bincount(target)}")
    break
# 确保数据已归一化，标签分布均衡

# 3. 检查梯度
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: 梯度范数 = {param.grad.norm().item():.4f}")
# 梯度爆炸: >10 → 使用梯度裁剪
# 梯度消失: <0.0001 → 检查激活函数、初始化

# 4. 使用梯度裁剪
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Q2: 显存溢出（OOM）怎么办？

# 解决方案:

# 1. 减小batch size
train_loader = DataLoader(dataset, batch_size=16)  # 从32降到16

# 2. 使用梯度累积
accumulation_steps = 4
for i, (data, target) in enumerate(train_loader):
    loss = criterion(model(data), target) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# 3. 启用梯度检查点
from torch.utils.checkpoint import checkpoint

# 4. 使用混合精度
from torch.cuda.amp import autocast
with autocast():
    output = model(data)

# 5. 清理缓存
torch.cuda.empty_cache()

Q3: 过拟合怎么办？

# 症状: 训练集准确率高，验证集准确率低

# 解决方案:

# 1. 增加Dropout
model.dropout = nn.Dropout(0.5)  # 从0.3提高到0.5

# 2. 数据增强
from torchvision import transforms
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
])

# 3. 权重衰减
optimizer = optim.Adam(model.parameters(), weight_decay=1e-4)

# 4. Early Stopping
# 5. 收集更多数据

9.2 部署相关问题

Q1: 推理速度慢怎么办？

# 优化策略:

# 1. 量化模型
quantized_model = torch.quantization.quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)

# 2. 批量推理
inputs = torch.stack([data for data in batch])
outputs = model(inputs)

# 3. 使用TorchScript
scripted_model = torch.jit.script(model)

# 4. 使用ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")

# 5. 使用专用推理引擎
from vllm import LLM
llm = LLM(model="your-model")

Q2: 模型加载失败？

# 常见错误及解决:

# 错误1: RuntimeError: Error(s) in loading state_dict
# 原因: 模型结构不匹配
# 解决: 检查模型定义，使用strict=False
model.load_state_dict(torch.load("model.pth"), strict=False)

# 错误2: 找不到模型文件
# 解决: 使用绝对路径
import os
model_path = os.path.join(os.getcwd(), "models", "model.pth")

# 错误3: CUDA out of memory
# 解决: 加载到CPU
model = torch.load("model.pth", map_location='cpu')

9.3 实用调试技巧

# 技巧1: 打印模型结构
from torchsummary import summary
summary(model, input_size=(3, 224, 224))

# 技巧2: 可视化训练曲线
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(train_losses, label='Train Loss')
plt.plot(val_losses, label='Val Loss')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(val_accuracies, label='Val Accuracy')
plt.legend()
plt.show()

# 技巧3: 使用TensorBoard
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter('runs/experiment_1')
for epoch in range(num_epochs):
    train(...)
    writer.add_scalar('Loss/train', train_loss, epoch)
    writer.add_scalar('Loss/val', val_loss, epoch)
    writer.add_scalar('Accuracy/val', val_acc, epoch)

# 启动TensorBoard: tensorboard --logdir=runs

# 技巧4: 模型参数监控
def monitor_gradients(model):
    for name, param in model.named_parameters():
        if param.grad is not None:
            grad_norm = param.grad.norm().item()
            param_norm = param.norm().item()
            print(f"{name:30s} | Grad: {grad_norm:8.4f} | Param: {param_norm:8.4f}")

📖 附录：快速参考

常用命令速查

# 安装依赖
pip install torch torchvision transformers peft bitsandbytes accelerate

# 查看GPU信息
nvidia-smi

# 查看PyTorch版本
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

# 清理GPU缓存
python -c "import torch; torch.cuda.empty_cache()"

# 启动TensorBoard
tensorboard --logdir=runs --port=6006

# 多GPU训练
torchrun --nproc_per_node=4 train.py

# 转换模型格式
python -m torch.onnx convert model.pth model.onnx

硬件配置建议

任务	推荐GPU	显存	备注
小模型训练	GTX 1660 Ti	6GB	入门级
中等模型训练	RTX 3060	12GB	性价比高
大模型微调	RTX 4090	24GB	消费级最强
大模型训练	A100	80GB	专业级
推理部署	T4	16GB	云服务常用

🎯 下一步学习路径

初级（1-2周）

✅ 理解本文档所有基础概念
✅ 完成简单的图像分类模型训练
✅ 尝试保存和加载模型

中级（1-2个月）

✅ 使用预训练模型进行微调
✅ 实现LoRA微调
✅ 部署一个简单的API服务

高级（3-6个月）

✅ 多GPU分布式训练
✅ 模型量化和优化
✅ 生产环境部署

祝您学习顺利！有问题随时查阅本文档 📚

最后更新: 2026-01-28