LLM微调实战指南：从零构建你的定制化AI助手LLM微调实战指南：从零构建你的定制化AI助手 🎯 前言：一起来聊聊AI

LLM微调实战指南：从零构建你的定制化AI助手

🎯 前言：一起来聊聊AI微调那些事

我刚开始接触大语言模型微调的时候，跟很多朋友一样，觉得这东西挺玄乎的。那时候网上资料零零散散，要么太学术化看不懂，要么就是广告味太重。踩了不少坑后，我决定把自己的经验整理出来，用最直白的方式跟大家分享。

🤔 这篇文章能帮你解决什么问题？

简单说，就是让你能真正动手把一个通用的大模型，变成理解你特定需求的助手。不用那些高大上的术语，咱们就聊点实在的：

技术选型：那么多模型，选哪个入门最合适？
环境配置：Python版本、CUDA版本怎么配才不踩坑？
数据准备：训练数据到底要什么格式？多少条才够用？
训练过程：参数怎么调？显存不够怎么办？
部署上线：训练完了怎么让其他人也能用上？

📝 我的实践思路

我自己摸索出来的这套流程，可能不是最优的，但保证能跑通：

text

选个小模型（比如TinyLlama-1.1B） → 准备些简单的训练数据 → 
用LoRA微调 → 测试效果 → 转成GGUF格式 → 用Ollama跑起来

整个流程下来，你会发现其实没想象中那么复杂。当然，中间会遇到各种问题，这也是我写这篇文章的原因——把那些容易卡住的地方都标注出来。

🔧 你需要准备什么？

硬件方面：

有NVIDIA显卡最好（显存4G以上就能玩）
没显卡用CPU也能跑，就是慢点
内存建议16G以上

软件方面：

Windows 10/11 或者 Linux系统
基本的命令行操作经验
一点点Python基础（不会也没关系，照葫芦画瓢）

时间投入：

第一次走通流程大概需要3-4小时
遇到问题解决问题，这本身就是学习过程

🗺️ 学习路线概览

让我用技术人更熟悉的方式给你画个路线图：

graph TD
    A[第1步: 环境搭建] --> B[第2步: 模型选择]
    B --> C[第3步: 数据准备]
    C --> D[第4步: 微调训练]
    D --> E[第5步: 效果测试]
    E --> F[第6步: 模型转换]
    F --> G[第7步: 部署上线]
    
    subgraph 常见坑点
        H[Python版本冲突]
        I[CUDA版本不匹配]
        J[显存不足]
        K[数据格式错误]
    end
    
    A -.-> H
    B -.-> I
    D -.-> J
    C -.-> K

📦 我会提供什么？

直接上干货，这些都是在实际项目中验证过的：

完整的代码仓库结构：

text

llm-finetune-guide/
├── scripts/           # 各种脚本文件
├── data/             # 训练数据示例
├── models/           # 模型存放位置
└── docs/             # 文档和笔记

关键脚本说明：

setup_env.ps1 - 环境配置脚本（Windows用户用这个）
train_lora.py - 核心训练代码（加了详细注释）
api_server.py - 简单的API服务示例
每个脚本都标注了运行环境和可能遇到的问题

🎪 技术交流，不谈虚的

咱们聊点实际的：

关于学习难度：
实话实说，如果你完全没接触过Python和命令行，第一次可能会有点懵。但别怕，我会把每个命令都解释清楚——在哪个终端执行、应该看到什么输出、出错了怎么排查。

关于时间投入：
我建议你拿出一个完整的下午，跟着走一遍。别急着跳步骤，每个环节都自己动手试试。遇到问题先看看错误信息，大部分问题都能通过错误提示找到线索。

关于预期效果：
第一次训练出来的模型可能不会让你惊艳——这很正常。我们的目标是先跑通流程，理解整个链路。效果优化是后面的事情。

🤝 一些心里话

我写这篇文章的初衷很简单：降低学习门槛。AI技术发展这么快，作为开发者，我们不应该只是API的调用者，更应该理解背后的原理和实现。

记得我第一次成功跑通微调流程时，那种“原来如此”的兴奋感。我希望你也能体验到这种感觉——不是因为它有多难，而是因为你亲手把一个想法变成了现实。

🚦 准备好了吗？

如果你已经：

装好了Python（建议3.10版本）
有个能写代码的编辑器（VSCode就行）
愿意花时间跟着一步步操作

那么，咱们就可以开始了。我会尽量把每个步骤都讲清楚，但更希望你动手去试——代码只有跑起来，才能真正理解。

遇到问题别慌张，技术学习就是不断解决问题的过程。我在每个关键步骤都加了“可能遇到的问题”部分，都是我自己踩过的坑。

LLM微调入门经验指南：从零到部署

一、核心概念解释（初学者必读）

1.1 基础术语

Python：一门高级编程语言，以其简洁的语法和强大的生态系统闻名，是机器学习和数据科学的主流语言。

Anaconda：一个Python发行版，包含了Python解释器、conda包管理工具以及1500多个预装的科学计算包，适合数据科学初学者。

Miniconda：一个更轻量级的Python发行版，只包含conda、Python和少量基础包，用户可按需安装其他包，适合对环境控制要求高的开发者。

PyTorch：一个基于Python的开源深度学习框架，由Facebook的AI研究团队开发。它提供动态计算图和自动微分功能，非常适合研究和原型开发。

HuggingFace：一个专注于自然语言处理的社区和平台，提供模型仓库、数据集和工具库（如Transformers），是LLM领域的重要资源。

1.2 模型相关术语

基础模型（Base Model） ：在大规模通用数据上预训练的模型，具有广泛的知识但缺乏特定领域的专精能力。

微调（Fine-tuning） ：在基础模型的基础上，使用特定领域的数据进行进一步训练，使模型适应特定任务。

LoRA（Low-Rank Adaptation） ：一种参数高效的微调方法，通过向模型添加少量可训练参数来适应新任务，而不是更新所有参数。

量化（Quantization） ：将模型权重从高精度（如32位浮点数）转换为低精度（如8位整数）的过程，以减少模型大小和内存占用。

1.3 部署相关术语

GGUF（GPT-Generated Unified Format） ：专为大型语言模型设计的二进制文件格式，支持量化，优化了模型加载和推理速度。

llama.cpp：一个用C++编写的推理框架，用于在CPU上高效运行大型语言模型，支持GGUF格式。

Ollama：一个简化大模型本地部署的工具，可以一键下载、运行和管理大型语言模型。

API（Application Programming Interface） ：应用程序编程接口，允许不同软件之间相互通信。

二、软件工具大全（下载安装指南）

2.1 开发环境准备

2.1.1 安装Miniconda（推荐）

下载地址：docs.conda.io/en/latest/m…

Windows安装步骤：

双击下载的Miniconda3-latest-Windows-x86_64.exe
选择"Just Me"（仅当前用户）
选择安装路径（建议使用默认路径）
勾选"Add Miniconda3 to my PATH environment variable"
点击"Install"开始安装

验证安装：

bash

# 打开PowerShell或CMD
conda --version
# 应显示：conda 24.x.x

python --version
# 应显示：Python 3.x.x

2.1.2 安装Git

下载地址：git-scm.com/download/wi…

安装步骤：

运行安装程序
选择组件：勾选"Git Bash Here"和"Git GUI Here"
选择默认编辑器：建议选择"Use Visual Studio Code as Git's default editor"
其余选项使用默认设置

2.1.3 安装Visual Studio Build Tools（Windows必需）

下载地址：visualstudio.microsoft.com/downloads/#…

安装步骤：

运行下载的vs_BuildTools.exe
选择工作负载：勾选"使用C++的桌面开发"
在右侧详情中确保勾选"Windows 10/11 SDK"
点击"安装"按钮

2.2 LLM微调环境配置

2.2.1 创建专用环境

bash

# 打开PowerShell（管理员权限）
# 创建名为llm-ft的环境，Python版本3.10
conda create -n llm-ft python=3.10

# 激活环境
conda activate llm-ft

# 验证环境
python --version
# 应显示：Python 3.10.x

2.2.2 安装PyTorch

bash

# 首先查看CUDA版本（如果有NVIDIA GPU）
nvidia-smi
# 顶部会显示CUDA Version: 12.x

# 根据CUDA版本选择安装命令
# CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# CPU版本（如果没有GPU）
pip install torch torchvision torchaudio

2.2.3 安装HuggingFace库

bash

# 核心库
pip install transformers datasets accelerate

# 参数高效微调
pip install peft

# 模型管理
pip install huggingface_hub

# 分词器支持
pip install sentencepiece protobuf

# 评估工具（可选）
pip install evaluate

# 实验跟踪（可选）
pip install wandb

2.3 模型转换和部署工具

2.3.1 安装llama.cpp

bash

# 打开PowerShell（在您的工作目录）
# 1. 克隆仓库
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# 2. 编译（需要CMake和Visual Studio Build Tools）
mkdir build
cd build

# 根据您的GPU配置CMake
# 如果有NVIDIA GPU
cmake .. -DLLAMA_CUBLAS=ON

# 如果没有GPU或使用AMD GPU
cmake .. -DLLAMA_CUBLAS=OFF

# 3. 开始编译
cmake --build . --config Release

# 4. 验证编译成功
.\bin\Release\main.exe --help

2.3.2 安装Ollama

下载地址：ollama.com/download

Windows安装：

下载OllamaSetup.exe
双击运行安装
安装完成后，可以在PowerShell中运行ollama命令

2.4 VPN需求说明

操作	是否需要VPN	备注
下载HuggingFace模型	需要（国外网络）	国内可使用镜像站：hf-mirror.com
访问GitHub	需要（国外网络）	国内网络可能较慢
安装Python包	通常不需要	使用国内镜像源加速
运行模型推理	不需要	完全本地操作

配置国内镜像源：

bash

# 配置pip国内镜像
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# 配置conda国内镜像
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes

三、完整工作流程（从模型到产品）

3.1 流程图

3.2 详细步骤说明

步骤1：获取基础模型

从HuggingFace选择合适的模型（如TinyLlama-1.1B）
使用huggingface-cli或Python代码下载模型
将模型保存到本地目录

步骤2：准备训练数据

收集或创建领域特定数据
格式化为JSONL文件（每行一个JSON对象）
包含instruction、input、output三个字段

步骤3：配置和运行微调

设置LoRA参数（rank、alpha等）
配置训练参数（学习率、批次大小等）
开始微调训练
保存微调后的模型

步骤4：模型部署（两种方案）

方案A：Ollama部署（简单快捷）

使用llama.cpp将PyTorch模型转换为GGUF格式
对GGUF模型进行量化（如q4_0）
创建Ollama Modelfile
使用Ollama加载和运行模型

方案B：API服务部署（灵活强大）

创建FastAPI应用程序
加载微调后的PyTorch模型
设计API接口（如/chat、/generate）
部署到服务器并配置反向代理

步骤5：客户端集成

开发Web界面或移动应用
调用API服务获取模型响应
添加业务逻辑和用户管理
测试和优化用户体验

四、实战示例：完整可运行的微调脚本

4.1 环境配置脚本

bash

# setup_env.ps1 (Windows PowerShell脚本)
# 文件名：setup_env.ps1
# 使用方法：以管理员身份运行PowerShell，执行：.\setup_env.ps1

Write-Host "=== LLM微调环境配置脚本 ===" -ForegroundColor Green

# 1. 检查并创建conda环境
Write-Host "1. 检查conda环境..." -ForegroundColor Yellow
$envName = "llm-ft"

if (conda env list | Select-String $envName) {
    Write-Host "环境 $envName 已存在" -ForegroundColor Green
} else {
    Write-Host "创建环境 $envName..." -ForegroundColor Yellow
    conda create -n $envName python=3.10 -y
}

# 2. 激活环境并安装包
Write-Host "2. 激活环境并安装依赖..." -ForegroundColor Yellow
conda activate $envName

# 安装PyTorch（假设有CUDA 12.1）
Write-Host "安装PyTorch..." -ForegroundColor Cyan
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 安装HuggingFace库
Write-Host "安装HuggingFace库..." -ForegroundColor Cyan
pip install transformers datasets accelerate peft huggingface_hub sentencepiece protobuf

# 3. 验证安装
Write-Host "3. 验证安装..." -ForegroundColor Yellow
python -c "
import torch, transformers, peft, accelerate
print(f'✅ PyTorch: {torch.__version__}')
print(f'✅ CUDA可用: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'✅ GPU: {torch.cuda.get_device_name(0)}')
print(f'✅ Transformers: {transformers.__version__}')
print(f'✅ PEFT: {peft.__version__}')
"

Write-Host "=== 环境配置完成 ===" -ForegroundColor Green
Write-Host "使用命令激活环境: conda activate $envName" -ForegroundColor Cyan

4.2 训练数据准备脚本

python

# prepare_data.py
# 文件名：prepare_data.py
# 使用方法：python prepare_data.py

import json

# 创建简单的训练数据
training_data = [
    {
        "instruction": "将以下英文翻译成中文",
        "input": "Hello, how are you?",
        "output": "你好，你怎么样？"
    },
    {
        "instruction": "将以下英文翻译成中文",
        "input": "What is your name?",
        "output": "你叫什么名字？"
    },
    {
        "instruction": "总结以下文本的主要内容",
        "input": "Python是一种高级编程语言，以其简洁的语法和强大的生态系统而闻名。",
        "output": "Python是一种简洁强大的高级编程语言。"
    },
    {
        "instruction": "回答问题",
        "input": "中国的首都是哪里？",
        "output": "中国的首都是北京。"
    },
    {
        "instruction": "编写一个简单的函数",
        "input": "用Python写一个计算两数之和的函数",
        "output": "def add(a, b):\n    return a + b"
    }
]

# 保存为JSONL格式
with open("training_data.jsonl", "w", encoding="utf-8") as f:
    for item in training_data:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

print(f"✅ 训练数据已保存到 training_data.jsonl")
print(f"📊 数据量: {len(training_data)} 条")

4.3 微调训练脚本

python

# train_lora.py
# 文件名：train_lora.py
# 使用方法：python train_lora.py
# 注意：需要VPN下载模型，或使用国内镜像

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, TaskType
import os

# 设置设备
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"使用设备: {device}")

# 1. 加载模型和分词器
print("1. 加载模型和分词器...")
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# 使用国内镜像（如果无法直接访问HuggingFace）
# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

# 添加padding token（如果不存在）
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 2. 配置LoRA
print("2. 配置LoRA...")
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,  # LoRA秩
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],  # 针对的模块
    bias="none"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # 打印可训练参数数量

# 3. 加载和预处理数据
print("3. 加载训练数据...")
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

def preprocess_function(examples):
    # 构建训练文本格式：instruction + input + output
    texts = []
    for i in range(len(examples["instruction"])):
        instruction = examples["instruction"][i]
        input_text = examples["input"][i] if examples["input"][i] else ""
        output = examples["output"][i]
        
        # 构建提示格式（根据模型调整）
        text = f"<|user|>\n{instruction}"
        if input_text:
            text += f"\n{input_text}"
        text += f"</s>\n<|assistant|>\n{output}</s>"
        texts.append(text)
    
    # 编码文本
    return tokenizer(texts, truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(preprocess_function, batched=True)

# 4. 配置训练参数
print("4. 配置训练参数...")
training_args = TrainingArguments(
    output_dir="./tinyllama_finetuned",  # 输出目录
    num_train_epochs=3,                  # 训练轮数
    per_device_train_batch_size=2,       # 批次大小（根据显存调整）
    gradient_accumulation_steps=4,       # 梯度累积
    warmup_steps=10,                     # 预热步数
    logging_steps=10,                    # 日志记录间隔
    save_steps=100,                      # 保存间隔
    evaluation_strategy="no",            # 不进行评估
    save_total_limit=2,                  # 最多保存2个检查点
    learning_rate=2e-4,                  # 学习率
    fp16=True if device == "cuda" else False,  # 混合精度训练
    push_to_hub=False,                   # 不上传到HuggingFace Hub
    report_to="none",                    # 不报告到wandb等
)

# 5. 创建Trainer并开始训练
print("5. 开始训练...")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

trainer.train()

# 6. 保存模型
print("6. 保存模型...")
model.save_pretrained("./tinyllama_finetuned")
tokenizer.save_pretrained("./tinyllama_finetuned")

print("✅ 训练完成！模型已保存到 ./tinyllama_finetuned")

4.4 模型测试脚本

python

# test_model.py
# 文件名：test_model.py
# 使用方法：python test_model.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

# 加载模型
print("加载模型...")
model_path = "./tinyllama_finetuned"

# 加载基础模型
base_model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(base_model)

# 加载LoRA权重
model = PeftModel.from_pretrained(model, model_path)
model.eval()

# 测试函数
def generate_response(prompt, max_length=100):
    # 构建完整提示
    full_prompt = f"<|user|>\n{prompt}</s>\n<|assistant|>\n"
    
    # 编码
    inputs = tokenizer(full_prompt, return_tensors="pt")
    
    # 生成
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_length,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    # 解码并提取回答
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # 提取assistant部分
    if "<|assistant|>" in full_response:
        response = full_response.split("<|assistant|>")[-1].strip()
    else:
        response = full_response
    
    return response

# 测试用例
test_cases = [
    "将以下英文翻译成中文: Hello, how are you?",
    "中国的首都是哪里？",
    "用Python写一个计算两数之和的函数",
]

print("\n=== 模型测试 ===\n")
for i, test_case in enumerate(test_cases, 1):
    print(f"测试 {i}:")
    print(f"问题: {test_case}")
    response = generate_response(test_case)
    print(f"回答: {response}")
    print("-" * 50)

print("\n✅ 测试完成！")

4.5 转换为GGUF格式脚本

bash

# convert_to_gguf.ps1
# 文件名：convert_to_gguf.ps1
# 使用方法：.\convert_to_gguf.ps1

Write-Host "=== 模型转换脚本 ===" -ForegroundColor Green

# 检查llama.cpp
if (-Not (Test-Path "llama.cpp")) {
    Write-Host "未找到llama.cpp，正在克隆..." -ForegroundColor Yellow
    git clone https://github.com/ggerganov/llama.cpp
} else {
    Write-Host "✅ llama.cpp已存在" -ForegroundColor Green
}

# 检查是否已编译
if (-Not (Test-Path "llama.cpp/build/bin/Release/convert.exe")) {
    Write-Host "编译llama.cpp..." -ForegroundColor Yellow
    cd llama.cpp
    mkdir build -ErrorAction SilentlyContinue
    cd build
    cmake .. -DLLAMA_CUBLAS=ON
    cmake --build . --config Release
    cd ../..
} else {
    Write-Host "✅ llama.cpp已编译" -ForegroundColor Green
}

# 执行转换
$pytorchModel = "./tinyllama_finetuned"
$outputName = "my_finetuned_model"

if (Test-Path $pytorchModel) {
    Write-Host "开始转换模型..." -ForegroundColor Yellow
    
    # 执行转换命令
    python llama.cpp/convert.py `
        --model $pytorchModel `
        --outfile "$outputName.gguf" `
        --outtype q4_0
    
    Write-Host "✅ 转换完成: $outputName.gguf" -ForegroundColor Green
} else {
    Write-Host "❌ 找不到PyTorch模型: $pytorchModel" -ForegroundColor Red
    Write-Host "请先运行训练脚本" -ForegroundColor Yellow
}

4.6 创建Ollama模型

bash

# 文件名：create_ollama_model.ps1
# 使用方法：.\create_ollama_model.ps1

Write-Host "=== 创建Ollama模型 ===" -ForegroundColor Green

# 检查GGUF文件
$ggufFile = "my_finetuned_model.gguf"
if (-Not (Test-Path $ggufFile)) {
    Write-Host "❌ 找不到GGUF文件: $ggufFile" -ForegroundColor Red
    Write-Host "请先运行转换脚本" -ForegroundColor Yellow
    exit 1
}

# 创建Modelfile
$modelfileContent = @"
FROM ./$ggufFile

TEMPLATE """<|user|>
{{ .Prompt }}
</s>
<|assistant|>
"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_predict 512

SYSTEM """你是一个经过微调的AI助手，擅长回答技术问题。"""
"@

$modelfileContent | Out-File -FilePath "Modelfile" -Encoding UTF8
Write-Host "✅ Modelfile已创建" -ForegroundColor Green

# 创建Ollama模型
Write-Host "创建Ollama模型..." -ForegroundColor Yellow
ollama create my-ai -f ./Modelfile

Write-Host "`n=== 使用命令测试模型 ===" -ForegroundColor Cyan
Write-Host "ollama run my-ai" -ForegroundColor White
Write-Host "`n输入问题后按Ctrl+D结束" -ForegroundColor Yellow

4.7 简单的API服务

python

# api_server.py
# 文件名：api_server.py
# 使用方法：python api_server.py
# 访问地址：http://localhost:8000

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import uvicorn

app = FastAPI(title="微调AI助手API", description="基于TinyLlama微调的AI助手")

# 请求/响应模型
class ChatRequest(BaseModel):
    message: str
    max_tokens: Optional[int] = 100
    temperature: Optional[float] = 0.7

class ChatResponse(BaseModel):
    response: str
    status: str

# 全局变量
model = None
tokenizer = None

def load_model():
    """加载模型"""
    global model, tokenizer
    
    print("加载模型...")
    
    # 加载基础模型
    base_model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    model_path = "./tinyllama_finetuned"
    
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(base_model)
    
    # 加载LoRA权重
    model = PeftModel.from_pretrained(model, model_path)
    model.eval()
    
    # 设置padding token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    print("✅ 模型加载完成")

@app.on_event("startup")
async def startup_event():
    """应用启动时加载模型"""
    load_model()

@app.get("/")
async def root():
    """根路径"""
    return {
        "service": "LLM微调API服务",
        "status": "运行中",
        "endpoints": {
            "聊天": "POST /chat",
            "健康检查": "GET /health"
        }
    }

@app.get("/health")
async def health_check():
    """健康检查"""
    if model is None:
        raise HTTPException(status_code=503, detail="模型未加载")
    return {"status": "healthy", "model_loaded": True}

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    """聊天接口"""
    try:
        # 构建提示
        prompt = f"<|user|>\n{request.message}</s>\n<|assistant|>\n"
        
        # 编码
        inputs = tokenizer(prompt, return_tensors="pt")
        
        # 生成
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=request.max_tokens,
                temperature=request.temperature,
                do_sample=True,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        
        # 解码
        full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # 提取assistant部分
        if "<|assistant|>" in full_response:
            response = full_response.split("<|assistant|>")[-1].strip()
        else:
            response = full_response
        
        return ChatResponse(response=response, status="success")
    
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"生成失败: {str(e)}")

# 启动服务
if __name__ == "__main__":
    print("🚀 启动API服务...")
    print("📡 访问地址: http://localhost:8000")
    print("📚 API文档: http://localhost:8000/docs")
    print("🛑 按Ctrl+C停止服务")
    
    uvicorn.run(app, host="0.0.0.0", port=8000)

五、运行指南

5.1 逐步执行命令

bash

# 第1步：打开PowerShell（管理员权限）
# 以管理员身份运行PowerShell

# 第2步：配置环境
.\setup_env.ps1

# 第3步：激活环境
conda activate llm-ft

# 第4步：准备数据
python prepare_data.py

# 第5步：微调训练（需要VPN或配置镜像源）
# 注意：这会下载约2.2GB的模型文件
python train_lora.py

# 第6步：测试模型
python test_model.py

# 第7步：转换为GGUF格式
.\convert_to_gguf.ps1

# 第8步：创建Ollama模型
.\create_ollama_model.ps1

# 第9步：测试Ollama模型
ollama run my-ai

# 第10步：启动API服务（可选）
python api_server.py

5.2 常见问题解决

问题1：下载模型失败

bash

# 设置HuggingFace镜像源
set HF_ENDPOINT=https://hf-mirror.com

# 或者在Python代码中添加
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

问题2：显存不足

python

# 减少批次大小
per_device_train_batch_size=1  # 从2改为1

# 启用梯度累积
gradient_accumulation_steps=8  # 从4改为8

# 使用更低的精度
fp16=True  # 确保开启混合精度训练

问题3：训练速度慢

bash

# 检查CUDA是否可用
python -c "import torch; print(torch.cuda.is_available())"

# 如果为False，可能需要重新安装PyTorch
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

六、学习资源推荐

6.1 官方文档

PyTorch官方教程：pytorch.org/tutorials/
HuggingFace课程：huggingface.co/learn
Transformers文档：huggingface.co/docs/transf…
PEFT文档：huggingface.co/docs/peft

6.2 中文资源

LLM微调实战：github.com/huggingface…
中文LLM社区：github.com/HqWu-HITCS/…
LLaMA中文指南：github.com/ymcui/Chine…

6.3 视频教程

李沐：动手学深度学习：space.bilibili.com/1567748478
跟李沐学AI：courses.d2l.ai/zh-v2/

总结

通过本指南，您应该已经掌握了：

环境搭建：配置Python、PyTorch、HuggingFace等必要工具
数据准备：创建和格式化训练数据
模型微调：使用LoRA技术微调预训练模型
模型转换：将PyTorch模型转换为GGUF格式
模型部署：通过Ollama或FastAPI部署模型
测试验证：验证微调效果并解决问题

记住，LLM微调是一个实践性很强的技能，最好的学习方式就是动手实践。从简单的小模型开始，逐步尝试更复杂的任务和更大的模型。

祝您学习顺利！🚀