三、利用unsloth对QwQ-32进行微调

551 阅读10分钟

1. unsloth安装部署

pip install unsloth
pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

2. wandb安装与注册

2.1 wandb基本说明

Wandb(Weights & Biases,网址是(wandb.ai) 是一个用于机器学习项目实验跟踪、可视化和管理的工具,旨在用户更有效地监控模型训练过程、优化性能,并分享和复现实验结果‌‌。对于使用者而言,wandb就是一个日志sdk,此sdk会在本地存一份日志,在远程存一份日志。

2.2 wandb注册与使用

官网注册账号,记录APIkey

3. QWQ-32B 4bit动态量化模型下载与调用

以huggingface为例(safetensors格式)——autodl平台

3.1 下载模型

  • 创建文件夹:mkdir ./QwQ-32B-unsloth-bnb-4bit
  • 安装huggingface hub: pip install huggingface_hub
  • (可选)借助screen持久会话:
    • 安装screen:sudo apt install screen -y
    • 建立新会话:screen -S mysession
    • 重新连接Screen会话: screen -r mysession
    • 列出所有screen会话: screen -ls
  • (可选)修改huggingface默认下载路径
    • 首先在 /root/autodl-tmp下创建名为HF_download文件夹作为huggingface下载的目录:
      • cd /root/autodl-tmp
      • mkdir HF_download
    • 然后找到root文件夹下的.bashrc文件,在结尾处加上 export HF_HOME="/root/autodl-tmp/HF_download"
    • 保存,退出,输入 source ~/.bashrc
  • 下载模型权重:
    • 在Jupyter输入:
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/QwQ-32B-unsloth-bnb-4bit",
    local_dir = "QwQ-32B-unsloth-bnb-4bit"
  )

3.2 使用transformers进行调用

from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "./QwQ-32B-unsloth-bnb-4bit"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "你好,你是谁"
messages = [
    {"role": "user", "content": prompt}
    ]
# 分词器
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

3.3 推理微调数据集下载(huggingface)

  这里我们直接从huggingface上下载medical-o1-reasoning-SFT数据集。直接下载:FreedomIntelligence/medical-o1-reasoning-SFT · Datasets at Hugging Face

3.3.1 设置代理环境

  由于huggingface网络受限,下载数据集前需要先进行网络环境设置。若是AutoDL服务器,则可以按照如下方式开启学术加速,从而顺利连接huggingface并进行数据集下载:

import subprocess
import os

result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

3.3.2 下载数据集

安装库:pip install datasets

import os
from datasets import load_dataset

# 提取并设置文本生成结束的标记:
EOS_TOKEN = tokenizer.eos_token
# 定义函数,用于对medical-o1-reasoning-SFT数据集进行修改,Complex_CoT列和Response列进行拼接,并加上文本结束标记:
def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }
# 先下500条看看效果
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True)
# 然后进行结构化处理:
dataset = dataset.map(formatting_prompts_func, batched = True,)

3.4 进行最小可行性实验(初步微调)

接下来我们尝试进行模型微调,对于当前数据集而言,我们可以带入原始数据集的部分数据进行微调,也可以带入全部数据并遍历多次进行微调。对于大多数的微调实验,我们都可以从最小可行性实验入手进行微调,也就是先尝试带入少量数据进行微调,并观测微调效果。若微调可以顺利执行,并能够获得微调效果,再考虑带入更多的数据进行更大规模微调。

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
import wandb


max_seq_length = 2048
dtype = None
load_in_4bit = True

#   然后即可把模型设置为微调模式:

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  
    bias="none",  
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  
    loftq_config=None,
)
# 创建有监督微调对象:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

# 然后设置wandb(可选):
wandb.login(key="YOUR_WANDB_API_KEY")
run = wandb.init(project='Fine-tune-QwQ-32B-4bit on Medical COT Dataset', )
trainer_stats = trainer.train()

3.5 SFTTrainer

这段代码主要是用 SFTTrainer 进行 监督微调(Supervised Fine-Tuning, SFT),适用于 transformersUnsloth 生态中的模型微调:
1. 导入相关库

  • SFTTrainer(来自 trl 库):

  • trl(Transformer Reinforcement Learning)是 Hugging Face 旗下的 trl 库,提供 监督微调(SFT)强化学习(RLHF) 相关的功能。

  • SFTTrainer 主要用于 有监督微调(Supervised Fine-Tuning),适用于 LoRA 等低秩适配微调方式。

  • TrainingArguments(来自 transformers 库):

  • 这个类用于定义 训练超参数,比如批量大小、学习率、优化器、训练步数等。

  • is_bfloat16_supported()(来自 unsloth):

  • 这个函数检查 当前 GPU 是否支持 bfloat16(BF16),如果支持,则返回 True,否则返回 False

  • bfloat16 是一种更高效的数值格式,在 新款 NVIDIA A100/H100 等 GPU 上表现更优。

2. 初始化 SFTTrainer 进行模型微调

参数解析
SFTTrainer 部分
参数作用
model=model指定需要进行微调的 预训练模型
tokenizer=tokenizer指定 分词器,用于处理文本数据
train_dataset=dataset传入 训练数据集
dataset_text_field="text"指定数据集中哪一列包含 训练文本(在 formatting_prompts_func 里处理)
max_seq_length=max_seq_length最大序列长度,控制输入文本的最大 Token 数量
dataset_num_proc=2数据加载的并行进程数,提高数据预处理效率
TrainingArguments 部分
参数作用
per_device_train_batch_size=2每个 GPU/设备 的训练批量大小(较小值适合大模型)
gradient_accumulation_steps=4梯度累积步数(相当于 batch_size=2 × 4 = 8
warmup_steps=5预热步数(初始阶段学习率较低,然后逐步升高)
max_steps=60最大训练步数(控制训练的总步数,此处总共约消耗60*8=480条数据)
learning_rate=2e-4学习率2e-4 = 0.0002,控制权重更新幅度)
fp16=not is_bfloat16_supported()如果 GPU 不支持 bfloat16,则使用 fp16(16位浮点数)
bf16=is_bfloat16_supported()如果 GPU 支持 bfloat16,则启用 bfloat16(训练更稳定)
logging_steps=10每 10 步记录一次训练日志
optim="adamw_8bit"使用 adamw_8bit(8-bit AdamW优化器)减少显存占用
weight_decay=0.01权重衰减(L2 正则化),防止过拟合
lr_scheduler_type="linear"学习率调度策略(线性衰减)
seed=3407随机种子(保证实验结果可复现)
output_dir="outputs"训练结果的输出目录

3.5 查看效果

注意,unsloth在微调结束后,会自动更新模型权重(在缓存中),因此无需手动合并模型权重即可直接调用微调后的模型:

FastLanguageModel.for_inference(model)
question_1 = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"
inputs = tokenizer([prompt_style.format(question_1, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=2048,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)print(response[0].split("### Response:")[1])

3.6 模型合并

此时本地保存的模型权重在outputs文件夹中:

image.png

  • 然后可使用如下代码进行模型权重合并: model.save_pretrained_merged("QwQ-Medical-COT-Tiny", tokenizer, save_method = "merged_4bit",)
  • 此外,也可以使用如下代码将其保存为GGUF格式,方便使用ollama进行推理。本部分导出与合并需要较长时间,请耐心等待:model.save_pretrained_gguf("QwQ-Medical-COT-Tiny-GGUF", tokenizer, quantization_method = "q4_k_m")

3.7 模型推理

为避免变量名称冲突,这里需要先重启jupyter Kernel(或者另起文件)

from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/root/autodl-tmp/QwQ-Medical-COT-Tiny",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

FastLanguageModel.for_inference(model) 

prompt_style_chat = """请写出一个恰当的回答来完成当前对话任务。

### Instruction:
你是一名助人为乐的助手。

### Question:
{}

### Response:
<think>{}

question = "你好,好久不见!"
[prompt_style_chat.format(question, "")]
inputs = tokenizer([prompt_style_chat.format(question, "")], return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    max_new_tokens=2048,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])
"""
  • Transformer 推理
from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "/root/autodl-tmp/QwQ-Medical-COT-Tiny"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "你好,好久不见!"
messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
  • Ollama 推理

需要将创建的Q4_K_M.gguf模型权重单独拷贝一份,然后再编写ModelFile文件,便于ollama调用

cd /root/autodl-tmp/QWQ-Medical-COT-Tiny-GGUF
mkdir ./unsloth_Q4
cp unsloth.Q4_K_M.gguf ./unsloth_Q4

然后创建ModelFile文件,并输入如下内容:

FROM ./unsloth.Q4_K_M.gguf

TEMPLATE """
请写出一个恰当的回答来完成当前对话任务。

### Instruction:
你是一名助人为乐的助手。

### Question:
{{ .Prompt }}

### Response:
<think>{{ .Response }}<|im_end|>
"""

PARAMETER stop "<|im_end|>"
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|reserved_special_token_>"
PARAMETER temperature 1.5
PARAMETER min_p 0.1

保存并退出,然后进行模型注册: ollama create unsloth_model -f /root/autodl-tmp/QwQ-Medical-COT-Tiny-GGUF/unsloth_Q4/ModelFile

进行调用:

from openai import OpenAI
client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',  # required but ignored
)
prompt = "你好,好久不见!"
messages = [
    {"role": "user", "content": prompt}
]
response = client.chat.completions.create(
    messages=messages,
    model='unsloth_model',
)

print(response.choices[0].message.content)
  • LLM推理

image.png

CUDA_VISIBLE_DEVICES=0,1 vllm serve /root/autodl-tmp/QwQ-Medical-COT-Tiny --tensor-parallel-size 2
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt = "你好,好久不见!"
messages = [
    {"role": "user", "content": prompt}
]

response = client.chat.completions.create(
    model="/root/autodl-tmp/QwQ-Medical-COT-Tiny",
    messages=messages,
)

print(response.choices[0].message.content)

4. 完整高效微调实验

from unsloth import FastLanguageModel
from modelscope import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset

train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>
{}
</think>
{}
"""

# Load the model using Unsloth , model has been predownload
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "./QwQ-32B-unsloth-bnb-4bit",  # or another supported model
    max_seq_length = 4096,
    dtype = torch.float16,
    load_in_4bit = True,  # or False
)

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

File medical_o1_sft.json  has been predownload —— medical-o1-reasoning-SFT
dataset = load_dataset("json",data_files="medical_o1_sft.json", split = "train",trust_remote_code=True)
dataset = dataset.map(formatting_prompts_func, batched = True,)


# Then call get_peft_model
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

max_seq_length = 2048
dtype = None
load_in_4bit = True

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs = 3,
        warmup_steps=5,
        # max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

trainer_stats = trainer.train()


# 带入两个问题进行测试,均有较好的回答效果:

question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"

FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])

question = "Given a patient who experiences sudden-onset chest pain radiating to the neck and left arm, with a past medical history of hypercholesterolemia and coronary artery disease, elevated troponin I levels, and tachycardia, what is the most likely coronary artery involved based on this presentation?"

FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])