三、利用unsloth对QwQ-32进行微调1. unsloth安装部署 2. wandb安装与注册 2.1 wandb

1. unsloth安装部署

pip install unsloth
pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

2. wandb安装与注册

2.1 wandb基本说明

Wandb（Weights & Biases，网址是(wandb.ai) 是一个用于机器学习项目实验跟踪、可视化和管理的工具，旨在用户更有效地监控模型训练过程、优化性能，并分享和复现实验结果‌‌。对于使用者而言，wandb就是一个日志sdk，此sdk会在本地存一份日志，在远程存一份日志。

2.2 wandb注册与使用

官网注册账号，记录APIkey

3. QWQ-32B 4bit动态量化模型下载与调用

huggingface：huggingface.co/unsloth/QwQ…
魔塔：modelscope.cn/models/unsl…

以huggingface为例(safetensors格式)——autodl平台

3.1 下载模型

创建文件夹：mkdir ./QwQ-32B-unsloth-bnb-4bit
安装huggingface hub: pip install huggingface_hub
（可选）借助screen持久会话：
- 安装screen：sudo apt install screen -y
- 建立新会话：screen -S mysession
- 重新连接Screen会话： screen -r mysession
- 列出所有screen会话： screen -ls
（可选）修改huggingface默认下载路径
- 首先在 /root/autodl-tmp下创建名为HF_download文件夹作为huggingface下载的目录：
  - cd /root/autodl-tmp
  - mkdir HF_download
- 然后找到root文件夹下的.bashrc文件，在结尾处加上 export HF_HOME="/root/autodl-tmp/HF_download"
- 保存，退出，输入 source ~/.bashrc
下载模型权重：
- 在Jupyter输入：

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/QwQ-32B-unsloth-bnb-4bit",
    local_dir = "QwQ-32B-unsloth-bnb-4bit"
  )

3.2 使用transformers进行调用

from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "./QwQ-32B-unsloth-bnb-4bit"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "你好,你是谁"
messages = [
    {"role": "user", "content": prompt}
    ]
# 分词器
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

3.3 推理微调数据集下载(huggingface)

这里我们直接从huggingface上下载medical-o1-reasoning-SFT数据集。直接下载：FreedomIntelligence/medical-o1-reasoning-SFT · Datasets at Hugging Face

3.3.1 设置代理环境

由于huggingface网络受限，下载数据集前需要先进行网络环境设置。若是AutoDL服务器，则可以按照如下方式开启学术加速，从而顺利连接huggingface并进行数据集下载：

import subprocess
import os

result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

3.3.2 下载数据集

安装库：pip install datasets

import os
from datasets import load_dataset

# 提取并设置文本生成结束的标记：
EOS_TOKEN = tokenizer.eos_token
# 定义函数，用于对medical-o1-reasoning-SFT数据集进行修改，Complex_CoT列和Response列进行拼接，并加上文本结束标记：
def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }
# 先下500条看看效果
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True)
# 然后进行结构化处理：
dataset = dataset.map(formatting_prompts_func, batched = True,)

3.4 进行最小可行性实验（初步微调）

接下来我们尝试进行模型微调，对于当前数据集而言，我们可以带入原始数据集的部分数据进行微调，也可以带入全部数据并遍历多次进行微调。对于大多数的微调实验，我们都可以从最小可行性实验入手进行微调，也就是先尝试带入少量数据进行微调，并观测微调效果。若微调可以顺利执行，并能够获得微调效果，再考虑带入更多的数据进行更大规模微调。

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
import wandb


max_seq_length = 2048
dtype = None
load_in_4bit = True

# &emsp;&emsp;然后即可把模型设置为微调模式：

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  
    bias="none",  
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  
    loftq_config=None,
)
# 创建有监督微调对象：
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

# 然后设置wandb（可选）：
wandb.login(key="YOUR_WANDB_API_KEY")
run = wandb.init(project='Fine-tune-QwQ-32B-4bit on Medical COT Dataset', )
trainer_stats = trainer.train()

3.5 SFTTrainer

这段代码主要是用 SFTTrainer 进行 监督微调（Supervised Fine-Tuning, SFT），适用于 transformers 和 Unsloth 生态中的模型微调：
1. 导入相关库

SFTTrainer（来自 trl 库）：
trl（Transformer Reinforcement Learning）是 Hugging Face 旗下的 trl 库，提供 监督微调（SFT） 和 强化学习（RLHF） 相关的功能。
SFTTrainer 主要用于 有监督微调（Supervised Fine-Tuning），适用于 LoRA 等低秩适配微调方式。
TrainingArguments（来自 transformers 库）：
这个类用于定义 训练超参数，比如批量大小、学习率、优化器、训练步数等。
is_bfloat16_supported()（来自 unsloth）：
这个函数检查 当前 GPU 是否支持 bfloat16（BF16），如果支持，则返回 True，否则返回 False。
bfloat16 是一种更高效的数值格式，在 新款 NVIDIA A100/H100 等 GPU 上表现更优。

2. 初始化 SFTTrainer 进行模型微调

参数解析

① `SFTTrainer` 部分

参数	作用
`model=model`	指定需要进行微调的预训练模型
`tokenizer=tokenizer`	指定分词器，用于处理文本数据
`train_dataset=dataset`	传入训练数据集
`dataset_text_field="text"`	指定数据集中哪一列包含训练文本（在 `formatting_prompts_func` 里处理）
`max_seq_length=max_seq_length`	最大序列长度，控制输入文本的最大 Token 数量
`dataset_num_proc=2`	数据加载的并行进程数，提高数据预处理效率

② `TrainingArguments` 部分

参数	作用
`per_device_train_batch_size=2`	每个 GPU/设备的训练批量大小（较小值适合大模型）
`gradient_accumulation_steps=4`	梯度累积步数（相当于 `batch_size=2 × 4 = 8`）
`warmup_steps=5`	预热步数（初始阶段学习率较低，然后逐步升高）
`max_steps=60`	最大训练步数（控制训练的总步数，此处总共约消耗60*8=480条数据）
`learning_rate=2e-4`	学习率（`2e-4` = 0.0002，控制权重更新幅度）
`fp16=not is_bfloat16_supported()`	如果 GPU 不支持 `bfloat16`，则使用 `fp16`（16位浮点数）
`bf16=is_bfloat16_supported()`	如果 GPU 支持 `bfloat16`，则启用 `bfloat16`（训练更稳定）
`logging_steps=10`	每 10 步记录一次训练日志
`optim="adamw_8bit"`	使用 `adamw_8bit`（8-bit AdamW优化器）减少显存占用
`weight_decay=0.01`	权重衰减（L2 正则化），防止过拟合
`lr_scheduler_type="linear"`	学习率调度策略（线性衰减）
`seed=3407`	随机种子（保证实验结果可复现）
`output_dir="outputs"`	训练结果的输出目录

3.5 查看效果

注意，unsloth在微调结束后，会自动更新模型权重（在缓存中），因此无需手动合并模型权重即可直接调用微调后的模型：

FastLanguageModel.for_inference(model)
question_1 = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"
inputs = tokenizer([prompt_style.format(question_1, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=2048,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)print(response[0].split("### Response:")[1])

3.6 模型合并

此时本地保存的模型权重在outputs文件夹中：

然后可使用如下代码进行模型权重合并： model.save_pretrained_merged("QwQ-Medical-COT-Tiny", tokenizer, save_method = "merged_4bit",)
此外，也可以使用如下代码将其保存为GGUF格式，方便使用ollama进行推理。本部分导出与合并需要较长时间，请耐心等待：model.save_pretrained_gguf("QwQ-Medical-COT-Tiny-GGUF", tokenizer, quantization_method = "q4_k_m")

3.7 模型推理

为避免变量名称冲突，这里需要先重启jupyter Kernel（或者另起文件)

from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/root/autodl-tmp/QwQ-Medical-COT-Tiny",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

FastLanguageModel.for_inference(model) 

prompt_style_chat = """请写出一个恰当的回答来完成当前对话任务。

### Instruction:
你是一名助人为乐的助手。

### Question:
{}

### Response:
<think>{}

question = "你好，好久不见！"
[prompt_style_chat.format(question, "")]
inputs = tokenizer([prompt_style_chat.format(question, "")], return_tensors="pt").to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    max_new_tokens=2048,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])
"""

Transformer 推理

from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "/root/autodl-tmp/QwQ-Medical-COT-Tiny"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "你好，好久不见！"
messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Ollama 推理

需要将创建的Q4_K_M.gguf模型权重单独拷贝一份，然后再编写ModelFile文件，便于ollama调用

cd /root/autodl-tmp/QWQ-Medical-COT-Tiny-GGUF
mkdir ./unsloth_Q4
cp unsloth.Q4_K_M.gguf ./unsloth_Q4

然后创建ModelFile文件，并输入如下内容：

FROM ./unsloth.Q4_K_M.gguf

TEMPLATE """
请写出一个恰当的回答来完成当前对话任务。

### Instruction:
你是一名助人为乐的助手。

### Question:
{{ .Prompt }}

### Response:
<think>{{ .Response }}<|im_end|>
"""

PARAMETER stop "<|im_end|>"
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|reserved_special_token_>"
PARAMETER temperature 1.5
PARAMETER min_p 0.1

保存并退出，然后进行模型注册： ollama create unsloth_model -f /root/autodl-tmp/QwQ-Medical-COT-Tiny-GGUF/unsloth_Q4/ModelFile

进行调用：

from openai import OpenAI
client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',  # required but ignored
)
prompt = "你好，好久不见！"
messages = [
    {"role": "user", "content": prompt}
]
response = client.chat.completions.create(
    messages=messages,
    model='unsloth_model',
)

print(response.choices[0].message.content)

LLM推理

CUDA_VISIBLE_DEVICES=0,1 vllm serve /root/autodl-tmp/QwQ-Medical-COT-Tiny --tensor-parallel-size 2

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt = "你好，好久不见！"
messages = [
    {"role": "user", "content": prompt}
]

response = client.chat.completions.create(
    model="/root/autodl-tmp/QwQ-Medical-COT-Tiny",
    messages=messages,
)

print(response.choices[0].message.content)

4. 完整高效微调实验

from unsloth import FastLanguageModel
from modelscope import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset

train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>
{}
</think>
{}
"""

# Load the model using Unsloth , model has been predownload
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "./QwQ-32B-unsloth-bnb-4bit",  # or another supported model
    max_seq_length = 4096,
    dtype = torch.float16,
    load_in_4bit = True,  # or False
)

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

File medical_o1_sft.json  has been predownload —— medical-o1-reasoning-SFT
dataset = load_dataset("json",data_files="medical_o1_sft.json", split = "train",trust_remote_code=True)
dataset = dataset.map(formatting_prompts_func, batched = True,)


# Then call get_peft_model
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

max_seq_length = 2048
dtype = None
load_in_4bit = True

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs = 3,
        warmup_steps=5,
        # max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

trainer_stats = trainer.train()


# 带入两个问题进行测试，均有较好的回答效果：

question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"

FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])

question = "Given a patient who experiences sudden-onset chest pain radiating to the neck and left arm, with a past medical history of hypercholesterolemia and coronary artery disease, elevated troponin I levels, and tachycardia, what is the most likely coronary artery involved based on this presentation?"

FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])

三、利用unsloth对QwQ-32进行微调