1. unsloth安装部署
pip install unsloth
pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
2. wandb安装与注册
2.1 wandb基本说明
Wandb(Weights & Biases,网址是(wandb.ai) 是一个用于机器学习项目实验跟踪、可视化和管理的工具,旨在用户更有效地监控模型训练过程、优化性能,并分享和复现实验结果。对于使用者而言,wandb就是一个日志sdk,此sdk会在本地存一份日志,在远程存一份日志。
2.2 wandb注册与使用
官网注册账号,记录APIkey
3. QWQ-32B 4bit动态量化模型下载与调用
- huggingface:huggingface.co/unsloth/QwQ…
- 魔塔:modelscope.cn/models/unsl…
以huggingface为例(safetensors格式)——autodl平台
3.1 下载模型
- 创建文件夹:
mkdir ./QwQ-32B-unsloth-bnb-4bit - 安装huggingface hub:
pip install huggingface_hub - (可选)借助screen持久会话:
- 安装screen:
sudo apt install screen -y - 建立新会话:
screen -S mysession - 重新连接Screen会话:
screen -r mysession - 列出所有screen会话:
screen -ls
- 安装screen:
- (可选)修改huggingface默认下载路径
- 首先在
/root/autodl-tmp下创建名为HF_download文件夹作为huggingface下载的目录:cd /root/autodl-tmpmkdir HF_download
- 然后找到root文件夹下的
.bashrc文件,在结尾处加上export HF_HOME="/root/autodl-tmp/HF_download" - 保存,退出,输入
source ~/.bashrc
- 首先在
- 下载模型权重:
- 在Jupyter输入:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/QwQ-32B-unsloth-bnb-4bit",
local_dir = "QwQ-32B-unsloth-bnb-4bit"
)
3.2 使用transformers进行调用
from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "./QwQ-32B-unsloth-bnb-4bit"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "你好,你是谁"
messages = [
{"role": "user", "content": prompt}
]
# 分词器
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
3.3 推理微调数据集下载(huggingface)
这里我们直接从huggingface上下载medical-o1-reasoning-SFT数据集。直接下载:FreedomIntelligence/medical-o1-reasoning-SFT · Datasets at Hugging Face
3.3.1 设置代理环境
由于huggingface网络受限,下载数据集前需要先进行网络环境设置。若是AutoDL服务器,则可以按照如下方式开启学术加速,从而顺利连接huggingface并进行数据集下载:
import subprocess
import os
result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
if '=' in line:
var, value = line.split('=', 1)
os.environ[var] = value
3.3.2 下载数据集
安装库:pip install datasets
import os
from datasets import load_dataset
# 提取并设置文本生成结束的标记:
EOS_TOKEN = tokenizer.eos_token
# 定义函数,用于对medical-o1-reasoning-SFT数据集进行修改,Complex_CoT列和Response列进行拼接,并加上文本结束标记:
def formatting_prompts_func(examples):
inputs = examples["Question"]
cots = examples["Complex_CoT"]
outputs = examples["Response"]
texts = []
for input, cot, output in zip(inputs, cots, outputs):
text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
texts.append(text)
return {
"text": texts,
}
# 先下500条看看效果
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True)
# 然后进行结构化处理:
dataset = dataset.map(formatting_prompts_func, batched = True,)
3.4 进行最小可行性实验(初步微调)
接下来我们尝试进行模型微调,对于当前数据集而言,我们可以带入原始数据集的部分数据进行微调,也可以带入全部数据并遍历多次进行微调。对于大多数的微调实验,我们都可以从最小可行性实验入手进行微调,也就是先尝试带入少量数据进行微调,并观测微调效果。若微调可以顺利执行,并能够获得微调效果,再考虑带入更多的数据进行更大规模微调。
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
import wandb
max_seq_length = 2048
dtype = None
load_in_4bit = True
#   然后即可把模型设置为微调模式:
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth", # True or "unsloth" for very long context
random_state=3407,
use_rslora=False,
loftq_config=None,
)
# 创建有监督微调对象:
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
# Use num_train_epochs = 1, warmup_ratio for full training runs!
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
),
)
# 然后设置wandb(可选):
wandb.login(key="YOUR_WANDB_API_KEY")
run = wandb.init(project='Fine-tune-QwQ-32B-4bit on Medical COT Dataset', )
trainer_stats = trainer.train()
3.5 SFTTrainer
这段代码主要是用 SFTTrainer 进行 监督微调(Supervised Fine-Tuning, SFT),适用于 transformers 和 Unsloth 生态中的模型微调:
1. 导入相关库
-
SFTTrainer(来自trl库): -
trl(Transformer Reinforcement Learning)是 Hugging Face 旗下的trl库,提供 监督微调(SFT) 和 强化学习(RLHF) 相关的功能。 -
SFTTrainer主要用于 有监督微调(Supervised Fine-Tuning),适用于LoRA等低秩适配微调方式。 -
TrainingArguments(来自transformers库): -
这个类用于定义 训练超参数,比如批量大小、学习率、优化器、训练步数等。
-
is_bfloat16_supported()(来自unsloth): -
这个函数检查 当前 GPU 是否支持
bfloat16(BF16),如果支持,则返回True,否则返回False。 -
bfloat16是一种更高效的数值格式,在 新款 NVIDIA A100/H100 等 GPU 上表现更优。
2. 初始化 SFTTrainer 进行模型微调
参数解析
① SFTTrainer 部分
| 参数 | 作用 |
|---|---|
model=model | 指定需要进行微调的 预训练模型 |
tokenizer=tokenizer | 指定 分词器,用于处理文本数据 |
train_dataset=dataset | 传入 训练数据集 |
dataset_text_field="text" | 指定数据集中哪一列包含 训练文本(在 formatting_prompts_func 里处理) |
max_seq_length=max_seq_length | 最大序列长度,控制输入文本的最大 Token 数量 |
dataset_num_proc=2 | 数据加载的并行进程数,提高数据预处理效率 |
② TrainingArguments 部分
| 参数 | 作用 |
|---|---|
per_device_train_batch_size=2 | 每个 GPU/设备 的训练批量大小(较小值适合大模型) |
gradient_accumulation_steps=4 | 梯度累积步数(相当于 batch_size=2 × 4 = 8) |
warmup_steps=5 | 预热步数(初始阶段学习率较低,然后逐步升高) |
max_steps=60 | 最大训练步数(控制训练的总步数,此处总共约消耗60*8=480条数据) |
learning_rate=2e-4 | 学习率(2e-4 = 0.0002,控制权重更新幅度) |
fp16=not is_bfloat16_supported() | 如果 GPU 不支持 bfloat16,则使用 fp16(16位浮点数) |
bf16=is_bfloat16_supported() | 如果 GPU 支持 bfloat16,则启用 bfloat16(训练更稳定) |
logging_steps=10 | 每 10 步记录一次训练日志 |
optim="adamw_8bit" | 使用 adamw_8bit(8-bit AdamW优化器)减少显存占用 |
weight_decay=0.01 | 权重衰减(L2 正则化),防止过拟合 |
lr_scheduler_type="linear" | 学习率调度策略(线性衰减) |
seed=3407 | 随机种子(保证实验结果可复现) |
output_dir="outputs" | 训练结果的输出目录 |
3.5 查看效果
注意,unsloth在微调结束后,会自动更新模型权重(在缓存中),因此无需手动合并模型权重即可直接调用微调后的模型:
FastLanguageModel.for_inference(model)
question_1 = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"
inputs = tokenizer([prompt_style.format(question_1, "")], return_tensors="pt").to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=2048,
use_cache=True,
)
response = tokenizer.batch_decode(outputs)print(response[0].split("### Response:")[1])
3.6 模型合并
此时本地保存的模型权重在outputs文件夹中:
- 然后可使用如下代码进行模型权重合并:
model.save_pretrained_merged("QwQ-Medical-COT-Tiny", tokenizer, save_method = "merged_4bit",) - 此外,也可以使用如下代码将其保存为GGUF格式,方便使用ollama进行推理。本部分导出与合并需要较长时间,请耐心等待:
model.save_pretrained_gguf("QwQ-Medical-COT-Tiny-GGUF", tokenizer, quantization_method = "q4_k_m")
3.7 模型推理
为避免变量名称冲突,这里需要先重启jupyter Kernel(或者另起文件)
from unsloth import FastLanguageModel
max_seq_length = 2048
dtype = None
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "/root/autodl-tmp/QwQ-Medical-COT-Tiny",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model)
prompt_style_chat = """请写出一个恰当的回答来完成当前对话任务。
### Instruction:
你是一名助人为乐的助手。
### Question:
{}
### Response:
<think>{}
question = "你好,好久不见!"
[prompt_style_chat.format(question, "")]
inputs = tokenizer([prompt_style_chat.format(question, "")], return_tensors="pt").to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
max_new_tokens=2048,
use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])
"""
- Transformer 推理
from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "/root/autodl-tmp/QwQ-Medical-COT-Tiny"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "你好,好久不见!"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
- Ollama 推理
需要将创建的Q4_K_M.gguf模型权重单独拷贝一份,然后再编写ModelFile文件,便于ollama调用
cd /root/autodl-tmp/QWQ-Medical-COT-Tiny-GGUF
mkdir ./unsloth_Q4
cp unsloth.Q4_K_M.gguf ./unsloth_Q4
然后创建ModelFile文件,并输入如下内容:
FROM ./unsloth.Q4_K_M.gguf
TEMPLATE """
请写出一个恰当的回答来完成当前对话任务。
### Instruction:
你是一名助人为乐的助手。
### Question:
{{ .Prompt }}
### Response:
<think>{{ .Response }}<|im_end|>
"""
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|reserved_special_token_>"
PARAMETER temperature 1.5
PARAMETER min_p 0.1
保存并退出,然后进行模型注册:
ollama create unsloth_model -f /root/autodl-tmp/QwQ-Medical-COT-Tiny-GGUF/unsloth_Q4/ModelFile
进行调用:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1/',
api_key='ollama', # required but ignored
)
prompt = "你好,好久不见!"
messages = [
{"role": "user", "content": prompt}
]
response = client.chat.completions.create(
messages=messages,
model='unsloth_model',
)
print(response.choices[0].message.content)
- LLM推理
CUDA_VISIBLE_DEVICES=0,1 vllm serve /root/autodl-tmp/QwQ-Medical-COT-Tiny --tensor-parallel-size 2
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
prompt = "你好,好久不见!"
messages = [
{"role": "user", "content": prompt}
]
response = client.chat.completions.create(
model="/root/autodl-tmp/QwQ-Medical-COT-Tiny",
messages=messages,
)
print(response.choices[0].message.content)
4. 完整高效微调实验
from unsloth import FastLanguageModel
from modelscope import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.
### Question:
{}
### Response:
<think>
{}
</think>
{}
"""
# Load the model using Unsloth , model has been predownload
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "./QwQ-32B-unsloth-bnb-4bit", # or another supported model
max_seq_length = 4096,
dtype = torch.float16,
load_in_4bit = True, # or False
)
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
inputs = examples["Question"]
cots = examples["Complex_CoT"]
outputs = examples["Response"]
texts = []
for input, cot, output in zip(inputs, cots, outputs):
text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
texts.append(text)
return {
"text": texts,
}
File medical_o1_sft.json has been predownload —— medical-o1-reasoning-SFT
dataset = load_dataset("json",data_files="medical_o1_sft.json", split = "train",trust_remote_code=True)
dataset = dataset.map(formatting_prompts_func, batched = True,)
# Then call get_peft_model
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=False,
loftq_config=None,
)
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
max_seq_length = 2048
dtype = None
load_in_4bit = True
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs = 3,
warmup_steps=5,
# max_steps=60,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
),
)
trainer_stats = trainer.train()
# 带入两个问题进行测试,均有较好的回答效果:
question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"
FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=1200,
use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])
question = "Given a patient who experiences sudden-onset chest pain radiating to the neck and left arm, with a past medical history of hypercholesterolemia and coronary artery disease, elevated troponin I levels, and tachycardia, what is the most likely coronary artery involved based on this presentation?"
FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=1200,
use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])