提问：为什么SimPO比DPO快5倍求助各位大佬和前辈.最近和同事一起在做SimPO和DPO的尝试对齐,结果出现DPO的

求助各位大佬和前辈.最近和同事一起在做SimPO和DPO的尝试对齐,结果出现一个理解不了的现象,想求助一下各位大佬.

背景:

我们组原先有一个qwen2.5-7b经过SFT训练过的模型，目前这个模型存在一些问题，我们想尝试一下用SimPO和DPO去对齐这个模型看能不能解决掉当前的问题．我们使用相同的参数在LLama Factory上进行训练，训练出来的准确率都是91%，但是SimPO的训练时长是DPO的1/6,我查看了一下目标函数,怎么算都算不出来为什么是 $\frac{1}{6}$ .

首先介绍一下DPO:

DPO的损失函数为：

$L_{DPO}(\theta)~=~−E_{(x,y_w,y_l){\backsim}D}[{\log}{\sigma}({\beta}(r_{\theta}(x,y_w)-r_{\theta}(x,y_l)))]$

DPO在LLama Factory中的训练参数为:

### model
model_name_or_path: /

### method
stage: dpo
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_offload_config.json

### dataset
dataset: query_summary
template: qwen
cutoff_len: 2048
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 8

### output
output_dir: /
logging_steps: 10
save_steps: 30
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 5e-7
num_train_epochs: 2
lr_scheduler_type: cosine
warmup_ratio: 0.03
bf16: true
ddp_timeout: 180000000

### 增加beta系数及其他超参数

pref_beta: 0.2
pref_ftx: 0.0
pref_loss: sigmoid
pref_bco_loss: 0.1

### eval
val_size: 0.05
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 10

SimPO的损失函数为:

$L_{SimPO}({\theta})=−E_{(x,y_w,y_l)}[logσ({\beta}(r_{\theta}(x,y_w)−r_θ(x,y_l)−{\gamma}))]$

在LLama Factory中的参数设置为:

### model
model_name_or_path: /

### simpo
stage: dpo
pref_loss: simpo
pref_beta: 0.1
simpo_gamma: 0.5

do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json


### dataset
dataset: query_summary
template: qwen
cutoff_len: 2048
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 8

### output
output_dir: /
logging_steps: 10
save_steps: 30
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 5.0e-7
num_train_epochs: 2
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.05
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 10

在以上配置中,同样的模型,同样的环境,SimPO都比DPO训练时间少,而且DPO训练时间每次都是SimPO的6倍.可是这少的时间是从哪里省出来的呢?就算没有参考模型,参数量减少,我猜想应该是2倍的关系. SimPO 间: 25min; DPO 时间: 4min.