在使用 LLaMA-Factory 对 DeepSeek-R1-Distill-Qwen-14B模型进行分布式训练时,提示报错信息如下:NotImplementedError: Using RTX 4000 series doesn’t support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE=“1” and NCCL_IB_DISABLE=“1” or use accelerate launch which will do this automatically.
这个错误是由于 NVIDIA RTX 4000 系列显卡(如 Ada Lovelace 架构)在某些版本的 NCCL 通信库中与 P2P(点对点)或 IB(InfiniBand)的兼容性问题导致的。
解决方法
设置系统环境变量
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
WebUI启动指定变量
NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 llamafactory-cli webui
CLI启动制定变量
NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" USE_MODELSCOPE_HUB="1" llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path /home/seven/LLM/DeepSeek-R1-Distill-Qwen-14B \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--template deepseek3 \
--flash_attn auto \
--dataset_dir /home/seven/LLM/ruozhiba_R1 \
--dataset identity \
--cutoff_len 2048 \
--learning_rate 5e-05 \
--num_train_epochs 3.0 \
--max_samples 100000 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--packing False \
--report_to none \
--use_swanlab True \
--output_dir saves/DeepSeek-R1-14B-Distill/lora/train_2025-02-18-08-14-04 \
--bf16 True \
--plot_loss True \
--trust_remote_code True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--optim adamw_torch \
--quantization_bit 4 \
--quantization_method bitsandbytes \
--double_quantization True \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all \
--deepspeed cache/ds_z2_config.json