RTX 4000 显卡分布式训练报错不支持 P2P or IB在RTX 4000显卡上使用LLaMA-Factory分

在使用 LLaMA-Factory 对 DeepSeek-R1-Distill-Qwen-14B模型进行分布式训练时，提示报错信息如下：NotImplementedError: Using RTX 4000 series doesn’t support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE=“1” and NCCL_IB_DISABLE=“1” or use accelerate launch which will do this automatically.

这个错误是由于 NVIDIA RTX 4000 系列显卡（如 Ada Lovelace 架构）在某些版本的 NCCL 通信库中与 P2P（点对点）或 IB（InfiniBand）的兼容性问题导致的。

解决方法

设置系统环境变量

export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1

WebUI启动指定变量

NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 llamafactory-cli webui

CLI启动制定变量

NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" USE_MODELSCOPE_HUB="1" llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path /home/seven/LLM/DeepSeek-R1-Distill-Qwen-14B \
    --preprocessing_num_workers 16 \
    --finetuning_type lora \
    --template deepseek3 \
    --flash_attn auto \
    --dataset_dir /home/seven/LLM/ruozhiba_R1 \
    --dataset identity \
    --cutoff_len 2048 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --packing False \
    --report_to none \
    --use_swanlab True \
    --output_dir saves/DeepSeek-R1-14B-Distill/lora/train_2025-02-18-08-14-04 \
    --bf16 True \
    --plot_loss True \
    --trust_remote_code True \
    --ddp_timeout 180000000 \
    --include_num_input_tokens_seen True \
    --optim adamw_torch \
    --quantization_bit 4 \
    --quantization_method bitsandbytes \
    --double_quantization True \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0 \
    --lora_target all \
    --deepspeed cache/ds_z2_config.json