跑~/AFAC2023_GenerationAndComplianceDetection/BELLE/train/scripts/run.sh时,报错RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
解决方案是在torchrun中添加参数--master_port改变master port。且注意这个参数一定要加在要跑的文件即src/entry_point/train.py之前,否则会被忽略。
torchrun --nproc_per_node 2 --master_port 15575 \
src/entry_point/train.py \