0.太长不看版:
export WANDB_BASE_URL=https://api.bandw.top更换W&B的API源可以解决原先API URL连接超时问题.再重新登陆后正常训练即可.
1.摘要
W&B是一款用于记录机器学习训练数据的工具,通过跟踪可视化从数据集处理到训练输出模型整个流程的各个方面,来帮助用户更快速的优化输出模型。
W&B目前支持多种框架和库: Integrate quickly, LangChain, LlamaIndex, PyTorch, HF Transformers, Lightning, TensorFlow, Keras, Scikit-Learn, XGboost, Padllepaddle, etc.
- W&B官方网站: wandb.ai/
- W&B Github repo:github.com/wandb/wandb
2.demo: PyTorch训练模型并调用W&B
2.1 硬件环境
本教程使用 腾讯云
CloudStudio. CloudStudio 提供每月免费5W分钟CPU环境和1W分钟GPU环境(16C32G & Tesla T4@16GVRAM)
- 点击
CloudStudio链接cloud.tencent.com/product/clo…, 注册 腾讯云 . - 进入
CloudStudio主页面ide.cloud.tencent.com/, 点击高性能工作空间,点击新建,点击左下角免费基础型,点新建,等待 几分钟 . - 刷新 页面等待出现新的空间条目,点击这个 新的条目 ,开始使用高性能空间.
- 如下是成功打开 高性能工作空间 的效果
2.2 软件环境
由于这个免费基础型环境自带 deepseek 等依赖(用不到)和
conda环境,我们这里直接使用conda环境的(base)环境
-
这里已经自动激活了
(base)的conda默认虚拟环境,使用pip list检查已经安装的依赖.这里并没有需要的PyTorch,所以我们需要手动安装:点击展开软件环境检查log
(base) root@VM-0-80-ubuntu:/workspace# which conda /root/miniforge3/bin/conda (base) root@VM-0-80-ubuntu:/workspace# pip list Package Version ----------------------- ----------- archspec 0.2.3 asttokens 3.0.0 boltons 24.0.0 Brotli 1.1.0 certifi 2024.12.14 cffi 1.17.1 charset-normalizer 3.4.1 colorama 0.4.6 comm 0.2.2 conda 24.11.2 conda-libmamba-solver 24.11.1 conda-package-handling 2.4.0 conda_package_streaming 0.11.0 debugpy 1.8.11 decorator 5.1.1 distro 1.9.0 exceptiongroup 1.2.2 executing 2.1.0 frozendict 2.4.6 h2 4.1.0 hpack 4.0.0 hyperframe 6.0.1 idna 3.10 importlib_metadata 8.5.0 ipykernel 6.29.5 ipython 8.31.0 jedi 0.19.2 jsonpatch 1.33 jsonpointer 3.0.0 jupyter_client 8.6.3 jupyter_core 5.7.2 libmambapy 2.0.5 matplotlib-inline 0.1.7 menuinst 2.2.0 nest_asyncio 1.6.0 packaging 24.2 parso 0.8.4 pexpect 4.9.0 pickleshare 0.7.5 pip 24.3.1 platformdirs 4.3.6 pluggy 1.5.0 prompt_toolkit 3.0.48 psutil 6.1.1 ptyprocess 0.7.0 pure_eval 0.2.3 pycosat 0.6.6 pycparser 2.22 Pygments 2.18.0 PySocks 1.7.1 python-dateutil 2.9.0.post0 pyzmq 26.2.0 requests 2.32.3 ruamel.yaml 0.18.8 ruamel.yaml.clib 0.2.8 setuptools 75.6.0 six 1.17.0 stack_data 0.6.3 tornado 6.4.2 tqdm 4.67.1 traitlets 5.14.3 truststore 0.10.0 typing_extensions 4.12.2 urllib3 2.3.0 wcwidth 0.2.13 wheel 0.45.1 zipp 3.21.0 zstandard 0.23.0 -
使用
nvidia-smi等命令检查GPU和nvidia-driver等版本,确定应该装什么版本PyTorch.点击展开环境检查log
(base) root@VM-0-80-ubuntu:/workspace# nvidia-smi Thu Jun 5 13:40:40 2025 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:09.0 Off | 0 | | N/A 33C P8 11W / 70W | 5MiB / 15360MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ (base) root@VM-0-80-ubuntu:/workspace# python --version Python 3.10.11 -
安装GPU版PyTorch:
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118点击展开`PyTorch GPU`安装log
(base) root@VM-0-80-ubuntu:/workspace# pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118 Looking in indexes: https://download.pytorch.org/whl/cu118 Collecting torch==2.4.1 Downloading https://download.pytorch.org/whl/cu118/torch-2.4.1%2Bcu118-cp310-cp310-linux_x86_64.whl (857.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 857.6/857.6 MB 8.6 MB/s eta 0:00:00 Collecting torchvision==0.19.1 Downloading https://download.pytorch.org/whl/cu118/torchvision-0.19.1%2Bcu118-cp310-cp310-linux_x86_64.whl (6.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 9.8 MB/s eta 0:00:00 Collecting torchaudio==2.4.1 Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.4.1%2Bcu118-cp310-cp310-linux_x86_64.whl (3.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 6.7 MB/s eta 0:00:00 Collecting filelock (from torch==2.4.1) Downloading https://download.pytorch.org/whl/filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB) Requirement already satisfied: typing-extensions>=4.8.0 in /root/miniforge3/lib/python3.10/site-packages (from torch==2.4.1) (4.12.2) Collecting sympy (from torch==2.4.1) Downloading https://download.pytorch.org/whl/sympy-1.13.3-py3-none-any.whl.metadata (12 kB) Collecting networkx (from torch==2.4.1) Downloading https://download.pytorch.org/whl/networkx-3.3-py3-none-any.whl.metadata (5.1 kB) Collecting jinja2 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/Jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB) Collecting fsspec (from torch==2.4.1) Downloading https://download.pytorch.org/whl/fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB) Collecting nvidia-cuda-nvrtc-cu11==11.8.89 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_nvrtc_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (23.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.2/23.2 MB 14.5 MB/s eta 0:00:00 Collecting nvidia-cuda-runtime-cu11==11.8.89 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_runtime_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (875 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 875.6/875.6 kB 16.0 MB/s eta 0:00:00 Collecting nvidia-cuda-cupti-cu11==11.8.87 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_cupti_cu11-11.8.87-py3-none-manylinux1_x86_64.whl (13.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.1/13.1 MB 21.3 MB/s eta 0:00:00 Collecting nvidia-cudnn-cu11==9.1.0.70 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/cu118/nvidia_cudnn_cu11-9.1.0.70-py3-none-manylinux2014_x86_64.whl (663.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 663.9/663.9 MB 10.0 MB/s eta 0:00:00 Collecting nvidia-cublas-cu11==11.11.3.6 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/cu118/nvidia_cublas_cu11-11.11.3.6-py3-none-manylinux1_x86_64.whl (417.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 417.9/417.9 MB 9.7 MB/s eta 0:00:00 Collecting nvidia-cufft-cu11==10.9.0.58 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/cu118/nvidia_cufft_cu11-10.9.0.58-py3-none-manylinux1_x86_64.whl (168.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 168.4/168.4 MB 12.5 MB/s eta 0:00:00 Collecting nvidia-curand-cu11==10.3.0.86 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/cu118/nvidia_curand_cu11-10.3.0.86-py3-none-manylinux1_x86_64.whl (58.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.1/58.1 MB 13.0 MB/s eta 0:00:00 Collecting nvidia-cusolver-cu11==11.4.1.48 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/cu118/nvidia_cusolver_cu11-11.4.1.48-py3-none-manylinux1_x86_64.whl (128.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.2/128.2 MB 12.5 MB/s eta 0:00:00 Collecting nvidia-cusparse-cu11==11.7.5.86 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/cu118/nvidia_cusparse_cu11-11.7.5.86-py3-none-manylinux1_x86_64.whl (204.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 204.1/204.1 MB 11.9 MB/s eta 0:00:00 Collecting nvidia-nccl-cu11==2.20.5 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/cu118/nvidia_nccl_cu11-2.20.5-py3-none-manylinux2014_x86_64.whl (142.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 142.9/142.9 MB 12.5 MB/s eta 0:00:00 Collecting nvidia-nvtx-cu11==11.8.86 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/cu118/nvidia_nvtx_cu11-11.8.86-py3-none-manylinux1_x86_64.whl (99 kB) Collecting triton==3.0.0 (from torch==2.4.1) Downloading https://download.pytorch.org/whl/triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 12.0 MB/s eta 0:00:00 Collecting numpy (from torchvision==0.19.1) Downloading https://download.pytorch.org/whl/numpy-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB) Collecting pillow!=8.3.*,>=5.3.0 (from torchvision==0.19.1) Downloading https://download.pytorch.org/whl/pillow-11.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.1 kB) Collecting MarkupSafe>=2.0 (from jinja2->torch==2.4.1) Downloading https://download.pytorch.org/whl/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB) Collecting mpmath<1.4,>=1.1.0 (from sympy->torch==2.4.1) Downloading https://download.pytorch.org/whl/mpmath-1.3.0-py3-none-any.whl (536 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 7.5 MB/s eta 0:00:00 Downloading https://download.pytorch.org/whl/pillow-11.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.4/4.4 MB 9.4 MB/s eta 0:00:00 Downloading https://download.pytorch.org/whl/filelock-3.13.1-py3-none-any.whl (11 kB) Downloading https://download.pytorch.org/whl/fsspec-2024.6.1-py3-none-any.whl (177 kB) Downloading https://download.pytorch.org/whl/Jinja2-3.1.4-py3-none-any.whl (133 kB) Downloading https://download.pytorch.org/whl/networkx-3.3-py3-none-any.whl (1.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 6.6 MB/s eta 0:00:00 Downloading https://download.pytorch.org/whl/numpy-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.3/16.3 MB 13.0 MB/s eta 0:00:00 Downloading https://download.pytorch.org/whl/sympy-1.13.3-py3-none-any.whl (6.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 32.9 MB/s eta 0:00:00 Installing collected packages: mpmath, sympy, pillow, nvidia-nvtx-cu11, nvidia-nccl-cu11, nvidia-cusparse-cu11, nvidia-curand-cu11, nvidia-cufft-cu11, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-cupti-cu11, nvidia-cublas-cu11, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusolver-cu11, nvidia-cudnn-cu11, jinja2, torch, torchvision, torchaudio Successfully installed MarkupSafe-2.1.5 filelock-3.13.1 fsspec-2024.6.1 jinja2-3.1.4 mpmath-1.3.0 networkx-3.3 numpy-2.1.2 nvidia-cublas-cu11-11.11.3.6 nvidia-cuda-cupti-cu11-11.8.87 nvidia-cuda-nvrtc-cu11-11.8.89 nvidia-cuda-runtime-cu11-11.8.89 nvidia-cudnn-cu11-9.1.0.70 nvidia-cufft-cu11-10.9.0.58 nvidia-curand-cu11-10.3.0.86 nvidia-cusolver-cu11-11.4.1.48 nvidia-cusparse-cu11-11.7.5.86 nvidia-nccl-cu11-2.20.5 nvidia-nvtx-cu11-11.8.86 pillow-11.0.0 sympy-1.13.3 torch-2.4.1+cu118 torchaudio-2.4.1+cu118 torchvision-0.19.1+cu118 triton-3.0.0 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. (base) root@VM-0-80-ubuntu:/workspace# -
检查PyTorch GPU是否安装成功
python -c "import torch;print(torch.cuda.get_device_name(torch.cuda.current_device()))"
(base) root@VM-0-80-ubuntu:/workspace# python -c "import torch;print(torch.cuda.get_device_name(torch.cuda.current_device()))"
Tesla T4
-
安装W&B:
pip install wandb点击展开`wandb`安装log
(base) root@VM-0-80-ubuntu:/workspace# pip install wandb Looking in indexes: http://mirrors.tencentyun.com/pypi/simple Collecting wandb Downloading http://mirrors.tencentyun.com/pypi/packages/88/c9/41b8bdb493e5eda32b502bc1cc49d539335a92cacaf0ef304d7dae0240aa/wandb-0.20.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.2/23.2 MB 9.2 MB/s eta 0:00:00 Collecting click!=8.0.0,>=7.1 (from wandb) Downloading http://mirrors.tencentyun.com/pypi/packages/85/32/10bb5764d90a8eee674e9dc6f4db6a0ab47c8c4d0d83c27f7c39ac415a4d/click-8.2.1-py3-none-any.whl (102 kB) Collecting gitpython!=3.1.29,>=1.0.0 (from wandb) Using cached http://mirrors.tencentyun.com/pypi/packages/1d/9a/4114a9057db2f1462d5c8f8390ab7383925fe1ac012eaa42402ad65c2963/GitPython-3.1.44-py3-none-any.whl (207 kB) Requirement already satisfied: packaging in /root/miniforge3/lib/python3.10/site-packages (from wandb) (24.2) Requirement already satisfied: platformdirs in /root/miniforge3/lib/python3.10/site-packages (from wandb) (4.3.6) Collecting protobuf!=4.21.0,!=5.28.0,<7,>=3.19.0 (from wandb) Downloading http://mirrors.tencentyun.com/pypi/packages/fa/b1/b59d405d64d31999244643d88c45c8241c58f17cc887e73bcb90602327f8/protobuf-6.31.1-cp39-abi3-manylinux2014_x86_64.whl (321 kB) Requirement already satisfied: psutil>=5.0.0 in /root/miniforge3/lib/python3.10/site-packages (from wandb) (6.1.1) Collecting pydantic<3 (from wandb) Downloading http://mirrors.tencentyun.com/pypi/packages/b5/69/831ed22b38ff9b4b64b66569f0e5b7b97cf3638346eb95a2147fdb49ad5f/pydantic-2.11.5-py3-none-any.whl (444 kB) Collecting pyyaml (from wandb) Downloading http://mirrors.tencentyun.com/pypi/packages/6b/4e/1523cb902fd98355e2e9ea5e5eb237cbc5f3ad5f3075fa65087aa0ecb669/PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 kB 2.8 MB/s eta 0:00:00 Requirement already satisfied: requests<3,>=2.0.0 in /root/miniforge3/lib/python3.10/site-packages (from wandb) (2.32.3) Collecting sentry-sdk>=2.0.0 (from wandb) Downloading http://mirrors.tencentyun.com/pypi/packages/f0/e5/da07b0bd832cefd52d16f2b9bbbe31624d57552602c06631686b93ccb1bd/sentry_sdk-2.29.1-py2.py3-none-any.whl (341 kB) Collecting setproctitle (from wandb) Downloading http://mirrors.tencentyun.com/pypi/packages/67/2b/c3cbd4a4462c1143465d8c151f1d51bbfb418e60a96a754329d28d416575/setproctitle-1.3.6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB) Requirement already satisfied: typing-extensions<5,>=4.8 in /root/miniforge3/lib/python3.10/site-packages (from wandb) (4.12.2) Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.29,>=1.0.0->wandb) Using cached http://mirrors.tencentyun.com/pypi/packages/a0/61/5c78b91c3143ed5c14207f463aecfc8f9dbb5092fb2869baf37c273b2705/gitdb-4.0.12-py3-none-any.whl (62 kB) Collecting annotated-types>=0.6.0 (from pydantic<3->wandb) Using cached http://mirrors.tencentyun.com/pypi/packages/78/b6/6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8/annotated_types-0.7.0-py3-none-any.whl (13 kB) Collecting pydantic-core==2.33.2 (from pydantic<3->wandb) Downloading http://mirrors.tencentyun.com/pypi/packages/31/0d/c8f7593e6bc7066289bbc366f2235701dcbebcd1ff0ef8e64f6f239fb47d/pydantic_core-2.33.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 3.7 MB/s eta 0:00:00 Collecting typing-inspection>=0.4.0 (from pydantic<3->wandb) Downloading http://mirrors.tencentyun.com/pypi/packages/17/69/cd203477f944c353c31bade965f880aa1061fd6bf05ded0726ca845b6ff7/typing_inspection-0.4.1-py3-none-any.whl (14 kB) Requirement already satisfied: charset_normalizer<4,>=2 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (3.4.1) Requirement already satisfied: idna<4,>=2.5 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (2024.12.14) Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.29,>=1.0.0->wandb) Using cached http://mirrors.tencentyun.com/pypi/packages/04/be/d09147ad1ec7934636ad912901c5fd7667e1c858e19d355237db0d0cd5e4/smmap-5.0.2-py3-none-any.whl (24 kB) Installing collected packages: typing-inspection, smmap, setproctitle, sentry-sdk, pyyaml, pydantic-core, protobuf, click, annotated-types, pydantic, gitdb, gitpython, wandb Successfully installed annotated-types-0.7.0 click-8.2.1 gitdb-4.0.12 gitpython-3.1.44 protobuf-6.31.1 pydantic-2.11.5 pydantic-core-2.33.2 pyyaml-6.0.2 sentry-sdk-2.29.1 setproctitle-1.3.6 smmap-5.0.2 typing-inspection-0.4.1 wandb-0.20.1 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
3. 注册和配置W&B
-
注册:到官网wandb.ai/ 注册一个账号,或者用github登陆也可以.
-
获取token:点击网页右上角你
自己的头像,选择🔑 API key,点击复制 -
在需要使用
W&B的终端输入wandb login,并粘贴刚刚复制的token,回车 -
这个操作会链接W&B网站登陆,并把
token保存在/root/.netrc点击查看`wandb login`的log
(base) root@VM-0-80-ubuntu:/workspace# wandb login wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server) wandb: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: wandb: No netrc file found, creating one. wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin (base) root@VM-0-80-ubuntu:/workspace#点击查看`wandb`的终端命令和解释
(base) root@VM-0-80-ubuntu:/workspace# wandb --help Usage: wandb [OPTIONS] COMMAND [ARGS]... Options: --version Show the version and exit. --help Show this message and exit. Commands: agent Run the W&B agent artifact Commands for interacting with artifacts beta Beta versions of wandb CLI commands. controller Run the W&B local sweep controller disabled Disable W&B. docker Run your code in a docker container. docker-run Wrap `docker run` and adds WANDB_API_KEY and WANDB_DOCKER... enabled Enable W&B. init Configure a directory with Weights & Biases job Commands for managing and viewing W&B jobs launch Launch or queue a W&B Job. launch-agent Run a W&B launch agent. launch-sweep Run a W&B launch sweep (Experimental). login Login to Weights & Biases offline Disable W&B sync online Enable W&B sync pull Pull files from Weights & Biases restore Restore code, config and docker state for a run scheduler Run a W&B launch sweep scheduler (Experimental) server Commands for operating a local W&B server status Show configuration settings sweep Initialize a hyperparameter sweep. sync Upload an offline training directory to W&B verify Verify your local instance (base) root@VM-0-80-ubuntu:/workspace#
4. 使用W&B和PyTorch
4.1 官方demo和注释,来自github.com/wandb/wandb
import wandb
# Start a W&B Run with wandb.init
# 使用wandb.init()初始化一个W&B的一个训练
run = wandb.init(project="my_first_project")
# Save model inputs and hyperparameters in a wandb.config object
# 把模型输入和超参数保存在wandb.config对象中
config = run.config
config.learning_rate = 0.01
# Model training code here ...
# 这里撰写模型训练的代码
# Log metrics over time to visualize performance with wandb.log
# 使用wandb.Log()记录随时间变化的指标,以可视化性能
for i in range(10):
run.log({"loss": ...})
# Mark the run as finished, and finish uploading all data
# 把这次训练标记为已完成,并上传所有数据
run.finish()
4.2 使用Torchvision在MNIST上训练一个MLP,并记录训练集的loss,acc.
import os
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import wandb
run = wandb.init(project="minimal-demo")
transform = transforms.ToTensor()
train_loader = torch.utils.data.DataLoader(
torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform),
batch_size=64, shuffle=True
)
model = nn.Sequential(
nn.Flatten(),
nn.Linear(28 * 28, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
model.train()
for epoch in range(5):
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
_, predicted = outputs.max(1)
accuracy = (predicted == labels).float().mean()
run.log({"loss": loss.item(), "accuracy": accuracy.item()})
run.finish()
5. Trableshooting
5.0 连接超时
- 出现
wandb: Network error (ConnectTimeout), entering retry loop.和CommError: Run initialization has timed out after 90.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`. - 原因:wandb的API服务器上传问题
- 解决:请参照 6. W&B网络问题解决方案
5.1 缺少依赖
wandb: ERROR The nbformat package was not found. It is required to save notebook history.- 解决:
pip install nbformat
nbformat是 Jupyter 项目提供的一个官方 Python 库,用于读取、写入和操作.ipynb格式的 Jupyter Notebook 文件。它提供了标准化的数据结构(如NotebookNode),使开发者能够以编程方式处理 Notebook 的各个组成部分,包括代码单元、Markdown 单元、元数据等。- 解决:
6. W&B网络问题解决方案
6.1 (建议的方案)换第三方源
- 参照小红书作者Urara(xhsid:5227964072)提供的服务器API,更换之
- (建议)直接在python代码中加入
import os; os.environ["WANDB_BASE_URL"] = "https://api.bandw.top" - 或者修改环境变量:
export WANDB_BASE_URL=https://api.bandw.top.- 注意这里命令是
export所以只在当前环境有效,需要持续化这个变量需要自行添加到.bashrc或者conda env的响应文件中. - 更换后,可能需要再次登陆wandb.(这个不确定会不会百分百触发,这里我会关注下,然后给出一劳永逸而不是每次都重新登录的方案)
- 冷知识:如果在
jupyter notebook中输入export,其实是修改不了SHELL环境变量的,建议在jupyter中就用python代码的方式修改!
- 注意这里命令是
- (建议)直接在python代码中加入
6.2 (我还没研究)自行使用functionless服务或者其他云设施搭建反代理
- 可以使用CF或者AWS等免费资源,我还没尝试
- 参考:www.bilibili.com/opus/924356…
6.2 offiline运行,(需要时上传W&B数据)
- 建议离线offiline的情形:当本机无法连接国外服务器或者完全是离线的服务器的情况,请offiline模式运行.然后下载数据自行查看,有需要的话上传数据到wandb.
- PS0:完全离线的服务器:北京超算离线模式,Kaggle离线模式
- PS·:不方便使用wandb在线模式的情况:百度AI Studio,一旦连接wandb并上传会被百度反病毒掐断,并封号.这时候发邮件或者web端反馈也不会有人处理的.
- 开启离线offline的操作:
- (建议)使用
wandb命令:wandb offline - 或者修改环境变量:
export WANDB_MODE=offline,参考docs.wandb.ai/support/run… - 或者在
python代码中wandb.init()的初始化实例的时候加入参数wandb.init(mode="offline")
- (建议)使用
- 上传wandb数据的操作:
- (建议)命令
wandb sync ./wandb/offline-run-*.- 参考docs.wandb.ai/ref/cli/wan…
- 这里
./wandb/offline-run-*是离线模式wandb保存的run的目录.
- (建议)命令