[阿维笔记]W&B训练数据可视化,顺便解决网络问题

2,730 阅读10分钟

0.太长不看版:

export WANDB_BASE_URL=https://api.bandw.top更换W&B的API源可以解决原先API URL连接超时问题.再重新登陆后正常训练即可.

1.摘要

W&B是一款用于记录机器学习训练数据的工具,通过跟踪可视化从数据集处理到训练输出模型整个流程的各个方面,来帮助用户更快速的优化输出模型。

W&B目前支持多种框架和库: Integrate quickly, LangChain, LlamaIndex, PyTorch, HF Transformers, Lightning, TensorFlow, Keras, Scikit-Learn, XGboost, Padllepaddle, etc.

2.demo: PyTorch训练模型并调用W&B

2.1 硬件环境

本教程使用 腾讯云 CloudStudio. CloudStudio 提供每月免费5W分钟CPU环境和1W分钟GPU环境(16C32G & Tesla T4@16GVRAM)

  • 点击CloudStudio链接cloud.tencent.com/product/clo…, 注册 腾讯云 .
  • 进入CloudStudio主页面ide.cloud.tencent.com/, 点击高性能工作空间,点击新建,点击左下角免费基础型,点新建,等待 几分钟 .
  • 刷新 页面等待出现新的空间条目,点击这个 新的条目 ,开始使用高性能空间.
  • 如下是成功打开 高性能工作空间 的效果 image.png

2.2 软件环境

由于这个免费基础型环境自带 deepseek 等依赖(用不到)和conda环境,我们这里直接使用conda环境的(base)环境

  • 这里已经自动激活了(base)conda默认虚拟环境,使用pip list检查已经安装的依赖.这里并没有需要的PyTorch,所以我们需要手动安装:

    点击展开软件环境检查log
    (base) root@VM-0-80-ubuntu:/workspace# which conda
    /root/miniforge3/bin/conda
    (base) root@VM-0-80-ubuntu:/workspace# pip list
    Package                 Version
    ----------------------- -----------
    archspec                0.2.3
    asttokens               3.0.0
    boltons                 24.0.0
    Brotli                  1.1.0
    certifi                 2024.12.14
    cffi                    1.17.1
    charset-normalizer      3.4.1
    colorama                0.4.6
    comm                    0.2.2
    conda                   24.11.2
    conda-libmamba-solver   24.11.1
    conda-package-handling  2.4.0
    conda_package_streaming 0.11.0
    debugpy                 1.8.11
    decorator               5.1.1
    distro                  1.9.0
    exceptiongroup          1.2.2
    executing               2.1.0
    frozendict              2.4.6
    h2                      4.1.0
    hpack                   4.0.0
    hyperframe              6.0.1
    idna                    3.10
    importlib_metadata      8.5.0
    ipykernel               6.29.5
    ipython                 8.31.0
    jedi                    0.19.2
    jsonpatch               1.33
    jsonpointer             3.0.0
    jupyter_client          8.6.3
    jupyter_core            5.7.2
    libmambapy              2.0.5
    matplotlib-inline       0.1.7
    menuinst                2.2.0
    nest_asyncio            1.6.0
    packaging               24.2
    parso                   0.8.4
    pexpect                 4.9.0
    pickleshare             0.7.5
    pip                     24.3.1
    platformdirs            4.3.6
    pluggy                  1.5.0
    prompt_toolkit          3.0.48
    psutil                  6.1.1
    ptyprocess              0.7.0
    pure_eval               0.2.3
    pycosat                 0.6.6
    pycparser               2.22
    Pygments                2.18.0
    PySocks                 1.7.1
    python-dateutil         2.9.0.post0
    pyzmq                   26.2.0
    requests                2.32.3
    ruamel.yaml             0.18.8
    ruamel.yaml.clib        0.2.8
    setuptools              75.6.0
    six                     1.17.0
    stack_data              0.6.3
    tornado                 6.4.2
    tqdm                    4.67.1
    traitlets               5.14.3
    truststore              0.10.0
    typing_extensions       4.12.2
    urllib3                 2.3.0
    wcwidth                 0.2.13
    wheel                   0.45.1
    zipp                    3.21.0
    zstandard               0.23.0
    
  • 使用nvidia-smi等命令检查GPU和nvidia-driver等版本,确定应该装什么版本PyTorch.

    点击展开环境检查log
    (base) root@VM-0-80-ubuntu:/workspace# nvidia-smi
    Thu Jun  5 13:40:40 2025       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:00:09.0 Off |                    0 |
    | N/A   33C    P8    11W /  70W |      5MiB / 15360MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    (base) root@VM-0-80-ubuntu:/workspace# python --version
    Python 3.10.11
    
  • 安装GPU版PyTorch: pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118

    点击展开`PyTorch GPU`安装log
    (base) root@VM-0-80-ubuntu:/workspace# pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118
    Looking in indexes: https://download.pytorch.org/whl/cu118
    Collecting torch==2.4.1
      Downloading https://download.pytorch.org/whl/cu118/torch-2.4.1%2Bcu118-cp310-cp310-linux_x86_64.whl (857.6 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 857.6/857.6 MB 8.6 MB/s eta 0:00:00
    Collecting torchvision==0.19.1
      Downloading https://download.pytorch.org/whl/cu118/torchvision-0.19.1%2Bcu118-cp310-cp310-linux_x86_64.whl (6.3 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 9.8 MB/s eta 0:00:00
    Collecting torchaudio==2.4.1
      Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.4.1%2Bcu118-cp310-cp310-linux_x86_64.whl (3.3 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 6.7 MB/s eta 0:00:00
    Collecting filelock (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
    Requirement already satisfied: typing-extensions>=4.8.0 in /root/miniforge3/lib/python3.10/site-packages (from torch==2.4.1) (4.12.2)
    Collecting sympy (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
    Collecting networkx (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
    Collecting jinja2 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/Jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
    Collecting fsspec (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
    Collecting nvidia-cuda-nvrtc-cu11==11.8.89 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_nvrtc_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (23.2 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.2/23.2 MB 14.5 MB/s eta 0:00:00
    Collecting nvidia-cuda-runtime-cu11==11.8.89 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_runtime_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (875 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 875.6/875.6 kB 16.0 MB/s eta 0:00:00
    Collecting nvidia-cuda-cupti-cu11==11.8.87 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_cupti_cu11-11.8.87-py3-none-manylinux1_x86_64.whl (13.1 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.1/13.1 MB 21.3 MB/s eta 0:00:00
    Collecting nvidia-cudnn-cu11==9.1.0.70 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/cu118/nvidia_cudnn_cu11-9.1.0.70-py3-none-manylinux2014_x86_64.whl (663.9 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 663.9/663.9 MB 10.0 MB/s eta 0:00:00
    Collecting nvidia-cublas-cu11==11.11.3.6 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/cu118/nvidia_cublas_cu11-11.11.3.6-py3-none-manylinux1_x86_64.whl (417.9 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 417.9/417.9 MB 9.7 MB/s eta 0:00:00
    Collecting nvidia-cufft-cu11==10.9.0.58 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/cu118/nvidia_cufft_cu11-10.9.0.58-py3-none-manylinux1_x86_64.whl (168.4 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 168.4/168.4 MB 12.5 MB/s eta 0:00:00
    Collecting nvidia-curand-cu11==10.3.0.86 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/cu118/nvidia_curand_cu11-10.3.0.86-py3-none-manylinux1_x86_64.whl (58.1 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.1/58.1 MB 13.0 MB/s eta 0:00:00
    Collecting nvidia-cusolver-cu11==11.4.1.48 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/cu118/nvidia_cusolver_cu11-11.4.1.48-py3-none-manylinux1_x86_64.whl (128.2 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.2/128.2 MB 12.5 MB/s eta 0:00:00
    Collecting nvidia-cusparse-cu11==11.7.5.86 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/cu118/nvidia_cusparse_cu11-11.7.5.86-py3-none-manylinux1_x86_64.whl (204.1 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 204.1/204.1 MB 11.9 MB/s eta 0:00:00
    Collecting nvidia-nccl-cu11==2.20.5 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/cu118/nvidia_nccl_cu11-2.20.5-py3-none-manylinux2014_x86_64.whl (142.9 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 142.9/142.9 MB 12.5 MB/s eta 0:00:00
    Collecting nvidia-nvtx-cu11==11.8.86 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/cu118/nvidia_nvtx_cu11-11.8.86-py3-none-manylinux1_x86_64.whl (99 kB)
    Collecting triton==3.0.0 (from torch==2.4.1)
      Downloading https://download.pytorch.org/whl/triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 12.0 MB/s eta 0:00:00
    Collecting numpy (from torchvision==0.19.1)
      Downloading https://download.pytorch.org/whl/numpy-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
    Collecting pillow!=8.3.*,>=5.3.0 (from torchvision==0.19.1)
      Downloading https://download.pytorch.org/whl/pillow-11.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.1 kB)
    Collecting MarkupSafe>=2.0 (from jinja2->torch==2.4.1)
      Downloading https://download.pytorch.org/whl/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
    Collecting mpmath<1.4,>=1.1.0 (from sympy->torch==2.4.1)
      Downloading https://download.pytorch.org/whl/mpmath-1.3.0-py3-none-any.whl (536 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 7.5 MB/s eta 0:00:00
    Downloading https://download.pytorch.org/whl/pillow-11.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.4 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.4/4.4 MB 9.4 MB/s eta 0:00:00
    Downloading https://download.pytorch.org/whl/filelock-3.13.1-py3-none-any.whl (11 kB)
    Downloading https://download.pytorch.org/whl/fsspec-2024.6.1-py3-none-any.whl (177 kB)
    Downloading https://download.pytorch.org/whl/Jinja2-3.1.4-py3-none-any.whl (133 kB)
    Downloading https://download.pytorch.org/whl/networkx-3.3-py3-none-any.whl (1.7 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 6.6 MB/s eta 0:00:00
    Downloading https://download.pytorch.org/whl/numpy-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.3/16.3 MB 13.0 MB/s eta 0:00:00
    Downloading https://download.pytorch.org/whl/sympy-1.13.3-py3-none-any.whl (6.2 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 32.9 MB/s eta 0:00:00
    Installing collected packages: mpmath, sympy, pillow, nvidia-nvtx-cu11, nvidia-nccl-cu11, nvidia-cusparse-cu11, nvidia-curand-cu11, nvidia-cufft-cu11, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-cupti-cu11, nvidia-cublas-cu11, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusolver-cu11, nvidia-cudnn-cu11, jinja2, torch, torchvision, torchaudio
    Successfully installed MarkupSafe-2.1.5 filelock-3.13.1 fsspec-2024.6.1 jinja2-3.1.4 mpmath-1.3.0 networkx-3.3 numpy-2.1.2 nvidia-cublas-cu11-11.11.3.6 nvidia-cuda-cupti-cu11-11.8.87 nvidia-cuda-nvrtc-cu11-11.8.89 nvidia-cuda-runtime-cu11-11.8.89 nvidia-cudnn-cu11-9.1.0.70 nvidia-cufft-cu11-10.9.0.58 nvidia-curand-cu11-10.3.0.86 nvidia-cusolver-cu11-11.4.1.48 nvidia-cusparse-cu11-11.7.5.86 nvidia-nccl-cu11-2.20.5 nvidia-nvtx-cu11-11.8.86 pillow-11.0.0 sympy-1.13.3 torch-2.4.1+cu118 torchaudio-2.4.1+cu118 torchvision-0.19.1+cu118 triton-3.0.0
    WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
    (base) root@VM-0-80-ubuntu:/workspace# 
    
  • 检查PyTorch GPU是否安装成功python -c "import torch;print(torch.cuda.get_device_name(torch.cuda.current_device()))"

(base) root@VM-0-80-ubuntu:/workspace# python -c "import  torch;print(torch.cuda.get_device_name(torch.cuda.current_device()))"
Tesla T4
  • 安装W&B: pip install wandb

    点击展开`wandb`安装log
    (base) root@VM-0-80-ubuntu:/workspace# pip install wandb
    Looking in indexes: http://mirrors.tencentyun.com/pypi/simple
    Collecting wandb
      Downloading http://mirrors.tencentyun.com/pypi/packages/88/c9/41b8bdb493e5eda32b502bc1cc49d539335a92cacaf0ef304d7dae0240aa/wandb-0.20.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.2 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.2/23.2 MB 9.2 MB/s eta 0:00:00
    Collecting click!=8.0.0,>=7.1 (from wandb)
      Downloading http://mirrors.tencentyun.com/pypi/packages/85/32/10bb5764d90a8eee674e9dc6f4db6a0ab47c8c4d0d83c27f7c39ac415a4d/click-8.2.1-py3-none-any.whl (102 kB)
    Collecting gitpython!=3.1.29,>=1.0.0 (from wandb)
      Using cached http://mirrors.tencentyun.com/pypi/packages/1d/9a/4114a9057db2f1462d5c8f8390ab7383925fe1ac012eaa42402ad65c2963/GitPython-3.1.44-py3-none-any.whl (207 kB)
    Requirement already satisfied: packaging in /root/miniforge3/lib/python3.10/site-packages (from wandb) (24.2)
    Requirement already satisfied: platformdirs in /root/miniforge3/lib/python3.10/site-packages (from wandb) (4.3.6)
    Collecting protobuf!=4.21.0,!=5.28.0,<7,>=3.19.0 (from wandb)
      Downloading http://mirrors.tencentyun.com/pypi/packages/fa/b1/b59d405d64d31999244643d88c45c8241c58f17cc887e73bcb90602327f8/protobuf-6.31.1-cp39-abi3-manylinux2014_x86_64.whl (321 kB)
    Requirement already satisfied: psutil>=5.0.0 in /root/miniforge3/lib/python3.10/site-packages (from wandb) (6.1.1)
    Collecting pydantic<3 (from wandb)
      Downloading http://mirrors.tencentyun.com/pypi/packages/b5/69/831ed22b38ff9b4b64b66569f0e5b7b97cf3638346eb95a2147fdb49ad5f/pydantic-2.11.5-py3-none-any.whl (444 kB)
    Collecting pyyaml (from wandb)
      Downloading http://mirrors.tencentyun.com/pypi/packages/6b/4e/1523cb902fd98355e2e9ea5e5eb237cbc5f3ad5f3075fa65087aa0ecb669/PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 kB 2.8 MB/s eta 0:00:00
    Requirement already satisfied: requests<3,>=2.0.0 in /root/miniforge3/lib/python3.10/site-packages (from wandb) (2.32.3)
    Collecting sentry-sdk>=2.0.0 (from wandb)
      Downloading http://mirrors.tencentyun.com/pypi/packages/f0/e5/da07b0bd832cefd52d16f2b9bbbe31624d57552602c06631686b93ccb1bd/sentry_sdk-2.29.1-py2.py3-none-any.whl (341 kB)
    Collecting setproctitle (from wandb)
      Downloading http://mirrors.tencentyun.com/pypi/packages/67/2b/c3cbd4a4462c1143465d8c151f1d51bbfb418e60a96a754329d28d416575/setproctitle-1.3.6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
    Requirement already satisfied: typing-extensions<5,>=4.8 in /root/miniforge3/lib/python3.10/site-packages (from wandb) (4.12.2)
    Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.29,>=1.0.0->wandb)
      Using cached http://mirrors.tencentyun.com/pypi/packages/a0/61/5c78b91c3143ed5c14207f463aecfc8f9dbb5092fb2869baf37c273b2705/gitdb-4.0.12-py3-none-any.whl (62 kB)
    Collecting annotated-types>=0.6.0 (from pydantic<3->wandb)
      Using cached http://mirrors.tencentyun.com/pypi/packages/78/b6/6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8/annotated_types-0.7.0-py3-none-any.whl (13 kB)
    Collecting pydantic-core==2.33.2 (from pydantic<3->wandb)
      Downloading http://mirrors.tencentyun.com/pypi/packages/31/0d/c8f7593e6bc7066289bbc366f2235701dcbebcd1ff0ef8e64f6f239fb47d/pydantic_core-2.33.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 3.7 MB/s eta 0:00:00
    Collecting typing-inspection>=0.4.0 (from pydantic<3->wandb)
      Downloading http://mirrors.tencentyun.com/pypi/packages/17/69/cd203477f944c353c31bade965f880aa1061fd6bf05ded0726ca845b6ff7/typing_inspection-0.4.1-py3-none-any.whl (14 kB)
    Requirement already satisfied: charset_normalizer<4,>=2 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (3.4.1)
    Requirement already satisfied: idna<4,>=2.5 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (3.10)
    Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (2.3.0)
    Requirement already satisfied: certifi>=2017.4.17 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (2024.12.14)
    Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.29,>=1.0.0->wandb)
      Using cached http://mirrors.tencentyun.com/pypi/packages/04/be/d09147ad1ec7934636ad912901c5fd7667e1c858e19d355237db0d0cd5e4/smmap-5.0.2-py3-none-any.whl (24 kB)
    Installing collected packages: typing-inspection, smmap, setproctitle, sentry-sdk, pyyaml, pydantic-core, protobuf, click, annotated-types, pydantic, gitdb, gitpython, wandb
    Successfully installed annotated-types-0.7.0 click-8.2.1 gitdb-4.0.12 gitpython-3.1.44 protobuf-6.31.1 pydantic-2.11.5 pydantic-core-2.33.2 pyyaml-6.0.2 sentry-sdk-2.29.1 setproctitle-1.3.6 smmap-5.0.2 typing-inspection-0.4.1 wandb-0.20.1
    WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
    

3. 注册和配置W&B

  • 注册:到官网wandb.ai/ 注册一个账号,或者用github登陆也可以.

  • 获取token:点击网页右上角你自己的头像,选择🔑 API key,点击复制

  • 在需要使用W&B的终端输入wandb login,并粘贴刚刚复制的token,回车

  • 这个操作会链接W&B网站登陆,并把token保存在/root/.netrc

    点击查看`wandb login`的log
    (base) root@VM-0-80-ubuntu:/workspace# wandb login
    wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
    wandb: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
    wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
    wandb: No netrc file found, creating one.
    wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
    wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
    (base) root@VM-0-80-ubuntu:/workspace# 
    
    点击查看`wandb`的终端命令和解释
    (base) root@VM-0-80-ubuntu:/workspace# wandb --help
    Usage: wandb [OPTIONS] COMMAND [ARGS]...
    
    Options:
      --version  Show the version and exit.
      --help     Show this message and exit.
    
    Commands:
      agent         Run the W&B agent
      artifact      Commands for interacting with artifacts
      beta          Beta versions of wandb CLI commands.
      controller    Run the W&B local sweep controller
      disabled      Disable W&B.
      docker        Run your code in a docker container.
      docker-run    Wrap `docker run` and adds WANDB_API_KEY and WANDB_DOCKER...
      enabled       Enable W&B.
      init          Configure a directory with Weights & Biases
      job           Commands for managing and viewing W&B jobs
      launch        Launch or queue a W&B Job.
      launch-agent  Run a W&B launch agent.
      launch-sweep  Run a W&B launch sweep (Experimental).
      login         Login to Weights & Biases
      offline       Disable W&B sync
      online        Enable W&B sync
      pull          Pull files from Weights & Biases
      restore       Restore code, config and docker state for a run
      scheduler     Run a W&B launch sweep scheduler (Experimental)
      server        Commands for operating a local W&B server
      status        Show configuration settings
      sweep         Initialize a hyperparameter sweep.
      sync          Upload an offline training directory to W&B
      verify        Verify your local instance
    (base) root@VM-0-80-ubuntu:/workspace# 
    

4. 使用W&B和PyTorch

4.1 官方demo和注释,来自github.com/wandb/wandb

import wandb

# Start a W&B Run with wandb.init
# 使用wandb.init()初始化一个W&B的一个训练
run = wandb.init(project="my_first_project")

# Save model inputs and hyperparameters in a wandb.config object
# 把模型输入和超参数保存在wandb.config对象中
config = run.config
config.learning_rate = 0.01

# Model training code here ...
# 这里撰写模型训练的代码

# Log metrics over time to visualize performance with wandb.log
# 使用wandb.Log()记录随时间变化的指标,以可视化性能
for i in range(10):
    run.log({"loss": ...})

# Mark the run as finished, and finish uploading all data
# 把这次训练标记为已完成,并上传所有数据
run.finish()

4.2 使用Torchvision在MNIST上训练一个MLP,并记录训练集的loss,acc.

import os 

import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

import wandb
run = wandb.init(project="minimal-demo")

transform = transforms.ToTensor()
train_loader = torch.utils.data.DataLoader(
    torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform),
    batch_size=64, shuffle=True
    )

model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
    )

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

model.train()
for epoch in range(5):
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        _, predicted = outputs.max(1)
        accuracy = (predicted == labels).float().mean()

        run.log({"loss": loss.item(), "accuracy": accuracy.item()})
run.finish()

5. Trableshooting

5.0 连接超时

  • 出现wandb: Network error (ConnectTimeout), entering retry loop.CommError: Run initialization has timed out after 90.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
  • 原因:wandb的API服务器上传问题
  • 解决:请参照 6. W&B网络问题解决方案

5.1 缺少依赖

  1. wandb: ERROR The nbformat package was not found. It is required to save notebook history.
    • 解决:pip install nbformat

    nbformat 是 Jupyter 项目提供的一个官方 Python 库,用于读取、写入和操作 .ipynb 格式的 Jupyter Notebook 文件。它提供了标准化的数据结构(如 NotebookNode),使开发者能够以编程方式处理 Notebook 的各个组成部分,包括代码单元、Markdown 单元、元数据等。

6. W&B网络问题解决方案

6.1 (建议的方案)换第三方源

  • 参照小红书作者Urara(xhsid:5227964072)提供的服务器API,更换之
    • (建议)直接在python代码中加入import os; os.environ["WANDB_BASE_URL"] = "https://api.bandw.top"
    • 或者修改环境变量:export WANDB_BASE_URL=https://api.bandw.top.
      • 注意这里命令是export所以只在当前环境有效,需要持续化这个变量需要自行添加到.bashrc或者conda env的响应文件中.
      • 更换后,可能需要再次登陆wandb.(这个不确定会不会百分百触发,这里我会关注下,然后给出一劳永逸而不是每次都重新登录的方案)
      • 冷知识:如果在jupyter notebook中输入export,其实是修改不了SHELL环境变量的,建议在jupyter中就用python代码的方式修改!

6.2 (我还没研究)自行使用functionless服务或者其他云设施搭建反代理

6.2 offiline运行,(需要时上传W&B数据)

  • 建议离线offiline的情形:当本机无法连接国外服务器或者完全是离线的服务器的情况,请offiline模式运行.然后下载数据自行查看,有需要的话上传数据到wandb.
    • PS0:完全离线的服务器:北京超算离线模式,Kaggle离线模式
    • PS·:不方便使用wandb在线模式的情况:百度AI Studio,一旦连接wandb并上传会被百度反病毒掐断,并封号.这时候发邮件或者web端反馈也不会有人处理的.
  • 开启离线offline的操作:
    • (建议)使用wandb命令:wandb offline
    • 或者修改环境变量:export WANDB_MODE=offline,参考docs.wandb.ai/support/run…
    • 或者在python代码中wandb.init()的初始化实例的时候加入参数wandb.init(mode="offline")
  • 上传wandb数据的操作:
    • (建议)命令wandb sync ./wandb/offline-run-*.