[阿维笔记]W&B训练数据可视化,顺便解决网络问题太长不看版: export WANDB_BASE_URL=https:

0.太长不看版:

export WANDB_BASE_URL=https://api.bandw.top更换W&B的API源可以解决原先API URL连接超时问题.再重新登陆后正常训练即可.

1.摘要

W&B是一款用于记录机器学习训练数据的工具，通过跟踪可视化从数据集处理到训练输出模型整个流程的各个方面，来帮助用户更快速的优化输出模型。

W&B目前支持多种框架和库: Integrate quickly, LangChain, LlamaIndex, PyTorch, HF Transformers, Lightning, TensorFlow, Keras, Scikit-Learn, XGboost, Padllepaddle, etc.

W&B官方网站: wandb.ai/
W&B Github repo:github.com/wandb/wandb

2.demo: PyTorch训练模型并调用W&B

2.1 硬件环境

本教程使用 腾讯云 CloudStudio. CloudStudio 提供每月免费5W分钟CPU环境和1W分钟GPU环境(16C32G & Tesla T4@16GVRAM)

点击CloudStudio链接cloud.tencent.com/product/clo…, 注册 腾讯云 .
进入CloudStudio主页面ide.cloud.tencent.com/, 点击高性能工作空间,点击新建,点击左下角免费基础型,点新建,等待 几分钟 .
刷新页面等待出现新的空间条目,点击这个 新的条目 ,开始使用高性能空间.
如下是成功打开 高性能工作空间 的效果

2.2 软件环境

由于这个免费基础型环境自带 deepseek 等依赖(用不到)和conda环境,我们这里直接使用conda环境的(base)环境

这里已经自动激活了(base)的conda默认虚拟环境,使用pip list检查已经安装的依赖.这里并没有需要的PyTorch,所以我们需要手动安装:

点击展开软件环境检查log

(base) root@VM-0-80-ubuntu:/workspace# which conda
/root/miniforge3/bin/conda
(base) root@VM-0-80-ubuntu:/workspace# pip list
Package                 Version
----------------------- -----------
archspec                0.2.3
asttokens               3.0.0
boltons                 24.0.0
Brotli                  1.1.0
certifi                 2024.12.14
cffi                    1.17.1
charset-normalizer      3.4.1
colorama                0.4.6
comm                    0.2.2
conda                   24.11.2
conda-libmamba-solver   24.11.1
conda-package-handling  2.4.0
conda_package_streaming 0.11.0
debugpy                 1.8.11
decorator               5.1.1
distro                  1.9.0
exceptiongroup          1.2.2
executing               2.1.0
frozendict              2.4.6
h2                      4.1.0
hpack                   4.0.0
hyperframe              6.0.1
idna                    3.10
importlib_metadata      8.5.0
ipykernel               6.29.5
ipython                 8.31.0
jedi                    0.19.2
jsonpatch               1.33
jsonpointer             3.0.0
jupyter_client          8.6.3
jupyter_core            5.7.2
libmambapy              2.0.5
matplotlib-inline       0.1.7
menuinst                2.2.0
nest_asyncio            1.6.0
packaging               24.2
parso                   0.8.4
pexpect                 4.9.0
pickleshare             0.7.5
pip                     24.3.1
platformdirs            4.3.6
pluggy                  1.5.0
prompt_toolkit          3.0.48
psutil                  6.1.1
ptyprocess              0.7.0
pure_eval               0.2.3
pycosat                 0.6.6
pycparser               2.22
Pygments                2.18.0
PySocks                 1.7.1
python-dateutil         2.9.0.post0
pyzmq                   26.2.0
requests                2.32.3
ruamel.yaml             0.18.8
ruamel.yaml.clib        0.2.8
setuptools              75.6.0
six                     1.17.0
stack_data              0.6.3
tornado                 6.4.2
tqdm                    4.67.1
traitlets               5.14.3
truststore              0.10.0
typing_extensions       4.12.2
urllib3                 2.3.0
wcwidth                 0.2.13
wheel                   0.45.1
zipp                    3.21.0
zstandard               0.23.0

使用nvidia-smi等命令检查GPU和nvidia-driver等版本,确定应该装什么版本PyTorch.

点击展开环境检查log

(base) root@VM-0-80-ubuntu:/workspace# nvidia-smi
Thu Jun  5 13:40:40 2025       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:09.0 Off |                    0 |
| N/A   33C    P8    11W /  70W |      5MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(base) root@VM-0-80-ubuntu:/workspace# python --version
Python 3.10.11

安装GPU版PyTorch: pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118

点击展开`PyTorch GPU`安装log

(base) root@VM-0-80-ubuntu:/workspace# pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118
Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu118/torch-2.4.1%2Bcu118-cp310-cp310-linux_x86_64.whl (857.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 857.6/857.6 MB 8.6 MB/s eta 0:00:00
Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu118/torchvision-0.19.1%2Bcu118-cp310-cp310-linux_x86_64.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 9.8 MB/s eta 0:00:00
Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.4.1%2Bcu118-cp310-cp310-linux_x86_64.whl (3.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 6.7 MB/s eta 0:00:00
Collecting filelock (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Requirement already satisfied: typing-extensions>=4.8.0 in /root/miniforge3/lib/python3.10/site-packages (from torch==2.4.1) (4.12.2)
Collecting sympy (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
Collecting jinja2 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/Jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting fsspec (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.8.89 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_nvrtc_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (23.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.2/23.2 MB 14.5 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu11==11.8.89 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_runtime_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (875 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 875.6/875.6 kB 16.0 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu11==11.8.87 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_cupti_cu11-11.8.87-py3-none-manylinux1_x86_64.whl (13.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.1/13.1 MB 21.3 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu11==9.1.0.70 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cudnn_cu11-9.1.0.70-py3-none-manylinux2014_x86_64.whl (663.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 663.9/663.9 MB 10.0 MB/s eta 0:00:00
Collecting nvidia-cublas-cu11==11.11.3.6 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cublas_cu11-11.11.3.6-py3-none-manylinux1_x86_64.whl (417.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 417.9/417.9 MB 9.7 MB/s eta 0:00:00
Collecting nvidia-cufft-cu11==10.9.0.58 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cufft_cu11-10.9.0.58-py3-none-manylinux1_x86_64.whl (168.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 168.4/168.4 MB 12.5 MB/s eta 0:00:00
Collecting nvidia-curand-cu11==10.3.0.86 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_curand_cu11-10.3.0.86-py3-none-manylinux1_x86_64.whl (58.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.1/58.1 MB 13.0 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu11==11.4.1.48 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cusolver_cu11-11.4.1.48-py3-none-manylinux1_x86_64.whl (128.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.2/128.2 MB 12.5 MB/s eta 0:00:00
Collecting nvidia-cusparse-cu11==11.7.5.86 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cusparse_cu11-11.7.5.86-py3-none-manylinux1_x86_64.whl (204.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 204.1/204.1 MB 11.9 MB/s eta 0:00:00
Collecting nvidia-nccl-cu11==2.20.5 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_nccl_cu11-2.20.5-py3-none-manylinux2014_x86_64.whl (142.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 142.9/142.9 MB 12.5 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu11==11.8.86 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_nvtx_cu11-11.8.86-py3-none-manylinux1_x86_64.whl (99 kB)
Collecting triton==3.0.0 (from torch==2.4.1)
  Downloading https://download.pytorch.org/whl/triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 12.0 MB/s eta 0:00:00
Collecting numpy (from torchvision==0.19.1)
  Downloading https://download.pytorch.org/whl/numpy-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting pillow!=8.3.*,>=5.3.0 (from torchvision==0.19.1)
  Downloading https://download.pytorch.org/whl/pillow-11.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.1 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch==2.4.1)
  Downloading https://download.pytorch.org/whl/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy->torch==2.4.1)
  Downloading https://download.pytorch.org/whl/mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 7.5 MB/s eta 0:00:00
Downloading https://download.pytorch.org/whl/pillow-11.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.4/4.4 MB 9.4 MB/s eta 0:00:00
Downloading https://download.pytorch.org/whl/filelock-3.13.1-py3-none-any.whl (11 kB)
Downloading https://download.pytorch.org/whl/fsspec-2024.6.1-py3-none-any.whl (177 kB)
Downloading https://download.pytorch.org/whl/Jinja2-3.1.4-py3-none-any.whl (133 kB)
Downloading https://download.pytorch.org/whl/networkx-3.3-py3-none-any.whl (1.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 6.6 MB/s eta 0:00:00
Downloading https://download.pytorch.org/whl/numpy-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.3/16.3 MB 13.0 MB/s eta 0:00:00
Downloading https://download.pytorch.org/whl/sympy-1.13.3-py3-none-any.whl (6.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 32.9 MB/s eta 0:00:00
Installing collected packages: mpmath, sympy, pillow, nvidia-nvtx-cu11, nvidia-nccl-cu11, nvidia-cusparse-cu11, nvidia-curand-cu11, nvidia-cufft-cu11, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-cupti-cu11, nvidia-cublas-cu11, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusolver-cu11, nvidia-cudnn-cu11, jinja2, torch, torchvision, torchaudio
Successfully installed MarkupSafe-2.1.5 filelock-3.13.1 fsspec-2024.6.1 jinja2-3.1.4 mpmath-1.3.0 networkx-3.3 numpy-2.1.2 nvidia-cublas-cu11-11.11.3.6 nvidia-cuda-cupti-cu11-11.8.87 nvidia-cuda-nvrtc-cu11-11.8.89 nvidia-cuda-runtime-cu11-11.8.89 nvidia-cudnn-cu11-9.1.0.70 nvidia-cufft-cu11-10.9.0.58 nvidia-curand-cu11-10.3.0.86 nvidia-cusolver-cu11-11.4.1.48 nvidia-cusparse-cu11-11.7.5.86 nvidia-nccl-cu11-2.20.5 nvidia-nvtx-cu11-11.8.86 pillow-11.0.0 sympy-1.13.3 torch-2.4.1+cu118 torchaudio-2.4.1+cu118 torchvision-0.19.1+cu118 triton-3.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
(base) root@VM-0-80-ubuntu:/workspace#

检查PyTorch GPU是否安装成功python -c "import torch;print(torch.cuda.get_device_name(torch.cuda.current_device()))"

(base) root@VM-0-80-ubuntu:/workspace# python -c "import  torch;print(torch.cuda.get_device_name(torch.cuda.current_device()))"
Tesla T4

安装W&B: pip install wandb

点击展开`wandb`安装log

(base) root@VM-0-80-ubuntu:/workspace# pip install wandb
Looking in indexes: http://mirrors.tencentyun.com/pypi/simple
Collecting wandb
  Downloading http://mirrors.tencentyun.com/pypi/packages/88/c9/41b8bdb493e5eda32b502bc1cc49d539335a92cacaf0ef304d7dae0240aa/wandb-0.20.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.2/23.2 MB 9.2 MB/s eta 0:00:00
Collecting click!=8.0.0,>=7.1 (from wandb)
  Downloading http://mirrors.tencentyun.com/pypi/packages/85/32/10bb5764d90a8eee674e9dc6f4db6a0ab47c8c4d0d83c27f7c39ac415a4d/click-8.2.1-py3-none-any.whl (102 kB)
Collecting gitpython!=3.1.29,>=1.0.0 (from wandb)
  Using cached http://mirrors.tencentyun.com/pypi/packages/1d/9a/4114a9057db2f1462d5c8f8390ab7383925fe1ac012eaa42402ad65c2963/GitPython-3.1.44-py3-none-any.whl (207 kB)
Requirement already satisfied: packaging in /root/miniforge3/lib/python3.10/site-packages (from wandb) (24.2)
Requirement already satisfied: platformdirs in /root/miniforge3/lib/python3.10/site-packages (from wandb) (4.3.6)
Collecting protobuf!=4.21.0,!=5.28.0,<7,>=3.19.0 (from wandb)
  Downloading http://mirrors.tencentyun.com/pypi/packages/fa/b1/b59d405d64d31999244643d88c45c8241c58f17cc887e73bcb90602327f8/protobuf-6.31.1-cp39-abi3-manylinux2014_x86_64.whl (321 kB)
Requirement already satisfied: psutil>=5.0.0 in /root/miniforge3/lib/python3.10/site-packages (from wandb) (6.1.1)
Collecting pydantic<3 (from wandb)
  Downloading http://mirrors.tencentyun.com/pypi/packages/b5/69/831ed22b38ff9b4b64b66569f0e5b7b97cf3638346eb95a2147fdb49ad5f/pydantic-2.11.5-py3-none-any.whl (444 kB)
Collecting pyyaml (from wandb)
  Downloading http://mirrors.tencentyun.com/pypi/packages/6b/4e/1523cb902fd98355e2e9ea5e5eb237cbc5f3ad5f3075fa65087aa0ecb669/PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 kB 2.8 MB/s eta 0:00:00
Requirement already satisfied: requests<3,>=2.0.0 in /root/miniforge3/lib/python3.10/site-packages (from wandb) (2.32.3)
Collecting sentry-sdk>=2.0.0 (from wandb)
  Downloading http://mirrors.tencentyun.com/pypi/packages/f0/e5/da07b0bd832cefd52d16f2b9bbbe31624d57552602c06631686b93ccb1bd/sentry_sdk-2.29.1-py2.py3-none-any.whl (341 kB)
Collecting setproctitle (from wandb)
  Downloading http://mirrors.tencentyun.com/pypi/packages/67/2b/c3cbd4a4462c1143465d8c151f1d51bbfb418e60a96a754329d28d416575/setproctitle-1.3.6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Requirement already satisfied: typing-extensions<5,>=4.8 in /root/miniforge3/lib/python3.10/site-packages (from wandb) (4.12.2)
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.29,>=1.0.0->wandb)
  Using cached http://mirrors.tencentyun.com/pypi/packages/a0/61/5c78b91c3143ed5c14207f463aecfc8f9dbb5092fb2869baf37c273b2705/gitdb-4.0.12-py3-none-any.whl (62 kB)
Collecting annotated-types>=0.6.0 (from pydantic<3->wandb)
  Using cached http://mirrors.tencentyun.com/pypi/packages/78/b6/6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8/annotated_types-0.7.0-py3-none-any.whl (13 kB)
Collecting pydantic-core==2.33.2 (from pydantic<3->wandb)
  Downloading http://mirrors.tencentyun.com/pypi/packages/31/0d/c8f7593e6bc7066289bbc366f2235701dcbebcd1ff0ef8e64f6f239fb47d/pydantic_core-2.33.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 3.7 MB/s eta 0:00:00
Collecting typing-inspection>=0.4.0 (from pydantic<3->wandb)
  Downloading http://mirrors.tencentyun.com/pypi/packages/17/69/cd203477f944c353c31bade965f880aa1061fd6bf05ded0726ca845b6ff7/typing_inspection-0.4.1-py3-none-any.whl (14 kB)
Requirement already satisfied: charset_normalizer<4,>=2 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /root/miniforge3/lib/python3.10/site-packages (from requests<3,>=2.0.0->wandb) (2024.12.14)
Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.29,>=1.0.0->wandb)
  Using cached http://mirrors.tencentyun.com/pypi/packages/04/be/d09147ad1ec7934636ad912901c5fd7667e1c858e19d355237db0d0cd5e4/smmap-5.0.2-py3-none-any.whl (24 kB)
Installing collected packages: typing-inspection, smmap, setproctitle, sentry-sdk, pyyaml, pydantic-core, protobuf, click, annotated-types, pydantic, gitdb, gitpython, wandb
Successfully installed annotated-types-0.7.0 click-8.2.1 gitdb-4.0.12 gitpython-3.1.44 protobuf-6.31.1 pydantic-2.11.5 pydantic-core-2.33.2 pyyaml-6.0.2 sentry-sdk-2.29.1 setproctitle-1.3.6 smmap-5.0.2 typing-inspection-0.4.1 wandb-0.20.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

3. 注册和配置W&B

注册:到官网wandb.ai/ 注册一个账号,或者用github登陆也可以.
获取token:点击网页右上角你自己的头像,选择🔑 API key,点击复制
在需要使用W&B的终端输入wandb login,并粘贴刚刚复制的token,回车

这个操作会链接W&B网站登陆,并把token保存在/root/.netrc

点击查看`wandb login`的log

(base) root@VM-0-80-ubuntu:/workspace# wandb login
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
wandb: No netrc file found, creating one.
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
(base) root@VM-0-80-ubuntu:/workspace#

点击查看`wandb`的终端命令和解释

(base) root@VM-0-80-ubuntu:/workspace# wandb --help
Usage: wandb [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  agent         Run the W&B agent
  artifact      Commands for interacting with artifacts
  beta          Beta versions of wandb CLI commands.
  controller    Run the W&B local sweep controller
  disabled      Disable W&B.
  docker        Run your code in a docker container.
  docker-run    Wrap `docker run` and adds WANDB_API_KEY and WANDB_DOCKER...
  enabled       Enable W&B.
  init          Configure a directory with Weights & Biases
  job           Commands for managing and viewing W&B jobs
  launch        Launch or queue a W&B Job.
  launch-agent  Run a W&B launch agent.
  launch-sweep  Run a W&B launch sweep (Experimental).
  login         Login to Weights & Biases
  offline       Disable W&B sync
  online        Enable W&B sync
  pull          Pull files from Weights & Biases
  restore       Restore code, config and docker state for a run
  scheduler     Run a W&B launch sweep scheduler (Experimental)
  server        Commands for operating a local W&B server
  status        Show configuration settings
  sweep         Initialize a hyperparameter sweep.
  sync          Upload an offline training directory to W&B
  verify        Verify your local instance
(base) root@VM-0-80-ubuntu:/workspace#

4. 使用W&B和PyTorch

4.1 官方demo和注释,来自github.com/wandb/wandb

import wandb

# Start a W&B Run with wandb.init
# 使用wandb.init()初始化一个W&B的一个训练
run = wandb.init(project="my_first_project")

# Save model inputs and hyperparameters in a wandb.config object
# 把模型输入和超参数保存在wandb.config对象中
config = run.config
config.learning_rate = 0.01

# Model training code here ...
# 这里撰写模型训练的代码

# Log metrics over time to visualize performance with wandb.log
# 使用wandb.Log()记录随时间变化的指标，以可视化性能
for i in range(10):
    run.log({"loss": ...})

# Mark the run as finished, and finish uploading all data
# 把这次训练标记为已完成,并上传所有数据
run.finish()

4.2 使用Torchvision在MNIST上训练一个MLP,并记录训练集的loss,acc.

import os 

import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

import wandb
run = wandb.init(project="minimal-demo")

transform = transforms.ToTensor()
train_loader = torch.utils.data.DataLoader(
    torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform),
    batch_size=64, shuffle=True
    )

model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
    )

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

model.train()
for epoch in range(5):
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        _, predicted = outputs.max(1)
        accuracy = (predicted == labels).float().mean()

        run.log({"loss": loss.item(), "accuracy": accuracy.item()})
run.finish()

5. Trableshooting

5.0 连接超时

出现wandb: Network error (ConnectTimeout), entering retry loop.和 CommError: Run initialization has timed out after 90.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
原因:wandb的API服务器上传问题
解决:请参照 6. W&B网络问题解决方案

5.1 缺少依赖

wandb: ERROR The nbformat package was not found. It is required to save notebook history.
- 解决:pip install nbformat
nbformat 是 Jupyter 项目提供的一个官方 Python 库，用于读取、写入和操作 .ipynb 格式的 Jupyter Notebook 文件。它提供了标准化的数据结构（如 NotebookNode），使开发者能够以编程方式处理 Notebook 的各个组成部分，包括代码单元、Markdown 单元、元数据等。

6. W&B网络问题解决方案

6.1 (建议的方案)换第三方源

参照小红书作者Urara(xhsid:5227964072)提供的服务器API,更换之
- (建议)直接在python代码中加入import os; os.environ["WANDB_BASE_URL"] = "https://api.bandw.top"
- 或者修改环境变量:export WANDB_BASE_URL=https://api.bandw.top.
  - 注意这里命令是export所以只在当前环境有效,需要持续化这个变量需要自行添加到.bashrc或者conda env的响应文件中.
  - 更换后,可能需要再次登陆wandb.(这个不确定会不会百分百触发,这里我会关注下,然后给出一劳永逸而不是每次都重新登录的方案)
  - 冷知识:如果在jupyter notebook中输入export,其实是修改不了SHELL环境变量的,建议在jupyter中就用python代码的方式修改!

6.2 (我还没研究)自行使用functionless服务或者其他云设施搭建反代理

可以使用CF或者AWS等免费资源,我还没尝试
参考:www.bilibili.com/opus/924356…

6.2 offiline运行,(需要时上传W&B数据)

建议离线offiline的情形:当本机无法连接国外服务器或者完全是离线的服务器的情况,请offiline模式运行.然后下载数据自行查看,有需要的话上传数据到wandb.
- PS0:完全离线的服务器:北京超算离线模式,Kaggle离线模式
- PS·:不方便使用wandb在线模式的情况:百度AI Studio,一旦连接wandb并上传会被百度反病毒掐断,并封号.这时候发邮件或者web端反馈也不会有人处理的.
开启离线offline的操作:
- (建议)使用wandb命令:wandb offline
- 或者修改环境变量:export WANDB_MODE=offline,参考docs.wandb.ai/support/run…
- 或者在python代码中wandb.init()的初始化实例的时候加入参数wandb.init(mode="offline")
上传wandb数据的操作:
- (建议)命令wandb sync ./wandb/offline-run-*.
  - 参考docs.wandb.ai/ref/cli/wan…
  - 这里./wandb/offline-run-*是离线模式wandb保存的run的目录.