Hugging Face实战——开始本章内容包括使用 Anaconda 用 conda 创建虚拟环境在 pipeli

本章内容包括

使用 Anaconda
用 conda 创建虚拟环境
在 pipeline() 函数中使用 GPU
使用 Hugging Face Hub 包

在第 1 章中，你已经看到了将要用 Hugging Face 预训练模型和诸如 AutoTrain 等服务构建的一些有趣项目。你将主要使用的编程语言是 Python，并配合我最喜欢的 IDE——Jupyter Notebook。Jupyter Notebook 是一个开源 Web 应用，允许你创建并分享包含可运行代码、公式、可视化和文字说明的文档。它因其交互式与探索式特性而被广泛用于数据科学、科学研究、机器学习与教育。接下来你将学习如何设置 Jupyter Notebook，并创建一个虚拟环境来运行本书的所有示例。

2.1 下载 Anaconda

安装 Jupyter Notebook 最简单的方式是下载 Anaconda——一个包含常用数据科学、机器学习和科学计算库与工具的 Python/R 发行版。它内置 conda 包管理器，便于包管理与环境创建。Jupyter Notebook 是 Anaconda 的核心组件之一。因此，安装 Anaconda 不仅能获得 Jupyter Notebook，还能同时获得许多常用包。

要获取（个人使用免费）Anaconda，请访问：www.anaconda.com/download/su… 。然后点击与你操作系统对应的下载图标（见图 2.1）。下载完成后，双击安装程序并按提示完成安装。

图 2.1 为 Windows、macOS 与 Linux 下载 Anaconda（略）

2.1.1 创建虚拟环境

安装好 Anaconda 后就可以开始使用。不过在启动 Jupyter Notebook 写代码之前，建议先创建一个虚拟环境——它是一个自包含的环境，可让你在不影响系统级 Python 的前提下，独立安装与管理项目依赖。对于隔离依赖、管理不同项目需求非常有用。

要创建虚拟环境，打开 Terminal（macOS） 或 Anaconda Prompt（Windows） 。图 2.2 展示了在 Windows 中如何启动 Anaconda Prompt。

图 2.2 在 Windows 中启动 Anaconda Prompt（略）

Anaconda Prompt vs. 命令提示符
对 Windows 用户而言，Anaconda Prompt 看起来与普通命令提示符相同。实际上它们确实一样，唯一区别是：Anaconda Prompt 已经配置好 Anaconda 的环境变量与路径。也就是说，打开它后即可直接使用 Python 解释器、conda 包管理器及其他工具，无需额外配置。

接着，用 conda 命令创建一个新的虚拟环境：

conda create -n HuggingFaceBook python=3.11 anaconda

上述命令会创建名为 HuggingFaceBook、Python 版本为 3.11 的虚拟环境，并包含 Anaconda 发行版。随后会提示安装各类包（见图 2.3），输入 Y 并回车。

图 2.3 创建新虚拟环境并安装所需包（略）

Anaconda 发行版
Anaconda 是集成的包管理器、环境管理器与 Python 发行版，旨在简化 Python/R 数据科学与机器学习应用的包管理与部署。它包含 NumPy、pandas、Matplotlib、scikit-learn、TensorFlow、PyTorch、Jupyter Notebook 等常用库。

安装完成后，用 conda activate 激活（切换到）该虚拟环境：

conda activate HuggingFaceBook

激活后，你会看到终端或 Anaconda Prompt 的提示符前缀上带有环境名（见图 2.4）。

图 2.4 提示符前缀显示当前虚拟环境名称（略）

2.1.2 启动 Jupyter Notebook

创建好虚拟环境后，就可以启动 Jupyter Notebook。这里我更喜欢通过 Terminal 或 Anaconda Prompt 来启动。

首先在终端里创建一个用于保存项目的文件夹，例如 HF_Projects：

(HuggingFaceBook) weimenglee@WeiMengacStudio ~ % mkdir HF_Projects

切换到该目录：

(HuggingFaceBook) weimenglee@WeiMengacStudio ~ % cd HF_Projects

输入以下命令启动 Jupyter Notebook：

(HuggingFaceBook) weimenglee@WeiMengacStudio HF_Projects % jupyter notebook

浏览器会自动打开并显示 Jupyter Notebook 的主页（见图 2.5）。

图 2.5 浏览器显示 Jupyter Notebook 主页（略）

要创建一个新笔记本，点击 New 按钮，然后在菜单中点击 Notebook（见图 2.6）。

图 2.6 创建 Notebook（略）

会出现一个名为 Untitled 的新标签页。（若没有看到，可能是浏览器拦截了弹窗；点一下地址栏通常会显示该标签。）在下拉菜单中选择 Python 3 (ipykernel) （见图 2.7），然后点击 Select。随后你会看到一个可直接编写代码的 Notebook（见图 2.8）。

图 2.7 为 Notebook 选择内核（略）

图 2.8 Notebook 已就绪（略）

TIP 如果你是 Jupyter Notebook 新手，建议阅读官方文档：mng.bz/eB4P 。

最后，点击默认文件名为笔记本重命名，在“Rename File”对话框中输入新名称（见图 2.9）。在本例中输入 Chapter 2.ipynb，点击 Rename 完成。该文件会保存在 HF_Projects 目录中。

图 2.9 重命名你的 Notebook（略）

2.2 安装 Transformers 库

在本书中，你会大量使用 Hugging Face 的 Transformers 库。Transformers 是 Hugging Face 开源的库，提供易用接口来操控各类最先进的预训练模型，适用于文本分类、命名实体识别（NER）、文本生成、问答等 NLP 任务。第 3 章会更详细介绍该库。现在先安装它。你可以在 Jupyter Notebook 里直接执行：

!pip install transformers

或在 Terminal / Anaconda Prompt 中执行：

pip install transformers

2.2.1 GPU 支持

在 Hugging Face 中，你会用 Transformers 完成 NLP、图像识别等机器学习任务。其底层主要基于 PyTorch（Facebook AI Research 开发的深度学习框架），利用 PyTorch 的能力来搭建网络、训练模型，并在 NLP 等任务上优化性能。

注意：Transformers 也支持 TensorFlow（Google 开发的广泛使用的深度学习框架）。这使得用户能在 TensorFlow 工作流中使用 Transformers 的能力，处理 NLP 等任务。不过本书以 PyTorch 用法为主。

PyTorch 的一大特性是对 GPU 的良好支持：它与 CUDA（NVIDIA 的并行计算平台与编程模型）无缝集成，可将张量操作与计算下放到 GPU，大幅加速深度学习模型的训练与推理。更棒的是，PyTorch 的 API 会在需要时自动处理 CPU 与 GPU 间的数据搬运。PyTorch 还支持模型并行，可将大模型拆分到多块 GPU 上，使原本无法容纳于单卡的大模型也能分布式运行。

如果你有支持 CUDA 的 GPU，建议启用它以加速训练和推理。需要从与 CUDA 兼容的源安装 PyTorch（及若干相关包）。如下命令会从 https://download.pytorch.org/whl/cu121 安装 torch、torchvision、torchaudio：

pip install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu121 -U

安装完成后，可在 Jupyter Notebook 中检测 CUDA 是否可用：

import torch
print(torch.cuda.is_available())

若输出 True，表示你的系统支持 CUDA。下面的代码会打印更详细的 GPU 信息。

代码清单 2.1 使用 torch 获取 GPU 详情

import torch
use_cuda = torch.cuda.is_available()

if use_cuda:
    print('__CUDNN VERSION:', torch.backends.cudnn.version())
    print('__Number CUDA Devices:', torch.cuda.device_count())
    print('__CUDA Device Name:', torch.cuda.get_device_name(0))
    print('__CUDA Device Total Memory [GB]:',
          torch.cuda.get_device_properties(0).total_memory/1e9)

在配有 NVIDIA RTX 4060 的笔记本上，示例输出如下：

__CUDNN VERSION: 8801
__Number CUDA Devices: 1
__CUDA Device Name: NVIDIA GeForce RTX 4060 Laptop GPU
__CUDA Device Total Memory [GB]: 8.585216

你也可以用 GPUtil 包查看 GPU 详情（数量、利用率、温度、显存占用等）。先安装：

!pip install GPUtil

获取可用 GPU 的索引列表：

import GPUtil
GPUtil.getAvailable()

单卡示例返回：

[0]

获取各 GPU 的详细信息（名称、利用率、显存使用、温度、总显存等）：

代码清单 2.2 使用 GPUtil 获取每块 GPU 的详情

import GPUtil

gpus = GPUtil.getGPUs()

for gpu in gpus:
    print("GPU ID:", gpu.id)
    print("GPU Name:", gpu.name)
    print("GPU Utilization:", gpu.load * 100, "%")
    print("GPU Memory Utilization:", gpu.memoryUtil * 100, "%")
    print("GPU Temperature:", gpu.temperature, "C")
    print("GPU Total Memory:", gpu.memoryTotal, "MB")

2.2.2 在 pipeline 对象中使用 GPU

要让 pipeline 使用 GPU，需要在调用 pipeline() 时显式指定 device 参数。下面示例让其使用系统中的第 1（也是唯一）块 GPU：

from transformers import pipeline
question_classifier = pipeline("text-classification",
                               model="huaen/question_detection",
                               device=0)

Transformers pipeline
在 Transformers 中，pipeline 是一个高层、易用的 API，可用极少代码完成文本分类、NER、翻译、摘要等一系列 NLP 工作流。

上述代码创建了一个问题分类的 pipeline：输入字符串，输出该字符串是否是一个问题。

除了用数字指定 device（选择第几块 GPU），你也可以用字符串：

question_classifier = pipeline("text-classification",
                               model="huaen/question_detection",
                               device="cuda:0")  # 1

在 Mac 上，可用 MPS（Apple Metal Performance Shaders）加速 Apple Silicon（M1/M2/M3）上的推理，将 device 设为 "mps:0"：

question_classifier = pipeline("text-classification",
                               model="huaen/question_detection",
                               device="mps:0")

表 2.1 pipeline() 中 device 参数取值

数值	字符串	说明
-1	`"cpu"`	使用 CPU（`pipeline()` 默认设备）
0	`"cuda"` / `"cuda:0"`	使用第 1 块 GPU
1	`"cuda:1"`	使用第 2 块 GPU
n	`"cuda:n"`	使用第 n+1 块 GPU
0	`"mps:0"`	使用 PyTorch 的 MPS 后端（Apple 内置 GPU：M1/M2/M3）

不确定 pipeline 正在用 CPU 还是 GPU？打印其设备类型：

print(question_classifier.device.type)

一个稳妥的做法是自动检测 CUDA 或 MPS，若都不可用则回退到 CPU：

代码清单 2.3 自动检测 CUDA / MPS / CPU 用于推理

from transformers import pipeline
import torch

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

question_classifier = pipeline("text-classification",
                               model="huaen/question_detection",
                               device=device)
print(f"Using device: {device}")

2.3 安装 Hugging Face Hub 包

Hugging Face Hub（huggingface.co，见图2.10）是与 Hugging Face 相关的一切内容的首选平台：预训练模型、演示、数据集等。

虽然你可以用浏览器访问 Hugging Face Hub，但也可以通过命令行包 huggingface_hub 直接与 Hub 交互。借助该包，你可以执行如下任务：

管理项目仓库
上传与下载文件
获取模型

图 2.10 Hugging Face Hub

要安装 huggingface_hub 包，使用 pip 命令：

!pip install huggingface_hub

2.3.1 下载文件

在 Hugging Face Hub 上，你会发现许多可自由使用的预训练模型。通常，当你首次使用某个模型时，Transformers 库会自动下载与模型相关的文件并缓存在本地。但有时你可能希望手动下载所需文件，以便离线运行代码。

要从 Hugging Face Hub 下载文件，可进入模型页面并点击 Download 按钮。以模型 google/pegasus-xsum（mng.bz/gmXx）为例，在模型页面上，如果要下载 config.json 文件，点击其下载图标（见图 2.11）。

图 2.11 在模型页面直接下载文件

如果使用 huggingface_hub 包，可以用 hf_hub_download() 函数以编程方式下载文件：

from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="google/pegasus-xsum",
                filename="config.json")

config.json 会被下载到如下目录：

<home_directory>/.cache/huggingface/hub/
models--google--pegasus-xsum/snapshots/
8d8ffc158a3bee9fbb03afacdfc347c823c5ec8b/

默认会从 main 分支下载文件的最新版本。但在某些情况下，你可能希望下载文件的特定版本（例如来自某个分支、PR、标签或特定提交哈希）。为此，先点击你要下载的文件（见图 2.12）。

图 2.12 选择要下载的文件

然后点击文件的 history 链接（见图 2.13）。

图 2.13 查看项目的历史提交

最后复制你想要的文件版本对应的提交哈希（见图 2.14）。

当你复制了提交哈希后，可将其作为 hf_hub_download() 函数的 revision 参数：

hf_hub_download(
    repo_id="google/pegasus-xsum",
    filename="config.json",
    revision="a0aa5531c00f59a32a167b75130805098b046f9c"
)

图 2.14 复制文件的提交哈希

2.3.2 使用 Hugging Face CLI

huggingface_hub 包还包含 Hugging Face CLI，一个命令行工具，用于通过 Token 认证你的应用。在 Terminal 或 Anaconda Prompt 中输入 huggingface-cli 可查看可用选项：

huggingface-cli

输出（节选）：

usage: huggingface-cli <command> [<args>]

positional arguments:
  {env,login,whoami,logout,repo,upload,download,
   lfs-enable-largefiles,lfs-multipart-upload,
   scan-cache,delete-cache}
    env                 Print information about the environment.
    login               Log in using a token from
                        huggingface.co/settings/tokens
    whoami              Find out which huggingface.co account 
                        you are logged in as.
    logout              Log out
    repo                {create} Commands to interact with 
                        your huggingface.co repos.
    upload              Upload a file or a folder to a repo on the Hub
    download            Download files from the Hub
    lfs-enable-largefiles
                        Configure your repository to enable 
                        upload of files > 5GB.
    scan-cache          Scan cache directory.
    delete-cache        Delete revisions from the cache directory.

使用 CLI，你可以以编程方式登录 Hugging Face Hub。首先需要在 huggingface.co/join 创建账户（见图 2.15）。

图 2.15 注册 Hugging Face 账号

Hugging Face 使用**访问令牌（token）**来认证需要下载私有仓库、上传文件、创建 PR 等操作的用户。注册完成后，在 huggingface.co/settings/to… 创建一个访问令牌。然后使用以下命令登录：

$ huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

A token is already saved on your machine. Run `huggingface-cli 
whoami` to get more information or `huggingface-cli logout` if 
you want to log out.
    Setting a new token will erase the existing one.
To login, `huggingface_hub` requires a token generated from
https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): <HuggingFaceAccessToken>
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /Users/weimenglee/.cache/huggingface/token
Login successful

当你输入（或粘贴）token 时，屏幕不会回显。该 token 会被保存到 <home_directory>/.cache/huggingface/ 目录下名为 token 的文件中。要查看当前登录的账号，使用：

huggingface-cli whoami

另一种登录方式是在 Python 中使用 login() 函数：

from huggingface_hub import login

login()

login() 会显示一个界面（见图 2.16），输入 token 后点击 Login 即可。

注意：如果在 Jupyter Notebook 中遇到与 ipywidgets 相关的错误，可通过升级修复：
!pip install -U ipywidgets

图 2.16 在 Jupyter Notebook 中登录 Hugging Face Hub

小结

Anaconda 自带 conda 包管理器，简化包管理与环境创建，并包含 Jupyter Notebook。
创建虚拟环境可将项目依赖与系统 Python 隔离，便于管理不同项目需求。
启动 Jupyter Notebook 最简单的方法是从 Terminal 或 Anaconda Prompt 启动。
Transformers 库基于 PyTorch（主要由 FAIR 开发的深度学习框架）。
PyTorch 支持 GPU，并与 CUDA 无缝集成以加速训练与推理。
Hugging Face Hub 包支持下载/上传文件与 CLI 认证等操作。