Machine-Learning-Mastery-Python-教程-二-Machine Learning Master

Machine Learning Mastery Python 教程（二）

原文：Machine Learning Mastery

协议：CC BY-NC-SA 4.0

《获取机器学习数据集的指南》

原文：machinelearningmastery.com/a-guide-to-getting-datasets-for-machine-learning-in-python/

与其他编程练习相比，机器学习项目是代码和数据的结合。你需要两者才能实现结果并做些有用的事情。多年来，许多著名的数据集被创建，许多已经成为标准或基准。在本教程中，我们将学习如何轻松获取这些著名的公共数据集。如果现有的数据集都不符合我们的需求，我们还将学习如何生成合成数据集。

完成本教程后，你将知道：

如何寻找适合机器学习项目的免费数据集
如何使用 Python 中的库下载数据集
如何使用 scikit-learn 生成合成数据集

启动你的项目，阅读我的新书 Python for Machine Learning，包括 逐步教程 和 所有示例的 Python 源代码 文件。

让我们开始吧。

《获取机器学习数据集的指南》

图片由 Olha Ruskykh 提供。保留一些权利。

教程概述

本教程分为四个部分；它们是：

数据集库
在 scikit-learn 和 Seaborn 中检索数据集
在 TensorFlow 中检索数据集
在 scikit-learn 中生成数据集

数据集库

机器学习已经发展了几十年，因此有些数据集具有历史意义。一个最著名的数据集库是UCI 机器学习库。那里的大多数数据集都很小，因为当时的技术还不够先进，无法处理更大的数据。一些著名的数据集包括 1936 年 Ronald Fisher 介绍的鸢尾花数据集和 20 个新闻组数据集（通常在信息检索文献中提到的文本数据）。

较新的数据集通常更大。例如，ImageNet 数据集超过 160 GB。这些数据集通常可以在 Kaggle 上找到，我们可以按名称搜索它们。如果需要下载，建议在注册账户后使用 Kaggle 的命令行工具。

OpenML 是一个较新的数据集库，托管了大量的数据集。它非常方便，因为你可以按名称搜索数据集，但它也提供了一个标准化的网络 API 供用户检索数据。如果你想使用 Weka，这个库会很有用，因为它提供 ARFF 格式的文件。

但仍然，有许多数据集公开可用，但由于各种原因不在这些存储库中。你还可以查看维基百科上的“机器学习研究数据集列表”。该页面包含了长长的一列数据集，按不同类别归类，并提供了下载链接。

在 scikit-learn 和 Seaborn 中检索数据集

显而易见，你可以通过从网络上下载这些数据集来获取它们，无论是通过浏览器、命令行、使用wget工具，还是使用 Python 中的requests等网络库。由于其中一些数据集已成为标准或基准，许多机器学习库都创建了帮助检索它们的函数。出于实际考虑，通常这些数据集不会与库一起提供，而是在调用函数时实时下载。因此，你需要有稳定的互联网连接才能使用这些数据集。

Scikit-learn 是一个可以通过其 API 下载数据集的例子。相关函数定义在sklearn.datasets下，你可以查看函数列表：

scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

例如，你可以使用load_iris()函数获取鸢尾花数据集，如下所示：

import sklearn.datasets

data, target = sklearn.datasets.load_iris(return_X_y=True, as_frame=True)
data["target"] = target
print(data)

load_iris()函数会返回 numpy 数组（即，没有列标题），而不是 pandas DataFrame，除非指定参数as_frame=True。另外，我们将return_X_y=True传递给函数，因此仅返回机器学习特征和目标，而不是一些元数据，如数据集的描述。上述代码打印如下内容：

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target
0                  5.1               3.5                1.4               0.2       0
1                  4.9               3.0                1.4               0.2       0
2                  4.7               3.2                1.3               0.2       0
3                  4.6               3.1                1.5               0.2       0
4                  5.0               3.6                1.4               0.2       0
..                 ...               ...                ...               ...     ...
145                6.7               3.0                5.2               2.3       2
146                6.3               2.5                5.0               1.9       2
147                6.5               3.0                5.2               2.0       2
148                6.2               3.4                5.4               2.3       2
149                5.9               3.0                5.1               1.8       2

[150 rows x 5 columns]

将特征和目标分开对训练 scikit-learn 模型很方便，但将它们结合起来对可视化会有帮助。例如，我们可以如上所述合并 DataFrame，然后使用 Seaborn 可视化相关图：

import sklearn.datasets
import matplotlib.pyplot as plt
import seaborn as sns

data, target = sklearn.datasets.load_iris(return_X_y=True, as_frame=True)
data["target"] = target

sns.pairplot(data, kind="scatter", diag_kind="kde", hue="target",
             palette="muted", plot_kws={'alpha':0.7})
plt.show()

从相关图中，我们可以看到目标 0 容易区分，但目标 1 和目标 2 通常有一些重叠。因为这个数据集也用于演示绘图函数，我们可以从 Seaborn 中找到等效的数据加载函数。我们可以将上述代码改写为以下内容：

import matplotlib.pyplot as plt
import seaborn as sns

data = sns.load_dataset("iris")
sns.pairplot(data, kind="scatter", diag_kind="kde", hue="species",
             palette="muted", plot_kws={'alpha':0.7})
plt.show()

Seaborn 支持的数据集比较有限。我们可以通过运行以下命令查看所有受支持的数据集名称：

import seaborn as sns
print(sns.get_dataset_names())

下面是 Seaborn 中的所有数据集：

['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes',
'diamonds', 'dots', 'exercise', 'flights', 'fmri', 'gammas', 'geyser',
'iris', 'mpg', 'penguins', 'planets', 'taxis', 'tips', 'titanic']

有一些类似的函数可以从 scikit-learn 加载“玩具数据集”。例如，我们有load_wine()和load_diabetes()，它们的定义方式类似。

较大的数据集也是类似的。例如，我们有fetch_california_housing()，需要从互联网下载数据集（因此函数名称中包含“fetch”）。Scikit-learn 文档将这些称为“真实世界数据集” ，但实际上，玩具数据集同样真实。

import sklearn.datasets

data = sklearn.datasets.fetch_california_housing(return_X_y=False, as_frame=True)
data = data["frame"]
print(data)

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseVal
0      8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23        4.526
1      8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22        3.585
2      7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85    -122.24        3.521
3      5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85    -122.25        3.413
4      3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85    -122.25        3.422
...       ...       ...       ...        ...         ...       ...       ...        ...          ...
20635  1.5603      25.0  5.045455   1.133333       845.0  2.560606     39.48    -121.09        0.781
20636  2.5568      18.0  6.114035   1.315789       356.0  3.122807     39.49    -121.21        0.771
20637  1.7000      17.0  5.205543   1.120092      1007.0  2.325635     39.43    -121.22        0.923
20638  1.8672      18.0  5.329513   1.171920       741.0  2.123209     39.43    -121.32        0.847
20639  2.3886      16.0  5.254717   1.162264      1387.0  2.616981     39.37    -121.24        0.894

[20640 rows x 9 columns]

如果我们需要更多，scikit-learn 提供了一个方便的函数从 OpenML 读取任何数据集。例如，

import sklearn.datasets

data = sklearn.datasets.fetch_openml("diabetes", version=1, as_frame=True, return_X_y=False)
data = data["frame"]
print(data)

     preg   plas  pres  skin   insu  mass   pedi   age            class
0     6.0  148.0  72.0  35.0    0.0  33.6  0.627  50.0  tested_positive
1     1.0   85.0  66.0  29.0    0.0  26.6  0.351  31.0  tested_negative
2     8.0  183.0  64.0   0.0    0.0  23.3  0.672  32.0  tested_positive
3     1.0   89.0  66.0  23.0   94.0  28.1  0.167  21.0  tested_negative
4     0.0  137.0  40.0  35.0  168.0  43.1  2.288  33.0  tested_positive
..    ...    ...   ...   ...    ...   ...    ...   ...              ...
763  10.0  101.0  76.0  48.0  180.0  32.9  0.171  63.0  tested_negative
764   2.0  122.0  70.0  27.0    0.0  36.8  0.340  27.0  tested_negative
765   5.0  121.0  72.0  23.0  112.0  26.2  0.245  30.0  tested_negative
766   1.0  126.0  60.0   0.0    0.0  30.1  0.349  47.0  tested_positive
767   1.0   93.0  70.0  31.0    0.0  30.4  0.315  23.0  tested_negative

[768 rows x 9 columns]

有时，我们不应该使用名称在 OpenML 中识别数据集，因为可能有多个同名的数据集。我们可以在 OpenML 上搜索数据 ID，并在函数中使用如下：

import sklearn.datasets

data = sklearn.datasets.fetch_openml(data_id=42437, return_X_y=False, as_frame=True)
data = data["frame"]
print(data)

以上代码中的数据 ID 指的是泰坦尼克号数据集。我们可以扩展代码如下，展示如何获取泰坦尼克号数据集，然后运行逻辑回归：

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_openml

X, y = fetch_openml(data_id=42437, return_X_y=True, as_frame=False)
clf = LogisticRegression(random_state=0).fit(X, y)
print(clf.score(X,y)) # accuracy
print(clf.coef_)      # coefficient in logistic regression

0.8114478114478114
[[-0.7551392   2.24013347 -0.20761281  0.28073571  0.24416706 -0.36699113
   0.4782924 ]]

想要开始 Python 机器学习吗？

现在免费订阅我的 7 天电子邮件速成课程（包含示例代码）。

单击注册并获得课程的免费 PDF 电子书版本。

在 TensorFlow 中检索数据集

除了 scikit-learn，TensorFlow 是另一个可以用于机器学习项目的工具。出于类似的原因，TensorFlow 还有一个数据集 API，以最适合 TensorFlow 的格式提供数据集。与 scikit-learn 不同，这个 API 不是标准 TensorFlow 包的一部分。您需要使用以下命令安装它：

pip install tensorflow-datasets

所有数据集的列表可在目录中找到：

www.tensorflow.org/datasets/catalog/overview#all_datasets

所有数据集都有一个名称。这些名称可以在上述目录中找到。您也可以使用以下方法获取名称列表：

import tensorflow_datasets as tfds
print(tfds.list_builders())

打印超过 1,000 个名称。

例如，让我们以 MNIST 手写数字数据集为例。我们可以按如下方式下载数据：

import tensorflow_datasets as tfds
ds = tfds.load("mnist", split="train", shuffle_files=True)
print(ds)

这显示我们使用tfds.load()会得到一个tensorflow.data.OptionsDataset类型的对象：

<_OptionsDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>

特别是，这个数据集将数据实例（图像）存储在形状为（28,28,1）的 numpy 数组中，目标（标签）是标量。

经过轻微的整理，数据即可在 Keras 的fit()函数中使用。一个示例如下：

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, AveragePooling2D, Dropout, Flatten
from tensorflow.keras.callbacks import EarlyStopping

# Read data with train-test split
ds_train, ds_test = tfds.load("mnist", split=['train', 'test'],
                              shuffle_files=True, as_supervised=True)

# Set up BatchDataset from the OptionsDataset object
ds_train = ds_train.batch(32)
ds_test = ds_test.batch(32)

# Build LeNet5 model and fit
model = Sequential([
    Conv2D(6, (5,5), input_shape=(28,28,1), padding="same", activation="tanh"),
    AveragePooling2D((2,2), strides=2),
    Conv2D(16, (5,5), activation="tanh"),
    AveragePooling2D((2,2), strides=2),
    Conv2D(120, (5,5), activation="tanh"),
    Flatten(),
    Dense(84, activation="tanh"),
    Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["sparse_categorical_accuracy"])
earlystopping = EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)
model.fit(ds_train, validation_data=ds_test, epochs=100, callbacks=[earlystopping])

如果我们提供as_supervised=True，数据集将是元组（特征，目标）的记录，而不是字典。这对于 Keras 是必需的。此外，为了在fit()函数中使用数据集，我们需要创建一个批次的可迭代对象。通过设置数据集的批次大小，将其从OptionsDataset对象转换为BatchDataset对象。

我们应用了 LeNet5 模型进行图像分类。但由于数据集中的目标是一个数值（0 到 9），而不是布尔向量，因此我们要求 Keras 在计算精度和损失之前将 softmax 输出向量转换为数字，方法是在compile()函数中指定sparse_categorical_accuracy和sparse_categorical_crossentropy。

关键在于理解每个数据集的形状都是不同的。当你用它与你的 TensorFlow 模型时，你需要调整模型以适应数据集。

在 scikit-learn 中生成数据集

在 scikit-learn 中，有一组非常有用的函数可以生成具有特定属性的数据集。由于我们可以控制合成数据集的属性，这对于在特定情况下评估模型性能非常有帮助，这种情况在其他数据集中不常见。

Scikit-learn 文档称这些函数为样本生成器。它使用起来很简单，例如：

from sklearn.datasets import make_circles
import matplotlib.pyplot as plt

data, target = make_circles(n_samples=500, shuffle=True, factor=0.7, noise=0.1)
plt.figure(figsize=(6,6))
plt.scatter(data[:,0], data[:,1], c=target, alpha=0.8, cmap="Set1")
plt.show()

make_circles()函数生成二维平面中散布点的坐标，这些点以同心圆的形式排列为两个类别。我们可以通过参数factor和noise来控制圆的大小和重叠程度。这个合成数据集对于评估分类模型（如支持向量机）很有帮助，因为没有线性分隔器可用。

make_circles()生成的输出始终分为两个类别，坐标总是在二维空间中。但是一些其他函数可以生成更多类别或更高维度的点，例如make_blob()。在下面的示例中，我们生成了一个包含 4 个类别的三维数据集：

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

data, target = make_blobs(n_samples=500, n_features=3, centers=4,
                          shuffle=True, random_state=42, cluster_std=2.5)

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(projection='3d')
ax.scatter(data[:,0], data[:,1], data[:,2], c=target, alpha=0.7, cmap="Set1")
plt.show()

还有一些函数用于生成回归问题的数据集。例如，make_s_curve()和make_swiss_roll()将生成三维坐标，目标值为连续值。

from sklearn.datasets import make_s_curve, make_swiss_roll
import matplotlib.pyplot as plt

data, target = make_s_curve(n_samples=5000, random_state=42)

fig = plt.figure(figsize=(15,8))
ax = fig.add_subplot(121, projection='3d')
ax.scatter(data[:,0], data[:,1], data[:,2], c=target, alpha=0.7, cmap="viridis")

data, target = make_swiss_roll(n_samples=5000, random_state=42)
ax = fig.add_subplot(122, projection='3d')
ax.scatter(data[:,0], data[:,1], data[:,2], c=target, alpha=0.7, cmap="viridis")

plt.show()

如果我们不希望从几何角度查看数据，还有make_classification()和make_regression()。与其他函数相比，这两个函数提供了更多对特征集的控制，例如引入一些冗余或无关的特征。

下面是使用make_regression()生成数据集并进行线性回归的示例：

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import numpy as np

# Generate 10-dimensional features and 1-dimensional targets
X, y = make_regression(n_samples=500, n_features=10, n_targets=1, n_informative=4,
                       noise=0.5, bias=-2.5, random_state=42)

# Run linear regression on the data
reg = LinearRegression()
reg.fit(X, y)

# Print the coefficient and intercept found
with np.printoptions(precision=5, linewidth=100, suppress=True):
    print(np.array(reg.coef_))
    print(reg.intercept_)

在上面的示例中，我们创建了 10 维特征，但只有 4 个特征是有用的。因此，从回归结果中我们发现只有 4 个系数显著非零。

[-0.00435 -0.02232 19.0113   0.04391 46.04906 -0.02882 -0.05692 28.61786 -0.01839 16.79397]
-2.5106367126731413

使用make_classification()的一个类似示例如下。在这个案例中使用了支持向量机分类器：

from sklearn.datasets import make_classification
from sklearn.svm import SVC
import numpy as np

# Generate 10-dimensional features and 3-class targets
X, y = make_classification(n_samples=1000, n_features=10, n_classes=3,
                           n_informative=4, n_redundant=2, n_repeated=1,
                           random_state=42)

# Run SVC on the data
clf = SVC(kernel="rbf")
clf.fit(X, y)

# Print the accuracy
print(clf.score(X, y))

进一步阅读

本节提供了更多关于该主题的资源，如果你希望深入了解。

仓库

UCI 机器学习库
Kaggle
OpenML
维基百科，en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research

文章

API

总结

在本教程中，你了解了在 Python 中加载常见数据集或生成数据集的各种选项。

具体来说，你学到了：

如何在 scikit-learn、Seaborn 和 TensorFlow 中使用数据集 API 加载常见的机器学习数据集
不同 API 返回的数据集格式的小差异及其使用方法
如何使用 scikit-learn 生成数据集

获取时间序列数据集的指南

原文：machinelearningmastery.com/a-guide-to-obtaining-time-series-datasets-in-python/

来自真实世界场景的数据集对构建和测试机器学习模型至关重要。你可能只是想有一些数据来实验算法。你也可能想通过设置基准或使用不同的数据集来评估你的模型的弱点。有时，你可能还想创建合成数据集，通过向数据中添加噪声、相关性或冗余信息，在受控条件下测试你的算法。

在这篇文章中，我们将演示如何使用 Python 从不同来源获取一些真实的时间序列数据。我们还将使用 Python 的库创建合成时间序列数据。

完成本教程后，你将了解：

如何使用 pandas_datareader
如何使用 requests 库调用网络数据服务器的 API
如何生成合成时间序列数据

开启你的项目，请参考我新书《Python 机器学习》，包括 一步一步的教程 和所有示例的 Python 源代码 文件。

让我们开始吧。海浪和鸟的图片

在 Python 中处理数据集的指南

图片来源于 Mehreen Saeed，部分权利保留

教程概述

本教程分为三个部分，分别是：

使用 pandas_datareader
使用 requests 库通过远程服务器的 API 获取数据
生成合成时间序列数据

使用 pandas-datareader 加载数据

本文将依赖于一些库。如果你还未安装它们，可以使用 pip 安装：

pip install pandas_datareader requests

pandas_datareader 库允许你从不同的数据源获取数据，包括 Yahoo Finance（获取金融市场数据）、世界银行（获取全球发展数据）以及圣路易斯联邦储备银行（获取经济数据）。在本节中，我们将展示如何从不同的数据源加载数据。

在幕后，pandas_datareader 实时从网络中提取你所需的数据，并将其组装成 pandas DataFrame。由于网页结构差异巨大，每个数据源需要不同的读取器。因此，pandas_datareader 仅支持从有限数量的数据源读取，主要与金融和经济时间序列相关。

获取数据很简单。例如，我们知道苹果公司的股票代码是 AAPL，因此我们可以通过 Yahoo Finance 获取苹果公司股票的每日历史价格，如下所示：

import pandas_datareader as pdr

# Reading Apple shares from yahoo finance server    
shares_df = pdr.DataReader('AAPL', 'yahoo', start='2021-01-01', end='2021-12-31')
# Look at the data read
print(shares_df)

调用DataReader()时，第一个参数需要指定股票代码，第二个参数指定数据来源。上述代码打印出 DataFrame：

                  High         Low        Open       Close       Volume   Adj Close
Date                                                                               
2021-01-04  133.610001  126.760002  133.520004  129.410004  143301900.0  128.453461
2021-01-05  131.740005  128.429993  128.889999  131.009995   97664900.0  130.041611
2021-01-06  131.050003  126.379997  127.720001  126.599998  155088000.0  125.664215
2021-01-07  131.630005  127.860001  128.360001  130.919998  109578200.0  129.952271
2021-01-08  132.630005  130.229996  132.429993  132.050003  105158200.0  131.073914
...                ...         ...         ...         ...          ...         ...
2021-12-27  180.419998  177.070007  177.089996  180.330002   74919600.0  180.100540
2021-12-28  181.330002  178.529999  180.160004  179.289993   79144300.0  179.061859
2021-12-29  180.630005  178.139999  179.330002  179.380005   62348900.0  179.151749
2021-12-30  180.570007  178.089996  179.470001  178.199997   59773000.0  177.973251
2021-12-31  179.229996  177.259995  178.089996  177.570007   64062300.0  177.344055

[252 rows x 6 columns]

我们还可以从多个公司获取股票价格历史数据，方法是使用一个包含股票代码的列表：

companies = ['AAPL', 'MSFT', 'GE']
shares_multiple_df = pdr.DataReader(companies, 'yahoo', start='2021-01-01', end='2021-12-31')
print(shares_multiple_df.head())

结果将是一个具有多层列的 DataFrame：

Attributes   Adj Close                              Close              \
Symbols           AAPL        MSFT         GE        AAPL        MSFT   
Date                                                                    
2021-01-04  128.453461  215.434982  83.421600  129.410004  217.690002   
2021-01-05  130.041611  215.642776  85.811905  131.009995  217.899994   
2021-01-06  125.664223  210.051315  90.512833  126.599998  212.250000   
2021-01-07  129.952286  216.028732  89.795753  130.919998  218.289993   
2021-01-08  131.073944  217.344986  90.353485  132.050003  219.619995   

...

Attributes       Volume                          
Symbols            AAPL        MSFT          GE  
Date                                             
2021-01-04  143301900.0  37130100.0   9993688.0  
2021-01-05   97664900.0  23823000.0  10462538.0  
2021-01-06  155088000.0  35930700.0  16448075.0  
2021-01-07  109578200.0  27694500.0   9411225.0  
2021-01-08  105158200.0  22956200.0   9089963.0

由于 DataFrames 的结构，提取部分数据非常方便。例如，我们可以使用以下方法仅绘制某些日期的每日收盘价：

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

# General routine for plotting time series data
def plot_timeseries_df(df, attrib, ticker_loc=1, title='Timeseries', 
                       legend=''):
    fig = plt.figure(figsize=(15,7))
    plt.plot(df[attrib], 'o-')
    _ = plt.xticks(rotation=90)
    plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(ticker_loc))
    plt.title(title)
    plt.gca().legend(legend)
    plt.show()

plot_timeseries_df(shares_multiple_df.loc["2021-04-01":"2021-06-30"], "Close",
                   ticker_loc=3, title="Close price", legend=companies)

从 Yahoo Finance 获取的多个股票

完整代码如下：

import pandas_datareader as pdr
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

companies = ['AAPL', 'MSFT', 'GE']
shares_multiple_df = pdr.DataReader(companies, 'yahoo', start='2021-01-01', end='2021-12-31')
print(shares_multiple_df)

def plot_timeseries_df(df, attrib, ticker_loc=1, title='Timeseries', legend=''):
    "General routine for plotting time series data"
    fig = plt.figure(figsize=(15,7))
    plt.plot(df[attrib], 'o-')
    _ = plt.xticks(rotation=90)
    plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(ticker_loc))
    plt.title(title)
    plt.gca().legend(legend)
    plt.show()

plot_timeseries_df(shares_multiple_df.loc["2021-04-01":"2021-06-30"], "Close",
                   ticker_loc=3, title="Close price", legend=companies)

使用 pandas-datareader 从另一个数据源读取数据的语法类似。例如，我们可以从联邦储备经济数据（FRED）读取经济时间序列。FRED 中的每个时间序列都有一个符号。例如，所有城市消费者的消费者价格指数是CPIAUCSL，不包括食品和能源的所有项目的消费者价格指数是CPILFESL，个人消费支出是PCE。你可以在 FRED 的网页上搜索和查找这些符号。

以下是如何获取两个消费者价格指数，CPIAUCSL 和 CPILFESL，并在图中显示它们：

import pandas_datareader as pdr
import matplotlib.pyplot as plt

# Read data from FRED and print
fred_df = pdr.DataReader(['CPIAUCSL','CPILFESL'], 'fred', "2010-01-01", "2021-12-31")
print(fred_df)

# Show in plot the data of 2019-2021
fig = plt.figure(figsize=(15,7))
plt.plot(fred_df.loc["2019":], 'o-')
plt.xticks(rotation=90)
plt.legend(fred_df.columns)
plt.title("Consumer Price Index")
plt.show()

消费者物价指数图

从世界银行获取数据也类似，但我们需要理解世界银行的数据更为复杂。通常，数据系列，如人口，以时间序列的形式呈现，并且还具有国家维度。因此，我们需要指定更多参数来获取数据。

使用pandas_datareader，我们可以使用一组特定的世界银行 API。可以从世界银行开放数据中查找指标的符号，或者使用以下方法进行搜索：

from pandas_datareader import wb

matches = wb.search('total.*population')
print(matches[["id","name"]])

search()函数接受一个正则表达式字符串（例如，上述.*表示任何长度的字符串）。这将打印出：

                               id                                               name
24     1.1_ACCESS.ELECTRICITY.TOT      Access to electricity (% of total population)
164            2.1_ACCESS.CFT.TOT  Access to Clean Fuels and Technologies for coo...
1999              CC.AVPB.PTPI.AI  Additional people below $1.90 as % of total po...
2000              CC.AVPB.PTPI.AR  Additional people below $1.90 as % of total po...
2001              CC.AVPB.PTPI.DI  Additional people below $1.90 as % of total po...
...                           ...                                                ...
13908           SP.POP.TOTL.FE.ZS         Population, female (% of total population)
13912           SP.POP.TOTL.MA.ZS           Population, male (% of total population)
13938              SP.RUR.TOTL.ZS           Rural population (% of total population)
13958           SP.URB.TOTL.IN.ZS           Urban population (% of total population)
13960              SP.URB.TOTL.ZS  Percentage of Population in Urban Areas (in % ...

[137 rows x 2 columns]

其中id列是时间序列的符号。

我们可以通过指定 ISO-3166-1 国家代码来读取特定国家的数据。但世界银行也包含非国家的汇总数据（例如，南亚），因此虽然pandas_datareader允许我们使用“all”字符串表示所有国家，但通常我们不希望使用它。以下是如何从世界银行获取所有国家和汇总数据列表：

import pandas_datareader.wb as wb

countries = wb.get_countries()
print(countries)

    iso3c iso2c                 name               region          adminregion          incomeLevel     lendingType capitalCity  longitude  latitude
0     ABW    AW                Aruba  Latin America & ...                               High income  Not classified  Oranjestad   -70.0167   12.5167
1     AFE    ZH  Africa Eastern a...           Aggregates                                Aggregates      Aggregates                    NaN       NaN
2     AFG    AF          Afghanistan           South Asia           South Asia           Low income             IDA       Kabul    69.1761   34.5228
3     AFR    A9               Africa           Aggregates                                Aggregates      Aggregates                    NaN       NaN
4     AFW    ZI  Africa Western a...           Aggregates                                Aggregates      Aggregates                    NaN       NaN
..    ...   ...                  ...                  ...                  ...                  ...             ...         ...        ...       ...
294   XZN    A5  Sub-Saharan Afri...           Aggregates                                Aggregates      Aggregates                    NaN       NaN
295   YEM    YE          Yemen, Rep.  Middle East & No...  Middle East & No...           Low income             IDA      Sana'a    44.2075   15.3520
296   ZAF    ZA         South Africa  Sub-Saharan Africa   Sub-Saharan Afri...  Upper middle income            IBRD    Pretoria    28.1871  -25.7460
297   ZMB    ZM               Zambia  Sub-Saharan Africa   Sub-Saharan Afri...  Lower middle income             IDA      Lusaka    28.2937  -15.3982
298   ZWE    ZW             Zimbabwe  Sub-Saharan Africa   Sub-Saharan Afri...  Lower middle income           Blend      Harare    31.0672  -17.8312

以下是如何获取 2020 年所有国家的人口数据，并在条形图中展示前 25 个国家的情况。当然，我们也可以通过指定不同的start和end年份来获取跨年度的人口数据：

import pandas_datareader.wb as wb
import pandas as pd
import matplotlib.pyplot as plt

# Get a list of 2-letter country code excluding aggregates
countries = wb.get_countries()
countries = list(countries[countries.region != "Aggregates"]["iso2c"])

# Read countries' total population data (SP.POP.TOTL) in year 2020
population_df = wb.download(indicator="SP.POP.TOTL", country=countries, start=2020, end=2020)

# Sort by population, then take top 25 countries, and make the index (i.e., countries) as a column
population_df = (population_df.dropna()
                              .sort_values("SP.POP.TOTL")
                              .iloc[-25:]
                              .reset_index())

# Plot the population, in millions
fig = plt.figure(figsize=(15,7))
plt.bar(population_df["country"], population_df["SP.POP.TOTL"]/1e6)
plt.xticks(rotation=90)
plt.ylabel("Million Population")
plt.title("Population")
plt.show()

不同国家总人口的条形图

想要开始学习用于机器学习的 Python？

现在就参加我的免费 7 天邮件速成课程（附示例代码）。

点击注册，并获取课程的免费 PDF 电子书版本。

使用 Web APIs 获取数据

有时，你可以选择直接从 Web 数据服务器获取数据，而无需进行任何身份验证。这可以通过使用标准库 urllib.requests 在 Python 中完成，或者你也可以使用 requests 库以获得更简单的接口。

世界银行是一个示例，其中 Web APIs 自由提供，因此我们可以轻松读取不同格式的数据，如 JSON、XML 或纯文本。页面上的世界银行数据存储库 API 描述了各种 API 及其相应参数。为了重复我们在之前示例中所做的，而不使用 pandas_datareader，我们首先构造一个 URL 以读取所有国家的列表，以便找到不是汇总的国家代码。然后，我们可以构造一个查询 URL，包含以下参数：

country 参数值 = all
indicator 参数值 = SP.POP.TOTL
date 参数值 = 2020
format 参数值 = json

当然，你可以尝试不同的指标。默认情况下，世界银行在每页上返回 50 项，我们需要逐页查询以获取所有数据。我们可以扩大页面大小，以便一次性获取所有数据。下面是如何以 JSON 格式获取国家列表并收集国家代码：

import requests

# Create query URL for list of countries, by default only 50 entries returned per page
url = "http://api.worldbank.org/v2/country/all?format=json&per_page=500"
response = requests.get(url)
# Expects HTTP status code 200 for correct query
print(response.status_code)
# Get the response in JSON
header, data = response.json()
print(header)
# Collect a list of 3-letter country code excluding aggregates
countries = [item["id"]
             for item in data
             if item["region"]["value"] != "Aggregates"]
print(countries)

它将打印 HTTP 状态码、页眉以及国家代码列表，如下所示：

200
{'page': 1, 'pages': 1, 'per_page': '500', 'total': 299}
['ABW', 'AFG', 'AGO', 'ALB', ..., 'YEM', 'ZAF', 'ZMB', 'ZWE']

从页眉中，我们可以验证数据已被耗尽（第 1 页，共 1 页）。然后我们可以获取所有的人口数据，如下所示：

...

# Create query URL for total population from all countries in 2020
arguments = {
    "country": "all",
    "indicator": "SP.POP.TOTL",
    "date": "2020:2020",
    "format": "json"
}
url = "http://api.worldbank.org/v2/country/{country}/" \
      "indicator/{indicator}?date={date}&format={format}&per_page=500"
query_population = url.format(**arguments)
response = requests.get(query_population)
# Get the response in JSON
header, population_data = response.json()

你应查看世界银行 API 文档，了解如何构造 URL。例如，2020:2021 的日期语法表示开始和结束年份，额外参数 page=3 将为你提供多页结果中的第三页。获取数据后，我们可以筛选出非汇总国家，将其转换为 pandas DataFrame 以进行排序，然后绘制条形图：

...

# Filter for countries, not aggregates
population = []
for item in population_data:
    if item["countryiso3code"] in countries:
        name = item["country"]["value"]
        population.append({"country":name, "population": item["value"]})
# Create DataFrame for sorting and filtering
population = pd.DataFrame.from_dict(population)
population = population.dropna().sort_values("population").iloc[-25:]
# Plot bar chart
fig = plt.figure(figsize=(15,7))
plt.bar(population["country"], population["population"]/1e6)
plt.xticks(rotation=90)
plt.ylabel("Million Population")
plt.title("Population")
plt.show()

图形应与之前完全相同。但正如你所见，使用 pandas_datareader 有助于通过隐藏低级操作使代码更加简洁。

将所有内容整合在一起，以下是完整的代码：

import pandas as pd
import matplotlib.pyplot as plt
import requests

# Create query URL for list of countries, by default only 50 entries returned per page
url = "http://api.worldbank.org/v2/country/all?format=json&per_page=500"
response = requests.get(url)
# Expects HTTP status code 200 for correct query
print(response.status_code)
# Get the response in JSON
header, data = response.json()
print(header)
# Collect a list of 3-letter country code excluding aggregates
countries = [item["id"]
             for item in data
             if item["region"]["value"] != "Aggregates"]
print(countries)

# Create query URL for total population from all countries in 2020
arguments = {
    "country": "all",
    "indicator": "SP.POP.TOTL",
    "date": 2020,
    "format": "json"
}
url = "http://api.worldbank.org/v2/country/{country}/" \
      "indicator/{indicator}?date={date}&format={format}&per_page=500"
query_population = url.format(**arguments)
response = requests.get(query_population)
print(response.status_code)
# Get the response in JSON
header, population_data = response.json()
print(header)

# Filter for countries, not aggregates
population = []
for item in population_data:
    if item["countryiso3code"] in countries:
        name = item["country"]["value"]
        population.append({"country":name, "population": item["value"]})
# Create DataFrame for sorting and filtering
population = pd.DataFrame.from_dict(population)
population = population.dropna().sort_values("population").iloc[-25:]
# Plot bar chart
fig = plt.figure(figsize=(15,7))
plt.bar(population["country"], population["population"]/1e6)
plt.xticks(rotation=90)
plt.ylabel("Million Population")
plt.title("Population")
plt.show()

使用 NumPy 创建合成数据

有时，我们可能不想使用现实世界的数据，因为我们需要特定的内容，这些内容在现实中可能不会发生。一个具体的例子是使用理想的时间序列数据测试模型。在这一部分，我们将探讨如何创建合成的自回归（AR）时间序列数据。

numpy.random 库可用于从不同分布中创建随机样本。randn() 方法生成来自标准正态分布的数据，均值为零，方差为一。

在 AR( $n$ ) 模型中，时间步 $t$ 的值 $x_t$ 取决于前 $n$ 个时间步的值。即，

x_t = b_1 x_{t-1} + b_2 x_{t-2} + … + b_n x_{t-n} + e_t

使用模型参数 $b_i$ 作为不同滞后的 $x_t$ 的系数，误差项 $e_t$ 预计遵循正态分布。

理解公式后，我们可以在下面的示例中生成一个 AR(3) 时间序列。我们首先使用 randn() 生成序列的前 3 个值，然后迭代应用上述公式生成下一个数据点。然后，再次使用 randn() 函数添加一个误差项，受预定义的 noise_level 影响：

import numpy as np

# Predefined paramters
ar_n = 3                     # Order of the AR(n) data
ar_coeff = [0.7, -0.3, -0.1] # Coefficients b_3, b_2, b_1
noise_level = 0.1            # Noise added to the AR(n) data
length = 200                 # Number of data points to generate

# Random initial values
ar_data = list(np.random.randn(ar_n))

# Generate the rest of the values
for i in range(length - ar_n):
    next_val = (np.array(ar_coeff) @ np.array(ar_data[-3:])) + np.random.randn() * noise_level
    ar_data.append(next_val)

# Plot the time series
fig = plt.figure(figsize=(12,5))
plt.plot(ar_data)
plt.show()

上面的代码将创建以下图表：

但我们可以通过首先将数据转换为 pandas DataFrame，然后将时间作为索引来进一步添加时间轴：

...

# Convert the data into a pandas DataFrame
synthetic = pd.DataFrame({"AR(3)": ar_data})
synthetic.index = pd.date_range(start="2021-07-01", periods=len(ar_data), freq="D")

# Plot the time series
fig = plt.figure(figsize=(12,5))
plt.plot(synthetic.index, synthetic)
plt.xticks(rotation=90)
plt.title("AR(3) time series")
plt.show()

此后我们将得到以下图表：

合成时间序列的图表

使用类似技术，我们也可以生成纯随机噪声（即 AR(0) 系列）、ARIMA 时间序列（即带有误差项的系数）或布朗运动时间序列（即随机噪声的累计和）。

进一步阅读

本节提供了更多资源，如果你希望深入了解这个主题。

库

数据来源

书籍

Think Python: How to Think Like a Computer Scientist 由 Allen B. Downey 编著
Python 3 编程：Python 语言完全介绍由 Mark Summerfield 编著
Python 数据分析，由 Wes McKinney 编著，第二版

总结

在本教程中，你发现了在 Python 中获取数据或生成合成时间序列数据的各种选项。

具体来说，你学到了：

如何使用 pandas_datareader 从不同的数据源中获取金融数据
如何调用 API 从不同的 Web 服务器获取数据，使用 requests 库
如何使用 NumPy 的随机数生成器生成合成时间序列数据

对于本帖讨论的主题，你有任何问题吗？请在下面的评论中提问，我会尽力回答。

在家构建你的迷你 ChatGPT

原文：machinelearningmastery.com/building-your-mini-chatgpt-at-home/

ChatGPT 非常有趣。你可能也希望拥有一个私人运行的副本。实际上，这是不可能的，因为 ChatGPT 不是可以下载的软件，并且需要巨大的计算能力才能运行。但你可以构建一个可以在普通硬件上运行的简化版本。在这篇文章中，你将了解

能像 ChatGPT 一样表现的语言模型
如何使用高级语言模型构建聊天机器人

在家构建你的迷你 ChatGPT

图片由作者使用 Stable Diffusion 生成。保留一些权利。

让我们开始吧。

概述

本文分为三个部分；它们是：

什么是指令跟随模型？
如何寻找指令跟随模型
构建一个简单的聊天机器人

什么是指令跟随模型？

语言模型是机器学习模型，它们可以根据句子的前文预测单词的概率。如果我们要求模型提供下一个单词，并将其反馈给模型以请求更多内容，模型就在进行文本生成。

文本生成模型是许多大型语言模型（如 GPT3）的核心理念。然而，指令跟随模型是经过微调的文本生成模型，专门学习对话和指令。它的操作方式像是两个人之间的对话，一个人说完一句话，另一个人相应地回应。

因此，文本生成模型可以帮助你在有开头句子的情况下完成一段文字。但指令跟随模型可以回答你的问题或按要求做出回应。

这并不意味着你不能使用文本生成模型来构建聊天机器人。但是，你应该使用经过微调的指令跟随模型，它能提供更高质量的结果。

如何寻找指令跟随模型

现在你可能会发现很多指令跟随模型。但是，要构建一个聊天机器人，你需要一个容易操作的模型。

一个方便的资源库是 Hugging Face。那里提供的模型应与 Hugging Face 的 transformers 库一起使用。这非常有帮助，因为不同的模型可能会有细微的差别。虽然让你的 Python 代码支持多种模型可能很繁琐，但 transformers 库统一了它们，并隐藏了这些差异。

通常，指令跟随模型在模型名称中会带有关键词“instruct”。在 Hugging Face 上用这个关键词搜索可以找到超过一千个模型。但并非所有模型都能工作。你需要查看每一个模型，并阅读它们的模型卡，以了解这个模型可以做什么，从而选择最合适的一个。

选择你的模型时，有几个技术标准需要考虑：

模型的训练数据是什么： 具体来说，这意味着模型可以使用什么语言。一个用英语小说文本训练的模型可能对一个德语物理聊天机器人没有帮助。
它使用的深度学习库是什么： 通常，Hugging Face 中的模型是使用 TensorFlow、PyTorch 和 Flax 构建的。并非所有模型都有所有库的版本。你需要确保在运行transformers模型之前，已安装了特定的库。
模型需要哪些资源： 模型可能非常庞大。通常，它需要 GPU 来运行。但有些模型需要非常高端的 GPU，甚至多个高端 GPU。你需要确认你的资源是否支持模型推理。

构建一个简单的聊天机器人

让我们构建一个简单的聊天机器人。这个聊天机器人只是一个在命令行中运行的程序，它从用户那里获取一行文本作为输入，并生成一行由语言模型生成的文本作为回应。

为这个任务选择的模型是falcon-7b-instruct。它是一个拥有 70 亿参数的模型。由于它是为最佳性能设计的，需要在 bfloat16 浮点数下运行，因此你可能需要使用现代 GPU，例如 nVidia RTX 3000 系列。利用 Google Colab 上的 GPU 资源或 AWS 上的适当 EC2 实例也是一种选择。

要在 Python 中构建聊天机器人，过程如下：

while True:
    user_input = input("> ")
    print(response)

input("> ")函数从用户那里获取一行输入。你会在屏幕上看到字符串"> "来提示你的输入。输入会在你按下 Enter 后被捕获。

剩下的问题是如何获取响应。在 LLM 中，你将输入或提示作为令牌 ID（整数）的序列提供，模型会回应另一个令牌 ID 序列。在与 LLM 交互前后，你应该在整数序列和文本字符串之间进行转换。令牌 ID 对每个模型都是特定的；也就是说，对于相同的整数，不同模型表示不同的词。

Hugging Face 库transformers旨在简化这些步骤。你只需创建一个管道并指定模型名称以及其他几个参数。设置一个使用 bfloat16 浮点数的模型名称为tiiuae/falcon-7b-instruct的管道，并允许模型在有 GPU 时使用，配置如下：

from transformers import AutoTokenizer, pipeline
import torch

model = "tiiuae/falcon-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

这个管道被创建为"text-generation"，因为这是模型卡建议的工作方式。transformers中的管道是用于特定任务的一系列步骤。文本生成就是这些任务之一。

使用管道时，你需要指定更多的参数来生成文本。请记住，模型并不是直接生成文本，而是生成令牌的概率。你必须从这些概率中确定下一个词，并重复这个过程以生成更多词。通常，这个过程会引入一些变异，通过不选择概率最高的单一令牌，而是根据概率分布进行采样。

以下是如何使用管道的步骤：

newline_token = tokenizer.encode("\n")[0]    # 193
sequences = pipeline(
    prompt,
    max_length=500,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    return_full_text=False,
    eos_token_id=newline_token,
    pad_token_id=tokenizer.eos_token_id,
)

你将提示内容提供给变量prompt来生成输出序列。你可以让模型给出几个选项，但在这里你设置了num_return_sequences=1，因此只会有一个选项。你还让模型使用采样来生成文本，但只从 10 个最高概率的标记中进行采样（top_k=10）。返回的序列不会包含你的提示，因为你设置了return_full_text=False。最重要的参数是eos_token_id=newline_token和pad_token_id=tokenizer.eos_token_id。这些参数用于让模型连续生成文本，但仅到换行符为止。换行符的标记 ID 是 193，来自代码片段的第一行。

返回的sequences是一个字典列表（在这种情况下是一个字典的列表）。每个字典包含标记序列和字符串。我们可以很容易地打印字符串，如下所示：

print(sequences[0]["generated_text"])

语言模型是无记忆的。它不会记住你使用模型的次数以及之前使用的提示。每次都是新的，因此你需要向模型提供之前对话的历史。这很简单。但因为它是一个会话处理模型，你需要记住在提示中识别出谁说了什么。假设这是 Alice 和 Bob 之间的对话（或任何名字）。你需要在提示中每个句子前加上他们的名字，如下所示：

Alice: What is relativity?
Bob:

然后模型应该生成与对话匹配的文本。一旦从模型获得响应，将其与来自 Alice 的其他文本一起追加到提示中，然后再次发送给模型。将所有内容结合起来，以下是一个简单的聊天机器人：

from transformers import AutoTokenizer, pipeline
import torch

model = "tiiuae/falcon-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
newline_token = tokenizer.encode("\n")[0]
my_name = "Alice"
your_name = "Bob"
dialog = []

while True:
    user_input = input("> ")
    dialog.append(f"{my_name}: {user_input}")
    prompt = "\n".join(dialog) + f"\n{your_name}: "
    sequences = pipeline(
        prompt,
        max_length=500,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        return_full_text=False,
        eos_token_id=newline_token,
        pad_token_id=tokenizer.eos_token_id,
    )
    print(sequences[0]['generated_text'])
    dialog.append("Bob: "+sequences[0]['generated_text'])

注意dialog变量如何在每次迭代中更新以跟踪对话，以及如何用它来设置变量prompt以进行下一次管道运行。

当你尝试用聊天机器人问“什么是相对论”时，它的回答听起来不够专业。这时你需要进行一些提示工程。你可以让 Bob 成为物理学教授，这样他就能对这个话题给出更详细的回答。这就是 LLMs 的魔力，通过简单地改变提示来调整响应。你只需要在对话开始之前添加一个描述。更新后的代码如下（现在dialog初始化时带有角色描述）：

from transformers import AutoTokenizer, pipeline
import torch

model = "tiiuae/falcon-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
newline_token = tokenizer.encode("\n")[0]
my_name = "Alice"
your_name = "Bob"
dialog = ["Bob is a professor in Physics."]

while True:
    user_input = input("> ")
    dialog.append(f"{my_name}: {user_input}")
    prompt = "\n".join(dialog) + f"\n{your_name}: "
    sequences = pipeline(
        prompt,
        max_length=500,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        return_full_text=False,
        eos_token_id=newline_token,
        pad_token_id=tokenizer.eos_token_id,
    )
    print(sequences[0]['generated_text'])
    dialog.append("Bob: "+sequences[0]['generated_text'])

如果你没有足够强大的硬件，这个聊天机器人可能会很慢。你可能无法看到确切的结果，但以下是上述代码的示例对话。

> What is Newtonian mechanics?
"Newtonian mechanics" refers to the classical mechanics developed by Sir Isaac Newton in the 17th century. It is a mathematical description of the laws of motion and how objects respond to forces."A: What is the law of inertia?

> How about Lagrangian mechanics?
"Lagrangian mechanics" is an extension of Newtonian mechanics which includes the concept of a "Lagrangian function". This function relates the motion of a system to a set of variables which can be freely chosen. It is commonly used in the analysis of systems that cannot be reduced to the simpler forms of Newtonian mechanics."A: What's the principle of inertia?"

聊天机器人将一直运行，直到你按下 Ctrl-C 停止它，或者遇到管道输入中的最大长度（max_length=500）。最大长度是指模型一次可以读取的字数。你的提示不能超过这个字数。最大长度越高，模型运行越慢，每个模型对这个长度的设置都有一个限制。falcon-7b-instruct 模型仅允许将此设置为 2048，而 ChatGPT 则为 4096。

你可能还会注意到输出质量不是很完美。这部分是因为你没有在将模型的响应发送回用户之前尝试润色，同时也是因为我们选择的模型是一个拥有 70 亿参数的模型，是该系列中最小的。通常，你会看到较大模型的结果更好，但这也需要更多的资源来运行。

进一步阅读

以下是一篇可能帮助你更好理解指令遵循模型的论文：

欧阳等，《训练语言模型以遵循指令并获取人类反馈》（2022）

总结

在这篇文章中，你学会了如何使用 Hugging Face 库中的大型语言模型创建一个聊天机器人。具体来说，你学会了：

能进行对话的语言模型称为指令遵循模型
如何在 Hugging Face 中找到这些模型
如何使用 transformers 库中的模型，并构建一个聊天机器人

为你的 Python 脚本添加命令行参数

原文：machinelearningmastery.com/command-line-arguments-for-your-python-script/

在机器学习项目中工作意味着我们需要进行实验。有一种简单配置脚本的方法将帮助你更快地前进。在 Python 中，我们有一种方法可以从命令行适应代码。在本教程中，我们将看到如何利用 Python 脚本的命令行参数，帮助你在机器学习项目中更有效地工作。

完成本教程后，你将学会

为什么我们想要在命令行中控制 Python 脚本
我们如何能在命令行上高效工作

**用我的新书Python 机器学习**快速启动你的项目，包括逐步教程和所有示例的Python 源代码文件。

让我们开始吧！

为你的 Python 脚本添加命令行参数。照片由insung yoon拍摄。部分权利保留

概述

本教程分为三部分；它们是：

在命令行中运行 Python 脚本
在命令行工作
替代命令行参数

在命令行中运行 Python 脚本

有许多方法可以运行 Python 脚本。有人可能在 Jupyter 笔记本中运行它。有人可能在 IDE 中运行它。但在所有平台上，始终可以在命令行中运行 Python 脚本。在 Windows 中，你可以使用命令提示符或 PowerShell（或者更好的是Windows 终端）。在 macOS 或 Linux 中，你可以使用终端或 xterm。在命令行中运行 Python 脚本是强大的，因为你可以向脚本传递额外的参数。

以下脚本允许我们将值从命令行传递到 Python 中：

import sys

n = int(sys.argv[1])
print(n+1)

我们将这几行保存到一个文件中，并在命令行中运行它，带一个参数：

Shell

$ python commandline.py 15
16

然后，你会看到它接受我们的参数，将其转换为整数，加一并打印出来。列表sys.argv包含我们脚本的名称和所有参数（都是字符串），在上述情况下，是["commandline.py", "15"]。

当你运行带有更复杂参数集的命令行时，需要一些处理列表sys.argv的努力。因此，Python 提供了argparse库来帮助。这假设 GNU 风格，可以用以下例子来解释：

rsync -a -v --exclude="*.pyc" -B 1024 --ignore-existing 192.168.0.3:/tmp/ ./

可选参数由“-”或“--”引入，单个连字符表示单个字符的“短选项”（例如上述的 -a、-B 和 -v），双连字符用于多个字符的“长选项”（例如上述的 --exclude 和 --ignore-existing）。可选参数可能有附加参数，例如在 -B 1024 或 --exclude="*.pyc" 中，1024 和 "*.pyc" 分别是 B 和 --exclude 的参数。此外，我们还可能有强制性参数，我们直接将其放入命令行中。上面的 192.168.0.3:/tmp/ 和 ./ 就是例子。强制参数的顺序很重要。例如，上面的 rsync 命令将文件从 192.168.0.3:/tmp/ 复制到 ./ 而不是相反。

下面是使用 argparse 在 Python 中复制上述示例的方法：

import argparse

parser = argparse.ArgumentParser(description="Just an example",
                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("-a", "--archive", action="store_true", help="archive mode")
parser.add_argument("-v", "--verbose", action="store_true", help="increase verbosity")
parser.add_argument("-B", "--block-size", help="checksum blocksize")
parser.add_argument("--ignore-existing", action="store_true", help="skip files that exist")
parser.add_argument("--exclude", help="files to exclude")
parser.add_argument("src", help="Source location")
parser.add_argument("dest", help="Destination location")
args = parser.parse_args()
config = vars(args)
print(config)

如果运行上述脚本，您将看到：

$ python argparse_example.py
usage: argparse_example.py [-h] [-a] [-v] [-B BLOCK_SIZE] [--ignore-existing] [--exclude EXCLUDE] src dest
argparse_example.py: error: the following arguments are required: src, dest

这意味着您没有为 src 和 dest 提供必需的参数。也许使用 argparse 的最佳理由是，如果您提供了 -h 或 --help 作为参数，可以免费获取帮助屏幕，如下所示：

$ python argparse_example.py --help
usage: argparse_example.py [-h] [-a] [-v] [-B BLOCK_SIZE] [--ignore-existing] [--exclude EXCLUDE] src dest

Just an example

positional arguments:
  src                   Source location
  dest                  Destination location

optional arguments:
  -h, --help            show this help message and exit
  -a, --archive         archive mode (default: False)
  -v, --verbose         increase verbosity (default: False)
  -B BLOCK_SIZE, --block-size BLOCK_SIZE
                        checksum blocksize (default: None)
  --ignore-existing     skip files that exist (default: False)
  --exclude EXCLUDE     files to exclude (default: None)

虽然脚本并未执行任何实际操作，但如果按要求提供参数，将会看到以下内容：

$ python argparse_example.py -a --ignore-existing 192.168.0.1:/tmp/ /home
{'archive': True, 'verbose': False, 'block_size': None, 'ignore_existing': True, 'exclude': None, 'src': '192.168.0.1:/tmp/', 'dest': '/home'}

由 ArgumentParser() 创建的解析器对象有一个 parse_args() 方法，它读取 sys.argv 并返回一个 namespace 对象。这是一个携带属性的对象，我们可以使用 args.ignore_existing 等方式读取它们。但通常，如果它是 Python 字典，处理起来会更容易。因此，我们可以使用 vars(args) 将其转换为一个字典。

通常，对于所有可选参数，我们提供长选项，有时也提供短选项。然后，我们可以使用长选项作为键（将连字符替换为下划线）从命令行访问提供的值，如果没有长版本，则使用单字符短选项作为键。 “位置参数” 不是可选的，并且它们的名称在 add_argument() 函数中提供。

参数有多种类型。对于可选参数，有时我们将它们用作布尔标志，但有时我们期望它们带入一些数据。在上述示例中，我们使用 action="store_true" 来将该选项默认设置为 False，如果指定则切换为 True。对于其他选项，例如上面的 -B，默认情况下，它期望在其后跟随附加数据。

我们还可以进一步要求参数是特定类型。例如，对于上面的 -B 选项，我们可以通过添加 type 来使其期望整数数据，如下所示：

parser.add_argument("-B", "--block-size", type=int, help="checksum blocksize")

如果提供了错误的类型，argparse 将帮助终止我们的程序，并显示一个信息性错误消息：

python argparse_example.py -a -B hello --ignore-existing 192.168.0.1:/tmp/ /home
usage: argparse_example.py [-h] [-a] [-v] [-B BLOCK_SIZE] [--ignore-existing] [--exclude EXCLUDE] src dest
argparse_example.py: error: argument -B/--block-size: invalid int value: 'hello'

在命令行上工作

使用命令行参数增强你的 Python 脚本可以使其达到新的可重用性水平。首先，让我们看一个简单的示例，将 ARIMA 模型拟合到 GDP 时间序列上。世界银行收集了许多国家的历史 GDP 数据。我们可以利用pandas_datareader包来读取这些数据。如果你还没有安装它，可以使用pip（或者如果你安装了 Anaconda，则可以使用conda）来安装该包：

pip install pandas_datareader

我们使用的 GDP 数据的代码是NY.GDP.MKTP.CN；我们可以通过以下方式获得国家的数据，将其转换成 pandas DataFrame：

from pandas_datareader.wb import WorldBankReader

gdp = WorldBankReader("NY.GDP.MKTP.CN", "SE", start=1960, end=2020).read()

然后，我们可以使用 pandas 提供的工具稍微整理一下 DataFrame：

import pandas as pd

# Drop country name from index
gdp = gdp.droplevel(level=0, axis=0)
# Sort data in choronological order and set data point at year-end
gdp.index = pd.to_datetime(gdp.index)
gdp = gdp.sort_index().resample("y").last()
# Convert pandas DataFrame into pandas Series
gdp = gdp["NY.GDP.MKTP.CN"]

拟合 ARIMA 模型并使用该模型进行预测并不困难。接下来，我们使用前 40 个数据点进行拟合，并预测未来 3 个数据点。然后，通过相对误差比较预测值和实际值：

import statsmodels.api as sm

model = sm.tsa.ARIMA(endog=gdp[:40], order=(1,1,1)).fit()
forecast = model.forecast(steps=3)
compare = pd.DataFrame({"actual":gdp, "forecast":forecast}).dropna()
compare["rel error"] = (compare["forecast"] - compare["actual"])/compare["actual"]
print(compare)

将所有内容整合在一起，并稍加修饰后，以下是完整的代码：

import warnings
warnings.simplefilter("ignore")

from pandas_datareader.wb import WorldBankReader
import statsmodels.api as sm
import pandas as pd

series = "NY.GDP.MKTP.CN"
country = "SE" # Sweden
length = 40
start = 0
steps = 3
order = (1,1,1)

# Read the GDP data from WorldBank database
gdp = WorldBankReader(series, country, start=1960, end=2020).read()
# Drop country name from index
gdp = gdp.droplevel(level=0, axis=0)
# Sort data in choronological order and set data point at year-end
gdp.index = pd.to_datetime(gdp.index)
gdp = gdp.sort_index().resample("y").last()
# Convert pandas dataframe into pandas series
gdp = gdp[series]
# Fit arima model
result = sm.tsa.ARIMA(endog=gdp[start:start+length], order=order).fit()
# Forecast, and calculate the relative error
forecast = result.forecast(steps=steps)
df = pd.DataFrame({"Actual":gdp, "Forecast":forecast}).dropna()
df["Rel Error"] = (df["Forecast"] - df["Actual"]) / df["Actual"]
# Print result
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    print(df)

此脚本输出以下内容：

                   Actual      Forecast  Rel Error
2000-12-31  2408151000000  2.367152e+12  -0.017025
2001-12-31  2503731000000  2.449716e+12  -0.021574
2002-12-31  2598336000000  2.516118e+12  -0.031643

上述代码很简短，但我们通过在变量中保存一些参数使其更加灵活。我们可以将上述代码改为使用 argparse，这样我们就可以从命令行中更改一些参数，如下所示：

from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
import warnings
warnings.simplefilter("ignore")

from pandas_datareader.wb import WorldBankReader
import statsmodels.api as sm
import pandas as pd

# Parse command line arguments
parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
parser.add_argument("-c", "--country", default="SE", help="Two-letter country code")
parser.add_argument("-l", "--length", default=40, type=int, help="Length of time series to fit the ARIMA model")
parser.add_argument("-s", "--start", default=0, type=int, help="Starting offset to fit the ARIMA model")
args = vars(parser.parse_args())

# Set up parameters
series = "NY.GDP.MKTP.CN"
country = args["country"]
length = args["length"]
start = args["start"]
steps = 3
order = (1,1,1)

# Read the GDP data from WorldBank database
gdp = WorldBankReader(series, country, start=1960, end=2020).read()
# Drop country name from index
gdp = gdp.droplevel(level=0, axis=0)
# Sort data in choronological order and set data point at year-end
gdp.index = pd.to_datetime(gdp.index)
gdp = gdp.sort_index().resample("y").last()
# Convert pandas dataframe into pandas series
gdp = gdp[series]
# Fit arima model
result = sm.tsa.ARIMA(endog=gdp[start:start+length], order=order).fit()
# Forecast, and calculate the relative error
forecast = result.forecast(steps=steps)
df = pd.DataFrame({"Actual":gdp, "Forecast":forecast}).dropna()
df["Rel Error"] = (df["Forecast"] - df["Actual"]) / df["Actual"]
# Print result
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    print(df)

如果我们在命令行中运行上述代码，可以看到它现在可以接受参数：

$ python gdp_arima.py --help
usage: gdp_arima.py [-h] [-c COUNTRY] [-l LENGTH] [-s START]

optional arguments:
  -h, --help            show this help message and exit
  -c COUNTRY, --country COUNTRY
                        Two-letter country code (default: SE)
  -l LENGTH, --length LENGTH
                        Length of time series to fit the ARIMA model (default: 40)
  -s START, --start START
                        Starting offset to fit the ARIMA model (default: 0)
$ python gdp_arima.py
                   Actual      Forecast  Rel Error
2000-12-31  2408151000000  2.367152e+12  -0.017025
2001-12-31  2503731000000  2.449716e+12  -0.021574
2002-12-31  2598336000000  2.516118e+12  -0.031643
$ python gdp_arima.py -c NO
                   Actual      Forecast  Rel Error
2000-12-31  1507283000000  1.337229e+12  -0.112821
2001-12-31  1564306000000  1.408769e+12  -0.099429
2002-12-31  1561026000000  1.480307e+12  -0.051709

在上面的最后一个命令中，我们传入-c NO来将相同的模型应用于挪威（NO）的 GDP 数据，而不是瑞典（SE）。因此，在不破坏代码的风险下，我们重用了我们的代码来处理不同的数据集。

引入命令行参数的强大之处在于，我们可以轻松地测试我们的代码，使用不同的参数。例如，我们想要查看 ARIMA(1,1,1)模型是否是预测 GDP 的好模型，并且我们希望使用北欧国家的不同时间窗口来验证：

丹麦（DK）
芬兰（FI）
冰岛（IS）
挪威（NO）
瑞典（SE）

我们希望检查 40 年的窗口，但是使用不同的起始点（从 1960 年、1965 年、1970 年、1975 年起）。根据操作系统的不同，可以在 Linux 和 mac 中使用 bash shell 语法构建 for 循环：

Shell

for C in DK FI IS NO SE; do
    for S in 0 5 10 15; do
        python gdp_arima.py -c $C -s $S
    done
done

或者，由于 shell 语法允许，我们可以将所有内容放在一行中：

Shell

for C in DK FI IS NO SE; do for S in 0 5 10 15; do python gdp_arima.py -c $C -s $S ; done ; done

或者更好的做法是，在循环的每次迭代中提供一些信息，然后多次运行我们的脚本：

$ for C in DK FI IS NO SE; do for S in 0 5 10 15; do echo $C $S; python gdp_arima.py -c $C -s $S ; done; done
DK 0
                  Actual      Forecast  Rel Error
2000-12-31  1.326912e+12  1.290489e+12  -0.027449
2001-12-31  1.371526e+12  1.338878e+12  -0.023804
2002-12-31  1.410271e+12  1.386694e+12  -0.016718
DK 5
                  Actual      Forecast  Rel Error
2005-12-31  1.585984e+12  1.555961e+12  -0.018931
2006-12-31  1.682260e+12  1.605475e+12  -0.045644
2007-12-31  1.738845e+12  1.654548e+12  -0.048479
DK 10
                  Actual      Forecast  Rel Error
2010-12-31  1.810926e+12  1.762747e+12  -0.026605
2011-12-31  1.846854e+12  1.803335e+12  -0.023564
2012-12-31  1.895002e+12  1.843907e+12  -0.026963

...

SE 5
                   Actual      Forecast  Rel Error
2005-12-31  2931085000000  2.947563e+12   0.005622
2006-12-31  3121668000000  3.043831e+12  -0.024934
2007-12-31  3320278000000  3.122791e+12  -0.059479
SE 10
                   Actual      Forecast  Rel Error
2010-12-31  3573581000000  3.237310e+12  -0.094099
2011-12-31  3727905000000  3.163924e+12  -0.151286
2012-12-31  3743086000000  3.112069e+12  -0.168582
SE 15
                   Actual      Forecast  Rel Error
2015-12-31  4260470000000  4.086529e+12  -0.040827
2016-12-31  4415031000000  4.180213e+12  -0.053186
2017-12-31  4625094000000  4.273781e+12  -0.075958

如果你使用 Windows，可以在命令提示符中使用以下语法：

MS DOS

for %C in (DK FI IS NO SE) do for %S in (0 5 10 15) do python gdp_arima.py -c $C -s $S

或者在 PowerShell 中：

PowerShell

foreach ($C in "DK","FI","IS","NO","SE") { foreach ($S in 0,5,10,15) { python gdp_arima.py -c $C -s $S } }

两者应该产生相同的结果。

虽然我们可以将类似的循环放在 Python 脚本中，但有时如果能在命令行中完成会更容易。当我们探索不同的选项时，这可能更加方便。此外，通过将循环移到 Python 代码之外，我们可以确保每次运行脚本时都是独立的，因为我们不会在迭代之间共享任何变量。

命令行参数的替代方案

使用命令行参数并不是将数据传递给 Python 脚本的唯一方法。至少还有几种其他方法：

使用环境变量
使用配置文件

环境变量是操作系统提供的功能，用于在内存中保留少量数据。我们可以使用以下语法在 Python 中读取环境变量：

import os
print(os.environ["MYVALUE"])

例如，在 Linux 中，上述两行脚本将在 shell 中如下工作：

$ export MYVALUE="hello"
$ python show_env.py
hello

在 Windows 中，命令提示符中的语法类似：

C:\MLM> set MYVALUE=hello

C:\MLM> python show_env.py
hello

你还可以通过控制面板中的对话框在 Windows 中添加或编辑环境变量：

因此，我们可以将参数保存在一些环境变量中，让脚本适应其行为，例如设置命令行参数。

如果我们需要设置很多选项，最好将这些选项保存到文件中，而不是让命令行变得过于繁杂。根据我们选择的格式，我们可以使用 Python 的 configparser 或 json 模块来读取 Windows INI 格式或 JSON 格式。我们也可以使用第三方库 PyYAML 来读取 YAML 格式。

对于上述在 GDP 数据上运行 ARIMA 模型的示例，我们可以修改代码以使用 YAML 配置文件：

import warnings
warnings.simplefilter("ignore")

from pandas_datareader.wb import WorldBankReader
import statsmodels.api as sm
import pandas as pd
import yaml

# Load config from YAML file
with open("config.yaml", "r") as fp:
    args = yaml.safe_load(fp)

# Set up parameters
series = "NY.GDP.MKTP.CN"
country = args["country"]
length = args["length"]
start = args["start"]
steps = 3
order = (1,1,1)

# Read the GDP data from WorldBank database
gdp = WorldBankReader(series, country, start=1960, end=2020).read()
# Drop country name from index
gdp = gdp.droplevel(level=0, axis=0)
# Sort data in choronological order and set data point at year-end
gdp.index = pd.to_datetime(gdp.index)
gdp = gdp.sort_index().resample("y").last()
# Convert pandas dataframe into pandas series
gdp = gdp[series]
# Fit arima model
result = sm.tsa.ARIMA(endog=gdp[start:start+length], order=order).fit()
# Forecast, and calculate the relative error
forecast = result.forecast(steps=steps)
df = pd.DataFrame({"Actual":gdp, "Forecast":forecast}).dropna()
df["Rel Error"] = (df["Forecast"] - df["Actual"]) / df["Actual"]
# Print result
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    print(df)

YAML 配置文件名为 config.yaml，其内容如下：

country: SE
length: 40
start: 0

然后我们可以运行上述代码，并获得与之前相同的结果。JSON 对应的代码非常相似，我们使用 json 模块中的 load() 函数：

import json
import warnings
warnings.simplefilter("ignore")

from pandas_datareader.wb import WorldBankReader
import statsmodels.api as sm
import pandas as pd

# Load config from JSON file
with open("config.json", "r") as fp:
    args = json.load(fp)

# Set up parameters
series = "NY.GDP.MKTP.CN"
country = args["country"]
length = args["length"]
start = args["start"]
steps = 3
order = (1,1,1)

# Read the GDP data from WorldBank database
gdp = WorldBankReader(series, country, start=1960, end=2020).read()
# Drop country name from index
gdp = gdp.droplevel(level=0, axis=0)
# Sort data in choronological order and set data point at year-end
gdp.index = pd.to_datetime(gdp.index)
gdp = gdp.sort_index().resample("y").last()
# Convert pandas dataframe into pandas series
gdp = gdp[series]
# Fit arima model
result = sm.tsa.ARIMA(endog=gdp[start:start+length], order=order).fit()
# Forecast, and calculate the relative error
forecast = result.forecast(steps=steps)
df = pd.DataFrame({"Actual":gdp, "Forecast":forecast}).dropna()
df["Rel Error"] = (df["Forecast"] - df["Actual"]) / df["Actual"]
# Print result
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    print(df)

JSON 配置文件 config.json 如下：

{
    "country": "SE",
    "length": 40,
    "start": 0
}

你可以了解更多关于JSON和YAML的语法，以便于你的项目。但这里的核心理念是，我们可以分离数据和算法，以提高代码的可重用性。

想要开始使用 Python 进行机器学习？

立即领取我的免费 7 天电子邮件速成课程（包含示例代码）。

点击注册并获得课程的免费 PDF 电子书版本。

进一步阅读

本节提供了更多关于该主题的资源，如果你想深入了解。

库

argparse 模块，docs.python.org/3/library/argparse.html
Pandas Data Reader，pandas-datareader.readthedocs.io/en/latest/
statsmodels 中的 ARIMA，www.statsmodels.org/devel/generated/statsmodels.tsa.arima.model.ARIMA.html
configparser 模块，docs.python.org/3/library/configparser.html
json 模块，docs.python.org/3/library/json.html
PyYAML，pyyaml.org/wiki/PyYAMLDocumentation

文章

处理 JSON，developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON
维基百科上的 YAML，zh.wikipedia.org/wiki/YAML

书籍

Python Cookbook，第三版，作者 David Beazley 和 Brian K. Jones，www.amazon.com/dp/1449340377/

摘要

在本教程中，您已经看到如何使用命令行更有效地控制我们的 Python 脚本。具体来说，您学到了：

如何使用 argparse 模块向您的 Python 脚本传递参数
如何在不同操作系统的终端中高效控制启用 argparse 的 Python 脚本
我们还可以使用环境变量或配置文件来向 Python 脚本传递参数

Python 代码中的注释、文档字符串和类型提示

原文：machinelearningmastery.com/comments-docstrings-and-type-hints-in-python-code/

程序的源代码应该对人类可读。使其正确运行只是其目的的一半。如果没有适当的注释代码，任何人，包括未来的你，将很难理解代码背后的理由和意图。这样也会使代码无法维护。在 Python 中，有多种方式可以向代码添加描述，使其更具可读性或使意图更加明确。在接下来的内容中，我们将看到如何正确使用注释、文档字符串和类型提示，使我们的代码更易于理解。完成本教程后，你将了解：

什么是 Python 中注释的正确使用方法
字符串字面量或文档字符串在某些情况下如何替代注释
Python 中的类型提示是什么，它们如何帮助我们更好地理解代码

快速启动你的项目，参考我新书《Python 机器学习》，包括 逐步教程 和 所有示例的 Python 源代码 文件。

让我们开始吧！

概述

本教程分为三部分，它们是：

向 Python 代码添加注释
使用文档字符串
在 Python 代码中使用类型提示

向 Python 代码添加注释

几乎所有编程语言都有专门的注释语法。注释会被编译器或解释器忽略，因此它们不会影响编程流程或逻辑。但通过注释，可以更容易地阅读代码。

在像 C++ 这样的语言中，我们可以使用前导双斜杠（//）添加“行内注释”或使用 /* 和 */ 包围的注释块。然而，在 Python 中，我们只有“行内”版本，它们由前导井号字符（#）引入。

编写注释以解释每一行代码是很容易的，但这通常是一种浪费。当人们阅读源代码时，注释往往很容易引起他们的注意，因此放太多注释会分散阅读注意力。例如，以下内容是不必要且具有干扰性的：

import datetime

timestamp = datetime.datetime.now()  # Get the current date and time
x = 0    # initialize x to zero

这样的注释仅仅是重复代码的功能。除非代码非常晦涩，这些注释不会为代码增添价值。下面的例子可能是一个边际情况，其中名称“ppf”（百分比点函数）比术语“CDF”（累积分布函数）更不为人知：

import scipy.stats

z_alpha = scipy.stats.norm.ppf(0.975)  # Call the inverse CDF of standard normal

优秀的注释应该说明我们为什么要做某件事。让我们来看一个例子：

def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):
    # generate an initial point
    solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
    # lists to hold the average square gradients for each variable and
    # average parameter updates
    sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
    sq_para_avg = [0.0 for _ in range(bounds.shape[0])]
    # run the gradient descent
    for it in range(n_iter):
        gradient = derivative(solution[0], solution[1])
        # update the moving average of the squared partial derivatives
        for i in range(gradient.shape[0]):
            sg = gradient[i]**2.0
            sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))
        # build a solution one variable at a time
        new_solution = list()
        for i in range(solution.shape[0]):
            # calculate the step size for this variable
            alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))
            # calculate the change and update the moving average of the squared change
            change = alpha * gradient[i]
            sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))
            # calculate the new position in this variable and store as new solution
            value = solution[i] - change
            new_solution.append(value)
        # evaluate candidate point
        solution = asarray(new_solution)
        solution_eval = objective(solution[0], solution[1])
        # report progress
        print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
    return [solution, solution_eval]

上面的函数实现了 AdaDelta 算法。在第一行中，当我们将某事物分配给变量solution时，我们不会写像“在bounds[:,0]和bounds[:,1]之间的随机插值”这样的评论，因为那只是重复的代码。我们说这行的意图是“生成一个初始点”。类似地，在函数中的其他注释中，我们标记一个 for 循环作为梯度下降算法，而不仅仅是说迭代若干次。

在编写评论或修改代码时我们要记住的一个重要问题是确保评论准确描述代码。如果它们相矛盾，对读者来说会很困惑。因此，当你打算在上面的例子的第一行放置评论以“将初始解设为下界”时，而代码显然是随机化初始解时，或者反之，你应该同时更新评论和代码。

一个例外是“待办事项”评论。不时地，当我们有改进代码的想法但尚未更改时，我们可以在代码上加上待办事项评论。我们也可以用它来标记未完成的实现。例如，

# TODO replace Keras code below with Tensorflow
from keras.models import Sequential
from keras.layers import Conv2D

model = Sequential()
model.add(Conv2D(1, (3,3), strides=(2, 2), input_shape=(8, 8, 1)))
model.summary()
...

这是一个常见的做法，当关键字TODO被发现时，许多 IDE 会以不同的方式突出显示评论块。然而，它应该是临时的，我们不应滥用它作为问题跟踪系统。

总之，有关编写注释代码的一些常见“最佳实践”列举如下：

评论不应重复代码而应该解释它
评论不应造成混淆而应消除它
在不易理解的代码上放置注释；例如，说明语法的非典型使用，命名正在使用的算法，或者解释意图或假设
评论应该简洁明了
保持一致的风格和语言在评论中使用
总是更喜欢写得更好的代码，而不需要额外的注释

使用文档字符串

在 C++中，我们可以编写大块的评论，如下所示：

C++

TcpSocketBase::~TcpSocketBase (void)
{
  NS_LOG_FUNCTION (this);
  m_node = nullptr;
  if (m_endPoint != nullptr)
    {
      NS_ASSERT (m_tcp != nullptr);
      /*
       * Upon Bind, an Ipv4Endpoint is allocated and set to m_endPoint, and
       * DestroyCallback is set to TcpSocketBase::Destroy. If we called
       * m_tcp->DeAllocate, it will destroy its Ipv4EndpointDemux::DeAllocate,
       * which in turn destroys my m_endPoint, and in turn invokes
       * TcpSocketBase::Destroy to nullify m_node, m_endPoint, and m_tcp.
       */
      NS_ASSERT (m_endPoint != nullptr);
      m_tcp->DeAllocate (m_endPoint);
      NS_ASSERT (m_endPoint == nullptr);
    }
  if (m_endPoint6 != nullptr)
    {
      NS_ASSERT (m_tcp != nullptr);
      NS_ASSERT (m_endPoint6 != nullptr);
      m_tcp->DeAllocate (m_endPoint6);
      NS_ASSERT (m_endPoint6 == nullptr);
    }
  m_tcp = 0;
  CancelAllTimers ();
}

但在 Python 中，我们没有/*和*/这样的界定符的等价物，但我们可以用以下方式写多行注释：

async def main(indir):
    # Scan dirs for files and populate a list
    filepaths = []
    for path, dirs, files in os.walk(indir):
        for basename in files:
            filepath = os.path.join(path, basename)
            filepaths.append(filepath)

    """Create the "process pool" of 4 and run asyncio.
    The processes will execute the worker function
    concurrently with each file path as parameter
    """
    loop = asyncio.get_running_loop()
    with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
        futures = [loop.run_in_executor(executor, func, f) for f in filepaths]
        for fut in asyncio.as_completed(futures):
            try:
                filepath = await fut
                print(filepath)
            except Exception as exc:
                print("failed one job")

这是因为 Python 支持声明跨多行的字符串字面量，如果它用三引号（"""）界定。而在代码中，字符串字面量仅仅是一个没有影响的声明。因此，它在功能上与评论没有任何区别。

我们希望使用字符串字面量的一个原因是注释掉一大块代码。例如，

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
"""
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)
"""
import pickle
with open("dataset.pickle", "wb") as fp:
    X, y = pickle.load(fp)

clf = LogisticRegression(random_state=0).fit(X, y)
...

以上是我们可能通过尝试机器学习问题而开发的样本代码。虽然我们在开始时随机生成了一个数据集（上面的make_classification()调用），但我们可能希望在以后的某个时间切换到另一个数据集并重复相同的过程（例如上面的 pickle 部分）。我们可以简单地注释这些行而不是删除代码块，以便稍后存储代码。尽管它不适合最终代码的形式，但在开发解决方案时非常方便。

在 Python 中，作为注释的字符串字面量如果位于函数下的第一行，则具有特殊目的。在这种情况下，该字符串字面量被称为函数的“docstring”。例如，

def square(x):
    """Just to compute the square of a value

    Args:
        x (int or float): A numerical value

    Returns:
        int or float: The square of x
    """
    return x * x

我们可以看到函数下的第一行是一个字面字符串，它与注释具有相同的作用。它使代码更易读，但同时我们可以从代码中检索到它：

print("Function name:", square.__name__)
print("Docstring:", square.__doc__)

Function name: square
Docstring: Just to compute the square of a value

    Args:
        x (int or float): A numerical value

    Returns:
        int or float: The square of x

由于 docstring 的特殊地位，有几种关于如何编写适当的 docstring 的约定。

在 C++中，我们可以使用 Doxygen 从注释中生成代码文档，类似地，Java 中有 Javadoc。Python 中最接近的匹配工具将是来自 Sphinx 或 pdoc 的“autodoc”。两者都会尝试解析 docstring 以自动生成文档。

没有标准的 docstring 编写方式，但通常我们期望它们将解释函数（或类或模块）的目的以及参数和返回值。一个常见的风格如上所述，由 Google 推崇。另一种风格来自 NumPy：

def square(x):
    """Just to compupte the square of a value

    Parameters
    ----------
    x : int or float
        A numerical value

    Returns
    -------
    int or float
        The square of `x`
    """
    return x * x

类似 autodoc 这样的工具可以解析这些 docstring 并生成 API 文档。但即使这不是目的，使用一个描述函数性质、函数参数和返回值数据类型的 docstring 肯定可以使您的代码更易于阅读。这一点特别重要，因为 Python 不像 C++或 Java 那样是一种鸭子类型语言，其中变量和函数参数不声明为特定类型。我们可以利用 docstring 来说明数据类型的假设，以便人们更容易地跟踪或使用您的函数。

想要开始 Python 机器学习吗？

现在就参加我的免费 7 天电子邮件速成课程（附带示例代码）。

点击注册并获取免费的课程 PDF 电子书版本。

在 Python 代码中使用类型提示

自 Python 3.5 以来，允许类型提示语法。顾名思义，它的目的是提示类型而不是其他任何内容。因此，即使看起来将 Python 更接近 Java，它也不意味着限制要存储在变量中的数据。上面的示例可以使用类型提示进行重写：

def square(x: int) -> int:
    return x * x

在函数中，参数后面可以跟着: type的语法来说明预期的类型。函数的返回值通过冒号前的-> type语法来标识。事实上，变量也可以声明类型提示，例如，

def square(x: int) -> int:
    value: int = x * x
    return value

类型提示的好处是双重的：我们可以用它来消除一些注释，如果我们需要明确描述正在使用的数据类型。我们还可以帮助静态分析器更好地理解我们的代码，以便它们能够帮助识别代码中的潜在问题。

有时类型可能很复杂，因此 Python 在其标准库中提供了typing模块来帮助简化语法。例如，我们可以使用Union[int,float]表示int类型或float类型，List[str]表示每个元素都是字符串的列表，并使用Any表示任何类型。如下所示：

from typing import Any, Union, List

def square(x: Union[int, float]) -> Union[int, float]:
    return x * x

def append(x: List[Any], y: Any) -> None:
    x.append(y)

然而，重要的是要记住，类型提示只是提示。它不对代码施加任何限制。因此，以下对读者来说可能很困惑，但完全合法：

n: int = 3.5
n = "assign a string"

使用类型提示可以提高代码的可读性。然而，类型提示最重要的好处是允许像mypy这样的静态分析器告诉我们我们的代码是否有潜在的 bug。如果你用 mypy 处理以上代码行，我们会看到以下错误：

test.py:1: error: Incompatible types in assignment (expression has type "float", variable has type "int")
test.py:2: error: Incompatible types in assignment (expression has type "str", variable has type "int")
Found 2 errors in 1 file (checked 1 source file)

静态分析器的使用将在另一篇文章中介绍。

为了说明注释、文档字符串和类型提示的使用，以下是一个例子，定义了一个生成器函数，该函数在固定宽度窗口上对 pandas DataFrame 进行采样。这对训练 LSTM 网络非常有用，其中需要提供几个连续的时间步骤。在下面的函数中，我们从 DataFrame 的随机行开始，并裁剪其后的几行。只要我们能成功获取一个完整的窗口，我们就将其作为样本。一旦我们收集到足够的样本以组成一个批次，批次就会被分发。

如果我们能够为函数参数提供类型提示，那么代码会更清晰，例如我们知道data应该是一个 pandas DataFrame。但是我们进一步描述了预期在文档字符串中携带一个日期时间索引。我们描述了如何从输入数据中提取一行窗口的算法以及内部 while 循环中“if”块的意图。通过这种方式，代码变得更容易理解和维护，也更容易修改以供其他用途使用。

from typing import List, Tuple, Generator
import pandas as pd
import numpy as np

TrainingSampleGenerator = Generator[Tuple[np.ndarray,np.ndarray], None, None]

def lstm_gen(data: pd.DataFrame,
             timesteps: int,
             batch_size: int) -> TrainingSampleGenerator:
    """Generator to produce random samples for LSTM training

    Args:
        data: DataFrame of data with datetime index in chronological order,
              samples are drawn from this
        timesteps: Number of time steps for each sample, data will be
                   produced from a window of such length
        batch_size: Number of samples in each batch

    Yields:
        ndarray, ndarray: The (X,Y) training samples drawn on a random window
        from the input data
    """
    input_columns = [c for c in data.columns if c != "target"]
    batch: List[Tuple[pd.DataFrame, pd.Series]] = []
    while True:
        # pick one start time and security
        while True:
            # Start from a random point from the data and clip a window
            row = data["target"].sample()
            starttime = row.index[0]
            window: pd.DataFrame = data[starttime:].iloc[:timesteps]
            # If we are at the end of the DataFrame, we can't get a full
            # window and we must start over
            if len(window) == timesteps:
                break
        # Extract the input and output
        y = window["target"]
        X = window[input_columns]
        batch.append((X, y))
        # If accumulated enough for one batch, dispatch
        if len(batch) == batch_size:
            X, y = zip(*batch)
            yield np.array(X).astype("float32"), np.array(y).astype("float32")
            batch = []

进一步阅读

如果您希望深入了解这个主题，本节提供了更多资源。

文章

编写代码注释的最佳实践，stackoverflow.blog/2021/12/23/best-practices-for-writing-code-comments/
PEP483，类型提示理论，www.python.org/dev/peps/pep-0483/
Google Python 风格指南，google.github.io/styleguide/pyguide.html

软件

Sphinx 文档，www.sphinx-doc.org/en/master/index.html
Sphinx 的 Napoleon 模块，sphinxcontrib-napoleon.readthedocs.io/en/latest/index.html
- Google 风格的文档字符串示例：sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html
- NumPy 风格的文档字符串示例：sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html
pdoc，pdoc.dev/
Python 的 typing 模块，docs.python.org/3/library/typing.html

总结

在本教程中，你已经看到我们如何在 Python 中使用注释、文档字符串和类型提示。具体来说，你现在知道：

如何写好有用的注释
解释函数使用文档字符串的约定
如何使用类型提示来解决 Python 中鸭子类型的可读性问题

使用 matplotlib、Seaborn 和 Bokeh 在 Python 中进行数据可视化

原文：machinelearningmastery.com/data-visualization-in-python-with-matplotlib-seaborn-and-bokeh/

数据可视化是所有 AI 和机器学习应用的重要方面。通过不同的图形表示，你可以获得数据的关键洞察。在本教程中，我们将讨论 Python 中的数据可视化几种选项。我们将使用 MNIST 数据集和 Tensorflow 库进行数据处理和操作。为了说明创建各种类型图表的方法，我们将使用 Python 的图形库，即 matplotlib、Seaborn 和 Bokeh。

完成本教程后，你将了解：

如何在 matplotlib 中可视化图像
如何在 matplotlib、Seaborn 和 Bokeh 中制作散点图
如何在 matplotlib、Seaborn 和 Bokeh 中制作多线图

启动你的项目，阅读我的新书《Python 机器学习》，其中包括逐步教程和所有示例的Python 源代码文件。

让我们开始吧。从飞机上拍摄的伊斯坦布尔的照片

使用 matplotlib、Seaborn 和 Bokeh 在 Python 中进行数据可视化

照片由 Mehreen Saeed 拍摄，部分权利保留。

教程概述

本教程分为七个部分，它们是：

散点数据的准备
matplotlib 中的图形
matplotlib 和 Seaborn 中的散点图
Bokeh 中的散点图
线图数据的准备
在 matplotlib、Seaborn 和 Bokeh 中绘制线图
更多关于可视化的内容

散点数据的准备

在这篇文章中，我们将使用 matplotlib、Seaborn 和 Bokeh。它们都是需要安装的外部库。要使用 pip 安装它们，请运行以下命令：

pip install matplotlib seaborn bokeh

为了演示目的，我们还将使用 MNIST 手写数字数据集。我们将从 TensorFlow 中加载它，并对其运行 PCA 算法。因此，我们还需要安装 TensorFlow 和 pandas：

pip install tensorflow pandas

之后的代码将假设已执行以下导入：

# Importing from tensorflow and keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Reshape
from tensorflow.keras import utils
from tensorflow import dtypes, tensordot
from tensorflow import convert_to_tensor, linalg, transpose
# For math operations
import numpy as np
# For plotting with matplotlib
import matplotlib.pyplot as plt
# For plotting with seaborn
import seaborn as sns  
# For plotting with bokeh
from bokeh.plotting import figure, show
from bokeh.models import Legend, LegendItem
# For pandas dataframe
import pandas as pd

我们从 keras.datasets 库中加载 MNIST 数据集。为了简化起见，我们将仅保留包含前三个数字的数据子集。我们现在还将忽略测试集。

...
# load dataset
(x_train, train_labels), (_, _) = mnist.load_data()
# Choose only the digits 0, 1, 2
total_classes = 3
ind = np.where(train_labels < total_classes)
x_train, train_labels = x_train[ind], train_labels[ind]
# Shape of training data
total_examples, img_length, img_width = x_train.shape
# Print the statistics
print('Training data has ', total_examples, 'images')
print('Each image is of size ', img_length, 'x', img_width)

输出

Training data has  18623 images
Each image is of size  28 x 28

想开始学习 Python 进行机器学习吗？

现在就参加我的免费 7 天电子邮件速成课程（包含示例代码）。

点击注册并获得课程的免费 PDF 电子书版本。

matplotlib 中的图形

Seaborn 确实是 matplotlib 的一个附加库。因此，即使使用 Seaborn，你也需要了解 matplotlib 如何处理图表。

Matplotlib 称其画布为图形。你可以将图形划分为几个称为子图的部分，以便将两个可视化并排放置。

例如，让我们使用 matplotlib 可视化 MNIST 数据集的前 16 张图像。我们将使用subplots()函数创建 2 行 8 列的图像。subplots()函数将为每个单元创建坐标轴对象。然后，我们将使用imshow()方法在每个坐标轴对象上显示每张图像。最后，将使用show()函数显示图像：

img_per_row = 8
fig,ax = plt.subplots(nrows=2, ncols=img_per_row,
                      figsize=(18,4),
                      subplot_kw=dict(xticks=[], yticks=[]))
for row in [0, 1]:
    for col in range(img_per_row):
        ax[row, col].imshow(x_train[row*img_per_row + col].astype('int'))   
plt.show()

训练数据集前 16 张图像显示在 2 行 8 列中

训练数据集的前 16 张图像显示在 2 行 8 列中

在这里，我们可以看到 matplotlib 的一些特性。matplotlib 有一个默认的图形和默认的坐标轴。matplotlib 的pyplot子模块下定义了许多函数，用于在默认坐标轴上绘图。如果我们想在特定坐标轴上绘图，可以使用坐标轴对象下的绘图函数。操作图形是过程性的。这意味着 matplotlib 内部记住了一个数据结构，我们的操作会改变它。show()函数仅显示一系列操作的结果。因此，我们可以逐步调整图形中的许多细节。在上面的示例中，我们通过将xticks和yticks设置为空列表来隐藏了“刻度”（即坐标轴上的标记）。

matplotlib 和 Seaborn 中的散点图

在机器学习项目中，我们常用的一种可视化方式是散点图。

例如，我们对 MNIST 数据集应用 PCA，并提取每张图像的前三个成分。在下面的代码中，我们从数据集中计算特征向量和特征值，然后沿着特征向量的方向投影每张图像的数据，并将结果存储在x_pca中。为了简单起见，我们在计算特征向量之前没有将数据标准化为零均值和单位方差。这一省略不影响我们可视化的目的。

...
# Convert the dataset into a 2D array of shape 18623 x 784
x = convert_to_tensor(np.reshape(x_train, (x_train.shape[0], -1)),
                      dtype=dtypes.float32)
# Eigen-decomposition from a 784 x 784 matrix
eigenvalues, eigenvectors = linalg.eigh(tensordot(transpose(x), x, axes=1))
# Print the three largest eigenvalues
print('3 largest eigenvalues: ', eigenvalues[-3:])
# Project the data to eigenvectors
x_pca = tensordot(x, eigenvectors, axes=1)

打印出的特征值如下：

3 largest eigenvalues:  tf.Tensor([5.1999642e+09 1.1419439e+10 4.8231231e+10], shape=(3,), dtype=float32)

数组x_pca的形状为 18623 x 784。我们考虑最后两列作为 x 和 y 坐标，并在图中标出每一行的点。我们还可以根据每个点对应的数字进一步为其上色。

以下代码使用 matplotlib 生成散点图。图是通过坐标轴对象的scatter()函数创建的，该函数将 x 和 y 坐标作为前两个参数。scatter()方法的c参数指定将成为其颜色的值。s参数指定其大小。代码还创建了一个图例，并为图形添加了标题。

fig, ax = plt.subplots(figsize=(12, 8))
scatter = ax.scatter(x_pca[:, -1], x_pca[:, -2], c=train_labels, s=5)
legend_plt = ax.legend(*scatter.legend_elements(),
                       loc="lower left", title="Digits")
ax.add_artist(legend_plt)
plt.title('First Two Dimensions of Projected Data After Applying PCA')
plt.show()

使用 Matplotlib 生成的 2D 散点图

使用 matplotlib 生成的 2D 散点图

将上述内容综合起来，以下是使用 matplotlib 生成 2D 散点图的完整代码：

from tensorflow.keras.datasets import mnist
from tensorflow import dtypes, tensordot
from tensorflow import convert_to_tensor, linalg, transpose
import numpy as np
import matplotlib.pyplot as plt

# Load dataset
(x_train, train_labels), (_, _) = mnist.load_data()
# Choose only the digits 0, 1, 2
total_classes = 3
ind = np.where(train_labels < total_classes)
x_train, train_labels = x_train[ind], train_labels[ind]
# Verify the shape of training data
total_examples, img_length, img_width = x_train.shape
print('Training data has ', total_examples, 'images')
print('Each image is of size ', img_length, 'x', img_width)

# Convert the dataset into a 2D array of shape 18623 x 784
x = convert_to_tensor(np.reshape(x_train, (x_train.shape[0], -1)),
                      dtype=dtypes.float32)
# Eigen-decomposition from a 784 x 784 matrix
eigenvalues, eigenvectors = linalg.eigh(tensordot(transpose(x), x, axes=1))
# Print the three largest eigenvalues
print('3 largest eigenvalues: ', eigenvalues[-3:])
# Project the data to eigenvectors
x_pca = tensordot(x, eigenvectors, axes=1)

# Create the plot
fig, ax = plt.subplots(figsize=(12, 8))
scatter = ax.scatter(x_pca[:, -1], x_pca[:, -2], c=train_labels, s=5)
legend_plt = ax.legend(*scatter.legend_elements(),
                       loc="lower left", title="Digits")
ax.add_artist(legend_plt)
plt.title('First Two Dimensions of Projected Data After Applying PCA')
plt.show()

Matplotlib 还允许生成 3D 散点图。为此，首先需要创建一个具有 3D 投影的坐标轴对象。然后使用 scatter3D() 函数创建 3D 散点图，第一个三个参数为 x、y 和 z 坐标。下面的代码使用沿着与三个最大特征值对应的特征向量投影的数据。此代码创建了一个颜色条，而不是图例：

fig = plt.figure(figsize=(12, 8))
ax = plt.axes(projection='3d')
plt_3d = ax.scatter3D(x_pca[:, -1], x_pca[:, -2], x_pca[:, -3], c=train_labels, s=1)
plt.colorbar(plt_3d)
plt.show()

使用 Matplotlib 生成的 3D 散点图

使用 matplotlib 生成的 3D 散点图

scatter3D() 函数仅将点放置到 3D 空间中。之后，我们仍然可以修改图形的显示方式，例如每个坐标轴的标签和背景颜色。但在 3D 图形中，一个常见的调整是视口，即我们查看 3D 空间的角度。视口由坐标轴对象中的 view_init() 函数控制：

ax.view_init(elev=30, azim=-60)

视口由仰角（即相对于水平面角度）和方位角（即水平面上的旋转）控制。默认情况下，matplotlib 使用 30 度的仰角和 -60 度的方位角，如上所示。

综合所有内容，以下是使用 matplotlib 创建 3D 散点图的完整代码：

from tensorflow.keras.datasets import mnist
from tensorflow import dtypes, tensordot
from tensorflow import convert_to_tensor, linalg, transpose
import numpy as np
import matplotlib.pyplot as plt

# Load dataset
(x_train, train_labels), (_, _) = mnist.load_data()
# Choose only the digits 0, 1, 2
total_classes = 3
ind = np.where(train_labels < total_classes)
x_train, train_labels = x_train[ind], train_labels[ind]
# Verify the shape of training data
total_examples, img_length, img_width = x_train.shape
print('Training data has ', total_examples, 'images')
print('Each image is of size ', img_length, 'x', img_width)

# Convert the dataset into a 2D array of shape 18623 x 784
x = convert_to_tensor(np.reshape(x_train, (x_train.shape[0], -1)),
                      dtype=dtypes.float32)
# Eigen-decomposition from a 784 x 784 matrix
eigenvalues, eigenvectors = linalg.eigh(tensordot(transpose(x), x, axes=1))
# Print the three largest eigenvalues
print('3 largest eigenvalues: ', eigenvalues[-3:])
# Project the data to eigenvectors
x_pca = tensordot(x, eigenvectors, axes=1)

# Create the plot
fig = plt.figure(figsize=(12, 8))
ax = plt.axes(projection='3d')
ax.view_init(elev=30, azim=-60)
plt_3d = ax.scatter3D(x_pca[:, -1], x_pca[:, -2], x_pca[:, -3], c=train_labels, s=1)
plt.colorbar(plt_3d)
plt.show()

在 Seaborn 中创建散点图也很简单。scatterplot() 方法会自动创建图例，并在绘制点时对不同的类别使用不同的符号。默认情况下，图形会在 matplotlib 的“当前坐标轴”上创建，除非通过 ax 参数指定坐标轴对象。

fig, ax = plt.subplots(figsize=(12, 8))
sns.scatterplot(x_pca[:, -1], x_pca[:, -2],
                style=train_labels, hue=train_labels,
                palette=["red", "green", "blue"])
plt.title('First Two Dimensions of Projected Data After Applying PCA')
plt.show()

使用 Seaborn 生成的 2D 散点图

Seaborn 相对于 matplotlib 的好处有两点：首先，我们有一个精美的默认样式。例如，如果我们比较上述两个散点图的点样式，Seaborn 的点周围有边框，以防止许多点混在一起。实际上，如果我们在调用任何 matplotlib 函数之前运行以下代码：

sns.set(style = "darkgrid")

我们仍然可以使用 matplotlib 函数，但通过使用 Seaborn 的样式可以得到更好的图形。其次，如果我们使用 pandas DataFrame 来保存数据，使用 Seaborn 会更方便。例如，让我们将 MNIST 数据从张量转换为 pandas DataFrame：

df_mnist = pd.DataFrame(x_pca[:, -3:].numpy(), columns=["pca3","pca2","pca1"])
df_mnist["label"] = train_labels
print(df_mnist)

现在，DataFrame 看起来如下：

             pca3        pca2         pca1  label
0     -537.730103  926.885254  1965.881592      0
1      167.375885 -947.360107  1070.359375      1
2      553.685425 -163.121826  1754.754272      2
3     -642.905579 -767.283020  1053.937988      1
4     -651.812988 -586.034424   662.468201      1
...           ...         ...          ...    ...
18618  415.358948 -645.245972   853.439209      1
18619  754.555786    7.873116  1897.690552      2
18620 -321.809357  665.038086  1840.480225      0
18621  643.843628  -85.524895  1113.795166      2
18622   94.964279 -549.570984   561.743042      1

[18623 rows x 4 columns]

然后，我们可以使用以下代码重现 Seaborn 的散点图：

fig, ax = plt.subplots(figsize=(12, 8))
sns.scatterplot(data=df_mnist, x="pca1", y="pca2",
                style="label", hue="label",
                palette=["red", "green", "blue"])
plt.title('First Two Dimensions of Projected Data After Applying PCA')
plt.show()

我们不会将数组作为坐标传递给 scatterplot() 函数，而是使用 data 参数中的列名。

以下是使用 Seaborn 生成散点图的完整代码，数据存储在 pandas 中：

from tensorflow.keras.datasets import mnist
from tensorflow import dtypes, tensordot
from tensorflow import convert_to_tensor, linalg, transpose
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
(x_train, train_labels), (_, _) = mnist.load_data()
# Choose only the digits 0, 1, 2
total_classes = 3
ind = np.where(train_labels < total_classes)
x_train, train_labels = x_train[ind], train_labels[ind]
# Verify the shape of training data
total_examples, img_length, img_width = x_train.shape
print('Training data has ', total_examples, 'images')
print('Each image is of size ', img_length, 'x', img_width)

# Convert the dataset into a 2D array of shape 18623 x 784
x = convert_to_tensor(np.reshape(x_train, (x_train.shape[0], -1)),
                      dtype=dtypes.float32)
# Eigen-decomposition from a 784 x 784 matrix
eigenvalues, eigenvectors = linalg.eigh(tensordot(transpose(x), x, axes=1))
# Print the three largest eigenvalues
print('3 largest eigenvalues: ', eigenvalues[-3:])
# Project the data to eigenvectors
x_pca = tensordot(x, eigenvectors, axes=1)

# Making pandas DataFrame
df_mnist = pd.DataFrame(x_pca[:, -3:].numpy(), columns=["pca3","pca2","pca1"])
df_mnist["label"] = train_labels

# Create the plot
fig, ax = plt.subplots(figsize=(12, 8))
sns.scatterplot(data=df_mnist, x="pca1", y="pca2",
                style="label", hue="label",
                palette=["red", "green", "blue"])
plt.title('First Two Dimensions of Projected Data After Applying PCA')
plt.show()

Seaborn 作为一些 matplotlib 函数的封装，并没有完全取代 matplotlib。例如，Seaborn 不支持 3D 绘图，我们仍然需要使用 matplotlib 函数来实现这些目的。

Bokeh 中的散点图

matplotlib 和 Seaborn 创建的图表是静态图像。如果你需要放大、平移或切换图表的某部分显示，应该使用 Bokeh。

在 Bokeh 中创建散点图也很简单。以下代码生成一个散点图并添加一个图例。Bokeh 库中的show()方法会打开一个新浏览器窗口来显示图像。你可以通过缩放、缩放、滚动等方式与图表互动，使用渲染图旁边工具栏中显示的选项。你还可以通过点击图例来隐藏部分散点。

colormap = {0: "red", 1:"green", 2:"blue"}
my_scatter = figure(title="First Two Dimensions of Projected Data After Applying PCA", 
                    x_axis_label="Dimension 1",
                    y_axis_label="Dimension 2")
for digit in [0, 1, 2]:
    selection = x_pca[train_labels == digit]
    my_scatter.scatter(selection[:,-1].numpy(), selection[:,-2].numpy(),
                       color=colormap[digit], size=5,
                       legend_label="Digit "+str(digit))
my_scatter.legend.click_policy = "hide"
show(my_scatter)

Bokeh 将以 HTML 和 JavaScript 生成图表。你控制图表的所有操作都由一些 JavaScript 函数处理。其输出如下所示：

使用 Bokeh 在新浏览器窗口生成的 2D 散点图。注意右侧的各种选项，用于与图表互动。

以下是使用 Bokeh 生成上述散点图的完整代码：

from tensorflow.keras.datasets import mnist
from tensorflow import dtypes, tensordot
from tensorflow import convert_to_tensor, linalg, transpose
import numpy as np
from bokeh.plotting import figure, show

# Load dataset
(x_train, train_labels), (_, _) = mnist.load_data()
# Choose only the digits 0, 1, 2
total_classes = 3
ind = np.where(train_labels < total_classes)
x_train, train_labels = x_train[ind], train_labels[ind]
# Verify the shape of training data
total_examples, img_length, img_width = x_train.shape
print('Training data has ', total_examples, 'images')
print('Each image is of size ', img_length, 'x', img_width)

# Convert the dataset into a 2D array of shape 18623 x 784
x = convert_to_tensor(np.reshape(x_train, (x_train.shape[0], -1)),
                      dtype=dtypes.float32)
# Eigen-decomposition from a 784 x 784 matrix
eigenvalues, eigenvectors = linalg.eigh(tensordot(transpose(x), x, axes=1))
# Print the three largest eigenvalues
print('3 largest eigenvalues: ', eigenvalues[-3:])
# Project the data to eigenvectors
x_pca = tensordot(x, eigenvectors, axes=1)

# Create scatter plot in Bokeh
colormap = {0: "red", 1:"green", 2:"blue"}
my_scatter = figure(title="First Two Dimensions of Projected Data After Applying PCA",
                    x_axis_label="Dimension 1",
                    y_axis_label="Dimension 2")
for digit in [0, 1, 2]:
    selection = x_pca[train_labels == digit]
    my_scatter.scatter(selection[:,-1].numpy(), selection[:,-2].numpy(),
                       color=colormap[digit], size=5, alpha=0.5,
                       legend_label="Digit "+str(digit))
my_scatter.legend.click_policy = "hide"
show(my_scatter)

如果你在 Jupyter Notebook 中渲染 Bokeh 图表，你可能会看到图表在新浏览器窗口中生成。要将图表放在 Jupyter Notebook 中，你需要在运行 Bokeh 函数之前，告诉 Bokeh 你在笔记本环境下，方法是运行以下代码：

from bokeh.io import output_notebook
output_notebook()

此外，请注意我们在循环中创建三个数字的散点图，每次一个数字。这是为了使图例可互动，因为每次调用scatter()时都会创建一个新对象。如果我们一次性创建所有散点，如下所示，点击图例会隐藏和显示所有内容，而不是仅显示一个数字的点。

colormap = {0: "red", 1:"green", 2:"blue"}
colors = [colormap[i] for i in train_labels]
my_scatter = figure(title="First Two Dimensions of Projected Data After Applying PCA", 
           x_axis_label="Dimension 1", y_axis_label="Dimension 2")
scatter_obj = my_scatter.scatter(x_pca[:, -1].numpy(), x_pca[:, -2].numpy(), color=colors, size=5)
legend = Legend(items=[
    LegendItem(label="Digit 0", renderers=[scatter_obj], index=0),
    LegendItem(label="Digit 1", renderers=[scatter_obj], index=1),
    LegendItem(label="Digit 2", renderers=[scatter_obj], index=2),
    ])
my_scatter.add_layout(legend)
my_scatter.legend.click_policy = "hide"
show(my_scatter)

准备线图数据

在我们继续展示如何可视化线图数据之前，让我们生成一些示例数据。下面是一个使用 Keras 库的简单分类器，我们训练它来学习手写数字分类。fit()方法返回的历史对象是一个包含训练阶段所有学习历史的字典。为了简化，我们将使用 10 个 epochs 来训练模型。

epochs = 10
y_train = utils.to_categorical(train_labels)
input_dim = img_length*img_width
# Create a Sequential model
model = Sequential()
# First layer for reshaping input images from 2D to 1D
model.add(Reshape((input_dim, ), input_shape=(img_length, img_width)))
# Dense layer of 8 neurons
model.add(Dense(8, activation='relu'))
# Output layer
model.add(Dense(total_classes, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, validation_split=0.33, epochs=epochs, batch_size=10, verbose=0)
print('Learning history: ', history.history)

上述代码将生成一个包含loss、accuracy、val_loss和val_accuracy键的字典，如下所示：

输出

Learning history:  {'loss': [0.5362154245376587, 0.08184114843606949, ...],
'accuracy': [0.9426144361495972, 0.9763565063476562, ...],
'val_loss': [0.09874073415994644, 0.07835448533296585, ...],
'val_accuracy': [0.9716889262199402, 0.9788480401039124, ...]}

matplotlib、Seaborn 和 Bokeh 中的线图

我们来看看各种选项，用于可视化训练分类器获得的学习历史。

在 matplotlib 中创建多线图就像下面这样简单。我们从历史记录中获取训练和验证准确性的值列表，默认情况下，matplotlib 会将其视为顺序数据（即 x 坐标是从 0 开始的整数）。

plt.plot(history.history['accuracy'], label="Training accuracy")
plt.plot(history.history['val_accuracy'], label="Validation accuracy")
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

使用 Matplotlib 的多线图

创建多线图的完整代码如下：

from tensorflow.keras.datasets import mnist
from tensorflow.keras import utils
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Reshape
import numpy as np
import matplotlib.pyplot as plt

# Load dataset
(x_train, train_labels), (_, _) = mnist.load_data()
# Choose only the digits 0, 1, 2
total_classes = 3
ind = np.where(train_labels < total_classes)
x_train, train_labels = x_train[ind], train_labels[ind]
# Verify the shape of training data
total_examples, img_length, img_width = x_train.shape
print('Training data has ', total_examples, 'images')
print('Each image is of size ', img_length, 'x', img_width)

# Prepare for classifier network
epochs = 10
y_train = utils.to_categorical(train_labels)
input_dim = img_length*img_width
# Create a Sequential model
model = Sequential()
# First layer for reshaping input images from 2D to 1D
model.add(Reshape((input_dim, ), input_shape=(img_length, img_width)))
# Dense layer of 8 neurons
model.add(Dense(8, activation='relu'))
# Output layer
model.add(Dense(total_classes, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, validation_split=0.33, epochs=epochs, batch_size=10, verbose=0)
print('Learning history: ', history.history)

# Plot accuracy in Matplotlib
plt.plot(history.history['accuracy'], label="Training accuracy")
plt.plot(history.history['val_accuracy'], label="Validation accuracy")
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

同样，我们也可以在 Seaborn 中做同样的事情。正如我们在散点图的例子中看到的，我们可以将数据作为值序列明确传递给 Seaborn，或者通过 pandas DataFrame 传递。让我们使用 pandas DataFrame 绘制训练损失和验证损失：

# Create pandas DataFrame
df_history = pd.DataFrame(history.history)
print(df_history)

# Plot using Seaborn
my_plot = sns.lineplot(data=df_history[["loss","val_loss"]])
my_plot.set_xlabel('Epochs')
my_plot.set_ylabel('Loss')
plt.legend(labels=["Training", "Validation"])
plt.title('Training and Validation Loss')
plt.show()

它将打印以下表格，即我们从历史记录中创建的 DataFrame：

输出

       loss  accuracy  val_loss  val_accuracy
0  0.536215  0.942614  0.098741      0.971689
1  0.081841  0.976357  0.078354      0.978848
2  0.064002  0.978841  0.080637      0.972991
3  0.055695  0.981726  0.064659      0.979987
4  0.054693  0.984371  0.070817      0.983729
5  0.053512  0.985173  0.069099      0.977709
6  0.053916  0.983089  0.068139      0.979662
7  0.048681  0.985093  0.064914      0.977709
8  0.052084  0.982929  0.080508      0.971363
9  0.040484  0.983890  0.111380      0.982590

它生成的图表如下：

使用 Seaborn 的多线图

默认情况下，Seaborn 会从 DataFrame 中理解列标签，并将其用作图例。在上面的例子中，我们为每个图提供了新的标签。此外，线图的 x 轴默认取自 DataFrame 的索引，在我们的例子中是从 0 到 9 的整数。

生成 Seaborn 图表的完整代码如下：

from tensorflow.keras.datasets import mnist
from tensorflow.keras import utils
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Reshape
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
(x_train, train_labels), (_, _) = mnist.load_data()
# Choose only the digits 0, 1, 2
total_classes = 3
ind = np.where(train_labels < total_classes)
x_train, train_labels = x_train[ind], train_labels[ind]
# Verify the shape of training data
total_examples, img_length, img_width = x_train.shape
print('Training data has ', total_examples, 'images')
print('Each image is of size ', img_length, 'x', img_width)

# Prepare for classifier network
epochs = 10
y_train = utils.to_categorical(train_labels)
input_dim = img_length*img_width
# Create a Sequential model
model = Sequential()
# First layer for reshaping input images from 2D to 1D
model.add(Reshape((input_dim, ), input_shape=(img_length, img_width)))
# Dense layer of 8 neurons
model.add(Dense(8, activation='relu'))
# Output layer
model.add(Dense(total_classes, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, validation_split=0.33, epochs=epochs, batch_size=10, verbose=0)

# Prepare pandas DataFrame
df_history = pd.DataFrame(history.history)
print(df_history)

# Plot loss in seaborn
my_plot = sns.lineplot(data=df_history[["loss","val_loss"]])
my_plot.set_xlabel('Epochs')
my_plot.set_ylabel('Loss')
plt.legend(labels=["Training", "Validation"])
plt.title('Training and Validation Loss')
plt.show()

正如你所预期的，如果我们想精确控制 x 和 y 坐标，我们还可以将参数 x 和 y 与 data 一起传递给 lineplot()，就像我们在上面的 Seaborn 散点图示例中一样。

Bokeh 也可以生成多线图，如下代码所示。正如我们在散点图例子中看到的，我们需要明确提供 x 和 y 坐标，并且一次绘制一条线。同样，show() 方法会打开一个新的浏览器窗口来显示图表，你可以与之互动。

p = figure(title="Training and validation accuracy",
           x_axis_label="Epochs", y_axis_label="Accuracy")
epochs_array = np.arange(epochs)
p.line(epochs_array, df_history['accuracy'], legend_label="Training",
       color="blue", line_width=2)
p.line(epochs_array, df_history['val_accuracy'], legend_label="Validation",
       color="green")
p.legend.click_policy = "hide"
p.legend.location = 'bottom_right'
show(p)

使用 Bokeh 的多线图。注意右侧工具栏上的用户交互选项。

制作 Bokeh 图表的完整代码如下：

from tensorflow.keras.datasets import mnist
from tensorflow.keras import utils
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Reshape
import numpy as np
import pandas as pd
from bokeh.plotting import figure, show

# Load dataset
(x_train, train_labels), (_, _) = mnist.load_data()
# Choose only the digits 0, 1, 2
total_classes = 3
ind = np.where(train_labels < total_classes)
x_train, train_labels = x_train[ind], train_labels[ind]
# Verify the shape of training data
total_examples, img_length, img_width = x_train.shape
print('Training data has ', total_examples, 'images')
print('Each image is of size ', img_length, 'x', img_width)

# Prepare for classifier network
epochs = 10
y_train = utils.to_categorical(train_labels)
input_dim = img_length*img_width
# Create a Sequential model
model = Sequential()
# First layer for reshaping input images from 2D to 1D
model.add(Reshape((input_dim, ), input_shape=(img_length, img_width)))
# Dense layer of 8 neurons
model.add(Dense(8, activation='relu'))
# Output layer
model.add(Dense(total_classes, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, validation_split=0.33, epochs=epochs, batch_size=10, verbose=0)

# Prepare pandas DataFrame
df_history = pd.DataFrame(history.history)
print(df_history)

# Plot accuracy in Bokeh
p = figure(title="Training and validation accuracy",
           x_axis_label="Epochs", y_axis_label="Accuracy")
epochs_array = np.arange(epochs)
p.line(epochs_array, df_history['accuracy'], legend_label="Training",
       color="blue", line_width=2)
p.line(epochs_array, df_history['val_accuracy'], legend_label="Validation",
       color="green")
p.legend.click_policy = "hide"
p.legend.location = 'bottom_right'
show(p)

进一步阅读

如果你希望更深入地了解这个主题，本节提供更多资源。

书籍

Think Python：如何像计算机科学家一样思考，Allen B. Downey 著
Python 3 编程：Python 语言完全介绍，Mark Summerfield 著
Python 编程：计算机科学导论，John Zelle 著
Python 数据分析，第二版，Wes McKinney 著

文章

API 参考

摘要

在这个教程中，你将发现 Python 中数据可视化的各种选项。

具体来说，你学会了：

如何在不同行和列创建子图
如何使用 matplotlib 渲染图像
如何使用 matplotlib 生成 2D 和 3D 散点图
如何使用 Seaborn 和 Bokeh 创建 2D 图
如何使用 matplotlib、Seaborn 和 Bokeh 创建多行图

对于本帖讨论的数据可视化选项，您是否有任何问题？请在下面的评论中提问，我将尽力回答。

Machine-Learning-Mastery-Python-教程-二-