NumPy-秘籍中文第二版-三-NumPy 秘籍中文第二版（三）九、使用 Cython 加速代码在本章中，我们将介绍

NumPy 秘籍中文第二版（三）

原文：NumPy Cookbook - Second Edition

协议：CC BY-NC-SA 4.0

九、使用 Cython 加速代码

在本章中，我们将介绍以下秘籍：

安装 Cython
构建 HelloWorld 程序
将 Cython 与 NumPy 结合使用
调用 C 函数
分析 Cython 代码
用 Cython 近似阶乘

简介

Cython 是基于 Python 的相对年轻的编程语言。它允许编码人员将 C 的速度与 Python 的功能混合在一起。与 Python 的区别在于我们可以选择声明静态类型。许多编程语言（例如 C）具有静态类型，这意味着我们必须告诉 C 变量的类型，函数参数和返回值类型。另一个区别是 C 是一种编译语言，而 Python 是一种解释语言。根据经验，可以说 C 比 Python 更快，但灵活性更低。通过 Cython 代码，我们可以生成 C 或 C++ 代码。之后，我们可以将生成的代码编译为 Python 扩展模块。

在本章中，您将学习 Cython。我们将获得一些与 NumPy 一起运行的简单 Cython 程序。另外，我们将介绍 Cython 代码。

安装 Cython

为了使用 Cython，我们需要安装它。 Enthought Canopy，Anaconda 和 Sage 发行版包括 Cython。有关更多信息，请参见这里，这里和这里。我们将不在这里讨论如何安装这些发行版。显然，我们需要一个 C 编译器来编译生成的 C 代码。在某些操作系统（例如 Linux）上，编译器将已经存在。在本秘籍中，我们将假定您已经安装了编译器。

操作步骤

我们可以使用以下任何一种方法来安装 Cython：

通过执行以下步骤从源存档中安装 Cython ：
- 下载源归档文件。
- 打开包装。
- 使用cd命令浏览到目录。
- 运行以下命令：
```
$ python setup.py install
```
使用以下任一命令从 PyPI 存储库安装 Cython：
```
$ easy_install cython
$ sudo pip install cython
```
使用非官方 Windows 安装程序，在 Windows 上安装 Cython。

另见

相关的 Cython 在线文档

构建 HelloWorld 程序

与编程语言的传统一样，我们将从 HelloWorld 示例开始。与 Python 不同，我们需要编译 Cython 代码。我们从.pyx文件开始，然后从中生成 C 代码。可以编译此.c文件，然后将其导入 Python 程序中。

操作步骤

本节介绍如何构建 Cython HelloWorld 程序：

首先，编写一些非常简单的代码以显示Hello World。这只是普通的 Python 代码，但文件具有pyx扩展名：
```
def say_hello():
  print "Hello World!"
```

创建一个名为setup.py的文件来帮助构建 Cython 代码：

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("hello", ["hello.pyx"])]

setup(
        name = 'Hello world app',
        cmdclass = {'build_ext': build_ext},
        ext_modules = ext_modules
     )

如您所见，我们在上一步中指定了文件，并为应用命名。

使用以下命令进行构建：

$ python setup.py build_ext --inplace

这将生成 C 代码，将其编译为您的平台，并产生以下输出：

running build_ext
cythoning hello.pyx to hello.c
building 'hello' extension
creating build

现在，我们可以使用以下语句导入模块：

from hello import say_hello

工作原理

在此秘籍中，我们创建了一个传统的 HelloWorld 示例。 Cython 是一种编译语言，因此我们需要编译代码。我们编写了一个包含Hello World代码的.pyx文件和一个用于生成和构建 C 代码的setup.py文件。

另见

相关的 Cython 的在线文档

将 Cython 与 NumPy 结合使用

我们可以集成 Cython 和 NumPy 代码，就像可以集成 Cython 和 Python 代码一样。让我们来看一个示例，该示例分析股票的上涨天数（股票收盘价高于前一日的天数）的比率。我们将应用二项式比值置信度的公式。您可以参考这里了解更多信息。以下公式说明该比率的重要性：

Using Cython with NumPy

式中，p是概率，n是观察数。

操作步骤

本节通过以下步骤介绍如何将 Cython 与 NumPy 结合使用：

编写一个.pyx文件，其中包含一个函数，该函数可计算上升天数的比率和相关的置信度。首先，此函数计算价格之间的差异。然后，它计算出正差的数量，从而得出上升天数的比率。最后，在引言中的维基百科页面上应用置信度公式：
```
import numpy as np

def pos_confidence(numbers):
   diffs = np.diff(numbers)
   n = float(len(diffs))
   p = len(diffs[diffs > 0])/n
   confidence = np.sqrt(p * (1 - p)/ n)

   return (p, confidence)
```

使用上一个示例中的setup.py文件作为模板。更改明显的内容，例如.pyx文件的名称：

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("binomial_proportion", ["binomial_proportion.pyx"])]

setup(
        name = 'Binomial proportion app',
        cmdclass = {'build_ext': build_ext},
        ext_modules = ext_modules
     )

我们现在可以建立；有关更多详细信息，请参见前面的秘籍。

构建后，通过导入使用上一步中的 Cython 模块。我们将编写一个 Python 程序，使用matplotlib下载股价数据。然后我们将confidence()函数应用于收盘价：

from matplotlib.finance import quotes_historical_yahoo
from datetime import date
import numpy
import sys
from binomial_proportion import pos_confidence

#1\. Get close prices.
today = date.today()
start = (today.year - 1, today.month, today.day)

quotes = quotes_historical_yahoo(sys.argv[1], start, today)
close =  numpy.array([q[4] for q in quotes])
print pos_confidence(close)

AAPL程序的输出如下：

(0.56746031746031744, 0.031209043355655924)

工作原理

我们计算了APL股价上涨的可能性和相应的置信度。我们通过创建 Cython 模块，将 NumPy 代码放入.pyx文件中，并按照上一教程中的步骤进行构建。最后，我们导入并使用了 Cython 模块。

另见

有关 Cython 的在线文档

调用 C 函数

我们可以从 Cython 调用 C 函数。在此示例中，我们调用 C log()函数。此函数仅适用于单个数字。请记住，NumPy log()函数也可以与数组一起使用。我们将计算股票价格的所谓对数回报。

操作步骤

我们首先编写一些 Cython 代码：

首先，从libc命名空间导入 C 的log()函数。然后，将此函数应用于for循环中的数字。最后，使用 NumPy diff()函数在第二步中获取对数值之间的一阶差：
```
from libc.math cimport log
import numpy as np

def logrets(numbers):
   logs = [log(x) for x in numbers] 
   return np.diff(logs)
```
先前的秘籍中已经介绍了架构。我们只需要更改setup.py文件中的某些值。

再次使用 matplotlib 下载股价数据。应用您刚刚在价格上创建的 Cython logrets()函数并绘制结果：

from matplotlib.finance import quotes_historical_yahoo
from datetime import date
import numpy as np
from log_returns import logrets
import matplotlib.pyplot as plt

today = date.today()
start = (today.year - 1, today.month, today.day)

quotes = quotes_historical_yahoo('AAPL', start, today)
close =  np.array([q[4] for q in quotes])
plt.plot(logrets(close))
plt.title('Logreturns of AAPL for the previous year')
plt.xlabel('Days')
plt.ylabel('Log returns')
plt.grid()
plt.show()

的AAPL对数回报结果图类似于以下屏幕截图所示：

How to do it...

工作原理

我们从 Cython 代码中调用了 C log()函数。该函数与 NumPy 函数一起用于计算股票的对数收益。这样，我们可以创建自己的包含便利函数的专用 API。令人高兴的是，我们的代码应该或多或少地像 Python 代码一样，以与 C 代码差不多的速度执行。

另见

相关的 Cython 在线文档位于

分析 Cython 代码

我们将使用以下公式对 Cython 和 NumPy 代码进行剖析，这些代码试图近似于欧拉常数：

Profiling the Cython code

有关更多背景信息，请参见。

操作步骤

本节演示如何通过以下步骤来分析 Cython 代码：

对于e的 NumPy 近似值，请按照下列步骤操作：

首先，我们将创建一个1到n的数组（在我们的示例中n是40）。

然后，我们将计算该数组的累积乘积，该乘积恰好是阶乘。在那之后，我们采取阶乘的倒数。最后，我们从维基百科页面应用公式。最后，我们放入标准配置代码，为我们提供以下程序：

from __future__ import print_function
import numpy as np
import cProfile
import pstats

def approx_e(n=40, display=False):
   # array of [1, 2, ... n-1]
   arr = np.arange(1, n)

   # calculate the factorials and convert to floats
   arr = arr.cumprod().astype(float)

   # reciprocal 1/n
   arr = np.reciprocal(arr)

   if display:
    print(1 + arr.sum())

# Repeat multiple times because NumPy is so fast
def run(repeat=2000):
    for i in range(repeat):
        approx_e()

cProfile.runctx("run()", globals(), locals(), "Profile.prof")

s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()

approx_e(display=True)

以下代码段显示了e的近似值的分析输出和结果。有关概要分析输出的更多信息，请参见第 7 章，“分析和调试”。

 8004 function calls in 0.016 seconds

 Ordered by: internal time

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 2000    0.007    0.000    0.015    0.000 numpy_approxe.py:6(approx_e)
 2000    0.004    0.000    0.004    0.000 {method 'cumprod' of 'numpy.ndarray' objects}
 2000    0.002    0.000    0.002    0.000 {numpy.core.multiarray.arange}
 2000    0.002    0.000    0.002    0.000 {method 'astype' of 'numpy.ndarray' objects}
 1    0.001    0.001    0.016    0.016 numpy_approxe.py:20(run)
 1    0.000    0.000    0.000    0.000 {range}
 1    0.000    0.000    0.016    0.016 <string>:1(<module>)
 1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

2.71828182846

Cython 代码使用与上一步所示相同的算法，但是实现方式不同。便利函数较少，实际上我们现在需要一个for循环。另外，我们需要为某些变量指定类型。 .pyx文件的代码如下所示：

def approx_e(int n=40, display=False):
    cdef double sum = 0.
    cdef double factorial = 1.
    cdef int k

    for k in xrange(1,n+1):
        factorial *= k
        sum += 1/factorial

    if display:
        print(1 + sum)

以下 Python 程序导入 Cython 模块并进行一些分析：

import pstats
import cProfile
import pyximport
pyximport.install()

import approxe

# Repeat multiple times because Cython is so fast
def run(repeat=2000):
    for i in range(repeat):
        approxe.approx_e()

cProfile.runctx("run()", globals(), locals(), "Profile.prof")

s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()

approxe.approx_e(display=True)

这是 Cython 代码的分析输出：

 2004 function calls in 0.001 seconds

 Ordered by: internal time

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 2000    0.001    0.000    0.001    0.000 {approxe.approx_e}
 1    0.000    0.000    0.001    0.001 cython_profile.py:9(run)
 1    0.000    0.000    0.000    0.000 {range}
 1    0.000    0.000    0.001    0.001 <string>:1(<module>)
 1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

2.71828182846

工作原理

我们分析了 NumPy 和 Cython 代码。 NumPy 已针对速度进行了优化，因此 NumPy 和 Cython 程序都是高性能程序，我们对此不会感到惊讶。但是，当比较 2,000 次近似代码的总时间时，我们意识到 NumPy 需要 0.016 秒，而 Cython 仅需要 0.001 秒。显然，实际时间取决于硬件，操作系统和其他因素，例如计算机上运行的其他进程。同样，提速取决于代码类型，但我希望您同意，根据经验，Cython 代码会更快。

另见

相关的 Cython 在线文档

Cython 的近似阶乘

最后一个示例是 Cython 的近似阶乘。我们将使用两种近似方法。首先，我们将应用斯特林近似方法。斯特林近似的公式如下：

Approximating factorials with Cython

其次，我们将使用 Ramanujan 的近似值，并使用以下公式：

Approximating factorials with Cython

操作步骤

本节介绍如何使用 Cython 近似阶乘。您可能还记得，在本秘籍中，我们使用在 Cython 中可选的类型。从理论上讲，声明静态类型应加快速度。静态类型化提供了一些有趣的挑战，这些挑战在编写 Python 代码时可能不会遇到，但请不要担心。我们将尝试使其简单：

除了将函数参数和一个局部变量声明为ndarray数组外，我们将编写的 Cython 代码类似于常规的 Python 代码。为了使静态类型起作用，我们需要cimport NumPy。另外，我们必须使用cdef关键字声明局部变量的类型：

import numpy
cimport numpy

def ramanujan_factorial(numpy.ndarray n):
   sqrt_pi = numpy.sqrt(numpy.pi, dtype=numpy.float64)
   cdef numpy.ndarray root = (8 * n + 4) * n + 1 
   root = root * n + 1/30.
   root = root ** (1/6.)
   return sqrt_pi * calc_eton(n) * root

def stirling_factorial(numpy.ndarray n):
    return numpy.sqrt(2 * numpy.pi * n) * calc_eton(n)

def calc_eton(numpy.ndarray n):
    return (n/numpy.e) ** n

如先前的教程所示，构建需要我们创建一个setup.py文件，但是现在我们通过调用get_include()函数来包含与 NumPy 相关的目录。通过此修订，setup.py文件具有以下内容：

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy

ext_modules = [Extension("factorial", ["factorial.pyx"], include_dirs = [numpy.get_include()])] 

setup(
        name = 'Factorial app',
        cmdclass = {'build_ext': build_ext},
        ext_modules = ext_modules
     )

绘制两种近似方法的相对误差。像我们在整本书中所做的那样，将使用 NumPy cumprod()函数计算相对于阶乘值的误差：

from factorial import ramanujan_factorial
from factorial import stirling_factorial
import numpy as np
import matplotlib.pyplot as plt

N = 50
numbers = np.arange(1, N)
factorials = np.cumprod(numbers, dtype=float)

def error(approximations):
   return (factorials - approximations)/factorials

plt.plot(error(ramanujan_factorial(numbers)), 'b-', label='Ramanujan')
plt.plot(error(stirling_factorial(numbers)), 'ro', label='Stirling')
plt.title('Factorial approximation relative errors')
plt.xlabel('n')
plt.ylabel('Relative error')
plt.grid()
plt.legend(loc='best')
plt.show()

下图显示了 Ramanujan 近似值（点）和 Stirling 近似值（线）的相对误差：

How to do it...

工作原理

在此示例中，我们看到了 Cython 静态类型的演示。该秘籍的主要成分如下：

cimport，它导入 C 声明
包含具有get_include()函数的目录
cdef关键字，用于定义局部变量的类型

另见

相关的 Cython 在线文档

十、Scikits 的乐趣

在本章中，我们将介绍以下秘籍：

安装 scikit-learn
加载示例数据集
用 scikit-learn 对道琼斯股票进行聚类
安装 Statsmodels
使用 Statsmodels 执行正态性检验
安装 scikit-image
检测角点
检测边界
安装 Pandas
使用 Pandas 估计股票收益的相关性
从 Statsmodels 中将数据作为 pandas 对象加载
重采样时间序列数据

简介

Scikits 是小型的独立项目，以某种方式与 SciPy 相关，但不属于 SciPy。这些项目不是完全独立的，而是作为一个联合体在伞下运行的。在本章中，我们将讨论几个 Scikits 项目，例如：

scikit-learn，机器学习包
Statsmodels，统计数据包
scikit-image，图像处理包
Pandas，数据分析包

安装 scikit-learn

scikit-learn 项目旨在提供一种用于机器学习的 API。我最喜欢的是令人惊叹的文档。我们可以使用操作系统的包管理器安装 scikit-learn。根据操作系统的不同，此选项可能可用也可能不可用，但它应该是最方便的方法。

Windows 用户只需从项目网站下载安装程序即可。在 Debian 和 Ubuntu 上，该项目称为python-sklearn。在 MacPorts 上，这些端口称为py26-scikits-learn和py27-scikits-learn。我们也可以从源代码或使用easy_install安装。 PythonXY，Enthought 和 NetBSD 提供了第三方发行版。

准备

您需要安装 SciPy 和 NumPy。返回第 1 章，“使用 IPython”，以获取必要的说明。

操作步骤

现在让我们看一下如何安装 scikit-learn 项目：

使用easy_install进行安装：在命令行中键入以下命令之一：
```
$ pip install -U scikit-learn
$ easy_install -U scikit-learn
```
由于权限问题，这个可能无法工作，因此您可能需要在命令前面编写sudo，或以管理员身份登录。
从源代码安装：下载源代码，解压缩并使用cd进入下载的文件夹。然后发出以下命令：
```
$ python setup.py install
```

加载示例数据集

scikit-learn 项目附带了许多我们可以尝试的数据集和样例图像。在本秘籍中，我们将加载 scikit-learn 分发中包含的示例数据集。数据集将数据保存为 NumPy 二维数组，并将元数据链接到该数据。

操作步骤

我们将加载波士顿房价样本数据集。这是一个很小的数据集，因此，如果您要在波士顿寻找房子，请不要太兴奋！其他数据集在这个页面中进行了描述。

我们将查看原始数据的形状及其最大值和最小值。形状是一个元组，表示 NumPy 数组的大小。我们将对目标数组执行相同的操作，其中包含作为学习目标（确定房价）的值。以下sample_data.py 中的代码实现了我们的目标：

from __future__ import print_function
from sklearn import datasets

boston_prices = datasets.load_boston()
print("Data shape", boston_prices.data.shape)
print("Data max=%s min=%s" % (boston_prices.data.max(), boston_prices.data.min()))
print("Target shape", boston_prices.target.shape)
print("Target max=%s min=%s" % (boston_prices.target.max(), boston_prices.target.min()))

我们计划的结果如下：

Data shape (506, 13)
Data max=711.0 min=0.0
Target shape (506,)
Target max=50.0 min=5.0

利用 scikits-learn 对道琼斯股票进行聚类

聚类是一种机器学习算法，旨在基于相似度对项目进行分组。在此示例中，我们将使用道琼斯工业平均指数（DJI 或 DJIA）进行聚类。本秘籍的大多数步骤已通过前面各章的审查。

操作步骤

首先，我们将从 Yahoo 金融下载这些股票的 EOD 价格数据。然后，我们将计算平方亲和矩阵。最后，我们将股票与AffinityPropagation类聚类：

使用 DJI 指数的股票代码下载 2011 年价格数据。在此示例中，我们仅对收盘价感兴趣：

# 2011 to 2012
start = datetime.datetime(2011, 01, 01)
end = datetime.datetime(2012, 01, 01)

#Dow Jones symbols
symbols = ["AA", "AXP", "BA", "BAC", "CAT",
   "CSCO", "CVX", "DD", "DIS", "GE", "HD",
   "HPQ", "IBM", "INTC", "JNJ", "JPM", 
   "KO", "MCD", "MMM", "MRK", "MSFT", "PFE",
   "PG", "T", "TRV", "UTX", "VZ", "WMT", "XOM"]

quotes = []

for symbol in symbols:
    try:
        quotes.append(finance.quotes_historical_yahoo(symbol, start, end, asobject=True))
    except urllib2.HTTPError as e:
        print(symbol, e)

close = np.array([q.close for q in quotes]).astype(np.float)
print(close.shape)

使用对数收益率作为指标来计算不同股票之间的相似度。我们正在要做的是计算数据点的欧几里德距离：

logreturns = np.diff(np.log(close))
print(logreturns.shape)

logreturns_norms = np.sum(logreturns ** 2, axis=1)
S = - logreturns_norms[:, np.newaxis] - logreturns_norms[np.newaxis, :] + 2 * np.dot(logreturns, logreturns.T)

给AffinityPropagation类上一步的结果。此类为数据点或在我们的情况下的股票标记了适当的群集编号：

aff_pro = sklearn.cluster.AffinityPropagation().fit(S)
labels = aff_pro.labels_

for symbol, label in zip(symbols, labels):
    print('%s in Cluster %d' % (symbol, label))

完整的群集程序如下：

from __future__ import print_function
import datetime
import numpy as np
import sklearn.cluster
from matplotlib import finance
import urllib2

#1\. Download price data

# 2011 to 2012
start = datetime.datetime(2011, 01, 01)
end = datetime.datetime(2012, 01, 01)

#Dow Jones symbols
symbols = ["AA", "AXP", "BA", "BAC", "CAT",
   "CSCO", "CVX", "DD", "DIS", "GE", "HD",
   "HPQ", "IBM", "INTC", "JNJ", "JPM", 
   "KO", "MCD", "MMM", "MRK", "MSFT", "PFE",
   "PG", "T", "TRV", "UTX", "VZ", "WMT", "XOM"]

quotes = []

for symbol in symbols:
    try:
        quotes.append(finance.quotes_historical_yahoo(symbol, start, end, asobject=True))
    except urllib2.HTTPError as e:
        print(symbol, e)

close = np.array([q.close for q in quotes]).astype(np.float)
print(close.shape)

#2\. Calculate affinity matrix
logreturns = np.diff(np.log(close))
print(logreturns.shape)

logreturns_norms = np.sum(logreturns ** 2, axis=1)
S = - logreturns_norms[:, np.newaxis] - logreturns_norms[np.newaxis, :] + 2 * np.dot(logreturns, logreturns.T)

#3\. Cluster using affinity propagation
aff_pro = sklearn.cluster.AffinityPropagation().fit(S)
labels = aff_pro.labels_

for symbol, label in zip(symbols, labels):
    print('%s in Cluster %d' % (symbol, label))

具有每个股票的簇号的输出如下：

(29, 252)
(29, 251)
AA in Cluster 0
AXP in Cluster 6
BA in Cluster 6
BAC in Cluster 1
CAT in Cluster 6
CSCO in Cluster 2
CVX in Cluster 7
DD in Cluster 6
DIS in Cluster 6
GE in Cluster 6
HD in Cluster 5
HPQ in Cluster 3
IBM in Cluster 5
INTC in Cluster 6
JNJ in Cluster 5
JPM in Cluster 4
KO in Cluster 5
MCD in Cluster 5
MMM in Cluster 6
MRK in Cluster 5
MSFT in Cluster 5
PFE in Cluster 7
PG in Cluster 5
T in Cluster 5
TRV in Cluster 5
UTX in Cluster 6
VZ in Cluster 5
WMT in Cluster 5
XOM in Cluster 7

工作原理

下表是在秘籍中使用的函数的概述：

函数	描述
`sklearn.cluster.AffinityPropagation()`	创建一个`AffinityPropagation`对象。
`sklearn.cluster.AffinityPropagation.fit()`	从欧几里得距离计算亲和度矩阵，并应用亲和度传播聚类。
`diff()`	计算 NumPy 数组中数字的差。如果未指定，则计算一阶差。
`log()`	计算 NumPy 数组中元素的自然对数。
`sum()`	对 NumPy 数组的元素求和。
`dot()`	这对二维数组执行矩阵乘法。它还计算一维数组的内积。

另见

相关文档

安装 Statsmodels

statsmodels包专注于统计建模。我们可以将其与 NumPy 和 pandas 集成（在本章稍后的内容中将有更多关于 pandas 的信息）。

操作步骤

可以从这里下载源码和二进制文件。如果要从源代码安装，请运行以下命令：

$ python setup.py install

如果使用setuptools，则命令如下：

$ easy_install statsmodels

使用 Statsmodels 执行正态性检验

statsmodels包具有许多统计检验。我们将看到这样一个检验的示例 -- Anderson-Darling 检验用于正态性。

操作步骤

我们将像以前的秘籍一样下载价格数据，但这一次是单只股票。再次，我们将计算该股票收盘价的对数收益，并将其用作正态性检验函数的输入。

此函数返回一个包含第二个元素的元组，即 p 值，介于 0 和 1 之间。本教程的完整代码如下：

from __future__ import print_function
import datetime
import numpy as np
from matplotlib import finance
from statsmodels.stats.adnorm import normal_ad

#1\. Download price data

# 2011 to 2012
start = datetime.datetime(2011, 01, 01)
end = datetime.datetime(2012, 01, 01)

quotes = finance.quotes_historical_yahoo('AAPL', start, end, asobject=True)

close = np.array(quotes.close).astype(np.float)
print(close.shape)

print(normal_ad(np.diff(np.log(close))))

#Retrieving data for AAPL
#(252,)
#(0.57103805516803163, 0.13725944999430437)

下面显示了0.13为p-value的脚本的输出：

Retrieving data for AAPL
(252,)
(0.57103805516803163, 0.13725944999430437)

工作原理

如statsmodels中所示，此秘籍证明了 Anderson-Darling 统计检验的正态性。我们使用没有正态分布的股票价格数据作为输入。对于数据，我们获得了0.13的 p 值。由于概率在 0 到 1 之间，这证实了我们的假设。

安装 scikit-image

scikit-image 是用于图像处理的工具包，它依赖 PIL，SciPy，Cython 和 NumPy。 Windows 安装程序也可用。该工具包是 Enthought Python 发行版以及 PythonXY 发行版的一部分。

操作步骤

与往常一样，使用以下两个命令之一安装 scikit-image：

$ pip install -U scikit-image
$ easy_install -U scikit-image

同样，您可能需要以超级用户身份运行这些命令。

另一种选择是通过克隆 Git 存储库或从 Github 下载该存储库作为源归档来获取最新的开发版本。然后运行以下命令：

$ python setup.py install

检测角点

角点检测是计算机视觉中的标准技术。 scikit-image 提供了 Harris 角点检测器，它很棒，因为角检测非常复杂。显然，我们可以从头开始做，但是这违反了不重新发明轮子的基本原则。

准备

您可能需要在系统上安装jpeglib，才能加载 scikit-learn 图像（是 JPEG 文件）。如果您使用的是 Windows，请使用安装程序。否则，下载发行版，解压缩它，并使用以下命令从顶部文件夹中进行构建：

$ ./configure
$  make
$  sudo make install

操作步骤

我们将从 scikit-learn 加载示例图像。对于此示例，这不是绝对必要的；您可以改用其他任何图像：

scikit-learn 当前在数据集结构中有两个样例 JPEG 图像。只看第一张图片：
```
dataset = load_sample_images()
img = dataset.images[0]
```
自本书第一版以来，API 发生了变化。例如，对于 scikit-image 0.11.2，我们需要首先将彩色图像的值转换为灰度值。图像灰度如下：
```
gray_img = rgb2gray(img)
```

调用corner_harris()函数获取角点的坐标：

harris_coords = corner_peaks(corner_harris(gray_img))
y, x = np.transpose(harris_coords)

拐角检测的代码如下：

from sklearn.datasets import load_sample_images
import matplotlib.pyplot as plt
import numpy as np
from skimage.feature import corner_harris
from skimage.feature import corner_peaks
from skimage.color import rgb2gray

dataset = load_sample_images()
img = dataset.images[0]
gray_img = rgb2gray(img)
harris_coords = corner_peaks(corner_harris(gray_img))
y, x = np.transpose(harris_coords)
plt.axis('off')
plt.imshow(img)
plt.plot(x, y, 'ro')
plt.show()

我们得到一个带有点的图像，脚本在其中检测角点，如以下屏幕截图所示：

How to do it...

工作原理

我们对 scikit-image 的样例图像应用了 Harris 角点检测。如您所见，结果非常好。我们只能使用 NumPy 做到这一点，因为它只是一个简单的线性代数类型的计算。仍然，可能会变得凌乱。 scikit-image 工具包具有更多类似的功能，因此，如果需要图像处理例程，请查看 scikit-image 文档。另外请记住，API 可能会发生快速变化。

另见

相关的 scikit-image 文档

检测边界

边界检测是另一种流行的图像处理技术。 scikit-image 具有基于高斯分布的标准差的 Canny 过滤器实现，可以开箱即用地执行边界检测。除了将图像数据作为 2D 数组外，此过滤器还接受以下参数：

高斯分布的标准差
下限阈值
上限阈值

操作步骤

我们将使用与先前秘籍相同的图像。代码几乎相同（请参见edge_detection.py）。请特别注意我们称为 Canny 筛选器函数的行：

from sklearn.datasets import load_sample_images
import matplotlib.pyplot as plt
import skimage.feature

dataset = load_sample_images()
img = dataset.images[0] 
edges = skimage.feature.canny(img[..., 0])
plt.axis('off')
plt.imshow(edges)
plt.show()

该代码生成原始图像中边界的图像，如以下屏幕截图所示：

How to do it...

另见

相关文档

安装 Pandas

Pandas 是用于数据分析的 Python 库。它与 R 编程语言有一些相似之处，并非偶然。 R 是一种受数据科学家欢迎的专业编程语言。例如，R 启发了 Pandas 的核心DataFrame对象。

操作步骤

在 PyPi 上，该项目称为pandas。因此，您可以运行以下命令之一：

$ sudo easy_install -U pandas
$ pip install pandas

如果使用 Linux 包管理器，则需要安装python-pandas项目。在 Ubuntu 上，执行以下操作：

$ sudo apt-get install python-pandas

您也可以从源代码安装（除非下载源代码存档，否则需要 Git）：

$ git clone git://github.com/pydata/pandas.git
$ cd pandas
$ python setup.py install

另见

相关文档

使用 Pandas 估计股票收益的相关性

Pandas DataFrame是类似矩阵和字典的数据结构，类似于 R 中提供的功能。实际上，它是 Pandas 的中心数据结构，您可以应用各种操作。例如，查看投资组合的相关矩阵是很常见的，所以让我们开始吧。

操作步骤

首先，我们将为每个符号的每日对数回报创建带有 Pandas 的DataFrame。然后，我们将在约会中加入这些。最后，将打印相关性，并显示一个图：

要创建数据框，请创建一个包含股票代码作为键的字典，并将相应的日志作为值返回。数据框本身以日期作为索引，将股票代码作为列标签：

data = {}

for i, symbol in enumerate(symbols):
   data[symbol] = np.diff(np.log(close[i]))

# Convention: import pandas as pd
df = pd.DataFrame(data, index=dates[0][:-1], columns=symbols)

现在，我们可以执行诸如计算相关矩阵或在数据帧上绘制等操作：

print(df.corr())
df.plot()

完整的源代码（也可以下载价格数据）如下：

from __future__ import print_function
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from matplotlib import finance
import numpy as np

# 2011 to 2012
start = datetime(2011, 01, 01)
end = datetime(2012, 01, 01)

symbols = ["AA", "AXP", "BA", "BAC", "CAT"]

quotes = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True)
          for symbol in symbols]

close = np.array([q.close for q in quotes]).astype(np.float)
dates = np.array([q.date for q in quotes])

data = {}

for i, symbol in enumerate(symbols):
   data[symbol] = np.diff(np.log(close[i]))

df = pd.DataFrame(data, index=dates[0][:-1], columns=symbols)

print(df.corr())
df.plot()
plt.legend(symbols)
plt.show()

#           AA       AXP        BA       BAC       CAT
#AA   1.000000  0.768484  0.758264  0.737625  0.837643
#AXP  0.768484  1.000000  0.746898  0.760043  0.736337
#BA   0.758264  0.746898  1.000000  0.657075  0.770696
#BAC  0.737625  0.760043  0.657075  1.000000  0.657113
#CAT  0.837643  0.736337  0.770696  0.657113  1.000000

这是相关矩阵的输出：

 AA       AXP        BA       BAC       CAT
AA   1.000000  0.768484  0.758264  0.737625  0.837643
AXP  0.768484  1.000000  0.746898  0.760043  0.736337
BA   0.758264  0.746898  1.000000  0.657075  0.770696
BAC  0.737625  0.760043  0.657075  1.000000  0.657113
CAT  0.837643  0.736337  0.770696  0.657113  1.000000

显示了五只股票的对数收益图：

How to do it...

工作原理

我们使用了以下DataFrame方法：

函数	描述
`pandas.DataFrame()`	此函数使用指定的数据，索引（行）和列标签构造`DataFrame`。
`pandas.DataFrame.corr()`	该函数计算列的成对相关，而忽略缺失值。默认情况下，使用 Pearson 相关。
`pandas.DataFrame.plot()`	此函数使用`matplotlib`绘制数据帧。

另见

相关文档
第 4 章，“Pandas 入门书”，摘自 Ivan Idris 的书“Python 数据分析”， Packt Publishing

从 Statsmodels 中将数据作为 pandas 对象加载

statsmodels 在其发行版中有很多样本数据集。完整列表可以在这个页面中找到。

在本教程中，我们将专注于铜数据集，其中包含有关铜价，世界消费量和其他参数的信息。

准备

在开始之前，我们可能需要安装 patsy。 patsy 是描述统计模型的库。很容易看出这个库是否是必需的。只需运行代码。如果您收到与 patsy 相关的错误，请执行以下任一命令：

$ sudo easy_install patsy
$ pip install --upgrade patsy

操作步骤

在本节中，我们将从 statsmodels 中加载数据集作为 Pandas DataFrame或Series对象。

我们需要调用的函数是load_pandas()。如下加载数据：
```
data = statsmodels.api.datasets.copper.load_pandas()
```
这会将数据加载到包含 Pandas 对象的DataSet对象中。
DataSet对象具有名为exog的属性，当作为 Pandas 对象加载时，该属性将成为具有多个列的DataFrame对象。在我们的案例中，它还有一个endog属性，其中包含世界铜消费量的值。

通过创建OLS对象并调用其fit()方法来执行普通的最小二乘计算，如下所示：
```
x, y = data.exog, data.endog

fit = statsmodels.api.OLS(y, x).fit()
print("Fit params", fit.params)
```
这应该打印拟合过程的结果：
```
Fit params COPPERPRICE         14.222028
INCOMEINDEX       1693.166242
ALUMPRICE          -60.638117
INVENTORYINDEX    2515.374903
TIME               183.193035
```

OLS 拟合的结果可以通过summary()方法总结如下：

print(fit.summary())

这将为我们提供以下回归结果输出：

How to do it...

加载铜数据集所需的代码如下：

from __future__ import print_function
import statsmodels.api

# See https://github.com/statsmodels/statsmodels/tree/master/statsmodels/datasets
data = statsmodels.api.datasets.copper.load_pandas()

x, y = data.exog, data.endog

fit = statsmodels.api.OLS(y, x).fit()
print("Fit params", fit.params)
print()
print("Summary")
print()
print(fit.summary())

工作原理

Statsmodels 的Dataset类中的数据遵循特殊格式。其中，此类具有endog和exog属性。 Statsmodels 具有load()函数，该函数将数据作为 NumPy 数组加载。相反，我们使用了load_pandas()方法，该方法将数据加载为pandas对象。我们进行了 OLS 拟合，基本上为我们提供了铜价和消费量的统计模型。

另见

相关文档

重采样时间序列数据

在此教程中，您将学习如何使用 Pandas 对时间序列进行重新采样。

操作步骤

我们将下载AAPL的每日价格时间序列数据，然后通过计算平均值将其重新采样为每月数据。我们将通过创建 Pandas DataFrame并调用其resample() 方法来做到这一点：

在创建 Pandas DataFrame之前，我们需要创建一个DatetimeIndex对象传递给DataFrame构造器。根据下载的报价数据创建索引，如下所示：
```
dt_idx = pandas.DatetimeIndex(quotes.date)
```
获得日期时间索引后，我们将其与收盘价一起使用以创建数据框：
```
df = pandas.DataFrame (quotes.close, index=dt_idx, columns=[symbol])
```

通过计算平均值，将时间序列重新采样为每月频率：

resampled = df.resample('M', how=numpy.mean)
print(resampled)

如下行所示，重新采样的时间序列每个月都有一个值：

 AAPL
2011-01-31  336.932500
2011-02-28  349.680526
2011-03-31  346.005652
2011-04-30  338.960000
2011-05-31  340.324286
2011-06-30  329.664545
2011-07-31  370.647000
2011-08-31  375.151304
2011-09-30  390.816190
2011-10-31  395.532381
2011-11-30  383.170476
2011-12-31  391.251429

使用DataFrame plot()方法绘制数据：

df.plot()
resampled.plot()
plt.show()

原始时间序列的图如下：

How to do it...

重采样的数据具有较少的数据点，因此，生成的图更加混乱，如以下屏幕截图所示：

How to do it...

完整的重采样代码如下：

from __future__ import print_function
import pandas
import matplotlib.pyplot as plt
from datetime import datetime
from matplotlib import finance
import numpy as np

# Download AAPL data for 2011 to 2012
start = datetime(2011, 01, 01)
end = datetime(2012, 01, 01)

symbol = "AAPL"
quotes = finance.quotes_historical_yahoo(symbol, start, end, asobject=True)

# Create date time index
dt_idx = pandas.DatetimeIndex(quotes.date)

#Create data frame
df = pandas.DataFrame(quotes.close, index=dt_idx, columns=[symbol])

# Resample with monthly frequency
resampled = df.resample('M', how=np.mean)
print(resampled) 

# Plot
df.plot()
plt.title('AAPL prices')
plt.ylabel('Price')

resampled.plot()
plt.title('Monthly resampling')
plt.ylabel('Price')
plt.grid(True)
plt.show()

工作原理

我们根据日期和时间列表创建了日期时间索引。然后，该索引用于创建 Pandas DataFrame。然后，我们对时间序列数据进行了重新采样。单个字符给出重采样频率，如下所示：

每天D
每月M
每年A

resample()方法的how参数指示如何采样数据。默认为计算平均值。

另见

相关 Pandas 文档

十一、最新最强的 NumPy

在本章中，我们涵盖以下秘籍：

用at()方法用花式索引代替 ufuncs
通过使用partition()函数选择快速中位数进行部分排序
使用nanmean()，nanvar()和nanstd()函数跳过 NaN
使用full()和full_like()函数创建值初始化的数组
numpy.random.choice()随机抽样
使用datetime64类型和相关的 API

简介

自《NumPy 秘籍》第一版以来，NumPy 团队引入了新功能；我将在本章中对其进行描述。您可能不太可能阅读本书的第一版，而现在正在阅读第二版。我在 2012 年撰写了第一版，并使用了当时可用的功能。 NumPy 具有许多功能，因此您不能期望涵盖所有功能，但是我在本章中介绍的功能相对重要。

使用`at()`方法为 ufuncs 建立花式索引

at()方法已添加到 NumPy 1.8 的 NumPy 通用函数类中。此方法允许就地进行花式索引。花式索引是不涉及整数或切片的索引，这是正常的索引。 “就地”是指将更改输入数组的数据。

at()方法的签名为ufunc.at(a, indices[, b])。索引数组对应于要操作的元素。我们仅必须为具有两个操作数的通用函数指定b数组。

操作步骤

以下步骤演示了at()方法的工作方式：

创建一个具有种子44的7个从-4到4的随机整数的数组。

np.random.seed(44)
a = np.random.random_integers(-4, 4, 7)
print(a)

该数组如下所示：

[ 0 -1 -3 -1 -4  0 -1]

将sign通用函数的at()方法应用于第三和第五个数组元素：
```
np.sign.at(a, [2, 4])
print(a)
```
我们得到以下更改后的数组：
```
[ 0 -1 -1 -1 -1  0 -1]
```

另见

NumPy 通用函数文档

使用`partition()`函数通过快速中位数的选择进行部分排序

partition()子例程进行部分排序。这应该比正常的分类工作少。

注意

有关更多信息，请参见这里。有用的情况是选择组中的前五项（或其他一些数字）。部分排序不能在顶部元素集中保留正确的顺序。

子例程的第一个参数是要排序的输入数组。第二个参数是整数或与数组元素的索引相对应的整数列表。 partition()子例程正确地对那些索引处的项目进行排序。一个指定的索引给出两个分区。多个索自举致两个以上的分区。该算法保证分区中小于正确排序项目的项目位于该项目之前。否则，它们将放在该项目的后面。

操作步骤

让我们举例说明此解释：

创建一个具有随机数的数组以进行排序：

np.random.seed(20)
a = np.random.random_integers(0, 7, 9)
print(a)

该数组具有以下元素：

[3 2 7 7 4 2 1 4 3]

通过将数组划分为两个大致相等的部分，对数组进行部分排序：
```
print(np.partition(a, 4))
```
我们得到以下结果：
```
[2 3 1 2 3 7 7 4 4]
```

工作原理

我们对 9 个元素的数组进行了部分排序。该函数保证索引4,的中间只有一个元素在正确的位置。这对应于尝试选择数组的前五项而不关心前五组中的顺序。由于正确排序的项目位于中间，因此这也将返回数组的中位数。

另见

相关 NumPy 文档

使用`nanmean()`，`nanvar()`和`nanstd()`函数跳过 NaN

试图估计一组数据的算术平均值，方差和标准差是很常见的。

一种简单但有效的方法称为 Jackknife 重采样。 Jackknife 重采样的想法是通过每次都遗漏一个值来从原始数据创建数据集。本质上，我们试图估计如果至少一个值不正确会发生什么。对于每个新数据集，我们都会重新计算我们感兴趣的统计估计量。这有助于我们了解估计量的变化方式。

操作步骤

我们将折刀重采样应用于随机数据。通过将其设置为 NaN（非数字），我们将跳过每个数组元素一次。然后，可以使用nanmean()，nanvar()和nanstd()计算算术平均值，方差和标准差：

首先为估算值初始化一个30 x 3的数组，如下所示：
```
estimates = np.zeros((len(a), 3))
```
遍历数组并通过在循环的每次迭代中将一个值设置为 NaN 来创建新的数据集。对于每个新数据集，计算估计值：
```
for i in xrange(len(a)):
   b = a.copy()
   b[i] = np.nan
   estimates[i,] = [np.nanmean(b), np.nanvar(b), np.nanstd(b)]
```

打印每个估计量的方差：

print("Estimator variance", estimates.var(axis=0))

屏幕上显示以下输出：

Estimator variance [ 0.00079905  0.00090129  0.00034604]

工作原理

我们用折刀重采样估计了数据集的算术平均值，方差和标准差的方差。这表明算术平均值，方差和标准差有多少变化。该秘籍的代码在本书的代码包的jackknife.py 文件中：

from __future__ import print_function
import numpy as np

np.random.seed(46)
a = np.random.randn(30)
estimates = np.zeros((len(a), 3))

for i in xrange(len(a)):
   b = a.copy()
   b[i] = np.nan

   estimates[i,] = [np.nanmean(b), np.nanvar(b), np.nanstd(b)]

print("Estimator variance", estimates.var(axis=0))

另见

使用`full()`和`full_like()`函数创建初始化值的数组

full()和full_like()函数是 NumPy 的新增函数，旨在促进初始化。这是文档中关于它们的内容：

>>> help(np.full)
Return a new array of given shape and type, filled with `fill_value`.
>>> help(np.full_like)
Return a full array with the same shape and type as a given array.

操作步骤

让我们看一下full()和full_like()函数：

用full()创建一个1x2数组，并填充幸运数字7：
```
print(np.full((1, 2), 7))
```
因此，我们得到以下数组：
```
array([[ 7.,  7.]])
```
数组元素是浮点数。
指定整数数据类型，如下所示：
```
print(np.full((1, 2), 7, dtype=np.int))
```
输出相应地更改：
```
array([[7, 7]])
```
full_like()函数检查数组的元数据，并将其重新用于新数组。例如，使用linspace()创建一个数组，并将其用作full_like()函数的模板：
```
a = np.linspace(0, 1, 5)
print(a)
array([ 0\.  ,  0.25,  0.5 ,  0.75,  1\.  ])
print(np.full_like(a, 7))
array([ 7.,  7.,  7.,  7.,  7.])
```
同样，我们用幸运数字7填充数组。要将数据类型修改为整数，请在以下行中使用：
```
print(np.full_like(a, 7, dtype=np.int))
array([7, 7, 7, 7, 7])
```

工作原理

我们用full()和full_like()产生了数组。 full()函数用数字7填充数组。 full_like()函数重新使用了数组的元数据来创建新的数组。这两个函数都可以指定数组的数据类型。

使用`numpy.random.choice()`进行随机采样

自举的过程类似于粗加工。基本的自举方法包括以下步骤：

从大小为 N 的原始数据生成样本。将原始数据样本可视化为一碗数字。我们通过从碗中随机抽取数字来创建新样本。取一个数字后，我们将其放回碗中。
对于每个生成的样本，我们计算感兴趣的统计估计量（例如，算术平均值）。

操作步骤

我们将应用numpy.random.choice()进行自举：

按照二项式分布生成数据样本，该数据样本模拟五次抛掷公平硬币：
```
N = 400
np.random.seed(28)
data = np.random.binomial(5, .5, size=N)
```

生成 30 个样本并计算其平均值（更多样本将得到更好的结果）：

bootstrapped = np.random.choice(data, size=(N, 30))
means = bootstrapped.mean(axis=0)

使用 matplotlib 箱形图可视化算术平均值分布：

plt.title('Bootstrapping demo')
plt.grid()
plt.boxplot(means)
plt.plot(3 * [data.mean()], lw=3, label='Original mean')
plt.legend(loc='best')
plt.show()

有关最终结果，请参考以下带注释的图：

How to do it...

工作原理

我们模拟了一个实验，该实验涉及掷出五次公平硬币。我们通过创建样本并计算相应的方法来自举数据。然后，我们使用numpy.random.choice()进行自举。我们用matplotlib箱形图直观地表示了均值。如果您不熟悉箱形图，图中的注释将对您有所帮助。箱形图中的以下元素很重要：

中位数由框中的一条线表示。
上下四分位数显示为框的边界。
胡须指示异常值的边界。默认情况下，这些值从框的边界设置为1.5 * (Q3 - Q1)，也称为四分位间距。

另见

使用`datetime64`类型和相关的 API

datetime64类型表示相应的日期和时间。您需要 NumPy 1.7.0 或更高版本才能使用此数据类型。

操作步骤

要熟悉datetime64，请按照下列步骤操作：

从字符串创建一个datetime64，如下所示：
```
print(np.datetime64('2015-05-21'))
```
前一行输出以下输出：
```
numpy.datetime64('2015-05-21')
```
我们使用YYYY-MM-DD格式在 2015 年 5 月 21 日创建了datetime64类型，其中Y对应于年份，M对应于月份，D对应于每月的一天。 NumPy 符合 ISO 8601 标准 -- 一种表示日期和时间的国际标准。

ISO 8601 还定义了YYYY-MM-DD，YYYY-MM和YYYYMMDD格式。自己检查这些，如下所示：

print(np.datetime64('20150521'))
print(np.datetime64('2015-05'))

该代码显示以下行：

numpy.datetime64('20150521')
numpy.datetime64('2015-05')

默认情况下，ISO 8601 使用本地时区。可以使用T[hh:mm:ss]格式指定时间。例如，我们可以定义 1578 年 1 月 1 日和晚上 9:18。如下：
```
local = np.datetime64('1578-01-01T21:18')
print(local)
```
以下行显示了结果：
```
numpy.datetime64('1578-01-01T21:18Z')
```
格式为-[hh:mm]的字符串定义相对于 UTC 时区的偏移量。我们可以使用 8 个小时的偏移量创建一个datetime64类型，如下所示：
```
with_offset = np.datetime64('1578-01-01T21:18-0800')
print(with_offset)
```
然后，我们在屏幕上看到以下行：
```
numpy.datetime64('1578-01-02T05:18Z')
```
最后的Z代表 Zulu 时间，有时也称为 UTC。
相互减去两个datetime64对象：
```
print(local - with_offset)
```
结果显示如下：
```
numpy.timedelta64(-480,'m')
```
减法创建一个timedelta64 NumPy 对象，在这种情况下，它表示 480 分钟的增量。

工作原理

您了解了datetime64 NumPy 类型。这种数据类型使我们可以轻松地操纵日期和时间。它的功能包括简单的算术运算和使用常规 NumPy 函数创建数组。请参考本书代码包中的datetime_demo.py文件：

import numpy as np

print(np.datetime64('2015-05-21'))
#numpy.datetime64('2015-05-21')

print(np.datetime64('20150521'))
print(np.datetime64('2015-05'))
#numpy.datetime64('20150521')
#numpy.datetime64('2015-05')

local = np.datetime64('1578-01-01T21:18')
print(local)
#numpy.datetime64('1578-01-01T21:18Z')

with_offset = np.datetime64('1578-01-01T21:18-0800')
print(with_offset)
#numpy.datetime64('1578-01-02T05:18Z')

print(local - with_offset)

另见

十二、使用 NumPy 进行探索性和预测性数据分析

在本章中，我们涵盖以下秘籍：

探索气压
探索日常气压范围
研究年度气压平均值
分析最大可见度
用自回归模型预测气压
使用移动平均模型预测气压
研究年内平均气压
研究气压的极值

简介

数据分析是 NumPy 最重要的用例之一。根据我们的目标，我们可以区分数据分析的许多阶段和类型。在本章中，我们将讨论探索性和预测性数据分析。探索性数据分析可探查数据的线索。在此阶段，我们可能不熟悉数据集。预测分析试图使用模型来预测有关数据的某些信息。

数据来自荷兰气象局 KNMI。特别是 KNMI 总部位于 De Bilt 的气象站。在这些秘籍中，我们将检查气压和最大可见度。

我修改了文本数据并将其从 KNMI 转换为 NumPy 特定的.npy格式，并保存为40996 x 5数组。该数组包含五个变量的每日值：

YYYYMMDD格式的日期
每日平均气压
每日最高气压
每日最低气压
每日最大可见度

探索气压

在此秘籍中，我们将查看由 24 小时值计算出的每日平均海平面气压（0.1 hPa）。这包括打印描述性统计数据和可视化概率分布。在自然界中，我们经常处理正态分布，因此使用第 10 章，“使用 Scikits 进行正态性检验”会派上用场。

完整的代码在本书代码包的exploring.py文件中：

from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.adnorm import normal_ad

data = np.load('cbk12.npy')

# Multiply to get hPa values
meanp = .1 * data[:,1]

# Filter out 0 values
meanp = meanp[ meanp > 0]

# Get descriptive statistics
print("Max", meanp.max())
print("Min", meanp.min())
mean = meanp.mean()
print("Mean", mean)
print("Median", np.median(meanp))
std = meanp.std()
print("Std dev", std)

# Check for normality
print("Normality", normal_ad(meanp))

#histogram with Gaussian PDF
plt.subplot(211)
plt.title('Histogram of average atmospheric pressure')
_, bins, _ = plt.hist(meanp, np.sqrt(len(meanp)), normed=True)
plt.plot(bins, 1/(std * np.sqrt(2 * np.pi)) * np.exp(- (bins - mean)**2/(2 * std**2)), 'r-', label="Gaussian PDF")
plt.grid()
plt.legend(loc='best')
plt.xlabel('Average atmospheric pressure (hPa)')
plt.ylabel('Frequency')

# boxplot
plt.subplot(212)
plt.boxplot(meanp)
plt.title('Boxplot of average atmospheric pressure')
plt.ylabel('Average atmospheric pressure (hPa)')
plt.grid()

# Improves spacing of subplots
plt.tight_layout()
plt.show()

准备

如果尚未安装，请安装statsmodels以进行正态性测试（请参阅第 10 章，“Scikits 的乐趣”的“安装 scikits-statsmodels”秘籍）。

操作步骤

请按照以下步骤探索每日气压：

使用load()函数加载数据：
```
data = np.load('cbk12.npy')
```
通常，数据需要进行处理和清理。在这种情况下，将这些值相乘以获得hPa中的值，然后删除与缺失值相对应的0值：
```
# Multiply to get hPa values
meanp = .1 * data[:,1]

# Filter out 0 values
meanp = meanp[ meanp > 0]
```

获取描述性统计信息，包括最大值，最小值，算术平均值，中位数和标准差：

print("Max", meanp.max())
print("Min", meanp.min())
mean = meanp.mean()
print("Mean", mean)
print("Median", np.median(meanp))
std = meanp.std()
print("Std dev", std)

您应该看到以下值：

Max 1048.3
Min 962.1
Mean 1015.14058231
Median 1015.8
Std dev 9.85889134337

如下应用第 10 章“Scikits 的乐趣”的正态性检验：
```
print("Normality", normal_ad(meanp))
```
屏幕上显示以下值：
```
Normality (72.685781095773564, 0.0)
```
用直方图和箱形图可视化值的分布也很好。最终结果请参考以下图表：

另见

“使用 Statsmodels”秘籍执行正态性测试，该秘籍来自第 10 章，“Scikits 的乐趣”
有关箱形图的说明，请参见第 11 章“最新最强的 NumPy”
load()函数的文档

探索日常气压范围

每日气压范围是每日最高点和最低点之差。使用实际数据，有时我们会缺少一些值。在这里，我们可能会缺少给定一天的高压和/或低压值。可以通过智能算法来填补这些空白。但是，让我们保持简单，而忽略它们。在计算了范围之后，我们将进行与前面的秘籍类似的分析，但是我们将使用可以处理NaN值的函数。另外，我们将研究月份和范围之间的关系。

本书代码包中的day_range.py文件位于对应的代码中：

from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
import calendar as cal

data = np.load('cbk12.npy')

# Multiply to get hPa values
highs = .1 * data[:,2]
lows = .1 * data[:,3]

# Filter out 0 values
highs[highs == 0] = np.nan
lows[lows == 0] = np.nan

# Calculate range and stats
ranges = highs - lows
print("Minimum daily range", np.nanmin(ranges))
print("Maximum daily range", np.nanmax(ranges))

print("Average daily range", np.nanmean(ranges))
print("Standard deviation", np.nanstd(ranges))

# Get months
dates = data[:,0]
months = (dates % 10000)/100
months = months[~np.isnan(ranges)]

monthly = []
month_range = np.arange(1, 13)

for month in month_range:
   indices = np.where(month == months)
   monthly.append(np.nanmean(ranges[indices]))

plt.bar(month_range, monthly)
plt.title('Monthly average of daily pressure ranges')
plt.xticks(month_range, cal.month_abbr[1:13])
plt.ylabel('Monthly Average (hPa)')
plt.grid()
plt.show()

操作步骤

此秘籍的第一步与先前的秘籍几乎相同，因此我们将跳过它们。继续分析每日气压范围：

我们可以将缺失值保留为当前的0值。但是，通常将它们设置为NaN较为安全，以避免造成混淆。将缺失值设置为NaN，如下所示：
```
highs[highs == 0] = np.nan
lows[lows == 0] = np.nan
```

使用nanmin()，nanmax()，nanmean()和nanstd()函数计算范围，最小值，最大值，均值和标准差：

ranges = highs - lows
print("Minimum daily range", np.nanmin(ranges))
print("Maximum daily range", np.nanmax(ranges))

print("Average daily range", np.nanmean(ranges))
print("Standard deviation", np.nanstd(ranges))

结果出现在屏幕上：

Minimum daily range 0.4
Maximum daily range 41.7
Average daily range 6.11945360571
Standard deviation 4.42162136692

如前所述，日期以YYYYMMDD格式给出。有了一点算术，我们就可以轻松地得到几个月。此外，我们忽略与NaN范围值相对应的月份值：
```
dates = data[:,0]
months = (dates % 10000)/100
months = months[~np.isnan(ranges)]
```
按月对范围进行平均，如下所示：
```
monthly = []
month_range = np.arange(1, 13)

for month in month_range:
   indices = np.where(month == months)
   monthly.append(np.nanmean(ranges[indices]))
```
在最后一步中，我们绘制了matplotlib条形图，显示了每日气压范围的每月平均值。最终结果请参考以下图表：

工作原理

我们分析了气压的每日范围。此外，我们可视化了每日范围的每月平均值。似乎有一种模式导致夏季的每日气压范围变小。当然，需要做更多的工作来确定。

另见

“探索气压”秘籍

研究年度气压平均值

您可能听说过，全球变暖表明温度逐年稳定上升。由于气压是另一个热力学变量，因此我们可以预期气压也会跟随趋势。该秘籍的完整代码在本书代码包的annual.py文件中：

import numpy as np
import matplotlib.pyplot as plt

data = np.load('cbk12.npy')

# Multiply to get hPa values
avgs = .1 * data[:,1]
highs = .1 * data[:,2]
lows = .1 * data[:,3]

# Filter out 0 values
avgs = np.ma.array(avgs, mask = avgs == 0)
lows = np.ma.array(lows, mask = lows == 0)
highs = np.ma.array(highs, mask = highs == 0)

# Get years
years = data[:,0]/10000

# Initialize annual stats arrays
y_range = np.arange(1901, 2014)
nyears = len(y_range)
y_avgs = np.zeros(nyears)
y_highs = np.zeros(nyears)
y_lows = np.zeros(nyears)

# Compute stats
for year in y_range:
   indices = np.where(year == years)
   y_avgs[year - 1901] = np.mean(avgs[indices])
   y_highs[year - 1901] = np.max(highs[indices])
   y_lows[year - 1901] = np.min(lows[indices])

plt.title('Annual atmospheric pressure for De Bilt(NL)')
plt.ticklabel_format(useOffset=900, axis='y')

plt.plot(y_range, y_avgs, label='Averages')

# Plot ignoring NaNs
h_mask = np.isfinite(y_highs)
plt.plot(y_range[h_mask], y_highs[h_mask], '^', label='Highs')

l_mask = np.isfinite(y_lows)
plt.plot(y_range[l_mask], y_lows[l_mask], 'v', label='Lows')

plt.xlabel('Year')
plt.ylabel('Atmospheric pressure (hPa)')
plt.grid()
plt.legend(loc='best')
plt.show()

操作步骤

要检查趋势，让我们按以下步骤绘制平均，最大和最小年度气气压：

初始化年度统计数据数组：

y_range = np.arange(1901, 2014)
nyears = len(y_range)
y_avgs = np.zeros(nyears)
y_highs = np.zeros(nyears)
y_lows = np.zeros(nyears)

计算年度统计数据：

for year in y_range:
   indices = np.where(year == years)
   y_avgs[year - 1901] = np.mean(avgs[indices])
   y_highs[year - 1901] = np.max(highs[indices])
   y_lows[year - 1901] = np.min(lows[indices])

绘制，忽略NaN值，如下所示：

h_mask = np.isfinite(y_highs)
plt.plot(y_range[h_mask], y_highs[h_mask], '^', label='Highs')

l_mask = np.isfinite(y_lows)
plt.plot(y_range[l_mask], y_lows[l_mask], 'v', label='Lows')

最终结果请参考以下图表：

How to do it...

工作原理

年平均气压似乎持平或有所波动，但没有任何趋势。我们使用isfinite()函数忽略了最终图中的NaN值。此函数检查无限值和NaN值。

另见

“探索气压”秘籍
isfinite()函数的文档

分析最大可见度

如果到目前为止，您在本章的所有秘籍中均未使用，则可能需要突破气压。因此，让我们来看看能见度。数据文件具有一列，以提供最大的可视性，KNMI 描述如下：

Maximum visibility; 0: <100 m, 1:100-200 m, 2:200-300 m,..., 49:4900-5000 m, 50:5-6 km, 56:6-7 km, 57:7-8 km,..., 79:29-30 km, 80:30-35 km, 81:35-40 km,..., 89: >70 km)

这里的能见度是一个离散变量，因此取平均值可能没有意义。此外，似乎我们几乎每天都有 1901 年到 1950 年之间的0值。我不相信 De Bilt 在那个时期会更加迷雾重重。出于本秘籍的目的，我们将雾定义为 1 至 2km 之间的能见度，这对应于数据文件中的10和20值。让我们还将雾度定义为 2 至 5 公里之间的能见度。这又对应于我们数据文件中的20和50。

空气污染会降低能见度，尤其是在晴天。我们可以将晴天定义为能见度高于 30km 的晴天，或数据文件中79的值。理想情况下，我们应该使用空气污染数据，但不幸的是，我们没有这些数据。据我所知，这个特殊气象站周围的空气污染水平不是很高。知道每年的晴天数很有趣。分析的代码在本书的代码包的visibility.py文件中：

import numpy as np
import matplotlib.pyplot as plt

data = np.load('cbk12.npy')

# Get minimum visibility
visibility = data[:,4]

# doy
doy = data[:,0] % 10000

doy_range = np.unique(doy)

# Initialize arrays
ndoy = len(doy_range)
mist = np.zeros(ndoy)
haze = np.zeros(ndoy)

# Compute frequencies
for i, d in enumerate(doy_range):
   indices = np.where(d == doy)
   selection = visibility[indices]

   mist_truth = (10 < selection) & (selection < 20)
   mist[i] = len(selection[mist_truth])/(1\. * len(selection))

   haze_truth = (20 < selection) & (selection < 50)
   haze[i] = len(selection[haze_truth])/(1\. * len(selection))

# Get years
years = data[:,0]/10000

# Initialize annual stats arrays
y_range = np.arange(1901, 2014)
nyears = len(y_range)
y_counts = np.zeros(nyears)

# Get annual counts
for year in y_range:
   indices = np.where(year == years)
   selection = visibility[indices]
   y_counts[year - 1901] = len(selection[selection > 79])

plt.subplot(211)
plt.plot(np.arange(1, 367), mist, color='.25', label='mist')
plt.plot(np.arange(1, 367), haze, color='0.75', lw=2, label='haze')
plt.title('Probability of mist and haze')
plt.xlabel('Day of the year')
plt.ylabel('Probability')
plt.grid()
plt.legend(loc='best')

plt.subplot(212)
plt.plot(y_range, y_counts)
plt.xlabel('Year')
plt.ylabel('Number of clear days')
plt.title('Annual counts of clear days')
plt.grid()
plt.tight_layout()
plt.show()

操作步骤

请按照以下步骤来绘制晴天的年度计数，即一年中的某一天（1-366）对霾和薄雾的可能性：

使用以下代码块计算雾霾的可能性：

for i, d in enumerate(doy_range):
   indices = np.where(d == doy)
   selection = visibility[indices]

   mist_truth = (10 < selection) & (selection < 20)
   mist[i] = len(selection[mist_truth])/(1\. * len(selection))

   haze_truth = (20 < selection) & (selection < 50)
   haze[i] = len(selection[haze_truth])/(1\. * len(selection))

使用此代码段获取年度计数：

for year in y_range:
   indices = np.where(year == years)
   selection = visibility[indices]
   y_counts[year - 1901] = len(selection[selection > 79])

Refer to the following plot for the end result:

How to do it...

工作原理

如您所见，我们在 1950 年以后开始变得晴朗。这不是由于 1950 年之前的大雾天气，而是由于数据丢失或无效的现象。去年的下降也是由于数据不完整。 1980 年以后，晴天明显增加。这应该是全球变暖和气候变化也加剧的时期。不幸的是，我们没有与空气污染直接相关的数据，但是我们的探索性分析表明存在趋势。

薄雾似乎主要发生在一年的头两个月中。您可以得出有关雾霾的类似结论。显然，雾气比雾气更有可能，这可能是一件好事。您也可以绘制直方图以确保。但是，请记住，正如我之前提到的，您需要忽略0值。

另见

“探索气压”秘籍
“研究年度气压平均值”秘籍
相关维基百科页面

使用自回归模型预测气压

非常简单的预测模型会获取变量的当前值，并将其外推到下一个周期。为了推断，我们可以使用一个简单的数学函数。由于可以通过多项式近似泰勒级数中的多项式，因此低阶多项式可以解决问题。归结为将先前值回归到下一个值。因此，相应的模型称为自回归。

我们必须对此谨慎，以免过拟合。 交叉验证是一种常见的方法，将数据分为训练和测试集。我们使用训练集拟合数据，并使用测试集测试拟合度。这样可以减少偏差（请参见本书代码包中的autoregressive.py文件）：

from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt

data = np.load('cbk12.npy')

# Load average pressure
meanp = .1 * data[:,1]

# Split point for test and train data
cutoff = 0.9 * len(meanp)

for degree, marker in zip(xrange(1, 4), ['o', 'x','.']):
   poly = np.polyfit(meanp[:cutoff - 1], meanp[1:cutoff], degree)
   print('Polynomial coefficients', poly)

   fit = np.polyval(poly, meanp[cutoff:-1])
   error = np.abs(meanp[cutoff + 1:] - fit)/fit
   plt.plot(error, marker, color=str(.25* degree), label='Degree ' + str(degree))
   plt.plot(np.full(len(error), error.mean()), lw=degree, label='Mean for degree ' + str(degree))

   print("Absolute mean relative error", error.mean(), 'for polynomial of degree', degree)
   print()

plt.title('Relative test errors for polynomial fits')
plt.ylabel('Relative error')
plt.grid()
plt.legend(loc='best')
plt.show()

操作步骤

通过执行以下，我们将使用不同程度的多项式拟合气压：

定义测试和训练集的截止：
```
cutoff = 0.9 * len(meanp)
```

使用polyfit()和polyval()函数拟合数据：

poly = np.polyfit(meanp[:cutoff - 1], meanp[1:cutoff], degree)
print('Polynomial coefficients', poly)

fit = np.polyval(poly, meanp[cutoff:-1])

计算相对误差：

error = np.abs(meanp[cutoff + 1:] - fit)/fit

此代码输出以下输出：

Polynomial coefficients [ 0.995542    4.50866543]
Absolute mean relative error 0.00442472512506 for polynomial of degree 1

Polynomial coefficients [ -1.79946321e-04   1.17995347e+00   2.77195814e+00]
Absolute mean relative error 0.00421276856088 for polynomial of degree 2

Polynomial coefficients [  3.17914507e-06  -6.62444552e-03   4.44558056e+00   2.76520065e+00]
Absolute mean relative error 0.0041906802632 for polynomial of degree 3

Refer to the following plot for the end result:

How to do it...

工作原理

这三个多项式的平均相对误差非常接近 -- 约 .004 -- 因此我们在图中看到一条线（很有趣的是知道气压的典型测量误差是多少），小于百分之一。我们看到了一些潜在的异常值，但是没有太多。大部分繁重的工作都是通过polyfit()和polyval()函数完成的，它们分别将数据拟合到多项式并求值该多项式。

另见

使用移动平均模型预测气压

模拟气压的简单方法是假设值围绕均值μ跳动。然后，在最简单的情况下，我们假设连续值与平均值的偏差ε遵循以下公式：

Predicting pressure with a moving average model

该关系是线性的，在最简单的情况下，我们只需要估计一个参数 - θ。为此，我们将需要 SciPy 功能。该秘籍的完整代码在本书代码包的moving_average.py文件中：

from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime as dt
from scipy.optimize import leastsq

data = np.load('cbk12.npy')

# Load average pressure
meanp = .1 * data[:,1]

cutoff = 0.9 * len(meanp)

def model(p, ma1):
   return p * ma1

def error(p, t, ma1):
   return t - model(p, ma1)

p0 = [.9]
mu = meanp[:cutoff].mean()
params = leastsq(error, p0, args=(meanp[1:cutoff] - mu, meanp[:cutoff-1] - mu))[0]
print(params)

abs_error = np.abs(error(params, meanp[cutoff+1:] - mu, meanp[cutoff:-1] - mu))
plt.plot(abs_error, label='Absolute error')
plt.plot(np.full_like(abs_error, abs_error.mean()), lw=2, label='Absolute mean error')
plt.title('Absolute error for the moving average model')
plt.ylabel('Absolute error (hPa)')
plt.grid()
plt.legend(loc='best')
plt.show()

入门

如有必要，请按照第 2 章，“高级索引和数组概念”的“安装 SciPy”秘籍中的说明安装 SciPy。

操作步骤

以下步骤将移动平均模型应用于气压。

定义以下函数：

def model(p, ma1):
   return p * ma1

def error(p, t, ma1):
   return t - model(p, ma1)

使用上一步中的函数将移动平均模型与leastsq()函数拟合，并将0.9的初始猜测作为模型参数：

p0 = [.9]
mu = meanp[:cutoff].mean()
params = leastsq(error, p0, args=(meanp[1:cutoff] - mu, meanp[:cutoff-1] - mu))[0]

使用测试数据集拟合后计算绝对误差：
```
abs_error = np.abs(error(params, meanp[cutoff+1:] - mu, meanp[cutoff:-1] - mu))
```
请参考以下数据集中每个数据点的绝对误差图：

工作原理

leastsq()函数通过最小化误差来拟合模型。它需要一个函数来计算拟合误差和模型参数的初始猜测。

另见

研究年内平均气压

在一年内探索气压很有趣。特别地，检查与变异性相关的模式可能是有益的，因此，与可预测性相关。原因是几个月中的气气压会发生很大变化，从而降低了可预测性。在此秘籍中，我们将绘制每月的箱形图和每月的气气压方差。

秘籍代码在本书代码包的intrayear.py文件中。请特别注意突出显示的部分：

import numpy as np
import matplotlib.pyplot as plt
import calendar as cal

data = np.load('cbk12.npy')

# Multiply to get hPa values
meanp = .1 * data[:,1]

# Get months
dates = data[:,0]
months = (dates % 10000)/100

monthly = []
vars = np.zeros(12)
month_range = np.arange(1, 13)

for month in month_range:
 indices = np.where(month == months)
 selection = meanp[indices]

 # Filter out 0 values
 selection = selection[selection > 0]

 monthly.append(selection)
 vars[month - 1] = np.var(selection)

def plot():
    plt.xticks(month_range, cal.month_abbr[1:13])
    plt.grid()
    plt.xlabel('Month')

plt.subplot(211)
plot()
plt.title('Atmospheric pressure box plots')
plt.boxplot(monthly)
plt.ylabel('Atmospheric pressure (hPa)')

plt.subplot(212)
plot()

# Display error bars using standard deviation
plt.errorbar(month_range, vars, yerr=vars.std())
plt.plot(month_range, np.full_like(month_range, np.median(vars)), lw=3, label='Median')

# Shades the region above the median
plt.fill_between(month_range, vars, where=vars>np.median(vars), color='0.5')
plt.title('Variance of atmospheric pressure')
plt.ylabel('Variance')
plt.legend(loc='best')

plt.show()

操作步骤

在我们探索时，往往会重复这些步骤，并且此秘籍与本书中的其他秘籍之间存在重叠。以下是此秘籍中的新步骤：

使用标准差显示误差线：

plt.errorbar(month_range, vars, yerr=vars.std())

用高于中值的值对图的区域进行着色：
```
plt.fill_between(month_range, vars, where=vars>np.median(vars), color='0.5')
```
Refer to the following plot for the end result:

工作原理

我们将几个月与气气压的测量相匹配。我们使用匹配来绘制箱形图并可视化每月变化。这项研究表明，在 1 月，2 月，11 月和 12 月的最冷月份，气压变化高于中值。从图中可以看出，在温暖的夏季，气压范围狭窄。这与其他秘籍的结果一致。

另见

“探索气压”秘籍
“研究年度气压平均值”秘籍
var()的文档

研究气压的极值

异常值是一个问题，因为它们会影响我们对数据的理解。在此秘籍中，我们将异常值定义为与数据的第一或第三四分位数相差至少四分位数范围的 1.5 倍。四分位数间距是第一和第三四分位数之间的距离。让我们计算一年中每个月的异常值。完整的代码在本书代码包的extreme.py文件中：

import numpy as np
import matplotlib.pyplot as plt
import calendar as cal

data = np.load('cbk12.npy')

# Multiply to get hPa values
meanp = .1 * data[:,1]

# Filter out 0 values
meanp = np.ma.array(meanp, mask = meanp == 0)

# Calculate quartiles and irq
q1 = np.percentile(meanp, 25)
median = np.percentile(meanp, 50)
q3 = np.percentile(meanp, 75)

irq = q3 - q1

# Get months
dates = data[:,0]
months = (dates % 10000)/100

m_low = np.zeros(12)
m_high = np.zeros(12)
month_range = np.arange(1, 13)

for month in month_range:
   indices = np.where(month == months)
   selection = meanp[indices]
   m_low[month - 1] = len(selection[selection < (q1 - 1.5 * irq)])
   m_high[month - 1] = len(selection[selection > (q3 + 1.5 * irq)])

plt.xticks(month_range, cal.month_abbr[1:13])
plt.bar(month_range, m_low, label='Low outliers', color='.25')
plt.bar(month_range, m_high, label='High outliers', color='0.5')
plt.title('Atmospheric pressure outliers')
plt.xlabel('Month')
plt.ylabel('# of outliers')
plt.grid()
plt.legend(loc='best')
plt.show()

操作步骤

要绘制一年中每个月的异常数，请执行以下步骤：

使用percentile()函数计算四分位数和四分位数间距：

q1 = np.percentile(meanp, 25)
median = np.percentile(meanp, 50)
q3 = np.percentile(meanp, 75)

irq = q3 - q1

计算异常值的数量，如下所示：

for month in month_range:
   indices = np.where(month == months)
   selection = meanp[indices]
   m_low[month - 1] = len(selection[selection < (q1 - 1.5 * irq)])
   m_high[month - 1] = len(selection[selection > (q3 + 1.5 * irq)])

Refer to the following plot for the end result:

How to do it...

工作原理

看起来异常值大多在下侧，并且夏天不太可能出现。较高端的异常值似乎仅在某些月份内发生。我们发现四分位数具有percentile()函数，因为四分之一对应 25%。

另见

“探索气压”秘籍
percentile()函数的文档

NumPy-秘籍中文第二版-三-