这是一个简洁的Numpy指南的例子。

数组要点

模块导入

 1# Numpy module
 2import numpy as np
 3
 4# Only used for statistics
 5import scipy.stats as stats
 6
 7# Only for pretty table formatting in Jupyter Notebook
 8from IPython.display import HTML, display
 9import tabulate
10from matplotlib import pyplot

声明一个数组

1# one row, three columns
2np.array([1,2,3])

array([1, 2, 3])

声明一个矩阵

1# two rows, three columns
2np.array([[1,2,3],
3          [4,5,6]])

array([[1, 2, 3],
       [4, 5, 6]])

数组的尺寸

1np.array([1,2,3]).ndim

1np.array([[1,2,3],[4,5,6]]).ndim

数组的行和列

1# one row, three columns
2np.array([1,2,3]).shape

(3,)

1# two rows, three columns
2np.array([[1,2,3],[4,5,6]]).shape

(2, 3)

下移元素的类型

1# Convert numbers to unsigned 8-bit ints
2np.array([1.5, 300.0, -5]).astype(np.uint8)

array([  1,  44, 251], dtype=uint8)

通过索引引用元素

1# 2 x 2 matrix
2a = np.array([[1,2],
3              [3,4]])

1# row 0, column 0
2a[0,0]

1# row 1, column 0
2a[1,0]

1# row 1, column 1
2a[1,1]

切片(选择列)

1a = np.array([[1,2,3],
2              [4,5,6],
3              [7,8,9]])

1# all rows, just column 1 (as array)
2a[:,1]

array([2, 5, 8])

1# all rows, just column 1 (as rows containing a single column)
2a[:,[1]]

array([[2],
       [5],
       [8]])

1# all rows, columns starting from 1
2a[:,1:]

array([[2, 3],
       [5, 6],
       [8, 9]])

1# all rows, specific columns 1 and 2
2a[:,[1,2]]

array([[2, 3],
       [5, 6],
       [8, 9]])

1# all rows, columns from 0 to <2
2a[:,0:2]

array([[1, 2],
       [4, 5],
       [7, 8]])

切片(选择行)

1a = np.array([[1,2,3],
2              [4,5,6],
3              [7,8,9]])

1# specific row 1 (as array), all columns
2a[1,]

array([4, 5, 6])

1# specific row 1 (as one row), all columns
2a[[1],]

array([[4, 5, 6]])

1# specific rows 0 and 2, all columns
2a[[0,2],]

array([[1, 2, 3],
       [7, 8, 9]])

1# rows from 0 to <2, all columns
2a[0:2,]

array([[1, 2, 3],
       [4, 5, 6]])

阵列生成

带零

1# 2 rows by 3 columns
2np.zeros((2,3),dtype=int)

array([[0, 0, 0],
       [0, 0, 0]])

使用1

1# 2 rows by 3 columns
2np.ones((2,3),dtype=int)

array([[1, 1, 1],
       [1, 1, 1]])

使用随机Int数

1# 2 rows by 3 columns
2np.random.randint(5,size=(2,3))

array([[4, 4, 0],
       [4, 0, 4]])

使用随机浮点数

1# 2 rows by 3 columns
2np.random.rand(2,3)

array([[0.61656079, 0.25462431, 0.13681125],
       [0.82952057, 0.1369984 , 0.23413243]])

用一个整数范围

1# from 1 to <10
2np.arange(1,10)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

1# from 1 to <10 in +2 increments
2np.arange(1,10,2)

array([1, 3, 5, 7, 9])

有一个浮点数的范围

1# from 0 to 1 (inclusive) in 5 steps
2np.linspace(0,1,5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

用一个特定的值

1# two rows by three columns, all cells filled with 5
2np.full((2,3),5)

array([[5, 5, 5],
       [5, 5, 5]])

阵列算术

加法

1# element-wise addition
2a = np.array([2,4,6])
3b = np.array([1,2,3])
4a+b

array([3, 6, 9])

减法

1# element-wise subtraction
2a = np.array([2,4,6])
3b = np.array([1,2,3])
4a-b

array([1, 2, 3])

乘积（元素）

1# element-wise product
2a = np.array([2,4,6])
3b = np.array([1,2,3])
4a-b

array([1, 2, 3])

乘积（矩阵）

1np.array([1,2])@np.array([[2,3],[4,5]])

array([10, 13])

元素的算术

1a = np.array([2,4,6])
2a*3

array([ 6, 12, 18])

阵列谓词

将谓词应用于元素

1# check which elements are divisible by two
2a = np.array([1,2,3,4,5])
3l = a % 2 == 0
4l

array([False,  True, False,  True, False])

1# use 'l' as a filter to obtain even numbers
2# equivalent to a[a % 2 == 0]
3a[l]

array([2, 4])

阵列聚合

总和

1np.array([1,2,3]).sum()

最大

1np.array([1,2,3]).max()

最小值

1np.array([1,2,3]).min()

平均值

1np.array([1,2,3]).mean()

2.0

CSV文件

加载CSV数据

例如使用的样本文件

resources/countries_small.csv

1# this is just to show the raw contents of the file
2!cat resources/countries_small.csv

"country","population","gdp_in_trillions"
"China",1439323776,12.238
"India",1380004385,2.651
"USA",331002651,19.485

1# just the data, ignore the headers
2np.genfromtxt('resources/countries_small.csv',delimiter=',',skip_header=1,usecols=(1,2))

array([[1.43932378e+09, 1.22380000e+01],
       [1.38000438e+09, 2.65100000e+00],
       [3.31002651e+08, 1.94850000e+01]])

1# specify the headers so they can be used to reference columns
2a = np.genfromtxt('resources/countries_small.csv',delimiter=',', \
3       skip_header=1,names=('country','population','gdp'),dtype=None,encoding=None)
4print(a['population'])

[1439323776 1380004385  331002651]

皮尔逊相关

佩尔森系数有助于确定两个数据集之间的相关性。

这里我们使用Python数组，而不是numpy数组，是为了简单起见。

100%的正相关

在这种情况下，数据集（a）的任何正向变化也会导致数据集（b）的正向变化，反之亦然。

 1a = [0,1,2,3]
 2b = [5,6,7,8]
 3display(stats.pearsonr(a,b)) # (pearson coefficient, p-value)
 4
 5# Plot rendering only
 6pyplot.ioff()
 7fig = pyplot.figure()
 8pyplot.plot(a)
 9pyplot.plot(b)
10pyplot.savefig("plot_p.png")
11pyplot.close(fig)

(1.0, 0.0)

100% 负相关

在这种情况下，数据集（a）的每一个正向变化都会导致数据集（b）的等效负向变化，反之亦然。

 1a = [10,9,8,7]
 2b = [1,2,3,4]
 3display(stats.pearsonr(a,b)) # (pearson coefficient, p-value)
 4
 5# Plot rendering only
 6pyplot.ioff()
 7fig = pyplot.figure()
 8pyplot.plot(a)
 9pyplot.plot(b)
10pyplot.savefig("plot_n.png")
11pyplot.close(fig)

(-1.0, 0.0)

无相关性

在这种情况下，积极和消极的变化是相等的，所以数据集（b）和（a）之间没有相关性。

 1a = [0,1,2,3,4]
 2b = [6,7,8,7,6]
 3display(stats.pearsonr(a,b)) # (pearson coefficient, p-value)
 4
 5# Plot rendering only
 6pyplot.ioff()
 7fig = pyplot.figure()
 8pyplot.plot(a)
 9pyplot.plot(b)
10pyplot.savefig("plot_nc.png")
11pyplot.close(fig)

(0.0, 1.0000000000000002)

T测试（相对的）

这是对两个样本具有相同的平均值这一无效假设的双侧检验。