Python中NumPy的实例指南

103 阅读5分钟

这是一个简洁的Numpy指南的例子。

数组要点

模块导入

 1# Numpy module
 2import numpy as np
 3
 4# Only used for statistics
 5import scipy.stats as stats
 6
 7# Only for pretty table formatting in Jupyter Notebook
 8from IPython.display import HTML, display
 9import tabulate
10from matplotlib import pyplot

声明一个数组

1# one row, three columns
2np.array([1,2,3])
array([1, 2, 3])

声明一个矩阵

1# two rows, three columns
2np.array([[1,2,3],
3          [4,5,6]])
array([[1, 2, 3],
       [4, 5, 6]])

数组的尺寸

1np.array([1,2,3]).ndim
1
1np.array([[1,2,3],[4,5,6]]).ndim
2

数组的行和列

1# one row, three columns
2np.array([1,2,3]).shape
(3,)
1# two rows, three columns
2np.array([[1,2,3],[4,5,6]]).shape
(2, 3)

下移元素的类型

1# Convert numbers to unsigned 8-bit ints
2np.array([1.5, 300.0, -5]).astype(np.uint8)
array([  1,  44, 251], dtype=uint8)

通过索引引用元素

1# 2 x 2 matrix
2a = np.array([[1,2],
3              [3,4]])
1# row 0, column 0
2a[0,0]
1
1# row 1, column 0
2a[1,0]
3
1# row 1, column 1
2a[1,1]
4

切片(选择列)

1a = np.array([[1,2,3],
2              [4,5,6],
3              [7,8,9]])
1# all rows, just column 1 (as array)
2a[:,1]
array([2, 5, 8])
1# all rows, just column 1 (as rows containing a single column)
2a[:,[1]]
array([[2],
       [5],
       [8]])
1# all rows, columns starting from 1
2a[:,1:]
array([[2, 3],
       [5, 6],
       [8, 9]])
1# all rows, specific columns 1 and 2
2a[:,[1,2]]
array([[2, 3],
       [5, 6],
       [8, 9]])
1# all rows, columns from 0 to <2
2a[:,0:2]
array([[1, 2],
       [4, 5],
       [7, 8]])

切片(选择行)

1a = np.array([[1,2,3],
2              [4,5,6],
3              [7,8,9]])
1# specific row 1 (as array), all columns
2a[1,]
array([4, 5, 6])
1# specific row 1 (as one row), all columns
2a[[1],]
array([[4, 5, 6]])
1# specific rows 0 and 2, all columns
2a[[0,2],]
array([[1, 2, 3],
       [7, 8, 9]])
1# rows from 0 to <2, all columns
2a[0:2,]
array([[1, 2, 3],
       [4, 5, 6]])

阵列生成

带零

1# 2 rows by 3 columns
2np.zeros((2,3),dtype=int)
array([[0, 0, 0],
       [0, 0, 0]])

使用1

1# 2 rows by 3 columns
2np.ones((2,3),dtype=int)
array([[1, 1, 1],
       [1, 1, 1]])

使用随机Int数

1# 2 rows by 3 columns
2np.random.randint(5,size=(2,3))
array([[4, 4, 0],
       [4, 0, 4]])

使用随机浮点数

1# 2 rows by 3 columns
2np.random.rand(2,3)
array([[0.61656079, 0.25462431, 0.13681125],
       [0.82952057, 0.1369984 , 0.23413243]])

用一个整数范围

1# from 1 to <10
2np.arange(1,10)
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
1# from 1 to <10 in +2 increments
2np.arange(1,10,2)
array([1, 3, 5, 7, 9])

有一个浮点数的范围

1# from 0 to 1 (inclusive) in 5 steps
2np.linspace(0,1,5)
array([0.  , 0.25, 0.5 , 0.75, 1.  ])

用一个特定的值

1# two rows by three columns, all cells filled with 5
2np.full((2,3),5)
array([[5, 5, 5],
       [5, 5, 5]])

阵列算术

加法

1# element-wise addition
2a = np.array([2,4,6])
3b = np.array([1,2,3])
4a+b
array([3, 6, 9])

减法

1# element-wise subtraction
2a = np.array([2,4,6])
3b = np.array([1,2,3])
4a-b
array([1, 2, 3])

乘积(元素)

1# element-wise product
2a = np.array([2,4,6])
3b = np.array([1,2,3])
4a-b
array([1, 2, 3])

乘积(矩阵)

1np.array([1,2])@np.array([[2,3],[4,5]])
array([10, 13])

元素的算术

1a = np.array([2,4,6])
2a*3
array([ 6, 12, 18])

阵列谓词

将谓词应用于元素

1# check which elements are divisible by two
2a = np.array([1,2,3,4,5])
3l = a % 2 == 0
4l
array([False,  True, False,  True, False])
1# use 'l' as a filter to obtain even numbers
2# equivalent to a[a % 2 == 0]
3a[l]
array([2, 4])

阵列聚合

总和

1np.array([1,2,3]).sum()
6

最大

1np.array([1,2,3]).max()
3

最小值

1np.array([1,2,3]).min()
1

平均值

1np.array([1,2,3]).mean()
2.0

CSV文件

加载CSV数据

例如使用的样本文件

resources/countries_small.csv

1# this is just to show the raw contents of the file
2!cat resources/countries_small.csv 
"country","population","gdp_in_trillions"
"China",1439323776,12.238
"India",1380004385,2.651
"USA",331002651,19.485
1# just the data, ignore the headers
2np.genfromtxt('resources/countries_small.csv',delimiter=',',skip_header=1,usecols=(1,2))
array([[1.43932378e+09, 1.22380000e+01],
       [1.38000438e+09, 2.65100000e+00],
       [3.31002651e+08, 1.94850000e+01]])
1# specify the headers so they can be used to reference columns
2a = np.genfromtxt('resources/countries_small.csv',delimiter=',', \
3       skip_header=1,names=('country','population','gdp'),dtype=None,encoding=None)
4print(a['population'])
[1439323776 1380004385  331002651]

皮尔逊相关

佩尔森系数有助于确定两个数据集之间的相关性。

这里我们使用Python数组,而不是numpy数组,是为了简单起见。

100%的正相关

在这种情况下,数据集(a)的任何正向变化也会导致数据集(b)的正向变化,反之亦然。

 1a = [0,1,2,3]
 2b = [5,6,7,8]
 3display(stats.pearsonr(a,b)) # (pearson coefficient, p-value)
 4
 5# Plot rendering only
 6pyplot.ioff()
 7fig = pyplot.figure()
 8pyplot.plot(a)
 9pyplot.plot(b)
10pyplot.savefig("plot_p.png")
11pyplot.close(fig)
(1.0, 0.0)

100% 负相关

在这种情况下,数据集(a)的每一个正向变化都会导致数据集(b)的等效负向变化,反之亦然。

 1a = [10,9,8,7]
 2b = [1,2,3,4]
 3display(stats.pearsonr(a,b)) # (pearson coefficient, p-value)
 4
 5# Plot rendering only
 6pyplot.ioff()
 7fig = pyplot.figure()
 8pyplot.plot(a)
 9pyplot.plot(b)
10pyplot.savefig("plot_n.png")
11pyplot.close(fig)
(-1.0, 0.0)

无相关性

在这种情况下,积极和消极的变化是相等的,所以数据集(b)和(a)之间没有相关性。

 1a = [0,1,2,3,4]
 2b = [6,7,8,7,6]
 3display(stats.pearsonr(a,b)) # (pearson coefficient, p-value)
 4
 5# Plot rendering only
 6pyplot.ioff()
 7fig = pyplot.figure()
 8pyplot.plot(a)
 9pyplot.plot(b)
10pyplot.savefig("plot_nc.png")
11pyplot.close(fig)
(0.0, 1.0000000000000002)

T测试(相对的)

这是对两个样本具有相同的平均值这一无效假设的双侧检验。

这里我们使用Python数组,而不是numpy数组,是为了简单起见。

两组随机数据的P值

理论上,两组随机数据应该有相似的平均值,因此,P值更接近于1而不是0。在实践中,两个特定的随机数样本可能不那么随机。

 1a = np.random.randint(1,high=50,size=100)
 2b = np.random.randint(1,high=50,size=100)
 3display(stats.ttest_rel(a,b)) # P-Value
 4
 5# Plot rendering only
 6pyplot.ioff()
 7fig = pyplot.figure()
 8pyplot.plot(a)
 9pyplot.plot(b)
10pyplot.savefig("plot_random.png")
11pyplot.close(fig)
12
Ttest_relResult(statistic=1.55684703917969, pvalue=0.12269795013854484)

非随机数据的P值

这里,一个数据集是随机的,但另一个数据集是用一个常数函数生成的。

 1a = np.random.randint(1,high=50,size=100)
 2b = np.arange(0,100)
 3display(stats.ttest_rel(a,b)) # P-Value
 4
 5# Plot rendering only
 6pyplot.ioff()
 7fig = pyplot.figure()
 8pyplot.plot(a)
 9pyplot.plot(b)
10pyplot.savefig("plot_one-random.png")
11pyplot.close(fig)
Ttest_relResult(statistic=-8.324218327435583, pvalue=4.824950835820826e-13)