这是一个简洁的Numpy指南的例子。
数组要点
模块导入
1# Numpy module
2import numpy as np
3
4# Only used for statistics
5import scipy.stats as stats
6
7# Only for pretty table formatting in Jupyter Notebook
8from IPython.display import HTML, display
9import tabulate
10from matplotlib import pyplot
声明一个数组
1# one row, three columns
2np.array([1,2,3])
array([1, 2, 3])
声明一个矩阵
1# two rows, three columns
2np.array([[1,2,3],
3 [4,5,6]])
array([[1, 2, 3],
[4, 5, 6]])
数组的尺寸
1np.array([1,2,3]).ndim
1
1np.array([[1,2,3],[4,5,6]]).ndim
2
数组的行和列
1# one row, three columns
2np.array([1,2,3]).shape
(3,)
1# two rows, three columns
2np.array([[1,2,3],[4,5,6]]).shape
(2, 3)
下移元素的类型
1# Convert numbers to unsigned 8-bit ints
2np.array([1.5, 300.0, -5]).astype(np.uint8)
array([ 1, 44, 251], dtype=uint8)
通过索引引用元素
1# 2 x 2 matrix
2a = np.array([[1,2],
3 [3,4]])
1# row 0, column 0
2a[0,0]
1
1# row 1, column 0
2a[1,0]
3
1# row 1, column 1
2a[1,1]
4
切片(选择列)
1a = np.array([[1,2,3],
2 [4,5,6],
3 [7,8,9]])
1# all rows, just column 1 (as array)
2a[:,1]
array([2, 5, 8])
1# all rows, just column 1 (as rows containing a single column)
2a[:,[1]]
array([[2],
[5],
[8]])
1# all rows, columns starting from 1
2a[:,1:]
array([[2, 3],
[5, 6],
[8, 9]])
1# all rows, specific columns 1 and 2
2a[:,[1,2]]
array([[2, 3],
[5, 6],
[8, 9]])
1# all rows, columns from 0 to <2
2a[:,0:2]
array([[1, 2],
[4, 5],
[7, 8]])
切片(选择行)
1a = np.array([[1,2,3],
2 [4,5,6],
3 [7,8,9]])
1# specific row 1 (as array), all columns
2a[1,]
array([4, 5, 6])
1# specific row 1 (as one row), all columns
2a[[1],]
array([[4, 5, 6]])
1# specific rows 0 and 2, all columns
2a[[0,2],]
array([[1, 2, 3],
[7, 8, 9]])
1# rows from 0 to <2, all columns
2a[0:2,]
array([[1, 2, 3],
[4, 5, 6]])
阵列生成
带零
1# 2 rows by 3 columns
2np.zeros((2,3),dtype=int)
array([[0, 0, 0],
[0, 0, 0]])
使用1
1# 2 rows by 3 columns
2np.ones((2,3),dtype=int)
array([[1, 1, 1],
[1, 1, 1]])
使用随机Int数
1# 2 rows by 3 columns
2np.random.randint(5,size=(2,3))
array([[4, 4, 0],
[4, 0, 4]])
使用随机浮点数
1# 2 rows by 3 columns
2np.random.rand(2,3)
array([[0.61656079, 0.25462431, 0.13681125],
[0.82952057, 0.1369984 , 0.23413243]])
用一个整数范围
1# from 1 to <10
2np.arange(1,10)
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
1# from 1 to <10 in +2 increments
2np.arange(1,10,2)
array([1, 3, 5, 7, 9])
有一个浮点数的范围
1# from 0 to 1 (inclusive) in 5 steps
2np.linspace(0,1,5)
array([0. , 0.25, 0.5 , 0.75, 1. ])
用一个特定的值
1# two rows by three columns, all cells filled with 5
2np.full((2,3),5)
array([[5, 5, 5],
[5, 5, 5]])
阵列算术
加法
1# element-wise addition
2a = np.array([2,4,6])
3b = np.array([1,2,3])
4a+b
array([3, 6, 9])
减法
1# element-wise subtraction
2a = np.array([2,4,6])
3b = np.array([1,2,3])
4a-b
array([1, 2, 3])
乘积(元素)
1# element-wise product
2a = np.array([2,4,6])
3b = np.array([1,2,3])
4a-b
array([1, 2, 3])
乘积(矩阵)
1np.array([1,2])@np.array([[2,3],[4,5]])
array([10, 13])
元素的算术
1a = np.array([2,4,6])
2a*3
array([ 6, 12, 18])
阵列谓词
将谓词应用于元素
1# check which elements are divisible by two
2a = np.array([1,2,3,4,5])
3l = a % 2 == 0
4l
array([False, True, False, True, False])
1# use 'l' as a filter to obtain even numbers
2# equivalent to a[a % 2 == 0]
3a[l]
array([2, 4])
阵列聚合
总和
1np.array([1,2,3]).sum()
6
最大
1np.array([1,2,3]).max()
3
最小值
1np.array([1,2,3]).min()
1
平均值
1np.array([1,2,3]).mean()
2.0
CSV文件
加载CSV数据
例如使用的样本文件
1# this is just to show the raw contents of the file
2!cat resources/countries_small.csv
"country","population","gdp_in_trillions"
"China",1439323776,12.238
"India",1380004385,2.651
"USA",331002651,19.485
1# just the data, ignore the headers
2np.genfromtxt('resources/countries_small.csv',delimiter=',',skip_header=1,usecols=(1,2))
array([[1.43932378e+09, 1.22380000e+01],
[1.38000438e+09, 2.65100000e+00],
[3.31002651e+08, 1.94850000e+01]])
1# specify the headers so they can be used to reference columns
2a = np.genfromtxt('resources/countries_small.csv',delimiter=',', \
3 skip_header=1,names=('country','population','gdp'),dtype=None,encoding=None)
4print(a['population'])
[1439323776 1380004385 331002651]
皮尔逊相关
佩尔森系数有助于确定两个数据集之间的相关性。
这里我们使用Python数组,而不是numpy数组,是为了简单起见。
100%的正相关
在这种情况下,数据集(a)的任何正向变化也会导致数据集(b)的正向变化,反之亦然。
1a = [0,1,2,3]
2b = [5,6,7,8]
3display(stats.pearsonr(a,b)) # (pearson coefficient, p-value)
4
5# Plot rendering only
6pyplot.ioff()
7fig = pyplot.figure()
8pyplot.plot(a)
9pyplot.plot(b)
10pyplot.savefig("plot_p.png")
11pyplot.close(fig)
(1.0, 0.0)
100% 负相关
在这种情况下,数据集(a)的每一个正向变化都会导致数据集(b)的等效负向变化,反之亦然。
1a = [10,9,8,7]
2b = [1,2,3,4]
3display(stats.pearsonr(a,b)) # (pearson coefficient, p-value)
4
5# Plot rendering only
6pyplot.ioff()
7fig = pyplot.figure()
8pyplot.plot(a)
9pyplot.plot(b)
10pyplot.savefig("plot_n.png")
11pyplot.close(fig)
(-1.0, 0.0)
无相关性
在这种情况下,积极和消极的变化是相等的,所以数据集(b)和(a)之间没有相关性。
1a = [0,1,2,3,4]
2b = [6,7,8,7,6]
3display(stats.pearsonr(a,b)) # (pearson coefficient, p-value)
4
5# Plot rendering only
6pyplot.ioff()
7fig = pyplot.figure()
8pyplot.plot(a)
9pyplot.plot(b)
10pyplot.savefig("plot_nc.png")
11pyplot.close(fig)
(0.0, 1.0000000000000002)
T测试(相对的)
这是对两个样本具有相同的平均值这一无效假设的双侧检验。
这里我们使用Python数组,而不是numpy数组,是为了简单起见。
两组随机数据的P值
理论上,两组随机数据应该有相似的平均值,因此,P值更接近于1而不是0。在实践中,两个特定的随机数样本可能不那么随机。
1a = np.random.randint(1,high=50,size=100)
2b = np.random.randint(1,high=50,size=100)
3display(stats.ttest_rel(a,b)) # P-Value
4
5# Plot rendering only
6pyplot.ioff()
7fig = pyplot.figure()
8pyplot.plot(a)
9pyplot.plot(b)
10pyplot.savefig("plot_random.png")
11pyplot.close(fig)
12
Ttest_relResult(statistic=1.55684703917969, pvalue=0.12269795013854484)
非随机数据的P值
这里,一个数据集是随机的,但另一个数据集是用一个常数函数生成的。
1a = np.random.randint(1,high=50,size=100)
2b = np.arange(0,100)
3display(stats.ttest_rel(a,b)) # P-Value
4
5# Plot rendering only
6pyplot.ioff()
7fig = pyplot.figure()
8pyplot.plot(a)
9pyplot.plot(b)
10pyplot.savefig("plot_one-random.png")
11pyplot.close(fig)
Ttest_relResult(statistic=-8.324218327435583, pvalue=4.824950835820826e-13)