24.9.26 机器学习入门DAY5ndarray 一维数组 np.array([1,2,3]) 二维数组 np.arr

ndarray

一维数组 np.array([1,2,3]) 二维数组 np.array([1,2,3],[4,5,6])

二维数组：一张数据表三维数组 np.array([[[1,2,3],[1,2,3]]]) 多张数据表的叠加

ndarray.shape 数组维度的元组 ndarray.ndim 数组维数 ndarray.size 数组中的元素数量 ndarray.itemsize 一个数组元素的长度(字节) ndarray.dtype 数组元素的类型

ndarray 的 dtype 属性

生成数组的方法：生成0,1数组 np.ones([行数,列数]) one = np.ones([4,8]) np.zeros_like(one)

np.array() 深拷贝 np.asarray() 浅拷贝

np.linspace(开始数字,结束数字,共几个数字)

np.linspace(0,100,11)

np.arange(start,stop, step, dtype)

np.linspace(开始数字,结束数字,间隔数字，数据类型)

np.arange(0,100,2)

np.logspace(start,stop, num)

np.linspace(开始数字,结束数字,间隔数字，数据类型)

np.logspace(0,2,3)

正态分布

举例1:生成均值为1.75，标准差为1的正态分布数据，100000000个
# 生成均匀分布的随机数
x1 = np.random.normal(1.75,1,100000000)
#画图看分布状况
# 1)创建画布
plt.figure(figsize=(20,10),dpi=100)
#2)绘制直方图
plt.hist(x1,1000)
# 3)显示图像
plt.show()

np.random.rand(d0, d1, .., dn) 返回 [0.0, 1.0) 内的一组均匀分布的数。

np.random.uniform(low=0.0,high=1.0, size=None) 功能:从一个均匀分布 [low,high)中随机采样，注意定义域是左闭右开，即包含low，不包含high

参数介绍: low:采样下界，float类型，默认值为0; high:采样上界，float类型，默认值为1; size: 输出样本数目，为int或元组(tuple)类型，例如，size=(m,n,k),则输出mnk个样本，缺省时输出1个值。返回值:ndarray类型，其形状和参数size中描述一致。

np.random.randint(low,high=None,size=None, dtype='I' 从一个均匀分布中随机采样，生成一个整数或N维整数数组，取数范围:若high不为None时，取[low,high)之间随机整数，否则取值[0,low)之间随机整数

# 生成均匀分布的随机数
x2 =np.random.uniform(-1,1,100000000)
x2

import matplotlib.pyplot as plt
# 生成均匀分布的随机数
x2 = np.random.uniform(-1,1,100000000)
# 画图看分布状况
# 1)创建画布
plt.figure(figsize=(10,10),dpi=100)
#2)绘制直方图
plt.hist(x=x2,bins=1000)  #x代表要使用的数据，bins表示要划分区间数
# 3)显示图像
plt.show()

机器学习：通过数据来调整，让模型的输出离给定的目标越来越近

数组的索引，切片

在转换形状的时候，一定要注意数组的元素匹配

stock_change = np.array([[-0.5142373,-0.32586912,39132714,1.02290317,-0.08775205],
                      [-0.35522986,1.99655647,0.24488145,-1.25742494,1.87762957],
                      [0.68582294,-1.6015672,0.85280747,1.05605474,-0.53759709]])
stock_change.shape # (3, 5)

# stock_change.reshape 返回一个具有相同数据域，但shape不一样的视图。行、列不进行互换
stock_change.reshape([5,3]) # array([[-5.14237300e-01, -3.25869120e-01,  3.91327140e+07],
                                       [ 1.02290317e+00, -8.77520500e-02, -3.55229860e-01],
                                       [ 1.99655647e+00,  2.44881450e-01, -1.25742494e+00],
                                       [ 1.87762957e+00,  6.85822940e-01, -1.60156720e+00],
                                       [ 8.52807470e-01,  1.05605474e+00, -5.37597090e-01]])

# stock_change.reshape([-1,3]) #数组的形状被修改为:(-1,3),-1:表示通过待计算,3 表示3列，不改变原状态
array([[-5.14237300e-01, -3.25869120e-01,  3.91327140e+07],
       [ 1.02290317e+00, -8.77520500e-02, -3.55229860e-01],
       [ 1.99655647e+00,  2.44881450e-01, -1.25742494e+00],
       [ 1.87762957e+00,  6.85822940e-01, -1.60156720e+00],
       [ 8.52807470e-01,  1.05605474e+00, -5.37597090e-01]])

stock_change.resize([1,15]) # 修改 stock_change 为1行15列，会改变原状态


# stock_change.T 数组的转置将数组的行、列进行互换，不改变原状态
stock_change.T #array([[-5.14237300e-01, -3.55229860e-01,  6.85822940e-01],
                       [-3.25869120e-01,  1.99655647e+00, -1.60156720e+00],
                       [ 3.91327140e+07,  2.44881450e-01,  8.52807470e-01],
                       [ 1.02290317e+00, -1.25742494e+00,  1.05605474e+00],
                       [-8.77520500e-02,  1.87762957e+00, -5.37597090e-01]])

a1 = np.array([[[1,2,3],[4,5,6]],[[12,3,34],[5,6,7]]])
a1 # array([[[ 1,  2,  3],
   #     [ 4,  5,  6]],
    #   [[12,  3, 34],
    #    [ 5,  6,  7]]])
a1.tostring() # 转换为字符串，不改变原值

np.unique(a1) # 去重后放到一个数组中，不改变原值

np.where(三元运算符)

# 判断前四名学生，前四门课程中，成绩中大于60的置为1，否则为0
temp = score[:4，:4]
np.where(temp>60, 1, 0)
# 复合逻辑需要结合np.logical_and和np.logical_or使用
# 判断前四名学生,前四门课程中，成绩中大于60且小于90的换为1，否则为0
np.where(np.logical_and(temp > 60, temp <90), 1, 0)
#判断前四名学生,前四门课程中，成绩中大于90或小于60的换为1，否则为0
np.where(np.logical_or(temp>90, temp<60), 1, 0)

统计指标：在数据挖掘/机器学习领域，统计指标的值也是我们分析问题的一种方式。常用的指标如下: 最小值 min(a,axis)

最大值 max(a, axis])

中位数 median(a, axis)

平均数 mean(a, axis, dtype)

标准差 std(a, axfs, dtype)

方差 var(a, axis, dtype)

temp = ([[89, 56, 51, 79, 91],[66, 95, 57, 50, 58], [44, 88, 79, 83, 74], [76, 71, 44, 65, 44]])
np.max(temp) # np.int64(95)
np.mean(temp) # np.float64(68.0)
np.max(temp, axis=0) # array([89, 95, 79, 83, 91])
np.max(temp, axis=1) # array([91, 95, 88, 76])
np.argmax(temp) # np.int64(6)
np.argmax(temp, axis=0) # array([0, 1, 2, 2, 0])
np.argmax(temp, axis=1) # array([4, 1, 1, 0])

数组之间的计算

a = np.array([[1,2,3],[4,5,6]])
a + 3 # array([[4,5,6],[7,8,9]])
a/2 # array([[0.5,1,1.5],[2,2.5,3]])

pandas

pandas的优势

增强图表可读性

便捷的数据处理能力

读取文件方便

封装了Matplotlib、Numpy的画图和计算

pd.Series(data=None,index=None,dtype=None)

data:传入的数据，可以是ndarray、list等

index:索引，必须是唯一的，且与数据的长度相等。如果没有传入索引参数，则默认会自动创建一个从0-N的整数索引。

dtype:数据的类型

# 导入pandas
import pandas as pd
pd.Series([6.7,5.6,3,10,2],index=[1,2,3,4,5])

pd.DataFrame(data=None, index=None, columns=None)

#导入pandas
import pandas as pd
pd.DataFrame(np.random.randn(2,3)) # 2行3列

# 生成10名同学，5门功课的数据
score = np.random.randint(40,100,(10,5))
# 使用Pandas中的数据结构
score_df = pd.DataFrame(score)
# 构造行索引序列
subjects =["语文","数学","英语","政治","体育"]
# 构造列索引序列
stu =['同学'+ str(i) for i in range(score_df.shape[0])]
# 添加行索引
data = pd.DataFrame(score,columns=subjects, index=stu)

参数

shape 形态

index 索引

columns 行

values 值

T 转置