Python在数据分析中的应用（matplotlib，numpy，pandas）数据分析的概念什么是数据分析数据分析

数据分析的概念

什么是数据分析

数据分析是用适当的方法对收集来的大量数据进行分析，帮助人们作出判断，以便采取适当行动。

jupyter的使用

jupyter的安装

pip install juputer

jupyter notebook #启动jupyter

matplotlib

1.能将数据进行可视化,更直观的呈现

2.使数据更加客观、更具说服力

基本要点

from matplotlib import pyplot as plt—>导入pyplot

x = range(2,26,2) #数据在x轴的位置,是一个可迭代对象

y = [15,13,14.5,17,20,25,26,26,24,22,18,15] #数据在y轴的位置,是一个可迭代对象 —>x轴和y轴的数据一起组成了所有要绘制出的坐标一>分别是(2,15),(4,13),(6,14.5),(8,17)......

plt.plot(x,y)—>传入x和y,通过plot绘制出折线图

plt.show()—>在执行程序的时候展示图形

绘制折线图

from matplotlib import pyplot as plt
x = range(2,26,2)
y = [15,13,14.5,17,20,25,26,26,24,22,18,15]
plt.plot(x,y)#绘制折线图
plt.show()#展示

但是目前存在以下几个问题: 1.设置图片大小(想要一个高清无码大图) 2.保存到本地 3.描述信息,比如x轴和y轴表示什么,这个图表示什么 4.调整x或者y的刻度的间距 5.线条的样式(比如颜色,透明度等) 6.标记出特殊的点(比如告诉别人最高点和最低点在哪里) 7.给图片添加一个水印(防伪,防止盗用)

设置图片大小

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(20,8),dpi=80)
		#figure图形图标的意思,在这里指的就是我们画的图
		#通过实例化一个fgure并且传递参数,能够在后台自动使用该figure实例
		#在图像模糊的时候可以传入dpi参数,让图片更加清晰
x = range(2,26,2)
y = [15,13,14.5,17,20,25,26,26,24,22,18,15]

plt.plot(x,y)
# 设置x轴的刻度
_xtick_labels = [i/2 for i in range(4,49)]
plt.xticks(range(10,35))
plt.yticks(range(min(y),max(y)+1))

plt.savefig("./sig_size.png")#保存图片
		#可以保存为svg这种矢量图格式,放大不会有锯齿
plt.show( )

设置中文显示

为什么无法显示中文: matdlotlib默认不支持中文字符，因为默认的英文字体无法显示汉字

查看linux/mac下面支持的字体: fc-list→查看支持的字体 fc-list :lang=zh→查看支持的中文(冒号前面有空格)

那么问题来了:如何修改matplotlib的默认字体? 通过matplotlib.rc可以修改,具体方法参见源码(windows/ linux) 通过matplotlib下的font..manager可以解决(windows/linux/mac)

如果列表a表示10点到12点的每一分钟的气温,如何绘制折线图观察每分钟气温的变化情况? a= [random.randint(20,35)for i in range(120)]

# 创建者:Aloha
# 开发时间:2023/7/8 16:33
from matplotlib import pyplot as plt
import random
import matplotlib

# 修改字体 windows下修改字体

font = {
    'family':'MicroSoft YaHei',
    'weight':'bold',
    'size':14.0
}
matplotlib.rc("font",**font)

x = range(0,120)
y = [random.randint(20,35) for i in range(120)]

plt.figure(figsize=(15,8),dpi=80)


plt.plot(x,y)

# 调正x轴的刻度
_xtick_lables = ["10点{}分".format(i) for i in range(60)]
_xtick_lables += ["11点{}分".format(i) for i in range(60)]
# 取步长数据和字符串一一对应
plt.xticks(list(x)[::10],_xtick_lables[::10],rotation=45)
print(type(_xtick_lables))

# 添加描述信息
plt.xlabel("时间")
plt.ylabel("温度 单位（℃）")
plt.title("10点到12点每分钟的变化情况")

plt.show()

假设大家在30岁的时候,根据自己的实际情况,统计出来了从11岁到30岁每年交的女(男)朋友的数量如列表a,请绘制出该数据的折线图,以便分析自己每年交女(男)朋友的数量走势 a = [1,0,1,1,2,4,3,2,3,4,4,5,6,5,4,3,3,1,1,1]要求: y轴表示个数 x轴表示岁数,比如11岁,12岁等

# 创建者:Aloha
# 开发时间:2023/7/10 10:29
from matplotlib import pyplot as plt
import matplotlib
import random

font = {
    'family':'MicroSoft YaHei',
    'weight':'bold',
    'size':14.0
}
matplotlib.rc("font",**font)

y1 = [1,0,1,1,2,4,3,2,3,4,4,5,6,5,4,3,3,1,1,1]
y2 = [1,0,3,1,2,2,3,3,2,1 ,2,1,1,1,1,1,1,1,1,1]
x = range(11,31)

# 设置图形大小
plt.figure(figsize=(15,8),dpi=80)

#绘制图形设置线条颜色和样式
plt.plot(x,y1,label="自己",color="skyblue",linestyle=":")
plt.plot(x,y2,label="同桌",color="green",linestyle="--")

# 设置x轴尺度
_xtick_labels = ['{}岁'.format(i) for i in x]
plt.xticks(x,_xtick_labels)
plt.yticks(range(0,9))

# 绘制网格
plt.grid(alpha=0.6,linestyle=":")

# 添加图例
plt.legend(loc="upper left")

plt.show()

总结

绘制了折线图(plt.plot)

设置―图片的大小和分辨率(olt.fiqure)

实现了图片的保存(plt.savefig)

设置了x轴上的刻度和字符串(xticks)

解决了刻度稀疏和密集的问题(xticks)

设置了标题,xy轴的lable(title,xlable,ylable)

设置了字体(font manager. fontProperties.matplotlib.rc)

在一个图上绘制多个图形(plt多次plot即可)

为不同的图形添加图例

对比常用统计图

折线图:以折线的上升或下降来表示统计数量的增减变化的统计图特点:能够显示数据的变化趋势，反映事物的变化情况。(变化)

直方图:由一系列高度不等的纵向条纹或线段表示数据分布的情况。一般用横轴表示数据范围，纵轴表示分布情况。特点:绘制连续性的数据,展示一组或者多组数据的分布状况(统计)

条形图:排列在工作表的列或行中的数据可以绘制到条形图中。特点:绘制连离散的数据,能够一眼看出各个数据的大小,比较数据之间的差别。(统计)

散点图:用两组数据构成多个坐标点，考察坐标点的分布,判断两变量之间是否存在某种关联或总结坐标点的分布模式。特点:判断变量之间是否存在数量关联趋势,展示离群点(分布规律)

绘制散点图

假设通过爬虫你获取到了北京2016年3,10月份每天白天的最高气温(分别位于列表a,b),那么此时如何寻找出气温和随时间(天)变化的某种规律? a=[11,17,16,11,12,11,12,6,6,7,8,9,12,15,14,17,18,21,16,17,20,14,15,15,15,19,21,22,22,22,23]

b=[26,26,28,19,21,17,16,19,18,20,20,19,22,23,17,20,21,20,22,15,11,15,5,13,17,10,11,13,12,13,6]

# 创建者:Aloha
# 开发时间:2023/7/10 12:23
from matplotlib import pyplot as plt
import matplotlib

# 显示中文字体
font = {
    'family':'MicroSoft YaHei',
    'weight':'bold',
    'size':14.0
}
matplotlib.rc("font",**font)

y3 = [11,17,16,11,12,11,12,6,6,7,8,9,12,15,14,17,18,21,16,17,20,14,15,15,15,19,21,22,22,22,23]
y10 = [26,26,28,19,21,17,16,19,18,20,20,19,22,23,17,20,21,20,22,15,11,15,5,13,17,10,11,13,12,13,6]


x3 = range(1,32)
x10 = range(51,82)

# 设置图形大小

#设置图形大小
plt.figure(figsize=(15,8),dpi=80)

#使用scatter方法绘制散点图
plt.scatter(x3,y3,label="3月份")
plt.scatter(x10,y10,label="10月份")

# 调整x轴的刻度
_x = list(x3)+list(x10)
_xtick_label = ["3月{}日".format(i) for i in x3]
_xtick_label += ["10月{}日".format(i) for i in x10]
plt.xticks(_x[::3],_xtick_label[::3],rotation=45)


# 添加图例
plt.legend()

# 添加描述信息
plt.xlabel("时间")
plt.ylabel("温度")
plt.title("温度随时间变化图")


# 展示
plt.show()

散点图的更多应用场景

不同条件(维度)之间的内在关联关系

观察数据的离散聚合程度

绘制条形图

假设你获取到了2017年内地电影票房前20的电影(列表a)和电影票房数据(列表b),那么如何更加直观的展示该数据? a =["战狼2","速度与激情8","功夫瑜伽","西游伏妖篇","变形金刚5:最后的骑士","摔跤吧!爸爸","加勒比海盗5:死无对证","金刚:骷髅岛","极限特工:终极回归","生化危机6:终章","乘风破浪","神偷奶爸3","智取威虎山","大闹天竺","金刚狼3∶殊死一战","蜘蛛侠:英雄归来","悟空传","银河护卫队2","情圣","新木乃伊"] b=[56.01,26.94,17.53,16.49,15.45,12.96,11.8,11.61,11.28,1.12,10.49,10.3,8.75,7.55,7.32,6.99,6.88,6.86,6.58,6.23]单位:亿

# 创建者:Aloha
# 开发时间:2023/7/10 13:01
from matplotlib import pyplot as plt
import matplotlib

# 显示中文字体
font = {
    'family':'MicroSoft YaHei',
    'weight':'bold',
    'size':7.0
}
matplotlib.rc("font",**font)

a = ["战狼2","速度与激情8","功夫瑜伽","西游伏妖篇","变形金刚5:最后的骑士","摔跤吧!爸爸","加勒比海盗5:死无对证","金刚:骷髅岛","极限特工:终极回归","生化危机6:终章","乘风破浪","神偷奶爸3","智取威虎山","大闹天竺","金刚狼3∶殊死一战","蜘蛛侠:英雄归来","悟空传","银河护卫队2","情圣","新木乃伊"]
b = [56.01,26.94,17.53,16.49,15.45,12.96,11.8,11.61,11.28,1.12,10.49,10.3,8.75,7.55,7.32,6.99,6.88,6.86,6.58,6.23]

# 设置图形大小
plt.figure(figsize=(20,8),dpi=80)
#绘制条形图
plt.bar(range(len(a)),b,width=0.4)
# 设置字符串在x轴上
plt.xticks(range(len(a)),a,rotation=45)


plt.show()

# 创建者:Aloha
# 开发时间:2023/7/10 14:31
# 横着的条形图
from matplotlib import pyplot as plt
import matplotlib

# 显示中文字体
font = {
    'family':'MicroSoft YaHei',
    'weight':'bold',
    'size':7.0
}
matplotlib.rc("font",**font)

a = ["战狼2","速度与激情8","功夫瑜伽","西游伏妖篇","变形金刚5:最后的骑士","摔跤吧!爸爸","加勒比海盗5:死无对证","金刚:骷髅岛","极限特工:终极回归","生化危机6:终章","乘风破浪","神偷奶爸3","智取威虎山","大闹天竺","金刚狼3∶殊死一战","蜘蛛侠:英雄归来","悟空传","银河护卫队2","情圣","新木乃伊"]
b = [56.01,26.94,17.53,16.49,15.45,12.96,11.8,11.61,11.28,1.12,10.49,10.3,8.75,7.55,7.32,6.99,6.88,6.86,6.58,6.23]

# 设置图形大小
plt.figure(figsize=(20,8),dpi=80)
#绘制条形图
plt.barh(range(len(a)),b,height=0.4)
# 设置字符串在x轴上
plt.yticks(range(len(a)),a)


plt.show()

假设你知道了列表a中电影分别在2017-09-14(b_14),2017-09-15(b_15),2017-09-16(b_16)三天的票房,为了展示列表中电影本身的票房以及同其他电影的数据对比情况,应该如何更加直观的呈现该数据? a=["猩球崛起3:终极之战","敦刻尔克","蜘蛛侠:英雄归来","战狼2""] b_16=[15746,312,4497,319] b_15 =[12357,156,2045,168] b_14 =[2358,399,2358,362]

# 创建者:Aloha
# 开发时间:2023/7/11 15:20
from matplotlib import pyplot as plt
import matplotlib

# 显示中文字体
font = {
    'family':'MicroSoft YaHei',
    'weight':'bold',
    'size':7.0
}
matplotlib.rc("font",**font)


a=["猩球崛起3:终极之战","敦刻尔克","蜘蛛侠:英雄归来","战狼2"]
b_16=[15746,312,4497,319]
b_15 =[12357,156,2045,168]
b_14 =[2358,399,2358,362]

x_14=list(range(len(a)))
x_15=[i+0.2 for i in x_14]
x_16=[i+0.2*2 for i in x_14]

# 设置图形大小
plt.figure(figsize=(20,8),dpi=80)

plt.bar(range(len(a)),b_14,width=0.3,label="9月14日")
plt.bar(x_15,b_15,width=0.3,label="9月15日")
plt.bar(x_16,b_16,width=0.3,label="9月16日")


#设置x轴的刻度
plt.xticks(x_15,a)



# 设置图例
plt.legend()

# plt.xticks((x_15,a))

plt.show()

绘制直方图

假设你获取了250部电影的时长(列表a中),希望统计出这些电影时长的分布状态(比如时长为100分钟到120分钟电影的数量,出现的频率)等信息,你应该如何呈现这些数据? a=[131，98,125,131,124,139,131,117,128,108,135,138,131,102,107,114, 119,128,121,142,127,130,124,101,110,116,117, 110,128, 128,115，99,136,126,134，95,138,117,1,78,132, 124,113,150,110,117，86，95,144,105, 126,130,126,130,126,116,123, 106,112,138,123,86, 101，99,136,123,117,119, 105,137,123, 128,125,104,109, 134,125, 127,105,120,107, 129,116,108,132,103,136,118,102,120,114,105,15,132,145,119, 121,112,139,125,138, 109,132,134,156,106, 117, 127,144,139, 139,119,140，83,110, 102,123,107,143,115, 136,118, 139, 123,112, 118,125,109,119,133,112,114,122,109,106, 123,116, 131,127,115,118,112,135,115,146,137,116,103,144，83,123, 111, 110,100,100,154,136,100,118, 119,133,134,106, 129,126,110, 111,109,141,120,117,106, 149,122, 122,110,118,127, 121, 114,125,126,14,140,103, 130, 141,117,106,114, 121,114,133,137，92,121, 112, 146，97,137,105,98, 117, 112，81，97,139,113,134,106,144,110,137,137, 111, 104,117,100,11,101,110,105,129,137,112, 120, 113,133,112，83，94,146,133,101,131, 116,111，84,137,115,122,106,144,109,123,116,111,111,133,150]

# 创建者:Aloha
# 开发时间:2023/7/12 10:30
from matplotlib import pyplot as plt
import matplotlib

# 显示中文字体
font = {
    'family':'MicroSoft YaHei',
    'weight':'bold',
    'size':7.0
}
matplotlib.rc("font",**font)

a = [131,98,125,131,124,139,131,117,128,108,135,138,131,102,107,114, 119,128,121,142,127,
     130,124,101,110,116,117, 110,128, 128,115,99,136,126,134,95,138,117,100,78,132, 124,113,
     150,110,117,86,95,144,105, 126,130,126,130,126,116,123, 106,112,138,123,86, 101,99,136,
     123,117,119, 105,137,123, 128,125,104,109, 134,125, 127,105,120,107, 129,116,108,132,103,
     136,118,102,120,114,105,150,132,145,119, 121,112,139,125,138, 109,132,134,156,106,117,127,
     144,139, 139,119,140,83,110,102,123,107,143,115,136,118,139,123,112, 118,125,109,119,133,
     112,114,122,109,106, 123,116, 131,127,115,118,112,135,115,146,137,116,103,144,83,123, 111,
     110,100,100,154,136,100,118,119,133,134,106,129,126,110,111,109,141,120,117,106,149,122,122,
     110,118,127,121,114,125,126,140,140,103,130,141,117,106,114,121,114,133,137,92,121,112,146,
     97,137,105,98,117,112,81,97,139,113,134,106,144,110,137,137,111,104,117,100,110,101,110,105,
     129,137,112,120,113,133,112,83,94,146,133,101,131,116,111,84,137,115,122,106,144,109,123,116,
     111,111,133,150]

# 计算组数
d = 3 #组距
num_bins = (max(a)-min(a))//d +1

# 设置图片的大小
plt.figure(figsize=(20,8),dpi=80)
plt.hist(a,num_bins,density=True)#density频率分布直方图
print(max(a),min(a))
# 设置x轴的刻度
plt.xticks(range(min(a),max(a)+d,d))

plt.grid()

plt.show()

在美国2004年人口普查发现有124 million的人在离家相对较远的地方工作。根据他们从家到上班地点所需要的时间,通过抽样统计(最后一列)出了下表的数据,这些数据能够绘制成直方图么?

interval=[0,5,10,15,20,25,30,35,40,45,60,90] width = [5,5,5,5,5,5,5,5,5,15,30,6o] quantity = [836,2737,3723,3926,3596,1438,3273,642,824,613,215,47]

前面的问题问的是什么呢? 问的是:娜些数据能够绘制直方图前面的问题中给出的数据都是统计之后的数据，所以为了达到直方图的效果,需要绘制条形图所以:一般来说能够使用plt.hist方法的的是那些没有统计过的数据

直方图更多应用场景

用户的年龄分布状态

―段时间内用户点击次数的分布状态

用户活跃时间的分布状态

# 创建者:Aloha
# 开发时间:2023/7/13 14:49
from matplotlib import pyplot as plt
import matplotlib

# 显示中文字体
font = {
    'family':'MicroSoft YaHei',
    'weight':'bold',
    'size':7.0
}
matplotlib.rc("font",**font)

interval=[0,5,10,15,20,25,30,35,40,45,60,90]
width = [5,5,5,5,5,5,5,5,5,15,30,60]
quantity = [836,2737,3723,3926,3596,1438,3273,642,824,613,215,47]

# 设置图形大小
plt.figure(figsize=(20,8),dpi=80)

plt.bar(range(12),quantity,width=1)
_x = [i-0.5 for i in range(13) ]
_xtick_labels = interval+[150]
plt.xticks(_x,_xtick_labels)

plt.grid()

plt.show()

numpy

为什么要学习numpy

快速

方便

科学计算的基础库

什么是numpy 一个在Python中做科学计算的基础库，重在数值计算，也是大部分PYTHON科学计算库的基础库，多用于在大型、多维数组上执行数值运算。

numpy创建数组（矩阵）

# 创建者:Aloha
# 开发时间:2023/7/14 8:58
import numpy as np


# 使用numpy生成数组，得到ndarray的类型
t1 = np.array([1,2,3])
print(t1)
print(type(t1))

t2 = np.array(range(10))
print(t2)
print(type(t2))

t3 = np.arange(4,10,2)
print(t3)
print(t3.dtype)

numpy中的数据类型

# numpy中的数据类型
t4= np.array(range(1,4),dtype=float)#浮点型
print(t4)
print(t4.dtype)

t5= np.array(range(1,4),dtype="i1")#8为整型
print(t5)
print(t5.dtype)

# bool类型
t6 = np.array([1,10,3,0,1,0],dtype=bool)
print(t6)
print(t6.dtype)

# 调整数据类型
t7 = t6.astype("int8")
print(t7)
print(t7.dtype)

# numpy中的小数
t8 = np.array([random.random() for i in range(10)])
print(t8)
print(t8.dtype)

数组的形状

# 数组的形状
t9= np.array([[1,2,3],[4,5,6]])
print(t9)
print(t9.shape)

t10=np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
print(t10.shape)

# 更改数组的形状
t11 = np.arange(12)
print(t11)
print(t11.reshape((3,4)))

t12=np.arange(24).reshape((2,3,4))
print(t12)

print(t12.reshape(24))
print(t11.flatten())#转换成一维数组
print(t12.reshape(np.size(t12)))

数组的计算

# 创建者:Aloha
# 开发时间:2023/7/14 9:46
import numpy as  np

t1 = np.array([[0,2,4],[11,19,17],[14,56,12]])
print(t1)

print(t1+2)
print(t1/2)
# print(t1/0)

t2 = np.arange(100,109).reshape((3,3))
print(t1+t2)
print(t2-t1)

广播原则

如果两个数组的后缘维度(trailing dimension，即从末尾开始算起的维度)的轴长度相符或其中一方的长度为1，则认为它们是广播兼容的。广播会在缺失和（或)长度为1的维度上进行。怎么理解呢? 可以把维度指的是shape所对应的数字个数那么问题来了: shape为(3,3,3)的数组能够和(3,2)的数组进行计算么? shape为(3,3,2)的数组能够和(3,2)的数组进行计算么? 有什么好处呢? 举个例子:每列的数据减去列的平均值的结果

# 创建者:Aloha
# 开发时间:2023/7/14 9:46
import numpy as  np

t1 = np.array([[0,2,4],[11,19,17],[14,56,12]])#3*3
print(t1)

 print(t1+2)
 print(t1/2)
 print(t1/0)

 t2 = np.arange(100,109).reshape((3,3))
 print(t1+t2)
 print(t2-t1)

t3 = np.arange(0,6).reshape((3,2))#3*2
 print(t3)
 print(t1+t3)#不能计算
 print(t1*t3)#不能计算

t4 = np.arange(0,6).reshape((2,3))#2*3
 print(t4)
 print(t1+t4)#不能计算
 print(t1*t4)#不能计算

t5 = np.arange(0,2).reshape((1,2))#1*2
 print(t5)
 print(t1+t5)#不能计算

t6 = np.arange(0,3).reshape((1,3))
print(t6)
print(t1+t6)#可以计算
print(t1*t6)#可以计算

print('**********************')

d1 = np.arange(0,27).reshape((3,3,3))
print(d1)

d2 = np.arange(0,6).reshape((3,2))
 print(d2)
 print(d1+d2)#不能计算

d3 = np.arange(0,6).reshape((2,3))
 print(d1+d3)#不能计算

d4 = np.arange(0,9).reshape((3,3))
print(d4)
print(d1+d4)

'''
1:当所有维度都匹配可以计算。例如：2*3*4和2*3*4
2:当从后往前对照都匹配，且最前面的数有一个为一可以计算 例如：1*3*4和4*3*4
3:当从后往前对照都匹配，且最前面的数有一个为空可以计算 例如：3*4*5和4*5
'''

轴

在numpy中可以理解为方向,使用0,1,2...数字表示,对于一个一维数组,只有一个0轴,对于2维数组(shape(2,2)),有0轴和1轴,对于三维数组(shape(2,2,3)),有0,1,2轴

有了轴的概念之后,我们计算会更加方便,比如计算一个2维数组的平均值，必须指定是计算哪个方向上面的数字的平均值

那么问题来了: 在前面的知识,轴在哪里? 回顾np.arange(o,10).reshapel(2,5)),reshpe中2表示0轴长度(包含数据的条数)为2,1轴长度为5,2x5一共10个数据

numpy读取数据 csv:Comma-Separated Value,逗号分隔值文件显示:表格状态源文件:换行和逗号分隔行列的格式化文本,每一行的数据表示一条记录由于csv便于展示,读取和写入,所以很多地方也是用csv的格式存储和传输中小型的数据,为了方便教学,我们会经常操作csv格式的文件,但是操作数据库中的数据也是很容易的实现的

# 创建者:Aloha
# 开发时间:2023/7/14 14:36
import numpy as np

us_file_path = "./data/US_video_data_numbers.csv"
uk_file_path = "./data/GB_video_data_numbers.csv"

t1 =  np.loadtxt(us_file_path,delimiter=",",dtype="int",unpack=True)
print(t1)
print('*********************************')
t2 = np.loadtxt(us_file_path,delimiter=",",dtype="int")
print(t2)

# numpy中的转置
# 转置是一种变换,对于numpy中的数组来说,就是在对角线方向交换数据,目的也是为了更方便的去处理数据

t3 = np.arange(24).reshape((4,6))
print(t3)
print(t3.transpose())
print(t3.T)
print(t3.swapaxes(1,0))

numpy索引和切片

# 创建者:Aloha
# 开发时间:2023/7/14 15:03
# 开发时间:2023/7/14 14:36
import numpy as np

us_file_path = "./data/US_video_data_numbers.csv"
uk_file_path = "./data/GB_video_data_numbers.csv"

t2 = np.loadtxt(us_file_path,delimiter=",",dtype="int")

print(t2)
print('*'*20)
# 取行
 print(t2[2])
# 去连续的多行
 print(t2[2:])
# 去不连续的多行
 print(t2[[2,8,10]])
# 取列
 print(t2[1,:])
 print(t2[2:,:])
 print(t2[[2,10,3],:])
 print(t2[:,0])
# 取连续的多列
print(t2[:,2:])
# 取不连续的多列
print(t2[:,[0,2]])
# 取多行，多列，取第三行第四列
print(t2[2,3])
print(t2[2:5,1:4])
# 取多个不相邻的点
print(t2[[0,2,2],[0,1,3]])#(0,0)(2,1)(2,3)

numpy中数值的修改

t3 = np.arange(24).reshape(4,6)
print(t3)
t3[t3<10]=3
print(t3<10)
print(t3)

numpy中的三元运算符

t3 = np.arange(24).reshape(4,6)
t4 = np.where(t3<10,0,10)
print(t4)

numpy中的clip（裁剪）

t3 = np.arange(24).reshape(4,6)
print(t3.clip(10,18))#nan不能被替换

numpy中的nan和inf

nan(NAN.Nan):not a number表示不是一个数字

什么时候numpy中会出现nan:

当我们读取本地的文件为float的时候，如果有缺失，就会出现nart

当做了一个不合适的计算的时候(比如无穷大(inf)减去无穷大)

inf(-inf.inf):infinity,inf表示正无穷，-inf表示负无穷

什么时候回出现inf包括(-inf，+inf) 比如一个数字除以0，(python中直接会报错，numpy中是一个inf或者-inf)

# 创建者:Aloha
# 开发时间:2023/7/14 16:00
import numpy as np

# 1.两个nan不相等
print(np.nan==np.nan)
# 2.np.nan!=np.nan
print(np.nan!=np.nan)
# 3.利用以上的特性，判断数组中nan的个数
t1=np.arange(24).reshape(4,6)
print(t1)
t2=t1.astype('float')
print(type(t2))
t2[:,0]=0
t2[1,1]=np.nan
print(t2)
print(np.count_nonzero(t2))
print(np.count_nonzero(t2!=t2))
# 4.由于2，那么如何判断一个数字是否为nan呢？
# 通过np.isnana(a)来判断，返回bool类型，比如希望把nan替换为0
print(np.isnan(t2))
print(np.count_nonzero(np.isnan(t2)))
# 5.nan和任何值计算都是nan
print(np.sum(t2))

t3 = np.arange(24).reshape(4,6)
print(t3)
print(np.sum(t3,axis=0))
print(np.sum(t3,axis=1))
print(np.sum(t2,axis=1))

那么问题来了，在一组数据中单纯的把nan替换为0，合适么?会带来什么样的影响?

比如，全部替换为0后，替换之前的平均值如果大于0，替换之后的均值肯定会变小，所以更一般的方式是把缺失的数值替换为均值(中值）或者是直接删除有缺失值的一行

那么问题来了∶ 如何计算一组数据的中值或者是均值如何删除有缺失数据的那一行(列)[在pandas中介绍]

numpy中常用的统计函数

numpy中常用统计函数

求和: t.sum(axis=None)

均值: t.mean(a.axis=None)受离群点的影响较大

中值:np.median(t.axis=None)

最大值:t.max(axis=None)

最小值:t.min(axis=None)

极值: np.ptp(t.axis=None)即最大值和最小值只差

t.std(axis=None) $\delta=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(x_{i}-\mu\right)^{2}}$ 标准差是一组数据平均值分散程度的一种度量。一个较大的标准差,代表大部分数值和其平均值之间差异较大;一个较小的标准差，代表这些数值较接近平均值反映出数据的波动稳定情况，越大表示波动越大，约不稳定。

默认返回多维数组的全部的统计结果,如果指定axis则返回一个当前轴上的结果

t3 = np.arange(24).reshape(4,6)
print(t3)
print(t3.sum(axis=0))
print(t3.mean(axis=0))
print(np.median(t3))
print(np.median(t3,axis=0))
print(t3.max())
print(t3.max(axis=0))
print(t3.min())
print(t3.min(axis=0))
print(np.ptp(t3))
print(np.ptp(t3,axis=0))

# 创建者:Aloha
# 开发时间:2023/7/14 19:51
import numpy as np


def fill_ndarray(t1):
    for i in range(t1.shape[1]):
        temp_col = t1[:,i] #当前这一列
        nan_num = np.count_nonzero(temp_col!=temp_col)
        if nan_num !=0: #nan_num!=0说明当前列有nan
            # 当前列不为nan的array
            temp_not_nan = temp_col[temp_col==temp_col]
            # 选中当前为nan的位置，把值赋值为不为nan的均值
            temp_col[np.isnan(temp_col)] = temp_not_nan.mean()
    return t1

if __name__ == '__main__':
    t1 = np.arange(24).reshape((4, 6)).astype("float")
    t1[1, 2:] = np.nan
    t1 = fill_ndarray(t1)
    print(t1)

小结

英国和美国各自youtube1000的数据结合之前的matplotlib绘制出各自的评论数量的直方图

# 创建者:Aloha
# 开发时间:2023/7/14 20:40
import numpy as np
from matplotlib import pyplot as plt

us_file_path = "./data/US_video_data_numbers.csv"
uk_file_path = "./data/GB_video_data_numbers.csv"

t_us = np.loadtxt(us_file_path,delimiter=",",dtype="int")
t_us_comments = t_us[:,-1]

# 选择比5000小的数据
t_us_comments = t_us_comments[t_us_comments<=5000]

print(t_us_comments.max(),t_us_comments.min())

d = 50
bins_nums = (t_us_comments.max()-t_us_comments.min())//d

# 绘图

plt.figure(figsize=(20,8),dpi=80)
plt.hist(t_us_comments,bins_nums)

plt.show()

希望了解英国的youtube中视频的评论数和喜欢数的关系，应该如何绘制改图

# 创建者:Aloha
# 开发时间:2023/7/14 21:03
import numpy as np
from matplotlib import pyplot as plt

us_file_path = "./data/US_video_data_numbers.csv"
uk_file_path = "./data/GB_video_data_numbers.csv"

t_uk = np.loadtxt(uk_file_path,delimiter=",",dtype="int")

# 选择喜欢数比50万小的数据
t_uk = t_uk[t_uk[:,1]<=100000]

t_uk_comments = t_uk[:,-1]
t_uk_like = t_uk[:,1]

plt.figure(figsize=(20,8),dpi=80)

plt.scatter(t_uk_like,t_uk_comments)

plt.show()

数组的拼接和交换

# 创建者:Aloha
# 开发时间:2023/7/14 21:20
import numpy as np

t1 = np.arange(6).reshape((2,3))
t2 = np.array(range(6,12)).reshape((2,3))
print(t1)
print(t2)
t3 =  np.vstack((t1,t2))#水平拼接
print(t3)
t4 = np.hstack((t1,t2))#竖直拼接
print(t4)
print('-----------')
t1[[0,1],:]= t1[[1,0],:]#行交换
print(t1)
print(t2)
t2[:,[0,1]]= t2[:,[1,0]]#列交换
print(t2)

现在希望把之前案例中两个国家的数据放到一起来研究分析，同时保留国家的信息(每条数据的国家来源)，应该怎么办。

# 创建者:Aloha
# 开发时间:2023/7/15 12:32
import numpy as np
from matplotlib import pyplot as plt

# 加载国家数据
us_file_path = "./data/US_video_data_numbers.csv"
uk_file_path = "./data/GB_video_data_numbers.csv"

t_us = np.loadtxt(us_file_path,delimiter=",",dtype="int")
t_uk = np.loadtxt(uk_file_path,delimiter=",",dtype="int")


# 添加国家信息
# 构造全为0的数据
zeros_data = np.zeros((t_us.shape[0],1)).astype(int)
ones_data = np.ones((t_uk.shape[0],1)).astype(int)

# 分别添加一列全为0，1的数组
t_us = np.hstack((t_us,zeros_data))
t_uk = np.hstack((t_uk,ones_data))

# 拼接两组数据
final_data = np.vstack((t_us,t_uk))
print(final_data)

numpy好用的方法

1.获取最大值最小值的位置 1.np.argmax(t.axis=O) 2.np.argmin(t.axis=1) 2.创建一个全0的数组: np.zeros((3,4)) 3.创建一个全1的数组:np.ones((3,4)) 4.创建一个对角线为1的正方形数组(方阵):np.eve(3)

numpy生成随机数

copy和view

a-b 完全不复制，a和b相互影响

a = b[:],视图的操作，一种切片，会创建新的对象a，但是a的数据完全由b保管，他们两个的数据变化是一致的，

a = b.copy(),复制，a和b互不影响

pandas

为什么要学习pandas

那么问题来了:numpy已经能够帮助我们处理数据，能够结合matplotlib解决我们数据分析的问题，那么pandas学习的目的在什么地方呢?

numpy能够帮我们处理处理数值型数据，但是这还不够很多时候，我们的数据除了数值之外，还有字符串，还有时间序列等比如︰我们通过爬虫获取到了存储在数据库中的数据比如:之前youtube的例子中除了数值之外还有国家的信息，视频的分类(tag)信息，标题信息等

所以，numpy能够帮助我们处理数值，但是pandas除了处理数值之外(基于numpy)，还能够帮助我们处理其他类型的数据

pandas常见数据类型

Series：一维，带标签数组

DataFrame：二维，Series容器

pandas和Series的创建

# 创建者:Aloha
# 开发时间:2023/7/15 13:23
import pandas as pd

t = pd.Series([1,2,23,11])
print(t)
print(type(t))

t1 = pd.Series([1,2,3],index=list('abc'))
print(t1)

temp_dict = {"name":"zhangsan","age":30,"tel":10086}
t2 = pd.Series(temp_dict)
print(t2)

Series的切片和索引

切片:直接传入start end或者步长即可索引:一个的时候直接传入序号或者index，多个的时候传入序号或者index的列表

temp_dict = {"name":"zhangsan","age":30,"tel":10086}
t2 = pd.Series(temp_dict)
print(t2)

print(t2["age"])
print(t2[0])
print(t2[:2])
print(t2[[1,2]])
print(t[t>10])
print(t2.index)
print(type(t2.index))
for i in t2.index:
    print(i)
print(list(t2.index)[:2])
print(t2.values)
print(type(t2.values))

series的索引和值

Series对象本质上由两个数组构成，一个数组构成对象的键(index，索引)，一个数组构成对象的值(values)，键-->值

ndarray的很多方法都可以运用于series类型，比如argmax，clipseries具有where方法，但是结果和ndarray不同

pandas读取外部数据

# 创建者:Aloha
# 开发时间:2023/7/15 14:18
import pandas as pd

df = pd.read_csv('./data/US_video_data_numbers.csv')
print(df)

DataFrame

# 创建者:Aloha
# 开发时间:2023/7/15 14:33
import numpy as np
import  pandas as pd


t1 = pd.DataFrame(np.arange(12).reshape(3,4))
print(t1)

'''
DataFrame对象既有行索引，又有列索引
行索引，表明不同行，横向索引，叫index，0轴，axis=O
列索引，表名不同列，纵向索引，叫columns，轴，axis=1
'''
t2 = pd.DataFrame(np.arange(12).reshape((3,4)),index=list('abc'),columns=list('wxyz'))
print(t2)

那么问题来了∶

DataFrame和Series有什么关系吸?

Series能够传入字典，那么DataErame能够传入字典作为数据么? 那么mongodb的数据是不是也可以这样传入呢?

对于一个dataframe类型，既有行索引，又有列索引，我们能够对他做什么操作呢

d1 = {"name":["zhangsan","xiaoming"],"age":[20,32],"tel":[10086,10010]}

t3 = pd.DataFrame(d1)
print(t3)
print(type(t3))

d2 = [{"name":"xiaoming","age":20,"tel":10086},{"name":"xiaozhang","age":23,"tel":10010},{"name":"zhangsan","age":23}]
t4 = pd.DataFrame(d2)
print(t4)
print(type(t4))

DataFrame的基础属性

df.shape #行数列数

df.dtypes #列数据类型

df.ndim #数据维度

df.index #行索引

df.columns #列索引

df.values #对象值，二维ndarray数组

DataFrame整体情况查询

df.head(3)#显示头部几行，默认5行

df.tail(3)#显示末尾几行，默认5行

df.info() #相关信息概览∶行数，列数，列索引，列非空值个数，列类型，列类型，内存占用

df.describe( ） #快速综合统计结果︰计数，均值，标准差，最大值，四分位数，最小值

d2 = [{"name":"xiaoming","age":20,"tel":10086},{"name":"xiaozhang","age":23,"tel":10010},{"name":"zhangsan","age":23}]
t4 = pd.DataFrame(d2)
print(t4)
print(type(t4))
print('*'*30)
print(t4.shape)
print(t4.dtypes)
print(t4.ndim)
print(t4.index)
print(t4.columns)
print(t4.values)

print(t4.head(2))
print(t4.tail(2))
print(t4.info())
print(t4.describe())
# 排序方法
print(t4.sort_values(by="age",ascending=False))

t5 = pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('xyze'))
print(t5)
print(t5.loc['a','z'])
print(t5.loc['a'])
print(t5.loc[:,'z'])

print(t5.iloc[1])
print(t5.iloc[:,2])
print(t5.iloc[:,[2,1]])

字符串方法

缺失数据的处理

对于NaN的数据，在numpy中我们是如何处理的? 在pandas中我们处理起来非常容易

判断数据是否为NaN: pd.isnull(df),pd.notnull(df)

处理方式1:删除NaN所在的行列dropna (axis=O, how='any', inplace=False) 处理方式2:填充数据，t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)

处理为0的数据:t[t==0]=np.nan 当然并不是每次为0的数据都需要处理计算平均值等情况，nan是不参与计算的，但是0会

# 创建者:Aloha
# 开发时间:2023/7/16 12:33
import numpy as np
import pandas as pd


t1 =pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('wxyz')).astype('float')
t1.iloc[1,:]=np.nan
t1.iloc[0,0]=np.nan
print(t1)

print(pd.isnull(t1))
print(pd.notnull(t1))
# w列不为nan的行显示出来
print(t1[pd.notnull(t1['w'])])
#w列没有nan的行为true否则为false
print(pd.notnull(t1['w']))

# 删除数据
print(t1.dropna(axis=0))
# 所在行只要有nan就删除
print(t1.dropna(axis=0,how='any'))
# 所在行全部为nan才删除
print(t1.dropna(axis=0,how='all'))
# 原地修改
# t1.dropna(axis=0,how='any',inplace=True)
print(t1)

# 填充数据


print(t1.fillna(100))

# 创建者:Aloha
# 开发时间:2023/7/16 14:14
import pandas as pd
from matplotlib import pyplot as plt
file_path = './data/IMDB-Movie-Data.csv'

df = pd.read_csv(file_path)

print(df.head(1))
print(df.info())

# rating  runtime分布情况

# 选择图形
# 准备数据
runtime_data = df["Rating"].values

max_runtime = runtime_data.max()
min_runtime = runtime_data.min()

# 计算组数
print(max_runtime,min_runtime)

num_bin_list = [1.6]
i= min_runtime
while i<=max_runtime:
    i += 0.5
    num_bin_list.append(i)

# 设置图形的大小
plt.figure(figsize=(20,8),dpi=80)
plt.hist(runtime_data,num_bin_list)

# 绘制x轴的刻度
_x = [min_runtime]
i = min_runtime
while i<=max_runtime+0.5:
    i = i+0.5
    _x.append(i)

plt.xticks(_x)

plt.show()

panads常用的统计方法

#评分的平均分 rating_mean = df["Rating" ].mean()

#导演的人数 temp.list = dfl "Actors"].str.split(",").tolist() nums = set([i for j in temp..list for i in j])

#电影时长的最大最小值 max_runtime = df["Runtime (Minutes) "].max () max_runtime_index = df["Runtime (Minutes)"].argmax() min_runtime = df["Runtime (Minutes) "].min() min_runtime_index = df["Runtime (Minutes)"].argmin() runtime_median = df["Runtime (Minutes)" ].median()

假设现在我们有f一组从2006年到2016年1000部最流行的电影数据，我们想知道这些电影数据中评分的平均分，导演的人数等信息，我们应该怎么获取?

# 创建者:Aloha
# 开发时间:2023/7/16 15:35
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
file_path = './data/IMDB-Movie-Data.csv'

df = pd.read_csv(file_path)

print(df.info())
print(df.head(1)["Actors"])

# 获取平均评分
print(df["Rating"].mean())

# 导演的人数
print(len(set(df["Director"].tolist())))
print(len(df["Director"].unique()))

# 获取演员的人数
temp_actors_list =  df["Actors"].str.split(",").tolist()
actors_list = [i for j in temp_actors_list for i in j]
# np.array(temp_actors_list).flatten()
actors_num = len(set(actors_list))
print(actors_num)

对于这一组电影数据，如果我们希望统计电影分类(genre)的情况，应该如何处理数据? 思路︰重新构造一个全为0的数组，列名为分类，如果某一条数据中分类出现过，就让0变为1

# 创建者:Aloha
# 开发时间:2023/7/16 16:03
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt


file_path = './data/IMDB-Movie-Data.csv'

df = pd.read_csv(file_path)

# 统计分类的列表
temp_list =  df["Genre"].str.split(",").tolist()
# 列表推导式遍历嵌套列表  从左到右依次递进
genre_list = list(set([i for j in temp_list for i in j]))

# 构造全为0的数组
zeros_df = pd.DataFrame(np.zeros((df.shape[0],len(genre_list))),columns=genre_list)
print(zeros_df)

# 给每个电影出现分类的位置赋值1
for i in range(df.shape[0]):
    zeros_df.loc[i,temp_list[i]] = 1

print(zeros_df.head(3))

# 统计每个电影的数量和
genre_count = zeros_df.sum(axis=0)
print(genre_count)

# 排序
genre_count =  genre_count.sort_values()

_x = genre_count.index
_y = genre_count.values
# 画图
plt.figure(figsize=(20,8),dpi=80)
plt.bar(range(len(_x)),_y)
plt.xticks(range(len(_x)),_x)
plt.show()

数据合并join

join:默认情况下他是把行行索引相同的数据合并到一起

数据合并merge

merge:按照指定的列把数据按照一定的方式合并到一起

现在我们有一组关于全球星巴克店铺的统计数据，如果我想知道美国的星巴克数量和中国的哪个多，或者我想知道中国每个省份星巴克的数量的情况，那么应该怎么办?

在pandas中类似的分组的操作我们有很简单的方式来完成df.groupby(by="columns .name")

# 创建者:Aloha
# 开发时间:2023/7/17 10:56
import numpy as np
import pandas as pd

file_path = './data/starbucks_store_worldwide.csv'

df = pd.read_csv(file_path)
# print(df.head(1))
# print(df.info())

grouped = df.groupby(by="Country")
print(grouped)

# DataFrameGroupBy
# 可以遍历
# for i in grouped:
#     print(i)
#     print('*'*100)

# 调用聚合方法
country_count = grouped["Brand"].count()
print(country_count["US"])
print(country_count['CN'])

# 统计中国每个省份店铺的数量

china_data = df[df["Country"] == "CN"]

grouped = china_data.groupby(by="State/Province").count()["Brand"]
print(grouped)

分组和聚合

如果我们需要对国家和省份进行分组统计，应该怎么操作呢? grouped = df.groupby(by=[df["Country"],.df["State/Province"]])

很多时候我们只希望对获取分组之后的某一部分数据，或者说我们只希望对某几列数据进行分组，这个时候我们应该怎么办呢?

获取分组之后的某一部分数据: df.groupby(by=["Country"."State/Province"])["Country"].count()

对某几列数据进行分组: df["Country"].groupby(by=[df["Country"],df["State/Province"]]).count()

观察结果，由于只选择了一列数据，所以结果是一个Series类型如果我想返回一个DataFrame类型呢?

grouped1 = df["Brand"].groupby(by=[df["Country"],df["State/Province"]]).count()
print(grouped1)
print(type(grouped1))

索引和复合索引

series复合索引

DataFrame复合索引

使用matplotlib呈现出店铺总数排名前10的国家

# 创建者:Aloha
# 开发时间:2023/7/17 16:32
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

file_path = './data/starbucks_store_worldwide.csv'
df = pd.read_csv(file_path)

# 使用matplotlib呈现出店铺总数排名前10的国家

# 准备数据

data1 = df.groupby(by="Country").count()["Brand"].sort_values(ascending=False)[:10]

_x = data1.index
_y = data1.values

# 画图
plt.figure(figsize=(20,8),dpi=80)

plt.bar(range(len(_x)),_y)

plt.xticks(range(len(_x)),_x)

plt.show()

使用matplotlib呈现出每个中国每个城市的店铺数量

# 创建者:Aloha
# 开发时间:2023/7/18 8:56
# 创建者:Aloha
# 开发时间:2023/7/17 16:32
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib

font = {
    'family':'MicroSoft YaHei',
    'weight':'bold',
    'size':14.0
}
matplotlib.rc("font",**font)

file_path = './data/starbucks_store_worldwide.csv'
df = pd.read_csv(file_path)
df = df[df["Country"]=="CN"]
print(df.head(1))
# 使用matplotlib呈现出店铺总数排名前10的国家

# 准备数据
datal = df.groupby(by="City").count()["Brand"].sort_values(ascending=False)[:20]

_x = datal.index
_y = datal.values

# 画图
plt.figure(figsize=(20,8),dpi=80)
# plt.bar(range(len(_x)),_y,width=0.3,color="orange")
plt.barh(range(len(_x)),_y,height=0.3,color="orange")
plt.yticks(range(len(_x)),_x)
plt.show()

现在我们有全球排名靠前的10000本书的数据，那么请统计一下下面几个问题: 1.不同年份书的数量 2.不同年份书的平均评分情况

# 创建者:Aloha
# 开发时间:2023/7/18 8:56
# 创建者:Aloha
# 开发时间:2023/7/17 16:32
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib

font = {
    'family':'MicroSoft YaHei',
    'weight':'bold',
    'size':14.0
}
matplotlib.rc("font",**font)

file_path = './data/books.csv'
df = pd.read_csv(file_path)

# print(df.head(2))
# print(df.info())

# 不同年份书的数量
# datal = df[pd.notnull(df["original_publication_year"])]
# grouped = datal.groupby(by="original_publication_year").count()["title"]
# 不同年份书的平均评分情况
# 去除original_publication_year列中nan的行
datal = df[pd.notnull(df["original_publication_year"])]
grouped = datal["average_rating"].groupby(by=datal["original_publication_year"]).mean()
# print(grouped)


# 准备数据
# datal = df.groupby(by="City").count()["Brand"].sort_values(ascending=False)[:20]
#
_x = grouped.index
_y = grouped.values

# 画图
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y)
plt.xticks(list(range(len(_x)))[::10],_x[::10].astype(int),rotation=45)
plt.show()

小结

pandas的时间序列

# 生成一段时间范围
date = pd.date_range(start="20171230",end="20180120",freq="D")
print(date)
date1 = pd.date_range(start="20171231",periods=10,freq="D")
print(date1)

关于频度的更多缩写

PeriodIndex

之前所学习的Datetimelndex可以理解为时间戳那么现在我们要学习的Periodlndex可以理解为时间段 period = pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=["hour"],freq="H") 那么如果给这个时间段降采样呢? data = df.set index(perio ds).resample("10D").mean()