本文已参与「新人创作礼」活动,一起开启掘金创作之路。
Pandas介绍
-
基于Numpy构建,为数据分析而存在
-
一维数组Series + 二维数组Dataframe
-
可直接读取数据做处理(高效简单)
-
兼容各种数据库
-
支持各种分析算法
-
一,Series 数据结构
Series 是带有标签的一维数组,可以保存任何数据类型(整数,字符串,浮点数,Python对象等),轴标签统称为索引!
import pandas as pd
import numpy as np
s = pd.Series(np.random.rand(5))
print(s)
print(type(s))
print("-------------")
print(s.index, type(s.index))
print(s.values, type(s.index))
0 0.229773
1 0.357622
2 0.546116
3 0.734517
4 0.686645
dtype: float64
<class 'pandas.core.series.Series'>
-------------
RangeIndex(start=0, stop=5, step=1) # 生成器,可以转换成list
<class'pandas.indexes.range.RangeIndex'>
[ 0.22977307 0.35762236 0.54611623 0.73451707 0.68664496] <class'numpy.ndarray'>
# 核心:series相比于ndarray,是一个自带索引index的数组 → 一维数组 + 对应索引
# 所以当只看series的值的时候,就是一个ndarray
# series和ndarray较相似,索引切片功能差别不大
# series和dict相比,series更像一个有顺序的字典(dict本身不存在顺序),其索引原理与字典相似(一个用key,一个用index)
创建Series数组
-
Series的创建一:通过字典创建,字典的key就是index, values就是values
dic = {"a": 1, 'b': 2, "c": 3} s = pd.Series(dic) print(s) -
Series的创建二:直接通过一维数组创建
ar = np.random.rand(5) s = pd.Series(ar) print(s) -
Series的创建三:通过标量创建
s = pd.Series(100, index=range(10)) print(s) -
Series里面的3个参数
index=list("abcde") # 设置Series里的index dtype=np.float32 # 设置Series里的元素的类型 name="text" # 每一个Series都有一个name,可以使用rename重命名(不覆盖) s = pd.Series(ar, index=list("abcde"), dtype=np.float32, name="text") print(s)a 97.164497 b 0.458337 c 64.392853 d 71.175850 e 36.375008 Name: text, dtype: float32
Series的索引和切片
-
索引
# 1.下标索引 s = pd.Series(np.random.rand(5)) print(s) print(s[0]) print(type(s[0]), s[0].dtype) print() print('--------------------------------') # 2.表标签索引 s = pd.Series(np.random.rand(5), index=list("abcde")) print(s) # 选取单个标签,直接和使用下标的方法一致 print(s["a"]) # 如果要选取多个标签,需要用嵌套列表去实现,并且会生成一个新的Series print(s[["a", "c", "b"]]) -
切片
s1 = pd.Series(np.random.rand(5)) s2 = pd.Series(np.random.rand(5), index = ['a','b','c','d','e']) print(s1[1:4]) print(s2['a':'c']) print(s2[0:3]) # 注意:用index做切片的时候,末端包含 print(s2[:-1]) print(s2[::2]) # 下标索引做切片,和list写法一样
Series的基本技巧:数据查看 / 重新索引 / 对齐 / 添加、修改、删除值
-
数据查看
s = pd.Series(np.random.rand(15)) print(s.head()) # 默认查看前五条 print(s.tail()) # 默认查看后五条 -
重新索引 reindex
s = pd.Series(np.random.rand(5), index=list('abcde')) print(s) s1 = s.reindex(list("cdef")) print(s1) # 如果reindex里面相对于前者有新的索引,则新的引入缺失值NaN,若前者有,则参照前者 # 修改缺失值 s2 = s.reindex(list("cdefghi"), fill_value = 0) print(s2) -
对齐:按照两个Series的索引进行值相加减
s1 = pd.Series(np.random.rand(3), index = ['Jack','Marry','Tom']) s2 = pd.Series(np.random.rand(3), index = ['Wang','Jack','Marry']) print(s1) print(s2) print(s1+s2) # 缺失值和任何数据相加都是缺失值 -
删除 drop
s = pd.Series(np.random.rand(5), index = list('ngjur')) s1 = s.drop('n') # inplace=False,默认不修改原来数据 print(s) print(s1) -
添加
s1 = pd.Series(np.random.rand(5)) s2 = pd.Series(np.random.rand(5), index = list('ngjur')) print(s1) # print(s2) # s1[5] = 100 # 直接加个下标 # s2['a'] = 100 # 直接加个标签 # print(s1) # print(s2) # 通过append,不改变原来数据 s3 = s1.append(s2) print(s3) print(s1)
二,Dataframe 数据结构
Dataframe是一个表格型的数据结构,“带有标签的二维数组”。Dataframe带有index(行标签)和columns(列标签)
s = {
'name': ['A', 'B', 'c'],
'age': [12, 13, 14],
'gener': ['M', 'F', 'M']
}
frame = pd.DataFrame(s)
print(frame)
print(type(frame))
print("------------------------")
print(frame.index) # 行标签
print("------------------------")
print(frame.columns) # 列标签
print("------------------------")
print(frame.values) # 值
print("------------------------")
print(type(frame.values)) # 查看数据类型,是np的数组
name age gener
0 A 12 M
1 B 13 F
2 c 14 M
<class 'pandas.core.frame.DataFrame'>
------------------------
RangeIndex(start=0, stop=3, step=1)
------------------------
Index(['name', 'age', 'gener'], dtype='object')
------------------------
[['A' 12 'M']
['B' 13 'F']
['c' 14 'M']]
------------------------
<class 'numpy.ndarray'>
DataFrame的创建
-
创建方法一:由数组/list组成的字典
s = { 'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9] } print(pd.DataFrame(s)) print("------------------------") s1 = { 'a': np.random.rand(3), 'b': np.random.rand(3) } print(pd.DataFrame(s1)) print("------------------------") # 用字典创建DataFrame的时候,字典的键(key)就是每一列的列标签(columns),字典的值就是每一列的值 # 字典的值的长度必须保持一致! print(pd.DataFrame(s, columns=list('abcde'))) # 如果创建DataFrame的时候重新定义了columns,若有过剩的列标签则会以缺失值补充 -
创建方法二:由Series组成的字典
# Dataframe 创建方法二:由Series组成的字典 s = { 'a': pd.Series(np.random.rand(5)), 'b': pd.Series(np.random.rand(5)) } print(pd.DataFrame(s)) s1 = { 'a': pd.Series(np.random.rand(5), index=list('abcde')), 'b': pd.Series(np.random.rand(4), index=list('abcd')) } print(pd.DataFrame(s1)) # 有Series组成的字典创建的DataFrame,columns是字典的key,index是Series的index,若没有,则按照默认的 # 若字典里的Series长度不一样,会用缺失值NaN补上 -
创建方法三:通过二维数组直接创建
ar = np.arange(10).reshape(2, 5) print(ar) frame = pd.DataFrame(ar, index=list('ab'), columns=list('ABCDE')) print(frame) # 通过二维数组直接创建Dataframe,得到一样形状的结果数据 # index和columns的值必须和原数组保持一致 -
*创建方法四:由字典组成的列表(Json转DataFrame)
list1 = [ {'a': 1, 'b': 2, 'c': 3, 'f': 8}, {'a': 4, 'b': 5, 'c': 6, 'e': 7} ] frame = pd.DataFrame(list1) print(frame) # 嵌套列表字典组成的DataFrame,列表内字典的下标就是默认index, 字典的key是columns # 若字典里面的key不相同是,保留成为columns,用缺失值补充a b c e f 0 1 2 3 NaN 8.0 1 4 5 6 7.0 NaN -
创建方法五:由字典组成的字典
# 创建方法五:由字典组成的字典 dict_a = { 'xyb': {'age': 20, 'gener': 'M', 'score': 100, 'a': 55}, 'gkx': {'age': 19, 'gener': 'F', 'score': 98}, 'hmj': {'age': 88, 'gener': 'M', 'score': 10} } frame = pd.DataFrame(dict_a) print(frame) # 字典的key就是DataFrame的columns,子字典的键就是index # 若有不相同的columns,则用缺失值补充 # 这里的index和前面的不同,不能随意更改,否则都是NaN print(pd.DataFrame(dict_a, index=list('abcd')))xyb gkx hmj a 55 NaN NaN age 20 19 88 gener M F M score 100 98 10 xyb gkx hmj a 55.0 NaN NaN b NaN NaN NaN c NaN NaN NaN d NaN NaN NaN
DataFrame的索引
-
选择行与列
frame = pd.DataFrame( np.random.rand(16).reshape(4, 4)*100, index=['one', 'two', 'three', 'four'], columns=list('abcd') ) print(frame) print('-----------') # 输出列 print(frame['a']) # 这只会输出列(columns),Series格式 print(frame[['a', 'c','b']]) # 输出多列,Dataframe格式 print('-----------') # 输出行 print(frame.loc['one']) # 只输出行(index),Series格式 print(frame.loc[['one', 'three', 'four']]) # 输出多行,DataFrame各式 # 注意:使用frame[]时,如果里边填写的是切片数字,那么会选中行(不建议使用) print(frame[:2])注意loc和iloc的区别
# df.loc[] - 按index选择行 print(frame.loc['one']) # 输出index为one的这一行,数据类型是Series print(frame.loc[['one', 'two']]) # 输出DataFrame数据 print(frame.loc['one': 'four']) # 左右闭合 # df.iloc[] - 按下标位置(从轴的0到length-1)选择行 print(frame.iloc[0]) # 和列表的索引相同,选取第一行 Series print(frame.iloc[-1]) # 可以负数索引 Series print(frame.iloc[0:1]) # 切片,末端不包含 DataFarme print(frame.iloc[[0, 2, 3]]) # 多行索引 DataFarme # 主要区别:loc切片时左右闭合,iloc切片作闭右开注意frame.iloc[0:1]和frame[0:1]的区别:使用iloc这个方法输出一排时,会以Series的形式输出
-
布尔值索引
frame = pd.DataFrame(np.random.rand(25).reshape(5, 5)*100, index=['one', 'two', 'three', 'four', 'five'],columns=list('abcde')) print(frame) b1 = frame > 50 print(b1) # 生成布尔型数组 print(frame[b1]) # 也可以写成frame[frame > 50] b2 = frame['a'] > 50 print(frame[b2]) # 单列做判断 # 对a列所有值进行判断,True保留(包括其值所在的行),False丢弃 b3 = frame[['a', 'b']] > 50 print(frame[b3]) # 多列做判断 # 索引结果保留 所有数据:True返回原数据,False返回值为NaN,没判断的也是NaN b4 = frame.loc[['one', 'two']] > 50 print(frame[b4]) # 多行做判断 # 索引结果保留 所有数据:True返回原数据,False返回值为NaN,没判断的也是NaNa b c d e one 24.618148 63.399686 6.798379 43.962008 45.552317 two 94.591154 72.643779 66.825324 63.110716 61.176136 three 69.951802 92.393260 43.727457 8.785210 20.742149 four 82.501806 53.260059 61.655564 4.670043 76.228869 five 26.690782 46.703855 60.726235 16.030582 85.548820 a b c d e one False True False False False two True True True True True three True True False False False four True True True False True five False False True False True a b c d e one NaN 63.399686 NaN NaN NaN two 94.591154 72.643779 66.825324 63.110716 61.176136 three 69.951802 92.393260 NaN NaN NaN four 82.501806 53.260059 61.655564 NaN 76.228869 five NaN NaN 60.726235 NaN 85.548820 a b c d e two 94.591154 72.643779 66.825324 63.110716 61.176136 three 69.951802 92.393260 43.727457 8.785210 20.742149 four 82.501806 53.260059 61.655564 4.670043 76.228869 a b c d e one NaN 63.399686 NaN NaN NaN two 94.591154 72.643779 NaN NaN NaN three 69.951802 92.393260 NaN NaN NaN four 82.501806 53.260059 NaN NaN NaN five NaN NaN NaN NaN NaN a b c d e one NaN 63.399686 NaN NaN NaN two 94.591154 72.643779 66.825324 63.110716 61.176136 three NaN NaN NaN NaN NaN four NaN NaN NaN NaN NaN five NaN NaN NaN NaN NaN -
多重索引
# 同时索引行和列 # 先选择列再选择行 —— 相当于对于一个数据,先筛选字段,再选择数据量 frame = pd.DataFrame(np.random.rand(25).reshape(5, 5)*100, index=['one', 'two', 'three', 'four', 'five'], columns=list('abcde')) print(frame['a'].loc[['one', 'two']]) print('-------------') print(frame[['a', 'c', 'd']].loc[['one', 'three']]) print('-------------') print(frame[frame > 50].loc[['one', 'three', 'four']])one 5.229978 two 22.471440 Name: a, dtype: float64 ------------- a c d one 5.229978 35.075185 32.674318 three 85.967856 93.905559 55.819677 ------------- a b c d e one NaN NaN NaN NaN NaN three 85.967856 88.27458 93.905559 55.819677 57.813916 four 84.678590 NaN 70.645585 NaN NaN
Dataframe:基本技巧:数据查看、转置 / 添加、修改、删除值 / 对齐 / 排序
-
数据查看,转置
frame = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100, columns=list('abcd'), index=['one', 'two', 'three', 'four']) print(frame) print(frame.T) # 数据转置 print(frame.head(2)) # 查看前2条数据 print(frame.tail(2)) # 查看后2条数据 -
添加和修改
frame = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100, columns=list('abcd'), index=['one', 'two', 'three', 'four']) print(frame) frame['e'] = 0 # 直接添加新的columns frame.loc['five'] = 0 # 直接添加新的index print(frame) # 新增列/行并赋值 frame['e'] = 100 # 直接使用index修改 frame.loc[['one', 'two']] = 1000 # 直接使用columns修改 print(frame) # 索引后直接修改值 -
删除
frame = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100, columns=list('abcd'), index=['one', 'two', 'three', 'four']) print(frame) del frame['a'] print(frame) print('---------------') # 删除列,在原始数据上修改 print(frame.drop(['three','one'])) print(frame) print('---------------') # drop()删除行,函数默认inplace=False :不改变原数据 print(df.drop(['d'], axis = 1)) print(df) # drop()删除列,需加上axis = 1,函数默认inplace=False :不改变原数据 -
对齐
df1 = pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')) df2 = pd.DataFrame(np.random.randn(7, 3), columns=list('abc')) print(df1) print(df2) print(df1 + df2) # DataFrame对象之间的数据自动按照列和索引(行标签)对齐 # 若数据有一方没有,则用Nan缺失值补充,空值和任何值相加都是NaN -
排序
# 排序一,按值排序 .sort_values # 同样适用于Series df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100, columns = ['a','b','c','d']) print(df1) print(df1.sort_values(['a'])) # 所有的行值都会根据a的改变而改变 print(df1.sort_values(['a'], ascending = False)) # 降序 print('-----------------------') # ascending参数:设置升序降序,默认升序 # 单列排序 df2 = pd.DataFrame({'a':[1,1,1,1,2,2,2,2], 'b':list(range(8)), 'c':list(range(8,0,-1))}) print(df2) print(df2.sort_values(['a','c'])) # 多列排序,按列顺序排序, 先排序a列,在按照a排序ca b c d 0 29.266607 0.305069 67.024592 11.961605 1 84.371752 8.702451 11.278948 15.154947 2 36.447298 99.487098 45.994845 11.623095 3 99.368811 78.956778 90.963022 53.165712 a b c d 0 29.266607 0.305069 67.024592 11.961605 2 36.447298 99.487098 45.994845 11.623095 1 84.371752 8.702451 11.278948 15.154947 3 99.368811 78.956778 90.963022 53.165712 a b c d 3 99.368811 78.956778 90.963022 53.165712 1 84.371752 8.702451 11.278948 15.154947 2 36.447298 99.487098 45.994845 11.623095 0 29.266607 0.305069 67.024592 11.961605 ----------------------- a b c 0 1 0 8 1 1 1 7 2 1 2 6 3 1 3 5 4 2 4 4 5 2 5 3 6 2 6 2 7 2 7 1 a b c 3 1 3 5 2 1 2 6 1 1 1 7 0 1 0 8 7 2 7 1 6 2 6 2 5 2 5 3 4 2 4 4# 排序2 - 索引排序 .sort_index df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100, index=list('dcab')) df2 = pd.DataFrame(np.random.rand(25).reshape(5, 5)*100, index=list('54321')) print(df1) print(df1.sort_index()) print(df2) print(df2.sort_index())0 1 2 3 d 41.135343 77.253278 36.886300 92.532598 c 50.130950 82.863338 5.692255 36.134360 a 83.997284 73.834330 45.764539 12.149822 b 9.857439 11.782583 49.340666 73.912567 0 1 2 3 a 83.997284 73.834330 45.764539 12.149822 b 9.857439 11.782583 49.340666 73.912567 c 50.130950 82.863338 5.692255 36.134360 d 41.135343 77.253278 36.886300 92.532598 0 1 2 3 4 5 57.074088 86.690920 89.375103 59.925863 58.614896 4 39.561612 32.114331 72.069420 33.207265 23.431688 3 53.559025 98.972040 4.027947 58.884601 77.840003 2 67.495174 57.055822 31.545504 37.005847 40.818169 1 81.765456 77.883286 42.279947 80.506095 95.405828 0 1 2 3 4 1 81.765456 77.883286 42.279947 80.506095 95.405828 2 67.495174 57.055822 31.545504 37.005847 40.818169 3 53.559025 98.972040 4.027947 58.884601 77.840003 4 39.561612 32.114331 72.069420 33.207265 23.431688 5 57.074088 86.690920 89.375103 59.925863 58.614896