python数据分析：Pandas（1）本文已参与「新人创作礼」活动，一起开启掘金创作之路。 Pandas介绍基于Nu

本文已参与「新人创作礼」活动，一起开启掘金创作之路。

Pandas介绍

基于Numpy构建，为数据分析而存在
- 一维数组Series + 二维数组Dataframe
- 可直接读取数据做处理（高效简单）
- 兼容各种数据库
- 支持各种分析算法

一，Series 数据结构

Series 是带有标签的一维数组，可以保存任何数据类型（整数，字符串，浮点数，Python对象等）,轴标签统称为索引!

import pandas as pd
import numpy as np

s = pd.Series(np.random.rand(5))
print(s)
print(type(s))
print("-------------")

print(s.index, type(s.index))
print(s.values, type(s.index))

0    0.229773
1    0.357622
2    0.546116
3    0.734517
4    0.686645
dtype: float64
<class 'pandas.core.series.Series'>
-------------
RangeIndex(start=0, stop=5, step=1)			# 生成器，可以转换成list
<class'pandas.indexes.range.RangeIndex'>
[ 0.22977307  0.35762236  0.54611623  0.73451707  0.68664496] <class'numpy.ndarray'>

# 核心：series相比于ndarray，是一个自带索引index的数组 → 一维数组 + 对应索引
# 所以当只看series的值的时候，就是一个ndarray
# series和ndarray较相似，索引切片功能差别不大
# series和dict相比，series更像一个有顺序的字典（dict本身不存在顺序），其索引原理与字典相似（一个用key，一个用index）

创建Series数组

Series的创建一：通过字典创建，字典的key就是index， values就是values
```
dic = {"a": 1, 'b': 2, "c": 3}
s = pd.Series(dic)
print(s)
```

Series的创建二：直接通过一维数组创建

ar = np.random.rand(5)
s = pd.Series(ar)
print(s)

Series的创建三：通过标量创建

s = pd.Series(100, index=range(10))
print(s)

Series里面的3个参数

index=list("abcde")		# 设置Series里的index
dtype=np.float32		# 设置Series里的元素的类型
name="text"				# 每一个Series都有一个name，可以使用rename重命名(不覆盖)

s = pd.Series(ar, index=list("abcde"), dtype=np.float32, name="text")
print(s)

a    97.164497
b     0.458337
c    64.392853
d    71.175850
e    36.375008
Name: text, dtype: float32

Series的索引和切片

索引

# 1.下标索引

s = pd.Series(np.random.rand(5))
print(s)
print(s[0])
print(type(s[0]), s[0].dtype)
print()
print('--------------------------------')

# 2.表标签索引
s = pd.Series(np.random.rand(5), index=list("abcde"))
print(s)
# 选取单个标签，直接和使用下标的方法一致
print(s["a"])
# 如果要选取多个标签，需要用嵌套列表去实现，并且会生成一个新的Series
print(s[["a", "c", "b"]])

切片

s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index = ['a','b','c','d','e'])
print(s1[1:4])
print(s2['a':'c'])
print(s2[0:3])
# 注意：用index做切片的时候，末端包含

print(s2[:-1])
print(s2[::2])
# 下标索引做切片，和list写法一样

Series的基本技巧：数据查看 / 重新索引 / 对齐 / 添加、修改、删除值

数据查看

s = pd.Series(np.random.rand(15))
print(s.head())    # 默认查看前五条
print(s.tail())    # 默认查看后五条

重新索引 reindex

s = pd.Series(np.random.rand(5), index=list('abcde'))
print(s)
s1 = s.reindex(list("cdef"))
print(s1)

# 如果reindex里面相对于前者有新的索引，则新的引入缺失值NaN，若前者有，则参照前者

# 修改缺失值
s2 = s.reindex(list("cdefghi"), fill_value = 0)
print(s2)

对齐：按照两个Series的索引进行值相加减

s1 = pd.Series(np.random.rand(3), index = ['Jack','Marry','Tom'])
s2 = pd.Series(np.random.rand(3), index = ['Wang','Jack','Marry'])
print(s1)
print(s2)
print(s1+s2)    # 缺失值和任何数据相加都是缺失值

删除 drop

s = pd.Series(np.random.rand(5), index = list('ngjur'))
s1 = s.drop('n')    # inplace=False,默认不修改原来数据
print(s)
print(s1)

添加

s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index = list('ngjur'))
print(s1)
# print(s2)
# s1[5] = 100        # 直接加个下标
# s2['a'] = 100      # 直接加个标签
# print(s1)
# print(s2)

# 通过append，不改变原来数据
s3 = s1.append(s2)
print(s3)
print(s1)

二，Dataframe 数据结构

Dataframe是一个表格型的数据结构，“带有标签的二维数组”。Dataframe带有index（行标签）和columns（列标签）

s = {
    'name': ['A', 'B', 'c'],
    'age': [12, 13, 14],
    'gener': ['M', 'F', 'M']
}
frame = pd.DataFrame(s)
print(frame)
print(type(frame))
print("------------------------")
print(frame.index)        # 行标签
print("------------------------")
print(frame.columns)      # 列标签
print("------------------------")
print(frame.values)       # 值
print("------------------------")
print(type(frame.values)) # 查看数据类型，是np的数组

  name  age gener
0    A   12     M
1    B   13     F
2    c   14     M
<class 'pandas.core.frame.DataFrame'>
------------------------
RangeIndex(start=0, stop=3, step=1)
------------------------
Index(['name', 'age', 'gener'], dtype='object')
------------------------
[['A' 12 'M']
 ['B' 13 'F']
 ['c' 14 'M']]
------------------------
<class 'numpy.ndarray'>

DataFrame的创建

创建方法一：由数组/list组成的字典

s = {
    'a': [1, 2, 3],
    'b': [4, 5, 6],
    'c': [7, 8, 9]
}

print(pd.DataFrame(s))
print("------------------------")

s1 = {
    'a': np.random.rand(3),
    'b': np.random.rand(3)
}

print(pd.DataFrame(s1))
print("------------------------")

# 用字典创建DataFrame的时候，字典的键(key)就是每一列的列标签(columns)，字典的值就是每一列的值
# 字典的值的长度必须保持一致！

print(pd.DataFrame(s, columns=list('abcde')))
# 如果创建DataFrame的时候重新定义了columns，若有过剩的列标签则会以缺失值补充

创建方法二：由Series组成的字典

# Dataframe 创建方法二：由Series组成的字典
s = {
    'a': pd.Series(np.random.rand(5)),
    'b': pd.Series(np.random.rand(5))
}

print(pd.DataFrame(s))

s1 = {
    'a': pd.Series(np.random.rand(5), index=list('abcde')),
    'b': pd.Series(np.random.rand(4), index=list('abcd'))
}

print(pd.DataFrame(s1))

# 有Series组成的字典创建的DataFrame，columns是字典的key，index是Series的index，若没有，则按照默认的
# 若字典里的Series长度不一样，会用缺失值NaN补上

创建方法三：通过二维数组直接创建

ar = np.arange(10).reshape(2, 5)
print(ar)

frame = pd.DataFrame(ar, index=list('ab'), columns=list('ABCDE'))
print(frame)
# 通过二维数组直接创建Dataframe，得到一样形状的结果数据
# index和columns的值必须和原数组保持一致

*创建方法四：由字典组成的列表（Json转DataFrame）

list1 = [
    {'a': 1, 'b': 2, 'c': 3, 'f': 8},
    {'a': 4, 'b': 5, 'c': 6, 'e': 7}
]

frame = pd.DataFrame(list1)
print(frame)

# 嵌套列表字典组成的DataFrame，列表内字典的下标就是默认index， 字典的key是columns
# 若字典里面的key不相同是，保留成为columns，用缺失值补充

   a  b  c    e    f
0  1  2  3  NaN  8.0
1  4  5  6  7.0  NaN

创建方法五：由字典组成的字典

# 创建方法五：由字典组成的字典
dict_a = {
    'xyb': {'age': 20, 'gener': 'M', 'score': 100, 'a': 55},
    'gkx': {'age': 19, 'gener': 'F', 'score': 98},
    'hmj': {'age': 88, 'gener': 'M', 'score': 10}
}

frame = pd.DataFrame(dict_a)
print(frame)

# 字典的key就是DataFrame的columns，子字典的键就是index
# 若有不相同的columns，则用缺失值补充
# 这里的index和前面的不同，不能随意更改,否则都是NaN
print(pd.DataFrame(dict_a, index=list('abcd')))

       xyb  gkx  hmj
a       55  NaN  NaN
age     20   19   88
gener    M    F    M
score  100   98   10
    xyb  gkx  hmj
a  55.0  NaN  NaN
b   NaN  NaN  NaN
c   NaN  NaN  NaN
d   NaN  NaN  NaN

DataFrame的索引

选择行与列

frame = pd.DataFrame(
    np.random.rand(16).reshape(4, 4)*100,
    index=['one', 'two', 'three', 'four'],
    columns=list('abcd')
)
print(frame)
print('-----------')
# 输出列
print(frame['a'])        	# 这只会输出列(columns),Series格式
print(frame[['a', 'c','b']]) # 输出多列，Dataframe格式
print('-----------')
# 输出行
print(frame.loc['one'])  					# 只输出行(index)，Series格式
print(frame.loc[['one', 'three', 'four']])    # 输出多行，DataFrame各式 

# 注意：使用frame[]时，如果里边填写的是切片数字，那么会选中行(不建议使用)
print(frame[:2])

注意loc和iloc的区别

# df.loc[] - 按index选择行
print(frame.loc['one'])    # 输出index为one的这一行,数据类型是Series
print(frame.loc[['one', 'two']])    # 输出DataFrame数据
print(frame.loc['one': 'four'])     # 左右闭合

# df.iloc[] - 按下标位置（从轴的0到length-1）选择行
print(frame.iloc[0])        	# 和列表的索引相同，选取第一行   Series
print(frame.iloc[-1])       	# 可以负数索引				Series
print(frame.iloc[0:1])      	# 切片，末端不包含 			   DataFarme
print(frame.iloc[[0, 2, 3]]) 	# 多行索引					 DataFarme

# 主要区别：loc切片时左右闭合，iloc切片作闭右开

注意frame.iloc[0:1]和frame[0:1]的区别：使用iloc这个方法输出一排时，会以Series的形式输出

布尔值索引

frame = pd.DataFrame(np.random.rand(25).reshape(5, 5)*100, index=['one', 'two', 'three', 'four', 'five'],columns=list('abcde'))
print(frame)
b1 = frame > 50
print(b1)  		# 生成布尔型数组
print(frame[b1])   # 也可以写成frame[frame > 50]

b2 = frame['a'] > 50
print(frame[b2])
# 单列做判断
# 对a列所有值进行判断，True保留（包括其值所在的行），False丢弃

b3 = frame[['a', 'b']] > 50
print(frame[b3])
# 多列做判断
# 索引结果保留 所有数据：True返回原数据，False返回值为NaN，没判断的也是NaN

b4 = frame.loc[['one', 'two']] > 50
print(frame[b4])
# 多行做判断
# 索引结果保留 所有数据：True返回原数据，False返回值为NaN，没判断的也是NaN

               a          b          c          d          e
one    24.618148  63.399686   6.798379  43.962008  45.552317
two    94.591154  72.643779  66.825324  63.110716  61.176136
three  69.951802  92.393260  43.727457   8.785210  20.742149
four   82.501806  53.260059  61.655564   4.670043  76.228869
five   26.690782  46.703855  60.726235  16.030582  85.548820
           a      b      c      d      e
one    False   True  False  False  False
two     True   True   True   True   True
three   True   True  False  False  False
four    True   True   True  False   True
five   False  False   True  False   True
               a          b          c          d          e
one          NaN  63.399686        NaN        NaN        NaN
two    94.591154  72.643779  66.825324  63.110716  61.176136
three  69.951802  92.393260        NaN        NaN        NaN
four   82.501806  53.260059  61.655564        NaN  76.228869
five         NaN        NaN  60.726235        NaN  85.548820
               a          b          c          d          e
two    94.591154  72.643779  66.825324  63.110716  61.176136
three  69.951802  92.393260  43.727457   8.785210  20.742149
four   82.501806  53.260059  61.655564   4.670043  76.228869
               a          b   c   d   e
one          NaN  63.399686 NaN NaN NaN
two    94.591154  72.643779 NaN NaN NaN
three  69.951802  92.393260 NaN NaN NaN
four   82.501806  53.260059 NaN NaN NaN
five         NaN        NaN NaN NaN NaN
               a          b          c          d          e
one          NaN  63.399686        NaN        NaN        NaN
two    94.591154  72.643779  66.825324  63.110716  61.176136
three        NaN        NaN        NaN        NaN        NaN
four         NaN        NaN        NaN        NaN        NaN
five         NaN        NaN        NaN        NaN        NaN

多重索引

# 同时索引行和列
# 先选择列再选择行 —— 相当于对于一个数据，先筛选字段，再选择数据量
frame = pd.DataFrame(np.random.rand(25).reshape(5, 5)*100,
                    index=['one', 'two', 'three', 'four', 'five'],
                     columns=list('abcde'))

print(frame['a'].loc[['one', 'two']])
print('-------------')
print(frame[['a', 'c', 'd']].loc[['one', 'three']])
print('-------------')
print(frame[frame > 50].loc[['one', 'three', 'four']])

one     5.229978
two    22.471440
Name: a, dtype: float64
-------------
               a          c          d
one     5.229978  35.075185  32.674318
three  85.967856  93.905559  55.819677
-------------
               a         b          c          d          e
one          NaN       NaN        NaN        NaN        NaN
three  85.967856  88.27458  93.905559  55.819677  57.813916
four   84.678590       NaN  70.645585        NaN        NaN

Dataframe：基本技巧：数据查看、转置 / 添加、修改、删除值 / 对齐 / 排序

数据查看，转置

frame = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100, columns=list('abcd'), index=['one', 'two', 'three', 'four'])
print(frame)
print(frame.T)				# 数据转置
print(frame.head(2))		# 查看前2条数据
print(frame.tail(2))		# 查看后2条数据

添加和修改

frame = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100, columns=list('abcd'), index=['one', 'two', 'three', 'four'])
print(frame)
frame['e'] = 0				# 直接添加新的columns
frame.loc['five'] = 0        # 直接添加新的index
print(frame)
# 新增列/行并赋值

frame['e'] = 100				# 直接使用index修改
frame.loc[['one', 'two']] = 1000 # 直接使用columns修改
print(frame)
# 索引后直接修改值

删除

frame = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100,
                     columns=list('abcd'), 
                     index=['one', 'two', 'three', 'four'])
print(frame)
del frame['a']
print(frame)
print('---------------')
# 删除列，在原始数据上修改

print(frame.drop(['three','one']))
print(frame)
print('---------------')
# drop()删除行，函数默认inplace=False ：不改变原数据

print(df.drop(['d'], axis = 1))
print(df)
# drop()删除列，需加上axis = 1，函数默认inplace=False ：不改变原数据

对齐

df1 = pd.DataFrame(np.random.randn(10, 4), columns=list('abcd'))
df2 = pd.DataFrame(np.random.randn(7, 3), columns=list('abc'))
print(df1)
print(df2)
print(df1 + df2)
# DataFrame对象之间的数据自动按照列和索引（行标签）对齐
# 若数据有一方没有，则用Nan缺失值补充，空值和任何值相加都是NaN

排序

# 排序一，按值排序 .sort_values
# 同样适用于Series
df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df1)
print(df1.sort_values(['a']))    # 所有的行值都会根据a的改变而改变
print(df1.sort_values(['a'], ascending = False))  # 降序
print('-----------------------')
# ascending参数：设置升序降序，默认升序
# 单列排序

df2 = pd.DataFrame({'a':[1,1,1,1,2,2,2,2],
                  'b':list(range(8)),
                  'c':list(range(8,0,-1))})
print(df2) 
print(df2.sort_values(['a','c']))
# 多列排序，按列顺序排序, 先排序a列，在按照a排序c

           a          b          c          d
0  29.266607   0.305069  67.024592  11.961605
1  84.371752   8.702451  11.278948  15.154947
2  36.447298  99.487098  45.994845  11.623095
3  99.368811  78.956778  90.963022  53.165712
           a          b          c          d
0  29.266607   0.305069  67.024592  11.961605
2  36.447298  99.487098  45.994845  11.623095
1  84.371752   8.702451  11.278948  15.154947
3  99.368811  78.956778  90.963022  53.165712
           a          b          c          d
3  99.368811  78.956778  90.963022  53.165712
1  84.371752   8.702451  11.278948  15.154947
2  36.447298  99.487098  45.994845  11.623095
0  29.266607   0.305069  67.024592  11.961605
-----------------------
   a  b  c
0  1  0  8
1  1  1  7
2  1  2  6
3  1  3  5
4  2  4  4
5  2  5  3
6  2  6  2
7  2  7  1
   a  b  c
3  1  3  5
2  1  2  6
1  1  1  7
0  1  0  8
7  2  7  1
6  2  6  2
5  2  5  3
4  2  4  4

# 排序2 - 索引排序 .sort_index
df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100, index=list('dcab'))

df2 = pd.DataFrame(np.random.rand(25).reshape(5, 5)*100, index=list('54321'))

print(df1)
print(df1.sort_index())
print(df2)
print(df2.sort_index())

           0          1          2          3
d  41.135343  77.253278  36.886300  92.532598
c  50.130950  82.863338   5.692255  36.134360
a  83.997284  73.834330  45.764539  12.149822
b   9.857439  11.782583  49.340666  73.912567
           0          1          2          3
a  83.997284  73.834330  45.764539  12.149822
b   9.857439  11.782583  49.340666  73.912567
c  50.130950  82.863338   5.692255  36.134360
d  41.135343  77.253278  36.886300  92.532598
           0          1          2          3          4
5  57.074088  86.690920  89.375103  59.925863  58.614896
4  39.561612  32.114331  72.069420  33.207265  23.431688
3  53.559025  98.972040   4.027947  58.884601  77.840003
2  67.495174  57.055822  31.545504  37.005847  40.818169
1  81.765456  77.883286  42.279947  80.506095  95.405828
           0          1          2          3          4
1  81.765456  77.883286  42.279947  80.506095  95.405828
2  67.495174  57.055822  31.545504  37.005847  40.818169
3  53.559025  98.972040   4.027947  58.884601  77.840003
4  39.561612  32.114331  72.069420  33.207265  23.431688
5  57.074088  86.690920  89.375103  59.925863  58.614896