Python数据分析:pandas-tutorial-student

67 阅读27分钟

Pandas基础

Pandas简介

  • python数据分析library
  • 基于numpy (对ndarray的操作)
  • 有一种用python做Excel/SQL/R的感觉

目录

  • Series
  • DataFrame
  • Index, reindex and hierarchical indexing
  • Merge, Join, Concatenate, Groupby and Aggregate
  • Read from csv
  • 随堂案例:bikes routes counts

数据结构Series

构造和初始化Series

import pandas as pd
import numpy as np

Series是一个一维的数据结构

s = pd.Series([7, 'Beijing', 2.17, -1232, 'Happy birthday!'])
s
0                  7
1            Beijing
2               2.17
3              -1232
4    Happy birthday!
dtype: object

pandas会默认用0到n来作为Series的index,但是我们也可以自己指定index。

s = pd.Series([7, 'Beijing', 2.17, -1232, 'Happy birthday!'],
             index = ['A', 'B', 'C', 'D', 'E'])
type(s)
pandas.core.series.Series

还可以用dictionary来构造一个Series,因为Series本来就是key value pairs。

cities = {'Beijing': 55000, 'Shanghai': 60000, 'Shenzhen': 50000, 'Hangzhou':20000, \
         'Guangzhou': 25000, 'Suzhou': None}
apts = pd.Series(cities)
print(apts)
print(type(apts))
Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     20000.0
Guangzhou    25000.0
Suzhou           NaN
dtype: float64
<class 'pandas.core.series.Series'>

选择数据

前面定义的index就是用来选择数据的

apts["Hangzhou"]
20000.0
apts[["Hangzhou", "Beijing", "Shenzhen"]]
# type(apts[["Hangzhou", "Beijing", "Shenzhen"]])
Hangzhou    20000.0
Beijing     55000.0
Shenzhen    50000.0
dtype: float64

大家还记得上次numpy讲到的boolean indexing吗?pandas当中也可以使用。

apts[apts < 50000]
Hangzhou     20000.0
Guangzhou    25000.0
dtype: float64
less_than_50000 = apts < 50000
print(less_than_50000)
Beijing      False
Shanghai     False
Shenzhen     False
Hangzhou      True
Guangzhou     True
Suzhou       False
dtype: bool
apts[less_than_50000]
Hangzhou     20000.0
Guangzhou    25000.0
dtype: float64

下面我再详细展示一下这个boolean indexing是如何工作的

Series元素赋值

Series的元素可以被赋值

print("Old value:", apts['Shenzhen'])
Old value: 50000.0
apts['Shenzhen'] = 55000
apts['Shenzhen']
55000.0

前面讲过的boolean indexing在赋值的时候也可以用

print(apts[apts < 50000])
Hangzhou     20000.0
Guangzhou    25000.0
dtype: float64
apts[apts <= 50000] = 40000
apts
Beijing      55000.0
Shanghai     60000.0
Shenzhen     55000.0
Hangzhou     40000.0
Guangzhou    40000.0
Suzhou           NaN
dtype: float64

数学运算

下面我们来讲一些基本的数学运算。

apts / 2
Beijing      27500.0
Shanghai     30000.0
Shenzhen     27500.0
Hangzhou     20000.0
Guangzhou    20000.0
Suzhou           NaN
dtype: float64
apts * 2
Beijing      110000.0
Shanghai     120000.0
Shenzhen     110000.0
Hangzhou      80000.0
Guangzhou     80000.0
Suzhou            NaN
dtype: float64
np.square(apts)
Beijing      3.025000e+09
Shanghai     3.600000e+09
Shenzhen     3.025000e+09
Hangzhou     1.600000e+09
Guangzhou    1.600000e+09
Suzhou                NaN
dtype: float64

square也可以写成 **

apts ** 2
Beijing      3.025000e+09
Shanghai     3.600000e+09
Shenzhen     3.025000e+09
Hangzhou     1.600000e+09
Guangzhou    1.600000e+09
Suzhou                NaN
dtype: float64

我们再定义一个新的Series做加法

cars = pd.Series({'Beijing': 300000, 'Shanghai': 400000, 'Shenzhen': 300000, \
                      'Tianjin': 200000, 'Guangzhou': 200000, 'Chongqing': 150000})
cars
Beijing      300000
Shanghai     400000
Shenzhen     300000
Tianjin      200000
Guangzhou    200000
Chongqing    150000
dtype: int64
print(cars + apts * 100)
Beijing      5800000.0
Chongqing          NaN
Guangzhou    4200000.0
Hangzhou           NaN
Shanghai     6400000.0
Shenzhen     5800000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64

数据缺失

'Hangzhou' in apts
True
'Hangzhou' in cars
False
apts.notnull()
Beijing       True
Shanghai      True
Shenzhen      True
Hangzhou      True
Guangzhou     True
Suzhou       False
dtype: bool
apts.isnull()
Beijing      False
Shanghai     False
Shenzhen     False
Hangzhou     False
Guangzhou    False
Suzhou        True
dtype: bool
apts[apts.isnull()]
Suzhou   NaN
dtype: float64
apts[apts.notnull()]
Beijing      55000.0
Shanghai     60000.0
Shenzhen     55000.0
Hangzhou     40000.0
Guangzhou    40000.0
dtype: float64
apts[apts.isnull() == False]
Beijing      55000.0
Shanghai     60000.0
Shenzhen     55000.0
Hangzhou     40000.0
Guangzhou    40000.0
dtype: float64

数据结构Dataframe

一个Dataframe就是一张表格,Series表示的是一维数组,Dataframe则是一个二维数组,可以类比成一张excel的spreadsheet。也可以把Dataframe当做一组Series的集合。

创建一个DataFrame

dataframe可以由一个dictionary构造得到。

data = {'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],
       'year': [2016,2017,2016,2017,2016, 2016],
       'population': [2100, 2300, 1000, 700, 500, 500]}
pd.DataFrame(data)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
city year population
0 Beijing 2016 2100
1 Shanghai 2017 2300
2 Guangzhou 2016 1000
3 Shenzhen 2017 700
4 Hangzhou 2016 500
5 Chongqing 2016 500

columns的名字和顺序可以指定

pd.DataFrame(data, columns = ['year', 'city', 'population'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
year city population
0 2016 Beijing 2100
1 2017 Shanghai 2300
2 2016 Guangzhou 1000
3 2017 Shenzhen 700
4 2016 Hangzhou 500
5 2016 Chongqing 500
pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
year city population debt
0 2016 Beijing 2100 NaN
1 2017 Shanghai 2300 NaN
2 2016 Guangzhou 1000 NaN
3 2017 Shenzhen 700 NaN
4 2016 Hangzhou 500 NaN
5 2016 Chongqing 500 NaN
frame2 = pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'],
            index=['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)
       year       city  population debt
one    2016    Beijing        2100  NaN
two    2017   Shanghai        2300  NaN
three  2016  Guangzhou        1000  NaN
four   2017   Shenzhen         700  NaN
five   2016   Hangzhou         500  NaN
six    2016  Chongqing         500  NaN

从DataFrame里选择数据

frame2['city']
one        Beijing
two       Shanghai
three    Guangzhou
four      Shenzhen
five      Hangzhou
six      Chongqing
Name: city, dtype: object
type(frame2['city'])
pandas.core.series.Series
print(frame2.city)
one        Beijing
two       Shanghai
three    Guangzhou
four      Shenzhen
five      Hangzhou
six      Chongqing
Name: city, dtype: object
frame2.loc['three']
year               2016
city          Guangzhou
population         1000
debt                NaN
Name: three, dtype: object
type(frame2.loc['three'])
pandas.core.series.Series

下面这种方法默认用来选列而不是选行

print(frame2.iloc[2])
year               2016
city          Guangzhou
population         1000
debt                NaN
Name: three, dtype: object

DataFrame元素赋值

frame2["population"]["one"] = 2200
C:\Users\zhangxf\AppData\Local\Temp\ipykernel_120184\1394222887.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frame2["population"]["one"] = 2200
frame2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
year city population debt
one 2016 Beijing 2200 NaN
two 2017 Shanghai 2300 NaN
three 2016 Guangzhou 1000 NaN
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 NaN
six 2016 Chongqing 500 NaN

可以给一整列赋值

frame2['debt'] = 100000000
frame2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
year city population debt
one 2016 Beijing 2200 100000000
two 2017 Shanghai 2300 100000000
three 2016 Guangzhou 1000 100000000
four 2017 Shenzhen 700 100000000
five 2016 Hangzhou 500 100000000
six 2016 Chongqing 500 100000000
frame2.loc['six'] = 0
frame2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
year city population debt
one 2016 Beijing 2200 100000000
two 2017 Shanghai 2300 100000000
three 2016 Guangzhou 1000 100000000
four 2017 Shenzhen 700 100000000
five 2016 Hangzhou 500 100000000
six 0 0 0 0
frame2 = pd.DataFrame(data, \
                     columns = ['year', 'city', 'population', 'debt'],
                     index = ['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)
       year       city  population debt
one    2016    Beijing        2100  NaN
two    2017   Shanghai        2300  NaN
three  2016  Guangzhou        1000  NaN
four   2017   Shenzhen         700  NaN
five   2016   Hangzhou         500  NaN
six    2016  Chongqing         500  NaN
frame2.debt = np.arange(6)
frame2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
year city population debt
one 2016 Beijing 2100 0
two 2017 Shanghai 2300 1
three 2016 Guangzhou 1000 2
four 2017 Shenzhen 700 3
five 2016 Hangzhou 500 4
six 2016 Chongqing 500 5

还可以用Series来指定需要修改的index以及相对应的value,没有指定的默认用NaN.

val  = pd.Series([200, 300, 500], index=['two', 'three', 'five'])
val
frame2['debt'] = val
print(frame2)
       year       city  population   debt
one    2016    Beijing        2100    NaN
two    2017   Shanghai        2300  200.0
three  2016  Guangzhou        1000  300.0
four   2017   Shenzhen         700    NaN
five   2016   Hangzhou         500  500.0
six    2016  Chongqing         500    NaN
frame2['western'] = (frame2.city == 'Chongqing')
print(frame2)
       year       city  population   debt  western
one    2016    Beijing        2100    NaN    False
two    2017   Shanghai        2300  200.0    False
three  2016  Guangzhou        1000  300.0    False
four   2017   Shenzhen         700    NaN    False
five   2016   Hangzhou         500  500.0    False
six    2016  Chongqing         500    NaN     True
frame2.columns
Index(['year', 'city', 'population', 'debt', 'western'], dtype='object')
frame2.index
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

一个DataFrame就和一个numpy 2d array一样,可以被转置

frame2.T
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
one two three four five six
year 2016 2017 2016 2017 2016 2016
city Beijing Shanghai Guangzhou Shenzhen Hangzhou Chongqing
population 2100 2300 1000 700 500 500
debt NaN 200.0 300.0 NaN 500.0 NaN
western False False False False False True
pop = {'Beijing': {2016: 2100, 2017:2200},
      'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
frame3 = pd.DataFrame(pop)
print(frame3)
      Beijing  Shanghai
2016   2100.0      2500
2017   2200.0      2600
2015      NaN      2400
print(frame3.T)
            2016    2017    2015
Beijing   2100.0  2200.0     NaN
Shanghai  2500.0  2600.0  2400.0

指定index的顺序,以及使用切片初始化数据

frame3['Beijing'][1:3]
2017    2200.0
2015       NaN
Name: Beijing, dtype: float64
frame3['Shanghai'][:-1]
2016    2500
2017    2600
Name: Shanghai, dtype: int64
pdata = {'Beijing': frame3['Beijing'][:-1], 'Shanghai':frame3['Shanghai'][:-1]}
print(pd.DataFrame(pdata))
      Beijing  Shanghai
2016   2100.0      2500
2017   2200.0      2600

我们还可以指定index的名字和列的名字

frame3.index.name = 'year'
frame3.columns.name = 'city'
frame3
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
city Beijing Shanghai
year
2016 2100.0 2500
2017 2200.0 2600
2015 NaN 2400
type(frame3.values)
numpy.ndarray

Index

index object

obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(index)
print(index[1:])
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')

index的值是不能被更改的

index[1] = 'd'
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

~\AppData\Local\Temp\ipykernel_120184\3159586693.py in <module>
----> 1 index[1] = 'd'


e:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   5033     @final
   5034     def __setitem__(self, key, value):
-> 5035         raise TypeError("Index does not support mutable operations")
   5036 
   5037     def __getitem__(self, key):


TypeError: Index does not support mutable operations
index = pd.Index(np.arange(3))
index
obj2 = pd.Series([2,5,7], index=index)
print(obj2)
print(obj2.index is index)
0    2
1    5
2    7
dtype: int64
True
pop = {'Beijing': {2016: 2100, 2017:2200},
      'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
frame3 = pd.DataFrame(pop)
print('Shanghai' in frame3.columns)
print(2015 in frame3.columns)
True
False

针对index进行索引和切片

obj = pd.Series(np.arange(4), index=['a','b','c','d'])
print(obj)

a    0
b    1
c    2
d    3
dtype: int32
obj[['b', 'a']]
b    1
a    0
dtype: int32
obj[[0, 2]]
a    0
c    2
dtype: int32

默认的数字index依旧可以使用

下面介绍如何对Series进行切片

obj[1:3]
b    1
c    2
dtype: int32
obj['b':'d'] = 5
print(obj)
a    0
b    5
c    5
d    5
dtype: int32

对DataFrame进行Indexing与Series基本相同

frame = pd.DataFrame(np.arange(9).reshape(3,3),
                    index = ['a', 'c', 'd'],
                    columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])
print(frame)
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8
frame['Hangzhou']
a    0
c    3
d    6
Name: Hangzhou, dtype: int32
frame[:2]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
frame.loc['a':'d']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
frame['Hangzhou':'Najing']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
frame.loc[:, 'Shenzhen':'Nanjing']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Shenzhen Nanjing
a 1 2
c 4 5
d 7 8
frame.loc[:'c', 'Hangzhou']
a    0
c    3
Name: Hangzhou, dtype: int32

DataFrame也可以用condition selection

frame[frame.Hangzhou > 1]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
c 3 4 5
d 6 7 8
frame[frame < 5] = 0
print(frame)
   Hangzhou  Shenzhen  Nanjing
a         0         0        0
c         0         0        5
d         6         7        8

reindex

把一个Series或者DataFrame按照新的index顺序进行重排


obj = pd.Series([4.5, 7.2, -5.3, 3.2], index=['d', 'b', 'a', 'c'])
print(obj)
d    4.5
b    7.2
a   -5.3
c    3.2
dtype: float64
obj.reindex(['a', 'b', 'c', 'd', 'e'])
a   -5.3
b    7.2
c    3.2
d    4.5
e    NaN
dtype: float64
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)
a   -5.3
b    7.2
c    3.2
d    4.5
e    0.0
dtype: float64
obj3 = pd.Series(['blue', 'purple', 'yellow'], index = [0,2,4])
print(obj3)
0      blue
2    purple
4    yellow
dtype: object
obj3.reindex(range(6), method='ffill')
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
obj3.reindex(range(6), method='bfill')
0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

既然我们可以对Series进行reindex,相应地,我们也可以用同样的方法对DataFrame进行reindex。

frame = pd.DataFrame(np.arange(9).reshape(3,3), 
                    index = ['a', 'c', 'd'],
                    columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])
print(frame)
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8
frame.reindex(['a' , 'b', 'c', 'd'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0

在reindex的同时,我们还可以重新指定columns

frame.reindex(columns = ['Shenzhen', 'Hangzhou', 'Chongqing'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Shenzhen Hangzhou Chongqing
a 1 0 NaN
c 4 3 NaN
d 7 6 NaN

下面介绍如何用drop来删除Series和DataFrame中的index

obj4 = obj3.drop(2)
print(obj4)
0      blue
4    yellow
dtype: object
obj3.drop([2, 4])
0    blue
dtype: object
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
frame.drop(['a', 'c'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
d 6 7 8
frame.drop('Shenzhen', axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Nanjing
a 0 2
c 3 5
d 6 8
frame.drop(['Shenzhen', 'Hangzhou'], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Nanjing
a 2
c 5
d 8

drop不仅仅可以删除行,还可以删除列

hierarchical index

Series的hierarchical indexing

data = pd.Series(np.random.randn(10), 
                 index=[['a','a','a','b','b','c','c','c','d','d'], \
                        [1,2,3,1,2,1,2,3,1,2]])
data
a  1   -0.445826
   2    0.656269
   3   -0.337780
b  1    0.867805
   2   -0.438495
c  1   -0.311051
   2   -0.217434
   3   -1.547993
d  1   -1.152814
   2   -0.516443
dtype: float64
data.index
MultiIndex([('a', 1),            ('a', 2),            ('a', 3),            ('b', 1),            ('b', 2),            ('c', 1),            ('c', 2),            ('c', 3),            ('d', 1),            ('d', 2)],
           )
print(data["b"])
1    0.867805
2   -0.438495
dtype: float64
data['b':'d']
b  1    0.867805
   2   -0.438495
c  1   -0.311051
   2   -0.217434
   3   -1.547993
d  1   -1.152814
   2   -0.516443
dtype: float64
data[1:4]
a  2    0.656269
   3   -0.337780
b  1    0.867805
dtype: float64

unstack和stack可以帮助我们在hierarchical indexing和DataFrame之间进行切换。

data.unstack()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
1 2 3
a -0.445826 0.656269 -0.337780
b 0.867805 -0.438495 NaN
c -0.311051 -0.217434 -1.547993
d -1.152814 -0.516443 NaN
type(data.unstack())
pandas.core.frame.DataFrame
data.unstack().stack()
a  1   -0.445826
   2    0.656269
   3   -0.337780
b  1    0.867805
   2   -0.438495
c  1   -0.311051
   2   -0.217434
   3   -1.547993
d  1   -1.152814
   2   -0.516443
dtype: float64

DataFrame的hierarchical indexing

frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                    index = [['a','a','b','b'], [1,2,1,2]],
                    columns = [['Beijing', 'Beijing', 'Shanghai'], ['apts', 'cars', 'apts']])
print(frame)
    Beijing      Shanghai
       apts cars     apts
a 1       0    1        2
  2       3    4        5
b 1       6    7        8
  2       9   10       11
frame.index.names = ['alpha', 'number']
frame.columns.names = ['city', 'type']
print(frame)
city         Beijing      Shanghai
type            apts cars     apts
alpha number                      
a     1            0    1        2
      2            3    4        5
b     1            6    7        8
      2            9   10       11
frame.loc['a', 1]
#print(type(frame.loc['a', 1]))
city      type
Beijing   apts    0
          cars    1
Shanghai  apts    2
Name: (a, 1), dtype: int32
frame.loc['a', 2]['Beijing']['apts']
3

一些建议

学习过程中碰到问题,直接自己写代码进行尝试。

学会使用google和stackoverflow解决技术问题

关于Merge, Join和Concatenate

concatenate

df1 = pd.DataFrame({'apts': [55000, 60000],
                   'cars': [200000, 300000],},
                  index = ['Shanghai', 'Beijing'])
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Shanghai 55000 200000
Beijing 60000 300000
df2 = pd.DataFrame({'apts': [25000, 20000],
                   'cars': [150000, 120000],},
                  index = ['Hangzhou', 'Najing'])
print(df2)
           apts    cars
Hangzhou  25000  150000
Najing    20000  120000
df3 = pd.DataFrame({'apts': [30000, 10000],
                   'cars': [180000, 100000],},
                  index = ['Guangzhou', 'Chongqing'])
print(df3)
            apts    cars
Guangzhou  30000  180000
Chongqing  10000  100000
frames = [df1, df2, df3]
result = pd.concat(frames)
result 
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Hangzhou 25000 150000
Najing 20000 120000
Guangzhou 30000 180000
Chongqing 10000 100000

在concatenate的时候可以指定keys,这样可以给每一个部分加上一个Key。

以下的例子就构造了一个hierarchical index。

result2 = pd.concat(frames, keys=['x', 'y', 'z'])
print(result2)
              apts    cars
x Shanghai   55000  200000
  Beijing    60000  300000
y Hangzhou   25000  150000
  Najing     20000  120000
z Guangzhou  30000  180000
  Chongqing  10000  100000
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},
                  index = ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
print(df4)
           salaries
Suzhou        10000
Beijing       30000
Shanghai      30000
Guangzhou     20000
Tianjin       15000
result3 = pd.concat([result, df4], axis=1)
result3
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars salaries
Shanghai 55000.0 200000.0 30000.0
Beijing 60000.0 300000.0 30000.0
Hangzhou 25000.0 150000.0 NaN
Najing 20000.0 120000.0 NaN
Guangzhou 30000.0 180000.0 20000.0
Chongqing 10000.0 100000.0 NaN
Suzhou NaN NaN 10000.0
Tianjin NaN NaN 15000.0

复习一下前面讲过的stack

result3.stack()
Shanghai   apts         55000.0
           cars        200000.0
           salaries     30000.0
Beijing    apts         60000.0
           cars        300000.0
           salaries     30000.0
Hangzhou   apts         25000.0
           cars        150000.0
Najing     apts         20000.0
           cars        120000.0
Guangzhou  apts         30000.0
           cars        180000.0
           salaries     20000.0
Chongqing  apts         10000.0
           cars        100000.0
Suzhou     salaries     10000.0
Tianjin    salaries     15000.0
dtype: float64

用inner可以去掉NaN

result3 = pd.concat([result, df4], axis=1, join='inner')
result3
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars salaries
Shanghai 55000 200000 30000
Beijing 60000 300000 30000
Guangzhou 30000 180000 20000
用append来做concatenation
df1.append(df2)
C:\Users\zhangxf\AppData\Local\Temp\ipykernel_120184\3062608662.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df1.append(df2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Hangzhou 25000 150000
Najing 20000 120000
df1.append(df4)
C:\Users\zhangxf\AppData\Local\Temp\ipykernel_120184\940018064.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df1.append(df4)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars salaries
Shanghai 55000.0 200000.0 NaN
Beijing 60000.0 300000.0 NaN
Suzhou NaN NaN 10000.0
Beijing NaN NaN 30000.0
Shanghai NaN NaN 30000.0
Guangzhou NaN NaN 20000.0
Tianjin NaN NaN 15000.0

Series和DataFrame还可以被一起concatenate,这时候Series会先被转成DataFrame然后做Join,因为Series本来就是一个只有一维的DataFrame对吧。

s1 = pd.Series([60, 50], index=['Shanghai', 'Beijing'], name='meal')
print(s1)
Shanghai    60
Beijing     50
Name: meal, dtype: int64
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Shanghai 55000 200000
Beijing 60000 300000
pd.concat([df1, s1], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars meal
Shanghai 55000 200000 60
Beijing 60000 300000 50

下面讲如何append一个row到DataFrame里。

s2 = pd.Series([18000, 120000], index=['apts', 'cars'], name='Xiamen')
s2
apts     18000
cars    120000
Name: Xiamen, dtype: int64
df1.append(s2)
C:\Users\zhangxf\AppData\Local\Temp\ipykernel_120184\3390175239.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df1.append(s2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Xiamen 18000 120000

Merge(Join)


df1 = pd.DataFrame({'apts': [55000, 60000, 58000],
                   'cars': [200000, 300000,250000],
                  'cities': ['Shanghai', 'Beijing','Shenzhen']})
print(df1)
    apts    cars    cities
0  55000  200000  Shanghai
1  60000  300000   Beijing
2  58000  250000  Shenzhen
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000],
                  'cities': ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin']})
print(df4)
   salaries     cities
0     10000     Suzhou
1     30000    Beijing
2     30000   Shanghai
3     20000  Guangzhou
4     15000    Tianjin
result = pd.merge(df1, df4, on='cities')
result
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars cities salaries
0 55000 200000 Shanghai 30000
1 60000 300000 Beijing 30000
result2 = pd.merge(df1, df4, on='cities', how='outer')
result2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars cities salaries
0 55000.0 200000.0 Shanghai 30000.0
1 60000.0 300000.0 Beijing 30000.0
2 58000.0 250000.0 Shenzhen NaN
3 NaN NaN Suzhou 10000.0
4 NaN NaN Guangzhou 20000.0
5 NaN NaN Tianjin 15000.0
join on index
df1 = pd.DataFrame({'apts': [55000, 60000, 58000],
                   'cars': [200000, 300000,250000]},
                  index=['Shanghai', 'Beijing','Shenzhen'])
print(df1)
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000
Shenzhen  58000  250000
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},
                  index=['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
print(df4)
           salaries
Suzhou        10000
Beijing       30000
Shanghai      30000
Guangzhou     20000
Tianjin       15000
df1.join(df4)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars salaries
Shanghai 55000 200000 30000.0
Beijing 60000 300000 30000.0
Shenzhen 58000 250000 NaN
df4.join(df1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
salaries apts cars
Suzhou 10000 NaN NaN
Beijing 30000 60000.0 300000.0
Shanghai 30000 55000.0 200000.0
Guangzhou 20000 NaN NaN
Tianjin 15000 NaN NaN
df1.join(df4, how='outer')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars salaries
Beijing 60000.0 300000.0 30000.0
Guangzhou NaN NaN 20000.0
Shanghai 55000.0 200000.0 30000.0
Shenzhen 58000.0 250000.0 NaN
Suzhou NaN NaN 10000.0
Tianjin NaN NaN 15000.0

也可以用merge来写

pd.merge(df1, df4, left_index=True, right_index=True, how='outer')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars salaries
Beijing 60000.0 300000.0 30000.0
Guangzhou NaN NaN 20000.0
Shanghai 55000.0 200000.0 30000.0
Shenzhen 58000.0 250000.0 NaN
Suzhou NaN NaN 10000.0
Tianjin NaN NaN 15000.0

Group By

举个栗子,假设我们现在有一张公司每个员工的收入流水。


import pandas as pd
salaries = pd.DataFrame({
    'Name': ['July', 'Chu', 'Chu', 'Lin', 'July', 'July', 'Chu', 'July'],
    'Year': [2016,2016,2016,2016,2017,2017,2017,2017],
    'Salary': [10000,2000,4000,5000,18000,25000,3000,4000],
    'Bonus': [3000,1000,1000,1200,4000,2300,500,1000]
})
print(salaries)
   Name  Year  Salary  Bonus
0  July  2016   10000   3000
1   Chu  2016    2000   1000
2   Chu  2016    4000   1000
3   Lin  2016    5000   1200
4  July  2017   18000   4000
5  July  2017   25000   2300
6   Chu  2017    3000    500
7  July  2017    4000   1000

接下来我给大家演示一下什么叫做Group By

group_by_name = salaries.groupby('Name')
group_by_name
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A95711B1C0>

groupby经常和aggregate一起使用

group_by_name.aggregate(sum)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Year Salary Bonus
Name
Chu 6049 9000 2500
July 8067 57000 10300
Lin 2016 5000 1200
group_by_name.sum()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Year Salary Bonus
Name
Chu 6049 9000 2500
July 8067 57000 10300
Lin 2016 5000 1200

也可以这么写

group_by_name_year = salaries.groupby(['Name', 'Year'])
group_by_name_year.sum()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Salary Bonus
Name Year
Chu 2016 6000 2000
2017 3000 500
July 2016 10000 3000
2017 47000 7300
Lin 2016 5000 1200
group_by_name_year.size()
Name  Year
Chu   2016    2
      2017    1
July  2016    1
      2017    3
Lin   2016    1
dtype: int64
group_by_name_year.max() # min()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Salary Bonus
Name Year
Chu 2016 4000 1000
2017 3000 500
July 2016 10000 3000
2017 25000 4000
Lin 2016 5000 1200

describe这个function可以为我们展示各种有用的统计信息

# group_by_name_year.describe()

READ FROM CSV

我们先从CSV文件中读取一些数据。

bike.csv记录了Montreal自行车路线的数据,具体有7条路线,数据记录了每条自行车路线每天分别有多少人。

bikes = pd.read_csv('bikes.csv', encoding='latin1')
# bikes
bikes = pd.read_csv('bikes.csv', encoding='latin1', sep=';', 
                    parse_dates=['Date'], dayfirst=True, index_col='Date')
bikes.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 St-Urbain (données non disponibles)
Date
2012-01-01 35 NaN 0 38 51 26 10 16 NaN
2012-01-02 83 NaN 1 68 153 53 6 43 NaN
2012-01-03 135 NaN 2 104 248 89 3 58 NaN
2012-01-04 144 NaN 1 116 318 111 8 61 NaN
2012-01-05 197 NaN 2 124 330 97 13 95 NaN

更多关于read_csv的说明请查阅documentation

read_csv

dropna会删除所有带NA的行

bikes.dropna()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 St-Urbain (données non disponibles)
Date
bikes.dropna(how='all').head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 St-Urbain (données non disponibles)
Date
2012-01-01 35 NaN 0 38 51 26 10 16 NaN
2012-01-02 83 NaN 1 68 153 53 6 43 NaN
2012-01-03 135 NaN 2 104 248 89 3 58 NaN
2012-01-04 144 NaN 1 116 318 111 8 61 NaN
2012-01-05 197 NaN 2 124 330 97 13 95 NaN
bikes.dropna(how='all', axis=1).head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
Date
2012-01-01 35 0 38 51 26 10 16
2012-01-02 83 1 68 153 53 6 43
2012-01-03 135 2 104 248 89 3 58
2012-01-04 144 1 116 318 111 8 61
2012-01-05 197 2 124 330 97 13 95
berri_bikes = bikes[['Berri 1']].copy()
berri_bikes.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Berri 1
Date
2012-01-01 35
2012-01-02 83
2012-01-03 135
2012-01-04 144
2012-01-05 197
berri_bikes.index.weekday
Int64Index([6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
            ...
            5, 6, 0, 1, 2, 3, 4, 5, 6, 0],
           dtype='int64', name='Date', length=310)
berri_bikes.loc[:, 'weekday'] = berri_bikes.index.weekday
berri_bikes.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Berri 1 weekday
Date
2012-01-01 35 6
2012-01-02 83 0
2012-01-03 135 1
2012-01-04 144 2
2012-01-05 197 3
weekday_counts = berri_bikes.groupby('weekday').sum()
weekday_counts
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Berri 1
weekday
0 134298
1 135305
2 152972
3 160131
4 141771
5 101578
6 99310

下面给大家介绍如何填充缺失的数据

有了weekday信息之后,我们就可以用上我们前面学过的.groupyby把骑车人数按照weekday分类,然后用aggregate算出每个工作日的骑车人数之和。

bikes.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1 St-Urbain (données non disponibles)
Date
2012-01-01 35 NaN 0 38 51 26 10 16 NaN
2012-01-02 83 NaN 1 68 153 53 6 43 NaN
2012-01-03 135 NaN 2 104 248 89 3 58 NaN
2012-01-04 144 NaN 1 116 318 111 8 61 NaN
2012-01-05 197 NaN 2 124 330 97 13 95 NaN
bikes = bikes.dropna(axis=1, how='all')
bikes.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Maisonneuve 2 du Parc Pierre-Dupuy Rachel1
Date
2012-01-01 35 0 38 51 26 10 16
2012-01-02 83 1 68 153 53 6 43
2012-01-03 135 2 104 248 89 3 58
2012-01-04 144 1 116 318 111 8 61
2012-01-05 197 2 124 330 97 13 95
bikes_sum = bikes.sum(axis=1).to_frame()
bikes_sum.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0
Date
2012-01-01 176
2012-01-02 407
2012-01-03 639
2012-01-04 759
2012-01-05 858
bikes_sum.loc[:, 'weekday'] = bikes_sum.index.weekday
bikes_sum.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 weekday
Date
2012-01-01 176 6
2012-01-02 407 0
2012-01-03 639 1
2012-01-04 759 2
2012-01-05 858 3

接下来我们试试能不能把每条路线都加起来,然后算出一天骑自行车出门的人数之和。

weekday_counts = bikes_sum.groupby('weekday').aggregate(sum)
weekday_counts.index = ['Monday', 'Tuesday', 
                        'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekday_counts
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0
Monday 714963
Tuesday 698582
Wednesday 789722
Thursday 829069
Friday 738772
Saturday 516701
Sunday 518047