Pandas基础
Pandas简介
- python数据分析library
- 基于numpy (对ndarray的操作)
- 有一种用python做Excel/SQL/R的感觉
目录
- Series
- DataFrame
- Index, reindex and hierarchical indexing
- Merge, Join, Concatenate, Groupby and Aggregate
- Read from csv
- 随堂案例:bikes routes counts
数据结构Series
构造和初始化Series
import pandas as pd
import numpy as np
Series是一个一维的数据结构
s = pd.Series([7, 'Beijing', 2.17, -1232, 'Happy birthday!'])
s
0 7
1 Beijing
2 2.17
3 -1232
4 Happy birthday!
dtype: object
pandas会默认用0到n来作为Series的index,但是我们也可以自己指定index。
s = pd.Series([7, 'Beijing', 2.17, -1232, 'Happy birthday!'],
index = ['A', 'B', 'C', 'D', 'E'])
type(s)
pandas.core.series.Series
还可以用dictionary来构造一个Series,因为Series本来就是key value pairs。
cities = {'Beijing': 55000, 'Shanghai': 60000, 'Shenzhen': 50000, 'Hangzhou':20000, \
'Guangzhou': 25000, 'Suzhou': None}
apts = pd.Series(cities)
print(apts)
print(type(apts))
Beijing 55000.0
Shanghai 60000.0
Shenzhen 50000.0
Hangzhou 20000.0
Guangzhou 25000.0
Suzhou NaN
dtype: float64
<class 'pandas.core.series.Series'>
选择数据
前面定义的index就是用来选择数据的
apts["Hangzhou"]
20000.0
apts[["Hangzhou", "Beijing", "Shenzhen"]]
# type(apts[["Hangzhou", "Beijing", "Shenzhen"]])
Hangzhou 20000.0
Beijing 55000.0
Shenzhen 50000.0
dtype: float64
大家还记得上次numpy讲到的boolean indexing吗?pandas当中也可以使用。
apts[apts < 50000]
Hangzhou 20000.0
Guangzhou 25000.0
dtype: float64
less_than_50000 = apts < 50000
print(less_than_50000)
Beijing False
Shanghai False
Shenzhen False
Hangzhou True
Guangzhou True
Suzhou False
dtype: bool
apts[less_than_50000]
Hangzhou 20000.0
Guangzhou 25000.0
dtype: float64
下面我再详细展示一下这个boolean indexing是如何工作的
Series元素赋值
Series的元素可以被赋值
print("Old value:", apts['Shenzhen'])
Old value: 50000.0
apts['Shenzhen'] = 55000
apts['Shenzhen']
55000.0
前面讲过的boolean indexing在赋值的时候也可以用
print(apts[apts < 50000])
Hangzhou 20000.0
Guangzhou 25000.0
dtype: float64
apts[apts <= 50000] = 40000
apts
Beijing 55000.0
Shanghai 60000.0
Shenzhen 55000.0
Hangzhou 40000.0
Guangzhou 40000.0
Suzhou NaN
dtype: float64
数学运算
下面我们来讲一些基本的数学运算。
apts / 2
Beijing 27500.0
Shanghai 30000.0
Shenzhen 27500.0
Hangzhou 20000.0
Guangzhou 20000.0
Suzhou NaN
dtype: float64
apts * 2
Beijing 110000.0
Shanghai 120000.0
Shenzhen 110000.0
Hangzhou 80000.0
Guangzhou 80000.0
Suzhou NaN
dtype: float64
np.square(apts)
Beijing 3.025000e+09
Shanghai 3.600000e+09
Shenzhen 3.025000e+09
Hangzhou 1.600000e+09
Guangzhou 1.600000e+09
Suzhou NaN
dtype: float64
square也可以写成 **
apts ** 2
Beijing 3.025000e+09
Shanghai 3.600000e+09
Shenzhen 3.025000e+09
Hangzhou 1.600000e+09
Guangzhou 1.600000e+09
Suzhou NaN
dtype: float64
我们再定义一个新的Series做加法
cars = pd.Series({'Beijing': 300000, 'Shanghai': 400000, 'Shenzhen': 300000, \
'Tianjin': 200000, 'Guangzhou': 200000, 'Chongqing': 150000})
cars
Beijing 300000
Shanghai 400000
Shenzhen 300000
Tianjin 200000
Guangzhou 200000
Chongqing 150000
dtype: int64
print(cars + apts * 100)
Beijing 5800000.0
Chongqing NaN
Guangzhou 4200000.0
Hangzhou NaN
Shanghai 6400000.0
Shenzhen 5800000.0
Suzhou NaN
Tianjin NaN
dtype: float64
数据缺失
'Hangzhou' in apts
True
'Hangzhou' in cars
False
apts.notnull()
Beijing True
Shanghai True
Shenzhen True
Hangzhou True
Guangzhou True
Suzhou False
dtype: bool
apts.isnull()
Beijing False
Shanghai False
Shenzhen False
Hangzhou False
Guangzhou False
Suzhou True
dtype: bool
apts[apts.isnull()]
Suzhou NaN
dtype: float64
apts[apts.notnull()]
Beijing 55000.0
Shanghai 60000.0
Shenzhen 55000.0
Hangzhou 40000.0
Guangzhou 40000.0
dtype: float64
apts[apts.isnull() == False]
Beijing 55000.0
Shanghai 60000.0
Shenzhen 55000.0
Hangzhou 40000.0
Guangzhou 40000.0
dtype: float64
数据结构Dataframe
一个Dataframe就是一张表格,Series表示的是一维数组,Dataframe则是一个二维数组,可以类比成一张excel的spreadsheet。也可以把Dataframe当做一组Series的集合。
创建一个DataFrame
dataframe可以由一个dictionary构造得到。
data = {'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],
'year': [2016,2017,2016,2017,2016, 2016],
'population': [2100, 2300, 1000, 700, 500, 500]}
pd.DataFrame(data)
| city | year | population | |
|---|---|---|---|
| 0 | Beijing | 2016 | 2100 |
| 1 | Shanghai | 2017 | 2300 |
| 2 | Guangzhou | 2016 | 1000 |
| 3 | Shenzhen | 2017 | 700 |
| 4 | Hangzhou | 2016 | 500 |
| 5 | Chongqing | 2016 | 500 |
columns的名字和顺序可以指定
pd.DataFrame(data, columns = ['year', 'city', 'population'])
| year | city | population | |
|---|---|---|---|
| 0 | 2016 | Beijing | 2100 |
| 1 | 2017 | Shanghai | 2300 |
| 2 | 2016 | Guangzhou | 1000 |
| 3 | 2017 | Shenzhen | 700 |
| 4 | 2016 | Hangzhou | 500 |
| 5 | 2016 | Chongqing | 500 |
pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'])
| year | city | population | debt | |
|---|---|---|---|---|
| 0 | 2016 | Beijing | 2100 | NaN |
| 1 | 2017 | Shanghai | 2300 | NaN |
| 2 | 2016 | Guangzhou | 1000 | NaN |
| 3 | 2017 | Shenzhen | 700 | NaN |
| 4 | 2016 | Hangzhou | 500 | NaN |
| 5 | 2016 | Chongqing | 500 | NaN |
frame2 = pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'],
index=['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)
year city population debt
one 2016 Beijing 2100 NaN
two 2017 Shanghai 2300 NaN
three 2016 Guangzhou 1000 NaN
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 NaN
six 2016 Chongqing 500 NaN
从DataFrame里选择数据
frame2['city']
one Beijing
two Shanghai
three Guangzhou
four Shenzhen
five Hangzhou
six Chongqing
Name: city, dtype: object
type(frame2['city'])
pandas.core.series.Series
print(frame2.city)
one Beijing
two Shanghai
three Guangzhou
four Shenzhen
five Hangzhou
six Chongqing
Name: city, dtype: object
frame2.loc['three']
year 2016
city Guangzhou
population 1000
debt NaN
Name: three, dtype: object
type(frame2.loc['three'])
pandas.core.series.Series
下面这种方法默认用来选列而不是选行
print(frame2.iloc[2])
year 2016
city Guangzhou
population 1000
debt NaN
Name: three, dtype: object
DataFrame元素赋值
frame2["population"]["one"] = 2200
C:\Users\zhangxf\AppData\Local\Temp\ipykernel_120184\1394222887.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
frame2["population"]["one"] = 2200
frame2
| year | city | population | debt | |
|---|---|---|---|---|
| one | 2016 | Beijing | 2200 | NaN |
| two | 2017 | Shanghai | 2300 | NaN |
| three | 2016 | Guangzhou | 1000 | NaN |
| four | 2017 | Shenzhen | 700 | NaN |
| five | 2016 | Hangzhou | 500 | NaN |
| six | 2016 | Chongqing | 500 | NaN |
可以给一整列赋值
frame2['debt'] = 100000000
frame2
| year | city | population | debt | |
|---|---|---|---|---|
| one | 2016 | Beijing | 2200 | 100000000 |
| two | 2017 | Shanghai | 2300 | 100000000 |
| three | 2016 | Guangzhou | 1000 | 100000000 |
| four | 2017 | Shenzhen | 700 | 100000000 |
| five | 2016 | Hangzhou | 500 | 100000000 |
| six | 2016 | Chongqing | 500 | 100000000 |
frame2.loc['six'] = 0
frame2
| year | city | population | debt | |
|---|---|---|---|---|
| one | 2016 | Beijing | 2200 | 100000000 |
| two | 2017 | Shanghai | 2300 | 100000000 |
| three | 2016 | Guangzhou | 1000 | 100000000 |
| four | 2017 | Shenzhen | 700 | 100000000 |
| five | 2016 | Hangzhou | 500 | 100000000 |
| six | 0 | 0 | 0 | 0 |
frame2 = pd.DataFrame(data, \
columns = ['year', 'city', 'population', 'debt'],
index = ['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)
year city population debt
one 2016 Beijing 2100 NaN
two 2017 Shanghai 2300 NaN
three 2016 Guangzhou 1000 NaN
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 NaN
six 2016 Chongqing 500 NaN
frame2.debt = np.arange(6)
frame2
| year | city | population | debt | |
|---|---|---|---|---|
| one | 2016 | Beijing | 2100 | 0 |
| two | 2017 | Shanghai | 2300 | 1 |
| three | 2016 | Guangzhou | 1000 | 2 |
| four | 2017 | Shenzhen | 700 | 3 |
| five | 2016 | Hangzhou | 500 | 4 |
| six | 2016 | Chongqing | 500 | 5 |
还可以用Series来指定需要修改的index以及相对应的value,没有指定的默认用NaN.
val = pd.Series([200, 300, 500], index=['two', 'three', 'five'])
val
frame2['debt'] = val
print(frame2)
year city population debt
one 2016 Beijing 2100 NaN
two 2017 Shanghai 2300 200.0
three 2016 Guangzhou 1000 300.0
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 500.0
six 2016 Chongqing 500 NaN
frame2['western'] = (frame2.city == 'Chongqing')
print(frame2)
year city population debt western
one 2016 Beijing 2100 NaN False
two 2017 Shanghai 2300 200.0 False
three 2016 Guangzhou 1000 300.0 False
four 2017 Shenzhen 700 NaN False
five 2016 Hangzhou 500 500.0 False
six 2016 Chongqing 500 NaN True
frame2.columns
Index(['year', 'city', 'population', 'debt', 'western'], dtype='object')
frame2.index
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')
一个DataFrame就和一个numpy 2d array一样,可以被转置
frame2.T
| one | two | three | four | five | six | |
|---|---|---|---|---|---|---|
| year | 2016 | 2017 | 2016 | 2017 | 2016 | 2016 |
| city | Beijing | Shanghai | Guangzhou | Shenzhen | Hangzhou | Chongqing |
| population | 2100 | 2300 | 1000 | 700 | 500 | 500 |
| debt | NaN | 200.0 | 300.0 | NaN | 500.0 | NaN |
| western | False | False | False | False | False | True |
pop = {'Beijing': {2016: 2100, 2017:2200},
'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
frame3 = pd.DataFrame(pop)
print(frame3)
Beijing Shanghai
2016 2100.0 2500
2017 2200.0 2600
2015 NaN 2400
print(frame3.T)
2016 2017 2015
Beijing 2100.0 2200.0 NaN
Shanghai 2500.0 2600.0 2400.0
指定index的顺序,以及使用切片初始化数据
frame3['Beijing'][1:3]
2017 2200.0
2015 NaN
Name: Beijing, dtype: float64
frame3['Shanghai'][:-1]
2016 2500
2017 2600
Name: Shanghai, dtype: int64
pdata = {'Beijing': frame3['Beijing'][:-1], 'Shanghai':frame3['Shanghai'][:-1]}
print(pd.DataFrame(pdata))
Beijing Shanghai
2016 2100.0 2500
2017 2200.0 2600
我们还可以指定index的名字和列的名字
frame3.index.name = 'year'
frame3.columns.name = 'city'
frame3
| city | Beijing | Shanghai |
|---|---|---|
| year | ||
| 2016 | 2100.0 | 2500 |
| 2017 | 2200.0 | 2600 |
| 2015 | NaN | 2400 |
type(frame3.values)
numpy.ndarray
Index
index object
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(index)
print(index[1:])
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')
index的值是不能被更改的
index[1] = 'd'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_120184\3159586693.py in <module>
----> 1 index[1] = 'd'
e:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
5033 @final
5034 def __setitem__(self, key, value):
-> 5035 raise TypeError("Index does not support mutable operations")
5036
5037 def __getitem__(self, key):
TypeError: Index does not support mutable operations
index = pd.Index(np.arange(3))
index
obj2 = pd.Series([2,5,7], index=index)
print(obj2)
print(obj2.index is index)
0 2
1 5
2 7
dtype: int64
True
pop = {'Beijing': {2016: 2100, 2017:2200},
'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
frame3 = pd.DataFrame(pop)
print('Shanghai' in frame3.columns)
print(2015 in frame3.columns)
True
False
针对index进行索引和切片
obj = pd.Series(np.arange(4), index=['a','b','c','d'])
print(obj)
a 0
b 1
c 2
d 3
dtype: int32
obj[['b', 'a']]
b 1
a 0
dtype: int32
obj[[0, 2]]
a 0
c 2
dtype: int32
默认的数字index依旧可以使用
下面介绍如何对Series进行切片
obj[1:3]
b 1
c 2
dtype: int32
obj['b':'d'] = 5
print(obj)
a 0
b 5
c 5
d 5
dtype: int32
对DataFrame进行Indexing与Series基本相同
frame = pd.DataFrame(np.arange(9).reshape(3,3),
index = ['a', 'c', 'd'],
columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])
print(frame)
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
frame['Hangzhou']
a 0
c 3
d 6
Name: Hangzhou, dtype: int32
frame[:2]
| Hangzhou | Shenzhen | Nanjing | |
|---|---|---|---|
| a | 0 | 1 | 2 |
| c | 3 | 4 | 5 |
frame.loc['a':'d']
| Hangzhou | Shenzhen | Nanjing | |
|---|---|---|---|
| a | 0 | 1 | 2 |
| c | 3 | 4 | 5 |
| d | 6 | 7 | 8 |
frame['Hangzhou':'Najing']
| Hangzhou | Shenzhen | Nanjing |
|---|
frame.loc[:, 'Shenzhen':'Nanjing']
| Shenzhen | Nanjing | |
|---|---|---|
| a | 1 | 2 |
| c | 4 | 5 |
| d | 7 | 8 |
frame.loc[:'c', 'Hangzhou']
a 0
c 3
Name: Hangzhou, dtype: int32
DataFrame也可以用condition selection
frame[frame.Hangzhou > 1]
| Hangzhou | Shenzhen | Nanjing | |
|---|---|---|---|
| c | 3 | 4 | 5 |
| d | 6 | 7 | 8 |
frame[frame < 5] = 0
print(frame)
Hangzhou Shenzhen Nanjing
a 0 0 0
c 0 0 5
d 6 7 8
reindex
把一个Series或者DataFrame按照新的index顺序进行重排
obj = pd.Series([4.5, 7.2, -5.3, 3.2], index=['d', 'b', 'a', 'c'])
print(obj)
d 4.5
b 7.2
a -5.3
c 3.2
dtype: float64
obj.reindex(['a', 'b', 'c', 'd', 'e'])
a -5.3
b 7.2
c 3.2
d 4.5
e NaN
dtype: float64
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)
a -5.3
b 7.2
c 3.2
d 4.5
e 0.0
dtype: float64
obj3 = pd.Series(['blue', 'purple', 'yellow'], index = [0,2,4])
print(obj3)
0 blue
2 purple
4 yellow
dtype: object
obj3.reindex(range(6), method='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
obj3.reindex(range(6), method='bfill')
0 blue
1 purple
2 purple
3 yellow
4 yellow
5 NaN
dtype: object
既然我们可以对Series进行reindex,相应地,我们也可以用同样的方法对DataFrame进行reindex。
frame = pd.DataFrame(np.arange(9).reshape(3,3),
index = ['a', 'c', 'd'],
columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])
print(frame)
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
frame.reindex(['a' , 'b', 'c', 'd'])
| Hangzhou | Shenzhen | Nanjing | |
|---|---|---|---|
| a | 0.0 | 1.0 | 2.0 |
| b | NaN | NaN | NaN |
| c | 3.0 | 4.0 | 5.0 |
| d | 6.0 | 7.0 | 8.0 |
在reindex的同时,我们还可以重新指定columns
frame.reindex(columns = ['Shenzhen', 'Hangzhou', 'Chongqing'])
| Shenzhen | Hangzhou | Chongqing | |
|---|---|---|---|
| a | 1 | 0 | NaN |
| c | 4 | 3 | NaN |
| d | 7 | 6 | NaN |
下面介绍如何用drop来删除Series和DataFrame中的index
obj4 = obj3.drop(2)
print(obj4)
0 blue
4 yellow
dtype: object
obj3.drop([2, 4])
0 blue
dtype: object
frame
| Hangzhou | Shenzhen | Nanjing | |
|---|---|---|---|
| a | 0 | 1 | 2 |
| c | 3 | 4 | 5 |
| d | 6 | 7 | 8 |
frame.drop(['a', 'c'])
| Hangzhou | Shenzhen | Nanjing | |
|---|---|---|---|
| d | 6 | 7 | 8 |
frame.drop('Shenzhen', axis=1)
| Hangzhou | Nanjing | |
|---|---|---|
| a | 0 | 2 |
| c | 3 | 5 |
| d | 6 | 8 |
frame.drop(['Shenzhen', 'Hangzhou'], axis=1)
| Nanjing | |
|---|---|
| a | 2 |
| c | 5 |
| d | 8 |
drop不仅仅可以删除行,还可以删除列
hierarchical index
Series的hierarchical indexing
data = pd.Series(np.random.randn(10),
index=[['a','a','a','b','b','c','c','c','d','d'], \
[1,2,3,1,2,1,2,3,1,2]])
data
a 1 -0.445826
2 0.656269
3 -0.337780
b 1 0.867805
2 -0.438495
c 1 -0.311051
2 -0.217434
3 -1.547993
d 1 -1.152814
2 -0.516443
dtype: float64
data.index
MultiIndex([('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 2), ('c', 1), ('c', 2), ('c', 3), ('d', 1), ('d', 2)],
)
print(data["b"])
1 0.867805
2 -0.438495
dtype: float64
data['b':'d']
b 1 0.867805
2 -0.438495
c 1 -0.311051
2 -0.217434
3 -1.547993
d 1 -1.152814
2 -0.516443
dtype: float64
data[1:4]
a 2 0.656269
3 -0.337780
b 1 0.867805
dtype: float64
unstack和stack可以帮助我们在hierarchical indexing和DataFrame之间进行切换。
data.unstack()
| 1 | 2 | 3 | |
|---|---|---|---|
| a | -0.445826 | 0.656269 | -0.337780 |
| b | 0.867805 | -0.438495 | NaN |
| c | -0.311051 | -0.217434 | -1.547993 |
| d | -1.152814 | -0.516443 | NaN |
type(data.unstack())
pandas.core.frame.DataFrame
data.unstack().stack()
a 1 -0.445826
2 0.656269
3 -0.337780
b 1 0.867805
2 -0.438495
c 1 -0.311051
2 -0.217434
3 -1.547993
d 1 -1.152814
2 -0.516443
dtype: float64
DataFrame的hierarchical indexing
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
index = [['a','a','b','b'], [1,2,1,2]],
columns = [['Beijing', 'Beijing', 'Shanghai'], ['apts', 'cars', 'apts']])
print(frame)
Beijing Shanghai
apts cars apts
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.index.names = ['alpha', 'number']
frame.columns.names = ['city', 'type']
print(frame)
city Beijing Shanghai
type apts cars apts
alpha number
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.loc['a', 1]
#print(type(frame.loc['a', 1]))
city type
Beijing apts 0
cars 1
Shanghai apts 2
Name: (a, 1), dtype: int32
frame.loc['a', 2]['Beijing']['apts']
3
一些建议
学习过程中碰到问题,直接自己写代码进行尝试。
学会使用google和stackoverflow解决技术问题
关于Merge, Join和Concatenate
concatenate
df1 = pd.DataFrame({'apts': [55000, 60000],
'cars': [200000, 300000],},
index = ['Shanghai', 'Beijing'])
df1
| apts | cars | |
|---|---|---|
| Shanghai | 55000 | 200000 |
| Beijing | 60000 | 300000 |
df2 = pd.DataFrame({'apts': [25000, 20000],
'cars': [150000, 120000],},
index = ['Hangzhou', 'Najing'])
print(df2)
apts cars
Hangzhou 25000 150000
Najing 20000 120000
df3 = pd.DataFrame({'apts': [30000, 10000],
'cars': [180000, 100000],},
index = ['Guangzhou', 'Chongqing'])
print(df3)
apts cars
Guangzhou 30000 180000
Chongqing 10000 100000
frames = [df1, df2, df3]
result = pd.concat(frames)
result
| apts | cars | |
|---|---|---|
| Shanghai | 55000 | 200000 |
| Beijing | 60000 | 300000 |
| Hangzhou | 25000 | 150000 |
| Najing | 20000 | 120000 |
| Guangzhou | 30000 | 180000 |
| Chongqing | 10000 | 100000 |
在concatenate的时候可以指定keys,这样可以给每一个部分加上一个Key。
以下的例子就构造了一个hierarchical index。
result2 = pd.concat(frames, keys=['x', 'y', 'z'])
print(result2)
apts cars
x Shanghai 55000 200000
Beijing 60000 300000
y Hangzhou 25000 150000
Najing 20000 120000
z Guangzhou 30000 180000
Chongqing 10000 100000
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},
index = ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
print(df4)
salaries
Suzhou 10000
Beijing 30000
Shanghai 30000
Guangzhou 20000
Tianjin 15000
result3 = pd.concat([result, df4], axis=1)
result3
| apts | cars | salaries | |
|---|---|---|---|
| Shanghai | 55000.0 | 200000.0 | 30000.0 |
| Beijing | 60000.0 | 300000.0 | 30000.0 |
| Hangzhou | 25000.0 | 150000.0 | NaN |
| Najing | 20000.0 | 120000.0 | NaN |
| Guangzhou | 30000.0 | 180000.0 | 20000.0 |
| Chongqing | 10000.0 | 100000.0 | NaN |
| Suzhou | NaN | NaN | 10000.0 |
| Tianjin | NaN | NaN | 15000.0 |
复习一下前面讲过的stack
result3.stack()
Shanghai apts 55000.0
cars 200000.0
salaries 30000.0
Beijing apts 60000.0
cars 300000.0
salaries 30000.0
Hangzhou apts 25000.0
cars 150000.0
Najing apts 20000.0
cars 120000.0
Guangzhou apts 30000.0
cars 180000.0
salaries 20000.0
Chongqing apts 10000.0
cars 100000.0
Suzhou salaries 10000.0
Tianjin salaries 15000.0
dtype: float64
用inner可以去掉NaN
result3 = pd.concat([result, df4], axis=1, join='inner')
result3
| apts | cars | salaries | |
|---|---|---|---|
| Shanghai | 55000 | 200000 | 30000 |
| Beijing | 60000 | 300000 | 30000 |
| Guangzhou | 30000 | 180000 | 20000 |
用append来做concatenation
df1.append(df2)
C:\Users\zhangxf\AppData\Local\Temp\ipykernel_120184\3062608662.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df1.append(df2)
| apts | cars | |
|---|---|---|
| Shanghai | 55000 | 200000 |
| Beijing | 60000 | 300000 |
| Hangzhou | 25000 | 150000 |
| Najing | 20000 | 120000 |
df1.append(df4)
C:\Users\zhangxf\AppData\Local\Temp\ipykernel_120184\940018064.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df1.append(df4)
| apts | cars | salaries | |
|---|---|---|---|
| Shanghai | 55000.0 | 200000.0 | NaN |
| Beijing | 60000.0 | 300000.0 | NaN |
| Suzhou | NaN | NaN | 10000.0 |
| Beijing | NaN | NaN | 30000.0 |
| Shanghai | NaN | NaN | 30000.0 |
| Guangzhou | NaN | NaN | 20000.0 |
| Tianjin | NaN | NaN | 15000.0 |
Series和DataFrame还可以被一起concatenate,这时候Series会先被转成DataFrame然后做Join,因为Series本来就是一个只有一维的DataFrame对吧。
s1 = pd.Series([60, 50], index=['Shanghai', 'Beijing'], name='meal')
print(s1)
Shanghai 60
Beijing 50
Name: meal, dtype: int64
df1
| apts | cars | |
|---|---|---|
| Shanghai | 55000 | 200000 |
| Beijing | 60000 | 300000 |
pd.concat([df1, s1], axis=1)
| apts | cars | meal | |
|---|---|---|---|
| Shanghai | 55000 | 200000 | 60 |
| Beijing | 60000 | 300000 | 50 |
下面讲如何append一个row到DataFrame里。
s2 = pd.Series([18000, 120000], index=['apts', 'cars'], name='Xiamen')
s2
apts 18000
cars 120000
Name: Xiamen, dtype: int64
df1.append(s2)
C:\Users\zhangxf\AppData\Local\Temp\ipykernel_120184\3390175239.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df1.append(s2)
| apts | cars | |
|---|---|---|
| Shanghai | 55000 | 200000 |
| Beijing | 60000 | 300000 |
| Xiamen | 18000 | 120000 |
Merge(Join)
df1 = pd.DataFrame({'apts': [55000, 60000, 58000],
'cars': [200000, 300000,250000],
'cities': ['Shanghai', 'Beijing','Shenzhen']})
print(df1)
apts cars cities
0 55000 200000 Shanghai
1 60000 300000 Beijing
2 58000 250000 Shenzhen
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000],
'cities': ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin']})
print(df4)
salaries cities
0 10000 Suzhou
1 30000 Beijing
2 30000 Shanghai
3 20000 Guangzhou
4 15000 Tianjin
result = pd.merge(df1, df4, on='cities')
result
| apts | cars | cities | salaries | |
|---|---|---|---|---|
| 0 | 55000 | 200000 | Shanghai | 30000 |
| 1 | 60000 | 300000 | Beijing | 30000 |
result2 = pd.merge(df1, df4, on='cities', how='outer')
result2
| apts | cars | cities | salaries | |
|---|---|---|---|---|
| 0 | 55000.0 | 200000.0 | Shanghai | 30000.0 |
| 1 | 60000.0 | 300000.0 | Beijing | 30000.0 |
| 2 | 58000.0 | 250000.0 | Shenzhen | NaN |
| 3 | NaN | NaN | Suzhou | 10000.0 |
| 4 | NaN | NaN | Guangzhou | 20000.0 |
| 5 | NaN | NaN | Tianjin | 15000.0 |
join on index
df1 = pd.DataFrame({'apts': [55000, 60000, 58000],
'cars': [200000, 300000,250000]},
index=['Shanghai', 'Beijing','Shenzhen'])
print(df1)
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Shenzhen 58000 250000
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},
index=['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
print(df4)
salaries
Suzhou 10000
Beijing 30000
Shanghai 30000
Guangzhou 20000
Tianjin 15000
df1.join(df4)
| apts | cars | salaries | |
|---|---|---|---|
| Shanghai | 55000 | 200000 | 30000.0 |
| Beijing | 60000 | 300000 | 30000.0 |
| Shenzhen | 58000 | 250000 | NaN |
df4.join(df1)
| salaries | apts | cars | |
|---|---|---|---|
| Suzhou | 10000 | NaN | NaN |
| Beijing | 30000 | 60000.0 | 300000.0 |
| Shanghai | 30000 | 55000.0 | 200000.0 |
| Guangzhou | 20000 | NaN | NaN |
| Tianjin | 15000 | NaN | NaN |
df1.join(df4, how='outer')
| apts | cars | salaries | |
|---|---|---|---|
| Beijing | 60000.0 | 300000.0 | 30000.0 |
| Guangzhou | NaN | NaN | 20000.0 |
| Shanghai | 55000.0 | 200000.0 | 30000.0 |
| Shenzhen | 58000.0 | 250000.0 | NaN |
| Suzhou | NaN | NaN | 10000.0 |
| Tianjin | NaN | NaN | 15000.0 |
也可以用merge来写
pd.merge(df1, df4, left_index=True, right_index=True, how='outer')
| apts | cars | salaries | |
|---|---|---|---|
| Beijing | 60000.0 | 300000.0 | 30000.0 |
| Guangzhou | NaN | NaN | 20000.0 |
| Shanghai | 55000.0 | 200000.0 | 30000.0 |
| Shenzhen | 58000.0 | 250000.0 | NaN |
| Suzhou | NaN | NaN | 10000.0 |
| Tianjin | NaN | NaN | 15000.0 |
Group By
举个栗子,假设我们现在有一张公司每个员工的收入流水。
import pandas as pd
salaries = pd.DataFrame({
'Name': ['July', 'Chu', 'Chu', 'Lin', 'July', 'July', 'Chu', 'July'],
'Year': [2016,2016,2016,2016,2017,2017,2017,2017],
'Salary': [10000,2000,4000,5000,18000,25000,3000,4000],
'Bonus': [3000,1000,1000,1200,4000,2300,500,1000]
})
print(salaries)
Name Year Salary Bonus
0 July 2016 10000 3000
1 Chu 2016 2000 1000
2 Chu 2016 4000 1000
3 Lin 2016 5000 1200
4 July 2017 18000 4000
5 July 2017 25000 2300
6 Chu 2017 3000 500
7 July 2017 4000 1000
接下来我给大家演示一下什么叫做Group By
group_by_name = salaries.groupby('Name')
group_by_name
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A95711B1C0>
groupby经常和aggregate一起使用
group_by_name.aggregate(sum)
| Year | Salary | Bonus | |
|---|---|---|---|
| Name | |||
| Chu | 6049 | 9000 | 2500 |
| July | 8067 | 57000 | 10300 |
| Lin | 2016 | 5000 | 1200 |
group_by_name.sum()
| Year | Salary | Bonus | |
|---|---|---|---|
| Name | |||
| Chu | 6049 | 9000 | 2500 |
| July | 8067 | 57000 | 10300 |
| Lin | 2016 | 5000 | 1200 |
也可以这么写
group_by_name_year = salaries.groupby(['Name', 'Year'])
group_by_name_year.sum()
| Salary | Bonus | ||
|---|---|---|---|
| Name | Year | ||
| Chu | 2016 | 6000 | 2000 |
| 2017 | 3000 | 500 | |
| July | 2016 | 10000 | 3000 |
| 2017 | 47000 | 7300 | |
| Lin | 2016 | 5000 | 1200 |
group_by_name_year.size()
Name Year
Chu 2016 2
2017 1
July 2016 1
2017 3
Lin 2016 1
dtype: int64
group_by_name_year.max() # min()
| Salary | Bonus | ||
|---|---|---|---|
| Name | Year | ||
| Chu | 2016 | 4000 | 1000 |
| 2017 | 3000 | 500 | |
| July | 2016 | 10000 | 3000 |
| 2017 | 25000 | 4000 | |
| Lin | 2016 | 5000 | 1200 |
describe这个function可以为我们展示各种有用的统计信息
# group_by_name_year.describe()
READ FROM CSV
我们先从CSV文件中读取一些数据。
bike.csv记录了Montreal自行车路线的数据,具体有7条路线,数据记录了每条自行车路线每天分别有多少人。
bikes = pd.read_csv('bikes.csv', encoding='latin1')
# bikes
bikes = pd.read_csv('bikes.csv', encoding='latin1', sep=';',
parse_dates=['Date'], dayfirst=True, index_col='Date')
bikes.head()
| Berri 1 | Brébeuf (données non disponibles) | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | St-Urbain (données non disponibles) | |
|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||
| 2012-01-01 | 35 | NaN | 0 | 38 | 51 | 26 | 10 | 16 | NaN |
| 2012-01-02 | 83 | NaN | 1 | 68 | 153 | 53 | 6 | 43 | NaN |
| 2012-01-03 | 135 | NaN | 2 | 104 | 248 | 89 | 3 | 58 | NaN |
| 2012-01-04 | 144 | NaN | 1 | 116 | 318 | 111 | 8 | 61 | NaN |
| 2012-01-05 | 197 | NaN | 2 | 124 | 330 | 97 | 13 | 95 | NaN |
更多关于read_csv的说明请查阅documentation
dropna会删除所有带NA的行
bikes.dropna()
| Berri 1 | Brébeuf (données non disponibles) | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | St-Urbain (données non disponibles) | |
|---|---|---|---|---|---|---|---|---|---|
| Date |
bikes.dropna(how='all').head()
| Berri 1 | Brébeuf (données non disponibles) | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | St-Urbain (données non disponibles) | |
|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||
| 2012-01-01 | 35 | NaN | 0 | 38 | 51 | 26 | 10 | 16 | NaN |
| 2012-01-02 | 83 | NaN | 1 | 68 | 153 | 53 | 6 | 43 | NaN |
| 2012-01-03 | 135 | NaN | 2 | 104 | 248 | 89 | 3 | 58 | NaN |
| 2012-01-04 | 144 | NaN | 1 | 116 | 318 | 111 | 8 | 61 | NaN |
| 2012-01-05 | 197 | NaN | 2 | 124 | 330 | 97 | 13 | 95 | NaN |
bikes.dropna(how='all', axis=1).head()
| Berri 1 | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2012-01-01 | 35 | 0 | 38 | 51 | 26 | 10 | 16 |
| 2012-01-02 | 83 | 1 | 68 | 153 | 53 | 6 | 43 |
| 2012-01-03 | 135 | 2 | 104 | 248 | 89 | 3 | 58 |
| 2012-01-04 | 144 | 1 | 116 | 318 | 111 | 8 | 61 |
| 2012-01-05 | 197 | 2 | 124 | 330 | 97 | 13 | 95 |
berri_bikes = bikes[['Berri 1']].copy()
berri_bikes.head()
| Berri 1 | |
|---|---|
| Date | |
| 2012-01-01 | 35 |
| 2012-01-02 | 83 |
| 2012-01-03 | 135 |
| 2012-01-04 | 144 |
| 2012-01-05 | 197 |
berri_bikes.index.weekday
Int64Index([6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
...
5, 6, 0, 1, 2, 3, 4, 5, 6, 0],
dtype='int64', name='Date', length=310)
berri_bikes.loc[:, 'weekday'] = berri_bikes.index.weekday
berri_bikes.head()
| Berri 1 | weekday | |
|---|---|---|
| Date | ||
| 2012-01-01 | 35 | 6 |
| 2012-01-02 | 83 | 0 |
| 2012-01-03 | 135 | 1 |
| 2012-01-04 | 144 | 2 |
| 2012-01-05 | 197 | 3 |
weekday_counts = berri_bikes.groupby('weekday').sum()
weekday_counts
| Berri 1 | |
|---|---|
| weekday | |
| 0 | 134298 |
| 1 | 135305 |
| 2 | 152972 |
| 3 | 160131 |
| 4 | 141771 |
| 5 | 101578 |
| 6 | 99310 |
下面给大家介绍如何填充缺失的数据
有了weekday信息之后,我们就可以用上我们前面学过的.groupyby把骑车人数按照weekday分类,然后用aggregate算出每个工作日的骑车人数之和。
bikes.head()
| Berri 1 | Brébeuf (données non disponibles) | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | St-Urbain (données non disponibles) | |
|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||
| 2012-01-01 | 35 | NaN | 0 | 38 | 51 | 26 | 10 | 16 | NaN |
| 2012-01-02 | 83 | NaN | 1 | 68 | 153 | 53 | 6 | 43 | NaN |
| 2012-01-03 | 135 | NaN | 2 | 104 | 248 | 89 | 3 | 58 | NaN |
| 2012-01-04 | 144 | NaN | 1 | 116 | 318 | 111 | 8 | 61 | NaN |
| 2012-01-05 | 197 | NaN | 2 | 124 | 330 | 97 | 13 | 95 | NaN |
bikes = bikes.dropna(axis=1, how='all')
bikes.head()
| Berri 1 | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2012-01-01 | 35 | 0 | 38 | 51 | 26 | 10 | 16 |
| 2012-01-02 | 83 | 1 | 68 | 153 | 53 | 6 | 43 |
| 2012-01-03 | 135 | 2 | 104 | 248 | 89 | 3 | 58 |
| 2012-01-04 | 144 | 1 | 116 | 318 | 111 | 8 | 61 |
| 2012-01-05 | 197 | 2 | 124 | 330 | 97 | 13 | 95 |
bikes_sum = bikes.sum(axis=1).to_frame()
bikes_sum.head()
| 0 | |
|---|---|
| Date | |
| 2012-01-01 | 176 |
| 2012-01-02 | 407 |
| 2012-01-03 | 639 |
| 2012-01-04 | 759 |
| 2012-01-05 | 858 |
bikes_sum.loc[:, 'weekday'] = bikes_sum.index.weekday
bikes_sum.head()
| 0 | weekday | |
|---|---|---|
| Date | ||
| 2012-01-01 | 176 | 6 |
| 2012-01-02 | 407 | 0 |
| 2012-01-03 | 639 | 1 |
| 2012-01-04 | 759 | 2 |
| 2012-01-05 | 858 | 3 |
接下来我们试试能不能把每条路线都加起来,然后算出一天骑自行车出门的人数之和。
weekday_counts = bikes_sum.groupby('weekday').aggregate(sum)
weekday_counts.index = ['Monday', 'Tuesday',
'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekday_counts
| 0 | |
|---|---|
| Monday | 714963 |
| Tuesday | 698582 |
| Wednesday | 789722 |
| Thursday | 829069 |
| Friday | 738772 |
| Saturday | 516701 |
| Sunday | 518047 |