Pandas用法最全整理(上)

211 阅读9分钟

前言

Pandas 是一个Python语言实现的,开源,易于使用的数据架构以及数据分析工具。在Pandas中主要有两种数据类型,可以简单的理解为:

  • Series:一维数组(列表)
  • DateFrame:二维数组(矩阵)

在线实验:Pandas完全指南.ipynb

学习资料:

导入pandas

import pandas as pd
import numpy as np
from IPython.display import Image

创建列表

创建普通列表

s = pd.Series([1, 3, 6, np.nan, 23, 3]) # type(s) === 'pandas.core.series.Series'

创建时间列表

dates = pd.date_range('20200101', periods=6)

创建矩阵

根据列表(Series)创建矩阵

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=['a', 'b', 'c', 'd'])
df
abcd
2020-01-01-1.3657741.1698990.607591-2.029687
2020-01-02-0.967683-0.8004480.1236730.700337
2020-01-031.7906090.5606660.3440510.799520
2020-01-042.0686630.320610-1.6606310.416631
2020-01-05-0.956351-0.6570501.241433-0.652496
2020-01-06-1.1358701.8880930.5333640.080852
df2 = pd.DataFrame({
  'a':pd.Series([1, 2, 3, 4]),
  'b':pd.Timestamp('20180708'),
  'c':pd.Categorical(['cate1', 'cate2', 'cate3', 'cate4'])
})
df2
abc
012018-07-08cate1
122018-07-08cate2
232018-07-08cate3
342018-07-08cate4

根据字典创建矩阵

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Jack', 'Tim'], 
        'age': [20, 32, 36, 24, 23, 18, 27], 
        'gender': np.random.choice(['M','F'],size=7),
        'score': [25, 94, 57, 62, 70, 88, 67],
        'country': np.random.choice(['US','CN'],size=7),
        }
df3 = pd.DataFrame(data, columns = ['name', 'age', 'gender', 'score', 'country'])
df3
nameagegenderscorecountry
0Jason20F25US
1Molly32F94US
2Tina36F57US
3Jake24M62CN
4Amy23F70US
5Jack18M88CN
6Tim27F67CN

矩阵属性、检视数据

行数列数

df.shape
(6, 4)

索引

df.index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')

列名

df.columns
Index(['a', 'b', 'c', 'd'], dtype='object')

df.values
array([[-1.36577441,  1.16989918,  0.60759059, -2.02968684],
       [-0.96768326, -0.80044798,  0.12367311,  0.70033731],
       [ 1.79060939,  0.56066552,  0.34405077,  0.79952019],
       [ 2.06866329,  0.32060998, -1.6606308 ,  0.41663058],
       [-0.95635134, -0.65704975,  1.24143335, -0.65249624],
       [-1.1358703 ,  1.88809265,  0.53336403,  0.08085195]])

矩阵信息

df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2020-01-01 to 2020-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       6 non-null      float64
 1   b       6 non-null      float64
 2   c       6 non-null      float64
 3   d       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes

矩阵描述信息

df.describe()
abcd
count6.0000006.0000006.0000006.000000
mean-0.0944010.4136280.198247-0.114141
std1.5772601.0389030.9849211.074899
min-1.365774-0.800448-1.660631-2.029687
25%-1.093824-0.4126350.178768-0.469159
50%-0.9620170.4406380.4387070.248741
75%1.1038691.0175910.5890340.629411
max2.0686631.8880931.2414330.799520
### 更改索引
df.index = pd.date_range('2020/06/01', periods=df.shape[0])

df
abcd
2020-06-01-1.3657741.1698990.607591-2.029687
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-031.7906090.5606660.3440510.799520
2020-06-042.0686630.320610-1.6606310.416631
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-06-1.1358701.8880930.5333640.080852

top5 数据

df.head(1)
abcd
2020-06-01-1.3657741.1698990.607591-2.029687

tail5 数据

df.tail(5)
abcd
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-031.7906090.5606660.3440510.799520
2020-06-042.0686630.320610-1.6606310.416631
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-06-1.1358701.8880930.5333640.080852

某一列值统计

df['a'].value_counts(dropna=False)
 1.790609    1
-1.135870    1
 2.068663    1
-0.967683    1
-1.365774    1
-0.956351    1
Name: a, dtype: int64

查看每一列唯一值统计

df.apply(pd.Series.value_counts)
abcd
-2.029687NaNNaNNaN1.0
-1.660631NaNNaN1.0NaN
-1.3657741.0NaNNaNNaN
-1.1358701.0NaNNaNNaN
-0.9676831.0NaNNaNNaN
-0.9563511.0NaNNaNNaN
-0.800448NaN1.0NaNNaN
-0.657050NaN1.0NaNNaN
-0.652496NaNNaNNaN1.0
0.080852NaNNaNNaN1.0
0.123673NaNNaN1.0NaN
0.320610NaN1.0NaNNaN
0.344051NaNNaN1.0NaN
0.416631NaNNaNNaN1.0
0.533364NaNNaN1.0NaN
0.560666NaN1.0NaNNaN
0.607591NaNNaN1.0NaN
0.700337NaNNaNNaN1.0
0.799520NaNNaNNaN1.0
1.169899NaN1.0NaNNaN
1.241433NaNNaN1.0NaN
1.7906091.0NaNNaNNaN
1.888093NaN1.0NaNNaN
2.0686631.0NaNNaNNaN

排序

根据索引(index)排序

# sort_index(axis=, ascending=)
# axis:0-行排序,1-列排序; ascending:True-升序,False-降序
df.sort_index(axis=0, ascending=False)
abcd
2020-06-06-1.1358701.8880930.5333640.080852
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-042.0686630.320610-1.6606310.416631
2020-06-031.7906090.5606660.3440510.799520
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-01-1.3657741.1698990.607591-2.029687
df.sort_index(axis=1, ascending=False)
dcba
2020-06-01-2.0296870.6075911.169899-1.365774
2020-06-020.7003370.123673-0.800448-0.967683
2020-06-030.7995200.3440510.5606661.790609
2020-06-040.416631-1.6606310.3206102.068663
2020-06-05-0.6524961.241433-0.657050-0.956351
2020-06-060.0808520.5333641.888093-1.135870

根据值排序

df.sort_values(by='a', ascending=False)
abcd
2020-06-042.0686630.320610-1.6606310.416631
2020-06-031.7906090.5606660.3440510.799520
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-06-1.1358701.8880930.5333640.080852
2020-06-01-1.3657741.1698990.607591-2.029687
df.sort_values(by=['a','b'], ascending=True)
abcd
2020-06-01-1.3657741.1698990.607591-2.029687
2020-06-06-1.1358701.8880930.5333640.080852
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-031.7906090.5606660.3440510.799520
2020-06-042.0686630.320610-1.6606310.416631

选取数据

选取某一列

df['a'] # 等效于df.a
2020-06-01   -1.365774
2020-06-02   -0.967683
2020-06-03    1.790609
2020-06-04    2.068663
2020-06-05   -0.956351
2020-06-06   -1.135870
Freq: D, Name: a, dtype: float64

根据索引选取某几行数据

df['2020-06-01':'2020-06-02'] # 选取索引以2020-06-01开始,到2020-06-02结束的数据
abcd
2020-06-01-1.3657741.1698990.607591-2.029687
2020-06-02-0.967683-0.8004480.1236730.700337

根据列名选择某几列数据

df[['c', 'b']]
cb
2020-06-010.6075911.169899
2020-06-020.123673-0.800448
2020-06-030.3440510.560666
2020-06-04-1.6606310.320610
2020-06-051.241433-0.657050
2020-06-060.5333641.888093

根据索引和列名选择数据

loc[行名选择, 列名选择],未指定行名或列名,或者指定为:则表示选择当前所有行,或列

df.loc['2020-06-01']
a   -1.365774
b    1.169899
c    0.607591
d   -2.029687
Name: 2020-06-01 00:00:00, dtype: float64
df.loc['2020-06-01', 'b']
1.1698991845802456
df.loc[:, 'b'] # type(df.loc[:, 'b']) === 'pandas.core.series.Series',而type(df.loc[:, ['b']]) === ’pandas.core.frame.DataFrame‘
2020-06-01    1.169899
2020-06-02   -0.800448
2020-06-03    0.560666
2020-06-04    0.320610
2020-06-05   -0.657050
2020-06-06    1.888093
Freq: D, Name: b, dtype: float64
df.loc[:, ['a', 'b']]
ab
2020-06-01-1.3657741.169899
2020-06-02-0.967683-0.800448
2020-06-031.7906090.560666
2020-06-042.0686630.320610
2020-06-05-0.956351-0.657050
2020-06-06-1.1358701.888093

根据行索引和列索引取数据

df.iloc[0,0] # === df.loc['2020-06-01', 'a']
-1.3657744117360429
df.iloc[0, :] # ==== df.loc['2020-06-01', :]
a   -1.365774
b    1.169899
c    0.607591
d   -2.029687
Name: 2020-06-01 00:00:00, dtype: float64

根据布尔表达式表达式取数据

只有当布尔表达式为真时的数据才会被选择

df[df.a > 1]
abcd
2020-06-031.7906090.5606660.3440510.799520
2020-06-042.0686630.320610-1.6606310.416631
df[(df['a'] > 1) & (df['d'] <0)]
abcd

添加/删除列、更新、替换数据

设置某矩阵项值

df.loc['2020-06-01', 'a'] = np.nan
df.loc['2020-06-06', 'c'] = np.nan
df
abcd
2020-06-01NaN1.1698990.607591-2.029687
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-031.7906090.5606660.3440510.799520
2020-06-042.0686630.320610-1.6606310.416631
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-06-1.1358701.888093NaN0.080852

根据条件创建新列

df['e'] = np.where((df['a'] > 1) & (df['d']<0), 1, 0)
df
abcde
2020-06-01NaN1.1698990.607591-2.0296870
2020-06-02-0.967683-0.8004480.1236730.7003370
2020-06-031.7906090.5606660.3440510.7995200
2020-06-042.0686630.320610-1.6606310.4166310
2020-06-05-0.956351-0.6570501.241433-0.6524960
2020-06-06-1.1358701.888093NaN0.0808520

根据已有列创建新列

tmp = df.copy()
df.loc[:,'f'] = tmp.apply(lambda row: row['b']+ row['d'], axis=1)
df
abcdef
2020-06-01NaN1.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.888093NaN0.08085201.968945

替换数据

# 将所有等于1的值替换成20
df.replace(1,20)
abcdef
2020-06-01NaN1.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.888093NaN0.08085201.968945
# 使用one替换1,three替换3
df.replace([1,3],['one','three'])
abcdef
2020-06-01NaN1.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.888093NaN0.08085201.968945

列名重命名

df.rename(columns={'c':'cc'})
abccdef
2020-06-01NaN1.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.888093NaN0.08085201.968945

重设索引

# 将a设置为索引
df.set_index('a')
bcdef
a
NaN1.1698990.607591-2.0296870-0.859788
-0.967683-0.8004480.1236730.7003370-0.100111
1.7906090.5606660.3440510.79952001.360186
2.0686630.320610-1.6606310.41663100.737241
-0.956351-0.6570501.241433-0.6524960-1.309546
-1.1358701.888093NaN0.08085201.968945

删除列

df.drop(columns=['a', 'f'])
bcde
2020-06-011.1698990.607591-2.0296870
2020-06-02-0.8004480.1236730.7003370
2020-06-030.5606660.3440510.7995200
2020-06-040.320610-1.6606310.4166310
2020-06-05-0.6570501.241433-0.6524960
2020-06-061.888093NaN0.0808520

处理Nan数据

检查是否Nan值

df.isnull()
abcdef
2020-06-01TrueFalseFalseFalseFalseFalse
2020-06-02FalseFalseFalseFalseFalseFalse
2020-06-03FalseFalseFalseFalseFalseFalse
2020-06-04FalseFalseFalseFalseFalseFalse
2020-06-05FalseFalseFalseFalseFalseFalse
2020-06-06FalseFalseTrueFalseFalseFalse
df.notnull() # df.isnull()反操作
abcdef
2020-06-01FalseTrueTrueTrueTrueTrue
2020-06-02TrueTrueTrueTrueTrueTrue
2020-06-03TrueTrueTrueTrueTrueTrue
2020-06-04TrueTrueTrueTrueTrueTrue
2020-06-05TrueTrueTrueTrueTrueTrue
2020-06-06TrueTrueFalseTrueTrueTrue

删除掉包含null值的行

### dropna(axis=, how=):丢弃NaN数据,
# axis:0-按行丢弃),1-按列丢弃; how:'any'-只要含有NaN数据就丢弃,'all'-所有数据都为NaN时丢弃

df.dropna(axis=0)
abcdef
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546

替换Nan

#### 使用1000替换Nan
df.fillna(1000)
abcdef
2020-06-011000.0000001.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.8880931000.0000000.08085201.968945
# 使用平均值替换所有null值
df.fillna(df.mean())
abcdef
2020-06-010.1598741.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.8880930.1312230.08085201.968945