前言
Pandas 是一个Python语言实现的,开源,易于使用的数据架构以及数据分析工具。在Pandas中主要有两种数据类型,可以简单的理解为:
- Series:一维数组(列表)
- DateFrame:二维数组(矩阵)
在线实验:Pandas完全指南.ipynb
学习资料:
导入pandas
import pandas as pd
import numpy as np
from IPython.display import Image
创建列表
创建普通列表
s = pd.Series([1, 3, 6, np.nan, 23, 3])
创建时间列表
dates = pd.date_range('20200101', periods=6)
创建矩阵
根据列表(Series)创建矩阵
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=['a', 'b', 'c', 'd'])
df
| a | b | c | d |
---|
2020-01-01 | -1.365774 | 1.169899 | 0.607591 | -2.029687 |
---|
2020-01-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 |
---|
2020-01-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 |
---|
2020-01-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 |
---|
2020-01-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 |
---|
2020-01-06 | -1.135870 | 1.888093 | 0.533364 | 0.080852 |
---|
df2 = pd.DataFrame({
'a':pd.Series([1, 2, 3, 4]),
'b':pd.Timestamp('20180708'),
'c':pd.Categorical(['cate1', 'cate2', 'cate3', 'cate4'])
})
df2
| a | b | c |
---|
0 | 1 | 2018-07-08 | cate1 |
---|
1 | 2 | 2018-07-08 | cate2 |
---|
2 | 3 | 2018-07-08 | cate3 |
---|
3 | 4 | 2018-07-08 | cate4 |
---|
根据字典创建矩阵
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Jack', 'Tim'],
'age': [20, 32, 36, 24, 23, 18, 27],
'gender': np.random.choice(['M','F'],size=7),
'score': [25, 94, 57, 62, 70, 88, 67],
'country': np.random.choice(['US','CN'],size=7),
}
df3 = pd.DataFrame(data, columns = ['name', 'age', 'gender', 'score', 'country'])
df3
| name | age | gender | score | country |
---|
0 | Jason | 20 | F | 25 | US |
---|
1 | Molly | 32 | F | 94 | US |
---|
2 | Tina | 36 | F | 57 | US |
---|
3 | Jake | 24 | M | 62 | CN |
---|
4 | Amy | 23 | F | 70 | US |
---|
5 | Jack | 18 | M | 88 | CN |
---|
6 | Tim | 27 | F | 67 | CN |
---|
矩阵属性、检视数据
行数列数
df.shape
(6, 4)
索引
df.index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
'2020-01-05', '2020-01-06'],
dtype='datetime64[ns]', freq='D')
列名
df.columns
Index(['a', 'b', 'c', 'd'], dtype='object')
值
df.values
array([[-1.36577441, 1.16989918, 0.60759059, -2.02968684],
[-0.96768326, -0.80044798, 0.12367311, 0.70033731],
[ 1.79060939, 0.56066552, 0.34405077, 0.79952019],
[ 2.06866329, 0.32060998, -1.6606308 , 0.41663058],
[-0.95635134, -0.65704975, 1.24143335, -0.65249624],
[-1.1358703 , 1.88809265, 0.53336403, 0.08085195]])
矩阵信息
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2020-01-01 to 2020-01-06
Freq: D
Data columns (total 4 columns):
# Column Non-Null Count Dtype
0 a 6 non-null float64
1 b 6 non-null float64
2 c 6 non-null float64
3 d 6 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes
矩阵描述信息
df.describe()
| a | b | c | d |
---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
---|
mean | -0.094401 | 0.413628 | 0.198247 | -0.114141 |
---|
std | 1.577260 | 1.038903 | 0.984921 | 1.074899 |
---|
min | -1.365774 | -0.800448 | -1.660631 | -2.029687 |
---|
25% | -1.093824 | -0.412635 | 0.178768 | -0.469159 |
---|
50% | -0.962017 | 0.440638 | 0.438707 | 0.248741 |
---|
75% | 1.103869 | 1.017591 | 0.589034 | 0.629411 |
---|
max | 2.068663 | 1.888093 | 1.241433 | 0.799520 |
---|
df.index = pd.date_range('2020/06/01', periods=df.shape[0])
df
| a | b | c | d |
---|
2020-06-01 | -1.365774 | 1.169899 | 0.607591 | -2.029687 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 |
---|
2020-06-06 | -1.135870 | 1.888093 | 0.533364 | 0.080852 |
---|
top5 数据
df.head(1)
| a | b | c | d |
---|
2020-06-01 | -1.365774 | 1.169899 | 0.607591 | -2.029687 |
---|
tail5 数据
df.tail(5)
| a | b | c | d |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 |
---|
2020-06-06 | -1.135870 | 1.888093 | 0.533364 | 0.080852 |
---|
某一列值统计
df['a'].value_counts(dropna=False)
1.790609 1
-1.135870 1
2.068663 1
-0.967683 1
-1.365774 1
-0.956351 1
Name: a, dtype: int64
查看每一列唯一值统计
df.apply(pd.Series.value_counts)
| a | b | c | d |
---|
-2.029687 | NaN | NaN | NaN | 1.0 |
---|
-1.660631 | NaN | NaN | 1.0 | NaN |
---|
-1.365774 | 1.0 | NaN | NaN | NaN |
---|
-1.135870 | 1.0 | NaN | NaN | NaN |
---|
-0.967683 | 1.0 | NaN | NaN | NaN |
---|
-0.956351 | 1.0 | NaN | NaN | NaN |
---|
-0.800448 | NaN | 1.0 | NaN | NaN |
---|
-0.657050 | NaN | 1.0 | NaN | NaN |
---|
-0.652496 | NaN | NaN | NaN | 1.0 |
---|
0.080852 | NaN | NaN | NaN | 1.0 |
---|
0.123673 | NaN | NaN | 1.0 | NaN |
---|
0.320610 | NaN | 1.0 | NaN | NaN |
---|
0.344051 | NaN | NaN | 1.0 | NaN |
---|
0.416631 | NaN | NaN | NaN | 1.0 |
---|
0.533364 | NaN | NaN | 1.0 | NaN |
---|
0.560666 | NaN | 1.0 | NaN | NaN |
---|
0.607591 | NaN | NaN | 1.0 | NaN |
---|
0.700337 | NaN | NaN | NaN | 1.0 |
---|
0.799520 | NaN | NaN | NaN | 1.0 |
---|
1.169899 | NaN | 1.0 | NaN | NaN |
---|
1.241433 | NaN | NaN | 1.0 | NaN |
---|
1.790609 | 1.0 | NaN | NaN | NaN |
---|
1.888093 | NaN | 1.0 | NaN | NaN |
---|
2.068663 | 1.0 | NaN | NaN | NaN |
---|
排序
根据索引(index)排序
df.sort_index(axis=0, ascending=False)
| a | b | c | d |
---|
2020-06-06 | -1.135870 | 1.888093 | 0.533364 | 0.080852 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 |
---|
2020-06-01 | -1.365774 | 1.169899 | 0.607591 | -2.029687 |
---|
df.sort_index(axis=1, ascending=False)
| d | c | b | a |
---|
2020-06-01 | -2.029687 | 0.607591 | 1.169899 | -1.365774 |
---|
2020-06-02 | 0.700337 | 0.123673 | -0.800448 | -0.967683 |
---|
2020-06-03 | 0.799520 | 0.344051 | 0.560666 | 1.790609 |
---|
2020-06-04 | 0.416631 | -1.660631 | 0.320610 | 2.068663 |
---|
2020-06-05 | -0.652496 | 1.241433 | -0.657050 | -0.956351 |
---|
2020-06-06 | 0.080852 | 0.533364 | 1.888093 | -1.135870 |
---|
根据值排序
df.sort_values(by='a', ascending=False)
| a | b | c | d |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 |
---|
2020-06-06 | -1.135870 | 1.888093 | 0.533364 | 0.080852 |
---|
2020-06-01 | -1.365774 | 1.169899 | 0.607591 | -2.029687 |
---|
df.sort_values(by=['a','b'], ascending=True)
| a | b | c | d |
---|
2020-06-01 | -1.365774 | 1.169899 | 0.607591 | -2.029687 |
---|
2020-06-06 | -1.135870 | 1.888093 | 0.533364 | 0.080852 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 |
---|
选取数据
选取某一列
df['a']
2020-06-01 -1.365774
2020-06-02 -0.967683
2020-06-03 1.790609
2020-06-04 2.068663
2020-06-05 -0.956351
2020-06-06 -1.135870
Freq: D, Name: a, dtype: float64
根据索引选取某几行数据
df['2020-06-01':'2020-06-02']
| a | b | c | d |
---|
2020-06-01 | -1.365774 | 1.169899 | 0.607591 | -2.029687 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 |
---|
根据列名选择某几列数据
df[['c', 'b']]
| c | b |
---|
2020-06-01 | 0.607591 | 1.169899 |
---|
2020-06-02 | 0.123673 | -0.800448 |
---|
2020-06-03 | 0.344051 | 0.560666 |
---|
2020-06-04 | -1.660631 | 0.320610 |
---|
2020-06-05 | 1.241433 | -0.657050 |
---|
2020-06-06 | 0.533364 | 1.888093 |
---|
根据索引和列名选择数据
loc[行名选择, 列名选择],未指定行名或列名,或者指定为:则表示选择当前所有行,或列
df.loc['2020-06-01']
a -1.365774
b 1.169899
c 0.607591
d -2.029687
Name: 2020-06-01 00:00:00, dtype: float64
df.loc['2020-06-01', 'b']
1.1698991845802456
df.loc[:, 'b']
2020-06-01 1.169899
2020-06-02 -0.800448
2020-06-03 0.560666
2020-06-04 0.320610
2020-06-05 -0.657050
2020-06-06 1.888093
Freq: D, Name: b, dtype: float64
df.loc[:, ['a', 'b']]
| a | b |
---|
2020-06-01 | -1.365774 | 1.169899 |
---|
2020-06-02 | -0.967683 | -0.800448 |
---|
2020-06-03 | 1.790609 | 0.560666 |
---|
2020-06-04 | 2.068663 | 0.320610 |
---|
2020-06-05 | -0.956351 | -0.657050 |
---|
2020-06-06 | -1.135870 | 1.888093 |
---|
根据行索引和列索引取数据
df.iloc[0,0]
-1.3657744117360429
df.iloc[0, :]
a -1.365774
b 1.169899
c 0.607591
d -2.029687
Name: 2020-06-01 00:00:00, dtype: float64
根据布尔表达式表达式取数据
只有当布尔表达式为真时的数据才会被选择
df[df.a > 1]
| a | b | c | d |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 |
---|
df[(df['a'] > 1) & (df['d'] <0)]
添加/删除列、更新、替换数据
设置某矩阵项值
df.loc['2020-06-01', 'a'] = np.nan
df.loc['2020-06-06', 'c'] = np.nan
df
| a | b | c | d |
---|
2020-06-01 | NaN | 1.169899 | 0.607591 | -2.029687 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 |
---|
2020-06-06 | -1.135870 | 1.888093 | NaN | 0.080852 |
---|
根据条件创建新列
df['e'] = np.where((df['a'] > 1) & (df['d']<0), 1, 0)
df
| a | b | c | d | e |
---|
2020-06-01 | NaN | 1.169899 | 0.607591 | -2.029687 | 0 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 | 0 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 | 0 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 | 0 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 | 0 |
---|
2020-06-06 | -1.135870 | 1.888093 | NaN | 0.080852 | 0 |
---|
根据已有列创建新列
tmp = df.copy()
df.loc[:,'f'] = tmp.apply(lambda row: row['b']+ row['d'], axis=1)
df
| a | b | c | d | e | f |
---|
2020-06-01 | NaN | 1.169899 | 0.607591 | -2.029687 | 0 | -0.859788 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 | 0 | -0.100111 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 | 0 | 1.360186 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 | 0 | 0.737241 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 | 0 | -1.309546 |
---|
2020-06-06 | -1.135870 | 1.888093 | NaN | 0.080852 | 0 | 1.968945 |
---|
替换数据
df.replace(1,20)
| a | b | c | d | e | f |
---|
2020-06-01 | NaN | 1.169899 | 0.607591 | -2.029687 | 0 | -0.859788 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 | 0 | -0.100111 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 | 0 | 1.360186 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 | 0 | 0.737241 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 | 0 | -1.309546 |
---|
2020-06-06 | -1.135870 | 1.888093 | NaN | 0.080852 | 0 | 1.968945 |
---|
df.replace([1,3],['one','three'])
| a | b | c | d | e | f |
---|
2020-06-01 | NaN | 1.169899 | 0.607591 | -2.029687 | 0 | -0.859788 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 | 0 | -0.100111 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 | 0 | 1.360186 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 | 0 | 0.737241 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 | 0 | -1.309546 |
---|
2020-06-06 | -1.135870 | 1.888093 | NaN | 0.080852 | 0 | 1.968945 |
---|
列名重命名
df.rename(columns={'c':'cc'})
| a | b | cc | d | e | f |
---|
2020-06-01 | NaN | 1.169899 | 0.607591 | -2.029687 | 0 | -0.859788 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 | 0 | -0.100111 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 | 0 | 1.360186 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 | 0 | 0.737241 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 | 0 | -1.309546 |
---|
2020-06-06 | -1.135870 | 1.888093 | NaN | 0.080852 | 0 | 1.968945 |
---|
重设索引
df.set_index('a')
| b | c | d | e | f |
---|
a | | | | | |
---|
NaN | 1.169899 | 0.607591 | -2.029687 | 0 | -0.859788 |
---|
-0.967683 | -0.800448 | 0.123673 | 0.700337 | 0 | -0.100111 |
---|
1.790609 | 0.560666 | 0.344051 | 0.799520 | 0 | 1.360186 |
---|
2.068663 | 0.320610 | -1.660631 | 0.416631 | 0 | 0.737241 |
---|
-0.956351 | -0.657050 | 1.241433 | -0.652496 | 0 | -1.309546 |
---|
-1.135870 | 1.888093 | NaN | 0.080852 | 0 | 1.968945 |
---|
删除列
df.drop(columns=['a', 'f'])
| b | c | d | e |
---|
2020-06-01 | 1.169899 | 0.607591 | -2.029687 | 0 |
---|
2020-06-02 | -0.800448 | 0.123673 | 0.700337 | 0 |
---|
2020-06-03 | 0.560666 | 0.344051 | 0.799520 | 0 |
---|
2020-06-04 | 0.320610 | -1.660631 | 0.416631 | 0 |
---|
2020-06-05 | -0.657050 | 1.241433 | -0.652496 | 0 |
---|
2020-06-06 | 1.888093 | NaN | 0.080852 | 0 |
---|
处理Nan数据
检查是否Nan值
df.isnull()
| a | b | c | d | e | f |
---|
2020-06-01 | True | False | False | False | False | False |
---|
2020-06-02 | False | False | False | False | False | False |
---|
2020-06-03 | False | False | False | False | False | False |
---|
2020-06-04 | False | False | False | False | False | False |
---|
2020-06-05 | False | False | False | False | False | False |
---|
2020-06-06 | False | False | True | False | False | False |
---|
df.notnull()
| a | b | c | d | e | f |
---|
2020-06-01 | False | True | True | True | True | True |
---|
2020-06-02 | True | True | True | True | True | True |
---|
2020-06-03 | True | True | True | True | True | True |
---|
2020-06-04 | True | True | True | True | True | True |
---|
2020-06-05 | True | True | True | True | True | True |
---|
2020-06-06 | True | True | False | True | True | True |
---|
删除掉包含null值的行
df.dropna(axis=0)
| a | b | c | d | e | f |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 | 0 | -0.100111 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 | 0 | 1.360186 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 | 0 | 0.737241 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 | 0 | -1.309546 |
---|
替换Nan
df.fillna(1000)
| a | b | c | d | e | f |
---|
2020-06-01 | 1000.000000 | 1.169899 | 0.607591 | -2.029687 | 0 | -0.859788 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 | 0 | -0.100111 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 | 0 | 1.360186 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 | 0 | 0.737241 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 | 0 | -1.309546 |
---|
2020-06-06 | -1.135870 | 1.888093 | 1000.000000 | 0.080852 | 0 | 1.968945 |
---|
df.fillna(df.mean())
| a | b | c | d | e | f |
---|
2020-06-01 | 0.159874 | 1.169899 | 0.607591 | -2.029687 | 0 | -0.859788 |
---|
2020-06-02 | -0.967683 | -0.800448 | 0.123673 | 0.700337 | 0 | -0.100111 |
---|
2020-06-03 | 1.790609 | 0.560666 | 0.344051 | 0.799520 | 0 | 1.360186 |
---|
2020-06-04 | 2.068663 | 0.320610 | -1.660631 | 0.416631 | 0 | 0.737241 |
---|
2020-06-05 | -0.956351 | -0.657050 | 1.241433 | -0.652496 | 0 | -1.309546 |
---|
2020-06-06 | -1.135870 | 1.888093 | 0.131223 | 0.080852 | 0 | 1.968945 |
---|