持续创作,加速成长!这是我参与「掘金日新计划 · 10 月更文挑战」的第9天,点击查看活动详情
11.4 缺失值处理
1、发现缺失值
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array([[1, np.nan, 2],
[np.nan, 3, 4],
[5, 6, None]]), columns=["A", "B", "C"])
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | |
|---|---|---|---|
| 0 | 1 | NaN | 2 |
| 1 | NaN | 3 | 4 |
| 2 | 5 | 6 | None |
注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源
np.nan是一个特殊的浮点数,类型是浮点类型,所以表示缺失值时最好使用NaN。
data.dtypes
A object
B object
C object
dtype: object
data.isnull()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | |
|---|---|---|---|
| 0 | False | True | False |
| 1 | True | False | False |
| 2 | False | False | True |
data.notnull()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | |
|---|---|---|---|
| 0 | True | False | True |
| 1 | False | True | True |
| 2 | True | True | False |
2、删除缺失值
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 1.0 | NaN | 2.0 | 3.0 |
| 1 | NaN | 4.0 | 5.0 | 6.0 |
| 2 | 7.0 | 8.0 | NaN | 9.0 |
| 3 | 10.0 | 11.0 | 12.0 | 13.0 |
注意:np.nan是一种特殊的浮点数
data.dtypes
A float64
B float64
C float64
D float64
dtype: object
(1)删除整行
data.dropna()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 3 | 10.0 | 11.0 | 12.0 | 13.0 |
(2)删除整列
data.dropna(axis="columns")
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| D | |
|---|---|
| 0 | 3.0 |
| 1 | 6.0 |
| 2 | 9.0 |
| 3 | 13.0 |
data["D"] = np.nan
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 1.0 | NaN | 2.0 | NaN |
| 1 | NaN | 4.0 | 5.0 | NaN |
| 2 | 7.0 | 8.0 | NaN | NaN |
| 3 | 10.0 | 11.0 | 12.0 | NaN |
data.dropna(axis="columns", how="all")
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | |
|---|---|---|---|
| 0 | 1.0 | NaN | 2.0 |
| 1 | NaN | 4.0 | 5.0 |
| 2 | 7.0 | 8.0 | NaN |
| 3 | 10.0 | 11.0 | 12.0 |
all表示都是缺失值时才删除。
data.dropna(axis="columns", how="any")
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 |
| 1 |
| 2 |
| 3 |
data.loc[3] = np.nan
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 1.0 | NaN | 2.0 | NaN |
| 1 | NaN | 4.0 | 5.0 | NaN |
| 2 | 7.0 | 8.0 | NaN | NaN |
| 3 | NaN | NaN | NaN | NaN |
data.dropna(how="all")
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 1.0 | NaN | 2.0 | NaN |
| 1 | NaN | 4.0 | 5.0 | NaN |
| 2 | 7.0 | 8.0 | NaN | NaN |
3、填充缺失值
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 1.0 | NaN | 2.0 | 3.0 |
| 1 | NaN | 4.0 | 5.0 | 6.0 |
| 2 | 7.0 | 8.0 | NaN | 9.0 |
| 3 | 10.0 | 11.0 | 12.0 | 13.0 |
data.fillna(value=5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 1.0 | 5.0 | 2.0 | 3.0 |
| 1 | 5.0 | 4.0 | 5.0 | 6.0 |
| 2 | 7.0 | 8.0 | 5.0 | 9.0 |
| 3 | 10.0 | 11.0 | 12.0 | 13.0 |
- 用均值进行替换
fill = data.mean()
fill
A 6.000000
B 7.666667
C 6.333333
D 7.750000
dtype: float64
data.fillna(value=fill)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 1.0 | 7.666667 | 2.000000 | 3.0 |
| 1 | 6.0 | 4.000000 | 5.000000 | 6.0 |
| 2 | 7.0 | 8.000000 | 6.333333 | 9.0 |
| 3 | 10.0 | 11.000000 | 12.000000 | 13.0 |
全部数据的平均值,先进行摊平,再进行填充即可。
fill = data.stack().mean()
fill
7.0
data.fillna(value=fill)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 1.0 | 7.0 | 2.0 | 3.0 |
| 1 | 7.0 | 4.0 | 5.0 | 6.0 |
| 2 | 7.0 | 8.0 | 7.0 | 9.0 |
| 3 | 10.0 | 11.0 | 12.0 | 13.0 |
11.5 合并数据
- 构造一个生产DataFrame的函数
import pandas as pd
import numpy as np
def make_df(cols, ind):
"一个简单的DataFrame"
data = {c: [str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
make_df("ABC", range(3))
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | |
|---|---|---|---|
| 0 | A0 | B0 | C0 |
| 1 | A1 | B1 | C1 |
| 2 | A2 | B2 | C2 |
- 垂直合并
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
A B
1 A1 B1
2 A2 B2
A B
3 A3 B3
4 A4 B4
pd.concat([df_1, df_2])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | |
|---|---|---|
| 1 | A1 | B1 |
| 2 | A2 | B2 |
| 3 | A3 | B3 |
| 4 | A4 | B4 |
- 水平合并
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)
A B
0 A0 B0
1 A1 B1
C D
0 C0 D0
1 C1 D1
pd.concat([df_3, df_4], axis=1)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | A0 | B0 | C0 | D0 |
| 1 | A1 | B1 | C1 | D1 |
- 索引重叠
行重叠
df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
A B
1 A1 B1
2 A2 B2
A B
1 A1 B1
2 A2 B2
pd.concat([df_5, df_6])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | |
|---|---|---|
| 1 | A1 | B1 |
| 2 | A2 | B2 |
| 1 | A1 | B1 |
| 2 | A2 | B2 |
pd.concat([df_5, df_6],ignore_index=True)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | |
|---|---|---|
| 0 | A1 | B1 |
| 1 | A2 | B2 |
| 2 | A1 | B1 |
| 3 | A2 | B2 |
列重叠
df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
A B C
1 A1 B1 C1
2 A2 B2 C2
B C D
1 B1 C1 D1
2 B2 C2 D2
pd.concat([df_7, df_8], axis=1)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | B | C | D | |
|---|---|---|---|---|---|---|
| 1 | A1 | B1 | C1 | B1 | C1 | D1 |
| 2 | A2 | B2 | C2 | B2 | C2 | D2 |
pd.concat([df_7, df_8],axis=1, ignore_index=True)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|
| 1 | A1 | B1 | C1 | B1 | C1 | D1 |
| 2 | A2 | B2 | C2 | B2 | C2 | D2 |
- 对齐合并merge()
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
B C
1 B1 C1
2 B2 C2
pd.merge(df_9, df_10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | |
|---|---|---|---|
| 0 | A1 | B1 | C1 |
| 1 | A2 | B2 | C2 |
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
C B
2 C2 B2
1 C1 B1
pd.merge(df_9, df_10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | |
|---|---|---|---|
| 0 | A1 | B1 | C1 |
| 1 | A2 | B2 | C2 |
【例】 合并城市信息
population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
"pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| city | pop | |
|---|---|---|
| 0 | BeiJing | 2154 |
| 1 | HangZhou | 981 |
| 2 | ShenZhen | 1303 |
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
"GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| city | GDP | |
|---|---|---|
| 0 | BeiJing | 30320 |
| 1 | ShangHai | 32680 |
| 2 | HangZhou | 13468 |
city_info = pd.merge(population, GDP)
city_info
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| city | pop | GDP | |
|---|---|---|---|
| 0 | BeiJing | 2154 | 30320 |
| 1 | HangZhou | 981 | 13468 |
这里outer是求并集
city_info = pd.merge(population, GDP, how="outer")
city_info
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| city | pop | GDP | |
|---|---|---|---|
| 0 | BeiJing | 2154.0 | 30320.0 |
| 1 | HangZhou | 981.0 | 13468.0 |
| 2 | ShenZhen | 1303.0 | NaN |
| 3 | ShangHai | NaN | 32680.0 |
11.6 分组和数据透视表
df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
"data1": range(6),
"data2": np.random.randint(0, 10, size=6)})
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | A | 0 | 1 |
| 1 | B | 1 | 4 |
| 2 | C | 2 | 9 |
| 3 | C | 3 | 9 |
| 4 | B | 4 | 1 |
| 5 | A | 5 | 9 |
(1)分组
- 延迟计算
df.groupby("key")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002276795A240>
这说明已经分好了,等待我们用什么样的方法进行处理后,再显示。
df.groupby("key").sum()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| data1 | data2 | |
|---|---|---|
| key | ||
| A | 5 | 10 |
| B | 5 | 6 |
| C | 5 | 11 |
df.groupby("key").mean()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| data1 | data2 | |
|---|---|---|
| key | ||
| A | 2.5 | 5.0 |
| B | 2.5 | 3.0 |
| C | 2.5 | 5.5 |
可以打印看看这是什么东西:
for i in df.groupby("key"):
print(str(i))
('A', key data1 data2
0 A 0 2
5 A 5 8)
('B', key data1 data2
1 B 1 2
4 B 4 4)
('C', key data1 data2
2 C 2 8
3 C 3 3)
- 按列取值
df.groupby("key")["data2"].sum()
key
A 10
B 6
C 11
Name: data2, dtype: int32
- 按组迭代
for data, group in df.groupby("key"):
print("{0:5} shape={1}".format(data, group.shape))
A shape=(2, 3)
B shape=(2, 3)
C shape=(2, 3)
- 调用方法
df.groupby("key")["data1"].describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| key | ||||||||
| A | 2.0 | 2.5 | 3.535534 | 0.0 | 1.25 | 2.5 | 3.75 | 5.0 |
| B | 2.0 | 2.5 | 2.121320 | 1.0 | 1.75 | 2.5 | 3.25 | 4.0 |
| C | 2.0 | 2.5 | 0.707107 | 2.0 | 2.25 | 2.5 | 2.75 | 3.0 |
- 支持更复杂的操作
df.groupby("key").aggregate(["min", "median", "max"])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| data1 | data2 | |||||
|---|---|---|---|---|---|---|
| min | median | max | min | median | max | |
| key | ||||||
| A | 0 | 2.5 | 5 | 2 | 5.0 | 8 |
| B | 1 | 2.5 | 4 | 2 | 3.0 | 4 |
| C | 2 | 2.5 | 3 | 3 | 5.5 | 8 |
- 过滤
def filter_func(x):
return x["data2"].std() > 3
df.groupby("key")["data2"].std()
key
A 4.242641
B 1.414214
C 3.535534
Name: data2, dtype: float64
df.groupby("key").filter(filter_func)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | A | 0 | 2 |
| 2 | C | 2 | 8 |
| 3 | C | 3 | 3 |
| 5 | A | 5 | 8 |
- 转换
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | A | 0 | 2 |
| 1 | B | 1 | 2 |
| 2 | C | 2 | 8 |
| 3 | C | 3 | 3 |
| 4 | B | 4 | 4 |
| 5 | A | 5 | 8 |
df.groupby("key").transform(lambda x: x-x.mean())
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| data1 | data2 | |
|---|---|---|
| 0 | -2.5 | -3.0 |
| 1 | -1.5 | -1.0 |
| 2 | -0.5 | 2.5 |
| 3 | 0.5 | -2.5 |
| 4 | 1.5 | 1.0 |
| 5 | 2.5 | 3.0 |
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | A | 0 | 1 |
| 1 | B | 1 | 4 |
| 2 | C | 2 | 9 |
| 3 | C | 3 | 9 |
| 4 | B | 4 | 1 |
| 5 | A | 5 | 9 |
df.groupby("key").apply(lambda x: x-x.mean())
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| data1 | data2 | |
|---|---|---|
| 0 | -2.5 | -4.0 |
| 1 | -1.5 | 1.5 |
| 2 | -0.5 | 0.0 |
| 3 | 0.5 | 0.0 |
| 4 | 1.5 | -1.5 |
| 5 | 2.5 | 4.0 |
- apply()方法
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | A | 0 | 2 |
| 1 | B | 1 | 2 |
| 2 | C | 2 | 8 |
| 3 | C | 3 | 3 |
| 4 | B | 4 | 4 |
| 5 | A | 5 | 8 |
def norm_by_data2(x):
x["data1"] /= x["data2"].sum()
return x
df.groupby("key").apply(norm_by_data2)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | A | 0.000000 | 2 |
| 1 | B | 0.166667 | 2 |
| 2 | C | 0.181818 | 8 |
| 3 | C | 0.272727 | 3 |
| 4 | B | 0.666667 | 4 |
| 5 | A | 0.500000 | 8 |
- 将列表、数组设为分组键
这里的L相当于一个新的标签替代原来的行标签。
L = [0, 1, 0, 1, 2, 0]
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| key | data1 | data2 | |
|---|---|---|---|
| 0 | A | 0 | 2 |
| 1 | B | 1 | 2 |
| 2 | C | 2 | 8 |
| 3 | C | 3 | 3 |
| 4 | B | 4 | 4 |
| 5 | A | 5 | 8 |
df.groupby(L).sum()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| data1 | data2 | |
|---|---|---|
| 0 | 7 | 18 |
| 1 | 4 | 5 |
| 2 | 4 | 4 |
- 用字典将索引映射到分组
df2 = df.set_index("key")
df2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| data1 | data2 | |
|---|---|---|
| key | ||
| A | 0 | 2 |
| B | 1 | 2 |
| C | 2 | 8 |
| C | 3 | 3 |
| B | 4 | 4 |
| A | 5 | 8 |
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| data1 | data2 | |
|---|---|---|
| constant | 10 | 17 |
| first | 5 | 10 |
- 任意Python函数
df2.groupby(str.lower).mean()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| data1 | data2 | |
|---|---|---|
| a | 2.5 | 5.0 |
| b | 2.5 | 3.0 |
| c | 2.5 | 5.5 |
- 多个有效值组成的列表
只有这两个数都相等,才会分到同一个组。
df2.groupby([str.lower, mapping]).mean()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| data1 | data2 | ||
|---|---|---|---|
| a | first | 2.5 | 5.0 |
| b | constant | 2.5 | 3.0 |
| c | constant | 2.5 | 5.5 |
【例1】 行星观测数据处理
import seaborn as sns
planets = sns.load_dataset("planets")
planets.shape
(1035, 6)
planets.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| method | number | orbital_period | mass | distance | year | |
|---|---|---|---|---|---|---|
| 0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
| 1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
| 2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
| 3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
| 4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |
planets.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| number | orbital_period | mass | distance | year | |
|---|---|---|---|---|---|
| count | 1035.000000 | 992.000000 | 513.000000 | 808.000000 | 1035.000000 |
| mean | 1.785507 | 2002.917596 | 2.638161 | 264.069282 | 2009.070531 |
| std | 1.240976 | 26014.728304 | 3.818617 | 733.116493 | 3.972567 |
| min | 1.000000 | 0.090706 | 0.003600 | 1.350000 | 1989.000000 |
| 25% | 1.000000 | 5.442540 | 0.229000 | 32.560000 | 2007.000000 |
| 50% | 1.000000 | 39.979500 | 1.260000 | 55.250000 | 2010.000000 |
| 75% | 2.000000 | 526.005000 | 3.040000 | 178.500000 | 2012.000000 |
| max | 7.000000 | 730000.000000 | 25.000000 | 8500.000000 | 2014.000000 |
planets.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| method | number | orbital_period | mass | distance | year | |
|---|---|---|---|---|---|---|
| 0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
| 1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
| 2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
| 3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
| 4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |
decade = 10 * (planets["year"] // 10)
decade.head()
0 2000
1 2000
2 2010
3 2000
4 2000
Name: year, dtype: int64
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
0 2000s
1 2000s
2 2010s
3 2000s
4 2000s
Name: decade, dtype: object
planets.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| method | number | orbital_period | mass | distance | year | |
|---|---|---|---|---|---|---|
| 0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
| 1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
| 2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
| 3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
| 4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |
planets.groupby(["method", decade]).sum()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| number | orbital_period | mass | distance | year | ||
|---|---|---|---|---|---|---|
| method | decade | |||||
| Astrometry | 2010s | 2 | 1.262360e+03 | 0.00000 | 35.75 | 4023 |
| Eclipse Timing Variations | 2000s | 5 | 1.930800e+04 | 6.05000 | 261.44 | 6025 |
| 2010s | 10 | 2.345680e+04 | 4.20000 | 1000.00 | 12065 | |
| Imaging | 2000s | 29 | 1.350935e+06 | 0.00000 | 956.83 | 40139 |
| 2010s | 21 | 6.803750e+04 | 0.00000 | 1210.08 | 36208 | |
| Microlensing | 2000s | 12 | 1.732500e+04 | 0.00000 | 0.00 | 20070 |
| 2010s | 15 | 4.750000e+03 | 0.00000 | 41440.00 | 26155 | |
| Orbital Brightness Modulation | 2010s | 5 | 2.127920e+00 | 0.00000 | 2360.00 | 6035 |
| Pulsar Timing | 1990s | 9 | 1.900153e+02 | 0.00000 | 0.00 | 5978 |
| 2000s | 1 | 3.652500e+04 | 0.00000 | 0.00 | 2003 | |
| 2010s | 1 | 9.070629e-02 | 0.00000 | 1200.00 | 2011 | |
| Pulsation Timing Variations | 2000s | 1 | 1.170000e+03 | 0.00000 | 0.00 | 2007 |
| Radial Velocity | 1980s | 1 | 8.388800e+01 | 11.68000 | 40.57 | 1989 |
| 1990s | 52 | 1.091561e+04 | 68.17820 | 723.71 | 55943 | |
| 2000s | 475 | 2.633526e+05 | 945.31928 | 15201.16 | 619775 | |
| 2010s | 424 | 1.809630e+05 | 316.47890 | 11382.67 | 432451 | |
| Transit | 2000s | 64 | 2.897102e+02 | 0.00000 | 31823.31 | 124462 |
| 2010s | 712 | 8.087813e+03 | 1.47000 | 102419.46 | 673999 | |
| Transit Timing Variations | 2010s | 9 | 2.393505e+02 | 0.00000 | 3313.00 | 8050 |
这里使用两个中括号[[]],取出来是DF类型的数据,而一个中括号[]取出来是Serios的数据,前者更美观一点。
planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| number | ||||
|---|---|---|---|---|
| decade | 1980s | 1990s | 2000s | 2010s |
| method | ||||
| Astrometry | 0.0 | 0.0 | 0.0 | 2.0 |
| Eclipse Timing Variations | 0.0 | 0.0 | 5.0 | 10.0 |
| Imaging | 0.0 | 0.0 | 29.0 | 21.0 |
| Microlensing | 0.0 | 0.0 | 12.0 | 15.0 |
| Orbital Brightness Modulation | 0.0 | 0.0 | 0.0 | 5.0 |
| Pulsar Timing | 0.0 | 9.0 | 1.0 | 1.0 |
| Pulsation Timing Variations | 0.0 | 0.0 | 1.0 | 0.0 |
| Radial Velocity | 1.0 | 52.0 | 475.0 | 424.0 |
| Transit | 0.0 | 0.0 | 64.0 | 712.0 |
| Transit Timing Variations | 0.0 | 0.0 | 0.0 | 9.0 |
(2)数据透视表
【例2】泰坦尼克号乘客数据分析
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
T = titanic[titanic.age.notnull()].copy()
T.age.apply(lambda x: 60 if x>=60 else x)
T.age.value_counts()
24.00 30
22.00 27
60.00 26
18.00 26
28.00 25
30.00 25
19.00 25
21.00 24
25.00 23
36.00 22
29.00 20
35.00 18
32.00 18
27.00 18
26.00 18
31.00 17
16.00 17
34.00 15
20.00 15
33.00 15
23.00 15
39.00 14
40.00 13
17.00 13
42.00 13
45.00 12
38.00 11
4.00 10
50.00 10
2.00 10
..
8.00 4
5.00 4
11.00 4
6.00 3
7.00 3
46.00 3
30.50 2
57.00 2
0.83 2
55.00 2
10.00 2
59.00 2
13.00 2
28.50 2
40.50 2
45.50 2
0.75 2
32.50 2
34.50 1
55.50 1
0.92 1
36.50 1
12.00 1
53.00 1
14.50 1
0.67 1
20.50 1
23.50 1
24.50 1
0.42 1
Name: age, Length: 77, dtype: int64
Age = 10*(T["age"]//10)
Age = Age.astype(int)
Age.head()
Age.value_counts()
20 220
30 167
10 102
40 89
0 62
50 48
60 26
Name: age, dtype: int64
Age.astype(str)+"s"
0 20s
1 30s
2 20s
3 30s
4 30s
6 50s
7 0s
8 20s
9 10s
10 0s
11 50s
12 20s
13 30s
14 10s
15 50s
16 0s
18 30s
20 30s
21 30s
22 10s
23 20s
24 0s
25 30s
27 10s
30 40s
33 60s
34 20s
35 40s
37 20s
38 10s
...
856 40s
857 50s
858 20s
860 40s
861 20s
862 40s
864 20s
865 40s
866 20s
867 30s
869 0s
870 20s
871 40s
872 30s
873 40s
874 20s
875 10s
876 20s
877 10s
879 50s
880 20s
881 30s
882 20s
883 20s
884 20s
885 30s
886 20s
887 10s
889 20s
890 30s
Name: age, Length: 714, dtype: object
T.groupby(["sex", Age])["survived"].mean().unstack()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| age | 0 | 10 | 20 | 30 | 40 | 50 | 60 |
|---|---|---|---|---|---|---|---|
| sex | |||||||
| female | 0.633333 | 0.755556 | 0.722222 | 0.833333 | 0.687500 | 0.888889 | 1.000000 |
| male | 0.593750 | 0.122807 | 0.168919 | 0.214953 | 0.210526 | 0.133333 | 0.136364 |
T.age = Age
T.pivot_table("survived", index="sex", columns="age")
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| age | 0 | 10 | 20 | 30 | 40 | 50 | 60 |
|---|---|---|---|---|---|---|---|
| sex | |||||||
| female | 0.633333 | 0.755556 | 0.722222 | 0.833333 | 0.687500 | 0.888889 | 1.000000 |
| male | 0.593750 | 0.122807 | 0.168919 | 0.214953 | 0.210526 | 0.133333 | 0.136364 |
titanic.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| survived | pclass | age | sibsp | parch | fare | |
|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
titanic.groupby("sex")[["survived"]].mean()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| survived | |
|---|---|
| sex | |
| female | 0.742038 |
| male | 0.188908 |
titanic.groupby("sex")["survived"].mean()
sex
female 0.742038
male 0.188908
Name: survived, dtype: float64
titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| class | First | Second | Third |
|---|---|---|---|
| sex | |||
| female | 0.968085 | 0.921053 | 0.500000 |
| male | 0.368852 | 0.157407 | 0.135447 |
- 数据透视表:用更直观的方式实现上面的功能。
titanic.pivot_table("survived", index="sex", columns="class") # 默认返回平均值
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| class | First | Second | Third |
|---|---|---|---|
| sex | |||
| female | 0.968085 | 0.921053 | 0.500000 |
| male | 0.368852 | 0.157407 | 0.135447 |
titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True) # aggfunc="mean"即为默认值 margins=True 会加一个总的列和总的行。
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| class | First | Second | Third | All |
|---|---|---|---|---|
| sex | ||||
| female | 0.968085 | 0.921053 | 0.500000 | 0.742038 |
| male | 0.368852 | 0.157407 | 0.135447 | 0.188908 |
| All | 0.629630 | 0.472826 | 0.242363 | 0.383838 |
titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"}) # 要处理的那一列和要处理的方法组成一个键值对。
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
| fare | survived | |||||
|---|---|---|---|---|---|---|
| class | First | Second | Third | First | Second | Third |
| sex | ||||||
| female | 106.125798 | 21.970121 | 16.118810 | 91 | 70 | 72 |
| male | 67.226127 | 19.741782 | 12.661633 | 45 | 17 | 47 |
11.7 其他
(1)向量化字符串操作
(2) 处理时间序列
(3) 多级索引:用于多维数据
base_data = np.array([[1771, 11115 ],
[2154, 30320],
[2141, 14070],
[2424, 32680],
[1077, 7806],
[1303, 24222],
[798, 4789],
[981, 13468]])
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
, [2008, 2018]*4], columns=["population", "GDP"])
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| population | GDP | ||
|---|---|---|---|
| BeiJing | 2008 | 1771 | 11115 |
| 2018 | 2154 | 30320 | |
| ShangHai | 2008 | 2141 | 14070 |
| 2018 | 2424 | 32680 | |
| ShenZhen | 2008 | 1077 | 7806 |
| 2018 | 1303 | 24222 | |
| HangZhou | 2008 | 798 | 4789 |
| 2018 | 981 | 13468 |
data.index.names = ["city", "year"]
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| population | GDP | ||
|---|---|---|---|
| city | year | ||
| BeiJing | 2008 | 1771 | 11115 |
| 2018 | 2154 | 30320 | |
| ShangHai | 2008 | 2141 | 14070 |
| 2018 | 2424 | 32680 | |
| ShenZhen | 2008 | 1077 | 7806 |
| 2018 | 1303 | 24222 | |
| HangZhou | 2008 | 798 | 4789 |
| 2018 | 981 | 13468 |
data["GDP"]
city year
BeiJing 2008 11115
2018 30320
ShangHai 2008 14070
2018 32680
ShenZhen 2008 7806
2018 24222
HangZhou 2008 4789
2018 13468
Name: GDP, dtype: int32
data.loc["ShangHai", "GDP"]
year
2008 14070
2018 32680
Name: GDP, dtype: int32
data.loc["ShangHai", 2018]["GDP"]
32680
(4) 高性能的Pandas:eval()
df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
%timeit (df1+df2)/(df3+df4)
17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
- 减少了复合代数式计算中间过程的内存分配
%timeit pd.eval("(df1+df2)/(df3+df4)")
10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))
True
- 实现列间运算
df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | |
|---|---|---|---|
| 0 | 0.418071 | 0.381836 | 0.500556 |
| 1 | 0.059432 | 0.749066 | 0.302429 |
| 2 | 0.489147 | 0.739153 | 0.777161 |
| 3 | 0.175441 | 0.016556 | 0.348979 |
| 4 | 0.766534 | 0.559252 | 0.310635 |
res_1 = pd.eval("(df.A+df.B)/(df.C-1)")
res_2 = df.eval("(A+B)/(C-1)")
np.allclose(res_1, res_2)
True
df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 0.418071 | 0.381836 | 0.500556 | -1.601593 |
| 1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
| 2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
| 3 | 0.175441 | 0.016556 | 0.348979 | -0.294917 |
| 4 | 0.766534 | 0.559252 | 0.310635 | -1.923199 |
df.eval("D=(A+B)/(C-1)", inplace=True)
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 0.418071 | 0.381836 | 0.500556 | -1.601593 |
| 1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
| 2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
| 3 | 0.175441 | 0.016556 | 0.348979 | -0.294917 |
| 4 | 0.766534 | 0.559252 | 0.310635 | -1.923199 |
- 使用局部变量
column_mean = df.mean(axis=1)
res = df.eval("A+@column_mean")
res.head()
0 0.342788
1 0.047409
2 -0.387501
3 0.236956
4 0.694839
dtype: float64
(4) 高性能的Pandas:query()
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 0.418071 | 0.381836 | 0.500556 | -1.601593 |
| 1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
| 2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
| 3 | 0.175441 | 0.016556 | 0.348979 | -0.294917 |
| 4 | 0.766534 | 0.559252 | 0.310635 | -1.923199 |
%timeit df[(df.A < 0.5) & (df.B > 0.5)]
1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.query("(A < 0.5)&(B > 0.5)")
2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.query("(A < 0.5)&(B > 0.5)").head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| A | B | C | D | |
|---|---|---|---|---|
| 1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
| 2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
| 7 | 0.073950 | 0.730144 | 0.646190 | -2.272672 |
| 10 | 0.393200 | 0.610467 | 0.697096 | -3.313485 |
| 11 | 0.065734 | 0.764699 | 0.179380 | -1.011958 |
np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
True
(5)eval()和query()的使用时机
小数组时,普通方法反而更快
df.values.nbytes
32000
df1.values.nbytes
8000000