Python基础(十一) | 超详细的Pandas库三万字总结(下)

291 阅读15分钟

持续创作,加速成长!这是我参与「掘金日新计划 · 10 月更文挑战」的第9天,点击查看活动详情

11.4 缺失值处理

1、发现缺失值

import pandas as pd
import numpy as np
​
data = pd.DataFrame(np.array([[1, np.nan, 2],
                              [np.nan, 3, 4],
                              [5, 6, None]]), columns=["A", "B", "C"])
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
01NaN2
1NaN34
256None

注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源

np.nan是一个特殊的浮点数,类型是浮点类型,所以表示缺失值时最好使用NaN。

data.dtypes
A    object
B    object
C    object
dtype: object
data.isnull()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
0FalseTrueFalse
1TrueFalseFalse
2FalseFalseTrue

data.notnull()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
0TrueFalseTrue
1FalseTrueTrue
2TrueTrueFalse

2、删除缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.0NaN2.03.0
1NaN4.05.06.0
27.08.0NaN9.0
310.011.012.013.0

注意:np.nan是一种特殊的浮点数

data.dtypes
A    float64
B    float64
C    float64
D    float64
dtype: object

(1)删除整行

data.dropna()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
310.011.012.013.0

(2)删除整列

data.dropna(axis="columns")

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

D
03.0
16.0
29.0
313.0

data["D"] = np.nan
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN
310.011.012.0NaN

data.dropna(axis="columns", how="all")

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
01.0NaN2.0
1NaN4.05.0
27.08.0NaN
310.011.012.0

all表示都是缺失值时才删除。

data.dropna(axis="columns", how="any")

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

0
1
2
3

data.loc[3] = np.nan
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN
3NaNNaNNaNNaN

data.dropna(how="all")

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN

3、填充缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.0NaN2.03.0
1NaN4.05.06.0
27.08.0NaN9.0
310.011.012.013.0

data.fillna(value=5)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.05.02.03.0
15.04.05.06.0
27.08.05.09.0
310.011.012.013.0

  • 用均值进行替换
fill = data.mean()
fill
A    6.000000
B    7.666667
C    6.333333
D    7.750000
dtype: float64
data.fillna(value=fill)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.07.6666672.0000003.0
16.04.0000005.0000006.0
27.08.0000006.3333339.0
310.011.00000012.00000013.0

全部数据的平均值,先进行摊平,再进行填充即可。

fill = data.stack().mean()
fill
7.0
data.fillna(value=fill)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.07.02.03.0
17.04.05.06.0
27.08.07.09.0
310.011.012.013.0

11.5 合并数据

  • 构造一个生产DataFrame的函数
import pandas as pd
import numpy as np
​
def make_df(cols, ind):
    "一个简单的DataFrame"
    data = {c: [str(c)+str(i) for i in ind]  for c in cols}
    return pd.DataFrame(data, ind)
​
make_df("ABC", range(3))

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
0A0B0C0
1A1B1C1
2A2B2C2

  • 垂直合并
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4
pd.concat([df_1, df_2])

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

AB
1A1B1
2A2B2
3A3B3
4A4B4

  • 水平合并
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)
    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1
pd.concat([df_3, df_4], axis=1)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
0A0B0C0D0
1A1B1C1D1

  • 索引重叠

行重叠

df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
    A   B
1  A1  B1
2  A2  B2
    A   B
1  A1  B1
2  A2  B2
pd.concat([df_5, df_6])

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

AB
1A1B1
2A2B2
1A1B1
2A2B2

pd.concat([df_5, df_6],ignore_index=True)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

AB
0A1B1
1A2B2
2A1B1
3A2B2

列重叠

df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
1  B1  C1  D1
2  B2  C2  D2
pd.concat([df_7, df_8], axis=1)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCBCD
1A1B1C1B1C1D1
2A2B2C2B2C2D2

pd.concat([df_7, df_8],axis=1, ignore_index=True)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

012345
1A1B1C1B1C1D1
2A2B2C2B2C2D2

  • 对齐合并merge()
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
    A   B
1  A1  B1
2  A2  B2
    B   C
1  B1  C1
2  B2  C2
pd.merge(df_9, df_10)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
0A1B1C1
1A2B2C2

df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
    A   B
1  A1  B1
2  A2  B2
    C   B
2  C2  B2
1  C1  B1
pd.merge(df_9, df_10)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
0A1B1C1
1A2B2C2

【例】 合并城市信息

population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
                   "pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

citypop
0BeiJing2154
1HangZhou981
2ShenZhen1303

GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
            "GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

cityGDP
0BeiJing30320
1ShangHai32680
2HangZhou13468

city_info = pd.merge(population, GDP)
city_info

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

citypopGDP
0BeiJing215430320
1HangZhou98113468

这里outer是求并集

city_info = pd.merge(population, GDP, how="outer")
city_info

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

citypopGDP
0BeiJing2154.030320.0
1HangZhou981.013468.0
2ShenZhen1303.0NaN
3ShangHaiNaN32680.0

11.6 分组和数据透视表

image.png

df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
                  "data1": range(6),
                  "data2": np.random.randint(0, 10, size=6)})
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

keydata1data2
0A01
1B14
2C29
3C39
4B41
5A59

(1)分组

  • 延迟计算
df.groupby("key")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002276795A240>

这说明已经分好了,等待我们用什么样的方法进行处理后,再显示。

df.groupby("key").sum()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

data1data2
key
A510
B56
C511

df.groupby("key").mean()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

data1data2
key
A2.55.0
B2.53.0
C2.55.5

可以打印看看这是什么东西:

for i in df.groupby("key"):
    print(str(i))
('A',   key  data1  data2
0   A      0      2
5   A      5      8)
('B',   key  data1  data2
1   B      1      2
4   B      4      4)
('C',   key  data1  data2
2   C      2      8
3   C      3      3)
  • 按列取值
df.groupby("key")["data2"].sum()
key
A    10
B     6
C    11
Name: data2, dtype: int32
  • 按组迭代
for data, group in df.groupby("key"):
    print("{0:5} shape={1}".format(data, group.shape))
A     shape=(2, 3)
B     shape=(2, 3)
C     shape=(2, 3)
  • 调用方法
df.groupby("key")["data1"].describe()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

countmeanstdmin25%50%75%max
key
A2.02.53.5355340.01.252.53.755.0
B2.02.52.1213201.01.752.53.254.0
C2.02.50.7071072.02.252.52.753.0

  • 支持更复杂的操作
df.groupby("key").aggregate(["min", "median", "max"])

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead tr th {
    text-align: left;
}
​
.dataframe thead tr:last-of-type th {
    text-align: right;
}

data1data2
minmedianmaxminmedianmax
key
A02.5525.08
B12.5423.04
C22.5335.58

  • 过滤
def filter_func(x):
    return x["data2"].std() > 3
df.groupby("key")["data2"].std()
key
A    4.242641
B    1.414214
C    3.535534
Name: data2, dtype: float64
df.groupby("key").filter(filter_func)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

keydata1data2
0A02
2C28
3C33
5A58

  • 转换
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

keydata1data2
0A02
1B12
2C28
3C33
4B44
5A58

df.groupby("key").transform(lambda x: x-x.mean())

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

data1data2
0-2.5-3.0
1-1.5-1.0
2-0.52.5
30.5-2.5
41.51.0
52.53.0

df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

keydata1data2
0A01
1B14
2C29
3C39
4B41
5A59

df.groupby("key").apply(lambda x: x-x.mean())

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

data1data2
0-2.5-4.0
1-1.51.5
2-0.50.0
30.50.0
41.5-1.5
52.54.0

  • apply()方法
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

keydata1data2
0A02
1B12
2C28
3C33
4B44
5A58

def norm_by_data2(x):
    x["data1"] /= x["data2"].sum()
    return x
df.groupby("key").apply(norm_by_data2)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

keydata1data2
0A0.0000002
1B0.1666672
2C0.1818188
3C0.2727273
4B0.6666674
5A0.5000008

  • 将列表、数组设为分组键

这里的L相当于一个新的标签替代原来的行标签。

L = [0, 1, 0, 1, 2, 0]
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

keydata1data2
0A02
1B12
2C28
3C33
4B44
5A58

df.groupby(L).sum()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

data1data2
0718
145
244

  • 用字典将索引映射到分组
df2 = df.set_index("key")
df2

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

data1data2
key
A02
B12
C28
C33
B44
A58

mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

data1data2
constant1017
first510

  • 任意Python函数
df2.groupby(str.lower).mean()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

data1data2
a2.55.0
b2.53.0
c2.55.5

  • 多个有效值组成的列表

只有这两个数都相等,才会分到同一个组。

df2.groupby([str.lower, mapping]).mean()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

data1data2
afirst2.55.0
bconstant2.53.0
cconstant2.55.5

【例1】 行星观测数据处理

import seaborn as sns
​
planets = sns.load_dataset("planets")
planets.shape
(1035, 6)
planets.head()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

methodnumberorbital_periodmassdistanceyear
0Radial Velocity1269.3007.1077.402006
1Radial Velocity1874.7742.2156.952008
2Radial Velocity1763.0002.6019.842011
3Radial Velocity1326.03019.40110.622007
4Radial Velocity1516.22010.50119.472009

planets.describe()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

numberorbital_periodmassdistanceyear
count1035.000000992.000000513.000000808.0000001035.000000
mean1.7855072002.9175962.638161264.0692822009.070531
std1.24097626014.7283043.818617733.1164933.972567
min1.0000000.0907060.0036001.3500001989.000000
25%1.0000005.4425400.22900032.5600002007.000000
50%1.00000039.9795001.26000055.2500002010.000000
75%2.000000526.0050003.040000178.5000002012.000000
max7.000000730000.00000025.0000008500.0000002014.000000

planets.head()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

methodnumberorbital_periodmassdistanceyear
0Radial Velocity1269.3007.1077.402006
1Radial Velocity1874.7742.2156.952008
2Radial Velocity1763.0002.6019.842011
3Radial Velocity1326.03019.40110.622007
4Radial Velocity1516.22010.50119.472009

decade = 10 * (planets["year"] // 10)
decade.head()
0    2000
1    2000
2    2010
3    2000
4    2000
Name: year, dtype: int64
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
0    2000s
1    2000s
2    2010s
3    2000s
4    2000s
Name: decade, dtype: object
planets.head()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

methodnumberorbital_periodmassdistanceyear
0Radial Velocity1269.3007.1077.402006
1Radial Velocity1874.7742.2156.952008
2Radial Velocity1763.0002.6019.842011
3Radial Velocity1326.03019.40110.622007
4Radial Velocity1516.22010.50119.472009

planets.groupby(["method", decade]).sum()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

numberorbital_periodmassdistanceyear
methoddecade
Astrometry2010s21.262360e+030.0000035.754023
Eclipse Timing Variations2000s51.930800e+046.05000261.446025
2010s102.345680e+044.200001000.0012065
Imaging2000s291.350935e+060.00000956.8340139
2010s216.803750e+040.000001210.0836208
Microlensing2000s121.732500e+040.000000.0020070
2010s154.750000e+030.0000041440.0026155
Orbital Brightness Modulation2010s52.127920e+000.000002360.006035
Pulsar Timing1990s91.900153e+020.000000.005978
2000s13.652500e+040.000000.002003
2010s19.070629e-020.000001200.002011
Pulsation Timing Variations2000s11.170000e+030.000000.002007
Radial Velocity1980s18.388800e+0111.6800040.571989
1990s521.091561e+0468.17820723.7155943
2000s4752.633526e+05945.3192815201.16619775
2010s4241.809630e+05316.4789011382.67432451
Transit2000s642.897102e+020.0000031823.31124462
2010s7128.087813e+031.47000102419.46673999
Transit Timing Variations2010s92.393505e+020.000003313.008050

这里使用两个中括号[[]],取出来是DF类型的数据,而一个中括号[]取出来是Serios的数据,前者更美观一点。

planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead tr th {
    text-align: left;
}
​
.dataframe thead tr:last-of-type th {
    text-align: right;
}

number
decade1980s1990s2000s2010s
method
Astrometry0.00.00.02.0
Eclipse Timing Variations0.00.05.010.0
Imaging0.00.029.021.0
Microlensing0.00.012.015.0
Orbital Brightness Modulation0.00.00.05.0
Pulsar Timing0.09.01.01.0
Pulsation Timing Variations0.00.01.00.0
Radial Velocity1.052.0475.0424.0
Transit0.00.064.0712.0
Transit Timing Variations0.00.00.09.0

(2)数据透视表

【例2】泰坦尼克号乘客数据分析

import seaborn as sns
​
titanic = sns.load_dataset("titanic")
titanic.head()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue

T = titanic[titanic.age.notnull()].copy()
T.age.apply(lambda x: 60 if x>=60 else x)
T.age.value_counts()
24.00    30
22.00    27
60.00    26
18.00    26
28.00    25
30.00    25
19.00    25
21.00    24
25.00    23
36.00    22
29.00    20
35.00    18
32.00    18
27.00    18
26.00    18
31.00    17
16.00    17
34.00    15
20.00    15
33.00    15
23.00    15
39.00    14
40.00    13
17.00    13
42.00    13
45.00    12
38.00    11
4.00     10
50.00    10
2.00     10
         ..
8.00      4
5.00      4
11.00     4
6.00      3
7.00      3
46.00     3
30.50     2
57.00     2
0.83      2
55.00     2
10.00     2
59.00     2
13.00     2
28.50     2
40.50     2
45.50     2
0.75      2
32.50     2
34.50     1
55.50     1
0.92      1
36.50     1
12.00     1
53.00     1
14.50     1
0.67      1
20.50     1
23.50     1
24.50     1
0.42      1
Name: age, Length: 77, dtype: int64
Age = 10*(T["age"]//10)
Age = Age.astype(int)
Age.head()
Age.value_counts()
20    220
30    167
10    102
40     89
0      62
50     48
60     26
Name: age, dtype: int64
Age.astype(str)+"s"
0      20s
1      30s
2      20s
3      30s
4      30s
6      50s
7       0s
8      20s
9      10s
10      0s
11     50s
12     20s
13     30s
14     10s
15     50s
16      0s
18     30s
20     30s
21     30s
22     10s
23     20s
24      0s
25     30s
27     10s
30     40s
33     60s
34     20s
35     40s
37     20s
38     10s
      ... 
856    40s
857    50s
858    20s
860    40s
861    20s
862    40s
864    20s
865    40s
866    20s
867    30s
869     0s
870    20s
871    40s
872    30s
873    40s
874    20s
875    10s
876    20s
877    10s
879    50s
880    20s
881    30s
882    20s
883    20s
884    20s
885    30s
886    20s
887    10s
889    20s
890    30s
Name: age, Length: 714, dtype: object
T.groupby(["sex", Age])["survived"].mean().unstack()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

age0102030405060
sex
female0.6333330.7555560.7222220.8333330.6875000.8888891.000000
male0.5937500.1228070.1689190.2149530.2105260.1333330.136364

T.age = Age
T.pivot_table("survived", index="sex", columns="age")

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

age0102030405060
sex
female0.6333330.7555560.7222220.8333330.6875000.8888891.000000
male0.5937500.1228070.1689190.2149530.2105260.1333330.136364

titanic.describe()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

survivedpclassagesibspparchfare
count891.000000891.000000714.000000891.000000891.000000891.000000
mean0.3838382.30864229.6991180.5230080.38159432.204208
std0.4865920.83607114.5264971.1027430.80605749.693429
min0.0000001.0000000.4200000.0000000.0000000.000000
25%0.0000002.00000020.1250000.0000000.0000007.910400
50%0.0000003.00000028.0000000.0000000.00000014.454200
75%1.0000003.00000038.0000001.0000000.00000031.000000
max1.0000003.00000080.0000008.0000006.000000512.329200

titanic.groupby("sex")[["survived"]].mean()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

survived
sex
female0.742038
male0.188908

titanic.groupby("sex")["survived"].mean()
sex
female    0.742038
male      0.188908
Name: survived, dtype: float64
titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

classFirstSecondThird
sex
female0.9680850.9210530.500000
male0.3688520.1574070.135447

  • 数据透视表:用更直观的方式实现上面的功能。
titanic.pivot_table("survived", index="sex", columns="class") # 默认返回平均值

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

classFirstSecondThird
sex
female0.9680850.9210530.500000
male0.3688520.1574070.135447

titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True) # aggfunc="mean"即为默认值 margins=True 会加一个总的列和总的行。

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

classFirstSecondThirdAll
sex
female0.9680850.9210530.5000000.742038
male0.3688520.1574070.1354470.188908
All0.6296300.4728260.2423630.383838

titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"}) # 要处理的那一列和要处理的方法组成一个键值对。

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead tr th {
    text-align: left;
}
​
.dataframe thead tr:last-of-type th {
    text-align: right;
}

faresurvived
classFirstSecondThirdFirstSecondThird
sex
female106.12579821.97012116.118810917072
male67.22612719.74178212.661633451747

11.7 其他

(1)向量化字符串操作

(2) 处理时间序列

(3) 多级索引:用于多维数据

base_data = np.array([[1771, 11115 ],
                      [2154, 30320],
                      [2141, 14070],
                      [2424, 32680],
                      [1077, 7806],
                      [1303, 24222],
                      [798, 4789],
                      [981, 13468]]) 
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
                                     , [2008, 2018]*4], columns=["population", "GDP"])
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

populationGDP
BeiJing2008177111115
2018215430320
ShangHai2008214114070
2018242432680
ShenZhen200810777806
2018130324222
HangZhou20087984789
201898113468

data.index.names = ["city", "year"]
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

populationGDP
cityyear
BeiJing2008177111115
2018215430320
ShangHai2008214114070
2018242432680
ShenZhen200810777806
2018130324222
HangZhou20087984789
201898113468

data["GDP"]
city      year
BeiJing   2008    11115
          2018    30320
ShangHai  2008    14070
          2018    32680
ShenZhen  2008     7806
          2018    24222
HangZhou  2008     4789
          2018    13468
Name: GDP, dtype: int32
data.loc["ShangHai", "GDP"]
year
2008    14070
2018    32680
Name: GDP, dtype: int32
data.loc["ShangHai", 2018]["GDP"]
32680

(4) 高性能的Pandas:eval()

df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
%timeit (df1+df2)/(df3+df4)
17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • 减少了复合代数式计算中间过程的内存分配
%timeit pd.eval("(df1+df2)/(df3+df4)")
10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))
True
  • 实现列间运算
df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
00.4180710.3818360.500556
10.0594320.7490660.302429
20.4891470.7391530.777161
30.1754410.0165560.348979
40.7665340.5592520.310635

res_1 = pd.eval("(df.A+df.B)/(df.C-1)")
res_2 = df.eval("(A+B)/(C-1)")
np.allclose(res_1, res_2)
True
df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
df.head()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
00.4180710.3818360.500556-1.601593
10.0594320.7490660.302429-1.159019
20.4891470.7391530.777161-5.512052
30.1754410.0165560.348979-0.294917
40.7665340.5592520.310635-1.923199

df.eval("D=(A+B)/(C-1)", inplace=True)
df.head()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
00.4180710.3818360.500556-1.601593
10.0594320.7490660.302429-1.159019
20.4891470.7391530.777161-5.512052
30.1754410.0165560.348979-0.294917
40.7665340.5592520.310635-1.923199

  • 使用局部变量
column_mean = df.mean(axis=1)
res = df.eval("A+@column_mean")
res.head()
0    0.342788
1    0.047409
2   -0.387501
3    0.236956
4    0.694839
dtype: float64

(4) 高性能的Pandas:query()

df.head()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
00.4180710.3818360.500556-1.601593
10.0594320.7490660.302429-1.159019
20.4891470.7391530.777161-5.512052
30.1754410.0165560.348979-0.294917
40.7665340.5592520.310635-1.923199

%timeit df[(df.A < 0.5) & (df.B > 0.5)]
1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.query("(A < 0.5)&(B > 0.5)")
2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.query("(A < 0.5)&(B > 0.5)").head()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
10.0594320.7490660.302429-1.159019
20.4891470.7391530.777161-5.512052
70.0739500.7301440.646190-2.272672
100.3932000.6104670.697096-3.313485
110.0657340.7646990.179380-1.011958

np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
True

(5)eval()和query()的使用时机

小数组时,普通方法反而更快

df.values.nbytes
32000
df1.values.nbytes
8000000