Python基础(十一) | 超详细的Pandas库三万字总结(上)

354 阅读34分钟

持续创作,加速成长!这是我参与「掘金日新计划 · 10 月更文挑战」的第8天,点击查看活动详情

image-20221002210956895

image-20221002211020709

引子

Numpy 在向量化的数值计算中表现优异

但是在处理更灵活、复杂的数据任务:

如为数据添加标签、处理缺失值、分组和透视表等方面

Numpy显得力不从心

而基于Numpy构建的Pandas库,提供了使得数据分析变得更快更简单的高级数据结构和操作工具

11.1 对象创建

11.1.1 Pandas Series对象

Series 是带标签数据的一维数组

Series对象的创建

通用结构: pd.Series(data, index=index, dtype=dtype)

data:数据,可以是列表,字典或Numpy数组

index:索引,为可选参数

dtype: 数据类型,为可选参数

1、用列表创建

  • index缺省,默认为整数序列
import pandas as pd
​
data = pd.Series([1.5, 3, 4.5, 6])
data
0    1.5
1    3.0
2    4.5
3    6.0
dtype: float64
  • 增加index
data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])
data
a    1.5
b    3.0
c    4.5
d    6.0
dtype: float64
  • 增加数据类型

    缺省则从传入的数据自动判断

data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])    
data
a    1
b    2
c    3
d    4
dtype: int64
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype="float")
data
a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64

注意:数据支持多种类型

  • 混合后数据类型变为object
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])
data
a    1
b    2
c    3
d    4
dtype: object
data["a"]
1
data["c"]
'3'

数据类型可被强制改变

data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)
data
a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64
data["c"]
3.0

不能转为浮点数则会报错

data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
data
---------------------------------------------------------------------------
​
NameError                                 Traceback (most recent call last)
​
~\AppData\Local\Temp/ipykernel_9236/4046912764.py in <module>
----> 1 data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
      2 data
NameError: name 'pd' is not defined

2、用一维numpy数组创建

import numpy as np
​
x = np.arange(5)
pd.Series(x)
0    0
1    1
2    2
3    3
4    4
dtype: int32

3、用字典创建

  • 默认以键为index 值为data
population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }
population = pd.Series(population_dict)    
population
BeiJing     2154
ShangHai    2424
ShenZhen    1303
HangZhou     981
dtype: int64
  • 字典创建,如果指定index,则会到字典的键中筛选,找不到的,值设为NaN
population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])    
population
BeiJing     2154.0
HangZhou     981.0
c              NaN
d              NaN
dtype: float64

4、data为标量的情况

pd.Series(5, index=[100, 200, 300])
100    5
200    5
300    5
dtype: int64

11.1.2 Pandas DataFrame对象

DataFrame 是带标签数据的多维数组

DataFrame对象的创建

通用结构: pd.DataFrame(data, index=index, columns=columns)

data:数据,可以是列表,字典或Numpy数组

index:索引,为可选参数

columns: 列标签,为可选参数

1、通过Series对象创建

population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }

population = pd.Series(population_dict)    
pd.DataFrame(population)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

0
BeiJing2154
ShangHai2424
ShenZhen1303
HangZhou981

pd.DataFrame(population, columns=["population"])

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

population
BeiJing2154
ShangHai2424
ShenZhen1303
HangZhou981

2、通过Series对象字典创建

GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468 }
​
GDP = pd.Series(GDP_dict)
GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
dtype: int64
pd.DataFrame({"population": population,
              "GDP": GDP})

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

populationGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

注意:数量不够的会自动补齐

pd.DataFrame({"population": population,
              "GDP": GDP,
              "country": "China"})

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

populationGDPcountry
BeiJing215430320China
ShangHai242432680China
ShenZhen130324222China
HangZhou98113468China

3、通过字典列表对象创建

  • 字典索引作为index,字典键作为columns
import numpy as np
import pandas as pd
​
data = [{"a": i, "b": 2*i} for i in range(3)]
data
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
data = pd.DataFrame(data)
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ab
000
112
224

行的标签没有排,因此行从0开始,列的标签延续。

  • 从中取出一列数据
data1 = data["a"].copy()
data1
0    0
1    1
2    2
Name: a, dtype: int64
data1[0] = 10
data1
0    10
1     1
2     2
Name: a, dtype: int64
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ab
000
112
224

  • 不存在的键,会默认值为NaN
data = [{"a": 1, "b":1},{"b": 3, "c":4}]
data
[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]
pd.DataFrame(data)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

abc
01.01NaN
1NaN34.0

4、通过Numpy二维数组创建

data = np.random.randint(10, size=(3, 2))
data
array([[1, 6],
       [2, 9],
       [4, 0]])
pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

foobar
a16
b29
c40

11.2 DataFrame性质

1、属性

data = pd.DataFrame({"pop": population, "GDP": GDP})
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

(1)df.values 返回numpy数组表示的数据

data.values
array([[ 2154, 30320],
       [ 2424, 32680],
       [ 1303, 24222],
       [  981, 13468]], dtype=int64)

(2)df.index 返回行索引

data.index
Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')

(3)df.columns 返回列索引

data.columns
Index(['pop', 'GDP'], dtype='object')

(4)df.shape 形状

data.shape
(4, 2)

(5) pd.size 大小

data.size
8

(6)pd.dtypes 返回每列数据类型

data.dtypes
pop    int64
GDP    int64
dtype: object

2、索引

data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

(1)获取列

  • 字典式
data["pop"]
BeiJing     2154
ShangHai    2424
ShenZhen    1303
HangZhou     981
Name: pop, dtype: int64
data[["GDP", "pop"]]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

GDPpop
BeiJing303202154
ShangHai326802424
ShenZhen242221303
HangZhou13468981

  • 对象属性式
data.GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
Name: GDP, dtype: int64

(2)获取行

  • 绝对索引 df.loc
data.loc["BeiJing"]
pop     2154
GDP    30320
Name: BeiJing, dtype: int64
data.loc[["BeiJing", "HangZhou"]]

popGDP
BeiJing215430320
HangZhou98113468

  • 相对索引 df.iloc
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

data.iloc[0]
pop     2154
GDP    30320
Name: BeiJing, dtype: int64
data.iloc[[1, 3]]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

popGDP
ShangHai242432680
HangZhou98113468

(3)获取标量

data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

data.loc["BeiJing", "GDP"]
30320
data.iloc[0, 1]
30320
data.values[0][1]
30320

(4)Series对象的索引

type(data.GDP)
pandas.core.series.Series
GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
dtype: int64
GDP["BeiJing"]
30320

3、切片

dates = pd.date_range(start='2019-01-01', periods=6)
dates
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342

(1)行切片

df["2019-01-01": "2019-01-03"]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726

df.loc["2019-01-01": "2019-01-03"]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726

df.iloc[0: 3]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726

注意:这里的3是取不到的。

(2)列切片

df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342

df.loc[:, "A": "C"]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
2019-01-01-0.935378-0.1907420.925984
2019-01-02-0.234414-1.1946741.080779
2019-01-03-0.1415720.0581181.102248
2019-01-040.3050880.535920-0.978434
2019-01-050.3133830.2340410.163155
2019-01-060.250613-0.904400-0.858240

df.iloc[:, 0: 3]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
2019-01-01-0.935378-0.1907420.925984
2019-01-02-0.234414-1.1946741.080779
2019-01-03-0.1415720.0581181.102248
2019-01-040.3050880.535920-0.978434
2019-01-050.3133830.2340410.163155
2019-01-060.250613-0.904400-0.858240

(3)多种多样的取值

df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342

  • 行、列同时切片
df.loc["2019-01-02": "2019-01-03", "C":"D"]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

CD
2019-01-021.080779-2.294395
2019-01-031.1022481.207726

df.iloc[1: 3, 2:]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

CD
2019-01-021.080779-2.294395
2019-01-031.1022481.207726

  • 行切片,列分散取值
df.loc["2019-01-04": "2019-01-06", ["A", "C"]]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

AC
2019-01-040.305088-0.978434
2019-01-050.3133830.163155
2019-01-060.250613-0.858240

df.iloc[3:, [0, 2]]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

AC
2019-01-040.305088-0.978434
2019-01-050.3133830.163155
2019-01-060.250613-0.858240

  • 行分散取值,列切片
df.loc[["2019-01-02", "2019-01-06"], "C": "D"]

上面这种方式是行不通的。

df.iloc[[1, 5], 0: 3]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
2019-01-02-0.234414-1.1946741.080779
2019-01-060.250613-0.904400-0.858240

  • 行、列均分散取值
df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]

同样,上面这种方式是行不通的。

df.iloc[[1, 5], [0, 3]]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

AD
2019-01-02-0.234414-2.294395
2019-01-060.250613-1.573342

4、布尔索引

相当于numpy当中的掩码操作。

df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342

df > 0

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01FalseFalseTrueFalse
2019-01-02FalseFalseTrueFalse
2019-01-03FalseTrueTrueTrue
2019-01-04TrueTrueFalseTrue
2019-01-05TrueTrueTrueFalse
2019-01-06TrueFalseFalseFalse

df[df > 0]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01NaNNaN0.925984NaN
2019-01-02NaNNaN1.080779NaN
2019-01-03NaN0.0581181.1022481.207726
2019-01-040.3050880.535920NaN0.177251
2019-01-050.3133830.2340410.163155NaN
2019-01-060.250613NaNNaNNaN

可以观察到,为true的部分都被取到了,而false没有。

df.A > 0
2019-01-01    False
2019-01-02    False
2019-01-03    False
2019-01-04     True
2019-01-05     True
2019-01-06     True
Freq: D, Name: A, dtype: bool
df[df.A > 0]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342

  • isin()方法
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCDE
2019-01-01-0.935378-0.1907420.925984-0.818969one
2019-01-02-0.234414-1.1946741.080779-2.294395one
2019-01-03-0.1415720.0581181.1022481.207726two
2019-01-040.3050880.535920-0.9784340.177251three
2019-01-050.3133830.2340410.163155-0.296649four
2019-01-060.250613-0.904400-0.858240-1.573342three

ind = df2["E"].isin(["two", "four"])
ind     
2019-01-01    False
2019-01-02    False
2019-01-03     True
2019-01-04    False
2019-01-05     True
2019-01-06    False
Freq: D, Name: E, dtype: bool
df2[ind]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCDE
2019-01-03-0.1415720.0581181.1022481.207726two
2019-01-050.3133830.2340410.163155-0.296649four

(5)赋值

df
  • DataFrame 增加新列
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
s1
2019-01-01    1
2019-01-02    2
2019-01-03    3
2019-01-04    4
2019-01-05    5
2019-01-06    6
Freq: D, dtype: int64
df["E"] = s1
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCDE
2019-01-01-0.935378-0.1907420.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426

  • 修改赋值
df.loc["2019-01-01", "A"] = 0
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCDE
2019-01-010.000000-0.1907420.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426

df.iloc[0, 1] = 0
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCDE
2019-01-010.0000000.0000000.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426

df["D"] = np.array([5]*len(df))   # 可简化成df["D"] = 5
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCDE
2019-01-010.0000000.0000000.92598451
2019-01-02-0.234414-1.1946741.08077952
2019-01-03-0.1415720.0581181.10224853
2019-01-040.3050880.535920-0.97843454
2019-01-050.3133830.2340410.16315555
2019-01-060.250613-0.904400-0.85824056

  • 修改index和columns
df.index = [i for i in range(len(df))]
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCDE
00.0000000.0000000.92598451
1-0.234414-1.1946741.08077952
2-0.1415720.0581181.10224853
30.3050880.535920-0.97843454
40.3133830.2340410.16315555
50.250613-0.904400-0.85824056

df.columns = [i for i in range(df.shape[1])]
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

01234
00.0000000.0000000.92598451
1-0.234414-1.1946741.08077952
2-0.1415720.0581181.10224853
30.3050880.535920-0.97843454
40.3133830.2340410.16315555
50.250613-0.904400-0.85824056

11.3 数值运算及统计分析

image-20221002211052367

1、数据的查看

import pandas as pd
import numpy as np
​
dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402

(1)查看前面的行

df.head()    # 默认5行,也可以进行设置

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302

df.head(2)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375

(2)查看后面的行

df.tail()    # 默认5行

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402

df.tail(3) 

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402

(3)查看总体信息

df.iloc[0, 3] = np.nan
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2019-01-01-0.8540430.412345-2.296051NaN
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402

df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A    6 non-null float64
B    6 non-null float64
C    6 non-null float64
D    5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes

2、Numpy通用函数同样适用于Pandas

(1)向量化运算

x = pd.DataFrame(np.arange(4).reshape(1, 4))
x

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

0123
00123

x+5

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

0123
05678

np.exp(x)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

0123
01.02.7182827.38905620.085537

y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

0123
04567

x*y

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

0123
0051221

(2)矩阵化运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(30, 30)))
x

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

0123456789...20212223242526272829
06374692674...4095809263
18242648613...2031731559
23519193768...6870772072
32049698687...0242049668
49926033466...9686008838
52657840297...2040700115
66400214956...5085233292
72363807617...3010446882
82237570730...1152830304
93776200256...4232004528
104704203460...5619190708
115696921879...6528959950
123955407446...0729694946
138409901587...5840349946
143046995431...6103712002
154200791212...6394173848
163948720231...8003852038
172863294428...6942618990
185679819144...3525699262
191937860280...4322381800
204552689757...3508043251
212481971467...0182046504
224524644499...1769915521
230548064412...8507692043
249709037415...3782219224
254195450489...9307023759
266719726261...0659803839
272813517702...8045455637
286862274375...1792459532
293030095432...1304808756

30 rows × 30 columns

  • 转置
z = x.T
z

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

0123456789...20212223242526272829
06832926223...4240946263
13250964327...5455717880
27414250637...5824091163
34299670376...2148959320
46616082852...6960047520
59499341070...8746352179
62838304700...9144706745
76676429672...7444442734
87168695135...5691186073
94387676706...7792591252
103871353075...4026419910
117140676835...7505105835
127916287855...9041292431
132846530172...3185882557
145977105635...9001698359
154494907927...7421686904
161182894281...9931584176
177387433624...1802759759
185605561980...4501376521
191782315810...8073708487
204260925314...3018390811
210082600012...5175736073
229374848153...0860805490
235102605022...8297279524
248770072480...0096208448
250374003430...4419120550
269129803604...3652933598
272506812835...2550278657
286576319802...5024253335
293928852248...1413499726

30 rows × 30 columns

np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(30, 30)))
y

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

0123456789...20212223242526272829
05895001769...1706997691
10188398736...9204927798
26937745936...7711308645
36257844774...0198231272
46092662770...1540789570
59391446889...1870342035
61243060728...4336735324
74033835675...1731669696
80029606703...6795495256
96877726052...7062436763
100647629599...4939125408
112399448216...0598660473
120160616425...8807207119
135159649875...2432004250
140385314732...8557591393
153361305052...7177383063
160659646622...3686513263
176728018600...5625430621
189440987761...7997114656
194115126233...0009859340
209863990816...2901394888
212864905561...6756874240
220359036511...6253939519
237708612044...1960283725
246042310570...1127529473
255021494693...5535927416
269818162618...2512533618
271864695472...9315117126
280774327852...0283739238
298026836497...6785725345

30 rows × 30 columns

x.dot(y)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

0123456789...20212223242526272829
0616560723739612457681799575590...523739613580668602733585657700
1520438691600612455666764707592...555681503679641506779494633590
2557570786807690469804828704573...563675712758793672754550756638
3605507664701660496698806651575...582685668586629534678484591626
4599681753873721563754770620654...633747661677726649716610735706
5422354602627613396617627489423...456572559537499384589436574507
6359446599599481357577572451464...449550495532633554663476565602
7531520698590607537665696571472...576588551665652527742528650599
8449322547533593399584638587424...402596523523447362561386529484
9373433525601522345551521434447...508498438478459418488407503496
10500427574607667477652656615477...622702531610558532598471582561
11664694772841779574730810711608...591760616638721676846678754708
12545547687701721576689724710532...674684648694710564757571671656
13574586723750691494696787667523...618681568682715644756557690604
14502382645557570403538677500501...369650507576546531554437616463
15510505736651649510719733694557...605717574642678576755455598654
16567376614612643514598724547464...456639520560569442596517659532
17626716828765740603809852692591...664716655721742612819593744712
18600559667664641556624815638564...581701559677710554748597614657
19445431661681641552690719602474...515637576620572512599455622538
20523569784725713501740772638640...589775664686726672747548723645
21487465553639517449592609454398...492567534404554417561466498492
22479449574686583377566614563455...453539491501596520722478565501
23483386476526550426492585536482...322541438456487408502426474481
24523551658767537444663731576577...522590525664691548635526641538
25652656738753853508752815669576...694833693606575616704559728672
26578577744856699497779800733587...630754704834760680765592731629
27554494665689630574695703636599...554685532658649554693577634668
28498552659784552492690775544551...567636518599742521733533605604
29513491563642477367589647516484...428574504548553483540407547455

30 rows × 30 columns

%timeit x.dot(y)
218 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.dot(x, y)
81.1 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  • 执行相同运算,Numpy与Pandas的对比
x1 = np.array(x)
x1
y1 = np.array(y)
y1
%timeit x1.dot(y1)
22.1 µs ± 992 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x1, y1)
22.6 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x.values, y.values)
42.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
x2 = list(x1)
y2 = list(y1)
x3 = []
y3 = []
for i in x2:
    res = []
    for j in i:
        res.append(int(j))
    x3.append(res)
for i in y2:
    res = []
    for j in i:
        res.append(int(j))
    y3.append(res)
def f(x, y):
    res = []
    for i in range(len(x)):
        row = []
        for j in range(len(y[0])):
            sum_row = 0
            for k in range(len(x[0])):
                sum_row += x[i][k]*y[k][j]
            row.append(sum_row)
        res.append(row)
    return res          
%timeit f(x3, y3)
4.29 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

一般来说,纯粹的计算在Numpy里执行的更快

Numpy更侧重于计算,Pandas更侧重于数据处理

(3)广播运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
0637
1469
2267

  • 按行广播
x.iloc[0]
A    6
B    3
C    7
Name: 0, dtype: int32
x/x.iloc[0]

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
01.0000001.01.000000
10.6666672.01.285714
20.3333332.01.000000

  • 按列广播
x.A
0    6
1    4
2    2
Name: A, dtype: int32
x.div(x.A, axis=0)             # add sub div mul

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
01.00.51.166667
11.01.52.250000
21.03.03.500000

x.div(x.iloc[0], axis=1)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
01.0000001.01.000000
10.6666672.01.285714
20.3333332.01.000000

3、新的用法

(1)索引对齐

A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

AB
037
121

B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
0751
1409
2580

  • pandas会自动对齐两个对象的索引,没有的值用np.nan表示
A+B

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
010.012.0NaN
16.01.0NaN
2NaNNaNNaN

  • 缺省值也可用fill_value来填充
A.add(B, fill_value=0)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
010.012.01.0
16.01.09.0
25.08.00.0

A*B

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABC
021.035.0NaN
18.00.0NaN
2NaNNaNNaN

(2)统计相关

  • 数据种类统计
y = np.random.randint(3, size=20)
y
array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])
np.unique(y)
array([0, 1, 2])

用Counter方法统计数据

from collections import Counter
Counter(y)
Counter({2: 11, 1: 5, 0: 4})
y1 = pd.DataFrame(y, columns=["A"])
y1

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

A
02
12
22
31
42
51
61
72
81
92
102
110
122
130
142
152
160
170
182
191

np.unique(y1)

有value counter的方法

y1["A"].value_counts()
2    11
1     5
0     4
Name: A, dtype: int64
  • 产生新的结果,并进行排序
population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }
population = pd.Series(population_dict) 
​
GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468 }
GDP = pd.Series(GDP_dict)
​
city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

populationGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
city_info

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

populationGDPper_GDP
BeiJing21543032014.076137
ShangHai24243268013.481848
ShenZhen13032422218.589409
HangZhou9811346813.728848

递增排序

city_info.sort_values(by="per_GDP")

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

populationGDPper_GDP
ShangHai24243268013.481848
HangZhou9811346813.728848
BeiJing21543032014.076137
ShenZhen13032422218.589409

递减排序

city_info.sort_values(by="per_GDP", ascending=False)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

populationGDPper_GDP
ShenZhen13032422218.589409
BeiJing21543032014.076137
HangZhou9811346813.728848
ShangHai24243268013.481848

按轴进行排序

data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

CBAD
2313178
1119146
0117142

行排序

data.sort_index()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

CBAD
0117142
1119146
2313178

列排序

data.sort_index(axis=1)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
2171338
1141916
0147112

data.sort_index(axis=1, ascending=False)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

DCBA
2831317
1611914
0211714

  • 统计方法
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588

非空个数

df.count()
A    6
B    6
C    6
D    6
dtype: int64

求和

df.sum()
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
df.sum(axis=1)
0     7.947086
1    28.289408
2    -1.427283
3    -4.893571
4    20.476806
5     3.043402
dtype: float64

最大值 最小值

df.min()
A   -3.386712
B   -4.333177
C   -4.032613
D   -0.152567
dtype: float64
df.max(axis=1)
0     6.367969
1    13.113252
2     5.094802
3     2.537716
4     5.577329
5     6.398588
dtype: float64
df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588

df.idxmax()
A    1
B    1
C    4
D    5
dtype: int64

均值

df.mean()
A    2.378704
B    1.588916
C    0.655231
D    4.283124
dtype: float64

方差

df.var()
A    34.980702
B    19.110656
C    18.948144
D     6.726776
dtype: float64

标准差

df.std()
A    5.914449
B    4.371574
C    4.352947
D    2.593603
dtype: float64

中位数

df.median()
A    1.126767
B    1.531743
C    0.176172
D    5.273518
dtype: float64

众数

data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

AB
042
132
220
324
420
541
620
711
834
920

data.mode()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

AB
020

75%分位数

df.quantile(0.75)
A    3.539202
B    5.052594
C    4.539740
D    6.157738
Name: 0.75, dtype: float64
  • 用describe()可以获取所有属性
df.describe()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
count6.0000006.0000006.0000006.000000
mean2.3787041.5889160.6552314.283124
std5.9144494.3715744.3529472.593603
min-3.386712-4.333177-4.032613-0.152567
25%-1.256706-1.265251-2.9259103.158284
50%1.1267671.5317430.1761725.273518
75%3.5392025.0525944.5397406.157738
max13.1132526.7745595.5773296.398588

data_2 = pd.DataFrame([["a", "a", "c", "d"],
                       ["c", "a", "c", "b"],
                       ["a", "a", "d", "c"]], columns=list("ABCD"))
data_2

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
0aacd
1cacb
2aadc

  • 字符串类型的describe
data_2.describe()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
count3333
unique2123
topaacd
freq2321

相关性系数和协方差

df.corr()

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
A1.0000000.8310630.3310600.510821
B0.8310631.0000000.1792440.719112
C0.3310600.1792441.000000-0.450365
D0.5108210.719112-0.4503651.000000

df.corrwith(df["A"])
A    1.000000
B    0.831063
C    0.331060
D    0.510821
dtype: float64

自定义输出

apply(method)的用法:使用method方法默认对每一列进行相应的操作

df

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588

df.apply(np.cumsum)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.0821983.557396-3.0604766.367969
114.19545010.331955-0.18592311.895013
212.1591095.9987784.90887811.742447
38.7723974.4764132.38666914.280162
413.10088810.0274067.96399919.300153
514.2722249.5334973.93138525.698741

df.apply(np.cumsum, axis=1)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
01.0821984.6395941.5791177.947086
113.11325219.88781122.76236428.289408
2-2.036341-6.369518-1.274717-1.427283
3-3.386712-4.909077-7.431287-4.893571
44.3284919.87948515.45681420.476806
51.1713360.677427-3.3551863.043402

df.apply(sum)
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
df.sum()
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
df.apply(lambda x: x.max()-x.min())
A    16.499965
B    11.107736
C     9.609942
D     6.551155
dtype: float64
def my_describe(x):
    return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
                     index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)

​​

.dataframe tbody tr th {
    vertical-align: top;
}
​
.dataframe thead th {
    text-align: right;
}

ABCD
Count6.0000006.0000006.0000006.000000
mean2.3787041.5889160.6552314.283124
max13.1132526.7745595.5773296.398588
idxmin3.0000002.0000005.0000002.000000
std5.9144494.3715744.3529472.593603