本节为pandas进阶内容,核心还是DataFrame数据处理,注意包括缺失数据处理、函数的应用和映射、数据规整等。
第七节:pandas进阶
本节为pandas进阶内容,核心还是DataFrame数据处理,注意包括缺失数据处理、函数的应用和映射、数据规整等。
开始之前首先导入库:numpy和pandas
In [1]:
import pandas as pd
import numpy as np
一、缺失数据处理
还是获取MindGo平台的数据来演示:
In [4]:
# 获取招商银行近10个工作日的开盘价、最高价、最低价、收盘价。并将部分数据赋值为NAN,假设为缺失部分。
price= get_price('600036.SH', None, '20180125', '1d', ['open', 'high', 'low', 'close'], False, 'pre', 10, is_panel=1)
price[price > 34] = np.nan
price
Out[4]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-12 | 31.51 | 31.58 | 31.02 | 31.12 |
| 2018-01-15 | 31.94 | 32.40 | 31.30 | 31.48 |
| 2018-01-16 | 31.89 | 32.28 | 31.54 | 31.80 |
| 2018-01-17 | 31.69 | 33.11 | 31.50 | 32.10 |
| 2018-01-18 | 32.32 | 32.75 | 32.10 | 32.10 |
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
| 2018-01-22 | 33.08 | 33.64 | 32.15 | 32.18 |
| 2018-01-23 | NaN | NaN | 33.20 | 33.20 |
| 2018-01-24 | 33.85 | NaN | 33.45 | NaN |
| 2018-01-25 | 33.41 | NaN | 32.90 | NaN |
1.去掉包含缺失值的行
In [6]:
price.dropna()
Out[6]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-12 | 31.51 | 31.58 | 31.02 | 31.12 |
| 2018-01-15 | 31.94 | 32.40 | 31.30 | 31.48 |
| 2018-01-16 | 31.89 | 32.28 | 31.54 | 31.80 |
| 2018-01-17 | 31.69 | 33.11 | 31.50 | 32.10 |
| 2018-01-18 | 32.32 | 32.75 | 32.10 | 32.10 |
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
| 2018-01-22 | 33.08 | 33.64 | 32.15 | 32.18 |
2.对缺失值进行填充为30
In [7]:
price.fillna(value=30)
Out[7]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-12 | 31.51 | 31.58 | 31.02 | 31.12 |
| 2018-01-15 | 31.94 | 32.40 | 31.30 | 31.48 |
| 2018-01-16 | 31.89 | 32.28 | 31.54 | 31.80 |
| 2018-01-17 | 31.69 | 33.11 | 31.50 | 32.10 |
| 2018-01-18 | 32.32 | 32.75 | 32.10 | 32.10 |
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
| 2018-01-22 | 33.08 | 33.64 | 32.15 | 32.18 |
| 2018-01-23 | 30.00 | 30.00 | 33.20 | 33.20 |
| 2018-01-24 | 33.85 | 30.00 | 33.45 | 30.00 |
| 2018-01-25 | 33.41 | 30.00 | 32.90 | 30.00 |
3.判断数据是否为nan
In [8]:
pd.isnull(price)
Out[8]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-12 | False | False | False | False |
| 2018-01-15 | False | False | False | False |
| 2018-01-16 | False | False | False | False |
| 2018-01-17 | False | False | False | False |
| 2018-01-18 | False | False | False | False |
| 2018-01-19 | False | False | False | False |
| 2018-01-22 | False | False | False | False |
| 2018-01-23 | True | True | False | False |
| 2018-01-24 | False | True | False | True |
| 2018-01-25 | False | True | False | True |
二、函数的应用和映射
再次获取MindGo平台的数据来演示:
In [13]:
# 获取招商银行近10个工作日的开盘价、最高价、最低价、收盘价。
price= get_price('600036.SH', None, '20180125', '1d', ['open', 'high', 'low', 'close'], False, 'pre', 10, is_panel=1)
price
Out[13]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-12 | 31.51 | 31.58 | 31.02 | 31.12 |
| 2018-01-15 | 31.94 | 32.40 | 31.30 | 31.48 |
| 2018-01-16 | 31.89 | 32.28 | 31.54 | 31.80 |
| 2018-01-17 | 31.69 | 33.11 | 31.50 | 32.10 |
| 2018-01-18 | 32.32 | 32.75 | 32.10 | 32.10 |
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
| 2018-01-22 | 33.08 | 33.64 | 32.15 | 32.18 |
| 2018-01-23 | 34.05 | 34.17 | 33.20 | 33.20 |
| 2018-01-24 | 33.85 | 35.35 | 33.45 | 34.25 |
| 2018-01-25 | 33.41 | 34.02 | 32.90 | 34.01 |
1.列计算平均值
In [14]:
price.mean()
Out[14]:
close 32.620
high 33.265
low 32.137
open 32.490
dtype: float64
2.行计算平均值
In [15]:
price.mean(1)
Out[15]:
2018-01-12 31.3075
2018-01-15 31.7800
2018-01-16 31.8775
2018-01-17 32.1000
2018-01-18 32.3175
2018-01-19 32.6700
2018-01-22 32.7625
2018-01-23 33.6550
2018-01-24 34.2250
2018-01-25 33.5850
dtype: float64
如果你担心你求均值时受到缺失值的影响,你可以:
In [16]:
price.mean(axis = 1,skipna = True) # skipna参数默认是 True 表示排除缺失值 axis=1是按行 axis=0是按列
Out[16]:
2018-01-12 31.3075
2018-01-15 31.7800
2018-01-16 31.8775
2018-01-17 32.1000
2018-01-18 32.3175
2018-01-19 32.6700
2018-01-22 32.7625
2018-01-23 33.6550
2018-01-24 34.2250
2018-01-25 33.5850
dtype: float64
三、数据规整
Pandas提供了大量的方法能够轻松的对Series,DataFrame和Panel对象进行各种符合各种逻辑关系的合并操作,主要介绍三个常用操作。
| 操作方式 | 释义 |
|---|---|
| concat | 可以沿一条轴将多个对象堆叠到一起。 |
| append | 将一行连接到一个DataFrame上 |
| duplicated | 移除重复数据 |
1.concat
首先我们分别获取两个DataFrame。
In [18]:
# 获取招商银行20180125日的前5个工作日的开盘价、最高价、最低价、收盘价。
price1= get_price('600036.SH', None, '20180125', '1d', ['open', 'high', 'low', 'close'], False, 'pre', 5, is_panel=1)
price1
Out[18]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
| 2018-01-22 | 33.08 | 33.64 | 32.15 | 32.18 |
| 2018-01-23 | 34.05 | 34.17 | 33.20 | 33.20 |
| 2018-01-24 | 33.85 | 35.35 | 33.45 | 34.25 |
| 2018-01-25 | 33.41 | 34.02 | 32.90 | 34.01 |
In [19]:
# 获取招商银行20170125日的前5个工作日的开盘价、最高价、最低价、收盘价。
price2= get_price('600036.SH', None, '20170125', '1d', ['open', 'high', 'low', 'close'], False, 'pre', 5, is_panel=1)
price2
Out[19]:
| close | high | low | open | |
|---|---|---|---|---|
| 2017-01-19 | 18.52 | 18.74 | 18.51 | 18.55 |
| 2017-01-20 | 18.59 | 18.65 | 18.47 | 18.54 |
| 2017-01-23 | 18.50 | 18.76 | 18.41 | 18.68 |
| 2017-01-24 | 18.87 | 18.88 | 18.50 | 18.59 |
| 2017-01-25 | 18.88 | 18.95 | 18.68 | 18.81 |
纵向拼接(默认):
In [21]:
pd.concat([price1,price2],axis=0)
Out[21]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
| 2018-01-22 | 33.08 | 33.64 | 32.15 | 32.18 |
| 2018-01-23 | 34.05 | 34.17 | 33.20 | 33.20 |
| 2018-01-24 | 33.85 | 35.35 | 33.45 | 34.25 |
| 2018-01-25 | 33.41 | 34.02 | 32.90 | 34.01 |
| 2017-01-19 | 18.52 | 18.74 | 18.51 | 18.55 |
| 2017-01-20 | 18.59 | 18.65 | 18.47 | 18.54 |
| 2017-01-23 | 18.50 | 18.76 | 18.41 | 18.68 |
| 2017-01-24 | 18.87 | 18.88 | 18.50 | 18.59 |
| 2017-01-25 | 18.88 | 18.95 | 18.68 | 18.81 |
横向拼接,index对不上的会用 NaN 填充:
In [24]:
pd.concat([price1,price2],axis=1)
Out[24]:
| close | high | low | open | close | high | low | open | |
|---|---|---|---|---|---|---|---|---|
| 2017-01-19 | NaN | NaN | NaN | NaN | 18.52 | 18.74 | 18.51 | 18.55 |
| 2017-01-20 | NaN | NaN | NaN | NaN | 18.59 | 18.65 | 18.47 | 18.54 |
| 2017-01-23 | NaN | NaN | NaN | NaN | 18.50 | 18.76 | 18.41 | 18.68 |
| 2017-01-24 | NaN | NaN | NaN | NaN | 18.87 | 18.88 | 18.50 | 18.59 |
| 2017-01-25 | NaN | NaN | NaN | NaN | 18.88 | 18.95 | 18.68 | 18.81 |
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 | NaN | NaN | NaN | NaN |
| 2018-01-22 | 33.08 | 33.64 | 32.15 | 32.18 | NaN | NaN | NaN | NaN |
| 2018-01-23 | 34.05 | 34.17 | 33.20 | 33.20 | NaN | NaN | NaN | NaN |
| 2018-01-24 | 33.85 | 35.35 | 33.45 | 34.25 | NaN | NaN | NaN | NaN |
| 2018-01-25 | 33.41 | 34.02 | 32.90 | 34.01 | NaN | NaN | NaN | NaN |
2.append
首先获取数据
In [25]:
# 获取招商银行20180125日的前5个工作日的开盘价、最高价、最低价、收盘价。
price= get_price('600036.SH', None, '20180125', '1d', ['open', 'high', 'low', 'close'], False, 'pre', 5, is_panel=1)
price
Out[25]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
| 2018-01-22 | 33.08 | 33.64 | 32.15 | 32.18 |
| 2018-01-23 | 34.05 | 34.17 | 33.20 | 33.20 |
| 2018-01-24 | 33.85 | 35.35 | 33.45 | 34.25 |
| 2018-01-25 | 33.41 | 34.02 | 32.90 | 34.01 |
In [26]:
s = price.iloc[0]
print(s)
close 32.46
high 33.35
low 32.21
open 32.66
Name: 2018-01-19 00:00:00, dtype: float64
In [27]:
price.append(s, ignore_index=False) # ignore_index=False 表示索引不变
Out[27]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
| 2018-01-22 | 33.08 | 33.64 | 32.15 | 32.18 |
| 2018-01-23 | 34.05 | 34.17 | 33.20 | 33.20 |
| 2018-01-24 | 33.85 | 35.35 | 33.45 | 34.25 |
| 2018-01-25 | 33.41 | 34.02 | 32.90 | 34.01 |
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
In [28]:
price.append(s, ignore_index=True) # ignore_index=True 表示索引重置
Out[28]:
| close | high | low | open | |
|---|---|---|---|---|
| 0 | 32.46 | 33.35 | 32.21 | 32.66 |
| 1 | 33.08 | 33.64 | 32.15 | 32.18 |
| 2 | 34.05 | 34.17 | 33.20 | 33.20 |
| 3 | 33.85 | 35.35 | 33.45 | 34.25 |
| 4 | 33.41 | 34.02 | 32.90 | 34.01 |
| 5 | 32.46 | 33.35 | 32.21 | 32.66 |
3.移除重复数据duplicated
延续append示例
In [31]:
price2=price.append(s, ignore_index=False) # ignore_index=False 表示索引不变
price2
Out[31]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
| 2018-01-22 | 33.08 | 33.64 | 32.15 | 32.18 |
| 2018-01-23 | 34.05 | 34.17 | 33.20 | 33.20 |
| 2018-01-24 | 33.85 | 35.35 | 33.45 | 34.25 |
| 2018-01-25 | 33.41 | 34.02 | 32.90 | 34.01 |
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
查看重复数据:
In [32]:
price2.duplicated()
Out[32]:
2018-01-19 False
2018-01-22 False
2018-01-23 False
2018-01-24 False
2018-01-25 False
2018-01-19 True
dtype: bool
移除重复数据:
In [33]:
price2.drop_duplicates()
Out[33]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-19 | 32.46 | 33.35 | 32.21 | 32.66 |
| 2018-01-22 | 33.08 | 33.64 | 32.15 | 32.18 |
| 2018-01-23 | 34.05 | 34.17 | 33.20 | 33.20 |
| 2018-01-24 | 33.85 | 35.35 | 33.45 | 34.25 |
| 2018-01-25 | 33.41 | 34.02 | 32.90 | 34.01 |
可以看到'2018-01-19'的重复行被删除了
查看以上策略详情请到supermind量化交易官网查看:同花顺Supermind量化交易 Python基础编程--pandas进阶