pandas 是基于 Numpy 构建的,让以 Numpy 为中心的应用变得更加简单。pandas是公认的数据处理利器,本章内容主要介绍DataFrame数据结构,在此基础上进行数据处理。除了DataFrame格式,pandas 还包括series、Panel。
第六节:pandas基础
pandas 是基于 Numpy 构建的,让以 Numpy 为中心的应用变得更加简单。pandas是公认的数据处理利器,本章内容主要介绍DataFrame数据结构,在此基础上进行数据处理。除了DataFrame格式,pandas 还包括series、Panel。
| 格式 | 数组 | 释义 |
|---|---|---|
| Series | 一维数组 | 与Numpy中的一维array类似。 |
| DataFrame | 二维的表格型数据结构 | 可以将DataFrame理解为Series的容器 |
| Panel | 三维的数组 | 可以理解为DataFrame的容器 |
开始之前,我们首先掌握导入pandas库,方式如下:
In [ ]:
import pandas as pd
注意:以下内容必须在导入pandas库之后才能运行。
一、Series和DataFrame介绍
1.Series
由一组数据和与之相关的索引组成。可通过传递一个list对象来创建一个Series,pandas会默认创建整型索引。
创建一个Series:
In [2]:
s = pd.Series([1,3,5,7,6,8])
print(s)
0 1
1 3
2 5
3 7
4 6
5 8
dtype: int64
获取 Series 的索引:
In [3]:
s.index
Out[3]:
RangeIndex(start=0, stop=6, step=1)
2.DataFrame
DataFrame是一个表格型的数据结构,它含有一组有序的列,每一列的数据结构都是相同的,而不同的列之间则可以是不同的数据结构。DataFrame中的每一行是一个记录,名称为Index的一个元素,而每一列则为一个字段,是这个记录的一个属性,DataFrame既有行索引也有列索引。
创建DataFrame
首先来看如何从字典创建DataFrame。
In [6]:
d = {'one': [1, 2, 3], 'two': [1, 2, 3]}
df = pd.DataFrame(d,index=['a', 'b', 'c'])
print(df)
one two
a 1 1
b 2 2
c 3 3
可以使用dataframe.index和dataframe.columns来查看DataFrame的行和列,dataframe.values则以数组的形式返回DataFrame的元素:
In [12]:
print(df.index) #查看行
print(df.columns) #查看列
print(df.values) #查看元素
Index(['a', 'b', 'c'], dtype='object')
Index(['one', 'two'], dtype='object')
[[1 1]
[2 2]
[3 3]]
DataFrame从值是数组的字典创建时,其各个数组的长度需要相同,加强印象,可参考以下报错的例子。
In [13]:
d = {'one': [1, 2], 'two': [1, 2, 3]}
df = pd.DataFrame(d,index=['a', 'b', 'c'])
print(df)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/opt/conda/lib/python3.5/site-packages/pandas/core/internals.py in create_block_manager_from_arrays(arrays, names, axes)
4308 try:
-> 4309 blocks = form_blocks(arrays, names, axes)
4310 mgr = BlockManager(blocks, axes)
/opt/conda/lib/python3.5/site-packages/pandas/core/internals.py in form_blocks(arrays, names, axes)
4380 if len(int_items):
-> 4381 int_blocks = _multi_blockify(int_items)
4382 blocks.extend(int_blocks)
/opt/conda/lib/python3.5/site-packages/pandas/core/internals.py in _multi_blockify(tuples, dtype)
4449
-> 4450 values, placement = _stack_arrays(list(tup_block), dtype)
4451
/opt/conda/lib/python3.5/site-packages/pandas/core/internals.py in _stack_arrays(tuples, dtype)
4494 for i, arr in enumerate(arrays):
-> 4495 stacked[i] = _asarray_compat(arr)
4496
ValueError: could not broadcast input array from shape (3) into shape (2)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-13-26bdeb89a47c> in <module>()
1 d = {'one': [1, 2], 'two': [1, 2, 3]}
----> 2 df = pd.DataFrame(d,index=['a', 'b', 'c'])
3 print(df)
/opt/conda/lib/python3.5/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
273 dtype=dtype, copy=copy)
274 elif isinstance(data, dict):
--> 275 mgr = self._init_dict(data, index, columns, dtype=dtype)
276 elif isinstance(data, ma.MaskedArray):
277 import numpy.ma.mrecords as mrecords
/opt/conda/lib/python3.5/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
409 arrays = [data[k] for k in keys]
410
--> 411 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
412
413 def _init_ndarray(self, values, index, columns, dtype=None, copy=False):
/opt/conda/lib/python3.5/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
5602 axes = [_ensure_index(columns), _ensure_index(index)]
5603
-> 5604 return create_block_manager_from_arrays(arrays, arr_names, axes)
5605
5606
/opt/conda/lib/python3.5/site-packages/pandas/core/internals.py in create_block_manager_from_arrays(arrays, names, axes)
4312 return mgr
4313 except ValueError as e:
-> 4314 construction_error(len(arrays), arrays[0].shape, axes, e)
4315
4316
/opt/conda/lib/python3.5/site-packages/pandas/core/internals.py in construction_error(tot_items, block_shape, axes, e)
4278 raise ValueError("Empty data passed with indices specified.")
4279 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 4280 passed, implied))
4281
4282
ValueError: Shape of passed values is (2, 2), indices imply (2, 3)
如果DataFrame的值是非数组时,没有这一限制,且自动将缺失值补成NaN。如下示例
In [14]:
d= [{'a': 1.6, 'b': 2}, {'a': 3, 'b': 6, 'c': 9}]
df = pd.DataFrame(d)
print(df)
a b c
0 1.6 2 NaN
1 3.0 6 9.0
在实际处理数据时,有时需要创建一个空的DataFrame,可以这么做:
In [15]:
df = pd.DataFrame()
print(df)
Empty DataFrame
Columns: []
Index: []
另一种创建DataFrame的方法十分有用,那就是使用concat函数创建DataFrame,其主要是通过两个行或列相同的DataFrame链接成一个。
In [16]:
a= [{'a': 1.6, 'b': 2}, {'a': 3, 'b': 6}]
df1 = pd.DataFrame(a)
b= [{'a': 4, 'b': 5}]
df2 = pd.DataFrame(b)
df = pd.concat([df1, df2], axis=0)
print(df1)
print(df2)
print(df)
a b
0 1.6 2
1 3.0 6
a b
0 4 5
a b
0 1.6 2
1 3.0 6
0 4.0 5
注意:concat函数内有axis参数,其中的axis=1表示按列进行合并,axis=0表示按行合并
二、数据查看
MindGO量化交易平台上大部分获取数据的函数,最终以DataFrame或Dict(字典)格式呈现。接下来重点介绍DataFrame格式的数据查看,数据处理。以MindGO平台获取的数据为例进行讲解:
该部分内容需在MindGo研究环境中练习。
In [18]:
# 获取贵州茅台近10个工作日的开盘价、最高价、最低价、收盘价,获取格式即为DataFrame
price= get_price('600519.SH', None, '20180125', '1d', ['open', 'high', 'low', 'close'], False, 'pre', 20, is_panel=1)
print(price)
close high low open
2017-12-28 718.69 719.90 671.32 687.00
2017-12-29 697.49 726.50 691.60 718.00
2018-01-02 703.85 710.16 689.89 700.00
2018-01-03 715.86 721.40 699.74 701.50
2018-01-04 737.07 743.50 719.33 721.40
2018-01-05 738.36 746.03 728.22 741.00
2018-01-08 752.13 756.50 735.02 735.02
2018-01-09 782.52 783.00 752.21 752.21
2018-01-10 785.71 788.88 773.48 785.00
2018-01-11 774.81 788.00 772.00 787.00
2018-01-12 788.42 788.80 767.02 773.77
2018-01-15 785.37 799.06 779.02 793.46
2018-01-16 772.94 788.61 768.00 780.48
2018-01-17 747.93 774.00 738.51 770.00
2018-01-18 750.74 765.00 744.09 747.93
2018-01-19 750.18 758.90 739.02 752.90
2018-01-22 773.64 774.00 751.81 751.81
2018-01-23 773.78 780.00 768.60 777.81
2018-01-24 764.46 776.46 758.60 776.44
2018-01-25 769.16 776.00 751.00 761.00
以下为数据查看常用的八项操作:
1.查看前几条数据:
In [19]:
price.head()
Out[19]:
| close | high | low | open | |
|---|---|---|---|---|
| 2017-12-28 | 718.69 | 719.90 | 671.32 | 687.0 |
| 2017-12-29 | 697.49 | 726.50 | 691.60 | 718.0 |
| 2018-01-02 | 703.85 | 710.16 | 689.89 | 700.0 |
| 2018-01-03 | 715.86 | 721.40 | 699.74 | 701.5 |
| 2018-01-04 | 737.07 | 743.50 | 719.33 | 721.4 |
2.查看后几条数据:
In [20]:
price.tail()
Out[20]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-19 | 750.18 | 758.90 | 739.02 | 752.90 |
| 2018-01-22 | 773.64 | 774.00 | 751.81 | 751.81 |
| 2018-01-23 | 773.78 | 780.00 | 768.60 | 777.81 |
| 2018-01-24 | 764.46 | 776.46 | 758.60 | 776.44 |
| 2018-01-25 | 769.16 | 776.00 | 751.00 | 761.00 |
3.查看 DataFrame 的索引
In [21]:
price.index
Out[21]:
DatetimeIndex(['2017-12-28', '2017-12-29', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-08', '2018-01-09', '2018-01-10', '2018-01-11', '2018-01-12', '2018-01-15', '2018-01-16', '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-22', '2018-01-23', '2018-01-24', '2018-01-25'],
dtype='datetime64[ns]', freq=None)
4.查看 DataFrame 的列名
In [22]:
price.columns
Out[22]:
Index(['close', 'high', 'low', 'open'], dtype='object')
5.查看 DataFrame 的值
In [23]:
price.values
Out[23]:
array([[ 718.69, 719.9 , 671.32, 687. ],
[ 697.49, 726.5 , 691.6 , 718. ],
[ 703.85, 710.16, 689.89, 700. ],
[ 715.86, 721.4 , 699.74, 701.5 ],
[ 737.07, 743.5 , 719.33, 721.4 ],
[ 738.36, 746.03, 728.22, 741. ],
[ 752.13, 756.5 , 735.02, 735.02],
[ 782.52, 783. , 752.21, 752.21],
[ 785.71, 788.88, 773.48, 785. ],
[ 774.81, 788. , 772. , 787. ],
[ 788.42, 788.8 , 767.02, 773.77],
[ 785.37, 799.06, 779.02, 793.46],
[ 772.94, 788.61, 768. , 780.48],
[ 747.93, 774. , 738.51, 770. ],
[ 750.74, 765. , 744.09, 747.93],
[ 750.18, 758.9 , 739.02, 752.9 ],
[ 773.64, 774. , 751.81, 751.81],
[ 773.78, 780. , 768.6 , 777.81],
[ 764.46, 776.46, 758.6 , 776.44],
[ 769.16, 776. , 751. , 761. ]])
6.使用 describe() 函数对于数据的快速统计汇总:
In [24]:
price.describe()
Out[24]:
| close | high | low | open | |
|---|---|---|---|---|
| count | 20.000000 | 20.000000 | 20.000000 | 20.000000 |
| mean | 754.155500 | 763.235000 | 739.924000 | 750.686500 |
| std | 28.005539 | 26.794003 | 31.251968 | 31.581411 |
| min | 697.490000 | 710.160000 | 671.320000 | 687.000000 |
| 25% | 738.037500 | 745.397500 | 725.997500 | 731.615000 |
| 50% | 758.295000 | 774.000000 | 747.545000 | 752.555000 |
| 75% | 774.037500 | 784.250000 | 767.265000 | 776.782500 |
| max | 788.420000 | 799.060000 | 779.020000 | 793.460000 |
7.对数据的转置:
In [25]:
price.T
Out[25]:
| 2017-12-28 00:00:00 | 2017-12-29 00:00:00 | 2018-01-02 00:00:00 | 2018-01-03 00:00:00 | 2018-01-04 00:00:00 | 2018-01-05 00:00:00 | 2018-01-08 00:00:00 | 2018-01-09 00:00:00 | 2018-01-10 00:00:00 | 2018-01-11 00:00:00 | 2018-01-12 00:00:00 | 2018-01-15 00:00:00 | 2018-01-16 00:00:00 | 2018-01-17 00:00:00 | 2018-01-18 00:00:00 | 2018-01-19 00:00:00 | 2018-01-22 00:00:00 | 2018-01-23 00:00:00 | 2018-01-24 00:00:00 | 2018-01-25 00:00:00 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| close | 718.69 | 697.49 | 703.85 | 715.86 | 737.07 | 738.36 | 752.13 | 782.52 | 785.71 | 774.81 | 788.42 | 785.37 | 772.94 | 747.93 | 750.74 | 750.18 | 773.64 | 773.78 | 764.46 | 769.16 |
| high | 719.90 | 726.50 | 710.16 | 721.40 | 743.50 | 746.03 | 756.50 | 783.00 | 788.88 | 788.00 | 788.80 | 799.06 | 788.61 | 774.00 | 765.00 | 758.90 | 774.00 | 780.00 | 776.46 | 776.00 |
| low | 671.32 | 691.60 | 689.89 | 699.74 | 719.33 | 728.22 | 735.02 | 752.21 | 773.48 | 772.00 | 767.02 | 779.02 | 768.00 | 738.51 | 744.09 | 739.02 | 751.81 | 768.60 | 758.60 | 751.00 |
| open | 687.00 | 718.00 | 700.00 | 701.50 | 721.40 | 741.00 | 735.02 | 752.21 | 785.00 | 787.00 | 773.77 | 793.46 | 780.48 | 770.00 | 747.93 | 752.90 | 751.81 | 777.81 | 776.44 | 761.00 |
8.按列对 DataFrame 进行排序
In [26]:
print(price.sort_values(by='open' , ascending=False))
close high low open
2018-01-15 785.37 799.06 779.02 793.46
2018-01-11 774.81 788.00 772.00 787.00
2018-01-10 785.71 788.88 773.48 785.00
2018-01-16 772.94 788.61 768.00 780.48
2018-01-23 773.78 780.00 768.60 777.81
2018-01-24 764.46 776.46 758.60 776.44
2018-01-12 788.42 788.80 767.02 773.77
2018-01-17 747.93 774.00 738.51 770.00
2018-01-25 769.16 776.00 751.00 761.00
2018-01-19 750.18 758.90 739.02 752.90
2018-01-09 782.52 783.00 752.21 752.21
2018-01-22 773.64 774.00 751.81 751.81
2018-01-18 750.74 765.00 744.09 747.93
2018-01-05 738.36 746.03 728.22 741.00
2018-01-08 752.13 756.50 735.02 735.02
2018-01-04 737.07 743.50 719.33 721.40
2017-12-29 697.49 726.50 691.60 718.00
2018-01-03 715.86 721.40 699.74 701.50
2018-01-02 703.85 710.16 689.89 700.00
2017-12-28 718.69 719.90 671.32 687.00
注意:sort_values函数内置参数有by和ascending,by参数是排序指定列,ascending是排序顺序,False是从大到小,True是从小到大。
三、选择数据
依旧采用上个小节案例,继续讲述选择数据的八项基本操作。
1.选择一列数据,选取开盘价这列数据:
In [28]:
price['open']
Out[28]:
2017-12-28 687.00
2017-12-29 718.00
2018-01-02 700.00
2018-01-03 701.50
2018-01-04 721.40
2018-01-05 741.00
2018-01-08 735.02
2018-01-09 752.21
2018-01-10 785.00
2018-01-11 787.00
2018-01-12 773.77
2018-01-15 793.46
2018-01-16 780.48
2018-01-17 770.00
2018-01-18 747.93
2018-01-19 752.90
2018-01-22 751.81
2018-01-23 777.81
2018-01-24 776.44
2018-01-25 761.00
Name: open, dtype: float64
同学们动手试试price.open~
它与price['open']是等效的!
2.选择多列数据:
In [30]:
price[['open','close']]
Out[30]:
| open | close | |
|---|---|---|
| 2017-12-28 | 687.00 | 718.69 |
| 2017-12-29 | 718.00 | 697.49 |
| 2018-01-02 | 700.00 | 703.85 |
| 2018-01-03 | 701.50 | 715.86 |
| 2018-01-04 | 721.40 | 737.07 |
| 2018-01-05 | 741.00 | 738.36 |
| 2018-01-08 | 735.02 | 752.13 |
| 2018-01-09 | 752.21 | 782.52 |
| 2018-01-10 | 785.00 | 785.71 |
| 2018-01-11 | 787.00 | 774.81 |
| 2018-01-12 | 773.77 | 788.42 |
| 2018-01-15 | 793.46 | 785.37 |
| 2018-01-16 | 780.48 | 772.94 |
| 2018-01-17 | 770.00 | 747.93 |
| 2018-01-18 | 747.93 | 750.74 |
| 2018-01-19 | 752.90 | 750.18 |
| 2018-01-22 | 751.81 | 773.64 |
| 2018-01-23 | 777.81 | 773.78 |
| 2018-01-24 | 776.44 | 764.46 |
| 2018-01-25 | 761.00 | 769.16 |
注意:price[['open','close']]中['open','close']是一个由两个字符串(列名)组成的列表,会自动对应到整个DataFrame表结构中,获取到相应的数据。
同学们试试price['open','close'],看看能不能获取到数据~
3.选择多行:
In [31]:
price[0:3]
Out[31]:
| close | high | low | open | |
|---|---|---|---|---|
| 2017-12-28 | 718.69 | 719.90 | 671.32 | 687.0 |
| 2017-12-29 | 697.49 | 726.50 | 691.60 | 718.0 |
| 2018-01-02 | 703.85 | 710.16 | 689.89 | 700.0 |
4.按index选取多行:
In [32]:
price['2018-01-24':'2018-01-25']
Out[32]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-24 | 764.46 | 776.46 | 758.6 | 776.44 |
| 2018-01-25 | 769.16 | 776.00 | 751.0 | 761.00 |
5.使用标签选取数据:
price.loc[行标签,列标签]
price.loc['a':'b'] #选取 ab 两行数据
price.loc[:,'open'] #选取 open 列的数据
price.loc 的第一个参数是行标签,第二个参数为列标签,两个参数既可以是列表也可以是单个字符,如果两个参数都为列表则返回的是 DataFrame,否则,则为 Series。
In [33]:
price.loc['2018-01-24','open']
Out[33]:
776.44000000000005
In [34]:
price.loc['2018-01-24':'2018-01-25']
Out[34]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-24 | 764.46 | 776.46 | 758.6 | 776.44 |
| 2018-01-25 | 769.16 | 776.00 | 751.0 | 761.00 |
In [35]:
price.loc[:, 'open']
Out[35]:
2017-12-28 687.00
2017-12-29 718.00
2018-01-02 700.00
2018-01-03 701.50
2018-01-04 721.40
2018-01-05 741.00
2018-01-08 735.02
2018-01-09 752.21
2018-01-10 785.00
2018-01-11 787.00
2018-01-12 773.77
2018-01-15 793.46
2018-01-16 780.48
2018-01-17 770.00
2018-01-18 747.93
2018-01-19 752.90
2018-01-22 751.81
2018-01-23 777.81
2018-01-24 776.44
2018-01-25 761.00
Name: open, dtype: float64
In [36]:
price.loc['2018-01-24':'2018-01-25','open']
Out[36]:
2018-01-24 776.44
2018-01-25 761.00
Name: open, dtype: float64
6..使用位置选取数据:
price.iloc[行位置,列位置]
price.iloc[1,1] #选取第二行,第二列的值,返回的为单个值
price.iloc[[0,2],:] #选取第一行及第三行的数据
price.iloc[0:2,:] #选取第一行到第三行(不包含)的数据
price.iloc[:,1] #选取所有记录的第二列的值,返回的为一个Series
price.iloc[1,:] #选取第一行数据,返回的为一个Series
In [38]:
price.iloc[1,1] # 选取第二行,第二列的值,返回的为单个值
Out[38]:
726.5
In [39]:
price.iloc[[0,2],:] # 选取第一行及第三行的数据
Out[39]:
| close | high | low | open | |
|---|---|---|---|---|
| 2017-12-28 | 718.69 | 719.90 | 671.32 | 687.0 |
| 2018-01-02 | 703.85 | 710.16 | 689.89 | 700.0 |
In [40]:
price.iloc[0:2,:] # 选取第一行到第三行(不包含)的数据
Out[40]:
| close | high | low | open | |
|---|---|---|---|---|
| 2017-12-28 | 718.69 | 719.9 | 671.32 | 687.0 |
| 2017-12-29 | 697.49 | 726.5 | 691.60 | 718.0 |
In [41]:
price.iloc[:,1] # 选取所有记录的第一列的值,返回的为一个Series
Out[41]:
2017-12-28 719.90
2017-12-29 726.50
2018-01-02 710.16
2018-01-03 721.40
2018-01-04 743.50
2018-01-05 746.03
2018-01-08 756.50
2018-01-09 783.00
2018-01-10 788.88
2018-01-11 788.00
2018-01-12 788.80
2018-01-15 799.06
2018-01-16 788.61
2018-01-17 774.00
2018-01-18 765.00
2018-01-19 758.90
2018-01-22 774.00
2018-01-23 780.00
2018-01-24 776.46
2018-01-25 776.00
Name: high, dtype: float64
In [42]:
price.iloc[1,:] # 选取第一行数据,返回的为一个Series
Out[42]:
close 697.49
high 726.50
low 691.60
open 718.00
Name: 2017-12-29 00:00:00, dtype: float64
7.更广义的切片方式是使用.ix,它自动根据给到的索引类型判断是使用位置还是标签进行切片
price.ix[1,1]
price.ix['a':'b']
In [43]:
price.ix[1,1]
Out[43]:
726.5
In [44]:
price.ix['2018-01-24':'2018-01-25']
Out[44]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-24 | 764.46 | 776.46 | 758.6 | 776.44 |
| 2018-01-25 | 769.16 | 776.00 | 751.0 | 761.00 |
In [45]:
price.ix['2018-01-24','open']
Out[45]:
776.44000000000005
In [46]:
price.ix[1,'open']
Out[46]:
718.0
In [47]:
price.ix['2018-01-24',0]
Out[47]:
764.46000000000004
8.通过逻辑指针进行数据切片:
price[逻辑条件]
price[price.one >= 2] #单个逻辑条件
price[(price.one >=1 ) & (df.one < 3) ] #多个逻辑条件组合
In [49]:
#筛选出 open 大于 750的数据
price[price.open > 750]
Out[49]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-09 | 782.52 | 783.00 | 752.21 | 752.21 |
| 2018-01-10 | 785.71 | 788.88 | 773.48 | 785.00 |
| 2018-01-11 | 774.81 | 788.00 | 772.00 | 787.00 |
| 2018-01-12 | 788.42 | 788.80 | 767.02 | 773.77 |
| 2018-01-15 | 785.37 | 799.06 | 779.02 | 793.46 |
| 2018-01-16 | 772.94 | 788.61 | 768.00 | 780.48 |
| 2018-01-17 | 747.93 | 774.00 | 738.51 | 770.00 |
| 2018-01-19 | 750.18 | 758.90 | 739.02 | 752.90 |
| 2018-01-22 | 773.64 | 774.00 | 751.81 | 751.81 |
| 2018-01-23 | 773.78 | 780.00 | 768.60 | 777.81 |
| 2018-01-24 | 764.46 | 776.46 | 758.60 | 776.44 |
| 2018-01-25 | 769.16 | 776.00 | 751.00 | 761.00 |
In [50]:
#筛选出 open 大于 750 的数据,并且 close 小于 770 的数据
price[(price.open > 750) & (price.close < 770)]
Out[50]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-17 | 747.93 | 774.00 | 738.51 | 770.00 |
| 2018-01-19 | 750.18 | 758.90 | 739.02 | 752.90 |
| 2018-01-24 | 764.46 | 776.46 | 758.60 | 776.44 |
| 2018-01-25 | 769.16 | 776.00 | 751.00 | 761.00 |
In [51]:
#使用 条件过来更改数据。
price[price>780]
Out[51]:
| close | high | low | open | |
|---|---|---|---|---|
| 2017-12-28 | NaN | NaN | NaN | NaN |
| 2017-12-29 | NaN | NaN | NaN | NaN |
| 2018-01-02 | NaN | NaN | NaN | NaN |
| 2018-01-03 | NaN | NaN | NaN | NaN |
| 2018-01-04 | NaN | NaN | NaN | NaN |
| 2018-01-05 | NaN | NaN | NaN | NaN |
| 2018-01-08 | NaN | NaN | NaN | NaN |
| 2018-01-09 | 782.52 | 783.00 | NaN | NaN |
| 2018-01-10 | 785.71 | 788.88 | NaN | 785.00 |
| 2018-01-11 | NaN | 788.00 | NaN | 787.00 |
| 2018-01-12 | 788.42 | 788.80 | NaN | NaN |
| 2018-01-15 | 785.37 | 799.06 | NaN | 793.46 |
| 2018-01-16 | NaN | 788.61 | NaN | 780.48 |
| 2018-01-17 | NaN | NaN | NaN | NaN |
| 2018-01-18 | NaN | NaN | NaN | NaN |
| 2018-01-19 | NaN | NaN | NaN | NaN |
| 2018-01-22 | NaN | NaN | NaN | NaN |
| 2018-01-23 | NaN | NaN | NaN | NaN |
| 2018-01-24 | NaN | NaN | NaN | NaN |
| 2018-01-25 | NaN | NaN | NaN | NaN |
观察可以发现,price 中小于等于 780 的数都变为 NaN。
In [54]:
#我们还可以把大于 780 的数赋值为1.
price[price > 780] = 1
price
Out[54]:
| close | high | low | open | |
|---|---|---|---|---|
| 2017-12-28 | 718.69 | 719.90 | 671.32 | 687.00 |
| 2017-12-29 | 697.49 | 726.50 | 691.60 | 718.00 |
| 2018-01-02 | 703.85 | 710.16 | 689.89 | 700.00 |
| 2018-01-03 | 715.86 | 721.40 | 699.74 | 701.50 |
| 2018-01-04 | 737.07 | 743.50 | 719.33 | 721.40 |
| 2018-01-05 | 738.36 | 746.03 | 728.22 | 741.00 |
| 2018-01-08 | 752.13 | 756.50 | 735.02 | 735.02 |
| 2018-01-09 | 1.00 | 1.00 | 752.21 | 752.21 |
| 2018-01-10 | 1.00 | 1.00 | 773.48 | 1.00 |
| 2018-01-11 | 774.81 | 1.00 | 772.00 | 1.00 |
| 2018-01-12 | 1.00 | 1.00 | 767.02 | 773.77 |
| 2018-01-15 | 1.00 | 1.00 | 779.02 | 1.00 |
| 2018-01-16 | 772.94 | 1.00 | 768.00 | 1.00 |
| 2018-01-17 | 747.93 | 774.00 | 738.51 | 770.00 |
| 2018-01-18 | 750.74 | 765.00 | 744.09 | 747.93 |
| 2018-01-19 | 750.18 | 758.90 | 739.02 | 752.90 |
| 2018-01-22 | 773.64 | 774.00 | 751.81 | 751.81 |
| 2018-01-23 | 773.78 | 780.00 | 768.60 | 777.81 |
| 2018-01-24 | 764.46 | 776.46 | 758.60 | 776.44 |
| 2018-01-25 | 769.16 | 776.00 | 751.00 | 761.00 |
使用isin()方法来过滤在指定列中的数据,案例延续上面赋值后的price
In [55]:
# 选取 high 列中数为 1 和 774.00的数。
price[price['high'].isin([1,774.00])]
Out[55]:
| close | high | low | open | |
|---|---|---|---|---|
| 2018-01-09 | 1.00 | 1.0 | 752.21 | 752.21 |
| 2018-01-10 | 1.00 | 1.0 | 773.48 | 1.00 |
| 2018-01-11 | 774.81 | 1.0 | 772.00 | 1.00 |
| 2018-01-12 | 1.00 | 1.0 | 767.02 | 773.77 |
| 2018-01-15 | 1.00 | 1.0 | 779.02 | 1.00 |
| 2018-01-16 | 772.94 | 1.0 | 768.00 | 1.00 |
| 2018-01-17 | 747.93 | 774.0 | 738.51 | 770.00 |
| 2018-01-22 | 773.64 | 774.0 | 751.81 | 751.81 |
四、 Panel
MindGo量化交易平台的get_price函数,如果是获取多支股票数据, 则返回pandas.Panel对象。pane其实就是一张一张DataFrame整合。
In [72]:
# 获取贵州茅台,招商银行,中信证券这三只股票近10个工作日的开盘价、最高价、最低价、收盘价,获取格式即为DataFrame
price= get_price(['600519.SH','600036.SH','600030.SH'], None, '20180125', '1d', ['open', 'high', 'low', 'close'], False, 'pre', 20,is_panel=1)
print(price)
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 20 (major_axis) x 3 (minor_axis)
Items axis: close to open
Major_axis axis: 2017-12-28 00:00:00 to 2018-01-25 00:00:00
Minor_axis axis: 600030.SH to 600519.SH
注意:现在这个price不是一张DataFrame,而是四个对应字段的DataFrame了,那么我们需要通过数据字段下标,来分别获取多张DataFrame,之后的操作就是操作单张DataFrame了。
In [73]:
price['close']#获取三个股票的收盘价,注意获取后是个DataFrame
Out[73]:
| 600030.SH | 600036.SH | 600519.SH | |
|---|---|---|---|
| 2017-12-28 | 18.12 | 28.63 | 718.69 |
| 2017-12-29 | 18.10 | 29.02 | 697.49 |
| 2018-01-02 | 18.44 | 29.62 | 703.85 |
| 2018-01-03 | 18.61 | 29.97 | 715.86 |
| 2018-01-04 | 18.67 | 29.65 | 737.07 |
| 2018-01-05 | 18.88 | 30.10 | 738.36 |
| 2018-01-08 | 19.54 | 29.47 | 752.13 |
| 2018-01-09 | 19.44 | 29.77 | 782.52 |
| 2018-01-10 | 19.61 | 30.53 | 785.71 |
| 2018-01-11 | 19.28 | 30.92 | 774.81 |
| 2018-01-12 | 19.33 | 31.51 | 788.42 |
| 2018-01-15 | 19.45 | 31.94 | 785.37 |
| 2018-01-16 | 20.25 | 31.89 | 772.94 |
| 2018-01-17 | 20.94 | 31.69 | 747.93 |
| 2018-01-18 | 21.41 | 32.32 | 750.74 |
| 2018-01-19 | 21.29 | 32.46 | 750.18 |
| 2018-01-22 | 21.20 | 33.08 | 773.64 |
| 2018-01-23 | 21.21 | 34.05 | 773.78 |
| 2018-01-24 | 22.92 | 33.85 | 764.46 |
| 2018-01-25 | 22.33 | 33.41 | 769.16 |
In [75]:
print(price['open'])#获取开盘价股票数据,注意获取的还是DataFrame
600030.SH 600036.SH 600519.SH
2017-12-28 18.06 28.75 687.00
2017-12-29 18.12 28.63 718.00
2018-01-02 18.13 29.02 700.00
2018-01-03 18.36 29.74 701.50
2018-01-04 18.64 30.28 721.40
2018-01-05 18.68 29.87 741.00
2018-01-08 19.00 29.92 735.02
2018-01-09 19.55 29.52 752.21
2018-01-10 19.47 29.66 785.00
2018-01-11 19.46 30.52 787.00
2018-01-12 19.25 31.12 773.77
2018-01-15 19.25 31.48 793.46
2018-01-16 19.26 31.80 780.48
2018-01-17 20.50 32.10 770.00
2018-01-18 21.15 32.10 747.93
2018-01-19 21.36 32.66 752.90
2018-01-22 21.10 32.18 751.81
2018-01-23 21.37 33.20 777.81
2018-01-24 21.40 34.25 776.44
2018-01-25 22.50 34.01 761.00
In [59]:
price['2018-01-11']#获取2018-01-11日期的三个股票的数据
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/opt/conda/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2392 try:
-> 2393 return self._engine.get_loc(key)
2394 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)()
KeyError: '2018-01-11'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-59-dddd7b500893> in <module>()
----> 1 price['2018-01-11']#获取2018-01-11日期的三个股票的数据
/opt/conda/lib/python3.5/site-packages/pandas/core/panel.py in __getitem__(self, key)
284 return self._getitem_multilevel(key)
285 if not (is_list_like(key) or isinstance(key, slice)):
--> 286 return super(Panel, self).__getitem__(key)
287 return self.loc[key]
288
/opt/conda/lib/python3.5/site-packages/pandas/core/generic.py in __getitem__(self, item)
1525
1526 def __getitem__(self, item):
-> 1527 return self._get_item_cache(item)
1528
1529 def _get_item_cache(self, item):
/opt/conda/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
1532 res = cache.get(item)
1533 if res is None:
-> 1534 values = self._data.get(item)
1535 res = self._box_item_values(item, values)
1536 cache[item] = res
/opt/conda/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]
/opt/conda/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2393 return self._engine.get_loc(key)
2394 except KeyError:
-> 2395 return self._engine.get_loc(self._maybe_cast_indexer(key))
2396
2397 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)()
KeyError: '2018-01-11'
注意这是不可行的,思考下为什么无法指定日期?
查看以上策略详情请到supermind量化交易官网查看同花顺Supermind量化交易 Python基础编程--pandas基础