背景
最近学习了 AutoGluon 机器学习框架, 被其仅需几行代码就可以完成机器学习的能力给惊艳到了
AutoML for Image, Text, Time Series, and Tabular Data
Quick Prototyping
Build machine learning solutions on raw data in a few lines of code.
State-of-the-art Techniques
Automatically utilize SOTA models without expert knowledge.
相比于传统的机器学习框架, 其内置了很多自动机制(例如自动问题识别和特征识别), 让我们几乎不需要任何机器学习的知识储备, 就能够快速上手解决现实问题
-
分类问题(classification)
from autogluon.tabular import TabularDataset, TabularPredictor data_root = 'https://autogluon.s3.amazonaws.com/datasets/Inc/' train_data = TabularDataset(data_root + 'train.csv') test_data = TabularDataset(data_root + 'test.csv') predictor = TabularPredictor(label='class').fit(train_data=train_data) predictions = predictor.predict(test_data)
-
回归问题(regression)
-
时序预测
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor data = TimeSeriesDataFrame('https://autogluon.s3.amazonaws.com/datasets/timeseries/m4_hourly/train.csv') predictor = TimeSeriesPredictor(target='target', prediction_length=48).fit(data) predictions = predictor.predict(data)
在看过 AutoGluon Time Series - Forecasting Quick Start 预测时序数据的示例后, 我们知道时间序列预测的主要目标是根据过去的观测值预测时间序列的未来值.
就此联想到股票价格数据属于典型的时间序列数据, 那么我们是否可以预测股票的价格走势呢?
思路
-
获取数据
获取股票价格的历史数据
-
分割数据
- 训练数据: 从历史数据中分割出要预测范围之前的数据
- 测试数据: 全量的历史数据
-
训练模型
使用训练数据训练出模型
-
预测数据
使用训练数据让模型去预测出股票价格的未来数据
-
评估准确性
通过测试数据去评估预测的准确性
实现 by jupyter notebook
PS: 由于我不会 python
, 很多代码都是通过 ChatGPT
辅助完成的, 再次感慨一下: 驾驭 AI 可以让我们变得更强大!
安装依赖
!python -m pip install --upgrade pip
!python -m pip install tushare # 1.2.89
!python -m pip install autogluon # 0.8.2
导入依赖
import tushare as ts
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor
import pandas as pd
# 股票数据不是连续的, 例如周末节假日是不开盘的, 因此需要重新采样生成连续的数据
def fill_stock_data_by_freq_d(df):
df_copy = df.copy()
# 字段转为时间类型
df_copy['trade_date'] = pd.to_datetime(df_copy['trade_date'], format='%Y%m%d')
df_copy = df_copy.set_index('trade_date')
# 按天采样数据
df_copy = df_copy.asfreq('D')
df_copy = df_copy.reset_index()
# 填充缺失的数据
# 时间戳字段
df_copy['timestamp'] = df_copy['trade_date'].copy()
df_copy['ts_code'].fillna(method='ffill', inplace=True)
# 填充 NaN 数据
df_copy['open'].fillna(df_copy['close'].ffill(), inplace=True)
df_copy['high'].fillna(df_copy['open'], inplace=True)
df_copy['low'].fillna(df_copy['open'], inplace=True)
df_copy['close'].fillna(df_copy['open'], inplace=True)
df_copy['pre_close'].fillna(df_copy['open'], inplace=True)
df_copy['change'].fillna(0, inplace=True)
df_copy['pct_chg'].fillna(0, inplace=True)
df_copy['vol'].fillna(0, inplace=True)
df_copy['amount'].fillna(0, inplace=True)
df_copy = df_copy.set_index('trade_date')
# 数据频率
print('数据频率', pd.infer_freq(df_copy.index))
return df_copy
初始化
# 初始化 tushare API
ts.set_token('{你的接口TOKEN}')
pro = ts.pro_api()
设置参数
# 股票代码
ts_code = '600028.SH'
# 预测多少天的数据
prediction_length = 30
# 预测的字段
prediction_target = 'close'
# ID 字段
id_column = 'ts_code'
获取数据
# 获取股票价格的历史数据
stock_df = pro.daily(ts_code=ts_code)
end_date = stock_df.iloc[0]['trade_date']
start_date = stock_df.iloc[-1]['trade_date']
print(f'查询出A股 {ts_code} 从 {start_date} 至 {end_date} 的日行情数据, 共计 {len(stock_df)} 条')
stock_df.head()
查询出A股 600028.SH 从 20010808 至 20230811 的日行情数据, 共计 5290 条
以天的频率填满数据
stock_df_filled = fill_stock_data_by_freq_d(stock_df)
stock_df_filled[prediction_target].plot()
stock_df_filled.head()
数据频率 D
分割数据
data = TimeSeriesDataFrame.from_data_frame(
stock_df_filled,
id_column=id_column
)
print('数据频率', data.freq)
# 训练数据: 从历史数据中分割出要预测的时间之前的数据
# 测试数据: 全量的历史数据
train_data, test_data = data.train_test_split(prediction_length)
print(f'训练数据: 从 {train_data.index.tolist()[0][1]} 至 {train_data.index.tolist()[-1][1]}, 共计 {len(train_data)} 条')
print(f'测试数据: 从 {test_data.index.tolist()[0][1]} 至 {test_data.index.tolist()[-1][1]}, 共计 {len(test_data)} 条')
train_data.head()
数据频率 D
训练数据: 从 2001-08-08 00:00:00 至 2023-07-12 00:00:00, 共计 8009 条
测试数据: 从 2001-08-08 00:00:00 至 2023-08-11 00:00:00, 共计 8039 条
可视化对比数据
import matplotlib.pyplot as plt
import numpy as np
train_ts = train_data.loc[ts_code][prediction_target]
test_ts = test_data.loc[ts_code][prediction_target]
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=[10, 4], sharex=True)
ax1.set_title("Train data (past time series values)")
ax1.plot(train_ts)
ax2.set_title("Test data (past + future time series values)")
ax2.plot(test_ts)
for ax in (ax1, ax2):
ax.fill_between(np.array([train_ts.index[-1], test_ts.index[-1]]), test_ts.min(), test_ts.max(), color="C1", alpha=0.3, label="Forecast horizon")
plt.legend()
plt.show()
可视化对比数据(减少数据量)
import matplotlib.pyplot as plt
import numpy as np
limit = int(len(train_data) * 0.01)
train_ts = train_data.loc[ts_code][prediction_target][-limit:]
test_ts = test_data.loc[ts_code][prediction_target][-(limit + prediction_length):]
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=[10, 4], sharex=True)
ax1.set_title("Train data (past time series values)")
ax1.plot(train_ts)
ax2.set_title("Test data (past + future time series values)")
ax2.plot(test_ts)
for ax in (ax1, ax2):
ax.fill_between(np.array([train_ts.index[-1], test_ts.index[-1]]), test_ts.min(), test_ts.max(), color="C1", alpha=0.3, label="Forecast horizon")
plt.legend()
plt.show()
训练模型
predictor = TimeSeriesPredictor(
prediction_length=prediction_length,
path='autogluon-stock',
target=prediction_target,
eval_metric='MASE',
)
# 使用训练数据训练出模型
predictor.fit(
train_data,
presets='best_quality',
time_limit=60 * 20,
)
predictor.leaderboard(silent=True)
================ TimeSeriesPredictor ================
TimeSeriesPredictor.fit() called
Setting presets to: best_quality
Fitting with arguments:
{'enable_ensemble': True,
'evaluation_metric': 'MASE',
'excluded_model_types': None,
'hyperparameter_tune_kwargs': {'num_trials': 3,
'scheduler': 'local',
'searcher': 'auto'},
'hyperparameters': 'best_quality',
'num_val_windows': 1,
'prediction_length': 30,
'random_seed': None,
'target': 'close',
'time_limit': 1200,
'verbosity': 2}
Provided training data set with 8009 rows, 1 items (item = single time series). Average time series length is 8009.0. Data frequency is 'D'.
=====================================================
AutoGluon will save models to autogluon-stock/
AutoGluon will gauge predictive performance using evaluation metric: 'MASE'
This metric's sign has been flipped to adhere to being 'higher is better'. The reported score can be multiplied by -1 to get the metric value.
Provided dataset contains following columns:
target: 'close'
past covariates: ['open', 'high', 'low', 'pre_close', 'change', 'pct_chg', 'vol', 'amount']
Starting training. Start time is 2023-08-14 03:16:31
Models that will be trained: ['Naive', 'SeasonalNaive', 'Theta', 'AutoETS', 'RecursiveTabular', 'DeepAR', 'TemporalFusionTransformer', 'PatchTST', 'DirectTabular', 'AutoARIMA']
Hyperparameter tuning model: Naive. Tuning model for up to 120.00s of the 1199.99s remaining.
-0.9547 = Validation score (-MASE)
0.04 s = Training runtime
4.14 s = Validation (prediction) runtime
...省略其他模型训练的输出...
Training complete. Models trained: ['Naive', 'SeasonalNaive', 'Theta', 'AutoETS', 'RecursiveTabular', 'DeepAR/T1', 'TemporalFusionTransformer', 'PatchTST', 'DirectTabular', 'AutoARIMA', 'WeightedEnsemble']
Total runtime: 512.44 s
Best model: WeightedEnsemble
Best model score: -0.4967
预测数据
# 使用训练数据让模型去预测出股票价格的未来数据
predictions = predictor.predict(train_data)
print(f'预测数据: 从 {predictions.index.tolist()[0][1]} 至 {predictions.index.tolist()[-1][1]}, 共计 {len(predictions)} 条')
predictions.head()
Model not specified in predict, will default to the model with the best validation score: WeightedEnsemble
预测数据: 从 2023-07-13 00:00:00 至 2023-08-11 00:00:00, 共计 30 条
评估准确性
import matplotlib.pyplot as plt
# 通过测试数据去评估预测的准确性
score = predictor.evaluate(test_data)
print('预测准确性分数', score)
# 训练数据中的股票价格
y_past = train_data.loc[ts_code][prediction_target]
# 预测数据中的股票价格
y_pred = predictions.loc[ts_code]
# 测试数据中的股票价格(仅获预测范围内的数据)
y_test = test_data.loc[ts_code][prediction_target][-prediction_length:]
plt.figure(figsize=(20, 3))
# 训练数据仅展示 1% 的数据, 方便在图形中直观的查看预测的范围
limit = int(len(y_past) * 0.01)
plt.plot(y_past[-limit:], label="Past time series values")
plt.plot(y_pred["mean"], label="Mean forecast")
plt.plot(y_test, label="Future time series values")
plt.fill_between(
y_pred.index, y_pred["0.1"], y_pred["0.9"], color="red", alpha=0.1, label=f"10%-90% confidence interval"
)
plt.legend()
Model not specified in predict, will default to the model with the best validation score: WeightedEnsemble
预测准确性分数 -0.8391059970944484
总结
从分析的结果来看, 股价的走势涵盖在预测的范围之内, 基本符合预期
但仅供参考, 不构成任何证券、金融产品或其他投资工具或任何交易策略的依据或建议。
投资行为的盈亏依赖于您的独立思考和决策,本文所述观点并不构成投资或任何其他建议,不提供或推荐任何投资品种。股市有风险,投资需谨慎。
参考
-
拥有丰富的数据内容,如股票、基金、期货、数字货币等行情数据,公司财务、基金经理等基本面数据
-
免费又好用的 notebook: Amazon SageMaker Studio Lab
Amazon SageMaker Studio Lab is absolutely free – no credit card or AWS account required.
The Amazon SageMaker Studio Lab is based on the open-source and extensible JupyterLab IDE. Skip the complicated setup and author Jupyter notebooks right in your browser.
- You can have one project with at least 15 GB of storage, 16 GB of RAM and a CPU or GPU runtime.
- CPU runtime is limited to 4 hours per session and no more than a total of 8 hours in a 24-hour period
- GPU runtime is limited to 4 hours per session and no more than a total of 4 hours in a 24-hour period.