揭秘!零基础预测股票价格的未来走势!

avatar

背景

最近学习了 AutoGluon 机器学习框架, 被其仅需几行代码就可以完成机器学习的能力给惊艳到了

AutoML for Image, Text, Time Series, and Tabular Data

  • Quick Prototyping

    Build machine learning solutions on raw data in a few lines of code.

  • State-of-the-art Techniques

    Automatically utilize SOTA models without expert knowledge.

相比于传统的机器学习框架, 其内置了很多自动机制(例如自动问题识别和特征识别), 让我们几乎不需要任何机器学习的知识储备, 就能够快速上手解决现实问题

  • 分类问题(classification)

    from autogluon.tabular import TabularDataset, TabularPredictor
    
    data_root = 'https://autogluon.s3.amazonaws.com/datasets/Inc/'
    train_data = TabularDataset(data_root + 'train.csv')
    test_data = TabularDataset(data_root + 'test.csv')
    
    predictor = TabularPredictor(label='class').fit(train_data=train_data)
    predictions = predictor.predict(test_data)
    
  • 回归问题(regression)

  • 时序预测

    from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor
    
    data = TimeSeriesDataFrame('https://autogluon.s3.amazonaws.com/datasets/timeseries/m4_hourly/train.csv')
    
    predictor = TimeSeriesPredictor(target='target', prediction_length=48).fit(data)
    predictions = predictor.predict(data)
    

在看过 AutoGluon Time Series - Forecasting Quick Start 预测时序数据的示例后, 我们知道时间序列预测的主要目标是根据过去的观测值预测时间序列的未来值.

image.png

就此联想到股票价格数据属于典型的时间序列数据, 那么我们是否可以预测股票的价格走势呢?

思路

  1. 获取数据

    获取股票价格的历史数据

  2. 分割数据

    • 训练数据: 从历史数据中分割出要预测范围之前的数据
    • 测试数据: 全量的历史数据
  3. 训练模型

    使用训练数据训练出模型

  4. 预测数据

    使用训练数据让模型去预测出股票价格的未来数据

  5. 评估准确性

    通过测试数据去评估预测的准确性

实现 by jupyter notebook

PS: 由于我不会 python, 很多代码都是通过 ChatGPT 辅助完成的, 再次感慨一下: 驾驭 AI 可以让我们变得更强大!

安装依赖

!python -m pip install --upgrade pip
!python -m pip install tushare # 1.2.89
!python -m pip install autogluon # 0.8.2

导入依赖

import tushare as ts
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor
import pandas as pd

# 股票数据不是连续的, 例如周末节假日是不开盘的, 因此需要重新采样生成连续的数据
def fill_stock_data_by_freq_d(df):
    df_copy = df.copy()

    # 字段转为时间类型
    df_copy['trade_date'] = pd.to_datetime(df_copy['trade_date'], format='%Y%m%d')

    df_copy = df_copy.set_index('trade_date')
    # 按天采样数据
    df_copy = df_copy.asfreq('D')

    df_copy = df_copy.reset_index()
    # 填充缺失的数据
    # 时间戳字段
    df_copy['timestamp'] = df_copy['trade_date'].copy()
    df_copy['ts_code'].fillna(method='ffill', inplace=True)
    # 填充 NaN 数据
    df_copy['open'].fillna(df_copy['close'].ffill(), inplace=True)
    df_copy['high'].fillna(df_copy['open'], inplace=True)
    df_copy['low'].fillna(df_copy['open'], inplace=True)
    df_copy['close'].fillna(df_copy['open'], inplace=True)
    df_copy['pre_close'].fillna(df_copy['open'], inplace=True)
    df_copy['change'].fillna(0, inplace=True)
    df_copy['pct_chg'].fillna(0, inplace=True)
    df_copy['vol'].fillna(0, inplace=True)
    df_copy['amount'].fillna(0, inplace=True)

    df_copy = df_copy.set_index('trade_date')

    # 数据频率
    print('数据频率', pd.infer_freq(df_copy.index))
    return df_copy

初始化

# 初始化 tushare API
ts.set_token('{你的接口TOKEN}')
pro = ts.pro_api()

设置参数

# 股票代码
ts_code = '600028.SH'
# 预测多少天的数据
prediction_length = 30
# 预测的字段
prediction_target = 'close'
# ID 字段
id_column = 'ts_code'

获取数据

# 获取股票价格的历史数据
stock_df = pro.daily(ts_code=ts_code)
end_date = stock_df.iloc[0]['trade_date']
start_date = stock_df.iloc[-1]['trade_date']

print(f'查询出A股 {ts_code}{start_date}{end_date} 的日行情数据, 共计 {len(stock_df)} 条')
stock_df.head()
查询出A600028.SH2001080820230811 的日行情数据, 共计 5290

image.png

以天的频率填满数据

stock_df_filled = fill_stock_data_by_freq_d(stock_df)

stock_df_filled[prediction_target].plot()
stock_df_filled.head()
数据频率 D

image.png image.png

分割数据

data = TimeSeriesDataFrame.from_data_frame(
    stock_df_filled,
    id_column=id_column
)

print('数据频率', data.freq)

# 训练数据: 从历史数据中分割出要预测的时间之前的数据
# 测试数据: 全量的历史数据
train_data, test_data = data.train_test_split(prediction_length)

print(f'训练数据: 从 {train_data.index.tolist()[0][1]}{train_data.index.tolist()[-1][1]}, 共计 {len(train_data)} 条')
print(f'测试数据: 从 {test_data.index.tolist()[0][1]}{test_data.index.tolist()[-1][1]}, 共计 {len(test_data)} 条')

train_data.head()
数据频率 D
训练数据:  2001-08-08 00:00:00  2023-07-12 00:00:00, 共计 8009 
测试数据:  2001-08-08 00:00:00  2023-08-11 00:00:00, 共计 8039 

image.png

可视化对比数据

import matplotlib.pyplot as plt
import numpy as np

train_ts = train_data.loc[ts_code][prediction_target]
test_ts = test_data.loc[ts_code][prediction_target]

fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=[10, 4], sharex=True)
ax1.set_title("Train data (past time series values)")
ax1.plot(train_ts)
ax2.set_title("Test data (past + future time series values)")
ax2.plot(test_ts)

for ax in (ax1, ax2):
    ax.fill_between(np.array([train_ts.index[-1], test_ts.index[-1]]), test_ts.min(), test_ts.max(), color="C1", alpha=0.3, label="Forecast horizon")

plt.legend()
plt.show()

image.png

可视化对比数据(减少数据量)

import matplotlib.pyplot as plt
import numpy as np

limit = int(len(train_data) * 0.01)
train_ts = train_data.loc[ts_code][prediction_target][-limit:]
test_ts = test_data.loc[ts_code][prediction_target][-(limit + prediction_length):]

fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=[10, 4], sharex=True)
ax1.set_title("Train data (past time series values)")
ax1.plot(train_ts)
ax2.set_title("Test data (past + future time series values)")
ax2.plot(test_ts)

for ax in (ax1, ax2):
    ax.fill_between(np.array([train_ts.index[-1], test_ts.index[-1]]), test_ts.min(), test_ts.max(), color="C1", alpha=0.3, label="Forecast horizon")

plt.legend()
plt.show()

image.png

训练模型

predictor = TimeSeriesPredictor(
    prediction_length=prediction_length,
    path='autogluon-stock',
    target=prediction_target,
    eval_metric='MASE',
)

# 使用训练数据训练出模型
predictor.fit(
    train_data,
    presets='best_quality',
    time_limit=60 * 20,
)

predictor.leaderboard(silent=True)
================ TimeSeriesPredictor ================
TimeSeriesPredictor.fit() called
Setting presets to: best_quality
Fitting with arguments:
{'enable_ensemble': True,
 'evaluation_metric': 'MASE',
 'excluded_model_types': None,
 'hyperparameter_tune_kwargs': {'num_trials': 3,
                                'scheduler': 'local',
                                'searcher': 'auto'},
 'hyperparameters': 'best_quality',
 'num_val_windows': 1,
 'prediction_length': 30,
 'random_seed': None,
 'target': 'close',
 'time_limit': 1200,
 'verbosity': 2}
Provided training data set with 8009 rows, 1 items (item = single time series). Average time series length is 8009.0. Data frequency is 'D'.
=====================================================
AutoGluon will save models to autogluon-stock/
AutoGluon will gauge predictive performance using evaluation metric: 'MASE'
	This metric's sign has been flipped to adhere to being 'higher is better'. The reported score can be multiplied by -1 to get the metric value.

Provided dataset contains following columns:
	target:           'close'
	past covariates:  ['open', 'high', 'low', 'pre_close', 'change', 'pct_chg', 'vol', 'amount']

Starting training. Start time is 2023-08-14 03:16:31
Models that will be trained: ['Naive', 'SeasonalNaive', 'Theta', 'AutoETS', 'RecursiveTabular', 'DeepAR', 'TemporalFusionTransformer', 'PatchTST', 'DirectTabular', 'AutoARIMA']

Hyperparameter tuning model: Naive. Tuning model for up to 120.00s of the 1199.99s remaining.
	-0.9547       = Validation score (-MASE)
	0.04    s     = Training runtime
	4.14    s     = Validation (prediction) runtime
...省略其他模型训练的输出...

Training complete. Models trained: ['Naive', 'SeasonalNaive', 'Theta', 'AutoETS', 'RecursiveTabular', 'DeepAR/T1', 'TemporalFusionTransformer', 'PatchTST', 'DirectTabular', 'AutoARIMA', 'WeightedEnsemble']
Total runtime: 512.44 s
Best model: WeightedEnsemble
Best model score: -0.4967

image.png

预测数据

# 使用训练数据让模型去预测出股票价格的未来数据
predictions = predictor.predict(train_data)

print(f'预测数据: 从 {predictions.index.tolist()[0][1]}{predictions.index.tolist()[-1][1]}, 共计 {len(predictions)} 条')
predictions.head()
Model not specified in predict, will default to the model with the best validation score: WeightedEnsemble
预测数据: 从 2023-07-13 00:00:002023-08-11 00:00:00, 共计 30

image.png

评估准确性

import matplotlib.pyplot as plt

# 通过测试数据去评估预测的准确性
score = predictor.evaluate(test_data)
print('预测准确性分数', score)

# 训练数据中的股票价格
y_past = train_data.loc[ts_code][prediction_target]
# 预测数据中的股票价格
y_pred = predictions.loc[ts_code]
# 测试数据中的股票价格(仅获预测范围内的数据)
y_test = test_data.loc[ts_code][prediction_target][-prediction_length:]

plt.figure(figsize=(20, 3))

# 训练数据仅展示 1% 的数据, 方便在图形中直观的查看预测的范围
limit = int(len(y_past) * 0.01)
plt.plot(y_past[-limit:], label="Past time series values")
plt.plot(y_pred["mean"], label="Mean forecast")
plt.plot(y_test, label="Future time series values")

plt.fill_between(
    y_pred.index, y_pred["0.1"], y_pred["0.9"], color="red", alpha=0.1, label=f"10%-90% confidence interval"
)
plt.legend()
Model not specified in predict, will default to the model with the best validation score: WeightedEnsemble
预测准确性分数 -0.8391059970944484

image.png

总结

从分析的结果来看, 股价的走势涵盖在预测的范围之内, 基本符合预期

但仅供参考, 不构成任何证券、金融产品或其他投资工具或任何交易策略的依据或建议。

投资行为的盈亏依赖于您的独立思考和决策,本文所述观点并不构成投资或任何其他建议,不提供或推荐任何投资品种。股市有风险,投资需谨慎。

参考

  • Tushare大数据开放社区

    拥有丰富的数据内容,如股票、基金、期货、数字货币等行情数据,公司财务、基金经理等基本面数据

  • 免费又好用的 notebook: Amazon SageMaker Studio Lab

    Amazon SageMaker Studio Lab is absolutely free – no credit card or AWS account required.

    The Amazon SageMaker Studio Lab is based on the open-source and extensible JupyterLab IDE. Skip the complicated setup and author Jupyter notebooks right in your browser.

    • You can have one project with at least 15 GB of storage, 16 GB of RAM and a CPU or GPU runtime.
    • CPU runtime is limited to 4 hours per session and no more than a total of 8 hours in a 24-hour period
    • GPU runtime is limited to 4 hours per session and no more than a total of 4 hours in a 24-hour period.