使用隐马尔科夫模型(HMM)预测股票价格的教程订阅并收到免费指南 - 使用Python的终极数据可视化指南 *表示需要

隐马尔可夫模型 （HMM）是状态空间模型的一个特例，其中潜变量是离散的多叉变量。从图形上看，你可以认为HMM是一个双随机过程，由一个你无法直接观察到的隐藏随机马尔可夫过程（潜变量）和另一个产生给定第一过程的观察序列的随机过程组成。

HMMs能够预测和分析基于时间的现象。因此，它们在语音识别、自然语言处理和金融市场预测等领域非常有用。在这篇文章中，你将研究HMMs在金融市场分析领域的应用，主要是股票价格的预测。

在这篇文章中，我们涵盖了：

股票价格预测
收集股票价格数据
股票价格预测的特点
使用HMM预测价格

1.股票价格预测

鉴于很多大公司的明显兴趣，股票市场预测一直是过去比较活跃的研究领域之一。历史上，各种机器学习算法已经被应用，并取得了不同程度的成功。

然而，由于股票的非平稳性、季节性和不可预测性，股票预测仍然受到严重的限制。仅仅根据以前的股票数据进行预测是一项更具挑战性的任务，因为它忽略了几个外围的因素。

Bank Note Annotation Dataset Visual

HMMs能够从连续的观察数据中对隐藏的状态转换进行建模。股票预测的问题也可以被认为是遵循同样的模式。股票的价格取决于众多因素，而这些因素通常对投资者来说是不可见的（隐藏变量）。

潜在因素之间的转换基于公司的政策和决定、其财务状况和管理决策而变化，这些都会影响股票的价格（观察数据）。因此，HMMs自然适合价格预测的问题。

现在，你可以通过用HMM预测Alphabet公司（GOOGL）、Facebook（FB）和苹果公司（AAPL）的股票价格来检验这一点。

2.2.收集股票价格数据

使用pystock数据（http://data.pystock.com）来获取历史股票价格数据。每天，在美国证券交易所于美国东部时间9:30开盘前，pystock爬虫 收集股票价格和财务报告，并将数据，如前一天的开盘价、收盘价、最高价和最低价，推送到存储库。这些数据是基于一天的，这意味着不会有任何小时或分钟级别的数据。

下载特定年份的pystock 数据。由于数据集很大，创建一个Python脚本来下载给定年份的数据，并对三个不同年份同时运行该程序，以平行下载所有数据：

"""
Usage: get_data.py --year=<year>
"""
import requests
import os
from docopt import docopt
 
# docopt helps parsing the command line argument in
# a simple manner (http://docopt.org/)
args = docopt(doc=__doc__, argv=None,
help=True, version=None,
options_first=False)
 
year = args['--year']
 
# Create directory if not present
year_directory_name = 'data/{year}'.format(year=year)
if not os.path.exists(year_directory_name):
    os.makedirs(year_directory_name)
 
# Fetching file list for the corresponding year
year_data_files = requests.get(
'http://data.pystock.com/{year}/index.txt'.format(year=year)
).text.strip().split('\n')
 
for data_file_name in year_data_files:
    file_location = '{year_directory_name}/{data_file_name}'.format(
year_directory_name=year_directory_name,
data_file_name=data_file_name)
 
with open(file_location, 'wb+') as data_file:
print('>>> Downloading \t {file_location}'.format(file_location=file_location))
        data_file_content = requests.get(
'http://data.pystock.com/{year}/{data_file_name}'.format(year=year, data_file_name=data_file_name)
        ).content
print('<<< Download Completed \t {file_location}'.format(file_location=file_location))
        data_file.write(data_file_content)

对三个不同的年份同时运行下面的脚本：

python get_data.py --year 2015
python get_data.py --year 2016
python get_data.py --year 2017

Coding Visual

一旦数据下载完毕，通过合并所有年份对应的数据，得到前面所述的每只股票的所有数据。

"""
Usage: parse_data.py --company=<company>
"""
import os
import tarfile
import pandas as pd
from pandas import errors as pd_errors
from functools import reduce
from docopt import docopt
 
args = docopt(doc=__doc__, argv=None,
help=True, version=None,
options_first=False)
 
years = [2015, 2016, 2017]
company = args['--company']
 
 
# Getting the data files list
data_files_list = []
for year in years:
    year_directory = 'data/{year}'.format(year=year)
for file in os.listdir(year_directory):
        data_files_list.append('{year_directory}/{file}'.format(year_directory=year_directory, file=file))
 
 
def parse_data(file_name, company_symbol):
"""
    Returns data for the corresponding company
 
:param file_name: name of the tar file
:param company_symbol: company symbol
:type file_name: str
:type company_symbol: str
:return: dataframe for the corresponding company data
:rtype: pd.DataFrame
    """
tar = tarfile.open(file_name)
try:
        price_report = pd.read_csv(tar.extractfile('prices.csv'))
        company_price_data = price_report[price_report['symbol'] == company_symbol]
return company_price_data
except (KeyError, pd_errors.EmptyDataError):
return pd.DataFrame()
 
 
# Getting the complete data for a given company
company_data = reduce(lambda df, file_name: df.append(parse_data(file_name, company)),
data_files_list,
pd.DataFrame())
company_data = company_data.sort_values(by=['date'])
 
# Create folder for company data if does not exists
if not os.path.exists('data/company_data'):
    os.makedirs('data/company_data')
 
# Write data to a CSV file
company_data.to_csv('data/company_data/{company}.csv'.format(company=company),
columns=['date', 'open', 'high', 'low', 'close', 'volume', 'adj_close'],
index=False)

运行以下脚本，创建一个包含 GOOGL、 FB和 AAPL 股票所有历史数据的.csv 文件。

python parse_data.py --company GOOGL
python parse_data.py --company FB
python parse_data.py --company AAPL

2.股票价格预测的功能

你对每一天的特征都非常有限，即当天股票的开盘价、收盘价、股票的最高价、股票的最低价。所以，用它们来计算股票价格。你可以计算出一天的收盘股价，给定当天的开盘股价，以及之前 d 天的数据。你的预测器将有一个 d 天的延迟。

现在，创建一个名为 StockPredictor的预测器，它将包含所有的逻辑来预测某家公司在某一天的股票价格。

不要直接使用股票的开盘价、收盘价、最低价和最高价，而是提取其中每一个的零头变化，用来训练你的HMM。

对于股票价格预测器HMM来说，你可以将单个观察值表示为这些参数的向量，即 Xt=< fracchange, frachigh , fraclow > 。

import pandas as pd
 
class StockPredictor(object):
    def __init__(self, company, n_latency_days=10):
        self._init_logger()
 
        self.company = company
        self.n_latency_days = n_latency_days
        self.data = pd.read_csv(
            'data/company_data/{company}.csv'.format(company=self.company))
 
 
    def _init_logger(self):
        self._logger = logging.getLogger(__name__)
        handler = logging.StreamHandler()
        formatter = logging.Formatter(
            '%(asctime)s %(name)-12s %(levelname)-8s %(message)s')
        handler.setFormatter(formatter)
        self._logger.addHandler(handler)
        self._logger.setLevel(logging.DEBUG)
 
 
    @staticmethod
    def _extract_features(data):
        open_price = np.array(data['open'])
        close_price = np.array(data['close'])
        high_price = np.array(data['high'])
        low_price = np.array(data['low'])
 
        # Compute the fraction change in close, high and low prices
        # which would be used a feature
        frac_change = (close_price - open_price) / open_price
        frac_high = (high_price - open_price) / open_price
        frac_low = (open_price - low_price) / open_price
 
        return np.column_stack((frac_change, frac_high, frac_low))
 
 
# Predictor for GOOGL stocks
stock_predictor = StockPredictor(company='GOOGL')

3.使用HMM预测价格

预测价格的第一步是训练一个HMM，从给定的观察序列中计算参数。由于观察结果是一个连续随机变量的矢量，假设排放概率分布是连续的。

为简单起见，假设它是一个具有参数（μ 和Σ ）的多叉高斯分布。因此，你必须确定过渡矩阵 A、先验概率 π以及代表多叉高斯分布的μ 和Σ 的以下参数。

Programming Visual

现在，假设你有四个隐藏状态。在接下来的章节中，你将研究如何找到最佳的隐藏状态数量。使用 hmmlearn 包提供的 GaussianHMM 类作为你的HMM，并使用它提供的fit() 方法进行参数估计。

from hmmlearn.hmm import GaussianHMM
 
class StockPredictor(object):
    def __init__(self, company, n_latency_days=10, n_hidden_states=4):
        self._init_logger()
 
        self.company = company
        self.n_latency_days = n_latency_days
 
        self.hmm = GaussianHMM(n_components=n_hidden_states)
 
        self.data = pd.read_csv(
            'data/company_data/{company}.csv'.format(company=self.company))
 
    def fit(self):
        self._logger.info('>>> Extracting Features')
        feature_vector = StockPredictor._extract_features(self.data)
        self._logger.info('Features extraction Completed <<<')
 
        self.hmm.fit(feature_vector)

在机器学习中，你把整个数据集分成两类。第一组是训练数据集，用于训练模型。第二类数据集，即测试数据集，用于对训练数据集上的最终模型拟合进行无偏见的评估。

将训练数据集与测试数据集分开，可以防止将数据过度拟合到模型中。因此，在这种情况下，将数据集分成两类， train_data 用于训练模型， test_data 用于评估模型。要做到这一点，请使用 sklearn.model_selection 模块提供的 train_test_split 方法。

from sklearn.model_selection import train_test_split

class StockPredictor(object):
    def __init__(self, company, test_size=0.33,
                 n_latency_days=10, n_hidden_states=4):
        self._init_logger()
 
        self.company = company
        self.n_latency_days = n_latency_days
 
        self.hmm = GaussianHMM(n_components=n_hidden_states)
 
        self._split_train_test_data(test_size)
 
    def _split_train_test_data(self, test_size):
        data = pd.read_csv(
            'data/company_data/{company}.csv'.format(company=self.company))
        _train_data, test_data = train_test_split(
            data, test_size=test_size, shuffle=False)
 
        self._train_data = _train_data
        self._test_data = test_data
 
    def fit(self):
        self._logger.info('>>> Extracting Features')
        feature_vector = StockPredictor._extract_features(self._train_data)
        self._logger.info('Features extraction Completed <<<')
 
        self.hmm.fit(feature_vector)

AI Visual

train_test_split 可以将数组或矩阵分成随机的训练和测试子集。由于你用连续的数据训练你的HMM，你不希望随机地分割数据。为了防止测试和训练数据的随机分割，请传递 shuffle=False 作为参数。

一旦你的模型训练完成，你需要预测股票的收盘价。如前所述，你想在知道开盘价的情况下，预测某一天的股票收盘价。这意味着，如果你能够预测某一天的fracchange ，你就可以计算出收盘价。

因此，你的问题归结为计算 Xt+1 = < fracchange , frachigh , fraclow > 一天的观察向量，给定 t 天的观察数据， x1 ,...,xt，以及HMM的参数

一旦你把所有独立于 Xt+1的参数从最大化方程中删除，你就剩下寻找 Xt+1 的值的问题了，这个值可以优化 P(X1 ,...,Xt+1|θ ) 的概率。如果你假设 fracchange是一个连续变量，那么这个问题的优化在计算上会很困难。

因此，将这些分数变化划分为一些介于两个有限变量之间的离散值（如下表所述），并找到一组分数变化， < fracchange , frachigh , fraclow > ，这将使概率最大化， P(X1 ,...,Xt+1|θ ) 。

观察结果	最小值	最大值	点的数量
frac变化	-0.1	0.1	20
frac高	0	0.1	10
frac低	0	0.1	10

因此，用前面的离散值集，运行（20 x 10 x 10 =）2000次操作。

def _compute_all_possible_outcomes(self, n_steps_frac_change,
                                       n_steps_frac_high, n_steps_frac_low):
        frac_change_range = np.linspace(-0.1, 0.1, n_steps_frac_change)
        frac_high_range = np.linspace(0, 0.1, n_steps_frac_high)
        frac_low_range = np.linspace(0, 0.1, n_steps_frac_low)
 
        self._possible_outcomes = np.array(list(itertools.product(
            frac_change_range, frac_high_range, frac_low_range)))

现在，实现预测收盘价的方法，如下所示：

def _get_most_probable_outcome(self, day_index):
        previous_data_start_index = max(0, day_index - self.n_latency_days)
        previous_data_end_index = max(0, day_index - 1)
        previous_data = self._test_data.iloc[previous_data_end_index: previous_data_end_index]
        previous_data_features = StockPredictor._extract_features(
            previous_data)
 
        outcome_score = []
        for possible_outcome in self._possible_outcomes:
            total_data = np.row_stack(
                (previous_data_features, possible_outcome))
            outcome_score.append(self.hmm.score(total_data))
        most_probable_outcome = self._possible_outcomes[np.argmax(
            outcome_score)]
 
        return most_probable_outcome
 
    def predict_close_price(self, day_index):
        open_price = self._test_data.iloc[day_index]['open']
        predicted_frac_change, _, _ = self._get_most_probable_outcome(
            day_index)
        return open_price * (1 + predicted_frac_change)

预测一些日子的收盘价，并绘制出这两条曲线：

"""
Usage: analyse_data.py --company=<company>
"""
import warnings
import logging
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from hmmlearn.hmm import GaussianHMM
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from docopt import docopt
 
args = docopt(doc=__doc__, argv=None, help=True,
              version=None, options_first=False)
 
# Supress warning in hmmlearn
warnings.filterwarnings("ignore")
# Change plot style to ggplot (for better and more aesthetic visualisation)
plt.style.use('ggplot')
 
 
class StockPredictor(object):
    def __init__(self, company, test_size=0.33,
                 n_hidden_states=4, n_latency_days=10,
                 n_steps_frac_change=50, n_steps_frac_high=10,
                 n_steps_frac_low=10):
        self._init_logger()
 
        self.company = company
        self.n_latency_days = n_latency_days
 
        self.hmm = GaussianHMM(n_components=n_hidden_states)
 
        self._split_train_test_data(test_size)
 
        self._compute_all_possible_outcomes(
            n_steps_frac_change, n_steps_frac_high, n_steps_frac_low)
 
    def _init_logger(self):
        self._logger = logging.getLogger(__name__)
        handler = logging.StreamHandler()
        formatter = logging.Formatter(
            '%(asctime)s %(name)-12s %(levelname)-8s %(message)s')
        handler.setFormatter(formatter)
        self._logger.addHandler(handler)
        self._logger.setLevel(logging.DEBUG)
 
    def _split_train_test_data(self, test_size):
        data = pd.read_csv(
            'data/company_data/{company}.csv'.format(company=self.company))
        _train_data, test_data = train_test_split(
            data, test_size=test_size, shuffle=False)
 
        self._train_data = _train_data
        self._test_data = test_data
 
    @staticmethod
    def _extract_features(data):
        open_price = np.array(data['open'])
        close_price = np.array(data['close'])
        high_price = np.array(data['high'])
        low_price = np.array(data['low'])
 
        # Compute the fraction change in close, high and low prices
        # which would be used a feature
        frac_change = (close_price - open_price) / open_price
        frac_high = (high_price - open_price) / open_price
        frac_low = (open_price - low_price) / open_price
 
        return np.column_stack((frac_change, frac_high, frac_low))
 
    def fit(self):
        self._logger.info('>>> Extracting Features')
        feature_vector = StockPredictor._extract_features(self._train_data)
        self._logger.info('Features extraction Completed <<<')
 
        self.hmm.fit(feature_vector)
 
    def _compute_all_possible_outcomes(self, n_steps_frac_change,
                                       n_steps_frac_high, n_steps_frac_low):
        frac_change_range = np.linspace(-0.1, 0.1, n_steps_frac_change)
        frac_high_range = np.linspace(0, 0.1, n_steps_frac_high)
        frac_low_range = np.linspace(0, 0.1, n_steps_frac_low)
 
        self._possible_outcomes = np.array(list(itertools.product(
            frac_change_range, frac_high_range, frac_low_range)))
 
    def _get_most_probable_outcome(self, day_index):
        previous_data_start_index = max(0, day_index - self.n_latency_days)
        previous_data_end_index = max(0, day_index - 1)
        previous_data = self._test_data.iloc[previous_data_end_index: previous_data_start_index]
        previous_data_features = StockPredictor._extract_features(
            previous_data)
 
        outcome_score = []
        for possible_outcome in self._possible_outcomes:
            total_data = np.row_stack(
                (previous_data_features, possible_outcome))
            outcome_score.append(self.hmm.score(total_data))
        most_probable_outcome = self._possible_outcomes[np.argmax(
            outcome_score)]
 
        return most_probable_outcome
 
    def predict_close_price(self, day_index):
        open_price = self._test_data.iloc[day_index]['open']
        predicted_frac_change, _, _ = self._get_most_probable_outcome(
            day_index)
        return open_price * (1 + predicted_frac_change)
 
    def predict_close_prices_for_days(self, days, with_plot=False):
        predicted_close_prices = []
        for day_index in tqdm(range(days)):
            predicted_close_prices.append(self.predict_close_price(day_index))
 
        if with_plot:
            test_data = self._test_data[0: days]
            days = np.array(test_data['date'], dtype="datetime64[ms]")
            actual_close_prices = test_data['close']
 
            fig = plt.figure()
 
            axes = fig.add_subplot(111)
            axes.plot(days, actual_close_prices, 'bo-', label="actual")
            axes.plot(days, predicted_close_prices, 'r+-', label="predicted")
            axes.set_title('{company}'.format(company=self.company))
 
            fig.autofmt_xdate()
 
            plt.legend()
            plt.show()
 
        return predicted_close_prices
 
 
stock_predictor = StockPredictor(company=args['--company'])
stock_predictor.fit()
stock_predictor.predict_close_prices_for_days(500, with_plot=True)

结论

你已经成功地用HMM预测了股票的价格。你应用了参数估计和模型评价的方法来确定股票的收盘价。在股市分析中使用HMM只是HMM在分析时间序列数据中应用的另一个例子。

谢谢您的阅读!