关于如何用Python模拟时间序列中的缺失值的完整指南

在处理现实世界的数据时，遇到缺失值是一个常见的挑战。时间序列数据通常与机器学习数据集不同。在时间序列中，数据是在不同时期的不同条件下收集的，各种机制可能导致不同时期的记录缺失。这些机制被称为缺失机制（Missingness Mechanisms）。

有三种类型的数据缺失。

完全随机缺失（MCAR）。简单来说，MCAR ，意味着缺失的数据和已经观察到的数据之间没有关系。缺失数据的概率是完全随机的，不依赖于已经观察到的数据，即 $P(Missing\\ | Complete\\ data)=p(Missing)$ 。
随机缺失（MAR）。如果一个变量的缺失概率只取决于现有的信息，即 $P(Missing / | Complete \\ data)=p(Missing / | Observed \\ data)$ ，那么它就是随机缺失。
Missing not at Random(MNAR)。在这种情况下，缺失的概率取决于变量本身。

时间序列模型的工作对象是完整的数据，因此在实际分析前需要将缺失的数据替换成有意义的值。在高层次上，时间序列中的缺失值有两种处理方式，要么丢弃，要么替换。然而，由于数据的时间顺序和相邻时期的观测值之间的相关性，丢弃缺失值可能是一个不恰当的解决方案。

估计一个合理的值，使序列的组成部分不被扭曲，是处理时间序列中缺失值的一个很好的方法。代入法用从相同的数据中估计出的值或从具有相同条件的环境中观察到的缺失数据来代替缺失值。

这篇文章将指导我们解决时间序列数据中的这样一个问题。

前提条件

为了继续学习本教程，最好有以下条件。

对如何在NumPy中处理[时间序列数据]有良好的理解。
有一个准备好的数据集。我在这个项目中使用这个[数据集]。
能够访问[Jupyter笔记本]或[Google Colab]。

让我们看看在时间序列中使用的Python的各种归因技术。

Python的实现

第1步：导入库

在这个项目中，我们将使用以下库。

用于处理数据帧的Pandas。
用于数值分析的Numpy。
用于可视化的Matplotlib。
用于提供日志语句的Warnings。

让我们导入这些库。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

在这个实现中，我们将使用的数据是关于1949年到1960年之间访问某家商店的顾客。下载这个数据的链接在先决条件部分提供。请确保你已经下载了它，并将其导入你的工作空间。让我们一起做吧。

第2步：导入数据集

在读取我们的数据到工作空间的基础上，我们将把它转换成时间序列格式。

# import the data
dataset= pd.read_csv('/content/Time-Series.csv', header=None)
# name the columns
dataset.columns=['Date','Customers']
# represent date column in date fromat in the order, Year, month and the day
dataset['Date']=pd.to_datetime(dataset['Date'], format='%Y-%m')
# set the Date column be the index of our dataset
dataset= dataset.set_index('Date')
# now check the data shape
dataset.shape

输出。

(144, 1)

从这个输出来看，我们的数据有144个观测值和1个列。这种类型的时间序列被称为单变量，与多变量时间序列相反，多变量时间序列有1个以上的关注列。

借助于head() ，我们可以看一下前五个观测值，如下所示。

dataset.head()

输出。

data head

我们的数据的格式是正确的。因此，在isnull() 方法的帮助下，让我们检查这个数据是否有缺失值。

# creating series True or False for NaN data and present data respectively. 
nul_data = pd.isnull(dataset['Customers']) 
    
# print only the data, Customers = NaN 
dataset[nul_data]

输出。

missing values

的确，我们的数据有缺失值。我们的代码返回了四个缺失数据的实例，以及与之相关的日期。由于我们处理的是一个单变量时间序列，而且我们的数据不是太大，我们可以绘制这个序列，直观地看到这些NaN点出现在图中的位置，并大致了解我们处理的是什么类型的时间序列。让我们运行下面的代码来完成这个工作。

 # set the size of our plot
plt.rcParams['figure.figsize']=(15,7)
# plots our series
plt.plot(dataset, color='blue')
# adds title to our time series plot
plt.title('Customers visted shop since 1950') 
# print the plot
plt.show()

data values

曲线内的断点表示我们数据中的缺失值。我们可以看到，数据中的季节性成分在不同年份是不一样的。现在，让我们应用用于归纳时间序列数据的技术，完成我们的数据。这些技术是。

第3步：归纳缺失值

1.平均法

这种技术用时间序列中已经给出的所有数据的平均值来估算缺失值。例如，在python中，我们实现这一技术的方法如下。

# declare the size of the  plot
plt.rcParams['figure.figsize']=(15,7)
# fill the missing data using the mean of the present observations
dataset = dataset.assign(FillMean=dataset.Customers.fillna(dataset.Customers.mean()))
# pass the data and declared the colour of your curve, i.e., blue
plt.plot(dataset, color='blue')
# add tittle to the plot
plt.title('Mean Imputation')
 # print the plot
plt.show()

输出。

mean imputation

2.中位数推算

在这项技术中，我们用数据的中位数来替换数据中的缺失值。我们实现这一技术的方法如下。

# declare the size of the  plot
plt.rcParams['figure.figsize']=(15,7)
# fill the missing data using the of the present observations
dataset = dataset.assign(FillMean=dataset.Customers.fillna(dataset.Customers.median()))
# pass the data and declared the colouyr opf our curve as blue
plt.plot(dataset, color='blue')
# add tittle to the plot
plt.title('Median Imputation')
 # print the plot
plt.show()

输出。

median imputation

在绘制上述两种方法的数据时，很明显，所有的缺失值都被成功地归置了。然而，我们可以注意到使用这些技术的一个问题。如果时间序列有季节性和趋势成分，这些技术就不能适当地工作。这是因为在估算缺失数据时没有考虑季节性和趋势成分。因此，只有当观察到的时间序列没有季节性或趋势成分时，它们才能更好地工作。

如果时间序列有这些成分，下面的方法可以更好地计算其缺失值。

3.最后观测值转发（LOCF）。

根据这项技术，缺失值是用时间序列中它之前的值来估算的。首先，让我们学习一下这种方法的实现方式。下面的代码演示了如何实现LOCF。

# figure size
plt.rcParams['figure.figsize']=(15,7)
# On the customer column of our data, impute the missing values with the LOCF
dataset['Customers_locf']= dataset['Customers'].fillna(method ='bfill')
# plot our time series with imputed values
plt.plot(dataset['Customers_locf'], color='blue')
#Plot tittle
plt.title('Last Observation Carried Forward')
# show the plot
plt.show()

输出。

locf-plot

4.下一个观察值向后转(NOCB)

根据这种技术，缺失的值是用它们前面的一个即时值来估算的。我们可以按以下方式实现这种方法。

# ffigure size
plt.rcParams['figure.figsize']=(15,7)
# On the customer column of our data, impute the missing values with the LOCF
dataset['Customers_nocb']= dataset['Customers'].fillna(method ='ffill')
# plot our time series with imputed values
plt.plot(dataset['Customers_nocb'], color='blue')
#Plot tittle
plt.title('Next Observation Carried Backward')
# show the plot
plt.show()

输出。

nocb-plot

3.线性内插法

最后，让我们看看线性内插法。这种技术起源于数值分析，它通过假设数据点范围内的线性关系来估计未知值，与线性外推法不同的是，线性外推法是在提供的数据点范围之外估计数据。为了使用线性内插法估计缺失值，我们从缺失值看过去和未来的数据。

因此，发现的缺失值预计会落在两个有限的点内，而这两个点的值是已知的，因此是一个已知的值范围，我们的估计值可以位于其中。以下是在我们的数据上实现线性插值的python代码。

# setting the plot size
plt.rcParams['figure.figsize']=(15,7)
# on our data, impute the missing values using rolling window method
dataset['Customers_L']= dataset['Customers'].interpolate(method='linear')
# plot the complete dataset
plt.plot(dataset['Customers_L'], color='blue')
# add the tittle of our plot as Linear interpolation
plt.title('Linear interpolatoin')
# print the plot
plt.show()

linear interpolation

6.曲线插值

使用一个数学函数，该方法估计的值使整体曲率最小，从而得到一个通过输入点的光滑表面。下面的代码实现了这个方法。

# setting the plot size
plt.rcParams['figure.figsize']=(15,7)
# on our data, impute the missing values using the interpolation techniques and specifically, the lineare method
dataset['Customers_Spline']= dataset['Customers'].interpolate(option='spline')
# plot the complete dataset
plt.plot(dataset['Customers_Spline'], color='blue')
# add the tittle of our plot as Linear interpolation
plt.title('Spline Interpolation')
# print the plot
plt.show()

输出。

spline interpolation

然而，我们讨论的这些方法都假设相邻的数据点是相似的，但情况并非总是如此。在这种假设不成立的情况下，有一些先进的方法被使用，这超出了本教程的范围。我们在本节课中讨论的所有这些方法在不同情况下表现最好，这取决于时间序列的下划线成分和类型。

然而，线性插值和花样插值倾向于提供归入值。归入的数据具有平均平方误差；因此，它们可以被认为是这个层面上的最佳技术。

总结

在这篇文章中，我们已经了解了利用和适当处理时间序列中缺失值的各种方法。此外，我们还看到了这些方法是如何在Python中实现的，这就结束了我们的会议。我希望你觉得这些内容对你有帮助，并感谢你能坚持到最后。

关于如何用Python替代时间序列中的缺失值的完整指南