公众号:尤而小屋
作者:Peter
编辑:Peter
大家好,我是Peter~
本文是一个深度学习的实战案例:基于长短期记忆模型LSTM的Li电池寿命预测
导入库
在Python中进行机器学习建模,通常需要以下一些库:
- NumPy:这是Python的一个基本库,用于处理大型多维数组和矩阵,这对于机器学习算法非常重要。
- Pandas:这个库提供了一种灵活的数据结构,可以方便地处理和分析数据,进行数据清洗和预处理。
- Matplotlib:用于数据可视化的库,可以创建各种图表和图形。
- Seaborn:基于Matplotlib库的高级封装,提供了一些更高级的可视化功能。
- scikit-learn:这是最常用的机器学习库之一,包含许多经典的机器学习算法和实用程序。
- TensorFlow 或 PyTorch:这两个都是深度学习框架,用于构建和训练神经网络。根据你的需求和项目类型,你可能需要选择其中一个。
- Keras:Keras是一个建立在TensorFlow、Theano和CNTK上的高级神经网络API,可以方便地构建和训练深度学习模型。
- Statsmodels:用于统计模型的拟合和预测,包括线性回归、逻辑回归等。
In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.8f' % x)
# 可视化
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr
%matplotlib inline
import seaborn as sns
sns.set_context("paper", font_scale=1.3)
sns.set_style('white')
import plotly_express as px
import plotly.graph_objects as go
from time import time
import math
from scipy import stats
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import pacf
# sklearn
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
# TensorFlow-Keras
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import *
from keras.callbacks import EarlyStopping
# 忽略警告
import warnings
warnings.filterwarnings('ignore')
导入数据
In [2]:
df=pd.read_csv('data.csv')
df.head()
Out[2]:
t | v | q | T | |
---|---|---|---|---|
0 | 736374.62276940 | 2.74435500 | -0.00004110 | 41.49570100 |
1 | 736374.62278098 | 2.87145950 | 0.19542568 | 41.54560500 |
2 | 736374.62279255 | 2.90456840 | 0.40105436 | 41.62048000 |
3 | 736374.62280412 | 2.92857840 | 0.60668314 | 41.55807100 |
4 | 736374.62281570 | 2.94793870 | 0.81231222 | 41.62048000 |
... | ... | ... | ... | ... |
398325 | 736088.79050114 | 2.75599150 | -673.31840507 | 40.98415400 |
398326 | 736088.79051271 | 2.74112060 | -673.52392141 | 41.03405400 |
398327 | 736088.79052429 | 2.72440580 | -673.72943744 | 41.00908700 |
398328 | 736088.79053586 | 2.70612760 | -673.93495363 | 41.04652000 |
398329 | 736088.79053983 | 2.69991470 | -674.00544561 | 41.03405400 |
398330 rows × 4 columns
整体的数据量大小:
In [3]:
df.shape
Out[3]:
(398330, 4)
In [4]:
df.dtypes
Out[4]:
t float64
v float64
q float64
T float64
dtype: object
查看缺失值情况:
In [5]:
df.isnull().sum()
Out[5]:
t 0
v 0
q 0
T 0
dtype: int64
结果表明:本次数据中不存在缺失值
In [6]:
df.describe() # 查看数据的描述统计信息
Out[6]:
t | v | q | T | |
---|---|---|---|---|
count | 398330.00000000 | 398330.00000000 | 398330.00000000 | 398330.00000000 |
mean | 736242.01817497 | 3.79064920 | -12.73245818 | 40.37094253 |
std | 185.23179234 | 0.23120589 | 357.85993286 | 0.50859680 |
min | 735954.81583877 | 2.69965200 | -734.81896570 | 38.20174400 |
25% | 736071.60333095 | 3.68081355 | -318.52437892 | 40.06082500 |
50% | 736230.62326857 | 3.81470610 | -13.14100665 | 40.28541600 |
75% | 736407.73875342 | 3.94288035 | 292.40127446 | 40.67221800 |
max | 736564.74362440 | 4.20021440 | 718.74129793 | 42.03221900 |
- t:表示测试时间
- v:表示测试的电压
- q:表示正负极容量
- T:表示测试的环境温度
根据测试时间t将数据升序排列:
In [7]:
df.sort_values("t",inplace=True,ignore_index=True)
Out[7]:
t | v | q | T | |
---|---|---|---|---|
0 | 735954.81583877 | 4.18963380 | 0.00000000 | 40.11072900 |
1 | 735954.81585035 | 4.17384050 | -0.19524245 | 40.12323400 |
2 | 735954.81586192 | 4.17215730 | -0.40063316 | 40.13570000 |
3 | 735954.81587350 | 4.17087460 | -0.60602176 | 40.13570000 |
4 | 735954.81588507 | 4.16947170 | -0.81141119 | 40.02338800 |
数据分布
时间t分布
In [8]:
# 根据索引升序排列
df["t"].value_counts().sort_index()
Out[8]:
735954.81583877 1
735954.81585035 1
735954.81586192 1
735954.81587350 1
735954.81588507 1
..
736564.74358940 1
736564.74360097 1
736564.74361255 1
736564.74362412 1
736564.74362440 1
Name: t, Length: 398330, dtype: int64
可以看到每个时间点是不同的。
In [9]:
print("start time: ", df["t"].min())
print("end time: ", df["t"].max())
start time: 735954.8158387741
end time: 736564.743624398
电压v分布
In [10]:
df["v"].describe()
Out[10]:
count 398330.00000000
mean 3.79064920
std 0.23120589
min 2.69965200
25% 3.68081355
50% 3.81470610
75% 3.94288035
max 4.20021440
Name: v, dtype: float64
In [11]:
print("V-min: ", df["v"].min())
print("V—max: ", df["v"].max())
V-min: 2.699652
V—max: 4.2002144
In [12]:
fig = px.box(df, y="v")
fig.show()
查看数据分布是否符合高斯分布:
In [13]:
sns.distplot(df.v)
plt.show()
温度T分布
In [14]:
print("T-min: ", df["T"].min())
print("T-max: ", df["T"].max())
T-min: 38.201744
T-max: 42.032219
In [15]:
fig = px.box(df, y="T")
fig.show()
容量q分布
绘制数据分布的密度图,并且计算峰度和偏度来检验是否符合正态分布。
- Kurtosis 指的是峰度,衡量实数随机变量概率分布的峰态的统计量
- 它表征概率密度分布曲线在平均值处峰值高低的特征数。描述分布尾部的重轻程度。
- 正态分布的kurtosis接近0。如果kurtosis大于0,则分布尾部较重。如果kurtosis小于0,则分布尾部较轻
- Skewness 指的是偏度
- 统计学中衡量实数随机变量概率分布的不对称性的统计量;它反映的是分布偏斜的方向和程度,即分布的不对称性。
- 如果偏度在-0.5和0.5之间,数据相当对称。如果偏度在-1和-0.5之间或0.5和1之间,数据中度偏斜。如果偏度小于-1或大于1,数据高度偏斜。
In [16]:
print( 'Kurtosis of normal distribution: {}'.format(stats.kurtosis(df.q)))
print( 'Skewness of normal distribution: {}'.format(stats.skew(df.q)))
Kurtosis of normal distribution: -1.1309840344769533
Skewness of normal distribution: 0.0058767955337298994
In [17]:
sns.distplot(df.q)
plt.show()
检验数据是否符合正态分布的一种方法:D'Agostino's K^2 Test
该检验是偏度检验的一种,用于检查一组观察值的分布是否符合正态分布。它基于 Anderson-Darling 测试和 Shapiro-Wilk 测试,通过比较数据分布与理论分布(在这种情况下是正态分布)的拟合程度来评估数据是否符合正态分布。使用scipy库来实现。
In [18]:
import numpy as np
from scipy.stats import normaltest
# 生成一组随机数据
data = np.random.normal(0, 1, 1000)
# 执行 D'Agostino's K^2 Test ;有两个返回值K^2和p值
# 如果 P 值大于显著性水平(通常为 0.05),则我们不能拒绝数据符合正态分布的假设:即符合正态分布
k2, p = normaltest(data)
# 输出结果
print("K^2 statistic:", k2)
print("P-value:", p)
K^2 statistic: 1.5222953698760753
P-value: 0.4671300011784085
In [19]:
sns.distplot(data) # 明显的正态分布
Out[19]:
<AxesSubplot:ylabel='Density'>
t&q的分布
t和q的分布关系:
In [20]:
df1=df.loc[:,['t','q']]
df1.set_index('t',inplace=True)
df1.plot(figsize=(12,5))
plt.ylabel('q')
plt.legend().set_visible(False) # 不显示图例
plt.tight_layout()
plt.title('Lithium-ion battery Time Series')
sns.despine(top=True) # 移除顶部边框
plt.show();
基于LSTM建模预测
数据提取
In [21]:
dataset = df.q.values
dataset = dataset.astype("float32")
dataset = np.reshape(dataset, (-1,1))
dataset
Out[21]:
array([[ 0.0000000e+00],
[-1.9524245e-01],
[-4.0063316e-01],
...,
[ 5.0550665e+02],
[ 5.0571231e+02],
[ 5.0571719e+02]], dtype=float32)
数据标准化
In [22]:
mm = MinMaxScaler(feature_range=(0,1)) # 标准化的范围在0-1之间
dataset = mm.fit_transform(dataset)
训练集和测试集
划分训练集和测试集
In [23]:
train_size = int(len(dataset) * 0.8) # 训练集大小
test_size = len(dataset) - train_size # 测试集大小
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
数据转换
将数组转成矩阵形式:
In [24]:
def convert_dataset(dataset, k=1):
X,Y = [],[]
for i in range(len(dataset) - k - 1):
x = dataset[i:(i+k), 0]
y = dataset[i+k, 0]
X.append(x)
Y.append(y)
return np.array(X), np.array(Y)
对训练集和测试集的转换:
In [25]:
k = 30
X_train, Y_train = convert_dataset(train, k)
X_test, Y_test = convert_dataset(test, k)
In [26]:
X_train.shape
Out[26]:
(318633, 30)
In [27]:
Y_train.shape
Out[27]:
(318633,)
In [28]:
X_test.shape
Out[28]:
(79635, 30)
In [29]:
Y_test.shape
Out[29]:
(79635,)
In [30]:
# shape的重置 中间增加一个维度1
X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))
X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))
In [31]:
X_train.shape
Out[31]:
(318633, 1, 30)
模型搭建
In [32]:
model = Sequential()
model.add(LSTM(100, input_shape=(X_train.shape[1], X_train.shape[2]))) # 除去shape的第一个值
model.add(Dropout(0.2))
model.add(Dense(1))
模型编译
In [33]:
model.compile(loss="mean_squared_error", optimizer="adam")
模型训练
In [34]:
history = model.fit(X_train, Y_train,
epochs=20,
batch_size=70,
validation_data=(X_test, Y_test),
callbacks=[EarlyStopping(monitor="val_loss", patience=10)], # 早停观测指标
verbose=1,
shuffle=False
)
model.summary()
Epoch 1/20
4552/4552 [==============================] - 12s 2ms/step - loss: 0.0071 - val_loss: 0.0043
Epoch 2/20
4552/4552 [==============================] - 11s 2ms/step - loss: 0.0045 - val_loss: 0.0059
Epoch 3/20
4552/4552 [==============================] - 11s 2ms/step - loss: 0.0029 - val_loss: 0.0041
Epoch 4/20
4552/4552 [==============================] - 10s 2ms/step - loss: 0.0020 - val_loss: 0.0028
Epoch 5/20
4552/4552 [==============================] - 10s 2ms/step - loss: 0.0016 - val_loss: 0.0028
Epoch 6/20
4552/4552 [==============================] - 10s 2ms/step - loss: 0.0014 - val_loss: 0.0022
Epoch 7/20
4552/4552 [==============================] - 10s 2ms/step - loss: 0.0011 - val_loss: 0.0040
Epoch 8/20
4552/4552 [==============================] - 10s 2ms/step - loss: 9.6566e-04 - val_loss: 0.0014
Epoch 9/20
4552/4552 [==============================] - 10s 2ms/step - loss: 9.4363e-04 - val_loss: 0.0036
Epoch 10/20
4552/4552 [==============================] - 10s 2ms/step - loss: 8.5776e-04 - val_loss: 2.2590e-04
Epoch 11/20
4552/4552 [==============================] - 10s 2ms/step - loss: 8.4321e-04 - val_loss: 3.7745e-04
Epoch 12/20
4552/4552 [==============================] - 10s 2ms/step - loss: 7.1527e-04 - val_loss: 7.2602e-04
Epoch 13/20
4552/4552 [==============================] - 8s 2ms/step - loss: 7.3660e-04 - val_loss: 7.1454e-04
Epoch 14/20
4552/4552 [==============================] - 9s 2ms/step - loss: 6.6346e-04 - val_loss: 4.9174e-04
Epoch 15/20
4552/4552 [==============================] - 9s 2ms/step - loss: 6.3772e-04 - val_loss: 0.0011
Epoch 16/20
4552/4552 [==============================] - 10s 2ms/step - loss: 5.4255e-04 - val_loss: 5.4382e-04
Epoch 17/20
4552/4552 [==============================] - 9s 2ms/step - loss: 6.4824e-04 - val_loss: 3.9280e-04
Epoch 18/20
4552/4552 [==============================] - 10s 2ms/step - loss: 5.3917e-04 - val_loss: 3.5105e-04
Epoch 19/20
4552/4552 [==============================] - 9s 2ms/step - loss: 5.4563e-04 - val_loss: 3.9737e-04
Epoch 20/20
4552/4552 [==============================] - 10s 2ms/step - loss: 5.1043e-04 - val_loss: 2.2002e-04
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 100) 52400
dropout (Dropout) (None, 100) 0
dense (Dense) (None, 1) 101
=================================================================
Total params: 52,501
Trainable params: 52,501
Non-trainable params: 0
_________________________________________________________________
模型预测
In [35]:
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)
# 预测结果还原 inverse_transform
train_pred = mm.inverse_transform(train_pred)
Y_train = mm.inverse_transform([Y_train])
test_pred = mm.inverse_transform(test_pred)
Y_test = mm.inverse_transform([Y_test])
9958/9958 [==============================] - 8s 753us/step
2489/2489 [==============================] - 2s 779us/step
In [36]:
train_pred[:5]
Out[36]:
array([[-25.417528],
[-25.618053],
[-25.818405],
[-26.018715],
[-26.219025]], dtype=float32)
In [37]:
test_pred[:5]
Out[37]:
array([[339.21115],
[339.40436],
[339.59738],
[339.7904 ],
[339.98343]], dtype=float32)
预测指标
In [38]:
print("Train-MAE:", mean_absolute_error(Y_train[0], train_pred[:, 0]))
print("Train-RMSE:", np.sqrt(mean_squared_error(Y_train[0], train_pred[:, 0])))
print("Test-MAE:", mean_absolute_error(Y_test[0], test_pred[:, 0]))
print("Test-RMSE:", np.sqrt(mean_squared_error(Y_test[0], test_pred[:, 0])))
Train-MAE: 22.344867164491458
Train-RMSE: 32.17790188891644
Test-MAE: 15.700488038320737
Test-RMSE: 21.560603609538187
In [39]:
Y_test.shape
Out[39]:
(1, 79635)
模型损失可视化
In [40]:
history_dict = history.history
for k in history_dict.keys():
print(k)
loss
val_loss
In [41]:
plt.figure(figsize=(10,6))
plt.plot(history_dict['loss'], label='Train Loss')
plt.plot(history_dict['val_loss'], label='Test Loss')
plt.title('LSTM Loss')
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.legend(loc='upper right')
plt.show()
真实值vs预测值
对比测试集下的真实值和预测值:
In [42]:
x = [i for i in range(70000)]
plt.figure(figsize=(10,5))
plt.plot(x, Y_test[0][:70000], marker=".",label="actual")
plt.plot(x, test_pred[:,0][:70000], "r",label="predict")
plt.tight_layout()
sns.despine(top=True)
plt.subplots_adjust(left=0.07)
plt.xlabel("Time",size=15)
plt.ylabel("Charge",size=15)
plt.legend(fontsize=15)
plt.show()