工业蒸汽量预测

708 阅读4分钟

前言

火力发电是利用燃料在燃烧时加热水生成的蒸汽推动汽轮机旋转,然后汽轮机带动发电机转动从而产生电能。在这个过程中,锅炉的燃烧效率是影响发电效率的核心。而影响锅炉燃烧效率的因素有很多,包括锅炉本身的条件,如燃烧给量,一二次风,引风,返料风等;以及锅炉的工况,例如锅炉床温、炉膛温度、床压等。本文利用工业蒸汽量数据集来预测产生的蒸汽量,从而分析火力发电的效能。

1.1 Jupyter设置、导包及数据集加载

导入相关模块。

import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
import sklearn
from sklearn.exceptions import ConvergenceWarning
from typing import types
import pandas_profiling

拦截警告

warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)   

防止中文乱码,设置seaborn中文字体。

mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
sns.set(font='SimHei')

设置jupyter显示行数

mpl.rcParams['axes.unicode_minus'] =False
pd.options.display.min_rows = None
pd.set_option('display.expand_frame_repr', False)
pd.set_option('expand_frame_repr', False)
pd.set_option('max_rows', 10)
pd.set_option('max_columns', 30)

加载数据。

df_train = pd.read_csv('zhengqi_train.txt',sep='\t',encoding='utf-8')
df_test = pd.read_csv('zhengqi_test.txt',sep='\t',encoding='utf-8')

1.2 探索性分析

1.2.1 分析数据集

1.2.1.1 预览数据集
  • 预览数据集。
df_all.head(5).append(df_all.tail(5))
————————————————————————————————————————————————————————
	V0	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	...	V24	V25	V26	V27	V28	V29	V30	V31	V32	V33	V34	V35	V36	V37	target
0	0.566	0.016	-0.143	0.407	0.452	-0.901	-1.812	-2.360	-0.436	-2.114	-0.940	-0.307	-0.073	0.550	-0.484	...	0.800	-0.223	0.796	0.168	-0.450	0.136	0.109	-0.615	0.327	-4.627	-4.789	-5.101	-2.608	-3.508	0.175
1	0.968	0.437	0.066	0.566	0.194	-0.893	-1.566	-2.360	0.332	-2.114	0.188	-0.455	-0.134	1.109	-0.488	...	0.801	-0.144	1.057	0.338	0.671	-0.128	0.124	0.032	0.600	-0.843	0.160	0.364	-0.335	-0.730	0.676
2	1.013	0.568	0.235	0.370	0.112	-0.797	-1.367	-2.360	0.396	-2.114	0.874	-0.051	-0.072	0.767	-0.493	...	0.961	-0.067	0.915	0.326	1.287	-0.009	0.361	0.277	-0.116	-0.843	0.160	0.364	0.765	-0.589	0.633
3	0.733	0.368	0.283	0.165	0.599	-0.679	-1.200	-2.086	0.403	-2.114	0.011	0.102	-0.014	0.769	-0.371	...	1.435	0.113	0.898	0.277	1.298	0.015	0.417	0.279	0.603	-0.843	-0.065	0.364	0.333	-0.112	0.206
4	0.684	0.638	0.260	0.209	0.337	-0.454	-1.073	-2.086	0.314	-2.114	-0.251	0.570	0.199	-0.349	-0.342	...	0.881	0.221	0.386	0.332	1.289	0.183	1.078	0.328	0.418	-0.843	-0.215	0.364	-0.280	-0.028	0.384
2883	0.190	-0.025	-0.138	0.161	0.600	-0.212	0.757	0.584	-0.026	0.904	0.355	-0.066	0.436	0.141	-0.560	...	-1.310	0.094	-0.461	0.189	-0.449	0.128	-0.208	0.809	-0.173	0.247	-0.027	-0.349	0.576	0.686	0.235
2884	0.507	0.557	0.296	0.183	0.530	-0.237	0.749	0.584	0.537	0.904	-0.061	0.033	0.414	-0.634	-0.626	...	-1.314	-0.066	-0.892	0.372	-0.439	0.291	-0.287	0.465	-0.310	0.763	0.498	-0.349	-0.615	-0.380	1.042
2885	-0.394	-0.721	-0.485	0.084	0.136	0.034	0.655	0.614	-0.818	0.904	0.240	0.287	-0.185	0.389	-0.725	...	-1.310	-0.360	-0.349	0.058	-0.445	0.291	-0.179	0.268	0.552	0.763	0.498	-0.349	0.951	0.748	0.005
2886	-0.219	-0.282	-0.344	-0.049	0.449	-0.140	0.560	0.583	-0.596	0.904	-0.395	-0.023	-0.053	-0.310	-0.258	...	-1.313	-0.603	-0.677	0.133	-0.448	0.216	1.061	-0.051	1.023	0.878	0.610	-0.230	-0.301	0.555	0.350
2887	0.368	0.380	-0.225	-0.049	0.379	0.092	0.550	0.551	0.244	0.904	-0.419	0.515	0.346	-0.114	-0.204	...	-1.314	-0.662	-0.596	0.208	-0.449	0.047	0.057	-0.042	0.847	0.534	-0.009	-0.190	-0.567	0.388	0.417
10 rows × 39 columns
1.2.1.2 预览相关统计量
df_train.describe()
————————————————————————————————————————————————————————

V0	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	...	V24	V25	V26	V27	V28	V29	V30	V31	V32	V33	V34	V35	V36	V37	target
count	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	...	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000	2888.000000
mean	0.123048	0.056068	0.289720	-0.067790	0.012921	-0.558565	0.182892	0.116155	0.177856	-0.169452	0.034319	-0.364465	0.023177	0.195738	0.016081	...	-0.021813	-0.051679	0.072092	0.272407	0.137712	0.097648	0.055477	0.127791	0.020806	0.007801	0.006715	0.197764	0.030658	-0.130330	0.126353
std	0.928031	0.941515	0.911236	0.970298	0.888377	0.517957	0.918054	0.955116	0.895444	0.953813	0.968272	0.858504	0.894092	0.922757	1.015585	...	1.033403	0.915957	0.889771	0.270374	0.929899	1.061200	0.901934	0.873028	0.902584	1.006995	1.003291	0.985675	0.970812	1.017196	0.983966
min	-4.335000	-5.122000	-3.420000	-3.956000	-4.742000	-2.182000	-4.576000	-5.048000	-4.692000	-12.891000	-2.584000	-3.160000	-5.165000	-3.675000	-2.455000	...	-1.344000	-3.808000	-5.131000	-1.164000	-2.435000	-2.912000	-4.507000	-5.859000	-4.053000	-4.627000	-4.789000	-5.695000	-2.608000	-3.630000	-3.044000
25%	-0.297000	-0.226250	-0.313000	-0.652250	-0.385000	-0.853000	-0.310000	-0.295000	-0.159000	-0.390000	-0.420500	-0.803250	-0.419000	-0.398000	-0.668000	...	-1.191000	-0.557250	-0.452000	0.157750	-0.455000	-0.664000	-0.283000	-0.170250	-0.407250	-0.499000	-0.290000	-0.202500	-0.413000	-0.798250	-0.350250
50%	0.359000	0.272500	0.386000	-0.044500	0.110000	-0.466000	0.388000	0.344000	0.362000	0.042000	0.157000	-0.112000	0.123000	0.289500	-0.161000	...	0.095000	-0.076000	0.075000	0.325000	-0.447000	-0.023000	0.053500	0.299500	0.039000	-0.040000	0.160000	0.364000	0.137000	-0.185500	0.313000
75%	0.726000	0.599000	0.918250	0.624000	0.550250	-0.154000	0.831250	0.782250	0.726000	0.042000	0.619250	0.247000	0.616000	0.864250	0.829750	...	0.931250	0.356000	0.644250	0.442000	0.730000	0.745250	0.488000	0.635000	0.557000	0.462000	0.273000	0.602000	0.644250	0.495250	0.793250
max	2.121000	1.918000	2.828000	2.457000	2.689000	0.489000	1.895000	1.918000	2.245000	1.335000	4.830000	1.455000	2.657000	2.475000	2.558000	...	2.423000	7.284000	2.980000	0.925000	4.671000	4.580000	2.689000	2.013000	2.395000	5.465000	5.110000	2.324000	5.238000	3.000000	2.538000
8 rows × 39 columns
1.2.1.3 预览数据类型
  • 均为float类型数据
df_train.info()
————————————————————————————————————————————————————————
dtypes: float64(39)
memory usage: 880.1 KB
1.2.1.4 预览训练集、测试集维度
df_train.shape,df_test.shape
————————————————————————————————————————————————————————
((2888, 39), (1925, 38))
1.2.1.5 缺失值数量与分布
  • 训练集中并未发现缺失值
df_train.isnull().sum()
# missing_pct = df_all.isnull().sum() * 100 / len(df_all) #将列中为空的个数统计出来
# missing = pd.DataFrame({
#     'name': df_all.columns,
#     'missing_pct': missing_pct,
# })
# missing.sort_values(by='missing_pct', ascending=False).head()
————————————————————————————————————————————————————————
name	missing_pct
V0	V0	0.0
V29	V29	0.0
V22	V22	0.0
V23	V23	0.0
V24	V24	0.0
1.2.1.6 预测值分布
  • 预测值的均值、中位数、最大值、最小值
df_train['target'].mean(),
df_train['target'].median(),
df_train['target'].max(),
df_train['target'].min()
————————————————————————————————————————————————————————
(0.12635283933517938, 0.313, 2.5380000000000003, -3.0439999999999996)
  • 预测值的变化曲线
plt.figure()
df_train['target'].plot()
plt.ylabel('target')
plt.xlabel('id')
plt.show()

image.png

1.2.1.7 生成数据报告
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("./example.html")
————————————————————————————————————————————————————————
Summarize dataset: 100%
53/53 [02:42<00:00, 3.06s/it, Completed]

Generate report structure: 100%
1/1 [00:11<00:00, 11.36s/it]

Render HTML: 100%
1/1 [00:32<00:00, 32.81s/it]

Export report to file: 100%
1/1 [00:00<00:00, 3.85it/s]

image.png

1.2.2 特征分布曲线

  • 从图中可以观察到每个特征是否呈正态分布
  • V9、V17、V18、V22、V23、V24、V28、V35中数据值分布很不均匀
  • V14、V17、V19、V22、V24、V28的图现存在多个极值
df = pd.melt(df_train,value_vars=df_train.columns)
sp=sns.FacetGrid(df,col='variable',col_wrap=5,sharex=False,sharey=False)
sp=sp.map(sns.distplot,'value',color='m',rug=True)

image.png

1.3 特征工程

  • 将测试集添加到训练集中进行特征工程以保证数据类型相同。
target = df_train['target']
combined = df_train.drop('target',axis=1).append(df_test)
combined.head()
————————————————————————————————————————————————————————

V0	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	...	V23	V24	V25	V26	V27	V28	V29	V30	V31	V32	V33	V34	V35	V36	V37
0	0.566	0.016	-0.143	0.407	0.452	-0.901	-1.812	-2.360	-0.436	-2.114	-0.940	-0.307	-0.073	0.550	-0.484	...	0.356	0.800	-0.223	0.796	0.168	-0.450	0.136	0.109	-0.615	0.327	-4.627	-4.789	-5.101	-2.608	-3.508
1	0.968	0.437	0.066	0.566	0.194	-0.893	-1.566	-2.360	0.332	-2.114	0.188	-0.455	-0.134	1.109	-0.488	...	0.357	0.801	-0.144	1.057	0.338	0.671	-0.128	0.124	0.032	0.600	-0.843	0.160	0.364	-0.335	-0.730
2	1.013	0.568	0.235	0.370	0.112	-0.797	-1.367	-2.360	0.396	-2.114	0.874	-0.051	-0.072	0.767	-0.493	...	0.355	0.961	-0.067	0.915	0.326	1.287	-0.009	0.361	0.277	-0.116	-0.843	0.160	0.364	0.765	-0.589
3	0.733	0.368	0.283	0.165	0.599	-0.679	-1.200	-2.086	0.403	-2.114	0.011	0.102	-0.014	0.769	-0.371	...	0.352	1.435	0.113	0.898	0.277	1.298	0.015	0.417	0.279	0.603	-0.843	-0.065	0.364	0.333	-0.112
4	0.684	0.638	0.260	0.209	0.337	-0.454	-1.073	-2.086	0.314	-2.114	-0.251	0.570	0.199	-0.349	-0.342	...	0.352	0.881	0.221	0.386	0.332	1.289	0.183	1.078	0.328	0.418	-0.843	-0.215	0.364	-0.280	-0.028
    5 rows × 38 columns q

1.3.1 特征相关性分析

  • 图中显示了所有特征的相关性,通过观察分析可以分为几类
  • 与V0呈线性分布的是:V1、V4、V8、V12、V27、V31、target
  • 与V2呈线性分布的是:V6、V7、V16
  • 与V5呈线性分布的是:V11
  • 与V10呈线性分布的是:V36
  • 与V15呈线性分布的是:V29
  • 与V33呈线性分布的是:V34
features = ['V0','V1','V2','V3','V4','V5','V6','V7','V8','V9','V10','V11','V12',
        'V13','V14','V15','V16','V17','V18','V19','V20','V21','V22','V23','V24',
         'V25','V26','V27','V28','V29','V30','V31','V32','V33','V34','V35','V36',
        'V37','target']
sns.pairplot(df_train[features])

image.png

  • 对训练集进行了特征关系初步分析后,现在将使用分析结果对combined进行高相关特征剔除。
plt.figure(figsize=(15,8))
sns.heatmap(combined[features].corr(),annot=True)
plt.show()

image.png

  • 处理V0、V1、V4、V8、V12、V27、V31
features = ['V0','V1','V4','V8','V12','V27','V31','V16']
plt.figure(figsize=(15,8))
sns.heatmap(combined[features].corr(),annot=True)
plt.show()

image.png

  • 处理V2、V6、V7、V16
features = ['V2','V6','V7','V16']
plt.figure(figsize=(15,8))
sns.heatmap(combined[features].corr(),annot=True)
plt.show()

image.png -删除皮尔逊相关系数大于0.75的特征,包括:V1、V4、V8、V6、V16、V5、V29、V36、

features = ['V0','V2','V7','V10','V11','V12',
       'V15','V27','V31','V33','V34']
plt.figure(figsize=(15,8))
sns.heatmap(combined[features].corr(),annot=True)
plt.show()
rv_features = ['V1','V4','V5','V6','V8','V16','V29','V36']
combined.drop(rv_features,axis=1,inplace=True)
combined.head()
————————————————————————————————————————————————————————
	V0	V2	V3	V7	V9	V10	V11	V12	V13	V14	V15	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	V30	V31	V32	V33	V34	V35	V37
0	0.566	-0.143	0.407	-2.360	-2.114	-0.940	-0.307	-0.073	0.550	-0.484	0.000	-1.162	-0.573	-0.991	0.610	-0.400	-0.063	0.356	0.800	-0.223	0.796	0.168	-0.450	0.109	-0.615	0.327	-4.627	-4.789	-5.101	-3.508
1	0.968	0.066	0.566	-2.360	-2.114	0.188	-0.455	-0.134	1.109	-0.488	0.000	-1.162	-0.571	-0.836	0.588	-0.802	-0.063	0.357	0.801	-0.144	1.057	0.338	0.671	0.124	0.032	0.600	-0.843	0.160	0.364	-0.730
2	1.013	0.235	0.370	-2.360	-2.114	0.874	-0.051	-0.072	0.767	-0.493	-0.212	-0.897	-0.564	-0.558	0.576	-0.477	-0.063	0.355	0.961	-0.067	0.915	0.326	1.287	0.361	0.277	-0.116	-0.843	0.160	0.364	-0.589
3	0.733	0.283	0.165	-2.086	-2.114	0.011	0.102	-0.014	0.769	-0.371	-0.162	-0.897	-0.574	-0.564	0.272	-0.491	-0.063	0.352	1.435	0.113	0.898	0.277	1.298	0.417	0.279	0.603	-0.843	-0.065	0.364	-0.112
4	0.684	0.260	0.209	-2.086	-2.114	-0.251	0.570	0.199	-0.349	-0.342	-0.138	-0.897	-0.572	-0.394	0.106	0.309	-0.259	0.352	0.881	0.221	0.386	0.332	1.289	1.078	0.328	0.418	-0.843	-0.215	0.364	-0.028

1.3.2 特征分布一致性分析

  • 训练集与测试集的特征分布不一致影响模型的泛化能力,因此剔除分布不一致的特征,剔除V17、V22。
plt.figure(figsize=(42,36))
i = 1
for feature in test.columns:
    ax = plt.subplot(6, 6, i)
    ax = sns.kdeplot(train[feature], color='r', shade=True)
    ax = sns.kdeplot(test[feature], color='k', shade=True)
    ax = ax.legend(['train','test'])
    i = i + 1
plt.show()
combined.drop(['V17','V22'],axis=1,inplace=True)

image.png

1.3.1 处理异常值

  • 在均值的三个标准差范围内,可以认为是服从正态分布的数据,在这个范围以外的数据认定为异常值。
def find_outliers_by_3segama(data, feature):
    data_std = np.std(data[feature])
    data_mean = np.mean(data[feature])
    #计算3个标准差值
    outliers_cut_off = data_std * 3 
    #下边界
    lower_rule = data_mean - outliers_cut_off
    #上边界
    upper_rule = data_mean + outliers_cut_off
    #异常值筛选
    data[feature+'_outliers'] = data[feature].apply(lambda x:'异常值' if x > upper_rule or x < lower_rule else '正常值')
    return data
#异常值查看
for feature in df_train.columns:
    df_train = find_outliers_by_3segama(df_train, feature)
    print(combined[fea+'_outliers'].value_counts()) 
    print('='*50)

image.png

  • 删除异常值(可选)
for fea in train.columns:
    df_train = df_train[df_train[fea+'_outliers']=="正常值"]
df_train=df_train.iloc[:,:29]
df_train.shape,test_c.shape
________________________________________________________
((2447, 29)

1.4 模型训练

-导入相关模块

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import make_scorer,mean_squared_error
from sklearn.preprocessing import MinMaxScaler,StandardScaler,Normalizer,PolynomialFeatures
from sklearn.pipeline import Pipeline
import time
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score,StratifiedKFold
  • 划分数据集
train=combined[:2888]
test =combined[2888:]
x_train,x_val,y_train,y_val =train_test_split(train,target,test_size=0.2,random_state=42)

1.4.1 利用基础模型及评分

  • 创建回归模型
lr = LinearRegression()
rgcv=RidgeCV()
eltcv=ElasticNetCV()
lasso=LassoCV()
rf =RandomForestRegressor()
gbdt=GradientBoostingRegressor()
xgb =XGBRegressor()
lgbm = LGBMRegressor()
models =[lr,rgcv,eltcv, lasso,rf,gbdt ,xgb,lgbm]
  • 数据转换: 标准化、归一化、多项式展开,经过实验,选择多项式展开方式
# x_train = ss.fit_transform(x_train)
# x_val =ss.transform(x_val)

x_train=poly.fit_transform(x_train)
x_val =poly.transform(x_val)

# x_train=norm.fit_transform(x_train)
# x_val=norm.fit_transform(x_val)

# x_train=mms.fit_transform(x_train)
# x_val=mms.transform(x_val)
for model in models:
    model=model.fit(x_train,y_train)
    predict_val=model.predict(x_val)
    print(model)
    print('val r2_score :',metrics.r2_score(y_val,predict_val))
    print('val mean_squared_error :',metrics.mean_squared_error(y_val,predict_val))
    print('**********************************')
————————————————————————————————————————————————————————
LinearRegression()
val r2_score : 0.8255887645766965
val mean_squared_error : 0.10567861144493183
**********************************
RidgeCV(alphas=array([ 0.1,  1. , 10. ]))
val r2_score : 0.8255973558177241
val mean_squared_error : 0.10567340587190682
**********************************
ElasticNetCV()
val r2_score : 0.826007552533511
val mean_squared_error : 0.10542486099325592
**********************************
LassoCV()
val r2_score : 0.8260362646728213
val mean_squared_error : 0.1054074638398755
**********************************
RandomForestRegressor()
val r2_score : 0.8142584766091285
val mean_squared_error : 0.11254381767306122
**********************************
GradientBoostingRegressor()
val r2_score : 0.8214384543762159
val mean_squared_error : 0.10819335206922724
**********************************
[00:29:40] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
XGBRegressor()
val r2_score : 0.8221341278990157
val mean_squared_error : 0.10777183213830037
**********************************
LGBMRegressor()
val r2_score : 0.8315534168942726
val mean_squared_error : 0.10206453134772146
**********************************

1.4.2 调节超参数

1.4.2.1 Ridge和Lasso模型调参
  • 根据基础模型的R方和均方差的评估选择其中若干模型进行调参优化,使用pipeline方法,按照标准化-二项式展开-回归模型参数设置的顺序进行网格搜调参。
## Pipeline常用于并行调参
models = [
    Pipeline([
            ('ss', StandardScaler()),
            ('poly', PolynomialFeatures()),
            ('linear', RidgeCV(alphas=np.logspace(-3,1,10)))
        ]),
    Pipeline([
            ('ss', StandardScaler()),
            ('poly', PolynomialFeatures()),
            ('linear', LassoCV(alphas=np.logspace(-3,1,10)))
        ]),
]  

# 参数字典, 字典中的key是属性的名称,value是可选的参数列表
parameters = {
    "poly__degree": [3,2,1], 
    "poly__interaction_only": [True, False],
    "poly__include_bias": [True, False],
    "linear__fit_intercept": [True, False]
}
for mode in models:
    model = GridSearchCV(mode, param_grid=parameters,cv=5,
                         scoring='neg_mean_squared_error')
    model.fit(x_train, y_train)
    print(mode[2])
    print ("最优参数:" , model.best_params_)
    print ("最优分数:" , model.best_score_)
    print('**************************************************************')
————————————————————————————————————————————————————————
RidgeCV(alphas=array([1.00000000e-03, 2.78255940e-03, 7.74263683e-03, 2.15443469e-02,
       5.99484250e-02, 1.66810054e-01, 4.64158883e-01, 1.29154967e+00,
       3.59381366e+00, 1.00000000e+01]))
最优参数: {'linear__fit_intercept': True, 'poly__degree': 1, 'poly__include_bias': True, 'poly__interaction_only': True}
最优分数: -0.10341087538163592
**************************************************************
LassoCV(alphas=array([1.00000000e-03, 2.78255940e-03, 7.74263683e-03, 2.15443469e-02,
       5.99484250e-02, 1.66810054e-01, 4.64158883e-01, 1.29154967e+00,
       3.59381366e+00, 1.00000000e+01]))
最优参数: {'linear__fit_intercept': False, 'poly__degree': 2, 'poly__include_bias': True, 'poly__interaction_only': False}
最优分数: -0.10087217338993411
********************************************************
  • 将最优参数带入模型进行训练与评估
models = [
Pipeline([('ss', StandardScaler()),
          ('poly',PolynomialFeatures(degree=1,include_bias=True,interaction_only=True)),
        ('linear', RidgeCV(alphas=np.logspace(-3,1,10),fit_intercept= True))]),
    Pipeline([('ss', StandardScaler()),
            ('poly', PolynomialFeatures(degree=2,include_bias=True,interaction_only=False)),
            ('linear', LassoCV(alphas=np.logspace(-3,1,10),fit_intercept=False))])
         ]  
for mode in models:
    model=mode.fit(x_train,y_train)
    predict_val =model.predict(x_val)
    print(model)
    print('train mean_squared_error :',metrics.mean_squared_error(y_val,predict_val))
    print('**********************************')
————————————————————————————————————————————————————————
train mean_squared_error : 0.10561696740544291
**********************************
train mean_squared_error : 0.10668960584119808
**********************************
1.4.2.2 lightgbm调参
model_lgb=LGBMRegressor(random_state=2021)
params_dic=dict(learning_rate=[0.01, 0.1, 1], n_estimators=[20,50,120,300], 
                num_leaves=[10,30],max_depth=[-1,4,10])
grid_search = GridSearchCV(model_lgb, cv=5,
                param_grid=params_dic,
                scoring='neg_mean_squared_error')
grid_search.fit(x_train,y_train)
print(f'最好的参数是:{grid_search.best_params_}')
print(f'最好的分数是:{-grid_search.best_score_}')
————————————————————————————————————————————————————————
# 最好的参数是:{'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 300, 'num_leaves': 30}
  • 使用验证集验证lgb最终模型,mean_squared_error有所下降
lgb_final =LGBMRegressor(random_state=2021,learning_rate= 0.1, max_depth= 5,
                        n_estimators=200, num_leaves= 50)
lgb_final.fit(x_train,y_train)
val_pred =lgb_final.predict(x_val)
print(f'mean_squared_error:{metrics.mean_squared_error(y_val,val_pred)}')
____________________________________________________
mean_squared_error:0.10652795234920218
1.4.2.3 xgb调参
xgb_re=XGBRegressor(seed=27,learning_rate=0.1, n_estimators=300,silent=0, objective='reg:linear',
                         gamma=0,subsample=0.8,colsample_bytree=0.8,nthread=4,scale_pos_weight=1)
xgb_params ={'n_estimators':[50,100,120],'min_child_weight':list(range(1,4,2)),}
best_model= GridSearchCV(xgb_re,param_grid=xgb_params,refit=True,
                             cv=5,scoring='neg_mean_squared_error')
                             best_model.fit(x_train,y_train)
print('best_parameters:',best_model.best_params_) 
print(f'最好的分数是:{-grid_search.best_score_}')
————————————————————————————————————————————————————————
# best_parameters: {'min_child_weight': 3, 'n_estimators': 120}
  • 使用验证集验证xgb最终模型
xgb_final=XGBRegressor(seed=27,learning_rate=0.1, objective='reg:linear',
                gamma=0.2,subsample=0.5,
                colsample_bytree=0.8,nthread=1,scale_pos_weight=1,
                    min_child_weight=0.3, n_estimators=300)
                                                    
xgb_final.fit(x_train,y_train)
val_pred =xgb_final.predict(x_val)
print(f'mean_squared_error:{metrics.mean_squared_error(y_val,val_pred)}')
______________________________________________________
mean_squared_error:0.10342130385311427

1.4.4 模型预测及输出结果文件

  • 模型预测并输出txt文件
xgb_final=XGBRegressor(seed=27,learning_rate=0.1, objective='reg:linear',
                gamma=0.2,subsample=0.5,
                colsample_bytree=0.8,nthread=1,scale_pos_weight=1,
                    min_child_weight=0.3, n_estimators=300)
xgb_final.fit(x_train,y_train)
pre_test=xgb_final.predict(test)
pred=pd.Series(pre_test)
pred.to_csv('submit.txt',sep='\t',index=False)