00
参加比赛10多天,期间也遇到了很多困难,边学边参赛,虽然好多复杂的问题当时都不知所措,但在不断地探索之下,所有困难都一点一点解决了,最终得到top12的成绩。虽然成绩不高,但鉴于本人是第一次参加数据挖掘比赛,且完全是自己探索,没有找人组队,所以还是比较有成就感的。
01 赛题理解
题目:本赛题数据经过脱敏,给出week0~week32的相关特征以及销量为,预测训练集第week33的销量weekly_sales,特征只有5个:
['shop_id','item_id','week','item_category_id','item_price']
其中,'item_price'train和test都有70%缺失值。
评价指标: 本赛题以MSE作为评价指标。
思路: 本赛题属于回归类问题,且存在大量缺失值,优先选择lightgbm模型。特征中提供了时间轴week,为典型的时序数据预测,因此构造特征时时间平移和滑动窗口应要考虑在内。
难点: 训练集和测试集都存在大量缺失值,而且缺失值是唯一变量,不可删除,如何处理缺失值是赛题的难点。
02 完整代码
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
train = pd.read_csv("./train.csv")
test = pd.read_csv("./test.csv")
train_data = train.copy()
test_data = test.copy()
# ========================================预处理
# 对label进行截断处理,并且进行变换
train_data['weekly_sales'] = train_data['weekly_sales'].apply(lambda x:x if x < 200 else 200)
train_data['weekly_sales'] = train_data['weekly_sales'].apply(lambda x:np.log1p(x))
# 合并train和test
all_data = pd.concat([train_data,test_data],axis=0,ignore_index=True)
# 对item_price一列特征进行截断处理并进
# 因为后文构造的许多特征都是以该特征为基础的
def dfs(x):
if x > 4000:
return 4000
else:
return x
all_data["item_price"] = all_data["item_price"].apply(lambda x:dfs(x))
all_data['item_price'] = all_data['item_price'].apply(lambda x:np.log1p(x))
# =======================================特征工程
# 1.滑动窗口
def weekly_Max(x):
x = x.split(' ')
List = []
for i in x:
List.append(float(i))
return max(List)
def weekly_Min(x):
x = x.split(' ')
List = []
for i in x:
List.append(float(i))
return min(List)
def weekly_Mean(x):
x = x.split(' ')
List = []
for i in x:
List.append(float(i))
return np.mean(List)
from scipy import stats
def weekly_Mode(x):
x = x.split(' ')
List = []
for i in x:
List.append(float(i))
return stats.mode(List)[0][0]
def weekly_Median(x):
x = x.split(' ')
List = []
for i in x:
List.append(float(i))
return np.median(List)
# ---------------------------定义滑窗函数
def moving_windows(df,lags):
'''滑动窗口,窗口大小为3,步长为1 '''
feature = pd.DataFrame(columns=['shop_id','item_id','week','weekly_sales',
'Max','Min','Mean','Mode','Var','item_category_id','Std','Median','item_price','Gap'])
for i in lags:
f0 = df.loc[df['week']==i,:][['shop_id','item_id','weekly_sales']].copy()
f1 = df.loc[df['week']==i+1,:][['shop_id','item_id','weekly_sales']].copy()
f2 = df.loc[df['week']==i+2,:][['shop_id','item_id','weekly_sales']].copy()
f3 = pd.merge(f0,f1,on=['shop_id','item_id'],how='left')
f4 = pd.merge(f3,f2,on=['shop_id','item_id'],how='left')
f4['weekly_all'] = f4['weekly_sales_x'].map(str) + ' '+ f4['weekly_sales_y'].map(str) + ' ' + f4['weekly_sales'].map(str)
# -------------------------------------------------------
f4['Max'] = f4['weekly_all'].apply(lambda x:weekly_Max(x))
f4['Min'] = f4['weekly_all'].apply(lambda x:weekly_Min(x))
f4['Mean'] = f4['weekly_all'].apply(lambda x:weekly_Mean(x))
f4['Mode'] = f4['weekly_all'].apply(lambda x:weekly_Mode(x))
f4['Median'] = f4['weekly_all'].apply(lambda x:weekly_Median(x))
f4['Gap'] = (f4['Max']-f4['Median'])/(f4['Median']-f4['Min'])
f4["Std"] = f4['Max'] - f4['Min']
f4["Var"] = (f4['weekly_sales']-f4['Mean'])**2+(f4['weekly_sales_x'] - f4['Mean'])**2+(f4['weekly_sales_y']-f4['Mean'])**2
# ---------------------------------------------------------
f4.drop(['weekly_sales_x','weekly_sales_y','weekly_all','weekly_sales'],axis=1,inplace=True)
# ---------------------------------------------------
f4['week'] = i+3
f5 = pd.merge(df.loc[df['week']==i+3,:],f4,on=['shop_id','item_id','week'],how='left')
feature = pd.concat([feature,f5],axis=0,ignore_index=False)
return feature
all_data = moving_windows(all_data,[i for i in range(0,31)])
# 转换数据类型
all_data['shop_id'] = pd.to_numeric(all_data['shop_id'],errors='coerce')
all_data['item_id'] = pd.to_numeric(all_data['item_id'],errors='coerce')
all_data['week'] = pd.to_numeric(all_data['week'],errors='coerce')
all_data['item_category_id'] = pd.to_numeric(all_data['item_category_id'],errors='coerce')
# -----------------------------------------平移特征
def lack_feature(df, lags, col):
'''历史N周平移特征,item_price'''
tmp = df[['week','shop_id','item_id',col]]
for i in lags:
shifted = tmp.copy()
shifted.columns = ['week','shop_id','item_id', col+str(i)]
shifted['week'] += i
df = pd.merge(df, shifted, on=['week','shop_id','item_id'], how='left')
df[col+str(i)] = df[col+str(i)]
return df
all_data = lack_feature(all_data, [1, 2, 3], 'item_price'
# ====================================构造统计特征
# 2.item_id在所有商铺中不同价格count get
all_data1 = all_data[["item_price","item_id"]].copy()
all_data1.drop_duplicates(inplace=True)
all_data1.dropna(axis=0,how="any",inplace=True)
# t1 = all_data[["item_price","item_id"]].copy()
all_data1["item_all_shop_price"] = 1
all_data1 = all_data1.groupby(["item_id"])["item_all_shop_price"].agg("sum").reset_index()
all_data = all_data.merge(all_data1,on=["item_id"],how='left')
# 3.item_id 在每个商铺中不同的价格统计 get
all_data1 = all_data[["item_price","item_id","shop_id"]].copy()
all_data1.drop_duplicates(inplace=True)
all_data1.dropna(axis=0,how="any",inplace=True)
t1 = all_data[["item_price","item_id","shop_id"]].copy()
all_data1["item_every_shop_price"] = 1
all_data1 = all_data1.groupby(["shop_id"])["item_every_shop_price"].agg("sum").reset_index()
all_data = all_data.merge(all_data1,on=["shop_id"],how='left')
# 4.每个item_id在每个shop中价格max,min,median,std ,count统计 get
all_data1 = all_data[["item_price","item_id","shop_id"]].copy()
all_data1.drop_duplicates(inplace=True)
all_data1.dropna(axis=0,how="any",inplace=True)
t1 = all_data[["item_price","item_id","shop_id"]].copy()
# all_data1 = all_data1.groupby(["shop_id","item_id"]).agg({"item_price":[np.max,np.min,np.mean,np.median]})
all_data1 = all_data1.groupby(["shop_id","item_id"])["item_price"].agg([("Max_3","max"),("Min_3","min"),("Mean_3","mean"),
("Median_3","median")])
all_data = all_data.merge(all_data1,on=["shop_id","item_id"],how='left')
# 5.每个item_id在不同shop中价格max,min,median,std ,count统计
all_data1 = all_data[["item_price","item_id"]].copy()
all_data1.drop_duplicates(inplace=True)
all_data1.dropna(axis=0,how="any",inplace=True)
# all_data1 = all_data1.groupby(["item_id"]).agg({"item_price":[np.max,np.min,np.mean,np.median]})
all_data1 = all_data1.groupby(["item_id"])["item_price"].agg([("Max_2","max"),("Min_2","min"),("Mean_2","mean"),
("Median_2","median")])
all_data = all_data.merge(all_data1,on=["item_id"],how='left')
# =======================================================
X_train = all_data[all_data.week < 33].drop("weekly_sales",axis=1)
y_train = all_data[all_data.week < 33]["weekly_sales"]
X_test = all_data[all_data.week == 33].drop("weekly_sales",axis=1)
X_train.drop(['week'],axis=1,inplace=True)
X_test.drop(['week'],axis=1,inplace=True)
X_train.drop('label',axis=1,inplace=True)
# 导入库
import lightgbm as lgb
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from lightgbm.sklearn import LGBMRegressor as lgb
# 单模型线下训练
clf2 = lgb(n_estimators=100)
# 交叉验证
scores1 = cross_val_score(clf2,X_train,y_train,cv=5,scoring=make_scorer(mean_squared_error))
scores2 = cross_val_score(clf2,X_train,y_train,cv=5,scoring=make_scorer(mean_absolute_error))
print("MSE:",scores1)
print("MAE:",scores2)
print("MSE mean:",scores1.mean())
print("MAE mean:",scores2.mean())
# ==============================lgb模型
def cv_model(clf, train_x, train_y, test_x, clf_name='lgb'):
folds = 5
seed = 1024
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
train = np.zeros(train_x.shape[0])
test = np.zeros(test_x.shape[0])
categorical_feature = ['shop_id','item_id','item_category_id']
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
print('************************ {} ****************************'.format(str(i+1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
train_matrix = clf.Dataset(trn_x, label=trn_y)
valid_matrix = clf.Dataset(val_x, label=val_y)
params = {
'boosting_type': 'gbdt',
'objective': 'mse',
'metric': 'mse',
'min_child_weight': 5,
'num_leaves': 2 ** 7,
'lambda_l2': 10,
'feature_fraction': 0.9,
'bagging_fraction': 0.9,
'bagging_freq': 4,
'learning_rate': 0.05,
'seed': 1024,
'n_jobs':-1,
'silent': True,
'verbose': -1,
}
model = clf.train(params, train_matrix, 5000, valid_sets=[train_matrix, valid_matrix],
categorical_feature = categorical_feature,
verbose_eval=500,early_stopping_rounds=200)
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
train[valid_index] = val_pred
test += test_pred / kf.n_splits
cv_scores.append(mean_squared_error(val_y, val_pred))
print(cv_scores)
print("%s_scotrainre_list:" % clf_name, cv_scores)
print("%s_score_mean:" % clf_name, np.mean(cv_scores))
print("%s_score_std:" % clf_name, np.std(cv_scores))
return train, test
# 调用函数训练模型
lgb_train, lgb_test = cv_model(lgb, X_train, y_train, X_test)
# 将预测值变换回去
lgb_test = np.expm1(lgb_test)
lgb_test
# 保存输出
sample_submit = pd.read_csv('./sample_submit.csv')
sample_submit['weekly_sales'] = lgb_test
sample_submit['weekly_sales'] = sample_submit['weekly_sales'].apply(lambda x:x if x>0 else 0).values
sample_submit.to_csv('predict_result0.csv', index=False)
03探索过程
1.缺失值探索: 由于本赛题缺失值较多,我尝试过很多填充方式,其中我觉得提升性能可能性最大的两种方式是众数填充和随机森林填充。但是尝试过之后,线下交叉训练结果依然没有较大提升,后来看一个大佬写的baseline说缺失值的填充对本赛题作用不是很大,如果硬要填充,对于时序数据来说,可以尝试用最近一周的值去填充,可是由于缺失值太多了,有70%之多,所以我最后放弃了填充,可能其他高分大佬有好的解决方法,希望可以留言请教,感谢!
2.异常值处理: 通过查看数据的分布可以发现,首先处理weekly_sales和item_prices,因为要通过这两个特征构造很多特征,但是发现它们并不是正态分布,而是存在很大的偏态,数据大部分都集中在0附近,有很多异常值跨度很大,但是数量较小。
首先我尝试对标签进行处理,在200那里进行了截断,超过200的都处理成200,结果发现图像好了一点,可是此时数据依然左偏,于是,我对标签进行了np.log1p变换,此时,label分布趋近正态。训练后发现,得分也有一定上涨。之后我又尝试了不同的截断方式,结果发现在200这里截断最好。
之后,我对item_prices也进行了相同的截断方式,并且进行了正态变换。
3.不构造特征预测: 就这样我进行了一次预测,大概分数在1.3左右。
4.添加平移时间特征: 对于时序数据来说,根据时间将标签向前平移作为特征,能大大提升预测准确度,没平移到的地方作为缺失值不做处理。此时分数为0.99左右,有较大的提升。
5.添加滑动窗口特征统计特征: 设置窗口大小为3个单位,滑动速度为1个单位,用week1、week2、week3来构造week4的特征,刚好可以构造到测试集week33的特征。统计窗口内部min,max,count,median,mode,mean,std,var.此时,分数上升到0.73左右,又有了0.3的提升,这个提升可谓较大。
6.滑动窗口内部添加Gap比值特征: 因为是时序数据,经过分析,数据整体有个变化的趋势,用比值构造特征可以反映这一趋势,于是在滑动窗口内部构造gap比值,结果分数又有了一定的提升,达到了0.7.
7.添加一般统计特征: 构造玩这些重要特征之后,还有一些一般统计特征未统计于是这里进行探索,用特征于标签的相关性觉得去留--train.corr(),经筛选,得到最后特征。
04 总结与设想
经过重重探索,构造特征,最终将分数从一开始的1.3提升到了0.6919,经历了一个这样的变化:1.3 ~ 0.99 ~ 0.73 ~ 0.70 ~ 0.69.是我此时单模型所能达到的最高值了。为了进一步提升分数,我觉得应该用深度学习模型尝试一下,然后进行模型融合,可能会有进一步的提升,后期会进行相关探索,同时也希望各位大佬分享自己的见解,给出更优的方案,感谢!!!