双塔模型是基于DSSM--Deep Structured Semantic Models【深度语义匹配模型】,从自然语言模型中衍生出来的编码训练模型。
其核心的思想是,将用户特征和物品特征通过特定编码方法,分别映射到高维空间中去,在映射逻辑的加持下,用户空间中的点(用户多维坐标)与物品空间中的点(物品多维坐标)巧妙的在点积空间/余弦空间产生了纠缠(重合/高度集中)。
双塔模型的训练像是多米诺,搭建是从下往上的,但是执行却是从上往下的。
构建原理
训练原理
双塔通过神经网络进行模型训练,其本质是:
1 分别对用户和物品信息进行编码,并映射至各自向量空间,将用户编码与物品编码进行点积/余弦进行匹配;
2 当无法有效在各自空间中完成相互映射(点积/余弦计算结果不符合阈值)时,将偏移信息反向传播至编码器,编码器在此基础上进行重新编码;
3 用户与物品向量空间完成相互映射后,保存用户信息与物品信息的编码方式,保存已经训练好的编码后的两个向量空间。反之,重复上述步骤。
预测原理
在进行预测时,输入为用户信息,用户信息进入model后,通过训练好的编码方式进行编码,再向用户编码空间进行映射(匹配),返回与用户编码对应的物品(可依概率输出);
构建编码空间的优点在于,行为信息/物品信息是通过训练编码并投影至编码空间后,潜在关联信息实现了隐性挖掘,在进行推荐时,会在实现原始需求挖掘后,同时会实现潜在需求的挖掘。
脚本详解
工具部分
时间工具:用于时间字符串加载,进行模型命名,防止模型覆盖;
进度条工具:手动执行任务时进度演示,模型固化时删除此引用,将脚本中tqdm.trange函数修正为range函数即可;
深度学习工具:使用tensorflow,pytorch/paddlepaddle框架依照流程修改编码器与连接层函数;
机器学习工具:分割数据,全量训练时不分割数据。
import time
import tqdm
import pymysql
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
数据读取
在进行数据抽取时,使用的时用户匹配全用户信息(笛卡尔积),此处数据为1用户-18条物品(其中,物品为1正17负),此处会产生一个模型问题,文章后续会统一归纳。另,由于数据敏感性问题,数据确实无法提供,请各位读者见谅(可自行随机生成用于搭建模型框架)。
# 创建端口
config = pymysql.connect(host="localhost", user="root", password="asdf", database="bonc")
print('端口创建完成')
# 查询语句
sql_str = """
select t.user_id
,t.age
,t.sex
,t.urban_rural
,t.ronghe_flag
,t.prim_chrg_level
,t.ofr_bxl_limit_flag
,t.include_kd
,t.ofr_risk_level
,t.xxzx_label74
,t.if_5g_term_n_flag
,t.ofr_total_num
,t.ofr_basic_num
,t.ofr_flux_use
,t.ofr_flux_saturation
,t.zs_remain
,t.m_flux_5g
,t.day_main
,t.pre_future_flux
,t.pay_flux_pac_num_3mon
,t.bag_type_cnt
,t.flux3_yc_fee
,t.ofr_flux_yc
,t.busi_flux_overflow_chrg
,t.prefer_all
,t.ofr_id
,t.bag_price
,t.bag_amount
,t.flux_price_per
,t.effective_length
,t.if_5g_flux
,case when t.targets is null then 0 else 1 end as rating
from (
select t1.*,t2.targets
from (
select a.user_id
,a.age
,a.sex
,a.urban_rural
,a.ronghe_flag
,a.prim_chrg_level
,a.ofr_bxl_limit_flag
,a.include_kd
,a.ofr_risk_level
,a.xxzx_label74
,a.if_5g_term_n_flag
,a.ofr_total_num
,a.ofr_basic_num
,a.ofr_flux_use
,a.ofr_flux_saturation
,a.zs_remain
,a.m_flux_5g
,a.day_main
,a.pre_future_flux
,a.pay_flux_pac_num_3mon
,a.bag_type_cnt
,a.flux3_yc_fee
,a.ofr_flux_yc
,a.busi_flux_overflow_chrg
,a.prefer_all
,b.ofr_id
,b.bag_price
,b.bag_amount
,b.flux_price_per
,b.effective_length
,b.if_5g_flux
from user_data a
cross join
item_data b
) t1
left join (
select a.user_id
,a.age
,a.sex
,a.urban_rural
,a.ronghe_flag
,a.prim_chrg_level
,a.ofr_bxl_limit_flag
,a.include_kd
,a.ofr_risk_level
,a.xxzx_label74
,a.if_5g_term_n_flag
,a.ofr_total_num
,a.ofr_basic_num
,a.ofr_flux_use
,a.ofr_flux_saturation
,a.zs_remain
,a.m_flux_5g
,a.day_main
,a.pre_future_flux
,a.pay_flux_pac_num_3mon
,a.bag_type_cnt
,a.flux3_yc_fee
,a.ofr_flux_yc
,a.busi_flux_overflow_chrg
,a.prefer_all
,b.ofr_id
,b.bag_price
,b.bag_amount
,b.flux_price_per
,b.effective_length
,b.if_5g_flux
,1 targets
from user_data a
left join
item_data b
on a.ofr_id = b.ofr_id
) t2
on t1.user_id = t2.user_id
and t1.ofr_id = t2.ofr_id
) t
"""
# 数据抽取
df = pd.read_sql(sql_str, config)
print('数据抽取中')
config.close()
数据预处理
1 数据填充,此处可跳过,在数据开发环节就完成数据处理;
2 为了防止数据调用后格式保留(文中从mysql中调取的字段为原始的数据库varchar,即字符串格式,强制转化一次,防止异常报错);
3 归一化处理,在最早的版本中,我们使用了索引编码的方式,初衷是将数据转化至标准状态,但是在大量验证后发现此方案存在严重错误:
报错1:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
报错2:
InvalidArgumentError: indices[n1,n2] = -n3 is not in [x1,x2)
第一个报错提示显卡错误,第二个错误是训练的向量映射超过了逻辑长度
出现原因是一样的,即:编码超了,为了将此问题处理,我们将原有的索引编码修正为了归一化压缩,从根源上处理掉了这个问题。
4 数据划分。
df = df.fillna(0)
y_col = ['rating']
x_col = [v for v in df.columns if v not in y_col]
# 类型转化
for i in df.columns:
df[i] = df[i].astype(float)
# 数据标准化
def data_0_1(df):
id_col = ['user_id', 'ofr_id']
df_col = [v for v in df.columns if v not in id_col]
for i in df_col:
i_max = max(df[i])
i_min = min(df[i])
i_len = i_max - i_min
new_lst = []
print(i)
for j in tqdm.trange(len(df)):
i_new = (df[i][j] - i_min)/i_len
new_lst.append(i_new)
df[i] = new_lst
return df
df = data_0_1(df)
x_df = df[x_col]
y_df = df[y_col]
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=18)
模型实现
长度限制
长度限制的目的是为了在后续编码时,限制住embedding的上限,其目的是为了防止embedding结果在训练过程中失去控制,进而导致训练异常,此处的+1可以修正为其他值,使用1则是保证结果一定可控。
# 用户数据 各字段种类数
num_user_id = int(x_train['user_id'].max()) + 1
num_age = int(x_train['age'].max()) + 1
num_sex = int(x_train['sex'].max()) + 1
num_urban_rural = int(x_train['urban_rural'].max()) + 1
num_ronghe_flag = int(x_train['ronghe_flag'].max()) + 1
num_prim_chrg_level = int(x_train['prim_chrg_level'].max()) + 1
num_ofr_bxl_limit_flag = int(x_train['ofr_bxl_limit_flag'].max()) + 1
num_include_kd = int(x_train['include_kd'].max()) + 1
num_ofr_risk_level = int(x_train['ofr_risk_level'].max()) + 1
num_xxzx_label74 = int(x_train['xxzx_label74'].max()) + 1
num_if_5g_term_n_flag = int(x_train['if_5g_term_n_flag'].max()) + 1
num_ofr_total_num = int(x_train['ofr_total_num'].max()) + 1
num_ofr_basic_num = int(x_train['ofr_basic_num'].max()) + 1
num_ofr_flux_use = int(x_train['ofr_flux_use'].max()) + 1
num_ofr_flux_saturation = int(x_train['ofr_flux_saturation'].max()) + 1
num_zs_remain = int(x_train['zs_remain'].max()) + 1
num_m_flux_5g = int(x_train['m_flux_5g'].max()) + 1
num_day_main = int(x_train['day_main'].max()) + 1
num_pre_future_flux = int(x_train['pre_future_flux'].max()) + 1
num_pay_flux_pac_num_3mon = int(x_train['pay_flux_pac_num_3mon'].max()) + 1
num_bag_type_cnt = int(x_train['bag_type_cnt'].max()) + 1
num_flux3_yc_fee = int(x_train['flux3_yc_fee'].max()) + 1
num_ofr_flux_yc = int(x_train['ofr_flux_yc'].max()) + 1
num_busi_flux_overflow_chrg = int(x_train['busi_flux_overflow_chrg'].max()) + 1
num_prefer_all = int(x_train['prefer_all'].max()) + 1
# 物品数据 各字段种类数
num_ofr_id = int(x_train['ofr_id'].max()) + 1
num_bag_price = int(x_train['bag_price'].max()) + 1
num_bag_amount = int(x_train['bag_amount'].max()) + 1
num_flux_price_per = int(x_train['flux_price_per'].max()) + 1
num_effective_length = int(x_train['effective_length'].max()) + 1
num_if_5g_flux = int(x_train['if_5g_flux'].max()) + 1
模型构建
1 先将创建输入(input),设定将输入转化成1维展开;
2 将用户和物品分别进行编码,编码后将用户编码与物品编码进行点积运算;
3 点积运算后再添加网络层将点积进行压缩;
4 用结果进行训练,矫正编码方向,实现编码训练。
def get_model():
ofr_id = keras.layers.Input(shape=(1,), name='ofr_id')
bag_price = keras.layers.Input(shape=(1,), name='bag_price')
bag_amount = keras.layers.Input(shape=(1,), name='bag_amount')
flux_price_per = keras.layers.Input(shape=(1,), name='flux_price_per')
effective_length = keras.layers.Input(shape=(1,), name='effective_length')
if_5g_flux = keras.layers.Input(shape=(1,), name='if_5g_flux')
user_id = keras.layers.Input(shape=(1,), name='user_id')
age = keras.layers.Input(shape=(1,), name='age')
sex = keras.layers.Input(shape=(1,), name='sex')
urban_rural = keras.layers.Input(shape=(1,), name='urban_rural')
ronghe_flag = keras.layers.Input(shape=(1,), name='ronghe_flag')
prim_chrg_level = keras.layers.Input(shape=(1,), name='prim_chrg_level')
ofr_bxl_limit_flag = keras.layers.Input(shape=(1,), name='ofr_bxl_limit_flag')
include_kd = keras.layers.Input(shape=(1,), name='include_kd')
ofr_risk_level = keras.layers.Input(shape=(1,), name='ofr_risk_level')
xxzx_label74 = keras.layers.Input(shape=(1,), name='xxzx_label74')
if_5g_term_n_flag = keras.layers.Input(shape=(1,), name='if_5g_term_n_flag')
ofr_total_num = keras.layers.Input(shape=(1,), name='ofr_total_num')
ofr_basic_num = keras.layers.Input(shape=(1,), name='ofr_basic_num')
ofr_flux_use = keras.layers.Input(shape=(1,), name='ofr_flux_use')
ofr_flux_saturation = keras.layers.Input(shape=(1,), name='ofr_flux_saturation')
zs_remain = keras.layers.Input(shape=(1,), name='zs_remain')
m_flux_5g = keras.layers.Input(shape=(1,), name='m_flux_5g')
day_main = keras.layers.Input(shape=(1,), name='day_main')
pre_future_flux = keras.layers.Input(shape=(1,), name='pre_future_flux')
pay_flux_pac_num_3mon = keras.layers.Input(shape=(1,), name='pay_flux_pac_num_3mon')
bag_type_cnt = keras.layers.Input(shape=(1,), name='bag_type_cnt')
flux3_yc_fee = keras.layers.Input(shape=(1,), name='flux3_yc_fee')
ofr_flux_yc = keras.layers.Input(shape=(1,), name='ofr_flux_yc')
busi_flux_overflow_chrg = keras.layers.Input(shape=(1,), name='busi_flux_overflow_chrg')
prefer_all = keras.layers.Input(shape=(1,), name='prefer_all')
item_vector = tf.keras.layers.concatenate([
keras.layers.Embedding(num_ofr_id, 16)(ofr_id),
keras.layers.Embedding(num_bag_price, 16)(bag_price),
keras.layers.Embedding(num_bag_amount, 16)(bag_amount),
keras.layers.Embedding(num_flux_price_per, 16)(flux_price_per),
keras.layers.Embedding(num_effective_length, 16)(effective_length),
keras.layers.Embedding(num_if_5g_flux, 16)(if_5g_flux)])
item_vector = keras.layers.Dense(16, activation='relu')(item_vector)
item_vector = keras.layers.Dense(8, activation='relu', name="item_embedding", kernel_regularizer='l2')(item_vector)
user_vector = tf.keras.layers.concatenate([
keras.layers.Embedding(num_user_id, 16)(user_id),
keras.layers.Embedding(num_age, 16)(age),
keras.layers.Embedding(num_sex, 16)(sex),
keras.layers.Embedding(num_urban_rural, 16)(urban_rural),
keras.layers.Embedding(num_ronghe_flag, 16)(ronghe_flag),
keras.layers.Embedding(num_prim_chrg_level, 16)(prim_chrg_level),
keras.layers.Embedding(num_ofr_bxl_limit_flag, 16)(ofr_bxl_limit_flag),
keras.layers.Embedding(num_include_kd, 16)(include_kd),
keras.layers.Embedding(num_ofr_risk_level, 16)(ofr_risk_level),
keras.layers.Embedding(num_xxzx_label74, 16)(xxzx_label74),
keras.layers.Embedding(num_if_5g_term_n_flag, 16)(if_5g_term_n_flag),
keras.layers.Embedding(num_ofr_total_num, 16)(ofr_total_num),
keras.layers.Embedding(num_ofr_basic_num, 16)(ofr_basic_num),
keras.layers.Embedding(num_ofr_flux_use, 16)(ofr_flux_use),
keras.layers.Embedding(num_ofr_flux_saturation, 16)(ofr_flux_saturation),
keras.layers.Embedding(num_zs_remain, 16)(zs_remain),
keras.layers.Embedding(num_m_flux_5g, 16)(m_flux_5g),
keras.layers.Embedding(num_day_main, 16)(day_main),
keras.layers.Embedding(num_pre_future_flux, 16)(pre_future_flux),
keras.layers.Embedding(num_pay_flux_pac_num_3mon, 16)(pay_flux_pac_num_3mon),
keras.layers.Embedding(num_bag_type_cnt, 16)(bag_type_cnt),
keras.layers.Embedding(num_flux3_yc_fee, 16)(flux3_yc_fee),
keras.layers.Embedding(num_ofr_flux_yc, 16)(ofr_flux_yc),
keras.layers.Embedding(num_busi_flux_overflow_chrg, 16)(busi_flux_overflow_chrg),
keras.layers.Embedding(num_prefer_all, 16)(prefer_all)])
user_vector = keras.layers.Dense(16, activation='relu')(user_vector)
user_vector = keras.layers.Dense(8, activation='relu', name="user_vector", kernel_regularizer='l2')(user_vector)
dot_user_item = tf.reduce_sum(user_vector * item_vector, axis=1)
dot_user_item = tf.expand_dims(dot_user_item, 1)
output_1 = keras.layers.Dense(1, activation='sigmoid')(dot_user_item)
return keras.models.Model(
inputs=[ofr_id, bag_price, bag_amount, flux_price_per, effective_length, if_5g_flux,
user_id, age, sex, urban_rural, ronghe_flag, prim_chrg_level, ofr_bxl_limit_flag,
include_kd, ofr_risk_level, xxzx_label74, if_5g_term_n_flag, ofr_total_num, ofr_basic_num, ofr_flux_use,
ofr_flux_saturation, zs_remain, m_flux_5g, day_main, pre_future_flux, pay_flux_pac_num_3mon,
bag_type_cnt, flux3_yc_fee, ofr_flux_yc, busi_flux_overflow_chrg, prefer_all], outputs=[output_1])
模型训练
# 模型激活,选择优化器
model = get_model()
model.compile(loss=tf.keras.losses.MeanSquaredError(), optimizer=keras.optimizers.RMSprop())
# 数据划分
x = x_train[x_col]
y = y_train.pop('rating')
print('训练数据准备')
# 训练数据生成
fit_x_train = [x['ofr_id'],
x['bag_price'],
x['bag_amount'],
x['flux_price_per'],
x['effective_length'],
x['if_5g_flux'],
x['user_id'],
x['age'],
x['sex'],
x['urban_rural'],
x['ronghe_flag'],
x['prim_chrg_level'],
x['ofr_bxl_limit_flag'],
x['include_kd'],
x['ofr_risk_level'],
x['xxzx_label74'],
x['if_5g_term_n_flag'],
x['ofr_total_num'],
x['ofr_basic_num'],
x['ofr_flux_use'],
x['ofr_flux_saturation'],
x['zs_remain'],
x['m_flux_5g'],
x['day_main'],
x['pre_future_flux'],
x['pay_flux_pac_num_3mon'],
x['bag_type_cnt'],
x['flux3_yc_fee'],
x['ofr_flux_yc'],
x['busi_flux_overflow_chrg'],
x['prefer_all']]
print('训练数据加工完成')
# 模型训练
history = model.fit(
x=fit_x_train,
y=y,
batch_size=16,
epochs=2,
verbose=1)
print('模型训练结束')
模型保存
使用时间进行模型保存,方便后续模型增量训练时输出,并防止新模型覆盖问题。
# 模型保存与加载
time_str = time.strftime("%Y%m%d_%H%M%S", time.localtime())
model.save("./model/dssm_model_{}.h5".format(time_str))
new_model = tf.keras.models.load_model("./model/dssm_model_{}.h5".format(time_str))
print("模型保存结束,文件名:dssm_model_{}.h5".format(time_str))
结果输出
inputs = x_test
predict_lst = new_model.predict([ inputs['ofr_id'],
inputs['bag_price'],
inputs['bag_amount'],
inputs['flux_price_per'],
inputs['effective_length'],
inputs['if_5g_flux'],
inputs['user_id'],
inputs['age'],
inputs['sex'],
inputs['urban_rural'],
inputs['ronghe_flag'],
inputs['prim_chrg_level'],
inputs['ofr_bxl_limit_flag'],
inputs['include_kd'],
inputs['ofr_risk_level'],
inputs['xxzx_label74'],
inputs['if_5g_term_n_flag'],
inputs['ofr_total_num'],
inputs['ofr_basic_num'],
inputs['ofr_flux_use'],
inputs['ofr_flux_saturation'],
inputs['zs_remain'],
inputs['m_flux_5g'],
inputs['day_main'],
inputs['pre_future_flux'],
inputs['pay_flux_pac_num_3mon'],
inputs['bag_type_cnt'],
inputs['flux3_yc_fee'],
inputs['ofr_flux_yc'],
inputs['busi_flux_overflow_chrg'],
inputs['prefer_all']
])
pred_lst = []
for i in predict_lst:
pred_lst.append(float(i))
x_test['rating'] = y_test['rating']
x_test['pred'] = pred_lst
x_test.to_csv('./test_pred.csv', index=False)
常见问题
1 文中使用的是1正17负,这会导致负样本学习过度,从而导致预测结果召回异常,可修正为1:1到1:3左右;
2 模型训练时,batch_size请务必控制在32以内,详细原因可自行查阅;
3 双塔确实会有效缓解预测倾斜问题(机器学习中,正样本某一类占比50%,预测结果几乎全是此类),但是只是缓解,训练时请主动修正训练数据中正样本的均衡性问题;
4 训练轮数epochs不是越大越好,推荐5-10轮,过高训练回报过低。
5 双塔框架有严重的弊端,虽然当前是双向验证编码,但是用户行为与物品信息是严格独立的,没有交叉特征供模型进行中间学习,后续版本将引入FM类交叉网络,期待大家共同研究分享!
效果预览
f1评分
混淆矩阵