基于DSSM的双塔推荐底层框架

2,182 阅读7分钟

双塔模型是基于DSSM--Deep Structured Semantic Models【深度语义匹配模型】,从自然语言模型中衍生出来的编码训练模型。

其核心的思想是,将用户特征和物品特征通过特定编码方法,分别映射到高维空间中去,在映射逻辑的加持下,用户空间中的点(用户多维坐标)与物品空间中的点(物品多维坐标)巧妙的在点积空间/余弦空间产生了纠缠(重合/高度集中)。

双塔模型的训练像是多米诺,搭建是从下往上的,但是执行却是从上往下的。

构建原理

训练原理

image.png

双塔通过神经网络进行模型训练,其本质是:

1 分别对用户和物品信息进行编码,并映射至各自向量空间,将用户编码与物品编码进行点积/余弦进行匹配;

2 当无法有效在各自空间中完成相互映射(点积/余弦计算结果不符合阈值)时,将偏移信息反向传播至编码器,编码器在此基础上进行重新编码;

3 用户与物品向量空间完成相互映射后,保存用户信息与物品信息的编码方式,保存已经训练好的编码后的两个向量空间。反之,重复上述步骤。

预测原理

image.png

在进行预测时,输入为用户信息,用户信息进入model后,通过训练好的编码方式进行编码,再向用户编码空间进行映射(匹配),返回与用户编码对应的物品(可依概率输出);

构建编码空间的优点在于,行为信息/物品信息是通过训练编码并投影至编码空间后,潜在关联信息实现了隐性挖掘,在进行推荐时,会在实现原始需求挖掘后,同时会实现潜在需求的挖掘。

脚本详解

工具部分

时间工具:用于时间字符串加载,进行模型命名,防止模型覆盖;

进度条工具:手动执行任务时进度演示,模型固化时删除此引用,将脚本中tqdm.trange函数修正为range函数即可;

深度学习工具:使用tensorflow,pytorch/paddlepaddle框架依照流程修改编码器与连接层函数;

机器学习工具:分割数据,全量训练时不分割数据。

import time
import tqdm
import pymysql
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split

数据读取

在进行数据抽取时,使用的时用户匹配全用户信息(笛卡尔积),此处数据为1用户-18条物品(其中,物品为1正17负),此处会产生一个模型问题,文章后续会统一归纳。另,由于数据敏感性问题,数据确实无法提供,请各位读者见谅(可自行随机生成用于搭建模型框架)

# 创建端口
config = pymysql.connect(host="localhost", user="root", password="asdf", database="bonc")
print('端口创建完成')

# 查询语句
sql_str = """
select t.user_id
,t.age
,t.sex
,t.urban_rural
,t.ronghe_flag
,t.prim_chrg_level
,t.ofr_bxl_limit_flag
,t.include_kd
,t.ofr_risk_level
,t.xxzx_label74
,t.if_5g_term_n_flag
,t.ofr_total_num
,t.ofr_basic_num
,t.ofr_flux_use
,t.ofr_flux_saturation
,t.zs_remain
,t.m_flux_5g
,t.day_main
,t.pre_future_flux
,t.pay_flux_pac_num_3mon
,t.bag_type_cnt
,t.flux3_yc_fee
,t.ofr_flux_yc
,t.busi_flux_overflow_chrg
,t.prefer_all
,t.ofr_id
,t.bag_price
,t.bag_amount
,t.flux_price_per
,t.effective_length
,t.if_5g_flux
,case when t.targets is null then 0 else 1 end as rating
from (
    select t1.*,t2.targets
    from (
        select a.user_id
        ,a.age
        ,a.sex
        ,a.urban_rural
        ,a.ronghe_flag
        ,a.prim_chrg_level
        ,a.ofr_bxl_limit_flag
        ,a.include_kd
        ,a.ofr_risk_level
        ,a.xxzx_label74
        ,a.if_5g_term_n_flag
        ,a.ofr_total_num
        ,a.ofr_basic_num
        ,a.ofr_flux_use
        ,a.ofr_flux_saturation
        ,a.zs_remain
        ,a.m_flux_5g
        ,a.day_main
        ,a.pre_future_flux
        ,a.pay_flux_pac_num_3mon
        ,a.bag_type_cnt
        ,a.flux3_yc_fee
        ,a.ofr_flux_yc
        ,a.busi_flux_overflow_chrg
        ,a.prefer_all
        ,b.ofr_id
        ,b.bag_price
        ,b.bag_amount
        ,b.flux_price_per
        ,b.effective_length
        ,b.if_5g_flux
        from user_data a
        cross join 
        item_data b
    ) t1
    left join (
        select a.user_id
        ,a.age
        ,a.sex
        ,a.urban_rural
        ,a.ronghe_flag
        ,a.prim_chrg_level
        ,a.ofr_bxl_limit_flag
        ,a.include_kd
        ,a.ofr_risk_level
        ,a.xxzx_label74
        ,a.if_5g_term_n_flag
        ,a.ofr_total_num
        ,a.ofr_basic_num
        ,a.ofr_flux_use
        ,a.ofr_flux_saturation
        ,a.zs_remain
        ,a.m_flux_5g
        ,a.day_main
        ,a.pre_future_flux
        ,a.pay_flux_pac_num_3mon
        ,a.bag_type_cnt
        ,a.flux3_yc_fee
        ,a.ofr_flux_yc
        ,a.busi_flux_overflow_chrg
        ,a.prefer_all
        ,b.ofr_id
        ,b.bag_price
        ,b.bag_amount
        ,b.flux_price_per
        ,b.effective_length
        ,b.if_5g_flux
        ,1 targets
        from user_data a
        left join 
        item_data b
        on a.ofr_id = b.ofr_id
    ) t2
    on t1.user_id = t2.user_id
    and t1.ofr_id = t2.ofr_id
) t
"""

# 数据抽取
df = pd.read_sql(sql_str, config)
print('数据抽取中')
config.close()

数据预处理

1 数据填充,此处可跳过,在数据开发环节就完成数据处理;

2 为了防止数据调用后格式保留(文中从mysql中调取的字段为原始的数据库varchar,即字符串格式,强制转化一次,防止异常报错);

3 归一化处理,在最早的版本中,我们使用了索引编码的方式,初衷是将数据转化至标准状态,但是在大量验证后发现此方案存在严重错误:

报错1:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

报错2:

InvalidArgumentError: indices[n1,n2] = -n3 is not in [x1,x2)

第一个报错提示显卡错误,第二个错误是训练的向量映射超过了逻辑长度

出现原因是一样的,即:编码超了,为了将此问题处理,我们将原有的索引编码修正为了归一化压缩,从根源上处理掉了这个问题。

4 数据划分。

df = df.fillna(0)

y_col = ['rating']
x_col = [v for v in df.columns if v not in y_col]

# 类型转化
for i in df.columns:
    df[i] = df[i].astype(float)


# 数据标准化
def data_0_1(df):
    id_col = ['user_id', 'ofr_id']
    df_col = [v for v in df.columns if v not in id_col]
    for i in df_col:
        i_max = max(df[i])
        i_min = min(df[i])
        i_len = i_max - i_min
        new_lst = []
        print(i)
        for j in tqdm.trange(len(df)):
            i_new = (df[i][j] - i_min)/i_len
            new_lst.append(i_new)
        df[i] = new_lst
    return df

df = data_0_1(df)

x_df = df[x_col]
y_df = df[y_col]

x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, train_size=0.7, random_state=18)

模型实现

长度限制

长度限制的目的是为了在后续编码时,限制住embedding的上限,其目的是为了防止embedding结果在训练过程中失去控制,进而导致训练异常,此处的+1可以修正为其他值,使用1则是保证结果一定可控。

# 用户数据 各字段种类数
num_user_id = int(x_train['user_id'].max()) + 1
num_age = int(x_train['age'].max()) + 1
num_sex = int(x_train['sex'].max()) + 1
num_urban_rural = int(x_train['urban_rural'].max()) + 1
num_ronghe_flag = int(x_train['ronghe_flag'].max()) + 1
num_prim_chrg_level = int(x_train['prim_chrg_level'].max()) + 1
num_ofr_bxl_limit_flag = int(x_train['ofr_bxl_limit_flag'].max()) + 1
num_include_kd = int(x_train['include_kd'].max()) + 1
num_ofr_risk_level = int(x_train['ofr_risk_level'].max()) + 1
num_xxzx_label74 = int(x_train['xxzx_label74'].max()) + 1
num_if_5g_term_n_flag = int(x_train['if_5g_term_n_flag'].max()) + 1
num_ofr_total_num = int(x_train['ofr_total_num'].max()) + 1
num_ofr_basic_num = int(x_train['ofr_basic_num'].max()) + 1
num_ofr_flux_use = int(x_train['ofr_flux_use'].max()) + 1
num_ofr_flux_saturation = int(x_train['ofr_flux_saturation'].max()) + 1
num_zs_remain = int(x_train['zs_remain'].max()) + 1
num_m_flux_5g = int(x_train['m_flux_5g'].max()) + 1
num_day_main = int(x_train['day_main'].max()) + 1
num_pre_future_flux = int(x_train['pre_future_flux'].max()) + 1
num_pay_flux_pac_num_3mon = int(x_train['pay_flux_pac_num_3mon'].max()) + 1
num_bag_type_cnt = int(x_train['bag_type_cnt'].max()) + 1
num_flux3_yc_fee = int(x_train['flux3_yc_fee'].max()) + 1
num_ofr_flux_yc = int(x_train['ofr_flux_yc'].max()) + 1
num_busi_flux_overflow_chrg = int(x_train['busi_flux_overflow_chrg'].max()) + 1
num_prefer_all = int(x_train['prefer_all'].max()) + 1

# 物品数据 各字段种类数
num_ofr_id = int(x_train['ofr_id'].max()) + 1
num_bag_price = int(x_train['bag_price'].max()) + 1
num_bag_amount = int(x_train['bag_amount'].max()) + 1
num_flux_price_per = int(x_train['flux_price_per'].max()) + 1
num_effective_length = int(x_train['effective_length'].max()) + 1
num_if_5g_flux = int(x_train['if_5g_flux'].max()) + 1

模型构建

1 先将创建输入(input),设定将输入转化成1维展开;

2 将用户和物品分别进行编码,编码后将用户编码与物品编码进行点积运算;

3 点积运算后再添加网络层将点积进行压缩;

4 用结果进行训练,矫正编码方向,实现编码训练。

def get_model():
    ofr_id = keras.layers.Input(shape=(1,), name='ofr_id')
    bag_price = keras.layers.Input(shape=(1,), name='bag_price')
    bag_amount = keras.layers.Input(shape=(1,), name='bag_amount')
    flux_price_per = keras.layers.Input(shape=(1,), name='flux_price_per')
    effective_length = keras.layers.Input(shape=(1,), name='effective_length')
    if_5g_flux = keras.layers.Input(shape=(1,), name='if_5g_flux')

    user_id = keras.layers.Input(shape=(1,), name='user_id')
    age = keras.layers.Input(shape=(1,), name='age')
    sex = keras.layers.Input(shape=(1,), name='sex')
    urban_rural = keras.layers.Input(shape=(1,), name='urban_rural')
    ronghe_flag = keras.layers.Input(shape=(1,), name='ronghe_flag')
    prim_chrg_level = keras.layers.Input(shape=(1,), name='prim_chrg_level')
    ofr_bxl_limit_flag = keras.layers.Input(shape=(1,), name='ofr_bxl_limit_flag')
    include_kd = keras.layers.Input(shape=(1,), name='include_kd')
    ofr_risk_level = keras.layers.Input(shape=(1,), name='ofr_risk_level')
    xxzx_label74 = keras.layers.Input(shape=(1,), name='xxzx_label74')
    if_5g_term_n_flag = keras.layers.Input(shape=(1,), name='if_5g_term_n_flag')
    ofr_total_num = keras.layers.Input(shape=(1,), name='ofr_total_num')
    ofr_basic_num = keras.layers.Input(shape=(1,), name='ofr_basic_num')
    ofr_flux_use = keras.layers.Input(shape=(1,), name='ofr_flux_use')
    ofr_flux_saturation = keras.layers.Input(shape=(1,), name='ofr_flux_saturation')
    zs_remain = keras.layers.Input(shape=(1,), name='zs_remain')
    m_flux_5g = keras.layers.Input(shape=(1,), name='m_flux_5g')
    day_main = keras.layers.Input(shape=(1,), name='day_main')
    pre_future_flux = keras.layers.Input(shape=(1,), name='pre_future_flux')
    pay_flux_pac_num_3mon = keras.layers.Input(shape=(1,), name='pay_flux_pac_num_3mon')
    bag_type_cnt = keras.layers.Input(shape=(1,), name='bag_type_cnt')
    flux3_yc_fee = keras.layers.Input(shape=(1,), name='flux3_yc_fee')
    ofr_flux_yc = keras.layers.Input(shape=(1,), name='ofr_flux_yc')
    busi_flux_overflow_chrg = keras.layers.Input(shape=(1,), name='busi_flux_overflow_chrg')
    prefer_all = keras.layers.Input(shape=(1,), name='prefer_all')

    item_vector = tf.keras.layers.concatenate([
        keras.layers.Embedding(num_ofr_id, 16)(ofr_id),
        keras.layers.Embedding(num_bag_price, 16)(bag_price),
        keras.layers.Embedding(num_bag_amount, 16)(bag_amount),
        keras.layers.Embedding(num_flux_price_per, 16)(flux_price_per),
        keras.layers.Embedding(num_effective_length, 16)(effective_length),
        keras.layers.Embedding(num_if_5g_flux, 16)(if_5g_flux)])

    item_vector = keras.layers.Dense(16, activation='relu')(item_vector)
    item_vector = keras.layers.Dense(8, activation='relu', name="item_embedding", kernel_regularizer='l2')(item_vector)

    user_vector = tf.keras.layers.concatenate([
        keras.layers.Embedding(num_user_id, 16)(user_id),
        keras.layers.Embedding(num_age, 16)(age),
        keras.layers.Embedding(num_sex, 16)(sex),
        keras.layers.Embedding(num_urban_rural, 16)(urban_rural),
        keras.layers.Embedding(num_ronghe_flag, 16)(ronghe_flag),
        keras.layers.Embedding(num_prim_chrg_level, 16)(prim_chrg_level),
        keras.layers.Embedding(num_ofr_bxl_limit_flag, 16)(ofr_bxl_limit_flag),
        keras.layers.Embedding(num_include_kd, 16)(include_kd),
        keras.layers.Embedding(num_ofr_risk_level, 16)(ofr_risk_level),
        keras.layers.Embedding(num_xxzx_label74, 16)(xxzx_label74),
        keras.layers.Embedding(num_if_5g_term_n_flag, 16)(if_5g_term_n_flag),
        keras.layers.Embedding(num_ofr_total_num, 16)(ofr_total_num),
        keras.layers.Embedding(num_ofr_basic_num, 16)(ofr_basic_num),
        keras.layers.Embedding(num_ofr_flux_use, 16)(ofr_flux_use),
        keras.layers.Embedding(num_ofr_flux_saturation, 16)(ofr_flux_saturation),
        keras.layers.Embedding(num_zs_remain, 16)(zs_remain),
        keras.layers.Embedding(num_m_flux_5g, 16)(m_flux_5g),
        keras.layers.Embedding(num_day_main, 16)(day_main),
        keras.layers.Embedding(num_pre_future_flux, 16)(pre_future_flux),
        keras.layers.Embedding(num_pay_flux_pac_num_3mon, 16)(pay_flux_pac_num_3mon),
        keras.layers.Embedding(num_bag_type_cnt, 16)(bag_type_cnt),
        keras.layers.Embedding(num_flux3_yc_fee, 16)(flux3_yc_fee),
        keras.layers.Embedding(num_ofr_flux_yc, 16)(ofr_flux_yc),
        keras.layers.Embedding(num_busi_flux_overflow_chrg, 16)(busi_flux_overflow_chrg),
        keras.layers.Embedding(num_prefer_all, 16)(prefer_all)])

    user_vector = keras.layers.Dense(16, activation='relu')(user_vector)
    user_vector = keras.layers.Dense(8, activation='relu', name="user_vector", kernel_regularizer='l2')(user_vector)

    dot_user_item = tf.reduce_sum(user_vector * item_vector, axis=1)
    dot_user_item = tf.expand_dims(dot_user_item, 1)

    output_1 = keras.layers.Dense(1, activation='sigmoid')(dot_user_item)

    return keras.models.Model(
        inputs=[ofr_id, bag_price, bag_amount, flux_price_per, effective_length, if_5g_flux,
                user_id, age, sex, urban_rural, ronghe_flag, prim_chrg_level, ofr_bxl_limit_flag,
                include_kd, ofr_risk_level, xxzx_label74, if_5g_term_n_flag, ofr_total_num, ofr_basic_num, ofr_flux_use,
                ofr_flux_saturation, zs_remain, m_flux_5g, day_main, pre_future_flux, pay_flux_pac_num_3mon,
                bag_type_cnt, flux3_yc_fee, ofr_flux_yc, busi_flux_overflow_chrg, prefer_all], outputs=[output_1])

模型训练

# 模型激活,选择优化器
model = get_model()
model.compile(loss=tf.keras.losses.MeanSquaredError(), optimizer=keras.optimizers.RMSprop())

# 数据划分
x = x_train[x_col]
y = y_train.pop('rating')
print('训练数据准备')

# 训练数据生成
fit_x_train = [x['ofr_id'],
               x['bag_price'],
               x['bag_amount'],
               x['flux_price_per'],
               x['effective_length'],
               x['if_5g_flux'],
               x['user_id'],
               x['age'],
               x['sex'],
               x['urban_rural'],
               x['ronghe_flag'],
               x['prim_chrg_level'],
               x['ofr_bxl_limit_flag'],
               x['include_kd'],
               x['ofr_risk_level'],
               x['xxzx_label74'],
               x['if_5g_term_n_flag'],
               x['ofr_total_num'],
               x['ofr_basic_num'],
               x['ofr_flux_use'],
               x['ofr_flux_saturation'],
               x['zs_remain'],
               x['m_flux_5g'],
               x['day_main'],
               x['pre_future_flux'],
               x['pay_flux_pac_num_3mon'],
               x['bag_type_cnt'],
               x['flux3_yc_fee'],
               x['ofr_flux_yc'],
               x['busi_flux_overflow_chrg'],
               x['prefer_all']]
print('训练数据加工完成')

# 模型训练
history = model.fit(
    x=fit_x_train,
    y=y,
    batch_size=16,
    epochs=2,
    verbose=1)
print('模型训练结束')

模型保存

使用时间进行模型保存,方便后续模型增量训练时输出,并防止新模型覆盖问题。

# 模型保存与加载
time_str = time.strftime("%Y%m%d_%H%M%S", time.localtime())
model.save("./model/dssm_model_{}.h5".format(time_str))
new_model = tf.keras.models.load_model("./model/dssm_model_{}.h5".format(time_str))
print("模型保存结束,文件名:dssm_model_{}.h5".format(time_str))

结果输出

inputs = x_test

predict_lst = new_model.predict([    inputs['ofr_id'],
    inputs['bag_price'],
    inputs['bag_amount'],
    inputs['flux_price_per'],
    inputs['effective_length'],
    inputs['if_5g_flux'],
    inputs['user_id'],
    inputs['age'],
    inputs['sex'],
    inputs['urban_rural'],
    inputs['ronghe_flag'],
    inputs['prim_chrg_level'],
    inputs['ofr_bxl_limit_flag'],
    inputs['include_kd'],
    inputs['ofr_risk_level'],
    inputs['xxzx_label74'],
    inputs['if_5g_term_n_flag'],
    inputs['ofr_total_num'],
    inputs['ofr_basic_num'],
    inputs['ofr_flux_use'],
    inputs['ofr_flux_saturation'],
    inputs['zs_remain'],
    inputs['m_flux_5g'],
    inputs['day_main'],
    inputs['pre_future_flux'],
    inputs['pay_flux_pac_num_3mon'],
    inputs['bag_type_cnt'],
    inputs['flux3_yc_fee'],
    inputs['ofr_flux_yc'],
    inputs['busi_flux_overflow_chrg'],
    inputs['prefer_all']
    ])

pred_lst = []
for i in predict_lst:
    pred_lst.append(float(i))

x_test['rating'] = y_test['rating']
x_test['pred'] = pred_lst

x_test.to_csv('./test_pred.csv', index=False)

常见问题

1 文中使用的是1正17负,这会导致负样本学习过度,从而导致预测结果召回异常,可修正为1:1到1:3左右;

2 模型训练时,batch_size请务必控制在32以内,详细原因可自行查阅;

3 双塔确实会有效缓解预测倾斜问题(机器学习中,正样本某一类占比50%,预测结果几乎全是此类),但是只是缓解,训练时请主动修正训练数据中正样本的均衡性问题;

4 训练轮数epochs不是越大越好,推荐5-10轮,过高训练回报过低。

5 双塔框架有严重的弊端,虽然当前是双向验证编码,但是用户行为与物品信息是严格独立的,没有交叉特征供模型进行中间学习,后续版本将引入FM类交叉网络,期待大家共同研究分享!

效果预览

f1评分

image.png

混淆矩阵

image.png