【Torch-RecHub学习】Feature层处理

377 阅读3分钟

1. Torch-RecHub地址

github.com/datawhalech… DataWhale团队发布的开源推荐系统工具包,可以通过pip install指令或者在git中下载完整的工具包。目前实现了常用的LR、FM、MLP等经典模型的组件,但仍缺少应用GCN的组件,可以学习其分割组件的思想实现GCN组件。

2. Torch-RecHub框架

Torch-RecHub.PNG 对于当期学习,本人的重点在于feature层的处理和Embedding层的构造

2.1 Data数据层

在torch-rechub/basic/features.py中给出了:SequenceFeature序列特征、SparseFeature离散特征、DenseFeature密集特征三种常用特征的处理方式。对这三类特征的初始化进行分析:

2.1.1 DenseFeature 密集型特征

class DenseFeature(object):
    """The Feature Class for Dense feature.

    Args:
        name (str): feature's name.
        embed_dim (int): embedding vector's length, the value fixed `1`. emb向量的长度
    """

    def __init__(self, name):
        self.name = name
        self.embed_dim = 1

    def __repr__(self):
        return f'<DenseFeature {self.name}>'
  1. 定义了特征的名字name

  2. embedding的维度embed_dim,因为密集类特征是一个连续的有意义值,故维度可直接设置为1。

2.1.2 SparseFeature 离散型特征

class SparseFeature(object):
    """The Feature Class for Sparse feature.

    Args:
        name (str): feature's name.
        vocab_size (int): vocabulary size of embedding table.
        embed_dim (int): embedding vector's length
        shared_with (str): the another feature name which this feature will shared with embedding.
        initializer(Initializer): Initializer the embedding layer weight.
    """

    def __init__(self, name, vocab_size, embed_dim=None, shared_with=None, initializer=RandomNormal(0, 0.0001)):
        self.name = name
        self.vocab_size = vocab_size
        if embed_dim == None:
            self.embed_dim = get_auto_embedding_dim(vocab_size)     # 根据vocab_size获取embedding的维度: 6*(vocab^0.26)
        else:
            self.embed_dim = embed_dim
        self.shared_with = shared_with
        self.initializer = initializer

    def __repr__(self):
        return f'<SparseFeature {self.name} with Embedding shape ({self.vocab_size}, {self.embed_dim})>'

    def get_embedding_layer(self):
        return self.initializer(self.vocab_size, self.embed_dim)
  1. vocab_size:描述embedding table的词汇量,即这一项特征一共有多少个类(性别有male和female两个类,vocab_size=2),也就是该语料库的字典大小

  2. embed_vim:描述embedding向量的长度,即embedding table中每一个embedding有多长,也就是需要多长的one-hot去表示这个embedding table中的一个词。 embed_vim若没有定义,则调用util.data.get_auto_embedding_dim()根据vocab_size计算(该计算方法来自Deep & Cross,可以理解为一个降维的过程)。

3.shared_with:与其他特征共享Embedding的标记,默认None。

  1. initializer:初始化嵌入层的权重矩阵(查找表),这里调用的是basic.initializers.RandomNormal(),在本初始化过程中设定生成随机值的平均值为0,生成随机值的标准差为0.0001。 在实际声明模型的时候会调用本方法中的get_embedding_layer()方法,回到RandomNormal()方法查看,此时进入RandomNormal().call()中,使用pytorch的Embedding()方法创造符合正态分布的初始化参数。
class RandomNormal(object):
    """Returns an embedding initialized with a normal distribution.

    Args:
        mean (float): the mean of the normal distribution
        std (float): the standard deviation of the normal distribution
    """

    def __init__(self, mean=0.0, std=1.0):
        self.mean = mean
        self.std = std

    def __call__(self, vocab_size, embed_dim):
        embed = torch.nn.Embedding(vocab_size, embed_dim)  # 传入语料库字典的大小,传入每个嵌入向量的大小
        torch.nn.init.normal_(embed.weight, self.mean, self.std)  # 给embed.weight进行初始化,指定均值和标准差,获得符合正态分布的初始化参数
        return embed

2.1.3 SequenceFeature序列特征

class SequenceFeature(object):
    """The Feature Class for Sequence feature or multi-hot feature.
    In recommendation, there are many user behaviour features which we want to take the sequence model
    and tag featurs (multi hot) which we want to pooling. Note that if you use this feature, you must padding填充
    the feature value before training.

    Args:
        name (str): feature's name.
        vocab_size (int): vocabulary size of embedding table.  
        embed_dim (int): embedding vector's length  
        pooling (str): pooling method, support `["mean", "sum", "concat"]` (default=`"mean"`)
        shared_with (str): the another feature name which this feature will shared with embedding.
        initializer(Initializer): Initializer the embedding layer weight.
    """

    def __init__(self, name, vocab_size, embed_dim=None, pooling="mean", shared_with=None, initializer=RandomNormal(0, 0.0001)):
        self.name = name
        self.vocab_size = vocab_size
        if embed_dim == None:
            # util.data
            self.embed_dim = get_auto_embedding_dim(vocab_size) # 根据vocab_size获取embedding的维度: 6*(vocab^0.26)
        else:
            self.embed_dim = embed_dim
        self.pooling = pooling
        self.shared_with = shared_with
        self.initializer = initializer

    # 把对象转为字符串,再把字符串连接再一起
    def __repr__(self):
        return f'<SequenceFeature {self.name} with Embedding shape ({self.vocab_size}, {self.embed_dim})>'

    # 通过设置的初始化构造embedding层权重
    def get_embedding_layer(self):
        return self.initializer(self.vocab_size, self.embed_dim)

与前两个特征的初始化相似,多了以下几项:

  1. pooling:声明序列Embedding进行融合pooling的方法