【Torch-RecHub学习】EmbeddingLayer构造分析torch-rechub对EmbeddingLaye

1. Torch-RecHub地址

github.com/datawhalech… DataWhale团队发布的开源推荐系统工具包，可以通过pip install指令或者在git中下载完整的工具包。目前实现了常用的LR、FM、MLP等经典模型的组件，但仍缺少应用GCN的组件，可以学习其分割组件的思想实现GCN组件。

在前文对Feature处理的基础上，继续分析本工具包中Embedding层的构建

2. EmbeddingLayer层

本组件封装在torch_rechub/basic/layers.py中，继承自pytorch模块中的Module类。因原文中该部分实现的代码较长，本文只截取其中的重要部分

2.1 初始化方法

class EmbeddingLayer(nn.Module):
    def __init__(self, features):
        super().__init__()
        self.features = features
        self.embed_dict = nn.ModuleDict()  # 创建存放module的容器
        self.n_dense = 0

        for fea in features:    # 遍历构造EmbeddingLayer的所有Feature
            if fea.name in self.embed_dict:  # exist
                continue
            #  isinstance判断fea是否为SparseFeature类型
            # 且这个feature是否被共享
            if isinstance(fea, SparseFeature) and fea.shared_with == None: 
                # 根据该fea的vocab_size和embed_dim创建符合正态分布的finding table
                self.embed_dict[fea.name] = fea.get_embedding_layer()
            elif isinstance(fea, SequenceFeature) and fea.shared_with == None:
                self.embed_dict[fea.name] = fea.get_embedding_layer()
            elif isinstance(fea, DenseFeature):
                self.n_dense += 1  # 密集型特征可以直接使用

把所有的Embedding保存为字典dict形式

输入的形参features：声明哪些feature需要创建embedding finding table

这里的get_embedding_layer()在前文对Feature处理中有详细说明，本质上是调用了torch.nn.Embedding然后用nn.init.normal_()进行初始化

构造的字典形如 embed_dict:{feature_name : embedding table}，根据该特征的vocab_size和embed_dim初始化生成符合正态分布的finding table。

调用例子如下(torch_rechub/models/ranking/widedeep.py)：

self.embedding = EmbeddingLayer(wide_features + deep_features)

2.2 调用该层的前向传播方法

    def forward(self, x, features, squeeze_dim=False):
        sparse_emb, dense_values = [], []
        sparse_exists, dense_exists = False, False
        for fea in features:
            if isinstance(fea, SparseFeature):
                if fea.shared_with == None:
                    sparse_emb.append(self.embed_dict[fea.name](x[fea.name].long()).unsqueeze(1))
                else:
                    sparse_emb.append(self.embed_dict[fea.shared_with](x[fea.name].long()).unsqueeze(1))
            elif isinstance(fea, SequenceFeature):
                if fea.pooling == "sum":
                    pooling_layer = SumPooling()
                elif fea.pooling == "mean":
                    pooling_layer = AveragePooling()
                elif fea.pooling == "concat":
                    pooling_layer = ConcatPooling()
                else:
                    raise ValueError("Sequence pooling method supports only pooling in %s, got %s." %
                                     (["sum", "mean"], fea.pooling))
                if fea.shared_with == None:
                    sparse_emb.append(pooling_layer(self.embed_dict[fea.name](x[fea.name].long())).unsqueeze(1))
                else:
                    sparse_emb.append(pooling_layer(self.embed_dict[fea.shared_with](
                        x[fea.name].long())).unsqueeze(1))  # shared specific sparse feature embedding
            else:
                dense_values.append(x[fea.name].float().unsqueeze(1))  #.unsqueeze(1).unsqueeze(1)

        if len(dense_values) > 0:
            dense_exists = True
            dense_values = torch.cat(dense_values, dim=1)
        if len(sparse_emb) > 0:
            sparse_exists = True
            sparse_emb = torch.cat(sparse_emb, dim=1)  # [batch_size, num_features, embed_dim]

        if squeeze_dim:  #Note: if the emb_dim of sparse features is different, we must squeeze_dim
            if dense_exists and not sparse_exists:  # only input dense features
                return dense_values
            elif not dense_exists and sparse_exists:
                return sparse_emb.flatten(start_dim=1)  # squeeze dim to : [batch_size, num_features*embed_dim]
            elif dense_exists and sparse_exists:
                return torch.cat((sparse_emb.flatten(start_dim=1), dense_values),
                                 dim=1)  #concat dense value with sparse embedding
            else:
                raise ValueError("The input features can note be empty")
        else:
            if sparse_exists:
                return sparse_emb  #[batch_size, num_features, embed_dim]
            else:
                raise ValueError(
                    "If keep the original shape:[batch_size, num_features, embed_dim], expected %s in feature list, got %s" %
                    ("SparseFeatures", features))

对于传入的特征features，转换为对应的离散sparse_emb或连续值dense_values。

首先注意到这里传参 x，该参数的传入是在torch_rechub/models定义的几个模型在forward()阶段给传入，再往前是在torch_rechub/trainers定义的几个训练器中给出，以torch_rechub/trainers/ctr_trainer.py中给出的 x 为例：

x 指字典格式的数据集，字典的key是数据集的列名，把该列的值转为tensor类型作为该key对应的value

以采样数据集中的某一个离散特征（101）为例：

embed_dict[fea.name]从初始化的embed_dict中提取出对应key=fea.name的value，即Embedding(2,16)，该tensor有2维，每一维有16个值

x[fea.name].long()从数据集中提出特征101的value，即一个全是1的100维tensor，再转为long整型。

两者结合，得到一个100*16的tensor，通过unsqueeze(1)增加1维，然后添加到sparse_emb中

最后把sparse_emb中所有值，按照列的维度拼接起来，合并后的sparse_emb的构成为[batch_size, num_features, embed_dim]

上图为 embed_dict[fea.name]对应的Embedding(2,16)，结尾requiers_grad=True：需要为张量计算梯度