Transformer的自己实现面试准备： Transformer 作为一种序列转换模型(sequence transd

先贴论文：Attention is all you need

当下大部分的的文本生成模型都是基于Transformer结构的变型，直接调库虽然方便，但是对细节的把握还不够，因此希望自己实现一遍，以加深理解。

Pytorch完整的源码实现有

把整个模型的实现分为以下几部分

Encoder
Decoder
Transformer
Optimezer-warmup
Train
Autoregression-Beamserach（未完成）

Transofmer

实际复现过程中，最好在理解的基础上先搭下来整体框架（即第三部分-Transformer),然后再完善Encoder和Decoder。（这样可以在整体的理解上明确需要传递的参数，避免后面各种缺参数的问题。）

为了明确参数的意义，先贴上Transfomer框架代码

class Transformer(nn.Module):
    def __init__(self, hidden_dim, n_head, dropout_rate, pad_idx, seqs_len, src_vocab_size, trg_vocab_size, n_layer,fnn_dim):
        """
        :param hidden_dim: embeding 编码维度, 512
        :param n_head: 多头注意力，number of head ,8
        :param dropout_rate: 随机失活率: 0.1
        :param pad_idx: 0 使用transofomers中的tokenizer来指定
        :param seqs_len: 最长文本长度，number of token
        :param src_vocab_size: 源文本词典大小
        :param trg_vocab_size: 目标文本词典大小
        :param n_layer: encoder decoder数量
        :param fnn_dim: Feed Forward中MLP的维度， 2048 （向量先扩张再压缩，可以理解为去除噪声的过程。）
        """
        super(Transformer, self).__init__()
        self.padding_idx = pad_idx
        self.encoder = Encoder(
            src_vocab_size=src_vocab_size,
            hidden_emb=hidden_dim, dropout_rate=dropout_rate,
            src_seqs_len=seqs_len,
            padding_idx=pad_idx, n_head=n_head,
            enc_layer_number=n_layer,
            hidden_fnn=fnn_dim,
        )

        self.decoder = Decoder(
            trg_vocab_size=trg_vocab_size, hidden_dim=hidden_dim,
            padding_idx=pad_idx, dropout_rate=dropout_rate,
            seqs_len=seqs_len, n_layer=n_layer,n_head=n_head,
            hidden_fnn=fnn_dim)  # b,seqs_len,hidden
        self.l1 = nn.Linear(hidden_dim, trg_vocab_size)
        self.softmax = nn.Softmax(dim=-1)
        # 共享编码层的参数
        # In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation
        # 编码词表大小不一致怎么解决？
        # 目标语言和源语言共用一张大的词表，
        # 做词嵌入时只有对应语言的embeeding被激活。
        # 这样做的意义是当目标语言和源语言有相同subword的时候可以共享语义信息。
        # 但对于中英翻译来说的话可能意义不大因为没有相同的subword，
        # 权重共享会在做softmax时加大计算量，所以实际中是否使用还要权衡。
 
        self.decoder.embeding.embed.weight = self.encoder.embd.embed.weight
        # self.l1.weight = self.decoder.embeding.embed.weight
    def forward(self, inputs, target):
        """
        :param inputs: src_seqs
        :param target: trg_seqs
        :return:
        """
        trg_mask = _mask_dec_inputs(target)
        enc_output = self.encoder(inputs)
        dec_outputs = self.decoder(enc_output=enc_output,dec_input=target, mask=trg_mask)
        prob_res = self.softmax(self.l1(dec_outputs))

        return prob_res

Encoder

论文从第三段开始介绍模型，结合模型结构和文章内容： Here, the encoder maps an input sequence of symbol representations (x 1 , ..., x n ) to a sequence of continuous representations z = (z 1, ..., z n).....

The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers.（encoder是一个可堆叠的层次结构，包括了N个encoder layers，以及Embeding结构

组合TokenEmbeding、Position Embeding和Encoder Layer为完整的Encoder模块。训练过程中，src_seqs_len，trg_seqs_len 应该合并求最长的长度。


class Encoder(nn.Module):
    def __init__(self,dropout_rate,src_vocab_size,hidden_emb, padding_idx,src_seqs_len,n_head=8,enc_layer_number=2,hidden_fnn=2048):
    """
    参数：
        src_vocab_size：源语言的词表大小，
        trg_vocab_size：目标语言的词表大小
        enc_layer_number： encoder layer 堆叠数量
        hidden_fnn： FeedForward层的映射维度。
        padding_idx: padding的id,默认为0
        src_seqs_len: 输入句子序列长度，训练过程中也是输出的长度。
    """
        super(Encoder, self).__init__()
        self.dropout = dropout_rate
        self.vocab_size = src_vocab_size
        self.hidden_emb = hidden_emb
        self.src_seqs_len = src_seqs_len
        self.embd = TokenEmbeding(vocab_size=src_vocab_size, hidden_dim=hidden_emb, padding_idx=padding_idx, drop_rate=dropout_rate)
        self.position_embd = PositionEmbeding(seqs_len=src_seqs_len, hidden_emb=hidden_emb)
        self.ModuleList = nn.ModuleList([EncodeLayer(hidden_emb=hidden_emb, n_head=n_head,dropout_rate=dropout_rate, hidden_fnn=hidden_fnn) for i in range(enc_layer_number)])

    def forward(self, inputs):
        assert len(inputs.shape)==2, "inputs must shape > 2, when batch_size=1 inputs shape should be [1,seqs_len]"
        x = self.embd(inputs) # token编码
        x = self.position_embd(x) # 位置编码
        for layer in self.ModuleList: # 堆叠attention和mlp
            x = layer(x)
        return x

Embeding

Encoder中首先是Embeding对输入的token进行编码。两个主要参数：词表大小和编码维度

class TokenEmbeding(nn.Module):
    def __init__(self,src_vocab_size,hidden_dim=512,padding_idx=0,drop_rate=0.1):
        super(TokenEmbeding, self).__init__()
        self.embed = nn.Embedding(embedding_dim=hidden_dim, padding_idx=padding_idx, num_embeddings=vocab_size)
        self.dropout = nn.Dropout(drop_rate)

    def forward(self, inputs):
        """
        :param inputs: 输入one-hot特征
        :return:
        """
        embed = self.embed(inputs)
        return self.dropout(embed)

Position Embedding

\begin{aligned} P E_{(p o s, 2 i)} &=\sin \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \\ P E_{(p o s, 2 i+1)} &=\cos \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \end{aligned}

其中 $pos$ 表示当前位置， $d_{model}$ 表示词编码向量,也就是Token编码中的 hidden_dim=512 $i$ 代表当前的维度。因此，按照公式可以理解为，对于奇数和偶数位置的维度，分别使用cos和sin来进行位置编码，并且在奇数位置中要将 $2i+1$ 处理为 $2i$ .

class PositionEmbeding(nn.Module):
    
    def __init__(self,seqs_len, hidden_emb, device):
        # seqs_len: 句子长度
        # hidden_emb: 每个token的编码维度
        super(PositionEmbeding, self).__init__()
        self.seqs_len=seqs_len
        self.hidden = hidden_emb
        self.device = device
    def forward(self, inputs):
        pos_emb = [[hi/np.power(10000,  2 * (d // 2) / self.hidden) for d in range(self.hidden = hidden_emb)] for hi in range(self.seqs_len)]
        pos_emb[::2] = np.sin(pos_emb[::2]) 
        pos_emb[1::2] = np.cos(pos_emb[1::2])
        pos_emb = torch.tensor(pos_emb, dtype=torch.float,device=self.device)
        return inputs+pos_emb[:inputs.size(1), :]

本来是考虑通过for循环的方式合并数组的（但是效率太低），这部分参考了别人的代码，用切片来实现，极其方便。

Multi-Head Attention

这部分计算并不复杂，最重要的是理解维度变换。 attention计算是在512的维度上，即每一个token的向量上进行attention。由于线性映射过于简单，我们先考虑如何进行维度变换和拆分头

记录形状的参数说明

$b$ ：batch_size(一次放入了多少个句子)

seqs_len：(句子的长度）

hidden_size：(每个token的特征维度）

n_head: n个注意力头

    def split_head(self, inputs):
        """
        特征划分是划分的每一个token的特征，即hidden_size这一维度。
        参数：
            inputs: shape [b, seqs_len, hidden_size]
        返回：
            tensor shape [b,n_head,seqs_len,512/n_head] 
        """
        inputs = inputs.transpose(1,2) # 变换维度,为了后面的拆分和乘法 [b,512,seqs_len]
        origin_shape = inputs.shape
        # 断言hidden_size要可以被n_head整除。即可以被平均分为n个头
        assert origin_shape[1] % self.n_head == 0, "hidden dim not divisible by n_head" 
        
        vec = inputs.reshape(origin_shape[0], self.n_head, origin_shape[1]//self.n_head, origin_shape[2]) #shape:[b, n_head, 512/n_head, seqs_len]
        return vec.transpose(-1, -2) # 转换shape为  [b,n_head,seqs_len,512/n_head】


class MutilHeadAttention(nn.Module):

    def __init__(self,hidden_size,n_head,dropout_rate=0.1):
        super(MutilHeadAttention, self).__init__()
        """
        取得PositionEmbeding后的特征来计算多头注意力，以及该特征和attention计算后的残差链接
        初始化参数：
            hidden_size,
            n_head,
            dropout_rate=0.1
        """
        self.emb_size = hidden_size
        self.n_head = n_head
        self.w_k = nn.Linear(hidden_size, hidden_size) 
        self.w_v = nn.Linear(hidden_size, hidden_size)
        self.w_q = nn.Linear(hidden_size, hidden_size)
        self.self_attention = SelfAttention(hidden_size)
        self.dropout = nn.Dropout(dropout_rate)
        self.l1 = nn.Linear(hidden_size, hidden_size)
       
    def forward(self, query, value, mask=None):
        # 先进行映射再拆分头。
        """
        多头注意力的前向传播过程：
        
        参数：
            query: 当在Encoder中时，Q,K,V=query 
            value: 当出于Decoder中时，Q,K=query, V=value
            mask=None: 当Mask为None时，为Encoer输入，不为None时为Decoder输人
        返回：
            tensor: shape[b,seqs_len,hidden_size]
        """
        if mask is not None: # mask不是None的时候用于decoder中的cross attention
            # decode处理
            res = value  # 保留残差
            query = self.split_head(self.w_q(res)) 
            key = self.split_head(self.w_k(res))
            value = self.split_head(self.w_v(value))
        else:
            # encoder处理
            res = query # 保留输入作为残差
            query = self.split_head(self.w_q(query)) # 将Query先通过W_q映射后拆分头

            key = self.split_head(self.w_k(res)) # 将原始的query通过W_k映射后拆分头,作为key
            value = self.split_head(self.w_v(res)) # 将原始的query通过W_v映射后拆分头,作为value

        mutil_head_att = self.self_attention(q=query, k=key, v=value, mask=mask) # selfattention计算, 返回tensor shape:[b,n_head,seqs_len,hidden_size/n_head]
        att = mutil_head_att.transpose(-2, -3).reshape(*res.shape) # 重定义形状为初试输入的Query形状。
        x = self.l1(self.dropout(att)) # dropout
        return x + res # 残差模块。

Slef Attention

自注意力计算,结合公式和代码来理解

主要流程如下：

输入的Q,K,V的维度均为[b,n_head,seqs_len,hidden_size/n_head]
则 $Q K^T$ 的shape变为[seqs_len,hidden_size/n_head]*[hidden_size/n_head, seqs_len]=[seqs_len,seqs_,len]
之后除以 $\sqrt{d_{k}}$ ,使得attention矩阵中向量的分布方差降为1(有些类似Layer-Norm).
最后在-1维经过 $Softmax$ 后乘以 $V$ 得到最后结果。

\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V


class SelfAttention(nn.Module):
    def __init__(self,d_k,padding_idx=0):
        super(SelfAttention,self).__init__()
        self.dk = torch.tensor(d_k, dtype=torch.int)
        self.softmax = nn.Softmax(dim=-1)
        self.padding_idx = padding_idx
        self.dropout = nn.Dropout(0.1)
    def forward(self,q,k,v,mask=None):
        """
        参数：
            q: Batch,n_head, seqs_len,hidden_dim
            k: 
            v:
            mask:
        """
        att = q.matmul(k.transpose(-1, -2))/torch.sqrt(self.dk)
        if mask is not None:
            att = att.masked_fill(mask == 0, -1e9) # decoder mask矩阵填充
        att = att.masked_fill(att == self.padding_idx, -1e9) # padding 位置填充。
        return self.softmax(self.dropout(att)).matmul(v)

Feed Forward

公式如下：

\operatorname{FFN}(x)=\max \left(0, x W_{1}+b_{1}\right) W_{2}+b_{2}

可以理解为一个MLP。即一次非线性变换（ReLu激活），两次线性映射。并在此基础上添加残差模块

为什么采用先扩充维度，再压缩回原来维度的操作？

class FeedForward(nn.Module):
    """
    参数：
        hidden_fnn：2048
        hidden_emb：512
    """
    def __init__(self, hidden_fnn,hidden_emb,dropout_rate=0.1):
        super(FeedForward, self).__init__()
        self.l1 = nn.Linear(hidden_emb,hidden_fnn,bias=True)
        self.relu = nn.ReLU()
        self.l2 = nn.Linear(hidden_fnn,hidden_emb,bias=True)
        self.dropout = nn.Dropout(dropout_rate)
    def forward(self,inputs):
        res_x = inputs
        x = self.relu(self.l1(inputs))
        x = self.dropout(self.l2(x))
        return res_x + x

Encoder Layer

完成上述的模块之后，我们将MultilHead Attention和FeedForward组合成我们需要的Encoder Layer，以搭建N=6层的Encoder Layer

class EncodeLayer(nn.Module):
    """
    堆叠的Encoder layer, 包括了MutilAttention以及FeedForward两部分
    """
    def __init__(self,hidden_emb,dropout_rate,hidden_fnn=2048,n_head=8):
        super(EncodeLayer, self).__init__()
        self.att = MutilHeadAttention(hidden_size=hidden_emb,n_head=n_head,dropout_rate=dropout_rate)
        self.feed_forward = FeedForward(hidden_fnn=hidden_fnn, hidden_emb=hidden_emb,dropout_rate=dropout_rate)

    def forward(self, inputs):
        att = self.att(inputs,inputs)
        x = self.feed_forward(att)
        return x

Decoder

除了传入的参数稍微不同，其他的流程处理和Encoder一致，差别主要存在于MASK Attention 以及Cross Attention的区别

# MASK矩阵的生成。
def _mask_dec_inputs(inputs_seqs):

    b_size, seqs_len = inputs_seqs.shape

    mask = 1-torch.triu(torch.ones(1, seqs_len, seqs_len), diagonal=1) # diagonal定位到行
    return mask.to(device)


class Decoder(nn.Module):
    def __init__(self,trg_vocab_size, hidden_dim, padding_idx, dropout_rate,seqs_len,n_layer,n_head,hidden_fnn):
        super(Decoder, self).__init__()
        self.embeding = TokenEmbeding(vocab_size=trg_vocab_size, hidden_dim=hidden_dim, padding_idx=padding_idx, drop_rate=dropout_rate)
        self.position_embeding = PositionEmbeding(seqs_len=seqs_len, hidden_emb=hidden_dim)
        self.decoder_layers = nn.ModuleList(DecodeLayer(dropout_rate=dropout_rate, hidden_size=hidden_dim, n_head=n_head,hidden_fnn=hidden_fnn) for i in range(n_layer))

    def forward(self, enc_output, mask, dec_input):
        dec_embeding = self.embeding(dec_input)
        dec_input = self.position_embeding(dec_embeding)
        for dec_layer in self.decoder_layers:
            dec_input = dec_layer(enc_outputs=enc_output, dec_input=dec_input, att_mask=mask)
        return dec_input

Decoder Layer

Decoder Layer中包括了一个Mask Attention以及接受Encoder输入的Attention（常成为cross attetion）以及MLP层。对其中的两个Attention模块我们在上面的Attention模块中以是否传入MASK矩阵作为区分，因此这里直接用就行

class DecodeLayer(nn.Module):
    """
    1. 使用enc中的token encoder和position encoder编码输入。
    2. 可堆叠的包括三部分：
        1. Mask Multi-head attention
        2. Cross Attention(包括 encoer 输出的Multi-Head Attention)
        3. FNN、MLP模块
    """
    def __init__(self, dropout_rate=0.1, hidden_size=512, n_head=8, hidden_fnn=2048):
        super(DecodeLayer, self).__init__()
        self.self_attention = MutilHeadAttention(hidden_size=hidden_size, n_head=n_head, dropout_rate=dropout_rate)
        self.cross_attention = MutilHeadAttention(hidden_size=hidden_size, n_head=n_head, dropout_rate=dropout_rate)
        self.mlp = FeedForward(dropout_rate=dropout_rate, hidden_fnn=hidden_fnn, hidden_emb=hidden_size)

    def forward(self, enc_outputs, dec_input, att_mask):
        """
        :param enc_outputs: encoder的输出，作为Q,K进行cross attention
        :param target_encoding: 输入目标序列的编码，这里需要进行掩码操作。（可选有teacher force或者仅使用预测的来计算损失）
        :return:
        """
        att = self.self_attention(query=enc_outputs, value=att, mask=None)
        att = self.cross_attention(query=dec_input, value=dec_input, mask=att_mask) # MASKAttetnmtion
        x = self.mlp(att)
        return x

Train

一段简单的代码训练验证有效性。

src = [
        "岱宗夫如何",
        "造化钟神秀",
        "荡胸生曾云",
        "会当凌绝顶"
    ]
trg = ['齐鲁青未了。','阴阳割昏晓。','决眦入归鸟。','一览众山小。']
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
res = tokenizer(src, max_length=100, padding="max_length", truncation=True, return_tensors='pt')
trg = tokenizer(trg, max_length=100, truncation=True, padding='max_length', return_tensors='pt')
model = Transformer(
    hidden_dim=512,
    n_head=8,
    dropout_rate=0.1,
    pad_idx=0,
    seqs_len=100,
    src_vocab_size=tokenizer.vocab_size,
    trg_vocab_size=tokenizer.vocab_size,
    n_layer=2,
    fnn_dim=2048,
)
model.to(device)
optim = torch.optim.Adam(model.parameters(),lr=0.001, betas=(0.9, 0.98), eps=1e-09)
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=0,label_smoothing=0.1)
model.train()
for i in range(1000):
    optim.zero_grad()
    predict = model(res['input_ids'].to(device), trg['input_ids'].to(device))
    # 2,128,21128,      2, 128
    predict_token = torch.argmax(predict, dim=-1).cpu().numpy().tolist()
    predict = predict.reshape(-1,21128)
    target = trg['input_ids'].reshape(-1).to(device)
    loss = loss_fn(predict,target)
    print(loss.item())
    loss.backward()
    optim.step()
    print([tokenizer.decode(i,skip_special_tokens=True) for i in predict_token])

WarmUp

借鉴别人复现的:attention-is-all-you-need-pytorch 另外对于这种seq2seq任务和BERT、GPT一类的大模型（参数多，数据量大）都使用了train_step来进行优化，主要因为一个Epoch数据量特别大，可能在一个Epoch中间就出现了更好的结果，而训练完一个EPOCH导致了过拟合，从而导致指标降低。

optimizer = ScheduledOptim(
    optim.Adam(transformer.parameters(), betas=(0.9, 0.98), eps=1e-09),
    opt.lr_mul, opt.d_model, opt.n_warmup_steps)