Trae搭建了我的小小Transformer模型我正在参加Trae「超级体验官」创意实践征文，本文所使用的 Trae 免

我正在参加Trae「超级体验官」创意实践征文，本文所使用的 Trae 免费下载链接：www.trae.com.cn/?utm_source…

前言

在阅读了《Attention is all you need》文章，从中了解到Transformer，作为第一个将注意力机制融入上下文的任务中，抛弃递归和卷积，通过注意力机制进行模型训练。那么作为机器翻译的新手小白，简单实现一下Transformer的模型是必须的！

模型搭建

下载Trae，给Ai一个考验，让我们丝滑的搭建个Transformer相关的EncoderLayer 实例模型吧！

简单给ai提出：

attention is all you need

ai很快就将这个与论文相关联：

想基于此实现简单的代码

简版的代码呼之欲出！

import torch
import torch.nn as nn
import torch.nn.functional as F

class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def forward(self, query, key, value, mask=None):
        d_k = query.size(-1)
        scores = torch.matmul(query, key.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attn_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attn_weights, value)
        return output, attn_weights

class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, d_model):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

    def combine_heads(self, x):
        batch_size, num_heads, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, -1)

    def forward(self, query, key, value, mask=None):
        Q = self.split_heads(self.W_q(query))
        K = self.split_heads(self.W_k(key))
        V = self.split_heads(self.W_v(value))
        
        output, attn_weights = ScaledDotProductAttention()(Q, K, V, mask)
        output = self.combine_heads(output)
        output = self.W_o(output)
        
        return output, attn_weights

class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionwiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, num_heads, d_model, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(num_heads, d_model)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output, _ = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x

# 示例使用
if __name__ == "__main__":
    num_heads = 8
    d_model = 512
    d_ff = 2048
    dropout = 0.1
    seq_length = 10
    batch_size = 32

    encoder_layer = EncoderLayer(num_heads, d_model, d_ff, dropout)
    input_tensor = torch.randn(batch_size, seq_length, d_model)
    mask = torch.ones(batch_size, 1, seq_length)

    output = encoder_layer(input_tensor, mask)
    print("Output shape:", output.shape)

Trae的环境配置

获取到简单的代码，那就是调试。关于环境的搭建，可以参考我之前的文章，这里就不作过多的介绍了。使用Trae需要把虚拟环境关联进来，很简单！

在右下角就有环境的设置。

在这里就加载好自己配置好的虚拟环境。

完成后就会具备好已有的虚拟环境啦！

调试

简单创建一个文件，将代码复制出来，并编译好。

报错啦！！！！！不用着急，向Ai求助！

从错误信息可知道函数的输入有不匹配的地方，在多头注意力机制的实现中，出现维度与掩码的不匹配，Ai很快就修改代码，解决了问题。

运行结果

通过上述的反复调试，最后运行出结果。

分析

确实很快就通过ai实现了简单的Transformer模型，但是内部的函数是否真的符合Transformer模型呢？让Ai再分析具体函数的设计。

ScaledDotProductAttention 类
- forward 方法：实现缩放点积注意力机制。
  - 计算 d_k ，即查询向量的维度。
  - 计算查询向量和键向量的点积，并除以 sqrt(d_k) 进行缩放。
  - 如果提供了掩码 mask ，将掩码为 0 的位置的分数设置为负无穷大，这样在应用 softmax 函数时，这些位置的权重会趋近于 0。
  - 对分数应用 softmax 函数得到注意力权重。
  - 将注意力权重与值向量相乘得到输出。
MultiHeadAttention 类
- init 方法：初始化类，确保 d_model 能被 num_heads 整除。定义了四个线性层 W_q 、 W_k 、 W_v 和 W_o ，分别用于生成查询、键、值向量和最终输出。
- split_heads 方法：将输入的张量分割成多个头。
- combine_heads 方法：将多个头的输出合并成一个张量。
- forward 方法：
  - 通过线性层生成查询、键、值向量，并将它们分割成多个头。
  - 如果提供了掩码，扩展掩码的维度以匹配多头注意力的形状。
  - 调用 ScaledDotProductAttention 类计算注意力输出。
  - 将多头注意力的输出合并，并通过线性层 W_o 得到最终输出。
PositionwiseFeedForward 类
- init 方法：初始化类，定义两个线性层 fc1 和 fc2 ，以及一个 ReLU 激活函数。
- forward 方法：将输入通过 fc1 线性层，应用 ReLU 激活函数，再通过 fc2 线性层得到输出。
EncoderLayer 类
- init 方法：初始化类，定义多头注意力层、位置前馈网络层、两个层归一化层和一个 Dropout 层。
- forward 方法：
  - 调用多头注意力层进行自注意力计算，得到注意力输出。
  - 将注意力输出与输入相加，并通过层归一化层 norm1 。
  - 调用位置前馈网络层得到前馈输出。
  - 将前馈输出与上一步的输出相加，并通过层归一化层 norm2 。

结论

通过Trae快速搭建了Transformer模型，效率杠杠的！