Transformer与大语言模型：从自注意力到ChatGPT在前面几节中，我们学习了传统的机器学习方法、强化学习以及各

在前面几节中，我们学习了传统的机器学习方法、强化学习以及各种神经网络架构。今天，我们将深入学习当前最热门的AI技术——Transformer架构和大语言模型，包括自注意力机制、BERT、GPT等，这些技术推动了现代AI的快速发展。

Transformer架构概览

Transformer架构由Vaswani等人在2017年提出，彻底改变了自然语言处理领域。它完全基于注意力机制，摒弃了传统的循环和卷积结构。

graph TD
    A[Transformer] --> B[自注意力]
    A --> C[编码器-解码器]
    A --> D[位置编码]
    B --> E[Query]
    B --> F[Key]
    B --> G[Value]
    C --> H[Encoder]
    C --> I[Decoder]
    H --> J[多头注意力]
    H --> K[前馈网络]
    I --> L[掩码注意力]
    I --> M[编码器-解码器注意力]

自注意力机制详解

自注意力机制是Transformer的核心，它允许模型在处理序列时关注序列中的不同位置。

注意力机制原理

注意力机制的计算公式为： Attention(Q, K, V) = softmax(QK^T / √d_k) V

其中Q是查询矩阵，K是键矩阵，V是值矩阵，d_k是键向量的维度。

import numpy as np
import matplotlib.pyplot as plt

# 简单的自注意力实现
class SimpleSelfAttention:
    """简化版自注意力机制"""
    
    def __init__(self, d_model=64):
        self.d_model = d_model
        # 初始化权重矩阵
        self.W_q = np.random.randn(d_model, d_model) * 0.1
        self.W_k = np.random.randn(d_model, d_model) * 0.1
        self.W_v = np.random.randn(d_model, d_model) * 0.1
    
    def softmax(self, x):
        """Softmax函数"""
        exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
    
    def scaled_dot_product_attention(self, Q, K, V):
        """缩放点积注意力"""
        # 计算注意力分数
        d_k = Q.shape[-1]
        scores = np.dot(Q, K.T) / np.sqrt(d_k)
        
        # 应用softmax
        attention_weights = self.softmax(scores)
        
        # 计算输出
        output = np.dot(attention_weights, V)
        
        return output, attention_weights
    
    def forward(self, X):
        """前向传播"""
        # 线性变换生成Q, K, V
        Q = np.dot(X, self.W_q)
        K = np.dot(X, self.W_k)
        V = np.dot(X, self.W_v)
        
        # 计算注意力
        output, attention_weights = self.scaled_dot_product_attention(Q, K, V)
        
        return output, attention_weights

# 可视化注意力机制
def visualize_attention():
    """可视化注意力机制"""
    # 创建示例序列数据
    seq_length = 5
    d_model = 8
    np.random.seed(42)
    X = np.random.randn(seq_length, d_model)
    
    # 创建注意力机制
    attention = SimpleSelfAttention(d_model=d_model)
    
    # 计算注意力
    output, attention_weights = attention.forward(X)
    
    print("输入序列形状:", X.shape)
    print("注意力权重形状:", attention_weights.shape)
    print("输出序列形状:", output.shape)
    
    # 可视化注意力权重
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.imshow(attention_weights, cmap='viridis', aspect='auto')
    plt.xlabel('Key位置')
    plt.ylabel('Query位置')
    plt.title('注意力权重矩阵')
    plt.colorbar(label='注意力权重')
    
    # 添加数值标签
    for i in range(seq_length):
        for j in range(seq_length):
            plt.text(j, i, f'{attention_weights[i, j]:.2f}', 
                    ha='center', va='center', color='white' if attention_weights[i, j] < 0.5 else 'black')
    
    plt.subplot(1, 2, 2)
    # 可视化输入和输出序列
    plt.plot(range(seq_length), X[:, 0], 'bo-', label='输入序列(第1维)')
    plt.plot(range(seq_length), output[:, 0], 'ro-', label='输出序列(第1维)')
    plt.xlabel('序列位置')
    plt.ylabel('值')
    plt.title('输入与输出序列对比')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # 分析注意力模式
    print("\n注意力权重分析:")
    for i in range(min(3, seq_length)):
        max_attention_idx = np.argmax(attention_weights[i])
        print(f"位置{i}最关注位置{max_attention_idx} (权重: {attention_weights[i, max_attention_idx]:.3f})")

visualize_attention()

多头注意力机制

多头注意力允许模型在不同表示子空间中并行关注信息，增强了模型的表达能力。

# 多头注意力实现
class MultiHeadAttention:
    """多头注意力机制"""
    
    def __init__(self, d_model=64, num_heads=8):
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # 为每个头初始化权重矩阵
        self.W_q = np.random.randn(num_heads, d_model, self.d_k) * 0.1
        self.W_k = np.random.randn(num_heads, d_model, self.d_k) * 0.1
        self.W_v = np.random.randn(num_heads, d_model, self.d_k) * 0.1
        self.W_o = np.random.randn(num_heads * self.d_k, d_model) * 0.1
    
    def scaled_dot_product_attention(self, Q, K, V):
        """缩放点积注意力"""
        d_k = Q.shape[-1]
        scores = np.dot(Q, K.T) / np.sqrt(d_k)
        
        # 应用softmax
        exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
        attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
        
        # 计算输出
        output = np.dot(attention_weights, V)
        
        return output, attention_weights
    
    def forward(self, X):
        """前向传播"""
        batch_size, seq_length, d_model = X.shape
        heads_output = []
        all_attention_weights = []
        
        # 对每个头计算注意力
        for i in range(self.num_heads):
            Q = np.dot(X, self.W_q[i])
            K = np.dot(X, self.W_k[i])
            V = np.dot(X, self.W_v[i])
            
            head_output, attention_weights = self.scaled_dot_product_attention(Q, K, V)
            heads_output.append(head_output)
            all_attention_weights.append(attention_weights)
        
        # 拼接所有头的输出
        multi_head_output = np.concatenate(heads_output, axis=-1)
        
        # 线性变换
        final_output = np.dot(multi_head_output, self.W_o)
        
        return final_output, all_attention_weights

# 演示多头注意力
def multihead_attention_demo():
    """多头注意力演示"""
    # 创建示例数据
    batch_size = 2
    seq_length = 4
    d_model = 16
    num_heads = 4
    
    np.random.seed(42)
    X = np.random.randn(batch_size, seq_length, d_model)
    
    # 创建多头注意力
    mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
    
    # 计算注意力
    output, attention_weights = mha.forward(X)
    
    print("多头注意力演示:")
    print(f"输入形状: {X.shape}")
    print(f"输出形状: {output.shape}")
    print(f"注意力头数: {len(attention_weights)}")
    print(f"每个头的注意力权重形状: {attention_weights[0].shape}")
    
    # 可视化第一个样本的第一个头的注意力权重
    plt.figure(figsize=(10, 8))
    
    for i in range(min(4, num_heads)):
        plt.subplot(2, 2, i+1)
        plt.imshow(attention_weights[i][0], cmap='viridis', aspect='auto')
        plt.xlabel('Key位置')
        plt.ylabel('Query位置')
        plt.title(f'头 {i+1} 注意力权重')
        plt.colorbar(label='权重')
    
    plt.tight_layout()
    plt.show()

multihead_attention_demo()

Transformer编码器结构

Transformer编码器由多层组成，每层包含多头注意力和前馈神经网络。

# 简化的Transformer编码器层
class TransformerEncoderLayer:
    """Transformer编码器层"""
    
    def __init__(self, d_model=64, num_heads=8, d_ff=256, dropout=0.1):
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_ff = d_ff
        self.dropout = dropout
        
        # 多头注意力
        self.multihead_attention = MultiHeadAttention(d_model, num_heads)
        
        # 前馈神经网络
        self.W1 = np.random.randn(d_model, d_ff) * 0.1
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * 0.1
        self.b2 = np.zeros(d_model)
        
        # Layer Normalization参数
        self.ln1_gamma = np.ones(d_model)
        self.ln1_beta = np.zeros(d_model)
        self.ln2_gamma = np.ones(d_model)
        self.ln2_beta = np.zeros(d_model)
    
    def layer_norm(self, x, gamma, beta, eps=1e-5):
        """Layer Normalization"""
        mean = np.mean(x, axis=-1, keepdims=True)
        std = np.std(x, axis=-1, keepdims=True)
        return gamma * (x - mean) / (std + eps) + beta
    
    def feed_forward(self, x):
        """前馈神经网络"""
        # 第一层
        hidden = np.dot(x, self.W1) + self.b1
        # ReLU激活
        hidden = np.maximum(0, hidden)
        # 第二层
        output = np.dot(hidden, self.W2) + self.b2
        return output
    
    def forward(self, x):
        """前向传播"""
        # 多头注意力 + 残差连接 + LayerNorm
        attn_output, _ = self.multihead_attention.forward(x)
        x = x + attn_output  # 残差连接
        x = self.layer_norm(x, self.ln1_gamma, self.ln1_beta)  # LayerNorm
        
        # 前馈网络 + 残差连接 + LayerNorm
        ff_output = self.feed_forward(x)
        x = x + ff_output  # 残差连接
        x = self.layer_norm(x, self.ln2_gamma, self.ln2_beta)  # LayerNorm
        
        return x

# 演示Transformer编码器
def transformer_encoder_demo():
    """Transformer编码器演示"""
    # 创建示例数据
    batch_size = 1
    seq_length = 5
    d_model = 16
    
    np.random.seed(42)
    X = np.random.randn(batch_size, seq_length, d_model)
    
    # 创建编码器层
    encoder_layer = TransformerEncoderLayer(d_model=d_model, num_heads=4)
    
    # 前向传播
    output = encoder_layer.forward(X)
    
    print("Transformer编码器演示:")
    print(f"输入形状: {X.shape}")
    print(f"输出形状: {output.shape}")
    
    # 可视化输入和输出
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.imshow(X[0].T, cmap='viridis', aspect='auto')
    plt.xlabel('序列位置')
    plt.ylabel('特征维度')
    plt.title('输入序列')
    plt.colorbar(label='值')
    
    plt.subplot(1, 2, 2)
    plt.imshow(output[0].T, cmap='viridis', aspect='auto')
    plt.xlabel('序列位置')
    plt.ylabel('特征维度')
    plt.title('编码器输出')
    plt.colorbar(label='值')
    
    plt.tight_layout()
    plt.show()

transformer_encoder_demo()

BERT与GPT模型

BERT和GPT是基于Transformer的两种重要大语言模型，分别代表了双向和单向语言建模。

# BERT与GPT对比
def bert_gpt_comparison():
    """BERT与GPT对比"""
    
    comparison = {
        '特征': ['训练方式', '注意力机制', '主要任务', '输入处理', '应用场景'],
        'BERT': ['双向编码器', '双向自注意力', '掩码语言建模', '完整句子输入', '理解任务(NLU)'],
        'GPT': ['单向解码器', '因果注意力', '自回归语言建模', '从左到右生成', '生成任务(NLG)']
    }
    
    fig, ax = plt.subplots(figsize=(12, 6))
    
    # 隐藏坐标轴
    ax.axis('tight')
    ax.axis('off')
    
    # 创建表格
    table_data = []
    for i in range(len(comparison['特征'])):
        table_data.append([
            comparison['特征'][i],
            comparison['BERT'][i],
            comparison['GPT'][i]
        ])
    
    table = ax.table(cellText=table_data,
                     colLabels=['特征', 'BERT', 'GPT'],
                     cellLoc='center',
                     loc='center')
    
    table.auto_set_font_size(False)
    table.set_fontsize(12)
    table.scale(1.2, 1.5)
    
    # 设置表头样式
    for i in range(3):
        table[(0, i)].set_facecolor('#4CAF50')
        table[(0, i)].set_text_props(weight='bold', color='white')
    
    plt.title('BERT与GPT模型对比', fontsize=16, pad=20)
    plt.tight_layout()
    plt.show()

bert_gpt_comparison()

# 大语言模型发展时间线
def llm_timeline():
    """大语言模型发展时间线"""
    models = {
        '2017': ['Transformer'],
        '2018': ['BERT', 'GPT-1'],
        '2019': ['GPT-2', 'XLNet'],
        '2020': ['GPT-3', 'T5'],
        '2021': ['BERT-Large', 'T5-XXL'],
        '2022': ['ChatGPT', 'PaLM'],
        '2023': ['GPT-4', 'Claude', '通义千问']
    }
    
    plt.figure(figsize=(14, 8))
    
    years = list(models.keys())
    year_nums = range(len(years))
    
    plt.hlines(1, 0, len(years)-1, alpha=0.3, linewidth=2)
    plt.scatter(year_nums, [1]*len(year_nums), s=150, color='red', zorder=5)
    
    for i, (year, model_list) in enumerate(models.items()):
        models_text = '\n'.join(model_list)
        plt.annotate(f"{year}\n{models_text}", (i, 1), 
                    xytext=(0, 40 if i % 2 == 0 else -60), 
                    textcoords='offset points',
                    ha='center', va='bottom' if i % 2 == 0 else 'top',
                    bbox=dict(boxstyle='round,pad=0.5', fc='lightblue', alpha=0.8),
                    arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0'),
                    fontsize=9)
    
    plt.xlim(-0.5, len(years)-0.5)
    plt.ylim(0.3, 1.7)
    plt.yticks([])
    plt.xlabel('年份', fontsize=12)
    plt.title('大语言模型发展时间线', fontsize=16, pad=20)
    plt.tight_layout()
    plt.show()

llm_timeline()

位置编码

由于Transformer没有循环或卷积结构，需要位置编码来为模型提供序列顺序信息。

# 正弦位置编码
class PositionalEncoding:
    """正弦位置编码"""
    
    def __init__(self, d_model, max_len=5000):
        self.d_model = d_model
        self.max_len = max_len
        
        # 创建位置编码矩阵
        pe = np.zeros((max_len, d_model))
        position = np.arange(0, max_len, dtype=np.float32).reshape(-1, 1)
        
        # 计算分母项
        div_term = np.exp(np.arange(0, d_model, 2, dtype=np.float32) * 
                         -(np.log(10000.0) / d_model))
        
        # 填充位置编码
        pe[:, 0::2] = np.sin(position * div_term)  # 偶数位置
        pe[:, 1::2] = np.cos(position * div_term)  # 奇数位置
        
        self.pe = pe
    
    def forward(self, x):
        """添加位置编码"""
        seq_len = x.shape[1]
        return x + self.pe[:seq_len, :]

# 可视化位置编码
def visualize_positional_encoding():
    """可视化位置编码"""
    d_model = 128
    max_len = 100
    
    pos_encoding = PositionalEncoding(d_model, max_len)
    pe_matrix = pos_encoding.pe[:50, :]  # 只显示前50个位置
    
    plt.figure(figsize=(14, 6))
    
    plt.subplot(1, 2, 1)
    plt.imshow(pe_matrix, cmap='RdBu', aspect='auto')
    plt.xlabel('特征维度')
    plt.ylabel('序列位置')
    plt.title('位置编码矩阵')
    plt.colorbar(label='编码值')
    
    plt.subplot(1, 2, 2)
    # 绘制几个维度的位置编码
    dimensions = [0, 1, 10, 20, 50]
    for dim in dimensions:
        if dim < d_model:
            plt.plot(pos_encoding.pe[:50, dim], label=f'维度 {dim}')
    
    plt.xlabel('序列位置')
    plt.ylabel('编码值')
    plt.title('不同维度的位置编码')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("位置编码特点:")
    print("1. 使用正弦和余弦函数生成，确保不同位置有不同的编码")
    print("2. 具有周期性，能够处理比训练时更长的序列")
    print("3. 为模型提供序列顺序信息")

visualize_positional_encoding()

现代大语言模型应用

# 大语言模型应用场景
def llm_applications():
    """大语言模型应用场景"""
    
    applications = {
        '文本生成': ['文章写作', '故事创作', '诗歌生成'],
        '问答系统': ['智能客服', '知识问答', '教育辅导'],
        '语言翻译': ['多语言翻译', '实时翻译', '文档翻译'],
        '代码生成': ['代码补全', '代码解释', '调试辅助'],
        '创意辅助': ['头脑风暴', '方案设计', '内容策划']
    }
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    colors = plt.cm.Set3(np.linspace(0, 1, len(applications)))
    
    for i, (category, examples) in enumerate(applications.items()):
        if i < len(axes):
            ax = axes[i]
            y_pos = np.arange(len(examples))
            ax.barh(y_pos, [1]*len(examples), color=colors[i])
            ax.set_yticks(y_pos)
            ax.set_yticklabels(examples)
            ax.set_title(category, fontsize=12, fontweight='bold')
            ax.set_xlabel('应用程度')
            ax.set_xlim(0, 1.2)
            ax.grid(True, alpha=0.3)
    
    # 隐藏多余的子图
    for i in range(len(applications), len(axes)):
        axes[i].set_visible(False)
    
    plt.suptitle('大语言模型应用场景', fontsize=16)
    plt.tight_layout()
    plt.show()

llm_applications()

# 模拟ChatGPT对话示例
def chatgpt_demo():
    """模拟ChatGPT对话示例"""
    print("ChatGPT对话示例:")
    print("=" * 50)
    
    conversation = [
        ("用户", "你好，你能帮我写一个Python函数来计算斐波那契数列吗？"),
        ("ChatGPT", "当然可以！以下是一个计算斐波那契数列的Python函数：\n\n```python\ndef fibonacci(n):\n    if n <= 0:\n        return []\n    elif n == 1:\n        return [0]\n    elif n == 2:\n        return [0, 1]\n    \n    fib_seq = [0, 1]\n    for i in range(2, n):\n        fib_seq.append(fib_seq[i-1] + fib_seq[i-2])\n    \n    return fib_seq\n\n# 使用示例\nprint(fibonacci(10))  # 输出: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]\n```"),
        ("用户", "这个函数的时间复杂度是多少？"),
        ("ChatGPT", "这个函数的时间复杂度是 O(n)，空间复杂度也是 O(n)。\n\n分析：\n- 时间复杂度：函数中有一个从2到n-1的循环，执行n-2次，所以时间复杂度是O(n)\n- 空间复杂度：我们创建了一个长度为n的列表来存储结果，所以空间复杂度是O(n)\n\n如果只需要计算第n个斐波那契数而不保存整个序列，可以优化到O(1)空间复杂度。")
    ]
    
    for speaker, message in conversation:
        print(f"\n{speaker}:")
        print(message)
        print("-" * 30)

chatgpt_demo()

本周学习总结

今天我们深入学习了Transformer架构和大语言模型的核心技术：

自注意力机制
- 理解了注意力机制的数学原理
- 实现了简单的自注意力和多头注意力
Transformer架构
- 学习了编码器-解码器结构
- 掌握了位置编码的重要性
大语言模型
- 对比了BERT和GPT的特点
- 了解了现代大语言模型的发展历程
实际应用
- 探讨了大语言模型的各种应用场景
- 模拟了ChatGPT的对话过程

graph TD
    A[Transformer与大语言模型] --> B[自注意力机制]
    A --> C[Transformer架构]
    A --> D[大语言模型]
    B --> E[Query-Key-Value]
    B --> F[多头注意力]
    C --> G[编码器]
    C --> H[解码器]
    C --> I[位置编码]
    D --> J[BERT]
    D --> K[GPT]
    D --> L[ChatGPT]

课后练习

运行本节所有代码示例，理解Transformer的工作原理
修改自注意力机制，尝试不同的缩放因子
实现一个简单的文本分类器，使用Transformer编码器
研究RoBERTa、ALBERT等BERT的变体模型

下节预告

下一节我们将学习深度强化学习进阶内容，包括DQN改进、策略梯度方法和PPO算法，这些是现代强化学习的重要技术，敬请期待！

有任何疑问请在讨论区留言，我们会定期回复大家的问题。