自然语言处理(NLP)应用

5 阅读2分钟

自然语言处理(NLP)应用

1. RNN/LSTM文本分类

1.1 循环神经网络基础

1.1.1 RNN数学原理

对于时间步tt的输入xtx_t,隐藏状态更新公式: ht=σ(Wxhxt+Whhht1+bh)h_t = \sigma(W_{xh}x_t + W_{hh}h_{t-1} + b_h)

1.1.2 LSTM门控机制

遗忘门:ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
输入门:it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
候选值:C~t=tanh(WC[ht1,xt]+bC)\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
新状态:Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t
输出门:ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
隐藏状态:ht=ottanh(Ct)h_t = o_t \odot \tanh(C_t)

import torch
import torch.nn as nn

class TextLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        # x shape: (batch_size, seq_length)
        embedded = self.embedding(x)  # (batch, seq, embed_dim)
        output, (hn, cn) = self.lstm(embedded)
        # 取最后一个时间步的输出
        return self.fc(output[:, -1, :])

# 示例参数
model = TextLSTM(
    vocab_size=10000,
    embed_dim=300,
    hidden_size=128,
    num_classes=5
)

1.2 文本分类数据预处理

1.2.1 使用TorchText处理数据
from torchtext.data import Field, TabularDataset, BucketIterator

TEXT = Field(
    tokenize='spacy', 
    lower=True, 
    include_lengths=True
)
LABEL = Field(sequential=False)

train_data, test_data = TabularDataset.splits(
    path='data',
    train='train.csv',
    test='test.csv',
    format='csv',
    fields=[('text', TEXT), ('label', LABEL)]
)

TEXT.build_vocab(train_data, max_size=20000)
LABEL.build_vocab(train_data)

# 创建迭代器
train_iter, test_iter = BucketIterator.splits(
    (train_data, test_data),
    batch_size=64,
    sort_within_batch=True,
    sort_key=lambda x: len(x.text)
)

1.3 训练与评估

1.3.1 训练循环
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    for batch in train_iter:
        text, lengths = batch.text
        labels = batch.label
        
        outputs = model(text)
        loss = criterion(outputs, labels)
        
        optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
1.3.2 性能对比
模型准确率训练时间/epoch
简单RNN82.3%3.2min
BiLSTM89.7%4.5min
LSTM+Attention91.2%5.1min

2. Transformer与BERT模型实践

2.1 Transformer架构核心

2.1.1 自注意力机制

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

2.1.2 位置编码公式

PE(pos,2i)=sin(pos/100002i/dmodel)PE_{(pos,2i)} = \sin\left(pos/10000^{2i/d_{model}}\right) PE(pos,2i+1)=cos(pos/100002i/dmodel)PE_{(pos,2i+1)} = \cos\left(pos/10000^{2i/d_{model}}\right)

class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=3)
        self.classifier = nn.Linear(d_model, num_classes)
    
    def forward(self, src):
        src = self.embedding(src) * math.sqrt(d_model)
        src = self.pos_encoder(src)
        output = self.transformer(src)
        return self.classifier(output.mean(dim=1))

2.2 BERT微调实践

2.2.1 使用Hugging Face库
from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', 
    num_labels=5
)

# 输入处理示例
text = "This movie was fantastic!"
inputs = tokenizer(
    text,
    padding='max_length',
    max_length=128,
    truncation=True,
    return_tensors="pt"
)
2.2.2 微调训练配置
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()

2.3 模型性能对比

模型准确率训练时间(小时)
LSTM89.3%0.5
Transformer91.7%1.2
BERT-base94.2%3.5
BERT-large95.1%8.2

附录:高级技巧

注意力可视化

# 提取BERT注意力权重
attentions = model(**inputs)[-1]
plt.figure(figsize=(12, 8))
for i in range(12):
    plt.subplot(3, 4, i+1)
    plt.imshow(attentions[0][i].detach().numpy())
    plt.title(f'Head {i+1}')

知识蒸馏

# 使用BERT蒸馏到BiLSTM
teacher_model = BertForSequenceClassification.from_pretrained(...)
student_model = BiLSTMClassifier(...)

# 蒸馏损失
loss = KLDivLoss(student_logits, teacher_logits) * T^2 + CE_loss(student_logits, labels)

文本处理工具链

graph TD
    A[原始文本] --> B[分词]
    B --> C[构建词汇表]
    C --> D[嵌入表示]
    D --> E[模型输入]
    style A fill:#9f9,stroke:#333
    style E fill:#f99,stroke:#333

说明:本文代码基于PyTorch 2.1和transformers 4.28实现,BERT训练建议使用16GB以上显存。实际应用时需注意文本长度限制(BERT通常为512 tokens)。下一章将讲解模型部署与生产化! 🚀