1. Seq2Seq
原始的 N vs N RNN 要求序列等长,然而我们遇到的大部分问题序列都是不等长的,如机器翻译中,源语言和目标语言的句子往往并没有相同的长度。这种结构又叫 Encoder-Decoder 模型,也可以称之为 Seq2Seq 模型。
Encoder-Decoder结构先将输入数据编码成一个上下文向量c,得到c有多种方式,最简单的方法就是把Encoder的最后一个隐状态赋值给c,还可以对最后的隐状态做一个变换得到c,也可以对所有的隐状态做变换。拿到c之后,就用另一个RNN网络对其进行解码。
2. Attention 机制
在 Encoder-Decoder 结构中,Encoder 把所有的输入序列都编码成一个统一的语义特征 c 再解码,因此, c中必须包含原始序列中的所有信息,它的长度就成了限制模型性能的瓶颈。
Attention 机制,即注意力机制通过在每个时间输入不同的 c 来解决这个问题,每一个 c 会自动去选取与当前所要输出的 y 最合适的上下文信息。
aij和Decoder的第i-1阶段的隐状态、Encoder第j个阶段的隐状态有关,如图所示a1j的计算:
a2j的计算:
参考:zhuanlan.zhihu.com/p/28054589
3. Transformer
Transformer 是一个利用注意力机制来提高模型训练速度的模型。它适用于并行化计算,和它本身模型的复杂程度导致它在精度和性能上都要高于 RNN 循环神经网络。
(1). 如何理解Transformer论文中的positional encoding:www.zhihu.com/question/34…
(2). The Illustrated Transformer: jalammar.github.io/illustrated…
(3). Transformers 库介绍:medium.com/tensorflow/…
4. GPT-2
The GPT-2 is built using transformer decoder blocks. 参考:jalammar.github.io/illustrated…
Transformer Summarizer 实现
Part 1: Importing the dataset
import sys
import os
import numpy as np
import textwrap
wrapper = textwrap.TextWrapper(width=70)
import trax
from trax import layers as tl
from trax.fastmath import numpy as jnp
# to print the entire np array
np.set_printoptions(threshold=sys.maxsize)
# This will download the dataset if no data_dir is specified.
# Downloading and processing can take bit of time,
# so we have the data already in 'data/' for you
# Importing CNN/DailyMail articles dataset
train_stream_fn = trax.data.TFDS('cnn_dailymail',
data_dir='data/',
keys=('article', 'highlights'),
train=True)
# This should be much faster as the data is downloaded already.
eval_stream_fn = trax.data.TFDS('cnn_dailymail',
data_dir='data/',
keys=('article', 'highlights'),
train=False)
def tokenize(input_str, EOS=1):
"""Input str to features dict, ready for inference"""
# Use the trax.data.tokenize method. It takes streams and returns streams,
# we get around it by making a 1-element stream with `iter`.
inputs = next(trax.data.tokenize(iter([input_str]),
vocab_dir='vocab_dir/',
vocab_file='summarize32k.subword.subwords'))
# Mark the end of the sentence with EOS
return list(inputs) + [EOS]
def detokenize(integers):
"""List of ints to str"""
s = trax.data.detokenize(integers,
vocab_dir='vocab_dir/',
vocab_file='summarize32k.subword.subwords')
return wrapper.fill(s)
print(tokenize('Not only that'))
print(detokenize([1369, 86, 285, 1]))
---------------------------
输出:
[1369, 86, 285, 1]
Not only that<EOS>
# Special tokens
SEP = 0 # Padding or separator token
EOS = 1 # End of sentence token
# Concatenate tokenized inputs and targets using 0 as separator.
def preprocess(stream):
for (article, summary) in stream:
joint = np.array(list(article) + [EOS, SEP] + list(summary) + [EOS])
mask = [0] * (len(list(article)) + 2) + [1] * (len(list(summary)) + 1) # Accounting for EOS and SEP
yield joint, joint, np.array(mask)
# You can combine a few data preprocessing steps into a pipeline like this.
input_pipeline = trax.data.Serial(
# Tokenizes
trax.data.Tokenize(vocab_dir='vocab_dir/',
vocab_file='summarize32k.subword.subwords'),
# Uses function defined above
preprocess,
# Filters out examples longer than 2048
trax.data.FilterByLength(2048)
)
# Apply preprocessing to data streams.
train_stream = input_pipeline(train_stream_fn())
eval_stream = input_pipeline(eval_stream_fn())
train_input, train_target, train_mask = next(train_stream)
assert sum((train_input - train_target)**2) == 0 # They are the same in Language Model (LM).
# Bucketing to create batched generators.
# Buckets are defined in terms of boundaries and batch sizes.
# Batch_sizes[i] determines the batch size for items with length < boundaries[i]
# So below, we'll take a batch of 16 sentences of length < 128 , 8 of length < 256,
# 4 of length < 512. And so on.
boundaries = [128, 256, 512, 1024]
batch_sizes = [16, 8, 4, 2, 1]
# Create the streams.
train_batch_stream = trax.data.BucketByLength(
boundaries, batch_sizes)(train_stream)
eval_batch_stream = trax.data.BucketByLength(
boundaries, batch_sizes)(eval_stream)
You can see that the data has the following structure:
[Article] -> <EOS> -> <pad> -> [Article Summary] -> <EOS> -> (possibly) multiple <pad>
The loss is taken only on the summary using cross_entropy as loss function.
Part 2: Summarization with transformer
def DotProductAttention(query, key, value, mask):
"""Dot product self-attention.
Args:
query (jax.interpreters.xla.DeviceArray): array of query representations with shape (L_q by d)
key (jax.interpreters.xla.DeviceArray): array of key representations with shape (L_k by d)
value (jax.interpreters.xla.DeviceArray): array of value representations with shape (L_k by d) where L_v = L_k
mask (jax.interpreters.xla.DeviceArray): attention-mask, gates attention with shape (L_q by L_k)
Returns:
jax.interpreters.xla.DeviceArray: Self-attention array for q, k, v arrays. (L_q by L_k)
"""
assert query.shape[-1] == key.shape[-1] == value.shape[-1], "Embedding dimensions of q, k, v aren't all the same"
# Save depth/dimension of the query embedding for scaling down the dot product
depth = query.shape[-1]
# Calculate scaled query key dot product according to formula above
dots = jnp.matmul(query, jnp.swapaxes(key, -1, -2)) / jnp.sqrt(depth)
# Apply the mask
if mask is not None: # The 'None' in this line does not need to be replaced
dots = jnp.where(mask, dots, jnp.full_like(dots, -1e9))
# Softmax formula implementation
# Use trax.fastmath.logsumexp of dots to avoid underflow by division by large numbers
# Hint: Last axis should be used and keepdims should be True
# Note: softmax = e^(dots - logsumexp(dots)) = E^dots / sumexp(dots)
logsumexp = trax.fastmath.logsumexp(dots, axis=-1, keepdims=True)
# Take exponential of dots minus logsumexp to get softmax
# Use jnp.exp()
dots = jnp.exp(dots - logsumexp)
# Multiply dots by value to get self-attention
# Use jnp.matmul()
attention = jnp.matmul(dots, value)
return attention
Implement the following functions that will be needed for Causal Attention:
compute_attention_heads : Gets an input 𝑥 of dimension (batch_size, seqlen, n_heads × d_head) and splits the last (depth) dimension and stacks it to the zeroth dimension to allow matrix multiplication (batch_size × n_heads, seqlen, d_head).
dot_product_self_attention : Creates a mask matrix with False values above the diagonal and True values below and calls DotProductAttention which implements dot product self attention.
compute_attention_output : Undoes compute_attention_heads by splitting first (vertical) dimension and stacking in the last (depth) dimension (batch_size, seqlen, n_heads × d_head). These operations concatenate (stack/merge) the heads.
compute_attention_heads : Gets an input 𝑥 of dimension (batch_size, seqlen, n_heads × d_head) and splits the last (depth) dimension and stacks it to the zeroth dimension to allow matrix multiplication (batch_size × n_heads, seqlen, d_head).
def compute_attention_heads_closure(n_heads, d_head):
""" Function that simulates environment inside CausalAttention function.
Args:
d_head (int): dimensionality of heads.
n_heads (int): number of attention heads.
Returns:
function: compute_attention_heads function
"""
def compute_attention_heads(x):
""" Compute the attention heads.
Args:
x (jax.interpreters.xla.DeviceArray): tensor with shape (batch_size, seqlen, n_heads X d_head).
Returns:
jax.interpreters.xla.DeviceArray: reshaped tensor with shape (batch_size X n_heads, seqlen, d_head).
"""
# Size of the x's batch dimension
batch_size = x.shape[0]
# Length of the sequence
# Should be size of x's first dimension without counting the batch dim
seqlen = x.shape[1]
# Reshape x using jnp.reshape()
# batch_size, seqlen, n_heads*d_head -> batch_size, seqlen, n_heads, d_head
x = jnp.reshape(x, (batch_size, seqlen, n_heads, d_head))
# Transpose x using jnp.transpose()
# batch_size, seqlen, n_heads, d_head -> batch_size, n_heads, seqlen, d_head
# Note that the values within the tuple are the indexes of the dimensions of x and you must rearrange them
x = jnp.transpose(x, (0, 2, 1, 3))
# Reshape x using jnp.reshape()
# batch_size, n_heads, seqlen, d_head -> batch_size*n_heads, seqlen, d_head
x = jnp.reshape(x, (-1, seqlen, d_head))
return x
return compute_attention_heads
dot_product_self_attention : Creates a mask matrix with False values above the diagonal and True values below and calls DotProductAttention which implements dot product self attention.
def dot_product_self_attention(q, k, v):
""" Masked dot product self attention.
Args:
q (jax.interpreters.xla.DeviceArray): queries.
k (jax.interpreters.xla.DeviceArray): keys.
v (jax.interpreters.xla.DeviceArray): values.
Returns:
jax.interpreters.xla.DeviceArray: masked dot product self attention tensor.
"""
# Hint: mask size should be equal to L_q. Remember that q has shape (batch_size, L_q, d)
mask_size = q.shape[-2]
# Creates a matrix with ones below the diagonal and 0s above. It should have shape (1, mask_size, mask_size)
# Notice that 1's and 0's get casted to True/False by setting dtype to jnp.bool_
# Use jnp.tril() - Lower triangle of an array and jnp.ones()
mask = jnp.tril(jnp.ones((1, mask_size, mask_size), dtype=jnp.bool_), k=0)
return DotProductAttention(q, k, v, mask)
compute_attention_output : Undoes compute_attention_heads by splitting first (vertical) dimension and stacking in the last (depth) dimension (batch_size, seqlen, n_heads × d_head). These operations concatenate (stack/merge) the heads.
def compute_attention_output_closure(n_heads, d_head):
""" Function that simulates environment inside CausalAttention function.
Args:
d_head (int): dimensionality of heads.
n_heads (int): number of attention heads.
Returns:
function: compute_attention_output function
"""
def compute_attention_output(x):
""" Compute the attention output.
Args:
x (jax.interpreters.xla.DeviceArray): tensor with shape (batch_size X n_heads, seqlen, d_head).
Returns:
jax.interpreters.xla.DeviceArray: reshaped tensor with shape (batch_size, seqlen, n_heads X d_head).
"""
# Length of the sequence
# Should be size of x's first dimension without counting the batch dim
seqlen = x.shape[1]
# Reshape x using jnp.reshape() to shape (batch_size, n_heads, seqlen, d_head)
x = jnp.reshape(x, ( -1, n_heads, seqlen, d_head))
# Transpose x using jnp.transpose() to shape (batch_size, seqlen, n_heads, d_head)
x = jnp.transpose(x, ( 0, 2, 1 , 3))
# Reshape to allow to concatenate the heads
return jnp.reshape(x, (-1, seqlen, n_heads * d_head))
return compute_attention_output
def CausalAttention(d_feature,
n_heads,
compute_attention_heads_closure=compute_attention_heads_closure,
dot_product_self_attention=dot_product_self_attention,
compute_attention_output_closure=compute_attention_output_closure,
mode='train'):
"""Transformer-style multi-headed causal attention.
Args:
d_feature (int): dimensionality of feature embedding.
n_heads (int): number of attention heads.
compute_attention_heads_closure (function): Closure around compute_attention heads.
dot_product_self_attention (function): dot_product_self_attention function.
compute_attention_output_closure (function): Closure around compute_attention_output.
mode (str): 'train' or 'eval'.
Returns:
trax.layers.combinators.Serial: Multi-headed self-attention model.
"""
assert d_feature % n_heads == 0
d_head = d_feature // n_heads
# HINT: The second argument to tl.Fn() is an uncalled function (without the parentheses)
# Since you are dealing with closures you might need to call the outer
# function with the correct parameters to get the actual uncalled function.
ComputeAttentionHeads = tl.Fn('AttnHeads', compute_attention_heads_closure(n_heads, d_head), n_out=1)
return tl.Serial(
tl.Branch( # creates three towers for one input, takes activations and creates queries keys and values
[tl.Dense(d_feature), ComputeAttentionHeads], # queries
[tl.Dense(d_feature), ComputeAttentionHeads], # keys
[tl.Dense(d_feature), ComputeAttentionHeads], # values
),
tl.Fn('DotProductAttn', dot_product_self_attention, n_out=1), # takes QKV
# HINT: The second argument to tl.Fn() is an uncalled function
# Since you are dealing with closures you might need to call the outer
# function with the correct parameters to get the actual uncalled function.
tl.Fn('AttnOutput', compute_attention_output_closure(n_heads, d_head), n_out=1), # to allow for parallel
tl.Dense(d_feature) # Final dense layer
)
def DecoderBlock(d_model, d_ff, n_heads,
dropout, mode, ff_activation):
"""Returns a list of layers that implements a Transformer decoder block.
The input is an activation tensor.
Args:
d_model (int): depth of embedding.
d_ff (int): depth of feed-forward layer.
n_heads (int): number of attention heads.
dropout (float): dropout rate (how much to drop out).
mode (str): 'train' or 'eval'.
ff_activation (function): the non-linearity in feed-forward layer.
Returns:
list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.
"""
# Create masked multi-head attention block using CausalAttention function
causal_attention = CausalAttention(
d_model,
n_heads=n_heads,
mode=mode
)
# Create feed-forward block (list) with two dense layers with dropout and input normalized
feed_forward = [
# Normalize layer inputs
tl.LayerNorm(),
# Add first feed forward (dense) layer (don't forget to set the correct value for n_units)
tl.Dense(d_ff),
# Add activation function passed in as a parameter (you need to call it!)
ff_activation(), # Generally ReLU
# Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)
tl.Dropout(rate=dropout, mode=mode),
# Add second feed forward layer (don't forget to set the correct value for n_units)
tl.Dense(d_model),
# Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)
tl.Dropout(rate=dropout,mode=mode)
]
# Add list of two Residual blocks: the attention with normalization and dropout and feed-forward blocks
return [
tl.Residual(
# Normalize layer input
tl.LayerNorm(),
# Add causal attention block previously defined (without parentheses)
causal_attention,
# Add dropout with rate and mode specified
tl.Dropout(rate=dropout, mode=mode)
),
tl.Residual(
# Add feed forward block (without parentheses)
feed_forward
),
]
def TransformerLM(vocab_size=33300,
d_model=512,
d_ff=2048,
n_layers=6,
n_heads=8,
dropout=0.1,
max_len=4096,
mode='train',
ff_activation=tl.Relu):
"""Returns a Transformer language model.
The input to the model is a tensor of tokens. (This model uses only the
decoder part of the overall Transformer.)
Args:
vocab_size (int): vocab size.
d_model (int): depth of embedding.
d_ff (int): depth of feed-forward layer.
n_layers (int): number of decoder layers.
n_heads (int): number of attention heads.
dropout (float): dropout rate (how much to drop out).
max_len (int): maximum symbol length for positional encoding.
mode (str): 'train', 'eval' or 'predict', predict mode is for fast inference.
ff_activation (function): the non-linearity in feed-forward layer.
Returns:
trax.layers.combinators.Serial: A Transformer language model as a layer that maps from a tensor of tokens
to activations over a vocab set.
"""
# Embedding inputs and positional encoder
positional_encoder = [
# Add embedding layer of dimension (vocab_size, d_model)
tl.Embedding(vocab_size, d_model),
# Use dropout with rate and mode specified
tl.Dropout(rate=dropout, mode=mode),
# Add positional encoding layer with maximum input length and mode specified
tl.PositionalEncoding(max_len=max_len, mode=mode)]
# Create stack (list) of decoder blocks with n_layers with necessary parameters
decoder_blocks = [
DecoderBlock(d_model, d_ff, n_heads,
dropout, mode, ff_activation) for _ in range(n_layers)]
# Create the complete model as written in the figure
return tl.Serial(
# Use teacher forcing (feed output of previous step to current step)
tl.ShiftRight(mode=mode), # Specify the mode!
# Add positional encoder
positional_encoder,
# Add decoder blocks
decoder_blocks,
# Normalize layer
tl.LayerNorm(),
# Add dense layer of vocab_size (since need to select a word to translate to)
# (a.k.a., logits layer. Note: activation already set by ff_activation)
tl.Dense(vocab_size),
# Get probabilities with Logsoftmax
tl.LogSoftmax(),
)
Part 3: Training
from trax.supervised import training
def training_loop(TransformerLM, train_gen, eval_gen, output_dir = "~/model"):
'''
Input:
TransformerLM (trax.layers.combinators.Serial): The model you are building.
train_gen (generator): Training stream of data.
eval_gen (generator): Evaluation stream of data.
output_dir (str): folder to save your file.
Returns:
trax.supervised.training.Loop: Training loop.
'''
output_dir = os.path.expanduser(output_dir) # trainer is an object
lr_schedule = trax.lr.warmup_and_rsqrt_decay(n_warmup_steps=1000, max_value=0.01)
train_task = training.TrainTask(
labeled_data=train_gen, # The training generator
loss_layer=tl.CrossEntropyLoss(), # Loss function
optimizer=trax.optimizers.Adam(0.01), # Optimizer (Don't forget to set LR to 0.01)
lr_schedule=lr_schedule,
n_steps_per_checkpoint=10
)
eval_task = training.EvalTask(
labeled_data=eval_gen, # The evaluation generator
metrics=[tl.CrossEntropyLoss(), tl.Accuracy()] # CrossEntropyLoss and Accuracy
)
loop = training.Loop(TransformerLM(d_model=4,
d_ff=16,
n_layers=1,
n_heads=2,
mode='train'),
train_task,
eval_tasks=[eval_task],
output_dir=output_dir)
return loop
loop = training_loop(TransformerLM, train_batch_stream, eval_batch_stream)
loop.run(10)
Part 4: Evaluation
# Get the model architecture
model = TransformerLM(mode='eval')
# Load the pre-trained weights
model.init_from_file('model.pkl.gz', weights_only=True)
def next_symbol(cur_output_tokens, model):
"""Returns the next symbol for a given sentence.
Args:
cur_output_tokens (list): tokenized sentence with EOS and PAD tokens at the end.
model (trax.layers.combinators.Serial): The transformer model.
Returns:
int: tokenized symbol.
"""
# current output tokens length
token_length = len(cur_output_tokens)
print(f'token_length: {token_length}')
# calculate the minimum power of 2 big enough to store token_length
# HINT: use np.ceil() and np.log2()
# add 1 to token_length so np.log2() doesn't receive 0 when token_length is 0
padded_length = 2**int(np.ceil(np.log2(token_length + 1)))
print(f'padded_length: {padded_length}')
# Fill cur_output_tokens with 0's until it reaches padded_length
padded = cur_output_tokens + [0] * (padded_length - token_length)
print(f'padded: {padded}')
padded_with_batch = np.array(padded)[None, :] # Don't replace this 'None'! This is a way of setting the batch dim
print(f'padded_with_batch: {padded_with_batch}')
# model expects a tuple containing two padded tensors (with batch)
output, o = model((padded_with_batch, padded_with_batch))
print(f'output.shape: {output.shape}')
print(f'o.shape: {o.shape}')
# HINT: output has shape (1, padded_length, vocab_size)
# To get log_probs you need to index output with 0 in the first dim
# token_length in the second dim and all of the entries for the last dim.
log_probs = output[0, token_length, :]
print(f'log_probs.shape: {log_probs.shape}')
return int(np.argmax(log_probs))
def greedy_decode(input_sentence, model):
"""Greedy decode function.
Args:
input_sentence (string): a sentence or article.
model (trax.layers.combinators.Serial): Transformer model.
Returns:
string: summary of the input.
"""
# Use tokenize()
cur_output_tokens = tokenize(input_sentence) + [0]
generated_output = []
cur_output = 0
EOS = 1
while cur_output != EOS:
# Get next symbol
print(f'cur_output_tokens: {cur_output_tokens}')
cur_output = next_symbol(cur_output_tokens, model)
print(f'cur_output: {cur_output}')
# Append next symbol to original sentence
cur_output_tokens.append(cur_output)
# Append next symbol to generated sentence
generated_output.append(cur_output)
print(detokenize(generated_output))
print('\n')
return detokenize(generated_output)
test_sentence = "It was a sunny day when I went to the market to buy some flowers. But I only found roses, not tulips."
print(greedy_decode(test_sentence, model))
----------------------------------
输出:
cur_output_tokens: [52, 1353, 28, 24421, 20, 194, 7511, 13, 533, 320, 213, 700, 320, 2444, 87, 5083, 3, 200, 13, 86, 233, 4267, 45, 2, 19, 3735, 18035, 4, 10, 1, 0]
token_length: 31
padded_length: 32
padded: [52, 1353, 28, 24421, 20, 194, 7511, 13, 533, 320, 213, 700, 320, 2444, 87, 5083, 3, 200, 13, 86, 233, 4267, 45, 2, 19, 3735, 18035, 4, 10, 1, 0, 0]
padded_with_batch: [[ 52 1353 28 24421 20 194 7511 13 533 320 213 700
320 2444 87 5083 3 200 13 86 233 4267 45 2
19 3735 18035 4 10 1 0 0]]
output.shape: (1, 32, 33300)
o.shape: (1, 32)
log_probs.shape: (33300,)
cur_output: 11
:
cur_output_tokens: [52, 1353, 28, 24421, 20, 194, 7511, 13, 533, 320, 213, 700, 320, 2444, 87, 5083, 3, 200, 13, 86, 233, 4267, 45, 2, 19, 3735, 18035, 4, 10, 1, 0, 11]
token_length: 32
padded_length: 64
padded: [52, 1353, 28, 24421, 20, 194, 7511, 13, 533, 320, 213, 700, 320, 2444, 87, 5083, 3, 200, 13, 86, 233, 4267, 45, 2, 19, 3735, 18035, 4, 10, 1, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
padded_with_batch: [[ 52 1353 28 24421 20 194 7511 13 533 320 213 700
320 2444 87 5083 3 200 13 86 233 4267 45 2
19 3735 18035 4 10 1 0 11 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0]]
output.shape: (1, 64, 33300)
o.shape: (1, 64)
log_probs.shape: (33300,)
cur_output: 13
: I
cur_output_tokens: [52, 1353, 28, 24421, 20, 194, 7511, 13, 533, 320, 213, 700, 320, 2444, 87, 5083, 3, 200, 13, 86, 233, 4267, 45, 2, 19, 3735, 18035, 4, 10, 1, 0, 11, 13]
token_length: 33
padded_length: 64
padded: [52, 1353, 28, 24421, 20, 194, 7511, 13, 533, 320, 213, 700, 320, 2444, 87, 5083, 3, 200, 13, 86, 233, 4267, 45, 2, 19, 3735, 18035, 4, 10, 1, 0, 11, 13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
padded_with_batch: [[ 52 1353 28 24421 20 194 7511 13 533 320 213 700
320 2444 87 5083 3 200 13 86 233 4267 45 2
19 3735 18035 4 10 1 0 11 13 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0]]
output.shape: (1, 64, 33300)
o.shape: (1, 64)
log_probs.shape: (33300,)
cur_output: 141
: I just
......
Part 5: tl.CausalAttention()
import sys
import os
import time
import numpy as np
import gin
import trax
from trax import layers as tl
from trax.fastmath import numpy as jnp
# to print the entire np array
np.set_printoptions(threshold=sys.maxsize)
def PositionalEncoder(vocab_size, d_model, dropout, max_len, mode):
"""Returns a list of layers that:
1. takes a block of text as input,
2. embeds the words in that text, and
3. adds positional encoding,
i.e. associates a number in range(max_len) with
each word in each sentence of embedded input text
The input is a list of tokenized blocks of text
Args:
vocab_size (int): vocab size.
d_model (int): depth of embedding.
dropout (float): dropout rate (how much to drop out).
max_len (int): maximum symbol length for positional encoding.
mode (str): 'train' or 'eval'.
"""
# Embedding inputs and positional encoder
return [
# Add embedding layer of dimension (vocab_size, d_model)
tl.Embedding(vocab_size, d_model),
# Use dropout with rate and mode specified
tl.Dropout(rate=dropout, mode=mode),
# Add positional encoding layer with maximum input length and mode specified
tl.PositionalEncoding(max_len=max_len, mode=mode)]
def FeedForward(d_model, d_ff, dropout, mode, ff_activation):
"""Returns a list of layers that implements a feed-forward block.
The input is an activation tensor.
Args:
d_model (int): depth of embedding.
d_ff (int): depth of feed-forward layer.
dropout (float): dropout rate (how much to drop out).
mode (str): 'train' or 'eval'.
ff_activation (function): the non-linearity in feed-forward layer.
Returns:
list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.
"""
# Create feed-forward block (list) with two dense layers with dropout and input normalized
return [
# Normalize layer inputs
tl.LayerNorm(),
# Add first feed forward (dense) layer (don't forget to set the correct value for n_units)
tl.Dense(d_ff),
# Add activation function passed in as a parameter (you need to call it!)
ff_activation(), # Generally ReLU
# Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)
tl.Dropout(rate=dropout, mode=mode),
# Add second feed forward layer (don't forget to set the correct value for n_units)
tl.Dense(d_model),
# Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)
tl.Dropout(rate=dropout, mode=mode)
]
def DecoderBlock(d_model, d_ff, n_heads,
dropout, mode, ff_activation):
"""Returns a list of layers that implements a Transformer decoder block.
The input is an activation tensor.
Args:
d_model (int): depth of embedding.
d_ff (int): depth of feed-forward layer.
n_heads (int): number of attention heads.
dropout (float): dropout rate (how much to drop out).
mode (str): 'train' or 'eval'.
ff_activation (function): the non-linearity in feed-forward layer.
Returns:
list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.
"""
# Add list of two Residual blocks: the attention with normalization and dropout and feed-forward blocks
return [
tl.Residual(
# Normalize layer input
tl.LayerNorm(),
# Add causal attention
tl.CausalAttention(d_model, n_heads=n_heads, dropout=dropout, mode=mode)
),
tl.Residual(
# Add feed-forward block
# We don't need to normalize the layer inputs here. The feed-forward block takes care of that for us.
FeedForward(d_model, d_ff, dropout, mode, ff_activation)
),
]
def TransformerLM(vocab_size=33300,
d_model=512,
d_ff=2048,
n_layers=6,
n_heads=8,
dropout=0.1,
max_len=4096,
mode='train',
ff_activation=tl.Relu):
"""Returns a Transformer language model.
The input to the model is a tensor of tokens. (This model uses only the
decoder part of the overall Transformer.)
Args:
vocab_size (int): vocab size.
d_model (int): depth of embedding.
d_ff (int): depth of feed-forward layer.
n_layers (int): number of decoder layers.
n_heads (int): number of attention heads.
dropout (float): dropout rate (how much to drop out).
max_len (int): maximum symbol length for positional encoding.
mode (str): 'train', 'eval' or 'predict', predict mode is for fast inference.
ff_activation (function): the non-linearity in feed-forward layer.
Returns:
trax.layers.combinators.Serial: A Transformer language model as a layer that maps from a tensor of tokens
to activations over a vocab set.
"""
# Create stack (list) of decoder blocks with n_layers with necessary parameters
decoder_blocks = [
DecoderBlock(d_model, d_ff, n_heads, dropout, mode, ff_activation) for _ in range(n_layers)]
# Create the complete model as written in the figure
return tl.Serial(
# Use teacher forcing (feed output of previous step to current step)
tl.ShiftRight(mode=mode),
# Add embedding inputs and positional encoder
PositionalEncoder(vocab_size, d_model, dropout, max_len, mode),
# Add decoder blocks
decoder_blocks,
# Normalize layer
tl.LayerNorm(),
# Add dense layer of vocab_size (since need to select a word to translate to)
# (a.k.a., logits layer. Note: activation already set by ff_activation)
tl.Dense(vocab_size),
# Get probabilities with Logsoftmax
tl.LogSoftmax()
)
5. BERT
BERT stands for Bidirectional Encoder Representations from Transformers (BERT). 参考:jalammar.github.io/illustrated…
1. 预训练内容
(1). Masked Language Model
(2). Two-sentence Tasks
2. Transfer Learning
(1). feature based
In feature based, you can train word embeddings by running a different model and then using those features (i.e. word vectors) on a different task.
参考:jalammar.github.io/a-visual-gu…
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import torch
import transformers as ppb
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
batch_1 = df[:2000]
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')
## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
max_len = 0
for i in tokenized.values:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
last_hidden_states = model(input_ids, attention_mask=attention_mask)
features = last_hidden_states[0][:,0,:].numpy()
labels = batch_1[1]
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
parameters = {'C': np.linspace(0.0001, 100, 20)}
grid_search = GridSearchCV(LogisticRegression(), parameters)
grid_search.fit(train_features, train_labels)
print('best parameters: ', grid_search.best_params_)
print('best scrores: ', grid_search.best_score_)
lr_clf = LogisticRegression(C=5.2)
lr_clf.fit(train_features, train_labels)
lr_clf.score(test_features, test_labels)
bert-as-service 项目可以加载BERT的预训练模型,输出文本向量化的结果
参考:
a). blog.csdn.net/ling620/art…
(2). fine tuning
When fine tuning, you can use the exact same model and just run it on a different task. Sometimes when fine tuning, you can keep the model weights fixed and just add a new layer that you will train. Other times you can slowly unfreeze the layers one at a time. You can also use unlabelled data when pre-training, by masking words and trying to predict which word was masked.
参考:
a). 文本分类 fine-tuning: www.jianshu.com/p/34de39d52…
b). 文本相似性 fine-tuning:blog.csdn.net/ling620/art…
c). 压缩以及通过 bert-base 部署: www.jianshu.com/p/67f99e48f…
d). bert-base: github.com/macanv/BERT…