HuggingFace Transformers 库学习（一、基本原理）本系列会对Transformers库的原理和使用

Transformer是NLP任务常用的模型，我们常使用HuggingFace开源的Transformers库来完成各种任务。本系列会对Transformers库的原理和使用进行详细介绍。本篇文章为系列的第一篇，会对Transformer库的框架与模型的基本原理进行介绍。

框架与原理

Hugging face 主要包含了以下几个库：

以及一个hub: huggingface.co/models

番外
在作者介绍中，我们看到有两个作者曾参与机器学习开源库gradio(github.com/gradio-app/…)的创建。Gradio能够帮助用户快速创建一个具有用户界面的机器学习应用，后被HuggingFace收购。

NLP任务

在我们介绍Transformers之前，我们先了解下NLP主要解决的问题是什么。下面就列出一些常见的NLP任务：

句子分类：例如影评的情感分析，检测一封电子邮件是否为垃圾邮件，确定一个句子是否在语法上正确，或者两个句子是否在逻辑上相关。
给句子里每个词分类：例如识别句子的语法成分（名词、动词、形容词）或命名实体（人、地点、组织）。
内容生成：例如自动写诗，填充句子中的空白。
答案抽取：例如给定一个问题和上下文，根据上下文提供的信息提取问题的答案。
根据输入生成一个新的句子：例如机器翻译，文本摘要。

其实NLP任务并不局限于文本，它也可以处理语音识别或计算机视觉领域的复杂任务，例如为一段语音生成文字记录或为图像生成描述。

Pipeline初探

Transformers库使用pipeline去完成各种NLP任务，如Question-answering、Text classification、Image classification等，之后我们会对每一个pipeline进行详细介绍。

在这里，先以一个例子了解Pipeline的用法。

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    ["I've been waiting for a HuggingFace course my whole life.",
     "I hate this so much!"]
)

结果：

[{'label': 'POSITIVE', 'score': 0.9598047137260437}, {'label': 'NEGATIVE', 'score': 0.9994558095932007}]

那么pipeline是如何工作的呢？我们先猜一下，大致会有以下几个过程：

根据任务名称"sentiment-analysis"选择一个特定的预训练模型，这个模型可能已经对英文的情感分类任务进行了微调。
下载模型并缓存，以便于下次调用pipeline时直接使用模型。
预处理文本为模型能够理解的格式，并输入给模型。
模型对输入进行预测，并进行后处理，成为我们能理解的内容。

那么第一步，根据任务名称选好特定的模型和pipeline，这一步是通过这段代码实现的：

TASK_ALIASES = {
	    "sentiment-analysis": "text-classification",
	    "ner": "token-classification",
	}
SUPPORTED_TASKS = {
  ...
  "text-classification": {
       "impl": TextClassificationPipeline,
       "tf": (TFAutoModelForSequenceClassification,) if is_tf_available() else (),
       "pt": (AutoModelForSequenceClassification,) if is_torch_available() else (),
       "default": {
           "model": {
               "pt": "distilbert-base-uncased-finetuned-sst-2-english",
               "tf": "distilbert-base-uncased-finetuned-sst-2-english",
           },
       },
       "type": "text",
   },
   ...
}
def pipeline(
	    task: str = None,
	    model: Optional = None,
	    config: Optional[Union[str, PretrainedConfig]] = None,
	    tokenizer: Optional[Union[str, PreTrainedTokenizer, PreTrainedTokenizerFast]] = None,
	    feature_extractor: Optional[Union[str, PreTrainedFeatureExtractor]] = None,
	    framework: Optional[str] = None,
	    revision: Optional[str] = None,
	    use_fast: bool = True,
	    use_auth_token: Optional[Union[str, bool]] = None,
	    model_kwargs: Dict[str, Any] = None,
	    pipeline_class: Optional[Any] = None,
	    **kwargs
	) -> Pipeline:
    """
    Pipelines are made of:
       - A [tokenizer](tokenizer) in charge of mapping raw textual input to token.
       - A [model](model) to make predictions from the inputs.
       - Some (optional) post processing for enhancing model's output.
    """
    if pipeline_class is None:
	    pipeline_class = targeted_task["impl"]
    if model is None:
        model = get_default_model(targeted_task, framework, task_options)
    if isinstance(config, str):
        config = AutoConfig.from_pretrained(config, revision=revision, _from_pipeline=task, **model_kwargs)
    elif config is None and isinstance(model, str):
        config = AutoConfig.from_pretrained(model, revision=revision, _from_pipeline=task, **model_kwargs)
    load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None
	load_feature_extractor = type(model_config) in FEATURE_EXTRACTOR_MAPPING or feature_extractor is not None
    ...
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_identifier, revision=revision, use_fast=use_fast, _from_pipeline=task, **tokenizer_kwargs)
    ...
    feature_extractor = AutoFeatureExtractor.from_pretrained(feature_extractor, revision=revision, _from_pipeline=task, **model_kwargs)
    if (feature_extractor._processor_class
        and feature_extractor._processor_class.endswith("WithLM")
	    and isinstance(model_name, str)):
        decoder = BeamSearchDecoderCTC.load_from_dir(model_name)
     ...
     return pipeline_class(model=model, framework=framework, task=task, **kwargs)

通过上面代码，我们了解到"sentiment-analysis"会被自动映射到TextClassificationPipeline上，在初始化pipeline时会去寻找对应的tokenizer、model和feature_extractor。我们在来看一下TextClassificationPipeline的定义：

def softmax(_outputs):
     maxes = np.max(_outputs, axis=-1, keepdims=True)
     shifted_exp = np.exp(_outputs - maxes)
     return shifted_exp / shifted_exp.sum(axis=-1, keepdims=True)

class TextClassificationPipeline(Pipeline):
    """
        Text classification pipeline using any `ModelForSequenceClassification`.
        If multiple classification labels are available (`model.config.num_labels >= 2`), the pipeline will run a softmax
        over the results. If there is a single label, the pipeline will run a sigmoid over the result.
        The models that this pipeline can use are models that have been fine-tuned on a sequence classification task. 
        See the up-to-date list of available models on [huggingface.co/models](https://huggingface.co/models?filter=text-classification).
    """
    
    def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
        return self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)
    
    def _forward(self, model_inputs):
        return self.model(**model_inputs)

    def postprocess(self, model_outputs, function_to_apply=None, return_all_scores=False):
       outputs = model_outputs["logits"][0]
       outputs = outputs.numpy()

       if function_to_apply == ClassificationFunction.SIGMOID:
           scores = sigmoid(outputs)
       elif function_to_apply == ClassificationFunction.SOFTMAX:
           scores = softmax(outputs)
       elif function_to_apply == ClassificationFunction.NONE:
           scores = outputs
       else:
           raise ValueError(f"Unrecognized `function_to_apply` argument: {function_to_apply}")

       if return_all_scores:
           return [{"label": self.model.config.id2label[i], "score": score.item()} for i, score in enumerate(scores)]
       else:
           return {"label": self.model.config.id2label[scores.argmax().item()], "score": scores.max().item()}

上面的代码中的softmax函数是一种stable的softmax，具体可参考numercially-stable-softmax。通过这段代码，我们可以了解到TextClassificationPipeline是如何进行前处理、预测和后处理的。但是这几个操作是怎么串联在一起的呢，这就要看它的基类Pipeline了：

class Pipeline(_ScikitCompat):
   """
   Workflow:
     Input -> Tokenization -> Model Inference -> Post-Processing (task dependent) -> Output
   """
  
   def __init__(
            self,
	        model: Union["PreTrainedModel", "TFPreTrainedModel"],
	        tokenizer: Optional[PreTrainedTokenizer] = None,
	        feature_extractor: Optional[PreTrainedFeatureExtractor] = None,
	        modelcard: Optional[ModelCard] = None,
	        framework: Optional[str] = None,
	        task: str = "",
	        args_parser: ArgumentHandler = None,
	        device: int = -1,
	        binary_output: bool = False,
	        **kwargs,
	    );
   
   @abstractmethod
   def preprocess(self, input_: Any, **preprocess_parameters: Dict) -> Dict[str, GenericTensor]:
   """
   Preprocess will take the `input_` of a specific pipeline and return a dictionnary of everything necessary for`_forward` to run properly.
   """
       
   @abstractmethod
   def _forward(self, input_tensors: Dict[str, GenericTensor], **forward_parameters: Dict) -> ModelOutput:
   """
   _forward will receive the prepared dictionnary from `preprocess` and run it on the model.
   """
   
   @abstractmethod
   def postprocess(self, model_outputs: ModelOutput, **postprocess_parameters: Dict) -> Any:
   """
   Postprocess will receive the raw outputs of the `_forward` method, generally tensors, and reformat them into
   something more friendly. Generally it will output a list or a dict or results (containing just strings and
   numbers).
   """
  
   def forward(self, model_inputs, **forward_params):
   """
   forward is basically the same but contains additional code surrounding `_forward` making sure tensors and models are on the same device, 
   disabling the training part of the code (leading to faster inference).
   """
       
   def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):
	   return [self.run_single(item, preprocess_params, forward_params, postprocess_params) for item in inputs]
	
   def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
       model_inputs = self.preprocess(inputs, **preprocess_params)
       model_outputs = self.forward(model_inputs, **forward_params)
       outputs = self.postprocess(model_outputs, **postprocess_params)
       return outputs
   
   def get_iterator(self, inputs, num_workers: int, batch_size: int, preprocess_params, forward_params, postprocess_params):
       if isinstance(inputs, collections.abc.Sized):
           dataset = PipelineDataset(inputs, self.preprocess, preprocess_params)
       else:
           dataset = PipelineIterator(inputs, self.preprocess, preprocess_params)
       collate_fn = no_collate_fn if batch_size == 1 else pad_collate_fn(self.tokenizer, self.feature_extractor)
       dataloader = DataLoader(dataset, num_workers=num_workers, batch_size=batch_size, collate_fn=collate_fn)
       model_iterator = PipelineIterator(dataloader, self.forward, forward_params, loader_batch_size=batch_size)
       final_iterator = PipelineIterator(model_iterator, self.postprocess, postprocess_params)
       return final_iterator
  
  def __call__(self, inputs, *args, num_workers=None, batch_size=None, **kwargs):
      if is_list:
         if can_use_iterator:
             final_iterator = self.get_iterator(
                 inputs, num_workers, batch_size, preprocess_params, forward_params, postprocess_params
             )
             outputs = [output for output in final_iterator]
             return outputs
         else:
             return self.run_multi(inputs, preprocess_params, forward_params, postprocess_params)
      elif can_use_iterator:
         return self.get_iterator(
             inputs, num_workers, batch_size, preprocess_params, forward_params, postprocess_params
         )
      elif is_iterable:
         return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
      else:
         return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

通过这段代码，我们了解了Pipeline如何将preprocess->forward->postprocess连接起来。这些代码构成了Pipeline的基础框架。

以上就是Pipeline的基本原理和使用方法了。我们下面来看看pipeline中最为核心的transformer model。

Transformer模型

我们先来看一下Transformer模型的发展史：

Transformer框架是2017年6月在Attention Is All You Need中提出，最早是为了解决翻译任务。随后又产生了一系列的相关模型：

2018年6月，GPT：第一个可以用于各项NLP任务的预训练模型。
2018年10月，BERT：另一个预训练模型
2019年2月，GPT2：一个改进版的（更大的）GPT模型
2019年10月，DistillBert：BERT的蒸馏模型，可以提升60%的速度，降低40%的存储开销，但仍可以保持97%的BERT的性能。
2019年10月，BART和T5：两个大型预训练模型，并使用了与原始Transformer模型相同的架构。
2020年5月，GPT3：一个更大的GPT-2版本，可以在不需要微调的情况下完成各种任务（zero-shot learning）。

总结来说，Transformer框架可以分为三类：

GPT-like model：我们称之为auto-regressive模型。
BERT-like model：我们称之为auto-encoding模型。
BART/T5-like model：我们称之为sequence-to-sequence模型。

Transformer框架里的模型都是通过自监督的方式训练出的语言模型，训练数据的标签是自动产生的，不需要进行人工标注。但是这种语言模型对具体的NLP任务并不有效。所以在应用于特定任务时，需要对数据进行人工标注，并通过迁移学习对语言模型进行微调。

番外
预训练模型所使用的大规模语料是从网络上爬取下来的，由此训练的模型可能会产生一些带有种族歧视、性别歧视的内容。即使使用特定语料进行微调，也无法从本质上避免这种问题。

Transformer模型主要由两部分组成（如下图所示）：

Encoder：理解输入并生成representation。
Decoder：根据encoder representation，并结合其他输入，生成目标序列。

根据任务的不同，Encoder和Decoder是可以分开使用的：

Encoder-only model：适用于需要对输入进行理解的任务，例如文本分类、命名实体识别。这类模型的代表包括ALBERT、BERT、DistillBERT、ELECTRA、RoBERTa。
Decoder-only model：适用于生成任务，例如文本生成。这类模型的代表包括CTRL、GPT、GPT2、Transformer-XL
Encoder-Decoder model：适用于需要输入的生成任务，例如翻译、摘要生成、问答。这类模型的代表包括BART、mBART、Marian、T5。

Transformer中有一个重要的结构叫"Attention Layer"。这个层会告诉模型要重点关注句子中的哪几个词。举个例子，当我们想将"You like this course"翻译成法语时，对"like"这个词的翻译要关注到其相邻的词"You"，因为在法语中"like"词性会根据主语不同而不同，而这个句子的其他部分对于"like"的翻译并不重要。同样的，在翻译到"this"时，模型要关注到"course"这个词，因为"this"的翻译会根据其指代的名词是男/女/物而不同。当这个句子非常复杂时，模型可能需要关注到离某个词非常远距离的另一个词。"Attention"的概念在NLP的很多任务中都很适用，因为某一个词的具体含义是会被这个词所在的上下文深深影响的。

需要注意的是，在翻译任务中，encoder的attention layer可以使用句子中所有的词，decoder的attention layer只能看到生成的某个词之前的所有词，也就是说decoder的attention layer不能看到未来的词。"Attention mask"可以用来实现这一点。

下面这张图展示了Transformer模型的具体结构：

可以看出来，decoder模块的第一个attention layer使用的是decoder的历史输入，第二个attention layer使用的是encoder的输出，这样decoder就可以利用上input句子的全部信息。

模型代码

下面，我们深入到HuggingFace的代码中学习Transformer model原理。我们从Pipeline的代码中知道，Transformer model的基类是PreTrainedModel，其主要代码如下：

class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMixin):
    def __init__(self, config: PretrainedConfig, *inputs, **kwargs):
  	    super().__init__()
        self.config = config
        self.name_or_path = config.name_or_path
        
    def post_init(self):
        self.init_weights()
	    self._backward_compatibility_gradient_checkpointing()


    @classmethod
	def _from_config(cls, config, **kwargs):
        if is_deepspeed_zero3_enabled():
            with deepspeed.zero.Init(config_dict_or_path=deepspeed_config()):
	            model = cls(config, **kwargs)
        else:
	        model = cls(config, **kwargs)


    @property
	def base_model(self) -> nn.Module:
        return getattr(self, self.base_model_prefix, self)
        
    def get_input_embeddings(self) -> nn.Module:
    """
    Returns:
        `nn.Module`: A torch module mapping vocabulary to hidden states.
    """
        return base_model.get_input_embeddings()
        
    def set_input_embeddings(self, value: nn.Module):
        base_model.set_input_embeddings(value)
        
    def get_output_embeddings(self) -> nn.Module:
    """
    Returns:
        `nn.Module`: A torch module mapping hidden states to vocabulary.
    """
        return None
        
    def tie_weights(self):
    """
    Tie the weights between the input embeddings and the output embeddings.
    """
       if getattr(self.config, "tie_word_embeddings", True):
           output_embeddings = self.get_output_embeddings()
           if output_embeddings is not None:
               self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())
      
       if getattr(self.config, "is_encoder_decoder", False) and getattr(self.config, "tie_encoder_decoder", False):
           if hasattr(self, self.base_model_prefix):
               self = getattr(self, self.base_model_prefix)
           self._tie_encoder_decoder_weights(self.encoder, self.decoder, self.base_model_prefix)
      
       for module in self.modules():
           if hasattr(module, "_tie_weights"):
               module._tie_weights()
               
    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None) -> nn.Embedding:
    """
    Resizes input token embeddings matrix of the model if `new_num_tokens != config.vocab_size`.
    Takes care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.
    Return:
	    `torch.nn.Embedding`: Pointer to the input tokens Embeddings Module of the model.
	"""
        model_embeds = self._resize_token_embeddings(new_num_tokens)
        self.config.vocab_size = new_num_tokens
        self.vocab_size = new_num_tokens
        self.tie_weights()
        return model_embeds
 
    def init_weights(self):
    """
    If needed prunes and maybe initializes weights.
    """
       if self.config.pruned_heads:
           self.prune_heads(self.config.pruned_heads)
      
       if _init_weights:
           self.apply(self._init_weights)
           self.tie_weights()


    def save_pretrained(
	        self,
	        save_directory: Union[str, os.PathLike],
	        is_main_process: bool = True,
	        state_dict: Optional[dict] = None,
	        save_function: Callable = torch.save,
	        push_to_hub: bool = False,
	        max_shard_size: Union[int, str] = "10GB",
	        **kwargs,
	    ):
    """
    Save a model and its configuration file to a directory, so that it can be re-loaded using the
     `[`~PreTrainedModel.from_pretrained`]` class method.
    """
        # Shard the model if it is too big.
	    shards, index = shard_checkpoint(state_dict, max_shard_size=max_shard_size)
        # Save the model
        for shard_file, shard in shards.items():
            save_function(shard, os.path.join(save_directory, shard_file))
    
    @classmethod
	def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.PathLike]], *model_args, **kwargs):
    """
	Instantiate a pretrained pytorch model from a pre-trained model configuration.
    """
        state_dict = load_state_dict(resolved_archive_file)
        loaded_state_dict_keys = [k for k in state_dict.keys()]
        # Instantiate model.
        model = cls(config, *model_args, **model_kwargs)
        # Load weights
        model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model(
	                model,
	                state_dict,
	                loaded_state_dict_keys,
	                resolved_archive_file,
	                pretrained_model_name_or_path,
	                ignore_mismatched_sizes=ignore_mismatched_sizes,
	                sharded_metadata=sharded_metadata,
	                _fast_init=_fast_init,
	                low_cpu_mem_usage=low_cpu_mem_usage,
	            )
        model.tie_weights()
        # Set model in evaluation mode to deactivate DropOut modules by default
	    model.eval()
        return model

我们从上面代码中知道，PreTrainedModel继承了nn.Module，是所有model的基类。模型的结构是通过PretrainedConfig初始化的，模型的权重通过具体的resolved_archive_file初始化。PretrainedConfig定义了模型的主要配置，包括vocab_size、hidden_size、num_attention_heads、num_hidden_layers等等。

下面我们以bert模型为例，看一下Transformer中它是如何实现的(modeling_bert)。

BertEmbeddings会将输入转化为Embedding表示，从代码21～28行我们知道，Embedding由word_embedding、token_type_embedding和absolute position_embedding相加而成(如果是relative embedding则加在attention层，具体可以在后面代码中看到)。

class BertEmbeddings(nn.Module):
"""Construct the embeddings from word, position and token_type embeddings."""
    def __init__(self, config):
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
	    self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
	    self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
	    self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
        
    def forward(
	        self,
	        input_ids: Optional[torch.LongTensor] = None,
	        token_type_ids: Optional[torch.LongTensor] = None,
	        position_ids: Optional[torch.LongTensor] = None,
	        inputs_embeds: Optional[torch.FloatTensor] = None,
	        past_key_values_length: int = 0,
	    ) -> torch.Tensor:
        input_shape = inputs_embeds.size()[:-1]
        seq_length = input_shape[1]
        inputs_embeds = self.word_embeddings(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)
        embeddings = inputs_embeds + token_type_embeddings
        if self.position_embedding_type == "absolute":
	            position_embeddings = self.position_embeddings(position_ids)
	            embeddings += position_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

我们看到embedding也使用了layernorm，它将沿着输入([batch_size, seq_len, hidden_size])的最后一维进行操作：计算每一个hidden_size长度list的均值和方差，并按 $\frac{x-E[x]}{\sqrt{\operatorname{Var}[x]+\epsilon}}$ 进行归一化。使用LayerNorm我们将会得到球体空间中符合0均值1方差高斯分布的embedding。

BertSelfAttention是计算attention的核心模块，对应于上面图中"Multi-head attention"模块。Attention的计算公式为 $\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V$ (如下图所示)

在BertSelfAttention模块中，hidden_states是"Q"，encoder_hidden_states是"K"和"V"。

class BertSelfAttention(nn.Module):
    def __init__(self, config, position_embedding_type=None):
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size
        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)
        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.position_embedding_type = position_embedding_type or getattr(config, "position_embedding_type", "absolute")
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            self.max_position_embeddings = config.max_position_embeddings
            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
        self.is_decoder = config.is_decoder
        
    def forward(
	        self,
	        hidden_states: torch.Tensor,
	        attention_mask: Optional[torch.FloatTensor] = None,
	        head_mask: Optional[torch.FloatTensor] = None,
	        encoder_hidden_states: Optional[torch.FloatTensor] = None,
	        encoder_attention_mask: Optional[torch.FloatTensor] = None,
	        past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
	        output_attentions: Optional[bool] = False,
	    ) -> Tuple[torch.Tensor]:


        mixed_query_layer = self.query(hidden_states)
        is_cross_attention = encoder_hidden_states is not None
        if is_cross_attention and past_key_value is not None:
             # reuse k,v, cross_attentions
             key_layer = past_key_value[0]
             value_layer = past_key_value[1]
             attention_mask = encoder_attention_mask
        elif is_cross_attention:
             key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
             value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
             attention_mask = encoder_attention_mask
        elif past_key_value is not None:
             key_layer = self.transpose_for_scores(self.key(hidden_states))
             value_layer = self.transpose_for_scores(self.value(hidden_states))
             key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
             value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
        else:
             key_layer = self.transpose_for_scores(self.key(hidden_states))
             value_layer = self.transpose_for_scores(self.value(hidden_states))
        
        query_layer = self.transpose_for_scores(mixed_query_layer)
        if self.is_decoder:
             past_key_value = (key_layer, value_layer) 
            
	    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            seq_length = hidden_states.size()[1]
            position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
            position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
            distance = position_ids_l - position_ids_r
            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility
            if self.position_embedding_type == "relative_key":
                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores
            elif self.position_embedding_type == "relative_key_query":
                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            attention_scores = attention_scores + attention_mask
        attention_probs = nn.functional.softmax(attention_scores, dim=-1)
        attention_probs = self.dropout(attention_probs)        
        if head_mask is not None:
            attention_probs = attention_probs * head_mask
        context_layer = torch.matmul(attention_probs, value_layer)
	
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(new_context_layer_shape)
        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
        
        if self.is_decoder:
           outputs = outputs + (past_key_value,)
        return outputs

我们可以看到输入是hidden_states，这个输入可能是encoder或decoder的embedding。代码的27～49行计算出Q、K、V vector。我们注意到当作为decoder使用时，为了减少计算量，用past_key_value存储了历史的K、V vector。代码的51～73行进行了attention的计算，其中52～65行将relative position embedding嵌入到attention里。

BertAttention对应于上面图中在BertSelfAttention基础上增加了"Add & norm"模块。其中"Add & norm"对应于9~11行代码。BertAttention模块也可以去掉一些head以节省内存(代码21～37)。

class BertSelfOutput(nn.Module):
    def __init__(self, config):
       super().__init__()
       self.dense = nn.Linear(config.hidden_size, config.hidden_size)
       self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
       self.dropout = nn.Dropout(config.hidden_dropout_prob)
    
    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
       hidden_states = self.dense(hidden_states)
       hidden_states = self.dropout(hidden_states)
       hidden_states = self.LayerNorm(hidden_states + input_tensor)
       return hidden_states


class BertAttention(nn.Module):
    def __init__(self, config, position_embedding_type=None):
       super().__init__()
       self.self = BertSelfAttention(config, position_embedding_type=position_embedding_type)
       self.output = BertSelfOutput(config)
       self.pruned_heads = set()
    
    def prune_heads(self, heads):
       if len(heads) == 0:
           return
       heads, index = find_pruneable_heads_and_indices(
           heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
       )
    
       # Prune linear layers
       self.self.query = prune_linear_layer(self.self.query, index)
       self.self.key = prune_linear_layer(self.self.key, index)
       self.self.value = prune_linear_layer(self.self.value, index)
       self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
    
       # Update hyper params and store pruned heads
       self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
       self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
       self.pruned_heads = self.pruned_heads.union(heads)
    
    def forward(
       self,
       hidden_states: torch.Tensor,
       attention_mask: Optional[torch.FloatTensor] = None,
       head_mask: Optional[torch.FloatTensor] = None,
       encoder_hidden_states: Optional[torch.FloatTensor] = None,
       encoder_attention_mask: Optional[torch.FloatTensor] = None,
       past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
       output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.Tensor]:
       self_outputs = self.self(
           hidden_states,
           attention_mask,
           head_mask,
           encoder_hidden_states,
           encoder_attention_mask,
           past_key_value,
           output_attentions,
       )
       attention_output = self.output(self_outputs[0], hidden_states)
       outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
       return outputs

BertLayer定义了上图中一整个block。25~39行是encoder或decoder的自注意力"Multihead-attention + Add & norm"模块，41～59行是decoder的cross attention的"Multihead-attention + Add & norm"模块，61～64行是"FeedForward + Add & norm"模块。

class BertLayer(nn.Module):
     def __init__(self, config):
         super().__init__()
         self.chunk_size_feed_forward = config.chunk_size_feed_forward
         self.seq_len_dim = 1
         self.attention = BertAttention(config)
         self.is_decoder = config.is_decoder
         self.add_cross_attention = config.add_cross_attention
         if self.add_cross_attention:
             self.crossattention = BertAttention(config, position_embedding_type="absolute")
         self.intermediate = BertIntermediate(config)
         self.output = BertOutput(config)
    
     def forward(
         self,
         hidden_states: torch.Tensor,
         attention_mask: Optional[torch.FloatTensor] = None,
         head_mask: Optional[torch.FloatTensor] = None,
         encoder_hidden_states: Optional[torch.FloatTensor] = None,
         encoder_attention_mask: Optional[torch.FloatTensor] = None,
         past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
         output_attentions: Optional[bool] = False,
     ) -> Tuple[torch.Tensor]:
     
         self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
         self_attention_outputs = self.attention(
             hidden_states,
             attention_mask,
             head_mask,
             output_attentions=output_attentions,
             past_key_value=self_attn_past_key_value,
         )
         attention_output = self_attention_outputs[0]
         # if decoder, the last output is tuple of self-attn cache
         if self.is_decoder:
             outputs = self_attention_outputs[1:-1]
             present_key_value = self_attention_outputs[-1]
         else:
             outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
    
         cross_attn_present_key_value = None
         if self.is_decoder and encoder_hidden_states is not None:
             # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple
             cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
             cross_attention_outputs = self.crossattention(
                 attention_output,
                 attention_mask,
                 head_mask,
                 encoder_hidden_states,
                 encoder_attention_mask,
                 cross_attn_past_key_value,
                 output_attentions,
             )
             attention_output = cross_attention_outputs[0]
             outputs = outputs + cross_attention_outputs[1:-1]  # add cross attentions if we output attention weights
    
             # add cross-attn cache to positions 3,4 of present_key_value tuple
             cross_attn_present_key_value = cross_attention_outputs[-1]
             present_key_value = present_key_value + cross_attn_present_key_value
    
         layer_output = apply_chunking_to_forward(
             self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
         )
         outputs = (layer_output,) + outputs
    
         # if decoder, return the attn key/values as the last output
         if self.is_decoder:
             outputs = outputs + (present_key_value,)
    
         return outputs
    
     def feed_forward_chunk(self, attention_output):
         intermediate_output = self.intermediate(attention_output)
         layer_output = self.output(intermediate_output, attention_output)
         return layer_output

BertEncoder将N层BertLayer组合起来，构成encoder或decoder模块。

class BertEncoder(nn.Module):
    def __init__(self, config):
       super().__init__()
       self.config = config
       self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
       self.gradient_checkpointing = False
    
    def forward(
       self,
       hidden_states: torch.Tensor,
       attention_mask: Optional[torch.FloatTensor] = None,
       head_mask: Optional[torch.FloatTensor] = None,
       encoder_hidden_states: Optional[torch.FloatTensor] = None,
       encoder_attention_mask: Optional[torch.FloatTensor] = None,
       past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
       use_cache: Optional[bool] = None,
       output_attentions: Optional[bool] = False,
       output_hidden_states: Optional[bool] = False,
       return_dict: Optional[bool] = True,
    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPastAndCrossAttentions]:
       all_hidden_states = () if output_hidden_states else None
       all_self_attentions = () if output_attentions else None
       all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
    
       next_decoder_cache = () if use_cache else None
       for i, layer_module in enumerate(self.layer):
           if output_hidden_states:
               all_hidden_states = all_hidden_states + (hidden_states,)
    
           layer_head_mask = head_mask[i] if head_mask is not None else None
           past_key_value = past_key_values[i] if past_key_values is not None else None
    
           if self.gradient_checkpointing and self.training:
               def create_custom_forward(module):
                   def custom_forward(*inputs):
                       return module(*inputs, past_key_value, output_attentions)
                   return custom_forward
    
               layer_outputs = torch.utils.checkpoint.checkpoint(
                   create_custom_forward(layer_module),
                   hidden_states,
                   attention_mask,
                   layer_head_mask,
                   encoder_hidden_states,
                   encoder_attention_mask,
               )
           else:
               layer_outputs = layer_module(
                   hidden_states,
                   attention_mask,
                   layer_head_mask,
                   encoder_hidden_states,
                   encoder_attention_mask,
                   past_key_value,
                   output_attentions,
               )
    
           hidden_states = layer_outputs[0]
           if use_cache:
               next_decoder_cache += (layer_outputs[-1],)
           if output_attentions:
               all_self_attentions = all_self_attentions + (layer_outputs[1],)
               if self.config.add_cross_attention:
                   all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
    
       if output_hidden_states:
           all_hidden_states = all_hidden_states + (hidden_states,)
    
       return BaseModelOutputWithPastAndCrossAttentions(
           last_hidden_state=hidden_states,
           past_key_values=next_decoder_cache,
           hidden_states=all_hidden_states,
           attentions=all_self_attentions,
           cross_attentions=all_cross_attentions,
       )

BertModel将输入、embedding和BertLayer连接起来，构成一个完整的模型结构。这里使用了BertPooler池化将BerLayer输出的[batch_size, seq_len, hidden_size]的第一个token的hidden_state作为最终的输出。

class BertPooler(nn.Module):
    def __init__(self, config):
       super().__init__()
       self.dense = nn.Linear(config.hidden_size, config.hidden_size)
       self.activation = nn.Tanh()
    
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
       # We "pool" the model by simply taking the hidden state corresponding
       # to the first token.
       first_token_tensor = hidden_states[:, 0]
       pooled_output = self.dense(first_token_tensor)
       pooled_output = self.activation(pooled_output)
       return pooled_output


class BertModel(BertPreTrainedModel):
    def __init__(self, config, add_pooling_layer=True):
        super().__init__(config)
        self.config = config
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)
        # Initialize weights and apply final processing
        self.post_init()
        
    def forward(
	        self,
	        input_ids: Optional[torch.Tensor] = None,
	        attention_mask: Optional[torch.Tensor] = None,
	        token_type_ids: Optional[torch.Tensor] = None,
	        position_ids: Optional[torch.Tensor] = None,
	        head_mask: Optional[torch.Tensor] = None,
	        inputs_embeds: Optional[torch.Tensor] = None,
	        encoder_hidden_states: Optional[torch.Tensor] = None,
	        encoder_attention_mask: Optional[torch.Tensor] = None,
	        past_key_values: Optional[List[torch.FloatTensor]] = None,
	        use_cache: Optional[bool] = None,
	        output_attentions: Optional[bool] = None,
	        output_hidden_states: Optional[bool] = None,
	        return_dict: Optional[bool] = None,
	    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]:
        
        embedding_output = self.embeddings(
	            input_ids=input_ids,
	            position_ids=position_ids,
	            token_type_ids=token_type_ids,
	            inputs_embeds=inputs_embeds,
	            past_key_values_length=past_key_values_length,
	        )
        encoder_outputs = self.encoder(
             embedding_output,
             attention_mask=extended_attention_mask,
             head_mask=head_mask,
             encoder_hidden_states=encoder_hidden_states,
             encoder_attention_mask=encoder_extended_attention_mask,
             past_key_values=past_key_values,
             use_cache=use_cache,
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
             return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]
        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
        return BaseModelOutputWithPoolingAndCrossAttentions(
	            last_hidden_state=sequence_output,
	            pooler_output=pooled_output,
	            past_key_values=encoder_outputs.past_key_values,
	            hidden_states=encoder_outputs.hidden_states,
	            attentions=encoder_outputs.attentions,
	            cross_attentions=encoder_outputs.cross_attentions,
	        )

BertForPreTraining在BertModel的基础上增加BertPreTrainingHeads结构用于多任务模型训练。Bert的预训练模型有两个任务，一个是mask language model任务(37行)，一个是next sentence prediction任务(38行)。

class BertPredictionHeadTransform(nn.Module):
     def __init__(self, config):
         super().__init__()
         self.dense = nn.Linear(config.hidden_size, config.hidden_size)
         if isinstance(config.hidden_act, str):
             self.transform_act_fn = ACT2FN[config.hidden_act]
         else:
             self.transform_act_fn = config.hidden_act
         self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
    
     def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
         hidden_states = self.dense(hidden_states)
         hidden_states = self.transform_act_fn(hidden_states)
         hidden_states = self.LayerNorm(hidden_states)
         return hidden_states
	
class BertLMPredictionHead(nn.Module):
     def __init__(self, config):
         super().__init__()
         self.transform = BertPredictionHeadTransform(config)
         self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
         self.bias = nn.Parameter(torch.zeros(config.vocab_size))  
         self.decoder.bias = self.bias
  
     def forward(self, hidden_states):
         hidden_states = self.transform(hidden_states)
         hidden_states = self.decoder(hidden_states)
         return hidden_states
      
class BertPreTrainingHeads(nn.Module):
    def __init__(self, config):
         super().__init__()
         self.predictions = BertLMPredictionHead(config)
         self.seq_relationship = nn.Linear(config.hidden_size, 2)
    
    def forward(self, sequence_output, pooled_output):
         prediction_scores = self.predictions(sequence_output)
         seq_relationship_score = self.seq_relationship(pooled_output)
         return prediction_scores, seq_relationship_score   


class BertForPreTraining(BertPreTrainedModel):
    def __init__(self, config):
         super().__init__(config)
         self.bert = BertModel(config)
         self.cls = BertPreTrainingHeads(config)
         self.post_init()
         
    def forward(
	        self,
	        input_ids: Optional[torch.Tensor] = None,
	        attention_mask: Optional[torch.Tensor] = None,
	        token_type_ids: Optional[torch.Tensor] = None,
	        position_ids: Optional[torch.Tensor] = None,
	        head_mask: Optional[torch.Tensor] = None,
	        inputs_embeds: Optional[torch.Tensor] = None,
	        labels: Optional[torch.Tensor] = None,
	        next_sentence_label: Optional[torch.Tensor] = None,
	        output_attentions: Optional[bool] = None,
	        output_hidden_states: Optional[bool] = None,
	        return_dict: Optional[bool] = None,
	    ) -> Union[Tuple[torch.Tensor], BertForPreTrainingOutput]:


        outputs = self.bert(
	            input_ids,
	            attention_mask=attention_mask,
	            token_type_ids=token_type_ids,
	            position_ids=position_ids,
	            head_mask=head_mask,
	            inputs_embeds=inputs_embeds,
	            output_attentions=output_attentions,
	            output_hidden_states=output_hidden_states,
	            return_dict=return_dict,
	        )
        sequence_output, pooled_output = outputs[:2]
	    prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)


        total_loss = None
        if labels is not None and next_sentence_label is not None:
            loss_fct = CrossEntropyLoss()
            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
            total_loss = masked_lm_loss + next_sentence_loss
            
        return BertForPreTrainingOutput(
	            loss=total_loss,
	            prediction_logits=prediction_scores,
	            seq_relationship_logits=seq_relationship_score,
	            hidden_states=outputs.hidden_states,
	            attentions=outputs.attentions,
	        )

从上面代码中我们看到，代码74～75行先获取两个head的输出，77~83根据输出和label计算loss。我们可以用类似的方式构建起他任务，如BertForSequenceClassification、BertForMultipleChoice、BertForTokenClassification、BertForQuestionAnswering。

Transformers库中支持很多encoder、decoder或sequence-to-sequence模型，下面我们简要介绍下它们。

Encoder模型

BERT：仅使用了Transformer的encoder部分，加入了next sentence prediction和mask prediction两个任务训练。
ALBERT：相对于BERT进行了3点改进，词嵌入参数因式分解（假设词表的大小为 V，词嵌入的维度为 E，隐藏层的维度为H，BERT 的情况是E=H，ALBERT 的方案是将E降低，在词嵌入和隐藏层之间加入一个 project 层，连接两个层），隐藏层的参数共享（全连接层、注意力层的参数均是共享的，也就是 ALBERT 依然有多层的深度连接，但是各层之间的参数是一样的），使用sentence-order prediction(SOP)任务（比BERT的next sentence predict任务更难）。
DistilBERT：使用知识蒸馏压缩模型参数，相比于Bert-base，distilbert 只有6层，且token-type embedding和pooler层被移除。使用了三种损失函数，第一种是跟训练bert一样的Masked Language Modeling Loss $L_{mlm}$ ，用于计算mask token时的损失。第二种是Distillation Loss $L_{ce}$ ，用于对比 softmax 前的教师模型和学生模型输出向量。第三种是Cosine Embedding Loss $L_{cos}$ ，用于对比模型输出的隐藏向量。

ELECTRA：提出了新的模型预训练的框架，采用generator和discriminator的结合方式。Generator就是一个小的masked language model，使用generator去训练模型，使得模型预测masked token，得到corrupted tokens。Generator的目标函数和bert一样，都是希望被masked的能够被还原成原本的original tokens。Discriminator的接收被generator corrupt之后的输入，discriminator的作用是分辨输入的每一个token是original的还是replaced，对于每个token，discriminator都会进行一个二分类，最后获得loss。因为masked language model 能有效地学习到context的信息，所以能很好地学习embedding，所以使用了weight sharing的方式将generator的embedding的信息共享给discriminator。ELECTRA采用了小的generator以及discriminator的方式共同训练，并且采用了两者loss相加，使得discriminator的学习难度逐渐地提升，学习到更难的token。
RoBERTa：在BERT的基础上改进了预训练的方法，使用了更大的mini-batch，更多的训练数据，并使用了动态掩码。原来Bert对每一个序列随机选择15%的Tokens替换成[MASK]，为了消除与下游任务的不匹配，还对这15%的Tokens进行(1) 80%的时间替换成[MASK] (2) 10%的时间不变 (3) 10%的时间替换成其他词。但整个训练过程，这15%的Tokens一旦被选择就不再改变，也就是说从一开始随机选择了这15%的Tokens，之后的N个epoch里都不再改变了。这就叫做静态Masking。而RoBERTa一开始把预训练的数据复制10份，每一份都随机选择15%的Tokens进行Masking，也就是说，同样的一句话有10种不同的mask方式。然后每份数据都训练N/10个epoch。这就相当于在这N个epoch的训练中，每个序列的被mask的tokens是会变化的。

Decoder模型

GPT-2：GPT/GPT-2使用decoder结构，根据单词上文进行预测（BERT 同时使用上文和下文），在生成任务上比BERT有更好的表现。
CTRL：在生成文本时可指定文本的类型，具体操作是在每一个序列的具体内容前加了入类型描述，使得在计算Attention过程中，类型与序列中的所有元素建立联系。
Transformer XL：通过循环机制和相对位置编码方案，使得模型可以接收更长的输入。

Sequence-to-sequence模型

BART：同时使用encoder和decoder模块，使其在自然语言理解任务上表现没有下降，并且在自然语言生成任务上有明显的提高。在预训练阶段，对原始文本破坏再重建，因此损失函数为decoder的输出与原始文本的交叉熵。在下游任务中，可以用decoder的最后一个隐藏节点做为分类任务；也可以用decoder的所有隐藏节点作为每个token的模型表示，再对每个token的表示进行分类；也可以将待生成的目标文本输入到decoder中，进行自回归生成。
mBART：mBart是多语言的seq2seq翻译系统，也是第一个通过对多种语言的完整文本进行降噪来预训练一个完整的seq2seq模型的方法。所谓降噪，旨在采用部分损坏的输入，然后恢复原始的未失真输入。例如使用MASK破坏原始的序列，然后尝试模型恢复原始序列。mBart的噪声函数，包括2种，(1) 按照泊松分布抽取token，然后进行Mask，(2) 对一个原始输入的不同句子进行调换顺序。
T5：T5是Transfer Text-to-Text Transformer 的简写，它提出了一个统一框架，将所有 NLP 任务都转化成 Text-to-Text任务。比如英德翻译，只需将训练数据集的输入部分前加上“translate English to German” 就行。假设需要翻译"That is good"，那么先转换成 "translate English to German：That is good." 输入模型，之后就可以直接输出德语翻译 “Das ist gut”。再比如情感分类任务，输入"sentiment：This movie is terrible!"，前面直接加上 “sentiment：”，然后就能输出结果“negative”。最神奇的是，对于需要输出连续值的 STS-B（文本语义相似度任务），居然也是直接输出文本，而不是加个连续值输出头。以每 0.2 为间隔，从 1 到 5 分之间分成 21 个值作为输出分类任务。通过这样的方式就能将 NLP 任务都转换成 Text-to-Text 形式，也就可以用同样的模型，同样的损失函数，同样的训练过程，同样的解码过程来完成所有 NLP 任务。