携手创作，共同成长！这是我参与「掘金日新计划 · 8 月更文挑战」的第14天，点击查看活动详情

Huggingface Transformers学习

1.自然语言处理的重要性

“语言理解是人工智能领域皇冠上的明珠”-比尔·盖茨 在非结构数据中，文本的数量是最多的，它虽然没有图片和视频占用的空间大，但是他的信息量是最大的。在人工智能出现之前，机器智能处理结构化的数据（例如 Excel 里的数据）。但是网络中大部分的数据都是非结构化的，例如：文章、图片、音频、视频…为了能够分析和利用这些文本信息，我们就需要利用 **NLP **技术，让机器理解这些文本信息，并加以利用。

Image Name

2 为什么要学习transformers库

第一方面 transformer和Bert划时代产物的出现

全面拥抱Transformer：NLP三大特征抽取器(CNN/RNN/TF)中，近两年新欢Transformer明显会很快成为NLP里担当大任的最主流的特征抽取器。
像Wordvec出现之后一样，在人工智能领域种各种目标皆可向量化，也就是我们经常听到的“万物皆可Embedding”。而Transformer模型和Bert模型的出现，更是NLP领域划时代的产物：将transformer和双向语言模型进行融合，便得到NLP划时代的，也是当下在各自NLP下流任务中获得state-of-the-art的模型-BERT

Image Name

第二方面 HuggingFace的transformers库很强大 目前GitHub仓库star数量为46k，活跃用户群体巨多，并且开发者高频度维护与更新 Image Name 专门社区，提供支持多种语言模型，以及不同框架的预训练模型很多第三方预训练模型框架底层是基于transformers进行开发的，丰富的API，落地和研究非常实用

3.transformers安装

安装命令：

pip安装

pip install transformers

conda安装

conda install -c huggingface transformers

!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple transformers==4.3.1

4.导入模型

Image Name

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModelForMaskedLM.from_pretrained("bert-base-chinese")
model

5.环境信息

import torch
from transformers import BertTokenizer
from IPython.display import clear_output

PRETRAINED_MODEL_NAME = "bert-base-chinese"  # 指定繁簡中文 BERT-BASE 預訓練模型

# 取得此預訓練模型所使用的 tokenizer
tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

clear_output()
print("PyTorch 版本：", torch.__version__)

PyTorch 版本： 1.6.0

6.字典信息

除了本文使用的中文BERT以外，常被拿来应用与研究的是英文的bert-base-cased模型。

现在让我们看看tokenizer 里头的字典资讯：

vocab = tokenizer.vocab
print("字典大小：", len(vocab))

字典大小： 21128

import random
random_tokens = random.sample(list(vocab), 10)
random_ids = [vocab[t] for t in random_tokens]

print("{0:20}{1:15}".format("token", "index"))
print("-" * 25)
for t, id in zip(random_tokens, random_ids):
    print("{0:15}{1:10}".format(t, id))

token               index          
-------------------------
諸                    6328
##媛                 15113
##﹏                 21056
嶙                    2326
a6                  11716
##嫵                 15134
##颌                 20628
##搓                 16070
123456              10644
啥                    1567

7.wordpieces

BERT使用当初Google NMT提出的WordPiece Tokenization，将本来的words拆成更小粒度的wordpieces，有效处理不在字典里头的词汇（OOV）。中文的话大致上就像是character-level tokenization，而有##前缀的tokens即为wordpieces。

以词汇fragment来说，其可以被拆成frag与##ment两个pieces，而一个word也可以独自形成一个wordpiece。wordpieces可以由搜集大量文本并找出其中常见的pattern取得。

让我们利用中文BERT 的tokenizer 将一个中文句子断词看看：

text = "[CLS] 等到潮水 [MASK] 了，就知道谁沒穿裤子。"
tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)

print(text)
print(tokens[:10], '...')
print(ids[:10], '...')

[CLS] 等到潮水 [MASK] 了，就知道谁沒穿裤子。
['[CLS]', '等', '到', '潮', '水', '[MASK]', '了', '，', '就', '知'] ...
[101, 5023, 1168, 4060, 3717, 103, 749, 8024, 2218, 4761] ...

除了一般的wordpieces 以外，BERT 里头有5 个特殊tokens 各司其职：

[CLS]：在做分类任务时其最后一层的repr. 会被视为整个输入序列的repr. [SEP]：有两个句子的文本会被串接成一个输入序列，并在两句之间插入这个token 以做区隔 [UNK]：没出现在BERT 字典里头的字会被这个token 取代 [PAD]：zero padding 遮罩，将长度不一的输入序列补齐方便做batch 运算 [MASK]：未知遮罩，仅在预训练阶段会用到如上例所示，[CLS]一般会被放在输入序列的最前面，而zero padding在之前的Transformer文章里已经有非常详细的介绍。[MASK]token一般在fine-tuning或是feature extraction时不会用到，这边只是为了展示预训练阶段的遮蔽字任务才使用的。

现在马上让我们看看给定上面有[MASK]的句子，BERT会填入什么字：

ids

from transformers import BertForMaskedLM
# 除了 tokens 以外我們還需要辨別句子的 segment ids
tokens_tensor = torch.tensor([ids])  # (1, seq_len)
segments_tensors = torch.zeros_like(tokens_tensor)  # (1, seq_len)
maskedLM_model = BertForMaskedLM.from_pretrained(PRETRAINED_MODEL_NAME)

# 使用 masked LM 估計 [MASK] 位置所代表的實際 token 
maskedLM_model.eval()
with torch.no_grad():
    outputs = maskedLM_model(tokens_tensor, segments_tensors)
    predictions = outputs[0]
    # (1, seq_len, num_hidden_units)
del maskedLM_model

# 將 [MASK] 位置的機率分佈取 top k 最有可能的 tokens 出來
masked_index = 5
k = 3
probs, indices = torch.topk(torch.softmax(predictions[0, masked_index], -1), k)
predicted_tokens = tokenizer.convert_ids_to_tokens(indices.tolist())

# 顯示 top k 可能的字。一般我們就是取 top 1 当做预测值
print("輸入 tokens ：", tokens[:10], '...')
print('-' * 50)
for i, (t, p) in enumerate(zip(predicted_tokens, probs), 1):
    tokens[masked_index] = t
    print("Top {} ({:2}%)：{}".format(i, int(p.item() * 100), tokens[:10]), '...')

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


輸入 tokens ： ['[CLS]', '等', '到', '潮', '水', '[MASK]', '了', '，', '就', '知'] ...
--------------------------------------------------
Top 1 (65%)：['[CLS]', '等', '到', '潮', '水', '来', '了', '，', '就', '知'] ...
Top 2 ( 4%)：['[CLS]', '等', '到', '潮', '水', '过', '了', '，', '就', '知'] ...
Top 3 ( 4%)：['[CLS]', '等', '到', '潮', '水', '干', '了', '，', '就', '知'] ...

Google在训练中文BERT铁定没看巴菲特说的话，还无法预测出我们最想要的那个退字。而最接近的過的出现机率只有2%，但我会说以语言代表模型以及自然语言理解的角度来看这结果已经不差了。BERT透过关注潮与水这两个字，从2万多个wordpieces的可能性中选出來作为这个情境下[MASK]token的预测值，也还算说的过去。

如上所示，中文BERT 的字典大小约有2.1 万个tokens。没记错的话，英文BERT 的字典则大约是3 万tokens 左右。我们可以瞧瞧中文BERT 字典里头纪录的一些tokens 以及其对应的索引：

8.可视化模型

# !pip install bertviz ipywidgets -i https://pypi.tuna.tsinghua.edu.cn/simple


from transformers import AutoTokenizer, AutoModel
from bertviz import model_view

tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModelForMaskedLM.from_pretrained("bert-base-chinese", output_attentions=True)
inputs = tokenizer.encode("[CLS] 等到潮水 [MASK] 了，就知道谁沒穿裤子。", return_tensors='pt')
outputs = model(inputs)
attention = outputs[-1]  
tokens = tokenizer.convert_ids_to_tokens(inputs[0]) 
model_view(attention, tokens)
# 10

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

</span>
<div id='vis'></div>

<IPython.core.display.Javascript object>

from transformers import BertTokenizer, BertModel
from bertviz import head_view
from IPython.display import clear_output
# 在 jupyter notebook 裡頭顯示 visualzation 的 helper
def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

clear_output()

model_version = 'bert-base-chinese'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version)

# 情境 1 的句子
sentence_a = "胖虎叫大雄去买漫画，"
sentence_b = "回來慢了就打他。"

# 得到 tokens 後丟入 BERT 取得 attention
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=False)
token_type_ids = inputs['token_type_ids']
input_ids = inputs['input_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)
call_html()

# 交給 BertViz 視覺化
head_view(attention, tokens)

Layer:

</span>
<div id='vis'></div>

<IPython.core.display.Javascript object>

9.文本数据处理的终极指南

【致Great 文本数据处理的终极指南-[NLP入门] www.jianshu.com/p/37e529c8b… 】

在本节中，我们将要讨论不同的特征提取方法，从一些基本技巧逐步深入学习高级自然语言处理技术。我们也将会学习如何预处理文本数据，以便可以从“干净”数据中提取更好的特征。

文本数据的基本体征提取

词汇数量
字符数量
平均字长
停用词数量
特殊字符数量
数字数量
大写字母数量

文本数据的基本预处理

小写转换
去除标点符号
去除停用词
去除频现词
去除稀疏词
拼写校正
分词(tokenization)
词干提取(stemming)
词形还原(lemmatization)

高级文本处理

N-grams语言模型
词频
逆文档频率
TF-IDF
词袋
情感分析
词嵌入

基本特征提取即使我们对NLP没有充足的知识储备，但是我们可以使用python来提取文本数据的几个基本特征。在开始之前，我们使用pandas将数据集加载进来，以便后面其他任务的使用，数据集是Twitter情感文本数据集。

import pandas as pd
train=pd.read_csv("/home/mw/input/Twitter6754/train_E6oV3lV.csv")
train.head(10)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	tweet
0	1	@user when a father is dysfunctional and is s...
1	2	@user @user thanks for #lyft credit i can't us...
2	3	bihday your majesty
3	4	#model i love u take with u all the time in ...
4	5	factsguide: society now #motivation
5	6	[2/2] huge fan fare and big talking before the...
6	7	@user camping tomorrow @user @user @user @use...
7	8	the next school year is the year for exams.ð...
8	9	we won!!! love the land!!! #allin #cavs #champ...
9	10	@user @user welcome here ! i'm it's so #gr...