HuggingFace概述基本介绍 Hugging Face 是一个提供先进自然语言处理（NLP）工具的平台，支持 Tr

基本介绍

Hugging Face 是一个提供先进自然语言处理（NLP）工具的平台，支持 Transformer 模型的开发和应用。它拥有庞大的模型库和社区资源，能够满足从研究到工业应用的各种需求。类似与AI界的github，也可以作为推理工具。

编解码器的模型

编码器

特征提取和特征压缩，作用为情感分析，文本总结，文本生成和问答（文本理解），应用：评价类的好评和差评的分类，可用于情绪监控，舆情分析。模型为BERT。

解码器

解码器为解压操作，主要做生成任务和文章生成，模型为GPT。

环境相关配置

Anaconda 安装时设置justme 环境变量勾选全部
pytorch 2.3.0 不要添加清华的镜像源，可以下载GPU版本的
cuda版本11.8、12.1较为稳定可以使用nvcc -v确定版本和安装情况
cuDNN

验证

print(torch.cuda.is_avaliable()) #可以确认是否安装成功

核心组件

模型：transformer
数据集：datasets
分词器：tokenizers

模型加载

联网使用

直接设置加载路径为huggingFace中的路径，但是需要网络较为稳定，存在中断或者无法下载情况。

from transformers import AutoModelForCausalLM, AutoTokenizer

# 下载模型和分词器，并保存到指定目录
model_name = "uer/gpt2-chinese-cluecorpussmall"
cache_dir = "./my_model_cache/uer/gpt2-chinese-cluecorpussmall"

# 下载模型
AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir)
# 下载分词器
AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

本地加载

将模型下载到本地磁盘，通过磁盘加载，如果无法使用指令下载，也可以分别下载相关的文件，下载config.json,model.safetensors,tokenizer.json（分词器）,tokenizer_config.json（分词器配置）,vocab.txt即可。

截屏2025-05-04 18.46.39.png

from transformers import AutoModelForCausalLM,AutoTokenizer,pipeline

# 设置具体包含 config.json 的目录
model_dir =r"./my_model_cache/uer/gpt2-chinese-cluecorpussmall/models--uer--gpt2-chinese-cluecorpussmall/snapshots/c2c0249d8a2731f269414cc3b22dff021f8e07a3"

#加载模型和分词器
model = AutoModelForCausalLM.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(model_dir)

# 使用加载的模型和分词器创建生成文本的 pipeline  device指定模型运行的设备“cpu”或者“cuda”
generator = pipeline("text-generation",model=model,tokenizer=tokenizer,device="cpu")

模型实践

基于 BERT 的中文评价情感分析

BERT模型

BERT 模型主要由三个核心部分组成：嵌入层（embeddings）、编码器（encoder）和池化层（pooler）。其工作流程是先将输入的文本进行嵌入处理，然后通过多层的编码器进行特征提取，最后使用池化层得到一个固定长度的表示向量。

BertModel(  (embeddings): BertEmbeddings(    (word_embeddings): Embedding(21128, 768, padding_idx=0)    (position_embeddings): Embedding(512, 768)    (token_type_embeddings): Embedding(2, 768)    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)    (dropout): Dropout(p=0.1, inplace=False)  )  (encoder): BertEncoder(    (layer): ModuleList(      (0-11): 12 x BertLayer(        (attention): BertAttention(          (self): BertSdpaSelfAttention(            (query): Linear(in_features=768, out_features=768, bias=True)            (key): Linear(in_features=768, out_features=768, bias=True)            (value): Linear(in_features=768, out_features=768, bias=True)            (dropout): Dropout(p=0.1, inplace=False)          )          (output): BertSelfOutput(            (dense): Linear(in_features=768, out_features=768, bias=True)            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)            (dropout): Dropout(p=0.1, inplace=False)          )        )        (intermediate): BertIntermediate(          (dense): Linear(in_features=768, out_features=3072, bias=True)          (intermediate_act_fn): GELUActivation()        )        (output): BertOutput(          (dense): Linear(in_features=3072, out_features=768, bias=True)          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)          (dropout): Dropout(p=0.1, inplace=False)        )      )    )  )  (pooler): BertPooler(    (dense): Linear(in_features=768, out_features=768, bias=True)    (activation): Tanh()  ))

嵌入层

作用是将输入的文本转换为适合模型处理的向量表示，它包含以下几个子模块：

word_embeddings：这是一个词嵌入层，使用 Embedding 类实现。Embedding(21128, 768, padding_idx=0) 表示词汇表大小为 21128，每个词会被映射到一个 768 维的向量空间，padding_idx=0 表示索引为 0 的词是填充词。
position_embeddings：位置嵌入层，同样使用 Embedding 类。Embedding(512, 768) 表示最大序列长度为 512，每个位置会被映射到一个 768 维的向量，用于让模型感知词的位置信息。
token_type_embeddings：标记类型嵌入层，Embedding(2, 768) 表示有两种标记类型（通常用于区分两个句子的输入），每种类型会被映射到一个 768 维的向量。
LayerNorm：层归一化层，用于对输入进行归一化处理，使模型训练更加稳定。eps=1e-12 是一个小的常数，用于避免除零错误。
dropout：随机失活层，Dropout(p=0.1, inplace=False) 表示以 0.1 的概率随机将输入的某些元素置为 0，用于防止过拟合。

编码器

由多个相同的 BertLayer 组成，这里有 12 层（0 - 11）。每个 BertLayer 又包含以下几个子模块：

attention： 注意力机制模块，由 BertSelfAttention 和 BertSelfOutput 组成。
- self（BertSdpaSelfAttention）：自注意力机制，包含三个线性层 query、key 和 value，用于计算注意力分数。Linear(in_features=768, out_features=768, bias=True) 表示输入和输出的维度都是 768。
- dropout 层用于在注意力分数上进行随机失活。
- output（BertSelfOutput）：包含一个线性层 dense 用于对注意力输出进行线性变换，然后通过 LayerNorm 进行归一化处理，最后使用 dropout 防止过拟合。
intermediate： 中间层，包含一个线性层 dense 将输入维度从 768 扩展到 3072，然后使用 GELUActivation 激活函数引入非线性。
output： 输出层，包含一个线性层 dense 将维度从 3072 恢复到 768，然后通过 LayerNorm 进行归一化处理，最后使用 dropout 防止过拟合。

池化层

作用是将编码器的输出转换为一个固定长度的表示向量，用于后续的任务。

dense： 线性层，Linear(in_features=768, out_features=768, bias=True) 表示输入和输出的维度都是 768。
activation： 使用 Tanh 激活函数对线性层的输出进行非线性变换。

加载数据集

使用方法：在线加载数据集；磁盘加载数据集

在线加载数据集必须要联网加载，即使数据集已经有缓存，也是需要在线连接，存在数据集是否更新的比对。

# path为在hugging face中的路径；split表示需要切分为训练集；cache_dir表示指定的缓存位置dataset= load_dataset(path="NousResearch/hermes-function-calling-v1",split="train",cache_dir="dataset/")

磁盘加载数据集需要提前将数据集加载到磁盘中

dataset = load_from_disk(r"/Users/chenyijing/Downloads/人工智能/学习资料/demo_14/dataset_test/dataset/ChnSentiCorp")

位置编码

AI模型本质为矩阵，实质计算为矩阵计算，文本转换过程为编码，用一串数字表示出来，使用的工具就是tokenizer。model与token是一一绑定的。

tokenizer介绍

vocab_size字典大小；model_max_length 模型最大输入token；special_token特殊token；

过程：通过分词进行拆分，然后通过模型转化为数字编码。

位置编码过程：

1加载BERT模型

2 使用BertTokenizer加载出分词器

3 使用batch_encode_plus为文本进行编码，为了避免歧义的方式造成为问题，中文字是一个字一个字的编码，组词根据不同的语义和语境编码。因为编码组成为实际为矩阵，所以每一行的数目固定，不足之处需要用padding补齐，超出的需要截断。

4 编码成功后内容，input_ids表示编码之后的数字对应的编码，special_token_mask表示特殊字符，使用特殊字符标志为1，普通字符为0，attention_mask表示真正有效的字符，需要关注的地方，有效字符为1，填充字符为0

增量模型微调

微调模型的设计：

1 设置全连接层为(768,2)

2 冻干主干网络的设置

3 从预训练模型的输出中提取每个样本的 [CLS] 标记的隐藏状态向量， [batch_size, hidden_size]

4 全连接层将提取的向量进行线形变化，输出为二分类向量

训练模型

1 加载数据集

2 使用分词器处理数据集的数据和标签，输出处理后的数据和标签的位置编码

3 设置微调模型和损失函数、训练轮次，输出模型为设计的二分类向量

模型测试

1 加载训练完成的微调模型

2 根据输入的数据进行位置编码，并返回编码之后的内容

3 根据输入的位置编码调用微调模型，并返回二元分类的判定

4 输出判定结果