携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第30天,点击查看活动详情
导语
Huggingface Transformers库提供了一个用于使用模型快捷推理的方法,对于一些日常模型的轻量级使用非常方便,本篇博客记录了对Pipeline的学习笔记,主要资料来源为其官方文档huggingface.co/docs/transf…
概述
Pipeline是使用模型进行推理的一种简单方法。它们是从库中抽象出大多数复杂代码的对象,提供了一个简单的API,专门用于一些任务,包括命名实体识别、掩码语言建模、情感分析、特征提取和问答。在Huggingface中有两种主要的Pipeline,分别是:
- pipeline(): 该方法是对所有类型pipeline的封装;
- task-specific pipeline:针对特定任务的pipeline,如:
- AudioClassificationPipeline
- AutomaticSpeechRecognitionPipeline
- ConversationalPipeline
- FeatureExtractionPipeline
- FillMaskPipeline
- ImageClassificationPipeline
- ImageSegmentationPipeline
- ObjectDetectionPipeline
- QuestionAnsweringPipeline
- SummarizationPipeline
- TableQuestionAnsweringPipeline
- TextClassificationPipeline
- TextGenerationPipeline
- Text2TextGenerationPipeline
- TokenClassificationPipeline
- TranslationPipeline
- VisualQuestionAnsweringPipeline
- ZeroShotClassificationPipeline
- ZeroShotImageClassificationPipeline
Pipeline抽象(The pipeline abstraction)
Pipeline抽象是对所有其他可用pipeline的包装。它可以被实例化为任何其他pipeline(即各种task-specific形式的pipeline)来适应于各种任务,并可以在多个任务和模型之间灵活切换。
简单示例
下面举一个简单的例子,我们使用pipeline做一个文本分类的推理,由于没有指定具体模型,它将会为我们下载一个应用在该任务上的模型作为推理模型:
from transformers import pipeline
pipe = pipeline("text-classification")
pipe("This restaurant is awesome")
输出如下:
[{'label': 'POSITIVE', 'score': 0.9998743534088135}]
可以看到,只用两三行代码,pipeline就能直接对我们的输入进行推理,非常方便。
指定模型
同时,如果想使用Huggingface Hub上的指定模型,可以使用model参数进行设定,
pipe = pipeline(model="roberta-large-mnli")
pipe("This restaurant is awesome")
输出如下:
[{'label': 'POSITIVE', 'score': 0.9998743534088135}]
多输入处理
同时,pipeline也支持一次性处理多个输入的批处理方式,示例如下:
pipe = pipeline("text-classification")
pipe(["This restaurant is awesome", "This restaurant is aweful"])
得到输出结果为:
[{'label': 'POSITIVE', 'score': 0.9998743534088135}, {'label': 'NEGATIVE', 'score': 0.9996669292449951}]
要迭代完整的数据集,建议直接使用datasets库。这意味着不需要一次分配整个数据集,也不需要自己批量处理。pipeline会直接帮你处理好这些问题。
import datasets
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm
pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")
# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
# as we're not interested in the *target* part of the dataset.
for out in tqdm(pipe(KeyDataset(dataset, "file"))):
print(out)
# {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
# {"text": ....}
# ....
为了方便,也可以使用生成器的方式来进行调用,示例代码如下:
from transformers import pipeline
pipe = pipeline("text-classification")
def data():
while True:
# This could come from a dataset, a database, a queue or HTTP request
# in a server
# Caveat: because this is iterative, you cannot use `num_workers > 1` variable
# to use multiple threads to preprocess data. You can still have 1 thread that
# does the preprocessing while the main runs the big inference
yield "This is a test"
for out in pipe(data()):
print(out)
# {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
# {"text": ....}
# ....
多任务调用示例
可以直接使用pipeline方法灵活的在各个模型、任务之间进行切换。举例如下:
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
# Sentiment analysis pipeline
pipeline("sentiment-analysis")
# Question answering pipeline, specifying the checkpoint identifier
pipeline("question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="bert-base-cased")
# Named entity recognition pipeline, passing in a specific model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
pipeline("ner", model=model, tokenizer=tokenizer)
上面的示例中,我们保持了调用格式,即都是pipeline(参数)的方式来进行了情感分析、问答和命名实体识别的操作。使用pipeline的调用非常优雅和简洁。
总结
本篇博客为大家介绍了Huggingface Transformer库中的一个即开即用的工具Pipeline,这个工具非常的灵活与便捷,非常适用于快速推理和模型集成、项目部署等任务。