OpenAI 双语文档参考 Fine-tuning 微调

711 阅读34分钟

Fine-tuning

Learn how to customize a model for your application. 了解如何为您的应用程序定制模型。

Introduction

Fine-tuning lets you get more out of the models available through the API by providing: 通过提供以下内容,微调可让您从 API 提供的模型中获得更多收益:

  1. Higher quality results than prompt design 比即时设计更高质量的结果
  2. Ability to train on more examples than can fit in a prompt 能够训练比提示中更多的例子
  3. Token savings due to shorter prompts 由于更短的提示而节省了代币
  4. Lower latency requests 更低的延迟请求

GPT-3 has been pre-trained on a vast amount of text from the open internet. When given a prompt with just a few examples, it can often intuit what task you are trying to perform and generate a plausible completion. This is often called "few-shot learning." GPT-3 已经在来自开放互联网的大量文本上进行了预训练。当给出仅包含几个示例的提示时,它通常可以凭直觉判断出您要执行的任务并生成合理的完成。这通常称为“小样本学习”。

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide examples in the prompt anymore. This saves costs and enables lower-latency requests. 微调通过训练比提示中更多的示例来改进小样本学习,让您在大量任务中取得更好的结果。对模型进行微调后,您将不再需要在提示中提供示例。这样可以节省成本并实现更低延迟的请求。

At a high level, fine-tuning involves the following steps: 在高层次上,微调涉及以下步骤:

  1. Prepare and upload training data 准备和上传训练数据
  2. Train a new fine-tuned model 训练新的微调模型
  3. Use your fine-tuned model 使用您的微调模型

Visit our pricing page to learn more about how fine-tuned model training and usage are billed. 请访问我们的定价页面,详细了解微调模型训练和使用的收费方式。

What models can be fine-tuned? 哪些模型可以微调?

Fine-tuning is currently only available for the following base models: davinci, curie, babbage, and ada. These are the original models that do not have any instruction following training (like text-davinci-003 does for example). You are also able to continue fine-tuning a fine-tuned model to add additional data without having to start from scratch. 微调目前仅适用于以下基本模型: davincicuriebabbageada 。这些是训练后没有任何说明的原始模型(例如 text-davinci-003 )。您还可以继续微调微调模型以添加其他数据,而无需从头开始。

Installation

We recommend using our OpenAI command-line interface (CLI). To install this, run 我们建议使用我们的 OpenAI 命令行界面 (CLI)。要安装这个,运行

pip install --upgrade openai

(The following instructions work for version 0.9.4 and up. Additionally, the OpenAI CLI requires python 3.) (以下说明适用于 0.9.4 及更高版本。此外,OpenAI CLI 需要 python 3。)

Set your OPENAI_API_KEY environment variable by adding the following line into your shell initialization script (e.g. .bashrc, zshrc, etc.) or running it in the command line before the fine-tuning command: 通过将以下行添加到您的 shell 初始化脚本(例如 .bashrc、zshrc 等)或在微调命令之前的命令行中运行它来设置您的 OPENAI_API_KEY 环境变量:

export OPENAI_API_KEY="<OPENAI_API_KEY>"

Prepare training data 准备训练数据

Training data is how you teach GPT-3 what you'd like it to say. 训练数据是你如何教 GPT-3 你想让它说什么。

Your data must be a JSONL document, where each line is a prompt-completion pair corresponding to a training example. You can use our CLI data preparation tool to easily convert your data into this file format. 您的数据必须是 JSONL 文档,其中每一行都是一个提示完成对,对应于一个训练示例。您可以使用我们的 CLI 数据准备工具轻松地将您的数据转换成这种文件格式。

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...

Designing your prompts and completions for fine-tuning is different from designing your prompts for use with our base models (Davinci, Curie, Babbage, Ada). In particular, while prompts for base models often consist of multiple examples ("few-shot learning"), for fine-tuning, each training example generally consists of a single input example and its associated output, without the need to give detailed instructions or include multiple examples in the same prompt. 设计用于微调的提示和补全不同于设计用于我们的基本模型(Davinci、Curie、Babbage、Ada)的提示。特别是,虽然基础模型的提示通常包含多个示例(“小样本学习”),但对于微调,每个训练示例通常包含一个输入示例及其相关输出,无需给出详细说明或在同一提示中包含多个示例。

For more detailed guidance on how to prepare training data for various tasks, please refer to our preparing your dataset best practices. 有关如何为各种任务准备训练数据的更多详细指导,请参阅我们准备数据集的最佳实践。

The more training examples you have, the better. We recommend having at least a couple hundred examples. In general, we've found that each doubling of the dataset size leads to a linear increase in model quality. 您拥有的训练示例越多越好。我们建议至少有几百个示例。一般来说,我们发现数据集大小每增加一倍都会导致模型质量线性增加。

CLI data preparation tool CLI数据准备工具

We developed a tool which validates, gives suggestions and reformats your data: 我们开发了一个工具来验证、提供建议和重新格式化您的数据:

openai tools fine_tunes.prepare_data -f <LOCAL_FILE>

This tool accepts different formats, with the only requirement that they contain a prompt and a completion column/key. You can pass a CSV, TSV, XLSX, JSON or JSONL file, and it will save the output into a JSONL file ready for fine-tuning, after guiding you through the process of suggested changes. 此工具接受不同的格式,唯一的要求是它们包含提示和完成列/键。您可以传递 CSV、TSV、XLSX、JSON 或 JSONL 文件,它会在指导您完成建议的更改过程后将输出保存到 JSONL 文件中以备微调。

Create a fine-tuned model 创建微调模型

The following assumes you've already prepared training data following the above instructions. 以下假设您已经按照上述说明准备了训练数据。

Start your fine-tuning job using the OpenAI CLI: 使用 OpenAI CLI 开始微调工作:

openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>

Where BASE_MODEL is the name of the base model you're starting from (ada, babbage, curie, or davinci). You can customize your fine-tuned model's name using the suffix parameter. 其中 BASE_MODEL 是您开始使用的基础模型的名称(ada、babbage、curie 或 davinci)。您可以使用后缀参数自定义微调模型的名称。

Running the above command does several things: 运行上面的命令会做几件事:

  1. Uploads the file using the files API (or uses an already-uploaded file) 使用文件 API 上传文件(或使用已经上传的文件)
  2. Creates a fine-tune job 创建微调作业
  3. Streams events until the job is done (this often takes minutes, but can take hours if there are many jobs in the queue or your dataset is large) 流式传输事件直到作业完成(这通常需要几分钟,但如果队列中有很多作业或您的数据集很大,则可能需要数小时)

Every fine-tuning job starts from a base model, which defaults to curie. The choice of model influences both the performance of the model and the cost of running your fine-tuned model. Your model can be one of: ada, babbage, curie, or davinci. Visit our pricing page for details on fine-tune rates. 每个微调工作都从一个默认为居里的基本模型开始。模型的选择会影响模型的性能和运行微调模型的成本。您的模型可以是以下之一: adababbagecuriedavinci 。请访问我们的定价页面,了解有关微调费率的详细信息。

After you've started a fine-tune job, it may take some time to complete. Your job may be queued behind other jobs on our system, and training our model can take minutes or hours depending on the model and dataset size. If the event stream is interrupted for any reason, you can resume it by running: 开始微调作业后,可能需要一些时间才能完成。在我们的系统中,您的工作可能排在其他工作之后,训练我们的模型可能需要几分钟或几小时,具体取决于模型和数据集的大小。如果事件流因任何原因中断,您可以通过运行以下命令恢复它:

openai api fine_tunes.follow -i <YOUR_FINE_TUNE_JOB_ID>

When the job is done, it should display the name of the fine-tuned model. 工作完成后,它应该显示微调模型的名称。

In addition to creating a fine-tune job, you can also list existing jobs, retrieve the status of a job, or cancel a job. 除了创建微调作业外,您还可以列出现有作业、检索作业状态或取消作业。

# List all created fine-tunes
openai api fine_tunes.list

# Retrieve the state of a fine-tune. The resulting object includes
# job status (which can be one of pending, running, succeeded, or failed)
# and other information
openai api fine_tunes.get -i <YOUR_FINE_TUNE_JOB_ID>

# Cancel a job
openai api fine_tunes.cancel -i <YOUR_FINE_TUNE_JOB_ID>

Use a fine-tuned model 使用微调模型

When a job has succeeded, the fine_tuned_model field will be populated with the name of the model. You may now specify this model as a parameter to our Completions API, and make requests to it using the Playground. 当作业成功时, fine_tuned_model 字段将填充模型名称。您现在可以将此模型指定为我们的 Completions API 的参数,并使用 Playground 向它发出请求。

After your job first completes, it may take several minutes for your model to become ready to handle requests. If completion requests to your model time out, it is likely because your model is still being loaded. If this happens, try again in a few minutes. 在您的工作首次完成后,您的模型可能需要几分钟时间才能准备好处理请求。如果对您的模型的完成请求超时,可能是因为您的模型仍在加载中。如果发生这种情况,请在几分钟后重试。

You can start making requests by passing the model name as the model parameter of a completion request: 您可以通过将模型名称作为完成请求的 model 参数传递来开始发出请求:

OpenAI CLI:

openai api completions.create -m <FINE_TUNED_MODEL> -p <YOUR_PROMPT>

cURL:

curl https://api.openai.com/v1/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": YOUR_PROMPT, "model": FINE_TUNED_MODEL}'

Python:

import openai
openai.Completion.create(
    model=FINE_TUNED_MODEL,
    prompt=YOUR_PROMPT)

Node.js:

const response = await openai.createCompletion({
  model: FINE_TUNED_MODEL
  prompt: YOUR_PROMPT,
});

You may continue to use all the other Completions parameters like temperature, frequency_penalty, presence_penalty, etc, on these requests to fine-tuned models. 对于微调模型的这些请求,您可以继续使用所有其他完成参数,如 temperaturefrequency_penaltypresence_penalty 等。

Delete a fine-tuned model 删除微调模型

To delete a fine-tuned model, you must be designated an "owner" within your organization. 要删除微调模型,您必须在您的组织中被指定为“所有者”。

OpenAI CLI:

openai api models.delete -i <FINE_TUNED_MODEL>

cURL:

curl -X "DELETE" https://api.openai.com/v1/models/<FINE_TUNED_MODEL> \
  -H "Authorization: Bearer $OPENAI_API_KEY"

Python:

import openai
openai.Model.delete(FINE_TUNED_MODEL)

Preparing your dataset 准备数据集

Fine-tuning is a powerful technique to create a new model that's specific to your use case. Before fine-tuning your model, we strongly recommend reading these best practices and specific guidelines for your use case below. 微调是一种强大的技术,可用于创建特定于您的用例的新模型。在微调您的模型之前,我们强烈建议您阅读以下适用于您的用例的最佳实践和具体指南。

Data formatting

To fine-tune a model, you'll need a set of training examples that each consist of a single input ("prompt") and its associated output ("completion"). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt. 要微调模型,您需要一组训练示例,每个训练示例都包含一个输入(“提示”)及其关联的输出(“完成”)。这与使用我们的基本模型明显不同,在基本模型中,您可能会在单个提示中输入详细说明或多个示例。

  • Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \n\n###\n\n. The separator should not appear elsewhere in any prompt. 每个提示都应以固定分隔符结尾,以在提示结束和完成开始时通知模型。通常效果很好的简单分隔符是 \n\n###\n\n 。分隔符不应出现在任何提示中的其他地方。
  • Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace. 由于我们的标记化,每个完成都应该以空格开头,它用前面的空格标记大多数单词。
  • Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion. 每次完成都应以固定的停止序列结束,以在完成结束时通知模型。停止序列可以是 \n### 或任何其他未出现在任何完成中的标记。
  • For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion. 对于推理,您应该按照与创建训练数据集时相同的方式格式化提示,包括相同的分隔符。还指定相同的停止序列以正确截断完成。

General best practices 一般最佳实践

Fine-tuning performs better with more high-quality examples. To fine-tune a model that performs better than using a high-quality prompt with our base models, you should provide at least a few hundred high-quality examples, ideally vetted by human experts. From there, performance tends to linearly increase with every doubling of the number of examples. Increasing the number of examples is usually the best and most reliable way of improving performance. 使用更多高质量的示例进行微调效果更好。为了微调一个比使用我们的基础模型使用高质量提示更好的模型,您应该提供至少几百个高质量的例子,最好由人类专家审查。从那里开始,性能往往会随着示例数量的每增加一倍而线性增加。增加示例的数量通常是提高性能的最佳和最可靠的方法。

Classifiers are the easiest models to get started with. For classification problems we suggest using ada, which generally tends to perform only very slightly worse than more capable models once fine-tuned, whilst being significantly faster and cheaper. 分类器是最容易上手的模型。对于分类问题,我们建议使用 ada,一旦经过微调,它的性能通常只会比功能更强大的模型差一点点,同时速度更快,成本更低。

If you are fine-tuning on a pre-existing dataset rather than writing prompts from scratch, be sure to manually review your data for offensive or inaccurate content if possible, or review as many random samples of the dataset as possible if it is large. 如果您要对预先存在的数据集进行微调而不是从头开始编写提示,请务必在可能的情况下手动检查您的数据是否存在令人反感或不准确的内容,或者如果数据集很大,请检查尽可能多的随机样本。

Specific guidelines 具体准则

Fine-tuning can solve a variety of problems, and the optimal way to use it may depend on your specific use case. Below, we've listed the most common use cases for fine-tuning and corresponding guidelines. 微调可以解决多种问题,最佳使用方式可能取决于您的具体用例。下面,我们列出了最常见的微调用例和相应的指南。

Classification

In classification problems, each input in the prompt should be classified into one of the predefined classes. For this type of problem, we recommend: 在分类问题中,提示中的每个输入都应分类到预定义的类别之一。对于此类问题,我们建议:

  • Use a separator at the end of the prompt, e.g. \n\n###\n\n. Remember to also append this separator when you eventually make requests to your model. 在提示末尾使用分隔符,例如 \n\n###\n\n 。当您最终向您的模型发出请求时,请记住还要附加此分隔符。
  • Choose classes that map to a single token. At inference time, specify max_tokens=1 since you only need the first token for classification. 选择映射到单个标记的类。在推理时,指定 max_tokens=1 ,因为您只需要第一个标记进行分类。
  • Ensure that the prompt + completion doesn't exceed 2048 tokens, including the separator 确保提示+完成不超过 2048 个标记,包括分隔符
  • Aim for at least ~100 examples per class 目标是每班至少 ~100 个例子
  • To get class log probabilities you can specify logprobs=5 (for 5 classes) when using your model 要获得类日志概率,您可以在使用模型时指定 logprobs=5 (对于 5 个类)
  • Ensure that the dataset used for finetuning is very similar in structure and type of task as what the model will be used for 确保用于微调的数据集在结构和任务类型上与模型将用于的数据集非常相似

[Case study: Is the model making untrue statements?

案例研究:模型是否做出了不真实的陈述?

Let's say you'd like to ensure that the text of the ads on your website mention the correct product and company. In other words, you want to ensure the model isn't making things up. You may want to fine-tune a classifier which filters out incorrect ads. 假设您希望确保您网站上的广告文字提及正确的产品和公司。换句话说,您要确保模型没有胡编乱造。您可能想要微调过滤掉不正确广告的分类器。

The dataset might look something like the following: 数据集可能类似于以下内容:

{"prompt":"Company: BHFF insurance\nProduct: allround insurance\nAd:One stop shop for all your insurance needs!\nSupported:", "completion":" yes"}
{"prompt":"Company: Loft conversion specialists\nProduct: -\nAd:Straight teeth in weeks!\nSupported:", "completion":" no"}

In the example above, we used a structured input containing the name of the company, the product, and the associated ad. As a separator we used \nSupported: which clearly separated the prompt from the completion. With a sufficient number of examples, the separator doesn't make much of a difference (usually less than 0.4%) as long as it doesn't appear within the prompt or the completion. 在上面的示例中,我们使用了包含公司名称、产品和相关广告的结构化输入。作为分隔符,我们使用了 \nSupported: ,它清楚地将提示与完成分开。如果有足够数量的示例,分隔符不会产生太大差异(通常小于 0.4%),只要它没有出现在提示或完成中即可。

For this use case we fine-tuned an ada model since it will be faster and cheaper, and the performance will be comparable to larger models because it is a classification task. 对于这个用例,我们微调了一个 ada 模型,因为它会更快、更便宜,而且性能将与更大的模型相当,因为它是一个分类任务。

Now we can query our model by making a Completion request. 现在我们可以通过发出完成请求来查询我们的模型。

curl https://api.openai.com/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "prompt": "Company: Reliable accountants Ltd\nProduct: Personal Tax help\nAd:Best advice in town!\nSupported:",
    "max_tokens": 1,
    "model": "YOUR_FINE_TUNED_MODEL_NAME"
  }'

Which will return either yes or no. 这将返回 yesno

Case study: Sentiment analysis

Let's say you'd like to get a degree to which a particular tweet is positive or negative. The dataset might look something like the following: 假设您想要了解特定推文的正面或负面程度。数据集可能类似于以下内容:

{"prompt":"Overjoyed with the new iPhone! ->", "completion":" positive"}
{"prompt":"@lakers disappoint for a third straight night https://t.co/38EFe43 ->", "completion":" negative"}

Once the model is fine-tuned, you can get back the log probabilities for the first completion token by setting logprobs=2 on the completion request. The higher the probability for positive class, the higher the relative sentiment. 对模型进行微调后,您可以通过在完成请求上设置 logprobs=2 来取回第一个完成标记的对数概率。正类别的概率越高,相对情绪就越高。

Now we can query our model by making a Completion request. 现在我们可以通过发出完成请求来查询我们的模型。

curl https://api.openai.com/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "prompt": "https://t.co/f93xEd2 Excited to share my latest blog post! ->",
    "max_tokens": 1,
    "model": "YOUR_FINE_TUNED_MODEL_NAME"
  }'

Which will return: 哪个将返回:

{
  "id": "cmpl-COMPLETION_ID",
  "object": "text_completion",
  "created": 1589498378,
  "model": "YOUR_FINE_TUNED_MODEL_NAME",
  "choices": [
    {
      "logprobs": {
        "text_offset": [
          19
        ],
        "token_logprobs": [
          -0.03597255
        ],
        "tokens": [
          " positive"
        ],
        "top_logprobs": [
          {
            " negative": -4.9785037,
            " positive": -0.03597255
          }
        ]
      },

      "text": " positive",
      "index": 0,
      "finish_reason": "length"
    }
  ]
}

[Case study: Categorization for Email triage

案例研究:电子邮件分类的分类

Let's say you'd like to categorize incoming email into one of a large number of predefined categories. For classification into a large number of categories, we recommend you convert those categories into numbers, which will work well up to ~500 categories. We've observed that adding a space before the number sometimes slightly helps the performance, due to tokenization. You may want to structure your training data as follows: 假设您希望将收到的电子邮件归入大量预定义类别之一。对于大量类别的分类,我们建议您将这些类别转换为数字,最多可处理约 500 个类别。我们观察到,由于标记化,在数字前添加一个空格有时会对性能略有帮助。您可能希望按如下方式构建训练数据:

{"prompt":"Subject: <email_subject>\nFrom:<customer_name>\nDate:<date>\nContent:<email_body>\n\n###\n\n", "completion":" <numerical_category>"}

For example:

{"prompt":"Subject: Update my address\nFrom:Joe Doe\nTo:support@ourcompany.com\nDate:2021-06-03\nContent:Hi,\nI would like to update my billing address to match my delivery address.\n\nPlease let me know once done.\n\nThanks,\nJoe\n\n###\n\n", "completion":" 4"}

In the example above we used an incoming email capped at 2043 tokens as input. (This allows for a 4 token separator and a one token completion, summing up to 2048.) As a separator we used \n\n###\n\n and we removed any occurrence of ### within the email. 在上面的示例中,我们使用了一封上限为 2043 个令牌的传入电子邮件作为输入。 (这允许一个 4 标记分隔符和一个标记完成,总计为 2048。)我们使用 \n\n###\n\n 作为分隔符,我们删除了电子邮件中出现的任何 ###

Conditional generation 条件生成

Conditional generation is a problem where the content needs to be generated given some kind of input. This includes paraphrasing, summarizing, entity extraction, product description writing given specifications, chatbots and many others. For this type of problem we recommend: 条件生成是需要在给定某种输入的情况下生成内容的问题。这包括释义、总结、实体提取、编写给定规范的产品描述、聊天机器人等。对于此类问题,我们建议:

  • Use a separator at the end of the prompt, e.g. \n\n###\n\n. Remember to also append this separator when you eventually make requests to your model. 在提示末尾使用分隔符,例如 \n\n###\n\n 。当您最终向您的模型发出请求时,请记住还要附加此分隔符。
  • Use an ending token at the end of the completion, e.g. END 在完成结束时使用结束标记,例如 END
  • Remember to add the ending token as a stop sequence during inference, e.g. stop=[" END"] 请记住在推理过程中将结束标记添加为停止序列,例如 stop=[" END"]
  • Aim for at least ~500 examples 目标是至少 ~500 个示例
  • Ensure that the prompt + completion doesn't exceed 2048 tokens, including the separator 确保提示+完成不超过 2048 个标记,包括分隔符
  • Ensure the examples are of high quality and follow the same desired format 确保示例具有高质量并遵循相同的所需格式
  • Ensure that the dataset used for finetuning is very similar in structure and type of task as what the model will be used for 确保用于微调的数据集在结构和任务类型上与模型将用于的数据集非常相似
  • Using Lower learning rate and only 1-2 epochs tends to work better for these use cases 使用较低的学习率和仅 1-2 个时期往往更适合这些用例

[Case study: Write an engaging ad based on a Wikipedia article

案例研究:根据维基百科文章撰写引人入胜的广告

This is a generative use case so you would want to ensure that the samples you provide are of the highest quality, as the fine-tuned model will try to imitate the style (and mistakes) of the given examples. A good starting point is around 500 examples. A sample dataset might look like this: 这是一个生成用例,因此您需要确保您提供的样本具有最高质量,因为微调模型将尝试模仿给定示例的风格(和错误)。一个好的起点是大约 500 个示例。示例数据集可能如下所示:

{"prompt":"<Product Name>\n<Wikipedia description>\n\n###\n\n", "completion":" <engaging ad> END"}

For example:

{"prompt":"Samsung Galaxy Feel\nThe Samsung Galaxy Feel is an Android smartphone developed by Samsung Electronics exclusively for the Japanese market. The phone was released in June 2017 and was sold by NTT Docomo. It runs on Android 7.0 (Nougat), has a 4.7 inch display, and a 3000 mAh battery.\nSoftware\nSamsung Galaxy Feel runs on Android 7.0 (Nougat), but can be later updated to Android 8.0 (Oreo).\nHardware\nSamsung Galaxy Feel has a 4.7 inch Super AMOLED HD display, 16 MP back facing and 5 MP front facing cameras. It has a 3000 mAh battery, a 1.6 GHz Octa-Core ARM Cortex-A53 CPU, and an ARM Mali-T830 MP1 700 MHz GPU. It comes with 32GB of internal storage, expandable to 256GB via microSD. Aside from its software and hardware specifications, Samsung also introduced a unique a hole in the phone's shell to accommodate the Japanese perceived penchant for personalizing their mobile phones. The Galaxy Feel's battery was also touted as a major selling point since the market favors handsets with longer battery life. The device is also waterproof and supports 1seg digital broadcasts using an antenna that is sold separately.\n\n###\n\n", "completion":"Looking for a smartphone that can do it all? Look no further than Samsung Galaxy Feel! With a slim and sleek design, our latest smartphone features high-quality picture and video capabilities, as well as an award winning battery life. END"}

Here we used a multi line separator, as Wikipedia articles contain multiple paragraphs and headings. We also used a simple end token, to ensure that the model knows when the completion should finish. 这里我们使用了多行分隔符,因为维基百科文章包含多个段落和标题。我们还使用了一个简单的结束标记,以确保模型知道何时应该完成完成。

Case study: Entity extraction 案例研究:实体提取

This is similar to a language transformation task. To improve the performance, it is best to either sort different extracted entities alphabetically or in the same order as they appear in the original text. This will help the model to keep track of all the entities which need to be generated in order. The dataset could look as follows: 这类似于语言转换任务。为了提高性能,最好按字母顺序或按照它们在原始文本中出现的相同顺序对不同的提取实体进行排序。这将有助于模型跟踪需要按顺序生成的所有实体。数据集可能如下所示:

{"prompt":"<any text, for example news article>\n\n###\n\n", "completion":" <list of entities, separated by a newline> END"}

For example:

{"prompt":"Portugal will be removed from the UK's green travel list from Tuesday, amid rising coronavirus cases and concern over a "Nepal mutation of the so-called Indian variant". It will join the amber list, meaning holidaymakers should not visit and returnees must isolate for 10 days...\n\n###\n\n", "completion":" Portugal\nUK\nNepal mutation\nIndian variant END"}

A multi-line separator works best, as the text will likely contain multiple lines. Ideally there will be a high diversity of the types of input prompts (news articles, Wikipedia pages, tweets, legal documents), which reflect the likely texts which will be encountered when extracting entities. 多行分隔符效果最好,因为文本可能包含多行。理想情况下,输入提示的类型将高度多样化(新闻文章、维基百科页面、推文、法律文件),这反映了提取实体时可能遇到的文本。

[Case study: Customer support chatbot

案例研究:客户支持聊天机器人

A chatbot will normally contain relevant context about the conversation (order details), summary of the conversation so far as well as most recent messages. For this use case the same past conversation can generate multiple rows in the dataset, each time with a slightly different context, for every agent generation as a completion. This use case will require a few thousand examples, as it will likely deal with different types of requests, and customer issues. To ensure the performance is of high quality we recommend vetting the conversation samples to ensure the quality of agent messages. The summary can be generated with a separate text transformation fine tuned model. The dataset could look as follows: 聊天机器人通常会包含有关对话的相关上下文(订单详细信息)、到目前为止的对话摘要以及最近的消息。对于这个用例,相同的过去对话可以在数据集中生成多行,每次都有稍微不同的上下文,对于每个代理生成作为完成。这个用例将需要几千个示例,因为它可能会处理不同类型的请求和客户问题。为确保高质量的性能,我们建议审查对话样本以确保代理消息的质量。可以使用单独的文本转换微调模型生成摘要。数据集可能如下所示:

{"prompt":"Summary: <summary of the interaction so far>\n\nSpecific information:<for example order details in natural language>\n\n###\n\nCustomer: <message1>\nAgent: <response1>\nCustomer: <message2>\nAgent:", "completion":" <response2>\n"}
{"prompt":"Summary: <summary of the interaction so far>\n\nSpecific information:<for example order details in natural language>\n\n###\n\nCustomer: <message1>\nAgent: <response1>\nCustomer: <message2>\nAgent: <response2>\nCustomer: <message3>\nAgent:", "completion":" <response3>\n"}

Here we purposefully separated different types of input information, but maintained Customer Agent dialog in the same format between a prompt and a completion. All the completions should only be by the agent, and we can use \n as a stop sequence when doing inference. 在这里,我们有意分离不同类型的输入信息,但在提示和完成之间以相同的格式维护客户代理对话框。所有的补全只能由代理完成,我们可以在进行推理时使用 \n 作为停止序列。

[Case study: Product description based on a technical list of properties

案例研究:基于技术属性列表的产品描述

Here it is important to convert the input data into a natural language, which will likely lead to superior performance. For example, the following format: 在这里,将输入数据转换为自然语言很重要,这可能会带来卓越的性能。例如,以下格式:

{"prompt":"Item=handbag, Color=army_green, price=$99, size=S->", "completion":" This stylish small green handbag will add a unique touch to your look, without costing you a fortune."}

Won't work as well as: 不会像以下那样工作:

{"prompt":"Item is a handbag. Colour is army green. Price is midrange. Size is small.->", "completion":" This stylish small green handbag will add a unique touch to your look, without costing you a fortune."}

For high performance ensure that the completions were based on the description provided. If external content is often consulted, then adding such content in an automated way would improve the performance. If the description is based on images, it may help to use an algorithm to extract a textual description of the image. Since completions are only one sentence long, we can use . as the stop sequence during inference. 为了获得高性能,请确保完成是基于所提供的描述。如果经常查阅外部内容,则以自动方式添加此类内容将提高性能。如果描述基于图像,则使用算法提取图像的文本描述可能会有所帮助。由于补全只有一个句子长,我们可以在推理过程中使用 . 作为停止序列。

Advanced usage

Customize your model name 自定义您的模型名称

You can add a suffix of up to 40 characters to your fine-tuned model name using the suffix parameter. 您可以使用后缀参数将最多 40 个字符的后缀添加到经过微调的模型名称中。

OpenAI CLI:

openai api fine_tunes.create -t test.jsonl -m ada --suffix "custom model name"

The resulting name would be: 结果名称将是:

ada:ft-your-org:custom-model-name-2022-02-15-04-21-04

Analyzing your fine-tuned model 分析您的微调模型

We attach a result file to each job once it has been completed. This results file ID will be listed when you retrieve a fine-tune, and also when you look at the events on a fine-tune. You can download these files: 我们会在每个作业完成后附上一个结果文件。当您检索微调时以及查看微调中的事件时,将列出此结果文件 ID。您可以下载这些文件:

OpenAI CLI:

openai api fine_tunes.results -i <YOUR_FINE_TUNE_JOB_ID>

CURL:

curl https://api.openai.com/v1/files/$RESULTS_FILE_ID/content \
  -H "Authorization: Bearer $OPENAI_API_KEY" > results.csv

The _results.csv file contains a row for each training step, where a step refers to one forward and backward pass on a batch of data. In addition to the step number, each row contains the following fields corresponding to that step: _results.csv 文件为每个训练步骤包含一行,其中一个步骤指的是对一批数据的一次前向和反向传递。除步骤编号外,每行还包含与该步骤对应的以下字段:

  • elapsed_tokens: the number of tokens the model has seen so far (including repeats) elapsed_tokens:到目前为止模型已经看到的令牌数(包括重复)
  • elapsed_examples: the number of examples the model has seen so far (including repeats), where one example is one element in your batch. For example, if batch_size = 4, each step will increase elapsed_examples by 4. elapsed_examples:到目前为止模型已经看到的示例数量(包括重复),其中一个示例是您的批次中的一个元素。例如,如果 batch_size = 4 ,每一步都会将 elapsed_examples 增加 4。
  • training_loss: loss on the training batch training_loss:训练批次的损失
  • training_sequence_accuracy: the percentage of completions in the training batch for which the model's predicted tokens matched the true completion tokens exactly. For example, with a batch_size of 3, if your data contains the completions [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 2/3 = 0.67 training_sequence_accuracy:训练批次中模型的预测标记与真实完成标记完全匹配的完成百分比。例如, batch_size 为 3,如果您的数据包含补全 [[1, 2], [0, 5], [4, 2]] 和模型预测 [[1, 1], [0, 5], [4, 2]], 这个精度将是 2/3 = 0.67
  • training_token_accuracy: the percentage of tokens in the training batch that were correctly predicted by the model. For example, with a batch_size of 3, if your data contains the completions [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 5/6 = 0.83 training_token_accuracy:模型正确预测的训练批次中的标记百分比。例如, batch_size 为 3,如果您的数据包含补全 [[1, 2], [0, 5], [4, 2]] 和模型预测 [[1, 1], [0, 5], [4, 2]], 这个精度将是 5/6 = 0.83

Classification specific metrics 分类特定指标

We also provide the option of generating additional classification-specific metrics in the results file, such as accuracy and weighted F1 score. These metrics are periodically calculated against the full validation set and at the end of fine-tuning. You will see them as additional columns in your results file. 我们还提供了在结果文件中生成其他特定于分类的指标的选项,例如准确性和加权 F1 分数。这些指标是根据完整的验证集和微调结束时定期计算的。您将在结果文件中看到它们作为附加列。

To enable this, set the parameter --compute_classification_metrics. Additionally, you must provide a validation file, and set either the classification_n_classes parameter, for multiclass classification, or classification_positive_class, for binary classification. 要启用此功能,请设置参数 --compute_classification_metrics 。此外,您必须提供一个验证文件,并为多类分类设置 classification_n_classes 参数,或为二元分类设置 classification_positive_class 参数。

OpenAI CLI:

# For multiclass classification
openai api fine_tunes.create \
  -t <TRAIN_FILE_ID_OR_PATH> \
  -v <VALIDATION_FILE_OR_PATH> \
  -m <MODEL> \
  --compute_classification_metrics \
  --classification_n_classes <N_CLASSES>

# For binary classification
openai api fine_tunes.create \
  -t <TRAIN_FILE_ID_OR_PATH> \
  -v <VALIDATION_FILE_OR_PATH> \
  -m <MODEL> \
  --compute_classification_metrics \
  --classification_n_classes 2 \
  --classification_positive_class <POSITIVE_CLASS_FROM_DATASET>

The following metrics will be displayed in your results file if you set --compute_classification_metrics: 如果您设置 --compute_classification_metrics ,以下指标将显示在您的结果文件中:

For multiclass classification 对于多类分类

  • classification/accuracy: accuracy 分类/准确度:准确度
  • classification/weighted_f1_score: weighted F-1 score classification/weighted_f1_score: 加权F-1分数

For binary classification 对于二进制分类

The following metrics are based on a classification threshold of 0.5 (i.e. when the probability is > 0.5, an example is classified as belonging to the positive class.) 以下指标基于 0.5 的分类阈值(即当概率 > 0.5 时,示例被分类为属于正类。)

  • classification/accuracy 分类/准确度
  • classification/precision 分类/精度
  • classification/recall 分类/召回
  • classification/f{beta} 分类/f{beta}
  • classification/auroc - AUROC 分类/auroc - AUROC
  • classification/auprc - AUPRC 分类/auprc - AUPRC

Note that these evaluations assume that you are using text labels for classes that tokenize down to a single token, as described above. If these conditions do not hold, the numbers you get will likely be wrong. 请注意,这些评估假设您正在为将标记化为单个标记的类使用文本标签,如上所述。如果这些条件不成立,您得到的数字很可能是错误的。

Validation

You can reserve some of your data for validation. A validation file has exactly the same format as a train file, and your train and validation data should be mutually exclusive. 您可以保留一些数据以供验证。验证文件与训练文件具有完全相同的格式,并且您的训练数据和验证数据应该互斥。

If you include a validation file when creating your fine-tune job, the generated results file will include evaluations on how well the fine-tuned model performs against your validation data at periodic intervals during training. 如果您在创建微调作业时包含验证文件,则生成的结果文件将包括对微调模型在训练期间定期对验证数据执行情况的评估。

OpenAI CLI:

openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> \
  -v <VALIDATION_FILE_ID_OR_PATH> \
  -m <MODEL>

If you provided a validation file, we periodically calculate metrics on batches of validation data during training time. You will see the following additional metrics in your results file: 如果您提供了验证文件,我们会在训练期间定期计算批量验证数据的指标。您将在结果文件中看到以下附加指标:

  • validation_loss: loss on the validation batch validation_loss:验证批次的损失
  • validation_sequence_accuracy: the percentage of completions in the validation batch for which the model's predicted tokens matched the true completion tokens exactly. For example, with a batch_size of 3, if your data contains the completion [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 2/3 = 0.67 validation_sequence_accuracy:模型的预测标记与真实完成标记完全匹配的验证批次中的完成百分比。例如, batch_size 为 3,如果您的数据包含完成 [[1, 2], [0, 5], [4, 2]] 和模型预测 [[1, 1], [0, 5], [4, 2]], 这个精度将是 2/3 = 0.67
  • validation_token_accuracy: the percentage of tokens in the validation batch that were correctly predicted by the model. For example, with a batch_size of 3, if your data contains the completion [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 5/6 = 0.83 validation_token_accuracy:模型正确预测的验证批次中标记的百分比。例如, batch_size 为 3,如果您的数据包含完成 [[1, 2], [0, 5], [4, 2]] 和模型预测 [[1, 1], [0, 5], [4, 2]], 这个精度将是 5/6 = 0.83

Hyperparameters

We've picked default hyperparameters that work well across a range of use cases. The only required parameter is the training file. 我们选择了适用于一系列用例的默认超参数。唯一需要的参数是训练文件。

That said, tweaking the hyperparameters used for fine-tuning can often lead to a model that produces higher quality output. In particular, you may want to configure the following: 也就是说,调整用于微调的超参数通常可以产生产生更高质量输出的模型。特别是,您可能需要配置以下内容:

  • model: The name of the base model to fine-tune. You can select one of "ada", "babbage", "curie", or "davinci". To learn more about these models, see the Models documentation. model :要微调的基本模型的名称。您可以选择“ada”、“babbage”、“curie”或“davinci”之一。要了解有关这些模型的更多信息,请参阅模型文档。
  • n_epochs - defaults to 4. The number of epochs to train the model for. An epoch refers to one full cycle through the training dataset. n_epochs - 默认为 4。训练模型的时期数。一个纪元指的是训练数据集的一个完整周期。
  • batch_size - defaults to ~0.2% of the number of examples in the training set, capped at 256. The batch size is the number of training examples used to train a single forward and backward pass. In general, we've found that larger batch sizes tend to work better for larger datasets. batch_size - 默认为训练集中示例数量的 0.2%,上限为 256。批量大小是用于训练单个前向和反向传递的训练示例的数量。总的来说,我们发现更大的批次大小往往更适用于更大的数据集。
  • learning_rate_multiplier - defaults to 0.05, 0.1, or 0.2 depending on final batch_size. The fine-tuning learning rate is the original learning rate used for pretraining multiplied by this multiplier. We recommend experimenting with values in the range 0.02 to 0.2 to see what produces the best results. Empirically, we've found that larger learning rates often perform better with larger batch sizes. learning_rate_multiplier - 默认为 0.05、0.1 或 0.2,具体取决于最终的 batch_size 。微调学习率是用于预训练的原始学习率乘以该乘数。我们建议使用 0.02 到 0.2 范围内的值进行试验,以查看产生最佳结果的值。根据经验,我们发现较大的学习率通常在较大的批量大小下表现更好。
  • compute_classification_metrics - defaults to False. If True, for fine-tuning for classification tasks, computes classification-specific metrics (accuracy, F-1 score, etc) on the validation set at the end of every epoch. compute_classification_metrics - 默认为 False。如果为 True,为了对分类任务进行微调,在每个 epoch 结束时在验证集上计算特定于分类的指标(准确性、F-1 分数等)。

To configure these additional hyperparameters, pass them in via command line flags on the OpenAI CLI, for example: 要配置这些额外的超参数,请通过 OpenAI CLI 上的命令行标志传递它们,例如:

openai api fine_tunes.create \
  -t file-JD89ePi5KMsB3Tayeli5ovfW \
  -m ada \
  --n_epochs 1

[Continue fine-tuning from a fine-tuned model

从微调模型继续微调

If you have already fine-tuned a model for your task and now have additional training data that you would like to incorporate, you can continue fine-tuning from the model. This creates a model that has learned from all of the training data without having to re-train from scratch. 如果您已经为您的任务微调了一个模型,并且现在有您想要合并的额外训练数据,您可以从模型继续微调。这将创建一个从所有训练数据中学习的模型,而无需从头开始重新训练。

To do this, pass in the fine-tuned model name when creating a new fine-tuning job (e.g. -m curie:ft-<org>-<date>). Other training parameters do not have to be changed, however if your new training data is much smaller than your previous training data, you may find it useful to reduce learning_rate_multiplier by a factor of 2 to 4. 为此,请在创建新的微调作业时传入微调模型名称(例如 -m curie:ft-<org>-<date> )。不必更改其他训练参数,但是如果您的新训练数据比以前的训练数据小得多,您可能会发现将 learning_rate_multiplier 减少 2 到 4 倍很有用。

Weights & Biases 权重和偏差

You can sync your fine-tunes with Weights & Biases to track experiments, models, and datasets. 您可以将微调与权重和偏差同步以跟踪实验、模型和数据集。

To get started, you will need a Weights & Biases account and a paid OpenAI plan. To make sure you are using the lastest version of openai and wandb, run: 要开始使用,您需要一个 Weights & Biases 帐户和一个付费的 OpenAI 计划。为确保您使用的是最新版本的 openaiwandb ,请运行:

pip install --upgrade openai wandb

To sync your fine-tunes with Weights & Biases, run: 要将微调与权重和偏差同步,请运行:

openai wandb sync

You can read the Weights & Biases documentation for more information on this integration. 您可以阅读权重和偏差文档以获取有关此集成的更多信息。

Example notebooks

Classification

finetuning-classification.ipynb 微调分类.ipynb

This notebook will demonstrate how to fine-tune a model that can classify whether a piece of input text is related to Baseball or Hockey. We will perform this task in four steps in the notebook: 此笔记本将演示如何微调模型,该模型可以对一段输入文本是否与棒球或曲棍球相关进行分类。我们将在笔记本中分四个步骤执行此任务:

  1. Data exploration will give an overview of the data source and what an example looks like 数据探索将概述数据源和示例
  2. Data preparation will turn our data source into a jsonl file that can be used for fine-tuning 数据准备会将我们的数据源变成一个jsonl文件,可以用来微调
  3. Fine-tuning will kick off the fine-tuning job and explain the resulting model's performance 微调将启动微调工作并解释生成的模型的性能
  4. Using the model will demonstrate making requests to the fine-tuned model to get predictions. 使用该模型将演示向微调模型发出请求以获得预测。

Collapse‍

Question answering 问题解答

olympics-1-collect-data.ipynbolympics-2-create-qa.ipynbolympics-3-train-qa.ipynb

The idea of this project is to create a question answering model, based on a few paragraphs of provided text. Base GPT-3 models do a good job at answering questions when the answer is contained within the paragraph, however if the answer isn't contained, the base models tend to try their best to answer anyway, often leading to confabulated answers. 这个项目的想法是基于提供的文本的几段来创建一个问答模型。当答案包含在段落中时,基础 GPT-3 模型在回答问题方面做得很好,但是如果答案不包含在内,基础模型往往会尽力回答,通常会导致混淆答案。

To create a model which answers questions only if there is sufficient context for doing so, we first create a dataset of questions and answers based on paragraphs of text. In order to train the model to answer only when the answer is present, we also add adversarial examples, where the question doesn't match the context. In those cases, we ask the model to output "No sufficient context for answering the question". 为了创建一个仅在有足够上下文的情况下才回答问题的模型,我们首先创建一个基于文本段落的问题和答案数据集。为了训练模型仅在出现答案时回答,我们还添加了对抗性示例,其中问题与上下文不匹配。在这些情况下,我们要求模型输出“没有足够的上下文来回答问题”。

We will perform this task in three notebooks: 我们将在三个笔记本中执行此任务:

  1. The first notebook focuses on collecting recent data, which GPT-3 didn't see during it's pre-training. We picked the topic of Olympic Games 2020 (which actually took place in the summer of 2021), and downloaded 713 unique pages. We organized the dataset by individual sections, which will serve as context for asking and answering the questions. 第一个笔记本侧重于收集最近的数据,这些数据是 GPT-3 在预训练期间没有看到的。我们选择了 2020 年奥运会(实际发生在 2021 年夏天)的主题,并下载了 713 个独特的页面。我们按各个部分组织数据集,这些部分将作为提问和回答问题的背景。
  2. The second notebook will utilize Davinci-instruct to ask a few questions based on a Wikipedia section, as well as answer those questions, based on that section. 第二个笔记本将利用 Davinci-instruct 根据维基百科部分提出一些问题,并根据该部分回答这些问题。
  3. The third notebook will utilize the dataset of context, question and answer pairs to additionally create adversarial questions and context pairs, where the question was not generated on that context. In those cases the model will be prompted to answer "No sufficient context for answering the question". We will also train a discriminator model, which predicts whether the question can be answered based on the context or not. 第三个笔记本将利用上下文、问题和答案对的数据集来额外创建对抗性问题和上下文对,其中问题不是在该上下文中生成的。在这些情况下,系统将提示模型回答“没有足够的上下文来回答问题”。我们还将训练一个鉴别器模型,该模型预测是否可以根据上下文回答问题。