关于ChatGPT,现阶段可以得到的一些内容

58 阅读3分钟

前言

ChatGPT这个东西也算用了几天了,优势和局限性都看的比较清楚,遗憾的是这不是一个开源的模型也没找到相应的论文,所以这里只能把从主页得到的信息整理一下,后续的注意力主要还是放在GPT4上。

Introduct ChatGPT

We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.

简单来说ChatGPT是一个以对话的方式进行交互的模型,在对话的过程中,ChatGPT可以做到回答问题、自我纠错、并质疑不准确的前提与拒绝不恰当的请求。

表现在实际的使用过程中就是,如果ChatGPT的回答出现了你不希望看到的错误,你可以告诉他错在哪里,让他修正,此外,当你提出一些不符合道德规范的内容,他也会拒绝回答。

首页给出了一个使用ChatGPT对代码进行debug的案例,可以看到对于程序员来说这个东西无论是纠错、写baseline还是面向google编程都是一个很好的工具。

方案

ChatGPT使用了RLHF进行训练,字面上说,RLHF就是基于人类反馈(Human Feedback)对语言模型进行强化学习(Reinforcement Learning),和一般的fine-tune过程乃至prompt tuning自然也不同。

RLHF具体的步骤是这样的:

image.png

  1. 收集问答数据并训练一个有监督的policy: 从大量的prompt(提问)数据集中抽取提问,然后找标注人员为这些提问写回答,这样就得到了一组监督数据,再把这些数据拿到GPT-3.5中进行finetune,至于GPT-3.5是个什么玩意就只能等等再看了。
  2. 收集对比数据来训练reward模型: 对于一个prompt提问,从多个不同模型的输出回答中采样,再由标注人员对这些回答从好到坏进行排序,这样就得到了一组有评分的数据,可以用来训练一个reward model(偏好模型)。 3.现在有了policy和reward model,可以进行强化学习了。 依然从数据集中抽取prompt,从有监督的policy中获得ppo模型,给定输出以后在由reward model给出奖励,对PPO进行强化。

这一块说的有点别扭,容我去恶补一下强化学习。

注意:ChatGPT是在GPT-3.5上进行的finetune。

局限性

  • ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers. Fixing this issue is challenging, as: (1) during RL training, there’s currently no source of truth; (2) training the model to be more cautious causes it to decline questions that it can answer correctly; and (3) supervised training misleads the model because the ideal answer depends on what the model knows, rather than what the human demonstrator knows.
  • ChatGPT is sensitive to tweaks to the input phrasing or attempting the same prompt multiple times. For example, given one phrasing of a question, the model can claim to not know the answer, but given a slight rephrase, can answer correctly.
  • The model is often excessively verbose and overuses certain phrases, such as restating that it’s a language model trained by OpenAI. These issues arise from biases in the training data (trainers prefer longer answers that look more comprehensive) and well-known over-optimization issues.1,2
  • Ideally, the model would ask clarifying questions when the user provided an ambiguous query. Instead, our current models usually guess what the user intended.
  • While we’ve made efforts to make the model refuse inappropriate requests, it will sometimes respond to harmful instructions or exhibit biased behavior. We’re using the Moderation API to warn or block certain types of unsafe content, but we expect it to have some false negatives and positives for now. We’re eager to collect user feedback to aid our ongoing work to improve this system.

基本上是一些使用体验上的问题,也很好看懂,这里就不翻译了。

总之首页大致就这些,涉及到的相关知识倒是挺多,一个个看过来吧。