聊聊ChatGPT之抛砖引玉
闲言少叙
ChatGPT自2022年11月30日发布以来持续火爆(发布仅5天注册用户数破百万,2个月MAU破亿,DAU破1300万)。我个人对ChatGPT非常感兴趣,并且对这项技术和下游应用的发展非常看好,首先说一下我可能不是咱们Team最懂ChatGPT的同学,但是今天我想做一个分享来抛砖引玉。
介绍
是什么
ChatGPT(全名:Chat Generative Pre-trained Transformer),来自OpenAI的介绍:
ChatGPT : Optimizing Language Models for Dialogue
We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests . ChatGPT is a sibling model to InstructGPT , which is trained to follow an instruction in a prompt and provide a detailed response.
OpenAI将ChatGPT视为GPT4.0发布之前的预热,因此将之定义为GPT3.5。
ChatGPT几乎可以做全部的自然语言理解和生成任务。
再演示一下写代码的能力。
效果
在各大内容平台和包括我们字节内部,对ChatGPT的讨论都非常多,有被ChatGPT效果惊艳到的、也有说ChatGPT有时候就像一个油腻的中年大叔一本正经的胡说八道。在讨论章节列举了一些科技巨头的大佬们的看法。
仅个人观点:
此次ChatGPT的问世不仅为下一代智能搜索引擎发展提供思路,并且是人工智能发展史上的一次里程碑事件,还将助力AIGC(人工智能生产内容)行业进一步向前发展。ChatGPT至少有以下三个重要的突破:
-
ChatGPT可以理解较为复杂的语句内容,比如有多层语法嵌套的句子(语义理解)。
-
ChatGPT拥有一定联系上下文理解语境的能力,可以针对一个问题不断深入交流(8k token以内可以理解上下文)。(会话上下文)
-
ChatGPT可以自动拒绝执行一些不合法指令(拒识)。
这意味着,ChatGPT已经实现围绕某个话题,与自然人展开一段谈话讨论的可能。而这是此前所有类似语音助手无法实现的。
原理
截止到目前(20230226),OpenAI没有发表ChatGPT的论文,在OpenAI关于ChatGPT的blog中提到下面这段话。也就是说ChatGPT的模型结构和训练方法与InstructGPT这篇论文几乎相同,只是在训练数据收集方面有略微不同。我下面讨论的原理部分都是基于InstructGPT的论文。
We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.
To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. To collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.
解决问题
InstructGPT是prompt和completion的形式,ChatGPT可以支持对话的形式(which interacts in a conversational way)。ChatGPT的训练数据一部分是新标的,一部分是把老InstructGPT的训练数据做了一下转换。
下面从两个视角看一下要解决的问题。
从OpenAI的视角,InstructGPT要解决的问题。下面段落引用的是InstructGPT论文中说的像GPT3这样的LLM(Large Language Model)没有很好的与用户align(对齐)用户的意图,并且可能生成对用户没用、不真实、有毒的内容。
Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback.
下面两个举了2个🌰,对比GPT3和InstructGPT,都是关于helpful的,至于untruthful(捏造事实)和toxic(色、暴)的🌰就不在这举了。
-
是给一个六岁的孩子讲登月
-
让写首关于🐸的诗
从一个算法工程师的视角来看,其实InstructGPT在解决LLM落地的问题。抛开算力问题,以前使用LLM不管是fine-tuning还是做few-shot,都需要精心的设计prompt的template和候选词(即prompt工程,像陈丹琦他们团队也提出过自动化的方法,但这块还需要不少工作量),InstructGPT它说我微调了训练过一次,即使没在你的任务上训练过它也好使,它是通过这个非英语和写代码来说明。InstructGPT包括ChatGPT也需要设计prompt,但貌似不用做那么多精细活了。
概念解释
fine-tuning, prompt & instruct learning
Instruct learning is the idea proposed in a 2021 article titled "Finetuned Language Models Are Zero-Shot Learners" by the Quoc V. Le team at Google Deepmind [5] The purpose of instruction learning and prompt learning is to mine the knowledge that the language model itself has. The difference is that Prompt stimulates the **completion ability **of the language model, such as generating the second half of the sentence from the first half of the sentence, or filling in the blanks. Instruct is to stimulate the understanding ability of the language model, which allows the model to do the correct action by giving more obvious instructions. We can understand these two different learning styles through the following examples:
Tip to learn: I bought this necklace for my girlfriend. She likes it very much. This necklace is so ____.
Instructed Learning: Judge the emotion of this sentence: Bought this necklace for my girlfriend and she liked it. Options: A = good; B = average; C = poor.
The advantage of instructional learning is that it can also do zero-shots on other tasks after being fine-tuned by multi-tasks, and cue learning is all for one task. The generalization ability is not as good as instructional learning. We can understand fine-tuning, prompt learning and instructional learning through.
Model fine-tuning, prompting learning, indicating the similarities and differences of the three
为啥prompt好使?
如下图所示。
提示语真正好使的原因还是训练集足够大,在训练集中出现过类似的话。上面翻译的例子,数据集中可能出现过如下图所示的document。模型足够强大,学会了wrote in French、translates和translate english to french之间的关系。因此提示词奏效了
模型结构
The three generations of GPT-1 , GPT-2 , GPT-3 ,GPT3.5 models based on text pre-training are all models with Transformer as the core structure.
The difference is the number of layers of the model and the length of the word vector and other super parameters. Their specific contents are shown in Table 1.
**Release time, parameter amount and training amount of past **GPT
GPT-1 was born a few months earlier than BERT. They all use Transformer as the core structure. The difference is that GPT-1 builds pre-training tasks from left to right generative** **, and then obtains a general pre-training model, which can be used to fine-tune downstream tasks like BERT. GPT-1 achieved the effect of SOTA on 9 NLP tasks at that time, but GPT-1 used a smaller model scale and data volume, which led to the birth of GPT-2.
In contrast to GPT-1, GPT-2 did not make a big deal about the model structure, but used a model with more parameters and more training data The most important idea of GPT-2 is to propose " **All supervised learning is a subset of unsupervised language models **", which is also the predecessor of Prompt Learning. GPT-2 also caused a lot of buzz at the beginning of its birth, and the news it generates is enough to deceive most humans to achieve the effect of faking it. Even at the time, dubbed "the most dangerous weapon in AI," many portals banned the use of GPT-2-generated news.
When GPT-3 was proposed, in addition to its far superior effect to GPT-2, what caused more discussion was the amount of parameters it 175 billion. In addition to GPT-3 can complete common NLP tasks, researchers unexpectedly found that GPT-3 has a good performance in writing SQL, JavaScript and other languages. Code for simple mathematical operations also has a good performance effect. The training of GPT-3 uses** **In-context Learning, which is a kind of Meta-learning. The core idea of meta-learning is to find a suitable initialization range through a small amount of data, so that the model can be used. Fast fitting on limited datasets and getting good results.
数据集
ChatGPT的SFT数据集blog上说是一部分labeler标的会话,一部分来自于InstructGPT转换的,因为没有详细的说明,我这里就不讲我猜的内容了。就说说靠谱的(InstructGPT的数据集吧)。
We first hired a team of 40 contractors to label our data, based on their performance on a screening test (see Section 3.4 and Appendix B.1 for more details). We then collect a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to the OpenAI AP
请了40个外包标注承包商,人标注了两个数据集(一共三个,训练SFT模型和训练RM模型的数据集,训练PPO-ptx即InstructGPT的训练集不用标)。给**Supervised fine-tuning (SFT)**模型标注需要理解prompt并给出答案,给RM模型标注需要给SFT模型的output结果排序,相对来说容易。
InstructGPT的SFT模型的训练数据由两部分组成,一部分是标注工从零到1编制的,一部分是从InstructGPT早期版本的playground收集来的。
The SFT dataset contains about 13k training prompts (from the API and labeler-written), the RM dataset has 33k training prompts (from the API and labeler-written), and the PPO dataset has 31k training prompts (only from the API).
数据集分布情况:
其中English占大部分,其它语种占小部分,所以存从效果上来讲,我们国产训练中文版的ChatGPT也是十分有必要的。
这小结最后看一下标注界面:
训练
step1:训练SFT模型其实就是微调GPT3,下图是论文给的一些超参。其中有一个注意点,就是训练了16轮,即使作者发现训练一轮就有点过拟合,但作者认为稍微过拟合一点问题不大。
step2:训练Reward modeling (RM),其实就是一个回归模型,把prompt和response放一块作为输入,把SFT模型的unembedding层给拿掉了,让它输出一个标量得分。最后用的是参数量6B大小的RM,因为计算量小并且好训练(175B的大模型不稳定)。
这里面有一个trick挺有意思说一下,作者把k选为4或9,因为标这部分排序数据,人类读懂prompt花时间多,但给出排序花的时间不多,因为output其实都差不太多。这样标注时间可能多一倍,计算量多一倍,但是因为是两两比较,相当于训练集多的更多啊。
step3:Reinforcement learning (RL),其实就是论文中反复说的RLHF,下面是目标函数。
实验
68页纸的InstructGPT论文中用了一半的篇幅讲实验和讨论(作者列表中大部分作者的主要工作也是这部分😁),这是OpenAI发论文的一贯作风(写论文也用到了持续集成的概念),即不读GPT系列论文读不懂InstructGPT的论文,做个大胆预测,如果发表ChatGPT论文,也必读InstructGPT论文。
更多关于这部分,其实是能学到不少内容的,如果想进一步了解GPT3.5系列的一些实现细节推荐去读,时间关系这部分就略过了。
小结
我们最后回顾一下整个训练过程,先用认为有用的数据微调一下形成一个SFT,然后训练一个小的RM模型,给后面的强化学习模型做目标函数一部分。最后进行一个强化学习模型的训练。
强调一下上面的数据集部分是InstructGPT的,ChatGPT会不一样。ChatGPT同时在3H上也更好了。下面是一个问哥伦布2015年的啥时候来的美国的例子。
Trick
其实在ChatGPT的数据集收集和标注中还是有非常多的trick的,但OpenAI公布的这方面资料涉及的非常少,我在这方面了解的也不多。所以如果想复现中国版的ChatGPT趟一些坑多花些时间目前是无法避免的。
算力
下面数据部分内容来自(华尔街见闻)
-
训练阶段:参数量175B,训练数据45T,训练阶段算力消耗3640PF-days(即1 PetaFLOP/s效率跑3640天,1 PetaFLOP/s表示每秒钟进行1千万亿次的数学运算)
-
推理阶段:因为要加载175B的参数,推理上至少需要5个A100来加载模型,每秒输出15-20个单词,需要8个A100显卡,即一台NVIDIA DGX A100服务器(所以如果有模型的情况下,想玩一玩自己部署一个需要至少8张A100)。
-
OpenAI维持1300万日活,需要算力:根据Similarweb的数据,23年1月份当前ChatGPT日活约1300万人,每人平均1000字左右的问题,因此合计产生约130亿字(173.3亿个token),假设24小时平均分配任务,需要的A100 GPU数量为173.3亿*2*3000亿/(20%*24小时*3600秒)=601.75 PetaFLOP/S,由于访问流量存在峰值,假定访问峰值是一天均值的5倍,因此共需要602台DGX A100服务器能够满足当前的访问量。如果截止到目前1200-1300万的日活,大约需要600台DGX A100服务器(NVIDIA DGX A100数据表)。
API
Turbo
Model: The ChatGPT model family we are releasing today, gpt-3.5-turbo, is the same model used in the ChatGPT product. It is priced at $0.002 per 1k tokens, which is 10x cheaper than our existing GPT-3.5 models. It’s also our best model for many non-chat use cases—we’ve seen early testers migrate from text-davinci-003 to gpt-3.5-turbo with only a small amount of adjustment needed to their prompts.
非会话场景,或者会话的第一轮,可能使用gpt-3.5-turbo更合适。
API
Turbo:
# curl
curl https://api.openai.com/v1/chat/completions
-H "Authorization: Bearer $OPENAI_API_KEY"
-H "Content-Type: application/json"
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "What is the OpenAI mission?"}]
}'
# python
import openai
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Tell the world about the ChatGPT API in the style of a pirate."}]
)
print(completion)
# Note: you need to be using OpenAI Python v0.27.0 for the code below to work
import openai
openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
The main input is the messages parameter. Messages must be an array of message objects, where each object has a role (either “system”, “user”, or “assistant”) and content (the content of the message). Conversations can be as short as 1 message or fill many pages.
Typically, a conversation is formatted with a system message first, followed by alternating user and assistant messages.
The system message helps set the behavior of the assistant. In the example above, the assistant was instructed with “You are a helpful assistant.”
The user messages help instruct the assistant. They can be generated by the end users of an application, or set by a developer as an instruction.
The assistant messages help store prior responses. They can also be written by a developer to help give examples of desired behavior.
An example API response looks as follows:
{
'id': 'chatcmpl-6p9XYPYSTTRi0xEviKjjilqrWU2Ve',
'object': 'chat.completion',
'created': 1677649420,
'model': 'gpt-3.5-turbo',
'usage': {'prompt_tokens': 56, 'completion_tokens': 31, 'total_tokens': 87},
'choices': [
{
'message': {
'role': 'assistant',
'content': 'The 2020 World Series was played in Arlington, Texas at the Globe Life Field, which was the new home stadium for the Texas Rangers.'},
'finish_reason': 'stop',
'index': 0
}
]
}
下游应用
GPT3包括ChatGPT的应用,也有接近上千款
仅个人观点认为,把ChatGPT作为大脑,让它长手✋🏻来使用工具:
- 比如算术题做不好但它知道要算啥,那用计算器啊,
- 比如不知道模型训练后的最新的信息,那让它用搜索引擎。
- 比如它只接收文本的内容,那多模态的内容比如语音可以转文本,图片可以接上看图说话模型。
- 比如它只能输出文本,画流程图还得弄个markdown格式的,其实可以和类似dall-e-2这样的文字到图像,甚至将来可能跟text2video的来结合。
我是十分看好它的发展的。
但是例如prompt的DAN攻击等还是存在被恶意运用,同时呢目前版本的ChatGPT我测下来,还是可能生成虚假和有毒的内容,仔细编写prompt还是十分关键的。所以对于怎么落地到下游应用上,还要考验落地的工程师和产品经理们的智慧的。
讨论
参考
-
InstructGPT**:**openai.com/blog/instru…
-
InstructGPT paper:Training language models to follow instructions with human feedback
-
-
[^](zhuanlan.zhihu.com/p/590311003…, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022). arxiv.org/pdf/2203.02…
-
[^](zhuanlan.zhihu.com/p/590311003…, A., Narasimhan, K., Salimans, T. and Sutskever, I., 2018. Improving language understanding by generative pre-training. www.cs.ubc.ca/~amuham01/L…
-
[^](zhuanlan.zhihu.com/p/590311003…, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I., 2019. Language models are unsupervised multitask learners. *OpenAI blog*, *1*(8), p.9. life-extension.github.io/2020/05/27/…
-
[^](zhuanlan.zhihu.com/p/590311003…, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. “Language models are few-shot learners.” arXiv preprint arXiv:2005.14165 (2020). proceedings.neurips.cc/paper/2020/…
-
[^](zhuanlan.zhihu.com/p/590311003…, Jason, et al. "Finetuned language models are zero-shot learners." arXiv preprint arXiv:2109.01652 (2021). arxiv.org/pdf/2109.01…
-
[^](zhuanlan.zhihu.com/p/590311003…, Paul F., et al. "Deep reinforcement learning from human preferences." Advances in neural information processing systems 30 (2017). arxiv.org/pdf/1706.03…
-
[^](zhuanlan.zhihu.com/p/590311003…, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017). arxiv.org/pdf/1707.06…
-
datafun直播:www.datafuntalk.com/live\_pc/l\…
-
吴恩达来信:ChatGPT很酷,RL也很酷. zhuanlan.zhihu.com/p/606827548
-
ChatGPT Prompt工程:设计、实践与思考,www.kuxai.com/article/790
-
吴恩达来信:Prompt engineering的现状与未来,hub.baai.ac.cn/view/21225
-
Wolfram|Alpha as the Way to Bring Computational Knowledge Superpowers to ChatGPT
-
What Is ChatGPT Doing … and Why Does It Work?writings.stephenwolfram.com/2023/02/wha…
-
大模型中的幻觉性问题 mp.weixin.qq.com/s/-SHGTui0Q…
-
**Gradio: Build Machine Learning Web Apps — in **Python github.com/gradio-app/…
-
chatgpt以后mp.weixin.qq.com/s/ErujYGRmv…
-
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers-论文分享
-
这个是meta最近的动作m.toutiao.com/article/720…
-
Illustrating Reinforcement Learning from Human Feedback (RLHF)
本文正在参加 ✍🏻 技术视角深入 ChatGPT 征文活动