- LLaMA:Open and Efficient Foundation Languate Models
- github:arxiv.org/pdf/2302.13…
- 论文:arxiv.org/pdf/2302.13…
正在部署中,回头看看怎么finetune一下。之前NLP接触的不多,希望以LLaMA为切入点,进行扩展。比方说里面提到的finetune技术的发展,GPT-3到现在之间的多个模型的迭代,还有各种激活函数的使用。和CV完全是两个知识体系,但是都在一个世界之下。
概述
- LLaMA is trained on trillions of tokens and we have shown that it is possible to train state-of-the-art models using publicly available datasets exclusively.
- LLaMA在数万亿个代币上进行训练,我们已经证明,可以专门使用公开可用的数据集来训练最先进的模型。
参数量
- A collection of foundation language models ranging from 7B to 65B parameters
- 从 7B 到 65B 参数的基础语言模型集合
what is foundation language model?什么是基础语言模型?
A foundation language model is a type of language model that serves as a basis or starting point for other models. It is typically trained on a large corpus of text data and can be fine-tuned for specific tasks such as language translation or sentiment analysis. 基础语言模型是一种语言模型,用作其他模型的基础或起点。它通常在大量文本数据语料库上进行训练,并且可以针对特定任务(如语言翻译或情感分析)进行微调。
How does LLaMA-13B outperform GPT-3 (175B) on most benchmarks?LLaMA-13B 在大多数基准测试中如何优于 GPT-3 (175B)?
LLaMA-13B outperforms GPT-3 on most benchmarks despite being more than 10 times smaller. We attribute this to our training exclusively on publicly available datasets, whichout resorting to proprietary and inaccessible datasets. LLaMA-13B 在大多数基准测试中优于 GPT-3,尽管它小了 10 倍以上。我们将这归因于我们专门针对公开可用数据集的培训,这些数据集诉诸专有和无法访问的数据集。
数据集
结果比较
- Zero-shot. We provide a textual description of the task and a test example. The model either provides an answer using open-ended generation, or ranks the proposed answers.零样本。我们提供任务的文本描述和测试示例。该模型要么使用开放式生成提供答案,要么对建议的答案进行排名。
- Few-shot. We provide a few examples of the task (between 1 and 64) and a test example. The model takes this text as input and generates the answer or ranks different options. 少样本。我们提供了一些任务示例(介于 1 和 64 之间)和一个测试示例。模型将此文本作为输入并生成答案或对不同的选项进行排名。
What is open-ended generation?什么是开放式生成?
Open-ended generation is a type of task where the language model is given a prompt or input and generates a response without any specific constraints or limitations. In other works, the model is free to generated any response that it deems appropriate based on the input it reveices. This is in contrast to other types of tasks such as multiple-choice questions or fill-in-the-blank tasks, where the model is given a set of options to choose from. 开放式生成是一种任务类型,其中语言模型被给予提示或输入,并生成没有任何特定约束或限制的响应。在其他作品中,模型可以根据它所揭示的输入自由地生成它认为合适的任何响应。这与其他类型的任务(如多项选择题或填空任务)形成鲜明对比,在这些任务中,模型有一组选项可供选择。
如何评价不同模型的性能?
In the multiple choice tasks, the objective is to select the most appropriate completion among a set of given options, based on a provided context. We select the completion with the highest likelihood given the provided context.在多项选择任务中,目标是根据提供的上下文在一组给定选项中选择似然虽高的选项。我们选择似然最高的结果,根据提供的上下文(即需要填写的空格之前的文本)。
what is likelihood normalized by the number of characters in the completion?什么是按字符数归一化的似然?
To make sure that longer completions are not favored over shorter ones, the likelihood is normalized by the number of characters in the completion. This means that longer completions are not automatically considered better than shorter ones just because they have more words.为了确保较长的回答不会优先于较短的回答,可能性通过完成中的字符数进行归一化。这意味着较长的回答不会仅仅因为它们有更多的单词而自动被认为比较短的回答更好。
为什么OpenBookQA 和BoolQ采用了不同的测量方法。
因为这两个数据集中的填空前面都会出现"Answer",而这个信息也被考虑在了选择答案的过程中。考虑的方法,就是使用条件概率:
意思就是说,你这个答案completion,在给定前面是“Answer”的前提下,由上下文的出来的概率是多少?
测评题型
-
Common Sense Reasoning
-
Closed-book Question Answering
-
Reading Comprehension
-
Mathematical reasoning
-
Code Generation
-
Massive Multitask Language Understanding
-
Evolution of performance during training
-
常识推理
-
闭卷问答
-
阅读理解
-
数学推理
-
代码生成
-
海量多任务语言理解
-
训练期间表现的演变
what is common sense reasoning?
common sense reasoning refers to a type of reasoning that involves making inferences and drawing conclusions based on everyday knowledge and experience. It is the ability to understand and reason about the world in a way that is consistent with how humans typically think and behave. Common sense reasoning is an important area of research in natural language processing because it is essential for many tasks, such as question answering, dialogue systems, and language understanding in general. 常识推理是指一种推理,涉及根据日常知识和经验进行推理和得出结论。它是以与人类通常思考和行为方式一致的方式理解和推理世界的能力。常识推理是自然语言处理的一个重要研究领域,因为它对许多任务都是必不可少的,例如问答、对话系统和一般的语言理解。PDF文件提到了几个评估常识推理的标准基准,包括BoolQ,PIQA和SIQA。
what is Closed-book Question Answering
闭卷问答是指一种问答任务,其中模型除了问题本身的文本之外,没有提供任何外部信息或上下文。换句话说,不允许模型访问任何可以帮助它回答问题的文档或信息源。这种类型的任务旨在测试模型仅基于其内部知识和对语言的理解来推理和生成答案的能力。
what is reading comprehension?
reading comprehension refers to the ability to understand and interpret written text. It involves a range of skills, including vocabulary knowledge, sentence comprehension, and the ability to make inferences and draw conclusions based on the information presented in the text. Reading comprehension is an important area of research in natural language processing because it is essential for many tasks, such as question answering, summarization, and information retrieval. 阅读理解是指理解和解释书面文本的能力。它涉及一系列技能,包括词汇知识、句子理解以及根据文本中提供的信息进行推理和得出结论的能力。阅读理解是自然语言处理的一个重要研究领域,因为它对于许多任务都是必不可少的,例如问答、总结和信息检索。
其他
后续还做了bias分析,用了很多种衡量方法,其中一种是CrowS-paris。
然后是能源消耗和碳排放比较: