Abstract 摘要

大型语言模型（LLMs）已经改变了自然语言处理（NLP）领域，并在多个领域展现出潜力，然而，由于缺乏深入评估和金融任务的复杂性，它们在金融领域的潜力尚未被充分探索。这与LLMs的快速发展一起，凸显了为LLMs建立系统性金融评估基准的迫切需求。在本文中，我们介绍了FinBen，这是第一个全面的开源评估基准，专门设计用于彻底评估LLMs在金融领域的能力。FinBen包括23个金融任务的35个数据集，按照CattellHorn-Carroll理论分为三个难度谱系，以评估LLMs在归纳推理、联想记忆、数量推理、结晶智力等方面的认知能力。我们对15个代表性的LLMs进行了评估，包括GPT-4、ChatGPT和最新的Gemini，揭示了它们在金融领域的优势和局限性。研究结果表明，GPT-4在量化、提取、数值推理和股票交易方面领先，而Gemini在生成和预测方面表现出色；然而，两者在复杂提取和预测方面都存在困难，显示出明确的针对性增强需求。指令调整提高了简单任务的性能，但在提高复杂推理和预测能力方面仍有不足。FinBen旨在持续评估金融领域的LLMs，通过定期更新任务和模型，促进AI的发展。

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of thorough evaluations and the complexity of financial tasks. This along with the rapid development of LLMs, highlights the urgent need for a systematic financial evaluation benchmark for LLMs. In this paper, we introduce FinBen, the first comprehensive open-sourced evaluation benchmark, specifically designed to thoroughly assess the capabilities of LLMs in the financial domain. FinBen encompasses 35 datasets across 23 financial tasks, organized into three spectrums of difficulty inspired by the CattellHorn-Carroll theory, to evaluate LLMs’ cognitive abilities in inductive reasoning, associative memory, quantitative reasoning, crystallized intelligence, and more. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals insights into their strengths and limitations within the financial domain. The findings indicate that GPT-4 leads in quantification, extraction, numerical reasoning, and stock trading, while Gemini shines in generation and forecasting; however, both struggle with complex extraction and forecasting, showing a clear need for targeted enhancements. Instruction tuning boosts simple task performance but falls short in improving complex reasoning and forecasting abilities. FinBen seeks to continuously evaluate LLMs in finance, fostering AI development with regular updates of tasks and models.

(github.com/The-FinAI/P…)

1 Introduction 引言

近期，大型语言模型（LLMs）（Brown等人，2020年）如ChatGPT2和GPT-4（OpenAI，2023a）已经重塑了自然语言处理（NLP）领域，并在数学、编码、医学、法律和金融等专业领域展现了显著的能力（Bubeck等人，2023年）。随着模型规模的增加和广泛的预训练数据，LLMs已经发展出上下文学习能力，使它们能够在零样本和少样本设置中跨足广泛领域的特定任务，表现出色（Wei等人，2023年）。在金融领域，最近的几项研究（Xie等人，2023a；Lopez-Lira和Tang，2023年；Li等人，2023b；Xie等人，2023b）已经展示了像GPT-4这样的先进LLMs在金融文本分析和预测任务上的巨大潜力。尽管它们的潜力显而易见，但对它们在金融领域的能力和局限性的全面理解，仍然大部分未被探索。这是由于缺乏广泛的评估研究和基准测试，以及与金融任务的专业性质相关的内在复杂性。

现有的金融领域评估基准，包括FLUE（Shah等人，2022年）、BBTCFLEB（Lu等人，2023年）和PIXIU（Xie等人，2023b），范围有限，仅专注于金融NLP任务，主要针对LLMs已经在语言理解能力上进行了广泛评估的领域。如表1所示，它们未能捕捉到金融领域的其他关键方面，如理解和提取特定领域的金融知识和解决现实金融任务。因此，它们在评估和理解LLM性能方面的效力是有限的。

此外，虽然在一般领域新发布了基准测试，如MMLU（Hendrycks等人，2020年）、HELM（Liang等人，2022年）和BIG-bench（Srivastava等人，2023年），它们编译了众多机构的大量任务，但并未扩展到金融领域。LLMs的快速发展，加上对它们能力和行为的不完全理解，凸显了需要一个专门针对这些模型的系统性金融评估基准。

一个有效的系统性金融评估基准应该如何设计？我们认为它应该满足以下标准：1）广泛覆盖：它应该涵盖广泛的任务范围，以捕捉金融领域的复杂性，包括语言理解和多样化技能，如知识提取、文本生成和数值推理等。2）面向现实世界的应用：基准测试应关注现实世界的场景，包括股市分析和交易，突出LLMs的实际应用能力。3）包含金融领域特定特征：它还需要解决金融的独特方面，嵌入要求特定知识、术语和概念的任务，展示LLMs在该领域的熟练程度。4）考虑人类水平的认知：它应该衡量类似人类的认知能力，在金融背景下评估LLMs的决策制定、问题解决和抽象推理能力。

为了弥补这一差距，我们提出了FinBen，这是第一个开源的（我们将向研究社区发布所有资源。）全面评估基准，专为评估金融领域LLMs的能力而设计。如图1所示，FinBen包括35个数据集，涵盖23个金融任务，分为三个难度谱系，灵感来自心理学和教育领域的Cattell-Horn-Carroll（CHC）理论（Schneider和McGrew，2012年），以评估LLMs在各种认知领域，包括归纳推理、联想记忆、定量推理、结晶智力、流体智力和一般智力。谱系I包括基础任务，如量化、提取和数值理解，为基本认知技能奠定基础。

向上发展，谱系II深入更复杂的生成和预测任务，要求更高的认知参与度。在顶峰，谱系III专注于复杂的股票交易任务，体现了一般智力的应用。

与上述标准一致，FinBen在覆盖的广度和深度以及其独特针对金融领域的关注点上，与现有基准有所不同：

1）广泛覆盖：FinBen将经典的NLP任务（文本分析、知识提取、问题回答）与金融特定挑战（数值标记）相结合，并通过评估LLMs在现实世界金融应用（股票预测、信用评分）上的表现，首次直接评估LLMs的交易表现。这种广泛的方法全面揭示了LLMs在金融领域的优势和局限性。

2）多数据模态和文本类型的多样性：FinBen通过接受其任务的多样化数据形式和文本类型而脱颖而出，包括新闻、推文、收益电话、财务文件、表格和时间序列数据。这种多样性有助于全面评估LLMs对金融内容的理解和生成，突出它们在现实世界中的实用性。

3）多样的难度级别：FinBen包含不同难度级别的任务，从像新闻标题分类这样的简单基础任务，到像股票走势预测这样的高级认知参与任务，甚至是更复杂的一般智力任务，这些任务甚至对人类也是挑战，如股票交易。这种范围使得对LLMs的细致评估成为可能，充分映射它们在金融领域的优势和弱点。

我们在FinBen中测试了15个代表性的通用LLMs，如GPT-4、ChatGPT和最新的Gemini，以及金融LLMs，并得出以下发现：1）GPT-4在量化、提取、数值推理和复杂的股票交易任务上胜过所有其他模型，而Gemini在生成和预测任务上表现出色。2）虽然像GPT-4这样的最先进（SOTA）LLMs在量化和简单提取任务上展示了卓越的能力，但它们在需要高级数值推理和复杂信息提取的领域上表现不足。值得注意的是，这些LLMs在要求严格的股票交易任务上显示出潜力，但在依赖结晶智力和流体智力的文本生成和预测任务上，仍有显著的改进空间。3）指令调整是提高量化和简单提取任务性能的有效方法，而在数值推理、生成和预测等其他任务上，其效果较差。

Recently, Large Language Models (LLMs) (Brown et al., 2020) such as ChatGPT2 and GPT-4 (OpenAI, 2023a), have reshaped the field of natural language processing (NLP) and exhibited remarkable capabilities in specialized domains across mathematics, coding, medicine, law, and finance (Bubeck et al., 2023). With their increasing model size and extensive pre-training data, LLMs have developed the emergent capacity for in-context learning, enabling them to perform remarkably across a wide range of domain-specific tasks in zero-shot and few-shot settings (Wei et al., 2023). Within the financial domain, recent several studies (Xie et al., 2023a; Lopez-Lira and Tang, 2023; Li et al., 2023b; Xie et al., 2023b) have shown the great potential of advanced LLMs such as GPT-4 on financial text analysis and prediction tasks. While their potential is evident, a comprehensive understanding of their capabilities and limitations for finance, remains largely unexplored. This is due to a lack of extensive evaluation studies and benchmarks, and the inherent complexities associated with the professional nature of financial tasks.

Existing financial domain evaluation benchmarks including FLUE (Shah et al., 2022), BBTCFLEB (Lu et al., 2023), and PIXIU (Xie et al., 2023b), have a limited scope and are solely focused on financial NLP tasks, primarily targeting language understanding abilities where LLMs have already been extensively evaluated. As shown in Table 1, they fail to capture other crucial facets of the financial domain, such as comprehending and extracting domain-specific financial knowledge and resolving realistic financial tasks. As such, their efficacy in evaluating and understanding LLM performance is limited.

Furthermore, while there are newly released benchmarks in the general domain, such as MMLU (Hendrycks et al., 2020), HELM (Liang et al., 2022) and BIG-bench (Srivastava et al., 2023) compiling massive tasks across numerous institutions, they do not extend to the financial domain. The fast progression of LLMs, coupled with an incomplete understanding of their abilities and behavior, highlights the need for a systematic financial evaluation benchmark dedicated to these models. How should an effective systematic financial evaluation benchmark be designed? We believe it should fulfill the following criteria: 1) Broad coverage: It should cover a broad spectrum of tasks to capture the financial domain’s complexity, incorporating both linguistic understanding and diverse skills like knowledge extraction, text generation, and numerical reasoning et al. 2) Real-world application orientation: The benchmark should focus on real-world scenarios, including stock market analysis and trading, highlighting LLMs’ practical application capabilities. 3) Inclusion of financial domain-specific characteristics: It also needs to address the unique aspects of finance, embedding tasks that demand specific knowledge, terminology, and concepts, demonstrating LLMs’ proficiency in the field. 4) Consideration of human-level cognition: It should gauge human-like cognitive abilities, evaluating LLMs on decision-making, problemsolving, and abstract reasoning within financial contexts.

To bridge this gap, we propose FinBen, the first open-sourced (We will release all resources to the research community. ) comprehensive evaluation benchmark designed for assessing the capabilities of LLMs in the financial domain. As shown in Figure 1, FinBen includes 35 datasets spanning 23 financial tasks organized into three Spectrums of difficulty inspired by the Cattell-Horn-Carroll (CHC) theory (Schneider and McGrew, 2012) in the fields of psychology and education, to assess LLMs across various cognitive domains, including inductive reasoning, associative memory, quantitative reasoning, crystallized intelligence, fluid intelligence, and general intelligence. Spectrum I is comprised of foundational tasks inclduing Quantification, Extraction, and Numerical Understanding, laying the groundwork for basic cognitive skills.

Moving up, Spectrum II delves into more complex Generation and Forecasting tasks, demanding enhanced cognitive involvement. At the apex, Spectrum III focuses on the sophisticated stock trading task, exemplifying the application of General Intelligence.

In align with the above criteria, FinBen distinguishes from existing benchmarks from the breadth and depth of its coverage, as well as its uniquely tailored focus on the financial domain:

Wide coverage: FinBen integrates classic NLP tasks (text analysis, knowledge extraction, question answering) with finance-specific challenges (numeric labeling) and innovates by assessing LLMs on real-world financial applications (stock prediction, credit scoring) and for the first time directly assess the trading performance of LLMs. This broad approach unveils LLMs’ strengths and limitations in finance comprehensively. 2) Multi-data modality and diversity of text types: FinBen distinguishes itself by embracing diverse data forms and text types for its tasks, including news, tweets, earnings calls, financial documents, tables, and time-series data. This variety facilitates a thorough assessment of LLMs’ comprehension and generation of financial content, highlighting their real-world utility.
Diverse difficulty levels: the FinBen incorporates tasks of varying difficulty levels, from simpler fundamental tasks like news headline classification, to advenced congnitive engagement tasks such as the stock movement prediction, and even more complex general intelligence tasks that even challenge humans, such as stock trading. This range enables a nuanced evaluation of LLMs, fully mapping their strengths and weaknesses in finance.

We test 15 representative general LLMs such as GPT-4, ChatGPT and the latest Gemini, and financial LLMs in FinBen, and have following findings: 1) GPT-4 outperforms all others in quantification, extraction, numerical reasoning, and the intricate stock trading task, whereas Gemini excels in generation and forecasting tasks. 2) while state-of-the-art (SOTA) LLMs such as GPT-4 demonstrate superior capabilities in quantification and simple extraction tasks, they fall short in areas requiring advanced numerical reasoning and complex information extraction. Notably, these LLMs show promise in the demanding stock trading task, yet there is a pronounced need for improvement in text generation and forecasting tasks, which rely heavily on crystallized and fluid intelligence. 3) Instruction tuning is an effective way for improve the performance on quantification and simple extraction tasks, while it is less useful on other tasks such as numerical reasoning, generation and forecasting.

2 The FinBen

我们的基准框架通过受Cattell-Horn-Carroll (CHC) 理论（Schneider 和 McGrew，2012）启发的层级结构来评估金融LLMs，将认知能力定义为三个谱系。谱系I包括基础任务，如量化（归纳推理）使用分类任务，提取（联想记忆）涵盖信息提取任务，以及数值理解（定量推理）涵盖数值推理任务。谱系II进阶到生成（结晶智力）涵盖生成任务，以及预测（流体智力）与预测任务，要求更深层次的认知参与。顶峰的谱系III，包括使用当前最先进的（SOTA）金融LLM代理（Yu等人，2023）进行交易中的战略决策，以股票交易任务展示一般智力（McGrew，2009）。这种结构化方法允许对LLMs在不同认知需求下的金融分析能力进行细致的评估。表2和图2展示了FinBen涵盖的所有任务、数据集、数据统计和评估指标（有关每个数据集的详细说明，请参见附录C）。

Our benchmark framework evaluates financial LLMs through a hierarchy inspired by the CattellHorn-Carroll (CHC) theory (Schneider and McGrew, 2012), defining cognitive abilities in three Spectrums. Spectrum I includes foundational tasks like Quantification (Inductive Reasoning) using classification tasks, Extraction (Associative Memory) covering information extraction tasks, and Numerical Understanding (Quantitative Reasoning) covering numerical reasoning tasks. Spectrum II advances to Generation (Crystallized Intelligence) covering generaltion task, and Forecasting (Fluid Intelligence) with prediction tasks, requiring deeper cognitive engagement. The pinnacle, Spectrum III, encompasses strategic decision-making in trading using the current state-of-art (SOTA) financial LLM agent (Yu et al., 2023) with the stock trading task, showcasing General Intelligence (McGrew, 2009). This structured approach allows for nuanced assessment of LLMs’ financial analytical capabilities across varied cognitive demands. Table 2 and Figure 2 shows all tasks, datasets, data statistics and evaluation metrics covered by FinBen(For detail instructions of each dataset, please see Appendix C) .

2.1 Spectrum I: Foundamental Tasks 谱系I：基础任务

谱系I包括来自16个任务的20个数据集，从量化（归纳推理）、提取（联想记忆）和数值理解（定量推理）三个角度评估金融LLMs。

Spectrum I including 20 datasets from 16 tasks to evalaute financial LLMs from three perspectives including Quantification (Inductive Reasoning), Extraction (Associative Memory) and Numerical Understanding (Quantitative Reasoning).

Quantification 量化

量化任务包括8个分类任务，用于评估金融LLMs，将金融文本转化为类别标签和数值分数。作为归纳推理（Ir），它要求LLMs在金融叙事中识别模式并量化情感。1) 情感分析专注于从金融文本中提取情感信息。我们使用两个数据集：金融短语银行（FPB）（Malo等人，2014年）、FiQA-SA（Maia等人，2018a）和TSA（Cortis等人，2017）数据集。2) 新闻标题分类使用标题数据集（Sinha和Khandait，2021年），分析金融文本中的额外信息，如价格变动，该数据集包括2000年至2019年关于“黄金”的新闻及其9个相应的标签。3) 鹰派-鸽派分类旨在将货币政策文本中的句子分类为“鹰派”或“鸽派”，关注金融文本的微妙语言和经济含义，使用FOMC（Shah等人，2023a）数据集。4) 论证单元分类使用FinArg AUC数据集（Sy等人，2023）将句子分类为主张或前提。5) 论证关系检测使用FinArg ARC数据集（Sy等人，2023）识别社交媒体帖子之间的关系（攻击、支持或无关）。6) 多类分类针对分类各种金融文本，包括分析师报告、新闻文章和投资者评论，使用MultiFin数据集（Jørgensen等人，2023）。7) 交易完整性分类预测并购事件是“完成”的还是仍然是“谣言”，基于新闻和推文，使用MA数据集（Yang等人，2020a）。8) ESG问题识别专注于使用MLESG数据集（Chen等人，2023a）在金融文件中检测环境、社会和治理（ESG）问题。对于所有数据集，评估使用准确率和F1分数。

The auantification task include 8 classification tasks for evaluating financial LLMs, to transform financial text into categorical labels and numerical scores. As inductive reasoning (Ir), it requires LLMs to discern patterns and quantify sentiments within financial narratives. 1) Sentiment analysis focuses on extracting sentiment information from financial texts. We utilize two datasets: the Financial Phrase Bank (FPB) (Malo et al., 2014), FiQA-SA (Maia et al., 2018a), and TSA (Cortis et al., 2017) dataset. 2) News headline classification analyzes additional information, like price movements in financial texts, using the Headlines dataset (Sinha and Khandait, 2021), which includes news about "gold" from 2000 to 2019 and their 9 corresponding tags. 3) Hawkish-Dovish classification aims to classify sentences from monetary policy texts as ’hawkish’ or ’dovish,’ focusing on the nuanced language and economic implications of financial texts, using the FOMC (Shah et al., 2023a) dataset. 4) Argument unit classification categorizes sentences as claims or premises using the FinArg AUC dataset (Sy et al., 2023). 5) Argument relation detection identifies relationships (attack, support, or irrelevant) between social media posts using the FinArg ARC dataset (Sy et al., 2023). 6) Multi-class classification targets categorizing a variety of financial texts, including analyst reports, news articles, and investor comments, utilizing the MultiFin dataset (Jørgensen et al., 2023). 7) Deal completeness classification predicts if mergers and acquisitions events are "completed" or remain "rumors" based on news and tweets, employing the MA dataset (Yang et al., 2020a). 8) ESG issue identification focuses on detecting Environmental, Social, and Governance (ESG) concerns in financial documents using the MLESG dataset (Chen et al., 2023a). For all datasets, evaluation utilizes the accuracy and F1 Score.

Extraction 提取

提取任务包括来自4个信息提取任务的5个数据集，评估LLMs从大型数据集中准确检索特定金融信息的能力，这一过程与联想记忆（Ma）密切相关。1) 命名实体识别从金融协议和SEC文件中提取LOCATION（地点）、ORGANIZATION（组织）和PERSON（人物）等实体，使用NER（Alvarado等人，2015年）和FINERORD（Shah等人，2023b）数据集。2) 关系提取在金融新闻和收益记录中识别“产品/材料生产”和“制造商”等关系，使用FINRED数据集（Sharma等人，2022年）。3) 因果分类判断金融新闻和SEC文件中的句子是否传达因果关系，使用SC数据集（Mariko等人，2020年）。4) 因果检测在金融文本中识别因果跨度，使用CD数据集（Mariko等人，2020年）。这些任务的评估侧重于F1分数（Goutte和Gaussier，2005年）和实体F1分数（Derczynski，2016年）。

The extraction task including 5 datasets from 4 information extraction tasks, evaluating LLMs’ ability to accurately retrieve specific financial information from large datasets, a process tied closely to Associative Memory (Ma). 1) Named entity recognition extracts entities like LOCATION, ORGANIZATION, and PERSON from financial agreements and SEC filings, using the NER (Alvarado et al., 2015) and FINERORD (Shah et al., 2023b) datasets. 2) Relation extraction identifies relationships such as "product/material produced" and "manufacturer" in financial news and earnings transcripts with the FINRED dataset (Sharma et al., 2022). 3) Causal classification discerns whether sentences from financial news and SEC filings convey causality using the SC dataset (Mariko et al., 2020). 4) Causal detection identifies cause and effect spans in financial texts with the CD dataset (Mariko et al., 2020). The evaluation of these tasks is focused on the F1 score (Goutte and Gaussier, 2005) and Entity F1 score (Derczynski, 2016).

Understanding 理解

理解任务包括来自4个数值推理任务的5个数据集，挑战LLMs解释和分析复杂的数值数据和错综复杂的金融统计数据，这与定量推理（Gq）能力相关。1) 问题回答专注于通过多步数值推理解决金融报告和表格中的问题，使用FinQA（Chen等人，2021年）和TATQA（Zhu等人，2021年）数据集。2) 多轮问题回答是基于金融收益报告和表格的多轮问题和答案的QA扩展，使用ConvFinQA数据集（Chen等人，2022年）。3) 数值标记旨在使用FNXL数据集（Sharma等人，2023年）中的2,794个标签标记金融文档中的数值跨度。4) 标记分类旨在通过提取类比框架来识别文本类比中的常见属性和比较元素，使用FSRL数据集（Lamm等人，2018年）。这些任务的评估使用实体F1分数（Derczynski，2016年）和精确匹配准确度（EMAcc）指标（Kim等人，2023年）。

The understanding task includes 5 datasets from 4 numerical reasoning tasks, challenging LLMs to interpret and analyze complex numerical data and intricate financial statistics, associted with the Quantitative Reasoning (Gq) ability. 1) Question answering focuses on solving questions through multi-step numerical reasoning with financial reports and tables, utilizing the FinQA (Chen et al., 2021) and TATQA (Zhu et al., 2021) dataset. 2) Multi-turn question answering is an extension on QA with multi-turn questions and answers based on financial earnings reports and tables, using the ConvFinQA dataset (Chen et al., 2022). 3) Numeric labeling aims at tagging numeric spans in financial documents using 2,794 labels with the FNXL dataset (Sharma et al., 2023). 4) Token classification aims at identifying common attributes and comparative elements in textual analogies by extracting analogy frames, utilizing the FSRL dataset (Lamm et al., 2018). Entity F1 score (Derczynski, 2016) and the Exact Match Accuracy (EMAcc) metric (Kim et al., 2023) are used to evaluate these tasks.

2.2 Spectrum II: Advanced Cognitive Engagement 谱系II：高级认知参与

谱系II包含跨越6个任务的14个数据集，旨在评估LLMs在生成（结晶智力）和预测（流体智力）方面的能力，这需要更深层次的认知参与。

Spectrum II has 14 datasets across 6 tasks designed to assess the Generation (Crystallized Intelligence) and Forecasting (Fluid Intelligence) capabilities of LLMs, requiring deeper cognitive engagement.

Generation 生成

生成任务衡量模型在产生连贯、信息丰富和相关文本输出方面的熟练程度，涉及结晶智力（Gc）。我们关注文本摘要任务，使用ECTSUM（Mukherjee等人，2022年）数据集对收益电话记录进行摘要，以及EDTSUM（Zhou等人，2021年）数据集将金融新闻文章抽象成简洁的摘要。使用ROUGE分数（Lin，2004年）、BERTScore（Zhang等人，2019年）和BART Score（Yuan等人，2021年）进行评估，这些指标量化了机器生成的摘要与专家摘要之间的一致性、事实一致性和信息保留情况。

The generation task gauges the models’ proficiency in producing coherent, informative, and relevant text outputs, involving the Crystallized Intelligence (Gc). We focus on the text summarization task utilizing the ECTSUM (Mukherjee et al., 2022) dataset for summarizing earnings call transcripts and the EDTSUM (Zhou et al., 2021) dataset for abstracting financial news articles into concise summaries. It’s evaluated using ROUGE scores (Lin, 2004), BERTScore (Zhang et al., 2019), and BART Score (Yuan et al., 2021), metrics that quantify to measure the alignment, factual consistency, and information retention between machine-generated and expert summaries.

Forecasting 预测

预测任务利用流体智力（Gf），挑战模型从新兴模式中适应性地预测未来市场和投资者行为。它包括5个预测任务的12个数据集。1) 股票走势预测专注于基于历史价格和推文预测股票方向，将其预测为正面或负面，使用三个数据集：BigData22（Soun等人，2022年）、ACL18（Xu和Cohen，2018年）和CIKM18（Wu等人，2018年）。2) 信用评分使用历史客户数据将个人分类为“好”或“坏”的信用风险，使用包括德国（Hofmann，1994年）、澳大利亚（Quinlan）和LendingClub（Feng等人，2023年）在内的数据集。3) 欺诈检测将交易分类为“欺诈”或“非欺诈”，使用两个数据集：ccf（Feng等人，2023年）和ccFraud（Feng等人，2023年）。4) 财务困境识别旨在预测公司的破产风险，使用波兰（Feng等人，2023年）和台湾数据集（Feng等人，2023年）。5) 索赔分析为隐私匿名化客户数据，标记“目标”以指示索赔状态，使用两个数据集：PortoSeguro（Feng等人，2023年）和travelinsurance（Feng等人，2023年）。使用F1分数和Matthews相关系数（MCC）（Chicco和Jurman，2020年）来评估这些任务。

The forecasting task leverages Fluid Intelligence (Gf), challenging models to adaptively predict future market and investor behaviors from emerging patterns. It includes 12 datasets from 5 forecasting tasks. 1) Stock movement prediction focuses on forecasting stock directions as either positive or negative, based on historical prices and tweets, utilizing three datasets: BigData22 (Soun et al., 2022), ACL18 (Xu and Cohen, 2018) and CIKM18 (Wu et al., 2018). 2) Credit scoring classifies individuals as "good" or "bad" credit risks using historical customer data, employing datasets including: German (Hofmann, 1994), Australia (Quinlan) and LendingClub (Feng et al., 2023). 3) Fraud detection involve categorizes transactions as "fraudulent" or "non-fraudulent", using two datasets: ccf (Feng et al., 2023) and ccFraud (Feng et al., 2023). 4) Financial distress identification aims to predict a company’s bankruptcy risk, using the polish (Feng et al., 2023) and taiwan dataset (Feng et al., 2023). 5) Claim analysis anonymizes client data for privacy, labeling a "target" to indicate claim status, using two datasets: PortoSeguro (Feng et al., 2023) and travelinsurance (Feng et al., 2023). F1 score and Matthews correlation coefficient (MCC) (Chicco and Jurman, 2020) are used for evaluating these tasks.

2.3 Spectrum III: General Intelligence 谱系III：通用智能

Trading 交易

在光谱III下的交易（Punt, 2017）被归类为金融LLMs的顶尖任务，强调它们使用通用智能（g）。这项任务评估模型在综合多样信息以制定和实施交易策略方面的熟练程度，即使是专家也面临挑战，代表了金融分析中最高水平的认知能力。最新的金融LLM代理FinMem（Yu et al., 2023）被用来评估LLMs在复杂股票决策上的表现，基于我们策划的包含七只主要股票的FinTrade数据集，通过历史价格、新闻和情感分析模拟现实世界的交易。性能通过累积回报（CR）（Ariel, 1987）、夏普比率（SR）（Sharpe, 1998）、日波动性（DV）和年化波动性（AV）（Zhou et al., 2023）以及最大回撤（MD）（Magdon-Ismail and Atiya, 2004）来衡量，提供了对盈利能力、风险管理和决策能力的全面评估。

Strategic decision-making in Trading (Punt, 2017), categorized under Spectrum III, is the pinnacle task for financial LLMs, emphasizing their use of General Intelligence (g). This task evaluates the model’s proficiency in synthesizing diverse information to formulate and implement trading strategies, a challenge even for experts, representing the highest level of cognitive capability in financial analysis. The SOTA financial LLM agent FinMem (Yu et al., 2023) are used to evaluate LLMs on sophisticated stock decisions, based on the FinTrade dataset we curated of seven major stocks, simulating real-world trading through historical prices, news, and sentiment analysis. Performance is measured by Cumulative Return (CR) (Ariel, 1987), Sharpe Ratio (SR) (Sharpe, 1998), Daily (DV) and Annualized volatility (AV) (Zhou et al., 2023), and Maximum Drawdown (MD) (Magdon-Ismail and Atiya, 2004), offering a comprehensive assessment of profitability, risk management, and decisionmaking prowess

3 Evaluation 评估

我们评估了15种代表性的通用大型语言模型（LLMs）和金融领域特定语言模型（LLMs）在FinBen基准测试中的零样本和少样本性能，包括：1) ChatGPT：由OpenAI开发的具有1750亿参数的指令遵循型LLM。2) GPT-4（OpenAI, 2023b）：由OpenAI提出的具有大约1000亿参数的强大指令遵循型LLM。3) Gemini Pro（Team et al., 2023）：由谷歌发布的具有5000亿参数的多模态AI LLM。4) LLaMA2-70B（Touvron et al., 2023）：由MetaAI开发的具有700亿参数的指令遵循型LLM。5) ChatGLM3-6B（Du et al., 2022）：由Zhipu AI和清华大学KEG联合发布的具有60亿参数的对话型LLM。6) Baichuan2-6B（Baichuan, 2023）：由Baichuan智能科技推出的具有60亿参数的开源LLM。7) InternLM-7B（Team, 2023b）：由商汤科技提出的为实际场景量身定制的70亿参数开源基础模型。9) Falcon7B（Almazrouei et al., 2023）：一个在1500亿个经过策划的语料库增强的RefinedWeb令牌上训练的70亿参数因果解码器唯一型LLM。10) Mixtral 8×7B（Jiang et al., 2024）：一个采用稀疏专家混合（SMoE）架构的LLM。11) Code Llama-7B（Roziere et al., 2023）：由Meta AI发布的用于生成编程代码的开源LLM模型，具有70亿参数。12) FinGPT（Yang et al., 2023a）：一个经过情感分析任务微调的70亿指令金融LLM。13) FinMA-7B（Xie et al., 2023b）：一个经过多项自然语言处理和预测任务微调的70亿指令金融LLM。14) DISCFinLLM（Chen et al., 2023c）：由Baichuan-13B-Chat（Baichuan, 2023）微调而来的开源金融LLM。15) CFGPT（Li et al., 2023a）：一个专门为金融领域设计并在中国金融数据集上训练的开源LLM，包含70亿参数。所有实验均仅使用5块NVIDIA TITAN RTX图形处理单元和2块NVIDIA GeForce RTX 3090处理单元进行，完成时间约为20小时。平均每个实验分配2个GPU，总计约为20400个GPU小时。

We evaluate the zero-shot and few-shot performance of 15 representative general LLMs and financial LLMs on the FinBen benchmark, including: 1) ChatGPT: An instruction-following LLM with 175B parameters developed by OpenAI. 2) GPT-4 (OpenAI, 2023b): A powerful instructionfollowing LLM with approximately 1T parameters, proposed by OpenAI. 3) Gemini Pro (Team et al., 2023): A multimodal AI LLM with 50T parameters, released by Google. 4) LLaMA2-70B (Touvron et al., 2023): An instruction-following LLM with 70B parameters developed by MetaAI. 5) ChatGLM3-6B (Du et al., 2022): A conversational LLM with 6B parameters, jointly released by Zhipu AI and Tsinghua KEG. 6) Baichuan2-6B (Baichuan, 2023): An open-source LLM with 6B parameters, launched by Baichuan Intelligent Technology. 7) InternLM-7B (Team, 2023b): An opensourced 7B parameter base model tailored for practical scenarios, proposed by SenseTime. 9) Falcon7B (Almazrouei et al., 2023): A 7B parameter causal decoder-only LLM model trained on 1500B tokens of RefinedWeb enhanced with curated corpora. 10) Mixtral 8×7B (Jiang et al., 2024): A LLM with the Sparse Mixture of Experts (SMoE) architecture. 11) Code Llama-7B (Roziere et al., 2023): An open-source LLM model for generating programming code, launched by Meta AI with 7B parameters. 12) FinGPT (Yang et al., 2023a): An 7B instruction finetuned financial LLM with sentiment analysis tasks. 13) FinMA-7B (Xie et al., 2023b): An 7B instruction finetuned financial LLM with multiple NLP and forecasting tasks. 14) DISCFinLLM (Chen et al., 2023c): An open-sourced financial LLM, fine-tuned from Baichuan-13B-Chat (Baichuan, 2023). 15) CFGPT (Li et al., 2023a): An open-source LLM, specifically designed for the financial sector and trained on Chinese financial datasets, which comprises 7 billion parameters. All experiments are conducted exclusively using 5 NVIDIA TITAN RTX graphics GPUs and 2 NVIDIA GeForce RTX 3090 GPUs, taking approximately 20 hours to complete. On average, 2 GPUs are allocated per experiment, amounting to a total of approximately 20400 GPU hours.

4 Results 结果

表3和表4展示了12种代表性LLM在FinBen所有数据集上的性能表现。

Table 3 and Table 4 shows the performance of 12 representative LLMs on all datasets in the FinBen.

4.1 Foundamental Tasks Analysis 基础任务分析

从表3中，对于基础任务，我们可以看到GPT-4以最高的平均性能脱颖而出，紧随其后的是ChatGPT和Gemini。在所有开源LLM中，金融LLM FinMA-7B在几个分类任务上，如FPB，展现出卓越的性能，甚至超过了像GPT-4这样的更大模型。这归功于它在训练数据集上的定制指令调整。对于通用目的LLM，由于模型尺寸较大，LLaMA2 70B在平均性能上领先。在为中文语言定制的模型中，ChatGLM2-6B在平均性能上超过了InternLM 7B，表明其在处理金融任务上的有效性。然而，CFGPT sft-7B-Full，一个在中文金融数据上微调的模型，仅在少数数据集上表现出有限的提升，甚至在像MultiFin这样的数据集上相比其基础模型InternLM 7B出现了性能下降。这一趋势表明了基于语言的差异，突出了使用中文数据进行微调可能会对英文任务的性能产生不利影响，强调了模型训练中跨语言适应的复杂性。

值得注意的是，在Headlines等量化数据集中，像Gemini和其他金融调整的LLM，包括FinMA-7B，在性能上与GPT-4相当甚至更好。然而，当处理FinQA和ConvFinQA等数据集中的理解任务时，GPT-4和ChatGPT显著优于其他模型，突出了像Gemini和LLaMA2-70B这样的模型在数值推理能力上的局限性。在需要复杂信息提取和数字标记的提取数据集上，如FinRED、CD、FNXL和FSRL，所有模型，包括GPT-4，在这些领域都存在不足，表明需要在这些领域进行进一步的增强。

总之，像GPT-4这样的SOTA LLM在量化任务上表现出强大的性能。然而，在数值推理和复杂信息提取任务上存在明显的差距，这指出了进一步发展的必要性。指令调整已被证明可以显著提升性能，表明这是一种有价值的方法，用于提高模型在专业金融任务上的能力。结果突出了跨语言模型调整的复杂性，以及在增强LLM在多样化金融任务上的效力时仔细考虑语言的重要性。

From Table 3, for fundamental tasks, we can see that GPT-4 stands out with the highest average performance, closely followed by ChatGPT, and Gemini. Among all open-sourced LLMs, the financial LLM FinMA-7B showcases superior performance on several classification tasks, such as FPB, even exceeding larger models like GPT-4. This is attributed to its tailored instruction tuning on the training datasets. For general-purpose LLMs, LLaMA2 70B leads in average performance, due to the large model size. Among models tailored for the Chinese language, ChatGLM2-6B outperforms InternLM 7B in average performance, indicating its effectiveness in handling financial tasks. However, CFGPT sft-7B-Full, fine-tuned on Chinese financial data, exhibits limited improvement on a few datasets and even declining performance on others like MultiFin compared to its base model InternLM 7B. This trend suggests a language-based discrepancy, highlighting that fine-tuning with Chinese data may adversely affect performance on English tasks, underscoring the complexities of cross-lingual adaptation in model training.

Notably, in quantification datasets such as Headlines, models like Gemini and other financially tuned LLMs, including FinMA-7B, perform on par with or even better than GPT-4. However, when tackling understanding tasks in datasets like FinQA and ConvFinQA, GPT-4 and ChatGPT significantly outperform others, highlighting the limited numerical reasoning capabilities of models like Gemini and LLaMA2-70B. Challenges persist in extraction datasets requiring complex information extraction and numeric labeling, such as FinRED, CD, FNXL, and FSRL, where all models, including GPT-4, fall short, indicating a need for further enhancement in these areas.

In conclusion, SOTA LLMs like GPT-4 exhibit strong performance across quantification tasks. However, there’s a clear gap in numerical reasoning and complex information extraction tasks, pinpointing the necessity for further development. Instruction tuning has shown to enhance performance significantly, suggesting a valuable approach for improving model capabilities in specialized financial tasks. The results highlight the complexity of cross-lingual model tuning and the importance of careful language consideration in enhancing LLMs’ effectiveness across diverse financial tasks.

4.2 Advanced Cognitive Engagement Tasks Analysis 高级认知参与任务分析

在文本生成任务中，Gemini在EDTSUM摘要性文本摘要数据集上脱颖而出，展示了其在生成连贯摘要方面的能力。然而，所有模型在需要生成精确标签序列的抽取式摘要任务上都面临挑战。在预测任务中，Gemini在除澳大利亚信用评分数据集外的大多数数据集上表现突出，而在该数据集中，GPT-4展示了更优越的性能。在开源LLM中，LLaMA2 70B在文本摘要方面表现突出，而LLaMA2-7B-chat在预测任务上表现优异。尽管使用了BigData22和ACL18等数据集进行指令调整，FinMA 7B在预测性能上仍落后于Falcon 7B等同行，这强调了需要更有效的改进策略。与基础模型InternLM 7B相比，CFGPT sft-7B-Full在性能上持续下降。对于预测任务，必须承认所有LLM都没有达到预期结果，并且落后于传统方法。这一与现有研究（Feng et al., 2023; Xie et al., 2023b）的一致观察结果，强调了LLM在处理高级认知任务方面的能力存在显著不足，不如传统方法有效。这一分析揭示了LLM在提升方面的重大潜力，包括行业领导者如GPT-4和Gemini，特别是在需要更高认知技能的文本生成和预测任务上。

In the text generation task, Gemini emerges as the frontrunner on the EDTSUM abstractive text summarization dataset, illustrating its prowess in generating coherent summaries. Nevertheless, all models face challenges with extractive summarization, which demands the generation of precise label sequences for sentences. In the forecasting task, Gemini distinguishes itself across most datasets, except in the Australian credit scoring dataset, where GPT-4 demonstrates superior performance. Among open-source LLMs, LLaMA2 70B stands out in text summarization, whereas LLaMA2-7B-chat excels in forecasting tasks. Despite instruction tuning with datasets like BigData22 and ACL18, FinMA 7B lags behind peers such as Falcon 7B in forecasting performance, underscoring the need for more effective improvement strategies. CFGPT sft-7B-Full consistently shows a decrease in performance compared to its foundational model, InternLM 7B. For forecasting, it is crucial to acknowledge that all LLMs do not meet expected outcomes and fall behind traditional methodologies. This consistent observation with existing studies (Feng et al., 2023; Xie et al., 2023b) underlines a notable deficiency in LLMs’ capacity to tackle advanced cognitive tasks as effectively as conventional methods. This analysis reveals significant potential for enhancement in LLMs, including industry leaders like GPT-4 and Gemini, particularly in text generation and forecasting tasks that demand higher cognitive skills.

4.3 General Intelligence Tasks Analysis 通用智力任务分析

各种大型语言模型（LLMs）在要求高度通用智力的复杂股票交易任务上的比较分析，呈现在表4中（有关详细的交易表现，请参见附录E）。结果表明，所有LLMs的表现都优于传统的买入持有策略，这突出了它们在制定更有利交易决策方面的有效性。在评估的LLMs中，GPT-4通过获得最高的夏普比率（Sharpe Ratio, SR），超过了1，而脱颖而出。这一成就强调了GPT-4在风险对利润优化方面的熟练程度，这一能力在其他LLMs中似乎有所减弱，它们倾向于让投资者承担更高的风险以获得较低的回报。此外，GPT-4展示了最小的最大回撤（Max Drawdown, MDD），这表明它比其他模型更有效地限制了潜在的损失，从而提供了更安全的投资途径。

相比之下，ChatGPT展现出显著较低的性能指标，表明其在金融决策能力方面的局限性。另一方面，Gemini确保了第二佳表现者的位置，与GPT-4相比展现出较低的风险和波动性，同时保持着可观的回报。在考虑开源模型时，观察到LLaMA-70B尽管波动性较低，但在LLMs中产生的利润最少，突显了风险管理和盈利能力之间的权衡。

对于参数少于700亿的较小模型，注意到它们在交易中一贯遵循交易指令的明显无能，这归因于它们有限的理解、提取能力和受限的上下文窗口。这一限制强调了较小LLM在需要复杂金融推理和决策的任务中面临的重大挑战，从而突出了更先进模型有效处理此类高级认知任务的必要性。

本质上，LLMs在股票交易任务中的卓越表现揭示了它们在金融领域内体现通用智力的能力。这种能力，植根于多样化认知技能的整合和这些技能对现实世界金融挑战的应用，预示着金融分析和决策的新时代。我们的发现，因此，不仅肯定了LLMs在驾驭金融市场复杂性方面的重大潜力，而且还预示着它们在需要高水平通用智力的任务中的进一步发展和应用的光明轨迹。

The comparative analysis of various Large Language Models (LLMs) on the complex task of stock trading, which demands a high degree of general intelligence, is presented in Table 4 （For detail trading performance, please see Appendix E） . The results indicate a superior performance of all LLMs over the traditional Buy & Hold strategy, highlighting their efficacy in formulating more advantageous trading decisions. Among the evaluated LLMs, GPT-4 distinguishes itself by attaining the highest Sharpe Ratio (SR), exceeding 1. This achievement underscores GPT-4’s proficiency in optimizing profit against the risk, a capability that appears somewhat diminished in other LLMs, which tend to expose investors to higher risk for lesser returns. Additionally, GPT-4 demonstrates the minimal Max Drawdown (MDD), suggesting that it limits potential losses more effectively than its counterparts, thereby offering a more secure investment avenue.

In contrast, ChatGPT exhibits significantly lower performance metrics, indicating limitations in its financial decision-making capabilities. Gemini, on the other hand, secures the position of second-best performer, showcasing lower risk and volatility in comparison to GPT-4, yet maintaining commendable returns. When considering open-source models, it is observed that LLaMA-70B, despite its lower volatility, yields the least profit among the LLMs, highlighting a trade-off between risk management and profitability.

For smaller models with parameters less than 70 billion, a marked inability to adhere to trading instructions consistently across transactions is noted, attributed to their limited comprehension, extraction capabilities, and constrained context windows. This limitation underscores the critical challenges smaller LLMs face in tasks requiring intricate financial reasoning and decision-making, thereby spotlighting the necessity for more advanced models to tackle such high-level cognitive tasks effectively.

In essence, the exceptional performance of LLMs in the stock trading task illuminates their capacity to embody general intelligence within the financial domain. This capacity, rooted in the integration of diverse cognitive skills and the application of these skills to real-world financial challenges, heralds a new era of financial analysis and decision-making. Our findings, thereby, not only affirm the significant potential of LLMs in navigating the complexities of financial markets but also suggest a promising trajectory for their further development and application in tasks demanding a high level of general intelligence.

5 Conclusion 结论

在本研究中，我们介绍了一个全面的金融基准测试FinBen，专门设计用于评估金融领域的LLMs。该基准测试包含了来自23个任务的35个不同数据集，分为三个难度谱系。与金融领域以往的基准测试不同，FinBen将其评估范围扩展到了包括量化、提取、理解、生成、预测在内的广泛任务类型。值得注意的是，它首次通过基于代理的评估框架，包含了直接的交易任务。我们对15个代表性LLMs的全面评估得出了几个关键见解：1) GPT-4在与量化、提取、理解以及交易相关的任务中表现最佳，而Gemini在生成和预测任务中领先。2) 尽管现有的LLMs在基础任务上表现出色，但在更需要认知能力以及要求通用智力的任务上，它们的效果似乎受限。3) 这些发现突出了LLMs直接为交易决策提供信息的能力，为未来研究指出了一个充满希望的途径。展望未来，我们计划扩展FinBen以包含更多的语言和更广泛的金融交易任务，进一步扩大该基准测试的适用性和实用性，以推进金融LLMs领域的发展。

In this work, we introduce a comprehensive financial benchmark the FinBen specifically designed for evaluating LLMs in the financial domain. This benchmark encompasses 35 diverse datasets from 23 tasks, organized into three spectrums of difficulty. Unlike previous benchmarks in the financial domain, the FinBen extends its evaluation to encompass a broad spectrum of tasks, including quantification, extraction, understanding, generation, forecasting. Notably, for the first time, it incorporates a direct trading task through an agent-based evaluation framework. Our comprehensive evaluation of 15 representative LLMs yields several key insights: 1) GPT-4 emerges as the top performer in tasks related to quantification, extraction, un-derstanding, and trading, whereas Gemini leads in generation and forecasting tasks. 2) While existing LLMs demonstrate commendable performance on foundational tasks, their effectiveness on more cognitively demanding tasks and those requiring general intelligence appears constrained. 3) The findings highlight the capacity of LLMs to directly inform trading decisions, suggesting a promising avenue for future research. Moving forward, we aim to expand FinBen to encompass additional languages and a wider array of financial trading tasks, further broadening the benchmark’s applicability and utility in advancing the field of financial LLMs.

Limitations 局限性

尽管通过FinBen对金融领域LLMs进行基准测试的努力具有开创性，我们认识到有几个固有的局限性可能会影响基准测试的有效性和适用性：

数据集大小限制： 在开发FinBen过程中面临的主要挑战之一是可用数据集的大小受限，这是开源金融数据领域中的一个常见问题。这一限制可能会影响模型对金融理解的深度以及它们在所有金融环境中的应用能力。

模型大小限制： 由于计算资源的限制，我们的评估仅限于LLaMA 70B模型。这一限制可能忽略了更大或架构不同的模型在FinBen全面任务套件上可能展示的能力和性能细微差别。

泛化性：特别是涉及交易和预测的任务，主要基于美国市场的数据和英文文本。这一重点可能限制了基准测试在全球金融市场的适用性，而在这些市场中，语言多样性和独特的市场动态起着至关重要的作用。

潜在负面影响： 虽然FinBen旨在推动金融语言理解领域向前发展，但考虑潜在的误用情况至关重要，例如金融误导信息的传播或对市场施加不道德影响。这些风险强调了负责任使用和在部署使用FinBen训练或评估的LLMs时进一步保障措施的重要性（有关这项工作的详细伦理和法律声明，请参见附录）。

Despite the groundbreaking efforts to benchmark LLMs in the financial domain through the FinBen, we acknowledge several inherent limitations that could impact the benchmark’s effectiveness and applicability: Dataset Size Limitations: A primary challenge faced in the development of the FinBen is the restricted size of available datasets, a common issue in the niche field of open-source financial data. This limitation may affect the depth of the models’ financial understanding and their ability to generalize across the full spectrum of financial contexts.

Model Size Limitations: Due to computational resource constraints, our evaluation was limited to the LLaMA 70B model. This restriction potentially overlooks the capabilities and performance nuances that larger or differently architected models might demonstrate on FinBen’s comprehensive task suite.

Generalizability: The tasks, particularly those involving trading and forecasting, are predominantly based on data from American markets and English-language texts. This focus may limit the benchmark’s applicability to global financial markets, where linguistic diversity and unique market dynamics play a crucial role.

Potential Negative Impacts: While the FinBen aims to propel the field of financial language understanding forward, it is crucial to consider the potential for misuse, such as the propagation of financial misinformation or the exertion of unethical influence on markets. These risks underscore the importance of responsible usage and further safeguards in the deployment of LLMs trained or evaluated with the FinBen（For a detailed ethical and legal statement concerning this work, please see Appendix） .

Ethical Statement 伦理声明

作者对FinBen的开发和传播承担所有潜在的权利侵犯或法律问题的完全责任。我们已经尽了勤勉努力确保FinBen的构建尊重隐私并符合既定的伦理准则。FinBen内编译的数据集根据MIT许可证共享，期望用户同意遵守其条件。本手稿，包括任何相关的源代码、数据集和附录（“材料”），专门用于学术和教育目的。必须承认，这些材料不提供金融、法律或投资建议，也不应作为任何形式决策的基础。

尽管作者已经尽了合理的努力来验证材料的准确性和可靠性，但对其完整性或适用于任何特定应用的明确或暗示保证不予延长。作者及其关联实体对于因使用或依赖材料而产生的任何损失、损害或其他后果，无论是直接的还是间接的，均不承担责任。用户有责任寻求专业的金融、法律或投资决策咨询。

通过引用或使用这些材料，个人同意对作者以及任何关联组织或个人因此类使用可能引起的任何索赔或损害进行赔偿、辩护并使其免受伤害。

The development and dissemination of the FinBen by the authors carry full responsibility for any potential violation of rights or arising legal issues. Diligent efforts have been undertaken to ensure the construction of the FinBen respects privacy and conforms to established ethical guidelines. The datasets compiled within FinBen are shared under the MIT license, with the expectation that users agree to adhere to its conditions. This manuscript, inclusive of any associated source codes, datasets, and appendices ("Material"), is designated exclusively for academic and educational pursuits. It is crucial to acknowledge that the Material does not provide financial, legal, or investment counsel, nor should it be utilized as a foundation for any form of decision-making.

While the authors have exerted reasonable diligence to verify the accuracy and reliability of the Material, no explicit or implied warranty is extended regarding its completeness or suitability for any specific application. The authors, along with their affiliated entities, absolve themselves of liability for any losses, damages, or other consequences, whether direct or indirect, that may emanate from the employment or reliance upon the Material. It is incumbent upon the user to seek professional consultation for financial, legal, or investment determinations.

By referencing or employing this Material, individuals consent to indemnify, defend, and hold the authors, along with any affiliated organizations or persons, harmless against any claims or damages that may arise from such utilization.

References

Too looooong 略

A Contributions

Science Leadership: Qianqian Xie, Min Peng, Sophia Ananiadou, Alejandro Lopez-Lira, Hao Wang, Yanzhao Lai, Benyou Wang, Xiao-yang Liu, Gang Hu, Jiajia Huang, Jimin Huang. Contributors: Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, Yijing Xu, Haoqiang Kang, Ziyan Kuang, Chenhan Yuan, Kailai Yang, Zheheng Luo, Tianlin Zhang, Zhiwei Liu, Guojun Xiong, Zhiyang Deng, Yuechen Jiang, Zhiyuan Yao, Haohang Li, Yangyang Yu

B Other LLMs Performance

Table 5 presents other LLMs’ performance in the FinBen.

C Instructions

For detail instruction of each dataset, please see Table 6 and Table 7.

D Related Work 相关工作

D.1 Financial Large Language Models 金融领域的大语言模型

近年来，针对金融特定的大型语言模型(LLMs)的研究显著增加，这些研究在通用语言模型的基础上进行了拓展（李等人，2024年；刘等人，2023年b；谢等人，2023年a；张等人，2024年；戴等人，2024年）。金融预训练语言模型（FinPLMs），如基于BERT的FinBERT（Araci，2019年；杨等人，2020年b；刘等人，2020年），以及基于ELECTRA的FLANG（沙等人，2022年），利用特定领域的数据开发，以提高在情感分析和股票预测等任务中的性能。Meta AI发布的开源LLaMA（Touvron等人，2023年）进一步推动了金融大型语言模型（FinLLMs）的创新，模型如FinMA（谢等人，2023年b）、InvestLM（杨等人，2023年b）和FinGPT（王等人，2023年；刘等人，2023年a）利用高级调优策略（张等人，2023年a）用于金融应用。BloombergGPT（吴等人，2023年）作为一个基于BLOOM的，针对金融行业定制的闭源模型而脱颖而出。此外，中国金融领域也出现了如XuanYuan 2.0（张等人，2023年c）整合广泛和专业知识的模型，用于金融通信的FinBART（洪元等人，2023年）和包括全面数据集以进行针对性预训练和微调的CFGPT（李等人，2023年a）。

Recent years have seen a significant surge in research on finance-specific LLMs, expanding on the groundwork laid by general-purpose language models (Lee et al., 2024; Liu et al., 2023b; Xie et al., 2023a; Zhang et al., 2024; Dai et al., 2024). Financial pre-trained language models (FinPLMs) like FinBERT (Araci, 2019; Yang et al., 2020b; Liu et al., 2020), derived from BERT, and FLANG (Shah et al., 2022), based on ELECTRA, have been developed using domain-specific data for enhanced performance in tasks like sentiment analysis and stock prediction. The open-source release of Meta AI’s LLaMA (Touvron et al., 2023) has fueled further innovation in Financial LLMs (FinLLMs), with models like FinMA (Xie et al., 2023b), InvestLM (Yang et al., 2023b), and FinGPT (Wang et al., 2023; Liu et al., 2023a) leveraging advanced tuning strategies (Zhang et al., 2023a)for financial applications. BloombergGPT (Wu et al., 2023) stands out as a BLOOM-based, closedsource models tailored for the financial industry. Additionally, the Chinese financial sector has seen the emergence of models like XuanYuan 2.0 (Zhang et al., 2023c), integrating broad and specialized knowledge, FinBART (Hongyuan et al., 2023) for financial communication, and CFGPT (Li et al., 2023a), which includes a comprehensive dataset for targeted pre-training and fine-tuning.

D.2 Financial Evaluation Benchmarks 金融评估基准

金融评估基准，例如开创性的FLUE（沙等人，2022年），已经被引入来衡量模型在金融领域的性能，涵盖五个关键的自然语言处理(NLP)任务：金融情感分析（沙等人，2022年）、新闻标题分类（Sinha和Khandait，2020年）、命名实体识别(NER)（Salinas Alvarado等人，2015年）、结构边界检测和问答(QA)（Maia等人，2018b年）。在FLUE的基础上，FLARE（谢等人，2023年b）增加了对时间序列处理能力的评估，即预测股票价格变动。此外，在中文金融基准中，最近发布了更多中文数据集，如CFBenchmark（雷等人，2023年）、DISC-FINSFT（陈等人，2023年b）和CGCE（张等人，2023年b）。然而，这些基准的范围有限，尚未解决更复杂的金融NLP任务，如事件检测（周等人，2021年），以及尽管之前有股票交易方面的努力（刘等人，2022年；韩等人，2023年a,b），但尚未涉及现实金融任务。

Financial evaluation benchmarks, such as the pioneering FLUE (Shah et al., 2022), have been in troduced to measure model performance in the financial sector, covering five key NLP tasks: financial sentiment analysis (Shah et al., 2022), news headline classification (Sinha and Khandait, 2020), named entity recognition (NER) (Salinas Alvarado et al., 2015), structure boundary detection and question answering (QA) (Maia et al., 2018b). Building upon FLUE, FLARE (Xie et al., 2023b) added the evaluation of time-series processing capabilities, i.e., forecasting stock price movements. In addition, in Chinese financial benchmarks, there are more recently released chinese datasets like CFBenchmark (Lei et al., 2023), DISC-FINSFT (Chen et al., 2023b), and CGCE (Zhang et al., 2023b). However, these benchmarks have a limited scope and have not yet to address more complex financial NLP tasks such as event detection (Zhou et al., 2021), and realistic financial tasks, despite the fact that there were previous efforts on stock trading (Liu et al., 2022; Han et al., 2023a,b).

E Trading Accumulative Returns

Table 8 and below Figures show detail trading performance

The FinBen: An Holistic Financial Benchmark for Large Language Models