Abstract 摘要
大型语言模型(LLMs)在各种自然语言处理(NLP)任务中展现出了卓越的能力,并吸引了包括金融服务在内的多个领域的关注。尽管对通用领域LLMs进行了广泛的研究,并且它们在金融领域拥有巨大的潜力,但金融LLM(FinLLM)的研究仍然有限。本综述提供了FinLLMs的全面概述,包括它们的历史、技术、性能以及机遇和挑战。首先,我们提供了从通用领域预训练语言模型(PLMs)到当前FinLLMs的按时间顺序的概述,包括GPT系列、选定的开源LLMs和金融LMs。其次,我们比较了金融PLMs和FinLLMs中使用的五种技术,包括训练方法、训练数据和微调方法。第三,我们总结了六个基准任务和数据集的性能评估。此外,我们提供了八个高级金融NLP任务和数据集,以开发更复杂的FinLLMs。最后,我们讨论了FinLLMs面临的机遇和挑战,如幻觉、隐私和效率问题。为了支持金融领域的AI研究,我们在GitHub上编译了一系列可访问的数据集和评估基准。
(Large Language Models (LLMs) have shown remarkable capabilities across a wide variety of Natural Language Processing (NLP) tasks and have attracted attention from multiple domains, including financial services. Despite the extensive research into general-domain LLMs, and their immense potential in finance, Financial LLM (FinLLM) research remains limited. This survey provides a comprehensive overview of FinLLMs, including their history, techniques, performance, and opportunities and challenges. Firstly, we present a chronological overview of general-domain Pretrained Language Models (PLMs) through to current FinLLMs, including the GPT-series, selected open-source LLMs, and financial LMs. Secondly, we compare five techniques used across financial PLMs and FinLLMs, including training methods, training data, and fine-tuning methods. Thirdly, we summarize the performance evaluations of six benchmark tasks and datasets. In addition, we provide eight advanced financial NLP tasks and datasets for developing more sophisticated FinLLMs. Finally, we discuss the opportunities and the challenges facing FinLLMs, such as hallucination, privacy, and efficiency. To support AI research in finance, we compile a collection of accessible datasets and evaluation benchmarks on GitHub. )( github.com/adlnlp/FinL… )
1.引言
对大型语言模型(LLMs)的研究在学术界和工业界迅速增长,特别是对如ChatGPT这样的LLM应用引起了显著关注。受到预训练语言模型(PLMs)[Devlin等人,2018; Radford等人,2018]的启发,LLMs通过迁移学习得到增强,并建立在Transformer架构[Vaswani等人,2017]之上,使用大规模文本语料库。研究人员发现,将模型扩展到足够大的规模[Radford等人,2019]不仅增强了模型容量,还产生了令人惊讶的新兴属性,如上下文学习[Brown等人,2020],这些在小规模语言模型中并未出现。语言模型(LMs)可以根据参数大小进行分类,研究社区创造了“大型语言模型(LLM)”一词,用于通常超过70亿参数的PLMs[Zhao等人,2023]。LLMs的技术演变导致了显著的同质化水平[Bommasani等人,2021],这意味着单一模型可以在广泛的任务中产生强大的性能。LLMs的能力促进了各种形式的多模态数据(例如文本、图像、音频、视频和表格数据)以及多模态模型在AI和跨学科研究社区的适应。在金融领域,人们越来越关注在各种金融任务中应用自然语言处理(NLP),包括情感分析、问答和股市预测。通用领域LLMs的快速发展促使人们研究金融LLMs(FinLLMs),采用混合领域LLMs与提示工程和指令微调LLMs与提示工程等方法。虽然通用LLMs被广泛研究和回顾[Zhao等人,2023; Yang等人,2023b; Zhang等人,2023],但金融LLMs领域[Li等人,2023b]仍处于早期阶段。考虑到LLMs在金融领域的巨大潜力,本综述提供了FinLLMs的整体概述,并讨论了可以刺激跨学科研究的未来方向。我们承认这项研究专注于英语的LMs。本综述论文的主要贡献总结如下:
• 据我们所知,这是第一次全面调查FinLLMs,探索从通用领域LMs到金融领域LMs的演变。
• 我们比较了四种金融PLMs和四种金融LLMs中使用的五种技术,包括训练方法和数据,以及指令微调方法。
• 我们总结了不同模型之间的六个基准任务和数据集的性能评估,并为开发先进的FinLLMs提供了八个高级金融NLP任务和数据集。
• 我们讨论了FinLLMs的机遇和挑战,涉及数据集、技术、评估、实施和实际应用。
(Research on Large Language Models (LLMs) has grown rapidly in both academia and industry, with notable attention to LLM applications such as ChatGPT. Inspired by Pre-trained Language Models (PLMs) [Devlin et al., 2018; Radford et al., 2018], LLMs are empowered by transfer learning and built upon the Transformer architecture [Vaswani et al., 2017] using large-scale textual corpora. Researchers have discovered that scaling models [Radford et al., 2019] to sufficient sizes not only enhances model capacity but also enables surprising emergent properties, such as in-context learning[Brown et al., 2020], that do not appear in small-scale language models. Language Models (LMs) can be categorized based on parameter size, and the research community has created the term “Large Language Models (LLM)” for PLMs of substantial size, typically exceeding 7 billion parameters [Zhao et al., 2023]. The technical evolution of LLMs has resulted in a remarkable level of homogenization [Bommasani et al., 2021], meaning that a single model could yield strong performance across a wide range of tasks. The capability of LLMs has facilitated the adaptation of various forms of multimodal data (e.g., text, image, audio, video, and tabular data) and multimodal models across AI and interdisciplinary research communities. In the financial domain, there has been growing interest in applying NLP across various financial tasks, including sentiment analysis, question answering, and stock market prediction. The rapid advancement of general-domain LLMs has spurred an investigation into Financial LLMs (FinLLMs), employing methods such as mixed-domain LLMs with prompt engineering and instruction fine-tuned LLMs with prompt engineering. While general LLMs are extensively researched and reviewed [Zhao et al., 2023; Yang et al., 2023b; Zhang et al., 2023], the field of financial LLMs [Li et al., 2023b] is at an early stage. Considering the immense potential of LLMs in finance, this survey provides a holistic overview of FinLLMs and discusses future directions that can stimulate interdisciplinary studies. We acknowledge that this research focuses on LMs in English. The key contributions of this survey paper are summarized below.
• To the best of our knowledge, this is the first comprehensive survey of FinLLMs that explores the evolution from general-domain LMs to financial-domain LMs.
• We compare five techniques used across four financial PLMs and four financial LLMs, including training methods and data, and instruction fine-tuning methods.
• We summarize the performance evaluation of six benchmark tasks and datasets between different models, and provide eight advanced financial NLP tasks and datasets for the development of advanced FinLLMs.
• We discuss the opportunities and the challenges of FinLLMs, with regard to datasets, techniques, evaluation, implementation, and real-world applications.)
2.演变趋势:从通用到金融(Evolution Trends: from General to Finance)
2.1 General-domain LMs 通用领域LMs
自2017年谷歌推出Transformer [Vaswani等人,2017]架构以来,语言模型(LMs)通常以判别性或生成性目标进行预训练。判别性预训练使用掩蔽语言模型来预测下一句,具有仅编码器或编码器-解码器架构。生成性预训练使用自回归语言建模来预测下一个标记,具有仅解码器架构。图1展示了从通用领域LMs到金融领域LMs的演变时间线。
Since the introduction of the Transformer [Vaswani et al., 2017] architecture by Google in 2017, Language Models (LMs) are generally pre-trained with either discriminative or generative objectives. Discriminative pre-training uses a masked language model to predict the next sentence and features an encoder-only or an encoder-decoder architecture. Generative pre-training uses autoregressive language modeling to predict the next token and features a decoder-only architecture. Figure 1 illustrates the evolutionary timeline from general-domain LMs to financial-domain LMs.
GPT-Series GPT系列
生成预训练变换器(GPT)系列模型始于GPT-1(1.1亿参数)[Radford等人,2018]。自那时起,OpenAI团队专注于扩大模型规模,并于2019年发布了GPT-2(15亿参数)[Radford等人,2019]。GPT-2展示了扩大规模的力量和用于多任务问题解决的概率方法。2020年,发布了具有1750亿参数的GPT-3 [Brown等人,2020]。这是LLMs的一个重要里程碑,因为它引入了LLMs的新兴能力;上下文学习。上下文学习指的是模型获得了未明确训练的能力,允许语言模型理解人类语言并产生超出其原始预训练目标的结果。
持续改进LLMs的努力导致了ChatGPT在2022年11月的推出。这个应用结合了GPT-3(上下文学习)、Codex(用于代码的LLMs)和InstructGPT(带有人类反馈的强化学习,RLHF)。ChatGPT的成功导致了更大模型的进一步开发,包括GPT-4(估计1.7万亿参数)。GPT-4展示了人类水平的性能,能够通过法律和医学考试,并处理多模态数据。
OpenAI继续构建极其大型的语言模型,旨在增强模型处理多模态数据的能力,以及为实际应用的开发提供API。尽管主流的流行和采用,但利用它们的API在金融领域的实际应用尚未得到充分探索。
The Generative Pre-trained Transformer (GPT) series of models started with GPT-1 (110M) [Radford et al., 2018]. Since then, the OpenAI team focused on scaling the model, and GPT-2 (1.5B) [Radford et al., 2019] was released in 2019. GPT-2 identified the power of scaling and a probabilistic approach for multi-task problem-solving. In 2020, GPT3 with 175B parameters was released [Brown et al., 2020]. This was a significant milestone for LLMs, as it introduced an emergent capability of LLMs; in-context learning. Incontext learning refers to the model acquiring capabilities that were not explicitly trained, allowing language models to understand human language and produce outcomes beyond their original pre-training objectives.
Ongoing efforts to improve LLMs have resulted in the introduction of ChatGPT, in November 2022. This application combines GPT-3 (In-context learning), Codex (LLMs for code), and InstructGPT (Reinforcement Learning with Human Feedback, RLHF). The success of ChatGPT has led to further development of significantly larger models, including GPT-4 (estimated 1.7T parameters). GPT-4 demonstrateshuman-level performance, capable of passing law and medical exams, and handling multimodal data.
OpenAI continues to build extremely large language models, aiming to enhance the model’s capabilities in handling multimodal data, as well as providing APIs for the development of real-world applications. Despite the mainstream popularity and adoption, real-world applications in finance utilizing their APIs have not yet been fully explored.
Open-source LLMs 开源LLMs
在LLMs时代之前,研究社区经常发布开源的PLMs,例如来自变换器的双向编码器表示(BERT,基础110M参数)[Devlin等人,2018]。BERT是许多早期PLMs的基础模型,包括FinBERT。自从OpenAI从开源转向闭源LLMs以来,LLM研究的趋势是减少开源模型的发布。然而,在2023年2月,Meta AI发布了开源LLM,LLaMA(7B、13B、33B、65B参数)[Touvron等人,2023],这鼓励了使用LLaMA开发多样化的LLMs。与BERT变体类似,LLaMA变体通过采用各种技术,如指令微调(IFT)[Zhang等人,2023]和思维链提示(CoT)[Wei等人,2022],迅速扩散。
研究社区也做出了重大努力,生成开源LLMs以减少对企业研究和专有模型的依赖。BLOOM(176B)[Scao等人,2022]是由BigScience研讨会的数百名研究人员合作构建的。这个开源LLM在46种自然语言和13种编程语言上进行了训练。
Prior to the era of LLMs, the research community often released open-source PLMs such as Bidirectional Encoder Representations from Transformers (BERT, base-110M parameters) [Devlin et al., 2018]. BERT is the foundational model for many early PLMs, including FinBERT. Since OpenAI shifted from open-source to closed-source LLMs, the trend across LLM research is a reduction in the release of opensource models. However, in February 2023, Meta AI released the open-source LLM, LLaMA (7B, 13B, 33B, 65B parameters) [Touvron et al., 2023], and this encouraged the development of diverse LLMs using LLaMA. Similar to BERT variants, LLaMA variants quickly proliferated by adopting various techniques such as Instruction Fine-Tuning (IFT) [Zhang et al., 2023] and Chain-of-Thought (CoT) Prompting [Wei et al., 2022].
There have also been significant efforts by the research community to generate open-source LLMs to reduce the reliance on corporate research and proprietary models. BLOOM (176B) [Scao et al., 2022] was built by a collaboration of hundreds of researchers from the BigScience Workshop. This open-source LLM was trained on 46 natural languages and 13 programming languages.
2.2 Financial-domain LMs 金融领域LMs
特定领域的LMs,如金融领域LMs,通常是使用通用领域LMs构建的。在金融领域,主要有四种金融PLMs(FinPLMs)和四种金融LLMs(FinLLMs)。在四种FinPLMs中,FinBERT19 [Araci, 2019]、FinBERT-20 [Yang等人,2020]和FinBERT-21 [Liu等人,2021]都是基于BERT的,而FLANG [Shah等人,2022]是基于ELECTRA [Clark等人,2020]的。在四种FinLLMs中,FinMA [Xie等人,2023]、InvestLM [Yang等人,2023c]和FinGPT [Wang等人,2023]是基于LLaMA或其他开源模型的,而BloombergGPT [Wu等人,2023]是一个BLOOM风格的闭源模型。
Domain-specific LMs, such as financial-domain LMs, are commonly built using general-domain LMs. In finance, there are primarily four financial PLMs (FinPLMs) and four financial LLMs (FinLLMs). Within the four FinPLMs, FinBERT19 [Araci, 2019], FinBERT-20 [Yang et al., 2020], and FinBERT-21 [Liu et al., 2021] are all based on BERT, while FLANG [Shah et al., 2022] is based on ELECTRA [Clark et al., 2020]. Within the four FinLLMs, FinMA [Xie et al., 2023], InvestLM [Yang et al., 2023c], and FinGPT [Wang et al., 2023] are based on LLaMA or other open-source based models, while BloombergGPT [Wu et al., 2023] is a BLOOM-style closed-source model.
3. Techniques: from FinPLMs to FinLLMs 技术:从FinPLMs到FinLLMs
虽然我们的调查重点是FinLLMs,但承认之前对FinPLMs的研究很重要,因为它们为FinLLM的发展奠定了基础。我们回顾了四种FinPLMs使用的三种技术和四种FinLLMs使用的两种技术。图2展示了构建金融LMs的技术比较(持续预训练图可以在GitHub上找到),表1显示了FinPLMs/FinLLMs的总结,包括预训练技术、微调和评估。
While our survey focuses on FinLLMs, it is important to acknowledge that previous studies on FinPLMs as they formed the groundwork for FinLLM development. We reviewed three techniques used by the four FinPLMs and two techniques used by the four FinLLMs. Figure 2 illustrates technical comparisons of building financial LMs (The continual pre-training diagram can be found on our GitHub.), and Table 1 shows a summary of FinPLMs/FinLLMs including pretraining techniques, fine-tuning ,and evaluation.
3.1 Continual Pre-training 持续预训练
持续预训练语言模型(LMs)的目标是在一系列增量任务上,使用新的领域特定数据对现有的通用LM进行训练[Ke等人,2022]。
FinBERT-19 [Araci, 2019](huggingface.co/ProsusAI/fi… PLM(3.3亿个标记),2)在金融领域语料库上进行持续预训练,3)在金融领域特定的自然语言处理(NLP)任务上进行微调。经过微调的金融LM已在HuggingFace上发布,这个FinBERT-19是针对金融情感分析任务的任务依赖型模型。
Continual pre-training of LMs aims to train an existing general LM with new domain-specific data on an incremental sequence of tasks [Ke et al., 2022].
FinBERT-19 [Araci, 2019] (huggingface.co/ProsusAI/fi… the first FinBERT model released for financial sentiment analysis and implements three steps: 1) the initialization of the general-domain BERT PLM (3.3B tokens), 2) continual pre-training on a financialdomain corpus, and 3) fine-tuning on financial domainspecific NLP tasks. The fine-tuned financial LM is released on HuggingFace, and this FinBERT-19 is a task-dependent model for the financial sentiment analysis task.
3.2 Domain-Specific Pre-training from Scratch 从头开始的领域特定预训练
领域特定预训练方法涉及在遵循原始架构及其训练目标的同时,仅在未标记的领域特定语料库上训练模型。
FinBERT-20 [Yang等人,2020](github.com/yya518/FinB…
The domain-specific pre-training approach involves training a model exclusively on an unlabeled domain-specific corpus while following the original architecture and its training objective.
FinBERT-20 [Yang et al., 2020] (github.com/yya518/FinB… is a finance domainspecific BERT model, pre-trained on a financial communication corpus (4.9B tokens). The author released not only the FinBERT model but also FinVocab uncased/cased, which has a similar token size to the original BERT model. FinBERT20 also conducted a sentiment analysis task for fine-tuning experiments on the same dataset of FinBERT-19.
3.3 Mixed-Domain Pre-training 混合领域预训练
混合领域预训练方法涉及同时使用通用领域语料库和领域特定语料库来训练模型。其假设是通用领域文本仍然相关,而金融领域数据在预训练过程中提供知识和适应性。
FinBERT-21 [刘等人,2021](截至2023年12月,论文中提到的链接不存在。)是另一个基于BERT的PLM,专为金融文本挖掘设计,同时在通用语料库和金融领域语料库上进行训练。FinBERT-21采用多任务学习,跨越六个自监督预训练任务,使其能够有效捕获语言知识和语义信息。FinBERT-21进行了情感分析实验,并为两个额外任务提供了实验结果;句子边界检测和问答。
FLANG [沙阿等人,2022](github.com/SALT-NLP/FL… [克拉克等人,2020]的训练策略。这项研究首次引入了金融语言理解评估(FLUE),这是一组五个金融NLP基准任务的集合。这些任务包括情感分析、标题文本分类、命名实体识别、结构边界检测和问答。
The mixed-domain pre-training approach involves training a model using both a general-domain corpus and a domainspecific corpus. The assumption is that general-domain text remains relevant, while the financial domain data provides knowledge and adaptation during the pre-training process.
FinBERT-21 [Liu et al., 2021] (As of Dec 2023, the link mentioned in the paper does not exist. )is another BERT-based PLM designed for financial text mining, trained simultaneously on a general corpus and a financial domain corpus. FinBERT-21 employs multi-task learning across six selfsupervised pre-training tasks, enabling it to efficiently capture language knowledge and semantic information. FinBERT-21 conducted experiments on Sentiment Analysis as well as provided experiment results for two additional tasks; Sentence Boundary Detection and Question Answering.
FLANG [Shah et al., 2022] (github.com/SALT-NLP/FL… is a domain-specific model using financial keywords and phrases for masking, and follows the training strategy of ELECTRA [Clark et al., 2020]. This research first introduces Financial Language Understanding Evaluation (FLUE), a collection of five financial NLP benchmark tasks. The tasks include Sentiment Analysis, Headline Text Classification, Named Entity Recognition, Structure Boundary Detection, and Question Answering.
3.4 Mixed-Domain LLM with Prompt Engineering 混合领域LLM与提示工程
混合领域LLMs在大型通用语料库和大型领域特定语料库上进行训练。然后,用户用人类语言描述任务,并可选择性地提供一组示例。这种技术被称为提示工程,它使用相同的固定LLM来执行多个下游任务,而不需要权重更新。这项调查并未探讨提示工程,而是参考了最近的调查[刘等人,2023]。BloombergGPT [吴等人,2023](该模型及其相关数据是闭源的。)是第一个利用BLOOM模型[Scao等人,2022]的FinLLM。它在大型通用语料库(3450亿个标记)和大型金融语料库(3630亿个标记)上进行训练。金融语料库FinPile包含了从网络、新闻、文件、新闻稿和彭博的专有数据收集的数据。作者进行了金融NLP任务(5个基准任务和12个内部任务)以及42个通用NLP任务。
Mixed-domain LLMs are trained on both a large general corpus and a large domain-specific corpus. Then, users describe the task and optionally provide a set of examples in human language. This technique is called Prompt Engineering and uses the same frozen LLM for several downstream tasks with no weight updates. This survey does not explore prompt engineering but instead references recent surveys [Liu et al., 2023]. BloombergGPT [Wu et al., 2023] (This model and its associated data are closed-source. )is the first FinLLM that utilizes the BLOOM model [Scao et al., 2022]. It is trained on a large general corpus (345B tokens), and a large financial corpus (363B tokens). The financial corpus, FinPile, contains data collected from the web, news, filings, press, and Bloomberg’s proprietary data. The authors conducted financial NLP tasks (5 benchmark tasks and 12 internal tasks) as well as 42 general-purpose NLP tasks.
3.5 Instruction Fine-tuned LLM with Prompt Engineering 带有提示工程的指令微调LLM
指令调优是使用明确的文本指令对LLMs进行额外训练,以增强LLMs的能力和可控性。关于指令调优的研究可以分为两个主要领域[张等人,2023]:1) 构建指令数据集,以及2) 使用这些指令数据集生成微调后的LLMs。在金融领域,研究人员已经开始将现有的金融数据集转化为指令数据集,随后使用这些数据集对LLMs进行微调。
FinMA(或PIXIU)[谢等人,2023](github.com/The-FinAI/P…
InvestLM [杨等人,2023c](指令数据集不可用,但模型已在GitHub上发布。github.com/AbaciNLP/In…
FinGPT [杨等人,2023a](github.com/AI4Finance-…
Instruction tuning is the additional training of LLMs using explicit text instructions to enhance the capabilities and controllability of LLMs. Research on instruction tuning can be classified into two main areas[Zhang et al., 2023]: 1) the construction of instruction datasets, and 2) the generation of finetuned LLMs using these instruction datasets. In finance, researchers have started transforming existing financial datasets into instruction datasets and subsequently using these datasets for fine-tuning LLMs.
FinMA (or PIXIU) [Xie et al., 2023] (github.com/The-FinAI/P… consists of two fine-tuned LLaMA models (7B and 30B) [Touvron et al., 2023] that use financial instruction datasets for financial tasks. It is constructed from a large-scale multi-task instruction dataset called Financial Instruction Tuning (FIT, 136k samples) by collecting nine publicly released financial datasets used across five different tasks. In addition to the five FLUE benchmark tasks, it includes the Stock Movement Prediction task.
InvestLM [Yang et al., 2023c] (The instruction dataset is unavailable, but the model has been released on GitHub. github.com/AbaciNLP/In… ) is a fine-tuned LLaMA65B model using a manually curated financial domain instruction dataset. The dataset includes Chartered Financial Analyst (CFA) exam questions, SEC filings, Stackexchange quantitative finance discussions, and Financial NLP tasks. The downstream tasks are similar to FinMA but also include a financial text Summarization task.
FinGPT [Yang et al., 2023a] (github.com/AI4Finance-… is an open-sourced and data-centric framework, which provides a suite of APIs for financial data sources, an instruction dataset for financial tasks, and several fine-tuned financial LLMs. The FinGPT team has released several similar papers that describe the framework and an experiment paper [Wang et al., 2023] on the instruction fine-tuned FinLLMs using six open-source LLMs with the Low-Rank Adaptation (LoRA) [Hu et al., 2021] method.
4. Evaluation: Benchmark Tasks and Datasets 评估:基准任务和数据集
随着LLMs获得显著关注,评估它们变得越来越关键。我们总结了六个金融NLP基准任务和数据集,并回顾了包括FinPLMs、FinLLMs、ChatGPT、GPT-4以及特定任务的最新技术(SOTA)模型的评估结果。这些结果(每个任务的图表和表格可以在GitHub上找到。)参考了原始研究或分析研究[李等人,2023a],以及特定任务模型的SOTA结果。
As LLMs gain significant attention, evaluating them becomes increasingly critical. We summarize six financial NLP benchmark tasks and datasets, and review the evaluation results of models including FinPLMs, FinLLMs, ChatGPT, GPT-4, and task-specific State-of-the-Art (SOTA) models. The results (1Figure and tables of each task can be found on our Github.) are referenced from original research or analysis research [Li et al., 2023a], and SOTA results from task-specific models.
4.1 Sentiment Analysis (SA) 情感分析(SA)
情感分析(SA)任务旨在分析输入文本中的情感信息,包括金融新闻和微博帖子。大多数FinPLMs和FinLLMs使用金融短语库(FPB)和FiQA SA数据集报告此任务的评估结果。FPB [Malo等人,2014]数据集包含4,845篇英文金融新闻文章。领域专家将每个句子标注为三种情感标签之一:正面、负面或中性。FiQA-SA [Maia等人,2018]数据集包含1,173篇来自标题和微博的帖子。情感分数在[-1, 1]的范围内,最近的研究已将这个分数转换为分类任务。总的来说,FLANG-ELECTRA取得了最佳结果(F1得分92%),而FinMA-30B和GPT-4在5次提示下取得了相似的结果(F1得分87%)。这表明在效率和成本方面,对于不太复杂的任务来说,这是一种实用的方法。
为了进一步评估SA,我们包括了两个公开发布的数据集:SemEval-2017(任务5)和StockEmotions。SemEval-2017 [Cortis等人,2017]数据集包含4,157个句子,这些句子从标题和微博中收集。与FiQA SA类似,情感分数在[-1, 1]的范围内。StockEmotions [Lee等人,2023]数据集包含10,000个句子,这些句子从微博中收集,注释了二元情感和12个细粒度的情感类别,涵盖了投资者情感的多维范围。
The Sentiment Analysis (SA) task aims to analyze sentiment information from input text, including financial news and microblog posts. Most FinPLMs and FinLLMs report the evaluation results of this task using the Financial PhraseBank (FPB) and the FiQA SA dataset. The FPB [Malo et al., 2014] dataset consists of 4,845 English financial news articles. Domain experts annotated each sentence with one of three sentiment labels: Positive, Negative, or Neutral. The FiQA-SA [Maia et al., 2018] dataset consists of 1,173 posts from both headlines and microblogs. The sentiment scores are on a scale of [-1, 1], and recent studies have converted this score into a classification task. Overall, FLANG-ELECTRA achieved the best results (92% on F1) while FinMA-30B and GPT-4 achieved similar results (87% on F1) with a 5-shot prompting. It suggests a practical approach for less complex tasks in terms of efficiency and costs.
For further evaluation of SA, we include two open-released datasets: SemEval-2017 (Task 5) and StockEmotions. The SemEval-2017 [Cortis et al., 2017] dataset comprises 4,157 sentences collected from both headlines and microblogs. Similar to FiQA SA, the sentiment scores are on a scale of [-1, 1]. The StockEmotions [Lee et al., 2023] dataset consists of 10,000 sentences collected microblogs that annotate binary sentiment and 12 fine-grained emotion classes that span the multi-dimensional range of investor emotions.
4.2 Text Classification (TC) 文本分类(TC)
文本分类(TC)任务是根据其内容将给定的文本或文档分类到预定义的标签中。在金融文本中,除了情感之外,通常还有多个信息维度,如价格方向或利率方向。FLUE包括了金新闻标题[Sinha和Khandait,2021]数据集用于文本分类。该数据集包含11,412条新闻标题,标记为二元分类,涵盖九个标签,如“价格上涨”或“价格下跌”。与SA任务类似,FLANG-ELECTRA和FinMA-30B在5次提示下取得了最佳结果(平均F1得分98%),BERT和FinBERT-20的性能也值得注意(平均F1得分97%)。
由于TC是一个广泛的任务,取决于数据集及其预定义的标签,我们还包括了三个公开发布的金融TC数据集以供进一步研究:FedNLP、FOMC和Banking77。FedNLP [Lee等人,2021]数据集包含来自各种联邦公开市场委员会(FOMC)材料的文档。该数据集根据联邦储备委员会对随后时期的联邦基金利率决策标注为“上涨”、“维持”或“下跌”。同样,FOMC [Shah等人,2023]数据集是FOMC文件的集合,标签为“鸽派”、“鹰派”或“中性”,反映了FOMC材料中传达的主导情绪。Banking77 [Casanueva等人,2020]数据集包含13,083个样本,涵盖与银行客户服务查询相关的77个意图,如“卡片丢失”或“链接到现有卡片”。该数据集旨在进行意图检测和开发对话系统。
Text Classification (TC) is the task of classifying a given text or document into predefined labels based on its content. In financial text, there are often multiple dimensions of information beyond sentiment such as price directions or interest rate directions. FLUE includes the gold news Headline [Sinha and Khandait, 2021] dataset for text classification. This dataset comprises 11,412 news headlines, labeled with a binary classification across nine labels such as “price up”, or “price down”. Similar to the SA task, FLANG-ELECTRA and FinMA-30B with a 5-shot prompting achieved the best results (98% on Avg. F1) and the performance of BERT and FinBERT-20 was also noteworthy (97% on Avg. F1).
As TC is a broad task depending on the dataset and its predefined labels, we include three open-released financial TC datasets for further research: FedNLP, FOMC, and Banking77. The FedNLP [Lee et al., 2021] dataset comprises documents sourced from various Federal Open Market Committee (FOMC) materials. The dataset is annotated with labels as Up, Maintain, or Down based on the Federal Reserve’s Federal Funds Rate decision for the subsequent period. Similarly, the FOMC [Shah et al., 2023] dataset is a collection of FOMC documents with the labels as Dovish, Hawkish, or Neutral, reflecting the prevailing sentiment conveyed within the FOMC materials. The Banking77 [Casanueva et al., 2020] dataset comprises 13,083 samples covering 77 intents related to banking customer service queries, such as “card loss” or “linking to an existing card”. This dataset is designed for intent detection and developing conversation systems.
4.3 Named Entity Recognition (NER) 命名实体识别(NER)
命名实体识别(NER)任务是从非结构化文本中提取信息,并将其归类为预定义的命名实体,如地点(LOC)、组织(ORG)和人物(PER)。对于金融NER任务,FLUE基准测试中包括了FIN数据集[Alvarado等人,2015]。FIN数据集包含来自美国证券交易委员会(SEC)的八份金融贷款协议,用于信用风险评估。GPT-4在5次提示下(实体F1得分83%)和FLANGELECTRA表现出显著的性能(实体F1得分82%),而其他FinLLMs的结果不够理想(实体F1得分61%-69%)。
为了进一步研究,我们包括了一个金融NER数据集,FiNER-139[Loukas等人,2022],包含110万个句子,注释有139个扩展商业报告语言(XBRL)的词级标签,这些标签来自SEC。该数据集旨在进行实体提取和数值推理任务,根据句子内的数值输入数据(例如,“24.8”百万)预测XBRL标签(例如,现金及现金等价物)。
The Named Entity Recognition (NER) task is the extraction of information from unstructured text and categorizing it into predefined named entities such as locations (LOC), organizations (ORG), and persons (PER). For the financial NER task, the FIN dataset [Alvarado et al., 2015] is included in FLUE benchmarks. The FIN dataset comprises eight financial loan agreements sourced from the US Security and Exchange Commission (SEC) for credit risk assessment. GPT-4 with a 5-shot prompting (83% on Entity F1) and FLANGELECTRA demonstrate notable performance (82% on Entity F1), while other FinLLMs exhibit suboptimal results (61%- 69% on Entity F1).
For further research, we include a financial NER dataset, FiNER-139 [Loukas et al., 2022], consisting of 1.1M sentences annotated with 139 eXtensive Business Reporting Language (XBRL) word-level tags, sourced from the SEC. This dataset is designed for Entity Extraction and Numerical Reasoning tasks, predicting the XBRL tags (e.g., cash and cash equivalents) based on numeric input data within sentences (e.g., “24.8” million).
4.4 Question Answering (QA) 问答(QA)
问答(QA)任务是从非结构化的文档集合中检索或生成问题的答案。金融QA比一般QA更具挑战性,因为它需要在多种格式中进行数值推理。FiQA-QA [Maia等人,2018]是用于基于观点的QA,代表了早期的金融QA数据集。
随着时间的推移,金融QA已经发展到包括在多轮对话中进行复杂的数值推理。这种演变涉及到引入混合QA,即创建连接包括表格和文本内容的混合上下文的路径。FinQA [Chen等人,2021]是一个单轮混合QA数据集,包含8,281对QA对,由标准普尔500公司年度报告的专家注释。ConvFinQA [Chen等人,2022]是FinQA的扩展,是一个多轮对话混合QA数据集,包含3,892个对话和14,115个问题。与使用FiQA-QA数据集不同,所有FinLLMs都在FinQA和/或ConvFinQA数据集上进行实验,以评估它们的数值推理能力。GPT-4在零次提示下的表现优于所有其他模型(EM准确率69%-76%),接近人类专家的表现(平均EM准确率90%)。BloombergGPT的结果(EM准确率43%)略低于一般人群(EM准确率47%)。
Question Answering (QA) is a task to retrieve or generate answers to questions from an unstructured collection of documents. Financial QA is more challenging than general QA as it requires numerical reasoning across multiple formats. FiQA-QA [Maia et al., 2018] is for opinion-based QA, representing an early Financial QA dataset.
Over time, Financial QA has evolved to include complex numerical reasoning in multi-turn conversations. This evolution involves the introduction of hybrid QA, which is to create paths to connect hybrid contexts including both tabular and textual content. FinQA [Chen et al., 2021] is a singleturn hybrid QA dataset having 8,281 QA pairs and annotated by experts from the annual reports of S&P 500 companies. ConvFinQA [Chen et al., 2022], an extension of FinQA, is a multi-turn conversational hybrid QA dataset, consisting of 3,892 conversations with 14,115 questions. Instead of using the FiQA-QA dataset, all FinLLMs conducted experi- ments on the FinQA and/or ConvFinQA datasets to assess their numerical reasoning capabilities. GPT-4 with a zero-shot prompting outperforms all other models (69%-76% on EM Accuracy), approaching the performance of human experts (Avg. 90% on EM Accuracy). BloombergGPT’s result (43% on EM Accuracy) was slightly below the general crowd (47% on EM Accuracy).
4.5 Stock Movement Prediction (SMP) 股票走势预测(SMP)
股票走势预测(SMP)任务旨在基于历史价格和相关文本数据预测下一天的价格走势(例如,上涨或下跌)。由于它需要将时间序列问题与文本信息中的时间依赖性结合起来,这是一个复杂的任务,其中文本数据既可以作为噪声也可以作为信号。FinMA首次包括SMP任务,对三个数据集进行了实验;StockNet、CIKM18和BigData22。
StockNet [徐和科恩,2018]收集了2014年至2016年间标准普尔500指数中88只股票的历史价格数据和Twitter数据,广泛用于SMP任务。该任务被框定为二元分类,有一个阈值:价格变动高于0.55%被标记为上涨(表示为1),而低于-0.5%的变动被标记为下跌(表示为0)。同样,CIKM18 [吴等人,2018]利用2017年标准普尔500指数中47只股票的历史价格和Twitter数据。BigData22 [苏恩等人,2022]在2019年至2020年间为美国股市的50只股票编制了数据。与StockNet类似,它采用了带有阈值的二元分类公式。在这三个数据集的平均结果中,GPT-4在零次提示下的表现(准确率54%)优于FinMA(准确率52%),并且略低于SOTA模型(准确率58%)。尽管通常使用NLP指标如准确率,但这些对于SMP评估来说是不够的。考虑金融评估指标,如夏普比率,以及回测模拟结果也很重要。
The Stock Movement Prediction (SMP) task aims to predict the next day’s price movement (e.g., up or down) based on historical prices and associated text data. As it requires the integration of time series problems with temporal dependencies from text information, it presents a complex task, where text data can act both as noise and signal. FinMA includes the SMP tasks for the first time, conducting experiments on three datasets; StockNet, CIKM18, and BigData22.
StockNet [Xu and Cohen, 2018] collected historical price data and Twitter data between 2014 and 2016 for 88 stocks listed in the S&P, and is widely used for SMP tasks. The task is framed as a binary classification with a threshold: a price movement higher than 0.55% is labeled as a rise (denoted as 1), while a movement less than -0.5% is labeled as a fall (denoted as 0). Similarly, CIKM18 [Wu et al., 2018] utilizes historical price and Twitter data in 2017 for 47 stocks in the S&P 500. BigData22 [Soun et al., 2022] compiled data between 2019 and 2020 for 50 stocks in the US stock markets. Like StockNet, it adopts a binary classification formulation with a threshold. On average across these three datasets, GPT-4 with a zero-shot prompting achieves higher performance (54% on Accuracy) than FinMA (52% on Accuracy) and slightly lower results than the SOTA model (58% on Accuracy). Although NLP metrics such as Accuracy are commonly used, these are insufficient for the SMP evaluation. It is important to consider financial evaluation metrics, such as the Sharpe ratio, as well as backtesting simulation results.
4.6 Text Summarization (Summ) 文本摘要(Summ)
摘要(Summ)是从文档中生成简洁摘要的过程,通过提取式或抽象式方法传达其关键信息。在金融领域,由于缺乏基准数据集、领域专家评估的挑战以及在提供金融建议时需要免责声明,这一任务相对较少被探索。InvestLM首次包括摘要任务,对ECTSum数据集进行了实验。ECTSum [Mukherjee等人,2022]包含2,425对文档摘要对,包含来自路透社的收益电话记录(ECTs)和项目符号摘要。它在各种指标上报告评估结果,包括ROUGE-1、ROUGE-2、ROUGE-L和BERTScore。与其他复杂的金融任务类似,特定任务的最新技术(SOTA)模型(在ROUGE-1上得分47%)优于所有LLMs。根据InvestLM的作者,尽管GPT-4在零次提示下(在ROUGE-1上得分30%)的表现优于FinLLMs,但商业模型生成的决定性答案。
摘要任务提供了显著的发展机会,探索FinLLMs是否能超越特定任务的SOTA模型。为了持续研究,我们包括了金融摘要数据集MultiLing 2019 [El-Haj, 2019],包含3,863对从伦敦证券交易所(LSE)上市的英国年报中提取的文档摘要对。它为每份年报提供了至少两个金标准摘要。
Summarization (Summ) is the generation of a concise summary from documents while conveying its key information via either an extractive or an abstractive approach. In finance, it has been relatively underexplored due to the lack of benchmark datasets, challenges with domain experts’ evaluations, and the need for disclaimers when presenting financial advice. InvestLM includes summarization tasks for the first time, conducting experiments on the ECTSum dataset. ECTSum [Mukherjee et al., 2022] consists of 2,425 documentsummary pairs, containing Earnings Call Transcripts (ECTs) and bullet-point summarizations from Reuters. It reports evaluation results on various metrics, including ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore. Similar to other complex financial tasks, the task-specific SOTA model (47% on ROUGE-1) outperforms all LLMs. According to the authors of InvestLM, while GPT-4 with a zero-shot prompting (30% on ROUGE-1) shows superior performance compared to FinLLMs, the commercial models generate decisive answers.
The summarization task offers significant development opportunities, exploring whether FinLLMs can outperform taskspecific SOTA models. For ongoing research, we include the financial summarization dataset, MultiLing 2019 [El-Haj, 2019], containing 3,863 document-summary pairs extracted from UK annual reports listed on the London Stock Exchange (LSE). It provides at least two gold-standard summaries for each annual report.
4.7 Discussion 讨论
在这六个基准测试中,混合领域FinPLMs在SA(情感分析)、TC(文本分类)和NER(命名实体识别)任务上的表现值得注意,这表明对于任务复杂性而言,使用PLM并针对特定任务进行微调可能是一种实用的方法。对于QA(问答)、SMP(股票走势预测)和Summ(文本摘要)任务,特定任务的SOTA模型优于所有LLMs,这表明FinLLMs有改进的空间。值得注意的是,GPT-4在所有基准测试中表现出色,除了Summ任务,这表明仅仅扩大模型规模可能不足以在金融领域实现最佳性能。由于大多数指令微调的FinLLMs使用了相同的数据集进行评估,我们还包括了额外的数据集。
Within the six benchmarks, the performance of mixed domain FinPLMs is noteworthy for the SA, TC, and NER tasks, suggesting that using a PLM with fine-tuning for a specific task can be a practical approach depending on the task complexity. For QA, SMP, and Summ tasks, the task-specific SOTA models outperform all LLMs, indicating areas for improvement in FinLLMs. Notably, GPT-4 shows impressive performance across all benchmarks except the Summ task, indicating that scaling models alone may not be adequate for optimal performance in finance. As most instructionfinetuned FinLLMs used the same datasets for their evaluation, we include additional.
5. Advanced Financial NLP Tasks and Datasets 高级金融自然语言处理任务和数据集
适当设计的基准任务和数据集是评估LLMs能力的关键资源,然而,目前的6个基准任务尚未解决更复杂的金融自然语言处理任务。在本节中,我们提出了8个高级基准任务,并为每个任务编译了相关数据集。
关系提取(RE)任务旨在识别和分类文本中隐含的实体之间的关系。与NER类似,这个任务是信息提取的一部分。FinRED [Sharma等人,2022]数据集为RE发布,并从金融新闻和收益电话记录中策划,包含29个特定于金融领域的关联标签(例如,由...拥有)。
金融中的事件检测(ED)涉及识别投资者如何感知和评估相关公司的影响。事件驱动交易(EDT)[Zhou等人,2021]数据集为ED发布,包括11种公司事件检测。EDT包括9,721篇带有标记级事件标签的新闻文章,以及额外的303,893篇带有分钟级时间戳和股票价格标签的新闻文章。
金融中的因果关系检测(CD)旨在识别事实文本中的因果关系,旨在发展生成有意义的金融叙事摘要的能力。金融叙事处理研讨会(FNP)每年都会解决这个任务并贡献数据集。FNP的一个公开发布的数据集FinCausal20 [Mariko等人,2020]包含两个任务:在给定文本中检测因果方案和识别因果句子。
金融中的数值推理(NR)旨在识别数字和数学运算符,无论是数字形式还是单词形式,以便进行计算和理解金融背景(例如,现金及现金等价物)。一些为NER和QA任务引入的数据集也设计用于数值推理,包括:FiNER-139 [Loukas等人,2022],FinQA [Chen等人,2021],ConvFinQA [Chen等人,2022]。
结构识别(SR)是一个专注于文档内结构边界检测(例如文本、表格或图表)的任务,并识别表格与周围内容之间或表格内单元格之间的逻辑关系。IBM Research发布了FinTabNet [Zheng等人,2021]数据集,该数据集从标准普尔500公司的收益报告中收集。该数据集包含带有详细注释的表格结构的非结构化PDF文档。QA任务中包含的FinQA和ConvFinQA数据集已从FinTabNet进一步发展。
多模态(MM)理解是许多领域中的一个具有挑战性的任务。最近,引入了几个多模态金融数据集。MAEC [Li等人,2020]在更大规模上编译了收益电话记录的多模态数据(文本、时间序列和音频),包含3,443个实例和394,277个句子。此外,MONOPOLY [Mathur等人,2022]引入了来自六个中央银行的货币政策电话记录的视频数据,分享了340个视频中的24,180个样本,包括文本脚本和时间序列。金融中的机器翻译(MT)旨在不仅将句子从源语言翻译成目标语言,而且还要在不同语言中理解金融上下文含义。MINDS-14 [Gerz等人,2021]包含8,168个银行语音助手数据样本,涵盖14种不同语言的文本和音频格式。MultiFin [Jørgensen等人,2023]包括10,048个样本,涵盖金融主题,有6个高级标签(例如,金融)和23个低级标签(例如,并购与估值),来源自15种不同语言的公共金融文章。
市场预测(MF)是金融市场中的一个重要任务,涉及市场价格、波动性和风险的预测。这个任务超越了股票走势预测(SMP),后者将问题制定为分类任务。在情感分析、事件检测和多模态任务中引入的数据集也设计用于市场预测。这里,我们包括与MF相关的数据集列表:StockEmotions(SA)[Lee等人,2023],EDT(ED)[Zhou等人,2021],MAEC(MM-音频)[Li等人,2020],和MONOPOLY(MM-视频)[Mathur等人,2022]。
Properly designed benchmark tasks and datasets are a crucial resource to assess the capability of LLMs, however, the current 6 benchmark tasks have yet to address more complex financial NLP tasks. In this section, we present 8 advanced benchmark tasks and compile associated datasets for each.
The Relation Extraction (RE) task aims to identify and classify relationships between entities implied in text. Similar to NER, this task is part of Information Extraction. The FinRED [Sharma et al., 2022] dataset is released for RE and is curated from financial news and earning call transcripts, containing 29 relation tags (e.g. owned by) specific to the finance domain.
Event Detection (ED) in finance involves identifying the impact of how investors perceive and assess related companies. The Event-Driven Trading (EDT) [Zhou et al., 2021] dataset is released for ED and includes 11 types of corporate event detection. EDT comprises 9,721 news articles with token-level event labels, and an additional 303,893 news articles with minute-level timestamps and stock price labels.
Causality Detection (CD) in finance seeks to identify cause-and-effect relationships within factual text, aiming to develop an ability to generate meaningful financial narrative summaries. The Workshop on Financial Narrative Processing (FNP) addresses this task every year and contributes datasets. One of the open-released dataset from FNP, FinCausal20 [Mariko et al., 2020] shares two tasks: detecting a causal scheme in a given text and identifying cause-and-effect sentences.
Numerical Reasoning (NR) in finance aims to identify numbers and mathematical operators in either digit or word form, in order to perform calculations and comprehend financial context (e.g. cash and cash equivalent). Some datasets introduced for NER and QA tasks are also designed for numerical reasoning, including: FiNER-139 [Loukas et al., 2022], FinQA [Chen et al., 2021], ConvFinQA [Chen et al., 2022].
Structure Recognition (SR) is a task focused on Structure Boundary Detection within a document (e.g. text, tables, or figures) and recognizing the logical relationships between tables and surrounding content, or between cells within a table. IBM Research has released the FinTabNet [Zheng et al., 2021] dataset, collected from earnings reports of S&P 500 companies. This dataset comprises unstructured PDF documents with detailed annotations of table structures. The FinQA and ConvFinQA datasets, included in QA tasks, have been further developed from FinTabNet.
Multimodal (MM) understanding is a challenging task across many domains. Recently, several multimodal financial datasets have been introduced. MAEC [Li et al., 2020] compiles the multimodal data (text, time series, and audio) from earnings call transcripts on a larger scale, with 3,443 instances and 394,277 sentences. Additionally, MONOPOLY [Mathur et al., 2022] introduces video data from monetary policy call transcripts across six central banks, sharing 24,180 samples from 340 videos with text scripts and time series. Machine Translation (MT) in finance aims to not only translate sentences from a source language to a target language, but to also comprehend the financial contextual meaning in different languages. MINDS-14 [Gerz et al., 2021] consists of 8,168 samples of banking voice assistant data in text and audio formats across 14 different languages. MultiFin [Jørgensen et al., 2023] includes 10,048 samples covering financial topics with 6 high-level labels (e.g., Finance) and 23 low-level labels (e.g., M&A & Valuations), sourced from public financial articles in 15 different languages.
Market Forecasting (MF) is an essential task in financial markets, involving the prediction of market price, volatility, and risk. This task extends beyond Stock Movement Prediction (SMP), which formulates problems as a classification task. The datasets introduced in Sentiment Analysis, Event Detection, and Multimodal tasks are also designed for Market Forecasting. Here, we include a list of datasets relevant to MF: StockEmotions (SA) [Lee et al., 2023], EDT (ED) [Zhou et al., 2021], MAEC (MM-audio) [Li et al., 2020], and MONOPOLY (MM-video) [Mathur et al., 2022].
6. Opportunities and Challenges 机遇与挑战
在本节中,我们强调了指导FinLLMs未来方向的各个方面,包括数据集、技术、评估、实施和实际应用。
数据集:对于开发复杂的FinLLMs来说,高质量的数据和多模态数据非常重要。由于大多数FinLLMs在金融特定数据上训练通用领域LLMs,挑战在于收集多种形式的高质量金融数据。通过将现有数据集转换为特定金融NLP任务的指令微调金融数据集,将促进先进FinLLMs的发展。此外,金融多模态数据集的研究将变得越来越重要,提高FinLLMs在复杂任务上的性能。
技术:金融中的主要挑战包括在不违反隐私和引起安全问题的情况下利用内部数据,同时增强对FinLLMs生成响应的信任。为了应对这些挑战,可以在金融领域实施一些积极研究的LLMs技术,如检索增强生成(RAG)[Lewis等人,2020]。RAG系统类似于开卷方法,它检索非预训练的外部知识资源(例如,查询私有数据)以增强预训练模型对信息的原始表示。RAG为模型提供了访问事实信息的途径,使模型能够生成交叉引用的答案,从而提高可靠性,并最小化幻觉问题。此外,RAG使模型能够在不重新训练整个模型的情况下使用内部非可训练数据,确保隐私不被侵犯。
评估:评估的主要挑战在于结合金融专家的领域知识,根据金融NLP任务验证模型的性能。当前的评估结果使用常用的NLP指标如F1分数或准确率呈现。然而,知识驱动的任务需要金融专家的人类评估,适当的金融评估指标超过NLP指标,以及专家反馈以对齐模型。此外,我们提出的八个进一步基准的复杂金融NLP任务将发现FinLLMs的隐藏能力。这些复杂任务将评估FinLLMs是否能够作为通用金融问题解决模型,考虑到特定任务的成本和性能。
实施:选择适当的FinLLMs和技术的挑战在于成本和性能之间的权衡。根据任务复杂性和推理成本,选择通用领域LLMs与提示或特定任务模型可能比构建FinLLMs更实用。这需要LLMOps工程技能,包括软提示技术如参数高效微调(PEFT)和监控操作系统集成持续集成(CI)和持续交付(CD)流程。
应用:开发实际金融应用的挑战与非技术问题有关,包括业务需求、行业障碍、数据隐私、问责制、伦理和金融专家与AI专家之间的理解差距。为了克服这些挑战,分享FinLLM用例将在各种金融领域(包括机器人顾问、量化交易和低代码开发)中有益。此外,我们鼓励未来的方向朝着生成性应用发展,包括报告生成和文档理解。
In this section, we highlight various aspects guiding the future directions of FinLLMs, covering datasets, techniques, evaluation, implementation, and real-world applications.
Datasets : High-quality data and multimodal data are significantly important for developing sophisticated FinLLMs. As most FinLLMs train general-domain LLMs on financialspecific data, the challenge lies in collecting high-quality financial data in diverse formats. Building instruction finetuned financial datasets by converting existing datasets for specific financial NLP tasks will facilitate the development of advanced FinLLMs. Also, the research on financial multimodal datasets will become increasingly important, enhancing the performance of FinLLMs on complex tasks.
Techniques : Major challenges in finance include utilizing internal data without privacy breaches, causing security issues, while also enhancing trust in the responses generated by FinLLMs. To address these challenges, some active searched techniques on LLMs, such as Retrieval Augmented Generation (RAG) [Lewis et al., 2020], can be implemented in the financial domain. The RAG system is similar to an open-book approach, which retrieves non-pre-trained external knowledge resources (e.g., queried private data) to enhance the pre-trained model’s raw representation of information. RAG provides the model with access to factual information, enabling the generation of cross-referenced answers, therefore improving reliability, and minimizing hallucination issues. Moreover, RAG enables the use of internal nontrainable data without retraining the entire model, ensuring privacy is not breached.
Evaluation : The primary challenge in evaluation is incorporating domain knowledge from financial experts to validate the model’s performance based on financial NLP tasks. The current evaluation results were presented using commonly used NLP metrics such as F1-score or Accuracy. However, the knowledge-driven tasks require human evaluation by financial experts, appropriate financial evaluation metrics over NLP metrics, and expert feedback for model alignment. Furthermore, advanced financial NLP tasks, including the eight further benchmarks we presented, would discover the hidden capabilities of FinLLMs. These complex tasks will assess whether FinLLMs can serve as general financial problem-solver models [Guo et al., 2023], considering both cost and performance for specific tasks.
Implementation : The challenge in selecting suitable FinLLMs and techniques lies in the trade-off between cost and performance. Depending on the task complexity and inference cost, selecting general-domain LLMs with prompting or task-specific models might be a more practical choice than building FinLLMs. This requires LLMOps engineering skills, including soft prompt techniques such as ParameterEfficient Fine-Tuning (PEFT) and monitoring operation systems with a Continuous Integration (CI) and Continuous Delivery (CD) pipeline.
Applications : The challenge of developing real-world financial applications relates to non-technical issues, including business needs, industry barriers, data privacy, accountability, ethics, and the understanding gap between financial experts and AI experts. To overcome these challenges, sharing FinLLM use-cases will be beneficial across various financial fields including robo-advisor, quantitative trading, and low-code development [Yang et al., 2023a]. Furthermore, we encourage future directions towards generative applications including report generation and document understanding.
7. Conclusion 结论
我们的调查提供了对FinLLMs的简洁而全面的调查,通过探索它们从通用领域LMs的演变,比较FinPLMs/FinLLMs的技术,并提出了六个传统基准以及八个高级基准和数据集。对于未来的研究,我们对FinLLMs的宏观视角、相关且广泛的数据集收集,以及高级FinLLMs新方向的机遇和挑战,将对计算机科学和金融研究社区都有益处。
Our survey provides a concise yet comprehensive investigation of FinLLMs, by exploring their evolution from generaldomain LMs, comparing techniques of FinPLMs/FinLLMs, and presenting six conventional benchmarks as well as eight advanced benchmarks and datasets. For future research, our big-picture view of FinLLMs, a relevant and extensive collection of datasets for more advanced evaluation, and opportunities and challenges for new directions for advanced FinLLMs will be beneficial to both the Computer Science and Finance research communities.
References 参考文献
[Alvarado et al., 2015] Julio Cesar Salinas Alvarado, Karin Verspoor, and Timothy Baldwin. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of ALTA Workshop, pages 84–90, 2015.
[Araci, 2019] Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063, 2019.
[Bommasani et al., 2021] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
[Brown et al., 2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, et al. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
[Casanueva et al., 2020] Inigo Casanueva, Tadas Tem ˜ cinas, ˇ Daniela Gerz, Matthew Henderson, and Ivan Vulic. Effi- ´ cient intent detection with dual sentence encoders. In Proceedings of NLP4ConvAI Workshop, pages 38–45, 2020.
[Chen et al., 2021] Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, et al. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of EMNLP, pages 3697–3711, 2021.
[Chen et al., 2022] Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. In Proceedings of EMNLP, pages 6279–6292, 2022.
[Clark et al., 2020] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
[Cortis et al., 2017] Keith Cortis, Andre Freitas, Tobias ´ Daudert, et al. Semeval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news. In Proceedings of SemEval, pages 519–535, 2017. [Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, 2018.
[El-Haj, 2019] Mahmoud El-Haj. Multiling 2019: Financial narrative summarisation. In Proceedings of the Workshop MultiLing 2019, pages 6–10, 2019.
[Gerz et al., 2021] Daniela Gerz, Pei-Hao Su, Razvan Kusztos, Avishek Mondal, Michał Lis, et al. Multilingual and cross-lingual intent detection from spoken data. In Proceedings of EMNLP, pages 7468–7475, 2021.
[Guo et al., 2023] Yue Guo, Zian Xu, and Yi Yang. Is chatgpt a financial expert? evaluating language models on financial natural language processing. In Findings of EMNLP, 2023.
[Hu et al., 2021] Edward J Hu, Phillip Wallis, Zeyuan AllenZhu, Yuanzhi Li, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2021.
[Jørgensen et al., 2023] Rasmus Jørgensen, Oliver Brandt, Mareike Hartmann, Xiang Dai, Christian Igel, and Desmond Elliott. Multifin: A dataset for multilingual financial nlp. In Findings of EACL, pages 864–879, 2023. [Ke et al., 2022] Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pretraining of language models. In ICLR, 2022.
[Lee et al., 2021] Jean Lee, Hoyoul Luis Youn, Nicholas Stevens, Josiah Poon, and Soyeon Caren Han. Fednlp: an interpretable nlp system to decode federal reserve communications. In Proceedings of ACM SIGIR, pages 2560– 2564, 2021.
[Lee et al., 2023] Jean Lee, Hoyoul Luis Youn, Josiah Poon, and Soyeon Caren Han. Stockemotions: Discover investor emotions for financial sentiment analysis and multivariate time series. AAAI-24 Bridge, 2023.
[Lewis et al., 2020] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, et al. Retrieval-augmented generation for knowledgeintensive nlp tasks. NeurIPS, 33:9459–9474, 2020. [Li et al., 2020] Jiazheng Li, Linyi Yang, Barry Smyth, and Ruihai Dong. Maec: A multimodal aligned earnings conference call dataset for financial risk prediction. In Proceedings of ACM CIKM, pages 3063–3070, 2020.
[Li et al., 2023a] Xianzhi Li, Samuel Chan, Xiaodan Zhu, et al. Are chatgpt and gpt-4 general-purpose solvers for financial text analytics? a study on several typical tasks. In Proceedings of EMNLP: Industry, pages 408–422, 2023.
[Li et al., 2023b] Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey. In Proceedings of ACM ICAIF, pages 374–382, 2023.
[Liu et al., 2021] Zhuang Liu, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. Finbert: A pre-trained financial language representation model for financial text mining. In Proceedings of IJCAI, pages 4513–4519, 2021.
[Liu et al., 2023] Pengfei Liu, Weizhe Yuan, Jinlan Fu, et al. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023. [Loukas et al., 2022] Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, et al. Finer: Financial numeric entity recognition for xbrl tagging. In Proceedings of ACL, pages 4419–4431, 2022.
[Maia et al., 2018] Macedo Maia, Siegfried Handschuh, Andre Freitas, et al. Www’18 open challenge: financial ´ opinion mining and question answering. In Companion proceedings of WWW, pages 1941–1942, 2018.
[Malo et al., 2014] Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. Good debt or bad debt: Detecting semantic orientations in economic texts. JASIST, 65(4):782–796, 2014.
[Mariko et al., 2020] Dominique Mariko, Hanna Abi Akl, Estelle Labidurie, et al. The financial document causality detection shared task (fincausal 2020). In Proceedings of the Workshop on FNP-FNS, pages 23–32, 2020.
[Mathur et al., 2022] Puneet Mathur, Atula Neerkaje, Malika Chhibber, Ramit Sawhney, et al. Monopoly: Financial prediction from monetary policy conference videos using multimodal cues. In Proceedings of ACM MM, pages 2276–2285, 2022.
[Mukherjee et al., 2022] Rajdeep Mukherjee, Abhinav Bohra, Akash Banerjee, Soumya Sharma, et al. Ectsum: A new benchmark dataset for bullet point summarization of long earnings call transcripts. In Proceedings of EMNLP, pages 10893–10906, 2022.
[Radford et al., 2018] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. OpenAI, 2018.
[Radford et al., 2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[Scao et al., 2022] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, et al. Bloom: A ´ 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
[Shah et al., 2022] Raj Shah, Kunal Chawla, Dheeraj Eidnani, Agam Shah, Wendi Du, Sudheer Chava, Natraj Raman, Charese Smiley, Jiaao Chen, and Diyi Yang. When flue meets flang: Benchmarks and large pretrained language model for financial domain. In Proceedings of EMNLP, pages 2322–2335, 2022.
[Shah et al., 2023] Agam Shah, Suvan Paturi, and Sudheer Chava. Trillion dollar words: A new financial dataset, task & market analysis. In Proceedings of ACL, 2023.
[Sharma et al., 2022] Soumya Sharma, Tapas Nayak, Arusarka Bose, Ajay Kumar Meena, et al. Finred: A dataset for relation extraction in financial domain. In Companion Proceedings of WWW, pages 595–597, 2022. [Sinha and Khandait, 2021] Ankur Sinha and Tanmay Khandait. Impact of news on the commodity market: Dataset and results. In Proceedings of FICC, pages 589–601. Springer, 2021.
[Soun et al., 2022] Yejun Soun, Jaemin Yoo, Minyong Cho, Jihyeong Jeon, and U Kang. Accurate stock movement prediction with self-supervised learning from sparse noisy tweets. In IEEE Big Data, pages 1691–1700. IEEE, 2022.
[Touvron et al., 2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30, 2017.
[Wang et al., 2023] Neng Wang, Hongyang Yang, and Christina Dan Wang. Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets. In NeurIPS Workshop on Instruction Tuning and Instruction Following, 2023.
[Wei et al., 2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022.
[Wu et al., 2018] Huizhe Wu, Wei Zhang, Weiwei Shen, and Jun Wang. Hybrid deep sequential modeling for social text-driven stock prediction. In Proceedings of ACM CIKM, pages 1627–1630, 2018.
[Wu et al., 2023] Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
[Xie et al., 2023] Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. Pixiu: A large language model, instruction data and evaluation benchmark for finance. In Proceedings of NeurIPS Datasets and Benchmarks, 2023.
[Xu and Cohen, 2018] Yumo Xu and Shay B Cohen. Stock movement prediction from tweets and historical prices. In Proceedings of ACL, pages 1970–1979, 2018.
[Yang et al., 2020] Yi Yang, Mark Christopher Siy Uy, and Allen Huang. Finbert: A pretrained language model for financial communications. arXiv preprint arXiv:2006.08097, 2020.
[Yang et al., 2023a] Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031, 2023.
[Yang et al., 2023b] Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712, 2023.
[Yang et al., 2023c] Yi Yang, Yixuan Tang, and Kar Yan Tam. Investlm: A large language model for investment using financial domain instruction tuning. arXiv preprint arXiv:2309.13064, 2023.
[Zhang et al., 2023] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
[Zhao et al., 2023] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, , Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
[Zheng et al., 2021] Xinyi Zheng, Douglas Burdick, Lucian Popa, Xu Zhong, and Nancy Xin Ru Wang. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. In Proceedings of the IEEE/CVF WACV, pages 697–706, 2021.
[Zhou et al., 2021] Zhihan Zhou, Liqian Ma, and Han Liu. Trade the event: Corporate events detection for newsbased event-driven trading. In Findings of ACL-IJCNLP, pages 2114–2124, 2021.