Large Language Models in Finance: A Survey

429 阅读45分钟

链接:arxiv.org/abs/2311.10…

Abstract 摘要

近期大型语言模型(LLMs)的进步为金融领域的人工智能应用打开了新的可能性。在本文中,我们提供了一个实践调研,重点关注利用LLMs进行金融任务的两个关键方面:现有解决方案和采用指南。

首先,我们回顾了当前在金融领域应用LLMs的方法,包括通过零样本学习或少样本学习利用预训练模型、在特定领域数据上的微调,以及从零开始训练定制LLMs。我们总结了关键模型,并评估了它们在金融自然语言处理任务上的性能改进。

其次,我们提出了一个决策框架,以指导金融专业人士根据他们的用例约束(围绕数据、计算和性能需求)选择合适的LLM解决方案。该框架提供了从轻量级实验到在定制LLMs上的重大投资的路径。

最后,我们讨论了在金融应用中利用LLMs的限制和挑战。总体而言,本调研旨在综合最新技术,并为负责任地应用LLMs以推进金融AI提供一条路线图。

关键词: 大型语言模型,生成性人工智能,自然语言处理,金融

Recent advances in large language models (LLMs) have opened new possibilities for artificial intelligence applications in finance. In this paper, we provide a practical survey focused on two key aspects of utilizing LLMs for financial tasks: existing solutions and guidance for adoption.

First, we review current approaches employing LLMs in finance, including leveraging pretrained models via zeroshot or few-shot learning, fine-tuning on domain-specific data, and training custom LLMs from scratch. We summarize key models and evaluate their performance improvements on financial natural language processing tasks.

Second, we propose a decision framework to guide financial professionals in selecting the appropriate LLM solution based on their use case constraints around data, compute, and performance needs. The framework provides a pathway from lightweight experimentation to heavy investment in customized LLMs.

Lastly, we discuss limitations and challenges around leveraging LLMs in financial applications. Overall, this survey aims to synthesize the state-of-the-art and provide a roadmap for responsibly applying LLMs to advance financial AI.

Keywords: Large Language Models, Generative AI, Natural Language Processing, Finance

1 Introduction 引言

最近在人工智能领域,特别是在自然语言处理方面的进步,促成了强大的大型语言模型(LLMs)如ChatGPT[33]的开发。这些模型在理解、生成和推理自然语言方面展现了令人印象深刻的能力。金融行业可以从应用LLMs中受益,因为有效的语言理解和生成可以为交易、风险建模、客户服务等提供信息。

在这项调研中,我们旨在提供一个实用的概览,重点关注利用LLMs进行金融应用的两个关键方面:

• 现有的解决方案和模型,这些解决方案和模型利用LLMs完成各种金融任务。我们总结了关键技术,如微调预训练LLMs和从头开始训练特定领域的LLMs。

• 应用LLMs于金融时的决策过程指导。我们讨论了关于LLMs是否适合某项任务的考虑因素、成本/效益权衡、风险和限制。

通过回顾当前文献和发展,我们希望提供一个易于获取的最新技术综合,以及采用LLMs于金融领域的考虑因素。这项调研面向探索AI与金融交叉点的金融专业人士和研究人员。它也可能为应用LLM解决方案于金融行业的开发者提供信息。本文的其余部分组织如下。第2节覆盖了语言建模的背景和导致LLMs的最近进展。第3节调研了当前在金融领域的AI应用以及LLMs在这些领域进步的潜力。第4节和第5节提供了金融应用的LLM解决方案和决策指导。最后,第6节和第7节讨论了风险、限制和结论。

Recent advances in artificial intelligence, especially in natural language processing, have led to the development of powerful large language models (LLMs) like ChatGPT[33]. These models have demonstrated impressive capabilities in understanding, generating, and reasoning about natural language. The finance industry could benefit from applying LLMs, as effective language understanding and generation can inform trading, risk modeling, customer service, and more.

In this survey, we aim to provide a practical overview focused on two key aspects of utilizing LLMs for financial applications:

• Existing solutions and models that employ LLMs for various finance tasks. We summarize key techniques like finetuning pretrained LLMs and training domainspecific LLMs from scratch.

• Guidance on the decision process for applying LLMs in finance. We discuss factors to consider regarding whether LLMs are suitable for a task, cost/benefit tradeoffs, risks, and limitations.

By reviewing current literature and developments, we hope to give an accessible synthesis of the state-of-the-art along with considerations for adopting LLMs in finance. This survey targets financial professionals and researchers exploring the intersection of AI and finance. It may also inform developers applying LLM solutions for the finance industry. The remainder of the paper is organized as follows. Section 2 covers background on language modeling and recent advances leading to LLMs. Section 3 surveys current AI applications in finance and the potential for LLMs to advance in these areas. Sections 4 and 5 provide LLM solutions and decision guidance for financial applications. Finally, Sections 6 and 7 discuss risks, limitations, and conclusions.

2 Basics of Language Models 语言模型基础

语言模型是一种统计模型,它通过在大量文本语料上的训练来预测单词序列的概率分布[4]。让我们考虑一个由单词序列组成的𝑊 = 𝑤1,𝑤2, ...,𝑤𝑛,其中𝑤𝑖代表序列中的第𝑖个单词。语言模型的目标是计算概率𝑃(𝑊),可以表示为:

条件概率𝑃(𝑤𝑖|𝑤1,𝑤2, ...,𝑤𝑖−1)捕捉了给定前面单词时单词𝑤𝑖出现的可能性。在过去的几十年里,语言模型架构经历了显著的演变。最初,n-gram模型将单词序列表示为马尔科夫过程[3],假设下一个单词的概率仅依赖于前面的(𝑛−1)个单词。例如,在一个二元模型中,一个单词的概率仅取决于前一个单词。

后来,基于递归神经网络(RNN)的模型如LSTM[20]和GRU[10]作为神经网络解决方案出现,能够捕获序列数据中的长期依赖性。然而,在2017年,变换器架构[46]的引入革命性地改变了语言建模,超越了RNN在任务中的性能,如机器翻译。变换器采用自我注意机制来模拟单词之间的并行关系,便于在大规模数据集上进行高效训练。著名的基于变换器的模型包括GPT(生成预训练变换器)[5, 48],它是仅编码器框架,BERT(双向编码器表示从变换器)[13],它是仅解码器框架,以及T5(文本到文本转换变换器)[38],它利用了编码器和解码器结构。这些模型通过迁移学习在各种自然语言处理(NLP)任务上取得了最先进的结果。

重要的是要注意,语言模型的演进主要是由计算能力的进步、大规模数据集的可用性和新型神经网络架构的开发推动的。这些模型显著增强了语言理解和生成能力,使它们能够在广泛的行业和领域中得到应用。

A language model is a statistical model that is trained on extensive text corpora to predict the probability distribution of word sequences [4]. Let’s consider a sequence of words denoted as 𝑊 = 𝑤1,𝑤2, ...,𝑤𝑛, where 𝑤𝑖 represents the 𝑖-th word in the sequence. The goal of a language model is to calculate the probability 𝑃 (𝑊 ), which can be expressed as:

The conditional probability 𝑃 (𝑤𝑖 |𝑤1,𝑤2, ...,𝑤𝑖−1) captures the likelihood of word 𝑤𝑖 given the preceding words. Over the past few decades, language model architectures have undergone significant evolution. Initially, n-gram models represented word sequences as Markov processes [3], assuming that the probability of the next word depends solely on the preceding (𝑛 −1) words. For example, in a bigram model,the probability of a word is only conditioned on the previous word.

Later, Recurrent Neural Network (RNN)-based models like LSTM [20] and GRU [10] emerged as neural network solutions, which are capable of capturing long-term dependencies in sequential data. However, in 2017, the introduction of the transformer architecture [46] revolutionized language modeling, surpassing the performance of RNNs in tasks such as machine translation. Transformers employ self-attention mechanisms to model parallel relationships between words, facilitating efficient training on large-scale datasets. Prominent transformer-based models include GPT (Generative Pretrained Transformer) [5, 48], which is encoder-only framework, BERT (Bidirectional Encoder Representations from Transformers) [13], which is decoder-only framework, and T5 (Text-to-Text Transfer Transformer) [38], which leverages both encoder and decoder structures. These models have achieved state-of-the-art results on various natural language processing (NLP) tasks through transfer learning.

It is important to note that the evolution of language models has mainly been driven by advancements in computational power, the availability of large-scale datasets, and the development of novel neural network architectures. These models have significantly enhanced language understanding and generation capabilities, enabling their application across a wide range of industries and domains.

3 Overview of AI Applications in Finance 金融中的AI应用概览

3.1 Current AI Applications in Finance 当前金融中的AI应用

近年来,人工智能(AI)在金融的各个领域得到了广泛的采用[19]。在这项调研中,我们专注于关键的金融应用,包括交易和投资组合管理[67]、金融风险建模[30]、金融文本挖掘[21, 36]以及金融咨询和客户服务[41]。虽然这个列表不是详尽的,但这些领域随着AI的进步显示出了显著的兴趣和高潜力。

交易和投资组合管理是金融行业内机器学习和深度学习模型的早期采用者。交易的主要目标是预测价格并基于这些预测生成利润。最初,使用统计机器学习方法如支持向量机(SVM)[23]、XGboost[68]和基于树的算法来进行利润和损失估计。然而,深度神经网络的出现引入了技术如循环神经网络(RNN),特别是长短期记忆(LSTM)网络[40]、卷积神经网络(CNN)和变换器[51],这些技术在价格预测中已被证明是有效的。此外,强化学习[47]已被应用于自动交易和投资组合优化。

金融风险建模包含了机器学习和深度学习模型的各种应用。例如,麦肯锡公司开发了一个基于深度学习的解决方案,通过利用用户历史数据和实时交易数据来检测金融欺诈[39]。类似的方法已被用于信用评分[29, 52]和破产或违约预测[8]。

金融文本挖掘是一个热门领域,其中深度学习模型和自然语言处理技术被广泛利用。根据[35]的报道,这个主题上有超过40篇研究出版物。金融文本挖掘旨在从大规模非结构化数据中实时提取有价值的信息,使交易和风险建模中的决策更加明智。例如,[15]利用从新闻文章中提取的金融市场情绪来预测股票市场指数的方向。

在金融咨询和与客户相关的服务中应用AI是一个新兴且迅速发展的领域。如[32]所讨论的,AI驱动的聊天机器人已在各种电子商务和电子服务场景中提供超过37%的支持功能。在金融行业,如"消费者金融中的聊天机器人"报告[2]所强调的,聊天机器人正被采纳为人类客户服务的成本效益替代方案。此外,像摩根大通这样的银行正在利用AI服务提供投资建议,如CNBC报告中所提到的[42]

深度学习模型的当前实施提供了显著的优势,通过在短时间内高效地从大量数据中提取有价值的见解。这一能力在金融行业尤其宝贵,其中及时准确的信息在决策过程中起着关键作用。随着大型语言模型(LLMs)的出现,甚至更多先前被认为是难以解决的任务变得可能,进一步扩大了AI在金融行业中的潜在应用范围。

深度学习和AI技术在金融行业的应用展示了如何利用先进的计算技术来提高效率、精确度和决策质量。从自动化交易策略的发展,到风险评估的增强,再到客户服务的个性化,AI技术正在重塑金融服务的提供方式。特别是,LLMs的进步为处理复杂的金融文档、提高金融报告的准确性和速度,以及开发更先进的风险评估工具提供了新的可能性。

此外,AI的这些应用不仅提高了金融机构的操作效率,也增强了客户体验,通过提供更快、更个性化的服务。例如,通过使用自然语言处理和机器学习算法,金融机构能够更好地理解客户需求,提供定制化的投资建议,同时通过自动化流程减少错误和提高处理速度。

尽管AI技术在金融领域提供了许多机遇,但也存在挑战,如数据隐私、模型解释性以及与传统金融模型的集成问题。因此,金融机构在采用这些技术时需要谨慎,确保合规性并考虑到潜在的道德和法律问题。

总之,AI和深度学习模型,特别是LLMs,在金融行业的应用正在迅速发展,提供了改进决策制定、风险管理和客户服务的强大工具。随着技术的进步和更多创新的应用案例的出现,预计AI将继续在金融领域扮演着越来越重要的角色。

Artificial Intelligence (AI) has witnessed extensive adoption across various domains of finance in recent years [19]. In this survey, we focus on key financial applications, including trading and portfolio management [67], financial risk modeling [30], financial text mining [21, 36], and financial advisory and customer services [41]. While this list is not exhaustive, these areas have shown significant interest and high potential with the advancement of AI.

Trading and portfolio management have been early adopters of machine learning and deep learning models within the finance industry. The primary objective of trading is to forecast prices and generate profits based on these predictions. Initially, statistical machine learning methods such as Support Vector Machines (SVM) [23], Xgboost [68], and tree-based algorithms were utilized for profit and loss estimation. However, the emergence of deep neural networks introduced techniques like Recurrent Neural Networks (RNN), particularly Long Short-Term Memory (LSTM) networks [40], Convolutional Neural Networks (CNN), and transformers [51], which have proven effective in price forecasting. Additionally, reinforcement learning [47] has been applied to automatic trading and portfolio optimization.

Financial risk modeling encompasses various applications of machine learning and deep learning models. For instance, McKinsey & Company has developed a deep learningbased solution for financial fraud detection by leveraging user history data and real-time transaction data [39]. Similar approaches have been employed in credit scoring [29, 52] and bankruptcy or default prediction [8].

Financial text mining represents a popular area where deep learning models and natural language processing techniques are extensively utilized. According to [35], there are over 40 research publications on this topic. Financial text mining aims to extract valuable information from largescale unstructured data in real-time, enabling more informed decision-making in trading and risk modeling. For example, [15] employs financial market sentiment extracted from news articles to forecast the direction of the stock market index.

Applying AI in financial advisory and customerrelated services is an emerging and rapidly growing field. AI-powered chatbots, as discussed in [32], already provide more than 37% of supporting functions in various e-commerce and e-service scenarios. In the financial industry, chatbots are being adopted as cost-effective alternatives to human customer service, as highlighted in the report "Chatbots in consumer finance" [2]. Additionally, banks like JPMorgan are leveraging AI services to provide investment advice, as mentioned in a report by CNBC [42].

The current implementation of deep learning models offers significant advantages by efficiently extracting valuable insights from vast amounts of data within short time frames. This capability is particularly valuable in the finance industry, where timely and accurate information plays a crucial role in decision-making processes. With the emergence of LLMs, even more tasks that were previously considered intractable become possible, further expanding the potential applications of AI in the finance industry.

3.2 Advancements of LLMs in Finance 金融中LLMs的进步

与传统模型相比,LLMs在金融领域提供了诸多优势。首先,LLMs利用其广泛的预训练数据有效处理常识知识,使其能够理解自然语言指令。这在监督训练因受限的标注金融数据或对某些文档的访问限制而变得具有挑战性的情景中尤为宝贵。LLMs可以通过零次学习[26]执行任务,如其在跨复杂级别的情感分类任务中所展示的满意性能所证明。对于金融文档上的类似文本挖掘任务,LLMs可以自动达到可接受的性能。

与其他监督模型相比,LLMs提供了更优越的适应性和灵活性。LLMs无需为特定任务训练单独的模型,只需简单修改不同任务指令下的提示即可处理多项任务[6]。这种适应性不需要额外训练,使LLMs能够同时对金融文档执行情感分析、摘要和关键词提取。

LLMs擅长将模糊或复杂的任务分解为可行的计划。如Auto-GPT[1]、Semantic Kernel[31]和LangChain[7]等应用已被开发出来展示这一能力。在本文中,我们将其称为工具增强生成。例如[37],Auto-GPT可以基于用户定义的目标,优化包含全球股票ETFs和债券ETFs的投资组合。它制定详细计划,包括获取金融数据、使用Python包进行夏普比率优化以及向用户展示结果。之前,用单一模型实现这样的端到端解决方案是不可行的。这一属性使LLMs非常适合金融客户服务或金融咨询,它们可以理解自然语言指令,并通过利用可用的工具和信息来协助客户。

尽管在金融领域应用LLMs非常有前景,但至关重要的是要认识到它们的限制和相关风险,这将在第6节中进一步讨论。

LLMs offer numerous advantages over traditional models, particularly in the field of finance. Firstly, LLMs leverage their extensive pre-training data to effectively process commonsense knowledge, enabling them to understand natural language instructions. This is valuable in scenarios where supervised training is challenging due to limited labeled financial data or restricted access to certain documents. LLMs can perform tasks through zero-shot learning [26], as demonstrated by their satisfactory performance in sentiment classification tasks across complex levels [65]. For similar text mining tasks on financial documents, LLMs can automatically achieve acceptable performance.

Compared to other supervised models, LLMs offer superior adaptation and flexibility. Instead of training separate models for specific tasks, LLMs can handle multiple tasks by simply modifying the prompt under different task instructions [6]. This adaptability does not require additional training, enabling LLMs to simultaneously perform sentiment analysis, summarization, and keyword extraction on financial documents.

LLMs excel at breaking down ambiguous or complex tasks into actionable plans. Applications like Auto-GPT [1], Semantic Kernel [31], and LangChain [7] have been developed to showcase this capability. In this paper, we refer to this as Tool Augmented Generation. For instance [37], AutoGPT can optimize a portfolio with global equity ETFs and bond ETFs based on user-defined goals. It formulates detailed plans, including acquiring financial data, utilizing Python packages for Sharpe ratio optimization, and presenting the results to the user. Previously, achieving such end-to-end solutions with a single model was unfeasible. This property makes LLMs an ideal fit for financial customer service or financial advisory, where they can understand natural language instructions and assist customers by leveraging available tools and information.

While the application of LLMs in finance is really promising, it is crucial to acknowledge their limitations and associated risks, which will be further discussed in Section 6.

4 LLM Solutions for Finance 金融领域的LLM解决方案

4.1 Utilizing Few-shot/Zero-shot Learning in Finance Applications 在金融应用中利用少数示例/零示例学习

在金融中接入LLM解决方案可以通过两种途径实现:利用LLM服务提供商的API或采用开源LLM。像OpenAI、Google和Microsoft等公司通过API提供LLM服务。这些服务不仅提供了基础的语言模型能力,还提供了针对特定用途定制的附加功能。例如,OpenAI的API包括了聊天、SQL生成、代码完成和代码解释等功能。虽然没有专门为金融应用设计的LLM服务,但利用这些通用的LLM服务是一个可行选项,特别是对于常见的任务。在这项工作中的一个示例展示了如何使用OpenAI的GPT4服务进行财务报表分析。

除了科技公司提供的LLM服务之外,开源LLM也可以被应用到金融应用中。模型如LLaMA、BLOOM、Flan-T5等可以从Hugging Face模型库下载。与使用API不同,托管和运行这些开源模型需要自己托管。

零示例或少数示例学习方法同样可以和开源模型一起使用,类似于使用LLM API。使用开源模型提供了更大的灵活性,因为模型的权重是可访问的,并且模型的输出可以针对下游任务进行定制。此外,这样做提供了更好的隐私保护,因为模型和数据保持在用户的控制之下。然而,使用开源模型也有其不足之处。已报告的评估指标显示,开源模型与专有模型之间存在性能差距。对于某些下游任务,零示例或少数示例学习可能不会达到最佳性能。在这种情况下,可能需要通过带有标签的数据、专业知识和计算资源对模型进行微调,以获得满意的结果。这可能解释了为什么到目前为止,还没有直接将开源模型应用于金融应用的示例被找到。在第5节中,我们将更详细地讨论在不同情况下哪种选择更有利。

Accessing LLM solutions in finance can be done through two options: utilizing an API from LLM service providers or employing open-source LLMs. Companies like OpenAI(openai.com/product) , Google(gemini.google.com/) , and Microsoft(azure.microsoft.com/en-us/produ…) offer LLM services through APIs. These services not only provide the base language model capabilities but also offer additional features tailored for specific use cases. For example, OpenAI’s APIs include functionalities for chat, SQL generation, code completion, and code interpretation. While there is no dedicated LLM service exclusively designed for finance applications, leveraging these general-purpose LLM services can be a viable option, especially for common tasks. An example in this work [16] demonstrates the use of OpenAI’s GPT4 service for financial statement analysis.

In addition to LLM services provided by tech companies, open-source LLMs can also be applied to financial applications. Models such as LLaMA [45], BLOOM [54], Flan-T5 [12], and more are available for download from the Hugging Face model repository(huggingface.co/models) . Unlike using APIs, hosting and running these open-source models would require self-hosting.

Similar to using LLM APIs, zero-shot or few-shot learning approaches can be employed with open-source models. Utilizing open-source models offers greater flexibility as the model’s weights are accessible, and the model’s output can be customized for downstream tasks. Additionally, it provides better privacy protection as the model and data remain under user’s control. However, working with open-source models also has its drawbacks. Reported evaluation metrics suggest a performance gap between open-source models and proprietary models. For certain downstream tasks, zero-shot or few-shot learning may not yield optimal performance. In such cases, fine-tuning the model with labeled data, expertise, and computational resources is necessary to achieve satisfactory results. This may explain why, at the time of writing this paper, no direct examples of open-source models applied to financial applications have been found. In Section 5, we provide a more detailed discussion of which option is more favorable under different circumstances.

4.2 Fine-tuning a Model 模型微调

在金融领域对大型语言模型(LLMs)进行微调,可以增强对领域特定语言的理解和上下文理解能力,从而在金融相关任务中提高性能,并生成更准确、更定制化的输出。

Fine-tuning LLMs in the finance domain can enhance domainspecific language understanding and contextual comprehension, resulting in improved performance in finance-related tasks and generating more accurate and tailored outputs.

4.2.1 Common Techniques for LLM Fine-tuning. 大型语言模型微调的常见技术

微调大型语言模型(LLMs)的现代技术通常分为两大类:标准微调和指令式微调。

在标准微调中,模型在未经修改的原始数据集上进行训练。关键的上下文、问题和期望答案直接输入到LLM中,训练过程中答案被遮蔽,以便模型学习生成它。尽管这种方法简单,但广泛有效。

指令式微调涉及创建特定于任务的数据集,这些数据集提供示例和指导,以引导模型的学习过程。通过在训练数据中制定明确的指令和示范,可以优化模型,使其在某些任务上表现出色或产生更具上下文相关性和期望的输出。指令作为一种监督形式,以塑造模型的行为。

这两种方法各有优点:标准微调实施简单直接,而指令式微调允许对模型进行更精确的指导。理想的方法取决于可用的训练数据量和期望行为的复杂性。然而,两者都利用了LLMs中已有的知识,并对它们进行微调,以提高下游任务的性能。

除了上述方法外,如低秩适应(LoRA)和量化等技术可以实现显著降低的计算需求进行微调。

LoRA允许微调原始权重矩阵的低秩分解因子,而不是完整的矩阵。这种方法大幅减少了可训练参数的数量,使得在较不强大的硬件上训练成为可能,并缩短了总训练时间。

另一个有影响的方法是使用较低数值精度,如bfloat16或float16代替float32。通过减半位宽,每个参数只占用2字节而不是4字节,减少了50%的内存使用。这也加速了计算,因为较小的数据类型加速了训练。此外,减少的内存占用使得可以使用更大的批量大小,进一步提高了吞吐量。

Modern techniques for fine-tuning LLMs typically fall into two main categories: standard fine-tuning and instructional finetuning.

In standard fine-tuning, the model is trained on the raw datasets without modification. The key context, question, and desired answer are directly fed into the LLM, with the answer masked during training so that the model learns to generate it. Despite its simplicity, this approach is widely effective.

Instruct fine-tuning [34] involves creating task-specific datasets that provide examples and guidance to steer the model’s learning process. By formulating explicit instructions and demonstrations in the training data, the model can be optimized to excel at certain tasks or produce more contextually relevant and desired outputs. The instructions act as a form of supervision to shape the model’s behavior.

Both methods have their merits: standard fine-tuning is straightforward to implement, while instructional finetuning allows for more precise guidance of the model. The ideal approach depends on the amount of training data available and the complexity of the desired behaviors. However, both leverage the knowledge already embedded in LLMs and fine-tune them for enhanced performance on downstream tasks.

In addition to the above methods, techniques such as LowRank Adaptation (LoRA)[22] and quantization[18] can enable fine-tuning with significantly lower computational requirements

LoRA allows for fine-tuning the low-rank decomposed factors of the original weight matrices instead of the full matrices. This approach drastically reduces the number of trainable parameters, enabling training on less powerful hardware and shortening the total training time.

Another impactful approach is to use reduced numerical precisions such as bfloat16 [24] or float16 instead of float32. By halving the bit-width, each parameter only occupies 2 bytes instead of 4 bytes, reducing memory usage by 50%. This also accelerates computation by up to 2x since smaller data types speed up training. Moreover, the reduced memory footprint enables larger batch sizes, further boosting throughput.

4.2.2 Fine-tuned finance LLM evaluation. 微调金融领域大型语言模型的评估

微调金融领域大型语言模型(LLMs)的性能可以分为两大类进行评估:金融分类任务和金融生成任务。在金融分类中,我们考虑的任务包括情感分析和新闻标题分类。在金融生成任务中,我们关注的是问答、新闻摘要和命名实体识别。表1提供了所有微调金融LLMs的详细信息。在众多微调的LLMs中,我们将重点讨论三个:(1)PIXIU(也称为FinMA)[56],在136K特定任务指令样本上微调LLaMA。(2)FinGPT[58],它提出了一个端到端的框架,用于在金融行业训练和应用FinLLMs。FinGPT利用轻量级的低秩适应(LoRA)技术,使用大约50k样本微调开源LLMs(如LLaMA和ChatGLM)。然而,FinGPT的评估仅限于金融分类任务。(3)另一方面,Instruct-FinGPT[63]在两个金融情感分析数据集派生的10k指令样本上微调LLaMA,并且也仅评估金融分类任务的性能。基于报告的模型性能,我们总结我们的发现如下:

• 与原始基础LLM(LLaMA)和其他开源LLMs(BLOOM, OPT[64], ChatGLM[14, 62])相比,所有微调的金融LLMs在报告的所有金融领域任务中,特别是分类任务上,表现出显著更好的性能。

• 在报告的大多数金融任务中,微调的金融LLMs超越了BloombergGPT[55]。

• 与强大的通用LLMs(如ChatGPT和GPT-4)相比,微调的金融LLMs在大多数金融分类任务中展示了更优越的性能,这表明它们在领域特定语言理解和上下文理解能力上的增强。然而,在金融生成任务中,微调的LLMs表现出相似或更差的性能,这表明需要更多高质量的领域特定数据集来提高它们的生成能力。

The performance of fine-tuned finance LLMs can be evaluated in two categories: finance classification tasks and finance generative tasks. In finance classification, we consider tasks such as Sentiment Analysis and News Headline Classification. In finance generative tasks, our focus is on Question Answering, News Summarization, and Named Entity Recognition. Table 1 provides detailed information about all the fine-tuned finance LLMs. Among the various fine-tuned LLMs, we will focus on discussing three of them: (1) PIXIU (also known as FinMA)[56], fine-tuned LLaMA on 136K task-specific instruction samples. (2) FinGPT[58], it presents a end-to-end framework for training and applying FinLLMs in the finance industry. FinGPT utilizes the lightweight Low-rank Adaptation (LoRA) technique to fine-tune open-source LLMs (such as LLaMA and ChatGLM) using approximately 50k samples. However, FinGPT’s evaluation is only limited to finance classification tasks. (3) Instruct-FinGPT[63], on the other hand,fine-tunes LLaMA on 10k instruction samples derived from two Financial Sentiment Analysis Datasets and also solely evaluates performance on finance classification tasks. Based on the reported model performance, we summarize our findings as below:

• Compared to the original base LLM (LLaMA) and other open-source LLMs (BLOOM, OPT[64], ChatGLM[14, 62]), all fine-tuned finance LLMs exhibit significantly better performance across all finance-domain tasks reported in the papers, especially classification tasks.

• The fine-tuned finance LLMs outperform BloombergGPT[55] in most finance tasks reported in the papers.

• When compared to powerful general LLMs like ChatGPT and GPT-4, the fine-tuned finance LLMs demonstrate superior performance in most finance classification tasks, which indicates their enhanced domainspecific language understanding and contextual comprehension abilities. However, in finance generative tasks, the fine-tuned LLMs show similar or worse performance, suggesting the need for more high-quality domain-specific datasets to improve their generative capabilities.

4.3 Pretrain from Scratch 从零开始预训练

从零开始训练大型语言模型(LLMs)的目标是开发出更好地适应金融领域的模型。表2展示了当前从零开始训练的金融LLMs:BloombergGPT、轩辕2.0[66]和Fin-T5[28]。

如表2所示,预训练阶段有将公共数据集与特定于金融的数据集结合使用的趋势。值得注意的是,BloombergGPT就是一个例子,其语料库包含了等量的通用文本和金融相关文本。值得一提的是,BloombergGPT主要依赖于一个专属于彭博的50亿令牌的子集,仅占总训练语料的0.7%。这个目标语料库有助于在金融基准测试中实现性能提升。

BloombergGPT和Fin-T5与它们的原始模型(如BLOOM176B和T5)相比,都展示了更优越的性能。这些任务包括市场情绪分类、多类别和多标签分类等。BloombergGPT取得了平均得分62.51的令人印象深刻的成绩,超过了开源的BLOOM176B模型,后者仅获得了54.35的分数。同样,Fin-T5以平均得分81.78的优异成绩胜过了T5模型的79.56分。值得注意的是,BloombergGPT是使用彭博特别设计的内部基准进行评估的。这次评估的结果展示了显著的改进,因为BloombergGPT获得了平均分数62.47,超过了BLOOM176B的33.39分。这一结果突出显示,即使内部私有训练语料不到总训练语料的1%,仍然可以在相同领域和分布的评估任务中实现显著的性能提升。

在金融相关的生成任务,如问答、命名实体识别、摘要等方面,这两个模型与它们各自的通用模型相比,都展示了显著更好的结果。具体来说,BloombergGPT取得了64.83的令人印象深刻的分数,超过了BLOOM-176B的45.43分。同样,Fin-T5以68.69的分数胜过了T5的66.06分。这些发现进一步强调了这些模型在生成金融相关内容方面相比它们的通用型对手表现出的优越性能。

虽然这些模型并不如像GPT-3或PaLM[11]这样的闭源模型强大,但它们展示了与类似大小的公开模型相似或更优的性能。在各种通用生成任务的评估中,如BIG-bench Hard、知识评估、阅读理解和语言任务,BloombergGPT展示了与类似大小的公开模型相当或更优的性能,虽然略逊于更大的模型如GPT-3或PaLM。总的来说,BloombergGPT在广泛的通用生成任务中展现了值得称赞的性能,使其在相似大小的模型中占有有利的位置。这表明该模型在金融相关任务上的增强能力并不以牺牲其通用能力为代价。

The objective of training LLMs from scratch is to develop models that have even better adaptation to the finance domain. Table 2 presents the current finance LLMs that have been trained from scratch: BloombergGPT, Xuan Yuan 2.0 [66], and Fin-T5[28].

As shown in Table 2, there is a trend of combining public datasets with finance-specific datasets during the pretraining phase. Notably, BloombergGPT serves as an example where the corpus comprises an equal mix of general and financerelated text. It is worth mentioning that BloombergGPT primarily relies on a subset of 5 billion tokens that pertain exclusively to Bloomberg, representing only 0.7% of the total training corpus. This targeted corpus contributes to the performance improvements achieved in finance benchmarks.

Both BloombergGPT and Fin-T5 have demonstrated superior performance compared to their original models like BLOOM176B and T5, respectively. These tasks encompass activities such as market sentiment classification, multi-categorical and multi-label classification, and more. BloombergGPT achieves an impressive average score of 62.51, surpassing the opensource BLOOM176B model, which only attains a score of 54.35. Similarly, Fin-T5 demonstrates its excellence with an average score of 81.78, outperforming the T5 model’s score of 79.56. Notably, BloombergGPT was evaluated using an internal benchmark specifically designed by Bloomberg. The results of this evaluation showcased remarkable improvements, as BloombergGPT achieved an average score of 62.47, surpassing the performance of BLOOM176B, which only attained a score of 33.39. This outcome highlights that even when the internal private training corpus constitutes less than 1% of the total training corpus, it can still lead to substantial enhancements in evaluating tasks within the same domain and distribution.

On finance-related generative tasks such as Question Answering, Named Entity Recognition, summarization, both models exhibited significantly better results compared to their respective general models by a considerable margin. Specifically, BloombergGPT achieved an impressive score of 64.83, surpassing BLOOM-176B’s score of 45.43. Similarly, Fin-T5 outperformed T5 with a score of 68.69, while T5 scored 66.06. These findings further highlight the models’ superior performance in generating finance-related content when compared to their general-purpose counterparts.

Although these models are not as powerful as closedsource models like GPT-3 or PaLM[11], they demonstrate similar or superior performance compared to similar-sized public models. In evaluations on various general generative tasks, such as BIG-bench Hard, knowledge assessments, reading comprehension, and linguistic tasks, BloombergGPT exhibited comparable or superior performance compared to similar-sized public models, albeit slightly inferior to larger models like GPT-3 or PaLM. Overall, BloombergGPT showcased commendable performance across a wide range of general generative tasks, positioning it favorably among models of comparable size. This indicates that the model’s enhanced capabilities in finance-related tasks do not come at the expense of its general abilities.

5 Decision Process in Applying LLM to Financial Applications

应用大型语言模型(LLM)于金融应用的决策过程

5.1 Determining the Need for a LLM 确定是否需要LLM

在探索LLM解决方案之前,首先需要确定是否真正需要为给定任务使用这样的模型。根据Yang等人[59]的工作,LLMs相较于小型模型的优势可以总结如下:

利用预训练知识: LLMs可以利用从预训练数据获得的知识来提供解决方案。如果一个任务缺乏足够的训练数据或注释数据,但需要常识知识,LLM可能是一个合适的选择。

推理和新兴能力: LLMs擅长涉及推理或新兴能力[49]的任务。这一属性使LLMs非常适合于任务指令或预期答案不明确,或处理超出分布数据时的任务。在金融咨询的背景下,客户服务中的客户请求常常显示出高度的变化和复杂的对话。LLMs可以作为虚拟助理在这些情况下提供帮助。

协调模型合作: LLMs可以充当不同模型和工具之间的协调者。对于需要多个模型协作的任务,LLMs可以作为协调者来整合并利用这些工具[1, 7, 31]。当目标是实现模型解决方案流程的强大自动化时,这一能力尤其宝贵。

虽然LLMs提供了巨大的能力,但它们的使用与显著的成本相关联,无论是利用第三方API[33]还是微调开源LLM。因此,在完全投入LLMs之前,考虑常规模型是明智的。在任务有明确定义(例如,回归、分类、排名)、有大量注释训练数据,或任务最小程度依赖于常识知识或像推理这样的新兴能力的情况下,最初可能不需要或不合理依赖LLMs。

Before exploring LLM solutions, it is essential to ascertain whether employing such a model is truly necessary for the given task. The advantages of LLMs over smaller models can be summarized as follows, as outlined in the work by Yang et al. [59]:

Leveraging Pretraining Knowledge: LLMs can utilize the knowledge acquired from pretraining data to provide solutions. If a task lacks sufficient training data or annotated data but requires common-sense knowledge, an LLM may be a suitable choice.

Reasoning and Emergent Abilities: LLMs excel at tasks that involve reasoning or emergent abilities [49]. This property makes LLMs well-suited for tasks where task instructions or expected answers are not clearly defined, or when dealing with out-of-distribution data. In the context of financial advisory, client requests in customer service often exhibit high variance and complex conversations. LLMs can serve as virtual agents to provide assistance in such cases.

Orchestrating Model Collaboration : LLMs can act as orchestrators between different models and tools. For tasks that require collaboration among various models, LLMs can serve as orchestrators to integrate and utilize these tools together [1, 7, 31]. This capability is particularly valuable when aiming for a robust automation of a model solution pipeline.

While LLMs offer immense power, their use comes with a significant cost, whether utilizing a third-party API [33] or fine-tuning an open-source LLM. Therefore, it is prudent to consider conventional models before fully committing to LLMs. In cases where the task has a clear definition (e.g., regression, classification, ranking), there is an ample amount of annotated training data, or the task relies minimally on common-sense knowledge or emerging capabilities like reasoning, relying on LLMs may not be necessary or justified initially.

5.2 A general decision guidance for applying LLMs on finance tasks

应用LLMs于金融任务的一般决策指南

一旦决定利用LLMs执行金融任务,可以遵循一个决策指导框架,以确保高效和有效的实施。如图1所示的框架,基于计算资源和数据需求将LLMs的使用分为四个级别。随着级别的提升,与训练和数据收集相关的成本增加。建议从第1级开始,并且只有当模型的性能不满意时才移至更高级别(2、3和4级)。下一节将提供每个级别的决策和行动块的详细解释。表??展示了根据各种第三方服务(如AWS和OpenAI)的定价,不同选项的大致成本范围。

Once the decision has been made to utilize LLMs for a finance task, a decision guidance framework can be followed to ensure efficient and effective implementation. The framework, illustrated in Figure 1, categorizes the usage of LLMs into four levels based on computational resources and data requirements. By progressing through the levels, costs associated with training and data collection increase. It is recommended to start at Level 1 and move to higher levels (2, 3, and 4) only if the model’s performance is not satisfactory. The following section provides detailed explanations of the decision and action blocks at each level. Table ?? presents an approximate cost range for different options, based on pricing from various third-party services like AWS and OpenAI.

5.2.1 Level 1: Zero-shot Applications. 第1级:零次射击应用

第一个决策块确定是使用现有的LLM服务还是开源模型。如果输入的问题或上下文涉及机密数据,就需要执行1A行动块,即自行托管一个开源LLM。截至2023年7月,有几个选项可用,包括LLAMA[45]、OpenLLAMA[17]、Alpaca[44]和Vicuna[9]。LLAMA提供了从7B到65B不等大小的模型,但它们限于研究目的。OpenLLAMA为3B、7B和13B模型提供选项,并支持商业用途。Alpaca和Vicuna基于LLAMA进行了微调,提供了7B和13B的选项。部署自己的LLM需要一台配备合适GPU的强大本地机器,例如对于7B模型使用NVIDIA-V100,或对于13B模型使用NVIDIA-A100、A6000。

如果数据隐私不是一个问题,推荐选择第三方LLMs,如OpenAI的GPT3.5/GPT4或Google的BARD。这个选项允许进行轻量级实验和早期性能评估,无需显著的部署成本。唯一的成本将是每次API调用相关的费用,通常基于输入长度和模型输出的令牌计数。

The first decision block determines whether to use an existing LLM service or an open-source model. If the input question or context involves confidential data, it is necessary to proceed with the 1A action block, which involves self hosting an open-source LLM. As of July 2023, several options are available, including LLAMA[45], OpenLLAMA[17], Alpaca[44], and Vicuna[9]. LLAMA offers models with sizes ranging from 7B to 65B, but they are limited to research purposes. OpenLLAMA provides options for 3B, 7B, and 13B models, with support for commercial usage. Alpaca and Vicuna are fine-tuned based on LLAMA, offering 7B and 13B options. Deploying your own LLM requires a robust local machine with a suitable GPU, such as NVIDIA-V100 for a 7B model or NVIDIA-A100, A6000 for a 13B model.

If data privacy is not a concern, selecting third-party LLMs such as GPT3.5/GPT4 from OpenAI or BARD from Google is recommended. This option allows for lightweight experiments and early performance evaluation without significant deployment costs. The only cost incurred would be the fees associated with each API call, typically based on input length and the token count of the model’s output.

5.2.2 Level 2: Few-shot Applications. 第2级:少样本应用

如果模型在第1级的性能对于应用来说不可接受,如果有几个示例问题及其对应的答案可用,可以探索少次射击学习。少次射击学习在之前的各种工作中显示出优势[5, 48]。核心思想是提供一组示例问题及其对应的答案作为上下文,除了被询问的具体问题之外。与前几级相比,少次射击学习的成本类似,除了每次都需要提供示例的要求外。通常,实现良好的性能可能需要使用1到10个示例。这些示例可以在不同问题间保持一致,或根据手头的具体问题选择。挑战在于确定示例的最佳数量并选择相关的示例。这个过程涉及实验和测试,直到达到期望的性能边界。

If the model’s performance at Level 1 is not acceptable for the application, few-shot learning can be explored if there are several example questions and their corresponding answers available. Few-shot learning has shown advantages in various previous works [5, 48]. The core idea is to provide a set of example questions along with their corresponding answers as context in addition to the specific question being asked. The cost associated with few-shot learning is similar to that of the previous levels, except for the requirement of providing examples each time. Generally, achieving good performance may require using 1 to 10 examples. These examples can be the same across different questions or selected based on the specific question at hand. The challenge lies in determining the optimal number of examples and selecting relevant ones. This process involves experimentation and testing until the desired performance boundary is reached.

5.2.3 Level 3: Tool-Augmented Generation and Finetuning. 第3级:工具增强生成与微调

如果手头的任务极其复杂,且上下文学习未能产生合理的性能,则下一个选项是利用外部工具或插件与LLM结合,前提是有相关工具/插件的集合可用。例如,一个简单的计算器可以协助完成与算术相关的任务,而搜索引擎对于知识密集型任务(如查询特定公司的CEO或识别市值最高的公司)可能是不可或缺的。

通过提供工具的描述来实现工具与LLMs的集成。这种方法相关的成本通常高于少示例学习,因为需要开发工具和作为上下文的更长输入序列。然而,可能存在工具描述串联过长、超过LLMs输入长度限制的情况。在这种情况下,可能需要一个额外的步骤,如简单的工具检索或过滤,以缩小工具的选择范围。部署成本通常包括使用LLMs的成本以及使用工具的成本。

如果上述选项未能产生满意的性能,可以尝试微调LLMs。这一阶段需要一定量的注释数据、计算资源(GPU、CPU等)以及调整语言模型的专业知识,如表??所列。

If the task at hand is extremely complicated and in-context learning does not yield reasonable performance, the next option is to leverage external tools or plugins with the LLM, assuming a collection of relevant tools/plugins is available. For example, a simple calculator could assist with arithmetic-related tasks, while a search engine could be indispensable for knowledge-intensive tasks such as querying the CEO of a specific company or identifying the company with the highest market capitalization.

Integrating tools with LLMs can be achieved by providing the tool’s descriptions. The cost associated with this approach is generally higher than that of few-shot learning due to the development of the tool(s) and the longer input sequence required as context. However, there may be instances where the concatenated tool description is too long, surpassing the input length limit of LLMs. In such cases, an additional step such as a simple tool retrieval or filter might be needed to narrow down the tools for selection. The deployment cost typically includes the cost of using the LLMs as well as the cost of using the tool(s).

If the above options fail to produce satisfactory performance, finetuning the LLMs can be attempted. This stage

requires a reasonable amount of annotated data, computational resources (GPU, CPU, etc.), and expertise in tuning language models, as listed in Table ??.

5.2.4 Level 4: Train Your Own LLMs from Scratch. 第4级:从零开始训练您自己的LLMs

如果结果仍然不令人满意,唯一的选择就是从零开始训练特定领域的LLMs,类似于BloombergGPT所做的。然而,这个选项伴随着显著的计算成本和数据需求。它通常需要数百万美元的计算资源,并在包含数万亿令牌的数据集上进行训练。训练过程的复杂性超出了本调查的范围,但值得注意的是,这可能需要一个专业团队几个月甚至几年的努力来完成。

通过遵循这个决策指导框架,金融专业人士和研究者可以通过各个级别和选项导航,做出符合他们特定需求和资源限制的明智选择。

If the results are still unsatisfactory, the only option left is to train domain-specific LLMs from scratch, similar to what BloombergGPT did. However, this option comes with significant computational costs and data requirements. It typically requires millions of dollars in computational resources and training on a dataset with trillions of tokens. The intricacies of the training process are beyond the scope of this survey, but it is worth noting that it can take several months or even years of effort for a professional team to accomplish.

By following this decision guidance framework, financial professionals and researchers can navigate through the various levels and options, making informed choices that align with their specific needs and resource constraints.

5.3 Evaluation 评估

在金融领域对LLMs的评估可以通过各种方法进行。一种直接的评估方法是评估模型在下游任务上的表现。根据[57]提供的分类法,评估指标可以分为两大类:准确性和性能。准确性类别可以进一步细分为回归指标(如MAPE、RMSE、𝑅²)和分类指标(召回率、精确率、F1分数)。性能类别包括直接评估模型在特定任务上表现的指标或测量,如在交易相关任务中测量总利润或夏普比率。这些评估可以使用历史数据、回测模拟或在线实验进行。虽然在金融中性能指标往往更重要,但确保准确性指标与性能一致以确保有意义的决策制定并防止过拟合是至关重要的。

除了特定任务的评估外,还可以应用用于LLMs的通用指标。特别是,在评估现有LLM或微调后的LLM的整体质量时,可以利用[27]中介绍的综合评估系统。该评估系统涵盖了各种场景的任务,并结合了不同方面的指标,包括准确性、公平性、鲁棒性、偏见等。它可以作为选择语言模型或在金融应用背景下评估自己模型的指南。

The evaluation of LLMs in finance can be conducted through various approaches. One direct evaluation method is to assess the model’s performance on downstream tasks. Evaluation metrics can be categorized into two main groups: accuracy and performance, based on the taxonomy provided by [57]. The accuracy category can further be divided into metrics for regression (such as MAPE, RMSE, 𝑅 2 ) and metrics for classification (Recall, Precision, F1 score). The performance category includes metrics or measurements that directly assess the model’s performance on the specific task, such as measuring total profit or Sharpe Ratio in a trading-related task. These evaluations can be conducted using historical data, backtest simulations, or online experiments. While performance metrics are often more important in finance, it is crucial to ensure that accuracy metrics align with performance to ensure meaningful decision-making and guard against overfitting.

In addition to task-specific evaluations, general metrics used for LLMs can also be applied. Particularly, when evaluating the overall quality of an existing LLM or a fine-tuned one, comprehensive evaluation systems like the one presented in [27] can be utilized. This evaluation system covers tasks for various scenarios and incorporates metrics from different aspects, including accuracy, fairness, robustness, bias, and more. It can serve as a guide for selecting a language model or evaluating one’s own model in the context of finance applications.

5.4 Limitations 限制

尽管将LLMs应用于革新金融应用方面取得了重大进展,但认识到这些语言模型的限制是重要的。两个主要挑战是产生错误信息和LLMs中出现的偏见,如种族、性别和宗教偏见[43]。在金融行业中,信息的准确性对于做出合理的金融决策至关重要,公平性是所有金融服务的基本要求。为了确保信息准确性和减轻幻觉现象,可以实施额外的措施,如检索增强生成[25]。为了解决偏见问题,可以采用内容审查和输出限制技术(例如只从预定义列表生成答案),以控制生成的内容并减少偏见。

LMMs在监管和治理方面提出了潜在挑战。尽管与传统深度学习模型相比,LLM通过在正确提示时提供生成答案的推理步骤或思考过程提供了更多的可解释性[50][60],LLM仍然是一个黑盒,其生成内容的可解释性非常有限。

解决这些限制并确保在金融应用中负责任和道德地使用LLMs是必要的。持续的研究、健壮的评估框架的开发以及适当的安全措施的实施,是充分发挥LLMs潜力的关键步骤,同时减轻潜在风险。

While significant progress has been made in applying LLMs to revolutionize financial applications, it is important to acknowledge the limitations of these language models. Two major challenges are the production of disinformation and the manifestation of biases, such as racial, gender, and religious biases, in LLMs [43]. In the financial industry, accuracy of information is crucial for making sound financial decisions, and fairness is a fundamental requirement for all financial services. To ensure information accuracy and mitigate hallucination, additional measures like retrieve-augmented generation [25] can be implemented. To address biases, content censoring and output restriction techniques (such as only generating answers from a pre-defined list) can be employed to control the generated content and reduce bias.

LMMs poises potential challenges in terms of regulation and governance. Although LLM offers more interpretability compared to conventional deep learning models by providing reasoning steps or thinking processes for the generated answers when prompted correctly [50] [60], LLM remains a black box and explainability of the content it generates is highly limited.

Addressing these limitations and ensuring the ethical and responsible use of LLMs in finance applications is essential. Continuous research, development of robust evaluation frameworks, and the implementation of appropriate safeguards are vital steps in harnessing the full potential of LLMs while mitigating potential risks.

6 Conclusion 结论

总之,本文对LLMs在金融AI中的新兴应用进行了及时且实用的调查。我们围绕两个关键支柱结构化调查:解决方案和采纳指导。

在解决方案部分,我们回顾了利用LLMs于金融的多种方法,包括利用预训练模型、针对领域数据进行微调和训练自定义LLMs。实验结果表明,在情感分析、问答和摘要等自然语言任务上,与通用LLMs相比,显著提高了性能。

为了提供采纳指导,我们提出了一个结构化框架,基于数据可用性、计算资源和性能需求的限制,选择最佳LLM策略。该框架旨在通过指导实践者从低成本实验到严格定制,平衡价值和投资。

总结来说,这项调查综合了将LLMs应用于金融AI变革的最新进展,并提供了一个实用的采纳路线图。我们希望它能成为探索LLMs与金融交叉领域的研究人员和专业人士的有用参考。随着数据集和计算的改进,针对金融的特定LLMs代表了一条令人兴奋的路径,以在整个行业内推广尖端的自然语言处理技术。

In conclusion, this paper has conducted a timely and practical survey on the emerging application of LLMs for financial AI. We structured the survey around two critical pillars: solutions and adoption guidance.

Under solutions, we reviewed diverse approaches to harnessing LLMs for finance, including leveraging pretrained models, fine-tuning on domain data, and training custom LLMs. Experimental results demonstrate significant performance gains over general purpose LLMs across natural language tasks like sentiment analysis, question answering, and summarization.

To provide adoption guidance, we proposed a structured framework for selecting the optimal LLM strategy based on constraints around data availability, compute resources, and performance needs. The framework aims to balance value and investment by guiding practitioners from low-cost experimentation to rigorous customization.

In summary, this survey synthesized the latest progress in applying LLMs to transform financial AI and provided a practical roadmap for adoption. We hope it serves as a useful reference for researchers and professionals exploring the intersection of LLMs and finance. As datasets and computation improve, finance-specific LLMs represent an exciting path to democratize cutting-edge NLP across the industry.