What We Learned from a Year of Building with LLMs (Part III): Strategy

75 阅读35分钟

What We Learned from a Year of Building with LLMs (Part III): Strategy我们从法学硕士一年的学习中学到了什么(第三部分):策略

  • 总而言之,这意味着模型可能是系统中最不耐用的组件
  • 采用一些“战略性拖延”,构建你绝对需要的东西,并等待提供商对功能的明显扩展
  • 这两种观察常常会误导领导者,以增加成本和降低质量的方式匆忙改造法学硕士系统,并将其作为仿制品、虚荣的“AI”功能发布,并带有现在令人恐惧的闪光图标有一个更好的方法:专注于真正符合您的产品目标并增强您的核心运营的法学硕士申请。
  • 很容易将任何“LLM”误认为是尖端的增值差异化,而忽略了哪些应用程序已经过时了。
  • Build evals and kickstart a data flywheel构建评估并启动数据飞轮
  • .这也许是最重要的战略事实:今天完全不可行的现场演示或研究论文将在几年内成为高级功能,然后不久之后成为商品。我们应该牢记这一点来构建我们的系统和组织。

We previously shared our insights on the tactics we have honed while operating LLM applications. Tactics are granular: they are the specific actions employed to achieve specific objectives. We also shared our perspective on operations: the higher-level processes in place to support tactical work to achieve objectives.我们之前分享了对我们在运营LLM申请时磨练的策略的见解。策略是细粒度的:它们是为实现特定目标而采取的特定行动。我们还分享了运营的看法:支持战术工作以实现目标的更高级别流程。

But where do those objectives come from? That is the domain of strategy. Strategy answers the “what” and “why” questions behind the “how” of tactics and operations.但这些目标从何而来?这是策略的领域。战略回答了战术和行动“如何”背后的“什么”和“为什么”问题。

We provide our opinionated takes, such as “no GPUs before PMF” and “focus on the system not the model,” to help teams figure out where to allocate scarce resources. We also suggest a roadmap for iterating toward a great product. This final set of lessons answers the following questions:我们提供我们的固执己见的观点,例如“PMF 之前没有 GPU”和“关注系统而不是模型”,以帮助团队弄清楚在哪里分配稀缺资源。我们还建议了一个迭代优秀产品的路线图。这最后一组课程回答了以下问题:

  1. Building vs. Buying: When should you train your own models, and when should you leverage existing APIs? The answer is, as always, “it depends.” We share what it depends on.构建与购买:什么时候应该训练自己的模型,什么时候应该利用现有的 API?一如既往,答案是“视情况而定”。我们分享它所依赖的内容。
  2. Iterating to Something Great: How can you create a lasting competitive edge that goes beyond just using the latest models? We discuss the importance of building a robust system around the model and focusing on delivering memorable, sticky experiences.迭代出伟大的东西:如何才能创造出超越仅仅使用最新模型的持久竞争优势?我们讨论了围绕模型构建强大系统并专注于提供难忘、粘性体验的重要性。
  3. Human-Centered AI: How can you effectively integrate LLMs into human workflows to maximize productivity and happiness? We emphasize the importance of building AI tools that support and enhance human capabilities rather than attempting to replace them entirely.以人为本的人工智能:如何有效地将法学硕士融入人类工作流程中,以最大限度地提高生产力和幸福感?我们强调构建支持和增强人类能力的人工智能工具的重要性,而不是试图完全取代它们。
  4. Getting Started: What are the essential steps for teams embarking on building an LLM product? We outline a basic playbook that starts with prompt engineering, evaluations, and data collection.入门:团队开始构建 LLM 产品的基本步骤是什么?我们概述了一个基本的行动手册,从快速工程、评估和数据收集开始。
  5. The Future of Low-Cost Cognition: How will the rapidly decreasing costs and increasing capabilities of LLMs shape the future of AI applications? We examine historical trends and walk through a simple method to estimate when certain applications might become economically feasible.低成本认知的未来:法学硕士成本的快速下降和能力的增强将如何塑造人工智能应用的未来?我们研究历史趋势并通过一种简单的方法来估计某些应用程序何时在经济上可行。
  6. From Demos to Products: What does it take to go from a compelling demo to a reliable, scalable product? We emphasize the need for rigorous engineering, testing, and refinement to bridge the gap between prototype and production.从演示到产品:从引人注目的演示到可靠、可扩展的产品需要什么?我们强调需要进行严格的工程、测试和改进,以弥合原型和生产之间的差距。

To answer these difficult questions, let’s think step by step…为了回答这些难题,让我们逐步思考......

Strategy: Building with LLMs without Getting Out-Maneuvered策略:以法学硕士为基础,在不被操纵的情况下进行建设

Successful products require thoughtful planning and tough prioritization, not endless prototyping or following the latest model releases or trends. In this final section, we look around the corners and think about the strategic considerations for building great AI products. We also examine key trade-offs teams will face, like when to build and when to buy, and suggest a “playbook” for early LLM application development strategy.成功的产品需要深思熟虑的规划和严格的优先级排序,而不是无休止的原型设计或遵循最新的模型发布或趋势。在最后一节中,我们环顾四周并思考构建出色的人工智能产品的战略考虑因素。我们还研究了团队将面临的关键权衡,例如何时构建和何时购买,并为早期法学硕士申请开发策略提出了“剧本”。

No GPUs before PMFPMF 之前没有 GPU

To be great, your product needs to be more than just a thin wrapper around somebody else’s API. But mistakes in the opposite direction can be even more costly. The past year has also seen a mint of venture capital, including an eye-watering six-billion-dollar Series A, spent on training and customizing models without a clear product vision or target market. In this section, we’ll explain why jumping immediately to training your own models is a mistake and consider the role of self-hosting.要想变得出色,您的产品需要不仅仅是别人 API 的薄包装。但相反方向的错误可能会付出更大的代价。过去的一年里,还出现了大量风险投资,其中包括令人眼花缭乱的 60 亿美元 A 轮融资,该投资用于培训和定制模型,但没有明确的产品愿景或目标市场。在本节中,我们将解释为什么立即跳到训练自己的模型是错误的,并考虑自托管的作用。

Training from scratch (almost) never makes sense从头开始训练(几乎)没有意义

For most organizations, pretraining an LLM from scratch is an impractical distraction from building products.对于大多数组织来说,从头开始对法学硕士进行预培训是一种不切实际的干扰,不会分散他们构建产品的注意力。

As exciting as it is and as much as it seems like everyone else is doing it, developing and maintaining machine learning infrastructure takes a lot of resources. This includes gathering data, training and evaluating models, and deploying them. If you’re still validating product-market fit, these efforts will divert resources from developing your core product. Even if you had the compute, data, and technical chops, the pretrained LLM may become obsolete in months.尽管开发和维护机器学习基础设施令人兴奋,而且似乎其他人都在这样做,但它需要大量资源。这包括收集数据、训练和评估模型以及部署它们。如果您仍在验证产品与市场的契合度,这些努力将转移开发核心产品的资源。即使您拥有计算、数据和技术能力,预训练的法学硕士也可能在几个月内就会过时。

Consider the case of BloombergGPT, an LLM specifically trained for financial tasks. The model was pretrained on 363B tokens and required a heroic effort by nine full-time employees, four from AI Engineering and five from ML Product and Research. Despite this effort, it was outclassed by gpt-3.5-turbo and gpt-4 on those financial tasks within a year.考虑一下 BloombergGPT 的案例,这是一位专门针对金融任务进行培训的法学硕士。该模型是在 363B 代币上进行预训练的,需要九名全职员工的巨大努力,其中四名来自人工智能工程,五名来自机器学习产品和研究。尽管做出了这些努力,但在一年内,它在这些财务任务上还是被 gpt-3.5-turbo 和 gpt-4 超越了。

This story and others like it suggests that for most practical applications, pretraining an LLM from scratch, even on domain-specific data, is not the best use of resources. Instead, teams are better off fine-tuning the strongest open source models available for their specific needs.这个故事和其他类似的故事表明,对于大多数实际应用来说,从头开始预训练法学硕士,即使是针对特定领域的数据,也不是资源的最佳利用。相反,团队最好根据其特定需求微调最强大的开源模型。

There are of course exceptions. One shining example is Replit’s code model, trained specifically for code-generation and understanding. With pretraining, Replit was able to outperform other models of large sizes such as CodeLlama7b. But as other, increasingly capable models have been released, maintaining utility has required continued investment.当然也有例外。一个典型的例子是 Replit 的代码模型,专门针对代码生成和理解进行了训练。通过预训练,Replit 能够超越 CodeLlama7b 等其他大尺寸模型。但随着其他功能日益强大的模型的发布,维持实用性需要持续的投资。

Don’t fine-tune until you’ve proven it’s necessary在证明有必要之前不要进行微调

For most organizations, fine-tuning is driven more by FOMO than by clear strategic thinking.对于大多数组织来说,微调更多地是由 FOMO 驱动的,而不是由清晰的战略思维驱动的。

Organizations invest in fine-tuning too early, trying to beat the “just another wrapper” allegations. In reality, fine-tuning is heavy machinery, to be deployed only after you’ve collected plenty of examples that convince you other approaches won’t suffice.组织过早地进行微调投资,试图推翻“只是另一个包装”的指控。事实上,微调是重型机械,只有在您收集了大量示例并让您相信其他方法还不够之后才能部署。

A year ago, many teams were telling us they were excited to fine-tune. Few have found product-market fit and most regret their decision. If you’re going to fine-tune, you’d better be really confident that you’re set up to do it again and again as base models improve—see the “The model isn’t the product” and “Build LLMOps” below.一年前,许多团队告诉我们他们很高兴能够进行微调。很少有人发现产品适合市场,而且大多数人都后悔自己的决定。如果您要进行微调,您最好真正有信心,随着基础模型的改进,您已经准备好一次又一次地进行微调 - 请参阅“模型不是产品”和“构建 LLMOps”如下。

When might fine-tuning actually be the right call? If the use case requires data not available in the mostly open web-scale datasets used to train existing models—and if you’ve already built an MVP that demonstrates the existing models are insufficient. But be careful: if great training data isn’t readily available to the model builders, where are you getting it?什么时候微调才是真正正确的选择?如果用例需要用于训练现有模型的大多数开放的网络规模数据集中不可用的数据,并且如果您已经构建了一个 MVP 来证明现有模型不足。但请注意:如果模型构建者无法轻易获得大量训练数据,那么从哪里获得这些数据呢?

Ultimately, remember that LLM-powered applications aren’t a science fair project; investment in them should be commensurate with their contribution to your business’ strategic objectives and its competitive differentiation.最后,请记住,LLM 驱动的应用程序不是一个科学展览项目;而是一个项目。对它们的投资应与其对企业战略目标及其竞争优势的贡献相称。

Start with inference APIs, but don’t be afraid of self-hosting从推理 API 开始,但不要害怕自托管

With LLM APIs, it’s easier than ever for startups to adopt and integrate language modeling capabilities without training their own models from scratch. Providers like Anthropic and OpenAI offer general APIs that can sprinkle intelligence into your product with just a few lines of code. By using these services, you can reduce the effort spent and instead focus on creating value for your customers—this allows you to validate ideas and iterate toward product-market fit faster.借助 LLM API,初创公司比以往任何时候都更容易采用和集成语言建模功能,而无需从头开始训练自己的模型。 Anthropic 和 OpenAI 等提供商提供通用 API,只需几行代码即可将智能融入到您的产品中。通过使用这些服务,您可以减少花费的精力,转而专注于为客户创造价值,这使您能够更快地验证想法并迭代以实现产品市场契合。

But, as with databases, managed services aren’t the right fit for every use case, especially as scale and requirements increase. Indeed, self-hosting may be the only way to use models without sending confidential/private data out of your network, as required in regulated industries like healthcare and finance or by contractual obligations or confidentiality requirements.但是,与数据库一样,托管服务并不适合每个用例,尤其是随着规模和需求的增加。事实上,自托管可能是使用模型而无需将机密/私人数据发送出网络的唯一方法,这是医疗保健和金融等受监管行业的要求或合同义务或保密要求的要求。

Furthermore, self-hosting circumvents limitations imposed by inference providers, like rate limits, model deprecations, and usage restrictions. In addition, self-hosting gives you complete control over the model, making it easier to construct a differentiated, high-quality system around it. Finally, self-hosting, especially of fine-tunes, can reduce cost at large scale. For example, BuzzFeed shared how they fine-tuned open source LLMs to reduce costs by 80%.此外,自托管规避了推理提供商施加的限制,例如速率限制、模型弃用和使用限制。此外,自托管使您可以完全控制模型,从而更轻松地围绕模型构建差异化的高质量系统。最后,自托管,尤其是微调,可以大规模降低成本。例如,BuzzFeed 分享了他们如何微调开源法学硕士以降低 80% 的成本

Iterate to something great迭代一些伟大的事情

To sustain a competitive edge in the long run, you need to think beyond models and consider what will set your product apart. While speed of execution matters, it shouldn’t be your only advantage.为了长期保持竞争优势,您需要超越模型来思考,并考虑是什么让您的产品与众不同。虽然执行速度很重要,但这不应该是您唯一的优势。

The model isn’t the product; the system around it is模型不是产品,而是产品。它周围的系统是

For teams that aren’t building models, the rapid pace of innovation is a boon as they migrate from one SOTA model to the next, chasing gains in context size, reasoning capability, and price-to-value to build better and better products.对于不构建模型的团队来说,快速的创新步伐是一个福音,因为他们从一个 SOTA 模型迁移到下一个模型,追求上下文大小、推理能力和性价比的收益,以构建越来越好的产品。

This progress is as exciting as it is predictable. Taken together, this means models are likely to be the least durable component in the system.这一进展是令人兴奋的,也是可以预见的。总而言之,这意味着模型可能是系统中最不耐用的组件

Instead, focus your efforts on what’s going to provide lasting value, such as:相反,应将精力集中在能够提供持久价值的事情上,例如:

  • Evaluation chassis: To reliably measure performance on your task across models评估底盘:可靠地衡量跨模型的任务性能
  • Guardrails: To prevent undesired outputs no matter the model护栏:无论型号如何,都可以防止不需要的输出
  • Caching: To reduce latency and cost by avoiding the model altogether缓存:通过完全避免模型来减少延迟和成本
  • Data flywheel: To power the iterative improvement of everything above数据飞轮:为上述一切的迭代改进提供动力

These components create a thicker moat of product quality than raw model capabilities.这些组件创造了比原始模型功能更厚的产品质量护城河。

But that doesn’t mean building at the application layer is risk free. Don’t point your shears at the same yaks that OpenAI or other model providers will need to shave if they want to provide viable enterprise software.但这并不意味着在应用程序层进行构建是没有风险的。如果 OpenAI 或其他模型提供商想要提供可行的企业软件,那么不要将剪刀指向 OpenAI 或其他模型提供商需要剃毛的同一头牦牛。

For example, some teams invested in building custom tooling to validate structured output from proprietary models; minimal investment here is important, but a deep one is not a good use of time. OpenAI needs to ensure that when you ask for a function call, you get a valid function call—because all of their customers want this. Employ some “strategic procrastination” here, build what you absolutely need and await the obvious expansions to capabilities from providers.例如,一些团队投资构建自定义工具来验证专有模型的结构化输出;最小的投资在这里很重要,但深入的投资并不利于时间的利用。 OpenAI 需要确保当您请求函数调用时,您会得到有效的函数调用 - 因为他们的所有客户都希望如此。在这里采用一些“战略性拖延”,构建你绝对需要的东西,并等待提供商对功能的明显扩展。

Build trust by starting small从小事做起建立信任

Building a product that tries to be everything to everyone is a recipe for mediocrity. To create compelling products, companies need to specialize in building memorable, sticky experiences that keep users coming back.打造一款试图满足所有人需求的产品只会导致平庸。为了创造引人注目的产品,公司需要专注于打造令人难忘、有粘性的体验,以吸引用户回头客。

Consider a generic RAG system that aims to answer any question a user might ask. The lack of specialization means that the system can’t prioritize recent information, parse domain-specific formats, or understand the nuances of specific tasks. As a result, users are left with a shallow, unreliable experience that doesn’t meet their needs.考虑一个通用的 RAG 系统,旨在回答用户可能提出的任何问题。缺乏专业化意味着系统无法优先考虑最新信息、解析特定领域的格式或理解特定任务的细微差别。结果,用户得到的体验是肤浅的、不可靠的,不能满足他们的需求。

To address this, focus on specific domains and use cases. Narrow the scope by going deep rather than wide. This will create domain-specific tools that resonate with users. Specialization also allows you to be upfront about your system’s capabilities and limitations. Being transparent about what your system can and cannot do demonstrates self-awareness, helps users understand where it can add the most value, and thus builds trust and confidence in the output.为了解决这个问题,请关注特定领域和用例。通过深入而不是广泛来缩小范围。这将创建与用户产生共鸣的特定领域工具。专业化还可以让您提前了解系统的功能和限制。对系统能做什么和不能做什么保持透明可以展现自我意识,帮助用户了解系统可以在哪些方面增加最大价值,从而建立对输出的信任和信心。

Build LLMOps, but build it for the right reason: faster iteration构建 LLMOps,但构建它的原因是正确的:更快的迭代

DevOps is not fundamentally about reproducible workflows or shifting left or empowering two pizza teams—and it’s definitely not about writing YAML files.DevOps 从根本上来说并不是关于可重复的工作流程或左移或授权两个披萨团队,而且它绝对不是关于编写 YAML 文件。

DevOps is about shortening the feedback cycles between work and its outcomes so that improvements accumulate instead of errors. Its roots go back, via the Lean Startup movement, to Lean manufacturing and the Toyota Production System, with its emphasis on Single Minute Exchange of Die and Kaizen.DevOps 旨在缩短工作与其结果之间的反馈周期,以便积累改进而不是错误。它的根源可以通过精益创业运动追溯到精益制造和丰田生产系统,其重点是单分钟模具交换和改善。

MLOps has adapted the form of DevOps to ML. We have reproducible experiments and we have all-in-one suites that empower model builders to ship. And Lordy, do we have YAML files.MLOps 将 DevOps 的形式改编为 ML。我们有可重复的实验,我们有一体化的套件,可以帮助模型构建者进行交付。 Lordy,我们有 YAML 文件吗?

But as an industry, MLOps didn’t adapt the function of DevOps. It didn’t shorten the feedback gap between models and their inferences and interactions in production.但作为一个行业,MLOps 并没有适应 DevOps 的功能。它并没有缩短模型与其在生产中的推理和交互之间的反馈差距。

Hearteningly, the field of LLMOps has shifted away from thinking about hobgoblins of little minds like prompt management and toward the hard problems that block iteration: production monitoring and continual improvement, linked by evaluation.令人振奋的是,LLMOps 领域已经从思考即时管理之类的小问题转向了阻碍迭代的难题:通过评估联系起来的生产监控和持续改进。

Already, we have interactive arenas for neutral, crowd-sourced evaluation of chat and coding models—an outer loop of collective, iterative improvement. Tools like LangSmith, Log10, LangFuse, W&B Weave, HoneyHive, and more promise to not only collect and collate data about system outcomes in production but also to leverage them to improve those systems by integrating deeply with development. Embrace these tools or build your own.我们已经拥有了用于对聊天和编码模型进行中立、众包评估的交互式平台——集体、迭代改进的外循环。 LangSmith、Log10、LangFuse、W&B Weave、HoneyHive 等工具不仅可以收集和整理有关生产中系统结果的数据,还可以利用它们通过与开发深度集成来改进这些系统。拥抱这些工具或构建您自己的工具。

Don’t build LLM features you can buy不要构建你可以购买的 LLM 功能

Most successful businesses are not LLM businesses. Simultaneously, most businesses have opportunities to be improved by LLMs.大多数成功的企业都不是法学硕士企业。同时,大多数企业都有机会通过法学硕士获得改进。

This pair of observations often misleads leaders into hastily retrofitting systems with LLMs at increased cost and decreased quality and releasing them as ersatz, vanity “AI” features, complete with the now-dreaded sparkle icon. There’s a better way: focus on LLM applications that truly align with your product goals and enhance your core operations.这两种观察常常会误导领导者,以增加成本和降低质量的方式匆忙改造法学硕士系统,并将其作为仿制品、虚荣的“AI”功能发布,并带有现在令人恐惧的闪光图标有一个更好的方法:专注于真正符合您的产品目标并增强您的核心运营的法学硕士申请。

Consider a few misguided ventures that waste your team’s time:考虑一些浪费团队时间的误导行为:

  • Building custom text-to-SQL capabilities for your business为您的企业构建自定义文本到 SQL 功能
  • Building a chatbot to talk to your documentation构建一个聊天机器人来与您的文档对话
  • Integrating your company’s knowledge base with your customer support chatbot将公司的知识库与客户支持聊天机器人集成

While the above are the hellos-world of LLM applications, none of them make sense for virtually any product company to build themselves. These are general problems for many businesses with a large gap between promising demo and dependable component—the customary domain of software companies. Investing valuable R&D resources on general problems being tackled en masse by the current Y Combinator batch is a waste.虽然以上是法学硕士申请的美好世界,但对于几乎任何产品公司自己构建来说,它们都没有意义。对于许多企业来说,这些都是普遍存在的问题,因为有前途的演示和可靠的组件(软件公司的惯用领域)之间存在巨大差距。将宝贵的研发资源投入到当前 Y Combinator 批次正在集体解决的一般问题上是一种浪费。

If this sounds like trite business advice, it’s because in the frothy excitement of the current hype wave, it’s easy to mistake anything “LLM” as cutting-edge accretive differentiation, missing which applications are already old hat*.如果这听起来像是陈词滥调的商业建议,那是因为在当前炒作浪潮的泡沫兴奋中,很容易将任何“LLM”误认为是尖端的增值差异化,而忽略了哪些应用程序已经过时了。*

AI in the loop; humans at the center人工智能在循环中;人在中心

Right now, LLM-powered applications are brittle. They required an incredible amount of safe-guarding and defensive engineering and remain hard to predict. Additionally, when tightly scoped, these applications can be wildly useful. This means that LLMs make excellent tools to accelerate user workflows.目前,LLM 支持的应用程序很脆弱。它们需要大量的安全防护和防御工程,而且仍然难以预测。此外,当范围严格时,这些应用程序可能非常有用。这意味着法学硕士是加速用户工作流程的优秀工具。

While it may be tempting to imagine LLM-based applications fully replacing a workflow or standing in for a job function, today the most effective paradigm is a human-computer centaur (c.f. Centaur chess). When capable humans are paired with LLM capabilities tuned for their rapid utilization, productivity and happiness doing tasks can be massively increased. One of the flagship applications of LLMs, GitHub Copilot, demonstrated the power of these workflows:虽然人们可能很容易想象基于法学硕士的应用程序完全取代工作流程或代替工作职能,但当今最有效的范例是人机半人马(参见半人马象棋)。当有能力的人与专为快速利用而调整的法学硕士能力相结合时,工作效率和完成任务的幸福感可以大大提高。法学硕士的旗舰应用程序之一 GitHub Copilot 展示了这些工作流程的强大功能:

“Overall, developers told us they felt more confident because coding is easier, more error-free, more readable, more reusable, more concise, more maintainable, and more resilient with GitHub Copilot and GitHub Copilot Chat than when they’re coding without it.”“总的来说,开发人员告诉我们,他们感到更有信心,因为使用 GitHub Copilot 和 GitHub Copilot Chat 进行编码比不使用 GitHub Copilot 时编码更容易、更无错误、更易读、更可重用、更简洁、更可维护且更有弹性”。 —Mario Rodriguez, GitHub马里奥·罗德里格斯,GitHub

For those who have worked in ML for a long time, you may jump to the idea of “human-in-the-loop,” but not so fast: HITL machine learning is a paradigm built on human experts ensuring that ML models behave as predicted. While related, here we are proposing something more subtle. LLM driven systems should not be the primary drivers of most workflows today; they should merely be a resource.对于那些长期从事 ML 工作的人来说,您可能会想到“人在循环”的想法,但不会那么快:HITL 机器学习是一种建立在人类专家基础上的范式,确保 ML 模型的行为与预料到的。虽然相关,但我们在这里提出一些更微妙的建议。 LLM 驱动的系统不应成为当今大多数工作流程的主要驱动力;它们应该只是一种资源。

By centering humans and asking how an LLM can support their workflow, this leads to significantly different product and design decisions. Ultimately, it will drive you to build different products than competitors who try to rapidly offshore all responsibility to LLMs—better, more useful, and less risky products.通过以人为中心并询问法学硕士如何支持他们的工作流程,这会导致显着不同的产品和设计决策。最终,它将促使您开发出与那些试图将所有责任快速转移给法学硕士的竞争对手不同的产品——更好、更有用、风险更小的产品。

Start with prompting, evals, and data collection从提示、评估和数据收集开始

The previous sections have delivered a fire hose of techniques and advice. It’s a lot to take in. Let’s consider the minimum useful set of advice: if a team wants to build an LLM product, where should they begin?前面的部分提供了大量的技术和建议。需要考虑的内容有很多。让我们考虑一下最起码有用的建议:如果一个团队想要构建 LLM 产品,他们应该从哪里开始?

Over the last year, we’ve seen enough examples to start becoming confident that successful LLM applications follow a consistent trajectory. We walk through this basic “getting started” playbook in this section. The core idea is to start simple and only add complexity as needed. A decent rule of thumb is that each level of sophistication typically requires at least an order of magnitude more effort than the one before it. With this in mind…在过去的一年里,我们看到了足够多的例子,开始相信成功的法学硕士申请遵循一致的轨迹。我们将在本节中介绍这个基本的“入门”手册。核心思想是从简单开始,仅根据需要增加复杂性。一个不错的经验法则是,每一个复杂程度通常都需要比之前至少多一个数量级的努力。考虑到这一点……

Prompt engineering comes first及时的工程设计是第一位的

Start with prompt engineering. Use all the techniques we discussed in the tactics section before. Chain-of-thought, n-shot examples, and structured input and output are almost always a good idea. Prototype with the most highly capable models before trying to squeeze performance out of weaker models.从即时工程开始。使用我们之前在战术部分讨论的所有技术。思想链、n-shot 示例以及结构化输入和输出几乎总是一个好主意。在尝试从较弱的模型中挤出性能之前,先使用最强大的模型进行原型设计。

Only if prompt engineering cannot achieve the desired level of performance should you consider fine-tuning. This will come up more often if there are nonfunctional requirements (e.g., data privacy, complete control, and cost) that block the use of proprietary models and thus require you to self-host. Just make sure those same privacy requirements don’t block you from using user data for fine-tuning!仅当即时工程无法达到所需的性能水平时,您才应考虑进行微调。如果存在阻止使用专有模型并因此要求您自行托管的非功能性需求(例如数据隐私、完全控制和成本),这种情况会更频繁地出现。只需确保这些相同的隐私要求不会阻止您使用用户数据进行微调即可!

Build evals and kickstart a data flywheel构建评估并启动数据飞轮

Even teams that are just getting started need evals. Otherwise, you won’t know whether your prompt engineering is sufficient or when your fine-tuned model is ready to replace the base model.即使是刚刚起步的团队也需要评估。否则,你将不知道你的及时工程是否足够,或者你的微调模型何时可以取代基础模型。

Effective evals are specific to your tasks and mirror the intended use cases. The first level of evals that we recommend is unit testing. These simple assertions detect known or hypothesized failure modes and help drive early design decisions. Also see other task-specific evals for classification, summarization, etc.有效的 evals 针对您的任务并反映预期的使用案例。我们建议的第一级 evals 是单元测试。这些简单的断言可检测已知或假设的故障模式,有助于推动早期设计决策。此外,还可查看其他针对特定任务的 evals,如分类、总结等。

While unit tests and model-based evaluations are useful, they don’t replace the need for human evaluation. Have people use your model/product and provide feedback. This serves the dual purpose of measuring real-world performance and defect rates while also collecting high-quality annotated data that can be used to fine-tune future models. This creates a positive feedback loop, or data flywheel, which compounds over time:虽然单元测试和基于模型的评估很有用,但它们不能取代人工评估的需要。让人们使用您的模型/产品并提供反馈。这样做有两个目的,一是测量真实世界的性能和缺陷率,二是收集高质量的注释数据,用于对未来的模型进行微调。这就形成了一个正反馈循环,或称数据飞轮,并随着时间的推移不断累积:

  • Use human evaluation to assess model performance and/or find defects人工评估模型性能和/或发现缺陷
  • Use the annotated data to fine-tune the model or update the prompt使用注释数据对模型进行微调或更新提示
  • Repeat重复

For example, when auditing LLM-generated summaries for defects we might label each sentence with fine-grained feedback identifying factual inconsistency, irrelevance, or poor style. We can then use these factual inconsistency annotations to train a hallucination classifier or use the relevance annotations to train a reward model to score on relevance. As another example, LinkedIn shared about its success with using model-based evaluators to estimate hallucinations, responsible AI violations, coherence, etc. in its write-up.例如,在审核 LLM 生成的摘要是否存在缺陷时,我们可能会给每个句子贴上细粒度的反馈标签,以识别事实不一致、不相关或风格不佳等问题。然后,我们可以使用这些事实不一致注释来训练一个幻觉分类器,或者使用相关性注释来训练一个奖励模型,以便对相关性进行评分。作为另一个例子,LinkedIn 在其撰写的文章中分享了他们使用基于模型的评价器来估计幻觉、负责任的人工智能违规行为、连贯性等方面的成功经验

By creating assets that compound their value over time, we upgrade building evals from a purely operational expense to a strategic investment and build our data flywheel in the process.通过创造资产,使其价值随着时间的推移不断复合,我们将建立 evals 从纯粹的运营支出提升为战略投资,并在此过程中建立我们的数据飞轮。

The high-level trend of low-cost cognition低成本认知的高级趋势

In 1971, the researchers at Xerox PARC predicted the future: the world of networked personal computers that we are now living in. They helped birth that future by playing pivotal roles in the invention of the technologies that made it possible, from Ethernet and graphics rendering to the mouse and the window.1971 年,施乐帕洛阿尔托研究中心的研究人员预测了未来:我们现在生活的联网个人计算机的世界。他们在以太网和图形渲染技术的发明中发挥了关键作用,帮助创造了这个未来。到鼠标和窗口。

But they also engaged in a simple exercise: they looked at applications that were very useful (e.g., video displays) but were not yet economical (i.e., enough RAM to drive a video display was many thousands of dollars). Then they looked at historic price trends for that technology (à la Moore’s law) and predicted when those technologies would become economical.但他们也进行了一项简单的练习:他们研究了非常有用的应用程序(例如视频显示器)但还不经济(即驱动视频显示器的足够 RAM 需要数千美元)。然后,他们研究了该技术的历史价格趋势(摩尔定律),并预测这些技术何时变得经济。

We can do the same for LLM technologies, even though we don’t have something quite as clean as transistors-per-dollar to work with. Take a popular, long-standing benchmark, like the Massively-Multitask Language Understanding dataset, and a consistent input approach (five-shot prompting). Then, compare the cost to run language models with various performance levels on this benchmark over time.我们可以对法学硕士技术做同样的事情,尽管我们没有像每美元晶体管那么干净的东西可以使用。采用流行的、长期存在的基准,例如大规模多任务语言理解数据集,以及一致的输入方法(五次提示)。然后,比较随着时间的推移在此基准上运行具有不同性能级别的语言模型的成本。

imgFor a fixed cost, capabilities are rapidly increasing. For a fixed capability level, costs are rapidly decreasing. Created by coauthor Charles Frye using public data on May 13, 2024.对于固定成本,能力正在迅速增强。对于固定的能力水平,成本正在迅速下降。由合著者 Charles Frye 使用公共数据于 2024 年 5 月 13 日创建。

In the four years since the launch of OpenAI’s davinci model as an API, the cost for running a model with equivalent performance on that task at the scale of one million tokens (about one hundred copies of this document) has dropped from 20 to less than 10¢—a halving time of just six months. Similarly, the cost to run Meta’s LLama 3 8B via an API provider or on your own is just 20¢ per million tokens as of May 2024, and it has similar performance to OpenAI’s text-davinci-003, the model that enabled ChatGPT to shock the world. That model also cost about 20 per million tokens when it was released in late November 2023. That’s two orders of magnitude in just 18 months—the same time frame in which Moore’s law predicts a mere doubling.自 OpenAI 的达芬奇模型作为 API 推出以来的四年里,在 100 万个代币(约一百份本文档)规模的任务上运行具有同等性能的模型的成本已从 20 美元降至不到10 美分——减半时间仅需六个月。同样,截至 2024 年 5 月,通过 API 提供商或自行运行 Meta 的 LLama 3 8B 的成本仅为每百万代币 20 美分,并且它具有与 OpenAI 的 text-davinci-003 类似的性能,该模型使 ChatGPT 震惊世界。该模型在 2023 年 11 月下旬发布时,每百万代币的成本约为 20 美元。在短短 18 个月内,这个价格就上涨了两个数量级——摩尔定律预测的时间范围仅为翻倍。

Now, let’s consider an application of LLMs that is very useful (powering generative video game characters, à la Park et al.) but is not yet economical. (Their cost was estimated at 625 per hour here.) Since that paper was published in August 2023, the cost has dropped roughly one order of magnitude, to 62.50 per hour. We might expect it to drop to $6.25 per hour in another nine months.现在,让我们考虑一个非常有用的法学硕士应用(为生成视频游戏角色提供动力,如 Park 等人)。但还不经济。 (他们的成本估计为每小时 625 美元此处。)自该论文于 2023 年 8 月发表以来,成本大约下降了一个数量级,达到每小时 62.50 美元。我们预计在接下来的 9 个月内它会降至每小时 6.25 美元。

Meanwhile, when Pac-Man was released in 1980, 1 of today’s money would buy you a credit, good to play for a few minutes or tens of minutes—call it six games per hour, or 6 per hour. This napkin math suggests that a compelling LLM-enhanced gaming experience will become economical some time in 2025.与此同时,当吃豆人于 1980 年发布时,今天的 1 美元就可以给你买一个积分,可以玩几分钟或几十分钟——称之为每小时 6 场游戏,或者每场 6 美元小时。这个餐巾纸数学表明,引人注目的 LLM 增强游戏体验将在 2025 年的某个时候变得经济实惠。

These trends are new, only a few years old. But there is little reason to expect this process to slow down in the next few years. Even as we perhaps use up low-hanging fruit in algorithms and datasets, like scaling past the “Chinchilla ratio” of ~20 tokens per parameter, deeper innovations and investments inside the data center and at the silicon layer promise to pick up slack.这些趋势是新的,只有几年的历史。但没有理由预计这一进程在未来几年会放缓。即使我们可能会在算法和数据集中使用容易实现的成果,例如扩展到每个参数约 20 个代币的“Chinchilla 比率”,但数据中心和硅层内部更深入的创新和投资有望弥补这一不足。

And this is perhaps the most important strategic fact: what is a completely infeasible floor demo or research paper today will become a premium feature in a few years and then a commodity shortly after. We should build our systems, and our organizations, with this in mind .这也许是最重要的战略事实:今天完全不可行的现场演示或研究论文将在几年内成为高级功能,然后不久之后成为商品。我们应该牢记这一点来构建我们的系统和组织。

Enough 0 to 1 Demos, It’s Time for 1 to N Products0 到 1 的演示已经够多了,是时候展示 1 到 N 个产品了

We get it; building LLM demos is a ton of fun. With just a few lines of code, a vector database, and a carefully crafted prompt, we create ✨magic ✨. And in the past year, this magic has been compared to the internet, the smartphone, and even the printing press.我们懂了;构建 LLM 演示非常有趣。只需几行代码、一个矢量数据库和精心设计的提示,我们就创造了✨神奇✨。在过去的一年里,人们将这种魔力与互联网、智能手机甚至印刷机进行比较。

Unfortunately, as anyone who has worked on shipping real-world software knows, there’s a world of difference between a demo that works in a controlled setting and a product that operates reliably at scale.不幸的是,任何从事过实际软件开发工作的人都知道,在受控环境中运行的演示与大规模可靠运行的产品之间存在天壤之别。

Take, for example, self-driving cars. The first car was driven by a neural network in 1988. Twenty-five years later, Andrej Karpathy took his first demo ride in a Waymo. A decade after that, the company received its driverless permit. That’s thirty-five years of rigorous engineering, testing, refinement, and regulatory navigation to go from prototype to commercial product.以自动驾驶汽车为例。第一辆汽车是在1988由神经网络驱动的。二十五年后,Andrej Karpathy 第一次驾驶 Waymo 进行了演示。十年后,该公司获得了无人驾驶许可证。从原型到商业产品,经历了三十五年严格的工程、测试、改进和监管导航。

Across different parts of industry and academia, we have keenly observed the ups and downs for the past year: year 1 of N for LLM applications. We hope that the lessons we have learned—from tactics like rigorous operational techniques for building teams to strategic perspectives like which capabilities to build internally—help you in year 2 and beyond, as we all build on this exciting new technology together.在工业界和学术界的不同领域,我们敏锐地观察到了过去一年的起起落落:法学硕士申请的第一年。我们希望我们学到的经验教训——从建立团队的严格运营技术等策略到内部构建哪些能力等战略视角——在第二年及以后对您有所帮助,因为我们都在这项令人兴奋的新技术的基础上共同构建。