原来亚马逊在偷偷给企业降本增效

351 阅读14分钟

亚马逊最近在干嘛呢?

提起亚马逊,你会想到什么?我最先想到的是它曾是一个卖书起家的电商平台。但在互联网企业和程序员眼中,亚马逊更重要的身份其实是全球领先的云服务提供商。

在云服务领域,亚马逊可以说是当之无愧的霸主。然而在AI风头正劲的当下,「亚马逊」这三个字却很少出现在相关话题中,甚至连Mistral、Cohere这些二线AI公司,曝光度似乎都比它更高。

作为一家在电商和云服务领域都已称王的科技巨头,为何在AI热潮中显得有些“缺席”?这事有点奇怪。

其实,当其他公司纷纷押注大模型的C端应用时,亚马逊选择了另一条路径,它真正瞄准的,是企业级AI市场。

就在几天前(4月8日),亚马逊宣布「Amazon Bedrock提示词缓存功能」现已正式可用

该功能适用于Anthropic的Claude 3.5 Haiku和Claude 3.7 Sonnet模型,以及Amazon Nova Micro、Amazon Nova Lite和Amazon Nova Pro模型。通过缓存多次 API 调用中常用的提示词,将响应延迟降低高达 85%,同时降低计算成本高达 90%。

借助提示词缓存功能,用户可以标记提示词中需要缓存的特定连续部分(称为“提示词前缀”)。当使用指定前缀发出请求时,模型会处理输入内容,并缓存与该前缀相关的内部状态。在后续使用相同提示词前缀的请求中,模型会从缓存中读取数据,跳过重复处理输入 token 的计算过程,从而缩短首个 token 的生成时间(TTFT),更高效地利用硬件资源。对于用户来说,最明显的感觉就是成本降低了。

提示词缓存功能的工作原理

早在2024年12月AWS的博客上就出现了一篇文章介绍提示词缓存:

《Reduce costs and latency with Amazon Bedrock Intelligent Prompt Routing and prompt caching (preview)》

大语言模型(LLM)处理过程主要包含两个阶段:

  • 输入token处理
  • 输出token生成。

而 Amazon Bedrock 上的提示词缓存功能可对输入token处理阶段进行优化。

具体来说,你可以通过设置缓存检查点来标记提示词中关键的连续部分。检查点之前的整个提示词便构成了缓存的提示词前缀

  • 缓存前缀(静态部分): 包含说明、示例、文档正文等固定信息。

  • 动态部分: 包含用户查询或特定于请求的内容。

当后续请求中再次使用相同提示词前缀时,LLM 会检测该前缀是否已存储在缓存中。如果匹配,模型将直接从缓存中读取状态,从最后一个检查点开始继续输入处理,从而避免重复计算,缩短计算时间,降低计算成本,真正的降本增效~

**只有当提示词前缀完全匹配时,才会触发缓存命中。**为充分发挥提示词缓存功能的优势,建议将说明、示例等静态内容置于提示词开头,将用户特定的动态信息放在提示词末尾。同样地,为确保缓存生效,图像和工具内容在多次请求中也必须保持一致。

01.png

用图说话。下图中,A、B、C、D 代表提示词的不同部分,其中 A、B、C 被标记为提示词前缀并创建了缓存检查点。缓存机制会将 A、B、C 这段提示词前缀以及对应的模型内部状态进行缓存。当后续请求中包含与缓存中完全相同的 A、B、C 提示词前缀时,模型会跳过对 A、B、C 的重复处理,直接利用缓存中的计算结果,从而触发缓存命中。

02.png

提醒一下,提示词缓存功能是针对特定模型而设计的,可以在这个地址查看现在有哪些模型是支持提示词缓存功能

docs.aws.amazon.com/bedrock/lat…

什么时候使用提示词缓存功能比较合适?

Amazon Bedrock 的提示词缓存功能非常适合用于多次 API 调用中频繁重复使用长上下文提示词的场景。该功能最多可降低响应延迟 85% 和推理成本 90%,因此特别适合那些重复性长文本输入的应用。为了判断是否适合您的使用场景,您可以先估算计划缓存的 token 数量、重复使用的频率以及请求之间的时间间隔。

以下几种使用场景非常适合采用提示词缓存功能:

  • 与文档对话:首次请求时将文档内容作为上下文进行缓存,后续用户查询时便不再重复处理文档内容,从而简化架构,避免依赖向量数据库等复杂解决方案。
  • 代码辅助工具:在提示词中反复使用长代码文件,可实现几乎实时的内联代码建议,显著降低因重新处理代码文件产生的延迟。
  • Agent 工作流:通过使用更长的系统提示词来优化 Agent 行为,并通过缓存系统提示词和复杂工具定义,在多步 Agent 流程中减少每一步的处理时间,从而保持良好的终端用户体验。
  • 小样本学习:对于需要大量高质量示例和复杂指令的场景(如客户服务或技术故障排查),提示词缓存功能能帮助显著提高响应效率和准确性。

怎么用?

在评估某一使用场景是否适用提示词缓存功能时,关键是将给定提示词的组件划分为两个不同部分:**一是静态且重复的部分,二是动态部分。**提示词模板应遵循下图所示的结构。

在评估某一使用场景是否适用提示词缓存功能时,关键在于将给定提示词拆解为两个部分:

  • 一部分是静态且重复的内容。
  • 另一部分是根据用户或场景实时变化的动态内容。

提示词模板应当遵循这样的结构——将固定且不变的静态部分放在前面,后跟一个用于缓存的检查点,而动态部分则置于检查点之后。

这种划分方式使得在多个请求中,当静态部分与之前完全一致时,模型可以直接从缓存中恢复这部分的计算状态,省略重复计算步骤,只对动态部分进行处理。

03.png

在单个请求中,可以根据特定模型的限制创建多个缓存检查点。每个检查点都要求其之前的提示词部分保持静态且不变,而其后面的部分则可以根据实际需求动态变化。换句话说,每个缓存检查点都遵循“静态部分、缓存检查点、动态部分”的结构。

04.png

🌰 举个例子

“和文档对话”这个场景用提示词缓存特别合适。简单来说,每次对话时系统里那些固定不变的部分——比如回复格式要求和整篇文档内容——可以提前打好标记作为“缓存前缀”。而每次用户问的具体问题,就是会变化的部分。比如第一次对话时,系统需要完整处理整个文档,但后面用户再问相关问题时,系统就能直接调用之前缓存好的文档内容,回答速度能快很多。

实际操作的时候,我们可以用代码在请求里设置缓存检查点。比如把格式说明和文档正文分别做标记,这样系统就知道哪些内容可以重复使用,不用每次都重新处理整篇文档。

05.png

    def chat_with_document(document, user_query): instructions =  ("I will provide you with a document, followed by a question about its content. "
        "Your task is to analyze the document, extract relevant information, and provide "
        "a comprehensive answer to the question. Please follow these detailed instructions:"
        "\n\n1. Identifying Relevant Quotes:"
        "\n - Carefully read through the entire document."
        "\n - Identify sections of the text that are directly relevant to answering the question."
        "\n - Select quotes that provide key information, context, or support for the answer."
        "\n - Quotes should be concise and to the point, typically no more than 2-3 sentences each."
        "\n - Choose a diverse range of quotes if multiple aspects of the question need to be addressed."
        "\n - Aim to select between 2 to 5 quotes, depending on the complexity of the question."
        "\n\n2. Presenting the Quotes:"
        "\n - List the selected quotes under the heading 'Relevant quotes:'"
        "\n - Number each quote sequentially, starting from [1]."
        "\n - Present each quote exactly as it appears in the original text, enclosed in quotation marks."
        "\n - If no relevant quotes can be found, write 'No relevant quotes' instead."
        "\n - Example format:"
        "\n  Relevant quotes:"
        "\n  [1]\"This is the first relevant quote from the document.\""
        "\n  [2]\"This is the second relevant quote from the document.\""
        "\n\n3. Formulating the Answer:"
        "\n - Begin your answer with the heading 'Answer:' on a new line after the quotes."
        "\n - Provide a clear, concise, and accurate answer to the question based on the information in the document."
        "\n - Ensure your answer is comprehensive and addresses all aspects of the question."
        "\n - Use information from the quotes to support your answer, but do not repeat them verbatim."
        "\n - Maintain a logical flow and structure in your response."
        "\n - Use clear and simple language, avoiding jargon unless it's necessary and explained."
        "\n\n4. Referencing Quotes in the Answer:"
        "\n - Do not explicitly mention or introduce quotes in your answer (e.g., avoid phrases like 'According to quote [1]')."
        "\n - Instead, add the bracketed number of the relevant quote at the end of each sentence or point that uses information from that quote."
        "\n - If a sentence or point is supported by multiple quotes, include all relevant quote numbers."
        "\n - Example: 'The company's revenue grew by 15% last year. [1] This growth was primarily driven by increased sales in the Asian market. [2][3]'"
        "\n\n5. Handling Uncertainty or Lack of Information:"
        "\n - If the document does not contain enough information to fully answer the question, clearly state this in your answer."
        "\n - Provide any partial information that is available, and explain what additional information would be needed to give a complete answer."
        "\n - If there are multiple possible interpretations of the question or the document's content, explain this and provide answers for each interpretation if possible."
        "\n\n6. Maintaining Objectivity:"
        "\n - Stick to the facts presented in the document. Do not include personal opinions or external information not found in the text."
        "\n - If the document presents biased or controversial information, note this objectively in your answer without endorsing or refuting the claims."
        "\n\n7. Formatting and Style:"
        "\n - Use clear paragraph breaks to separate different points or aspects of your answer."
        "\n - Employ bullet points or numbered lists if it helps to organize information more clearly."
        "\n - Ensure proper grammar, punctuation, and spelling throughout your response."
        "\n - Maintain a professional and neutral tone throughout your answer."
        "\n\n8. Length and Depth:"
        "\n - Provide an answer that is sufficiently detailed to address the question comprehensively."
        "\n - However, avoid unnecessary verbosity. Aim for clarity and conciseness."
        "\n - The length of your answer should be proportional to the complexity of the question and the amount of relevant information in the document."
        "\n\n9. Dealing with Complex or Multi-part Questions:"
        "\n - For questions with multiple parts, address each part separately and clearly."
        "\n - Use subheadings or numbered points to break down your answer if necessary."
        "\n - Ensure that you've addressed all aspects of the question in your response."
        "\n\n10. Concluding the Answer:"
        "\n  - If appropriate, provide a brief conclusion that summarizes the key points of your answer."
        "\n  - If the question asks for recommendations or future implications, include these based strictly on the information provided in the document."
        "\n\nRemember, your goal is to provide a clear, accurate, and well-supported answer based solely on the content of the given document. "
        "Adhere to these instructions carefully to ensure a high-quality response that effectively addresses the user's query.") document_content = f "Here is the document: <document> {document} </document>"
    messages_API_body =  {
        "anthropic_version":  "bedrock-2023-05-31",
        "max_tokens": 4096,
        "messages": [
        {
            "role":  "user",
            "content": [
            {
                "type":  "text",
                "text": instructions,
                "cache_control":
                {
                    "type":  "ephemeral"
                }
            }, 
            {
                "type":  "text",
                "text": document_content,
                "cache_control":
                {
                    "type":  "ephemeral"
                }
            }, 
            {
                "type":  "text",
                "text": user_query
            }, ]
        }]
    }
    response = bedrock_runtime.invoke_model(body = json.dumps(messages_API_body), modelId = "us.anthropic.claude-3-7-sonnet-20250219-v1:0", accept = "application/json", contentType = "application/json") response_body = json.loads(response.get("body").read()) print(json.dumps(response_body, indent = 2)) response = requests.get("https://aws.amazon.com/blogs/aws/reduce-costs-and-latency-with-amazon-bedrock-intelligent-prompt-routing-and-prompt-caching-preview/") blog = response.textchat_with_document(blog, "What is the blog writing about?")

上述代码片段的响应中,有一个使用情况部分提供了有关缓存读取和写入的指标数据。

首次调用模型时返回的示例响应如下。

    {
        "id""msg_bdrk_01BwzJX6DBVVjUDeRqo3Z6GL",  "type""message",  "role""assistant",  "model""claude-3-7-sonnet-20250219”,  "content": [    {
            "type": "text",      "text": "Relevant quotes:\n[
            1
        ] \"Today, Amazon Bedrock has introduced in preview two capabilities that help reduce costs and latency for generative AI applications\"\n\n[2] \"Amazon Bedrock Intelligent Prompt Routing \u2013 When invoking a model, you can now use a combination of foundation models (FMs) from the same model family to help optimize for quality and cost... Intelligent Prompt Routing can reduce costs by up to 30 percent without compromising on accuracy.\"\n\n[3] \"Amazon Bedrock now supports prompt caching \u2013 You can now cache frequently used context in prompts across multiple model invocations... Prompt caching in Amazon Bedrock can reduce costs by up to 90% and latency by up to 85% for supported models.\"\n\nAnswer:\nThe article announces two new preview features for Amazon Bedrock that aim to improve cost efficiency and reduce latency in generative AI applications [1]:\n\n1. Intelligent Prompt Routing: This feature automatically routes requests between different models within the same model family based on the complexity of the prompt, choosing more cost-effective models for simpler queries while maintaining quality. This can reduce costs by up to 30% [2].\n\n2. Prompt Caching: This capability allows frequent reuse of cached context across multiple model invocations, which is particularly useful for applications that repeatedly use the same context (like document Q&A systems). This feature can reduce costs by up to 90% and improve latency by up to 85% [3].\n\nThese features are designed to help developers build more efficient and cost-effective generative AI applications while maintaining performance and quality standards."   
    } 
    ],  "stop_reason""end_turn",  "stop_sequence": null,  "usage": {
          "input_tokens"9,    "cache_creation_input_tokens"37209,    "cache_read_input_tokens"0,    "output_tokens"357 
    }
    }

如下图所示,根据cache_creation_input_tokens这一值可知,缓存检查点已成功创建,并缓存了37209个token。

06.png

对于后续请求,可以提出另一个问题。

chat_with_document(blog, "what are the use cases?")

提示词的动态部分已更改,但静态部分和提示词前缀保持不变。因此,可以期待在后续调用中实现缓存命中,参考以下代码。

    {
      "id": "msg_bdrk_01HKoDMs4Bmm9mhzCdKoQ8bQ",
      "type": "message",
      "role": "assistant",
      "model": "claude-3-7-sonnet-20250219",
      "content": [
        {
          "type": "text",
          "text": "Relevant quotes:\n[1] \"This is particularly useful for applications such as customer service assistants, where uncomplicated queries can be handled by smaller, faster, and more cost-effective models, and complex queries are routed to more capable models.\"\n\n[2] \"This is especially valuable for applications that repeatedly use the same context, such as document Q&A systems where users ask multiple questions about the same document or coding assistants that need to maintain context about code files.\"\n\n[3] \"During the preview, you can use the default prompt routers for Anthropic's Claude and Meta Llama model families.\"\n\nAnswer:\nThe document describes two main features with different use cases:\n\n1. Intelligent Prompt Routing:\n- Customer service applications where query complexity varies\n- Applications needing to balance between cost and performance\n- Systems that can benefit from using different models from the same family (Claude or Llama) based on query complexity [1][3]\n\n2. Prompt Caching:\n- Document Q&A systems where users ask multiple questions about the same document\n- Coding assistants that need to maintain context about code files\n- Applications that frequently reuse the same context in prompts [2]\n\nBoth features are designed to optimize costs and reduce latency while maintaining response quality. Prompt routing can reduce costs by up to 30% without compromising accuracy, while prompt caching can reduce costs by up to 90% and latency by up to 85% for supported models."
        }
      ],
      "stop_reason": "end_turn",
      "stop_sequence": null,
      "usage": {
        "input_tokens": 10,
        "cache_creation_input_tokens": 0,
        "cache_read_input_tokens": 37209,
        "output_tokens": 324
      }
    }

如下图所示,37209个token用于从缓存中读取的文档和说明内容,10个输入token则用于用户查询。

07.png

更多调用模型API的方法可以参考:docs.aws.amazon.com/bedrock/lat…

提示词缓存能力是百分百能降本吗?

关于提示词缓存的实际效果,需要区分具体场景来看。虽然这项技术在处理固定模板的重复请求时确实能显著缩短首个token响应时间(TTFT),但遇到一些特殊场景它的优势确实会打折扣,比如涉及长达2000个token的系统提示词信息,且后续有大量动态变化文本的工作负载而言,提示词缓存功能的效果可能并不理想。

在GitHub上有一篇关于如何使用提示词缓存并对其进行基准测试的文章。基准测试结果取决于具体使用场景:输入token数量、缓存token数量或输出token数量。有兴趣的工友可以看看:github.com/aws-samples…

总结

总的来看,虽然亚马逊在AI领域的存在感似乎不如OpenAI、Google那样高调,但它正在稳扎稳打地推进企业级AI的落地。提示词缓存功能就是个很好的例子——看似“幕后”,实则是提升效率、降低成本的利器。对于有长期、重复调用需求的开发者或企业来说,这项功能的价值不容小觑。

亚马逊或许没有选择加入AI的“流量大战”,但它显然清楚自己要走的路。未来,谁能在“幕后”真正解决企业的实际问题,谁才可能在这场AI竞赛中走得更远。

有兴趣的工友可以了解一下👉亚马逊的云服务