面向 AI 的数据蓝图——AI 数据基础入门：构建面向 AI 的数据底座本章将考察 generative AI（GenA

本章将考察 generative AI（GenAI）technologies 的快速增长，并介绍其成功实施所需的关键 data infrastructure。Foundation models 的采用速度前所未有，已经远远超过 organizations 的 data readiness，从而在 experimentation 和 production deployment 之间造成了显著 gap。

我们会探索这个 gap 在实践中如何体现：根据 McKinsey 的近期研究，虽然 79% 的 organizations 已经在至少一个 business function 中 regularly use GenAI，但只有 7% 已经将 AI use fully scaled 到 production。现在已经是一个众所周知的事实：造成失败的主要原因并不是 model selection，而是 inadequate data preparation。传统 data architectures 主要为 analytics 和 machine learning 优化，根本无法支持 GenAI applications 所需要的 semantic understanding、real-time context 和 cross-domain reasoning。

随着 GenAI 从简单 assistants 演进为 autonomous agents，data infrastructure 面临的需求也变得越来越复杂。每一个 evolution stage——从 basic AI assistants，到 copilot assistants，到 retrieval-augmented generation（RAG）based agents，再到 agentic AI——都会引入本质上不同的 requirements，要求 data 以不同方式被 structured、accessed 和 governed。

本章识别了五种 architectural patterns，它们反复出现在那些成功跨越 production gap 的 organizations 中：用于 contextual intelligence 的 knowledge graphs、用于 real-time AI 的 event-driven architecture、用于 unified data 的 lakehouse platforms、用于 meaning-based retrieval 的 semantic search，以及用于 autonomous access 的 agent-ready properties。通过 real-world examples，包括一家 global retailer 从 traditional dimensional models 的演进，我们将展示这些 patterns 如何协同工作，从而实现 production success。

Introduction and Market Context

2022 年 11 月 30 日，OpenAI 以 “research preview” 的形式发布 ChatGPT 3.5，目标是收集 feedback，并理解其 capabilities 和 limitations。

然而，这次发布引发了巨大的 interest，新 model 很快成为 viral phenomenon，释放出一股 technological force，迅速重塑世界，而我们才刚刚开始理解这种重塑的方式。两个月内，ChatGPT 累计超过 1 亿 users，成为历史上增长最快的 consumer software application。仅仅 60 天，它就达到了 Instagram 用两年半、TikTok 用九个月才建立起来的 user base。

不同于此前的 technological advances——例如 1860 年代的 combustion engine、1990 年代的 internet、2007 年的 smartphones，以及 2010 年代早期的 cloud computing——这些技术都是逐渐被采用的，而 generative AI 的影响几乎立刻 democratized。

GenAI 将强大的 content creation capabilities 放到了每个人手中——通常只需要一个简单的 natural language prompt，甚至一句 spoken command。现在，任何人都可以在自己感兴趣或擅长的领域生成内容，无论是 creating images、writing scripts、producing videos，还是 coding。这种新的 accessibility 彻底改变了 productivity，使 individuals——不管是否具备 technical expertise——都能轻松创建各种类型的 content。

跨行业的 knowledge workers，包括 software developers、content creators、analysts、researchers、consultants，以及其他主要与 information 打交道的 professionals，都经历了 workflow 的显著加速：过去需要数天完成的 tasks，如今只需数小时，甚至数分钟即可完成。GenAI 通过 intuitive natural language interfaces 实现 seamless integration；不同于此前需要 extensive onboarding 的 enterprise software，它不需要 specialized training。这推动了 consumer 和 enterprise landscapes 中前所未有的 adoption wave。

目睹 GenAI 的 transformative potential 之后，organizations 很快进入了 industry analysts 所称的 GenAI panic mode——一种出于 competitive pressure 而非 strategic planning 驱动的 generative AI capabilities implementation rush。各行业 C-suite executives 要求立即启动 generative AI initiatives，推动 teams 在 ChatGPT 于 2022 年 11 月亮相后不久就 launch proofs of concept。

Arize AI 在 2024 年对 US Securities and Exchange Commission（SEC）filing data 的分析显示，64.6% 的 Fortune 500 companies 在其最新 annual reports 中提到了 AI——相比 2022 年增长了 250.1%。超过五分之一的 companies 明确提到了 generative AI；超过一半的 companies（281 家）将其列为 risk factor，而 2022 年只有 49 家。图 1-1 可视化了 adoption 和 awareness 的快速增长。

图 1-1：GenAI adoption statistics，包括 ChatGPT 前所未有的 adoption figures，以及 Arize AI 对 2022 至 2024 年 Fortune 500 companies annual reports 中 AI mentions 趋势的观察

Boardroom discussions 突然开始包含原本只存在于 AI research papers 中的 technical terminology。诸如 “foundation model”、“knowledge base”、“prompt engineering”、“RAG architecture”、“vector database”、“agents”、“confabulation”（也就是 “hallucination”）和 “agentic AI” 等术语，几乎一夜之间成为 executive vocabulary 的一部分。

随着更多 foundation models 迅速进入市场，这种 corporate enthusiasm，或者说 FOMO（fear of missing out），进一步加剧。在 ChatGPT 发布后的六个月内，多个 competitors 推出了自己的 offerings，包括但不限于：

Anthropic 的 Claude
Google 的 Bard
Meta 的 Llama
Amazon 的 Titan 和 Bedrock
Alibaba 的 Tongyi Qianwen
Stability.ai 的 Diffusion models

Models 的激增进一步加剧了 organizations 采用 generative AI technologies 的 competitive urgency。Customers 积极寻求关于可用选项的 guidance，以加速 proofs of concept；consumer adoption 也转化为前所未有的 enterprise deployment。

JPMorgan Chase 已经将 generative AI scale 到其 global operations 中的 200,000 名 employees，代表了 financial services history 中最大规模的 enterprise AI deployments 之一。类似地，Siemens 将 Amazon Bedrock 集成到其 Mendix platform 中，现在服务超过 5,000 万 users，覆盖超过 200,000 个 applications。Development velocity 同样惊人：BT Group 将 Amazon CodeWhisperer（现在集成到 Q Developer 中）部署给 1,200 名 engineers，在四个月内自动化了他们 12% 的工作，并生成了超过 100,000 行代码。

然而，在这场 adoption frenzy 之下，存在一个关键挑战：大多数 organizations 正在发现，现有 data infrastructure 无法支持 production GenAI applications。虽然 experimentation 很普遍，但 scaling to production 会暴露 enterprise data 在 structured、accessed 和 governed 方面的根本缺口。

要理解为什么 traditional data architectures 在 GenAI 面前会失败，必须先考察 generative AI 与此前 AI approaches 的根本不同之处。这些技术差异直接导致 organizations 在从 experimentation 走向 production 时面临 infrastructure challenges。

What Makes Generative AI Different

Generative AI 标志着 AI systems 如何构建以及能够做什么的 fundamental shift。Traditional AI systems 通常聚焦于 classification、prediction 或 recommendation，基于 structured inputs 工作。它们依赖 narrow task-specific models，需要 labeled datasets 或 reinforcement signals 才能随时间改进。相比之下，generative models，尤其是 large foundation models，能够创建 entirely new content，包括 text、images、music、video，甚至 code。

Foundation models 的革命性不只是来自 scale 或 output diversity，而是来自它们的 learning methodology（见图 1-2）。它们不是主要依赖 human-labeled data（如 supervised learning）或 trial-and-error reward loops（如 reinforcement learning），而是使用 self-supervised learning：这是一种 model 通过预测 raw、unlabeled data 中被 masked 或 missing 的 elements，来教自己理解并生成 language、images 或 code 的方法。

图 1-2：传统 supervised learning（使用 labeled data 和 explicit training），用于 machine learning 和 deep learning；以及 self-supervised learning

这个过程类似 humans 的学习方式：通过观察、推断，并形成关于 world 的 internal representations，而不总是需要 explicit instruction。例如，当你教 toddler 认识 dog、cat 或其他 animal 时，他们学到的不只是某一种特定动物。他们很快开始区分不同类型的 animals，最终还会区分每个 category 内部的各种 breeds。

The Transformer Architecture: The Technical Foundation of GenAI

Modern generative AI 的核心是 Transformer architecture——这是一个 deep learning breakthrough，彻底改变了 AI systems 处理和生成 content 的方式。Transformers 由 Ashish Vaswani 和 Google team 在 2017 年 seminal paper “Attention Is All You Need” 中提出，它通过一种名为 self-attention 的 mechanism，使 models 能够捕获 data 中复杂的 relationships。

Self-attention mechanism 允许 model 在生成 output 的每个部分时，衡量 input 中不同部分的重要性。这让 model 能够在长文本序列或其他数据序列上保持 coherence 和 context，而这是此前 architectures 难以做到的。Transformers 中影响 data requirements 的关键 technical aspects 包括：

Parallel processing

不同于此前的 sequential models，例如 recurrent neural networks 和 long short-term memory networks，Transformers 会同时处理所有 input tokens，从而能够在 massive datasets 上高效训练。

Context windows

Transformers 在固定长度的 context windows 内运行，这要求 data infrastructure 能够有效管理并检索 relevant context。

Positional encoding

Transformers 使用 positional encodings 来理解 sequence order，因此 data preparation 必须保留 meaningful sequential relationships。

Attention mechanisms

Self-attention mechanism 会创建一个 computational graph，将每个 token 与其他所有 tokens 连接起来，从而支持 rich contextual understanding，但也需要大量 computational resources。

这些 architectural characteristics 直接影响 data 为 effective GenAI implementations 所需的 structured、stored 和 retrieved 方式。为 row-based 或 columnar access patterns 优化的 traditional data architectures，通常并不适合 Transformer models 所需要的 contextual、token-based access patterns。

Enterprise Data as the Key Differentiator

虽然 foundation models 提供强大的 general capabilities，但它们真正的 business value 出现在它们与你组织独有的数据连接时。这种连接会将 general-purpose AI systems 转化为 specialized business tools，使其理解你的 industry context、company terminology 和 domain-specific knowledge。

能够在 proprietary data 和 foundation models 之间建立 effective bridges 的 organizations，会通过更 accurate、relevant 和 trustworthy 的 AI solutions 获得显著 competitive advantages。通常有四种 strategies 用于实现这一点：

Context engineering

使用 RAG 将 proprietary information 作为 context 提供给 foundation models。它需要在 vector embeddings 上进行 low-latency semantic search，并支持 real-time updates。

Fine-tuning

使用 domain-specific datasets 适配 pretrained foundation models。它需要 versioned、high-quality labeled datasets，并具备清晰 lineage tracking。

Custom model training

构建针对 specific use cases 和 data 优化的 purpose-built models。它需要对多样化 datasets 进行 massive parallel access，并且优化目标是 throughput 而不是 latency。

Model optimization

通过 distillation 和 pruning 等技术创建更小、更高效的 models，使其捕获 larger models 的 capabilities，同时需要更少 resources。

正如第 2 章会进一步探讨的，这些 implementation patterns 构成了 enterprise data 与 model behavior 之间 structural coupling 的连续谱。随着 organizations 利用这些 capabilities，它们的 GenAI adoption 通常会经历不同 stages，每个 stage 都会对 data infrastructure 提出不同需求。我们会在 “The Evolution of GenAI Applications” 中考察这些 stages。

Contextual Intelligence: A Simple Example

理解 generative AI 的 technical foundations——contextual intelligence、self-supervised learning 和 semantic understanding——非常关键，因为它们使 increasingly sophisticated applications 成为可能。为了说明 self-attention mechanism 的力量，可以考虑一个简单但有意义的实验。让一个 generative image model 创建 “a man sitting by a bank” 的 image，你可能会得到一个人坐在 financial institution 外面的画面。现在，把 prompt 改成 “a man fishing by a bank”，生成的 image 很可能变成一个人在 riverbank 旁边钓鱼（图 1-3）。Model 并不是 random guessing，而是使用 surrounding words 的 context 来消除 “bank” 这个词不同 meanings 之间的歧义。

图 1-3：使用 Amazon Nova Canvas，通过 prompts “a man sitting by a bank” 和 “a man fishing by a bank” 生成的 example images

现在，再考虑一个单词如何完全改变 language understanding 中的 pronoun reference，如图 1-4 所示。在句子 “The trophy doesn’t fit in the suitcase because it is too big” 中，“it” 显然指 trophy。但只改变一个词——“The trophy doesn’t fit in the suitcase because it is too small”——现在 “it” 指的是 suitcase。相同的 grammatical structure，相同的 pronoun，但 “big” 和 “small” 提供的 context 从根本上改变了被引用的 object。这个例子展示了 models 必须利用 contextual clues 和关于 physical constraints 的 real-world knowledge，才能正确 interpret meaning。

图 1-4：contextual pronoun reference 示例

这种根据 context 推断 meaning 的能力，展示了 model 对 semantic relationships 的 internal grasp——这是它与依赖 keyword matching 或 rigid labels 的 traditional systems 的关键区别。随着 generative models 在更加 diverse 和 expansive 的 data 上训练，它们的 understanding 会加深，从而能够执行它们从未被 explicitly trained for 的 tasks。

这种 contextual reasoning 并不是 magic，而是在 training 过程中 learned 出来的。要理解 generative AI models 如何如此细腻地区分 meaning，需要看看它们如何在内部表示 language。

Representing Meaning in Vector Space

GenAI model 如何理解 words 是否 semantically close together？这种 contextual understanding 在数学上建立在 models 如何将 words、phrases，甚至 concepts 表示为 high-dimensional space 中 vectors 的基础上（图 1-5）。Meaning 或 contextual usage 相似的 words 会彼此更接近，而 unrelated 或使用方式不同的 terms 会相距更远。

图 1-5：二维 vector space visualization，展示 “bank” 的不同 meanings 如何相对于其他 words 定位

在这个示例中：

Bank（finance）接近 “loan”、“money” 和 “finance” 等 terms。
Bank（river）出现在 “water”、“fishing” 和 “river” 附近。
两个 “bank” vectors 之间的 angle 很大，表示 low cosine similarity，也就是 low contextual overlap。

这种结构让 generative AI systems 能够基于 semantics 而不是 syntax 进行 reasoning，使其能够以 traditional AI 无法实现的方式执行 contextual search、summarization、question answering 和 creative generation tasks。Semantic search capability——本章后面会将其作为五种 emerging architectural patterns 之一展开——正在成为 enterprise AI systems 的 foundational requirement。

Enterprise Example: Cross-Document Understanding in Customer Support

为了说明这种 vector-based understanding 如何应用于 business settings，可以考虑一个 customer support knowledge base，其中包含以下 entries：

Product X requires firmware version 2.1 or higher when used with the Y3000 controller.
The Y3000 controller was deprecated in November 2023 and replaced with the Y3500 model.
Y3500 controllers are backward compatible with all products requiring Y3000 controllers.

当 support agent 询问 “What firmware does Product X need?” 时，traditional keyword-based systems 很可能只返回第一篇 document。但使用 semantic understanding 的 GenAI system 可以：

识别出三篇 documents 通过共同 reference controllers 而 contextually related。
理解 Y3000 与更新的 Y3500 models 之间的 temporal relationship。
将 compatibility statement 连接起来，提供一个 complete answer，同时包含 original requirement 和 updated controller information。

这种跨 multiple documents 提取 context 的能力，如图 1-6 所示，是 GenAI 与 traditional data systems 的根本区别；而 traditional data systems 的 data architectures 通常无法有效支持这些 capabilities。

图 1-6：self-attention 如何支持 sophisticated cross-document understanding，连接三篇 documents 中的信息，从而提供关于 firmware requirements 的完整答案

理解 GenAI models 如何处理并表示 meaning，是设计 intelligent systems 的关键，但它只是整体图景的一部分。随着这些 technologies 从 simple assistants 演进为 autonomous agents，organizations 正在通过构建新的 data architectures 来响应，从而弥合 traditional systems 与 AI-ready infrastructure 之间的 gap（见表 1-1）。

表 1-1：Traditional AI 与 generative AI 的差异

Characteristic	Traditional AI	Generative AI
Learning method	使用 labeled data 的 supervised learning	从 raw data 中进行 self-supervised learning
Understanding	Syntactic pattern matching	Semantic contextual understanding
Representation	Feature vectors	Vector space 中的 contextual embeddings
Capabilities	Classification、prediction、recommendation	Generation、reasoning、contextual understanding
Data requirements	Structured、labeled datasets	具有 relationships 的 diverse、contextual data
Infrastructure needs	Batch processing、data warehouses	Real-time retrieval、vector databases、context management

The Evolution of GenAI Applications

随着 generative AI technologies 成熟，我们正在看到其 applications 从 simple assistants 迅速演进为 increasingly autonomous agents。这种 progression 不只是 technical advancement，也代表 AI systems 如何与 data、users 和 world 交互的 fundamental shift——organizations 必须理解这一点，才能为当前和未来的 AI capabilities 准备 data infrastructure。

From Assistants to Agents: The Four Stages of GenAI Evolution

GenAI applications 的演进可以通过四个 distinct stages 来理解（见图 1-7）。每个 stage 都建立在前一阶段的 capabilities 之上，同时引入新的 data requirements 和 organizational challenges。

图 1-7：GenAI evolution 的四个 stages

Stage 1: AI assistants—information retrieval and basic task completion

第一代 GenAI applications 聚焦于响应 direct user queries，提供 information 并完成 simple tasks。这些 assistants，例如早期版本的 ChatGPT 和 Gemini，擅长：

基于 training data 回答 factual questions
生成 emails、summaries 或 creative writing 等 content
提供 concepts 或 procedures 的 explanations
在 languages 之间进行 translation

这些 systems 主要依赖 pretrained knowledge，对新 information 的 access 或 reasoning 能力有限。它们的 data requirements 相对直接：需要 high-quality training data，但对 real-time data integration 或 complex knowledge management 的需求很少。

Stage 2: Copilot assistants—collaborative work with humans

第二阶段中，GenAI systems 演进为与 humans 并肩工作的 copilots，在 specific domains 中增强人的 capabilities。示例包括用于 coding 的 GitHub Copilot、用于 productivity 的 Microsoft 365 Copilot，以及用于 creative work 的 Adobe Firefly。这些 copilots：

理解 domain-specific contexts 和 terminology
生成 domain-appropriate content 和 suggestions
从 user feedback 和 preferences 中学习
与 existing tools 和 workflows 集成

Copilots 引入了更复杂的 data requirements，包括访问 domain-specific knowledge bases、理解 proprietary data formats，以及在 user sessions 之间维持 context。它们开始模糊 general 和 specialized AI systems 之间的界线，需要更 sophisticated data integration strategies。

Stage 3: RAG-based agents—enhanced knowledge and contextual understanding

第三阶段引入 retrieval-augmented generation，以克服 pretrained knowledge 的 limitations。RAG-based agents 可以：

Real time 访问并 reason over enterprise knowledge bases
提供超出 training cutoff 的 up-to-date information
将 responses grounded 在 specific documents 或 data sources 中
连接此前 siloed sources 中的信息

RAG 显著增加了 data requirements 的复杂性，需要：

用于 semantic search 的 vector databases
将 documents 转换为 vector representations 的 embedding pipelines
用于 source attribution 的 metadata management
维护 currency 的 content refresh mechanisms

实施 RAG-based agents 的 organizations 发现，它们现有 data architectures 往往无法满足这些新 requirements，这导致了本章稍后会讨论的 implementation challenges。

Stage 4: Agentic AI—autonomous decision making and action

GenAI evolution 的第四个，也是当前 frontier，是 agentic AI——能够在 minimal human supervision 下 autonomously plan 和 execute complex tasks 的 systems。这些 agents：

将 complex goals 分解为 actionable steps
访问 multiple tools 和 APIs 完成 tasks
基于 real-time information 做 decisions
从 successes 和 failures 中学习以提升 performance
以越来越高的 autonomy level 运行

Agentic AI 代表 data requirements 和 organizational readiness 的 quantum leap。这些 systems 不仅需要访问信息，还需要：

用于 controlling actions 的 permission frameworks
用于 oversight 的 monitoring systems
用于 learning 的 feedback mechanisms
防止 misuse 的 security controls
用于 data sources 的 trust verification

Stage 4 已经在 specialized domains 中出现。Amazon 的百万级 robots 在 logistics 中展示了 autonomous decision making，能够针对 package handling、route optimization 和 inventory management 做 independent decisions。

The Increasing Complexity of Data Requirements

每一个 evolution stage 都会引入新的 data requirements 和 challenges，如表 1-2 所示。

表 1-2：Generative AI 的演进

GenAI stage	Primary data focus	Key data requirements	Organizational challenges
AI assistants	Pretrained knowledge	High-quality training data	管理对 knowledge limitations 的 expectations
Copilot assistants	Domain-specific knowledge	与 proprietary systems 和 formats 集成	平衡 assistance 与 human expertise
RAG-based agents	Enterprise knowledge bases	Vector databases、embedding pipelines、metadata management	打破 data silos，确保 data quality
Agentic AI	Multisource knowledge 和 action permissions	Permission frameworks、monitoring systems、trust verification	Security、governance、liability、observability and control

这种 evolutionary complexity 解释了为什么如此多 organizations 难以超越 experimentation；也解释了为什么已经成功实施早期 GenAI applications 的 organizations，在更高级 implementations 上仍然可能遇到困难——每个 stage 都需要本质上不同的 data architecture considerations。然而，通过研究 successful implementations 中出现的 patterns，我们可以识别出具体的 architectural approaches，用来应对这些不断升级的 requirements。

What Leading Organizations Are Building Today

虽然 experimentation 与 production 之间的 gap 仍然显著，目前成功 scale GenAI 的 organizations 不到 10%，但在成功者中已经出现了清晰趋势。Leading organizations 并不等待 perfect solutions，而是在构建一些实现 distinct architectural patterns 的 systems，将 traditional data infrastructure 转换为 AI-ready foundations，帮助它们从 proof of concept 走向 production。

The Production Reality Check

2025 年版 McKinsey Global Survey on AI 揭示了一个明显的 implementation gap。虽然现在 88% 的 organizations 至少在一个 function 中使用 AI，高于 2024 年的 72%，但从 pilot 到 production 的 transition 仍然困难。根据 use case complexity 不同，捕获 meaningful enterprise value 的成功率差异非常大：15–20% 的 organizations 已经成功 scale AI 用于 document processing 和 customer service；10–15% 已经在 content generation 和 data analysis 中超越 experimentation；而低于 10%（目前估计为 “high performers” 中的 6%）成功部署了 AI 用于 complex decision-making 或 autonomous agents。

在我们自己的工作中，也亲眼见过这些 challenges。一位 client 不得不 roll back 其 GenAI deployment，并不是因为 model 或 prompt issues，而是因为 poor data quality；在 test environment 中一切表现良好，但真实 production use 暴露了 data inconsistencies，迫使整个项目完全停下来。现在他们正专注于重建 data foundation。

我们也领导过成功 implementations，例如为 chief financial officer’s office 构建的 variance analysis solution，它会处理 enterprise resource planning（ERP）data、计算 variances，并使用 AI 起草 financial commentaries。另一个案例中，我们帮助客户构建了一个 marketing chat agent，为 field teams 生成 localized content。

在这些真实 implementations 中，出现了五种 distinct architectural patterns。这些不是 theoretical frameworks，而是已经验证的方法，用来克服 traditional data architectures 在满足 GenAI 特定 demands 时的 limitations。

Emerging GenAI Architectural Patterns

本节将详细探索五种 emerging architectural patterns。

Pattern 1: Knowledge graphs for contextual intelligence

What it is

Knowledge graphs 通过 explicit relationship mapping 连接 disparate data sources，创建一个 semantic network，以支持 cross-domain reasoning。不同于 traditional databases 存储 isolated records，knowledge graphs 将 entities（customers、products、transactions）及其 relationships（purchased、returned、recommended）表示为 interconnected nodes。

Why it matters for AI

GenAI agents 需要理解跨 organizational boundaries 的 context。当 customer service agent 询问 “Why did this customer return three orders last month?” 时，答案需要连接 customer data、order history、product specifications、shipping logistics 和 support tickets——这些 data 通常存在于 separate systems 中。Knowledge graphs 让这些 connections 变得 explicit 且 queryable。

Field example

一家 financial services organization 实施了一个 knowledge graph，连接 customer profiles、transaction histories、market data 和 regulatory requirements。它的 RAG-based compliance agent 现在可以通过 traversing relationships across previously siloed datasets，回答 “Which customers are affected by the new EU regulation?” 这样的问题。Analysis time 从数天降到数分钟。

Data requirements

用于 relationship storage 的 graph databases，例如 Neo4j、Amazon Neptune
用于识别并 merge duplicate entities 的 entity resolution pipelines
用于 consistent relationship definitions 的 ontology management
来自 source systems 的 continuous synchronization

这个 pattern 在图 1-8 中可视化。

Evolution to data mesh

Leading organizations 正在扩展这个 pattern，将 knowledge graphs 视为 federated data products。它们不是把所有 data centralize 到一个 single graph 中，而是创建 domain-specific knowledge graphs，例如 customer graph、product graph、supply chain graph，让 agents 可以 autonomously discover 和 query。这种 data mesh approach 能够在没有 centralized bottlenecks 的情况下实现 scale。

图 1-8：用于 contextual intelligence 的 knowledge graphs

Pattern 2: Event-driven foundations for real-time AI

What it is

Event-driven architectures 用 streaming data pipelines 取代 batch processing，在 events 发生时捕获并传播 changes。每一个 business event——customer purchase、inventory update、price change 或 support ticket——都会通过 event streams 流动，multiple systems 可以 real time 消费这些 streams。

Why it matters for AI

Agents 需要 fresh context，而不是昨晚 batch job 的 stale data。当 customer 询问 “Is this item in stock?” 时，答案必须反映几秒钟前的 inventory changes，而不是 yesterday’s snapshot。Event-driven architectures 确保 agents 始终使用 current information。

Field example

一家 retail organization 用 Kafka-based event streaming 替换了 nightly batch extract、transform、load（ETL）process。Customer behavior events（page views、cart additions、purchases）持续流入 vector database，并近实时更新 product embeddings。Recommendation agent 能够在 minutes 内响应 trending products，而不是 days，从而提升 conversion rates。

Data requirements

Event streaming platforms，例如 Kafka、Amazon Kinesis
来自 source systems 的 change data capture（CDC）
用于 transformation 的 stream processing frameworks，例如 Flink、Spark Streaming
Low-latency vector database updates

图 1-9 可视化了这个 pattern。

图 1-9：用于 real-time AI 的 event-driven foundations

Real-time context refresh

最 sophisticated implementations 会进一步扩展这个 pattern，在 source documents 变化时持续更新 vector embeddings。当 product specification 被更新时，embedding pipeline 会自动 regenerate vectors 并更新 semantic search index，确保 agents 永远不会 retrieve outdated information。

Pattern 3: Lakehouse as unified AI platform

What it is

Lakehouse architecture 将 data lakes 和 data warehouses 融合为一个 unified platform，同时支持 analytics 和 AI workloads。Lakehouses 基于 Apache Iceberg、Delta Lake 和 Apache Hudi 等 open table formats 构建，在 data lake storage 上提供 ACID transactions、schema evolution 和 time travel（查询 historical snapshots of data 的能力），将 lakes 的 flexibility 与 warehouses 的 reliability 结合起来。

Why it matters for AI

Traditional architectures 迫使 organizations 在 data lakes（flexible、unstructured）和 data warehouses（structured、performant）之间做选择。GenAI 两者都需要：用于 RAG 的 unstructured documents、用于 analysis 的 structured data，以及 seamless join 它们的能力。Lakehouses 允许做到这一点。

Field example

一家 healthcare organization 将 15 年的 clinical research data 整合到一个 Iceberg-based lakehouse 中，包括 structured trial results、unstructured research papers、patient records 和 regulatory documents。它的 research agents 现在可以通过 single interface 跨所有 data types 查询，将 structured trial outcomes 与 unstructured researcher notes join 起来。Data preparation time 从数周降到数小时。

Data requirements

用于 unified storage 的 open table formats，例如 Iceberg、Delta Lake、Hudi
用于 diverse workloads 的 multi-engine access，例如 Spark、Trino、Athena
跨所有 data 的 unified governance，例如 Lake Formation、Unity Catalog
支持 diverse data types，包括 structured、semistructured、unstructured

图 1-10 展示了这个 pattern。

图 1-10：作为 unified AI platform 的 lakehouse

Multimodal evolution

Leading organizations 正在扩展 lakehouses，以支持 multimodal data——text、images、audio、video 和 structured data 位于 single platform 中。这让 agents 可以跨 modalities 进行 reasoning，例如结合 customer reviews 和 sales data 分析 product images，或者将 audio transcripts 与 structured call center metrics 关联。

Pattern 4: Semantic search as the new query layer

What it is

Semantic search 用 vector-based similarity search 替代 traditional keyword-based retrieval。Documents 会被转换为 high-dimensional embeddings，用来捕获 meaning，从而支持基于 conceptual similarity 而不是 exact word matches 的 retrieval。

Why it matters for AI

RAG-based agents 的质量取决于 retrieval systems。当 user 问 “How do we handle European data privacy requirements?” 时，keyword search 可能错过讨论 “GDPR compliance” 或 “EU data protection regulations” 的 documents。Semantic search 理解这些是相关 concepts，并 retrieve 所有 relevant documents。

Field example

一家 manufacturing company 在 30 年的 engineering specifications 上构建了 semantic search layer。它的 design agent 现在可以跨几十年的 documentation 找到 similar components，即使 terminology 已经演变。Engineers 可以用 natural language 提问：“Find lightweight materials suitable for high-temperature applications”，并收到来自从未使用这些 exact words 的 documents 的 relevant specs。

Data requirements

用于 embedding storage 的 vector databases，例如 Pinecone、OpenSearch、pgvector 等
用于 document vectorization 的 embedding models，例如 text-embedding-ada-002、Titan Embeddings
结合 semantic 和 keyword retrieval 的 hybrid search capabilities
用于 access control 和 relevance refinement 的 metadata filtering

Implementation insight

最有效的 implementations 使用 hybrid search，将 semantic similarity、keyword matching 和 metadata filters 结合起来。这可以确保 agents retrieve 的 documents 既 conceptually relevant，又能 precisely match specific criteria，例如 date ranges、document types、security classifications。

Pattern 5: Agent-ready digital properties

What it is

Organizations 正在重新设计它们的 digital properties——websites、products、APIs 和 documentation——让它们不仅能被 humans 消费，也能被 AI agents 消费。这包括 structured data markup（Schema.org）、machine-readable APIs，以及 agent-friendly documentation，支持 autonomous discovery 和 interaction。

Why it matters for AI

在 emerging agent-driven economy 中，customers 不会浏览你的网站——他们的 agents 会。如果你的 product information 不是 machine-readable，你的 competitor’s agent 就会推荐他们的 product。Agent-ready properties 确保你的 business 在 AI-mediated world 中保持 discoverable 和 accessible。

Field example

一家 ecommerce company 使用 Schema.org markup 重构 product catalog，为所有 services 创建 OpenAPI specifications，并构建 agent-accessible API layer。当 shopping agents query “Find ergonomic office chairs under $500 with next-day delivery” 时，system 会返回 structured product data，agents 可以直接 compare 和 act upon。Agent-driven sales 现在占 revenue 的 12%，并且还在增长。

Data requirements

用于 content 的 structured data markup，例如 Schema.org、JSON-LD
基于 OpenAPI / GraphQL specifications 的 API-first design
Agent-friendly documentation，也就是 machine-readable，而不仅是 human-readable
用于 agent access 的 authentication 和 rate limiting
用于理解 agent behavior patterns 的 usage analytics

Strategic imperative

这个 pattern 代表了 organizations 思考 digital presence 方式的 fundamental shift。正如 smartphone era 中 mobile-first design 成为必需一样，在 AI era 中，agent-ready design 正在变得 critical。延迟这种 transformation 的 organizations，可能会在未来由 agents 中介的 customer interactions 中变得 invisible。

From Patterns to Practice

这五种 patterns 并不是 mutually exclusive。最成功的 implementations 往往会组合多种 patterns。例如，一家 retail organization 可能使用 knowledge graphs 连接 customer 和 product data（pattern 1），使用 event streaming 保持这些 data current（pattern 2），使用 lakehouse 统一 structured 和 unstructured sources（pattern 3），使用 semantic search 支持 agent retrieval（pattern 4），并使用 agent-ready APIs 支持 external access（pattern 5）。

关键 insight 是，这些 patterns 解决了 traditional data architectures 与 GenAI requirements 之间的 fundamental mismatch。Traditional systems 针对 human analysts 在 structured data 上运行 SQL queries 进行了优化。GenAI systems 需要 semantic understanding、real-time context、cross-domain reasoning 和 autonomous access，而这些 capabilities 需要本质上不同的 architectural approaches。

成功实施这些 patterns 的 organizations 往往有共同特征：它们从一个与 highest-value use case 对齐的 single pattern 开始，快速证明 value，通常在三到六个月内，然后随着 AI capabilities 成熟，扩展到 additional patterns。它们不等待 perfect solutions，而是 build、learn 和 iterate。

这种 iterative architectural implementation approach 反映了 GenAI 与 traditional machine learning（ML）projects 之间更深层的差异。为了成功实施这里概述的 architectural patterns，organizations 也必须理解 GenAI development cycles 与 traditional ML approaches 的根本不同。这些 process differences 直接影响 architectural patterns 应该如何以及何时实施。

Development Cycles: ML Versus GenAI

理解 traditional machine learning 和 GenAI development cycles 之间的 fundamental differences，对希望从 experimentation 走向 production deployment 的 organizations 至关重要。这些差异解释了为什么 successful ML practices 在应用到 GenAI implementations 时经常失败。

Traditional Machine Learning Development Cycle

Traditional ML development cycle，如图 1-11 所示，遵循一个已经在数十年实践中演进成熟的 well-established pattern。

图 1-11：traditional machine learning development cycle

步骤包括：

Problem definition

在初始阶段，teams 会清晰阐述 business challenge，并定义 specific、measurable outcomes。这为整个 project 建立 scope 和 success criteria。这个阶段的 key requirements 包括：

Clearly defined business problem with specific metrics
聚焦单一 prediction 或 classification task 的 narrow scope
Explicit definition of inputs and outputs
基于 accuracy、precision、recall 等的 clear success criteria

Data investigation

在 discovery phase 中，data scientists 会探索可用 data sources，理解其 structure，并评估其 quality 和 relevance。这个阶段的 key activities 包括：

分析来自 enterprise systems 的 structured data
识别并选择 relevant features
执行 statistical analysis 来理解 data distributions
定义 training、validation 和 test datasets

Data preparation

这个关键 preprocessing stage 将 raw data 转换为适合 machine learning algorithms 的格式，解决 quality issues，并创建 optimized inputs。Key activities 包括：

Feature engineering 以创建 model inputs
Data cleansing 和 normalization
处理 missing values 和 outliers
创建 balanced datasets 用于 training

Development

在 model development phase 中，data scientists 选择并实现 appropriate algorithms、调优 parameters，并基于 validation results 迭代改进 performance。这个阶段涉及：

选择 appropriate algorithms，例如 random forest、gradient boosting 等
Hyperparameter tuning 和 optimization
在 labeled datasets 上进行 model training
基于 validation results 的 iterative improvement

Evaluation

Evaluation stage 使用 held-out data 针对 objective metrics 严格测试 model performance，以确保 reliability 和 effectiveness。这个阶段的 key tasks 包括：

针对 held-out test data 进行 rigorous testing
使用 established metrics 进行 performance measurement
与 baseline models 进行 comparison
Statistical significance testing

Deployment

在 deployment 阶段，validated model 会为 production use 做准备，并与 existing systems 集成，同时为 operations teams 创建 documentation。Key activities 包括：

Model serialization 和 packaging
与 production systems 集成
Batch 或 real-time inference setup
创建 documentation 并 hand over 给 operations

Monitoring and improvement

最后的 ongoing stage 会建立 continuous monitoring processes，以确保持续 performance，并支持 iterative enhancements。它包括：

针对 established metrics 的 performance monitoring
Drift detection 和 model retraining
Model improvements 的 A/B testing
建立 feedback loops 支持 continuous enhancement

这种 traditional cycle 非常适合具有清晰 inputs 和 outputs 的 predictive 和 classification tasks。它高度依赖 structured data 和 explicit feature engineering，并通过 objective performance metrics 衡量 success。这个 process 本质上是 iterative 的，monitoring 和 improvement 得到的 insights 会反馈回 problem definition，从而驱动 continuous refinement。

GenAI Development Cycle

相比之下，GenAI development cycle 如图 1-12 所示，引入了本质上不同的 patterns 和 requirements。

图 1-12：generative AI development cycle

这个 cycle 也可以分为七个 stages：

Use case definition and model selection

第一阶段聚焦于定义更广泛的 goals，例如 “improve customer support”，并选择与 use case requirements 和 organizational constraints 对齐的 appropriate foundation models，例如 GPT、Claude、Llama 等。Additional tasks 包括：

考虑 model capabilities、limitations 和 biases
评估 hosting options，例如 API services 与 self-hosting
定义超越 traditional metrics 的 success criteria

Knowledge integration and context engineering

这个 GenAI 独有 stage 涉及将 enterprise knowledge sources 与 foundation model 连接起来，确保 relevant information 被正确表示并可访问。这里正是 “What Leading Organizations Are Building Today” 中描述的五种 architectural patterns 变得 critical 的地方。Key tasks 包括：

识别 enterprise knowledge sources
开发 document processing 和 chunking strategies
Embedding generation 和 vector database selection
Metadata enrichment 和 retrieval strategy design
Integration of structured and unstructured data sources

Prompt engineering and system design

这个关键 design phase 聚焦于为 model 编写 effective instructions，并建立 frameworks 来确保 appropriate responses 和 safeguards。它涉及：

设计 effective prompts 和 system instructions
创建 few-shot examples 和 templates
开发 evaluation 和 filtering mechanisms
与 external tools 和 APIs 集成
实施 guardrails 和 safety measures

Fine-tuning and customization（optional）

在这个 optional 但 powerful 的 stage 中，foundation models 会通过 additional specialized training 适配 specific domains 或 tasks。它包括：

创建 fine-tuning datasets
选择 appropriate fine-tuning techniques
使用 parameter-efficient tuning methods，例如 LoRA、QLoRA
Evaluation of fine-tuned model performance
Fine-tuning 与 RAG approaches 之间的 trade-off analysis

Evaluation and alignment

这个 phase 同时包含 objective metrics 和 human assessment，以确保 outputs 在 diverse scenarios 下满足 quality、safety 和 alignment standards。它涉及：

Human evaluation of outputs for quality and relevance
Red teaming for safety and security vulnerabilities
Ensuring alignment with organizational values and guidelines
Testing across diverse scenarios and edge cases
Evaluation of hallucination rates and factual accuracy

Deployment and integration

Deployment stage 聚焦于将 GenAI system 与 existing workflows 集成，同时建立 proper monitoring、feedback mechanisms 和 user training。Key tasks 包括：

Integration with existing workflows and systems
Implementation of monitoring and logging
Establishment of feedback collection mechanisms
User training and change management
Implementation of governance and compliance controls

Continuous learning and improvement

这个 ongoing phase 强调 GenAI systems 的 dynamic nature，通过持续收集 feedback，并定期更新 system 和 knowledge bases。它涉及：

Collection and analysis of user feedback
Monitoring of model performance and usage patterns
Identification of failure modes and edge cases
Regular updates to knowledge bases and context
Adaptation to evolving user needs and expectations

表 1-3 总结了 traditional ML 和 GenAI development cycles 之间的 structural differences。

表 1-3：ML 与 GenAI development cycles 的关键差异

Aspect	Traditional ML cycle	GenAI development cycle
Problem scope	Narrow、well-defined prediction tasks	Broad、open-ended generation and reasoning tasks
Data requirements	Structured、labeled datasets	Diverse knowledge sources，既包括 structured 也包括 unstructured
Model development	Custom models built from scratch	Adaptation of pre-trained foundation models
Engineering focus	Feature engineering	Context engineering 和 prompt design
Evaluation methods	Objective metrics，例如 accuracy、F1 等	Subjective assessment 和 human evaluation
Deployment pattern	Static model deployment with periodic retraining	Dynamic systems with continuous knowledge updates
Success criteria	Quantitative performance metrics	User satisfaction 和 business value metrics
Failure modes	Gradual performance degradation	Catastrophic failures，例如 hallucinations、harmful outputs
Governance needs	Model documentation 和 monitoring	Comprehensive safety、ethics 和 alignment controls

Data Infrastructure Implications

表 1-3 中概述的 development cycle differences 直接影响 data infrastructure requirements，并为试图将 existing ML investments 用于 GenAI applications 的 organizations 带来 fundamental challenges。我们前面看到的五种 architectural patterns 正是为了解决这些 implications：

Knowledge management versus feature stores

Traditional ML 依赖 feature stores，其中包含 structured、engineered features；而 GenAI 需要 comprehensive knowledge management systems，用来维护 context 和 relationships。一家 telecommunications company 发现，其精心设计的 features，例如 “average call duration”，并不适合 GenAI customer service assistant——这些 features 去掉了 helpful responses 所需的 contextual richness。

Solution：Knowledge graphs（pattern 1）保留 feature stores 会消除的 relationships 和 context。

Static versus dynamic data access

ML models 访问 static datasets，这些 datasets 只在 periodic retraining 时变化；但 GenAI 需要 real-time awareness。一家 global bank 的 GenAI system 在 regulatory update 后数周仍继续提供 outdated policy information，从而造成 compliance exposure。

Solution：Event-driven architectures（pattern 2）确保 agents 始终使用 current information。

Batch versus interactive processing

ML workflows 通过 batch processing 优化 throughput，但 GenAI 要求 low-latency interactive processing。一家 manufacturing company 的 overnight data pipeline 无法支持其 GenAI maintenance assistant，因为后者需要 subsecond response times。

Solution：Lakehouse architectures（pattern 3）同时支持 batch analytics 和 interactive AI workloads。

Structured versus contextual storage

Traditional ML data 针对 tabular storage 优化，但 GenAI 需要 unstructured、context-rich information。一家 healthcare provider 的 systems 存储了 diagnosis codes，但无法维护 patient histories 的 narrative context。

Solution：Semantic search（pattern 4）支持基于 meaning 而非 structure 的 retrieval。

Metric-driven versus feedback-driven improvement

ML systems 通过 metric optimization 改进，但 GenAI 通过 ongoing user feedback 和 continuous learning 演进。一家 retail organization 的 recommendation system 基于 conversion metrics 改进，但其 GenAI shopping assistant 需要纳入 subjective feedback，并持续更新 product knowledge。

Solution：Agent-ready architectures（pattern 5）包含 feedback loops 和 continuous learning mechanisms。

Architecture Evolution in Practice: From Traditional ETL to GenAI-Ready Pipelines

要真正释放 GenAI 的 transformative potential，enterprises 必须超越图 1-13 和图 1-14 所示的 traditional AI / ML 和 big data paradigms。Generative AI 需要一种 fundamentally new approach，来重新思考 data 如何被 stored、processed、retrieved 和 governed。释放其全部潜力，需要从 structured、batch-oriented pipelines 转向 flexible、real-time、context-aware architectures，这些 architectures 专门为 unstructured 和 dynamic content 而设计。

图 1-13：Traditional data foundation

图 1-14：GenAI data foundation

Key Differences Between Traditional and GenAI Data Foundations

在深入 real-world example 之前，先快速回顾 key differences。表 1-4 提供了概览。

表 1-4：Traditional 和 GenAI data foundations 的关键差异

Aspect	Traditional data foundation	GenAI data foundation
Data types	主要是 structured，部分 unstructured	大量 unstructured，并具有 complex relationships
Context requirements	Limited context，通常是 single-record processing	跨 data sources 的 rich contextual relationships
Query patterns	Predictable、explicit queries	Semantic、similarity-based 和 context-aware retrieval
Processing mode	主要是 batch，部分 streaming	Real-time、interactive，并带 dynamic context windows
Storage paradigm	Table-oriented（rows / columns）	Vector-based，并具有 semantic relationships
Update frequency	Periodic refreshes	Continuous integration of new knowledge
Schema requirements	需要 rigid schemas	Schema-flexible，并强调 embeddings
Governance focus	Access control 和 compliance	Ethics、bias detection 和 responsible use
Retrieval method	Exact match、keyword-based	Similarity-based、semantic understanding
Scaling dimension	Volume（petabytes of structured data）	Context（在 scale 下维护 relationships）

Real-World Example: Evolving from Kimball to Medallion to GenAI-Ready Architecture

Organizations 如何实际实施 GenAI 成功所需的 architectural evolution？让我们考察一个具体 case study，展示这种 transformation。

The challenge: multi-billion-row fact tables in the GenAI era

一家 global retailer 拥有超过 30 亿行 transaction history，正在从 traditional Kimball star schema 迁移到 medallion architecture，其中包括 bronze（raw）、silver（cleansed and standardized）和 gold（curated）data layers，如图 1-15 所示；与此同时，它还在构建第一批 GenAI applications。

图 1-15：从 traditional star schema 迁移到 medallion architecture

现有 model 如下：

-- Traditional dimensional model with separate fact and dimension tables
CREATE TABLE gold.fact_sales (
  transaction_id BIGINT,
  date_key INT,
  product_key INT,
  store_key INT,
  customer_key INT,
  quantity INT,
  amount DECIMAL(12,2)
);
CREATE TABLE gold.dim_product (
  product_key INT,
  product_id VARCHAR(50),
  product_name VARCHAR(100),
  category VARCHAR(50),
  subcategory VARCHAR(50),
  brand VARCHAR(50)
);
CREATE TABLE gold.dim_store (
  store_key INT,
  store_id VARCHAR(20),
  store_name VARCHAR(100),
  region_id VARCHAR(10),
  city VARCHAR(50),
  state VARCHAR(2)
);

这种 traditional approach 对 historical reporting 很有效，但对 GenAI applications 来说存在多个 challenges：

Product descriptions 和 details 被限制在 structured fields 中。
没有 product relationships 的 semantic understanding。
不支持关于 products 的 natural language queries。
不支持 image data 或 rich content。
Updates 是 batch-oriented，可能导致 stale data。

The solution: a hybrid approach for GenAI readiness

该 organization 实施了 hybrid approach，保留 core dimensional assets，确保 enterprise 内 business definitions 一致，同时创建 GenAI-optimized data structures：

Purpose-built、denormalized fact tables，与 specific business domains 对齐。
设计同时针对 human analytics 和 AI consumption 优化。

在 implementation 中，它们创建了两类 gold layer assets：

-- For data within SPICE limits (< 1B rows)
CREATE TABLE gold.daily_aggregated AS
SELECT
  date_key,
  region_id,
  product_category,
  SUM(amount) as daily_total,
  COUNT(DISTINCT customer_id) as customer_count
FROM silver.transactions
GROUP BY 1,2,3;
-- For detailed data exceeding SPICE limits
-- Use direct query mode
CREATE VIEW gold.transaction_details AS
SELECT * FROM silver.transactions;

为了 flatten data 以供 AI consumption，它们专门为 GenAI applications 创建了 denormalized views：

CREATE TABLE gold.store_performance AS
SELECT
  t.transaction_id,
  t.transaction_date,
  t.amount,
  p.product_name,
  p.category,
  p.subcategory,
  s.store_name,
  s.region_id,
  s.city,
  s.state
FROM silver.transactions t
JOIN silver.products p ON t.product_id = p.product_id
JOIN silver.stores s ON t.store_id = s.store_id;

GenAI-enhanced data pipeline architecture

为了支持 GenAI applications，该 organization 实施了一个 comprehensive data pipeline，它超越了 traditional ETL。这个 pipeline 实现了前面介绍的全部五种 patterns：

# Python code for a GenAI-ready retail data pipeline
# 1. Extract comprehensive product and transaction data
def extract_retail_data():
    # Extract structured data from traditional sources
    structured_data = extract_from_database("SELECT * FROM source.products")
    # Extract unstructured data from multiple sources
    product_descriptions = extract_from_cms("product_descriptions")
    product_images = extract_from_asset_manager("product_images")
    customer_reviews = extract_from_reviews_system("customer_reviews")
    social_media_mentions = extract_from_social_platforms("brand_mentions")
    return {
        "structured": structured_data,
        "descriptions": product_descriptions,
        "images": product_images,
        "reviews": customer_reviews,
        "social": social_media_mentions
    }
# 2. Process and enrich data for GenAI consumption
def process_for_genai(raw_data):
    # Process structured data
    products_df = process_structured_data(raw_data["structured"])
    # Generate text embeddings for product descriptions (Pattern 4: Semantic 
    # Search)
    description_embeddings = generate_embeddings(
        raw_data["descriptions"],
        model="text-embedding-ada-002"
    )
    # Extract features from product images (Pattern 3: Multimodal Lakehouse)
    image_features = process_images_with_vision_model(raw_data["images"])
    # Analyze sentiment and extract topics from reviews
    review_insights = analyze_reviews_with_nlp(raw_data["reviews"])
    # Process social media for brand sentiment
    social_sentiment = analyze_social_sentiment(raw_data["social"])
    # Create unified product embeddings
    unified_embeddings = create_multimodal_embeddings(
        text=description_embeddings,
        images=image_features,
        reviews=review_insights,
        social=social_sentiment
    )
    return {
        "structured_data": products_df,
        "embeddings": unified_embeddings,
        "metadata": extract_metadata(raw_data),
        "relationships": identify_product_relationships(products_df, \
        unified_embeddings)
    }
# 3. Store in hybrid GenAI-optimized architecture
def store_genai_data(processed_data):
    # Store structured data in traditional warehouse (for reporting)
    store_in_warehouse(processed_data["structured_data"])
    # Store embeddings in vector database (Pattern 4: Semantic Search)
    store_in_vector_db(
        embeddings=processed_data["embeddings"],
        metadata=processed_data["metadata"]
    )
    # Store relationships in graph database (Pattern 1: Knowledge Graphs)
    store_in_graph_db(processed_data["relationships"])
    # Update real-time search index for AI applications
    update_search_index(processed_data)
    # Create knowledge graph for complex reasoning (Pattern 1)
    update_knowledge_graph(processed_data["relationships"])
# 4. Implement real-time updates for GenAI applications
def setup_realtime_genai_pipeline():
    # Set up event listeners for product updates (Pattern 2: Event-Driven)
    setup_event_listeners([
        "product_updates",
        "inventory_changes",
        "price_updates",
        "new_reviews",
        "social_mentions"
    ])
    # Implement change data capture for structured data (Pattern 2)
    setup_cdc_pipeline()
    # Monitor content management system for description updates
    monitor_cms_changes()

Results and benefits

当它们随后实现一个 AI assistant 来回答 business questions 时，这种 hybrid approach 证明了巨大价值：

AI 可以高效访问 domain-specific data，同时保持 enterprise 内部 definitions 一致。
通过 semantic search（pattern 4），像 “Show me products similar to our best-selling winter jackets” 这样的 natural language queries 成为可能。
Real-time inventory context 使 AI 能提供准确的 availability information（pattern 2）。
通过连接 product data、customer reviews 和 social sentiment，cross-domain insights 开始出现（pattern 1）。

例如，当 store manager 问 AI assistant：“Today’s sales are 15% below forecast, what should I do?” 时，system 可以回答：

“Today’s sales pattern is similar to what we saw last quarter during the supply chain disruption. Based on historical data, promoting complementary products in categories X and Y typically recovers 8–12% of the shortfall. Current inventory levels support this strategy.”

Scaling strategies for the AI era

随着 retailer 的 data volumes 从 billions 增长到 trillions of rows，它们实施了几种 strategies 来维持 performance：

Domain-driven modeling

为 merchandising、store operations 和 finance teams 创建 separate gold datasets（pattern 1，data mesh evolution）。

Thoughtful partitioning

按 date 和 region 对 transaction data 进行 partition（pattern 3）。

Aggregation layers

为 common queries 构建 daily、weekly 和 monthly aggregates。

Metadata enrichment

为所有 fields 添加清晰的 business descriptions（pattern 5）。

Vector optimization

实施 hierarchical vector indexes，以实现更快的 similarity search（pattern 4）。

Caching strategies

为 frequently accessed embeddings 实施 intelligent caching（pattern 2）。

Why this matters: The data foundation for AI success

这种 approach 让 organization 的 business logic 同时可供 traditional business intelligence（BI）tools 和 GenAI applications 使用。将 calculations 放在 data transformation layer 中，而不是 opaque database procedures 中，提升了 documentation 的透明度，并最终提高了 dashboards 和 AI-generated insights 的 accuracy。

通过将 complex business logic 从 opaque procedures 移动到 gold layer 中的 transparent transformations，它们创建了一个 single source of truth，visualization tools 和 AI models 都可以一致地利用它。

随着 retailer 的 GenAI implementation 扩展，它们发现 data architecture decisions 具有深远影响：

Data quality became paramount。AI systems 会放大 data 中的任何 inconsistencies 或 errors。
Context preservation was critical。维护 data elements 之间的 relationships，使更 sophisticated AI reasoning 成为可能（pattern 1）。
Real-time updates were essential。Stale data 会导致 outdated AI responses，削弱 user trust（pattern 2）。
Metadata richness enabled better AI performance。文档完善的数据帮助 AI systems 提供更 accurate 和 relevant 的 responses（pattern 5）。

从 traditional Kimball dimensional modeling 演进到 GenAI-ready architecture，展示了 organizations 为弥合 experimental AI applications 与 production-ready systems 之间 gap 所必须采取的实践步骤。更重要的是，它展示了本章介绍的五种 architectural patterns 如何在实践中协同工作，从而构建 comprehensive GenAI data foundation。

Preparing for the Agent-Driven Future

目前讨论的 implementation approaches 可以满足当前 GenAI requirements。但当我们超越今天的 GenAI implementations，面向一个越来越由 autonomous AI agents 塑造的未来时，organizations 必须为 data infrastructure 做好准备，应对 information 被 accessed、processed 和 acted upon 的更根本变化。本节探索在 agent-driven world 中定义 data readiness 的关键 trends，并为希望在 emerging landscape 中取得优势地位的 organizations 提供 guidance。再次说明，我们看到的 architectural patterns 可以帮助 navigating this transition。

“Jeeves Does the Shopping”: The Agent-Mediated Consumer Experience

Agentic AI 中正在出现的最具变革性的场景之一，可以称为 “Jeeves does the shopping” paradigm。在这个 paradigm 中，AI agents 充当 consumers 与 businesses 之间的 intermediaries，从根本上改变 products 和 services 被 discovered、evaluated 和 purchased 的方式。

在这个场景中：

Consumers 将 purchasing decisions 委托给 AI agents，这些 agents 理解他们的 preferences、budget constraints 和 needs。
Agents 代表 consumers 直接与 businesses negotiation，比较多个 providers 的 options。
Traditional marketing 变得不那么有效，因为 agents 会基于 objective criteria 过滤 promotional content。
Product 和 service attributes 变成 machine-readable requirements，而不是 emotional appeals。
Price transparency 显著增加，因为 agents 可以 instant compare providers 之间的 options。

对于 organizations，尤其是 retail 和 consumer services 行业，这种 shift 需要从多个 dimensions 重新思考 data readiness：

Product information

所有 product attributes 都必须以 machine-readable formats 结构化，使 agents 可以 parse 和 compare（pattern 5）。

Pricing models

Dynamic pricing systems 必须被设计为通过 standardized interfaces 与 AI agents 交互（pattern 5）。

Inventory systems

Real-time inventory data 必须可供 agents 访问，以防止 fulfillment failures（pattern 2）。

Service level agreements（SLAs）

Performance guarantees 必须以 agents 可以 verify 和 enforce 的方式 formalized（pattern 5）。

Search Engine Optimization: From Keywords to Agent Optimization

随着 AI agents 越来越多地中介 information discovery 和 decision making，traditional search engine optimization（SEO）会演进为一门新学科，我们可以称之为 agent optimization（AO）。

这种 transformation 从根本上改变了 organizations 为保持 discoverable 和 relevant 而必须 structure data 的方式，如表 1-5 所示。

表 1-5：Traditional SEO 与 agent optimization

Traditional SEO	Agent optimization
Keyword optimization	Structured data optimization
Content readability for humans	Machine-readable attribute standardization
Backlink authority	Verifiable trust credentials
User engagement metrics	Objective performance metrics
Emotional appeals	Quantifiable value propositions

这种 shift 要求 organizations：

使用 Schema.org 和 industry-specific ontologies 等 standards 实施 comprehensive structured data markup（pattern 5）。
开发 machine-readable performance metrics，使其可以被 agents 独立 verify。
创建 standardized interfaces 用于 agent interactions，支持 automated comparison 和 negotiation（pattern 5）。
建立 verifiable trust mechanisms，使 agents 可以评估 reliability 和 quality（pattern 1）。
维护完美的 data hygiene，因为 humans 可能忽略的 inconsistencies，会使 offerings 被 agents 排除在 consideration 之外。

Agents Have No Allegiance: Preparing for Radical Transparency

Agent-driven world 最深刻的 implication 之一，或许是 AI agents 对 specific brands、providers 或 platforms 没有 inherent allegiance。Human consumers 会基于 emotional connections、past experiences 或 simple inertia 形成 loyalty，但 agents 只会基于 optimization criteria 和 available data 做 decisions。这会创造一个由 radical transparency 和 objective comparison 定义的 marketplace，其中：

Switching costs 接近零，因为 agents 可以 instant evaluate alternatives，而没有 emotional attachment。
Historical relationships 几乎没有优势，除非它们能转化为 objectively superior offerings。
Brand value 来自 verifiable performance，而不是 emotional associations。
Hidden fees 或 quality issues 无法被 marketing 或 psychological manipulation 掩盖。
Value 必须以 agents 可以 process 和 compare 的 terms 显式 quantifiable。

对于习惯于基于 brand loyalty、emotional connections 或 information asymmetry 竞争的 organizations 来说，这构成 existential challenge。要在 agent-driven world 中成功，必须围绕 objective value delivery 重构 data 和 operations，使其能够经受 algorithmic scrutiny。这正是五种 architectural patterns 变得 essential 的地方——它们为在 radically transparent marketplace 中竞争提供 foundation。

Business Autopilot: Autonomous Operations and Decision Making

除了 consumer interactions，agentic AI 还支持我们可以称为 “business autopilot” 的能力——以 minimal human intervention autonomously execution core business processes。这种 capability 正在多个 functions 中出现：

Supply chain

Autonomous inventory management、supplier selection 和 logistics optimization（patterns 1、2、3）。

Marketing

Automated campaign creation、targeting 和 performance optimization（patterns 2、4、5）。

Customer service

无需 human escalation 的 end-to-end issue resolution（patterns 1、4、5）。

Financial operations

Automated cash flow management、investment decisions 和 risk hedging（patterns 1、2、3）。

Product development

AI-driven feature prioritization 和 design iteration（patterns 1、3、4）。

为了让这些 autonomous operations 有效运行，organizations 必须开发支持以下能力的 data infrastructure：

Clear decision boundaries 和 authorization frameworks，定义 agents 在哪里以及如何 autonomously act（pattern 5）。
Real-time monitoring systems，用于追踪 agent actions 和 outcomes（pattern 2）。
Feedback mechanisms，用于 continuous learning 和 improvement（pattern 5）。
Explainability tools，帮助 humans 理解 agent decisions（pattern 1）。
Override capabilities，使 human intervention 在必要时可用（pattern 5）。

Data Readiness Checklist for the Agent-Driven Future

为了准备这个 agent-driven future，organizations 应该围绕以下 key dimensions 评估自己的 data readiness：

Structured data accessibility

所有 product / service attributes 都以 machine-readable formats 可用。

APIs 提供对所有 relevant business data 的 standardized access。

Real-time data updates 可通过 event streams 获取。

Comprehensive metadata 描述所有 data elements 的 meaning 和 context。

Performance transparency

Key performance metrics 被客观定义且可测量。

Historical performance data 可访问且可验证。

Service level guarantees 已 formalized 且 machine-readable。

Performance monitoring 是 continuous 且 transparent 的。

Trust and verification

Data provenance 被追踪且可验证。

Quality assurance processes 被文档化且可访问。

Third-party certifications 以 machine-readable formats 可用。

Trust credentials 被 standardized 且 verifiable。

Agent interaction capabilities

Standardized interfaces 可用于 agent queries 和 transactions。

Negotiation protocols 已针对 pricing 和 terms 定义。

Feedback mechanisms 已针对 service quality 建立。

Exception handling processes 已针对 unusual situations 建立。

Governance and control

针对 agent actions 定义了 clear permission frameworks。

所有 agent interactions 都维护 audit trails。

Override mechanisms 可用于 human intervention。

确保符合不断演进的 governance 和 regulatory requirements。

主动处理这些 readiness factors 的 organizations，将在 emerging agent-driven economy 中处于有利位置；而延迟行动的 organizations，随着 agent-mediated transactions 成为常态，可能会被逐渐边缘化。本章介绍的五种 architectural patterns，为实现这种 readiness 提供了 technical foundation。

Blueprint: Implementation Guidance

Generative AI 的爆发式增长为各行业 organizations 带来了前所未有的 opportunities，也带来了重大 challenges。正如本章所探讨的，experimental proofs of concept 和 successful production deployments 之间的 gap，很大程度上源于 inadequate data foundations——但这个 gap 是可以 bridge 的。

Where to Start

你的 starting point 取决于 organization 当前的 data maturity 和 GenAI ambitions。

If you’re just starting（experimentation stage）

Start with：Pattern 4（semantic search）+ Pattern 3（lakehouse）。

Why：Semantic search 可以为 document retrieval 和 knowledge management use cases 立即提供 value，而 lakehouse 提供未来 patterns 所需的 unified storage foundation。

Timeline：3–4 个月到第一个 production use case。

Example use cases：Internal knowledge base search、document Q&A、policy / procedure assistants。

Success metric：Reduce time to find information by 50%+。

If you have basic GenAI running（pilot stage）

Add：Pattern 2（event-driven architecture）+ Pattern 1（knowledge graphs）。

Why：Event-driven architecture 确保 agents 使用 current information，而 knowledge graphs 支持 cross-domain reasoning，解锁更 sophisticated use cases。

Timeline：4–6 个月增强 existing applications。

Example use cases：带 real-time context 的 customer service agents、supply chain optimization、cross-functional analytics。

Success metric：从 single-domain use cases 扩展到 cross-domain use cases。

If you’re scaling production（growth stage）

Add：Pattern 5（agent-ready）+ 跨所有 patterns 的 advanced capabilities。

Why：Agent-ready design 让你为即将到来的 agent-driven economy 做好准备，而 advanced capabilities，例如 data mesh、multimodal、real-time context refresh，则支持 autonomous operations。

Timeline：6–12 个月完成 full agent-ready transformation。

Example use cases：Autonomous purchasing agents、business autopilot functions、agent-mediated customer experiences。

Success metric：Enable autonomous agent operations with minimal human intervention。

The Decision Tree: Which Pattern First?

先识别 highest-value use case。它会引导你选择应从哪个 pattern 开始：

Document / knowledge retrieval? → Start with pattern 4（semantic search）

Quick wins with existing document repositories
Foundation for RAG-based agents
Timeline to production：3–4 个月

Cross-domain insights? → Start with pattern 1（knowledge graphs）

Connects siloed data sources
Enables sophisticated reasoning
Timeline to production：4–6 个月

Real-time decision support? → Start with pattern 2（event-driven）

Ensures agents work with current data
Prevents stale information issues
Timeline to production：3–5 个月

Unified analytics + AI? → Start with pattern 3（lakehouse）

Consolidates data infrastructure
Supports both BI and AI workloads
Timeline to production：4–6 个月

External agent access? → Start with pattern 5（agent-ready）

Positions for agent-driven economy
Enables autonomous interactions
Timeline to production：6–9 个月

Remember：最成功的 implementations 会组合多个 patterns。从一个 pattern 开始，快速证明 value，通常 3–6 个月，然后随着 AI capabilities 成熟扩展到 additional patterns。

Your Action Plan

In the next 30 days

根据五种 patterns 评估当前 data maturity。
识别 highest-value GenAI use case。
基于 decision tree 选择 starting pattern。
组建 cross-functional team，包括 data engineering、AI / ML 和 business stakeholders。
定义 success metrics，不只衡量 technical performance，也衡量 business value。

In the next 90 days

针对 focused use case 实施第一个 pattern。
与 end users 建立 feedback loops。
Document learnings 和 architectural decisions。
开始规划第二个 pattern。
通过 training 和 change management 构建 organizational capability。

In the next 6–12 months

随着 use cases 成熟，扩展到 multiple patterns。
在 organization 内 scale successful implementations。
为 agent operations 建立 governance frameworks。
通过 machine-readable data 为 agent-driven future 做准备。
Measure 并 communicate business value，以确保 continued investment。

Summary

第 2 章中，我们会深入 AI-ready data framework 的 technical implementation，提供 detailed architectural patterns、technology selections 和 governance models，以支持 successful production deployment。我们会探索具体的 Amazon Web Services（AWS）services、implementation approaches，以及适用于处在不同 AI journey stages 的 organizations 的 best practices。

Competitive advantage 的窗口非常短暂。现在构建 AI-ready data foundations 的 organizations，将在 agent-driven economy 中领先；而那些维持 traditional approaches 的 organizations，可能会在未来由 agents 中介的 customer interactions 中变得 invisible。