Python RAG 实战手册——Graph RAGGraph RAG 在基础检索之上扩展了图遍历能力，使你能够沿着实体

Graph RAG 在基础检索之上扩展了图遍历能力，使你能够沿着实体和关系构成的网络移动，而不是只依赖孤立文本 embeddings 的语义相似度。

基础 RAG 系统会将文档拆分成 chunks，对它们进行 embedding，并依赖 vector search 找出相关内容。Vector search 会把每个 chunk 当作一个孤立单元，并不知道这些片段如何在更大的叙事中彼此连接。当信息分散在长文档的多个部分，或需要来自多个来源的数据时，这种方法会遗漏相关上下文。章节之间的依赖、引用和关系都会丢失。

Graph RAG 弥补了这一缺口。它不只是将文本存储为 embeddings，而是提取 entities，构建它们之间的显式 relationships，并将这种 graph structure 与 vector index 结合起来。这会生成更丰富、更精确的上下文，并保留实体之间的明确连接。

图 9-1 使用合同文档数据说明 basic RAG 和 graph RAG 的区别。在 basic RAG setup 中，系统维护的是一组彼此孤立的 embedding vectors。在 graph RAG 中，每个文本片段都锚定在其周围结构中。每个 clause 都与它的 clause type、company、address，以及它来源的 service-level agreement（SLA）相连接。这种结构化上下文不仅帮助模型找到相关文本，也帮助它真正理解这段文本属于哪里。

图 9-1：Graph RAG versus classic vector search

要构建 graph RAG system，你需要用 graph database 替代传统 retriever。本章所有 graph 示例都使用 Neo4j。

一旦 graph 被填充，retrieval flow 通常遵循四个步骤：

Initial search：先使用 vector search 或 full-text search 识别相关 nodes。这些 nodes 会作为 anchor points。
Graph expansion：从 anchor nodes 开始遍历 graph，通常经过一到两跳，以收集相关 nodes 和 edges。
可选：Filtering and ranking：优化扩展后的结果集，只保留能增加有意义上下文的 nodes。
Context assembly and generation：将原始 anchor text 及其 connected context 一起输入 LLM。

这个过程将 semantic matching 与 structural reasoning 结合起来，生成既反映含义、又反映连接关系的上下文。图 9-2 展示了主要 retrieval process。

图 9-2：Graph RAG retrieval 的阶段

本章中，你将构建第一个 knowledge graph，用文档中的文本填充它，并用结构化数据丰富它。

NOTE

Graph databases 通过沿着 nodes 之间的连接移动，提供更丰富的上下文。与基础 RAG 系统相比，graphs 会引入额外复杂性和成本。它们需要仔细的数据建模、结构化 ingestion，以及更多前期工作。当 relationships 很重要，并且最终答案依赖 entities 之间如何连接时，graphs 能解锁 classic RAG 系统无法实现的检索行为。

接下来的 recipes 会逐步引导你构建一个完整 graph RAG system，从 graph construction 开始，逐渐推进到 hybrid search。第一个 recipe 通过从 SLA 合同文档创建 knowledge graph，建立基础。

你可以在本书 GitHub repository 中找到本章所有代码示例。

9.1 创建第一个 Neo4j Knowledge Graph，并用文档文本填充它

Problem

你从未使用过 Neo4j，希望为自己的 graph RAG system 构建第一个 knowledge graph。

Solution

在这个 recipe 中，你将构建一个由 service-level agreements（SLAs）派生出的 knowledge graph。SLAs 是服务提供商与客户之间的合同，用来定义性能标准、责任和补救措施。图 9-3 展示了你将创建的 graph structure。

中心是一个 Company node。每家公司都有一个关联的 Address 和一个或多个 SLA nodes。每个 SLA 被划分为多个 Clause nodes，每个 node 表示文档中的一个独立章节。每个 clause 会通过 NEXT relationships 连接到它的后继 clause，以保留文档顺序。此外，每个 clause 还会连接到一个 ClauseType node，用于跨供应商聚合类似的主题类别。

图 9-3：SLA knowledge graph schema

在这个 recipe 中，你将构建 graph diagram 的右侧部分，也就是 SLA 和 Clause nodes。左侧的 company data 会在 Recipe 9.2 中添加。

图 9-4 展示了该结构。创建 Clause nodes，并使用 HAS_CLAUSE relationships 将它们连接到父级 SLA。使用 NEXT relationships 连接相邻 clauses，以保留文档顺序。创建 ClauseType nodes，例如 Availability、Support、Maintenance、Termination 等，并使用 OF_TYPE relationship 将每个 clause 连接到其类型。

图 9-4：SLA knowledge graph structure

你需要 Neo4j Desktop 用于本地开发，或者 Neo4j Aura cloud instance。详细安装步骤可以参考 Neo4j installation guide。安装后，创建一个 database instance。如果你在本地机器上运行 Neo4j，数据库通常位于 localhost:7687。系统中总会有一个预定义 admin user，用户名是 neo4j，密码由你在创建 instance 时选择。你可以在 Neo4j Desktop 中启动 instance 并连接它，以验证一切是否正常。

接下来，你从 Python 脚本连接 graph，以便用准备好的数据填充它。为此，需要 neo4j library 和 python-dotenv，后者用于以安全、方便的方式管理连接凭据：

pip install neo4j python-dotenv

将连接详情存入 .env 文件。确保永远不要把 credentials 提交到 GitHub 或公开分享。在 .env 文件中，设置 Neo4j instance 的 connection URI、username 和 password：

NEO4J_URI=neo4j://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=testpassword

为了避免每次都重新定义 credentials，创建一个 helper function get_driver，它会在需要时建立连接，并返回一个 driver，用于在 knowledge graph 中执行 Cypher queries：

def get_driver():
    """Create Neo4j driver from environment variables."""
    uri = os.getenv("NEO4J_URI", "neo4j://127.0.0.1:7687")
    user = os.getenv("NEO4J_USERNAME", "neo4j")
    pwd = os.getenv("NEO4J_PASSWORD", "testpassword")
    return GraphDatabase.driver(uri, auth=(user, pwd))

插入数据之前，先定义 constraints 以防止重复。没有 constraints 时，如果 matching logic 含糊，MERGE command 可能会创建重复 nodes。SLA 和 Clause nodes 的 id property，以及 ClauseType nodes 的 name property 都需要保持唯一：

def create_constraints(driver):
    """Create uniqueness constraints for graph nodes."""

    constraints = [
        "CREATE CONSTRAINT sla_id IF NOT EXISTS "
        "FOR (s:SLA) REQUIRE s.id IS UNIQUE",

        "CREATE CONSTRAINT clause_id IF NOT EXISTS "
        "FOR (c:Clause) REQUIRE c.id IS UNIQUE",

        "CREATE CONSTRAINT type_name IF NOT EXISTS "
        "FOR (t:ClauseType) REQUIRE t.name IS UNIQUE",
    ]

    with driver.session() as session:
        for constraint in constraints:
            session.run(constraint)


driver = get_driver()
create_constraints(driver)

通过按 section headings 拆分来解析 SLA document。使用 regex 在每个二级标题处分割文档，也就是以 ## 开头的行，并为每个 section 创建一个 Clause object：

from dataclasses import dataclass
import re


@dataclass
class Clause:
    id: str
    title: str
    text: str
    order: int
    clause_type: str = "Other"


def parse_sla_file(path, sla_id):
    with open(path, "r", encoding="utf-8") as f:
        content = f.read()

    sections = re.split(r"^##\s+", content, flags=re.MULTILINE)[1:]
    clauses = []

    for idx, section in enumerate(sections, start=1):
        lines = section.strip().splitlines()
        if not lines:
            continue

        title = lines[0].strip()
        text = "\n".join(lines[1:]).strip()

        clauses.append(
            Clause(
                id=f"{sla_id}_C{idx}",
                title=title,
                text=text,
                order=idx,
            )
        )

    return clauses

图 9-5 展示了生成的 Clause objects 列表。

图 9-5：从 SLA 中提取出的 clauses 列表

Clauses 提取完成后，将列表转换为 graph structure。创建带有 title 的 SLA node，创建 Clause nodes 并将它们连接回 SLA，然后用 NEXT relationships 连接相邻 clauses：

def write_sla_and_clauses(driver, sla_id, sla_title, clauses):
    """Write SLA and Clause nodes to Neo4j with their relationships."""
    with driver.session() as session:
        # Create SLA node
        session.run(
            """
            MERGE (s:SLA {id: $id})
            SET s.title = $title
            """,
            id=sla_id,
            title=sla_title,
        )

        # Create Clause nodes and link to SLA
        for c in clauses:
            session.run(
                """
                MERGE (cl:Clause {id: $id})
                SET cl.title = $title,
                    cl.text = $text,
                    cl.order = $order
                """,
                id=c.id,
                title=c.title,
                text=c.text,
                order=c.order,
            )
            session.run(
                """
                MATCH (s:SLA {id: $sla_id})
                MATCH (cl:Clause {id: $cid})
                MERGE (s)-[:HAS_CLAUSE]->(cl)
                """,
                sla_id=sla_id,
                cid=c.id,
            )

        # Connect adjacent clauses to preserve document order
        for prev, nxt in zip(clauses, clauses[1:]):
            session.run(
                """
                MATCH (a:Clause {id: $p})
                MATCH (b:Clause {id: $n})
                MERGE (a)-[:NEXT]->(b)
                """,
                p=prev.id,
                n=nxt.id,
            )


write_sla_and_clauses(driver, sla_id, sla_title, clauses)

每个 clause 需要根据 keyword matching 拥有一个 type。这里，你将 availability 或 uptime 等关键词映射到 clause types；不匹配任何 keyword pattern 的 clauses 会被标记为 Other：

def infer_clause_type(title):
    """Infer ClauseType based on keywords in the title."""
    title_lower = title.lower()
    keywords = {
        "Availability": ["availability", "uptime"],
        "Support": ["support", "response time", "incident"],
        "Maintenance": ["maintenance"],
        "DataProtection": ["data protection", "gdpr", "privacy"],
        "Liability": ["liability"],
        "Termination": ["termination"],
    }

    for clause_type, words in keywords.items():
        if any(word in title_lower for word in words):
            return clause_type
    return "Other"


for c in clauses:
    c.clause_type = infer_clause_type(c.title)

一旦每个 clause 都被分配了 type，就创建 ClauseType nodes，并使用 OF_TYPE relationship 将每个 clause 连接到其类型：

def add_clause_types(session, clauses):
    # Create ClauseType nodes
    types = [{"clause_type": c.clause_type} for c in clauses]
    session.run(
        """
        UNWIND $rows AS row
        MERGE (t:ClauseType {name: row.clause_type})
    """,
        rows=types,
    )

    # Link clauses to their types
    links = [{"id": c.id, "type": c.clause_type} for c in clauses]
    session.run(
        """
        UNWIND $rows AS row
        MATCH (cl:Clause {id: row.id})
        MATCH (t:ClauseType {name: row.type})
        MERGE (cl)-[:OF_TYPE]->(t)
    """,
        rows=links,
    )


add_clause_types(driver.session(), clauses)

这样就完成了基础 SLA knowledge graph structure。为了验证导入是否成功，可以在 Neo4j Browser 中运行一个简单 Cypher query。打开 Neo4j Desktop，点击 Connect and Query，然后运行下面的 query，以查找包含 availability 信息的 clauses：

MATCH (s:SLA {id: "SLA1"})-[:HAS_CLAUSE]->(cl:Clause)
RETURN s.id AS sla_id, cl.id AS clause_id,
       cl.order AS clause_order, cl.title AS clause_title
ORDER BY cl.order;

图 9-6 展示了已完成的 graph，其中包含 SLA contracts、clauses 及其 relationships。

图 9-6：第一个 SLA knowledge graph

Discussion

这个 recipe 将 SLAs 建模为由 contracts、clauses 和 clause types 组成的 graph，同时保留文档结构和 relationships。Constraints 和 indexes 会防止重复并保持 queries 快速，而 clause order 会保留原始文档顺序。

Graph modeling 有效，是因为它捕捉了 embeddings 无法捕捉的 relationships，例如哪个 clause 属于哪个 contract，或哪个 clause 位于下一个。这使你可以跨文档比较 clauses、沿着阅读顺序移动，或验证某份合同是否缺少必需 clause。

当问题需要遍历 relationships 时，可以使用 graph，例如比较不同供应商的 termination clauses，或检查哪些 contracts 缺少某个特定 clause type。对于单个文档内部的简单 semantic lookup，通常仅使用 vector search 就足够。

取舍是 schema 和 maintenance overhead。你必须提前设计 node 和 relationship types，并在数据变化时保持它们同步；但这种结构支持 pure vector RAG 无法支持的强大 queries。

下一个 recipe 会在这个基础上扩展 supplier data，为合同 queries 增加业务上下文。

9.2 使用结构化数据扩展 Knowledge Graph

Problem

你想用结构化业务数据丰富 knowledge graph。

Solution

导入四类 master data 来丰富 knowledge graph：

Company information

来自 companies.csv 的 supplier name、country 和 industry。

Address data

来自 addresses.csv 的 street、city、postal code 和 country。

Spending information

来自 spend_2024.csv 的 yearly spending 和 spending category。

SLA metadata

来自 slas.csv 的 effective dates、service names 和 governing law。

导入后，你的 graph 会将文档结构与业务上下文结合起来，如图 9-7 所示。

图 9-7：带 company data 的扩展 SLA knowledge graph structure

导入 companies.csv，并为每个 supplier 创建一个 Company node，包含 name、country 和 industry 等属性：

def load_companies(driver, companies_csv_path):
    """Load companies from CSV and create Company nodes."""
    df = pd.read_csv(companies_csv_path)
    print(f"Loading {len(df)} companies...")

    with driver.session() as session:
        session.run(
            """
            UNWIND $rows AS row
            MERGE (c:Company {supplier_id: row.supplier_id})
            ON CREATE SET c.name=row.name, c.country=row.country,
	        c.industry=row.industry
            SET c.industry = row.industry
            """,
            {"rows": df.to_dict(orient="records")},
        )

添加 address nodes，将 street、city、postal code 和 country 作为 properties：

def load_addresses(driver, addresses_csv_path):
    """Load addresses from CSV and create Address nodes."""
    df = pd.read_csv(addresses_csv_path)
    print(f"Loading {len(df)} addresses...")

    with driver.session() as session:
        session.run(
            """
            UNWIND $rows AS row
            MERGE (a:Address {id: row.address_id})
            ON CREATE SET a.street = row.street,
                          a.city = row.city,
                          a.postal_code = row.postal_code,
                          a.country = row.country
            """,
            {"rows": df.to_dict(orient="records")},
        )

使用 LOCATED_AT relationships 将 companies 连接到它们的 addresses：

def connect_company_addresses(driver, companies_csv_path):
    """Create LOCATED_AT relationships between companies and addresses."""
    df = pd.read_csv(companies_csv_path)[["supplier_id", "address_id"]]
    print(f"Connecting {len(df)} company-address relationships...")

    with driver.session() as session:
        session.run(
            """
            UNWIND $rows AS row
            MATCH (c:Company {supplier_id: row.supplier_id})
            MATCH (a:Address {id: row.address_id})
            MERGE (c)-[:LOCATED_AT]->(a)
            """,
            {"rows": df.to_dict(orient="records")},
        )

每个 supplier 的 annual spending data 存储在 spend_2024.csv 中。将这些信息作为 properties 添加到 Company node 上：

def load_spend(driver, spend_csv_path):
    """Load spend data and update Company nodes with spend information."""
    df = pd.read_csv(spend_csv_path)
    print(f"Loading spend data for {len(df)} companies...")

    with driver.session() as session:
        session.run(
            """
            UNWIND $rows AS row
            MATCH (c:Company {supplier_id: row.supplier_id})
            SET c.spend_2024 = toFloat(row.spend_eur),
                c.spend_category = row.spend_category
            """,
            {"rows": df.to_dict(orient="records")},
        )

最后，导入 slas.csv 文件，使用 HAS_SLA relationship 将每家公司连接到对应 SLA，同时应用额外 SLA metadata，例如 title、service name 和 effective date：

def load_slas(driver, sla_csv_path):
    """Load SLA metadata and link to Company nodes."""
    df = pd.read_csv(sla_csv_path)
    df["effective_date"] = df["effective_date"].replace({"N/A": None})
    df["effective_date"] = df["effective_date"].where(
        df["effective_date"].notna(), None
    )
    print(f"Loading {len(df)} SLA records...")

    with driver.session() as session:
        session.run(
            """
            UNWIND $rows AS row
            MATCH (c:Company {supplier_id: row.supplier_id})
            MERGE (s:SLA {id: row.sla_id})
            ON CREATE SET s.title = row.title,
                          s.service_name = row.service_name,
                          s.governing_law = row.governing_law,
                          s.effective_date = CASE
                            WHEN row.effective_date IS NOT NULL
                            THEN date(row.effective_date)
                            ELSE NULL END
            MERGE (c)-[:HAS_SLA]->(s)
            """,
            {"rows": df.to_dict(orient="records")},
        )

这个导入执行两个关键动作。第一，确保每个 SLA node 存在，并设置其 metadata fields，例如 title、service name、governing law 和 effective date。第二，创建 HAS_SLA relationship，将每个 supplier 连接到其持有的 contracts。

这种连接对于将业务上下文与合同内容结合的 queries 至关重要，例如 “Which high-spend suppliers lack a termination clause?”

Discussion

当 retrieval 依赖业务上下文时，使用 structured enrichment。像 “Which high-spend suppliers lack termination clauses?” 或 “Show me data protection clauses for German healthcare companies” 这样的问题，需要在同一个 query 中同时使用 contract text 和 entity attributes。这一模式是 procurement、compliance、vendor management，以及任何文档只有在与其描述实体关联时才有意义的领域中的核心能力。

如果你的问题完全以文档为中心，可以跳过 enrichment。如果你只需要 “What does the Acme Corp contract say about support?”，简单 clause lookup 就足够。只有当 importing and maintaining business data 确实会改变检索结果时，它才值得投入。如果 metadata 很少影响 selection，就把它保留在 graph 外部。

添加 structured data 会增加运维复杂性。Spending、addresses 或 organizational structure 等 attributes 会随时间变化，必须与 source systems 保持同步。过时数据会导致错误过滤，例如将某个 supplier 标记为 low-spend，而实际并非如此。当这些 filters 在 semantic ranking 之前显著缩小搜索空间时，准确率收益通常能证明投入合理。

前一个 recipe 只建模了文档结构，例如 SLAs、clauses 和 clause types。这个 recipe 将 business entities 及其 attributes 作为第二个维度加入。结果是一个 hybrid knowledge graph，结合了 content structure 和 business context，这正是 enterprise-grade graph RAG 区别于 document-only graphs 的地方。

9.3 构建第一个 Cypher Query

Problem

你已经构建了一个连接 suppliers、contracts 和 clauses 的 SLA knowledge graph。现在你想看第一个用 Cypher 查询这个 graph 的实践示例。

Solution

检索某个指定 SLA 的所有 clauses，并按原始顺序返回它们：

def list_clauses_for_sla(sla_id):
    """Return all clauses for one SLA ordered by their original position."""
    cypher = """
    MATCH (s:SLA {id: $sla_id})-[:HAS_CLAUSE]->(cl:Clause)
    RETURN cl.order AS order,
           cl.title AS title,
           cl.text AS text
    ORDER BY order
    """
    driver = get_driver()
    with driver.session() as session:
        response = session.run(cypher, sla_id=sla_id)
        records = [r.data() for r in response]
        return records

搜索所有 SLAs 中类型为 Termination 的 clauses。这样可以很容易比较不同 suppliers 如何表述同一合同主题：

def clauses_of_type(clause_type):
    """List all clauses of a specific ClauseType across suppliers."""
    cypher = """
    MATCH (c:Company)-[:HAS_SLA]->(s:SLA)-[:HAS_CLAUSE]->(cl:Clause)
    MATCH (cl)-[:OF_TYPE]->(t:ClauseType {name: $clause_type})
    RETURN c.name AS company,
           s.id AS sla_id,
           cl.order AS clause_order,
           cl.title AS clause_title,
           cl.text AS clause_text
    ORDER BY company, sla_id, clause_order
    """
    driver = get_driver()
    with driver.session() as session:
        return [r.data() for r in session.run(cypher, clause_type=clause_type)]

现在你已经看过 basic lookups，可以转向更具调查性的 queries，帮助你发现合同中的风险和缺口。

使用下面函数返回所有 spending 高于某个阈值、但其 SLAs 不包含 termination clause 的 companies。这能让你快速发现缺少关键保护条款的高价值 suppliers：

def high_spend_missing_termination(min_spend):
    """Return companies above min_spend whose SLAs lack a termination clause."""

    cypher = """
    MATCH (c:Company)-[:HAS_SLA]->(s:SLA)
    WHERE c.spend_2024 > $min_spend
    OPTIONAL MATCH (s)-[:HAS_CLAUSE]->(cl:Clause)
                  -[:OF_TYPE]->(t:ClauseType {name: "Termination"})
    WITH c, s, count(cl) AS num_termination
    WHERE num_termination = 0
    RETURN c.name AS company,
           c.spend_2024 AS spend_2024,
           s.id AS sla_id
    ORDER BY spend_2024 DESC
    """

    driver = get_driver()

    with driver.session() as session:
        return [
            r.data()
            for r in session.run(cypher, min_spend=min_spend)
        ]

你也可以聚焦特定 clause types 和文本模式。下一个示例会检查 availability clauses，并按 search phrase 过滤，例如特定 uptime target：99.9。

def availability_clauses(search_phrase):
    """Inspect availability clauses per supplier."""
    cypher = """
    MATCH (c:Company)-[:HAS_SLA]->(s:SLA)-[:HAS_CLAUSE]->(cl:Clause)
    MATCH (cl)-[:OF_TYPE]->(t:ClauseType {name: "Availability"})
    WHERE toLower(cl.text) CONTAINS toLower($phrase)
    RETURN c.name AS company,
           s.id AS sla_id,
           cl.order AS clause_order,
           cl.text AS availability_text
    ORDER BY company, sla_id, clause_order
    """
    driver = get_driver()
    with driver.session() as session:
        return [r.data() for r in session.run(cypher, phrase=search_phrase)]

将 clause types 与 supplier metadata 结合起来。下面的 query 会检索来自指定 EU countries 列表中 suppliers 的 data protection clauses：

def eu_data_protection_clauses(countries: list[str]):
    """Retrieve data protection clauses for suppliers in given EU countries."""
    cypher = """
    MATCH (c:Company)-[:LOCATED_AT]->(a:Address)
    WHERE a.country IN $countries
    MATCH (c)-[:HAS_SLA]->(s:SLA)-[:HAS_CLAUSE]->(cl:Clause)
    MATCH (cl)-[:OF_TYPE]->(t:ClauseType {name: "DataProtection"})
    RETURN c.name AS company,
           a.country AS country,
           s.id AS sla_id,
           cl.order AS clause_order,
           cl.text AS data_protection_clause
    ORDER BY country, company, sla_id, clause_order
    """
    driver = get_driver()
    with driver.session() as session:
        return [r.data() for r in session.run(cypher, countries=countries)]

Discussion

Structural retrieval 的工作方式，是导航 graph，而不是搜索文本。你从一个已知 node 出发，沿着 relationships 到达所需信息。如果你知道具体 SLA 或 company，可以直接从那个 node 移动到它的 clauses。如果你想跨许多 contracts 比较同一种内容，就从 ClauseType 出发，并收集该类型的所有 clauses。如果 business rules 首先重要，就先过滤 companies 或 contracts，再查看任何文本。

当你知道自己关心的确切 document 或 entity 时，使用 direct lookups。当你想跨许多 documents 比较相似 clauses 时，使用 type-based queries。当 spending、region 或 industry 等 attributes 决定哪些数据应被纳入时，使用 filtered searches。

Structural queries 快速且精确，因为它们只返回匹配 labels 和 filters 的 nodes。其限制在于，它们依赖 graph structure 的质量。如果 taxonomy 不完整或不一致，相关内容可能被遗漏。下一个 recipe 会加入 semantic search，通过寻找相似含义来补偿这一点，即使 labels 没有完全对齐。

与 vector-only RAG 相比，graph-based retrieval 结合了 structure 和 semantics。Business rules 和 relationships 可以在 semantic matching 前缩小搜索空间，也可以在 semantic search 后用于验证和聚合结果。

9.4 在 Neo4j Knowledge Graph 上启用 Semantic Search

Problem

你希望既能通过显式 graph labels 搜索 SLA clauses，也能通过语义含义搜索，并将其与 supplier metadata 和 graph relationships 结合起来。

Solution

这个 recipe 展示如何在 Neo4j 中结合 semantic vector search 和 structural graph filters。图 9-8 展示了 hybrid search approach。

图 9-8：Neo4j 中的 hybrid semantic structural search

先从 setup 开始，这部分在 graph creation 后只运行一次：

为每个 Clause node 生成 embeddings，并写入 graph：

def create_embedding(text):
    """Generate embeddings using OpenAI's text-embedding-3-small model."""
    return (
        client.embeddings.create(model="text-embedding-3-small", input=text)
        .data[0]
        .embedding
    )

def create_clause_embeddings():
    """Add embeddings to all Clause nodes in the graph."""
    driver = get_driver()

    with driver.session() as session:
        clauses = list(session.run(
            "MATCH (cl:Clause) RETURN cl.id AS id, cl.text AS text"
        ))

        print(f"Creating embeddings for {len(clauses)} clauses...")

        for i, row in enumerate(clauses, 1):
            emb = create_embedding(row["text"])
            session.run(
                "MATCH (cl:Clause {id:$id}) SET cl.embedding=$emb",
                id=row["id"],
                emb=emb,
            )

            if i % 10 == 0:
                print(f"  Processed {i}/{len(clauses)} clauses")

    driver.close()

创建 graph 后运行一次 create_clause_embeddings。有了 embeddings 后，为 clause embeddings 创建 vector index：

def create_vector_index():
    """Create a vector index for semantic search on Clause embeddings."""
    driver = get_driver()
    with driver.session() as session:
        session.run(
            """
            CREATE VECTOR INDEX clause_embeddings IF NOT EXISTS
            FOR (c:Clause) ON c.embedding
            OPTIONS {
                indexConfig: {
                    `vector.dimensions`: 1536,
                    `vector.similarity_function`: "cosine"
                }
            }
            """
        )
    driver.close()

现在可以使用自然语言 queries 搜索整个 corpus。基于 embeddings 返回 top k 个最相似 clauses：

def semantic_search(query, top_k=5):
    """Find clauses semantically similar to the query."""
    driver = get_driver()
    emb = create_embedding(query)

    cypher = """
    CALL db.index.vector.queryNodes(
        "clause_embeddings", $top_k, $embedding
    )
    YIELD node, score
    RETURN node.title AS title,
           node.text AS text,
           score
    ORDER BY score DESC
    """

    with driver.session() as session:
        rows = [r.data() for r in session.run(
            cypher, top_k=top_k, embedding=emb
        )]

    driver.close()
    return rows

将 semantic search 与 graph filters 结合，以缩小结果范围。这个示例使用 industry filter：

def hybrid_search_by_industry(query, industry, top_k=5):
    """Combine semantic search with industry filtering."""
    driver = get_driver()
    emb = create_embedding(query)

    cypher = """
    CALL db.index.vector.queryNodes(
        "clause_embeddings", $top_k, $embedding
    )
    YIELD node, score
    MATCH (node)<-[:HAS_CLAUSE]-(s:SLA)<-[:HAS_SLA]-(c:Company)
    WHERE c.industry = $industry
    RETURN c.name AS company,
           s.id AS sla_id,
           node.title AS clause_title,
           node.text AS clause_text,
           score
    ORDER BY score DESC
    """

    with driver.session() as session:
        rows = [r.data() for r in session.run(
            cypher,
            top_k=top_k,
            embedding=emb,
            industry=industry
        )]

    driver.close()
    return rows

这个函数会返回按 industry 过滤后的语义相似 clauses。结果包含 company name、SLA ID、clause title 和 text，以及 similarity scores。

Discussion

这个 recipe 将 vector search 与 graph filtering 结合起来。Industry 或 spending 等 business rules 会先缩小候选范围；随后 semantic similarity 对剩余 clauses 排名，使模型只看到与 company 和 SLA context 相关的文本。

当 queries 同时混合含义和业务约束时，可以使用 hybrid search。当 ClauseType 等 labels 可靠时，pure Cypher query 更简单也更快。

取舍是 overhead。你必须同时维护 embeddings 和 graph structure，而且 vector search 会增加成本和延迟。作为回报，你可以处理 graphs 或 vectors 单独都无法处理的 queries。

与 basic RAG 相比，graph RAG 不返回孤立 chunks。它会返回 clauses 及其连接的业务上下文，这使答案更精确、更可解释。

9.5 为 RAG Systems 优化 Knowledge Graph

Problem

当你测量到性能或成本瓶颈，例如 slow queries、high token costs 或 low retrieval precision 时，你想优化 graph。

Solution

用可选组件增强 graph，例如 summaries、entire SLAs 的 embeddings、risk scores、domain ontologies 或 retrieval shortcuts。

使用 efficient LLM，例如 gpt-4o-mini，总结每个 clause，并将其作为新 attribute summary 附加上去：

def summarize_clause(text):
    """Generate a summary for a clause using GPT-4o-mini."""
    prompt = f"Summarize this SLA clause:\n\n{text}"
    return (
        client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=120,
        )
        .choices[0]
        .message.content
    )

def add_clause_summaries():
    """Add summaries to all Clause nodes that don't have one."""

    driver = get_driver()

    with driver.session() as session:
        result = session.run(
            """
            MATCH (cl:Clause)
            WHERE cl.summary IS NULL
            RETURN cl.id AS id,
                   cl.text AS text
            """
        )

        clauses = list(result)

        print(
            f"Generating summaries for {len(clauses)} clauses..."
        )

        for i, row in enumerate(clauses, 1):
            s = summarize_clause(row["text"])
            session.run(
                "MATCH (cl:Clause {id:$id}) SET cl.summary=$s", id=row["id"], s=s
            )
            if i % 5 == 0:
                print(f"  Processed {i}/{len(clauses)} clauses")
    driver.close()

你可以像为 clauses 建立 index 一样，为 SLAs 注册 vector index。Summaries 也可以被单独索引和搜索。

为了加速 queries，你可以将一些有用 aggregates 直接存储在 graph 中，例如：

每个 supplier 的 availability clauses 数量
Support obligations 的合并列表
某个 clause type 是否存在的 flags

下面的 Cypher query 展示了一个基础示例：统计每家公司每种 clause type 的 clauses 数量，并将结果作为新 property 存储在 Company node 上：

MATCH (c:Company)-[:HAS_SLA]->(:SLA)-[:HAS_CLAUSE]->(cl:Clause)
      -[:OF_TYPE]->(t:ClauseType)
WITH c, t, count(cl) AS num_clauses
SET c[t.name + '_count'] = num_clauses

Discussion

这些优化通过将工作从 query time 转移到 ingestion time，降低成本和延迟。Clause summaries 让模型可以廉价浏览许多 clauses，然后再阅读 full text。SLA-level embeddings 使检索整份合同成为可能，而不必搜索每个 clause。Aggregates 避免每次 query 都重新计算相同 values。

每种技术解决不同类型的瓶颈。当你会检索许多 clauses，但其中只有少数真正相关时，使用 summaries。当你反复计算相同 counts 或 flags 时，使用 aggregates。当用户询问整份合同而不是单独 clauses 时，使用 SLA embeddings。

不要一开始就添加这些优化。先从 baseline graph 开始，测量其表现。如果 queries 很快、token usage 可接受、答案准确，那么更简单的模型往往已经足够。

主要取舍是 maintenance。Summaries、aggregates 和 embeddings 必须在数据变化时保持同步，并且会增加 storage 和 ingestion time。只有当它们能够移除已证实的成本或性能问题时，它们才最有价值。

最终答案始终应检索 full clause text。Summaries 和 cached values 是用于 filtering 和 ranking，而不是替代原始数据。

Python RAG 实战手册——Graph RAG

9.1 创建第一个 Neo4j Knowledge Graph，并用文档文本填充它

Problem

Solution

Discussion

See Also

9.2 使用结构化数据扩展 Knowledge Graph

Problem

Solution

Discussion

See Also

9.3 构建第一个 Cypher Query

Problem

Solution

Discussion

See Also

9.4 在 Neo4j Knowledge Graph 上启用 Semantic Search

Problem

Solution

Discussion

See Also

9.5 为 RAG Systems 优化 Knowledge Graph

Problem

Solution

Discussion

See Also