Knowledge Graph RAG实战2026:让检索系统真正理解实体关系

6 阅读1分钟

传统RAG的核心局限在于:向量相似度只能捕捉语义相似性,却无法理解实体之间的关系。"苹果公司的CEO是谁"和"苹果的营收超过谷歌了吗"——这类需要关系推理的问题,传统RAG往往答不好。Knowledge Graph RAG(KG-RAG)通过引入知识图谱,赋予检索系统真正的关系推理能力。

一、为什么需要Knowledge Graph RAG

1.1 传统RAG的盲区

多跳推理失败:问题需要经过多个实体关系推导才能得到答案。例如:"参与过Transformer原论文的研究员,后来创建了哪些公司?"——这需要从作者→创业公司的关系链。

关系理解缺失:向量检索无法区分"A收购了B"和"B收购了A",两句话的语义向量非常相近,但关系方向完全相反。

事实更新困难:当某个实体的属性发生变化(如CEO更换),传统RAG需要重新索引相关文档;KG-RAG只需更新对应的图节点。

1.2 Knowledge Graph RAG的核心优势

  • 结构化关系推理:能够执行图遍历,回答"A经过哪些关系路径与B相连"
  • 实体去重与消歧:同一实体的不同表述("OpenAI"/"奥特曼的公司")统一映射到同一节点
  • 可解释的推理路径:检索过程可以追踪关系链,提供可审计的推理依据

二、Knowledge Graph RAG的系统架构

用户问题
   ↓
[实体识别与关系提取](NER + RE)
   ↓
[图查询生成](NL→Cypher/SPARQL)
   ↓
[图谱检索](Neo4j/Amazon Neptune)
     ↓
[子图提取]     [向量检索](传统RAG)
     ↓              ↓
[结果融合与排序][LLM生成回答]

三、知识图谱构建:从文本到结构化图谱

3.1 基于LLM的实体关系抽取

import anthropic
import json
from typing import NamedTuple

class Entity(NamedTuple):
    name: str
    entity_type: str
    properties: dict

class Relation(NamedTuple):
    subject: str
    predicate: str
    obj: str
    confidence: float

KG_EXTRACTION_PROMPT = """从以下文本中提取实体和关系,以JSON格式输出。

文本:{text}

输出格式:
{
  "entities": [
    {"name": "实体名称", "type": "实体类型", "properties": {"key": "value"}}
  ],
  "relations": [
    {"subject": "主体实体", "predicate": "关系类型", "object": "客体实体", "confidence": 0.9}
  ]
}

实体类型包括:Person, Organization, Product, Technology, Location, Event, Concept
关系类型包括:founded_by, acquired_by, works_at, created_by, based_in, part_of, related_to, competes_with
"""

class KGExtractor:
    def __init__(self):
        self.client = anthropic.Anthropic()
    
    def extract(self, text: str) -> dict:
        """从文本中提取实体和关系"""
        response = self.client.messages.create(
            model="claude-opus-4-7",
            max_tokens=2000,
            messages=[{
                "role": "user",
                "content": KG_EXTRACTION_PROMPT.format(text=text)
            }]
        )
        
        try:
            # 提取JSON部分
            content = response.content[0].text
            start = content.find('{')
            end = content.rfind('}') + 1
            return json.loads(content[start:end])
        except json.JSONDecodeError:
            return {"entities": [], "relations": []}
    
    def extract_batch(self, texts: list[str]) -> list[dict]:
        """批量提取(使用异步提高效率)"""
        import asyncio
        import anthropic
        
        async def extract_one(text):
            async_client = anthropic.AsyncAnthropic()
            response = await async_client.messages.create(
                model="claude-opus-4-7",
                max_tokens=2000,
                messages=[{"role": "user", "content": KG_EXTRACTION_PROMPT.format(text=text)}]
            )
            try:
                content = response.content[0].text
                start = content.find('{')
                end = content.rfind('}') + 1
                return json.loads(content[start:end])
            except:
                return {"entities": [], "relations": []}
        
        async def run_batch():
            tasks = [extract_one(text) for text in texts]
            return await asyncio.gather(*tasks)
        
        return asyncio.run(run_batch())

3.2 将图谱写入Neo4j

from neo4j import GraphDatabase
import hashlib

class KnowledgeGraphDB:
    def __init__(self, uri: str, username: str, password: str):
        self.driver = GraphDatabase.driver(uri, auth=(username, password))
    
    def ingest_extraction(self, extraction: dict, source_doc: str):
        """将提取结果写入Neo4j"""
        with self.driver.session() as session:
            # 创建或更新实体节点
            for entity in extraction.get("entities", []):
                session.run(
                    """
                    MERGE (e:Entity {name: $name})
                    ON CREATE SET 
                        e.entity_type = $entity_type,
                        e.source_doc = $source_doc,
                        e.created_at = datetime()
                    ON MATCH SET
                        e.updated_at = datetime()
                    SET e += $properties
                    """,
                    name=entity["name"],
                    entity_type=entity.get("type", "Unknown"),
                    source_doc=source_doc,
                    properties=entity.get("properties", {})
                )
            
            # 创建关系边
            for relation in extraction.get("relations", []):
                if relation.get("confidence", 0) >= 0.7:
                    session.run(
                        f"""
                        MATCH (s:Entity {{name: $subject}})
                        MATCH (o:Entity {{name: $object}})
                        MERGE (s)-[r:{relation['predicate'].upper()}]->(o)
                        SET r.source_doc = $source_doc,
                            r.confidence = $confidence
                        """,
                        subject=relation["subject"],
                        object=relation["obj"],
                        source_doc=source_doc,
                        confidence=relation.get("confidence", 0.8)
                    )
    
    def query_subgraph(self, entity_name: str, depth: int = 2) -> dict:
        """提取以某实体为中心的子图"""
        with self.driver.session() as session:
            result = session.run(
                """
                MATCH path = (n:Entity {name: $name})-[*1..$depth]-(m:Entity)
                RETURN path
                LIMIT 50
                """,
                name=entity_name,
                depth=depth
            )
            
            nodes = {}
            edges = []
            
            for record in result:
                path = record["path"]
                for node in path.nodes:
                    nodes[node.id] = {
                        "id": node.id,
                        "name": node.get("name"),
                        "type": node.get("entity_type")
                    }
                for rel in path.relationships:
                    edges.append({
                        "source": rel.start_node.id,
                        "target": rel.end_node.id,
                        "type": rel.type
                    })
            
            return {"nodes": list(nodes.values()), "edges": edges}

四、自然语言转图查询(NL2Cypher)

NL2CYPHER_PROMPT = """将以下自然语言问题转换为Neo4j Cypher查询。

图谱Schema:
节点类型:Entity(属性:name, entity_type)
关系类型:FOUNDED_BY, ACQUIRED_BY, WORKS_AT, CREATED_BY, BASED_IN, PART_OF, COMPETES_WITH

自然语言问题:{question}

要求:
1. 只输出Cypher查询语句,不要解释
2. 使用LIMIT 10限制结果数量
3. 对模糊匹配使用 CONTAINS 或 =~ 运算符

Cypher查询:"""

class NL2CypherConverter:
    def __init__(self):
        self.client = anthropic.Anthropic()
    
    def convert(self, question: str) -> str:
        response = self.client.messages.create(
            model="claude-opus-4-7",
            max_tokens=500,
            messages=[{
                "role": "user", 
                "content": NL2CYPHER_PROMPT.format(question=question)
            }]
        )
        
        cypher = response.content[0].text.strip()
        # 移除可能的代码块标记
        if cypher.startswith("```"):
            cypher = "\n".join(cypher.split("\n")[1:-1])
        
        return cypher

五、混合检索:融合图谱与向量

from sentence_transformers import SentenceTransformer
import numpy as np

class HybridKGRetriever:
    """融合知识图谱和向量检索的混合检索器"""
    
    def __init__(self, kg_db: KnowledgeGraphDB, vector_store):
        self.kg_db = kg_db
        self.vector_store = vector_store
        self.nl2cypher = NL2CypherConverter()
        self.extractor = KGExtractor()
    
    def retrieve(self, query: str, top_k: int = 5) -> list[dict]:
        """混合检索:图谱检索 + 向量检索"""
        
        results = []
        
        # 1. 图谱检索(结构化关系)
        kg_results = self._kg_retrieve(query)
        results.extend(kg_results)
        
        # 2. 向量检索(语义相似)
        vector_results = self.vector_store.similarity_search(query, k=top_k)
        results.extend([{"source": "vector", "content": r.page_content, "score": r.score} 
                       for r in vector_results])
        
        # 3. 结果融合与重排序
        return self._rerank(query, results, top_k)
    
    def _kg_retrieve(self, query: str) -> list[dict]:
        """通过知识图谱检索相关信息"""
        results = []
        
        try:
            # 转换为Cypher并执行
            cypher = self.nl2cypher.convert(query)
            
            with self.kg_db.driver.session() as session:
                records = session.run(cypher)
                for record in records:
                    result_text = self._record_to_text(record)
                    results.append({
                        "source": "knowledge_graph",
                        "content": result_text,
                        "cypher": cypher,
                        "score": 0.9  # 图谱结果默认高置信度
                    })
        except Exception as e:
            print(f"KG检索失败: {e}")
        
        return results
    
    def _rerank(self, query: str, results: list[dict], top_k: int) -> list[dict]:
        """基于相关性重新排序"""
        if not results:
            return []
        
        # 简单策略:图谱结果优先,向量结果补充
        kg_results = [r for r in results if r["source"] == "knowledge_graph"]
        vector_results = [r for r in results if r["source"] == "vector"]
        
        # 按分数排序
        kg_results.sort(key=lambda x: x.get("score", 0), reverse=True)
        vector_results.sort(key=lambda x: x.get("score", 0), reverse=True)
        
        # 混合:优先KG,补充向量
        final = kg_results[:3] + vector_results[:top_k-len(kg_results[:3])]
        return final[:top_k]

六、完整的KG-RAG流水线

class KGRAGPipeline:
    """完整的Knowledge Graph RAG流水线"""
    
    def __init__(self, kg_db: KnowledgeGraphDB, vector_store):
        self.retriever = HybridKGRetriever(kg_db, vector_store)
        self.client = anthropic.Anthropic()
    
    def answer(self, question: str) -> dict:
        """完整的问答流程"""
        
        # 1. 混合检索
        retrieved = self.retriever.retrieve(question, top_k=5)
        
        # 2. 构建上下文
        context_parts = []
        for r in retrieved:
            if r["source"] == "knowledge_graph":
                context_parts.append(f"[知识图谱] {r['content']}")
            else:
                context_parts.append(f"[文档] {r['content']}")
        
        context = "\n\n".join(context_parts)
        
        # 3. LLM生成回答
        response = self.client.messages.create(
            model="claude-opus-4-7",
            max_tokens=1500,
            messages=[{
                "role": "user",
                "content": f"""基于以下检索到的信息回答问题。

检索信息:
{context}

问题:{question}

请基于检索信息给出准确回答。如果信息不足,请明确说明。"""
            }]
        )
        
        return {
            "answer": response.content[0].text,
            "sources": retrieved,
            "question": question
        }

七、工程实践建议

什么时候用KG-RAG

  • 领域知识有明确的实体和关系结构(如企业知识库、医疗知识库、法律条文)
  • 问题需要多跳推理(A→B→C的关系链)
  • 需要可解释的推理过程

什么时候用传统RAG

  • 文档内容以描述性文本为主(新闻、论文、文档)
  • 问题主要是语义匹配
  • 对推理路径透明度要求不高

混合策略:对大多数生产系统,KG-RAG + 向量RAG的混合架构是最佳选择——两者互补,覆盖更广的问题类型。

八、总结

Knowledge Graph RAG是传统RAG的重要补充,通过引入结构化的实体关系图谱,解决了向量检索在关系推理上的盲区。核心工程步骤:

  1. 用LLM从文档中自动提取实体和关系
  2. 将图谱存储在Neo4j等图数据库
  3. 实现NL2Cypher将自然语言转换为图查询
  4. 融合图谱检索和向量检索的混合策略
  5. 基于检索结果让LLM生成最终回答

2026年,随着LLM对结构化数据理解能力的提升,KG-RAG将成为企业级AI知识系统的标准组件。