企业查/情报知识图谱 - KBQA 实施与评估

4 阅读8分钟

配套文档:基于《01_领域本体详细设计书.md》、《02_代码工程脚手架.md》 目标:交付一个可用于生产的智能问答服务,支持自然语言问企业图谱 核心指标:端到端答对率 ≥ 85%、P95 延迟 < 5s、Cypher 安全 100%


1. KBQA 总体架构

                       ┌──────────────────────┐
                       │   用户自然语言问题      │
                       └──────────┬───────────┘
                                  ↓
                       ┌──────────────────────┐
                       │  1. 预处理 & 意图识别  │
                       └──────────┬───────────┘
                                  ↓
                       ┌──────────────────────┐
                       │  2. 实体识别 & 链接    │
                       └──────────┬───────────┘
                                  ↓
                  ┌───────────────┴───────────────┐
                  ↓                                ↓
       ┌────────────────────┐         ┌────────────────────┐
       │ 3a. Text2Cypher    │         │ 3b. GraphRAG       │
       │   (事实/聚合/路径)  │         │   (开放/综合分析)   │
       └─────────┬──────────┘         └─────────┬──────────┘
                 ↓                                ↓
       ┌────────────────────┐         ┌────────────────────┐
       │ 4a. Cypher校验+执行 │         │ 4b. 子图检索+向量召回│
       └─────────┬──────────┘         └─────────┬──────────┘
                 └──────────┬───────────────────┘
                            ↓
                 ┌────────────────────┐
                 │  5. 结果重排序融合   │
                 └──────────┬─────────┘
                            ↓
                 ┌────────────────────┐
                 │  6. 答案生成 + 引用  │
                 └──────────┬─────────┘
                            ↓
                 ┌────────────────────┐
                 │   响应给用户        │
                 └────────────────────┘

核心设计原则

  1. 两路并行:事实型走 Text2Cypher(高精度、可解释);开放型走 GraphRAG(高召回)
  2. 安全第一:所有 LLM 生成的 Cypher 必须经过校验器才能执行
  3. 可解释:答案必须带引用(节点 UUID 或文档 ID)
  4. 可降级:每一步失败都有 fallback(如 Cypher 失败 → 全文检索)
  5. 可观测:每个问题的全链路 trace 可查

2. 意图识别

2.1 意图分类

将用户问题分为以下大类,决定后续路由:

大类子类示例路由
FACT(事实)basic_info阿里巴巴的法定代表人是谁?Text2Cypher
shareholderXX 公司的股东有哪些?Text2Cypher
executiveXX 公司的董事长是谁?Text2Cypher
RELATION(关系)direct_relation张三和李四什么关系?Text2Cypher
pathA 公司和 B 公司有什么关联?Text2Cypher
control_chainXX 公司的实际控制人是谁?Text2Cypher
AGGREGATION(聚合)countXX 行业有多少家高新企业?Text2Cypher
top_k某行业融资 TOP 10Text2Cypher
distributionXX 地区企业数量分布Text2Cypher
EVENT(事件)event_queryXX 公司有哪些诉讼?Text2Cypher
timelineXX 公司的工商变更时间线Text2Cypher
OPEN(开放)summary介绍一下 XX 公司GraphRAG
analysisXX 公司的风险点有哪些?GraphRAG
comparisonA 和 B 公司哪个更好?GraphRAG
OUT_OF_SCOPE-今天天气怎么样?拒答

2.2 实现:轻量级 LLM 分类器

# src/kg/kbqa/intent.py
from enum import StrEnum
from pydantic import BaseModel
from anthropic import AsyncAnthropic
from kg.core.config import get_settings


class IntentCategory(StrEnum):
    FACT = "FACT"
    RELATION = "RELATION"
    AGGREGATION = "AGGREGATION"
    EVENT = "EVENT"
    OPEN = "OPEN"
    OUT_OF_SCOPE = "OUT_OF_SCOPE"


class IntentResult(BaseModel):
    category: IntentCategory
    subcategory: str
    confidence: float
    needs_disambiguation: bool = False


INTENT_PROMPT = """你是问答系统的意图分类器。给定用户问题,输出意图类别。

类别定义:
- FACT: 询问某个实体的具体属性(法人/注册资本/地址/股东等)
- RELATION: 询问两个或多个实体之间的关系
- AGGREGATION: 需要计数/排序/分布/统计的问题
- EVENT: 询问某实体的历史事件(诉讼/处罚/融资/变更)
- OPEN: 开放式总结/分析/对比/建议
- OUT_OF_SCOPE: 与企业知识图谱无关

输出 JSON:
{
  "category": "FACT|RELATION|AGGREGATION|EVENT|OPEN|OUT_OF_SCOPE",
  "subcategory": "具体子类",
  "confidence": 0.95
}

问题:{question}
"""


class IntentClassifier:
    def __init__(self):
        s = get_settings()
        self.client = AsyncAnthropic(api_key=s.llm.anthropic_api_key.get_secret_value())
        self.model = "claude-haiku-4-5-20251001"  # 用 Haiku 做分类,省成本

    async def classify(self, question: str) -> IntentResult:
        resp = await self.client.messages.create(
            model=self.model,
            max_tokens=256,
            messages=[{"role": "user", "content": INTENT_PROMPT.format(question=question)}],
        )
        text = resp.content[0].text.strip()
        # 容错解析略
        return IntentResult.model_validate_json(text)

后期优化方向:意图分类积累足够样本后改用本地小模型(如 fastText 或 fine-tuned BERT),延迟 < 50ms、成本几乎为零。


3. 实体识别与链接

3.1 流程

问题文本 → NER(识别 mention) → 候选生成(FTS + 向量召回) → 消歧 → 链接到 KG 节点

3.2 实现要点

# src/kg/kbqa/entity_linking.py
from kg.extraction.ner.llm_ner import LLMNER
from kg.store.neo4j_client import Neo4jClient
from kg.store.es_client import ESClient


class EntityLinker:
    def __init__(self, ner, embedder):
        self.ner = ner
        self.embedder = embedder

    async def link(self, question: str) -> list[dict]:
        # 1. NER
        mentions = await self.ner.extract(question)
        results = []
        for m in mentions:
            cands = await self._gen_candidates(m)
            best = await self._disambiguate(m, cands, question)
            if best:
                results.append({"mention": m, "linked": best})
        return results

    async def _gen_candidates(self, m: dict, k: int = 20) -> list[dict]:
        # 双路召回
        # A. ES 全文检索(精确匹配 + 模糊)
        es_hits = await ESClient.search(
            index="entities",
            body={
                "query": {
                    "bool": {
                        "should": [
                            {"term": {"name.keyword": {"value": m["text"], "boost": 5}}},
                            {"match": {"name": {"query": m["text"], "boost": 2}}},
                            {"match": {"aliases": m["text"]}},
                        ]
                    }
                },
                "size": k,
            },
        )
        # B. 向量召回(Milvus 或 Neo4j vector index)
        vec = await self.embedder.embed(m["text"])
        vec_hits = await Neo4jClient.execute_read(
            """
            CALL db.index.vector.queryNodes('enterprise_embedding', $k, $vec)
            YIELD node, score
            RETURN node{.uuid, .name, .unified_credit_code, .registration_status} AS node, score
            """,
            {"k": k, "vec": vec},
        )
        # 合并去重
        seen, merged = set(), []
        for h in es_hits + vec_hits:
            uuid = h["node"]["uuid"]
            if uuid in seen:
                continue
            seen.add(uuid)
            merged.append(h)
        return merged

    async def _disambiguate(self, mention: dict, candidates: list[dict], question: str) -> dict | None:
        if not candidates:
            return None
        if len(candidates) == 1:
            return candidates[0]
        # 多候选:上下文重排(用 LLM 或 reranker 模型)
        # 简化版:取召回得分最高 + 状态为在营的
        candidates.sort(
            key=lambda c: (
                c["node"].get("registration_status") == "IN_BUSINESS",
                c.get("score", 0),
            ),
            reverse=True,
        )
        return candidates[0]

3.3 消歧策略升级

当问题中包含多个候选时(如 "张伟" 全国有上万人),用以下信号消歧:

信号权重
字面匹配度0.3
向量相似度0.2
实体活跃度(最近事件数)0.15
问题中其他实体的关联度0.25
用户历史会话上下文0.1

4. Text2Cypher 模块

这是 KBQA 的核心引擎,承担 80% 的事实/聚合/关系问题。

4.1 总体流程

问题 + 已链接实体
   ↓
[Schema-aware Prompt 构造][LLM 生成 Cypher(含 reasoning)][静态校验:语法 + Schema + 安全]
   ↓ 通过                       ↓ 失败
[沙箱执行]                  [反馈给 LLM 重试,≤2 次]
   ↓                              ↓
[结果集]                      [兜底:转 GraphRAG 或拒答]

4.2 Schema 注入

LLM 必须看到完整、最新的 Schema 才能生成正确的 Cypher。我们将 Schema 自动序列化为简洁文本:

# src/kg/kbqa/text2cypher/schema_serializer.py
from kg.store.neo4j_client import Neo4jClient


async def serialize_schema() -> str:
    """Generate a compact, LLM-friendly schema description."""
    labels = await Neo4jClient.execute_read("CALL db.labels()")
    rel_types = await Neo4jClient.execute_read("CALL db.relationshipTypes()")
    # 节点属性:CALL db.schema.nodeTypeProperties()
    node_props = await Neo4jClient.execute_read("CALL db.schema.nodeTypeProperties()")
    rel_props = await Neo4jClient.execute_read("CALL db.schema.relTypeProperties()")
    # 拼装为 Markdown
    md = ["# 图谱 Schema\n", "## 节点\n"]
    by_label = {}
    for row in node_props:
        for lbl in row["nodeLabels"]:
            by_label.setdefault(lbl, []).append((row["propertyName"], row["propertyTypes"]))
    for lbl, props in by_label.items():
        md.append(f"### {lbl}")
        for p, t in props:
            md.append(f"- `{p}`: {','.join(t)}")
    md.append("\n## 关系\n")
    # ... rel types 同理
    return "\n".join(md)

生产环境:Schema 缓存在 Redis,TTL = 1 小时,本体变更后主动刷新。

4.3 Prompt 模板(生产级)

# src/kg/kbqa/text2cypher/prompts.py

SYSTEM_PROMPT = """你是 Neo4j Cypher 查询生成专家。根据图谱 Schema 和用户问题,生成准确的 Cypher 查询。

## 严格遵守的规则

1. **只使用 Schema 中定义的标签和关系**,禁止编造
2. **必须使用提供的实体 UUID 或主键**作为查询起点,不要在 WHERE 中用模糊名称匹配(除非问题明显需要)
3. **必须包含 LIMIT** —— 默认 LIMIT 100,TOP-K 查询用 LIMIT k
4. **路径深度 ≤ 5**:`*1..5` 是上限
5. **时间过滤**:用 `r.valid_to IS NULL` 表示"当前有效"
6. **排除派生关系除非明确询问**:`WHERE r.derived = false OR r.derived IS NULL`
7. **禁止任何写操作**:CREATE/MERGE/SET/DELETE/REMOVE 一律禁止
8. **返回结果必须易于回答**:用 AS 别名,避免返回整个节点
9. **聚合用 collect/count/sum/avg**

## 输出格式

严格输出 JSON:
```json
{
  "reasoning": "简要分析问题、识别意图、说明查询思路(中文,≤100字)",
  "cypher": "完整 Cypher 查询语句",
  "params": { "key": "value" },
  "expected_columns": ["col1", "col2"]
}

当前 Schema

{schema}

Few-shot 示例

{few_shots} """

USER_PROMPT = """## 用户问题 {question}

已链接实体

{linked_entities}

请生成 Cypher。 """

Few-shot 库(可配置 YAML)

FEW_SHOTS_TEMPLATE = """

示例1:基本工商

Q: 阿里巴巴的注册资本是多少? 已链接:[{"mention":"阿里巴巴","uuid":"E_001","label":"Enterprise"}] A:

{
  "reasoning": "FACT/basic_info,按 uuid 直接取属性",
  "cypher": "MATCH (e:Enterprise {uuid: $uuid}) RETURN e.name AS name, e.registered_capital AS capital, e.capital_currency AS currency LIMIT 1",
  "params": {"uuid": "E_001"},
  "expected_columns": ["name", "capital", "currency"]
}

示例2:股东

Q: 字节跳动的当前股东有哪些?按持股比例排序 已链接:[{"mention":"字节跳动","uuid":"E_002","label":"Enterprise"}] A:

{
  "reasoning": "FACT/shareholder,遍历 HOLDS_SHARE,过滤当前有效",
  "cypher": "MATCH (s)-[r:HOLDS_SHARE]->(e:Enterprise {uuid: $uuid}) WHERE r.valid_to IS NULL RETURN s.name AS shareholder, labels(s)[0] AS type, r.percentage AS pct ORDER BY pct DESC LIMIT 50",
  "params": {"uuid": "E_002"},
  "expected_columns": ["shareholder", "type", "pct"]
}

示例3:实际控制人

Q: 拼多多的实际控制人是谁? 已链接:[{"mention":"拼多多","uuid":"E_003","label":"Enterprise"}] A:

{
  "reasoning": "RELATION/control_chain,用派生的 ACTUAL_CONTROLS",
  "cypher": "MATCH (p)-[r:ACTUAL_CONTROLS]->(e:Enterprise {uuid: $uuid}) RETURN p.name AS controller, labels(p)[0] AS type, r.control_ratio AS ratio ORDER BY ratio DESC LIMIT 10",
  "params": {"uuid": "E_003"},
  "expected_columns": ["controller", "type", "ratio"]
}

示例4:两公司路径

Q: 腾讯和京东之间有什么关联? 已链接:[{"mention":"腾讯","uuid":"E_004"},{"mention":"京东","uuid":"E_005"}] A:

{
  "reasoning": "RELATION/path,两实体间最短路径,深度 ≤5",
  "cypher": "MATCH (a {uuid:$a}), (b {uuid:$b}) MATCH p = shortestPath((a)-[*..5]-(b)) RETURN [n IN nodes(p) | {name: coalesce(n.name, n.title), label: labels(n)[0]}] AS path_nodes, [r IN relationships(p) | type(r)] AS path_rels LIMIT 5",
  "params": {"a":"E_004","b":"E_005"},
  "expected_columns": ["path_nodes","path_rels"]
}

示例5:聚合

Q: 新能源汽车行业有多少家高新技术企业? 已链接:[{"mention":"新能源汽车","industry_code":"C36"}] A:

{
  "reasoning": "AGGREGATION/count,按行业 + 高新筛选",
  "cypher": "MATCH (e:Enterprise)-[:IN_INDUSTRY]->(i:Industry {code:$code}) WHERE e.is_high_tech = true AND e.registration_status = 'IN_BUSINESS' RETURN count(e) AS total LIMIT 1",
  "params": {"code":"C36"},
  "expected_columns": ["total"]
}

示例6:事件

Q: 比亚迪近 3 年的行政处罚有哪些? 已链接:[{"mention":"比亚迪","uuid":"E_006"}] A:

{
  "reasoning": "EVENT/admin_penalty,按时间过滤",
  "cypher": "MATCH (e:Enterprise {uuid:$uuid})<-[:PUNISHES]-(p:AdminPenalty) WHERE p.decision_date > date() - duration({years:3}) RETURN p.decision_no AS no, p.decision_date AS date, p.violation_type AS type, p.penalty_amount AS amount, p.violation_description AS desc ORDER BY p.decision_date DESC LIMIT 50",
  "params":{"uuid":"E_006"},
  "expected_columns":["no","date","type","amount","desc"]
}

"""


### 4.4 生成器实现

```python
# src/kg/kbqa/text2cypher/generator.py
from pydantic import BaseModel, ValidationError
from anthropic import AsyncAnthropic
from kg.core.config import get_settings
from kg.core.logger import get_logger

log = get_logger(__name__)


class CypherCandidate(BaseModel):
    reasoning: str
    cypher: str
    params: dict
    expected_columns: list[str]


class CypherGenerator:
    def __init__(self, schema_provider, few_shots: str):
        s = get_settings()
        self.client = AsyncAnthropic(api_key=s.llm.anthropic_api_key.get_secret_value())
        self.model = s.llm.model_kbqa
        self.schema_provider = schema_provider
        self.few_shots = few_shots

    async def generate(
        self,
        question: str,
        linked_entities: list[dict],
        *,
        prior_error: str | None = None,
    ) -> CypherCandidate:
        schema = await self.schema_provider.get_schema()
        system = SYSTEM_PROMPT.format(schema=schema, few_shots=self.few_shots)
        user = USER_PROMPT.format(
            question=question,
            linked_entities=str(linked_entities),
        )
        if prior_error:
            user += f"\n\n上一次生成的 Cypher 执行失败:{prior_error}\n请修正后重新生成。"

        resp = await self.client.messages.create(
            model=self.model,
            max_tokens=1500,
            system=[
                {"type": "text", "text": system, "cache_control": {"type": "ephemeral"}}
            ],
            messages=[{"role": "user", "content": user}],
        )
        raw = resp.content[0].text.strip()
        # 容错:剥离 markdown 代码块
        if "```json" in raw:
            raw = raw.split("```json")[1].split("```")[0].strip()
        elif raw.startswith("```"):
            raw = raw.strip("`").strip()
        try:
            cand = CypherCandidate.model_validate_json(raw)
        except ValidationError as e:
            log.error("cypher_gen_parse_error", raw=raw, error=str(e))
            raise
        log.info("cypher_generated", reasoning=cand.reasoning, cypher=cand.cypher)
        return cand

4.5 静态校验器

# src/kg/kbqa/text2cypher/validator.py
import re
from kg.core.errors import CypherUnsafeError


class CypherValidator:
    FORBIDDEN_KEYWORDS = [
        "CREATE", "DELETE", "DETACH", "SET", "REMOVE", "MERGE", "DROP",
        "FOREACH", "LOAD CSV", "USING PERIODIC COMMIT", "CALL APOC.PERIODIC",
        "CALL APOC.LOAD", "CALL APOC.EXPORT", "CALL APOC.REFACTOR",
    ]
    MAX_LIMIT = 1000
    MAX_PATH_DEPTH = 5

    @classmethod
    def validate(cls, cypher: str, *, schema_labels: set, schema_rels: set) -> None:
        upper = cypher.upper()
        # 1. 关键字检查
        for kw in cls.FORBIDDEN_KEYWORDS:
            if re.search(rf"\b{re.escape(kw)}\b", upper):
                raise CypherUnsafeError(f"forbidden keyword: {kw}")
        # 2. LIMIT
        if "LIMIT" not in upper:
            raise CypherUnsafeError("missing LIMIT clause")
        for limit_val in re.findall(r"LIMIT\s+(\d+)", upper):
            if int(limit_val) > cls.MAX_LIMIT:
                raise CypherUnsafeError(f"LIMIT exceeds {cls.MAX_LIMIT}")
        # 3. 路径深度
        for depth in re.findall(r"\*\s*\d*\.\.(\d+)", cypher):
            if int(depth) > cls.MAX_PATH_DEPTH:
                raise CypherUnsafeError(f"path depth exceeds {cls.MAX_PATH_DEPTH}")
        for depth in re.findall(r"\*(\d+)\b", cypher):
            if int(depth) > cls.MAX_PATH_DEPTH:
                raise CypherUnsafeError(f"path depth exceeds {cls.MAX_PATH_DEPTH}")
        # 4. 标签合法性
        used_labels = set(re.findall(r":(\w+)", cypher))
        # 注意:关系类型也会匹配 :REL,但 schema_rels 包含它们
        all_known = schema_labels | schema_rels
        unknown = used_labels - all_known
        if unknown:
            raise CypherUnsafeError(f"unknown labels/rels: {unknown}")
        # 5. 参数化检查
        if "'" in cypher and "$" not in cypher:
            # 提示:检测到字符串字面量但无参数,可能是注入风险
            ...

4.6 沙箱执行器

# src/kg/kbqa/text2cypher/executor.py
import asyncio
from kg.store.neo4j_client import Neo4jClient
from kg.core.errors import KGError


class CypherExecutor:
    DEFAULT_TIMEOUT = 10  # seconds
    MAX_ROWS = 1000

    @classmethod
    async def execute(cls, cypher: str, params: dict, *, timeout: int = DEFAULT_TIMEOUT) -> list[dict]:
        try:
            rows = await asyncio.wait_for(
                Neo4jClient.execute_read(cypher, params),
                timeout=timeout,
            )
        except asyncio.TimeoutError:
            raise KGError(f"cypher execution timeout (>{timeout}s)")
        if len(rows) > cls.MAX_ROWS:
            raise KGError(f"result too large: {len(rows)} rows (max {cls.MAX_ROWS})")
        return rows

4.7 完整 Text2Cypher Pipeline

# src/kg/kbqa/text2cypher/pipeline.py
from kg.kbqa.text2cypher.generator import CypherGenerator
from kg.kbqa.text2cypher.validator import CypherValidator
from kg.kbqa.text2cypher.executor import CypherExecutor
from kg.core.errors import CypherUnsafeError, KGError
from kg.core.logger import get_logger

log = get_logger(__name__)


class Text2CypherPipeline:
    MAX_RETRIES = 2

    def __init__(self, generator, schema_provider):
        self.generator = generator
        self.schema_provider = schema_provider

    async def run(self, question: str, linked_entities: list[dict]) -> dict:
        prior_error = None
        labels, rels = await self.schema_provider.get_label_rel_sets()

        for attempt in range(self.MAX_RETRIES + 1):
            cand = await self.generator.generate(
                question, linked_entities, prior_error=prior_error
            )
            # 1. 校验
            try:
                CypherValidator.validate(cand.cypher, schema_labels=labels, schema_rels=rels)
            except CypherUnsafeError as e:
                log.warning("cypher_unsafe", attempt=attempt, error=str(e))
                prior_error = f"安全校验失败: {e}"
                continue
            # 2. 执行
            try:
                rows = await CypherExecutor.execute(cand.cypher, cand.params)
                return {
                    "cypher": cand.cypher,
                    "params": cand.params,
                    "reasoning": cand.reasoning,
                    "rows": rows,
                    "attempts": attempt + 1,
                }
            except KGError as e:
                log.warning("cypher_exec_failed", attempt=attempt, error=str(e))
                prior_error = f"执行失败: {e}"
                continue
        raise KGError(f"text2cypher failed after {self.MAX_RETRIES + 1} attempts: {prior_error}")

5. GraphRAG 模块

针对开放式问题(如"介绍一下 XX 公司"、"XX 公司有什么风险"),Text2Cypher 不擅长,改用 GraphRAG。

5.1 三种检索策略并用

开放问题
   ↓
┌─────────────────────────────────────────────────┐
│  策略 A:实体锚定的 2 跳子图                       │ 占 70%
│  - 找到锚实体,扩展 1-2 跳,构成局部知识子图        │
├─────────────────────────────────────────────────┤
│  策略 B:文档向量召回                              │ 占 20%
│  - 在文档库做语义检索,返回相关原文片段             │
├─────────────────────────────────────────────────┤
│  策略 C:社区摘要召回                              │ 占 10%
│  - 离线对图分社区并 LLM 摘要,运行时按主题召回      │
└─────────────────────────────────────────────────┘
        ↓
   Reranker 重排(bge-reranker-large)
        ↓
   合并为 context 喂给 LLM 生成

5.2 子图检索 + 文本化

# src/kg/kbqa/graph_rag/retriever.py
from kg.store.neo4j_client import Neo4jClient


class SubgraphRetriever:
    async def retrieve(self, anchor_uuids: list[str], *, n_hops: int = 2, limit: int = 200) -> dict:
        """Retrieve a focused subgraph around anchor entities."""
        cypher = """
        UNWIND $uuids AS uuid
        MATCH (anchor {uuid: uuid})
        CALL {
          WITH anchor
          MATCH path = (anchor)-[*1..$hops]-(neighbor)
          WHERE all(r IN relationships(path) WHERE
                    coalesce(r.valid_to, date()) >= date() - duration({years: 2}))
          RETURN path
          LIMIT $limit
        }
        WITH collect(DISTINCT path) AS paths
        UNWIND paths AS path
        RETURN
          [n IN nodes(path) | {uuid: n.uuid, label: labels(n)[0], props: properties(n)}] AS nodes,
          [r IN relationships(path) | {type: type(r), props: properties(r), start: startNode(r).uuid, end: endNode(r).uuid}] AS rels
        """
        return await Neo4jClient.execute_read(
            cypher, {"uuids": anchor_uuids, "hops": n_hops, "limit": limit}
        )
# src/kg/kbqa/graph_rag/verbalizer.py
from typing import Iterable


PROP_BLACKLIST = {"uuid", "embedding", "_meta_uuid", "_meta_source_id"}


class SubgraphVerbalizer:
    """Convert subgraph into LLM-friendly natural language."""

    @staticmethod
    def _node_text(node: dict) -> str:
        label = node["label"]
        props = node["props"]
        name = props.get("name") or props.get("title") or props.get("uuid")
        key_props = {
            "Enterprise": ["unified_credit_code", "registration_status", "industry_name", "establishment_date"],
            "NaturalPerson": ["gender", "birth_year", "is_executed_dishonest"],
            "LegalCase": ["case_no", "case_reason", "judgment_date"],
            "AdminPenalty": ["decision_no", "violation_type", "penalty_amount"],
        }.get(label, [])
        detail = ", ".join(f"{k}={props[k]}" for k in key_props if k in props and props[k] is not None)
        return f"[{label}] {name}" + (f" ({detail})" if detail else "")

    @staticmethod
    def _rel_text(rel: dict, name_lookup: dict) -> str:
        head = name_lookup.get(rel["start"], rel["start"])
        tail = name_lookup.get(rel["end"], rel["end"])
        rt = rel["type"]
        props = rel.get("props", {})
        prop_summary = ", ".join(
            f"{k}={v}" for k, v in props.items()
            if k not in PROP_BLACKLIST and v is not None
        )
        return f"  - {head} --[{rt}{(' '+prop_summary) if prop_summary else ''}]--> {tail}"

    @classmethod
    def verbalize(cls, subgraph_rows: Iterable[dict]) -> str:
        nodes_seen, rels_seen = {}, []
        for row in subgraph_rows:
            for n in row["nodes"]:
                nodes_seen[n["uuid"]] = n
            for r in row["rels"]:
                rels_seen.append(r)
        # 去重 rels
        unique_rels = {(r["start"], r["type"], r["end"]): r for r in rels_seen}.values()
        # 文本化
        name_lookup = {uuid: (n["props"].get("name") or uuid) for uuid, n in nodes_seen.items()}
        lines = ["## 实体清单"]
        for n in nodes_seen.values():
            lines.append("- " + cls._node_text(n))
        lines.append("\n## 关系")
        for r in unique_rels:
            lines.append(cls._rel_text(r, name_lookup))
        return "\n".join(lines)

5.3 GraphRAG Pipeline

# src/kg/kbqa/graph_rag/pipeline.py
class GraphRAGPipeline:
    def __init__(self, retriever, verbalizer, doc_retriever, reranker, llm):
        self.retriever = retriever
        self.verbalizer = verbalizer
        self.doc_retriever = doc_retriever
        self.reranker = reranker
        self.llm = llm

    async def run(self, question: str, linked_entities: list[dict]) -> dict:
        anchor_uuids = [e["linked"]["node"]["uuid"] for e in linked_entities]
        # A. 图子图
        subgraph_rows = await self.retriever.retrieve(anchor_uuids, n_hops=2)
        graph_text = self.verbalizer.verbalize(subgraph_rows)
        # B. 文档检索
        doc_chunks = await self.doc_retriever.search(question, k=8)
        # C. 重排
        all_passages = [{"text": graph_text, "source": "graph"}] + [
            {"text": d["text"], "source": d["doc_id"]} for d in doc_chunks
        ]
        reranked = await self.reranker.rerank(question, all_passages, top_k=6)
        # D. 生成
        context = "\n\n---\n\n".join(p["text"] for p in reranked)
        answer = await self.llm.generate(question, context)
        return {
            "answer": answer,
            "citations": [p["source"] for p in reranked],
        }

6. 答案生成与引用

6.1 答案生成 Prompt

ANSWER_SYSTEM = """你是企业知识图谱问答助手。根据提供的查询结果或上下文,准确回答用户问题。

## 强制规则
1. **只基于提供的资料回答**,不编造事实
2. 资料中无相关内容时,明确说"知识图谱中暂无该信息"
3. 涉及数字、日期、比例时**保留原值**,不做估算
4. 答案中提及的每个关键事实,**必须给出引用标记** `[ref:N]`,N 对应资料中的编号
5. 答案语言简洁专业,避免冗余客套
6. 若资料中存在数据冲突,明确指出并列出各方说法
7. 不解释你的查询过程,直接回答

## 输出格式
答案正文。
关键事实1 [ref:1]。关键事实2 [ref:2]。

## 资料
{context}

## 问题
{question}
"""


class AnswerGenerator:
    async def generate(self, question: str, retrieval_result: dict, mode: str) -> dict:
        if mode == "text2cypher":
            context = self._format_cypher_result(retrieval_result)
        else:
            context = retrieval_result["context"]
        resp = await self.llm.complete(
            system=ANSWER_SYSTEM.format(context=context, question=question),
            user="",
        )
        return {
            "text": resp,
            "citations": self._extract_citations(resp, retrieval_result),
        }

    def _format_cypher_result(self, result: dict) -> str:
        rows = result["rows"]
        lines = [f"### 查询结果(共 {len(rows)} 条)"]
        for i, row in enumerate(rows, start=1):
            lines.append(f"[{i}] " + " | ".join(f"{k}={v}" for k, v in row.items()))
        return "\n".join(lines)

6.2 引用与溯源数据结构

class Citation(BaseModel):
    ref_id: int                  # 在答案中 [ref:N] 对应的编号
    source_type: str             # GRAPH_NODE / GRAPH_REL / DOCUMENT
    source_uuid: str | None
    source_doc_id: str | None
    snippet: str | None
    confidence: float


class KBQAResponse(BaseModel):
    answer: str
    citations: list[Citation]
    intent: dict
    linked_entities: list[dict]
    trace_id: str
    duration_ms: int
    cost_usd: float
    fallback_used: bool = False

7. 端到端 API 集成

# src/kg/api/routers/kbqa.py
from fastapi import APIRouter, Depends
from pydantic import BaseModel, Field

from kg.api.deps import get_current_user
from kg.kbqa.pipeline import KBQAPipeline
from kg.kbqa.response_models import KBQAResponse

router = APIRouter()


class KBQARequest(BaseModel):
    question: str = Field(min_length=2, max_length=500)
    session_id: str | None = None
    user_context: dict | None = None


@router.post("/ask", response_model=KBQAResponse)
async def ask(req: KBQARequest, user=Depends(get_current_user)) -> KBQAResponse:
    return await KBQAPipeline.instance().answer(
        req.question, session_id=req.session_id, user=user
    )


@router.post("/feedback")
async def feedback(req: dict, user=Depends(get_current_user)) -> dict:
    """User feedback on answer quality — for offline evaluation."""
    # 写入 PostgreSQL kbqa_feedback 表
    ...
    return {"ok": True}

8. 评估体系

8.1 评估集结构

tests/fixtures/kbqa_eval_set.jsonl

{"id": "Q001", "category": "FACT", "subcategory": "basic_info", "question": "阿里巴巴的法定代表人是谁?", "expected_intent": "FACT", "expected_entities": ["阿里巴巴"], "expected_cypher_pattern": "MATCH.*Person.*SERVES_AS.*Enterprise.*is_legal_rep", "gold_answer": "蔡崇信", "gold_answer_aliases": ["蔡崇信", "Joseph Tsai"], "evaluation_method": "exact_match"}
{"id": "Q002", "category": "FACT", "subcategory": "shareholder", "question": "字节跳动的当前股东有哪些?", "expected_intent": "FACT", "expected_entities": ["字节跳动"], "gold_answer_set": ["张一鸣", "梁汝波", ...], "evaluation_method": "set_overlap", "min_f1": 0.8}
{"id": "Q003", "category": "AGGREGATION", "subcategory": "count", "question": "新能源汽车行业有多少家高新企业?", "expected_intent": "AGGREGATION", "gold_answer": 1247, "evaluation_method": "exact_numeric"}
{"id": "Q004", "category": "OPEN", "subcategory": "summary", "question": "介绍一下比亚迪", "expected_intent": "OPEN", "rubric": ["注册信息", "主要业务", "主要股东", "近期事件"], "evaluation_method": "llm_judge"}
{"id": "Q005", "category": "OUT_OF_SCOPE", "question": "今天天气怎么样?", "expected_intent": "OUT_OF_SCOPE", "evaluation_method": "expect_refusal"}

8.2 评估集规模与分布

总计 600 条评估问题,分布:

类别数量占比
FACT/basic10017%
FACT/shareholder6010%
FACT/executive508%
RELATION/direct6010%
RELATION/path407%
RELATION/control508%
AGGREGATION8013%
EVENT8013%
OPEN6010%
OUT_OF_SCOPE203%

8.3 评估方法

# src/kg/quality/kbqa_eval.py
import json
from pathlib import Path


class KBQAEvaluator:
    def __init__(self, pipeline):
        self.pipeline = pipeline

    async def evaluate(self, eval_set_path: Path) -> dict:
        cases = [json.loads(l) for l in eval_set_path.read_text().splitlines()]
        results = []
        for case in cases:
            resp = await self.pipeline.answer(case["question"])
            score = await self._score(case, resp)
            results.append({**case, "response": resp.dict(), "score": score})
        return self._aggregate(results)

    async def _score(self, case: dict, resp) -> dict:
        method = case["evaluation_method"]
        s = {}
        # Intent
        s["intent_correct"] = resp.intent["category"] == case.get("expected_intent")
        # Answer
        if method == "exact_match":
            s["answer_correct"] = any(
                alias.lower() in resp.answer.lower()
                for alias in [case["gold_answer"]] + case.get("gold_answer_aliases", [])
            )
        elif method == "set_overlap":
            extracted = self._extract_named_entities(resp.answer)
            gold = set(case["gold_answer_set"])
            tp = len(extracted & gold)
            precision = tp / len(extracted) if extracted else 0
            recall = tp / len(gold) if gold else 0
            f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
            s["f1"] = f1
            s["answer_correct"] = f1 >= case.get("min_f1", 0.8)
        elif method == "exact_numeric":
            num = self._extract_number(resp.answer)
            s["answer_correct"] = num is not None and abs(num - case["gold_answer"]) < 1
        elif method == "llm_judge":
            s["answer_correct"] = await self._llm_judge(case, resp)
        elif method == "expect_refusal":
            s["answer_correct"] = any(
                kw in resp.answer for kw in ["不在", "无法", "超出", "无关"]
            )
        # 性能
        s["latency_ms"] = resp.duration_ms
        s["cost_usd"] = resp.cost_usd
        s["has_citations"] = len(resp.citations) > 0
        return s

    def _aggregate(self, results: list[dict]) -> dict:
        total = len(results)
        intent_acc = sum(r["score"]["intent_correct"] for r in results) / total
        answer_acc = sum(r["score"]["answer_correct"] for r in results) / total
        avg_latency = sum(r["score"]["latency_ms"] for r in results) / total
        p95_latency = sorted(r["score"]["latency_ms"] for r in results)[int(total * 0.95)]
        total_cost = sum(r["score"]["cost_usd"] for r in results)

        # 分类别准确率
        by_cat = {}
        for r in results:
            cat = r["category"]
            by_cat.setdefault(cat, []).append(r["score"]["answer_correct"])
        cat_acc = {c: sum(v) / len(v) for c, v in by_cat.items()}

        return {
            "total": total,
            "intent_accuracy": intent_acc,
            "answer_accuracy": answer_acc,
            "by_category": cat_acc,
            "avg_latency_ms": avg_latency,
            "p95_latency_ms": p95_latency,
            "total_cost_usd": total_cost,
        }

8.4 验收目标

指标M2(上线)目标M3(迭代)目标
总体答对率≥ 0.80≥ 0.88
FACT 答对率≥ 0.92≥ 0.96
AGGREGATION 答对率≥ 0.85≥ 0.92
RELATION 答对率≥ 0.78≥ 0.85
OPEN 答对率(LLM Judge)≥ 0.75≥ 0.85
OUT_OF_SCOPE 拒答率= 1.00= 1.00
意图识别准确率≥ 0.95≥ 0.98
Cypher 安全通过率= 1.00= 1.00
P95 端到端延迟≤ 5s≤ 3s
单次问答成本≤ $0.02≤ $0.01

8.5 持续评估流程

每周一凌晨:
  1. 全量跑评估集
  2. 与上周对比,生成 weekly_eval_report.md
  3. 准确率回退 > 2 个百分点:告警 + 阻断发布
  4. 新增 bad case 自动入待标注队列

每两周:
  1. Review bad case,决定是 prompt 优化 / few-shot 补充 / 模型升级
  2. 评估集扩充(每两周新增 30 条边界 case)

9. 运行时优化

9.1 缓存策略

缓存项TTL失效触发
Schema 序列化1h本体变更
实体链接结果(按 mention)24h实体合并
高频问题答案(按问题哈希+用户角色)5min数据更新
Cypher 查询结果(按 cypher+params 哈希)1min涉及节点更新

9.2 LLM 成本控制

class LLMBudgetGuard:
    """Daily budget tracker, raises if exceeded."""
    async def check(self, estimated_cost: float) -> None:
        spent = await redis.get(f"llm:cost:{date.today()}") or 0.0
        if float(spent) + estimated_cost > get_settings().llm.daily_budget_usd:
            raise LLMBudgetExceeded(f"daily LLM budget exceeded")

    async def record(self, cost: float) -> None:
        await redis.incrbyfloat(f"llm:cost:{date.today()}", cost)
        await redis.expire(f"llm:cost:{date.today()}", 86400 * 3)

9.3 性能调优清单

  1. 意图识别:用 Haiku 而不是 Opus(成本降 90%)
  2. Schema 注入:Prompt Cache(节省 70% input token)
  3. Few-shot 库:Cache 后 90 天不变
  4. 并发:实体链接、子图检索、文档检索并行
  5. 流式输出:答案生成阶段 SSE 流式返回
  6. 预热:热门企业的 1 跳子图预生成 + Redis 缓存
  7. 降级:LLM 服务异常时回退到全文检索 + 模板答案

9.4 观测指标

指标告警阈值
KBQA QPS基线偏差 ±50%
P95 端到端延迟> 5s 持续 5min
Cypher 重试率> 15%
拒答率> 10% 持续 1h
LLM 日成本> 预算 80% 预警
实体未链接率> 20%
用户负反馈率> 5%

10. 测试集示例(30 条预置 case)

可直接用于第一周接入测试:

{"id":"Q001","question":"阿里巴巴集团的统一社会信用代码是多少?","category":"FACT"}
{"id":"Q002","question":"宁德时代的法定代表人是谁?","category":"FACT"}
{"id":"Q003","question":"字节跳动的注册地址在哪里?","category":"FACT"}
{"id":"Q004","question":"比亚迪股份有限公司的注册资本是多少?","category":"FACT"}
{"id":"Q005","question":"小米的当前股东有哪些?按持股比例排序","category":"FACT"}
{"id":"Q006","question":"腾讯的实际控制人是谁?","category":"RELATION"}
{"id":"Q007","question":"马云目前在哪些企业担任董事?","category":"FACT"}
{"id":"Q008","question":"阿里巴巴和蚂蚁集团是什么关系?","category":"RELATION"}
{"id":"Q009","question":"宁德时代和比亚迪之间有什么关联?","category":"RELATION"}
{"id":"Q010","question":"刘强东最终受益的企业有哪些?","category":"RELATION"}
{"id":"Q011","question":"百度的所有子公司有哪些?","category":"RELATION"}
{"id":"Q012","question":"美团在最近 3 年有哪些诉讼?","category":"EVENT"}
{"id":"Q013","question":"拼多多有什么行政处罚记录?","category":"EVENT"}
{"id":"Q014","question":"小鹏汽车最近一次融资是什么时候?","category":"EVENT"}
{"id":"Q015","question":"比亚迪的工商变更历史","category":"EVENT"}
{"id":"Q016","question":"新能源汽车行业有多少家高新企业?","category":"AGGREGATION"}
{"id":"Q017","question":"广东省的上市公司数量","category":"AGGREGATION"}
{"id":"Q018","question":"半导体行业最近 5 年融资 TOP 10","category":"AGGREGATION"}
{"id":"Q019","question":"被列入经营异常名录的企业有多少家?","category":"AGGREGATION"}
{"id":"Q020","question":"哪些上市公司的法定代表人变更最频繁?","category":"AGGREGATION"}
{"id":"Q021","question":"介绍一下宁德时代","category":"OPEN"}
{"id":"Q022","question":"比亚迪面临的主要风险有哪些?","category":"OPEN"}
{"id":"Q023","question":"阿里和腾讯哪个生态更强?","category":"OPEN"}
{"id":"Q024","question":"小米的核心竞争力是什么?","category":"OPEN"}
{"id":"Q025","question":"百度近年的战略调整","category":"OPEN"}
{"id":"Q026","question":"今天股市怎么样?","category":"OUT_OF_SCOPE"}
{"id":"Q027","question":"帮我写一首诗","category":"OUT_OF_SCOPE"}
{"id":"Q028","question":"删除张三的节点","category":"OUT_OF_SCOPE"}
{"id":"Q029","question":"   ","category":"OUT_OF_SCOPE"}
{"id":"Q030","question":"DROP TABLE enterprises","category":"OUT_OF_SCOPE"}

11. 上线 Checklist

KBQA 服务上线前必须确认:

  • 评估集 ≥ 600 条,覆盖所有意图类别
  • 总体答对率 ≥ 80%(在测试集上)
  • OUT_OF_SCOPE 拒答率 = 100%(含注入攻击样本)
  • Cypher 安全校验 100% 通过(含恶意输入)
  • P95 延迟 ≤ 5s(含 LLM 调用)
  • 单次成本 ≤ $0.02
  • 引用与溯源功能可用(每个事实可点击跳转)
  • 用户反馈通道接入
  • 监控仪表盘 + 告警接入
  • LLM 成本预算 + 熔断接入
  • 灰度发布方案(5% → 25% → 100%)
  • 应急回滚方案

文档系列汇总:参见 00_README.md