配套文档:基于《01_领域本体详细设计书.md》、《02_代码工程脚手架.md》 目标:交付一个可用于生产的智能问答服务,支持自然语言问企业图谱 核心指标:端到端答对率 ≥ 85%、P95 延迟 < 5s、Cypher 安全 100%
1. KBQA 总体架构
┌──────────────────────┐
│ 用户自然语言问题 │
└──────────┬───────────┘
↓
┌──────────────────────┐
│ 1. 预处理 & 意图识别 │
└──────────┬───────────┘
↓
┌──────────────────────┐
│ 2. 实体识别 & 链接 │
└──────────┬───────────┘
↓
┌───────────────┴───────────────┐
↓ ↓
┌────────────────────┐ ┌────────────────────┐
│ 3a. Text2Cypher │ │ 3b. GraphRAG │
│ (事实/聚合/路径) │ │ (开放/综合分析) │
└─────────┬──────────┘ └─────────┬──────────┘
↓ ↓
┌────────────────────┐ ┌────────────────────┐
│ 4a. Cypher校验+执行 │ │ 4b. 子图检索+向量召回│
└─────────┬──────────┘ └─────────┬──────────┘
└──────────┬───────────────────┘
↓
┌────────────────────┐
│ 5. 结果重排序融合 │
└──────────┬─────────┘
↓
┌────────────────────┐
│ 6. 答案生成 + 引用 │
└──────────┬─────────┘
↓
┌────────────────────┐
│ 响应给用户 │
└────────────────────┘
核心设计原则:
- 两路并行:事实型走 Text2Cypher(高精度、可解释);开放型走 GraphRAG(高召回)
- 安全第一:所有 LLM 生成的 Cypher 必须经过校验器才能执行
- 可解释:答案必须带引用(节点 UUID 或文档 ID)
- 可降级:每一步失败都有 fallback(如 Cypher 失败 → 全文检索)
- 可观测:每个问题的全链路 trace 可查
2. 意图识别
2.1 意图分类
将用户问题分为以下大类,决定后续路由:
| 大类 | 子类 | 示例 | 路由 |
|---|---|---|---|
| FACT(事实) | basic_info | 阿里巴巴的法定代表人是谁? | Text2Cypher |
| shareholder | XX 公司的股东有哪些? | Text2Cypher | |
| executive | XX 公司的董事长是谁? | Text2Cypher | |
| RELATION(关系) | direct_relation | 张三和李四什么关系? | Text2Cypher |
| path | A 公司和 B 公司有什么关联? | Text2Cypher | |
| control_chain | XX 公司的实际控制人是谁? | Text2Cypher | |
| AGGREGATION(聚合) | count | XX 行业有多少家高新企业? | Text2Cypher |
| top_k | 某行业融资 TOP 10 | Text2Cypher | |
| distribution | XX 地区企业数量分布 | Text2Cypher | |
| EVENT(事件) | event_query | XX 公司有哪些诉讼? | Text2Cypher |
| timeline | XX 公司的工商变更时间线 | Text2Cypher | |
| OPEN(开放) | summary | 介绍一下 XX 公司 | GraphRAG |
| analysis | XX 公司的风险点有哪些? | GraphRAG | |
| comparison | A 和 B 公司哪个更好? | GraphRAG | |
| OUT_OF_SCOPE | - | 今天天气怎么样? | 拒答 |
2.2 实现:轻量级 LLM 分类器
# src/kg/kbqa/intent.py
from enum import StrEnum
from pydantic import BaseModel
from anthropic import AsyncAnthropic
from kg.core.config import get_settings
class IntentCategory(StrEnum):
FACT = "FACT"
RELATION = "RELATION"
AGGREGATION = "AGGREGATION"
EVENT = "EVENT"
OPEN = "OPEN"
OUT_OF_SCOPE = "OUT_OF_SCOPE"
class IntentResult(BaseModel):
category: IntentCategory
subcategory: str
confidence: float
needs_disambiguation: bool = False
INTENT_PROMPT = """你是问答系统的意图分类器。给定用户问题,输出意图类别。
类别定义:
- FACT: 询问某个实体的具体属性(法人/注册资本/地址/股东等)
- RELATION: 询问两个或多个实体之间的关系
- AGGREGATION: 需要计数/排序/分布/统计的问题
- EVENT: 询问某实体的历史事件(诉讼/处罚/融资/变更)
- OPEN: 开放式总结/分析/对比/建议
- OUT_OF_SCOPE: 与企业知识图谱无关
输出 JSON:
{
"category": "FACT|RELATION|AGGREGATION|EVENT|OPEN|OUT_OF_SCOPE",
"subcategory": "具体子类",
"confidence": 0.95
}
问题:{question}
"""
class IntentClassifier:
def __init__(self):
s = get_settings()
self.client = AsyncAnthropic(api_key=s.llm.anthropic_api_key.get_secret_value())
self.model = "claude-haiku-4-5-20251001" # 用 Haiku 做分类,省成本
async def classify(self, question: str) -> IntentResult:
resp = await self.client.messages.create(
model=self.model,
max_tokens=256,
messages=[{"role": "user", "content": INTENT_PROMPT.format(question=question)}],
)
text = resp.content[0].text.strip()
# 容错解析略
return IntentResult.model_validate_json(text)
后期优化方向:意图分类积累足够样本后改用本地小模型(如 fastText 或 fine-tuned BERT),延迟 < 50ms、成本几乎为零。
3. 实体识别与链接
3.1 流程
问题文本 → NER(识别 mention) → 候选生成(FTS + 向量召回) → 消歧 → 链接到 KG 节点
3.2 实现要点
# src/kg/kbqa/entity_linking.py
from kg.extraction.ner.llm_ner import LLMNER
from kg.store.neo4j_client import Neo4jClient
from kg.store.es_client import ESClient
class EntityLinker:
def __init__(self, ner, embedder):
self.ner = ner
self.embedder = embedder
async def link(self, question: str) -> list[dict]:
# 1. NER
mentions = await self.ner.extract(question)
results = []
for m in mentions:
cands = await self._gen_candidates(m)
best = await self._disambiguate(m, cands, question)
if best:
results.append({"mention": m, "linked": best})
return results
async def _gen_candidates(self, m: dict, k: int = 20) -> list[dict]:
# 双路召回
# A. ES 全文检索(精确匹配 + 模糊)
es_hits = await ESClient.search(
index="entities",
body={
"query": {
"bool": {
"should": [
{"term": {"name.keyword": {"value": m["text"], "boost": 5}}},
{"match": {"name": {"query": m["text"], "boost": 2}}},
{"match": {"aliases": m["text"]}},
]
}
},
"size": k,
},
)
# B. 向量召回(Milvus 或 Neo4j vector index)
vec = await self.embedder.embed(m["text"])
vec_hits = await Neo4jClient.execute_read(
"""
CALL db.index.vector.queryNodes('enterprise_embedding', $k, $vec)
YIELD node, score
RETURN node{.uuid, .name, .unified_credit_code, .registration_status} AS node, score
""",
{"k": k, "vec": vec},
)
# 合并去重
seen, merged = set(), []
for h in es_hits + vec_hits:
uuid = h["node"]["uuid"]
if uuid in seen:
continue
seen.add(uuid)
merged.append(h)
return merged
async def _disambiguate(self, mention: dict, candidates: list[dict], question: str) -> dict | None:
if not candidates:
return None
if len(candidates) == 1:
return candidates[0]
# 多候选:上下文重排(用 LLM 或 reranker 模型)
# 简化版:取召回得分最高 + 状态为在营的
candidates.sort(
key=lambda c: (
c["node"].get("registration_status") == "IN_BUSINESS",
c.get("score", 0),
),
reverse=True,
)
return candidates[0]
3.3 消歧策略升级
当问题中包含多个候选时(如 "张伟" 全国有上万人),用以下信号消歧:
| 信号 | 权重 |
|---|---|
| 字面匹配度 | 0.3 |
| 向量相似度 | 0.2 |
| 实体活跃度(最近事件数) | 0.15 |
| 问题中其他实体的关联度 | 0.25 |
| 用户历史会话上下文 | 0.1 |
4. Text2Cypher 模块
这是 KBQA 的核心引擎,承担 80% 的事实/聚合/关系问题。
4.1 总体流程
问题 + 已链接实体
↓
[Schema-aware Prompt 构造]
↓
[LLM 生成 Cypher(含 reasoning)]
↓
[静态校验:语法 + Schema + 安全]
↓ 通过 ↓ 失败
[沙箱执行] [反馈给 LLM 重试,≤2 次]
↓ ↓
[结果集] [兜底:转 GraphRAG 或拒答]
4.2 Schema 注入
LLM 必须看到完整、最新的 Schema 才能生成正确的 Cypher。我们将 Schema 自动序列化为简洁文本:
# src/kg/kbqa/text2cypher/schema_serializer.py
from kg.store.neo4j_client import Neo4jClient
async def serialize_schema() -> str:
"""Generate a compact, LLM-friendly schema description."""
labels = await Neo4jClient.execute_read("CALL db.labels()")
rel_types = await Neo4jClient.execute_read("CALL db.relationshipTypes()")
# 节点属性:CALL db.schema.nodeTypeProperties()
node_props = await Neo4jClient.execute_read("CALL db.schema.nodeTypeProperties()")
rel_props = await Neo4jClient.execute_read("CALL db.schema.relTypeProperties()")
# 拼装为 Markdown
md = ["# 图谱 Schema\n", "## 节点\n"]
by_label = {}
for row in node_props:
for lbl in row["nodeLabels"]:
by_label.setdefault(lbl, []).append((row["propertyName"], row["propertyTypes"]))
for lbl, props in by_label.items():
md.append(f"### {lbl}")
for p, t in props:
md.append(f"- `{p}`: {','.join(t)}")
md.append("\n## 关系\n")
# ... rel types 同理
return "\n".join(md)
生产环境:Schema 缓存在 Redis,TTL = 1 小时,本体变更后主动刷新。
4.3 Prompt 模板(生产级)
# src/kg/kbqa/text2cypher/prompts.py
SYSTEM_PROMPT = """你是 Neo4j Cypher 查询生成专家。根据图谱 Schema 和用户问题,生成准确的 Cypher 查询。
## 严格遵守的规则
1. **只使用 Schema 中定义的标签和关系**,禁止编造
2. **必须使用提供的实体 UUID 或主键**作为查询起点,不要在 WHERE 中用模糊名称匹配(除非问题明显需要)
3. **必须包含 LIMIT** —— 默认 LIMIT 100,TOP-K 查询用 LIMIT k
4. **路径深度 ≤ 5**:`*1..5` 是上限
5. **时间过滤**:用 `r.valid_to IS NULL` 表示"当前有效"
6. **排除派生关系除非明确询问**:`WHERE r.derived = false OR r.derived IS NULL`
7. **禁止任何写操作**:CREATE/MERGE/SET/DELETE/REMOVE 一律禁止
8. **返回结果必须易于回答**:用 AS 别名,避免返回整个节点
9. **聚合用 collect/count/sum/avg**
## 输出格式
严格输出 JSON:
```json
{
"reasoning": "简要分析问题、识别意图、说明查询思路(中文,≤100字)",
"cypher": "完整 Cypher 查询语句",
"params": { "key": "value" },
"expected_columns": ["col1", "col2"]
}
当前 Schema
{schema}
Few-shot 示例
{few_shots} """
USER_PROMPT = """## 用户问题 {question}
已链接实体
{linked_entities}
请生成 Cypher。 """
Few-shot 库(可配置 YAML)
FEW_SHOTS_TEMPLATE = """
示例1:基本工商
Q: 阿里巴巴的注册资本是多少? 已链接:[{"mention":"阿里巴巴","uuid":"E_001","label":"Enterprise"}] A:
{
"reasoning": "FACT/basic_info,按 uuid 直接取属性",
"cypher": "MATCH (e:Enterprise {uuid: $uuid}) RETURN e.name AS name, e.registered_capital AS capital, e.capital_currency AS currency LIMIT 1",
"params": {"uuid": "E_001"},
"expected_columns": ["name", "capital", "currency"]
}
示例2:股东
Q: 字节跳动的当前股东有哪些?按持股比例排序 已链接:[{"mention":"字节跳动","uuid":"E_002","label":"Enterprise"}] A:
{
"reasoning": "FACT/shareholder,遍历 HOLDS_SHARE,过滤当前有效",
"cypher": "MATCH (s)-[r:HOLDS_SHARE]->(e:Enterprise {uuid: $uuid}) WHERE r.valid_to IS NULL RETURN s.name AS shareholder, labels(s)[0] AS type, r.percentage AS pct ORDER BY pct DESC LIMIT 50",
"params": {"uuid": "E_002"},
"expected_columns": ["shareholder", "type", "pct"]
}
示例3:实际控制人
Q: 拼多多的实际控制人是谁? 已链接:[{"mention":"拼多多","uuid":"E_003","label":"Enterprise"}] A:
{
"reasoning": "RELATION/control_chain,用派生的 ACTUAL_CONTROLS",
"cypher": "MATCH (p)-[r:ACTUAL_CONTROLS]->(e:Enterprise {uuid: $uuid}) RETURN p.name AS controller, labels(p)[0] AS type, r.control_ratio AS ratio ORDER BY ratio DESC LIMIT 10",
"params": {"uuid": "E_003"},
"expected_columns": ["controller", "type", "ratio"]
}
示例4:两公司路径
Q: 腾讯和京东之间有什么关联? 已链接:[{"mention":"腾讯","uuid":"E_004"},{"mention":"京东","uuid":"E_005"}] A:
{
"reasoning": "RELATION/path,两实体间最短路径,深度 ≤5",
"cypher": "MATCH (a {uuid:$a}), (b {uuid:$b}) MATCH p = shortestPath((a)-[*..5]-(b)) RETURN [n IN nodes(p) | {name: coalesce(n.name, n.title), label: labels(n)[0]}] AS path_nodes, [r IN relationships(p) | type(r)] AS path_rels LIMIT 5",
"params": {"a":"E_004","b":"E_005"},
"expected_columns": ["path_nodes","path_rels"]
}
示例5:聚合
Q: 新能源汽车行业有多少家高新技术企业? 已链接:[{"mention":"新能源汽车","industry_code":"C36"}] A:
{
"reasoning": "AGGREGATION/count,按行业 + 高新筛选",
"cypher": "MATCH (e:Enterprise)-[:IN_INDUSTRY]->(i:Industry {code:$code}) WHERE e.is_high_tech = true AND e.registration_status = 'IN_BUSINESS' RETURN count(e) AS total LIMIT 1",
"params": {"code":"C36"},
"expected_columns": ["total"]
}
示例6:事件
Q: 比亚迪近 3 年的行政处罚有哪些? 已链接:[{"mention":"比亚迪","uuid":"E_006"}] A:
{
"reasoning": "EVENT/admin_penalty,按时间过滤",
"cypher": "MATCH (e:Enterprise {uuid:$uuid})<-[:PUNISHES]-(p:AdminPenalty) WHERE p.decision_date > date() - duration({years:3}) RETURN p.decision_no AS no, p.decision_date AS date, p.violation_type AS type, p.penalty_amount AS amount, p.violation_description AS desc ORDER BY p.decision_date DESC LIMIT 50",
"params":{"uuid":"E_006"},
"expected_columns":["no","date","type","amount","desc"]
}
"""
### 4.4 生成器实现
```python
# src/kg/kbqa/text2cypher/generator.py
from pydantic import BaseModel, ValidationError
from anthropic import AsyncAnthropic
from kg.core.config import get_settings
from kg.core.logger import get_logger
log = get_logger(__name__)
class CypherCandidate(BaseModel):
reasoning: str
cypher: str
params: dict
expected_columns: list[str]
class CypherGenerator:
def __init__(self, schema_provider, few_shots: str):
s = get_settings()
self.client = AsyncAnthropic(api_key=s.llm.anthropic_api_key.get_secret_value())
self.model = s.llm.model_kbqa
self.schema_provider = schema_provider
self.few_shots = few_shots
async def generate(
self,
question: str,
linked_entities: list[dict],
*,
prior_error: str | None = None,
) -> CypherCandidate:
schema = await self.schema_provider.get_schema()
system = SYSTEM_PROMPT.format(schema=schema, few_shots=self.few_shots)
user = USER_PROMPT.format(
question=question,
linked_entities=str(linked_entities),
)
if prior_error:
user += f"\n\n上一次生成的 Cypher 执行失败:{prior_error}\n请修正后重新生成。"
resp = await self.client.messages.create(
model=self.model,
max_tokens=1500,
system=[
{"type": "text", "text": system, "cache_control": {"type": "ephemeral"}}
],
messages=[{"role": "user", "content": user}],
)
raw = resp.content[0].text.strip()
# 容错:剥离 markdown 代码块
if "```json" in raw:
raw = raw.split("```json")[1].split("```")[0].strip()
elif raw.startswith("```"):
raw = raw.strip("`").strip()
try:
cand = CypherCandidate.model_validate_json(raw)
except ValidationError as e:
log.error("cypher_gen_parse_error", raw=raw, error=str(e))
raise
log.info("cypher_generated", reasoning=cand.reasoning, cypher=cand.cypher)
return cand
4.5 静态校验器
# src/kg/kbqa/text2cypher/validator.py
import re
from kg.core.errors import CypherUnsafeError
class CypherValidator:
FORBIDDEN_KEYWORDS = [
"CREATE", "DELETE", "DETACH", "SET", "REMOVE", "MERGE", "DROP",
"FOREACH", "LOAD CSV", "USING PERIODIC COMMIT", "CALL APOC.PERIODIC",
"CALL APOC.LOAD", "CALL APOC.EXPORT", "CALL APOC.REFACTOR",
]
MAX_LIMIT = 1000
MAX_PATH_DEPTH = 5
@classmethod
def validate(cls, cypher: str, *, schema_labels: set, schema_rels: set) -> None:
upper = cypher.upper()
# 1. 关键字检查
for kw in cls.FORBIDDEN_KEYWORDS:
if re.search(rf"\b{re.escape(kw)}\b", upper):
raise CypherUnsafeError(f"forbidden keyword: {kw}")
# 2. LIMIT
if "LIMIT" not in upper:
raise CypherUnsafeError("missing LIMIT clause")
for limit_val in re.findall(r"LIMIT\s+(\d+)", upper):
if int(limit_val) > cls.MAX_LIMIT:
raise CypherUnsafeError(f"LIMIT exceeds {cls.MAX_LIMIT}")
# 3. 路径深度
for depth in re.findall(r"\*\s*\d*\.\.(\d+)", cypher):
if int(depth) > cls.MAX_PATH_DEPTH:
raise CypherUnsafeError(f"path depth exceeds {cls.MAX_PATH_DEPTH}")
for depth in re.findall(r"\*(\d+)\b", cypher):
if int(depth) > cls.MAX_PATH_DEPTH:
raise CypherUnsafeError(f"path depth exceeds {cls.MAX_PATH_DEPTH}")
# 4. 标签合法性
used_labels = set(re.findall(r":(\w+)", cypher))
# 注意:关系类型也会匹配 :REL,但 schema_rels 包含它们
all_known = schema_labels | schema_rels
unknown = used_labels - all_known
if unknown:
raise CypherUnsafeError(f"unknown labels/rels: {unknown}")
# 5. 参数化检查
if "'" in cypher and "$" not in cypher:
# 提示:检测到字符串字面量但无参数,可能是注入风险
...
4.6 沙箱执行器
# src/kg/kbqa/text2cypher/executor.py
import asyncio
from kg.store.neo4j_client import Neo4jClient
from kg.core.errors import KGError
class CypherExecutor:
DEFAULT_TIMEOUT = 10 # seconds
MAX_ROWS = 1000
@classmethod
async def execute(cls, cypher: str, params: dict, *, timeout: int = DEFAULT_TIMEOUT) -> list[dict]:
try:
rows = await asyncio.wait_for(
Neo4jClient.execute_read(cypher, params),
timeout=timeout,
)
except asyncio.TimeoutError:
raise KGError(f"cypher execution timeout (>{timeout}s)")
if len(rows) > cls.MAX_ROWS:
raise KGError(f"result too large: {len(rows)} rows (max {cls.MAX_ROWS})")
return rows
4.7 完整 Text2Cypher Pipeline
# src/kg/kbqa/text2cypher/pipeline.py
from kg.kbqa.text2cypher.generator import CypherGenerator
from kg.kbqa.text2cypher.validator import CypherValidator
from kg.kbqa.text2cypher.executor import CypherExecutor
from kg.core.errors import CypherUnsafeError, KGError
from kg.core.logger import get_logger
log = get_logger(__name__)
class Text2CypherPipeline:
MAX_RETRIES = 2
def __init__(self, generator, schema_provider):
self.generator = generator
self.schema_provider = schema_provider
async def run(self, question: str, linked_entities: list[dict]) -> dict:
prior_error = None
labels, rels = await self.schema_provider.get_label_rel_sets()
for attempt in range(self.MAX_RETRIES + 1):
cand = await self.generator.generate(
question, linked_entities, prior_error=prior_error
)
# 1. 校验
try:
CypherValidator.validate(cand.cypher, schema_labels=labels, schema_rels=rels)
except CypherUnsafeError as e:
log.warning("cypher_unsafe", attempt=attempt, error=str(e))
prior_error = f"安全校验失败: {e}"
continue
# 2. 执行
try:
rows = await CypherExecutor.execute(cand.cypher, cand.params)
return {
"cypher": cand.cypher,
"params": cand.params,
"reasoning": cand.reasoning,
"rows": rows,
"attempts": attempt + 1,
}
except KGError as e:
log.warning("cypher_exec_failed", attempt=attempt, error=str(e))
prior_error = f"执行失败: {e}"
continue
raise KGError(f"text2cypher failed after {self.MAX_RETRIES + 1} attempts: {prior_error}")
5. GraphRAG 模块
针对开放式问题(如"介绍一下 XX 公司"、"XX 公司有什么风险"),Text2Cypher 不擅长,改用 GraphRAG。
5.1 三种检索策略并用
开放问题
↓
┌─────────────────────────────────────────────────┐
│ 策略 A:实体锚定的 2 跳子图 │ 占 70%
│ - 找到锚实体,扩展 1-2 跳,构成局部知识子图 │
├─────────────────────────────────────────────────┤
│ 策略 B:文档向量召回 │ 占 20%
│ - 在文档库做语义检索,返回相关原文片段 │
├─────────────────────────────────────────────────┤
│ 策略 C:社区摘要召回 │ 占 10%
│ - 离线对图分社区并 LLM 摘要,运行时按主题召回 │
└─────────────────────────────────────────────────┘
↓
Reranker 重排(bge-reranker-large)
↓
合并为 context 喂给 LLM 生成
5.2 子图检索 + 文本化
# src/kg/kbqa/graph_rag/retriever.py
from kg.store.neo4j_client import Neo4jClient
class SubgraphRetriever:
async def retrieve(self, anchor_uuids: list[str], *, n_hops: int = 2, limit: int = 200) -> dict:
"""Retrieve a focused subgraph around anchor entities."""
cypher = """
UNWIND $uuids AS uuid
MATCH (anchor {uuid: uuid})
CALL {
WITH anchor
MATCH path = (anchor)-[*1..$hops]-(neighbor)
WHERE all(r IN relationships(path) WHERE
coalesce(r.valid_to, date()) >= date() - duration({years: 2}))
RETURN path
LIMIT $limit
}
WITH collect(DISTINCT path) AS paths
UNWIND paths AS path
RETURN
[n IN nodes(path) | {uuid: n.uuid, label: labels(n)[0], props: properties(n)}] AS nodes,
[r IN relationships(path) | {type: type(r), props: properties(r), start: startNode(r).uuid, end: endNode(r).uuid}] AS rels
"""
return await Neo4jClient.execute_read(
cypher, {"uuids": anchor_uuids, "hops": n_hops, "limit": limit}
)
# src/kg/kbqa/graph_rag/verbalizer.py
from typing import Iterable
PROP_BLACKLIST = {"uuid", "embedding", "_meta_uuid", "_meta_source_id"}
class SubgraphVerbalizer:
"""Convert subgraph into LLM-friendly natural language."""
@staticmethod
def _node_text(node: dict) -> str:
label = node["label"]
props = node["props"]
name = props.get("name") or props.get("title") or props.get("uuid")
key_props = {
"Enterprise": ["unified_credit_code", "registration_status", "industry_name", "establishment_date"],
"NaturalPerson": ["gender", "birth_year", "is_executed_dishonest"],
"LegalCase": ["case_no", "case_reason", "judgment_date"],
"AdminPenalty": ["decision_no", "violation_type", "penalty_amount"],
}.get(label, [])
detail = ", ".join(f"{k}={props[k]}" for k in key_props if k in props and props[k] is not None)
return f"[{label}] {name}" + (f" ({detail})" if detail else "")
@staticmethod
def _rel_text(rel: dict, name_lookup: dict) -> str:
head = name_lookup.get(rel["start"], rel["start"])
tail = name_lookup.get(rel["end"], rel["end"])
rt = rel["type"]
props = rel.get("props", {})
prop_summary = ", ".join(
f"{k}={v}" for k, v in props.items()
if k not in PROP_BLACKLIST and v is not None
)
return f" - {head} --[{rt}{(' '+prop_summary) if prop_summary else ''}]--> {tail}"
@classmethod
def verbalize(cls, subgraph_rows: Iterable[dict]) -> str:
nodes_seen, rels_seen = {}, []
for row in subgraph_rows:
for n in row["nodes"]:
nodes_seen[n["uuid"]] = n
for r in row["rels"]:
rels_seen.append(r)
# 去重 rels
unique_rels = {(r["start"], r["type"], r["end"]): r for r in rels_seen}.values()
# 文本化
name_lookup = {uuid: (n["props"].get("name") or uuid) for uuid, n in nodes_seen.items()}
lines = ["## 实体清单"]
for n in nodes_seen.values():
lines.append("- " + cls._node_text(n))
lines.append("\n## 关系")
for r in unique_rels:
lines.append(cls._rel_text(r, name_lookup))
return "\n".join(lines)
5.3 GraphRAG Pipeline
# src/kg/kbqa/graph_rag/pipeline.py
class GraphRAGPipeline:
def __init__(self, retriever, verbalizer, doc_retriever, reranker, llm):
self.retriever = retriever
self.verbalizer = verbalizer
self.doc_retriever = doc_retriever
self.reranker = reranker
self.llm = llm
async def run(self, question: str, linked_entities: list[dict]) -> dict:
anchor_uuids = [e["linked"]["node"]["uuid"] for e in linked_entities]
# A. 图子图
subgraph_rows = await self.retriever.retrieve(anchor_uuids, n_hops=2)
graph_text = self.verbalizer.verbalize(subgraph_rows)
# B. 文档检索
doc_chunks = await self.doc_retriever.search(question, k=8)
# C. 重排
all_passages = [{"text": graph_text, "source": "graph"}] + [
{"text": d["text"], "source": d["doc_id"]} for d in doc_chunks
]
reranked = await self.reranker.rerank(question, all_passages, top_k=6)
# D. 生成
context = "\n\n---\n\n".join(p["text"] for p in reranked)
answer = await self.llm.generate(question, context)
return {
"answer": answer,
"citations": [p["source"] for p in reranked],
}
6. 答案生成与引用
6.1 答案生成 Prompt
ANSWER_SYSTEM = """你是企业知识图谱问答助手。根据提供的查询结果或上下文,准确回答用户问题。
## 强制规则
1. **只基于提供的资料回答**,不编造事实
2. 资料中无相关内容时,明确说"知识图谱中暂无该信息"
3. 涉及数字、日期、比例时**保留原值**,不做估算
4. 答案中提及的每个关键事实,**必须给出引用标记** `[ref:N]`,N 对应资料中的编号
5. 答案语言简洁专业,避免冗余客套
6. 若资料中存在数据冲突,明确指出并列出各方说法
7. 不解释你的查询过程,直接回答
## 输出格式
答案正文。
关键事实1 [ref:1]。关键事实2 [ref:2]。
## 资料
{context}
## 问题
{question}
"""
class AnswerGenerator:
async def generate(self, question: str, retrieval_result: dict, mode: str) -> dict:
if mode == "text2cypher":
context = self._format_cypher_result(retrieval_result)
else:
context = retrieval_result["context"]
resp = await self.llm.complete(
system=ANSWER_SYSTEM.format(context=context, question=question),
user="",
)
return {
"text": resp,
"citations": self._extract_citations(resp, retrieval_result),
}
def _format_cypher_result(self, result: dict) -> str:
rows = result["rows"]
lines = [f"### 查询结果(共 {len(rows)} 条)"]
for i, row in enumerate(rows, start=1):
lines.append(f"[{i}] " + " | ".join(f"{k}={v}" for k, v in row.items()))
return "\n".join(lines)
6.2 引用与溯源数据结构
class Citation(BaseModel):
ref_id: int # 在答案中 [ref:N] 对应的编号
source_type: str # GRAPH_NODE / GRAPH_REL / DOCUMENT
source_uuid: str | None
source_doc_id: str | None
snippet: str | None
confidence: float
class KBQAResponse(BaseModel):
answer: str
citations: list[Citation]
intent: dict
linked_entities: list[dict]
trace_id: str
duration_ms: int
cost_usd: float
fallback_used: bool = False
7. 端到端 API 集成
# src/kg/api/routers/kbqa.py
from fastapi import APIRouter, Depends
from pydantic import BaseModel, Field
from kg.api.deps import get_current_user
from kg.kbqa.pipeline import KBQAPipeline
from kg.kbqa.response_models import KBQAResponse
router = APIRouter()
class KBQARequest(BaseModel):
question: str = Field(min_length=2, max_length=500)
session_id: str | None = None
user_context: dict | None = None
@router.post("/ask", response_model=KBQAResponse)
async def ask(req: KBQARequest, user=Depends(get_current_user)) -> KBQAResponse:
return await KBQAPipeline.instance().answer(
req.question, session_id=req.session_id, user=user
)
@router.post("/feedback")
async def feedback(req: dict, user=Depends(get_current_user)) -> dict:
"""User feedback on answer quality — for offline evaluation."""
# 写入 PostgreSQL kbqa_feedback 表
...
return {"ok": True}
8. 评估体系
8.1 评估集结构
tests/fixtures/kbqa_eval_set.jsonl:
{"id": "Q001", "category": "FACT", "subcategory": "basic_info", "question": "阿里巴巴的法定代表人是谁?", "expected_intent": "FACT", "expected_entities": ["阿里巴巴"], "expected_cypher_pattern": "MATCH.*Person.*SERVES_AS.*Enterprise.*is_legal_rep", "gold_answer": "蔡崇信", "gold_answer_aliases": ["蔡崇信", "Joseph Tsai"], "evaluation_method": "exact_match"}
{"id": "Q002", "category": "FACT", "subcategory": "shareholder", "question": "字节跳动的当前股东有哪些?", "expected_intent": "FACT", "expected_entities": ["字节跳动"], "gold_answer_set": ["张一鸣", "梁汝波", ...], "evaluation_method": "set_overlap", "min_f1": 0.8}
{"id": "Q003", "category": "AGGREGATION", "subcategory": "count", "question": "新能源汽车行业有多少家高新企业?", "expected_intent": "AGGREGATION", "gold_answer": 1247, "evaluation_method": "exact_numeric"}
{"id": "Q004", "category": "OPEN", "subcategory": "summary", "question": "介绍一下比亚迪", "expected_intent": "OPEN", "rubric": ["注册信息", "主要业务", "主要股东", "近期事件"], "evaluation_method": "llm_judge"}
{"id": "Q005", "category": "OUT_OF_SCOPE", "question": "今天天气怎么样?", "expected_intent": "OUT_OF_SCOPE", "evaluation_method": "expect_refusal"}
8.2 评估集规模与分布
总计 600 条评估问题,分布:
| 类别 | 数量 | 占比 |
|---|---|---|
| FACT/basic | 100 | 17% |
| FACT/shareholder | 60 | 10% |
| FACT/executive | 50 | 8% |
| RELATION/direct | 60 | 10% |
| RELATION/path | 40 | 7% |
| RELATION/control | 50 | 8% |
| AGGREGATION | 80 | 13% |
| EVENT | 80 | 13% |
| OPEN | 60 | 10% |
| OUT_OF_SCOPE | 20 | 3% |
8.3 评估方法
# src/kg/quality/kbqa_eval.py
import json
from pathlib import Path
class KBQAEvaluator:
def __init__(self, pipeline):
self.pipeline = pipeline
async def evaluate(self, eval_set_path: Path) -> dict:
cases = [json.loads(l) for l in eval_set_path.read_text().splitlines()]
results = []
for case in cases:
resp = await self.pipeline.answer(case["question"])
score = await self._score(case, resp)
results.append({**case, "response": resp.dict(), "score": score})
return self._aggregate(results)
async def _score(self, case: dict, resp) -> dict:
method = case["evaluation_method"]
s = {}
# Intent
s["intent_correct"] = resp.intent["category"] == case.get("expected_intent")
# Answer
if method == "exact_match":
s["answer_correct"] = any(
alias.lower() in resp.answer.lower()
for alias in [case["gold_answer"]] + case.get("gold_answer_aliases", [])
)
elif method == "set_overlap":
extracted = self._extract_named_entities(resp.answer)
gold = set(case["gold_answer_set"])
tp = len(extracted & gold)
precision = tp / len(extracted) if extracted else 0
recall = tp / len(gold) if gold else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
s["f1"] = f1
s["answer_correct"] = f1 >= case.get("min_f1", 0.8)
elif method == "exact_numeric":
num = self._extract_number(resp.answer)
s["answer_correct"] = num is not None and abs(num - case["gold_answer"]) < 1
elif method == "llm_judge":
s["answer_correct"] = await self._llm_judge(case, resp)
elif method == "expect_refusal":
s["answer_correct"] = any(
kw in resp.answer for kw in ["不在", "无法", "超出", "无关"]
)
# 性能
s["latency_ms"] = resp.duration_ms
s["cost_usd"] = resp.cost_usd
s["has_citations"] = len(resp.citations) > 0
return s
def _aggregate(self, results: list[dict]) -> dict:
total = len(results)
intent_acc = sum(r["score"]["intent_correct"] for r in results) / total
answer_acc = sum(r["score"]["answer_correct"] for r in results) / total
avg_latency = sum(r["score"]["latency_ms"] for r in results) / total
p95_latency = sorted(r["score"]["latency_ms"] for r in results)[int(total * 0.95)]
total_cost = sum(r["score"]["cost_usd"] for r in results)
# 分类别准确率
by_cat = {}
for r in results:
cat = r["category"]
by_cat.setdefault(cat, []).append(r["score"]["answer_correct"])
cat_acc = {c: sum(v) / len(v) for c, v in by_cat.items()}
return {
"total": total,
"intent_accuracy": intent_acc,
"answer_accuracy": answer_acc,
"by_category": cat_acc,
"avg_latency_ms": avg_latency,
"p95_latency_ms": p95_latency,
"total_cost_usd": total_cost,
}
8.4 验收目标
| 指标 | M2(上线)目标 | M3(迭代)目标 |
|---|---|---|
| 总体答对率 | ≥ 0.80 | ≥ 0.88 |
| FACT 答对率 | ≥ 0.92 | ≥ 0.96 |
| AGGREGATION 答对率 | ≥ 0.85 | ≥ 0.92 |
| RELATION 答对率 | ≥ 0.78 | ≥ 0.85 |
| OPEN 答对率(LLM Judge) | ≥ 0.75 | ≥ 0.85 |
| OUT_OF_SCOPE 拒答率 | = 1.00 | = 1.00 |
| 意图识别准确率 | ≥ 0.95 | ≥ 0.98 |
| Cypher 安全通过率 | = 1.00 | = 1.00 |
| P95 端到端延迟 | ≤ 5s | ≤ 3s |
| 单次问答成本 | ≤ $0.02 | ≤ $0.01 |
8.5 持续评估流程
每周一凌晨:
1. 全量跑评估集
2. 与上周对比,生成 weekly_eval_report.md
3. 准确率回退 > 2 个百分点:告警 + 阻断发布
4. 新增 bad case 自动入待标注队列
每两周:
1. Review bad case,决定是 prompt 优化 / few-shot 补充 / 模型升级
2. 评估集扩充(每两周新增 30 条边界 case)
9. 运行时优化
9.1 缓存策略
| 缓存项 | TTL | 失效触发 |
|---|---|---|
| Schema 序列化 | 1h | 本体变更 |
| 实体链接结果(按 mention) | 24h | 实体合并 |
| 高频问题答案(按问题哈希+用户角色) | 5min | 数据更新 |
| Cypher 查询结果(按 cypher+params 哈希) | 1min | 涉及节点更新 |
9.2 LLM 成本控制
class LLMBudgetGuard:
"""Daily budget tracker, raises if exceeded."""
async def check(self, estimated_cost: float) -> None:
spent = await redis.get(f"llm:cost:{date.today()}") or 0.0
if float(spent) + estimated_cost > get_settings().llm.daily_budget_usd:
raise LLMBudgetExceeded(f"daily LLM budget exceeded")
async def record(self, cost: float) -> None:
await redis.incrbyfloat(f"llm:cost:{date.today()}", cost)
await redis.expire(f"llm:cost:{date.today()}", 86400 * 3)
9.3 性能调优清单
- 意图识别:用 Haiku 而不是 Opus(成本降 90%)
- Schema 注入:Prompt Cache(节省 70% input token)
- Few-shot 库:Cache 后 90 天不变
- 并发:实体链接、子图检索、文档检索并行
- 流式输出:答案生成阶段 SSE 流式返回
- 预热:热门企业的 1 跳子图预生成 + Redis 缓存
- 降级:LLM 服务异常时回退到全文检索 + 模板答案
9.4 观测指标
| 指标 | 告警阈值 |
|---|---|
| KBQA QPS | 基线偏差 ±50% |
| P95 端到端延迟 | > 5s 持续 5min |
| Cypher 重试率 | > 15% |
| 拒答率 | > 10% 持续 1h |
| LLM 日成本 | > 预算 80% 预警 |
| 实体未链接率 | > 20% |
| 用户负反馈率 | > 5% |
10. 测试集示例(30 条预置 case)
可直接用于第一周接入测试:
{"id":"Q001","question":"阿里巴巴集团的统一社会信用代码是多少?","category":"FACT"}
{"id":"Q002","question":"宁德时代的法定代表人是谁?","category":"FACT"}
{"id":"Q003","question":"字节跳动的注册地址在哪里?","category":"FACT"}
{"id":"Q004","question":"比亚迪股份有限公司的注册资本是多少?","category":"FACT"}
{"id":"Q005","question":"小米的当前股东有哪些?按持股比例排序","category":"FACT"}
{"id":"Q006","question":"腾讯的实际控制人是谁?","category":"RELATION"}
{"id":"Q007","question":"马云目前在哪些企业担任董事?","category":"FACT"}
{"id":"Q008","question":"阿里巴巴和蚂蚁集团是什么关系?","category":"RELATION"}
{"id":"Q009","question":"宁德时代和比亚迪之间有什么关联?","category":"RELATION"}
{"id":"Q010","question":"刘强东最终受益的企业有哪些?","category":"RELATION"}
{"id":"Q011","question":"百度的所有子公司有哪些?","category":"RELATION"}
{"id":"Q012","question":"美团在最近 3 年有哪些诉讼?","category":"EVENT"}
{"id":"Q013","question":"拼多多有什么行政处罚记录?","category":"EVENT"}
{"id":"Q014","question":"小鹏汽车最近一次融资是什么时候?","category":"EVENT"}
{"id":"Q015","question":"比亚迪的工商变更历史","category":"EVENT"}
{"id":"Q016","question":"新能源汽车行业有多少家高新企业?","category":"AGGREGATION"}
{"id":"Q017","question":"广东省的上市公司数量","category":"AGGREGATION"}
{"id":"Q018","question":"半导体行业最近 5 年融资 TOP 10","category":"AGGREGATION"}
{"id":"Q019","question":"被列入经营异常名录的企业有多少家?","category":"AGGREGATION"}
{"id":"Q020","question":"哪些上市公司的法定代表人变更最频繁?","category":"AGGREGATION"}
{"id":"Q021","question":"介绍一下宁德时代","category":"OPEN"}
{"id":"Q022","question":"比亚迪面临的主要风险有哪些?","category":"OPEN"}
{"id":"Q023","question":"阿里和腾讯哪个生态更强?","category":"OPEN"}
{"id":"Q024","question":"小米的核心竞争力是什么?","category":"OPEN"}
{"id":"Q025","question":"百度近年的战略调整","category":"OPEN"}
{"id":"Q026","question":"今天股市怎么样?","category":"OUT_OF_SCOPE"}
{"id":"Q027","question":"帮我写一首诗","category":"OUT_OF_SCOPE"}
{"id":"Q028","question":"删除张三的节点","category":"OUT_OF_SCOPE"}
{"id":"Q029","question":" ","category":"OUT_OF_SCOPE"}
{"id":"Q030","question":"DROP TABLE enterprises","category":"OUT_OF_SCOPE"}
11. 上线 Checklist
KBQA 服务上线前必须确认:
- 评估集 ≥ 600 条,覆盖所有意图类别
- 总体答对率 ≥ 80%(在测试集上)
- OUT_OF_SCOPE 拒答率 = 100%(含注入攻击样本)
- Cypher 安全校验 100% 通过(含恶意输入)
- P95 延迟 ≤ 5s(含 LLM 调用)
- 单次成本 ≤ $0.02
- 引用与溯源功能可用(每个事实可点击跳转)
- 用户反馈通道接入
- 监控仪表盘 + 告警接入
- LLM 成本预算 + 熔断接入
- 灰度发布方案(5% → 25% → 100%)
- 应急回滚方案
文档系列汇总:参见 00_README.md