配套文档:基于《01_领域本体详细设计书.md》、《02_代码工程脚手架.md》 覆盖范围:四类数据源(内部数据库、文档、外部 API、实时流)的完整接入方案 本文档作用:每个数据源接入前的标准化设计模板 + 具体源头的逐一规划
1. 接入总体架构
┌────────────────────────────────────────────────────────────────┐
│ 数据源接入统一架构 │
├────────────────────────────────────────────────────────────────┤
│ │
│ 内部DB ─┐ 文档 ─┐ 外部API ─┐ 实时流 ─┐ │
│ ↓ ↓ ↓ ↓ │
│ ┌────────────────────────────────────────────┐ │
│ │ Source Adapter (统一抽象) │ │
│ └────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────┐ │
│ │ Raw Landing Zone (PostgreSQL + 对象存储) │ │
│ └────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────┐ │
│ │ Extraction Pipeline (NER/RE/EE) │ │
│ └────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────┐ │
│ │ Staging (PostgreSQL,待审核三元组库) │ │
│ └────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────┐ │
│ │ Fusion & Quality Gate │ │
│ └────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────┐ │
│ │ Neo4j + ES + Milvus (Online Stores) │ │
│ └────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
关键设计:
- 统一 Adapter 抽象:每个数据源实现
SourceAdapter协议,调度器看不到细节差异 - 三层 Landing 策略:Raw → Staging → Online,每层都可重放
- 审计可溯源:每条数据保留
source / source_id / source_record_url / extracted_by - 背压控制:每条链路有限流、重试、死信队列
2. SourceAdapter 接口规范
所有数据源接入必须实现此协议,确保调度器、监控、错误处理一致。
# src/kg/ingestion/sources/base.py
from abc import ABC, abstractmethod
from datetime import datetime
from typing import AsyncIterator, Literal
from pydantic import BaseModel
class SourceRecord(BaseModel):
"""Raw record from a data source — uniform envelope."""
source: str # NECIPS / TYC / DOC_PDF / KAFKA_AUDIT / ...
source_id: str # 源系统主键
source_record_url: str | None
record_type: str # ENTERPRISE / PERSON / EVENT / DOCUMENT
payload: dict # 原始字段(保留全部,不裁剪)
fetched_at: datetime
schema_version: str # 源数据 schema 版本
class IngestStats(BaseModel):
fetched: int = 0
succeeded: int = 0
failed: int = 0
skipped: int = 0
bytes_in: int = 0
duration_seconds: float = 0.0
class SourceAdapter(ABC):
"""Uniform interface for all data sources."""
source_name: str # 必须唯一
mode: Literal["batch", "stream", "polling"]
@abstractmethod
async def discover(self) -> dict:
"""Probe source: connectivity, schema version, watermark."""
@abstractmethod
async def fetch(
self,
*,
since: datetime | None = None,
until: datetime | None = None,
limit: int | None = None,
) -> AsyncIterator[SourceRecord]:
"""Yield raw records. Must be resumable via watermark."""
@abstractmethod
async def checkpoint(self, watermark: datetime) -> None:
"""Persist progress; called by orchestrator after each chunk."""
@abstractmethod
async def health(self) -> dict:
"""Return live health for monitoring."""
3. 数据源 #1:企业内部数据库(CDC 接入)
3.1 典型场景
企业内部业务系统:
- HR 系统:员工、组织架构、任职变动
- CRM:客户主数据、联系人
- ERP:供应商、客户、合同、物料
- 法务系统:合同、案件
- OA:流程、文件、人员信息
3.2 接入模式选择
| 模式 | 适用 | 工具 |
|---|---|---|
| JDBC 全量拉取 | 数据量小(< 100 万行)、变化少 | Airflow SQL Operator |
JDBC 增量拉取(基于 updated_at) | 中等量、源系统有时间戳 | Airflow + 自研 |
| CDC(binlog 订阅) | 大数据量、低延迟要求 ✓ 推荐 | Debezium + Kafka |
| 触发器 + 消息队列 | 源库可侵入 | 源库 trigger |
生产推荐:业务系统统一用 Debezium CDC,避免对源库压力。
3.3 Debezium CDC 接入
拓扑
[MySQL/Oracle/PG] → [Debezium Connector] → [Kafka Topic]
↓
[Flink/Python Consumer]
↓
[Raw Landing (PostgreSQL)]
↓
[D2R Engine → Neo4j]
Debezium 配置示例(MySQL HR 库)
{
"name": "hr-mysql-cdc",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "hr-mysql.internal",
"database.port": "3306",
"database.user": "debezium",
"database.password": "${secrets:DEBEZIUM_PWD}",
"database.server.id": "1001",
"topic.prefix": "cdc.hr",
"database.include.list": "hr_prod",
"table.include.list": "hr_prod.employee,hr_prod.department,hr_prod.org",
"snapshot.mode": "initial",
"schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
"schema.history.internal.kafka.topic": "cdc.hr.schema-history",
"include.schema.changes": "true",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "rewrite"
}
}
消费端代码
# src/kg/ingestion/sources/mysql_cdc.py
from kg.ingestion.sources.base import SourceAdapter, SourceRecord
from aiokafka import AIOKafkaConsumer
import json
class MySQLCDCAdapter(SourceAdapter):
source_name = "MYSQL_CDC_HR"
mode = "stream"
def __init__(self, bootstrap: str, topics: list[str], group_id: str):
self.bootstrap = bootstrap
self.topics = topics
self.group_id = group_id
async def discover(self) -> dict:
return {"topics": self.topics, "type": "kafka-cdc"}
async def fetch(self, *, since=None, until=None, limit=None):
consumer = AIOKafkaConsumer(
*self.topics,
bootstrap_servers=self.bootstrap,
group_id=self.group_id,
enable_auto_commit=False,
auto_offset_reset="earliest",
value_deserializer=lambda v: json.loads(v.decode("utf-8")) if v else None,
)
await consumer.start()
try:
count = 0
async for msg in consumer:
if msg.value is None:
continue
op = msg.value.get("__op", "c") # c/u/d/r
table = msg.topic.replace("cdc.hr.", "")
yield SourceRecord(
source=self.source_name,
source_id=f"{table}:{msg.value.get('id')}",
source_record_url=None,
record_type=self._map_record_type(table),
payload={**msg.value, "_op": op, "_table": table},
fetched_at=msg.timestamp,
schema_version="hr.v1",
)
count += 1
if limit and count >= limit:
break
# Commit per batch is done by orchestrator via checkpoint()
finally:
await consumer.stop()
@staticmethod
def _map_record_type(table: str) -> str:
return {
"employee": "PERSON",
"department": "DEPARTMENT",
"org": "ORGANIZATION",
}.get(table, "UNKNOWN")
async def checkpoint(self, watermark):
# Kafka offsets are committed by consumer group
pass
async def health(self) -> dict:
return {"alive": True}
3.4 D2R 映射
D2R(Database to RDF/KG)核心思想:用 YAML 声明式定义数据库行 → 图谱节点/关系的映射,引擎自动生成 Cypher。
YAML 映射示例
# src/kg/ingestion/d2r/mappings/hr_employee.yaml
mapping_name: hr_employee_to_person
version: 1.0
source:
type: mysql_cdc
topic: cdc.hr.employee
target:
- kind: node
label: NaturalPerson
primary_key:
property: person_hash
expression: "fn:person_hash($name, $id_card_tail, $birth_year)"
properties:
uuid: "fn:uuid_v5_namespace('person', $person_hash)"
name: $name
gender: "fn:map($gender, {'1':'M','2':'F','0':'U'})"
birth_year: "fn:year($birth_date)"
id_card_tail: "fn:substr($id_card, -4)"
nationality: $nationality
- kind: node
label: Department
primary_key:
property: dept_code
expression: $dept_code
properties:
name: $dept_name
- kind: relationship
type: WORKS_IN
from:
label: NaturalPerson
key: person_hash
expression: "fn:person_hash($name, $id_card_tail, $birth_year)"
to:
label: Department
key: dept_code
expression: $dept_code
properties:
since: $hire_date
position: $position
valid_to: "fn:if($leave_date != null, $leave_date, null)"
filters:
- condition: "$status != 'DELETED'"
- condition: "$id_card IS NOT NULL"
handlers:
on_insert: upsert
on_update: upsert
on_delete: soft_delete # 设置 _meta.status = 'DELETED',不真删
D2R Engine 核心逻辑
# src/kg/ingestion/d2r/engine.py
from typing import Any
import yaml
from pathlib import Path
from kg.core.id_generator import person_hash, new_uuid
from kg.store.neo4j_client import Neo4jClient
class D2REngine:
def __init__(self, mapping_path: Path):
with mapping_path.open() as f:
self.mapping = yaml.safe_load(f)
self.funcs = {
"person_hash": person_hash,
"uuid_v5_namespace": lambda ns, k: f"{ns}:{k}",
"year": lambda d: int(d[:4]) if d else None,
"substr": lambda s, n: s[n:] if s else None,
"map": lambda v, mapping: mapping.get(str(v), v),
"if": lambda cond, a, b: a if cond else b,
}
def _eval(self, expr: str, row: dict) -> Any:
"""Tiny expression evaluator: $field references or fn:name(...) calls."""
if isinstance(expr, str) and expr.startswith("$"):
return row.get(expr[1:])
if isinstance(expr, str) and expr.startswith("fn:"):
# Simple parser; production code should use a real AST parser (e.g. lark)
return self._eval_func(expr[3:], row)
return expr
def _eval_func(self, fn_call: str, row: dict) -> Any:
# ... safe expression eval, omitted for brevity
...
async def apply(self, row: dict, op: str = "upsert") -> None:
"""Apply one row to the graph according to the mapping."""
# 1. Filter
for f in self.mapping.get("filters", []):
if not self._eval(f["condition"], row):
return
# 2. Generate Cypher for each target
for target in self.mapping["target"]:
if target["kind"] == "node":
await self._upsert_node(target, row)
elif target["kind"] == "relationship":
await self._upsert_rel(target, row)
async def _upsert_node(self, target: dict, row: dict) -> None:
label = target["label"]
pk_prop = target["primary_key"]["property"]
pk_val = self._eval(target["primary_key"]["expression"], row)
props = {k: self._eval(v, row) for k, v in target["properties"].items()}
props = {k: v for k, v in props.items() if v is not None}
cypher = f"""
MERGE (n:{label} {{{pk_prop}: $pk}})
SET n += $props,
n.updated_at = datetime(),
n._meta_source = $source,
n._meta_updated_at = datetime()
ON CREATE SET n.uuid = coalesce(n.uuid, $uuid),
n.created_at = datetime(),
n._meta_created_at = datetime()
"""
await Neo4jClient.execute_write(
cypher,
{"pk": pk_val, "props": props, "uuid": new_uuid(), "source": self.mapping["source"]["type"]},
)
3.5 内部 DB 接入运维要点
- 只读账号:Debezium 用户仅授予
REPLICATION SLAVE / SELECT权限 - binlog 保留:MySQL
binlog_expire_logs_seconds ≥ 7 天,避免 connector 暂停后 catch-up 失败 - 快照策略:首次启动
snapshot.mode=initial,之后schema_only_recovery避免重复全量 - Topic 命名:
cdc.{业务域}.{表名},便于权限隔离 - DLQ:消费失败的消息进入
cdc.dlq.{业务域},告警 + 人工介入 - Schema 演进:源表 DDL 变更要先在 Debezium 测试,再上生产
4. 数据源 #2:文档(PDF / Word / 邮件)
4.1 文档类型清单
| 文档类型 | 来源 | 体量预估 | 解析难度 | 优先级 |
|---|---|---|---|---|
| 上市公司年报 | 巨潮资讯 | 数万份 | 中(表格多) | P0 |
| 招股说明书 | 巨潮资讯 | 数千份 | 高 | P0 |
| 法院判决文书 | 中国裁判文书网 | 千万级 | 中 | P0 |
| 行政处罚决定书 | 信用中国 | 百万级 | 中 | P1 |
| 公司公告 | 巨潮资讯 | 百万级 | 低-中 | P1 |
| 新闻报道 | 新闻 API | 持续流入 | 低 | P1 |
| 内部合同 | 内部 OA | 万级 | 高(扫描件) | P2 |
| 邮件归档 | Exchange | 持续流入 | 低 | P2 |
4.2 文档处理流水线
[文件落地] → [类型识别] → [解析器路由] → [结构化中间格式] → [章节切分] → [入抽取队列]
│ │
│ ├─ PDF / 数字版 → PyMuPDF
│ ├─ PDF / 扫描版 → PaddleOCR
│ ├─ Word → python-docx
│ ├─ HTML → trafilatura
│ └─ Email → mail-parser
│
└─ 元数据登记(PostgreSQL `documents` 表)
4.3 统一中间格式
所有解析器输出统一 Unstructured-Like Element 格式:
{
"doc_id": "DOC_20260511_001",
"doc_type": "ANNUAL_REPORT",
"metadata": {
"title": "阿里巴巴集团 2023 年度报告",
"source": "CNINFO",
"source_url": "http://...",
"publish_date": "2024-04-15",
"language": "zh-CN",
"page_count": 287,
"file_hash": "sha256:..."
},
"elements": [
{
"id": "el_001",
"type": "Title",
"text": "第一章 公司基本情况",
"page": 1,
"bbox": [120, 80, 480, 100]
},
{
"id": "el_002",
"type": "NarrativeText",
"text": "阿里巴巴集团成立于...",
"page": 1
},
{
"id": "el_003",
"type": "Table",
"text": "| 项目 | 2023 | 2022 |\n|---|---|---|\n...",
"html": "<table>...</table>",
"page": 5,
"rows": [...]
},
{
"id": "el_004",
"type": "ListItem",
"text": "..."
}
]
}
4.4 PDF 解析(数字版)
# src/kg/ingestion/parsers/pdf_parser.py
from pathlib import Path
import fitz # PyMuPDF
import pdfplumber
class PDFParser:
def __init__(self, *, ocr_fallback: bool = True):
self.ocr_fallback = ocr_fallback
def parse(self, path: Path) -> dict:
# 1. 探测是否是扫描版(无文本层)
if self._is_scanned(path):
if not self.ocr_fallback:
raise ValueError(f"Scanned PDF and OCR disabled: {path}")
from kg.ingestion.parsers.ocr import OCRParser
return OCRParser().parse(path)
# 2. 数字版:双引擎并用,PyMuPDF 拿文本流,pdfplumber 拿表格
doc = fitz.open(path)
elements = []
with pdfplumber.open(path) as plumber_doc:
for page_no, page in enumerate(doc, start=1):
# 文本块
for block in page.get_text("blocks"):
text = block[4].strip()
if not text:
continue
elements.append({
"id": f"p{page_no}_b{block[5]}",
"type": self._classify_block(text),
"text": text,
"page": page_no,
"bbox": [block[0], block[1], block[2], block[3]],
})
# 表格
plumber_page = plumber_doc.pages[page_no - 1]
for ti, table in enumerate(plumber_page.extract_tables() or []):
elements.append({
"id": f"p{page_no}_t{ti}",
"type": "Table",
"text": self._table_to_md(table),
"rows": table,
"page": page_no,
})
return {
"metadata": {
"page_count": len(doc),
"title": doc.metadata.get("title"),
},
"elements": elements,
}
@staticmethod
def _is_scanned(path: Path) -> bool:
doc = fitz.open(path)
text_chars = sum(len(p.get_text()) for p in doc)
return text_chars / max(len(doc), 1) < 50 # 平均每页 < 50 字符视为扫描版
@staticmethod
def _classify_block(text: str) -> str:
if len(text) < 50 and (text.endswith(":") or text.startswith("第")):
return "Title"
return "NarrativeText"
@staticmethod
def _table_to_md(rows: list[list[str]]) -> str:
if not rows:
return ""
header = rows[0]
lines = ["| " + " | ".join(c or "" for c in header) + " |"]
lines.append("|" + "|".join("---" for _ in header) + "|")
for r in rows[1:]:
lines.append("| " + " | ".join(c or "" for c in r) + " |")
return "\n".join(lines)
4.5 OCR(扫描件)
# src/kg/ingestion/parsers/ocr.py
from pathlib import Path
from paddleocr import PaddleOCR
import fitz
class OCRParser:
def __init__(self):
self.ocr = PaddleOCR(use_angle_cls=True, lang="ch", show_log=False)
def parse(self, path: Path) -> dict:
doc = fitz.open(path)
elements = []
for page_no, page in enumerate(doc, start=1):
pix = page.get_pixmap(dpi=200)
img_path = f"/tmp/ocr_{page_no}.png"
pix.save(img_path)
result = self.ocr.ocr(img_path, cls=True)
for line in result[0] or []:
bbox, (text, conf) = line
if conf < 0.5:
continue
elements.append({
"type": "NarrativeText",
"text": text,
"page": page_no,
"bbox": bbox,
"ocr_confidence": conf,
})
return {"metadata": {"page_count": len(doc)}, "elements": elements}
4.6 邮件解析
# src/kg/ingestion/parsers/email_parser.py
import mailparser
from pathlib import Path
class EmailParser:
def parse(self, path: Path) -> dict:
mail = mailparser.parse_from_file(str(path))
return {
"metadata": {
"subject": mail.subject,
"from": [a[1] for a in mail.from_],
"to": [a[1] for a in mail.to],
"cc": [a[1] for a in mail.cc],
"date": mail.date.isoformat() if mail.date else None,
"message_id": mail.message_id,
"attachments_count": len(mail.attachments),
},
"elements": [
{"type": "Title", "text": mail.subject or ""},
{"type": "NarrativeText", "text": mail.body or ""},
*[
{"type": "Attachment", "filename": a["filename"], "content_type": a["mail_content_type"]}
for a in mail.attachments
],
],
}
4.7 章节切分(Chunking)
抽取前的关键步骤:长文档必须切分为合适长度的 chunk,每个 chunk 内做 NER/RE,跨 chunk 关系在融合阶段解决。
# src/kg/ingestion/parsers/chunker.py
from typing import Iterator
class SemanticChunker:
"""Chunk by titles first, then by max_tokens with overlap."""
def __init__(self, max_tokens: int = 1500, overlap: int = 100):
self.max_tokens = max_tokens
self.overlap = overlap
def chunk(self, elements: list[dict]) -> Iterator[dict]:
section: list[dict] = []
section_title = ""
for el in elements:
if el["type"] == "Title":
if section:
yield from self._emit(section, section_title)
section = []
section_title = el["text"]
else:
section.append(el)
if section:
yield from self._emit(section, section_title)
def _emit(self, section: list[dict], title: str) -> Iterator[dict]:
buf, buf_len = [], 0
for el in section:
t = el["text"]
tl = self._tokens(t)
if buf_len + tl > self.max_tokens and buf:
yield {"title": title, "elements": buf, "tokens": buf_len}
# overlap by tail
overlap_buf = []
ol = 0
for x in reversed(buf):
if ol + self._tokens(x["text"]) > self.overlap:
break
overlap_buf.insert(0, x)
ol += self._tokens(x["text"])
buf, buf_len = list(overlap_buf), ol
buf.append(el)
buf_len += tl
if buf:
yield {"title": title, "elements": buf, "tokens": buf_len}
@staticmethod
def _tokens(text: str) -> int:
# Rough estimation: 1 Chinese char ≈ 1 token, ASCII chars / 4
ascii_chars = sum(1 for c in text if ord(c) < 128)
return (len(text) - ascii_chars) + ascii_chars // 4
4.8 文档元数据登记表
-- PostgreSQL:documents 表
CREATE TABLE documents (
doc_id VARCHAR(64) PRIMARY KEY,
doc_type VARCHAR(50) NOT NULL,
source VARCHAR(50) NOT NULL,
source_url TEXT,
file_path TEXT NOT NULL,
file_hash CHAR(64) UNIQUE NOT NULL,
file_size BIGINT,
page_count INT,
language VARCHAR(10),
publish_date DATE,
title TEXT,
parse_status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
-- PENDING / PARSING / PARSED / FAILED / EXTRACTED
extract_status VARCHAR(20),
parsed_at TIMESTAMP,
extracted_at TIMESTAMP,
error_message TEXT,
metadata JSONB,
created_at TIMESTAMP DEFAULT now(),
updated_at TIMESTAMP DEFAULT now()
);
CREATE INDEX idx_documents_status ON documents(parse_status, extract_status);
CREATE INDEX idx_documents_publish ON documents(publish_date DESC);
5. 数据源 #3:外部 API(公开数据)
5.1 典型外部源清单
| 名称 | 性质 | 数据 | 鉴权 | 频控 |
|---|---|---|---|---|
| 国家企业信用信息公示系统(NECIPS) | 官方 | 工商基础 | 无(爬) | 严,需 IP 池 |
| 信用中国 | 官方 | 行政处罚、失信被执行 | 无(爬) | 中 |
| 中国裁判文书网 | 官方 | 司法案件 | 复杂 | 极严 |
| 巨潮资讯 | 官方 | 上市公司公告 | 无 | 中 |
| 天眼查 OpenAPI | 商业 | 综合 | API Key | 按量计费 |
| 企查查 OpenAPI | 商业 | 综合 | API Key | 按量计费 |
| 启信宝 OpenAPI | 商业 | 综合 | API Key | 按量计费 |
| 国家知识产权局 | 官方 | 专利、商标 | 复杂 | 严 |
5.2 接入规约
⚠️ 合规先行:
- 商业 API 必须有有效合同,明确数据使用范围
- 公开网站爬取必须遵守
robots.txt,控制频率 - 司法文书网等敏感源 优先采购商业数据,避免合规风险
- 任何接入前需法务过审
5.3 商业 API Pull Adapter
# src/kg/ingestion/sources/api_pull.py
import asyncio
from datetime import datetime, timedelta
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential
from kg.ingestion.sources.base import SourceAdapter, SourceRecord
from kg.core.config import get_settings
from kg.core.logger import get_logger
log = get_logger(__name__)
class TYCAdapter(SourceAdapter):
"""Tianyancha (天眼查) API adapter — pull mode."""
source_name = "TYC"
mode = "polling"
BASE = "https://open.api.tianyancha.com/services/open/v3"
def __init__(self, api_key: str, *, qps_limit: int = 5):
self.api_key = api_key
self.qps_limit = qps_limit
self._semaphore = asyncio.Semaphore(qps_limit)
async def discover(self) -> dict:
return {"endpoint": self.BASE, "auth": "Header(token)"}
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def _call(self, endpoint: str, params: dict) -> dict:
async with self._semaphore:
async with httpx.AsyncClient(timeout=20) as client:
resp = await client.get(
f"{self.BASE}/{endpoint}",
params=params,
headers={"Authorization": self.api_key},
)
resp.raise_for_status()
data = resp.json()
if data.get("error_code") != 0:
raise ValueError(f"TYC error: {data}")
return data["result"]
async def fetch(self, *, since=None, until=None, limit=None):
"""Pull enterprises listed since `since`. Driven by a watchlist of USCC seeds."""
seeds = await self._load_seeds(since)
count = 0
for uscc in seeds:
try:
detail = await self._call("ic/baseinfoV3", {"keyword": uscc})
yield SourceRecord(
source=self.source_name,
source_id=uscc,
source_record_url=f"https://www.tianyancha.com/company/{uscc}",
record_type="ENTERPRISE",
payload=detail,
fetched_at=datetime.utcnow(),
schema_version="tyc.v3",
)
count += 1
if limit and count >= limit:
break
except Exception as e:
log.error("tyc_fetch_failed", uscc=uscc, error=str(e))
async def _load_seeds(self, since):
# 从内部 watchlist(PostgreSQL)读取需要刷新的企业列表
...
async def checkpoint(self, watermark):
# 记录到 ingestion_watermark 表
...
async def health(self) -> dict:
try:
await self._call("v4/health", {})
return {"alive": True}
except Exception as e:
return {"alive": False, "error": str(e)}
5.4 频控与缓存
商业 API 按调用次数计费。生产必须做:
- 本地缓存:相同企业 24 小时内不重复查(Redis + PostgreSQL 双层)
- 批量优先:能用批量接口绝不用单条接口
- 预算控制:每日调用预算 + 告警
- 数据保鲜策略:
- 高频实体(重点监控企业):每日刷新
- 普通实体:每周刷新
- 长尾实体:按需触发
# src/kg/ingestion/sources/cached_api.py
class CachedAPIWrapper:
def __init__(self, adapter: SourceAdapter, ttl: int = 86400):
self.adapter = adapter
self.ttl = ttl
async def fetch_one(self, key: str) -> SourceRecord | None:
# 1. 查 Redis
cached = await redis.get(f"src:{self.adapter.source_name}:{key}")
if cached:
return SourceRecord.model_validate_json(cached)
# 2. 查 PostgreSQL(持久化缓存)
row = await pg.fetchrow(
"SELECT payload FROM source_cache WHERE source = $1 AND source_id = $2 "
"AND fetched_at > now() - interval '7 days'",
self.adapter.source_name, key
)
if row:
await redis.setex(f"src:{self.adapter.source_name}:{key}", self.ttl, row["payload"])
return SourceRecord.model_validate_json(row["payload"])
# 3. 调上游
async for rec in self.adapter.fetch(...):
await pg.execute(
"INSERT INTO source_cache (source, source_id, payload, fetched_at) "
"VALUES ($1, $2, $3, now()) ON CONFLICT (source, source_id) DO UPDATE "
"SET payload = $3, fetched_at = now()",
self.adapter.source_name, rec.source_id, rec.model_dump_json(),
)
await redis.setex(f"src:{self.adapter.source_name}:{rec.source_id}", self.ttl, rec.model_dump_json())
return rec
return None
6. 数据源 #4:实时流(Kafka / IoT)
6.1 流式场景
- 业务事件流(订单、交易、登录、操作日志)
- 舆情/新闻实时流
- CDC 流(已在 §3 涵盖)
- 风险信号流(征信、反欺诈预警)
6.2 流处理架构
[Kafka Topic] → [Flink SQL/DataStream] → [Stateful Enrich (Lookup KG)] → [Sink]
│ │
↓ ├─ 写回 Neo4j(增量)
[Aggregation/Window] ├─ 写 Kafka 派生 Topic
└─ 触发告警
6.3 Flink 作业示例:实时舆情入图
-- flink_jobs/opinion_to_kg.sql
CREATE TABLE news_stream (
news_id STRING,
title STRING,
content STRING,
publish_time TIMESTAMP_LTZ(3),
source_url STRING,
raw_entities ARRAY<ROW<text STRING, type STRING>>,
WATERMARK FOR publish_time AS publish_time - INTERVAL '30' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'news.raw',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'json',
'scan.startup.mode' = 'group-offsets'
);
CREATE TABLE opinion_events (
event_id STRING,
enterprise_uscc STRING,
title STRING,
summary STRING,
sentiment STRING,
first_seen TIMESTAMP_LTZ(3),
source_url STRING,
PRIMARY KEY (event_id) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:postgresql://pg:5432/kg',
'table-name' = 'staging_opinion_events'
);
-- 流式实体链接(Lookup Join 到 Neo4j 镜像表)
CREATE TABLE enterprise_dim (
name STRING,
aliases ARRAY<STRING>,
uscc STRING,
PRIMARY KEY (name) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'lookup.cache.max-rows' = '500000',
'lookup.cache.ttl' = '1h'
);
INSERT INTO opinion_events
SELECT
SHA256(CONCAT(n.news_id, '_', e.text)) AS event_id,
d.uscc,
n.title,
SUBSTRING(n.content FROM 1 FOR 300) AS summary,
'NEUTRAL' AS sentiment, -- 后续模型替换
n.publish_time,
n.source_url
FROM news_stream AS n
CROSS JOIN UNNEST(n.raw_entities) AS e
LEFT JOIN enterprise_dim FOR SYSTEM_TIME AS OF n.publish_time AS d
ON d.name = e.text
WHERE e.type = 'ORGANIZATION' AND d.uscc IS NOT NULL;
6.4 流式入图最佳实践
- 不要直接写 Neo4j:流写入冲到 PostgreSQL Staging,由 Airflow 微批(1–5 分钟)入图
- 去重:用
event_id = SHA256(原始内容 + 实体)做幂等 - 背压:Flink 检测下游延迟时降速,避免 Neo4j 过载
- 状态管理:Flink State 用 RocksDB,定期 Checkpoint 到 S3/OSS
- 乱序处理:用 Watermark + 允许迟到 5 分钟
- 大窗口降冷:超过 1 小时的窗口落 OSS,热路径只保留 1 小时
7. 通用接入管控
7.1 Watermark 与断点续传
每个 Source 在 PostgreSQL 维护 watermark:
CREATE TABLE ingestion_watermark (
source VARCHAR(50) PRIMARY KEY,
watermark TIMESTAMP NOT NULL,
last_run_at TIMESTAMP NOT NULL,
last_run_status VARCHAR(20) NOT NULL,
last_run_stats JSONB,
updated_at TIMESTAMP DEFAULT now()
);
Airflow DAG 每次启动读 watermark,执行后更新。
7.2 数据落地分层
| 层 | 存储 | 保留 | 用途 |
|---|---|---|---|
| Raw | 对象存储(OSS / S3)+ PostgreSQL 索引 | 90 天 | 原始数据,可重放 |
| Bronze | PostgreSQL bronze_<source> 表 | 90 天 | 结构化原始 |
| Silver | PostgreSQL staging_* 表 | 30 天 | 抽取后待入图 |
| Gold | Neo4j + ES + Milvus | 长期 | 在线服务 |
7.3 数据契约(Data Contract)
每个数据源必须有正式的数据契约文档,包含:
# contracts/necips_v1.yaml
contract_name: NECIPS_Enterprise_v1
producer: External / NECIPS官网爬虫
consumer: KG-Ingestion
slo:
freshness: 24h # 数据延迟不超过 24 小时
completeness: 0.99 # 字段填充率
availability: 0.95 # 接入成功率
schema:
- name: unified_credit_code
type: string
pattern: '^[0-9A-HJ-NPQRTUWXY]{18}$'
required: true
- name: name
type: string
required: true
- ...
breaking_change_policy: 60_days_notice
incident_contact: kg-team@example.com / +86-xxxx
7.4 监控指标(每个 Source)
| 指标 | 阈值 |
|---|---|
| 接入延迟(fetch → land) | P95 < 5 min(流)/ < 4h(批) |
| Watermark 落后 | > 1h 告警 |
| 失败率 | > 5% 告警 |
| 数据量同比 | 波动 > 30% 告警 |
| 字段填充率 | 关键字段 < 95% 告警 |
7.5 死信处理
每个 Adapter 失败的记录必须进入 DLQ(Dead Letter Queue):
CREATE TABLE ingestion_dlq (
id BIGSERIAL PRIMARY KEY,
source VARCHAR(50) NOT NULL,
source_id VARCHAR(255),
record_type VARCHAR(50),
payload JSONB NOT NULL,
error_type VARCHAR(100),
error_message TEXT,
error_stack TEXT,
retry_count INT DEFAULT 0,
next_retry_at TIMESTAMP,
status VARCHAR(20) DEFAULT 'PENDING',
-- PENDING / RETRYING / RESOLVED / ABANDONED
created_at TIMESTAMP DEFAULT now(),
updated_at TIMESTAMP DEFAULT now()
);
CREATE INDEX idx_dlq_status_retry ON ingestion_dlq(status, next_retry_at);
定时任务扫描 DLQ,按指数退避重试,超过 5 次进入 ABANDONED 状态供人工处理。
8. 数据源接入清单(含 Owner、状态、SLA)
| # | 数据源 | 类型 | 优先级 | Owner | 状态 | 上线 SLA |
|---|---|---|---|---|---|---|
| 1 | HR 内部库(MySQL CDC) | DB | P0 | 数据工程组 | 设计 | 第 4 周 |
| 2 | CRM 内部库(Oracle CDC) | DB | P0 | 数据工程组 | 设计 | 第 5 周 |
| 3 | 巨潮资讯(年报+公告) | 文档 | P0 | NLP 组 | 设计 | 第 6 周 |
| 4 | 裁判文书(商业数据采购) | 文档 | P0 | 法务+数据组 | 选型 | 第 8 周 |
| 5 | 信用中国(行政处罚) | API | P1 | 数据工程组 | 设计 | 第 7 周 |
| 6 | 天眼查 OpenAPI | API | P0 | 数据工程组 | 已签约 | 第 4 周 |
| 7 | 内部 OA 合同文件 | 文档 | P1 | NLP 组 | 设计 | 第 9 周 |
| 8 | 邮件归档 | 文档 | P2 | 数据工程组 | 待评估 | 第 12 周 |
| 9 | 新闻舆情流(第三方推送) | 流 | P1 | 数据工程组 | 设计 | 第 10 周 |
| 10 | 内部业务事件 Kafka | 流 | P1 | 平台组 | 接入中 | 第 6 周 |
9. 接入质量验收标准
每个数据源接入完成必须通过以下验收:
| # | 验收项 | 标准 |
|---|---|---|
| 1 | Adapter 实现 SourceAdapter 协议 | 100% 接口实现 |
| 2 | 数据契约文档 | 已签字 |
| 3 | 单元测试 | 覆盖率 ≥ 75% |
| 4 | 集成测试 | 至少 100 条样本端到端跑通 |
| 5 | 字段映射 | D2R YAML 评审通过 |
| 6 | Watermark / Checkpoint | 故障注入测试通过 |
| 7 | 监控接入 | 指标可见、告警可达 |
| 8 | 性能基准 | 吞吐量符合预期 |
| 9 | DLQ 流程 | 失败样本被正确捕获 |
| 10 | 合规审查 | 法务签字 |
下一份文档:04_KBQA实施与评估.md