企业查/情报知识图谱 - 数据源接入详设

5 阅读15分钟

配套文档:基于《01_领域本体详细设计书.md》、《02_代码工程脚手架.md》 覆盖范围:四类数据源(内部数据库、文档、外部 API、实时流)的完整接入方案 本文档作用:每个数据源接入前的标准化设计模板 + 具体源头的逐一规划


1. 接入总体架构

┌────────────────────────────────────────────────────────────────┐
│                    数据源接入统一架构                            │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  内部DB ─┐    文档 ─┐    外部API ─┐    实时流 ─┐                  │
│         ↓         ↓            ↓             ↓                  │
│    ┌────────────────────────────────────────────┐              │
│    │       Source Adapter (统一抽象)             │              │
│    └────────────────────────────────────────────┘              │
│                          ↓                                      │
│    ┌────────────────────────────────────────────┐              │
│    │   Raw Landing Zone (PostgreSQL + 对象存储)  │              │
│    └────────────────────────────────────────────┘              │
│                          ↓                                      │
│    ┌────────────────────────────────────────────┐              │
│    │       Extraction Pipeline (NER/RE/EE)       │              │
│    └────────────────────────────────────────────┘              │
│                          ↓                                      │
│    ┌────────────────────────────────────────────┐              │
│    │   Staging (PostgreSQL,待审核三元组库)       │              │
│    └────────────────────────────────────────────┘              │
│                          ↓                                      │
│    ┌────────────────────────────────────────────┐              │
│    │   Fusion & Quality Gate                     │              │
│    └────────────────────────────────────────────┘              │
│                          ↓                                      │
│    ┌────────────────────────────────────────────┐              │
│    │   Neo4j + ES + Milvus (Online Stores)       │              │
│    └────────────────────────────────────────────┘              │
└────────────────────────────────────────────────────────────────┘

关键设计

  1. 统一 Adapter 抽象:每个数据源实现 SourceAdapter 协议,调度器看不到细节差异
  2. 三层 Landing 策略:Raw → Staging → Online,每层都可重放
  3. 审计可溯源:每条数据保留 source / source_id / source_record_url / extracted_by
  4. 背压控制:每条链路有限流、重试、死信队列

2. SourceAdapter 接口规范

所有数据源接入必须实现此协议,确保调度器、监控、错误处理一致。

# src/kg/ingestion/sources/base.py
from abc import ABC, abstractmethod
from datetime import datetime
from typing import AsyncIterator, Literal

from pydantic import BaseModel


class SourceRecord(BaseModel):
    """Raw record from a data source — uniform envelope."""
    source: str                  # NECIPS / TYC / DOC_PDF / KAFKA_AUDIT / ...
    source_id: str               # 源系统主键
    source_record_url: str | None
    record_type: str             # ENTERPRISE / PERSON / EVENT / DOCUMENT
    payload: dict                # 原始字段(保留全部,不裁剪)
    fetched_at: datetime
    schema_version: str          # 源数据 schema 版本


class IngestStats(BaseModel):
    fetched: int = 0
    succeeded: int = 0
    failed: int = 0
    skipped: int = 0
    bytes_in: int = 0
    duration_seconds: float = 0.0


class SourceAdapter(ABC):
    """Uniform interface for all data sources."""

    source_name: str             # 必须唯一
    mode: Literal["batch", "stream", "polling"]

    @abstractmethod
    async def discover(self) -> dict:
        """Probe source: connectivity, schema version, watermark."""

    @abstractmethod
    async def fetch(
        self,
        *,
        since: datetime | None = None,
        until: datetime | None = None,
        limit: int | None = None,
    ) -> AsyncIterator[SourceRecord]:
        """Yield raw records. Must be resumable via watermark."""

    @abstractmethod
    async def checkpoint(self, watermark: datetime) -> None:
        """Persist progress; called by orchestrator after each chunk."""

    @abstractmethod
    async def health(self) -> dict:
        """Return live health for monitoring."""

3. 数据源 #1:企业内部数据库(CDC 接入)

3.1 典型场景

企业内部业务系统:

  • HR 系统:员工、组织架构、任职变动
  • CRM:客户主数据、联系人
  • ERP:供应商、客户、合同、物料
  • 法务系统:合同、案件
  • OA:流程、文件、人员信息

3.2 接入模式选择

模式适用工具
JDBC 全量拉取数据量小(< 100 万行)、变化少Airflow SQL Operator
JDBC 增量拉取(基于 updated_at中等量、源系统有时间戳Airflow + 自研
CDC(binlog 订阅)大数据量、低延迟要求 ✓ 推荐Debezium + Kafka
触发器 + 消息队列源库可侵入源库 trigger

生产推荐:业务系统统一用 Debezium CDC,避免对源库压力。

3.3 Debezium CDC 接入

拓扑

[MySQL/Oracle/PG][Debezium Connector][Kafka Topic][Flink/Python Consumer][Raw Landing (PostgreSQL)][D2R Engine → Neo4j]

Debezium 配置示例(MySQL HR 库)

{
  "name": "hr-mysql-cdc",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "hr-mysql.internal",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "${secrets:DEBEZIUM_PWD}",
    "database.server.id": "1001",
    "topic.prefix": "cdc.hr",
    "database.include.list": "hr_prod",
    "table.include.list": "hr_prod.employee,hr_prod.department,hr_prod.org",
    "snapshot.mode": "initial",
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "cdc.hr.schema-history",
    "include.schema.changes": "true",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "transforms.unwrap.delete.handling.mode": "rewrite"
  }
}

消费端代码

# src/kg/ingestion/sources/mysql_cdc.py
from kg.ingestion.sources.base import SourceAdapter, SourceRecord
from aiokafka import AIOKafkaConsumer
import json


class MySQLCDCAdapter(SourceAdapter):
    source_name = "MYSQL_CDC_HR"
    mode = "stream"

    def __init__(self, bootstrap: str, topics: list[str], group_id: str):
        self.bootstrap = bootstrap
        self.topics = topics
        self.group_id = group_id

    async def discover(self) -> dict:
        return {"topics": self.topics, "type": "kafka-cdc"}

    async def fetch(self, *, since=None, until=None, limit=None):
        consumer = AIOKafkaConsumer(
            *self.topics,
            bootstrap_servers=self.bootstrap,
            group_id=self.group_id,
            enable_auto_commit=False,
            auto_offset_reset="earliest",
            value_deserializer=lambda v: json.loads(v.decode("utf-8")) if v else None,
        )
        await consumer.start()
        try:
            count = 0
            async for msg in consumer:
                if msg.value is None:
                    continue
                op = msg.value.get("__op", "c")  # c/u/d/r
                table = msg.topic.replace("cdc.hr.", "")
                yield SourceRecord(
                    source=self.source_name,
                    source_id=f"{table}:{msg.value.get('id')}",
                    source_record_url=None,
                    record_type=self._map_record_type(table),
                    payload={**msg.value, "_op": op, "_table": table},
                    fetched_at=msg.timestamp,
                    schema_version="hr.v1",
                )
                count += 1
                if limit and count >= limit:
                    break
                # Commit per batch is done by orchestrator via checkpoint()
        finally:
            await consumer.stop()

    @staticmethod
    def _map_record_type(table: str) -> str:
        return {
            "employee": "PERSON",
            "department": "DEPARTMENT",
            "org": "ORGANIZATION",
        }.get(table, "UNKNOWN")

    async def checkpoint(self, watermark):
        # Kafka offsets are committed by consumer group
        pass

    async def health(self) -> dict:
        return {"alive": True}

3.4 D2R 映射

D2R(Database to RDF/KG)核心思想:用 YAML 声明式定义数据库行 → 图谱节点/关系的映射,引擎自动生成 Cypher。

YAML 映射示例

# src/kg/ingestion/d2r/mappings/hr_employee.yaml
mapping_name: hr_employee_to_person
version: 1.0
source:
  type: mysql_cdc
  topic: cdc.hr.employee

target:
  - kind: node
    label: NaturalPerson
    primary_key:
      property: person_hash
      expression: "fn:person_hash($name, $id_card_tail, $birth_year)"
    properties:
      uuid: "fn:uuid_v5_namespace('person', $person_hash)"
      name: $name
      gender: "fn:map($gender, {'1':'M','2':'F','0':'U'})"
      birth_year: "fn:year($birth_date)"
      id_card_tail: "fn:substr($id_card, -4)"
      nationality: $nationality

  - kind: node
    label: Department
    primary_key:
      property: dept_code
      expression: $dept_code
    properties:
      name: $dept_name

  - kind: relationship
    type: WORKS_IN
    from:
      label: NaturalPerson
      key: person_hash
      expression: "fn:person_hash($name, $id_card_tail, $birth_year)"
    to:
      label: Department
      key: dept_code
      expression: $dept_code
    properties:
      since: $hire_date
      position: $position
      valid_to: "fn:if($leave_date != null, $leave_date, null)"

filters:
  - condition: "$status != 'DELETED'"
  - condition: "$id_card IS NOT NULL"

handlers:
  on_insert: upsert
  on_update: upsert
  on_delete: soft_delete   # 设置 _meta.status = 'DELETED',不真删

D2R Engine 核心逻辑

# src/kg/ingestion/d2r/engine.py
from typing import Any
import yaml
from pathlib import Path

from kg.core.id_generator import person_hash, new_uuid
from kg.store.neo4j_client import Neo4jClient


class D2REngine:
    def __init__(self, mapping_path: Path):
        with mapping_path.open() as f:
            self.mapping = yaml.safe_load(f)
        self.funcs = {
            "person_hash": person_hash,
            "uuid_v5_namespace": lambda ns, k: f"{ns}:{k}",
            "year": lambda d: int(d[:4]) if d else None,
            "substr": lambda s, n: s[n:] if s else None,
            "map": lambda v, mapping: mapping.get(str(v), v),
            "if": lambda cond, a, b: a if cond else b,
        }

    def _eval(self, expr: str, row: dict) -> Any:
        """Tiny expression evaluator: $field references or fn:name(...) calls."""
        if isinstance(expr, str) and expr.startswith("$"):
            return row.get(expr[1:])
        if isinstance(expr, str) and expr.startswith("fn:"):
            # Simple parser; production code should use a real AST parser (e.g. lark)
            return self._eval_func(expr[3:], row)
        return expr

    def _eval_func(self, fn_call: str, row: dict) -> Any:
        # ... safe expression eval, omitted for brevity
        ...

    async def apply(self, row: dict, op: str = "upsert") -> None:
        """Apply one row to the graph according to the mapping."""
        # 1. Filter
        for f in self.mapping.get("filters", []):
            if not self._eval(f["condition"], row):
                return
        # 2. Generate Cypher for each target
        for target in self.mapping["target"]:
            if target["kind"] == "node":
                await self._upsert_node(target, row)
            elif target["kind"] == "relationship":
                await self._upsert_rel(target, row)

    async def _upsert_node(self, target: dict, row: dict) -> None:
        label = target["label"]
        pk_prop = target["primary_key"]["property"]
        pk_val = self._eval(target["primary_key"]["expression"], row)
        props = {k: self._eval(v, row) for k, v in target["properties"].items()}
        props = {k: v for k, v in props.items() if v is not None}
        cypher = f"""
        MERGE (n:{label} {{{pk_prop}: $pk}})
        SET n += $props,
            n.updated_at = datetime(),
            n._meta_source = $source,
            n._meta_updated_at = datetime()
        ON CREATE SET n.uuid = coalesce(n.uuid, $uuid),
                      n.created_at = datetime(),
                      n._meta_created_at = datetime()
        """
        await Neo4jClient.execute_write(
            cypher,
            {"pk": pk_val, "props": props, "uuid": new_uuid(), "source": self.mapping["source"]["type"]},
        )

3.5 内部 DB 接入运维要点

  • 只读账号:Debezium 用户仅授予 REPLICATION SLAVE / SELECT 权限
  • binlog 保留:MySQL binlog_expire_logs_seconds ≥ 7 天,避免 connector 暂停后 catch-up 失败
  • 快照策略:首次启动 snapshot.mode=initial,之后 schema_only_recovery 避免重复全量
  • Topic 命名cdc.{业务域}.{表名},便于权限隔离
  • DLQ:消费失败的消息进入 cdc.dlq.{业务域},告警 + 人工介入
  • Schema 演进:源表 DDL 变更要先在 Debezium 测试,再上生产

4. 数据源 #2:文档(PDF / Word / 邮件)

4.1 文档类型清单

文档类型来源体量预估解析难度优先级
上市公司年报巨潮资讯数万份中(表格多)P0
招股说明书巨潮资讯数千份P0
法院判决文书中国裁判文书网千万级P0
行政处罚决定书信用中国百万级P1
公司公告巨潮资讯百万级低-中P1
新闻报道新闻 API持续流入P1
内部合同内部 OA万级高(扫描件)P2
邮件归档Exchange持续流入P2

4.2 文档处理流水线

[文件落地][类型识别][解析器路由][结构化中间格式][章节切分][入抽取队列]
     │                         │
     │                         ├─ PDF / 数字版    → PyMuPDF
     │                         ├─ PDF / 扫描版    → PaddleOCR
     │                         ├─ Word            → python-docx
     │                         ├─ HTML            → trafilatura
     │                         └─ Email           → mail-parser
     │
     └─ 元数据登记(PostgreSQL `documents` 表)

4.3 统一中间格式

所有解析器输出统一 Unstructured-Like Element 格式:

{
  "doc_id": "DOC_20260511_001",
  "doc_type": "ANNUAL_REPORT",
  "metadata": {
    "title": "阿里巴巴集团 2023 年度报告",
    "source": "CNINFO",
    "source_url": "http://...",
    "publish_date": "2024-04-15",
    "language": "zh-CN",
    "page_count": 287,
    "file_hash": "sha256:..."
  },
  "elements": [
    {
      "id": "el_001",
      "type": "Title",
      "text": "第一章 公司基本情况",
      "page": 1,
      "bbox": [120, 80, 480, 100]
    },
    {
      "id": "el_002",
      "type": "NarrativeText",
      "text": "阿里巴巴集团成立于...",
      "page": 1
    },
    {
      "id": "el_003",
      "type": "Table",
      "text": "| 项目 | 2023 | 2022 |\n|---|---|---|\n...",
      "html": "<table>...</table>",
      "page": 5,
      "rows": [...]
    },
    {
      "id": "el_004",
      "type": "ListItem",
      "text": "..."
    }
  ]
}

4.4 PDF 解析(数字版)

# src/kg/ingestion/parsers/pdf_parser.py
from pathlib import Path
import fitz  # PyMuPDF
import pdfplumber


class PDFParser:
    def __init__(self, *, ocr_fallback: bool = True):
        self.ocr_fallback = ocr_fallback

    def parse(self, path: Path) -> dict:
        # 1. 探测是否是扫描版(无文本层)
        if self._is_scanned(path):
            if not self.ocr_fallback:
                raise ValueError(f"Scanned PDF and OCR disabled: {path}")
            from kg.ingestion.parsers.ocr import OCRParser
            return OCRParser().parse(path)

        # 2. 数字版:双引擎并用,PyMuPDF 拿文本流,pdfplumber 拿表格
        doc = fitz.open(path)
        elements = []
        with pdfplumber.open(path) as plumber_doc:
            for page_no, page in enumerate(doc, start=1):
                # 文本块
                for block in page.get_text("blocks"):
                    text = block[4].strip()
                    if not text:
                        continue
                    elements.append({
                        "id": f"p{page_no}_b{block[5]}",
                        "type": self._classify_block(text),
                        "text": text,
                        "page": page_no,
                        "bbox": [block[0], block[1], block[2], block[3]],
                    })
                # 表格
                plumber_page = plumber_doc.pages[page_no - 1]
                for ti, table in enumerate(plumber_page.extract_tables() or []):
                    elements.append({
                        "id": f"p{page_no}_t{ti}",
                        "type": "Table",
                        "text": self._table_to_md(table),
                        "rows": table,
                        "page": page_no,
                    })

        return {
            "metadata": {
                "page_count": len(doc),
                "title": doc.metadata.get("title"),
            },
            "elements": elements,
        }

    @staticmethod
    def _is_scanned(path: Path) -> bool:
        doc = fitz.open(path)
        text_chars = sum(len(p.get_text()) for p in doc)
        return text_chars / max(len(doc), 1) < 50  # 平均每页 < 50 字符视为扫描版

    @staticmethod
    def _classify_block(text: str) -> str:
        if len(text) < 50 and (text.endswith(":") or text.startswith("第")):
            return "Title"
        return "NarrativeText"

    @staticmethod
    def _table_to_md(rows: list[list[str]]) -> str:
        if not rows:
            return ""
        header = rows[0]
        lines = ["| " + " | ".join(c or "" for c in header) + " |"]
        lines.append("|" + "|".join("---" for _ in header) + "|")
        for r in rows[1:]:
            lines.append("| " + " | ".join(c or "" for c in r) + " |")
        return "\n".join(lines)

4.5 OCR(扫描件)

# src/kg/ingestion/parsers/ocr.py
from pathlib import Path
from paddleocr import PaddleOCR
import fitz


class OCRParser:
    def __init__(self):
        self.ocr = PaddleOCR(use_angle_cls=True, lang="ch", show_log=False)

    def parse(self, path: Path) -> dict:
        doc = fitz.open(path)
        elements = []
        for page_no, page in enumerate(doc, start=1):
            pix = page.get_pixmap(dpi=200)
            img_path = f"/tmp/ocr_{page_no}.png"
            pix.save(img_path)
            result = self.ocr.ocr(img_path, cls=True)
            for line in result[0] or []:
                bbox, (text, conf) = line
                if conf < 0.5:
                    continue
                elements.append({
                    "type": "NarrativeText",
                    "text": text,
                    "page": page_no,
                    "bbox": bbox,
                    "ocr_confidence": conf,
                })
        return {"metadata": {"page_count": len(doc)}, "elements": elements}

4.6 邮件解析

# src/kg/ingestion/parsers/email_parser.py
import mailparser
from pathlib import Path


class EmailParser:
    def parse(self, path: Path) -> dict:
        mail = mailparser.parse_from_file(str(path))
        return {
            "metadata": {
                "subject": mail.subject,
                "from": [a[1] for a in mail.from_],
                "to": [a[1] for a in mail.to],
                "cc": [a[1] for a in mail.cc],
                "date": mail.date.isoformat() if mail.date else None,
                "message_id": mail.message_id,
                "attachments_count": len(mail.attachments),
            },
            "elements": [
                {"type": "Title", "text": mail.subject or ""},
                {"type": "NarrativeText", "text": mail.body or ""},
                *[
                    {"type": "Attachment", "filename": a["filename"], "content_type": a["mail_content_type"]}
                    for a in mail.attachments
                ],
            ],
        }

4.7 章节切分(Chunking)

抽取前的关键步骤:长文档必须切分为合适长度的 chunk,每个 chunk 内做 NER/RE,跨 chunk 关系在融合阶段解决。

# src/kg/ingestion/parsers/chunker.py
from typing import Iterator


class SemanticChunker:
    """Chunk by titles first, then by max_tokens with overlap."""

    def __init__(self, max_tokens: int = 1500, overlap: int = 100):
        self.max_tokens = max_tokens
        self.overlap = overlap

    def chunk(self, elements: list[dict]) -> Iterator[dict]:
        section: list[dict] = []
        section_title = ""
        for el in elements:
            if el["type"] == "Title":
                if section:
                    yield from self._emit(section, section_title)
                section = []
                section_title = el["text"]
            else:
                section.append(el)
        if section:
            yield from self._emit(section, section_title)

    def _emit(self, section: list[dict], title: str) -> Iterator[dict]:
        buf, buf_len = [], 0
        for el in section:
            t = el["text"]
            tl = self._tokens(t)
            if buf_len + tl > self.max_tokens and buf:
                yield {"title": title, "elements": buf, "tokens": buf_len}
                # overlap by tail
                overlap_buf = []
                ol = 0
                for x in reversed(buf):
                    if ol + self._tokens(x["text"]) > self.overlap:
                        break
                    overlap_buf.insert(0, x)
                    ol += self._tokens(x["text"])
                buf, buf_len = list(overlap_buf), ol
            buf.append(el)
            buf_len += tl
        if buf:
            yield {"title": title, "elements": buf, "tokens": buf_len}

    @staticmethod
    def _tokens(text: str) -> int:
        # Rough estimation: 1 Chinese char ≈ 1 token, ASCII chars / 4
        ascii_chars = sum(1 for c in text if ord(c) < 128)
        return (len(text) - ascii_chars) + ascii_chars // 4

4.8 文档元数据登记表

-- PostgreSQL:documents 表
CREATE TABLE documents (
    doc_id VARCHAR(64) PRIMARY KEY,
    doc_type VARCHAR(50) NOT NULL,
    source VARCHAR(50) NOT NULL,
    source_url TEXT,
    file_path TEXT NOT NULL,
    file_hash CHAR(64) UNIQUE NOT NULL,
    file_size BIGINT,
    page_count INT,
    language VARCHAR(10),
    publish_date DATE,
    title TEXT,
    parse_status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
    -- PENDING / PARSING / PARSED / FAILED / EXTRACTED
    extract_status VARCHAR(20),
    parsed_at TIMESTAMP,
    extracted_at TIMESTAMP,
    error_message TEXT,
    metadata JSONB,
    created_at TIMESTAMP DEFAULT now(),
    updated_at TIMESTAMP DEFAULT now()
);

CREATE INDEX idx_documents_status ON documents(parse_status, extract_status);
CREATE INDEX idx_documents_publish ON documents(publish_date DESC);

5. 数据源 #3:外部 API(公开数据)

5.1 典型外部源清单

名称性质数据鉴权频控
国家企业信用信息公示系统(NECIPS)官方工商基础无(爬)严,需 IP 池
信用中国官方行政处罚、失信被执行无(爬)
中国裁判文书网官方司法案件复杂极严
巨潮资讯官方上市公司公告
天眼查 OpenAPI商业综合API Key按量计费
企查查 OpenAPI商业综合API Key按量计费
启信宝 OpenAPI商业综合API Key按量计费
国家知识产权局官方专利、商标复杂

5.2 接入规约

⚠️ 合规先行

  1. 商业 API 必须有有效合同,明确数据使用范围
  2. 公开网站爬取必须遵守 robots.txt,控制频率
  3. 司法文书网等敏感源 优先采购商业数据,避免合规风险
  4. 任何接入前需法务过审

5.3 商业 API Pull Adapter

# src/kg/ingestion/sources/api_pull.py
import asyncio
from datetime import datetime, timedelta
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential

from kg.ingestion.sources.base import SourceAdapter, SourceRecord
from kg.core.config import get_settings
from kg.core.logger import get_logger

log = get_logger(__name__)


class TYCAdapter(SourceAdapter):
    """Tianyancha (天眼查) API adapter — pull mode."""
    source_name = "TYC"
    mode = "polling"

    BASE = "https://open.api.tianyancha.com/services/open/v3"

    def __init__(self, api_key: str, *, qps_limit: int = 5):
        self.api_key = api_key
        self.qps_limit = qps_limit
        self._semaphore = asyncio.Semaphore(qps_limit)

    async def discover(self) -> dict:
        return {"endpoint": self.BASE, "auth": "Header(token)"}

    @retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
    async def _call(self, endpoint: str, params: dict) -> dict:
        async with self._semaphore:
            async with httpx.AsyncClient(timeout=20) as client:
                resp = await client.get(
                    f"{self.BASE}/{endpoint}",
                    params=params,
                    headers={"Authorization": self.api_key},
                )
                resp.raise_for_status()
                data = resp.json()
                if data.get("error_code") != 0:
                    raise ValueError(f"TYC error: {data}")
                return data["result"]

    async def fetch(self, *, since=None, until=None, limit=None):
        """Pull enterprises listed since `since`. Driven by a watchlist of USCC seeds."""
        seeds = await self._load_seeds(since)
        count = 0
        for uscc in seeds:
            try:
                detail = await self._call("ic/baseinfoV3", {"keyword": uscc})
                yield SourceRecord(
                    source=self.source_name,
                    source_id=uscc,
                    source_record_url=f"https://www.tianyancha.com/company/{uscc}",
                    record_type="ENTERPRISE",
                    payload=detail,
                    fetched_at=datetime.utcnow(),
                    schema_version="tyc.v3",
                )
                count += 1
                if limit and count >= limit:
                    break
            except Exception as e:
                log.error("tyc_fetch_failed", uscc=uscc, error=str(e))

    async def _load_seeds(self, since):
        # 从内部 watchlist(PostgreSQL)读取需要刷新的企业列表
        ...

    async def checkpoint(self, watermark):
        # 记录到 ingestion_watermark 表
        ...

    async def health(self) -> dict:
        try:
            await self._call("v4/health", {})
            return {"alive": True}
        except Exception as e:
            return {"alive": False, "error": str(e)}

5.4 频控与缓存

商业 API 按调用次数计费。生产必须做:

  1. 本地缓存:相同企业 24 小时内不重复查(Redis + PostgreSQL 双层)
  2. 批量优先:能用批量接口绝不用单条接口
  3. 预算控制:每日调用预算 + 告警
  4. 数据保鲜策略
    • 高频实体(重点监控企业):每日刷新
    • 普通实体:每周刷新
    • 长尾实体:按需触发
# src/kg/ingestion/sources/cached_api.py
class CachedAPIWrapper:
    def __init__(self, adapter: SourceAdapter, ttl: int = 86400):
        self.adapter = adapter
        self.ttl = ttl

    async def fetch_one(self, key: str) -> SourceRecord | None:
        # 1. 查 Redis
        cached = await redis.get(f"src:{self.adapter.source_name}:{key}")
        if cached:
            return SourceRecord.model_validate_json(cached)
        # 2. 查 PostgreSQL(持久化缓存)
        row = await pg.fetchrow(
            "SELECT payload FROM source_cache WHERE source = $1 AND source_id = $2 "
            "AND fetched_at > now() - interval '7 days'",
            self.adapter.source_name, key
        )
        if row:
            await redis.setex(f"src:{self.adapter.source_name}:{key}", self.ttl, row["payload"])
            return SourceRecord.model_validate_json(row["payload"])
        # 3. 调上游
        async for rec in self.adapter.fetch(...):
            await pg.execute(
                "INSERT INTO source_cache (source, source_id, payload, fetched_at) "
                "VALUES ($1, $2, $3, now()) ON CONFLICT (source, source_id) DO UPDATE "
                "SET payload = $3, fetched_at = now()",
                self.adapter.source_name, rec.source_id, rec.model_dump_json(),
            )
            await redis.setex(f"src:{self.adapter.source_name}:{rec.source_id}", self.ttl, rec.model_dump_json())
            return rec
        return None

6. 数据源 #4:实时流(Kafka / IoT)

6.1 流式场景

  • 业务事件流(订单、交易、登录、操作日志)
  • 舆情/新闻实时流
  • CDC 流(已在 §3 涵盖)
  • 风险信号流(征信、反欺诈预警)

6.2 流处理架构

[Kafka Topic][Flink SQL/DataStream][Stateful Enrich (Lookup KG)][Sink]
                          │                                                 │
                          ↓                                                 ├─ 写回 Neo4j(增量)
                   [Aggregation/Window]                                     ├─ 写 Kafka 派生 Topic
                                                                            └─ 触发告警

6.3 Flink 作业示例:实时舆情入图

-- flink_jobs/opinion_to_kg.sql
CREATE TABLE news_stream (
    news_id STRING,
    title STRING,
    content STRING,
    publish_time TIMESTAMP_LTZ(3),
    source_url STRING,
    raw_entities ARRAY<ROW<text STRING, type STRING>>,
    WATERMARK FOR publish_time AS publish_time - INTERVAL '30' SECOND
) WITH (
    'connector' = 'kafka',
    'topic' = 'news.raw',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'json',
    'scan.startup.mode' = 'group-offsets'
);

CREATE TABLE opinion_events (
    event_id STRING,
    enterprise_uscc STRING,
    title STRING,
    summary STRING,
    sentiment STRING,
    first_seen TIMESTAMP_LTZ(3),
    source_url STRING,
    PRIMARY KEY (event_id) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:postgresql://pg:5432/kg',
    'table-name' = 'staging_opinion_events'
);

-- 流式实体链接(Lookup Join 到 Neo4j 镜像表)
CREATE TABLE enterprise_dim (
    name STRING,
    aliases ARRAY<STRING>,
    uscc STRING,
    PRIMARY KEY (name) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'lookup.cache.max-rows' = '500000',
    'lookup.cache.ttl' = '1h'
);

INSERT INTO opinion_events
SELECT
    SHA256(CONCAT(n.news_id, '_', e.text)) AS event_id,
    d.uscc,
    n.title,
    SUBSTRING(n.content FROM 1 FOR 300) AS summary,
    'NEUTRAL' AS sentiment,    -- 后续模型替换
    n.publish_time,
    n.source_url
FROM news_stream AS n
CROSS JOIN UNNEST(n.raw_entities) AS e
LEFT JOIN enterprise_dim FOR SYSTEM_TIME AS OF n.publish_time AS d
  ON d.name = e.text
WHERE e.type = 'ORGANIZATION' AND d.uscc IS NOT NULL;

6.4 流式入图最佳实践

  1. 不要直接写 Neo4j:流写入冲到 PostgreSQL Staging,由 Airflow 微批(1–5 分钟)入图
  2. 去重:用 event_id = SHA256(原始内容 + 实体) 做幂等
  3. 背压:Flink 检测下游延迟时降速,避免 Neo4j 过载
  4. 状态管理:Flink State 用 RocksDB,定期 Checkpoint 到 S3/OSS
  5. 乱序处理:用 Watermark + 允许迟到 5 分钟
  6. 大窗口降冷:超过 1 小时的窗口落 OSS,热路径只保留 1 小时

7. 通用接入管控

7.1 Watermark 与断点续传

每个 Source 在 PostgreSQL 维护 watermark:

CREATE TABLE ingestion_watermark (
    source VARCHAR(50) PRIMARY KEY,
    watermark TIMESTAMP NOT NULL,
    last_run_at TIMESTAMP NOT NULL,
    last_run_status VARCHAR(20) NOT NULL,
    last_run_stats JSONB,
    updated_at TIMESTAMP DEFAULT now()
);

Airflow DAG 每次启动读 watermark,执行后更新。

7.2 数据落地分层

存储保留用途
Raw对象存储(OSS / S3)+ PostgreSQL 索引90 天原始数据,可重放
BronzePostgreSQL bronze_<source>90 天结构化原始
SilverPostgreSQL staging_*30 天抽取后待入图
GoldNeo4j + ES + Milvus长期在线服务

7.3 数据契约(Data Contract)

每个数据源必须有正式的数据契约文档,包含:

# contracts/necips_v1.yaml
contract_name: NECIPS_Enterprise_v1
producer: External / NECIPS官网爬虫
consumer: KG-Ingestion
slo:
  freshness: 24h           # 数据延迟不超过 24 小时
  completeness: 0.99       # 字段填充率
  availability: 0.95       # 接入成功率
schema:
  - name: unified_credit_code
    type: string
    pattern: '^[0-9A-HJ-NPQRTUWXY]{18}$'
    required: true
  - name: name
    type: string
    required: true
  - ...
breaking_change_policy: 60_days_notice
incident_contact: kg-team@example.com / +86-xxxx

7.4 监控指标(每个 Source)

指标阈值
接入延迟(fetch → land)P95 < 5 min(流)/ < 4h(批)
Watermark 落后> 1h 告警
失败率> 5% 告警
数据量同比波动 > 30% 告警
字段填充率关键字段 < 95% 告警

7.5 死信处理

每个 Adapter 失败的记录必须进入 DLQ(Dead Letter Queue):

CREATE TABLE ingestion_dlq (
    id BIGSERIAL PRIMARY KEY,
    source VARCHAR(50) NOT NULL,
    source_id VARCHAR(255),
    record_type VARCHAR(50),
    payload JSONB NOT NULL,
    error_type VARCHAR(100),
    error_message TEXT,
    error_stack TEXT,
    retry_count INT DEFAULT 0,
    next_retry_at TIMESTAMP,
    status VARCHAR(20) DEFAULT 'PENDING',
    -- PENDING / RETRYING / RESOLVED / ABANDONED
    created_at TIMESTAMP DEFAULT now(),
    updated_at TIMESTAMP DEFAULT now()
);

CREATE INDEX idx_dlq_status_retry ON ingestion_dlq(status, next_retry_at);

定时任务扫描 DLQ,按指数退避重试,超过 5 次进入 ABANDONED 状态供人工处理。


8. 数据源接入清单(含 Owner、状态、SLA)

#数据源类型优先级Owner状态上线 SLA
1HR 内部库(MySQL CDC)DBP0数据工程组设计第 4 周
2CRM 内部库(Oracle CDC)DBP0数据工程组设计第 5 周
3巨潮资讯(年报+公告)文档P0NLP 组设计第 6 周
4裁判文书(商业数据采购)文档P0法务+数据组选型第 8 周
5信用中国(行政处罚)APIP1数据工程组设计第 7 周
6天眼查 OpenAPIAPIP0数据工程组已签约第 4 周
7内部 OA 合同文件文档P1NLP 组设计第 9 周
8邮件归档文档P2数据工程组待评估第 12 周
9新闻舆情流(第三方推送)P1数据工程组设计第 10 周
10内部业务事件 KafkaP1平台组接入中第 6 周

9. 接入质量验收标准

每个数据源接入完成必须通过以下验收:

#验收项标准
1Adapter 实现 SourceAdapter 协议100% 接口实现
2数据契约文档已签字
3单元测试覆盖率 ≥ 75%
4集成测试至少 100 条样本端到端跑通
5字段映射D2R YAML 评审通过
6Watermark / Checkpoint故障注入测试通过
7监控接入指标可见、告警可达
8性能基准吞吐量符合预期
9DLQ 流程失败样本被正确捕获
10合规审查法务签字

下一份文档04_KBQA实施与评估.md