企业查/情报知识图谱 - 代码工程脚手架

4 阅读12分钟

配套文档:基于《01_领域本体详细设计书.md》 目标读者:开发实施工程师、SRE 本文档作用:开箱即用的项目骨架,照此搭建即可开始编码


1. 技术栈最终选型

模块选型版本说明
语言Python3.11主开发语言
包管理Poetry1.8+依赖锁定
图数据库Neo4j Enterprise5.15主存储
关系库PostgreSQL16暂存/审核/元数据
缓存Redis7.2查询缓存/限流
全文检索Elasticsearch8.13实体名检索
向量库Milvus2.4实体对齐/RAG
消息队列Kafka3.7增量流
工作流Airflow2.9离线 ETL
API 框架FastAPI0.110+REST/WebSocket
应用服务器Uvicorn + Gunicorn-生产部署
任务队列Celery + Redis5.4异步任务
OCRPaddleOCR2.7中文 OCR
LLM SDKanthropic0.40+Claude API
嵌入模型bge-large-zh-v1.5-中文嵌入
抽取模型UIE (PaddleNLP)2.7微调
监控Prometheus + Grafana-指标
日志Loki + Promtail-日志聚合
链路追踪OpenTelemetry + Jaeger-APM
容器Docker + Kubernetes1.28+编排
CI/CDGitLab CI / GitHub Actions-流水线

2. 完整目录结构

enterprise-insight-kg/
├── README.md
├── pyproject.toml
├── poetry.lock
├── .python-version
├── .env.example
├── .gitignore
├── .editorconfig
├── .pre-commit-config.yaml
├── Makefile
├── docker-compose.yml                       # 本地开发环境一键起
├── docker-compose.prod.yml
│
├── docs/                                    # 项目文档
│   ├── 01_领域本体详细设计书.md
│   ├── 02_代码工程脚手架.md
│   ├── 03_数据源接入详设.md
│   ├── 04_KBQA实施与评估.md
│   ├── api/                                 # OpenAPI 文档(自动生成)
│   ├── adr/                                 # 架构决策记录
│   │   ├── 0001-use-neo4j.md
│   │   ├── 0002-text2cypher-vs-finetune.md
│   │   └── ...
│   └── runbook/                             # 运维手册
│       ├── deploy.md
│       ├── backup.md
│       └── incident.md
│
├── schema/                                  # Neo4j Schema DDL
│   ├── changelog.yaml
│   ├── changesets/
│   ├── rollback/
│   └── seeds/
│
├── src/
│   └── kg/                                  # 主 Python 包
│       ├── __init__.py
│       ├── core/                            # 核心抽象
│       │   ├── __init__.py
│       │   ├── config.py                    # Pydantic Settings
│       │   ├── logger.py                    # 结构化日志
│       │   ├── tracing.py                   # OTel 配置
│       │   ├── errors.py                    # 异常体系
│       │   ├── types.py                     # 公共类型
│       │   └── id_generator.py              # UUID/Hash 工具
│       │
│       ├── ontology/                        # 本体定义(代码化)
│       │   ├── __init__.py
│       │   ├── entities.py                  # 实体类型 Pydantic 模型
│       │   ├── relations.py                 # 关系类型
│       │   ├── enums.py                     # 枚举常量
│       │   └── validators.py                # 业务校验器
│       │
│       ├── store/                           # 存储层
│       │   ├── __init__.py
│       │   ├── neo4j_client.py              # Neo4j 异步驱动封装
│       │   ├── postgres_client.py           # 暂存库
│       │   ├── redis_client.py
│       │   ├── es_client.py
│       │   ├── milvus_client.py
│       │   └── repositories/                # 数据访问层
│       │       ├── __init__.py
│       │       ├── enterprise_repo.py
│       │       ├── person_repo.py
│       │       ├── event_repo.py
│       │       └── document_repo.py
│       │
│       ├── ingestion/                       # 数据接入
│       │   ├── __init__.py
│       │   ├── sources/                     # 数据源适配器
│       │   │   ├── base.py                  # SourceAdapter 抽象类
│       │   │   ├── mysql_cdc.py
│       │   │   ├── api_pull.py
│       │   │   ├── file_watcher.py
│       │   │   └── kafka_consumer.py
│       │   ├── parsers/                     # 文档解析
│       │   │   ├── pdf_parser.py
│       │   │   ├── docx_parser.py
│       │   │   ├── email_parser.py
│       │   │   ├── html_parser.py
│       │   │   └── ocr.py
│       │   ├── d2r/                         # 数据库→图谱映射
│       │   │   ├── engine.py
│       │   │   ├── mapping_loader.py
│       │   │   └── mappings/                # YAML 映射规则
│       │   │       ├── necips_enterprise.yaml
│       │   │       ├── tyc_person.yaml
│       │   │       └── ...
│       │   └── stream/
│       │       ├── flink_jobs/              # Flink SQL 作业
│       │       └── handlers.py
│       │
│       ├── extraction/                      # 知识抽取
│       │   ├── __init__.py
│       │   ├── pipeline.py                  # 抽取流水线编排
│       │   ├── ner/                         # 命名实体识别
│       │   │   ├── base.py
│       │   │   ├── rule_ner.py
│       │   │   ├── uie_ner.py
│       │   │   └── llm_ner.py
│       │   ├── relation/                    # 关系抽取
│       │   │   ├── base.py
│       │   │   ├── pattern_re.py
│       │   │   ├── uie_re.py
│       │   │   └── llm_re.py
│       │   ├── event/                       # 事件抽取
│       │   │   ├── schemas/                 # 事件 Schema YAML
│       │   │   └── llm_ee.py
│       │   ├── linking/                     # 实体链接
│       │   │   ├── candidate_gen.py
│       │   │   └── disambiguator.py
│       │   ├── normalization/               # 归一化
│       │   │   ├── date.py
│       │   │   ├── amount.py
│       │   │   ├── name.py
│       │   │   └── address.py
│       │   ├── prompts/                     # LLM Prompt 模板
│       │   │   ├── ner_prompt.py
│       │   │   ├── re_prompt.py
│       │   │   └── event_prompt.py
│       │   └── arbitration.py               # 多源仲裁
│       │
│       ├── fusion/                          # 知识融合
│       │   ├── __init__.py
│       │   ├── blocking.py                  # 候选生成
│       │   ├── matcher.py                   # 实体匹配
│       │   ├── clusterer.py                 # 簇划分
│       │   ├── merger.py                    # 节点合并执行
│       │   ├── conflict_resolver.py         # 冲突消解
│       │   └── features.py                  # 匹配特征工程
│       │
│       ├── reasoning/                       # 推理与派生
│       │   ├── __init__.py
│       │   ├── rules/                       # 规则定义
│       │   │   ├── actual_controls.cypher
│       │   │   ├── beneficial_owner.cypher
│       │   │   └── ...
│       │   ├── runner.py                    # 规则执行器
│       │   └── gds_jobs.py                  # 图算法作业
│       │
│       ├── api/                             # FastAPI 应用
│       │   ├── __init__.py
│       │   ├── main.py                      # FastAPI app 入口
│       │   ├── deps.py                      # 依赖注入
│       │   ├── middleware/
│       │   │   ├── auth.py
│       │   │   ├── ratelimit.py
│       │   │   ├── logging.py
│       │   │   └── tracing.py
│       │   ├── routers/
│       │   │   ├── health.py
│       │   │   ├── entity.py                # 实体 CRUD
│       │   │   ├── graph.py                 # 图查询
│       │   │   ├── search.py
│       │   │   ├── kbqa.py                  # KBQA 入口
│       │   │   └── admin.py                 # 后台管理
│       │   └── schemas/                     # Pydantic 请求/响应模型
│       │
│       ├── kbqa/                            # 智能问答模块
│       │   ├── __init__.py
│       │   ├── pipeline.py                  # 端到端 KBQA
│       │   ├── intent.py                    # 意图识别
│       │   ├── entity_linking.py
│       │   ├── text2cypher/
│       │   │   ├── generator.py             # LLM 生成 Cypher
│       │   │   ├── validator.py             # Cypher 校验
│       │   │   ├── executor.py              # 安全执行
│       │   │   └── few_shots.yaml           # Few-shot 库
│       │   ├── graph_rag/
│       │   │   ├── retriever.py
│       │   │   ├── verbalizer.py
│       │   │   └── community_summary.py
│       │   ├── reranker.py
│       │   ├── answer_gen.py                # 答案生成
│       │   └── citation.py                  # 引用与溯源
│       │
│       ├── quality/                         # 质量评估
│       │   ├── checks.py
│       │   ├── metrics.py
│       │   └── reports.py
│       │
│       └── cli/                             # 命令行工具
│           ├── __init__.py
│           ├── main.py                      # typer 入口
│           ├── schema_cli.py                # DDL 应用
│           ├── ingest_cli.py
│           ├── extract_cli.py
│           └── eval_cli.py
│
├── airflow/                                 # 工作流 DAG
│   ├── dags/
│   │   ├── ingest_necips.py
│   │   ├── extract_documents.py
│   │   ├── fusion_daily.py
│   │   ├── reasoning_daily.py
│   │   ├── quality_check_daily.py
│   │   └── backup_neo4j.py
│   ├── plugins/
│   └── tests/
│
├── tests/                                   # 测试
│   ├── unit/
│   │   ├── ontology/
│   │   ├── extraction/
│   │   ├── fusion/
│   │   └── kbqa/
│   ├── integration/
│   │   ├── test_neo4j.py
│   │   ├── test_pipeline_e2e.py
│   │   └── test_api.py
│   ├── fixtures/                            # 测试数据
│   │   ├── enterprises_30.json
│   │   ├── documents/
│   │   └── kbqa_eval_set.jsonl
│   └── conftest.py
│
├── scripts/                                 # 运维脚本
│   ├── apply_schema.sh
│   ├── load_seeds.py
│   ├── reindex.sh
│   ├── benchmark_query.py
│   └── backup.sh
│
├── deploy/
│   ├── k8s/                                 # K8s 清单
│   │   ├── namespace.yaml
│   │   ├── neo4j-cluster.yaml
│   │   ├── postgres.yaml
│   │   ├── api-deployment.yaml
│   │   ├── airflow.yaml
│   │   ├── ingress.yaml
│   │   └── monitoring/
│   ├── helm/                                # Helm Chart
│   ├── terraform/                           # 基础设施 IaC
│   └── ansible/                             # 配置管理
│
├── monitoring/
│   ├── prometheus/
│   │   ├── prometheus.yml
│   │   └── alerts.yaml
│   ├── grafana/
│   │   └── dashboards/
│   │       ├── neo4j.json
│   │       ├── api.json
│   │       └── extraction.json
│   └── otel-collector.yaml
│
└── .github/                                 # 或 .gitlab-ci.yml
    └── workflows/
        ├── ci.yml                           # 测试、Lint、构建
        ├── cd-staging.yml
        ├── cd-prod.yml
        ├── security-scan.yml
        └── release.yml

3. 关键配置文件

3.1 pyproject.toml

[tool.poetry]
name = "kg"
version = "0.1.0"
description = "Enterprise Insight Knowledge Graph"
authors = ["KG Team <kg@example.com>"]
packages = [{ include = "kg", from = "src" }]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.11"
# Web
fastapi = "^0.110.0"
uvicorn = { extras = ["standard"], version = "^0.29.0" }
gunicorn = "^21.2.0"
# Data validation
pydantic = "^2.6"
pydantic-settings = "^2.2"
# Stores
neo4j = "^5.18"
asyncpg = "^0.29"
redis = "^5.0"
elasticsearch = "^8.13"
pymilvus = "^2.4"
# Async
aiofiles = "^23.2"
httpx = "^0.27"
# LLM & NLP
anthropic = "^0.40"
sentence-transformers = "^2.7"
paddlenlp = "^2.7"
paddlepaddle = "^2.6"
# Document parsing
pymupdf = "^1.24"
pdfplumber = "^0.11"
python-docx = "^1.1"
mail-parser = "^3.15"
trafilatura = "^1.10"
unstructured = "^0.13"
# OCR
paddleocr = "^2.7"
# Stream
kafka-python = "^2.0"
# Workflow
celery = "^5.4"
# CLI
typer = "^0.12"
rich = "^13.7"
# Observability
opentelemetry-api = "^1.24"
opentelemetry-sdk = "^1.24"
opentelemetry-instrumentation-fastapi = "^0.45b0"
opentelemetry-instrumentation-neo4j = "^0.45b0"
structlog = "^24.1"
prometheus-client = "^0.20"
# Utility
tenacity = "^8.2"
python-jose = { extras = ["cryptography"], version = "^3.3" }
passlib = { extras = ["bcrypt"], version = "^1.7" }

[tool.poetry.group.dev.dependencies]
pytest = "^8.1"
pytest-asyncio = "^0.23"
pytest-cov = "^4.1"
pytest-mock = "^3.12"
ruff = "^0.4"
black = "^24.3"
mypy = "^1.9"
pre-commit = "^3.7"
ipython = "^8.22"
locust = "^2.25"   # 压测

[tool.poetry.scripts]
kg = "kg.cli.main:app"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

[tool.ruff]
line-length = 100
target-version = "py311"
select = ["E", "F", "I", "W", "B", "UP", "N", "SIM"]
ignore = ["E501"]

[tool.black]
line-length = 100
target-version = ["py311"]

[tool.mypy]
python_version = "3.11"
strict = true
ignore_missing_imports = true

[tool.pytest.ini_options]
asyncio_mode = "auto"
addopts = "-ra --strict-markers --cov=src/kg --cov-report=term-missing"
testpaths = ["tests"]

3.2 .env.example

# ====== Runtime ======
ENV=local                              # local | dev | staging | prod
LOG_LEVEL=INFO
LOG_FORMAT=json                        # json | console

# ====== Neo4j ======
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=changeit
NEO4J_DATABASE=neo4j
NEO4J_POOL_SIZE=50
NEO4J_CONN_TIMEOUT=30

# ====== PostgreSQL ======
PG_DSN=postgresql://kg:changeit@localhost:5432/kg

# ====== Redis ======
REDIS_URL=redis://localhost:6379/0
REDIS_CACHE_TTL=300

# ====== Elasticsearch ======
ES_URL=http://localhost:9200
ES_USER=elastic
ES_PASSWORD=changeit

# ====== Milvus ======
MILVUS_HOST=localhost
MILVUS_PORT=19530

# ====== Kafka ======
KAFKA_BOOTSTRAP=localhost:9092

# ====== LLM ======
ANTHROPIC_API_KEY=sk-ant-xxxx
LLM_MODEL_EXTRACT=claude-opus-4-7
LLM_MODEL_KBQA=claude-opus-4-7
LLM_MAX_TOKENS=4096
LLM_TIMEOUT=60
LLM_DAILY_BUDGET_USD=200

# ====== Embedding ======
EMBED_MODEL=BAAI/bge-large-zh-v1.5
EMBED_DIM=1024
EMBED_BATCH=64

# ====== API ======
API_HOST=0.0.0.0
API_PORT=8080
API_WORKERS=4
API_CORS_ORIGINS=https://insight.example.com
JWT_SECRET=change-me
JWT_EXPIRE_MIN=60
RATE_LIMIT_PER_MIN=120

# ====== Observability ======
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_SERVICE_NAME=kg-api
PROMETHEUS_PORT=9100

# ====== Sources ======
NECIPS_API_KEY=
TYC_API_KEY=
QCC_API_KEY=

3.3 src/kg/core/config.py

"""Centralized configuration loaded from environment variables.

Use Pydantic Settings — typed, validated, with sensible defaults.
"""
from functools import lru_cache
from typing import Literal

from pydantic import Field, SecretStr
from pydantic_settings import BaseSettings, SettingsConfigDict


class Neo4jSettings(BaseSettings):
    model_config = SettingsConfigDict(env_prefix="NEO4J_")

    uri: str = "bolt://localhost:7687"
    user: str = "neo4j"
    password: SecretStr = SecretStr("changeit")
    database: str = "neo4j"
    pool_size: int = 50
    conn_timeout: int = 30


class LLMSettings(BaseSettings):
    model_config = SettingsConfigDict(env_prefix="LLM_")

    anthropic_api_key: SecretStr = Field(SecretStr(""), alias="ANTHROPIC_API_KEY")
    model_extract: str = "claude-opus-4-7"
    model_kbqa: str = "claude-opus-4-7"
    max_tokens: int = 4096
    timeout: int = 60
    daily_budget_usd: float = 200.0


class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        case_sensitive=False,
        extra="ignore",
    )

    env: Literal["local", "dev", "staging", "prod"] = "local"
    log_level: str = "INFO"
    log_format: Literal["json", "console"] = "json"

    neo4j: Neo4jSettings = Neo4jSettings()
    llm: LLMSettings = LLMSettings()

    pg_dsn: str = "postgresql://kg:changeit@localhost:5432/kg"
    redis_url: str = "redis://localhost:6379/0"
    es_url: str = "http://localhost:9200"
    milvus_host: str = "localhost"
    milvus_port: int = 19530

    api_host: str = "0.0.0.0"
    api_port: int = 8080
    jwt_secret: SecretStr = SecretStr("change-me")
    jwt_expire_min: int = 60
    rate_limit_per_min: int = 120

    otel_endpoint: str = Field("", alias="OTEL_EXPORTER_OTLP_ENDPOINT")
    otel_service: str = Field("kg-api", alias="OTEL_SERVICE_NAME")


@lru_cache
def get_settings() -> Settings:
    return Settings()

3.4 src/kg/core/logger.py

"""Structured logging with structlog + JSON output for prod."""
import logging
import sys

import structlog

from kg.core.config import get_settings


def setup_logging() -> None:
    settings = get_settings()
    level = getattr(logging, settings.log_level.upper(), logging.INFO)

    timestamper = structlog.processors.TimeStamper(fmt="iso")

    shared_processors = [
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.StackInfoRenderer(),
        timestamper,
    ]

    if settings.log_format == "json":
        renderer = structlog.processors.JSONRenderer(ensure_ascii=False)
    else:
        renderer = structlog.dev.ConsoleRenderer(colors=True)

    structlog.configure(
        processors=shared_processors + [
            structlog.processors.format_exc_info,
            renderer,
        ],
        wrapper_class=structlog.make_filtering_bound_logger(level),
        logger_factory=structlog.PrintLoggerFactory(file=sys.stdout),
        cache_logger_on_first_use=True,
    )


def get_logger(name: str) -> structlog.stdlib.BoundLogger:
    return structlog.get_logger(name)

3.5 src/kg/store/neo4j_client.py

"""Async Neo4j driver wrapper with retry, tracing, and metrics."""
from __future__ import annotations

import asyncio
from contextlib import asynccontextmanager
from typing import Any, AsyncIterator

from neo4j import AsyncGraphDatabase, AsyncDriver
from neo4j.exceptions import ServiceUnavailable, TransientError
from prometheus_client import Counter, Histogram
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential

from kg.core.config import get_settings
from kg.core.logger import get_logger

log = get_logger(__name__)

_query_total = Counter(
    "kg_neo4j_query_total", "Total Neo4j queries", ["operation", "status"]
)
_query_latency = Histogram(
    "kg_neo4j_query_seconds", "Neo4j query latency", ["operation"]
)


class Neo4jClient:
    _driver: AsyncDriver | None = None
    _lock = asyncio.Lock()

    @classmethod
    async def init(cls) -> None:
        if cls._driver is not None:
            return
        async with cls._lock:
            if cls._driver is not None:
                return
            s = get_settings().neo4j
            cls._driver = AsyncGraphDatabase.driver(
                s.uri,
                auth=(s.user, s.password.get_secret_value()),
                max_connection_pool_size=s.pool_size,
                connection_timeout=s.conn_timeout,
            )
            await cls._driver.verify_connectivity()
            log.info("neo4j_connected", uri=s.uri, db=s.database)

    @classmethod
    async def close(cls) -> None:
        if cls._driver:
            await cls._driver.close()
            cls._driver = None

    @classmethod
    @asynccontextmanager
    async def session(cls, *, write: bool = False) -> AsyncIterator[Any]:
        if cls._driver is None:
            await cls.init()
        s = get_settings().neo4j
        async with cls._driver.session(
            database=s.database,
            default_access_mode="WRITE" if write else "READ",
        ) as sess:
            yield sess

    @classmethod
    @retry(
        retry=retry_if_exception_type((TransientError, ServiceUnavailable)),
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=0.5, max=5),
        reraise=True,
    )
    async def execute_read(cls, cypher: str, params: dict | None = None) -> list[dict]:
        op = "read"
        with _query_latency.labels(op).time():
            try:
                async with cls.session(write=False) as sess:
                    result = await sess.run(cypher, params or {})
                    rows = [r.data() async for r in result]
                _query_total.labels(op, "ok").inc()
                return rows
            except Exception:
                _query_total.labels(op, "err").inc()
                raise

    @classmethod
    @retry(
        retry=retry_if_exception_type((TransientError, ServiceUnavailable)),
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=0.5, max=5),
        reraise=True,
    )
    async def execute_write(cls, cypher: str, params: dict | None = None) -> list[dict]:
        op = "write"
        with _query_latency.labels(op).time():
            try:
                async with cls.session(write=True) as sess:
                    result = await sess.run(cypher, params or {})
                    rows = [r.data() async for r in result]
                _query_total.labels(op, "ok").inc()
                return rows
            except Exception:
                _query_total.labels(op, "err").inc()
                raise

    @classmethod
    async def batch_write(
        cls, cypher: str, rows: list[dict], batch_size: int = 5000
    ) -> int:
        """UNWIND-based batch upsert."""
        total = 0
        for i in range(0, len(rows), batch_size):
            chunk = rows[i : i + batch_size]
            await cls.execute_write(cypher, {"rows": chunk})
            total += len(chunk)
            log.info("batch_written", count=len(chunk), total=total)
        return total

3.6 src/kg/ontology/entities.py

"""Pydantic models that mirror the ontology — the single source of truth for code.

Any data going into Neo4j must pass through these models first.
"""
from __future__ import annotations

from datetime import date, datetime
from enum import StrEnum
from typing import Annotated

from pydantic import BaseModel, ConfigDict, Field, StringConstraints

USCC_PATTERN = r"^[0-9A-HJ-NPQRTUWXY]{18}$"


class RegistrationStatus(StrEnum):
    IN_BUSINESS = "IN_BUSINESS"
    CANCELLED = "CANCELLED"
    REVOKED = "REVOKED"
    SUSPENDED = "SUSPENDED"
    LIQUIDATING = "LIQUIDATING"
    MIGRATED_OUT = "MIGRATED_OUT"


class EnterpriseType(StrEnum):
    LIMITED_LIABILITY = "LIMITED_LIABILITY"
    JOINT_STOCK = "JOINT_STOCK"
    WHOLLY_FOREIGN_OWNED = "WHOLLY_FOREIGN_OWNED"
    SINO_FOREIGN_JOINT_VENTURE = "SINO_FOREIGN_JOINT_VENTURE"
    PARTNERSHIP = "PARTNERSHIP"
    SOLE_PROPRIETORSHIP = "SOLE_PROPRIETORSHIP"
    INDIVIDUAL_BUSINESS = "INDIVIDUAL_BUSINESS"
    STATE_OWNED = "STATE_OWNED"
    COLLECTIVE = "COLLECTIVE"
    OTHER = "OTHER"


class EntityMeta(BaseModel):
    """Required metadata for every entity / relationship."""
    uuid: str
    source: str
    source_id: str | None = None
    source_record_url: str | None = None
    created_at: datetime
    updated_at: datetime
    confidence: float = Field(ge=0, le=1, default=1.0)
    status: str = "ACTIVE"
    version: int = 1
    merged_from: list[str] = Field(default_factory=list)
    extracted_by: str | None = None


class Enterprise(BaseModel):
    model_config = ConfigDict(extra="forbid")

    uuid: str
    unified_credit_code: Annotated[str, StringConstraints(pattern=USCC_PATTERN)]
    registration_no: str | None = None
    name: Annotated[str, StringConstraints(max_length=200)]
    aliases: list[str] = Field(default_factory=list)
    legal_representative_name: str | None = None
    registered_capital: float | None = Field(default=None, ge=0)
    paid_in_capital: float | None = Field(default=None, ge=0)
    capital_currency: str = "CNY"
    enterprise_type: EnterpriseType
    establishment_date: date
    business_term_start: date | None = None
    business_term_end: date | None = None
    registration_authority: str | None = None
    registration_status: RegistrationStatus
    industry_code: str
    industry_name: str | None = None
    business_scope: str | None = None
    email: str | None = None
    phone: str | None = None
    website: str | None = None
    is_listed: bool = False
    stock_code: str | None = None
    stock_exchange: str | None = None
    staff_size: str | None = None
    is_high_tech: bool = False
    is_specialized_new: bool = False
    credit_rating: str | None = None
    embedding: list[float] | None = None
    _meta: EntityMeta

    def to_cypher_props(self) -> dict:
        """Flatten for Cypher SET clause."""
        d = self.model_dump(exclude_none=True)
        meta = d.pop("_meta", None) or {}
        for k, v in meta.items():
            d[f"_meta_{k}"] = v
        return d


class NaturalPerson(BaseModel):
    model_config = ConfigDict(extra="forbid")

    uuid: str
    person_hash: str
    name: str
    aliases: list[str] = Field(default_factory=list)
    gender: str | None = None
    birth_year: int | None = Field(default=None, ge=1900, le=2100)
    nationality: str | None = None
    id_card_tail: str | None = Field(default=None, pattern=r"^[0-9X]{4}$")
    is_pep: bool = False
    is_sanctioned: bool = False
    is_executed_dishonest: bool = False
    embedding: list[float] | None = None
    _meta: EntityMeta


# ... other entities follow the same pattern

3.7 src/kg/api/main.py

"""FastAPI app entry."""
from contextlib import asynccontextmanager

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from prometheus_client import make_asgi_app

from kg.api.middleware.auth import AuthMiddleware
from kg.api.middleware.logging import LoggingMiddleware
from kg.api.middleware.ratelimit import RateLimitMiddleware
from kg.api.routers import admin, entity, graph, health, kbqa, search
from kg.core.config import get_settings
from kg.core.logger import setup_logging, get_logger
from kg.core.tracing import setup_tracing
from kg.store.neo4j_client import Neo4jClient

log = get_logger(__name__)


@asynccontextmanager
async def lifespan(app: FastAPI):
    setup_logging()
    setup_tracing()
    await Neo4jClient.init()
    log.info("api_startup_complete")
    yield
    await Neo4jClient.close()
    log.info("api_shutdown_complete")


def create_app() -> FastAPI:
    settings = get_settings()
    app = FastAPI(
        title="Enterprise Insight KG API",
        version="0.1.0",
        lifespan=lifespan,
        docs_url="/docs" if settings.env != "prod" else None,
        redoc_url="/redoc" if settings.env != "prod" else None,
    )

    app.add_middleware(
        CORSMiddleware,
        allow_origins=["*"] if settings.env == "local" else [settings.api_host],
        allow_methods=["*"],
        allow_headers=["*"],
    )
    app.add_middleware(LoggingMiddleware)
    app.add_middleware(RateLimitMiddleware)
    app.add_middleware(AuthMiddleware)

    app.include_router(health.router, tags=["health"])
    app.include_router(entity.router, prefix="/api/v1/entities", tags=["entity"])
    app.include_router(graph.router, prefix="/api/v1/graph", tags=["graph"])
    app.include_router(search.router, prefix="/api/v1/search", tags=["search"])
    app.include_router(kbqa.router, prefix="/api/v1/kbqa", tags=["kbqa"])
    app.include_router(admin.router, prefix="/api/v1/admin", tags=["admin"])

    app.mount("/metrics", make_asgi_app())

    FastAPIInstrumentor.instrument_app(app)
    return app


app = create_app()

3.8 src/kg/api/routers/entity.py(示例)

from typing import Annotated

from fastapi import APIRouter, Depends, HTTPException, Query

from kg.api.deps import get_current_user
from kg.api.schemas.entity import EnterpriseRead, EnterpriseSearchResult
from kg.store.repositories.enterprise_repo import EnterpriseRepository

router = APIRouter()


@router.get("/enterprise/{uscc}", response_model=EnterpriseRead)
async def get_enterprise(
    uscc: str,
    user: Annotated[dict, Depends(get_current_user)],
) -> EnterpriseRead:
    ent = await EnterpriseRepository.get_by_uscc(uscc)
    if not ent:
        raise HTTPException(status_code=404, detail="enterprise not found")
    return ent


@router.get("/enterprise", response_model=list[EnterpriseSearchResult])
async def search_enterprises(
    q: Annotated[str, Query(min_length=2, max_length=100)],
    limit: int = 20,
) -> list[EnterpriseSearchResult]:
    return await EnterpriseRepository.search_by_name(q, limit=limit)

3.9 docker-compose.yml(本地一键启动)

version: "3.9"

services:
  neo4j:
    image: neo4j:5.15-enterprise
    environment:
      NEO4J_AUTH: neo4j/changeit
      NEO4J_ACCEPT_LICENSE_AGREEMENT: "yes"
      NEO4J_PLUGINS: '["apoc", "graph-data-science"]'
      NEO4J_dbms_security_procedures_unrestricted: apoc.*,gds.*
      NEO4J_server_memory_heap_max__size: 4G
      NEO4J_server_memory_pagecache_size: 4G
    ports:
      - "7474:7474"
      - "7687:7687"
    volumes:
      - neo4j_data:/data
      - neo4j_logs:/logs
      - ./schema/seeds:/import

  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: kg
      POSTGRES_PASSWORD: changeit
      POSTGRES_DB: kg
    ports: ["5432:5432"]
    volumes:
      - pg_data:/var/lib/postgresql/data

  redis:
    image: redis:7.2
    ports: ["6379:6379"]

  elasticsearch:
    image: elasticsearch:8.13.0
    environment:
      discovery.type: single-node
      xpack.security.enabled: "false"
      ES_JAVA_OPTS: -Xms2g -Xmx2g
    ports: ["9200:9200"]

  milvus:
    image: milvusdb/milvus:v2.4.0
    command: ["milvus", "run", "standalone"]
    environment:
      ETCD_USE_EMBED: "true"
      COMMON_STORAGETYPE: local
    ports: ["19530:19530"]

  kafka:
    image: bitnami/kafka:3.7
    environment:
      KAFKA_CFG_NODE_ID: 0
      KAFKA_CFG_PROCESS_ROLES: controller,broker
      KAFKA_CFG_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
      KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: 0@kafka:9093
      KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
    ports: ["9092:9092"]

  airflow:
    image: apache/airflow:2.9.0
    environment:
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://kg:changeit@postgres:5432/airflow
    volumes:
      - ./airflow/dags:/opt/airflow/dags
    depends_on: [postgres]
    ports: ["8088:8080"]
    command: standalone

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana:10.4.0
    ports: ["3000:3000"]
    volumes:
      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards

volumes:
  neo4j_data:
  neo4j_logs:
  pg_data:

3.10 Dockerfile

# ===== builder =====
FROM python:3.11-slim AS builder
ENV PIP_NO_CACHE_DIR=1 POETRY_VERSION=1.8.2
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential curl git && rm -rf /var/lib/apt/lists/*
RUN pip install poetry==${POETRY_VERSION}
WORKDIR /app
COPY pyproject.toml poetry.lock ./
RUN poetry config virtualenvs.in-project true \
 && poetry install --only main --no-root --no-interaction

# ===== runtime =====
FROM python:3.11-slim AS runtime
ENV PATH="/app/.venv/bin:$PATH" PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 curl ca-certificates && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
COPY src /app/src
RUN groupadd -r kg && useradd -r -g kg kg && chown -R kg:kg /app
USER kg
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
  CMD curl -fsS http://localhost:8080/health || exit 1
CMD ["gunicorn", "kg.api.main:app", \
     "-w", "4", "-k", "uvicorn.workers.UvicornWorker", \
     "-b", "0.0.0.0:8080", "--access-logfile", "-"]

3.11 Makefile

.PHONY: install up down schema seed test lint fmt run api ingest

install:
	poetry install
	pre-commit install

up:
	docker compose up -d

down:
	docker compose down

schema:
	poetry run kg schema apply

seed:
	poetry run kg ingest seed

test:
	poetry run pytest -q

lint:
	poetry run ruff check src tests
	poetry run mypy src

fmt:
	poetry run ruff check --fix src tests
	poetry run black src tests

api:
	poetry run uvicorn kg.api.main:app --reload --port 8080

ingest:
	poetry run kg ingest run --source $(SOURCE)

bench:
	poetry run python scripts/benchmark_query.py

3.12 .github/workflows/ci.yml

name: CI

on:
  push:
    branches: [main, develop]
  pull_request:

env:
  POETRY_VERSION: 1.8.2

jobs:
  test:
    runs-on: ubuntu-22.04
    services:
      neo4j:
        image: neo4j:5.15
        env:
          NEO4J_AUTH: neo4j/changeit
        ports: [7687:7687]
        options: >-
          --health-cmd "wget -O - http://localhost:7474 || exit 1"
          --health-interval 10s --health-timeout 5s --health-retries 10
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install poetry==${POETRY_VERSION}
      - run: poetry install --no-interaction
      - run: poetry run ruff check src tests
      - run: poetry run mypy src
      - run: poetry run pytest -q --cov=src/kg --cov-report=xml
        env:
          NEO4J_URI: bolt://localhost:7687
          NEO4J_PASSWORD: changeit
      - uses: codecov/codecov-action@v4

  security:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - uses: aquasecurity/trivy-action@master
        with:
          scan-type: fs
          severity: CRITICAL,HIGH
          exit-code: 1

  build:
    needs: [test, security]
    runs-on: ubuntu-22.04
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ${{ secrets.REGISTRY }}
          username: ${{ secrets.REGISTRY_USER }}
          password: ${{ secrets.REGISTRY_PASS }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: |
            ${{ secrets.REGISTRY }}/kg-api:${{ github.sha }}
            ${{ secrets.REGISTRY }}/kg-api:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

3.13 .pre-commit-config.yaml

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.6.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-json
      - id: check-added-large-files
        args: ["--maxkb=1024"]
      - id: detect-private-key
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.4.0
    hooks:
      - id: ruff
        args: [--fix]
  - repo: https://github.com/psf/black
    rev: 24.3.0
    hooks: [{ id: black }]
  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.9.0
    hooks:
      - id: mypy
        additional_dependencies: [pydantic, types-redis]

4. CLI 命令规范

kg CLI(基于 Typer)是日常运维入口:

# Schema 管理
kg schema apply                          # 应用所有 changesets
kg schema status                         # 查看已应用状态
kg schema rollback --to 005              # 回滚到指定版本

# 种子数据加载
kg ingest seed --type industry           # 加载行业字典
kg ingest seed --type region             # 加载行政区划
kg ingest seed --type pep                # 加载 PEP 名单

# ETL 运行
kg ingest run --source necips --mode full     # 全量
kg ingest run --source necips --mode incr     # 增量
kg ingest run --source documents --path /data/docs

# 抽取
kg extract document --doc-id D123
kg extract batch --task-id T456

# 融合
kg fusion run --entity-type Enterprise --since 2026-05-01

# 派生关系重算
kg reasoning rebuild --rule actual_controls

# 质量检查
kg quality check --report-path ./reports/q1.json

# 评估
kg eval kbqa --testset tests/fixtures/kbqa_eval_set.jsonl

# 备份
kg backup neo4j --output /backup/$(date +%Y%m%d).dump

5. 测试策略

5.1 分层

范围工具目标覆盖率
单元函数/类pytest + mock≥ 80%
集成多模块/外部依赖pytest + docker-compose关键路径 100%
端到端API → DBpytest + httpx主用例覆盖
性能关键查询locust + scripts/benchmark基线达标
数据Schema/抽取结果great-expectations全部规则通过

5.2 测试夹具

tests/fixtures/enterprises_30.json —— 30 家真实企业的完整图(脱敏),包含:

  • 上市公司 5
  • 国企 3
  • 民营科技公司 10
  • 失信公司 3
  • 注销/吊销 4
  • 多层股权结构 5

夹具数据用于:

  • 集成测试基线
  • KBQA 评估集构造
  • 新人 onboarding

5.3 tests/conftest.py

import asyncio
import pytest
from kg.store.neo4j_client import Neo4jClient


@pytest.fixture(scope="session")
def event_loop():
    loop = asyncio.new_event_loop()
    yield loop
    loop.close()


@pytest.fixture(scope="session", autouse=True)
async def neo4j_setup():
    await Neo4jClient.init()
    # 清库 + 重建 schema
    await Neo4jClient.execute_write("MATCH (n) DETACH DELETE n")
    # 应用 schema changesets
    # ...
    yield
    await Neo4jClient.close()


@pytest.fixture
async def sample_enterprises():
    import json
    with open("tests/fixtures/enterprises_30.json") as f:
        return json.load(f)

6. 关键工程规约

6.1 错误处理体系

# src/kg/core/errors.py
class KGError(Exception):
    """Base."""
    code: str = "KG_ERROR"
    http_status: int = 500

class ValidationError(KGError):
    code = "KG_VALIDATION"
    http_status = 400

class NotFoundError(KGError):
    code = "KG_NOT_FOUND"
    http_status = 404

class ExtractError(KGError):
    code = "KG_EXTRACT_FAILED"

class CypherUnsafeError(KGError):
    code = "KG_CYPHER_UNSAFE"
    http_status = 400

class LLMBudgetExceeded(KGError):
    code = "KG_LLM_BUDGET"
    http_status = 503

FastAPI 全局异常处理器统一转 JSON 响应:

@app.exception_handler(KGError)
async def kg_error_handler(request, exc: KGError):
    return JSONResponse(
        status_code=exc.http_status,
        content={"code": exc.code, "message": str(exc)},
    )

6.2 ID 与 UUID 规约

# src/kg/core/id_generator.py
import hashlib, uuid
from datetime import datetime

def new_uuid() -> str:
    return str(uuid.uuid4())

def person_hash(name: str, id_card_tail: str | None, birth_year: int | None) -> str:
    """Stable hash for natural person identity.

    Uses SHA-256 with stable input ordering. id_card_tail and birth_year may be None;
    None is serialized as empty string for determinism.
    """
    s = f"{name}|{id_card_tail or ''}|{birth_year or ''}"
    return hashlib.sha256(s.encode("utf-8")).hexdigest()[:32]

def doc_hash(content: str) -> str:
    return hashlib.sha256(content.encode("utf-8")).hexdigest()

6.3 数据写入流程(黄金路径)

原始数据
  ↓ Pydantic 校验(Ontology Model)
  ↓ 业务规则校验(Validators)
  ↓ 实体链接(候选 → 匹配 → 选 master)
  ↓ 暂存(PostgreSQL,含审核状态)
  ↓ [可选] 人工审核
  ↓ 写入 Neo4j(UNWIND 批量 MERGE)
  ↓ 写入 ES(同步搜索)
  ↓ 写入 Milvus(嵌入入库)
  ↓ 触发派生关系重算(如必要)
  ↓ 发出领域事件(Kafka)

每一步必须幂等有审计日志可断点续传

6.4 性能预算(写入代码时记住)

操作预算
单次 API 请求 P95< 500ms
1-hop 查询< 50ms
3-hop 路径查询< 200ms
KBQA 端到端< 5s
全文检索< 100ms
批量写入(每万行)< 30s

6.5 安全规约

  • 绝不接受用户输入直接拼入 Cypher(注入风险)
  • 所有 Cypher 必须用参数化($param
  • kbqa 模块生成的 Cypher 必须经过 validator 才能 executor 执行
  • executor 强制 READ 模式 + 超时 + LIMIT
  • 敏感字段(手机/身份证)入库前哈希
  • 所有 API 必须 JWT 鉴权(除 /health/metrics
  • 所有 mutating 接口必须有审计日志

7. 第一周开发任务(可立即落地)

Day任务产出
D1仓库初始化、Poetry、pre-commit可运行的空骨架
D2docker-compose up 跑通本地栈所有中间件可访问
D3Neo4j 客户端 + Pydantic 实体模型基础写入跑通
D4Schema DDL apply + 种子数据行业/区划/PEP 入库
D5FastAPI 骨架 + 基础查询 API/health/enterprises/{uscc} 可调
D6CI 流水线(test + lint + build)PR 触发自动验证
D730 家企业 fixture + 集成测试基础回归套件

完成后可进入数据源接入、抽取实现阶段。


下一份文档03_数据源接入详设.md