配套文档:基于《01_领域本体详细设计书.md》 目标读者:开发实施工程师、SRE 本文档作用:开箱即用的项目骨架,照此搭建即可开始编码
1. 技术栈最终选型
| 模块 | 选型 | 版本 | 说明 |
|---|---|---|---|
| 语言 | Python | 3.11 | 主开发语言 |
| 包管理 | Poetry | 1.8+ | 依赖锁定 |
| 图数据库 | Neo4j Enterprise | 5.15 | 主存储 |
| 关系库 | PostgreSQL | 16 | 暂存/审核/元数据 |
| 缓存 | Redis | 7.2 | 查询缓存/限流 |
| 全文检索 | Elasticsearch | 8.13 | 实体名检索 |
| 向量库 | Milvus | 2.4 | 实体对齐/RAG |
| 消息队列 | Kafka | 3.7 | 增量流 |
| 工作流 | Airflow | 2.9 | 离线 ETL |
| API 框架 | FastAPI | 0.110+ | REST/WebSocket |
| 应用服务器 | Uvicorn + Gunicorn | - | 生产部署 |
| 任务队列 | Celery + Redis | 5.4 | 异步任务 |
| OCR | PaddleOCR | 2.7 | 中文 OCR |
| LLM SDK | anthropic | 0.40+ | Claude API |
| 嵌入模型 | bge-large-zh-v1.5 | - | 中文嵌入 |
| 抽取模型 | UIE (PaddleNLP) | 2.7 | 微调 |
| 监控 | Prometheus + Grafana | - | 指标 |
| 日志 | Loki + Promtail | - | 日志聚合 |
| 链路追踪 | OpenTelemetry + Jaeger | - | APM |
| 容器 | Docker + Kubernetes | 1.28+ | 编排 |
| CI/CD | GitLab CI / GitHub Actions | - | 流水线 |
2. 完整目录结构
enterprise-insight-kg/
├── README.md
├── pyproject.toml
├── poetry.lock
├── .python-version
├── .env.example
├── .gitignore
├── .editorconfig
├── .pre-commit-config.yaml
├── Makefile
├── docker-compose.yml # 本地开发环境一键起
├── docker-compose.prod.yml
│
├── docs/ # 项目文档
│ ├── 01_领域本体详细设计书.md
│ ├── 02_代码工程脚手架.md
│ ├── 03_数据源接入详设.md
│ ├── 04_KBQA实施与评估.md
│ ├── api/ # OpenAPI 文档(自动生成)
│ ├── adr/ # 架构决策记录
│ │ ├── 0001-use-neo4j.md
│ │ ├── 0002-text2cypher-vs-finetune.md
│ │ └── ...
│ └── runbook/ # 运维手册
│ ├── deploy.md
│ ├── backup.md
│ └── incident.md
│
├── schema/ # Neo4j Schema DDL
│ ├── changelog.yaml
│ ├── changesets/
│ ├── rollback/
│ └── seeds/
│
├── src/
│ └── kg/ # 主 Python 包
│ ├── __init__.py
│ ├── core/ # 核心抽象
│ │ ├── __init__.py
│ │ ├── config.py # Pydantic Settings
│ │ ├── logger.py # 结构化日志
│ │ ├── tracing.py # OTel 配置
│ │ ├── errors.py # 异常体系
│ │ ├── types.py # 公共类型
│ │ └── id_generator.py # UUID/Hash 工具
│ │
│ ├── ontology/ # 本体定义(代码化)
│ │ ├── __init__.py
│ │ ├── entities.py # 实体类型 Pydantic 模型
│ │ ├── relations.py # 关系类型
│ │ ├── enums.py # 枚举常量
│ │ └── validators.py # 业务校验器
│ │
│ ├── store/ # 存储层
│ │ ├── __init__.py
│ │ ├── neo4j_client.py # Neo4j 异步驱动封装
│ │ ├── postgres_client.py # 暂存库
│ │ ├── redis_client.py
│ │ ├── es_client.py
│ │ ├── milvus_client.py
│ │ └── repositories/ # 数据访问层
│ │ ├── __init__.py
│ │ ├── enterprise_repo.py
│ │ ├── person_repo.py
│ │ ├── event_repo.py
│ │ └── document_repo.py
│ │
│ ├── ingestion/ # 数据接入
│ │ ├── __init__.py
│ │ ├── sources/ # 数据源适配器
│ │ │ ├── base.py # SourceAdapter 抽象类
│ │ │ ├── mysql_cdc.py
│ │ │ ├── api_pull.py
│ │ │ ├── file_watcher.py
│ │ │ └── kafka_consumer.py
│ │ ├── parsers/ # 文档解析
│ │ │ ├── pdf_parser.py
│ │ │ ├── docx_parser.py
│ │ │ ├── email_parser.py
│ │ │ ├── html_parser.py
│ │ │ └── ocr.py
│ │ ├── d2r/ # 数据库→图谱映射
│ │ │ ├── engine.py
│ │ │ ├── mapping_loader.py
│ │ │ └── mappings/ # YAML 映射规则
│ │ │ ├── necips_enterprise.yaml
│ │ │ ├── tyc_person.yaml
│ │ │ └── ...
│ │ └── stream/
│ │ ├── flink_jobs/ # Flink SQL 作业
│ │ └── handlers.py
│ │
│ ├── extraction/ # 知识抽取
│ │ ├── __init__.py
│ │ ├── pipeline.py # 抽取流水线编排
│ │ ├── ner/ # 命名实体识别
│ │ │ ├── base.py
│ │ │ ├── rule_ner.py
│ │ │ ├── uie_ner.py
│ │ │ └── llm_ner.py
│ │ ├── relation/ # 关系抽取
│ │ │ ├── base.py
│ │ │ ├── pattern_re.py
│ │ │ ├── uie_re.py
│ │ │ └── llm_re.py
│ │ ├── event/ # 事件抽取
│ │ │ ├── schemas/ # 事件 Schema YAML
│ │ │ └── llm_ee.py
│ │ ├── linking/ # 实体链接
│ │ │ ├── candidate_gen.py
│ │ │ └── disambiguator.py
│ │ ├── normalization/ # 归一化
│ │ │ ├── date.py
│ │ │ ├── amount.py
│ │ │ ├── name.py
│ │ │ └── address.py
│ │ ├── prompts/ # LLM Prompt 模板
│ │ │ ├── ner_prompt.py
│ │ │ ├── re_prompt.py
│ │ │ └── event_prompt.py
│ │ └── arbitration.py # 多源仲裁
│ │
│ ├── fusion/ # 知识融合
│ │ ├── __init__.py
│ │ ├── blocking.py # 候选生成
│ │ ├── matcher.py # 实体匹配
│ │ ├── clusterer.py # 簇划分
│ │ ├── merger.py # 节点合并执行
│ │ ├── conflict_resolver.py # 冲突消解
│ │ └── features.py # 匹配特征工程
│ │
│ ├── reasoning/ # 推理与派生
│ │ ├── __init__.py
│ │ ├── rules/ # 规则定义
│ │ │ ├── actual_controls.cypher
│ │ │ ├── beneficial_owner.cypher
│ │ │ └── ...
│ │ ├── runner.py # 规则执行器
│ │ └── gds_jobs.py # 图算法作业
│ │
│ ├── api/ # FastAPI 应用
│ │ ├── __init__.py
│ │ ├── main.py # FastAPI app 入口
│ │ ├── deps.py # 依赖注入
│ │ ├── middleware/
│ │ │ ├── auth.py
│ │ │ ├── ratelimit.py
│ │ │ ├── logging.py
│ │ │ └── tracing.py
│ │ ├── routers/
│ │ │ ├── health.py
│ │ │ ├── entity.py # 实体 CRUD
│ │ │ ├── graph.py # 图查询
│ │ │ ├── search.py
│ │ │ ├── kbqa.py # KBQA 入口
│ │ │ └── admin.py # 后台管理
│ │ └── schemas/ # Pydantic 请求/响应模型
│ │
│ ├── kbqa/ # 智能问答模块
│ │ ├── __init__.py
│ │ ├── pipeline.py # 端到端 KBQA
│ │ ├── intent.py # 意图识别
│ │ ├── entity_linking.py
│ │ ├── text2cypher/
│ │ │ ├── generator.py # LLM 生成 Cypher
│ │ │ ├── validator.py # Cypher 校验
│ │ │ ├── executor.py # 安全执行
│ │ │ └── few_shots.yaml # Few-shot 库
│ │ ├── graph_rag/
│ │ │ ├── retriever.py
│ │ │ ├── verbalizer.py
│ │ │ └── community_summary.py
│ │ ├── reranker.py
│ │ ├── answer_gen.py # 答案生成
│ │ └── citation.py # 引用与溯源
│ │
│ ├── quality/ # 质量评估
│ │ ├── checks.py
│ │ ├── metrics.py
│ │ └── reports.py
│ │
│ └── cli/ # 命令行工具
│ ├── __init__.py
│ ├── main.py # typer 入口
│ ├── schema_cli.py # DDL 应用
│ ├── ingest_cli.py
│ ├── extract_cli.py
│ └── eval_cli.py
│
├── airflow/ # 工作流 DAG
│ ├── dags/
│ │ ├── ingest_necips.py
│ │ ├── extract_documents.py
│ │ ├── fusion_daily.py
│ │ ├── reasoning_daily.py
│ │ ├── quality_check_daily.py
│ │ └── backup_neo4j.py
│ ├── plugins/
│ └── tests/
│
├── tests/ # 测试
│ ├── unit/
│ │ ├── ontology/
│ │ ├── extraction/
│ │ ├── fusion/
│ │ └── kbqa/
│ ├── integration/
│ │ ├── test_neo4j.py
│ │ ├── test_pipeline_e2e.py
│ │ └── test_api.py
│ ├── fixtures/ # 测试数据
│ │ ├── enterprises_30.json
│ │ ├── documents/
│ │ └── kbqa_eval_set.jsonl
│ └── conftest.py
│
├── scripts/ # 运维脚本
│ ├── apply_schema.sh
│ ├── load_seeds.py
│ ├── reindex.sh
│ ├── benchmark_query.py
│ └── backup.sh
│
├── deploy/
│ ├── k8s/ # K8s 清单
│ │ ├── namespace.yaml
│ │ ├── neo4j-cluster.yaml
│ │ ├── postgres.yaml
│ │ ├── api-deployment.yaml
│ │ ├── airflow.yaml
│ │ ├── ingress.yaml
│ │ └── monitoring/
│ ├── helm/ # Helm Chart
│ ├── terraform/ # 基础设施 IaC
│ └── ansible/ # 配置管理
│
├── monitoring/
│ ├── prometheus/
│ │ ├── prometheus.yml
│ │ └── alerts.yaml
│ ├── grafana/
│ │ └── dashboards/
│ │ ├── neo4j.json
│ │ ├── api.json
│ │ └── extraction.json
│ └── otel-collector.yaml
│
└── .github/ # 或 .gitlab-ci.yml
└── workflows/
├── ci.yml # 测试、Lint、构建
├── cd-staging.yml
├── cd-prod.yml
├── security-scan.yml
└── release.yml
3. 关键配置文件
3.1 pyproject.toml
[tool.poetry]
name = "kg"
version = "0.1.0"
description = "Enterprise Insight Knowledge Graph"
authors = ["KG Team <kg@example.com>"]
packages = [{ include = "kg", from = "src" }]
readme = "README.md"
[tool.poetry.dependencies]
python = "^3.11"
# Web
fastapi = "^0.110.0"
uvicorn = { extras = ["standard"], version = "^0.29.0" }
gunicorn = "^21.2.0"
# Data validation
pydantic = "^2.6"
pydantic-settings = "^2.2"
# Stores
neo4j = "^5.18"
asyncpg = "^0.29"
redis = "^5.0"
elasticsearch = "^8.13"
pymilvus = "^2.4"
# Async
aiofiles = "^23.2"
httpx = "^0.27"
# LLM & NLP
anthropic = "^0.40"
sentence-transformers = "^2.7"
paddlenlp = "^2.7"
paddlepaddle = "^2.6"
# Document parsing
pymupdf = "^1.24"
pdfplumber = "^0.11"
python-docx = "^1.1"
mail-parser = "^3.15"
trafilatura = "^1.10"
unstructured = "^0.13"
# OCR
paddleocr = "^2.7"
# Stream
kafka-python = "^2.0"
# Workflow
celery = "^5.4"
# CLI
typer = "^0.12"
rich = "^13.7"
# Observability
opentelemetry-api = "^1.24"
opentelemetry-sdk = "^1.24"
opentelemetry-instrumentation-fastapi = "^0.45b0"
opentelemetry-instrumentation-neo4j = "^0.45b0"
structlog = "^24.1"
prometheus-client = "^0.20"
# Utility
tenacity = "^8.2"
python-jose = { extras = ["cryptography"], version = "^3.3" }
passlib = { extras = ["bcrypt"], version = "^1.7" }
[tool.poetry.group.dev.dependencies]
pytest = "^8.1"
pytest-asyncio = "^0.23"
pytest-cov = "^4.1"
pytest-mock = "^3.12"
ruff = "^0.4"
black = "^24.3"
mypy = "^1.9"
pre-commit = "^3.7"
ipython = "^8.22"
locust = "^2.25" # 压测
[tool.poetry.scripts]
kg = "kg.cli.main:app"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
[tool.ruff]
line-length = 100
target-version = "py311"
select = ["E", "F", "I", "W", "B", "UP", "N", "SIM"]
ignore = ["E501"]
[tool.black]
line-length = 100
target-version = ["py311"]
[tool.mypy]
python_version = "3.11"
strict = true
ignore_missing_imports = true
[tool.pytest.ini_options]
asyncio_mode = "auto"
addopts = "-ra --strict-markers --cov=src/kg --cov-report=term-missing"
testpaths = ["tests"]
3.2 .env.example
# ====== Runtime ======
ENV=local # local | dev | staging | prod
LOG_LEVEL=INFO
LOG_FORMAT=json # json | console
# ====== Neo4j ======
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=changeit
NEO4J_DATABASE=neo4j
NEO4J_POOL_SIZE=50
NEO4J_CONN_TIMEOUT=30
# ====== PostgreSQL ======
PG_DSN=postgresql://kg:changeit@localhost:5432/kg
# ====== Redis ======
REDIS_URL=redis://localhost:6379/0
REDIS_CACHE_TTL=300
# ====== Elasticsearch ======
ES_URL=http://localhost:9200
ES_USER=elastic
ES_PASSWORD=changeit
# ====== Milvus ======
MILVUS_HOST=localhost
MILVUS_PORT=19530
# ====== Kafka ======
KAFKA_BOOTSTRAP=localhost:9092
# ====== LLM ======
ANTHROPIC_API_KEY=sk-ant-xxxx
LLM_MODEL_EXTRACT=claude-opus-4-7
LLM_MODEL_KBQA=claude-opus-4-7
LLM_MAX_TOKENS=4096
LLM_TIMEOUT=60
LLM_DAILY_BUDGET_USD=200
# ====== Embedding ======
EMBED_MODEL=BAAI/bge-large-zh-v1.5
EMBED_DIM=1024
EMBED_BATCH=64
# ====== API ======
API_HOST=0.0.0.0
API_PORT=8080
API_WORKERS=4
API_CORS_ORIGINS=https://insight.example.com
JWT_SECRET=change-me
JWT_EXPIRE_MIN=60
RATE_LIMIT_PER_MIN=120
# ====== Observability ======
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_SERVICE_NAME=kg-api
PROMETHEUS_PORT=9100
# ====== Sources ======
NECIPS_API_KEY=
TYC_API_KEY=
QCC_API_KEY=
3.3 src/kg/core/config.py
"""Centralized configuration loaded from environment variables.
Use Pydantic Settings — typed, validated, with sensible defaults.
"""
from functools import lru_cache
from typing import Literal
from pydantic import Field, SecretStr
from pydantic_settings import BaseSettings, SettingsConfigDict
class Neo4jSettings(BaseSettings):
model_config = SettingsConfigDict(env_prefix="NEO4J_")
uri: str = "bolt://localhost:7687"
user: str = "neo4j"
password: SecretStr = SecretStr("changeit")
database: str = "neo4j"
pool_size: int = 50
conn_timeout: int = 30
class LLMSettings(BaseSettings):
model_config = SettingsConfigDict(env_prefix="LLM_")
anthropic_api_key: SecretStr = Field(SecretStr(""), alias="ANTHROPIC_API_KEY")
model_extract: str = "claude-opus-4-7"
model_kbqa: str = "claude-opus-4-7"
max_tokens: int = 4096
timeout: int = 60
daily_budget_usd: float = 200.0
class Settings(BaseSettings):
model_config = SettingsConfigDict(
env_file=".env",
env_file_encoding="utf-8",
case_sensitive=False,
extra="ignore",
)
env: Literal["local", "dev", "staging", "prod"] = "local"
log_level: str = "INFO"
log_format: Literal["json", "console"] = "json"
neo4j: Neo4jSettings = Neo4jSettings()
llm: LLMSettings = LLMSettings()
pg_dsn: str = "postgresql://kg:changeit@localhost:5432/kg"
redis_url: str = "redis://localhost:6379/0"
es_url: str = "http://localhost:9200"
milvus_host: str = "localhost"
milvus_port: int = 19530
api_host: str = "0.0.0.0"
api_port: int = 8080
jwt_secret: SecretStr = SecretStr("change-me")
jwt_expire_min: int = 60
rate_limit_per_min: int = 120
otel_endpoint: str = Field("", alias="OTEL_EXPORTER_OTLP_ENDPOINT")
otel_service: str = Field("kg-api", alias="OTEL_SERVICE_NAME")
@lru_cache
def get_settings() -> Settings:
return Settings()
3.4 src/kg/core/logger.py
"""Structured logging with structlog + JSON output for prod."""
import logging
import sys
import structlog
from kg.core.config import get_settings
def setup_logging() -> None:
settings = get_settings()
level = getattr(logging, settings.log_level.upper(), logging.INFO)
timestamper = structlog.processors.TimeStamper(fmt="iso")
shared_processors = [
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.StackInfoRenderer(),
timestamper,
]
if settings.log_format == "json":
renderer = structlog.processors.JSONRenderer(ensure_ascii=False)
else:
renderer = structlog.dev.ConsoleRenderer(colors=True)
structlog.configure(
processors=shared_processors + [
structlog.processors.format_exc_info,
renderer,
],
wrapper_class=structlog.make_filtering_bound_logger(level),
logger_factory=structlog.PrintLoggerFactory(file=sys.stdout),
cache_logger_on_first_use=True,
)
def get_logger(name: str) -> structlog.stdlib.BoundLogger:
return structlog.get_logger(name)
3.5 src/kg/store/neo4j_client.py
"""Async Neo4j driver wrapper with retry, tracing, and metrics."""
from __future__ import annotations
import asyncio
from contextlib import asynccontextmanager
from typing import Any, AsyncIterator
from neo4j import AsyncGraphDatabase, AsyncDriver
from neo4j.exceptions import ServiceUnavailable, TransientError
from prometheus_client import Counter, Histogram
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential
from kg.core.config import get_settings
from kg.core.logger import get_logger
log = get_logger(__name__)
_query_total = Counter(
"kg_neo4j_query_total", "Total Neo4j queries", ["operation", "status"]
)
_query_latency = Histogram(
"kg_neo4j_query_seconds", "Neo4j query latency", ["operation"]
)
class Neo4jClient:
_driver: AsyncDriver | None = None
_lock = asyncio.Lock()
@classmethod
async def init(cls) -> None:
if cls._driver is not None:
return
async with cls._lock:
if cls._driver is not None:
return
s = get_settings().neo4j
cls._driver = AsyncGraphDatabase.driver(
s.uri,
auth=(s.user, s.password.get_secret_value()),
max_connection_pool_size=s.pool_size,
connection_timeout=s.conn_timeout,
)
await cls._driver.verify_connectivity()
log.info("neo4j_connected", uri=s.uri, db=s.database)
@classmethod
async def close(cls) -> None:
if cls._driver:
await cls._driver.close()
cls._driver = None
@classmethod
@asynccontextmanager
async def session(cls, *, write: bool = False) -> AsyncIterator[Any]:
if cls._driver is None:
await cls.init()
s = get_settings().neo4j
async with cls._driver.session(
database=s.database,
default_access_mode="WRITE" if write else "READ",
) as sess:
yield sess
@classmethod
@retry(
retry=retry_if_exception_type((TransientError, ServiceUnavailable)),
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=0.5, max=5),
reraise=True,
)
async def execute_read(cls, cypher: str, params: dict | None = None) -> list[dict]:
op = "read"
with _query_latency.labels(op).time():
try:
async with cls.session(write=False) as sess:
result = await sess.run(cypher, params or {})
rows = [r.data() async for r in result]
_query_total.labels(op, "ok").inc()
return rows
except Exception:
_query_total.labels(op, "err").inc()
raise
@classmethod
@retry(
retry=retry_if_exception_type((TransientError, ServiceUnavailable)),
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=0.5, max=5),
reraise=True,
)
async def execute_write(cls, cypher: str, params: dict | None = None) -> list[dict]:
op = "write"
with _query_latency.labels(op).time():
try:
async with cls.session(write=True) as sess:
result = await sess.run(cypher, params or {})
rows = [r.data() async for r in result]
_query_total.labels(op, "ok").inc()
return rows
except Exception:
_query_total.labels(op, "err").inc()
raise
@classmethod
async def batch_write(
cls, cypher: str, rows: list[dict], batch_size: int = 5000
) -> int:
"""UNWIND-based batch upsert."""
total = 0
for i in range(0, len(rows), batch_size):
chunk = rows[i : i + batch_size]
await cls.execute_write(cypher, {"rows": chunk})
total += len(chunk)
log.info("batch_written", count=len(chunk), total=total)
return total
3.6 src/kg/ontology/entities.py
"""Pydantic models that mirror the ontology — the single source of truth for code.
Any data going into Neo4j must pass through these models first.
"""
from __future__ import annotations
from datetime import date, datetime
from enum import StrEnum
from typing import Annotated
from pydantic import BaseModel, ConfigDict, Field, StringConstraints
USCC_PATTERN = r"^[0-9A-HJ-NPQRTUWXY]{18}$"
class RegistrationStatus(StrEnum):
IN_BUSINESS = "IN_BUSINESS"
CANCELLED = "CANCELLED"
REVOKED = "REVOKED"
SUSPENDED = "SUSPENDED"
LIQUIDATING = "LIQUIDATING"
MIGRATED_OUT = "MIGRATED_OUT"
class EnterpriseType(StrEnum):
LIMITED_LIABILITY = "LIMITED_LIABILITY"
JOINT_STOCK = "JOINT_STOCK"
WHOLLY_FOREIGN_OWNED = "WHOLLY_FOREIGN_OWNED"
SINO_FOREIGN_JOINT_VENTURE = "SINO_FOREIGN_JOINT_VENTURE"
PARTNERSHIP = "PARTNERSHIP"
SOLE_PROPRIETORSHIP = "SOLE_PROPRIETORSHIP"
INDIVIDUAL_BUSINESS = "INDIVIDUAL_BUSINESS"
STATE_OWNED = "STATE_OWNED"
COLLECTIVE = "COLLECTIVE"
OTHER = "OTHER"
class EntityMeta(BaseModel):
"""Required metadata for every entity / relationship."""
uuid: str
source: str
source_id: str | None = None
source_record_url: str | None = None
created_at: datetime
updated_at: datetime
confidence: float = Field(ge=0, le=1, default=1.0)
status: str = "ACTIVE"
version: int = 1
merged_from: list[str] = Field(default_factory=list)
extracted_by: str | None = None
class Enterprise(BaseModel):
model_config = ConfigDict(extra="forbid")
uuid: str
unified_credit_code: Annotated[str, StringConstraints(pattern=USCC_PATTERN)]
registration_no: str | None = None
name: Annotated[str, StringConstraints(max_length=200)]
aliases: list[str] = Field(default_factory=list)
legal_representative_name: str | None = None
registered_capital: float | None = Field(default=None, ge=0)
paid_in_capital: float | None = Field(default=None, ge=0)
capital_currency: str = "CNY"
enterprise_type: EnterpriseType
establishment_date: date
business_term_start: date | None = None
business_term_end: date | None = None
registration_authority: str | None = None
registration_status: RegistrationStatus
industry_code: str
industry_name: str | None = None
business_scope: str | None = None
email: str | None = None
phone: str | None = None
website: str | None = None
is_listed: bool = False
stock_code: str | None = None
stock_exchange: str | None = None
staff_size: str | None = None
is_high_tech: bool = False
is_specialized_new: bool = False
credit_rating: str | None = None
embedding: list[float] | None = None
_meta: EntityMeta
def to_cypher_props(self) -> dict:
"""Flatten for Cypher SET clause."""
d = self.model_dump(exclude_none=True)
meta = d.pop("_meta", None) or {}
for k, v in meta.items():
d[f"_meta_{k}"] = v
return d
class NaturalPerson(BaseModel):
model_config = ConfigDict(extra="forbid")
uuid: str
person_hash: str
name: str
aliases: list[str] = Field(default_factory=list)
gender: str | None = None
birth_year: int | None = Field(default=None, ge=1900, le=2100)
nationality: str | None = None
id_card_tail: str | None = Field(default=None, pattern=r"^[0-9X]{4}$")
is_pep: bool = False
is_sanctioned: bool = False
is_executed_dishonest: bool = False
embedding: list[float] | None = None
_meta: EntityMeta
# ... other entities follow the same pattern
3.7 src/kg/api/main.py
"""FastAPI app entry."""
from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from prometheus_client import make_asgi_app
from kg.api.middleware.auth import AuthMiddleware
from kg.api.middleware.logging import LoggingMiddleware
from kg.api.middleware.ratelimit import RateLimitMiddleware
from kg.api.routers import admin, entity, graph, health, kbqa, search
from kg.core.config import get_settings
from kg.core.logger import setup_logging, get_logger
from kg.core.tracing import setup_tracing
from kg.store.neo4j_client import Neo4jClient
log = get_logger(__name__)
@asynccontextmanager
async def lifespan(app: FastAPI):
setup_logging()
setup_tracing()
await Neo4jClient.init()
log.info("api_startup_complete")
yield
await Neo4jClient.close()
log.info("api_shutdown_complete")
def create_app() -> FastAPI:
settings = get_settings()
app = FastAPI(
title="Enterprise Insight KG API",
version="0.1.0",
lifespan=lifespan,
docs_url="/docs" if settings.env != "prod" else None,
redoc_url="/redoc" if settings.env != "prod" else None,
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"] if settings.env == "local" else [settings.api_host],
allow_methods=["*"],
allow_headers=["*"],
)
app.add_middleware(LoggingMiddleware)
app.add_middleware(RateLimitMiddleware)
app.add_middleware(AuthMiddleware)
app.include_router(health.router, tags=["health"])
app.include_router(entity.router, prefix="/api/v1/entities", tags=["entity"])
app.include_router(graph.router, prefix="/api/v1/graph", tags=["graph"])
app.include_router(search.router, prefix="/api/v1/search", tags=["search"])
app.include_router(kbqa.router, prefix="/api/v1/kbqa", tags=["kbqa"])
app.include_router(admin.router, prefix="/api/v1/admin", tags=["admin"])
app.mount("/metrics", make_asgi_app())
FastAPIInstrumentor.instrument_app(app)
return app
app = create_app()
3.8 src/kg/api/routers/entity.py(示例)
from typing import Annotated
from fastapi import APIRouter, Depends, HTTPException, Query
from kg.api.deps import get_current_user
from kg.api.schemas.entity import EnterpriseRead, EnterpriseSearchResult
from kg.store.repositories.enterprise_repo import EnterpriseRepository
router = APIRouter()
@router.get("/enterprise/{uscc}", response_model=EnterpriseRead)
async def get_enterprise(
uscc: str,
user: Annotated[dict, Depends(get_current_user)],
) -> EnterpriseRead:
ent = await EnterpriseRepository.get_by_uscc(uscc)
if not ent:
raise HTTPException(status_code=404, detail="enterprise not found")
return ent
@router.get("/enterprise", response_model=list[EnterpriseSearchResult])
async def search_enterprises(
q: Annotated[str, Query(min_length=2, max_length=100)],
limit: int = 20,
) -> list[EnterpriseSearchResult]:
return await EnterpriseRepository.search_by_name(q, limit=limit)
3.9 docker-compose.yml(本地一键启动)
version: "3.9"
services:
neo4j:
image: neo4j:5.15-enterprise
environment:
NEO4J_AUTH: neo4j/changeit
NEO4J_ACCEPT_LICENSE_AGREEMENT: "yes"
NEO4J_PLUGINS: '["apoc", "graph-data-science"]'
NEO4J_dbms_security_procedures_unrestricted: apoc.*,gds.*
NEO4J_server_memory_heap_max__size: 4G
NEO4J_server_memory_pagecache_size: 4G
ports:
- "7474:7474"
- "7687:7687"
volumes:
- neo4j_data:/data
- neo4j_logs:/logs
- ./schema/seeds:/import
postgres:
image: postgres:16
environment:
POSTGRES_USER: kg
POSTGRES_PASSWORD: changeit
POSTGRES_DB: kg
ports: ["5432:5432"]
volumes:
- pg_data:/var/lib/postgresql/data
redis:
image: redis:7.2
ports: ["6379:6379"]
elasticsearch:
image: elasticsearch:8.13.0
environment:
discovery.type: single-node
xpack.security.enabled: "false"
ES_JAVA_OPTS: -Xms2g -Xmx2g
ports: ["9200:9200"]
milvus:
image: milvusdb/milvus:v2.4.0
command: ["milvus", "run", "standalone"]
environment:
ETCD_USE_EMBED: "true"
COMMON_STORAGETYPE: local
ports: ["19530:19530"]
kafka:
image: bitnami/kafka:3.7
environment:
KAFKA_CFG_NODE_ID: 0
KAFKA_CFG_PROCESS_ROLES: controller,broker
KAFKA_CFG_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: 0@kafka:9093
KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
ports: ["9092:9092"]
airflow:
image: apache/airflow:2.9.0
environment:
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://kg:changeit@postgres:5432/airflow
volumes:
- ./airflow/dags:/opt/airflow/dags
depends_on: [postgres]
ports: ["8088:8080"]
command: standalone
prometheus:
image: prom/prometheus:latest
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
ports: ["9090:9090"]
grafana:
image: grafana/grafana:10.4.0
ports: ["3000:3000"]
volumes:
- ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
volumes:
neo4j_data:
neo4j_logs:
pg_data:
3.10 Dockerfile
# ===== builder =====
FROM python:3.11-slim AS builder
ENV PIP_NO_CACHE_DIR=1 POETRY_VERSION=1.8.2
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential curl git && rm -rf /var/lib/apt/lists/*
RUN pip install poetry==${POETRY_VERSION}
WORKDIR /app
COPY pyproject.toml poetry.lock ./
RUN poetry config virtualenvs.in-project true \
&& poetry install --only main --no-root --no-interaction
# ===== runtime =====
FROM python:3.11-slim AS runtime
ENV PATH="/app/.venv/bin:$PATH" PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends \
libgomp1 curl ca-certificates && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
COPY src /app/src
RUN groupadd -r kg && useradd -r -g kg kg && chown -R kg:kg /app
USER kg
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
CMD curl -fsS http://localhost:8080/health || exit 1
CMD ["gunicorn", "kg.api.main:app", \
"-w", "4", "-k", "uvicorn.workers.UvicornWorker", \
"-b", "0.0.0.0:8080", "--access-logfile", "-"]
3.11 Makefile
.PHONY: install up down schema seed test lint fmt run api ingest
install:
poetry install
pre-commit install
up:
docker compose up -d
down:
docker compose down
schema:
poetry run kg schema apply
seed:
poetry run kg ingest seed
test:
poetry run pytest -q
lint:
poetry run ruff check src tests
poetry run mypy src
fmt:
poetry run ruff check --fix src tests
poetry run black src tests
api:
poetry run uvicorn kg.api.main:app --reload --port 8080
ingest:
poetry run kg ingest run --source $(SOURCE)
bench:
poetry run python scripts/benchmark_query.py
3.12 .github/workflows/ci.yml
name: CI
on:
push:
branches: [main, develop]
pull_request:
env:
POETRY_VERSION: 1.8.2
jobs:
test:
runs-on: ubuntu-22.04
services:
neo4j:
image: neo4j:5.15
env:
NEO4J_AUTH: neo4j/changeit
ports: [7687:7687]
options: >-
--health-cmd "wget -O - http://localhost:7474 || exit 1"
--health-interval 10s --health-timeout 5s --health-retries 10
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: pip install poetry==${POETRY_VERSION}
- run: poetry install --no-interaction
- run: poetry run ruff check src tests
- run: poetry run mypy src
- run: poetry run pytest -q --cov=src/kg --cov-report=xml
env:
NEO4J_URI: bolt://localhost:7687
NEO4J_PASSWORD: changeit
- uses: codecov/codecov-action@v4
security:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
- uses: aquasecurity/trivy-action@master
with:
scan-type: fs
severity: CRITICAL,HIGH
exit-code: 1
build:
needs: [test, security]
runs-on: ubuntu-22.04
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ${{ secrets.REGISTRY }}
username: ${{ secrets.REGISTRY_USER }}
password: ${{ secrets.REGISTRY_PASS }}
- uses: docker/build-push-action@v5
with:
push: true
tags: |
${{ secrets.REGISTRY }}/kg-api:${{ github.sha }}
${{ secrets.REGISTRY }}/kg-api:latest
cache-from: type=gha
cache-to: type=gha,mode=max
3.13 .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-json
- id: check-added-large-files
args: ["--maxkb=1024"]
- id: detect-private-key
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.4.0
hooks:
- id: ruff
args: [--fix]
- repo: https://github.com/psf/black
rev: 24.3.0
hooks: [{ id: black }]
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.9.0
hooks:
- id: mypy
additional_dependencies: [pydantic, types-redis]
4. CLI 命令规范
kg CLI(基于 Typer)是日常运维入口:
# Schema 管理
kg schema apply # 应用所有 changesets
kg schema status # 查看已应用状态
kg schema rollback --to 005 # 回滚到指定版本
# 种子数据加载
kg ingest seed --type industry # 加载行业字典
kg ingest seed --type region # 加载行政区划
kg ingest seed --type pep # 加载 PEP 名单
# ETL 运行
kg ingest run --source necips --mode full # 全量
kg ingest run --source necips --mode incr # 增量
kg ingest run --source documents --path /data/docs
# 抽取
kg extract document --doc-id D123
kg extract batch --task-id T456
# 融合
kg fusion run --entity-type Enterprise --since 2026-05-01
# 派生关系重算
kg reasoning rebuild --rule actual_controls
# 质量检查
kg quality check --report-path ./reports/q1.json
# 评估
kg eval kbqa --testset tests/fixtures/kbqa_eval_set.jsonl
# 备份
kg backup neo4j --output /backup/$(date +%Y%m%d).dump
5. 测试策略
5.1 分层
| 层 | 范围 | 工具 | 目标覆盖率 |
|---|---|---|---|
| 单元 | 函数/类 | pytest + mock | ≥ 80% |
| 集成 | 多模块/外部依赖 | pytest + docker-compose | 关键路径 100% |
| 端到端 | API → DB | pytest + httpx | 主用例覆盖 |
| 性能 | 关键查询 | locust + scripts/benchmark | 基线达标 |
| 数据 | Schema/抽取结果 | great-expectations | 全部规则通过 |
5.2 测试夹具
tests/fixtures/enterprises_30.json —— 30 家真实企业的完整图(脱敏),包含:
- 上市公司 5
- 国企 3
- 民营科技公司 10
- 失信公司 3
- 注销/吊销 4
- 多层股权结构 5
夹具数据用于:
- 集成测试基线
- KBQA 评估集构造
- 新人 onboarding
5.3 tests/conftest.py
import asyncio
import pytest
from kg.store.neo4j_client import Neo4jClient
@pytest.fixture(scope="session")
def event_loop():
loop = asyncio.new_event_loop()
yield loop
loop.close()
@pytest.fixture(scope="session", autouse=True)
async def neo4j_setup():
await Neo4jClient.init()
# 清库 + 重建 schema
await Neo4jClient.execute_write("MATCH (n) DETACH DELETE n")
# 应用 schema changesets
# ...
yield
await Neo4jClient.close()
@pytest.fixture
async def sample_enterprises():
import json
with open("tests/fixtures/enterprises_30.json") as f:
return json.load(f)
6. 关键工程规约
6.1 错误处理体系
# src/kg/core/errors.py
class KGError(Exception):
"""Base."""
code: str = "KG_ERROR"
http_status: int = 500
class ValidationError(KGError):
code = "KG_VALIDATION"
http_status = 400
class NotFoundError(KGError):
code = "KG_NOT_FOUND"
http_status = 404
class ExtractError(KGError):
code = "KG_EXTRACT_FAILED"
class CypherUnsafeError(KGError):
code = "KG_CYPHER_UNSAFE"
http_status = 400
class LLMBudgetExceeded(KGError):
code = "KG_LLM_BUDGET"
http_status = 503
FastAPI 全局异常处理器统一转 JSON 响应:
@app.exception_handler(KGError)
async def kg_error_handler(request, exc: KGError):
return JSONResponse(
status_code=exc.http_status,
content={"code": exc.code, "message": str(exc)},
)
6.2 ID 与 UUID 规约
# src/kg/core/id_generator.py
import hashlib, uuid
from datetime import datetime
def new_uuid() -> str:
return str(uuid.uuid4())
def person_hash(name: str, id_card_tail: str | None, birth_year: int | None) -> str:
"""Stable hash for natural person identity.
Uses SHA-256 with stable input ordering. id_card_tail and birth_year may be None;
None is serialized as empty string for determinism.
"""
s = f"{name}|{id_card_tail or ''}|{birth_year or ''}"
return hashlib.sha256(s.encode("utf-8")).hexdigest()[:32]
def doc_hash(content: str) -> str:
return hashlib.sha256(content.encode("utf-8")).hexdigest()
6.3 数据写入流程(黄金路径)
原始数据
↓ Pydantic 校验(Ontology Model)
↓ 业务规则校验(Validators)
↓ 实体链接(候选 → 匹配 → 选 master)
↓ 暂存(PostgreSQL,含审核状态)
↓ [可选] 人工审核
↓ 写入 Neo4j(UNWIND 批量 MERGE)
↓ 写入 ES(同步搜索)
↓ 写入 Milvus(嵌入入库)
↓ 触发派生关系重算(如必要)
↓ 发出领域事件(Kafka)
每一步必须幂等、有审计日志、可断点续传。
6.4 性能预算(写入代码时记住)
| 操作 | 预算 |
|---|---|
| 单次 API 请求 P95 | < 500ms |
| 1-hop 查询 | < 50ms |
| 3-hop 路径查询 | < 200ms |
| KBQA 端到端 | < 5s |
| 全文检索 | < 100ms |
| 批量写入(每万行) | < 30s |
6.5 安全规约
- 绝不接受用户输入直接拼入 Cypher(注入风险)
- 所有 Cypher 必须用参数化(
$param) kbqa模块生成的 Cypher 必须经过validator才能executor执行executor强制READ模式 + 超时 + LIMIT- 敏感字段(手机/身份证)入库前哈希
- 所有 API 必须 JWT 鉴权(除
/health、/metrics) - 所有 mutating 接口必须有审计日志
7. 第一周开发任务(可立即落地)
| Day | 任务 | 产出 |
|---|---|---|
| D1 | 仓库初始化、Poetry、pre-commit | 可运行的空骨架 |
| D2 | docker-compose up 跑通本地栈 | 所有中间件可访问 |
| D3 | Neo4j 客户端 + Pydantic 实体模型 | 基础写入跑通 |
| D4 | Schema DDL apply + 种子数据 | 行业/区划/PEP 入库 |
| D5 | FastAPI 骨架 + 基础查询 API | /health、/enterprises/{uscc} 可调 |
| D6 | CI 流水线(test + lint + build) | PR 触发自动验证 |
| D7 | 30 家企业 fixture + 集成测试 | 基础回归套件 |
完成后可进入数据源接入、抽取实现阶段。
下一份文档:03_数据源接入详设.md