第 29 课: Document 模型与 DocumentLoader

3 阅读5分钟

课程目标

精读 LangChain.js 的文档数据模型:Document 类的 pageContent + metadata 设计、BaseDocumentLoader 加载器抽象、BaseDocumentTransformer 转换器、RecordManager 增量索引系统,以及 structured_query 的查询 IR。


29.1 Document — 文档数据模型

Document 是 RAG 系统中所有文档数据的统一表示。

源码位置: libs/langchain-core/src/documents/document.ts

export interface DocumentInput<
  Metadata extends Record<string, any> = Record<string, any>,
> {
  pageContent: string;
  metadata?: Metadata;
  id?: string;
}

export class Document<
  Metadata extends Record<string, any> = Record<string, any>,
> implements DocumentInput, DocumentInterface {
  pageContent: string;
  metadata: Metadata;
  id?: string;

  constructor(fields: DocumentInput<Metadata>) {
    this.pageContent = fields.pageContent !== undefined
      ? fields.pageContent.toString()
      : "";
    this.metadata = fields.metadata ?? ({} as Metadata);
    this.id = fields.id;
  }
}

核心设计

  • pageContent:文档的文本内容(始终是字符串)
  • metadata:结构化元数据(泛型,默认 Record<string, any>
  • id:可选的唯一标识符(推荐 UUID 格式)

使用示例

import { Document } from "@langchain/core/documents";

const doc = new Document({
  pageContent: "LangChain.js 是一个用于构建 LLM 应用的框架。",
  metadata: {
    source: "docs/introduction.md",
    chapter: "概述",
    language: "zh",
    lastUpdated: "2024-12-01",
  },
  id: "doc-001",
});

为什么 metadata 如此重要? 在 RAG 系统中,metadata 用于:

  • 追踪文档来源(文件名、URL、页码)
  • 实现过滤检索(按日期、分类、语言筛选)
  • 支持增量索引(通过 hash 判断文档是否更新)

29.2 BaseDocumentLoader — 文档加载器

加载器负责从各种数据源读取文档。

源码位置: libs/langchain-core/src/document_loaders/base.ts

export interface DocumentLoader {
  load(): Promise<Document[]>;
}

export abstract class BaseDocumentLoader implements DocumentLoader {
  abstract load(): Promise<Document[]>;
}

极简抽象:加载器只需实现一个 load() 方法,返回 Document[]

自定义 DocumentLoader 示例

import { BaseDocumentLoader } from "@langchain/core/document_loaders/base";
import { Document } from "@langchain/core/documents";

class JsonDocumentLoader extends BaseDocumentLoader {
  constructor(private filePath: string) {
    super();
  }

  async load(): Promise<Document[]> {
    const fs = await import("fs/promises");
    const content = await fs.readFile(this.filePath, "utf-8");
    const data = JSON.parse(content);

    return data.map((item: any, index: number) =>
      new Document({
        pageContent: item.text || item.content || "",
        metadata: {
          source: this.filePath,
          index,
          ...item.metadata,
        },
      })
    );
  }
}

// 使用
const loader = new JsonDocumentLoader("./data/articles.json");
const docs = await loader.load();

29.3 BaseDocumentTransformer — 文档转换器

转换器对文档进行后处理:清洗、过滤、增强。

源码位置: libs/langchain-core/src/documents/transformers.ts

export abstract class BaseDocumentTransformer<
  RunInput extends DocumentInterface[] = DocumentInterface[],
  RunOutput extends DocumentInterface[] = DocumentInterface[],
> extends Runnable<RunInput, RunOutput> {
  lc_namespace = ["langchain_core", "documents", "transformers"];

  abstract transformDocuments(documents: RunInput): Promise<RunOutput>;

  invoke(input: RunInput, _options?: BaseCallbackConfig): Promise<RunOutput> {
    return this.transformDocuments(input);
  }
}

关键设计BaseDocumentTransformer 继承自 Runnable,因此它可以无缝嵌入到 pipe() 链中。

MappingDocumentTransformer

一对一转换的便利基类:

export abstract class MappingDocumentTransformer extends BaseDocumentTransformer {
  async transformDocuments(documents: DocumentInterface[]): Promise<DocumentInterface[]> {
    const newDocuments = [];
    for (const document of documents) {
      const transformedDocument = await this._transformDocument(document);
      newDocuments.push(transformedDocument);
    }
    return newDocuments;
  }

  abstract _transformDocument(document: DocumentInterface): Promise<DocumentInterface>;
}

自定义转换器示例

import { MappingDocumentTransformer } from "@langchain/core/documents";

class MetadataEnricher extends MappingDocumentTransformer {
  async _transformDocument(doc: DocumentInterface): Promise<DocumentInterface> {
    return new Document({
      pageContent: doc.pageContent,
      metadata: {
        ...doc.metadata,
        charCount: doc.pageContent.length,
        wordCount: doc.pageContent.split(/\s+/).length,
        hasCode: doc.pageContent.includes("```"),
      },
    });
  }
}

29.4 RecordManager — 增量索引系统

在大规模 RAG 系统中,需要增量更新向量数据库而不是每次全量重建。RecordManager 通过记录文档 hash 实现去重与增量更新。

源码位置: libs/langchain-core/src/indexing/record_manager.ts

export interface RecordManagerInterface {
  createSchema(): Promise<void>;
  getTime(): Promise<number>;
  update(keys: string[], updateOptions: UpdateOptions): Promise<void>;
  exists(keys: string[]): Promise<boolean[]>;
  listKeys(options: ListKeyOptions): Promise<string[]>;
  deleteKeys(keys: string[]): Promise<void>;
}

export abstract class RecordManager extends Serializable implements RecordManagerInterface {
  lc_namespace = ["langchain", "recordmanagers"];
  abstract createSchema(): Promise<void>;
  abstract getTime(): Promise<number>;
  abstract update(keys: string[], updateOptions?: UpdateOptions): Promise<void>;
  abstract exists(keys: string[]): Promise<boolean[]>;
  abstract listKeys(options?: ListKeyOptions): Promise<string[]>;
  abstract deleteKeys(keys: string[]): Promise<void>;
}

HashedDocument — 文档 Hash 计算

源码位置: libs/langchain-core/src/indexing/base.ts:41

export class _HashedDocument implements HashedDocumentInterface {
  uid: string;
  hash_?: string;
  contentHash?: string;
  metadataHash?: string;

  calculateHashes(): void {
    const contentHash = this._hashStringToUUID(this.pageContent);
    const metadataHash = this._hashNestedDictToUUID(this.metadata);
    this.contentHash = contentHash;
    this.metadataHash = metadataHash;
    this.hash_ = this._hashStringToUUID(this.contentHash + this.metadataHash);
  }
}

Hash 由 pageContent + metadata 共同决定。任一变化都会产生新 hash,触发重新索引。


29.5 index() — 增量索引函数

源码位置: libs/langchain-core/src/indexing/base.ts:263

export async function index(args: IndexArgs): Promise<IndexingResult> {
  const { docsSource, recordManager, vectorStore, options } = args;
  const { batchSize = 100, cleanup, sourceIdKey, forceUpdate = false } = options ?? {};

  const docs = _isBaseDocumentLoader(docsSource) ? await docsSource.load() : docsSource;
  const indexStartDt = await recordManager.getTime();

  for (const batch of batches) {
    const hashedDocs = _deduplicateInOrder(
      batch.map(doc => _HashedDocument.fromDocument(doc))
    );
    const batchExists = await recordManager.exists(hashedDocs.map(d => d.uid));

    // 跳过已存在且未变化的文档
    hashedDocs.forEach((hashedDoc, i) => {
      if (batchExists[i] && !forceUpdate) {
        docsToUpdate.push(hashedDoc.uid);  // 只更新时间戳
        return;
      }
      docsToIndex.push(hashedDoc.toDocument());  // 需要索引
    });

    // 写入向量数据库
    if (docsToIndex.length > 0) {
      await vectorStore.addDocuments(docsToIndex, { ids: uids });
    }
  }

  // cleanup: "full" 或 "incremental"
  if (cleanup === "full") {
    // 删除所有在此次索引中未出现的旧文档
  }

  return { numAdded, numDeleted, numUpdated, numSkipped };
}

Cleanup 模式

模式行为用户体验
undefined不删除任何文档可能有过期文档
"incremental"边索引边删除同 source 的旧文档最小化重复
"full"索引完成后删除所有旧文档索引期间可能有重复

29.6 Structured Query — 结构化查询 IR

structured_query 模块定义了 Self-query retriever 的中间表示(IR)。

源码位置: libs/langchain-core/src/structured_query/ir.ts

// 比较操作符
export type Comparator = "eq" | "ne" | "lt" | "gt" | "lte" | "gte";

// 逻辑操作符
export type Operator = "and" | "or" | "not";

// 比较表达式
export class Comparison extends FilterDirective {
  exprName = "Comparison" as const;
  constructor(
    public comparator: Comparator,
    public attribute: string,
    public value: string | number
  ) { super(); }
}

// 逻辑操作表达式
export class Operation extends FilterDirective {
  exprName = "Operation" as const;
  constructor(
    public operator: Operator,
    public args?: FilterDirective[]
  ) { super(); }
}

// 结构化查询
export class StructuredQuery extends Expression {
  exprName = "StructuredQuery" as const;
  constructor(
    public query: string,
    public filter?: FilterDirective
  ) { super(); }
}

Visitor 模式:每种向量数据库实现自己的 Visitor,将 IR 转换为原生查询语法。

// 使用示例
const query = new StructuredQuery(
  "关于 TypeScript 的文档",
  new Operation("and", [
    new Comparison("eq", "language", "zh"),
    new Comparison("gte", "year", 2024),
  ])
);

29.7 实战练习:自定义 DocumentLoader 并增量索引

import { Document, BaseDocumentLoader } from "@langchain/core/documents";
import { RecordManager, index } from "@langchain/core/indexing";

// 自定义加载器:从 JSON API 加载文档
class ApiDocumentLoader extends BaseDocumentLoader {
  constructor(private apiUrl: string) { super(); }

  async load(): Promise<Document[]> {
    const response = await fetch(this.apiUrl);
    const items = await response.json();

    return items.map((item: any) =>
      new Document({
        pageContent: item.content,
        metadata: {
          source: this.apiUrl,
          title: item.title,
          author: item.author,
          publishedAt: item.publishedAt,
        },
        id: item.id,
      })
    );
  }
}

// 增量索引流程
async function indexDocuments(
  loader: BaseDocumentLoader,
  recordManager: RecordManager,
  vectorStore: VectorStore
) {
  const result = await index({
    docsSource: loader,
    recordManager,
    vectorStore,
    options: {
      cleanup: "incremental",
      sourceIdKey: "source",
      batchSize: 50,
    },
  });

  console.log(`新增: ${result.numAdded}`);
  console.log(`更新: ${result.numUpdated}`);
  console.log(`跳过: ${result.numSkipped}`);
  console.log(`删除: ${result.numDeleted}`);
}

29.8 源码精读路线

优先级文件关注点
P0langchain-core/src/documents/document.tsDocument 类 — pageContent + metadata
P0langchain-core/src/document_loaders/base.tsBaseDocumentLoader — 加载器接口
P1langchain-core/src/documents/transformers.tsBaseDocumentTransformer, MappingDocumentTransformer
P1langchain-core/src/indexing/base.ts:263index() 函数 — 增量索引核心逻辑
P1langchain-core/src/indexing/record_manager.tsRecordManagerInterface — 记录管理器接口
P2langchain-core/src/indexing/base.ts:41_HashedDocument — 文档 hash 计算
P2langchain-core/src/structured_query/ir.tsComparison, Operation, StructuredQuery — 查询 IR

本课收获总结

级别你应该掌握的
🟢 基础掌握 Document 数据结构:pageContent + metadata;理解加载器的作用
🔵 中阶理解 BaseDocumentLoader 抽象,能自定义加载器
🟡 高阶掌握 BaseDocumentTransformerMappingDocumentTransformer 的文档后处理
🟠 资深分析 RecordManagerindex() 函数实现增量索引的完整机制
🔴 架构设计大规模文档 ingestion pipeline:去重、版本控制、增量同步、Structured Query IR

下一课预告

第 30 课进入文本分割与 Embeddings —— 理解 TextSplitter 的分块策略和 Embeddings 的向量化接口,完成 RAG 管道的关键环节。