课程目标
精读 LangChain.js 的文档数据模型:Document 类的 pageContent + metadata 设计、BaseDocumentLoader 加载器抽象、BaseDocumentTransformer 转换器、RecordManager 增量索引系统,以及 structured_query 的查询 IR。
29.1 Document — 文档数据模型
Document 是 RAG 系统中所有文档数据的统一表示。
源码位置: libs/langchain-core/src/documents/document.ts
export interface DocumentInput<
Metadata extends Record<string, any> = Record<string, any>,
> {
pageContent: string;
metadata?: Metadata;
id?: string;
}
export class Document<
Metadata extends Record<string, any> = Record<string, any>,
> implements DocumentInput, DocumentInterface {
pageContent: string;
metadata: Metadata;
id?: string;
constructor(fields: DocumentInput<Metadata>) {
this.pageContent = fields.pageContent !== undefined
? fields.pageContent.toString()
: "";
this.metadata = fields.metadata ?? ({} as Metadata);
this.id = fields.id;
}
}
核心设计:
pageContent:文档的文本内容(始终是字符串)metadata:结构化元数据(泛型,默认Record<string, any>)id:可选的唯一标识符(推荐 UUID 格式)
使用示例:
import { Document } from "@langchain/core/documents";
const doc = new Document({
pageContent: "LangChain.js 是一个用于构建 LLM 应用的框架。",
metadata: {
source: "docs/introduction.md",
chapter: "概述",
language: "zh",
lastUpdated: "2024-12-01",
},
id: "doc-001",
});
为什么 metadata 如此重要? 在 RAG 系统中,metadata 用于:
- 追踪文档来源(文件名、URL、页码)
- 实现过滤检索(按日期、分类、语言筛选)
- 支持增量索引(通过 hash 判断文档是否更新)
29.2 BaseDocumentLoader — 文档加载器
加载器负责从各种数据源读取文档。
源码位置: libs/langchain-core/src/document_loaders/base.ts
export interface DocumentLoader {
load(): Promise<Document[]>;
}
export abstract class BaseDocumentLoader implements DocumentLoader {
abstract load(): Promise<Document[]>;
}
极简抽象:加载器只需实现一个 load() 方法,返回 Document[]。
自定义 DocumentLoader 示例
import { BaseDocumentLoader } from "@langchain/core/document_loaders/base";
import { Document } from "@langchain/core/documents";
class JsonDocumentLoader extends BaseDocumentLoader {
constructor(private filePath: string) {
super();
}
async load(): Promise<Document[]> {
const fs = await import("fs/promises");
const content = await fs.readFile(this.filePath, "utf-8");
const data = JSON.parse(content);
return data.map((item: any, index: number) =>
new Document({
pageContent: item.text || item.content || "",
metadata: {
source: this.filePath,
index,
...item.metadata,
},
})
);
}
}
// 使用
const loader = new JsonDocumentLoader("./data/articles.json");
const docs = await loader.load();
29.3 BaseDocumentTransformer — 文档转换器
转换器对文档进行后处理:清洗、过滤、增强。
源码位置: libs/langchain-core/src/documents/transformers.ts
export abstract class BaseDocumentTransformer<
RunInput extends DocumentInterface[] = DocumentInterface[],
RunOutput extends DocumentInterface[] = DocumentInterface[],
> extends Runnable<RunInput, RunOutput> {
lc_namespace = ["langchain_core", "documents", "transformers"];
abstract transformDocuments(documents: RunInput): Promise<RunOutput>;
invoke(input: RunInput, _options?: BaseCallbackConfig): Promise<RunOutput> {
return this.transformDocuments(input);
}
}
关键设计:BaseDocumentTransformer 继承自 Runnable,因此它可以无缝嵌入到 pipe() 链中。
MappingDocumentTransformer
一对一转换的便利基类:
export abstract class MappingDocumentTransformer extends BaseDocumentTransformer {
async transformDocuments(documents: DocumentInterface[]): Promise<DocumentInterface[]> {
const newDocuments = [];
for (const document of documents) {
const transformedDocument = await this._transformDocument(document);
newDocuments.push(transformedDocument);
}
return newDocuments;
}
abstract _transformDocument(document: DocumentInterface): Promise<DocumentInterface>;
}
自定义转换器示例
import { MappingDocumentTransformer } from "@langchain/core/documents";
class MetadataEnricher extends MappingDocumentTransformer {
async _transformDocument(doc: DocumentInterface): Promise<DocumentInterface> {
return new Document({
pageContent: doc.pageContent,
metadata: {
...doc.metadata,
charCount: doc.pageContent.length,
wordCount: doc.pageContent.split(/\s+/).length,
hasCode: doc.pageContent.includes("```"),
},
});
}
}
29.4 RecordManager — 增量索引系统
在大规模 RAG 系统中,需要增量更新向量数据库而不是每次全量重建。RecordManager 通过记录文档 hash 实现去重与增量更新。
源码位置: libs/langchain-core/src/indexing/record_manager.ts
export interface RecordManagerInterface {
createSchema(): Promise<void>;
getTime(): Promise<number>;
update(keys: string[], updateOptions: UpdateOptions): Promise<void>;
exists(keys: string[]): Promise<boolean[]>;
listKeys(options: ListKeyOptions): Promise<string[]>;
deleteKeys(keys: string[]): Promise<void>;
}
export abstract class RecordManager extends Serializable implements RecordManagerInterface {
lc_namespace = ["langchain", "recordmanagers"];
abstract createSchema(): Promise<void>;
abstract getTime(): Promise<number>;
abstract update(keys: string[], updateOptions?: UpdateOptions): Promise<void>;
abstract exists(keys: string[]): Promise<boolean[]>;
abstract listKeys(options?: ListKeyOptions): Promise<string[]>;
abstract deleteKeys(keys: string[]): Promise<void>;
}
HashedDocument — 文档 Hash 计算
源码位置: libs/langchain-core/src/indexing/base.ts:41
export class _HashedDocument implements HashedDocumentInterface {
uid: string;
hash_?: string;
contentHash?: string;
metadataHash?: string;
calculateHashes(): void {
const contentHash = this._hashStringToUUID(this.pageContent);
const metadataHash = this._hashNestedDictToUUID(this.metadata);
this.contentHash = contentHash;
this.metadataHash = metadataHash;
this.hash_ = this._hashStringToUUID(this.contentHash + this.metadataHash);
}
}
Hash 由 pageContent + metadata 共同决定。任一变化都会产生新 hash,触发重新索引。
29.5 index() — 增量索引函数
源码位置: libs/langchain-core/src/indexing/base.ts:263
export async function index(args: IndexArgs): Promise<IndexingResult> {
const { docsSource, recordManager, vectorStore, options } = args;
const { batchSize = 100, cleanup, sourceIdKey, forceUpdate = false } = options ?? {};
const docs = _isBaseDocumentLoader(docsSource) ? await docsSource.load() : docsSource;
const indexStartDt = await recordManager.getTime();
for (const batch of batches) {
const hashedDocs = _deduplicateInOrder(
batch.map(doc => _HashedDocument.fromDocument(doc))
);
const batchExists = await recordManager.exists(hashedDocs.map(d => d.uid));
// 跳过已存在且未变化的文档
hashedDocs.forEach((hashedDoc, i) => {
if (batchExists[i] && !forceUpdate) {
docsToUpdate.push(hashedDoc.uid); // 只更新时间戳
return;
}
docsToIndex.push(hashedDoc.toDocument()); // 需要索引
});
// 写入向量数据库
if (docsToIndex.length > 0) {
await vectorStore.addDocuments(docsToIndex, { ids: uids });
}
}
// cleanup: "full" 或 "incremental"
if (cleanup === "full") {
// 删除所有在此次索引中未出现的旧文档
}
return { numAdded, numDeleted, numUpdated, numSkipped };
}
Cleanup 模式:
| 模式 | 行为 | 用户体验 |
|---|---|---|
undefined | 不删除任何文档 | 可能有过期文档 |
"incremental" | 边索引边删除同 source 的旧文档 | 最小化重复 |
"full" | 索引完成后删除所有旧文档 | 索引期间可能有重复 |
29.6 Structured Query — 结构化查询 IR
structured_query 模块定义了 Self-query retriever 的中间表示(IR)。
源码位置: libs/langchain-core/src/structured_query/ir.ts
// 比较操作符
export type Comparator = "eq" | "ne" | "lt" | "gt" | "lte" | "gte";
// 逻辑操作符
export type Operator = "and" | "or" | "not";
// 比较表达式
export class Comparison extends FilterDirective {
exprName = "Comparison" as const;
constructor(
public comparator: Comparator,
public attribute: string,
public value: string | number
) { super(); }
}
// 逻辑操作表达式
export class Operation extends FilterDirective {
exprName = "Operation" as const;
constructor(
public operator: Operator,
public args?: FilterDirective[]
) { super(); }
}
// 结构化查询
export class StructuredQuery extends Expression {
exprName = "StructuredQuery" as const;
constructor(
public query: string,
public filter?: FilterDirective
) { super(); }
}
Visitor 模式:每种向量数据库实现自己的 Visitor,将 IR 转换为原生查询语法。
// 使用示例
const query = new StructuredQuery(
"关于 TypeScript 的文档",
new Operation("and", [
new Comparison("eq", "language", "zh"),
new Comparison("gte", "year", 2024),
])
);
29.7 实战练习:自定义 DocumentLoader 并增量索引
import { Document, BaseDocumentLoader } from "@langchain/core/documents";
import { RecordManager, index } from "@langchain/core/indexing";
// 自定义加载器:从 JSON API 加载文档
class ApiDocumentLoader extends BaseDocumentLoader {
constructor(private apiUrl: string) { super(); }
async load(): Promise<Document[]> {
const response = await fetch(this.apiUrl);
const items = await response.json();
return items.map((item: any) =>
new Document({
pageContent: item.content,
metadata: {
source: this.apiUrl,
title: item.title,
author: item.author,
publishedAt: item.publishedAt,
},
id: item.id,
})
);
}
}
// 增量索引流程
async function indexDocuments(
loader: BaseDocumentLoader,
recordManager: RecordManager,
vectorStore: VectorStore
) {
const result = await index({
docsSource: loader,
recordManager,
vectorStore,
options: {
cleanup: "incremental",
sourceIdKey: "source",
batchSize: 50,
},
});
console.log(`新增: ${result.numAdded}`);
console.log(`更新: ${result.numUpdated}`);
console.log(`跳过: ${result.numSkipped}`);
console.log(`删除: ${result.numDeleted}`);
}
29.8 源码精读路线
| 优先级 | 文件 | 关注点 |
|---|---|---|
| P0 | langchain-core/src/documents/document.ts | Document 类 — pageContent + metadata |
| P0 | langchain-core/src/document_loaders/base.ts | BaseDocumentLoader — 加载器接口 |
| P1 | langchain-core/src/documents/transformers.ts | BaseDocumentTransformer, MappingDocumentTransformer |
| P1 | langchain-core/src/indexing/base.ts:263 | index() 函数 — 增量索引核心逻辑 |
| P1 | langchain-core/src/indexing/record_manager.ts | RecordManagerInterface — 记录管理器接口 |
| P2 | langchain-core/src/indexing/base.ts:41 | _HashedDocument — 文档 hash 计算 |
| P2 | langchain-core/src/structured_query/ir.ts | Comparison, Operation, StructuredQuery — 查询 IR |
本课收获总结
| 级别 | 你应该掌握的 |
|---|---|
| 🟢 基础 | 掌握 Document 数据结构:pageContent + metadata;理解加载器的作用 |
| 🔵 中阶 | 理解 BaseDocumentLoader 抽象,能自定义加载器 |
| 🟡 高阶 | 掌握 BaseDocumentTransformer 和 MappingDocumentTransformer 的文档后处理 |
| 🟠 资深 | 分析 RecordManager 和 index() 函数实现增量索引的完整机制 |
| 🔴 架构 | 设计大规模文档 ingestion pipeline:去重、版本控制、增量同步、Structured Query IR |
下一课预告
第 30 课进入文本分割与 Embeddings —— 理解 TextSplitter 的分块策略和 Embeddings 的向量化接口,完成 RAG 管道的关键环节。