使用 LangChain 构建语义搜索引擎
本教程将帮助您熟悉 LangChain 的文档加载器、嵌入和向量存储抽象概念。这些抽象概念旨在支持从(向量)数据库和其他来源检索数据,以便与 LLM 工作流集成。对于那些需要获取数据以进行推理以用于模型推理的应用程序(例如检索增强生成,即RAG)而言,这些抽象概念至关重要。包括以下内容:
- 文档和文档加载器;
- 文本分割器;
- 嵌入(文本向量化);
- 向量存储和检索器。
- 向量召回
1. 文档和文档加载器
LangChain 实现了很多用于加载不同类型文档的加载器,官方文档点击这里,加载完后得到统一的文档对象 Document。常用文件类型如下:
| 文档加载器 | 描述 |
|---|---|
| CSV | 从 CSV 文件加载数据,并可配置列提取 |
| JSON | 使用 JSON 指针加载 JSON 文件以定位特定键 |
| JSONLines | 从 JSONLines/JSONL 文件加载数据 |
| TEXT | 加载纯文本文件或者 markdown |
| DOCX | 加载 Microsoft Word 文档(.docx 和 .doc 格式) |
| EPUB | 加载带有可选章节分割的EPUB文件 |
| PPTX | 加载 PowerPoint 演示文稿 |
| Subtitles | 加载字幕文件(.srt 格式) |
与之对应加载器如下:
| 文档加载器 | 描述 |
|---|---|
| DirectoryLoader | 使用自定义加载器映射从目录中加载所有文件 |
| UnstructuredLoader | 使用非结构化 API 加载多种文件类型 |
| MultiFileLoader | 从多个单独的文件路径加载数据 |
| ChatGPT | 加载 ChatGPT 对话导出 |
| Notion Markdown | 加载导出为 Markdown 格式的 Notion 页面 |
| OpenAI Whisper Audio | 使用 OpenAI Whisper API 转录音频文件 |
| PDF加载器 | 使用 pdf-parse 加载和解析 PDF 文件 |
具体还有哪些可以看下@langchain/community/document_loaders/fs/包下面有哪些文档格式的加载器。接下来我们使用 TextLoader 加载器加载 readme.md 文件来实战文件加载流程。
import { TextLoader } from "@langchain/classic/document_loaders/fs/text";
const loader = new TextLoader("../readme.md");
const documents = await loader.load();
console.info(documents)
console.info(documents.length) // 打印加载了多少个页面,如果是 PDF 文件则打印加载的 PDF 页数
[
Document {
pageContent: "# langchain 实战教程项目\n" +
"## 1. 项目介绍\n" +
"本项目是关于学习 langchain.js 框架的实战项目,项目中包含了多个实战案例,每个案例都有详细的代码注释和说明。\n" +
"本项目包括如下几部分:\n" +
"+ 基础模型的使用\n" +
"+ 上下文记忆实现\n" +
"+ RAG 搜索实现\n" +
"+ MCP 协议实战\n" +
"+ Agent 实战\n" +
"+ Skills 实战",
metadata: { source: "../readme.md" },
id: undefined
}
]
1
1
上面从文件中读取到的内容被封装成为了一个统一的 Document 对象, 请注意,单个Document对象通常代表整个文档的一部分,例如读取的是 PDF 文档则表示一页数据文档。 包含了文件的内容和元数据。pageContent中是文件的内容,metadata中是文件的元数据,例如文件路径、文件名、文件大小等信息。id(可选)文档的字符串标识符。下面我们再来加载一个 PDF 论文文件看下加载后的内容(如果加载报错可能是 pdf 的版本与解析器版本不兼容,换个pdf试试):
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
const loader = new PDFLoader("../resource/s41598-021-99343-4.pdf");
const docs = await loader.load();
console.info(docs)
console.log(docs.length);
[
Document {
pageContent: "1\n" +
"\x19o\x8e.:ȋͬͭͮͯͰͱͲͳʹ͵Ȍ\n" +
"Scientific Reports | (2021) 11:20146 \n" +
"| \n" +
"https://doi.org/10.1038/s41598-021-99343-4\n" +
"www.nature.com/scientificreports\n" +
"Assessing the cascading impacts \n" +
"of natural disasters in a multi‑layer \n" +
"behavioral network framework\n" +
"Asjad Naqvi\n" +
"1,2*\n" +
" & Irene Monasterolo\n" +
"1,2,3\n" +
"Natural disasters negatively impact regions and exacerbate socioeconomic vulnerabilities. \n" +
"While the direct impacts of natural disasters are well understood, the channels through which these \n" +
"shocks spread to non‑affected regions, still represents an open research question. In this paper we \n" +
"propose modelling socioeconomic systems as spatially‑explicit, multi‑layer behavioral networks, \n" +
"where the interplay of supply‑side production, and demand‑side consumption decisions, can help \n" +
...
}
]
14
14
对于每个Document对象,我们可以轻松访问:
- 页面的字符串内容;
- 包含文件名和页码的元数据。
console.log(docs[0].pageContent.slice(0, 200)); // 访问第一页的前 200 个字符
console.log(docs[0].metadata); // 打印文档元信息
1
o.:ȋͬͭͮͯͰͱͲͳʹ͵Ȍ
Scientific Reports | (2021) 11:20146
|
https://doi.org/10.1038/s41598-021-99343-4
www.nature.com/scientificreports
Assessing the cascading impacts
of natural disasters i
{
source: "../resource/s41598-021-99343-4.pdf",
pdf: {
version: "1.10.100",
info: {
PDFFormatVersion: "1.6",
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: "Assessing the cascading impacts of natural disasters in a multi-layer behavioral network framework",
Author: "Asjad Naqvi ",
Subject: "Scientific Reports, https://doi.org/10.1038/s41598-021-99343-4",
Creator: "Springer",
Producer: "Adobe PDF Library 15.0; modified using iText® 5.3.5 ©2000-2012 1T3XT BVBA (SPRINGER SBM; licensed version)",
CreationDate: "D:20211005160931+05'30'",
ModDate: "D:20211005190810+02'00'"
},
metadata: Metadata {
_metadata: [Object: null prototype] {
"xmp:createdate": "2021-10-05T16:09:31+05:30",
"xmp:creatortool": "Springer",
"xmp:modifydate": "2021-10-05T19:08:10+02:00",
"xmp:metadatadate": "2021-10-05T19:08:10+02:00",
"dc:format": "application/pdf",
"dc:identifier": "https://doi.org/10.1038/s41598-021-99343-4",
"dc:publisher": "Nature Publishing Group UK",
"dc:description": "Scientific Reports, https://doi.org/10.1038/s41598-021-99343-4",
"dc:title": "Assessing the cascading impacts of natural disasters in a multi-layer behavioral network framework",
"dc:creator": "Asjad Naqvi Irene Monasterolo",
"crossmark:doi": "10.1038/s41598-021-99343-4",
"crossmark:majorversiondate": "2010-04-23",
"crossmark:crossmarkdomainexclusive": "true",
"crossmark:crossmarkdomains": "springer.comspringerlink.com",
"prism:url": "https://doi.org/10.1038/s41598-021-99343-4",
"prism:doi": "10.1038/s41598-021-99343-4",
"prism:issn": "2045-2322",
...
}
},
totalPages: 14
},
loc: { pageNumber: 1 }
}
2. 文本分割器
上面我们完成了文档的加载,对于加载后的文档,我们需要将文档切分成多个小的文档片段,每个文档片段包含一定数量的字符。这样可以方便后续的向量化处理。对于信息检索和后续问答而言,页面可能过于粗略。我们进一步拆分 PDF 文件有助于确保文档中相关部分的含义不会被周围的文本“掩盖”。
我们可以使用文本分割器来实现这一目的。这里我们将使用一个简单的文本分割器,它基于字符进行分割。我们将文档分割成 100 个字符的块,块之间有 20 个字符的重叠。(一般推荐切割大小为 1000,重叠部分为 200)重叠有助于减少将语句与其重要上下文分离的可能性。我们使用 splitter 函数 RecursiveCharacterTextSplitter,它会递归地使用换行符等常用分隔符分割文档,直到每个块都达到合适的大小。对于一般的文本用例,这是推荐的文本分割器。
如下输出的pageContent是docs被切割后的第一块,长度为 100 个字符,每个切割块与相邻切割块之间有 20 个字符的重叠。
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 100,
chunkOverlap: 20,
});
const allSplits = await textSplitter.splitDocuments(docs); // 对整个文档进行切块
console.log(allSplits[0]);
console.log(allSplits.length);
Document {
pageContent: "1\n\x19o\x8e.:ȋͬͭͮͯͰͱͲͳʹ͵Ȍ\nScientific Reports | (2021) 11:20146 \n|",
metadata: {
source: "../resource/s41598-021-99343-4.pdf",
pdf: {
version: "1.10.100",
info: {
PDFFormatVersion: "1.6",
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: "Assessing the cascading impacts of natural disasters in a multi-layer behavioral network framework",
Author: "Asjad Naqvi ",
Subject: "Scientific Reports, https://doi.org/10.1038/s41598-021-99343-4",
Creator: "Springer",
Producer: "Adobe PDF Library 15.0; modified using iText® 5.3.5 ©2000-2012 1T3XT BVBA (SPRINGER SBM; licensed version)",
CreationDate: "D:20211005160931+05'30'",
ModDate: "D:20211005190810+02'00'"
},
metadata: Metadata {
_metadata: [Object: null prototype] {
"xmp:createdate": "2021-10-05T16:09:31+05:30",
"xmp:creatortool": "Springer",
"xmp:modifydate": "2021-10-05T19:08:10+02:00",
"xmp:metadatadate": "2021-10-05T19:08:10+02:00",
"dc:format": "application/pdf",
"dc:identifier": "https://doi.org/10.1038/s41598-021-99343-4",
"dc:publisher": "Nature Publishing Group UK",
"dc:description": "Scientific Reports, https://doi.org/10.1038/s41598-021-99343-4",
"dc:title": "Assessing the cascading impacts of natural disasters in a multi-layer behavioral network framework",
"dc:creator": "Asjad Naqvi Irene Monasterolo",
"crossmark:doi": "10.1038/s41598-021-99343-4",
"crossmark:majorversiondate": "2010-04-23",
"crossmark:crossmarkdomainexclusive": "true",
"crossmark:crossmarkdomains": "springer.comspringerlink.com",
"prism:url": "https://doi.org/10.1038/s41598-021-99343-4",
"prism:doi": "10.1038/s41598-021-99343-4",
"prism:issn": "2045-2322",
"prism:aggregationtype": "journal",
"prism:publicationname": "Scientific Reports",
"prism:copyright": "The Author(s)",
"pdfx:crossmarkmajorversiondate": "2010-04-23",
"pdfx:crossmarkdomainexclusive": "true",
"pdfx:doi": "10.1038/s41598-021-99343-4",
"pdfx:robots": "noindex",
"pdfx:crossmarkdomains": "springer.comspringerlink.com",
"jav:journal_article_version": "VoR",
"xmpmm:documentid": "uuid:43861835-254a-47ce-903b-fffb3fd6781a",
"xmpmm:instanceid": "uuid:5cb53b71-003e-4f4f-9411-d085b98aa13b",
"xmpmm:renditionclass": "default",
"xmpmm:versionid": "1",
"xmpmm:history": "converteduuid:52a5f700-b51b-47b7-9b95-3c0246d02b26converted to PDF/A-2bpdfToolbox2021-10-05T16:10:24+05:30converteduuid:5cc79c6d-a15b-4f48-8796-a2d0719b6789converted to PDF/A-2bpdfToolbox2021-10-05T16:11:03+05:30",
"pdf:producer": "Adobe PDF Library 15.0; modified using iText® 5.3.5 ©2000-2012 1T3XT BVBA (SPRINGER SBM; licensed version)",
"pdfaid:part": "2",
"pdfaid:conformance": "B",
"pdfaextension:schemas": "http://ns.adobe.com/pdfx/1.3/pdfxAdobe Document Info PDF eXtension SchemaexternalMirrors crossmark:MajorVersionDateCrossmarkMajorVersionDateTextexternalMirrors crossmark:CrossmarkDomainExclusiveCrossmarkDomainExclusiveTextinternalMirrors crossmark:DOIdoiTextexternalMirrors crossmark:CrosMarkDomainsCrossMarkDomainsseq TextinternalA name object indicating whether the document has been modified to include trapping informationrobotsTextinternalID of PDF/X standardGTS_PDFXVersionTextinternalConformance level of PDF/X standardGTS_PDFXConformanceTextinternalCompany creating the PDFCompanyTextinternalDate when document was last modifiedSourceModifiedTexthttp://crossref.org/crossmark/1.0/crossmarkCrossmark SchemainternalUsual same as prism:doiDOITextexternalThe date when a publication was publishe.MajorVersionDateTextinternalCrossmarkDomainExclusiveCrossmarkDomainExclusiveTextinternalCrossMarkDomainsCrossMarkDomainsseq Texthttp://prismstandard.org/namespaces/basic/2.0/prismPrism SchemaexternalThis element provides the url for an article or unit of content. The attribute platform is optionally allowed for situations in which multiple URLs must be specified. PRISM recommends that a subset of the PCV platform values, namely “mobile” and “web”, be used in conjunction with this element. NOTE: PRISM recommends against the use of the #other value allowed in the PRISM Platform controlled vocabulary. In lieu of using #other please reach out to the PRISM group at prism-wg@yahoogroups.com to request addition of your term to the Platform Controlled Vocabulary.urlURIexternalThe Digital Object Identifier for the article.\n" +
"The DOI may also be used as the dc:identifier. If used as a dc:identifier, the URI form should be captured, and the bare identifier should also be captured using prism:doi. If an alternate unique identifier is used as the required dc:identifier, then the DOI should be specified as a bare identifier within prism:doi only. If the URL associated with a DOI is to be specified, then prism:url may be used in conjunction with prism:doi in order to provide the service endpoint (i.e. the URL). doiTextexternalISSN for an electronic version of the issue in which the resource occurs. Permits publishers to include a second ISSN, identifying an electronic version of the issue in which the resource occurs (therefore e(lectronic)Issn. If used, prism:eIssn MUST contain the ISSN of the electronic version.issnTextinternalVolume numbervolumeTextinternalIssue numbernumberTextinternalStarting pagestartingPageTextinternalEnding pageendingPageTextexternalThe aggregation type specifies the unit of aggregation for a content collection. Comment PRISM recommends that the PRISM Aggregation Type Controlled Vocabulary be used to provide values for this element. Note: PRISM recommends against the use of the #other value currently allowed in this controlled vocabulary. In lieu of using #other please reach out to the PRISM group at info@prismstandard.org to request addition of your term to the Aggregation Type Controlled Vocabulary. aggregationTypeTextexternalTitle of the magazine, or other publication, in which a resource was/will be published. Typically this will be used to provide the name of the magazine an article appeared in as metadata for the article, along with information such as the article title, the publisher, volume, number, and cover date. Note: Publication name can be used to differentiate between a print magazine and the online version if the names are different such as “magazine” and “magazine.com.”publicationNameTextexternalCopyrightcopyrightTexthttp://ns.adobe.com/pdf/1.3/pdfAdobe PDF SchemainternalA name object indicating whether the document has been modified to include trapping informationTrappedTexthttp://ns.adobe.com/xap/1.0/mm/xmpMMXMP Media Management SchemainternalUUID based identifier for specific incarnation of a documentInstanceIDURIinternalThe common identifier for all versions and renditions of a document.DocumentIDURIinternalThe common identifier for all versions and renditions of a document.OriginalDocumentIDURIinternalA reference to the original document from which this one is derived. It is a minimal reference; missing components can be assumed to be unchanged. For example, a new version might only need to specify the instance ID and version number of the previous version, or a rendition might only need to specify the instance ID and rendition class of the original.DerivedFromResourceRefIdentifies a portion of a document. This can be a position at which the document has been changed since the most recent event history (stEvt:changed). For a resource within an xmpMM:Ingredients list, the ResourceRef uses this type to identify both the portion of the containing document that refers to the resource, and the portion of the referenced resource that is referenced.http://ns.adobe.com/xap/1.0/sType/Part#stPartParthttp://www.aiim.org/pdfa/ns/id/pdfaidPDF/A ID SchemainternalPart of PDF/A standardpartIntegerinternalAmendment of PDF/A standardamdTextinternalConformance level of PDF/A standardconformanceTexthttp://ns.adobe.com/xap/1.0/t/pg/xmpTPgXMP Paged-TextinternalXMP08 Spec: An ordered array of plate names that are needed to print the document (including any in contained documents).PlateNamesSeq Texthttp://www.niso.org/schemas/jav/1.0/javNISOexternalValues for Journal Article Version are one of the following:\n" +
"AO = Author’s Original\n" +
"SMUR = Submitted Manuscript Under Review\n" +
"AM = Accepted Manuscript\n" +
"P = Proof\n" +
"VoR = Version of Record\n" +
"CVoR = Corrected Version of Record\n" +
"EVoR = Enhanced Version of Recordjournal_article_versionClosed Choice of Text"
}
},
totalPages: 14
},
loc: { pageNumber: 1, lines: { from: 1, to: 4 } }
},
id: undefined
}
1112
1112
3. 切割块向量化(Embedding)
向量搜索是一种存储和搜索非结构化数据(例如非结构化文本)的常用方法。其基本思想是存储与文本关联的数值向量。给定一个查询,我们可以将其嵌入为相同维度的向量,并使用向量相似度度量(例如余弦相似度)来识别相关文本。LangChain 支持来自数十个提供商的词嵌入,官方文档点击这里。这些模型指定了如何将文本转换为数值向量。今天我们主要使用开源的 HuggingFace transformers 和 阿里巴巴的千问 Embadding 模型。
3.1 阿里千问 Embedding 模型
import { AlibabaTongyiEmbeddings } from "@langchain/community/embeddings/alibaba_tongyi";
const embeddingModel = new AlibabaTongyiEmbeddings({
apiKey: process.env.QWEN_API_KEY,
modelName: "text-embedding-v4",
});
const res = await embeddingModel.embedQuery(
"What would be a good company name a company that makes colorful socks?",
);
const documentRes = await embeddingModel.embedDocuments(["Hello world", "Bye bye"]);
// console.log({ documentRes });
console.log('---------------------------');
console.log({ res });
---------------------------
{
res: [
0.029817035421729088, 0.023035120218992233, 0.02877928502857685,
-0.03925909474492073, 0.03931755945086479, 0.04150998964905739,
0.022494321689009666, 0.023371294140815735, 0.0011272738920524716,
0.12616698443889618, 0.08787255734205246, -0.0467718206346035,
0.003665010444819927, 0.02033112570643425, 0.05498611927032471,
0.008521241135895252, -0.042094636708498, -0.02154427021741867,
-0.017393270507454872, -0.04291314259171486, 0.046567194163799286,
0.04156845435500145, 0.035751208662986755, -0.00456025218591094,
-0.005638196598738432, 0.045953311026096344, -0.014053470455110073,
-0.014396950602531433, 0.00890126172453165, 0.03659895062446594,
-0.05738317593932152, -0.02332744561135769, -0.010903680697083473,
-0.05174132436513901, 0.02473060041666031, 0.02167581580579281,
-0.05942944437265396, -0.013242271728813648, 0.054459936916828156,
0.04732723534107208, -0.006237460765987635, -0.009113196283578873,
-0.04951966553926468, 0.014097318984568119, -0.02132502570748329,
0.004746608901768923, 0.028983911499381065, -0.013804994523525238,
-0.008733175694942474, 0.0248182974755764, 0.026206834241747856,
0.018036382272839546, 0.020930388942360878, 0.045514825731515884,
-0.023473607376217842, 0.01752481609582901, -0.03230178728699684,
0.032974131405353546, 0.0033891298808157444, 0.03630662336945534,
0.014740431681275368, -0.0009258445352315903, 0.04191924259066582,
0.0698654055595398, -0.03124942258000374, 0.012087591923773289,
0.04449169337749481, 0.046976447105407715, 0.030635541304945946,
-0.033383384346961975, 0.011751419864594936, 0.050659727305173874,
-0.03835289180278778, 0.04969505965709686, 0.04355625808238983,
0.010837907902896404, 0.004172923043370247, -0.054401472210884094,
-0.04153922200202942, 0.031600210815668106, -0.00043802906293421984,
-0.05501535162329674, -0.06553900986909866, -0.01942492090165615,
-0.02079884335398674, -0.010194795206189156, -0.07085930556058884,
-0.03165867552161217, 0.028238486498594284, 0.01177334412932396,
0.026309149339795113, 0.03256487846374512, 0.0654805451631546,
-0.043731652200222015, 0.020696530118584633, 0.08869106322526932,
0.033412620425224304, -0.020258044824004173, -0.004348317626863718,
0.019907256588339806,
... 924 more items
]
}
{
res: [
0.029817035421729088, 0.023035120218992233, 0.02877928502857685,
-0.03925909474492073, 0.03931755945086479, 0.04150998964905739,
0.022494321689009666, 0.023371294140815735, 0.0011272738920524716,
0.12616698443889618, 0.08787255734205246, -0.0467718206346035,
0.003665010444819927, 0.02033112570643425, 0.05498611927032471,
0.008521241135895252, -0.042094636708498, -0.02154427021741867,
-0.017393270507454872, -0.04291314259171486, 0.046567194163799286,
0.04156845435500145, 0.035751208662986755, -0.00456025218591094,
-0.005638196598738432, 0.045953311026096344, -0.014053470455110073,
-0.014396950602531433, 0.00890126172453165, 0.03659895062446594,
-0.05738317593932152, -0.02332744561135769, -0.010903680697083473,
-0.05174132436513901, 0.02473060041666031, 0.02167581580579281,
-0.05942944437265396, -0.013242271728813648, 0.054459936916828156,
0.04732723534107208, -0.006237460765987635, -0.009113196283578873,
-0.04951966553926468, 0.014097318984568119, -0.02132502570748329,
0.004746608901768923, 0.028983911499381065, -0.013804994523525238,
-0.008733175694942474, 0.0248182974755764, 0.026206834241747856,
0.018036382272839546, 0.020930388942360878, 0.045514825731515884,
-0.023473607376217842, 0.01752481609582901, -0.03230178728699684,
0.032974131405353546, 0.0033891298808157444, 0.03630662336945534,
0.014740431681275368, -0.0009258445352315903, 0.04191924259066582,
0.0698654055595398, -0.03124942258000374, 0.012087591923773289,
0.04449169337749481, 0.046976447105407715, 0.030635541304945946,
-0.033383384346961975, 0.011751419864594936, 0.050659727305173874,
-0.03835289180278778, 0.04969505965709686, 0.04355625808238983,
0.010837907902896404, 0.004172923043370247, -0.054401472210884094,
-0.04153922200202942, 0.031600210815668106, -0.00043802906293421984,
-0.05501535162329674, -0.06553900986909866, -0.01942492090165615,
-0.02079884335398674, -0.010194795206189156, -0.07085930556058884,
-0.03165867552161217, 0.028238486498594284, 0.01177334412932396,
0.026309149339795113, 0.03256487846374512, 0.0654805451631546,
-0.043731652200222015, 0.020696530118584633, 0.08869106322526932,
0.033412620425224304, -0.020258044824004173, -0.004348317626863718,
0.019907256588339806,
... 924 more items
]
}
3.2 开源模型 transformers
首先安装相关依赖:npm install @huggingface/transformers
import { HuggingFaceTransformersEmbeddings } from '@langchain/community/embeddings/huggingface_transformers'
const embeddingModel_tf = new HuggingFaceTransformersEmbeddings({
model: 'Xenova/all-MiniLM-L6-v2'
})
const res = await embeddingModel_tf.embedQuery(
"What would be a good company name a company that makes colorful socks?",
);
const documentRes = await embeddingModel_tf.embedDocuments(["Hello world", "Bye bye"]);
// console.log({ documentRes });
console.log('---------------------------');
console.log({ res });
dtype not specified for "model". Using the default dtype (fp32) for this device (cpu).
---------------------------
{
res: [
-0.11090296506881714, -0.06207888200879097, -0.026457801461219788,
-0.014639840461313725, -0.03584568202495575, -0.03648298978805542,
0.03844122588634491, -0.056844647973775864, 0.004606223665177822,
-0.011424371972680092, 0.00996622908860445, 0.05055968463420868,
0.017805255949497223, 0.03721049055457115, -0.015863768756389618,
0.0849541500210762, 0.05105682834982872, 0.06582950800657272,
0.03156667947769165, -0.19215089082717896, -0.022613516077399254,
-0.10871582478284836, 0.02355892024934292, 0.028067192062735558,
-0.0637408047914505, 0.005205999128520489, -0.014231138862669468,
0.010988589376211166, -0.022418370470404625, -0.07386071979999542,
-0.08736184984445572, -0.019853731617331505, 0.03278568759560585,
0.013404401019215584, -0.0070277247577905655, 0.029492776840925217,
-0.05833226442337036, -0.05483487993478775, -0.0400066040456295,
0.06512190401554108, 0.03344210609793663, -0.014605705626308918,
-0.016307922080159187, -0.03151392191648483, 0.00842444971203804,
-0.031146645545959473, 0.1264137625694275, 0.06309884786605835,
0.015157172456383705, 0.03475946560502052, -0.02383442409336567,
-0.1039414256811142, -0.00514568667858839, -0.01335218083113432,
0.044624343514442444, 0.0038201739080250263, -0.06735650449991226,
-0.0846400260925293, 0.01961386390030384, -0.04056335240602493,
0.06778173893690109, 0.007492696866393089, -0.037689752876758575,
0.030470313504338264, 0.005149312783032656, 0.021455513313412666,
-0.04174603149294853, 0.10233484208583832, -0.06513651460409164,
-0.10311980545520782, -0.0356571339070797, -0.024493569508194923,
0.07409506291151047, 0.1411401927471161, -0.021153582260012627,
0.07043297588825226, 0.06894206255674362, -0.01509435661137104,
-0.0579165518283844, -0.005523786414414644, -0.07963882386684418,
-0.0022520916536450386, -0.030415641143918037, 0.06850080192089081,
-0.001742134802043438, 0.06578843295574188, 0.023562844842672348,
0.027051398530602455, -0.017859533429145813, -0.02100076898932457,
-0.10933089256286621, 0.02166176587343216, 0.06821256875991821,
-0.052844513207674026, -0.06312106549739838, 0.0608273483812809,
0.02273370884358883, -0.02880200184881687, -0.02250671200454235,
0.003738575614988804,
... 284 more items
]
}
{
res: [
-0.11090296506881714, -0.06207888200879097, -0.026457801461219788,
-0.014639840461313725, -0.03584568202495575, -0.03648298978805542,
0.03844122588634491, -0.056844647973775864, 0.004606223665177822,
-0.011424371972680092, 0.00996622908860445, 0.05055968463420868,
0.017805255949497223, 0.03721049055457115, -0.015863768756389618,
0.0849541500210762, 0.05105682834982872, 0.06582950800657272,
0.03156667947769165, -0.19215089082717896, -0.022613516077399254,
-0.10871582478284836, 0.02355892024934292, 0.028067192062735558,
-0.0637408047914505, 0.005205999128520489, -0.014231138862669468,
0.010988589376211166, -0.022418370470404625, -0.07386071979999542,
-0.08736184984445572, -0.019853731617331505, 0.03278568759560585,
0.013404401019215584, -0.0070277247577905655, 0.029492776840925217,
-0.05833226442337036, -0.05483487993478775, -0.0400066040456295,
0.06512190401554108, 0.03344210609793663, -0.014605705626308918,
-0.016307922080159187, -0.03151392191648483, 0.00842444971203804,
-0.031146645545959473, 0.1264137625694275, 0.06309884786605835,
0.015157172456383705, 0.03475946560502052, -0.02383442409336567,
-0.1039414256811142, -0.00514568667858839, -0.01335218083113432,
0.044624343514442444, 0.0038201739080250263, -0.06735650449991226,
-0.0846400260925293, 0.01961386390030384, -0.04056335240602493,
0.06778173893690109, 0.007492696866393089, -0.037689752876758575,
0.030470313504338264, 0.005149312783032656, 0.021455513313412666,
-0.04174603149294853, 0.10233484208583832, -0.06513651460409164,
-0.10311980545520782, -0.0356571339070797, -0.024493569508194923,
0.07409506291151047, 0.1411401927471161, -0.021153582260012627,
0.07043297588825226, 0.06894206255674362, -0.01509435661137104,
-0.0579165518283844, -0.005523786414414644, -0.07963882386684418,
-0.0022520916536450386, -0.030415641143918037, 0.06850080192089081,
-0.001742134802043438, 0.06578843295574188, 0.023562844842672348,
0.027051398530602455, -0.017859533429145813, -0.02100076898932457,
-0.10933089256286621, 0.02166176587343216, 0.06821256875991821,
-0.052844513207674026, -0.06312106549739838, 0.0608273483812809,
0.02273370884358883, -0.02880200184881687, -0.02250671200454235,
0.003738575614988804,
... 284 more items
]
}
4. 向量存储与检索
LangChain VectorStore对象包含用于向存储中添加文本和对象,以及使用各种相似度度量进行查询的方法。它们通常使用嵌入模型进行初始化,这些模型决定了如何将文本数据转换为数值向量。LangChain 包含一系列与不同向量存储技术的集成,官方文档。本次只学习内存存储与 FAISS 框架存储,用于轻量级工作负载。
LangChain 为矢量存储提供了一个统一的接口,使您能够:
- addDocuments- 将文档添加到向量存储中。
- delete- 按ID删除已存储的文档。
- similaritySearch- 查询语义相似的文档。 这种抽象让你可以在不同的实现方式之间切换,而无需改变应用程序逻辑。
4.1 内存存储
使用similaritySearch可以在向量存储中搜索与查询最相似的文档,该方法返回最接近的嵌入式文档: 许多矢量存储设备都支持以下参数:
- k — 返回的结果数量
- filter — 基于元数据的条件过滤,按元数据(例如来源、日期)筛选可以优化搜索结果。
import { MemoryVectorStore } from "@langchain/classic/vectorstores/memory";
import { Document } from "@langchain/core/documents";
const vectorStore = new MemoryVectorStore(embeddingModel_tf);
// 向向量存储添加文档
const document1 = new Document({
pageContent: "我的老师叫做 cheney,他是一位年轻的数学老师,他热爱 python 编程与数学。",
});
const document2 = new Document({
pageContent: "您可以在我们的 ESM 指南 中找到有关 Electron 中 ESM 状态以及如何在我们的应用程序中使用它们的更多信息",
});
await vectorStore.addDocuments([document1,document2]);
// 相似性搜索12
const results = await vectorStore.similaritySearch("我的老师热爱数学", 1);
console.info(results)
// 相似性搜索2
const results2 = await vectorStore.similaritySearch("学习 Electron", 1);
console.info(results2)
[
Document {
pageContent: "我的老师叫做 cheney,他是一位年轻的数学老师,他热爱 python 编程与数学。",
metadata: {},
id: undefined
}
]
[
Document {
pageContent: "您可以在我们的 ESM 指南 中找到有关 Electron 中 ESM 状态以及如何在我们的应用程序中使用它们的更多信息",
metadata: {},
id: undefined
}
]
[
Document {
pageContent: "您可以在我们的 ESM 指南 中找到有关 Electron 中 ESM 状态以及如何在我们的应用程序中使用它们的更多信息",
metadata: {},
id: undefined
}
]
4.2 FAISS 本地向量存储
FaissStore 是内存中的 FAISS 向量存储实现,本地持久化通过保存 index 文件到指定目录完成。首次创建空实例添加文档后保存,以后从目录加载。首先安装 FAISS 依赖包npm install @langchain/community faiss-node
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { Document } from "@langchain/core/documents";
import { HuggingFaceTransformersEmbeddings } from '@langchain/community/embeddings/huggingface_transformers'
// 1. 创建 embeddings
const embeddings = new HuggingFaceTransformersEmbeddings({
model: 'Xenova/all-MiniLM-L6-v2'
})
// 2. 创建空 vector store(首次使用)
let loadedVectorStore = new FaissStore(embeddings, {});
// 3. 添加文档
const docs: Document[] = [
{ pageContent: "线粒体是细胞的能量工厂", metadata: { source: "biology.com" } },
{ pageContent: "java 编程我不会", metadata: { source: "arch.com" } },
{ pageContent: "线粒体由脂质构成", metadata: { source: "biology.com" } },
{ pageContent: "张智华是华为公司的员工,今年 38 岁,负责开发 AI 产品,年收入 1000 万美元。", metadata: { source: "biology.com" } },
];
await loadedVectorStore.addDocuments(docs,{ ids: ["1", "2", "3","4"] }); // 文档向量持久化并添加到索引
dtype not specified for "model". Using the default dtype (fp32) for this device (cpu).
[ "1", "2", "3", "4" ]
相似性查询
const results = await loadedVectorStore.similaritySearch("线粒体的结构", 2);
results.forEach(doc => {
console.log(doc.pageContent, doc.metadata);
});
// 带分数查询,一般用的多
const withScores = await loadedVectorStore.similaritySearchWithScore("细胞结构", 2);
withScores.forEach(([doc, score]) => {
console.log(score, doc.pageContent);
});
线粒体是细胞的能量工厂 { source: "biology.com" }
线粒体由脂质构成 { source: "biology.com" }
0.5523346662521362 线粒体是细胞的能量工厂
0.9782209396362305 线粒体由脂质构成
线粒体由脂质构成 { source: "biology.com" }
0.5523346662521362 线粒体是细胞的能量工厂
0.9782209396362305 线粒体由脂质构成
删除文档
await loadedVectorStore.delete({ ids: ["3"] }); // 按 ID 删除
更新文档
// 删除旧文档
await loadedVectorStore.delete({ ids: ["3"] });
// 添加新内容
await loadedVectorStore.addDocuments(
[{ pageContent: "线粒体是双层膜结构的细胞器", metadata: { source: "updated.com" } }],
{ ids: ["3"] }
);
[ "3" ]
// 保存到本地目录(创建 index 目录和文件)
await loadedVectorStore.save("./faiss_index"); // 本地持久化
// 后续加载已保存的 store
loadedVectorStore = await FaissStore.load(
"./faiss_index",
embeddings,
{ maxConcurrency: 128 } // 可选参数
);
FaissStore {
lc_serializable: false,
lc_kwargs: {
docstore: SynchronousInMemoryDocstore {
_docs: Map(4) {
"1" => { pageContent: "线粒体是细胞的能量工厂", metadata: [Object] },
"2" => { pageContent: "java 编程我不会", metadata: [Object] },
"4" => {
pageContent: "张智华是华为公司的员工,今年 38 岁,负责开发 AI 产品,年收入 1000 万美元。",
metadata: [Object]
},
"3" => { pageContent: "线粒体是双层膜结构的细胞器", metadata: [Object] }
}
},
index: IndexFlatL2 {},
mapping: { "0": "1", "1": "2", "2": "4", "3": "3" }
},
lc_namespace: [ "langchain", "vectorstores", "faiss" ],
embeddings: HuggingFaceTransformersEmbeddings {
caller: AsyncCaller {
maxConcurrency: Infinity,
maxRetries: 6,
onFailedAttempt: [Function: defaultFailedAttemptHandler],
queue: PQueue {
_events: Events <[Object: null prototype] {}> {},
_eventsCount: 0,
_intervalCount: 1,
_intervalEnd: 0,
_pendingCount: 0,
_resolveEmpty: [Function: empty],
_resolveIdle: [Function: empty],
_carryoverConcurrencyCount: false,
_isIntervalIgnored: true,
_intervalCap: Infinity,
_interval: 0,
_queue: PriorityQueue { _queue: [] },
_queueClass: [class PriorityQueue],
_concurrency: Infinity,
_intervalId: undefined,
_timeout: undefined,
_throwOnTimeout: false,
_isPaused: false
}
},
model: "Xenova/all-MiniLM-L6-v2",
batchSize: 512,
stripNewLines: true,
timeout: undefined,
pretrainedOptions: {},
pipelineOptions: { pooling: "mean", normalize: true },
pipelinePromise: Promise {
[Function: closure] FeatureExtractionPipeline {
task: "feature-extraction",
model: [Function: closure] BertModel {
main_input_name: "input_ids",
forward_params: [Array],
config: [PretrainedConfig],
sessions: [Object],
configs: undefined,
can_generate: false,
_forward: [AsyncFunction: encoderForward],
_prepare_inputs_for_generation: null,
custom_config: {}
},
tokenizer: [Function: closure] BertTokenizer {
return_token_type_ids: true,
padding_side: "right",
config: [Object],
normalizer: [BertNormalizer],
pre_tokenizer: [BertPreTokenizer],
model: [WordPieceTokenizer],
post_processor: [TemplateProcessing],
decoder: [WordPieceDecoder],
special_tokens: [Array],
all_special_ids: [Array],
added_tokens: [Array],
additional_special_tokens: [],
added_tokens_splitter: [DictionarySplitter],
added_tokens_map: [Map],
mask_token: "[MASK]",
mask_token_id: 103,
pad_token: "[PAD]",
pad_token_id: 0,
sep_token: "[SEP]",
sep_token_id: 102,
unk_token: "[UNK]",
unk_token_id: 100,
bos_token: null,
bos_token_id: undefined,
eos_token: null,
eos_token_id: undefined,
model_max_length: 512,
remove_space: undefined,
clean_up_tokenization_spaces: true,
do_lowercase_and_remove_accent: false,
add_bos_token: undefined,
add_eos_token: undefined,
legacy: false,
chat_template: null,
_compiled_template_cache: Map(0) {}
},
processor: null
}
}
},
_index: IndexFlatL2 {},
_mapping: { "0": "1", "1": "2", "2": "4", "3": "3" },
docstore: SynchronousInMemoryDocstore {
_docs: Map(4) {
"1" => {
pageContent: "线粒体是细胞的能量工厂",
metadata: { source: "biology.com" }
},
"2" => { pageContent: "java 编程我不会", metadata: { source: "arch.com" } },
"4" => {
pageContent: "张智华是华为公司的员工,今年 38 岁,负责开发 AI 产品,年收入 1000 万美元。",
metadata: { source: "biology.com" }
},
"3" => {
pageContent: "线粒体是双层膜结构的细胞器",
metadata: { source: "updated.com" }
}
}
},
args: {
docstore: SynchronousInMemoryDocstore {
_docs: Map(4) {
"1" => { pageContent: "线粒体是细胞的能量工厂", metadata: [Object] },
"2" => { pageContent: "java 编程我不会", metadata: [Object] },
"4" => {
pageContent: "张智华是华为公司的员工,今年 38 岁,负责开发 AI 产品,年收入 1000 万美元。",
metadata: [Object]
},
"3" => { pageContent: "线粒体是双层膜结构的细胞器", metadata: [Object] }
}
},
index: IndexFlatL2 {},
mapping: { "0": "1", "1": "2", "2": "4", "3": "3" }
}
}
5. 向量召回
LangChainVectorStore对象不继承Runnable类。LangChain Retriever是 Runnable 对象,因此它们实现了一组标准方法(例如,同步和异步invoke操作batch)。虽然我们可以从向量存储中构造 Retriever,但 Retriever 也可以与非向量存储数据源(例如外部 API)进行交互。Vectorstore 实现了一个as_retriever方法,用于生成 Retriever,具体来说是一个 Retriever VectorStoreRetriever。这些 Retriever 包含特定的属性search_type,search_kwargs用于标识要调用底层 Vectorstore 的哪些方法以及如何参数化这些方法。例如,我们可以使用以下代码重现上述操作:
// 转成标准 Retriever 接口
const retriever = loadedVectorStore.asRetriever({
k: 2, // 召回数量
searchType: "similarity", // 计算相似度的函数 "mmr",MMR(Maximal Marginal Relevance)在保证相关性的同时,最大化结果多样性,避免召回内容重复。
filter: (doc) => doc.metadata.source === "biology.com", // 过滤条件
});
// 直接调用
const docs = await retriever.invoke("线粒体的结构");
console.log(docs);
[
{ pageContent: "线粒体是细胞的能量工厂", metadata: { source: "biology.com" } },
{ pageContent: "线粒体是双层膜结构的细胞器", metadata: { source: "updated.com" } }
]
6. 简单 RAG 系统
import dotenv from "dotenv" // 加载环境变量中的模型 API 密钥
import { ChatOpenAI } from "@langchain/openai"
import {HumanMessage, SystemMessage,AIMessage } from "langchain";
dotenv.config()
const llm = new ChatOpenAI({
model: "qwen-plus",
apiKey: process.env.QWEN_API_KEY,
temperature: 0.7,
streamUsage: false, // 是否开启流式返回,默认 false
// maxTokens: 1000, // 最大Tokens
// maxRetries: 6 , // 最大重试次数,
// timeout: undefined, // 超时时间
logprobs: true,
configuration: {
baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1"
}
})
const questionMessage = new HumanMessage("张智华今天多少岁?年薪多少?")
const docs = await retriever.invoke(questionMessage.content); // 查询知识库,返回最相关的文档
const ragContext = docs[0].pageContent
llm.invoke([
questionMessage,
new HumanMessage(`请根据上下文回答问题:${ragContext}`),
])
.then((res) => {
console.log(res.content);
})
根据提供的上下文:
- 张智华今天**38岁**;
- 年薪为**1000万美元**。