上一篇文章我们说了如何与模型进行交互,下面要处理的问题就是如何获取数据,并在向量数据库中进行检索。同样,Langchain也提供了一系列的工具
文档加载器
加载PDF
pip install langchain-community pymupdf
from dotenv import load_dotenv
load_dotenv('../.env')
from langchain_community.document_loaders import PyMuPDFLoader
def load_pdf():
loader = PyMuPDFLoader('../data/deepseek-v3-1-4.pdf')
pages = loader.load_and_split()
print(pages[0].page_content)
if __name__ == '__main__':
load_pdf()
输出:
加载csv
def load_csv():
loader = CSVLoader('../data/test.csv')
data = loader.load()
for record in data[:2]:
print(record)
输出:
更多加载器可参考官方文档
文档切分
pip install --upgrade langchain-text-splitters
def split_doc():
pages = load_pdf()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=200, # 分段数量
chunk_overlap=100, # 重叠
length_function=len,
add_start_index=True,
)
paragraphs = text_splitter.create_documents([pages[0].page_content])
for para in paragraphs:
print(para.page_content)
print('-------')
写入向量数据库与数据检索
from dotenv import load_dotenv
load_dotenv('../.env')
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyMuPDFLoader
# 加载文档
loader = PyMuPDFLoader("../data/deepseek-v3-1-4.pdf")
pages = loader.load_and_split()
# 文档切分
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=100,
length_function=len,
add_start_index=True,
)
texts = text_splitter.create_documents(
[page.page_content for page in pages[:4]]
)
# 灌库
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
db = FAISS.from_documents(texts, embeddings)
# 检索 top-3 结果
retriever = db.as_retriever(search_kwargs={"k": 3})
docs = retriever.invoke("deepseek-v3代码能力怎么样")
for doc in docs:
print(doc.page_content)
print('===============')
输出:
需要注意的是,LangChain提供的只是向量数据库的接口封装,参考:python.langchain.com/docs/integr…
关于这一部分没有太多需要特殊说明的,参考官方文档操作即可