LangChain(二) 数据连接封装

91 阅读1分钟

上一篇文章我们说了如何与模型进行交互,下面要处理的问题就是如何获取数据,并在向量数据库中进行检索。同样,Langchain也提供了一系列的工具

文档加载器

加载PDF

 pip install langchain-community pymupdf
from dotenv import load_dotenv

load_dotenv('../.env')
from langchain_community.document_loaders import PyMuPDFLoader

def load_pdf():
    loader = PyMuPDFLoader('../data/deepseek-v3-1-4.pdf')
    pages = loader.load_and_split()
    print(pages[0].page_content)

if __name__ == '__main__':
    load_pdf()

输出:

image.png

加载csv

def load_csv():
    loader = CSVLoader('../data/test.csv')
    data = loader.load()
    for record in data[:2]:
        print(record)

输出:

image.png

更多加载器可参考官方文档

文档切分

pip install --upgrade langchain-text-splitters
def split_doc():

    pages = load_pdf()
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=200, # 分段数量
        chunk_overlap=100, # 重叠
        length_function=len,
        add_start_index=True,
    )

    paragraphs = text_splitter.create_documents([pages[0].page_content])
    for para in paragraphs:
        print(para.page_content)
        print('-------')

image.png

写入向量数据库与数据检索

from dotenv import load_dotenv

load_dotenv('../.env')
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyMuPDFLoader

# 加载文档
loader = PyMuPDFLoader("../data/deepseek-v3-1-4.pdf")
pages = loader.load_and_split()

# 文档切分
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=100,
    length_function=len,
    add_start_index=True,
)

texts = text_splitter.create_documents(
    [page.page_content for page in pages[:4]]
)

# 灌库
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
db = FAISS.from_documents(texts, embeddings)

# 检索 top-3 结果
retriever = db.as_retriever(search_kwargs={"k": 3})

docs = retriever.invoke("deepseek-v3代码能力怎么样")

for doc in docs:
    print(doc.page_content)
    print('===============')

输出:

image.png 需要注意的是,LangChain提供的只是向量数据库的接口封装,参考:python.langchain.com/docs/integr…

关于这一部分没有太多需要特殊说明的,参考官方文档操作即可