Deepseek-r1:14B What is LangChain?
Okay, so I need to figure out what LangChain is. I've heard the term before in the context of AI and machine learning, but I'm not exactly sure what it stands for or does. Let me try to break this down.First, the name "LangChain" sounds a bit like "language chain." Maybe it's something related to processing languages? I know that language models are a big part of AI, especially with things like chatbots and NLP tasks. So maybe LangChain is a framework or tool used in that area.
I remember hearing about chains in the context of machine learning pipelines—like a series of steps where each model's output feeds into another. Maybe LangChain is something similar but focused on language models? Perhaps it allows you to connect different AI models or components together in a sequence to perform complex tasks.
Wait, I think I've heard that LangChain is used for building applications with large language models (LLMs). That makes sense because LLMs are powerful but need structured workflows to be effective. So maybe LangChain provides tools to chain together these models and other AI services to create more sophisticated systems.
I'm also thinking about how sometimes you need different parts of a system to interact, like using an LLM for text generation followed by another model for image recognition or something else. LangChain might facilitate that kind of integration, allowing each step to pass data to the next in a chain-like manner.
Another thought: Maybe LangChain is similar to other frameworks but focuses specifically on language tasks. For example, it could help with question answering, chat interfaces, or document analysis by linking multiple models together. Each model in the chain handles a specific part of the problem, and they work collaboratively to produce the final output.
I should also consider who developed LangChain. If it's from a reputable source like a well-known AI research lab or company, that would add credibility. I'm not sure about its origins, but if it's widely used, there might be documentation or examples available online to learn more.
Looking up some keywords in my mind—LangChain, LLMs, chains—I think it's a framework for creating applications by chaining together different AI components. It allows developers to build complex systems where each step is handled by an appropriate model, and the outputs are passed along to achieve the desired task.
So putting this all together, LangChain seems to be a tool or framework that enables the creation of sophisticated AI applications by connecting various language models and other AI services in a sequential manner. This chaining allows for more dynamic and powerful interactions compared to using a single model alone.
LangChain: A Framework for Building Sophisticated AI Applications
LangChain is a framework designed to build advanced AI applications, particularly leveraging large language models (LLMs). It facilitates the creation of complex systems by chaining together various AI components in a sequential manner. Each component in this chain can handle specific tasks, allowing for a structured and collaborative approach to problem-solving.
Key features of LangChain include:
- Chaining Models: It enables the integration of multiple AI models, where each model's output feeds into another, creating a pipeline of processing steps.
- Language Focus: Primarily focused on language tasks, it can be used for applications like question answering, chat interfaces, and document analysis.
- Integration with LLMs: LangChain allows the use of powerful LLMs alongside other AI services, enhancing the capability to handle diverse tasks.
- Modular Design: Each step in the chain is modular, making it easy to add or remove components based on specific needs.
LangChain is particularly useful for developers looking to create dynamic and sophisticated AI applications by orchestrating multiple models to work together seamlessly.
概览
LangChain 应用主要由三部分组成:
- LLM: 大语言模型是应用的核心引擎。
- Prompt Templates: 提供语言模型的指令,控制语言模型的输出。所以理解如何构建 prompt 和 prompt 策略非常重要。
- Output parsers: 将语言模型的输出转换成更加可用的格式,方便工作流中后续节点使用。
LLM
LLM 提供了两种模型类;
- llms: 输入字符串,输出字符串的语言模型
- chat_models: 输入 message 列表,输出一个 message.
对应两个方法:
- predict: 输入字符串,输出字符串
- predict_messages: 输入 message 列表,输出一个 message.
text = "What would be a good company name for a company that makes colorful socks?"
llm.predict(text)
# >> Feetful of Fun
chat_model.predict(text)
# >> Socks O'Color
from langchain.schema import HumanMessage
text = "What would be a good company name for a company that makes colorful socks?"
messages = [HumanMessage(content=text)]
llm.predict_messages(messages)
# >> Feetful of Fun
chat_model.predict_messages(messages)
# >> Socks O'Color
一个 message 由两部分组成:
- content: 表示 message 的具体内容。
- role: 表示 message 来自什么类型的角色。
LangChain 提供了四类不同角色的消息类型:
- HumanMessage: a human/user,通常只由 content 组成
- AIMessage: AI/assistant,需要 additional_kwargs, 代表来自模型的 message
- SystemMessage: system,通常只由 content 组成
- FunctionMessage: 表示 a function call 的结果,在 role 和 content 之外还有 name 信息,记录 function name .
- ToolMessage: 表示 tool 调用的结果。在 role 和 content 之外还有 tool_call_id 信息,记录产生结果的 tool。
Prompt Templates
通常不会把用户输入直接传递到 LLM 语言模型,而是将用户输入拼接到某个设计好的一串文字中,这串文字叫做 prompt templates.
from langchain.prompts import PromptTemplate
prompt = PromptTemplate.from_template("What is a good name for a company that makes {product}?")
prompt.format(product="colorful socks")
What is a good name for a company that makes colorful socks?
Prompt Templates 支持将多个 messages 组合成在一起使用。
from langchain.prompts.chat import (
ChatPromptTemplate,
SystemMessagePromptTemplate,
HumanMessagePromptTemplate,
)
template = "You are a helpful assistant that translates {input_language} to {output_language}."
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_template = "{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])
chat_prompt.format_messages(input_language="English", output_language="French", text="I love programming.")
[
SystemMessage(content="You are a helpful assistant that translates English to French.", additional_kwargs={}),
HumanMessage(content="I love programming.")
]
Output parsers
Output parsers 将 LLM 的输出转换成工作流中后续节点支持的格式。由下面几种主要类型:
- LLM text -> 结构化数据,例如 JSON
- ChatMessage -> 字符串
- 功能调用返回的额外信息 -> 字符串
from langchain.schema import BaseOutputParser
class CommaSeparatedListOutputParser(BaseOutputParser):
"""Parse the output of an LLM call to a comma-separated list."""
def parse(self, text: str):
"""Parse the output of an LLM call."""
return text.strip().split(", ")
CommaSeparatedListOutputParser().parse("hi, bye")
# >> ['hi', 'bye']
LLMChain
LLMChain 支持将 LLMs, Prompt Templates, Output Parsers 组合成一个工作流。
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
ChatPromptTemplate,
SystemMessagePromptTemplate,
HumanMessagePromptTemplate,
)
from langchain.chains import LLMChain
from langchain.schema import BaseOutputParser
class CommaSeparatedListOutputParser(BaseOutputParser):
"""Parse the output of an LLM call to a comma-separated list."""
def parse(self, text: str):
"""Parse the output of an LLM call."""
return text.strip().split(", ")
template = """You are a helpful assistant who generates comma separated lists.
A user will pass in a category, and you should generate 5 objects in that category in a comma separated list.
ONLY return a comma separated list, and nothing more."""
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_template = "{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])
chain = LLMChain(
llm=ChatOpenAI(),
prompt=chat_prompt,
output_parser=CommaSeparatedListOutputParser()
)
chain.run("colors")
# >> ['red', 'blue', 'green', 'yellow', 'orange']
Language Models
LangChain 提供了两类模型接口:
- llms: 输入字符串,输出字符串的语言模型
- chat_models: 输入 message 列表,输出一个 message.
llms 是纯字符串生成模型。 chat_models 是基于大语言模型,但支持会话能力,使用 message 集合作为输入,支持角色定义。
两类模型可以互换使用,都包含 predict 和 predict_message 两个方法。推荐 llms 模型使用 predict, chat_models 模型使用 predict_message.
Custom LLM
语言模型种类非常多,有开源的, 有商业付费的,也有自己训练或者微调的。如果希望使用 LangChain 默认不不支持的模型,则需要自定义语言模型封装,介入 LangChain 中。
自定义语言模型封装需要给封装类实现 _call 方法和 _identifying_params 属性:
- _call 方法输入参数字符串和可选参数 stop 词,输出同样是字符串。
- _identifying_params 属性类型是字典,作用是格式化输出类信息。可选。
class CustomLLM(LLM):
n: int
@property
def _llm_type(self) -> str:
return "custom"
def _call(
self,
prompt: str,
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> str:
if stop is not None:
raise ValueError("stop kwargs are not permitted.")
return prompt[: self.n]
@property
def _identifying_params(self) -> Mapping[str, Any]:
"""Get the identifying parameters."""
return {"n": self.n}
llm = CustomLLM(n=10)
llm("This is a foobar thing")
'This is a '
Fake LLM
Fake LLM 是测试时使用,提供 mock LLM 能力的机制,方便模拟 LLM 响应处理过程。
from langchain.llms.fake import FakeListLLM
from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType
tools = load_tools(["python_repl"])
responses = ["Action: Python REPL\nAction Input: print(2 + 2)", "Final Answer: 4"]
llm = FakeListLLM(responses=responses)
agent = initialize_agent(
tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
agent.run("whats 2 + 2")
> Entering new AgentExecutor chain...
Action: Python REPL
Action Input: print(2 + 2)
Observation: 4
Thought:Final Answer: 4
> Finished chain.
'4'
Human input LLM
Human input LLM 是一种伪 LLM, 用于调试与教学实践。通过 mock LLM 调用,模拟人类行为是如何响应 prompts 。
from langchain.llms.human import HumanInputLLM
from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType
tools = load_tools(["wikipedia"])
llm = HumanInputLLM(
prompt_func=lambda prompt: print(
f"\n===PROMPT====\n{prompt}\n=====END OF PROMPT======"
)
)
agent = initialize_agent(
tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True
)
agent.run("What is 'Bocchi the Rock!'?")
Streaming
有一些 LLM 模型,提供了流式的响应方式,不需要等待完整的回答生成完成,可以边生成,边返回结果。
To utilize streaming, use a CallbackHandler
that implements on_llm_new_token
. In this example, we are using StreamingStdOutCallbackHandler
.
from langchain.llms import OpenAI
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = OpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], temperature=0)
resp = llm("Write me a song about sparkling water.")
Chat Models
Chat Models 是语言模型的一个变种。同样是基于 LLM ,但输入与输出存在区别, Language Models 是 text in , text out, Chat Models 的输入与输出类型是 ChatMessage
目前 LangChain 支持的 ChatMessage 种类有:'AIMessage', 'HumanMessage', 'SystemMessage', 'FunctionMessage', 以及 'ChatMessage' .
- ChatMessage 列表
Chat models implement the Runnable interface, the basic building block of the LangChain Expression Language (LCEL). This means they support invoke
, ainvoke
, stream
, astream
, batch
, abatch
, astream_log
calls.
Chat models accept List[BaseMessage]
as inputs, or objects which can be coerced to messages, including str
(converted to HumanMessage
) and PromptValue
.
from langchain.schema.messages import HumanMessage, SystemMessage
messages = [
SystemMessage(content="You're a helpful assistant"),
HumanMessage(content="What is the purpose of model regularization?")
]
chat.invoke(messages)
AIMessage(content="The purpose of model regularization is to prevent overfitting in machine learning models. Overfitting occurs when a model becomes too complex and starts to fit the noise in the training data, leading to poor generalization on unseen data. Regularization techniques introduce additional constraints or penalties to the model's objective function, discouraging it from becoming overly complex and promoting simpler and more generalizable models. Regularization helps to strike a balance between fitting the training data well and avoiding overfitting, leading to better performance on new, unseen data.")
- 批量处理Batch calls, richer outputs
You can go one step further and generate completions for multiple sets of messages using generate
. This returns an LLMResult
with an additional message
parameter. This will include additional information about each generation beyond the returned message (e.g. the finish reason) and additional information about the full API call (e.g. total tokens used).
batch_messages = [
[
SystemMessage(content="You are a helpful assistant that translates English to French."),
HumanMessage(content="I love programming.")
],
[
SystemMessage(content="You are a helpful assistant that translates English to French."),
HumanMessage(content="I love artificial intelligence.")
],
]
result = chat.generate(batch_messages)
result
result.llm_output
LLMResult(generations=[[ChatGeneration(text="J'adore programmer.", generation_info={'finish_reason': 'stop'}, message=AIMessage(content="J'adore programmer."))], [ChatGeneration(text="J'adore l'intelligence artificielle.", generation_info={'finish_reason': 'stop'}, message=AIMessage(content="J'adore l'intelligence artificielle."))]], llm_output={'token_usage': {'prompt_tokens': 53, 'completion_tokens': 18, 'total_tokens': 71}, 'model_name': 'gpt-3.5-turbo'}, run=[RunInfo(run_id=UUID('077917a9-026c-47c4-b308-77b37c3a3bfa')), RunInfo(run_id=UUID('0a70a0bf-c599-4f51-932a-c7d42202c984'))])
{'token_usage': {'prompt_tokens': 53,
'completion_tokens': 18,
'total_tokens': 71},
'model_name': 'gpt-3.5-turbo'}
Caching
LangChain 提供了可选的 LLM 缓存层,可以缓存记忆,减少 LLM API 调用次数。支持在 LangChain 中部分解决启动,部分节点关闭缓存。
from langchain.globals import set_llm_cache
from langchain.llms import OpenAI
# To make the caching really obvious, lets use a slower model.
llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2)
# In Memory Cache
from langchain.cache import InMemoryCache
set_llm_cache(InMemoryCache())
# The first time, it is not yet in cache, so it should take longer
llm.predict("Tell me a joke")
# SQLite Cache
rm .langchain.db
# We can do the same thing with a SQLite cache
from langchain.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))
# The first time, it is not yet in cache, so it should take longer
llm.predict("Tell me a joke")
Prompt Templates
Prompt Templates 是用来生成语言模型的 prompt 的预定义模板。其中可以包含指令,few-shot 示例,上下文,问题。
Language Models 预期输入的 prompt 是 string 类型或者 ChatMessages 对象数组。
PromptTemplate
生成字符串 prompt 的模板
from langchain.prompts import PromptTemplate
prompt_template = PromptTemplate.from_template(
"Tell me a {adjective} joke about {content}."
)
prompt_template.format(adjective="funny", content="chickens")
PromptTemplate
by default uses Python f-string as its template format. However, it can also use other formats like jinja2
, specified through the template_format
argument.
from langchain.prompts import PromptTemplate
jinja2_template = "Tell me a {{ adjective }} joke about {{ content }}"
prompt = PromptTemplate.from_template(jinja2_template, template_format="jinja2")
prompt.format(adjective="funny", content="chickens")
# Output: Tell me a funny joke about chickens.
Custom PromptTemplate
LangChain 提供了多种类型的 Prompt Templates ,同时也支持开发者自定义 String Prompt Templates 完成特定任务。
创建自定义的 String Prompt Templates 需求实现以下两点:
- 必须由 input_variables 属性,表示预期可输入的变量。
- 实现 format 方法,接收 keyword arguments ,使用 input_variables 转换成预期格式输出字符串。
from langchain.prompts import StringPromptTemplate
from pydantic import BaseModel, validator
PROMPT = """\
Given the function name and source code, generate an English language explanation of the function.
Function Name: {function_name}
Source Code:
{source_code}
Explanation:
"""
class FunctionExplainerPromptTemplate(StringPromptTemplate, BaseModel):
"""A custom prompt template that takes in the function name as input, and formats the prompt template to provide the source code of the function."""
@validator("input_variables")
def validate_input_variables(cls, v):
"""Validate that the input variables are correct."""
if len(v) != 1 or "function_name" not in v:
raise ValueError("function_name must be the only input_variable.")
return v
def format(self, **kwargs) -> str:
# Get the source code of the function
source_code = get_source_code(kwargs["function_name"])
# Generate the prompt to be sent to the language model
prompt = PROMPT.format(
function_name=kwargs["function_name"].__name__, source_code=source_code
)
return prompt
def _prompt_type(self):
return "function-explainer"
Few-shot PromptTemplate
包含了 Few-shot 示例的 PromptTemplate , 可以实现示例相关的特定任务。
- 创建示例集合,每一个示例是一个字典,元素的 key 是输入变量,value 是输入变量的具体内容。
from langchain.prompts.few_shot import FewShotPromptTemplate
from langchain.prompts.prompt import PromptTemplate
examples = [
{
"question": "Who lived longer, Muhammad Ali or Alan Turing?",
"answer":
"""
Are follow up questions needed here: Yes.
Follow up: How old was Muhammad Ali when he died?
Intermediate answer: Muhammad Ali was 74 years old when he died.
Follow up: How old was Alan Turing when he died?
Intermediate answer: Alan Turing was 41 years old when he died.
So the final answer is: Muhammad Ali
"""
},
...
- 创建示例格式化对象。该对象可将示例转换成指定字符串。可以使用 PromptTemplate 实现。
example_prompt = PromptTemplate(input_variables=["question", "answer"], template="Question: {question}\n{answer}")
print(example_prompt.format(**examples[0]))
- 使用示例集合与示例格式化对象,创建 FewShotPromptTemplate 对象实例。
prompt = FewShotPromptTemplate(
examples=examples,
example_prompt=example_prompt,
suffix="Question: {input}",
input_variables=["input"]
)
print(prompt.format(input="Who was the father of Mary Ball Washington?"))
- 使用示例集合生成 ExampleSelector, 与直接使用示例集合不同,
SemanticSimilarityExampleSelector
对象可以使用 embedding model 计算实际输入与示例集合的相似度,选择出最近似的示例。
from langchain.prompts.example_selector import SemanticSimilarityExampleSelector
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
example_selector = SemanticSimilarityExampleSelector.from_examples(
# This is the list of examples available to select from.
examples,
# This is the embedding class used to produce embeddings which are used to measure semantic similarity.
OpenAIEmbeddings(),
# This is the VectorStore class that is used to store the embeddings and do a similarity search over.
Chroma,
# This is the number of examples to produce.
k=1
)
# Select the most similar example to the input.
question = "Who was the father of Mary Ball Washington?"
selected_examples = example_selector.select_examples({"question": question})
print(f"Examples most similar to the input: {question}")
for example in selected_examples:
print("\n")
for k, v in example.items():
print(f"{k}: {v}")
prompt = FewShotPromptTemplate(
example_selector=example_selector,
example_prompt=example_prompt,
suffix="Question: {input}",
input_variables=["input"]
)
print(prompt.format(input="Who was the father of Mary Ball Washington?"))
ChatPromptTemplate
ChatPromptTemplate 支持 chat_models 使用的 prompt template, 其包含多个 message 的集合。ChatModel 可以直接使用一个 MessagePromptTemplate, 也可以多个 MessagePromptTemplate 组成 ChatPromptTemplate, 其 format_prompt 方法返回 PromptValue , 可以用来生成 string 或者 message object 。
使用二元组创建 ChatPromptTemplate 实例
from langchain.prompts import ChatPromptTemplate
template = ChatPromptTemplate.from_messages([
("system", "You are a helpful AI bot. Your name is {name}."),
("human", "Hello, how are you doing?"),
("ai", "I'm doing well, thanks!"),
("human", "{user_input}"),
])
messages = template.format_messages(
name="Bob",
user_input="What is your name?"
)
使用 MessagePromptTemplate 和 BaseMessage 对象创建 ChatPromptTemplate.
from langchain.prompts import ChatPromptTemplate
from langchain.prompts.chat import SystemMessage, HumanMessagePromptTemplate
template = ChatPromptTemplate.from_messages(
[
SystemMessage(
content=(
"You are a helpful assistant that re-writes the user's text to "
"sound more upbeat."
)
),
HumanMessagePromptTemplate.from_template("{text}"),
]
)
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI()
llm(template.format_messages(text='i dont like eating tasty things.'))
MessagePromptTemplate
LangChain 提供了多个种 MessagePromptTemplate , 常见的有 AIMessagePromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate 。
Few-Shot ChatPromptTemplate
LangChain 为 chat_models 的 Few-Shot ChatPromptTemplt 提供 FewShotChatMessagePromptTemplate 对象。其包含两个基础部分:
- 示例集合: 集合中每个元素都是字典,代表一个示例。
- 示例格式化对象:使用其 format_messages 方法将示例格式化成 message 。通常是 human message 输入配合 AI message 输出, 或者 human message 加 function call message。
from langchain.prompts import (
FewShotChatMessagePromptTemplate,
ChatPromptTemplate,
)
# 示例集合
examples = [
{"input": "2+2", "output": "4"},
{"input": "2+3", "output": "5"},
]
# 示例格式化对象
# This is a prompt template used to format each individual example.
example_prompt = ChatPromptTemplate.from_messages(
[
("human", "{input}"),
("ai", "{output}"),
]
)
few_shot_prompt = FewShotChatMessagePromptTemplate(
example_prompt=example_prompt,
examples=examples,
)
print(few_shot_prompt.format())
Human: 2+2
AI: 4
Human: 2+3
AI: 5
示例集合可以使用示例选择器替换,示例选择器可以根据输入选择相似的示例,例如 SemanticSimilarityExampleSelector 基于向量嵌入比较相似性,因此需要构建向量存储对象。
from langchain.prompts import SemanticSimilarityExampleSelector
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# 示例集合
examples = [
{"input": "2+2", "output": "4"},
{"input": "2+3", "output": "5"},
{"input": "2+4", "output": "6"},
{"input": "What did the cow say to the moon?", "output": "nothing at all"},
{
"input": "Write me a poem about the moon",
"output": "One for the moon, and one for me, who are we to talk about the moon?",
},
]
# 向量存储
to_vectorize = [" ".join(example.values()) for example in examples]
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_texts(to_vectorize, embeddings, metadatas=examples)
# 构建示例选择器
example_selector = SemanticSimilarityExampleSelector(
vectorstore=vectorstore,
k=2,
)
# The prompt template will load examples by passing the input do the `select_examples` method
example_selector.select_examples({"input": "horse"})
# 构建 few-shot ChatPromptTemplate
from langchain.prompts import (
FewShotChatMessagePromptTemplate,
ChatPromptTemplate,
)
# Define the few-shot prompt.
few_shot_prompt = FewShotChatMessagePromptTemplate(
# The input variables select the values to pass to the example_selector
input_variables=["input"],
example_selector=example_selector,
# Define how each example will be formatted.
# In this case, each example will become 2 messages:
# 1 human, and 1 AI
example_prompt=ChatPromptTemplate.from_messages(
[("human", "{input}"), ("ai", "{output}")]
),
)
Prompt Template 输出格式
- string: chat_prompt.fromat, chat_prompt.format_prompt().to_string()
- message 对象集合: chat_prompt.format_prompt().to_messages()
- ChatPromptValue: chat_prompt.format_prompt()
Feature Store
Feture Store 是传统机器中概念,用来确保数据的实时性和相关性。在将 LLM 应用在实际产品中时,为了获得自定义的 LLM 产品,通常会将与特定用户相关的实时信息加入到 LLM 中。Feature Store 则是一个合适的方式获取相应的数据,LangChain 则提供了将相关数据与 LLM 结合的方式。
- Feast : 常见的一个开源 Feature Store.
- [Tecton](tecton.ai is a fully managed feature platform built to orchestrate the complete ML feature lifecycle, from transformation to online serving, with enterprise-grade SLAs.
- Featureform, an open-source and enterprise-grade feature store, to run the same example. Featureform allows you to work with your infrastructure like Spark or locally to define your feature transformations.
Output parsers
Quickstart | 🦜️🔗 Langchain LLM 输出类型通常为字符串,ChatModel 的输出类型为 Message,一些场景字符串与 Message 中时包含指定格式的信息,可以将这些信息转换成结构化的数据。
Output parsers 则负责将模型输出转换成结构化数据。
实现模型输出到结构化数据的转换,需要做的以下几步:
- Get format instructions: 该方法返回指定格式的提示词描述,与 prompt 一起,从而引导语言模型的生成
prompt = PromptTemplate(
template="Answer the user query.\n{format_instructions}\n{query}\n",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
- Parse: 将语言模型输出转换成指定的结构化数据。
LangChain 提供了多种类型的 output parsers
- Name: parser 名字;
- Supports Streaming: 是否支持流逝输出
- Has Format Instructions:是由格式化的指令。
- Calls LLM: 是的调用 LLM, 使用 LLM 来修正输出格式
- Input Type: 预期输入格式,通常是 Strings 和 messages
- Output type: parser 输出类型。
- Description: parser 功能描述。
Output Parser 列表:
- JSON: 输出 json object
- XML: 输出 字典 类型
- CVS: 输出 List
- Structed: 输出字典类型结构化数据
Chain
Chain 是使用固定逻辑,将对组件的调用组合在一起,简化对 LLM 进行使用。其可以保护其他 Chain 。
chain = LLMChain(llm=chat, prompt=chat_prompt)
chain.run(input_language="English", output_language="French", text="I love programming.")
LLMChain 是最基本的 Chain ,它的参数为语言模型和一个 PromptTemplate 。其功能则是将用户输入使用 PromptTemplate 进行格式化生成 prompt,将 prompt 传入 LLM ,进而生成结果。。
Custom Chain
自定义一个 chain 通过继承 Chain ,实现下面的方法, _call, _acall, input_keys, output_keys.
from __future__ import annotations
from typing import Any, Dict, List, Optional
from pydantic import Extra
from langchain.schema.language_model import BaseLanguageModel
from langchain.callbacks.manager import (
AsyncCallbackManagerForChainRun,
CallbackManagerForChainRun,
)
from langchain.chains.base import Chain
from langchain.prompts.base import BasePromptTemplate
class MyCustomChain(Chain):
"""
An example of a custom chain.
"""
prompt: BasePromptTemplate
"""Prompt object to use."""
llm: BaseLanguageModel
output_key: str = "text" #: :meta private:
class Config:
"""Configuration for this pydantic object."""
extra = Extra.forbid
arbitrary_types_allowed = True
@property
def input_keys(self) -> List[str]:
"""Will be whatever keys the prompt expects.
:meta private:
"""
return self.prompt.input_variables
@property
def output_keys(self) -> List[str]:
"""Will always return text key.
:meta private:
"""
return [self.output_key]
def _call(
self,
inputs: Dict[str, Any],
run_manager: Optional[CallbackManagerForChainRun] = None,
) -> Dict[str, str]:
# Your custom chain logic goes here
# This is just an example that mimics LLMChain
prompt_value = self.prompt.format_prompt(**inputs)
# Whenever you call a language model, or another chain, you should pass
# a callback manager to it. This allows the inner run to be tracked by
# any callbacks that are registered on the outer run.
# You can always obtain a callback manager for this by calling
# `run_manager.get_child()` as shown below.
response = self.llm.generate_prompt(
[prompt_value], callbacks=run_manager.get_child() if run_manager else None
)
# If you want to log something about this run, you can do so by calling
# methods on the `run_manager`, as shown below. This will trigger any
# callbacks that are registered for that event.
if run_manager:
run_manager.on_text("Log something about this run")
return {self.output_key: response.generations[0][0].text}
async def _acall(
self,
inputs: Dict[str, Any],
run_manager: Optional[AsyncCallbackManagerForChainRun] = None,
) -> Dict[str, str]:
# Your custom chain logic goes here
# This is just an example that mimics LLMChain
prompt_value = self.prompt.format_prompt(**inputs)
# Whenever you call a language model, or another chain, you should pass
# a callback manager to it. This allows the inner run to be tracked by
# any callbacks that are registered on the outer run.
# You can always obtain a callback manager for this by calling
# `run_manager.get_child()` as shown below.
response = await self.llm.agenerate_prompt(
[prompt_value], callbacks=run_manager.get_child() if run_manager else None
)
# If you want to log something about this run, you can do so by calling
# methods on the `run_manager`, as shown below. This will trigger any
# callbacks that are registered for that event.
if run_manager:
await run_manager.on_text("Log something about this run")
return {self.output_key: response.generations[0][0].text}
@property
def _chain_type(self) -> str:
return "my_custom_chain"
from langchain.callbacks.stdout import StdOutCallbackHandler
from langchain.chat_models.openai import ChatOpenAI
from langchain.prompts.prompt import PromptTemplate
chain = MyCustomChain(
prompt=PromptTemplate.from_template("tell us a joke about {topic}"),
llm=ChatOpenAI(),
)
chain.run({"topic": "callbacks"}, callbacks=[StdOutCallbackHandler()])
Chains 接口
同步接口 _call, run
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
prompt = PromptTemplate(
input_variables=["company", "product"],
template="What is a good name for {company} that makes {product}?",
)
chain = LLMChain(llm=llm, prompt=prompt)
print(chain.run({
'company': "ABC Startup",
'product': "colorful socks"
}))
异步接口 arun, apredict, acall
简单示例
import asyncio
import time
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
async def async_generate(chain):
resp = await chain.arun(product="toothpaste")
print(resp)
async def generate_concurrently():
llm = OpenAI(temperature=0.9)
prompt = PromptTemplate(
input_variables=["product"],
template="What is a good name for a company that makes {product}?",
)
chain = LLMChain(llm=llm, prompt=prompt)
tasks = [async_generate(chain) for _ in range(5)]
await asyncio.gather(*tasks)
s = time.perf_counter()
# If running this outside of Jupyter, use asyncio.run(generate_concurrently())
await generate_concurrently()
elapsed = time.perf_counter() - s
print("\033[1m" + f"Concurrent executed in {elapsed:0.2f} seconds." + "\033[0m")
s = time.perf_counter()
generate_serially()
elapsed = time.perf_counter() - s
print("\033[1m" + f"Serial executed in {elapsed:0.2f} seconds." + "\033[0m")
Chain 的其他执行接口示例
Chain 在 _call 和 run 之外,还提供其他的执行接口
apply
allows you run the chain against a list of inputs:
input_list = [{"product": "socks"}, {"product": "computer"}, {"product": "shoes"}]
llm_chain.apply(input_list)
[{'text': '\n\nSocktastic!'},
{'text': '\n\nTechCore Solutions.'},
{'text': '\n\nFootwear Factory.'}]
generate
is similar toapply
, except it return anLLMResult
instead of string.LLMResult
often contains useful generation such as token usages and finish reason.
llm_chain.generate(input_list)
LLMResult(generations=[[Generation(text='\n\nSocktastic!', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nTechCore Solutions.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nFootwear Factory.', generation_info={'finish_reason': 'stop', 'logprobs': None})]], llm_output={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 36, 'total_tokens': 55}, 'model_name': 'text-davinci-003'}, run=[RunInfo(run_id=UUID('9a423a43-6d35-4e8f-9aca-cacfc8e0dc49')), RunInfo(run_id=UUID('a879c077-b521-461c-8f29-ba63adfc327c')), RunInfo(run_id=UUID('40b892fa-e8c2-47d0-a309-4f7a4ed5b64a'))])
predict
is similar torun
method except that the input keys are specified as keyword arguments instead of a Python dict.
# Single input example
llm_chain.predict(product="colorful socks")
# Multiple inputs example
template = """Tell me a {adjective} joke about {subject}."""
prompt = PromptTemplate(template=template, input_variables=["adjective", "subject"])
llm_chain = LLMChain(prompt=prompt, llm=OpenAI(temperature=0))
llm_chain.predict(adjective="sad", subject="ducks")
predict_and_parse
如果需要应用 prompt 中的 output parser , 则需要使用predict_and_parse
代替predict
,apply_and_parse
替换apply
from langchain.output_parsers import CommaSeparatedListOutputParser
output_parser = CommaSeparatedListOutputParser()
template = """List all the colors in a rainbow"""
prompt = PromptTemplate(
template=template, input_variables=[], output_parser=output_parser
)
llm_chain = LLMChain(prompt=prompt, llm=llm)
llm_chain.predict_and_parse()
['Red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']
Agents
Agents 的关键思想是使用 LLM 处理输入,解析 LLM 的输出,从从备选的 Tools 选择执行的动作。Agents 使用 language model 去决策选择什么动作去执行和顺序。Chains 是固定逻辑,执行流程不变。
一些重要术语:
- AgentAction: 是数据类,表示 agent 应该执行的动作。包含 tool 名称和 tool_input 输入两个属性。
- AgentFinish: 是数据类,表示 agent 执行完成,将返回用户。包含一个字典类型的 return_values 参数。通常包含 key 为 output, value 字符串类型的元素。
- intermediate_steps: 代表前一个 agent actions 的输出。对于agents 了解已经完成的动作,以及执行后续动作非常重要。类型是 List[Tuple[AgentAction, Any]] ,注意声明了 Any 类型,但实际通常是 string 类型,Any 主要是为了灵活性,
Agent 概览
一个 Agent 通常由 LLM, Prompt, Output Parser 组成。不同的类型的 Agent 使用特定的 Prompt 引导 LLM 生成指定格式的输出,再由 Output Parser 解析,返回下一步的 AgentAction 。
Agent 的输入是一个 KV 的 Map, 其至少需要包含 "intermediate_steps" 为 key 的元素。其对应了上一步的 AgentAction 的生成输出。通常 Agent 的 PromptTemplate 负责将其转换成 prompt ,再输入到 LLM 。
Agent 的输出可以是下一步的 AgentAction,也可以是最终输出 AgentFinish,其类型可以是三种类型 Union[AgentAction, List[AgentAction], AgentFinish]
。AgentOutputParser 负责将 LLM 的输出转换成 Union[AgentAction, List[AgentAction], AgentFinish]
其中一种类型。
AgentExecutor
Agent 是静态的信息, AgentExcutor 才是具体执行者,其实际获取 Agent 的下一步的 AgentAction ,并执行。将执行输出的 observation 传入 Agent 获取下一个 AgentAction/AgentFinish ,周而复始。异常情况:
- Agent 选择未定义 Tool
- Tool 返回 Error
- LLM 输出无法被 Output Parser 解析。
- 打印日志,记录执行过程,输出或者 LangSmith
Agent Type
Agent Types | 🦜️🔗 Langchain 代理类型# – LangChain中文网
- AgentType.ZERO_SHOT_REACT_DESCRIPTION: 此代理使用ReAct框架,仅基于工具的描述来确定要使用的工具。可以提供任意数量的工具。此代理需要为每个工具提供描述。
- AgentType.REACT_DOCSTORE: 这个代理使用ReAct框架与文档存储进行交互。必须提供两个工具:一个
Search
工具和一个Lookup
工具(它们必须被命名为这样)。Search
工具应该搜索文档,而Lookup
工具应该查找最近找到的文档中的一个术语。 - AgentType.SELF_ASK_WITH_SEARCH: 这个代理使用一个被命名为
Intermediate Answer
的工具。这个工具应该能够查找问题的事实性答案。这个代理相当于最初的self ask with search paper(opens in a new tab),其中提供了Google搜索API作为工具 - AgentType.CONVERSATIONAL_REACT_DESCRIPTION: 这个代理程序旨在用于对话环境中。提示设计旨在使代理程序有助于对话。 它使用ReAct框架来决定使用哪个工具,并使用内存来记忆先前的对话交互。
Tool
Tool 是 Agent 可以调用的功能封装,它是 LLM 调用外部能力的接口。由下面元素组成:
- name of tool
- a description of tool
- JSON schema of input to tool
- the function call of tool
- the result of tool 是否直接返回。
以上信息有助于构建 action-taking 系统,name, description, JSON schema 组合在 prompt 中引导 LLM 生成特定格式的输出,function 则是 action 的具体执行者。
LangChain 内建了许多 Tools | 🦜️🔗 Langchain
Custom Tool
自定义 Tool 存在两种方式:
- 使用修饰符 @tool
@tool
def multiply(a: int, b: int) -> int:
"""Multiply two numbers."""
return a * b
- 基于 BaseTool, 实现自定义 Tool 的类,重写 _run 和 _arun 方法。
from typing import Optional, Type
from langchain.callbacks.manager import (
AsyncCallbackManagerForToolRun,
CallbackManagerForToolRun,
)
class CalculatorInput(BaseModel):
a: int = Field(description="first number")
b: int = Field(description="second number")
class CustomCalculatorTool(BaseTool):
name = "Calculator"
description = "useful for when you need to answer questions about math"
args_schema: Type[BaseModel] = CalculatorInput
return_direct: bool = True
def _run(
self, a: int, b: int, run_manager: Optional[CallbackManagerForToolRun] = None
) -> str:
"""Use the tool."""
return a * b
async def _arun(
self,
a: int,
b: int,
run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
) -> str:
"""Use the tool asynchronously."""
raise NotImplementedError("Calculator does not support async")
ToolKits
某个任务一般需要多个 tools 协同工作完成。 LangChain 提供了 ToolKits 将多个工具组合在一起的完成指定目标。Agents and toolkits | 🦜️🔗 Langchain
For example, the GitHub toolkit has a tool for searching through GitHub issues, a tool for readin·g a file, a tool for commenting, etc.
# Initialize a toolkit
toolkit = ExampleTookit(...)
# Get list of tools
tools = toolkit.get_tools()
# Create agent
agent = create_agent_method(llm, tools, prompt)
Custom Agent
自定义代理 (Custom Agents) – LangChain中文网 一个代理 (Agent) 由二个部分组成:
- 工具 tool:代理可以使用的工具。
- 代理执行器 :这决定了采取哪些行动。
from typing import List, Tuple, Any, Union
from langchain.schema import AgentAction, AgentFinish
class FakeAgent(BaseSingleActionAgent):
"""Fake Custom Agent."""
@property
def input_keys(self):
return ["input"]
def plan(
self, intermediate_steps: List[Tuple[AgentAction, str]], **kwargs: Any
) -> Union[AgentAction, AgentFinish]:
"""Given input, decided what to do.
Args:
intermediate_steps: Steps the LLM has taken to date,
along with observations
**kwargs: User inputs.
Returns:
Action specifying what tool to use.
"""
return AgentAction(tool="Search", tool_input=kwargs["input"], log="")
async def aplan(
self, intermediate_steps: List[Tuple[AgentAction, str]], **kwargs: Any
) -> Union[AgentAction, AgentFinish]:
"""Given input, decided what to do.
Args:
intermediate_steps: Steps the LLM has taken to date,
along with observations
**kwargs: User inputs.
Returns:
Action specifying what tool to use.
"""
return AgentAction(tool="Search", tool_input=kwargs["input"], log="")
LCEL
LCEL 起步 | 🦜️🔗 Langchain LCEL使得从基本组件构建复杂链条变得容易,并且支持诸如流式处理、并行处理和日志记录等开箱即用的功能。
- streaming,
- async calls,
- batching,
- parallelization,
- retries,
- fallbacks,
- tracing,
- and more.
一个简单的例子了解 LCEL 如何组合 prompt, model 和 output parser
from langchain.chat_models import ChatAnthropic
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
model = ChatAnthropic()
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You're a very knowledgeable historian who provides accurate and eloquent answers to historical questions.",
),
("human", "{question}"),
]
)
runnable = prompt | model | StrOutputParser()
for chunk in runnable.stream({"question": "How did Mansa Musa accumulate his wealth?"}):
print(chunk, end="", flush=True)
对应 Chains 接口
from langchain.chains import LLMChain
chain = LLMChain(llm=model, prompt=prompt, output_parser=StrOutputParser())
chain.run(question="How did Mansa Musa accumulate his wealth?")
整个流程如下:
- 我们将用户输入的主题作为
{"topic": "ice cream"}
传入。 prompt
组件接受用户输入,然后使用主题构建提示,生成PromptValue
。PromptValue
是一个包装完成的提示的包装器,可以传递给LLM
(它以字符串作为输入)或ChatModel
(它以消息序列作为输入)。它可以与任何语言模型类型一起使用,因为它定义了生成BaseMessage
和生成字符串的逻辑。
prompt_value.to_messages() prompt_value.to_string()
- 然后将
PromptValue
传递给model
。在这种情况下,我们的model
是一个ChatModel
,这意味着它将输出一个BaseMessage
,如果我们的model
是一个LLM
,它将输出一个字符串。 - 最后,
output_parser
组件接收model
的输出,output_parser
是一个BaseOutputParser
,意味着它可以接受字符串或BaseMessage
作为输入。StrOutputParser
简单地将任何输入转换为字符串。
graph LR
A(输入:topic=ice cream) --> |字典| B(PromptTemplate)
B -->|PromptValue| C(ChatModel)
C -->|ChatMessage| D(StrOutputParser)
D --> |字符串| F(结果)
Retrieval
LLM 模型训练的数据是有限。LLM 应用需要特定场景相关的数据,RAG 检索参数增强是一种主要的给 LLM 提供训练数据之外数据的方式。
Document loaders
Document loaders 加载多种类型和多种来源的文档数据。LangChain 提供 100 种类型的 Document loaders
- CSV: from langchain_community.document_loaders.csv_loader import CSVLoader
- HTM一步 from langchain_community.document_loaders import UnstructuredHTMLLoader
- JSON: from langchain_community.document_loaders import JSONLoader, from langchain_community.document_loaders import BSHTMLLoader
- Markdown: from langchain_community.document_loaders import UnstructuredMarkdownLoader
- PDF: pip install pypdf, from langchain_community.document_loaders import PyPDFLoader
HTML
加载 HTML 类型文档,构造 document 对象。也可以使用 BeautifulSoup4 中的 BSHTMLLoader 解析 HTML 文档,抽取出内容构成 page_content, 页面标题到 metadata
from langchain_community.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoad("exampler_data/fake-content.html")
data = loader.load()
from langchain_community.document_loaders import BSHTMLLoader
loader = BSHTMLLoader("example_data/fake-content.html")
data = loader.load()
Text Splitting
检索的关键一步是找到 Document 中的相关部分。这需要几个转换步骤,对 Document 进行预处理。其中关键的一步是将完整的 document 分割成多个小的分块。LangChain 提供了多种文本分割器
- Recursive: A list of user defined characters, Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text.
- HTML: HTML specific characters, Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML)
- Markdown: Markdown specific characters, Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown)
- Code: Code specific characters, Splits text based on characters specific to coding languages. 15 different languages are available to choose from.
- Token: Tokens, Splits text on tokens. There exist a few different ways to measure tokens.
- Character: A user defined charater, Splits text based on a user defined character. One of the simpler methods.
文本分割器由下面元素组成:
- Name:分割器名字
- Splits On: 分割文本的方式
- Adds Metadata: 是否在每个分块填表示来源的 metadata
- Description: 描述信息,以及建议应用场景
Split by character
- 使用指定 character 分割文本
- chunk size 限定字符数字
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)
Split Code
CodeTextSplitter 支持将代码进行分割,支持多种编程语言。
from langchain.text_splitter import (
Language,
RecursiveCharacterTextSplitter,
)
# Full list of supported languages
[e.value for e in Language]
# You can also see the separators used for a given language
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
Python
PYTHON_CODE = """
def hello_world():
print("Hello, World!")
# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs
Recursively split by character
这个文本分割器是较适合分割通用文本。通过字符列表实现参数化,默认字符列表 ["\n\n", "\n", " ", ""]。它会尽可能减少分块的大小,从而保全字元,句子与单词。
- 字符列表
- chunk size 限定字符数字
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=100,
chunk_overlap=20,
length_function=len,
is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])
Semantic Chunking
语义分割器基于语义相似性进行分割。先分割成语句,再3 句一组进行分组,最后将在 embedding 空间相似的合并成一个 chunk
!pip install --quiet langchain_experimental langchain_openai
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
Split by tokens
语言模型的输入通常都有 token 数量限制。所以 chunk 的 token 数 + 用户输入的 token 数,不能超出语言模型的限制。
tiktoken 是一种 BPE tokenizer, 由 openAI 实现提供
tikoken 可以用来估计文本的 token 数量,对于 openAI 的语言模型其预估值更为准确。
- 基于字符分割。
- 基于 token 计算 chunk size
%pip install --upgrade --quiet tiktoken
# This is a long document we can split up.
with open("../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
CharacterTextSplitter.from_tiktoken_encoder 分割时只是使用 titoken 进行合并,所以 chunk size 可能大于是 titoken tokenizer 为单位的值。
RecursiveCharacterTextSplitter.from_tiktoken_encoder 则会确保分割结果小于 token 为单位的 chunk size ,因为大于限制时,会递归继续分割。
直接使用 tiktoken 分割器,确保分割结果小于 chunk size
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
其他的 token split
- spaCy
- SentenceTransformers
- NLTK
- KoNLPY
- Hugging Face tokenizer: GPT2TokenizerFast
Text embedding models
检索的另一个关键步骤是将 Document 进行 embeding 向量化。embedding 有效抽取出文本的语义,方便后面快速检索到相关部分。
Embedding 是将文字片段转换成向量 vector 表示。这样就可以在向量空间中计算文字之间的相关度,达到语义查询的效果。
Embedding 类负责提供接口调用 text embedding 模型的能力。OpenAI, Cohere, Hugging Face 都提供了类似模型
from langchain_openai import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings(openai_api_key="...")
Embedding 类的基础类包含两个方法:
embed_documents
将文档进行 embedding 生成向量数据
embeddings = embeddings_model.embed_documents(
[
"Hi there!",
"Oh, hello!",
"What's your name?",
"My friends call me World",
"Hello World!"
]
)
len(embeddings), len(embeddings[0])
embed_query
在向量数据中进行 embedding 查询
embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")
embedded_query[:5]
向量数据库(Vector stores)
生成 Document 的 embeddings 信息后,需要一个数据库支持 embeddinsg 信息的存存储和查询,就是 vector stores。
下图展示了 vector stores 工作流程中基础功能。其中关键部分是通过 embeddings 创建 vector 数据存储
常见的 vector stores 开源方案
- Chroma
pip install chromab
- FAISS: he Facebook AI Similarity Search (FAISS) library.
pip install faiss-cpu
- Lance
pip install lancedb
两种查询接口
- 相似性查询(Similarity search)
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
- 向量相似查询(Similarity search by vector)
embedding_vector = OpenAIEmbeddings().embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)
Retrievers
embeding 数据准备好,接下来就是查询, LangChain 提供了多种查询算法。
- Parent Document Retriever: This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks). If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together.
- Self Query Retriever: This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filer to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text.
- Ensemble Retriever: This fetches documents from multiple retrievers and then combines them. If you have multiple retrieval methods and want to try combining them.
Custom Retriever
Retriever 的接口非常简单,所以自定义 Retriever 的实现非常容易,代码如下
from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from typing import List
class CustomRetriever(BaseRetriever):
def _get_relevant_documents(
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
return [Document(page_content=query)]
retriever = CustomRetriever()
retriever.get_relevant_documents("bar")
Vector store-backed Retriever
Vector store retriever 是利用 vector store 的查询能力,将其进行分装,适配符合 Retriever 的接口。
retriever = db.as_retriever()
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
默认情况使用 MMR 查询策略,全称 maximum marginal relevance search.
retriever = db.as_retriever(search_type="mmr")
也可以使用阈值限制策略,只返回相似度得分大于阈值的结果
retriever = db.as_retriever(
search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5}
)
还有 top K 策略
retriever = db.as_retriever(search_kwargs={"k": 1})
MultiQueryRetriever
基于向量距离的数据库检索 embeds 是在高维空间进行查询,找到 embedded 向量距离相近的文档。但是