基于知识图谱增强的RAG系统阅读笔记(三)高级向量检索

66 阅读19分钟

第三章 高级向量检索策略

继续对系统进行优化。文本嵌入和向量相似性搜索的基本实现方式可能在检索准确率召回率方面表现不足用户查询生成的嵌入向量并不总是能与包含关键信息的文档嵌入向量紧密对齐,这可能是由于术语或上下文的差异所致。这种不匹配会导致那些在语义上高度相关的文档被忽略,因为查询的嵌入表示未能捕捉到所寻求信息的本质

3.1 优化方式概览

3.1.1 重写查询

一种提高检索准确率和召回率的策略是重写用于查找相关文档的查询。查询重写方法旨在通过将原始查询重新表述为更符合目标文档语言和上下文的形式,来弥合用户查询与信息丰富文档之间的差距。这种查询优化提高了找到包含相关信息文档的可能性,从而提升对原始查询回应的准确性。常见的查询重写策略包括“假设性文档检索器”(Hypothetical Document Retriever)和“后退一步提示”(Step-Back Prompting)。

“后退一步提示”的核心思想是,不直接用用户的复杂问题去检索,而是先构造一个更宽泛、更抽象的指令(后退一步的提示),让 LLM 先思考这个问题背后的基本原则、背景知识或更通用的查询是什么。然后,我们用这个更通用的查询去知识库中检索,获取更广泛、更基础的上下文,最后再结合原始问题和这些基础信息来生成更高质量的答案。

3.1.2 嵌入策略优化

假设性问题

在假设性问题嵌入策略中,你需要确定文档内容能够回答哪些问题。例如,你可以使用大语言模型(LLM)生成一系列假设性问题,也可以利用聊天机器人的对话历史来推断某个文档可能回答的问题。其核心思想是:不直接对原始文档本身进行嵌入,而是对文档能够回答的问题进行嵌入。当用户提出问题时,系统会计算该查询的嵌入向量,并在预先计算好的问题嵌入中搜索最近邻,目标是找到与用户问题在语义上最相近的匹配问题。随后,系统检索出能够回答这些相似问题的对应文档。本质上,假设性问题嵌入策略是通过嵌入文档可能回答的潜在问题,并利用这些嵌入来匹配和检索与用户查询相关的文档。

父文档嵌入

在这种方法中,原始文档(称为“父文档”)被分割成更小的单元,称为子块(child chunks),通常基于固定的 token 数量进行切分。与将整个父文档作为一个整体进行嵌入不同,该策略为每个子块分别计算独立的嵌入向量。当用户提交查询时,系统将查询向量与这些子块的嵌入进行比对,以找到最相关的结果。然而,系统并不会仅返回匹配的子块,而是检索出与该子块关联的完整原始父文档。这使得大语言模型能够在完整的上下文中进行操作,从而更有可能生成准确且完整的回答。 该策略解决了对长文档进行嵌入时的一个常见局限:当对整个父文档进行嵌入时,生成的向量可能因平均化而模糊了文档中不同的关键思想,导致难以有效匹配具体的查询。相比之下,将文档拆分为更小的块可以实现更精确的匹配,同时仍能在需要时返回完整的上下文信息。

3.1.3 其他优化

微调嵌入模型

通过对嵌入模型在特定领域数据上进行微调,可以增强其捕捉用户查询上下文的能力,从而与相关文档实现更紧密的语义匹配。 需要注意的是,微调通常需要更多的计算资源和基础设施支持。此外,一旦模型更新,所有已有的文档嵌入都必须重新计算以反映变化——这对于大规模文档库而言可能非常耗费资源。

重排序(Reranking)

在初步检索出一组文档后,重排序算法可以根据其与用户意图的相关性对结果进行重新排序。这一“二次筛选”过程通常使用更复杂的模型或评分启发式方法来优化结果。即使初始检索匹配不够理想,重排序也有助于将最相关的内容置于前列。

基于元数据的上下文过滤

许多文档包含结构化元数据,例如作者、发布日期、主题标签或来源类型。基于这些元数据进行过滤(无论是手动设置还是作为检索流程的一部分),可以在语义匹配之前显著缩小候选文档的范围,从而提高检索的精确度。例如,关于近期政策更新的查询可以限定为仅检索过去一年内发布的文档

混合检索

将稀疏检索(如基于关键词的搜索)与密集向量检索(语义搜索)相结合,可以兼顾两者的优势。关键词搜索在精确匹配和罕见术语处理上表现优异,而向量检索则能捕捉更广泛的语义含义。混合检索系统可以融合并重新排序两种方法的结果,从而在召回率和精确率之间取得最佳平衡。

3.2 Step-back Prompting实现

后退一步提示通过将一个具体问题转化为更宏观、更高层次的查询,降低了向量搜索过程的复杂性。其核心思想是,更宽泛的查询通常涵盖更全面的信息范围,使模型更容易识别相关事实,而不会被具体细节所困扰。相关论文的作者在下述代码清单中使用了系统提示,以指导大语言模型如何重写输入查询。

stepback_system_message = f"""
You are an expert at world knowledge. Your task is to step backand paraphrase a question to a more generic step-back question, which
is easier to answer. Here are a few examples:
"input": "Could the members of The Police perform lawful arrests?"
"output": "what can the members of The Police do?"
"input": "Jan Sindel's was born in what country?"
"output": "what is Jan Sindel's personal history?"
"""

仅包含指令而不提供示例的提示方式被称为零样本提示(zero-shot prompting),它完全依赖大语言模型对任务的通用理解和能力。然而,为了更有效地引导模型并确保输出结果的一致性,作者选择在提示中加入多个期望的重写示例。这种技术称为少样本提示(few-shot prompting),即在提示中包含少量示例来说明任务要求。少样本提示通过具体的实例帮助大语言模型更好地理解所需的转换方式,从而提升输出结果的质量和可靠性。

如下是相关代码的实现

import ollama
import re
from langchain_ollama import OllamaEmbeddings
from neo4j import GraphDatabase
​
EMBEDDING_MODEL = "nomic-embed-text:latest"
LLM_MODEL = "qwen3:14b"
OLLAMA_BASE_URL = "http://localhost:11434" # Ollama 服务地址
NEO4J_URI = "bolt://localhost:7687"       # Neo4j Bolt 地址
NEO4J_USER = "neo4j"                           # Neo4j 用户名
NEO4J_PASSWORD = "password"               # Neo4j 密码
VECTOR_INDEX_NAME = "queryIndex"               # 向量索引名称
FULLTEXT_INDEX_NAME = "ChunkFulltext"          # 全文索引名称
VECTOR_DIMENSIONS = 768                        # nomic-embed-text 向量维度# 初始化嵌入模型和 Neo4j 驱动
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL, base_url=OLLAMA_BASE_URL)
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
​
​
# 1. 创建向量索引
def create_vector_index():
    """在 Neo4j 中为 :Chunk 节点的 embedding 属性创建向量索引。"""
    with driver.session() as session:
        session.run(f"""
            CREATE VECTOR INDEX {VECTOR_INDEX_NAME} IF NOT EXISTS
            FOR (c:Chunk) ON c.embedding
            OPTIONS {{
                indexConfig: {{
                    `vector.dimensions`: {VECTOR_DIMENSIONS},
                    `vector.similarity_function`: 'cosine'
                }}
            }}
        """)
    print(f"向量索引 '{VECTOR_INDEX_NAME}' 已创建")
​
​
# 2. 创建全文索引
def create_fulltext_index():
    """在 Neo4j 中为 :Chunk 节点的 text 属性创建名为 'ChunkFulltext' 的全文索引。"""
    with driver.session() as session:
        # 检查索引是否已存在,避免重复创建错误
        result = session.run("""
            SHOW FULLTEXT INDEXES YIELD name
            WHERE name = $index_name
            RETURN count(*) > 0 AS exists
        """, index_name=FULLTEXT_INDEX_NAME)
        exists = result.single()["exists"]
        if not exists:
            session.run(f"""
                CREATE FULLTEXT INDEX {FULLTEXT_INDEX_NAME} FOR (c:Chunk) ON EACH [c.text]
            """)
            print(f"全文索引 '{FULLTEXT_INDEX_NAME}' 已创建")
        else:
            print(f"全文索引 '{FULLTEXT_INDEX_NAME}' 已存在")
​
​
# 3. 插入 Chunk 节点
def insert_chunks_cypher(chunks):
    """将文本块及其嵌入向量插入到 Neo4j 数据库中。"""
    embedding_list = embeddings.embed_documents(chunks)
    with driver.session() as session:
        session.run("""
                    WITH $chunks AS chunks, $embeddings AS embeddings
                        UNWIND range (0, size (chunks) - 1) AS i
                        MERGE (c :Chunk {index : i}) // 使用 MERGE 避免重复
                    SET c.text = chunks[i], c.embedding = embeddings[i]
                    """, chunks=chunks, embeddings=embedding_list)
    print(f"已通过 Cypher 插入 {len(chunks)} 个 Chunk 节点")
​
​
# 4. 向量相似性搜索
def retrieve_similar_chunks(query_text, top_k=3):
    """
    根据查询文本的嵌入向量,在 Neo4j 中检索最相似的 Chunk。
    返回检索到的文本列表。
    """
    query_embedding = embeddings.embed_query(query_text)
    with driver.session() as session:
        result = session.run(f"""
            CALL db.index.vector.queryNodes('{VECTOR_INDEX_NAME}', $top_k, $embedding_vector)
            YIELD node, score
            RETURN node.text AS text, score, node.index AS index
            ORDER BY score DESC
        """, embedding_vector=query_embedding, top_k=top_k)
​
        records = [{"text": record["text"], "score": record["score"], "index": record["index"], "type": "vector"} for
                   record in result]
        return records
​
​
# 5. 全文搜索
def retrieve_fulltext_chunks(query_text, top_k=3):
    """
    使用 Neo4j 全文索引 'ChunkFulltext' 检索包含关键词的 Chunk。
    返回检索到的文本列表及其分数。
    """
    with driver.session() as session:
        result = session.run(f"""
            CALL db.index.fulltext.queryNodes('{FULLTEXT_INDEX_NAME}', $query_term)
            YIELD node, score
            RETURN node.text AS text, score, node.index AS index
            ORDER BY score DESC
            LIMIT $top_k
        """, query_term=query_text, top_k=top_k)
​
        records = [{"text": record["text"], "score": record["score"], "index": record["index"], "type": "fulltext"} for
                   record in result]
        return records
​
​
# 6. 基于 Cypher 的混合搜索 (Hybrid Search)
def retrieve_hybrid_chunks_cypher(query_text, top_k=3):
    """
    使用 Cypher 执行混合搜索,结合向量和全文搜索,并进行分数归一化和去重。
    """
    query_embedding = embeddings.embed_query(query_text)
    with driver.session() as session:
        result = session.run("""
            CALL () {
                 // 向量索引搜索
                 CALL db.index.vector.queryNodes($vector_index_name, $k, $embedding_vector)
                 YIELD node, score
                 WITH collect({node:node, score:score}) AS nodes, max(score) AS max_score
                 UNWIND nodes AS n
                 // 归一化向量分数
                 RETURN n.node AS node, (n.score / max_score) AS score
                 UNION
                 // 全文索引搜索
                 CALL db.index.fulltext.queryNodes($fulltext_index_name, $query_term, {limit: $k})
                 YIELD node, score
                 WITH collect({node:node, score:score}) AS nodes, max(score) AS max_score
                 UNWIND nodes AS n
                 // 归一化全文分数
                 RETURN n.node AS node, (n.score / max_score) AS score
            }
            // 去重节点,取最高分数
            WITH node, max(score) AS score ORDER BY score DESC LIMIT $k
            RETURN node.text AS text, node.index AS index, score
            """,
                             vector_index_name=VECTOR_INDEX_NAME,
                             fulltext_index_name=FULLTEXT_INDEX_NAME,
                             embedding_vector=query_embedding,
                             query_term=query_text,
                             k=top_k * 2
                             )
        # 返回结果,格式包含 text, score, index
        return [{"text": record["text"], "score": record["score"], "index": record["index"]} for record in result]
​
​
# 7. 使用 Ollama 生成回答
def generate_answer_with_ollama(question, context_records):
    """
    使用 Ollama 调用 qwen3:14b 模型,基于检索到的上下文生成回答。
    """
    # 提取文本内容用于构建提示
    context_texts = "\n".join([rec["text"] for rec in context_records])
​
    # 构建系统消息和用户消息
    system_message = (
        "You are an expert on the Zen of Python (PEP 20). "
        "Answer questions based solely on the provided documents. "
        "Be concise and accurate. Only output the final answer, do not include any thinking process."
    )
    user_message = f"""Use the following documents to answer the question at the end.
​
Documents:
{context_texts}
---
Question: {question}
Answer in English:"""
​
    print(f"问题: {question}")
    print("回答:")
​
    try:
        response = ollama.chat(
            model=LLM_MODEL,
            messages=[
                {'role': 'system', 'content': system_message},
                {'role': 'user', 'content': user_message},
            ],
            options={'temperature': 0.1}  # 降低随机性,保证一致性
        )
        answer = response['message']['content'].strip()
        # 清理可能的思考过程
        cleaned_answer = re.sub(r'<think>.*?</think>', '', answer, flags=re.DOTALL)
        cleaned_answer = re.sub(r'^\s*"|"\s*$', '', cleaned_answer.strip())  # 移除首尾引号
        print(cleaned_answer)
    except Exception as e:
        print(f"\n调用 Ollama 生成回答时出错: {e}")
​
​
# 8. Step-back Prompting 实现
def generate_stepback_question(question):
    """
    使用 LLM 生成"后退一步"问题,将具体问题转化为更通用的查询
    """
    stepback_system_message = """
You are an expert at world knowledge. Your task is to step back and paraphrase a question 
to a more generic step-back question, which is easier to answer. Here are examples:
​
Input: "Could the members of The Police perform lawful arrests?"
Output: "What can the members of The Police do?"
​
Input: "Jan Sindel's was born in what country?"
Output: "What is Jan Sindel's personal history?"
​
Input: "In the face of ambiguity, refuse the temptation to guess - what does this mean?"
Output: "What are the principles of the Zen of Python?"
​
Now convert the following question to a more generic form. Only output the final question, do not include any thinking process or explanation:
"""
​
    try:
        response = ollama.chat(
            model=LLM_MODEL,
            messages=[
                {'role': 'system', 'content': stepback_system_message.strip()},
                {'role': 'user', 'content': f'Input: "{question}"\nOutput:'},
            ],
            options={'temperature': 0.1}  # 降低随机性,保证一致性
        )
        stepback_question = response['message']['content'].strip()
        # 清理思考过程和格式
        cleaned_question = re.sub(r'<think>.*?</think>', '', stepback_question, flags=re.DOTALL)
        # 提取最后一行作为最终问题
        lines = cleaned_question.strip().split('\n')
        final_question = lines[-1].strip() if lines else cleaned_question.strip()
        # 移除可能的引号
        final_question = re.sub(r'^\s*"|"\s*$', '', final_question)
        return final_question
    except Exception as e:
        print(f"生成 step-back 问题时出错: {e}")
        return question  # 出错时返回原问题
​
​
def main():
    texts = [
        "Beautiful is better than ugly.",
        "Explicit is better than implicit.",
        "Simple is better than complex.",
        "Complex is better than complicated.",
        "Flat is better than nested.",
        "Sparse is better than dense.",
        "Readability counts.",
        "Special cases aren't special enough to break the rules.",
        "Although practicality beats purity.",
        "Errors should never pass silently.",
        "Unless explicitly silenced.",
        "In the face of ambiguity, refuse the temptation to guess.",
        "There should be one-- and preferably only one --obvious way to do it.",
        "Although that way may not be obvious at first unless you're Dutch.",
        "Now is better than never.",
        "Although never is often better than *right* now.",
        "If the implementation is hard to explain, it's a bad idea.",
        "If the implementation is easy to explain, it may be a good idea.",
        "Namespaces are one honking great idea -- let's do more of those!"
    ]
​
    # 步骤一:创建向量索引
    create_vector_index()
    # 步骤二:创建全文索引
    create_fulltext_index()
    # 步骤三:插入文本块(使用 Cypher)
    insert_chunks_cypher(texts)
​
    # 定义要查询的问题
    question = "In the Zen of Python, what is the relationship between 'simple' and 'complex'?"
​
    # Step-back Prompting 实现
    print("--- 原始问题 ---")
    print(f"  {question}")
​
    stepback_question = generate_stepback_question(question)
    print("\n--- Step-back 问题 ---")
    print(f"  {stepback_question}")
​
    print("\n--- 向量搜索结果 (使用 step-back 问题) ---")
    vector_records = retrieve_similar_chunks(stepback_question, top_k=3)
    if vector_records:
        for i, record in enumerate(vector_records):
            print(f"  [{i + 1}] (相似度: {record['score']:.4f}) {record['text']}")
​
    print("\n--- 全文搜索结果 (使用 step-back 问题) ---")
    fulltext_records = retrieve_fulltext_chunks(stepback_question, top_k=3)
    if fulltext_records:
        for i, record in enumerate(fulltext_records):
            print(f"  [{i + 1}] (相关性: {record['score']:.4f}) {record['text']}")
​
    print("\n--- 混合搜索结果 (使用 step-back 问题) ---")
    # 步骤四:使用基于 Cypher 的混合检索
    hybrid_records = retrieve_hybrid_chunks_cypher(stepback_question, top_k=3)
​
    if not hybrid_records:
        print("未检索到任何相关文本块。")
        return
​
    print("检索到的相关文本块 (混合分数):")
    for i, record in enumerate(hybrid_records):
        print(f"  [{i + 1}] (混合分数: {record['score']:.4f}, 索引: {record['index']}) {record['text']}")
    print("-" * 40)
​
    # 步骤五:使用 Ollama 生成回答
    generate_answer_with_ollama(question, hybrid_records)
​
    # 关闭数据库连接
    driver.close()
​
​
if __name__ == "__main__":
    main()

有一说一,在这个例子里,最后的结果表现的挺差的,我觉得是因为例子的原因,至于为什么要选这个例子,因为一时间也想不到选什么文本了。

3.3 父文档检索器的实现

3.3.1 生成

父文档检索器策略涉及将一个大型文档分割成更小的部分,为每个小部分(而非整个文档)计算嵌入向量,从而更精确地匹配用户查询,并最终检索出完整的父文档以提供富含上下文的回应。然而,由于无法直接将整个 PDF 输入大语言模型(LLM),你首先需要将 PDF 分割成多个父文档,然后进一步将这些父文档拆分为用于嵌入和检索的子文档。

首先需要将文件分成一个一个的父文档,此处我们简单的使用了固定块的大致大小,然后对句子进行拼接进行划分。再下一级的子文档块则使用句子进行划分。代码中同时对于父级节点和子级节点添加了向量索引,可自行根据需求对检索粒度进行修改。

import ollama
import re
from typing import List
from langchain_ollama import OllamaEmbeddings
from neo4j import GraphDatabase
​
# 配置参数
EMBEDDING_MODEL = "nomic-embed-text:latest"
LLM_MODEL = "qwen3:14b"
OLLAMA_BASE_URL = "http://localhost:11434" # Ollama 服务地址
NEO4J_URI = "bolt://localhost:7687"       # Neo4j Bolt 地址
NEO4J_USER = "neo4j"                           # Neo4j 用户名
NEO4J_PASSWORD = "password"               # Neo4j 密码
VECTOR_INDEX_NAME = "queryIndex"               # 向量索引名称
FULLTEXT_INDEX_NAME = "ChunkFulltext"          # 全文索引名称
VECTOR_DIMENSIONS = 768                        # nomic-embed-text 向量维度# 初始化嵌入模型和 Neo4j 驱动
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL, base_url=OLLAMA_BASE_URL)
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
​
​
def split_document_to_parent_chunks(document_text: str, max_chunk_size: int = 800) -> List[str]:
    """
    将输入文档按句子分割成父文档块
    """
    # 按句子分割(以句号、问号、感叹号结尾)
    sentences = re.split(r'(?<=[.!?])\s+', document_text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]
​
    parent_chunks = []
    current_chunk = ""
​
    for sentence in sentences:
        # 如果句子本身超过最大块大小,单独处理
        if len(sentence) > max_chunk_size:
            if current_chunk:
                parent_chunks.append(current_chunk.strip())
                current_chunk = ""
            parent_chunks.append(sentence)
        # 如果加上句子后超过最大块大小,保存当前块并开始新块
        elif len(current_chunk) + len(sentence) + 1 > max_chunk_size:
            if current_chunk:
                parent_chunks.append(current_chunk.strip())
            current_chunk = sentence
        # 否则将句子添加到当前块
        else:
            if current_chunk:
                current_chunk += " " + sentence
            else:
                current_chunk = sentence
​
    # 添加最后一个块
    if current_chunk:
        parent_chunks.append(current_chunk.strip())
​
    return parent_chunks
​
​
def split_parent_to_child_chunks(parent_text: str, max_child_size: int = 300) -> List[str]:
    """
    将父文档块分割成子文档块
    """
    # 按句子分割
    sentences = re.split(r'(?<=[.!?])\s+', parent_text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]
​
    child_chunks = []
    current_chunk = ""
​
    for sentence in sentences:
        # 如果句子本身超过最大子块大小,单独处理
        if len(sentence) > max_child_size:
            if current_chunk:
                child_chunks.append(current_chunk.strip())
                current_chunk = ""
            child_chunks.append(sentence)
        # 如果加上句子后超过最大子块大小,保存当前块并开始新块
        elif len(current_chunk) + len(sentence) + 1 > max_child_size:
            if current_chunk:
                child_chunks.append(current_chunk.strip())
            current_chunk = sentence
        # 否则将句子添加到当前块
        else:
            if current_chunk:
                current_chunk += " " + sentence
            else:
                current_chunk = sentence
​
    # 添加最后一个块
    if current_chunk:
        child_chunks.append(current_chunk.strip())
​
    return child_chunks
​
​
def insert_document_hierarchy_with_embeddings(document_text: str, document_id: str,
                                              document_name: str = "Research Paper"):
    """
    将文档按照三级层级结构插入到 Neo4j 数据库中
    Document -> Parent (包含嵌入) -> Child (包含嵌入)
    """
    # 第一级:将文档分割成父文档块
    parent_chunks = split_document_to_parent_chunks(document_text, max_chunk_size=800)
​
    # 为父文档块生成嵌入
    parent_embeddings = embeddings.embed_documents(parent_chunks)
​
    with driver.session() as session:
        # 1. 合并 Document 节点
        session.run("""
            MERGE (d:Document {id: $doc_id})
            SET d.name = $doc_name, d.text = $doc_text
        """, doc_id=document_id, doc_name=document_name,
                    doc_text=document_text[:500] + "..." if len(document_text) > 500 else document_text)
​
        # 2. 处理每个父文档块及其子块
        for i, parent_text in enumerate(parent_chunks):
            parent_id = f"{document_id}_parent_{i}"
​
            # 第二级:将父文档块分割成子文档块
            child_chunks = split_parent_to_child_chunks(parent_text, max_child_size=300)
​
            # 为子文档块生成嵌入
            if child_chunks:
                child_embeddings = embeddings.embed_documents(child_chunks)
            else:
                child_chunks = [parent_text]  # 如果无法分割,使用父块作为子块
                child_embeddings = embeddings.embed_documents(child_chunks)
​
            # 3. 插入完整的三级结构
            session.run("""
                // 合并 Document 节点
                MERGE (d:Document {id: $doc_id})
                // 创建或合并 Parent 节点
                MERGE (p:Parent {id: $parent_id})
                SET p.text = $parent_text, p.index = $parent_index, p.embedding = $parent_embedding
                // 创建 Document 到 Parent 的关系
                MERGE (d)-[:HAS_PARENT]->(p)
                // 处理 Child 节点
                WITH p, $child_chunks AS chunks, $child_embeddings AS embeddings, $child_ids AS child_ids
                UNWIND range(0, size(chunks) - 1) AS j
                MERGE (c:Child {id: child_ids[j]})
                SET c.text = chunks[j], c.index = j, c.embedding = embeddings[j]
                MERGE (p)-[:HAS_CHILD]->(c)
            """,
                        doc_id=document_id,
                        parent_id=parent_id,
                        parent_text=parent_text,
                        parent_index=i,
                        parent_embedding=parent_embeddings[i],
                        child_chunks=child_chunks,
                        child_embeddings=child_embeddings,
                        child_ids=[f"{parent_id}_child_{j}" for j in range(len(child_chunks))]
                        )
​
    print(f"已插入文档 '{document_name}' (ID: {document_id})")
    print(f"  - 包含 {len(parent_chunks)} 个父文档块")
    print(f"  - 总计 {sum(len(split_parent_to_child_chunks(p, 300)) for p in parent_chunks)} 个子文档块")
​
​
def create_vector_indexes():
    """
    为 Parent 和 Child 节点的 embedding 属性创建向量索引
    """
    with driver.session() as session:
        # 为 Parent 节点创建向量索引
        session.run(f"""
            CREATE VECTOR INDEX parent_index IF NOT EXISTS
            FOR (p:Parent) ON p.embedding
            OPTIONS {{
                indexConfig: {{
                    `vector.dimensions`: {VECTOR_DIMENSIONS},
                    `vector.similarity_function`: 'cosine'
                }}
            }}
        """)
        print("Parent 向量索引 'parent_index' 已创建")
​
        # 为 Child 节点创建向量索引
        session.run(f"""
            CREATE VECTOR INDEX child_index IF NOT EXISTS
            FOR (c:Child) ON c.embedding
            OPTIONS {{
                indexConfig: {{
                    `vector.dimensions`: {VECTOR_DIMENSIONS},
                    `vector.similarity_function`: 'cosine'
                }}
            }}
        """)
        print("Child 向量索引 'child_index' 已创建")
​
​
def query_document_hierarchy():
    """
    查询文档三级层级结构进行验证
    """
    with driver.session() as session:
        result = session.run("""
            MATCH (d:Document)-[:HAS_PARENT]->(p:Parent)-[:HAS_CHILD]->(c:Child)
            RETURN d.id AS doc_id, d.name AS doc_name,
                   p.id AS parent_id, p.index AS parent_index,
                   count(c) AS child_count,
                   head(collect(c.text)) AS first_child_text
            LIMIT 5
        """)
​
        print("\n--- 文档三级层级结构 ---")
        for record in result:
            print(f"文档: {record['doc_name']} (ID: {record['doc_id']})")
            print(f"  父块 {record['parent_index']}: {record['parent_id']}")
            print(f"    子块数量: {record['child_count']}")
            print(f"    首个子块预览: {record['first_child_text'][:100]}...")
            print()
​
​
def main():
    # 示例文档
    document_text = """Artificial intelligence (AI) text generators caught much attention over the last years, especially after the release of GPT-3 in 2020 (1). GPT-3, the latest iteration of the generative pretrained transformers developed by OpenAI, is arguably the most advanced system of pretrained language representations (2). A generative pretrained transformer, in its essence, is a statistical representation of language; it is an AI engine that, based on users' prompts, can produce very credible, and sometimes astonishing, texts (3). An initial test on people's ability to tell whether a ∼500-word article was written by humans or GPT-3 showed a mean accuracy of 52%, just slightly better than random guessing (1).
GPT-3 does not have any mental representations or understanding of the language it operates on (4). The system relies on statistical representations of language for how it is used in real-life by real humans or "a simulacrum of the interaction between people and the world" (4). Even keeping in mind these structural limitations, what GPT-3 can do is remarkable, and remarkable is also the possible implication. While GPT-3 can be a great tool for machine translations, text classification, dialogue/chatbot systems, knowledge summarizing, question answering, creative writing (2, 5, 6), detecting hate speech (7), and automatic code writing (2, 8), it can also be used to produce "misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing, and social engineering pretexting" (1, 9–11). GPT-3 serves as a lever, amplifying human intentions. It can receive instructions in natural language and generate output that may be in either natural or formal language. The tool is inherently neutral from an ethical point of view, and as every other similar technology, it is subject to the dual-use problem (12).
The advancements in AI text generators and the release of GPT-3 historically coincide with the ongoing infodemic (13), an epidemic-like circulation of fake news and disinformation, which, alongside the coronavirus disease 2019 (COVID-19) pandemic, has been greatly detrimental for global health. GPT-3 has the potential to generate information, which raises concerns about potential misuse, such as producing disinformation that can have devastating effects on global health. Therefore, it is crucial to assess how text generated by GPT-3 can affect people's comprehension of information.
The purpose of this paper is to assess whether GPT-3 can generate both accurate information and disinformation in the form of tweets. We will compare the credibility of this text with information and disinformation produced by humans. Furthermore, we will explore the potential for this technology to be used in developing assistive tools for identifying disinformation. For clarity, we acknowledge that the definitions of disinformation and misinformation are diverse, but here, we refer to an inclusive definition, which considers disinformation as both intentionally false information (also partially false information) and/or unintentionally misleading content (14).
To achieve our goals, we asked GPT-3 to write tweets containing informative or disinformative texts on a range of different topics, including vaccines, 5G technology and COVID-19, or the theory of evolution, among others, which are commonly subject to disinformation and public misconception. We collected a set of real tweets written by users on the same topics and programmed a survey in which we asked respondents to classify whether randomly selected synthetic tweets (i.e., written by GPT-3) and organic tweets (i.e., written by humans) were true or false (i.e., whether they contained accurate information or disinformation) and whether they were written by a real Twitter user or by an AI. Note that this study has been preregistered on the Open Science Framework (OSF) (15), and we have conducted a power analysis based on the findings of a pilot study, as described in Materials and Methods."""
​
    # 创建向量索引
    create_vector_indexes()
​
    # 插入文档三级层级结构
    insert_document_hierarchy_with_embeddings(document_text, "doc_001", "GPT-3 Research Paper")
​
    # 查询验证结构
    query_document_hierarchy()
​
    # 关闭数据库连接
    driver.close()
​
​
if __name__ == "__main__":
    main()
Parent 向量索引 'parent_index' 已创建
Child 向量索引 'child_index' 已创建
已插入文档 'GPT-3 Research Paper' (ID: doc_001)
  - 包含 7 个父文档块
  - 总计 17 个子文档块
​
--- 文档三级层级结构 ---
文档: GPT-3 Research Paper (ID: doc_001)
  父块 0: doc_001_parent_0
    子块数量: 4
    首个子块预览: An initial test on people's ability to tell whether a ∼500-word article was written by humans or GPT...
​
文档: GPT-3 Research Paper (ID: doc_001)
  父块 1: doc_001_parent_1
    子块数量: 2
    首个子块预览: Even keeping in mind these structural limitations, what GPT-3 can do is remarkable, and remarkable i...
​
文档: GPT-3 Research Paper (ID: doc_001)
  父块 2: doc_001_parent_2
    子块数量: 3
    首个子块预览: The tool is inherently neutral from an ethical point of view, and as every other similar technology,...
​
文档: GPT-3 Research Paper (ID: doc_001)
  父块 3: doc_001_parent_3
    子块数量: 3
    首个子块预览: The purpose of this paper is to assess whether GPT-3 can generate both accurate information and disi...
​
文档: GPT-3 Research Paper (ID: doc_001)
  父块 4: doc_001_parent_4
    子块数量: 2
    首个子块预览: For clarity, we acknowledge that the definitions of disinformation and misinformation are diverse, b...

3.3.2 检索

import ollama
import re
from typing import List
from langchain_ollama import OllamaEmbeddings
from neo4j import GraphDatabase
​
# 配置参数
EMBEDDING_MODEL = "nomic-embed-text:latest"
LLM_MODEL = "qwen3:14b"
OLLAMA_BASE_URL = "http://192.168.31.158:11434"  # Ollama 服务地址
NEO4J_URI = "bolt://192.168.31.158:7687"  # Neo4j Bolt 地址
NEO4J_USER = "neo4j"  # Neo4j 用户名
NEO4J_PASSWORD = "simon123456zw"  # Neo4j 密码
VECTOR_DIMENSIONS = 768  # nomic-embed-text 向量维度# 初始化嵌入模型和 Neo4j 驱动
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL, base_url=OLLAMA_BASE_URL)
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
​
​
def split_document_to_parent_chunks(document_text: str, max_chunk_size: int = 800) -> List[str]:
    """
    将输入文档按句子分割成父文档块
    """
    # 按句子分割(以句号、问号、感叹号结尾)
    sentences = re.split(r'(?<=[.!?])\s+', document_text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]
​
    parent_chunks = []
    current_chunk = ""
​
    for sentence in sentences:
        # 如果句子本身超过最大块大小,单独处理
        if len(sentence) > max_chunk_size:
            if current_chunk:
                parent_chunks.append(current_chunk.strip())
                current_chunk = ""
            parent_chunks.append(sentence)
        # 如果加上句子后超过最大块大小,保存当前块并开始新块
        elif len(current_chunk) + len(sentence) + 1 > max_chunk_size:
            if current_chunk:
                parent_chunks.append(current_chunk.strip())
            current_chunk = sentence
        # 否则将句子添加到当前块
        else:
            if current_chunk:
                current_chunk += " " + sentence
            else:
                current_chunk = sentence
​
    # 添加最后一个块
    if current_chunk:
        parent_chunks.append(current_chunk.strip())
​
    return parent_chunks
​
​
def split_parent_to_child_chunks(parent_text: str, max_child_size: int = 300) -> List[str]:
    """
    将父文档块分割成子文档块
    """
    # 按句子分割
    sentences = re.split(r'(?<=[.!?])\s+', parent_text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]
​
    child_chunks = []
    current_chunk = ""
​
    for sentence in sentences:
        # 如果句子本身超过最大子块大小,单独处理
        if len(sentence) > max_child_size:
            if current_chunk:
                child_chunks.append(current_chunk.strip())
                current_chunk = ""
            child_chunks.append(sentence)
        # 如果加上句子后超过最大子块大小,保存当前块并开始新块
        elif len(current_chunk) + len(sentence) + 1 > max_child_size:
            if current_chunk:
                child_chunks.append(current_chunk.strip())
            current_chunk = sentence
        # 否则将句子添加到当前块
        else:
            if current_chunk:
                current_chunk += " " + sentence
            else:
                current_chunk = sentence
​
    # 添加最后一个块
    if current_chunk:
        child_chunks.append(current_chunk.strip())
​
    return child_chunks
​
​
def insert_document_hierarchy_with_embeddings(document_text: str, document_id: str,
                                              document_name: str = "Research Paper"):
    """
    将文档按照三级层级结构插入到 Neo4j 数据库中
    Document -> Parent (包含嵌入) -> Child (包含嵌入)
    """
    # 第一级:将文档分割成父文档块
    parent_chunks = split_document_to_parent_chunks(document_text, max_chunk_size=800)
​
    # 为父文档块生成嵌入
    parent_embeddings = embeddings.embed_documents(parent_chunks)
​
    with driver.session() as session:
        # 1. 合并 Document 节点
        session.run("""
            MERGE (d:Document {id: $doc_id})
            SET d.name = $doc_name, d.text = $doc_text
        """, doc_id=document_id, doc_name=document_name,
                    doc_text=document_text[:500] + "..." if len(document_text) > 500 else document_text)
​
        # 2. 处理每个父文档块及其子块
        for i, parent_text in enumerate(parent_chunks):
            parent_id = f"{document_id}_parent_{i}"
​
            # 第二级:将父文档块分割成子文档块
            child_chunks = split_parent_to_child_chunks(parent_text, max_child_size=300)
​
            # 为子文档块生成嵌入
            if child_chunks:
                child_embeddings = embeddings.embed_documents(child_chunks)
            else:
                child_chunks = [parent_text]  # 如果无法分割,使用父块作为子块
                child_embeddings = embeddings.embed_documents(child_chunks)
​
            # 3. 插入完整的三级结构
            session.run("""
                // 合并 Document 节点
                MERGE (d:Document {id: $doc_id})
                // 创建或合并 Parent 节点
                MERGE (p:Parent {id: $parent_id})
                SET p.text = $parent_text, p.index = $parent_index, p.embedding = $parent_embedding
                // 创建 Document 到 Parent 的关系
                MERGE (d)-[:HAS_PARENT]->(p)
                // 处理 Child 节点
                WITH p, $child_chunks AS chunks, $child_embeddings AS embeddings, $child_ids AS child_ids
                UNWIND range(0, size(chunks) - 1) AS j
                MERGE (c:Child {id: child_ids[j]})
                SET c.text = chunks[j], c.index = j, c.embedding = embeddings[j]
                MERGE (p)-[:HAS_CHILD]->(c)
            """,
                        doc_id=document_id,
                        parent_id=parent_id,
                        parent_text=parent_text,
                        parent_index=i,
                        parent_embedding=parent_embeddings[i],
                        child_chunks=child_chunks,
                        child_embeddings=child_embeddings,
                        child_ids=[f"{parent_id}_child_{j}" for j in range(len(child_chunks))]
                        )
​
    print(f"已插入文档 '{document_name}' (ID: {document_id})")
    print(f"  - 包含 {len(parent_chunks)} 个父文档块")
    print(f"  - 总计 {sum(len(split_parent_to_child_chunks(p, 300)) for p in parent_chunks)} 个子文档块")
​
​
def create_vector_indexes():
    """
    为 Parent 和 Child 节点的 embedding 属性创建向量索引
    """
    with driver.session() as session:
        # 为 Parent 节点创建向量索引
        session.run(f"""
            CREATE VECTOR INDEX parent_index IF NOT EXISTS
            FOR (p:Parent) ON p.embedding
            OPTIONS {{
                indexConfig: {{
                    `vector.dimensions`: {VECTOR_DIMENSIONS},
                    `vector.similarity_function`: 'cosine'
                }}
            }}
        """)
        print("Parent 向量索引 'parent_index' 已创建")
​
        # 为 Child 节点创建向量索引
        session.run(f"""
            CREATE VECTOR INDEX child_index IF NOT EXISTS
            FOR (c:Child) ON c.embedding
            OPTIONS {{
                indexConfig: {{
                    `vector.dimensions`: {VECTOR_DIMENSIONS},
                    `vector.similarity_function`: 'cosine'
                }}
            }}
        """)
        print("Child 向量索引 'child_index' 已创建")
​
​
def parent_retrieval(question: str, k: int = 4) -> List[dict]:
    """
    检索父文档策略数据
    基于子节点向量搜索,然后去重返回唯一的父文档
    """
    # 为问题生成嵌入
    question_embedding = embeddings.embed_query(question)
​
    with driver.session() as session:
        # 执行基于子节点的向量搜索,检索 k*4 个子节点以确保足够的父文档多样性
        result = session.run("""
            // 基于子节点向量索引进行搜索
            CALL db.index.vector.queryNodes('child_index', $k_multiplier, $question_embedding)
            YIELD node AS child_node, score
            // 获取对应的父节点
            MATCH (parent:Parent)-[:HAS_CHILD]->(child_node)
            // 收集每个父节点的最高分子女节点
            WITH parent, max(score) AS max_score
            // 按分数排序并限制结果数量
            ORDER BY max_score DESC
            LIMIT $k
            // 返回父节点文本和分数
            RETURN parent.text AS text, max_score AS score, parent.index AS parent_index
            ORDER BY max_score DESC
        """,
                             question_embedding=question_embedding,
                             k_multiplier=k * 4,  # 检索更多的子节点以确保父文档多样性
                             k=k
                             )
​
        # 返回检索结果
        records = []
        for record in result:
            records.append({
                "text": record["text"],
                "score": record["score"],
                "parent_index": record["parent_index"]
            })
​
        return records
​
​
def child_retrieval(question: str, k: int = 4) -> List[dict]:
    """
    直接检索子文档
    """
    # 为问题生成嵌入
    question_embedding = embeddings.embed_query(question)
​
    with driver.session() as session:
        # 直接在子节点上进行向量搜索
        result = session.run("""
            CALL db.index.vector.queryNodes('child_index', $k, $question_embedding)
            YIELD node AS child_node, score
            RETURN child_node.text AS text, score, child_node.index AS child_index
            ORDER BY score DESC
        """,
                             question_embedding=question_embedding,
                             k=k
                             )
​
        # 返回检索结果
        records = []
        for record in result:
            records.append({
                "text": record["text"],
                "score": record["score"],
                "child_index": record["child_index"]
            })
​
        return records
​
​
def main():
    # 示例文档
    document_text = """Artificial intelligence (AI) text generators caught much attention over the last years, especially after the release of GPT-3 in 2020 (1). GPT-3, the latest iteration of the generative pretrained transformers developed by OpenAI, is arguably the most advanced system of pretrained language representations (2). A generative pretrained transformer, in its essence, is a statistical representation of language; it is an AI engine that, based on users' prompts, can produce very credible, and sometimes astonishing, texts (3). An initial test on people's ability to tell whether a ∼500-word article was written by humans or GPT-3 showed a mean accuracy of 52%, just slightly better than random guessing (1).
GPT-3 does not have any mental representations or understanding of the language it operates on (4). The system relies on statistical representations of language for how it is used in real-life by real humans or "a simulacrum of the interaction between people and the world" (4). Even keeping in mind these structural limitations, what GPT-3 can do is remarkable, and remarkable is also the possible implication. While GPT-3 can be a great tool for machine translations, text classification, dialogue/chatbot systems, knowledge summarizing, question answering, creative writing (2, 5, 6), detecting hate speech (7), and automatic code writing (2, 8), it can also be used to produce "misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing, and social engineering pretexting" (1, 9–11). GPT-3 serves as a lever, amplifying human intentions. It can receive instructions in natural language and generate output that may be in either natural or formal language. The tool is inherently neutral from an ethical point of view, and as every other similar technology, it is subject to the dual-use problem (12).
The advancements in AI text generators and the release of GPT-3 historically coincide with the ongoing infodemic (13), an epidemic-like circulation of fake news and disinformation, which, alongside the coronavirus disease 2019 (COVID-19) pandemic, has been greatly detrimental for global health. GPT-3 has the potential to generate information, which raises concerns about potential misuse, such as producing disinformation that can have devastating effects on global health. Therefore, it is crucial to assess how text generated by GPT-3 can affect people's comprehension of information.
The purpose of this paper is to assess whether GPT-3 can generate both accurate information and disinformation in the form of tweets. We will compare the credibility of this text with information and disinformation produced by humans. Furthermore, we will explore the potential for this technology to be used in developing assistive tools for identifying disinformation. For clarity, we acknowledge that the definitions of disinformation and misinformation are diverse, but here, we refer to an inclusive definition, which considers disinformation as both intentionally false information (also partially false information) and/or unintentionally misleading content (14).
To achieve our goals, we asked GPT-3 to write tweets containing informative or disinformative texts on a range of different topics, including vaccines, 5G technology and COVID-19, or the theory of evolution, among others, which are commonly subject to disinformation and public misconception. We collected a set of real tweets written by users on the same topics and programmed a survey in which we asked respondents to classify whether randomly selected synthetic tweets (i.e., written by GPT-3) and organic tweets (i.e., written by humans) were true or false (i.e., whether they contained accurate information or disinformation) and whether they were written by a real Twitter user or by an AI. Note that this study has been preregistered on the Open Science Framework (OSF) (15), and we have conducted a power analysis based on the findings of a pilot study, as described in Materials and Methods."""
​
    # 创建向量索引
    create_vector_indexes()
​
    # 插入文档三级层级结构
    insert_document_hierarchy_with_embeddings(document_text, "doc_001", "GPT-3 Research Paper")
​
    # 定义检索问题
    question = "What are the ethical implications of GPT-3?"
​
    print(f"检索问题: {question}")
    print("=" * 50)
​
    # 执行父文档检索
    print("\n--- 父文档检索结果 ---")
    parent_results = parent_retrieval(question, k=3)
    if parent_results:
        for i, record in enumerate(parent_results, 1):
            print(f"[{i}] (相似度: {record['score']:.4f}, 父块索引: {record['parent_index']})")
            print(f"    {record['text'][:200]}...")
            print()
    else:
        print("未检索到相关父文档")
​
    # 执行子文档检索
    print("\n--- 子文档检索结果 ---")
    child_results = child_retrieval(question, k=3)
    if child_results:
        for i, record in enumerate(child_results, 1):
            print(f"[{i}] (相似度: {record['score']:.4f}, 子块索引: {record['child_index']})")
            print(f"    {record['text']}")
            print()
    else:
        print("未检索到相关子文档")
​
    # 关闭数据库连接
    driver.close()
​
​
if __name__ == "__main__":
    main()

3.3.3 查看图结构

// 查询1: 可视化完整的三级文档结构路径
// 匹配 Document -> Parent -> Child 的完整路径关系
// 使用 path 作为路径变量名,避免与 Parent 节点变量 p 冲突
MATCH path=(d:Document)-[:HAS_PARENT]->(p:Parent)-[:HAS_CHILD]->(c:Child)
RETURN path 
LIMIT 25

image.png

总结

  • 查询重写可以通过使用户查询与目标文档的语言和上下文更加匹配,从而提高文档检索的准确性。
  • 假设性文档检索器和“后退一步提示”(step-back prompting)等技术能有效弥合用户意图与文档内容之间的差距,降低遗漏相关信息的可能性。
  • 检索系统的有效性可以通过不仅嵌入原始文本,还嵌入上下文相关的摘要或改写内容来提升,从而更好地捕捉文档的核心要义。
  • 实施假设性问题嵌入和父文档检索等策略,可以实现查询与文档之间更精确的匹配,从而增强检索信息的相关性和准确性。
  • 将文档分割为更小、更易处理的片段进行嵌入,使得信息检索更加细致入微,确保具体查询能够找到最相关的文档部分。