[GraphRAG] [Index][graph][extractors] graph_extractor[Graph

[GraphRAG] [Index][graph][extractors] graph_extractor

GraphExtractionResult

GraphExtractionResult 类用于封装图形提取的结果。

output：一个 networkx.Graph 对象，表示提取出的图。
source_docs：一个字典，记录了每个文档的源文本

GraphExtractor

GraphExtractor 是核心的图形提取类，它负责通过语言模型从文本中提取实体和关系，并构建一个无向图（unipartite graph）。这个类的主要功能包括：

初始化：设置语言模型（LLM）、图形构建的相关参数（如分隔符、最大提取次数等），以及错误处理函数。
实体提取：通过调用语言模型 API，提取每个文档中的实体和关系。
结果处理：将提取出的结果解析成图形对象，支持将描述信息合并到节点和边中。

关键属性

_llm：用于实体提取的语言模型（CompletionLLM）。
_join_descriptions：一个布尔值，决定是否将多个描述合并到同一个节点或边上。
_tuple_delimiter_key、_record_delimiter_key、_entity_types_key 等：用于定义分隔符的配置键。
_max_gleanings：最大提取次数，用于控制从文档中提取实体和关系的次数。(1 次提取可能不准确)

主要分隔符和配置

DEFAULT_TUPLE_DELIMITER = "<|>"
DEFAULT_RECORD_DELIMITER = "##"
DEFAULT_COMPLETION_DELIMITER = "<|COMPLETE|>"
DEFAULT_ENTITY_TYPES = ["organization", "person", "geo", "event"]

主要方法

__init__：构造函数，初始化提取器并设置相关参数。
__call__：核心方法，接收文本输入，调用语言模型进行实体提取，并返回 GraphExtractionResult。
_process_document：处理单个文档，使用语言模型提取实体和关系，并返回结果。
_process_results：将多个文档的提取结果解析成图，节点表示实体，边表示实体之间的关系。

步骤

初始化和配置

参数

设置提取的参数（如分隔符、实体类型、最大提取次数等）。

prompt_variables

{
  "entity_types": "organization,person,geo,event",
  "tuple_delimiter": "<|>",
  "record_delimiter": "##",
  "completion_delimiter": "<|COMPLETE|>"
}

任务模板

-Goal-

给定一份潜在与该活动有关联的文本文件及一组实体类型，找出这些类型的全部实体，并且确定它们之间的所有联系。

-Steps-

1. 找出所有的实体。针对每一个被发现的实体，提取以下信息：
- entity_name：实体名，首字母大写
- entity_type：以下几种类型之一：[组织、人物、地点、事件]
- entity_description：对该实体属性和行为的详细说明

用“(entity)<|><entity_name><|><entity_type><|><entity_description>)”的形式来记录每个实体

2. 根据 step 1 找到的所有实体，找出明显有直接关联的一对（source_entity, target_entity）。
针对每一组相关联的两个实体，提取以下信息：
- source_entity：source_entity的名字，即 step 1 中定义的 entity_name
- target_entity：target_entity的名字
- relationship_description：解释为何你认为这个来源实体和目标实体是互相有关联的
- relationship_strength: 表明source_entity和target_entity间关系强弱的一个numeric score(score范围) 

用：“(relation)<|><source_entity><|><target_entity><|><relationship_description>|<relationship_strength>)” 的形式来记录每种关联

3. 把在step 1和step 2里找出来的所有内容汇总起来形成一个结果list。用**##**做为各个条目的分割符号。

4. 完成后，请输入<|COMPLETE||>

COT

######################
-Examples-
######################
Example 1:
Entity_types: ORGANIZATION,PERSON
Text:
The Verdantis's Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%.
######################
Output:
("entity"<|>CENTRAL INSTITUTION<|>ORGANIZATION<|>The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday)
##
("entity"<|>MARTIN SMITH<|>PERSON<|>Martin Smith is the chair of the Central Institution)
##
("entity"<|>MARKET STRATEGY COMMITTEE<|>ORGANIZATION<|>The Central Institution committee makes key decisions about interest rates and the growth of Verdantis's money supply)
##
("relationship"<|>MARTIN SMITH<|>CENTRAL INSTITUTION<|>Martin Smith is the Chair of the Central Institution and will answer questions at a press conference<|>9)
<|COMPLETE|>

模板

-Goal-

-Steps-

######################
-Examples-
######################
Example 1:

######################
Output:

######################

-Real Data-
######################
Entity_types: {entity_types}
Text: {input_text}
######################
Output:

通过 tiktoken 加载编码器，配置 logit_bias 和 max_tokens 等参数，以优化模型的推理过程。

实体提取

_process_document-> str:

调用语言模型，使用自定义的提取提示（GRAPH_EXTRACTION_PROMPT（.\prompts\index\entity_extraction.py））对每个文档进行处理。

提取过程可能会重复进行多次（最多 max_gleanings 次），直到达到 最大次数 或 提取完成。

response = await self._llm(
    CONTINUE_PROMPT,
    name=f"extract-continuation-{i}",
    history=response.history,
)

CONTINUE_PROMPT：

MANY entities and relationships were missed in the last extraction. Remember to ONLY emit entities that match any of the previously extracted types. Add them below using the same format:\n

达到最大次数

if i >= self._max_gleanings - 1:
    break

由模型判断是否提取完成

response = await self._llm(
  LOOP_PROMPT,
  name=f"extract-loopcheck-{i}",
  history=response.history,
  model_parameters=self._loop_args,
)
if response.output != "YES":
  	break

LOOP_PROMPT:

It appears some entities and relationships may have still been missed.  Answer YES | NO if there are still entities or relationships that need to be added.\n

结果处理

_process_results-> nx.Graph:

{doc_index: text}

text

(entity<|><entity_name><|><entity_type><|><entity_description>)
##
(entity)<|><entity_name><|><entity_type><|><entity_description>)
##
(relation)<|><source_entity><|><target_entity><|><relationship_description>|<relationship_strength>)
##
<|COMPLETE|>

处理提取的字符串数据，分隔每个记录并解析成图形节点和边。
在图中，节点代表实体（如组织、人物、地理位置等），边表示实体之间的关系。
- 节点：entity
  - entity_name
  - entity_type
  - entity_description
- 边：relationship
  - source
  - target
  - edge_description
  - edge_source_id：源文档 id

记录每个节点和边的属性（如描述、源文档 ID、权重等）。

重复和描述合并：如果同一个实体或关系在多个文档中出现，提取器会将描述信息合并，或者根据设置选择保留最丰富的描述。

entity

write

graph.add_node(
    entity_name,
    type=entity_type,
    description=entity_description,
    source_id=str(source_doc_id),
)

update

node = graph.nodes[entity_name]
# 用 \n 拼起来
if self._join_descriptions:
    node["description"] = "\n".join(
        list({
            *_unpack_descriptions(node),
            entity_description,
        })
    )

relationship

如果节点不在图中：

graph.add_node(
    source/target,
    type="",
    description="",
    source_id=edge_source_id,
)

write

graph.add_edge(
    source,
    target,
    weight=weight,
    description=edge_description,
    source_id=edge_source_id,
)

update

edge_data = graph.get_edge_data(source, target)
if edge_data is not None:
    # 分数相加
    weight += edge_data["weight"]
    if self._join_descriptions:
        # edge 描述用 \n 拼起来
        edge_description = "\n".join(
            list({
                *_unpack_descriptions(edge_data),
                edge_description,
            })
        )
        # edge 来源id用 ， 拼起来
        edge_source_id = ", ".join(
            list({
                *_unpack_source_ids(edge_data),
                str(source_doc_id),
            })
        )