[GraphRAG][Index][Flows][GraphRAG][Index][Flows][GraphRAG][I

[GraphRAG][Index][Flows]

create_base_text_units

函数参数

documents: pd.DataFrame

text id title
document document_id document_title
chunk_by_columns: list[str]

[id]

text	id	title
document	document_id	document_title

chunk_strategy

{
  "type": "tokens",
  "chunk_size": 1200,
  "chunk_overlap": 100,
  "group_by_columns": [
    "id"
  ],
  "encoding_name": "cl100k_base"
}

输出

返回一个包含分块文本单元的 DataFrame，每一行表示一个分块后的文本，以及相关的元数据（如 id、document_ids 和 n_tokens）

id text document_ids n_tokens
chuncked_document chuncked_document [document_id,... ]

id	text	document_ids	n_tokens
chuncked_document	chuncked_document	[document_id,... ]

步骤解析

排序

按照 id 列对文档进行排序。

sort = documents.sort_values(by=["id"], ascending=[True])

合并列

将 id 和 text 列组合为一个新列。

sort["text_with_ids"] = list(
        zip(*[sort[col] for col in ["id", "text"]], strict=True)
    )

聚合

根据指定的分组列（chunk_by_columns），将文本合并为一个数组，表示同一分组的所有文本。

aggregated = _aggregate_df(
    sort,
    groupby=[*chunk_by_columns] if len(chunk_by_columns) > 0 else None,
    aggregations=[
        {
            "column": "text_with_ids",
            "operation": "array_agg",
            "to": "texts",
        }
    ],
)

分块

使用分块策略（chunk_strategy），将聚合后的文本数组拆分为更小的文本块

chunked = chunk_text(
    aggregated,
    column="texts",
    to="chunks",
    callbacks=callbacks,
    strategy=chunk_strategy,
)

生成哈希值：为每个文本块生成唯一的哈希值（id），确保数据的唯一性和可追溯性。
字段重命名：将分块后的列名调整为下游数据处理所需的格式。
可选功能：快照：如果启用了 snapshot_transient_enabled 参数，会将处理后的结果保存为中间快照文件（格式为 Parquet）。

create_final_documents

函数参数

documents: pd.DataFrame

text id title
document id document_title
text_units: pd.DataFrame

create_base_text_units 步骤处理完成的文本单元表格，每个文本单元（chunk）与其所属的文档有关联。

text id document_ids n_tokens
chuncked_document id chuncked_document_ids
document_attribute_columns: list[str] | None

可选，指定需要处理并合并为 JSON 格式的文档属性列。如果传入这些列，会将它们合并为一个新的 JSON 对象列

text	id	title
document	id	document_title

text	id	document_ids	n_tokens
chuncked_document	id	chuncked_document_ids

输出

最终的列包括文档 ID、标题、文本、文本单元 IDs 以及（如果指定）合并的属性列。

id	human_readable_id	title	text	text_unit_ids
id	human_readable_id	document_title	document	[chuncked_document_id1, ..., chuncked_document_idn]

步骤解析

exploded `text_units` 表格

使用 .explode("document_ids") 将 text_units 中的 document_ids 列拆分成多行。每一行表示一个文本块（chunk）和其对应的文档 ID。
重命名列：
- document_ids -> chunk_doc_id（文档 ID）。
- id -> chunk_id（文本单元的 ID）。
- text -> chunk_text（文本块内容）。

exploded = (
    text_units.explode("document_ids")
    .loc[:, ["id", "document_ids", "text"]]
    .rename(
        columns={
            "document_ids": "chunk_doc_id",
            "id": "chunk_id",
            "text": "chunk_text",
        }
    )
)

chunk_id	chunk_doc_id	chunck_text
chuncked_document_ids	document_id	chuncked_document

合并 `documents` 和 `text_units`

使用 merge 将拆分后的 text_units 表格与原始的 documents 表格进行内连接合并，按 chunk_doc_id 和 id 列匹配。

joined = exploded.merge(
    documents,
    left_on="chunk_doc_id",
    right_on="id",
    how="inner",
    copy=False,
)

chunk_id	chunk_doc_id	chunck_text	text	id	title
chuncked_document_ids	document_id	chuncked_document	document	id	document_title

为每个文档聚合文本单元

对合并后的表格按文档的 id 分组，并将每个文档的所有 chunk_id 聚合成一个列表，形成 text_unit_ids。
将聚合后的文本单元 ID 列表与原始 documents 表再次合并，以确保每个文档的完整信息和文本单元信息都在一行中。
通过 rejoined.index + 1 为每个文档生成一个 "人类可读的 ID"（human_readable_id），便于用户识别。

docs_with_text_units = joined.groupby("id", sort=False).agg(
    text_unit_ids=("chunk_id", list)
)
rejoined["id"] = rejoined["id"].astype(str)
rejoined["human_readable_id"] = rejoined.index + 1

id	text_unit_ids	text	title	human_readable_id
id	[chuncked_document_id1, ..., chuncked_document_idn]	document	document_title	human_readable_id

处理文档属性列（可选）

如果指定了 document_attribute_columns，这些列会被转换为字符串，并合并为一个 JSON 对象列 attributes，并删除原始的属性列。

create_base_entity_graph

函数参数

text_units: pd.DataFrame ： create_base_text_units 的输出
clustering_strategy: dict[str, Any]

聚类策略，用于在实体图的聚类操作中进行配置。
```
{
  "type": "leiden",
  "max_cluster_size": 10
}
```

extraction_strategy: dict[str, Any] | None

实体提取策略，用于在提取实体时进行配置。

{
  "type": "graph_intelligence",
  "llm": {},
  "stagger": 0.3,
  "num_threads": 50,
  "extraction_prompt": "./prompts/entity_extraction.txt",
  "max_gleanings": 1,
  "encoding_name": "cl100k_base",
  "prechunked": true
}

entity_types: list[str] | None

要提取的实体类型（如人名、地名等）。如果为 None，表示提取所有实体类型
```
['organization', 'person', 'geo', 'event']
```

node_merge_config: dict[str, Any] | None

节点合并配置，用于合并图中的节点

{
  "source_id": {
    "operation": "concat",
    "delimiter": ", ",
    "distinct": true
  },
  "description": {
    "operation": "concat",
    "separator": "\\n",
    "distinct": false
  }
}

edge_merge_config: dict[str, Any] | None

边合并配置，用于合并图中的边。

{
  "source_id": {
    "operation": "concat",
    "delimiter": ", ",
    "distinct": true
  },
  "description": {
    "operation": "concat",
    "separator": "\\n",
    "distinct": false
  },
  "weight": "sum"
}

summarization_strategy: dict[str, Any] | None

摘要生成策略，用于对图进行摘要操作。

{
  "type": "graph_intelligence",
  "llm": {},
  "stagger": 0.3,
  "num_threads": 50,
  "summarize_prompt": "./prompts/summarize_descriptions.txt  ",
  "max_summary_length": 500
}

输出

level	clustered_graph
0	聚类之后每个 level 对应的 graphml

步骤解析

提取实体：`extract_entities`

从每个 chuncked_doc 中提取实体信息，返回每个文本单元的实体图。

graph_extractor 提取每个 chuncked_doc 实体信息得到 GraphExtractionResult

 @dataclass
class GraphExtractionResult:
    """Unipartite graph extraction result class definition."""

    output: nx.Graph
    source_docs: dict[Any, Any]

提取 GraphExtractionResult.output.nodes 信息，整合得到每个 doc 的所有实体信息

# results: GraphExtractionResult
graph = results.output
entities = [
    ({"name": item[0], **(item[1] or {})})
    for item in graph.nodes(data=True)
    if item is not None
]

entities:

[
  {
    "name": "",
    "type": "",
    "description": "",
    "source_id": ""
  },
    ...
]

得到 EntityExtractionResult

entities：chuncked_doc 的所有实体
graph：chuncked_doc 的 graph

@dataclass
class EntityExtractionResult:
    """Entity extraction result class definition."""

    entities: list[ExtractedEntity]
    graph: nx.Graph | None

将输入 text_units 新增一列 entities

text id document_ids n_tokens entities
chuncked_document id chuncked_document_ids entities

text	id	document_ids	n_tokens	entities
chuncked_document	id	chuncked_document_ids		entities

合并实体图：`merge_graphs`

合并多个实体图，将它们整合成一个更大的图，依据合并策略处理图中的节点和边。

建立一张新图，将之前得到的子图合并入新图

target 中不含原始 node/edge：直接添加 node/edge
target 中包含原始 node/edge：使用 merge_operation 添加 node/edge

DetailedAttributeMergeOperation
```
@dataclass
class DetailedAttributeMergeOperation:
    """
    Detailed attribute merge operation class definition.
    """
    operation: str  # StringOperation | NumericOperation
    # concat
    separator: str | None = None
    delimiter: str | None = None
    distinct: bool = False
```
- node_ops
  - description：
    
    DetailedAttributeMergeOperation(operation ='concat', separator ='\n', delimiter = None, distinct = False)
  - source_id：
    
    DetailedAttributeMergeOperation(operation ='concat', separator = None, delimiter =', ', distinct = True)
- edge_ops
  - description：
    
    DetailedAttributeMergeOperation(operation ='concat', separator ='\n', delimiter = None, distinct = False)
  - source_id：
    
    DetailedAttributeMergeOperation(operation ='concat', separator = None, delimiter =', ', distinct = True)
  - weight:
    
    DetailedAttributeMergeOperation(operation ='sum', separator = None, delimiter = None, distinct = False)

merge_nodes

def merge_nodes(
    target: nx.Graph,
    subgraph: nx.Graph,
    node_ops: dict[str, DetailedAttributeMergeOperation],
):
    """
    Merge nodes from subgraph into target using the operations defined in node_ops.
    """
    for node in subgraph.nodes:
        if node not in target.nodes:
            target.add_node(node, **(subgraph.nodes[node] or {}))
        else:
            merge_attributes(target.nodes[node], subgraph.nodes[node], node_ops)

merge_edges

def merge_edges(
    target_graph: nx.Graph,
    subgraph: nx.Graph,
    edge_ops: dict[str, DetailedAttributeMergeOperation],
):
    """
    Merge edges from subgraph into target using the operations defined in edge_ops.
    """
    for source, target, edge_data in subgraph.edges(data=True):  # type: ignore
        if not target_graph.has_edge(source, target):
            target_graph.add_edge(source, target, **(edge_data or {}))
        else:
            merge_attributes(
                target_graph.edges[(source, target)],  # noqa
                edge_data,
                edge_ops,
            )

merge_attributes

根据 ops 中定义的操作，将 source_item 的属性合并到 target_item 中，适用于节点或边的属性合并。
- target_item：目标节点或边，属性将被合并到此。
- source_item：源节点或边，属性将从此图项中合并到目标图项。
- ops：定义属性合并操作的字典
apply_merge_operation

根据 op 中定义的操作应用属性合并操作。支持的操作包括：
- Replace：替换属性值。
- Skip：跳过属性，不做任何更改。
- Concat：连接属性值，可以指定分隔符。
- Sum：对属性值求和。
- Max：取属性值的最大值。
- Min：取属性值的最小值。
- Average：计算属性值的平均值。
- Multiply：对属性值进行乘法运算。

生成摘要：`summarize_descriptions`

在合并实体图：merge_graphs 中，对属于不同 chuncked_doc 的多个 entities 或 relationships 的多个描述进行了直接 concat 的操作。生成摘要这一步对 concat 后的实体图进行摘要操作，生成简化版本的描述，保留主要信息。

description_summary_extractor 用于处理具体的总结任务得到 SummarizationResult

items：描述的是哪个图的节点或边，可能是单个节点的 name 或一条边的节点对。
description：该节点或边的总结描述。

@dataclass
class SummarizationResult:
    """Unipartite graph extraction result class definition."""

    items: str | tuple[str, str]
    description: str

用 SummarizationResult 的结果更新图

for result in results:
    graph_item = result.items
    if isinstance(graph_item, str) and graph_item in graph.nodes():
        graph.nodes[graph_item]["description"] = result.description
    elif isinstance(graph_item, tuple) and graph_item in graph.edges():
        graph.edges[graph_item]["description"] = result.description

return graph

聚类：`cluster_graph`

对摘要后的实体图使用 Leiden 算法来进行聚类，根据聚类策略将图中的节点分组, 得到 communities: [(level, cluster_id, nodes), ...]
```
"strategy": {
  "type": "leiden",
  "max_cluster_size": 10
}
```
根据给定的社区信息将聚类结果应用到图的节点和边上
- 将社区信息应用到节点上，包括 cluster 和 level 属性。
- 计算每个节点的 degree(与节点相邻的边数)并将其存储在节点属性中。
- 为每个节点和边生成 UUID(id) 和 index（human_readable_id），并将这些信息作为属性添加到节点和边中。
输出

entity_graph level clustered_graph
聚类之前 graph 对应的 graphml 0 聚类之后每个 level 对应的 graphml

entity_graph	level	clustered_graph
聚类之前 graph 对应的 graphml	0	聚类之后每个 level 对应的 graphml

生成嵌入：`embed_graph`（可选）

如果指定了嵌入策略（embedding_strategy），则对聚类后的实体图生成嵌入表示。

保存中间结果（可选）

如果启用了相应的快照配置，会保存原始实体、合并后的实体图、摘要图、聚类图及其嵌入表示为 GraphML 或其他格式。

create_final_entities

workflow:create_base_entity_graph

函数参数

entity_graph: pd.DataFrame

输出

entities： pd.Dataframe

id	human_readable_id	title	type	description	text_unit_ids

步骤解析

解析 `entity_graph`：`unpack_graph`

input_df: entity_graph：pd.DataFrame

level clustered_graph
0 聚类之后每个 level 对应的 graphml
column: 指定解包的图形列名 (clustered_graph)。
type: 解包类型，
- "nodes"：节点
- "edges"：边
copy: 需要复制的列，默认为 ["level"]。
embeddings_column: 用于存储图嵌入（embeddings）的列，默认是 "embeddings"。

_unpack_nodes

nx.parse_graphml 将 .graphml 解析为 nx.Graph

将 nx.Graph 解析为 dict

{
  "level": "cluster_level"
  "label": "entity_name",
  "type": "entity_type",
  "description": "description",
  "source_id": "doc_id",
  "degree": 3,
  "human_readable_id": 0,
  "id": "node_id",
  "graph_embedding": null
}

格式处理

将 nodes 转化为 pd.Dataframe
删除空行

将多个 source_id 行拆分

nodes = nodes.loc[nodes["title"].notna()]
nodes["text_unit_ids"] = nodes["source_id"].str.split(",")

create_final_nodes

workflow:create_base_entity_graph

函数参数

entity_graph: pd.DataFrame

输出

entities： pd.Dataframe

id	human_readable_id	title	community	level	degree	x	y

步骤解析

`layout_graph`

layout_graph 对 entity_graph 进行操作，为每个节点生成一个位置（即 NodePosition），根据 strategy 的类型计算每个节点的位置（x、y、z 坐标），并添加到 entity_graph 中

@dataclass
class NodePosition:
    """Node position class definition."""
    
    label: str
    cluster: str
    size: float
        
    x: float
    y: float
    z: float | None = None

    def to_pandas(self) -> tuple[str, float, float, str, float]:
        """To pandas method definition."""
        return self.label, self.x, self.y, self.cluster, self.size

strategy:

class LayoutGraphStrategyType(str, Enum):
    """LayoutGraphStrategyType class definition."""

    umap = "umap"
    zero = "zero"

run_umap
- 需要根据 embedding 向量计算 x，y，z
run_zero
- x = y = z = 0

数据处理

layout 操作之后得到的每个 node：

{
  "level": "cluster_level"
  "label": "entity_name",
  "type": "entity_type",
  "description": "description",
  "source_id": "doc_id",
  "degree": 3,
  "human_readable_id": 0,
  "id": "node_id",
  "graph_embedding": null,
  "x": 0,
  "y": 0, 
  "size": "degree"
}

过滤节点：
- nodes_without_positions 包含原始的节点数据（没有 x 和 y 位置）
- nodes 只保留指定级别（level_for_node_positions(0)）的节点，并重置索引。然后从中提取出 id、x 和 y 位置
```
nodes_without_positions = nodes.drop(columns=["x", "y"])
nodes = nodes[nodes["level"] == level_for_node_positions].reset_index(drop=True)
```
snapshot top level 节点（可选）：

如果 snapshot_top_level_nodes_enabled 为 True，则会调用 snapshot 函数，将节点数据保存为 json 格式，并存储到 storage 中。
将带有位置的节点数据 nodes 与原始节点数据 nodes_without_positions（去掉了位置列的部分）根据 id 列合并
重命名 部分列（将 label 列重命名为 title，cluster 列重命名为 community）。
填充缺失 的 community 值为 -1 并转换为整数类型。
删除不需要的列（source_id、type、description、size、graph_embedding 等），保留与图形相关的必要信息。

去重：根据 title 和 community 列去重，以确保每个节点在其所属的社区中只有一个唯一的记录

joined = nodes_without_positions.merge(
    nodes,
    on="id",
    how="inner",
)
joined.rename(columns={"label": "title", "cluster": "community"}, inplace=True)
joined["community"] = joined["community"].fillna(-1).astype(int)
joined.drop(
    columns=["source_id", "type", "description", "size", "graph_embedding"],
    inplace=True,
)
deduped = joined.drop_duplicates(subset=["title", "community"])

create_final_communities

构建社区

函数参数

entity_graph: pd.DataFrame

输出

community： pd.Dataframe

return filtered.loc[
    :,
    [
        "id",
        "human_readable_id",
        "community",
        "level",
        "title",
        "entity_ids",
        "relationship_ids",
        "text_unit_ids",
        "period",
        "size",
    ],
]

步骤解析

解包图形数据

使用 unpack_graph 函数从 entity_graph 中提取图形的节点 (graph_nodes) 和边 (graph_edges) 数据并返回 DataFrame。

graph_nodes = unpack_graph(entity_graph, callbacks, "clustered_graph", "nodes")
graph_edges = unpack_graph(entity_graph, callbacks, "clustered_graph", "edges")

合并节点和边

对 graph_nodes 和 graph_edges 进行两次合并，分别合并为 source_clusters 和 target_clusters。

合并的条件是将节点与边的源节点 (source) 和目标节点 (target) 进行匹配。

source_clusters = graph_nodes.merge(
    graph_edges, left_on="label", right_on="source", how="inner"
)
target_clusters = graph_nodes.merge(
    graph_edges, left_on="label", right_on="target", how="inner"
)

合并源和目标集群

将 source_clusters 和 target_clusters 进行拼接，合并成一个完整的 clusters DataFrame。

clusters = pd.concat([source_clusters, target_clusters], ignore_index=True)

过滤符合条件的集群

只保留 level_x 和 level_y 相等的行，这表示只有在同一层级上的集群才会被保留下来。

combined_clusters = clusters[
    clusters["level_x"] == clusters["level_y"]
].reset_index(drop=True)

聚合集群关系

使用 groupby 和 agg 方法对 combined_clusters 进行聚合，生成每个集群（cluster）和层级（level_x）的关系，包含 相关的实体 ID（id_x）、文本单元 ID(source_id_x) 和 关系 ID（id_y）。

分组 (groupby)：

按照 cluster 和 level_x 进行分组。分组后的每个组包含相同的 cluster 和 level_x 值。
- cluster：集群的标识。
- level_x：source 层级（level）的标识。
聚合 (agg)：
- relationship_ids=("id_y", "unique")：将每个分组中的 id_y（target_id）字段的唯一值聚合成一个列表。(unique 是 pandas 的聚合函数，它返回每个分组中不重复的值。)
  
  表示该分组关联的 target_id。
- text_unit_ids=("source_id_x", "unique")：将每个分组中的 source_id_x（source_id 的 doc_id）字段的唯一值聚合成一个列表
  
  表示该分组关联的文本单元 ID。
- entity_ids=("id_x", "unique")：将每个分组中的 id_x 字段的唯一值聚合成一个列表
  
  表示该分组关联的 source_id

cluster_relationships = (
    combined_clusters.groupby(["cluster", "level_x"], sort=False)
    .agg(
        relationship_ids=("id_y", "unique"),
        text_unit_ids=("source_id_x", "unique"),
        entity_ids=("id_x", "unique"),
    )
    .reset_index()
)

生成所有集群信息

按照集群和层级进行聚合，得到每个集群的基本信息（community）并与 cluster_relationships 进行合并。

分组 (groupby)：

根据 cluster 和 level 对 graph_nodes 进行分组
聚合 (agg)：

对于每个 cluster 和 level 组合，聚合出该组合中第一个出现的 cluster 值作为该组的 "community"（社区）。
合并 (merge)：
- all_clusters 中的 community 列作为连接条件的左侧列。
- cluster_relationships 中的 cluster 列作为连接条件的右侧列。
- 合并的结果只包含 community 和 cluster 列相同的行。

all_clusters = (
    graph_nodes.groupby(["cluster", "level"], sort=False)
    .agg(community=("cluster", "first"))
    .reset_index()
)

joined = all_clusters.merge(
    cluster_relationships,
    left_on="community",
    right_on="cluster",
    how="inner",
)

进一步过滤数据

再次通过层级匹配（level_x 和 level）进行过滤，确保社区的数据是相同层级的。

filtered = cast(
    pd.DataFrame,
    joined[joined["level"] == joined["level_x"]].reset_index(drop=True),
)

添加额外的字段

id：为每个社区生成一个新的唯一标识符（UUID）。
community：社区 ID（整数类型）。
human_readable_id：社区 ID 。
title：社区标题（"Community " + community）。
period：记录当前日期。
size：计算每个社区的大小（即包含的实体数量）。

create_final_relationships

workflow:create_base_entity_graph

workflow:create_final_nodes

函数参数

entity_graph: pd.DataFrame
node: pd.DataFrame

输出

final_relationships：pd.Dataframe

id：边的唯一标识符。
human_readable_id：边的可读 ID。
source：边的源节点。
target：边的目标节点。
description：边的描述信息。
weight：边的权重。
combined_degree：边的组合度。
text_unit_ids：边的文本单元 ID（拆分后的列表）

return deduped.loc[
    :,
    [
        "id",
        "human_readable_id",
        "source",
        "target",
        "description",
        "weight",
        "combined_degree",
        "text_unit_ids",
    ],
]

步骤解析

解析边数据

调用 unpack_graph 函数从 entity_graph 中提取 clustered_graph 的 "edges"（边）信息得到 graph_edges。
重命名列:

将 graph_edges 中的 source_id 列重命名为 text_unit_ids
过滤出 level == 0 的边，并将该列去除

graph_edges = unpack_graph(entity_graph, callbacks, "clustered_graph", "edges")

graph_edges.rename(columns={"source_id": "text_unit_ids"}, inplace=True)

filtered = cast(
    pd.DataFrame, graph_edges[graph_edges["level"] == 0].reset_index(drop=True)
)

pruned_edges = filtered.drop(columns=["level"])

解析节点数据

过滤出 level == 0 的节点，并重置索引。
选择需要的列，保留 title（节点标题）和 degree（节点度）这两个字段。

filtered_nodes = nodes[nodes["level"] == 0].reset_index(drop=True)
filtered_nodes = cast(pd.DataFrame, filtered_nodes[["title", "degree"]])

计算每条边的 degree：

利用解析出的节点数据，计算每条边的 combined_degree，并将结果存储到 pruned_edges["combined_degree"] 列中。

combined_degree 表示一条边的 degree，为该边的 source 节点和 target 节点的 degree 之和。

output_df = join_to_degree(edge_df, edge_source_column)
output_df = join_to_degree(output_df, edge_target_column)
output_df["combined_degree"] = (
    output_df[_degree_colname(edge_source_column)]
    + output_df[_degree_colname(edge_target_column)]
)

拆分 `text_unit_ids` 列

将 text_unit_ids 列的字符串拆分成列表。

pruned_edges["text_unit_ids"] = pruned_edges["text_unit_ids"].str.split(",")

去重

基于 source 和 target 列的组合进行去重。

deduped = pruned_edges.drop_duplicates(subset=["source", "target"])

函数参数

text_units：pd.DataFrame
final_entities：pd.DataFrame
final_relationships：pd.DataFrame

输出

final_text_units: pd.Dataframe

id：文本单元的 ID。
human_readable_id：文本单元的可读 ID。
text：文本内容。
n_tokens：词元数量。
document_ids：与文本单元关联的文档 ID。
entity_ids：与文本单元关联的实体 ID。
relationship_ids：与文本单元关联的关系 ID。
covariate_ids（如果有协变量数据）：与文本单元关联的协变量 ID。

步骤解析

将 entities 和 relationships 合入文本单元（`text_units`）

将 final_entities 中与文本单元相关的实体信息提取出来。

id entity_ids
chuncked_doc_id [entity_id, ...]
将 final_relationships 中与文本单元相关的实体信息提取出来。

id relationship_ids
chuncked_doc_id [relationship_id, ...]
合并所有信息

id text document_ids n_tokens human_readable_id entity_ids relationship_ids
使用 groupby 和 agg 函数按 id 进行分组，并选择每组的第一个值（agg("first")）。

这确保每个文本单元在最终结果中只有一行

id	entity_ids
chuncked_doc_id	[entity_id, ...]

id	relationship_ids
chuncked_doc_id	[relationship_id, ...]

create_final_community_reports

workflow:create_final_nodes
workflow:create_base_entity_graph
workflow:create_final_entities
workflow:create_final_communities

函数参数

nodes： pd.Dataframe
edges： pd.Dataframe
entities： pd.Dataframe
communities：pd.Dataframe

summarization_strategy: dict

{
  "type": "graph_intelligence",
  "llm": {
  },
  "stagger": 0.3,
  "num_threads": 50,
  "extraction_prompt": ".\prompts\community_report.txt",
  "max_report_length": 2000,
  "max_input_length": 8000
}

输出

community_reports: pd.Dataframe

id: 每个社区报告的唯一标识符（通过 uuid4() 生成）。
human_readable_id: 可读的社区ID(int)，用于社区报告的标识。
community: 社区的实际标识符(str human_readable_id)。
level: 社区的层级（通常用于表示社区的层次结构）。
title: 社区报告的标题。
summary: 社区报告的摘要。
full_content: 社区报告的详细内容(markdown)。
rank: 社区的影响力评分(重要性水平)。
rank_explanation: 影响力评分的解释说明。
findings: 有关此社区的5至10条重要观察结论。
full_content_json: 完整报告的JSON格式内容。
period: 信息来源的截止日期。
size: 社区的大小（entities数量）。

merged.loc[
        :,
        [
            "id",
            "human_readable_id",
            "community",
            "level",
            "title",
            "summary",
            "full_content",
            "rank",
            "rank_explanation",
            "findings",
            "full_content_json",
            "period",
            "size",
        ],
    ]

步骤解析

得到社区层级 `restore_community_hierarchy`

构建社区层级字典：
- 通过遍历 nodes，构建一个 community_levels 字典。该字典的键是层级（level），值为每个层级中的社区及其对应的节点。
并构建层级关系

在每个层级中，遍历当前层级的所有社区，如果下一层级的社区的节点集合是当前层级某个社区的子集，那么下一层级的社区就被视为当前层级社区的子社区。并将当前层级和子社区的信息添加到 community_hierarchy 列表中。
最终返回一个包含社区层级关系的 DataFrame: community_hierarchy

community level sub_community sub_community_size

准备社区报告 `prepare_community_reports`

遍历每个层级并调用 _prepare_reports_at_level，为每个层级的社区准备报告数据。

community	all_context	context_string	context_size	context_exceed_limit	level
node_name	list(dict)	str	int	bool	int

node 信息

merged_node_df

title	community	level	degree	node_details	edge_details	all_context
node_name	int	int	int	dict	list

node_details:

{
  "human_readable_id": 122,
  "title": "",
  "description": "",
  "degree": 1
}

edge_details:

[
  {
    "human_readable_id": 74,
    "source": "",
    "target": "",
    "description": "",
    "combined_degree": 99
  },
    ...
]

all_context:

{
  "title": "",
  "degree": 1,
  "node_details": {
    "human_readable_id": 122,
    "title": "",
    "description": ",
    "degree": 1
  },
  "edge_details": [
    NaN,
    {
      "human_readable_id": 113,
      "source": "",
      "target": "",
      "description": "",
      "combined_degree": 97
    }
  ],
  "claim_details": []
}

过滤出当前层级（level）下的 node 和 edge 数据得到 level_node_df 和 level_edge_df
合并节点和边的详细信息：
- 将每个节点（包括 source 和 target）的信息与其相关的边的详细信息合并，并把 source 和 target 拼接得到 merged_node_df。
- 将 merged_node_df 按照 node_name_column（节点名称）、node_community_column（节点所属的社区）、node_degree_column（节点的度数）、node_level_column（节点的层级）进行分组。
- 聚合（汇总）每个分组中的数据：
  - node_details_column: "first": 对于每个分组中的节点详细信息，只取第一个节点的详细信息
  - edge_details_column: list : 对于每个分组中的所有边详细信息，将它们合并成一个列表。这样一个节点可能有多个与之相连的边，所有边的详细信息会被收集到一个列表中。
合并声明信息：如果声明数据存在，使用 merge 函数将声明信息合并到节点的详细信息中。
创建所有节点的完整上下文：每个节点的所有详细信息（node_name_column（节点名称）、node_community_column（节点所属的社区）、node_degree_column（节点的度数）、node_level_column（节点的层级））都会被整合成一个字典，并存储在 ALL_CONTEXT 列中

社区信息

community_df

community	all_context	context_string	context_size	context_exceed_limit	level
node_name	list(dict)	str	int	bool	int

将所有的节点信息按社区（node_community_column）进行分组，生成每个社区的报告

对每个社区的上下文生成 context_string。

将上下文信息 按边的度数降序 排序

并在给定的 max_tokens 限制内生成一个上下文字符串，包括所有的 Entities （nodes）信息和 Relationships（edge）信息。（超出 max_tokens 部分截断）

-----Entities-----
human_readable_id,title,description,degree
13,MR. FEZZIWIG,"Mr. Fezziwig is a kind-hearted, jovial old merchant",1


-----Relationships-----
human_readable_id,source,target,description,combined_degree
11,MR. FEZZIWIG,MRS. FEZZIWIG,Mr. Fezziwig is married to Mrs. Fezziwig,5

计算其长度 context_size，以及其是否超出最大 token 数阈值 context_exceed_limit。

总结社区报告

初始化 report_df 表示为整个社区报告。

从最底层开始，遍历每个层级(level)并调用 prep_community_report_context，为每个层级的社区准备报告数据（context_string）。

准备报告数据

prep_community_report_context：对某一给定 level 层级，准备该 level 层级的 level_context_df。

其主要目的是确保每个社区的 context_string 不会超出最大令牌数（max_tokens）。因为每个社区的 context_string 可以认为是其所有子社区 context_string 的拼接，所以上下文超出限制时，可以通过引入子社区的报告数据来替代部分本地上下文。具体操作：

筛选当前层级的 context_string：根据 level 筛选出 local_context_df 中该层级的社区数据，生成 level_context_df。并将其分为两部分：
- valid_context_df：上下文在 max_tokens 限制内的记录。
- invalid_context_df：上下文超出限制的记录
如果 invalid_context_df 为空，说明当前层级的所有社区的 context_size 都在令牌限制内，直接返回 valid_context_df
如果 invalid_context_df 非空:
- 如果 report_df 为空: 说明该层级为最底层，这一层级的社区不存在子社区。则对该社区的 context_string 直接进行裁剪（同对社区信息的处理）
- 如果 report_df 非空：说明该层级的社区存在子社区：
  - 移除已报告的社区：通过 _antijoin_reports 函数，移除那些已经在 report_df 中出现过的社区报告。
  - 获取子社区的 context_string：调用 _get_subcontext_df 函数获取每个子社区的上下文。
  - 替换超限上下文：通过 _get_community_df 函数获取子社区层级的上下文，并尝试用子社区报告替换掉 invalid_context_df 中的超限部分。
  - 处理无法替换的记录：如果有些记录仍然无法用子社区报告替代（即仍超出了令牌限制），则再次裁剪这些记录的本地上下文。

生成报告

community_reports_extractor 提取每个层级的 context_string 信息得到 CommunityReportsResult

@dataclass
class CommunityReportsResult:
    """Community reports result class definition."""

    output: str
    structured_output: dict

output

# {title}

{summary}

## {finding.summary}

{finding.explanation}

...

structured_output

{
  "title": "",
  "summary": "",
  "rating": 3.5,
  "rating_explanation": "",
  "findings": [
    {
      "summary": "",
      "explanation": "..., [Data: Relationships (169)]"
    },
  ]
}

将CommunityReportsResult整合为 CommunityReport

class CommunityReport(TypedDict):
    """Community report class definition."""

    community: str | int
    title: str
    summary: str
    full_content: str
    full_content_json: str
    rank: float
    level: int
    rank_explanation: str
    findings: list[Finding]

generate_text_embeddings

workflow:create_final_documents
workflow:create_final_relationships
workflow:create_final_text_units
workflow:create_final_entities
workflow:create_final_community_reports

函数参数

source: pd.DataFrame
final_relationships: pd.DataFrame
final_text_units: pd.DataFrame
final_entities: pd.DataFrame
final_community_reports: pd.DataFrame

输出

步骤解析

配置需要生成的embedding

Name	Data	Embed Column
document_text_embedding	["id", "text"]	text
relationship_description_embedding	["id", "description"]	description
text_unit_text_embedding	["id", "text"]	text
entity_title_embedding	["id", "title"]	title
entity_description_embedding	["id", "title", "description"]	title_description
community_title_embedding	["id", "title"]	title
community_summary_embedding	["id", "summary"]	summary
community_full_content_embedding	["id", "full_content"]	full_content

默认：

community_full_content_embedding
entity_description_embedding
text_unit_text_embedding

生成embedding： `embed_text`

配置向量存储

如果配置了向量存储（例如数据库或向量搜索服务），将嵌入结果保存到向量存储中。
如果没有配置向量存储，直接在内存中返回嵌入结果。

text_embed

分批次处理文本数据, 先根据max_batch_size分批次，再根据 max_batch_tokens继续拆分
- max_batch_size: int: 每个批次中的最大样本数 ——> texts:[text,...]
- max_batch_tokens: int: 每个批次中允许的最大令牌数。避免单个请求中包含过多的文本内容。——> text_batches:[[text,...],...]
- 将text_batches输入 llm 得到对应的embedding向量

构建 VectorStoreDocument 并存储

@dataclass
class VectorStoreDocument:
    """A document that is stored in vector storage."""

    id: str | int
    """unique id for the document"""

    text: str | None
    vector: list[float] | None

    attributes: dict[str, Any] = field(default_factory=dict)
    """store any additional metadata, e.g. title, date ranges, etc"""

将每个文本的嵌入向量、文档 ID、文本内容和标题封装成 VectorStoreDocument 对象
使用 vector_store.load_documents() 将这些文档对象存储到向量存储中。overwrite 参数控制是否覆盖已有的文档数据。

[GraphRAG][Index][Flows]