graphrag 安装以及踩坑

485 阅读4分钟

安装过程

  1. 克隆graph-local-ollama仓库:

    git clone https://github.com/TheAiSingularity/graphrag-local-ollama.git
    
  2. 进入项目目录:

    cd graphrag-local-ollama/
    
  3. 安装依赖:

    pip install -e .
    
  4. 创建graphrag 的一个项目,将输入输出文件夹创建好:

    mkdir -p ./ragtest/input
    cp input/* ./ragtest/input
    
  5. 初始化项目:

    python -m graphrag.index --init --root ./ragtest
    
  6. 将设置的配置文件拷贝到项目内:

    cp settings.yaml ./ragtest
    
  1. 修改配置文件:

    
    encoding_model: cl100k_base
    skip_workflows: []
    llm:
      api_key: ${GRAPHRAG_API_KEY}
      type: openai_chat # or azure_openai_chat
      model: qwen-32b-instruct-fp16:latest
      model_supports_json: true # recommended if this is available for your model.
      # max_tokens: 4000
      # request_timeout: 180.0
      api_base: http://localhost:7889/v1
      # api_version: 2024-02-15-preview
      # organization: <organization_id>
      # deployment_name: <azure_model_deployment_name>
      # tokens_per_minute: 150_000 # set a leaky bucket throttle
      # requests_per_minute: 10_000 # set a leaky bucket throttle
      # max_retries: 10
      # max_retry_wait: 10.0
      # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
      # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    
    parallelization:
      stagger: 0.3
      # num_threads: 50 # the number of threads to use for parallel processing
    
    async_mode: threaded # or asyncio
    
    embeddings:
      ## parallelization: override the global parallelization settings for embeddings
      async_mode: threaded # or asyncio
      llm:
        api_key: ${GRAPHRAG_API_KEY}
        type: openai_embedding # or azure_openai_embedding
        model: bge-large-zh-v1.5:f16
        api_base: http://localhost:7889/api
        # api_version: 2024-02-15-preview
        # organization: <organization_id>
        # deployment_name: <azure_model_deployment_name>
        # tokens_per_minute: 150_000 # set a leaky bucket throttle
        # requests_per_minute: 10_000 # set a leaky bucket throttle
        # max_retries: 10
        # max_retry_wait: 10.0
        # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
        # concurrent_requests: 25 # the number of parallel inflight requests that may be made
        # batch_size: 16 # the number of documents to send in a single request
        # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
        # target: required # or optional
      
    
    
    chunks:
      size: 300
      overlap: 100
      group_by_columns: [id] # by default, we don't allow chunks to cross documents
        
    input:
      type: file # or blob
      file_type: text # or csv
      base_dir: "input"
      file_encoding: utf-8
      file_pattern: ".*\.txt$"
    
    cache:
      type: file # or blob
      base_dir: "cache"
      # connection_string: <azure_blob_storage_connection_string>
      # container_name: <azure_blob_storage_container_name>
    
    storage:
      type: file # or blob
      base_dir: "output/${timestamp}/artifacts"
      # connection_string: <azure_blob_storage_connection_string>
      # container_name: <azure_blob_storage_container_name>
    
    reporting:
      type: file # or console, blob
      base_dir: "output/${timestamp}/reports"
      # connection_string: <azure_blob_storage_connection_string>
      # container_name: <azure_blob_storage_container_name>
    
    entity_extraction:
      ## llm: override the global llm settings for this task
      ## parallelization: override the global parallelization settings for this task
      ## async_mode: override the global async_mode settings for this task
      prompt: "prompts/entity_extraction.txt"
      entity_types: [organization,person,geo,event]
      max_gleanings: 0
    
    summarize_descriptions:
      ## llm: override the global llm settings for this task
      ## parallelization: override the global parallelization settings for this task
      ## async_mode: override the global async_mode settings for this task
      prompt: "prompts/summarize_descriptions.txt"
      max_length: 500
    
    claim_extraction:
      ## llm: override the global llm settings for this task
      ## parallelization: override the global parallelization settings for this task
      ## async_mode: override the global async_mode settings for this task
      # enabled: true
      prompt: "prompts/claim_extraction.txt"
      description: "Any claims or facts that could be relevant to information discovery."
      max_gleanings: 0
    
    community_report:
      ## llm: override the global llm settings for this task
      ## parallelization: override the global parallelization settings for this task
      ## async_mode: override the global async_mode settings for this task
      prompt: "prompts/community_report.txt"
      max_length: 2000
      max_input_length: 8000
    
    cluster_graph:
      max_cluster_size: 10
    
    embed_graph:
      enabled: false # if true, will generate node2vec embeddings for nodes
      # num_walks: 10
      # walk_length: 40
      # window_size: 2
      # iterations: 3
      # random_seed: 597832
    
    umap:
      enabled: false # if true, will generate UMAP embeddings for nodes
    
    snapshots:
      graphml: yes
      raw_entities: yes
      top_level_nodes: yes
    
    local_search:
      # text_unit_prop: 0.5
      # community_prop: 0.1
      # conversation_history_max_turns: 5
      # top_k_mapped_entities: 10
      # top_k_relationships: 10
      # max_tokens: 12000
    
    global_search:
      # max_tokens: 12000
      # data_max_tokens: 12000
      # map_max_tokens: 1000
      # reduce_max_tokens: 2000
      # concurrency: 32
    

关于配置文件,修改文件的模型名称和模型地址为本地,也可以修改为远程调用(如阿里云百炼):

image-20250501151659835转存失败,建议直接上传图片文件

  1. 运行索引,创建图(直接运行下面的命令在离线的环境下不能成功,详见踩坑记录):

    python -m graphrag.index --root ./ragtest
    

运行结果

在以上命令都执行完之后,运行索引执行成功的结果如下图:

image.png

查询语句

python -m graphrag.query --root ./ragtest --method global "What is machine learning?"

运行查询,得到结果:

image.png

踩坑记录

关于 tiktoken

tiktoken 在没有编码器的情况下,自动请求tiktoken 的下载对应的文件。然而在离线环境下,这种下载是不可能成功的。所以需要手动下载来解决这个问题。

方法一(未验证)

tiktoken_ext/openai_public.py,文件中,如果blobpath 为 https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken, 那么文件的hash就是9b5ad71b2ce5302211f9c61530b329a4922fc6a4

def cl100k_base():
    mergeable_ranks = load_tiktoken_bpe(
        "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken",
        expected_hash="223921b76ee99bde995b7ff738513eef100fb51d18c93597a113bcffe865b2a7",
    )

tiktoken/load.py 中,可以看到文件cache的存储路径,如果没有在环境变量中设置,默认使用的路径在Linux下是/tmp/data-gym-cache(可以自己验证路径)。 那么找一台联网的机器,运行过tiktoken程序的机器,去找到 /tmp/data-gym-cache/9b5ad71b2ce5302211f9c61530b329a4922fc6a4 文件,copy到另一台机器上相同路径下即可。

以上见原文链接:blog.csdn.net/zhilaizhiwa…

方法二(我采用)

在安装了tiktoken 包的路径下,有文件load.py,修改文件中的读取文件的逻辑:

from __future__ import annotations
​
import base64
import hashlib
import json
import os
import tempfile
import uuid
from typing import Optionalimport requests
​
# 以下注释掉的是原本的read_file() 的逻辑
#def read_file(blobpath: str) -> bytes:
#    if not blobpath.startswith("http://") and not blobpath.startswith("https://"):
#        try:
#            import blobfile
#        except ImportError as e:
#            raise ImportError(
#                "blobfile is not installed. Please install it by running `pip install blobfile`."
#            ) from e
#        with blobfile.BlobFile(blobpath, "rb") as f:
#            return f.read()
    # avoiding blobfile for public files helps avoid auth issues, like MFA prompts
#    resp = requests.get(blobpath)
#    resp.raise_for_status()
#    return resp.contentdef read_file(blobpath: str) -> bytes:
    blobpath = "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"  
    cache_key = hashlib.sha1(blobpath.encode()).hexdigest()
    cache_dir = "/data/inspur/product/lbk-tests"
    cache_path = os.path.join(cache_dir, cache_key)
    with open(cache_path, "rb") as f:
        data = f.read()
    return data

其实上面就是将read_file()的逻辑从远程下载的逻辑改为从本地读取。本质上同方法一的底层逻辑一样,只不过方法一是从环境变量中读取。