如何使用微软的知识图谱GraphRAG

Responsive Ad Header

Question

Grade: Education Subject: popjane free_chatgpt_api
如何使用微软的知识图谱GraphRAG
Asked by:
19 Viewed 19 Answers
Responsive Ad After Question

Answer (19)

Best Answer
(4893)

以下是settings.yaml的代码,看起来需要配置的东西有点多,或者不支持这个项目?

This config file contains required core defaults that must be set, along with a handful of common optional settings.

For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

LLM settings

There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

models: default_chat_model: type: chat model_provider: openai auth_type: api_key # or azure_managed_identity api_key: $GRAPHRAG_API_KEY} # set this in the generated .env file, or remove if managed identity model gpt-4-turbo-preview api_base: https://free.v36.cm/v1/ api_version: dall-e-2 model_supports_json: true # recommended if this is available for your model. concurrent_requests: 25 async_mode: threaded # or asyncio retry_strategy: exponential_backoff max_retries: 10 tokens_per_minute: null requests_per_minute: null default_embedding_model: type: embedding model_provider: openai auth_type: api_key api_key: ${GRAPHRAG_API_KEY model: text-embedding-3-small api_base: https://free.v36.cm/v1/ api_version: dall-e-2 concurrent_requests: 25 async_mode: threaded # or asyncio retry_strategy: exponential_backoff max_retries: 10 tokens_per_minute: null requests_per_minute: null

Input settings

input: storage: type: file # or blob base_dir: "input" file_type: text # [csv, text, json]

chunks: size: 1200 overlap: 100 group_by_columns: [id]

Output/storage settings

If blob storage is specified in the following four sections,

connection_string and container_name must be provided

output: type: file # [file, blob, cosmosdb] base_dir: "output"

cache: type: file # [file, blob, cosmosdb] base_dir: "cache"

reporting: type: file # [file, blob] base_dir: "logs"

vector_store: default_vector_store: type: lancedb db_uri: output\lancedb container_name: default

Workflow settings

embed_text: model_id: default_embedding_model vector_store_id: default_vector_store

extract_graph: model_id: default_chat_model prompt: "prompts/extract_graph.txt" entity_types: [organization,person,geo,event] max_gleanings: 1

summarize_descriptions: model_id: default_chat_model prompt: "prompts/summarize_descriptions.txt" max_length: 500

extract_graph_nlp: text_analyzer: extractor_type: regex_english # [regex_english, syntactic_parser, cfg] async_mode: threaded # or asyncio

cluster_graph: max_cluster_size: 10

extract_claims: enabled: false model_id: default_chat_model prompt: "prompts/extract_claims.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 1

community_reports: model_id: default_chat_model graph_prompt: "prompts/community_report_graph.txt" text_prompt: "prompts/community_report_text.txt" max_length: 2000 max_input_length: 8000

embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes

umap: enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)

snapshots: graphml: false embeddings: false

Query settings

The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.

See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search: chat_model_id: default_chat_model embedding_model_id: default_embedding_model prompt: "prompts/local_search_system_prompt.txt"

global_search: chat_model_id: default_chat_model map_prompt: "prompts/global_search_map_system_prompt.txt" reduce_prompt: "prompts/global_search_reduce_system_prompt.txt" knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search: chat_model_id: default_chat_model embedding_model_id: default_embedding_model prompt: "prompts/drift_search_system_prompt.txt" reduce_prompt: "prompts/drift_search_reduce_prompt.txt"

basic_search: chat_model_id: default_chat_model embedding_model_id: default_embedding_model prompt: "prompts/basic_search_system_prompt.txt"

(1260)

使用微软GraphRAG的知识图谱是一个多步骤的过程,你可以通过以下步骤配置它:

  1. 理解配置文件: settings.yaml文件包含了核心配置和一些可选设置。确保你理解哪些是必须设置的,如模型、API密钥、并发请求限制等。

  2. 核心配置

    • LLM设置:配置chat和embedding模型,包括模型提供商(如OpenAI)、认证类型(API_key或Azure Managed Identity)、API密钥、API基础地址、版本等。确保GRAPHRAG_API_KEY已设置或为使用管理身份时已移除。
  3. 输入设置

    • 选择存储类型(文件或blob),设置基础目录、文件格式和分块大小、重叠和分组列。
  4. 输出/存储设置

    • 根据需求配置输出文件、缓存、报告和矢量存储类型,提供连接字符串和容器名(如适用)。
  5. 工作流设置

    • 定义每个任务(如提取图、摘要描述等)使用的模型、提示和参数,如最大收获数量、分析类型等。
  6. 查询设置

    • 对于不同的搜索方法(如local_search、global_search等),需要设置相应的提示,可调整其他可选参数以优化性能。
  7. 环境变量

    • GRAPHRAG_API_KEY替换为生成的.env文件中的值,以确保API密钥的安全性。
  8. 执行

    • 保存配置文件,然后根据配置启动GraphRAG应用,开始使用知识图谱进行查询和分析。

确保在配置过程中根据项目的实际需求调整参数,并参考官方文档 <a href="https://microsoft.github.io/graphrag/config/yaml/">https://microsoft.github.io/graphrag/config/yaml/</a> 获取详细信息。