Microsoft GraphRAG – Index Error – Azure Blob Storage

Question

Microsoft GraphRAG – Index Error – Azure Blob Storage

Octavian Mocanu 45

I want to use Microsoft GraphRAG to identify some chat topics/scope based on some data source.

I followed the guidance from here.

I’ve installed the following graphrag Python package:

Name: graphrag
Version: 0.9.0
Summary: GraphRAG: A graph-based retrieval-augmented generation (RAG) system.
Home-page: 
Author: Alonso Guevara Fernández
Author-email: alonsog@microsoft.com
License: MIT

The data source is hosted on Azure Blob Container from an Azure Storage Account (3 txt files).

The settings file is like this:

encoding_model: cl100k_base # this needs to be matched to your model!

llm:
  api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
  type: azure_openai_chat # or azure_openai_chat
  model: gpt-4o
  model_supports_json: false # recommended if this is available for your model.
  api_base: https://[openai-service].openai.azure.com
  api_version: "2024-08-01-preview"
  deployment_name: gpt-4o

parallelization:
  stagger: 0.3

async_mode: threaded # or asyncio

embeddings:
  async_mode: threaded # or asyncio
  vector_store: # configuration for AI Search
    type: azure_ai_search
    url: https://[search-service].search.windows.net
    api_key: ${AZURE_SEARCH_SERVICE_API_KEY}

  llm:
    api_key: ${GRAPHRAG_API_KEY_TEXT_EMBEDDING}
    type: azure_openai_embedding # or azure_openai_embedding
    model: text-embedding-ada-002
    api_base:  https://[openai-service].openai.azure.com
    api_version: 2024-02-15-preview
    deployment_name: text-embedding-ada-002

### Input settings ###

input:
  type: blob # file or blob
  connection_string: "${AZURE_STORAGE_ACCOUNT_CONNECTION_STRING}"
  container_name: "graphrag-input-001"
  file_type: text # or csv
  file_encoding: utf-8
  file_pattern: ".*\\.(txt|md)$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: blob
  container_name: "graphrag-workspace-001"
  connection_string: "${AZURE_STORAGE_ACCOUNT_CONNECTION_STRING}"

reporting:
  type: file #file or blob
  base_dir: "logs"

storage:
  type: blob # file or blob
  container_name: "graphrag-workspace-001"
  connection_string: "${AZURE_STORAGE_ACCOUNT_CONNECTION_STRING}"

### Workflow settings ###

skip_workflows: []

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  max_gleanings: 1

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  enabled: false
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: true # if true, will generate node2vec embeddings for nodes

umap:
  enabled: true # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: true
  embeddings: true
  transient: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  prompt: "prompts/local_search_001_sys_prompt_sk_plugin_ftos_context.txt"
  conversation_history_max_turns: 5

global_search:
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  prompt: "prompts/drift_search_system_prompt.txt"

Running index command:

graphrag index --root ./ragtest

I’ve got this error:

AttributeError: 'list' object has no attribute 'on_error'

The logs.json content:

{
    "type": "error",
    "data": "Error executing verb \"create_base_entity_graph\" in create_base_entity_graph: 'name'",
    "stack": "Traceback (most recent call last):\n  File \"C:\\Users\\om\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\datashaper\\workflow\\workflow.py\", line 415, in _execute_verb\n    result = await result\n             ^^^^^^^^^^^^\n  File \"C:\\Users\\om\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\graphrag\\index\\workflows\\v1\\subflows\\create_base_entity_graph.py\", line 47, in create_base_entity_graph\n    await create_base_entity_graph_flow(\n  File \"C:\\Users\\om\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\graphrag\\index\\flows\\create_base_entity_graph.py\", line 58, in create_base_entity_graph\n    merged_entities = _merge_entities(entity_dfs)\n                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\om\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\graphrag\\index\\flows\\create_base_entity_graph.py\", line 119, in _merge_entities\n    all_entities.groupby([\"name\", \"type\"], sort=False)\n  File \"C:\\Users\\om\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\core\\frame.py\", line 9183, in groupby\n    return DataFrameGroupBy(\n           ^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\om\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\core\\groupby\\groupby.py\", line 1329, in __init__\n    grouper, exclusions, obj = get_grouper(\n                               ^^^^^^^^^^^^\n  File \"C:\\Users\\om\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\pandas\\core\\groupby\\grouper.py\", line 1043, in get_grouper\n    raise KeyError(gpr)\nKeyError: 'name'\n",
    "source": "'name'",
    "details": null
}

Could you please guide me to solve this error?

Accepted answer

1 additional answer

Your answer

Answer 1

Adrian Calinescu 80 Microsoft Employee

You have to specify base_dir, otherwise it will default to base_dir: "input", which does not exist in your blob storage container and will trigger all kinds of parsing errors, which unfortunately aren't handled very gracefully.

See

https://github.com/microsoft/graphrag/blob/main/graphrag/config/defaults.py#L255

and

https://github.com/microsoft/graphrag/blob/0144b3fd88940218375bca9bb251b81eec192624/graphrag/config/models/input_config.py#L26

### Input settings ###
input:
  type: blob
  connection_string: "${AZURE_STORAGE_ACCOUNT_CONNECTION_STRING}"
  container_name: "graphrag-input-001"
  base_dir: "."
  file_type: text # or csv
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

JAYA SHANKAR G S 1,450 Reputation points Microsoft External Staff

2025-02-25T10:26:43.6733333+00:00

Hi Octavian Mocanu, I agree with Adrian Calinescu provide the base directory properly.

By default, it takes input directory and your storage account container doesn't have input folder it raises this error.

I am able reproduce the error with giving different base directory, so either you give it has base_dir: "." or proper path to input folder base_dir: "<base_path>/input"
Octavian Mocanu 45 Reputation points

2025-02-25T10:32:30.2233333+00:00

Hello @Adrian Calinescu ,

Yes, indeed settings of that base_dir did the trick.
Now it is working fine with input data from storage account as well.
Thanks for your support!
Octavian Mocanu 45 Reputation points

2025-02-25T11:07:48.95+00:00

Hello @Adrian Calinescu ,

Yes, your suggestion did the trick.
Thanks for your support

Answer 2

SKale 1,441

Hello Octavian Mocanu,

Thank you for posting your question in the Microsoft Q&A forum.

The error you're encountering, KeyError: 'name', indicate that the code is trying to group entities by the column's "name"and "type", but the "name" column is missing from your data. I have listed few reasons for you to validate info:

The input data (text files) might not be structured in a way that the GraphRAG system expects. Specifically, it might be missing the "name" field that is required for entity extraction.
The entity extraction process might not be correctly configured to extract the "name" field from your data.

Possible solutions/checklist to validate info on your end may help:

Ensure that your text files contain structured data that includes a "name" field or something equivalent. If your data is unstructured, you might need to preprocess it to extract entities and assign them a "name" field.
The entity_extraction section in your YAML configuration references a prompt file (prompts/entity_extraction.txt). Open this file and ensure that the prompt is designed to extract entities with a "name" field. If the prompt is not correctly extracting the "name" field, you may need to modify it.
If your data does not naturally contain a "name" field, you might need to modify the entity extraction process to generate or infer this field. For example, you could modify the entity_extraction prompt to extract a different field and map it to "name".
Add logging or print statements in the create_base_entity_graph.py file to inspect the entity_dfs variable. This will help you understand what data is being passed to the _merge_entities function. Ensure that the data frames in entity_dfs contain the expected columns ("name" and "type").
If your data does not have a "name" field, you might need to update the configuration to use a different field for grouping. For example, if your data has a "title" field, you could modify the _merge_entities function to group by "title" instead of "name".

After making the necessary changes above, re-run the graphrag index --root ./ragtest command to see if the issue is resolved.

If above answer helped, please do not forget to "Accept Answer" as this may help other community members to refer the info if facing similar issue.

Octavian Mocanu 45

If I index the same data source (3 txt files) with the following localhost settings.yaml file:

yaml
encoding_model: cl100k_base # this needs to be matched to your model!

llm:
  api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
  type: azure_openai_chat # or azure_openai_chat
  model: gpt-4o
  model_supports_json: false # recommended if this is available for your model.
  api_base: https://[openai-service].openai.azure.com
  api_version: "2024-08-01-preview"
  deployment_name: gpt-4o

parallelization:
  stagger: 0.3

async_mode: threaded # or asyncio

embeddings:
  async_mode: threaded # or asyncio
  vector_store: # configuration for AI Search
    type: azure_ai_search
    url: https://[search-service].search.windows.net
    api_key: ${AZURE_SEARCH_SERVICE_API_KEY} # if not set, will attempt to use managed identity. Expects the `Search Index Data Contributor` RBAC role in this case.
  llm:
    api_key: ${GRAPHRAG_API_KEY_TEXT_EMBEDDING}
    type: azure_openai_embedding # or azure_openai_embedding
    model: text-embedding-ada-002
    api_base: https://[openai-service].openai.azure.com
    api_version: 2024-02-15-preview
    deployment_name: text-embedding-ada-002

### Input settings ###

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.(txt|md)$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: blob # file or blob
  container_name: "graphrag-workspace-001"
  connection_string: "${AZURE_STORAGE_ACCOUNT_CONNECTION_STRING}"

reporting:
  type: file #file or blob
  base_dir: "logs"

storage:
  type: blob # file or blob
  container_name: "graphrag-workspace-001"
  connection_string: "${AZURE_STORAGE_ACCOUNT_CONNECTION_STRING}"

### Workflow settings ###

skip_workflows: []

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  max_gleanings: 1

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  enabled: false
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: true # if true, will generate node2vec embeddings for nodes

umap:
  enabled: true # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: true
  embeddings: true
  transient: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  prompt: "prompts/local_search_001_sys_prompt_sk_plugin_ftos_context.txt"
  conversation_history_max_turns: 5

global_search:
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  prompt: "prompts/drift_search_system_prompt.txt"

the index is performed successfully.

So it seems that the issue is not related with the data source.

santoshkc 13,600 Reputation points Microsoft External Staff

2025-02-14T09:40:07.4633333+00:00

Hi Octavian Mocanu,

Thank you for clarifying that you are able to create index on your local file. Regarding blob storage issue, could you please share all steps you have followed with CLI commands. Thank you.
Octavian Mocanu 45 Reputation points

2025-02-14T09:52:02.2033333+00:00

Hello @santoshks,

Steps, commands and error are described all in the initial question.
I tried to mention all steps there.
Do you need additional info ?
Thanks.
santoshkc 13,600 Reputation points Microsoft External Staff

2025-02-17T13:16:58.51+00:00

Hi Octavian Mocanu,
The error you're facing, KeyError: 'name', suggests the data is missing the required "name" field for entity extraction. To resolve this, ensure your text files are structured to include a "name" field, or adjust the extraction prompts to generate or map it correctly. Check the entity_extraction prompt in your YAML file to ensure it's designed to extract the "name" field, and if necessary, modify the configuration to use a different field (e.g., "title"). After making these changes, try running the index command again.

Thank you.
Octavian Mocanu 45 Reputation points

2025-02-20T08:26:34.8466667+00:00

Hi @santoshkc,

In both scenarios (#1: input data from local file system and #2: input data from Azure Storage Account) I was using the same data input and the same entity_extraction prompt.
In case #1 everything is working fine.

In case #2 I have that error.

So it seems to not be related with the data structure.

Please le me know if you need more info.

Thanks you
JAYA SHANKAR G S 1,450 Reputation points Microsoft External Staff

2025-02-21T05:11:32.63+00:00

Hi @Octavian Mocanu ,

Did you try the source text files in azure adls gen2 storage account?

Please try using the source from adls gen 2 storage account and do let me know.

Also give the sample content of the text file and entity_extraction prompt.
JAYA SHANKAR G S 1,450 Reputation points Microsoft External Staff

2025-02-24T04:14:21.49+00:00

Hi @Octavian Mocanu ,

It will be helpful if you provide the sample content of your text file so that I can try reproducing the issue in my environment.

Thank you
Octavian Mocanu 45 Reputation points

2025-02-25T07:49:22.0333333+00:00

Hello @JAYA SHANKAR G S ,

Please let me know an email or something else to send you the files.
Thanks
JAYA SHANKAR G S 1,450 Reputation points Microsoft External Staff

2025-02-25T07:53:18.29+00:00

Hi @Octavian Mocanu , please check in private chat.
Octavian Mocanu 45 Reputation points

2025-02-25T07:58:02.5+00:00

Hello @JAYA SHANKAR G S ,
Please open the private chat again
JAYA SHANKAR G S 1,450 Reputation points Microsoft External Staff

2025-02-25T08:01:05.88+00:00

@Octavian Mocanu it is already opened. Please check it under the question.
JAYA SHANKAR G S 1,450 Reputation points Microsoft External Staff

2025-02-25T08:21:01.4+00:00

@Octavian Mocanu Any specific query you are trying ? please add those details.
Octavian Mocanu 45 Reputation points

2025-02-25T08:25:20.7133333+00:00
@JAYA SHANKAR G S ,
Th error is not on "query" time ...it appears on "index time":

graphrag index --root ./app
JAYA SHANKAR G S 1,450 Reputation points Microsoft External Staff

2025-02-25T08:53:10.89+00:00

Hi @Octavian Mocanu , did you alter anything in entity_extraction.txt? If yes please send those details also. Because with your config and data source i am not able to reproduce the issue from my end. The index command works fine.
Octavian Mocanu 45 Reputation points

2025-02-25T08:55:52.07+00:00

It is working fine - with both settings file?
JAYA SHANKAR G S 1,450 Reputation points Microsoft External Staff

2025-02-25T08:57:08.62+00:00

@Octavian Mocanu Yes, both.
Octavian Mocanu 45 Reputation points

2025-02-25T09:08:46.66+00:00

I'm using this kind of storage account (switzerlandnorth location):

Are you using something different?
JAYA SHANKAR G S 1,450 Reputation points Microsoft External Staff

2025-02-25T09:08:50.87+00:00

Hi @Octavian Mocanu ,
If any alter you made in prompt files, please send those details.

Thank you.
Octavian Mocanu 45 Reputation points

2025-02-25T09:09:55.5+00:00

I did not changed anything in entity_extraction.txt.
JAYA SHANKAR G S 1,450 Reputation points Microsoft External Staff

2025-02-25T09:12:05.01+00:00

@Octavian Mocanu It's not about the storage location, and it's configuration. It's something related to the entity extraction which is not going good.
Octavian Mocanu 45 Reputation points

2025-02-25T09:12:35.6033333+00:00

Hi @JAYA SHANKAR G S ,
I've send you all the prompts used.
Thanks
JAYA SHANKAR G S 1,450 Reputation points Microsoft External Staff

2025-02-25T09:36:38.59+00:00

@Octavian Mocanu Ok will update you
Octavian Mocanu 45 Reputation points

2025-02-25T10:27:52.16+00:00

Hello @JAYA SHANKAR G S ,

Indeed that base_dir was the issue.
Now it is working fine in both approaches (local file system and storage account).
Thanks for your support!

Share via

Microsoft GraphRAG – Index Error – Azure Blob Storage

1 additional answer

Your answer