reading csv files in azure ml notebook

Question

reading csv files in azure ml notebook

Ajam, Meraj 40

Hi,

I want to read all CSV files in an Azure ML notebook. I registered a folder as a data asset, and when consuming the files, the final DataFrame becomes a concatenation of all CSV files. However, I want to read each CSV file separately, remove a specific column, and save it back as a CSV file. Each CSV file is different from the other.

I came across a Microsoft lab that uses glob.glob in Python scripts (py) to address this, but when I try using glob in notebooks, I encounter an error.

Could someone please help me with the correct approach to achieve this?

Thanks!

1 answer

Your answer

Answer 1

Saideep Anchuri 4,115 Microsoft External Staff

Hi Ajam, Meraj

Welcome to Microsoft Q&A Forum, thank you for posting your query here!

To read all CSV files from a registered folder data asset in an Azure ML notebook and process each one individually,

Here are some steps:

Use the mltable library to access the folder asset.
List the CSV files in the folder.
Read each CSV file into a DataFrame.
Remove the specific column you don't need.
Save the modified DataFrame back as a CSV file.

Here's a sample code snippet to illustrate this:

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential 
import os

# Initialize MLClient
 ml_client = MLClient.from_config(credential=DefaultAzureCredential())

 # Get the data asset (folder) 
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>") 
folder_path = data_asset.path

# List all CSV files in the folder
 csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Process each CSV file 
for csv_file in csv_files: 
    file_path = os.path.join(folder_path, csv_file) 
	df = mltable.from_delimited_files(paths=[{'file': file_path}]).to_pandas_dataframe()

	
	df = df.drop(columns=['<column_to_remove>'])

	
	df.to_csv(os.path.join(folder_path, f'modified_{csv_file}'), index=False)

Kindly refer below link: access-your-data-in-a-notebook

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Thank You.

Ajam, Meraj 40

I get an error when I use os.listdir saying there is no such directory (The folder_path exists).

The only way to consume a data asset folder is the following suggestion (under the consume section in data asset), which concatenates all CSVs:

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get("dibabet-folder", version="1")

path = {
  'folder': data_asset.path
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df

Saideep Anchuri 4,115 Microsoft External Staff

Hi Ajam, Meraj

It sounds like you're getting an error because the directory you're trying to list does not exist. Ensure that the path you're passing to os.listdir is correct and exists. Verify that you have the necessary permissions to access the directory.

Here is the sample code:


import os
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
# Initialize the ML client
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
# Retrieve the data asset
data_asset = ml_client.data.get("dibabet-folder", version="1")
# Use the path of the data asset
folder_path = data_asset.path
# Check if the directory exists and list its contents
if os.path.exists(folder_path):
    print("Directory exists. Contents:")
    print(os.listdir(folder_path))
else:
    print("Directory does not exist. Please check the path:", folder_path)
# Create an mltable from the CSV files in the folder
path = {
  'folder': folder_path
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
print(df)

Thank You.

Saideep Anchuri 4,115 Reputation points Microsoft External Staff

2025-02-09T04:42:15.6833333+00:00

Hi Ajam, Meraj

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet

Thank You.
Ajam, Meraj 40 Reputation points

2025-02-10T22:19:29.72+00:00

Hi, the recent code that you have sent is going to return a concatenated of all CSVs (that I don't want it).

The code returns " directory does not exist", however, it was able to return the df. This is the issue that I have with Folder type in data asset, that os.listdir or os.path does not work.

Saideep Anchuri 4,115 Microsoft External Staff

Hi Ajam, Meraj

If os.listdir and os.path aren't functioning as expected with your data asset's Folder type, you may want to explore alternative methods to access the files within the folder.

Here is the python code:

from pathlib import Path
import pandas as pd

# Set your folder path here
folder_path = Path('your_folder_path')

# Check if the directory exists
if folder_path.exists() and folder_path.is_dir():
    # Create a list to hold individual DataFrames
    dfs = []
    
    # Iterate over CSV files in the directory
    for csv_file in folder_path.glob('*.csv'):
        df = pd.read_csv(csv_file)
        dfs.append(df)
    
    # Now you have a list of DataFrames
    # You can process them individually or combine them as needed
else:
    print("Directory does not exist")

Thank You.

Ajam, Meraj 40 Reputation points

2025-02-11T16:16:47.1933333+00:00

The output is "Directory does not exist".
I used all the different paths for my registered folder but it did not work.
Saideep Anchuri 4,115 Reputation points Microsoft External Staff

2025-02-11T17:07:53.5966667+00:00

Hi Ajam, Meraj

I recommend reporting this issue to the Azure support team. They will be able to investigate the issue further and provide a more targeted solution.

The Azure support team will review your request and provide assistance as soon as possible Azure support.

Thank You.

Share via

reading csv files in azure ml notebook

1 answer

Your answer