reading csv files in azure ml notebook

Ajam, Meraj 40 Reputation points
2025-02-05T15:36:43.7833333+00:00

Hi,

I want to read all CSV files in an Azure ML notebook. I registered a folder as a data asset, and when consuming the files, the final DataFrame becomes a concatenation of all CSV files. However, I want to read each CSV file separately, remove a specific column, and save it back as a CSV file. Each CSV file is different from the other.

I came across a Microsoft lab that uses glob.glob in Python scripts (py) to address this, but when I try using glob in notebooks, I encounter an error.

Could someone please help me with the correct approach to achieve this?

Thanks!

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,121 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Saideep Anchuri 2,110 Reputation points Microsoft Vendor
    2025-02-06T04:47:26.36+00:00

    Hi Ajam, Meraj

    Welcome to Microsoft Q&A Forum, thank you for posting your query here!

    To read all CSV files from a registered folder data asset in an Azure ML notebook and process each one individually,

    Here are some steps:

    1. Use the mltable library to access the folder asset.
    2. List the CSV files in the folder.
    3. Read each CSV file into a DataFrame.
    4. Remove the specific column you don't need.
    5. Save the modified DataFrame back as a CSV file.

    Here's a sample code snippet to illustrate this:

    import mltable
    from azure.ai.ml import MLClient
    from azure.identity import DefaultAzureCredential 
    import os
    
    # Initialize MLClient
     ml_client = MLClient.from_config(credential=DefaultAzureCredential())
    
     # Get the data asset (folder) 
    data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>") 
    folder_path = data_asset.path
    
    # List all CSV files in the folder
     csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]
    
    # Process each CSV file 
    for csv_file in csv_files: 
        file_path = os.path.join(folder_path, csv_file) 
    	df = mltable.from_delimited_files(paths=[{'file': file_path}]).to_pandas_dataframe()
    
    	
    	df = df.drop(columns=['<column_to_remove>'])
    
    	
    	df.to_csv(os.path.join(folder_path, f'modified_{csv_file}'), index=False)
    
    
    

    Kindly refer below link: access-your-data-in-a-notebook

    Hope this helps. Do let us know if you any further queries.

     


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    Thank You.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.