I have a set of pickle files, for which I want to write a python script to read them and store them as datasets in azure. Whats the procedure to do that?

Chitti, Srinivasa 20 Reputation points
2025-01-31T10:09:52.1333333+00:00

I have a set of pickle files, for which I want to write a python script to read them and store them as datasets in azure. Whats the best procedure to do that?

Azure Open Datasets
Azure Open Datasets
An Azure service that provides curated open data for machine learning workflows.
30 questions
{count} votes

Accepted answer
  1. Vikram Singh 1,070 Reputation points Microsoft Employee
    2025-01-31T11:23:24.4833333+00:00

    Hi @Chitti, Srinivasa

    Thanks for posting your question in Microsoft Q&A.

    To store pickle files as datasets in Azure, you can use Azure Blob Storage for storage and authentication via a connection string or Azure AD. Below is a sample script using a connection string for authentication:

    import os
    import pickle
    import pandas as pd
    from azure.storage.blob import BlobServiceClient
    
    # Authenticate using connection string
    AZURE_STORAGE_CONNECTION_STRING = "your_connection_string"
    CONTAINER_NAME = "your-container-name"
    blob_service_client = BlobServiceClient.from_connection_string(AZURE_STORAGE_CONNECTION_STRING)
    container_client = blob_service_client.get_container_client(CONTAINER_NAME)
    
    # Path to pickle files
    pickle_folder = "path/to/pickle/files"
    for filename in os.listdir(pickle_folder):
        if filename.endswith(".pkl"):
            file_path = os.path.join(pickle_folder, filename)
            # Load pickle file
            with open(file_path, "rb") as file:
                data = pickle.load(file)
            # Convert to CSV (if DataFrame)
            if isinstance(data, pd.DataFrame):
                csv_data = data.to_csv(index=False)
    
                # Upload to Azure Blob
                blob_client = container_client.get_blob_client(f"datasets/{filename}.csv")
                blob_client.upload_blob(csv_data, overwrite=True)
                print(f"Uploaded: {filename}.csv")
    
    print("All files uploaded successfully.")
    

    This script will read each pickle file in the specified folder and upload it to the designated Azure container. Alternatively, for Azure AD authentication, you can use DefaultAzureCredential from azure.identity Make sure you have the azure-storage-blob library installed (pip install azure-storage-blob).

    References:

    Do let me know if you are facing any issue.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Chitti, Srinivasa 20 Reputation points
    2025-02-04T06:56:32.69+00:00

    Hi Vikram..

    Thanks for answering.. This was really helpful. I also got clarity on the different between Azure Blob Storage and Azure Data Lake Gen2(that is the Hierarchical way of organizing).

    Can I ask a follow-up question if you don't mind? Since my data is now present in Blob/DIR can you please direct me to some material on how I can create an automatic ETL pipeline to process the Data and fill a relational table with data?

    Thanks a lot in advance

    Phani

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.