How to Automate Large .SAV File to Parquet Conversion in Azure?

Akshay Patel 90 Reputation points
2024-12-27T10:31:15.69+00:00

I'm using Azure ADLS as our primary storage, Azure Data Factory (ADF) for data transformations, and Power BI for reporting and visualization.

I have a large .SAV file (200-300 MB, containing 2-4 million rows) stored in Azure Data Lake Storage (ADLS). To load the data into a SQL table, I need to first convert the .SAV file into a Parquet file, as Azure Data Factory (ADF) cannot directly process .SAV files.

I previously attempted to use an Azure Function for this conversion, but encountered a limitation where execution times out after 10 minutes, which is insufficient for processing files of this size.

I'm looking for an optimized and scalable solution to automate this conversion process within the Azure ecosystem.

Key Considerations:

  1. The solution must handle large files efficiently.
  2. It should be compatible with Azure services and integrate seamlessly into a data pipeline.
  3. Preferably avoid time-out or size limitations like those in Azure Functions.

Any guidance on how to approach this or examples of similar implementations would be highly appreciated.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,531 questions
Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
5,346 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,160 questions
{count} votes

Accepted answer
  1. Hari Babu Vattepally 1,475 Reputation points Microsoft Vendor
    2025-01-06T17:13:15.4633333+00:00

    Hi @Akshay Patel

    Welcome to Microsoft Q&A Forum. Thanks for posting your query here!

    Firstly, apologies for delay in reply. I understand that you would like to Automate Large .SAV File to Parquet Conversion in Azure.

    Here you can use Azure Databricks. Azure Databricks is a platform that allows you to process large datasets efficiently and at scale.

    Here are the steps you can follow to implement this solution:

    1. Create an Azure Databricks workspace and cluster: You can create an Azure Databricks workspace and cluster using the Azure portal or Azure CLI.
    2. Upload the .SAV file to ADLS: You can upload the .SAV file to Azure Data Lake Storage (ADLS) using the Azure portal or Azure Storage Explorer.
    3. Create a Databricks notebook: In the Databricks workspace, create a new notebook and write the code to read the .SAV file from ADLS and convert it to Parquet format using Apache Spark. You can use the spark.read.format("sav") method to read the .SAV file and the df.write.format("parquet") method to write it to Parquet format.
    4. Schedule the notebook to run: You can schedule the notebook to run at a specific time or on a recurring basis using the Databricks Jobs feature. This will automate the conversion process and ensure that it runs at the desired frequency.
    5. Load the Parquet file into SQL table: Once the Parquet file is generated, you can use Azure Data Factory to load it into a SQL table.

    Additionally, Databricks integrates seamlessly with other Azure services, making it a great fit for building data pipelines. Finally, you can avoid time-out or size limitations like those in Azure Functions by using Databricks.

    I hope this helps in addressing the above.

    Please let us know you have any further quires. We will be glad to assist you closely.

    Please do consider to “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.