Merge around 3000 parquet files to a single parquet file in ADF in gen2 storage

Shivshankar Kanawade 0 Reputation points Microsoft Employee
2025-02-28T00:30:05.58+00:00

Hi

I want to merge like 3000 parquet files to a single parquet file in ADF. The files are in gen2 storage account. I did try the copy activity and merge into single file but its very slow. How can I do this in a fast manner. Can we use ADF data flow if yes how can we do using that I am not able to see documentation around that. Any information/example will be great.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,293 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 28,386 Reputation points MVP
    2025-02-28T05:53:24.0166667+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    Merging files into single file works fine for smaller datasets but it is not optimized for large scale merges. Copy Activity effectively streams each file and then aggregates them, this can become very slow when you have thousands of files. To get more performance or parallelism, moving to a Data Flow or a Spark based engine like Synapse/Databricks which will best suit your requirement.

    If you are going with data flow, keep in mind the following:

    1. Data Flow compute size: Increase the Data Flow cluster size if you’re dealing with a large volume of data. A larger compute (General Purpose with 8 or more cores) can handle merges more efficiently.
    2. Partitioning: By default, Data Flow tries to optimize partitioning, but forcing a single partition at sink is necessary to get exactly one file. Just be aware that everything is funneled through a single partition at the last step. However, earlier steps (Source, transformations) will still run in parallel across multiple partitions, so you still get performance benefits.

    Note:

    • Mapping Data Flow is the more “ADF-native” approach. It leverages the Spark like engine under the hood, giving you better parallelization and more transformations in one place.
    • Setting the sink partition to “Single partition” is what ensures all data gets written into one file.
    • If even Data Flow is not performant enough, consider bigger compute or a direct Spark solution (Databricks (or) Synapse).

    Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.