Hi ,
Thanks for reaching out to Microsoft Q&A.
Merging files into single file works fine for smaller datasets but it is not optimized for large scale merges. Copy Activity effectively streams each file and then aggregates them, this can become very slow when you have thousands of files. To get more performance or parallelism, moving to a Data Flow or a Spark based engine like Synapse/Databricks which will best suit your requirement.
If you are going with data flow, keep in mind the following:
- Data Flow compute size: Increase the Data Flow cluster size if you’re dealing with a large volume of data. A larger compute (General Purpose with 8 or more cores) can handle merges more efficiently.
- Partitioning: By default, Data Flow tries to optimize partitioning, but forcing a single partition at sink is necessary to get exactly one file. Just be aware that everything is funneled through a single partition at the last step. However, earlier steps (Source, transformations) will still run in parallel across multiple partitions, so you still get performance benefits.
Note:
- Mapping Data Flow is the more “ADF-native” approach. It leverages the Spark like engine under the hood, giving you better parallelization and more transformations in one place.
- Setting the sink partition to “Single partition” is what ensures all data gets written into one file.
- If even Data Flow is not performant enough, consider bigger compute or a direct Spark solution (Databricks (or) Synapse).
Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.