Reduce Cluster startup time in ADF Dataflow

Patil Rajas 40 Reputation points
2025-02-23T16:50:04.0466667+00:00

I have create a ADF pipeline which merge data between source and target in parallel fashion for 18 tables. The table name, database name etc are dynamic and assigned at runtime.

The pipeline takes more than 7 min to load data and most time is taken by cluster startup time for each dataflow. how can we reduce it ?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,323 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Chandra Boorla 9,685 Reputation points Microsoft External Staff
    2025-02-24T03:31:41.4+00:00

    Hi @Patil Rajas

    Thank you for posting your query!

    Cluster startup time is a common challenge in ADF Data Flows since each Data Flow activity initializes a new Spark cluster, which can take several minutes. Given that you're processing 18 tables in parallel, this startup time adds up significantly.

    Here are some optimized strategies to improve performance and cost efficiency:

    Enable Time-to-Live (TTL) for Your Integration Runtime -

    • What it does - Keeps the Spark cluster alive between jobs, reducing cold start delays.
    • How to set it up - Go to Azure Integration Runtime (IR) settings in ADF. Set TTL to 10–15 minutes (adjust based on job frequency).

    Note - TTL helps only if Data Flows are run sequentially or in small batches. It won't reduce startup time for fully parallel executions, as each Data Flow still requires a separate cluster.

    Optimize Parallel Execution – Use Batching Instead of Full Parallelism

    • Issue - Running 18 Data Flows in parallel results in 18 separate cluster startups, increasing overall execution time.
    • Fix - Instead of full parallel execution, use a batched approach. Batch tables into groups (e.g., 3 batches of 6 tables). Run each batch sequentially, while processing tables within the batch in parallel. This reduces cluster provisioning overhead while still leveraging parallelism.

    Use a Dedicated Integration Runtime (IR) for Faster Startup

    • Why - The default "Auto-resolve IR" has inconsistent startup times due to shared infrastructure.
    • Solution - Create a Dedicated Azure IR in the same region as your storage & compute resources. Use a fixed cluster size (e.g., General Purpose, 16–32 vCores) for faster provisioning. Avoid IRs in over-utilized regions to prevent resource allocation delays.

    Merge Data Flows – Process Multiple Tables in One Data Flow

    • Instead of creating 18 separate Data Flows, process multiple tables within a single Data Flow using dynamic parameters.
    • How - Use @pipeline().parameters.TableName in the source query to dynamically load tables. Loop over table names using a ForEach activity, feeding them into a single Data Flow. Configure the sink dataset dynamically to store outputs per table. Impact, this eliminates unnecessary cluster startups, significantly reducing execution time.

    Cost vs. Speed Trade-Off – Recommended Approach

    • Set TTL to 10 minutes (adjust as needed to balance cost vs. speed).
    • Use batch processing (3-5 tables per Data Flow) instead of full parallelism.
    • Choose a Dedicated IR with fixed vCores for consistent performance.

    By implementing these optimizations, you can significantly reduce cluster startup time while maintaining cost efficiency.

    For additional information, please refer the following Microsoft documentation:

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.