Data Flows in a daisy chain - TTL

Question

Data Flows in a daisy chain - TTL

Alan Anscombe 51

Found a guy on internet with this issue. I have the exact same
**

** “For each pipeline that used data flows to perform data transformations, there’d be a ~6 minute cold-start time where ADF would be “acquiring compute” for an Apache Spark cluster. Azure states in their docs that you can overcome this cold start for down stream tasks by configuring a TTL on the integration runtime but this does not work. We found our pipeline would be cold starting all data flow activities down stream. Microsoft, you really need to fix this!**

” Where I work we have this exact same problem. We were forced to move our factory from Aus SE to Aus East as Dataflows are not yet supported in former (confirmed by Microsoft response), so I’m wondering if now, our mixed location subscription is causing trouble?

Accepted answer

0 additional answers

Your answer

Answer 1

Mark Kromer MSFT 1,146

Setting a TTL on your Azure IR, then executing your data flows in sequence will reduce your Azure Databricks cluster acquisition time down to 1-2 mins.

You can execute your Azure Databricks compute from ADF Data Flows in a region that is different from the home region of your factory by setting the region in the Azure IR configuration.

KranthiPakala-MSFT 46,612 Reputation points Microsoft Employee

2020-07-13T19:53:49.83+00:00

Hi @Alan Anscombe ,

Following up to see if the above information from Mark Kromer was helpful. If you need further assistance or encounter any issues, please do let us know.

----------

Thank you
Alan Anscombe 51 Reputation points

2020-07-14T03:14:45.077+00:00

I do see that TTL decreases execution time, but I was wondering why each Dataflow exhibited a 'cluster startup time' .I guess this documentation below about 'Spark context' explains it for me. Clean separation of logic is a must with any kind of sizable ETL project.

Execute DataFlow sequentially:
"If you execute your data flow activities in sequence in the pipeline and you have set a TTL on the Azure IR configuration, then ADF will reuse the compute resources (VMs) resulting in faster subsequent execution times. You will still receive a new Spark context for each execution.

Of these three options, this action likely takes the longest time to execute end-to-end. But it does provide a clean separation of logical operations in each data flow step"
KranthiPakala-MSFT 46,612 Reputation points Microsoft Employee

2020-07-16T21:44:21.817+00:00

Hi @AlanAnscombe-0773,

Thanks for your response and yes, you understanding is correct. Please feel free to let us know if you have any further query. If any of the post that you feel as helpful, please do consider to click on "Accept Answer" or"Upvote" on the post that helps you, as it can be beneficial to other community members.

Thank you

Share via

Data Flows in a daisy chain - TTL

0 additional answers

Your answer