Udostępnij za pośrednictwem


Rerunning many slices and activities in Azure Data Factory

Today someone asked me how to run all the data slices in their data factory on-demand in an ad-hoc fashion, to say run the whole pipeline again from scratch.

For example, if you have a one-time copy data factory that is used to load a data warehouse or a development environment in its entirety, you might want to run it only on-demand, and not have it run on a schedule.

A. Click on each Dataset and click Run on the data slice blade

Then can click on one data set at a time and run it or rerun it. That is still a piecemeal and a bit tedious if you have many datasets and slice times.

1. Select the data set in question. 2. Find the time slice in question. 3. Click the Run button to rerun it.

B. Multiselect from the ADF Monitor and Manage web app.

You can filter the list and rerun the slices from the Monitoring web app. Multiselect (hold shift key and arrow down through the list) will work there in the Activity Windows grid, then you can run many with a single click.

If you have 100 tables being copied for example, you might have a bunch of different data sets and nested slice times as shown.

However, if you have a one-time copy that you did from the Copy Data tool in the browser, you won't see any data sets most likely, so this option is universal for all copy activities I worry.

 

C. Script it with PowerShell

Another way to rerun many slices is to use a PowerShell script to loop over all datasets in a given data factory or pipeline and reset the status to run them again.

Below is a code sample to do that. This script will rerun all slices on all datasets. If you only want to run some of them (not all slices), you can change the * filter in the where like query.

1. Launch Windows PowerShell ISE from the Windows start menu.

Open the code, or paste it in the script window at the top.
You can press the play button to run the whole script, or highlight lines piecemeal and run selection as needed.

Example of how to highlight the code and run one line at a time:

2. First time Installations

The first time you want to use Azure PowerShell on a given computer, you could either Download the whole Azure PowerShell SDK  or launch powershell as administrator and run PowerShell commands to install the needed modules:

Install-Module AzureRM

Install-Module AzureRM.DataFactories

 

3. Set your active subscription once per session.

The first time you need to get connected to Azure and authenticate, and pick which subscription to work on.
If you use the same PowerShell window on the second time, you can skip these lines, to save time and save typing your password again. Use # pound sign as a comment to comment out the code as needed.

# To login to Azure Resource Manager
Login-AzureRmAccount

# To view all subscriptions for your account
Get-AzureRmSubscription

# To select a default subscription for your current session
Get-AzureRmSubscription -SubscriptionName "your sub" | Select-AzureRmSubscription

 

4. Example PowerShell code to run.

 

There are #Comments inline to help you understand the intent.

This sample finds your data factory by name and resource group, then lists all slices, and loops to set the slice status to waiting. That will reset them so that the slices run again.

In case you want to filter just to Output slices, or a certain naming pattern, you can edit the like * clause.

You should adjust the two time stamps highlighted to a reasonable range so that it finds only recent slices, else it may be a bit slow if you have days or years of slices in the history to repeat.

-StartDateTime
2016-01-01T00:00:00.0000000 -EndDateTime 2017-01-01T00:00:00.0000000

 

#Put your Data Factory name and Resource group here
$RG = "YourResourceGroup"
$DFname = "YourFactory"

#Once per session - comment out to save time
# Login-AzureRmAccount

# Once, select a default subscription for your current session
# Get-AzureRmSubscription -SubscriptionName "your sub" | Select-AzureRmSubscription

# One time per computer - comment out to save time
# Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory

# Debug option: List all DF in a resource group for debugging if needed
# Get-AzureRmDataFactory -ResourceGroupName $RG

# Find the Data Factory
$df= Get-AzureRmDataFactory -ResourceGroupName $RG -Name $DFname
if ($df -eq $null) { Write-Host "Data Factory " $DFname " cannot be found. Check spelling and resource group name. Error: " $_ -BackgroundColor:Red }

# List all DataSets in the data factory - Add the LIKE filter by name instead of * if needed such as *output*
$DataSets = Get-AzureRmDataFactoryDataset -DataFactory $df | Where {$_.DatasetName -like "*"} | Sort-Object DatasetName

# Loop over matching named DataSets
$i = 1
ForEach($DS in $DataSets)
{

Write-Host $DS.DataFactoryName "--> " $DS.DatasetName -ForegroundColor:Yellow

# List slices

$Slices = Get-AzureRmDataFactorySlice -DataFactory $df -DatasetName $DS.DatasetName -StartDateTime 2016-01-01T00:00:00.0000000 -EndDateTime 2017-01-01T00:00:00.0000000

# Reset all slices to status Waiting for the given dataset, in case there are multiple

ForEach($S in $Slices)
{
$outcome=$false

Write-Host $i ":" $DS.DataFactoryName "--> " $DS.DatasetName "--> Slice Start:["$S.Start"] End:["$S.End"] State:"$S.State $S.SubState -ForegroundColor:Cyan

Try {

$outcome=Set-AzureRmDataFactorySliceStatus -DataFactory $df -DatasetName $DS.DatasetName -Status Waiting -UpdateType UpstreamInPipeline -StartDateTime $S.Start -EndDateTime $S.End

Write-Host " Slice status reset to Waiting so it will run again:" $outcome -ForegroundColor:Green

}

Catch
{

Write-Host " Slice status reset has failed. Error: " $_ -ForegroundColor:Red
}

$i++
}
}