Sdílet prostřednictvím


Incremental data load from Azure Table Storage to Azure SQL using Azure Data Factory

 

Azure Data Factory is a cloud based data integration service. The service not only helps to move data between cloud services but also helps to move data from/to on-premises.  For example, moving data from Azure blob storage to Azure SQL etc. You can find supported data stores here.

Many business scenario starts with an initial data load, then continual incremental data loads, either hourly/daily/monthly etc. The focus of this post is to explain ADF creation and how incremental data could be loaded.

Prerequisite

  • Azure Subscription
  • Azure Table Storage which hosts the source data
  • Azure SQL DB which hold table to store data
  • PowerShell/.Net SDK if you want to create ADF solution from said platform. In this blog we’ll be using Azure portal without this requirement.

 

Setting up the initial scenario

Everyday new data lands in an Azure Table, and needs to be updated on Azure SQL Database on some periodic basis, daily/hourly/monthly depending on the schedule desired. We don’t want to upload the entire table daily rather just want to add new data at destination. In this example I will focus on Employee data defined in next section.

Azure Table storage

Here is the sample data that will be our Copy Activity source. In this example we will be using Azure Storage Explorer.

image

Here is the schema defined for records.

image

PartitionKey, RowKey and Timestamp are automatically included for every entity. PartitionKey and RowKey value will be user/developer responsibility to fill in whereas Timestamp value manage by server which can’t be changed/modified. So if schema doesn’t have any property to define when record added then Timestamp property can be used. For more information about Table service model please click here.

As shown in above screenshot, Azure storage explorer 6.0.0 doesn’t show Timestamp property.

Azure SQL Server

Below is the table definition defined in Azure SQL Database that will be our Copy Activity Sink (destination).

image

Azure Data Factory Solution

In this section, we will be creating an Azure Data Factory.

Creating ADF Service

image

  • Enter Azure Data Factory Name
  • Select Subscription
  • Select/Create new Resource group name

image

  • Select Pin to Dashboard
  • Click Create
  • Once ADF created, there will be an tile added at home screen.

image

  • Click Author and deploy

image

 

Creating Data Sources

Azure Storage

    • Click New data store

             image

    • Select Azure storage
    • Provide storage account name and account key

             image

    • Click Deploy

Azure SQL

    • Click New data store
    • Click Azure SQL

             image

    • Provide Server name
    • Provide Initial Catalog (database name)
    • Provide user id and password

            image

    • Click Deploy

Creating Dataset

Azure Table

    • Click New dataset

             image

    • Select Azure table
    • Refer article Datasets to understand properties like Structure, Published, type,  typeProperties etc.

            image

    • Click Deploy

Azure SQL

    • Click New dataset

             image

    • Select Azure SQL

             image

    • Click Deploy

Creating Pipeline

  • Click New pipeline

image

  • Click Add activity and select Copy activity

image

  • Add the code for data transformation. It will be look like below once done.

image

  • Click Deploy. The diagram will look like below.

image

Once this pipeline is deployed, it will start processing the data based on the start and end time. In this blog we define in scheduler frequency is Day and interval is 1. That means it will run daily once. We also define start and end time for this pipeline and as per the definition it will run only for one day. For more information on Pipeline execution please refer Scheduling and Execution with Data Factory.

Go back to the Data Factory which we created in the web portal. Click Pipelines and select the pipeline which we created just now.

 

image

On summary page click Datasets, Consumed. Notice the Pipeline Activities details shows it executed once.

image

Go back and click Datasets, Produced. Notice the Slice is in the “Ready” state, meaning data has been transferred. Sometime it will show Pending Execution means it’s waiting to execute. Is progress means it is still running. You may notice Error state in case there is any error while designing/execution of pipeline.

image

Checking Execution

You can use SQL Server Management studio to confirm if the data arrived or not. Do a select query on destination table to confirm.

image

Incremental load

Now that we have all the data loaded in destination, the next step is that we want to move incremental data on a daily basis rather than deleting and inserting the whole set of data. In this example, we will be adding some additional records with different dates and will insert data for specific date. Below is sample data added in Azure Table storage.

image

As an example, for the next date, we’ll move data only for 9-January-2016. To do this first change we will do in pipeline is to fetch specific data from source. To do we’ll add the line below in JSON property azureTableSourceQuery for the data factory pipeline.

"azureTableSourceQuery": "$$Text.Format('RecordAddedDate eq datetime\\'{0:yyyy-MM-ddTHH:mm:ss}\\'', SliceStart)"

The above query will filter record based on the slice Start Time on RecordAddedDate property. $$ is used to invoke data factory macro functions. Since RecordAddedDate is a date-time property we need to add datetime prefix to cast it. In this example we will filter records where RecordAddedDate equals to SliceStart date. We can also define SliceStart, SliceEnd, WindowStart or WindowEnd as parameter.

*Note:- if RecordAddedDate property is not define in table entity, we can use Timestamp property. Please refer Azure Table Storage section at the beginning of this blog.

Add changes and deploy

image

Once the JSON is deployed, and the Produced dataset state is in Ready status, you can query the SQL Server table to see the second copy activity’s output. Notice it has only transferred data for 09-Jan-2016.

image

Generally, in such defined scheduler, ADF pipeline executes at 12:00:00 AM daily. In case if required to run pipeline in different time instead of default (12:00:00 AM) let say 6:00:00 AM then add "offset": "06:00:00" property.

Thanks my colleague Jason for reviewing this post.

Hope it’s helpful.

Comments

  • Anonymous
    March 16, 2017
    Is it necessary that we need to have a datetime column in our data for achieving incremental delta load. If it is not so, how can we do that?