SSIS: Data maintenance with Microsoft Azure SQL Database
Introduction
This article will explain how to track changes to Azure data using SSIS. To set the scene, let us imagine a scenario. There are two database servers, one hosted in the Cloud and another an on-premises SQL Server. Both have one user table with the same attributes.
We will consider routines to keep both of these matching by updating both with the same data.
Justifying
One of the biggest challenges of the extract, transform, and load (ETL) process is determining which records need to be extracted from the source data and loaded into a destination table. For smaller dimensional tables which are not used to populate slowly changing dimension, we may choose to truncate the target table and refill it with all of the data from the source with every load. This won't work for fact tables, large dimensional tables, and slowly changing dimensional tables. Truncate is also going to be a problem for any table with a foreign key constraint on it.
Where truncate will not work, we need to determine what data has changed since the last extract and only grab those records.
There are several methods to mark which data has changed since the last extract. They include:
- Changed timestamp and type fields
- A flag which is reset when a record is changed and marked when the record is extracted.
- Trigger or stored procedure to replicate changes to change capture table.
- Trigger writing an entry to a change list table with record id, timestamp and change type.
- Log ship to another server
All of these approaches add overhead to the transactional system and can even require changes to the front-end systems that create the transaction.
One external way round this is to pass all changes through a separate business server which records changes separately. This may not be attractive if you don't have a spare sql, rabbitmq or appfabric server or two standing by.
With SQL SERVER 2008 - and later - databases, we have a feature to make this process much easier. That feature is Known as CHANGE DATA CAPTURE- CDC.
Versions
This article cover the following versions of SQL Server Business Intelligence and Enterprise/Developer Editions : SQL SERVER 2008, SQL SERVER 2008 R2, SQL SERVER 2012 and SQL SERVER 2014.
Management Portal Azure
How to: Create an Azure SQL Database
- Sign in to the Management Portal.
- Create a WebSite to use a free 20 MB SQL DATABASE.
https://ricardolacerda.files.wordpress.com/2015/03/createwebsite.png
https://ricardolacerda.files.wordpress.com/2015/03/createwebsite2.png
https://ricardolacerda.files.wordpress.com/2015/03/createwebsitefreesql.png
https://ricardolacerda.files.wordpress.com/2015/03/createdatabase.png
https://ricardolacerda.files.wordpress.com/2015/03/databasecreated.png
How to: Configure the firewall for the logical server
- In the Management Portal, clique SQL DATABASES, then click SERVERS.
https://ricardolacerda.files.wordpress.com/2015/03/click-servers.png
- From the list, click on the server you just created.
- Click CONFIGURE.
https://ricardolacerda.files.wordpress.com/2015/03/click-configure.png
- In the allowed IP address section click ADD TO THE ALLOWED IP ADDRESS.
- Note: Your router or proxy server is currently listening this IP address. The SQL AZURE DATABASE detects the IP address used by SSIS's current connection and creates a firewall rule to accept connection request from the device.
https://ricardolacerda.files.wordpress.com/2015/03/allowed-ip-addresses.png
- A default name for the rule is generated. You can change the name.
- Clique SAVE at the bottom of the page to complete the step.
Connection Information - SQL AZURE
Server: | tcp:dxyt4i9hi5.database.windows.net (tcp: + Server Name + .database.windows.net) |
DataBase: | SQLRicardo |
User ID: | ricardo@dxyt4i9hi5 (User + @ + Server Name); |
Password: | {your_password}; |
https://ricardolacerda.files.wordpress.com/2015/03/ssms.png
**
Script executed to create the USER TABLE- SQL AZURE.
**
CREATE TABLE [dbo].[User](
[sAMAccountName] [varchar](20) NOT NULL,
[displayName] [varchar](50) NULL,
[Age] [int] NULL,
[Address] [varchar](100) NULL,
[TelephoneNumber] [varchar](20) NULL,
CONSTRAINT [PK_User] PRIMARY KEY CLUSTERED
(
[sAMAccountName] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF)
)
https://ricardolacerda.files.wordpress.com/2015/03/ssms_user.png
Script executed to create the USER_LOCAL TABLE- On-premises SQL Server database- WSS_CUSTOMS.
CREATE TABLE [dbo].[USER_LOCAL](
[sAMAccountName] [varchar](20) NOT NULL,
[displayName] [varchar](50) NULL,
[Age] [int] NULL,
[Address] [varchar](100) NULL,
[TelephoneNumber] [varchar](20) NULL,
CONSTRAINT [PK_USER_LOCAL] PRIMARY KEY CLUSTERED
(
[sAMAccountName] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Architecture
Create a routine to keep the data updated in both tables basing on the table from On-premises SQL Server Database - WSS_CUSTOMS .
https://ricardolacerda.files.wordpress.com/2015/03/architecture2.png
Goals
As picture below we have three procedures to do:
https://ricardolacerda.files.wordpress.com/2015/04/tablesaction2.png
Requirements
Enabling Change Data Capture
The ADO.NET connection manager utilized by the CDC Source must point to a database that has change data capture enabled.
Enabling change data capture is two-step process. This feature must first be enabled for the database as a whole. Once this is done, change data capture must also be enabled for each table individually, in our situation will be executed on User_Local table.
Important: Change data capture is only available in the Enterprise, Developer, and Enterprise Evaluation editions.
**The following system stored procedure will enable change data capture for the database:
**Database Name = "WSS_CUSTOMS"
Sysadmin privileges are required to initiate the request action below.
USE WSS_Customs
EXEC sys.sp_cdc_enable_db
**The following system stored procedure will enable change data capture for the USER_LOCAL table: **
Important: SQLServerAgent must be started to enable change data capture. (Error message: SQLServerAgent is not currently running so it cannot be notified of this action)
execute sys.sp_cdc_enable_table
@source_schema='dbo',
@source_name ='USER_LOCAL',
@role_name='cdc_Admin',
@supports_net_changes=1
https://ricardolacerda.files.wordpress.com/2015/03/systemtable.png
The table is named for the schema and table name of the source table as follows:
CDC.DBO.USER_LOCAL_CT is found in the System Tables folder in SQL Server Management Studio.
The CDC.DBO.USER_LOCAL_CT has one column that mirrors each column in the source table where changed data is being captured. These columns have the same name and the same data type as the corresponding column in the source table.
In addition to the columns that mirror the CDC.DBO.USER_LOCAL_CT table**,** it contains several columns of metadata, as follows:
- __$start_lsn : The LOG SEQUENCE NUMBER (LSN) assigned to this change. All changes committed within the same transaction will have the same LSN. The LSN shows the order in which transactions occurred.
- __$end_lsn: This column may be utilized in future versions of SQL SERVER.
- __$seqval: A sequence number used to determine the order of changes that are within the same transaction and thus have the same LSN.
- __$operation: The operation contains a code identifying whether the row has been inserted, updated, or deleted as follows: 1- delete | 2- insert | 3-update (values prior to the update) | 4-update (values after the update)
- __$update_mask: A bit map indicating which columns where affected by an update.
In addition to creating a SQL Agent Job to capture change data from the SQL transaction log, the sys.sp_cdc_enable_db stored procedure creates a SQL Agent Job to remove old change data from the change tables. This prevent the change tables from growing indefinitely. It also means that there is a limited time period during which change data is available. This time period is known as the validity interval. By default, the validity interval is three days.
Integration Services Projects - SSIS
https://ricardolacerda.files.wordpress.com/2015/03/integrationservices.png
Creating New Connection Manager – SQL AZURE
https://ricardolacerda.files.wordpress.com/2015/03/connectionmanager.png
https://ricardolacerda.files.wordpress.com/2015/03/connectionmanagersuccess.png
Creating New Connection Manager – ADO.NET - CDC - USER_LOCAL
https://ricardolacerda.files.wordpress.com/2015/04/ado_connection.png
https://ricardolacerda.files.wordpress.com/2015/04/screenshotado.png
The integration Services packages for performing the initial load consists of three control flow items. An initial CDC Control Task marks the starting Load Sequence Number (LSN). Next, a control flow copies all of the records from the source table to the destination table. Finally, a second CDC Control Task marks the ending LSN. The CDC initial load package is shown in sequence below:
https://ricardolacerda.files.wordpress.com/2015/04/cdc11.png
https://ricardolacerda.files.wordpress.com/2015/04/cdc6.png
https://ricardolacerda.files.wordpress.com/2015/04/cdc5.png
https://ricardolacerda.files.wordpress.com/2015/04/cdc10.png
https://ricardolacerda.files.wordpress.com/2015/04/cdc9.png
https://ricardolacerda.files.wordpress.com/2015/04/cdc12.png
https://ricardolacerda.files.wordpress.com/2015/04/cd13.png
See Also