Tracking changes to tables in your data warehouse using snapshot-based versioning
Today I will explain a way to track historic changes in a data warehouse and share a little program I wrote that will help you do this in minutes. It’s based on a method I’ve use many times and which is generic enough to cover many different scenarios. I’ll use an example to clarify how this works. In this example, I have a data warehouse database called MyDW, and this database contains a table called ImportTable which I’ll assume to be a table which is imported from another database by the data warehouse’s ETL process. For the purpose of this example I’ve kept thus table very simple:
CREATE TABLE [dbo].[ImportTable](
[ID] [int] NULL,
[Name] [nvarchar](100) NULL,
[Country] [nvarchar](100) NULL
) ON [PRIMARY]
It’s important to understand that the ETL process periodically refreshes this table by fetching fresh data from an external database. Records might be added, removed or changed. The ETL process might simply update it by truncating the existing table and reloading it entirely. The ID field is the primary key which is imported from the external database (for this example we don’t add our own ID in the data warehouse).
I called this method of tracking history “snapshot-based versioning”. It does not rely on database triggers or on CDC. While those features might be useful for other scenarios, I’ve found them too limiting for data warehouse scenarios. For example, these features don’t work well if your ETL process refreshes the data by truncating the existing table and then reloading it.
Snapshot-based versioning instead relies on TSQL’s MERGE statement. The MERGE statement is supported in SQL Server 2008 or higher. A limitation is that you cannot track changes to fields of type image, ntext and text. This is usually not a blocker because in real-life scenarios most tables have a “modified” field that is updated whenever a change to the record is changed.
The idea is to have a second table called H_ImportTable which has the same fields as the original ImportTable. This second table, which I’ll refer to as the “history table”, has a versioned copy of the data in H_ImportTable. In other words, it works like a log which appends a row for any change to ImportTable. Obviously, if the table in the external database has changed multiple times between two ETL runs, this will appear only as one change in H_ImportTable, hence the name “snapshot-based versioning”.
You can create H_ImportTable as follows:
CREATE TABLE [dbo].[H_ImportTable] (
[_ACTION] [char](1) NOT NULL,
[_KEY] [int] IDENTITY(1,1) NOT NULL,
[_EFF_FROM] [datetime] NOT NULL,
[_EFF_TO] [datetime] NOT NULL,
[ID] [int] NULL,
[Name] [nvarchar](100) NULL,
[Country] [nvarchar](100) NULL
)
As you see there are four additional fields in this table:
- _ACTION indicates whether a row has been added (‘A’), changed (‘C’), deleted (‘D’) or undeleted (‘U’).
- _KEY is a unique row identifier. This field is only useful in specific scenarios
- _EFF_FROM indicates the date/time from which this row is effective
- _EFF_TO indicates the date/time until which this row is active. In case the row is effective now, this field is set to 9999-12-31.
After refreshing ImportTable, the ETL process should update H_ImportTable using the following SQL statement:
INSERT INTO [dbo].[H_ImportTable]
SELECT _ACTION, SYSDATETIME() AS _EFF_FROM, '9999-12-31' AS _EFF_TO,[ID],[Name],[Country]
FROM
(
MERGE [dbo].[H_ImportTable] AS T
USING [dbo].[ImportTable] AS S
ON (S.[ID]=T.[ID] AND T._EFF_TO='9999-12-31')
WHEN NOT MATCHED BY TARGET THEN
INSERT(_ACTION,_EFF_FROM,_EFF_TO,[ID],[Name],[Country])
VALUES('A',SYSDATETIME(),'9999-12-31',[S].[ID],[S].[Name],[S].[Country])
WHEN MATCHED AND T._EFF_TO='9999-12-31' AND (T._ACTION='D' OR ([S].[Name]<>[T].[Name] OR [S].[Country]<>[T].[Country])) THEN
UPDATE SET T._EFF_TO=SYSDATETIME()
WHEN NOT MATCHED BY SOURCE AND T._EFF_TO='9999-12-31' AND T._ACTION<>'D' THEN
UPDATE SET T._EFF_TO=SYSDATETIME()
OUTPUT $Action Action_Out
,CASE WHEN S.[ID] IS NULL THEN 'D' WHEN Inserted._ACTION='D' THEN 'U' ELSE 'C' END AS _ACTION
,ISNULL(S.[ID],Deleted.[ID]) AS ID,[S].[Name],[S].[Country]
) AS MERGE_OUT
WHERE MERGE_OUT.Action_Out = 'UPDATE'
The statement above does all of the heavy lifting to update the history table H_ImportTable. Because MERGE is just a single SQL statement, it is very efficient compared to using multiple SELECT, INSERT and UPDATE statements to accomplish the same. The statement makes the assumption that the ID field is unique in the source table and you’re advised to create a unique index on that field in ImportTable. Obviously, in the history table H_ImportTable this field will no longer be unique.
For performance you’ll want to create the following indexes on the history table H_ImprotTable: ID, _ACTION, _EFF_FROM and _EFF_TO.
You can adjust this SQL code to work with the tables in your data warehouse. To do so, adjust all the parts that are underlined. If you need to do this for many tables, making these changes manually is laborious and error-prone, This is why I wrote the program “Table Merge Scripts” which you can download here. This program generates the SQL code based on an existing table in the database. You can see the UI below:
First, edit the connection string by entering the correct server name and database name. Then click Connect.
Next select the table and the field that is the primary key.
The first button in the bottom half of the screen copies the SQL code to create the history table to the clipboard. The new table will have the same name as the selected source table, prefixed with “H_”. The code also creates indexes for _ACTION, _EFF_FROM, _EFF_TO, _KEY and the field which you designated as the primary key in the source table.
The next button copies the SQL code to update the history table to the clipboard. This consists of a MERGE command similar to the example shown above.
The final button copies the same SQL code as the second table, but wrapped in a stored procedure.
Good luck!
Comments
Anonymous
June 25, 2014
Great, this is something I've been looking for, in stead of CDC. I'm trying it in my environment.Anonymous
April 23, 2015
I am confused as to how you are identifying CDC for SOURCE rows to be upserted. If your source fact table has a billion rows, you need CDC to get the new / updated / deleted rows. I do not think you are advocating selecting all billion source fact rows at each etl run.Anonymous
April 23, 2015
@dave dumasThe technique described in this post assumes you have a way to load/update the entire source table into a import staging table in the DW. The simplest approach which I use most of the time is to reload the entire table with every ETL. This works well if the source table has millions of rows. If you have billions of rows, you'll need to find a way to load only the data that changed. If the source system has CDC enabled, you can leverage that.Nevertheless, once the ETL completed loading the import staging table, the method described in this post can theoretically be used to track historic changes to the data. It relies on the SQL MERGE statement. I have had no problems using it with tables with millions of rows, but have never tested it with a billion-row table. I suspect the time MERGE requires to complete is linear with the number of rows, and if so, that means it would probably not complete in time with very large data sets.BTW - the method described in this post also assumes there is a unique ID field. It needs that in order to match records and understand if they have changed. If the source data does not have a primary key, you can't use this method to track historic changes.Anonymous
September 28, 2015
Thanks for the post! Any chance you'd be willing to share your source code for your 'Table Merge Scripts' program?Anonymous
September 29, 2015
The comment has been removedAnonymous
September 30, 2015
Much appreciated!