Load slowly changing dimensions
In most relational data warehouses, you need to handle updates to dimension data and support what are commonly referred to as slowly changing dimensions (SCDs).
Types of slowly changing dimension
There are multiple kinds of slowly changing dimension, of which three are commonly implemented:
Type 0
Type 0 dimension data can't be changed. Any attempted changes fail.
DateKey | DateAltKey | Day | Month | Year |
---|---|---|---|---|
20230101 | 01-01-2023 | Sunday | January | 2023 |
Type 1
In type 1 dimensions, the dimension record is updated in-place. Changes made to an existing dimension row apply to all previously loaded facts related to the dimension.
StoreKey | StoreAltKey | StoreName |
---|---|---|
123 | EH199J |
Type 2
In a type 2 dimension, a change to a dimension results in a new dimension row. Existing rows for previous versions of the dimension are retained for historical fact analysis and the new row is applied to future fact table entries.
CustomerKey | CustomerAltKey | Name | Address | City | DateFrom | DateTo | IsCurrent |
---|---|---|---|---|---|---|---|
1211 | jo@contoso.com | Jo Smith | 999 Main St | Seattle | 20190101 | 20230105 | False |
2996 | jo@contoso.com | Jo Smith | 1234 9th Ave | Boston | 20230106 | True |
Note
Type 2 dimensions often include columns to track the effective time periods for each version of an entity, and/or a flag to indicate which row represents the current version of the entity. If you’re using an incrementing surrogate key and you only need to track the most recently added version of an entity, then you may not need these columns; but before making that decision, consider how you’ll look up the appropriate version of an entity when a new fact is entered based on the time at which the event the fact relates to occurred.
Combining INSERT and UPDATE statements
Logic to implement Type 1 and Type 2 updates can be complex, and there are various techniques you can use. For example, you could use a combination of UPDATE
and INSERT
statements.
-- New Customers
INSERT INTO dbo.DimCustomer
SELECT stg.*
FROM dbo.StageCustomers AS stg
WHERE NOT EXISTS
(SELECT * FROM dbo.DimCustomer AS dim
WHERE dim.CustomerAltKey = stg.CustNo)
-- Type 1 updates (name)
UPDATE dbo.DimCustomer
SET CustomerName = stg.CustomerName
FROM dbo.StageCustomers AS stg
WHERE dbo.DimCustomer.CustomerAltKey = stg.CustomerNo;
-- Type 2 updates (StreetAddress)
INSERT INTO dbo.DimCustomer
SELECT stg.*
FROM dbo.StageCustomers AS stg
JOIN dbo.DimCustomer AS dim
ON stg.CustNo = dim.CustomerAltKey
AND stg.StreetAddress <> dim.StreetAddress;
In the previous example, it's assumed that an incrementing surrogate key based on an IDENTITY
column identifies each row, and that the highest value surrogate key for a given alternate key indicates the most recent or "current" instance of the dimension entity associated with that alternate key. In practice, many data warehouse designers include a Boolean column to indicate the current active instance of a changing dimension or use DateTime fields to indicate the active time periods for each version of the dimension instance. With these approaches, the logic for a type 2 change must include an INSERT
of the new dimension row and an UPDATE
to mark the current row as inactive.
Using a MERGE statement
As an alternative to using multiple INSERT
and UPDATE
statements, you can use a single MERGE
statement to perform an "upsert" operation to insert new records and update existing ones.
MERGE dbo.DimProduct AS tgt
USING (SELECT * FROM dbo.StageProducts) AS src
ON src.ProductID = tgt.ProductBusinessKey
WHEN MATCHED THEN
-- Type 1 updates
UPDATE SET
tgt.ProductName = src.ProductName,
tgt.ProductCategory = src.ProductCategory,
tgt.Color = src.Color,
tgt.Size = src.Size,
tgt.ListPrice = src.ListPrice,
tgt.Discontinued = src.Discontinued
WHEN NOT MATCHED THEN
-- New products
INSERT VALUES
(src.ProductID,
src.ProductName,
src.ProductCategory,
src.Color,
src.Size,
src.ListPrice,
src.Discontinued);
Note
For more information about the MERGE statement, see the MERGE documentation for Azure Synapse Analytics.