Udostępnij za pośrednictwem


Slowly Changing Dimension using SSIS

Surrogate Keys

Also known as meaningless keys, substitute keys, non-natural keys, or artificial keys. A surrogate key is a unique value, usually an integer, assigned to each row in the dimension. This surrogate key becomes the primary key of the dimension table and is used to join the dimension to the associated foreign key field in the fact table.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.
Example:

On Jan 1 2010, Emp A belongs to Dept1, whatever sales made by this employee added to Dept1 but on June 1 2010 Emp A moved to Dept2. All his new sales contribution should be added to Dept2 from that day onwards and the old one should belong to the Dept2.

If let's say in this case we have used business key (Primary key as stated in RDBMS) within data warehouse everything would be allocated to Dept2 even what actually belongs to Dept1

If you use surrogate keys you could create on the 1st June a new record for the Employee 'A' in your Employee Dimension with a new surrogate key.
This way in your fact table you have your old data (before June) with the SID of the Employee 'E1' + 'Dept1' All new data (after June) would take the SID of the employee 'E1' + 'Dept2'
Key Points:

A surrogate key is a unique value, usually an integer, assigned to each row in the dimension. This surrogate key becomes the primary key of the dimension table and is used to join the dimension to the associated foreign key field in the fact table.

The ability to track changes in dimension attributes over time is reason enough to implement surrogate keys.

Handling Slowly Changing Dimension through SSIS:

SCD and its types:

Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes.

For example, you may have a Dimension in your database that tracks the sales records of your company's salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another or from one manager to another manager. How do you record such a change in your sales Dimension?

You could sum or average the sales by salesperson, but if you use that to compare the performance of salesmen, that might give misleading information. If the salesperson that was transferred used to work in a hot market where sales were easy, and now works in a market where sales are infrequent, her totals will look much stronger than the other salespeople in her new region, even if they are just as good. Or you could create a second salesperson record and treat the transferred person as a new sales person, but that creates problems also.

Dealing with these issues involves SCD management methodologies

The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming you won't ever need to know how it used to be misspelled in the past)

Another example would be of a database table that keeps supplier information.

clip_image002[4]

Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:

clip_image003[4]

The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. You can't tell if your suppliers are tending to move to the Midwest, for example. But an advantage to this is that these are very easy to maintain. Type 2

The Type 2 method tracks historical data by creating multiple records in the dimensional tables with separate keys. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.

In the same example, if the supplier moves to Illinois, the table would look like this:

clip_image004[4]

Another popular method for tuple versioning is to add effective date columns.

clip_image005[4]

Null End_Date signifies current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index.

Transactions that reference this Surrogate Key (Supplier_Key) are then permanently bound to these time slices defined by each row in the slowly changing dimension table. If there are retrospective changes made to the contents of the dimension, or if a new set of attributes are added to the dimension (for example a Sales Rep column) which have different effective dates to those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCD are not a good choice if the dimensional model is subject to change.

The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 will add additional columns to the tables:

clip_image006[4]

Note that this record cannot track all historical changes, such as when a supplier moves twice. Although, one possible version of this type is to create the field Previous_Supplier_State instead of Original_Supplier_State which will then track all historical changes.

Note: Type 3, keeps separate columns for both the old and new attribute values—sometimes called “alternate realities.” In our experience, Type 3 is less common because it involves changing the physical tables and is not very scalable.

SSIS has built in support for SCD (Sample Database is Attached)

Easy to get up and running

This is how Slowly Changing Dimension Wizard - Data Flow Looks like

clip_image008[4]

Demo of Using SCD through SSIS:

Source Table:

select * from scd.dbo.SourceEmployees

clip_image010[4]

Destination Table:

clip_image012[4]

Execute Package for the first time will give this, which is nothing but a copy of Source Table

clip_image013[4]

Now let us fire following statement.

update scd.dbo.SourceEmployees

set managerid=1002

where empname='G'

Execute Package and now you will find the difference, if you will notice here we get one more record for EmpBusinessID 1007 with EmpSurrogateID as 11 and EmpDate as Null and for old record EndDate inserted which signifies that this record is valid till 14th Sep 2010clip_image014[4]

Step of Implementing SCD from SSIS (Applies to SQL 2005 / 2008 and 2008 R2)

1) Create a New package.

2) Got to Data Tab and drop one OLE DB Source & Configure it for database ->Source Table

clip_image015[4]

clip_image016[4]

3) From Toolbox drag & drop in Data Flow, connect OLE DB Source created in step 2 to SCD

clip_image017[4]

4) Double Click on SCD will initiate SCD Wizard

5) Select a Dimension Table and Keys, here you need to specify the destination table, automatically mapping will be shown with matching col of source with destination

clip_image018[4]

6) Specify EmpBusinessID as Business Key & Click on next tab, this will allow you to specify the Change Type.

There are three kinds of Attributes:

Fixed Attribute - For which value is not changing

Changing Attribute - For which value will be over-written (Type 1)

Historical Attribute- For which value is saved as a new record. Previous values are saved in the records marked as outdated (Type 2)

clip_image019[4]

7) You can check or uncheck as per your requirement, I am un-checking these two

clip_image020[4]

8) Specify how you want record changed value, using single column or start / end date

clip_image021[4]

9) While loading data in Fact tables we usually see a scenario where the fact data is available but there is no corresponding business key in the related dimension.

In this case we choose multiple options to resolve the issue.

Ignore that fact

Insert the associated business key in dimension table and return the newly generated surrogate key from dimension table. And now store the data in Fact table with the surrogate key.

The second approach relates to a term called “Inferred members”. All the other attributes of that dimension will also be updated in next run of dimension load (usually nightly load).

This is al-together a lengthy topic for discussion and outside the scope of this session, for further information refers below mentioned links:-

SSIS approach to handle Inferred Members

Inferred Members Implementation Best Practices

For this demo, I am not checking Enable Inferred member support

clip_image022[4]

Click Next and Finish.

SCD.zip

Comments

  • Anonymous
    February 13, 2011
    Thanks for the screen shots :).

  • Anonymous
    March 27, 2011
    thanks....:-)

  • Anonymous
    March 27, 2011
    thanks

  • Anonymous
    November 08, 2011
    Awesome..

  • Anonymous
    April 16, 2012
    Thanks a lot

  • Anonymous
    April 26, 2012
    Hi All, Is it while migrating an SSIS Package from SQL Server 2008 R2 to SQL Server 2012 SCD Configurations do not get carried forward and we have to reconfigure the settings in the upgraded server?? One of our Client is having this issue. Any insights of this is appreciated! Thanks in advance.

  • Anonymous
    April 29, 2012
    i havent tested this yet...if its not allowing you to carry forward settings then i think you need to reconfigure it on upgraded server. I will test this behavior and get back to you.

  • Anonymous
    May 08, 2012
    Hi Sneha, SSIS 2012 introduced a new design surface. When you upgrade your packages, the layout information is lost. Unfortunately, the SCD Wizard stores all of its settings in the layout… this means that after upgrading to 2012, you will need to re-enter the settings the first time you go through the wizard. --Debs

  • Anonymous
    October 03, 2012
    I have implemented as you have show in this article. What would be the next step. Lets say I go ahead and want to change MaritalStatus for employee 1001 and go and change SQL syntax in OLE DB Command 1 where we have  UPDATE [dbo].[DestinationEmployees] SET [MaratialStatus] = '' WHERE [EmpBusinessID] = '' AND [EndDate] IS NULL. I am new so asking these question. Infact I just got started after reading your article.

  • Anonymous
    April 15, 2013
    I have hire date,start date column in my table and i want my start date column values should be same as the hiredate column values .when i select system:@starttime as column value in historical attribute option in scd transforamtion it just take the System:start time rather than taking hiredate value.Please advise on this.

  • Anonymous
    May 13, 2013
    can u give the scd type 3 screen shots i didn't get newly updated value comes into new column in dimtable

  • Anonymous
    March 05, 2014
    It is excellent. easy to understood everyone

  • Anonymous
    March 15, 2014
    Screenshots given here are excellent

  • Anonymous
    May 14, 2015
    Its very Understandable. Good work!!!!

  • Anonymous
    May 29, 2015
    I follow that step by step, but the result are keep appends to the destination even the status was not changed. Which step do I missed?

  • Anonymous
    June 04, 2015
    hi Karan, May be I missed out something, but I am really not getting how to populate surrogate key column.

  • Anonymous
    July 16, 2015
    Its very clear....thanks for sharing...........