다음을 통해 공유


SSIS: Using A Lookup Component

There are many suggestions for loading a data warehouse using SQL Server integration Services (SSIS). Once you get started, you find a pattern to repeat for facts and dimensions. This might be because the consistency Dimensional Modeling designs for fact and dimension tables. The SSIS component Lookup provides options to divert handling of non-conformed data or you can use your T-SQL skills up front where a diversion is not necessary.

The most common use of a Lookup component is to find the Dimension surrogate key for one or more rows in a fact table. The Customer Key from the Customer dimension will be used in the Sales Fact table in order to join to customer hierarchies and attributes like the following T-SQL:

SELECT c.CommuteDistance, Sum([OrderQuantity]) AS OrdQty
 
    , Sum([SalesAmount]) AS SalesAmt
 
  FROM [dbo].[FactInternetSales] fis
 
    INNER JOIN dbo.DimCustomer c ON c.CustomerKey = fis.CustomerKey
 
  GROUP BY c.CommuteDistance

CommuteDistance

OrdQty

SalesAmt

5-10 Miles

10615

4893148.0413

2-5 Miles

10084

4965514.4362

1-2 Miles

10170

4541608.1498

0-1 Miles

21307

11207592.2248

10+ Miles

8222

3750814.3686

Once you get the source data for the Sales fact in an SSIS package, the Lookup component can be used to get the surrogate from the customer dimension. You will see in Figure 1, there is an LU – Customer Key lookup in the flow of populating this fact table.

 

Figure 1: Loading the FactInternetSales from AdventureWorks

The options for Lookups vary based on some properties. In Figure 2, we can see some of these options.

Figure 2: General Lookup properties

Most lookups for dimensions can use the Full Cache mode because there is not a lot of dimension rows. But, if you come across large dimensions in a data warehouse, some time should be spent seeing if a Partial Cache or No Cache will help with loading data as well as a Cache connection manager. The Full Cache option will load all the data from the dimension table into memory for lookups before any flow starts. This is why it is good not to SELECT all the columns in the dimension table for a lookup.

Figure 3: Connection properties

Figure 3 shows the connection using a T-SQL statement to only get the CustomerKey and CustomerAlternateKey from the DimCustomer table. The CustomerAlternateKey is the business key that matches the customer to rows in the source data for FactInternetSales. If you use the option “Use a table or a view”, the generated query will select all columns from the source.

Let’s go back to the General properties to look at “Specify how to handle rows with no matching entries”. Figure 4 shows the different options available.

Figure 4: General properties

The “Fail component” option will stop processing of the import if no match is found in the lookup table. This is not a good option for loading data into a fact table. Most systems would want the import to continue for the customer surrogate keys that are found.

So, the two other options I use most are “Redirect rows to no match output” or “Ignore Failure”. The redirect rows… option can be used to stage the rows that have problems and get someone to look at them before the next import. The Ignore Failure option can be used to substitute a Null value in the CustomerKey column.

  1. Ignore Failure – Null will replace lookup values selected. Those rows with no match are streamed to the normal flow in the package with the Null value in match columns selected.
  2. Redirect rows to error output – red line output will show a failure but can pipe the data to any component. Those rows with no match are not streamed to the normal flow in the package.
  3. Fail component – the package stops with a failure if no match exists if all match there is no failure
  4. Redirect rows to no match output – output can be piped to another component and processing continues. Those rows with no match are not streamed to the normal flow in the package.

Now, this will not work if importing to a fact table with foreign keys and a Not Null property on the CustomerKey column, but what I do most often is change the Null value to -1 and have an entry in the DimCustomer table for Unknown possibilities like this example. 

CustomerKey

GeographyKey

CustomerAlternateKey

FirstName

LastName

MaritalStatus

Gender

-1

-1

-1

Unknown

Unknown

U

U

Figure 5: Derived Column component

Figure 5 shows what I can do to convert the Null value in the CustomerKey column to -1. This way with the -1 row in the customer dimension table, we will be able to insert the row into the fact table without having to process the rows outside of this package.

http://i1.wp.com/thomas-leblanc.com/wp-content/uploads/2016/07/clip_image013.jpg

Figure 6: Selected Lookup Columns and names

Figure 6 shows the Columns tab Properties. Here is where we match the Source fact column for the customer to the Business Key in the Customer Dimension to find the math. We also select the CustomerKey from the Available Lookup Columns in order to get the surrogate key for the fact table. The selected CustomerKey is where the Null value will be located if no match if found.

This process also indicates the another process needs to be implemented to determine why existing fact rows have business keys for the customer source table that are not imported into the Customer dimension. This one tip can go a long way and patterns developed to help with this situation. It also indicates the power and flexibility SSIS provides data warehouse architects in implementing an ETL solution for its business.