Data Quality Services: Cleansing Complex Data Using Composite Domains
This article provides information on how to cleanse complex source data using composite domains in DQS. How do you cleanse your data if you have multiple delimited fields in a single column or have data in delimited columns in a plain text file. For example, consider the following two2 sets of sample data that contain the list of US States and Capitals (purposely introduced errors in data for this tutorial):
The data is not correct in the above sample. For example Arizona, California, Indiana, and New Jersey are spelled incorrectly; Richmond and New York are not the capital cities for the Colorado and Washington states respectively.
Since the source data is in a plain text file, we will use the DQS Cleansing transformation in SSIS to cleanse this data.
Create a Knowledge Base Containing Composite Domains
We will leverage the default database, DQS Data, in this example to cleanse the sample data.
- Run Data Quality Client.
- In the Data Quality Client home screen, click New Knowledge Base.
- Type State&Capital as the name for the knowledge base, and use DQS Data as the base for the creation of new knowledge base.
- Select the Domain Management activity, and click Next.
- In the Domain Management screen, you will see all the domains in the left pane that are available in the DQS Data knowledge base. Click the create a composite domain icon (), specify StatesAndCapitals as the composite domain name, and add the US - State and US - Places domains. Click OK.
- Next, in the CD Properties tab of the newly created composite domain, select Comma as the delimiter.
- Next, in the CD Rules tab, create some cross-domain rules that corrects data in the US Places domain based on the values in the US- State domain. For example, to correct the value for the "Colorado:Richmond" record, create the following rule: IF US - State value is equal to Colorado THEN US-Places value is equal to Denver.
- Click Finish, and then click Publish to publish the knowledge base.
You now have a State&Capital knowledge base that contains the StatesAndCapitals composite domain containing the following two domains: US - State and US - Places.
Scenario 1: Cleansing Delimited Values in Single Column
In this case, we will cleanse the comma delimited values as mentioned earlier in the sameple data under Scenario 1. When you have a single column with delimited values, you must map the source column directly to the composite domain.
- Use SQL Server Data Tools (SSDT) to create an Integration Services project.
- Add a new data flow task to the package.
- Add a Flat File Source to the data flow to read data from the .txt file. Next, double-click Flat File Source to open the editor.
- On the Connection Manager page, specify a connection to the file by clicking New, and then browsing and selecting the .txt file.
- Click OK to close the Flat File Source editor.
- Add the DQS Cleansing transformation, and connect the output of Flat File Source to this. Next, double-click the DQS Cleansing transformation to open the editor.
- On the Connection Manager tab, connect to your Data Quality Server, and then select the knowledge base that you created earlier in DQS.
- On the Mapping tab, map the source column directly to the composite domain created earlier.
- Click OK to close the DQS Cleansing editor.
- OPTIONAL: You can choose to split the delimited values cleansed by DQS into two columns. To do so, you add a Derived Column transformation to the flow. In this case, we will split the cleansed values in the State&Capital_Output column into StatesCleansedOutput and CapitalsCleansedOutput columns.
Connect the output of the DQS Cleansing transformation to the Derived Column transformation.
Double-click the Derived Column transformation to open the editor.
Add the following values for the StatesCleansedOutput derived column:
- Derived Column Name: StatesCleansedOutput
- Derived Column: <add as new column>
- Expression: TOKEN([State&Capital_Output],",",1)
Add the following values for the CapitalsCleansedOutput derived column:
- Derived Column Name: CapitalsCleansedOutput
- Derived Column: <add as new column>
- Expression: SUBSTRING([State&Capital_Output],(FINDSTRING([State&Capital_Output],",",1) + 1),50)
- Next, add an OLE DB Destination component to send the output to a SQL database, In this case, we will output the values to a new table in the DQS_STAGING_DATA db. Your data flow will look like the following:
- Run the package. After the package has completed successfully, verify the cleansed output in the newly created table in the DQS_STAGING_DATA db:
Scenario 2: Cleansing Delimited Values in Delimited Columns
When you have delimited data columns to be cleansed, you must use the delimter to extract the individual columns into SSIS, and then map the extracted input columns with the individual columns within the composite domain.
- Use SQL Server Data Tools (SSDT) to create an Integration Services project.
- Add a new data flow task to the package.
- Add a Flat File Source to the data flow to read data from the .txt file. Next, double-click Flat File Source to open the editor.
- On the Connection Manager page, specify a connection to the file by clicking New, and then browsing and selecting the .txt file.
- On the Columns page, ensure that comma is selected in the Column delimiter field.
NOTE: In case your source data has any other delimiter, ensure that the appropriate delimiter is selected. - Click OK to close the Flat File Source editor.
- Add the DQS Cleansing transformation, and connect the output of Flat File Source to this. Next, double-click the DQS Cleansing transformation to open the editor.
- On the Connection Manager tab, connect to your Data Quality Server, and then select the knowledge base that you created earlier in DQS.
- On the Mapping tab, map the source columns to individual columns within the composite domain: US - State and US - Places.
- Click OK to close the DQS Cleansing editor.
- Add the OLE DB Destination component to send the output to a SQL database, In this case, we will output the values to a new table in the DQS_STAGING_DATA db. Your data flow will look like the following:
- Run the package. After the package has completed successfully, verify the cleansed output in the newly created table in the DQS_STAGING_DATA db:
See Also
- [[articles:Using DQS: How to find blank/empty values in your data source]]
- DQS Resources on TechNet Wiki
- Map Columns to Composite Domains