Remove duplicates in each table for data unification

Article
01/08/2025

Deduplication finds and removes duplicate records for a customer from a source table so that each customer is represented by a single row in each table. Each table is deduplicated separately using rules to identify the records for a given customer.

Each deduplication rule is run against every row. If your first rule matches rows 1 and 2, and rule 2 matches rows 2 and 3, then rows 1, 2, and 3 are matched. When matched rows are found, a winner row is selected to represent that customer based on the Merge preferences (Most filled, Most recent, or Least recent). Use the Advanced option to create a winner row by selecting fields from the various matched rows, such as the most recent email, but the most filled address.

Customer Insights - Data automatically performs the following actions:

Deduplicate records with the same primary key value, selecting the first row in the data set as the winner.
Deduplicate records using the Matching Rules defined for the table when matching rows between tables.

Define deduplication rules

A good rule identifies a unique customer. Consider your data. It might be enough to identify customers based on a field such as email. However, if you want to differentiate customers that share an email, you might choose to have a rule with two conditions, matching on Email + FirstName. For more information, see Deduplication best practices.

On the Deduplication rules page, select a table and select Add rule to define the deduplication rules.

Tip

If you enriched tables on the data source level to help improve your unification results, select Use enriched tables at the top of the page. For more information, see Enrichment for data sources.

In the Add rule pane, enter the following information:

Select field: Choose from the list of available fields from the table that you want to check for duplicates. Choose fields that are likely unique for every single customer. For example, an email address, or the combination of name, city, and phone number.

Normalize: Select normalization options for the column. Normalization only impacts the matching step, and doesn't change the data.

Normalization	Examples
Numerals	Converts many Unicode symbols that represent numbers to simple numbers. Examples: ❽ and Ⅷ are both normalized to the number 8. Note: The symbols must be encoded in Unicode Point Format.
Symbols	Removes symbols and special characters. Examples: !?"#$%&'( )+,.-/:;<=>@^~{}`[ ]
Text to lower case	Converts upper case characters to lower case. Example: "THIS Is aN EXamplE" is converted to "this is an example"
Type – Phone	Converts phones in various formats to digits, and accounts for variations in how country codes and extensions are presented. Symbols and whitespace are ignored. Leading '0' digits in country codes are ignored, matching +1 and +01. Extensions signified by a lettered prefix are ignored (X 123). The normalized country code is significant, so a phone with a country code won’t match a phone without a country code. Example: +01 425.555.1212 matches 1 (425) 555-1212 +01 425.555.1212 won't match (425) 555-1212
Type - Name	Converts over 500 common name variations and titles. Examples: "debby" -> "deborah" "prof" and "professor" -> "Prof."
Type - Address	Converts common parts of addresses Examples: "street" -> "st" and "northwest" -> "nw"
Type - Organization	Removes around 50 company name "noise words" such as "co," "corp," "corporation," and "ltd."
Unicode to ASCII	Converts Unicode characters to their ASCII letter equivalent Example: The characters 'à,' 'á,' 'â,' 'À,' 'Á,' 'Â,' 'Ã,' 'Ä,' 'Ⓐ,' and 'Ａ' are all converted to 'a.'
Whitespace	Removes all white space
Alias mapping	Allows you to upload a custom list of string pairs that can then be used to indicate strings that should always be considered an exact match. Use alias mapping when you have specific data examples you think should match, and aren't matched using one of the other normalization patterns. Example: Scott and Scooter, or MSFT and Microsoft.
Custom bypass	Allows you to upload a custom list of strings that can then be used to indicate strings that should never be matched. Custom bypass is useful when you have data with common values that should be ignored, such as a dummy phone number or a dummy email. Example: Never match the phone 555-1212, or test@contoso.com

Precision: Set the level of precision. Precision is used for exact match and fuzzy matching, and determines how close two strings need to be in order to be considered a match.
- Basic: Choose from Low (30%), Medium (60%), High (80%), and Exact (100%). Select Exact to only match records that match 100 percent.
- Custom: Set a percentage that records need to match. The system only matches records passing this threshold.
Name: Name for the rule.

Optionally, select Add > Add condition to add more conditions to the rule. Conditions are connected with a logical AND operator and thus only executed if all conditions are met.
Optionally, Add > Add exception to add exceptions to the rule. Exceptions are used to address rare cases of false positives and false negatives.
Select Done to create the rule.
Optionally, add more rules.

Select merge preferences

When rules are run and duplicate records are identified for a customer, a "winner row" is selected based on the merge policy. The winner row represents the customer in the next unification step that matches records between tables. Data in the non-winner (“alternate") rows is used in the Matching rules unification step to match records from other tables to the winner row. This approach improves matching results by allowing information like previous phone numbers to help identify matching records. The winner row can be configured to be the most filled, most recent, or least recent of the duplicate records found.

Select a table and then Edit merge preferences. The Merge preferences pane appears.
Choose one of three options to determine which record to keep if a duplicate is found:
- Most filled: Identifies the record with most populated columns as the winner record. It's the default merge option.
- Most recent: Identifies the winner record based on the most recency. Requires a date or a numeric field to define the recency.
- Least recent: Identifies the winner record based on the least recency. Requires a date or a numeric field to define the recency.
If there's a tie, the winner record is the one with the MAX(PK) or the larger primary key value.
Optionally, to define merge preferences on individual columns of a table, select Advanced at the bottom of the pane. For example, you can choose to keep the most recent email AND the most complete address from different records. Expand the table to see all its columns and define which option to use for individual columns. If you choose a recency-based option, you also need to specify a date/time field that defines the recency.
Select Done to apply your merge preferences.

After defining the deduplication rules and merge preferences, select Next.

Next step for a single table: View unified data

Next step for multiple tables: Define matching rules

Share via

Remove duplicates in each table for data unification

Define deduplication rules

Select merge preferences

Additional resources