Remove duplicates in each table for data unification

The Deduplication rules step of unification finds and removes duplicate records for a customer from a source table so that each customer is represented by a single row in each table. Each table is deduplicated separately using rules to identify the records for a given customer.

Rules are processed in order. After all rules have been run on all the records in a table, match groups that share a common row are combined into a single match group.

Define deduplication rules

A good rule identifies a unique customer. Consider your data. It might be enough to identify customers based on a field such as email. However, if you want to differentiate customers that share an email, you might choose to have a rule with two conditions, matching on Email + FirstName. For more information, see Deduplication best practices.

  1. On the Deduplication rules page, select a table and select Add rule to define the deduplication rules.

    Tip

    If you enriched tables on the data source level to help improve your unification results, select Use enriched tables at the top of the page. For more information, see Enrichment for data sources.

    Screenshot of Deduplication rules page with table highlighted and Add rule displayed

  2. In the Add rule pane, enter the following information:

    • Select field: Choose from the list of available fields from the table that you want to check for duplicates. Choose fields that are likely unique for every single customer. For example, an email address, or the combination of name, city, and phone number.

      • Normalize: Select normalization options for the column. Normalization only impacts the matching step, and doesn't change the data.

        Normalization Examples
        Numerals Converts many Unicode symbols that represent numbers to simple numbers.
        Examples: ❽ and Ⅷ are both normalized to the number 8.
        Note: The symbols must be encoded in Unicode Point Format.
        Symbols Removes symbols and special characters.
        Examples: !?"#$%&'( )+,.-/:;<=>@^~{}`[ ]
        Text to lower case Converts upper case characters to lower case. 
        Example: "THIS Is aN EXamplE" is converted to "this is an example"
        Type – Phone Converts phones in various formats to digits, and accounts for variations in how country codes and extensions are presented. Symbols and whitespace are ignored. Leading '0' digits in country codes are ignored, matching +1 and +01. Extensions signified by a lettered prefix are ignored (X 123). The normalized country code is significant, so a phone with a country code won’t match a phone without a country code.
        Example: +01 425.555.1212 matches 1 (425) 555-1212
        +01 425.555.1212 won't match (425) 555-1212
        Type - Name Converts over 500 common name variations and titles. 
        Examples: "debby" -> "deborah" "prof" and "professor" -> "Prof."
        Type - Address Converts common parts of addresses 
        Examples: "street" -> "st" and "northwest" -> "nw"
        Type - Organization Removes around 50 company name "noise words" such as "co," "corp," "corporation," and "ltd."
        Unicode to ASCII Converts Unicode characters to their ASCII letter equivalent 
        Example: The characters 'à,' 'á,' 'â,' 'À,' 'Á,' 'Â,' 'Ã,' 'Ä,' 'Ⓐ,' and 'A' are all converted to 'a.'
        Whitespace Removes all white space
        Alias mapping Allows you to upload a custom list of string pairs that can then be used to indicate strings that should always be considered an exact match. 
        Use alias mapping when you have specific data examples you think should match, and aren't matched using one of the other normalization patterns. 
        Example: Scott and Scooter, or MSFT and Microsoft.
        Custom bypass Allows you to upload a custom list of strings that can then be used to indicate strings that should never be matched.
        Custom bypass is useful when you have data with common values that should be ignored, such as a dummy phone number or a dummy email. 
        Example: Never match the phone 555-1212, or test@contoso.com
    • Precision: Set the level of precision. Precision is used for exact match and fuzzy matching, and determines how close two strings need to be in order to be considered a match.

      • Basic: Choose from Low (30%), Medium (60%), High (80%), and Exact (100%). Select Exact to only match records that match 100 percent.
      • Custom: Set a percentage that records need to match. The system only matches records passing this threshold.
    • Name: Name for the rule.

      Screenshot of Add rule pane for removing duplicates.

  3. Optionally, select Add > Add condition to add more conditions to the rule. Conditions are connected with a logical AND operator and thus only executed if all conditions are met.

  4. Optionally, Add > Add exception to add exceptions to the rule. Exceptions are used to address rare cases of false positives and false negatives.

  5. Select Done to create the rule.

  6. Optionally, add more rules.

Select merge preferences

When rules are run and duplicate records are identified for a customer, a "winner row" is selected based on the merge policy. The winner row represents the customer in the next unification step that matches records between tables. Data in the non-winner (“alternate") rows is used in the Matching rules unification step to match records from other tables to the winner row. This approach improves matching results by allowing information like previous phone numbers to help identify matching records. The winner row can be configured to be the most filled, most recent, or least recent of the duplicate records found.

  1. Select a table and then Edit merge preferences. The Merge preferences pane appears.

  2. Choose one of three options to determine which record to keep if a duplicate is found:

    • Most filled: Identifies the record with most populated columns as the winner record. It's the default merge option.
    • Most recent: Identifies the winner record based on the most recency. Requires a date or a numeric field to define the recency.
    • Least recent: Identifies the winner record based on the least recency. Requires a date or a numeric field to define the recency.

    If there's a tie, the winner record is the one with the MAX(PK) or the larger primary key value.

  3. Optionally, to define merge preferences on individual columns of a table, select Advanced at the bottom of the pane. For example, you can choose to keep the most recent email AND the most complete address from different records. Expand the table to see all its columns and define which option to use for individual columns. If you choose a recency-based option, you also need to specify a date/time field that defines the recency.

    Advanced merge preferences pane showing recent email and complete address

  4. Select Done to apply your merge preferences.

After defining the deduplication rules and merge preferences, select Next.