Share via


Using DENSE_RANK with SOUNDEX to Detect Imperfect Duplicate Values in Character Data

The DENSE_RANK function is a window function that provides an increasing integer value based on repetition in the ordering clause of the window. The ranks can be broken down into groups using the partition by option in the over clause of the window function.

The SOUNDEX function returns an alphanumeric code indicating a consonant sound pattern (w,h,y are treated as vowels for this purpose). The pattern begins with the first letter of the character string irrespective of its being a consonant or vowel. The pattern that follows the first letter is a number each digit representing which group of consonants each detected consonant belongs to.

By building a table valued expression that includes a soundex against the target character column and running a dense rank against it you have seeds for keying columns on the basis of both a shared soundex (the dense rank) and the permutations of character values sharing a soundex value (the soundex and the key off of the source).

Consider the following:

IF OBJECT_ID ('tempdb..#t_soundexrank_example_source') IS  NOT NULL  DROP TABLE  #t_soundexrank_example_source
    CREATE TABLE  #t_soundexrank_example_source (c_rownum INT IDENTITY(1,1), c_charval VARCHAR (25));
IF OBJECT_ID ('t_soundexrank_example_dest') IS  NOT NULL  DROP TABLE  t_soundexrank_example_dest;
    CREATE TABLE  t_soundexrank_example_dest
        (c_rownum INT  IDENTITY(1,1), c_charval VARCHAR (25), c_soundex CHAR(4), c_startchar CHAR(1), c_partdrank INT, c_partrow INT);
  
WITH CTE1 AS
(SELECT c_rownum, c_charval, SOUNDEX(c_charval) AS c_soundex, LEFT(c_charval,1) AS  c_startchar FROM   #t_soundexrank_example_source) 
 
INSERT INTO  t_soundexrank_example_dest
    SELECT c_charval, c_soundex, c_startchar, 
        DENSE_RANK() OVER (PARTITION BY  c_startchar ORDER  BY c_soundex) AS c_partdrank,
        ROW_NUMBER() OVER (PARTITION BY  c_startchar ORDER  BY c_soundex) AS c_partrow
    FROM CTE1
     
    DECLARE @t_charlist TABLE (c_startchar char(1));
  
 DECLARE @v_char CHAR(1);
 DECLARE @v_idxname VARCHAR(15);
INSERT INTO  @t_charlist SELECT  DISTINCT (c_startchar) FROM t_soundexrank_example_dest;
  
  
  
WHILE EXISTS (SELECT c_startchar FROM @t_charlist)
  
BEGIN
  
SET @v_char =
  
(SELECT TOP  1 c_startchar FROM  @t_charlist);
 SET @v_idxname = 'ix_charlookup_' + @v_char;
 EXEC (
'CREATE INDEX ' + @v_idxname + ' ON
  
#t_soundex_example_dest (c_soundex)
  
INCLUDE (c_partdrank, c_partrow, c_rownum) 
 
WHERE c_startchar = @v_char;'
);
  
  
  
DELETE TOP(1) FROM  @t_charlist;
  
END
; 
  
 GO
CREATE FUNCTION  f_soundexref
  
(
  
@p_charlookup VARCHAR(25)
  
)
  
RETURNS TABLE
RETURN
 
SELECT c_rownum, c_partdrank
  
FROM t_soundexrank_example_dest
  
WHERE LEFT(SOUNDEX(@p_charlookup),1) = c_startchar
  
AND SOUNDEX(@p_charlookup) = c_soundex
 
;

By populating into the source temporary table the character valued list you wish to examine you can move through the CTE into the destination temporary table values that represent the soundex scoring along with values that provide indications of repetition with in the group defined by the first character of the strings being examined. Extended to a more robust case this would allow for combined function or procedure and indexing strategies that could be used to speed up the querying of fuzzy matched strings.

The preceding code should generate indexes for each of the starting characters. This does not take into consideration the distributions of such characters in naturally occurring datasets. Also in the code below is a function that should provide an effective utilization of these indexes to search for the DENSE_RANK and row number pairs (for mapping to the source) for all values sharing a SOUNDEX pattern.

*

How the Census Bureau uses Soundex
*