Share via


SQL Server: Implementation of N-Gram Search Index

https://msdnshared.blob.core.windows.net/media/2016/05/0640_NinjaAwardTinyGold.pngGold Award Winner


All current Web search engines have only one input field for the search term and always deliver a result, no matter what you type. This makes us developer’s life difficult, because now end users expect the same in Line-of-Business (LOB) applications like ERP, for example, when searching for an address or an item. This article shows how to implement a similar search function with the Microsoft SQL Server.

Introducing

The built-In function “Full-Text Index” of the Microsoft SQL Server provides an easy way to index all the words in a database, as well as to query them in a flexible and performant way. But there are a few limitations that cause it's not suitable for a similar web search functionality.

  • no suffix (ends with) search possible like *part

  • no typing error tolerance with respect to search term and/or base data

  • only conditional indexing record delimiter and other Noise Word

    An alternative is to implement a own indexing and in this case the so-called N-Gram Index. In a n-gram index all N consecutive characters are counted as an unit (fragment) and indexed, where N can be any numeric value from 1-x. An example with "Hello World" and a 3-grams results in the units
    "HEL", "ELL", "LLO", "LO ", "O W", " WO", "WOR", "ORL", "RLD"

    The higher the N value, the larger the fragment and the more precise is the search, the less the N value the higher the typing error tolerance,


Normalization

The aim is to implement a typing error tolerant and accent insensitive search of addresses, for this we have to provide normalized data base for the n-gram index accordingly. This requires a few considerations ahead.

Umlauts

German umlauts like Ü can be replaced by UE, so that we can find with the search phrase "Müller" also the name "Mueller". In other languages, there are special characters that come from the Latin alphabet and include indicators for emphasis, as the Polish Ł. Such characters could be also replace.

Delimiter/Punctuation Mark

For full-text index delimiters are counted as the end of words and are not indexed. There are, however, company and product names that contain delimiter and therefore they should be in a certain way searchable. Here you have to decide by the data if the delimiters are retained or replaced.

In this example, all delimiters are replaced by a single character for simplicity. So the structure of the text is retained, but for indexing and searching just one sign is used.

Special Character

Again, you have to decide by database whether to replace special characters or not. Good example is the @ sign in e-mail addresses; one would like to specifically search only for the e-mail or also after the parts of the e-mail address like domain name in another data areas.

This means in this example only

  • digits

  • Western capital letters

  • Exclamation mark ! as a general substitute for all delimiter

  • hash mark # as a general substitute for all other special characters
    are approved as signs of the normalized text.To maintain a central list of allowable characters, a function is created that returns them; global variables are not available in MS SQL Server.

    CREATE FUNCTION  dbo.fnGetNGramValidChars()
        RETURNS nvarchar(39)
    AS
    BEGIN
        -- Approved Signs for N-Gram Index
        -- Western capital letters + Space
        -- Digits
        -- Delimiter sign: !
        -- Special character sign: #
        RETURN N'ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789!#';
     
    END;
    

    All other characters are replaced accordingly. For this another function is created, which normalizes the supplied text. It is later used for both indexing the data and for the normalization of the search term.
    Here the shortened version:

    CREATE FUNCTION  dbo.fnNormalizeDatastring
        (@string nvarchar(max))
        RETURNS nvarchar(max)
    AS
    BEGIN
        -- Get all valid characters
        DECLARE @validChars nvarchar(50) = (SELECT dbo.dbo.fnGetNGramValidChars() AS Result);
        -- Only capital characters
        SET @string = UPPER(@string);
        -- Replace umlauts
        SET @string = REPLACE(@string, N'Ä', N'AE');
        SET @string = REPLACE(@string, N'Æ', N'AE');
        -- usw. ...
     
        -- Registered Marks
        SET @string = REPLACE(@string, N'℅', N'C#O');
        SET @string = REPLACE(@string, N'™', N'TM');
        -- usw. ...
     
        -- Replace delimiter by !
        SET @string = REPLACE(@string, N'.', N'!');
        SET @string = REPLACE(@string, N',', N'!');
        -- usw. ...
     
        -- Replace special characters by hash mark.
        SET @string = REPLACE(@string, N'+', N'#');
        SET @string = REPLACE(@string, N'-', N'#');
        -- usw. ...
     
        -- Replace all other invalid characters by space.
        DECLARE @loop int = 1;
        DECLARE @len int = LEN(@string);
        DECLARE @invalidChar nchar(1);
     
        WHILE @loop <= @len
        BEGIN
            SET @invalidChar = SUBSTRING(@string, @loop, 1);
            IF CHARINDEX(@invalidChar, @validChars) <= 0
            BEGIN
                SET @string = REPLACE(@string, @invalidChar, N' ');
            END;
     
            SET @loop = @loop + 1;
        END;
     
        -- Remove multiple spaces
        DECLARE @goOn tinyint = 1;
        DECLARE @oldLen int;
     
        WHILE @goOn <> 0
        BEGIN
            SET @oldLen = LEN(@string);
            SET @string = REPLACE(@string, N'    ', N' ');
            SET @string = REPLACE(@string, N'  ', N' ');
     
            IF @oldLen = LEN(@string)
            BEGIN
                SET @goOn = 0;
            END;
        END;
     
        RETURN @string;
    END;
    

    Note: Such elaborate string operations are very expensive in SQL Server. It would be conceivable to optimize implementation as CLR Assembly.


Indexer

The index can be easily calculated on the number of allowed characters, the position of the character and the N-gram value. For the fragment "BDHZ" the index for 4-Gram is calculated by way of example as below (position is 0-based):
B = Position  1 * (39^0) = 1
D = Position  4 * (39^1) = 156
H = Position  7 * (39^2) = 10,647
Z = Position 25 * (39^3) = 1,482,975
Sum = 1,493,779 less integer lower limit -2,147,483,648 => index value -2,145,989,869.

Take the example of a 4-gram index with the above defined 39 characters, then we can get 39 ^ 4 = 2,313,441 different character combinations (fragments).
Let we allow negative values, then the value range of the int data type fits up to the 6-gram index values. The number of possible index values for each n-gram:
N-Gram Formular Value Data Typ
2-Gram = 39^2 = 1,521 => smallint
3-Gram = 39^3 = 59,319 => smallint
4-Gram = 39^4 = 2,313,441 => int
5-Gram = 39^5 = 90,224,199 => int
6-Gram = 39^6 = 3,518,743,761 => int
7-Gram = 39^7 = 137,231,006,679 => bigint

The index calculation function in T-SQL provide as follows is being assumed, that the fragment is passed normalized.

CREATE FUNCTION  dbo.fnCalculate4GramIndex
    (@fragment nchar(4))
    RETURNS int
AS
BEGIN
    -- Get all valid characters
    DECLARE @validChars varchar(50) = (SELECT dbo.fnGetNGramValidChars() AS Result);
    DECLARE @size int  = (SELECT  LEN(@validChars) AS  size);
 
    DECLARE @id int = -2147483648;
    -- Calculate position of each char
    DECLARE @pos1 int = CHARINDEX(SUBSTRING(@fragment, 1, 1), @validChars);
    DECLARE @pos2 int = CHARINDEX(SUBSTRING(@fragment, 2, 1), @validChars);
    DECLARE @pos3 int = CHARINDEX(SUBSTRING(@fragment, 3, 1), @validChars);
    DECLARE @pos4 int = CHARINDEX(SUBSTRING(@fragment, 4, 1), @validChars);
 
    SET @id = @id + (@pos1 - 1) + 
                    (@pos2 - 1) * @size + 
                    (@pos3 - 1) * POWER(@size, 2) + 
                    (@pos4 - 1) * POWER(@size, 3);
 
    RETURN @id;
END;

Now, in order to generate all the index values for an entire text, a Table Valued Function (TVF) is created, which breaks down the text into fragments, calculates the values for each and returns the result as a table.

CREATE FUNCTION  dbo.fnBuildIndex4Gram
    (@string nvarchar(max))
    RETURNS @ids TABLE (id int)
AS
BEGIN
    SET @string = (SELECT dbo.fnNormalizeDatastring(@string));
    DECLARE @pos int = 1;
    DECLARE @len int = LEN(@string) - 4;
 
    IF @len <= 0
        RETURN;
 
    WHILE @pos <= @len
    BEGIN
        INSERT INTO  @ids (id)
        SELECT dbo.fnCalculate4GramIndex(SUBSTRING(@string, @pos, 4));
 
        SET @pos = @pos + 1;
    END
 
    RETURN;
END;
 
-- Small test:
SELECT *
FROM dbo.fnBuildIndex4Gram(N'Hello World') AS  IDX;

As an example for indexing, the address data used in the scheme "Person" from the AdventureWorks database is used.
To get all the relevant data in just one field, incl. of 0-n telephone numbers, a view is created that provides the data. This view has the advantage that the data can be extended at any time, without having to adjust other logics and process steps.

CREATE VIEW  Person.vAddressIndexData
AS
    SELECT ADR.AddressID,
           ADR.AddressLine1 + N' ' +
           ISNULL(ADR.AddressLine2 + N' ', N'') +  
           ADR.City + N' ' +
           ADR.PostalCode +  N' ' +
           SPR.StateProvinceCode + N' ' +
           SPR.Name + N' '  +
           CR.Name + N' '  +
           ISNULL(PER.Title + N' ', N'') +
           ISNULL(PER.FirstName + N' ', N'') +
           ISNULL(PER.MiddleName + N' ', N'') +  
           ISNULL(PER.LastName + N' ', N'') +
           ISNULL(PER.Suffix + N' ', N'') +
           ISNULL(EA.EmailAddress + N' ', N'') +
           ISNULL((SELECT CAST((SELECT (SELECT PHO.PhoneNumber + ','
                                 FROM Person.PersonPhone AS PHO
                                 WHERE PHO.BusinessEntityID = PER.BusinessEntityID
                                 FOR XML PATH (''), Type 
                               ) AS  ColCSV 
                              ) AS  nvarchar(MAX))) + N' ', N'')
           AS IndexData
    FROM Person.Address AS ADR
         INNER JOIN
         Person.StateProvince AS  SPR
             ON ADR.StateProvinceID = SPR.StateProvinceID
         INNER JOIN
         Person.CountryRegion AS  CR
             ON SPR.CountryRegionCode = CR.CountryRegionCode
         LEFT JOIN
         Person.BusinessEntityAddress AS  BEA
             ON ADR.AddressID = BEA.AddressID
         LEFT JOIN
         Person.Person AS  PER
             ON BEA.BusinessEntityID = PER.BusinessEntityID
         LEFT JOIN
         Person.EmailAddress AS  EA
             ON PER.BusinessEntityID = EA.BusinessEntityID
                AND PER.EmailPromotion = EA.EmailAddressID;

Note: The view is only meant as an example, sense and correctness was not respected here.

Next, we need a table to save the index values. What is needed is the primary key of the record, the AddressId, and the n-gram index value. Since a fragment / index value can occur more than once for each record, a field for the occurrence is used instead of storing each index value individually; which of course also would work.

CREATE TABLE  dbo.Gram4Index(
    AddressId int  NOT NULL,
    ID_4Gram int  NOT NULL,
    Occurrence smallint  NOT NULL,
    CONSTRAINT PK_Gram4Index PRIMARY KEY  CLUSTERED 
    (
        AddressId ASC,
        ID_4Gram ASC
    )
);

Note: The mixing of different schemes, like here with Person/dbo, usually does not make sense.

During search only the index value is directly needed, the primary key index is not so suitable for this, so another index is created, which is more selective. To avoid unnecessary key lookups on the address and occurrences in the base table, the fields are included as "include".

CREATE NONCLUSTERED INDEX IDX_Gram4Index_Search
    ON dbo.Gram4Index 
       (ID_4Gram)
    INCLUDE (AddressId, Occurrence);
 
-- Some additional statistics.
 
CREATE STATISTICS  [STAT_Gram4Index_AddressId] 
    ON [dbo].[Gram4Index]
       ([AddressId]);
 
CREATE STATISTICS  [STAT_Gram4Index_Occurrence] 
    ON [dbo].[Gram4Index]
       ([Occurrence]);
 
CREATE STATISTICS  [STAT_Gram4Index_AddressId_Occurrence] 
    ON [dbo].[Gram4Index]
       (AddressId, [Occurrence]);

How it should be, filling the index value is done centrally via a stored procedure, which can then be called from other SP's or from triggers. The following SP serves this purpose, it first deletes the old index values and adds the new into the index table for an address; it expects only the PK AddressId as parameters.

CREATE PROCEDURE  dbo.spUpdate4Gram
    @addressId int
AS
BEGIN
    SET NOCOUNT ON;
 
    -- Delete old index data
    DELETE FROM  dbo.Gram4Index
    WHERE AddressId  = @addressId;
 
    -- Re-create Index for the Address.
    INSERT INTO  dbo.Gram4Index
       (AddressId, ID_4Gram, Occurrence)
    SELECT ADR.AddressId, NG.ID AS ID_NGram4, COUNT(*) AS  Occurrence
    FROM Person.vAddressIndexData AS ADR
         CROSS APPLY
         dbo.fnBuildIndex4Gram(ADR.IndexData) AS  NG
    WHERE ADR.AddressId = @addressId
    GROUP BY  ADR.AddressId, NG.ID;
END;

Since we have with the Adventure Works an existing database with data, we need additionally a routine to initial process the existing data. For this we use a cursor to process all addresses one by one.

SET NOCOUNT ON;
DECLARE @addressId int;
DECLARE addresses CURSOR LOCAL  FOR
    SELECT AddressId
    FROM Person.Address;
 
OPEN addresses;
 
FETCH NEXT  FROM addresses INTO @addressId;
WHILE (@@FETCH_STATUS = 0)
BEGIN
    EXEC dbo.spUpdate4Gram @addressId;
    FETCH NEXT  FROM addresses INTO @addressId;
END;
 
CLOSE addresses;
DEALLOCATE addresses;
 
PRINT N'Finished';

Space requirements for the index

Every index requires additional space, with this N-Gram index it's the same, of course.
For a N-Gram index and a text length of x characters we get (x - N) fragments. For each fragment we save

  • AddressId = int = 4 byte

  • ID_4Gram = int = 4 byte

  • Occurrence = smallint = 2 byte

  • Primary Key = 8 byte

  • Search Index = 10 byte
    in total it is 28 bytes per fragment. If we have a text of 1,000 characters, we get for a 4-Gram index 996 * 28 bytes = 27,888 bytes storage size. In compare a common index on a NVarChar column needs 2 bytes per character, here 2,000 bytes. The N-Gram index requires ~14 times of space and therefore it is from storage side an expensive index. In this particular case, 30.4 MB for data and 21 MB for indexes are needed, thus total 51.4 MB.

    Note: The size calculation is simplified, the more accurate calculation can be read at MSDN: Estimate the Size of a Table


Search Logic

Suitable search results are the addresses, that have as many of the same index values as in the search term. The fewer index values matches, the less the address matches with the search term.
Unlike a normal index, the order of individual term in the search text as well as in the data is almost irrelevant. On the other hand, of course, fragments from the search text can appear in the data also in completely different combinations from different words.
For searching first the entire search string is separated by the function dbo.fnBuildIndex4Gram as previously in the individual index values, this is joined with the stored index values and then counted the matches, more specifically, the sum of "occurrence" is formed and so frequent occurring in the text matching index values are more heavily weighted.

CREATE PROCEDURE  dbo.spSearch4Gram
    (@search nvarchar(4000),
     @factor decimal(9, 4)
    )
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @len int;
 
    -- Normalize search string.
    SET @search = (SELECT dbo.fnNormalizeDatastring(@search));
    SET @len = (LEN(@search) - 4) * @factor;
    IF @len <= 0
        SET @len = 1;
 
    -- Weighted 4-Gram index search
    ;WITH idx AS
        (SELECT GR4.AddressId, SUM(GR4.Occurrence) AS  Weigth
         FROM dbo.fnBuildIndex4Gram(@search) AS BIG
              INNER JOIN
              dbo.Gram4Index AS  GR4
                  ON GR4.ID_4Gram = BIG.id
         GROUP BY  GR4.AddressId
         HAVING SUM(GR4.Occurrence) > @len
        )
    SELECT ADR.AddressID,
           ADR.AddressLine1,
           ADR.City,
           ADR.City,
           idx.Weigth
    FROM Person.Address AS ADR WITH (NOLOCK)
         INNER JOIN
         idx 
             ON ADR.AddressID = idx.AddressId
    ORDER BY  idx.weigth DESC;
END;
 
-- Test cases
 
-- Umlaut search => "Buergermeister" as result
EXEC dbo.spSearch4Gram N'Saarland Bürgermeister', 0.8;
-- and counter wise
EXEC dbo.spSearch4Gram N'rotthaeuser germany saarbruecken', 0.8;
 
-- City name and telephone number without area code.
EXEC dbo.spSearch4Gram N'Springfield -555-0181', 0.7;
 
-- Typo: "Bery" instead "Berry"
EXEC dbo.spSearch4Gram N'Bery court', 0.6;
 
-- Some part terms
EXEC dbo.spSearch4Gram N'West gloria California 91791', 0.75;
 
-- Long search text
EXEC dbo.spSearch4Gram N'5157 Washington 98027 lane marywood Issaquah', 0.8;
 
-- Large result set
EXEC dbo.spSearch4Gram N'Washington', 0.8;

Summary

The N-gram index is easy to implement using standard Transact-SQL and can be easily adapted to specific characteristics of each data set and the desired search logic.
The search function is very performant and the accuracy or tolerance of typing errors can be controlled individually via a parameter, as well as the choice of the N value. These benefits is bought with a high storage requirements for the index values.

As with every solution you can also still find potential for optimization, for example, the usage of a Column Store Index, but this would restricts the usage to Enterprise Edition.


Other Languages

SQL Server: Implementierung eines N-Gram Suchindex (de-DE)


https://en.wikipedia.org/wiki/N-gram