SQL Server: Implementation of N-Gram Search Index
https://msdnshared.blob.core.windows.net/media/2016/05/0640_NinjaAwardTinyGold.pngGold Award Winner
All current Web search engines have only one input field for the search term and always deliver a result, no matter what you type. This makes us developer’s life difficult, because now end users expect the same in Line-of-Business (LOB) applications like ERP, for example, when searching for an address or an item. This article shows how to implement a similar search function with the Microsoft SQL Server.
Introducing
The built-In function “Full-Text Index” of the Microsoft SQL Server provides an easy way to index all the words in a database, as well as to query them in a flexible and performant way. But there are a few limitations that cause it's not suitable for a similar web search functionality.
no suffix (ends with) search possible like *part
no typing error tolerance with respect to search term and/or base data
only conditional indexing record delimiter and other Noise Word
An alternative is to implement a own indexing and in this case the so-called N-Gram Index. In a n-gram index all N consecutive characters are counted as an unit (fragment) and indexed, where N can be any numeric value from 1-x. An example with "Hello World" and a 3-grams results in the units
"HEL", "ELL", "LLO", "LO ", "O W", " WO", "WOR", "ORL", "RLD"The higher the N value, the larger the fragment and the more precise is the search, the less the N value the higher the typing error tolerance,
Normalization
The aim is to implement a typing error tolerant and accent insensitive search of addresses, for this we have to provide normalized data base for the n-gram index accordingly. This requires a few considerations ahead.
Umlauts
German umlauts like Ü can be replaced by UE, so that we can find with the search phrase "Müller" also the name "Mueller". In other languages, there are special characters that come from the Latin alphabet and include indicators for emphasis, as the Polish Ł. Such characters could be also replace.
Delimiter/Punctuation Mark
For full-text index delimiters are counted as the end of words and are not indexed. There are, however, company and product names that contain delimiter and therefore they should be in a certain way searchable. Here you have to decide by the data if the delimiters are retained or replaced.
In this example, all delimiters are replaced by a single character for simplicity. So the structure of the text is retained, but for indexing and searching just one sign is used.
Special Character
Again, you have to decide by database whether to replace special characters or not. Good example is the @ sign in e-mail addresses; one would like to specifically search only for the e-mail or also after the parts of the e-mail address like domain name in another data areas.
This means in this example only
digits
Western capital letters
Exclamation mark ! as a general substitute for all delimiter
hash mark # as a general substitute for all other special characters
are approved as signs of the normalized text.To maintain a central list of allowable characters, a function is created that returns them; global variables are not available in MS SQL Server.CREATE FUNCTION dbo.fnGetNGramValidChars() RETURNS nvarchar(39) AS BEGIN -- Approved Signs for N-Gram Index -- Western capital letters + Space -- Digits -- Delimiter sign: ! -- Special character sign: # RETURN N'ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789!#'; END;
All other characters are replaced accordingly. For this another function is created, which normalizes the supplied text. It is later used for both indexing the data and for the normalization of the search term.
Here the shortened version:CREATE FUNCTION dbo.fnNormalizeDatastring (@string nvarchar(max)) RETURNS nvarchar(max) AS BEGIN -- Get all valid characters DECLARE @validChars nvarchar(50) = (SELECT dbo.dbo.fnGetNGramValidChars() AS Result); -- Only capital characters SET @string = UPPER(@string); -- Replace umlauts SET @string = REPLACE(@string, N'Ä', N'AE'); SET @string = REPLACE(@string, N'Æ', N'AE'); -- usw. ... -- Registered Marks SET @string = REPLACE(@string, N'℅', N'C#O'); SET @string = REPLACE(@string, N'™', N'TM'); -- usw. ... -- Replace delimiter by ! SET @string = REPLACE(@string, N'.', N'!'); SET @string = REPLACE(@string, N',', N'!'); -- usw. ... -- Replace special characters by hash mark. SET @string = REPLACE(@string, N'+', N'#'); SET @string = REPLACE(@string, N'-', N'#'); -- usw. ... -- Replace all other invalid characters by space. DECLARE @loop int = 1; DECLARE @len int = LEN(@string); DECLARE @invalidChar nchar(1); WHILE @loop <= @len BEGIN SET @invalidChar = SUBSTRING(@string, @loop, 1); IF CHARINDEX(@invalidChar, @validChars) <= 0 BEGIN SET @string = REPLACE(@string, @invalidChar, N' '); END; SET @loop = @loop + 1; END; -- Remove multiple spaces DECLARE @goOn tinyint = 1; DECLARE @oldLen int; WHILE @goOn <> 0 BEGIN SET @oldLen = LEN(@string); SET @string = REPLACE(@string, N' ', N' '); SET @string = REPLACE(@string, N' ', N' '); IF @oldLen = LEN(@string) BEGIN SET @goOn = 0; END; END; RETURN @string; END;
Note: Such elaborate string operations are very expensive in SQL Server. It would be conceivable to optimize implementation as CLR Assembly.
Indexer
The index can be easily calculated on the number of allowed characters, the position of the character and the N-gram value. For the fragment "BDHZ" the index for 4-Gram is calculated by way of example as below (position is 0-based):
B = Position 1 * (39^0) = 1
D = Position 4 * (39^1) = 156
H = Position 7 * (39^2) = 10,647
Z = Position 25 * (39^3) = 1,482,975
Sum = 1,493,779 less integer lower limit -2,147,483,648 => index value -2,145,989,869.
Take the example of a 4-gram index with the above defined 39 characters, then we can get 39 ^ 4 = 2,313,441 different character combinations (fragments).
Let we allow negative values, then the value range of the int data type fits up to the 6-gram index values. The number of possible index values for each n-gram:
N-Gram Formular Value Data Typ
2-Gram = 39^2 = 1,521 => smallint
3-Gram = 39^3 = 59,319 => smallint
4-Gram = 39^4 = 2,313,441 => int
5-Gram = 39^5 = 90,224,199 => int
6-Gram = 39^6 = 3,518,743,761 => int
7-Gram = 39^7 = 137,231,006,679 => bigint
The index calculation function in T-SQL provide as follows is being assumed, that the fragment is passed normalized.
CREATE FUNCTION dbo.fnCalculate4GramIndex
(@fragment nchar(4))
RETURNS int
AS
BEGIN
-- Get all valid characters
DECLARE @validChars varchar(50) = (SELECT dbo.fnGetNGramValidChars() AS Result);
DECLARE @size int = (SELECT LEN(@validChars) AS size);
DECLARE @id int = -2147483648;
-- Calculate position of each char
DECLARE @pos1 int = CHARINDEX(SUBSTRING(@fragment, 1, 1), @validChars);
DECLARE @pos2 int = CHARINDEX(SUBSTRING(@fragment, 2, 1), @validChars);
DECLARE @pos3 int = CHARINDEX(SUBSTRING(@fragment, 3, 1), @validChars);
DECLARE @pos4 int = CHARINDEX(SUBSTRING(@fragment, 4, 1), @validChars);
SET @id = @id + (@pos1 - 1) +
(@pos2 - 1) * @size +
(@pos3 - 1) * POWER(@size, 2) +
(@pos4 - 1) * POWER(@size, 3);
RETURN @id;
END;
Now, in order to generate all the index values for an entire text, a Table Valued Function (TVF) is created, which breaks down the text into fragments, calculates the values for each and returns the result as a table.
CREATE FUNCTION dbo.fnBuildIndex4Gram
(@string nvarchar(max))
RETURNS @ids TABLE (id int)
AS
BEGIN
SET @string = (SELECT dbo.fnNormalizeDatastring(@string));
DECLARE @pos int = 1;
DECLARE @len int = LEN(@string) - 4;
IF @len <= 0
RETURN;
WHILE @pos <= @len
BEGIN
INSERT INTO @ids (id)
SELECT dbo.fnCalculate4GramIndex(SUBSTRING(@string, @pos, 4));
SET @pos = @pos + 1;
END
RETURN;
END;
-- Small test:
SELECT *
FROM dbo.fnBuildIndex4Gram(N'Hello World') AS IDX;
As an example for indexing, the address data used in the scheme "Person" from the AdventureWorks database is used.
To get all the relevant data in just one field, incl. of 0-n telephone numbers, a view is created that provides the data. This view has the advantage that the data can be extended at any time, without having to adjust other logics and process steps.
CREATE VIEW Person.vAddressIndexData
AS
SELECT ADR.AddressID,
ADR.AddressLine1 + N' ' +
ISNULL(ADR.AddressLine2 + N' ', N'') +
ADR.City + N' ' +
ADR.PostalCode + N' ' +
SPR.StateProvinceCode + N' ' +
SPR.Name + N' ' +
CR.Name + N' ' +
ISNULL(PER.Title + N' ', N'') +
ISNULL(PER.FirstName + N' ', N'') +
ISNULL(PER.MiddleName + N' ', N'') +
ISNULL(PER.LastName + N' ', N'') +
ISNULL(PER.Suffix + N' ', N'') +
ISNULL(EA.EmailAddress + N' ', N'') +
ISNULL((SELECT CAST((SELECT (SELECT PHO.PhoneNumber + ','
FROM Person.PersonPhone AS PHO
WHERE PHO.BusinessEntityID = PER.BusinessEntityID
FOR XML PATH (''), Type
) AS ColCSV
) AS nvarchar(MAX))) + N' ', N'')
AS IndexData
FROM Person.Address AS ADR
INNER JOIN
Person.StateProvince AS SPR
ON ADR.StateProvinceID = SPR.StateProvinceID
INNER JOIN
Person.CountryRegion AS CR
ON SPR.CountryRegionCode = CR.CountryRegionCode
LEFT JOIN
Person.BusinessEntityAddress AS BEA
ON ADR.AddressID = BEA.AddressID
LEFT JOIN
Person.Person AS PER
ON BEA.BusinessEntityID = PER.BusinessEntityID
LEFT JOIN
Person.EmailAddress AS EA
ON PER.BusinessEntityID = EA.BusinessEntityID
AND PER.EmailPromotion = EA.EmailAddressID;
Note: The view is only meant as an example, sense and correctness was not respected here.
Next, we need a table to save the index values. What is needed is the primary key of the record, the AddressId, and the n-gram index value. Since a fragment / index value can occur more than once for each record, a field for the occurrence is used instead of storing each index value individually; which of course also would work.
CREATE TABLE dbo.Gram4Index(
AddressId int NOT NULL,
ID_4Gram int NOT NULL,
Occurrence smallint NOT NULL,
CONSTRAINT PK_Gram4Index PRIMARY KEY CLUSTERED
(
AddressId ASC,
ID_4Gram ASC
)
);
Note: The mixing of different schemes, like here with Person/dbo, usually does not make sense.
During search only the index value is directly needed, the primary key index is not so suitable for this, so another index is created, which is more selective. To avoid unnecessary key lookups on the address and occurrences in the base table, the fields are included as "include".
CREATE NONCLUSTERED INDEX IDX_Gram4Index_Search
ON dbo.Gram4Index
(ID_4Gram)
INCLUDE (AddressId, Occurrence);
-- Some additional statistics.
CREATE STATISTICS [STAT_Gram4Index_AddressId]
ON [dbo].[Gram4Index]
([AddressId]);
CREATE STATISTICS [STAT_Gram4Index_Occurrence]
ON [dbo].[Gram4Index]
([Occurrence]);
CREATE STATISTICS [STAT_Gram4Index_AddressId_Occurrence]
ON [dbo].[Gram4Index]
(AddressId, [Occurrence]);
How it should be, filling the index value is done centrally via a stored procedure, which can then be called from other SP's or from triggers. The following SP serves this purpose, it first deletes the old index values and adds the new into the index table for an address; it expects only the PK AddressId as parameters.
CREATE PROCEDURE dbo.spUpdate4Gram
@addressId int
AS
BEGIN
SET NOCOUNT ON;
-- Delete old index data
DELETE FROM dbo.Gram4Index
WHERE AddressId = @addressId;
-- Re-create Index for the Address.
INSERT INTO dbo.Gram4Index
(AddressId, ID_4Gram, Occurrence)
SELECT ADR.AddressId, NG.ID AS ID_NGram4, COUNT(*) AS Occurrence
FROM Person.vAddressIndexData AS ADR
CROSS APPLY
dbo.fnBuildIndex4Gram(ADR.IndexData) AS NG
WHERE ADR.AddressId = @addressId
GROUP BY ADR.AddressId, NG.ID;
END;
Since we have with the Adventure Works an existing database with data, we need additionally a routine to initial process the existing data. For this we use a cursor to process all addresses one by one.
SET NOCOUNT ON;
DECLARE @addressId int;
DECLARE addresses CURSOR LOCAL FOR
SELECT AddressId
FROM Person.Address;
OPEN addresses;
FETCH NEXT FROM addresses INTO @addressId;
WHILE (@@FETCH_STATUS = 0)
BEGIN
EXEC dbo.spUpdate4Gram @addressId;
FETCH NEXT FROM addresses INTO @addressId;
END;
CLOSE addresses;
DEALLOCATE addresses;
PRINT N'Finished';
Space requirements for the index
Every index requires additional space, with this N-Gram index it's the same, of course.
For a N-Gram index and a text length of x characters we get (x - N) fragments. For each fragment we save
AddressId = int = 4 byte
ID_4Gram = int = 4 byte
Occurrence = smallint = 2 byte
Primary Key = 8 byte
Search Index = 10 byte
in total it is 28 bytes per fragment. If we have a text of 1,000 characters, we get for a 4-Gram index 996 * 28 bytes = 27,888 bytes storage size. In compare a common index on a NVarChar column needs 2 bytes per character, here 2,000 bytes. The N-Gram index requires ~14 times of space and therefore it is from storage side an expensive index. In this particular case, 30.4 MB for data and 21 MB for indexes are needed, thus total 51.4 MB.Note: The size calculation is simplified, the more accurate calculation can be read at MSDN: Estimate the Size of a Table
Search Logic
Suitable search results are the addresses, that have as many of the same index values as in the search term. The fewer index values matches, the less the address matches with the search term.
Unlike a normal index, the order of individual term in the search text as well as in the data is almost irrelevant. On the other hand, of course, fragments from the search text can appear in the data also in completely different combinations from different words.
For searching first the entire search string is separated by the function dbo.fnBuildIndex4Gram as previously in the individual index values, this is joined with the stored index values and then counted the matches, more specifically, the sum of "occurrence" is formed and so frequent occurring in the text matching index values are more heavily weighted.
CREATE PROCEDURE dbo.spSearch4Gram
(@search nvarchar(4000),
@factor decimal(9, 4)
)
AS
BEGIN
SET NOCOUNT ON;
DECLARE @len int;
-- Normalize search string.
SET @search = (SELECT dbo.fnNormalizeDatastring(@search));
SET @len = (LEN(@search) - 4) * @factor;
IF @len <= 0
SET @len = 1;
-- Weighted 4-Gram index search
;WITH idx AS
(SELECT GR4.AddressId, SUM(GR4.Occurrence) AS Weigth
FROM dbo.fnBuildIndex4Gram(@search) AS BIG
INNER JOIN
dbo.Gram4Index AS GR4
ON GR4.ID_4Gram = BIG.id
GROUP BY GR4.AddressId
HAVING SUM(GR4.Occurrence) > @len
)
SELECT ADR.AddressID,
ADR.AddressLine1,
ADR.City,
ADR.City,
idx.Weigth
FROM Person.Address AS ADR WITH (NOLOCK)
INNER JOIN
idx
ON ADR.AddressID = idx.AddressId
ORDER BY idx.weigth DESC;
END;
-- Test cases
-- Umlaut search => "Buergermeister" as result
EXEC dbo.spSearch4Gram N'Saarland Bürgermeister', 0.8;
-- and counter wise
EXEC dbo.spSearch4Gram N'rotthaeuser germany saarbruecken', 0.8;
-- City name and telephone number without area code.
EXEC dbo.spSearch4Gram N'Springfield -555-0181', 0.7;
-- Typo: "Bery" instead "Berry"
EXEC dbo.spSearch4Gram N'Bery court', 0.6;
-- Some part terms
EXEC dbo.spSearch4Gram N'West gloria California 91791', 0.75;
-- Long search text
EXEC dbo.spSearch4Gram N'5157 Washington 98027 lane marywood Issaquah', 0.8;
-- Large result set
EXEC dbo.spSearch4Gram N'Washington', 0.8;
Summary
The N-gram index is easy to implement using standard Transact-SQL and can be easily adapted to specific characteristics of each data set and the desired search logic.
The search function is very performant and the accuracy or tolerance of typing errors can be controlled individually via a parameter, as well as the choice of the N value. These benefits is bought with a high storage requirements for the index values.
As with every solution you can also still find potential for optimization, for example, the usage of a Column Store Index, but this would restricts the usage to Enterprise Edition.
Other Languages
SQL Server: Implementierung eines N-Gram Suchindex (de-DE)