T-SQL : Find Word Matched Count
Introduction
We have been asked to design an algorithm which counts “No. Of Words Matched” between 2 different strings
Problem Definition
User wants to count “No. Of Words Matched” between 2 different strings, suppose we have tables ‘Test1’ & ‘Test2’ which contains columns String1 & String2 respectively. Now user wants to calculate “No. Of Words Matched” between Table 'Test1' String1 column with every row of Table 'Test2' String2 column
Total No. Of Words Matched = Find Words in String1, find Words in String2, now find how many Words from String1 exists in String2 and return the total no. of words matched between 2 strings
Please find sample table structures which we would be working for this problem
CREATE TABLE TEST1
(
ID1 INT IDENTITY(1,1)
, STRING1 VARCHAR ( MAX )
)
CREATE TABLE TEST2
(
ID2 INT IDENTITY(1,1)
, STRING2 VARCHAR ( MAX )
)
Please find the script to populate these sample tables with dummy data
INSERT INTO TEST1(STRING1)
SELECT 'This is String1 Row1' UNION ALL
SELECT 'This is String1 Row2' UNION ALL
SELECT 'This is String1 Row3' UNION ALL
SELECT 'This is String1 John Doe' UNION ALL
SELECT 'This is String1 Will Smith'
INSERT INTO TEST2(STRING2)
SELECT 'This is string2 Row1' UNION ALL
SELECT 'This is string2 Row2' UNION ALL
SELECT 'This is String2 Row3' UNION ALL
SELECT 'This is String2 John doe Band' UNION ALL
SELECT 'This is string2 Will smith One'
Here is a copy of the image of sample data for reference to this problem:
Solution
We have created a function udf_GetMatchWordCount which accept 3 parameters, String1, String2 & CaseSensitive and return the No. Words Matched between these two String parameters, matching of Words depends on CaseSensitive parameter which defines either to perform Word matching on Case Sensitive or Case Insensitive
The syntax for udf_GetMatchWordCount function is:
Syntax : udf_GetMatchWordCount(String1, String2, CaseSensitive)
String1 : Is the string which words you want to be matched with String2 and can be of VARCHAR(MAX)
String2 : Is the string in which you want to match words from String1 and can be of VARCHAR(MAX)
CaseSensitive : It defines either you want String1 & String2 to be compared with Case Sensitive or not, 0 for Case Insensitive, 1 for Case Sensitive
Provide these input parameters and you will be given the output, you can also use this code inside inline query or stored procedure as well
IF OBJECT_ID (N 'dbo.udf_GetMatchWordCount' , N 'FN' ) IS NOT NULL
DROP FUNCTION dbo.udf_GetMatchWordCount;
GO
--Create Function udf_GetMatchWordCount to Find out Matched Word Count
CREATE FUNCTION dbo.udf_GetMatchWordCount(@String1 VARCHAR(MAX), @String2 VARCHAR(MAX), @CaseSensitive INT)
RETURNS INT
AS
BEGIN
DECLARE @MatchingCount INT
DECLARE @WordTable1 TABLE(String VARCHAR(MAX))
DECLARE @WordTable2 TABLE(String VARCHAR(MAX))
SELECT @String1 = CASE WHEN @CaseSensitive = 0 THEN UPPER(@String1) ELSE @String1 END
SELECT @String2 = CASE WHEN @CaseSensitive = 0 THEN UPPER(@String2) ELSE @String2 END
SELECT @String1 = REPLACE(REPLACE(REPLACE(@String1, ' ', ' X'), 'X ', ''), 'X', '')
SELECT @String2 = REPLACE(REPLACE(REPLACE(@String2, ' ', ' X'), 'X ', ''), 'X', '')
; WITH CTE (Word, String1) AS
(
SELECT CASE WHEN CHARINDEX(' ', @String1) > 0 THEN SUBSTRING(@String1, 1, CHARINDEX(' ', @String1)-1) ELSE @String1 END AS Word
, CASE WHEN CHARINDEX(' ', @String1) > 0 THEN SUBSTRING(@String1, CHARINDEX(' ', @String1)+1, LEN(@String1)) ELSE @String1 END AS String1
UNION ALL
SELECT CASE WHEN CHARINDEX(' ', C.String1) > 0 THEN SUBSTRING(C.String1, 1, CHARINDEX(' ', C.String1)-1) ELSE C.String1 END AS Word
, CASE WHEN CHARINDEX(' ', C.String1) > 0 THEN SUBSTRING(C.String1, CHARINDEX(' ', C.String1)+1, LEN(C.String1)) ELSE '' END AS String1
FROM CTE C
WHERE LEN(C.String1) < LEN(@String1) AND C.String1 <> ''
)
INSERT INTO @WordTable1
SELECT Word FROM CTE
; WITH CTE2 (Word, String2) AS
(
SELECT CASE WHEN CHARINDEX(' ', @String2) > 0 THEN SUBSTRING(@String2, 1, CHARINDEX(' ', @String2)-1) ELSE @String2 END AS Word
, CASE WHEN CHARINDEX(' ', @String2) > 0 THEN SUBSTRING(@String2, CHARINDEX(' ', @String2)+1, LEN(@String2)) ELSE @String2 END AS String2
UNION ALL
SELECT CASE WHEN CHARINDEX(' ', C.String2) > 0 THEN SUBSTRING(C.String2, 1, CHARINDEX(' ', C.String2)-1) ELSE C.String2 END AS Word
, CASE WHEN CHARINDEX(' ', C.String2) > 0 THEN SUBSTRING(C.String2, CHARINDEX(' ', C.String2)+1, LEN(C.String2)) ELSE '' END AS String2
FROM CTE2 C
WHERE LEN(C.String2) < LEN(@String2) AND C.String2 <> ''
)
INSERT INTO @WordTable2
SELECT Word FROM CTE2
; WITH CTE3 (Word, WordMatchCount) AS
(
SELECT T1.String AS Word, COUNT(*) AS WordMatchCount
FROM @WordTable1 T1
INNER JOIN @WordTable2 T2 ON T1.String = T2.String
GROUP BY T1.String
)
SELECT @MatchingCount = ISNULL(SUM(WordMatchCount), 0) FROM CTE3
RETURN @MatchingCount
END
Here is our desired result
--Now your final query would be;
SELECT T1.ID1, T1.STRING1, T2.ID2, T2.STRING2
, LEN(REPLACE(REPLACE(REPLACE(T1.String1, ' ', ' X'), 'X ', ''), 'X', '')) - LEN(REPLACE(T1.String1, ' ', '')) + 1 AS Table1WordCount
, LEN(REPLACE(REPLACE(REPLACE(T2.String2, ' ', ' X'), 'X ', ''), 'X', '')) - LEN(REPLACE(T2.String2, ' ', '')) + 1 AS Table2WordCount
, dbo.udf_GetMatchWordCount(T1.String1, T2.String2, 1) AS MatchedWordsCount
FROM TEST1 T1
INNER JOIN TEST2 T2 ON 1 = 1
ORDER BY T1.ID1, T2.ID2
Conclusion
This article demonstrated solution for a common problem of Finding No. Of Matched words using set-based approach, and provided good example for Recursive CTE. This article also demonstrates how to replace multiple spaces with single space.