TSQL - Solve it YOUR Way - Finding the Longest Repeated Substring
Introduction:
As part of the blog series TSQL - Solve it YOUR Way, today's topic will cover a common programming interview question where we are asked to find the longest repeated substring in a given string, followed by different solutions and explanations from three of the more helpful and creative contributors in the TSQL MSDN forums, Steven Wang, Kent Waldrop, and Jingyang Li.
Topic: Given a string, find the longest repeated substring
I was asked this question many years ago in a programming interview and I solved it in C# as was expected in the context of the role I was interviewing for. A few days ago, a friend of mine was discussing a recent programming interview where he was asked this same question. While discussing solutions with him, I realized that this could make for a great topic for the Solve it YOUR Way series. While T-SQL may be an unlikely method of solving this problem, the solutions below highlight the power of SQL Server through the creative solutions of the authors.
As with any good interview question, the initial phrasing of the question does not provide all of the necessary requirements to provide the best or only correct solution. In this case, in order to provide a correct solution, we would need details such as 1) can substrings overlap, and 2) what do we return if multiple substrings of the same length exist (first occurrence, all occurrences?), etc. I intentionally left these details out of the original topic to allow the answers a bit more freedom in their solutions. The solutions reflect these subtle differences.
Solution #1 - Provided by Steven Wang
Code Snippet
- --First of all, we need a generic number table. I know that we can use the common table expression to create
- --a virtual number table. But I still prefer to have a physical number table in place.
- --As SQL server can only store up to 8000 characters for a string value, for simplicity I will create
- -- a number table with maximum value equals to 8000.
- If Object_ID('dbo.LongestStr_Number', 'U') Is Null
- Begin
- SELECT TOP 8000 IDENTITY(INT,1,1) AS Number
- INTO dbo.LongestStr_Number
- FROM master.sys.all_columns A
- CROSS JOIN
- master.sys.all_columns B;
- ALTER TABLE dbo.LongestStr_Number ADD CONSTRAINT PK_LongestStr_Number PRIMARY KEY CLUSTERED (Number) WITH FILLFACTOR = 100;
- End
- --Solution Starts from here. The solution works only for SQL server 2012, by some change can be easily used for SQL server 2005/08
- Declare @MyString varchar(8000);
- Declare @Length int;
- Set @MyString = 'ABAB CDCDCDCD ABAB';
- Set @Length = Len(@MyString);
- ;With RepeatingNumber
- AS
- (
- Select A.Number, Row_Number() Over(Partition By A.Number Order By (Select 0)) AS Position
- From LongestStr_Number A
- Cross Apply
- (
- Select 1 As Col
- From LongestStr_Number B
- Where Number <= A.Number
- ) C
- Where A.Number between @Length/2 + 1 and @Length
- )
- , CTE
- As
- (
- Select A.Number, B.MySubString, B.StringLen
- , Count(B.MySubString) Over(Partition By A.Number, B.MySubString) As MyCount
- , A.Position - First_Value(A.Position) Over(partition By A.Number, B.MySubstring Order by A.Position Rows Unbounded Preceding) - B.StringLen As PositionGap
- From RepeatingNumber A
- CROSS APPly
- ( Select Substring(@MyString, A.Position, @Length - A.Number + 1) As MySubstring
- , @Length - A.Number + 1 As StringLen
- ) B
- )
- Select Top 1 with Ties MySubstring As longest_repeating_String
- From CTE
- Where MyCount > 1 and PositionGap >= 0
- Order by StringLen Desc
Explanation of Steven's solution:
To get substrings of a string, the number of substrings is determined by the length of the substring. For a string with length equals to n, it has 1 substring if the length of the substring equals to n – 1 + 1. If the length of the substring equals to n – 2 + 1, then it can have 2 substrings with overlaps. With the same fashion, if the length of a substring equals to n – m + 1 (m<=n), a string will have m number of substrings with overlaps.
As our task is to find out the longest repeating substring with NO overlaps, the number m should be only starting from n/2 + 1 (remember in T-SQL integer n divided by 2 will always return back integer). However, this doesn’t guarantee the substrings without overlaps.
Since each m number has m number of substrings which is calculated by the RepeatingNumber virtual table Cross Applied with a substring function, we can use the window function Count () by partitioning the m number and the substring to find out the repeating substrings with overlaps. If the window function Count() Over returns back the value is greater than 1, then we find a repeating substring. But at this stage, we don’t know if the repeating substrings are overlapped or not.
I used another window function First_value to identify the repeating substrings without overlaps by using the code below:
A.Position - First_Value(A.Position) Over(partition By A.Number, B.MySubstring Order by A.Position) - B.StringLen
The idea of above code is to get the offset value of the starting point of the remaining substrings with same value as the first substring within the same Number group. If the offset value is negative, then we can conclude that the two substrings have overlaps. Otherwise, the two substrings are non-overlapped repeating substrings.
At the last we use the TOP 1 clause with order by to get the longest repeating substring.
I have to mention that the white spaces will be treated as characters in this solution. we can modify the solution to take care of it if it is needed.
Solution #2 - Provided by Kent Waldrop
Code Snippet
- -- Create a numbers table for use in the solution
- create table numbers(n int)
- go
- declare @counter int
- set @counter = 0
- while @counter < 400
- begin
- set @counter = @counter + 1
- insert into numbers
- select @counter
- end
- -- Create a temporary table containing the strings for which we will find the longest substring
- declare @test table(theString varchar(400));
- insert into @test
- select 'ABABABABABABABABABABABABABAB CDCDCDCDCDCDCDCDCDCDCDCDCDCD'
- union all
- select 'SQL Server 2012 has introduced several new window functions and enhanced support for window aggregate functions by introducing window order and frame clauses, support for offset functions. In this session, the presenter will apply these new functions to solve some most frequently asked questions in MSDN T-SQL forum.'
- union all
- select 'bananas'
- union all
- select '3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233'
- union all
- select 'Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation.'
- ;
- -- Solution starts here
- --;WITH Numbers AS
- --(
- -- SELECT 1 AS n
- -- UNION ALL
- -- SELECT n + 1 AS n FROM Numbers WHERE n < 400
- --)
- --,
- with findRepeatChars as
- ( select
- 1 as stringLength,
- n as location,
- (len(theString)-n)/2 as diff,
- cast(substring(theString, n, 1) as varchar(200)) as MatchString,
- theString
- from @test
- outer apply
- ( select n from numbers
- where n < 400
- and n < len(theString)
- and charindex(substring(theString, n, 1), theString, n+1) > 0
- ) oa1
- ), primaryWork as
- (
- select
- substring(theString, location, loop_Offset + max_N) as matchString,
- theString,
- rank() over
- ( partition by theString
- order by loop_Offset + max_N desc
- ) as Rn
- from findRepeatChars
- cross apply
- ( select
- iif(diff>=9, 8, null) as group2_Offset,
- iif(diff>=25, 24, null) as group3_Offset,
- iif(diff>=49, 48, null) as group4_Offset,
- iif(diff>=90, 89, null) as group5_Offset,
- iif(diff>=144, 143, null) as group6_Offset
- ) calcGroupOffsets
- cross apply
- ( select
- nullif(sign(charindex(substring(theString, location, group6_offset + 1),
- theString, location + group6_Offset + 1)), 0) as group6_Index,
- nullif(sign(charindex(substring(theString, location, group5_offset + 1),
- theString, location + group5_Offset + 1)), 0) as group5_Index,
- nullif(sign(charindex(substring(theString, location, group4_offset + 1),
- theString, location + group4_Offset + 1)), 0) as group4_Index,
- nullif(sign(charindex(substring(theString, location, group3_offset + 1),
- theString, location + group3_Offset + 1)), 0) as group3_Index,
- nullif(sign(charindex(substring(theString, location, group2_offset + 1),
- theString, location + group2_Offset + 1)), 0) as group2_Index
- ) calcGroupIndices
- cross apply
- ( select
- coalesce( group6_Index * group6_Offset, group5_Index * group5_Offset,
- group4_Index * group4_Offset, group3_Index * group3_Offset,
- group2_Index * group2_Offset, 0
- ) as loop_Offset,
- coalesce
- ( group6_Index * iif(diff>=144, 57, null), group5_Index * iif(diff>=90,55, null),
- group4_Index * iif(diff>=49, 42, null), group3_Index * iif(diff>=25, 25, null),
- group2_Index * iif(diff>=9, 17, null), 9
- ) as loop_Range
- ) calcLoopParms
- cross apply
- ( select
- max(n) as max_N
- from numbers
- where n <= 57
- and n <= loop_Range
- and charindex(substring(theString, location, loop_Offset + n), theString,
- location + loop_Offset + n)
- > 0
- ) calcMaxN
- )
- select
- matchString as longest_Substring,
- theString as input_String
- from primaryWork
- where Rn = 1
- OPTION (maxrecursion 0)
Explanation of Kent's solution:
The purpose of this query is to find the longest substring that is repeated within a given string. Note that this query depends on a table of numbers. This query is divided into three major parts:
• The "findRepeatChars" CTE
• The "primaryWork" CTE
• The outer query
"findRepeatChars" locates all characters in the input string that are repeated at some point later in the string. This CTE parses the string by using a table of numbers to spin through each character position except the last position, which cannot have a repeated character that comes later in the string.
"primaryWork" processes the locations found by "findRepeatChars" and determines the longest substring beginning at the given location. This CTE includes four CROSS APPLY operators to provide reusable calculations:
• calcGroupOffsets
• calcGroupIndices
• calcLoopParms
• calcMaxN
calcGroupOffsets uses the built in IIF function to calculate offsets that can be used as a base locations for dividing a string into smaller segments that can be quickly processed with a linear search.
calcGroupIndices determines if a string at the given location has a substring of a length determined by the group offset that repeats later in the given string.
calcLoopParms uses COALESCE functions to select the largest group offset that defines a substring that is later repeated in the given string based on the computed group indices. In addition, the built in IIF function is used to determine an appropriate search range based on the selected group offset and group index.
calcMaxN uses the computed loop parameters and a table of numbers to determine the largest number that can be added to the offset. This largest number when added to the largest group offset determines the largest substring that is repeated at a given location in the string.
After the primaryWork CTE processes the four apply operators, a RANK function is used such that the rank is partitioned according to the input' string and ordered by the descending string lengths of all of the repeating substrings for a given input string to provide a row number (rn) column.
The outer query calls the primaryWork CTE to process a list of all repeating substrings associated with a given input string. Records are filtered with the outer query such that only records that have a row number (rn) of 1 are retained.
Notes:
I sort of used odd squares to segment out the data; I didn't investigate whether some kind of log method might be better.
Solution #3 - Provided by Jingyang Li
Code Snippet
- declare @MyString varchar(max)
- Set @MyString = 'Server 2012 has introduced several new window functions and enhanced support for window aggregate functions by introducing window order and frame clauses, support for offset functions. In this session, the presenter will apply these new functions to solve some most frequently asked questions in MSDN T-SQL forum.';
- ;WITH Numbers AS
- (
- SELECT 1 AS i
- UNION ALL
- SELECT i + 1 AS i FROM Numbers WHERE i < LEN(@MyString)
- )
- ,mycte1 AS
- (
- SELECT a.i AS Step,b.i AS StartPos, SUBSTRING(@MyString, b.i, a.i) AS SubSeq FROM Numbers a CROSS JOIN Numbers b
- )
- ,mycte2 AS (
- SELECT StartPos, Step, SubSeq
- ,Count(SubSeq)over(Partition by SubSeq) cntSubSeq
- ,ROW_NUMBER() Over(Partition by SubSeq Order by Step ) rnSubSeq
- ,MAX(Len(SubSeq)) OVER(Partition by SubSeq) maxLen
- FROM mycte1 WHERE LEN(SubSeq) >= Step
- )
- ,mycte3 as
- (
- SELECT Step,StartPos, SubSeq, maxLen, rnSubSeq
- FROM mycte2
- WHERE cntSubSeq>1
- )
- ,mycte4 as
- (
- SELECT a.SubSeq, a.StartPos, a.maxLen
- ,RANK() OVER (order by a.maxLen DESC) AS rnkLen
- ,Row_Number() OVER (order by a.maxLen DESC,a.Step) AS rnLen
- FROM mycte3 a
- LEFT JOIN mycte3 b ON a.SubSeq=b.SubSeq and (a.rnSubSeq=b.rnSubSeq-1 OR b.rnSubSeq-1 = 0)
- WHERE (a.StartPos+a.Step-1) < b.StartPos
- )
- SELECT SubSeq as [LognestRepeatingSubstring] FROM mycte4
- WHERE rnkLen=1
- Order by maxLen DESC, StartPos
- OPTION (maxrecursion 0)
Explanation of Jingyang's solution:
Step 1: Create a numbers table based on the string length but it would be better to use an auxiliary numbers table;
Step 2: Cross join Numbers table to create positions and steps for all substring combinations;
Step 3: Calculate the occurrence of substrings in various lengths, the length for each substring and a sequence within a substring based on the step. The WHERE condition is to eliminate double count for any substrings after the last full step increments;
Step 4: Get all substrings with at least two occurrences;
Step 5: Remove repeating substring with overlapping and get the ranking order for the longest substrings;
Step 6: Retrieve the longest repeating substring(s) with two options.
The option1 is to get longest string with tie breaker (first occurrence) by using rnLen=1 and the second option is to get longest strings without tie breaker by using rnkLen=1;
In order to change the default recursion limit from 100 to no limit for CTEs, the query hints MAXRECURSION is set to 0.
Note: I include the rank order to choose the first occurrence or all longest repeating substrings in the query.
Conclusion:
As you can see, all three of the above solutions provide variations of the result we were looking for, but do so in creatively different styles. Each of these solutions leverages different SQL Server language constructs and includes different considerations in the final solutions, including one solution that leverage new constructs from SQL Server 2012. I hope that you are able to learn a lot by trying out the problem yourself and then reading through the additional solutions.
Special thanks to Jingyang, Steven, and Kent for their valuable forums contribution and for contributing to this series!
Hope that helps,
Sam Lester (MSFT)
Contributor Bios:
Steven Wang has worked with SQL server for more than 10 years. Steven is an active SQL server community participant and speaks at events such as TechEd, CodeCamp, SQL User Group etc.
Blog: www.MSBICOE.com | LinkedIn: Steven Wang | MSDN Profile: Steven Wang - Shangzhou
Kent Waldrop started working with Sybase Transact SQL in 1989 as an application developer and continued working with Sybase until 1995 when SQL Server 6 was released. At that time, he became a Microsoft SQL Server database administrator and has continued to work with Microsoft SQL Server ever since. Currently he is a database architect working with Microsoft SQL Server, Oracle and UDB/DB2.
Jingyang Li has been working with SQL Server since he began his IT career as an ASP.NET developer in 2001. He enjoys working with T-SQL and recently took a full time job as a DBA. He has been participating in the Microsoft forums with alias Limno.
Comments
Anonymous
October 10, 2012
For the first solution, if we need to find our the all occurrences for the longest repeating strings, then tweak the code a little bit to use TOP with TIE clause as below: If Object_ID('dbo.LongestStr_Number', 'U') Is Null Begin SELECT TOP 8000 IDENTITY(INT,1,1) AS Number INTO dbo.LongestStr_Number FROM master.sys.all_columns A CROSS JOIN master.sys.all_columns B; ALTER TABLE dbo.LongestStr_Number ADD CONSTRAINT PK_LongestStr_Number PRIMARY KEY CLUSTERED (Number) WITH FILLFACTOR = 100; End --Solution Starts from here. The solution works only for SQL server 2012, by some change can be easily used for SQL server 2005/08 Declare @MyString varchar(8000); Declare @Length int; Set @MyString = 'ABAB CDCDCDCD ABAB'; Set @Length = Len(@MyString); ;With RepeatingNumber AS ( Select A.Number, Row_Number() Over(Partition By A.Number Order By (Select 0)) AS Position From LongestStr_Number A Cross Apply ( Select 1 As Col From LongestStr_Number B Where Number <= A.Number ) C Where A.Number between @Length/2 + 1 and @Length ) , CTE As ( Select A.Number, B.MySubString, B.StringLen , Count(B.MySubString) Over(Partition By A.Number, B.MySubString) As MyCount , A.Position - First_Value(A.Position) Over(partition By A.Number, B.MySubstring Order by A.Position Rows Unbounded Preceding) - B.StringLen As PositionGap From RepeatingNumber A CROSS APPly ( Select Substring(@MyString, A.Position, @Length - A.Number + 1) As MySubstring , @Length - A.Number + 1 As StringLen ) B ) Select Top 1 with Ties MySubstring As longest_repeating_String From CTE Where MyCount > 1 and PositionGap >= 0 Order by StringLen Desc Thanks. --StevenAnonymous
October 11, 2012
The solution I provided works better if you have a table of numbers or use a better numbers CTE. Probably better to comment out this stuff: -- Solution starts here --;WITH Numbers AS --( -- SELECT 1 AS n -- UNION ALL -- SELECT n + 1 AS n FROM Numbers WHERE n < 400 --) --,findRepeatChars as and instead use with findRepeatChars as Sorry for the confusion. KentAnonymous
October 12, 2012
Thanks Steven, I updated your solution above.Anonymous
October 12, 2012
Thanks Kent, I made the modification to yours as well.Anonymous
June 03, 2013
Good one !!!