다음을 통해 공유


C#: Calculating Percentage Similarity of 2 strings

This article is all about calculating the similarity between two strings or two words in C#. String similarity confidentially reflects relationships between two words or strings. In this article my focus is calculating similarity in strings instead of meanings of words.

I implemented this algorithm while developing a Windows Store app that will do partial matching. I used the Levenshtein algorithm to do it. If you want to understand this algorithm without going deep into theory, consider reading about Levenshtein algorithm.

Problem Domain

String similarity algorithm was to be developed that will be able to recognize changes in word character order. Algorithm will simply tell percentage similarity between two words or strings.

Solutions

Above problem can be solved in two steps:

  1. Calculating number of steps required to transform one string to other.
  2. Calculating percentage similarity between two words from above calculated steps.

Calculating Steps Required To Transform

Using Levenshtein algorithm we can calculate number of required steps to transform one word to other. Levenshtein algorithm can be implemented into following steps:

  • Step 1: If target word length is zero or source word length is zero then number of steps required to transform will be equals to source word length and target word length respectively. For example, to transform ‘hello’ to ”, required steps will be 5 (equals to the length of source word) or to transform ” to ‘hello’ required steps will also be 5 (equals to the length of target word).

  • Step 2: If both target word and source word length is greater then zero. Computed distance on each step will be stored in a m x n matrix where m = source word length + 1 and n = target word length + 1. Fill matrix first row and first column with consecutive integers starting from zero at top left corner.

    for (int i = 0; i <= sourceWordCount; distance[i, 0] = i++);
    for (int j = 0; j <= targetWordCount; distance[0, j] = j++);
    
  • Step 3: For each remaining cell at i x j we have to calculate cost that will be 0 if char at i – 1 in source string is equal to char at j – 1 in target string else it will be 1.

    int cost = (target[j - 1] == source[i - 1]) ? 0 : 1;
    
  • Step 4: For each cell except from row 1 and column 1 distance can be calculated by using below code

    distance[i, j] = Math.Min(Math.Min(distance[i - 1, j] + 1, distance[i, j - 1] + 1), distance[i - 1, j - 1] + cost);
    
  • Finally steps required to transform will be in bottom left corner of matrix at the end.

Complete code to calculate steps required to transform one word to other will be:



      /// <summary>  
      /// Returns the number of steps required to transform the source string  
      /// into the target string.  
      /// </summary>  
      int ComputeLevenshteinDistance(string source, string target)
      {  
                    if ((source == null) || (target == null)) return  0;  
                    if ((source.Length == 0) || (target.Length == 0))  return  0;  
                    if (source == target) return source.Length;  
               
                    int sourceWordCount = source.Length;  
                    int targetWordCount = target.Length;  
               
                    // Step 1      
                    if (sourceWordCount == 0)  
                        return targetWordCount;  
               
                    if (targetWordCount == 0)  
                        return sourceWordCount;  
               
                    int      [,] distance =       new  int[sourceWordCount + 1, targetWordCount + 1];  
               
                    // Step 2      
                    for (int i = 0; i <= sourceWordCount; distance[i, 0] = i++);  
                    for (int j = 0; j <= targetWordCount; distance[0, j] = j++);  
               
                    for (int i = 1; i <= sourceWordCount; i++)  
                    {      
                        for (int j = 1; j <= targetWordCount; j++)  
                        {      
                            // Step 3      
                            int cost = (target[j - 1] == source[i - 1]) ? 0 : 1;  
               
                            // Step 4      
                            distance[i, j] = Math.Min(Math.Min(distance[i - 1, j] + 1, distance[i, j - 1] + 1), distance[i - 1, j - 1] + cost);      
                        }      
                    }      
               
                    return distance[sourceWordCount, targetWordCount];  
      }  

Calculating Percentage Similarity

Percentage similarity between two words can be calculated after calculating number of steps required to transform one word to other.



      /// <summary>  
      /// Calculate percentage similarity of two strings  
      /// <param name="source">Source String to Compare with</param>  
      /// <param name="target">Targeted String to Compare</param>  
      /// <returns>Return Similarity between two strings from 0 to 1.0</returns>  
      /// </summary>  
      double CalculateSimilarity(string source, string target)
      {  
                    if ((source == null) || (target == null)) return  0.0;  
                    if ((source.Length == 0) || (target.Length == 0))  return  0.0;  
                    if (source == target) return 1.0;  
               
                    int stepsToSame = ComputeLevenshteinDistance(source, target);  
                    return (1.0 - ((double)stepsToSame / (double)Math.Max(source.Length, target.Length)));  
      }  

Remember: This operation is case sensitive. To ignore case you can simple convert both target and source string into lower or upper case.

Source Code

Download complete source code with demo console project from https://github.com/aztnan/StringSimilarity