Udostępnij za pośrednictwem


Generic Levenshtein edit distance with C#

A friend of mine asked me about common algorithms for determining string similarity.  One of the most well-known string similarity algorithms is the Levenshtein edit distance algorithm, possibly because it's frequently used in computer science algorithm courses as a classic example of dynamic programming.  Levenshtein edit distance computes the number of insertions, deletions, or substitutions it would take to go from one string to another.  For example, the distance between "dog" and "dogs" is 1, because we can insert an 's' at the end of "dog" to get to "dogs".  The distance between "puppy" and "lucky" is 3, because we can swap the 'l' for the first 'p', the 'c' for the second 'p', and the 'k' for the third 'p'. Etc.  More information on the Levenshtein edit distance is available in wikipedia.

I wrote up an implementation of the algorithm in C# for comparing strings, but then I realized that with generics it would be very easy to extend it to support any IEnumerable<T> of equatable types.  This makes it easy to determine, for example, the edit distance between two arrays of integers. For anyone interested, here's the resulting code:

 /// <SUMMARY>Computes the Levenshtein Edit Distance between two enumerables.</SUMMARY>
/// <TYPEPARAM name="T">The type of the items in the enumerables.</TYPEPARAM>
/// <PARAM name="x">The first enumerable.</PARAM>
/// <PARAM name="y">The second enumerable.</PARAM>
/// <RETURNS>The edit distance.</RETURNS>
public static int EditDistance<T>(IEnumerable<T> x, IEnumerable<T> y) 
    where T : IEquatable<T>
{
    // Validate parameters
    if (x == null) throw new ArgumentNullException("x");
    if (y == null) throw new ArgumentNullException("y");

    // Convert the parameters into IList instances
    // in order to obtain indexing capabilities
    IList<T> first = x as IList<T> ?? new List<T>(x);
    IList<T> second = y as IList<T> ?? new List<T>(y);

    // Get the length of both.  If either is 0, return
    // the length of the other, since that number of insertions
    // would be required.
    int n = first.Count, m = second.Count;
    if (n == 0) return m;
    if (m == 0) return n;

    // Rather than maintain an entire matrix (which would require O(n*m) space),
    // just store the current row and the next row, each of which has a length m+1,
    // so just O(m) space. Initialize the current row.
    int curRow = 0, nextRow = 1;
    int[][] rows = new int[][] { new int[m + 1], new int[m + 1] };
    for (int j = 0; j <= m; ++j) rows[curRow][j] = j;

    // For each virtual row (since we only have physical storage for two)
    for (int i = 1; i <= n; ++i)
    {
        // Fill in the values in the row
        rows[nextRow][0] = i;
        for (int j = 1; j <= m; ++j)
        {
            int dist1 = rows[curRow][j] + 1;
            int dist2 = rows[nextRow][j - 1] + 1;
            int dist3 = rows[curRow][j - 1] +
                (first[i - 1].Equals(second[j - 1]) ? 0 : 1);

            rows[nextRow][j] = Math.Min(dist1, Math.Min(dist2, dist3));
        }

        // Swap the current and next rows
        if (curRow == 0)
        {
            curRow = 1;
            nextRow = 0;
        }
        else
        {
            curRow = 0;
            nextRow = 1;
        }
    }

    // Return the computed edit distance
    return rows[curRow][m];
}

In addition to generics, this implementation also showcases a new feature of C# 2.0, the null coalescing operator (??).  The ?? operator returns the left-hand operand if it is not null, otherwise it returns the right-hand operator.  I use it in the above code in order to get IList<T> representations of the supplied IEnumerable<T> instances passed as parameters.  For computing the edit distance, I needed the capability to index into each of the lists being compared, but I also wanted to support any enumerable type, such as System.String (which implements IEnumerable<char>).  I can easily create two new List<T> instances from the supplied parameters, but that could be an expensive operation if the lists provided to the method are long, as building a List<T> from an IEnumerable<T> requires copying all of the data from the latter into the former (if the IEnumerable<T> is an ICollection<T>, this might result in a straightforward array copy, but if it doesn't, it results in enumerating through every element in the IEnumerable<T> and Add'ing each individual item to the new List<T>).  Instead, I can save a lot of time and effort if the IEnumerable<T> instance already implements IList<T>.  If it does, I can simply cast the IEnumerable<T> to IList<T>, and if it doesn't, I can take the expensive path of creating the new List<T>.  The ?? operator makes this easy.  Rather than having to write code like:

     IList<T> first;
    if (x is IList<T>) 
    {
        first = (IList<T>)x;
    }
    else 
    {
        first = new List<T>(x);
    }

or:

     IList<T> first = (x is IList<T>) ? (IList<T>)x : new List<T>(x);

or:

     IList<T> first = x as IList<T>;
    if (first == null) first = new List<T>(x);

I can simply write:

     IList<T> first = x as IList<T> ?? new List<T>(x);

Nice and neat.

Comments

  • Anonymous
    May 05, 2006
    Cool, never knew about the ?? operator :) Thanks.

  • Anonymous
    July 31, 2006
    thanks a much for that one man :) i removed templating in my case, which was super easy, so you really made my day ;)

  • Anonymous
    January 21, 2007
     The Levenshtein algorithm is very precise, but it's very slow. It has O(nm) complexity. In a duplicate finding situation, where you have to compare at least a few thousand entries with themselves, you get to have N^2n^2 operations where N is the number of entries and n the average word length.  But humans recognize duplicates in a glance, and humans are slow, so there must be a faster, less precise algorithm for finding string distance. I have tried to implement one and I've created the Sift2 algorithm: http://siderite.blogspot.com/2007/01/super-fast-string-distance-algorithm.html . In my opinion it works very much like Levenshtein, at least where words are concerned, and it's really fast. Complexity O(n*constant) where default constant is 5.  I've received remarcably little input about it, though.

  • Anonymous
    August 18, 2008
    Greate stuff - I was able to get this converted to VB.NET quite easily.

  • Anonymous
    September 30, 2008
    What's your license on this, Stephen? Can I snag it for personal use, if I give you credit?

  • Anonymous
    October 20, 2009
    Excuse me, an example of how to use this method to compare two strings will be very useful, tnx

  • Anonymous
    January 29, 2011
    Awesome!  This is what I've been looking for.  Nice start on add distance to my application metrics.

  • Anonymous
    December 10, 2012
    Strings can be compare as string first = "welcome"; string second = "weldome"; and by calling method like int distance = EditDistance<Char>(first.ToCharArray(), second.ToCharArray());

  • Anonymous
    October 06, 2013
    Hi Stephen, Many thanks for this excellent implementation!  I'm using it to good effect in the "Fuzzy Search" feature of Entrian Source Search ( entrian.com/source-search ). -- Richie Hindle

  • Anonymous
    August 05, 2014
    Very good implementation. The only thing I've changed is the swapping of current and next row to be like: // Swap the current and next rows: 0->1 , 1->0 . curRow = ++curRow % 2; nextRow= ++nextRow % 2;

  • Anonymous
    November 11, 2014
    // Swap the current and next rows nextRow = curRow; curRow = 1 - curRow; :)