Udostępnij za pośrednictwem


LINQ Distinct: IEqualityComparer and Func extension method Implementations

As you may know when working with LINQ there is a method that allows one to return distinct elements from a sequence. Two prototypes are supported:

 Distinct<TSource>(IEqualityComparer<TSource>)
Distinct()

Unlike most sequence operations that take some form of a Func parameter, the Distinct() method takes an instance of an IEqualityComparer Type. It is the implementation of this Type that I will be discussing which will hopefully provide some insight into why this approach has been taken.

I am going to demonstrate the Distinct() method using the following class definition:

 public class Product
{
    // Key comparison elements
    public string Reference { get; set; }
    public ProductVersion ProductVersion { get; set; }
 
    // Other attributes not used on comparison
    public string Description { get; set; }
 
    // Ensure have a sensible string value
    public override string ToString()
    {
        if (this.ProductVersion == null)
        {
            return string.Format("Product: {0}", this.Reference ?? "Null Product");
        }
        else
        {
            return string.Format("Product: {0}, Version: {1}-{2}-{3}", this.Reference, this.ProductVersion.VersionName, this.ProductVersion.VersionNumber, this.ProductVersion.SubVersionNumber);
        }
    }
 
    // Derive a unique identifier
    public string UniqueIdentifier
    {
        get
        {
            if (this.ProductVersion == null)
            {
                return this.Reference;
            }
            else
            {
                return string.Format("{0}:{1}-{2:00}-{3:00}", this.Reference, this.ProductVersion.VersionName, this.ProductVersion.VersionNumber, this.ProductVersion.SubVersionNumber);
            }
        }
    }
}
 
public class ProductVersion
{
    public int VersionNumber { get; set; }
    public int SubVersionNumber { get; set; }
    public string VersionName { get; set; }
}

A distinct product is one defined where the Reference and ProductVersion values are unalike. The ProductVersion will also only be distinct if all its properties are unalike.

For testing and demonstration purposes the following code will be used:

 // Build collection
List<Product> products = new List<Product>();
products.Add(new Product() { Reference = "AA", Description = "Bike AA", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 0, VersionName = "A" } });
products.Add(new Product() { Reference = "AA", Description = "Bike AAv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "A" } });
products.Add(new Product() { Reference = "AA", Description = "Bike AAv3", ProductVersion = new ProductVersion() { VersionNumber = 2, SubVersionNumber = 0, VersionName = "A" } });
products.Add(new Product() { Reference = "AB", Description = "Bike AB", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 0, VersionName = "B" } });
products.Add(new Product() { Reference = "AC", Description = "Bike ACv1", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 0, VersionName = "C" } });
products.Add(new Product() { Reference = "AC", Description = "Bike ACv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "C" } });
 
// Add some duplicates
products.Add(new Product() { Reference = "AA", Description = "Bike AAv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "A" } });
products.Add(new Product() { Reference = "AA", Description = "Bike AAv3", ProductVersion = new ProductVersion() { VersionNumber = 2, SubVersionNumber = 0, VersionName = "A" } });
products.Add(new Product() { Reference = "AC", Description = "Bike ACv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "C" } });
 
IEnumerable<Product> distinctProducts = products.Distinct<Product>(new ProductEqualityComparer());
 
foreach (Product product in distinctProducts.OrderBy(prod => prod.UniqueIdentifier)) Console.WriteLine(product.ToString());

The line creating the distinct collection is:

 IEnumerable<Product> distinctProducts =
    products.Distinct<Product>(new ProductEqualityComparer());

The initial attempt at an IEqualityComparer implementation may be something along the lines of:

 public class ProductEqualityComparer : IEqualityComparer<Product>
{
    public bool Equals(Product x, Product y)
    {
        // If reference same object including null then return true
        if (object.ReferenceEquals(x, y))
        {
            return true;
        }
 
        // If one object null the return false
        if (object.ReferenceEquals(x, null) || object.ReferenceEquals(y, null))
        {
            return false;
        }
 
        // Compare properties for equality
        return (x.Reference == y.Reference && x.ProductVersion == y.ProductVersion);
    }
 
    public int GetHashCode(Product obj)
    {
        return obj.GetHashCode();
    }
}

If you run and debugs this code you will quickly see that the Distinct() operation does not return a collection of products that are distinct from each other according to the stated criteria for uniqueness. If you debug the code you will also see that the Equals() method of the comparer class never actually gets called. So what is happening?

The important method here is GetHashCode(). What is actually happening is that the Distinct() operation is first using the comparers implementation of GetHashCode() to first derive a hash-code for the Product. Only when a product returns like hash-codes is Equals() called.

This implementation returns a hash-code based on the object instance. Thus one based on the product values is needed:

 public int GetHashCode(Product obj)
{
    if (object.ReferenceEquals(obj, null))
    { 
        return 0;
    }
 
    int referencehHash =
        obj.Reference == null ? 0 : obj.Reference.GetHashCode();
    int versionHash =
        obj.ProductVersion == null ? 0 : obj.ProductVersion.GetHashCode();
 
    return referencehHash ^ versionHash;
}

Of course this implementation assumes a couple of things. Firstly that equality comparisons of the ProductVersion return meaningful results, and secondly that the GetHashCode() returns a value based on the ProductVersion instance values; rather than the base implementation. For completeness here is the ProductVersion implementation which overrides the == operator:

 public class ProductVersion
{
    public int VersionNumber { get; set; }
    public int SubVersionNumber { get; set; }
    public string VersionName { get; set; }
 
    public static bool operator ==(ProductVersion x, ProductVersion y)
    {
        // If reference same object including null then return true
        if (object.ReferenceEquals(x, y))
        {
            return true;
        }
 
        // If one object null the return false
        if (object.ReferenceEquals(x, null) || object.ReferenceEquals(y, null))
        {
            return false;
        }
 
        // Compare object property values
        return VersionsEqual(x, y);
    }
 
    public static bool operator !=(ProductVersion x, ProductVersion y)
    {
        return !(x == y);
    }
 
    public override bool Equals(object obj)
    {
        // If comparing to null return false
        if (object.ReferenceEquals(obj, null))
        {
            return false;
        }
 
        // Cast and compare object property values
        ProductVersion version = (ProductVersion)obj;
        return VersionsEqual(this, version);
    }
 
    public override int GetHashCode()
    {
        int nameHash = this.VersionName == null ? 0 : this.VersionName.GetHashCode();
 
        return nameHash ^ this.VersionNumber ^ this.SubVersionNumber;
    }
 
    private static bool VersionsEqual(ProductVersion x, ProductVersion y)
    {
        // Compare properties for equality
        return (x.VersionNumber == y.VersionNumber && x.SubVersionNumber == y.SubVersionNumber && x.VersionName == y.VersionName);
    }
}

Remember, one has to use object.ReferenceEquals() calls for null checking when overriding the == operator

The IEqualityComparer thus firstly utilizes the product hash-code to place the product into a hast table. Only if a collision occurs is is the Equals() method called to determine uniqueness. You can easily see this in action if you code a GetHashCode() method that returns constant value for all products.

Once all this has been implemented the Distinct() methods works as expected, returning:

 Product: AA, Version: A-1-0
Product: AA, Version: A-1-1
Product: AA, Version: A-2-0
Product: AB, Version: B-1-0
Product: AC, Version: C-1-0
Product: AC, Version: C-1-1

In conclusion, distinct elements are found only when the hash-code are equal and the products with the same hash-code are equal based on the Equals() method call.

Hopefully one can see that this implementation is a lot more efficient than a process based on sorting the collection;  and explains why the returned collection is in an unspecified order.

However, an extension to IEnumerable for the Distinct() method that allows one to use a Func to determine a property on which the Distinct() method can operate, can provide a much simpler interface for when distinct operations are needed on single property values.

Using the Product type outlined above, this would allow the following Distinct() operation:

 IEnumerable<Product> distinctProducts =
    products.Distinct<Product, string>(prod => prod.UniqueIdentifier);

The UniqueIdentifier property is a read-only value that determines the product uniqueness:

 public string UniqueIdentifier
{
    get
    {
        if (this.ProductVersion == null)
        {
            return this.Reference;
        }
        else
        {
            return string.Format("{0}:{1}-{2:00}-{3:00}",
                this.Reference,
                this.ProductVersion.VersionName,
                this.ProductVersion.VersionNumber,
                this.ProductVersion.SubVersionNumber);
        }
    }
}

The complete implementation for this Distinct() extension method is as follows:

 public static class EnumerableExtensions
{
    public static IEnumerable<TSource> Distinct<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> getKey)
    {
        Dictionary<TKey, TSource> dictionary = new Dictionary<TKey, TSource>();
 
        foreach (TSource item in source)
        {
            TKey key = getKey(item);
            if (!dictionary.ContainsKey(key))
            {
                dictionary.Add(key, item);
            }
        }
 
        return dictionary.Select(item => item.Value);
    }
}

The Func in the Distinct() method calls provide a mapping from the source type, TSource, to the property type, TKey, on which the distinct comparisons will be performed.

In this implementation the determination of distinctness is made through the use of the Dictionary<TKey, TValue> generic class.

For each item in the Enumerable source a dictionary lookup is performed using the property value calculated in the Func expression. If no entry is found then the item is added to the dictionary, otherwise the duplicate item is skipped and the next item processed. Using this methodology, the dictionary will ultimately contain only distinct elements, accessible through the final Select() projection.

One could however use a HashSet, the difference in implementation being that the HasSet and yield approach uses deferred execution:

 public static IEnumerable<TSource> Distinct<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> getKey)
{
    HashSet<TKey> dictionary = new HashSet<TKey>();

    foreach (TSource item in source)
    {
        TKey key = getKey(item);
        if (dictionary.Add(key))
        {
            yield return item;
        }
    }
}

So how does the performance of the two options compare? As a comparison I executed 500,000 Distinct() method calls over a list containing 9 items with 3 duplicates. The IEqualityComparer approach took 2 seconds and the Func extension took 10 seconds; showing the hashing method is faster.

However if you just need to perform Distinct operations on single properties for small collections/lists, then the extension method is a much simpler programmatic approach; with reasonable performance.