Find Duplicates using LINQ
Sometimes you need to find the duplicates in a list. I’m currently developing a little utility that tests code in a word processing document (now posted, you can find it here). Each code snippet in the document has an identifier, and one of the rules that I’m imposing on this code testing utility is that there should be no duplicate identifiers in the set of documents that contain code snippets to be tested. An easy way to find duplicates is to write a query that groups by the identifier, and then filter for groups that have more than one member. In the following example, we want to know that 4 and 3 are duplicates:
This blog is inactive.
New blog: EricWhite.com/blog
int[] listOfItems = new[] { 4, 2, 3, 1, 6, 4, 3 };
var duplicates = listOfItems
.GroupBy(i => i)
.Where(g => g.Count() > 1)
.Select(g => g.Key);
foreach (var d in duplicates)
Console.WriteLine(d);
This produces the following:
4
3
Comments
Anonymous
January 15, 2009
Hi Eric, I am looking to remove duplicates when loading XML into a Dictionary. I have acheived it using the following: counties = XDocument.Load(HttpContext.Current.Server.MapPath("/data/counties.xml")) .Descendants("Row") .GroupBy(i => i.Attribute("Code").Value) .Where(g => g.Count() == 1) .Select(g => new { Code = g.Key, Desc = g.First().Attribute("Descrip").Value }) .ToDictionary(x => x.Code, x => x.Desc); So I remove the duplicates by grouping by the "Code" attribute, allowing only groups with one element. Is there a more elegant way?Anonymous
January 15, 2009
Actually, the above codes doesn't work as I want it to: .Where(g => g.Count() == 1) means that if duplicates exist both entries are removed, not just the "copies".Anonymous
January 16, 2009
Hi Daniel, I think that you can just remove the Where, and it will work. You would then be adding one entry into the dictionary for each unique Key. You also don't need the Select - you can write the ToDictionary like this: .ToDictionary(g.Key, g.First().Attribute("Descrip").Value) -EricAnonymous
March 18, 2009
Hi Eric, I am looking to do something similar to your example but I have my items in 2 collections: string[] x = new string[] { "firstPatha.txt", "firstPathb.txt", "firstPathc.txt", "firstPathd.txt" }; string[] y = new string[] { "secondPatha.txt", "secondPathe.txt", "secondPathf.txt", "secondPathg.txt" }; I want to end up with the results: { "firstPatha.txt", "secondPathe.txt", "secondPathf.txt", "secondPathg.txt" } I've tried different combinations of Except(), Intersect(), and Union() with Lambdas but just can't seem to get the right results. Any assistance is greatly appreciated!Anonymous
March 18, 2009
Hi Dave, You could do something like this: // uniqueness is based on the 'BaseName' so here is a function to get it static string BaseName(string path) { return path.Split('').ElementAt(1); } static void Main(string[] args) { string[] x = new string[] { @"firstPatha.txt", @"firstPathb.txt", @"firstPathc.txt", @"firstPathd.txt" }; string[] y = new string[] { @"secondPatha.txt", @"secondPathe.txt", @"secondPathf.txt", @"secondPathg.txt" }; // find all elements in x that are also in y var x1 = x.Where(p => y.Select(z => BaseName(z)).Contains(BaseName(p))); // find all elements in y that are not in x var y1 = y.Where(p => !x1.Select(z => BaseName(z)).Contains(BaseName(p))); // concatenate for complete collection var all = x1.Concat(y1); foreach (var z in all) Console.WriteLine(z); } You could do optimizations by materializing into intermediate arrays - should be done based on need and your real-world data. -EricAnonymous
March 18, 2009
Hi Eric, Thank you so much for your help and quick response! It works perfectly! I have a few followup questions (just for my own learning):
- Though the end result is the same, is there any reason one would use Concat() instead of Union() in this case? Note that order is of no importance here.
- Is there any way to do this as I was originally to do using any of Intersect(), Except(), or Where()?
- What is "Best Practice" - using the strongly-typed generic versions of LINQ methods or the non-generic (e.g. Select() vs. Select<TSource, TResult>())? Thanks again for your time - this is extremely helpful!
Anonymous
March 19, 2009
Hi Dave, >> Though the end result is the same, is there any reason one would use Concat() instead of Union() in this case? Note that order is of no importance here. Concat will perform better than Union, which must check to see whether there are duplicates. Concat will be lazy. Union must iterate through all items in the source collection, determine and remove duplicates, and then yield up the result collection. >> Is there any way to do this as I was originally to do using any of Intersect(), Except(), or Where()? Problem is, you are not really intersecting sets. Your rules are: if the basename exists in the first set, take it. Then take all items from the second list where the basename isn't in the first set. If you don't care which list the items in the result come from, then you could use Intersect to find any items in lists that are also in other lists, and you could use Except to include items from each source list that don't exist in other lists. Alternatively, you could keep around a 'priority' to indicate which list the full path should come from. One approach to using those methods would be to make your own equality comparer. >> What is "Best Practice" - using the strongly-typed generic versions of LINQ methods or the non-generic (e.g. Select() vs. Select<TSource, TResult>())? Actually, in both cases, you are using the same Select<TSource, TResult> method. When you don't specify the type parameters, then the C# compiler infers those types. There are a number of places where the C# compiler infers types - using the var keyword to declare a variable, or using a lambda expression where you don't specify the types of the arguments to the lambda. This is another place where the compiler does type inference. -EricAnonymous
July 06, 2009
Thanks this little code snippet saved me a lot of manual labor.Anonymous
June 03, 2010
Is there any short method to find duplicates key in any row or colomn ???Anonymous
December 03, 2014
hi Eric, I think little bit code we have to include for: if x contain duplicate and y also contain duplicate in this case concat() is not sufficient. i thinkAnonymous
January 14, 2015
Excellent! Just had remember to put: using System.Linq; Also, I have a list of items with a "string Name" field, and I just want to know if the names are duplicated, so I had var query = myList.GroupBy(x => x.Name)