A Set Class? [Ari Weinstein]
Recently, I have heard a number of requests for the addition of a Set<T> class to the System.Collection.Generic namespace. Basic implementation of this class that implements only ICollection<T> would cover all the etremely vital operations a set should implement, but that would just be a weak List<T>. We want to cover more interesting functionality; at least Union, Intersect, Difference, Subset, and Disjoint.
While any person using set would expect these operations from a set class, there are many additional interesting options. Of them are symmetric difference, mapping/correspondence, predicates, min/max, and powersets. Does anyone out there need so much goodness, or is it a bit more than what you would need?
Other questions were more centered around the implementation. A set based on a hash would have extremely good performance with near-constant insert and remove, but would have a slow search time O(n). An ordered set, on the other hand, would probably take O(log n) for insert, search, and remove. This means that for most applications, the unordered list provides better results. Of course, the speed of these operations has serious impact on all of the other set operations. We may implement both.
So again, to solicit some more user comments (and thanks for the past ones!):
· Would the addition of a set class to the BCL make sense for the programs you are writing? Would you rather see something else implemented?
· How much functionality would you like to see in the class; do the members in the prototype below satisfy your needs, or are you looking for more (or less)?
· Do you see yourself using sets for ordered or unordered collections of values, or both?
Here's the source for a simple program that does some basic operation on odd and even sets of integers, it highlights the most basic operations one may want to do with a set:
Check it out and respond up here or send me an email at ArielW@microsoft.com!
using
System;
using
System.Collections;
using
System.Collections.Generic;
class
Starter{
static void Main(){
/* Produces the Following output:
Set of even integers under 10:
0
2
4
6
8
Set of odd integers under 10:
1
3
5
7
9
Do the even and odd sets intesect?
no
Even union odd integers:
0
2
4
6
8
1
3
5
7
9
Are even integers a subset of all integers?
yes
Are all integers a subset of even integers?
no
Difference of All integers from Odd integers:
0
2
4
6
8
*/
Set<
int> EvenSet = new Set<int>();
Set<
int> OddSet = new Set<int>();
for(int i=0; i<10; ++i){
if (i % 2 == 0)
EvenSet.Add(i);
else
OddSet.Add(i);
}
Console.WriteLine("Set of even integers under 10:");
foreach (int i in EvenSet)
Console.WriteLine("\t" + i);
Console.WriteLine("\nSet of odd integers under 10:");
foreach (int i in OddSet)
Console.WriteLine("\t" + i);
Console.WriteLine("\nDo the even and odd sets intesect?");
if(EvenSet.IntersectsSet(OddSet))
Console.WriteLine("\tyes");
else
Console.WriteLine("\tno");
Console.WriteLine("\nEven union odd integers:");
Set<
int> EvenAndOddSet = EvenSet.Union(OddSet);
foreach(int i in EvenAndOddSet)
Console.WriteLine("\t" + i);
Console.WriteLine("\nAre even integers a subset of all integers?");
if (EvenSet.IsSubsetOf(EvenAndOddSet))
Console.WriteLine("\tyes");
else
Console.WriteLine("\tno");
Console.WriteLine("\nAre all integers a subset of even integers?");
if (EvenAndOddSet.IsSubsetOf(EvenSet))
Console.WriteLine("\tyes");
else
Console.WriteLine("\tno");
Console.WriteLine("\nDifference of All integers from Odd integers:");
foreach (int i in EvenAndOddSet.Difference(OddSet))
Console.WriteLine("\t" + i);
}
}
public
class Set<T> : ICollection<T>{
private List<T> list;
public Set()
{
list =
new List<T>();
}
//
//ICollection<T> methods
//
public void Add(T item)
{
if(!list.Contains(item))
list.Add(item);
}
public void Clear()
{
list.Clear();
}
public bool Contains(T item)
{
return list.Contains(item);
}
public void CopyTo(T[] array, int arrayIndex)
{
list.CopyTo(array, arrayIndex);
}
public bool Remove(T item)
{
return list.Remove(item);
}
public int Count
{
get {return list.Count;}
}
public bool IsReadOnly
{
get { return false; }
}
IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
// TODO: Add Class1.System.Collections.IEnumerable.GetEnumerator implementation
return list.GetEnumerator();
}
IEnumerator<T> System.Collections.Generic.IEnumerable<T>.GetEnumerator()
{
return list.GetEnumerator();
}
public Set<T> Union(Set<T> otherSet)
{
Set<T> UnionSet =
new Set<T>();
foreach (T item in this)
UnionSet.Add(item);
foreach (T item in otherSet)
UnionSet.Add(item);
return UnionSet;
}
public Set<T> Intersect(Set<T> otherSet)
{
Set<T> IntersectSet =
new Set<T>();
foreach (T item in this)
if(otherSet.Contains(item))
IntersectSet.Add(item);
return IntersectSet;
}
public Set<T> Difference(Set<T> otherSet){
Set<T> DifferenceSet =
new Set<T>();
foreach (T item in this)
DifferenceSet.Add(item);
foreach (T item in otherSet)
DifferenceSet.Remove(item);
return DifferenceSet;
}
public bool IsSubsetOf(Set<T> otherSet){
foreach (T item in this)
if(!otherSet.Contains(item))
return false;
return true;
}
public bool IntersectsSet(Set<T> otherSet)
{
return !IsDisjointFrom(otherSet);
}
public bool IsDisjointFrom(Set<T> otherSet){
foreach (T item in this)
if(otherSet.Contains(item))
return false;
return true;
}
}
Comments
- Anonymous
September 01, 2005
I have to say I'm a fan of the Java approach to the Collections classes better than the one taken by the .NET framework so far. Admittedly I haven't looked into the .NET approach in great detail but at first glance there are a lot of things missing compared with the Java model.
The approach taken by Java is to provide a rich set (no pun intended) of interfaces - Collection, Map, Set, List, SortedSet, SortedMap - and then provide a few alternative implementations of each interface with different performance characteristics.
So you have ArrayList and LinkedList, TreeMap and HashMap, and TreeSet and HashSet. All of these can be used together and you can choose the implementation whose performance characteristics best suit your particular need.
Want fast constant-time insertion and will always be iterating in order? Use LinkedList. Don't care so much about inserting but need fast random access? Use ArrayList. Etc.
List<T> may be extremely well optimized but, assuming it's implemented as an array, it'll never be as good as a linked list for insertion at arbitrary points in a large list.
The way this applies to your question is that, naturally, I believe that it would be nice to have an ISet<T> with both concrete implementations.
FWIW, I'd also like to see common specialized collections like an LRU cache and a Dictionary that preserves order-of-insertion (at the expense of lookup performance) be part of the framework too... - Anonymous
September 01, 2005
The standard set operators like union, intersection, etc can be implemented wickedly fast if the sets are either sorted linked lists (then it can be done in-place) or sorted arrays.
Just go through both lists at once, I assume you can figure out the rest. - Anonymous
September 01, 2005
I agree with the ISet<T> approach, but otherwise suspect that a set optimized for writing wouldn't be as useful as one optimized for reading (and, therefor, complicated calculations upon).
For completeness, is there an implementation that possibly takes more memory, but remains balanced in terms of reading and writing. Depending on the application, the tradeoff may be worth the overhead.
.. hence the ISet<T> - Anonymous
September 01, 2005
The comment has been removed - Anonymous
September 01, 2005
I agree with the poster who suggests that multiple ISet<T> implementations are needed.
I know ive needed a HashSet, a ListSet and more in my time. - Anonymous
September 01, 2005
The comment has been removed - Anonymous
September 01, 2005
I haven't really had a really big need for a Set-class yet in my projects. I can image some uses though so it would be nice to have one, but since it is relatively easy to implement and since I wouldn't need it very often, I could live with having to implement one myself if I would need one. And as far as the complicated features are concerned (e.g. calculating the symmetric difference of two sets), can't say that I ever had a need for that. Naturally, it depends on what kind of projects one develops. I can imagine that somebody who's into developing games, financial or statistical software for example might have a need for such a class.
However, while we're on the topic of enhancements to System.Collections and System.Collections.Generic: very often, I need a sort of Hashtable and Dictionary<T> that maintains the sortorder in which the elements were added.
I currently implement such a class myself like this:
public class MyHashtable: ICollection
{
private ArrayList mItems = new ArrayList();
private Hashtable mKeys = new Hashtable();
...
}
mItems contains all the items in the original sortorder. mKeys contains the key and index in mItems of every item. This allows this class to be quickly searched using index and key. - Anonymous
September 01, 2005
The comment has been removed - Anonymous
September 01, 2005
Michael, I believe that Dictionary<K,V> actually preserves the orders in which items are added, provided that no objects were previously removed. - Anonymous
September 01, 2005
A heap-based Priority Queue is also a nice add. That is a really easy add and actually requires less code than some of the other collection classes that have already been added. - Anonymous
September 01, 2005
The comment has been removed - Anonymous
September 02, 2005
I haven't tought about this more than 1 minute, but why not create a Set class that would simply apply set operations on existing collections? Something like the Sort pattern? - Anonymous
September 02, 2005
I'm suprised you did not mentioned Peter Golde's PowerCollections project. I know he has worked closely with the BCL team on that. He's implemented Set<T>, with all the set operations, and a host of other collections mentioned in the comments here. In addition, he has a wonderful set of STL-style algorithms implemented as static methods that operate against the standard generic collection interfaces. I would simply like to see his implementations slurped into the BCL. They are quite elegant, and designed to work WITH the BCL rather than as a replacement for the existing collections. I am using them extensively. - Anonymous
September 06, 2005
lets see a non-blocking queue implementation as well.
http://www.cs.rochester.edu/u/scott/synchronization/pseudocode/queues.html - Anonymous
September 06, 2005
Cool. I definitely like the idea of having a Set class built into the BCL! (Been wishing for that since 1.0...)
In your first paragraph, you said that a Set without its own interface would basically just be a List<T>. This isn't right, because sets (by definition) don't contain duplicates. If you add the same element twice, the second add is just ignored.
I've used set-like stuff before. Usually I've just used a Hashtable, but a genuine Set class would make the code more readable. The features I'd want would be foreach (with no ordering implied, just like Hashtable.Keys), Add, Remove, Contains, Union, Intersection, Clone (shallow copy), and possibly Difference.
Anything that implies ordering (either keeping elements in the order they were originally added, or sorting them) is making the class into something other than a real set, so while such things might be useful on rare occasions, I don't see much value in putting them in the BCL. (If you're keeping the original order, but you add the same element a second time, does it stay in the original location, or move to the end? If you don't try to impose ordering onto the Set, life is so much simpler.) A Set that requires its elements to be sorted, in particular, seems like a bad idea, because most of the time, I'd be using objects that don't have any innate ordering -- accounts, printers, etc. -- and it'd be silly to impose ordering just to use a Set class.
I definitely agree with what others have said about having an ISet interface. That way, even if there's only one Set implementation that ships with the BCL 2.0, there's still a standard interface, so others can write their own and still have them work with everyone else's code. - Anonymous
November 08, 2006
HashSet is a new generic collection that has been added to the System.Collections.Generic namespace.