System.Collections.Generic Dictionary Capacity, Hashing, and Collisions.
Somebody asked me a question how to set the initial capacity of System.Collections.Generic.Dictionary<K,V> if one knows that the Dictionary will contain 1000 entries.
The short answer
Dictionary<int,string> numbers = new Dictionary<int,string>(1000);
The long answer
Dictionary<K,V> is a generic type almost identical to the non-generic System.Collections.Hashtable. Both types are implemented as hashtables, but they are slightly different implementations. In particular, they use different algorithms to deal with collisions. Hashtable uses probing and Dictionary<K,V> uses chaining.
Probing means that when a collision is detected, the algorithm tries to store the new entry in the next free bucket. When a hashtable gets close to full (saturated), finding the next free bucket becomes difficult. For this reason, hashtables that use probing grow their size way before they get completely saturated to minimize collisions. In our Hashtable, the threshold of acceptable saturation is controlled by the loadFactor parameter the constructor, but this is a topic for another time. Hashtables using probing always have more buckets allocated than the number of entries they store.
Now, the interesting question is whether the capacity specifies the number of buckets we allocate or the algorithm is actually smarter. The good news is that we take the loadFactor into account when allocating the buckets. So, if you specify capacity of let’s say 1000, we actually allocate more buckets. In fact we allocate enough buckets to ensure that we don’t have to grow the number of buckets before more entries are added than the specified capacity.
Chaining means that all entries with the same hash code are stored in a list associated with one bucket corresponding to the hash code. Therefore the Dictionary<K,V> can be completely saturated, we don’t use loadFactor, and we simply take the next prime number after the capacity to determine how many buckets to allocate.
Comments
- Anonymous
August 06, 2004
on chaining, why the next prime number? why not just use 110% of capacity asked for? - Anonymous
August 06, 2004
Better yet, why isn't the "growth factor" a user-writable property, so the user can tune his application if smart enough? Of course, a default could be supplied. I too have problems with "next prime number" as a default growth factor, since this increases the growth factor %-wise FASTER for larger collections, which might not be desirable.
I'm thinking of collections in the millions of items... Primes get farther and farther apart and will cause an attempt to increase by larger and larger %'s as the collection grows. A lot of apps I write would probably run out of memory prematurely that way, but not if I was able to tune it with a percentage...and be able to change, e.g., decrease, the growth factor as it grows.
Basically, ANY rule-of-thumb default will ALWAYS have "exceptions to the rule", and you should allow for the app. to be smart about allocations by overriding the default, esp. those app's that have large collections constrained by the 2/3GB address space size. - Anonymous
August 06, 2004
Not to be too pedantic, but I think the comment re: Chaining reallocation is a triffle off. You say: "Therefore the Dictionary<K,V> can be completely saturated, we don’t use loadFactor, and we simply take the next prime number after the capacity to determine how many buckets to allocate." Rather, it gets the closest prime number twice as large as the last prime number. This makes sense, since certain prime numbers are near by one another, like 1783 and 1787. You don't want to go through all the work of resizing only to increase the size by a small number. (In fact, IIRC from my school days, it is assumed that there is an infinite # of twin primes - prime numbers that differ by 2 [http://primes.utm.edu/top20/page.php?id=1].)
Steve, your comment about primes always getting further apart as they get bigger is bit off, as you can see visually at this list of the first 1,000 primes: http://www.utm.edu/research/primes/lists/small/1000.txt
For the Dictionary, it might make sense to let the developer specify the rate at which the bucket size increases, although doubling size for a cache is a good heuristic. For the Hashtable, however, it is important that the next number be a prime, as for probing it is important that the hashed value and the number of buckets be relatively prime, which is guaranteed if the number of buckets is prime itself. Granted, you could still allow a differnt scaling factor, and just manually find the next prime.
For more on Hashtables, check out my article at: http://msdn.microsoft.com/vcsharp/default.aspx?pull=/library/en-us/dv_vstechart/html/datastructures_guide2.asp - Anonymous
August 07, 2004
The comment has been removed - Anonymous
August 13, 2004
Scott, you are right. We do take the closest prime after 2 x the current number of buckets.
Steve, we have considered pluggable growth algorithms, but decided against it for performance and API complexity reasons. We may consider it in the future though.
Peter, we chose chaining after some perf analysis of common scenarios. Both of the implementations were virtually same in the majority of benchmarks. Chaining was slightly better in memory consumption of some specific scenarios that we felt were more importnat than the scenarios were it was actually slightly worse (as you observed), but I cannot recall what they were now. To tell you the truth, I don’t think any of the implementations is a clear winner. Possibly the main reason we ended up with chaining is that we had a brand new implementation with cleaner code that the developers liked more :-) - Anonymous
September 22, 2006
PingBack from http://bloggingabout.net/blogs/jordy/archive/2006/09/22/System.Collections-vs.-System.Collection.Generic-and-System.Collections.ObjectModel.aspx - Anonymous
January 20, 2009
PingBack from http://www.hilpers.com/1041242-objekt-identitaet