Udostępnij za pośrednictwem


GUID guide, part two

So how is it that a GUID can be guaranteed to be unique without some sort of central authority that ensures uniqueness, a la the ISBN system?

Well, first off, notice that the number of possible GUIDs is vastly larger than the number of possible ISBNs. Because the last of the thirteen digits is a checksum, there are only 1012 possible ISBNs. That is about a hundred unique ISBNs for every person on earth. That's almost exactly 240, so you could represent an ISBN by a 40 bit number (again, ignoring the checksum). There are 2128 possible GUIDs; that's about 40 billion billion billion unique GUIDs for every person on earth. This alone gives us the intuition that it ought to be pretty easy to ensure that two of them never collide; there are a lot of GUIDs to choose from!

There are a number of possible strategies for making a unique GUID, and in fact information about the strategy used is encoded in the first four bits of the third "group"; almost every GUID you see will be of the form {xxxxxxxx-xxxx-1xxx-xxxx-xxxxxxxxxxxx} or {xxxxxxxx-xxxx-4xxx-xxxx-xxxxxxxxxxxx}.

If there is a one in that place then the algorithm used to guarantee uniqueness is essentially a variation on the ISBN strategy. The GUID is guaranteed to be unique "in space" by choosing some of the bits as the MAC address of the network card in the machine. (The tricky problem of ensuring that no two network cards in the world have the same MAC address is solved somehow by someone else; how that problem is solved, we don't particularly care. The cost of solving that problem is passed on to you, the consumer, in the purchase cost of the network card.)

(UPDATE: Larry Osterman pointed out to me that of course, this MAC address solution is not foolproof. First, you could deliberately or accidentally set your MAC address to an already-used value. Second, a hardware manufacturer might forget to set the address at all and have it be all zeros. Third, two virtual machines might be using the same physical network card and those virtual machines could be generating GUIDs at exactly the same time fast enough to cause collisions.)

We know that we can rely on this as a source of uniqueness in space. Most of the remaining bits are a high-resolution timestamp. Therefore every generated GUID is unique in both space and time, and is therefore globally unique.

This system does in practice have a few weak spots. The most obvious one is that it does not work well if the machine does not have a network card! Version-one GUIDs generated on machines without network cards are not guaranteed to be unique. The less obvious one is that there is a slim chance that two GUIDs could be generated "at the same time". Perhaps two GUID generators were running on two different processors in the same machine at the same time. Or perhaps a GUID was generated, the clock in the machine was deliberately "turned back", and the same GUID was then generated again, just by bad luck. There are a few "extra" bits in a GUID that are used to mitigate these timing problems, so in practice they don't arise.

There are a number of interesting consequences of this algorithm. The first is that such GUIDs are almost completely non-random. Many people seem to be of the mistaken belief that GUIDs are a source of randomness, when in fact they are only guaranteed to be a source of uniqueness.

The second is that it is possible for GUIDs generated on a particular machine with this algorithm to be monotone increasing. This is actually a very nice property for GUIDs to have; GUIDs are often used as primary keys in databases, and it can be much cheaper to insert a large number of rows into an indexed-by-primary-key database table if the rows are already sorted and you're always inserting after every previous entry. This again demonstrates that using a sequence of GUIDs as a sequence of random 128 bit numbers is a terrible idea; random numbers are typically not monotone increasing!

The third interesting consequence is that code or documents that contains a GUID generated with this first algorithm contain information that uniquely identifies the machine that was used to create the GUID. Just as a skilled reader can determine interesting facts about a book from its ISBN, so too can a skilled reader determine when and where a GUID was generated if it has a one as its thirteenth hex digit. This fact was used when tracking down and prosecuting the author of the famous Melissa virus. (We'll discuss the implications of this in more detail in an upcoming episode.)

The fourth interesting consequence is that no proper subsequence of the bits of a GUID have the global uniqueness property, as Raymond pointed out back in 2008. And indeed, we have no reason to expect that a smaller set of bits would have the same properties as a larger set of bits! You don't expect to be able to saw one aircraft in half and end up with two things that both fly.

Next time we'll talk about GUIDs that have a four in their thirteenth hex digit; they use a completely different technique for ensuring uniqueness.

Comments

  • Anonymous
    April 30, 2012
    The comment has been removed

  • Anonymous
    April 30, 2012
    This is a great article on an interesting subject, thanks.

  • Anonymous
    April 30, 2012
    I have offtopic question! About immutability. So Roslyn is deeply immutable library. Is it clear win? How things are going? I mean feedback etc. Is it worth the efforts comparing to mutable approach? If everything fine then could immutability be apllied say to c# xml library? What do think? Is it time for imperative developers to switch to immutable stuff? Tnx

  • Anonymous
    April 30, 2012
    Good article, but I feel like mentioning -- and I'm sure you know -- that in a properly constructed GUID, between five and seven (usually six) bits are reserved to provide information about the type and version of the GUID. In practice, there are only two values of those 5-7 bits in common use, effectively reducing them to one bit, so in practice there are much only about 2^123 GUIDs rather than 2^128. Still a lot, but it decreases it from 40 billion billion billion per person to one billion billion billion. :-)

  • Anonymous
    May 01, 2012
    The name-based GUIDs described by RFC4122 have a "3" or a "5" as the thirteenth hex digits. There's no built-in support for creating these GUIDs in .NET, so I implemented my own: code.logos.com/.../generating_a_deterministic_guid.html

  • Anonymous
    May 01, 2012
    This is a great series of posts - fascinating and useful. Thanks!

  • Anonymous
    May 01, 2012
    The comment has been removed

  • Anonymous
    May 01, 2012
    Perhaps it is not at the start for sorting reasons?

  • Anonymous
    May 01, 2012
    MAC addresses are 48 bits = 12 hex digits. I would guess that this wasn't planned to be an algorithm flag, but that the times just happen to always start with a 1 based on whatever epoch they use.

  • Anonymous
    May 01, 2012
    Do you still need to smash a network card with a hammer (blogs.msdn.com/.../71307.aspx) for an app to generate hardware-independent GUIDs?

  • Anonymous
    May 03, 2012
    @servy42 I think that originally there was only one type of GUID, and they didn't think of a type designator. So when they introduced this designator, they used a digit that was the same for all GUIDs created with the MAC method.

  • Anonymous
    May 04, 2012
    I am curious if adding randomization into GUID would not make it more unique?

  • Anonymous
    May 04, 2012
    Well, if you use a method giving guaranteed unique values, there's no way to make it more unique. Now, all the guids using hashing / random guids only are pretty likely unique in a reasonably sized data-set. There you could improve things by adding more uniquifiers. But that's not really relevant in realistic use-cases not dealing with more than 10^20 or so items or having to fend of attackers.

  • Anonymous
    May 07, 2012
    @Servy42:  The strategy identifier is a digit, not a bit, in order to have more than two of them in the future (even if only two methods are currently defined, and I'm not completely sure abou that). Second, it doesn't matter WHERE the strategy number appears in the GUID.  The bits that make up the strategy number could be scattered throughout the 128-bit GUID (although it's a little easier to spot the way it's done).  Why do you care whether the identifier is at the front or in the middle?

  • Anonymous
    May 07, 2012
    The comment has been removed

  • Anonymous
    May 17, 2012
    I had a weird DHCP collision issue on my ISP years back, with the seemingly "helpful" advice of update your mobo drivers. I thought that was an initial copout answer to any problem. It turned out that the nForce 2 default drivers had a bug that issued every one of them with the same MAC, and the update made them unique. I have no clue how this happened, possibly the default had MAC_Address = '1234.6789.abcd' //GetMACAddressFromHW in it. Now imagine if somewhere bought 100 devs identical systems with these drivers and they all sat around generating GUIDS for a while.