Are Hash Codes Unique?

While you will hear people state that hash codes generate a unique value for a given input, the fact is that, while difficult to accomplish, it is technically feasible to find two different data inputs that hash to the same value. However, the true determining factors regarding the effectiveness of a hash algorithm lie in the length of the generated hash code and the complexity of the data being hashed.

Let's first talk about the hash algorithms themselves. Hashing algorithms generate a fixed-length hash code regardless of the length of the input. For example, the MD5 hash function always generates hash codes that are 32 bytes in length, the SHA1 hash function generates 20-byte hash codes, SHA256 generates 256-bit (32 byte) hash codes, and so on. Therefore, since there are a limited range of possible values for a given hash code and an unlimited range of values to hash, it stands to reason that the length of the hash code generated with a given hash algorithm is directly related to the difficulty of finding two inputs that will generate the same hash code.

This is easy to prove. If n is the number of possible hash codes, we only need n + 1 distinct input values to prove that there is an overlap. Granted that for most hash functions, n is rather large so we can safely assume that for any meaningful input values, it would be very difficult to find another meaningful input that would give the same hash.

That brings me to the second point—the input value itself. It is assured that if the input is anything more involved than random data, then even if you were to find two inputs that generated the same hash value, the two inputs would have no semantic relationship to one another. This is especially evident if you are hashing text, because there are many more ways to produce gibberish than there are ways to produce meaningful words. In other words, it's extremely difficult to create random words, even harder to form sentences, and very unlikely that those sentences will form a paragraph or a document and still have it make sense.

As an example, let's say you're generating a hash code for a top-secret document that details the route of nuclear warheads for disposal. The chances of someone randomly generating words until they matched the hash code generated for the original document and producing a document that makes any sense are so small that you could accurately say it would be impossible.

Therefore, the fact that each possible hash code doesn't uniquely match an input value only comes into play when you are dealing with random (nonstructured) data. An example of this would be spyware-removal applications that determine if a file is spyware by hashing each file in a specified folder or drive and then comparing that hash value to a list of known spyware-file hash codes. In this case, relying solely on the hash code would be a mistake as the files being hashed are of varying lengths with many files having no semantic meaning (as the application is not determining the meaning of the data; only the hash code values). As a result, it would be highly recommended that these applications either match on file name before hashing the file's contents (which dramatically reduces the possibility of "false positive" results) or use a hash code with a very long output value - such as SHA256.

This brings us to the concept of password hashing. Instead of storing their users' passwords, some applications will store only the hash code for the password. When the user attempts to log into the system, the application hashes the user's input password and compares that hash value to what has been stored for the valid password. This is a very good technique because it means that the users' passwords are not stored and therefore - theoretically cannot be hacked as hash codes cannot be reverse-engineered to their pre-hashed values. However, assuming that a hacker got his hands on the password-hash file, a brute-force method could be employed by the hacker to continually generate hash codes until a code is generated that matches a hash code on the password list. The value the hacker used to generate the matching hash code could then be used to allow the hacker unauthorized access.

This is a very real risk as passwords generally have a fixed length (such as 6-10 characters) . Therefore, the hacker doesn't need to generate as many hash codes as he would if attempting to regenerate the same hash code that was originally hashed for a much longer value. As a result, anyone using this technique to protect user passwords should go a step further by adding a salt, or static piece of data, to the input value before it is hashed. This would produce an almost fool-proof system as even if someone were to obtain a list of hashed passwords and were to produce a password that generated to a hash code in the list, this password would not work because when attempting to log in, the system would apply the salt and the resulting hash code would differ from that stored for the user. In other words, the hacker's input value would already be salted, but the system is expecting a non-salted value that it salts. As a result, the addition of the salt greatly increases the difficulty in cracking the passwords, because now the hacker would need to 1) steal the hash-code file, 2) know that a salt has been added, 3) know the value of the salt and 4) know exactly how the salt was added.

Therefore, while there isn't a one-to-one correlation between every possible hash code and every possible input value such that all combinations are guaranteed to be unique, hash codes are an extremely reliable means of protecting data integrity.

Related Reads: Keith Brown did a full article on securing the username token with WSE 2.0 that is a good read even if you're not doing Web Services as it talks about general security issues regarding the protection of user authentication information that is being sent over the wire.

Comments

  • Anonymous
    May 09, 2006
    Interesting article and a nice read

    Cheers


  • Anonymous
    May 09, 2006
    Interesting article, thanks Tom.
  • Anonymous
    May 10, 2006
    I wonder why then, why many popular sites require a fixed password length instead of allowing a user to create a password of any length they want. It seems that it would be safer to remove the length rule as this way someone who is trying to brute force their way in will have a tougher time than if he knew that the password was always 6-10 characters long.
  • Anonymous
    May 10, 2006
    The comment has been removed
  • Anonymous
    May 10, 2006
    Paul: Thanks! I'm flattered that you read my stuff :) I'm hoping now that I'm a little more situated here at Microsoft that I can begin blogging much more.
  • Anonymous
    May 10, 2006
    Austin: DOH! Thanks for the catch!
  • Anonymous
    May 10, 2006
    Bryan: I'm in total agreement with you. On a related topic, TechNet has a great article on "security myths" and one thing they bring up is that the better way to protect yourself is NOT with "strong" passwords, but with long passwords.

    http://www.microsoft.com/technet/technetmag/issues/2006/05/SecurityMyths/default.aspx
  • Anonymous
    May 10, 2006
    Wow, that was a really good read. Thanks for the link, Tom
  • Anonymous
    May 14, 2006
    hi  i downlaoded  a version of vista  (vista beta 1) but when I want to install it . i'd have an error  written in this kind  

    the windows can't get the information of your disks

    also i should mentione that i have  sata2 hard

    what's wrong with my system ?
  • Anonymous
    May 15, 2006
    Always love reading your blog!
  • Anonymous
    May 18, 2006
    Good reading Tom!.  Security starts with the developer.  It's been years since I did serious crypto work and I had actually forgotten about using salt values in stored password (though thanks to micrsoft I had been hashing them for many years).  Great work!
  • Anonymous
    May 19, 2006
    Wow
    Geek Stuffs
    Quite amazing to read from the employee of ur dream company
    nice work Tom
  • Anonymous
    May 22, 2006
    Tom, Great piece of information.
    Couldnt find it anywhere on the net so clear.
  • Anonymous
    May 23, 2006
    The comment has been removed
  • Anonymous
    May 23, 2006
    The comment has been removed
  • Anonymous
    May 26, 2006
    The comment has been removed
  • Anonymous
    June 15, 2006
    > TechNet has a great article on "security myths"
    http://www.microsoft.com/technet/technetmag/issues/2006/05/SecurityMyths/default.aspx

    Good article, but they blew it by not recognizing that forced timed password changes weaken security.

    After arguing that the policy of the forced password changes could be set to every 1,900,000,000,000 years without risking security, then they set it to 90 days.

    Why? Clearly, the author just really likes annoying users, since he readily admits this serves no useful purpose.  That MUCH longer periods (much more than a lifetime) would still be secure.

    Forcing users to pick new password every 90 days likely results in increasingly weaker (including shorter) passwords each time.
  • Anonymous
    June 19, 2006
    Your article is prety nice. It's a pity that i didn't see it more later.
  • Anonymous
    June 20, 2006
    The comment has been removed
  • Anonymous
    June 20, 2006
    The comment has been removed
  • Anonymous
    July 12, 2006
    Interesting and useful
  • Anonymous
    July 19, 2006
    Tom, Great piece of information.....
  • Anonymous
    February 10, 2007
    PingBack from http://ipfreaks.com/dreamscene/?p=43