Share via


What curious property does this string have?

There are all kinds of interesting things in the Unicode standard. For example, the block of characters from U+A000 to U+A48F is for representing syllables in the "Yi script". Apparently it is a Chinese language writing system developed during the Tang Dynasty.

A string drawn from this block has an unusual property; the string consists of just two characters, both the same: a repetition of character U+A0A2:

string s = "ꂢꂢ";

Or, if your browser can't hack the Yi script, that's the equivalent of the C# program fragment:

string s = "\uA0A2\uA0A2";

What curious property does this string have?

I'll leave some hints in the comments, and post the answer next time..

UPDATE: A couple people have figured it out, so don't read the comments too far if you don't want to be spoiled. I'll post a follow-up article on Friday.

Comments

  • Anonymous
    July 12, 2011
    Hint #1: The curious property is platform-dependent; you'll want to be using a 32 bit version of CLR v4.

  • Anonymous
    July 12, 2011
    Hint #2: The curious property is also a property of a much more commonly-used string.

  • Anonymous
    July 12, 2011
    The comment has been removed

  • Anonymous
    July 12, 2011
    "s.ToUpper() == s.ToLower()" is true. Though that's not that curious. Indeed, I think all strings in Chinese-style languages have this property. - Eric

  • Anonymous
    July 12, 2011
    I am not seeing any curious property either.  Although I did notice that it doesn't have a case difference lice [ICR].  

  • Anonymous
    July 12, 2011
    Is it something to do with byte order marks?  Something like it matches an empty string in a different encoding with byte order marks?

  • Anonymous
    July 12, 2011
    The comment has been removed

  • Anonymous
    July 12, 2011
    I'm at work right now and only have access (easily) to a 2008 R2, ergo a 64 bit CLR. I'll give it a look when I get home.

  • Anonymous
    July 12, 2011
    Well Eyal is correct, on the x86 v4 CLR, "uA0A2uA0A2".GetHashcode() == "".GetHashcode(). Though technically that doesn't meet the criterion of hint number 2. Unless the curious property that "uA0A2uA0A2" shares with a much more commonly used string (i.e. string.Empty) is 'having the hashcode 757602046'. But I don't know, I just don't find that property all that curious.

  • Anonymous
    July 12, 2011
    As far as I can tell the Hashcode also match using x64.

  • Anonymous
    July 12, 2011
    Colisions like this happen in real system everytime when using GetHashCode(). However a lot of comparing and sorting infraestructure of the framework, like LINQ for example, depend on it. For me this model of equality is (on some scenarios) broken. I would like to see a much improven hashing algortihm, with less colision probabilty, be implemented in future versiones of dotnet.

  • Anonymous
    July 12, 2011
    @Paul Irwin Hashcodes are meant to collide! Not a problem depending on them! :)

  • Anonymous
    July 12, 2011
    The comment has been removed

  • Anonymous
    July 12, 2011
    @iCe Hashcodes are not meant to express equality -- a good reason why it would be a broken equality model --, but it surely can express inequality.

  • Anonymous
    July 12, 2011
    guessing here: it's the shortest (in code points) string whose hashcode matches? or the only 2 codepoint string to match? The smallest (if treated as an unsigned number) legal UTF-16 string which shares the hashcode? Or is the fact it's a palindrome (and string.Empty is essentially a palindrome in root case) and the only such palindrom to share the hashcode?

  • Anonymous
    July 12, 2011
    Could it be that s.GetHashCode() == (s + s).GetHashCode()? You're so close! -- Eric

  • Anonymous
    July 12, 2011
    > Could it be that s.GetHashCode() == (s + s).GetHashCode()? More than that - any number of repetitions of this string have the same hashcode. Give the man a cigar! Nicely done. -- Eric

  • Anonymous
    July 12, 2011
    I wonder if it has anything to do with string interning? Although I played around with String.Intern(...) on the 32-bit CLR v4 and didn't come to any definitive conclusion.

  • Anonymous
    July 12, 2011
    The comment has been removed

  • Anonymous
    July 12, 2011
    The comment has been removed

  • Anonymous
    July 12, 2011
    The comment has been removed

  • Anonymous
    July 12, 2011
    > > Could it be that s.GetHashCode() == (s + s).GetHashCode()? > More than that - any number of repetitions of this string have the same hashcode. And the much more common string that shares this property is of course String.Empty (where it is much less interesting).

  • Anonymous
    July 12, 2011
    UPDATE: A couple people have figured it out, so don't read the comments too far if you don't want to be spoiled. I'll post a follow-up article on Friday. => the first thing I did was look at the answer :(

  • Anonymous
    July 12, 2011
    @Diego: It can express inequality it BOTH hashcode are generated on the same machine with the same .net framework version in the same mode (x84/x64).

  • Anonymous
    July 12, 2011
    @Eugene: the hashcode function clearly contains some kind of internal state, but if you include that state it's not a general fixpoint.  For instance "abcd".GetHashCode() != "abcdꂢꂢ".GetHashCode().  On the other hand, for any 2-letter string s it appears that  (s+"ꂢꂢ").GetHashCode() == s.GetHashCode(); even though s+"ꂢꂢꂢꂢ" does not; which seems to indicate that some parts of the internal state don't immediately affect the output but do after later iterations. This kind of hashcode property might be exploitable to form a denial of service attack; it's a bit far-fetched, but vaguely reminiscent (though not nearly as easily exploitable) of the php/java floating point parsing bug of a while back...

  • Anonymous
    July 12, 2011
    The comment has been removed

  • Anonymous
    July 12, 2011
    the phonetic pronunciation of this is (IPA) m̥o pronounced (very roughly) like hmo (but with the hm bit voiceless) Exploiting this would thus be a "hmOhmO" attack. Not really catchy so I can't imagine it would be popular :)

  • Anonymous
    July 12, 2011
    The comment has been removed

  • Anonymous
    July 12, 2011
    I suppose another curious property is that it looks like an M.C. Escher drawing of impossible eyeglasses.

  • Anonymous
    July 14, 2011
    The comment has been removed

  • Anonymous
    July 14, 2011
    @Ramon GetHashCode in general can only express inequality withing a single AppDomain. @iCe Code usually relies on a low number of collisions to get a large performance gain. But any code that relies on GetHashCode() being unique for correctness is broken.

  • Anonymous
    July 14, 2011
    The comment has been removed

  • Anonymous
    December 28, 2011
    @Alex Yes! http://arst.ch/rz0 Remembered this post while reading the above article.