Jaa


How come Substring(0, xxx) matches something, but StartsWith returns false?

I was asked how a string can match a substring of another string, yet StartsWith can return false? For example:

 

string str = "Mux0308nchen";string find = "Mu";Console.WriteLine("Substring: " + (str.Substring(0,2) == find));Console.WriteLine("StartsWith:" + str.StartsWith(find));Console.WriteLine("IndexOf:   " + str.IndexOf(find));

 

returns this:

 

Substring: TrueStartsWith:FalseIndexOf:   -1

 

So if you test the first 2 characters with the search string, you'll see that they match, yet StartsWith() returns false, and IndexOf can't find it. This is because the 0308 diacritic is considered part of the u that it is modifying, so it won't be found. In many languages diacritics like this are really different letters. Since you don't expect a == z, then you wouldn't expect u == ü. 

 

Doing the substring effectively "breaks" the character, changing its meaning. Substring can even create illegal Unicode if it chops off part of a surrogate pair (eg: U+D800, U+DC00).

 

A similar oddity would be characters with no weight like U+FFFD. So if I have str = "AxFFFDxFFFDxFFFD", then all of str.Substring(0,1) == str.Substring(0,2) == str.Substring(0,3) == str.Substring(0,4) == "A". And in this case str.StartsWith("A") would be true.

 

Another perhaps unexpected behavior would be unweighted characters (or ignored by a flag) at the beginning of hte string. So if str="xFFFD" + "A", then str.IndexOf("A") can return 1, yet str.StartsWith() will return true (even though IndexOf didn't return 0).

 

Similar behaviors can be seen with LastIndexOf() and EndsWith(), and with the native Vista API FindNlsString and its variations. In addition with the FindNlsString() API, the found substrings may be unexpected.

Comments

  • Anonymous
    February 10, 2010
    I'm new to the topic of globalization, so bear with my basic question. What's the use of diacritics when their combination are already defined as characters?  ex: compare "ü" vs. "ux0308" I followed your example and this is what I got: >? ("München").Substring(0) "München" >? ("München").Substring(1) "ünchen" >? ("München").Substring(2) "nchen" ...so far so good... >? ("Mux0308nchen").Substring(0) "München" >? ("Mux0308nchen").Substring(1) "ünchen" >? ("Mux0308nchen").Substring(2) "̈nchen" >? ("Mux0308nchen").Substring(3) "nchen" ... that means that ... >? ("Mux0308nchen").Substring(1,1) "u" >? ("Mux0308nchen").Substring(2,1) "̈" ... so far so good also. But: >? ("Mux0308nchen").IndexOf("u") -1 >? ("Mux0308nchen").IndexOf("ü") 1 In any case, why is this inconsistency in the treatment of these strings in the common string management methods? Shouldn't they be split into methods that work one way and another set of NLS-related methods? I like consistency. Thanks!

  • Anonymous
    February 10, 2010
    by the way... note that in my examples, the little square is the umlaut.