Sdílet prostřednictvím


When does 0 + 4 = 1?

I received a good question today from a developer who asked:

Why should one not use pointer arithmetic for string length calculation, access to string elements, or string manipulation ?

The main reason pointer arithmetic is not a good idea is that not all characters are represented by the same number of bytes. For example in some code pages (like Japanese) some characters (a, b, c ... and half-size katakana) are represented by one byte, while kanji characters are represented by two bytes. Thus you can not compute where the fifth character in a string is by simply adding 4 to the beginning byte pointer of the string. The fifth character could be as much as 8 bytes offset from the beginning pointer (5 kanji characters - 2 bytes each)

 

Even using Unicode you can not assume that each character is 16 bits (or a word), because if you are using the UTF-16 encoding, then some characters (supplementary characters) are represented by surrogate pairs (two 16 bit words). Some single characters are represent as a base character (like "a") plus one or more combining characters (such as some diacritic "^'). And if you are using UTF-8, then a character can be 1, 2, 3 or 4 bytes long. 

 

That is why it is better to use APIs that are aware of these differences to walk through a string than to use pointer arithmetic. (See: StringInfo Class) In globalization best practices, you never assume that all characters are the same byte size. 

 

Because sometimes, 0 (beginning pointer to a string array) + 4 (a high surrogate and a low surrogate [4 bytes]) = 1 (a single Unicode character).