Compartir a través de


How Come My "ț" (or Another Character) Doesn't Work in Code Page XXX?

First of all, as I always suggest, Use Unicode when practical :)   Then you don't run into these kinds of problems.

The "thing" to remember about code pages in general is that they were an early way to get characters to display in a readable way on CRT MS-DOS displays, or, before that, for teletypes and such.  ASCII is a common representation, but most software developers realized that one of the bits wasn't being used, and extended ASCII in several standard, and not so standard ways.  Usually those extensions were for characters that someone thought were useful, but then other users discovered that some characters didn't "work" for their language and invented a variation of a code page, after all, all you had to do was change a bitmap font.  Often times subtle distinctions between characters were lost, or users "made do" with the closest code page.

Sometimes the behaviors were pretty much a "hack".  Some code pages represented right-to-left text, like Arabic, in a left-to-right fashion since their computer systems didn't really understand the concept of left-to-right text.  On the CRT, in addition to using the 8th bit, MS-DOS reused the 1-31 code points for "symbols" since the ASCII values were invisible concepts like "bell" and "return".  That hack allowed for card suits and console card games.  Commodore did something similar with their PET fonts.

So what's this got to do with a Romanian ț (U+021B "Latin small letter t with comma below")?  Well, code page 1250, "Eastern Europe" has a code point "0xfe" for ţ (U+0163 "Latin small letter t with cedilla").  I'm not sure of all of the history behind these characters, however they are different in appearance.  I don't know if the 1250 ţ was originally intended for use with Romanian, however a cedilla isn't a comma below, and this distinction caused a seperate Unicode code point to be created. 1250 however still has its original meaning and U+021B isn't in it.

So, for Romanian, if you really want to use the "correct" U+021B character, then 1250 won't work (nor will any other non-Unicode code page). Those kind of subtle (to non-Romanian) issues are why its best to have Unicode applications and data stores. We can't really change the behavior of code page 1250, or else someone else's usage (even a Romanian application making do with the cedilla) will break.

Comments

  • Anonymous
    July 06, 2008
    What happened is that Turkish uses s-cedilla and Romanian uses s-comma.  When ISO was standardizing the first four parts of 8859, they unified these two into a single character called s-cedilla, and put it in 8859-2 for Romanian and 8859-3 for Turkish.  As it turned out, the Turks didn't want to use 8859-3 (otherwise only used for Maltese and Esperanto), so they hacked a version of 8859-1 using Turkish characters in place of Icelandic ones (not a big Icelandic presence in Turkey) and that was eventually standardized as 8859-9. Unfortunately, at the same time ISO decided to provide t-cedilla for Romanian use instead of t-comma, so t-cedilla got into 8859-2 even though nobody uses t-cedilla at all.  Unicode copied the 8859 situation, then years later cleaned it up by adding s-comma and t-comma for Romanian use. As for 1250, it has a superset of the 8859-2 repertoire, partly rearranged to be more like 1252, so it inherits the issue from 8859-2.