Udostępnij za pośrednictwem


Encodings in Strings are Evil Things (Part 8)

   As more Unicode encodings are being finished, I find myself wanting to actually start using rmstring in real situations.  However, most of my "real situations" involve legacy encodings.  So, I need to start cracking on transcoding.

   The first concern is allowing adapters for arbitrary transcodings.  A tricky problem that's related to transcoding is collation (aka sorting) -- most people aren't aware that sorting strings is often a locale-dependent issue.  This is a localization problem.  Just to make sure that terminology is clear, internationalization (often abbreviated to i18n) is the act of coding a program such that it is entirely independent of location and language; the most classic example of i18n is moving all string literals into a binary resource within an EXE, so that the strings may be changed without modifing the program's logic.  This is almost always paired with localization (sometimes abbreviated to l10n), which is the act of tailoring an already-internationalized program for a specific language/locale.  Internationalization may be done by any programmer; localization requires translators.

   In the case of sorting, a binary sort is often not enough.  Context is everything!

  • Where do accented characters sort -- the same as their base characters, or after?  (For French speakers, accented As come after Z.)
  • What are you sorting for?  (German has a special sorting order for names in phone books!)
  • What about ligatures such as ch or fi?  (Spanish speakers, for example, will sort character sequences starting in "ch" between "c" and "d", even though they recognize "ch" as two separate characters.)

   For this reason, developers using rmstring on Win32 platforms will almost certainly want to use a sorting predicate based on Win32's CompareString or LCMapString APIs.  For example:

rmstring<ucs4, bytevector> getfirst( std::list<rmstring<utf8, bytevector> > & lines ) {
std::sort( lines.begin(), lines.end(), win32_collator( LOCALE_USER_DEFAULT ) );
return (*lines.begin()).transcode<ucs4, bytevector>();
}

   This example is a bit contrived -- a real example would template the container and output encoding, and make the LCID a parameter with a default argument -- but you get the point.  win32_collator, in this case, is a custom predicate for std::sort (see <algorithm>) that converts both strings to UTF-16 and then invokes CompareStringW on them, throwing a missing_symbol exception if there's a codepoint above 0x10FFFF that UTF-16 can't contain.  Of course, this will hardly be my primary solution!  More on that later.

   Anyways, similar issues arise for transcoding.  (Not to mention the fact that win32_collator is, in fact, dependent on the ability to transcode, since the Win32 Unicode APIs expect UTF-16 strings.)  So, we must include pluggable transcoders.  So, we change our prototypes from Part 7 to include one more template argument, the transcoding tool:

template <class Engine, class SrcEnc, class SrcStore, class TgtEnc, class TgtStore>
void transcode( const rmstring<SrcEnc, SrcStore> & src, rmstring<TgtEnc, TgtStore> & tgt, Engine e = Engine() );

template <class Engine, class TgtEnc, class TgtStore>
rmstring<TgtEnc, TgtStore> rmstring<SrcEnc, SrcStore>::transcode( Engine e = Engine(), TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );

   These functions now put off transcoding to the Engine object, whatever that may be.  In the Win32 vein, we could use MultiByteToWideChar and WideCharToMultiByte -- but that's too easy, not to mention very difficult to wrap.  I'd really like to do something that's solely C++ and entirely based in the Unicode Character Database's mappings directory.  There's a few dilemmas to be solved for that.

   Going from a legacy format to Unicode is fairly simple; in addition to combining characters, Unicode also provides an array of compatibility characters.  Compatibility characters are canonically equivalent to a sequence of one or more other Unicode characters; they are usually placed so that you have a single codepoint that's equivalent to a character in some older standard.  For example, ISO8859-2 defines 0x5A to be equivalent to a capital letter L with a caron accent (&Lcaron).  The "simple" equivalent of this in Unicode is a capital letter L (U+004C) followed by a combining caron (U+030C); however, Unicode also defines a single pre-combined character, U+013D, that is directly equivalent to those two.  Therefore, almost all legacy encodings thus can have a simple 1:1 function to go up to Unicode, in the form of a lookup table.  (Unfortunately, not all legacy encodings have a complete set of compatibility characters, so a LUT will not work for everything.)  Going back from Unicode to legacy is more problematic, however: we now have two equivalents to a given legacy character.  The most direct solution, it seems, is to generate a finite automata.

   I've been working on the DFA for the last few days.  My main concern has been memory efficiency, and I can now get a complete set of typical round-trip encoding data to fit in at under 8K per encoding, which fits nicely in cache.  Obviously, certain ones will be smaller, and certain ones will be larger (in particular KOI8 and other encodings with very large symbol sets).  The DFA solution is very clean though; the legacy-to-Unicode DFA takes in bytes and outputs 32-bit unsigned ints containing codepoints which are then re-encoded, and the Unicode-to-legacy DFA takes in codepoints and outputs bytes.  Legacy-to-legacy transcodes use UCS-4 as an intermediary.

   At this point, I'm now working on a program that reads in a file from MAPPINGS and UnicodeData.txt from the Unicode Character Database and outputs the DFA in C++ format.  I'll post more when that's finished.  (I'm writing this entry pre-emptively, as this work-week looks like an absolute killer.)

Comments

  • Anonymous
    January 17, 2005
    Note that UTF-16 can handle supplementary characters as surrogate pairs, and both Windows XP and Server 2003 include weights for these characters where they are defined....

    Also, note that the MultiByteToWideChar API defines MB_PRECOMPOSED and MB_COMPOSITE flags that will let you choose which form you want of those letters that can be either one code point or two....

    You may not want to write them all yourself so quickly (especially since many of the mappings on the Unicode site are out of date!)
  • Anonymous
    January 17, 2005
    Also, you may want to note that some of the most awful bugs that Microsoft has had to deal (like Nimda?) were caused by attempts to do UTF-8 conversions without using the APIs that had the supported fixes to problems in them.

    You may not want to go to far on your own in this direction.... :-)
  • Anonymous
    January 17, 2005
    Last post, I promise!

    Collation (sorting) is not really localization, we define it here with staff linguists and the help of language experts and dictionaries from around the world.
  • Anonymous
    January 17, 2005
    The comment has been removed
  • Anonymous
    January 17, 2005
    Well, they did not make a new UTF-8 parser -- they picked up ours, only they didn't tell us and they didn't pick up any of the bug fixes or updates. :-(

    I'll stay nervous about all the stuff that you are doing, just in case. And I'd love to be on that review list. :-)
  • Anonymous
    February 02, 2005
    The comment has been removed
  • Anonymous
    February 03, 2005
    To be blunt, I want to get started using this, so the only conversion I've written so far has been the win32<src,tgt> engine, which looks for a win32_tr_info member variable containing a code page and calls MultiByteToWideChar/WideCharToMultiByte to get it up to UTF-16 and back down to the target.

    My current implementation I'm working on is a DFA-based engine; I wrote a dedicated program that takes in a set of known relations and generates a DFA to go from that encoding to UCS4 and visa versa. My main concern there is memory usage, and at this time the DFA table for a typical single-byte char set mapping to accented characters (such as ISO8859-1) fits in under 8K, just enough for data L1. I just have to get a modern set of mappings, then.

    I do plan on releasing the source, but I'm debating what license. My gut instinct is to do BSD, or a modified Boost license; I've never been much of a GPL fan. (Most of my previous OSS work, before I went to MSFT, was BSD or Artistic.) Part of it too, of course, is that I would like my library to go through the internal review process so I can use it internally, and there's no way that'd happen under a viral license.

    BTW: Woot, another ACM member!
  • Anonymous
    June 07, 2009
    PingBack from http://greenteafatburner.info/story.php?id=2198