Jaa


Encodings in Strings are Evil Things (Part 7)

   Eugh.  Due to a three-part punch of piling-up work, time with family over the holidays, and being thoroughly sick, I haven't had much time to work on rmstring -- which means, of course, that this hasn't updated.  I haven't given up on it though!  (I'm not dead!  I don't want to go on the cart...)  If anything, my desire to finish it has increased, since I've been working on a set of internal utilities which parse text files to take instructions, and one keeps on thinking, "This would be so much easier if I just finished rmstring..."

   So, on to business.  First off, the all-important fixed_width_encoding class is done.  This critical class is the foundation of all encodings with a fixed number of bits per code point; it's templated on an intrinsic type that the implementor knows is 1/2/4 bytes.  The hardest part of an encoding, I've found, is writing the iterators; there are a huge number of methods that one must implement in order to make a 14882-compliant iterator.  The internals are mostly simple pointer arithmetic; just a lot to be tested.  (Yes, I have to write a test harness for this, if I want it to be approved for on-campus use :P)

   One annoyance that I've found is pointer type conversions; imagine that you've allocated a byte array for recv()ing something in from a TCP socket.  If we know that said content is UCS-4, the natural urge is to cast it to an unsigned long * to iterate over... except that you can't.  Or, at least, you shouldn't.  If that byte array isn't suitably aligned for 32-bit accesses, code will either run slowly (on x86 and AMD64) or crash (on IA-64, unless SetErrorMode() is called to force OS alignment fixups, in which case it will run extremely slowly).  Of course, people do this all the time; you just can't guarantee that doing so is safe within the confines of strictly conformant code.  There is also no way for strictly conformant code to check if a given pointer is aligned, since there is no operator to retrieve a type's alignment requirements.  The best you can do is assume that no type will have an alignment requirement greater than its size, and assert(0 == reinterpret_cast<size_t>(ptr) % sizeof(type)), which is throughly disgusting AND assumes certain things about the host's virtual memory system that may not be true.

   Thus, I've opted for the simplest solution: a huge comment in the code that says "These functions assume that the backing store's data() pointer is suitably aligned for Stride-sized accesses and that size() is a multiple of Stride's size. Violating either of these assumptions will result in your program's untimely death."   Sometime later, I might come up with a helper function alignment_assert<T>(ptr) that takes advantage of compiler-specific extensions such as MSVC's __alignof if available.  Note that this also could potentially result in a Unicode stream that does not make much sense, such as combining characters that don't properly match base characters.  The Unicode standard notes that such a stream is not ill-formed, although it is not quite renderer-friendly; so, I'll support it.

   I've also had occasion to rethink my plans for encoding_cast.  Initially, I planned to use encoding_cast in a way similar to the Boost lexical_cast pseudo-operator.  However, it disturbed me that doing so would mean that every call to encoding_cast would create a temporary in which to store the result, which would then make its way to final storage either by operator= or copy constructor.  I ended up realizing that a good 70% of the calls to encoding_cast would be writing the encode into a string that already existed.  So, instead, we now have the transcode function, which comes in both non-member and member flavors:

template <class SrcEnc, class SrcStore, class TgtEnc, class TgtStore>
void transcode( const rmstring<SrcEnc, SrcStore> & src, rmstring<TgtEnc, TgtStore> & tgt );

template <class TgtEnc, class TgtStore>
rmstring<TgtEnc, TgtStore> rmstring<SrcEnc, SrcStore>::transcode( TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );

   With the above, the originally envisioned encoding_cast is now just syntactic sugar for a call to the source string's member transcode() function.  It also means that the code to do transcodes is now centralized within rmstring.  Handy!

   Oh, and since someone asked: I'm currently developing and testing this on Visual C++ .NET 2003 and Stephan Lavavej's distribution of MinGW; I'll likely run it against Comeau as well to make sure it's kosher before I release the source to the public.

   My goals for the next article are to have a few non-Unicode encodings done, so I can start testing out transcoding and flesh out the different encoding mechanisms.  My main dilemma is designing the symbol tables; I noted in Part 4 that I wanted to have the ability to pass different resolving engines to the transcoder such as a perfect lossless transcription, visual parity, error codes, etc.  Visual parity will be the hardest to do; in fact, I will likely not do it right away.  (Namely, because the Unicode tables do not contain such parity information.)  Another concern has been memory consumption of tables for encodings; I'll be tackling that shortly.

(Since this was mostly a "what happened while I was gone" article, no point summary.)

(Update 2pm: Michael Kaplan nudged me a bit that I had broken my previous insistence on "code point" versus "character" terminology -- that's what I get for stepping away from the blog for two weeks!  Terminology corrected; anyone who doesn't know the difference between code points and characters needs to go back and read this blog from the beginning, or at least Part 5.)

Comments

  • Anonymous
    January 10, 2005
    Well, my thoughts are intense nervousness. Though I do like the series as a whole, despite my fear that this last post has moved toward the dark side.... :-)