Encodings in Strings are Evil Things (Part 7)
Eugh. Due to a three-part punch of piling-up work, time with family over the holidays, and being thoroughly sick, I haven't had much time to work on rmstring -- which means, of course, that this hasn't updated. I haven't given up on it though! (I'm not dead! I don't want to go on the cart...) If anything, my desire to finish it has increased, since I've been working on a set of internal utilities which parse text files to take instructions, and one keeps on thinking, "This would be so much easier if I just finished rmstring..."
So, on to business. First off, the all-important fixed_width_encoding class is done. This critical class is the foundation of all encodings with a fixed number of bits per code point; it's templated on an intrinsic type that the implementor knows is 1/2/4 bytes. The hardest part of an encoding, I've found, is writing the iterators; there are a huge number of methods that one must implement in order to make a 14882-compliant iterator. The internals are mostly simple pointer arithmetic; just a lot to be tested. (Yes, I have to write a test harness for this, if I want it to be approved for on-campus use :P)
One annoyance that I've found is pointer type conversions; imagine that you've allocated a byte array for recv()ing something in from a TCP socket. If we know that said content is UCS-4, the natural urge is to cast it to an unsigned long * to iterate over... except that you can't. Or, at least, you shouldn't. If that byte array isn't suitably aligned for 32-bit accesses, code will either run slowly (on x86 and AMD64) or crash (on IA-64, unless SetErrorMode() is called to force OS alignment fixups, in which case it will run extremely slowly). Of course, people do this all the time; you just can't guarantee that doing so is safe within the confines of strictly conformant code. There is also no way for strictly conformant code to check if a given pointer is aligned, since there is no operator to retrieve a type's alignment requirements. The best you can do is assume that no type will have an alignment requirement greater than its size, and assert(0 == reinterpret_cast<size_t>(ptr) % sizeof(type)), which is throughly disgusting AND assumes certain things about the host's virtual memory system that may not be true.
Thus, I've opted for the simplest solution: a huge comment in the code that says "These functions assume that the backing store's data() pointer is suitably aligned for Stride-sized accesses and that size() is a multiple of Stride's size. Violating either of these assumptions will result in your program's untimely death." Sometime later, I might come up with a helper function alignment_assert<T>(ptr) that takes advantage of compiler-specific extensions such as MSVC's __alignof if available. Note that this also could potentially result in a Unicode stream that does not make much sense, such as combining characters that don't properly match base characters. The Unicode standard notes that such a stream is not ill-formed, although it is not quite renderer-friendly; so, I'll support it.
I've also had occasion to rethink my plans for encoding_cast. Initially, I planned to use encoding_cast in a way similar to the Boost lexical_cast pseudo-operator. However, it disturbed me that doing so would mean that every call to encoding_cast would create a temporary in which to store the result, which would then make its way to final storage either by operator= or copy constructor. I ended up realizing that a good 70% of the calls to encoding_cast would be writing the encode into a string that already existed. So, instead, we now have the transcode function, which comes in both non-member and member flavors:
template <class SrcEnc, class SrcStore, class TgtEnc, class TgtStore>
void transcode( const rmstring<SrcEnc, SrcStore> & src, rmstring<TgtEnc, TgtStore> & tgt );
template <class TgtEnc, class TgtStore>
rmstring<TgtEnc, TgtStore> rmstring<SrcEnc, SrcStore>::transcode( TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );
With the above, the originally envisioned encoding_cast is now just syntactic sugar for a call to the source string's member transcode() function. It also means that the code to do transcodes is now centralized within rmstring. Handy!
Oh, and since someone asked: I'm currently developing and testing this on Visual C++ .NET 2003 and Stephan Lavavej's distribution of MinGW; I'll likely run it against Comeau as well to make sure it's kosher before I release the source to the public.
My goals for the next article are to have a few non-Unicode encodings done, so I can start testing out transcoding and flesh out the different encoding mechanisms. My main dilemma is designing the symbol tables; I noted in Part 4 that I wanted to have the ability to pass different resolving engines to the transcoder such as a perfect lossless transcription, visual parity, error codes, etc. Visual parity will be the hardest to do; in fact, I will likely not do it right away. (Namely, because the Unicode tables do not contain such parity information.) Another concern has been memory consumption of tables for encodings; I'll be tackling that shortly.
(Since this was mostly a "what happened while I was gone" article, no point summary.)
(Update 2pm: Michael Kaplan nudged me a bit that I had broken my previous insistence on "code point" versus "character" terminology -- that's what I get for stepping away from the blog for two weeks! Terminology corrected; anyone who doesn't know the difference between code points and characters needs to go back and read this blog from the beginning, or at least Part 5.)
Comments
- Anonymous
January 10, 2005
Well, my thoughts are intense nervousness. Though I do like the series as a whole, despite my fear that this last post has moved toward the dark side.... :-)