Jaa


Encodings in Strings are Evil Things (Part 6)

   First, I apologize for not updating recently -- at work, my dev machine's power supply died, and took my hard drive with it.  Luckily, I had everything backed up; however, I had to copy everything over to, and work on, a single-monitor Longhorn dogfood box with no major apps installed.  This went on for a week and a half while I waited for Dell to slog through the warranty process for new parts and have them installed by a Dell-authorized tech (in order to keep the warranty going) and this put me behind schedule for several deadlines.  So, now that my dev machine has a new PSU and HDD I've been frantically working to get caught up on things, and this has left little time for the blog.  In about two weeks these deadlines will be behind me, and I can start posting with regularity again.

   Also, at this point I'm now primarily doing implementation of previously discussed ideas, so this series of posts will temporarily serve two purposes: discussion of issues, and journal of coding concerns about implementing this in C++.  And this post concerns one of the C++ concerns: how do you define operator[] for a string that's in a variable-width encoding such as UTF-8?  One of the basic assumptions in std::string that I intend to honor is that operator[] returns a reference to the actual data, not a copy.  For fixed-width encodings such as ASCII, UCS2, or UCS4, this is not a problem; I simply return a unsigned char/short/long.  However, for variable-width encodings, I need to return a range of bytes, and presumably a size as well.  I could do this with covariant returns and unions, but this is horribly ugly -- and I'd need a lot of different returns, since UTF-8 alone can have up to six bytes in a single code point.

   My solution is to return a proxy object, MultiByteChar.  When I initially decided on this, one of my coworkers pointed out that I would run into the same problem as vector<bool>.  The Vector Wrapper Problem, as some refer to it, deserves a bit of discussion.

   The C++ standard defines that all implementations of the STL container std::vector<T> should include a specialization vector<bool> that stores the bits in packed form.  (Contrast with an array of bools -- bools can be stored in memory as if they were any of several integral types, depending on situation and the intelligence of the compiler).  In this case, if operator[] returns a bool, you cannot write expressions such as a[3] = true; -- there's no bool back there!  You need to return a proxy object containing a pointer/reference to the source container, with operator= overloaded, in order to support assignment in this manner.  However, this breaks with the definition of std::vector<T> -- the standard simultaneously claims that any operator[] on a vector must return some type that is convertible to T &.  This bit of doublespeak results in the inability to reliably write certain types of wrappers around vector that can accept bool.

   My belief is that this was an oversight of the standardization committee.  They took the first step towards solving this by defining operator[] (and the iterator's dereference operators) as returning a member typedef, ref_type; however, they stopped short of a goal, by saying that ref_type had to be defined from the allocator for the vector.  A better solution would be to define a set of semantics and overloaded operators that suitably encapsulated the intent, purpose, and behavior of references, and defining this as a Reference typeclass.  They could then simply require that ref_type be some type meeting the Reference(T) requirements, and all would be well.  This is what I intend to do.

   The only remaining question is how to handle assignment; at first I planned to make it read-only, but later decided to maintain a reference to the host string and call replace() on the encoding/store in response to an operator=.  This means that a MultiByteChar must be templated on the source string in order to be typesafe.  This brings up the question of the string's lifetime and the ref's lifetime being separate; however, traditional C++ says that operations such as destruction may invalidate iterators/references/etc. anyways.  In this case, I think it's reasonable to be the same.  (This also means it's okay to use a member reference variable; in almost every case, pointers are preferable, since references cannot be assigned to, only copy-constructed.)

   As far as implementation goes, I've completed the unmanaged_ptr and vector_of_bytes backing stores, and am currently working on the fixed_width_encoding parent class that all fixed width encodings such as UCS2 and ASCII derive from.  Next post, I will likely talk about the interactions of encoding and backing store classes, and how I've divided responsibilities between them.

   To finish this post off, though, a quick oddity about the use of widen() in iostreams.  widen() is defined on streams as handling certain platform-specific character conversions, such as converting '\n' to the appropriate end-of-line character on your platform (CR for Unix and Mac OS X, CRLF for Windows, LF for Classic MacOS).

  • cout << '\n'; outputs cout.widen('\n'), as you'd expect.

  • cout << "\n"; iterates through all characters in the string (as reported by traits<char>::length()) and outputs the result of cout.widen() on each one, as you'd expect.

  • cout << string("\n"); does NOT widen characters.  It directly asks for cout's streambuf, and xsputn()'s the entire contents of data() into it.  Do not pass locale, do not collect i18n.

   I'm still thinking over how I want to define my behavior for operator<<.