Compartir a través de


Writing "fields" of data to an encoded file.

The moral here is "Use Unicode," so you can skip the details below if you want :)

A common problem when storing string data in various fields is how to encode it.  Obviously you can store the Unicode as Unicode, which is a good choice for an XML file or text file.  However, sometimes data gets mixed with other non-string data or stored in a record, like a database record.  There are several ways to do that, but some common formats are delimited fields, fixed width fields, counted fields.  I'm going to ignore more robust protocols like XML for this problem.

A delimited field would be a character between fields that indicated that one field ended an another started.  Common delimiters are null (0), comma, and tab.  Using delimited fields, a list of names would look something like "Joe,Mary,Sally,Fred".

A fixed width field would be a field of a known size regardless of the input data size.  Generally data that is too short is padded with a space or null, and data that is too long is clipped.  If our "names" field was of fixed size four, then the previous list could look something like "Joe_MarySallFred".  Note the _ to pad the 3 character name, that Sally is clipped, and that the other names are "run together".

A counted field would indicate the field size for each piece of data before outputting the data.  The advantage is that it doesn't have the size restriction/clipping of fixed width fields, nor does it have to waste space with unnecessary padding.  (It could still be clipped for large strings as the count is likely restricted so some # of bits).  Similarly delimiters aren't a problem.  Generally the count is binary, but I'll show an example using numbers "3Joe4Mary5Sally4Fred"

A somewhat obvious way to store and read Unicode char or Unicode string data in the above formats is to write it in Unicode.  Counted fields can just count the Unicode code points to be read in.  Fixed width fields can similarly check for the space available and use Unicode character counts.   Delimited fields can also use Unicode.

When the desired output isn't Unicode (UTF-16) however, then you start running into some interesting problems.  Encodings (code pages) don't have a 1:1 relationship with UTF-16 code points, so you have to be careful.  Additionally some encodings shift modes and maintain state through shift or escape sequences.

For all of the fixed, counted, delimited techniques shift states cause an additional problem in that either the writer has to terminate the sequence, or persist the state until the next field.  Consider 2 fields where field 1 has some ASCII data that looks like "Joe" followed by shift sequence, then a Japanese character, and field 2 has "Kelly" in what looks like ASCII.  If the decoder retains the state between reading the 2 fields, it may accidentally read in "Kelly" as Japanese and presumably corrupt the output.  Alternatively if "Kelly" was really intended to read in "japanese" mode, then any application starting to read at field 2 gets confused since it didn't see the shift at the end of field 1. 

For that reason I like to make sure the fields are "complete", flushing the encoder at the end of each field (this is different than writing a pure-text document like XML).  So then field 1 above would have a shift-back-to-ASCII sequence at the end.

For fixed fields this could introduce another problem because the shift-back-to-ASCII sequence may exceed the allowed field size.  In that case the string would have to be made smaller before encoding to allow enough room for flushing.

For delimited fields there's an additional problem in that the delimiter could accidentally look like part of an encoded sequence.  Delimiters should only be tested on the decoded data.

For counted fields you start having trouble if the count isn't in encoded bytes.  If you counted the Unicode code points, then encode those code points, you don't know how many bytes to read back in when decoding.  It isn't possible to "just guess" when to stop reading data because there may or may not be some state changing data that you are expected to either ignore or read.  For example "Joe++" where ++ is a Japanese character could look like:

4<shift-to-ascii>Joe<shift-to-Japanese><+><+>, or
4<shift-to-ascii>Joe<shift-to-Japanese><+><+><shift-to-ascii>, or
4<shift-to-ascii>Joe<shift-to-Japanese><+><+><shift-to-mode-q><shift-to-mode-z><shift-to-mode-x>

where "4" represents the count, <+> represents the encoded character, and <shift...> indicates some sort of state change that doesn't cause output directly by itself.

Since the application doesn't know whether to expect the trailing <shift> sequence(s), it may not read enough data, and then may try to use <shift-to-ascii> as the count of the next field.  Similarly if it does see a <shift-to-ascii> and tries to read it in, then maybe it'll be confused if that was actually the count of the next field that just happened to look like a mode change.

So the moral is: Use UTF-16 because that's what the strings look like so they're less likely to get shifty about their sizes. 

  • Use Unicode.  Either UTF-16, or maybe use UTF-8, though it still can change size and you have to be careful, but at least each code point represents a Unicode code point. 
  • If you must count, try to count the actual encoded data size, not the unencoded form since that'll be confusing when decoding.
  • Be good and flush your encoder if you must encode, so that the state gets back into a known state (usually ASCII) and then the decoding application doesn't get confused if they don't reset their decoder.
  • Make sure you say which encoding you used.

Of course you may be talking to a GPS or something where you don't get to define the standard.  In that case you can just watch out for these caveats.  Should you be designing such a protocol however, make sure to use Unicode.  If that cannot happen, at least make sure to pay attention to the impact of encoding and decoding the data when the protocol's used.

-Shawn

Comments

  • Anonymous
    June 02, 2009
    I think you are using the term "code points" for Unicode code <i>units</i>.  A single code point represents a single Unicode character; however, a code point may require either one or two 16-bit code units in UTF-16, and either one, two, three, or four 8-bit code units in UTF-8.

  • Anonymous
    July 07, 2009
    Yup, that's a common problem when talking about code points/units/characters/glyphs/whatever :) So reader beware:  I'm not always consistent in my grammer ;-) Thanks John, no clue you read my blog.  Guess that spam on my email address caught someone.