Dela via


Using Strings for Computer Data Interchange

Previously I blogged about Culture Date Shouldn't Be Considered Stable (Except for Invariant) but it may have led to confusion in a couple cases.  Specifically, a fallacy that strings are localized and therefore not a great way to store data.

It is fine to store data in a string if need be.  Oftentimes there are more efficient types, however strings offer features that other types don't, like plain text readability.  Lots of modern platforms require strings, like XML, json, etc.  Even simple numbers get spelled out as strings in those forms.

The "catch" with strings is to ensure that when you define some sort of string interchange mechanism that you're consistent about the formatting of those strings.  The C# & Windows "invariant" locale might be helpful to format some strings, though you have to make sure the standard you're using doesn't specify some other format.

For example, numbers can be formatted in various locales as "- 123,456.12", "-1,23,456.12", "123 456,12 -", "-123456.12", etc.  They could even use alternate number systems like "- १,२३,४५६.१२".  Some might require bidi formatting marks to ensure proper visual ordering in bidirectional applications.  All of these variations can make it quite difficult for a computer to parse these kinds of numeric strings - when written for humans.

Even with a wide range of human formats, computers can still use strings to exchange numbers.  The trick is to merely ensure that a consistent format is used.  For numbers it is common to use a period . as a decimal indicator and a hyphen before the string as a negative sign.  Knowing the required formats, machines can easily format the correct string for transmission and parse that string when consuming the data.

Although I'd recommend the common - . system for numbers, other consistent formats could be used -- so long as all endpoints of the data stream agree on the format.  If you choose to use less common formats, ensure that all consumers are aware of the protocol you are defining and that they format and parse the data specifically for that format.

Some types are more complicated.  Dates and times have more information than a simple number and conventions for month or day first vary by locale.  ISO formats are commonly used in computing systems, and I'd recommend using those.  Special care is needed to communicate time zone information as the time zones themselves can change.  UTC offset can be explicit for a specific point in time, but a reoccurring meeting would need to allow the recipient to recognize that daylight savings rules might be needed for future meetings.

Currency data is another case where additional information is needed.  You can't just format a human readable "$123.00" as many currencies use the $ as their currency symbol.  Additional challenges around revaluations, changing symbols and other interesting currency behavior means that most of the human appropriate currency formatting APIs are particularly bad.  Bank codes are stable and can be used as an identifier for the currency itself, while the numeric value can be transmitted as a number, like in the previous example.  It is common to transmit a <value>12.34</value> and <currency>USD</currency> though they could be combined "-12.34USD" so long as all endpoints agree to the protocol.

No matter what format is chosen, locale/language based formatting and parsing operations should NEVER be used.  It is possible that you can find one or more languages that appear to generate the formats you specify, however you have no control over spelling reforms or other variations, including customization of data by the user or machine administrator.  When formatting data for string interchange, always use explicit formats and don't use APIs intended for human consumption or otherwise could vary.

Testing for these protocols should ensure that the data is still readable when the user's language changes.  An easy application mistake to make would be to format the string using locale-sensitive APIs by a developer that happens to use a language that generates that localized format, or has user preferences that happen to match the standard.