Jaa


Working with text and bytes

This post continues from the previous Bytes, encodings and text post.

The reading and writing classes I mentioned before are abstract (TextReader, TextWriter and Stream). Depending on what the underlying source/target is, you'll use one subclass or another. For example, for streams, you'll typically use MemoryStream if you want to work with and in-memory byte array; you'll use a FileStream to work with bytes in files, and you'll use NetworkStream to work with bytes from a network socket (you'll find others if you look around, such as PipeStream).

The "classes that work on bytes" lack an encoding, so as per our equation (bytes + encoding = text), you can't really get text out of them. Enter again the encoding classes from the System.Text namespace. With these you can work with text, as long as you can tell what the correct encoding is - otherwise you'll get a runtime error or, perhaps worse, incorrect results.

The concrete classes you can use to read/write text inherit from TextReader/TextWriter, and include StreamReader/StreamWriter (working on an underlying stream), StringReader/StringWriter. The latter two are an oddity - no encoding is necessary, as they work on a String/StringBuffer and so never go down to bytes.

As you might expect, then, the StreamReader/StreamWriter classes take an Encoding object in their constructor. All is well and good.

In general, be careful with anything that goes from bytes to text without an explicit encoding. For example, one of the overloads of StreamReader doesn't take an encoding. Do you know what encoding it will default to? The same as StreamWriter, so you're good as long as you pair them to work on any given file, but if you need to interoperate with other systems - and you never know which others programs it might be, especially when using files - then you need to watch out for these things.

So, how do you know what encoding was used to write out text when reading from a source of bytes? There is no general way to know, so typically you might find an out-of-band mechanism to help with this. For example, HTTP requests and responses often include the charset as a parameter on the MIME type that describes their content.

Thankfully XML already has taken steps to help make this even easier. The XML declaration at the top of the document specified the encoding, and because documents with a declaration must start with '<?xml', parsers can figure enough to parse the encoding attribute and pick the right encoding for the rest of the document (more details here).

The next hurdle to jump for further text processing is language identification. This is necessary for correct rendering, including default text directionality and glyph selection. But more on that some other time...