Поделиться через


What's with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()?

The behavior of Encoding.GetMaxByteCount() changed somewhat between .Net version 1.0/1.1 and version 2.0 (Whidbey).  The reason for this change is partially because GetMaxByteCount() didn't always return the worst-case byte count, and also because the fallbacks can create larger maximums that previous versions allowed.  GetMaxCharCount() has the similar issues, but when decoding.

For example, Encoding.GetEncoding(1252).GetMaxByteCount(1) returns 2.  1252 is a single byte code page (encoding), so generally one would expect that GetMaxByteCount(n) would return n, but it doesn't, it usually returns n+1.

One reason for this oddity is that an Encoder could store a high surrogate on one call to GetBytes(), hoping that the next call is a low surrogate.  This allows the fallback mechanism to provide a fallback for a complete surrogate pair, even if that pair is split between calls to GetBytes().  If the fallback returns a ? for each surrogate half, or if the next call doesn't have a surrogate, then 2 characters could be output for that surrogate pair.  So in this case, calling Encoder.GetBytes() with a high surrogate would return 0 bytes and then following that with another call with only the low surrogate would return 2 bytes.

Another change is that GetMaxByteCount(n) now asks the fallback what the worst case for an unknown character is.  So if I have a fallback that looks like new EncoderReplacementFallback("{Unknown Character}"), then those 19 characters will be aggravate our worst case.  For 1252, GetMaxByteCount() would then be 19 * (n + 1), or 38 for GetMaxByteCount(1)!

Allocating a buffer is what these methods are designed for, but avoid abusing the functions.  I've seen code that assume that GetMaxByteCount() and GetMaxCharCount() are reciprocals of each other or linear.  If my output buffer is 256 bytes, then I can't use 256/GetMaxCharCount(1) to determine the size of my input buffer!  In fact due to the fallbacks, GetMaxByteCount(1) could get quite large.  Try to allocate your output buffer based on the GetMax...() size of your input buffer.  If your output buffer is constrained in size, then consider the new Encoder.Convert() and Decoder.Convert() methods.

Another use of GetMaxByteCount() was to determine if an encoding is single byte or not.  Usually that kind of code really means "does 1 input char always cause 1 output char?", but that might not be true even if GetMaxByteCount(1) == 1.  Use IsSingleByte() to try to figure out if an encoding is a single byte code page, however I'd really recommend that you don't make too many assumptions about encodings.  Code that assumes a 1 to 1 relationship and then tries to seek or back up or something is likely to get confused, encodings aren't conducive to that kind of behavior.  Fallbacks, decoders and encoders can change the byte count behavior for individual calls and encodings can sometimes do unexpected things.

If you're using .Net 1.0 or 1.1 and allocating a buffer for an Encoder using GetMaxByteCount() or a Decoder using GetMaxCharCount(), you should probably allow a bit extra space in case a byte or character is left over in the encoder or decoder, causing additional output on the next call.

Use Unicode!

Comments

  • Anonymous
    June 21, 2006
    A little over a year ago I wrote What's with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()? to...