Encoder/Decoder Encoding fallbacks fail after 2GB of data has been converted
We have an unfortunate bug in .Net v2.0+ that causes encoding or decoding of more than 2GB of data to fail. That's a lot of data, but it still shouldn't do that. This is a problem with our built in fallbacks.
Ironically, if you encounter bad bytes then the bug is reset and you're "good" for another 2GB. This bug happens to most of our code pages for valid data, but some optimizations make it unlikely to happen in Unicode, ASCII & Latin-1. There are some workarounds. Some of these don't work if you're insulated from the decoder/encoder (like using a StreamWriter):
- Change the encoder and decoder fallbacks to custom fallbacks, or use the built-in EncoderExceptionFallback. If you have known-good data, the ExceptionFallback would be a good choice.
- Use UTF-8 or UTF-16. I think this nearly completely solves the problem. At the minimum it extends the data by enough magnitudes that your computer would probably die of hardware failure before you hit the bug.
- Unconvertible data resets the bug, so you have another 2GB before it'll die. You may be able to occasionally introduce an unconvertible code point (like U+FFFD).
- This only happens when the encoder/decoder fallback buffers aren't reset. Using the Encoding.GetBytes/GetChars won't fail unless you try a string longer than 2GB. If you are using short text segments that don't need the Encoder or Decoder state this would be a good state. For example, if you're piping a bunch of messages to the console, you might consider just sending one line at a time using the Encoding class.
- Getting a new Encoder or Decoder object when possible will give you a fresh start. For example if you process a bunch of smaller documents you might change encoders/decoders between documents, or between records or whatever.
Hope that helps,
Shawn