Fixing bugs with compatibility in mind - an IO experience [Lakshan Fernando]
The current focus on the team is bug-fixing, but every so often we get a bug that may cause compatibility issues if we fix it. I thought I would share one such recent scenario where, because the gain doesn’t seem significant enough, we’re currently choosing to stick with the V1.1 behavior. Note that For Beta2, you’ll actually see the CHANGED behavior, but we intend to roll that back. The issue was on the direction to take when StreamWriter/Reader encounter invalid characters. Below are the issue in more details, the steps we considered and the current thinking on this issue.
StreamWriter by default uses UTF8Encoding which will throw on invalid encoding errors (for example, in case of a high surrogate character without a following low surrogate character). This will cause the internal StreamWriter's state to be irrecoverable as we would have buffered the illegal chars and any subsequent call to Flush() would again hit the encoding error. This means that the StreamWriter become unusable after this. The user can release the underlying resources by calling Close(). We will throw on this too but will release the resources (a bug in V1.1 prevented this from happening but this will be fixed in a future SP release). StreamReader by default uses UTFEncoding that doesn’t throw on above. This means that the default behavior would be to ignore any invalid byte sequence. Our concern was that the user would not know of this issue. Below are 2 code snippets that show this behavior.
StreamWriter:
// This array contains two high surrogates in a row (\uD801, \uD802).
// A high surrogate should be followed by a low surrogate.
Char[] chars = new Char[] { 'a', 'b', 'c', '\uD801', '\uD802', 'd', '?' };
using (StreamWriter writer = new StreamWriter("foo"))
{
writer.Write(chars, 0, chars.Length);
try
{
writer.Flush();
Console.WriteLine("Didn't throw");
}
catch (Exception e)
{
Console.WriteLine("Exception raised. \nMessage: {0}", e.Message);
}
}
StreamReader:
//This byte sequence has an invalid high surrogate char (240, 144). We replace this with the ? (63) in Beta2 but will
//most likely ignore (V1.1 behavior) in post RTM
MemoryStream stream = new MemoryStream(new Byte[] { 97, 240, 144, 98 });
using (StreamReader reader = new StreamReader(stream))
{
int value;
while ((value = reader.Read()) != -1)
Console.Write("Character value: {0} ", value);
Console.WriteLine();
}
In V2.0, we’ve added new classes to System.Text namespace called EncoderFallback/DecoderFallback which provide a failure-handling mechanism for an input character that cannot be converted to an encoded output byte sequence or for an encoded input byte sequence that cannot be converted to an output character. We considered using these classes where we will replace the character ‘?’ when encountering an invalid byte sequence (using the default ReplacementFallback class). We also considered changing the behavior in the StreamReader’s to throw as in the StreamWriter class.
In StreamWriter’s case, the general consensus was to leave the behavior as previously shipped state (continue to throw). The user will have the option of explicitly providing an Encoding class that wouldn’t throw or add a Fallback class to the encoding class to handle a failure mechanism using the new classes in System.Text. In StreamReader’s case, we seriously considered replacing the invalid byte sequence with the ‘?’ character. In fact, Beta2 will ship with this behavior. This wasn’t a clear cut decision though. Initially, I was against this change. The dev was really for it since he believed that the buyoff for the customer was good. The PM thought the risk wasn’t high enough, and also liked the buyoff and I ended up agreeing. But on further analysis, it simply doesn’t look like the overall benefit outweighs the risk and we are planning to roll this back to the V1.1 behavior.
Comments
- Anonymous
March 17, 2005
Tips - Anonymous
March 21, 2005
Tips