Random and unexpected EXCEPTION_FLT_DIVIDE_BY_ZERO and EXCEPTION_FLT_INVALID__OPERATION
Your application is running fine then one day it starts to fail with EXCEPTION_FLT_DIVIDE_BY_ZERO (0xC000008E) or EXCEPTION_FLT_INVALID__OPERATION (0xC0000090) exceptions at seemingly random places.
Why?
One reason might be the current value of the floating point control word (fpcw). This is a bit mask used to control whether Intel 8087 and later CPUs raise exceptions or not when certain types of floating point errors occur. There is a good article about this over on the openwatcom.org site.
In Windows applications the usual value for fpcw is 027F. For example, just fire up notepad, attach WinDBG and do the following:
0:001> ~*e r@fpcw
fpcw=0000027f
fpcw=0000027f
The value of this register can be set in many ways, for example the functions _control87, _controlfp, __control87_2, _clear87, _clearfp, _status87, _statusfp, _statusfp2 can all modify it. But if it is modified it fundamentally changes the ground rules for how some floating point operations will behave on that thread until the original value is restored.
As 027f is the "normal" value on Windows, almost all Microsoft and third party applications, components, frameworks and libraries are written and tested with the expectation that this register will have this value and that floating point operations will happen in a particular way either raising or not raising exceptions. Therefore any code that needs to modify this register for some reason has a duty to change it back again when finished unless it is running on its own private thread. If not, mayhem will result.
And that is exactly what we see sometimes here at Microsoft support.
I had a case not so long ago from a systems integrator. They were delivering a solution to an end customer using a system developed by another company which in turn used applications from different vendors. After an update to one of the components was deployed these applications began to fail with floating point exceptions. Troubleshooting was particularly difficult because the end customer was in an isolated environment without access to the Internet so remote access was out of the question. Every time we wanted to do some troubleshooting someone from one of the vendors had to go on site and we'd have a phone conference where I would talk them through a series of debug steps "blind" (you develop good visualisation skills in my job).
From the outset I suspected something was modifying the fpcw so the first thing I had them check was the value of fpcw on all threads at the point in time where the exceptions had started to happen but the process was still up (unhandled, these exceptions will take down a process). Sure enough, somehow the "normal" value had been changed on thread 0, the main UI thread of the application:
0:000> ~*e rfpcw
fpcw=00001372
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
So this was the cause of the exceptions but the harder question to answer was who was changing this?
The lack of direct access to the system in question limited the complexity of what debug steps I could use as I would be talking someone else (who was not familiar with debugging) through it. I decided to start by running the process under the debugger from the beginning and setting the debugger to break any time a module loaded into the process and dump out the the value of fpcw on every thread and then continue execution. All this output would then be captured into a log file which could be brought back from the onsite visit. This was on the assumption that whatever DLL was making the change was likely to be doing it when it first loaded into the process. To do this we used the following command just after launching the process under the debugger:
0:001> .logopen c:debug_session.log
Closing open log file debug_1318_2008-09-30_15-06-38-527.log
Opened log file 'c:debug_session.log'0:001> sxe -c "~*e @fpcw;g" ld
(The reason for using the @ before the register name tells the debugger that it is a register name and should not be valuated for any symbol resolution that might be going on. Doing this can make debug sessions a bit more responsive.)
What this output showed us was like the following:
ModLoad: 053a0000 053b1000 C:librariesthirdpart.dll <<< start of loading of third party module
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
ModLoad: 77760000 778cc000 C:WINDOWSsystem32shdocvw.dll dll <<< start of loading of shdocvw.dll ( a Windows component)
fpcw=00001372 <<< incorrect value of fpcw on thread 0
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
fpcw=0000027f
Sometime between the start of loading thirdpart.dll and the start of the load of the next DLL into the process the wrong fpcw value was set. Therefore we can now say with a reasonable degree of certainty that it is this module that is responsible.
After discussions between all parties involved we eventually established that this module was being injected into the process to fulfil a hooking/monitoring function. Unfortunately the changing of the fpcw value appeared to be a side effect of the non-Microsoft compiler the DLL was compiled with. Certain compilers seem to generate code that does this possibly as a legacy side effect of targeting non-Microsoft operating systems in the past. The vendor was not in a position to recompile this module so in the end they had to redesign things to avoid using it.
[A little tip for spotting certain components as being compiled with certain non-Microsoft compilers (based off my experience). A clue of this lies in the timestamp in the version resource (do lmvm thirdpart in the debugger):
2A425E19 time date stamp Fri Jun 19 23:22:17 1992
Now I remember that when I joined Microsoft Developer Support in 1995 we were in the beta of Windows 95 and although there was a thing called Win32 that gave some kind of 32 bit implementation on the 16-bit Windows platform we were only just at the beginning of 32-bit computing. So I was fairly sure this component was not really compiled in 1992. I've seen this 1992 thing a few times now and I think it has usually been with PE binaries produced by a non-Microsoft compiler. ]
I've also seen cases where modification of the mxcsr register has led to very unexpected errors:
Microsoft VBScript error 800a000b
Division by Zero
This was occurring on this line of ASP code:
Response.Write(5/3)
Imagine how confusing that was!
In that case a third party ASP.NET charting component was changing the mxcsr register to 00001fa0 or 00001fa4 instead of its "normal" Windows value of 00001f80. (customer was hosting ASP.NET and ASP applications in the same application pool).
In another case we saw a customer getting a VBScript error 6, overflow on this line:
x = 1 + 2.0
Again, confusion reigned. This time it was caused by a component that was using MMX/SSE2/SSE3 instructions.
I'm not against code altering the fpcw or mxcsr registers. But if you are a library component that is going to be used by arbitrary threads in some foreign host process then your documentation needs to have a big red warning sticker on it and you certainly shouldn't go around injecting yourself into other processes and changing the way the CPU behaves. That's just bad manners!
HTH
Doug
Comments
Anonymous
November 11, 2008
PingBack from http://www.tmao.info/random-and-unexpected-exception_flt_divide_by_zero-and-exception_flt_invalid__operation/Anonymous
November 13, 2008
I enconutered the same problem. Here's my scenario: a very old application with very 'old' code, several year ago, we integrated .NET 1.1 to the application, then .NET 2.0. One day, we found out that some .NET operations will cause stack overflow of our application, after investiation, we fould out our old application umask the under flow of the floating point control word at startup, then when you using Double.Min or Double.Max or WPF, you'll get stack overflow. My post in the forum: http://social.msdn.microsoft.com/forums/en-US/clr/thread/b3505262-4e01-4e21-bea6-ce897caf4186/Anonymous
April 04, 2009
That doesn't make this make any more sense: ModLoad: 75e00000 75e1f000 C:Windowssystem32IMM32.DLL fpcw=0000027f ModLoad: 753d0000 7549d000 C:Windowssystem32MSCTF.dll fpcw=0000027f ModLoad: 73d90000 73dd0000 C:Windowssystem32uxtheme.dll fpcw=00001372 ModLoad: 739f0000 73a03000 C:Windowssystem32dwmapi.dllAnonymous
April 05, 2009
Hi Troff I agree that looks weird. Hard to comment without more context however. DougAnonymous
August 10, 2014
Hello is it possible that the VB6 Runtime is using the fpcw of 000137 per Default? We are currently facing the same Problem when using WPF inside a VB6 application. It seems WPF is running with the fpcw value of VB6. With best regards JakobAnonymous
August 11, 2014
Hi Jacob I've not come across that scenario. I think I had noticed WPF apps sometimes having different values on some threads but had not come across it causing an issue. You are actually hosting WPF somehow within a VB6 based host process? DougAnonymous
August 12, 2014
Yeah, thats right, we are hosting an WPF Elementhost inside a VB6 Application :) Works like a charm, but when there are Floating Point Exceptions the vb6 process sometimes gets the Floating Point Inexact Result Exception.... My guess is, its because the main ui thread runs with the mxcsr value of 00001fa0 (set by VB6) instead of the windows default value 00001f80. Vb6 has the inexact result exception not disabled and reacts to the errors produced by wpf.Anonymous
August 12, 2014
I notice now you refer to the inexact result exception rather than 0xC000008E and 0xC0000090. If you are seeing 0xC000008F then that is normal for a VB6 process because the VB6 runtime uses that exception code to implement VB error handling. Therefore if you attach a native debugger to a VB6 process and that application experiences any VB errors you will see that exception and you should ignore it if you are not chasing those. See support.microsoft.com/.../232829 for more info.Anonymous
August 13, 2014
Hi Doug, is it possible to change the mxcsr register from managed code? Via PInvoke or sth.? I tried the controlfp Methods, but I only achieved a change in the fpcw control word back to (000027f). the mxcsr register has still in the 00001fa0 value. (I think this is our problem cause with the Elementhost) I forget to mention that the vb6 error only occurs when a float or single is accessed.Anonymous
August 17, 2014
I would say not directly. You would have to implement a function in a native DLL to do that and pInvoke to that.