CLR and floating point: Some answers to common questions

Some very common questions I get from customers regarding floating point are:

- I get different results when compiling with optimizations vs without optimizations!

- My == checks are failing when the expressions are the same!

The answer for this question is most of the time ‘This is by design’, but it does seem that we’re not doing a good job explaining why it is by design or why things work this way. I’ll try to cover the most common causes for this behavior in this post.

X87 FPU

X86’s old CPU is based on the following

- 8 registers laid out in the form of a stack. The registers are 80 bit wide, although the precision at which operations happen (significant bits of the mantissa) can be modified via a control register. The range cannot be changed though (the exponent will always be 15 bit wide)

- Some control registers that control the precision, exception modes, etc of the FPU.

- Set of instructions to operate with stack registers and/or memory. Instructions that operate with memory can only use a register operand if the register is the top of the stack. Operations that involve 2 FP registers can access any element of the stack. This aspect of the FPU is what makes code generation for it more ‘interesting’

Most of the non intuitive behavior people encounter comes from the fact that the registers are 80 bit wide. Precision is set by default in VC++ and CLR apps to ‘double precision’, which means that if you are operating with operands of type float, results of operations done with floats actually exist in the x87 stack as if there were of type double. In fact, it’s even weirder than that. They will have the mantissa of a double, but the range (exponent) of an extended double (80 bit).

That means that a sequence of operations like the following

- Load float A

- Load float B

- Multiply loaded arguments

- Store float result in memory

- load float A

- load float B

- Multiply loaded arguments

- Compare with stored result in memory

May result in a result of ‘not equal’. Why? If the result of the first multiplication was not representable exactly as a float, it lost bits of precision when stored in memory, thus, when we compared what we stored in memory with what was in the FP stack, we see that the numbers are similar, but not equal, resulting in a surprising ‘not equal’ result.

Even more subtle

- load double MAX_DOUBLE

- load double MAX_DOUBLE

- add

- store double A

- load double A

- load double MAX_DOUBLE

- Substract

Is different than

- load double MAX_DOUBLE

- load double MAX_DOUBLE

- add

- load double MAX_DOUBLE

- substract

At the end of this sequence we have +Infinity in the floating point stack, because 2*MAX_DOUBLE is bigger than MAX_DOUBLE, which is the biggest number that can be represented in a double. Substracting MAX_DOUBLE from Inifinity still results in infinity. However, in the second example, everything happened in the floating point stack, which is operating in double extended range, and can represent 2*MAX_DOUBLE, so when we substract, we get MAX_DOUBLE, instead of Infinity.

What CLR has to say

How does the CLR play with this weird HW? Pulled out from the ECMA spec:

Storage locations for floating-point numbers (statics, array elements, and fields of classes) are of fixed size. The supported storage sizes are float32 and float64. Everywhere else (on the evaluation stack, as arguments, as return types, and as local variables) floating-point numbers are represented using an internal floating-point type. In each such instance, the nominal type of the variable or expression is either R4 or R8, but its value can be represented internally with additional range and/or precision. The size of the internal floating-point representation is implementation-dependent, can vary, and shall have precision at least as great as that of the variable or expression being represented. An implicit widening conversion to the internal representation from float32 or float64 is performed when those types are loaded from storage. The internal representation is typically the native size for the hardware, or as required for efficient implementation of an operation. The internal representation shall have the following characteristics:

· The internal representation shall have precision and range greater than or equal to the nominal type.

· Conversions to and from the internal representation shall preserve value.

[Note: This implies that an implicit widening conversion from float32 (or float64) to the internal representation, followed by an explicit conversion from the internal representation to float32 (or float64), will result in a value that is identical to the original float32 (or float64) value. end note]

[Rationale: This design allows the CLI to choose a platform-specific high-performance representation for floating-point numbers until they are placed in storage locations. For example, it might be able to leave floating-point variables in hardware registers that provide more precision than a user has requested. At the same time, CIL generators can force operations to respect language-specific rules for representations through the use of conversion instructions. end rationale]

When a floating-point value whose internal representation has greater range and/or precision than its nominal type is put in a storage location, it is automatically coerced to the type of the storage location . This can involve a loss of precision or the creation of an out-of-range value (NaN, +infinity, or ‑infinity) . However, the value might be retained in the internal representation for future use, if it is reloaded from the storage location without having been modified. It is the responsibility of the compiler to ensure that the retained value is still valid at the time of a subsequent load, taking into account the effects of aliasing and other execution threads (see memory model section). This freedom to carry extra precision is not permitted, however, following the execution of an explicit conversion (conv.r4 or conv.r8), at which time the internal representation must be exactly representable in the associated type.

[Note: To detect values that cannot be converted to a particular storage type, a conversion instruction (conv.r4, or conv.r8) can be used, followed by acheck for a non-finite value using ckfinite. Underflow can be detected by converting to a particular storage type, comparing to zero before and after the conversion. end note]

[Note: The use of an internal representation that is wider than float32 or float64can cause differences in computational results when a developer makes seemingly unrelated modifications to their code, the result of which can be that a value is spilled from the internal representation (e.g., in a register) to a location on the stack. end note]

This spec clearly had in mind the x87 FPU. The spec is basically saying that a CLR implementation is allowed to use an internal representation (in our case, the x87 80 bit representation) as long as there is no explicit storage to a coerced location (a class or valuet type field), that forces narrowing. Also, at any point, the IL stream may have conv.r4 and conv.r8 instructions, which will force the narrowing to happen.

Why did the spec implementers decide to go down this route? Imagine they hadn’t, and said that the precision/range of FP results would always have to be of the type of their operands. In x87 it would mean having to spill to memory (in order to narrow down to the operand precision/range) after every operation. This can become a very high price to pay for that extra predictability, which is not really cross platform, and that not everybody needs.

Typical problems: It’s a language choice

Note that with current spec, it’s still a language choice to give ‘predictability’. The language may insert conv.r4 or conv.r8 instructions after every FP operation to get a ‘predictable’ behavior. Obviously, this is really expensive, and different languages have different compromises. C#, for example, does nothing, if you want narrowing, you will have to insert (float) and (double) casts by hand. On the other hand, VC++ Whidbey has a new floating point model, which by default will do narrowing on assignment boundaries (it’s more complex than that, see https://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/floapoint.asp for more details).

Let’s look what can happen with current C# semantics in a recent question we got:

int Compare(Point x, Point y)

{

float thetaX = Worker(pivot, x);

      float thetaY = Worker (pivot, y);

      if (thetaX == thetaY)

      {

            return (0);

      }

      else

      {

            return (-1);

      }

}

Worker was a ‘pure’ function that returned a float value, as in it depended only of it’s input for it’s result, and produced no side effects. The surprise was coming from the fact that if you called Compare with an x == y, the function would still result in a -1. Let’s look at the disassembly to see what may have happened:

            …. Prolog

            lea EAX, bword ptr [ESI+4]

            cmp ECX, dword ptr [EAX]

            sub ESP, 8

            movq XMM0, qword ptr [EAX]

            movq qword ptr [ESP], XMM0

            lea EAX, bword ptr [ESP+14H+08H]

            sub ESP, 8

            movq XMM0, qword ptr [EAX]

            movq qword ptr [ESP], XMM0

            call [Worker(struct,struct):float]

            fstp dword ptr [ESP]

            add ESI, 4

            sub ESP, 8

            movq XMM0, qword ptr [ESI]

            movq qword ptr [ESP], XMM0

            lea EAX, bword ptr [ESP+0CH+08H]

            sub ESP, 8

            movq XMM0, qword ptr [EAX]

            movq qword ptr [ESP], XMM0

            call [Worker(struct,struct):float]

            fld dword ptr [ESP]

            fcomip ST(0), ST(1)

            fstp ST(0)

            jpe SHORT G_M004_IG03

            jne SHORT G_M004_IG03

            xor EAX, EAX

            …. Epilog

What is happening is that Worker returned a value in the ST(0) register of the FP stack, which was narrowed down to float precision in the fstp (memory store) following the first call. But then, for doing the comparison, it’s comparing ST(0) returned by the second call with the narrowed down to float result of the first call. So if what Worker did was not exactly representable in a float, the comparison (fcomip) would result in false.

How could this be solved? By narrowing by hand the result of the Worker() calls:

int Compare(Point x, Point y)

{

float thetaX = (float) Worker(pivot, x);

      float thetaY = (float) Worker (pivot, y);

      if (thetaX == thetaY)

      {

            return (0);

      }

      else

      {

            return (-1);

      }

}

Which generates:

            …prolog

            lea EAX, bword ptr [ESI+4]

            cmp ECX, dword ptr [EAX]

            sub ESP, 8

            movq XMM0, qword ptr [EAX]

            movq qword ptr [ESP], XMM0

            lea EAX, bword ptr [ESP+18H+08H]

            sub ESP, 8

            movq XMM0, qword ptr [EAX]

          movq qword ptr [ESP], XMM0

            call [ComputationalGeometryUtils.theta(struct,struct):float]

            fstp dword ptr [ESP]

            add ESI, 4

            sub ESP, 8

            movq XMM0, qword ptr [ESI]

            movq qword ptr [ESP], XMM0

            lea EAX, bword ptr [ESP+10H+08H]

            sub ESP, 8

            movq XMM0, qword ptr [EAX]

            movq qword ptr [ESP], XMM0

            call [ComputationalGeometryUtils.theta(struct,struct):float]

            fstp dword ptr [ESP+04H] ; Narrow down

            fld dword ptr [ESP+04H] ; and reload result of call

            fld dword ptr [ESP] ; result of first call

            fcomip ST(0), ST(1) ; comparison is now done on narrowed down results

            fstp ST(0)

            jpe SHORT G_M004_IG03

            jne SHORT G_M004_IG03

            xor EAX, EAX

           .epilog

This will now be more in line with what programmer expects. All this that is happening is in 100% compliance with CLR ECMA specification. I would agree that it is a behavior that is not 100% intuitive, but the problem is that being intuitive will result in a performance penalty, which all users may not be happy with. Maybe the solution is for C# to adopt FP models just like C++ has now, which, although don’t solve the problem completely, do help in the most common situations.

Comments

  • Anonymous
    August 08, 2005
    Why not educate the programmers about FP calculations instead? It's a well-known fact that FP variables must never be compared for equality, unless they result from a constant assignment. Otherwise, you must compare their absolute difference against an epsilon that represents the desired precision.

    If you write FP code using crutches such as fp:precise or fp:strict you get poorly performing code that will break as soon as someone uses a different compiler or different compiler switches. I don't quite see the point, frankly -- other than perhaps reducing Microsoft's support incidents.
  • Anonymous
    August 09, 2005
    fp:precise or fp:strict is not crutches but makes wrighting "correct" code much easier.
    Chris it just shows you never.


    BTW 2 days before your blog I blogged on the same issue :)


    David, could you explain some of the optimization desicisions and specifically why is it hard to inline value type members.

  • Anonymous
    August 09, 2005
    Chris, the C++ story is better now than before. Pre Whidbey they wouldn't respect casts. At least now they don't do things like that.

    Every time this issue comes up, we have to camps:

    a) People that want the same floating point IL to give exactly the same results on all platforms and JITs.

    b) People that want compilers to be able to take full advantage of the HW.

    Both of these options are pretty extreme. a) is ideal, but not practical due to performance issues. Make b) too 'hard' for the developer and you are not offering a compelling platform. A similar decision has happened regarding the memory model of the CLR: a) would mean bad performance, b) as proposed by HW vendors (free write and read reordering) makes writing code for the platform very hard. What you really want is something that is easy to get right (so the platform is productive), but that doesn't leave performance on the table for advanced developers. I think C++ is closer to that model today for floating point with their /fp options: By default, the behavior is more coherent than in their previous versions, but it still has the /fp:fast option when you care more about speed.







  • Anonymous
    August 09, 2005
    Dmitry:

    It's not a design decision, it's a limitation in how the x86 JIT compiler was built which is expensive to fix now (ie, some major rehaul). It's on our radar, we hate that we have these limitations, but in a limited time budget, we have preferred to fix 20 other issues than concentrate a lot of effort in this particular one.
  • Anonymous
    August 11, 2005
    Hmm, I guess I'm extreme then. :) What I'd really like to see:

    1. A warning when the C# compiler encounters any direct comparisons between floating-point types. Better yet, different warnings depending on whether one operand is 0.0 or 1.0.

    2. A standard method to perform FP comparisons with a given epsilon, or with a default epsilon based on the operand type. Let's call it Math.CompareTo(a, b, epsilon) which results in 0 if abs(a-b) <= epsilon, etc.

    That solution would be fast, reliable, portable... you name it. No fundamental changes to the CLR or C# compiler would be required. And when people asked about weird FP results you could just tell them to use Math.CompareTo.

    Maybe this would be a nice idea in addition to whatever FPU-specific options you think of adding?
  • Anonymous
    June 26, 2007
    PingBack from http://www.timvw.be/about-additional-precision-and-unpredicted-behaviour/