Partilhar via


Non-Properties of floating point numbers

I recently received a customer question that boiled down to the oft-encountered binary floating point inexact representation issue.  They were rather troubled that basic identities normal to numbers did not apply to floating point arithmetic. 

Learning that floating point is kind of messed up is kind of like finding out there is no Santa Claus.  Sorry if I just spoiled that for you :)

It made me think about some of my undergraduate math, and so I decided to illustrate just how messed up floating point numbers are.

What are some of the properties of floating point math compared to normal numbers?  Do they form a "Group"?

https://en.wikipedia.org/wiki/Group_(mathematics)#Definition

Let's try:

---------------
Closure

For all a, b in G, the result of the operation, a + b, is also in G.

Well, no, we lose already because floating point numbers can overflow.

From my javascript console:

>1e308
1e+308

>1e308+1e308
Infinity

Oops.  OK but nevermind that, it's mostly closed right?  I mean overflow hardly ever happens.  Let's trudge on bravely.

---------------
Associativity

For all a, b and c in G, (a + b) + c = a + (b + c).

That looks harmless enough right?  I can just reorder the parens, what could possibly go wrong?

>(1+1e200)-1e200
0

>1+(1e200-1e200)
1

Well... drat.

------------------
Identity element

There exists an element e in G, such that for every element a in G, the equation e + a = a + e = a holds. Such an element is unique, and thus one speaks of the identity element.

OK this one is really easy right?

All I need is a unique identity element, that's the "zero" and I can add zero and get the same thing back. 

>1+0
1

Great!  But wait...

>1+1e-100
1

And 1e-100 isn't another name for zero...

>1e-100
1e-100

Oops...

------------------
Inverse element

For each a in G, there exists an element b in G such that a + b = b + a = 0

OK on this one I think we're good.  If I started with a number I can represent in floating point, x, then there is exactly one number I can add to x which will give me zero and that is -x.  Now this may seem a little like I'm cheating because

>1.1111111111111111 - 1.111111111111111111111111111111
0

Seems to indicate that there is another number that I could subtract to get zero.

However...

>1.111111111111111111111111111111
1.1111111111111111

As we can see 1.111111111111111111111111111111 isn't a valid floating point number on my system so it's not fair to try to use it.  It's automatically converted to the best approximation which is 1.1111111111111111.

Now, anyone want to tell me why addition of floating point numbers is necessarily commutative?

Comments

  • Anonymous
    October 15, 2014
    Well, adding two numbers is commutative because the processer has to give the closest floating point number to the answer, and that answer is the same in both orders. However, addition isn't commutative with respect to multiple additions (as we usually assume with real numbers). e.g. a+b+c might not equal c+b+a, because this requires associativity and commutativity of addition to prove. I quibble with your characterization that floating point numbers aren't closed over addition though, because I'd consider Infinity a valid floating point number (you can add it to the other numbers and get infinity for example).

  • Anonymous
    October 15, 2014
    Forgive me for not addressing the commutative property yet -- I don't want to spoil it.   Now with regard to adding Infinity to the floating point system to gain closure, that alone won't do it. >Infinity-Infinity NaN So you also have to add NaN. OK so add that too?  I'm not feeling very good about this.... But there was another heavy cost.  Infinity has no additive inverse (see above).  So we gave up that to get closure.  NaN isn't any better.   a+(b-b) == a just went out the window. Basically we've created this funny sort of closure that includes three catch-alls (-Infinity being the other) for cases that would otherwise not fit but the rules for those "numbers" bear no resemblance to the usual kind.  I think the cure was worse than the disease.

  • Anonymous
    October 15, 2014
    1e-100 + 1e-100 = 2e-100 != 1e-100, so 1e-100 does not seem to contradict the uniqueness of the identity element (the equation must hold for every element).

  • Anonymous
    October 15, 2014
    I cannot explain why addition is commutative, but I can prove that it is using Z3: http://rise4fun.com/Z3/9Bla. "Unsat" here means that there is no a, b and rounding mode such that a + b != b + a under this rounding mode. Z3 has bit-precise floating point numbers. Here's distributivity for addition: http://rise4fun.com/Z3/F1dW

  • Anonymous
    October 15, 2014
    The reason why floating point addition is commutative is quite simple, it's commutative because integer addition is commutative :). The real issue is not why FP addition is commutative but why FP addition is not associative. Once one understands what's going on with associtiavity the question about commutativity becomes superfluous. FP addition isn't just addition, it's addition followed by rounding. (a+b)+c is actually round(round(a+b)+c) and a+(b+c) is round(a+round(b+c)). Needless to say, the two aren't equal when (b+c) != round(b+c) and this happens quite often. If you apply the same treatment to commutative you get round(a+b) = round(b+a) which always holds. And a slightly off topic remark - you mentioned JavaScript and floating point numbers :). It's probably worth mentioning another weird characteristic of FP numbers: integer values can be accurately represented in floating point up to a certain maximum value. For example, all 32 bit integers can be represented using double (JavaScript's Number type) but 64 integers cannot. For example 2^60 is 1152921504606846976 but if you use double you get 1152921504606847000, the 4 least significant digits are off. This should be obvious if you think that the double mantissa has 52 bits. If you need to roundtrip large 64 bit integers through JavaScript/JSON then beware, you may end up with bad results. And this is particularly bad for integers near the max value of a 64 bit integer, the value you get back is out of range.

  • Anonymous
    October 16, 2014
    @Rasmus: >1e-100 + 1e-100 = 2e-100 != 1e-100, so 1e-100 does not seem to contradict the uniqueness of the identity element (the equation must hold for every element). Uniqueness of identity requires that it be that case that if there is an item I can add to x and get x then it is the one and only identity element: x + e = x for exactly one and only one e. Or if you like, more formally: x + y = x   <=>  y = e BUT 1 + 1e-100 == 1 and 1 + 0 == 1 and 1e-100 != 0 It is precisely because 1e-100 isn't the identity element (as you correctly point out) and yet it can (sometimes) behave like it were that uniqueness is broken. The overall point of this posting is of course that floating point numbers actually are totally weird. I'm afraid nobody has been able to state why floating point addition commutes yet.  The rounding notion is totally a red herring.

  • Anonymous
    October 16, 2014
    "I'm afraid nobody has been able to state why floating point addition commutes yet." But it's trivial, FP addition is not really different from integer addition. If the 2 FP numbers to add have the same exponent then you simply add the mantissa of the 2 numbers. If the 2 FP numbers have different exponents you shift one mantissa by the difference between exponents and you end up in the case where the 2 exponents are equal. "The rounding notion is totally a red herring." Well, it is as far as commutativity is concerned. It certainly isn't a red herring when it comes to associativity. It's also the reason why the identity element isn't unique. Your 1+1e200-1e200 example would be fine if you were to use a FP format with log2(1e200) bits in the mantissa. One could say that the problem isn't the addition operation itself but the value representation. What sets floating point representation apart from fixed point representations is the fact that the distance between consecutive representable numbers increases with magnitude. Once you notice that, everything falls into place. 1+1e200 produces 1e200 because the distance between 1e200 and the next FP number is ~1e185.

  • Anonymous
    October 17, 2014
    I think that's close enough to call a winner.  It must be commutative because either the exponents are the same, in which case we're talking about integer add with overflow or else you must normalize the smaller exponent to the larger, in which case the order the addends were specified is irrelevant, the algorithm forced you to essentially add the lower magnitude to the larger regardless. Rounding is a peculiar notion implying you had a more precise answer and discarded some precision.  That is not the case!  While it is a useful mental model to think of adding the two floats as though they were real numbers and then trimming to fit you can't do this in practice.  Associatively fails, as do the others, because precision is finite.  When computing 1+1e-100 we have insufficient storage to hold the unrounded answer, it never exists.  In fact no rounding of any kind is needed.

  • Anonymous
    October 17, 2014
    You forgot about the peculiarity that is the negative zero: -0.0 It turns out that (-0.0) + (+0.0) results in +0.0, so positive 0 is not a neutral element at all! In fact, it is negative zero that is the neutral element for addition, as (-0.0 + x) always results in x (well, except when x is a signaling NaN, which results in an exception...). But of course this breaks the inverse element: there is no x such that (+0.0) + x results in (-0.0). By the way: I'm writing 'results in' instead of '==' because -0.0 compares equal to +0.0 despite being a different number. To test whether a number is negative zero, use "if (f == 0 && float.IsNegativeInfinity(1 / f))".

  • Anonymous
    October 17, 2014
    "Rounding is a peculiar notion implying you had a more precise answer and discarded some precision.  That is not the case!" Yes and no.

  • Yes, processors do not have buffers of 2000 bits in which to compute the exact result. I suppose they could even skip the addition in cases like 1 + 1e200, if the difference between exponents is larger than the number of mantissa bits then one can simply pick the number with the larger exponent as a result.
  • No, there are case where you have a more precise answer and precision is discarded. This happens all the time with x87 FP instructions where operations are performed at higher precision in FPU registers and you get a rounded/truncated result when you store the register in memory. "Associatively fails, as do the others, because precision is finite." Nope :). Fixed point representations also have finite precision but they're associative.
  • Anonymous
    October 17, 2014
    I think my comments about precision could have been more precise :)

  • Anonymous
    October 17, 2014
    On negative zero, the representational issues aren't inherent to floating point so I ignored them.  :) You can always normalize so that it never happens, not everyone chooses to do this.  But if you like I agree it worse than I said :)

  • Anonymous
    October 20, 2014
    @rico: Sorry to be so long in replying (I was away on vacation) and I am probably nitpicking, but: You are correct that in a group, if for any a,b a+b=a then b must be the identity element (since e=-a+a=-a+(a+b)=(-a+a)+b=e+b=b). But as you are pointing out the floating point numbers are not a group and commutativity does not hold: 0=-1+1=-1+(1+1e-100) != (-1+1)+1e-100 ) = 0+1e-100=1e-100. So you are not able to use this to deduce that there are multiple identity elements (in fact as Daniel correctly points out, there actually aren't any identity elements).