Fun with Floating Point Arithmetic, Part Four

Artigo
01/18/2005

A reader also asked the other day why it is that in VBScript, CSng(0.1) = CDbl(0.1) is False.

Forget about binary floating point for a moment. Suppose that we had two fixed-point decimal systems, say one with five digits after the decimal place and one with ten. You want to represent one-third. In our first system, the closest we can get is 0.33333. In our second system, the closest we can get is 0.3333333333.

Now we compare these two things. But this is comparing apples to oranges -- two things need to be the same type to sensibly compare them. We have a choice -- we can either convert the type with more precision to the less-precise format and then compare, or we can convert the less-precise type to the more precise format and compare.

If we did the former, then we'd truncate the long one and it would compare equal to the short one. If we did the latter, then clearly they would not be equal because we'd be comparing 0.3333300000 with 0.3333333333.

The analogy holds for doubles and singles. In a single, 1/10 in binary is

0.00011001100110011001100. In a double, it's 0.000110011001100110011001100110011001100110011001100110. If we compare by converting the double to a single, then clearly they are equal -- and, in fact a billion or so doubles which are close enough to 1/10 also compare equal. If we compare by converting the single to a double before comparing, clearly they are not equal.

VBScript always converts to the more precise format before doing the comparison.

You might think that this is kind of bogus. Surely if we're comparing a more precise value to a less precise value, it makes sense to say that the less significant bits are, well, less significant and throw them away. By converting the less precise format to a more precise format, we are essentially manufacturing new precision that didn't previously exist. We're just making it up out of whole cloth.

In the world of science, there's a word for that. It's called "fraud".

Yep, we are totally cheating. This is one of those unfortunate "gotchas" which you've got to be very careful of if you're mixing doubles with singles. There is some justification for it though.

Consider addition, for example. If you have a single, say

0.10000000000000000000000, and a double 0.10000000000000000000000111000…, and you add them together, do you expect that the result will be a single or a double? For this situation, many people would say that the sensible thing to do is to treat the single as a double and add them together, rather than losing the information in the less significant bits of the double. Yet this is once more manufacturing new precision for the single.

It comes down to a simple decision. Which is more important: not losing existing information, or not creating new information arbitrarily?

Once you pick which factor is more important, you've got to apply the rules that entails consistently. You can't say that for addition and subtraction, you convert singles to doubles, but do it the other way for comparisons. If you do that then you get into the rather ridiculous situation that two numbers can have a nonzero difference and yet compare as equal!

The Visual Basic designers decided that loss of information is worse than manufacturing new information and applied that rule consistently to the variant arithmetic logic. Hence, the same goes for operations between integer and floating point types; the integer types are converted to floating point types and the operations are done in floats. You'd certainly never say that 100 + 0.25 should avoid manufacturing new precision, convert the double to an integer, and result in 100, I hope. Similarly, comparisons between the integer 100 and the double 100.25 are done by converting the integer to a double, not converting the double to an integer.

In one case, a comparison can be done by converting to neither type. If you're comparing a 32 bit integer to a single-precision float, you can't convert the single to an integer or the integer to a single without one of them being potentially lossy. In that case, both are converted to doubles. In the VBScript implementation we consult this handy table for what conversion is used when comparing currency, 8-byte float, 4-byte float, 4-byte integer and 2-byte integer to each other:

(As an aside, in JScript .NET, where we have 64 bit integers and 64 bit floats which could be compared, we're in this cleft stick again, but this time with no clear way out! There is no larger type to which both can be losslessly converted. Comparing a 64 bit integer to a 64 bit float is a bad idea.)

Unless you have really compelling backwards-compatibility reasons, avoid using single precision floats altogether. In VBScript both a single and a double are stored as a 16 byte VARIANT, so there is no space savings. And on the chip, both single and double precision floats are converted to an internal extended format (which I believe is 80 bits), processed in that format, and then converted back to singles or doubles when the operation is done. There are no significant savings in either time or space obtained by using singles, and you get potentially a lot of pain because things don't compare the way you might think they do. Avoid, avoid, avoid.

Comments

Anonymous
January 18, 2005
In general, if you're dealing with money, the correct way is to use an integer multilpied by a scaling factor, instead of an IEEE float of any variety (single or double).

The most common case where you see this problem is when someone's dealing with currency -- it hits double precision inaccuracies a lot, and the runtime tries to compensate for it.

By using fixed point math (multiply the values by 1000, for example, then divide and convert into a float/double once your done), you eliminate any rounding errors, etc. at the expense of a small increase in the amount of code that you have to write.

SQL Server, Oracle and ISO C++ provide mechanisms for handling this. I believe the .NET runtime has support for it as well.
Anonymous
January 18, 2005
Oh and that goes for any time you care about significance in any computer math -- never use a single or a double if you care about how many digits come after the decimal.
Anonymous
January 18, 2005
Yes, the "decimal" type in .NET is a base-ten floating point number. It's a 96 bit integer scaled by 10^exp, where exp is any value from 0 to -28.
Anonymous
January 18, 2005
The comment has been removed
Anonymous
January 18, 2005
> The Visual Basic designers decided that loss
> of information is worse than manufacturing
> new information and applied that rule
> consistently to the variant arithmetic logic.

Except when printing. See your previous blog entry ^_^

1/18/2005 4:29 PM James Mastros (theorbtwo)

> when dealing with money, it's often an error
> to carry any more information that the
> smallest unit in the currency

Surely not. If interest is compounded daily and every 6 months you're supposed to pay the amount of interest that has accrued during those 6 months, you shouldn't compute each day's interest as zero and add up all those zeroes. You need to compute with enough extra digits to make the final result accurate within one of those smallest units, and then round once. (And of course you probably shouldn't be using floating point for any part of this.)
Anonymous
March 14, 2006
public static double roundnum(double num, int place)
{
double n;
n = num * Math.Pow(10, place);
n = Math.Sign(n) * Math.Abs(Math.Floor(n + .5));
return n / Math.Pow(10, place);
}

didn't write this....but looking for comments on it
Anonymous
April 26, 2007
This is for Dave... A BIG THANKS!! i just multiplied by 100 ( i had only 2 decimal places) and it solved my issue. my code was something like this: If pctTotal > 100 then ... Response.Write("Percentage total cannot be greater than 100. Your percentage total is " & pctTotal ) ... End If I was stumped when I got the message like this: Percentage total cannot be greater than 100. Your percentage total is 100 i just changed into If CInt(pctTotal*100))>10000 Now its all fixed. Thanks guys!

Compartilhar via

Fun with Floating Point Arithmetic, Part Four

Comments

Recursos adicionais