Share via


BCL Refresher: Floating Point Types - The Good, The Bad and The Ugly [Inbar Gazit, Matthew Greig]

So here is another BCL refresher on the topic of floating point types in the BCL.

Believe it or not, we have 3 different floating point types: Single (float) Double and Decimal. Each has their own characteristics and abilities and so let’s try to learn a little bit about them and what we can do with them.

So here’s a question: how is a number like 231.312 represented in bits? There are a few techniques one can use.  One way to represent this number in binary is to shift it so it becomes 231312 (an integer) and remember how much it was shifted (3 decimal places). This is called floating-point representation. This is how the BCL types (System.Single, System.Double and System.Decimal) work. Another way is to store the 231 (whole part) and the .312 (fraction part) separately. This is called fixed-point representation.  The main difference between Decimal and Single/Double is the base used to store the number.  Decimal uses a base ten representation so can give an exact representation of a number with a fixed number of digits when written in base 10. Whereas Single/Double are represented in base 2 so will often give us the closest estimate for some numbers that are easily written in base 10 (for example 0.1 which might be actually 0.100000005960465188081882963405). But when we print them with the “G” format by default we get the result rounded as we expect.  Single/Double can give an exact representation of a number with a fixed number of digits when written in base 2.

Single, is a 32 bits floating point type and can represent numbers from negative 3.402823e38 to positive 3.402823e38 with 7 decimal digits of precision (that’s even though internally we maintain 9 digits of precision).  However, Single only approximates the numbers and does not have a unique representation for any 7 digits precision numbers in this range (as would be clear since it’s only using 32 bits). To see how we represent different numbers, use this code:

float f = 2.3F;

foreach (byte b in System.BitConverter.GetBytes(f))

    Console.Write(b.ToString("X2"));

Console.Read();

By replacing the 2.3F with various numbers you could see the bits representation (in hex) for various numbers. See how close-enough numbers would have the same representation (try 12007.13009F and 12007.13008F—both return 859C3B46).

When we store these numbers we store three parts—a sign (1 bit), an exponent (8 bits), and a mantissa (23 bits).  The sign is simply 0 if the number is positive and 1 if it is negative.  The exponent is similar to the shift we talked about earlier except for two things.  First, it is in base 2 rather than base 10, so rather than the number of decimal digits of the shift, it is the number of binary digits shifted.  Secondly, it is the shift plus an offset rather than just the shift so that we can easily represent negative shifts and well as positive ones (i.e. precision with numbers between 1 and -1). In single that offset is 127, so in the 8 bit exponent instead of shifts from 0 to 255, we can represent shifts of -127 to 128 .Additionally if you have an exponent of 000 or 7FF these are reserved and our normal value formula will not apply.  The exponent used will be one that shifts the value being represented to one between 1 and 2 (including 1, excluding 2).  Thus we know the value we now need to store will have a whole part of 1 and a fractional part.  Since the whole part is always one we don’t need to store it at all—we already know it is 1.  The fractional part is stored in the 23 bit mantissa.  Since there are 23 bits to the mantissa the maximum value it can store is 8388607.  So we can think of the fractional value we are storing as being rounded to the nearest 8388607th and the numerator of the fraction then being stored in the mantissa.  Writing this as a formula it gives the values stored as

Value = -1^(sign) * 2^(exponent - offset) * (1 + (mantissa / maximum mantissa))

Adding in the values for the single type—offset of 127 and maximum mantissa of 8388607 we get:

SingleValue = -1^(sign) * 2^(exponent - 127) * (1 + (mantissa / 8388607)).

We understand that this can be a little confusing so let’s take a closer look at a couple values to better understand how these values are being represented.  If we try the values of 1.0  and 0.0 we get the following output from our code, respectively, 0000803F and CDCCCC3D.  The order of these bytes can make the discussing them a little tricky so lets reorder them to 3F800000 and 3DCCCCCD in order to ease our discussion a little.  Now if we convert these to binary, and separate them into the sign, exponent and mantissa we get the following:

1.0 stored as 0 01111111 00000000000000000000000.

0.1 stored as 0 01111011 10011001100110011001101

If we convert these values back to base 10 we get

1.0 stored as sign of 0, exponent of 127, mantissa of 0.

0.1 stored as sign of 0, exponent of 123, mantissa of 5033165

Putting these values into our formula for the value stored.  Remember that single does not store the numbers exactly we can see what number is actually stored (We will also be rounding in these calculations to be at a greater precision than the Single type can handle so we can see the imprecision using Singles would be introducing in these cases).  For 1.0 we get:

SingleValue = -1^(0) * 2^(127-127) * (1+ (0/8388607)

SingleValue = 1 * 2^0 * (1+0)

SingleValue = 1 * 1 * 1

SingleValue = 1

In the case of 1.0, it turns out that Single does actually represent the number exactly.  The number is still imprecise though since other numbers will also be represent exactly as 1 and with just the Single we cannot determine if this number is exactly 1.0 or just something else that is represented as 1.0 (i.e. 1.0000000000001).  To see how imprecision is introduced let’s find he value stored for our 0.1 case.

SingleValue = -1^(0) * 2^(123-127) * (1+ (5033165/8388607)

SingleValue = 1 * 2^(-4) * (1+ 0.60000009536744300931012741448014)

SingleValue = 1 * 0.0625 * 1. 60000009536744300931012741448014

SingleValue = 0. 100000005960465188081882963405.

Now we see that 0.1 is actually stored as a value closer to 0. 100000005960465188081882963405 (I rounded some too remember).  The reason that a number as seemingly simple as 0.1 is not stored exactly is to remember that we storing them in base 2 and not base 10 so 0.1 while simple to represent in base 10 is not so simple in base 2.

Now, let’s talk about Double. Double is doing a better job at capturing values precisely as it’s using 64 bits and so you can represent numbers in the range of negative 1.79769313486232e308 to positive 1.79769313486232e308 with 15 digits of accuracy. Double uses a 1 bit sign, 11 bit exponent with an offset of 1023, and a 52 bit mantissa. Again, this is an estimation of the number only and not an exact representation. By running the same code as before on decimal we actually get a different result for 12007.13009 (C503CAA69073C740) vs. 12007.13008 (EF2076A69073C740). Now, again, notice that the right part is the same since the numbers are similar in value but that the left part is completely different as this is the extra precision in the Double type.

The third type of floating point is Decimal and it’s rather different. With Decimal we have 96 bits representing numbers for negative 79,228,162,514,264,337,593,543,950,335 to positive 79,228,162,514,264,337,593,543,950,335. However, decimal is actually representing the exact number with a limited range and limited precision.  We need slightly different code to see the bits in Decimal: 

Decimal myDecimal = 2.3M;

foreach (Int32 i in Decimal.GetBits(myDecimal))

    Console.Write(i.ToString("X4"));

This will output 00170000000010000, where if we change the 2.3 to 23 we’ll get 0017000000000000. The bit that’s missing is the one telling us that we need to shift one bit to the left (try .23M and you’ll get 00170000000020000 etc.)

OK, so which one do you choose? Let’s look at this example:

Single s1 = 1300.40F;

Single s2 = 1359.48F;

Single s = s2 - s1;

s *= 1000000000000;

Double d1 = 1300.40;

Double d2 = 1359.48;

Double d = d2 - d1;

d *= 1000000000000;

Decimal e1 = 1300.40M;

Decimal e2 = 1359.48M;

Decimal e = e2 - e1;

e *= 1000000000000;

 

Console.WriteLine(d);

Console.WriteLine(s);

Console.WriteLine(e);

When trying this code you’ll get 3 different answers to the same calculation!

Single, tells you the answer is 59079960000000. Double tells you the answer is 59079999999999.9 and Decimal is telling you the answer is 59080000000000. Which one is it? Well if calculate it yourself you’ll see that it’s 5908000000000000. So, in this example it would seem that Decimal is giving us the best result. This would generally be true in calculations that are in the “normal” range of numbers. Decimal is therefore targeting financial applications where numbers in the ranges of 10-20 digits with relatively little precision needed. If, however, you need very large numbers or very small numbers, you may want to choose Double, which has a much greater range and ability to represent very small numbers. As for Single, you should only use it if you’re space conscious and maybe have an array with millions of items or so and are trying to improve memory utilization of your application.

OK, now, after reading all this we know you are going to ask—who is the bad, who is the good and who is the ugly? Well, since we have three types, there aren’t that many options are there? :-)

Comments

  • Anonymous
    May 29, 2007
    Decimals can more precisely represent fractions entered by the user, since the base 10 values entered by user matched the base of the representation. Doubles seem to work better than with division operation, as repeating decimals (non-divisible by 10) are represented more closely. For example, with doubles (1.0/3)*3 == 1. With decimals, (1m/3)*3 = .9999{28 times}. The repeating fraction gets chopped off. Also, if any of the other complex math operations are used, trig, exp, logs, etc, since they often involve irrational or nonterminating values.

  • Anonymous
    May 29, 2007
    Nice trick with the BitConverter and GetBits calls... I'll have to remember that. You mixed up the terms "fixed point" and "floating point" in your third paragraph, by the way (the one starting with "So here's a question").  The method used by the BCL is the first described, which is floating-point representation.

  • Anonymous
    May 30, 2007

  1. In the last example comparing the three types, I think you have the values for Single and Double reversed.
  2. You say the decimal type has 'limited precision.' Near the end you again put decimal nd 'little precision' in the same sentence Decimal is actually the type with the highest precision: 96 bits.
  3. A drawback of the Decimal type is that it is an order of magnitude slower than the other types. Using a consistent rounding policy can eliminate nearly all the problems with inexact representations.
  4. Another reason to use Singles is that it takes less time to load them into the CPU, and they take up less room in the cache. This makes a significant difference only with arrays.
  • Anonymous
    May 30, 2007
    IEEE754 is the relevant standard for the Single and Double data types and does have a (fairly) good set of rules governing rounding. Many older formats had no such nice rules and left a lot to be desired. An interesting issue to note which choosing Single, Double, or Decimal is that Single and Double are native and more importantly operations on Single and Double can be atomic. On some architectures the bit-representation of the Decimal structure cannot be modified atomically. Interesting thread safety implications there.

  • Anonymous
    May 30, 2007
    Jeff, thanks for your feedback.  Here are my comments on it.

  1. Yes, you are right. Should be fixed now.
  2. Decimal does have limited precision. It has a higher precision than Single or Double, but still limited. I will discuss the two mentions of presion seperately. "limited precision": It may seem somewhat strange to mention that it has limited precision since it has the highest precision of the three types, but necessary when talking about how it represents numbers exactly if it is in the given range (limited range) and given number of base-10 digits (limited precision).   "relatively little precision": Maybe this was a little misleading so I will try to explain better what was meant here.  Generally we were talking about that Decimal will suffer from greater problems with underflow than Single or Double.  Consider the output of: Double d3 = Double.Parse("1e-30"); Double d4 = d3/10; Decimal e3 = (Decimal) d3; Decimal e4 = (Decimal) d4; Console.WriteLine("Double (1e-30): " + d3); Console.WriteLine("Double (1e-31): " + d4); Console.WriteLine("Decimal (1e-30): " + e3); Console.WriteLine("Decimal (1e-31): " + e4); with values in this range Double is more precise than Decimal since it will underflow and be masked to zero.
  3. Agree that Decimal is generally slower than the other types but we are not really trying to focus on the performance of the types here although this may be important in the choice when using them.  The consistent rounding policy definitely helps with inexact representations, but not sure I would go as far as to say it “eliminates nearly all the problems”
  4. I agree with your statements. These are typical side effects of using less memory and not anything intrinsic about the way Singles are stored.
  • Anonymous
    May 31, 2007
    (Fixed/floating still appears to be wrong at the moment in the third paragraph, btw.) Oh dear - I'm constantly fighting against things like this: "The difference between Decimal and Single/Double is that Decimal is an exact representation of the number whereas Single/Double only gives us the closest estimate." Both of them give closest estimates to a particular real number. It's just that a literal expressed as a decimal within the appropriate range and precision can always be exactly represented as a decimal. Now, your statement is correct within the context of storing 231.312, but when taken out of context (as statements often are) it's incorrect. If "the number" is a third, both of them will only store an approximation. (Both will store an exact number, but it'll be an approximation to the original one.) Statements like the one quoted give rise to the myth that decimal is "accurate" and double/single are "approximate" - as if they're very different, with decimal never losing data. Decimal and float/double are more similar than most people realise, basically being an integer and a point - it's mostly the base of point (10 or 2) and the limits on precision/scaling which differ. (Yes, there are some other differences in terms of remembering "extra" digits, denormal values, infinite values etc - but the biggest difference is the base of the point, IMO.) Still, at least MS does seem firmly behind the idea that decimal is a floating point type. It was documented in MSDN as being a fixed point type, and I spent many messages arguing with a documentation maintainer about whether it was floating or not. The fact that there was a mantissa and an exponent as part of the value didn't seem to be evidence enough ;) Jon

  • Anonymous
    May 31, 2007
    Matthew, you've still got fixed-point and floating-point confused in the third paragraph...

  • Anonymous
    June 01, 2007
    Thanks to everyone who pointed out the errors in the third paragraph (they should be corrected now).  I reworded it somewhat to better describe the Single/Double vs Decimal differences.

  • Anonymous
    June 02, 2007
    Much better - thanks for being so responsive :)

  • Anonymous
    May 30, 2008
    So here is another BCL refresher on the topic of floating point types in the BCL. Believe it or not, we have 3 different floating point types: Single (float) Double and Decimal. Each has their own characteristics and abilities and so let’s try to lear

  • Anonymous
    June 04, 2008
    So here is another BCL refresher on the topic of floating point types in the BCL. Believe it or not, we have 3 different floating point types: Single (float) Double and Decimal. Each has their own characteristics and abilities and so let’s try to lear