This section provides an overview of what floating-point is, and why a developer might want to use it.
It should also mention any large subjects within floating-point, and link out to the related topics. Since the Documentation for floating-point is new, you may need to create initial versions of those related topics.
There are two types of numbers:
An example using decimal digits with three decimal places before the decimal point and two decimal places after the decimal place:
An example using decimal digits with five decimal places for the mantissa and one decimal place for the exponent:
So a floating point number can represent numbers with very different magnitudes (0.000000000123456 and 123456789.1) with the same amount of relative accuracy.
Fixed point numbers are useful when a particular number of decimal places are always needed regardless of the magnitude of the number (money for example). Floating point numbers are useful when the magnitude varies and accuracy is still needed. For example: to a road engineer distances are measured in meters and .01 of a meter is insignificant but to a microchip designer the difference between 0.0000001 meters and .000000001 meters is huge - and a physicist might need to use huge numbers and very, very tiny numbers in the same calculation. Accuracy at many different magnitudes is what makes floating point numbers useful.
Computers don't use decimal - they use binary and that causes problems for floating point because not every decimal number can be represented exactly by a floating point number and that introduces rounding errors into the calculations.
Having done all the examples in decimal it is important to note that because they are binary, instead of storing a floating point number as a sum of decimal fractions:
123.875 = 1/10^-2 + 2/10^-1 + 3/10^0 + 8/10^1 + 7/10^2 + 5/10^3
computers store floating point numbers as a sum of binary fractions:
123.875 = 1/2^-6 + 1/2^-5 + 1/2^-4 + 1/2^-3 + 1/2^-1 + 1/2^0 + 1/2^1 + 1/2^2 + 1/2^3
There are many different ways of storing bit patters that represent those fractions but the one most computers use now is based on the IEEE-754 standard. It has rules for storing both decimal and binary representations and for different size data types.
The way normal numbers are stored using the IEEE standard is:
To allow a more gradual underflow, denormalized numbers (when the exponent bits are all zero) are treated specially: the exponent is set to -126 and the implicit leading 1 before the decimal place is NOT added to the mantissa.
For a normal 32 bit IEEE-754 floating point number:
So a normal number is calculated as:
-1^sign * 2^(exponent-bias) * 1.mantissa
If the bit pattern were:
0 10000101 11101111100000000000000
Then the value is:
-1^0 * 2^(133-127) * 1.111011111
-1^0 * 2^6 * (1 + 1/2 + 1/4 + 1/8 + 1/32 + 1/64 + 1/128 + 1/256 + 1/512)
1 * 64 * 991/512
123.875
There are some special values:
0 11111111 11111111111111111111111 = NaN
0 11111111 00000000000000000000000 = +infinity
1 11111111 00000000000000000000000 = -infinity
0 00000000 00000000000000000000000 = +Zero
1 00000000 00000000000000000000000 = -Zero
Specifics of the 32 bit IEEE-754 format can be found at: