I am trying to understand floating point arithmetic better and have seen a few links to 'What Every Computer Scientist Should Know About Floating Point Arithmetic'.
I still don't understand how a number like 0.1
or 0.5
is stored in floats and as decimals.
Can someone please explain how it is laid out is memory?
I know about the float being two parts (i.e., a number to the power of something).
I've always pointed people towards Harald Schmidt's online converter, along with the Wikipedia IEEE754-1985 article with its nice pictures.
For those two specific values, you get (for 0.1):
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n
0 01111011 10011001100110011001101
| || || || || || +- 8388608
| || || || || |+--- 2097152
| || || || || +---- 1048576
| || || || |+------- 131072
| || || || +-------- 65536
| || || |+----------- 8192
| || || +------------ 4096
| || |+--------------- 512
| || +---------------- 256
| |+------------------- 32
| +-------------------- 16
+----------------------- 2
The sign is positive, that's pretty easy.
The exponent is 64+32+16+8+2+1 = 123 - 127 bias = -4
, so the multiplier is 2-4
or 1/16
.
The mantissa is chunky. It consists of 1
(the implicit base) plus (for all those bits with each being worth 1/(2n)
as n
starts at 1
and increases to the right), {1/2, 1/16, 1/32, 1/256, 1/512, 1/4096, 1/8192, 1/65536, 1/131072, 1/1048576, 1/2097152, 1/8388608}
.
When you add all these up, you get 1.60000002384185791015625
.
When you multiply that by the multiplier, you get 0.100000001490116119384765625
, which is why they say you cannot represent 0.1
exactly as an IEEE754 float, and provides so much opportunity on SO for people answering "why doesn't 0.1 + 0.1 + 0.1 == 0.3?"
-type questions :-)
The 0.5 example is substantially easier. It's represented as:
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
0 01111110 00000000000000000000000
which means it's the implicit base, 1
, plus no other additives (all the mantissa bits are zero).
The sign is again positive. The exponent is 64+32+16+8+4+2 = 126 - 127 bias = -1
. Hence the multiplier is 2-1
which is 1/2
or 0.5
.
So the final value is 1
multiplied by 0.5
, or 0.5
. Voila!
I've sometimes found it easier to think of it in terms of decimal.
The number 1.345 is equivalent to
1 + 3/10 + 4/100 + 5/1000
or:
-1 -2 -3
1 + 3*10 + 4*10 + 5*10
Similarly, the IEEE754 representation for decimal 0.8125
is:
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
0 01111110 10100000000000000000000
With the implicit base of 1, that's equivalent to the binary:
01111110-01111111
1.101 * 2
or:
-1
(1 + 1/2 + 1/8) * 2 (no 1/4 since that bit is 0)
which becomes:
(8/8 + 4/8 + 1/8) * 1/2
and then becomes:
13/8 * 1/2 = 0.8125