CS 202 - Notes 2018-10-01
Floating point
All of the numbers we have looked at so far are integers. What about reals?
The radix point works the same in binary as it does in decimal
d1d0d-1d-2 = d121 + d020 + d-12-2 + d-22-2
Example: 110.11 = 4 + 2 + 0 + .5 + .25 = 6.75
Our problem is that we can't represent the radix point in the computer -- still only have bits.
So, we come up with a new storage technique based on scientific notation.
Example: 110.11 = 1.1011 x 22
More generally, F x 2E. So, we take the available bits and divide them up into room for the fractional part F and the exponent E.
We would like signed values, so we will also add in a sign bit s, giving us -1S x F x 2E
As a part of our notation, we need to normalize the number so that it only has one digit to the left of the radix point. Note that in binary, the most significant digit is always 1, so our normalized number will always start with "1.". Since we know this is always the case, we won't both storing it.
So, now our number looks like -1S x 1.M x 2E, where M is the significand. So, in the computer we need 1 bit for S, some bits for E, and whatever is left over for M.
Excess notation
The next issue is that we would like negative exponents, so however we store E has to be signed. We could use something like two's compliment, but we are going to use something else called excess notation, which has the advantage that small numbers will look small and large numbers will look big.
The idea is that we subtract some fixed number (called the bias) from the stored value. To make the number of positive and negative numbers roughly the same, we want to split the available representations, which we can do by using a bias of 2N-1 or 2N-1 -1 (the choices comes down to whether or not we want one extra positive or one extra negative).
Consider a three bit number. Our bias could either be 3 or 4:
Binary | Stored value | Excess-4 interpretation | Excess-3 interpretation |
---|---|---|---|
000 | 0 | -4 | -3 |
001 | 1 | -3 | -2 |
010 | 2 | -2 | -1 |
011 | 3 | -1 | 0 |
100 | 4 | 0 | 1 |
101 | 5 | 1 | 2 |
110 | 6 | 2 | 3 |
111 | 7 | 3 | 4 |
IEEE 754
IEEE 754 is the standard governing the implementation of floats and doubles
float (single): sign - 1, exponent - 8, fraction - 23 (uses excess-127 for exponent)
double: sign - 1, exponent - 11, fraction - 52 (uses excess-1023 for the exponent)
There are also a number of special patterns
Single | Single | Double | Double | Meaning |
---|---|---|---|---|
Exponent | Fraction | Exponent | Fraction | |
0 | 0 | 0 | 0 | 0 |
0 | non-zero | 0 | non-zero | +- denormalized number |
1-254 | anything | 1-2046 | anything | +- normalized number |
255 | 0 | 2047 | 0 | +- infinity |
255 | non-zero | 2047 | non-zero | NaN |