CS 202 - Notes 2018-10-01

Floating point

All of the numbers we have looked at so far are integers. What about reals?

The radix point works the same in binary as it does in decimal

d1d0d-1d-2 = d121 + d020 + d-12-2 + d-22-2

Example: 110.11 = 4 + 2 + 0 + .5 + .25 = 6.75

Our problem is that we can't represent the radix point in the computer -- still only have bits.

So, we come up with a new storage technique based on scientific notation.

Example: 110.11 = 1.1011 x 22

More generally, F x 2E. So, we take the available bits and divide them up into room for the fractional part F and the exponent E.

We would like signed values, so we will also add in a sign bit s, giving us -1S x F x 2E

As a part of our notation, we need to normalize the number so that it only has one digit to the left of the radix point. Note that in binary, the most significant digit is always 1, so our normalized number will always start with "1.". Since we know this is always the case, we won't both storing it.

So, now our number looks like -1S x 1.M x 2E, where M is the significand. So, in the computer we need 1 bit for S, some bits for E, and whatever is left over for M.

Excess notation

The next issue is that we would like negative exponents, so however we store E has to be signed. We could use something like two's compliment, but we are going to use something else called excess notation, which has the advantage that small numbers will look small and large numbers will look big.

The idea is that we subtract some fixed number (called the bias) from the stored value. To make the number of positive and negative numbers roughly the same, we want to split the available representations, which we can do by using a bias of 2N-1 or 2N-1 -1 (the choices comes down to whether or not we want one extra positive or one extra negative).

Consider a three bit number. Our bias could either be 3 or 4:

Binary Stored value Excess-4 interpretation Excess-3 interpretation
000 0 -4 -3
001 1 -3 -2
010 2 -2 -1
011 3 -1 0
100 4 0 1
101 5 1 2
110 6 2 3
111 7 3 4

IEEE 754

IEEE 754 is the standard governing the implementation of floats and doubles

float (single): sign - 1, exponent - 8, fraction - 23 (uses excess-127 for exponent)

double: sign - 1, exponent - 11, fraction - 52 (uses excess-1023 for the exponent)

There are also a number of special patterns

Single Single Double Double Meaning
Exponent Fraction Exponent Fraction
0 0 0 0 0
0 non-zero 0 non-zero +- denormalized number
1-254 anything 1-2046 anything +- normalized number
255 0 2047 0 +- infinity
255 non-zero 2047 non-zero NaN
Last Updated: 10/2/2018, 4:58:13 PM