CS 202 - Notes 2018-10-01

Floating point

All of the numbers we have looked at so far are integers. What about reals?

The radix point works the same in binary as it does in decimal

d₁d₀d_-1d_-2 = d₁2¹ + d₀2⁰ + d_-12^-2 + d_-22^-2

Example: 110.11 = 4 + 2 + 0 + .5 + .25 = 6.75

Our problem is that we can't represent the radix point in the computer -- still only have bits.

So, we come up with a new storage technique based on scientific notation.

Example: 110.11 = 1.1011 x 2²

More generally, F x 2^E. So, we take the available bits and divide them up into room for the fractional part F and the exponent E.

We would like signed values, so we will also add in a sign bit s, giving us -1^S x F x 2^E

As a part of our notation, we need to normalize the number so that it only has one digit to the left of the radix point. Note that in binary, the most significant digit is always 1, so our normalized number will always start with "1.". Since we know this is always the case, we won't both storing it.

So, now our number looks like -1^S x 1.M x 2^E, where M is the significand. So, in the computer we need 1 bit for S, some bits for E, and whatever is left over for M.

Excess notation

The next issue is that we would like negative exponents, so however we store E has to be signed. We could use something like two's compliment, but we are going to use something else called excess notation, which has the advantage that small numbers will look small and large numbers will look big.

The idea is that we subtract some fixed number (called the bias) from the stored value. To make the number of positive and negative numbers roughly the same, we want to split the available representations, which we can do by using a bias of 2^N-1 or 2^N-1 -1 (the choices comes down to whether or not we want one extra positive or one extra negative).

Consider a three bit number. Our bias could either be 3 or 4:

Binary	Stored value	Excess-4 interpretation	Excess-3 interpretation
000	0	-4	-3
001	1	-3	-2
010	2	-2	-1
011	3	-1	0
100	4	0	1
101	5	1	2
110	6	2	3
111	7	3	4

IEEE 754

IEEE 754 is the standard governing the implementation of floats and doubles

float (single): sign - 1, exponent - 8, fraction - 23 (uses excess-127 for exponent)

double: sign - 1, exponent - 11, fraction - 52 (uses excess-1023 for the exponent)

There are also a number of special patterns

Single	Single	Double	Double	Meaning
Exponent	Fraction	Exponent	Fraction
0	0	0	0	0
0	non-zero	0	non-zero	+- denormalized number
1-254	anything	1-2046	anything	+- normalized number
255	0	2047	0	+- infinity
255	non-zero	2047	non-zero	NaN

# CS 202 - Notes 2018-10-01

# Floating point

# Excess notation

# IEEE 754

CS 202 - Notes 2018-10-01

Floating point

Excess notation

IEEE 754