Lecture 07 - Real numbers

Published

February 23, 2026

Goals

  • Learn how to represent real numbers
  • Learn the process of converting to and from IEEE 754
  • Learn how we represent a collection of other types of binary data

real numbers

all of our discussion to date has been about integers, what about real numbers?

If we are writing out numbers, then we can use the some ideas that we use in decimal

we will talk about the radix point instead of the decimal point, but the concept is the same

in decimal, the positional values continue to be powers of 10 – they are just negative powers

  • \(0.1 = 10^{-1} = \frac{1}{10}\)
  • \(0.01 = 10^{-2} = \frac{1}{100}\)
  • etc…

In binary

  • \(0.1 = 2^{-1} = \frac{1}{2}\)
  • \(0.01 = 2^{-2} = \frac{1}{4}\)
  • etc…

but we have the same problem we had with the negative sign – how do we expression the radix point?

floating point

there are a collection of different approaches we could use, but we are going to use something called floating point (this would be why the data type for real numbers is called float)

The basic concept is that we are going to represent our number in a form that resembles scientific notation

Let’s take an example: \(-5 \frac{3}{16}\)

we start with the \(5\)

  • 101

Now we need to do the \(\frac{3}{16}\)

  • 16 is \(2^{4}\) so \(\frac{1}{16} = 2^{-4}\)
  • \(\frac{1}{16} = 0.0001\)
  • we have three of those, so \(0.0011\)

Putting that together, we have \(-101.0011\)

Now we normalize the number by shifting the radix point to be immediately behind the most significant digit

\(-1.010011 \times 2^{2}\) Since we shifted two times to the left, we need to multiple the number by \(2^{2}\) to retain the value

Now we need to figure out how to store this number

  • need to keep track of the sign, the significand and the exponent

IEEE 754

IEEE 754 is the standard that provides the details of how we will do this

the sign will be stored in the high order bit of the number (like signed magnitude)

the exponent will be stored in the next block of bits

we need to be able to express both positive and negative exponents instead of using two’s compliment, we are going to use something called excess notation Given \(n\) bits, we will pick a bias of either \(2^{n-1}\) or \(2^{n-1} - 1\) the choice is just based on whether we want an extra negative or an extra positive value

the idea is that we add the bias to the exponent to get the representation (basically we shift the most negative number up to 0)

to figure out which number is being represented, we reverse the process by subtracting the bias

the significand we take up the remaining available bits

we make one small tweak though however since the digit to the left of the radix point is always 1, we don’t both putting it in our representation

IEEE 754 provides the standard for two data types: the float and the double

float

  • 32-bit number
  • 1 bit sign
  • 8 bit exponent using excess 127
  • 23 bit significand

double

  • 64 bit number
  • 1 bit sign
  • 11 bit exponent using excess 1023
  • 52 bit significand

there are also some special patterns

single exponent single significand double exponent double significand meaning
0 0 0 0 0
0 non-zero 0 non-zero +/- denormalized number
1-254 anything 1-2046 anything +/- normalized number
255 0 2047 0 +/- infinity
255 non-zero o non-zero NaN

Returning to our example: \(-1.010011 \times 2^{2}\)

the exponent will be \(2 +127 = 129 = 10000001_{2}\)

so our representation will be

1 10000001 01001100000000000000000
| |        |
| |        significand with leading 1 removed
| exponent in excess 127
sign bit

Note that this representation can only represent a limited number of values it will struggle with values like .3, which requires an infinite number of bits to represent (0.01001100110011…) this is a little bit of a problem if you really need that precision (like say you are a bank) so there are other representations when we really need full precision


Mechanical level

vocabulary

Skills

  • convert between binary and IEEE 754