IEEE Single Precision Floating Point Format Examples 1

# IEEE Single Precision Floating Point Format Examples 1

Recall from the Storage of Numbers in IEEE Single-Precision Floating Point Format page that for 32 bit storage, a computer can be stored as $x = \sigma \cdot \bar{x} \cdot 2^e$ and with 32 bits $b_1b_2...b_{32}$ we had that:

Bit $1$ The bit $b_1$ corresponds to the sign $\sigma$ of $x$ where $b_1 = \left\{\begin{matrix} 0 & \mathrm{if} \: \sigma = +1\\ 1 & \mathrm{if} \: \sigma = -1 \end{matrix}\right.$. Most computers do not store the exponent $e$ of a floating point binary number directly. Instead, they define $E = e + 127$ which is a positive binary number (since $-126 ≤ e$). The eight bits $b_2b_3...b_8b_9$ correspond to this number $E$. The $23$ succeeding digits $a_1a_2...a_{22}a_{23}$ of the significand of $x$, $1.a_1a_2...a_{22}a_{23}$ are stored here.

We will now look at some examples of determining the decimal value of IEEE single-precision floating point number and converting numbers to this form.

## Example 1

Consider the following floating point number presented in IEEE single precision (32 bits) as $01101011101101010000000000000000$. Determine the sign $\sigma$, exponent $e$, and significand/mantissa $\bar{x}$ and determine the value of $x = \sigma \cdot \bar{x} \cdot 2^e$.

We note that the first bit of the number given above is $b_1 = 0$. It immediately follows that we have that the sign of $x$ is $\sigma = +1$.

Now the next eight bits $b_2b_3…b_9$ are $11010111$ and represent $E = e + 127$. We want to find what decimal number represents the binary number $E = (11010111)_2$. We have that:

(1)
\begin{align} \quad E = 1 + 2 + 4 + 0 + 16 + 0 + 64 + 128 = 215 \end{align}

Thus we get that $e = E - 127 = 215 - 127 = 88$.

Lastly, recall that the twenty-three bits $b_{10}b_{11}…b_{32}$ represent the fractional part of the significand/mantissa $\bar{x}$, and that $\bar{x} = 1.b_{10}b_{11}…b_{32}$ and so:

(2)
\begin{align} \quad \bar{x} = 1 + \left ( 0 + \frac{1}{4} + \frac{1}{8} + 0 + \frac{1}{32} + 0 + \frac{1}{128} \right ) = 1.4140625 \end{align}

So the decimal representation of this number is $x = \sigma \cdot \bar{x} \cdot 2^e = + (1.4140625) \cdot 2^{88}$.

## Example 2

Consider the following number presented in IEEE single precision 32 bits $11001100101111100010000000000000$. Determine the sign $\sigma$, exponent $e$, and significand/mantissa $\bar{x}$ and determine the value of $x = \sigma \cdot \bar{x} \cdot 2^e$.

Once again we immediately have that since $b_1 = 1$ then the sign of $x$ is $\sigma = -1$.

Now next eight bits are $10011001$. These bits represent $E = e + 127$. Thus we have that:

(3)
\begin{align} \quad E = (10011001)_2 = 1 + 0 + 0 + 8 + 16 + 0 + 0 + 128 = 153 \end{align}

Therefore the exponent of $x$ is $e = E - 127 = 153 - 127 = 26$.

Lastly we will calculate the mantissa using the last twenty-three bits of the given number. We have that:

(4)
\begin{align} \quad \bar{x} = 1 + \left ( 0 + \frac{1}{4} + \frac{1}{8} + \frac{1}{16} + \frac{1}{32} + \frac{1}{64} + \frac{1}{1024} \right ) = 1.4853515625 \end{align}

So the decimal representation of this number is $x = \sigma \cdot \bar{x} \cdot 2^e = - (1.4853515625) \cdot 2^{26}$.

## Example 3

Consider the number $x = -\left ( 1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{16} + \frac{1}{32} \right ) 2^{-48}$. Determine the floating point representation in IEEE single precision (32 bits).

We immediately see that $x$ is a negative number and so the sign is $\sigma = 1$. Therefore the first bit in our floating point representation of this number will be $b_1 = 1$.

Now we also see that the exponent $e = -48$. IEEE floating point single precision (32 bits) stores the number $E = e + 127$ instead though, and hence $E = -48 + 127 = 79$. We must now convert $79$ to binary number. We have that:

(5)
\begin{align} \quad 79 = (1 + 2 + 4 + 8 + 64) = (01001111)_2 \end{align}

Therefore $b_2b_3…b_9 = 01001111$. Lastly we will determine the last twenty-three digits which represent the fractional part of the significand/mantissa. We note that $\bar{x} = \left ( 1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{16} + \frac{1}{32} \right )$. If we convert $\bar{x}$ to binary we get that:

(6)
\begin{align} \quad \left ( 1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{16} + \frac{1}{32} \right )_{10} = (1.11011)_2 \end{align}

So the digits $b_{10}b_{11}…b_{32}$ are thus $110110…0$. Therefore the floating point representation of $x$ is:

(7)
\begin{align} \quad 1 01001111 11011000000000000000000 \end{align}