Storage of Numbers in IEEE Single-Precision Floating Point Format

# Storage of Numbers in IEEE Single-Precision Floating Point Format

For 32 bit storage in computers, a real number $x$ can be stored in which is known as IEEE Single-Precision Floating Point Format in the following format:

(1)
\begin{align} x = \sigma \cdot \left ( 1 . a_1a_2...a_{23}\right ) \cdot 2^e \end{align}

The sign of $x$, $\sigma$ will take up one bit. The significand can be represented in terms of $24$ digits, so the precision of the IEEE Single-Precision Floating Point Format is $24$ binary digits, though we will only specify $23$ binary digits, so the significand is of $23$ bits. We will restrict $e$ such that $-(126)_{10} ≤ e ≤ (127)_{10}$ which in binary corresponds to the inequality $-(11111110)_{2} ≤ e ≤ (11111111)_{2}$ which contains $8$ bits. Together, we can write a binary number $x$ in IEEE Single-Precision Floating Point Format in $32$ bits ($4$ bytes).

The number $x = \sigma \cdot (1.a_1a_2...a_{22}a_{23} ) \cdot 2^e$ can hence be stored in $32$ bits as $b_1b_2...b_{31}b_{32}$ where each of the bits are assigned as follows:

Bit $1$ The bit $b_1$ corresponds to the sign $\sigma$ of $x$ where $b_1 = \left\{\begin{matrix} 0 & \mathrm{if} \: \sigma = +1\\ 1 & \mathrm{if} \: \sigma = -1 \end{matrix}\right.$. Most computers do not store the exponent $e$ of a floating point binary number directly. Instead, they define $E = e + 127$ which is a positive binary number (since $-126 ≤ e$). The eight bits $b_2b_3...b_8b_9$ correspond to this number $E$. The $23$ succeeding digits $a_1a_2...a_{22}a_{23}$ of the significand of $x$, $1.a_1a_2...a_{22}a_{23}$ are stored here.

The following image represents how a binary number is stored in IEEE Single-Precision Floating Point Format. Recall from the Floating Point Numbers page that if $x = 0$ then $x$ cannot be uniquely written in the form $\sigma \cdot \bar{x} \cdot 2^e$. We need to be able to store $0$ in a computer though, so we specially formalize $0$ to be stored where $b_1 = b_2 = ... = b_{31} = b_{32} = 0$, that is all $32$ bits are $0$.

## Example 1

Suppose that $11011011011010000000000000000000$ represents an IEEE Single-Precision floating-point number. Determine what this number is in decimal.

We will first separate $1 10110110 11010000000000000000000$. Notice that $b_1 = 1$, $b_2b_3...b_8b_9 = 10110110$, and $b_{10}b_{11}...b_{31}b_{32} = 11010000000000000000000$. Therefore we have that $\sigma = 1$, $E = 10110110$, and $\bar{x} = 1.11010000000000000000000$.

Since $b_1 = 1$, this tells us that $\sigma = -1$, and so $x$ is a negative number.

Since $b_2b_3...b_9$ represent $E$, we have that $E = 10110110$, and $E = 182$, and so since $E = e + 127$, we have that $e = 55$.

Lastly, we have that $\bar{x} = 1.11010000000000000000000 = 1.1101 = \left ( 1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{16} \right ) = 1.8125$.

Putting this all together, we have that:

(2)
\begin{align} (x)_{10} = -1 \cdot (1.8125) \cdot 2^{55} \end{align}

## Example 2

Suppose that $0001111111011100000000000000000$ represents an IEEE Single-Precision floating-point number. Determine what this number is in decimal.

Once again, we first separate $0011111111011100000000000000000$. Notice that $b_1 = 0$, $b_2b_3...b_9 = 01111111$, and $b_{10}b_{11}...b_{32} = 1011100000000000000000$.

Since $b_1 = 0$, this tells us that $\sigma = +1$, so $x$ is a positive number.

Now since $b_1b_2...b_9$ represents $E$, we have that $E = 00111111$. In decimal, we have that $E = 63$, and so since $E = e + 127$, we have that then $63 = e + 127$, so $e = -64$.

Lastly, we have that $\bar{x} = 1.b_{10}b_{11}...b_{32} = 1.1011100000000000000000$. Thus $\bar{x} = 1.71875$.

Therefore we have that:

(3)
\begin{align} x = +1 \cdot (1.71875) \cdot 2^{-64} \end{align}