Storage of Numbers in IEEE Single-Precision Floating Point Format
For 32 bit storage in computers, a real number $x$ can be stored in which is known as IEEE Single-Precision Floating Point Format in the following format:
(1)The sign of $x$, $\sigma$ will take up one bit. The significand can be represented in terms of $24$ digits, so the precision of the IEEE Single-Precision Floating Point Format is $24$ binary digits, though we will only specify $23$ binary digits, so the significand is of $23$ bits. We will restrict $e$ such that $-(126)_{10} ≤ e ≤ (127)_{10}$ which in binary corresponds to the inequality $-(11111110)_{2} ≤ e ≤ (11111111)_{2}$ which contains $8$ bits. Together, we can write a binary number $x$ in IEEE Single-Precision Floating Point Format in $32$ bits ($4$ bytes).
The number $x = \sigma \cdot (1.a_1a_2...a_{22}a_{23} ) \cdot 2^e$ can hence be stored in $32$ bits as $b_1b_2...b_{31}b_{32}$ where each of the bits are assigned as follows:
Bit $1$ | The bit $b_1$ corresponds to the sign $\sigma$ of $x$ where $b_1 = \left\{\begin{matrix} 0 & \mathrm{if} \: \sigma = +1\\ 1 & \mathrm{if} \: \sigma = -1 \end{matrix}\right.$. |
---|---|
Bits $2$-$9$ | Most computers do not store the exponent $e$ of a floating point binary number directly. Instead, they define $E = e + 127$ which is a positive binary number (since $-126 ≤ e$). The eight bits $b_2b_3...b_8b_9$ correspond to this number $E$. |
Bits $10$-$32$ | The $23$ succeeding digits $a_1a_2...a_{22}a_{23}$ of the significand of $x$, $1.a_1a_2...a_{22}a_{23}$ are stored here. |
The following image represents how a binary number is stored in IEEE Single-Precision Floating Point Format.
Recall from the Floating Point Numbers page that if $x = 0$ then $x$ cannot be uniquely written in the form $\sigma \cdot \bar{x} \cdot 2^e$. We need to be able to store $0$ in a computer though, so we specially formalize $0$ to be stored where $b_1 = b_2 = ... = b_{31} = b_{32} = 0$, that is all $32$ bits are $0$.
Example 1
Suppose that $11011011011010000000000000000000$ represents an IEEE Single-Precision floating-point number. Determine what this number is in decimal.
We will first separate $1 10110110 11010000000000000000000$. Notice that $b_1 = 1$, $b_2b_3...b_8b_9 = 10110110$, and $b_{10}b_{11}...b_{31}b_{32} = 11010000000000000000000$. Therefore we have that $\sigma = 1$, $E = 10110110$, and $\bar{x} = 1.11010000000000000000000$.
Since $b_1 = 1$, this tells us that $\sigma = -1$, and so $x$ is a negative number.
Since $b_2b_3...b_9$ represent $E$, we have that $E = 10110110$, and $E = 182$, and so since $E = e + 127$, we have that $e = 55$.
Lastly, we have that $\bar{x} = 1.11010000000000000000000 = 1.1101 = \left ( 1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{16} \right ) = 1.8125$.
Putting this all together, we have that:
(2)Example 2
Suppose that $0001111111011100000000000000000$ represents an IEEE Single-Precision floating-point number. Determine what this number is in decimal.
Once again, we first separate $0011111111011100000000000000000$. Notice that $b_1 = 0$, $b_2b_3...b_9 = 01111111$, and $b_{10}b_{11}...b_{32} = 1011100000000000000000$.
Since $b_1 = 0$, this tells us that $\sigma = +1$, so $x$ is a positive number.
Now since $b_1b_2...b_9$ represents $E$, we have that $E = 00111111$. In decimal, we have that $E = 63$, and so since $E = e + 127$, we have that then $63 = e + 127$, so $e = -64$.
Lastly, we have that $\bar{x} = 1.b_{10}b_{11}...b_{32} = 1.1011100000000000000000$. Thus $\bar{x} = 1.71875$.
Therefore we have that:
(3)