Floating Point Numbers

# Floating Point Numbers

Modern computers store numbers as integers or floating point numbers. Floating point numbers will be of our particular interest in this section. Let $(x)_{10} \in \mathbb{R}$ be a nonzero number. Then can write $(x)_{10}$ uniquely in the following form:

(1)
\begin{align} \quad x = \sigma \cdot \bar{x} \cdot 10^e \end{align}

We will define what $\sigma$, $\bar{x}$ and $e$ are below.

 Definition: If $(x)_{10} \in \mathbb{R} \setminus \{ 0 \}$ is a real decimal number, then $(x)_{10}$ can be written in the form $(x)_{10} = \sigma \cdot \bar{x} \cdot 10^e$. The symbol sigma, $\sigma = \pm 1$, is the Sign of $x$ and takes on the value of $+1$ if $x > 0$ and $-1$ if $x < 0$. The symbol, $\bar{x}$ where $1 ≤ \bar{x} < 10$ is the Significand or Mantissa of $x$. The value $e$ is the Exponent (base $10$) of $x$.

We are excluding the number $0 \in \mathbb{R}$ from the definition above as $0$ cannot be unique expressed above. We will get to the case of $0$ later.

For example, consider the number $(435.115)_{10}$. We note that this number is positive, so the sign of $x$ is $\sigma = +1$. The significant/mantissa of $x$ is $4.35115$ which we obtain by shifting the decimal over two places to the left. Lastly, the exponent of $x$ is $2$, and thus we can write $(435.115)_{10}$ uniquely as:

(2)
\begin{align} (435.115)_{10} = +1 \cdot 4.35115 \cdot 10^2 \end{align}

The example above contains $6$ digits, however, sometimes we may be dealing with numbers with many more digits. For example, if we were looking at the number $\pi$, we note that there are infinitely many digits (and more succinctly, there is no known pattern for these digits). Thus, it is often useful to not necessary keep track of all of these digits (and in the case of $\pi$ it is impossible) so instead we limit the number of digits of the significand/mantissa $\bar{x}$ and/or restrict the value of the exponent $e$.

 Definition: If $(x)_{10} \in \mathbb{R}$ is a real decimal number where the significand/mantissa $\bar{x}$ is restricted to be at most $m$ digits and $e$ is restricted then $\sigma \cdot \bar{x} \cdot 10^e$ is an $m$ Digit Decimal Floating Point Value of $x$.

Of course, we do not necessarily need to work with decimal numbers. In fact, binary numbers are much more prevalent in computing as we've mentioned earlier, and so instead we can write a nonzero binary number $(x)_{2}$ uniquely as follows:

(3)
\begin{align} \quad (x)_{2} = \sigma \cdot \bar{x} \cdot 2^e \end{align}

Once again, $\sigma$ represents the sign of $x$ (where $\sigma = +1$ if $(x)_2 > 0$ and $\sigma = -1$ if $(x)_2 < 0$, $\bar{x}$ represents the significant/mantissa of $x$ (where $1 ≤ \bar{x} < 2$), and $e$ represents the exponent of $x$, this time with respect to base $2$. For example, consider the binary number $(101101.101)_{2}$. Then $\sigma = +1$, $\bar{x} = 1.01101101$, and $e = 5$ (since we moved the decimal over by $5$ place values to the left), and so we have that:

(4)
\begin{align} (101101.101)_{2} = +1 \cdot 1.01101101 \cdot 2^5 \end{align}
 Definition: If $(x)_{2} \in \mathbb{R}$ is a real binary number where the significand/mantissa $\bar{x}$ is restricted to be at most $m$ digits and $e$ is restricted, then $\sigma \cdot \bar{x} \cdot 2^e$ is an $m$ Digit Binary Floating Point Value of $x$. The value $m$ is known as the Precision of the floating point value.

## Example 1

Write $(285437.6523)_{10}$ in the form $\sigma \cdot \bar{x} \cdot 10^e$.

We note that since $(285437.6523)_{10}$ is positive, then our sign $\sigma = +1$. The significand is $\bar{x} = 2.854376523$ and the exponent is $e = 5$, and therefore we have that:

(5)
\begin{align} \quad (285437.6523)_{10} = +1 \cdot 2.854376523 \cdot 10^5 \end{align}

## Example 2

Determine the value (as a decimal number) of the binary number $x$ where $\sigma = -1$, $\bar{x} = (1.001011)_2$ and $e = 5$.

Since $\sigma = -1$ we know that $x$ is negative, and if we move the decimal in the significand $5$ to the right we obtain that $x = (100101.1)_{2}$. We now want to convert this number into the decimal system which is fortunately easy:

(6)
\begin{align} \quad (100101.1)_{2} = 32_{10} + 0 + 0 + 4_{10} + 0 + 1_{10} + \frac{1}{2} = (37.5)_{10} \end{align}

Thus, for $\sigma = -1$, $\bar{x} = (1.001011)_2$ and $e = 5$. we have that $(x)_{10} = -37.5$.