Rounding of Floating Point Numbers

Rounding of Floating Point Numbers

Recall from the Truncation of Floating Point Numbers page that if $x$ is a real number, then the binary floating point number truncated to have $n$ digits has an error $\epsilon = \frac{\mathrm{fl}(x) - x}{x}$ which is bounded as:

(1)
\begin{align} -2^{-(n-1)} ≤ \epsilon ≤ 0 \end{align}

What's great about truncation is that the process is relatively fast, and for large calculations with many digits, often times the error $\epsilon$ is so exceptionally small and insignificant. However, sometimes, the error $\epsilon$ is too large for insignificance and we would prefer to reduce the error in a floating point representation of a binary number as much as possible. We can greatly reduce the error $\epsilon$ by instead taking a binary number $x$ and rounding it to contain $n$ digits. This process of rounding is determined by the $(n+1)^{\mathrm{th}}$ digit. If $x = 1.a_1a_2...a_{n-1}a_n...$ and round so that $\mathrm{fl}(x)$ contains $n$ digits, then look at the digit $a_{n}$. If $a_{n} = 0$ (which is the $(n+1)^{\mathrm{th}}$ digit), then $\mathrm{fl}(x) = 1.a_1a_2...a_{n-1}$. If $a_{n} = 1$, then $\mathrm{fl}(x)$ can be obtained by taking truncating $x$ to contain $n$ digits and then adding $1$ to the last digit, $a_{n-1}$, after truncation.

For example, consider the binary number $(1.11011101)_2$. If we wanted to round this number to contain $7$ digits, then, then we notice that the $8^{\mathrm{th}}$ digit is $0$ and so we round $(1.11011101)_2$ to be $1.110111)_2$. If we instead wanted to round $(1.11011101)_2$ to contain $6$ digits, then we notice that the $7^{\mathrm{th}}$ digit is $1$ and so we first truncate $(1.11011101)_2$ to $6$ digits to get $(1.11011)_2$ and then add $1$ to the last digit which is equivalent to adding $(0.00001)_2$ to get that $(1.11011101)_2$ rounded to $6$ binary digits is $(1.11100)_2$.

The following proposition tells us that the error $\epsilon = \frac{\mathrm{fl}(x) - x}{x}$ after rounding is also bounded and quite a bit smaller than with truncation.

 Proposition 1: Let $x$ be a real number. Then the floating point representation of $x$ rounded to have $n$ digits has error $\epsilon = \frac{\mathrm{fl}(x) - x}{x}$ such that $-2^{-n} ≤ \epsilon ≤ 2^{-n}$.
• Proof: Let $x = \sigma \cdot 1.a_1a_2...a_{n-1}a_n... \cdot 2^e$. There are two cases to consider, that is when $a_n = 0$ and when $a_n = 1$.
• Case 1: Suppose that $a_n = 0$. Then $\mathrm{fl}(x) = \sigma \cdot 1.a_1a_2...a_{n-1}00... \cdot 2^e$. Then we have that $x - \mathrm{fl}(x)$ is given by:
(2)
\begin{align} \quad x - \mathrm{fl}(x) = \sigma \cdot 1.a_1a_2...a_{n-1}a_n... \cdot 2^e - \sigma \cdot 1.a_1a_2...a_{n-1}00... \cdot 2^e = \sigma \cdot \underbrace{0.000...00}_{\mathrm{n+1 \: many \: zeroes}} a_{n+1}... \cdot 2^e \end{align}
• Now we will divide $x - \mathrm{fl}(x)$ to get that:
(3)
\begin{align} \quad 0 < \frac{x - \mathrm{fl}(x)}{x} = \frac{\sigma \cdot \underbrace{0.000...00}_{\mathrm{n+1 \: many \: zeroes}} a_{n+1}... \cdot 2^e}{\sigma \cdot 1.a_1a_2...a_{n-1}a_n... \cdot 2^e} ≤ \frac{\sigma \cdot \underbrace{0.000...00}_{\mathrm{n+1 \: many \: zeroes}} a_{n+1}... \cdot 2^e}{\sigma \cdot 1.000... \cdot 2^e} = 2^{-n} \end{align}
• Therefore $0 < \frac{x - \mathrm{fl}(x)}{x} ≤ 2^{-n}$, so multiplying both sides of this inequality by $-1$ and we get that $0 > \epsilon = \frac{\mathrm{fl}(x) - x}{x} ≥ -2^{-n}$.
• Case 2: Suppose that $a_n = 1$. Then $\mathrm{fl}(x) = \sigma \cdot \left ( 1.a_1a_2...a_{n-1}00... + \underbrace{0.000...00}_{\mathrm{n \: many \: zeroes}}1 \right ) \cdot 2^e$. Then we have that $\mathrm{fl}(x) - x$ is given by:
(4)
\begin{align} \quad \mathrm{fl}(x) - x = \sigma \cdot \left ( 1.a_1a_2...a_{n-1}00... + \underbrace{0.000...00}_{\mathrm{n \: many \: zeroes}}1 \right ) \cdot 2^e - \sigma \cdot 1.a_1a_2...a_{n-1}a_n... \cdot 2^e \\ \quad \mathrm{fl}(x) - x = \sigma \cdot \left ( \underbrace{0.000...0}_{\mathrm{n \: many \: zeroes}} 1 - \underbrace{0.000...000}_{\mathrm{n+1 \: many \: zeroes}} 1 a_{n+1}...\right ) \cdot 2^e \end{align}
• Now we will divide $\mathrm{fl}(x) - x$ by $x$ to get that:
(5)
\begin{align} \quad \quad 0 ≤ \frac{\mathrm{fl}(x) - x}{x} = \frac{\sigma \cdot \left ( \underbrace{0.000...0}_{\mathrm{n \: many \: zeroes}} 1 - \underbrace{0.000...000}_{\mathrm{n+1 \: many \: zeroes}} 1 a_{n+1}...\right ) \cdot 2^e}{\sigma \cdot 1.a_1a_2...a_{n-1}a_n... \cdot 2^e} = \frac{\underbrace{0.000...0}_{\mathrm{n \: many \: zeroes}} 1 - \underbrace{0.000...000}_{\mathrm{n+1 \: many \: zeroes}} 1 a_{n+1}...}{1.a_1a_2...a_{n-1}a_n...} \end{align}
(6)
\begin{align} \quad ≤ \frac{\underbrace{0.000...0}_{\mathrm{n \: many \: zeroes}} 1 - \underbrace{0.000...000}_{\mathrm{n+1 \: many \: zeroes}} 1 a_{n+1}...}{1.000...} = \underbrace{0.000...0}_{\mathrm{n \: many \: zeroes}} 1 - \underbrace{0.000...000}_{\mathrm{n+1 \: many \: zeroes}} 1 a_{n+1}... ≤ \underbrace{0.000...0}_{\mathrm{n \: many \: zeroes}} 1 - \underbrace{0.000...000}_{\mathrm{n+1 \: many \: zeroes}} 1 000... = \underbrace{0.000...0}_{\mathrm{n \: many \: zeroes}} 1 = 2^{-n} \end{align}
• Therefore we have that $0 ≤ \epsilon = \frac{\mathrm{fl}(x) - x}{x} ≤ 2^{-n}$. Combining this inequality with the inequality we obtained in the first case and we obtain that:
(7)
\begin{align} -2^{-n} ≤ \epsilon ≤ 2^{-n} \quad \blacksquare \end{align}