Rounding of Floating Point Numbers
Recall from the Truncation of Floating Point Numbers page that if $x$ is a real number, then the binary floating point number truncated to have $n$ digits has an error $\epsilon = \frac{\mathrm{fl}(x) - x}{x}$ which is bounded as:
(1)What's great about truncation is that the process is relatively fast, and for large calculations with many digits, often times the error $\epsilon$ is so exceptionally small and insignificant. However, sometimes, the error $\epsilon$ is too large for insignificance and we would prefer to reduce the error in a floating point representation of a binary number as much as possible. We can greatly reduce the error $\epsilon$ by instead taking a binary number $x$ and rounding it to contain $n$ digits. This process of rounding is determined by the $(n+1)^{\mathrm{th}}$ digit. If $x = 1.a_1a_2...a_{n-1}a_n...$ and round so that $\mathrm{fl}(x)$ contains $n$ digits, then look at the digit $a_{n}$. If $a_{n} = 0$ (which is the $(n+1)^{\mathrm{th}}$ digit), then $\mathrm{fl}(x) = 1.a_1a_2...a_{n-1}$. If $a_{n} = 1$, then $\mathrm{fl}(x)$ can be obtained by taking truncating $x$ to contain $n$ digits and then adding $1$ to the last digit, $a_{n-1}$, after truncation.
For example, consider the binary number $(1.11011101)_2$. If we wanted to round this number to contain $7$ digits, then, then we notice that the $8^{\mathrm{th}}$ digit is $0$ and so we round $(1.11011101)_2$ to be $1.110111)_2$. If we instead wanted to round $(1.11011101)_2$ to contain $6$ digits, then we notice that the $7^{\mathrm{th}}$ digit is $1$ and so we first truncate $(1.11011101)_2$ to $6$ digits to get $(1.11011)_2$ and then add $1$ to the last digit which is equivalent to adding $(0.00001)_2$ to get that $(1.11011101)_2$ rounded to $6$ binary digits is $(1.11100)_2$.
The following proposition tells us that the error $\epsilon = \frac{\mathrm{fl}(x) - x}{x}$ after rounding is also bounded and quite a bit smaller than with truncation.
Proposition 1: Let $x$ be a real number. Then the floating point representation of $x$ rounded to have $n$ digits has error $\epsilon = \frac{\mathrm{fl}(x) - x}{x}$ such that $-2^{-n} ≤ \epsilon ≤ 2^{-n}$. |
- Proof: Let $x = \sigma \cdot 1.a_1a_2...a_{n-1}a_n... \cdot 2^e$. There are two cases to consider, that is when $a_n = 0$ and when $a_n = 1$.
- Case 1: Suppose that $a_n = 0$. Then $\mathrm{fl}(x) = \sigma \cdot 1.a_1a_2...a_{n-1}00... \cdot 2^e$. Then we have that $x - \mathrm{fl}(x)$ is given by:
- Now we will divide $x - \mathrm{fl}(x)$ to get that:
- Therefore $0 < \frac{x - \mathrm{fl}(x)}{x} ≤ 2^{-n}$, so multiplying both sides of this inequality by $-1$ and we get that $0 > \epsilon = \frac{\mathrm{fl}(x) - x}{x} ≥ -2^{-n}$.
- Case 2: Suppose that $a_n = 1$. Then $\mathrm{fl}(x) = \sigma \cdot \left ( 1.a_1a_2...a_{n-1}00... + \underbrace{0.000...00}_{\mathrm{n \: many \: zeroes}}1 \right ) \cdot 2^e$. Then we have that $\mathrm{fl}(x) - x$ is given by:
- Now we will divide $\mathrm{fl}(x) - x$ by $x$ to get that:
- Therefore we have that $0 ≤ \epsilon = \frac{\mathrm{fl}(x) - x}{x} ≤ 2^{-n}$. Combining this inequality with the inequality we obtained in the first case and we obtain that: