Truncation of Floating-Point Numbers
Suppose that $x = \sigma \cdot \bar{x} \cdot 2^e$ is a binary number whose significand $\bar{x}$ contains $n$ binary digits. Recall that the IEEE Single-Precision Floating-Point Representation restricts how many digits of the significand that can be held in the floating-point representation of $x$.
One way to deal with holding these numbers is by truncating or chopping these numbers which we will define below.
Definition: A number $x$ that is Truncated or Chopped at the $m^{\mathrm{th}}$ digit means that all $n - m$ digits after the $n^{\mathrm{th}}$ digit are removed. |
For example, consider the decimal number $(1522.345113)_{10}$. If we were to truncate this number at the $7^{\mathrm{th}}$ digit we would obtain the number $1522.345$. The same applies to binary numbers. For example, consider the binary number $(1101000.100010)_2$ which when truncated at the $9^{\mathrm{th}}$ digit produces the number $(1101000.10)_2$. Note that if we were to truncate $(1101000.100010)_2$ at the $12^{\mathrm{th}}$ digit instead, then we would obtain $(1101000.10001)_2$ which is in fact equal to the number we started with.
We are about to look at a proposition which bounds the error associated with truncation at a specific digit in a floating point number. Before we look at it, we must first define the machine floating-point representation of number notationally
Definition: Let $x$ be a real number. Then the Machine Floating-Point Representation of $x$ is denoted $\mathrm{fl}(x)$. |
Note that when we truncate a binary number $x$ at the $n^{\mathrm{th}}$ digit then we get that $\mathrm{fl}(x) = 1.a_1a_2...a_{n-1}$ which is a floating point number that contains $n$ digits. The associated error that results from truncation is very important to define:
Definition: Let $x$ be a real number. Then the Error as a result of truncation is denoted by $\epsilon$ and $\mathrm{fl}(x) = x(1 + \epsilon)$ where the error is explicitly given by the formula $\epsilon = \frac{\mathrm{fl}(x) - x}{x}$. |
Note that if the error $\epsilon = 1$ then $x = x(1 + 0) = x$ which should make sense. The following proposition tells us that when we truncate a number to $n$ digits, then the associated error $\epsilon$ is bounded within a certain range.
Proposition 1: Let $x$ be a real number. Then the floating point representation of $x$ truncated to have $n$ digits has error $\epsilon = \frac{\mathrm{fl}(x) - x}{x}$ such that $-2^{-n+1} ≤ \epsilon ≤ 0$. |
- Proof: Let $x$ be a real number such that $x = \sigma \cdot \bar{x} \cdot 2^e = \sigma \cdot 1.a_1a_2...a_{n-1}a_{n} ... \cdot 2^e$. Suppose that we truncate $x$ at the $n^{\mathrm{th}}$ digit. Then we have that $\mathrm{fl} (x) = \sigma \cdot \underbrace{1.a_1a_2 \cdots a_{n-1}}_{\mathrm{n \: many \: digits}} \cdot 2^e$. Thus we have that the difference between $x$ and $\mathrm{fl} (x)$ is:
- We now divide $x - \mathrm{fl}(x)$ by $x$ to get:
- We note that when the signs $\sigma$ cancel in the above division that the result will be positive and greater or equal to zero and thus $\frac{x - \mathrm{fl}(x)}{x} ≥ 0$. Multiplying both sides of this inequality by $-1$ and we have that $\epsilon = \frac{\mathrm{fl}(x) - x}{x} ≤ 0$.
- We also note that
- Therefore we have that $\frac{x - \mathrm{fl}(x)}{x} ≤ 2^{-n+1)}$. Multiplying both sides of this inequality by $-1$ and we have that $\epsilon = \frac{\mathrm{fl} - x}{x} ≥ -2^{-n+1}$. Combining both inequalities and we have that: