Truncation of Floating-Point Numbers

# Truncation of Floating-Point Numbers

Suppose that $x = \sigma \cdot \bar{x} \cdot 2^e$ is a binary number whose significand $\bar{x}$ contains $n$ binary digits. Recall that the IEEE Single-Precision Floating-Point Representation restricts how many digits of the significand that can be held in the floating-point representation of $x$.

One way to deal with holding these numbers is by truncating or chopping these numbers which we will define below.

 Definition: A number $x$ that is Truncated or Chopped at the $m^{\mathrm{th}}$ digit means that all $n - m$ digits after the $n^{\mathrm{th}}$ digit are removed.

For example, consider the decimal number $(1522.345113)_{10}$. If we were to truncate this number at the $7^{\mathrm{th}}$ digit we would obtain the number $1522.345$. The same applies to binary numbers. For example, consider the binary number $(1101000.100010)_2$ which when truncated at the $9^{\mathrm{th}}$ digit produces the number $(1101000.10)_2$. Note that if we were to truncate $(1101000.100010)_2$ at the $12^{\mathrm{th}}$ digit instead, then we would obtain $(1101000.10001)_2$ which is in fact equal to the number we started with.

We are about to look at a proposition which bounds the error associated with truncation at a specific digit in a floating point number. Before we look at it, we must first define the machine floating-point representation of number notationally

 Definition: Let $x$ be a real number. Then the Machine Floating-Point Representation of $x$ is denoted $\mathrm{fl}(x)$.

Note that when we truncate a binary number $x$ at the $n^{\mathrm{th}}$ digit then we get that $\mathrm{fl}(x) = 1.a_1a_2...a_{n-1}$ which is a floating point number that contains $n$ digits. The associated error that results from truncation is very important to define:

 Definition: Let $x$ be a real number. Then the Error as a result of truncation is denoted by $\epsilon$ and $\mathrm{fl}(x) = x(1 + \epsilon)$ where the error is explicitly given by the formula $\epsilon = \frac{\mathrm{fl}(x) - x}{x}$.

Note that if the error $\epsilon = 1$ then $x = x(1 + 0) = x$ which should make sense. The following proposition tells us that when we truncate a number to $n$ digits, then the associated error $\epsilon$ is bounded within a certain range.

 Proposition 1: Let $x$ be a real number. Then the floating point representation of $x$ truncated to have $n$ digits has error $\epsilon = \frac{\mathrm{fl}(x) - x}{x}$ such that $-2^{-n+1} ≤ \epsilon ≤ 0$.
• Proof: Let $x$ be a real number such that $x = \sigma \cdot \bar{x} \cdot 2^e = \sigma \cdot 1.a_1a_2...a_{n-1}a_{n} ... \cdot 2^e$. Suppose that we truncate $x$ at the $n^{\mathrm{th}}$ digit. Then we have that $\mathrm{fl} (x) = \sigma \cdot \underbrace{1.a_1a_2 \cdots a_{n-1}}_{\mathrm{n \: many \: digits}} \cdot 2^e$. Thus we have that the difference between $x$ and $\mathrm{fl} (x)$ is:
(1)
\begin{align} \quad \quad x - \mathrm{fl}(x) = \sigma \cdot \bar{x} \cdot 2^e = \sigma \cdot 1.a_1a_2...a_{n-1}a_{n} ... \cdot 2^e - \sigma \cdot \underbrace{1.a_1a_2 \cdots a_{n-1}}_{\mathrm{n \: many \: digits}} \cdot 2^e = \sigma \cdot \underbrace{0.000 \cdots 00}_{\mathrm{n \: many \: zeroes}} a_n ... \cdot 2^e \end{align}
• We now divide $x - \mathrm{fl}(x)$ by $x$ to get:
(2)
\begin{align} \quad \frac{x - \mathrm{fl}(x)}{x} = \frac{\sigma \cdot \underbrace{0.000 \cdots 00}_{\mathrm{n \: many \: zeroes}} a_n ... \cdot 2^e}{\sigma \cdot 1.a_1a_2...a_{n-1}a_{n} \cdot 2^e} \end{align}
• We note that when the signs $\sigma$ cancel in the above division that the result will be positive and greater or equal to zero and thus $\frac{x - \mathrm{fl}(x)}{x} ≥ 0$. Multiplying both sides of this inequality by $-1$ and we have that $\epsilon = \frac{\mathrm{fl}(x) - x}{x} ≤ 0$.
• We also note that
(3)
\begin{align} \quad \frac{x - \mathrm{fl}(x)}{x} = \frac{\sigma \cdot \underbrace{0.000 \cdots 00}_{\mathrm{n \: many \: zeroes}} a_n ... \cdot 2^e}{\sigma \cdot 1.a_1a_2...a_{n-1}a_{n} \cdot 2^e} ≤ \frac{\underbrace{0.000 \cdots 00}_{\mathrm{n-many \: zeroes}}111...1...}{1.000...0...} = \underbrace{0.000 \cdots 00}_{\mathrm{n-many \: zeroes}}111...1... ≤ \underbrace{0.000 \cdots 0}_{\mathrm{n-1 \: many \: zeroes}}1000...0... = 2^{-n+1} \end{align}
• Therefore we have that $\frac{x - \mathrm{fl}(x)}{x} ≤ 2^{-n+1)}$. Multiplying both sides of this inequality by $-1$ and we have that $\epsilon = \frac{\mathrm{fl} - x}{x} ≥ -2^{-n+1}$. Combining both inequalities and we have that:
(4)
\begin{align} -2^{-n+1} ≤ \epsilon ≤ 0 \quad \blacksquare \end{align}