10

I have read that

(1) Ill conditioned operations should be performed before well conditioned ones.

As an example, one should calculate $xz-yz$ as $(x-y)z$ since subtraction is ill conditioned while multiplication isn't.

However, a first-order error analysis of both algorithms reveals that they differ only by a factor of three (*), and I don't see why one can generalize this to statement (1), nor do I intuitively grasp the significance of order of operations. Do you think statement is (1) is an accepted rule, and do you have other explanations for it?

*: more specifically, the first version has a relative error bounded by

$$\text{eps}+3\frac{|x|+|y|}{|x|-|y|}\text{eps}$$ while the relative error of the second version is bounded by

$$3\text{eps}+\frac{|x|+|y|}{|x|-|y|}\text{eps}$$

where $\text{eps}$ is the machine precision.

This analysis is based on the assumption that the $i$-th intermediate result is multiplied with $(1+\varepsilon_i)$ (due to rounding errors), where $\varepsilon_i$ are i.i.d. random variables bounded by $\text{eps}$. "First-order" means that higher-order terms, like $\epsilon_i\epsilon_jx$, are neglected.

Anton Menshov
  • 8,672
  • 7
  • 38
  • 94
Bananach
  • 799
  • 3
  • 13

1 Answers1

9

Let's denote by $\otimes,\oplus,\ominus$ (I was lazy trying to get circled version of division operator) the floating-point analogs of exact multiplication ($\times$), addition ($+$), and subtraction ($-$), respectively. We'll assume (IEEE-754) that for all of them $$ [x\oplus y]=(x+ y)(1+\delta_\oplus),\quad |\delta_\oplus|\le\epsilon_\mathrm{mach}, $$ where $\epsilon_\mathrm{mach}$ is the machine epsilon giving an upper bound on the relative error due to rounding off. We will also use the following lemma (assuming all $|\delta_i|\le\epsilon_\mathrm{mach}$, and $m$ is not too large) that can be easily proven: $$ \prod\limits_{i=1}^{m}(1+\delta_i)=1+\theta(m),\quad |\theta(m)|\le\frac{m\epsilon_\mathrm{mach}}{1-m\epsilon_\mathrm{mach}} $$

Let's define the true function $f$ that operates on real numbers $x,y,z$ as

$$ f(x,y,z)=(x\times z)-(y\times z) $$

and two versions of the function implementation in IEEE-compliant floating-point arithmetic as $\tilde{f_1}$ and $\tilde{f_2}$ that operate on floating-point representations $\tilde{x}=x(1+\delta_x),\tilde{y},\tilde{z}$, as follows:

$$ \tilde{f_1}(\tilde{x},\tilde{y},\tilde{z})=(\tilde{x}\otimes\tilde{z})\ominus(\tilde{y}\otimes\tilde{z}), $$

$$ \tilde{f_2}(\tilde{x},\tilde{y},\tilde{z})=(\tilde{x}\ominus\tilde{y})\otimes\tilde{z}. $$

Error analysis for $\tilde{f_1}$:

$$ \begin{aligned} \tilde{f_1}&=\Big(\underbrace{\big(x(1+\delta_x)\times z(1+\delta_z)\big)\big(1+\delta_{\otimes_{xz}}\big)}_{(\tilde{x}\otimes\tilde{z})}-\underbrace{\big(y(1+\delta_y)\times z(1+\delta_z)\big)\big(1+\delta_{\otimes_{yz}}\big)}_{(\tilde{y}\otimes\tilde{z})}\Big)\Big(1+\delta_{\ominus}\Big)\\ &=xz(1+\delta_x)(1+\delta_z)(1+\delta_{\otimes_{xz}})(1+\delta_\ominus)-yz(1+\delta_y)(1+\delta_z)(1+\delta_{\otimes_{yz}})(1+\delta_\ominus)\\ &=xz(1+\theta_{xz,1})-yz(1+\theta_{yz,1}). \end{aligned} $$ Here, $|\theta_{xz,1}|,|\theta_{yz,1}|\le\frac{4\epsilon_\mathrm{mach}}{1-4\epsilon_\mathrm{mach}}$.

Similarly, for $\tilde{f_2}$ $$ \begin{aligned} \tilde{f_2}&=\Bigg(\Big(\big( x(1+\delta_x)-y(1+\delta_y \big)\big(1+\delta_{\ominus_{xy}}\big)\Big)\times \Big(z(1+\delta_z)\Big)\Bigg)\Bigg(1+\delta_{\otimes}\Bigg)\\ &=xz(1+\delta_x)(1+\delta_z)(1+\delta_{\ominus_{xy}})(1+\delta_\otimes)-yz(1+\delta_y)(1+\delta_z)(1+\delta_{\ominus_{xy}})(1+\delta_\otimes)\\ &=xz(1+\theta_{x,2})-yz(1+\theta_{y,2}). \end{aligned} $$ Here, $|\theta_{x,2}|,|\theta_{y,2}|\le\frac{4\epsilon_\mathrm{mach}}{1-4\epsilon_\mathrm{mach}}$.

So, for both $\tilde{f_1}$ and $\tilde{f_2}$ we got expressions of the same type, thus I do not see why one implementation would be preferred to another from a numerical point of view (except the fact that $\tilde{f_2}$ performs only 2 floating-point operations compared to $\tilde{f_1}$).

Computing the relative error will show, that the problem comes from the fact that $x$ and $y$ can be very close (cancellation).

$$ \begin{aligned} \frac{|\tilde{f_1}-f|}{|f|}&=\frac{|xz+xz\theta_{xz,1}-yz-yz\theta_{yz,1}-(xz-yz)|}{|xz-yz|}=\frac{|x\theta_{xz,1}-y\theta_{yz,1}|}{|x-y|}\\ &\le\frac{|x|+|y|}{|x-y|}\frac{4\epsilon_\mathrm{mach}}{1-4\epsilon_\mathrm{mach}}, \end{aligned} $$ $$ \begin{aligned} \frac{|\tilde{f_2}-f|}{|f|}&=\frac{|xz+xz\theta_{x,2}-yz-yz\theta_{y,2}-(xz-yz)|}{|xz-yz|}=\frac{|x\theta_{x,2}-y\theta_{y,2}|}{|x-y|}\\ &\le\frac{|x|+|y|}{|x-y|}\frac{4\epsilon_\mathrm{mach}}{1-4\epsilon_\mathrm{mach}}. \end{aligned} $$

Slight differences between $\theta$'s might make one of the two numerical implementations marginally better or worse depending on $x,y,z$. However, I doubt it can be of any significance. The result totally makes sense, because no matter what, if you have to compute $(x-y)$, when $x$ and $y$ are close enough in values (for the precision you work with) using floating-point arithmetic, no scaling will significantly help you: you are already in trouble.

NB: All discussion above assumes no overflow or underflow, i.e. $x,y,z,f(x,y,z)\in\mathbb F_0$, $\mathbb F_0$ being the set of all normal floating-point numbers.

Anton Menshov
  • 8,672
  • 7
  • 38
  • 94