Sigplan: Garbage In/Garbage Out

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

SIGPLAN

ACM
Garbage In/Garbage Out
You Could Learn a Lot from a Quadratic:
I. Overloading Considered Harmful
Author: Henry G. Baker, http://home.pipeline.com/hbaker1/home.html; [email protected]
(No sacred cows were physically harmed in the making of
this column.)
Probably the single most memorable non-trivial al-
gebraic formula that students ever see is the famous
quadratic formula for nding the roots of the quadratic
equation Ax
2
+ Bx + C = 0:
x =
B

B
2
4AC
2A
.
Thus, when students are given the problem of writing
code for solving this equation in an elementary program-
ming course, they dredge up this formula, and proceed to
try to use it to solve for the roots of a quadratic in the ob-
vious way. If the teacher is exceptional, he will provide
enough test cases to show that computer arithmetic is not
at all like algebraic arithmetic, and the students will learn
something valuable when this formula fails in mysterious
ways. Unfortunately, most computer scientists dont take
this opportunity to provide valuable insights into a wide
variety of mathematical and programming issues. Indeed,
the Ada Language Reference Manual [Ada83] gives a
very poor implemention of quadratic equation solving as
an example to be emulated!
-- From [Ada83LRM], 10.1.1, Ex. 1.
with TEXT_IO, REAL_OPERATIONS;
use REAL_OPERATIONS;
procedure QUADRATIC_EQUATION is
A, B, C, D : REAL;
use REAL_IO, TEXT_IO, REAL_FUNCTIONS;
begin
GET(A); GET(B); GET(C);
D := B
**
2 - 4.0
*
A
*
C;
if D < 0.0 then
PUT("Imaginary Roots.");
else
PUT("Real Roots : X1 = ");
PUT((-B - SQRT(D))/(2.0
*
A));
PUT(" X2 = ");
PUT((-B + SQRT(D))/(2.0
*
A));
end if;
NEW_LINE;
end QUADRATIC_EQUATION;
Most of the careful work of the Ada language design-
ers on high quality numeric datatypes has gone down the
drain with these few lines of careless code. Suppose that
the REAL variables in this program are implemented as
ubiquitous IEEE-754 binary oating point numbers hav-
ing a 24-bit mantissa and an 8-bit exponent. How can this
program fail? If A = 0, then we can get a divide-by-zero
exception (or an innity). If A = 1, B = 2
14
= 16384,
C = 1, then B
2
4AC = 2
28
+ 4 2
28
= D, so
SQRT(D) = 2
14
and -B+SQRT(D) = 2
14
+2
14
= 0,
even though the true roots are approximately 2
14
=
16384 and 2
14
.000061! For another example, if
A = C = 2
80
(about the size of Avagadros Number)
and B = 0, then B
2
4AC = 2
162
, which cannot be rep-
resented within the 8-bit exponent range even though the
roots 1 can be represented.
Thus, if -B and SQRT(D) are of approximately the
same magnitude, then we can get massive cancellation,
and perhaps produce a root which is zero even when
C = 0, which is impossible, since C/A is the product
of the roots!
1
If B is particularly large or small, then the
mere computation of B
2
may cause an exponent under-
ow or overow, even when both roots are well within
exponent range. Aside from these problems, the program
is quite inefcient, since it recomputes SQRT(D) and
2.0
*
A twice.
2
Given the class of real-time systems that
Ada is targeting, it is possible that the end-users may die
of more than embarrassment if this example is followed.
Examples like this prove the utter futility of trying to
1
[Casio86] suggests making the same mistake, thus vastly reduc-
ing ones condence in Casio products. [Menzel60,1.1], [Kahan96],
and [Press86] avoid cancellation, and [Young72] provides an exhaustive
analysis of cases of the quadratic. Although [Press86] doesnt handle
overow/underow gracefully, his cookbook is at least elegant:
Q = (1/2)[B + sgn(B)
_
B
2
4AC]
x
1
= Q/A
x
2
= C/Q.
2
Yes, I know that many compilers can do common subexpression
elimination, but this ability for subroutines like SQRT is rare, and what
good is an optimization that cant be relied upon?
30
SIGPLAN
ACM
Garbage In/Garbage Out
make programs look more like mathematical formul
which hubris is the core premise of FORTRAN (FOR-
mula TRANslator) and its descendant wannabees. Com-
puter arithmetic doesnt follow most of the rules required
by algebra, so trying to make program expressions look
like mathematical expressions is foolhardy and danger-
ous. Such confusion is especially dangerous to the poor
students, who dont yet have enough of a solid grasp of
either mathematics or programming to be able to navigate
these subtle mineelds.
Mathematics of Quadratic Equations
For the past 35 years, American high schools have been
engaged in a massive Federally-funded study to deter-
mine how little mathematics and science can be taught to
the populace before a rst-world country collapses into
a third world economy. Freshmen now arrive at college
blissfully ignorant of much of basic algebra, including
the algebra necessary to understand and solve a quadratic
equation. In particular, most cannot derive the quadratic
formula, or even describe the simple symmetries of a
quadratic equation.
The usual derivation of the quadratic formula involves
completing the square, but since this step is completely
unmotivated, it is (quite properly) dismissed by the stu-
dent as a mere trick and quickly forgotten. A more fun-
damental approach involves looking at the symmetries of
the equation Ax
2
+Bx+C = 0 with real coefcients A,
B, C, where A = 0.
The rst symmetry of this equation is the observation
that the solution does not change when the equation is
multiplied through by any non-zero constant, including
1/A itself (assuming that A = 0). Thus, we can force
the coefcient of the quadratic term to be non-negative
by multiplying the equation by sgn(A):
3
0 = sgn(A)Ax
2
+ sgn(A)Bx + sgn(A)C
= |A|x
2
+ sgn(A)Bx + sgn(A)C
More importantly, we can simplify the equation by elim-
inating a parameter if we divide a non-trivial quadratic
equation through by the coefcient of the squared, or
quadratic term. This produces the monic equation:
x
2
+ (B/A)x + (C/A) = 0.
3
We use the convention that sgn(A) = 1 if A > 0 and sgn(A) =
1 if A < 0. When A = 0, we require only that sgn(A)
2
=
|sgn(A)| = 1, so that 1/sgn(A) is non-singular and A = sgn(A)|A|.
Assuming that A = 0, we can then assume without loss
of generality that we have already performed this step,
which reduces 3 parameters to 2, for a savings of 33%.
This allows us to focus our attention on the simpler equa-
tion:
x
2
+ Bx + C = 0.
We now consider the effect on the structure of the equa-
tion of performing the substitution x = y. We get:
(y)
2
+ B(y) + C = y
2
By + C = 0.
In other words, negating x negates the linear term while
preserving the signs of the quadratic and constant terms.
If we wanted to, we could use this symmetry to force the
coefcient of the linear term to be negative with the sub-
stitution x = sgn(B)y:
0 = (sgn(B)y)
2
+ B(sgn(B)y) + C
= y
2
(sgn(B)B)y + C
= y
2
|B|y + C
We can generalize this symmetry by considering the dila-
tion x = ay, where a is a non-zero real number:
(ay)
2
+ B(ay) + C = (a
2
)y
2
+ (aB)y + C = 0.
The dilation x =
_
|C|y can be used to normalize C = 0
such that the constant term has absolute value 1:
0 = (
_
|C|y)
2
+ B(
_
|C|y) + C
= |C|y
2
+ (B
_
|C|)y + C.
Dividing by |C| produces:
0 = y
2
+ (B/
_
|C|)y + sgn(C)
= y
2
+ (B/
_
|C|)y 1.
Dilations also give us another way to get rid of the coef-
cient A > 0 in the equation Ax
2
+ Bx + C = 0: use the
substitution x = y/

A:
A(y/

A)
2
+B(y/

A)+C = y
2
+(B/

A)y+C = 0.
Finally, we can perform all three simplications at the
same time with the substitution x = y sgn(B)
_
|C|/A:
0 = A
_
y sgn(B)
_
|C|
A
_
2
+ B
_
y sgn(B)
_
|C|
A
_
+ C
= |C|y
2

_
|B|
_
|C|/A
_
y + C
31
SIGPLAN
ACM
Garbage In/Garbage Out
Now, dividing by |C|, we get
0 = y
2

_
|B|
_
A|C|
_
y + sgn(C)
= y
2

_
|B|
_
A|C|
_
y 1.
With this substitution, we have achieved a 67% reduction
in parameters, from 3 to 1.
If C = 0, then we can consider the substitution x =
1/y:
A(1/y)
2
+ B(1/y) + C = A/y
2
+ B/y + C = 0.
Now if C = 0, then any root y = 0, so we can multiply
through by y
2
to get:
y
2
(A/y
2
+ B/y + C) = Cy
2
+ By + A = 0.
The substitution x = 1/y reverses the quadratic end-for-
end and exchanges the roles of A and C!
The nal symmetry we consider is that of translation,
in which we perform the substitution x = y + b:
0 = A(y + b)
2
+ B(y + b) + C
= Ay
2
+ 2Aby + Ab
2
+ By + Bb + C
= Ay
2
+ (2Ab + B)y + (Ab
2
+ Bb + C).
This last symmetry provides for the possibility of arrang-
ing for the linear coefcient of y to be zero if 2Ab +B =
0, i.e., b = B/2A:
0 = Ay
2
+ (2Ab + B)y + (Ab
2
+ Bb + C)
= Ay
2
+
_
B
2
4A

B
2
2A
+ C
_
= Ay
2
+
_
C
B
2
4A
_
.
In other words, y
2
= B
2
/4A
2
C/A, in which case
y =
_
B
2
4A
2

C
A
=
_
B
2
4AC
4A
2
=

B
2
4AC
2A
Substituting now for x, we now get
x = y + b
= y
B
2A
=

B
2
4AC
2A

B
2A
=
B

B
2
4AC
2A
.
In other words, by studying the symmetries of the equa-
tion Ax
2
+Bx+C = 0, we were able to nd the quadratic
formula by ourselves.
We now turn the problem around, and study the
quadratic equation that results from the two roots x
1
and
x
2
:
(x x
1
)(x x
2
) = x
2
(x
1
+ x
2
)x + x
1
x
2
= 0.
In other words, if we have a monic (A = 1) quadratic
equation x
2
+ Bx + C = 0, then the sum of the roots is
B, and the product of the roots is C. In particular, if
C = 0, then (at least) one of the roots is zero, while if
B = 0, then x
1
+ x
2
= 0, i.e., x
2
= x
1
.
Furthermore, if C = 0 and if we have already found
one root x
1
= r (which must therefore be non-zero), then
we can trivially nd the second root: x
2
= C/r = C/x
1
.
In particular, if |C| = 1, then x
2
= 1/x
1
= 1/r, and
the equation has the following simple form:
(x r)(x 1/r) = x
2
(r 1/r)x 1 = 0.
Let us revisit the quadratic formula for x
2
+Bx+C =
0 once more, now that we know that B = x
1
+ x
2
and
C = x
1
x
2
(x
1
, x
2
both real):
x =
B

B
2
4C
2
=
(x
1
+ x
2
)
_
((x
1
+ x
2
))
2
4(x
1
x
2
)
2
=
(x
1
+ x
2
)
_
(x
1
+ x
2
)
2
4x
1
x
2
2
=
(x
1
+ x
2
)
_
x
2
1
+ 2x
1
x
2
+ x
2
2
4x
1
x
2
2
=
(x
1
+ x
2
)
_
x
2
1
2x
1
x
2
+ x
2
2
2
=
(x
1
+ x
2
)
_
(x
1
x
2
)
2
2
=
(x
1
+ x
2
) |x
1
x
2
|
2
=
x
1
+ x
2
2

|x
1
x
2
|
2
.
32
SIGPLAN
ACM
Garbage In/Garbage Out
In other words, the rst term of the quadratic formula
provides the average/mean (center of mass) of the two
roots, while the second termof the quadratic formula pro-
vides half the (absolute value of the) difference of the two
roots!
4
Trigonometric Solutions
In the previous section, we saw that the quadratic equa-
tion y
2
By 1 = 0 for B 0 is a particularly interest-
ing universal quadratic, because the general quadratic
can be reduced to this form without performing addi-
tion/subtraction, which can sometimes cause spectacular
cancellation errors. We now investigate trigonometric
solutions to this equation.
We rst take up the case where C = +1, i.e., y
2

By + 1 = 0, for B 0. There are two subcases: B 2,


and 0 B < 2. Consider the quadratic formed by the
two (positive) roots y
1
= e

, y
2
= e

:
0 = (y y
1
)(y y
2
)
= (y e

)(y e

)
= y
2
(e

+ e

)y + e

= y
2
2 cosh()y + 1
= y
2
By + 1.
This last equation is well-dened if B 2, since
cosh() 1, for all real , so we can solve for to pro-
duce the roots y
1
, y
2
:
= acosh(B/2)
y
1
= e

= e
acosh(B/2)
y
2
= e

= e
acosh(B/2)
.
When 0 B < 2 in the equation y
2
By + 1 = 0,
then we have 2 complex roots because B
2
4AC =
B
2
4 < 0. We can then utilize hyperbolic trigonometric
4
A classic hack for the max and min functions involves the iden-
tities max(x
1
, x
2
) + min(x
1
, x
2
) = x
1
+ x
2
and max(x
1
, x
2
)
min(x
1
, x
2
) = |x
1
x
2
|, which yield the formul:
max(x
1
, x
2
) =
x
1
+x
2
2
+
|x
1
x
2
|
2
and
min(x
1
, x
2
) =
x
1
+x
2
2

|x
1
x
2
|
2
.
We have thus shown that these formul have the same cancellation
problems as the quadratic formula, and are thus a terrible way to com-
pute max and min!
functions with complex angles, or alternatively, we can
identify B/2 with cos(), for some real angle :
0 = (y y
1
)(y y
2
)
= (y e
i
)(y e
i
)
= y
2
(e
i
+ e
i
)y + e
i
e
i
= y
2
2 cos()y + 1
= y
2
2 sin(/2 )y + 1
= y
2
2 sin()y + 1
= y
2
By + 1.
We then solve for , y
1
, y
2
:
5
= asin(B/2)
y
1
= e
i
= e
i(/2)
= ie
i
= cis()i = cis(asin(B/2))i
y
2
= e
i
= e
i(/2)
= ie
i
= cis()i = cis(asin(B/2))i.
(cis() = cos() + i sin() = e
i
.)
The other major case involves C = 1, i.e., y
2
By
1 = 0, for B 0. Consider the roots y
1
= e

, y
2
=
e

:
0 = (y y
1
)(y y
2
)
= (y e

)(y + e

)
= y
2
(e

)y e

= y
2
2 sinh()y 1
= y
2
By 1.
Thus, we can now solve for , y
1
, y
2
:
= asinh(B/2)
y
1
= e

= e
asinh(B/2)
y
2
= e

= e
asinh(B/2)
.
For completeness, we express a root x
1
of the original
quadratic Ax
2
+ Bx + C = 0 trigonometrically:
x
1
=
_
C
A
e
asinh
_
B
2
4AC
.
Suitably interpreted, this formula is equivalent to the
classical quadratic formula! (Hint: use the mathematical
denition: asinh(z) = log
_
z +

1 + z
2
_
and the prop-
erty asinh(z) = asinh(z).)
An important reason for expressing the solutions of the
quadratic equation in this trigonometric form is that all
5
We utilize the function asin(B/2) rather than acos(B/2) because
the inverse sin function is better behaved near B/2 = 0.
33
SIGPLAN
ACM
Garbage In/Garbage Out
of the operations leading up to this form are numerically
stable,
6
thus passing the buck to the trigonometric
7
and
exponential functions to properly handle the numerical
subtleties, instead of trying to handle them ones self!
Floating Point Arithmetic
Computers have (at least) 2 kinds of arithmetic opera-
tions: integer (xed point) operations and oating
point operations. Fixed point addition and multiplica-
tion are commutative and associative over a limited range
in traditional algebraic fashion, whereas oating point ad-
dition and multiplication are usually commutative, but al-
most never associative.
Most algebraic systems encountered by students are
commutative and associative, with matrices providing the
rst encounters with non-commutative algebra. Other
than oating point arithmetic, the only non-associative
algebra normally encountered is that of vector cross-
products, which are neither commutative nor associa-
tive. Unfortunately, computer science classes rarely use
the students encounters with oating point arithmetic to
point out its non-associativity and other weird features.
In oating point arithmetic, there exist numbers y = 0,
such that x y = x,
8
i.e., y drowns in x (why do
you think they call it oating point?). For example, on
many computers 10
8
1 = 10
8
. As a result, one can
write loops which continually increment x with y, but
will never reach z > x! The student usually gets this
rude awakening the rst time he tries to perform approx-
imate integration by adding up the little rectangles as his
calculus class suggests.
In many implementations, there exist numbers x = 0
such that x 2 = 0. In other words, x is so small that
dividing it by 2 can make it identically zero (or crash
6
When B 2, we compute either cosh() 1 or cos() 1,
which implies that 0 or 0, respectively. The loss of accuracy
near B/2 1 is unavoidable due to the approximation of a double root.
7
If you want to try these trigonometric solutions, you may need
to implement the inverse hyperbolic functions acosh(x), asinh(x)
yourselfeither because they werent included in your language, or be-
cause they are broken (inverse hyperbolic functions are rarely tested).
In such cases, the mathematical denitions are [Steele90]:
asinh(z) = log
_
z +
_
1 +z
2
_
acosh(z) = 2 log
_
_
(z + 1)/2 +
_
(z 1)/2
_
.
8
We follow [Knuth81] in using , , , for the oating point
operations of +, , , /.
the program). This situation is called numerical under-
ow. There also exist numbers x = 0 such that comput-
ing x 2 causes either the program to crash with a nu-
merical overow, or to produce a number that prints as
NaN (Not-a-Number) or innity. However, for binary
oating point implementations, division and multiplica-
tion by powers of 2 lose no accuracy in the absence of
overow/underow,
9
so we can write x 2
k
and x 2
k
as x2
k
and x/2
k
= x2
k
, respectively.
Although oating point multiplication and division
are not associative, even in the absence of over-
ow/underow, they are relatively stable operations in
the sense that the oating point result is not too far from
the mathematical value (which is not itself usually repre-
sentable). Square root is even more stable, as it cannot
produce overow/underow, and fails only for negative
arguments.
Probably the most common (and most severe) problem
with oating point arithmetic occurs when very precise
numbers of the opposite sign and nearly the same value
are algebraically summed. In this case, the resulting value
may be very far from the correct numerical value, and
may be almost totally garbage. Thus, while x x = 0,
(y x) x may be very far from y, and may even be
identically zero, if y rst drowns in x.
Let us consider the two roots x
1
> 0, x
2
> 0 of the
quadratic equation x
2
(x
1
+x
2
)x+x
1
x
2
= x
2
+Bx+
C = 0. If x
2
is many orders of magnitude smaller than
x
1
, then B = (x
1
x
2
) = x
1
, when evaluated in
oating point. Thus, if we look at the operation of the
quadratic formula when computed using oating point:
x
2
=
x
1
x
2
2

x
1
x
2
2
=
x
1
2

x
1
2
= 0,
even when C = x
1
x
2
= 0!
Thus, when implemented with oating point arith-
metic, the naive quadratic formula may get one of the
roots correct, but completely ub the other one. The
naive formula may still produce very poor results even
when both of the answers produced from the oating
point arithmetic are non-zero.
Exponent Range Analysis
Consider again the quadratic equation (with real roots)
x
2
Bx 1 = 0, where B 0. We note that since
|x
1
| = 1/|x
2
|, if |x
1
| = 2
k
, then |x
2
| = 2
k
, so the two
9
Except for denormalized numbers, which should have been called
subnormal numbers.
34
SIGPLAN
ACM
Garbage In/Garbage Out
roots have exponents that are symmetrically distributed
about 2
0
= 1. Since oating point exponent range limits
are usually more-or-less symmetric about 2
0
= 1, we can
usually be assured (for this equation) that if x
1
is within
exponent range, then so will x
2
.
Now B = x
1
+x
2
0, so the root of larger magnitude
must be non-negative (regardless of the sign of x
2
). Call
this larger root x
1
= r > 0. If the magnitude of B is very
large, say B = 2
k
, for k 1, then B will equal x
1
= r,
because |x
2
| = 1/r will drown in x
1
. So, in this case, we
get the equation
x
2
(x
1
x
2
)x + (x
1
x
2
) = x
2
(x
1
)x 1 = 0.
In short, if the coefcient B 0 in the quadratic x
2

Bx 1 = 0 with real roots x


1
, x
2
is in exponent range,
then x
1
and x
2
must both also be in exponent range.
10
So the quadratic equation x
2
Bx + C = 0, B 0,
C = 1, is particularly nice, because its real solutions
are always representable. We now solve this equation. If
B 2, then B
2
4C 2
2
4C = 4(1C) 0, so the
roots are always real. The larger magnitude root (which
must be positive) can be computed as
B

= B 2 (so B

1)
x
1
= B

+
_
B
2
+ C
B

_
1
_
1 (C B
2
)
_
x
2
= C x
1
.
The only possible problem occurs in the step where we
compute C B
2
. If B

is very large, say B

= 2
k
for k 1, then |C B
2
| = 2
2k
, which can produce
exponent underow. However, in this case, the under-
ow isnt serious, because when it happens, we merely
produce an extremely small (in absolute value) number
which drowns when added to 1. This underow should
therefore be ignored, because we will already be getting
the best answer possible.
If 0 B < 2, on the other hand, then we have two
cases: C = 1 and C = 1. If C = 1, then B
2
4AC <
2
2
4 = 0, so both roots are complex. If C = 1, then
B
2
4AC = B
2
+ 4 > 0, so both roots are real. The
larger magnitude root is also positive, so
B

= B 2 (0 B

< 1)
x
1
= B

+
_
B
2
C
B

_
B
2
1
x
2
= C x
1
= 1 x
1
.
10
The proof requires a certain minimum of precision, which is true of
essentially all oating point implementations.
We thus conclude that the quadratic equation x
2
Bx
1 = 0 is a particularly nice quadratic, because it can be
easily solved when its roots are real, and these roots are
representable if and only if the equation itself is repre-
sentable.
Reducing the General Case
Nowthat we have a robust quadratic-solver for the special
case x
2
Bx 1 = 0, B 0, we show how to reduce
the general quadratic Ax
2
+Bx+C = 0, A = 0, C = 0,
to this case, or die tryingi.e., if the general case cannot
be so reduced, then its roots cannot be represented.
But we already know how to reduce the general equa-
tion Ax
2
+ Bx + C = 0 into this form. We rst multi-
ply through by sgn(A) to produce |A|x
2
+sgn(A)Bx +
sgn(A)C = 0. This step can always be performed with-
out any exceptions, since changing the sign of a oating
point number is a trivial operation. We assume that this
has already been done in the following.
We next compute
_
|A| and
_
|C|, which are both rep-
resentable, since |A| > 0, |C| > 0, and the absolute value
of the exponents of
_
|A|,
_
|C| are less than the absolute
value of the exponents of |A|, |C|, respectively.
We then form the product
_
|A|
_
|C|, which is
representable because both |A| and |C| are both repre-
sentable, so even in the worst case in which |A| = |C| =
M, where M is the maximum representable value, then
the product will be

M = M.
The most difcult step in the reduction is the forma-
tion of |B| (
_
|A|
_
|C|). This quantity may indeed
not be representable. Consider, for example, the equation
2
k
x
2
2 2
k
x + 2
k
= 0.
|B| (
_
|A|
_
|C|)
= 2 2
k
(2
k/2
2
k/2
)
= 2 2
k
2
k
= 2 2
2k
,
which will not be representable if k is the largest possible
exponent. However, in this case, the roots are both equal
to 2
2k
, which is not representable, either.
So, we must allow for an exponent overow in the for-
mation of the quantity |B| (
_
|A|
_
|C|), and ex-
ceptions are not spurious, because they indicate that the
roots do fall outside the representable range. However, an
underow here is benign.
35
SIGPLAN
ACM
Garbage In/Garbage Out
Thus, through the use of the substitution x =
y sgn(B)
_
|C|/
_
|A|, we get the equation
y
2

_
|B|
_
|A|
_
|C|
_
y + sgn(C) = 0.
We solve this equation, as shown above, and then back
substitute to produce the roots of the original equation.
In order to do this, we must form the ratio
_
|C|
_
|A|,
which can always be represented, since |A| and |C| are
representable. The original roots are then
x
1
= sgn(B)(
_
|C|
_
|A|) y
1
x
2
= sgn(B)(
_
|C|
_
|A|) y
2
.
This rescaling of y
1
, y
2
to form x
1
, x
2
may cause expo-
nent overow and underows, and these exceptions are
not benign.
Let us solve the equation 2
k
x
2
2x 2
k
= 0, for
k the largest representable exponent. After multiplying
through by 1, we get 2
k
x
2
+ 2x + 2
k
= 0, and A =
2
k
, B = 2, C = 2
k
. We then form the equation
0 = y
2

_
|B|
_
|A|
_
|C|
_
y + 1
= y
2

_
2
2
k/2
2
k/2
_
y + 1
= y
2
2y + 1.
The solution of this equation is straightforward, and pro-
duces the roots y
1
= y
2
= 1. We now backsubstitute:
x
1
=
_
sgn(B)
_
|C|
_
|A|
_
y
1
=
_

2
k/2
2
k/2
_
y
1
= 2
k
y
1
= 2
k
.
If we now substitute x
1
back into the original equation
(which we cant really do without causing overow) to
see if x
1
is really a root, we get
0 = 2
k
x
2
2x 2
k
= 2
k
(2
k
)
2
2(2
k
) 2
k
= 2
k
2
2k
+ 2 2
k
2
k
= 2
k
+ 2 2
k
2
k
= 0.
We have thus succeeded in properly solving a quadratic
that couldnt even be put into monic formwithout causing
exponent overow!
Although the use of square roots for scaling can be
considered extravagant in terms of performance,
11
these
square roots are quite stable and generate no spurious
overow/underows. If no other means of scaling is
availablee.g., in the Postscript language [Adobe90]
then square root scaling is an excellent way to maximize
the dynamic range of a quadratic equation solver.
Scale Factors
Although the method presented in the previous sections
works, it may not be quite as fast or accurate as one would
like, due to the multiple square root operations. In this
section, we would like to sketch an analogous scheme in
which square roots are merely approximated by means of
scale factors, so as to reduce the exponent ranges to rea-
sonable values. Since scale factors which are powers of 2
preserve all the available precision (assuming that there is
no underow or overow), we achieve the simultaneous
goals of high efciency, high precision and wide expo-
nent range (dynamic range).
We will need the following functions:
exponent(x) = xp(x) = oor(log
2
|x|) + 1
mantissa(x) = mn(x) = |x|2
exponent(x)
= |x|2
xp(x)
With this denition, 1/2 mn(x) < 1.
Some examples are in order:
xp(1) = oor(log
2
|1|) + 1 = 0 + 1 = 1
xp(2) = oor(log
2
|2|) + 1 = 1 + 1 = 2
xp(2
k
) = oor(log
2
|2
k
|) + 1 = k + 1
xp(2
k
+ 2
k1
) = k + 1
mn(3) = 0.75
We thus have for A = 0 the following factorization
A = sgn(A)mn(A)2
xp(A)
.
We will also need a slightly more constrained version of
xp(x) which we will call xp2(x) and which will be an
even integer. Thus,
xp2(x) = xp(x), if xp(x) is even
= xp(x) 1, if xp(x) is odd.
11
Unless the architecture has a fast square root implementatione.g.,
Digitals Alpha.
36
SIGPLAN
ACM
Garbage In/Garbage Out
We can also dene a more constrained version of mn(x):
mn2(x) = |x|2
xp2(x)
.
Note that we have the alternate factorization:
A = sgn(A)mn2(A)2
xp2(A)
.
Some additional examples are in order:
xp2(1) = 0
xp2(2) = 2
xp2(3) = 2
xp2(4) = 2
xp2(7) = 2
xp2(8) = 4
So, 1/2 mn2(A) < 2. xp2(x) and mn2(x) accomplish
two thingsthey produce an even exponent, and make
the mantissa symmetrical about 1.
Using these concepts, we can now tackle the general
quadratic equation Ax
2
+ Bx + C = 0, for A = 0,
C = 0. Since A = sgn(A)mn2(A)2
xp2(A)
, and since
1/2 mn2(A) < 2, we can divide the equation through
by sgn(A)mn2(A) without much risk of underow or
overow. Let us assume that this has already been done,
so our new quadratic coefcient is 2
2k
, for some integer
k.
We now have an equation that looks like 2
2k
x
2
+Bx+
C = 0, for some new B and C. We can rewrite this
equation as
0 = 2
2k
x
2
+ Bx + sgn(C)mn2(C)2
xp2(C)
= 2
2k
x
2
+ Bx + sgn(C)mn2(C)2
2
,
for some integer = xp2(C)/2.
Substituting x = y sgn(B)2
k
:
2
2k
_
y sgn(B)2
k
_
2
+ B
_
y sgn(B)2
k
_
+ sgn(C)mn2(C)2
2
= 2
2
y
2

_
|B|2
k
_
y + sgn(C)mn2(C)2
2
Dividing through by 2
2
, we get
0 = y
2
|B|2
k
y + sgn(C)mn2(C)
= y
2
2B

y + C

= 0
for B

= |B|2
k1
and C

= C2
2
. Furthermore,
we know that B

0 and 1/2 |C

| < 2.
If B

2 (
_
|C

|), then we can compute y as


follows (ignoring any exponent underow on C B

):
y
1
= B

+
_
B
2
C

(1

1 C B

)
y
2
= C

y
1
If 0 B

<

2, then we have two cases: D = B


2

< 0, in which case the roots are imaginary, and D =


B
2
C

0, in which case the roots are real. In the


second case, we compute y as follows:
y
1
= B

D
y
2
= C

y
1
Finally, we can rescale the ys back into xs, watching
out for overow/underows:
x
1
= sgn(B)2
k
y
1
x
2
= sgn(B)2
k
y
2
We have thus shown how to get essentially the same
dynamic range by careful scaling of the coefcients as
we got before by taking square roots of the coefcients.
Conclusions
If a programming language is to be used for portable,
high quality software, it is imperative that the language
provide the programmer access to the sign,
12
exponent,
and mantissa of a oating point number. Furthermore,
this access must be blindingly fastof the same order of
magnitude as an absolute value, a negation, or a multipli-
cation. If high-speed scaling is not easy to achieve in a
language, then the quality and/or cost of the software will
suffer. The quality will suffer in that many programmers
will nd it too difcult to get high speed without a mass
of special cases, and the cost will suffer if this mass of
special cases must be programmed and maintained.
We also learned the lesson that numerical software is
often improved by not using a classical mathematical for-
mula itself, but by following instead the derivation of
this formula. This approach provides a number of ad-
vantages: it tends to canonicalize the problem by grad-
ually reducing the number of parameters (and thus the
12
Most hardware architects have heretofore refused to provide for a
primitive sgn(x) operation, because they claim that it is easily emulated
by means of a conditional branch operation. But conditional branch
operations can be relatively expensive on pipelined architectures, and
the penalty for conditional branching is expected to increase. Sounds
like Catch-22 to me.
37
SIGPLAN
ACM
Garbage In/Garbage Out
number of cases to consider!), the canonicalization itself
removes common subexpressions which result in redun-
dant calculations, andassuming that the reductions pre-
serve information/accuracythe performance of opera-
tions likely to result in massive cancellation can be de-
ferred until the problemis simple enough to easily under-
stand the effect of this cancellation.
Although reducing the work involved in a calculation
is important, it is not nearly as important as getting the
right answer, or even getting an answer!
13
In particu-
lar, computer arithmetic is very different from mathemat-
ical arithmetic: it has limited range and limited precision,
and therefore violates some of the algebraic properties of
mathematical numbersmost prominently associativity.
The trend of programming languages to try to cover up
these issues instead of facing up to them directly is quite
distressing, and is likely to continue the trend of poor
quality software and poor software productivity.
In a future column, we will consider other ways to
solve a quadratic equation, including various iterative
methods.
References
[Ada83] Reference Manual for the Ada (R) Pro-
gramming Language. ANSI/MIL-STD-1815A-1983,
1983.
[Adobe90] Adobe Systems, Inc. Postscript Language
Reference Manual, 2nd Ed. Addison-Wesley, Read-
ing, MA, 1990. ISBN 0-201-18127-4.
[FORT77] ANSI. American National Standard Pro-
gramming Language FORTRAN. ANSI X3.9-1978,
1978.
[Casio86] Casio, Inc. Computing With The Scientic
Calculator. Casio, Inc., Japan, SA012200115B,
1986. Manual for Casio fx-115D calculator.
[Goldberg91] Goldberg, David. What Every Computer
Scientist Should Know About Floating-Point Arith-
metic. ACM Computing Surveys, 23, 1 (March
1991), 5-48.
[Kahan96] Kahan, W. Lecture Notes on the Status of
IEEE Standard 754 for Binary Floating-Point Arith-
metic. http://http.cs.berkeley.edu/wkahan/ieee754status/ieee754.ps
13
Limited numerical precision effects were shown to cause substan-
tial problems in the operation of the Patriot anti-missile system during
the Gulf War.
[Knuth81] Knuth, D.E. Seminumerical Algorithms, 2nd
Ed. Addison-Wesley, Reading, MA, 1981.
[Menzel60] Menzel, D.H., ed. Fundamental Formulas of
Physics, Vol. I. Dover Publs., New York, 1960, ISBN
0-486-60595-7.
[Moto87] Motorola, Inc. MC68881/MC68882 Floating-
Point Coprocessor Users Manual. Prentice-Hall, En-
glewood Cliffs, NJ, 1987.
[Press86] Press, W.H., et al. Numerical Recipes. Cam-
bridge Univ. Press, 1986, ISBN 0-521-30811-9. Sect.
5.5 Quadratic and Cubic Equations.
[Steele90] Steele, G.L., Jr. Common Lisp: The Lan-
guage, 2nd. Ed. Digital Press, 1990. ISBN 1-55558-
041-6.
[Young72] Young, D.M., and Gregory, R.T. A Survey of
Numerical Mathematics, Vol. I. Dover Publ., New
York, 1972. Sect. 3.4 The Quadratic Equation.
Henry Baker has been diddling bits for 35 years, with
time off for good behavior at MIT and Symbolics. In
his spare time, he collects garbage and tilts at wind-
bags. This column appeared in ACM Sigplan Notices
33,1 (Jan 1998), 30-38.
38

You might also like