1456308586E textofChapter1Module2 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17
At a glance
Powered by AI
The document discusses numerical errors, propagation of errors during arithmetic operations like addition, subtraction, multiplication and division, and issues that can arise from floating point arithmetic on computers.

The document discusses absolute errors, relative errors, and how errors propagate during addition, subtraction, multiplication and division of numbers.

Errors propagate differently during different arithmetic operations - the error in sum is equal to the sum of individual errors, error in difference is equal to the sum of individual errors, relative error in product is equal to the sum of individual relative errors, and relative error in quotient is equal to the sum of individual relative errors.

Numerical Analysis

by
Dr. Anita Pal
Assistant Professor
Department of Mathematics
National Institute of Technology Durgapur
Durgapur-713209
email: [email protected]
.

Chapter 1

Numerical Errors

Module No. 2

Propagation of Errors and Computer Arithmetic


......................................................................................

This is the continuation of Module 1. In this module, the propagation of error during
arithmetic operations are discussed in details. Also, the representation of numbers in
computer and their arithmetic calculations are explained.

2.1 Propagation of errors in arithmetic operations

In numerical computation, it is always assumed that there is an error in every number,


it may be very small or large. The errors present in the numbers are propagated during
arithmetic process. But, the rate of propagation depends on the type of arithmetic
operation. This case is discussed in the subsequent sections.

2.1.1 Errors in sum and difference

Let us consider the exact numbers X1 , X2 , . . . , Xn and their corresponding approx-


imate number be respectively x1 , x2 , . . . , xn . Assumed that ∆x1 , ∆x2 , . . . , ∆xn be the
absolute errors in x1 , x2 , . . . , xn . Therefore, Xi = xi ± ∆xi , i = 1, 2, . . . , n.
Let X = X1 + X2 + · · · + Xn and x = x1 + x2 + · · · + xn .
The total absolute error is

|X − x| = |(X1 − x1 ) + (X2 − x2 ) + · · · + (Xn − xn )|


≤ |X1 − x1 | + |X2 − x2 | + · · · + |Xn − xn |.

This shows that the total absolute error in the sum is

∆x = ∆x1 + ∆x2 + · · · + ∆xn . (2.1)

Thus the absolute error in sum of approximate numbers is equal to the sum of the
absolute errors of all the numbers.
The following points should keep in mind during addition of numbers.
(i) identify a number (or numbers) of the least accuracy,

(ii) round-off the numbers to the nearest exact numbers and retain one digit more
than in the identified number,

(iii) perform addition for all retained digits,

(iv) round-off the result by discarding last digit.


1
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic

Subtraction
The case for subtraction is similar to addition. Let x1 and x2 be two approximate values
of the exact numbers X1 and X2 respectively and X = X1 − X2 , x = x1 − x2 .
Therefore, one can write X1 = x1 ± ∆x1 and X2 = x2 ± ∆x2 .
Now, |X − x| = |(X1 − x1 ) − (X2 − x2 )| ≤ |X1 − x1 | + |X2 − x2 |. Hence,

∆x = ∆x1 + ∆x2 . (2.2)

It may be noted that the absolute error in difference of two numbers is equal to the
sum of individual absolute errors.

2.1.2 The error in product

Let us consider two exact numbers X1 and X2 with their approximate values x1 and
x2 . Let, X1 = x1 ± ∆x1 and X2 = x2 ± ∆x2 , where ∆x1 and ∆x2 are the absolute
errors in x1 and x2 .
Now, X1 X2 = x1 x2 ± x1 ∆x2 ± x2 ∆x1 ± ∆x1 · ∆x2 .
Therefore, |X1 X2 −x1 x2 | ≤ |x1 ∆x2 |+|x2 ∆x1 |+|∆x1 ·∆x2 |. Both the terms |∆x1 | and
|∆x2 | represent the errors and they are small, so their product is also small. Therefore,
we discard it and dividing both sides by |x| = |x1 x2 | to get the relative error.
Hence, the relative error is

X1 X2 − x1 x2 ∆x2 ∆x1
=
x2 + x1 . (2.3)

x1 x2

From this expression we conclude that the relative error in product of two numbers
is equal to the sum of the individual relative errors.
This result can be extended for n numbers as follows: Let X = X1 X2 · · · Xn and
x = x1 x2 · · · xn . Then

X − x ∆x1 ∆x2
+ · · · + ∆xn .

= + (2.4)
x x1 x2 xn

That is, the total relative error in product of n numbers is equal to the sum of
individual relative errors.
In particular, let all approximate values x1 , x2 , . . . , xn be positive and x = x1 x2 · · · xn .
Then log x = log x1 + log x2 + · · · + log xn .
2
......................................................................................

In this case,
∆x ∆x1 ∆x2 ∆xn
= + + ··· + .
x x1 x2 xn
∆x ∆x ∆x ∆x
1 2 n
Hence, = + + ··· + .

x x1 x2 xn
Let us consider another particular case. Suppose, x = kx1 , where k is a non-zero real
number. Now,
∆x k ∆x1 ∆x1
δx =
= = = δx1 .
x k x 1 x1
Also, |∆x| = |k ∆x1 | = |k| |∆x1 |.
Observed that the relative errors in both x and x1 are same, while absolute error in
x is |k| times the absolute error in x1 .

2.1.3 The error in quotient

Let X1 and X2 be two exact numbers and their approximate values be x1 and x2 .
X1 x1
Again, let X = and x = .
X2 x2
If ∆x1 and ∆x2 are the absolute errors, then X1 = x1 + ∆x1 , X2 = x2 + ∆x2 .
Suppose both x1 and x2 are non-zeros.
Now,
x1 + ∆x1 x1 x2 ∆x1 − x1 ∆x2
X −x= − = .
x2 + ∆x2 x2 x2 (x2 + ∆x2 )
Dividing both sides by x and taking absolute values:

X − x x2 ∆x1 − x1 ∆x2 x2 ∆x1 ∆x2
x x1 (x2 + ∆x2 ) x2 + ∆x2 x1 − x2 .
= =

Since the error ∆x2 is small as compared to x2 , therefore


x2
' 1.
x2 + ∆x2
Thus,

∆x X − x ∆x1 ∆x2 ∆x1 ∆x2
δx =
= = − ≤ + , (2.5)
x x x1 x2 x1 x2

i.e., δx = δx1 + δx2 .


This expression shows that the total relative error in quotient is equal to the sum of
their individual relative errors.
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic

The relative error δx of (2.5) can also be expressed as



∆x ∆x1 ∆x2 ∆x1 ∆x2
x x 1 − x 2 ≥ x 1 − x 2 .
= (2.6)

It may be observed that the relative error in quotient is greater than or equal to the
difference of their individual relative errors.
In case of positive numbers one can determine the error of logarithm function. Let
x1 and x2 be the approximate numbers and x = x1 /x2 .
Now, log x = log x1 − log x2 . Thus,

∆x ∆x1 ∆x2 ∆x ∆x1 ∆x2
= − i.e., ≤ + .
x x1 x2 x x1 x2
Example 2.1 Find the sum of the approximate numbers 120.237, 0.8761, 78.23, 0.001234,
234.3, 128.34, 35.4, 0.0672, 0.723, 0.08734. It is known that in each of which all the
written digits are valid. Find the absolute error in sum.
Solution. The least exact numbers are 234.3 and 35.4. The maximum error of each of
them is 0.05. Now, rounding-off all the numbers in two decimal places (one digit more
than the least exact numbers).
Their sum is 120.24 + 0.88 + 78.23 +0.00 + 234.3 + 128.34 +35.4 + 0.07 + 0.72 + 0.09 =
598.27.
Now, rounding-off the sum to one decimal place and it becomes 598.3.
There are two types of errors in the sum. The first one is the initial error. This is
the sum of the errors of the least exact numbers and the rounding errors of the other
numbers, which is equal to 0.05 × 2 + 0.0005 × 8 = 0.104 ' 0.10.
The second one is the error in rounding-off the sum which is 598.3 − 598.27 = 0.03.
Thus, the total absolute error in the sum is 0.10 + 0.03 = 0.13.
Finally, the sum can be expressed as 598.3 ± 0.13.
Example 2.2 Let x1 = 43.5 and x2 = 76.9 be two approximate numbers and 0.02
and 0.008 be the corresponding absolute errors respectively. Find the difference between
these numbers and evaluate absolute and relative errors.
Solution. Here, x = x1 −x2 = −33.4 and the total absolute error is ∆x = 0.02+0.008 =
0.028.
Hence, the difference is 33.4 and the absolute error is 0.028.
The relative error is 0.028/| − 33.4| ' 0.00084 = 0.084%.
4
......................................................................................

Example 2.3 Let x1 = 12.4 and x2 = 45.356 be two approximate numbers and all
digits of both the numbers are valid. Find the product and the relative and absolute
errors.
Solution. The number of valid decimal places in first and second approximate numbers
are one and three respectively. So we round-off the second number to one decimal place.
After rounding-off the numbers become x1 = 12.4 and x2 = 45.4.
Now, the product is x = x1 x2 = 12.4 × 45.4 = 562.96 ' 56.0 × 10.
The result is rounded in two significant figures, because the least number of valid
significant digits of the given numbers is 3.
The relative error in product is

∆x ∆x1 ∆x2 0.05 0.0005
δx = = + = + = 0.004043 ' 0.40%.
x x1 x2 12.4 45.356

The absolute error is (56.0 × 10) × 0.004043 = 2.26408 ' 2.3.

Example 2.4 Let x1 = 7.235 and x2 = 8.72 be two approximate numbers, where all
the digits of the numbers are valid. Find the quotient and also relative and the absolute
errors.
Solution. Here, x1 = 7.235 and x2 = 8.72 have four and three valid significant digits
respectively. Now,
x1 7.235
= = 0.830.
x2 8.72
We consider three significant digits, since the least exact number contains three valid
significant digits.
The absolute error in x1 and x2 are respectively ∆x1 = 0.0005 and ∆x2 = 0.005.
The relative error in quotient is

∆x1 ∆x2 0.0005 0.005
x1 + x2 = 7.235 + 8.72 = 0.000069 + 0.000573

' 0.001 = 0.1%.

The absolute error is



x1
× 0.001 = 0.830 × 0.001 = 0.00083 = 0.001.
x2

5
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic

2.1.4 The errors in power and in root

Let x1 be an approximate value of an exact number X1 and its relative error be δx1 .
Now, we determine the relative error of x = xk1 , where k is a real number.
Then
x = xk1 = x1 · x1 · · · k times.

According to the formula (2.4), the relative error δx is given by

δx = δx1 + δx1 + · · · + δx1 + k times = k δx1 . (2.7)

Thus, the relative error of the approximate number x is k times the relative error of
x1 .
Let us consider the case, the kth root of a positive approximate value x1 , i.e. the

number x = k x1 .
Since x1 > 0,
1
log x = log x1 .
k
Therefore,
∆x 1 ∆x1 ∆x 1 ∆x1
= or
= .
x k x1 x k x1

Thus, the relative error in k x1 is

1
δx = δx1 .
k

Example 2.5 Let a = 5.27, b = 28.61, c = 15.8 be the approximate values of some
numbers and let the absolute
√ errors in a, b, c be 0.01, 0.04 and 0.02 respectively. Calcu-
2 3
a b
late the value of E = and the error in the result.
c3
Solution. It is given that the absolute error ∆a = 0.01, ∆b = 0.04 and ∆c = 0.02. One
more significant figure retain to intermediate calculation. Now, the approximate values

of the terms a2 , 3 b, c3 are 27.77, 3.0585, 3944.0 respectively.
The approximate value of the expression is
27.77 × 3.0585
E= = 0.0215.
3944.0
6
......................................................................................

Three significant digits are taken in the result, since, the least number of significant
digits in the numbers is three.
The relative error is given by
1 0.01 1 0.04 0.02
δE = 2 δa + δb + 3 δc = 2 × + × +3×
3 5.27 3 28.61 15.8
' 0.0038 + 0.00047 + 0.0038 ' 0.008 = 0.8%.

The absolute error ∆E in E is 0.0215 × 0.008 = 0.0002.


Hence, A = 0.0215 ± 0.0002 and the relative error is 0.0002.
In the above example, E is an expression of three variables a, b, c, and the error
presents in E is illustrated. The general rule to calculate an error in a function of
several variables are determined below:

Error in function of several variables

Let y = f (x1 , x2 , . . . , xn ) be a differentiable function containing n variables x1 , x2 , . . . , xn .


Also, let ∆xi be the error in xi , for i = 1, 2, . . . , n.
Now, the absolute error ∆y in y is given by

y + ∆y = f (x1 + ∆x1 , x2 + ∆x2 , . . . , xn + ∆xn )


n
X ∂f
= f (x1 , x2 , . . . , xn ) + ∆xi + · · ·
∂xi
i=1
(by Taylor’s series expansion)
n
X ∂f
=y+ ∆xi
∂xi
i=1
(neglecting second and higher powers terms of ∆xi )
n
X ∂f
i.e., ∆y = ∆xi
∂xi
i=1

This is the formula to calculate the total absolute error to compute a function of
several variables.
The relative error can be calculated as
n
∆y X ∂f ∆xi
= .
y ∂xi y
i=1
7
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic

2.2 Significant error

It may be remembered that some significant digits are lost during arithmetic calcu-
lation, due to the finite representation of computing instruments. This error is called
significant error.
In the following two cases, there are high chances to loss of more significant digits
and care should be taken in these situations:
(i) When two nearly equal numbers are subtracted, and

(ii) When division is made by a very small divisor compared to the dividend.

It should be remembered that the significant error is more serious than round-off
error. These are illustrated in the following examples:
√ √
Example 2.6 Find the difference 10.23 − 10.21 and calculate the relative error in
the result.
√ √
Solution. Let X1 = 10.23 and X2 = 10.21 and their approximate values be
x1 = 3.198 and x2 = 3.195. Let X = X1 − X2 .
Then the absolute errors are ∆x1 = 0.0005 and ∆x2 = 0.0005 and the approximate
difference is x = 3.198 − 3.195 = 0.003.
Thus, the total absolute error in the subtraction is ∆x = 0.0005 + 0.0005 = 0.001
0.001
and the relative error is δx = = 0.3333.
0.003
But, by changing the calculation scheme one can obtained more accurate result. For
example,
√ √ 10.23 − 10.21
X= 10.23 − 10.21 = √ √
10.23 + 10.21
0.02
= ' 0.003128 = x (say).
3.198 + 3.195
The relative error is
∆x1 + ∆x2 0.001
δx = = = 0.0002 = 0.02%.
x1 + x2 3.198 + 3.195
Observed that the relative error is much less that the previous case.
8
......................................................................................

Example 2.7 Find the roots of the equation x2 − 1500x + 0.5 = 0.

Solution. To illustrate the difficulties of the problem, let us assumed that the com-
puting machine using four significant digits for all arithmetic calculation. The roots of
this equation are

15002 − 2
1500 ±
.
2
Now, 1500 − 2 = 0.2250 × 10 − 0.0000 × 10 = 0.2250 × 107 .
2 7 7

Thus 15002 − 2 = 0.1500 × 104 .
Hence, the roots are
0.1500 × 104 ± 0.1500 × 104
= 0.1500 × 104 , 0.0000 × 104 .
2
That is, the smaller root is zero (correct up to four decimal places), this occur due
to the finite representation of the numbers.
But, it is noted that 0 is not a root of the given equation.
To get the more accurate result, we use the transformation on arithmetic calculation.
The smaller root of the equation is now calculated as follows:
√ √ √
1500 − 15002 − 2 (1500 − 15002 − 2)(1500 + 15002 − 2)
= √
2 2(1500 + 15002 − 2)
2
= √ = 0.0003333.
2(1500 + 15002 − 2)

Hence, the smaller root of the equation is 0.0003333 and it is more closed to the exact
root. The other root is 0.1500 × 104 .
The situation may aries when |4ac|  b2 .

So it is suggested that a care should be taken when nearly two equal numbers are
subtracted. It is done by taking sufficient number of reserve valid digits.

2.3 Representation of numbers in computer

It is mentioned earlier that the numerical methods are used to solve problems using
computer. But, the computer has a limitation to store number either it is an integer
or a real (or floating point) number. Generally, two bytes memory space is used to
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic

store an integer and four bytes space is used to store a floating point number. Due to
the limitation of space, the rules for arithmetic operations used in mathematics do not
always hold in computer arithmetic.
The representation of a floating point number in computer is different from our con-
ventional technique. In computer representation, the technique is used to preserve the
maximum number of significant digits and increase the range of values of the real num-
bers. This representation is known the normalized floating point mode. In this
representation, the whole number is converted to a proper fraction in such a way that
the first digit after decimal point should be non-zero and is adjusted by multiplying
some power of 10. For example, the number 3876.23 is represented in the normalized
form as .387623 × 104 , and in computer representation it is written as .387623E4 (E4
is used to denote 104 ). It is observed that in normalized floating point representation,
a number has two parts – mantissa and exponent. In this example, .387623 is the
mantissa and 4 is the exponent. According to the representation the mantissa is always
greater than or equal to .1 and exponent is an integer.
To explain the computer arithmetic, in this section, it is assumed that the computer
uses only four digits to store mantissa and two digits for exponent. The mantissa and
the exponent have their own signs. In this assumption, the range of floating point
numbers (magnitudes) is .9999 × 1099 to .1000 × 10−99 .

2.4 Arithmetic of normalized floating point numbers

In this section, the four basic arithmetic operations on normalized floating point
numbers are discussed.

2.4.1 Addition

The addition of two normalized floating point numbers is done by using the following
rules:
(i) If two numbers have same exponent, then the mantissas are added directly and
the exponent of the added number is the either exponent.

(ii) If the exponents are different, then lower exponent is shifted to higher exponent
by adjusting mantissa and then the above rule is used to add them.
10
......................................................................................

All the possible cases are discussed in the following examples.

Example 2.8 Add the following normalized floating point numbers.


(i) .2678E15 and .4876E15 (same exponent)
(ii) .7487E10 and .6712E10 (same exponent)
(iii) .3451E3 and .3218E8 (different exponents)
(iv) .3876E25 and .8541E27 (different exponents)
(v) .8231E99 and .6541E99 (overflow condition)

Solution. (i) Here the exponents are same. So using first rule one can add the numbers
by adding mantissa. Therefore, the sum is .7554E15.

(ii) In this case also, the exponent are equal and in previous case the sum is 1.4199E10.
Notice that the sum contains five significant figures, but it is assumed that the computer
can store only four significant figures. So, the number is shifted right one place before
storing it to the computer memory. To convert it to four significant figures, the exponent
is increased by 1 and the last digit is truncated. Hence, finally the sum is .1419E11.

(iii) For this problem, the exponents are different and the difference is 8 − 3 = 5. The
mantissa of smaller number (low exponent) is shifted 5 places and the number becomes
.0000E8. Now, the numbers have same exponent and hence the final result is .0000E8
+ .3218E8 = .3218E8.

(iv) In this case, the exponents are also different and the difference is 27 − 25 = 2.
So the mantissa of the smaller number (here first number) is shifted right by 2 places
and it becomes .0038E27. Now the sum is .0038E27 + .8541E27 = .8579E27.

(v) This case is different. The exponents are same and the sum is 1.4772E99. Here,
the mantissa has five significant digits, so it is shifted right and the exponent is increased
by 1. Then the exponent becomes 100. Since as per our assumption, the maximum value
of the exponent is 99, so the number is larger than the capacity of the floating number of
the assumed computer. This number cannot store in the computer and this situation is
called an overflow condition. In this case, the computer will generate an error message.

11
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic

2.4.2 Subtraction

The subtraction is a special type of addition. In subtraction one positive number is


added with a negative number. The different cases of subtraction are illustrated in the
following examples.
Example 2.9 Subtract the normalized floating point numbers indicated below:
(i) .2832E10 from .8432E10
(ii) .2693E15 from .2697E15
(iii) .2786E–17 from .2134E–16
(iv) .7224E–99 from .7273E–99.

Solution. (i) Here the exponents are equal, and the hence the mantissas are directly
added. Thus, the result is
.8432E10 – .2832E10 = .5600E10.

(ii) Here also the exponents are equal. So the result is .2697E15 – .2693E15 = .0004E15.
The mantissa is not in normalised form. Since the computer always store normalised
numbers, we have to convert it to the normalised number. The normalised number
corresponding to .0004E15 is .4000E12. This is the final answer.

(iii) In these numbers the exponents are different. The number with smaller exponent is
shifted right and the exponent increased by 1 for every right shift. The second number
becomes .0278E–16. Thus the result is .2134E–16 – .0278E–16 = .1856E–16.

(iv) The result is .7273E–99 – .7224E–99=.0049E–99=.4900E–101 (In normalised form).


Note that the number of digits in exponent is 3, but our hypothetical computer can
store only two digits.
In this case, the result is smaller than the smallest number which could be stored in
our computer. This situation is called the underflow condition and the computer will
give an error message.

2.4.3 Multiplication

The multiplication of normalised float point numbers are same as multiplication of


ordinary numbers.
12
......................................................................................

Two normalized floating point numbers are multiplied by multiplying the mantissas
and adding the exponents. After multiplication, the mantissa is converted into nor-
malized floating point form and the exponent is adjusted accordingly. Multiplication is
illustrated in the following examples.
Example 2.10 Multiply the following floating point numbers:
(i) .2198E6 by .5671E12
(ii) .2318E17 by .8672E–17
(iii) .2341E52 by .9231E51
(iv) .2341E–53 by .7652E-51.
Solution. (i) In this case, .2198E6 × .5671E12 = .12464858E18.
Note that the mantissa has 8 significant figures, but as per our computer the result
will be .1246E18 (last four significant figures are truncated).
(ii) Here, .2318E17 × .8672E–17 = .20101696E0 = .2010E0.
(iii) .2341E52 × .9231E51 = .21609771E103.
In this case, the exponent has three digits and it is not allowed in our assumed
computer. The overflow condition occurs, so an error message will generate.
(iv) .2341E–53 × .7652E-51 = .17913332E–104 = .1791E–104 and an error message will
come.

2.4.4 Division

Also, the division of normalised floating point number is similar to division of ordinary
number. Only the difference is that the mantissa retains only four significant digits (as
per our assumed computer) instead of all digits. The quotient mantissa must be written
in the normalized form and the exponent is adjusted accordingly.
Example 2.11 Perform the following divisions
(i) .8765E43 ÷ .3131E21
(ii) .9999E5 ÷ .1452E–99
(iii) .3781E–18 ÷ .2871E94.
Solution. (i) .8765E43 ÷ .3131E21 = 2.7994251038E22 = .2799E23.
(ii) In this case, the number is divided by a small number.
.9999E5 ÷ .1452E–99 = 6.8863636364E104 =.6886E105.
13
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic

The overflow situation occurs.

(iii) In this case, the number is divided by a large number.


.3781E–18 ÷ .2871E94 = 1.3169627307E–112 = .1316E–111.
As per our computer, underflow condition occurs.

2.5 Effect of normalized floating point arithmetics

Sometimes floating point arithmetics give unpredictable results, due to the truncation
of mantissa. To illustrate this situation, let us consider the following example. It is well
known that 61 ×12 = 2. But, in the case of floating point arithmetic 1
6 = .1667 and hence
1 1
6 × 12 = .1667 × 12 = .2000E1. Also, one can determine the value of 6 × 12 by repeated
addition. Note that .1667 + .1667 + .1667 + .1667 + .1667 + .1667 = 1.0002 =.1000E1,
but .1667 + .1667 + .1667 + · · · 12 times gives 0.1996E1.
Thus, in floating point arithmetics multiplication is not always same as repeated

{z· · · + x} is not true always.


addition, i.e. 12x = |x + x +
12 times
Also, in floating point arithmetics the associative and distributive laws do not hold
always, due to the truncation of mantissa.
That is,
(i) (a + b) + c 6= a + (b + c)
(ii) (a + b) − c 6= (a − c) + b
(iii) a(b − c) 6= ab − ac.
These results are illustrated in the following examples:

(i) Suppose, a =.6889E2, b =.7799E2 and c =.1008E2. Now, a + b =.1468E3


(a + b) + c = .1468E3 + .1008E2 = .1468E3 + .0100E3 = .1568E3.
Again, b + c =.8807E2.
a + (b + c)=.6889E2+.8807E2=.1569E3.
Hence, for this example, (a + b) + c 6= a + (b + c).

(ii) Let a =.7433E1, b =.6327E–1, c =.6672E1.


Then a + b =.7496E1 and (a + b) − c =.7496E1 – .6672E1 = .8240E0.
Again, a − c =.7610E0 and (a − c) + b =.7610E0 + .0632E0 = .8242E0.
Thus, (a + b) − c 6= (a − c) + b.
14
......................................................................................

(iii) Let a =.6683E1, b =.4684E1, c =.4672E1.


b − c =.1200E–1.
a(b − c) =.6683E1 × .1200E–1 = .0801E0 = .8010E–1.
ab =.3130E2, ac =.3122E2.
ab − ac =.8000E–1.
Thus, a(b − c) 6= ab − ac.

From these examples we can think numerical computation is very dangerous. But, it
is not such dangerous, as the actual computer generally stores seven digits as mantissa
(in single precision). The larger length of mantissa gives more accurate result.

2.5.1 Zeros in floating point numbers

There is a definite meaning of zero in mathematics, but, in computer arithmetic exact


equality of a number to zero can never be guaranteed. Because, most of the numbers
in floating point representation are approximate. The behaviour of zero is illustrated in
the following example.

The exact roots of the equation x2 + 2x − 5 = 0 are x = −1 ± 6.
In floating point representation (4 digits mantissa) these are .1449E1 and –.3449E1.
When x =.1449E1, then the left hand side of the equation is
.1449E1 × .1449E1 + .2000E1 × .1449E1 – .5000E1
= .0209E2 + .2898E1 – .5000E1 = .0209E2 + .0289E2 – .0500E2 = –.0002E2.
When x =–.3449E1, then left hand side of the equation is
(–.3449E1) × (–.3449E1) + .2000E1 × (–.3449E1) – .5000E1
= .1189E2 – .6898E1 – .5000E1 = .1189E2 – .0689E2 – .0500E2 = .0000E2, which
is equal to 0.
It is interesting to see that one root perfectly satisfies the equation while other root
does not, though they are roots of the equation. Since .1449E1 is a root, one can say
that 0.02 is a zero. Thus, we can conclude the following:
Note 2.1 There is no fixed value of zero in computer arithmetic like mathematical
calculation. Thus, it is not advisable to give any instruction based on testing whether a
floating point number is zero or not. But, it is suggested that a number is zero if it is
less than a given (very) small number.

15

You might also like