10 Operation Count Numerical Linear Algebra
10 Operation Count Numerical Linear Algebra
10 Operation Count Numerical Linear Algebra
10.1 Introduction
62
Chapter 10 63
in the first column below the first matrix element. Then the transform
(row3+3row2) row3 yields zeros below the second element on the
diagonal:
1 1 0 1 1 0 1 1 0
2 1 1 0 1 1 0 1 1
1 2 5 0 3 5 0 0 2
4
10
3
10
time (seconds)
2
10
1
10
0
Figure 10-1: Execution time of a pro-
10
-1
gram solving a system of N linear equa-
10
-2
tions in N variables. For comparison,
10
10
2
10
3 4
10 the dashed line shows an increase propor-
problem size N tional to N 3 .
and Matlab, I can solve a linear system of about 3500 equations in one
second. Matlab automatically took advantage of all 4 CPU cores.) The
increase in computation time with the number of variables is not as ideal
as expected from the operation count, because time is required not only
for arithmetic operations but also for data movement. For this particular
implementation the execution time is larger when N is a power of two.
Other wiggles in the graph arise because the execution time is not exactly
the same every time the program is run.
Table 10-I shows the operation count for several important problems.
Operation count also goes under the name of computational cost or
computational complexity (of an algorithm).
. . . . . . . . .
A sparse matrix, where most elements are zero, can be solved with appro-
priate algorithms, much faster and with less storage than a full matrix.
For example, a tridiagonal system of equations can be solved with O(N )
effort, not O(N 2 ) but O(N ), as compared to O(N 3 ) for the a matrix.
The ratio of floating-point operations to bytes of main memory ac-
cessed is called the arithmetic intensity. Large problems with low arith-
metic intensity are limited by memory bandwidth, while large problems
with high arithmetic intensity are floating point limited. For example, a
matrix transpose has an arithmetic intensity of zero, matrix addition of
double precision numbers 1/(3 8) (very low), and naive matrix multi-
plication 2N/24 (high).
but the first block; that is about another m2 of traffic per tile. For each
output tile, we have to go through N/m pairs of input tiles. In total,
this amounts to (N/m)3 4m2 = 4N 3 /m memory movements, or 4N/m for
each matrix element, versus 2N for the naive method. This reduces data
access to the main memory and turns matrix multiplication into a truly
floating point limited operation.
There are enough highly-optimized matrix multiplication routines around
that we never have to write one on our own, but the same concept
grouping of data to reduce traffic to main memorycan be applied to
various problems with low arithmetic intensity.