Linear Algebra
Linear Algebra
Linear Algebra
SORIN MITRAN
Department of Mathematics
University of North Carolina at Chapel Hill
ABSTRACT
This textbook presents the essential concepts from linear algebra of direct utility to analysis
of large data sets. The theoretical foundations of the emerging discipline of Data Science are
still being dened at present, but linear algebra is certainly one the cornerstones. Traditional
presentations of linear algebra reect its historical roots with a focus on linear systems and
determinants, typically of small size. Presentation of the topic oen links solutions of linear sys-
tems to posible intersections of lines or planes. Such an approach is ill-suited for data science in
which the primary interest is in ecient description of large data sets, and automated extraction
of regularity from the available data. Neither is the essence of solving a linear system presented
as the information-conserving coordinate transformation that it actually represents when the
system matrix is of full rank.
The emphasis in linear algebra presentation suggested by data science is quite di erent. The
focus naturally shis to the essential problem of ecient description of large data sets using a
small, typically incomplete set of feature vectors. Linear algebra becomes the study of the basic
operation of linear combination and its potential as a descriptor of large data sets. Rather than
concentrate on the basis transformation represented by linear system solution, the focus shis
to maximal information compression. Instead of Gaussian elimination, the crucial algorithm
becomes the singular value decomposition. The typical operation required is not to solve a linear
system, but to construct low-dimensional approximations of the available data through projec-
tion and least squares.
Furthermore, computational exercises based on small matrices obscure the vitality and utility of
linear algebra in data science of describing objects as disparate and information-rich as images,
medical scans or sound recordings. To more faithfully portray the way linear algebra actually
gets used in data science, this textbook is packaged with a soware environment that contains
extensive data sets, code snippets to carry out typical analysis, and procedures to transform
heterogeneous data sources into standard linear algebra representations. Rather than relegate
computational applications to isolated sections, the entire text is interspersed with practical
examples using the Julia language, well suited for linear algebra and data science.
This textbook is draed and meant to be worked through in TeXmacs, a scientic editing plat-
form that features live documents with embedded computational examples constructed in
freely available mathematical soware systems such as Asymptote, Eukleides, Gnuplot, Julia,
Maxima, and Octave.
This textbook was developed for an intensive Maymester course that meets in twelve sessions
of three hours each. The content organization reects a desire to present crucial mathematical
ideas and practical skills to students from various backgrounds who might be interested in data
science. The key concepts required for mathematics students are present: vector spaces, matrix
factorizations, linear systems, eigenvalues. For a more general audience, these mathematical
5
6 ABSTRACT
topics are also recast as addressing specic aspects of data: expressiveness, redundancy, e-
ciency, compression, partitioning. More than a simple relabeling, this reinterpretation allows
for application of linear algebra operations to data far removed from the physical sciences or
engineering. The text and its associated soware environment considers data sets from visual
art, music, biology, medicine, social sciences.
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1. LINEAR COMBINATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1. Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1. Number sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2. Quantities described by a single number . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3. Quantities described by multiple numbers . . . . . . . . . . . . . . . . . . . . . . . 12
2. Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1. Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2. Real vector space m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Column vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Row vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Compatible vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3. Working with vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3. Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1. Forming matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2. Identity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4. Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1. Linear combination as a matrix-vector product . . . . . . . . . . . . . . . . . . . . 17
Linear combination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Matrix-vector product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2. Linear algebra problem examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Linear combinations in E2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5. Vectors and matrice in data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Linear Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1. Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.1. Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Homogeneous relations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2. Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3. Linear functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4. Linear mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2. Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1. Equivalence classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2. Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3. Inner product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7
8 TABLE OF CONTENTS
Correlation coecient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Patterns in data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5. CHANGE OF BASIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1. Gaussian elimination and row echelon reduction . . . . . . . . . . . . . . . . . . . . . . 65
2. LU -factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.1. Example for m =3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.2. General m case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3. Inverse matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1. Gauss-Jordan algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4. Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1. Cross product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Data Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1. Krylov bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2. Greedy approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6. EIGENPROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Data Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
1. The eigenvalue problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2. Computation of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Data Resonance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
1. Bases induced by eigenmodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
CHAPTER 1
LINEAR COMBINATIONS
SYNOPSIS. Data science arises from the need to organize massive amounts of data into mean-
ingful insights into some natural or social process. There are many ways to do so such as lists,
trees, clusters. Linear algebra concentrates on grouping quantifiable data into sets of fixed
length, known as vectors. Multiple vectors are subsequently grouped as matrices, and a for-
malism is established to work with vectors and matrices.
1. Numbers
1.1. Number sets
Several types of numbers have been introduced in mathematics to express di erent types of
quantities, and the following will be used throughout this text:
. The set of natural numbers, = {0,1,2,3,... }, innite and countable, + = {1,2,3,... };
$. The set of integers, $ = {0,±1,±2,±3,... }, innite and countable;
. The set of rational numbers = {p / q , p $, q +}, innite and countable;
. The set of real numbers, innite, not countable, can be ordered;
. The set of complex numbers, = {x + iy , x , y }, infinite, not countable, cannot be
ordered.
These sets of numbers form a hierarchy, with $ . The size of a set of numbers is
an important aspect of its utility in describing natural phenomena. The set S = {Mary,Jane,Tom}
has three elements, and its size is dened by the cardinal number , |S | = 3. The sets , $, , ,
have an innite number of elements, but the relation
/2
z = { (nn +1 for n even
) /2 for n odd
denes a one-to-one correspondence between n and z $, so these sets are of the same size
denoted by the transnite number 50 (aleph-zero). The rationals can also be placed into a one-
to-one correspondence with , hence
|| = |$| = | | = 50 .
11
12 LINEAR COMBINATIONS
In contrast there is no one-to-one mapping of the reals to the naturals, and the cardinality of
the reals is || = (Fraktur-script c). Intuitively, there are exponentially more reals than naturals,
formally stated in set theory as =25 . 0
" the population of a large community expressed as a oat p F, even though for a com-
munity of individuals the population is a natural number, as in the population of the
United States is p =328.2E6, i.e., 328.2 million.
In most disciplines, there is a particular interest in comparison of two quantities, and to facilitate
such comparison a common reference is used known as a standard unit . For measurement of
a length L, the meter = 1 m is a standard unit, as in the statement L = 10 m, that states that
L is obtained by taking the standard unit ten times, L = 10 . The rules for carrying out such
comparisons are part of the denition of real and rational numbers.
1.3. Quantities described by multiple numbers
Other quantities require more than a single number. The distribution of population in
the year 2000 among the alphabetically-ordered South American countries (Argentina,
Bolivia,..,Venezuela) requires 12 numbers. These are placed together in a list known in math-
ematics as a tupl e, in this case a 12-tuple P = (p1, p2,..., p12), with p1 the population of Argentina,
p2 that of Bolivia, and so on. An analogous 12-tuple can be formed from the South American
populations in the year 2020, say Q = (q1, q2,..., q12). Note that it is dicult to ascribe meaning
to apparently plausible expressions such as P + Q since, for instance, some people in the 2000
population are also in the 2020 population, and would be counted twice.
2. Vectors
2.1. Vector spaces
In contrast to the population 12-tuple example above, combining multiple numbers is well defined
in operations such as specifying a position within a three-dimensional Cartesian grid, or deter-
mining the resultant of two forces in space. Both of these lead to the consideration of 3-tuples
or triples such as the force ( f1, f2, f3). When combined with another force (g1, g2, g3) the resul-
tant is ( f1 + g1, f2 + g2, f3 + g3). If the force ( f1, f2, f3) is amplied by the scalar ± and the force
(g 1, g 2, g 3) is similarly scaled by ² , the resultant becomes
It is useful to distinguish tuples for which scaling and addition is well dened from simple lists
of numbers. In fact, since the essential di erence is the behavior with respect to scaling and
addition, the focus should be on these operations rather than the elements of the tuple.
The above observations underlie the defin-
ition of a vector space ± by a set V whose ele- Addition rules for a,b,c V
ments satisfy scaling and addition properties, a +b V Closure
denoted all together by the 4-tuple ± = (V , a + (b + c ) = (a + b ) + c Associativity
S ,+, Å). The rst element of the 4-tuple is a set a +b =b +a Commutativity
whose elements are called vectors. The second 0+a =a Zero vector
element is a set of scalars, and the third is a + (a ) = 0 Additive inverse
the vector addition operation. The last is the Scaling rules for a , b V , x, y S
scaling operation, seen as multiplication of a xa V Closure
vector by a scalar. Vector addition and scaling x (a + b ) = x a + x b Distributivity
(x + y ) a = x a + y a Distributivity
operations must satisfy rules suggested by
positions or forces in three-dimensional space, x (y a ) = (xy )a Composition
which are listed in Table 1.1. In particular, a 1a = a Scalar identity
vector space requires denition of two distin-
guished elements: the zero vector 0 V , and Table 1.1. Vector space ± = (V , S ,+, Å) properties.
the identity scalar element 1 S .
The denition of a vector space reects everyday experience with vectors in Euclidean geometry,
and it is common to refer to such vectors by descriptions in a Cartesian coordinate system. For
example, a position vector r within the plane can be referred through the pair of coordinates
(x , y ). This intuitive understanding can be made precise through the denition of a vector space
2 = (2, ,+, Å), called the real 2-space. Vectors within 2 are elements of 2 = × = {(x , y )| x ,
y }, meaning that a vector is specied through two real numbers, r (x , y ). Addition of two
vectors, q (s, t ), r (x , y ) is dened by addition of coordinates q + r = (s + x , t + v ). Scaling
r (x , y ) by scalar a is dened by a r (ax , ay ). Similarly, consideration of position vectors in
three-dimensional space leads to the denition of the 3 = (3, ,+, Å), or more generally a real
m-space m = (m , ,+, Å), m , m >0.
2.2. Real vector space m
Column vectors. Since the real spaces m = (m , , +, Å) play such an important role in them-
selves and as a guide to other vector spaces, familiarity with vector operations in m is necessary
to fully appreciate the utility of linear algebra to a wide range of applications. Following the
usage in geometry and physics, the m real numbers that specify a vector u m are called the
components of u . The one-to-one correspondence between a vector and its components u
(u 1,..., u m ), is by convention taken to dene an equality relationship,
u1
u = [[ ÅÅÅ ]], (1.1)
[ um ]
with the components arranged vertically and enclosed in square brackets. Given two vectors u ,
v m , and a scalar a , vector addition and scaling are dened in m by real number addition
14 LINEAR COMBINATIONS
The vector space m is dened using the real numbers as the set of scalars, and constructing
vectors by grouping together m scalars, but this approach can be extended to any set of scalars
S , leading to the denition of the vector spaces ®n = (S n, S ,+, Å). These will oen be referred to as
n-vector space of scalars, signifying that the set of vectors is V = S n.
To aid in visual recognition of vectors, the following notation conventions are introduced:
" vectors are denoted by lower-case bold Latin letters: u , v ;
" the components of a vector are denoted by the corresponding normal face with sub-
scripts as in equation (1.1);
" related sets of vectors are denoted by indexed bold Latin letters: u 1, u 2,..., u n.
[[ 12 ]]
[[ 1 ]] (1.3)
[[ 2 ]]
4
The equal sign in mathematics signies a particular equivalence relationship. In computer sys-
tems such as Julia the equal sign has the di erent meaning of assignment , that is dening the
label on the le side of the equal sign to be the expression on the right side. Subsequent invoca-
tion of the label returns the assigned object. Components of a vector are obtained by enclosing
the index in parantheses.
4 u=[1; 2; -1; 2]; u
[[ 12 ]]
[[ 1 ]] (1.4)
[[ 2 ]]
4
Row vectors. Instead of the vertical placement or components into one column , the compo-
nents of could have been placed horizontally in one row [ u1 ... um ], that contains the same
data, di erently organized. By convention vertical placement of vector components is the pre-
VECTORS AND MATRICES 15
ferred organization, and u shall denote a column vector henceforth. A transpose operation
denoted by a T superscript is introduced to relate the two representations
u T = [ u1 ... um ],
and u T is the notation used to denote a row vector . In Julia, horizontal placement of successive
components in a row is denoted by a space.
4 uT=transpose(u)
[ 1 2 1 2 ] (1.5)
4
[ 3 1 4 1] (1.6)
4 rT=[1 2]; uT+rT
DimensionMismatch
4 m=4; 1:m
[[ 12 ]]
[[ 3 ]] (1.7)
[[[ 4 ]]]
4 m:-1:2
[ ] (1.11)
4
0.0
[[ 0.16589613269341502 ]
[[ 0.3271946967961522 ]]]
[[ ]]
0.479425538604203
[[ 0.618369803069737 ]]
[[ 0.7401768531960371 ]] (1.12)
[[ ]]
0.8414709848078965
[[ 0.9194449792537551 ]]
[[ ]
[[ 0.9719379013633127 ]]]
0.9974949866040544
Indeed, such a piece-by-piece approach is not the way humans organize large amounts of infor-
mation, preferring to conceptualize the data as some other entity: an image, a sound excerpt, a
smell, a taste, a touch, a sense of balance, or relative position. All seven of these human senses
will be shown to allow representation by linear algebra concepts, including representation by
vectors.
4 clf(); plot(t,y); grid("on"); xlabel("t"); ylabel("y");
VECTORS AND MATRICES 17
4 cd(homedir()*"/courses/MATH347DS/images");
4 savefig("L01Fig01.eps");
1.0
0.8
0.6
y
0.4
0.2
0.0
3. Matrices
3.1. Forming matrices
The real numbers themselves form the vector space 1 = (, ,+, Å), as does any eld of scalars,
®1 = (S , S , +, Å). Juxtaposition of m real numbers has been seen to dene the new vector space
m . This process of juxtaposition can be continued to form additional mathematical objects. A
matrix is dened as a juxtaposition of compatible vectors. As an example, consider n vectors a 1,
a 2,..., a n V within some vector space ± = (V , S ,+, Å). Form a matrix by placing the vectors into
a row,
A = [ a 1 a 2 ... a n ]. (1.13)
To aid in visual recognition of a matrix, upper-case bold Latin letters will be used to denote
matrices. The columns of a matrix will be denoted by the corresponding lower-case bold letter
with a subscripted index as in equation (1.13). Note that the number of columns in a matrix can
be di erent from the number of components in each column, as would be the case for matrix A
from equation (1.13) when choosing vectors from, say, the real space m , a 1, a 2,..., a n m .
Vectors were seen to be useful juxtapositions of scalars that could describe quantities a single
scalar could not: a position in space, a force in physics, or a sampled function graph. The crucial
utility of matrices is their central role in providing a means to obtain new vectors from their
column vectors, as suggested by experience with Euclidean spaces.
3.2. Identity matrix
Consider rst 1, the vector space of real numbers. A position vector r 1 on the real axis is
specied by a single scalar component, r = [x ], x . Read this to mean that the position r is
18 LINEAR COMBINATIONS
obtained by traveling x units from the origin at position vector 0 = [0]. Look closely at what is
meant by unit in this context. Since x is a scalar, the mathematical expression r = 0 + x has
no meaning, as addition of a vector to a scalar has not been dened. Recall that scalars were
introduced to capture the concept of scaling of a vector, so in the context of vector spaces they
always appear as multiplying some vector. The correct mathematical description is r = 0 + x e ,
where e is the unit vector e = [1]. Taking the components leads to r1 =01 + xe1, where r1,01, e1 are
the rst (and in this case only) components of the r , 0, e vectors. Since r1 = x , 01 = 0, e1 = 1, one
obtains the identity x =0+ x Å 1.
Now consider 2, the vector space of positions in the plane. Repeating the above train of thought
leads to the identication of two direction vectors e1 and e 2
r = [ xy ] = x [ 10 ] + y [ 01 ] = x e1 + y e2, e1 = [ 10 ], e2 = [ 01 ].
[[ 24 ]] (1.14)
4
4. Linear combinations
4.1. Linear combination as a matrix-vector product
The expression x = x1 e1 + x2 e2 + Å Å Å + xm e m expresses the idea of scaling vectors within a set and
subsequent addition to form a new vector x . The matrix I = [ e1 e 2 ... em ] groups these vectors
together in a single entity, and the scaling factors are the components of the vector x . To bring
all these concepts together it is natural to consider the notation
x = Ix ,
as a generalization of the scalar expression x = 1 Å x . It is clear what the operation Ix should
signify: it should capture the vector scaling and subsequent vector addition x1 e1 + x2 e 2 + Å Å Å +
xm em . A specic meaning is now ascribed to Ix by identifying two denitions to one another.
Linear combination. Repeateadly stating vector scaling and subsequent vector addition is
unwieldy, so a special term is introduced for some given set of vectors {a 1,..., a n}.
DEFINITION. (LINEAR COMBINATION) . The linear combination of vectors a 1, a 2,..., a n V with scalars
x1, x2,..., xn S in vector space (V , S ,+, Å) is the vector b = x1 a 1 + x2 a 2 + ... xn a n .
Matrix-vector product. Similar to the grouping of unit vectors e1,..., em into the identity matrix
I , a more concise way of referring to arbitrary vectors a 1, . . . , a n from the same vector space is
the matrix A = [ a 1 a 2 ... a n ]. Combining these observations leads to the denition of a matrix-
vector product.
DEFINITION. (MATRIX-VECTOR PRODUCT) . In the vector space (V , S , +, Å), the product of matrix
A = [ a 1 a 2 ... a n ] composed of columns a 1, a 2,..., a n V with the vector x Sn whose components
are scalars x1, x2,..., xn S is the linear combination b = x1 a 1 + x2 a 2 + ... xn a n = Ax V .
problem: decomposition of forces in the plane along two directions. Suppose a force is given in
terms of components along the Cartesian x , y -axes, b = bx ex + by ey , as expressed by the matrix-
vector multiplication b = Ib . Note that the same force could be obtained by linear combination
of other vectors, for instance the normal and tangential components of the force applied on an
inclined plane with angle ¸ , b = xt et + xn e n, as in Figure 1.2. This denes an alternate reference
system for the problem. The unit vectors along these directions are
t = [[ cos ¸
sin ¸ ]
] , n = [
[
sin ¸
cos ¸ ]],
20 LINEAR COMBINATIONS
and can be combined into a matrix A = [ t n ]. The value of the components (xt , xn) are the
scaling factors and can be combined into a vector x = [ xt xn ]T . The same force must result
irrespective of whether its components are given along the Cartesian axes or the inclined plane
directions leading to the equality
Ib = b = Ax . (1.15)
Interpret equation (1.15) to state that the vector b could be obtained either as a linear combi-
nation of I , b = Ib , or as a linear combination of the columns of A , b = Ax . Of course the simpler
description seems to be Ib for which the components are already known. But this is only due to
an arbitrary choice made by a human observer to dene the force in terms of horizontal and ver-
tical components. The problem itself suggests that the tangential and normal components are
more relevant; for instance a friction force would be evaluated as a scaling of the normal force.
The components in this more natural reference system are not known, but can be determined
by solving the vector equality Ax = b , known as a linear system of equations. Procedures to carry
this out will be studied in more detail later, but Julia provides an instruction for this common
problem, the backslash operator, as in x=A\b.
4 ex=[1; 0]; ey=[0; 1]; b=[0.2; 0.4]; I=[ex ey]; I*b
[[ 0.2
0.4 ]] (1.16)
4 th=pi/6; c=cos(th); s=sin(th); tvec=[c; s]; nvec=[-s; c];
4 A=[tvec nvec]
0.8660254037844387 0.49999999999999994
[[ 0.49999999999999994 (1.17)
0.8660254037844387 ]
4 x=A\b
[[ 0.37320508075688774
0.2464101615137755 ]] (1.18)
4 x[1]*tvec+x[2]*nvec
[[ 0.2
0.4 ]] (1.19)
4
Note from the above calculations how the same vector is obtained by two di erent linear com-
binations
b = [ 0.2 1 0 0.866 0.500
0.4 ] =0.2 [ 0 ] +0.4 [ 1 ] =0.373[ 0.500 ] +0.246[ 0.866 ].
The general problem of determining what description is more insightful is a key question arising
in linear algebra applications.
LINEAR MAPPINGS
SYNOPSIS. Vectors have been introduced to represent complicated objects, whose description
requires m numbers, and the procedure of linear combination allows construction of new vec-
tors. Alternative insights into some object might be obtained by transformation of vectors. Of
all possible transformations those that are compatible with linear combinations are of special
interest. It turns out that matrices are not only important in organizing collections of vectors,
but also to represent such transformations, referred to as linear mappings.
1. Functions
1.1. Relations
The previous chapter focused on mathematical expression of the concept of quantication , the
act of associating human observation with measurements, as a rst step of scientic inquiry.
Consideration of di erent types of quantities led to various types of numbers, vectors as group-
ings of numbers, and matrices as groupings of vectors. Symbols were introduced for these quan-
tities along with some intial rules for manipulating such objects, laying the foundation for an
algebra of vectors and matrices. Science seeks to not only observe, but to also explain, which
now leads to additional operations for working with vectors and matrices that will dene the
framework of linear algebra.
Explanations within scientic inquiry are formulated as hypotheses, from which predictions are
derived and tested. A widely applied mathematical transcription of this process is to organize
hypotheses and predictions as two sets X and Y , and then construct another set R of all of the
instances in which an element of X is associated with an element in Y . The set of all possible
instances of x X and y Y , is the Cartesian product of X with Y , denoted as X × Y = {(x , y )| x X ,
y Y }, a construct already encountered in the denition of the real 2-space 2 = (2, , +, Å)
where 2 = × . Typically, not all possible tuples (x , y ) X × Y are relevant leading to the
following denition.
DEFINITION. (RELATION) . A relation R between two sets X , Y is a subset of the Cartesian product
X × Y, R X × Y.
The key concept is that of associating an input x X with an output y Y . Inverting the approach
and associating an output to an input is also useful, leading to the denition of an inverse rela-
tion as R 1 Y × X , R 1 = {(y , x ) | (x , y ) R}. Note that an inverse exists for any relation, and the
inverse of an inverse is the original relation, (R 1)1 = R. From the above, a relation is a triplet (a
tuple with three elements), (X , Y , R), that will oen be referred to by just its last member R.
LINEAR MAPPINGS 23
Homogeneous relations. Many types of relations are dened in mathematics and encoun-
tered in linear algebra, and establishing properties of specic relations is an important task
within data science. A commonly encountered type of relationship is from a set onto itself,
known as a homogeneous relation. For homogeneous relations H A × A, it is common to replace
the set membership notation (a, b ) H to state that a A is in relationship H with b A, with a
binary operator notation a <H b . Familiar examples include the equality and less than relation-
ships between reals, E , L × , in which (a, b ) E is replaced by a = b , and (a, b ) L is replaced
by a < b . The equality relationship is its own inverse, and the inverse of the less than relationship
is the greater than relation G × , G = L1, a < b Ò b > a.
1.2. Functions
Functions between sets X and Y are a specic type of relationship that oen arise in science. For
a given input x X , theories that predict a single possible output y Y are of particular scientic
interest.
DEFINITION. (FUNCTION) . A function from set X to set Y is a relation F X × Y, that associates to
x X a single y Y.
The above intuitive denition can be transcribed in precise mathematical terms as F X × Y is a
function if (x , y ) F and (x , z ) F implies y = z . Since it's a particular kind of relation, a function
is a triplet of sets (X , Y , F ), but with a special, common notation to denote the triplet by f :
X Y , with F = {(x , f (x ))| x X , f (x ) Y } and the property that (x , y ) F Ò y = f (x ). The set
X is the domain and the set Y is the codomain of the function f . The value from the domain
x X is the argument of the function associated with the function value y = f (x ). The function
value y is said to be returned by evaluation y = f (x ). The previously dened relations R, P , I
are functions but S = {(a, ± ), (a, ² ), (a, ³ )} is not. All relations can be inverted, and inversion
of a function denes a new relation, but which might not itself be a function. For example the
relation S 1 = {(± , a), (² , a), (³ , a)} is a function, but its inverse (S 1)1 = S is not.
Familiar functions include:
" the trigonometric functions cos: [1, 1], sin: [1, 1] that for argument ¸
return the function values cos(¸ ), sin(¸ ) giving the Cartesian coordinates (x , y ) 2 of
a point on the unit circle at angular extent ¸ from the x -axis;
" the exponential and logarithm functions exp: , log: (0, ) , as well as power and
logarithm functions in some other base a;
" polynomial functions pn: , dened by a succession of additions and multiplications
n
pn(x ) = a nx n + a n1x n1 + Å Å Å + a 1x + a 0 = y a i x i = ((a n x + a n1)x + Å Å Å + a 1)x + a 0.
i =0
Simple functions such as sin, cos, exp, log, are predened in Octave, and when given a vector
argument return the function applied to each vector component.
octave] disp(cos(0:pi/4:pi))
3 0 -1 0 3
octave]
As seen previously, a Euclidean space Em = (m , ,+, Å) can be used to suggest properties of more
complex spaces such as the vector space of continuous functions 0(). A construct that will
be oen used is to interpret a vector within Em as a function, since v m with components
v = [ v1 v2 ... vm ]T also denes a function v : {1,2,..., m} , with values v (i ) = vi. As the number
of components grows the function v can provide better approximations of some continuous
function f 0() through the function values vi = v (i ) = f (xi) at distinct sample points x1, x2,...,
xm .
The above function examples are all dened on a domain of scalars or naturals and returned
scalar values. Within linear algebra the particular interest is on functions dened on sets of
vectors from some vector space ± = (V , S , +, Å) that return either scalars f : V S , or vectors
from some other vector space ² = (W , S , +, Å), g : V W . The codomain of a vector-valued
function might be the same set of vectors as its domain, h : V V . The fundamental operation
within linear algebra is the linear combination a u + b v with a, b S , u , v V . A key aspect is
to characterize how a function behaves when given a linear combination as its argument, for
instance f (a u + b v ) or g (a u + b v ).
1.3. Linear functionals
Consider rst the case of a function dened on a set of vectors that returns a scalar value. These
can be interpreted as labels attached to a vector, and are very oen encountered in applications
from natural phenomena or data analysis.
DEFINITION. (FUNCTIONAL) . A functional on vector space ± = (V , S ,+, Å) is a function from the set
of vectors V to the set of scalars S of the vector space ±.
DEFINITION. (LINEAR FUNCTIONAL) . The functional f : V S on vector space ± = (V , S , +, Å) is a
linear functional if for any two vectors u , v V and any two scalars a, b
f (a u + b v ) = af (u ) + bf (v ). (1.20)
Many di erent functionals may be dened on a vector space ± = (V , S , +, Å), and an insightful
alternative description is provided by considering the set of all linear functionals, that will be
denoted as V = { f | f : V S }. These can be organized into another vector space ± = (V , S ,+, Å)
with vector addition of linear functionals f , g V and scaling by a S dened by
( f + g )(u ) = f (u ) + g (u ), (af )(u ) = af (u ), u V . (1.21)
LINEAR MAPPINGS 25
DEFINITION. (DUAL VECTOR SPACE) . For some vector space ±, the vector space of linear functionals
± is called the dual vector space.
As is oen the case, the above abstract denition can better be understood by reference to the
familiar case of Euclidean space. Consider 2 = (2, , +, Å), the set of vectors in the plane with
x 2 the position vector from the origin (0,0) to point X in the plane with coordinates (x1, x2).
One functional from the dual space 2 is f2(x ) = x2, i.e., taking the second coordinate of the
position vector. The linearity property is readily veried. For x , y 2, a, b ,
f2(a x + b y ) = ax2 + by2 = af2(x ) + bf2( y ).
Given some constant value h , the curves within the plane dened by f2(x ) = h are called the
contour lines or level sets of f2. Several contour lines and position vectors are shown in Figure
1.3. The utility of functionals and dual spaces can be shown by considering a simple example
from physics. Assume that x2 is the height above ground level and a vector x is the displacement
of a body of mass m in a gravitational eld. The mechanical work done to li the body from
ground level to height h is W = mgh with g the gravitational acceleration. The mechanical work
is the same for all displacements x that satisfy the equation f2(x ) = h. The work expressed in
units mg h can be interpreted as the number of contour lines f2(x ) = n h intersected by the
displacement vector x . This concept of duality between vectors and scalar-valued functionals
arises throughout mathematics, the physical and social sciences and in data science. The term
duality itself comes from geometry. A point X in 2 with coordinates (x 1, x 2) can be dened
either as the end-point of the position vector x , or as the intersection of the contour lines of two
funtionals f1(x ) = x1 and f2(x ) = x2. Either geometric description works equally well in specifying
the position of X , so it might seem redundant to have two such procedures. It turns out though
that many quantities of interest in applications can be dened through use of both descriptions,
as shown in the computation of mechanical work in a gravitational eld.
1.0
0.8
0.6
0.4
0.2
2. Measurements
Vectors within the real space m can be completely specied by m real numbers, even though
m is large in many realistic applications. A vector within 0(), i.e., a continuous function
dened on the reals, cannot be so specied since it would require an innite, non-countable
listing of function values. In either case, the task of describing the elements of a vector space
± = (V , S ,+, Å) by simpler means arises. Within data science this leads to classication problems
in accordance with some relevant criteria.
2.1. Equivalence classes
Many classication criteria are scalars, dened as a scalar-valued function f : ± S on a vector
space, ± = (V , S , +, Å). The most common criteria are inspired by experience with Euclidean
space. In a Euclidean-Cartesian model (2, , +, Å) of the geometry of a plane , a point O
is arbitrarily chosen to correspond to the zero vector 0 = [ 0 0 ]T , along with two preferred
vectors e1, e2 grouped together into the identity matrix I . The position of a point X with
respect to O is given by the linear combination
x = Ix + 0 = [ e1 e 2 ][[ xx1 ]] = x1 e 1 + x2 e 2 .
2
Several possible classications of points in the plane are depicted in Figure 1.4: lines, squares,
circles. Intuitively, each choice separates the plane into subsets, and a given point in the plane
belongs to just one in the chosen family of subsets. A more precise characterization is given by
the concept of a partition of a set.
DEFINITION. (PARTITION) . A partition of a set is a grouping of its elements into non-empty subsets
such that every element is included in exactly one subset.
LINEAR MAPPINGS 27
-1.0 -0.5 0.5 1.0 -1.0 -0.5 0.5 1.0 -1.0 -0.5 0.5 1.0 -1.0 -0.5 0.5 1.0
Null vector case. Provision must be made for the only distinguished element of V , the null
vector 0. It is natural to associate the null vector with the null scalar element, 0 =0. A
crucial additional property is also imposed namely that the null vector is the only vector
whose norm is zero, v = 0 Ò v = 0. From knowledge of a single scalar value, an entire
vector can be determined. This property arises at key junctures in linear algebra, notably
in providing a link to another branch of mathematics known as analysis, and is needed to
establish the fundamental theorem of linear algbera or the singular value decomposition
encountered later.
Scaling. Transfer of the scaling operation v = a u property leads to imposing v = |a| u . This
property ensures commensurability of vectors, meaning that the magnitude of vector v
can be expressed as a multiple of some standard vector magnitude u .
Vector addition. Position vectors from the origin to coordinates x , y >0 on the real line can
be added and |x + y | = |x | + |y |. If however the position vectors point in di erent directions,
x > 0, y < 0, then |x + y | < |x | + |y |. For a general vector space the analogous property is
known as the triangle inequality , u + v u + v for u , v V .
DEFINITION. (NORM) . A norm on the vector space ± = (V , S ,+, Å) is a function : V + that for
u , v V, a S satises:
1. v =0 Ò v = 0;
2. a u = |a| u ;
3. u + v u + v .
Note that the norm is a functional, but the triangle inequality implies that it is not generally
a linear functional. Returning to Figure 1.4, consider the functions fi : 2 + dened for x =
[ x 1 x 2 ]T through values
f1(x ) = |x1|, f2(x ) = |x2|, f3(x ) = |x1| + |x2|, f4(x ) = (|x1|2 + |x2|2)1/2.
Sets of constant value of the above functions are also equivalence classes induced by the equiv-
alence relations Ei for i =1,2,3,4.
1. f1(x ) = c Ò |x1| = c , E1 = {(x , y )| f1(x ) = f1( y ) Ô |x1| = |y1| } 2 × 2;
2. f2(x ) = c Ò |x2| = c , E2 = {(x , y )| f2(x ) = f2( y ) Ô |x2| = |y2| } 2 × 2;
3. f3(x ) = c Ò |x1| + |x2| = c , E3 = {(x , y )| f3(x ) = f3( y ) Ô |x1| + |x2| = |y1| + |y2| } 2 × 2;
4. f4(x ) = c Ò (|x1|2 + |x2|2)1/2 = c , E4 = {(x , y )| f4(x ) = f4( y ) Ô (|x1|2 + |x2|2)1/2 = (|y1|2 + |y2|2)1/2 }
2 × 2.
These equivalence classes correspond to the vertical lines, horizontal lines, squares, and circles
of Figure 1.4. Not all of the functions fi are norms since f1(x ) is zero for the non-null vector
x = [ 0 1 ]T , and f2(x )is zero for the non-null vector x = [ 1 0 ]T . The functions f3 and f4 are
indeed norms, and specic cases of the following general norm.
DEFINITION. (p-NORM IN m ) . The p-norm on the real vector space m = (m , ,+, Å) for p 1 is the
function p: V + with values x p = (|x1|p + |x2|p + Å Å Å + |xm |p)1/p, or
m 1/p
x = ((y |xi|p)) for x m .
p (1.23)
i =1
LINEAR MAPPINGS 29
The integration operation +ab can be intuitively interpreted as the value of the sum mi=1 from
equation (1.23) for very large m and very closely spaced evaluation points of the function f (xi),
for instance |xi+1 xi| = (b a) / m. An inf-norm can also be dene for continuous functions by
f = sup | f (x )|,
x [a ,b]
where sup, the supremum operation can be intuitively understood as the generalization of the
max operation over the countable set {1,2,..., m} to the uncountable set [a, b ].
1.0 1.0 1.0 1.0
Vector norms arise very oen in applications, especially in data science since they can be used
to classify data, and are implemented in soware systems such as Octave in which the norm
function with a single argument computes the most commonly encountered norm, the 2-norm.
If a second argument p is specied the p-norm is computed.
octave] x=[1; 1; 1]; disp([norm(x) sqrt(3)])
1.7321 1.7321
3 3
30 LINEAR COMBINATIONS
4 4
octave]
1.0
0.8
0.6
0.4
0.2
In general, any linear functional f dened on the real space m can be labeled by a vector
a T = [ a 1 a 2 ... a m ],
LINEAR MAPPINGS 31
and evaluated through the matrix-vector product f (x ) = a T x . This suggests the denition of
another function s: m × m ,
s (a , x ) = a T x .
The function s is called an inner product , has two vector arguments from which a matrix-vector
product is formed and returns a scalar value, hence is also called a scalar product . The denition
from an Euclidean space can be extended to general vector spaces. For now, consider the eld
of scalars to be the reals S = .
DEFINITION. (INNER PRODUCT) . An inner product in the vector space ± = (V , ,+, Å) is a function s:
V × V with properties
Symmetry. For any a , x V, s(a , x ) = s(x , a ).
Linearity in second argument. For any a , x , y V, ± , ² , s(a , ±x + ² y ) = ± s(a , x ) + ² s(a , y ).
Positive deniteness. For any x V \{0}, s(x , x ) >0.
The inner product s(a , x ) returns the number of level sets of the functional labeled by a crossed
by the vector x , and this interpretation underlies many applications in the sciences as in the
gravitational eld example above. Inner products also provide a procedure to evaluate geomet-
rical quantities and relationships.
Vector norm. In m the number of level sets of the functional labeled by x crossed by x
itself is identical to the square of the 2-norm
s(x , x ) = x Tx = x 22 .
In general, the square root of s(x , x ) satises the properties of a norm, and is called the
norm induced by an inner product
x = s (x , x ) .
1/2
A real space together with the scalar product s(x , y ) = x Ty and induced norm x = s(x ,
x )1/2 denes an Euclidean vector space 0m .
Orientation. In 02 the point specied by polar coordinates (r , ¸ ) has the Cartesian coordi-
nates x1 = r cos ¸ , x2 = r sin ¸ , and position vector x = [ x1 x2 ]T . The inner product
e1T x = [ 1 0 ] [ xx1 ] =1 Å x1 +0 Å x2 = r cos ¸ ,
2
in a Euclidean space, x , y m .
32 LINEAR COMBINATIONS
Orthogonality. In 02 two vectors are orthogonal if the angle between them is such that
cos ¸ =0, and this can be extended to an arbitrary vector space ± = (V , ,+, Å) with a scalar
product by stating that x , y V are orthogonal if s(x , y ) = 0. In 0m vectors x , y m are
orthogonal if x T y =0.
3. Linear mapping composition
3.1. Matrix-matrix product
From two functions f : A B and g: B C , a composite function, h = g f , h: A C is dened by
h(x ) = g ( f (x )).
Consider linear mappings between Euclidean spaces f : n m , g : m p. Recall that linear
mappings between Euclidean spaces are expressed as matrix vector multiplication
f (x ) = Ax , g ( y ) = By , A m ×n, B p×m .
The composite function h = g f is h : n p, dened by
h (x ) = g (f (x )) = g (Ax ) = BAx .
Note that the intemediate vector u = Ax is subsequently multiplied by the matrix B . The com-
posite function h is itself a linear mapping
h (a x + b y ) = BA (a x + b y ) = B (a A x + bA y ) = B (a u + b v ) = a Bu + b Bv = a BAx + b BAy = a h (x ) +
b h ( y ),
so it also can be expressed a matrix-vector multiplication
h (x ) = Cx = BAx . (1.25)
Using the above, C is dened as the product of matrix B with matrix A
C = BA .
The columns of C can be determined from those of A by considering the action of h on the the
column vectors of the identity matrix I = [ e1 e 2 ... en ] n×n. First, note that
[[ ]]
1
[[ ÅÅ ]]
0
[[ ÅÅ ]]
0
[[ ]] [[ Å ]] [[ Å ]]
][ ÅÅÅ ] ][ ] ][ ÅÅÅ ]
0
Ae j = [ a1 a2 ... a n
[[ ÅÅ ]] = a1 , . . . , Ae j = [ a1 a2 ... a n
[[ ÅÅ ]]
1 = a Ae
j, n = [ a1 a2 ... a n
[[ ]] =
[[ Å ]]
0
[[ Å ]]
0
[[ ]]
0
a n. (1.26)
The above can be repeated for the matrix C = [ c 1 c2 ... c n ] giving
h (e1) = Ce1 = c1,..., h (ej ) = Ce j = cj ,..., h (en) = Cen = cn. (1.27)
Combining the above equations leads to cj = Ba j , or
C = [ c1 c 2 ... cn ] = B [ a 1 a 2 ... a n ].
LINEAR MAPPINGS 33
From the above the matrix-matrix product C = BA is seen to simply be a grouping of all the
products of B with the column vectors of A ,
C = [ c 1 c2 ... c n ] = [B a 1 Ba 2 ... Ba n ]
Matrix-vector and matrix-matrix products are implemented in Octave, the above results can
readily be veried.
octave] a1=[1; 2]; a2=[3; 4]; A=[a1 a2]
A =
1 3
2 4
B =
-1 2
1 -2
3 3
octave] C=B*A
C =
3 5
-3 -5
9 21
3 5
-3 -5
9 21
octave]
CHAPTER 2
VECTOR SPACES
FORMAL RULES
1. Algebraic structures
A vector space has been introduced as a 4-tuple ± = (V , S , +, Å) with specic behavior of the vector addition and
scaling operations. Arithmetic operations between scalars were implicitly assumed to be similar to those of the
real numbers, but also must be specied to obtain a complete denition of a vector space. Algebra denes various
structures that specify the behavior operations with objects. Knowledge of these structures is useful not only in
linear algebra, but also in other mathematical approaches to data analysis such as topology or geometry.
Groups. A group is a 2-tuple ¢ = (G, +) containing a group.
set G and an operation + with properties from Table 2.2. Addition rules
If a, b G, a + b = b + a, the group is said to be commu- a+b G Closure
tative. Besides the familiar example of integers under a + (b + c ) = (a + b ) + c Associativity
addition ($, +), symmetry groups that specify spatial 0+ a = a Identity element
or functional relations are of particular interest. The a + (a) =0 Inverse element
rotations by 0, À , À , À or vertices of a square form a
2
3
2
Table 2.1. Group ¢ = (G ,+) properties, for a , b, c G
Fields. A ring is a 3-tuple 1 = (F ,+, Å) containing a set space must satisfy the properties of a eld. Since the
F and two operations +, Å, each with properties of a operations are oen understood from context a eld
commutative group, but with special behavior for the might be referred to as the full 3 tuple, or, more con-
inverse of the null element. The multiplicative inverse cisely just through the set of elements as in the deni-
is denoted as a . Scalars S in the denition of a vector
1
tion of a vector space.
35
36 VECTOR SPACES
Addition rules
(F ,+) is a commutative (Abelian) group
Multiplication rules
(F , Å) is a commutative group except
that 0 does not exist
1
Distributivity
a Å (b + c ) = (a Å b ) + (a Å c )
Table 2.3. Field = (F , +, Å) properties, for a , b, c F .
Using the above denitions, a vector space ± = (V , S ,+, Å) can be described as a commutative group (V ,+) combined
with a eld S that satises the scaling properties a u V , a(u + v ) = a u + a v , (a + b ) u = a u + b u , a(b u ) = (ab ) u , 1 u = u ,
for a, b S , u , v V .
The above states a vector subspace must be closed under linear combination, and have the same vector addition
and scaling operations as the enclosing vector space. The simplest vector subspace of a vector space is the null
subspace that only contains the null element, U = {0}. In fact any subspace must contain the null element 0, or
otherwise closure would not be veried for the particular linear combination u + (u ) = 0. If U V , then ° is said
to be a proper subspace of ±, denoted by ° < ±.
Setting n m components equal to zero in the real space m denes a proper subspace whose elements can be
placed into a one-to-one correspondence with the vectors within n. For example, setting component m of x m
equal to zero gives x = [ x x ... xm 0 ]T that while not a member of m , is in a one-to-one relation with
1 2 1
1
x = [ x x ... xm ]T m . Dropping the last component of y m, y = [ y y ... ym ym ]T gives vector y =
¹ 1 2 1
1
1 2 1
¹
[ y y ... ym ] m , but this is no longer a one-to-one correspondence since for some given y , the last com-
1 2 1
1
¹
1
2
octave] y=[1; 2; 3]; yp=y(1:2); disp(yp)
1
2
octave]
FORMAL RULES 37
Vector subspaces arise in decomposition of a vector space. The converse, composition of vector spaces ° = (U ,
S , +, Å) ± = (V , S , +, Å) is also dened in terms of linear combination. A vector x can be obtained as the linear
3
combination
x x 0
x = [[[ x ]]] = [[[ 0 ]]] + [[[ x ]]],
1 1
[[ x ]] [[ 0 ]] [[ x ]]
2
3
2
but also as
x x 0
x = [[[ x ]]] = [[[ x a ]]] + [[[ a ]]],
1 1
[x ] [ 0 ] [x ]
2
3
2
for some arbitrary a . In the rst case, x is obtained as a unique linear combination of a vector from the set
U = ¡ x 0 0 ¢T | x with a vector from V = {[ 0 x x ]T | x , x }. In the second case, there is an innity of
1 1 2 3 2 3
linear combinations of a vector from V with another from W = ¡ x x 0 ¢T | x , x to the vector x . This is
1 2 1 2
octave] u=[1; 0; 0]; v=[0; 2; 3]; vp=[0; 1; 3]; w=[1; 1; 0]; disp([u+v vp+w])
1 1
2 2
3 3
octave]
In the previous example, the essential di erence between the two ways to express x is that U ) V = {0}, but
3
V ) W = {[ 0 a 0 ]T | a } ` {0}, and in general if the zero vector is the only common element of two vector spaces
then the sum of the vector spaces becomes a direct sum. In practice, the most important procedure to construct
direct sums or check when an intersection of two vector subspaces reduces to the zero vector is through an inner
product.
DEFINITION. Two vector subspaces U , V of the real vector space m are orthogonal, denoted as U ¥V if u Tv =0 for any
u U , v V.
DEFINITION. Two vector subspaces U,V of U + V are orthogonal complements, denoted U = V ¥, V = U ¥ if they are orthog-
onal subspaces, U ¥V, and U ) V = {0}, i.e., the null vector is the only common element of both subspaces.
0 1
octave]
38 VECTOR SPACES
The above concept of orthogonality can be extended to other vector subspaces, such as spaces of functions. It can
also be extended to other choices of an inner product, in which case the term conjugate vector spaces is sometimes
used.
The concepts of sum and direct sum of vector spaces used linear combinations of the form u + v . This notion can
be extended to arbitrary linear combinations.
DEFINITION. In vector space ± = (V , S ,+, Å), the span of vectors a , a ,..., a n V , is the set of vectors reachable by linear
1 2
combination
span{a , a ,..., an} = {b V | x ,..., xn S suchthat b = x a + ... + xn an}.
1 2 1 1 1
Note that for real vector spaces a member of the span of the vectors {a , a ,..., an} is the vector b obtained from the
1 2
[[ xÅ ]
n
From the above, the span is a subset of the co-domain of the linear mapping f (x ) = Ax .
The wide-ranging utility of linear algebra essentially results a complete characterization of the behavior of a linear
mapping between vector spaces f : U V , f (a u + b v ) = a f (u ) + b f (v ). For some given linear mapping the questions
that arise are:
1. Can any vector within V be obtained by evaluation of f ?
2. Is there a single way that a vector within V can be obtained by evaluation of f ?
Linear mappings between real vector spaces f : n m, have been seen to be completely specied by a matrix
A m n. It is common to frame the above questions about the behavior of the linear mapping f (x ) = Ax through
×
sets associated with the matrix A . To frame an answer to the first question, a set of reachable vectors is first defined.
DEFINITION. The column space (or range) of matrix A m n is the set of vectors reachable by linear combination of the
×
By denition, the column space is included in the co-domain of the function f (x ) = Ax , C (A ) m, and is readily
seen to be a vector subspace of m. The question that arises is whether the column space is the entire co-domain
C (A ) = m that would signify that any vector can be reached by linear combination. If this is not the case then the
column space would be a proper subset, C (A ) m, and the question is to determine what part of the co-domain
cannot be reached by linear combination of columns of A . Consider the orthogonal complement of C (A ) dened
as the set vectors orthogonal to all of the column vectors of A , expressed through inner products as
a T y =0, a T y =0,..., a nT y =0.
1 2
FORMAL RULES 39
[[ ÅT ]] [[ TÅ ]]
[ ] [ ]an an y
and leads to the denition of a set of vectors for which ATy = 0
DEFINITION. The le null space (or cokernel) of a matrix A m n is the set ×
Note that the le null space is also a vector subspace of the co-domain of f (x ) = Ax , N (AT ) m. The above
denitions suggest that both the matrix and its transpose play a role in characterizing the behavior of the linear
mapping f = Ax , so analagous sets are dene for the transpose AT .
N (A ) =null(A ) = {x n|Ax = 0} n
Examples. Consider a linear mapping between real spaces f : n m, dened by y = f (x ) = Ax = [ y ... yn ]T , 1
with A m n.
×
octave]
the column space C (A ) is the y -axis, and the the columns of A are colinear, a = a , and the
2 1
tors that span these spaces are returned by the null space N (AT ) is the y y -plane, as before.
2 3
-1 -1.00000
-0 -0.00000
-0 -0.00000
----- -----
0 0 0 0
1 0 1 0
0 1 0 1
40 VECTOR SPACES
10 -0.00000 -0.00000
-1 -0
-0 -0 AT = [[ 3 1 3 ]] [[[ T ]]] [[ T ]]
[a y]
2 2
----- a 3 3
0
0
since a 3 = a 1 + 2 a 2, the orthogonality condi-
1
tion AT y = 0 is satised by vectors of form y =
octave] [ a 0 a ], a .
0.69157 0.14741
1 1
A = [[[ 1 1 ]]], AT = [[ 11 11 00 ]],
-0.20847 0.97803
[[ 0 0 ]] 0.69157 0.14741
-----
0.70711
0.00000
the same C (A ), N (AT ) are obtained, albeit with -0.70711
a di erent set of spanning vectors returned by
orth. octave]
The above low dimensional examples are useful to gain initial insight into the significance of the spaces C (A ), N (AT ).
Further appreciation can be gained by applying the same concepts to processing of images. A gray-scale image
of size px by py pixels can be represented as a vector with m = px py components, b [0, 1]m m. Even for a small
image with px = py = 128 = 2 pixels along each direction, the vector b would have m = 2 components. An image
7 14
[[[ bb ]]]
1
[ bm ]
with bi the gray-level intensity in pixel i. Similar to the inclined plane example from §1, an alternative description
as a linear combination of another set of vectors a ,..., a m might be more relevant. One choice of greater utility for
1
DATA REDUNDANCY 41
image processing mimics the behavior of the set {1,cos t ,cos2t , ...,sin t ,sin2t , ...} that extends the second example
in §1, would be for m =4
[[[ 11 1
1
1
0
0]
1 ]]].
A = [ a a a a ] = [[ 1 1 ]]]
1 2
[[ 3 4
0 1
[1 0 0 0]
DATA REDUNDANCY
1. Linear dependence
For the simple scalar mapping f : , f (x ) = ax , the condition f (x ) =0 implies either that a =0 or x =0. Note that
a =0 can be understood as dening a zero mapping f (x ) =0. Linear mappings between vector spaces, f : U V , can
exhibit di erent behavior, and the condtion f (x ) = Ax = 0, might be satised for both x ` 0, and A ` 0. Analogous to
the scalar case, A = 0 can be understood as dening a zero mapping, f (x ) = 0.
In vector space ± = (V , S ,+, Å), vectors u , v V related by a scaling operation, v = a u , a S , are said to be colinear,
and are considered to contain redundant data. This can be restated as v span{u }, from which it results that
span{u } =span{u , v }. Colinearity can be expressed only in terms of vector scaling, but other types of redundancy
arise when also considering vector addition as expressed by the span of a vector set. Assuming that v span{u },
then the strict inclusion relation span{u } span{u , v } holds. This strict inclusion expressed in terms of set concepts
can be transcribed into an algebraic condition.
DEFINITION. The vectors a , a ,..., an V , are linearly dependent if there exist n scalars, x ,..., xn S, at least one of which
1 2 1
A = [ a a ... a n ]; x = [[ ÅÅ ]]
[[ Å ]]
2
1 2
[ xn ]
allows restating linear dependence as the existence of a non-zero vector, x ` 0, such that Ax = 0. Linear dependence
can also be written as Ax = 0 Ï x = 0, or that one cannot deduce from the fact that the linear mapping f (x ) = Ax
attains a zero value that the argument itself is zero. The converse of this statement would be that the only way to
ensure Ax = 0 is for x = 0, or Ax = 0 Ò x = 0, leading to the concept of linear independence.
DEFINITION. The vectors a , a ,..., a n V , are linearly independent if the only n scalars, x ,..., xn S, that satisfy
1 2 1
x a + ... + xnan = 0,
1 1 (2.1)
are x =0, x =0,...,xn =0.
1 2
element = , *{b } would be redundant since span , = span . This is recognized by the concept of a basis, and
also allows leads to a characterization of the size of a vector space by the cardinality of a basis set.
42 VECTOR SPACES
2. span{u ,..., u n} = V.
1
DEFINITION. The number of vectors u ,..., un V within a basis is the dimension of the vector space ± = (V , S ,+, Å).
1
been dened:
DEFINITION. The rank of a matrix A m n is the dimension of its column space and is equal to the dimension of its row
×
space.
DATA INFORMATION
1. Partition of linear mapping domain and codomain
A partition of a set S has been introduced as a collection of subsets P = {Si | Si P , Si ` } such that any given element
x S belongs to only one set in the partition. This is modied when applied to subspaces of a vector space, and a
partition of a set of vectors is understood as a collection of subsets such that any vector except 0 belongs to only
one member of the partition.
Linear mappings between vector spaces f : U V can be represented by matrices A with columns that are images
of the columns of a basis {u , u ,... } of U
1 2
A = [ f (u ) f (u ) ... ].
1 2
Consider the case of real nite-dimensional domain and co-domain, f : n m, in which case A m n, ×
The column space of A is a vector subspace of the codomain, C (A ) d m, but according to the denition of dimen-
sion if n < m there remain non-zero vectors within the codomain that are outside the range of A ,
n < m Ò v m, v ` 0, v C (A ).
All of the non-zero vectors in N (AT ), namely the set of vectors orthogonal to all columns in A fall into this category.
The above considerations can be stated as
C (A ) d m, N (AT ) d m, C (A )¥N (AT ) C (A ) + N (AT ) d m .
The question that arises is whether there remain any non-zero vectors in the codomain that are not part of C (A )
or N (AT ). The fundamental theorem of linear algebra states that there no such vectors, that C (A ) is the orthogonal
complement of N (AT ), and their direct sum covers the entire codomain C (A ) N (AT ) = m.
LEMMA 3.1. Let °, ±, be subspaces of vector space ². Then ² = ° ± if and only if
i. ² = ° + ±, and
ii. ° ) ± = {0}.
Proof.² = ° ± Ò ² = ° + ± by denition of direct sum, sum of vector subspaces. To prove that ² = ° ± Ò
° ) ± = {0}, consider w ° ) ±. Since w ° and w ± write
w = w + 0 (w °, 0 ±), w = 0 + w (0 °, w ±),
and since expression w = u + v is unique, it results that w = 0. Now assume (i),(ii) and establish an unique decomposition.
Assume there might be two decompositions of w ², w = u + v , w = u + v , with u , u °, v , v ±. Obtain u + v =
u + v , or x = u u = v v . Since x ° and x ± it results that x = 0, and u = u , v = v , i.e., the decomposition is
1 1 2 2 1 2 1 2 1 1
2 2 1 2 2 1 1 2 1 2
unique. ¡
43
44 FUNDAMENTAL THEOREM OF LINEAR ALGEBRA
In the vector space U + V the subspaces U , V are said to be orthogonal complements is U ¥V , and U ) V = {0}. When
U d m, the orthogonal complement of U is denoted as U ¥, U U ¥ = m.
THEOREM. Given the linear mapping associated with matrix A m n we have: ×
1. C (A ) N (AT ) = m, the direct sum of the column space and le null space is the codomain of the mapping
2. C (AT ) N (A ) = n, the direct sum of the row space and null space is the domain of the mapping
3. C (A )¥N (AT ) and C (A ) ) N (AT ) = {0}, the column space is orthogonal to the le null space, and they are
orthogonal complements of one another,
C (A ) = N (AT )¥, N (AT ) = C (A )¥ .
4. C (AT )¥N (A ) and C (AT ) ) N (A ) = {0}, the row space is orthogonal to the null space, and they are orthogonal
complements of one another,
C (AT ) = N (A )¥, N (A ) = C (AT )¥ .
Figure 3.1. Graphical represenation of the Fundamental Theorem of Linear Algebra, Gil Strang, Amer. Math. Monthly , 848-855,
100
1993.
Consideration of equality between sets arises in proving the above theorem. A standard technique to show set
equality A = B, is by double inclusion, A B ' B A Ò A = B. This is shown for the statements giving the decomposi-
tion of the codomain m. A similar approach can be used to decomposition of n.
ii. C (A ) ) N (AT ) = {0} (0 is the only vector both in C (A ) and N (AT )).
DATA PARTITIONING 45
Proof. (By contradiction, reductio ad absurdum). Assume there might be b C (A ) and b N (AT ) and b ` 0.
Since b C (A ), x n such that b = Ax . Since b N (AT ), ATb = AT (Ax ) = 0. Note that x ` 0 since x =0 Ò b =0,
contradicting assumptions. Multiply equality AT Ax = 0 on le by x T ,
iii. C (A ) N (AT ) = m
Proof. (iii) and (iv) have established that C (A ), N (AT ) are orthogonal complements
C (A ) = N (AT )¥, N (AT ) = C (A )¥.
The remainder of the FTLA is established by considering B = AT , e.g., since it has been established in (v) that C (B )
N (AT ) = n, replacing B = AT yields C (AT ) N (A ) = m, etc.
DATA PARTITIONING
1. Mappings as data
1.1. Vector spaces of mappings and matrix representations
A vector space can be formed from all linear mappings from the vector space ° = (U , S , +, Å) to another vector
space ± = (V , S ,+, Å)
= {L, S ,+, Å}, L = {f | f : U V , f (a u + b v ) = af (u ) + bf (v )},
with addition and scaling of linear mappings dened by (f + g )(u ) = f (u ) + g (u ) and (a f )(u ) = a f (u ). Let B =
{u , u , . . . } denote a basis for the domain U of linear mappings within , such that the linear mapping f is
1 2
When the domain and codomain are the real vector spaces U = n, V = m, the above is a standard matrix of real
numbers, A m n. For linear mappings between innite dimensional vector spaces the matrix is understood in a
×
generalized sense to contain an innite number of columns that are elements of the codomain V . For example, the
indenite integral is a linear mapping between the vector space of functions that allow di erentiation to any order,
5 : v (x ) = 5 u(x ) dx
and for the monomial basis B = {1, x , x ,... }, is represented by the generalized matrix
2
A=Í x x 1
2
2
1
3
x ... Î.3
Truncation of the basis expansion u(x ) = j uj x j where uj to n terms, and sampling of u at points x ,...,
=1 1
2 3
xm
As to be expected, matrices can also be organized as vector space 3, which is essentially the representation of the
associated vector space of linear mappings,
The addition C = A + B and scaling S = a R of matrices is given in terms of the matrix components by
Consider rst the case of nite matrices with real components A m n that represent linear mappings between ×
real vector spaces f : m n. The columns a ,..., an of A m n could be placed into a single column vector c with
1
×
mn components
a
c = [[[ ÅÅÅ ]]].
1
[ an ]
Subsequently the norm of the matrix A could be dened as the norm of the vector c . An example of this approach
is the Frobenius norm
m n
A = c = (((y y |a ij | ))) .
1/2
F
(i )
2
2
=1 j =1
DATA PARTITIONING 47
A drawback of the above approach is that the structure of the matrix and its close relationship to a linear mapping
is lost. A more useful characterization of the size of a mapping is to consider the amplication behavior of linear
mapping. The motivation is readily understood starting from linear mappings between the reals f : , that are
of the form f (x ) = ax . When given an argument of unit magnitude |x | =1, the mapping returns a real number with
magnitude |a|. For mappings f : within the plane, arguments that satisfy x =1 are on the unit circle with
2 2
2
f (x ) = Ax = [ a a
1 2 ] [[ cos ¸
sin ¸ ]] =cos ¸ a +sin ¸ a ,
1 2
-1
-2
-3
-3 -2 -1 0 1 2 3
From the above the mapping associated A amplies some directions more than others. This suggests a denition
of the size of a matrix or a mapping by the maximal amplication unit norm vectors within the domain.
f = sup f (x )V .
x
U =1
DEFINITION. For vector spaces n, m with norms (n): U , (m): V , the induced norm of matrix A m n is
+ +
×
A = sup Ax (m).
(n )
x
=1
48 FUNDAMENTAL THEOREM OF LINEAR ALGEBRA
In the above, any vector norm can be used within the domain and codomain.
The dimension of the column and row spaces r =dim C (A ) =dim C (AT ) is the rank of the matrix, n r is the nullity
of A , and m r is the nullity of AT . A innite number of bases could be dened for the domain and codomain. It is
of great theoretical and practical interest bases with properties that faciliatate insight or computation.
The above partitions of the domain and codomain are orthogonal, and suggest searching for orthogonal bases
within these subspaces. Introduce a matrix representation for the bases
U = [ u u ... u m ] m m, V = [ v v ... v n ] n n,
1 2
×
1 2
×
with C (U ) = m and C (V ) = n. Orthogonality between columns u i , uj for i ` j is expressed as u iT u j =0. For i = j , the
inner product is positive u iT ui >0, and since scaling of the columns of U preserves the spanning property C (U ) = m,
it is convenient to impose uiT u i =1. Such behavior is concisely expressed as a matrix product
U TU = I m,
with Im the identity matrix in m. Expanded in terms of the column vectors of U the rst equality is
[[ u Tu u Tu ... u Tu m ]]
T T T T
[[ u T ]] 1 1 1 1 2 1
[ u u ... u m ]T [ u u ... u m ] = [[[[ uÅÅ ]]]][ u u ... um ] = [[[[ u ÅÅu u ÅÅu Å...Å u ÅÅu m ]]]] = I m.
2 2 1 2 2 2
m m m 1 m
2
UX = U [ x x ... x m ] = [ e e ... em ].
1 2 1 2
DATA PARTITIONING 49
The columns of X are the coordinates of the column vectors of I m in the basis U , and can readily be determined
T
[[ u T ]]
Ux j = ej Ò U T Ux j = U T ej Ò I m x j = [[[ uÅÅ ]]] ej Ò x j = (U T )j ,
1
[[ Å ]]
2
[ u mT ]
U TU = I = UU T .
Note that the second equality
T
[[[ u T ]]] 1
[[ T ]]
1 2 1 2 1 2 1 1 2 2
um
A = [[ 20 00 ]]
and given by
[[ Ã0 0 ... 0 0 ... 0
à ... 0 0 ... 0 ]]]
1
[[ ÅÅ ]]
[[ Å
2
ÅÅ Å Å 0 ÅÅ Å Å ÅÅ
0 ... Ãr 0 ... 0 ]]]].
Å Å Å Å Å
£ = [[[ 0
[[[ 0 0 ... 0 0 ... 0 ]]]
[[[ ÅÅÅ ÅÅ Å Å ÅÅ ÅÅ Å Å ÅÅ ]
Å Å Å Å Å]
0 ... 0 0 ... 0 ]]
Å
[0
Imposing the condition that U , V are orthogonal leads to
Us = y Ò s = U T y , Vr = x Ò r = V Tx ,
50 FUNDAMENTAL THEOREM OF LINEAR ALGEBRA
From the above the orthogonal bases U , V and scaling coecients £ that are sought must satisfy A = U £ V T .
A = U £V T ,
with properties:
1. U m m is an orthogonal matrix, U TU = I m;
×
2. V m m is an orthogonal matrix, V TV = I n;
×
Proof. The proof of the SVD makes use of properties of the norm, concepts from analysis and complete induction.
Adopting the 2-norm set à = A ,
1 2
à = sup Ax .
1 2
x
2 =1
The domain x = 1 is compact (closed and bounded), and the extreme value theorem implies that f (x ) = Ax attains
2
its maxima and minima, hence there must exist some vectors u , v of unit norm such that à u = Av Ò Ã = u T Av .
1 1 1 1 1 1 1 1
Introduce orthogonal bases U , V for m, n whose rst column vectors are u , v , and compute
1 1 1 1
T
U TAV [u ]
= [[[ ÅÅÅ ]]][ Av ... A vn ] = [[ Ã0 wB ] = C .
1 T
1
1 1
[ umT ] 1
In the above w T is a row vector with n 1 components u T Av j, j =2,..., n, and u iT Av must be zero for u to be the direction
1 1 1
y = [[ Ãw ]], z = Cy = [[ Ã +Bw
w Tw ], 2
]
1 1
U TAV = [[ Ã 0T ] = [ 1 0T ][ Ã 0T ][ 1 0T ],
0 U £ V T ] [ 0 U ][ 0 £ ][ 0 V T ]
1 1
1 1
2 2 2 2 2 2
and the orthogonal matrices arising in the singular value decomposition of A are
The scaling coecients Ãj are called the singular values of A . The columns of U are called the le singular vectors ,
and those of V are called the right singular vectors .
The fact that the scaling coecients are norms of A and submatrices of A , Ã = A , is crucial importance in appli-
1
T
à 0 ... 0 0 ... 0 [ v ]]]
1
[
[[[ 0 Ã ... 0 0 ... 0 ]]][[ v T 1
] [ Ã vT ]
2
[
[[ ÅÅ ÅÅ Å Å 0 ÅÅ Å Å ÅÅ ]][[[ ÅÅÅ ]]] 2
[
[[ Ã v T ]]]
1 1
[[ Å Å Å Å Å Å ]][ T ] [ Å ]
... u m ][[ 0 0 ... Ãr 0 ... 0 ]][[ v r ]] = [ u u ... u r ur ... u m ][[[ ÅÅ T ]]]
2 2
A = [ u u ... ur u r
1 2 +1
[[ 0 0 ... 0 0 ... 0 ]][[ v rT ]] 1 2
[[ Ãrv r ]]
+1
[ vn ]
leads to a representation of A as a sum
r
A = y Ãi u iv iT , r min (m, n).
i =1
A = Ã u v T + Ã u v T + Å Å Å + Ãru rv rT
1 1 1 2 2 2
Each product u iv iT is a matrix of rank one, and is called a rank-one update. Truncation of the above sum to p terms
leads to an approximation of A
p
A E Ap = y Ãi uiv iT .
i =1
In very many cases the singular values exhibit rapid, exponential decay, à 1 k à k ŠŠŠ, such that the approximation
2
? ? ?
Figure 3.2. Successive SVD approximations of Andy Warhol's painting, Marilyn Diptych (~1960), with k = 10,20, 40 rank-one updates.
52 FUNDAMENTAL THEOREM OF LINEAR ALGEBRA
P = 122500
octave] q1=(2*m+1)*10
q1 = 7010
octave] q2=(2*m+1)*20
q2 = 14020
octave] q3=(2*m+1)*40
q3 = 28040
octave]
(U £ V T ) x = b Ò U (£ V T x ) = b Ò (£ V T x ) = U T b Ò V T x = £ U T b Ò x = V £ U T b ,
+ +
where the matrix £ n m (notice the inversion of dimensions) is dened as a matrix with elements Ãi on the
+ × 1
that allows stating the solution of Ax = b simply as x = A b is called the pseudo-inverse of A . Note that in practice
+
A is not explicitly formed. Rather the notation A is simply a concise reference to carrying out steps 1-4 above.
+ +
CHAPTER 4
LEAST SQUARES
DATA COMPRESSION
A typical scenario in many sciences is acquisition of m numbers to describe some object that is understood to
actually require only n j m parameters. For example, m voltage measurements ui of an alternating current could
readily be reduced to three parameters, the amplitude, phase and frequency u(t ) = a sin(Ét + Æ ). Very oen a simple
rst-degree polynomial approximation y = ax + b is sought for a large data set D = {(xi , yi ), i =1,..., m}. All of these
are instances of data compression, a problem that can be solved in a linear algebra framework.
1. Projection
Consider a partition of a vector space U into orthogonal subspaces U = V W , V = W ¥, W = V ¥. Within the typical
scenario described above U = m, V m, W m, dim V = n, dim W = m n. If V = [ v ... v n ] m n is a basis for V
1
×
and W = [ w ... w mn ] m (mn) is a basis for W, then U = [ v ... v n w ... w mn ] is a basis for U . Even though
1
×
1 1
the matrices V , W are not necessarily square, they are said to be orthogonal, in the sense that all columns are of
unit norm and orthogonal to one another. Computation of the matrix product V TV leads to the formation of the
identity matrix within n
[[ v T v v T v ... v T v n ]]
T T T T
[[ v T ]]
1 1 1 1 2 1
V TV = [[[ vÅÅ ]]][ v v ... v n ] = [[[ v ÅÅv v ÅÅv Å...Å v ÅÅv n ]]] = I n.
[[ ÅT ]] [[ TÅ TÅ Å TÅ ]]
2 2 1 2 2 2
1 2
[ vn ] [ v n v v n v ... v n v n ]
1 2
Similarly, W TW = I mn. Whereas for the square orthogonal matrix U multiplication both on the le and the right
by its transpose leads to the formation of the identity matrix
U T U = UU T = I m,
the same operations applied to rectangular orthogonal matrices lead to di erent results
T
[[ v T ]] n
V TV = I n, VV T = [ v v ... v n ][[[ vÅÅ ]]] = y v i v iT ,rank(v i v iT ) =1
1
[[ Å ]] i
2
1 2
[ v nT ] =1
A simple example is provided by taking V = I m n, the rst n columns of the identity matrix in which case
,
n
VV T = y ei eTi = [[ I0n 00 ]] m m. ×
i =1
53
54 LEAST SQUARES
Applying P = VV T to some vector b m leads to a vector r = Pb whose rst n components are those of b , and the
remaining m n are zero. The subtraction b r leads to a new vector s = (I P )b that has the rst components
equal to zero, and the remaining m n the same as those of b . Such operations are referred to as projections , and
for V = I m n correspond to projection onto the span{e ,..., en}.
, 1
octave]
W = {[[ y0 ]]| y }
U = 2
s = (I P )b
Figure 4.1. Projection in 2. The vectors r , s 2 have two components, but could be expressed through scaling of e 1, e 2.
Returning to the general case, the orthogonal matrices U m m, V m n, W m (mn) are associated with linear
× × ×
span{w , . . . , w mn}, respectively. When V , W are orthogonal matrices the projections are also orthogonal r ¥ s .
1
Projection can also be carried out onto nonorthogonal spanning sets, but the process is fraught with possible error,
especially when the angle between basis vectors is small, and will be avoided henceforth.
Notice that projection of a vector already in the spanning set simply returns the same vector, which leads to a
general denition.
DEFINITION. The mapping is called a projection if f f = f , or if for any u U, f (f (u )) = f (u ). With P the matrix
associated f , a projection matrix satises P = P. 2
P = VV T
P = PP = VV TVV T = V (V TV ) V T = VIV T = VV T = P
2
DATA COMPRESSION 55
2. Gram-Schmidt
Orthonormal vector sets { q ,..., q n} are of the greatest practical utility, leading to the question of whether some
1
such a set can be obtained from an arbitrary set of vectors {a , . . . , a n}. This is possible for independent vectors, 1
5. Divide by norm q = (a 2 2 ( q Ta ) q ) / a ( q Ta ) q
1 2 1 2 1 2 1
a 2 ( q Ta ) q
1 2 1 a 2
q 2
a 1
q 1
P a = ( q q T )a = q ( q Ta ) = ( q Ta ) q
1 2 1 1 2 1 1 2 1 2 1
(( r0 r r ... r n )
r r ... r n ))
11 12 13 1
A = ( a a ... an ) = ( q q ...
1 2 1 2
((( ÅÅÅ ÅÅ
Å
33
ÅÅ Å Å
Å Å
)))
3
ÅÅ
Å
(0 0 ... ... rmn )
{{{ ÅÅÅ
2 12 1 22
. 2
{{ an = r nq + r nq + ... + rnnqn
1 1 2 2
56 LEAST SQUARES
The system is easily solved by forward substitution resulting in what is known as the (modied) Gram-Schmidt
algorithm, transcribed below both in pseudo-code and in Octave.
rii = ( q iT q i )
1/2 if (R(i,i)<eps) break;
Note that the normalization condition q ii =1 is satisifed by two values ±rii , so results from the above implementa-
tion might give orthogonal vectors q ,..., q n of di erent orientations than those returned by the Octave qr function.
1
The implementation provided by computational packages such as Octave contain many renements of the basic
algorithm and it's usually preferable to use these in applications.
octave] A=rand(4); [Q,R]=mgs(A); disp([Q R])
1.1102e-16 8.0390e-16
octave]
By analogy to arithmetic and polynomial algebra, the Gram-Schmidt algorithm furnishes a factorization
QR = A
with Q m n with orthonormal columns and R n n an upper triangular matrix, known as the QR-factorization.
× ×
Since the column vectors within Q were obtained through linear combinations of the column vectors of A we have
C (A ) = C (Q ) ` C (R )
AX = B , A [ x ... x n ] = [ Ax ... Ax n ].
1 1
DATA COMPRESSION 57
The QR-factorization can be used to solve basic problems within linear algebra.
octave] A=[3 2; 1 2]
A =
3 2
1 2
octave] [Q R]=qr(A)
Q =
-0.94868 -0.31623
-0.31623 0.94868
R =
-3.16228 -2.52982
0.00000 1.26491
octave]
Recall that when given a vector b m, an implicit basis is assumed, the canonical basis given by the column vectors
of the identity matrix I m m. The coordinates x in another basis A m m can be found by solving the equation
× ×
Ib = b = Ax ,
by an intermediate change of coordinates to the orthogonal basis Q . Since the basis Q is orthogonal the relation
Q TQ = I holds, and changes of coordinates from I to Q , Qc = b , are easily computed c = Q Tb . Since matrix multi-
plication is associative
b = Ax = (QR )x = Q (Rx ),
the relations Rx = Q Tb = c are obtained, stating that x also contains the coordinates of c in the basis R . The three
steps are:
Since R is upper-triangular,
(( r0 r r ... r m )[ x ] [ c ]]
((( 0 r r ... r m ))[[ x ]] [[ c
11 12 13 1 1 1
]]
0 r ... r m )))[[[ ÅÅÅ ]]] = [[[ ÅÅÅ ]]]
22 23 2 2 2
cm
1
cj = cj rji xi end;
end end;
end end
return x octave]
The above operations are carried out in the background by the Octave backslash operation A\b to solve A*x=b,
inspired by the scalar mnemonic ax = b Ò x = (1/ a) b . Again, many additional renements of the basic algorithm
argue for using the built-in Octave functions, even though the above implementations can be veried as correct.
octave]
The above approch for the real vector space m can be used to determine orthogonal bases for any other vector
space by appropriate modication of the scalar product. For example, within the space of smooth functions [1,
1] that can di erentiated an arbitrary number of times, the Taylor series
2
¹¹
2
1
!
DATA COMPRESSION 59
is seen to be a linear combination of the monomial basis M = 1 x x ... with scaling coecients µ f (0), f (0),
2
¹
1
2
f (0),... ¶. The scalar product
¹¹
( f , g) = 5 f (x ) g (x ) dx
1
1
can be seen as the extension to the [1,1] continuum of a the vector dot product. Orthogonalization of the mono-
mial basis with the above scalar product leads to the denition of another family of polynomials, known as the
Legendre polynomials
Q (x ) = ±
0
1
2
1/2
² Å 1, Q (x ) = ± ²
1
3
2
1/2
Å x , Q (x ) = ± ²
2
5
8
1/2
Å (3x 1), Q (x ) = ± ²
2
4
7
8
1/2
Å (5x 3x ),... .
3
The Legendre polynomials are usually given with a di erent scaling such that Pk(1) =1, rather than the unit norm
condition Q k = (Q k, Q k) = 1. The above results can be recovered by sampling of the interval [1, 1] at points
1/2
m
5 f (x ) Lj (x ) dx E hy f (xi )Lj (xi ) = h f T Lj .
1
1
i =1
octave] m=50; h=2/(m-1); x=(-1:h:1)'; M=[x.^0 x.^1 x.^2 x.^3 x.^4]; [Q,R]=mgs(M);
S=diag(1./Q(m,:)); P=Q*S; sc=[-1 1 -1 1];
figure(1); plot(x,M(:,1),x,M(:,2),x,M(:,3),x,M(:,4)); axis(sc); grid on;
figure(2); plot(x,P(:,1),x,P(:,2),x,P(:,3),x,P(:,4)); axis(sc); grid on;
octave]
Figure 4.2.
? ?
Comparison of monomial basis (le) to Legendre polynomial basis (right). The resolution of P3(x ) can be interpreted as
the number of crossings of the y =0 ordinate axis, and is greater than that of the corresponding monomial x 3.
60 LEAST SQUARES
m m
S (a , a ) = y (y (xi ) yi ) = y (a + a xi yi )
0 1
2
0 1
2
i =1 i =1
and seek (a , a ) that minimize S (a , a ). The function S (a , a ) 0 can be thought of as the height of a surface above
0 1 0 1 0 1
the a a plane, and the gradient S is dened as a vector in the direction of steepest slope. When at some point
on the surface if the gradient is di erent from the zero vector S ` 0, travel in the direction of the gradient would
0 1
increase the height, and travel in the opposite direction would decrease the height. The minimal value of S would
be attained when no local travel could decrease the function value, which is known as stationarity condition, stated
as S =0. Applying this to determining the coecients (a , a ) of a linear regression leads to the equations
0 1
m m m
S
a
=0 Ò 2y (a + a xi yi ) =0 Ô ma + ((y xi ))a = y yi ,
0 1 0 1
0
i =1 i =1 i =1
m m m m
S
a
=0 Ò 2y (a + a xi yi )xi =0 Ô ((y xi ))a + ((y xi ))a = y xi yi .
0 1 0
2
1
1
i =1 i =1 i =1 i =1
The above calculations can become tedious, and do not illuminate the geometrical essence of the calculation, which
can be brought out by reformulation in terms of a matrix-vector product that highlights the particular linear com-
bination that is sought in a liner regression. Form a vector of errors with components ei = y (xi ) yi , which for linear
regression is y (x ) = a + a x . Recognize that y (xi ) is a linear combination of 1 and xi with coecients a , a , or in
0 1 0 1
vector form
1 x
e = ((( ÅÅÅ ÅÅÅ )))(( aa )) y = ( 1 x )a y = Aa y
1
( 1 xm )
0
The norm of the error vector e is smallest when Aa is as close as possible to y . Since Aa is within the column
space of C (A ), Aa C (A ), the required condition is for e to be orthogonal to the column space
The above is known as the normal system, with N = ATA is the normal matrix. The system Na = b can be interpreted
as seeking the coordinates in the N = ATA basis of the vector b = AT y . An example can be constructed by randomly
perturbing a known function y (x ) = a + a x to simulate measurement noise and compare to the approximate a
0 1 Ü
octave]
octave]
2.0000 2.0302
3.0000 2.9628
octave]
The normal matrix basis N = AT A can however be an ill-advised choice. Consider A given by
2×2
A = [ a a ] = [[ 10 cos
1 2
¸
sin ¸ ]],
where the rst column vector is taken from the identity matrix a = e , and second is the one obtained by rotating it
1 1
with angle ¸ . If ¸ = À /2, the normal matrix is orthogonal, ATA = I , but for small ¸ , A and N = AT A are approximated as
A E [[ 10 ¸1 ]], N = [ n n ] = [[ 10 ¸1 ]].
1 2 2
When ¸ is small a , a are almost colinear, and n , n even more so. This can lead to amplication of small errors,
1 2 1 2
but can be avoided by recognizing that the best approximation in the 2-norm is identical to the Euclidean concept
of orthogonal projection. The orthogonal projector onto C (A ) is readily found by QR-factorization, and the steps
to solve least squares become
1. Compute QR = A
2. The projection of y onto the column space of A is z = QQ T y , and has coordinates c = Q T y in the orthogonal
basis Q .
3. The same z can also obtained by linear combination of the columns of A , z = Aa = QQ T y , and replacing A
with its QR-factorization gives QRa = Q c , that leads to the system Ra = c , solved by back-substitution.
b
e
Ax C (A )
e = b Ax
MODEL REDUCTION 63
MODEL REDUCTION
1. Projection of mappings
The least-squares problem
min
x n
y Ax (4.1)
Consider some phenomenon modeled as a function between vector spaces f : X Y , such that for input parameters
x X , the state of the system is y = f (x ). For most models f is di erentiable, a transcription of the condition that
the system should not exhibit jumps in behavior when changing the input parameters. Then by appropriate choice
of units and origin, a linearized model
y = Ax , A m n, ×
A simpler description is oen sought, typically based on recognition that the inputs and outputs of the model
can themselves be obtained as linear combinations x = Bu , y = C v , involving a smaller set of parameters u q,
v p, p < m, q < n. The column spaces of the matrices B n q, C m p are vector subspaces of the original set
× ×
of inputs and outputs, C (B ) d n, C (C ) d m. The sets of column vectors of B , C each form a reduced basis for
the system inputs and outputs if they are chosed to be of full rank. The reduced bases are assumed to have been
orthonormalized through the Gram-Schmidt procedure such that B TB = Iq, and C TC = I p. Expressing the model
inputs and outputs in terms of the reduced basis leads to
Cv = ABu Ò v = C T ABu Ò v = Ru .
The matrix R = C T AB is called the reduced system matrix and is associated with a mapping g : U V , that is a
restriction to the U , V vector subspaces of the mapping f . When f is an endomorphism, f : X X , m = n, the same
reduced basis is used for both inputs and outputs, x = Bu , y = Bv , and the reduced system is
v = Ru , R = B T AB .
Since B is assumed to be orthogonal, the projector onto C (B ) is PB = BB T . Applying the projector on the inital
model
PB y = PB Ax
leads to BB T y = BB T Ax , and since v = B T y the relation Bv = BB T ABu is obtained, and conveniently grouped as
Bv = B (B T AB ) u Ò Bv = B (Ru ),
again leading to the reduced model v = Bu . The above calculation highlights that the reduced model is a projection
of the full model y = Ax on C (B ).
64 LEAST SQUARES
2. Reduced bases
2.1. Correlation matrices
Correlation coecient. Consider two functions x , x : , that represent data streams in time of inputs x (t )
1 2 1
and outputs x (t ) of some system. A basic question arising in modeling and data science is whether the inputs
2
and outputs are themselves in a functional relationship. This usually is a consequence of incomplete knowledge
of the system, such that while x , x might be assumed to be the most relevant input, output quantities, this is
1 2
not yet fully established. A typical approach is to then carry out repeated measurements leading to a data set
D = {(x (ti ), x (ti ))| i = 1, . . . , N }, thus dening a relation. Let x , x N denote vectors containing the input and
1 2 1 2
output values. The mean values ¼ , ¼ of the input and output are estimated by the statistics
1 2
N N
1
¼ E x̄ = y x (ti ) = E [x ], ¼ E x̄ = y x (t i ) = E [x ],
1
1 1
Ni =1
1 1 2 2
Ni =1
2 2
E = N1 [ 1 1 ... 1 ],
and the means are also obtained by matrix vector multiplication (linear combination),
x̄ = E x , x̄ = E x .
1 1 2 2
Deviation from the mean is measured by the standard deviation dened for x , x by 1 2
à = 4E [(x ¼
1 1 1 ) ], Ã
2
2 = 4E [(x ¼ ) ] . 2 2
2
Note that the standard deviations are no longer linear mappings of the data.
Assume that the origin is chosen such that x̄ = x̄ =0. One tool to estalish whether the relation D is also a function
1 2
Á(x , x ) =
E [x x ] = 1 2 E [x x ] , 1 2
1 2
ÃÃ 1 2
4E [x ] E [x ]
2
1
2
2
Á(x , x ) =
xT x .
1 2
x x
1 2
1 2
(x 1 + x )T (x + x ) x T x + x T x +2 x x Ò x T x
2 1 2 1 1 2 2 1 2 1 2 x 1 x ,
2
known as the Cauchy-Schwarz inequality, which implies 1 Á(x , x ) 1. Depending on the value of Á, the vari- 1 2
1. uncorrelated , if Á =0;
2. correlated , if Á =1;
MODEL REDUCTION 65
3. anti-correlated , if Á = 1.
The numerator of the correlation coecient is known as the covariance of x , x 1 2
cov(x , x ) = E [x x ].
1 2 1 2
The correlation coecient can be interpreted as a normalization of the covariance, and the relation
cov(x , x ) = x T x = Á(x , x ) x x ,
1 2 1 2 1 2 1 2
is the two-variable version of a more general relationship encountered when the system inputs and outputs become
vectors.
Patterns in data. Consider now a related problem, whether the input and output parameters x n, y m thought
to characterize a system are actually well chosen, or whether they are redundant in the sense that a more insightful
description is furnished by u q, v p with fewer components p < m, q < n. Applying the same ideas as in the
correlation coecient, a sequence of N measurements is made leading to data sets
X = [ x x ... x n ] N n, Y = [ y y ... y n ] N m.
1 2
×
1 2
×
Again, by appropriate choice of the origin the means of the above measurements is assumed to be zero
E [x ] = 0, E [ y ] = 0.
Covariance matrices can be constructed by
CX = X TX = [[[ xÅÅ ]]][ x x ... x n ] = [[[ x ÅÅx x ÅÅx Å...Å x ÅÅx n ]]] n n.
2 2 1 2 2 2 ×
[[ ÅT ]] [[ TÅ TÅ Å TÅ ]]
1 2
[ xn ] [ x n x x n x ... x n x n ]
1 2
Recall that the SVD returns an order set of singular values à à ŠŠŠ, and associated singular vectors. In many
1 2
applications the singular values decrease quickly, oen exponentially fast. Taking the rst q singular modes then
gives a basis set suitable for mode reduction
x = Sq u = [ s s ... sq ] u .
1 2
CHAPTER 5
CHANGE OF BASIS
DATA TRANSFORMATION
1. Gaussian elimination and row echelon reduction
Suppose now that A x = b admits a unique solution. How to nd it? We are especially interested in constructing
a general procedure, that will work no matter what the size of A might be. This means we seek an algorithm that
precisely species the steps that lead to the solution, and that we can program a computing device to carry out
automatically. One such algorithm is Gaussian elimination.
Consider the system
{{ x2 x+2 xx + xx
1 2 3 =2
=2
{{ 3 x x x
1
1
2
2
3
3 =1
The idea is to combine equations such that we have one fewer unknown in each equation. Ask: with what number
should the rst equation be multiplied in order to eliminate x from sum of equation 1 and equation 2? This number
1
is called a Gaussian multiplier, and is in this case 2. Repeat the question for eliminating x from third equation, 1
{{ 2x x+2 xx + xx
1 2 3 =2 x +2 x x = 2
= 2 Ò {{{ 5 x +3 x = 2
1 2 3
{{ 3 x x x
1
1
2
2
3
3 = 1 { 7 x +2 x = 5
2
2
3
Now, ask: with what number should the second equation be multiplied to eliminate x from sum of second and 2
= 2 Ò {{ 5 x11+3 x = 11
2
1 2 3
{ 7 x +2 x {
2 3
{
2 3
= 5 x = 5
2 3
5 3
Starting from the last equation we can now nd x =1, replace in the second to obtain 5 x = 5, hence x =1, and
3 2 2
The above operations only involve coecients. A more compact notation is therefore to work with what is known
as the "bordered matrix"
1 2 1 2
(( 21 21 11 22 )) < (( 01 25 31 22 )) < ((( 0 5 3 2
)))
((( 3 1 1 1 ))) ((( 0 7 2 5 ))) ((( 11 ))
0 0 11 5 )
5
67
68 CHANGE OF BASIS
Once the above triangular form has been obtain, the solution is found by back substitution, in which we seek to
form the identity matrix in the rst 3 columns, and the solution is obtained in the last column.
2. LU -factorization
" We have introduced Gaussian elimination as a procedure to solve the linear system A x = b ("nd coordi-
nates of vector b in terms of column vectors of matrix A"), x , b m, A m m ×
" We now reinterpret Gaussian elimination as a sequence of matrix multiplications applied to A to obtain a
simpler, upper triangular form.
{{ x2 x+2 xx + xx
1 2 3 =2
=2
{{ 3 x x x
1
1
2
2
3
3 =1
with
1 2 1 2
A = (((( 2 1 1 )))), b = (((( 2 ))))
(( 3 1 1 )) ( 1 ))
We ask if there is a matrix L that could be multiplied with A to produce a result L A with zeros under the main
1 1
diagonal in the rst column. First, gain insight by considering multiplication by the identity matrix, which leaves
A unchanged
In the rst stage of Gaussian multiplication, the rst line remains unchanged, so we deduce that L should have 1
(? ? ?)
Next, recall the way Gaussian multipliers were determined: nd number to multiply rst line so that added to
second, third lines a zero is obtained. This leads to the form
100
L = ((( ? 1 0 )))
1
(? 0 1)
Finally, identify the missing entries with the Gaussian multipliers to determine
1 00
L = ((( 2 1 0 )))
1
( 3 0 1 )
Verify by carrying out the matrix multiplication
1 00 1 2 1 1 2 1 )
L A = ((( 2 1 0 )))((( 2 1 1 ))) = ((( 0 5 3 )))
1
( 3 0 1 )( 3 1 1 ) ( 0 7 2 )
Repeat the above reasoning to come up with a second matrix L that forms a zero under the main diagonal in the
2
second column
1 0 0
L = ((( 0 1 0 )))
2
( 0 7/5 1 )
1 0 0 1 2 1 1 2 1
L L A = ((( 0 1 0 )))((( 0 5 3 ))) = ((( 0 5 3 ))) = U
2 1
( 0 7/5 1 )( 0 7 2 ) (( 0 0 11/5 )
We have obtained a matrix with zero entries under the main diagonal (an upper triangular matrix) by a sequence
of matrix multiplications.
" Recall the basic operation in row echelon reduction: constructing a linear combination of rows to form zeros
beneath the main diagonal, e.g.
a a ... am
(( aa a ... a m ) (( 0 a ))
11 12 1
a
11
a
12
... a m )) (((
1
a 2111 a ... a m aa a m 21
)))
(
22 12 2 1
a a 3111 a
(((( ÅÅÅ
31
ÅÅ ÅÅ ÅÅ ÅÅ ÅÅ ÅÅ ÅÅ
11
Å Å Å
am 1 a m ...
2 a mm ((( Å
0 am 2
Å
a
am111 a 12
Å
... a mm aam
Å
1
a m ))
1
)
11
70 CHANGE OF BASIS
0 1 ... 0 ))))(((( a
21
a
22
... a m l a m
21
)))
1
((( ÅÅÅ
31
ÅÅÅ ÅÅ ))) 3
ÅÅÅ
31 12
ÅÅ
3
ÅÅÅ
31 1
)))
( l m 1 0 0 ... 1 )( a m 1 a m ... 2
Å
a mm ) ( 0 a m l m a ... 2 1 12
Å
a mm lm a m )
1 1
Lk = ((( 0
(0 ...
... lk k
1 ...
... 0 )))
((( 0 ... lk k
+1,
... 0 )))
((( ÅÅÅ ... ÅÅÅ Å Å Å ÅÅÅ )))
+2,
(0 ... l m k ... 1 ) ,
with l i k = a (i kk) / a (kkk), and A(k) = a (i kj ) the matrix obtained aer step k of row echelon reduction (or, equivalently, Gaussian
, , , ,
" For A m m nonsingular, the successive steps in row echelon reduction (or Gaussian elimination) corre-
×
(( 10 ... 0 ... 1)
0 ))
((( 0 Å 0 ...
ÅÅ
( ... 1 ... 0 )))
Lk = ((( 0 ... l k k ... 0 ))) = I (Lk I )
((( 0 0 )))
1
+1,
... l k k ...
((( ÅÅÅ ... ÅÅÅ
+2,
ÅÅ Å
Å ÅÅ )))
(0 ... lm k , ... 1)
" Due to the simple form of Lk the matrix L is easily obtained as
1
(( l 1 0
1
0
0
...
...
0
0
0)
0 ))
((( l 2,1
l 1 ... 0 0 )))
L = (( ÅÅ )))
((( Å
3,1 3,2
ÅÅ ÅÅ ÅÅ ÅÅ ÅÅ
Å Å Å Å Å
(( lm lm lm ... 1 0 ))
1)
1,1 1,2 1,3
lm ,1 lm ,2 lm ,3 ... l m m , 1
DATA TRANSFORMATION 71
We will show that this indeed possible if A x = b admits a unique solution. Furthermore, the product of lower
triangular matrices is lower triangular, and the inverse of a lower triangular matrix is lower triangular (same applies
for upper triangular matrices). Introduce the notation
L = Lm ... L L
1
1 2 1
and obtain
L A = U
1
or
A=LU
The above result permits a basic insight into Gaussian elimination: the procedure depends on "factoring" the matrix
A into two "simpler" matrices L, U . The idea of representing a matrix as a product of simpler matrices is funda-
mental to linear algebra, and we will come across it repeatedly.
For now, the factorization allows us to devise the following general approach to solving A x = b
2. Insert the factorization into A x = b to obtain (L U ) x = L (U x ) = L y = b , where the notation y = U x has been
introduced. The system
Ly =b
for i = s +1 to m
mag = abs(a p(i ) s ) ,
if mag>piv then
piv=mag; k = p(s ); p(s ) = p(i); p(i) = k
if piv< õ then break(Singular matrix)
t = a p(i )s / a p(s ) s
for j = s +1 to m
a p(i )j = a p(i )j + t Å a p(s ) j
bp(i ) = bp(i ) + t Å bp(s )
for s = m downto 1
xs = bp(s ) / a p(s ) s
for i =1 to s 1
bp(i ) = bp(i ) a p(i ) s Å xs
return x
Given A m n ×
3. Inverse matrix
By analogy to the simple scalar equation a x = b with solution x = a b when a ` 0, we are interested in writing the
1
expressing the vector b as a linear combination of the columns of A. This can only be done if the columns of A form
a basis for m, in which case we say that A is non-singular .
DEFINITION 5.1. For matrix A m m non-singular the inverse matrix is denoted by A and satises the properties
× 1
A A = A A = I
1 1
B = A for a moment and consider the inverse matrix property A B = I . Introducing the column notation for B, I
1
leads to
A( B ... Bm ) = ( e ... em )
1 1
DATA TRANSFORMATION 73
AB k = ek, k =1,2,.., m
with ek the column unit vector with zero components everywhere except for a 1 in row k . To nd the inverse we
need to simultaneously solve the m linear systems given above.
Gauss-Jordan algorithm example . Consider
1 2 3
A = ((( 1 3 1 ))
))
(( 2 1 2 )
To nd the inverse we solve the systems A B = e , A B = e , A B = e . This can be done simultaneously by working
1 1 2 2 3 3
(A|I ) =
((( 11 11 01 10 01 00 )))
(( 2 4 2 0 0 1 ))
Carry out now operations involving linear row combinations and permutations to bring the le side to I
( 10 12 01 11 0 0 ) ( 1 1 0 1 02 0 ) (1 1 0 1 0 0
1 )))
<(
((( 1 0 )) < ((( 0 2 0 0 3 1 )) (( 0 1 0 0 1
3 ))) < ((( 6 ))) <
1 1 )) (((
) ) (
3
(0 0 1 1 )
1 1 0 0 1 1 13 1)
3 3 (0 0 1 1 3
3
) ( 3
(( 1 0 0 1 13
1)
6)
((( 1 )))
((( 0 1 0 0 13 6 )))
((( 0 0 1 1 1 1)
)
3 3)
to obtain
(( 1 13
1
6 )))
(( 1 ))
A = (( 0 1
((( 3 6 )))
1
(( 1 13 1)
)
3)
4. Determinants
" A m m a square matrix, det(A ) is the oriented volume enclosed by the column vectors of A (a paral-
×
lelipiped)
74 CHANGE OF BASIS
|| aa aa ... am|
|
det(A) = || ÅÅ ÅÅ Å Å a ÅÅm |||
|
11 12 1
...
|| Å Å Å Å ||
21 22 2
| a m a m ... a mm |
1 2
giving the (oriented) volume of the parallelepiped spanned by matrix column vectors.
" m =2
A = ( aa aa 11 12
),det(A) = Ä
a a
11
a a
12
Ä
21 22 21 22
" m =3
a a a )),det(A) = || aa a a
A = ((( a a a
||
)))
11 12 13 11 12 13
|| a a ||
(( a a a
21
31
22
32
23
33 |a
21
31
22
a a
32
23
33 |
a a
11 12
=a a a a
Ä
a a
21 22
Ä 11 22 12 21
|| aa aa aa || = a a a + a a a + a a a
|| |
11 12 13
| a a a ||
21 22 23 11 22 33 21 32 13 31 12 23
31 32 33
a 13 a a 22 31 a 23 a a
32 11 a 33 a a
12 21
" Where do these determinant computation rules come from? Two viewpoints
Algebraic viewpoint : determinants are computed from all possible products that can be formed from
choosing a factor from each row and each column
" m =2
DATA TRANSFORMATION 75
A 2
A = ( a a ) = (( aa
1 2
11
21
a 12 )
a 22 )
A 1
Figure 5.1.
" Take a as the base, with length b = a . Vector a is at angle Æ to x -axis, a is at angle Æ to x -axis, and
1 1 1 1 1 2 2 2
" The geometric interpretation of a determinant as an oriented volume is useful in establishing rules for
calculation with determinants:
Determinant of matrix with repeated columns is zero (since two edges of the parallelepiped are
identical). Example for m =3
|a a u|
= ||| b b v ||| = abw + bcu + cav ubc vca wab =0
|c c w|
=det( a a a ... ) =0
1 1 3
Determinant of matrix with linearly dependent columns is zero (since one edge lies in the 'hyper-
plane' formed by all the others)
with ai , b m
1
76 CHANGE OF BASIS
with ±
with ± .
" A determinant of size m can be expressed as a sum of determinants of size m 1 by expansion along a row
or column
...
1
a a
Å ÅÅ |||
22 23 2
ÅÅ ÅÅ Å
||| Å Å Å Å Å ||| || a m
21 22 23 2
Å
... a mm |
11
a m a m a m ... a mm 2 am 3
1 2 3
|a a ... a m |
a ||| ÅÅÅ Å ÅÅ ||| +
21 23 2
ÅÅ ÅÅ Å
Å
| am ... a mm |
12
1 am 3
a a a ... a m |
a ||| ÅÅÅ ||
21 22 24 2
ÅÅ ÅÅ ÅÅ ÅÅ
|| a m 13
1 a m a m ... a mm ||
Å
2
Å
4
Å Å
...
+(1)m a m|||
| aÅÅ aÅÅ Å...Å a mÅÅ
21 23 2, 1
|||
| |||
+1
Å Å Å Å
| a m a m ... a m m
1
1 3 , 1
" A more economical determinant is to use row and column combinations to create zeros and then reduce the
size of the determinant, an algorithm reminiscent of Gauss elimination for systems
Example:
The rst equality comes from linear combinations of rows, i.e. row 1 is added to row 2, and row 1 multiplied
by 2 is added to row 3. These linear combinations maintain the value of the determinant. The second
equality comes from expansion along the rst column
DATA EFFICIENCY 77
u Å v = uT v = u v + u v + u v 1 1 2 2 3 3
u uv uv uv
uv T = ((( u ))( v v v ) = ((( u v u v u v ))
1 1 1 1 2 1 3
(u 2
3
)) 1 2
(u v u v u v
3 2
3
1
1
2
3
2
2
2
3
3
3
))
" Another product of two vectors is also useful, the cross product, most conveniently expressed in determi-
nant-like form
1
|e e e |
u × v = ||| u u u ||| = (u v v u )e + (u v v u )e + (u v v u )e
2 3
1
|| v v v ||
2
2
3
3
2 3 2 3 1 3 1 3 1 2 1 2 1 2 3
DATA EFFICIENCY
1. Krylov bases
In reduction of the model
A
Ax = y , A m n, n m ×
by restriction to a subspaces spanned by the orthogonal column vectors of B , C , x = Bu , y = Cv , the reduced model
v = Ru
is obtained with R = C T AB , the reduced system matrix. The choice of the basis sets B , C is not yet specied. One
common choice is to use the singular value decomposition A = S £ Q T and choose the dominant k singular vectors
to span the subspaces,
B = Q k, C = Sk.
This assumes that an SVD is available, and that ordering of vectors by the 2-norm is relevant to the problem. This
is oen the case in problems in which the matrix A somewhow expresses the energy of a system. For example in
deformation of a structure a relationship between forces f and displacements u is approximated linearly by f = Ku ,
with the sti ness matrix K expressing the potential energy stored in the deformation.
78 CHANGE OF BASIS
However in many cases, the model might not express an energy so the 2-norm is not an appropriate functional, or
even if it is the size of the problem might render the computation of the singular value decomposition impractical.
In such situations alternative procedures to construct reduced bases must be devised.
Consider that the only information available about the problem are the matrix A m m and a vector y m. From
×
these two a sequence of vectors can be gather into what is known as a Krylov matrix
K n = y Ay A y ... An y .
2 1
The Krylov matrix K n is generally not orthogonal, but an orthogonal basis can readily be extracted through the QR
factorization
Q n R = K n.
The basis Q n can then be adopted, both for the domain and the codomain
B = C = Q n.
2. Greedy approximation
The Krylov procedure has the virtue of simplicity, but does not have the desirable property of the SVD of ordering
of the singular vectors. Suppose that the system matrix A m m is applied to k vectors x ,..., x k, leading to forma-
×
1
tion of the vector set S = {Ax ,..., Ax k}. Denote by B the rst n members of the set ordered in some arbitrary norm
1
B = [ b b ... bn ], n j k
1 2
where à denotes the permutation that orders the vectors in accordance with the chosen norm. The above is called a
greedy approximation, and furnishes an alternative to the SVD that exhibits an ordering property. As in the Krylov
case, it is usually more ecient to use an orthogonal set obtained through QR factorization
Q n R = B n.
CHAPTER 6
EIGENPROBLEMS
DATA STABILITY
A = U £ V T A = QR A = LU
1. The eigenvalue problem
" Consider square matrix A m m. The eigenvalue problem asks for vectors x m, x ` 0, scalars » such
×
that
Ax = » x (6.1)
" Eigenvectors are those special vectors whose direction is not modied by the matrix A
" Rewrite (1): (A » I )x = 0, and deduce that A »I must be singular in order to have non-trivial solutions
det(A » I ) =0
" Consider the determinant
|| a a » a a » ... am |
a m |||
|
11 12 1
det(A »I ) = || ÅÅ ...
|| Å
21
Å
Å
22
Å Å ÅÅ || 2
| a m a mÅ ...Å a mmÅ » ||
1 2
" From determinant denition ``sum of all products choosing an element from row/column''
" A m × m has characteristic polynomial pA(») of degree m, which has m roots (Fundamental theorem of
algebra)
" Example
A =
0.50000 -0.86603
0.86603 0.50000
79
80 EIGENPROBLEMS
octave] eig(A)
ans =
0.50000 + 0.86603i
0.50000 - 0.86603i
octave] [R,lambda]=eig(A);
octave] disp(R);
octave] disp(lambda)
Diagonal Matrix
0.50000 + 0.86603i 0
0 0.50000 - 0.86603i
octave] A=[-2 1 0 0 0 0; 1 -2 1 0 0 0; 0 1 -2 1 0 0; 0 0 1 -2 1 0; 0 0 0 1 -2 1; 0 0 0 0 1 -2];
octave] disp(A)
-2 1 0 0 0 0
1 -2 1 0 0 0
0 1 -2 1 0 0
0 0 1 -2 1 0
0 0 0 1 -2 1
0 0 0 0 1 -2
octave] lambda=eig(A);
octave] disp(lambda);
-3.80194
-3.24698
-2.44504
-1.55496
-0.75302
-0.19806
octave]
" If the column vectors of X are linearly independent, then X is invertible and A can be reduced to diagonal
form
A = X X , A = U £V T
1
y =Ay Ô (X y ) = (X y )
¹
1 1
DATA STABILITY 81
" When can a matrix be reduced to diagonal form? When eigenvectors are linearly independent such that the
inverse of X exists
" Matrices with distinct eigenvalues are diagonalizable. Consider A m m with eigenvalues »j ` »k for j ` k ,
×
j , k {1,..., m}
Proof . By contradiction. Take any two eigenvalues »j ` »k and assume that xj would depend linearly on xk,
xk = cxj for some c ` 0. Then
Ax = » x Ò Ax = » x
1 1 1 1 1 1
Ax = » x Ò Acx = » cx
2 2 2 1 2 1
» =» .
1 2
" The characteristic polynomial might have repeated roots. Establishing diagonalizability in that case requires
additional concepts
DEFINITION 6.1. The algebraic multiplicity of an eigenvalue » is the number of times it appears as a repeated root of the
characteristic polynomial p(») =det(A » I )
Example. p(») = »(» 1)(» 2) has two single roots » =0, » =1 and a repeated root » =2. The eigenvalue » =2
2
1 2 3,4
DEFINITION 6.2. The geometric multiplicity of an eigenvalue » is the dimension of the null space of A » I
DEFINITION 6.3. An eigenvalue for which the geometric multiplicity is less than the algebraic multiplicity is said to be
defective
PROPOSITION 6.4. A matrix is diagonalizable is the geometric multiplicity of each eigenvalue is equal to the algebraic
multiplicity of that eigenvalue.
" Finding eigenvalues as roots of the characteristic polynomial p(») =det(A »I ) is suitable for small matrices
A m m. ×
small errors in characteristic polynomial coecients can lead to large errors in roots
5 -4 2
5 -4 1
-2 2 -3
octave]
» =1 Ò A » I = (
(( 5 5 1 ( 4 4 2 )) < (( 02 20 46 )) < (( 02 02 64 ))
1 1
( 2 2 4 )) (( 5 5 1 )) (( 0 0 0 ))
Note convenient choice of row operations to reduce amount of arithmetic, and use of knowledge that A » I 1
{{ 0x2x+0+2x x6x4x
1 2 3 =0 1
= 0 Ò x = ± (((( 1 )))), x =1 Ò ± =1/ (2
{ 0x +0x +0x
1
1
2
2
3
3 =0 (0)
" In Octave/Matlab the computations are carried out by the null function
octave] null(A+5*eye(3))'
ans = [](0x3)
octave]
" The eigenvalues of I are » = 1, but small errors in numerical computation can give roots of the
3×3
1,2,3
" In the following example notice that if we slightly perturb A (by a quantity less than 0.0005=0.05%), the
eigenvalues get perturb by a larger amount, e.g. 0.13%.
DATA STABILITY 83
octave]
" Extracting eigenvalues and eigenvectors is a commonly encountered operation, and specialized functions
exist to carry this out, including the eig function
octave> disp(null(A-3*eye(3)))
0.00000
0.70711
0.70711
octave> disp(null(A+2*eye(3)))
0.57735
-0.57735
-0.57735
octave>
DEFINITION. A matrix which has n» < m» for any of its eigenvalues is said to be defective.
Diagonal Matrix
-2.0000 0 0
0 3.0000 0
0 0 -2.0000
octave> disp(X);
octave> disp(null(A+2*eye(3)));
0.57735
-0.57735
-0.57735
octave> disp(rank(X))
2
octave>
84 EIGENPROBLEMS
B = AT A = V £T £ V T = V V T
ÅÅ
(((
Å
Ãr
))) ×
)))
+
(( 0
ÅÅ )
Å
DATA RESONANCE
1. Bases induced by eigenmodes
The trigonometric functions {1,cos t ,sin t ,cos2t ,sin2t ,... } have been introduced as a particularly appropriate basis
for periodic functions. The functions cos(kt ), sin(kt ) are also known as solution of the homogeneous di erential
equation
y + k y =0.
¹¹
2
octave]