Linear Algebra

Linear algebra for data science
SORIN MITRAN
Department of Mathematics
University of North Carolina at Chapel Hill
ABSTRACT
This textbook presents the essential concepts from linear algebra of direct utility to analysis
of large data sets. The theoretical foundations of the emerging discipline of Data Science are
still being dened at present, but linear algebra is certainly one the cornerstones. Traditional
presentations of linear algebra reect its historical roots with a focus on linear systems and
determinants, typically of small size. Presentation of the topic oen links solutions of linear sys-
tems to posible intersections of lines or planes. Such an approach is ill-suited for data science in
which the primary interest is in ecient description of large data sets, and automated extraction
of regularity from the available data. Neither is the essence of solving a linear system presented
as the information-conserving coordinate transformation that it actually represents when the
system matrix is of full rank.
The emphasis in linear algebra presentation suggested by data science is quite di erent. The
focus naturally shis to the essential problem of ecient description of large data sets using a
small, typically incomplete set of feature vectors. Linear algebra becomes the study of the basic
operation of linear combination and its potential as a descriptor of large data sets. Rather than
concentrate on the basis transformation represented by linear system solution, the focus shis
to maximal information compression. Instead of Gaussian elimination, the crucial algorithm
becomes the singular value decomposition. The typical operation required is not to solve a linear
system, but to construct low-dimensional approximations of the available data through projec-
tion and least squares.
Furthermore, computational exercises based on small matrices obscure the vitality and utility of
linear algebra in data science of describing objects as disparate and information-rich as images,
medical scans or sound recordings. To more faithfully portray the way linear algebra actually
gets used in data science, this textbook is packaged with a soware environment that contains
extensive data sets, code snippets to carry out typical analysis, and procedures to transform
heterogeneous data sources into standard linear algebra representations. Rather than relegate
computational applications to isolated sections, the entire text is interspersed with practical
examples using the Julia language, well suited for linear algebra and data science.
This textbook is draed and meant to be worked through in TeXmacs, a scientic editing plat-
form that features live documents with embedded computational examples constructed in
freely available mathematical soware systems such as Asymptote, Eukleides, Gnuplot, Julia,
Maxima, and Octave.
This textbook was developed for an intensive Maymester course that meets in twelve sessions
of three hours each. The content organization reects a desire to present crucial mathematical
ideas and practical skills to students from various backgrounds who might be interested in data
science. The key concepts required for mathematics students are present: vector spaces, matrix
factorizations, linear systems, eigenvalues. For a more general audience, these mathematical
5
6 ABSTRACT
topics are also recast as addressing specic aspects of data: expressiveness, redundancy, e-
ciency, compression, partitioning. More than a simple relabeling, this reinterpretation allows
for application of linear algebra operations to data far removed from the physical sciences or
engineering. The text and its associated soware environment considers data sets from visual
art, music, biology, medicine, social sciences.
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1. LINEAR COMBINATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1. Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1. Number sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2. Quantities described by a single number . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3. Quantities described by multiple numbers . . . . . . . . . . . . . . . . . . . . . . . 12
2. Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1. Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2. Real vector space m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Column vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Row vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Compatible vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3. Working with vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3. Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1. Forming matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2. Identity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4. Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1. Linear combination as a matrix-vector product . . . . . . . . . . . . . . . . . . . . 17
Linear combination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Matrix-vector product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2. Linear algebra problem examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Linear combinations in E2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5. Vectors and matrice in data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Linear Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1. Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.1. Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Homogeneous relations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2. Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3. Linear functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4. Linear mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2. Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1. Equivalence classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2. Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3. Inner product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7
8 TABLE OF CONTENTS
3. Linear mapping composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1. Matrix-matrix product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2. VECTOR SPACES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Formal Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1. Algebraic structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.1. Typical structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Rings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.2. Vector subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2. Vector subspaces of a linear mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Data Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1. Linear dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2. Basis and dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3. Dimension of matrix spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3. FUNDAMENTAL THEOREM OF LINEAR ALGEBRA . . . . . . . . . . . . . . . . . . . . . . . . . 41
Data Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1. Partition of linear mapping domain and codomain . . . . . . . . . . . . . . . . . . . . . 41
Data Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1. Mappings as data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.1. Vector spaces of mappings and matrix representations . . . . . . . . . . . . . . . 43
1.2. Measurement of mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2. The Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1. Orthogonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2. Intrinsic basis of a linear mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3. SVD solution of linear algebra problems . . . . . . . . . . . . . . . . . . . . . . . . . 50
Change of coordinates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Best 2-norm approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
The pseudo-inverse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4. LEAST SQUARES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1. Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2. Gram-Schmidt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3. QR solution of linear algebra problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1. Transformation of coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2. General orthogonal bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3. Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Model Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1. Projection of mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2. Reduced bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.1. Correlation matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
TABLE OF CONTENTS 9
Correlation coecient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Patterns in data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5. CHANGE OF BASIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1. Gaussian elimination and row echelon reduction . . . . . . . . . . . . . . . . . . . . . . 65
2. LU -factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.1. Example for m =3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.2. General m case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3. Inverse matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1. Gauss-Jordan algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4. Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1. Cross product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Data Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1. Krylov bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2. Greedy approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6. EIGENPROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Data Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
1. The eigenvalue problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2. Computation of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Data Resonance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
1. Bases induced by eigenmodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
CHAPTER 1
LINEAR COMBINATIONS
VECTORS AND MATRICES
SYNOPSIS. Data science arises from the need to organize massive amounts of data into mean-
ingful insights into some natural or social process. There are many ways to do so such as lists,
trees, clusters. Linear algebra concentrates on grouping quantifiable data into sets of fixed
length, known as vectors. Multiple vectors are subsequently grouped as matrices, and a for-
malism is established to work with vectors and matrices.
1. Numbers
1.1. Number sets
Several types of numbers have been introduced in mathematics to express di erent types of
quantities, and the following will be used throughout this text:
. The set of natural numbers, = {0,1,2,3,... }, innite and countable, + = {1,2,3,... };
$. The set of integers, $ = {0,±1,±2,±3,... }, innite and countable;
. The set of rational numbers = {p / q , p $, q +}, innite and countable;
. The set of real numbers, innite, not countable, can be ordered;
. The set of complex numbers, = {x + iy , x , y }, infinite, not countable, cannot be
ordered.
These sets of numbers form a hierarchy, with $ . The size of a set of numbers is
an important aspect of its utility in describing natural phenomena. The set S = {Mary,Jane,Tom}
has three elements, and its size is dened by the cardinal number , |S | = 3. The sets , $, , ,
have an innite number of elements, but the relation
/2
z = { (nn +1 for n even
) /2 for n odd
denes a one-to-one correspondence between n and z $, so these sets are of the same size
denoted by the transnite number 50 (aleph-zero). The rationals can also be placed into a one-
to-one correspondence with , hence
|| = |$| = | | = 50 .
11
12 LINEAR COMBINATIONS
In contrast there is no one-to-one mapping of the reals to the naturals, and the cardinality of
the reals is || = (Fraktur-script c). Intuitively, there are exponentially more reals than naturals,
formally stated in set theory as =25 . 0
1.2. Quantities described by a single number

The above numbers and their computer approximations are sucient to describe many quanti-
ties encountered in applications. Typical examples include:
" the position x of a point on the unit line segment [0,1], approximated by the oating
point number xÜ F, to within machine epsilon precision, |x xÜ | õ;
" the measure of resistance to change of the rate of motion known as mass , m , m >0;
" the population of a large community expressed as a oat p F, even though for a com-
munity of individuals the population is a natural number, as in the population of the
United States is p =328.2E6, i.e., 328.2 million.
In most disciplines, there is a particular interest in comparison of two quantities, and to facilitate
such comparison a common reference is used known as a standard unit . For measurement of
a length L, the meter = 1 m is a standard unit, as in the statement L = 10 m, that states that
L is obtained by taking the standard unit ten times, L = 10 . The rules for carrying out such
comparisons are part of the denition of real and rational numbers.
1.3. Quantities described by multiple numbers
Other quantities require more than a single number. The distribution of population in
the year 2000 among the alphabetically-ordered South American countries (Argentina,
Bolivia,..,Venezuela) requires 12 numbers. These are placed together in a list known in math-
ematics as a tupl e, in this case a 12-tuple P = (p1, p2,..., p12), with p1 the population of Argentina,
p2 that of Bolivia, and so on. An analogous 12-tuple can be formed from the South American
populations in the year 2020, say Q = (q1, q2,..., q12). Note that it is dicult to ascribe meaning
to apparently plausible expressions such as P + Q since, for instance, some people in the 2000
population are also in the 2020 population, and would be counted twice.
2. Vectors
2.1. Vector spaces
In contrast to the population 12-tuple example above, combining multiple numbers is well defined
in operations such as specifying a position within a three-dimensional Cartesian grid, or deter-
mining the resultant of two forces in space. Both of these lead to the consideration of 3-tuples
or triples such as the force ( f1, f2, f3). When combined with another force (g1, g2, g3) the resul-
tant is ( f1 + g1, f2 + g2, f3 + g3). If the force ( f1, f2, f3) is amplied by the scalar ± and the force
(g 1, g 2, g 3) is similarly scaled by ² , the resultant becomes
± ( f1, f2, f3) + ² (g 1, g 2, g 3) = (± f1, ± f2, ± f3) + (² g 1, ² g 2, ² g 3) = (± f1 + ² g 1, ± f2 + ² g 2, ± f3 + ² g 3).

VECTORS AND MATRICES 13
It is useful to distinguish tuples for which scaling and addition is well dened from simple lists
of numbers. In fact, since the essential di erence is the behavior with respect to scaling and
addition, the focus should be on these operations rather than the elements of the tuple.
The above observations underlie the defin-
ition of a vector space ± by a set V whose ele- Addition rules for a,b,c V
ments satisfy scaling and addition properties, a +b V Closure
denoted all together by the 4-tuple ± = (V , a + (b + c ) = (a + b ) + c Associativity
S ,+, Å). The rst element of the 4-tuple is a set a +b =b +a Commutativity
whose elements are called vectors. The second 0+a =a Zero vector
element is a set of scalars, and the third is a + (a ) = 0 Additive inverse
the vector addition operation. The last is the Scaling rules for a , b V , x, y S
scaling operation, seen as multiplication of a xa V Closure
vector by a scalar. Vector addition and scaling x (a + b ) = x a + x b Distributivity
(x + y ) a = x a + y a Distributivity
operations must satisfy rules suggested by
positions or forces in three-dimensional space, x (y a ) = (xy )a Composition
which are listed in Table 1.1. In particular, a 1a = a Scalar identity
vector space requires denition of two distin-
guished elements: the zero vector 0 V , and Table 1.1. Vector space ± = (V , S ,+, Å) properties.
the identity scalar element 1 S .
The denition of a vector space reects everyday experience with vectors in Euclidean geometry,
and it is common to refer to such vectors by descriptions in a Cartesian coordinate system. For
example, a position vector r within the plane can be referred through the pair of coordinates
(x , y ). This intuitive understanding can be made precise through the denition of a vector space
2 = (2, ,+, Å), called the real 2-space. Vectors within 2 are elements of 2 = × = {(x , y )| x ,
y }, meaning that a vector is specied through two real numbers, r (x , y ). Addition of two
vectors, q (s, t ), r (x , y ) is dened by addition of coordinates q + r = (s + x , t + v ). Scaling
r (x , y ) by scalar a is dened by a r (ax , ay ). Similarly, consideration of position vectors in
three-dimensional space leads to the denition of the 3 = (3, ,+, Å), or more generally a real
m-space m = (m , ,+, Å), m , m >0.
2.2. Real vector space m
Column vectors. Since the real spaces m = (m , , +, Å) play such an important role in them-
selves and as a guide to other vector spaces, familiarity with vector operations in m is necessary
to fully appreciate the utility of linear algebra to a wide range of applications. Following the
usage in geometry and physics, the m real numbers that specify a vector u m are called the
components of u . The one-to-one correspondence between a vector and its components u
(u 1,..., u m ), is by convention taken to dene an equality relationship,
u1
u = [[ ÅÅÅ ]], (1.1)
[ um ]
with the components arranged vertically and enclosed in square brackets. Given two vectors u ,
v m , and a scalar a , vector addition and scaling are dened in m by real number addition
and multiplication of components

u1 v1 u 1 + v1
u + v = [[[ ÅÅÅ ]]] + [[[ ÅÅÅ ]]] = [[[ ÅÅÅ ]], a u = a [[ uÅÅÅ 1 ]] = [[ au 1
]
[ um ] [ vm ] [ um + vm ]] [[ u ]] [[ Åau ]]].
ÅÅ (1.2)
m m
The vector space m is dened using the real numbers as the set of scalars, and constructing
vectors by grouping together m scalars, but this approach can be extended to any set of scalars
S , leading to the denition of the vector spaces ®n = (S n, S ,+, Å). These will oen be referred to as
n-vector space of scalars, signifying that the set of vectors is V = S n.
To aid in visual recognition of vectors, the following notation conventions are introduced:
" vectors are denoted by lower-case bold Latin letters: u , v ;
" scalars are denoted by normal face Latin or Greek letters: a , b , ± , ² ;
" the components of a vector are denoted by the corresponding normal face with sub-
scripts as in equation (1.1);
" related sets of vectors are denoted by indexed bold Latin letters: u 1, u 2,..., u n.
In Julia, successive components placed vertically are separated by a semicolon.

4 [1; 2; -1; 2]
[[ 12 ]]
[[ 1 ]] (1.3)
[[ 2 ]]
4
The equal sign in mathematics signies a particular equivalence relationship. In computer sys-
tems such as Julia the equal sign has the di erent meaning of assignment , that is dening the
label on the le side of the equal sign to be the expression on the right side. Subsequent invoca-
tion of the label returns the assigned object. Components of a vector are obtained by enclosing
the index in parantheses.
4 u=[1; 2; -1; 2]; u
[[ 12 ]]
[[ 1 ]] (1.4)
[[ 2 ]]
4
Row vectors. Instead of the vertical placement or components into one column , the compo-
nents of could have been placed horizontally in one row [ u1 ... um ], that contains the same
data, di erently organized. By convention vertical placement of vector components is the pre-
ferred organization, and u shall denote a column vector henceforth. A transpose operation
denoted by a T superscript is introduced to relate the two representations
u T = [ u1 ... um ],
and u T is the notation used to denote a row vector . In Julia, horizontal placement of successive
components in a row is denoted by a space.
4 uT=transpose(u)
[ 1 2 1 2 ] (1.5)
4
Compatible vectors. Addition of real vectors u , v m denes another vector w = u + v m .

The components of w are the sums of the corresponding components of u and v , wi = ui + vi,
for i = 1, 2, . . . , m. Addition of vectors with di erent number of components is not dened, and
attempting to add such vectors produces an error. Such vectors with di erent number of com-
ponents are called incompatible, while vectors with the same number of components are said to
be compatible. Scaling of u by a denes a vector z = a u , whose components are zi = aui, for i =1,
2,..., m. Vector addition and scaling in Julia are dened using the + and operators.
4 uT=[1 0 1 2]; vT=[2 1 3 -1]; wT=uT+vT
[ 3 1 4 1] (1.6)
4 rT=[1 2]; uT+rT
DimensionMismatch
2.3. Working with vectors

Ranges. The vectors used in applications usually have a large number of components, m k
1, and it is important to become procient in their manipulation. Previous examples dened
vectors by explicit listing of their m components. This is impractical for large m, and support is
provided for automated generation for oen-encountered situations. First, observe that Table
1.1 mentions one distinguished vector, the zero element that is a member of any vector space
0 V . The zero vector of a real vector space m is a column vector with m components, all of
which are zero, and a mathematical convention for specifying this vector is 0T = [ 0 0 ... 0 ]
m . This notation species that transpose of the zero vector is the row vector with m zero com-
ponents, also written through explicit indexing of each component as 0i =0, for i =1,..., m. Keep
in mind that the zero vector 0 and the zero scalar 0 are di erent mathematical objects. The
ellipsis symbol in the mathematical notation is transcribed in Julia by the notion of a range, with
1:m denoting all the integers starting from 1 to m, organized as a row vector. The notation is
extended to allow for strides di erent from one, and the mathematical ellipsis i = m, m 1,...,1 is
denoted as m:-1:1. In general r:s:t denotes the set of numbers {r , r + s,..., r + ns} with r + ns t ,
and r , s, t real numbers and n a natural number, r , s, t , n . If there is no natural number n
such that r + ns t , an empty vector with no components is returned.
4 m=4; 1:m
[[ 12 ]]
[[ 3 ]] (1.7)
[[[ 4 ]]]
4 m:-1:2
[[[ 43 ]]] (1.8)

[2]
4 r=0; s=0.2; t=1; (r:s:t)'
[ 0.0 0.2 0.4 0.6 0.8 1.0 ] (1.9)

4 r=0; s=0.3; t=1; (r:s:t)'
[ 0.0 0.3 0.6 0.9 ] (1.10)

4 r=0; s= -0.2; t=1; (r:s:t)'
[ ] (1.11)
4
Visualization. A component-by-component display of a vector becomes increasingly unwieldy

as the number of components m becomes large. For example, the numbers below seem an inef-
cient way to describe the sine function.
4 t=LinRange(0,1.5,10); y=sin.(t)
0.0
[[ 0.16589613269341502 ]
[[ 0.3271946967961522 ]]]
[[ ]]
0.479425538604203
[[ 0.618369803069737 ]]
[[ 0.7401768531960371 ]] (1.12)
[[ ]]
0.8414709848078965
[[ 0.9194449792537551 ]]
[[ ]
[[ 0.9719379013633127 ]]]
0.9974949866040544
Indeed, such a piece-by-piece approach is not the way humans organize large amounts of infor-
mation, preferring to conceptualize the data as some other entity: an image, a sound excerpt, a
smell, a taste, a touch, a sense of balance, or relative position. All seven of these human senses
will be shown to allow representation by linear algebra concepts, including representation by
vectors.
4 clf(); plot(t,y); grid("on"); xlabel("t"); ylabel("y");
4 cd(homedir()*"/courses/MATH347DS/images");
4 savefig("L01Fig01.eps");
1.0
0.8
0.6
y
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

t
Figure 1.1. Sample plot
3. Matrices
3.1. Forming matrices
The real numbers themselves form the vector space 1 = (, ,+, Å), as does any eld of scalars,
®1 = (S , S , +, Å). Juxtaposition of m real numbers has been seen to dene the new vector space
m . This process of juxtaposition can be continued to form additional mathematical objects. A
matrix is dened as a juxtaposition of compatible vectors. As an example, consider n vectors a 1,
a 2,..., a n V within some vector space ± = (V , S ,+, Å). Form a matrix by placing the vectors into
a row,
A = [ a 1 a 2 ... a n ]. (1.13)
To aid in visual recognition of a matrix, upper-case bold Latin letters will be used to denote
matrices. The columns of a matrix will be denoted by the corresponding lower-case bold letter
with a subscripted index as in equation (1.13). Note that the number of columns in a matrix can
be di erent from the number of components in each column, as would be the case for matrix A
from equation (1.13) when choosing vectors from, say, the real space m , a 1, a 2,..., a n m .
Vectors were seen to be useful juxtapositions of scalars that could describe quantities a single
scalar could not: a position in space, a force in physics, or a sampled function graph. The crucial
utility of matrices is their central role in providing a means to obtain new vectors from their
column vectors, as suggested by experience with Euclidean spaces.
3.2. Identity matrix
Consider rst 1, the vector space of real numbers. A position vector r 1 on the real axis is
specied by a single scalar component, r = [x ], x . Read this to mean that the position r is
obtained by traveling x units from the origin at position vector 0 = [0]. Look closely at what is
meant by unit in this context. Since x is a scalar, the mathematical expression r = 0 + x has
no meaning, as addition of a vector to a scalar has not been dened. Recall that scalars were
introduced to capture the concept of scaling of a vector, so in the context of vector spaces they
always appear as multiplying some vector. The correct mathematical description is r = 0 + x e ,
where e is the unit vector e = [1]. Taking the components leads to r1 =01 + xe1, where r1,01, e1 are
the rst (and in this case only) components of the r , 0, e vectors. Since r1 = x , 01 = 0, e1 = 1, one
obtains the identity x =0+ x Å 1.
Now consider 2, the vector space of positions in the plane. Repeating the above train of thought
leads to the identication of two direction vectors e1 and e 2
r = [ xy ] = x [ 10 ] + y [ 01 ] = x e1 + y e2, e1 = [ 10 ], e2 = [ 01 ].
4 x=2; y=4; e1=[1; 0]; e2=[0; 1]; r=x*e1+y*e2
[[ 24 ]] (1.14)
4
Continuing the procees to m gives

x [ 1] [0] [ 0]
[[ x2 ]]
1 [
[ ]
0] [1] [ ] [
[ 0 ]]
x = [[ ÅÅ ]] = x1 e1 + x2 e2 + Å Å Å + xm em , e1 = [[ ÅÅ ]], e2 = [[ ÅÅ ]],..., em = [[ ÅÅÅ ]] .
Å Å
[[ Åxm ]] [[ 0 ]] [[ 0 ]] [[ 0 ]]
[0] [0] [1]
For arbitrary m, the components are now x1, x2, . . . , xm rather than the alphabetically ordered
letters common for m =2 or m =3. It is then consistent with the adopted notation convention to
use x m to denote the position vector whose components are (x1,..., xm ). The basic idea is the
same as in the previous cases: to obtain a position vector scale direction e 1 by x1, e2 by x2,..., e m
by xm , and add the resulting vectors.
Juxtaposition of the vectors e1, e 2,..., em leads to the formation of a matrix of special utility known
as the identity matrix
I = [ e1 e2 ... em ].
The identity matrix is an example of a matrix in which the number of column vectors n is equal
to the number of components in each column vector m = n . Such matrices with equal number of
columns and rows are said to be square. Due to entrenched practice an exception to the notation
convention is made and the identity matrix is denoted by I , but its columns are denoted the
indexed bold-face of a di erent lower-case letter, e 1,..., em . If it becomes necessary to explicitly
state the number of columns in I , the notation I m is used to denote the identity matrix with m
columns, each with m components.
4. Linear combinations
4.1. Linear combination as a matrix-vector product
The expression x = x1 e1 + x2 e2 + Å Å Å + xm e m expresses the idea of scaling vectors within a set and
subsequent addition to form a new vector x . The matrix I = [ e1 e 2 ... em ] groups these vectors
together in a single entity, and the scaling factors are the components of the vector x . To bring
all these concepts together it is natural to consider the notation
x = Ix ,
as a generalization of the scalar expression x = 1 Å x . It is clear what the operation Ix should
signify: it should capture the vector scaling and subsequent vector addition x1 e1 + x2 e 2 + Å Å Å +
xm em . A specic meaning is now ascribed to Ix by identifying two denitions to one another.
Linear combination. Repeateadly stating vector scaling and subsequent vector addition is
unwieldy, so a special term is introduced for some given set of vectors {a 1,..., a n}.
DEFINITION. (LINEAR COMBINATION) . The linear combination of vectors a 1, a 2,..., a n V with scalars
x1, x2,..., xn S in vector space (V , S ,+, Å) is the vector b = x1 a 1 + x2 a 2 + ... xn a n .
Matrix-vector product. Similar to the grouping of unit vectors e1,..., em into the identity matrix
I , a more concise way of referring to arbitrary vectors a 1, . . . , a n from the same vector space is
the matrix A = [ a 1 a 2 ... a n ]. Combining these observations leads to the denition of a matrix-
vector product.
DEFINITION. (MATRIX-VECTOR PRODUCT) . In the vector space (V , S , +, Å), the product of matrix
A = [ a 1 a 2 ... a n ] composed of columns a 1, a 2,..., a n V with the vector x Sn whose components
are scalars x1, x2,..., xn S is the linear combination b = x1 a 1 + x2 a 2 + ... xn a n = Ax V .
4.2. Linear algebra problem examples

Linear combinations in E . Consider a simple example that leads to a common linear algebra
2
problem: decomposition of forces in the plane along two directions. Suppose a force is given in
terms of components along the Cartesian x , y -axes, b = bx ex + by ey , as expressed by the matrix-
vector multiplication b = Ib . Note that the same force could be obtained by linear combination
of other vectors, for instance the normal and tangential components of the force applied on an
inclined plane with angle ¸ , b = xt et + xn e n, as in Figure 1.2. This denes an alternate reference
system for the problem. The unit vectors along these directions are
t = [[ cos ¸
sin ¸ ]
] , n = [
[
sin ¸
cos ¸ ]],
and can be combined into a matrix A = [ t n ]. The value of the components (xt , xn) are the
scaling factors and can be combined into a vector x = [ xt xn ]T . The same force must result
irrespective of whether its components are given along the Cartesian axes or the inclined plane
directions leading to the equality
Ib = b = Ax . (1.15)
Interpret equation (1.15) to state that the vector b could be obtained either as a linear combi-
nation of I , b = Ib , or as a linear combination of the columns of A , b = Ax . Of course the simpler
description seems to be Ib for which the components are already known. But this is only due to
an arbitrary choice made by a human observer to dene the force in terms of horizontal and ver-
tical components. The problem itself suggests that the tangential and normal components are
more relevant; for instance a friction force would be evaluated as a scaling of the normal force.
The components in this more natural reference system are not known, but can be determined
by solving the vector equality Ax = b , known as a linear system of equations. Procedures to carry
this out will be studied in more detail later, but Julia provides an instruction for this common
problem, the backslash operator, as in x=A\b.
4 ex=[1; 0]; ey=[0; 1]; b=[0.2; 0.4]; I=[ex ey]; I*b
[[ 0.2
0.4 ]] (1.16)
4 th=pi/6; c=cos(th); s=sin(th); tvec=[c; s]; nvec=[-s; c];
4 A=[tvec nvec]
0.8660254037844387 0.49999999999999994
[[ 0.49999999999999994 (1.17)
0.8660254037844387 ]
4 x=A\b
[[ 0.37320508075688774
0.2464101615137755 ]] (1.18)
4 x[1]*tvec+x[2]*nvec
[[ 0.2
0.4 ]] (1.19)
4
Figure 1.2. Alternative decompositions of force on inclined plane.

Note from the above calculations how the same vector is obtained by two di erent linear com-
binations
b = [ 0.2 1 0 0.866 0.500
0.4 ] =0.2 [ 0 ] +0.4 [ 1 ] =0.373[ 0.500 ] +0.246[ 0.866 ].
The general problem of determining what description is more insightful is a key question arising
in linear algebra applications.
5. Vectors and matrices in data science

The above examples highlight some essential aspects of linear algebra in the context of data
science applications.
" Vectors organize information that cannot be expressed as a single number and for which
there exists a concept of scaling and addition.
" Matrices group together multiple vectors.
" The matrix-vector product expresses a linear combination of the column vectors of the
matrix.
" Solving a linear system Ax = b = Ib , to nd x m for given b m , re-expresses the linear
combination
b = b1e1 + Å Å Å + bm em , I = [ e1 e2 ... e m ],
as another linear combination
b = x1 a 1 + x2 a 2 + ... xn a n, A = [ a 1 a 2 ... a n ].
For certain problems the linear combination Ax might be more insightful, but the above
transformation is information-preserving, with b , x both having the same number of
components.
" Finding the best approximation of some given b m by a linear combination Ax of the
n column vectors of A m ×n is known as a least squares problem and transforms the
information from the m components of b into n components of x , and knowledge of the
form of the column vectors. If m > n and the form of the columns of A can be succintly
stated, the transformation compresses information.
Data science seeks to extract regularity directly from available data, not necessarily invoking
any additional hypotheses. The typical scenario is that immense amounts of data are available,
with limited capability of human analysis. In this context it is apparent that the least squares
problem is of greater interest than solving a linear system with a square matrix. It should also
be clear that while computation by hand of small examples is useful to solidify theroretical
concepts, it is essential to become procient in the use of soware that can deal with large data
sets, such as Julia.
LINEAR MAPPINGS
SYNOPSIS. Vectors have been introduced to represent complicated objects, whose description
requires m numbers, and the procedure of linear combination allows construction of new vec-
tors. Alternative insights into some object might be obtained by transformation of vectors. Of
all possible transformations those that are compatible with linear combinations are of special
interest. It turns out that matrices are not only important in organizing collections of vectors,
but also to represent such transformations, referred to as linear mappings.
1. Functions
1.1. Relations
The previous chapter focused on mathematical expression of the concept of quantication , the
act of associating human observation with measurements, as a rst step of scientic inquiry.
Consideration of di erent types of quantities led to various types of numbers, vectors as group-
ings of numbers, and matrices as groupings of vectors. Symbols were introduced for these quan-
tities along with some intial rules for manipulating such objects, laying the foundation for an
algebra of vectors and matrices. Science seeks to not only observe, but to also explain, which
now leads to additional operations for working with vectors and matrices that will dene the
framework of linear algebra.
Explanations within scientic inquiry are formulated as hypotheses, from which predictions are
derived and tested. A widely applied mathematical transcription of this process is to organize
hypotheses and predictions as two sets X and Y , and then construct another set R of all of the
instances in which an element of X is associated with an element in Y . The set of all possible
instances of x X and y Y , is the Cartesian product of X with Y , denoted as X × Y = {(x , y )| x X ,
y Y }, a construct already encountered in the denition of the real 2-space 2 = (2, , +, Å)
where 2 = × . Typically, not all possible tuples (x , y ) X × Y are relevant leading to the
following denition.
DEFINITION. (RELATION) . A relation R between two sets X , Y is a subset of the Cartesian product
X × Y, R X × Y.
The key concept is that of associating an input x X with an output y Y . Inverting the approach
and associating an output to an input is also useful, leading to the denition of an inverse rela-
tion as R 1 Y × X , R 1 = {(y , x ) | (x , y ) R}. Note that an inverse exists for any relation, and the
inverse of an inverse is the original relation, (R 1)1 = R. From the above, a relation is a triplet (a
tuple with three elements), (X , Y , R), that will oen be referred to by just its last member R.
LINEAR MAPPINGS 23
Homogeneous relations. Many types of relations are dened in mathematics and encoun-
tered in linear algebra, and establishing properties of specic relations is an important task
within data science. A commonly encountered type of relationship is from a set onto itself,
known as a homogeneous relation. For homogeneous relations H A × A, it is common to replace
the set membership notation (a, b ) H to state that a A is in relationship H with b A, with a
binary operator notation a <H b . Familiar examples include the equality and less than relation-
ships between reals, E , L × , in which (a, b ) E is replaced by a = b , and (a, b ) L is replaced
by a < b . The equality relationship is its own inverse, and the inverse of the less than relationship
is the greater than relation G × , G = L1, a < b Ò b > a.
1.2. Functions
Functions between sets X and Y are a specic type of relationship that oen arise in science. For
a given input x X , theories that predict a single possible output y Y are of particular scientic
interest.
DEFINITION. (FUNCTION) . A function from set X to set Y is a relation F X × Y, that associates to
x X a single y Y.
The above intuitive denition can be transcribed in precise mathematical terms as F X × Y is a
function if (x , y ) F and (x , z ) F implies y = z . Since it's a particular kind of relation, a function
is a triplet of sets (X , Y , F ), but with a special, common notation to denote the triplet by f :
X Y , with F = {(x , f (x ))| x X , f (x ) Y } and the property that (x , y ) F Ò y = f (x ). The set
X is the domain and the set Y is the codomain of the function f . The value from the domain
x X is the argument of the function associated with the function value y = f (x ). The function
value y is said to be returned by evaluation y = f (x ). The previously dened relations R, P , I
are functions but S = {(a, ± ), (a, ² ), (a, ³ )} is not. All relations can be inverted, and inversion
of a function denes a new relation, but which might not itself be a function. For example the
relation S 1 = {(± , a), (² , a), (³ , a)} is a function, but its inverse (S 1)1 = S is not.
Familiar functions include:
" the trigonometric functions cos: [1, 1], sin: [1, 1] that for argument ¸
return the function values cos(¸ ), sin(¸ ) giving the Cartesian coordinates (x , y ) 2 of
a point on the unit circle at angular extent ¸ from the x -axis;
" the exponential and logarithm functions exp: , log: (0, ) , as well as power and
logarithm functions in some other base a;
" polynomial functions pn: , dened by a succession of additions and multiplications
n
pn(x ) = a nx n + a n1x n1 + Å Å Å + a 1x + a 0 = y a i x i = ((a n x + a n1)x + Å Å Å + a 1)x + a 0.
i =0
Simple functions such as sin, cos, exp, log, are predened in Octave, and when given a vector
argument return the function applied to each vector component.
octave] disp(cos(0:pi/4:pi))
1.0000e+00 7.0711e-01 6.1232e-17 -7.0711e-01 -1.0000e+00

octave] y=log2(1:8); disp(y)
0.00000 1.00000 1.58496 2.00000 2.32193 2.58496 2.80735

3.00000
octave] disp(pow2(y))
1.0000 2.0000 3.0000 4.0000 5.0000 6.0000 7.0000 8.0000
octave] a=[1 0 -1]; x=-2:2; y=polyval(a,x); disp(y)
3 0 -1 0 3
octave]
As seen previously, a Euclidean space Em = (m , ,+, Å) can be used to suggest properties of more
complex spaces such as the vector space of continuous functions 0(). A construct that will
be oen used is to interpret a vector within Em as a function, since v m with components
v = [ v1 v2 ... vm ]T also denes a function v : {1,2,..., m} , with values v (i ) = vi. As the number
of components grows the function v can provide better approximations of some continuous
function f 0() through the function values vi = v (i ) = f (xi) at distinct sample points x1, x2,...,
xm .
The above function examples are all dened on a domain of scalars or naturals and returned
scalar values. Within linear algebra the particular interest is on functions dened on sets of
vectors from some vector space ± = (V , S , +, Å) that return either scalars f : V S , or vectors
from some other vector space ² = (W , S , +, Å), g : V W . The codomain of a vector-valued
function might be the same set of vectors as its domain, h : V V . The fundamental operation
within linear algebra is the linear combination a u + b v with a, b S , u , v V . A key aspect is
to characterize how a function behaves when given a linear combination as its argument, for
instance f (a u + b v ) or g (a u + b v ).
1.3. Linear functionals
Consider rst the case of a function dened on a set of vectors that returns a scalar value. These
can be interpreted as labels attached to a vector, and are very oen encountered in applications
from natural phenomena or data analysis.
DEFINITION. (FUNCTIONAL) . A functional on vector space ± = (V , S ,+, Å) is a function from the set
of vectors V to the set of scalars S of the vector space ±.
DEFINITION. (LINEAR FUNCTIONAL) . The functional f : V S on vector space ± = (V , S , +, Å) is a
linear functional if for any two vectors u , v V and any two scalars a, b
f (a u + b v ) = af (u ) + bf (v ). (1.20)
Many di erent functionals may be dened on a vector space ± = (V , S , +, Å), and an insightful
alternative description is provided by considering the set of all linear functionals, that will be
denoted as V = { f | f : V S }. These can be organized into another vector space ± = (V , S ,+, Å)
with vector addition of linear functionals f , g V and scaling by a S dened by
( f + g )(u ) = f (u ) + g (u ), (af )(u ) = af (u ), u V . (1.21)
LINEAR MAPPINGS 25
DEFINITION. (DUAL VECTOR SPACE) . For some vector space ±, the vector space of linear functionals
± is called the dual vector space.
As is oen the case, the above abstract denition can better be understood by reference to the
familiar case of Euclidean space. Consider 2 = (2, , +, Å), the set of vectors in the plane with
x 2 the position vector from the origin (0,0) to point X in the plane with coordinates (x1, x2).
One functional from the dual space 2 is f2(x ) = x2, i.e., taking the second coordinate of the
position vector. The linearity property is readily veried. For x , y 2, a, b ,
f2(a x + b y ) = ax2 + by2 = af2(x ) + bf2( y ).
Given some constant value h , the curves within the plane dened by f2(x ) = h are called the
contour lines or level sets of f2. Several contour lines and position vectors are shown in Figure
1.3. The utility of functionals and dual spaces can be shown by considering a simple example
from physics. Assume that x2 is the height above ground level and a vector x is the displacement
of a body of mass m in a gravitational eld. The mechanical work done to li the body from
ground level to height h is W = mgh with g the gravitational acceleration. The mechanical work
is the same for all displacements x that satisfy the equation f2(x ) = h. The work expressed in
units mg h can be interpreted as the number of contour lines f2(x ) = n h intersected by the
displacement vector x . This concept of duality between vectors and scalar-valued functionals
arises throughout mathematics, the physical and social sciences and in data science. The term
duality itself comes from geometry. A point X in 2 with coordinates (x 1, x 2) can be dened
either as the end-point of the position vector x , or as the intersection of the contour lines of two
funtionals f1(x ) = x1 and f2(x ) = x2. Either geometric description works equally well in specifying
the position of X , so it might seem redundant to have two such procedures. It turns out though
that many quantities of interest in applications can be dened through use of both descriptions,
as shown in the computation of mechanical work in a gravitational eld.
1.0
0.8
0.6
0.4
0.2
-1.0 -0.5 0.5 1.0
Figure 1.3. Vectors in E and contour lines of the functional f (x ) = x

2 2
1.4. Linear mappings

Consider now functions f : V W from vector space ± = (V , S , +, Å) to another vector space
² = (W , T , +, Å). As before, the action of such functions on linear combinations is of special
interest.
DEFINITION. (LINEAR MAPPING) . A function f : V W, from vector space ± = (V , S , +, Å) to vector

space ² = (W , S , , F. ) is called a linear mapping if for any two vectors u , v V and any two scalars
a, b S
f (a u + b v ) = a f (u ) + b f (v ). (1.22)
The image of a linear combination a u + b v through a linear mapping is another linear combina-
tion a f (u ) + b f (v ), and linear mappings are said to preserve the structure of a vector space, and
called homomorphisms in mathematics. The codomain of a linear mapping might be the same
as the domain in which case the mapping is said to be an endomorphism.
Matrix-vector multiplication has been introduced as a concise way to specify a linear combina-
tion
f (x ) = Ax = x1 a 1 + Å Å Å + xn a n,
with a 1,..., a n the columns of the matrix, A = [ a 1 a 2 ... a n ]. This is a linear mapping between
the real spaces m , n, f : m n, and indeed any linear mapping between real spaces can be
given as a matrix-vector product.
2. Measurements
Vectors within the real space m can be completely specied by m real numbers, even though
m is large in many realistic applications. A vector within 0(), i.e., a continuous function
dened on the reals, cannot be so specied since it would require an innite, non-countable
listing of function values. In either case, the task of describing the elements of a vector space
± = (V , S ,+, Å) by simpler means arises. Within data science this leads to classication problems
in accordance with some relevant criteria.
2.1. Equivalence classes
Many classication criteria are scalars, dened as a scalar-valued function f : ± S on a vector
space, ± = (V , S , +, Å). The most common criteria are inspired by experience with Euclidean
space. In a Euclidean-Cartesian model (2, , +, Å) of the geometry of a plane , a point O
is arbitrarily chosen to correspond to the zero vector 0 = [ 0 0 ]T , along with two preferred
vectors e1, e2 grouped together into the identity matrix I . The position of a point X with
respect to O is given by the linear combination
x = Ix + 0 = [ e1 e 2 ][[ xx1 ]] = x1 e 1 + x2 e 2 .
2
Several possible classications of points in the plane are depicted in Figure 1.4: lines, squares,
circles. Intuitively, each choice separates the plane into subsets, and a given point in the plane
belongs to just one in the chosen family of subsets. A more precise characterization is given by
the concept of a partition of a set.
DEFINITION. (PARTITION) . A partition of a set is a grouping of its elements into non-empty subsets
such that every element is included in exactly one subset.
LINEAR MAPPINGS 27
In precise mathematical terms, a partition of set S is P = {Si | Si P , Si ` , i I } such that x S ,

! j I
for which x Sj . Since there is only one set (! signies exists and is unique) to which
some given x S belongs, the subsets Si of the partition P are disjoint, i ` j Ò Si ) Sj = . The
subsets Si are labeled by i within some index set I . The index set might be a subset of the
naturals, I in which case the partition is countable, possibly nite. The partitions of the plane
suggested by Figure 1.4 are however indexed by a real-valued label, i with I .
A technique which is oen used to generate a partition of a vector space ± = (V , S , +, Å) is to
dene an equivalence relation between vectors, H V × V . For some element u V , the equiv-
alence class of u is dened as all vectors v that are equivalent to u , {v | (u , v ) H }. The set of
equivalence classes of is called the quotient set and denoted as V / H , and the quotient set is a
partition of V . Figure 1.4 depicts four di erent partitions of the plane. These can be interpreted
geometrically, such as parallel lines or distance from the origin. With wider implications for
linear algebra, the partitions can also be given in terms of classication criteria specied by
functions.
1.0 1.0 1.0 1.0
0.5 0.5 0.5 0.5
-1.0 -0.5 0.5 1.0 -1.0 -0.5 0.5 1.0 -1.0 -0.5 0.5 1.0 -1.0 -0.5 0.5 1.0
-0.5 -0.5 -0.5 -0.5
-1.0 -1.0 -1.0 -1.0
Figure 1.4. Equivalence classes within the plane

2.2. Norms
The partition of 2 by circles from Figure 1.4 is familiar; the equivalence classes are sets of
points whose position vector has the same size, {x = [ x1 x2 ]T | ( x12 + x22)1/2 = r }, or is at the same
distance from the origin. Note that familiarity with Euclidean geometry should not obscure
the fact that some other concept of distance might be induced by the data. A simple example is
statement of walking distance in terms of city blocks, in which the distance from a starting point
to an address x1 =3 blocks east and x2 =4 blocks north is x1 + x2 =7 city blocks, not the Euclidean
distance ( x12 + x22)1/2 =5 since one cannot walk through the buildings occupying a city block.
The above observations lead to the mathematical concept of a norm as a tool to evaluate vector
magnitude. Recall that a vector space is specied by two sets and two operations, ± = (V , S ,+, Å),
and the behavior of a norm with respect to each of these components must be dened. The
desired behavior includes the following properties and formal denition.
Unique value. The magnitude of a vector v V should be a unique scalar, requiring the def-
inition of a function. The scalar could have irrational values and should allow ordering
of vectors by size, so the function should be from V to , f : V . On the real line the
point at coordinate x is at distance |x | from the origin, and to mimic this usage the norm
of v V is denoted as v , leading to the denition of a function : V +, + = {a| a ,
a 0}.
Null vector case. Provision must be made for the only distinguished element of V , the null
vector 0. It is natural to associate the null vector with the null scalar element, 0 =0. A
crucial additional property is also imposed namely that the null vector is the only vector
whose norm is zero, v = 0 Ò v = 0. From knowledge of a single scalar value, an entire
vector can be determined. This property arises at key junctures in linear algebra, notably
in providing a link to another branch of mathematics known as analysis, and is needed to
establish the fundamental theorem of linear algbera or the singular value decomposition
encountered later.
Scaling. Transfer of the scaling operation v = a u property leads to imposing v = |a| u . This
property ensures commensurability of vectors, meaning that the magnitude of vector v
can be expressed as a multiple of some standard vector magnitude u .
Vector addition. Position vectors from the origin to coordinates x , y >0 on the real line can
be added and |x + y | = |x | + |y |. If however the position vectors point in di erent directions,
x > 0, y < 0, then |x + y | < |x | + |y |. For a general vector space the analogous property is
known as the triangle inequality , u + v u + v for u , v V .
DEFINITION. (NORM) . A norm on the vector space ± = (V , S ,+, Å) is a function : V + that for
u , v V, a S satises:
1. v =0 Ò v = 0;
2. a u = |a| u ;
3. u + v u + v .
Note that the norm is a functional, but the triangle inequality implies that it is not generally
a linear functional. Returning to Figure 1.4, consider the functions fi : 2 + dened for x =
[ x 1 x 2 ]T through values
f1(x ) = |x1|, f2(x ) = |x2|, f3(x ) = |x1| + |x2|, f4(x ) = (|x1|2 + |x2|2)1/2.
Sets of constant value of the above functions are also equivalence classes induced by the equiv-
alence relations Ei for i =1,2,3,4.
1. f1(x ) = c Ò |x1| = c , E1 = {(x , y )| f1(x ) = f1( y ) Ô |x1| = |y1| } 2 × 2;
2. f2(x ) = c Ò |x2| = c , E2 = {(x , y )| f2(x ) = f2( y ) Ô |x2| = |y2| } 2 × 2;
3. f3(x ) = c Ò |x1| + |x2| = c , E3 = {(x , y )| f3(x ) = f3( y ) Ô |x1| + |x2| = |y1| + |y2| } 2 × 2;
4. f4(x ) = c Ò (|x1|2 + |x2|2)1/2 = c , E4 = {(x , y )| f4(x ) = f4( y ) Ô (|x1|2 + |x2|2)1/2 = (|y1|2 + |y2|2)1/2 }
2 × 2.
These equivalence classes correspond to the vertical lines, horizontal lines, squares, and circles
of Figure 1.4. Not all of the functions fi are norms since f1(x ) is zero for the non-null vector
x = [ 0 1 ]T , and f2(x )is zero for the non-null vector x = [ 1 0 ]T . The functions f3 and f4 are
indeed norms, and specic cases of the following general norm.
DEFINITION. (p-NORM IN m ) . The p-norm on the real vector space m = (m , ,+, Å) for p 1 is the
function p: V + with values x p = (|x1|p + |x2|p + Å Å Å + |xm |p)1/p, or
m 1/p
x = ((y |xi|p)) for x m .
p (1.23)
i =1
LINEAR MAPPINGS 29
Denote by xi the largest component in absolute value of x m . As p increases, |xi|p becomes

dominant with respect to all other terms in the sum suggesting the denition of an inf-norm by
x =max |x i| .
1i m
This also works for vectors with equal components, since the fact that the number of compo-
nents is nite while p can be used as exemplied for x = [ a a ... a ]T , by x p = (m |a|p)1/p =
m 1/p |a|, with m 1/p 1.
Note that the Euclidean norm corresponds to p =2, and is oen called the 2-norm. The analogy
between vectors and functions can be exploited to also dene a p-norm for 0[a, b ] = (C ([a, b ]),
,+, Å) , the vector space of continuous functions dened on [a, b ].
DEFINITION. (p-NORM IN 0[a, b ]) . The p-norm on the vector space of continuous functions 0[a,
b ] for p 1 is the function p: V + with values
b
pf = Ç5a | f (x )|p dx È1/p,for f C [a, b ]. (1.24)
The integration operation +ab can be intuitively interpreted as the value of the sum mi=1 from
equation (1.23) for very large m and very closely spaced evaluation points of the function f (xi),
for instance |xi+1 xi| = (b a) / m. An inf-norm can also be dene for continuous functions by
f = sup | f (x )|,
x [a ,b]
where sup, the supremum operation can be intuitively understood as the generalization of the
max operation over the countable set {1,2,..., m} to the uncountable set [a, b ].
1.0 1.0 1.0 1.0
0.5 0.5 0.5 0.5
0.0 0.0 0.0 0.0
-0.5 -0.5 -0.5 -0.5
-1.0 -1.0 -1.0 -1.0

-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
Figure 1.5. Regions within for which x p 1, for p =1,2,3, .

2
Vector norms arise very oen in applications, especially in data science since they can be used
to classify data, and are implemented in soware systems such as Octave in which the norm
function with a single argument computes the most commonly encountered norm, the 2-norm.
If a second argument p is specied the p-norm is computed.
octave] x=[1; 1; 1]; disp([norm(x) sqrt(3)])
1.7321 1.7321
octave] m=9; x=ones(m,1); disp([norm(x) sqrt(m)])
3 3
octave] m=4; x=ones(m,1); disp([norm(x,1) m])
4 4
octave] disp([norm(x,1) norm(x,2) norm(x,4) norm(x,8) norm(x,16) norm(x,inf)])
4.0000 2.0000 1.4142 1.1892 1.0905 1.0000
octave]
2.3. Inner product

Norms are functionals that dene what is meant by the size of a vector, but are not linear. Even
in the simplest case of the real line, the linearity relation |x + y | = |x | + |y | is not veried for x >0,
y < 0. Nor do norms characterize the familiar geometric concept of orientation of a vector. A
particularly important orientation from Euclidean geometry is orthogonality between two vec-
tors. Another function is required, but before a formal denition some intuitive understanding
is sought by considering vectors and functionals in the plane, as depicted in Figure 1.6. Consider
a position vector x = [ x1 x2 ]T 2 and the previously-encountered linear functionals
f1, f2: 2 , f1(x ) = x1, f2(x ) = x2.
The x1 component of the vector x can be thought of as the number of level sets of f1 times it
crosses; similarly for the x2 component. A convenient labeling of level sets is by their normal
vectors. The level sets of f1 have normal e 1T = [ 1 0 ], and those of f2 have normal vector e2T =
[ 0 1 ]. Both of these can be thought of as matrices with two columns, each containing a single
component. The products of these matrices with the vector x gives the value of the functionals
f1, f2
e1T x = [ 1 0 ][[ xx1 ]] =1 Å x1 +0 Å x2 = x1 = f1(x ),
2
e 2T x = [ 0 1 ][[ xx1 ]] =0 Å x1 +1 Å x2 = x1 = f2(x ).

2
1.0
0.8
0.6
0.4
0.2
-1.0 -0.5 0.5 1.0
Figure 1.6. Euclidean space E and its dual E .

2 2
In general, any linear functional f dened on the real space m can be labeled by a vector
a T = [ a 1 a 2 ... a m ],
LINEAR MAPPINGS 31
and evaluated through the matrix-vector product f (x ) = a T x . This suggests the denition of
another function s: m × m ,
s (a , x ) = a T x .
The function s is called an inner product , has two vector arguments from which a matrix-vector
product is formed and returns a scalar value, hence is also called a scalar product . The denition
from an Euclidean space can be extended to general vector spaces. For now, consider the eld
of scalars to be the reals S = .
DEFINITION. (INNER PRODUCT) . An inner product in the vector space ± = (V , ,+, Å) is a function s:
V × V with properties
Symmetry. For any a , x V, s(a , x ) = s(x , a ).
Linearity in second argument. For any a , x , y V, ± , ² , s(a , ±x + ² y ) = ± s(a , x ) + ² s(a , y ).
Positive deniteness. For any x V \{0}, s(x , x ) >0.
The inner product s(a , x ) returns the number of level sets of the functional labeled by a crossed
by the vector x , and this interpretation underlies many applications in the sciences as in the
gravitational eld example above. Inner products also provide a procedure to evaluate geomet-
rical quantities and relationships.
Vector norm. In m the number of level sets of the functional labeled by x crossed by x
itself is identical to the square of the 2-norm
s(x , x ) = x Tx = x 22 .
In general, the square root of s(x , x ) satises the properties of a norm, and is called the
norm induced by an inner product
x = s (x , x ) .
1/2
A real space together with the scalar product s(x , y ) = x Ty and induced norm x = s(x ,
x )1/2 denes an Euclidean vector space 0m .
Orientation. In 02 the point specied by polar coordinates (r , ¸ ) has the Cartesian coordi-
nates x1 = r cos ¸ , x2 = r sin ¸ , and position vector x = [ x1 x2 ]T . The inner product
e1T x = [ 1 0 ] [ xx1 ] =1 Å x1 +0 Å x2 = r cos ¸ ,
2
is seen to contain information on the relative orientation of x with respect to e1. In

general, the angle ¸ between two vectors x , y with any vector space with a scalar product
can be dened by
cos ¸ = [s(x , xs()xs(, yy ), y )]1/2 = sx(x ,yy ) ,
which becomes
cos ¸ = xx yy ,
T
in a Euclidean space, x , y m .
Orthogonality. In 02 two vectors are orthogonal if the angle between them is such that
cos ¸ =0, and this can be extended to an arbitrary vector space ± = (V , ,+, Å) with a scalar
product by stating that x , y V are orthogonal if s(x , y ) = 0. In 0m vectors x , y m are
orthogonal if x T y =0.
3. Linear mapping composition
3.1. Matrix-matrix product
From two functions f : A B and g: B C , a composite function, h = g f , h: A C is dened by
h(x ) = g ( f (x )).
Consider linear mappings between Euclidean spaces f : n m , g : m p. Recall that linear
mappings between Euclidean spaces are expressed as matrix vector multiplication
f (x ) = Ax , g ( y ) = By , A m ×n, B p×m .
The composite function h = g f is h : n p, dened by
h (x ) = g (f (x )) = g (Ax ) = BAx .
Note that the intemediate vector u = Ax is subsequently multiplied by the matrix B . The com-
posite function h is itself a linear mapping
h (a x + b y ) = BA (a x + b y ) = B (a A x + bA y ) = B (a u + b v ) = a Bu + b Bv = a BAx + b BAy = a h (x ) +
b h ( y ),
so it also can be expressed a matrix-vector multiplication
h (x ) = Cx = BAx . (1.25)
Using the above, C is dened as the product of matrix B with matrix A
C = BA .
The columns of C can be determined from those of A by considering the action of h on the the
column vectors of the identity matrix I = [ e1 e 2 ... en ] n×n. First, note that
[[ ]]
1
[[ ÅÅ ]]
0
[[ ÅÅ ]]
0
[[ ]] [[ Å ]] [[ Å ]]
][ ÅÅÅ ] ][ ] ][ ÅÅÅ ]
0
Ae j = [ a1 a2 ... a n
[[ ÅÅ ]] = a1 , . . . , Ae j = [ a1 a2 ... a n
[[ ÅÅ ]]
1 = a Ae
j, n = [ a1 a2 ... a n
[[ ]] =
[[ Å ]]
0
[[ Å ]]
0
[[ ]]
0
a n. (1.26)
The above can be repeated for the matrix C = [ c 1 c2 ... c n ] giving
h (e1) = Ce1 = c1,..., h (ej ) = Ce j = cj ,..., h (en) = Cen = cn. (1.27)
Combining the above equations leads to cj = Ba j , or
C = [ c1 c 2 ... cn ] = B [ a 1 a 2 ... a n ].
LINEAR MAPPINGS 33
From the above the matrix-matrix product C = BA is seen to simply be a grouping of all the
products of B with the column vectors of A ,
C = [ c 1 c2 ... c n ] = [B a 1 Ba 2 ... Ba n ]
Matrix-vector and matrix-matrix products are implemented in Octave, the above results can
readily be veried.
octave] a1=[1; 2]; a2=[3; 4]; A=[a1 a2]
A =
1 3
2 4
octave] b1=[-1; 1; 3]; b2=[2; -2; 3]; B=[b1 b2]
B =
-1 2
1 -2
3 3
octave] C=B*A
C =
3 5
-3 -5
9 21
octave] c1=B*a1; c2=B*a2; [c1 c2]

ans =
3 5
-3 -5
9 21
octave]
CHAPTER 2
VECTOR SPACES
FORMAL RULES
1. Algebraic structures
1.1. Typical structures
A vector space has been introduced as a 4-tuple ± = (V , S , +, Å) with specic behavior of the vector addition and
scaling operations. Arithmetic operations between scalars were implicitly assumed to be similar to those of the
real numbers, but also must be specied to obtain a complete denition of a vector space. Algebra denes various
structures that specify the behavior operations with objects. Knowledge of these structures is useful not only in
linear algebra, but also in other mathematical approaches to data analysis such as topology or geometry.
Groups. A group is a 2-tuple ¢ = (G, +) containing a group.
set G and an operation + with properties from Table 2.2. Addition rules
If a, b G, a + b = b + a, the group is said to be commu- a+b G Closure
tative. Besides the familiar example of integers under a + (b + c ) = (a + b ) + c Associativity
addition ($, +), symmetry groups that specify spatial 0+ a = a Identity element
or functional relations are of particular interest. The a + (a) =0 Inverse element
rotations by 0, À , À , À or vertices of a square form a
2
3
2
Table 2.1. Group ¢ = (G ,+) properties, for a , b, c G
Rings. A ring is a 3-tuple = (R,+, Å) containing a set Addition rules

R and two operations +, Å with properties from Table (R,+) is a commutative (Abelian) group
2.1. As is oen the case, a ring is more complex struc- Multiplication rules
ture built up from simpler algebraic structures. With aÅb R Closure
respect to addition a ring has the properties of a com- (a Å b ) Å c = a Å (b Å c ) Associativity
mutative group. Only associativity and existence of an a Å 1=1 Å a = a Identity element
identity element is imposed for multiplication. Matrix Distributivity
addition and multiplication has the structure of ring a Å (b + c ) = (a Å b ) + (a Å c ) on the le
(m m,+, Å).
×
(a + b ) Å c = (a Å c ) + (b Å c ) on the right
Table 2.2. Ring = (R, +, Å) properties, for a , b, c R.
Fields. A ring is a 3-tuple 1 = (F ,+, Å) containing a set space must satisfy the properties of a eld. Since the
F and two operations +, Å, each with properties of a operations are oen understood from context a eld
commutative group, but with special behavior for the might be referred to as the full 3 tuple, or, more con-
inverse of the null element. The multiplicative inverse cisely just through the set of elements as in the deni-
is denoted as a . Scalars S in the denition of a vector
1
tion of a vector space.
35
36 VECTOR SPACES
Addition rules
(F ,+) is a commutative (Abelian) group
Multiplication rules
(F , Å) is a commutative group except
that 0 does not exist
1
Distributivity
a Å (b + c ) = (a Å b ) + (a Å c )
Table 2.3. Field = (F , +, Å) properties, for a , b, c F .
Using the above denitions, a vector space ± = (V , S ,+, Å) can be described as a commutative group (V ,+) combined
with a eld S that satises the scaling properties a u V , a(u + v ) = a u + a v , (a + b ) u = a u + b u , a(b u ) = (ab ) u , 1 u = u ,
for a, b S , u , v V .
1.2. Vector subspaces

A central interest in data science is to seek simple description of complex objects. A typical situation is that many
instances of some object of interest are initially given as an m-tuple v m with large m. Assuming that addition
and scaling of such objects can cogently be dened, a vector space is obtained, say over the eld of reals with an
Euclidean distance, Em. Examples include for instance recordings of medical data (electroencephalograms, elec-
trocardiograms), sound recordings, or images, for which m can easily reach in to the millions. A natural question
to ask is whether all the m real numbers are actually needed to describe the observed objects, or perhaps there is
some intrinsic description that requires a much smaller number of descriptive parameters, that still preserves the
useful idea of linear combination. The mathematical transcription of this idea is a vector subspace.
DEFINITION. (VECTOR SUBSPACE) . ° = (U , S , +, Å), U ` , is a vector subspace of vector space ± = (V , S , +, Å) over the
same eld of scalars S, denoted by ° d ±, if U V and a, b S, u , v U, the linear combination a u + b v U.
The above states a vector subspace must be closed under linear combination, and have the same vector addition
and scaling operations as the enclosing vector space. The simplest vector subspace of a vector space is the null
subspace that only contains the null element, U = {0}. In fact any subspace must contain the null element 0, or
otherwise closure would not be veried for the particular linear combination u + (u ) = 0. If U V , then ° is said
to be a proper subspace of ±, denoted by ° < ±.
Setting n m components equal to zero in the real space m denes a proper subspace whose elements can be
placed into a one-to-one correspondence with the vectors within n. For example, setting component m of x m
equal to zero gives x = [ x x ... xm 0 ]T that while not a member of m , is in a one-to-one relation with
1 2 1
1
x = [ x x ... xm ]T m . Dropping the last component of y m, y = [ y y ... ym ym ]T gives vector y =
¹ 1 2 1
1
1 2 1
¹
[ y y ... ym ] m , but this is no longer a one-to-one correspondence since for some given y , the last com-
1 2 1
1
¹
ponent ym could take any value.

octave] m=3; x=[1; 2; 0]; xp=x(1:2); disp(xp)
1
2
octave] y=[1; 2; 3]; yp=y(1:2); disp(yp)
1
2
octave]
FORMAL RULES 37
Vector subspaces arise in decomposition of a vector space. The converse, composition of vector spaces ° = (U ,
S , +, Å) ± = (V , S , +, Å) is also dened in terms of linear combination. A vector x can be obtained as the linear
3
combination
x x 0
x = [[[ x ]]] = [[[ 0 ]]] + [[[ x ]]],
1 1
[[ x ]] [[ 0 ]] [[ x ]]
2
3
2
but also as
x x 0
x = [[[ x ]]] = [[[ x a ]]] + [[[ a ]]],
1 1
[x ] [ 0 ] [x ]
2
3
2
for some arbitrary a . In the rst case, x is obtained as a unique linear combination of a vector from the set
U = ¡ x 0 0 ¢T | x with a vector from V = {[ 0 x x ]T | x , x }. In the second case, there is an innity of
1 1 2 3 2 3
linear combinations of a vector from V with another from W = ¡ x x 0 ¢T | x , x to the vector x . This is
1 2 1 2
captured by a pair of denitions to describe vector space composition.

DEFINITION. Given two vector subspaces ° = (U , S ,+, Å), ± = (V , S ,+, Å) of the space ² = (W , S ,+, Å), the sum is the vector
space ° + ± = (U + V , S ,+, Å), where the sum of the two sets of vectors U , V is U + V = {u + v | u U , v V }.
DEFINITION. Given two vector subspaces ° = (U , S ,+, Å), ± = (V , S ,+, Å) of the space ² = (W , S ,+, Å), the direct sum is the
vector space ° ± = (U V , S ,+, Å), where the direct sum of the two sets of vectors U , V is U V = {u + v | ! u U , ! v V }.
(unique decomposition)
Since the same scalar eld, vector addition, and scaling is used , it is more convenient to refer to vector space sums
simply by the sum of the vector sets U + V , or U V , instead of specifying the full tuplet for each space. This shall
be adopted henceforth to simplify the notation.
octave] u=[1; 0; 0]; v=[0; 2; 3]; vp=[0; 1; 3]; w=[1; 1; 0]; disp([u+v vp+w])
1 1
2 2
3 3
octave]
In the previous example, the essential di erence between the two ways to express x is that U ) V = {0}, but
3
V ) W = {[ 0 a 0 ]T | a } ` {0}, and in general if the zero vector is the only common element of two vector spaces
then the sum of the vector spaces becomes a direct sum. In practice, the most important procedure to construct
direct sums or check when an intersection of two vector subspaces reduces to the zero vector is through an inner
product.
DEFINITION. Two vector subspaces U , V of the real vector space m are orthogonal, denoted as U ¥V if u Tv =0 for any
u U , v V.
DEFINITION. Two vector subspaces U,V of U + V are orthogonal complements, denoted U = V ¥, V = U ¥ if they are orthog-
onal subspaces, U ¥V, and U ) V = {0}, i.e., the null vector is the only common element of both subspaces.
octave] disp([u'*v vp'*w])
0 1
octave]
38 VECTOR SPACES
The above concept of orthogonality can be extended to other vector subspaces, such as spaces of functions. It can
also be extended to other choices of an inner product, in which case the term conjugate vector spaces is sometimes
used.
The concepts of sum and direct sum of vector spaces used linear combinations of the form u + v . This notion can
be extended to arbitrary linear combinations.
DEFINITION. In vector space ± = (V , S ,+, Å), the span of vectors a , a ,..., a n V , is the set of vectors reachable by linear
1 2
combination
span{a , a ,..., an} = {b V | x ,..., xn S suchthat b = x a + ... + xn an}.
1 2 1 1 1
Note that for real vector spaces a member of the span of the vectors {a , a ,..., an} is the vector b obtained from the
1 2
matrix vector multiplication

[[ xx 1
]]
b = Ax = [ a a ... a n ][[[ ÅÅ 2
]] .
]]
1 2
[[ xÅ ]
n
From the above, the span is a subset of the co-domain of the linear mapping f (x ) = Ax .
2. Vector subspaces of a linear mapping
The wide-ranging utility of linear algebra essentially results a complete characterization of the behavior of a linear
mapping between vector spaces f : U V , f (a u + b v ) = a f (u ) + b f (v ). For some given linear mapping the questions
that arise are:
1. Can any vector within V be obtained by evaluation of f ?
2. Is there a single way that a vector within V can be obtained by evaluation of f ?
Linear mappings between real vector spaces f : n m, have been seen to be completely specied by a matrix
A m n. It is common to frame the above questions about the behavior of the linear mapping f (x ) = Ax through
×
sets associated with the matrix A . To frame an answer to the first question, a set of reachable vectors is first defined.
DEFINITION. The column space (or range) of matrix A m n is the set of vectors reachable by linear combination of the
×
matrix column vectors

C (A ) =range(A ) = {b m| x n suchthat b = Ax }.
By denition, the column space is included in the co-domain of the function f (x ) = Ax , C (A ) m, and is readily
seen to be a vector subspace of m. The question that arises is whether the column space is the entire co-domain
C (A ) = m that would signify that any vector can be reached by linear combination. If this is not the case then the
column space would be a proper subset, C (A ) m, and the question is to determine what part of the co-domain
cannot be reached by linear combination of columns of A . Consider the orthogonal complement of C (A ) dened
as the set vectors orthogonal to all of the column vectors of A , expressed through inner products as
a T y =0, a T y =0,..., a nT y =0.
1 2
FORMAL RULES 39
This can be expressed more concisely through the transpose operation
[[[ a T ]]] [[[ a T y ]]]

T T
1 1
A = [ a a ... a n ], A y = [[[ aÅÅ ]]] y = [[[ a ÅÅ y ]]],

T 2 2
1 2
[[ ÅT ]] [[ TÅ ]]
[ ] [ ]an an y
and leads to the denition of a set of vectors for which ATy = 0
DEFINITION. The le null space (or cokernel) of a matrix A m n is the set ×
N (AT ) =null(AT ) = { y m|AT y = 0}.
Note that the le null space is also a vector subspace of the co-domain of f (x ) = Ax , N (AT ) m. The above
denitions suggest that both the matrix and its transpose play a role in characterizing the behavior of the linear
mapping f = Ax , so analagous sets are dene for the transpose AT .
DEFINITION. The row space (or corange) of a matrix A m n is the set ×
R(A ) = C (AT ) =range(AT ) = {c n| y m c = AT y } n
DEFINITION. The null space of a matrix A m n is the set

×
N (A ) =null(A ) = {x n|Ax = 0} n
Examples. Consider a linear mapping between real spaces f : n m, dened by y = f (x ) = Ax = [ y ... yn ]T , 1
with A m n.
×
octave]
1. For n =1, m =3, 2. For n =2, m =3,

1 1 1
A = [[[[ 0 ]]]], AT = [ 1 0 0 ], A = [[[ 0 0 ]]] = [ a a ], AT = [[ 11 00 00 ]],
[0] [0 0 ]
1 2
the column space C (A ) is the y -axis, and the the columns of A are colinear, a = a , and the
2 1
column space C (A ) is the y -axis, and the le

1
left null space N (A ) is the y2y3-plane. Vec-

T 1
tors that span these spaces are returned by the null space N (AT ) is the y y -plane, as before.
2 3
Octave orth and null functions.

octave] A=[1 -1; 0 0; 0 0];
octave] A=[1; 0; 0]; disp(orth(A)); disp(orth(A));
disp('-----'); disp(null(A')) disp('-----'); disp(null(A'))
-1 -1.00000
-0 -0.00000
-0 -0.00000
----- -----
0 0 0 0
1 0 1 0
0 1 0 1
40 VECTOR SPACES
octave] octave] A=[1 1; 1 -1; 0 0];

disp(orth(A));
disp('-----'); disp(null(A'))
3. For n =2, m =3, 0.70711 0.70711
0.70711 -0.70711
10 -0.00000 -0.00000
A = [[[ 0 1 ]]], AT = [[ 10 01 00 ]], -----

[0 0] 0
0
1
the column space C (A ) is the y y -plane, and
1 2 octave]
the le null space N (AT ) is the y -axis. 3
5. For n =3, m =3,

octave] A=[1 0; 0 1; 0 0];
1 1 3
disp(orth(A));
A = [[[ 1 1 1 ]]] = [ a a a ],
disp('-----'); disp(null(A')) [1 1 3 ] 1 2 3
-1 -0
[[ 11 11 11 ]] = [[[ aa T ]]], ATy = [[[ aa T yy ]]]

-0 -1 T T
1 1
-0 -0 AT = [[ 3 1 3 ]] [[[ T ]]] [[ T ]]
[a y]
2 2
----- a 3 3
0
0
since a 3 = a 1 + 2 a 2, the orthogonality condi-
1
tion AT y = 0 is satised by vectors of form y =
octave] [ a 0 a ], a .
octave] A=[1 1 3; 1 -1 -1; 1 1 3];

4. For n =2, m =3, disp(orth(A));
disp('-----'); disp(null(A'))
0.69157 0.14741
1 1
A = [[[ 1 1 ]]], AT = [[ 11 11 00 ]],
-0.20847 0.97803
[[ 0 0 ]] 0.69157 0.14741
-----
0.70711
0.00000
the same C (A ), N (AT ) are obtained, albeit with -0.70711
a di erent set of spanning vectors returned by
orth. octave]
The above low dimensional examples are useful to gain initial insight into the significance of the spaces C (A ), N (AT ).
Further appreciation can be gained by applying the same concepts to processing of images. A gray-scale image
of size px by py pixels can be represented as a vector with m = px py components, b [0, 1]m m. Even for a small
image with px = py = 128 = 2 pixels along each direction, the vector b would have m = 2 components. An image
7 14
can be specied as a linear combination of the columns of the identity matrix
[[[ bb ]]]
1
b = Ib = [ e e ... em ][[ ÅÅ ]],

[[ Å ]]
2
1 2
[ bm ]
with bi the gray-level intensity in pixel i. Similar to the inclined plane example from §1, an alternative description
as a linear combination of another set of vectors a ,..., a m might be more relevant. One choice of greater utility for
1
DATA REDUNDANCY 41
image processing mimics the behavior of the set {1,cos t ,cos2t , ...,sin t ,sin2t , ...} that extends the second example
in §1, would be for m =4
[[[ 11 1
1
1
0
0]
1 ]]].
A = [ a a a a ] = [[ 1 1 ]]]
1 2
[[ 3 4
0 1
[1 0 0 0]
DATA REDUNDANCY
1. Linear dependence
For the simple scalar mapping f : , f (x ) = ax , the condition f (x ) =0 implies either that a =0 or x =0. Note that
a =0 can be understood as dening a zero mapping f (x ) =0. Linear mappings between vector spaces, f : U V , can
exhibit di erent behavior, and the condtion f (x ) = Ax = 0, might be satised for both x ` 0, and A ` 0. Analogous to
the scalar case, A = 0 can be understood as dening a zero mapping, f (x ) = 0.
In vector space ± = (V , S ,+, Å), vectors u , v V related by a scaling operation, v = a u , a S , are said to be colinear,
and are considered to contain redundant data. This can be restated as v span{u }, from which it results that
span{u } =span{u , v }. Colinearity can be expressed only in terms of vector scaling, but other types of redundancy
arise when also considering vector addition as expressed by the span of a vector set. Assuming that v span{u },
then the strict inclusion relation span{u } span{u , v } holds. This strict inclusion expressed in terms of set concepts
can be transcribed into an algebraic condition.
DEFINITION. The vectors a , a ,..., an V , are linearly dependent if there exist n scalars, x ,..., xn S, at least one of which
1 2 1
is di erent from zero such that

x a + ... + xn an = 0.
1 1
Introducing a matrix representation of the vectors

[[[ xx ]]] 1
A = [ a a ... a n ]; x = [[ ÅÅ ]]
[[ Å ]]
2
1 2
[ xn ]
allows restating linear dependence as the existence of a non-zero vector, x ` 0, such that Ax = 0. Linear dependence
can also be written as Ax = 0 Ï x = 0, or that one cannot deduce from the fact that the linear mapping f (x ) = Ax
attains a zero value that the argument itself is zero. The converse of this statement would be that the only way to
ensure Ax = 0 is for x = 0, or Ax = 0 Ò x = 0, leading to the concept of linear independence.
DEFINITION. The vectors a , a ,..., a n V , are linearly independent if the only n scalars, x ,..., xn S, that satisfy
1 2 1
x a + ... + xnan = 0,
1 1 (2.1)
are x =0, x =0,...,xn =0.
1 2
2. Basis and dimension

Vector spaces are closed under linear combination, and the span of a vector set , = {a , a , . . . } denes a vector
subspace. If the entire set of vectors can be obtained by a spanning set, V =span ,, extending , by an additional
1 2
element = , *{b } would be redundant since span , = span . This is recognized by the concept of a basis, and
also allows leads to a characterization of the size of a vector space by the cardinality of a basis set.
42 VECTOR SPACES
DEFINITION. A set of vectors u ,..., un V is a basis for vector space ± = (V , S ,+, Å) if

1
1. u ,..., u n are linearly independent;

1
2. span{u ,..., u n} = V.
1
DEFINITION. The number of vectors u ,..., un V within a basis is the dimension of the vector space ± = (V , S ,+, Å).
1
3. Dimension of matrix spaces

The domain and co-domain of the linear mapping f : U V , f (x ) = Ax , are decomposed by the spaces associated
with the matrix A . When U = n, V = m, the following vector subspaces associated with the matrix A m n have ×
been dened:
" C (A ) the column space of A
" C (AT ) the row space of A
" N (A ) the null space of A
" N (AT ) the le null space of A , or null space of AT
DEFINITION. The rank of a matrix A m n is the dimension of its column space and is equal to the dimension of its row
×
space.
DEFINITION. The nullity of a matrix A m n is the dimension of its null space.

×
CHAPTER 3
FUNDAMENTAL THEOREM OF LINEAR ALGEBRA
DATA INFORMATION
1. Partition of linear mapping domain and codomain
A partition of a set S has been introduced as a collection of subsets P = {Si | Si P , Si ` } such that any given element
x S belongs to only one set in the partition. This is modied when applied to subspaces of a vector space, and a
partition of a set of vectors is understood as a collection of subsets such that any vector except 0 belongs to only
one member of the partition.
Linear mappings between vector spaces f : U V can be represented by matrices A with columns that are images
of the columns of a basis {u , u ,... } of U
1 2
A = [ f (u ) f (u ) ... ].
1 2
Consider the case of real nite-dimensional domain and co-domain, f : n m, in which case A m n, ×
A = [ f (e ) f (e ) ... f (en) ] = [ a a ... an ].

1 2 1 2
The column space of A is a vector subspace of the codomain, C (A ) d m, but according to the denition of dimen-
sion if n < m there remain non-zero vectors within the codomain that are outside the range of A ,
n < m Ò v m, v ` 0, v C (A ).
All of the non-zero vectors in N (AT ), namely the set of vectors orthogonal to all columns in A fall into this category.
The above considerations can be stated as
C (A ) d m, N (AT ) d m, C (A )¥N (AT ) C (A ) + N (AT ) d m .
The question that arises is whether there remain any non-zero vectors in the codomain that are not part of C (A )
or N (AT ). The fundamental theorem of linear algebra states that there no such vectors, that C (A ) is the orthogonal
complement of N (AT ), and their direct sum covers the entire codomain C (A ) N (AT ) = m.
LEMMA 3.1. Let °, ±, be subspaces of vector space ². Then ² = ° ± if and only if
i. ² = ° + ±, and
ii. ° ) ± = {0}.
Proof.² = ° ± Ò ² = ° + ± by denition of direct sum, sum of vector subspaces. To prove that ² = ° ± Ò
° ) ± = {0}, consider w ° ) ±. Since w ° and w ± write
w = w + 0 (w °, 0 ±), w = 0 + w (0 °, w ±),
and since expression w = u + v is unique, it results that w = 0. Now assume (i),(ii) and establish an unique decomposition.
Assume there might be two decompositions of w ², w = u + v , w = u + v , with u , u °, v , v ±. Obtain u + v =
u + v , or x = u u = v v . Since x ° and x ± it results that x = 0, and u = u , v = v , i.e., the decomposition is
1 1 2 2 1 2 1 2 1 1
2 2 1 2 2 1 1 2 1 2
unique. ¡
43
44 FUNDAMENTAL THEOREM OF LINEAR ALGEBRA
In the vector space U + V the subspaces U , V are said to be orthogonal complements is U ¥V , and U ) V = {0}. When
U d m, the orthogonal complement of U is denoted as U ¥, U U ¥ = m.
THEOREM. Given the linear mapping associated with matrix A m n we have: ×
1. C (A ) N (AT ) = m, the direct sum of the column space and le null space is the codomain of the mapping
2. C (AT ) N (A ) = n, the direct sum of the row space and null space is the domain of the mapping
3. C (A )¥N (AT ) and C (A ) ) N (AT ) = {0}, the column space is orthogonal to the le null space, and they are
orthogonal complements of one another,
C (A ) = N (AT )¥, N (AT ) = C (A )¥ .
4. C (AT )¥N (A ) and C (AT ) ) N (A ) = {0}, the row space is orthogonal to the null space, and they are orthogonal
complements of one another,
C (AT ) = N (A )¥, N (A ) = C (AT )¥ .
Figure 3.1. Graphical represenation of the Fundamental Theorem of Linear Algebra, Gil Strang, Amer. Math. Monthly , 848-855,
100
1993.
Consideration of equality between sets arises in proving the above theorem. A standard technique to show set
equality A = B, is by double inclusion, A B ' B A Ò A = B. This is shown for the statements giving the decomposi-
tion of the codomain m. A similar approach can be used to decomposition of n.
i. C (A )¥N (AT ) (column space is orthogonal to le null space).
Proof. Consider arbitrary u C (A ), v N (AT ). By denition of C (A ), x n such that u = Ax , and by

denition of N (AT ), ATv = 0. Compute u Tv = (Ax )Tv = x TATv = x T (ATv ) = x T 0 =0, hence u ¥v for arbitrary u ,
v , and C (A )¥N (AT ). ¡
ii. C (A ) ) N (AT ) = {0} (0 is the only vector both in C (A ) and N (AT )).
DATA PARTITIONING 45
Proof. (By contradiction, reductio ad absurdum). Assume there might be b C (A ) and b N (AT ) and b ` 0.
Since b C (A ), x n such that b = Ax . Since b N (AT ), ATb = AT (Ax ) = 0. Note that x ` 0 since x =0 Ò b =0,
contradicting assumptions. Multiply equality AT Ax = 0 on le by x T ,
x TATAx = 0 Ò (Ax )T (Ax ) = b Tb = b =0, 2
thereby obtaining b =0, using norm property 3. Contradiction.

¡
iii. C (A ) N (AT ) = m
Proof. (iii) and (iv) have established that C (A ), N (AT ) are orthogonal complements
C (A ) = N (AT )¥, N (AT ) = C (A )¥.
By Lemma 2 it results that C (A ) N (AT ) = m. ¡
The remainder of the FTLA is established by considering B = AT , e.g., since it has been established in (v) that C (B )
N (AT ) = n, replacing B = AT yields C (AT ) N (A ) = m, etc.
DATA PARTITIONING
1. Mappings as data
1.1. Vector spaces of mappings and matrix representations
A vector space can be formed from all linear mappings from the vector space ° = (U , S , +, Å) to another vector
space ± = (V , S ,+, Å)
= {L, S ,+, Å}, L = {f | f : U V , f (a u + b v ) = af (u ) + bf (v )},
with addition and scaling of linear mappings dened by (f + g )(u ) = f (u ) + g (u ) and (a f )(u ) = a f (u ). Let B =
{u , u , . . . } denote a basis for the domain U of linear mappings within , such that the linear mapping f is
1 2
represented by the matrix

A = [ f (u ) f (u ) ... ].
1 2
When the domain and codomain are the real vector spaces U = n, V = m, the above is a standard matrix of real
numbers, A m n. For linear mappings between innite dimensional vector spaces the matrix is understood in a
×
generalized sense to contain an innite number of columns that are elements of the codomain V . For example, the
indenite integral is a linear mapping between the vector space of functions that allow di erentiation to any order,
5 : v (x ) = 5 u(x ) dx
and for the monomial basis B = {1, x , x ,... }, is represented by the generalized matrix
2
A=Í x x 1
2
2
1
3
x ... Î.3
Truncation of the basis expansion u(x ) = j uj x j where uj to n terms, and sampling of u at points x ,...,
=1 1
xm, forms a standard matrix of real numbers

j
A =Í x x Î m n, xj =
[[ xÅÅ ]] .
x ... [[[ Åj ]]]
1
1 1 ×
2 3
2 3
xm
As to be expected, matrices can also be organized as vector space 3, which is essentially the representation of the
associated vector space of linear mappings,
3 = (M, S ,+, Å) M = {A A = [ f (u ) f (u ) ... ]} .

¡ 1 2
The addition C = A + B and scaling S = a R of matrices is given in terms of the matrix components by
cij = a ij + bij , sij = ar . ij
1.2. Measurement of mappings

From the above it is apparent that linear mappings and matrices can also be considered as data, and a rst step in
analysis of such data is denition of functionals that would attach a single scalar label to each linear mapping of
matrix. Of particular interest is the denition of a norm functional that characterizes in an appropriate sense the
size of a linear mapping.
Consider rst the case of nite matrices with real components A m n that represent linear mappings between ×
real vector spaces f : m n. The columns a ,..., an of A m n could be placed into a single column vector c with
1
×
mn components
a
c = [[[ ÅÅÅ ]]].
1
[ an ]
Subsequently the norm of the matrix A could be dened as the norm of the vector c . An example of this approach
is the Frobenius norm
m n
A = c = (((y y |a ij | ))) .
1/2
F
(i )
2
2
=1 j =1
A drawback of the above approach is that the structure of the matrix and its close relationship to a linear mapping
is lost. A more useful characterization of the size of a mapping is to consider the amplication behavior of linear
mapping. The motivation is readily understood starting from linear mappings between the reals f : , that are
of the form f (x ) = ax . When given an argument of unit magnitude |x | =1, the mapping returns a real number with
magnitude |a|. For mappings f : within the plane, arguments that satisfy x =1 are on the unit circle with
2 2
2
components x = [ cos ¸ sin ¸ ] have images through f given analytically by
f (x ) = Ax = [ a a
1 2 ] [[ cos ¸
sin ¸ ]] =cos ¸ a +sin ¸ a ,
1 2
and correspond to ellipses.
-1
-2
-3
-3 -2 -1 0 1 2 3
Figure 1. Mapping of unit circle by f (x ) = Ax , A = [[ 00 00 ].
From the above the mapping associated A amplies some directions more than others. This suggests a denition
of the size of a matrix or a mapping by the maximal amplication unit norm vectors within the domain.
DEFINITION. For vector spaces U , V with norms U : U , V : V , the induced norm of f : U V is

+ +
f = sup f (x )V .

x
U =1
DEFINITION. For vector spaces n, m with norms (n): U , (m): V , the induced norm of matrix A m n is
+ +
×
A = sup Ax (m).

(n )
x
=1
In the above, any vector norm can be used within the domain and codomain.
2. The Singular Value Decomposition (SVD)

The fundamental theorem of linear algebra partitions the domain and codomain of a linear mapping f : U V . For
real vectors spaces U = n, V = m the partition properties are stated in terms of spaces of the associated matrix A as
C (A ) N (AT ) = m C (A )¥N (AT ) C (AT ) N (A ) = n C (AT )¥N (A ) .
The dimension of the column and row spaces r =dim C (A ) =dim C (AT ) is the rank of the matrix, n r is the nullity
of A , and m r is the nullity of AT . A innite number of bases could be dened for the domain and codomain. It is
of great theoretical and practical interest bases with properties that faciliatate insight or computation.
2.1. Orthogonal matrices
The above partitions of the domain and codomain are orthogonal, and suggest searching for orthogonal bases
within these subspaces. Introduce a matrix representation for the bases
U = [ u u ... u m ] m m, V = [ v v ... v n ] n n,
1 2
×
1 2
×
with C (U ) = m and C (V ) = n. Orthogonality between columns u i , uj for i ` j is expressed as u iT u j =0. For i = j , the
inner product is positive u iT ui >0, and since scaling of the columns of U preserves the spanning property C (U ) = m,
it is convenient to impose uiT u i =1. Such behavior is concisely expressed as a matrix product
U TU = I m,
with Im the identity matrix in m. Expanded in terms of the column vectors of U the rst equality is
[[ u Tu u Tu ... u Tu m ]]
T T T T
[[ u T ]] 1 1 1 1 2 1
[ u u ... u m ]T [ u u ... u m ] = [[[[ uÅÅ ]]]][ u u ... um ] = [[[[ u ÅÅu u ÅÅu Å...Å u ÅÅu m ]]]] = I m.
2 2 1 2 2 2
[[ uÅT ]] [[ u TÅu u TÅu ...Å u TÅum ]]

1 2 1 2 1 2
m m m 1 m
2
It is useful to determine if a matrix X exists such that UX = I m, or
UX = U [ x x ... x m ] = [ e e ... em ].
1 2 1 2
The columns of X are the coordinates of the column vectors of I m in the basis U , and can readily be determined
T
[[ u T ]]
Ux j = ej Ò U T Ux j = U T ej Ò I m x j = [[[ uÅÅ ]]] ej Ò x j = (U T )j ,
1
[[ Å ]]
2
[ u mT ]
where (U T )j is the j column of U T , hence X = U T , leading to

th
U TU = I = UU T .
Note that the second equality
T
[[[ u T ]]] 1
[ u u ... u m ][ u u ... u m ]T = [ u u ... u m ][ [[ uÅÅÅ ]]] = u u T + u u T + Å Å Å + umumT = I

2
[[ T ]]
1 2 1 2 1 2 1 1 2 2
um
acts as normalization condition on the matrices Uj = u jujT .
DEFINITION. A square matrix U is said to be orthogonal if U TU = UU T = I.
A = [[ 20 00 ]]
2.2. Intrinsic basis of a linear mapping

Given a linear mapping f : U V , expressed as y = f (x ) = Ax , the simplest description of the action of A would
be a simple scaling, as exemplied by g (x ) = a x that has as its associated matrix a I . Recall that specication of
a vector is typically done in terms of the identity matrix b = Ib , but may be more insightfully given in some other
basis Ax = Ib . This suggests that especially useful bases for the domain and codomain would reduce the action of a
linear mapping to scaling along orthogonal directions, and evaluate y = Ax by rst re-expressing y in another basis
U , Us = Iy and re-expressing x in another basis V , Vr = Ix . The condition that the linear operator reduces to simple
scaling in these new bases is expressed as si = Ãi ri for i =1,...,min (m, n), with Ãi the scaling coecients along each
direction which can be expressed as a matrix vector product s = £ r , where £ m n is of the same dimensions as A
×
and given by
[[ Ã0 0 ... 0 0 ... 0
Ã ... 0 0 ... 0 ]]]
1
[[ ÅÅ ]]
[[ Å
2
ÅÅ Å Å 0 ÅÅ Å Å ÅÅ
0 ... Ãr 0 ... 0 ]]]].
Å Å Å Å Å
£ = [[[ 0
[[[ 0 0 ... 0 0 ... 0 ]]]
[[[ ÅÅÅ ÅÅ Å Å ÅÅ ÅÅ Å Å ÅÅ ]
Å Å Å Å Å]
0 ... 0 0 ... 0 ]]
Å
[0
Imposing the condition that U , V are orthogonal leads to
Us = y Ò s = U T y , Vr = x Ò r = V Tx ,
which can be replaced into s = £ r to obtain

U T y = £ V Tx Ò y = U £ V Tx .
From the above the orthogonal bases U , V and scaling coecients £ that are sought must satisfy A = U £ V T .
THEOREM. Every matrix A m n has a singular value decomposition (SVD)

×
A = U £V T ,
with properties:
1. U m m is an orthogonal matrix, U TU = I m;
×
2. V m m is an orthogonal matrix, V TV = I n;
×
3. £ m n is diagonal, £ =diag(Ã ,..., Ãp), p =min (m, n), and Ã Ã

×
1 1 2 Å Å Å Ãp 0.
Proof. The proof of the SVD makes use of properties of the norm, concepts from analysis and complete induction.
Adopting the 2-norm set Ã = A ,
1 2
Ã = sup Ax .
1 2
x
2 =1
The domain x = 1 is compact (closed and bounded), and the extreme value theorem implies that f (x ) = Ax attains
2
its maxima and minima, hence there must exist some vectors u , v of unit norm such that Ã u = Av Ò Ã = u T Av .
1 1 1 1 1 1 1 1
Introduce orthogonal bases U , V for m, n whose rst column vectors are u , v , and compute
1 1 1 1
T
U TAV [u ]
= [[[ ÅÅÅ ]]][ Av ... A vn ] = [[ Ã0 wB ] = C .
1 T
1
1 1
[ umT ] 1
In the above w T is a row vector with n 1 components u T Av j, j =2,..., n, and u iT Av must be zero for u to be the direction
1 1 1
along which the maximum norm Av is obtained. Introduce vectors

1
y = [[ Ãw ]], z = Cy = [[ Ã +Bw
w Tw ], 2
]
1 1
and note that z y = Ã + w T w. From U TAV = A = Ã = C Ã + w Tw it results that w = 0. By induction, assume

2
2
2
2
1 1 1 1
2
1
that B has a singular value decomposition, B = U £ V T, such that 2 2 2
U TAV = [[ Ã 0T ] = [ 1 0T ][ Ã 0T ][ 1 0T ],
0 U £ V T ] [ 0 U ][ 0 £ ][ 0 V T ]
1 1
1 1
2 2 2 2 2 2
and the orthogonal matrices arising in the singular value decomposition of A are
U = U [[ 01 U0 ]], V T = [[[ 1 0 T ]]]V T .

T T
1
0V
2 2
1
The scaling coecients Ãj are called the singular values of A . The columns of U are called the le singular vectors ,
and those of V are called the right singular vectors .
The fact that the scaling coecients are norms of A and submatrices of A , Ã = A , is crucial importance in appli-
1
cations. Carrying out computation of the matrix products
T
Ã 0 ... 0 0 ... 0 [ v ]]]
1
[
[[[ 0 Ã ... 0 0 ... 0 ]]][[ v T 1
] [ Ã vT ]
2
[
[[ ÅÅ ÅÅ Å Å 0 ÅÅ Å Å ÅÅ ]][[[ ÅÅÅ ]]] 2
[
[[ Ã v T ]]]
1 1
[[ Å Å Å Å Å Å ]][ T ] [ Å ]
... u m ][[ 0 0 ... Ãr 0 ... 0 ]][[ v r ]] = [ u u ... u r ur ... u m ][[[ ÅÅ T ]]]
2 2
A = [ u u ... ur u r
1 2 +1
[[ 0 0 ... 0 0 ... 0 ]][[ v rT ]] 1 2
[[ Ãrv r ]]
+1
[[[ ÅÅ ÅÅ Å Å ÅÅ ÅÅ Å Å ÅÅ ]]][[ Å ]] [[ ÅÅÅ ]]

[[ 0Å 0Å ...Å 0Å 0Å ...Å 0Å ]][[[ ÅÅT ]]] [[ 0 ]]
+1
[ vn ]
leads to a representation of A as a sum
r
A = y Ãi u iv iT , r min (m, n).
i =1
A = Ã u v T + Ã u v T + Å Å Å + Ãru rv rT
1 1 1 2 2 2
Each product u iv iT is a matrix of rank one, and is called a rank-one update. Truncation of the above sum to p terms
leads to an approximation of A
p
A E Ap = y Ãi uiv iT .
i =1
In very many cases the singular values exhibit rapid, exponential decay, Ã 1 k Ã k Å Å Å , such that the approximation
2
above is an accurate representation of the matrix A .
? ? ?
Figure 3.2. Successive SVD approximations of Andy Warhol's painting, Marilyn Diptych (~1960), with k = 10,20, 40 rank-one updates.
octave] m=350; P=m**2
P = 122500
octave] q1=(2*m+1)*10
q1 = 7010
octave] q2=(2*m+1)*20
q2 = 14020
octave] q3=(2*m+1)*40
q3 = 28040
octave]
2.3. SVD solution of linear algebra problems

The SVD can be used to solve common problems within linear algebra.
Change of coordinates. To change from vector coordinates b in the canonical basis I m m to coordinates x in
×
some other basis A m m, a solution to the equation Ib = Ax

×
can be found by the following steps.
1. Compute the SVD, U £ V T = A ;
2. Find the coordinates of b in the orthogonal basis U , c = U T b ;
3. Scale the coordinates of c by the inverse of the singular values yi = ci / Ãi , i = 1, . . . , m, such that £ y = c is
satised;
4. Find the coordinates of y in basis V T , x = Vy .
Best 2-norm approximation. In the above A was assumed to be a basis, hence r =rank(A ) = m. If columns of A do
not form a basis, r < m, then b m might not be reachable by linear combinations within C (A ). The closest vector
to b in the norm is however found by the same steps, with the simple modication that in Step 3, the scaling is
carried out only for non-zero singular values, yi = ci / Ãi , i =1,..., r .
The pseudo-inverse. From the above, nding either the solution of Ax = Ib or the best approximation possible if
A is not of full rank, can be written as a sequence of matrix multiplications using the SVD
(U £ V T ) x = b Ò U (£ V T x ) = b Ò (£ V T x ) = U T b Ò V T x = £ U T b Ò x = V £ U T b ,
+ +
where the matrix £ n m (notice the inversion of dimensions) is dened as a matrix with elements Ãi on the
+ × 1
diagonal, and is called the pseudo-inverse of £. Similarly the matrix

A =V£ UT
+ +
that allows stating the solution of Ax = b simply as x = A b is called the pseudo-inverse of A . Note that in practice
+
A is not explicitly formed. Rather the notation A is simply a concise reference to carrying out steps 1-4 above.
+ +
CHAPTER 4
LEAST SQUARES
DATA COMPRESSION
A typical scenario in many sciences is acquisition of m numbers to describe some object that is understood to
actually require only n j m parameters. For example, m voltage measurements ui of an alternating current could
readily be reduced to three parameters, the amplitude, phase and frequency u(t ) = a sin(Ét + Æ ). Very oen a simple
rst-degree polynomial approximation y = ax + b is sought for a large data set D = {(xi , yi ), i =1,..., m}. All of these
are instances of data compression, a problem that can be solved in a linear algebra framework.
1. Projection
Consider a partition of a vector space U into orthogonal subspaces U = V W , V = W ¥, W = V ¥. Within the typical
scenario described above U = m, V m, W m, dim V = n, dim W = m n. If V = [ v ... v n ] m n is a basis for V
1
×
and W = [ w ... w mn ] m (mn) is a basis for W, then U = [ v ... v n w ... w mn ] is a basis for U . Even though
1
×
1 1
the matrices V , W are not necessarily square, they are said to be orthogonal, in the sense that all columns are of
unit norm and orthogonal to one another. Computation of the matrix product V TV leads to the formation of the
identity matrix within n
[[ v T v v T v ... v T v n ]]
T T T T
[[ v T ]]
1 1 1 1 2 1
V TV = [[[ vÅÅ ]]][ v v ... v n ] = [[[ v ÅÅv v ÅÅv Å...Å v ÅÅv n ]]] = I n.
[[ ÅT ]] [[ TÅ TÅ Å TÅ ]]
2 2 1 2 2 2
1 2
[ vn ] [ v n v v n v ... v n v n ]
1 2
Similarly, W TW = I mn. Whereas for the square orthogonal matrix U multiplication both on the le and the right
by its transpose leads to the formation of the identity matrix
U T U = UU T = I m,
the same operations applied to rectangular orthogonal matrices lead to di erent results
T
[[ v T ]] n
V TV = I n, VV T = [ v v ... v n ][[[ vÅÅ ]]] = y v i v iT ,rank(v i v iT ) =1
1
[[ Å ]] i
2
1 2
[ v nT ] =1
A simple example is provided by taking V = I m n, the rst n columns of the identity matrix in which case
,
n
VV T = y ei eTi = [[ I0n 00 ]] m m. ×
i =1
53
54 LEAST SQUARES
Applying P = VV T to some vector b m leads to a vector r = Pb whose rst n components are those of b , and the
remaining m n are zero. The subtraction b r leads to a new vector s = (I P )b that has the rst components
equal to zero, and the remaining m n the same as those of b . Such operations are referred to as projections , and
for V = I m n correspond to projection onto the span{e ,..., en}.
, 1
octave] I4=eye(5); V=I4(:,1:2); P=V*V'; Q=I4-P;

b=rand(5,1); r=P*b; s=Q*b; disp([P b r s])
1.00000 0.00000 0.00000 0.00000 0.00000 0.42253 0.42253 0.00000

0.00000 1.00000 0.00000 0.00000 0.00000 0.95900 0.95900 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.41781 0.00000 0.41781
0.00000 0.00000 0.00000 0.00000 0.00000 0.45744 0.00000 0.45744
0.00000 0.00000 0.00000 0.00000 0.00000 0.49784 0.00000 0.49784
octave]
W = {[[ y0 ]]| y }
U = 2
s = (I P )b
V = {{[[ x0 ]]| x }}

r = Pb
Figure 4.1. Projection in 2. The vectors r , s 2 have two components, but could be expressed through scaling of e 1, e 2.
Returning to the general case, the orthogonal matrices U m m, V m n, W m (mn) are associated with linear
× × ×
mappings b = f (x ) = Ux , r = g (b ) = Pb , s = h (b ) = (I P ) b . The mapping f gives the components in the I basis

of a vector whose components in the U basis are x . The mappings g , h project a vector onto span{v , . . . , v n}, 1
span{w , . . . , w mn}, respectively. When V , W are orthogonal matrices the projections are also orthogonal r ¥ s .
1
Projection can also be carried out onto nonorthogonal spanning sets, but the process is fraught with possible error,
especially when the angle between basis vectors is small, and will be avoided henceforth.
Notice that projection of a vector already in the spanning set simply returns the same vector, which leads to a
general denition.
DEFINITION. The mapping is called a projection if f f = f , or if for any u U, f (f (u )) = f (u ). With P the matrix
associated f , a projection matrix satises P = P. 2
P = VV T
P = PP = VV TVV T = V (V TV ) V T = VIV T = VV T = P
2
DATA COMPRESSION 55
2. Gram-Schmidt
Orthonormal vector sets { q ,..., q n} are of the greatest practical utility, leading to the question of whether some
1
such a set can be obtained from an arbitrary set of vectors {a , . . . , a n}. This is possible for independent vectors, 1
through what is known as the Gram-Schmidt algorithm

1. Start with an arbitrary direction a 1
2. Divide by its norm to obtain a unit-norm vector q = a / a 1 1 1
3. Choose another direction a 2
4. Subtract o its component along previous direction(s) a 2 ( q Ta ) q1 2 1
5. Divide by norm q = (a 2 2 ( q Ta ) q ) / a ( q Ta ) q
1 2 1 2 1 2 1
6. Repeat the above
a 2 ( q Ta ) q
1 2 1 a 2
q 2
a 1
q 1
P a = ( q q T )a = q ( q Ta ) = ( q Ta ) q
1 2 1 1 2 1 1 2 1 2 1
The above geometrical description can be expressed in terms of matrix operations as
(( r0 r r ... r n )
r r ... r n ))
11 12 13 1
q n )(((( 0 0 r ... r n ))) = QR ,

22 23 2
A = ( a a ... an ) = ( q q ...
1 2 1 2
((( ÅÅÅ ÅÅ
Å
33
ÅÅ Å Å
Å Å
)))
3
ÅÅ
Å
(0 0 ... ... rmn )
equivalent to the system

{{{ aa == rr qq + r q
1 11 1
{{{ ÅÅÅ
2 12 1 22
. 2
{{ an = r nq + r nq + ... + rnnqn
1 1 2 2
56 LEAST SQUARES
The system is easily solved by forward substitution resulting in what is known as the (modied) Gram-Schmidt
algorithm, transcribed below both in pseudo-code and in Octave.
Algorithm (Gram-Schmidt) octave] function [Q,R] = mgs(A)
Given n vectors a ,..., a n

1
[m,n]=size(A); Q=A; R=eye(n);
Initialize q = a ,.., q n = an, R = I n

1 1
for i=1:n
for i =1 to n R(i,i) = sqrt(Q(:,i)'*Q(:,i));
rii = ( q iT q i )
1/2 if (R(i,i)<eps) break;
if rii < õ break; Q(:,i) = Q(:,i)/R(i,i);
q i = q i / rii for j=i+1:n
for j = i+1 to n R(i,j) = Q(:,i)'*A(:,j);

Q(:,j) = Q(:,j) - R(i,j)*Q(:,i);
rij = q iT a j ; q j = q j rijq i end;
end end;
end end
return Q , R
octave]
Note that the normalization condition q ii =1 is satisifed by two values ±rii , so results from the above implementa-
tion might give orthogonal vectors q ,..., q n of di erent orientations than those returned by the Octave qr function.
1
The implementation provided by computational packages such as Octave contain many renements of the basic
algorithm and it's usually preferable to use these in applications.
octave] A=rand(4); [Q,R]=mgs(A); disp([Q R])
0.82757 -0.25921 -0.49326 0.06802 0.83553 0.64827 1.24651 1.05301

0.19408 0.53127 0.15805 0.80939 0.00000 0.93177 0.82700 0.87551
0.22006 0.79553 -0.12477 -0.55058 0.00000 0.00000 0.38433 -0.20336
0.47857 -0.13302 0.84625 -0.19270 0.00000 0.00000 0.00000 0.42469
octave] [Q1,R1]=qr(A); disp([Q1 R1])
-0.82757 0.25921 -0.49326 -0.06802 -0.83553 -0.64827 -1.24651 -1.05301

-0.19408 -0.53127 0.15805 -0.80939 0.00000 -0.93177 -0.82700 -0.87551
-0.22006 -0.79553 -0.12477 0.55058 0.00000 0.00000 0.38433 -0.20336
-0.47857 0.13302 0.84625 0.19270 0.00000 0.00000 0.00000 -0.42469
octave] disp([norm(A-Q*R) norm(A-Q1*R1)])
1.1102e-16 8.0390e-16
octave]
By analogy to arithmetic and polynomial algebra, the Gram-Schmidt algorithm furnishes a factorization
QR = A
with Q m n with orthonormal columns and R n n an upper triangular matrix, known as the QR-factorization.
× ×
Since the column vectors within Q were obtained through linear combinations of the column vectors of A we have
C (A ) = C (Q ) ` C (R )
AX = B , A [ x ... x n ] = [ Ax ... Ax n ].
1 1
DATA COMPRESSION 57
The QR-factorization can be used to solve basic problems within linear algebra.
octave] A=[3 2; 1 2]
A =
3 2
1 2
octave] [Q R]=qr(A)
Q =
-0.94868 -0.31623
-0.31623 0.94868
R =
-3.16228 -2.52982
0.00000 1.26491
octave]
3. QR solution of linear algebra problems
3.1. Transformation of coordinates
Recall that when given a vector b m, an implicit basis is assumed, the canonical basis given by the column vectors
of the identity matrix I m m. The coordinates x in another basis A m m can be found by solving the equation
× ×
Ib = b = Ax ,
by an intermediate change of coordinates to the orthogonal basis Q . Since the basis Q is orthogonal the relation
Q TQ = I holds, and changes of coordinates from I to Q , Qc = b , are easily computed c = Q Tb . Since matrix multi-
plication is associative
b = Ax = (QR )x = Q (Rx ),
the relations Rx = Q Tb = c are obtained, stating that x also contains the coordinates of c in the basis R . The three
steps are:
1. Compute the QR-factorization, QR = A ;
2. Find the coordinates of b in the orthogonal basis Q , c = Q T b ;
3. Find the coordinates of x in basis R , Rx = c .

58 LEAST SQUARES
Since R is upper-triangular,
(( r0 r r ... r m )[ x ] [ c ]]
((( 0 r r ... r m ))[[ x ]] [[ c
11 12 13 1 1 1
]]
0 r ... r m )))[[[ ÅÅÅ ]]] = [[[ ÅÅÅ ]]]
22 23 2 2 2
((( ÅÅ ÅÅÅ ÅÅÅ Å Å Å ÅÅÅ )

33 3
)))[[[[ xm ]]]] [[[[ cm ]]]

(( 0Å 0 ... ... rmm xm
1
cm
1
the coordinates of c in the R basis are easily found by back substitution.
Algorithm (Back substitution) octave] function x=bcks(R,c)
Given R upper-triangular, vectors c [m,n]=size(R); x=zeros(m,1);
for i = m down to 1 for i=m:-1:1
if rii < õ break; x(i) = c(i)/R(i,i);
xi = ci / rii for j=i-1:-1:1
for j = i-1 down to 1 c(j) = c(j) - R(j,i)*x(i);
cj = cj rji xi end;
end end;
end end
return x octave]
The above operations are carried out in the background by the Octave backslash operation A\b to solve A*x=b,
inspired by the scalar mnemonic ax = b Ò x = (1/ a) b . Again, many additional renements of the basic algorithm
argue for using the built-in Octave functions, even though the above implementations can be veried as correct.
octave] xex=rand(4,1); b=A*xex; [Q,R]=mgs(A); c=Q'*b; x=bcks(R,c); xO=A\b;
octave] disp([xex x xO])
0.96838 0.96838 0.96838

0.31829 0.31829 0.31829
0.58529 0.58529 0.58529
0.38250 0.38250 0.38250
octave]
3.2. General orthogonal bases
The above approch for the real vector space m can be used to determine orthogonal bases for any other vector
space by appropriate modication of the scalar product. For example, within the space of smooth functions [1,
1] that can di erentiated an arbitrary number of times, the Taylor series
f (x ) = f (0) Å 1+ f (0) Å x + f (0) Å x + Å Å Å + n f (n)(0) Å x n + Å Å Å +

¹
1
2
¹¹
2
1
!
DATA COMPRESSION 59
is seen to be a linear combination of the monomial basis M = 1 x x ... with scaling coecients µ f (0), f (0),
2
¹
1
2
f (0),... ¶. The scalar product
¹¹
( f , g) = 5 f (x ) g (x ) dx
1
1
can be seen as the extension to the [1,1] continuum of a the vector dot product. Orthogonalization of the mono-
mial basis with the above scalar product leads to the denition of another family of polynomials, known as the
Legendre polynomials
Q (x ) = ±
0
1
2
1/2
² Å 1, Q (x ) = ± ²
1
3
2
1/2
Å x , Q (x ) = ± ²
2
5
8
1/2
Å (3x 1), Q (x ) = ± ²
2
4
7
8
1/2
Å (5x 3x ),... .
3
The Legendre polynomials are usually given with a di erent scaling such that Pk(1) =1, rather than the unit norm
condition Q k = (Q k, Q k) = 1. The above results can be recovered by sampling of the interval [1, 1] at points
1/2
xi = (i 1) h 1, h =2/ (m 1), i =1,..., m, by approximation of the integral by a Riemann sum
m
5 f (x ) Lj (x ) dx E hy f (xi )Lj (xi ) = h f T Lj .
1
1
i =1
octave] m=50; h=2/(m-1); x=(-1:h:1)'; M=[x.^0 x.^1 x.^2 x.^3 x.^4]; [Q,R]=mgs(M);
S=diag(1./Q(m,:)); P=Q*S; sc=[-1 1 -1 1];
figure(1); plot(x,M(:,1),x,M(:,2),x,M(:,3),x,M(:,4)); axis(sc); grid on;
figure(2); plot(x,P(:,1),x,P(:,2),x,P(:,3),x,P(:,4)); axis(sc); grid on;
octave]
Figure 4.2.
? ?
Comparison of monomial basis (le) to Legendre polynomial basis (right). The resolution of P3(x ) can be interpreted as
the number of crossings of the y =0 ordinate axis, and is greater than that of the corresponding monomial x 3.
60 LEAST SQUARES
3.3. Least squares

The approach to compressing data D = {(xi , yi )| i = 1,. . ., m} suggested by calculus concepts is to form the sum of
squared di erences between y (xi ) and yi , for example for y (x ) = a + a x when carrying out linear regression, 0 1
m m
S (a , a ) = y (y (xi ) yi ) = y (a + a xi yi )
0 1
2
0 1
2
i =1 i =1
and seek (a , a ) that minimize S (a , a ). The function S (a , a ) 0 can be thought of as the height of a surface above
0 1 0 1 0 1
the a a plane, and the gradient S is dened as a vector in the direction of steepest slope. When at some point
on the surface if the gradient is di erent from the zero vector S ` 0, travel in the direction of the gradient would
0 1
increase the height, and travel in the opposite direction would decrease the height. The minimal value of S would
be attained when no local travel could decrease the function value, which is known as stationarity condition, stated
as S =0. Applying this to determining the coecients (a , a ) of a linear regression leads to the equations
0 1
m m m
S
a
=0 Ò 2y (a + a xi yi ) =0 Ô ma + ((y xi ))a = y yi ,
0 1 0 1
0
i =1 i =1 i =1
m m m m
S
a
=0 Ò 2y (a + a xi yi )xi =0 Ô ((y xi ))a + ((y xi ))a = y xi yi .
0 1 0
2
1
1
i =1 i =1 i =1 i =1
The above calculations can become tedious, and do not illuminate the geometrical essence of the calculation, which
can be brought out by reformulation in terms of a matrix-vector product that highlights the particular linear com-
bination that is sought in a liner regression. Form a vector of errors with components ei = y (xi ) yi , which for linear
regression is y (x ) = a + a x . Recognize that y (xi ) is a linear combination of 1 and xi with coecients a , a , or in
0 1 0 1
vector form
1 x
e = ((( ÅÅÅ ÅÅÅ )))(( aa )) y = ( 1 x )a y = Aa y
1
( 1 xm )
0
The norm of the error vector e is smallest when Aa is as close as possible to y . Since Aa is within the column
space of C (A ), Aa C (A ), the required condition is for e to be orthogonal to the column space
e ¥C (A ) Ò ATe = (( 1T ))e = (( 1Te )) = (( 00 )) = 0

T T
x xe
ATe = 0 Ô AT (Aa y ) =0 Ô (ATA )a = ATy = b .
The above is known as the normal system, with N = ATA is the normal matrix. The system Na = b can be interpreted
as seeking the coordinates in the N = ATA basis of the vector b = AT y . An example can be constructed by randomly
perturbing a known function y (x ) = a + a x to simulate measurement noise and compare to the approximate a
0 1 Ü
obtained by solving the normal system.

1. Generate some data on a line and perturb it by some random quantities
octave] m=100; x=(0:m-1)/m; a=[2; 3];
a0=a(1); a1=a(2); yex=a0+a1*x; y=(yex+rand(1,m)-0.5)';
DATA COMPRESSION 61
octave]
2. Form the matrices A , N = ATA , vector b = ATy

octave] A=ones(m,2); A(:,2)=x(:); N=A'*A; b=A'*y;
octave]
3. Solve the system Na = b , and form the linear combination y = Aa closest to y

Ü
octave] atilde=N\b; disp([a atilde]);
2.0000 2.0302
3.0000 2.9628
octave]
The normal matrix basis N = AT A can however be an ill-advised choice. Consider A given by
2×2
A = [ a a ] = [[ 10 cos
1 2
¸
sin ¸ ]],
where the rst column vector is taken from the identity matrix a = e , and second is the one obtained by rotating it
1 1
with angle ¸ . If ¸ = À /2, the normal matrix is orthogonal, ATA = I , but for small ¸ , A and N = AT A are approximated as
A E [[ 10 ¸1 ]], N = [ n n ] = [[ 10 ¸1 ]].
1 2 2
When ¸ is small a , a are almost colinear, and n , n even more so. This can lead to amplication of small errors,
1 2 1 2
but can be avoided by recognizing that the best approximation in the 2-norm is identical to the Euclidean concept
of orthogonal projection. The orthogonal projector onto C (A ) is readily found by QR-factorization, and the steps
to solve least squares become
1. Compute QR = A
2. The projection of y onto the column space of A is z = QQ T y , and has coordinates c = Q T y in the orthogonal
basis Q .
3. The same z can also obtained by linear combination of the columns of A , z = Aa = QQ T y , and replacing A
with its QR-factorization gives QRa = Q c , that leads to the system Ra = c , solved by back-substitution.
octave] [Q,R]=qr(A); c=Q'*y; aQR=R\c; disp([a atilde aQR])
2.0000 2.0302 2.0302

3.0000 2.9628 2.9628
octave]
The above procedure carried over to approximation by higher degree polynomials.

62 LEAST SQUARES
octave] m=100; n=6; x=(0:m-1)/m; x=x'; a=randi(10,n,1); A=[];

for j=1:n
A = [A x.^(j-1)];
end;
yex=A*a; y=yex+(rand(m,1)-0.5);
octave] N=A'*A; b=A'*y; atilde=inv(N)*b;

[Q,R]=qr(A); c=Q'*y; aQR=R\c;
disp([a atilde aQR]);
8.0000 8.0847 8.0847

8.0000 7.1480 7.1480
4.0000 4.2264 4.2264
4.0000 8.7568 8.7568
10.0000 2.7420 2.7420
6.0000 9.0386 9.0386
octave]
b
e
Ax C (A )
Givendata b ,form A , nd x ,suchthat e = Ax b isminimized
e = b Ax
MODEL REDUCTION 63
MODEL REDUCTION
1. Projection of mappings
The least-squares problem
min
x n
y Ax (4.1)
focuses on a simpler representation of a data vector y m as a linear combination of column vectors of A m n. ×
Consider some phenomenon modeled as a function between vector spaces f : X Y , such that for input parameters
x X , the state of the system is y = f (x ). For most models f is di erentiable, a transcription of the condition that
the system should not exhibit jumps in behavior when changing the input parameters. Then by appropriate choice
of units and origin, a linearized model
y = Ax , A m n, ×
is obtained if y C (A ), expressed as (1) if y C (A ).
A simpler description is oen sought, typically based on recognition that the inputs and outputs of the model
can themselves be obtained as linear combinations x = Bu , y = C v , involving a smaller set of parameters u q,
v p, p < m, q < n. The column spaces of the matrices B n q, C m p are vector subspaces of the original set
× ×
of inputs and outputs, C (B ) d n, C (C ) d m. The sets of column vectors of B , C each form a reduced basis for
the system inputs and outputs if they are chosed to be of full rank. The reduced bases are assumed to have been
orthonormalized through the Gram-Schmidt procedure such that B TB = Iq, and C TC = I p. Expressing the model
inputs and outputs in terms of the reduced basis leads to
Cv = ABu Ò v = C T ABu Ò v = Ru .
The matrix R = C T AB is called the reduced system matrix and is associated with a mapping g : U V , that is a
restriction to the U , V vector subspaces of the mapping f . When f is an endomorphism, f : X X , m = n, the same
reduced basis is used for both inputs and outputs, x = Bu , y = Bv , and the reduced system is
v = Ru , R = B T AB .
Since B is assumed to be orthogonal, the projector onto C (B ) is PB = BB T . Applying the projector on the inital
model
PB y = PB Ax
leads to BB T y = BB T Ax , and since v = B T y the relation Bv = BB T ABu is obtained, and conveniently grouped as
Bv = B (B T AB ) u Ò Bv = B (Ru ),
again leading to the reduced model v = Bu . The above calculation highlights that the reduced model is a projection
of the full model y = Ax on C (B ).
64 LEAST SQUARES
2. Reduced bases
2.1. Correlation matrices
Correlation coecient. Consider two functions x , x : , that represent data streams in time of inputs x (t )
1 2 1
and outputs x (t ) of some system. A basic question arising in modeling and data science is whether the inputs
2
and outputs are themselves in a functional relationship. This usually is a consequence of incomplete knowledge
of the system, such that while x , x might be assumed to be the most relevant input, output quantities, this is
1 2
not yet fully established. A typical approach is to then carry out repeated measurements leading to a data set
D = {(x (ti ), x (ti ))| i = 1, . . . , N }, thus dening a relation. Let x , x N denote vectors containing the input and
1 2 1 2
output values. The mean values ¼ , ¼ of the input and output are estimated by the statistics
1 2
N N
1
¼ E x̄ = y x (ti ) = E [x ], ¼ E x̄ = y x (t i ) = E [x ],
1
1 1
Ni =1
1 1 2 2
Ni =1
2 2
where E is the expectation seen to be a linear mapping, E : N whose associated matrix is
E = N1 [ 1 1 ... 1 ],
and the means are also obtained by matrix vector multiplication (linear combination),
x̄ = E x , x̄ = E x .
1 1 2 2
Deviation from the mean is measured by the standard deviation dened for x , x by 1 2
Ã = 4E [(x ¼
1 1 1 ) ], Ã
2
2 = 4E [(x ¼ ) ] . 2 2
2
Note that the standard deviations are no longer linear mappings of the data.
Assume that the origin is chosen such that x̄ = x̄ =0. One tool to estalish whether the relation D is also a function
1 2
is to compute the correlation coecient
Á(x , x ) =
E [x x ] = 1 2 E [x x ] , 1 2
1 2
ÃÃ 1 2
4E [x ] E [x ]
2
1
2
2
that can be expressed in terms of a scalar product and 2-norm as
Á(x , x ) =
xT x .
1 2
x x
1 2
1 2
Squaring each side of the norm property x + x x + x , leads to

1 2 1 2
(x 1 + x )T (x + x ) x T x + x T x +2 x x Ò x T x
2 1 2 1 1 2 2 1 2 1 2 x 1 x ,
2
known as the Cauchy-Schwarz inequality, which implies 1 Á(x , x ) 1. Depending on the value of Á, the vari- 1 2
ables x (t ), x (t ) are said to be:

1 2
1. uncorrelated , if Á =0;
2. correlated , if Á =1;
MODEL REDUCTION 65
3. anti-correlated , if Á = 1.
The numerator of the correlation coecient is known as the covariance of x , x 1 2
cov(x , x ) = E [x x ].
1 2 1 2
The correlation coecient can be interpreted as a normalization of the covariance, and the relation
cov(x , x ) = x T x = Á(x , x ) x x ,
1 2 1 2 1 2 1 2
is the two-variable version of a more general relationship encountered when the system inputs and outputs become
vectors.
Patterns in data. Consider now a related problem, whether the input and output parameters x n, y m thought
to characterize a system are actually well chosen, or whether they are redundant in the sense that a more insightful
description is furnished by u q, v p with fewer components p < m, q < n. Applying the same ideas as in the
correlation coecient, a sequence of N measurements is made leading to data sets
X = [ x x ... x n ] N n, Y = [ y y ... y n ] N m.
1 2
×
1 2
×
Again, by appropriate choice of the origin the means of the above measurements is assumed to be zero
E [x ] = 0, E [ y ] = 0.
Covariance matrices can be constructed by
[[[ x T ]]] [[[ x T x x T x ... x T x n ]]]

T T T T
1 1 1 1 2 1
CX = X TX = [[[ xÅÅ ]]][ x x ... x n ] = [[[ x ÅÅx x ÅÅx Å...Å x ÅÅx n ]]] n n.
2 2 1 2 2 2 ×
[[ ÅT ]] [[ TÅ TÅ Å TÅ ]]
1 2
[ xn ] [ x n x x n x ... x n x n ]
1 2
Consider now the SVDs of CX = N N T , X = U £ S T , and from

CX = X TX = (U £ S T )T U £ S T = S £T U TU £ S T = S £T £ S T = N N T ,
identify N = S , and = £T £.
Recall that the SVD returns an order set of singular values Ã Ã Å Å Å , and associated singular vectors. In many
1 2
applications the singular values decrease quickly, oen exponentially fast. Taking the rst q singular modes then
gives a basis set suitable for mode reduction
x = Sq u = [ s s ... sq ] u .
1 2
CHAPTER 5
CHANGE OF BASIS
DATA TRANSFORMATION
1. Gaussian elimination and row echelon reduction
Suppose now that A x = b admits a unique solution. How to nd it? We are especially interested in constructing
a general procedure, that will work no matter what the size of A might be. This means we seek an algorithm that
precisely species the steps that lead to the solution, and that we can program a computing device to carry out
automatically. One such algorithm is Gaussian elimination.
Consider the system
{{ x2 x+2 xx + xx
1 2 3 =2
=2
{{ 3 x x x
1
1
2
2
3
3 =1
The idea is to combine equations such that we have one fewer unknown in each equation. Ask: with what number
should the rst equation be multiplied in order to eliminate x from sum of equation 1 and equation 2? This number
1
is called a Gaussian multiplier, and is in this case 2. Repeat the question for eliminating x from third equation, 1
with multiplier 3.
{{ 2x x+2 xx + xx
1 2 3 =2 x +2 x x = 2
= 2 Ò {{{ 5 x +3 x = 2
1 2 3
{{ 3 x x x
1
1
2
2
3
3 = 1 { 7 x +2 x = 5
2
2
3
Now, ask: with what number should the second equation be multiplied to eliminate x from sum of second and 2
third equations. The multiplier is in this case 7/5.
{{{ x5+2x x+3xx = 2 {{ x +2 x x = 2 1 2 3
= 2 Ò {{ 5 x11+3 x = 11
2
1 2 3
{ 7 x +2 x {
2 3
{
2 3
= 5 x = 5
2 3
5 3
Starting from the last equation we can now nd x =1, replace in the second to obtain 5 x = 5, hence x =1, and
3 2 2
nally replace in the rst equation to obtain x =1. 1
The above operations only involve coecients. A more compact notation is therefore to work with what is known
as the "bordered matrix"
1 2 1 2
(( 21 21 11 22 )) < (( 01 25 31 22 )) < ((( 0 5 3 2
)))
((( 3 1 1 1 ))) ((( 0 7 2 5 ))) ((( 11 ))
0 0 11 5 )
5
67
68 CHANGE OF BASIS
Once the above triangular form has been obtain, the solution is found by back substitution, in which we seek to
form the identity matrix in the rst 3 columns, and the solution is obtained in the last column.
((( 10 25 31 2

2
))) (( 1 2 1 2 )) (( 1 0 0 1 ))
((( 11 )) < ((( 00 05 13 12 ))) < ((( 00 10 01 11 )))
( 0 0 115 )
5 )
2. LU -factorization
" We have introduced Gaussian elimination as a procedure to solve the linear system A x = b ("nd coordi-
nates of vector b in terms of column vectors of matrix A"), x , b m, A m m ×
" We now reinterpret Gaussian elimination as a sequence of matrix multiplications applied to A to obtain a
simpler, upper triangular form.
2.1. Example for m =3
Consider the system A x = b
{{ x2 x+2 xx + xx
1 2 3 =2
=2
{{ 3 x x x
1
1
2
2
3
3 =1
with
1 2 1 2
A = (((( 2 1 1 )))), b = (((( 2 ))))
(( 3 1 1 )) ( 1 ))
We ask if there is a matrix L that could be multiplied with A to produce a result L A with zeros under the main
1 1
diagonal in the rst column. First, gain insight by considering multiplication by the identity matrix, which leaves
A unchanged
(( 10 01 00 ))(( 21 21 11 )) = (( 21 21 11 ))

(( 0 0 1 ))(( 3 1 1 )) (( 3 1 1 ))
In the rst stage of Gaussian multiplication, the rst line remains unchanged, so we deduce that L should have 1
the same rst line as the identity matrix

100
L = ((( ? ? ? )))
1
(? ? ?)
(( 1? 0? 0? ))(( 21 21 11 )) = (( 01 25 31 ))

((( ? ? ? )))((( 3 1 1 ))) ((( 0 7 2 )))
DATA TRANSFORMATION 69
Next, recall the way Gaussian multipliers were determined: nd number to multiply rst line so that added to
second, third lines a zero is obtained. This leads to the form
100
L = ((( ? 1 0 )))
1
(? 0 1)
Finally, identify the missing entries with the Gaussian multipliers to determine
1 00
L = ((( 2 1 0 )))
1
( 3 0 1 )
Verify by carrying out the matrix multiplication
1 00 1 2 1 1 2 1 )
L A = ((( 2 1 0 )))((( 2 1 1 ))) = ((( 0 5 3 )))
1
( 3 0 1 )( 3 1 1 ) ( 0 7 2 )
Repeat the above reasoning to come up with a second matrix L that forms a zero under the main diagonal in the
2
second column
1 0 0
L = ((( 0 1 0 )))
2
( 0 7/5 1 )
1 0 0 1 2 1 1 2 1
L L A = ((( 0 1 0 )))((( 0 5 3 ))) = ((( 0 5 3 ))) = U
2 1
( 0 7/5 1 )( 0 7 2 ) (( 0 0 11/5 )
We have obtained a matrix with zero entries under the main diagonal (an upper triangular matrix) by a sequence
of matrix multiplications.
2.2. General m case

From the above, we assume that we can form a sequence of multiplier matrices such that the result is an upper
triangular matrix U
Lm ... L L A = U
1 2 1
" Recall the basic operation in row echelon reduction: constructing a linear combination of rows to form zeros
beneath the main diagonal, e.g.
a a ... am
(( aa a ... a m ) (( 0 a ))
11 12 1
a
11
a
12
... a m )) (((
1
a 2111 a ... a m aa a m 21
)))
(
22 12 2 1
A = ((( a ... a m ))) < ((( 0 a )))

11
a
... a m aa a m
21 22 2
a a 3111 a
(((( ÅÅÅ
31
))) ((( )))

31 32 3 32 12 3 1
ÅÅ ÅÅ ÅÅ ÅÅ ÅÅ ÅÅ ÅÅ
11
Å Å Å
am 1 a m ...
2 a mm ((( Å
0 am 2
Å
a
am111 a 12
Å
... a mm aam
Å
1
a m ))
1
)
11
70 CHANGE OF BASIS
" This can be stated as a matrix multiplication operation, with l i = a i / a 1 1 11
((( 1l 0 0 ... 0 )( a

1 0 ... 0 ))(( a
11 a
a
12 ... a m ) ( a
... a m )) (( 0 a l a
1 a 11 12 ... am
... a m l a m
1
)))
((( l 21
0 1 ... 0 ))))(((( a
21
a
22
... a m )))) = (((( 0 a l a

2 22 21 12 2
... a m l a m
21
)))
1
((( ÅÅÅ
31
ÅÅÅ ÅÅÅ Å Å Å ÅÅÅ )))((( ÅÅÅ31 32
ÅÅÅ ÅÅ ))) 3
ÅÅÅ ((( ÅÅÅ

32
ÅÅÅ
31 12
ÅÅ
3
ÅÅÅ
31 1
)))
( l m 1 0 0 ... 1 )( a m 1 a m ... 2
Å
a mm ) ( 0 a m l m a ... 2 1 12
Å
a mm lm a m )
1 1
DEFINITION. The matrix

(( 10 ... 0 ... 1)
0 ))
((( ÅÅ 0 ...
0 )))
Å
Lk = ((( 0
(0 ...
... lk k
1 ...
... 0 )))
((( 0 ... lk k
+1,
... 0 )))
((( ÅÅÅ ... ÅÅÅ Å Å Å ÅÅÅ )))
+2,
(0 ... l m k ... 1 ) ,
with l i k = a (i kk) / a (kkk), and A(k) = a (i kj ) the matrix obtained aer step k of row echelon reduction (or, equivalently, Gaussian
, , , ,
elimination) is called a Gaussian multiplier matrix.
" For A m m nonsingular, the successive steps in row echelon reduction (or Gaussian elimination) corre-
×
spond to successive multiplications on the le by Gaussian multiplier matrices
Lm Lm ... L L A = U

1 2 2 1
" The inverse of a Gaussian multiplier is
(( 10 ... 0 ... 1)
0 ))
((( 0 Å 0 ...
ÅÅ
( ... 1 ... 0 )))
Lk = ((( 0 ... l k k ... 0 ))) = I (Lk I )
((( 0 0 )))
1
+1,
... l k k ...
((( ÅÅÅ ... ÅÅÅ
+2,
ÅÅ Å
Å ÅÅ )))
(0 ... lm k , ... 1)
" From (Lm Lm ... L L )A = U obtain

1 2 2 1
A = (Lm Lm ... L L ) U = L L Å ... Å Lm U = LU

1 2 2 1
1
1
1
2
1 1
1
" Due to the simple form of Lk the matrix L is easily obtained as
1
(( l 1 0
1
0
0
...
...
0
0
0)
0 ))
((( l 2,1
l 1 ... 0 0 )))
L = (( ÅÅ )))
((( Å
3,1 3,2
ÅÅ ÅÅ ÅÅ ÅÅ ÅÅ
Å Å Å Å Å
(( lm lm lm ... 1 0 ))
1)
1,1 1,2 1,3
lm ,1 lm ,2 lm ,3 ... l m m , 1
We will show that this indeed possible if A x = b admits a unique solution. Furthermore, the product of lower
triangular matrices is lower triangular, and the inverse of a lower triangular matrix is lower triangular (same applies
for upper triangular matrices). Introduce the notation
L = Lm ... L L
1
1 2 1
and obtain
L A = U
1
or
A=LU
The above result permits a basic insight into Gaussian elimination: the procedure depends on "factoring" the matrix
A into two "simpler" matrices L, U . The idea of representing a matrix as a product of simpler matrices is funda-
mental to linear algebra, and we will come across it repeatedly.
For now, the factorization allows us to devise the following general approach to solving A x = b
1. Find the factorization L U = A
2. Insert the factorization into A x = b to obtain (L U ) x = L (U x ) = L y = b , where the notation y = U x has been
introduced. The system
Ly =b
is easy to solve by forward substitution to nd y for given b
3. Finally nd x by backward substitution solution of

Ux = y
Algorithm Gauss elimination without pivoting

for s =1 to m 1
for i = s +1 to m
t = a is / a ss
for j = s +1 to m
a ij = a ij + t Å a sj
b i = b i + t Å bs
for s = m downto 1
xs = bs / a ss
for i =1 to s 1
bi = bi a is Å xs
return x
Algorithm Gauss elimination with partial pivoting
72 CHANGE OF BASIS
p =1: m (initialize row permutation vector)

for s =1 to m 1
piv = abs(a p(s ) s ) ,
for i = s +1 to m
mag = abs(a p(i ) s ) ,
if mag>piv then
piv=mag; k = p(s ); p(s ) = p(i); p(i) = k
if piv< õ then break(Singular matrix)
t = a p(i )s / a p(s ) s
for j = s +1 to m
a p(i )j = a p(i )j + t Å a p(s ) j
bp(i ) = bp(i ) + t Å bp(s )
for s = m downto 1
xs = bp(s ) / a p(s ) s
for i =1 to s 1
bp(i ) = bp(i ) a p(i ) s Å xs
return x
Given A m n ×
Singular value decomposition Gram-Schmidt Lower-upper

Transformation of coordinates Ax = b
U £V T = A QR = A LU = A
(U £ V T )x = b Ò Uy = b Ò y = U Tb ( QR )x = b Ò Qy = b , y = Q Tb (LU )x = b Ò Ly = b (forwardsubto nd )y
£z = y Ò z = £ y +
Rx = y (backsubto nd x ) Ux = y (back sub to nd x )
V Tx = z Ò x = Vz
3. Inverse matrix
By analogy to the simple scalar equation a x = b with solution x = a b when a ` 0, we are interested in writing the
1
solution to a linear system A x = b as x = A b for A m m, x m. Recall that solving A x = b = I b corresponds to

1 ×
expressing the vector b as a linear combination of the columns of A. This can only be done if the columns of A form
a basis for m, in which case we say that A is non-singular .
DEFINITION 5.1. For matrix A m m non-singular the inverse matrix is denoted by A and satises the properties
× 1
A A = A A = I
1 1
3.1. Gauss-Jordan algorithm

Computation of the inverse A can be carried out by repeated use of Gauss elimination. Denote the inverse by
1
B = A for a moment and consider the inverse matrix property A B = I . Introducing the column notation for B, I
1
leads to
A( B ... Bm ) = ( e ... em )
1 1
and identication of each column in the equality states
AB k = ek, k =1,2,.., m
with ek the column unit vector with zero components everywhere except for a 1 in row k . To nd the inverse we
need to simultaneously solve the m linear systems given above.
Gauss-Jordan algorithm example . Consider
1 2 3
A = ((( 1 3 1 ))
))
(( 2 1 2 )
To nd the inverse we solve the systems A B = e , A B = e , A B = e . This can be done simultaneously by working
1 1 2 2 3 3
with the matrix A bordered by I
(A|I ) =
((( 11 11 01 10 01 00 )))
(( 2 4 2 0 0 1 ))
Carry out now operations involving linear row combinations and permutations to bring the le side to I
(( 11 11 01 10 01 00 )) < (( 10 12 01 11 01 00 )) < (( 10 12 01 11 01 00 )) <

(( 2 4 2 0 0 1 )) (( 0 2 2 2 0 1 )) (( 0 0 3 3 1 1 ))
( 10 12 01 11 0 0 ) ( 1 1 0 1 02 0 ) (1 1 0 1 0 0
1 )))
<(
((( 1 0 )) < ((( 0 2 0 0 3 1 )) (( 0 1 0 0 1
3 ))) < ((( 6 ))) <
1 1 )) (((
) ) (
3
(0 0 1 1 )
1 1 0 0 1 1 13 1)
3 3 (0 0 1 1 3
3
) ( 3
(( 1 0 0 1 13
1)
6)
((( 1 )))
((( 0 1 0 0 13 6 )))
((( 0 0 1 1 1 1)
)
3 3)
to obtain
(( 1 13
1
6 )))
(( 1 ))
A = (( 0 1
((( 3 6 )))
1
(( 1 13 1)
)
3)
4. Determinants
" A m m a square matrix, det(A ) is the oriented volume enclosed by the column vectors of A (a paral-
×
lelipiped)
74 CHANGE OF BASIS
" Geometric interpretation of determinants
" Determinant calculation rules
" Algebraic denition of a determinant
DEFINITION. The determinant of a square matrix A = ( a ... am ) m m is a real number

1
×
|| aa aa ... am|
|
det(A) = || ÅÅ ÅÅ Å Å a ÅÅm |||
|
11 12 1
...
|| Å Å Å Å ||
21 22 2
| a m a m ... a mm |
1 2
giving the (oriented) volume of the parallelepiped spanned by matrix column vectors.
" m =2
A = ( aa aa 11 12
),det(A) = Ä
a a
11
a a
12
Ä
21 22 21 22
" m =3
a a a )),det(A) = || aa a a
A = ((( a a a
||
)))
11 12 13 11 12 13
|| a a ||
(( a a a
21
31
22
32
23
33 |a
21
31
22
a a
32
23
33 |
" Computation of a determinant with m =2
a a
11 12
=a a a a
Ä
a a
21 22
Ä 11 22 12 21
" Computation of a determinant with m =3
|| aa aa aa || = a a a + a a a + a a a
|| |
11 12 13
| a a a ||
21 22 23 11 22 33 21 32 13 31 12 23
31 32 33
a 13 a a 22 31 a 23 a a
32 11 a 33 a a
12 21
" Where do these determinant computation rules come from? Two viewpoints
Geometric viewpoint : determinants express parallelepiped volumes
Algebraic viewpoint : determinants are computed from all possible products that can be formed from
choosing a factor from each row and each column
" m =2
A 2
A = ( a a ) = (( aa
1 2
11
21
a 12 )
a 22 )
A 1
Figure 5.1.
" In two dimensions a ``parallelepiped'' becomes a parallelogram with area given as
(Area) = (LengthofBase) × (LengthofHeight)
" Take a as the base, with length b = a . Vector a is at angle Æ to x -axis, a is at angle Æ to x -axis, and
1 1 1 1 1 2 2 2
the angle between a , a is ¸ = Æ Æ . The height has length

1 2 2 1
h = a sin ¸ = a sin(Æ Æ ) = a (sinÆ cosÆ sin Æ cosÆ )

2 2 2 1 2 2 1 1 2
" Use cos Æ = a / a , sin Æ = a / a , cos Æ = a / a , sin Æ = a / a

1 11 1 1 12 1 2 21 2 2 22 2
(Area) = a a (sinÆ 1 2 2 cosÆ sin Æ cosÆ ) = a a

1 1 2 11 22 a a
12 21
" The geometric interpretation of a determinant as an oriented volume is useful in establishing rules for
calculation with determinants:
Determinant of matrix with repeated columns is zero (since two edges of the parallelepiped are
identical). Example for m =3
|a a u|
= ||| b b v ||| = abw + bcu + cav ubc vca wab =0
|c c w|
This is more easily seen using the column notation
=det( a a a ... ) =0
1 1 3
Determinant of matrix with linearly dependent columns is zero (since one edge lies in the 'hyper-
plane' formed by all the others)
" Separating sums in a column (similar for rows)
det( a + b a ... am ) =det( a a ... am ) +det( b a ... a m )

1 1 2 1 2 1 2
with ai , b m
1
76 CHANGE OF BASIS
" Scalar product in a column (similar for rows)
det( ±a a ... am ) = ± det( a a ... am )

1 2 1 2
with ±
" Linear combinations of columns (similar for rows)
det( a a ... am ) =det( a ± a + a ... a m )

1 2 1 1 2
with ± .
" A determinant of size m can be expressed as a sum of determinants of size m 1 by expansion along a row
or column
||| aa aa aa ... am| ... a m |

|| ÅÅ ÅÅ ÅÅ Å Å a ÅÅm |||| = a ||| ÅÅÅ
11 12 13
...
1
a a
Å ÅÅ |||
22 23 2
ÅÅ ÅÅ Å
||| Å Å Å Å Å ||| || a m
21 22 23 2
Å
... a mm |
11
a m a m a m ... a mm 2 am 3
1 2 3
|a a ... a m |
a ||| ÅÅÅ Å ÅÅ ||| +
21 23 2
ÅÅ ÅÅ Å
Å
| am ... a mm |
12
1 am 3
a a a ... a m |
a ||| ÅÅÅ ||
21 22 24 2
ÅÅ ÅÅ ÅÅ ÅÅ
|| a m 13
1 a m a m ... a mm ||
Å
2
Å
4
Å Å
...
+(1)m a m|||
| aÅÅ aÅÅ Å...Å a mÅÅ
21 23 2, 1
|||
| |||
+1
Å Å Å Å
| a m a m ... a m m
1
1 3 , 1
" The formal denition of a determinant
det A = y ½ (Ã )a i a i ...a mim

1 1 2 2
Ã £
requires mm! operations, a number that rapidly increases with m
" A more economical determinant is to use row and column combinations to create zeros and then reduce the
size of the determinant, an algorithm reminiscent of Gauss elimination for systems
Example:
||| 11 20 31 ||| = ||| 10 22 34 ||| = 2 4 =20 12=8

|| 2 1 4 || || 0 3 10 || 3 10 Ä Ä
The rst equality comes from linear combinations of rows, i.e. row 1 is added to row 2, and row 1 multiplied
by 2 is added to row 3. These linear combinations maintain the value of the determinant. The second
equality comes from expansion along the rst column
DATA EFFICIENCY 77
4.1. Cross product

" Consider u, v . We've introduced the idea of a scalar product
3
u Å v = uT v = u v + u v + u v 1 1 2 2 3 3
in which from two vectors one obtains a scalar
" We've also introduced the idea of an exterior product
u uv uv uv
uv T = ((( u ))( v v v ) = ((( u v u v u v ))
1 1 1 1 2 1 3
(u 2
3
)) 1 2
(u v u v u v
3 2
3
1
1
2
3
2
2
2
3
3
3
))
in which a matrix is obtained from two vectors
" Another product of two vectors is also useful, the cross product, most conveniently expressed in determi-
nant-like form
1
|e e e |
u × v = ||| u u u ||| = (u v v u )e + (u v v u )e + (u v v u )e
2 3
1
|| v v v ||
2
2
3
3
2 3 2 3 1 3 1 3 1 2 1 2 1 2 3
DATA EFFICIENCY
1. Krylov bases
In reduction of the model
A
Ax = y , A m n, n m ×
by restriction to a subspaces spanned by the orthogonal column vectors of B , C , x = Bu , y = Cv , the reduced model
v = Ru
is obtained with R = C T AB , the reduced system matrix. The choice of the basis sets B , C is not yet specied. One
common choice is to use the singular value decomposition A = S £ Q T and choose the dominant k singular vectors
to span the subspaces,
B = Q k, C = Sk.
This assumes that an SVD is available, and that ordering of vectors by the 2-norm is relevant to the problem. This
is oen the case in problems in which the matrix A somewhow expresses the energy of a system. For example in
deformation of a structure a relationship between forces f and displacements u is approximated linearly by f = Ku ,
with the sti ness matrix K expressing the potential energy stored in the deformation.
78 CHANGE OF BASIS
However in many cases, the model might not express an energy so the 2-norm is not an appropriate functional, or
even if it is the size of the problem might render the computation of the singular value decomposition impractical.
In such situations alternative procedures to construct reduced bases must be devised.
Consider that the only information available about the problem are the matrix A m m and a vector y m. From
×
these two a sequence of vectors can be gather into what is known as a Krylov matrix
K n = y Ay A y ... An y .
2 1
The Krylov matrix K n is generally not orthogonal, but an orthogonal basis can readily be extracted through the QR
factorization
Q n R = K n.
The basis Q n can then be adopted, both for the domain and the codomain
B = C = Q n.
2. Greedy approximation
The Krylov procedure has the virtue of simplicity, but does not have the desirable property of the SVD of ordering
of the singular vectors. Suppose that the system matrix A m m is applied to k vectors x ,..., x k, leading to forma-
×
1
tion of the vector set S = {Ax ,..., Ax k}. Denote by B the rst n members of the set ordered in some arbitrary norm
1
B = [ b b ... bn ], n j k
1 2
b = AxÃ ( ),..., bk = AxÃ (k),

1 1
where Ã denotes the permutation that orders the vectors in accordance with the chosen norm. The above is called a
greedy approximation, and furnishes an alternative to the SVD that exhibits an ordering property. As in the Krylov
case, it is usually more ecient to use an orthogonal set obtained through QR factorization
Q n R = B n.
CHAPTER 6
EIGENPROBLEMS
DATA STABILITY
A = U £ V T A = QR A = LU
1. The eigenvalue problem
" Consider square matrix A m m. The eigenvalue problem asks for vectors x m, x ` 0, scalars » such
×
that
Ax = » x (6.1)
" Eigenvectors are those special vectors whose direction is not modied by the matrix A
" Rewrite (1): (A » I )x = 0, and deduce that A »I must be singular in order to have non-trivial solutions
det(A » I ) =0
" Consider the determinant
|| a a » a a » ... am |
a m |||
|
11 12 1
det(A »I ) = || ÅÅ ...
|| Å
21
Å
Å
22
Å Å ÅÅ || 2
| a m a mÅ ...Å a mmÅ » ||
1 2
" From determinant denition ``sum of all products choosing an element from row/column''
det(A »I ) = (1)m»m + c »m + ... + cm » + cm = pA(»)

1
1
1
is the characteristic polynomial associated with the matrix A, and is of degree m
" A m × m has characteristic polynomial pA(») of degree m, which has m roots (Fundamental theorem of
algebra)
" Example
octave] theta=pi/3.; A=[cos(theta) -sin(theta); sin(theta) cos(theta)]
A =
0.50000 -0.86603
0.86603 0.50000
79
80 EIGENPROBLEMS
octave] eig(A)
ans =
0.50000 + 0.86603i
0.50000 - 0.86603i
octave] [R,lambda]=eig(A);
octave] disp(R);
0.70711 + 0.00000i 0.70711 - 0.00000i

0.00000 - 0.70711i 0.00000 + 0.70711i
octave] disp(lambda)
Diagonal Matrix
0.50000 + 0.86603i 0
0 0.50000 - 0.86603i
octave] A=[-2 1 0 0 0 0; 1 -2 1 0 0 0; 0 1 -2 1 0 0; 0 0 1 -2 1 0; 0 0 0 1 -2 1; 0 0 0 0 1 -2];
octave] disp(A)
-2 1 0 0 0 0
1 -2 1 0 0 0
0 1 -2 1 0 0
0 0 1 -2 1 0
0 0 0 1 -2 1
0 0 0 0 1 -2
octave] lambda=eig(A);
octave] disp(lambda);
-3.80194
-3.24698
-2.44504
-1.55496
-0.75302
-0.19806
octave]
" For A m m, the eigenvalue problem 5 (x ` 0) can be written in matrix form as

×
AX = X , X = (x ... xm) eigenvector, =diag(» ,..., »m) eigenvaluematrices

1 1
" If the column vectors of X are linearly independent, then X is invertible and A can be reduced to diagonal
form
A = X X , A = U £V T
1
" Diagonal forms are useful in solving linear ODE systems
y =Ay Ô (X y ) = (X y )
¹
1 1
DATA STABILITY 81
" Also useful in repeatedly applying A
uk = Aku = AA ... Au = (X X )(X X ) ... (X X )u = X kX u

0 0
1 1 1
0
1
0
" When can a matrix be reduced to diagonal form? When eigenvectors are linearly independent such that the
inverse of X exists
" Matrices with distinct eigenvalues are diagonalizable. Consider A m m with eigenvalues »j ` »k for j ` k ,
×
j , k {1,..., m}
Proof . By contradiction. Take any two eigenvalues »j ` »k and assume that xj would depend linearly on xk,
xk = cxj for some c ` 0. Then
Ax = » x Ò Ax = » x
1 1 1 1 1 1
Ax = » x Ò Acx = » cx
2 2 2 1 2 1
and subtracting would give 0= (» » )x . Since x is an eigenvector, hence x ` 0 we obtain a contradiction

1 2 1 1 1
» =» .
1 2
" The characteristic polynomial might have repeated roots. Establishing diagonalizability in that case requires
additional concepts
DEFINITION 6.1. The algebraic multiplicity of an eigenvalue » is the number of times it appears as a repeated root of the
characteristic polynomial p(») =det(A » I )
Example. p(») = »(» 1)(» 2) has two single roots » =0, » =1 and a repeated root » =2. The eigenvalue » =2
2
1 2 3,4
has an algebraic multiplicity of 2
DEFINITION 6.2. The geometric multiplicity of an eigenvalue » is the dimension of the null space of A » I
DEFINITION 6.3. An eigenvalue for which the geometric multiplicity is less than the algebraic multiplicity is said to be
defective
PROPOSITION 6.4. A matrix is diagonalizable is the geometric multiplicity of each eigenvalue is equal to the algebraic
multiplicity of that eigenvalue.
" Finding eigenvalues as roots of the characteristic polynomial p(») =det(A »I ) is suitable for small matrices
A m m. ×
analytical root-nding formulas are available only for m 4
small errors in characteristic polynomial coecients can lead to large errors in roots
" Octave/Matlab procedures to nd characteristic polynomial
poly(A) function returns the coecients

82 EIGENPROBLEMS
roots(p) function computes roots of the polynomial
octave] A=[5 -4 2; 5 -4 1; -2 2 -3]; disp(A);
5 -4 2
5 -4 1
-2 2 -3
octave] p=poly(A); disp(p);
1.00000 2.00000 -1.00000 -2.00000
octave] r=roots(p); disp(r');
1.0000 -2.0000 -1.0000
octave]
" Find eigenvectors as non-trivial solutions of system (A »I )x =0
» =1 Ò A » I = (
(( 5 5 1 ( 4 4 2 )) < (( 02 20 46 )) < (( 02 02 64 ))
1 1
( 2 2 4 )) (( 5 5 1 )) (( 0 0 0 ))
Note convenient choice of row operations to reduce amount of arithmetic, and use of knowledge that A » I 1
is singular to deduce that last row must be null
" In traditional form the above row-echelon reduced system corresponds to
{{ 0x2x+0+2x x6x4x
1 2 3 =0 1
= 0 Ò x = ± (((( 1 )))), x =1 Ò ± =1/ (2
{ 0x +0x +0x
1
1
2
2
3
3 =0 (0)
" In Octave/Matlab the computations are carried out by the null function
octave] null(A+5*eye(3))'
ans = [](0x3)
octave]
" The eigenvalues of I are » = 1, but small errors in numerical computation can give roots of the
3×3
1,2,3
characteristic polynomial with imaginary parts
octave> lambda=roots(poly(eye(3))); disp(lambda')
1.00001 - 0.00001i 1.00001 + 0.00001i 0.99999 - 0.00000i

octave>
" In the following example notice that if we slightly perturb A (by a quantity less than 0.0005=0.05%), the
eigenvalues get perturb by a larger amount, e.g. 0.13%.
DATA STABILITY 83
octave] A=[-2 1 -1; 5 -3 6; 5 -1 4]; disp([eig(A) eig(A+0.001*(rand(3,3)-0.5))])
3.0000 + 0.0000i 3.0005 + 0.0000i

-2.0000 + 0.0000i -2.0000 + 0.0161i
-2.0000 + 0.0000i -2.0000 - 0.0161i
octave]
" Extracting eigenvalues and eigenvectors is a commonly encountered operation, and specialized functions
exist to carry this out, including the eig function
octave> [X,L]=eig(A); disp([L X]);
-2.00000 0.00000 0.00000 -0.57735 -0.00000 0.57735

0.00000 3.00000 0.00000 0.57735 0.70711 -0.57735
0.00000 0.00000 -2.00000 0.57735 0.70711 -0.57735
octave> disp(null(A-3*eye(3)))
0.00000
0.70711
0.70711
octave> disp(null(A+2*eye(3)))
0.57735
-0.57735
-0.57735
octave>
" Recall denitions of eigenvalue algebraic m» and geometric multiplicities n» .
DEFINITION. A matrix which has n» < m» for any of its eigenvalues is said to be defective.
octave> A=[-2 1 -1; 5 -3 6; 5 -1 4]; [X,L]=eig(A); disp(L);
Diagonal Matrix
-2.0000 0 0
0 3.0000 0
0 0 -2.0000
octave> disp(X);
-5.7735e-01 -1.9153e-17 5.7735e-01

5.7735e-01 7.0711e-01 -5.7735e-01
5.7735e-01 7.0711e-01 -5.7735e-01
octave> disp(null(A+2*eye(3)));
0.57735
-0.57735
-0.57735
octave> disp(rank(X))
2
octave>
84 EIGENPROBLEMS
2. Computation of the SVD
" The SVD is determined by eigendecomposition of ATA, and AAT

AT A = (U £ V T )T (U £ V T ) = V (£T £ ) V T , an eigendecomposition of ATA. The columns of V are eigen-
vectors of ATA and called right singular vectors of A
B = AT A = V £T £ V T = V V T
AAT = (U £ V T )(U £T V T )T = U (££T ) UT , an eigendecomposition of ATA. The columns of U are eigen-

vectors of AAT and called le singular vectors of A
The matrix £ has form
(( Ã 1
))
(( Ã ))) m n
£ = (((
2
ÅÅ
(((
Å
Ãr
))) ×
)))
+
(( 0
ÅÅ )
Å
and Ãi are the singular values of A.

" The singular value decomposition (SVD) furnishes complete information about A
rank(A) = r (the number of non-zero singular values)
U , V are orthogonal basis for the domain and codomain of A
DATA RESONANCE
1. Bases induced by eigenmodes
The trigonometric functions {1,cos t ,sin t ,cos2t ,sin2t ,... } have been introduced as a particularly appropriate basis
for periodic functions. The functions cos(kt ), sin(kt ) are also known as solution of the homogeneous di erential
equation
y + k y =0.
¹¹
2
The diferential operator is a linear mapping

dq (± y + ² z ) = ± dq y + ² dq z ,
dt q dt q dt q
and hence has an associated linear mapping. An approximation of the second-order di erentiation operation is
given by the nite di erence formulas
y = y (t ) E 1 (y 2 y + y )
i¹¹ ¹¹ i
h 2
i +1 i i 1
DATA RESONANCE 85
octave]

Linear Algebra

Uploaded by

Copyright:

Available Formats

Linear Algebra

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Algebra

Uploaded by

Copyright:

Available Formats

Linear algebra for data science

3. Linear mapping composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

VECTORS AND MATRICES

1.2. Quantities described by a single number

± ( f1, f2, f3) + ² (g 1, g 2, g 3) = (± f1, ± f2, ± f3) + (² g 1, ² g 2, ² g 3) = (± f1 + ² g 1, ± f2 + ² g 2, ± f3 + ² g 3).

and multiplication of components

" scalars are denoted by normal face Latin or Greek letters: a , b , ± , ² ;

In Julia, successive components placed vertically are separated by a semicolon.

Compatible vectors. Addition of real vectors u , v  m denes another vector w = u + v  m .

2.3. Working with vectors

[[[ 43 ]]] (1.8)

[ 0.0 0.2 0.4 0.6 0.8 1.0 ] (1.9)

[ 0.0 0.3 0.6 0.9 ] (1.10)

Visualization. A component-by-component display of a vector becomes increasingly unwieldy

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Figure 1.1. Sample plot

4 x=2; y=4; e1=[1; 0]; e2=[0; 1]; r=x*e1+y*e2

Continuing the procees to m gives

4.2. Linear algebra problem examples

Figure 1.2. Alternative decompositions of force on inclined plane.

5. Vectors and matrices in data science

1.0000e+00 7.0711e-01 6.1232e-17 -7.0711e-01 -1.0000e+00

octave] y=log2(1:8); disp(y)

0.00000 1.00000 1.58496 2.00000 2.32193 2.58496 2.80735

1.0000 2.0000 3.0000 4.0000 5.0000 6.0000 7.0000 8.0000

octave] a=[1 0 -1]; x=-2:2; y=polyval(a,x); disp(y)

-1.0 -0.5 0.5 1.0

Figure 1.3. Vectors in E and contour lines of the functional f (x ) = x

1.4. Linear mappings

DEFINITION. (LINEAR MAPPING) . A function f : V  W, from vector space ± = (V , S , +, Å) to vector

In precise mathematical terms, a partition of set S is P = {Si | Si  P , Si ` , i  I } such that x  S ,

0.5 0.5 0.5 0.5

-0.5 -0.5 -0.5 -0.5

-1.0 -1.0 -1.0 -1.0

Figure 1.4. Equivalence classes within the plane

Denote by xi the largest component in absolute value of x  m . As p increases, |xi|p becomes

0.5 0.5 0.5 0.5

0.0 0.0 0.0 0.0

-0.5 -0.5 -0.5 -0.5

-1.0 -1.0 -1.0 -1.0

Figure 1.5. Regions within  for which x p  1, for p =1,2,3, .

octave] m=9; x=ones(m,1); disp([norm(x) sqrt(m)])

octave] m=4; x=ones(m,1); disp([norm(x,1) m])

octave] disp([norm(x,1) norm(x,2) norm(x,4) norm(x,8) norm(x,16) norm(x,inf)])

4.0000 2.0000 1.4142 1.1892 1.0905 1.0000

2.3. Inner product

e 2T x = [ 0 1 ][[ xx1 ]] =0 Å x1 +1 Å x2 = x1 = f2(x ).

-1.0 -0.5 0.5 1.0

Figure 1.6. Euclidean space E and its dual E .

is seen to contain information on the relative orientation of x with respect to e1. In

octave] b1=[-1; 1; 3]; b2=[2; -2; 3]; B=[b1 b2]

octave] c1=B*a1; c2=B*a2; [c1 c2]

1.1. Typical structures

Rings. A ring is a 3-tuple  = (R,+, Å) containing a set Addition rules

Table 2.2. Ring  = (R, +, Å) properties, for a , b, c  R.

1.2. Vector subspaces

ponent ym could take any value.

captured by a pair of denitions to describe vector space composition.

octave] disp([u'*v vp'*w])

matrix vector multiplication

Compatible vectors. Addition of real vectors u , v m denes another vector w = u + v m .

4 x=2; y=4; e1=[1; 0]; e2=[0; 1]; r=xe1+ye2

Continuing the procees to m gives

DEFINITION. (LINEAR MAPPING) . A function f : V W, from vector space ± = (V , S , +, Å) to vector

In precise mathematical terms, a partition of set S is P = {Si | Si P , Si ` , i I } such that x S ,

Denote by xi the largest component in absolute value of x m . As p increases, |xi|p becomes

Figure 1.5. Regions within for which x p 1, for p =1,2,3, .

Figure 1.6. Euclidean space E and its dual E .

octave] c1=Ba1; c2=Ba2; [c1 c2]

Rings. A ring is a 3-tuple = (R,+, Å) containing a set Addition rules

Table 2.2. Ring = (R, +, Å) properties, for a , b, c R.

octave] disp([u'v vp'w])

N (AT ) =null(AT ) = { y m|AT y = 0}.

DEFINITION. The row space (or corange) of a matrix A m n is the set ×

R(A ) = C (AT ) =range(AT ) = {c n| y m c = AT y } n

DEFINITION. The null space of a matrix A m n is the set

column space C (A ) is the y -axis, and the le

[[ 11 11 11 ]] = [[[ aa T ]]], ATy = [[[ aa T yy ]]]

DEFINITION. A set of vectors u ,..., un V is a basis for vector space ± = (V , S ,+, Å) if

" N (AT ) the le null space of A , or null space of AT

DEFINITION. The nullity of a matrix A m n is the dimension of its null space.

i. C (A )¥N (AT ) (column space is orthogonal to le null space).

Proof. Consider arbitrary u C (A ), v N (AT ). By denition of C (A ), x n such that u = Ax , and by

x TATAx = 0 Ò (Ax )T (Ax ) = b Tb = b =0, 2

By Lemma 2 it results that C (A ) N (AT ) = m. ¡

DEFINITION. For vector spaces U , V with norms U : U , V : V , the induced norm of f : U V is

C (A ) N (AT ) = m C (A )¥N (AT ) C (AT ) N (A ) = n C (AT )¥N (A ) .

THEOREM. Every matrix A m n has a singular value decomposition (SVD)

3. £ m n is diagonal, £ =diag(Ã ,..., Ãp), p =min (m, n), and Ã Ã

along which the maximum norm Av is obtained. Introduce vectors

and note that z y = Ã + w T w. From U TAV = A = Ã = C Ã + w Tw it results that w = 0. By induction, assume

some other basis A m m, a solution to the equation Ib = Ax