BEST PDF On LinearAlgebra

A Colorful Introduction to Linear Algebra
Zack Booth Simpson, May 2014

Table of Contents
Introduction
Learning to count
Vector addition
Vectors as geometry
Solving for a new mixture
Matrix multiplication in general
Order matters
Size matters
Rescaling and normalization
Projection
Matrices as operators
Combining Operators
Basis
Homogeneous Matrices
Inverses
Rank
Special Matrices
Identity Matrix
Diagonal, a.k.a. Scaling Matrix
Orthogonal Matrix
Orthonormal / Rotation Matrix
Factorizations
Eigen Decomposition
LU decomposition
Principal Component Analysis
Covariance
Vocabulary
References
Acknowledgments
Appendix A
Using Trigonometry
Without Using Trigonometry
Appendix B Trigonometry Refresher
Appendix C The Algebra of PCA

Introduction
Linear Algebra is the math of mixing things. Linear algebra is incredibly useful; it has
applications to practically every discipline of science and engineering ranging from quantum
1
mechanics to video game design, to name but two examples of hundreds.

not
This paper seeks to give you intuition for a variety of ideas from linear algebra. It is a linear
algebra textbook. Whereas a textbook would focus on proofs and formal definitions, this paper
will instead focus on application and vocabulary. It is hoped that you will be able to generalize
from the presented applications and be capable, through familiarity with the jargon, of
referencing a textbook when needed.
Learning to count
When we teach children to count we might say: “one ball, two balls, three balls”. Then we might
say: “one block, two blocks, three blocks”. The child might reply: “one ball, two blocks, three
dolls” and we correct them: “No, no you have to add things of the same type!” It is
meaningless to add balls with blocks with dolls. Or is it? If we want to count different kinds of
things we have to keep them isolated in separate piles. Keeping track of these separate piles is
what linear algebra is all about so you might think that mathematicians would call these
interesting piles of things “piles” or “collections” or some such term. But they don’t, they call
them “vectors.”

Linear algebra, like a lot of math, has tons of weird vocabulary for otherwise easy ideas.
Sometimes this jargon can get in the way of understanding the ideas. That said, you can’t really
communicate with other people about an idea unless you learn the jargon. So, in this paper, I’m
going to introduce the jargon slowly and always be sure to let you know where those words
came from.

Let’s start with the word “vector.” “Vector” comes from the Latin “vehere” meaning “to carry.” It
[1]
has the same root as “vehicle.” It makes a tiny little bit of sense a vector “carries” us to a
destination. In the case of our child learning to count we might say that destination is something
like: “3 balls, 2 blocks, 1 doll.”

Vectors are usually written vertically like this:

Why are vectors conventionally written as columns? Because they are. Let’s have a word
about notation. All of the rules about how to write down a mathematical statement are one of
the reason that people get scared of math. The symbols, conventions, and Greek letters
understandably make people believe that math is a magical language only to be spoken by high
priests. This impression starts in gradeschool when we are taught these symbols without
2
explanation of where they came from as if the + and symbols were handed down by God and
we must not question them.

In reality, notation is a shorthand that’s been passed down from generation to generation and
has typically evolved extensively. In other words, it is part arbitrary convention and part refined
tool. You should remember that a notation represents a decision made by people just like you
and if you don’t like a notation then you are free to invent your own. Of course, you do so at
your peril you may have a hard time communicating your ideas to others who have different
notations. That said, this is exactly where all the standard notation came from someone
made it up and their ideas were good enough that other people started copying them. Maybe
you’ll get good enough at math that you can make up notation for your ideas and they’ll stick
too!

Back to vectors. Sometimes it is inconvenient to write a vector as a column as it takes up a lot
T
of paper. So, there’s an alternative notation like this (3,2,1) . That little “T” up there is a way of
saying that this vector is laying on its side. The “T” stands for “transpose” which comes from
[2]
Latin: “trans” = “over” + “pose”=“to put.” So “transpose” literally means “to put on its side.”

I’m going to interchange between these two notations frequently.

Let’s write our first vector equation. Suppose that we have a bread recipe that calls for 2 cups
of flour, 1 cup of sugar, and 1 stick of butter. Suppose that we want to make two loaves. We
need to multiply. By the way, “multiply” comes from Latin “multi” = “many” + “ply” = “fold.” A
vector multiply by 2 is written like this:

Multiplying a vector by a number is called “scaling” the vector also known
as a “scalar” multiply. The AngloSaxon word “scale” is interesting; it
comes from the time when scales used to
measure the weight of things were made of
balanced cups hanging from strings. The word
comes from the Norse word “skal” which means
“skull” maybe because the Vikings used to use
3
skulls as cups! So when you “scale” a vector you may be ghoulishly talking about drinking from
[3]
a skull!

When scaling a vector, it doesn’t matter on which side you write the multiply. This might seem
obvious but it is very important as we will learn in a bit.

Vector addition
Suppose that you had two recipes. One for the bread that we’ve already discussed and one for
a cake that was “1 cup flour, 2 cups sugar, 1 stick butter”. If you want to go to the store and buy
enough ingredients for both bread and cake you’d want to add the two requirements which we
write as follows where we simply add up each row so that we’re adding flour with flour, sugar
with sugar, etc.

Vectors as geometry
One way to think of vectors is to plot them. For example, the following is the flour and sugar
components of our bread and cake purchase above where we plot cups of flour on the
horizontal axis and cups of sugar on the vertical.

When plotted addition is the same thing as drawing the vectors so that the tail one is on the
4
head of the other. The result (how much we need to buy at the store for the combined recipe) is
the blue vector that carries us to the end point. In this case, 3 cups flour and 3 cups of sugar.

Visualizing vectors as geometry is very appealing when you have only two components
(ingredients) but is much harder with three vectors, and impossible with even more. Sometimes
people find that once they think of vectors as geometry, they can’t think of them any other way.
This is a mistake. The power of linear algebra is exactly that it works for vectors with lots of
[4]
components, that is, with a high “dimension” (from “di” + “metri” = “to measure out” ). There
are two good reasons to think of vectors as geometry. The first is when geometry is in fact the
problem of interest such as when one is modelling shapes in a video game or for a physics
problem. The second reason is that geometric understanding leads to good visual intuition.
But, as suggested, this can be as limiting since your intuition for high dimensions is essentially
zero and therefore relying on geometric intuition over algebra can lead to problems.

Geometrically, when we scale a vector we’re changing its size but not its direction. A bigger or
smaller cake is still a cake its ingredient vectors point in the same direction but with different
lengths. Here we plot 2 loaves of bread:

Solving for a new mixture
Consider all of the things you could make with flour and sugar, ignoring the other ingredients.
This is called the “space” of flour and sugar. We’ve already given the names “bread” and “cake”
T
to some of the vectors in this space now let’s create a third vector (2,0.5) and call it “scone.”

5
Suppose that we’ve bought a premixed cake recipe. We know the ratio of ingredients in the
cake mix (1 flour to 2 sugar), but the mix is just that mixed we can’t unmix it. Meanwhile
we’ve also bought a premixed scone recipe. Similarly we know its ratio of ingredients (2 flour to
0.5 sugar) but we can’t unmix them either. So, the question is, given these two premixed bags
of cake and scone can we make bread ? (Ignoring all the ingredients except for flour and sugar
Yes we can!
for simplicity). As we see below, there’s some amount of scone mix plus some
amount of cake mix that will make exactly one loaf of bread.

The fact that we can combine two premade mixtures into something new is not obvious. After
all, we have these two mixtures and it seems as if we’re somehow unmixing them to get
something new. But, of course, we’re not unmixing them, we’re mixing two mixtures and if we
can do that then we can get what we want. This idea is the heart of linear algebra.

But, we must be careful: can we make scone from bread and cake? No. That would require
unmixing either the bread or the cake mixes. Mathematically this is saying we’d need a
negative quantity of bread or cake mix which is manifestly silly in this situation but not in all
situations. The following demonstrates the problem; to get scone we’d need a cake vector
pointing in the negative sugar and flour direction.

Given that we know we can make bread from cake and scone since they are all positive
numbers, let’s solve for right combination of cake and scone ingredients to make the bread. We
can write an equation for the situation like this:

6

We’re interested in finding the two unknowns: cups of scone mix (C ) plus cups of cake mix
scone
(C ); we know the mixtures of everything else.
cake

We’re ready for the neatest trick in linear algebra notation. We will rewrite our equation using a
why
table called a “matrix.” (We’ll explain this rearrangement works and what the meaning of
the word “matrix” is in the next section; just follow along for a minute.)

The lefthand table (“matrix”) is made by combining the column vectors from our equation into
an orderly table. Meanwhile the two unknown were moved into a column vector of their own.

While this rearrangement is merely notation, it’s some very clever notation for it allows us to
rewrite the equation in a form that is familiar from grade school. We’ll use “M” for the “Mixing
T
Matrix”, “x” for the unknown values vector (C ,C
scone )
cake and “b” for the bread vector. (It is
conventional to use upper case letters for matrices, bold lower for vectors, and nonbold for
scalars).

x
M * =
b

Isn’t that elegant? It looks like a plain old algebra problem from 8th grade except that we know
that it’s really a two dimensional set of equations. We know from 8th grade how to solve for x
we divide by M.

x b
= / M

Except there’s a problem; division isn’t something we know how to do to a matrix! There’s a
way to do it by hand for a 2 by 2 matrix, but for matrices bigger than that it gets extremely
involved so we’re not going to learn how to do it manually. The right answer is that in this
modern age is you simply ask a computer to do it for you.

The way you ask a computer to do this is you ask it for the “inverse” of the matrix M. “Inverse”
[5]
comes from Latin. “In”=“opposite” + “vertere”=”turn”. Mathematicians represent the “inverse”
using a “1” in superscript. Why? That’s a story we don’t have time for.
7

M * x =
b Original equation
1
M * M * x M1
= *
b 1
Left multiply both sides by M
1 1 1
M
( * M) * x M *
= b Visually group M * M
x
= M1

*
b (
M1 * M) cancels out

1
The (M * M) term cancels out. It’s like saying with plainold numbers (M/M) cancels out
because M/M = 1 and 1 times any number is that same number.

1
So, all we have to do is ask a computer for M , multiply it by b and we will have our unknowns x!
(A warning: we’re later going to see that there’s a different and superior way to do this which in
general eliminates the needs to use inverses. See “LU Decomposition” below.)

Rearranging vectors into matrix collections as just demonstrated is probably the most important
concept in linear algebra so we’re going to study it more. By the way, “matrix” means a “womb”
[6]
in Latin (“mater” = mother. ) So, why do mathematicians call such tables wombs? Because a
mathematician in 1850 imaginatively thought of the equations coming from the womb of a
[7]
common parent. Thus demonstrating the strange nature of mathematical etymology.
Matrix multiplication in general
In the above section we rearranged an equation into a matrix times a vector. Let’s look at that
again in more detail because it’s a technique we’re going to use a lot:

Notice that multiplying a matrix (blue and orange) by a vector (purple) is the same thing as
scaling the first column vector (blue) by the first component of the vector (C ) plus scaling the
scone
second column (orange) by the second component (C ). Stop now, and study that; it is
cake
particularly important to see that the righthandside of the equation is a single vector
specifically it is the sum of two scalar/vector multiplications.

When describing the sizes of matrices it is conventional to list the number of rows first and the
number of columns second. So the above equation is a 2 x 2 matrix times a 2 x 1 vector = 2 x 1
vector.

8

This pattern can extend to as many vectors as you want by simply adding more columns to the
a
right side matrix. For example, here we have two vectors b
and combined into a matrix:

All we’ve done is show that we can extend the pattern to more than one vector. Compared to
our previous equation, we’ve added another set of unknowns. Where before we had a single
T T T
vector (C , C
scone )
cake we now have two vectors (a , a
1 )
2 and (b, b
1 )
2 . And the righthand result is
now a two column matrix instead of a vector.

This notation is a bit tricky but is really just some rearrangements. You might want to review the
above several times until you see the pattern, it is very important.
Order matters
The order in which we multiply matrices matters. A * B does not equal B * A. Attempting to
reverse the multiply is like mixing up components the way our naive child did when learning to
count you can’t add balls to blocks. For example, first we do it the right way:

9

Now the other way around, swapping the order of the multiply results in nonsense.

10

So, in general you can not swap the order of a matrix multiply. A*B is not the same as B*A and
furthermore its units are often meaningless. So, the lesson is you have to be very careful when
you write down your equations.
Size matters
The arrangement of the components of a matrix matters too. We can only multiply matrices if
they are compatible. For example, the following is meaningless:

It is simply meaningless to multiply a 2x2 by a 1x2 as demonstrated.
11

There’s a handy trick to remind yourself if a given matrix multiply is meaningful. Write out the
sizes rows by columns times rows by columns. For example:

If the inner numbers are the same then it is a legal multiplication. The resulting matrix will be
the dimensions of the outer numbers. So in the above example the inner numbers are both 2 so
its legal. The resulting matrix will be 1 x 3. The following is illegal as the inner numbers are not
the same
Rescaling and normalization
T T
b
Our first bread recipe called for = (flour,sugar,butter) of (2,1,1). Suppose that we made this
recipe and found that the recipe made 1.5 loaves in our pans. “What kind of stupid recipe
makes 1.5 of something?”, we ask ourselves. We want to scale the recipe so that it gives us
exactly 1 loaf in our pan units. We want to “rescale” the recipe to length 1. We simply divide it
by its length. This is a process called “normalization”.

2 2
How do we get the length of a vector? The Pythagorean Theorem: sqrt(base + height) for two
2
dimensions or, more generally, sqrt( sum( C
i ) ) where C is the ith component of a higher
i
dimensional system.

12

T T
Above we see normalizing the bread vector (2,1) . Notice that when the bread vector (2,1) is
T
divided by the length
2.2
then it ends up at
(0.9,0.45) which is 1 unit long as shown on the right.
Projection
Consider the following dialog between two vectors:

So, who is right? The orange vector or the green vector? Neither of course. All measurement is
relative to some other measurement. Thus, we need a way to say: “How long is the green arrow
compared to the orange arrow” and viceversa. Such measurement is called “projecting” one
vector onto another. In the above case our orange and green friends are pointing in the same
13
direction, but in general they might not be. Consider this situation:

Now what do we do? They don’t even agree on which direction is the “true” direction! When
projecting the orange vector onto the green vector we have to find the vector that points in the
same direction as green but which is closest to orange.

The idea of projection is very important. It is common that we have one vector and we want to
understand it in terms of some other vector. In this case we seek the vector v shown in
projection
blue which is how the green vector b
would understand the orange vector a. Note that the
difference between the orange vector a and the blue projection vector is called the “residual”
[8]
vector. “Residual” comes from Latin and means “the leftover part”.

The equation for projecting a b
onto is:

This is a somewhat mysterious equation, for a detailed understanding of this equation and its
relationship to the angle between the vectors see Appendix A.

One thing to notice is that if ab
x + a
x b
y a
is zero then there is no projection of
y b
onto . How is
that possible? It means that a
and b
are perpendicular to each other.

14

Matrices as operators
A very common thing to do with a matrix is to have it represent some transformation you wish to
do to a bunch of vectors. For example, in video games, characters are modelled by positioning
lots of points in 3D space and connecting them with triangles and quadrilateral faces as the
below picture illustrates. This kind of model is called a “mesh” of vertices.

[9]

We use matrices to act on the vertices of the character in order to position, scale, and rotate the
character in space. This called using the matrix as an “operator.” “Operator” comes from Latin
[10]
“operari” = ”to work.”

Let’s demonstrate what a matrix operator does by using a 2D mesh representing a house
instead of a complicated 3D game character. The following shows our matrix of vertices (one
per column) sitting next to a plot of those same vertices with connecting lines forming a house.

15

Let’s suppose that we’ve modeled the house but want it twice as wide (everyone wants a bigger
house!). We make a matrix that contains our scaling as follows.

We make the first column of our scaling matrix be two units long in the horizontal direction. With
this we’re saying: “the horizontal components of the space are twice as long.” Then we make
the
second column be one unit along the vertical direction so we’re keeping the height the same
as it was. The result is a scale matrix that makes the house twice as fat but no taller.

SCALE

Now let’s rotate the house, laying it on its side. This time we want the operationmatrix’s first
column vector to be pointed up and the second column vector to now point left as in the
following.

And the result is the house laying on its side.

16

The following is an arbitrary rotation around 45 degrees. If cos(45) and sin(45) are mysterious
to you, see the trigonometry refresher in Appendix B.

ROTATE

And finally if we let the matrix operator have nonperpendicular column vectors it will deform the
house, skewing it.

SKEW
Combining Operators

You can combine operators easily. Suppose you want to both scale and a rotate. You simply
multiply the operators to form the composite matrix operation as the following demonstrates with
a scale by 2 and a rotation by 90 degrees.
17

Basis
Notice that in the above five examples the column vectors of the operation matrix are plotted in
orange and blue. Take note of how those operation vectors are related to the outcome of the
house; when those orange/blue vectors are rotated, the house is rotated, when scaled, the
house scales, etc. These orange and blue vectors the column vectors of the operation matrix
are called a “basis.” A basis is simply a set of vectors that represent the canonical directions
of the operation. It is common to “change basis” meaning that you modify a set of vectors such
as the data vectors (the house vertices, for example) are in the new coordinate system. For
example:

Homogeneous Matrices
In all of the examples above the house is anchored to the origin no matter what operator we
use. What if we wanted to move the house to a different place? Sliding a set of vectors around
[11]
without changing shape is called “translation”, from Latin “trans”=”across” + “latus”=”to carry.”
The question is, how do we translate using a matrix operator? We don’t have a multiplicative
operator which means “add this amount to every vector,” but there’s a very clever trick we can
do: we add another dimension.
18

In the above example a third column and row have been added to the matrix operator; that third
T
column (2,2,1) results in a translation of (2,2) because we’ve also added a row of 1’s to our
vertices at the bottom. (Note that the system is now three dimensional but I’m only plotting two
of those dimensions.) So, for example, consider the first vertex (first column) of the house =
T
(0,0,1).

This is tricky matrix kungfu called “matrix homogenization”. If you plan on using linear algebra
for, say, computer graphics, then this is a mustknow method. In summary, the idea is that by
adding a new dimension to our system we’ve added translation. The clever bit is that we added
an extra dimension set to all 1 on the vertices of our house model so that when multiplied by the
operator they pick up the translation in the third column vector. One way to visualize this is that
the operator matrix is broken into two subparts as shown below. The green submatrix is the
rotation/scale/skew operator and the blue submatrix is the translation. The bottom row is
always set to 0, 0, 1.

Inverses
Almost anything you can do with a matrix you can undo. For example if we rotate by 90
degrees and scale by 2 as we did above then all we have to do to undo that is make a matrix
operator that scales by 0.5 and rotates 45 degrees (note the reversal of the order!). As
described before, a matrix which undoes the operation of another is called its inverse.
19

Do all matrix operators have an inverse? No!

Consider the following operation on our house model:

The above operator flattens the house on to a diagonal line. Once you’ve done that there’s no
inverse anymore. That is, when you flatten some aspect of a matrix operation you’ve lost
information you can’t go backward anymore. Matrices that can’t be inverted are called
“singular”. This term probably comes from the fact that a singular matrix (noninvertible) “stands
alone” it has no inverse partner to play with.

Notice that a noninvertible matrix has some of its column vectors pointing in the same direction.
That is, it is “flattened” in a way it’s parts don’t span the whole space. The basis of this matrix
has lost a dimension you can’t combine the blue and orange column vectors and get to
anywhere in the space anymore you’re stuck on that line.
Rank
Rank is a metric of how many effective dimensions there are in a matrix. A full rank matrix
means one that isn’t singular its vectors allow you to combine them so that you can reach
anywhere in the space. But a singular matrix like the one above has at least two of its column
vectors pointing in the same direction thus collapsing the space; its rank is less than full. In this
case, it is a two dimensional matrix but has only one direction among its two column vectors so
its rank is one. The directions you can’t travel in (which prevent you from leaving the line that
the blue and orange vectors are on) is called the “null space” of the matrix. For example, the
circle in the below basis is in the nullspace there’s no combination of the blue and orange
basis vectors that gets to it.

20

Special Matrices
There are a lot of special matrix forms that have interesting properties. Here are a few:
Identity Matrix
A matrix that is all zeros except for having one’s along the topleft to bottomright diagonal is
called the “identity matrix”. It has the special property that as an operator it leaves things
unchanged. It is abbreviated capitalized “I” for “Identity” (to differentiate it from lower case “i”:
the imaginary number).

In the above drawing, I’m only showing two dimensions, but the idea of the identity matrix
extends to any number of dimensions just add 1’s along the diagonal. Prove to yourself that
this operator leaves things unchanged. Try plugging in your own values in the green vector
below:

1
Note that the identity matrix is the result of a matrix times its own inverse. A * A = I.
Diagonal, a.k.a. Scaling Matrix
The identity matrix described above is a special case of a “diagonal matrix” one with all zeros
except the along the topleft to bottomright diagonal . A diagonal matrix is one whose basis
vectors are axisaligned but which are not necessarily all 1’s along the diagonal. When used as
an operator, it is said that these column vectors act to “scale” the right matrix. We’ve seen an
example of a “scaling” matrix earlier.

21

Orthogonal Matrix
The word “orthogonal” comes from Greek, “ortho” = ”straight” and “gonia” = ”angle”; “gonia”
[12]
comes from “gony” = “knee”. Any matrix where all of its column vectors are all at right angles
to one another is orthogonal. The identity and diagonal matrices we just learned about are two
examples of orthogonal matrices in those cases the basis is not only orthogonal but
axisaligned too. A more general orthogonal matrix might be rotated like:

Note that the lengths of the basis vectors don’t have to be the same length, as long as all the
bases are perpendicular to each other, it’s called orthogonal.

We previously saw that when you have two vectors that are orthogonal to each other that the
a b
x +a
x b
y term of their projection was zero. Note that this term is the same thing as taking the
y
row vector aT
b
and multiplying it by the column vector .

ab
x +a
x b
yy aT
=
*
b

This is true no matter how many dimensions there are (showing only 2 in the above equation).
Thus, whenever aT
b
* a
= 0 we know that b
and are orthogonal.
Orthonormal / Rotation Matrix
An orthogonal matrix that has all unitlength basis vectors is called “orthonormal” or a “rotation
22
matrix”. The “normal” part of the word means that it has been “normalized”, i.e. made all
unitlength. It is also called a rotation matrix because its effect as an operator is to simply rotate
the space without resizing or skewing.

An orthonormal matrix has the property that it’s transpose is also its inverse. To see why:

Since we know that the matrix A is orthogonal, all the offdiagonal terms must be zero since we
know at aT
b
* is zero when two vectors are orthogonal. And it must be the case that the
upperleft to lowerright diagonal must be all ones since a vector aT
a
* must be equal to 1 since
its projection onto itself is obviously 1. (Review projections above if you don’t see that).

T T
Thus we see that A * A is equal to the identity matrix meaning that A must be equal to the
inverse. (See Identity Matrix above). Therefore we’ve proved that for an orthonormal matrix, the
its transpose is its inverse.
Factorizations
A very important thing to understand about matrix multiplication is that you can combine or split
the matrices into parts. Consider this simple scalarvalued equation:

3 * 5 * x = 15 * x
23

We can collapse the 3 * 5 into 15 or viceversa we can split the 15 into 3 * 5. When we split a
product into parts that’s “factorization”; when we combine, that’s “compositing” or just
“multiplying”.

We can do the same with matrix multiplies. For example we can composite A*B into a single
(A*B) matrix.

x
A * B * = (A*B) *
x

This is a very handy thing to do computationally because it means we save a potentially
expensive loop over x
. For example, the matrix A might be a scale operation and the matrix B
might be a rotation. We might have many x
’s say, all the vertices of a game character. It is
much more efficient to combine the A * B into a single operation and then apply it to each x
vertex than it is to apply all the vertices to B and repeat that applying them all to A. Why loop
through the x’s more than once when you can combine the operators more cheaply?

Factoring is equally handy but much more complicated. Suppose you have a composite
rotation and scale (R*S)=M; the two operations have been mixed together and you might want
to pull back out the unknown rotation and unknown scale given M. This is not so easy even for
simple numbers (you basically have to try every prime number) much less systems of coupled
equations. We’re now going to look at several such matrix factorizations.
LU decomposition
The second factorization we are going to examine is used for solving matrices without using the
inverse. We previously solved an equation like the following to find the proportions of scone
and cake recipe we needed to make a bread recipe:

x
M * =
b

and we learned that we could solve for x by:

1
(M x
*M) * 1
= M *
b

But calculating an inverse is a fairly computationally expensive operation, especially for large
matrices. It is also fraught with numerical accuracy problems. Furthermore, suppose that we
have a lot of different b
vectors to solve for. We’d have to do an expensive matrix multiply for
each b
vector. There’s a better way.

First, observe the following. Suppose that we had a matrix that had the following form:

24

The matrix is in a “upper triangular form” meaning that all of its nonzero entries are in the
upperright triangle. Everything in the lowerleft is zero. Now consider the equation for the
bottom row:
0 * x + 0 * x
1 + m
2 * x
6 = b
3 3

Which is solvable for m because those zeros make it simplify down to:
6

m * x
6 = b
3 therefore x
3 = b
3 / m
3 6

And now that we have the value for x we can solve the next row up because we can substitute
3
in x in the equation and therefore have only one unknown. We repeat this up the rows until we
3
have solved for all the unknowns. This process is called “backsubstitution” and is the prefered
way to solve a large matrix.

Using an uppertriangular (U) matrix allows you to solve by backsubstitution, but how do you
get any old matrix M into U form? You use an algorithm called “LU” for “Lowertriangle /
b
Uppertriangle” decomposition. It breaks M into L * U. You then multiply by L to get the
b
vector into the right form and then you apply back substitution using U. For large matrices you
have to solve them using LU for both time and numerical stability reasons. The advantages of
LU over the inverse are so large that some computational linear algebra packages such as
LAPACK don’t even provide an inverse function you have to use a dedicated solver function
which is internally using LU.
Eigen Decomposition
For most matrices there exists a set of vectors that don’t change direction when operated upon
by the matrix. In all the above matrix operations we would input a set of vectors (for example,
the house vertices) and we’d find that the output vectors were different from the input. But, for
most matrices there are special vectors that don’t change direction under the operation. These
are the “eigenvectors”. The term “eigen” comes from German and means, roughly, “own” or
“self” which makes sense the eigenvectors map onto themselves under the operation.

It’s a little hard to visualize so the best way to understand this is with several specific examples.

First consider the following skew matrix from above.

25

Any vector that points in the blue direction (or opposite to blue) is going to keep pointing in the
same blue direction under the operation. Any other vector will change which way it’s pointing (it
will be skewed right). The blue direction is an “eigenvector” of this operation.

To motivate ourselves as to the utility of an eigen decomposition, let’s look at a practical
example borrowed from Linear Algebra and its Applications by David Lay [13]
. In his example he
models the movement of people to and from the cities. Let’s suppose that in any given year
10% of the people in the rural areas move to the city and 5% move from the city to the country.
That means that 90% of rural people stay in the rural area and 95% stay in the city. Let’s make
a matrix for that:

The first column represents the movement of the people who live in the countryside 90% go
from countryside to countryside (i.e. stay put) and 10% move to the city. The second column
represents the movement of the people who live in the cities: 5% move to the country and 95%
stay put.

Now let’s use this matrix as an operator. Let’s suppose that there’s a million people in the
country and half a million in the city
the first year. We can find out how many end up in each
place after one year by applying the matrix operator.

As we can see, 75,000 = (1,000,000925,000) people moved from the country to the city.

Algebraically we could write the above operator using M as the matrix that quantifies the
movement between country and city,
p0 p
as the population at time zero, and as the population
1
after one year as follows:
26

p
M * 0 p
=
1

What if we apply the movement matrix for another year? We can write that as:

p
M * p
=
1 2

Notice that there are fewer people moving this year only 63,750 = (925,000861,250). It
makes sense that less people moved the second year after all there were fewer people in the
country to move out. We could write this two year trend by concatenating the matrices as in:

p
M * p
=
0 1
p
M * p
=
1 2

Therefore:
M * (M * p ) =
0 p
2
M 2
p
* p
=
0 2

What if this pattern repeats itself year after year for, say for n years? That is the same thing as
raising M to the nth power.

n
M p
* p
=
0 n

Eventually, when n is large, the number of people leaving the country will
exactly match the number of people leaving the city. That is called the
“equilibrium” of the system. “Equi”=”equal” + “Libra”=”Balance”. (That’s
why the astrological symbol of Libra is represented holding a balance.)

The equilibrium vector (p ) doesn’t change under the matrix
equilibirum
operation. Stated algebraically:

p p
= M *
equilibirum equilibirum

Because the equilibrium vector doesn’t change under the matrix operator, it is an eigenvector of
27
the matrix operator.

This particular example of an eigenvector for the peoplemoving example is actually a very
special eigenvector because it doesn’t change length under the operation. In general,
eigenvectors point the same direction after the matrix operation but may change length. The
fact that this one doesn’t change length means it is a special case where the system isn’t
changing size (that is, we didn’t have birth or death in our model, only movement). A movement
operator such as this has a special name: “Markov matrix” named after an important Russian
mathematician who studied probability. You can tell it has the special conserving property
because all the columns add up to 1.

Each eigenvector has an associated number called its “eigenvalue” that tells you by how much
the eigenvector changes length (scales) under the operation. For a Markov matrix, the
dominant eigenvalue is 1 meaning it doesn’t scale.

How do you solve for the eigenvectors and their eigenvalues? Like other linear algebra
operations, you ask a computer to do it for you but you have to know how to phrase the
question.

The eigen decomposition of M is written as follows:

MQ = QD
which is the same thing as 1
M = QDQ

Where M is the transform matrix, Q is the matrix of eigenvectors (each in a column) and D is the
diagonal matrix of eigenvalues. What the above equation says is “If I operate with M on any of
the special eigenvectors (the columns of Q) then I end up with that same special Q vector back
scaled by some amount (the amount is in the respective column of D).

What other applications are there for the eigen factorization? There are a lot. In general they
have to do with analysis of how something is changing when you mix or rotate in some way. In
the Principal Component Analysis section below we will look at one of them.

28
Principal Component Analysis

[14]
The above graph is a scatter plot of weight versus height. Each data point represents a
person. The vertical axis is their height, the horizontal their weight. As we can see, and as we
would suspect, there is a relationship between these two variables giving us a bottomleft to
topright slant. That is, the taller you are in general the heavier you are. This pattern is called
“correlation”. (“Correlation” comes from “cor”=”together” + “relation”.) But, as we see clearly, it
isn’t a perfect correlation: there’s scatter to the points. That means that there are some people
who are much heavier than you’d expect given their height and others who are much skinnier
than you’d expect given their height.

A very handy thing to do to this graph would be to make a new set of axes that are better
aligned to the data giving us a trend line.

29

The red axis in the above plot shows us the trend how much do you weigh on average given
your height (or vice versa how tall are you given your weight). Those individuals above the
red line are those who are taller than the trend and those below are shorter. That is, your
position along the green axis tells us how “fat” or “skinny” someone is corrected for their height
(or vice versa).

Our goal is to solve for the new axes the ones labeled in red and green above. We will use a
tool called “Principal Component Analysis” (PCA).

The first “principal component” (shown in red) is the direction of maximum “variance”, or
“spreadoutiness”. If you knew nothing else about a person other than their position along this
red axis you’d still be able to guess at both their height and weight. Knowing the position of a
data point along the most variable dimension tells you a lot about that data point. Knowing the
position along the least variable dimension tells you very little. So, if you’re interested in
summarizing data, especially data that is embedded in a high dimension, then you use PCA to
find the dimension(s) of highest variance, then record each data point along those few
dimensions and simply toss out the rest of the data. This is a form of data compression called
“dimensional reduction”.

To perform PCA you begin by centering the data by subtracting out the mean so that the center
of the data is now at the origin.

30

We want to find the directions of maximum variance. To compute the variance you simply add
up the square of all the residuals from the mean.

Note that variance is a squared quantity which means that it’s going to have the property that
both positive and negative residuals will have positive contributions to the sum (since negative
times negative is positive).

Computing the variance with matrix notation is very easy: we simply multiply a centered data
T
matrix by its transpose. Variance = Data * Data.

31

Our data matrix was 2 dimensions (weight and height) by n data points. Transposing and
multiplying gives a (2 x n) * (n x 2) = (2 x 2). What are the four elements of that 2x2?

The topleft corner of our result is the variance along the first dimension. The lowerright corner
is the variance along the second dimension. The lowerleft corner is called the “covariance” of
the 1st with the 2nd dimension. Note that the upperright is the same thing as the lowerleft.
This makes sense. The dotproduct(x,y) is the same thing as the dotproduct(y,x).
Covariance
Variance of a single variable is fairly easy to understand, but covariance between two variables
is a little tricker. Consider the following contrasting situations.
32

When a data series runs along a lowerleft to upperright diagonal we see that the x*y residual
products will generally be positive because the points in the lowerleft will be negative*negative
= positive and the ones in the upperright will also be positive. So the sum of all these numbers
will be large positive number.

But in the opposite situation when the data points trend upperleft to lowerright we see that the
residuals will be positive*negative = negative and thus the sum will tend towards a large
negative number. What if there is no correlation? In that case (bottom) the positive values
roughly cancel out the negative values and the result will be a small number, somewhere near
zero.

The covariance between two components of a signal tells us about how well they predict each
other. A large positive score means they align well bottomleft to topright and a large negative
score tells us the line up well the other way around. In contrast, a small sum tells us that they
don’t line up well at all.

Principal Component Analysis uses the covariance matrix to solve for the new axes which lie
along the directions of most variance to least variance. The procedure is both simple and
profound the eigenvectors of the covariance matrix are the principal components. The
eigenvector with the largest eigenvalue is the first component, the next largest is the next, and
33
so forth.

Why are the eigenvectors of the covariance matrix the principal components? That’s a much
harder question to answer without slogging through the algebra (see Appendix C).

It’s worth noting that the vast majority of mathematical theorems are only accessible through
algebra. Try as we might to have intuition for why the PCA components equal the eigenvectors
of the covariance matrix, there isn’t an easy shortcut (believe me, I’ve tried!). While searching
for an easier way to learn geometry King Ptolemy is supposed to have been told by Euclid that
[15]
there was no “Royal Road to geometry”. As usual, Euclid is right often you just have to
make your way through the algebra to get somewhere useful.
Vocabulary
vector a collection of measurements, one for each component of a system
scalar a term which scales something under multiplication
matrix a table of number that represents a linear system
basis
the column vectors of a matrix
operator using a matrix to change the basis of a system
dimension the rulers by which a system is measured
transpose to lay a matrix on its side making column vectors into row vectors and vice versa
projection the shadow of a higher dimensional system on a lower one
inverse the operator that undoes the operation of some other operator
normalize to make unit length
projection to measure one vector in terms of another
orthogonal perpendicular in every dimension
homogeneous matrix a trick where by an extra dimension is added to account for translation
eigen decomposition a matrix factorization that finds the equilibrium vectors of a system
principal component analysis (PCA) a technique to identify the principal axes of variation
References
[1]
http://www.etymonline.com/index.php?term=vector
[2]
http://www.etymonline.com/index.php?term=transpose
[3]
http://www.etymonline.com/index.php?term=scale
[4]
http://www.etymonline.com/index.php?term=dimension
[5]
http://www.etymonline.com/index.php?term=invert
[6]
http://www.etymonline.com/index.php?term=matrix
[7] Via Wikipedia. The Collected Mathematical Papers of James Joseph Sylvester. 18371853. p.247. On google
books: http://books.google.com/books?id=5GQPlxWrDiEC&pg=PA247
[8]
http://www.etymonline.com/index.php?term=residual
[9] Image from http://www.polycount.com/forum/showthread.php?t=72623
[10]
http://www.etymonline.com/index.php?term=operate
34
[11]
http://www.etymonline.com/index.php?term=translate
http://www.etymonline.com/index.php?term=orthogonal
[12]
[13] Lay, David; Linear Algebra and its Applications, 3rd Edition; p. 289
[14] Data from http://www.amstat.org/publications/jse/datasets/body.dat.txt
[15] Proclus, p.15 http://books.google.com/books?id=JZEHj2fEmqAC&lpg=PP1&pg=PA57#v=onepage&q&f=false
[16]
http://mathworld.wolfram.com/OrthogonalMatrix.html
[17] See theorem 4, http://www.math.vanderbilt.edu/~msapir/msapir/prtranspose.html
Acknowledgments
Thanks to Diane Whitmer, John H Davis, Andreas Matouschek, Edward Marcotte, Alberto
Verjovsky, Benni Goetz, and Steve Moore for their help.

35
Appendix A

Using Trigonometry
First, we consider the situation using trigonometry.

Note that there is a right triangle formed by V on the base, V
projection on the opposite and
residual a
on the hypotenuse.

First, make a unit vector out of the green b vector. Then, once we find the length of V projection
we’ll just scale that unit vector to the right length.

b =
unit b / len(b)

Consider the angle θ between the two vectors. The cosine of that angle is what we want
remember the cosine is the adjacent over the hypotenuse (see Appendix B).

cos(θ) = len( v ) / len(
projection a
) Definition of cosine
len( v ) = cos(θ) * len(
projection a) Solve for len(v
projection)

v = cos(θ) * len(
projection a) *
bunit Multiply
bunit by length to get projection vector

Without Using Trigonometry
Turns out that there’s a way to do this projection without angles. The reason that’s interesting is
that trig functions like cosine are computationally expensive. If there was a way do vector
projection without angles, that would be great. And there is, here’s the equation:

a
len( b
) * len() * cos(θ) = a*b
x + a
x *b
y y

Notice that the righthand side has no reference to the angle but rather uses only the
36
a
components of . But it’s hard to see how this magic equation works.
b
and

In this “proof” we see the two vectors a and b at the top and the angle θ between them. Two
triangles, green and brown, have been formed from those vectors. We arrange pairs of the
triangles made by those vectors into the rectangular arrangement seen on the left (follow the
colors). We see that the beige area is equal to a *b
x + a
x *b
y . We then rearrange those triangles
y
as shown in the center panel. Since we only moved the triangles around then it follows that
beige area in the center panel must equal the beige area in the left panel. We then cut along
the a vector and move the resulting triangle up to form a rectangle whose area is len(a) * len(b)
* cos(θ) which must equal a *b
x + a
x *b
y .
y

So now we can rearrange the projection equation so that it doesn’t involve an angle.

From above
Substitute definition of
b
unit

Multiply by len(b) / len(
b
)
)
Substitute cos(θ)*len(a *len(
b
)
Simplify

So, finally the projection can be written using vector multiplication as:

37
Appendix B Trigonometry Refresher

38
Appendix C The Algebra of PCA
We seek a rotation matrix (Q) that will align with the maximum variance of the data matrix (X).

X = Q * Y

The columns of Y are the new coordinates in the Q basis. In other words, the Y column vectors
when rotated by Q, give us back X.

We want to choose Q so that the covariation in this new rotation is all diagonal. That is, we
want Q so that it removes all the crosscorrelation terms and leaves us with just the diagonal
T
variance terms in the Y * Y covariance matrix.

T
We need to find the relationship between the original covariance matrix (X * X ) and the new
T
rotated one (Y * Y ).

(1) X = Q * Y Definition of Q and Y
1 1
(2) Q * X = Q * Q * Y
Left multiply both sides by Q1
1 1
(3) Q * X = Y Cancel Q to Isolate Y
T
(4) Q * X = Y
Q is orthonormal: it’s inverse is its transpose [16]
T T T
(5) Q * X * Y = Y * Y Right multiply both sides by Y T
T T
(6) X * Q = Y From (4) solve for YT by transposition theorem[17]
T T T T T
(7) Q * X * X * Q = Y * Y From (6) substitute X * Q for Y into (5)
T
(8) Q * (covar X) * Q = (covar Y) Rewrite as “covar” as a reminder of what we’re up
to
1
(9) Q * (covar X) * Q = (covar Y)
Q is orthonormal: it’s transpose is its inverse [16]
1
(10) Q * Q * (covar X) * Q = Q * (covar Y) Left multiply both sides by Q
(11) (covar X) * Q = Q * (covar Y) Now in eigen form: M * Q = Q * D

Now we have an equation we’ve seen before an eigen system. Remember that we seek
(covar Y) to be all diagonal. So this equation is in eigen form where Q are the eigenvectors of
the covariance of X and the covariance of Y is diagonal.

Thus we see that the eigenvectors of the covariance matrix of X are the principal components of
the data matrix X. Not so obvious!

39

BEST PDF On LinearAlgebra

Uploaded by

Copyright:

Available Formats

BEST PDF On LinearAlgebra

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BEST PDF On LinearAlgebra

Uploaded by

Copyright:

Available Formats

What is a vector and how are they represented?

What is the relationship between a vector and its components?

A Colorful Introduction to Linear Algebra

You might also like