Notes For Lectures 11 To 16 - 2024

Prelims Statistics and Data Analysis, Lectures
11-16
Christl Donnelly
With grateful acknowledgement to Jonathan Marchini and Dino Sejdinovic
Trinity Term 2024 (version of April 12, 2024)
1 Introduction
The first 10 lectures of the Prelims Statistics and Data Analysis course in-
troduced the concept of likelihood for a probabilistic model, leading up to
linear regression with several explanatory variables (multiple linear regres-
sion).
Linear regression is an example of supervised learning where we are inter-

ested in modelling the relationship between a set of p variables (or features)
X1 , . . . , Xp , measured on n observations, and a response variable Y , mea-
sured on those same n observations. For example,
p
X
Y =α+ βi Xi + ϵ, ϵ ∼ N (0, σ 2 ).
i=1
Interest lies in identifying which of the Xi ’s are important parts of the model,
and also building a model that is able to make accurate predictions of Y
using X1 , . . . , Xp .
However, many methods in Statistics and Machine Learning are focused on

unsupervised learning. This can broadly be described as looking for pat-
terns and structure in datasets, that are often large and high-dimensional.
In this setting we only have a set of variables X1 , . . . , Xp , measured on n
observations. Since we do not have a response variable Y , we are not inter-
ested in prediction.
The goal of unsupervised learning is to discover interesting things about the

variables. Relevant questions include
1
1 Can we find a way to visualize the data that is informative?
2 Can we compress the dataset without losing any relevant information?
3 Can we find separate subgroups (or clusters) of observations that de-

scribe the structure of the dataset?
Unsupervised learning can be more challenging than supervised learning,

since the goal is more subjective than prediction of a response variable. Of-
ten unsupervised learning is performed as part of an exploratory data
analysis, where we seek to summarize the main characteristics of a dataset,
often using visual methods.
Techniques of statistical learning (both supervised and unsupervised) are

of growing importance in a number of fields. Statistical learning is closely
related to the disciplines of machine learning and artificial intelligence
(AI). For a recent example that has been in the news see the video below.
https://www.youtube.com/watch?v=SUbqykXVx0A
Massive amounts of data are being collected in almost all walks of life. Fi-
nancial institutions, businesses, governments, hospitals, and universities are
all interested in utilizing and making sense of data they collect.
Many mathematics students will go on to work in careers that involve car-

rying out or interpreting analysis of data of some form or other.
Thus, a good knowledge of statistical and machine learning techniques is a

highly valuable skill set.
1.1 Using R or Matlab for data analysis

The main aim of this course is to teach some of the theory behind unsuper-
vised learning. However, practical computing skills of carrying out such an
analysis are essential. Therefore, exercise sheets include practical questions
that will allow students to try basic examples of data visualization and ap-
ply some of the techniques from the course to real and simulated datasets.
These exercises can be carried out using either R or Matlab, depending upon
the preference of individual tutors.
2
1.2 Motivating examples
1.2.1 Single cell genomics
In 2014 a study reported in the journal Nature Biotechnology (Pollen et al.
Nature Biotechnology 32, 1053—1058 https://www.nature.com/articles/
nbt.2967) collected gene expression measurements at 8,686 genes from 300
cells. One way to think about this dataset is that they measured how ‘ac-
tive’ the gene was in each cell.
Figure 1 shows an image of the dataset where each row is one of the cells and
each column in one of the genes. Viewing the data in this way it is very dif-
ficult to see any structure in the dataset. In this course we will learn about
the method of Principal Components Analysis (that builds on Prelims Lin-
ear Algebra) which will allow us to find low-dimensional representations of
the dataset that uncover underlying structure. For example, Figure 2 shows
the “best” 2D representation of the data, and Figure 3 shows the “best” 3D
representation of the data (see also the file movie.gif on the course website:
https://courses.maths.ox.ac.uk/course/view.php?id=620
for a rotating version of Figure 4). What we mean by “best” here will be
more clearly defined later in the course. Both of these images show structure
within the dataset: some samples appear to cluster together in clear groups.
Once we have found a low-dimensional representation of the dataset that

shows groupings visually, a subsequent question is then whether we can
use a statistical method to infer the groupings or clusters. In other words,
whether we can find a way of labelling the points into groups in a principled
away. This is known as clustering.
Figure 4 shows the results of applying a clustering method known as K-

means to the single cell dataset. You can see that each point has been
labelled with a different colour.
3
Figure 1: Single Cell dataset : gene expression measurements on 300 cells
(rows) at 8,686 genes (columns).
4
●
● ●● ●
0.10
● ●●
● ●● ● ●●
●●
● ● ●●●●
● ●●
● ● ●● ●
● ● ●
●
●
● ●
●
●
●
● ● ●
● ●
●● ●● ●
● ● ●
●
● ●● ●
● ●
● ●
● ● ●
●● ●
0.05
● ●
● ● ●●●
●● ●● ● ●
● ● ● ● ●
● ● ●
● ●
● ●
●● ●
●●● ●
● ●
● ●
PC2
● ●
●
●
●
0.00
●
●
●
● ●● ● ● ●●
●
●● ● ●● ●
●
● ●● ●● ● ●
● ●● ● ● ● ● ●
● ● ● ●● ● ●
● ● ● ●
●● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ● ●●
● ● ● ● ● ●
●
● ● ● ● ● ● ●
● ● ●●● ● ● ●
● ● ● ● ●
●
● ● ● ●● ● ●
● ● ● ●
●
−0.05
● ●
●
● ● ● ●
● ●
●● ● ●●
● ● ● ●
● ● ● ● ●
● ●
● ● ● ●●
●
● ●● ● ● ● ●●
● ● ● ● ● ● ●
●● ● ● ●
●
● ● ● ●
●
●● ● ●● ●
●● ●
● ● ●●
●● ● ● ● ●
● ●
● ●
●
−0.08 −0.07 −0.06 −0.05 −0.04 −0.03 −0.02
PC1
Figure 2: Plot of 1st and 2nd Principal Components for the Single Cell
Genomics dataset.
5
Figure 3: 3D plot of 1st, 2nd and 3rd Principal Components for the Single
Cell Genomics dataset.
6
Cell Genomics dataset, with colouring given by the k-means clustering.
7
1.2.2 Food consumption
Consider the following DEFRA data showing the consumption in grams (per
person, per week) of 17 different types of foodstuff measured and averaged
in the four nations of the United Kingdom in 1997. We shall say that the 17
food types are the variables and the 4 nations are the observations. Looking
at the data in Table 1, it is hard to spot obvious patterns.
Figure 5 shows a 2D projection of the data (i.e. the first and second principal
components), and we see that Northern Ireland a major outlier. Once we
go back and look at the data in the table, this makes sense: the Northern
Irish eat much more grams of fresh potatoes and much fewer of fresh fruits,
cheese, fish and alcoholic drinks.
Table 1: DEFRA data showing the consumption in grams (per person, per
week) of 17 different types of foodstuff measured and averaged in the four
nations of the United Kingdom in 1997.
England Wales Scotland N.Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
Other meat 685 803 750 586
Fish 147 160 122 93
Fats and oils 193 235 184 209
Sugars 156 175 147 139
Fresh potatoes 720 874 566 1033
Fresh Veg 253 265 171 143
Other Veg 488 570 418 355
Processed potatoes 198 203 220 187
Processed Veg 360 365 337 334
Fresh fruit 1102 1137 957 674
Cereals 1472 1582 1462 1494
Beverages 57 73 53 47
Soft drinks 1374 1256 1572 1506
Alcoholic drinks 375 475 458 135
Confectionery 54 64 62 41
8
1.0
Wales
0.5
N.Ireland
PC2
0.0
England
−0.5
Scotland
−1.0
−0.5 0.0 0.5 1.0
PC1
Figure 5: Plot of 1st and 2nd Principal Components for UK food dataset.
9
1.2.3 Finding structure in genetic datasets
Novembre et al. (Nature 2008 https://www.nature.com/articles/nature07331)
analysed genetic data from 3,000 individuals at ∼500,000 positions in the
genome. The individuals were collected from different countries from around
Europe as part of the Population Reference Sample (POPRES) project. Be-
fore the study it was
“not clear to what extent populations within continental regions exist as dis-
crete genetic clusters versus as a genetic continuum, nor how precisely one
can assign an individual to a geographic location on the basis of their genetic
information alone.”
The question of interest was to see how much structure was present in the
dataset and to assess the similarities and differences between individuals
from different populations. Figure 6 shows the 2D projection of the dataset.
Each point on the plot is an individual and points are coloured in the plot
according to which country the individual comes from. What is striking
about this plot is how well the arrangement of points corresponds to the
geographic locations of the samples. The plot was constructed just using
the genetic data, without any knowledge of the geographic locations, yet the
plot is able to uncover a map of where samples came from.
1.3 Data matrices and notation

We will assume that we have collected a dataset that consists of p variables
on n observations, and that this data can be represented in an n × p matrix,
denoted X, where xij is the observed value of the jth variable for the ith
observation.
 
x11 x12 . . . x1j . . . x1p
 x21 x22 . . . x2j . . . x2p 
 
 .. .. . . .. . . .. 
 . . . . . . 
X=  xi1 xi2 . . . xij . . . xip  .

 
 .. .. .. .. .. .. 
 . . . . . . 
xn1 xn2 . . . xnj . . . xnp
We will refer to X as the data matrix. We use bold upper case for matrices
in these notes.
10
●
● ● ● ● Germany
● ● ● Netherlands
● ●
● ● ● ●●
0.04
●● ● ● ● ● Austria
● ●● ●●● ●
●●●● ●●
●
● ● ● ● ●
● ● ● ●
●● ● ● Luxembourg
●● ● ● ●●
● ● ● ●● ● ●
●●
●●●
● ●●●● ●
●
●●●●●●●● ●●● ● ●● ● ● ● ● ● ● ●● ● Czech Republic
●● ●●● ● ●●● ●● ●● ●
● ● ●● ●
● ●●● ●●● ● Hungary
●
●
●● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ●
●● ●● ●● ●●● ●●●● ● ●●●●● ●●
●
●● ●
●●
●● ● ●●●●● ●●
● ●●●●●●● ●
●
●
●● ●● ● ●● ●● ● ●● ● Slovakia
●● ●●●● ●● ●● ●●●●●●●●● ● ● ● ● ● ● ●
●
●● ● ● ●●
● ●● ● ● Sweden
●
● ●● ● ●●●● ● ●
● ● ●● ● ● ●●● ●
●●●●
● ●●
●● ●● ● ● ●
●● Norway
●
●●● ●● ●● ● ● ●●●●
● ●
● ● ●● ● ● ●● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ●
●● ● ● ● ● ● ●● ● ● ● Denmark
●●●
0.02
● ●● ● ●● ●● ●●
● ● ● ● ● ●●● ●● ● ●
● ● ●●● ● ●● ● ● ●●
● ● ● Poland
● ● ● ●● ●
● ● ● ● ● ● ● ● Russian Federation
●●● ● ●●● ● ●●●● ● ●● ● ●● ●
●● ●
●●
●
●●●●● ●● ●● ● ●
●● ●● ●● ●
● ●●
● ● ● ●
●
● Ukraine
● ● ● ●● ●
● ● ● ● ● ● ● ●
● ●● ● ●● ●● ● ● ●
● ● ● Finland
●● ●●●●● ● ●●●● ●
●●●● ●●● ●
●● ●●●●● ●● ● ● ● ● ●●
● ●● ● ●● ● ● ●● ●● ● ●●● Latvia
●
● ●● ●●●●
● ●●● ●
●●● ●
●● ●●●●● ●
●● ●● ●●●●●●
● ●● ●● ●●●
● ●●●● ●● ● ● ●● ● ●●
● ●● ●● ●●
● ●
● ●● ●●
● ●●
●●●●● ● ● United Kingdom
● ●●●●
●●●●●●● ● ●●●
● ● ●
●●
● ●●● ●
●●
●
●●●
●●
●●●
●●●
● ●● ● ●●
●●● ●●● Ireland
●
● ●● ● ●● ●●
●●●
●● ● ●
● ●●● ● ● ●● ● ● ● ● ● ●
● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ●●
0.00
●●● ●●● ●● ● ● ● ● ● Scotland

● ●● ● ●● ●
●
● ● ● ● ●
●● ●● ●●● ● ● ●● ●● ● ●● ● ● Wales
● ● ● ● ● ● ●
● ● ● ● ● Italy
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● Yugoslavia
● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● Romania
PC1
● ●
● ● ●● ●
●● ● ● ●● ● ●●● ● ●● ● ● ●
● ● ●●● ● ●
● ● ●
●
● ● Bulgaria
●●● ●● ●● ●
●●
● ● ●●●● ● ● ● ● ● ●
● ●●
● ●●●●●● ●● ●● ●● ●● ●
●● ●
●● ● ●●●●● ●●● ● ● ● ●
● Croatia
● ●●●● ●● ●● ●●●● ●●●●● ●● ●● ●
●●●● ●
● ●● ●● ●● ●
●●●●● ●
●●●
● ●● ●●● ● ●● ● Macedonia, The Former Yugoslav Republic Of
−0.02
● ● ●● ●●● ●● ●●●
● ● ●● ● ● ● ●●
●
● ●●●
● ●●● ●●●●●●
●●● ● ● ● ● ● ●● ● ● Kosovo
●● ● ●●●● ●
●●● ● ●●●● ●●● ●●
●●●●●●●
● ●● ● ● ●● ●● ●● ●
●● ● ●●
●● ●
●
●
●
●●●●
●● ●● ● ● ●
●
●● ● ●
● ●●
●●● ● ● ●
●
● ● Slovenia
● ● ● ●● ●●
● ●● ●● ● ●
● ●●●● ●● ●
● ●
●●
● ● ● ●●● ● ● ●● ● ● Bosnia and Herzegovina
● ●● ● ●●● ● ● ● ●● ● Albania
● ● ●●
● ● ●
●
●● ● ● ● ● ● Serbia and Montenegro
●● ● ● ●● ●●
● ●● ● ● ● ● Greece
● ● ●
● ●● ●
●●● ● ● ● Spain
● ● ● ●
−0.04
● ● ● ● ● Portugal
●● ● ● ●● ● ● ● Switzerland
● ●● ● ● ● ●●
●
●●●
●● ● ● ● Swiss−French
● ●● ●
● ● ● Swiss−German
● ● ●● ●● ●● ●● ●● ● ●
Swiss−Italian
● ●●●
●● ● ●
●
● ● ● ● ● ● ●● ●
● ● ● ●● ●● ●● ●
● ● ● France
● ●●
●● ● ● ●
● ●
●● ● ●● ●● Belgium
●●●
●
● ● ●● ●● ●
● ●●●●● ● ● Turkey
● ●
● ● ●● ●● ●●
−0.06
●● ● Cyprus
● ● ●●● ● ● ● ●● ●
●●●
●
● ●● ●
● ●
●
●
●
●
●
−0.05 0.00 0.05 0.10 0.15
PC2
Figure 6: Plot of 1st and 2nd Principal Components for POPRES dataset
The variables (columns) are also sometimes referred to as features, attributes

or dimensions. The observations (rows) are also sometimes referred to as
items, individuals or samples.
We will often be interested in the rows of X, which we will write as x1 , x2 , . . . , xn .

Here xi is a vector of length p, containing the p variable measurements for
the ith observation. That is
 
xi1
 xi2 
xi =  .  = (xi1 , . . . , xip )T
 
 .. 
xip
Vectors will be treated as column vectors by default. We assume that

x1 , . . . , xn are independent and identically distributed samples of a random
vector
X = (X1 , . . . , Xp )T over Rp
11
2 Exploratory data analysis and visualizing datasets
A key first step in many data analysis task is to carry out an exploratory
data analysis. If the dataset is stored in an n × p data matrix X then
we can look at data summaries and plots of various aspects of the data to
help us uncover the properties of the dataset. To illustrate some simple plot
types will use the ‘famous’ Crabs dataset.
2.1 Crabs dataset

Campbell and Mahon (Australian Journal of Zoology, 1974 https://www.
publish.csiro.au/ZO/ZO9740417) studied rock crabs of the genus lep-
tograpsus. One species, L. variegatus, had been split into two new species
according to their colour: orange and blue. Preserved specimens lose their
colour, so it was hoped that morphological differences would enable museum
material to be classified. Data are available on 50 specimens of each sex of
each species, so 200 in total. Each specimen has measurements on:
the width of the frontal lobe (FL),
the rear width (RW),
the length along the carapace midline (CL),
the maximum width (CW) of the carapace,
the body depth (BD) in mm.
So the data matrix X has dimensions 200 × 5. In addition to body size

measurements we have recorded variables colour/species and sex (we will
later view these as labels, but will ignore for now).
2.2 Histograms
A histogram is one of the simplest ways of visualizing the data from a single
variable. The range of the variable is divided into bins and the frequency of
observations in each bin is plotted as a bar with height proportional to the
frequency. Figure 7 shows histograms of the five crab measurements.
2.3 Boxplots
A Box Plot (sometimes called a Box-and-Whisker Plot) is a relatively so-
phisticated plot that summarises the distribution of a given variable. These
12
Histogram of Crabs$FL Histogram of Crabs$RW Histogram of Crabs$CL
60
50
40
50
40
30
40
Frequency
Frequency
Frequency
30
30
20
20
20
10
10
10
0
0
10 15 20 10 15 20 10 20 30 40 50
FL: Frontal Lobe Size (mm) RW: Rear Width (mm) CL: Carapace Length (mm)
Histogram of Crabs$CW Histogram of Crabs$BD

40
40
30
30
Frequency
Frequency
20
20
10
10
0
20 30 40 50 10 15 20
CW: Carapace Width (mm) BD: Body Depth (mm)
Figure 7: Histograms of the Crabs dataset
plots include the following summary statistics
Median - the ‘middle’ value i.e. the value for which 50% of the data fall
below when arranged in numerical order.
1st quartile - the 25% value i.e. the value for which 25% of the data fall
3rd quartile - the 75% value i.e. the value for which 75% of the data fall
13
Inter Quartile Range (IQR) - the difference between the 1st and 3rd
quartiles. This is a measure of ‘spread’ within the dataset.
A box plot consists of three main parts (shown in Figure 8)
1 A box that covers the middle 50% of the data i.e. the IQR. The edges
of the box are the 1st and 3rd quartiles. A line is drawn in the box at
the median value.
2 Whiskers that extend out from the box to indicate how far the data
extend either side of the box. The whiskers should extend no further
than α times the length of the box, i.e. the maximum length of a
whisker is α times the IQR. A commonly used value of α is 1.5.
3 All points that lie outside the whiskers are plotted individually as
outlying observations.
Plotting boxplots side by side allows comparisons to be made between vari-

ables. Figure 9 shows boxplots of the 5 variables in the Crabs dataset.
14
4000
Upper Whisker
3rd quartile
3500
Median
1st quartile
3000
Lower Whisker
2500
Outliers
2000
Figure 8: A Box Plot labelled with the main features of the plot.
15
50
40
30
20
●
10
FL RW CL CW BD
Figure 9: Box plots of the Crabs dataset showing similarities and differences
between the 5 variables.
2.4 Pairs plots

Plotting pairs of variables together in a scatter plot can be helpful to see
how variables co-vary (see Figure 10). However, if p is large this can be
impractical as the number of pairs of variables will be p2 . We could also

consider plotting triples of variables and tools exist to do that interactively.
16
6 10 14 18 20 30 40 50
●● ● ●● ●
● ●
● ● ●● ● ● ● ● ●
●●● ●
● ● ●●
● ●●● ● ●●●●●● ● ●●●
●●
●● ●●●● ● ●
●●●●● ●●●
●●● ● ●●●● ●
●● ● ●
20
●● ●
● ●
●●● ●●● ● ●●
●●
●● ●
● ●●●● ● ●●
●
●●●
●● ●●
●●●●● ●● ●
●●
●●
●
●
● ●● ●●
●●
● ● ●●
●
●●
●
●
●●
●
●●●●● ● ●
●●
●●
●
●
●●
●●
●
●
●●
●●●●●
●
●
●
●●
●
●
●
●
●●●●
●● ●●● ● ●
●●
●● ● ●●
●● ●●
● ●●●
●●
●●●●
●●● ●● ●●
●●
●●●●
●
●●●
● ●
●● ● ● ●
●●●●●
●●
●
●
●
●
● ● ●● ●●●
● ●● ●●
●●●
●●●●●
●
●●●●●● ● ●● ●●●
●● ● ●● ●●●●
●●
●●●● ● ●
●●
●
●●
●
●● ●● ●
FL ● ●
●
●●
●
●●
●
● ●●●
●●●
●● ●●
● ●●
●● ●
●●
●●●●
●
15
●●
●●●●
●●
●● ●
● ● ●
●
● ●●
●●
●● ●
● ●●●
●●
● ● ●●
●●●
●●●●●●
● ●●
● ●●●
●
●● ●● ●●●
●●●●●
● ●●●●
●●
●
●●
●
●
●●
●●
●
●●●●●
● ●●●
●●
●●
●
●●
●
●●● ●
●
●●
●●●
●●● ●
●
●●●
●
●●
●
●●
●●
●
●
●
●●●
● ●●
● ●● ●
●● ●●
●●
● ●
●●●●
●●●●●● ●●
●● ●●
●
●● ●
●●● ●●
● ●
●
●●●
●
●●
●
●●● ●
●●●●
● ●●
●●
● ●●
●●●
●
●●
●● ●●●●
●
● ●●●
●●
●
● ●●●●●●
●●●●
●
●
●●●● ● ●●●
● ●● ●● ●
10
●● ●
●● ● ●
● ●
●●
●
●● ●●
● ●●● ●●●
●
●●●
●● ●
●●
●● ● ●●
●● ●
●●
● ● ● ●
● ● ● ●
● ● ● ●
18
● ● ●● ●
●●● ●●
●● ●
●● ●●●
● ●●● ● ●●● ● ● ●●● ●
●● ● ●● ●
●● ●●●●●
●
●●● ●●●
●
●●●
●
● ● ● ● ●● ●
●●
●
●
●● ●
● ● ●● ●●● ●●
●● ● ●● ●●●● ●●●●
●●●
●● ●●
●● ●●
●●
● ●●●
●● ● ● ●●● ●
●●●●● ●●● ● ●●● ●
●● ●●● ● ● ● ● ●
●
●● ●● ● ●●
●● ●●● ● ●●
●● ● ●● ●●
● ●
● ● ●● ●●
●●●●
●
14
● ● ●
●● ● ●●●●● ●
●● ●
● ●● ●
●●● ● ●● ●● ●
● ● ●●●●●
● ●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●●●
●
●●
RW ●
●●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●●●
● ●●●
●
●
●
●●●
●
●
●
●
●
●● ●
●
● ●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
● ●
●
●●
●
●●
●●●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●●
●●
●●●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
● ●
●●
●●
●
●●
●●
●●
●
●
●●
●●
●
●
●●
●
●
●
● ●●
●●●
●
●●
●
●
●●●
●
●●
●
●
●●
●●
●●
●●●●
●
●●
●●
●●
●●
●
●
●●●●●
●●●
● ● ●●●●
●●●
●
●●
● ●● ●
●●●
●●
●●
● ● ● ●●●
●●●●●
●●●●●●
●
●●
● ●●●●●● ●●●●
●
● ●● ●●●●●
10
●●●●● ● ●●● ● ●●●●●● ● ●● ●

●
●●
●● ●●
●
●
●● ● ● ●●
●●
● ●● ●
●●●●
●
●●●
●●● ● ●●● ●●●●● ●● ●●●
●●
● ● ●●●
● ● ●●●● ● ●●●
●●
●●●
● ●
●
●
●
●●
● ●
●
●●● ●●●
●
●
●● ● ●
●● ●●
● ● ●●
6
● ● ●
●● ●●
● ● ●●
● ● ●●●
45
●●
● ●
●●
● ● ●
●●
● ●
● ● ● ● ●● ●●
●●
●●● ● ●● ●●● ● ● ●● ● ●●
● ●●
●●●●●
●
●
●●
●●
●●
●●●
● ●●● ● ●
● ●
● ●
●●●● ●
●●
●
●
●
●
●●●● ●●●●
●
●●●
●
●●
●
● ●●● ● ●
●
●●
● ●● ●
●●
●● ●● ●
●●●
●●●●●
● ● ●
●
● ●●●● ●
●●
●●
●● ●●● ●
●●
●
●
●
●●
● ●●
●●
●●
● ●
●
●
●●●
●
●
● ●
●● ●●●●●●● ●●
●●
●
●
●●●● ●
●●
●●●●
●
●●●●
35
●● ●
●● ●
● ● ●
●● ●●●●●
●●●
●
●●● ●●
●
●●
● ●●●●
●
● ●
●
●●
●
●●●
● ●● ●●●
●
●
●
●
●
●●
● ●
●●●
● ●
●
●
●●
●
●
●●
● ● ●●● ●●
● ●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●
●●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
CL ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
● ● ●●
● ●●●
● ●● ●
25
●●● ●
● ● ●●●
●
●●
●
●●● ●●●●
●● ●
●● ●
●
●
●●●
● ●
●●
●●●
●
●
●●
● ● ●
●●●● ●
●●
●
●●● ● ●
●
●●
●●
●
●
●
●
●●●●
● ●
● ●●●
●● ●
●●
●●●
●
●
●●
● ●
●●
●●●●●●
●
●
● ●●
● ●● ●
● ● ●● ●
●
●● ●● ●
● ●●
15
● ● ● ●
● ● ● ●
●
● ●● ● ●● ●
●●
●●
50
● ●
●●● ●●
● ●●●●●
●
●
● ● ● ●
●●●● ●●● ●
●●●
●●●●
●
●●
●●● ●●●●● ●● ●●● ●●
●
●●●
●
●
●
●
●●
●
●
● ●●
●
● ● ●● ●●
●● ● ●●
●
●●
●●●●●●●●
● ●
●●● ●● ●
●● ●●
●●
●● ●●●●
●●●●● ●
●
●● ●
●●●●
● ●
●
●●
●
●●
●
●●
●
●
●●
●
●●
● ● ●● ●●●
●
● ●
●
●●
●
●
●
●
●
●
●●
● ●●
●● ●
●●
●
●●
●
●
●
●
40
●
●● ●●
●●
● ●
●●
●●
● ● ● ●
●●
● ● ●● ●●● ●
●
●
●●
●
●
●●●●
●●● ●● ●●
●●●
●
●
●●
●
●
●●
●●● ●●
●●●●
●
● ●●
●● ●
●●
●
●
●
●●
● ●
● ●
● ●●
● ● ●● ●
●
●●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●
●●●●
●
●●
●●
●●
●
●
●
●
●●
●●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
CW ●
●●
●
●
●●
●
●●
●●
●
●●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
30
● ●●● ●● ●●
●
●● ●● ●●
●
●
● ●
●●
● ● ●●
●●
●●
● ●
● ●
●
●●● ●●●
●●●
●
●
●● ●●●
● ● ● ●●
●
● ●●●
●
●
●● ●
●● ● ●
●
●●
●●
● ●●
●●
●● ● ●●●
●●●●
●
●●●● ●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●● ●
●●●● ●
●● ●
●
●●●
●
●
● ●●●●
●● ● ●
20
● ● ●
●● ●● ●● ●●
● ● ● ●
●●● ●
●● ● ●●●● ●
●
●
● ●
● ● ●●●
● ● ●● ● ●● ●
20
●● ●● ● ● ● ●
●●●●●● ● ● ●● ●●
●
●●
● ●●
●●●
●●● ●
● ●● ●●●● ●●●●●
●●
●●●
●●●
●●
●● ●
●
●
●
●●● ●●●●●
●
● ●
● ● ●
●●
●
●
●
●
●
●●
●
●● ●●
●●●
●
●●●●●
● ●
●●●●●
●
● ●●
●●●●●●●● ● ●●●● ● ●●
●●
●● ●
●●●
●●●●
●
●●
●●●●● ● ● ●
●●
● ●●
●● ●
●
●●
●
●●
● ●
●●●●●
●●● ●●
●● ●
●● ●●
●● ●● ●●
●●
●●●●
● ●
●
●●
●
● ●●●
●●
15
●●
●●
●
●●● ● ●● ●●
● ●●●●● ●
●●
●●●
●
● ●
●
●●●●
●
● ● ● ●●● ●
● ●●
●
● ●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●●●●●
● ●
●●
●
●●●●●●●
●●●●●●
●
●
●
●
●
●●
●●●●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●
●●
●
●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
BD
●●
●●
● ●● ●●
●●
●
● ●● ●●
●●●
●
●●
●
● ●●●●●
●
●●
●
●
●●
●● ●●●●●
●●
●● ●● ●● ●● ●● ●
●
10
●●
● ● ●
●● ●
●
● ●
●●
●●●
● ●●●
●
● ● ●●
●●●
●● ●●●●●●
●
●●●
● ●●●●● ●●
●●● ●●●●●●
●●
●●● ●
● ●● ●●●● ●●●
●
●● ●●● ●
● ●
●●
● ●
●● ●
●●
● ●●● ●●● ●
●●
● ● ● ●
10 15 20 15 25 35 45 10 15 20
Figure 10: Pairs plots of the Crabs dataset
17
Figure 11: Screen shot from 3D interactive plotting of the Crabs variables
CW, RW and CL
18
2.5 The Multivariate Normal Distribution
To build good models of datasets consisting of several measured variables
we need probability models of multiple variables. We have seen this type of
model briefly in Prelims Probability in the section on Joint Distributions.
One of the simplest and most widely used models for multiple continuous
random variables is the multivariate normal distribution (MVN).
We have seen before that the univariate normal distribution has two scalar
parameters µ and σ 2 , and we use the notation X ∼ N (µ, σ 2 ). The parameter
µ denotes the mean of the distribution and σ 2 denotes the variance. The
pdf of the univariate normal distribution is
1 (x − µ)2

1
f (x) = exp − −∞<x<∞
(2πσ 2 )1/2 2 σ2
The multivariate normal distribution is the generalization of the univariate
normal distribution to higher dimensions. The MVN is a distribution for
a p-dimensional random column vector X = (X1 , . . . , Xp )T that allows for
non-zero correlations to exist between the elements of the vector. As such,
it can be a useful distribution for modelling data that consist of multiple
variables measured on the same items or observations that may (or may not)
be correlated.
If X = (X1 , . . . , Xp )T is a p-dimensional random column vector then we

say X has a multivariate normal distribution with mean p-column vector
µ = (µ1 , . . . , µp )T and p × p covariance matrix Σ, if the joint pdf of X is

1 1 T −1
f (x) = exp − (x − µ) Σ (x − µ) , x ∈ Rp
(2π)p/2 |Σ|1/2 2
We use the notation X ∼ Np (µ, Σ). When p = 1 this reduces to the univari-
ate normal distribution. The multivariate normal distribution is sometimes
referred to as the multivariate Gaussian distribution.
The interpretation of the parameters is as follows
E[Xi ] = µi for i ∈ 1, . . . , p
var (Xi ) = Σii for i ∈ 1, . . . , p
cov (Xi , Xj ) = Σij for i ̸= j ∈ 1, . . . , p
The correlation between two variables Xi and Xj is defined as follow
19
cov (Xi , Xj )
cor (Xi , Xj ) = p ∈ [−1, 1]
var (Xi ) var (Xj )
so if X ∼ Np (µ, Σ) then
Σij
cor (Xi , Xj ) = p
Σii Σjj
2.5.1 Example : the Bivariate normal distribution (p = 2)

When p = 2 we have X = (X1 , X2 )T ∼ N2 (µ, Σ). A special case is when
µ = (0, 0)T and the variances of X1 and X2 are both equal to 1, so that
cor (X1 , X2 ) = Σij = ρ. In this case the pdf of X reduces to
2
x1 − 2ρx1 x2 + x22

1
f (x) = p exp −
2π 1 − ρ2 2(1 − ρ2 )
Figure 12 shows the density of such a bivariate normal with ρ = 0.7 and
Figure 13 shows contour plots for several multivariate normal densities with
different values of ρ. Figure 14 shows a 2D density together with a sample
of 200 data points simulated from the same distribution.
Simulation is a technique for generating pseudo-random values that have

a particular distribution. Simulation is widely used in all areas of Statistics,
Machine Learning and Data Analysis. We recommend the Part A course in
Simulation and Statistical Programming to learn more about this subject.
20
Two dimensional Normal Distribution
0.20
0.15
0.10
z
3
0.05 2
0.00 1
−3
0
−2
x2
−1 −1
0
x1 1 −2
2
3 −3
Figure 12: 2D multivariate normal density for µ = (0, 0)T , Σ11 = Σ22 = 1
and Σ12 = 0.7
21
3 rho = 0 rho = −0.2 rho = −0.7
3
0.02
0.02
2
2
0.04 0.06
0.06
0.08 0.1
1
1
0.1 0.1
0.12 4
0.14
0.1
0
8
0.
0.14
16
0.12
0.1
−1
−1
−1
0.1 2
0.08
0.06 0.08
0.04
0.04
−2
−2
−2
0.02
−3
−3
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
rho = 0.3 rho = 0.5 rho = 0.9

3
3
2
2
0.05
0.06 0.06
0.1 0.15
0.1
1
1
5
0.14 0.2
0.14
35
0.
0
0
0.16
0.3
0.12 0.12
−1
−1
−1
0.08 0.2
0.08
0.04 0.1
0.04
−2
−2
−2
0.02 0.02
−3
−3
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 13: Contour plots for 2D multivariate normal densities with µ =

(0, 0)T , Σ11 = Σ22 = 1 and Σ12 = ρ for different values of ρ
22
rho = 0.7
3
3
●
●
●
● ● ●
●
2
2
● ●
●
● ● ● ●●
0.06 ●
0.08 ● ● ●●
● ● ●
● ● ● ●
● ● ●
0.12 ● ●●● ● ● ●
● ● ●
1
●● ●
● ● ● ●
0.16 ●● ● ●●●
0.18
● ● ●● ● ● ●● ●● ● ● ● ●
● ●
● ● ● ●●●
0.2 ● ●
●● ● ● ●● ●
● ● ●● ● ●
● ● ●●●●
● ● ●●
2
0
●
0.2
● ● ●●●●
●● ●
● ● ● ● ●●
● ● ● ●
●
● ●● ● ●
● ● ●
●● ● ● ●
●●● ● ●
● ●● ● ●● ● ●
● ●● ●● ●
●
0.14 ● ● ●
●
−1
−1
● ● ●●
● ● ● ● ●● ● ●
● ●
0.1 ●●
●
● ● ● ● ● ●●
● ●
● ●
● ●● ●
● ●
0.04 ●
−2
−2
● ●
0.02
● ●
●
−3
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 14: Contour plots for a 2D multivariate normal density with µ =

(0, 0)T , Σ11 = Σ22 = 1 and Σ12 = 0.7 and a sample of 200 data points
simulated from this distribution.
23
2.5.2 Estimating parameters for the multivariate normal distri-
bution
Often given a sample of real data, we will want to find the MVN that best
fits the data. Given a sample of n observations from a Np (µ, Σ) distribution,
it can be shown (see Exercises) that the maximum likelihood estimates of µ
and Σ are
n
1X
µ̂ = xi
n
i=1
and
n
1X
Σ̂ = (xi − µ̂)(xi − µ̂)T
n
i=1
Note that (xi − µ̂) is a p × 1 column vector and so (xi − µ̂)T is a 1 × p row
vector and so Σ̂ is an p × p matrix.
This estimator of Σ is biased (remember : not all maximum likelihood

estimators are unbiased). An unbiased estimator for Σ is
n
1 X
S= (xi − µ̂)(xi − µ̂)T
n−1
i=1
We refer to S as the sample covariance matrix. If we assume that the

data matrix X has been ‘mean centered’ by subtraction of µ̂T from each
row, then using matrix notation we can write
1
S= XT X
n−1
Note that we don’t have to assume that the data are generated from a MVN
distribution in order to calculate the statistic S. It is a useful summary of
the pairwise covariances in any collection of variables. On the Crabs data
the sample covariance matrix is
FL RW CL CW BD
FL 12.21 8.15 24.35 26.55 11.82
RW 8.15 6.62 16.35 18.23 7.83
S= .
CL 24.35 16.35 50.67 55.76 23.97
CW 26.55 18.23 55.76 61.96 26.09
BD 11.82 7.83 23.97 26.09 11.72
24
Sometimes it is useful to work with the sample correlation matrix, de-
noted R, which has entries
Sij
Rij = p
Sii Sjj
On the Crabs data the sample correlation matrix is
FL RW CL CW BD
FL 1.00 0.91 0.98 0.96 0.99
RW 0.91 1.00 0.89 0.90 0.89
R= .
CL 0.98 0.89 1.00 1.00 0.98
CW 0.96 0.90 1.00 1.00 0.97
BD 0.99 0.89 0.98 0.97 1.00
The high levels of correlation between all pairs of variables can be seen
visually in Figure 10.
2.5.3 Linear transformation of the Multivariate Normal Distri-

bution
If X = (X1 , . . . , Xp )T is a p-dimensional random column vector such that
X ∼ Np (µ, Σ) and B is a m × p matrix then it can be shown (see Exercises)
that the mean and covariance matrix of the m-dimensional random column
vector Y = BX are given by
E[Y ] = Bµ
cov (Y ) = BΣB T
A further useful result (which we will not prove in this course since it is
covered in Part A) is that the distribution of Y = BX is normal and given
by
Y ∼ Nm (Bµ, BΣB T )
25
3 Principal Components Analysis (PCA)
If we have a large number of p variables collected on a set of n observations
it can be hard to visualize and summarize the dataset.
Principal components analysis (PCA) is a useful method that can allow

us to summarize the dataset with a small number of representative variables
or dimensions. In other words, we would like to find a low-dimensional repre-
sentation of the data that captures as much of the information in the dataset
as possible. For instance, if we can find a two-dimensional (2D) representa-
tion of the data that captures most of the information, then we can plot the
observations in this 2D space, and maybe carry out further analysis within
this 2D space. PCA provides a tool to do this. The mathematics underlying
PCA build on ideas and results that we have learned in the Prelims Linear
Algebra courses.
The question then is ‘how do we define a representative variable or dimen-

sion?’
Answering this question using PCA involves the following key ideas :
1. Each of the dimensions found by PCA is a linear combination of the

variables.
2. PCA seeks a small number of dimensions that are as interesting as

possible, where interesting is measured by how variable the observa-
tions are along those dimensions. These dimensions are referred to as
principal components, or PCs for short.
Why is variability (or variance) a good metric for determining interesting-

ness?
In situations where the data consist of separated clusters of observations

in p-dimensional space, directions that maximize variance can provide good
separation of the clusters. This is illustrated in Figure 15 which shows an
example of a dataset in 2 dimensions with 2 clear clusters of points. When
the data points are projected onto the axis labelled A this results in clear
separation of the points and a variance of 12.98. When the points are pro-
jected onto the orthogonal axis labelled B this results in no separation of the
points and a much lower variance of 1.80. So in this example variance seems
26
to be a good way to choose a projection that separates the two clusters.
Another way to think about this is that we have found a rotation of the
data points to maximize the variance. A rotation is a set of orthogonal
projections. Figure 16 shows the data points before rotation (left) and after
rotation (right). The points in the two clusters are well separated on the
new x-axis.
10
5
B
A
0
−5
σ 2 = 1.80 σ 2 = 12.98
−10
−10 −5 0 5 10
Figure 15: Example showing the variance of points projected on two (or-
thogonal) projections (A) and (B). The projection (A) which separates the
points well has the highest variance.
27
Raw data Data rotated to Principal Components
10
10
PC2 PC1 PC2
5
5
PC1
0
0
−5
−5
−10
−10 −5 0 5 10 −10 −10 −5 0 5 10
Figure 16: Example showing the variance of points projected on two (or-
thogonal) projections (A) and (B). The projection (A) which separates the
points well has the highest variance.
3.1 Finding the first principal component

The PCA components are found one at a time. The first principal com-
ponent is a linear combination of the set of p variables, denoted Z1 , such
that
Z1 = α11 X1 + α12 X2 + . . . + α1p Xp
We can write
Z1 = α1T X
where α1 = (α11 , α12 , . . . , α1p )T and X = (X1 , . . . , Xp )T are both column
vectors. Then it can shown (Exercise Sheet 1) that
var (Z1 ) = α1T Σα1
where Σ is the p × p covariance matrix of the random variable X.
There are two problems with seeking to maximize this variance
1. Since we only have a sample of data from each of the p-variables we do

not know the true covariance Σ, but we can use the sample covariance
matrix S instead.
28
2. This variance is unbounded as the αi ’s increase, so we need to add a
constraint on α such that
p
X
2
αj1 = α1T α1 = 1
j=1
Making these changes results in the constrained maximization problem
max α1T Sα1 subject to α1T α1 = 1

α1
In other words, we try to maximize the sample variance of the first prin-
cipal component (α1T Sα1 ) subject to the constraint α1T α1 = 1. We can solve
this using Lagrange Multipliers (see Prelims Introductory Calculus course
for a reminder on this technique).
Let
L(α1 , λ1 ) = α1T Sα1 − λ1 (α1T α1 − 1)
We then need the vector of partial derivatives of L with respect to the vector
α1 . This is the gradient vector ∇L seen in Intro Calculus
Vector calculus results

Let a and x are p-column vectors and B a p × p symmetric matrix with rows
B1 , B2 , . . . , Bp . Then
1. Let y = aT x = pi=1 ai xi be a scalar function, then

P
∂y ∂y ∂y T
∇y = , ,..., = (a1 , a2 , . . . , ap )T = a
∂x1 ∂x2 ∂xp
Pp Pp
2. Let z = xT Bx = i=1 j=1 Bij xi xj then consider
∂z ∂z ∂z T
∇z = , ,...,
∂x1 ∂x2 ∂xp
∂z Pp
Since ∂xi =2 j=1 Bij xj = 2Bi x we have that
∇z = 2Bx
Re-writing we have
29
L(α1 , λ1 ) = α1T Sα1 − λ1 α1T Ip α1 + λ1
where Ip denotes the p × p identity matrix.
Using these results we have that
∇L(α1 , λ1 ) = 2Sα1 − 2λ1 Ip α1 = 2Sα1 − 2λ1 α1
Setting this equal to 0 results in the eigenvalue equation
Sα1 = λ1 α1 (1)
which means λ1 and α1 will be an eigenvalue and eigenvector of S respec-

tively. We can find the eigenvalues of S by solving the characteristic poly-
nomial (see Linear Algebra Recap on next page)
χS (x) = det(S − xIp ) = 0
There will be at most p roots of this equation, so we need to determine

which one to choose. If we multiply equation (1) by α1T then we get
α1T Sα1 = λ1 (2)
Thus λ1 is the sample variance of the first principal component which we

are trying to maximize, so we choose λ1 to be the largest eigenvalue of S,
and α1 is the corresponding eigenvector.
30
Recap of results from Linear Algebra II
Let V be a vector space over R and T : V → V be a linear transformation.
1. A vector v ∈ V is called an eigenvector of T if v ̸= 0 and T v = λv for some

λ ∈ R. We call λ ∈ R an eigenvalue of T if T v = λv for some nonzero
v∈V .
2. For A ∈ Mn (R) the characteristic polynomial of A is defined as det(A −

xIn ). For T : V → V a linear transformation, let A be the matrix for T
with respect to some basis B. The characteristic polynomial of T is defined
as det(A − xIn ).
3. Let T : V → V be a linear transformation, where A ∈ Mn (R) is the

matrix of T . Then λ is an eigenvalue of T if and only if λ is a root of the
characteristic polynomial χA (x) = det(A − xIn ) of A.
4. A real symmetric matrix A ∈ Mn (R) has real eigenvalues and there exists
an orthonormal basis for Rn consisting of eigenvectors for A. In other
words, there exists an orthonormal real matrix V (so VT = V−1 ) such
that
A = VDVT
where D is a diagonal matrix of eigenvalues.
We refer to this as the eigendecomposition of A.
3.2 Finding the second (and subsequent) principal compo-

nents
Now consider finding the second principal component. It will also be a linear
combination of the set of p variables, denoted Z2 , such that
Z2 = α2T X = α21 X1 + α22 X2 + . . . + α2p Xp

where α2 = (α21 , α22 , . . . , α2p )T . As before we need the constraint
α2T α2 = 1
However, we must add another constraint in order to find a distinct eigen-
vector. As described above we wish the linear combinations of p variables
to be orthogonal projections, so we add in the further constraint that
α1T α2 = 0
31
and this leads to another Lagrange multipliers problem where we seek to
maximize
L(α2 , λ2 , m) = α2T Sα2 − λ2 (α2T α2 − 1) − mα1T α2
Taking partial derivatives with respect to α2 leads to
∇L(α2 , λ2 , m) = 2Sα2 − 2λ2 α2 − mα1
Setting equal to 0 gives

1
Sα2 − λ2 α2 = mα1 (3)
2
Pre-multiplying by α1T , remembering that α1T α2 = 0 and α1T α1 = 1 , gives
1
α1T Sα2 = m
2
However, pre-multiplying (1) by α2T , we get
α2T Sα1 = α1T Sα2 = 0 (4)
since α2T Sα1 is a scalar and S is symmetric, which implies that m = 0 and
equation (3) reduces to the eigenvalue equation
Sα2 = λ2 α2 (5)
So λ2 is also an eigenvalue of S with associated eigenvector α2 . As before

we can show that λ2 is equal to α2T Sα2 which is the sample variance of the
second principal component. We must also ensure that α1T Sα2 = 0, so this
is achieved by setting λ2 to be the second largest eigenvalue, with α2 being
its corresponding eigenvector.
The above process can be continued for the other principal components. This
results in a sequence of principal components ordered by their variance. In
other words, if we consider the eigenvalue decomposition of S (see Linear
Algebra recap)
S = VDVT
where D is a diagonal matrix of eigenvalues ordered in decreasing value,
and V is a p × p matrix of the corresponding eigenvectors as orthonormal
columns (v1 , . . . , vp ), then we have that
αi = vi and λi = Dii for i = 1, . . . , p
32
3.3 Plotting the principal components
If X is the n×p data matrix of the n observations on p variable, S is the p×p
sample covariance matrix, with ordered eigendecomposition S = VDVT
then the data can be transformed to the p principal component directions.
If we define Z to be an n × p matrix containing the transformed data, such
that Zij is the value of the jth principal component for the ith observation
then we have
Zij = αj1 Xi1 + αj2 Xi2 + . . . αjp Xip (6)

= V1j Xi1 + V2j Xi2 + . . . Vpj Xip (7)
In other words we take the vector product of the ith row of X and the jth
column of V. The matrix V is known as the loadings matrix. In matrix
notation we can write this as
Z = XV
We can then plot the columns of Z against each other to visualize the data as
represented by those pairs of principal components. The matrix Z is known
as the scores matrix. The columns of this matrix contain the projections
onto the principal components.
Figure 17 shows a pairs plot of the 5 PCs for the Crabs dataset. The points
have been coloured according to the 4 groups Blue Male (dark blue), Blue
Female (light blue), Orange Male (orange), Orange Female (yellow). From
this plot we can see that PCs 2 and 3 seem to show good discrimination of
these 4 groups. This is shown more clearly in Figure 18.
33
−2 0 1 2 3 −1.0 0.0 1.0
● ● ● ●
● ● ● ● ● ● ● ●
●●
●● ●
●●● ●
●● ●●
● ●● ●●●●● ● ●●
20
●●●
●
●●
●●●●● ●●● ●●
●
● ●●
●
● ●●●●●
● ●
●●● ●
● ●
●
●● ●
● ●
● ●● ●●
● ● ● ● ●● ● ● ● ● ●● ●
●●● ●
●
●●● ● ●●●● ● ● ●● ● ●
● ●● ● ● ● ●● ● ●
●●
●●●
●
●●
●●● ● ●● ●●
● ●● ●
● ●●● ●
●
●
●
● ●●
● ●
●●
● ●
●
●
●●●●●● ●
●●● ●● ●● ●●●●
●●
●● ●●● ●
● ●● ● ●● ●
●
●● ●
●●
● ●●●●●● ●●●●●
● ●●●
●●●●
● ●●
● ● ●●
●●
●●●●● ● ●●● ● ●● ● ●●●●● ● ●
●●
● ● ●
●●●● ●●
● ●●●
Comp.1 ●●●
●●
●●
●●
●●●
●●●●
●●
●●
●
●
●
●●
● ●●
● ●
●●●●●●●●●
● ●●
●●
●●●
●
●●●
●
●
●
●●
● ●●● ●
●●
●●
●●
● ●
●
●●
●
●●●●
●
●●●●
●
●
●●
●●
● ●●
●
●
● ● ● ●●● ●
●
●
●
●
●
●
●
●
●
●
●
●●●
●● ●
●● ●
●● ●
0
● ●● ● ● ●● ●●● ● ●●●●
●● ●
●
●●●●●
● ●●●
● ●● ● ●●● ● ●●
●● ● ● ● ●● ●●● ●● ●● ●
●●● ● ●●
●●●
● ●●
●
●●●●●● ● ●●●● ●●● ●●●●●●●●●●●●● ●● ●● ● ● ● ●●●● ●●●● ●●
● ● ● ●● ●●
●●● ●●● ●
●●●●● ● ●●●●●
● ●●●●● ●● ● ●
● ● ●●● ● ● ● ●
●● ● ● ●● ● ●
● ●●
● ●● ●●●●● ●●
●●●●● ●●●●● ●●●● ●● ●●●●● ● ● ● ●● ●
●
● ●● ●
●● ●●
●● ●● ●● ●● ● ●● ● ●●●●● ● ●● ●
●● ●●●● ● ●● ●● ● ●●●● ●
●● ● ●● ● ●● ● ● ● ●
● ●● ●●●●●● ●
●●●●● ● ● ●● ●
● ● ● ● ● ●●● ● ●●● ● ● ●●● ● ●● ●● ● ●●● ● ●●●
−20
●● ● ● ● ●● ●
● ●●● ●●
● ●● ● ●
●
● ● ●● ●● ●● ● ● ●● ● ●
●● ●
●
● ● ● ●
● ● ● ● ● ● ●● ●● ● ●● ●
0 1 2 3
● ● ● ● ● ●
● ●
● ● ● ●
●● ● ● ● ●
● ● ● ●
● ●●● ●●●● ●●
● ● ●● ●●
● ●
●
● ● ●
● ● ●
●
●●●●●
● ● ● ●●●
●
●●●
●●
●
● ●● ●●●●●●
●
● ●
● ●●
●● ●●
● ●● ●
●●
● ● ● ●● ● ● ● ●● ● ●
● ● ●● ● ● ●● ●
●● ●●● ●● ● ●●● ● ●● ●● ● ●
●●●●●●● ● ● ● ●
●●
● ●● ● ●●● ●●●
● ●
● ● ●●● ●●●●
●
●●●●● ●● ●
● ●●●●●● ● ● ●
● ●● ●● ● ● ● ●●●●● ●● ●●
●●●● ●
● ●
●●●●●●●
●
● ●
●●●●●●●●●●●●●●●
●●● ● ●● ●● ●●● ●● ●●● ●●
● ● ●●● ●●●
● ●● ●
● ●●● ●●
●● ●
●
●●
●
●●●●●
●●
●●
●●
●
●●●●●●●●●
●●
●
●
●●●●●
●●
●
●
●●●●
●● ●
●●●● Comp.2 ●●
● ●●●
● ●●●
●●
●
●●●
●
●
●
●
●●●
●●
●
●●
●
●●
●●
●●
●
●●
● ●
● ● ●●
●
●● ● ●●
● ●
●
●
●●
●
●
●●
●
●●●
●●
●
●
●●
●
●●
● ●●
●●
●
●●●●
●●●●
●●●
●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●●
● ●
●●●
●
●● ●
●
●
● ●●●
●
●●
●
●
●●
●●
●●●
●
●
●
●●● ●
●●
●●
●●
●●
●
●
●●
●
●●
●●
●●●
●●●●●
● ●
●
●●
●
●●● ●
●●
●
● ●● ● ● ●
●
●● ● ● ● ● ● ●● ● ●
● ● ●● ●●● ● ●● ● ● ●● ●
● ●●
●
● ●●●
● ●●●● ● ●● ● ●●●●
●●●
●
●
● ●
●● ●●● ●●● ●●●●● ● ●● ●
●●●● ●● ●
● ●●●●●● ● ●
●● ● ● ●● ●●● ● ● ● ●● ● ●●●●●●
● ● ● ●
●●
● ● ●●●●
●●●● ● ●●●●● ●●● ● ● ● ● ●
●●●●● ●●●●● ● ● ●● ● ● ●● ● ●
●●● ●
●●●● ●
● ●
−2
●●● ● ●● ● ● ●●●
●●
● ● ●● ● ● ● ●
●● ● ● ●● ● ● ●● ●
● ● ● ● ● ● ● ●
● ● ● ●● ●
BM
2
●●● ● ●●● ●●●●●● ● ●●● ● ● ●●
● ●●●● ●● ● ●● ●
●
●● ●● ● ●●●● ●● ● ●● ●
● ●
●●●●
● ●●
●●●● ● ● ● ●● ●●● ●● ●●
●● ●●●●●● ● ●● ● ●●●
● ●● ● ●
● ● ●●●
●● ●●●
●
● ●●
●● ●● ●● ●
●
●● ●● ●
●
●●
●●
●●●●
●
●
●
●●●●● ●●
●
●●
●
●
●
●●●●
●
●●
●● ●●●
●●
●●
●
●●●
● ●● BF
1
●●●●●● ●● ●● ● ●● ● ●
● ●● ● ● ●● ● ● ●● ● ● ●● ● ●●
● ●●
● ● ● ●●● ● ●● ●
● ●●● ●● ● ●
●
●●
●●
●
●● ●
● ●● ●●● ● ● ● ●● ● ● ●● ●
●
●
● ●● ● ●● ● ●● ● ● ●●●●●● ● ● ● ●
●● ● ●●
●●
●●
●●
●● ● ●●●
●
●●
●
●● ●● ● ●●
● ●
● ●●
●
●●
●
●
●
●
●
●●●
●
● ●
●●● ●●
●
Comp.3 ●● ●
●
●
●●
●●●●
●●●● ●
●●● ●●
●●
● ●
●
●● ●● ●
●●●●● ● OM
0
● ● ● ● ● ●● ●
●● ●●●●
● ●● ● ●●● ●● ●
●●● ●● ●●●●
●● ● ●●
●●●
●
●
●●●●
● ●● ● ●
● ●
●● ●●
●●●
●● ●
●
●●● ●●●●
●
● ●●
●● ● ● ● ● ●●●●
●
●●●●●●●
●●●
● ● ● ●●
●
●●●
● ●●
● ● ● ●● ● ●●●
●●●
●●● ● ●●
● ●●●
●●
●● ●● ● ● ●
●
●●●
●●
● ●● ● ● ●●● ●●●●
●●●
●●●
● ●●● ● ●
● ●●●●● ●●● ●●
●
● ●● ● ●● ● ●●●
●●
● ●●● ● ● ●●●●
●●
●●●
● ●●● ● ●●●
●●
● ● ●
●
● ●
●
●●●
●●● ● ●● ●
●● ●●
●
●●●
● ●
●
●●
●●
●●
●
●
●●
●
●
●
● ●●
●● ●
●
●●●●
●
● ● ●●●● ●● ●
●●●●●●
●
●
● ●●
●
●●●
●
●●●
●
●
●●●
●●
●
●
●
●● ●
●● ● ● ●●●
●●
●●
●
●●
● OF
● ● ●● ● ● ● ●●
●● ● ● ● ● ● ● ● ●●
● ●● ●●● ● ● ●●
−2
● ●●●● ●●● ● ●
● ●● ●
● ● ●
● ● ● ●
1.0
● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ●
● ●
● ●● ● ● ● ● ● ● ● ●
●● ●●● ● ●● ●● ●●● ●● ●
● ● ●● ●● ●● ●●●●● ●
●●● ● ●●●●
● ●●●● ● ●
●● ●●
● ●●●●●●●● ●●
●●
●●● ● ●
●● ● ●
●
● ●●
●● ●●●●●
●●
●● ●●
● ●●
●
● ●●●● ●●
●●●
●
● ●●●
●
●
●●●●● ●
● ● ●●● ●●●
●
● ●●●●●●
●
● ●
● ● ●●
●●●
●●●●
● ●●●
●●●●●
●●
● ● ● ●
● ●●●
●
●●
● ●●●●
●●
● ●●●
●
●
●●●● ●●●● ●
Comp.4 ●●●●●
●
●
●●●●●
●● ●●●●
●
●
●●●
● ●●●
● ●
●● ●
0.0
●●●● ●● ● ●●● ●
●● ● ●●● ●●● ● ●●
●● ●● ● ●● ●●●●●●●●● ● ●
●● ●●●
●
●● ● ● ●● ●● ●●
● ●●
●● ●●
●●
●● ● ●
●●●●● ●●●●●●●●●●●●●●
● ●●●●●●● ●● ●●●
●●●●●● ●● ● ●● ●●
●●● ● ●●
● ●
●●●
●●● ● ●
● ● ● ●● ●●
● ●
●● ●●● ●
●
●
●● ● ●
●
●
● ●
●
● ●●●
●
●
●●
●
●
●●
●●●● ● ●● ●
●●●●
●
● ●●● ●●
●
● ●
●
●●●●●
●
●
●●●
●
● ● ●●
●● ● ●
●●
●●
●●
●
● ●●●●
●●
●
●●● ●
●
●●●
● ●
●●
●● ●●
● ●● ● ●●●
●
●
●●●●
●
●
● ●● ●●● ●
● ●●● ● ●●
●●
● ●● ●●● ● ●●●
●●● ● ● ●●
●● ● ● ● ●●
● ●●●●
●● ●● ●●●●● ● ●●● ● ●●●●● ● ●●●●●●● ●● ●● ●
●●● ●●●
●●●●●● ● ●●●●
●
●
● ● ●●● ●●●●
●●
●● ● ● ●● ●
●
●● ● ●●●
●●
●
●
●●● ● ●
●●
● ●●● ●●●●●● ●●● ●
●
●●●●
●●● ●● ● ●●● ● ● ●
● ●●● ● ● ● ● ● ●●●● ●
●● ● ● ● ● ●● ●
● ● ●● ● ● ●●
−1.0
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ●
●● ● ●
0.5
● ●● ● ● ●● ● ● ●●
●●
● ●●●● ● ●● ● ● ● ● ●● ● ● ●●● ●●●●●● ● ●
● ● ●●●●● ● ● ●
● ●●● ● ● ●● ● ● ●
● ●● ● ●●●●●●●●
●● ● ●●●●● ●● ● ●●
● ● ●● ●● ●●●●●● ●●● ● ●
● ● ●● ● ●● ● ●●●● ● ●●●●●
●● ●●
●
●●●●●●
● ● ● ● ●●●● ●● ●●● ● ●●● ● ●●
● ●● ● ●●● ●● ● ● ● ●●● ● ●●● ●
●●●● ●●●● ●
● ● ●● ●●●● ●● ●●
●● ●●● ●
● ● ●●
● ●●● ●● ● ●● ●● ● ● ● ●
●●●●●●●●● ●
●●● ●●●●●● ● ●●●●●
● ● ●
●● ●●● ●
● ● ●●● ●● ●●● ● ●
● ●● ● ● ●● ●●● ● ● ●
Comp.5
0.0
●●● ●●
●●●
●● ●●
●● ●●●●●
● ● ●● ●
●●●●
●● ●
●●●
●●
●
●
●●
●● ● ●
●
● ●
● ●●●●
●●
●●
●
●● ●●●● ●●
●
●
●●
● ● ●●
●●●●●● ●●
●● ● ●
● ●
●● ● ●
● ● ●● ●● ● ● ●
●● ● ● ● ●●● ●
●●●●●●●
● ●● ●●●
● ●●
● ●●● ●
●●
●●
●
●●●● ●●●
● ●
●● ● ●
●● ●●● ●●● ●● ●
● ●● ●
● ●●●● ● ● ●●●●●●
●
●
●●●
●●●● ●● ● ● ● ●●● ●●
●●
● ●●●●
●●
● ●
● ● ●● ● ● ●●
●
●
●● ●● ● ●● ● ● ●●●
●●●
● ●
●● ●●● ● ● ●● ●● ● ●● ● ●●●
●
●●
● ●●●●●● ● ●
●
●
● ● ● ● ●● ●● ●● ●
●●● ●● ●● ●
●● ● ●● ● ●●●● ●●●● ●● ● ● ● ●●
● ●● ● ●●●
● ●
●●●● ●
● ●● ●●●● ● ● ●● ●● ● ● ● ●●●●● ● ●● ● ●● ● ●
●● ● ● ●● ● ● ●●● ● ●
●● ● ● ●
−0.5
●
● ●
● ● ●●
●●● ● ● ● ● ●
● ● ●● ● ●
●
● ●● ● ● ●● ● ●● ● ●● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ●● ● ●
● ● ● ●
−20 0 20 −2 0 1 2 −0.5 0.0 0.5
Figure 17: Pairs plot of PC components for the Crabs dataset.
34
●
●
●
2
●
● ●
● ●
● ●
●
●
● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ●● ●
● ● ●
1
● ● ●
●● ● ●
● ●● ● ●
● ●
● ● ●● ● ●● ● ●
● ●● ●
● ● ● ●
● ●
●
●
●
● ● ● ● ●
● ●
●
● ●● ● ●● ● ●
●
● ● ●●● ● ●
PC3
● ● ● ● ●
● ● ●
0
● ● ● ●
● ● ● ● ●
● ● ● ●● ●
● ● ● ● ●● ●
● ● ●
● ● ● ●
●● ●
●
●
● ● ● ●
● ●
● ●
● ●
● ● ●
● ●
−1
● ● ● ●
● ● ●
● ● ● ● ● ● BM
● ●● ● ●
●
●●● ●
●● ● ●
● ● ● ● ● ● BF
● ●
●
● ● ● ●
● OM
● ●
●● ● OF
● ● ●
●
−2
−3 −2 −1 0 1 2
PC2
Figure 18: Plot of PCs 2 and 3 for the Crabs dataset.
35
3.4 Biplots
The columns of the loadings matrix V contains the linear projections for
each of the PCs. It can be interesting to look at these loadings to see which
variables contribute to each PC. For example, the loadings matrix for the
Crabs dataset is as follows
P C1 P C2 P C3 P C4 P C5
FL 0.28 0.32 −0.50 0.73 0.12
RW 0.19 0.86 0.41 −0.14 −0.14
CL 0.59 −0.19 −0.17 −0.14 −0.74
CW 0.66 −0.28 0.49 0.12 0.47
BD 0.28 0.15 −0.54 −0.63 0.43
So for example, this means that the first, second and third PCs are
Z1 = 0.28X1 + 0.19X2 + 0.59X3 + 0.66X4 + 0.28X5
Z2 = 0.32X1 + 0.86X2 − 0.19X3 − 0.28X4 + 0.15X5
Z3 = −0.50X1 + 0.41X2 − 0.17X3 + 0.49X4 − 0.54X5
Notice how the loadings for the 1st PC are all positive. This is quite usual,
especially when the units of observation are biological samples (such as
crabs), where the 1st PC is essentially a linear combination of features that
is measuring size. Often it is the case that this variable will account for
a large amount of variance, but tends not to be able to split samples into
distinct groups. It is often the PCs with a mixture of positive and negative
loadings that provide the contrast that separates groups.
It can be useful to represent the loadings information on the plots of the

PCs scores. Typically, we will plot one PC versus another. What we would
like to see is which variables contribute to each PC. From the above we can
see that variable FL has weight 0.32 in PC2 and -0.50 in PC3. We represent
that on the PC plot using a vector (0.32, -0.50). Similarly, we can represent
the contribution of variable RW on the PC plot using a vector (0.86, 0.41)
etc. Often we need to scale the vectors by a constant factor to appear on the
plot in a visually appealing way. Such a plot which shows the PCs scores
together with vectors showing the PC loadings is called a Biplot.
Figure 19 shows the Biplot for PCs 2 and 3 for the Crabs dataset. It
shows that PC2 is contrast between variables RW, FL, BD and CW, CL,
36
whereas PC3 is a contrast between variables CW, RW and CL, FL, BD.
Also, PC2 separates the Orange Females from the Blue Males well, whereas
PC2 separates the Blue Females from the Orange Males well.
●
●
2 ● ●
● ●
● ●
● ● ●
●
● ●
● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ● ● ●
●
1 ●● ● ● ●
PC3 (0.7% explained var.)
CW
●● ● ●
● ●● ● ●
● ● ● ● ● ●
● ● ●
●
●● ● ● ●
RW
●
●
● ●
●
●
●
●
●
● ●
groups
● ● ● ●
●
●
● ●
●
● ● ● ● ●
● BF
● ● ● ●
● ● ●●
● ● ● ●
● ● ● ● BM
0 ● ●
●● ● ●
● ● ● ●●
● ● ● ●
●
● ● ● ●
●●
●
● OF
● ● ●
●
CL
● ● ●
● ●
●
●
●
● OM
● ● ● ●
●
● ● ●
● ● ●
● ●
● ●
FL
BD
−1 ● ● ●
● ●
● ● ●
● ● ● ●
● ●●●● ● ●●
●●● ● ●
● ●
● ● ● ●
● ●●
● ● ● ●
● ●
● ●
● ● ●
●
−2
●
−3 −2 −1 0 1 2
PC2 (0.9% explained var.)
Figure 19: BiPlot of PCs 2 and 3 for the Crabs dataset.
37
3.5 Variance decomposition and eigenspectrum
The total amount of variance in the original data matrix X is the sum of
\
the diagonal entries in S. In other words, we have that var (Xi ) = Sii so
that
p
X
Total variance = Sii = tr(S)
i=1
but since tr(AB) = tr(BA) where A and B are two p × p matrices we have
that
Total variance = tr(S) = tr(VDVT ) = tr(DVT V) = tr(D)
since V is an orthonormal matrix of eigenvectors, so that VT V = I.
So we can see that the eigenvalues tell us interesting information about how
important each component is within the dataset. It is common practice to
plot the decreasing sequence of eigenvalues to visualize the structure in the
dataset. Such plots are sometimes referred to as eigenspectrum plots or
variance scree plots, and usually they are scaled so each bar is percentage
of the total variance. That is we plot
100Di
for i ∈ 1, . . . , p
tr(D)
Figure 20 shows the scree plot for a simulated dataset with 3 clear clusters.
The 1st PC (middle) clearly separates the groups and the scree plot (right)
shows that the 1st PC accounts for almost all the variance in the dataset.
Figure 21 shows the scree plot for a simulated dataset with 4 clear clusters.
The 1st and 2nd PCs (middle) clearly separate the groups and the scree plot
(right) shows that the 1st and 2nd PCs account for almost all the variance
in the dataset.
The scree plot is sometimes used to decide on a set of PCs to carry forward
for further analysis. For example, we might choose to take the PCs that
account for the top θ% of the variance. However, care is sometimes needed
here. Figure 22 shows the scree plot for the Crabs dataset and shows that
the 1st PC accounts for almost all the variance. As discussed before, this
PC is most likely measuring the size of the crabs.
38
● ●
6
●
● ● ●● ● ● ●
● ● ● ●
80
● ● ●
● ●●●● ●
● ● ● ●
●
● ● ●
● ●● ● ●
●
4
● ●● ●●
●
● ●
2
●
● ● ● ●●
● ● ●
●● ●●
●● ●● ● ● ●●
● ● ●
● ● ● ●●
●
●
● ●
60
2
● ●
● ● ● ●
● ● ● ● ●
● ●● ●●●
●
● ● ● ●●
●
●● ●●● ●
●●
●
● ● ● ●
● ● ●
● ● ● ●
● ●● ●●
● ● ●● ●
% Variance
●●● ● ●●● ●
● ● ●
0
● ● ●● ● ●
● ● ● ● ● ●
PC2
0
● ●● ● ● ●
X2
● ● ●● ● ● ● ● ●● ●
● ● ●
● ●● ● ● ●●
●
● ●● ● ● ●●
●●● ●
●●
40
● ● ●●
● ● ●
● ●
● ●
●
● ●● ●● ●
● ●
●● ● ● ● ●
● ●●
● ●
−2
● ●
●
● ●
● ●●
●● ● ●
● ● ●
●● ● ●
●
−2
●
●
● ● ●●
●●
● ● ●●
● ● ● ●
● ● ●
20
●● ●
−4
●
● ●
●● ● ●
● ●
●
●●
●
●
●● ●
●
● ●
●
●
−6
● ●
−4
●
● ●
0
−6 −4 −2 0 2 4 6 −15 −10 −5 0 5 10 15 1 2 3 4 5 6 7 8 9 10
X1 PC1
Figure 20: Example with 3 clear clusters (left). The 1st PC (middle) clearly
separates the groups. The scree plot (right) shows that the 1st PC accounts
for almost all the variance (98.24%)in the dataset, whereas we know that it
is the 2nd and 3rd PCs that are the most useful in separating the groupings
of Orange/Blue and Male/Female crabs.
39
● ●
10
● ●
● ●
6
● ●●
● ●● ●●
● ● ●
● ● ●
● ●
● ●● ●● ●●● ●
● ● ●● ●
●●
● ● ●●● ●
50
●● ● ● ● ●● ●
● ● ●●
●
● ●●
●●●
●
● ●●
●
●
● ● ● ●● ●
● ● ●●
●● ● ●
●
4
● ● ● ●
● ●●
● ●●
5
● ● ●
● ●
● ●
● ● ●
● ●● ●
● ●
●● ●● ●
40
●● ● ● ● ● ●● ●
● ●● ●●●●●
● ●●●●●● ● ● ● ● ●●
● ●● ● ● ● ● ● ●●●
● ●● ● ●●
●● ●
2
● ● ● ●
● ● ● ●●●
● ●● ● ● ●
● ● ●●● ●●
●
●
●●●● ● ●
● ●●
● ●●●
● ● ●●
● ●●● ●
● ● ●● ●
●●●●● ●
●●●● ●
●●
0
●● ● ●
% Variance
●
● ● ●
30
●
PC2
X2
●
0
●
●
−5
● ●●
−2
20
●● ● ●
● ● ●
●
●● ●●●● ● ●●●
●● ● ●● ●
● ● ●●● ●
● ● ● ● ●● ●
● ●
●
● ● ● ●
●● ● ● ● ●
● ●
●● ●● ● ●● ●
● ●● ● ● ●
−4
● ●● ● ●
●●● ● ●● ●
●
●●
10
−10
● ●●
● ● ● ● ● ●
● ● ●●●●
●●●
●●
●●
● ●
● ●●
● ●●
● ● ●
● ●●●●
● ● ●●
●● ●●
●
● ●
●●
● ●
●
● ●●●
−6
●
● ●
0
−6 −4 −2 0 2 4 −15 −10 −5 0 5 10 15 1 2 3 4 5 6 7 8 9 10
X1 PC1
Figure 21: Example with 4 clear clusters (left). The 1st and 2nd PCs
(middle) clearly separate the groups. The scree plot (right) shows that the
1st and 2nd PCs accounts for almost all the variance in the dataset.
40
80
60
% Variance
40
20
0
1 2 3 4 5
Figure 22: Scree plot for Crabs dataset
41
3.6 Using the covariance matrix or the correlation matrix
A key practical problem when applying PCA is deciding exactly what data
the method should be applied to. Should we apply the method to the raw
data, or should we first transform it in some way? You should always ask
yourself this question when starting out on a new data analysis.
The first principal component maximises the variance of a linear combina-

tion of variables. If the variables have very different levels of variability then
maximizing the projection with the largest variance may tend to dominate
the first principal component. This might be unsatisfactory when looking
for structure in the dataset. For this reason, it can sometimes make sense
to standardize the variables before PCA, so that each variable has the same
variance. This can be achieved by applying PCA to the sample correla-
tion matrix R, rather than the sample covariance matrix S.
Remember that
cov (Xi , Xj )
cor (Xi , Xj ) = p ∈ [−1, 1]
var (Xi ) var (Xj )
This means that the relationship between R and S is
Sij
Rij = p
Sii Sjj
If we let W be a diagonal matrix with entries Sii for i ∈ 1, . . . , p. In other
words, W is the same as S, but with all off-diagonal entries set to 0. Then
in matrix notation we can write
R = W−1/2 SW−1/2
It can be shown (see Exercises) that the PCA components derived from
using S are not the same as those derived from using R, and knowledge of
one of these sets of components does not enable the other set to be derived.
3.6.1 EU indicators dataset

Table 2 shows data on six economic indicators for the 27 European Union
countries collected in 2012. Looking at the table we can see clearly that the
scale of each of the variables is quite different. The last row of the table
shows that variables BOP and PRC have much higher variances than the
other variables.
42
Table 2: CPI, consumer price index (index = 100 in 2005); UNE, unemploy-
ment rate in 15–64 age group; INP, industrial production (index = 100 in
2005); BOP, balance of payments (e/capita); PRC, private final consump-
tion expenditure (e/capita); UN%, annual change in unemployment rate.
Country CPI UNE INP BOP PRC UN%
Belgium 116.03 4.77 125.59 908.60 6716.50 -1.60
Bulgaria 141.20 7.31 102.39 27.80 1094.70 3.50
CzechRep. 116.20 4.88 119.01 -277.90 2616.40 -0.60
Denmark 114.20 6.03 88.20 1156.40 7992.40 0.50
Germany 111.60 4.63 111.30 499.40 6774.60 -1.30
Estonia 135.08 9.71 111.50 153.40 2194.10 -7.70
Ireland 106.80 10.20 111.20 -166.50 6525.10 2.00
Greece 122.83 11.30 78.22 -764.10 5620.10 6.40
Spain 116.97 15.79 83.44 -280.80 4955.80 0.70
France 111.55 6.77 92.60 -337.10 6828.50 -0.90
Italy 115.00 5.05 87.80 -366.20 5996.60 -0.50
Cyprus 116.44 5.14 86.91 -1090.60 5310.30 -0.40
Latvia 144.47 12.11 110.39 42.30 1968.30 -3.60
Lithuania 135.08 11.47 114.50 -77.40 2130.60 -4.30
Luxembourg 118.19 3.14 85.51 2016.50 10051.60 -3.00
Hungary 134.66 6.77 115.10 156.20 1954.80 -0.10
Malta 117.65 4.15 101.65 359.40 3378.30 -0.60
Netherlands 111.17 3.23 103.80 1156.60 6046.00 -0.40
Austria 114.10 2.99 116.80 87.80 7045.50 -1.50
Poland 119.90 6.28 146.70 -74.80 2124.20 -1.00
Portugal 113.06 9.68 89.30 -613.40 4073.60 0.80
Romania 142.34 4.76 131.80 -128.70 1302.20 3.20
Slovenia 118.33 5.56 105.40 39.40 3528.30 1.80
Slovakia 117.17 9.19 156.30 16.00 2515.30 -2.10
Finland 114.60 5.92 101.00 -503.70 7198.80 -1.30
Sweden 112.71 6.10 100.50 1079.10 7476.70 -2.30
UnitedKingdom 120.90 6.11 90.36 -24.30 6843.90 -0.80
Variance 111.66 9.95 357.27 450057.15 5992520.48 7.12
43
Figure 23 shows plots of the 1st and 2nd PCs for the EU indicators dataset.
The Left plot used the covariance matrix S. The Right plot used the cor-
relation matrix R. Points are labelled with the abbreviated country name.
There is a clear difference between the two.
When using the covariance matrix S the loadings of the 1st and 2nd PCs
are
Z1 = −0.003CP I − 0.0004U N E − 0.0039IN P + 0.121BOP + 0.993P RC − 0.00003U N %

Z2 = 0.004CP I − 0.001U N E + 0.009IN P + 0.992BOP − 0.121P RC − 0.0014U N %
so it is the variables BOP and PRC that are dominating these PCs.
When using the correlation matrix R the loadings of the 1st and 2nd PCs
are
Z1 = −0.51CP I − 0.37U N E − 0.29IN P + 0.36BOP − 0.62P RC − 0.02U N %

Z2 = −0.17CP I + 0.34U N E − 0.53IN P − 0.49BOP + 0.12P RC + 0.56U N %
and the weightings for the variables are quite different.
PCA using covariance matrix PCA using correlation matrix

2000
EL
3
LU
1000
NL ES
DK
BESE
2
EE MT
HU
BGLV PT
ROPL
LTSK SI CY
IE
PC2
PC2
DE
0
CZ FR
IT
ES IEUKAT BG SI Fl
UK
PT IT FR DK
0
EL Fl CZ MT
CY LVRO
LT HU DE
AT SE
NL
−1
PL
SK BE LU
−2000
EE
−2
−4000 0 2000 6000 −4 −2 0 2 4

PC1 PC1
Figure 23: Plots of the 1st and 2nd PCs for the EU indicators dataset.
(Left) Using the covariance matrix S. (Right) Using the correlation matrix
R. Points are labelled with the abbreviated country name.
44
3.7 PCA via the Singular Value Decomposition
The Crabs data has the property that the number of observations n = 200
is much larger than the number of variable p = 5. This is not always the
case for datasets that we might work with. For example, earlier we saw that
the Single Cell dataset had n = 300 and p = 8, 686, the UK Foods dataset
had n = 4 and p = 17 and the POPRES dataset of human genetic data had
n = 3, 000 and p = 500, 000. A key step in carrying out PCA is calculating
an eigen decomposition of the p × p sample covariance matrix S.
Typical algorithms for finding the eigendecomposition of a p×p matrix scale

like O(p3 ). Also, when p is large this can be a very large matrix to store
and work with on a computer. Also, since n points in a p-dimensional space
defines a linear subspace whose dimension is at most n − 1, we would find
that p − n + 1 eigenvalues are zero.
There is a faster way of obtaining the scores matrix Z = XV that contains

the projections onto the PCs. We need to use the following result
Theorem (Singular Value Decomposition) If X is a n × p matrix of

real numbers then there exists a factorization, called the Singular Value
Decomposition (SVD) of X, with the following form
X = P ΛQT
where
P is a n × n matrix such that P P T = P T P = In
Q is a p × p matrix such that QQT = QT Q = Ip
Λ is n × p diagonal matrix with min(n, p) non-negative real numbers
on the diagonal.
The diagonal entries of Λ are known as singular values of X.
NOTE: This theorem is given without proof and Prelims students are not
expected to be able to prove it.
Using the SVD we can write the following

1 1
S= XT X = QΛT P T P ΛQT (8)
n−1 n−1
1
= QΛT ΛQT (9)
n−1
45
Note that ΛT Λ is a p × p diagonal matrix with entries that the squares
of the entries in Λ. This has the same form as the eigendecomposition of
S = VDVT where V = Q and D = n−1 1
ΛT Λ.
Therefore
Z = XV = XQ = P Λ
which implies that we need to calculate P and Λ. This can be achieved
using the eigendecomposition of the n × n matrix XX T since
XX T = P ΛQT QΛP T = P (ΛΛT )P T
In other words, we can eigendecompose XX T and obtain P and Λ from

that. Calculating the eigendecomposition of XX T scales like O(n3 ) which
is much less than the eigendecomposition of X T X which scales like O(p3 ).
Also, it is much easier to store and work with an n × n matrix than a p × p
matrix when n << p.
3.8 PCA as minimizing reconstruction error

There is another way to derive principal components analysis that uses the
idea that we might try to find the low-dimensional approximation that is as
close as possible to the original data.
If X is our n × p data matrix then a rank-1 approximation to the matrix

can be constructed as
e = z1 w T
X 1
where z1 is an n-column vector and w1 is p-column vector. Together the

product z1 w1T is an n × p matrix, and the idea is that we find the best pair
of vectors z1 and w1 to minimize the distance.
We can measure the distance between X and X e by sum of the squared
differences between the two matrices.
The ith row of X is xi . The ith row of X
e is zi1 w1 .
n
1X
J(z1 , w1 ) = (xi − zi1 w1 )T (xi − zi1 w1 ) (10)
n
i=1
n
1 X
= [xTi xi − 2zi1 w1T xi + w1T zi1
T
zi1 w1 ] (11)
n
i=1
46
n× p n ×1 1× p
w1T
X = z1
Figure 24: Figure illustrating a rank-1 approximation to the data matrix

X.
n
1X T
= [xi xi − 2zi1 w1T xi + zi1
2
(w1T w1 )] (12)
n
i=1
(13)
e = z1 wT stays
Since we can arbitrarily scale z1 and w1 so that the product X 1
T
the same, we can add the constraint that w1 w1 = 1. Taking derivatives wrt
zi1 and equating to zero gives
∂ 1
J(z1 , w1 ) = [−2w1T xi + 2zi1 ] = 0 ⇒ zi1 = w1T xi = xTi w1
∂zi1 n
or in matrix form
z1 = Xw1
Plugging this back in gives
n n
1X T 1X 2
J(w1 ) = xi xi − zi1
n n
i=1 i=1
47
n n
1X T 1X T
= xi xi − (w1 xi )(w1T xi )T
n n
i=1 i=1
n
1X T
= constant − w1 xi xTi w1
n
i=1
Now if we assume that the columns of X have been mean centered then
n
1 1X
Σ̂ = XT X = xi xTi
n n
i=1
which gives
J(w1 ) = constant − w1T Σ̂w1 (14)
To minimize this we need to maximize w1T Σ̂w1 subject to the constraint

w1T w1 = 1. We can replace Σ̂ with S since S ∝ Σ̂ and solve using Lagrange
multipliers
max w1T Sw1 subject to w1T w1 = 1
w1
which leads to the eigenvalue equation
Sw1 = λ1 w1 (15)
This is the same equation we had before (see eqn 1), so w1 and λ1 are an
eigenvector and eigenvalue of S respectively. Since w1T Sw1 = λ1 is what we
are trying to maximize we choose λ1 to be the largest eigenvalue of S and
w1 its associated eigenvector. Thus the best rank-1 approximation to X is
z1 w1T = Xw1 w1T .
This approach can be extended to find the best rank-2 approximation as
X ≈ z1 w1T + z2 w2T
using the constraints w2T w2 = 1 and w2T w1 = 0 and assuming that we have
already estimated w1 and z1 . It can be shown that w2 is the eigenvector
associated with the second largest eigenvalue of S.
A general extension to the best rank-q approximation to X then follows from

that, and can be written as
q
X
X≈ zi wiT (16)
i=1
48
3.8.1 Using PCA for compression
The data matrix X has a total of np elements, and the best rank-q approx-
imation in equation 16 has q(n + p) elements. Also, the derivation assumed
the X had been mean centered so really we need to store the column mean
vector µ̂, which adds another p values.
So if we can find a relatively small value of q such q(n + p) + p << np

for which the distance between the rank-q approximation and X is small
then the approximation can be used to compactly store the majority of the
information about the dataset X. This is an example of data compression
and can be useful when storing or transmitting a dataset.
49
4 Clustering Methods
We have seen the PCA can provide low dimensional representations of a
dataset that show groupings of observations when visualized. However,
PCA does not involve a direct labelling of observations into different groups.
Clustering refers to a very broad set of techniques for finding subgroups,
or clusters, in a dataset.
Clustering involves partitioning observations into groups, where observations

in the same group are similar to each other, while observations in different
groups are quite different from each other. To make this more concrete, we
need to define exactly what we mean by similar or different. There will not
just be one way of defining this, and the choice can depend on the dataset
being analyzed, as dictated by domain specific knowledge.
Some specific examples where clustering is needed are
1. Marketing and Advertising - given a database of existing customers,

can we identify subgroups of customers who might be receptive to
a particular form of advertising, or more likely to buy a particular
product.
2. Image segmentation - in medical imaging we may wish to automatically

detect tumours in an image, by clustering pixels of an image into
different groups, and then identifying outlying groups.
There are a large number of different clustering methods. In this course,

we will learn about two of the most popular : k-means clustering and
hierarchical clustering.
In k-means clustering we seek to partition the observations into a pre-

specified number of clusters. In hierarchical clustering, we don’t pre-specify
the number of clusters. Instead, we build a tree-like representation of the
dataset, called a dendrogram, that allows us to partition the observations
into as many clusters as we want.
4.1 K-means clustering

Figure 25 shows a simple simulated dataset with 3 clear clusters of points
in 2 dimensions. To start with we will use this simple example to illustrate
the method.
50
6
4
2
xx2[,2]
0
−2
−4
−6
−6 −4 −2 0 2 4 6
xx2[,1]
Figure 25: Simple example with 3 clear clusters of points.
To perform K-means clustering we must first decide upon the number of

clusters K. The algorithm will assign each observation to exactly one of the
K clusters.
We start by assuming the our dataset is stored in the n × p data matrix X

and with (i, j)th entry xij . As before, each row is an observation, and each
column is a variable.
Let C = (C1 , C2 , . . . , CK ) denote sets containing indices of the n observa-

tions in each cluster. We will refer to C as a clustering of the observations.
Let c(i) be the cluster assignment of the ith observation. For example, if
observation i is assigned to the kth cluster then i ∈ Ck and c(i) = k. These
sets satisfy two properties
1. C1 ∪ C2 ∪ . . . CK = {1, . . . , n}
2. Ci ∩ Cj = ∅ for all i ̸= j
51
We want to choose a clustering which has the property that the differences
between the observations within each cluster are as small as possible. If we
define W (Ck ) to be a measure of how different the observations are within
cluster k then we want to solve the problem
( K )
X
min W (Ck )
C1 ,C2 ,...,Ck
k=1
To move forward we need to define the within-cluster variation measure

W (Ck ). The most common choice, especially when dealing with real valued
variables involves squared Euclidean distance
p
1 X X
W (Ck ) = (xij − xi′ j )2
|Ck | ′
i,i ∈Ck j=1
In other words we take pairs of observations in the cluster Ck and measure

the squared Euclidean distance between those observations, and then sum
over all such pairs, and then divide by the size of the cluster. So we need to
optimize
p
( K )
X 1 X X
min (xij − xi′ j )2
C1 ,C2 ,...,Ck |Ck | ′
k=1 i,i ∈Ck j=1
One option would be to search over all possible clusterings of the n obser-
vations. The number of possible clusterings becomes large very quickly. If
we let S(n, k) denote the number of ways to partition a set of n objects into
k non-empty subsets then
k
1 X k−j n
S(n, k) = (−1) jn
k! j
j=0
These numbers are known as Stirling numbers of the second kind and get
very large quickly. For example,
S(100, 4) = 66955751844038698560793085292692610900187911879206859351901
This clearly illustrates that it is very challenging to search over all these
possibilities. However, if we can find a clustering that is ”good” but not the
best, then this may be (and often is) good enough for our purposes. The
following algorithm can be shown to provide a good local optimum (i.e. a
pretty good solution).
52
K-means algorithm
1. Choose K.
2. Randomly assign each observation to one of the clusters C1 , C2 , . . . , Ck .

These are the initial cluster assignments.
3. Iterate the following 2 steps until the cluster assignments stop changing
(a) For each cluster compute the cluster mean. The kth cluster mean
denoted µk is the mean of the all the xi in cluster k i.e.
1 X
µk = xi
|Ck |
i∈Ck
(b) Re-assign all observations to the cluster whose mean is closest,

where closest is defined using Euclidean distance.
Figure 26 shows the results of running this algorithm on the simple example
with 3 clear clusters.
53
(a) Randomly allocate points to 3 clusters (b) Calculate cluster means
(c) Allocate points to new clusters (d) Calculate cluster means
54
(e) Allocate points to new clusters (f) Calculate cluster means
Figure 26: Example of k-means on toy dataset with 3 clear clusters

Convergence of the K-means algorithm
It can be shown (see Exercises) that for a cluster Ck
p p
1 X X 2
XX
(xij − xi j ) = 2
′ (xij − µkj )2 (17)
|Ck | ′
i,i ∈Ck j=1 i∈Ck j=1
so the problem can be re-written as

p
( K )
XXX
2
min (xij − µkj ) (18)
C1 ,C2 ,...,Ck
k=1 i∈Ck j=1
We call
p
K X X
X
(xij − µkj )2 (19)
k=1 i∈Ck j=1
the objective function that we are trying to minimize.
Also, given a different set of cluster means, denoted v1 , . . . , vK it can be

shown (see Exercises) that
p
XX p
XX p
X
2 2
(xij − vkj ) = (xij − µkj ) + |Ck | (µkj − vkj )2 (20)
i∈Ck j=1 i∈Ck j=1 j=1
Let C (t) denote the clustering at iteration t.

(t)
Let µk be the mean of observations in cluster k at iteration t.
Now suppose that we have just updated the cluster assignments for the next
iteration i.e. we have determined what C (t+1) is and so the current value of
(t)
the objective function (using C (t+1) and µk ) is
K p
(t)
X X X
(xij − µkj )2
k=1 i∈C (t+1) j=1
k
(t)
But since µk may not be the same as the cluster means for the assignments
C (t+1) (since the cluster assignments can change) when we update the cluster
(t+1)
means to µk we must have that
K p K p
(t+1) (t)
X X X X X X
(xij − µkj )2 ≤ (xij − µkj )2
k=1 i∈C (t+1) j=1 k=1 i∈C (t+1) j=1
k k
55
(t+1)
using Equation 20. So updating the cluster means to µk never increases
the objective function.
Similarly, in Step 3(b) the cluster means are fixed and we update the as-
signments. We can re-write the objective function in (19) as a sum over the
n observations
Xn X p
(xij − µc(i)j )2
i=1 j=1
this is clearly minimized by assigning each observation to the cluster with

minimum Euclidean distance to the cluster mean.
Multiple starts
The algorithm does not always give the same solution since the start point
is random. Figure 27 shows an example dataset simulated to have points in
5 clusters that are quite close together and so hard to separate. The true
cluster means are shown as red triangles in the plot.
Figure 28 shows the results of two different runs of k-means with K = 5.

In the first run (left), the solution involves points being allocated to just 4
clusters. One of the 5 clusters becomes empty during the run as no points are
closest to its cluster center. In the second run (right), this doesn’t happen
and we get 5 clear clusters, which almost perfectly cluster the points.
56
4
2
yy[,2]
0
−2
−4
−4 −2 0 2 4
yy[,1]
Figure 27: Example dataset with 5 close clusters of points.
(a) Solution 1 (b) Solution 2
Figure 28: Two different solutions from k-means with 5 clusters.
57
Figure 29 shows the results of applying k-means to the Single Cell Genomics
dataset seen at the start of these course notes. The dataset was first reduced
down to the 11 PCs and then k-means was run with K = 11. The Figure
shows the k-means labelling of the points in the first 3 PC dimensions.
See also the file movie.gif on the course website:
https://courses.maths.ox.ac.uk/course/view.php?id=620
for a rotating 3D view of this figure.
Choosing the number of clusters

The choice of K is a key first step in the algorithm. In some situations we
may be interested in making statements about what value of K we think is
best. This is an example of model choice, which occurs in many others
parts of Data Analysis and Statistical Inference. A natural first thought
might be to choose K by fitting the model to the dataset for many different
values of K and then picking the value that results in a minimum of the ob-
jective. However, this is a flawed strategy as choosing K = n and assigning
each observation to its own cluster minimizes the objective function. This
is almost never a realistic or meaningful solution to the problem. This is
an example of over-fitting, in which we use a model that has too many
parameters (in case the K = n cluster means) relative to the number of
observations.
Better solutions involve working with an objective function that penalizes

models with an increasing number of parameters. Such approaches to model
choice will be covered in later courses on Statistics and Data Analysis.
Hierarchical clustering
Hierarchical clustering is an alternative approach that avoids having to
specify the number of clusters in advance. An added advantage is that the
method results in a tree-like (hierarchical) representation of the dataset, that
can be helpful when visualizing structure in the dataset, especially when the
data is high-dimensional, i.e. p is large.
In this course we will describe agglomerative clustering approaches. The

methods are called agglomerative because they iteratively fuse pairs obser-
58
Cell Genomics dataset. Points are coloured according to a run of the k-
means algorithm with K = 11 working on the first 11 PCs from a PCA of
the dataset.
vations together to form clusters.
1. Begin with n observations and a measure of all the n2 pairwise dis-

similarities, denoted dij for i ̸= j ∈ (1, . . . , n). These dissimilarities

can be represented in a lower diagonal matrix, denoted D (n) .
59
 
d21

 d31
d32 

 .... .. 
. . .
D (n)
 
=
 di1 di2 . . . di(i−1)


 
 .. .. .. .. .. 
 . . . . . 
dn1 dn2 . . . dnj . . . dn(n−1)
It is common to use Euclidean distance to measure dissimilarity (but

other options exist). Treat each observation as its own cluster.
2. For i = n, n − 1, . . . , 2
(a) Find the pair of clusters with the smallest dissimilarity. Fuse
these two clusters.
(b) Compute the new dissimilarity matrix between the new fused
cluster and all other i − 1 remaining clusters and create an up-
dated matrix of dissimilarities D (n−1) .
Linkage methods
A key step in the algorithm is the choice of how to compute the new dis-
similarity (or distance) between the new fused cluster and all other i − 1
remaining clusters. There are several commonly used options, with differing
properties in terms of type of clusters they tend to produce. Let G and H be
two groups of observations then we can define the function d(G, H) between
groups G and H to be some function of all the pairwise dissimilarities dij
where i ∈ G and j ∈ H.
Single Linkage (SL) takes the intergroup distance to be the closest of all
the pairs of observations between the two groups
dSL (G, H) = min dij

i∈G,j∈H
This is sometimes called the nearest neighbour method.
Complete Linkage (CL) takes the intergroup distance to be the furthest

of all the pairs of observations between the two groups
dCL (G, H) = max dij

i∈G,j∈H
60
Group Average (GA) takes the intergroup distance to be the average of
all the pairs of observations between the two groups
1 XX
dGA (G, H) = dij
|G||H|
i∈G j∈H
4.1.1 Small example

Consider the following small dataset with n = 6 and p = 2 variables
 
0.27 2.42
 0.88 1.09 
 
 5.77 6.76 
X=  .
 5.96 4.71 

 2.64 0.94 
3.13 4.49
Mean centering and scaling each variable, and then using Euclidean distance
to measure dissimilarities results in the following matrix of dissimilarities,
dij is the (i, j)th entry of the matrix. Note : we only show the lower triangle
of the matrix and only record values to 2dp. We use the notation D(k) to
denote the dissimilarities of k clusters of observations. The smallest entry
is coloured in red.
1 2 3 4 5
2 1.46
3 7.01 7.49
D(6) =
4 6.13 6.24 2.06
5 2.79 1.77 6.61 5.02
6 3.53 4.08 3.48 2.84 3.59
We will apply single linkage clustering to this dataset.
The smallest entry in the matrix is D1,2 , so we merge observations 1 and 2

into a new cluster (denoted (1,2)) and compute the new dissimilarities
(1, 2) 3 4 5
3 7.01
D(5) = 4 6.13 2.06
5 1.77 6.61 5.02
6 3.53 3.48 2.84 3.59
61
Next we merge clusters (1,2) and 5 into a new cluster (denoted (1,2,5)) and
compute the new dissimilarites
(1, 2, 5) 3 4
3 6.61
D(4) =
4 5.02 2.06
6 3.53 3.48 2.84
Next we merge clusters 3 and 4 into a new cluster (denoted (3,4)) and
(1, 2, 5) (3, 4)
D(3) = (3, 4) 5.02
6 3.53 2.84
Next we merge clusters (3,4) and 6 into a new cluster (denoted (3,4,6) and
(1, 2, 5)
D(2) =
(3, 4, 6) 3.53
Finally, we merge the two remaining clusters into a single cluster containing
all the observations. Notice how the sequence of dissimilarities which dictate
which clusters merge (ie. the numbers in red) are an increasing sequence of
values.
Dendrograms
The results of an agglomerative clustering of a dataset can be represented
as dendrogram, which is a tree-like diagram that allows us to visualize the
way in which the observations have been joined into clusters.
A dendrogram is read from bottom to top. At the bottom we have a set of

leaves. Each leaf of the dendrogram represents a data point. As we move
up the tree leaves fuse into branches. The first two leaves to merge are the
two observations that merge into a cluster in the first step of the algorithm.
It is normal that this merge point occurs on the y-axis at a value which is
the dissimilarity between the two merging clusters. The next two leaves to
merge are the two observations that merge into a cluster in the second step
of the algorithm, and so on. Figure 30 shows the resulting dendrogram from
the mini-example of 6 observations described above.
62
Cluster Dendrogram
3.5
8
3
3.0
6
4
6
2.5
6
4
Height
1
2.0
2
4
2 5
1.5
5
0
0 2 4 6 8
hclust (*, "single")
Figure 30: Small example of hierarchical clustering. The left plot shows the
raw dataset consisting of 6 observations in 2 dimensions. The right plot
shows the dendrogram of using agglomerative clustering with single linkage.
63
Differences between Linkage methods
Different linkage methods can lead to different dendrograms. Figure 31
shows a new dataset of 45 points, with 15 points each simulated from 3
different bivariate normal distributions with different means. The 3 sets of
points are coloured according to their true cluster. Points are also labelled
with a number.
Figure 32 shows the results of building dendrograms using Single Linkage,

Complete Linkage and Average Linkage, and illustrates how these methods
can differ.
64
●
42
6
●
31
●
44
● 37
33 ●
●
41
●
45 ●
●
40 ●3●5
34 ●43
39
●
32
4
●
38
●
36
●
10
●
26
2
●15
11
●13
● ●
●3 8 ●19
●20
17 ●
●
29 ●
21
●
14 ●
4 ●
25
●
9 ●
16 ●
18
●
27
0
●
7 ● 24
30 ●
●
12 ●
1
●●
25 ●
23 ●
28
●
6
−2
●
22
−6 −4 −2 0 2
Figure 31: Example dataset with 45 observations, with 15 each from a

different bivariate normal distribution.
65
Height
0.0 0.5 1.0 1.5 2.0 2.5
31
42
36
38
32
39
43
37
33
44
34
35
40
41
45
21
22
28
26
23
16
24
30
18
dist(x1)
27
25
19
hclust (*, "single")

29
17
20
Cluster Dendrogram
10
7
12
14
3
13
11
15
1
dataset in Figure 31.

6
2
5
9
4
8
Height
0 2 4 6 8 10
26
36
38
32
39
43
37
33
44
31
42
34
35
40
41
45
22
23
16
24
30
17
20
66
25
dist(x1)
19
29
21
28
hclust (*, "complete")

18
27
Cluster Dendrogram
10
9
4
8
1
6
2
5
7
12
14
11
15
3
13
Height
0 1 2 3 4 5 6
10
1
6
2
5
9
4
8
7
12
14
3
13
11
15
31
42
34
35
40
41
45
36
38
dist(x1)
32
39
43
hclust (*, "average")

37
33
44
Cluster Dendrogram
22
21
26
17
20
25
19
29
23
16
24
30
28
18
Figure 32: Dendrograms using different Linkage methods applied to the 27
Figure 33 shows the results of building a dendrogram using Complete Link-
age on the EU indicators dataset (see Table 2)
Cluster Dendrogram
6
5
4
3
Luxembourg
Height
Greece
2
Ireland
Spain
Estonia
Romania
1
Cyprus
Portugal
Bulgaria
Hungary
Netherlands
Belgium
CzechRep.
Denmark
Sweden
Poland
Slovakia
Malta
Slovenia
Latvia
Lithuania
Italy
UnitedKingdom
Germany
Austria
0
France
Finland
dist(scale(eu1))
hclust (*, "complete")
Figure 33: Dendrogram produced from hierarchical clustering of the EU

indicators dataset (see Table 2)
Using dendrograms to cluster observations

Dendrograms can be used to cluster observations into a discrete number of
clusters or groups. To do this we make a horizontal cut across the dendro-
gram, as shown in the top plots in Figure 34. Here the cut point has been
chosen to produce 3 clusters in each case. The leaves of the dendrograms
have been coloured to indicate cluster membership of the observations. The
clustering is also shown in 2D plots of the original observations, below each
dendrogram. At this cut-point the Single Linkage method seems to have
performed worse than Complete Linkage and Average Linkage, as it has a
cluster that is just a single observation. Complete Linkage and Average
Linkage produce very similar clusterings, and differ only in their assignment
of 1 observation. It should be noted that the clustering produced by Single
Linkage with 4 clusters produces very sensible results compared to the true
clustering.
67
Single Linkage Complete Linkage Average Linkage
2.5
6
10
5
2.0
4
1.5
3
1.0
2
2
0.5
1
0
0
0.0
dist(x1) dist(x1) dist(x1)

hclust (*, "single") hclust (*, "complete") hclust (*, "average")
● ● ●
6
● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ●● ● ●● ● ●●
● ●● ● ●● ● ●●
● ● ●
4
● ● ●
● ● ●
● ● ●
● ● ●
2
●● ●● ●●
● ● ●
● ● ●● ● ● ●● ● ● ●●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
0
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
−2
−2
−2
● ● ●
−6 −4 −2 0 2 −6 −4 −2 0 2 −6 −4 −2 0 2
Figure 34: (Top row) Dendrograms using different Linkage methods applied
to the dataset in Figure 31. Each dendrogram has been cut to produce 3
clusters. (Bottom row) The corresponding clustering of points in the original
2D space produced by cutting the dendrogram to have 3 clusters.
68

Notes For Lectures 11 To 16 - 2024

Uploaded by

Copyright:

Available Formats

Notes For Lectures 11 To 16 - 2024

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes For Lectures 11 To 16 - 2024

Uploaded by

Copyright:

Available Formats

Prelims Statistics and Data Analysis, Lectures

Trinity Term 2024 (version of April 12, 2024)

Linear regression is an example of supervised learning where we are inter-

However, many methods in Statistics and Machine Learning are focused on

The goal of unsupervised learning is to discover interesting things about the

2 Can we compress the dataset without losing any relevant information?

3 Can we find separate subgroups (or clusters) of observations that de-

Unsupervised learning can be more challenging than supervised learning,

Techniques of statistical learning (both supervised and unsupervised) are

Many mathematics students will go on to work in careers that involve car-

Thus, a good knowledge of statistical and machine learning techniques is a

1.1 Using R or Matlab for data analysis

Once we have found a low-dimensional representation of the dataset that

Figure 4 shows the results of applying a clustering method known as K-

−0.08 −0.07 −0.06 −0.05 −0.04 −0.03 −0.02

−0.5 0.0 0.5 1.0

1.3 Data matrices and notation

●●● ●●● ●● ● ● ● ● ● Scotland

−0.05 0.00 0.05 0.10 0.15

The variables (columns) are also sometimes referred to as features, attributes

We will often be interested in the rows of X, which we will write as x1 , x2 , . . . , xn .

Vectors will be treated as column vectors by default. We assume that

2.1 Crabs dataset

 the width of the frontal lobe (FL),

 the rear width (RW),

 the length along the carapace midline (CL),

 the maximum width (CW) of the carapace,

 the body depth (BD) in mm.

So the data matrix X has dimensions 200 × 5. In addition to body size

Histogram of Crabs$CW Histogram of Crabs$BD

CW: Carapace Width (mm) BD: Body Depth (mm)

Figure 7: Histograms of the Crabs dataset

plots include the following summary statistics

A box plot consists of three main parts (shown in Figure 8)

Plotting boxplots side by side allows comparisons to be made between vari-

2.4 Pairs plots

consider plotting triples of variables and tools exist to do that interactively.

●●●●● ● ●●● ● ●●●●●● ● ●● ●

Figure 10: Pairs plots of the Crabs dataset

If X = (X1 , . . . , Xp )T is a p-dimensional random column vector then we

The interpretation of the parameters is as follows

The correlation between two variables Xi and Xj is defined as follow

2.5.1 Example : the Bivariate normal distribution (p = 2)

Simulation is a technique for generating pseudo-random values that have

rho = 0.3 rho = 0.5 rho = 0.9

Figure 13: Contour plots for 2D multivariate normal densities with µ =

Figure 14: Contour plots for a 2D multivariate normal density with µ =

This estimator of Σ is biased (remember : not all maximum likelihood

We refer to S as the sample covariance matrix. If we assume that the

On the Crabs data the sample correlation matrix is

2.5.3 Linear transformation of the Multivariate Normal Distri-

Principal components analysis (PCA) is a useful method that can allow

The question then is ‘how do we define a representative variable or dimen-

1. Each of the dimensions found by PCA is a linear combination of the

2. PCA seeks a small number of dimensions that are as interesting as

Why is variability (or variance) a good metric for determining interesting-

In situations where the data consist of separated clusters of observations

−10 −5 0 5 10 −10 −10 −5 0 5 10

3.1 Finding the first principal component

the width of the frontal lobe (FL),

the rear width (RW),

the length along the carapace midline (CL),

the maximum width (CW) of the carapace,

the body depth (BD) in mm.