Notes For Lectures 11 To 16 - 2024
Notes For Lectures 11 To 16 - 2024
Notes For Lectures 11 To 16 - 2024
11-16
Christl Donnelly
With grateful acknowledgement to Jonathan Marchini and Dino Sejdinovic
1 Introduction
The first 10 lectures of the Prelims Statistics and Data Analysis course in-
troduced the concept of likelihood for a probabilistic model, leading up to
linear regression with several explanatory variables (multiple linear regres-
sion).
Interest lies in identifying which of the Xi ’s are important parts of the model,
and also building a model that is able to make accurate predictions of Y
using X1 , . . . , Xp .
1
1 Can we find a way to visualize the data that is informative?
https://www.youtube.com/watch?v=SUbqykXVx0A
Massive amounts of data are being collected in almost all walks of life. Fi-
nancial institutions, businesses, governments, hospitals, and universities are
all interested in utilizing and making sense of data they collect.
2
1.2 Motivating examples
1.2.1 Single cell genomics
In 2014 a study reported in the journal Nature Biotechnology (Pollen et al.
Nature Biotechnology 32, 1053—1058 https://www.nature.com/articles/
nbt.2967) collected gene expression measurements at 8,686 genes from 300
cells. One way to think about this dataset is that they measured how ‘ac-
tive’ the gene was in each cell.
Figure 1 shows an image of the dataset where each row is one of the cells and
each column in one of the genes. Viewing the data in this way it is very dif-
ficult to see any structure in the dataset. In this course we will learn about
the method of Principal Components Analysis (that builds on Prelims Lin-
ear Algebra) which will allow us to find low-dimensional representations of
the dataset that uncover underlying structure. For example, Figure 2 shows
the “best” 2D representation of the data, and Figure 3 shows the “best” 3D
representation of the data (see also the file movie.gif on the course website:
https://courses.maths.ox.ac.uk/course/view.php?id=620
for a rotating version of Figure 4). What we mean by “best” here will be
more clearly defined later in the course. Both of these images show structure
within the dataset: some samples appear to cluster together in clear groups.
3
Figure 1: Single Cell dataset : gene expression measurements on 300 cells
(rows) at 8,686 genes (columns).
4
●
● ●● ●
0.10
● ●●
● ●● ● ●●
●●
● ● ●●●●
● ●●
● ● ●● ●
● ● ●
●
●
● ●
●
●
●
● ● ●
● ●
●● ●● ●
● ● ●
●
● ●● ●
● ●
● ●
● ● ●
●● ●
0.05
● ●
● ● ●●●
●● ●● ● ●
● ● ● ● ●
● ● ●
● ●
● ●
●● ●
●●● ●
● ●
● ●
PC2
● ●
●
●
●
0.00
●
●
●
● ●● ● ● ●●
●
●● ● ●● ●
●
● ●● ●● ● ●
● ●● ● ● ● ● ●
● ● ● ●● ● ●
● ● ● ●
●● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ● ●●
● ● ● ● ● ●
●
● ● ● ● ● ● ●
● ● ●●● ● ● ●
● ● ● ● ●
●
● ● ● ●● ● ●
● ● ● ●
●
−0.05
● ●
●
● ● ● ●
● ●
●● ● ●●
● ● ● ●
● ● ● ● ●
● ●
● ● ● ●●
●
● ●● ● ● ● ●●
● ● ● ● ● ● ●
●● ● ● ●
●
● ● ● ●
●
●● ● ●● ●
●● ●
● ● ●●
●● ● ● ● ●
● ●
● ●
●
PC1
Figure 2: Plot of 1st and 2nd Principal Components for the Single Cell
Genomics dataset.
5
Figure 3: 3D plot of 1st, 2nd and 3rd Principal Components for the Single
Cell Genomics dataset.
6
Figure 4: 3D plot of 1st, 2nd and 3rd Principal Components for the Single
Cell Genomics dataset, with colouring given by the k-means clustering.
7
1.2.2 Food consumption
Consider the following DEFRA data showing the consumption in grams (per
person, per week) of 17 different types of foodstuff measured and averaged
in the four nations of the United Kingdom in 1997. We shall say that the 17
food types are the variables and the 4 nations are the observations. Looking
at the data in Table 1, it is hard to spot obvious patterns.
Figure 5 shows a 2D projection of the data (i.e. the first and second principal
components), and we see that Northern Ireland a major outlier. Once we
go back and look at the data in the table, this makes sense: the Northern
Irish eat much more grams of fresh potatoes and much fewer of fresh fruits,
cheese, fish and alcoholic drinks.
Table 1: DEFRA data showing the consumption in grams (per person, per
week) of 17 different types of foodstuff measured and averaged in the four
nations of the United Kingdom in 1997.
England Wales Scotland N.Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
Other meat 685 803 750 586
Fish 147 160 122 93
Fats and oils 193 235 184 209
Sugars 156 175 147 139
Fresh potatoes 720 874 566 1033
Fresh Veg 253 265 171 143
Other Veg 488 570 418 355
Processed potatoes 198 203 220 187
Processed Veg 360 365 337 334
Fresh fruit 1102 1137 957 674
Cereals 1472 1582 1462 1494
Beverages 57 73 53 47
Soft drinks 1374 1256 1572 1506
Alcoholic drinks 375 475 458 135
Confectionery 54 64 62 41
8
1.0
Wales
0.5
N.Ireland
PC2
0.0
England
−0.5
Scotland
−1.0
PC1
Figure 5: Plot of 1st and 2nd Principal Components for UK food dataset.
9
1.2.3 Finding structure in genetic datasets
Novembre et al. (Nature 2008 https://www.nature.com/articles/nature07331)
analysed genetic data from 3,000 individuals at ∼500,000 positions in the
genome. The individuals were collected from different countries from around
Europe as part of the Population Reference Sample (POPRES) project. Be-
fore the study it was
“not clear to what extent populations within continental regions exist as dis-
crete genetic clusters versus as a genetic continuum, nor how precisely one
can assign an individual to a geographic location on the basis of their genetic
information alone.”
The question of interest was to see how much structure was present in the
dataset and to assess the similarities and differences between individuals
from different populations. Figure 6 shows the 2D projection of the dataset.
Each point on the plot is an individual and points are coloured in the plot
according to which country the individual comes from. What is striking
about this plot is how well the arrangement of points corresponds to the
geographic locations of the samples. The plot was constructed just using
the genetic data, without any knowledge of the geographic locations, yet the
plot is able to uncover a map of where samples came from.
We will refer to X as the data matrix. We use bold upper case for matrices
in these notes.
10
●
● ● ● ● Germany
● ● ● Netherlands
● ●
● ● ● ●●
0.04
●● ● ● ● ● Austria
● ●● ●●● ●
●●●● ●●
●
● ● ● ● ●
● ● ● ●
●● ● ● Luxembourg
●● ● ● ●●
● ● ● ●● ● ●
●●
●●●
● ●●●● ●
●
●●●●●●●● ●●● ● ●● ● ● ● ● ● ● ●● ● Czech Republic
●● ●●● ● ●●● ●● ●● ●
● ● ●● ●
● ●●● ●●● ● Hungary
●
●
●● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ●
●● ●● ●● ●●● ●●●● ● ●●●●● ●●
●
●● ●
●●
●● ● ●●●●● ●●
● ●●●●●●● ●
●
●
●● ●● ● ●● ●● ● ●● ● Slovakia
●● ●●●● ●● ●● ●●●●●●●●● ● ● ● ● ● ● ●
●
●● ● ● ●●
● ●● ● ● Sweden
●
● ●● ● ●●●● ● ●
● ● ●● ● ● ●●● ●
●●●●
● ●●
●● ●● ● ● ●
●● Norway
●
●●● ●● ●● ● ● ●●●●
● ●
● ● ●● ● ● ●● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ●
●● ● ● ● ● ● ●● ● ● ● Denmark
●●●
0.02
● ●● ● ●● ●● ●●
● ● ● ● ● ●●● ●● ● ●
● ● ●●● ● ●● ● ● ●●
● ● ● Poland
● ● ● ●● ●
● ● ● ● ● ● ● ● Russian Federation
●●● ● ●●● ● ●●●● ● ●● ● ●● ●
●● ●
●●
●
●●●●● ●● ●● ● ●
●● ●● ●● ●
● ●●
● ● ● ●
●
● Ukraine
● ● ● ●● ●
● ● ● ● ● ● ● ●
● ●● ● ●● ●● ● ● ●
● ● ● Finland
●● ●●●●● ● ●●●● ●
●●●● ●●● ●
●● ●●●●● ●● ● ● ● ● ●●
● ●● ● ●● ● ● ●● ●● ● ●●● Latvia
●
● ●● ●●●●
● ●●● ●
●●● ●
●● ●●●●● ●
●● ●● ●●●●●●
● ●● ●● ●●●
● ●●●● ●● ● ● ●● ● ●●
● ●● ●● ●●
● ●
● ●● ●●
● ●●
●●●●● ● ● United Kingdom
● ●●●●
●●●●●●● ● ●●●
● ● ●
●●
● ●●● ●
●●
●
●●●
●●
●●●
●●●
● ●● ● ●●
●●● ●●● Ireland
●
● ●● ● ●● ●●
●●●
●● ● ●
● ●●● ● ● ●● ● ● ● ● ● ●
● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ●●
0.00
● ●
● ● ●● ●
●● ● ● ●● ● ●●● ● ●● ● ● ●
● ● ●●● ● ●
● ● ●
●
● ● Bulgaria
●●● ●● ●● ●
●●
● ● ●●●● ● ● ● ● ● ●
● ●●
● ●●●●●● ●● ●● ●● ●● ●
●● ●
●● ● ●●●●● ●●● ● ● ● ●
● Croatia
● ●●●● ●● ●● ●●●● ●●●●● ●● ●● ●
●●●● ●
● ●● ●● ●● ●
●●●●● ●
●●●
● ●● ●●● ● ●● ● Macedonia, The Former Yugoslav Republic Of
−0.02
● ● ●● ●●● ●● ●●●
● ● ●● ● ● ● ●●
●
● ●●●
● ●●● ●●●●●●
●●● ● ● ● ● ● ●● ● ● Kosovo
●● ● ●●●● ●
●●● ● ●●●● ●●● ●●
●●●●●●●
● ●● ● ● ●● ●● ●● ●
●● ● ●●
●● ●
●
●
●
●●●●
●● ●● ● ● ●
●
●● ● ●
● ●●
●●● ● ● ●
●
● ● Slovenia
● ● ● ●● ●●
● ●● ●● ● ●
● ●●●● ●● ●
● ●
●●
● ● ● ●●● ● ● ●● ● ● Bosnia and Herzegovina
● ●● ● ●●● ● ● ● ●● ● Albania
● ● ●●
● ● ●
●
●● ● ● ● ● ● Serbia and Montenegro
●● ● ● ●● ●●
● ●● ● ● ● ● Greece
● ● ●
● ●● ●
●●● ● ● ● Spain
● ● ● ●
−0.04
● ● ● ● ● Portugal
●● ● ● ●● ● ● ● Switzerland
● ●● ● ● ● ●●
●
●●●
●● ● ● ● Swiss−French
● ●● ●
● ● ● Swiss−German
● ● ●● ●● ●● ●● ●● ● ●
Swiss−Italian
● ●●●
●● ● ●
●
● ● ● ● ● ● ●● ●
● ● ● ●● ●● ●● ●
● ● ● France
● ●●
●● ● ● ●
● ●
●● ● ●● ●● Belgium
●●●
●
● ● ●● ●● ●
● ●●●●● ● ● Turkey
● ●
● ● ●● ●● ●●
−0.06
●● ● Cyprus
● ● ●●● ● ● ● ●● ●
●●●
●
● ●● ●
● ●
●
●
●
●
●
PC2
Figure 6: Plot of 1st and 2nd Principal Components for POPRES dataset
11
2 Exploratory data analysis and visualizing datasets
A key first step in many data analysis task is to carry out an exploratory
data analysis. If the dataset is stored in an n × p data matrix X then
we can look at data summaries and plots of various aspects of the data to
help us uncover the properties of the dataset. To illustrate some simple plot
types will use the ‘famous’ Crabs dataset.
2.2 Histograms
A histogram is one of the simplest ways of visualizing the data from a single
variable. The range of the variable is divided into bins and the frequency of
observations in each bin is plotted as a bar with height proportional to the
frequency. Figure 7 shows histograms of the five crab measurements.
2.3 Boxplots
A Box Plot (sometimes called a Box-and-Whisker Plot) is a relatively so-
phisticated plot that summarises the distribution of a given variable. These
12
Histogram of Crabs$FL Histogram of Crabs$RW Histogram of Crabs$CL
60
50
40
50
40
30
40
Frequency
Frequency
Frequency
30
30
20
20
20
10
10
10
0
0
10 15 20 10 15 20 10 20 30 40 50
FL: Frontal Lobe Size (mm) RW: Rear Width (mm) CL: Carapace Length (mm)
40
30
30
Frequency
Frequency
20
20
10
10
0
20 30 40 50 10 15 20
Median - the ‘middle’ value i.e. the value for which 50% of the data fall
below when arranged in numerical order.
1st quartile - the 25% value i.e. the value for which 25% of the data fall
below when arranged in numerical order.
3rd quartile - the 75% value i.e. the value for which 75% of the data fall
below when arranged in numerical order.
13
Inter Quartile Range (IQR) - the difference between the 1st and 3rd
quartiles. This is a measure of ‘spread’ within the dataset.
1 A box that covers the middle 50% of the data i.e. the IQR. The edges
of the box are the 1st and 3rd quartiles. A line is drawn in the box at
the median value.
2 Whiskers that extend out from the box to indicate how far the data
extend either side of the box. The whiskers should extend no further
than α times the length of the box, i.e. the maximum length of a
whisker is α times the IQR. A commonly used value of α is 1.5.
3 All points that lie outside the whiskers are plotted individually as
outlying observations.
14
4000
Upper Whisker
3rd quartile
3500
Median
1st quartile
3000
Lower Whisker
2500
Outliers
2000
Figure 8: A Box Plot labelled with the main features of the plot.
15
50
40
30
20
●
10
FL RW CL CW BD
Figure 9: Box plots of the Crabs dataset showing similarities and differences
between the 5 variables.
16
6 10 14 18 20 30 40 50
●● ● ●● ●
● ●
● ● ●● ● ● ● ● ●
●●● ●
● ● ●●
● ●●● ● ●●●●●● ● ●●●
●●
●● ●●●● ● ●
●●●●● ●●●
●●● ● ●●●● ●
●● ● ●
20
●● ●
● ●
●●● ●●● ● ●●
●●
●● ●
● ●●●● ● ●●
●
●●●
●● ●●
●●●●● ●● ●
●●
●●
●
●
● ●● ●●
●●
● ● ●●
●
●●
●
●
●●
●
●●●●● ● ●
●●
●●
●
●
●●
●●
●
●
●●
●●●●●
●
●
●
●●
●
●
●
●
●●●●
●● ●●● ● ●
●●
●● ● ●●
●● ●●
● ●●●
●●
●●●●
●●● ●● ●●
●●
●●●●
●
●●●
● ●
●● ● ● ●
●●●●●
●●
●
●
●
●
● ● ●● ●●●
● ●● ●●
●●●
●●●●●
●
●●●●●● ● ●● ●●●
●● ● ●● ●●●●
●●
●●●● ● ●
●●
●
●●
●
●● ●● ●
FL ● ●
●
●●
●
●●
●
● ●●●
●●●
●● ●●
● ●●
●● ●
●●
●●●●
●
15
●●
●●●●
●●
●● ●
● ● ●
●
● ●●
●●
●● ●
● ●●●
●●
● ● ●●
●●●
●●●●●●
● ●●
● ●●●
●
●● ●● ●●●
●●●●●
● ●●●●
●●
●
●●
●
●
●●
●●
●
●●●●●
● ●●●
●●
●●
●
●●
●
●●● ●
●
●●
●●●
●●● ●
●
●●●
●
●●
●
●●
●●
●
●
●
●●●
● ●●
● ●● ●
●● ●●
●●
● ●
●●●●
●●●●●● ●●
●● ●●
●
●● ●
●●● ●●
● ●
●
●●●
●
●●
●
●●● ●
●●●●
● ●●
●●
● ●●
●●●
●
●●
●● ●●●●
●
● ●●●
●●
●
● ●●●●●●
●●●●
●
●
●●●● ● ●●●
● ●● ●● ●
10
●● ●
●● ● ●
● ●
●●
●
●● ●●
● ●●● ●●●
●
●●●
●● ●
●●
●● ● ●●
●● ●
●●
● ● ● ●
● ● ● ●
● ● ● ●
18
● ● ●● ●
●●● ●●
●● ●
●● ●●●
● ●●● ● ●●● ● ● ●●● ●
●● ● ●● ●
●● ●●●●●
●
●●● ●●●
●
●●●
●
● ● ● ● ●● ●
●●
●
●
●● ●
● ● ●● ●●● ●●
●● ● ●● ●●●● ●●●●
●●●
●● ●●
●● ●●
●●
● ●●●
●● ● ● ●●● ●
●●●●● ●●● ● ●●● ●
●● ●●● ● ● ● ● ●
●
●● ●● ● ●●
●● ●●● ● ●●
●● ● ●● ●●
● ●
● ● ●● ●●
●●●●
●
14
● ● ●
●● ● ●●●●● ●
●● ●
● ●● ●
●●● ● ●● ●● ●
● ● ●●●●●
● ●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●●●
●
●●
RW ●
●●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●●●
● ●●●
●
●
●
●●●
●
●
●
●
●
●● ●
●
● ●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
● ●
●
●●
●
●●
●●●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●●
●●
●●●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
● ●
●●
●●
●
●●
●●
●●
●
●
●●
●●
●
●
●●
●
●
●
● ●●
●●●
●
●●
●
●
●●●
●
●●
●
●
●●
●●
●●
●●●●
●
●●
●●
●●
●●
●
●
●●●●●
●●●
● ● ●●●●
●●●
●
●●
● ●● ●
●●●
●●
●●
● ● ● ●●●
●●●●●
●●●●●●
●
●●
● ●●●●●● ●●●●
●
● ●● ●●●●●
10
● ● ●
●● ●●
● ● ●●
● ● ●●●
45
●●
● ●
●●
● ● ●
●●
● ●
● ● ● ● ●● ●●
●●
●●● ● ●● ●●● ● ● ●● ● ●●
● ●●
●●●●●
●
●
●●
●●
●●
●●●
● ●●● ● ●
● ●
● ●
●●●● ●
●●
●
●
●
●
●●●● ●●●●
●
●●●
●
●●
●
● ●●● ● ●
●
●●
● ●● ●
●●
●● ●● ●
●●●
●●●●●
● ● ●
●
● ●●●● ●
●●
●●
●● ●●● ●
●●
●
●
●
●●
● ●●
●●
●●
● ●
●
●
●●●
●
●
● ●
●● ●●●●●●● ●●
●●
●
●
●●●● ●
●●
●●●●
●
●●●●
35
●● ●
●● ●
● ● ●
●● ●●●●●
●●●
●
●●● ●●
●
●●
● ●●●●
●
● ●
●
●●
●
●●●
● ●● ●●●
●
●
●
●
●
●●
● ●
●●●
● ●
●
●
●●
●
●
●●
● ● ●●● ●●
● ●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●
●●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
CL ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
● ● ●●
● ●●●
● ●● ●
25
●●● ●
● ● ●●●
●
●●
●
●●● ●●●●
●● ●
●● ●
●
●
●●●
● ●
●●
●●●
●
●
●●
● ● ●
●●●● ●
●●
●
●●● ● ●
●
●●
●●
●
●
●
●
●●●●
● ●
● ●●●
●● ●
●●
●●●
●
●
●●
● ●
●●
●●●●●●
●
●
● ●●
● ●● ●
● ● ●● ●
●
●● ●● ●
● ●●
15
● ● ● ●
● ● ● ●
●
● ●● ● ●● ●
●●
●●
50
● ●
●●● ●●
● ●●●●●
●
●
● ● ● ●
●●●● ●●● ●
●●●
●●●●
●
●●
●●● ●●●●● ●● ●●● ●●
●
●●●
●
●
●
●
●●
●
●
● ●●
●
● ● ●● ●●
●● ● ●●
●
●●
●●●●●●●●
● ●
●●● ●● ●
●● ●●
●●
●● ●●●●
●●●●● ●
●
●● ●
●●●●
● ●
●
●●
●
●●
●
●●
●
●
●●
●
●●
● ● ●● ●●●
●
● ●
●
●●
●
●
●
●
●
●
●●
● ●●
●● ●
●●
●
●●
●
●
●
●
40
●
●● ●●
●●
● ●
●●
●●
● ● ● ●
●●
● ● ●● ●●● ●
●
●
●●
●
●
●●●●
●●● ●● ●●
●●●
●
●
●●
●
●
●●
●●● ●●
●●●●
●
● ●●
●● ●
●●
●
●
●
●●
● ●
● ●
● ●●
● ● ●● ●
●
●●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●
●●●●
●
●●
●●
●●
●
●
●
●
●●
●●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
CW ●
●●
●
●
●●
●
●●
●●
●
●●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
30
● ●●● ●● ●●
●
●● ●● ●●
●
●
● ●
●●
● ● ●●
●●
●●
● ●
● ●
●
●●● ●●●
●●●
●
●
●● ●●●
● ● ● ●●
●
● ●●●
●
●
●● ●
●● ● ●
●
●●
●●
● ●●
●●
●● ● ●●●
●●●●
●
●●●● ●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●● ●
●●●● ●
●● ●
●
●●●
●
●
● ●●●●
●● ● ●
20
● ● ●
●● ●● ●● ●●
● ● ● ●
●●● ●
●● ● ●●●● ●
●
●
● ●
● ● ●●●
● ● ●● ● ●● ●
20
●● ●● ● ● ● ●
●●●●●● ● ● ●● ●●
●
●●
● ●●
●●●
●●● ●
● ●● ●●●● ●●●●●
●●
●●●
●●●
●●
●● ●
●
●
●
●●● ●●●●●
●
● ●
● ● ●
●●
●
●
●
●
●
●●
●
●● ●●
●●●
●
●●●●●
● ●
●●●●●
●
● ●●
●●●●●●●● ● ●●●● ● ●●
●●
●● ●
●●●
●●●●
●
●●
●●●●● ● ● ●
●●
● ●●
●● ●
●
●●
●
●●
● ●
●●●●●
●●● ●●
●● ●
●● ●●
●● ●● ●●
●●
●●●●
● ●
●
●●
●
● ●●●
●●
15
●●
●●
●
●●● ● ●● ●●
● ●●●●● ●
●●
●●●
●
● ●
●
●●●●
●
● ● ● ●●● ●
● ●●
●
● ●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●●●●●
● ●
●●
●
●●●●●●●
●●●●●●
●
●
●
●
●
●●
●●●●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●
●●
●
●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
BD
●●
●●
● ●● ●●
●●
●
● ●● ●●
●●●
●
●●
●
● ●●●●●
●
●●
●
●
●●
●● ●●●●●
●●
●● ●● ●● ●● ●● ●
●
10
●●
● ● ●
●● ●
●
● ●
●●
●●●
● ●●●
●
● ● ●●
●●●
●● ●●●●●●
●
●●●
● ●●●●● ●●
●●● ●●●●●●
●●
●●● ●
● ●● ●●●● ●●●
●
●● ●●● ●
● ●
●●
● ●
●● ●
●●
● ●●● ●●● ●
●●
● ● ● ●
10 15 20 15 25 35 45 10 15 20
17
Figure 11: Screen shot from 3D interactive plotting of the Crabs variables
CW, RW and CL
18
2.5 The Multivariate Normal Distribution
To build good models of datasets consisting of several measured variables
we need probability models of multiple variables. We have seen this type of
model briefly in Prelims Probability in the section on Joint Distributions.
One of the simplest and most widely used models for multiple continuous
random variables is the multivariate normal distribution (MVN).
We have seen before that the univariate normal distribution has two scalar
parameters µ and σ 2 , and we use the notation X ∼ N (µ, σ 2 ). The parameter
µ denotes the mean of the distribution and σ 2 denotes the variance. The
pdf of the univariate normal distribution is
1 (x − µ)2
1
f (x) = exp − −∞<x<∞
(2πσ 2 )1/2 2 σ2
The multivariate normal distribution is the generalization of the univariate
normal distribution to higher dimensions. The MVN is a distribution for
a p-dimensional random column vector X = (X1 , . . . , Xp )T that allows for
non-zero correlations to exist between the elements of the vector. As such,
it can be a useful distribution for modelling data that consist of multiple
variables measured on the same items or observations that may (or may not)
be correlated.
We use the notation X ∼ Np (µ, Σ). When p = 1 this reduces to the univari-
ate normal distribution. The multivariate normal distribution is sometimes
referred to as the multivariate Gaussian distribution.
E[Xi ] = µi for i ∈ 1, . . . , p
var (Xi ) = Σii for i ∈ 1, . . . , p
cov (Xi , Xj ) = Σij for i ̸= j ∈ 1, . . . , p
19
cov (Xi , Xj )
cor (Xi , Xj ) = p ∈ [−1, 1]
var (Xi ) var (Xj )
so if X ∼ Np (µ, Σ) then
Σij
cor (Xi , Xj ) = p
Σii Σjj
Figure 12 shows the density of such a bivariate normal with ρ = 0.7 and
Figure 13 shows contour plots for several multivariate normal densities with
different values of ρ. Figure 14 shows a 2D density together with a sample
of 200 data points simulated from the same distribution.
20
Two dimensional Normal Distribution
0.20
0.15
0.10
z
3
0.05 2
0.00 1
−3
0
−2
x2
−1 −1
0
x1 1 −2
2
3 −3
Figure 12: 2D multivariate normal density for µ = (0, 0)T , Σ11 = Σ22 = 1
and Σ12 = 0.7
21
3 rho = 0 rho = −0.2 rho = −0.7
3
0.02
0.02
2
2
0.04 0.06
0.06
0.08 0.1
1
1
0.1 0.1
0.12 4
0.14
0.1
0
8
0.
0.14
16
0.12
0.1
−1
−1
−1
0.1 2
0.08
0.06 0.08
0.04
0.04
−2
−2
−2
0.02
−3
−3
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
3
2
2
0.05
0.06 0.06
0.1 0.15
0.1
1
1
5
0.14 0.2
0.14
35
0.
0
0
0.16
0.3
0.12 0.12
−1
−1
−1
0.08 0.2
0.08
0.04 0.1
0.04
−2
−2
−2
0.02 0.02
−3
−3
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
22
rho = 0.7
3
3
●
●
●
● ● ●
●
2
2
● ●
●
● ● ● ●●
0.06 ●
0.08 ● ● ●●
● ● ●
● ● ● ●
● ● ●
0.12 ● ●●● ● ● ●
● ● ●
1
●● ●
● ● ● ●
0.16 ●● ● ●●●
0.18
● ● ●● ● ● ●● ●● ● ● ● ●
● ●
● ● ● ●●●
0.2 ● ●
●● ● ● ●● ●
● ● ●● ● ●
● ● ●●●●
● ● ●●
2
0
●
0.2
● ● ●●●●
●● ●
● ● ● ● ●●
● ● ● ●
●
● ●● ● ●
● ● ●
●● ● ● ●
●●● ● ●
● ●● ● ●● ● ●
● ●● ●● ●
●
0.14 ● ● ●
●
−1
−1
● ● ●●
● ● ● ● ●● ● ●
● ●
0.1 ●●
●
● ● ● ● ● ●●
● ●
● ●
● ●● ●
● ●
0.04 ●
−2
−2
● ●
0.02
● ●
●
−3
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
23
2.5.2 Estimating parameters for the multivariate normal distri-
bution
Often given a sample of real data, we will want to find the MVN that best
fits the data. Given a sample of n observations from a Np (µ, Σ) distribution,
it can be shown (see Exercises) that the maximum likelihood estimates of µ
and Σ are
n
1X
µ̂ = xi
n
i=1
and
n
1X
Σ̂ = (xi − µ̂)(xi − µ̂)T
n
i=1
Note that (xi − µ̂) is a p × 1 column vector and so (xi − µ̂)T is a 1 × p row
vector and so Σ̂ is an p × p matrix.
Note that we don’t have to assume that the data are generated from a MVN
distribution in order to calculate the statistic S. It is a useful summary of
the pairwise covariances in any collection of variables. On the Crabs data
the sample covariance matrix is
FL RW CL CW BD
FL 12.21 8.15 24.35 26.55 11.82
RW 8.15 6.62 16.35 18.23 7.83
S= .
CL 24.35 16.35 50.67 55.76 23.97
CW 26.55 18.23 55.76 61.96 26.09
BD 11.82 7.83 23.97 26.09 11.72
24
Sometimes it is useful to work with the sample correlation matrix, de-
noted R, which has entries
Sij
Rij = p
Sii Sjj
FL RW CL CW BD
FL 1.00 0.91 0.98 0.96 0.99
RW 0.91 1.00 0.89 0.90 0.89
R= .
CL 0.98 0.89 1.00 1.00 0.98
CW 0.96 0.90 1.00 1.00 0.97
BD 0.99 0.89 0.98 0.97 1.00
The high levels of correlation between all pairs of variables can be seen
visually in Figure 10.
E[Y ] = Bµ
cov (Y ) = BΣB T
A further useful result (which we will not prove in this course since it is
covered in Part A) is that the distribution of Y = BX is normal and given
by
Y ∼ Nm (Bµ, BΣB T )
25
3 Principal Components Analysis (PCA)
If we have a large number of p variables collected on a set of n observations
it can be hard to visualize and summarize the dataset.
Answering this question using PCA involves the following key ideas :
26
to be a good way to choose a projection that separates the two clusters.
Another way to think about this is that we have found a rotation of the
data points to maximize the variance. A rotation is a set of orthogonal
projections. Figure 16 shows the data points before rotation (left) and after
rotation (right). The points in the two clusters are well separated on the
new x-axis.
10
5
B
A
0
−5
σ 2 = 1.80 σ 2 = 12.98
−10
−10 −5 0 5 10
Figure 15: Example showing the variance of points projected on two (or-
thogonal) projections (A) and (B). The projection (A) which separates the
points well has the highest variance.
27
Raw data Data rotated to Principal Components
10
10
PC2 PC1 PC2
5
5
PC1
0
0
−5
−5
−10
Figure 16: Example showing the variance of points projected on two (or-
thogonal) projections (A) and (B). The projection (A) which separates the
points well has the highest variance.
We can write
Z1 = α1T X
where α1 = (α11 , α12 , . . . , α1p )T and X = (X1 , . . . , Xp )T are both column
vectors. Then it can shown (Exercise Sheet 1) that
28
2. This variance is unbounded as the αi ’s increase, so we need to add a
constraint on α such that
p
X
2
αj1 = α1T α1 = 1
j=1
In other words, we try to maximize the sample variance of the first prin-
cipal component (α1T Sα1 ) subject to the constraint α1T α1 = 1. We can solve
this using Lagrange Multipliers (see Prelims Introductory Calculus course
for a reminder on this technique).
Let
We then need the vector of partial derivatives of L with respect to the vector
α1 . This is the gradient vector ∇L seen in Intro Calculus
∂y ∂y ∂y T
∇y = , ,..., = (a1 , a2 , . . . , ap )T = a
∂x1 ∂x2 ∂xp
Pp Pp
2. Let z = xT Bx = i=1 j=1 Bij xi xj then consider
∂z ∂z ∂z T
∇z = , ,...,
∂x1 ∂x2 ∂xp
∂z Pp
Since ∂xi =2 j=1 Bij xj = 2Bi x we have that
∇z = 2Bx
Re-writing we have
29
L(α1 , λ1 ) = α1T Sα1 − λ1 α1T Ip α1 + λ1
Sα1 = λ1 α1 (1)
30
Recap of results from Linear Algebra II
Let V be a vector space over R and T : V → V be a linear transformation.
4. A real symmetric matrix A ∈ Mn (R) has real eigenvalues and there exists
an orthonormal basis for Rn consisting of eigenvectors for A. In other
words, there exists an orthonormal real matrix V (so VT = V−1 ) such
that
A = VDVT
where D is a diagonal matrix of eigenvalues.
31
and this leads to another Lagrange multipliers problem where we seek to
maximize
since α2T Sα1 is a scalar and S is symmetric, which implies that m = 0 and
equation (3) reduces to the eigenvalue equation
Sα2 = λ2 α2 (5)
The above process can be continued for the other principal components. This
results in a sequence of principal components ordered by their variance. In
other words, if we consider the eigenvalue decomposition of S (see Linear
Algebra recap)
S = VDVT
where D is a diagonal matrix of eigenvalues ordered in decreasing value,
and V is a p × p matrix of the corresponding eigenvectors as orthonormal
columns (v1 , . . . , vp ), then we have that
32
3.3 Plotting the principal components
If X is the n×p data matrix of the n observations on p variable, S is the p×p
sample covariance matrix, with ordered eigendecomposition S = VDVT
then the data can be transformed to the p principal component directions.
If we define Z to be an n × p matrix containing the transformed data, such
that Zij is the value of the jth principal component for the ith observation
then we have
In other words we take the vector product of the ith row of X and the jth
column of V. The matrix V is known as the loadings matrix. In matrix
notation we can write this as
Z = XV
We can then plot the columns of Z against each other to visualize the data as
represented by those pairs of principal components. The matrix Z is known
as the scores matrix. The columns of this matrix contain the projections
onto the principal components.
Figure 17 shows a pairs plot of the 5 PCs for the Crabs dataset. The points
have been coloured according to the 4 groups Blue Male (dark blue), Blue
Female (light blue), Orange Male (orange), Orange Female (yellow). From
this plot we can see that PCs 2 and 3 seem to show good discrimination of
these 4 groups. This is shown more clearly in Figure 18.
33
−2 0 1 2 3 −1.0 0.0 1.0
● ● ● ●
● ● ● ● ● ● ● ●
●●
●● ●
●●● ●
●● ●●
● ●● ●●●●● ● ●●
20
●●●
●
●●
●●●●● ●●● ●●
●
● ●●
●
● ●●●●●
● ●
●●● ●
● ●
●
●● ●
● ●
● ●● ●●
● ● ● ● ●● ● ● ● ● ●● ●
●●● ●
●
●●● ● ●●●● ● ● ●● ● ●
● ●● ● ● ● ●● ● ●
●●
●●●
●
●●
●●● ● ●● ●●
● ●● ●
● ●●● ●
●
●
●
● ●●
● ●
●●
● ●
●
●
●●●●●● ●
●●● ●● ●● ●●●●
●●
●● ●●● ●
● ●● ● ●● ●
●
●● ●
●●
● ●●●●●● ●●●●●
● ●●●
●●●●
● ●●
● ● ●●
●●
●●●●● ● ●●● ● ●● ● ●●●●● ● ●
●●
● ● ●
●●●● ●●
● ●●●
Comp.1 ●●●
●●
●●
●●
●●●
●●●●
●●
●●
●
●
●
●●
● ●●
● ●
●●●●●●●●●
● ●●
●●
●●●
●
●●●
●
●
●
●●
● ●●● ●
●●
●●
●●
● ●
●
●●
●
●●●●
●
●●●●
●
●
●●
●●
● ●●
●
●
● ● ● ●●● ●
●
●
●
●
●
●
●
●
●
●
●
●●●
●● ●
●● ●
●● ●
0
● ●● ● ● ●● ●●● ● ●●●●
●● ●
●
●●●●●
● ●●●
● ●● ● ●●● ● ●●
●● ● ● ● ●● ●●● ●● ●● ●
●●● ● ●●
●●●
● ●●
●
●●●●●● ● ●●●● ●●● ●●●●●●●●●●●●● ●● ●● ● ● ● ●●●● ●●●● ●●
● ● ● ●● ●●
●●● ●●● ●
●●●●● ● ●●●●●
● ●●●●● ●● ● ●
● ● ●●● ● ● ● ●
●● ● ● ●● ● ●
● ●●
● ●● ●●●●● ●●
●●●●● ●●●●● ●●●● ●● ●●●●● ● ● ● ●● ●
●
● ●● ●
●● ●●
●● ●● ●● ●● ● ●● ● ●●●●● ● ●● ●
●● ●●●● ● ●● ●● ● ●●●● ●
●● ● ●● ● ●● ● ● ● ●
● ●● ●●●●●● ●
●●●●● ● ● ●● ●
● ● ● ● ● ●●● ● ●●● ● ● ●●● ● ●● ●● ● ●●● ● ●●●
−20
●● ● ● ● ●● ●
● ●●● ●●
● ●● ● ●
●
● ● ●● ●● ●● ● ● ●● ● ●
●● ●
●
● ● ● ●
● ● ● ● ● ● ●● ●● ● ●● ●
0 1 2 3
● ● ● ● ● ●
● ●
● ● ● ●
●● ● ● ● ●
● ● ● ●
● ●●● ●●●● ●●
● ● ●● ●●
● ●
●
● ● ●
● ● ●
●
●●●●●
● ● ● ●●●
●
●●●
●●
●
● ●● ●●●●●●
●
● ●
● ●●
●● ●●
● ●● ●
●●
● ● ● ●● ● ● ● ●● ● ●
● ● ●● ● ● ●● ●
●● ●●● ●● ● ●●● ● ●● ●● ● ●
●●●●●●● ● ● ● ●
●●
● ●● ● ●●● ●●●
● ●
● ● ●●● ●●●●
●
●●●●● ●● ●
● ●●●●●● ● ● ●
● ●● ●● ● ● ● ●●●●● ●● ●●
●●●● ●
● ●
●●●●●●●
●
● ●
●●●●●●●●●●●●●●●
●●● ● ●● ●● ●●● ●● ●●● ●●
● ● ●●● ●●●
● ●● ●
● ●●● ●●
●● ●
●
●●
●
●●●●●
●●
●●
●●
●
●●●●●●●●●
●●
●
●
●●●●●
●●
●
●
●●●●
●● ●
●●●● Comp.2 ●●
● ●●●
● ●●●
●●
●
●●●
●
●
●
●
●●●
●●
●
●●
●
●●
●●
●●
●
●●
● ●
● ● ●●
●
●● ● ●●
● ●
●
●
●●
●
●
●●
●
●●●
●●
●
●
●●
●
●●
● ●●
●●
●
●●●●
●●●●
●●●
●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●●
● ●
●●●
●
●● ●
●
●
● ●●●
●
●●
●
●
●●
●●
●●●
●
●
●
●●● ●
●●
●●
●●
●●
●
●
●●
●
●●
●●
●●●
●●●●●
● ●
●
●●
●
●●● ●
●●
●
● ●● ● ● ●
●
●● ● ● ● ● ● ●● ● ●
● ● ●● ●●● ● ●● ● ● ●● ●
● ●●
●
● ●●●
● ●●●● ● ●● ● ●●●●
●●●
●
●
● ●
●● ●●● ●●● ●●●●● ● ●● ●
●●●● ●● ●
● ●●●●●● ● ●
●● ● ● ●● ●●● ● ● ● ●● ● ●●●●●●
● ● ● ●
●●
● ● ●●●●
●●●● ● ●●●●● ●●● ● ● ● ● ●
●●●●● ●●●●● ● ● ●● ● ● ●● ● ●
●●● ●
●●●● ●
● ●
−2
●●● ● ●● ● ● ●●●
●●
● ● ●● ● ● ● ●
●● ● ● ●● ● ● ●● ●
● ● ● ● ● ● ● ●
● ● ● ●● ●
BM
2
●●● ● ●●● ●●●●●● ● ●●● ● ● ●●
● ●●●● ●● ● ●● ●
●
●● ●● ● ●●●● ●● ● ●● ●
● ●
●●●●
● ●●
●●●● ● ● ● ●● ●●● ●● ●●
●● ●●●●●● ● ●● ● ●●●
● ●● ● ●
● ● ●●●
●● ●●●
●
● ●●
●● ●● ●● ●
●
●● ●● ●
●
●●
●●
●●●●
●
●
●
●●●●● ●●
●
●●
●
●
●
●●●●
●
●●
●● ●●●
●●
●●
●
●●●
● ●● BF
1
●●●●●● ●● ●● ● ●● ● ●
● ●● ● ● ●● ● ● ●● ● ● ●● ● ●●
● ●●
● ● ● ●●● ● ●● ●
● ●●● ●● ● ●
●
●●
●●
●
●● ●
● ●● ●●● ● ● ● ●● ● ● ●● ●
●
●
● ●● ● ●● ● ●● ● ● ●●●●●● ● ● ● ●
●● ● ●●
●●
●●
●●
●● ● ●●●
●
●●
●
●● ●● ● ●●
● ●
● ●●
●
●●
●
●
●
●
●
●●●
●
● ●
●●● ●●
●
Comp.3 ●● ●
●
●
●●
●●●●
●●●● ●
●●● ●●
●●
● ●
●
●● ●● ●
●●●●● ● OM
0
● ● ● ● ● ●● ●
●● ●●●●
● ●● ● ●●● ●● ●
●●● ●● ●●●●
●● ● ●●
●●●
●
●
●●●●
● ●● ● ●
● ●
●● ●●
●●●
●● ●
●
●●● ●●●●
●
● ●●
●● ● ● ● ● ●●●●
●
●●●●●●●
●●●
● ● ● ●●
●
●●●
● ●●
● ● ● ●● ● ●●●
●●●
●●● ● ●●
● ●●●
●●
●● ●● ● ● ●
●
●●●
●●
● ●● ● ● ●●● ●●●●
●●●
●●●
● ●●● ● ●
● ●●●●● ●●● ●●
●
● ●● ● ●● ● ●●●
●●
● ●●● ● ● ●●●●
●●
●●●
● ●●● ● ●●●
●●
● ● ●
●
● ●
●
●●●
●●● ● ●● ●
●● ●●
●
●●●
● ●
●
●●
●●
●●
●
●
●●
●
●
●
● ●●
●● ●
●
●●●●
●
● ● ●●●● ●● ●
●●●●●●
●
●
● ●●
●
●●●
●
●●●
●
●
●●●
●●
●
●
●
●● ●
●● ● ● ●●●
●●
●●
●
●●
● OF
● ● ●● ● ● ● ●●
●● ● ● ● ● ● ● ● ●●
● ●● ●●● ● ● ●●
−2
● ●●●● ●●● ● ●
● ●● ●
● ● ●
● ● ● ●
1.0
● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ●
● ●
● ●● ● ● ● ● ● ● ● ●
●● ●●● ● ●● ●● ●●● ●● ●
● ● ●● ●● ●● ●●●●● ●
●●● ● ●●●●
● ●●●● ● ●
●● ●●
● ●●●●●●●● ●●
●●
●●● ● ●
●● ● ●
●
● ●●
●● ●●●●●
●●
●● ●●
● ●●
●
● ●●●● ●●
●●●
●
● ●●●
●
●
●●●●● ●
● ● ●●● ●●●
●
● ●●●●●●
●
● ●
● ● ●●
●●●
●●●●
● ●●●
●●●●●
●●
● ● ● ●
● ●●●
●
●●
● ●●●●
●●
● ●●●
●
●
●●●● ●●●● ●
Comp.4 ●●●●●
●
●
●●●●●
●● ●●●●
●
●
●●●
● ●●●
● ●
●● ●
0.0
●●●● ●● ● ●●● ●
●● ● ●●● ●●● ● ●●
●● ●● ● ●● ●●●●●●●●● ● ●
●● ●●●
●
●● ● ● ●● ●● ●●
● ●●
●● ●●
●●
●● ● ●
●●●●● ●●●●●●●●●●●●●●
● ●●●●●●● ●● ●●●
●●●●●● ●● ● ●● ●●
●●● ● ●●
● ●
●●●
●●● ● ●
● ● ● ●● ●●
● ●
●● ●●● ●
●
●
●● ● ●
●
●
● ●
●
● ●●●
●
●
●●
●
●
●●
●●●● ● ●● ●
●●●●
●
● ●●● ●●
●
● ●
●
●●●●●
●
●
●●●
●
● ● ●●
●● ● ●
●●
●●
●●
●
● ●●●●
●●
●
●●● ●
●
●●●
● ●
●●
●● ●●
● ●● ● ●●●
●
●
●●●●
●
●
● ●● ●●● ●
● ●●● ● ●●
●●
● ●● ●●● ● ●●●
●●● ● ● ●●
●● ● ● ● ●●
● ●●●●
●● ●● ●●●●● ● ●●● ● ●●●●● ● ●●●●●●● ●● ●● ●
●●● ●●●
●●●●●● ● ●●●●
●
●
● ● ●●● ●●●●
●●
●● ● ● ●● ●
●
●● ● ●●●
●●
●
●
●●● ● ●
●●
● ●●● ●●●●●● ●●● ●
●
●●●●
●●● ●● ● ●●● ● ● ●
● ●●● ● ● ● ● ● ●●●● ●
●● ● ● ● ● ●● ●
● ● ●● ● ● ●●
−1.0
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ●
●● ● ●
0.5
● ●● ● ● ●● ● ● ●●
●●
● ●●●● ● ●● ● ● ● ● ●● ● ● ●●● ●●●●●● ● ●
● ● ●●●●● ● ● ●
● ●●● ● ● ●● ● ● ●
● ●● ● ●●●●●●●●
●● ● ●●●●● ●● ● ●●
● ● ●● ●● ●●●●●● ●●● ● ●
● ● ●● ● ●● ● ●●●● ● ●●●●●
●● ●●
●
●●●●●●
● ● ● ● ●●●● ●● ●●● ● ●●● ● ●●
● ●● ● ●●● ●● ● ● ● ●●● ● ●●● ●
●●●● ●●●● ●
● ● ●● ●●●● ●● ●●
●● ●●● ●
● ● ●●
● ●●● ●● ● ●● ●● ● ● ● ●
●●●●●●●●● ●
●●● ●●●●●● ● ●●●●●
● ● ●
●● ●●● ●
● ● ●●● ●● ●●● ● ●
● ●● ● ● ●● ●●● ● ● ●
Comp.5
0.0
●●● ●●
●●●
●● ●●
●● ●●●●●
● ● ●● ●
●●●●
●● ●
●●●
●●
●
●
●●
●● ● ●
●
● ●
● ●●●●
●●
●●
●
●● ●●●● ●●
●
●
●●
● ● ●●
●●●●●● ●●
●● ● ●
● ●
●● ● ●
● ● ●● ●● ● ● ●
●● ● ● ● ●●● ●
●●●●●●●
● ●● ●●●
● ●●
● ●●● ●
●●
●●
●
●●●● ●●●
● ●
●● ● ●
●● ●●● ●●● ●● ●
● ●● ●
● ●●●● ● ● ●●●●●●
●
●
●●●
●●●● ●● ● ● ● ●●● ●●
●●
● ●●●●
●●
● ●
● ● ●● ● ● ●●
●
●
●● ●● ● ●● ● ● ●●●
●●●
● ●
●● ●●● ● ● ●● ●● ● ●● ● ●●●
●
●●
● ●●●●●● ● ●
●
●
● ● ● ● ●● ●● ●● ●
●●● ●● ●● ●
●● ● ●● ● ●●●● ●●●● ●● ● ● ● ●●
● ●● ● ●●●
● ●
●●●● ●
● ●● ●●●● ● ● ●● ●● ● ● ● ●●●●● ● ●● ● ●● ● ●
●● ● ● ●● ● ● ●●● ● ●
●● ● ● ●
−0.5
●
● ●
● ● ●●
●●● ● ● ● ● ●
● ● ●● ● ●
●
● ●● ● ● ●● ● ●● ● ●● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ●● ● ●
● ● ● ●
34
●
●
●
2
●
● ●
● ●
● ●
●
●
● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ●● ●
● ● ●
1
● ● ●
●● ● ●
● ●● ● ●
● ●
● ● ●● ● ●● ● ●
● ●● ●
● ● ● ●
● ●
●
●
●
● ● ● ● ●
● ●
●
● ●● ● ●● ● ●
●
● ● ●●● ● ●
PC3
● ● ● ● ●
● ● ●
0
● ● ● ●
● ● ● ● ●
● ● ● ●● ●
● ● ● ● ●● ●
● ● ●
● ● ● ●
●● ●
●
●
● ● ● ●
● ●
● ●
● ●
● ● ●
● ●
−1
● ● ● ●
● ● ●
● ● ● ● ● ● BM
● ●● ● ●
●
●●● ●
●● ● ●
● ● ● ● ● ● BF
● ●
●
● ● ● ●
● OM
● ●
●● ● OF
● ● ●
●
−2
−3 −2 −1 0 1 2
PC2
35
3.4 Biplots
The columns of the loadings matrix V contains the linear projections for
each of the PCs. It can be interesting to look at these loadings to see which
variables contribute to each PC. For example, the loadings matrix for the
Crabs dataset is as follows
P C1 P C2 P C3 P C4 P C5
FL 0.28 0.32 −0.50 0.73 0.12
RW 0.19 0.86 0.41 −0.14 −0.14
CL 0.59 −0.19 −0.17 −0.14 −0.74
CW 0.66 −0.28 0.49 0.12 0.47
BD 0.28 0.15 −0.54 −0.63 0.43
So for example, this means that the first, second and third PCs are
Notice how the loadings for the 1st PC are all positive. This is quite usual,
especially when the units of observation are biological samples (such as
crabs), where the 1st PC is essentially a linear combination of features that
is measuring size. Often it is the case that this variable will account for
a large amount of variance, but tends not to be able to split samples into
distinct groups. It is often the PCs with a mixture of positive and negative
loadings that provide the contrast that separates groups.
36
whereas PC3 is a contrast between variables CW, RW and CL, FL, BD.
Also, PC2 separates the Orange Females from the Blue Males well, whereas
PC2 separates the Blue Females from the Orange Males well.
●
●
2 ● ●
● ●
● ●
● ● ●
●
● ●
● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ● ● ●
●
1 ●● ● ● ●
PC3 (0.7% explained var.)
CW
●● ● ●
● ●● ● ●
● ● ● ● ● ●
● ● ●
●
●● ● ● ●
RW
●
●
● ●
●
●
●
●
●
● ●
groups
● ● ● ●
●
●
● ●
●
● ● ● ● ●
● BF
● ● ● ●
● ● ●●
● ● ● ●
● ● ● ● BM
0 ● ●
●● ● ●
● ● ● ●●
● ● ● ●
●
● ● ● ●
●●
●
● OF
● ● ●
●
CL
● ● ●
● ●
●
●
●
● OM
● ● ● ●
●
● ● ●
● ● ●
● ●
● ●
FL
BD
−1 ● ● ●
● ●
● ● ●
● ● ● ●
● ●●●● ● ●●
●●● ● ●
● ●
● ● ● ●
● ●●
● ● ● ●
● ●
● ●
● ● ●
●
−2
●
−3 −2 −1 0 1 2
PC2 (0.9% explained var.)
37
3.5 Variance decomposition and eigenspectrum
The total amount of variance in the original data matrix X is the sum of
\
the diagonal entries in S. In other words, we have that var (Xi ) = Sii so
that
p
X
Total variance = Sii = tr(S)
i=1
but since tr(AB) = tr(BA) where A and B are two p × p matrices we have
that
So we can see that the eigenvalues tell us interesting information about how
important each component is within the dataset. It is common practice to
plot the decreasing sequence of eigenvalues to visualize the structure in the
dataset. Such plots are sometimes referred to as eigenspectrum plots or
variance scree plots, and usually they are scaled so each bar is percentage
of the total variance. That is we plot
100Di
for i ∈ 1, . . . , p
tr(D)
Figure 20 shows the scree plot for a simulated dataset with 3 clear clusters.
The 1st PC (middle) clearly separates the groups and the scree plot (right)
shows that the 1st PC accounts for almost all the variance in the dataset.
Figure 21 shows the scree plot for a simulated dataset with 4 clear clusters.
The 1st and 2nd PCs (middle) clearly separate the groups and the scree plot
(right) shows that the 1st and 2nd PCs account for almost all the variance
in the dataset.
The scree plot is sometimes used to decide on a set of PCs to carry forward
for further analysis. For example, we might choose to take the PCs that
account for the top θ% of the variance. However, care is sometimes needed
here. Figure 22 shows the scree plot for the Crabs dataset and shows that
the 1st PC accounts for almost all the variance. As discussed before, this
PC is most likely measuring the size of the crabs.
38
● ●
6
●
● ● ●● ● ● ●
● ● ● ●
80
● ● ●
● ●●●● ●
● ● ● ●
●
● ● ●
● ●● ● ●
●
4
● ●● ●●
●
● ●
2
●
● ● ● ●●
● ● ●
●● ●●
●● ●● ● ● ●●
● ● ●
● ● ● ●●
●
●
● ●
60
2
● ●
● ● ● ●
● ● ● ● ●
● ●● ●●●
●
● ● ● ●●
●
●● ●●● ●
●●
●
● ● ● ●
● ● ●
● ● ● ●
● ●● ●●
● ● ●● ●
% Variance
●●● ● ●●● ●
● ● ●
0
● ● ●● ● ●
● ● ● ● ● ●
PC2
0
● ●● ● ● ●
X2
● ● ●● ● ● ● ● ●● ●
● ● ●
● ●● ● ● ●●
●
● ●● ● ● ●●
●●● ●
●●
40
● ● ●●
● ● ●
● ●
● ●
●
● ●● ●● ●
● ●
●● ● ● ● ●
● ●●
● ●
−2
● ●
●
● ●
● ●●
●● ● ●
● ● ●
●● ● ●
●
−2
●
●
● ● ●●
●●
● ● ●●
● ● ● ●
● ● ●
20
●● ●
−4
●
● ●
●● ● ●
● ●
●
●●
●
●
●● ●
●
● ●
●
●
−6
● ●
−4
●
● ●
0
−6 −4 −2 0 2 4 6 −15 −10 −5 0 5 10 15 1 2 3 4 5 6 7 8 9 10
X1 PC1
Figure 20: Example with 3 clear clusters (left). The 1st PC (middle) clearly
separates the groups. The scree plot (right) shows that the 1st PC accounts
for almost all the variance (98.24%)in the dataset, whereas we know that it
is the 2nd and 3rd PCs that are the most useful in separating the groupings
of Orange/Blue and Male/Female crabs.
39
● ●
10
● ●
● ●
6
● ●●
● ●● ●●
● ● ●
● ● ●
● ●
● ●● ●● ●●● ●
● ● ●● ●
●●
● ● ●●● ●
50
●● ● ● ● ●● ●
● ● ●●
●
● ●●
●●●
●
● ●●
●
●
● ● ● ●● ●
● ● ●●
●● ● ●
●
4
● ● ● ●
● ●●
● ●●
5
● ● ●
● ●
● ●
● ● ●
● ●● ●
● ●
●● ●● ●
40
●● ● ● ● ● ●● ●
● ●● ●●●●●
● ●●●●●● ● ● ● ● ●●
● ●● ● ● ● ● ● ●●●
● ●● ● ●●
●● ●
2
● ● ● ●
● ● ● ●●●
● ●● ● ● ●
● ● ●●● ●●
●
●
●●●● ● ●
● ●●
● ●●●
● ● ●●
● ●●● ●
● ● ●● ●
●●●●● ●
●●●● ●
●●
0
●● ● ●
% Variance
●
● ● ●
30
●
PC2
X2
●
0
●
●
−5
● ●●
−2
20
●● ● ●
● ● ●
●
●● ●●●● ● ●●●
●● ● ●● ●
● ● ●●● ●
● ● ● ● ●● ●
● ●
●
● ● ● ●
●● ● ● ● ●
● ●
●● ●● ● ●● ●
● ●● ● ● ●
−4
● ●● ● ●
●●● ● ●● ●
●
●●
10
−10
● ●●
● ● ● ● ● ●
● ● ●●●●
●●●
●●
●●
● ●
● ●●
● ●●
● ● ●
● ●●●●
● ● ●●
●● ●●
●
● ●
●●
● ●
●
● ●●●
−6
●
● ●
0
−6 −4 −2 0 2 4 −15 −10 −5 0 5 10 15 1 2 3 4 5 6 7 8 9 10
X1 PC1
Figure 21: Example with 4 clear clusters (left). The 1st and 2nd PCs
(middle) clearly separate the groups. The scree plot (right) shows that the
1st and 2nd PCs accounts for almost all the variance in the dataset.
40
80
60
% Variance
40
20
0
1 2 3 4 5
41
3.6 Using the covariance matrix or the correlation matrix
A key practical problem when applying PCA is deciding exactly what data
the method should be applied to. Should we apply the method to the raw
data, or should we first transform it in some way? You should always ask
yourself this question when starting out on a new data analysis.
Remember that
cov (Xi , Xj )
cor (Xi , Xj ) = p ∈ [−1, 1]
var (Xi ) var (Xj )
This means that the relationship between R and S is
Sij
Rij = p
Sii Sjj
If we let W be a diagonal matrix with entries Sii for i ∈ 1, . . . , p. In other
words, W is the same as S, but with all off-diagonal entries set to 0. Then
in matrix notation we can write
R = W−1/2 SW−1/2
It can be shown (see Exercises) that the PCA components derived from
using S are not the same as those derived from using R, and knowledge of
one of these sets of components does not enable the other set to be derived.
42
Table 2: CPI, consumer price index (index = 100 in 2005); UNE, unemploy-
ment rate in 15–64 age group; INP, industrial production (index = 100 in
2005); BOP, balance of payments (e/capita); PRC, private final consump-
tion expenditure (e/capita); UN%, annual change in unemployment rate.
Country CPI UNE INP BOP PRC UN%
Belgium 116.03 4.77 125.59 908.60 6716.50 -1.60
Bulgaria 141.20 7.31 102.39 27.80 1094.70 3.50
CzechRep. 116.20 4.88 119.01 -277.90 2616.40 -0.60
Denmark 114.20 6.03 88.20 1156.40 7992.40 0.50
Germany 111.60 4.63 111.30 499.40 6774.60 -1.30
Estonia 135.08 9.71 111.50 153.40 2194.10 -7.70
Ireland 106.80 10.20 111.20 -166.50 6525.10 2.00
Greece 122.83 11.30 78.22 -764.10 5620.10 6.40
Spain 116.97 15.79 83.44 -280.80 4955.80 0.70
France 111.55 6.77 92.60 -337.10 6828.50 -0.90
Italy 115.00 5.05 87.80 -366.20 5996.60 -0.50
Cyprus 116.44 5.14 86.91 -1090.60 5310.30 -0.40
Latvia 144.47 12.11 110.39 42.30 1968.30 -3.60
Lithuania 135.08 11.47 114.50 -77.40 2130.60 -4.30
Luxembourg 118.19 3.14 85.51 2016.50 10051.60 -3.00
Hungary 134.66 6.77 115.10 156.20 1954.80 -0.10
Malta 117.65 4.15 101.65 359.40 3378.30 -0.60
Netherlands 111.17 3.23 103.80 1156.60 6046.00 -0.40
Austria 114.10 2.99 116.80 87.80 7045.50 -1.50
Poland 119.90 6.28 146.70 -74.80 2124.20 -1.00
Portugal 113.06 9.68 89.30 -613.40 4073.60 0.80
Romania 142.34 4.76 131.80 -128.70 1302.20 3.20
Slovenia 118.33 5.56 105.40 39.40 3528.30 1.80
Slovakia 117.17 9.19 156.30 16.00 2515.30 -2.10
Finland 114.60 5.92 101.00 -503.70 7198.80 -1.30
Sweden 112.71 6.10 100.50 1079.10 7476.70 -2.30
UnitedKingdom 120.90 6.11 90.36 -24.30 6843.90 -0.80
Variance 111.66 9.95 357.27 450057.15 5992520.48 7.12
43
Figure 23 shows plots of the 1st and 2nd PCs for the EU indicators dataset.
The Left plot used the covariance matrix S. The Right plot used the cor-
relation matrix R. Points are labelled with the abbreviated country name.
There is a clear difference between the two.
When using the covariance matrix S the loadings of the 1st and 2nd PCs
are
so it is the variables BOP and PRC that are dominating these PCs.
When using the correlation matrix R the loadings of the 1st and 2nd PCs
are
EL
3
LU
1000
NL ES
DK
BESE
2
EE MT
HU
BGLV PT
ROPL
LTSK SI CY
IE
PC2
PC2
DE
0
CZ FR
IT
ES IEUKAT BG SI Fl
UK
PT IT FR DK
0
EL Fl CZ MT
CY LVRO
LT HU DE
AT SE
NL
−1
PL
SK BE LU
−2000
EE
−2
Figure 23: Plots of the 1st and 2nd PCs for the EU indicators dataset.
(Left) Using the covariance matrix S. (Right) Using the correlation matrix
R. Points are labelled with the abbreviated country name.
44
3.7 PCA via the Singular Value Decomposition
The Crabs data has the property that the number of observations n = 200
is much larger than the number of variable p = 5. This is not always the
case for datasets that we might work with. For example, earlier we saw that
the Single Cell dataset had n = 300 and p = 8, 686, the UK Foods dataset
had n = 4 and p = 17 and the POPRES dataset of human genetic data had
n = 3, 000 and p = 500, 000. A key step in carrying out PCA is calculating
an eigen decomposition of the p × p sample covariance matrix S.
NOTE: This theorem is given without proof and Prelims students are not
expected to be able to prove it.
45
Note that ΛT Λ is a p × p diagonal matrix with entries that the squares
of the entries in Λ. This has the same form as the eigendecomposition of
S = VDVT where V = Q and D = n−1 1
ΛT Λ.
Therefore
Z = XV = XQ = P Λ
which implies that we need to calculate P and Λ. This can be achieved
using the eigendecomposition of the n × n matrix XX T since
n
1X
J(z1 , w1 ) = (xi − zi1 w1 )T (xi − zi1 w1 ) (10)
n
i=1
n
1 X
= [xTi xi − 2zi1 w1T xi + w1T zi1
T
zi1 w1 ] (11)
n
i=1
46
n× p n ×1 1× p
w1T
X = z1
n
1X T
= [xi xi − 2zi1 w1T xi + zi1
2
(w1T w1 )] (12)
n
i=1
(13)
e = z1 wT stays
Since we can arbitrarily scale z1 and w1 so that the product X 1
T
the same, we can add the constraint that w1 w1 = 1. Taking derivatives wrt
zi1 and equating to zero gives
∂ 1
J(z1 , w1 ) = [−2w1T xi + 2zi1 ] = 0 ⇒ zi1 = w1T xi = xTi w1
∂zi1 n
or in matrix form
z1 = Xw1
Plugging this back in gives
n n
1X T 1X 2
J(w1 ) = xi xi − zi1
n n
i=1 i=1
47
n n
1X T 1X T
= xi xi − (w1 xi )(w1T xi )T
n n
i=1 i=1
n
1X T
= constant − w1 xi xTi w1
n
i=1
Now if we assume that the columns of X have been mean centered then
n
1 1X
Σ̂ = XT X = xi xTi
n n
i=1
which gives
Sw1 = λ1 w1 (15)
This is the same equation we had before (see eqn 1), so w1 and λ1 are an
eigenvector and eigenvalue of S respectively. Since w1T Sw1 = λ1 is what we
are trying to maximize we choose λ1 to be the largest eigenvalue of S and
w1 its associated eigenvector. Thus the best rank-1 approximation to X is
z1 w1T = Xw1 w1T .
X ≈ z1 w1T + z2 w2T
using the constraints w2T w2 = 1 and w2T w1 = 0 and assuming that we have
already estimated w1 and z1 . It can be shown that w2 is the eigenvector
associated with the second largest eigenvalue of S.
48
3.8.1 Using PCA for compression
The data matrix X has a total of np elements, and the best rank-q approx-
imation in equation 16 has q(n + p) elements. Also, the derivation assumed
the X had been mean centered so really we need to store the column mean
vector µ̂, which adds another p values.
49
4 Clustering Methods
We have seen the PCA can provide low dimensional representations of a
dataset that show groupings of observations when visualized. However,
PCA does not involve a direct labelling of observations into different groups.
Clustering refers to a very broad set of techniques for finding subgroups,
or clusters, in a dataset.
50
6
4
2
xx2[,2]
0
−2
−4
−6
−6 −4 −2 0 2 4 6
xx2[,1]
1. C1 ∪ C2 ∪ . . . CK = {1, . . . , n}
2. Ci ∩ Cj = ∅ for all i ̸= j
51
We want to choose a clustering which has the property that the differences
between the observations within each cluster are as small as possible. If we
define W (Ck ) to be a measure of how different the observations are within
cluster k then we want to solve the problem
( K )
X
min W (Ck )
C1 ,C2 ,...,Ck
k=1
One option would be to search over all possible clusterings of the n obser-
vations. The number of possible clusterings becomes large very quickly. If
we let S(n, k) denote the number of ways to partition a set of n objects into
k non-empty subsets then
k
1 X k−j n
S(n, k) = (−1) jn
k! j
j=0
These numbers are known as Stirling numbers of the second kind and get
very large quickly. For example,
S(100, 4) = 66955751844038698560793085292692610900187911879206859351901
This clearly illustrates that it is very challenging to search over all these
possibilities. However, if we can find a clustering that is ”good” but not the
best, then this may be (and often is) good enough for our purposes. The
following algorithm can be shown to provide a good local optimum (i.e. a
pretty good solution).
52
K-means algorithm
1. Choose K.
3. Iterate the following 2 steps until the cluster assignments stop changing
(a) For each cluster compute the cluster mean. The kth cluster mean
denoted µk is the mean of the all the xi in cluster k i.e.
1 X
µk = xi
|Ck |
i∈Ck
Figure 26 shows the results of running this algorithm on the simple example
with 3 clear clusters.
53
(a) Randomly allocate points to 3 clusters (b) Calculate cluster means
54
We call
p
K X X
X
(xij − µkj )2 (19)
k=1 i∈Ck j=1
Now suppose that we have just updated the cluster assignments for the next
iteration i.e. we have determined what C (t+1) is and so the current value of
(t)
the objective function (using C (t+1) and µk ) is
K p
(t)
X X X
(xij − µkj )2
k=1 i∈C (t+1) j=1
k
(t)
But since µk may not be the same as the cluster means for the assignments
C (t+1) (since the cluster assignments can change) when we update the cluster
(t+1)
means to µk we must have that
K p K p
(t+1) (t)
X X X X X X
(xij − µkj )2 ≤ (xij − µkj )2
k=1 i∈C (t+1) j=1 k=1 i∈C (t+1) j=1
k k
55
(t+1)
using Equation 20. So updating the cluster means to µk never increases
the objective function.
Similarly, in Step 3(b) the cluster means are fixed and we update the as-
signments. We can re-write the objective function in (19) as a sum over the
n observations
Xn X p
(xij − µc(i)j )2
i=1 j=1
Multiple starts
The algorithm does not always give the same solution since the start point
is random. Figure 27 shows an example dataset simulated to have points in
5 clusters that are quite close together and so hard to separate. The true
cluster means are shown as red triangles in the plot.
56
4
2
yy[,2]
0
−2
−4
−4 −2 0 2 4
yy[,1]
57
Figure 29 shows the results of applying k-means to the Single Cell Genomics
dataset seen at the start of these course notes. The dataset was first reduced
down to the 11 PCs and then k-means was run with K = 11. The Figure
shows the k-means labelling of the points in the first 3 PC dimensions.
https://courses.maths.ox.ac.uk/course/view.php?id=620
Hierarchical clustering
Hierarchical clustering is an alternative approach that avoids having to
specify the number of clusters in advance. An added advantage is that the
method results in a tree-like (hierarchical) representation of the dataset, that
can be helpful when visualizing structure in the dataset, especially when the
data is high-dimensional, i.e. p is large.
58
Figure 29: 3D plot of 1st, 2nd and 3rd Principal Components for the Single
Cell Genomics dataset. Points are coloured according to a run of the k-
means algorithm with K = 11 working on the first 11 PCs from a PCA of
the dataset.
59
d21
d31
d32
.... ..
. . .
D (n)
=
di1 di2 . . . di(i−1)
.. .. .. .. ..
. . . . .
dn1 dn2 . . . dnj . . . dn(n−1)
2. For i = n, n − 1, . . . , 2
(a) Find the pair of clusters with the smallest dissimilarity. Fuse
these two clusters.
(b) Compute the new dissimilarity matrix between the new fused
cluster and all other i − 1 remaining clusters and create an up-
dated matrix of dissimilarities D (n−1) .
Linkage methods
A key step in the algorithm is the choice of how to compute the new dis-
similarity (or distance) between the new fused cluster and all other i − 1
remaining clusters. There are several commonly used options, with differing
properties in terms of type of clusters they tend to produce. Let G and H be
two groups of observations then we can define the function d(G, H) between
groups G and H to be some function of all the pairwise dissimilarities dij
where i ∈ G and j ∈ H.
Single Linkage (SL) takes the intergroup distance to be the closest of all
the pairs of observations between the two groups
60
Group Average (GA) takes the intergroup distance to be the average of
all the pairs of observations between the two groups
1 XX
dGA (G, H) = dij
|G||H|
i∈G j∈H
61
Next we merge clusters (1,2) and 5 into a new cluster (denoted (1,2,5)) and
compute the new dissimilarites
(1, 2, 5) 3 4
3 6.61
D(4) =
4 5.02 2.06
6 3.53 3.48 2.84
Next we merge clusters 3 and 4 into a new cluster (denoted (3,4)) and
compute the new dissimilarites
(1, 2, 5) (3, 4)
D(3) = (3, 4) 5.02
6 3.53 2.84
Next we merge clusters (3,4) and 6 into a new cluster (denoted (3,4,6) and
compute the new dissimilarites
(1, 2, 5)
D(2) =
(3, 4, 6) 3.53
Finally, we merge the two remaining clusters into a single cluster containing
all the observations. Notice how the sequence of dissimilarities which dictate
which clusters merge (ie. the numbers in red) are an increasing sequence of
values.
Dendrograms
The results of an agglomerative clustering of a dataset can be represented
as dendrogram, which is a tree-like diagram that allows us to visualize the
way in which the observations have been joined into clusters.
62
Cluster Dendrogram
3.5
8
3
3.0
6
4
6
2.5
6
4
Height
1
2.0
2
4
2 5
1.5
5
0
0 2 4 6 8
Figure 30: Small example of hierarchical clustering. The left plot shows the
raw dataset consisting of 6 observations in 2 dimensions. The right plot
shows the dendrogram of using agglomerative clustering with single linkage.
63
Differences between Linkage methods
Different linkage methods can lead to different dendrograms. Figure 31
shows a new dataset of 45 points, with 15 points each simulated from 3
different bivariate normal distributions with different means. The 3 sets of
points are coloured according to their true cluster. Points are also labelled
with a number.
64
●
42
6
●
31
●
44
● 37
33 ●
●
41
●
45 ●
●
40 ●3●5
34 ●43
39
●
32
4
●
38
●
36
●
10
●
26
2
●15
11
●13
● ●
●3 8 ●19
●20
17 ●
●
29 ●
21
●
14 ●
4 ●
25
●
9 ●
16 ●
18
●
27
0
●
7 ● 24
30 ●
●
12 ●
1
●●
25 ●
23 ●
28
●
6
−2
●
22
−6 −4 −2 0 2
65
Height
31
42
36
38
32
39
43
37
33
44
34
35
40
41
45
21
22
28
26
23
16
24
30
18
dist(x1)
27
25
19
Cluster Dendrogram
10
7
12
14
3
13
11
15
1
Height
0 2 4 6 8 10
26
36
38
32
39
43
37
33
44
31
42
34
35
40
41
45
22
23
16
24
30
17
20
66
25
dist(x1)
19
29
21
28
10
9
4
8
1
6
2
5
7
12
14
11
15
3
13
Height
0 1 2 3 4 5 6
10
1
6
2
5
9
4
8
7
12
14
3
13
11
15
31
42
34
35
40
41
45
36
38
dist(x1)
32
39
43
22
21
26
17
20
25
19
29
23
16
24
30
28
18
Figure 32: Dendrograms using different Linkage methods applied to the 27
Figure 33 shows the results of building a dendrogram using Complete Link-
age on the EU indicators dataset (see Table 2)
Cluster Dendrogram
6
5
4
3
Luxembourg
Height
Greece
2
Ireland
Spain
Estonia
Romania
1
Cyprus
Portugal
Bulgaria
Hungary
Netherlands
Belgium
CzechRep.
Denmark
Sweden
Poland
Slovakia
Malta
Slovenia
Latvia
Lithuania
Italy
UnitedKingdom
Germany
Austria
0
France
Finland
dist(scale(eu1))
hclust (*, "complete")
67
Single Linkage Complete Linkage Average Linkage
2.5
6
10
5
2.0
4
1.5
3
1.0
2
2
0.5
1
0
0
0.0
● ● ●
6
● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ●● ● ●● ● ●●
● ●● ● ●● ● ●●
● ● ●
4
● ● ●
● ● ●
● ● ●
● ● ●
2
●● ●● ●●
● ● ●
● ● ●● ● ● ●● ● ● ●●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
0
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
−2
−2
−2
● ● ●
−6 −4 −2 0 2 −6 −4 −2 0 2 −6 −4 −2 0 2
Figure 34: (Top row) Dendrograms using different Linkage methods applied
to the dataset in Figure 31. Each dendrogram has been cut to produce 3
clusters. (Bottom row) The corresponding clustering of points in the original
2D space produced by cutting the dendrogram to have 3 clusters.
68