Practical PCA Methods in R
Practical PCA Methods in R
Practical PCA Methods in R
Alboukadel KASSAMBARA
ii
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, without the prior
written permission of the Publisher. Requests to the Publisher for permission should
be addressed to STHDA (http://www.sthda.com).
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials.
I Basics 1
1 Introduction to R 2
1.1 Installing R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Installing and loading R packages . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Getting help with functions in R . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Importing your data into R . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Demo data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Close your R/RStudio session . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Required R packages 6
2.1 FactoMineR & factoextra . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Main R functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
II Classical Methods 11
3 Principal Component Analysis 12
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Visualization and Interpretation . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Supplementary elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Filtering results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Exporting results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.9 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
iii
iv CONTENTS
4 Correspondence Analysis 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Visualization and interpretation . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Supplementary elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Filtering results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.7 Exporting results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.9 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
IV Clustering 141
8 HCPC: Hierarchical Clustering on Principal Components 142
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2 Why HCPC? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.3 Algorithm of the HCPC method . . . . . . . . . . . . . . . . . . . . . . . 143
8.4 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Preface
One of the difficulties inherent in multivariate analysis is the problem of visualizing data
that has many variables. In R, there are many functions and packages for displaying
a graph of the relationship between two variables (http://www.sthda.com/english/
wiki/data-visualization). There are also commands for displaying different three-
dimensional views. But when there are more than three variables, it is more difficult to
visualize their relationships.
Fortunately, in data sets with many variables, some variables are often correlated. This
can be explained by the fact that, more than one variable might be measuring the same
driving principle governing the behavior of the system. Correlation indicates that there is
redundancy in the data. When this happens, you can simplify the problem by replacing
a group of correlated variables with a single new variable.
Principal component analysis is a rigorous statistical method used for achieving this sim-
plification. The method creates a new set of variables, called principal components. Each
principal component is a linear combination of the original variables. All the principal
components are orthogonal to each other, so there is no redundant information.
v
vi CONTENTS
The type of principal component methods to use depends on variable types contained in
the data set. This practical guide will describe the following methods:
1. Principal Component Analysis (PCA), which is one of the most popular mul-
tivariate analysis method. The goal of PCA is to summarize the information con-
tained in a continuous (i.e, quantitative) multivariate data by reducing the dimen-
sionality of the data without loosing important information.
2. Correspondence Analysis (CA), which is an extension of the principal com-
ponent analysis for analyzing a large contingency table formed by two qualitative
variables (or categorical data).
3. Multiple Correspondence Analysis (MCA), which is an adaptation of CA to
a data table containing more than two categorical variables.
4. Factor Analysis of Mixed Data (FAMD), dedicated to analyze a data set
containing both quantitative and qualitative variables.
5. Multiple Factor Analysis (MFA), dedicated to analyze data sets, in which
variables are organized into groups (qualitative and/or quantitative variables).
Additionally, we’ll discuss the HCPC (Hierarchical Clustering on Principal Com-
ponent) method. It applies agglomerative hierarchical clustering on the results of prin-
cipal component methods (PCA, CA, MCA, FAMD, MFA). It allows us, for example, to
perform clustering analysis on any type of data (quantitative, qualitative or mixed data).
Figure 1 illustrates the type of analysis to be performed depending on the type of variables
contained in the data set.
containing only qualitative variables or with a mixed data of qualitative and quantitative
variables.
Some examples of plots generated in this book are shown hereafter. You’ll learn how to
create, customize and interpret these plots.
1) Eigenvalues/variances of principal components. Proportion of information
retained by each principal component.
Scree plot
Percentage of explained variances
50
41.2%
40
30
20 18.4%
12.4%
10 8.2% 7%
4.2% 3% 2.7% 1.6% 1.2%
0
1 2 3 4 5 6 7 8 9 10
Dimensions
0.5
Javeline
X400m contrib
Dim2 (18.4%)
Long.jump 12
X110m.hurdle Shot.put 10
0.0 Discus
X100m
8
High.jump 6
-0.5
-1.0
15 30
Contributions (%)
Contributions (%)
10 20
5 10
0 0
X1 ng m
m mp
D le
Sh cus
h. m
Ja mp
le e
0m
h. m
X1 p
Sh m
m us
X1 ault
X1 ault
Ja mp
X4 ne
ng 0m
e
X4 ut
X1 D put
Po elin
dl
d
Lo 100
H 00
H 00
00
.p
10 isc
li
ur
ur
10 .ju
ju
50
ju
0
.ju
.
is
ve
.v
.v
ot
ot
5
v
.h
.h
le
X
ig
ig
Po
Lo
3) PCA - Graph of individuals:
• Control automatically the color of individuals using the cos2 (the quality of the
individuals on the factor map)
Individuals - PCA
CLAY
2
BERNARD SEBRLE
BOURGUIGNON Clay
Schoenbeck Sebrle
1 cos2
HERNU Pogorelov
Dim2 (18.4%)
Schwarzl 0.75
Warners
0 0.50
BARRAS YURKOV
0.25
Barras
Zsivoczky Karpov
-1
McMULLEN
MARTINEAU Hernu
NOOL Bernard
-2 ZSIVOCZKY
Macey
-2.5 0.0 2.5
Dim1 (41.2%)
x CONTENTS
• Change the point size according to the cos2 of the corresponding individuals:
Individuals - PCA
CLAY
2
BERNARD SEBRLE
BOURGUIGNON Clay
Schoenbeck Sebrle
1 Pogorelov
HERNU
Dim2 (18.4%)
cos2
Schwarzl 0.25
0
Warners 0.50
BARRAS YURKOV 0.75
Barras Zsivoczky
Karpov
-1 McMULLEN
MARTINEAU
Hernu
NOOL Bernard
ZSIVOCZKY
-2
Macey
-2.5 0.0 2.5
Dim1 (41.2%)
3 Sepal.Width
Clusters
2
a petal
Sepal.Length
Dim2 (22.9%)
1
a sepal
Petal.Width
0 Species
Petal.Length
setosa
-1 versicolor
virginica
-2
-2 0 2
Dim1 (73%)
0.3. HOW THIS BOOK IS ORGANIZED xi
0.0
Tidying Shopping
Insurance
-0.5
Dishes
Finances
-1.0
Jointly
-1.5 Holidays
-1.0 -0.5 0.0 0.5 1.0 1.5
Dim1 (48.7%)
2
1VAU 1VAU
2BEA 2BEA
1ROC 1DAM 1ROC 1DAM
1 PER1 2DAM Env4 Env2 2DAM
T2Chinon
1TUR 4EL T2 PER1 4EL
Dim2 (16.9%)
Saumur Reference
1BOI
T1 1BOI 1POY T1 1TUR 1POY
0
DOM1 1BEN 2BOU DOM1 1BEN 2BOU
3ELBourgueuil 3EL
-1
2EL 2EL
Env1
-2 1CHA 1ING 1CHA 1ING
1FON 1FON
-3
2ING 2ING
-6 -4 -2 0 2 -6 -4 -2 0 2
Dim1 (43.9%)
xii CONTENTS
1
Height
0
Mississippi
Louisiana
Maryland
Georgia
Alabama
Tennessee
Arizona
Michigan
Colorado
Nevada
Nebraska
New York
Texas
Illinois
Florida
Alaska
California
Vermont
Montana
Idaho
Wisconsin
Minnesota
Maine
Iowa
Hawaii
Utah
Arkansas
Delaware
Kentucky
Wyoming
Virginia
Ohio
Kansas
Indiana
Oklahoma
Missouri
Oregon
New Mexico
New Hampshire
Massachusetts
New Jersey
South Carolina
North Carolina
Rhode Island
Connecticut
Pennsylvania
Washington
West Virginia
South Dakota
North Dakota
-1
0.6 Acknowledgment
I sincerely thank all developers for their efforts behind the packages that factoextra
depends on, namely, ggplot2 (Hadley Wickham, Springer-Verlag New York, 2009), Fac-
toMineR (Sebastien Le et al., Journal of Statistical Software, 2008), dendextend (Tal
Galili, Bioinformatics, 2015), cluster (Martin Maechler et al., 2016) and more.
0.7. COLOPHON xiii
0.7 Colophon
This book was built with:
• R 3.3.2
• factoextra 1.0.5
• FactoMineR 1.36
• ggpubr 0.1.5
• dplyr 0.7.2
• bookdown 0.4.3
About the author
xiv
Part I
Basics
1
Chapter 1
Introduction to R
R is a free and powerful statistical software for analyzing and visualizing data. If you
want to learn easily the essential of R programming, visit our series of tutorials available
on STHDA: http://www.sthda.com/english/wiki/r-basics-quick-and-easy.
In this chapter, we provide a very brief introduction to R, for installing R/RStudio as
well as importing your data into R for computing principal component methods.
2
1.3. GETTING HELP WITH FUNCTIONS IN R 3
2. How to install packages from GitHub? You should first install devtools if you don’t
have it already installed on your computer:
For example, the following R code installs the latest developmental version of factoextra
R package developed by A. Kassambara (https://github.com/kassambara/facoextra)
for multivariate data analysis and elegant visualization.
install.packages("devtools")
devtools::install_github("kassambara/factoextra")
Now, we can use R functions, such as PCA() [in the FactoMineR package] for performing
principal component analysis.
Using these functions, the imported data will be of class data.frame (R terminology).
You can read more about how to import data into R at this link: http://www.sthda.
com/english/wiki/importing-data-into-r
To select just certain columns from a data frame, you can either refer to the columns by
name or by their location (i.e., column 1, 2, 3, etc.).
# Access the data in 'Murder' column
# dollar sign is used
head(USArrests$Murder)
Required R packages
No matter which package you decide to use for computing principal component meth-
ods, the factoextra R package can help to extract easily, in a human readable data
format, the analysis results from the different packages mentioned above. factoextra
provides also convenient solutions to create ggplot2-based beautiful graphs.
2.2 Installation
6
2.2. INSTALLATION 7
Figure 2.1: Key features of FactoMineR and factoextra for multivariate analysis
# Load
library("FactoMineR")
Figure 2.2: Principal component methods and clustering methods supported by the
factoextra R package
library("factoextra")
Functions Description
PCA Principal component analysis.
CA Correspondence analysis.
2.3. MAIN R FUNCTIONS 9
Functions Description
MCA Multiple correspondence analysis.
FAMD Factor analysis of mixed data.
MFA Multiple factor analysis.
HCPC Hierarchical clustering on principal components.
dimdesc Dimension description.
factoextra functions covered in this book are listed in the table below. See the online doc-
umentation (http://www.sthda.com/english/rpkgs/factoextra) for a complete list.
• Visualizing principal component method outputs
Functions Description
fviz_eig (or fviz_eigenvalue) Visualize eigenvalues.
fviz_pca Graph of PCA results.
fviz_ca Graph of CA results.
fviz_mca Graph of MCA results.
fviz_mfa Graph of MFA results.
fviz_famd Graph of FAMD results.
fviz_hmfa Graph of HMFA results.
fviz_ellipses Plot ellipses around groups.
fviz_cos2 Visualize element cos2. 3
4
fviz_contrib Visualize element contributions.
Functions Description
get_eigenvalue Access to the dimension eigenvalues.
get_pca Access to PCA outputs.
get_ca Access to CA outputs.
get_mca Access to MCA outputs.
get_mfa Access to MFA outputs.
get_famd Access to MFA outputs.
get_hmfa Access to HMFA outputs.
facto_summarize Summarize the analysis.
3
Cos2: quality of representation of the row/column variables on the principal component maps.
4
This is the contribution of row/column elements to the definition of the principal components.
10 CHAPTER 2. REQUIRED R PACKAGES
Functions Description
fviz_dend Enhanced Visualization of Dendrogram.
fviz_cluster Visualize Clustering Results.
Part II
Classical Methods
11
Chapter 3
3.1 Introduction
Principal component analysis (PCA) allows us to summarize and to visualize the
information in a data set containing individuals/observations described by multiple inter-
correlated quantitative variables. Each variable could be considered as a different dimen-
sion. If you have more than 3 variables in your data sets, it could be very difficult to
visualize a multi-dimensional hyperspace.
Principal component analysis is used to extract the important information from a mul-
tivariate data table and to express this information as a set of few new variables called
principal components. These new variables correspond to a linear combination of the
originals. The number of principal components is less than or equal to the number of
original variables.
The information in a given data set corresponds to the total variation it contains. The
goal of PCA is to identify directions (or principal components) along which the variation
in the data is maximal.
In other words, PCA reduces the dimensionality of a multivariate data to two or three
principal components, that can be visualized graphically, with minimal loss of informa-
tion.
In this chapter, we describe the basic idea of PCA and, demonstrate how to compute
and visualize PCA using R software. Additionally, we’ll show how to reveal the most
important variables that explain the variations in a data set.
3.2 Basics
Understanding the details of PCA requires knowledge of linear algebra. Here, we’ll explain
only the basics with simple graphical representation of the data.
In the Plot 1A below, the data are represented in the X-Y coordinate system. The
dimension reduction is achieved by identifying the principal directions, called principal
components, in which the data varies.
12
3.2. BASICS 13
PCA assumes that the directions with the largest variances are the most “important” (i.e,
the most principal).
In the figure below, the PC1 axis is the first principal direction along which the samples
show the largest variation. The PC2 axis is the second most important direction
and it is orthogonal to the PC1 axis.
The dimensionality of our two-dimensional data can be reduced to a single dimension by
projecting each sample onto the first principal component (Plot 1B)
Plot 1A Plot 1B
80 40
20
40 PC2 PC1
PC2
0
y
0
-20
-40 -40
-10 -5 0 5 10 -80 -40 0 40
x PC1
2 4
1
0
y
-1
-4
-2
-3
-2 -1 0 1 2 -2 0 2
x x
Taken together, the main purpose of principal component analysis is to:
• identify hidden pattern in a data set,
• reduce the dimensionnality of the data by removing the noise and redun-
dancy in the data,
• identify correlated variables
14 CHAPTER 3. PRINCIPAL COMPONENT ANALYSIS
3.3 Computation
3.3.1 R packages
Several functions from different packages are available in the R software for computing
PCA:
• prcomp() and princomp() [built-in R stats package],
• PCA() [FactoMineR package],
• dudi.pca() [ade4 package],
• and epPCA() [ExPosition package]
No matter what function you decide to use, you can easily extract and visualize the
results of PCA using R functions provided in the factoextra R package.
Here, we’ll use the two packages FactoMineR (for the analysis) and factoextra (for
ggplot2-based visualization).
We’ll use the demo data sets decathlon2 from the factoextra package:
data(decathlon2)
# head(decathlon2)
As illustrated in Figure 3.1, the data used here describes athletes’ performance during two
sporting events (Desctar and OlympicG). It contains 27 individuals (athletes) described
by 13 variables.
Note that, only some of these individuals and variables will be used to perform the
principal component analysis. The coordinates of the remaining individuals and
variables on the factor map will be predicted after the PCA.
• Active individuals (in light blue, rows 1:23) : Individuals that are used during
the principal component analysis.
• Supplementary individuals (in dark blue, rows 24:27) : The coordinates of
these individuals will be predicted using the PCA information and parameters
obtained with active individuals/variables