COMPX310-19A Machine Learning Chapter 8: Dimensionality Reduction
COMPX310-19A Machine Learning Chapter 8: Dimensionality Reduction
COMPX310-19A Machine Learning Chapter 8: Dimensionality Reduction
Machine Learning
Chapter 8: Dimensionality Reduction
An introduction using Python, Scikit-Learn, Keras, and Tensorflow
Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
Dimensionality reduction
Some datasets have 1000s or even millions of features
Some might be useless, some might be highly correlated
MNIST (see feature importance slides from Random Forests):
borders are useless
Neighbouring pixels are very similar, could be averaged
Reasons:
Speedup: especially for methods that are worse than linear in
number of feature, Ridge O(n*n) for memory, O(n*n*n) for speed
Higher accuracy: sometimes
Data visualisation: 2d and 3d plots and animation, can be very
useful to report, but also to debug
03/08/2021 COMPX310 2
Dimensionality reduction
How:
Projection
Manifold learning
Algorithms covered:
PCA
Kernel PCA
LLE
[mentioned t-SNE, UMAP]
03/08/2021 COMPX310 3
Curse of Dimensionality
We understand low-dimensional spaces
03/08/2021 COMPX310 4
Curse of Dimensionality
High-dimensional spaces are VERY different
Unit square (1x1) in 2d: 0.4% chance to be less than 0.001 from a border
def close(point):
for x in point:
if x < 0.001: return True
if x > 0.999: return True
return False
03/08/2021 COMPX310 5
Curse of Dimensionality
Even worse:
Unit square, 2d: two random points: average distance => 0.52 +- 0.248
Unit square, 3d: two random points: average distance => 0.66 +- 0.249
Unit square, 1,000,000d: two random points: average distance => 408.25 +- 0.243
03/08/2021 COMPX310 6
Projection
What if the examples are actually not really using all dimensions?
03/08/2021 COMPX310 7
Projection
What if we had a way to compute this hyperplane?
03/08/2021 COMPX310 8
Projections can fail
Swiss roll toy dataset
03/08/2021 COMPX310 9
Projection vs unrolling
What if we could unroll (but how)?
03/08/2021 COMPX310 10
Unrolling: Manifold learning
Swiss roll: 2d shape bent and twisted in 3d
D-dimensional manifold: d-dim shape twisted and bent in an n-
dim space, with d < n
Swiss roll: 2d manifold, d=2, n=3
Manifold Hypothesis, Manifold Learning:
Our data actually is a d-dim manifold
This “explains” why learning in very high dimensions works:
The data is very sparse in the n-dim space, but
The data is dense (enough) in the d-dim space
Explicitly transforming the data from n-dim to d-dim can help, but
not always (see next slide).
03/08/2021 COMPX310 11
Unrolling gone wrong
Sometimes the original space works better
03/08/2021 COMPX310 12
Principal component analysis
PCA: most popular, finds axes for the subspace trying to keep as
much information as possible going for directions with the most
variance, but axes must also be at a right angle to each other
03/08/2021 COMPX310 13
PCA
PCA can compute n components for a n-dim input space
The Scikit-Learn implementation uses some fancy math (SVD)
It is a transformer:
03/08/2021 COMPX310 14
PCA: how many components
Every components “captures” some of the variance of the data:
Or doing it automatically:
03/08/2021 COMPX310 15
PCA: variance explained
Plotted as function of the number of components
03/08/2021 COMPX310 16
PCA for lossy compression
All components are weighted sums of input features, we can
invert this to reproduce an approximate original:
03/08/2021 COMPX310 17
Randomized PCA
PCA is expensive: O(m*n*n + n*n*n)
Randomized PCA: O(m*d*d + d*d*d), much faster for d << n,
but an approximation
03/08/2021 COMPX310 18
IncrementalPCA
For very large datasets that do not fit into RAM, in two ways:
03/08/2021 COMPX310 19
PCA use case
Can improve performance of classifiers that ‘prefer’ uncorrelated
features:
Naïve Bayes
Logistic Regression
Single decision trees
03/08/2021 COMPX310 20
PCA + Logistic
Using the FashionMNIST data with the following code:
03/08/2021 COMPX310 21
PCA + Logistic: Accuracy
Goes up quickly, then almost plateaus around 84%:
03/08/2021 COMPX310 22
PCA + Logistic: Runtime
Initially just seconds, 180 components around 20 minutes
03/08/2021 COMPX310 23
Variance explained
First component almost 30%, then quick dropoff
Zoomed in last 50: still improving, but only by tiny steps,
sometimes getting worse briefly (why?)
03/08/2021 COMPX310 24
KernelPCA
PCA is “linear”, but we can also model non-linearity using
kernels (like we did with SVMs)
03/08/2021 COMPX310 25
KernelPCA example
From the Scikit-Learn documentation, showing the diff between
a linear kernel (standard PCA) and RBF kernel with KernelPCA
03/08/2021 COMPX310 26
KernelPCA inverse
Computing inverse_transform is hard, switched off by default:
Not so good for the Swiss roll, but worked well for the disc
inside the circle.
03/08/2021 COMPX310 27
KernelPCA: how to optimize?
Has a number of hyper parameter, may use GridSearchCV:
03/08/2021 COMPX310 28
Locally linear embedding
LLE: non-linear manifold learning (PCA was projection—based)
Tries to preserve local relationships: close neighbours will stay
close, but far neighbours might not be that far anymore
03/08/2021 COMPX310 29
LLE for the Swiss roll
Local relationships are preserved, but there is some global
distortion
03/08/2021 COMPX310 30
LLE math (bonus material)
2step process, quite expensive:
03/08/2021 COMPX310 31
Many more methods
MDS: multi-dim scaling, similar to LLE
UMAP: Uniform Manifold Approximation and Projection, newest method, can also use
partially or fully labeled data, and ’add’ new data
03/08/2021 COMPX310 32
Swiss roll again
MDS very similar effect to LLE and PCA, Isomap ’unrolls’
nicely, t-SNE is ‘odd’
03/08/2021 COMPX310 33
More visualisations
Adapted from https://umap-learn.readthedocs.io/en/latest/auto_examples/plot_algorithm_comparison.html
03/08/2021 COMPX310 34
Summary
Curse of dimensionality
Useful for computational purposes and for visualisation
Projection-based: PCA main method
Manifold learning: LLE and others
Some methods only useful for visualisation
03/08/2021 COMPX310 35