COMPX310-19A Machine Learning Chapter 8: Dimensionality Reduction

COMPX310-19A
Machine Learning
Chapter 8: Dimensionality Reduction
An introduction using Python, Scikit-Learn, Keras, and Tensorflow
Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
Dimensionality reduction
 Some datasets have 1000s or even millions of features
 Some might be useless, some might be highly correlated
 MNIST (see feature importance slides from Random Forests):
 borders are useless
 Neighbouring pixels are very similar, could be averaged
 Reasons:
 Speedup: especially for methods that are worse than linear in
number of feature, Ridge O(n*n) for memory, O(n*n*n) for speed
 Higher accuracy: sometimes
 Data visualisation: 2d and 3d plots and animation, can be very
useful to report, but also to debug
03/08/2021 COMPX310 2
Dimensionality reduction
 How:
 Projection
 Manifold learning
 Algorithms covered:
 PCA
 Kernel PCA
 LLE
 [mentioned t-SNE, UMAP]
 Reduction is bound to loose “some” information, this may hurt

accuracy.
03/08/2021 COMPX310 3
Curse of Dimensionality
 We understand low-dimensional spaces
03/08/2021 COMPX310 4
 High-dimensional spaces are VERY different
 Unit square (1x1) in 2d: 0.4% chance to be less than 0.001 from a border
 Unit 10000d hybercube: 99.999999% chance
def random_point(dim): return np.random.rand(dim)
def close(point):
for x in point:
if x < 0.001: return True
if x > 0.999: return True
return False
[close(random_point(2)) for I in range(100000)] => 419
[close(random_point(10000)) for I in range(100000)] => 100000
03/08/2021 COMPX310 5
 Even worse:
 Unit square, 2d: two random points: average distance => 0.52 +- 0.248
 Unit square, 3d: two random points: average distance => 0.66 +- 0.249
 Unit square, 1,000,000d: two random points: average distance => 408.25 +- 0.243
 => High dim space is SPARSE, overfitting is very likely
 Why not just use more data?

 100d, uniform distribution, average distance to next instance <= 0.1:
 => would need more examples than atoms in the universe 
03/08/2021 COMPX310 6
Projection
 What if the examples are actually not really using all dimensions?
 In other words: they are inside or close to some subspace.
03/08/2021 COMPX310 7
Projection
 What if we had a way to compute this hyperplane?
03/08/2021 COMPX310 8
Projections can fail
 Swiss roll toy dataset
03/08/2021 COMPX310 9
Projection vs unrolling
 What if we could unroll (but how)?
03/08/2021 COMPX310 10
Unrolling: Manifold learning
 Swiss roll: 2d shape bent and twisted in 3d
 D-dimensional manifold: d-dim shape twisted and bent in an n-
dim space, with d < n
 Swiss roll: 2d manifold, d=2, n=3
 Manifold Hypothesis, Manifold Learning:
 Our data actually is a d-dim manifold
 This “explains” why learning in very high dimensions works:
 The data is very sparse in the n-dim space, but
 The data is dense (enough) in the d-dim space
 Explicitly transforming the data from n-dim to d-dim can help, but
not always (see next slide).
03/08/2021 COMPX310 11
Unrolling gone wrong
 Sometimes the original space works better
03/08/2021 COMPX310 12
Principal component analysis
 PCA: most popular, finds axes for the subspace trying to keep as
much information as possible going for directions with the most
variance, but axes must also be at a right angle to each other
03/08/2021 COMPX310 13
PCA
 PCA can compute n components for a n-dim input space
 The Scikit-Learn implementation uses some fancy math (SVD)
 It is a transformer:
03/08/2021 COMPX310 14
PCA: how many components
 Every components “captures” some of the variance of the data:
 Manually finding the right number:
 Or doing it automatically:
03/08/2021 COMPX310 15
PCA: variance explained
 Plotted as function of the number of components
03/08/2021 COMPX310 16
PCA for lossy compression
 All components are weighted sums of input features, we can
invert this to reproduce an approximate original:
03/08/2021 COMPX310 17
Randomized PCA
 PCA is expensive: O(m*n*n + n*n*n)
 Randomized PCA: O(m*d*d + d*d*d), much faster for d << n,
but an approximation
 Scikit-Learn tries to be smart: default “auto” switches from “full”

to “randomized”, if m>500 or n>500 or d < 0.8m or d <0.8n
03/08/2021 COMPX310 18
IncrementalPCA
 For very large datasets that do not fit into RAM, in two ways:
 Or, using ’memory mapping’:
03/08/2021 COMPX310 19
PCA use case
 Can improve performance of classifiers that ‘prefer’ uncorrelated
features:
 Naïve Bayes
 Logistic Regression
 Single decision trees
 Can help remove noise from data
03/08/2021 COMPX310 20
PCA + Logistic
 Using the FashionMNIST data with the following code:
03/08/2021 COMPX310 21
PCA + Logistic: Accuracy
 Goes up quickly, then almost plateaus around 84%:
03/08/2021 COMPX310 22
PCA + Logistic: Runtime
 Initially just seconds, 180 components around 20 minutes
03/08/2021 COMPX310 23
Variance explained
 First component almost 30%, then quick dropoff
 Zoomed in last 50: still improving, but only by tiny steps,
sometimes getting worse briefly (why?)
03/08/2021 COMPX310 24
KernelPCA
 PCA is “linear”, but we can also model non-linearity using
kernels (like we did with SVMs)
03/08/2021 COMPX310 25
KernelPCA example
 From the Scikit-Learn documentation, showing the diff between
a linear kernel (standard PCA) and RBF kernel with KernelPCA
03/08/2021 COMPX310 26
KernelPCA inverse
 Computing inverse_transform is hard, switched off by default:
 How good is it?
 Not so good for the Swiss roll, but worked well for the disc
inside the circle.
03/08/2021 COMPX310 27
KernelPCA: how to optimize?
 Has a number of hyper parameter, may use GridSearchCV:
 What is the best solution:
03/08/2021 COMPX310 28
Locally linear embedding
 LLE: non-linear manifold learning (PCA was projection—based)
 Tries to preserve local relationships: close neighbours will stay
close, but far neighbours might not be that far anymore
03/08/2021 COMPX310 29
LLE for the Swiss roll
 Local relationships are preserved, but there is some global
distortion
03/08/2021 COMPX310 30
LLE math (bonus material)
 2step process, quite expensive:
 Expensive three-stage process:

 O(m*log(m)*n*log(k)) for k nearest neighbours
 LLE step1: O(m*n*k*k*k) for optimizing the weights
 LLE step 2: O(d*m*m) for finding the d-dim Zs
03/08/2021 COMPX310 31
Many more methods
 MDS: multi-dim scaling, similar to LLE
 Isomap: nearest neighbor graph, then trying to preserve geodesic distances
 t-SNE: t-distributed Stochastic Neighbor Embedding, neural network-based, keep close

instances close, and dissimilar ones ‘flexibly’ apart; mainly used to visualise data
 UMAP: Uniform Manifold Approximation and Projection, newest method, can also use
partially or fully labeled data, and ’add’ new data
 [Bonus] Supervised methods:

 LDA: Linear Discriminant Analysis, finds axes that are most discriminative between classes
 PLS: Partial Least Squares: finds ‘components’ that work best for predicting numeric target
03/08/2021 COMPX310 32
Swiss roll again
 MDS very similar effect to LLE and PCA, Isomap ’unrolls’
nicely, t-SNE is ‘odd’
03/08/2021 COMPX310 33
More visualisations
 Adapted from https://umap-learn.readthedocs.io/en/latest/auto_examples/plot_algorithm_comparison.html
03/08/2021 COMPX310 34
Summary
 Curse of dimensionality
 Useful for computational purposes and for visualisation
 Projection-based: PCA main method
 Manifold learning: LLE and others
 Some methods only useful for visualisation
 Interesting slide deck: https://amueller.github.io/COMS4995-

s19/slides/aml-14-dimensionality-reduction/#1
03/08/2021 COMPX310 35

COMPX310-19A Machine Learning Chapter 8: Dimensionality Reduction

Uploaded by

Copyright:

Available Formats

COMPX310-19A Machine Learning Chapter 8: Dimensionality Reduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COMPX310-19A Machine Learning Chapter 8: Dimensionality Reduction

Uploaded by

Copyright:

Available Formats

COMPX310-19A

 Reduction is bound to loose “some” information, this may hurt

 Unit 10000d hybercube: 99.999999% chance

def random_point(dim): return np.random.rand(dim)

[close(random_point(2)) for I in range(100000)] => 419

[close(random_point(10000)) for I in range(100000)] => 100000

 => High dim space is SPARSE, overfitting is very likely

 Why not just use more data?

 In other words: they are inside or close to some subspace.

 Manually finding the right number:

 Scikit-Learn tries to be smart: default “auto” switches from “full”

 Or, using ’memory mapping’:

 Can help remove noise from data

 How good is it?

 What is the best solution:

 Expensive three-stage process:

 Isomap: nearest neighbor graph, then trying to preserve geodesic distances

 t-SNE: t-distributed Stochastic Neighbor Embedding, neural network-based, keep close

 [Bonus] Supervised methods:

 Interesting slide deck: https://amueller.github.io/COMS4995-

You might also like