COMPX310-19A Machine Learning Chapter 8: Dimensionality Reduction

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 35

COMPX310-19A

Machine Learning
Chapter 8: Dimensionality Reduction
An introduction using Python, Scikit-Learn, Keras, and Tensorflow

Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
Dimensionality reduction
 Some datasets have 1000s or even millions of features
 Some might be useless, some might be highly correlated
 MNIST (see feature importance slides from Random Forests):
 borders are useless
 Neighbouring pixels are very similar, could be averaged

 Reasons:
 Speedup: especially for methods that are worse than linear in
number of feature, Ridge O(n*n) for memory, O(n*n*n) for speed
 Higher accuracy: sometimes
 Data visualisation: 2d and 3d plots and animation, can be very
useful to report, but also to debug

03/08/2021 COMPX310 2
Dimensionality reduction
 How:
 Projection
 Manifold learning

 Algorithms covered:
 PCA
 Kernel PCA
 LLE
 [mentioned t-SNE, UMAP]

 Reduction is bound to loose “some” information, this may hurt


accuracy.

03/08/2021 COMPX310 3
Curse of Dimensionality
 We understand low-dimensional spaces

03/08/2021 COMPX310 4
Curse of Dimensionality
 High-dimensional spaces are VERY different

 Unit square (1x1) in 2d: 0.4% chance to be less than 0.001 from a border

 Unit 10000d hybercube: 99.999999% chance

def random_point(dim): return np.random.rand(dim)

def close(point):
for x in point:
if x < 0.001: return True
if x > 0.999: return True
return False

[close(random_point(2)) for I in range(100000)] => 419

[close(random_point(10000)) for I in range(100000)] => 100000

03/08/2021 COMPX310 5
Curse of Dimensionality
 Even worse:
 Unit square, 2d: two random points: average distance => 0.52 +- 0.248
 Unit square, 3d: two random points: average distance => 0.66 +- 0.249
 Unit square, 1,000,000d: two random points: average distance => 408.25 +- 0.243

 => High dim space is SPARSE, overfitting is very likely

 Why not just use more data?


 100d, uniform distribution, average distance to next instance <= 0.1:
 => would need more examples than atoms in the universe 

03/08/2021 COMPX310 6
Projection
 What if the examples are actually not really using all dimensions?

 In other words: they are inside or close to some subspace.

03/08/2021 COMPX310 7
Projection
 What if we had a way to compute this hyperplane?

03/08/2021 COMPX310 8
Projections can fail
 Swiss roll toy dataset

03/08/2021 COMPX310 9
Projection vs unrolling
 What if we could unroll (but how)?

03/08/2021 COMPX310 10
Unrolling: Manifold learning
 Swiss roll: 2d shape bent and twisted in 3d
 D-dimensional manifold: d-dim shape twisted and bent in an n-
dim space, with d < n
 Swiss roll: 2d manifold, d=2, n=3
 Manifold Hypothesis, Manifold Learning:
 Our data actually is a d-dim manifold
 This “explains” why learning in very high dimensions works:
 The data is very sparse in the n-dim space, but
 The data is dense (enough) in the d-dim space
 Explicitly transforming the data from n-dim to d-dim can help, but
not always (see next slide).

03/08/2021 COMPX310 11
Unrolling gone wrong
 Sometimes the original space works better

03/08/2021 COMPX310 12
Principal component analysis
 PCA: most popular, finds axes for the subspace trying to keep as
much information as possible going for directions with the most
variance, but axes must also be at a right angle to each other

03/08/2021 COMPX310 13
PCA
 PCA can compute n components for a n-dim input space
 The Scikit-Learn implementation uses some fancy math (SVD)
 It is a transformer:

03/08/2021 COMPX310 14
PCA: how many components
 Every components “captures” some of the variance of the data:

 Manually finding the right number:

 Or doing it automatically:

03/08/2021 COMPX310 15
PCA: variance explained
 Plotted as function of the number of components

03/08/2021 COMPX310 16
PCA for lossy compression
 All components are weighted sums of input features, we can
invert this to reproduce an approximate original:

03/08/2021 COMPX310 17
Randomized PCA
 PCA is expensive: O(m*n*n + n*n*n)
 Randomized PCA: O(m*d*d + d*d*d), much faster for d << n,
but an approximation

 Scikit-Learn tries to be smart: default “auto” switches from “full”


to “randomized”, if m>500 or n>500 or d < 0.8m or d <0.8n

03/08/2021 COMPX310 18
IncrementalPCA
 For very large datasets that do not fit into RAM, in two ways:

 Or, using ’memory mapping’:

03/08/2021 COMPX310 19
PCA use case
 Can improve performance of classifiers that ‘prefer’ uncorrelated
features:
 Naïve Bayes
 Logistic Regression
 Single decision trees

 Can help remove noise from data

03/08/2021 COMPX310 20
PCA + Logistic
 Using the FashionMNIST data with the following code:

03/08/2021 COMPX310 21
PCA + Logistic: Accuracy
 Goes up quickly, then almost plateaus around 84%:

03/08/2021 COMPX310 22
PCA + Logistic: Runtime
 Initially just seconds, 180 components around 20 minutes

03/08/2021 COMPX310 23
Variance explained
 First component almost 30%, then quick dropoff
 Zoomed in last 50: still improving, but only by tiny steps,
sometimes getting worse briefly (why?)

03/08/2021 COMPX310 24
KernelPCA
 PCA is “linear”, but we can also model non-linearity using
kernels (like we did with SVMs)

03/08/2021 COMPX310 25
KernelPCA example
 From the Scikit-Learn documentation, showing the diff between
a linear kernel (standard PCA) and RBF kernel with KernelPCA

03/08/2021 COMPX310 26
KernelPCA inverse
 Computing inverse_transform is hard, switched off by default:

 How good is it?

 Not so good for the Swiss roll, but worked well for the disc
inside the circle.

03/08/2021 COMPX310 27
KernelPCA: how to optimize?
 Has a number of hyper parameter, may use GridSearchCV:

 What is the best solution:

03/08/2021 COMPX310 28
Locally linear embedding
 LLE: non-linear manifold learning (PCA was projection—based)
 Tries to preserve local relationships: close neighbours will stay
close, but far neighbours might not be that far anymore

03/08/2021 COMPX310 29
LLE for the Swiss roll
 Local relationships are preserved, but there is some global
distortion

03/08/2021 COMPX310 30
LLE math (bonus material)
 2step process, quite expensive:

 Expensive three-stage process:


 O(m*log(m)*n*log(k)) for k nearest neighbours
 LLE step1: O(m*n*k*k*k) for optimizing the weights
 LLE step 2: O(d*m*m) for finding the d-dim Zs

03/08/2021 COMPX310 31
Many more methods
 MDS: multi-dim scaling, similar to LLE

 Isomap: nearest neighbor graph, then trying to preserve geodesic distances

 t-SNE: t-distributed Stochastic Neighbor Embedding, neural network-based, keep close


instances close, and dissimilar ones ‘flexibly’ apart; mainly used to visualise data

 UMAP: Uniform Manifold Approximation and Projection, newest method, can also use
partially or fully labeled data, and ’add’ new data

 [Bonus] Supervised methods:


 LDA: Linear Discriminant Analysis, finds axes that are most discriminative between classes
 PLS: Partial Least Squares: finds ‘components’ that work best for predicting numeric target

03/08/2021 COMPX310 32
Swiss roll again
 MDS very similar effect to LLE and PCA, Isomap ’unrolls’
nicely, t-SNE is ‘odd’

03/08/2021 COMPX310 33
More visualisations
 Adapted from https://umap-learn.readthedocs.io/en/latest/auto_examples/plot_algorithm_comparison.html

03/08/2021 COMPX310 34
Summary
 Curse of dimensionality
 Useful for computational purposes and for visualisation
 Projection-based: PCA main method
 Manifold learning: LLE and others
 Some methods only useful for visualisation

 Interesting slide deck: https://amueller.github.io/COMS4995-


s19/slides/aml-14-dimensionality-reduction/#1

03/08/2021 COMPX310 35

You might also like