Lecture 2 Annotated

02450: Introduction to Machine Learning and Data Mining
Data, feature extraction and PCA
Georgios Arvanitidis
DTU Compute, Technical University of Denmark (DTU)

Today
Feedback Groups of the day: Reading material:
Abrahim Deiaa El Din Abbas, Johanne Abildgaard,
Julieta Aceves, Helle Achari, Magnus Møller Chapter 2, Chapter 3
Aggernæs, Subhayan Akhuli, Melis Cemre Akyol,
Malek Al Abed, Mohammad Al-Ansari, Maximillian
Al-Helo, Ismail Ali, Yusuf Mohamed Alin, Mads
Albert Alkjærsig, Javier Alonso Fernandez, Rikke
Alstrup, Muhammad Hussain Rashid Al-Takmaji,
Mohamad Malaz Mohamed Alzarrad, Saeed
Mohamud Amin, William Kirk Andersen, Mikkel Arn
Andersen, Simon Rung Andersen, Oline Melinda
Andersen, Jeppe Aarup Andersen, Mathias Vith
Andersen, Giulia Andreatta, Sander Skjolden
Andresen, ¿eljko Antunovi¿, Theo Rønne Appel,
Pedro Aragon Fernandez, Ivan Antonino Arena, Alba
Arias Martínez, Enerel Ariunbold, Amari Karakandi
Arun, Matthew Hiroto Asano, Jacob Frederik
Aslan-Lassen, Bergur Ástrásson, Salomé Anaïs
Aubri, Mike Auer, Eseme Ida Elena Ayiwe, Md Amin
Azad, Gursharandeep Singh Badhesha, Haydar Hamid
Abbas Bahr, Eline Agnes Jacoueline Balland,
Volodymyr Baran, Samira Sanjay Barve, Laura Bauer,
Quim Bech Vilaseca, Aslan Dalhoff Behbahani,
Nikolaj Ivø Beier, Alex Belai, Magnus Johan
Berg-Arnbak, Toms Rudolfs Berzins, Kaushik Amol
Bhat, Mark Bidstrup, Kawa Shawki Bilal, Christian
Lundgaard Bjerregaard, Magnus Bjørnskov, Emma
Louise Blair, Louise Toft Blankensteiner
2 DTU Compute Lecture 2 5 September, 2023

Lecture Schedule
1 Introduction 8 Artificial Neural Networks and
29 August: C1 Bias/Variance
Data: Feature extraction, and visualization 24 October: C14, C15
2 Data, feature extraction and PCA 9 AUC and ensemble methods
5 September: C2, C3 31 October: C16, C17
3 Measures of similarity, summary Unsupervised learning: Clustering and density estimation
statistics and probabilities 10 K-means and hierarchical clustering

12 September: C4, C5 7 November: C18
4 Probability densities and data 11 Mixture models and density estimation
visualization 14 November: C19, C20 (Project 2 due before 13:00)
19 September: C6, C7 12 Association mining
Supervised learning: Classification and regression 21 November: C21
5 Decision trees and linear regression Recap
26 September: C8, C9 13 Recap and discussion of the exam

6 Overfitting, cross-validation and 28 November: C1-C21
Nearest Neighbor
3 October: C10, C12 (Project 1 due before 13:00)
7 Performance evaluation, Bayes, and
Naive Bayes
10 October: C11, C13
Online help: Discussion Forum (Piazza) on DTU Learn

Videos of lectures: https://panopto.dtu.dk
Streaming of lectures: Zoom (link on DTU Learn)
Learning Objectives
• Understand the types of data, their attributes and data issues
• Understand the bag of word representation
• Be able to apply principal component analysis for data visualization and feature
extraction

osvg-5
What is data?

osvg-6
Discrete / continuous attributes

osvg-7
Types of attributes

osvg-8

Quiz 1: Attribute types (Spring 2012)
No. Attribute description Abbrev. In a study of healthy breakfast habits 77 cereal

x1 Type TYPE brands were investigated. The attributes of the data
(0 = served cold, 1 = served hot) are given in Table 1. There are a total of 14 attributes
x2 Calories per serving CAL denoted x1 –x14 and one output variable y which de-
x3 Grams of protein PROT fines the average rating of the cereal products by the
x4 Grams of fat FAT
x5 Milligrams of sodium SOD consumers.
x6 Grams of dietary fiber FIB Which statement about the attributes in the data
x7 Grams of complex carbohydrates CARB set is incorrect?
x8 Grams of sugars SUG
x9 Milligrams of potassium POT A. NAME is discrete and nominal.
x10 Vitamins and minerals in 0%, 25%, VIT
or 100% of FDA recommendations B. PROT, FAT and SOD are all continuous and
x11 Shelf position SHELF
ratio.
(1, 2, or 3, counting from the floor)
x12
x13
Weight in ounces of one serving
Number of cups in one serving
WEIGHT
CUPS 0
C. TYPE and VIT are both discrete and ordinal.
x14 Name of cereal brand NAME
D. An attribute that is ratio will also be interval.
y Average rating of the cereal RAT
(from 0 to 100) E. Don’t know.
Table 1: Attributes in a study of cere-
als (i.e. breakfast products, taken from
http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html).

Solution:
There are a finite set of brands thus NAME is TYPE must be considered nominal, VIT on the other
discrete and as the only operators that can be applied hand is ordinal as 0% is less than 25% which in turn
to NAME is equal or not equal NAME is nominal. is less than 100%. An attribute that is ratio will also
PROT, FAT and SOD are all continuous and since be both interval, ordinal and nominal, i.e. we can
they have that zero means absence they are ratio. apply all the operations =, 6=, >, <, +, , ⇤, / to a ratio
TYPE and VIT are both discrete, however, TYPE attribute.
is not ordinal, i.e. Hot is not better than Cold, thus

osvg-9
Types of data sets

osvg-10
Record data example: Market basket data
XI
x2
x3
x4
x 5

osvg-11
Relational data example: Who knows who?

to we are
the
represen
0 1
0

osvg-12
Ordered data example: Time series

osvg-13
Data quality

osvg-14
Noise

osvg-15
Outliers

osvg-16
Missing values
• Definition
• No value is stored for an attribute
in a data object
• Reasons for missing values
• Information is not collected or
measured
• People decline to give their age
• Attribute is not applicable
• Annual income is not applicable
to children
• Handling missing values
• Eliminate data objects
• Eliminate attributes
• Estimate missing values (e.g. an
average)
• Ignore the missing value in analysis
• Model the missing values
osvg-17

osvg-18
Dataset manipulations

osvg-19
Feature processing

osvg-20
Common feature transformations

osvg-21
One-out-of K encoding
DNS

osvg-22
Bag of words representation

osvg-23

osvg-24

osvg-25

osvg-26

osvg-27
Image representation
-
E
- -
..
--
...

osvg-28
Vector space representation

Plan for the rest of today:
• Linear algebra recap (subspaces and projections)

• The goal of Principal Component Analysis (PCA)
• Derivation of PCA
• Singular Value Decomposition used to implement PCA
• Use of PCA for data visualization

osvg-30
Vectors and matrices

osvg-31
Matrix multiplication
C DC D C D
1 2 5 6 19 22
Example: =
3 4 7 8 43 50
osvg-32
Matrix transpose

osvg-33
The identity matrix

In general
ABFBA

osvg-35
Norms
x
-
[2]
1 1 2 2
xX
.
.
+
=
trace)-])
L
=
a + b

osvg-38
Vector spaces
- =
[
"
V
........
V =
find O
we can
&" by using M
sincer
n
poin combination XiGIR
a
vectors
of these

osvg-39
Subspaces
~ in IR3
any
V dr21+012 227 03 Es
EIR
.
ae2
x
=
a 21 + a .
es
e =
(1 ,
0 , 0)
er xz =
a?. 2 .
+
023 EIR 3
0)
.
t > in IR
ez
=
(0 , ,
won this plane
any
es =
(0 0 H ,
e
n b. bz x z EIR
,
x
.
w
+
=

osvg-40
Basis of a (sub)space

osvg-41
Basis of a (sub)space
in 1123
point
any
x
=
an(8) an( : )
+
+ a(8)
112
any point on tre plane in
br Un + bz Vz
y
.
=
linear
each vi is vector and by
find any point
a
combinations we can
·n a subspace/rector space .

osvg-43
Projection
EIR
x
·
Emel
-

osvg-44
Projection onto a subspace

·
r =
[i ,
,
,
. .
.,
v]mx
x V] EIR"
xun ,
= Ix'r ,
..., M
bxVx R
=b : V- +
br vz
.
+ -.
+
·
X

osvg-45
Projection onto a subspace
=[xre ,
x ve] < EIR -I
[4 +] [b bz] bi
in be
=
.
=
+
,
,
are
&
y
43 DTU Compute Lecture 2 5 September, 2023 r
osvg-46
PCA for high-dimensional data
" -> &
o
po
%.......
6 - - o o

osvg-47

osvg-48
11
osvg-49
biGIR
PCA derivation
-

osvg-49
PCA derivation
arg max Var[b] = arg max v T X̃ T X̃v

v v
s.t. ||v||2 = v T v = 1 *Y A I
ˆL
L(v, ⁄) = v T X̃ T X̃v ≠ ⁄(v T v ≠ 1), = 2X̃ T X̃v ≠ 2⁄v = 0
ˆv
or X̃ T X̃v = ⁄v
osvg-49
PCA derivation
(1 +
,
b)
ˆL
ˆv = 2X̃ T X̃v ≠ 2⁄v = 0 or X̃ T X̃v = ⁄v (62 ,
v2)
i
1 1
This means that Var[b] = v T ⁄v = ⁄ (1v vn)
N ≠1 N ≠1 ,

osvg-50
The Singular Value Decomposition (SVD)

osvg-50b
The Singular Value Decomposition (SVD)
mi[fne[/in
>I

osvg-51
Principal component analysis (PCA)

osvg-52
Principal component analysis (PCA)

osvg-53 Data Centered Reconstructions (centered)
>
x
-
>
x
-
Explained Variance
Recall that from SVD: X̃ = U ⌃V T
In the original space, the coordinates
of X̃ project onto the first K
components are:
centerestructions
recons X Õ = U ⌃(K) V(K)
T cov(X̃) = X̃ T X̃
NxK kxM trace(AB) = trace(BA)
We can measure how much variance
is retained in the reconstruction X Õ :
ÎX̃Î2F = trace(X̃ T X̃) = trace(X̃ X̃ T )
ÎX Õ Î2F
Explained var. =
ÎX̃Î2F

osvg-53
Explained Variance
components are:
cov(X̃) = X̃ T X̃
X Õ = U ⌃(K) V(K)
T
trace(AB) = trace(BA)
qK
ÎX Õ Î2F 2
i=1 ‡i = trace(U ⌃V T (U ⌃V T )T )
Explained var. = = q
ÎX̃Î2F M 2
i=1 ‡i = trace(U ⌃V T V ⌃T U T )
= trace(U ⌃⌃T U T )
= trace(U T U ⌃⌃T )
ÿ
= trace(⌃⌃T ) = ‡i2
i

osvg-53
Explained Variance
components are:
cov(X̃) = X̃ T X̃
X Õ = U ⌃(K) V(K)
T
trace(AB) = trace(BA)
qK
ÎX Õ Î2F 2
i=1 ‡i = trace(U ⌃V T (U ⌃V T )T )
Explained var. = = q
ÎX̃Î2F M 2
i=1 ‡i = trace(U ⌃V T V ⌃T U T )
Similarly, the fraction of explained = trace(U ⌃⌃T U T )

variance for the i’th component is = trace(U T U ⌃⌃T )
ÿ
= trace(⌃⌃T ) = ‡i2
‡2
Explained var. = qM i 2
i
i=1 ‡i
Quiz 2: PCA (Fall 2012)
No. Attribute description Abbrev. A PCA analysis is applied to the standardized

data based on the attributes x1 –x10 . The squared
x1 Age (in years) AGE
Frobenius norm of the standardized data matrix X
x2 Gender (Female=0, Male=1) GDR
is given by kXk2F = 5780.0. The first four singular
x3 Total Bilirubin TB
values are 1 = 40.1, 2 = 34.2, 3 = 28.1, and
x4 Direct Bilirubin DB
4 = 24.8, Which of the following statements is
x5 Alkaline Phosphotase AP
correct?
x6 Alamine Aminotransferase AlA
x7 Aspartate Aminotransferase AsA A. The first PCA component accounts for more than
x8 Total Protiens TP 35 % of the variation.
x9 Albumin AB
x10 Albumin to Globulin ratio A/G B. The second PCA component accounts for more
than 30 % of the variation.
y 0=No liver disease, 1=Liver disease LD
Table 1: Attributes in a study on liver dis- C. The first three PCA components account for less
ease among Indians living in the north east- than 70 % of the variation in the data.
ern part of Andhra Pradesh, India. (taken from
http://archive.ics.uci.edu/ml/datasets/ILPD +%28In- D. The fourth PCA component accounts for less
dian+Liver+Patient+Dataset%29). The data has 10 input than 10 % of the variation in the data.
attributes x1 –x10 and one output variable y which defines
whether the subject considered has a liver disease (y = 1) E. Don’t know.
or not (y = 0). x3 –x9 are non-negative measurements
giving the concentrations of various quantities measured
in a blood test. x10 gives the ratio of Albumin to Globulin
in the blood.

Solution:
2 2 2 2
The ith principal component accounts for P i
2 = count for 40.1 +34.2 +28.1
5780.0 = 61.7% of the variation
j j
2 whereas the fourth principal component accounts for
i
. We therefore have that the first PCA compo- 24.8 2
kXk2F 5780.0 = 10.6%. Thus, the first three PCA compo-
2 2
40.1
nent accounts for 5780.0 34.2
= 27.8%, the second 5780.0 = nents account for less than 70% of the variation in
20.2%, and the first three principal components ac- the data.

osvg-54
Fishers Iris Data

osvg-55
3D scatter plot of Iris Data

osvg-56
3D scatter plot of Iris Data

osvg-57
Visualization of the PCA projections of the data
59 DTU Compute ve V3 Lecture 2 5 September, 2023

VI
osvg-58

Quiz 3: PCA Cont. (Fall 2012)
No. Attribute description Abbrev. of the liver-dataset are
x1
x2
Age (in years)
Gender (Female=0, Male=1)
AGE
GDR
2 3 2 3
x3 Total Bilirubin TB
0.1404 0.2859
x4 Direct Bilirubin DB 6 0.1090 7 6 0.0130 7
x5 Alkaline Phosphotase AP 6 7 6 7
6 0.4115 7 6 0.2510 7
x6 Alamine Aminotransferase AlA 6 7 6 7
x7 Aspartate Aminotransferase AsA 6 0.4179 7 6 0.2622 7
x8 Total Protiens TP 6 7 6 7
x9 Albumin AB 6 0.2468 7 6 0.0525 7
x10 Albumin to Globulin ratio A/G v1 = 6
6
7,
7 v2 = 6
6
7.
7
6 0.2682 7 6 0.4162 7
y 0=No liver disease, 1=Liver disease LD
6 0.3009 7 6 0.3927 7
6 7 6 7
6 0.2781 7 6 0.4197 7
6 7 6 7
4 0.4375 5 4 0.4323 5
0.3638 0.3052
In the figure, the data projected onto the first two

principal components is plotted, and the colors indi-
cate the presence of liver disease. Which of the fol-
lowing statements is correct?
A. Relatively high values of AGE, GDR, TB, DB, AP,
AlA, and AsA and low values of TP, AB, and A/G will
result in a positive projection onto the first principal
Figure 1: Principal component 1 (PCA1) plotted against component.
principal component 2 (PCA2). B. Relatively low values of the projection onto PCA1 and
high values of the projection onto PCA2 indicates the
The first and second principal component directions subject does not have a liver disease.
0
C. PCA2 mainly discriminate between old subjects with
low measurements of TB, DB, AlA, AsA, TP, AB, and
A/G from young subjects with high values of TB, DB,
AlA, AsA, TP, AB, and A/G.
D. The principal component directions are not guaran-
teed to be orthogonal to each other since the data has
been standardized.
Solution:
AGE, GDR, TB, DB, AP, AlA, and AsA have have positive values while GDR and AP have small
negative coefficients of PCA1 whereas TP, AB, and amplitudes. As a result PCA2 mainly discriminate
A/G have positive coefficients resulting in a negative between young subjects with high measurements of
projection onto the first principal component, thus TD, DB, AlA, AsA, TP, AB, and A/G from old
this is correct. From the figure we observe that subjects with low values of TD, DB, AlA, AsA, TP,
observations with low values of PCA1 and high values AB, and A/G hence this is correct. The principal
of PCA2 in general have a red dot meaning they component directions are always orthogonal to each
have a liver disease. For PCA2 we observe that AGE other irrespective of the data preprocessing.
has a negative value whereas the remaining entities

osvg-59
Visualization of hand written digits

osvg-60
Visualization of hand written digits

osvg-61
PCA as compression
add the
mean
actua
recontractions
↓

osvg-62
Data and domain driven feature extraction

Resources
http://www2.imm.dtu.dk Our online PCA demo which highlights key

concepts of PCA such as the effect of normalization, variance
explained, and much more (http://www2.imm.dtu.dk/courses/02450/DemoPCA.html)
https://arxiv.org A great and more in-depth tutorial on PCA
(https://arxiv.org/abs/1404.1100)
https://www.3blue1brown.com An great, animated recap of linear algebra

(https://www.3blue1brown.com/essence-of-linear-algebra-page/)

Lecture 2 Annotated

Uploaded by

Copyright:

Available Formats

Lecture 2 Annotated

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2 Annotated

Uploaded by

Copyright:

Available Formats

02450: Introduction to Machine Learning and Data Mining

Data, feature extraction and PCA

DTU Compute, Technical University of Denmark (DTU)

2 DTU Compute Lecture 2 5 September, 2023

statistics and probabilities 10 K-means and hierarchical clustering

5 Decision trees and linear regression Recap

26 September: C8, C9 13 Recap and discussion of the exam

Online help: Discussion Forum (Piazza) on DTU Learn

4 DTU Compute Lecture 2 5 September, 2023

5 DTU Compute Lecture 2 5 September, 2023

Discrete / continuous attributes

6 DTU Compute Lecture 2 5 September, 2023

7 DTU Compute Lecture 2 5 September, 2023

8 DTU Compute Lecture 2 5 September, 2023

No. Attribute description Abbrev. In a study of healthy breakfast habits 77 cereal

9 DTU Compute Lecture 2 5 September, 2023

10 DTU Compute Lecture 2 5 September, 2023

Types of data sets

11 DTU Compute Lecture 2 5 September, 2023

Record data example: Market basket data

12 DTU Compute Lecture 2 5 September, 2023

Relational data example: Who knows who?

13 DTU Compute Lecture 2 5 September, 2023

Ordered data example: Time series

14 DTU Compute Lecture 2 5 September, 2023

15 DTU Compute Lecture 2 5 September, 2023

16 DTU Compute Lecture 2 5 September, 2023

17 DTU Compute Lecture 2 5 September, 2023

19 DTU Compute Lecture 2 5 September, 2023

20 DTU Compute Lecture 2 5 September, 2023

21 DTU Compute Lecture 2 5 September, 2023

Common feature transformations

22 DTU Compute Lecture 2 5 September, 2023

23 DTU Compute Lecture 2 5 September, 2023

Bag of words representation

24 DTU Compute Lecture 2 5 September, 2023

Bag of words representation

25 DTU Compute Lecture 2 5 September, 2023

Bag of words representation

26 DTU Compute Lecture 2 5 September, 2023

Bag of words representation

27 DTU Compute Lecture 2 5 September, 2023

Bag of words representation

28 DTU Compute Lecture 2 5 September, 2023

29 DTU Compute Lecture 2 5 September, 2023

Vector space representation

30 DTU Compute Lecture 2 5 September, 2023

• Linear algebra recap (subspaces and projections)

31 DTU Compute Lecture 2 5 September, 2023

Vectors and matrices

32 DTU Compute Lecture 2 5 September, 2023

34 DTU Compute Lecture 2 5 September, 2023

The identity matrix

35 DTU Compute Lecture 2 5 September, 2023

36 DTU Compute Lecture 2 5 September, 2023

37 DTU Compute Lecture 2 5 September, 2023

38 DTU Compute Lecture 2 5 September, 2023

39 DTU Compute Lecture 2 5 September, 2023

40 DTU Compute Lecture 2 5 September, 2023

41 DTU Compute Lecture 2 5 September, 2023