Note 4 Nov 2023

CS 464 Summary
Ch 2 Probability Review
Bayes Rule
Pixie Ppg
p x II apexes
Chain Rule
PCA Xan r Xn PCN P X2 Xi Pan I Xin Xn i
especially useful when there are conditional independence across

variables
picking the right order can often make evaluating the probability much
easier
Conditional independence
XLY 12 P X 712 PCXI ZI PLY 12
Ch 3 Estimation
Maximum Likelihood Estimator MLE Chooses O that maximizes the

probability of observed data C likelihood of the data
09 T
PCD 107 i o
MLE estimate of 0
O arg PID 10
fax
A Bound I From Hoeffding'sinequality
Let N an at and I met OH

ant at
Let 0 be the true parameter For any Eso
2N E
PI Id o d e ze
Probably Approximately Correct I Pact
I want to know the thumback O within E on with probability at least

I S 0
How manyflips or how big do I set N
2
P I 18 0 1
2,4 I 252 0.05
2 NE E In 0.0512
N Z in co 05121 3.8 i go
2 xo is a
MLE of Gaussian Parameters
Prob of iii d samples D xii rXn
PCD IM o
In e
MMLE I Once arg PCDI yur o

II
M Mie It Xi
EMLE
I IN
2
Xi My
The Mce for the varian ie of a Gaussian is biased That is the expectedvalue
of the estimator is not equal to the true parameter An unbiased variance
estimator
yaunbiased
It I Xi M
Maximum A Posteriori MAP Estimation
Our priorcould be in the form of a probability distribution
posterior p old PID1051017 Frio

PD normalization
BetaDistribution
p Col 04 l o B of 0 El
4 s o
p
When the data is sparse this allows us to fall back to the prior and
avoid the issues faced by MLE
When the data is abundant the likelihood will dominate the prior and the
prior will not have much of an effect on the posterior distribution
Chapter 4 Naive Bayes
Naive Bayes assumes
PC Xi r r xn147 IT PCA17
Random variable features Xi and Xj are conditionally independent of each

other given the class label Y for all i j
new
y Pl Y P Xinew 17 yr
arg max
Tk
yr II
To avoid underflow
new new
argmax logPLY yr t
I log Piti 17 yr
JK
y newt argmax
Yk
log ite t
I logDijk
Chapter5 Feature Selection
The objectivein classification regression is to learn a functionthat relates

values of Features to value of outcome variableest
Often we are presented with many features

Not all of these features are relevant
Feature selection is the taste of identifying an optimal set of features

that are useful for asuratelypredicting the outcomevariable
Motivation For Feature Selection
Accuracy Generalizability
Interpretability a
Efficiency
Three Main Approaches
1 Treat Feature selection as a separate task

Filtering based feature selection
wrapper based feature selection
2 Embed feature selection into the task of learning a model

Regularization
37 Do not Sele it features instead construct new features that effectively

representcombinations of original features
Dimensionality reduction
Feature Selection as a Separate Task
filteringbased featureselection wrapper based feature selection
all features
Feature selection callslearningmethod

v anytimes usesis to
subset of features helpselectfeatures
learningmethod my y learning method

v
model
StoringFeatures For Filtering
Mutual Information statistical Tests Variance Frequency
I
E statistic Chisquarestatistic
Compares thefrequencies of a term between
different classes
Information
Reduction in uncertainty l amount of surprise in the outcome
I x x
log ply 1092Pox
If the probability of an event happening is small it happens

the information is large i
Observing the outcome of a coin flip is head
I log 112 I
The outcome of a dice is 6
I log 116 2.58
En
The entropy of a randomvariable is
the sum of the information
provided by its possible values weighted by the probability of each value
Entropy is a measure of uncertainty
H I pix log pix
Mutual information
ICY Y is the reduction of uncertainty in one variableuponobservation
of the other variable
A measure of statisticaldependencybetweentwo randomvariables
HCNY IN 47 He IN Hex 7 Hexi I Hey x Is x y
HIN
T
my
Icty Halt HH Hart Hat H IX H HWI HCY IX
The mutual information betweenfeature vector and class label measures the
amount by which the uniertainty in the class is decreased byknowledge of the
feature
Definition
c ee
I said
teeth efe no et 1092
ply pÉÉ
E statistic
t It X2 The distribution of t approaches
from uniform to normal distribution
gets as number of samples grow
ForwardSelection us BackwardElimination
both use a hill climbingsearch
forward selention backward elimination
efficient forchoosing a efficientfor discarding a small subset

small number of features of the features
Misses features whose usefulness preserves featureswhoseusefulness

requires other features require otherfeatures
Embedded Methods Regularization
Ridge Regression
Ecw
I
Cy'd Wo
II Xi'd wi t X
II Wiz L2
LASSO
ECW
go
I Cy'd Wo
II Xi'd wi t X
II twit
L1 and L2 penalties can be used with other learning methods logistic
regression neural nets sums etc
Chapter G Feature Extraction
PCA Applications
Noisereduction Data visualization Data compression
How could we find the smallest subspace thatkeeps the most information
about the original data
A solution principal component Analysis
Feature Extraction
Rather than picking a subset of features xuxa r xn create new

features from the existing ones
2i Wo t
I wi xi
2k Wo t
I wi't Xi
Principal Component Analysis
PCA vectors originate from the center of mass
Principal component L i points in the direction of the largestvariance
Each subsequent principal component
is orthogonal to the previous ones r and

points in the direction of the largest variance of the residualsubspace
PCA Algorithm
CovCX 47
I I Xi Mx yi my
Compute the covariance matrix I

Psa basis vectors the eigenvectors of I
largereigenvalue more importanteigenvectors
Steps in PCA
Mean enter the data
Computecovariance matrix or the scatter matrix
Calculate eigenvalues and eigenvectors of covariance matrix
Eigenvector with largest eigenvalue Xi is 1st principalcomponent Pc
Eigenvector with k th largest eigenvalue Xk is Kth PC
Proportion of variance captured by Kth Pc Xk Ii be
Scaling Up
Covariance matrix can be reallybig
Use singularvalue decomposition SVD Takes input Xand
finds k eigenvectors
The SVD of man matrix A is givenby theformula A US VT wherei
U mxm matrix of the orthonormaleigenvectors of AAT
VT transpose of a nxn matrix containing the orthormal eigenvectors of

ATA
S diagonal matrix with r elements equalto the root of the positive
eigenvalues of AAT or ATA both matrics have the same positive
eigenvalues anyway I
PCA Summary
PCA
sorts dimensions in order of importance
Applications
get compactdescription
remove noise
improve classification hopefully
Not magic
does not knowclass labels
Canonlycapturelinearvariations
One of manytricks to reduce dimensionality
Chapter 7 Performance Metrics
Model Selection and Validation
Learning typicallyinvolves trying out differentmodels algorithms parameters

feature sets etc 1
How do we selectthe bestmodelamongdifferentmodels
Types of Errors Confusion Matrix
Act
The False
Positive Error
Positive Type I
Classifier
False True
Negative Negative
f
Type I Error
Accuracy
Tp IIIINITN
Precision Recall F Measure
Precision Fraction of true positivesamong all positivepredictions
P
ypIÉp
Recall Sensitivity TPR Fraction of true positive predictionsamongall
positive samples
R
IIF N
Specificity TNR Fraction of true negative predictions among all negative
samples
F Measure Harmonic mean of precision and recall
F IPR
Pt R
F Measure can be weighted to favor Precision or Recall

1 132 PR I favors recall
FB B
p2p R p s 1 favors precision
Micro us Macro Averaging
A macro average just averages the individuallycalculatedscores of

each class
Weights each class equally
A micro average calculates the metric by first pooling all instances of

eachclass
Weightseachinstance equally
Macro average can be dominatedbyminorityclasses
Micro average is not sensitive to class imbalance
If micro aug a macro avg high misclassification rate for majority

class
if mairo avg.cc micro aug high misclassification rate for minority

classes
Chapter 8 Model Selection Validation
Generalization
Definition Modeldoes a goodjob of correctlypredictingclasslabelsof

previously unseen samples
Evaluating Generalization Requires
Unseen dataset with known corre it labels

A quantitative measuremetric of tendencyformodel to predictcorrectlabels
Optimizing ModelComplexity
Most learning algorithms haveknobs that adjusts the model complexity
Regression order of polynomial
NaiveBayes number of features
Decision Trees number of nodes in the tree

KNN number of nearestneighbors
SUM kerneltype cost parameter
Model Selection
Extra sample error estimates
Train test split most used methods

Cross validation
Bootstrap
Chapter 9 Linear Regression
Measure of Error
We can measure the prediction loss in terms of squared error Loss on one
example
Lossc g g y 572
Loss on n trainingexamples
Foxis w 72
Jn Cw
I 21 Cy
d
F IR IR fix i wi wot w int twd xd
Minimize the Squared Loss
Jn Cw
I It yi FCxi i wi l
L It yi wo wi xi l dim
To get the optimal parameter values take derivative
Jn 0 JnCw
of O
w
fwy
W xx xty
Numerical Solution
Matrixinversion is computationally veryexpensive
Using the analyticalform tocompute theoptimalsolutionmaynot be feasibleeven

for moderatevalues of n
Alsoonlypossible if Atx is not singular multicollinearity problem
Determinant is zero
Not full rank
Gradient Descent
General algorithm for optimization
Initialization initialize O
Do step size
Step size
Can change as a function O O FI 5cal
of iteration 3 while all JI SE
Gradientdirection t
Stopping condition stoppingcondition
Gradient vector TJ
I 23ft É I

Note 4 Nov 2023

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Note 4 Nov 2023

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Note 4 Nov 2023

Uploaded by

Copyright:

Available Formats

CS 464 Summary

PCA Xan r Xn PCN P X2 Xi Pan I Xin Xn i

especially useful when there are conditional independence across

XLY 12 P X 712 PCXI ZI PLY 12

Maximum Likelihood Estimator MLE Chooses O that maximizes the

Let N an at and I met OH

Probably Approximately Correct I Pact

I want to know the thumback O within E on with probability at least

Prob of iii d samples D xii rXn

MMLE I Once arg PCDI yur o

Our priorcould be in the form of a probability distribution

posterior p old PID1051017 Frio

Chapter 4 Naive Bayes

Naive Bayes assumes

Random variable features Xi and Xj are conditionally independent of each

The objectivein classification regression is to learn a functionthat relates

Often we are presented with many features

Feature selection is the taste of identifying an optimal set of features

Motivation For Feature Selection

Three Main Approaches

1 Treat Feature selection as a separate task

2 Embed feature selection into the task of learning a model

37 Do not Sele it features instead construct new features that effectively

Feature Selection as a Separate Task

filteringbased featureselection wrapper based feature selection

Feature selection callslearningmethod

learningmethod my y learning method

If the probability of an event happening is small it happens

I log 116 2.58

Entropy is a measure of uncertainty

H I pix log pix

HCNY IN 47 He IN Hex 7 Hexi I Hey x Is x y

both use a hill climbingsearch

forward selention backward elimination

efficient forchoosing a efficientfor discarding a small subset

Misses features whose usefulness preserves featureswhoseusefulness

Embedded Methods Regularization

Chapter G Feature Extraction

Rather than picking a subset of features xuxa r xn create new

Principal Component Analysis

PCA vectors originate from the center of mass

Principal component L i points in the direction of the largestvariance

Each subsequent principal component

is orthogonal to the previous ones r and

Compute the covariance matrix I

largereigenvalue more importanteigenvectors

Computecovariance matrix or the scatter matrix

Calculate eigenvalues and eigenvectors of covariance matrix

Eigenvector with largest eigenvalue Xi is 1st principalcomponent Pc

Eigenvector with k th largest eigenvalue Xk is Kth PC

Proportion of variance captured by Kth Pc Xk Ii be

The SVD of man matrix A is givenby theformula A US VT wherei

U mxm matrix of the orthonormaleigenvectors of AAT

VT transpose of a nxn matrix containing the orthormal eigenvectors of

sorts dimensions in order of importance

One of manytricks to reduce dimensionality

Chapter 7 Performance Metrics

Model Selection and Validation

Learning typicallyinvolves trying out differentmodels algorithms parameters