Note 4 Nov 2023
Note 4 Nov 2023
Note 4 Nov 2023
Ch 2 Probability Review
Bayes Rule
Pixie Ppg
p x II apexes
Chain Rule
picking the right order can often make evaluating the probability much
easier
Conditional independence
Ch 3 Estimation
MLE estimate of 0
O arg PID 10
fax
A Bound I From Hoeffding'sinequality
2 NE E In 0.0512
N Z in co 05121 3.8 i go
2 xo is a
MLE of Gaussian Parameters
PCD IM o
In e
EMLE
I IN
2
Xi My
The Mce for the varian ie of a Gaussian is biased That is the expectedvalue
of the estimator is not equal to the true parameter An unbiased variance
estimator
yaunbiased
It I Xi M
Maximum A Posteriori MAP Estimation
BetaDistribution
p Col 04 l o B of 0 El
4 s o
p
When the data is sparse this allows us to fall back to the prior and
avoid the issues faced by MLE
When the data is abundant the likelihood will dominate the prior and the
prior will not have much of an effect on the posterior distribution
PC Xi r r xn147 IT PCA17
y newt argmax
Yk
log ite t
I logDijk
Chapter5 Feature Selection
Accuracy Generalizability
Interpretability a
Efficiency
all features
I x x
log ply 1092Pox
I log 112 I
The outcome of a dice is 6
En
The entropy of a randomvariable is
the sum of the information
provided by its possible values weighted by the probability of each value
Mutual information
ICY Y is the reduction of uncertainty in one variableuponobservation
of the other variable
A measure of statisticaldependencybetweentwo randomvariables
HIN
T
my
Icty Halt HH Hart Hat H IX H HWI HCY IX
The mutual information betweenfeature vector and class label measures the
amount by which the uniertainty in the class is decreased byknowledge of the
feature
Definition
c ee
I said
teeth efe no et 1092
ply pÉÉ
E statistic
t It X2 The distribution of t approaches
from uniform to normal distribution
gets as number of samples grow
ForwardSelection us BackwardElimination
Ridge Regression
Ecw
I
Cy'd Wo
II Xi'd wi t X
II Wiz L2
LASSO
ECW
go
I Cy'd Wo
II Xi'd wi t X
II twit
L1 and L2 penalties can be used with other learning methods logistic
regression neural nets sums etc
PCA Applications
Noisereduction Data visualization Data compression
How could we find the smallest subspace thatkeeps the most information
about the original data
A solution principal component Analysis
Feature Extraction
2i Wo t
I wi xi
2k Wo t
I wi't Xi
CovCX 47
I I Xi Mx yi my
Steps in PCA
Mean enter the data
Scaling Up
Covariance matrix can be reallybig
Use singularvalue decomposition SVD Takes input Xand
finds k eigenvectors
PCA
Applications
get compactdescription
remove noise
improve classification hopefully
Not magic
does not knowclass labels
Canonlycapturelinearvariations
Act
The False
Positive Error
Positive Type I
Classifier
False True
Negative Negative
f
Type I Error
Accuracy
Tp IIIINITN
Precision Recall F Measure
P
ypIÉp
Recall Sensitivity TPR Fraction of true positive predictionsamongall
positive samples
R
IIF N
Specificity TNR Fraction of true negative predictions among all negative
samples
F IPR
Pt R
Weightseachinstance equally
Macro average can be dominatedbyminorityclasses
Generalization
Optimizing ModelComplexity
Most learning algorithms haveknobs that adjusts the model complexity
Model Selection
Extra sample error estimates
We can measure the prediction loss in terms of squared error Loss on one
example
Lossc g g y 572
Loss on n trainingexamples
Foxis w 72
Jn Cw
I 21 Cy
d
F IR IR fix i wi wot w int twd xd
Minimize the Squared Loss
Jn Cw
I It yi FCxi i wi l
L It yi wo wi xi l dim
Jn 0 JnCw
of O
w
fwy
W xx xty
Numerical Solution
Determinant is zero
Not full rank
Gradient Descent
Initialization initialize O
Do step size
Step size
Can change as a function O O FI 5cal
of iteration 3 while all JI SE
Gradientdirection t
Stopping condition stoppingcondition
Gradient vector TJ
I 23ft É I