ML Cheatsheet
ML Cheatsheet
ML Cheatsheet
1
Complexity Non-linear Online learning
i:xi Nk (x,) x
(ti , C) Use cross-validation to learn the appropriate k; otherwise no training, classication based on existing points. k acts as to regularise the classier: as k N the boundary becomes smoother. O(N M ) space complexity, since all training instances and all their features need to be kept in memory.
k-nearest neighbour
The label of a new point x is classied with the most frequent label t of the k nearest training instances.
Nk (x, x) k points in x closest to x Euclidean distance formula: D 2 i=1 (xi xi ) (a, b) 1 if a = b; 0 o/w
No optimisation needed.
To be added.
Multivariate likelihood p(x|Ck ) = D log p(xi |Ck ) i=1 N pMLE (xi = v|Ck ) =
Use a Dirichlet prior on the parameters to obtain a MAP estimate. (tj = Ck xji = v) N j=1 (tj = Ck ) Multivariate likelihood pMAP (xi = v|Ck ) = (i 1) + N (tj = Ck xji = v) j=1 |xi |(i 1) + N (tj = Ck ) j=1 Can only learn linear boundaries for multivariate/multinomial attributes. With Gaussian attributes, quadratic boundaries can be learned with uni-modal distributions. To be added.
j=1
Naive Bayes
Learn p(Ck |x) by modelling p(x|Ck ) and p(Ck ), using Bayes rule to infer the class conditional probability. Assumes each feature independent of all others, ergo Naive.
i=1 D i=1
No optimisation needed.
Multinomial likelihood p(x|Ck ) = D p(wordi |Ck )xi i=1 N pMLE (wordi = v|Ck ) = . . . where:
= arg max
k
Multinomial likelihood
O(N M ), each training instance must be visited and each of its features counted.
xji is the count of word i in test example j; xdi is the count of feature d in test example j. Gaussian likelihood p(x|Ck ) = D N (v; ik , ik ) i=1 Gradient descent (or gradient ascent if maximising objective): n+1 = n L
pMAP (wordi = v|Ck ) = (i 1) + N (tj = Ck ) xji j=1 N D ((tj = Ck ) xdi ) D + D d j=1 d=1 d=1
Log-linear
Estimate p(Ck |x) directly, by assuming a maximum entropy distribution and optimising an objective function over the conditional entropy distribution.
m m (x, Ck )
. . . where is the step parameter. log p(t|x) m m (x, t) LMLE (, D) = E[(x, )] (x, t)
Penalise large values for the parameters, by introducing a prior distribution over them (typically a Gaussian). Objective function LMAP (, D, ) = arg min log p()
Reformulate the class conditional distribution in terms of a kernel K(x, x ), and use a non-linear kernel (for example K(x, x ) = (1 + wT x)2 ). By the Representer Theorem: O(IN M K), since each training instance must be visited and each combination of class and features must be calculated for the appropriate feature mapping. p(Ck |x) =
T 1 e (x,Ck ) Z (x) N K T 1 = e n=1 i=1 nk (xn ,Ci ) (x,Ck ) Z (x) N K 1 = e n=1 i=1 nk K((xn ,Ci ),(x,Ck )) Z (x) N 1 = e n=1 nk K(xn ,x) Z (x)
(x,t)D
p(Ck |x) =
(x,t)D
(x,t)D
(x,t)D
m m (x, t)
m m (x,Ck )
LMAP (, D, ) = . . . where
(x,t)D
+ 2
(x,t)D
(x,t)D
E[(x, )]
(x, t)
(x,t)D
(0)2 2 2
(x,t)D
= arg min
(x,t)D
log p(t|x)
Online Gradient Descent: Update the parameters using GD after seeing each training instance.
2 m
2 2
(x,t)D
log p(t|x)
log p(t|x)
Binary, linear classier: y(x) = sign(wT x) Directly estimate the linear function y(x) by iteratively updating the weight vector when incorrectly classifying a training instance. . . . where: sign(x) = Tries to minimise the Error function; the number of incorrectly classied input vectors: arg min EP (w) = arg min w T xn t n
w w nM
Iterate over each training example xn , and update the weight vector if misclassication: wi+1 = wi + EP (w) = wi + xn tn . . . where typically = 1. For the multiclass perceptron: wi+1 = wi + (x, t) (x, y(x)) The Voted Perceptron: run the perceptron i times and store each iterations weight vector. Then: y(x) = sign ci sign(wT x) i
i
Perceptron
+1 1
if x 0 if x < 0
O(IN M L), since each combination of instance, class and features must be calculated (see log-linear).
Use a kernel K(x, x ), and 1 weight per training instance: N y(x) = sign wn tn K(x, xn )
n=1
The soft margin SVM: penalise a hyperplane by the number and distance of misclassied points. 1 ||w||2 2 n Primal
N 1 arg min ||w||2 + C n 2 w,w0 n=1
y(x) =
n t n xT xn + w 0
n=1
s.t. Support vector machines A maximum margin classier: nds the separating hyperplane with the maximum margin to its closest data points. y(x) =
N
tn (wT xn + w0 ) 1
N N
Dual n t n x xn + w 0 L() =
T N
s.t. Dual
tn (wT xn + w0 ) 1 n ,
N N N
n > 0
QP: O(n3 );
n=1
n=1
n m t n t m xT xm n
SMO: much more ecient than QP, since computation based only on support vectors. L() = =
n tn K(x, xn ) + w0
n=1 N N N N N N
Online SVM is described in The Huller: A Simple and Ecient Online SVM, Bordes & Bottou (2005).
n=1 m=1 N
L() = n s.t.
n=1
n m t n t m xT xm n
n=1 m=1 N
n=1
n n
n m t n t m xT xm n
n=1 m=1
s.t.
n 0,
n tn = 0,
n=1
n m tn tm K(xn , xm )
0 n C,
n tn = 0,
n=1
n=1
n=1 m=1
Hard assignments rnk {0, 1} s.t. n k rnk = 1, i.e. each data point is assigned to one cluster k. k-means A hard-margin, geometric clustering algorithm, where each data point is assigned to its closest centroid. Geometric distance: The Euclidean distance, l2 norm: D ||xn || = (x )2
k 2 ni ki i=1
arg min
r,
n=1 k=1
N K
if ||xn k ||2 minimal for k o/w n rnk xn = n rnk Only hard-margin assignment to clusters. To be added. Not applicable. To be added.
. . . i.e. minimise the distance from each cluster centre to each of its points.
MLE
(k)
. . . where (k) is the centroid of cluster k. Expectation: For each n, k set: nk = p(z (i) = k|x(i) ; , , ) = K L(x, , , ) = log p(x|, , ) K N = log k Nk (xn |k , k )
n=1 k=1
(= p(k|xn ))
j=1
Mixture of Gaussians
A probabilistic clustering algorithm, where clusters are modelled as latent Guassians and each data point is assigned the probability of being drawn from a particular Gaussian.
Assignments to clusters by specifying probabilities p(x(i) , z (i) ) = p(x(i) |z (i) )p(z (i) ) . . . with z (i) Multinomial(), and k nk p(k|xn ) s.t. j=1 nj = 1. I.e. want to maximise the probability of the observed data x.
Maximisation:
p(x(i) |z (i) = l; , )p(z (i) = l; ) Online estimation of Gaussian Mixture Models is described in Highly Ecient Incremental Estimation of Gaussian Mixture Models for Online Data Stream Clustering, Song & Wang (2005).
The mixture of Gaussians assigns probabilities for each cluster to each data point, and as such is capable of capturing ambiguities in the data set.
To be added.
Not applicable.
1 Created
by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011.