ML Cheatsheet

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning

Algorithm Description Model Objective Training Regularisation t = arg max


C

1
Complexity Non-linear Online learning

i:xi Nk (x,) x

(ti , C) Use cross-validation to learn the appropriate k; otherwise no training, classication based on existing points. k acts as to regularise the classier: as k N the boundary becomes smoother. O(N M ) space complexity, since all training instances and all their features need to be kept in memory.

k-nearest neighbour

The label of a new point x is classied with the most frequent label t of the k nearest training instances.

Nk (x, x) k points in x closest to x Euclidean distance formula: D 2 i=1 (xi xi ) (a, b) 1 if a = b; 0 o/w

No optimisation needed.

Natively nds non-linear boundaries.

To be added.

y(x) = arg max p(Ck |x)


k k

Multivariate likelihood p(x|Ck ) = D log p(xi |Ck ) i=1 N pMLE (xi = v|Ck ) =

Use a Dirichlet prior on the parameters to obtain a MAP estimate. (tj = Ck xji = v) N j=1 (tj = Ck ) Multivariate likelihood pMAP (xi = v|Ck ) = (i 1) + N (tj = Ck xji = v) j=1 |xi |(i 1) + N (tj = Ck ) j=1 Can only learn linear boundaries for multivariate/multinomial attributes. With Gaussian attributes, quadratic boundaries can be learned with uni-modal distributions. To be added.

j=1

Naive Bayes

Learn p(Ck |x) by modelling p(x|Ck ) and p(Ck ), using Bayes rule to infer the class conditional probability. Assumes each feature independent of all others, ergo Naive.

= arg max p(x|Ck ) p(Ck ) = arg max


k D

i=1 D i=1

p(xi |Ck ) p(Ck ) log p(xi |Ck ) + log p(Ck )

No optimisation needed.

Multinomial likelihood p(x|Ck ) = D p(wordi |Ck )xi i=1 N pMLE (wordi = v|Ck ) = . . . where:

= arg max
k

j=1 (tj = Ck ) xji N D j=1 d=1 (tj = Ck ) xdi

Multinomial likelihood

O(N M ), each training instance must be visited and each of its features counted.

xji is the count of word i in test example j; xdi is the count of feature d in test example j. Gaussian likelihood p(x|Ck ) = D N (v; ik , ik ) i=1 Gradient descent (or gradient ascent if maximising objective): n+1 = n L

pMAP (wordi = v|Ck ) = (i 1) + N (tj = Ck ) xji j=1 N D ((tj = Ck ) xdi ) D + D d j=1 d=1 d=1

y(x) = arg max p(Ck |x)


k k

Log-linear

Estimate p(Ck |x) directly, by assuming a maximum entropy distribution and optimising an objective function over the conditional entropy distribution.

= arg max . . . where:

m m (x, Ck )

Minimise the negative log-likelihood: LMLE (, D) = p(t|x) =


(x,t)D

. . . where is the step parameter. log p(t|x) m m (x, t) LMLE (, D) = E[(x, )] (x, t)

Penalise large values for the parameters, by introducing a prior distribution over them (typically a Gaussian). Objective function LMAP (, D, ) = arg min log p()

Reformulate the class conditional distribution in terms of a kernel K(x, x ), and use a non-linear kernel (for example K(x, x ) = (1 + wT x)2 ). By the Representer Theorem: O(IN M K), since each training instance must be visited and each combination of class and features must be calculated for the appropriate feature mapping. p(Ck |x) =
T 1 e (x,Ck ) Z (x) N K T 1 = e n=1 i=1 nk (xn ,Ci ) (x,Ck ) Z (x) N K 1 = e n=1 i=1 nk K((xn ,Ci ),(x,Ck )) Z (x) N 1 = e n=1 nk K(xn ,x) Z (x)

1 e m m m (x,Ck ) Z (x) Z (x) = e m m m (x,Ck )

(x,t)D

p(Ck |x) =

(x,t)D

log Z (x) log


k

(x,t)D

(x,t)D

m m (x, t)

m m (x,Ck )

LMAP (, D, ) = . . . where
(x,t)D

+ 2

(x,t)D

(x,t)D

E[(x, )]

(x, t)

(x,t)D

(x, t) are the empirical counts.

= arg min log e

(0)2 2 2

For each class Ck : E[(x, )] =


(x,t)D

(x,t)D

(x, )p(Ck |x)

= arg min

(x,t)D

log p(t|x)

Online Gradient Descent: Update the parameters using GD after seeing each training instance.

2 m

2 2

(x,t)D

log p(t|x)

log p(t|x)

Binary, linear classier: y(x) = sign(wT x) Directly estimate the linear function y(x) by iteratively updating the weight vector when incorrectly classifying a training instance. . . . where: sign(x) = Tries to minimise the Error function; the number of incorrectly classied input vectors: arg min EP (w) = arg min w T xn t n
w w nM

Iterate over each training example xn , and update the weight vector if misclassication: wi+1 = wi + EP (w) = wi + xn tn . . . where typically = 1. For the multiclass perceptron: wi+1 = wi + (x, t) (x, y(x)) The Voted Perceptron: run the perceptron i times and store each iterations weight vector. Then: y(x) = sign ci sign(wT x) i
i

Perceptron

+1 1

if x 0 if x < 0

O(IN M L), since each combination of instance, class and features must be calculated (see log-linear).

Use a kernel K(x, x ), and 1 weight per training instance: N y(x) = sign wn tn K(x, xn )
n=1

Multiclass perceptron: y(x) = arg max wT (x, Ck )


Ck

. . . where M is the set of misclassied training vectors.

. . . and the update:


i+1 i wn = w n + 1

The perceptron is an online algorithm per default.

. . . where ci is the number of correctly classied training instances for wi .

Primal arg min


w,w0

The soft margin SVM: penalise a hyperplane by the number and distance of misclassied points. 1 ||w||2 2 n Primal
N 1 arg min ||w||2 + C n 2 w,w0 n=1

Use a non-linear kernel K(x, x ):


N N

y(x) =

n t n xT xn + w 0

n=1

s.t. Support vector machines A maximum margin classier: nds the separating hyperplane with the maximum margin to its closest data points. y(x) =
N

tn (wT xn + w0 ) 1
N N

Dual n t n x xn + w 0 L() =
T N

Quadratic Programming (QP) SMO, Sequential Minimal Optimisation (chunking).

s.t. Dual

tn (wT xn + w0 ) 1 n ,
N N N

n > 0

QP: O(n3 );

n=1

n=1

n m t n t m xT xm n

SMO: much more ecient than QP, since computation based only on support vectors. L() = =

n tn K(x, xn ) + w0

n=1 N N N N N N

Online SVM is described in The Huller: A Simple and Ecient Online SVM, Bordes & Bottou (2005).

n=1 m=1 N

L() = n s.t.

n=1

n m t n t m xT xm n

n=1 m=1 N

n=1

n n

n m t n t m xT xm n

n=1 m=1

s.t.

n 0,

n tn = 0,

n=1

n m tn tm K(xn , xm )

0 n C,

n tn = 0,

n=1

n=1

n=1 m=1

Hard assignments rnk {0, 1} s.t. n k rnk = 1, i.e. each data point is assigned to one cluster k. k-means A hard-margin, geometric clustering algorithm, where each data point is assigned to its closest centroid. Geometric distance: The Euclidean distance, l2 norm: D ||xn || = (x )2
k 2 ni ki i=1

arg min
r,

n=1 k=1

N K

rnk ||xn k ||2 2

Expectation: 1 rnk = 0 Maximisation:

if ||xn k ||2 minimal for k o/w n rnk xn = n rnk Only hard-margin assignment to clusters. To be added. Not applicable. To be added.

. . . i.e. minimise the distance from each cluster centre to each of its points.

MLE

(k)

. . . where (k) is the centroid of cluster k. Expectation: For each n, k set: nk = p(z (i) = k|x(i) ; , , ) = K L(x, , , ) = log p(x|, , ) K N = log k Nk (xn |k , k )
n=1 k=1

(= p(k|xn ))

p(x(i) |z (i) = k; , )p(z (i) = k; )

j=1

Mixture of Gaussians

A probabilistic clustering algorithm, where clusters are modelled as latent Guassians and each data point is assigned the probability of being drawn from a particular Gaussian.

Assignments to clusters by specifying probabilities p(x(i) , z (i) ) = p(x(i) |z (i) )p(z (i) ) . . . with z (i) Multinomial(), and k nk p(k|xn ) s.t. j=1 nj = 1. I.e. want to maximise the probability of the observed data x.

Maximisation:

k N (xn |k , k ) = K j=1 j N (xn |j , j )


N

p(x(i) |z (i) = l; , )p(z (i) = l; ) Online estimation of Gaussian Mixture Models is described in Highly Ecient Incremental Estimation of Gaussian Mixture Models for Online Data Stream Clustering, Song & Wang (2005).

1 k = nk N n=1 N T n=1 nk (xn k )(xn k ) k = N n=1 nk N n=1 nk xn k = N n=1 nk

The mixture of Gaussians assigns probabilities for each cluster to each data point, and as such is capable of capturing ambiguities in the data set.

To be added.

Not applicable.

1 Created

by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011.

You might also like