3 Non Linear Classifiers
3 Non Linear Classifiers
3 Non Linear Classifiers
0
● There is no single line (hyperplane) that separates class A
from class B. On the contrary, AND and OR operations
are linearly separable problems
0
● The Two-Layer Perceptron
0
➢ Then class B is located outside the shaded area and class A
inside. This is a two-phase design.
• Phase 1: Draw two lines (hyperplanes)
g1 ( x) = g 2 ( x) = 0
Each of them is realized by a perceptron. The outputs of
the perceptrons will be
⎧0
yi = f ( goni ( the
depending
x))position
= ⎨ iof=x.1, 2
⎩1
• Phase 2: Find the position of x w.r.t. both lines, based on
the values of y1, y2.
0
1st phase 2nd
x1 x2 y1 y2 phase
0 0 0(-) 0(-) B(0)
0 1 1(+) 0(-) A(1)
1 0 1(+) 0(-) A(1)
1 1 1(+) 1(+) B(0)
0
The decision is now performed on the transformed data.y
g ( y) = 0
This can be performed via a second line, which can also be
realized by a perceptron.
0
➢ Computations of the first phase perform a mapping
that transforms the nonlinearly separable problem to a
linearly separable one.
➢ The architecture
0
• This is known as the two layer perceptron with one
hidden and one output layer. The activation
functions are
⎧0
f (.) = ⎨
⎩1
• The neurons (nodes) of the figure realize the
following lines (hyperplanes)
1
g1 ( x) = x1 + x2 − =0
2
3
g 2 ( x) = x1 + x2 − = 0
2
1
g ( y ) = y1 − 2 y2 − = 0
2 0
● Classification capabilities of the two-layer perceptron
➢ The mapping performed by the first layer neurons is onto the vertices
of the unit side square, e.g.,
(0, 0), (0, 1), (1, 0), (1, 1).
x ∈ Rl
x → y = [ y1 ,... y p ]T , yi ∈ {0, 1} i = 1, 2,... p 0
performs a mapping of a vector
onto the vertices of the unit side Hp hypercube
0
➢ Intersections of these hyperplanes form regions in the l-
dimensional space. Each region corresponds to a vertex of
the Hp unit hypercube.
0
For example, the 001 vertex corresponds to the region
which is located
0
➢ The output neuron realizes a hyperplane in the transformed
y
space, that separates some of the vertices from the others.
Thus, the two layer perceptron has the capability to classify
vectors into classes that consist of unions of polyhedral
regions. But NOT ANY union. It depends on the relative
position of the corresponding vertices.
0
● Three layer-perceptrons
➢ The architecture
y∈Rp 0
➢ The reasoning
• For each vertex, corresponding to class, say A, construct
a hyperplane which leaves THIS vertex on one side (+)
and ALL the others to the other side (-).
• The output neuron realizes an OR gate
➢ Overall:
The first layer of the network forms the hyperplanes, the
second layer forms the regions and the output neuron
forms the classes.
0
● The Backpropagation Algorithm
➢ This is an algorithmic procedure that computes the synaptic
weights iteratively, so that an adopted cost function is
minimized (optimized)
⎧1 x > 0
f ( x) = ⎨
➢ ⎩0 xpath!!!
There is always an escape <0 The logistic function
1
is an example.f (Other
x) = functions are also possible and in
1 + exp(−ax)
some cases more desirable.
0
0
➢ The steps:
• Adopt an optimizing cost function, e.g.,
– Least Squares Error
– Relative Entropy
between desired responses and actual responses of
the network for the available training patterns. That
is, from now on we have to live with errors. We only
try to minimize them, using certain criteria.
0
• The task is a nonlinear optimization one. For the
gradient descent method
r r r
w1 ( new) = w1 (old) + Δ w1
r ∂J
Δ w1 = − µ
∂w1r
0
➢ The Procedure:
• Initialize unknown weights randomly with small values.
• Compute the gradient terms backwards, starting with the
weights of the last (3rd) layer and then moving towards the
first
• Update the weights
• Repeat the procedure until a termination procedure is met
0
0
➢ A major problem: The algorithm may converge to a local
minimum
0
➢ The Cost function choice
Examples:
• The Least Squares
N
J = ∑ E (i )
i =1
k k
E (i ) = ∑ e (i ) = ∑ ( ym (i ) − yˆ m (i )) 2
2
m
m =1 m =1
i = 1,2,..., N
Desired response of the mth output neuron
(1 or 0) for
ym (i ) →
x(i )
Actual response of the mth output neuron, in
the interval [0, 1], for input
yˆ m (i ) →
x(i ) 0
➢ The cross-entropy
N
J = ∑ E (i )
i =1
k
E (i ) = ∑ {ym (i ) ln yˆ m (i ) + (1 − ym (i )) ln(1 − yˆ m (i ))}
m =1
This presupposes an interpretation of y and ŷ as probabilities
0
➢ Remark 1: A common feature of all the above is the danger
of local minimum convergence. “Well formed” cost
functions guarantee convergence to a “good” solution, that is
one that classifies correctly
ALL training patterns, provided such a solution exists. The
cross-entropy cost function is a well formed one. The Least
Squares is not.
0
➢ Remark 2: Both, the Least Squares and the cross entropy
lead to output values that
yˆ m (iapproximate
) optimally
class a-posteriori probabilities!!!
yˆ m (i ) ≅ P (ω m x(i ))
That is, the probability of class given .
This is a very interesting result. It ωdoes
m
x(i ) on the
not depend
underlying distributions. It is a characteristic of certain cost
functions. How good or bad is the approximation, depends
on the underlying model. Furthermore, it is only valid at the
global minimum.
0
➢ Choice of the network size.
How big a network can be. How many layers and how many
neurons per layer?? There are two major directions
• Pruning Techniques: These techniques start from a large
network and then weights and/or neurons are removed
iteratively, according to a criterion.
0
— Methods based on parameter sensitivity
1 2 1
δJ = ∑ g iδwi + ∑ hii δ wi + ∑∑ hijδwiδw j
i 2 i 2 i j
∂J ∂2J
gi = , hij =
∂wi ∂wi ∂w j
Near a minimum and assuming that
1
δJ ≅ ∑ hiiδwi2
2 i
0
Pruning is now achieved in the following procedure:
• Train the network using Backpropagation
for a number of steps
• Compute the saliencies
hii wi2
• si =weights with small si.
Remove
2
• Repeat the process
N
J = ∑ E (i ) + aE p ( w)
i =1
0
The second term favours small values for the weights, e.g.,
E p (ω ) = ∑ h( wk2 )
k
2 wk2
where h( w ) = 2
k
w0 + wk2
After some training steps, weights with small values are
removed.w ≅ 1
0
• Constructive techniques:
They start with a small network and keep increasing it,
according to a predetermined procedure and criterion.
0
➢ Remark: Why not start with a large network and leave the
algorithm to decide which weights are small?? This
approach is just naïve. It overlooks that classifiers must
have good generalization properties. A large network can
result in small errors for the training set, since it can learn
the particular details of the training set. On the other hand,
it will not be able to perform well when presented with data
unknown to it. The size of the network must be:
• Large enough to learn what makes data of the same class
similar and data from different classes dissimilar
• Small enough not to be able to learn underlying
differences between data of the same class. This leads to
the so called overfitting.
0
Example:
0
➢ Overtraining is another side of the same coin, i.e., the
network adapts to the peculiarities of the training set.
0
● Generalized Linear Classifiers
⎡ f ( g1 ( x)) ⎤
x→ y=⎢ ⎥
f ( g
⎣ function
2 ( x )) ⎦ transforms the
The activation
nonlinear task into a linear one.
f (.) →
➢ In the more general case:
• Let and a nonlinear classification task.
x ∈ Rl
f i (.), i = 1,2,..., k
0
• Are there any functions and an appropriate k, so that
the mapping
⎡ f1 ( x) ⎤
x → y = ⎢⎢ ... ⎥⎥
⎢⎣ f k ( x)⎥⎦
transforms the task into a linear one, in the
space? y ∈ Rk
• If this is true, then there exists a hyperplane
so that
w∈ R k
T
If w0 + w y > 0 , x ∈ ω 1
T
w0 + w y < 0 , x ∈ ω 2
0
➢ In such a case this is equivalent with approximating
the nonlinear discriminant function g(x), in terms of
i.e., f i (x),
k
g ( x) ≅ w0 + ∑ wi f i ( x) (> <) 0
i =1
➢ Given , the task of computing the weights is a
f i (x)
linear one.
0
● Capacity of the l-dimensional space in Linear
Dichotomies
➢ Assume N points in R l assumed to be in general
position, that is:
0
➢ Cover’s theorem states: The number of groupings that can
be formed by (l-1)-dimensional hyperplanes to separate N
points in two classes is
l
⎛ N − 1⎞ ⎛ N − 1⎞ ( N − 1)!
O( N , l ) = 2∑ ⎜⎜ ⎟⎟, ⎜⎜ ⎟⎟ =
i =0 ⎝ i ⎠ ⎝ i ⎠ ( N − 1 − i )!i!
Example: N=4, l=2, O(4,2)=14
0
➢ Probability of grouping N points in two linearly separable
classes is
O( N , l ) l
= PN
2N
N = r (l + 1)
0
Thus, the probability of having N points in linearly
separable classes tends to 1, for large , provided
l N<2(
+1)
l
Hence, by mapping to a higher dimensional space, we
increase the probability of linear separability, provided
the space is not too densely populated.
0
● Radial Basis Function Networks (RBF)
➢ Choose
0
2
⎛ x − ci ⎞
f i ( x) = exp⎜ − 2
⎟
⎜
⎝ 2σ i
⎟
⎠
0
➢ Example: The XOR problem
• Define:
⎡1⎤ ⎡0 ⎤ 1
c1 = ⎢ ⎥ , c 2 = ⎢ ⎥ , σ 1 = σ 2 =
⎣1⎦ ⎣0 ⎦ 2
⎡ exp(− x − c1 2 ) ⎤
y=⎢ 2 ⎥
⎢⎣ exp( − x − c 2 )⎥⎦
•
0
g ( y ) = y1 + y2 − 1 = 0
2 2
g ( x) = exp(− x − c1 ) + exp(− x − c 2 ) − 1 = 0
0
➢ Training of the RBF networks
T
g ( x) = w0 + w y
is a typical linear classifier design.
0
● Universal Approximators
It has been shown that any nonlinear continuous function can
be approximated arbitrarily close, both, by a two layer
perceptron, with sigmoid activations, and an RBF network,
provided a large enough number of nodes is used.
0
● Support Vector Machines: The non-linear case
l k
x ∈ R
Then use SVM in Rk → y ∈ R , k >l
➢ Recall that in this case the dual problem formulation
will be
N
1 T
maximize (∑ λi − ∑ λi λ j yi y j y i y j )
λ i =1 2 i, j
where y i ∈ R k
0
Also, the classifier will be
T
g ( y ) = w y + w0
Ns
= ∑ λi yi y i y
i =1
k
where x → y ∈ R
Thus, inner products in a high dimensional space are
involved, hence
• High complexity
0
➢ Something clever: Compute the inner products in the
high dimensional space as functions of inner products
performed in the low dimensional space!!!
T
Let x = [x1 , x2 ] ∈ R 2
⎡ x12 ⎤
⎢ ⎥
Let x → y = 2 x x
1 2⎥∈
Then, it is easy⎢ to show R3
that
⎢ x22 ⎥
⎣ ⎦
T T
y i y j = ( xi x j )2
0
➢ Mercer’s Theorem
Let x → Φ ( x) ∈ H
Then, the inner product in H
∑ Φ r ( x )Φ r ( y ) = K ( x, y )
where
r
forΚany
∫ ( x,g(x),
y ) g x:
( x) g ( y ) d xd y ≥ 0
K(x,y)
2 symmetric function known as kernel.
∫ ( x)d x < +∞
g
0
➢ The opposite is also true. Any kernel, with the above
properties, corresponds to an inner product in SOME
space!!!
➢ Examples of kernels
• Polynomial:
⎛ x−z 2 ⎞
K ( x, z ) = exp⎜ − 2
⎟
• ⎜ σ
Radial Basis Functions: ⎟
⎝ ⎠
• Hyperbolic T
K ( x, z ) =Tangent:
( x z + 1) q , q > 0
0
➢ SVM Formulation
• Step 1: Choose appropriate kernel. This
implicitely assumes a mapping to a higher
dimensional (yet, not known) space.
• Step 2:
1
max (∑ λi −
λ i
∑
2 i, j
λi λ j yi y j K ( x i , x j ))
subject to : 0 ≤ λi ≤ C , i = 1,2,..., N
∑ λi yi = 0
This results to an implicit combination
i
Ns
w = ∑ λi yi ϕ ( x i )
i =1 0
• Step 3: Assign x to
Ns
ω1 (ω2 ) if g ( x) = ∑ λi yi Κ( x i , x) + w0 > (<)0
i =1
0
0
● Decision Trees
is feature
0
➢ The figures below are such examples. This type of trees is known
as Ordinary Binary Classification Trees (OBCT). The decision
hyperplanes, splitting the space into regions, are parallel to the axis
of the spaces. Other types of partition are also possible, yet less
popular.
0
➢ Design Elements that define a decision tree.
• Each node, t, is associated with a subset Χ t ,⊆where
X X is the
training set. At each node, Xt is split into two (binary splits)
disjoint descendant subsets Xt,Y and Xt,N, where
Xt,Y Xt,N = Ø
Xt,Y Xt,N = Xt
Xt,Y is the subset of Xt for which the answer to the query at node
t is YES. Xt,N is the subset corresponding to NO. The split is
decided according to an adopted question (query).
0
• A splitting criterion must be adopted for the best split of Xt into
Xt,Y and Xt,N.
0
➢ Set of Questions: In OBCT trees the set of questions is of the type
is ?
The choice of the specific xi xandi ≤ a
the value of the threshold α, for
each node t, are the results of searching, during training, among the
features and a set of possible threshold values. The final
combination is the one that results to the best value of a criterion.
0
➢ Splitting Criterion: The main idea behind splitting at each node is
the resulting descendant subsets Xt,Y and Xt,N to be more class
homogeneous compared to Xt. Thus the criterion must be in
harmony with such a goal. A commonly used criterion is the node
impurity:
M
I (t ) = −∑ P(ωi | t )log 2 P(ωt | t )
i =1
and N ti
P(ωi | t ) ≈
where N tin Xt that belong to class i.
is the number of data points
The decrease
i in node impurity is defined as:
Nt
N t ,Υ Nt,N
ΔI (t ) = I (t ) − I (t Υ ) − I (t N )
Nt Nt
0
• The goal is to choose the parameters in each node (feature and
threshold) that result in a split with the highest decrease in
impurity.
0
➢ Summary of an OBCT algorithmic scheme:
0
➢ Remarks:
• A critical factor in the design is the size of the tree. Usually one
grows a tree to a large size and then applies various pruning
techniques.
• Decision trees belong to the class of unstable classifiers. This
can be overcome by a number of “averaging” techniques.
Bagging is a popular technique. Using bootstrap techniques in
X, various trees are constructed, Ti, i=1, 2, …, B. The decision
is taken according to a majority voting rule.
0
● Combining Classifiers
The basic philosophy behind the combination of different classifiers
lies in the fact that even the “best” classifier fails in some patterns that
other classifiers may classify correctly. Combining classifiers aims at
exploiting this complementary information residing in the various
classifiers.
Thus, one designs different optimal classifiers and then combines the
results with a specific rule.
0
• Product Rule: Assign xto the class : ωi
L
i = arg max ∏ Pj (ωk | x )
k j =1
where is the respective posterior probability of the jth
classifier.Pj (ωk | x )
• Sum Rule: Assign to the class :
x ωi
L
i = arg max ∑ Pj (ωk | x )
k j =1
0
• Majority Voting Rule: Assign xto the class for which there is a
consensus or when at least of the classifiers
! c
agree on the class
label of where: x
⎧L
⎪⎪ 2 + 1, L even
!c =⎨
⎪ L + 1 , L odd
otherwise the decision is rejection,
⎪⎩ 2 that is no decision is taken.
Thus, correct decision is made if the majority of the classifiers
agree on the correct label, and wrong decision if the majority
agrees in the wrong label.
0
➢ Dependent or not Dependent classifiers?
0
– Stacking: Train the combiner with data points that have been
excluded from the set used to train the individual classifiers.
– Use different subspaces to train individual classifiers: According to
the method, each individual classifier operates in a different feature
subspace. That is, use different features for each classifier.
➢ Remarks:
• The majority voting and the summation schemes rank among the
most popular combination schemes.
• Besides the above three rules, other alternatives are also possible,
such as to use the median value of the outputs of individual
classifiers.
0
● The Boosting Approach
➢ The origins: Is it possible a weak learning algorithm (one that
performs slightly better than a random guessing) to be boosted into
a strong algorithm? (Villiant 1984).
➢ The procedure to achieve it:
• Adopt a weak classifier known as the base classifier.
• Employing the base classifier, design a series of classifiers, in a
hierarchical fashion, each time employing a different weighting
of the training samples. Emphasis in the weighting is given on
the hardest samples, i.e., the ones that keep “failing”.
• Combine the hierarchically designed classifiers by a weighted
average procedure.
0
➢ The AdaBoost Algorithm.
Construct an optimally designed classifier of the form:
f ( x) = sign{F ( x)}
where:
K
F ( x) = ∑ akϕ (x;ϑ k )
where denotes the kbase
=1 classifier that returns a binary class
label:
ϕ (x;ϑ k )
is a parameter vector.
ϕ (x;ϑ k )∈ {− 1, 1}
0
• The essence of the method.
Design the series of classifiers:
ϕ (x;ϑ 1 ), ϕ (x;ϑ 2 ), ..., ϕ (x;ϑ k )
The parameter vectors
ϑ k , kso=as:1,
are optimally computed 2 , ..., K
– To minimize the error rate on the training set.
– Each time, the training samples are re-weighted so that the
weight of each sample depends on its history. Hard samples that
“insist” on failing to be predicted correctly, by the previously
designed classifiers, are more heavily weighted.
0
• Updating the weights for each sample x i , i = 1, 2 , ..., N
wim exp(− yi amϕ (x i ;ϑ m ))
( m +1)
w i =
–
Zm
Zm is a normalizing factor common for all samples.
–
1 1 − Pm
am = ln
where
2 Pm<0.5
Pm (by assumption) is the error rate of the
optimal classifier at stage m. Thus αm>0.
– The term: ϕ (x;ϑ m )
– yiϕ (x i ;ϑ m )<nature.
The update equation is of a multiplicative 0 That is,
successive large values of weights (hard samples) result in
larger weight for
{ ythe next iteration
iϕ (x i ; ϑ m ) > 0}
0
• The algorithm
0
➢ Remarks:
• Training error rate tends to zero after a few iterations. The test
error levels to some value.
• AdaBoost is greedy in reducing the margin that samples leave
from the decision surface.