Perceptron Learning Algorithm Lecture Supplement

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Perceptron Learning Algorithm Lecture Supplement

Perceptron Learning Algorithm Convergence

In this section we prove that, when a linearly separable set of training examples (x1 , y1 ), . . . , (xn , yn ) is
provided as input to the Perceptron Learning algorithm, then the algorithm will eventually terminate,
meaning that values w and b have been found for which yi (w xi b) > 0, for all i = 1, . . . , n.

Positive vector sets

To simplify the analysis, notice that, if we add an extra component equal to 1 to vector xi , i = 1, . . . , n,
then we may think of b as an extra component to w. Then, after absorbing b into w, we have
yi (w xi ) > 0, for all i = 1, . . . , n. Finally, if we replace each xi with the vector yi xi , then after
absorbing yi into xi , we have w xi > 0, for all i = 1, . . . , n. We say that a set of vectors x1 , . . . , xn
is positive iff there exists a vector w for which w xi > 0, for all i = 1, . . . , n.

Example 1. Re-write the linearly separable set of training examples

((1, 1), 1), ((2, 3), 1), ((1, 3), 1), ((3, 1), 1), ((4, 5), 1),

as a set of three-dimensional positive vectors.

1
Cauchy-Schwarz inequality

Theorem 1 (Cauchy-Schwarz-Bunyakovsky Inequality). If u and v are vectors in a dot-product


vector space, then
(u v)2 |u|2 |v|2 ,
which implies that
|u v| |u||v|.

Proof of Theorem 1. Theorem 1 is intuitively true if we recall that |u v| is the length of the
projection of u on to v, times the length of v. Then the result is true if we believe that the length of
a projection of u on to v should not exceed the length of u. The following is a more formal proof.

For any scalar t, by several applications of the four properties of inner products, we have

0 (tu + v) (tu + v) = t2 (u u) + 2t(u v) + v v =

t2 |u|2 + 2t(u v) + |v|2 ,


which may be written as at2 +bt+c 0, where a = |u|2 , b = 2(uv), and c = |v|2 . But at2 +bt+c 0
implies that the equation at2 + bt + c = 0 either has no roots, or exactly one root. In other words,
we must have
b2 4ac 0,
which implies
4(u v)2 4|u|2 |v|2 ,
or
(u v)2 |u|2 |v|2 .

2
Convergence Theorem

Theorem 2. Let x1 , . . . , xn be a set of positive vectors. Then the Perceptron Learning algorithm
determines a weight vector w for which w xi > 0, for all i = 1, . . . , n.

Proof of Theorem 2. Since the set of input vectors is positive, there is a weight vector w for
which |w | = 1, and there exists a > 0 for which, for i = 1, 2, . . . , n,
|w xi | > .
Furthermore, let r > 0 be such that |xi | r, for all i = 1, . . . , n. Let k be the number of times the
vector w in the perceptron learning algorithm has been updated, and let wk denote the value of the
weight vector after the k th update. We assume w0 = 0; i.e. the algorithm begins with a zero weight
vector. The objective is to show that k must be bounded. Suppose xi is used for the k th update in
the algorithm. Then wk can be recursively written as
wk = wk1 + xi ,
where wk1 xi 0.

Claim. |wk |2 kr2 .

The proof of this claim is by induction on k. For k = 0, w0 = 0, and so |w0 |2 = 0 0(r2 ) = 0.

For the inductive step, assume that |wj |2 jr2 , for all j < k. Then
|wk |2 = |wk1 + xi |2 = (wk1 + xi ) (wk1 + xi )
|wk1 |2 + r2 (k 1)r2 + r2 = kr2 ,
and the claim is proved.

Thus, |wk | r k.

Next, we may use induction a second time to prove a lower bound on w wk , namely that w wk k.
This is certainly true for k = 0. Now if the inductive assumption is that w wk1 (k 1), then
w wk = w (wk1 + xi ) =
w wk1 + w xi w wk1 + (k 1) + = k,
and the lower bound is proved.

Finally, applying the Cauchy-Schwarz inequality, we have


|w | |wk | w wk k.

And since |w | = 1, this implies |wk | k.


r2
Putting the two inequalities together yields k r k, which yields k 2
. Therefore, k is bounded,
and the algorithm must terminate.

3
Exercises
1. Describe five features that could be used for the purpose of classifying a fish as either a salmon
or a trout.

2. Plot the training samples ((0, 0), +1), ((0, 1), 1), ((1, 0), 1), ((1, 1), +1) and verify that the
two classes are not linearly separable. Then provide an algebraic proof. Hint: assume w =
(w1 , w2 ) and b are the parameters of a separating line, and obtaine a contradiction.

3. If vector w = (2, 1, 5) is normal to plane P and P contains the point (0, 0, 5), then provide
an equation for P .

4. Provide an equation of a plane P that is normal to vector w = (1, 1, 3) and passes through
the point (0, 1, 2).

5. If the vector v = (2, 1, 5) makes a 60-degree angle with a unit vector u, compute u v.

6. Prove that the Cauchy-Schwarz inequality becomes an equality iff v = ku, for some constant k.

7. Establish that, for any n-dimensional vector v, |v| = v v.

8. Given the feature vectors from the two classes

C+ = (0.1, 0.2), (0.2, 0.1), (0.15, 0.1), (1.1, 0.8), (1.2, 1.1),

and
C = (1.1, 0.1), (1.25, 0.15), (0.9, 0.1), (0.1, 1.2), (0.2, 0.9),
Compute the centers c+ and c and provide the equation of the Simple-Learning algoirthm
decision surface. Use the decision surface to classify. the vector (0.5, 0.5).

9. Give an example using only three linearly separable training vectors, where the surface obtained
from the Simple-Learning algorithm misclassifies at least one of the training vectors.

10. Re-write the linearly separable set of training examples

((1, 1), 1), ((0, 2), 1), ((3, 0), 1), ((2, 1), 1), ((0, 2), 1),

as a set of three-dimensional positive vectors.

11. Demonstrate the Perceptron Learning algorithm with = 1 using the positive vectors obtained
from the previous exercise as input. Start with w0 = 0, and use the order

(0, 2, 1), (2, 1, 1), (3, 0, 1), (1, 1, 1), (0, 2, 1)

when checking for misclassifications. Compute the final normal vector w , and verify that the
surface (w 1 , w 2 ) x = w 3 separates the original data.

4
Exercise Solutions
1. Answers may vary. Here are five that come to mind: weight (grams), length from head to tail
(cm), girth (cm), number of fins (1-10), primary color.

2. Assume the training samples are separated by the line w x = b, where w = (w1 , w2 ). Then i)
w (1, 1) = w1 + w2 b, ii) w (0, 0) = 0 b, iii) w (1, 0) = w1 < b, and iv) w (0, 1) = w2 < b.
Then iii) and iv) yield w1 + w2 < 2b, and combining this with i) yields b < 2b, or b > 0, which
contradicts ii). Therefore, the training samples are not linearly separable.

3. Since
b = w (0, 0, 5) = (2, 1, 5) (0, 0, 5) = 25,
the equation is w x = 25.

4. Since
b = w (0, 1, 2) = (0, 1, 2) (1, 1, 3) = 7,
the equation is w x = 7.

5.

30
u v = |u||v| cos 60 = ( 30)(1)(1/2) = .
2
6. If v = ku, for some constant k, then

|u v| = |u (ku)| = |k|(u u) = |k||u||u| = |v||u| = |u||v|.

Now assume that |u v| = |u||v|. Without loss of generality, assume that |u| = 1. Then

w = proj(v, u) = (u v)u.

Now consider v w. Then

|v w|2 = (v w) (v w) = |v|2 + |w|2 2(v w) =

|v|2 + (u v)2 2(u v)(u v) = 0,


since |v|2 = (1)|v|2 = |u|2 |v|2 = (u v)2 . Hence, |v w| = 0, which implies v is a multiple of w,
which in turn is a multiple of u.

7.
q
v v = v12 + + vn2 = |v|.

8. c+ = (0.49, 0.38), while c = (0.71, 0.45). Then w = c+ c = (0.22, 0.07), and c =


1/2(1.2, 0.83) = (0.6, 0.415). Finally, b = w c = 0.16105. The equation of the decision
surface is thus
(0.22, 0.07) x = 0.16105.
Then
(0.22, 0.07) (0.5, 0.5) = 0.145 > 0.16105,
which implies that (0.5, 0.5) is classified as being in Class +1.

5
9. Consider ((0, 1), 1), ((0, 0), 1), and ((0, 4), 1). Then c+ = (0, 2), while c = (0, 1). Then
w = c+ c = (0, 3), and c = 1/2(0, 1) = (0, 0.5). Finally, b = w c = 1.5. The equation of the
decision surface is thus
(0, 3) x = 1.5.
Then
(0, 3) (0, 0) = 0 < 1.5,
which implies that (0, 0) is misclassified as being in Class 1.

10. Adding a +1 component to each vector yields

(1, 1, 1), (0, 2, 1), (3, 0, 1), (2, 1, 1), (0, 2, 1).

Then scaling each vector with its class label yields

(1, 1, 1), (0, 2, 1), (3, 0, 1), (2, 1, 1), (0, 2, 1).

11.
w0 (1, 1, 1) = 0 w1 = w0 + (0, 2, 1) = (0, 2, 1).
w1 (3, 0, 1) = 1 w2 = w1 + (3, 0, 1) = (3, 2, 0).
w2 x > 0 for each training vector x. Therefore, w = w2 = (3, 2, 0). Finally, the decision
surface to the original set of training vectors (see previous exercise) has equation (3, 2) x = 0.
Notice that (3, 2) x > 0 for every x in Class +1, and (3, 2) x < 0 for every x in Class -1, which
is the desired result.

You might also like