Perceptron Learning Algorithm Lecture Supplement
Perceptron Learning Algorithm Lecture Supplement
Perceptron Learning Algorithm Lecture Supplement
In this section we prove that, when a linearly separable set of training examples (x1 , y1 ), . . . , (xn , yn ) is
provided as input to the Perceptron Learning algorithm, then the algorithm will eventually terminate,
meaning that values w and b have been found for which yi (w xi b) > 0, for all i = 1, . . . , n.
To simplify the analysis, notice that, if we add an extra component equal to 1 to vector xi , i = 1, . . . , n,
then we may think of b as an extra component to w. Then, after absorbing b into w, we have
yi (w xi ) > 0, for all i = 1, . . . , n. Finally, if we replace each xi with the vector yi xi , then after
absorbing yi into xi , we have w xi > 0, for all i = 1, . . . , n. We say that a set of vectors x1 , . . . , xn
is positive iff there exists a vector w for which w xi > 0, for all i = 1, . . . , n.
((1, 1), 1), ((2, 3), 1), ((1, 3), 1), ((3, 1), 1), ((4, 5), 1),
1
Cauchy-Schwarz inequality
Proof of Theorem 1. Theorem 1 is intuitively true if we recall that |u v| is the length of the
projection of u on to v, times the length of v. Then the result is true if we believe that the length of
a projection of u on to v should not exceed the length of u. The following is a more formal proof.
For any scalar t, by several applications of the four properties of inner products, we have
2
Convergence Theorem
Theorem 2. Let x1 , . . . , xn be a set of positive vectors. Then the Perceptron Learning algorithm
determines a weight vector w for which w xi > 0, for all i = 1, . . . , n.
Proof of Theorem 2. Since the set of input vectors is positive, there is a weight vector w for
which |w | = 1, and there exists a > 0 for which, for i = 1, 2, . . . , n,
|w xi | > .
Furthermore, let r > 0 be such that |xi | r, for all i = 1, . . . , n. Let k be the number of times the
vector w in the perceptron learning algorithm has been updated, and let wk denote the value of the
weight vector after the k th update. We assume w0 = 0; i.e. the algorithm begins with a zero weight
vector. The objective is to show that k must be bounded. Suppose xi is used for the k th update in
the algorithm. Then wk can be recursively written as
wk = wk1 + xi ,
where wk1 xi 0.
For the inductive step, assume that |wj |2 jr2 , for all j < k. Then
|wk |2 = |wk1 + xi |2 = (wk1 + xi ) (wk1 + xi )
|wk1 |2 + r2 (k 1)r2 + r2 = kr2 ,
and the claim is proved.
Thus, |wk | r k.
Next, we may use induction a second time to prove a lower bound on w wk , namely that w wk k.
This is certainly true for k = 0. Now if the inductive assumption is that w wk1 (k 1), then
w wk = w (wk1 + xi ) =
w wk1 + w xi w wk1 + (k 1) + = k,
and the lower bound is proved.
3
Exercises
1. Describe five features that could be used for the purpose of classifying a fish as either a salmon
or a trout.
2. Plot the training samples ((0, 0), +1), ((0, 1), 1), ((1, 0), 1), ((1, 1), +1) and verify that the
two classes are not linearly separable. Then provide an algebraic proof. Hint: assume w =
(w1 , w2 ) and b are the parameters of a separating line, and obtaine a contradiction.
3. If vector w = (2, 1, 5) is normal to plane P and P contains the point (0, 0, 5), then provide
an equation for P .
4. Provide an equation of a plane P that is normal to vector w = (1, 1, 3) and passes through
the point (0, 1, 2).
5. If the vector v = (2, 1, 5) makes a 60-degree angle with a unit vector u, compute u v.
6. Prove that the Cauchy-Schwarz inequality becomes an equality iff v = ku, for some constant k.
7. Establish that, for any n-dimensional vector v, |v| = v v.
C+ = (0.1, 0.2), (0.2, 0.1), (0.15, 0.1), (1.1, 0.8), (1.2, 1.1),
and
C = (1.1, 0.1), (1.25, 0.15), (0.9, 0.1), (0.1, 1.2), (0.2, 0.9),
Compute the centers c+ and c and provide the equation of the Simple-Learning algoirthm
decision surface. Use the decision surface to classify. the vector (0.5, 0.5).
9. Give an example using only three linearly separable training vectors, where the surface obtained
from the Simple-Learning algorithm misclassifies at least one of the training vectors.
((1, 1), 1), ((0, 2), 1), ((3, 0), 1), ((2, 1), 1), ((0, 2), 1),
11. Demonstrate the Perceptron Learning algorithm with = 1 using the positive vectors obtained
from the previous exercise as input. Start with w0 = 0, and use the order
when checking for misclassifications. Compute the final normal vector w , and verify that the
surface (w 1 , w 2 ) x = w 3 separates the original data.
4
Exercise Solutions
1. Answers may vary. Here are five that come to mind: weight (grams), length from head to tail
(cm), girth (cm), number of fins (1-10), primary color.
2. Assume the training samples are separated by the line w x = b, where w = (w1 , w2 ). Then i)
w (1, 1) = w1 + w2 b, ii) w (0, 0) = 0 b, iii) w (1, 0) = w1 < b, and iv) w (0, 1) = w2 < b.
Then iii) and iv) yield w1 + w2 < 2b, and combining this with i) yields b < 2b, or b > 0, which
contradicts ii). Therefore, the training samples are not linearly separable.
3. Since
b = w (0, 0, 5) = (2, 1, 5) (0, 0, 5) = 25,
the equation is w x = 25.
4. Since
b = w (0, 1, 2) = (0, 1, 2) (1, 1, 3) = 7,
the equation is w x = 7.
5.
30
u v = |u||v| cos 60 = ( 30)(1)(1/2) = .
2
6. If v = ku, for some constant k, then
Now assume that |u v| = |u||v|. Without loss of generality, assume that |u| = 1. Then
w = proj(v, u) = (u v)u.
7.
q
v v = v12 + + vn2 = |v|.
5
9. Consider ((0, 1), 1), ((0, 0), 1), and ((0, 4), 1). Then c+ = (0, 2), while c = (0, 1). Then
w = c+ c = (0, 3), and c = 1/2(0, 1) = (0, 0.5). Finally, b = w c = 1.5. The equation of the
decision surface is thus
(0, 3) x = 1.5.
Then
(0, 3) (0, 0) = 0 < 1.5,
which implies that (0, 0) is misclassified as being in Class 1.
(1, 1, 1), (0, 2, 1), (3, 0, 1), (2, 1, 1), (0, 2, 1).
(1, 1, 1), (0, 2, 1), (3, 0, 1), (2, 1, 1), (0, 2, 1).
11.
w0 (1, 1, 1) = 0 w1 = w0 + (0, 2, 1) = (0, 2, 1).
w1 (3, 0, 1) = 1 w2 = w1 + (3, 0, 1) = (3, 2, 0).
w2 x > 0 for each training vector x. Therefore, w = w2 = (3, 2, 0). Finally, the decision
surface to the original set of training vectors (see previous exercise) has equation (3, 2) x = 0.
Notice that (3, 2) x > 0 for every x in Class +1, and (3, 2) x < 0 for every x in Class -1, which
is the desired result.