Lecture 13 - Perceptrons: Machine Learning March 16, 2010

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Lecture 13 Perceptrons

Machine Learning March 16, 2010

Last Time
Hidden Markov Models
Sequential modeling represented in a Graphical Model

Today
Perceptrons Leading to
Neural Networks aka Multilayer Perceptron Networks But more accurately: Multilayer Logistic Regression Networks

Review: Fitting Polynomial Functions


Fitting nonlinear 1-D functions Polynomial: Risk:

Review: Fitting Polynomial Functions


Order-D polynomial regression for 1D variables is the same as D-dimensional linear regression Extend the feature vector from a scalar. More generally

Neuron inspires Regression


Graphical Representation of linear regression
McCullough-Pitts Neuron Note: Not a graphical model

Neuron inspires Regression


Edges multiply the signal (xi) by some weight (i). Nodes sum inputs Equivalent to Linear Regression

Introducing Basis functions


Graphical representation of feature extraction

Edges multiply the signal by a weight. Nodes apply a function, d.

Extension to more features


Graphical representation of feature extraction

Combining function
How do we construct the neural output

Linear Neuron
10

Combining function
Sigmoid function or Squashing function

Logistic Neuron

Classification using the same metaphor


11

Logistic Neuron optimization


Minimizing R() is more difficult

Bad News: There is no closed-form solution. Good News: Its convex.


12

Aside: Convex Regions


Convex: for any pair of points xa and xb within a region, every point xc on a line between xa and xb is in the region

13

Aside: Convex Functions


Convex: for any pair of points xa and xb within a region, every point xc on a line between xa and xb is in the region

14

Aside: Convex Functions


Convex: for any pair of points xa and xb within a region, every point xc on a line between xa and xb is in the region

15

Aside: Convex Functions


Convex functions have a single maximum and minimum! How does this help us? (nearly) Guaranteed optimality of Gradient Descent

16

Gradient Descent
The Gradient is defined (though we cant solve directly) Points in the direction of fastest increase

17

Gradient Descent
Gradient points in the direction of fastest increase To minimize R, move in the opposite direction

18

Gradient Descent
Gradient points in the direction of fastest increase To minimize R, move in the opposite direction

19

Gradient Descent
Initialize Randomly Update with small steps (nearly) guaranteed to converge to the minimum

20

Gradient Descent
Initialize Randomly Update with small steps (nearly) guaranteed to converge to the minimum

21

Gradient Descent
Initialize Randomly Update with small steps (nearly) guaranteed to converge to the minimum

22

Gradient Descent
Initialize Randomly Update with small steps (nearly) guaranteed to converge to the minimum

23

Gradient Descent
Initialize Randomly Update with small steps (nearly) guaranteed to converge to the minimum

24

Gradient Descent
Initialize Randomly Update with small steps (nearly) guaranteed to converge to the minimum

25

Gradient Descent
Initialize Randomly Update with small steps (nearly) guaranteed to converge to the minimum

26

Gradient Descent
Initialize Randomly Update with small steps Can oscillate if is too large

27

Gradient Descent
Initialize Randomly Update with small steps Can oscillate if is too large

28

Gradient Descent
Initialize Randomly Update with small steps Can oscillate if is too large

29

Gradient Descent
Initialize Randomly Update with small steps Can oscillate if is too large

30

Gradient Descent
Initialize Randomly Update with small steps Can oscillate if is too large

31

Gradient Descent
Initialize Randomly Update with small steps Can oscillate if is too large

32

Gradient Descent
Initialize Randomly Update with small steps Can oscillate if is too large

33

Gradient Descent
Initialize Randomly Update with small steps Can oscillate if is too large

34

Gradient Descent
Initialize Randomly Update with small steps Can oscillate if is too large

35

Gradient Descent
Initialize Randomly Update with small steps is ever 0 not at the minimum Can stall if

36

Back to Neurons

Linear Neuron

Logistic Neuron

37

Perceptron
Classification squashing function

Perceptron

Strictly classification error


38

Classification Error
Only count errors when a classification is incorrect.

Sigmoid leads to greater than zero error on correct classifications

39

Classification Error
Only count errors when a classification is incorrect.

Perceptrons use strictly classification error

40

Perceptron Error
Cant do gradient descent on this.

Perceptrons use strictly classification error

41

Perceptron Loss
With classification loss: Define Perceptron Loss.
Loss calculated for each misclassified data point

vs

Now piecewise linear risk rather than step function


42

Perceptron Loss
Perceptron Loss.
Loss calculated for each misclassified data point

Nice gradient descent


43

Perceptron vs. Logistic Regression


Logistic Regression has a hard time eliminating all errors
Errors: 2

Perceptrons often do better. Increased importance of errors.


Errors: 0
44

Stochastic Gradient Descent


Instead of calculating the gradient over all points, and then updating.

Update the gradient for each misclassified point at training Update rule:
45

Online Perceptron Training


Online Training
Update weights for each data point.

Iterate over points xi point,


If xi is correctly classified Else

Theorem: If xi in X are linearly separable, then this process will converge to a * which leads to zero error in a finite number of steps.
46

Linearly Separable
Two classes of points are linearly separable, iff there exists a line such that all the points of one class fall on one side of the line, and all the points of the other class fall on the other side of the line

47

Linearly Separable
Two classes of points are linearly separable, iff there exists a line such that all the points of one class fall on one side of the line, and all the points of the other class fall on the other side of the line

48

Next Time
Multilayer Neural Networks

49

You might also like