Machine Learning Machine Learning: Learning Highly Non-Linear Functions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Machine Learning

10--701/15
10 701/15--781, Fall 2011

Neural Networks

Eric Xing

Lecture 5, September 26, 2011

Reading: Chap. 5 CB
© Eric Xing @ CMU, 2006-2011 1

Learning highly non-linear


functions
f: X Æ Y
z f might be non-linear function
z X (vector of) continuous and/or discrete vars
z Y (vector of) continuous and/or discrete vars

The XOR gate Speech recognition

© Eric Xing @ CMU, 2006-2011 2

1
Perceptron and Neural Nets
z From biological neuron to artificial neuron (perceptron)
Inputs
Synapse
Dendrites x1 Linear Hard
Synapse
Axon Combiner Limiter
Axon w1
Output
∑ Y
Soma w2
Soma
Dendrites θ
x2
Synapse Threshold

z Activation function
⎧+ 1, if X ≥ ω0
n
X = ∑ xi wi Y =⎨
i =1 ⎩− 1, if X < ω0

Output Signals
Input Signals
z Artificial neuron networks
z supervised learning
z gradient descent Middle Layer
Input Layer Output Layer
© Eric Xing @ CMU, 2006-2011 3

Connectionist Models
z Consider humans: Dendrites
Nodes

z Neuron switching
g time
~ 0.001 second +
Synapses
+ (weights)
Axon
z Number of neurons +
-
-
~ 1010 Synapses

z Connections per neuron


~ 104-5
z Scene recognition time
~ 0.1 second
z 100 inference steps doesn't seem like enough
Æ much parallel computation

z Properties of artificial neural nets (ANN)


z Many neuron-like threshold switching units
z Many weighted interconnections among units
z Highly parallel, distributed processes
© Eric Xing @ CMU, 2006-2011 4

2
Abdominal Pain Perceptron

Intensity Duration
Male Age Temp WBC Pain Pain
adjustable
1 20 37 10 1 1
weights

AppendicitisDiverticulitis Ulcer Pain Cholecystitis Obstruction Pancreatitis


Duodenal Non-specific
Small Bowel
Perforated
0 0 0 1 0 0 0

© Eric Xing @ CMU, 2006-2011 5

Myocardial Infarction Network


Duration Intensity Elevation
Pain Pain ECG: ST Smoker Age Male

2 4 1 1 50 1

Myocardial Infarction
0.8 “Probability” of MI
© Eric Xing @ CMU, 2006-2011 6

3
The "Driver" Network

© Eric Xing @ CMU, 2006-2011 7

Perceptrons
Input units

Cough Headache

∆ rule
change weights to
weights decrease the error

what we got
No disease Pneumonia Flu Meningitis
- what we wanted
error
Output units

© Eric Xing @ CMU, 2006-2011 8

4
Perceptrons

Input units i
Input to unit i: ai
measured value of variable i

Input to unit j: aj = Σ wijai

Output oj = 1/ (1 + e- (aj+θ j) )
units j

Output of unit j:

© Eric Xing @ CMU, 2006-2011 9

Jargon Pseudo-Correspondence
z Independent variable = input variable
z Dependent variable = output variable
z Coefficients = “weights”
z Estimates = “targets”

Logistic Regression Model (the sigmoid unit)


Inputs Output
g
Age 34
5
0.6
Gender 1 4
Σ “Probability
of beingAlive”
Stage 4 8

Independent variables Coefficients Dependent variable


x1, x2, x3 a, b, c p Prediction
© Eric Xing @ CMU, 2006-2011 10

5
The perceptron learning
algorithm

z Recall the nice property of sigmoid function

z Consider regression problem f:XÆY , for scalar Y:

z Let’s maximize the conditional data likelihood

© Eric Xing @ CMU, 2006-2011 11

xd = input
td = target output
od =observed unit

Gradient Descent output


wi =weight i

© Eric Xing @ CMU, 2006-2011 12

6
xd = input
td = target output
od =observed unit

The perceptron learning rules output


wi =weight i

Incremental mode:
Do until converge:

ƒ For each training example d in D


1. compute gradient ∇Ed[w]

2.

where

Batch mode:
Do until converge:
1. compute gradient ∇ED[w]

2.
© Eric Xing @ CMU, 2006-2011 13

MLE vs MAP
z Maximum conditional likelihood estimate

z Maximum a p
posteriori estimate

© Eric Xing @ CMU, 2006-2011 14

7
What decision surface does a
perceptron define?
x y Z (color)
0 0 1
0 1 1
1 0 1
1 1 0

NAND

f(x1 w1 + x2 w2 ) = y
y θ = 0.5 f(0w1 + 0w2 ) = 1
w1 w2 f(0w1 + 1w2 ) = 1 ( )=
f(a) 1, for a > θ

x1
f(1w1 + 0w2 ) = 1 0, for a ≤ θ
x2 f(1w1 + 1w2 ) = 0 θ

w1 w2
0.20 0.35
some possible values for w1 and w2 0.20 0.40
0.25 0.30
0.40 0.20

© Eric Xing @ CMU, 2006-2011 15

What decision surface does a


perceptron define?
x y Z (color)
0 0 0
0 1 1
1 0 1
1 1 0

NAND

f(x1 w1 + x2 w2 ) = y
y θ = 0.5 f(0w1 + 0w2 ) = 0
w1 w2 f(0w1 + 1w2 ) = 1 ( )=
f(a) 1, for a > θ

x1
f(1w1 + 0w2 ) = 1 0, for a ≤ θ
x2 f(1w1 + 1w2 ) = 0 θ

w1 w2

some possible values for w1 and w2

© Eric Xing @ CMU, 2006-2011 16

8
What decision surface does a
perceptron define?
x y Z (color)
0 0 0
0 1 1
1 0 1
1 1 0

NAND

θ = 0.5 for all units


w5 w6

f(a) = 1, for a > θ


w1 w w2
3
w4 0, for a ≤ θ
θ

a possible set of values for (w1, w2, w3, w4, w5 , w6):


(0.6,-0.6,-0.7,0.8,1,1) © Eric Xing @ CMU, 2006-2011 17

Non Linear Separation

Meningitis Flu
No cough Cough
Headache Headache
01 11

No treatment
Treatment

00 10
No disease Pneumonia 011 111
No cough Cough 010 110
No headache No headache

101

000 100
© Eric Xing @ CMU, 2006-2011 18

9
Neural Network Model
Inputs
.6 Output
Age 34 .4
.2 Σ
.1 .5 0.6
Gender 2 .3 .2
.8
Σ
.7 Σ “Probability
of beingAlive”
Stage 4 .2

Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction

© Eric Xing @ CMU, 2006-2011 19

“Combined logistic models”


Inputs
.6 Output
Age 34
.5 0.6
.1
Gender 2 Σ
.7 .8 “Probability
of beingAlive”
Stage 4

Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction

© Eric Xing @ CMU, 2006-2011 20

10
Inputs
Output
Age 34
.2 .5
0.6
Gender 2 .3
Σ
“Probability
.8
of beingAlive”
Stage 4 .2

Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction

© Eric Xing @ CMU, 2006-2011 21

Inputs
.6 Output
Age 34
.2 .5
.1 0.6
Gender 1 .3
Σ
.7 “Probability
.8
of beingAlive”
Stage 4 .2

Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction

© Eric Xing @ CMU, 2006-2011 22

11
Not really,
no target for hidden units...

Age 34 .6 .4
.2 Σ
.1 .5 0.6
Gender 2 .3 .2
.8
Σ
.7 Σ “Probability
of beingAlive”
Stage 4 .2

Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction

© Eric Xing @ CMU, 2006-2011 23

Recall perceptrons
Input units

Cough Headache

∆ rule
change weights to
weights decrease the error

what we got
No disease Pneumonia Flu Meningitis
- what we wanted
error
Output units

© Eric Xing @ CMU, 2006-2011 24

12
Hidden Units and
Backpropagation

Input units
what we g
got
- what we wanted
error

∆ rule
Hidden
units

∆ rule

Output units

© Eric Xing @ CMU, 2006-2011 25

xd = input
td = target output
od =observed unit

Backpropagation Algorithm output


wi =weight i

z Initialize all weights to small random numbers


Until convergence, Do

1. Input the training example to the network


and compute the network outputs

1. For each output unit k

2. For each hidden unit h

3. Undate each network weight wi,j

where
© Eric Xing @ CMU, 2006-2011 26

13
More on Backpropatation
z It is doing gradient descent over entire network weight vector
z Easily generalized to arbitrary directed graphs
z Will find a local, not necessarily global error minimum
z In practice, often works well (can run multiple times)

z Often include weight momentum α

z Minimizes error over training examples


z Will it generalize well to subsequent testing examples?

z Training can take thousands of iterations, Æ very slow!


z Using network after training is very fast

© Eric Xing @ CMU, 2006-2011 27

Learning Hidden Layer


Representation
z A network:

z A target function:

z Can this be learned?


© Eric Xing @ CMU, 2006-2011 28

14
Learning Hidden Layer
Representation
z A network:

z Learned hidden layer representation:

© Eric Xing @ CMU, 2006-2011 29

Training

© Eric Xing @ CMU, 2006-2011 30

15
Training

© Eric Xing @ CMU, 2006-2011 31

Minimizing the Error

initial error
Error surface

negative derivative

final error

local minimum

winitial wtrained

positive change
© Eric Xing @ CMU, 2006-2011 32

16
Overfitting in Neural Nets

Overfitted model “Real” model Overfitted model


CHD

error
holdout

training

0 age cycles

© Eric Xing @ CMU, 2006-2011 33

Alternative Error Functions


z Penalize large weights:

z Training on target slopes as well as values

z Tie together weights


z E.g., in phoneme recognition

© Eric Xing @ CMU, 2006-2011 34

17
Regression vs. Neural Networks

Y Y

“X1” “X2” “X1X3” “X1X2X3”

X1 X2 X3 X1X2 X1X3 X2X3 X1X2 X3

(23 -1)
1) possible
ibl combinations
bi ti X1 X2 X3

Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...

© Eric Xing @ CMU, 2006-2011 35

Artificial neural networks – what


you should know
z Highly expressive non-linear functions
z Highly parallel network of logistic function units
z Minimizing sum of squared training errors
z Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output
values

z Minimizing sum of sq errors plus weight squared (regularization)


z MAP estimates assuming weight priors are zero mean Gaussian

z Gradient descent as training procedure


z How to derive yyour own gradient
g descent p
procedure

z Discover useful representations at hidden units


z Local minima is greatest problem
z Overfitting, regularization, early stopping

© Eric Xing @ CMU, 2006-2011 36

18
Modern ANN topics:
“Deep” Learning

© Eric Xing @ CMU, 2006-2011 37

Expressive Capabilities of ANNs


z Boolean functions:
z Every
y Boolean function can be represented
p by
y network with single
g hidden layer
y
z But might require exponential (in number of inputs) hidden units

z Continuous functions:
z Every bounded continuous function can be approximated with arbitrary small
error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989]
z Any function can be approximated to arbitrary accuracy by a network with two
hidden layers [Cybenko 1988].

© Eric Xing @ CMU, 2006-2011 38

19
Using ANN to
hierarchical representation

Trainable Trainable
Trainable
Feature Feature
Classifier
Extractor Extractor

Good Representations are hierarchical

• In Language: hierarchy in syntax and semantics


– Words->Parts of Speech->Sentences->Text
– Objects,Actions,Attributes...-> Phrases -> Statements -> Stories
• In Vision: part-whole hierarchy
– Pixels->Edges->Textons->Parts->Objects->Scenes

© Eric Xing @ CMU, 2006-2011 39

“Deep” learning: learning hierarchical


representations

Trainable Trainable
Trainable
Feature Feature
Classifier
Extractor Extractor

Learned Internal Representation
• Deep Learning: learning a hierarchy of internal representations
• From low-level features to mid-level invariant representations,
to object identities
• Representations are increasingly invariant as we go up the
layers
• using multiple stages gets around the specificity/invariance
dilemma

© Eric Xing @ CMU, 2006-2011 40

20
Filtering+NonLinearity+Pooling = 1
stage of a Convolutional Net
• [Hubel & Wiesel 1962]:
– simple cells detect local features
– complex cells “pool” the outputs of simple cells within a retinotopic
neighborhood. “Simple cells”
“Complex cells”

pooling 
Multiple  subsampling
convolutions

Retinotopic Feature Maps
© Eric Xing @ CMU, 2006-2011 41

Convolutional Network: Multi-


Stage Trainable Architecture
Pooling
Convolutions,
Subsampling Convolutions,
Filtering
Classification

Convolutions,
Filtering Pooling
Subsampling
Hierarchical Architecture
Representations are more global, more invariant, and more
abstract
abst act as we
e go up
p the layers
la e s
Alternated Layers of Filtering and Spatial Pooling
Filtering detects conjunctions of features
Pooling computes local disjunctions of features
Fully Trainable
All the layers are trainable
© Eric Xing @ CMU, 2006-2011 42

21
Convolutional Net Architecture
for Hand-writing recognition
Layer 3 Layer 5
Layer 1 Layer 2 Layer 4 100@1x1
input 12@10x10
6@28x28 6@14x14 12@5x5
1@32x32
Layer 6: 10
Layer 6: 10

10

5x5
2x2 5x5 2x2 convolution
5x5 convolution
pooling/ pooling/
convolution
subsampling subsampling

z Convolutional net for handwriting recognition (400,000 synapses)


z Convolutional layers (simple cells): all units in a feature plane share the same weights
z Pooling/subsampling layers (complex cells): for invariance to small distortions.
z Supervised gradient-descent learning using back-propagation
z The entire network is trained end-to-end. All the layers are trained simultaneously.
z [LeCun et al. Proc IEEE, 1998]

© Eric Xing @ CMU, 2006-2011 43

How to train?

© Eric Xing @ CMU, 2006-2011 44

22
Application:
MNIST Handwritten Digit Dataset

Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples

© Eric Xing @ CMU, 2006-2011 45

Results on MNIST Handwritten


Digits
CLASSIFIER DEFORMATION PREPROCESSING ERROR (%) Reference
linear classifier (1-layer NN) none 12.00 LeCun et al. 1998
linear classifier (1-layer NN) deskewing 8.40 LeCun et al. 1998
pairwise linear classifier deskewing 7.60 LeCun et al. 1998
K-nearest -neighbors,
g , ((L2)) none 3.09 Kennet h Wilder,, U. Chicago
g
K-nearest -neighbors, (L2) deskewing 2.40 LeCun et al. 1998
K-nearest -neighbors, (L2) deskew, clean, blur 1.80 Kennet h Wilder, U. Chicago
K-NN L3, 2 pixel jitt er deskew, clean, blur 1.22 Kennet h Wilder, U. Chicago
K-NN, shape context m at ching shape context feat ure 0.63 Belongie et al. IEEE PAMI 2002
40 PCA + quadrat ic classifier none 3.30 LeCun et al. 1998
1000 RBF + linear classifier none 3.60 LeCun et al. 1998
K-NN, Tangent Dist ance subsam p 16x16 pixels 1.10 LeCun et al. 1998
SVM, Gaussian Kernel none 1.40
SVM deg 4 polynom ial deskewing 1.10 LeCun et al. 1998
Reduced Set SVM deg 5 poly deskewing 1.00 LeCun et al. 1998
Virt ual SVM deg-9 poly Affine none 0.80 LeCun et al. 1998
V-SVM, 2-pixel jitt ered none 0.68 DeCoste and Scholkopf, MLJ 2002
V-SVM, 2-pixel jitt ered deskewing 0.56 DeCoste and Scholkopf, MLJ 2002
2-layer NN, 300 HU, MSE none 4.70 LeCun et al. 1998
2-layer NN, 300 HU, MSE, Affine none 3.60 LeCun et al. 1998
2 layer NN
2-layer NN, 300 HU deskewing 1 60
1.60 LeCun et al
al. 1998
3-layer NN, 500+ 150 HU none 2.95 LeCun et al. 1998
3-layer NN, 500+ 150 HU Affine none 2.45 LeCun et al. 1998
3-layer NN, 500+ 300 HU, CE, reg none 1.53 Hint on, unpublished, 2005
2-layer NN, 800 HU, CE none 1.60 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, CE Affine none 1.10 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, MSE Elast ic none 0.90 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, CE Elast ic none 0.70 Sim ard et al., ICDAR 2003
Convolut ional net LeNet -1 subsam p 16x16 pixels 1.70 LeCun et al. 1998
Convolut ional net LeNet -4 none 1.10 LeCun et al. 1998
Convolut ional net LeNet -5, none 0.95 LeCun et al. 1998
Conv. net LeNet -5, Affine none 0.80 LeCun et al. 1998
Boosted LeNet -4 Affine none 0.70 LeCun et al. 1998
Conv. net, CE Affine none 0.60 Sim ard et al., ICDAR 2003
Com v net , CE Elast ic none 0.40 Sim ard et al., ICDAR 2003
© Eric Xing @ CMU, 2006-2011 46

23
Application: ANN for Face Reco.
z The model z The learned hidden unit
weights

© Eric Xing @ CMU, 2006-2011 47

Face Detection with a


Convolutional Net

© Eric Xing @ CMU, 2006-2011 48

24
Computer vision features

SIFT Spin image

HoG RIFT
Drawbacks of feature engineering
1. Needs expert knowledge
2. Time consuming hand-tuning
Textons © Eric Xing @ CMU, 2006-2011 GLOHCourtesy: Lee and
49
Ng

Sparse coding on images


Natural Images Learned bases:  “Edges”

N
New example
l

= 0.8 * + 0.3 * + 0.5 *

x = 0.8 * b36 + 0.3 * b42 + 0.5 * b65


[0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representation)
Courtesy: Lee and Ng
© Eric Xing @ CMU, 2006-2011 50

25
Basis (or features) can be learned by
Optimization
Given input data {x(1), …, x(m)}, we want to find good 
bases {b
{ 1,, …, b
, n}}:

min b ,a ∑ || x (i ) − ∑ a (ji )b j ||22 + β ∑ || a ( i ) ||1


i j i

Reconstruction error Sparsity penalty

∀ j : || b j || ≤ 1 Normalization 
constraint

Solve by alternating minimization:


-- Keep b fixed, find optimal a.
-- Keep a fixed, find optimal b. Courtesy: Lee and Ng
© Eric Xing @ CMU, 2006-2011 51

Learning Feature Hierarchy


Higher layer
(Combinations
of edges)

“Sparse coding”
(edges)

Input image (pixels)

DBN (Hinton et al., 2006) with additional sparseness constraint.

[Related work: Hinton, Bengio, LeCun, and others.]

Courtesy: Lee and Ng


© Eric Xing @ CMU, 2006-2011 52

26
Convolutional architectures
Max‐pooling layer
maximum 2x2 grid ƒ Weight sharing by convolution (e.g.,
max Detection layer
Detection layer [Lecun et al., 1989])
et al 1989])

convolution
ƒ “Max‐pooling”
Invariance 
conv Max‐pooling layer Computational efficiency
Deterministic and feed‐forward
maximum 2x2 grid
ƒ One can develop convolutional
max Detection layer
Detection layer Restricted Boltzmann machine (CRBM).

convolution ƒ One can define probabilistic max‐pooling 
convolution filter that combine bottom‐up and top‐down 
information.
conv Input
Courtesy: Lee and Ng
© Eric Xing @ CMU, 2006-2011 53

Convolutional Deep Belief


Networks
z Bottom-up (greedy), layer-wise training
z Train one layer (convolutional RBM) at a time
time.

z Inference (approximate)
z Undirected connections for all layers (Markov net)
[Related work: Salakhutdinov and Hinton, 2009]
z Block Gibbs sampling or mean-field
z Hierarchical probabilistic inference

Courtesy: Lee and Ng


© Eric Xing @ CMU, 2006-2011 54

27
Unsupervised learning of object-
parts
Faces Cars Elephants Chairs

Courtesy: Lee and Ng


© Eric Xing @ CMU, 2006-2011 55

Weaknesses & Criticisms


z Learning everything. Better to encode prior knowledge about
g
structure of images.
A: Compare with machine learning vs. linguists debate in
NLP.

z Results not yet competitive with best engineered systems.


A: Agreed.
g True for some domains.

Courtesy: Lee and Ng


© Eric Xing @ CMU, 2006-2011 56

28

You might also like