Machine Learning Machine Learning: Learning Highly Non-Linear Functions

Machine Learning
10--701/15
10 701/15--781, Fall 2011
Neural Networks
Eric Xing
Lecture 5, September 26, 2011
Reading: Chap. 5 CB
© Eric Xing @ CMU, 2006-2011 1
Learning highly non-linear

functions
f: X Æ Y
z f might be non-linear function
z X (vector of) continuous and/or discrete vars
z Y (vector of) continuous and/or discrete vars
The XOR gate Speech recognition
© Eric Xing @ CMU, 2006-2011 2
1
Perceptron and Neural Nets
z From biological neuron to artificial neuron (perceptron)
Inputs
Synapse
Dendrites x1 Linear Hard
Synapse
Axon Combiner Limiter
Axon w1
Output
∑ Y
Soma w2
Soma
Dendrites θ
x2
Synapse Threshold
z Activation function
⎧+ 1, if X ≥ ω0
n
X = ∑ xi wi Y =⎨
i =1 ⎩− 1, if X < ω0
Output Signals
Input Signals
z Artificial neuron networks
z supervised learning
z gradient descent Middle Layer
Input Layer Output Layer
© Eric Xing @ CMU, 2006-2011 3
Connectionist Models
z Consider humans: Dendrites
Nodes
z Neuron switching
g time
~ 0.001 second +
Synapses
+ (weights)
Axon
z Number of neurons +
-
-
~ 1010 Synapses
z Connections per neuron

~ 104-5
z Scene recognition time
~ 0.1 second
z 100 inference steps doesn't seem like enough
Æ much parallel computation
z Properties of artificial neural nets (ANN)

z Many neuron-like threshold switching units
z Many weighted interconnections among units
z Highly parallel, distributed processes
© Eric Xing @ CMU, 2006-2011 4
2
Abdominal Pain Perceptron
Intensity Duration
Male Age Temp WBC Pain Pain
adjustable
1 20 37 10 1 1
weights
AppendicitisDiverticulitis Ulcer Pain Cholecystitis Obstruction Pancreatitis

Duodenal Non-specific
Small Bowel
Perforated
0 0 0 1 0 0 0
© Eric Xing @ CMU, 2006-2011 5
Myocardial Infarction Network

Duration Intensity Elevation
Pain Pain ECG: ST Smoker Age Male
2 4 1 1 50 1
Myocardial Infarction
0.8 “Probability” of MI
© Eric Xing @ CMU, 2006-2011 6
3
The "Driver" Network
© Eric Xing @ CMU, 2006-2011 7
Perceptrons
Input units
Cough Headache
∆ rule
change weights to
weights decrease the error
what we got
No disease Pneumonia Flu Meningitis
- what we wanted
error
Output units
© Eric Xing @ CMU, 2006-2011 8
4
Perceptrons
Input units i
Input to unit i: ai
measured value of variable i
Input to unit j: aj = Σ wijai
Output oj = 1/ (1 + e- (aj+θ j) )
units j
Output of unit j:
© Eric Xing @ CMU, 2006-2011 9
Jargon Pseudo-Correspondence
z Independent variable = input variable
z Dependent variable = output variable
z Coefficients = “weights”
z Estimates = “targets”
Logistic Regression Model (the sigmoid unit)

Inputs Output
g
Age 34
5
0.6
Gender 1 4
Σ “Probability
of beingAlive”
Stage 4 8
Independent variables Coefficients Dependent variable

x1, x2, x3 a, b, c p Prediction
© Eric Xing @ CMU, 2006-2011 10
5
The perceptron learning
algorithm
z Recall the nice property of sigmoid function
z Consider regression problem f:XÆY , for scalar Y:
z Let’s maximize the conditional data likelihood
© Eric Xing @ CMU, 2006-2011 11
xd = input
td = target output
od =observed unit
Gradient Descent output

wi =weight i
© Eric Xing @ CMU, 2006-2011 12
6
xd = input
td = target output
od =observed unit
The perceptron learning rules output

wi =weight i
Incremental mode:
Do until converge:
For each training example d in D

1. compute gradient ∇Ed[w]
2.
where
Batch mode:
Do until converge:
1. compute gradient ∇ED[w]
2.
© Eric Xing @ CMU, 2006-2011 13
MLE vs MAP
z Maximum conditional likelihood estimate
z Maximum a p
posteriori estimate
© Eric Xing @ CMU, 2006-2011 14
7
What decision surface does a
perceptron define?
x y Z (color)
0 0 1
0 1 1
1 0 1
1 1 0
NAND
f(x1 w1 + x2 w2 ) = y
y θ = 0.5 f(0w1 + 0w2 ) = 1
w1 w2 f(0w1 + 1w2 ) = 1 ( )=
f(a) 1, for a > θ
x1
f(1w1 + 0w2 ) = 1 0, for a ≤ θ
x2 f(1w1 + 1w2 ) = 0 θ
w1 w2
0.20 0.35
some possible values for w1 and w2 0.20 0.40
0.25 0.30
0.40 0.20
© Eric Xing @ CMU, 2006-2011 15

perceptron define?
x y Z (color)
0 0 0
0 1 1
1 0 1
1 1 0
NAND
f(x1 w1 + x2 w2 ) = y
y θ = 0.5 f(0w1 + 0w2 ) = 0
w1 w2 f(0w1 + 1w2 ) = 1 ( )=
f(a) 1, for a > θ
x1
f(1w1 + 0w2 ) = 1 0, for a ≤ θ
x2 f(1w1 + 1w2 ) = 0 θ
w1 w2
some possible values for w1 and w2
© Eric Xing @ CMU, 2006-2011 16
8
perceptron define?
x y Z (color)
0 0 0
0 1 1
1 0 1
1 1 0
NAND
θ = 0.5 for all units

w5 w6
f(a) = 1, for a > θ

w1 w w2
3
w4 0, for a ≤ θ
θ
a possible set of values for (w1, w2, w3, w4, w5 , w6):

(0.6,-0.6,-0.7,0.8,1,1) © Eric Xing @ CMU, 2006-2011 17
Non Linear Separation
Meningitis Flu
No cough Cough
Headache Headache
01 11
No treatment
Treatment
00 10
No disease Pneumonia 011 111
No cough Cough 010 110
No headache No headache
101
000 100
© Eric Xing @ CMU, 2006-2011 18
9
Neural Network Model
Inputs
.6 Output
Age 34 .4
.2 Σ
.1 .5 0.6
Gender 2 .3 .2
.8
Σ
.7 Σ “Probability
of beingAlive”
Stage 4 .2
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
© Eric Xing @ CMU, 2006-2011 19
“Combined logistic models”

Inputs
.6 Output
Age 34
.5 0.6
.1
Gender 2 Σ
.7 .8 “Probability
of beingAlive”
Stage 4
Dependent
variables Layer
Prediction
© Eric Xing @ CMU, 2006-2011 20
10
Inputs
Output
Age 34
.2 .5
0.6
Gender 2 .3
Σ
“Probability
.8
of beingAlive”
Stage 4 .2
Dependent
variables Layer
Prediction
© Eric Xing @ CMU, 2006-2011 21
Inputs
.6 Output
Age 34
.2 .5
.1 0.6
Gender 1 .3
Σ
.7 “Probability
.8
of beingAlive”
Stage 4 .2
Dependent
variables Layer
Prediction
© Eric Xing @ CMU, 2006-2011 22
11
Not really,
no target for hidden units...
Age 34 .6 .4
.2 Σ
.1 .5 0.6
Gender 2 .3 .2
.8
Σ
.7 Σ “Probability
of beingAlive”
Stage 4 .2
Dependent
variables Layer
Prediction
© Eric Xing @ CMU, 2006-2011 23
Recall perceptrons
Input units
Cough Headache
∆ rule
change weights to
weights decrease the error
what we got
No disease Pneumonia Flu Meningitis
- what we wanted
error
Output units
© Eric Xing @ CMU, 2006-2011 24
12
Hidden Units and
Backpropagation
Input units
what we g
got
- what we wanted
error
∆ rule
Hidden
units
∆ rule
Output units
© Eric Xing @ CMU, 2006-2011 25
xd = input
td = target output
od =observed unit
Backpropagation Algorithm output

wi =weight i
z Initialize all weights to small random numbers

Until convergence, Do
1. Input the training example to the network

and compute the network outputs
1. For each output unit k
2. For each hidden unit h
3. Undate each network weight wi,j
where
© Eric Xing @ CMU, 2006-2011 26
13
More on Backpropatation
z It is doing gradient descent over entire network weight vector
z Easily generalized to arbitrary directed graphs
z Will find a local, not necessarily global error minimum
z In practice, often works well (can run multiple times)
z Often include weight momentum α
z Minimizes error over training examples

z Will it generalize well to subsequent testing examples?
z Training can take thousands of iterations, Æ very slow!

z Using network after training is very fast
© Eric Xing @ CMU, 2006-2011 27
Learning Hidden Layer

Representation
z A network:
z A target function:
z Can this be learned?

© Eric Xing @ CMU, 2006-2011 28
14
Learning Hidden Layer
Representation
z A network:
z Learned hidden layer representation:
© Eric Xing @ CMU, 2006-2011 29
Training
© Eric Xing @ CMU, 2006-2011 30
15
Training
© Eric Xing @ CMU, 2006-2011 31
Minimizing the Error
initial error
Error surface
negative derivative
final error
local minimum
winitial wtrained
positive change
© Eric Xing @ CMU, 2006-2011 32
16
Overfitting in Neural Nets
Overfitted model “Real” model Overfitted model

CHD
error
holdout
training
0 age cycles
© Eric Xing @ CMU, 2006-2011 33
Alternative Error Functions

z Penalize large weights:
z Training on target slopes as well as values
z Tie together weights

z E.g., in phoneme recognition
© Eric Xing @ CMU, 2006-2011 34
17
Regression vs. Neural Networks
Y Y
“X1” “X2” “X1X3” “X1X2X3”
X1 X2 X3 X1X2 X1X3 X2X3 X1X2 X3
(23 -1)
1) possible
ibl combinations
bi ti X1 X2 X3
Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...
© Eric Xing @ CMU, 2006-2011 35
Artificial neural networks – what

you should know
z Highly expressive non-linear functions
z Highly parallel network of logistic function units
z Minimizing sum of squared training errors
z Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output
values
z Minimizing sum of sq errors plus weight squared (regularization)

z MAP estimates assuming weight priors are zero mean Gaussian
z Gradient descent as training procedure

z How to derive yyour own gradient
g descent p
procedure
z Discover useful representations at hidden units

z Local minima is greatest problem
z Overfitting, regularization, early stopping
© Eric Xing @ CMU, 2006-2011 36
18
Modern ANN topics:
“Deep” Learning
© Eric Xing @ CMU, 2006-2011 37
Expressive Capabilities of ANNs

z Boolean functions:
z Every
y Boolean function can be represented
p by
y network with single
g hidden layer
y
z But might require exponential (in number of inputs) hidden units
z Continuous functions:
z Every bounded continuous function can be approximated with arbitrary small
error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989]
z Any function can be approximated to arbitrary accuracy by a network with two
hidden layers [Cybenko 1988].
© Eric Xing @ CMU, 2006-2011 38
19
Using ANN to
hierarchical representation
Trainable Trainable
Trainable
Feature Feature
Classifier
Extractor Extractor
Good Representations are hierarchical
• In Language: hierarchy in syntax and semantics

– Words->Parts of Speech->Sentences->Text
– Objects,Actions,Attributes...-> Phrases -> Statements -> Stories
• In Vision: part-whole hierarchy
– Pixels->Edges->Textons->Parts->Objects->Scenes
© Eric Xing @ CMU, 2006-2011 39
“Deep” learning: learning hierarchical

representations
Trainable Trainable
Trainable
Feature Feature
Classifier
Extractor Extractor
Learned Internal Representation
• Deep Learning: learning a hierarchy of internal representations
• From low-level features to mid-level invariant representations,
to object identities
• Representations are increasingly invariant as we go up the
layers
• using multiple stages gets around the specificity/invariance
dilemma
© Eric Xing @ CMU, 2006-2011 40
20
Filtering+NonLinearity+Pooling = 1
stage of a Convolutional Net
• [Hubel & Wiesel 1962]:
– simple cells detect local features
– complex cells “pool” the outputs of simple cells within a retinotopic
neighborhood. “Simple cells”
“Complex cells”
pooling
Multiple subsampling
convolutions
Retinotopic Feature Maps
© Eric Xing @ CMU, 2006-2011 41
Convolutional Network: Multi-

Stage Trainable Architecture
Pooling
Convolutions,
Subsampling Convolutions,
Filtering
Classification
Convolutions,
Filtering Pooling
Subsampling
Hierarchical Architecture
Representations are more global, more invariant, and more
abstract
abst act as we
e go up
p the layers
la e s
Alternated Layers of Filtering and Spatial Pooling
Filtering detects conjunctions of features
Pooling computes local disjunctions of features
Fully Trainable
All the layers are trainable
© Eric Xing @ CMU, 2006-2011 42
21
Convolutional Net Architecture
for Hand-writing recognition
Layer 3 Layer 5
Layer 1 Layer 2 Layer 4 100@1x1
input 12@10x10
6@28x28 6@14x14 12@5x5
1@32x32
Layer 6: 10
Layer 6: 10
10
5x5
2x2 5x5 2x2 convolution
5x5 convolution
pooling/ pooling/
convolution
subsampling subsampling
z Convolutional net for handwriting recognition (400,000 synapses)

z Convolutional layers (simple cells): all units in a feature plane share the same weights
z Pooling/subsampling layers (complex cells): for invariance to small distortions.
z Supervised gradient-descent learning using back-propagation
z The entire network is trained end-to-end. All the layers are trained simultaneously.
z [LeCun et al. Proc IEEE, 1998]
© Eric Xing @ CMU, 2006-2011 43
How to train?
© Eric Xing @ CMU, 2006-2011 44
22
Application:
MNIST Handwritten Digit Dataset
Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples
© Eric Xing @ CMU, 2006-2011 45
Results on MNIST Handwritten

Digits
CLASSIFIER DEFORMATION PREPROCESSING ERROR (%) Reference
linear classifier (1-layer NN) none 12.00 LeCun et al. 1998
linear classifier (1-layer NN) deskewing 8.40 LeCun et al. 1998
pairwise linear classifier deskewing 7.60 LeCun et al. 1998
K-nearest -neighbors,
g , ((L2)) none 3.09 Kennet h Wilder,, U. Chicago
g
K-nearest -neighbors, (L2) deskewing 2.40 LeCun et al. 1998
K-nearest -neighbors, (L2) deskew, clean, blur 1.80 Kennet h Wilder, U. Chicago
K-NN L3, 2 pixel jitt er deskew, clean, blur 1.22 Kennet h Wilder, U. Chicago
K-NN, shape context m at ching shape context feat ure 0.63 Belongie et al. IEEE PAMI 2002
40 PCA + quadrat ic classifier none 3.30 LeCun et al. 1998
1000 RBF + linear classifier none 3.60 LeCun et al. 1998
K-NN, Tangent Dist ance subsam p 16x16 pixels 1.10 LeCun et al. 1998
SVM, Gaussian Kernel none 1.40
SVM deg 4 polynom ial deskewing 1.10 LeCun et al. 1998
Reduced Set SVM deg 5 poly deskewing 1.00 LeCun et al. 1998
Virt ual SVM deg-9 poly Affine none 0.80 LeCun et al. 1998
V-SVM, 2-pixel jitt ered none 0.68 DeCoste and Scholkopf, MLJ 2002
V-SVM, 2-pixel jitt ered deskewing 0.56 DeCoste and Scholkopf, MLJ 2002
2-layer NN, 300 HU, MSE none 4.70 LeCun et al. 1998
2-layer NN, 300 HU, MSE, Affine none 3.60 LeCun et al. 1998
2 layer NN
2-layer NN, 300 HU deskewing 1 60
1.60 LeCun et al
al. 1998
3-layer NN, 500+ 150 HU none 2.95 LeCun et al. 1998
3-layer NN, 500+ 150 HU Affine none 2.45 LeCun et al. 1998
3-layer NN, 500+ 300 HU, CE, reg none 1.53 Hint on, unpublished, 2005
2-layer NN, 800 HU, CE none 1.60 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, CE Affine none 1.10 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, MSE Elast ic none 0.90 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, CE Elast ic none 0.70 Sim ard et al., ICDAR 2003
Convolut ional net LeNet -1 subsam p 16x16 pixels 1.70 LeCun et al. 1998
Convolut ional net LeNet -4 none 1.10 LeCun et al. 1998
Convolut ional net LeNet -5, none 0.95 LeCun et al. 1998
Conv. net LeNet -5, Affine none 0.80 LeCun et al. 1998
Boosted LeNet -4 Affine none 0.70 LeCun et al. 1998
Conv. net, CE Affine none 0.60 Sim ard et al., ICDAR 2003
Com v net , CE Elast ic none 0.40 Sim ard et al., ICDAR 2003
© Eric Xing @ CMU, 2006-2011 46
23
Application: ANN for Face Reco.
z The model z The learned hidden unit
weights
© Eric Xing @ CMU, 2006-2011 47
Face Detection with a

Convolutional Net
© Eric Xing @ CMU, 2006-2011 48
24
Computer vision features
SIFT Spin image
HoG RIFT
Drawbacks of feature engineering
1. Needs expert knowledge
2. Time consuming hand-tuning
Textons © Eric Xing @ CMU, 2006-2011 GLOHCourtesy: Lee and
49
Ng
Sparse coding on images

Natural Images Learned bases: “Edges”
N
New example
l
= 0.8 * + 0.3 * + 0.5 *
x = 0.8 * b36 + 0.3 * b42 + 0.5 * b65

[0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representation)
Courtesy: Lee and Ng
© Eric Xing @ CMU, 2006-2011 50
25
Basis (or features) can be learned by
Optimization
Given input data {x(1), …, x(m)}, we want to find good
bases {b
{ 1,, …, b
, n}}:
min b ,a ∑ || x (i ) − ∑ a (ji )b j ||22 + β ∑ || a ( i ) ||1

i j i
Reconstruction error Sparsity penalty
∀ j : || b j || ≤ 1 Normalization
constraint
Solve by alternating minimization:

-- Keep b fixed, find optimal a.
-- Keep a fixed, find optimal b. Courtesy: Lee and Ng
© Eric Xing @ CMU, 2006-2011 51
Learning Feature Hierarchy

Higher layer
(Combinations
of edges)
“Sparse coding”
(edges)
Input image (pixels)
DBN (Hinton et al., 2006) with additional sparseness constraint.
[Related work: Hinton, Bengio, LeCun, and others.]

© Eric Xing @ CMU, 2006-2011 52
26
Convolutional architectures
Max‐pooling layer
maximum 2x2 grid Weight sharing by convolution (e.g.,
max Detection layer
Detection layer [Lecun et al., 1989])
et al 1989])
convolution
“Max‐pooling”
Invariance
conv Max‐pooling layer Computational efficiency
Deterministic and feed‐forward
maximum 2x2 grid
One can develop convolutional
max Detection layer
Detection layer Restricted Boltzmann machine (CRBM).
convolution One can define probabilistic max‐pooling
convolution filter that combine bottom‐up and top‐down
information.
conv Input
© Eric Xing @ CMU, 2006-2011 53
Convolutional Deep Belief

Networks
z Bottom-up (greedy), layer-wise training
z Train one layer (convolutional RBM) at a time
time.
z Inference (approximate)
z Undirected connections for all layers (Markov net)
[Related work: Salakhutdinov and Hinton, 2009]
z Block Gibbs sampling or mean-field
z Hierarchical probabilistic inference

© Eric Xing @ CMU, 2006-2011 54
27
Unsupervised learning of object-
parts
Faces Cars Elephants Chairs

© Eric Xing @ CMU, 2006-2011 55
Weaknesses & Criticisms

z Learning everything. Better to encode prior knowledge about
g
structure of images.
A: Compare with machine learning vs. linguists debate in
NLP.
z Results not yet competitive with best engineered systems.

A: Agreed.
g True for some domains.

© Eric Xing @ CMU, 2006-2011 56
28

Machine Learning Machine Learning: Learning Highly Non-Linear Functions

Uploaded by

Copyright:

Available Formats

Machine Learning Machine Learning: Learning Highly Non-Linear Functions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Machine Learning: Learning Highly Non-Linear Functions

Uploaded by

Copyright:

Available Formats

Machine Learning

Lecture 5, September 26, 2011

Learning highly non-linear

The XOR gate Speech recognition

© Eric Xing @ CMU, 2006-2011 2

z Connections per neuron

z Properties of artificial neural nets (ANN)

AppendicitisDiverticulitis Ulcer Pain Cholecystitis Obstruction Pancreatitis

© Eric Xing @ CMU, 2006-2011 5

Myocardial Infarction Network

© Eric Xing @ CMU, 2006-2011 7

© Eric Xing @ CMU, 2006-2011 8

Input to unit j: aj = Σ wijai

© Eric Xing @ CMU, 2006-2011 9

Logistic Regression Model (the sigmoid unit)

Independent variables Coefficients Dependent variable

z Recall the nice property of sigmoid function

z Consider regression problem f:XÆY , for scalar Y:

z Let’s maximize the conditional data likelihood

© Eric Xing @ CMU, 2006-2011 11

Gradient Descent output

© Eric Xing @ CMU, 2006-2011 12

The perceptron learning rules output

 For each training example d in D

© Eric Xing @ CMU, 2006-2011 14

© Eric Xing @ CMU, 2006-2011 15

What decision surface does a

some possible values for w1 and w2

© Eric Xing @ CMU, 2006-2011 16

θ = 0.5 for all units

f(a) = 1, for a > θ

a possible set of values for (w1, w2, w3, w4, w5 , w6):

Non Linear Separation

© Eric Xing @ CMU, 2006-2011 19

“Combined logistic models”

© Eric Xing @ CMU, 2006-2011 20

© Eric Xing @ CMU, 2006-2011 21

© Eric Xing @ CMU, 2006-2011 22

© Eric Xing @ CMU, 2006-2011 23

© Eric Xing @ CMU, 2006-2011 24

© Eric Xing @ CMU, 2006-2011 25

Backpropagation Algorithm output

z Initialize all weights to small random numbers

1. Input the training example to the network

1. For each output unit k

2. For each hidden unit h

3. Undate each network weight wi,j

z Often include weight momentum α

z Minimizes error over training examples

z Training can take thousands of iterations, Æ very slow!

© Eric Xing @ CMU, 2006-2011 27

Learning Hidden Layer

z Can this be learned?

z Learned hidden layer representation:

© Eric Xing @ CMU, 2006-2011 29

© Eric Xing @ CMU, 2006-2011 30

© Eric Xing @ CMU, 2006-2011 31

Minimizing the Error

Overfitted model “Real” model Overfitted model

For each training example d in D