Machine Learning Machine Learning: Learning Highly Non-Linear Functions
Machine Learning Machine Learning: Learning Highly Non-Linear Functions
Machine Learning Machine Learning: Learning Highly Non-Linear Functions
10--701/15
10 701/15--781, Fall 2011
Neural Networks
Eric Xing
Reading: Chap. 5 CB
© Eric Xing @ CMU, 2006-2011 1
1
Perceptron and Neural Nets
z From biological neuron to artificial neuron (perceptron)
Inputs
Synapse
Dendrites x1 Linear Hard
Synapse
Axon Combiner Limiter
Axon w1
Output
∑ Y
Soma w2
Soma
Dendrites θ
x2
Synapse Threshold
z Activation function
⎧+ 1, if X ≥ ω0
n
X = ∑ xi wi Y =⎨
i =1 ⎩− 1, if X < ω0
Output Signals
Input Signals
z Artificial neuron networks
z supervised learning
z gradient descent Middle Layer
Input Layer Output Layer
© Eric Xing @ CMU, 2006-2011 3
Connectionist Models
z Consider humans: Dendrites
Nodes
z Neuron switching
g time
~ 0.001 second +
Synapses
+ (weights)
Axon
z Number of neurons +
-
-
~ 1010 Synapses
2
Abdominal Pain Perceptron
Intensity Duration
Male Age Temp WBC Pain Pain
adjustable
1 20 37 10 1 1
weights
2 4 1 1 50 1
Myocardial Infarction
0.8 “Probability” of MI
© Eric Xing @ CMU, 2006-2011 6
3
The "Driver" Network
Perceptrons
Input units
Cough Headache
∆ rule
change weights to
weights decrease the error
what we got
No disease Pneumonia Flu Meningitis
- what we wanted
error
Output units
4
Perceptrons
Input units i
Input to unit i: ai
measured value of variable i
Output oj = 1/ (1 + e- (aj+θ j) )
units j
Output of unit j:
Jargon Pseudo-Correspondence
z Independent variable = input variable
z Dependent variable = output variable
z Coefficients = “weights”
z Estimates = “targets”
5
The perceptron learning
algorithm
xd = input
td = target output
od =observed unit
6
xd = input
td = target output
od =observed unit
Incremental mode:
Do until converge:
2.
where
Batch mode:
Do until converge:
1. compute gradient ∇ED[w]
2.
© Eric Xing @ CMU, 2006-2011 13
MLE vs MAP
z Maximum conditional likelihood estimate
z Maximum a p
posteriori estimate
7
What decision surface does a
perceptron define?
x y Z (color)
0 0 1
0 1 1
1 0 1
1 1 0
NAND
f(x1 w1 + x2 w2 ) = y
y θ = 0.5 f(0w1 + 0w2 ) = 1
w1 w2 f(0w1 + 1w2 ) = 1 ( )=
f(a) 1, for a > θ
x1
f(1w1 + 0w2 ) = 1 0, for a ≤ θ
x2 f(1w1 + 1w2 ) = 0 θ
w1 w2
0.20 0.35
some possible values for w1 and w2 0.20 0.40
0.25 0.30
0.40 0.20
NAND
f(x1 w1 + x2 w2 ) = y
y θ = 0.5 f(0w1 + 0w2 ) = 0
w1 w2 f(0w1 + 1w2 ) = 1 ( )=
f(a) 1, for a > θ
x1
f(1w1 + 0w2 ) = 1 0, for a ≤ θ
x2 f(1w1 + 1w2 ) = 0 θ
w1 w2
8
What decision surface does a
perceptron define?
x y Z (color)
0 0 0
0 1 1
1 0 1
1 1 0
NAND
Meningitis Flu
No cough Cough
Headache Headache
01 11
No treatment
Treatment
00 10
No disease Pneumonia 011 111
No cough Cough 010 110
No headache No headache
101
000 100
© Eric Xing @ CMU, 2006-2011 18
9
Neural Network Model
Inputs
.6 Output
Age 34 .4
.2 Σ
.1 .5 0.6
Gender 2 .3 .2
.8
Σ
.7 Σ “Probability
of beingAlive”
Stage 4 .2
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
10
Inputs
Output
Age 34
.2 .5
0.6
Gender 2 .3
Σ
“Probability
.8
of beingAlive”
Stage 4 .2
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
Inputs
.6 Output
Age 34
.2 .5
.1 0.6
Gender 1 .3
Σ
.7 “Probability
.8
of beingAlive”
Stage 4 .2
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
11
Not really,
no target for hidden units...
Age 34 .6 .4
.2 Σ
.1 .5 0.6
Gender 2 .3 .2
.8
Σ
.7 Σ “Probability
of beingAlive”
Stage 4 .2
Dependent
Independent Weights Hidden Weights variable
variables Layer
Prediction
Recall perceptrons
Input units
Cough Headache
∆ rule
change weights to
weights decrease the error
what we got
No disease Pneumonia Flu Meningitis
- what we wanted
error
Output units
12
Hidden Units and
Backpropagation
Input units
what we g
got
- what we wanted
error
∆ rule
Hidden
units
∆ rule
Output units
xd = input
td = target output
od =observed unit
where
© Eric Xing @ CMU, 2006-2011 26
13
More on Backpropatation
z It is doing gradient descent over entire network weight vector
z Easily generalized to arbitrary directed graphs
z Will find a local, not necessarily global error minimum
z In practice, often works well (can run multiple times)
z A target function:
14
Learning Hidden Layer
Representation
z A network:
Training
15
Training
initial error
Error surface
negative derivative
final error
local minimum
winitial wtrained
positive change
© Eric Xing @ CMU, 2006-2011 32
16
Overfitting in Neural Nets
error
holdout
training
0 age cycles
17
Regression vs. Neural Networks
Y Y
(23 -1)
1) possible
ibl combinations
bi ti X1 X2 X3
18
Modern ANN topics:
“Deep” Learning
z Continuous functions:
z Every bounded continuous function can be approximated with arbitrary small
error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989]
z Any function can be approximated to arbitrary accuracy by a network with two
hidden layers [Cybenko 1988].
19
Using ANN to
hierarchical representation
Trainable Trainable
Trainable
Feature Feature
Classifier
Extractor Extractor
Trainable Trainable
Trainable
Feature Feature
Classifier
Extractor Extractor
Learned Internal Representation
• Deep Learning: learning a hierarchy of internal representations
• From low-level features to mid-level invariant representations,
to object identities
• Representations are increasingly invariant as we go up the
layers
• using multiple stages gets around the specificity/invariance
dilemma
20
Filtering+NonLinearity+Pooling = 1
stage of a Convolutional Net
• [Hubel & Wiesel 1962]:
– simple cells detect local features
– complex cells “pool” the outputs of simple cells within a retinotopic
neighborhood. “Simple cells”
“Complex cells”
pooling
Multiple subsampling
convolutions
Retinotopic Feature Maps
© Eric Xing @ CMU, 2006-2011 41
Convolutions,
Filtering Pooling
Subsampling
Hierarchical Architecture
Representations are more global, more invariant, and more
abstract
abst act as we
e go up
p the layers
la e s
Alternated Layers of Filtering and Spatial Pooling
Filtering detects conjunctions of features
Pooling computes local disjunctions of features
Fully Trainable
All the layers are trainable
© Eric Xing @ CMU, 2006-2011 42
21
Convolutional Net Architecture
for Hand-writing recognition
Layer 3 Layer 5
Layer 1 Layer 2 Layer 4 100@1x1
input 12@10x10
6@28x28 6@14x14 12@5x5
1@32x32
Layer 6: 10
Layer 6: 10
10
5x5
2x2 5x5 2x2 convolution
5x5 convolution
pooling/ pooling/
convolution
subsampling subsampling
How to train?
22
Application:
MNIST Handwritten Digit Dataset
Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples
23
Application: ANN for Face Reco.
z The model z The learned hidden unit
weights
24
Computer vision features
SIFT Spin image
HoG RIFT
Drawbacks of feature engineering
1. Needs expert knowledge
2. Time consuming hand-tuning
Textons © Eric Xing @ CMU, 2006-2011 GLOHCourtesy: Lee and
49
Ng
N
New example
l
25
Basis (or features) can be learned by
Optimization
Given input data {x(1), …, x(m)}, we want to find good
bases {b
{ 1,, …, b
, n}}:
Reconstruction error Sparsity penalty
∀ j : || b j || ≤ 1 Normalization
constraint
“Sparse coding”
(edges)
DBN (Hinton et al., 2006) with additional sparseness constraint.
26
Convolutional architectures
Max‐pooling layer
maximum 2x2 grid Weight sharing by convolution (e.g.,
max Detection layer
Detection layer [Lecun et al., 1989])
et al 1989])
convolution
“Max‐pooling”
Invariance
conv Max‐pooling layer Computational efficiency
Deterministic and feed‐forward
maximum 2x2 grid
One can develop convolutional
max Detection layer
Detection layer Restricted Boltzmann machine (CRBM).
convolution One can define probabilistic max‐pooling
convolution filter that combine bottom‐up and top‐down
information.
conv Input
Courtesy: Lee and Ng
© Eric Xing @ CMU, 2006-2011 53
z Inference (approximate)
z Undirected connections for all layers (Markov net)
[Related work: Salakhutdinov and Hinton, 2009]
z Block Gibbs sampling or mean-field
z Hierarchical probabilistic inference
27
Unsupervised learning of object-
parts
Faces Cars Elephants Chairs
28