Slide 2
Slide 2
Slide 2
02.04.2024
wi xi + w0
(B) Probabilistic learning:
Takes a Joint probability P(x, y) where x is the input and y is the label and
predicts the most possible known label for the unknown variable using
the Bayes theorem.
B E
Naïve Bayes (MAP):
A
J M
Neurons?
An electrical signal shooting down a nerve cell and then off to others in the brain. Learning strengthens the paths that these signals
take, essentially "wiring" certain common paths through the brain. Image Source: https://www.snexplores.org/ (imagination)
A healthy human brain has around 100 billion neurons (1011), and a neuron
may connect to as many as 100,000 other neurons.
A Nerve Cell: Neuron
Inputs (Dendrites): Neurons in an ANN receive inputs from other neurons or from the external environment. These inputs are analogous to the signals received by the dendrites of biological neurons.
Weights (Synaptic Strength): Each input to a neuron is associated with a weight, which determines the strength of influence of that input on the neuron's output. These weights can be adjusted during the learning process.
Summation (Cell Body): The neuron computes the weighted sum of its inputs. This process is akin to the integration of signals that happens in the cell body (soma) of a biological neuron.
Activation Function (Axon Hillock): After the summation, the neuron applies an activation function to the weighted sum. This function introduces non-linearity into the model and determines whether the neuron should fire or not based on the computed value.
Common activation functions include sigmoid, tanh, ReLU (Rectified Linear Unit), etc.
Output (Axon): The result of the activation function is the output of the neuron, which is then passed to the next layer of neurons in the network. This output is analogous to the signal transmitted along the axon of a biological neuron.
Perceptron: Modelling the Nerve cell
1943
x0
w1
x1 g f .75 .75
x2
y = f (g(x)) = 1, if g(x) ≥ 𝜽
x1 x2
= 0, if g(x) < 𝜽
What function this neuron computes?
Normalizing thresholds
• Why do we need Normalization? n
y( x) w0 w1 x1 w2 x2 ,, wn xn w0 wi xi
i 1
Bias
w1 wn
w2 wn
w1 w2
...
...
x1 x2 xn
1 x1 x2 xn
y = f (g(x)) = 1, if - 𝜽X1 + ≥ 0
Advantage: threshold = 0 for all neurons:
= 0, Otherwise
Normalized examples
INPUT: x1 = 1, x2 = 1
AND
1*-1 + .75*1 + .75*1 = .5 >= 0 OUTPUT: 1
1 -1 INPUT: x1 = 1, x2 = 0
-1 .75
1*-1 +.75*1 + .75*0 = -.25 < 1 OUTPUT: 0
.75
.75
x1 0 INPUT: x1 = 0, x2 = 1
1*-1 +.75*0 + .75*1 = -.25 < 1 OUTPUT: 0
INPUT: x1 = 0, x2 = 0
x2 .75
1 x1 x2 1*-1 +.75*0 + .75*0 = -1 < 1 OUTPUT: 0
INPUT: x1 = 1, x2 = 1
OR 1*-.5 + .75*1 + .75*1 = 1 >= 0 OUTPUT: 1
INPUT: x1 = 1, x2 = 0
-.5 .75 1*-.5 +.75*1 + .75*0 = .25 > 1 OUTPUT: 1
.75
INPUT: x1 = 0, x2 = 1
1*-.5 +.75*0 + .75*1 = .25 < 1 OUTPUT: 1
INPUT: x1 = 0, x2 = 0
1 x1 x2
1*-.5 +.75*0 + .75*0 = -.5 < 1 OUTPUT: 0
Perceptron as a Decision Surface
• Perceptron is a Linear Binary Classifier
y = +1
y = -1
(A 2-dimensional space)
• A perceptron can only solve linearly separable classification problems.
AND OR
How about
XOR?
Activation Functions in a Perceptron
y y
+1
y(x)
x
x
-1
(Step Function) (Sign Function)
1 x1 x2 1 x1 x2
Continued…
x1 = 1, x2 = 1: -2.9*1 + 0.6*1 + 0.2*1 = -2.1 0 WRONG
x1 = 1, x2 = 0: -2.9*1 + 0.6*1 + 0.2*0 = -2.3 0 OK
-2.9 0.2 x1 = 0, x2 = 1: -2.9*1 + 0.6*0 + 0.2*1 = -2.7 0 OK
0.6
x1 = 0, x2 = 0: -2.9*1 + 0.6*0 + 0.2*0 = -2.9 0 OK
w0 = -2.9 + 1 = -1.9
1 x1 x2 w1 = 0.6 + 1 = 1.6 New Weights
w2 = 0.2 + 1 = 1.2
x1 = 1, x2 = 1: -1.9*1 + 1.6*1 + 1.2*1 = 0.9 1 OK
-1.9
1.6
1.2 x1 = 1, x2 = 0: -1.9*1 + 1.6*1 + 1.2*0 = -0.3 0 OK
x1 = 0, x2 = 1: -1.9*1 + 1.6*0 + 1.2*1 = -0.7 0 OK
x1 = 0, x2 = 0: -1.9*1 + 1.6*0 + 1.2*0 = -1.9 0 OK
• The key-idea behind the delta rule is to use Gradient descent, a basis
for Back-propagation algorithm.
ADALINE: adaptive linear neural network based on MSE. Or Least mean square (LMS) Widrow Hoff
Visualizing Gradient Descent: Recap
Parabolic
(Convex)
with a single
global
minimum.
w0 and w1: The two weights of a linear unit and E is the error.
Derivation of Gradient Descent: Recap
• How can we calculate the direction of steepest descent along the error
surface?
• Gradient:
3. }
3. For each Linear unit weight wi do {
X
4. }
GD: the error is summed over all examples before updating weights. It might miss global minima when multiple
local minima are present. In SGD/Incremental GD, weights are updated upon examining each training example.
Inadequacy of Perceptron
• Many simple problems are NOT linearly separable.
?
XOR function
x1
• Output is in the form of binary (0 or 1), NOT in the
form of continuous values or probabilities. 1 1 0
I O
N U
P T
U P
T U
First Second T
Input
Input Hidden
hidden Hidden
hidden Output
Output
Layer
layer Layer1
layer Layer
layer2 Layer
layer
What does a hidden layer hide?
Hides it’s desired output. Neurons in the hidden layer cannot be observed
through the input/output behaviour of the network.
No. of nodes in a layer and no. of layers? Nodes too few: can’t learn, Too many: poor generalization Expt. & tuning.
Decision Surface in a Multilayer Network: An Ex.
b
i
n
a
r
y
What is Sparse Connectivity and what are its’ Pros and Cons? Leaving out some links.
Activation Function: Sigmoid
Vanishing Gradient
• The tanh function squashes the input values to the range [-1, 1]. It is
similar to the sigmoid function, but its output is zero-centered,
meaning that its output is centered around zero, unlike the sigmoid
function which outputs values between 0 and 1.
Used in RNNs,
and LSTMs…
The reason for using the ReLU is that its derivatives are particularly well
behaved: either they vanish or they just let the argument through. This makes
optimization better behaved and it reduces the issue of the vanishing gradient
problem.
Unlike sigmoid or tanh which saturate in certain regions (i.e., the gradients become very close to zero), ReLU does not
saturate in the positive region for positive inputs, the derivative of ReLU is always 1. Hence, during backpropagation,
gradients do not vanish for positive values, allowing for faster and more effective learning.
Gradient Descent for Sigmoid Unit
But we know:
Backpropagation Training Algorithm (BPN)
• Initialize weights (typically random!)
• Keep doing epochs
• For each example ‘e’ in the training set do
Error Backpropagation
• First calculate error of output units and use this to change the
top layer of weights.
output
j o j (1 o j ) k wkj
k hidden
input
Error Backpropagation continued…
• Finally update bottom layer of weights based on errors
calculated for hidden units.
output
j o j (1 o j ) k wkj
k
hidden
Update weights into j
w ji j oi
input
Assignment 5: BPNs for Predicting Age of Abalones