Slide 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Birla Institute of Technology and Science Pilani, Hyderabad Campus

2nd Semester 2023-24

02.04.2024

BITS F464: Machine Learning


NEURAL NETWORKS: PERCEPTRON, BACK PROPAGATION

Chittaranjan Hota, Sr. Professor


Dept. of Computer Sc. and Information Systems
[email protected]
Recap:
(A) Symbolic learning:
It involves representing knowledge in a symbolic form, often using logical rules
(logical structures) or mathematical functions.
Ear y(x)=wT x + w0 =

wi xi + w0
(B) Probabilistic learning:
Takes a Joint probability P(x, y) where x is the input and y is the label and
predicts the most possible known label for the unknown variable using
the Bayes theorem.

B E
Naïve Bayes (MAP):
A

J M

(C) Connectionist learning (Artificial Neural Networks) today…


ANNs: Motivating Examples

(Russia-Ukrain War: Ukrainians used ANNs to


combine ground-level photos, drone video
footage and satellite imagery to enhance War
Image source: https://towardsdatascience.com/
Intelligence)
Learning Rewires the Brain

Neurons?

An electrical signal shooting down a nerve cell and then off to others in the brain. Learning strengthens the paths that these signals
take, essentially "wiring" certain common paths through the brain. Image Source: https://www.snexplores.org/ (imagination)

A healthy human brain has around 100 billion neurons (1011), and a neuron
may connect to as many as 100,000 other neurons.
A Nerve Cell: Neuron

(Img. Source: https://bio.libretexts.org/)

What are their computational abstractions in an Artificial Neural Network?


In an Artificial Neural Network (ANN), a neuron is abstracted as a computational unit that takes multiple inputs, computes a weighted sum of these inputs, applies an activation function to the sum, and produces an output. Here's how it breaks down:

Inputs (Dendrites): Neurons in an ANN receive inputs from other neurons or from the external environment. These inputs are analogous to the signals received by the dendrites of biological neurons.

Weights (Synaptic Strength): Each input to a neuron is associated with a weight, which determines the strength of influence of that input on the neuron's output. These weights can be adjusted during the learning process.

Summation (Cell Body): The neuron computes the weighted sum of its inputs. This process is akin to the integration of signals that happens in the cell body (soma) of a biological neuron.

Activation Function (Axon Hillock): After the summation, the neuron applies an activation function to the weighted sum. This function introduces non-linearity into the model and determines whether the neuron should fire or not based on the computed value.
Common activation functions include sigmoid, tanh, ReLU (Rectified Linear Unit), etc.

Output (Axon): The result of the activation function is the output of the neuron, which is then passed to the next layer of neurons in the network. This output is analogous to the signal transmitted along the axon of a biological neuron.
Perceptron: Modelling the Nerve cell
1943

Frank Rosenblatt at IBM 704 (Electronic


profile analyzing computer): a precursor to
the perceptron, 1958.
(Image source: https://news.cornell.edu)
Perceptron: An Example
AND
g(x1,x2, …, xn) = g(x) =

x0 

w1
x1 g f .75 .75

x2

y = f (g(x)) = 1, if g(x) ≥ 𝜽
x1 x2
= 0, if g(x) < 𝜽
What function this neuron computes?
Normalizing thresholds
• Why do we need Normalization? n
y( x)  w0  w1 x1  w2 x2 ,, wn xn  w0   wi xi
i 1

 Bias

w1 wn
w2  wn
w1 w2
...
...

x1 x2 xn
1 x1 x2 xn

y = f (g(x)) = 1, if - 𝜽X1 + ≥ 0
Advantage: threshold = 0 for all neurons:
= 0, Otherwise
Normalized examples
INPUT: x1 = 1, x2 = 1
AND
1*-1 + .75*1 + .75*1 = .5 >= 0  OUTPUT: 1
1  -1 INPUT: x1 = 1, x2 = 0
-1 .75
1*-1 +.75*1 + .75*0 = -.25 < 1  OUTPUT: 0
.75
.75
x1 0 INPUT: x1 = 0, x2 = 1
1*-1 +.75*0 + .75*1 = -.25 < 1  OUTPUT: 0
INPUT: x1 = 0, x2 = 0
x2 .75
1 x1 x2 1*-1 +.75*0 + .75*0 = -1 < 1  OUTPUT: 0

INPUT: x1 = 1, x2 = 1
OR 1*-.5 + .75*1 + .75*1 = 1 >= 0  OUTPUT: 1

INPUT: x1 = 1, x2 = 0
-.5 .75 1*-.5 +.75*1 + .75*0 = .25 > 1  OUTPUT: 1
.75
INPUT: x1 = 0, x2 = 1
1*-.5 +.75*0 + .75*1 = .25 < 1  OUTPUT: 1
INPUT: x1 = 0, x2 = 0
1 x1 x2
1*-.5 +.75*0 + .75*0 = -.5 < 1  OUTPUT: 0
Perceptron as a Decision Surface
• Perceptron is a Linear Binary Classifier

y = +1

y = -1

(A 2-dimensional space)
• A perceptron can only solve linearly separable classification problems.

AND OR
How about
XOR?
Activation Functions in a Perceptron
y y
+1

y(x)
x
x
-1
(Step Function) (Sign Function)

y (x) = 1, if -𝜽X1 + > 0


y (x) = 1, if - 𝜽X1 + ≥ 0
= 0, for equal to 0
= 0, Otherwise
= -1, for less than 0
What about other activation functions like Sigmoid, Gaussian etc.? Multi-layer Perceptron
Perceptron Training Example
Gradient Descent:
changing the
weight a small
amount decreases
the training error

An example: A perceptron updating its linear boundary as more


training examples are added. (Image Source: Wiki)
Perceptron Training Algorithm

Image Source: PRML, Bishop

Algorithm: Perceptron Learning Algorithm

P  inputs with label + if (x ∈ N && w.x ≥ 0) then


N  inputs with label – w = w - x;
Initialize w  Random value; endif;
while (!convergence) do
Pick random x ∈ P ⋃ N; endwhile;
if (x ∈ P && w.x <0) then
w = w + x; // Algorithm converges when all the
endif; inputs are classified correctly.
Run through the algorithm: AND
x1 = 1, x2 = 1: -0.9*1 + 0.6*1 + 0.2*1 = -0.1  0 WRONG
x1 = 1, x2 = 0: -0.9*1 + 0.6*1 + 0.2*0 = -0.3  0 OK
x1 = 0, x2 = 1: -0.9*1 + 0.6*0 + 0.2*1 = -0.7  0 OK
x1 = 0, x2 = 0: -0.9*1 + 0.6*0 + 0.2*0 = -0.9  0 OK

(Training Set) New Weights

 

-0.9 0.2 0.1 1.2


0.6 1.6

1 x1 x2 1 x1 x2
Continued…
 x1 = 1, x2 = 1: -2.9*1 + 0.6*1 + 0.2*1 = -2.1  0 WRONG
x1 = 1, x2 = 0: -2.9*1 + 0.6*1 + 0.2*0 = -2.3  0 OK
-2.9 0.2 x1 = 0, x2 = 1: -2.9*1 + 0.6*0 + 0.2*1 = -2.7  0 OK
0.6
x1 = 0, x2 = 0: -2.9*1 + 0.6*0 + 0.2*0 = -2.9  0 OK

w0 = -2.9 + 1 = -1.9
1 x1 x2 w1 = 0.6 + 1 = 1.6 New Weights
w2 = 0.2 + 1 = 1.2


x1 = 1, x2 = 1: -1.9*1 + 1.6*1 + 1.2*1 = 0.9  1 OK
-1.9
1.6
1.2 x1 = 1, x2 = 0: -1.9*1 + 1.6*1 + 1.2*0 = -0.3  0 OK
x1 = 0, x2 = 1: -1.9*1 + 1.6*0 + 1.2*1 = -0.7  0 OK
x1 = 0, x2 = 0: -1.9*1 + 1.6*0 + 1.2*0 = -1.9  0 OK

1 x1 x2 Convergence Reached. Halt! w0 = -1.9, w1 = 1.6, w2 = 1.2


Perceptron Training Rule: Recap
wi  wi + wi (t - o) xi
Let us see this through an example:
When all ( , (t-o) and xi) are
positive, wi will increase and vice
Where, versa:
xi = 0.8, = 0.1, t = 1, o = -1:
•t = target value  wi = 0.1(1-(-1))0.8 = 0.16

•o = perceptron output If t = -1, o = 1, what will happen?

• is a small constant (e.g, 0.1) called the learning rate.


Why should this update rule converge toward successful
weight values?
If training data is linearly separable and is sufficiently small.
Gradient Descent and the Delta Rule
• If the training examples are NOT linearly separable (which the
Perceptron rule cannot handle), the delta rule converges towards a
best-fit approximation to the target concept.

• The key-idea behind the delta rule is to use Gradient descent, a basis
for Back-propagation algorithm.

• Delta rule is best understood by considering an un-thresholded


Perceptron, i.e. a linear unit without threshold (or activation function).

• Let the linear unit be characterized by: o = w0 + w1x1 + w2x2 + … + wnxn

• Let us learn wi’s that minimize the squared error:

ADALINE: adaptive linear neural network based on MSE. Or Least mean square (LMS) Widrow Hoff
Visualizing Gradient Descent: Recap
Parabolic
(Convex)
with a single
global
minimum.

w0 and w1: The two weights of a linear unit and E is the error.
Derivation of Gradient Descent: Recap
• How can we calculate the direction of steepest descent along the error
surface?
• Gradient:

• When interpreted as a vector in weight space, the gradient specifies


the direction that produces the steepest increase in E .
(2)
• The negative of this vector therefore gives the direction of steepest
decrease.
• The training rule: Where: Substituting (2) in (1):
(1)
Gradient Descent & Stochastic Gradient Descent
1. Initialize each wi to some small random value
2. Until the termination condition is met {
1. Initialize each wi to 0
2. For each training example do {
1. Input the instance to the Linear unit and compute output ‘o’
2. For each Linear unit weight wi do

3. }
3. For each Linear unit weight wi do {
X
4. }

3. } Alternatively, SGD computes ‘E’ for each training ex:

GD: the error is summed over all examples before updating weights. It might miss global minima when multiple
local minima are present. In SGD/Incremental GD, weights are updated upon examining each training example.
Inadequacy of Perceptron
• Many simple problems are NOT linearly separable.
?
XOR function
x1
• Output is in the form of binary (0 or 1), NOT in the
form of continuous values or probabilities. 1 1 0

• No memory and hence treat each input


independently. Hence, limited ability to understand 0 1

sequential or temporal patterns in data. x2

0.1 However, you can compute XOR by introducing a new,


hidden unit as shown in the left.
-3.5
1.5 1.5 1.5 Every classification problem has a Perceptron solution if
1 1 enough hidden layers are used.
x1 x2
How to build such a multi-layer network?
Minsky & Papert’s paper: Pretty much killed ANN research in 1970. Rebirth in 1980: faster
parallel computers, newer algorithms (BPN,…), newer architectures (Hopfield nets).
Hidden units in a Multi-layer Perceptron (MLP)
- The addition of hidden units allows the network to develop
complex feature detectors (i.e., internal representations)

- e.g., Optical Character Recognition (OCR)


• perhaps one hidden unit
"looks for" a horizontal bar
• another hidden unit
"looks for" a diagonal
• another looks for the vertical base

- The combination of specific hidden units


indicates a 7.
Multiple Hidden Layers

Out put Signals


Input Signals

I O
N U
P T
U P
T U
First Second T
Input
Input Hidden
hidden Hidden
hidden Output
Output
Layer
layer Layer1
layer Layer
layer2 Layer
layer
What does a hidden layer hide?
Hides it’s desired output. Neurons in the hidden layer cannot be observed
through the input/output behaviour of the network.

No. of nodes in a layer and no. of layers? Nodes too few: can’t learn, Too many: poor generalization Expt. & tuning.
Decision Surface in a Multilayer Network: An Ex.

Image source: Tom Mitchell’s text


Input to the Network: two features from spectral analysis of a spoken sound
Output: vowel sound occurring in the context “h__d”

Image source: Tom Mitchell's Text


An Example 3-layer Perceptron
𝑦 =softmax(z), where 𝑧 = 𝑊Tℎ+𝑏
For a multiclass classification.
hidden units
z

b
i
n
a
r
y

Hidden Units: Sigmoid, Tanh, ReLU etc…

What is Sparse Connectivity and what are its’ Pros and Cons? Leaving out some links.
Activation Function: Sigmoid

During backpropagation, the gradients


of the weights in the early layers of the
network (closer to the input) can
become very small as they are
multiplied by small gradients of later
layers using the sigmoid function.

Vanishing Gradient

Where should you worry much? Shallow or Deep NNs?


Activation Function: Tanh
• The hyperbolic tangent (tanh) activation function is another
commonly used non-linear activation function in neural networks.

• The tanh function squashes the input values to the range [-1, 1]. It is
similar to the sigmoid function, but its output is zero-centered,
meaning that its output is centered around zero, unlike the sigmoid
function which outputs values between 0 and 1.

Used in RNNs,
and LSTMs…

Suffers from same Vanishing gradient problem.


Activation Functions: ReLU
Rectified linear unit function (ReLU) provides a very simple nonlinear
transformation:
ReLU(x)=max(x,0) Retains only positive elements and discards negative ones.

The reason for using the ReLU is that its derivatives are particularly well
behaved: either they vanish or they just let the argument through. This makes
optimization better behaved and it reduces the issue of the vanishing gradient
problem.
Unlike sigmoid or tanh which saturate in certain regions (i.e., the gradients become very close to zero), ReLU does not
saturate in the positive region  for positive inputs, the derivative of ReLU is always 1. Hence, during backpropagation,
gradients do not vanish for positive values, allowing for faster and more effective learning.
Gradient Descent for Sigmoid Unit

But we know:
Backpropagation Training Algorithm (BPN)
• Initialize weights (typically random!)
• Keep doing epochs
• For each example ‘e’ in the training set do

• forward pass to compute


• O = neural-net-output (network, e)
• miss = (T-O) at each output unit
• backward pass to calculate deltas to weights
• update all weights
• end

• until tuning set error stops improving


(Look for Global minima)

Error Backpropagation

• First calculate error of output units and use this to change the
top layer of weights.

Current output: oj=0.2 output


Correct output: tj=1.0
Error δj = oj(1–oj)(tj–oj)
hidden
0.2(1–0.2)(1–0.2)=0.128
Update weights into j
input
w ji   j oi
Error Backpropagation continued…
• Next calculate error for hidden units based on errors on the
output units it feeds into.

output

 j  o j (1  o j )  k wkj
k hidden

input
Error Backpropagation continued…
• Finally update bottom layer of weights based on errors
calculated for hidden units.

output

 j  o j (1  o j )  k wkj
k
hidden
Update weights into j
w ji   j oi
input
Assignment 5: BPNs for Predicting Age of Abalones

will decide the age)


Label (no of rings
Thank You!

You might also like