AML 03 Dense Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Advanced Machine Learning

Dense Neural Networks


Amit Sethi
Electrical Engineering, IIT Bombay
Learning objectives
• Learn the motivations behind neural networks

• Become familiar with neural network terms

• Understand the working of neural networks

• Understand behind-the-scenes training of


neural networks
Neural networks are inspired from
mammalian brain
• Each unit (neuron) is simple
• But, human brain has 100
billion neurons with 100
trillion connections
• The strength and nature of
the connections stores
memories and the
“program” that makes us
human
• A neural network is a web
of artificial neurons
Original image source unknown
Artificial neurons is inspired by
biological neurons
• Neural networks are made up of artificial
neurons
• Artificial neurons are only loosely based on
real neurons, just like neural networks are
only loosely based on the human brain

x1 w1

w2
x2 Σ g
x3 w3
b
1
Activation function is the secret sauce
of neural networks
• Neural network training is
all about tuning weights
x1 and biases
w1

x2 w2 • If there was no activation


w3
Σ g function f, the output of
the entire neural network
x3 b would be a linear function
of the inputs
1
• The earliest models used a
step function
Types of activation functions
• Step: original concept
behind classification and
region bifurcation. Not
used anymore
• Sigmoid and tanh:
trainable approximations
of the step-function
• ReLU: currently preferred
due to fast convergence
• Softmax: currently
preferred for output of a
classification net.
Generalized sigmoid
• Linear: good for modeling
a range in the output of a
regression net
Formulas for activation functions
sign 𝑥 +1
• Step: 𝑔 𝑥 =
2

1
• Sigmoid: 𝑔 𝑥 =
1+𝑒 −𝑥

• Tanh: 𝑔 𝑥 = tanh(𝑥)

• ReLU: 𝑔 𝑥 = max(0, 𝑥)

𝑒 𝑥𝑖
• Softmax: 𝑔 𝑥𝑖 = 𝑥𝑖
𝑖𝑒

• Linear: 𝑔 𝑥 = 𝑥
Step function divides the input space
into two halves  0 and 1
• In a single neuron, step
function is a linear binary
classifier
• The weights and biases
determine where the step will
be in n-dimensions
• But, as we shall see later, it
gives little information about
how to change the weights if
we make a mistake
• So, we need a smoother
version of a step function
• Enter: the Sigmoid function
The sigmoid function is a smoother
step function

• Smoothness ensures that there is more


information about the direction in which
to change the weights if there are errors
• Sigmoid function is also mathematically
linked to logistic regression, which is a
theoretically well-backed linear classifier
The problem with sigmoid is (near)
zero gradient on both extremes
• For both large
positive and
negative input
values, sigmoid
doesn’t change
much with change
of input
• ReLU has a
constant gradient
for almost half of
the inputs
• But, ReLU cannot
give a meaningful
final output
Output activation functions can only
be of the following kinds
• Sigmoid gives binary
classification output
• Tanh can also do that
provided the desired
output is in {-1, +1}
• Softmax generalizes
sigmoid to n-ary
classification
• Linear is used for
regression
• ReLU is only used in
internal nodes (non-
output)
Contents

• Introduction to neural networks

• Feed forward neural networks

• Gradient descent and backpropagation

• Learning rate setting and tuning


Basic structure of a neural network
y1 y2 … yn
• It is feed forward
– Connections from
inputs towards
outputs

– No connection
comes backwards
… … …
• It consists of layers
– Current layer’s
h1n
h11 h12 … input is previous
1
layer’s output
– No lateral (intra-
x1 x2 … xd layer) connections
• That’s it!
Basic structure of a neural network
• Output layer
y1 y2 … yn – Represent the output of the neural
network
– For a two class problem or
regression with a 1-d output, we
need only one output node
• Hidden layer(s)
… – Represent the intermediary nodes
that divide the input space into
regions with (soft) boundaries
– These usually form a hidden layer
… … … – Usually, there is only one such
layer
– Given enough hidden nodes, we
can model an arbitrary input-
h1n output relation.
h11 h12 …
1 • Input layer
– Represent dimensions of the input
vector (one node for each
dimension)
… – These usually form an input layer,
x1 x2 xd and
– Usually there is only one such layer
Importance of hidden layers
− + +
+
+ Single • First hidden
− sigmoid
− + + − − layer extracts
− + features

+ −
+ +
• Second hidden
layer extracts
features of
features
− + + • …
+ Sigmoid
+ − hidden • Output layer
− −
− + +
− +
layers and gives the
sigmoid
+ + −

output desired output
+
Overall function of a neural network
• 𝑓 𝒙𝑖 = 𝑔𝑙 (𝑾𝑙 ∗ 𝑔𝑙−1 𝑾𝑙−1 … 𝑔1 𝑾1 ∗ 𝒙𝑖 … )
• Weights form a matrix
• Output of the previous layer form a vector
• The activation (nonlinear) function is applied
point-wise to the weight times input

• Design questions (hyper parameters):


– Number of layers
– Number of neurons in each layer (rows of weight
matrices)
Training the neural network
• Given 𝒙𝑖 and 𝑦𝑖
• Think of what hyper-parameters and neural
network design might work
• Form a neural network:
𝑓 𝒙𝑖 = 𝑔𝑙 (𝑾𝑙 ∗ 𝑔𝑙−1 𝑾𝑙−1 … 𝑔1 𝑾1 ∗ 𝒙𝑖 … )
• Compute 𝑓𝒘 𝒙𝑖 as an estimate of 𝑦𝑖 for all
samples
• Compute loss:
1 𝑁 1 𝑁
𝑖=1 𝐿(𝑓𝒘 𝒙𝑖 , 𝑦𝑖 ) = 𝑖=1 𝑙𝑖 (𝒘)
𝑁 𝑁
• Tweak 𝒘 to reduce loss (optimization algorithm)
• Repeat last three steps
Loss function choice
• There are positive
and negative
errors in
classification and
MSE is the most
common loss
function
error→
• There is
probability of
correct class in
classification, for
which cross
entropy is the
most common
loss function
error→
Some loss functions and their
derivatives
• Terminology
– 𝑦 is the output
– 𝑡 is the target output
• Mean square error
• Loss: (𝑦 − 𝑡 )2
• Derivative of the loss: 2(𝑦 − 𝑡 )
• Cross entropy
𝐶
• Loss: − 𝑐=1 𝑡𝑐 log 𝑦𝑐
1
• Derivative of the loss: − |𝑐=𝜔
𝑦𝑐
Computational graph of a single hidden
layer NN

x
* ?
ReL
W1 + Z1 A1
U
b1 * ?
SoftM
W2 + Z2 A2
ax
b2 CE Loss
targ
et

You might also like