Artificial Neural Networks: Slides Are By: Tan, Steinbach, Karpatne, Kumar

Intelligent Systems
Artificial Neural Networks
Slides are by: Tan, Steinbach, Karpatne, Kumar
02/14/2018 Introduction to Data Mining, 2nd Edition 1

Machine learning Problem
 Non-linear Classification Complex non-linear hypotheses (only 2 features)
Large number of features i.e.100
Quadratic features has complexity of

O(n2)~n2/2 i.e, 5000 features

Machine learning Problem
 What is this?
 You see this:

Computer vision : Car detection
 Car Detection Training labeled pictures

Neuron in the brain
(INPUT)
(OUTPUT)

Artificial Neural Networks (ANN)
X 1 X2 X3 Y Input Black box

1 0 0 -1
1 0 1 1
X1
1 1 0 1 Output
1 1 1 1
0 0 1 -1
X2 Y
0 1 0 -1
0 1 1 1 X3
0 0 0 -1
Output Y is 1 if at least two of the three inputs are equal to 1.

X0 1
 0   0 . 4 ( bias unit )
Input
nodes Black box
X 1 X2 X 3 Y
1 0 0 -1 Output
1 0 1 1
X1 0.3 node
1 1 0 1
0.3 hY
 (x )
1 1 1 1
0 0 1 -1
X2 
0 1 0 -1
0 1 1 1 X3 0.3 t=0.4
0 0 0 -1
h ( x )  sign ( 0 . 3 X 1  0 . 3 X 2  0 . 3 X 3  0 . 4 )
1 if x  0
where sign ( x )  
 1 if x  0

Input
 Model is an assembly of nodes
inter-connected nodes and Black box
Output
weighted links X1 w1 node
w2
X2  hY (x)
 Output node sums up w3
each of its input value X3 t

according to the weights
of its links Perceptron Model
d
h ( x)  sign( ( wi X i )  w0 X 0
i 1
d
 sign( wi X i )
i 0
Perceptron
 Single layer network

– Contains only input and output nodes
 Activation function: h(x)= sign(wx)
 Applying model is straightforward

h ( x)  sign(0.3 X 1  0.3 X 2  0.3 X 3  0.4)
 1 if x  0
where sign( x)  
 1 if x  0
– X1 = 1, X2 = 0, X3 =1 => y = sign(0.2) = 1
Perceptron Learning Rule
 Initialize the weights (w0, w1, …, wd)

 Repeat
– For each training example (xi, yi)
 Compute f(w, xi)
 Update the weights:
 
w( k 1)  w( k )   yi  f ( w( k ) , xi ) xi
 Until stopping condition is met

 Weight update formula:

 
w( k 1)  w( k )   yi  f ( w ( k ) , xi ) xi ;  : learning rate
 Intuition:
– Update weight based on error: e   yi  f ( w( k ) , xi )
– If y=f(x,w), e=0: no update needed
– If y>f(x,w), e=2: weight must be increased so
that f(x,w) will increase
– If y<f(x,w), e=-2: weight must be decreased so
that f(x,w) will decrease
Example of Perceptron Learning

w ( k 1)  w ( k )   y i  f ( w ( k ) , x i ) x i 
d
Y  sign(  wi X i )
i 0
  0 .1
X1 X2 X3 Y w0 w1 w2 w3 Epoch w0 w1 w2 w3
1 0 0 -1 0 0 0 0 0 0 0 0 0 0
1 0 1 1 1 -0.2 -0.2 0 0 1 -0.2 0 0.2 0.2
2 0 0 0 0.2 2 -0.2 0 0.4 0.2
1 1 0 1
3 0 0 0 0.2
1 1 1 1 3 -0.4 0 0.4 0.2
4 0 0 0 0.2
0 0 1 -1 5 -0.2 0 0 0 4 -0.4 0.2 0.4 0.4
0 1 0 -1 6 -0.2 0 0 0 5 -0.6 0.2 0.4 0.2
0 1 1 1 7 0 0 0.2 0.2 6 -0.6 0.4 0.4 0.2
0 0 0 -1 8 -0.2 0 0.2 0.2

 Since f(w,x) is a linear

combination of input
variables, decision
boundary is linear
 For nonlinearly separable problems, perceptron

learning algorithm will fail because no linear
hyperplane can separate the data perfectly

General Structure of ANN
x1 x2 x3 x4 x5
Input
Layer Input Neuron i Output
I1 wi1
wi2 Activation
I2
wi3
Si function Oi Oi
Hidden g(Si )
Layer I3
threshold, t
Output Training ANN means learning

Layer the weights of the neurons

Nonlinearly Separable Data
XOR Data
y  x1  x2
x1 x2 y
0 0 -1
1 0 1
0 1 1
1 1 -1

Multilayer Neural Network
 An artificial neural network has a more complex

structure than that of a perceptron model.
– Hidden layers: intermediary layers between input &
output layers
– The network may use types of activation functions
other than the sign function.
– Examples of other activation functions include
sigmoid (logistic), and hyperbolic tangent
functions.
– These activation functions allow the hidden and output
nodes to produce output values that are nonlinear in
their input parameters.

 Various types of neural network topology

– single-layered network (perceptron) versus
multi-layered network
– Feed-forward versus recurrent network
 Various types of
activation functions (f)
h ( x)  f ( wi X i )
i

Multi-layer Neural Network
 Multi-layer neural network can solve any type of classification task

involving nonlinear decision surfaces. hyperplanes
XOR Data
Where σ is a sigmoid function

Learning Multi-layer Neural Network
 Can we apply perceptron learning rule to each

node, including hidden nodes?
– Perceptron learning rule computes error term
e = y-f(w,x) and updates weights accordingly
 Problem: how to determine the true value of y for
hidden nodes?
– Approximate error in hidden nodes by error in
the output nodes
 Problem:
– Not clear how adjustment in the hidden nodes affect overall
error
– No guarantee of convergence to optimal solution

Learning the ANN model (Multilayer)
 The goal of the ANN learning algorithm

is to determine a set of weights w that
minimize the total sum of squared
errors:
 sum of squared errors depends on w

because the predicted class is a
function of the weights assigned to the
hidden and output nodes.
 error surface is typically encountered
when is a linear function of its
parameters, w.
 i.e. when the error function
becomes quadratic in its parameters and
a global minimum solution can be easily
found.
02/14/2018 Introduction to Data Mining, 2
nd
Edition 21
 In most cases, the output of an ANN is a

nonlinear function of its parameters because of
the choice of its activation functions (e.g., sigmoid
or tanh function).
 As a result, it is no longer straightforward to
derive a solution for w that is guaranteed to be
globally optimal.
 Greedy algorithms such as those based on the
gradient descent method have been developed to
efficiently solve the optimization problem.

 The weight update formula used by the gradient descent method can
be written as follows:
 where λ is the learning rate.

 The second term states that the weight should be increased in a
direction that reduces the overall error term.
 For hidden nodes, the computation is not trivial because it is difficult

to assess their error term , without knowing what their output
values should be.
 A technique known as back-propagation has been developed to
address this problem.

Design Issues in ANN
 Number of nodes in input layer

– One input node per binary/continuous attribute
– k or log2 k nodes for each categorical attribute with k
values
 Number of nodes in output layer
– One output for binary class problem
– k nodes for k-class problem
 Number of nodes in hidden layer
 Initial weights and biases, random assignment are usually
acceptable.
 Training examples with missing values should be removed
or replaced with most likely values.

Design Issues in ANN
 Number of nodes in hidden layer:

– start from a fully connected network with a sufficiently
large number of nodes and hidden layers, and then
repeat the model-building procedure with a smaller
number of nodes.
– Alternatively, instead of repeating the model-building
procedure, we could remove some of the nodes and
repeat the model evaluation procedure to select the
right model complexity.
 Initial weights and biases: Random assignment
 Training examples with missing values should be
removed or replaced with most likely values.

Characteristics of ANN
 Multilayer ANN are universal approximators but could

suffer from overfitting if the network is too large.
 Gradient descent may converge to local minimum. One
way to escape from the local minimum is to add a
momentum term to the weight update formula.
 Model building can be very time consuming, but testing
can be very fast
 Can handle redundant attributes because weights are
automatically learnt
 Sensitive to noise in training data
 Difficult to handle missing attributes

Artificial Neural Networks: Slides Are By: Tan, Steinbach, Karpatne, Kumar

Uploaded by

Copyright:

Available Formats

Artificial Neural Networks: Slides Are By: Tan, Steinbach, Karpatne, Kumar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Artificial Neural Networks: Slides Are By: Tan, Steinbach, Karpatne, Kumar

Uploaded by

Copyright:

Available Formats

Intelligent Systems

Artificial Neural Networks

Slides are by: Tan, Steinbach, Karpatne, Kumar

02/14/2018 Introduction to Data Mining, 2nd Edition 1

 Non-linear Classification Complex non-linear hypotheses (only 2 features)

Large number of features i.e.100

Quadratic features has complexity of

02/14/2018 Introduction to Data Mining, 2nd Edition 2

02/14/2018 Introduction to Data Mining, 2nd Edition 3

 Car Detection Training labeled pictures

02/14/2018 Introduction to Data Mining, 2nd Edition 4

02/14/2018 Introduction to Data Mining, 2nd Edition 6

X 1 X2 X3 Y Input Black box

Output Y is 1 if at least two of the three inputs are equal to 1.

02/14/2018 Introduction to Data Mining, 2nd Edition 7

02/14/2018 Introduction to Data Mining, 2nd Edition 8

each of its input value X3 t

 Single layer network

 Activation function: h(x)= sign(wx)

 Applying model is straightforward

 Initialize the weights (w0, w1, …, wd)

 Until stopping condition is met

02/14/2018 Introduction to Data Mining, 2nd Edition 11

 Weight update formula:

02/14/2018 Introduction to Data Mining, 2nd Edition 13

 Since f(w,x) is a linear

 For nonlinearly separable problems, perceptron

02/14/2018 Introduction to Data Mining, 2nd Edition 14

Output Training ANN means learning

02/14/2018 Introduction to Data Mining, 2nd Edition 15

02/14/2018 Introduction to Data Mining, 2nd Edition 16

 An artificial neural network has a more complex

02/14/2018 Introduction to Data Mining, 2nd Edition 17

 Various types of neural network topology

02/14/2018 Introduction to Data Mining, 2nd Edition 18

 Multi-layer neural network can solve any type of classification task

Where σ is a sigmoid function

02/14/2018 Introduction to Data Mining, 2nd Edition 19

 Can we apply perceptron learning rule to each

02/14/2018 Introduction to Data Mining, 2nd Edition 20

 The goal of the ANN learning algorithm

 sum of squared errors depends on w

 In most cases, the output of an ANN is a

02/14/2018 Introduction to Data Mining, 2nd Edition 22

 where λ is the learning rate.

 For hidden nodes, the computation is not trivial because it is difficult

02/14/2018 Introduction to Data Mining, 2nd Edition 23

 Number of nodes in input layer

02/14/2018 Introduction to Data Mining, 2nd Edition 24

 Number of nodes in hidden layer:

02/14/2018 Introduction to Data Mining, 2nd Edition 25

 Multilayer ANN are universal approximators but could

02/14/2018 Introduction to Data Mining, 2nd Edition 26

You might also like