08 NN

CSE472, CSE386
Artificial Intelligence
Neural Networks
Prof. Mahmoud Khalil
Summer 2024
1
Artificial Neural Networks
Artificial neural networks are inspired by brains and neurons

A neural network is a graph with nodes, or units, connected by links
Each link has an associated weight, a real number
Typically, each node I outputs a real number, which is fed as input to the nodes connected to I
The output of a node is a function of the weighted sum of the node’s inputs
2
Basic Concepts
• A Neural Network maps a set of inputs to a set of
outputs
• Number of inputs/outputs is variable. Input 0 Input 1 Input n
• The Network itself is composed of an arbitrary

number of nodes or units, connected by links, with an
arbitrary topology. Neural Network
• A link from unit i to unit j serves to propagate the activation
aj to j, and it has a weight Wij.
• What can a neural networks do? Output 0 Output 1 Output m
Compute a known function / Approximate an unknown

function Pattern Recognition / Signal Processing
Learn to do any of the above
3
Fully Connected NN
Different
types of nodes
4
Artificial Neuron Node or Unit: Mathematical Abstraction
Artificial Neuron,
Node or unit ,
Processing Unit i
Input edges, Input Output edges,

Activation Output
each with weights Function (ini): each with weights
weighted sum function (g)
n
(positive, negative, and ai  g(  W j,i a j )

of its inputs, (positive, negative,
applied to j0
change over time, including and change over

input function
time, learning)
learning) (typically
n
ini   W j,i a j non-linear).
j0
 a processing element producing an output based on a function of its inputs

5
Activation Functions
Step Function
Sigmoid Function Sign Function
6
Normalizing Unit Thresholds
If t is the threshold value of the output unit, then
Where W0 = t and I0 = −1
- We can always assume that the unit’s threshold is 0 This allows thresholds to be
learned like any other weight
- We can even allow output values in [0, 1] by replacing step0 by the sigmoid function
7
Units as logic Gates
Units with a threshold activation function can act as logic gates;

we can use these units to compute Boolean function of its inputs.
Activation of
threshold units when:
n
W j,i a j  W0,i
j1
8
AND
x1 x2 output
0 0 0
W0= 1.5
0 1 0
1 0 0
-1
1 1 1 w1=1 w2=1
x1 x2
Activation of
n
j1
9
OR
x1 x2 output
0 0 0 w0= 0.5
0 1 1
1 0 1 -1
w1=1 w2=1
1 1 1
x1 x2
Activation of
n
j1
10
NOT
x1 output
w0= -
0 1
1 0
-1 w1= 1
x1
Activation of
n
So, units with a threshold activation function
j1 can act as logic gates given the appropriate
input and bias weights.
11
Network Structures
• Feed-forward networks
• Activation flows from input layer to output layer
• single-layer perceptrons
• multi-layer perceptrons
• Feed-forward networks implement functions,
have no internal state (only weights).
• Recurrent networks
• Feed the outputs back into own inputs
Network is a dynamical system
(stable state, oscillations, chaotic behavior)
Response of the network depends on initial state
• Can support short-term memory
• More difficult to understand
12
Feed-Forward Network
Two input units Two hidden units One Output
Each unit receives input only

from units in the immediately
for simplicity, Bias unit omitted
preceding layer.
Given an input vector x = (x1,x2), the activations of the input units are set to values of the
input vector, i.e., (a1,a2)=(x1,x2), and the network computes:
By adjusting the weights we get different functions:

that is how learning is done in neural networks!
13
Single Layer Feed-Forward Networks (perceptron)
•Single-layer neural network (perceptron network)
A network with all the inputs connected directly to the outputs
– Output units all operate separately: no shared weights
Since each output unit is

independent of the others,
we can limit our study to
single output perceptrons.
14
Perceptron Learning Intuition
• Weight Update
•  Input Ij (j=1,2,…,n)
•  Single output O: target output, T.
• Consider some initial weights
• Define example error: Err = T – O
• Now just move weights in right direction!
• If the error is positive, then we need to increase O.
• Err >0  need to increase O;
• If the error is negative, then we need to decrease O
Ij Wj O
• Err <0  need to decrease O;
• Each input unit j, contributes Wj Ij to total input:
• if Ij is positive, increasing Wj tends to increaseO;
• if Ij is negative, decreasing Wj tends to decreaseO;
• So we use Wj  Wj +   Ij  Err  is the learning rate.
15
Perceptron Leaning: Example
• Let’s consider an example
• Framework and notation:
• 0/1 signals
• Input vector:
X  x0 , x1, x2 … , xn 
• Weight vector:

W  w0 , w1, w2 … , wn 
• x0 = 1 and -w0, simulate the threshold.
• O is output (0 or 1) (single output).

Learning rate = 1.
kn
• Threshold function: S   wk xk S  0 then O  1 else O  0
k 0
16
Err = T – O
• Set of training examples, each example is a pair
Wj  Wj +   Ij  Err
• i.e., an input vector and a label y (0 or 1). (x , y )
i i
• Learning procedure, called the “error correcting method”
• Start with all zero-weight vector.

• Cycle (repeatedly) through the training examples and for each example do:
• If perceptron is 0 while it should be 1,
add the input vector to the weight vector
• If perceptron is 1 while it should be 0, Intuitively correct,
(e.g., if output is 0
subtract the input vector to the weight vector but it should be 1,
• Otherwise do nothing. the weights are
increased) !
17
• Consider learning the logical OR function.
• Our examples are:
• Training Samples
x0 x1 x2 label
1 1 0 0 0
2 1 0 1 1
3 1 1 0 1
4 1 1 1 1
kn
Activation Function S   wk xk S  0 then O  1 else O  0
k 0
18
If perceptron is 0 while it should be 1,
k n
add the input vector to the weight vector
S   wk x k S  0 then O  1 else O  0 If perceptron is 1 while it should be 0,
k 0
subtract the input vector to the weight vector
Error correcting method Otherwise do nothing.
• We’ll use a single perceptron with three inputs.

• We’ll start with all weights 0 W= <0,0,0>
1
I0 w 0
• Example 1 I= < 1,0, 0> label=0 W= <0,0,0>
• Perceptron (10+ 00+ 00 =0, S=0) output  0 I1 w1 O
• it classifies it as 0, so correct, do nothing
I2 w2
• Example 2 I=<1, 0, 1> label=1 W= <0,0,0>
• Perceptron (10+ 00+ 10 = 0) output 0
• it classifies it as 0, while it should be 1, so we add input to weights
• W = <0,0,0> + <1,0,1>= <1,0,1> 19
• Example 3 I=<1,1, 0> label=1 W= <1,0,1>

• Perceptron (11+ 10+ 01 > 0) output = 1 I0 w0
• it classifies it as 1, correct, do nothing
• W = <1,0,1> I1 w1 O
I2 w2
• Example 4 I=<1,1,1> label=1 W= <1,0,1>

• Perceptron (11+ 10+ 11 > 0) output = 1
• W = <1,0,1>
20
• Epoch 2, through the examples, W = <1,0,1> . 1

I0 w0
• Example 1 I = <1,0,0> label=0 W = <1,0,1> I1 w1 O

• Perceptron (11+ 00+ 01 >0) output  1
I2 w2
• it classifies it as 1, while it should be 0, so subtract input
from weights
• W = <1,0,1> - <1,0,0> = <0, 0, 1>
• Example 2 I=<1,0, 1> label=1 W= <0,0,1>

• Perceptron (10+ 00+ 11 > 0) output 1
• it classifies it as 1, so correct, do nothing 21
• Example 3 I=<1,1,0> label=1 W= <0,0,1>

• Perceptron (10+ 10+ 01 = 0) output = 0
• it classifies it as 0, while it should be 1, so add input to weights
• W = I + W = <1, 1, 1>
• Example 4 I=<1,1,1> label=1 W= <1,1,1>

• W = <1,1,1>
22
• Epoch 3, through the examples, W = <1,1,1> .

1 I0 w0
• Example 1 I=<1,0,0> label=0 W = <1,1,1> I1 w1 O

• Perceptron (11+ 01+ 01 >0) output  1
I2 w2
• it classifies it as 1, while it should be 0, so subtract
input from weights
• W = <1,1,1> - <1,0,0> = <0, 1, 1>
• Example 2 I=<1,0,1> label=1 W= <0, 1, 1>

• Perceptron (10+ 01+ 11 > 0) output 1
• it classifies it as 1, so correct, do nothing
• 23
• Example 3 I=<1,1,0> label=1 W= <0, 1, 1>

• Example 4 I=<1,1,1> label=1 W= <0, 1, 1>

24
• Epoch 4, through the examples, W= <0, 1, 1>.
1
• Example 1 I= <1,0,0> label=0 W = <0,1,1> I0 W0 =0
• Perceptron (10+ 01+ 01 = 0) output  0
• it classifies it as 0, so correct, do nothing I1 W1=1O
So the final weight vector W= <0, 1, 1> classifies all I2 W2=1

examples correctly, and the perceptron has learned the function!
OR
Aside: in more realistic cases the bias (W0) will not be 0.
(This was just a toy example!)
Also, in general, many more inputs (100 to 1000)
25
Desired New New New
Epoch x0 x1 x2 w0 w1 w2 Output Error w2
Target w0 w1
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
1 0 1 1 0 0 0 0 1 1 0 1
1 1 0 1 1 0 1 1 0 1 0 1
1 1 1 1 1 0 1 1 0 1 0 1
2 1 0 0 0 1 0 1 1 -1 0 0 1
1 0 1 1 0 0 1 1 0 0 0 1
1 1 0 1 0 0 1 0 1 1 1 1
1 1 1 1 1 1 1 1 0 1 1 1
3 1 0 0 0 1 1 1 1 -1 0 1 1
1 0 1 1 0 1 1 1 0 0 1 1
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
26
Expressiveness of Perceptron
What hypothesis space can a perceptron represent?
Even more complex Booelan functions such as majority function .
But can it represent any arbitrary Boolean function?
27
Expressiveness of Perceptron
A threshold perceptron returns 1 iff the weighted sum of its inputs

(including the bias) is positive, i.e.,:
I.e., iff the input is on one side of the hyperplane it defines.

Perceptron  Linear Separator
Linear discriminant function or linear decision surface.
Weights determine slope and bias determines offset.
28
Linear Separability
Consider example with two inputs, x1, x2:

x2
+ +
+ Can view trained network
++ + as defining a “separation line”.
+
What is its equation?
 w0  w1 x1  w2 x2 0
x1
w1 w
x2   x1  0
w2 w2
Percepton used for classification
29
Linear Separability
x2
 
OR
  x1
30
Linear Separability
x2
 
AND
  x1
31
Linear Separability
x2
 
XOR
  x1
32
Linear Separability
x2
Not linearly separable
 
XOR
  x1
Minsky & Papert (1969)

Bad News: Perceptrons can only represent linearly separable functions.
33
Linear Separability XOR
• Consider a threshold perceptron for the logical XOR function (two inputs):
w1 x1  w2 x2  T
• Our examples are:
• Given our examples, we have the following inequalities for
x1 x2 label the perceptron:
1 0 0 0
• From (1) 0 + 0 ≤ T
2 1 0 1
• From (2) w1+ 0 > T
3 0 1 1
• From (3) 0 + w2 > T
4 1 1 0
• From (4) w1 + w2 ≤ T (contradiction)
So, XOR is not linearly separable

34
Convergence of Perceptron Learning Algorithm
Perceptron converges to a consistent function, if…
• … training data linearly separable

• … step size  sufficiently small
• … no “hidden” units
35
Non Linear Classifiers
• The XOR problem
x1 x2 XOR Class
0 0 0 B
0 1 1 A
1 0 1 A
1 1 0 B
36
Non Linear Classifiers
• There is no single line (hyperplane) that separates class A
from class B. On the contrary, AND and OR operations are
linearly separable problems
37
The Two‐Layer Perceptron
• For the XOR problem,
draw two, instead, of
one lines
38
• Then class B is located outside the shaded area and class A inside. This is a
two‐phase design.
• Phase 1: Draw two lines (hyperplanes)
g1 ( x)  g 2 ( x)  0
Each of them is realized by a perceptron. The outputs of the perceptrons
will be
0
yi  f ( g i ( x))   i  1, 2
1
depending on the position of x.
• Phase 2: Find the position of x w.r.t. both lines, based on the values of y1,
y2.
39
1st phase 2nd

x1 x2 y1 y2 phase
0 0 0(-) 0(-) B(0)
0 1 1(+) 0(-) A(1)
1 0 1(+) 0(-) A(1)
1 1 1(+) 1(+) B(0)
• Equivalently: The computations of the first phase perform a mapping
x  y  [ y1 , y2 ]T
40
The decision is now performed on the transformed y data.
g ( y)  0
This can be performed via a second line, which can also be

realized by a perceptron. 41
• Computations of the first phase perform a mapping
that transforms the nonlinearly separable problem to a
linearly separable one.
• The architecture
42
• This is known as the two layer perceptron with one hidden and one
output layer. The activation functions are
0
f (.)  
1
• The neurons (nodes) of the figure realize the following lines
(hyperplanes)
1
g1 ( x)  x1  x2   0
2
3
g 2 ( x)  x1  x2   0
2
1
g ( y )  y1  2 y2   0
2
43

08 NN

Uploaded by

Copyright:

Available Formats

08 NN

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

08 NN

Uploaded by

Copyright:

Available Formats

CSE472, CSE386

Artificial neural networks are inspired by brains and neurons

• The Network itself is composed of an arbitrary

Compute a known function / Approximate an unknown

Input edges, Input Output edges,

(positive, negative, and ai  g(  W j,i a j )

change over time, including and change over

 a processing element producing an output based on a function of its inputs

Units with a threshold activation function can act as logic gates;

Each unit receives input only

By adjusting the weights we get different functions:

A network with all the inputs connected directly to the outputs

– Output units all operate separately: no shared weights

Since each output unit is

• O is output (0 or 1) (single output).

• Learning procedure, called the “error correcting method”

• Start with all zero-weight vector.

• We’ll use a single perceptron with three inputs.

• Example 3 I=<1,1, 0> label=1 W= <1,0,1>

• Example 4 I=<1,1,1> label=1 W= <1,0,1>

• Epoch 2, through the examples, W = <1,0,1> . 1

• Example 1 I = <1,0,0> label=0 W = <1,0,1> I1 w1 O

• Example 2 I=<1,0, 1> label=1 W= <0,0,1>

• Example 3 I=<1,1,0> label=1 W= <0,0,1>

• Example 4 I=<1,1,1> label=1 W= <1,1,1>

• Epoch 3, through the examples, W = <1,1,1> .

• Example 1 I=<1,0,0> label=0 W = <1,1,1> I1 w1 O

• Example 2 I=<1,0,1> label=1 W= <0, 1, 1>

• Example 3 I=<1,1,0> label=1 W= <0, 1, 1>

• Example 4 I=<1,1,1> label=1 W= <0, 1, 1>

• Epoch 4, through the examples, W= <0, 1, 1>.

So the final weight vector W= <0, 1, 1> classifies all I2 W2=1

What hypothesis space can a perceptron represent?

Even more complex Booelan functions such as majority function .

But can it represent any arbitrary Boolean function?

A threshold perceptron returns 1 iff the weighted sum of its inputs

I.e., iff the input is on one side of the hyperplane it defines.

Linear discriminant function or linear decision surface.

Weights determine slope and bias determines offset.

Consider example with two inputs, x1, x2:

Minsky & Papert (1969)

So, XOR is not linearly separable

Perceptron converges to a consistent function, if…

• … training data linearly separable

1st phase 2nd

• Equivalently: The computations of the first phase perform a mapping

This can be performed via a second line, which can also be

You might also like