AICh 6

Chapter 6
Learning Agents
What is Learning?
• In the psychology literature, learning is considered
one of the keys to human intelligence.
• what learning is? Learning is:
– Memorizing something
– Knowing facts through observation and exploration
– Improving motor and/or cognitive skills through practice
• The idea behind learning is that percepts should not
only be used for acting now, but also for improving the
agent’s ability to act in the future.
– Learning modifies the agent's decision making mechanisms to
improve performance
Face Recognition
Learning: it is training (adaptation) from data set
•Training examples of a person –
Test images
Learning Agents
The Basic Learning Model
• A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P,
– if its performance at tasks T, as measured by P, improves with
experience E.
• Learning agents consist of four main components:
– learning element -- the part of the agent responsible for
improving its performance
– performance element -- the part that chooses the actions to take
– critic – provides feedback for the learning element how the
agent is doing with respect to a performance standard
– problem generator -- suggests actions that could lead to new,
informative experiences (suboptimal from the point of view of
the performance element, but designed to improve that element)
Types of learning
• Supervised learning: occurs where a set of input/output pairs
are explicitly presented to the agent by a teacher
– The teacher provides a category label for each pattern in a training
set, then the learning algorithm finds a rule that does a good job of
predicting the output associated with a new input.
• Unsupervised learning: Learning when there is no information
about what the correct outputs are.
– In unsupervised learning or clustering there is no explicit teacher,
the system forms clusters or natural groupings of the input patterns.
• Reinforcement learning: an agent interacting with the world
makes observations, takes actions, & is rewarded or punished;
it should learn to choose actions in order to obtain a lot of
reward.
– The agent is given an evaluation of its action, but not told the
correct action. Reward strengthens likelihood of its action.
Typically, the environment is assumed to be stochastic.
Classification
• Process of predicting class or category from
observed values or given data points.
• The categorized output can have the form such as
“Black” or “White” or “spam” or “no spam”.
• To implement this classification, we first need to
train the classifier.
• For this example, “spam” and “no spam” emails
would be used as the training data.
• After successfully train the classifier, it can be
used to detect an unknown email.
Data sets preparation for learning
• Training set
– Used in supervised learning, a training set is a set of problem
instances (described as a set of properties and their values),
together with a classification of the instance.
• Test set
– A set of instances and their classifications used to test the
accuracy of a learned hypothesis.
• Learning—A Two-Step Process

– Model construction: The training set is used to create the model.
The model is represented as classification rules, decision trees, or
mathematical formulae
– Model usage: the test set is used to see how well it works for
classifying future or unknown objects
Step 1: Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 9 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
OR years > 6
THEN tenured = ‘yes’
9
Step 2: Using the Model in Prediction
Classifier
model
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes 10
Learning methods
• There are various learning methods. Popular
learning techniques include the following.
– Nearest-neighbor classification
– k-nearest-neighbor classification
– Decision tree : divide decision space into piecewise
constant regions.
– Support vector machine
– Neural networks: partition by non-linear boundaries
– Regression: (linear or any other polynomial)
– Markov decision process
– clustering
– K-means clustering 11
Decision tree
• Decision tree performs classification by constructing a tree
based on training instances with leaves having class labels.
– The tree is traversed for each test instance to find a leaf, and the
class of the leaf is the predicted class. This is a directed
knowledge discovery in the sense that there is a specific field
whose value we want to predict.
• Widely used learning method. It has been applied to:
– classify medical patients based on the disease,
– equipment malfunction by cause,
– loan applicant by likelihood of payment.
• Easy to interpret: can be re-represented as if-then-else rules
• Does not require any prior knowledge of data distribution,
works well on noisy data.
12
Decision Trees
• Tree where internal nodes are simple decision rules on one or
more attributes and leaf nodes are predicted class labels; i.e. a
Boolean classifier for the input instance.
Given an instance of an object or situation, which is specified
by a set of properties, the tree returns a "yes" or "no"
decision about that instance.
Attribute_1
value-1 value-3
value-2
Attribute_2 Class1 Attribute_2
value-5 value-4 value-6 Value-7

Class2 Class3 Class4 Class5
Decision trees
14
Example 1: The problem of “Sunburn”
• You want to predict whether another person is likely to get sunburned
if he is back to the beach. How can you do this?
• Data Collected: predict based on the observed properties of the people
Name Hair Height Weight Lotion class
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Kate Blonde Short Light Yes None
Decision Tree 1
• This is one way of looking at the data, starting with ‘Height’
property. The decision tree correctly CLASSIFIES all the data!
is_sunburned
Height
short tall
average
Hair colour Weight Dana, Pete
blonde red brown light average heavy

Alex Sarah
Weight Hair colour
blonde
light average heavy red brown
Emily John
Katie Annie
Decision Tree 2
• Is there any attributes you don’t like to use for predicting whether
people will suffer from sunburn? Of course it isn’t reasonable to use
height. is_sunburned
Lotion used
no yes
Hair colour Hair colour

blonde
blonde
red brown red
brown
Sarah, Dana,
Annie Emily Pete, Katie
Alex
John
• This decision tree is a lot better than the first one and it doesn’t involve any
of the irrelevant attributes. Unfortunately though it uses the hair colour
attribute twice which doesn’t seem so efficient.
So what is the best decision tree?
•Up to know we used our common sense to select what
attributes were most likely to be relevant to classify whether
people will get sunburn or not.
•It is better to construct an optimal decision tree by selecting
best attributes with the help of: Entropy and Information Gain.
•The Entropy measures the disorder of a set S containing a total of n
examples of which n+ are positive and n- are negative & it is given by:
n n n n
D(n , n )   log 2  log 2  Entropy ( S )
n n n n
•Some useful properties of the Entropy:
•D(n,m) = D(m,n)
•D(0,m) = D(m,0) = 0
• D(S)=0 means that all the examples in S have the same class
•D(m,m) = 1
• D(S)=1 means that half the examples in S are of one class and
half are the opposite class
Information Gain
• So, once we measure the disorder using Entropy, What’s
left? We want to measure how much the disorder of a set
would reduce by knowing the value of a particular attribute,
which needs to measure the Information Gain.
• The Information Gain measures the expected reduction in
entropy due to splitting on an attribute A
| Sv |
Gain ( S , A)  Entropy ( S )   Entropy ( S v )
vValues ( A ) | S |
         
We want:- • The average disorder is just
•large Gain the weighted sum of the
•same as: small average disorders in the branches
disorder created (subsets) created by the
values of A.
How to construct the best decision
tree?
• Let us construct an optimal decision tree with the help of
Entropy and Information Gain.
• Entropy and information gain helps us to select the best attributes
in the process of decision tree construction
• Entropy: The Disorder of Sunburned
D({ “Sarah”,“Dana”,“Alex”,“Annie”, “Emily”,“Pete”,“John”,“Katie”})

3 3 5 5
 D(3 ,5 )   log 2  log 2
8 8 8 8
 0.954
Calculate the Average Disorder Associated
with Hair Colour
Hair colour
blonde red brown
Sblonde Sred Sbrown

D ( Sblonde ) D ( Sred ) D ( Sbrown )
S S S
Sarah Emily Alex
Annie
Pete
Dana
John
Katie
Calculate the Average Disorder
Associated with Hair Colour
The first term of the sum:
• D(Sblonde) = D({ “Sarah”,“Annie”,“Dana”,“Katie”}) = D(2+,2-) =1
S blonde S blonde 4
D ( S blonde )    0.5
S S 8
The second and third terms of the sum:
• Sred = {“Emily”}
• Sbrown = { “Alex”, “Pete”, “John”}.
• These are both 0 because within each set all the examples
have the same class
• So the average disorder created when splitting on ‘hair colour’
is 0.5+0+0=0.5
Which decision variable minimises the
disorder?
Test Average Disorder of the other attributes
Hair 0.50
height 0.69
weight 0.94
lotion 0.61
• Which decision variable maximises the Info Gain then?
• Remember it’s the one which minimises the average disorder.
 Gain(hair) = 0.954 - 0.50 = 0.454
 Gain(height) = 0.954 - 0.69 =0.264
 Gain(weight) = 0.954 - 0.94 =0.014
 Gain (lotion) = 0.954 - 0.61 =0.344
The best decision tree?
is_sunburned
Hair colour
blonde
red brown
Emily Alex
? Pete
John
Sunburned = Sarah, Annie,
None = Dana, Katie
• Once we have finished with hair colour we then need to
calculate the remaining branches of the decision tree.
• Which attributes is better to classify the remaining ?
The best Decision Tree
• This is the simplest and optimal one possible and it makes a lot of
sense.
• It classifies 4 of the people on just the hair colour alone.
is_sunburned
Hair colour
blonde brown
red
Alex,
Emily
Lotion used Pete,
John
no yes
Sarah, Dana,
Annie Katie
Sunburn sufferers are ...
• You can view Decision Tree as an IF-THEN_ELSE
statement which tells us whether someone will suffer
from sunburn.
If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-
used=“No”) then
return (sunburned = yes)
else
return (false)
Decision tree learning: Algorithm
• Aim: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root of
(sub)tree
Exercise: Decision Tree for “buy computer or not”. Use the
training Dataset given below to construct decision tree
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes 28
>40 medium no excellent no
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
29
Exercise: ‘Play Tennis?’
• Suppose that you have a free afternoon and you are thinking whether
or not to go and play tennis, How you do that?
– Goal is to Predict When This Player Will Play Tennis?
– The following training Data Example are prepared for the classifier ?
Pros and Cons of decision trees
· Pros
· Cons
+ Reasonable training time
Cannot handle complicated
+ Fast application
relationship between features
+ Easy to interpret
simple decision boundaries
+ Easy to implement
problems with lots of missing
+ Can handle large data
number of features
Why decision tree induction in data mining?

•relatively faster learning speed (than other classification
methods)
•convertible to simple and easy to understand classification rules
•comparable classification accuracy with other methods
31
Neural Network
• The neural network is the biology inspired
AI techniques
• A NN is an artificial representation of the
human brain that trains to simulate its
learning process
– Learn by examples
– Train to recognize input pattern
Neural Network
• It is represented as a layered set of interconnected processors. These
processor nodes has a relationship with the neurons of the brain. Each
node has a weighted connection to several other nodes in adjacent
layers. Individual nodes take the input received from connected nodes
and use the weights together to compute output values.
• The inputs are fed

simultaneously into the input
layer.
• The weighted outputs of these
units are fed into hidden layer.
• The weighted outputs of the
last hidden layer are inputs to
units making up the output
layer. 34
Architecture of Neural network
• Neural networks are used to look for patterns in data, learn these
patterns, and then classify new patterns & make forecasts
• A network with the input and output layer only is called single-
layered network. Whereas, a multilayer neural network is a
generalized one with one or more hidden layer.
– A network containing two hidden layers is called a three-layer neural
network, and so on.
Single layered NN Multilayer NN
n
x1
w1 o   (  wi xi ) x1
x2 i 1 x2
w2
1 x3
x3 w3  ( y) 
1  e y Input Hidden Output
nodes nodes nodes
Hidden layer: Neuron with Activation
• The neuron is the basic information processing unit of a NN. It
consists of:
1 A set of links, describing the neuron inputs, with weights W1,
W2, …, Wm
2. An adder function (linear combiner) for computing the weighted

sum of the inputs (real numbers):
m
y   wjxj
j 1
3. Activation function : for limiting the output behavior of the

neuron.
Topologies of neural network
• In a feed forward neural network connections between the units
do not form a directed cycle.
– In this network, the information moves in only one direction,
forward, from the input nodes, through the hidden nodes (if any)
and to the output nodes. There are no cycles or loops in the
network.
• In recurrent networks data circulates back and forth until the
activation of the units is stabilized
– Recurrent networks have a feedback loop where data can be fed back into
the input at some point before it is fed forward again for further
processing and final output.
• In a feed forward NN the data processing can extend over
multiple (layers of) units, but no feedback connections are
present, that is, connections extending from outputs of units to
inputs of units in the same layer or previous layers. 37
Topologies of neural network
Training the neural network
• Back Propagation is the most commonly used method for
training multilayer feed forward NN.
– Back propagation learns by iteratively processing a set
of training data (samples).
– For each sample, weights are modified to minimize the
error between the desired output and the actual output.
• After propagating an input through the network, the error
is calculated and the error is propagated back through the
network while the weights are adjusted in order to make
the error smaller.
39
Propagation through Hidden Layer
- Bias j
x0 w0j
x1 w1j
 y
output y
xn wnj
Input weight weighted Activation

vector x vector w sum function
• The inputs to unit j are outputs from the previous layer. These are
multiplied by their corresponding weights in order to form a
weighted sum, which is added to the bias associated with unit j.
• A nonlinear activation function f is applied to the net input.
Steps in Back propagation Algorithm
• STEP ONE: initialize the weights and biases.
– The weights in the network are initialized to random numbers from
the interval [-1,1].
– Each unit has a BIAS associated with it
– The biases are similarly initialized to random numbers from the
interval [-1,1].
• STEP TWO: feed the training sample.
• STEP THREE: Propagate the inputs forward; we compute
the net input and output of each unit in the hidden and
output layers.
• STEP FOUR: back propagate the error.
• STEP FIVE: update weights and biases to reflect the
propagated errors.
• STEP SIX: terminating conditions.
Pros and Cons of Neural Network
• Useful for learning complex data like handwriting, speech and
image recognition
Pros Cons
+ Can learn more complicated Slow training time
class boundaries Hard to interpret
+ Fast application Hard to implement: trial
+ Can handle large number of and error for choosing
features number of nodes
• Neural Network needs long time for training.

• Neural Network has a high tolerance to noisy and incomplete
data
• Conclusion: Use neural nets only if decision-trees fail.
42
clustering
• organizing a set of objects into groups in
such a way that similar objects tend to be in
the same group
• Some Clustering Applications
– Genetic research
– Image segmentation
– Market research
– Medical imaging
– Social network analysis.
k-means clustering
• algorithm for clustering data based on
repeatedly assigning points to clusters and
updating those clusters' centers

AICh 6

Uploaded by

Copyright:

Available Formats

AICh 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AICh 6

Uploaded by

Copyright:

Available Formats

Chapter 6

• Learning—A Two-Step Process

NAME RANK YEARS TENURED Classifier

value-5 value-4 value-6 Value-7

Hair colour Weight Dana, Pete

blonde red brown light average heavy

Hair colour Hair colour

D({ “Sarah”,“Dana”,“Alex”,“Annie”, “Emily”,“Pete”,“John”,“Katie”})

Sblonde Sred Sbrown

student? yes credit rating?

no yes excellent fair

Why decision tree induction in data mining?

• The inputs are fed

2. An adder function (linear combiner) for computing the weighted

3. Activation function : for limiting the output behavior of the

Input weight weighted Activation

• Neural Network needs long time for training.

You might also like