Chapter 6 AI
Chapter 6 AI
Chapter 6 AI
Learning
1
Outline
o Learning from Examples/Observation
o Knowledge in Learning
o Learning Probabilistic Models
o Neural Networks
2
What is Learning?
o Learning is one of the keys to human intelligence. Do you
agree?
o what learning is? Learning is:
Memorizing something
Knowing facts through observation and exploration
Improving motor and/or cognitive skills through practice
o The idea behind learning is that percepts should not only be used for acting now,
but also for improving the agent’s ability to act in the future.
Learning is essential for unknown environments, i.e. when the agent lacks knowledge. It
enables to organize new knowledge into general, effective representations
Learning modifies the agent's decision making mechanisms to improve performance
3
What is Learning: Examples
o Learning is memorizing and remembering
Telephone number
Reading textbook
Understanding Language (memorizing grammar & practicing)
Recognize face, signature & fraudulent credit card transactions
o Learning is improving motor skills
Riding a bike
Exercising and practicing the idea
o Learning is understanding the strategy & rule of the game
Playing chess and football
o Learning is abstraction and exploration
Develop scientific theory
Undertaking research
o Learning is nothing but
Feature extraction
Classification
4
Learning
Machine Learning is the study of how to build computer systems that adapt
and improve with experience.
It is a subfield of Artificial Intelligence and intersects with cognitive science,
information theory, and probability theory, among others.
Classical AI deals mainly with deductive reasoning, learning represents
inductive reasoning.
Deductive reasoning arrives at answers to queries relating to a particular
situation starting from a set of general axioms, whereas inductive reasoning
arrives at general axioms from a set of particular instances.
Classical AI often suffers from the knowledge acquisition problem in real life
applications where obtaining and updating the knowledge base is costly and
prone to errors.
Machine learning serves to solve the knowledge acquisition bottleneck by
obtaining the result from data by induction.
5
Cont’d
o Machine learning is particularly attractive in several real life problem
because of the following reasons:
• Some tasks cannot be defined well except by example
• Working environment of machines may not be known at design time
• Explicit knowledge encoding may be difficult and not available
• Environments change over time
• Biological systems learn
o Recently, learning is widely used in a number of application areas including,
• Data mining and knowledge discovery
• Speech/image/video (pattern) recognition
• Adaptive control
• Autonomous vehicles/robots
• Decision support systems
• Bioinformatics
6
Cont’d
o Formally, a computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.
o Thus a learning system is characterized by:
• task T
• experience E, and
• performance measure P
Examples:
o Learning to play chess
T: Play chess
P: Percentage of games won in world tournament
E: Opportunity to play against self or other players
o Learning to drive a van
T: Drive on a public highway using vision sensors
P: Average distance traveled before an error (according to human observer)
E: Sequence of images and steering actions recorded during human driving.
7
Cont’d
o The block diagram of a generic learning system which can realize
the above definition is shown below:
8
Cont’d
o As can be seen from the above diagram the system consists of the
following components:
• Goal: Defined with respect to the task to be performed by the
system
• Model: A mathematical function which maps perception to actions
• Learning rules: Which update the model parameters with new
experience such that the performance measures with respect to the
goals is optimized
• Experience: A set of perception (and possibly the corresponding
actions)
9
Taxonomy of Learning Systems
o Several classification of learning systems are possible based on the above
components as follows:
o Goal/Task/Target Function:
o Prediction: To predict the desired output for a given input based on previous
input/output pairs. E.g., to predict the value of a stock given other inputs like
market index, interest rates etc.
o Categorization: To classify an object into one of several categories based on
features of the object. E.g., a robotic vision system to categorize a machine part
into one of the categories, spanner, hammer etc based on the parts’ dimension
and shape.
o Clustering: To organize a group of objects into homogeneous segments. E.g., a
satellite image analysis system which groups land areas into forest, urban and
water body, for better utilization of natural resources.
o Planning: To generate an optimal sequence of actions to solve a particular
problem. E.g., an Unmanned Air Vehicle which plans its path to obtain a set of
pictures and avoid enemy anti-aircraft guns.
10
Cont’d
o Models:
o Propositional and FOL rules
o Decision trees
o Linear separators
o Neural networks
o Graphical models
o Temporal models like hidden Markov models
o Learning Rules:
o Learning rules are often tied up with the model of learning used.
o Some common rules are gradient descent, least square error, expectation
maximization and margin maximization.
11
Cont’d
o Experiences:
o Learning algorithms use experiences in the form of perceptions or
perception action pairs to improve their performance. The nature of
experiences available varies with applications. Some common situations are
described below.
o Supervised learning: In supervised learning a teacher or oracle is available
which provides the desired action corresponding to a perception. A set of
perception action pair provides what is called a training set. Examples
include an automated vehicle where a set of vision inputs and the
corresponding steering actions are available to the learner.
The problem of supervised learning involves learning a function from
examples of its inputs and outputs.
12
Cont’d
o Unsupervised learning: In unsupervised learning no teacher is
available. The learner only discovers persistent patterns in the data
consisting of a collection of perceptions. This is also called
exploratory learning. Finding out malicious network attacks from
a sequence of anomalous data packets is an example of
unsupervised learning.
o Involves learning patterns in the input when no
specific output values are supplied. For example, a taxi agent might
gradually develop a concept of "good traffic days" and "bad traffic
days" without ever being given labelled examples
of each.
o A purely unsupervised learning agent cannot learn what to do, because
it has no information as to what constitutes a correct action or a
desirable state
13
Cont’d
o Active learning: Here not only a teacher is available, the learner has the
freedom to ask the teacher for suitable perception-action example pairs
which will help the learner to improve its performance. Consider a news
recommender system which tries to learn users preferences and categorize
news articles as interesting or uninteresting to the user. The system may
present a particular article (of which it is not sure) to the user and ask
whether it is interesting or not.
o Reinforcement learning: In reinforcement learning a teacher is available,
but the teacher instead of directly providing the desired action
corresponding to a perception, return reward and punishment to the
learner for its action corresponding to a perception. Examples include a
robot in a unknown terrain where its get a punishment when its hits an
obstacle and reward when it moves smoothly.
14
Learning From Observations
o Concept Learning
o Definition:
o The problem is to learn a function mapping examples into two classes: positive
and negative. We are given a database of examples already classified as positive
or negative. Concept learning: the process of inducing a function mapping input
examples into a Boolean output.
o Tom M.Mitchell:
Inferring a Boolean-valued function from training examples of its input and
output.
Can be formulated as a problem of searching through a predefined space of
potential hypotheses for the hypothesis that best fits the training example.
o Examples:
Classifying objects in astronomical images as stars or galaxies
Classifying animals as vertebrates or invertebrates
15
Cont’d
o Example: Classifying Mushrooms
o Class of Tasks: Predicting poisonous mushrooms
o Performance: Accuracy of classification
o Experience: Database describing mushrooms with their class
o Knowledge to learn: Function mapping mushrooms to {0,1}
where 0:not-poisonous and 1:poisonous
o In general, any concept learning task can be described by
The set of instances over which the target function is defined
The target function
The set of candidate hypothesis considered by learner
The set of available training examples
16
Cont’d
Representation of target knowledge: conjunction of (constraints)attribute values.
Learning mechanism: candidate-elimination
Representation of instances:
Features:
color {red, brown, gray}
size {small, large}
shape {round,elongated}
land {humid,dry}
air humidity {low,high}
texture {smooth, rough}
Input and Output Spaces:
X : The space of all possible examples (input space).
Y: The space of classes (output space).
An example in X is a feature vector x.
For instance: x= (red,small,elongated,humid,low,rough)
X is the cross product of all feature values.
Only a small subset of instances is available in the database of examples.
17
Cont’d
Training Examples:
D : The set of training examples.
D is a set of pairs { (x, c(x)) }, where c is the target concept(the concept or
function to be learned). X is a subset of the universe of discourse or the set of
all possible instances. In the current example, X is the set of all possible
poisonous of mushrooms, each represented by the attributes color, size,
shape, land, air humidity and texture.
c can be any Boolean valued function defined over the instances x; that
is ,c : x{0,1}. In the current example the target concept corresponding to
the value of mushrooms (i.e., c(x) =1 if mushroom is non poison and
c(x)=0 if mushroom is poison)
18
Cont’d
Learning the target concept, the learner is presented as a set of training examples,
each consisting of an instance x from X(possible poisonous of mushrooms).
Instances for which c(x)=1 are called positive examples, or members of the target
concept.
Instances for which c(x)=0 are called negative examples or non members of the
target concept.
We often write the order pair (x, c(x)) to describe the training example consisting
of the instance x and its target concept value c(x).
D is the set of available training examples
Given a set of training examples of the target concept c, the problem faced by
learner is to hypothesize , or estimate c.
H is the set of all possible hypotheses that the learner may consider regarding the
identity of the target concept
h in H represents a Boolean-valued function defined over X; that is h:X{0,1}.
The goal of the learner is to find a hypothesis h such that h(x)=c(x)for all x in X
19
Cont’d
Example of D:
• ((red,small,round,humid,low,smooth), poisonous)
• ((red,small,elongated,humid,low,smooth), poisonous)
• ((gray,large,elongated,humid,low,rough), not-poisonous)
• ((red,small,elongated,humid,high,rough), poisonous)
20
Cont’d
Hypothesis Representation
Any hypothesis h is a function from X to Y
h: X Y
We will explore the space of conjunctions.
Special symbols:
? Any value is acceptable
0 no value is acceptable
Consider the following hypotheses:
(?,?,?,?,?,?): all mushrooms are poisonous
(0,0,0,0,0,0): no mushroom is poisonous
21
Cont’d
Hypotheses Space:
The space of all hypotheses is represented by H
Let h be a hypothesis in H.
Let X be an example of a mushroom.
if h(X) = 1 then X is poisonous, otherwise X is not-poisonous
Our goal is to find the hypothesis, h*, that is very “close” to target
concept c.
A hypothesis is said to “cover” those examples it classifies as positive.
22
Decision Tree
Decision Trees
Decision trees are a class of learning models that are more robust to
noise as well as more powerful as compared to concept learning.
Consider the problem of classifying a star based on some astronomical
measurements.
It can naturally be represented by the following set of decisions on each
measurement arranged in a tree like fashion
23
Definition
A decision-tree learning algorithm approximates a target concept using a
tree representation, where each internal node corresponds to an attribute,
and every terminal node corresponds to a class.
There are two types of nodes:
Internal node.- Splits into different branches according to the different
values the corresponding attribute can take. Example: luminosity <=
T1 or luminosity > T1.
Terminal Node.- Decides the class assigned to the example.
24
Cont’d
When to Consider Decision Trees:
Instances describable by attribute-value pairs.
e.g. the attribute temperature has values hot, mild and cold
Target function is discrete valued
e.g. assigning Boolean classification(yes or no) toe each example
Disjunctive hypothesis may be required
Decision trees represent disjunctive expression
Possibly noisy training data
Decision tree can be used when the some training examples have
unknown values
25
Cont’d
Decision tree learning has been applied to problems:
Learning to classify medical patients by their disease
Equipment malfunctions by their cause
Loan applicants by their likelihood of defaulting on payments
26
Classifying Examples Using Decision Tree
To classify an example X we start at the root of the tree, and check the
value of that attribute on X.
We follow the branch corresponding to that value and jump to the next
node.
We continue until we reach a terminal node and take that class as our
best prediction.
27
Cont’d…
Decision trees adopt a DNF (Disjunctive Normal Form) representation.
For a fixed class, every branch from the root of the tree to a terminal
node with that class is a conjunction of attribute values; different
branches ending in that class form a disjunction.
Learned trees can also be re-represented as sets of if-then rules to
improve human readability.
28
Decision Tree Construction
There are different ways to construct trees from data.
We will concentrate on the top-down, greedy search approach:
Basic idea:
1. Choose the best attribute a* to place at the root of the tree.
29
Cont’d
30
Cont’d
31
Cont’d
32
Cont’d
Steps:
• Create a root for the tree
• If all examples are of the same class or the number of examples is
below a threshold return that class
• If no attributes available return majority class
• Let a* be the best attribute
• For each possible value v of a*
• Add a branch below a* labeled “a = v”
• Let Sv be the subsets of example where attribute a*=v
• Recursively apply the algorithm to Sv
33
Rule Induction and Decision Tree
o Splitting Functions
What attribute is the best to split the data? Let us remember some definitions from
information theory.
What is the quantitative measure of the worth of an attribute?
Information gain measure how well a given attribute separate the training examples
according to their target classification
In order to define information gain precisely, we start by defining a measure of information
theory called entropy
A measure of uncertainty or entropy that is associated to a random variable X is defined as
H(X) = - Σ pi log pi
where the logarithm is in base 2.
This is the “average amount of information or entropy of a finite complete probability
scheme”.
We will use a entropy based splitting function.
Decision tree use this information gain measure to select among the candidate attributes at
each step while growing the tree.
Consider
34 the previous example:
Cont’d
o Size divides the sample in two. humidity divides the sample in three.
o S1 = { 6P, 0NP} S1 = { 2P, 2NP}
S2 = { 5P, 0NP}
o S2 = { 3P, 5NP}
S3 = { 2P, 3NP}
o H(S1) = 0 H(S1) = 1
o H(S2) = -(3/8)log2(3/8) H(S2) = 0
o -(5/8)log2(5/8) H(S3) = -(2/5)log2(2/5)
-(3/5)log2(3/5)
35
Cont’d
Let us define information gain as follows:
Information gain IG over attribute A: IG (A)
IG(A) = H(S) - Σv (Sv/S) H (Sv)
H(S) is the entropy of all examples. H(Sv) is the entropy of one subsample after
partitioning S based on all possible values of attribute A.
Consider the previous example: We have,
H(S1) = 0
H(S2) = -(3/8)log2(3/8)
-(5/8)log2(5/8)
H(S) = -(9/14)log2(9/14)
-(5/14)log2(5/14)
|S1|/|S| = 6/14
|S2|/|S| = 8/14
The principle for decision tree construction may be stated as follows:
Order the splits (attribute and value of the attribute) in decreasing order of information gain.
36
Cont’d
Exercise: Decision Tree for “buy computer or not”. Use the training Dataset
given below to construct decision tree
37
Cont’d
o Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
no yes no yes
38
NEURAL NETWORKS
39
Neural Networks
Neural Networks
Artificial neural networks are among the most powerful learning models.
They have the versatility to approximate a wide range of complex functions
representing multi-dimensional input-output maps.
Neural networks also have inherent adaptability, and can perform robustly
even in noisy environments.
An Artificial Neural Network (ANN) is an information processing
paradigm that is inspired by the way biological nervous systems, such as
the brain, process information.
The key element of this paradigm is the novel structure of the information
processing system.
It is composed of a large number of highly interconnected simple
processing elements (neurons) working in unison to solve specific
problems.
40
Cont’d
ANNs, like people, learn by example. An ANN is configured for a specific
application, such as pattern recognition or data classification, through a
learning process.
Learning in biological systems involves adjustments to the synaptic
connections that exist between the neurons. This is true of ANNs as well.
ANNs can process information at a great speed owing to their highly
massive parallelism.
Neural networks, with their remarkable ability to derive meaning from
complicated or imprecise data, can be used to extract patterns and detect
trends that are too complex to be noticed by either humans or other
computer techniques.
41
Cont’d
A trained neural network can be thought of as an "expert" in the category of
information it has been given to analyse.
This expert can then be used to provide projections given new situations of
interest and answer "what if" questions. Other advantages include:
1. Adaptive learning: An ability to learn how to do tasks based on the data given
for training or initial experience.
2. Self-Organisation: An ANN can create its own organisation or representation
of the information it receives during learning time.
3. Real Time Operation: ANN computations may be carried out in parallel, and
special hardware devices are being designed and manufactured which take
advantage of this capability.
4. Fault Tolerance via Redundant Information Coding: Partial destruction of a
network leads to the corresponding degradation of performance. However, some
network capabilities may be retained even with major network damage.
42
Biological Neural Networks
Much is still unknown about how the brain trains itself to process
information, so theories abound. In the human brain, a typical neuron collects
signals from others through a host of fine structures called dendrites.
The neuron sends out spikes of electrical activity through a long, thin stand
known as an axon, which splits into thousands of branches.
At the end of each branch, a structure called a synapse converts the activity
from the axon into electrical effects that inhibit or excite activity from the
axon into electrical effects that inhibit or excite activity in the connected
neurons.
When a neuron receives excitatory input that is sufficiently large compared
with its inhibitory input, it sends a spike of electrical activity down its axon.
Learning occurs by changing the effectiveness of the synapses so that the
influence of one neuron on another changes.
43
Cont’d
44
Artificial Neural Networks
Artificial neural networks are represented by a set of nodes or units,
often arranged in layers, and a set of weighted directed links connecting
them. The nodes are equivalent to neurons, while the links denote
synapses. The nodes are the information processing units and the links
acts as communicating media.
There are a wide variety of networks depending on the nature of
information processing carried out at individual nodes, the topology of
the links, and the algorithm for adaptation of link weights. Some of the
popular among them include:
Perceptron: This consists of a single neuron with multiple inputs and a
single output. It has restricted information processing capability. The
information processing is done through a transfer function which is
either linear or non-linear.
45
Cont’d
Multi-layered Perceptron (MLP): It has a layered architecture consisting
of input, hidden and output layers. Each layer consists of a number of
perceptrons. The output of each layer is transmitted to the input of nodes
in other layers through weighted links. Usually, this transmission is done
only to nodes of the next layer, leading to what are known as feed forward
networks. MLPs were proposed to extend the limited information
processing capabilities of simple percptrons, and are highly versatile in
terms of their approximation ability. Training or weight adaptation is done
in MLPs using supervised backpropagation learning.
Recurrent Neural Networks: RNN topology involves backward links
from output to the input and hidden layers. The notion of time is encoded
in the RNN information processing scheme. They are thus used in
applications like speech processing where inputs are time sequences data.
46
Cont’d
Self-Organizing Maps: SOMs or Kohonen networks have a grid
topology, with unequal grid weights. The topology of the grid provides a
low dimensional visualization of the data distribution. These are thus
used in applications which typically involve organization and human
browsing of a large volume of data. Learning is performed using a
winner take all strategy in a unsupervised mode.
47
Appropriate problems for NN learning
o The BACKPROPAGATION algorithm is the most
commonly used ANN learning technique with the
following characteristics:
Instances are represented by many attribute-value pairs
The target function output may be discrete-valued, real-valued,
or a vector of several real- or discrete-valued attributes
The training examples may contain errors.
Long training times are acceptable
Fast evaluation of the learned target function may be required
The ability of humans to understand the learned target function
is not important
48
Cont’d
o Perceptron
Definition: It’s a step function based on a linear combination of real-valued
inputs. If the combination is above a threshold it outputs a 1, otherwise it
outputs a –1.
49
Cont’d
A perceptron draws a hyperplane as the decision boundary over the (n-
dimensional) input space.
• A perceptron can learn only examples that are called “linearly separable”.
• These are examples that can be perfectly separated by a hyperplane.
50
Cont’d
Perceptrons can learn many boolean functions: AND, OR, NAND, NOR, but
not XOR
However, every boolean function can be represented with a perceptron
network that has two levels of depth or more.
The weights of a perceptron implementing the AND function is shown below.
51
Cont’d
52
Perceptron Learning
The learning problem is to determine a weight vector w that causes the
perceptron to produce the correct output for each training example
The hypothesis space of a perceptron is the space of all weight vectors.
The perceptron learning algorithm can be stated as below.
1. Assign random values to the weight vector
2. Apply the weight update rule to every training example
3. Are all training examples correctly classified?
a. Yes. Quit
b. No. Go back to Step 2.
There are two popular weight update rules.
i) The perceptron rule, and
ii) Delta rule
53
The Perceptron Rule
For a new training example X = (x1, x2, …, xn), update each weight
according to this rule:
wi = wi + Δwi
Where Δwi = η (t-o) xi
t: target output
o: output generated by the perceptron
η: constant called the learning rate (e.g., 0.1)
Comments about the perceptron training rule:
If the example is correctly classified the term (t-o) equals zero, and no update on
the weight is necessary.
• If the perceptron outputs –1 and the real answer is 1, the weight is increased.
54
Cont’d
If the perceptron outputs a 1 and the real answer is -1,
the weight is decreased.
Provided the examples are linearly separable and a
small value for η is used, the rule is proved to classify
all training examples correctly (i.e, is consistent with
the training data).
convergence guaranteed provided linearly separable
training examples and sufficiently small η
55
The Delta Rule
perceptron rule fails if data is not linearly separable
delta rule converges toward a best-fit approximation
uses gradient descent to search the hypothesis space
perceptron cannot be used, because it is not differentiable
hence, a unthresholded linear unit is appropriate
It is done by minimizing the error E = ½ Σi (ti – oi) 2
where the sum goes over all training examples. Here oi is the inner
product WX and not sgn(WX) as with the perceptron rule.
The idea is to find a minimum in the space of weights and the error
function E.
56
Cont’d
The delta rule is as follows:
For a new training example X = (x1, x2, …, xn),
update each weight according to this rule:
wi = wi + Δwi
Where Δwi = -η E’(W)/wi
η: learning rate (e.g., 0.1) which determines the step
size in the gradient descent search.
57
Cont’d
The negative sign indicate we want to move the weight vector in
the direction that decrease E
It is easy to see that
E’(W)/ wi = Σi (ti – oi) (-xi)
So that gives us the following equation:
wi = η Σi (ti – oi) xi
What are the key practical difficulties in applying gradient
descent ? How do we alleviate these difficulties?(Reading
Assignment )
58
Cont’d
Gradient descent algorithm for training a linear unit:
Gradient_Descent(training_examples, η)
Initialize each wi, to some small random value
Until the termination condition is met, Do
Initialize each Δwi to zero
For each (x,t ) in training_axamples, Do
Input the instance x to the unit and compute the output o
For each linear unit weight wi, Do
Δwi = Δwi +η (t – o) xi
For each linear unit weight wi, Do
Wi=wi+ Δwi
59
Multi-Layer Perceptron's
In contrast to perceptron's, multilayer networks can learn not only
multiple decision boundaries, but the boundaries may be nonlinear.
The typical architecture of a multi-layer perceptron (MLP) is shown
below.
• To make nonlinear partitions on the space we need to define each unit as a nonlinear
function (unlike the perceptron). One solution is to use the sigmoid unit.
• Another reason for using sigmoids are that they are continuous unlike linear thresholds
and are thus differentiable at all points.
60
Cont’d
Each layer is made up of units.
The inputs to the network correspond to the attributes measured for each
training tuple
The inputs are fed simultaneously into the units making up the input layer.
These inputs pass through the input layer and are then weighted and fed
simultaneously to a second layer of “neuronlike” units, known as a hidden
layer.
The outputs of the hidden layer units can be input to another hidden layer, and
so on.
The weighted outputs of the last hidden layer are input to units making up the
output layer, which emits the network’s prediction for given tuples
It is a feed-forward network since none of the weights cycles back to an input
unit or to a previous layer’s output unit.
It is fully connected in that each unit provides input to each unit in the next
forward layer.
61
Applications of Neural Networks
Neural networks have broad applicability to real world business problems. They have
already been successfully applied in many industries.
Since neural networks are best at identifying patterns or trends in data, they are well
suited for prediction or forecasting needs including:
• sales forecasting
• industrial process control
• customer research
• data validation
• risk management
• target marketing
Because of their adaptive and non-linear nature they have also been used in a number of
control system application domains like, • process control in chemical plants • unmanned
vehicles • robotics • consumer electronics
Neural networks are also used a number of other applications which are too hard to model
using classical techniques. These include, computer vision, path planning and user
modeling.
62
Thank you!!!
63