ML Unit-I
ML Unit-I
ML Unit-I
Course Objectives
• This course explains machine learning techniques such as decision
tree learning, Bayesian learning etc.
• To understand computational learning theory.
• To study the pattern comparison techniques.
Course Outcomes
• Understand the concepts of computational intelligence like machine
learning
• Ability to get the skill to apply machine learning techniques to address
the real time problems in different areas
• Understand the Neural Networks and its usage in machine learning
application.
Unit-I
• Introduction
• Concept Learning and General to Specific Ordering
• Decision Tree Learning
Introduction
• Well-posed Learning Problems
• Designing a Learning System
• Perspectives and Issues in Machine Learning
Introduction
• A successful understanding of how to make computers learn opens up
new uses of computers and new levels of competence and
customization.
• Algorithms have been invented that are effective for certain types of
learning tasks, and a theoretical understanding of learning has begun
to emerge.
• As our understanding of computers continue to mature, it seems very
clear that machine learning will revolutionize the way we use
computers.
Introduction-Some Successful Applications
of Machine Learning
Introduction-Some disciplines and examples
of their influence on machine learning
Well-Posed Learning Problems
Definition of Machine Learning:
• A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
experience E.
Well-Posed Learning Problems
In general to have a well defined learning problem the following three
features are to be identified:
• The class of tasks
• The measure of performance to be improved
• The source of experience
Well-Posed Learning Problems-Examples
A checkers learning problem:
• Task T: playing checkers
• Performance measure P: percent of games won against opponents
• Training experience E: playing practice games against itself
A handwriting recognition learning problem:
• Task T: recognizing and classifying handwritten words within images
• Performance measure P: percent of words correctly classified
• Training experience E: a database of handwritten words with given
classifications
Well-Posed Learning Problems-Examples
A robot driving learning problem:
• Task T: driving on public four-lane highways using vision sensors
• Performance measure P: average distance travelled before an error
(as judged by human overseer)
• Training Experience E: a sequence of images and steering commands
recorded while observing a human driver
Designing A Learning System
Here we design a program to learn to play checkers so that we
understand some of the basic design issues and approaches to machine
learning. We take the following steps:
• Choosing the Training Experience
• Choosing the Target function
• Choosing a representation for the target function
• Choosing a function approximation algorithm
• Estimating Training values
• Adjusting the weights
• The Final Design
Choosing the Training Experience
The type of training experience available can have a significant impact
on success or failure of the learner. There are three important
attributes of training experience. They are:
• Whether the training experience provides direct or indirect feedback
• The degree to which the learner controls the sequence of training
examples
• Representation of the distribution of examples
Direct or Indirect Feedback
• Direct Feedback: Learning from direct training examples consisting of
individual checkers board states and the correct moves for each.
• Indirect Feedback: Learning from indirect information consisting of
the move sequences and final outcomes of various games played.
• Indirect feedback has a problem of credit assignment.
Sequencing of Training Examples (settings
for learning)
• The learner might rely on the teacher to select informative board
states and to provide correct moves for each of them.
• The learner might propose board states that are confusing and ask
the teacher for the appropriate move.
• The learner might take complete control over both the board states
and the correct moves while playing against itself.
Representation of the distribution of
examples
• Learning is most reliable when the training examples follow a
distribution that is similar to the future text examples.
Choosing the Training Experience
• In our design we decide that our system will train by playing games
against itself. Hence no external trainer need to be present and the
system is allowed to as much training data as time permits. Now we
have a fully specified learning task:
A checkers learning problem:
• Task T: playing checkers
• Performance measure P: percent of games won in the world
tournament
• Training experience E: playing practice games against itself
Choosing the Target Function
• We now need to determine What type of knowledge will be learned
and how this will be used by the performance program.
• We begin with a checkers-playing program that can generate legal
moves from any board state.
• The program needs to learn how to choose the best move from
among the legal moves.
• Many optimization problems fall into this class.
Choosing the Target Function
• To solve this problem we need a function that chooses the best move
for any given board state. The target function is defined as follows:
ChooseMove: B🡨M
• This function accepts as input any board from the set of legal board
states B and produces as output an appropriate move M from the set
of legal moves.
• Now the problem of improving performance P at task T is reduced to
the problem of learning some target function ChooseMove.
• Given the kind of indirect training experience the target function
ChooseMove is very difficult to learn.
Choosing the Target Function
• An alternative target function that will be easy to learn is an
evaluation function that assigns a numerical score to any given board
state. Let us call the function V and it is defined as follows:
V: B🡨R
• This function maps a legal board state to some real value.
Choosing the Target Function
• Let us now define the target value V(b) for an arbitrary board state b
in B.
• If b is final board state that is won, then V(b)=100
• If b is final board state that is lost, then V(b)=-100
• If b is a final board state that is drawn, then V(b)=0
• If b is not a final state in the game, then V(b)=V(b’), where b’ is the best final
board state that can be achieved starting from b and playing optimally until
the end of the game.
• This definition of the function is a nonoperational definition because
• The last case requires searching ahead for the optimal line of play until the
end of the game, and
• This definition is not efficiently computable
Choosing the Target Function
• The goal here is to discover an operational description of V; that is a
description that is useful in evaluating states and selecting moves for
the checkers game.
• Now the learning task is reduced to the problem of discovering an
operational description of the ideal target function V.
• We actually expect learning algorithms to acquire some approximation
to the target function, and therefore learning a target function is often
called as function approximation.
• We will use the symbol to refer to the function that is actually
learned, to distinguish it from the ideal target function V.
Choosing a Representation for the Target
Function
• The program can be allowed to represent using:
• A large table with a distinct entry specifying the value for each distinct board state.
• We could allow it to represent using a collection of rules that match against
features of the board state.
• A quadratic polynomial function of predefined board features.
• An artificial neural network.
• The choice of representation of the function approximation is a tradeoff
between:
• A very expensive representation to allow representing as close an approximation as
possible to the ideal target function.
• The more expensive the representation the more training data the program will
require.
Choosing a Representation for the Target
Function
• Here in this example we choose a simple representation in which for any
given board state, the function will be calculated as a linear combination
of the following board features:
• X1: the number of black pieces on the board
• X2: the number of red pieces on the board
• X3: the number of black kings on the board
• X4: the number of red kings on the board
• X5: the number of black pieces threatened by red
• X6: the number of red pieces threatened by black.
• Our learning program will represent (b) as a linear function of the form:
• (b) = w0+w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
Choosing a Representation for the Target
Function
• The partial design of the checkers learning program is as follows:
• Task T: playing checkers
• Performance measure P: percent of games won in the world
tournament
• Training Experience E: games played against itself
• Target function: V: B🡨R
• Target function representation:
(b) = w0+w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
Choosing a function approximation
algorithm
• A set of training examples describing a given board state b and the
training value Vtrain(b)are needed to learn the target function .
• Each training example is represented as an ordered pair <b,Vtrain(b)>
• An example training pair is as follows:
<<x1=0,x2=4,x3=0,x4=1,x5=0,x6=0>,+100>
• There are two steps to be taken as part of the function approximation
algorithm
• Estimating training values
• Adjusting the weights
Estimating Training Values
• The only information available to the learner as far as the example
checkers is concerned is whether the game is eventually won or lost.
• Here we require training examples which assign scores to board
states.
• The most difficult task is assigning scores to intermediate board states
that occur before the game’s end.
• The rule for estimating training values can be summarized as follows:
Vtrain(b)🡨 (Successor(b))
Adjusting the Weights
• The remaining task is to specify the algorithm for choosing the
weights wi to best fit to the set of training examples <b,Vtrain(b)> just
estimated.
• One very common approach is the estimate the weights so that the
squared error E between the training values and the predicted values
is minimized.
Vtrain(b)🡨 (Successor(b))
The Generalizer
• This module takes the training examples as input and produces the
hypothesis which is the estimate of the target function as the output.
• It generalizes from the given specific examples.
• In the case of our checkers problem this module uses the LMS algorithm
and generalizes described by the learned weights w0,w1,w2,w3,w4,w5,w6.
Experiment Generator
• This module takes as input the current hypothesis and produces a
new problem as output for the performance system to explore.
• In the current checkers problem the current hypothesis is and the
new problem is the checkers game with the board’s initial state.
• Our experiment generator is a very simple module which outputs the
same board with the initial state each time.
• A better experiment generator module would create board positions
to explore particular regions of the state space.
Summary of choices in designing the
checkers learning problem
Perspectives and Issues in Machine Learning
• In machine learning there is a very large space of possible hypothesis
to be searched.
• The job of machine learning algorithm is to search for the one that
best fits the large observed space and any prior knowledge held by
the learner.
• In the checkers game example the LMS algorithm fits the weights
each time the hypothesized evaluation function predicts a value that
differs from the training value.
Perspectives and Issues in Machine Learning
Issues in Machine Learning
• What algorithms exist for learning general target functions from specific
training examples?
• How much training data is sufficient?
• When and how does the prior knowledge held by the learner guide the
process of generalizing from examples?
• What is the best strategy for choosing a useful next training experience?
• How does the choice of this strategy alter the complexity of the learning
problem?
• What specific functions should the system attempt to learn? Can this
process be automated?
• How can the learner automatically alter its representation to improve its
ability to represent and learn the target function?
Concept Learning and the General to
Specific Ordering
• Introduction
• A Concept Learning Task
• Concept Learning as Search
• Find-S: Finding a Maximally Specific Ordering of Hypothesis
• Version Spaces and the Candidate Elimination Algorithm
• Remarks on Version Spaces and Candidate Elimination
• Inductive Bias
Introduction
• Concept learning can be thought of as a problem of searching through
a predefined space of potential hypotheses for the hypothesis that
best fits the training examples.
• The activity of learning mostly involves acquiring general concepts
from specific training examples which is called as concept learning.
• Each concept can be thought of as a Boolean-valued function defined
over a large set. (Eg: a function defined over all animals, whose value
is defined as true for birds and false for other animals).
• Concept learning is all about inferring a Boolean valued function from
training examples of its input and output.
A Concept Learning Task
• Description of Concept Learning Task
• Notation
• The Inductive Learning Hypothesis
Description of A Concept Learning Task
Description of A Concept Learning Task
• Any concept learning task can be described by:
• The set of instances over which the target function is defined (X)
• The target function (c)
• The set of candidate hypothesis considered by the learner, and (H)
• The set of available training examples (D)
Notation
The Inductive Learning Hypothesis
• The learning task in the previous example is to determine a
hypothesis h identical to the target concept c over the entire set of
instances X.
• The information available about c is only its value over the training
examples.
• The inductive learning algorithms can at best guarantee that the
output hypothesis fits the target concept over the training data.
• The inductive learning hypothesis is any hypothesis found to
approximate the target function well over a sufficiently large set of
training examples. It will also approximate the target function well
over other unobserved examples.
Concept Learning as Search
• Concept Learning as Search (Description)
• General to Specific Ordering of Hypothesis
Description of Concept Learning as Search
• Concept learning is a task of searching through a large space of hypothesis
implicitly defined by the hypothesis representation.
• The goal is to find the hypothesis that best fits the training examples.
• It is the selection of the hypothesis representation that defines the space
of all hypothesis the program can ever represent and therefore can ever
learn.
• Let us consider the set of instances X and hypothesis H in the previous
example of EnjoySport.
• The instance space contains a total of 3*2*2*2*2*2 =96 distinct instances
and the hypothesis space contains a total of 5*4*4*4*4*4=5120
syntactically distinct hypothesis.
Description of Concept Learning as Search
•
General to Specific Ordering of Hypothesis
• There is a naturally occurring structure that exists for any concept
learning problem: a general-to-specific ordering of hypothesis.
• This structure helps us in designing learning algorithms that
exhaustively search even infinite hypothesis space without explicitly
enumerating every hypothesis.
• Let us consider two hypothesis:
• h1=(sunny,?,?,strong,?,?)
• h2=(sunny,?,?,?,?,?)
• h2 imposes fewer constraints on the instances and therefore classifies
more instances as positive. We can say that h2 is more general than
h1.
General to Specific Ordering of Hypothesis
• The definition of “more general than” relationship between hypothesis
can be defined as follows:
• For any instance x in X and hypothesis h in H, x satisfies h iff h(x)=1
• Definition of “more general than or equal to”
• Now we say that hj is more general than hk written (hj>ghk) iff (hj>=ghk)
and (hk!>=ghj).
• We can say that hk is more specific than hj and hj is more general than
h k.
General to Specific Ordering of Hypothesis
The relation >=g defines a partial order over the hypothesis space (the
relation is reflexive, antisymmetric and transitive).
The >=g relation is important because it provides a useful structure
over the hypothesis space H for any concept learning problem.
Find-S: Finding A Maximally Specific
Hypothesis
• The best way to use “the more general than” partial ordering to
organize the search for the best hypothesis is to begin with the most
specific possible hypothesis in H, then generalize it each time it fails
to cover an observed positive training example.
• Find-S is an algorithm which uses this partial ordering very effectively.
Find-S: Finding A Maximally Specific
Hypothesis
•
Find-S: Finding A Maximally Specific
Hypothesis
• The Find-S algorithm is an algorithm which illustrates a way in which
the more general than partial ordering can be used to find an
acceptable hypothesis.
Find-S: Finding A Maximally Specific
Hypothesis
• There are a few questions left unanswered by the Find-S algorithm
such as:
• Has the learner converged to the correct target concept?
• Why prefer the most specific hypothesis?
• Are the training examples consistent?
• What if there are several maximally specific consistent hypothesis?
Version Spaces and the
CANDIDATE-ELIMINATION Algorithm
• Introduction
• Representation
• The List-Then-Eliminate Algorithm
• A more compact representation for version spaces
• CANDIDATE-ELIMINATION Learning Algorithm
• An Illustrative Example
Version Spaces and the
CANDIDATE-ELIMINATION Algorithm
(Introduction)
• A new approach to concept learning is CANDIDTE-ELIMINATION
algorithm which addresses several limitations of the Find-S algorithm
• The idea in CANDIDATE-ELIMINATION algorithm is to output a
description of the set of all hypotheses consistent with the training
examples.
• CANDIDATE-ELIMINATION provides a useful conceptual framework for
introducing several fundamental issues in machine learning.
Representation
• The CANDIDATE-ELIMINATION algorithm finds all describable
hypotheses that are consistent with the observed training examples.
• Let us first try to understand what is being consistent with the
observed training examples.
Representation
The LIST-THEN-ELIMINATE Algorithm
A more Compact Representation of Version
Spaces
• The CANDIDATE-ELIMINATION algorithm employs a better representation
of the version space.
• The version space is represented by its most general and least general
members.
• These members form the general and specific boundary sets that delimit
the version space within the partially ordered hypothesis space.
A more Compact Representation of Version
Spaces
A more Compact Representation of Version
Spaces
• It Is intuitively probable that we can represent the version space in
terms of its most specific and most general members.
• Here are the definitions for the boundary sets G and S.
CANDIDATE-ELIMINATION Learning
Algorithm
An Illustrative Example
An Illustrative Example
An Illustrative Example
An Illustrative Example
Remarks on Version Spaces and
CANDIDATE-ELIMINATION
• Will the CANDIDATE-ELIMINATION Algorithm converge to the correct
hypothesis?
• What training example should the learner request next?
• How can partially learned concepts be used?
Will the CANDIDATE-ELIMINATION Algorithm
converge to the correct hypothesis
• The version space learned by the CANDIDATE-ELIMINATION algorithm will
converge to the hypothesis that correctly represents the target concept,
provided:
• There are no errors in the training examples
• There is some hypothesis in H that correctly describes the target concept.
• The target concept is exactly learned if the S and G boundaries converge to
a single, identical hypothesis.
• Incase there are errors in the training data the S and G boundaries
converge to an empty version space.
• Similarly though the training examples are correct if the hypothesis
representation cannot describe the target concept then also the algorithm
will not converge to the target concept.
What Training Example should the learner
request next
• Suppose that the learner is allowed to conduct experiments in which it
chooses the next instance, then obtains the correct classification for this
instance from an external oracle (e.g., nature or a teacher).
• The term query is used to refer to such instances constructed by the
learner, which are then classified by an external oracle.
• Let us look for a query in the example EnjoySport concept chosen by the
learner.
• (Sunny, Warm, Normal, Light, Warm, Same)
• If the trainer classifies this instance as positive S boundary of the version
space can be generalized.
• If the trainer classifies it as negative the G boundary can be specialized.
What Training Example should the learner
request next
• The optimal query strategy for a concept learner is to generate
instances that satisfy exactly half of the hypothesis in the current
version space.
• The correct target concept can then be found in log2VS experiments.
How can partially learned concept be used
Inductive Bias
• Inductive Bias Introduction
• A Biased Hypothesis Space
• An Unbiased Learner
• The Futility of Bias-Free Learning
Inductive Bias Introduction
• The CANDIDATE-ELIMINATION algorithm will converge towards the
true target concept provided it is given accurate training examples
and provided its initial hypothesis space contains the target concept.
• What if the target concept is not contained in the hypothesis space?
• Can we avoid this difficulty by using a hypothesis space that includes
every possible hypothesis?
• How does the size of this hypothesis space influence the ability of the
algorithm to generalize to unobserved instances?
• How does the size of the hypothesis space influence the number of
training examples that must be observed?
A Biased Hypothesis Space
• The above figure is a decision tree for the concept play tennis.
• Now let us look at how this tree classifies the following instance
(Outlook=sunny,Humidity=High,Wind=Strong)
• The expression for the tree in the previous slide is as follows:
(Outlook=sunny ^ Humidity=normal)v(Outlook=overcast)v(Outlook=rain ^Wind=weak)
Appropriate Problems for Decision Tree
Learning
• The following are the characteristics of problems suitable for decision
tree learning
• Instances are represented by attribute value pairs
• The target function has discrete output values
• Disjunctive descriptions may be required
• The training data may contain errors
• The training data may contain missing attribute values
• Decision tree learning has therefore been applied to problems such as
• learning to classify medical patients by their disease,
• equipment malfunctions by their cause,
• and loan applicants by their likelihood of defaulting on payments.
• Such problems, in which the task is to classify examples into one of a
discrete set of possible categories, are often referred to as classification
problems.
The Basic Decision Tree Algorithm
• Which attribute is the best classifier?
• An illustrative Example
The Basic Decision Tree Algorithm
• The core ID3(Iterative Dichotomizer 3) algorithm used for learning
decision trees employs a top-down, greedy search through the space
of possible decision trees.
• The ID3 algorithm learns decision trees by constructing them
top-down beginning with the question “Which attribute should be
tested at the root of the tree?”.
• ID3 uses information gain measure to select among the candidate
attributes at each step while growing the tree.
The Basic Decision Tree Algorithm
Which Attribute is the best Classifier
• Entropy measures homogeneity of examples
• Information gain measures the expected reduction in entropy
Entropy measures homogeneity of examples
• Entropy is a measure that characterizes the impurity of a collection of
examples.
• In a collection S containing positive and negative examples of some
target concept the entropy of S relative to the Boolean classification
is:
• The result is going to be a more complex tree and the future instances
will not be correctly categorized.
Avoiding Overfitting of Data
• There are several approaches to avoid overfitting in decision tree
learning. They can be grouped into two classes:
• approaches that stop growing the tree earlier, before it reaches the
point where it perfectly classifies the training data,
• approaches that allow the tree to over fit the data, and then
post-prune the tree.
Avoiding Overfitting of Data
• A very important question is what criterion is to be used to determine
the correct size of the tree:
• Use a separate set of examples, distinct from the training examples, to
evaluate the utility of post-pruning nodes from the tree.(training and
validation set approach).
• Use all the available data for training, but apply a statistical test to estimate
whether expanding (or pruning) a particular node is likely to produce an
improvement beyond the training set. (chi-square test)
• Use an explicit measure of the complexity for encoding the training examples
and the decision tree, halting growth of the tree when this encoding size is
minimized. (Minimum Description Length principle)
Avoiding Overfitting of Data
• In the training and validation set approach the data set is separated
into two sets of examples namely:
• Training set
• Validation set
• There are two approaches to use a validation set to prevent
overfitting. They are:
• Reduced Error Pruning
• Rule Post-Pruning
Reduced Error Pruning
• One approach to using a validation set to avoid overfitting of the data
is called reduced error pruning.
• This technique considers each of the decision node as a candidate for
pruning(removing the subtree rooted at the node and making it a leaf
node).
• Pruning is done only if the pruned tree performs no worse than the
original tree over the validation set.
• Pruning of nodes continues until further pruning is harmful.
• Using a separate validation dataset needs a large amount of data. This
approach is not useful when the data is limited.
Reduced Error Pruning
Rule Post-Pruning
• One very successful method for finding high accuracy hypotheses is a
technique we shall call rule post-pruning.
• Rule post-pruning involves the following steps:
1. Infer the decision tree from the training set, growing the tree until the
training data is fit as well as possible and allowing overfitting to occur.
2. Convert the learned tree into an equivalent set of rules by creating one rule
for each path from the root node to a leaf node.
3. Prune (generalize) each rule by removing any preconditions that result in
improving its estimated accuracy.
4. Sort the pruned rules by their estimated accuracy, and consider them in this
sequence when classifying subsequent instances.
Rule Post-Pruning
• Each attribute test along the path from the root to the leaf becomes a
rule antecedent (precondition) and the classification at the leaf node
becomes the rule consequent (post-condition).
• Let us look at an example:
Rule Post-Pruning
• Why convert the decision tree to rules before pruning? There are
three main advantages.
• Converting to rules allows distinguishing among the different contexts in
which a decision node is used. In contrast, if the tree itself were pruned, the
only two choices would be to remove the decision node completely, or to
retain it in its original form.
• Converting to rules removes the distinction between attribute tests that occur
near the root of the tree and those that occur near the leaves.
• Converting to rules improves readability. Rules are often easier for people to
understand.
Incorporating Continuous Valued Attributes
• Our initial definition of ID3 is restricted to attributes which are
discrete in nature.
• The target attribute must be discrete while other attributes can be
continuous.
• For a continuous attribute A the algorithm can create a Boolean Ac
that is true if A<c
• The most important question is how do we decide the threshold value
c.
Incorporating Continuous Valued Attributes
• Let us look at an example. Let us assume the temperature in our
playtennis data set is included as continuous valued.
• Nenez