Ecs 403 ML Module I

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 33

EID 403 MACHINE LEARNING LECTURE NOTES

LECTURE NOTES–
UNIT I
INTRODUCTION
EID 403
MACHINE LEARNING
4/4 B.Tech [ CSE B5,B8 ]

Dr. R. Bhramaramba
DEPARTMENT OF IT,
GIT,GITAM

1
EID 403 MACHINE LEARNING LECTURE NOTES

Module I Lecture Notes

Syllabus
Learning problems, perspectives and issues, concept learning, version spaces and candidate
eliminations, inductive bias, decision tree learning, representation, algorithm,
hypothesis space search.

Introduction

2
EID 403 MACHINE LEARNING LECTURE NOTES
Machine learning addresses the question of “how to build computer programs that improve
their performance at some task through experience”. Machine learning algorithms have proven to
be of great practical value in a variety of application domains like
● Data mining problems where large databases may contain valuable implicit
regularities that can be discovered automatically.
● Poorly understood domains where humans might not have the knowledge needed to
develop effective algorithms such as human face recognition from images.
● Domains where the program must dynamically adapt to changing condition.

Machine Learning draws on ideas from a diverse set of disciplines, including artificial
intelligence, probability and statistics, computational complexity, information theory, psychology

and neurobiology, control theory, and philosophy.

1.1 WELL-POSED LEARNING PROBLEMS


A Formal definition of Machine learning given by Mitchell is:

“A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure P, if its performance at tasks T, as measured by P, improves
with experience E.”

To have a well-defined learning problem, we must identify these three features: the class of
tasks(T), the measure of performance to be improved(P), and the source of experience (E). We can
specify many learning problems in this fashion. For example,
A checkers learning problem:
➢ Task T: playing checkers
➢ Performance measure P: percent of games won against opponents
➢ Training experience E: playing practice games against itself
A handwriting recognition learning problem:

3
EID 403 MACHINE LEARNING LECTURE NOTES
➢ Task T: recognizing and classifying handwritten words within images
➢ Performance measure P: percent of words correctly classified
➢ Training experience E: a database of handwritten words with given classifications

1.2 DESIGNING A LEARNING SYSTEM

Designing a machine learning approach involves a number of design choices like


choosing the type of training experience, the target function to be learned, a representation for
this target function, and an algorithm for learning the target function from training examples.
Let us consider designing a program to learn to play checkers, with the goal of entering it in the
world checkers tournament

1.2.1 Choosing the Training Experience

● The type of training experience available can have a significant impact on success or
failure of the learner. One key attribute is whether the training experience provides
direct or indirect feedback regarding the choices made by the performance system.
In learning to play checkers, the system might learn from direct training examples
consisting of individual checkers board states and the correct move for each. Alternatively, it might
have available only indirect information consisting of the move sequences and final outcomes of
various games played.
● A second important attribute of the training experience is the degree to which the
learner controls the sequence of training examples.
For example, the learner might rely on the teacher to select informative board states and to provide
the correct move for each. Alternatively, the learner might itself propose board states that it finds
particularly confusing and ask the teacher for the correct move. Or the learner may have complete
control over both the board states and (indirect) training classifications, as it does when it learns by
playing against itself with no teacher present.

4
EID 403 MACHINE LEARNING LECTURE NOTES
● A third important attribute of the training experience is how well it represents the
distribution of examples over which the final system performance P must be
measured.
In our checkers learning scenario, the performance metric P is the percent of games the system
wins in the world tournament. If its training experience E consists only of games played against
itself, there is an obvious danger that this training experience might not be fully representative of
the distribution of situations over which it will later be tested. In practice, it is often necessary to
learn from a distribution of examples that is somewhat different from those on which the final
system will be evaluated.

1.2.2 Choosing the Target Function

The next design choice is to determine exactly what type of knowledge will be learned and how
this will be used by the performance program.
Consider checkers-playing program that needs only to learn how to choose the best move
from among legal moves. To this problem, the type of information to be learned is a program, or
function, that chooses the best move for any given board state. Let us call this function
ChooseMove and use the notation ChooseMove : B→M to indicate that this function accepts as
input any board from the set of legal board states B and produces as output some move from the set
of legal moves M.
ChooseMove will be very difficult to learn if the training experience is indirect. An alternative
target function that works easier is an evaluation function that assigns a numerical score to any
given board state. Let us call this target function V and again use the notation V : B → R to denote
that V maps any legal board state from the set B to some real value which is used to assign higher
scores to better board states. The target value V(b) for an arbitrary board state b in B, as follows:
➢ if b is a final board state that is won, then V(b) = 100

5
EID 403 MACHINE LEARNING LECTURE NOTES
➢ if b is a final board state that is lost, then V(b) = -100
➢ if b is a final board state that is drawn, then V(b) = 0
➢ if b is a not a final state in the game, then V(b) = V(b|), where b' is the best final board state
that can be achieved starting from b and playing optimally until the end of the game.

As it is very difficult to learn Target function V perfectly, we often expect learning algorithms to
acquire only some approximation ( V^) to the target function, and for this reason the process of
learning the target function is often called function approximation

1.2.3 Choosing a Representation for the Target Function


We can represent target function using a large table with a distinct entry specifying the
value for each distinct board state or we can represent using a collection of rules that match against
features of the board state, or a quadratic polynomial function of predefined board features, or an
artificial neural networks. let us choose a simple representation: for any given board state, the
function c will be calculated as a linear combination of the following board features:
xl: the number of black pieces on the board
x2: the number of red pieces on the board
x3: the number of black kings on the board
x4: the number of red kings on the board
x5: the number of black pieces threatened by red (i.e., which can be captured
on red's next turn)
X6: the number of red pieces threatened by black
Thus, our learning program will represent V^ (b) as a linear function of the form

where Wo through W6 are numerical coefficients, or weights, to be chosen by the learning


algorithm. Learned values for the weights W1 through W6 will determine the relative importance
of the various board features in determining the value of the board, whereas the weight Wo will
provide an additive constant to the board value.

6
EID 403 MACHINE LEARNING LECTURE NOTES
1.2.4 Choosing a Function Approximation Algorithm
In order to learn the target function f we require a set of training examples, each describing a
specific board state b and the training value Vtrain(b) for b.The following training example
describes a board state b in which black has won the game (note x2 = 0 indicates that red has no
remaining pieces) and for which the target function value Vtrain(b) is +100.

● To assign the training value Vtrain of any intermediate board state b to be

.where is the learner's current approximation to V and


where Successor(b) denotes the next board state following b for which it is again
the program's turn to move.
● Now it is the time to specify the learning algorithm for choosing the weights wi to
best fit the set of training examples { (b,V train(b)}. As a first step we must
define what we mean by the best fit to the training data
● One common approach is to define the best hypothesis, or set of weights, as that
which minimizes the square error E between the training values and the values
predicted by the hypothesis V.

● Several algorithms are available. For our problem least mean squares, or LMS
training rule is used. For each observed training example it adjusts the weights a
small amount in the direction that reduces the error on this training example.

7
EID 403 MACHINE LEARNING LECTURE NOTES

Here is the small constant that moderates the size of weight update.
1.2.5 The Final Design

The final design of our checkers learning system can be described by four distinct program
modules that represent the central components in many learning systems.
● Performance System:
It solves the given performance task by taking an instance of a new problem (new game) as input
and produces a trace of its solution (game history) as output. For our checkers problem,
Performance System selects its next move determined by the learned V^ evaluation function at
each step. Therefore, we expect its performance to improve as this evaluation function becomes
increasingly accurate.

8
EID 403 MACHINE LEARNING LECTURE NOTES

Fig 1.1 Final design of the checkers learning program.

●Critic:
It takes as input the history or trace of the game and produces as output a set of training examples
of the target function. Each training example in this case corresponds to some game state
in the trace, along with an estimate Vtrain of the target function.
●Generalizer:
It takes as input the training examples and produces an output hypothesis that is its estimate of the
target function. It generalizes from the specific training examples, hypothesizing a general function
that covers these examples. In our example, the Generalizer corresponds to the LMS algorithm, and
the output hypothesis is the function f described by the learned weights W0, . . . , W6.

9
EID 403 MACHINE LEARNING LECTURE NOTES
●The Experiment Generator:
It takes as input the current hypothesis (currently learned function) and outputs a new problem
(i.e., initial board state) for the Performance System to explore. Its role is to pick new practice
problems that will maximize the learning rate of the overall system.

1.3 PERSPECTIVES AND ISSUES IN MACHINE LEARNING

● One useful perspective on machine learning is that it involves searching a very large
space of possible hypotheses to determine one that best fits the observed data and any
prior knowledge held by the learner.
● For Checkers Learning problem, hypothesis space consists of all evaluation functions
that can be represented by some choice of values for the weights wo through w6. The
learner's task is thus to search through this vast space to locate the hypothesis that is
most consistent with the available training examples.

1.3.1 Issues in Machine Learning

● What algorithms exist for learning general target functions from specific training examples?
In what settings will particular algorithms converge to the desired function, given sufficient
training data? Which algorithms perform best for which types of problems and
representations?
● How much training data is sufficient? What general bounds can be found to relate the
confidence in learned hypotheses to the amount of training experience and the character of
the learner's hypothesis space?
● When and how can prior knowledge held by the learner guide the process of generalizing
from examples? Can prior knowledge be helpful even when it is only approximately
correct?
● What is the best strategy for choosing a useful next training experience, and how does the
choice of this strategy alter the complexity of the learning problem?

10
EID 403 MACHINE LEARNING LECTURE NOTES
● What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn? Can
this process itself be automated?
● How can the learner automatically alter its representation to improve its ability to represent
and learn the target function?

1.4 CONCEPT LEARNING


● Concept means describing some subset of objects or events defined over a larger set.
The problem of inducing the general definition of concept or category from the given
examples which are members or nonmembers(positive&negative)of the concept is
referred as concept learning.
● Each concept can be thought of as a Boolean-valued (true/false or yes/no) function
hence concept learning can also be defined as “Inferring a boolean-valued function
from training examples of its input and output.” This learning involves acquiring
general concepts from specific training examples.
● When learning the target concept, the learner will be given with set of training
examples, each consisting of an instance x from X, along with its target concept value
c(x).
● Given a set of training examples of the target concept c, the problem faced by the
learner is to hypothesize, or estimate c such that h(x) = c(x) for all x in X.

Consider the example task of learning the target concept "days on which my friend Aldo
enjoys his favorite watersport."
Concept to be learned

11
EID 403 MACHINE LEARNING LECTURE NOTES

Days
Attributes

TABLE 1.1 Positive and negative training examples for the target concept EnjoySport

● The table1.1 describes a set of example days that are represented by a set of attributes
and the attribute EnjoySport indicates whether or not Aldo enjoys his favorite water
sport on this day which is the target concept for this problem.

● The task is to learn to predict the value of EnjoySport for a day, based on the values
of its other attributes. In other words , the EnjoySport concept learning task requires
learning the set of days for which EnjoySport = yes, describing this set by a
conjunction of constraints over the instance attributes.

● For this learning, hypothesis is represented as conjunction of the constraints that


specifies six attribute values ( Sky, AirTemp, Humidity, Wind, Water, and Forecast).

● For each attribute, the hypothesis will either,

➢ A specific value (e.g., “Water=Warm”)

➢ Any value can be accepted (e.g., “Water=?”)

➢ No value acceptable (e.g., “Water=Ø”)

● If some example/instance satisfies all the constraints of hypothesis h, then h classifies


x as a positive example (h(x) = 1). Otherwise x is negative example and is (h(x)=0).

12
EID 403 MACHINE LEARNING LECTURE NOTES
● The hypothesis that Aldo enjoys his favorite sport only on cold days with high
humidity (independent of the values of the other attributes) is represented by the
expression ( ?, Cold, High, ?, ?, ?)

● The most general hypothesis that every day is a positive example is represented

by ( ?, ?, ?, ?, ?, ?)

● The most specific possible hypothesis that no day is a positive example is


represented by (Ø, Ø, Ø,Ø,Ø)

● The EnjoySport concept learning task is given below

As we know concept learning is to find the hypothesis that best fits the training examples.
Consider EnjoySport learning task, given that the attribute Sky has three possible values, and that
AirTemp, Humidity, Wind, Water, and Forecast each have two possible values, the instance space
X contains exactly 3 .2 2 .2 2 .2 = 96 distinct instances. A similar calculation shows that there are

13
EID 403 MACHINE LEARNING LECTURE NOTES
5.4-4 -4 -4.4 = 5 120 syntactically distinct hypotheses within H. As our example is simple
Hypothesis space becomes finite but real world problems have larger and infinite hypothesis. To
do this hypothesis search in effiecient way, different strategies are available.

1.5 VERSION SPACES AND CANDIDATE-ELIMINATION ALGORITHM

Candidate Elimination algorithm makes use of General-to-Specific ordering structure of


hypothesis space in order to search best hypothesis that fits for our training examples. To
illustrate the general-to-specific ordering, consider the two hypotheses

h1= (Sunny, ?, ?, Strong, ?, ?)

h2 = (Sunny, ?, ?, ?, ?, ?)

h3=(Sunny,?,?,?,cool)

Because h2 imposes fewer constraints on the example, it classifies more examples

as positive. In fact, any example classified positive by hl and h3 will also be classified

positive by h2. Therefore, we say that h2 is more general than h1 and h3.

14
EID 403 MACHINE LEARNING LECTURE NOTES
The arrows connecting hypotheses in the hypothesis space H represent the m o r e - g e n e r a l - t
h a n relation, with the arrow pointing toward the less general hypothesis.The left box contains
instances X.
● more-general-than-or equal: Given hypotheses hj and hk, hj is more-general-than-or-
equal to hk if and only if any instance that satisfies hk also satisfies hj (hj>g=hk).

● These relations depend only on which instances satisfy the two hypotheses and not on the
classification of those instances according to the target concept.

● Formally, defines partial order over Hypothesis Space means the relation is
reflexive,antisymmetric,transitive. Using this partial order we can expect that there may be
hypothesis with a relation as this relation is partial order which is

antisymmetric the will be possible.


● Hence these relations are very useful in Hypothesis search.

The key idea in the CANDIDATE-ELIMINATION ALGORITHM is to output a description of


the set of all hypotheses consistent with the training examples. Consistent means If the
hypotheses correctly classifies the examples. These set of hypothesis retuned by the algorithm is
called Version Space (VS H,D ), (Examples D consistent with subset of Hypothesis in H).

➢ The CANDIDATE-ELIMINATION ALGORITHM represents the version space by storing


only its most general members (labeled G ) and its most specific (labeled S ). With these
two sets S and G, it is possible to include all members of the version space by generating the
hypotheses that lie between these two sets in the general-to-specific partial ordering over
hypotheses.
➢ The general boundary G, with respect to hypothesis space H and training data D, is the set
of maximally general members of H consistent with D.

15
EID 403 MACHINE LEARNING LECTURE NOTES
➢ The specific boundary S, with respect to hypothesis space H and training data D, is the set
of minimally general (i.e., maximally specific) members of H consistent with D.

➢ Hence, version space is precisely the set of hypotheses contained in G, plus those contained
in S, plus those that lie between G and S in the partially ordered hypothesis space.

➢ The algorithm begins by initializing the G boundary set to contain the most general

hypothesis as and initializing the S boundary set to contain the


most specific (least general) hypothesis as .

➢ By considering each example, the S and G boundary sets are generalized and specialized, to
eliminate from the version space any hypotheses found inconsistent with the examples.

Consider the Enjoy sport problem and the training examples for this algorithm. Positive
training examples may force the S boundary of the version space to become increasingly general.
Negative training examples play the complimentary role of forcing the G boundary to become
increasingly specific
➢ When the first training example is given (positive), the CANDIDATE-ELIMINATION
ALGORITHM checks the S boundary and finds that it is overly specific-it fails to cover the
positive example. The boundary is therefore revised by replacing with the next more general
constraint that fits the example; namely, the attribute values for this training example.
➢ This revised boundary is shown as S1. No update of the G boundary is needed in response
to this training example because Go correctly covers this example.
➢ When the second training example (also positive) is observed, it has a similar effect of
generalizing S further to S2, substituting a "?' in place of any attribute value in S1 that is not
satisfied by the new example. G again unchanged (i.e., G2 = G1 = G0)

16
EID 403 MACHINE LEARNING LECTURE NOTES

➢ Consider the third training example which is negative example reveals that the G boundary
of the version space is overly general. The hypothesis in the G boundary must therefore be
specialized until it correctly classifies this new negative example. There are several
alternative minimally more specific hypotheses but only three are included due to the reason
that only these three hypothesis are consistent with the previously encountered positive
examples. All of these become members of the new G3 boundary set.

17
EID 403 MACHINE LEARNING LECTURE NOTES

➢ The fourth training example, further generalizes the S boundary of the version space. It also
results in removing one member of the G boundary, because this member fails to cover the
new positive example.

18
EID 403 MACHINE LEARNING LECTURE NOTES

➢ It is clear that, the S boundary of the version space forms a summary of the previously
encountered positive examples Any hypothesis more general than S will cover any example
that S covers and thus will cover any past positive example. Similarly, the G boundary
summarizes the information from previously encountered negative examples. Any
hypothesis more specific than G is assured to be consistent with past negative examples
➢ After processing these four examples, the boundary sets S4 and G4 delimit the version
space of all hypotheses consistent with the set of incrementally observed training examples.
The entire version space, including those hypotheses bounded by S4 and G4 as shown in
the figure.

19
EID 403 MACHINE LEARNING LECTURE NOTES

➢ Because the S and G sets delimit the entire set of hypotheses consistent with the data, they
provide the learner with a description of its uncertainty regarding the exact identity of the
target concept. Although this algorithm provides useful framework for concept learning ,it
is not robust to noisy data or to situations in which the unknown target concept is not
expressible in the provided hypothesis space.
➢ Candidate Elimination algorithm is given below:

20
EID 403 MACHINE LEARNING LECTURE NOTES

1.6 INDUCTIVE BIAS

➢ Inductive learning algorithms are able to classify unseen examples only because of their
implicit inductive bias for selecting one consistent hypothesis over another
➢ Consider the following example

➢ The most specific hypothesis consistent with the first two examples in the given
hypothesis space H is
S2 : (?, Warm, Normal, Strong, Cool, Change)

21
EID 403 MACHINE LEARNING LECTURE NOTES
Which is already overly general that incorrectly covers the third (negative) training
example. The problem is that we have biased the learner to consider only conjunctive
hypotheses. Because of this restriction, the hypothesis space is unable to represent even
simple disjunctive target concepts such as "Sky = Sunny or Sky = Cloudy." To avoid this
we need a more expressive hypothesis.
➢ One way to define such a new hypothesis is to allow arbitrary disjunctions, conjunctions,
and negations of our earlier hypotheses. For instance, the target concept "Sky = Sunny or
Sky = Cloudy" could then be described as
(Sunny, ?, ?, ?, ?, ?) v (Cloudy, ?, ?, ?, ?, ?)
➢ Given this hypothesis space, we can safely use the CANDIDATE-ELIMINATION
algorithm without worrying that the target concept might not be expressible. But it raises a
difficult problem: our concept learning algorithm is now completely unable to generalize
beyond the observed examples.
➢ suppose we present three positive examples (xl,x 2, x3) and two negative examples (x4, x5)
to the learner then the S boundary will always be simply the disjunction of the observed
positive examples, while the G boundary will always be the negated disjunction of the
observed negative examples.

➢ Therefore, the only examples that will be unambiguously classified by S and G are the
observed training examples themselves. In order to converge to a single, final target
concept, we will have to present every single instance in X as a training example.
➢ From this discussion it is clear that a learner that makes no a priori assumptions
regarding the identity of the target concept has no rational basis for classifying any
unseen instances.
➢ The reason that the CANDIDATE-ELIMINATION ALGORITHM was able to generalize
beyond the observed training examples in our original formulation of the EnjoySport task is
that it was biased by the implicit assumption that the target concept could be represented by
a conjunction of attribute values

22
EID 403 MACHINE LEARNING LECTURE NOTES
➢ This prior assumption is called Inductive Bias.
➢ Let, learning algorithm L is provided with set of training data D, = {(x, c(x))} of some
target concept c. After training, L is asked to classify a new instance xi. Let L(xi, D,) denote
the classification (e.g., positive or negative) that L assigns to xi after learning from the
training data D,. We can describe this inductive inference step performed by L as follows

where the notation indicates that z is inductively inferred from y.


➢ For example, if we take L to be the CANDIDATE-ELIMINAaTlIgOoNrit hm, D, to be the
training data from Table 2.1, and xi to be the first instance then the inductive inference
performed in this case concludes that L(xi, D,) = (EnjoySport = yes). Because L is an
inductive learning algorithm, the result L(xi, D,) that it infers will not in general be provably
correct.

➢ To justify its inductive inferences as deductive inferences we define inductive bias as set of
additional assumption B as

where the notation indicates that z follows deductively from y (i.e., that z is provable from
y).

23
EID 403 MACHINE LEARNING LECTURE NOTES

➢ Given this definition of L(xi, D,) for the CANDIDATE-ELIMINATION ALGORITHM,


what is its inductive bias? It is simply the assumption c E H. Given this assumption, each
inductive inference performed by the CANDIDATE-ELIMINATION ALGORITHM can be
justified deductively.
➢ With advantage of inductive bias one can compare different learning algorithms. Following
are the algorithms from weakest to strongest bias
➢ ROTE-LEARNER: learning corresponds simply to storing each observed training example
in memory. Subsequent instances are classified by looking them up in memory. If the
instance is found in memory, the stored classification is returned. Otherwise, the system
refuses to classify the new instance.
➢ CANDIDATE-ELIMINATION ALGORITHM New instances are classified only in the
case where all members of the current version space agree on the classification. Otherwise,
the system refuses to classify the new instance.
➢ FIND-S: This algorithm, described earlier, finds the most specific hypothesis consistent
with the training examples. It then uses this hypothesis to classify all subsequent instances .

1.7 DECISION TREE LEARNING


➢ Decision tree learning is one of the most popular inductive inference algorithms and have
been successfully applied to a medical diagnosis learning, Credit risk analysis.

24
EID 403 MACHINE LEARNING LECTURE NOTES
➢ It is a method for approximating discrete-valued target functions, in which the learned
function is represented by a decision tree.
➢ Decision tree learning is generally best suited to problems with the following
characteristics:
● Instances are represented by attribute-value pairs: It is easy to construct decision tree
if a situation is represented with attributes having disjoint possible values.
● The target function has discrete output values: Decision tree methods can assign
Boolean classification to each example and can also easily extend to learning
functions with more than two possible output values.
● Disjunctive descriptions may be required: Decision trees naturally express
disjunctive expressions.
● The training data may contain errors: Decision tree learning methods are robust to
errors in classifications of the training examples and errors in the attribute values that
describe these examples.
● The training data may contain missing attribute values: Decision tree methods can be
used even when some training examples have unknown values.

DECISION TREE REPRESENTATION

➢ Decision trees classify instances by sorting them down the tree from the root to some leaf
node. Each node in the tree specifies a test of some attribute of the instance, and each branch
descending from that node corresponds to one of the possible values for this attribute.
➢ Consider the learning task represented by the training examples of the following table. Here
the target attribute PlayTennis, which can have values yes or no for different Saturday
mornings which is to be predicted based on other attributes of the morning for the given
query.

25
EID 403 MACHINE LEARNING LECTURE NOTES

The following tree illustrates a typical learned decision tree that classifies Saturday mornings
according to whether they are suitable for playing tennis For example, the instance

would be sorted down the leftmost branch of this decision tree and would therefore be classified as
a negative instance (i.e., the tree predicts that PlayTennis = no).
➢ In general, decision trees represent a disjunction of conjunctions of constraints on the
attribute values of instances. Each path from the tree root to a leaf corresponds to a
conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions.
➢ For example, the decision tree shown below corresponds to the expression

26
EID 403 MACHINE LEARNING LECTURE NOTES

Fig 1.3 Decision Tree for the PlayTennis concept

DECISION TREE LEARNING ALGORTIHM

● ID3 is the basic algorithm learns decision trees by constructing them top- down.
● The ID3 algorithm starts by selecting attributes that is suitable to place at root node. The
best attribute is selected and used as the test at the root node of the tree.
● A descendant of the root node is then created for each possible value of this attribute, and
the training examples are sorted to the appropriate descendant node (i.e., down the branch
corresponding to the example's value for this attribute).
● The entire process is then repeated using the training examples associated with each
descendant node to select the best attribute to test at that point in the tree.
● The central component of this learning is in selecting attribute to test. To do this it makes
use of quantitative measure,a statistical property, called information gain that measures
how well a given attribute separates the training examples according to their target
classification.

27
EID 403 MACHINE LEARNING LECTURE NOTES
● To calculate Information gain, we require a term called entropy, that characterizes the
(im)purity of collection of examples. Given a collection S, containing positive and negative
examples of some target concept, the entropy of S relative to this boolean classification is

● For the given example of PlayTennis Concept, there are nine postivie examples and five
negative examples in a collection of Fourteen examples(S).

● If entropy is 0 then all members of S belong to the same class. The entropy will be 1
when the collection contains an equal number of positive and negative examples. If
the collection contains unequal numbers of positive and negative examples, the
entropy is between 0 and 1.
● The information gain, Gain(S, A) of an attribute A, relative to a collection of
examples S, is defined as

where Values(A) is the set of all possible values for attribute A, and S, is the subset of S for
which attribute A has value v (i.e., S, = {s E SIA(s) = v)).
● The information gain due to sorting the original 14 examples by the attribute Wind
may then be calculated as

28
EID 403 MACHINE LEARNING LECTURE NOTES

● To construct decision tree for PlayTennis concept, we need to calculate information


gain for all the four attributes for selecting root node.

● According to the information gain measure, the Outlook attribute provides the best
prediction of the target over the training examples. Therefore, Outlook is selected as
the decision attribute for the root node, and branches are created below the root for
each of its possible values (i.e. Sunny, Overcast, and Rain)

29
EID 403 MACHINE LEARNING LECTURE NOTES
● Observe that every example for which Outlook = Overcast is also a positive example
of PlayTennis. Therefore, this node of the tree becomes a leaf node with the
classification PlayTennis = Yes.

● But descendants corresponding to Outlook = Sunny and Outlook = Rain still have
nonzero entropy, and the decision tree will be further elaborated below these nodes.

● The process of selecting a new attribute and partitioning the training examples is now
repeated for each nonterminal descendant node, this time using only the training
examples associated with that node and any given attribute can appear at

30
EID 403 MACHINE LEARNING LECTURE NOTES
most once along any path through the tree.

● This process continues for each new leaf node until either of two conditions is met:
(1) every attribute has already been included along this path through the tree, or (2)
the training examples associated with this leaf node all have the same target attribute
value (i.e., their entropy is zero).

● ID3 Algorithm is given below.

HYPOTHESIS SPACE SEARCH


➢ The hypothesis space searched by ID3 is the set of possible decision trees.

31
EID 403 MACHINE LEARNING LECTURE NOTES
➢ ID3 performs hill-climbing search through this hypothesis space, beginning with the
empty tree, then considering progressively more elaborate hypotheses in search of a
decision tree that correctly classifies the training data.

➢ The evaluation function that guides this hill-climbing search is the information
gain measure.

➢ Because every finite discrete-valued function can be represented by some


decision tree, ID3 avoids one of the major risks of methods that search
incomplete hypothesis spaces that might not contain target.

➢ ID3 maintains only a single current hypothesis as it searches through the space
of decision trees. In other words it does not have the ability to determine how
many alternative decision trees are consistent with the available training data.

32
EID 403 MACHINE LEARNING LECTURE NOTES
➢ ID3 in its pure form performs no backtracking in its search. Once it, selects an
attribute to test at a particular level in the tree, it never backtracks to
reconsider this choice. Therefore, it is susceptible to the usual risks of hill-
climbing search without backtracking.

➢ ID3 uses all training examples at each step in the search to make statistically
based decisions regarding how to refine its current hypothesis. This contrasts
with methods that make decisions incrementally, based on individual training
examples

33

You might also like