ML - Unit 2 - Part I
ML - Unit 2 - Part I
ML - Unit 2 - Part I
UNIT- 2
DECISION TREE LEARNING
Decision trees classify instances by sorting them down the tree from the root to some
leaf node, which provides the classification of the instance.
Each node in the tree specifies a test of some attribute of the instance, and each branch
descending from that node corresponds to one of the possible values for this attribute.
An instance is classified by starting at the root node of the tree, testing the attribute
specified by this node, then moving down the tree branch corresponding to the value
of the attribute in the given example. This process is then repeated for the sub tree
rooted at the new node.
FIGURE: A decision tree for the concept Play Tennis. An example is classified by sorting it
through the tree to the appropriate leaf node, then returning the classification associated with
this leaf
1
Machine
For example, the decision tree shown in above figure corresponds to the expression
(Outlook = Sunny ∧ Humidity = Normal)
∨ (Outlook = Overcast)
∨ (Outlook = Rain ∧ Wind =Weak)
Decision tree learning is generally best suited to problems with the following characteristics:
2. The target function has discrete output values – The decision tree assigns a Boolean
classification (e.g., yes or no) to each example. Decision tree methods easily extend to
learning functions with more than two possible output values.
3. The training data may contain errors – Decision tree learning methods are robust to
errors, both errors in classifications of the training examples and errors in the attribute
values that describe these examples.
4. The training data may contain missing attribute values – Decision tree methods can
be used even when some training examples have unknown values
2
Machine
The basic algorithm is ID3 which learns decision trees by constructing them top-down
Examples are the training examples. Target _attribute is the attribute whose value is to
be predicted by the tree. Attributes is a list of other attributes that may be tested by the
learneddecisiontree.ReturnsadecisiontreethatcorrectlyclassifiesthegivenExamples.
Algorithm:
Create a Root node for the tree
If all Examples are positive, Return the single-node tree Root, with label =+
If all Examples are negative, Return the single-node tree Root, with label =-
If Attributes is empty, Return the single-node tree Root, with label = most common value
of Target _attribute in Examples
Otherwise Begin
A ← the attribute from Attributes that best* classifies Examples
The decision attribute for Root ←A
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A =vi
Let Examples vi, be the subset of Examples that have value viforA
If Examples vi, is empty
Then below this new branch add a leaf node with label = most common
value of Target _attribute in Examples
Else below this new branch add the sub tree
ID3(Examples vi, Target attribute, Attributes – {A}))
End
Return Root
3
Machine
The central choice in the ID3 algorithm is selecting which attribute to test at each node
in the tree.
A statistical property called information gain that measures how well a given attribute
separates the training examples according to their target classification.
ID3 uses information gain measure to select among the candidate attributes at each
step while growing the tree.
Given a collection S, containing positive and negative examples of some target concept, the
entropy of S relative to this Boolean classification is
Where,
p+is the proportion of positive examples in S
p-is the proportion of negative examples in S.
Example:
Suppose S is a collection of 14 examples of some Boolean concept, including 9 positive and
5 negative examples. Then the entropy of S relative to this Boolean classification is
4
Machine
5
Machine
An Illustrative Example
To illustrate the operation of ID3, consider the learning task represented by the training
examples of below table.
Here the target attributes Play Tennis, which can have values yes or no for different days.
Consider the first step through the algorithm, in which the topmost node of the
decision tree is created.
ID3 determines the information gain for each candidate attribute (i.e., Outlook,
Temperature, Humidity, and Wind),then selects the one with highest information gain.
6
Machine
According to the information gain measure, the Outlook attribute provides the best
prediction of the target attribute, Play Tennis, over the training examples. Therefore,
Outlook is selected as the decision attribute for the root node, and branches are created
below the root for each of its possible values i.e., Sunny, Overcast, and Rain.
7
Machine
Occam's razor
Occam's razor: is the problem-solving principle that the simplest solution tends to be
the right one. When presented with competing hypotheses to solve a problem, one
should select the solution with the fewestassumptions.
8
Machine
The below figure illustrates the impact of overfitting in a typical application of decision tree
learning.
The horizontal axis of this plot indicates the total number of nodes in the decision tree,
as the tree is being constructed. The vertical axis indicates the accuracy of predictions
made by the tree.
9
Machine
The solid line shows the accuracy of the decision tree over the training examples. The
broken line shows accuracy measured over an independent set of test example
The accuracy of the tree over the training examples increases monotonically as the tree
is grown. The accuracy measured over the independent test examples first increases,
then decreases.
1
Machine
Reduced-Error Pruning:
Theimpactofreduced-errorpruningontheaccuracyofthedecisiontreeisillustratedinbelow figure
Theadditionallineinfigureshowsaccuracyoverthetestexamplesasthetreeispruned. When
pruning begins, the tree is at bits maximum size and lowest accuracy over the test set.
As pruning proceeds ,the number of nodes is reduced and accuracy over the test set
increases.
The available data has been subsets: the training examples, the validation examples
used for pruning the tree, and a set of test examples used to provide an unbiased
estimate of accuracy over future unseen examples. The plot shows accuracy over the
training and test sets.
1
Machine
Rule Post-Pruning
1
Machine
For example, consider the decision tree. The leftmost path of the tree in below figure is
translated into the rule.
IF (Outlook = Sunny) ^ (Humidity = High)
THEN Play Tennis = No
Given the above rule, rule post-pruning would consider removing the preconditions
(Outlook = Sunny) and (Humidity = High)
It would select whichever of these pruning steps produced the greatest improvement in
estimated rule accuracy, then consider pruning the second precondition as a further
pruning step.
No pruning step is performed if it reduces the estimated rule accuracy.
There are three main advantages by converting the decision tree to rules before pruning
1
Machine
Thegainratiomeasurepenalizesattributesbyincorporatingasplitinformation,that is sensitive to
how broadly and uniformly the attribute splits the data
1
Machine
The data which is available may contain missing values for some attributes
Example: Medical diagnosis
<Fever = true, Blood-Pressure = normal, …, Blood-Test = ?,…>
Sometimes values truly unknown, sometimes low priority (or cost toohigh)