Attribute Selection Measures: Decision Tree Based Classification

Attribute Selection Measures :
Decision Tree based

Classification
-V Kumar
Attribute Selection Measures
• Heuristic measure for selecting the splitting criterion
• It best separates the given data into individual classes
• Ideally each partition would be pure
• Best splitting criterion gives pure partitions
• Splitting rules
• Ranking
• Continuous valued/Binary split-split point or splitting subset
• Information gain • Expected information (entropy)
• Gain Ratio needed to classify
m a tuple in D:
Info ( D)   pi log 2 ( pi )
• Gini Index i 1
• Information needed (after using A

to split D into v partitions) to
• Select the attribute with the classify D:
highest information gain
v | Dj |
Info A ( D)    I (D j )
|D|
• Let pi be the probability that an j 1
arbitrary tuple in D belongs to • Information gained by branching

class Ci, estimated by |Ci, D|/|D| on Gain(A)
attributeAInfo(D)  Info (D)
A
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Info age ( D)  I (2,3)  I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D )  I (9,5)  
14
log 2 ( )  log 2 ( ) 0.940
14 14 14
 I (3,2)  0.694
14
age pi n i I(p i, n i) 5
I (2,3) means “age <=30” has 5
<=30 2 3 0.971 14
out of 14 samples, with 2 yes’es
31…40 4 0 0
and 3 no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age)  Info ( D)  Info age ( D)  0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40
31…40
low
low
yes
yes
excellent
excellent
no
yes
Gain(income)  0.029
<=30 medium no fair no
<=30 low yes fair yes Gain( student )  0.151
>40 medium yes fair yes
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating )  0.048
31…40 high yes fair yes
7/10/22 >40 medium no excellent Datano
Mining: Concepts and Techniques 4
• Information Gain:
age income student credit_rating buys_computer
<=30 high no fair no
31…40 high no fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
>40 medium no excellent no
• Gain(Age)=0.246
• Gain(income)=0.029
• Gain(student)=0.151
• Gain(credit_rating)=0.048)
Attribute Selection Measures: Gain ratio
• Information gain measure is • C4.5 uses gain ratio which

biased toward tests with many attempts to overcome bias
outcomes • Normalization of Info. Gain
• The information required to
classify data set D based on this GainRatio(A)=InfoGain(A)/SplitInfo(A)
partitioning would be
• Info. gained is maximal and

useless for classification
• Computation of gain ratio: • =0.246/1.5697=0.156
• Gain(Age)=0.246 •
• Gain(income)=0.029 • =-0.285*(-1.81) -0.285*(-1.81)- 0.428*(-1.25)
• Gain(student)=0.151 • =0.5158+0.5158+0.535=1.5666
• Gain(credit_rating)=0.048) • =0.029/1.5666=0.0185
=-0.357*(log5-log 14)-0.285 *(log 4-log14)-

0.357*log(log5-log 14) age pi ni income pi ni
=-0.357(2.321-3.807)-0.285(2-3.807)-0.375(2.231-3.807) <=30 2 3 high 2 2
=-(0.357*-1.486)-(0.285*-1.807)-(0.375*-1.486) 31…40 4 0 low 3 1
=0.5305+0.5149+0.5355=1.5759 >40 3 2 medium 4 2
=0.156
=-0.5*-2.807-0.5*-2.807=2.807 =0.0185
=0.053
=0.151/2.807=0.053
=0.048
=-0.571*(3-3.807)-0.428*-(2.584-3.807)
=-0.571*-0.807-0.428*-1.223
=0.460+0.523=0.983
=0.048/0.983=0.048
student pi ni credit-rating pi ni
yes 6 1 fair 6 2
no 3 4 excellent 3 3
Attribute Selection Measures: Gini Index
• Gini Index:
• Used in CART, Gini index is used to measure
the impurity in dataset D.
• Gini index considers a binary split for each

attribute.
• Let A be a discrete valued attribute with v
distinct values {a1,a2,…,av}
• Possible subsets -2 • If a binary split on A partitions D into D1 and
D2, the gini index of D based on partition is:
• Determine best binary split on A
• Ex. Income {low, medium, high} possible • The reduction of impurity by binary split is:
binary splits
• Gini index example:
Attribute Selection • Gini index to compute the impurity of D is:
Measures : Gain Ratio
• Compute Gini index for each attribute
<=30 high no fair no • INCOME:{low, high, medium}
• Binary splits:
31…40 high no fair yes
>40 medium no fair yes • {{low, high},{medium}} {{low, medium}{high}} {{high, medium},{low}}
>40 low yes excellent no • {{low, medium}{high}}
<=30 low yes fair yes • =
>40 medium yes fair yes • =
• =0.714* 0.42+0.285*0.5
31…40 medium no excellent yes
31…40 high yes fair yes • =0.299+0.142=0.441
Attribute Selection
Measures : Gain Ratio • {{low, high},{medium}}
•=
<=30 high no fair no •=
31…40 high no fair yes • =0.571*0.468+0.428*0.444=0.308
>40 low yes fair yes • {{high, medium},{low}}
>40 low yes excellent no
<=30 medium no fair no •=
<=30 low yes fair yes
>40 medium yes fair yes •=
31…40 medium no excellent yes • =0.714*0.480+0.285*0.375
• 0.342+0.106=0.448
Attribute Selection
Measures : Gain Ratio The student and credit rating both are binary splits with Gini
index of 0.367, 0.429
0.459-0.367=0.092
=0.441
0.459-0.429=0.030
=0.448
The attribute ranking is:
Income(
0.459-0.308 =0.151 Student({yes, no})=0.092
0.459-0.448=0.011 Age({{)=0.084
The best binary split for income is:
The best split for age is{{Youth, senior},{middle-aged}}
Credit-rating()=0.030
With Gini index 0.375
=0.459-0.375=0.084.
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on χ2 test
for independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistics: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
 The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
7/10/22 Data Mining: Concepts and Techniques 13

Comparing Attribute Selection Measures
 The three measures, in general, return good results but
 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one
partition is much smaller than the others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized
partitions and purity in both partitions
7/10/22 Data Mining: Concepts and Techniques 14
Which attribute
selection measure is
the best?
All measures have some bias:
• Time complexity of decision tree
induction increases exponentially with
tree height
• Shallower trees may be preferred
• Shallow trees tend to have large number
of leaves and higher error rates.
• no one attribute selection measure has
been found to be significantly superior to
others
THANK YOU

Attribute Selection Measures: Decision Tree Based Classification

Uploaded by

Copyright:

Available Formats

Attribute Selection Measures: Decision Tree Based Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Attribute Selection Measures: Decision Tree Based Classification

Uploaded by

Copyright:

Available Formats

Attribute Selection Measures :

Decision Tree based

• Information needed (after using A

arbitrary tuple in D belongs to • Information gained by branching

• Information gain measure is • C4.5 uses gain ratio which

• Info. gained is maximal and

=-0.357(log5-log 14)-0.285 (log 4-log14)-

• Gini index considers a binary split for each

7/10/22 Data Mining: Concepts and Techniques 13

You might also like