Attribute Selection Measures: Decision Tree Based Classification
Attribute Selection Measures: Decision Tree Based Classification
Attribute Selection Measures: Decision Tree Based Classification
-V Kumar
Attribute Selection Measures
• Heuristic measure for selecting the splitting criterion
• It best separates the given data into individual classes
• Ideally each partition would be pure
• Best splitting criterion gives pure partitions
• Splitting rules
• Ranking
• Continuous valued/Binary split-split point or splitting subset
Attribute Selection Measures
• Information gain • Expected information (entropy)
• Gain Ratio needed to classify
m a tuple in D:
Info ( D) pi log 2 ( pi )
• Gini Index i 1
• Gain(Age)=0.246
• Gain(income)=0.029
• Gain(student)=0.151
• Gain(credit_rating)=0.048)
Attribute Selection Measures: Gain ratio
=-0.5*-2.807-0.5*-2.807=2.807 =0.0185
=0.053
=0.151/2.807=0.053
=0.048
=-0.571*(3-3.807)-0.428*-(2.584-3.807)
=-0.571*-0.807-0.428*-1.223
=0.460+0.523=0.983
=0.048/0.983=0.048
student pi ni credit-rating pi ni
yes 6 1 fair 6 2
no 3 4 excellent 3 3
Attribute Selection Measures: Gini Index
• Gini Index:
• Used in CART, Gini index is used to measure
the impurity in dataset D.
•=
age income student credit_rating buys_computer
<=30 high no fair no •=
<=30 high no excellent no
31…40 high no fair yes • =0.571*0.468+0.428*0.444=0.308
>40 medium no fair yes
>40 low yes fair yes • {{high, medium},{low}}
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no •=
<=30 low yes fair yes
>40 medium yes fair yes •=
<=30 medium yes excellent yes
31…40 medium no excellent yes • =0.714*0.480+0.285*0.375
31…40 high yes fair yes
>40 medium no excellent no
• 0.342+0.106=0.448
Attribute Selection
Measures : Gain Ratio The student and credit rating both are binary splits with Gini
index of 0.367, 0.429
0.459-0.367=0.092
=0.441
0.459-0.429=0.030
=0.448
The attribute ranking is:
Income(
0.459-0.308 =0.151 Student({yes, no})=0.092
0.459-0.448=0.011 Age({{)=0.084
The best binary split for income is:
The best split for age is{{Youth, senior},{middle-aged}}
Credit-rating()=0.030
With Gini index 0.375
=0.459-0.375=0.084.
Other Attribute Selection Measures
CHAID: a popular decision tree algorithm, measure based on χ2 test
for independence
C-SEP: performs better than info. gain and gini index in certain cases
G-statistics: has a close approximation to χ2 distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others