Attribute Selection Measures: Decision Tree Based Classification

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

Attribute Selection Measures :

Decision Tree based


Classification

-V Kumar
Attribute Selection Measures
• Heuristic measure for selecting the splitting criterion
• It best separates the given data into individual classes
• Ideally each partition would be pure
• Best splitting criterion gives pure partitions
• Splitting rules
• Ranking
• Continuous valued/Binary split-split point or splitting subset
Attribute Selection Measures
• Information gain • Expected information (entropy)
• Gain Ratio needed to classify
m a tuple in D:
Info ( D)   pi log 2 ( pi )
• Gini Index i 1

• Information needed (after using A


to split D into v partitions) to
• Select the attribute with the classify D:
highest information gain
v | Dj |
Info A ( D)    I (D j )
|D|
• Let pi be the probability that an j 1

arbitrary tuple in D belongs to • Information gained by branching


class Ci, estimated by |Ci, D|/|D| on Gain(A)
attributeAInfo(D)  Info (D)
A
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Info age ( D)  I (2,3)  I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D )  I (9,5)  
14
log 2 ( )  log 2 ( ) 0.940
14 14 14
 I (3,2)  0.694
14
age pi n i I(p i, n i) 5
I (2,3) means “age <=30” has 5
<=30 2 3 0.971 14
out of 14 samples, with 2 yes’es
31…40 4 0 0
and 3 no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age)  Info ( D)  Info age ( D)  0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40
31…40
low
low
yes
yes
excellent
excellent
no
yes
Gain(income)  0.029
<=30 medium no fair no
<=30 low yes fair yes Gain( student )  0.151
>40 medium yes fair yes
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating )  0.048
31…40 high yes fair yes
7/10/22 >40 medium no excellent Datano
Mining: Concepts and Techniques 4
Attribute Selection Measures
• Information Gain:
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

• Gain(Age)=0.246
• Gain(income)=0.029
• Gain(student)=0.151
• Gain(credit_rating)=0.048)
Attribute Selection Measures: Gain ratio

• Information gain measure is • C4.5 uses gain ratio which


biased toward tests with many attempts to overcome bias
outcomes • Normalization of Info. Gain
• The information required to
classify data set D based on this GainRatio(A)=InfoGain(A)/SplitInfo(A)
partitioning would be

• Info. gained is maximal and


useless for classification
Attribute Selection Measures
• Computation of gain ratio: • =0.246/1.5697=0.156
• Gain(Age)=0.246 •
• Gain(income)=0.029 • =-0.285*(-1.81) -0.285*(-1.81)- 0.428*(-1.25)
• Gain(student)=0.151 • =0.5158+0.5158+0.535=1.5666
• Gain(credit_rating)=0.048) • =0.029/1.5666=0.0185

=-0.357*(log5-log 14)-0.285 *(log 4-log14)-


0.357*log(log5-log 14) age pi ni income pi ni
=-0.357(2.321-3.807)-0.285(2-3.807)-0.375(2.231-3.807) <=30 2 3 high 2 2
=-(0.357*-1.486)-(0.285*-1.807)-(0.375*-1.486) 31…40 4 0 low 3 1
=0.5305+0.5149+0.5355=1.5759 >40 3 2 medium 4 2
Attribute Selection Measures
=0.156

=-0.5*-2.807-0.5*-2.807=2.807 =0.0185
=0.053
=0.151/2.807=0.053
=0.048
=-0.571*(3-3.807)-0.428*-(2.584-3.807)
=-0.571*-0.807-0.428*-1.223
=0.460+0.523=0.983
=0.048/0.983=0.048

student pi ni credit-rating pi ni
yes 6 1 fair 6 2
no 3 4 excellent 3 3
Attribute Selection Measures: Gini Index
• Gini Index:
• Used in CART, Gini index is used to measure
the impurity in dataset D.

• Gini index considers a binary split for each


attribute.
• Let A be a discrete valued attribute with v
distinct values {a1,a2,…,av}
• Possible subsets -2 • If a binary split on A partitions D into D1 and
D2, the gini index of D based on partition is:
• Determine best binary split on A
• Ex. Income {low, medium, high} possible • The reduction of impurity by binary split is:
binary splits
• Gini index example:
Attribute Selection • Gini index to compute the impurity of D is:
Measures : Gain Ratio
• Compute Gini index for each attribute
age income student credit_rating buys_computer
<=30 high no fair no • INCOME:{low, high, medium}
<=30 high no excellent no
• Binary splits:
31…40 high no fair yes
>40 medium no fair yes • {{low, high},{medium}} {{low, medium}{high}} {{high, medium},{low}}
>40 low yes fair yes
>40 low yes excellent no • {{low, medium}{high}}
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes • =
>40 medium yes fair yes • =
<=30 medium yes excellent yes
• =0.714* 0.42+0.285*0.5
31…40 medium no excellent yes
31…40 high yes fair yes • =0.299+0.142=0.441
>40 medium no excellent no
Attribute Selection
Measures : Gain Ratio • {{low, high},{medium}}

•=
age income student credit_rating buys_computer
<=30 high no fair no •=
<=30 high no excellent no
31…40 high no fair yes • =0.571*0.468+0.428*0.444=0.308
>40 medium no fair yes
>40 low yes fair yes • {{high, medium},{low}}
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no •=
<=30 low yes fair yes
>40 medium yes fair yes •=
<=30 medium yes excellent yes
31…40 medium no excellent yes • =0.714*0.480+0.285*0.375
31…40 high yes fair yes
>40 medium no excellent no
• 0.342+0.106=0.448
Attribute Selection
Measures : Gain Ratio The student and credit rating both are binary splits with Gini
index of 0.367, 0.429
0.459-0.367=0.092
=0.441
0.459-0.429=0.030
=0.448
The attribute ranking is:
Income(
0.459-0.308 =0.151 Student({yes, no})=0.092
0.459-0.448=0.011 Age({{)=0.084
The best binary split for income is:
The best split for age is{{Youth, senior},{middle-aged}}
Credit-rating()=0.030
With Gini index 0.375

=0.459-0.375=0.084.
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on χ2 test
for independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistics: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
 The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others

7/10/22 Data Mining: Concepts and Techniques 13


Comparing Attribute Selection Measures
 The three measures, in general, return good results but
 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one
partition is much smaller than the others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized
partitions and purity in both partitions
7/10/22 Data Mining: Concepts and Techniques 14
Which attribute
selection measure is
the best?
All measures have some bias:
• Time complexity of decision tree
induction increases exponentially with
tree height
• Shallower trees may be preferred
• Shallow trees tend to have large number
of leaves and higher error rates.
• no one attribute selection measure has
been found to be significantly superior to
others
THANK YOU

You might also like