Unit 4 Classification

Unit-4
Classification
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
2 2
Classification vs. Numeric Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
– Credit/loan approval:
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
– Web page categorization: which category it is
3 3
Introduction to classification
 Classification is a supervised learning method.
 It is a data mining function that assigns items in a collection to
target categories or classes.
 The goal of classification is to accurately predict the target class for
each case in the data.
 For example, a classification model could be used to identify loan
applicants as low, medium, or high credit risks.
 In supervised learning, the learner(computer program) is provided
with two sets of data, training data set and test data set.
 The idea is for the learner to “learn” from a set of labeled examples
in the training set so that it can identify unlabeled examples in the
test set with the highest possible accuracy.
4
Introduction to classification (Cont..)
 Suppose a Database D is given as D = {t1,t2,..tn} and a set of desired
classes are C={C1,…,Cm}.
 The classification problem is to define the mapping m in such a
way that which tuple of database D belongs to which class of C.
 Actually we divides D into equivalence classes.
 Prediction is similar, but it viewed as having infinite number of
classes.
5
Classification Example
 Teachers classify students grades as A,B,C,D or E.
 Identify individuals with credit risks (high, low, medium or
unknown).
 In cricket (batsman, bowler, all-rounder)
 Websites (educational, sports, music)
6
Classification Example (Cont..)
 How teachers give grades to students based on their obtained
marks?
• If x >= 90 then A grade.
• If 80 <= x < 90 then B grade. x
<90 >=90
• If 70 <= x < 80 then C grade.
• If 60 <= x < 70 then D grade. x A
<80 >=80
• If x < 60 then E grade.
x B
<70 >=70
x C
<60 >=60
E D
7
Classification : a two step process
1) Model Construction
Classification
Algorithms
Training
Data
Name Rank Years Tenured

Classifier
Mike Asst. Prof. 3 No (Model)
Mary Asst. Prof. 7 Yes
Bill Prof. 2 Yes
Jim Asso. Prof. 7 Yes If Rank = ‘professor’
Dave Asst. Prof. 6 No OR year > 6
THEN tenured = ‘yes’
Anne Asso. Prof. 3 No
8
Classification : a two step process (Cont..)
2) Model Usage
Classifier
Testing
Data
Unseen
Data
Name Rank Years Tenured
Tom Asst. Prof. 2 No
Merlisa Asso. Prof. 7 No
(Jeff, Professor, 4)
George Prof. 5 Yes
Joseph Asst. Prof. 7 Yes
Tenured?
Yes
9
Classification : a two step process (Cont..)
1) Model Construction
• Describing a set of predetermined classes :
o Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute.
o The set of tuples used for model construction is called as training set.
o The model is represented as classification rules, decision trees, or
mathematical formulae.
2) Model Usage
• For classifying future or unknown objects
o Estimate accuracy of the model
o The known label of test sample is compared with the classified result from the
model.
o Accuracy rate is the percentage of test set samples that are correctly classified
by the model.
10
Classification & prediction issues
Data Preparation
 Data cleaning
• Pre-process data in order to reduce noise and handle missing values.
 Relevance analysis (Feature selection)
• Remove the irrelevant or redundant attributes.
 Data transformation
• Generalize the data to higher level concepts using concept hierarchies and/or
normalize data which involves scaling the values.
11
Classification & prediction methods
Classification Prediction
Decision Tree Linear Regression
Bayesian Non Linear

Classification Regressions
Rule Based Logistic

Classification Regression
Neural Network
13
Decision tree
 One of the most common tasks is to build models for the
prediction of the class of an object on the basis of its attributes.
 The objects can be seen as a customer, patient, transaction, e-mail
message or even a single character.
 Attributes of patient object can be heart rate, blood pressure,
weight and gender etc.
 The class of the patient object would most commonly be
positive/negative for a certain disease.
14
Decision tree (Cont..)
 In decision tree are represented by a fixed set of attributes (e.g.
gender) and their values (e.g. male, female) described as
attribute-value pairs.
 If the attribute has small number of disjoint possible values (e.g.
high, medium, low) or there are only two possible classes (e.g.
true, false) then decision tree learning is easy.
 Extension to decision tree algorithm also handles real value
attributes (e.g. salary).
 It gives a class label to each instance of dataset.
15
Decision tree (Cont..)
 Decision tree is a classifier in the form of a tree structure
• Decision node: Specifies a test on a single attribute
• Leaf node: Indicates the value of the target attribute
• Arc/edge: Split of one attribute
• Path: A disjunction of test to make the final decision
16
Decision tree representation- example
Root Node
Branches
Leaf Node Leaf Node
Set of Possible Answers Set of Possible Answers
17
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
student? yes credit rating?
no yes excellent fair
no yes no yes
18
Key requirements for classification
 Sufficient data:
• Enough training cases should be provided to learn the model.
 Attribute-value description:
• Object or case must be expressible in terms of a fixed collection of
properties or attributes (e.g., hot, mild, cold).
 Predefined classes (target values):

• The target function has discrete output values (boolean or multiclass)
19
Important terms for decision tree
 Entropy
 Attribute Selection Measure
✔ Information Gain
✔ Gain Ratio
✔ Gini Index
20
Entropy (E)
 It defines the certainty of a decision
• 1 if completely certain,
• 0 if completely uncertain,
• Normally data remains between 0 to 1 as entropy, a probability-based
measure used to calculate the amount of uncertainty.
21
Entropy (E) (Cont..)
 It measures that of how much information we don't know (how
uncertain we are about the data).
 It can be also used to measure how much information we gain
from an attribute when the target attribute is revealed to us.
 Which attribute is best?
✔ The attribute with the largest expected reduction in entropy is the 'best'
attribute to use next.
✔ Because if we have a large expected reduction it means taking away that
attribute has a big effect, meaning it must be very certain.
22
Entropy (E) (Cont..)
 A decision tree is built top-down from a root node and involves
partitioning the data into subsets that contain instances with
similar values (homogenous).
 ID3 algorithm uses entropy to calculate the homogeneity of a
sample.
 If the sample is completely homogeneous the entropy is zero and
if the sample is an equally divided it has entropy of one.
23
Attribute Selection Measures
 It is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of a class-labelled training
tuples into individual classes.
 Attribute selection measures are also known as splitting rules.
24
Information Gain
 Information gain can be used for continues-valued (numeric)
attributes.
 The attribute which has the highest information gain is selected
for split.
 Assume, that there are two classes P(positive) & N(negative).
 Suppose we have S samples, out of these p samples belongs to
class P and n samples belongs to class N.
 The amount of information, needed to decide split in S belongs to
P or N & that is defined as
-
25
Information Gain (ID3/C4.5)
 Expected information (entropy) needed to classify a tuple in D:
 Information needed (after using A to split D into v partitions) to

classify D:
m
Info ( D )    pi log 2 ( pi )
i 1
 Information gained by branching on attribute A
v | Dj |
Info A ( D )  j 1 |D|
 Info ( D j )
Gain(A)  Info(D)  Info A(D)

26
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Info age ( D )  I ( 2 ,3 )  I ( 4 ,0 )
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info ( D )  I (9,5)   log 2 ( )  log 2 ( )  0 .940  I (3, 2 )  0 .694
14 14 14 14 14
age pi ni I(p i, n i) 5
<=30 2 3 0.971 I ( 2,3)means “age <=30” has 5 out of
14
31…40 4 0 0 14 samples, with 2 yes’es and 3
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain ( age )  Info ( D )  Info age ( D )  0.246
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
Gain(income)  0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student )  0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40 medium no excellent no 27 27
Gain Ratio (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D)    log 2 ( )
j 1 |D| |D|
– GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
– gain_ratio(income) = 0.029/1.557 = 0.019

• The attribute with the maximum gain ratio is selected as the
splitting attribute
28
Gini Index
 Assume there exist several possible split values for each attribute.
 We may need other tools, such as clustering, to get the possible
split values.
 It can be modified for categorical attributes.
 An alternative method to information gain is called the Gini Index.
 Gini is used in CART (Classification and Regression Trees).
 If a dataset T Contains examples from n classes, gini index, gini(T)
is defined as
Gini (T) = 1 - 2
• n: the number of classes
• pj: the probability that a tuple in D belongs to class Ci
29
Gini Index (Cont..)
 After
splitting T into two subsets T1 and T2 with sizes N1 and N2,
the gini index of the split data is defined as
Ginisplit (T) = gini (T1) + gini (T2)
30
Example – ID3
Age Income Student Credit_Rating Class : buys_computer
<=30 High No Fair No
<=30 High No Excellent No
31..40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31..40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31..40 Medium No Excellent Yes
31..40 High Yes Fair Yes
>40 Medium No Excellent No
31
Solution – ID3
 Class P : buys_computer = “Yes” (9 records)
 Class N : buys_computer = “No” (5 records)
 Total number of Records 14.
 Now, Information Gain = I (p,n)
-
-
I(9,5) = 0.940
32
Solution – ID3 (Age <=30,31..40,>40)
Age Income Student Credit_Rating Buys_Computer
Age Income Stu. Cr_Rating Buys Age Income Stu. Cr_Rating Buys
31..40 High No Fair Yes >40 Medium No Fair Yes
31..40 Low Yes Excellent Yes >40 Low Yes Fair Yes
31..40 Medium No Excellent Yes >40 Low No Excellent No
33
Solution – ID3
 So the expected information needed to classify a given sample if
the samples are partitioned according to age is,
 Calculate entropy using the values from the Table and the formula
given below:
v | D |
Info A ( D )  
j
 Info ( D j )
j 1 | D |
I (2,3) + I (4,0) + I (3,2)
34
Solution – ID3 (Age <=30)
 Compute the information gain & Entropy For Age <=30,
• Pi = Yes class = 2
• Ni = No class = 3
So, Information Gain = I (p,n)
-
-
I(2,3) = 0.971
35
Solution – ID3
Age Pi Ni I (Pi , Ni)
<=30 2 3 0.971
31..40 4 0 0
>40 3 2 0.971
 So the expected information needed to classify a given sample if the samples

are partitioned according to age is,
 Calculate entropy using the values from the Table and the formula given below:
E (A) = I (Pi , Ni)
I (2,3) + I (4,0) + I (3,2)
E(Age) = 0.694
36
Solution – ID3
Gain (Age) = I (p, n) – E (Age)
= 0.940 – 0.694
= 0.246
Similarly,
Gain value
Gain (age) 0.246
Gain (income) 0.029
Gain (student) 0.151
Gain (credit_rating) 0.048
So, here we start decision tree with root node Age.
37
Solution – ID3
 Now the age has highest information gain among all the
attributes, so select age as test attribute and create the node as
age and show all possible values of age for further splitting.
Age
<=30 31..40 >40
38
Solution – ID3 (Age <= 30)
39
 Compute Information gain & Entropy for Age with sample S<=30.
 For age <=30,
• Pi = Yes = 2
• Ni = No = 3
-
-
I(3,2) = 0.971
40
Solution – ID3 (Age <= 30, Income)
Income Pi Ni I (Pi , Ni)
High 0 2 0
Medium 1 1 1
Low 1 0 0
In above table high (0,2) homogeneous so I(0,2) = 0, Medium equal portion so I(1,1) = 1 &
Low I(1,0) = 0.
E (A) = I (Pi , Ni) Gain (S<=30,Income)

= I (p, n) – E (Income)
I (0,2) + I (1,1) + I (1,0) = 0.971 – 0.4
= 0.571
E(Income) = 0.4
41
Solution – ID3 (Age <= 30, Student)
student Pi Ni I (Pi , Ni)
No 0 3 0
Yes 2 0 0
In above table I (0,3) = 0 & I (2,0) = 0 So E(Student) is 0.
E(Student) = 0
Gain (S<=30,Student) = I (p,n) – E(Student)

= 0.971 – 0
= 0.971
42
Solution – ID3 (Age <= 30, credit_rating)
credit_rating Pi Ni I (Pi , Ni)
Fair 1 2 0.918
Excellent 1 1 1
E (A) = I (Pi , Ni)
I (1,2) + I (1,1)
E(credit_rating) = 0.951
Gain (S<=30,credit_rating) = I (p, n) – E (credit_rating)

= 0.971 – 0.951
= 0.020
43
Gain (Age <= 30) value As shown in table we get
Income 0.571 maximum gain for student so,
Student 0.971 select student as leaf node for
Credit_rating 0.020
age <=30
Age
<=30 31..40 >40
Student
No Yes
44
Solution – ID3 (Age 31..40)
Age
<=30 31..40 >40
Student Yes
No Yes
45
Solution – ID3 (Age > 40)
>40 Low No Excellent No
46
 Compute Information gain for Age with sample S>40.
 For age > 40,
• Pi = Yes = 3
• Ni = No = 2
-
-
I(3,2) = 0.971
47
Solution – ID3 (Age > 40, Income)
Income Pi Ni I (Pi , Ni)
High 0 0 0
Medium 2 1 0.918
Low 1 1 1
E (A) = I (Pi , Ni) Gain (S>40,Income)

= I (p, n) – E (Income)
I (0,0) + I (2,1) + I (1,1) = 0.971 – 0.951
= 0.020
E(Income) = 0.951
48
Solution – ID3 (Age > 40, credit_rating)
Credit_rating Pi Ni I (Pi , Ni)
Fair 3 0 0
Excellent 0 2 0
E (A) = I (Pi , Ni)
I (3,0) + I (0,2)
E(credit_rating) = 0
Gain (S>40,credit_rating) = I (p, n) – E (credit_rating)

= 0.971 – 0
= 0.971
49
Gain (Age > 40) value As shown in table we get
Income 0.020 maximum gain for credit_rating
Credit_rating 0.971
so, select credit_rating as leaf
node for age > 40
Age
<=30 31..40 >40
Student Yes Credit_rating

No yes
50
Decision Tree – ID3
Age
<=30 31..40 >40
Student Yes Credit_rating

No yes
Excellent Fair
No Yes
51
Classification rules from decision tree
 IF age = “<=30” AND student = “no” THEN buys_computer = “no”
 IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
 IF age = “31..40” THEN buys_computer = “yes”
 IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “no”
 IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“yes”
52
Bayesian Classification
 Thomas Bayes, who proposed the Bayes Theorem so, it named
Bayesian theorem.
 It is statistical method & supervised learning method for
classification.
 It can solve problems involving both categorical and continuous
valued attributes.
 Bayesian classification is used to find conditional probabilities.
53
The Bayes Theorem
 The
Bayes Theorem:
• P(H|X)=
 P(H|X) : Probability that the customer will buy a computer given that we know
his age, credit rating and income. (Posterior Probability of H)
 P(H) : Probability that the customer will buy a computer regardless of age,
credit rating, income (Prior Probability of H)
 P(X|H) : Probability that the customer is 35 years old, have fair credit rating and
earns $40,000, given that he has bought computer (Posterior Probability of X)
 P(X) : Probability that a person from our set of customers is 35 years old, have
fair credit rating and earns $40,000. (Prior Probability of X)
54
Naïve Bayes classifier - Example
Age Income Student Credit_Rating Class : buys_computer
>40 Low Yes Excellent No
55
Naïve Bayes classifier - Solution
Age
P (<=30 | Yes) = 2/9 P (<=30 | No) = 3/5
P (31..40 | Yes) = 4/9 P (31..40 | No) = 0/5 P (Yes) = 9/14
P (> 40 | Yes) = 3/9 P (> 40 | No) = 2/5 P (No) = 5/14
Income
P (High | Yes) = 2/9 P (High | No) = 2/5
P (Medium | Yes) = 4/9 P (Medium | No) = 2/5
P (Low | Yes) = 3/9 P (Low | No) = 1/5
Student
P (No | Yes) = 3/9 P (No | No) = 4/5
P (Yes | Yes) = 6/9 P (Yes | No) = 1/5
Credit_rating
P (Fair | Yes) = 6/9 P (Fair | No) = 2/5
P (Excellent | Yes) = 3/9 P (Excellent | No) = 3/5
56
Naïve Bayes classifier - Solution
 An unseen sample Y = (<=30, Low, Yes, Excellent)
 P(Y|Yes).P(Yes) = P(<=30|Yes). P(Low|Yes). P(Yes|Yes). P(Excellent|
Yes) . P(Yes)
= 2/9 * 3/9 * 6/9 * 3/9 * 9/14
= 0.010582
 P(Y|No).P(No) = P(<=30|No). P(Low|No). P(Yes|No). P(Excellent|

No) . P(No)
= 3/5 * 1/5 * 1/5 * 3/5 * 5/14
= 0.005142
 Choose the class so that it maximizes this probability, this means
that new instance will be classified as Yes (Buys_computer)
57
Try yourself (Bayesian Classification)
Car No Color Type Origin Stolen? Unseen Data
1 Red Sports Domestic Yes Y = <Red, Domestic,
SUV>
2 Red Sports Domestic No Actual Data
3 Red Sports Domestic Yes Y = <Red, Sports,
Domestic>
4 Yellow Sports Domestic No
5 Yellow Sports Imported Yes 0.024,0.072
6 Yellow SUV Imported No (Unseen)
0.192, 0.096
7 Yellow SUV Imported Yes (Actual)
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
58
Rule Based Classification
 It is featured by building rules based on an object attributes.
 Rule-based classifier makes use of a set of IF-THEN rules for
classification.
 We can express a rule in the following from
• IF condition THEN conclusion
 Let us consider a rule R1,
R1: IF age=youth AND student=yes THEN buy_computer=yes
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent (conclusion).
• The antecedent (IF) part the condition consist of one or more attribute tests
and these tests are logically ANDed.
• The consequent (THEN) part consists of class prediction.
59
Rule Based Classification (Cont..)
 We can also write rule R1 as follows:
R1: ((age = youth) ^ (student = yes)) => (buys_computer = yes)
• If the condition (that is, all of the attribute tests) in a rule antecedent holds
true for a given tuple, we say that the rule antecedent is satisfied and that
the rule covers the tuple.
 A rule R can be assessed by its coverage and accuracy.
 Given a tuple X, from a class labeled data set D, let it covers the
number of tuples by R; the number of tuples correctly classified by
R; and |D| be the number of tuples in D.
 We can define the coverage and accuracy of R as
Coverage (R) = Accuracy (R) =
60
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other

metrics to consider?
• Use validation test set of class-labeled tuples instead of
training set when assessing accuracy
• Methods for estimating a classifier’s accuracy:
– Holdout method, random subsampling
– Cross-validation
– Bootstrap
• Comparing classifiers:
– Confidence intervals
– Cost-benefit analysis and ROC Curves
61 61
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:

Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
• Given m classes, an entry, CMi,j in a confusion matrix indicates

# of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
62
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the
• Classifier Accuracy, or negative class and minority of

the positive class
recognition rate: percentage of
 Sensitivity: True Positive
test set tuples that are correctly
recognition rate
classified  Sensitivity = TP/P
Accuracy = (TP + TN)/All  Specificity: True Negative
• Error rate: 1 – accuracy, or recognition rate

 Specificity = TN/N
Error rate = (FP + FN)/All
63 63
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier labeled
as positive are actually positive
• Recall: completeness – what % of positive tuples did the classifier

label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and recall,
• Fß: weighted measure of precision and recall

– assigns ß times as much weight to recall as to precision
64 64
Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
65
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized
data
– *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
66 66
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M 1, M2, …, Mk, with the
aim of creating an improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of classifiers
– Boosting: weighted vote with a collection of classifiers
– Ensemble: combining a set of heterogeneous classifiers
67 67
Model Selection: ROC Curves
• ROC (Receiver Operating Characteristics)
curves: for visual comparison of
classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a  Vertical axis
measure of the accuracy of the model represents the true
• Rank the test tuples in decreasing order: positive rate
the one that is most likely to belong to
 Horizontal axis rep.
the positive class appears at the top of the false positive rate
the list
 The plot also shows a
diagonal line
• The closer to the diagonal line (i.e., the  A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
68
Issues Affecting Model Selection
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
69
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote

• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled
with replacement from D (i.e., bootstrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the most
votes to X
• Prediction: can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple
• Accuracy
– Often significantly better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction
70 70
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
– The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
71 71
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)

• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set Di
of the same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
d
error ( M i )  w
j
j  err ( X j )
• The weight of classifier Mi’s vote is 1  error ( M i )

log
error ( M i )
72
Random Forest (Breiman 2001)
• Random Forest:
– Each classifier in the ensemble is a decision tree classifier and is generated
using a random selection of attributes at each node to determine the split
– During classification, each tree votes and the most popular class is returned
• Two Methods to construct Random Forest:
– Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology is
used to grow the trees to maximum size
– Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each split,
and faster than bagging or boosting
73 73
Other Classification Methods
 CART(Classification and Regression Tree)
✔ Variant of decision tree, as it uses Gini index for splitting criterion.
 Neural Network
 Case Based Reasoning (CBR)
 Genetic Algorithm
 Rough Set Approach
 Fuzzy Logic
74
Neural Network
 Neural Network is a set of connected INPUT/OUTPUT UNITS, where each
connection has a WEIGHT associated with it.
 Neural Network learning is also called CONNECTIONIST learning due to the
connections between units.
 Neural Network learns by adjusting the weights so It is able to correctly
classify the training data and after testing phase, to classify unknown data.
 Strengths of Neural Network:
• It can handle against complex data. (i.e., problems with many parameters)
• It can handle noise in the training data.
• The Prediction accuracy is generally high.
• Neural Networks are robust, work well even when training examples contain
errors.
• Neural Networks can handle missing data well.
75
Case-Based Reasoning (CBR)
• CBR: Uses a database of problem solutions to solve new problems
• Store symbolic description (tuples or cases)—not points in a Euclidean space
• Applications: Customer-service (product-related diagnosis), legal ruling
• Methodology
– Instances represented by rich symbolic descriptions (e.g., function graphs)
– Search for similar cases, multiple retrieved cases may be combined
– Tight coupling between case retrieval, knowledge-based reasoning, and
problem solving
• Challenges
– Find a good similarity metric
– Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases
76
Genetic Algorithms (GA)
• Genetic Algorithm: based on an analogy to biological evolution

• An initial population is created consisting of randomly generated rules
– Each rule is represented by a string of bits
– E.g., if A1 and ¬A2 then C2 can be encoded as 100
– If an attribute has k > 2 values, k bits can be used
• Based on the notion of survival of the fittest, a new population is formed to
consist of the fittest rules and their offspring
• The fitness of a rule is represented by its classification accuracy on a set of
training examples
• Offspring are generated by crossover and mutation
• The process continues until a population P evolves when each rule in P
satisfies a prespecified threshold
• Slow but easily parallelizable
77
Rough Set Approach
• Rough sets are used to approximately or “roughly” define equivalent
classes
• A rough set for a given class C is approximated by two sets: a lower
approximation (certain to be in C) and an upper approximation (cannot be
described as not belonging to C)
• Finding the minimal subsets (reducts) of attributes for feature reduction is
NP-hard but a discernibility matrix (which stores the differences between
attribute values for each pair of data tuples) is used to reduce the
computation intensity
78
Fuzzy Logic
• Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree
of membership (such as in a fuzzy membership graph)
• Attribute values are converted to fuzzy values. Ex.:
– Income, x, is assigned a fuzzy membership value to each of the
discrete categories {low, medium, high}, e.g. $49K belongs to “medium
income” with fuzzy value 0.15 but belongs to “high income” with fuzzy
value 0.96
– Fuzzy membership values do not have to sum to 1.
• Each applicable rule contributes a vote for membership in the categories
• Typically, the truth values for each predicted category are summed, and
these sums are combined
79
Regression
 Regression is a data mining function that predicts a number or value.
 Age, weight, distance, temperature, income, or sales attributes can be
predicted using regression techniques.
 For example, a regression model could be used to predict children's
height, given their age, weight and other factors.
 A regression task begins with a data set in which the target values are
known.
 For example, a regression model that predicts children's height could
be developed based on observed data for many children over a period
of time.
 The data might track age, height, weight, developmental milestones,
family history etc.
80
Regression (Cont..)
 Height would be the target, the other attributes would be the predictors,
and the data for each child would constitute a case.
 Regression models are tested by computing various statistics that
measure the difference between the predicted values and the expected
values.
 It is required to understand the mathematics used in regression analysis
to develop quality regression models for data mining.
 The goal of regression analysis is to determine the values of parameters
for a function that cause the function to fit best in a set of data
observations that we provide.
81
Linear Regression
 The simplest form of regression to visualize is linear regression with a
single predictor.
 A linear regression technique can be used if the relationship between x
and y can be approximated with a straight line.
 Linear regression with a single predictor can be expressed with the
following equation with one dependent and one independent variable is
defined by the given formula
Y = b1X + b0 + u
• Where y = Dependent variable which we are trying to predict, b1 = The
Slope, and X = independent variable, b0 = The Intercept, u = Random
Error/Residual
82
Linear Regression (Cont..)
83
Nonlinear Regression
 Often the relationship between x and y cannot be approximated with a
straight line or curve for that nonlinear regression technique may be
used.
 Alternatively, the data could be preprocessed to make the relationship
linear.
84
Multiple Linear Regression
 Multiple Linear regression is an extension of simple linear regression
analysis.
 It uses two or more independent variables to predict the outcome and a
single continuous dependent variable, Like..
Y = b0 + b1X1+ b2X2+ b3X3 +….+ bnXn + u
 Where Y = dependent variable or response variable
X1,X2,….Xn are the independent variables or predictors
u = Random Error
b0,b1,b2,….bn are the regression coefficients
85
Logistic Regression
 A linear regression is not appropriate for predicting the value of a
binary variable for two reasons:
• A linear regression will predict values outside the acceptable range (e.g.
predicting probabilities outside the range 0 to 1).
• Since the experiments can only have one of two possible values for each
experiment, the residuals(random errors) will not be normally distributed
about the predicted line.
 A logistic regression produces a logistic curve, which is limited to
values between 0 and 1.
 Logistic regression is similar to a linear regression, but the curve is
constructed using the natural logarithm “odds” of the target
variable, rather than the probability.
86
Logistic Regression (Cont..)
87

Unit 4 Classification

Uploaded by

Copyright:

Available Formats

Unit 4 Classification

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4 Classification

Uploaded by

Copyright:

Available Formats

Unit-4

Name Rank Years Tenured

Bayesian Non Linear

Rule Based Logistic

Leaf Node Leaf Node

Set of Possible Answers Set of Possible Answers

student? yes credit rating?

no yes excellent fair

 Predefined classes (target values):

 Attribute selection measures are also known as splitting rules.

 Information needed (after using A to split D into v partitions) to

Gain(A)  Info(D)  Info A(D)

– gain_ratio(income) = 0.029/1.557 = 0.019

I (2,3) + I (4,0) + I (3,2)

 So the expected information needed to classify a given sample if the samples

I (2,3) + I (4,0) + I (3,2)

So, here we start decision tree with root node Age.

<=30 31..40 >40

E (A) = I (Pi , Ni) Gain (S<=30,Income)

In above table I (0,3) = 0 & I (2,0) = 0 So E(Student) is 0.

Gain (S<=30,Student) = I (p,n) – E(Student)

E (A) = I (Pi , Ni)

Gain (S<=30,credit_rating) = I (p, n) – E (credit_rating)

<=30 31..40 >40

<=30 31..40 >40

E (A) = I (Pi , Ni) Gain (S>40,Income)

E (A) = I (Pi , Ni)

Gain (S>40,credit_rating) = I (p, n) – E (credit_rating)

<=30 31..40 >40

Student Yes Credit_rating

<=30 31..40 >40

Student Yes Credit_rating

 P(Y|No).P(No) = P(<=30|No). P(Low|No). P(Yes|No). P(Excellent|

• Evaluation metrics: How can we measure accuracy? Other

Example of Confusion Matrix:

• Given m classes, an entry, CMi,j in a confusion matrix indicates

• Classifier Accuracy, or negative class and minority of

Accuracy = (TP + TN)/All  Specificity: True Negative

• Error rate: 1 – accuracy, or recognition rate

• Recall: completeness – what % of positive tuples did the classifier

• Fß: weighted measure of precision and recall

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

• Analogy: Diagnosis based on multiple doctors’ majority vote

• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)

• The weight of classifier Mi’s vote is 1  error ( M i )

• Genetic Algorithm: based on an analogy to biological evolution

You might also like