Unit 4 Classification
Unit 4 Classification
Unit 4 Classification
Classification
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
2 2
Classification vs. Numeric Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
– Credit/loan approval:
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
– Web page categorization: which category it is
3 3
Introduction to classification
Classification is a supervised learning method.
It is a data mining function that assigns items in a collection to
target categories or classes.
The goal of classification is to accurately predict the target class for
each case in the data.
For example, a classification model could be used to identify loan
applicants as low, medium, or high credit risks.
In supervised learning, the learner(computer program) is provided
with two sets of data, training data set and test data set.
The idea is for the learner to “learn” from a set of labeled examples
in the training set so that it can identify unlabeled examples in the
test set with the highest possible accuracy.
4
Introduction to classification (Cont..)
Suppose a Database D is given as D = {t1,t2,..tn} and a set of desired
classes are C={C1,…,Cm}.
The classification problem is to define the mapping m in such a
way that which tuple of database D belongs to which class of C.
Actually we divides D into equivalence classes.
Prediction is similar, but it viewed as having infinite number of
classes.
5
Classification Example
Teachers classify students grades as A,B,C,D or E.
Identify individuals with credit risks (high, low, medium or
unknown).
In cricket (batsman, bowler, all-rounder)
Websites (educational, sports, music)
6
Classification Example (Cont..)
How teachers give grades to students based on their obtained
marks?
• If x >= 90 then A grade.
• If 80 <= x < 90 then B grade. x
<90 >=90
• If 70 <= x < 80 then C grade.
• If 60 <= x < 70 then D grade. x A
<80 >=80
• If x < 60 then E grade.
x B
<70 >=70
x C
<60 >=60
E D
7
Classification : a two step process
1) Model Construction
Classification
Algorithms
Training
Data
8
Classification : a two step process (Cont..)
2) Model Usage
Classifier
Testing
Data
Unseen
Data
Name Rank Years Tenured
Tom Asst. Prof. 2 No
Merlisa Asso. Prof. 7 No
(Jeff, Professor, 4)
George Prof. 5 Yes
Joseph Asst. Prof. 7 Yes
Tenured?
Yes
9
Classification : a two step process (Cont..)
1) Model Construction
• Describing a set of predetermined classes :
o Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute.
o The set of tuples used for model construction is called as training set.
o The model is represented as classification rules, decision trees, or
mathematical formulae.
2) Model Usage
• For classifying future or unknown objects
o Estimate accuracy of the model
o The known label of test sample is compared with the classified result from the
model.
o Accuracy rate is the percentage of test set samples that are correctly classified
by the model.
10
Classification & prediction issues
Data Preparation
Data cleaning
• Pre-process data in order to reduce noise and handle missing values.
Relevance analysis (Feature selection)
• Remove the irrelevant or redundant attributes.
Data transformation
• Generalize the data to higher level concepts using concept hierarchies and/or
normalize data which involves scaling the values.
11
Classification & prediction methods
Classification Prediction
Decision Tree Linear Regression
Neural Network
13
Decision tree
One of the most common tasks is to build models for the
prediction of the class of an object on the basis of its attributes.
The objects can be seen as a customer, patient, transaction, e-mail
message or even a single character.
Attributes of patient object can be heart rate, blood pressure,
weight and gender etc.
The class of the patient object would most commonly be
positive/negative for a certain disease.
14
Decision tree (Cont..)
In decision tree are represented by a fixed set of attributes (e.g.
gender) and their values (e.g. male, female) described as
attribute-value pairs.
If the attribute has small number of disjoint possible values (e.g.
high, medium, low) or there are only two possible classes (e.g.
true, false) then decision tree learning is easy.
Extension to decision tree algorithm also handles real value
attributes (e.g. salary).
It gives a class label to each instance of dataset.
15
Decision tree (Cont..)
Decision tree is a classifier in the form of a tree structure
• Decision node: Specifies a test on a single attribute
• Leaf node: Indicates the value of the target attribute
• Arc/edge: Split of one attribute
• Path: A disjunction of test to make the final decision
16
Decision tree representation- example
Root Node
Branches
17
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes no yes
18
Key requirements for classification
Sufficient data:
• Enough training cases should be provided to learn the model.
Attribute-value description:
• Object or case must be expressible in terms of a fixed collection of
properties or attributes (e.g., hot, mild, cold).
19
Important terms for decision tree
Entropy
Attribute Selection Measure
✔ Information Gain
✔ Gain Ratio
✔ Gini Index
20
Entropy (E)
It defines the certainty of a decision
• 1 if completely certain,
• 0 if completely uncertain,
• Normally data remains between 0 to 1 as entropy, a probability-based
measure used to calculate the amount of uncertainty.
21
Entropy (E) (Cont..)
It measures that of how much information we don't know (how
uncertain we are about the data).
It can be also used to measure how much information we gain
from an attribute when the target attribute is revealed to us.
Which attribute is best?
✔ The attribute with the largest expected reduction in entropy is the 'best'
attribute to use next.
✔ Because if we have a large expected reduction it means taking away that
attribute has a big effect, meaning it must be very certain.
22
Entropy (E) (Cont..)
A decision tree is built top-down from a root node and involves
partitioning the data into subsets that contain instances with
similar values (homogenous).
ID3 algorithm uses entropy to calculate the homogeneity of a
sample.
If the sample is completely homogeneous the entropy is zero and
if the sample is an equally divided it has entropy of one.
23
Attribute Selection Measures
It is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of a class-labelled training
tuples into individual classes.
24
Information Gain
Information gain can be used for continues-valued (numeric)
attributes.
The attribute which has the highest information gain is selected
for split.
Assume, that there are two classes P(positive) & N(negative).
Suppose we have S samples, out of these p samples belongs to
class P and n samples belongs to class N.
The amount of information, needed to decide split in S belongs to
P or N & that is defined as
-
25
Information Gain (ID3/C4.5)
Expected information (entropy) needed to classify a tuple in D:
Gain(income) 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 27 27
Gain Ratio (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) log 2 ( )
j 1 |D| |D|
– GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
29
Gini Index (Cont..)
After
splitting T into two subsets T1 and T2 with sizes N1 and N2,
the gini index of the split data is defined as
Ginisplit (T) = gini (T1) + gini (T2)
30
Example – ID3
Age Income Student Credit_Rating Class : buys_computer
<=30 High No Fair No
<=30 High No Excellent No
31..40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31..40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31..40 Medium No Excellent Yes
31..40 High Yes Fair Yes
>40 Medium No Excellent No
31
Solution – ID3
Class P : buys_computer = “Yes” (9 records)
Class N : buys_computer = “No” (5 records)
Total number of Records 14.
Now, Information Gain = I (p,n)
-
-
I(9,5) = 0.940
32
Solution – ID3 (Age <=30,31..40,>40)
Age Income Student Credit_Rating Buys_Computer
<=30 High No Fair No
<=30 High No Excellent No
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
<=30 Medium Yes Excellent Yes
Age Income Stu. Cr_Rating Buys Age Income Stu. Cr_Rating Buys
31..40 High No Fair Yes >40 Medium No Fair Yes
31..40 Low Yes Excellent Yes >40 Low Yes Fair Yes
31..40 Medium No Excellent Yes >40 Low No Excellent No
>40 Medium Yes Fair Yes
31..40 High Yes Fair Yes
>40 Medium No Excellent No
33
Solution – ID3
So the expected information needed to classify a given sample if
the samples are partitioned according to age is,
Calculate entropy using the values from the Table and the formula
given below:
v | D |
Info A ( D )
j
Info ( D j )
j 1 | D |
34
Solution – ID3 (Age <=30)
Compute the information gain & Entropy For Age <=30,
• Pi = Yes class = 2
• Ni = No class = 3
So, Information Gain = I (p,n)
-
-
I(2,3) = 0.971
35
Solution – ID3
Age Pi Ni I (Pi , Ni)
<=30 2 3 0.971
31..40 4 0 0
>40 3 2 0.971
E(Age) = 0.694
36
Solution – ID3
Gain (Age) = I (p, n) – E (Age)
= 0.940 – 0.694
= 0.246
Similarly,
Gain value
Gain (age) 0.246
Gain (income) 0.029
Gain (student) 0.151
Gain (credit_rating) 0.048
37
Solution – ID3
Now the age has highest information gain among all the
attributes, so select age as test attribute and create the node as
age and show all possible values of age for further splitting.
Age
38
Solution – ID3 (Age <= 30)
Age Income Student Credit_Rating Buys_Computer
<=30 High No Fair No
<=30 High No Excellent No
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
<=30 Medium Yes Excellent Yes
39
Solution – ID3 (Age <=30)
Compute Information gain & Entropy for Age with sample S<=30.
For age <=30,
• Pi = Yes = 2
• Ni = No = 3
-
-
I(3,2) = 0.971
40
Solution – ID3 (Age <= 30, Income)
Income Pi Ni I (Pi , Ni)
High 0 2 0
Medium 1 1 1
Low 1 0 0
In above table high (0,2) homogeneous so I(0,2) = 0, Medium equal portion so I(1,1) = 1 &
Low I(1,0) = 0.
41
Solution – ID3 (Age <= 30, Student)
student Pi Ni I (Pi , Ni)
No 0 3 0
Yes 2 0 0
E(Student) = 0
42
Solution – ID3 (Age <= 30, credit_rating)
credit_rating Pi Ni I (Pi , Ni)
Fair 1 2 0.918
Excellent 1 1 1
I (1,2) + I (1,1)
E(credit_rating) = 0.951
Student
No Yes
44
Solution – ID3 (Age 31..40)
Age Income Student Credit_Rating Buys_Computer
31..40 High No Fair Yes
31..40 Low Yes Excellent Yes
31..40 Medium No Excellent Yes
31..40 High Yes Fair Yes
Age
Student Yes
No Yes
45
Solution – ID3 (Age > 40)
Age Income Student Credit_Rating Buys_Computer
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low No Excellent No
>40 Medium Yes Fair Yes
>40 Medium No Excellent No
46
Solution – ID3 (Age > 40)
Compute Information gain for Age with sample S>40.
For age > 40,
• Pi = Yes = 3
• Ni = No = 2
-
-
I(3,2) = 0.971
47
Solution – ID3 (Age > 40, Income)
Income Pi Ni I (Pi , Ni)
High 0 0 0
Medium 2 1 0.918
Low 1 1 1
48
Solution – ID3 (Age > 40, credit_rating)
Credit_rating Pi Ni I (Pi , Ni)
Fair 3 0 0
Excellent 0 2 0
I (3,0) + I (0,2)
E(credit_rating) = 0
50
Decision Tree – ID3
Age
No Yes
51
Classification rules from decision tree
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31..40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “no”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“yes”
52
Bayesian Classification
Thomas Bayes, who proposed the Bayes Theorem so, it named
Bayesian theorem.
It is statistical method & supervised learning method for
classification.
It can solve problems involving both categorical and continuous
valued attributes.
Bayesian classification is used to find conditional probabilities.
53
The Bayes Theorem
The
Bayes Theorem:
• P(H|X)=
P(H|X) : Probability that the customer will buy a computer given that we know
his age, credit rating and income. (Posterior Probability of H)
P(H) : Probability that the customer will buy a computer regardless of age,
credit rating, income (Prior Probability of H)
P(X|H) : Probability that the customer is 35 years old, have fair credit rating and
earns $40,000, given that he has bought computer (Posterior Probability of X)
P(X) : Probability that a person from our set of customers is 35 years old, have
fair credit rating and earns $40,000. (Prior Probability of X)
54
Naïve Bayes classifier - Example
Age Income Student Credit_Rating Class : buys_computer
<=30 High No Fair No
<=30 High No Excellent No
31..40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31..40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31..40 Medium No Excellent Yes
31..40 High Yes Fair Yes
>40 Medium No Excellent No
55
Naïve Bayes classifier - Solution
Age
P (<=30 | Yes) = 2/9 P (<=30 | No) = 3/5
P (31..40 | Yes) = 4/9 P (31..40 | No) = 0/5 P (Yes) = 9/14
P (> 40 | Yes) = 3/9 P (> 40 | No) = 2/5 P (No) = 5/14
Income
P (High | Yes) = 2/9 P (High | No) = 2/5
P (Medium | Yes) = 4/9 P (Medium | No) = 2/5
P (Low | Yes) = 3/9 P (Low | No) = 1/5
Student
P (No | Yes) = 3/9 P (No | No) = 4/5
P (Yes | Yes) = 6/9 P (Yes | No) = 1/5
Credit_rating
P (Fair | Yes) = 6/9 P (Fair | No) = 2/5
P (Excellent | Yes) = 3/9 P (Excellent | No) = 3/5
56
Naïve Bayes classifier - Solution
An unseen sample Y = (<=30, Low, Yes, Excellent)
P(Y|Yes).P(Yes) = P(<=30|Yes). P(Low|Yes). P(Yes|Yes). P(Excellent|
Yes) . P(Yes)
= 2/9 * 3/9 * 6/9 * 3/9 * 9/14
= 0.010582
58
Rule Based Classification
It is featured by building rules based on an object attributes.
Rule-based classifier makes use of a set of IF-THEN rules for
classification.
We can express a rule in the following from
• IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age=youth AND student=yes THEN buy_computer=yes
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent (conclusion).
• The antecedent (IF) part the condition consist of one or more attribute tests
and these tests are logically ANDed.
• The consequent (THEN) part consists of class prediction.
59
Rule Based Classification (Cont..)
We can also write rule R1 as follows:
R1: ((age = youth) ^ (student = yes)) => (buys_computer = yes)
• If the condition (that is, all of the attribute tests) in a rule antecedent holds
true for a given tuple, we say that the rule antecedent is satisfied and that
the rule covers the tuple.
A rule R can be assessed by its coverage and accuracy.
Given a tuple X, from a class labeled data set D, let it covers the
number of tuples by R; the number of tuples correctly classified by
R; and |D| be the number of tuples in D.
We can define the coverage and accuracy of R as
Coverage (R) = Accuracy (R) =
60
Model Evaluation and Selection
64 64
Classifier Evaluation Metrics: Example
65
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized
data
– *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
66 66
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M 1, M2, …, Mk, with the
aim of creating an improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of classifiers
– Boosting: weighted vote with a collection of classifiers
– Ensemble: combining a set of heterogeneous classifiers
67 67
Model Selection: ROC Curves
• ROC (Receiver Operating Characteristics)
curves: for visual comparison of
classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a Vertical axis
measure of the accuracy of the model represents the true
• Rank the test tuples in decreasing order: positive rate
the one that is most likely to belong to
Horizontal axis rep.
the positive class appears at the top of the false positive rate
the list
The plot also shows a
diagonal line
• The closer to the diagonal line (i.e., the A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
68
Issues Affecting Model Selection
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
69
Bagging: Boostrap Aggregation
73 73
Other Classification Methods
CART(Classification and Regression Tree)
✔ Variant of decision tree, as it uses Gini index for splitting criterion.
Neural Network
Case Based Reasoning (CBR)
Genetic Algorithm
Rough Set Approach
Fuzzy Logic
74
Neural Network
Neural Network is a set of connected INPUT/OUTPUT UNITS, where each
connection has a WEIGHT associated with it.
Neural Network learning is also called CONNECTIONIST learning due to the
connections between units.
Neural Network learns by adjusting the weights so It is able to correctly
classify the training data and after testing phase, to classify unknown data.
Strengths of Neural Network:
• It can handle against complex data. (i.e., problems with many parameters)
• It can handle noise in the training data.
• The Prediction accuracy is generally high.
• Neural Networks are robust, work well even when training examples contain
errors.
• Neural Networks can handle missing data well.
75
Case-Based Reasoning (CBR)
• CBR: Uses a database of problem solutions to solve new problems
• Store symbolic description (tuples or cases)—not points in a Euclidean space
• Applications: Customer-service (product-related diagnosis), legal ruling
• Methodology
– Instances represented by rich symbolic descriptions (e.g., function graphs)
– Search for similar cases, multiple retrieved cases may be combined
– Tight coupling between case retrieval, knowledge-based reasoning, and
problem solving
• Challenges
– Find a good similarity metric
– Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases
76
Genetic Algorithms (GA)
78
Fuzzy Logic
• Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree
of membership (such as in a fuzzy membership graph)
• Attribute values are converted to fuzzy values. Ex.:
– Income, x, is assigned a fuzzy membership value to each of the
discrete categories {low, medium, high}, e.g. $49K belongs to “medium
income” with fuzzy value 0.15 but belongs to “high income” with fuzzy
value 0.96
– Fuzzy membership values do not have to sum to 1.
• Each applicable rule contributes a vote for membership in the categories
• Typically, the truth values for each predicted category are summed, and
these sums are combined
79
Regression
Regression is a data mining function that predicts a number or value.
Age, weight, distance, temperature, income, or sales attributes can be
predicted using regression techniques.
For example, a regression model could be used to predict children's
height, given their age, weight and other factors.
A regression task begins with a data set in which the target values are
known.
For example, a regression model that predicts children's height could
be developed based on observed data for many children over a period
of time.
The data might track age, height, weight, developmental milestones,
family history etc.
80
Regression (Cont..)
Height would be the target, the other attributes would be the predictors,
and the data for each child would constitute a case.
Regression models are tested by computing various statistics that
measure the difference between the predicted values and the expected
values.
It is required to understand the mathematics used in regression analysis
to develop quality regression models for data mining.
The goal of regression analysis is to determine the values of parameters
for a function that cause the function to fit best in a set of data
observations that we provide.
81
Linear Regression
The simplest form of regression to visualize is linear regression with a
single predictor.
A linear regression technique can be used if the relationship between x
and y can be approximated with a straight line.
Linear regression with a single predictor can be expressed with the
following equation with one dependent and one independent variable is
defined by the given formula
Y = b1X + b0 + u
• Where y = Dependent variable which we are trying to predict, b1 = The
Slope, and X = independent variable, b0 = The Intercept, u = Random
Error/Residual
82
Linear Regression (Cont..)
83
Nonlinear Regression
Often the relationship between x and y cannot be approximated with a
straight line or curve for that nonlinear regression technique may be
used.
Alternatively, the data could be preprocessed to make the relationship
linear.
84
Multiple Linear Regression
Multiple Linear regression is an extension of simple linear regression
analysis.
It uses two or more independent variables to predict the outcome and a
single continuous dependent variable, Like..
Y = b0 + b1X1+ b2X2+ b3X3 +….+ bnXn + u
Where Y = dependent variable or response variable
X1,X2,….Xn are the independent variables or predictors
u = Random Error
b0,b1,b2,….bn are the regression coefficients
85
Logistic Regression
A linear regression is not appropriate for predicting the value of a
binary variable for two reasons:
• A linear regression will predict values outside the acceptable range (e.g.
predicting probabilities outside the range 0 to 1).
• Since the experiments can only have one of two possible values for each
experiment, the residuals(random errors) will not be normally distributed
about the predicted line.
A logistic regression produces a logistic curve, which is limited to
values between 0 and 1.
Logistic regression is similar to a linear regression, but the curve is
constructed using the natural logarithm “odds” of the target
variable, rather than the probability.
86
Logistic Regression (Cont..)
87