Classification and Prediction
Classification and Prediction
Classification and Prediction
prediction
Training data set:- The data tuples analyzed to build the model collectively
form the training data set.
Training samples:- The individual tuples making the training set called as
training samples
model
then
Algorithm Description
The tree starts as a single node, N, representing the training tuples
in D (step 1)
If the tuples in D are all of the same class, then node N becomes a
leaf and is labeled with that class (steps 2 and 3).
Algorithm Description
The algorithm uses the same process recursively to form a decision
tree for the tuples at each resulting partition, Dj, of D (step 14).
The recursive partitioning stops only when any one of the following
terminating conditions is true:
Attribute Selection
Measures
Also Known as Splitting rules
They determine how the tuples at a given node are to
be split.
The attribute selection measures provides a ranking
for each attribute describing the given training
tuples.
The attribute having the best score for the measure
is chosen as the splitting attribute for the given
tuples.
Three popular attribute selection measures are:
Information gain
Gain Ratio
Gini Index
Information Gain
This measure is based on the value or information content of messages.
Let node N represent or hold the tuples of partition D.
The attribute with the highest information gain is chosen as the splitting attribute
for node N.
This attribute minimizes the information needed to classify the tuples in the
resulting partitions and reflects the least randomness or impurity in these
partitions.
Such an approach minimizes the expected number of tests needed to classify a
given tuple and guarantees that a simple (but not necessarily the simplest) tree is
found.
The expected information needed to classify a tuple in D is given by
Information Gain
Now, suppose we were to partition the tuples in D on some
Information Gain
The term |Dj | / | D | acts as the weight of the jthpartition.
InfoA(D) is the expected information required to classify a tuple from D
the value of A.
The attribute A with the highest information gain, (Gain(A)), is chosen as
the splitting attribute at node N.
Gini Index
The Gini index is used in CART. Using the notation described above,
The Gini index considers a binary split for each attribute. Lets first
consider the case where A is a discrete-valued attribute having v
distinct values, {a1, a2, : : : , av}, occurring in D.
Given a tuple, this test is satisfied if the value of A for the tuple is
among the values listed in SA. If A has v possible values, then there
are 2v possible subsets.
Gini Index
When considering a binary split, we compute a
weighted sum of the impurity of each resulting
partition.
For example, if a binary split on A partitions D
into D1 and D2, the gini index of D given that
partitioning is
Tree Pruning
when a decision tree is built many of the
branches will reflect anomalies in
training data due to noise or outliers
Tree pruning addresses this problem
Common approaches in tree pruning: Prepruning Approach
Postpruning Approach
Upon halting node becomes a leaf. The leaf may hold the most
frequent class among the subset tuples or the probability
distribution of those tuples.
number of leaves in the tree and the error rate of the tree (where the error rate
is the percentage of tuples misclassified by the tree). It starts from the
bottom of the tree.
For each internal node, N, it computes the cost complexity of the sub tree at
N, and the cost complexity of the sub tree at N if it were to be pruned (i.e.,
replaced by a leaf node).
The two values are compared. If pruning the sub tree at node N would result in
a smaller cost complexity, then the sub tree is pruned. Otherwise, it is kept.
A pruning set of class-labeled tuples is used to estimate cost complexity. This set is
independent of the training set used to build the un pruned tree and of any test
set used for accuracy estimation.
In general, the smallest decision tree that minimizes the cost complexity is
preferred.
Pessimistic Pruning
Used by C4.5
It is similar to the cost complexity method in that it also uses error
rate estimates to make decisions regarding sub tree pruning.
This approach does not require the use of a prune set. Instead, it
uses the training set to estimate error rates.
Problems encountered in
Decision Trees
Repetition : occurs when an attribute is
Example:
Bayesian classification
Bayesian classification can predict class membership
Bayes theorem
Let X be a data tuple. In Bayesian terms, X is considered
evidence.
it is described by measurements made on a set of n attributes.
Let H be some hypothesis, such as that the data tuple X belongs to
a specified class C.
Determine P(H\X), the probability that the hypothesis H holds
given the evidence or observed data tuple X.
Bayes Theorem
P(H) is the prior probability, or a priori probability, of H.
For example, this is the probability that any given
customer will buy a computer, regardless of age,
income, or any other information.
Nave Bayesian
Classification
Nave Bayesian
Classification
As P(X) is constant for all classes, only P(X/Ci )P(Ci ) need
be maximized.
If the class prior probabilities are not known, then it is
commonly assumed that the classes are equally likely,
that is, P(C1) = P(C2) = = P(Cm), and we would therefore
maximize P(X/Ci). Otherwise, we maximize P(X/Ci)P(Ci).
Given data sets with many attributes, it would be
extremely computationally expensive to compute P(X/Ci).
In order to reduce computation in evaluating P(X/Ci), the
naive assumption of class conditional independence is
made. This presumes that the values of the attributes are
conditionally independent of one another, given the class
label of the tuple (i.e., that there are no dependence
relationships among the attributes). Thus,
Nave Bayesian
whether the attribute is categorical or
Check
Classification
continuous-valued.
For instance, to compute P(X/Ci), we consider
the following:
If Ak is categorical, then P(xk/Ci) is the number of
tuples of class Ci in D having the value xk for Ak,
divided by |Ci,D |, the number of tuples of class Ci in
D.
If Ak is continuous-valued, A continuous-valued
attribute is typically assumed to have a Gaussian
distribution with a mean and standard deviation ,
defined by
Nave Bayesian
Classification
In order to predict the class label of X, P(X|
We need to maximize
P(buys_computer = no) =
5/14 = 0.357
Rule-Based Classification
The learned model is represented as a set of IFTHEN rules.
A rule-based classifier uses a set of IF-THEN
rules for classification. An IF-THEN rule is an
expression of the form
Rule-Based Classification
R1: (age = youth) ^ (student = yes)=>(buys computer =
yes)
If the condition in a rule antecedent holds true for a
given tuple, that the rule is satisfied and that the rule
covers the tuple.
A rule R can be assessed by its coverage and accuracy.
Given a tuple, X , from a class labeled data set, D ,
let n covers be the number of tuples covered by R;
n correct be the number of tuples correctly classified by
R;
| D | be the number of tuples in D. We can define the
coverage and accuracy of R as
rule (i.e., whose attribute values hold true for the rules antecedent).
For a rules accuracy, the tuples that it covers and what percentage of
tuples the rule can correctly classify.
Rule-Based Classification
We would like to classify X according to buys
computer. X satisfies R1, which triggers the rule.
If R1 is the only rule satisfied, then the rule fires
by returning the class prediction for X.
If more than one rule is triggered, there is a
requirement of conflict resolution strategy to
figure out which rule gets to fire and assign its
class prediction to X.
There are two strategies.
size ordering and
rule ordering
Rule-Based Classification
Size Ordering:
This scheme assigns the highest priority to the
triggering rule that has the toughest requirements,
where toughness is measured by the rule antecedent
size (left hand side).
That is, the triggering rule with the most attribute
tests is fired.
Rule Ordering:
The rule ordering scheme prioritizes the rules
beforehand.
The ordering may be class based or rule-based.
interpret.
In comparison with a decision tree, the IF-THEN rules
may be easier for humans to understand, particularly if
the decision tree is very large.
Classification by Back
propagation
algorithm.
A neural network is a set of connected
input/output units in which each connection has
a weight associated with it.
During the learning phase, the network learns by
adjusting the weights so as to be able to predict
the correct class label of the input tuples.
Neural network learning is also referred to as
connection learning due to the connections
between units.
Disadvantages of Neural
Networks
Neural networks involve long training times and are
Advantages of Neural
Networks
A Multilayer Feed-Forward
Neural Network
The backpropagation algorithm performs
Defining a Network
Topology
Defining a Network
Topology
There are no clear rules as to the best number of
Working of Backpropagation
Backpropagation learns by iteratively processing a data
Associative Classification:
Classification by association Rule
Analysis
Association rules show strong associations between
Associative
When mining association rules for use in classification,
Classification:
rules
are of the form
association
k-Nearest-Neighbor
Classifiers
The method is labor intensive when given large training sets.
pattern space. When given an unknown tuple, a k-nearestneighbor classifier searches the pattern space for the k training
tuples that are closest to the unknown tuple.
k-Nearest-Neighbor
Classifiers
k-Nearest-Neighbor
Classifiers
The k value that gives the minimum error rate may be selected. In
general, the larger the number of training tuples is, the larger the
value of k will be (so that classification and prediction decisions can
be based on a larger portion of the stored tuples).
Genetic Algorithms
An initial population is created consisting of randomly
Genetic Algorithms
A new population is formed to consist of the fittest rules in the
current population, as well as offspring of these rules.
Genetic algorithms are easily parallelizable and have been used for
have the
cutoffs for
who have had a job for two or more years and who have
a high income (i.e., of at least $50,000) are approved: