Data Mining Classification and Prediction
Data Mining Classification and Prediction
Data Mining Classification and Prediction
Alfonso Urso, Antonino Fiannaca, Massimo La Rosa, Valentina Ravì, and Riccardo Rizzo, ICAR-CNR, Palermo, Italy
r 2018 Elsevier Inc. All rights reserved.
Classification is the most common task in machine learning. It works on input data where each item is tagged with a predefined set
of categorical labels or classes (Kesavaraj and Sukumaran, 2013). From such training records, the classifier learns a model able to
distinguish classes or concepts for future predictions of discrete variables.
Using examples labelled with true function values (observation, measurements, etc.), a numerical predictor, or merely a
predictor, learns to construct continuous-valued functions as model that predicts unknown or missing numeric values (Han et al.,
2012).
Many applications of classification task solve several practical problems, for example, credit risk prediction, based on the
knowledge acquired by collecting data from many loan applicants. The output from such a classifier will be one of two possible
categorical labels “safe” or “risky”, asserting if a new applicant for a loan is reliable or not.
Similarly, regression models can be used in several fields of interest, for example, in marketing management to predict how
much a given customer will spend during a sale. Unlike classification, numeric prediction models continuous-valued function f by
0 0
mapping the input vector of attributes X to gives ordered continuous values y as output: y ¼ f(X).
Model Construction
Building a classifier or a predictor needs a training phase, in which a set of input labelled data, known as the training set is shown to
the machine. Each item in training set is described by some features or attribute (qualitative or quantitative) and by a class or a
numerical label. In this step, the candidate algorithms learn predetermined data classes or concepts: this is the operation mode of
supervised learning. The validation set allows selecting the best model among possible candidates, comparing their performances
over data never seen before. Finally, when the model was chosen, it needs to assess the machine's ability in classification task on a
new data set, so called test set. A good classifier or predictor is a machine that inferred the “general” model underlying training data
and can apply it to absolve his task on new unseen data.
Role of features
The term features holds the meaning of learning, representing the ensemble of characteristics useful to bring out the model from
input data. The right model gets the best classification or prediction performances, for this reason, the skill of features extraction is
considered a basic part of the classification/prediction task.
Classification and prediction represent the most used supervised techniques of machine learning (Han et al., 2012). Classifiers and
predictors develop their skills starting from training labelled data, discrete or numerical records, and produce categorical labels
(classes) or numeric predictions (function values) respectively. The base of operation for both machines is the “model” that must
connect the class (discrete or numerical) to the features.
Learning
The first step to constructing a classifier is learning the model from data. The classification model originates from training data,
which must be suitably prepared. Data preparation involves (i) cleaning, (ii) relevance analysis and (iii) data transformation:
(i) In the cleaning step, data are pre-processed to reduce noise and to handle missing values.
(ii) Relevance analysis consists of correlation analysis and attribute selection: these procedures reduce redundant and irrelevant
features, improving classification efficiency and scalability.
(iii) Then the information contained in training records have to be generalized into concepts, with data dimensionality
reduction.
In training phase, the machine is fed with pre-processed data, which is a list of instances, each of them described by some
attributes and an assigned qualitative (for a classifier) or numerical (for predictor) label for each. This label represents the “true”
class or the function values. Symbolically, the i–th instance is presented to the algorithm by the pair (Xi,yi), where X¼ (x1,…,xn) is
the attribute vector, which is associated to a predefined label yi. The algorithm increases its knowledge using the information from
the training records. The larger is the training set, the better is the classifier/predictor. In particular, a numerical predictor in training
phase constructs a function that minimizes the difference between the predicted and true function values.
Model selection depends in comparing different models learned during the training by using validation data set.
p
Accuracy ¼ ð1Þ
t
where p is the number of right classifications and t is the total number of test cases.
For a predictor the accuracy is provided by measuring the “distance” between the actual value yi and the predicted value yi0
corresponding to each vector of input attribute Xi. Loss functions provide these kind of error measures. The following expressions
(2,3) shows the most common loss functions:
Starting from these expression, the average loss over the test set is obtained by the error rates in expressions (4) and (5):
Pd
i¼1 jyi yi0 j
Mean absolute error: ð4Þ
d
Pd
i¼1 ðyi yi0 Þ2
Mean squared error: ð5Þ
d
Further Considerations
The simpler classification problem is the binary classification. The binary or two-class classification has several applications, for
example, in problems regarding credit/loan approval or medical diagnosis. Moreover, classification techniques provides solution
also for problems involving more than two classes, dealing with multi-class classification (see also the section about multi-class
classification in SVM section) (Li et al., 2004; Zhou et al., 2017).
It is possible to forecast numeric values with regression. Three types of regression analysis can be conduct:
A discussion of such methods will be subject to the following paragraphs and the next article.
Data Mining: Classification and Prediction 3
Classification by Decision Tree Induction: ID3 (Iterative Dichotomiser), C4.5, CART (Classifier and Regression Trees)
In a classification task, the objects of the universe are described by a set of attributes and their values. To distinguish an object from
others, the classifier has to absolve the induction task (Quinlan, 1986). The induction task consists in developing a classification
rule able to identify the class of any object of the universe from the value of its attributes that represents some important
characteristics or features of object.
A decision tree classifier is a simple and widely used classification technique. As other classification methods, it builds a
classification model learning from a bunch of input data. By analogy with a botanical tree, the structure of the decision tree
consists of a flowchart with a root, branches, internal nodes and leaves. Each structural element of a tree identifies a step in a
splitting data process, whereby the tree algorithm separates input training items by testing their different characteristics, i.e.,
different values of some their attributes.
In particular, the classification process of an object starts from the root that is the feature always tested for each instance. The
different values of root attribute lead to different possible paths on the tree, addressing the algorithm to a specific branch. The
decision process continues in the next internal node, where a new attribute test run over another instance attribute until the reaching
of a leaf node, representing the class that owns the instance considered.
Decision tree classifiers are employed extensively in several fields of interest, such as medical diagnosis, weather prediction,
fraud detection, credit approval, target marketing, customer segmentation (Lavanya and Rani, 2011). Their hallmarks are:
(i) providing intelligible classification rules, (ii) ease to interpret, (iii) speed of construction, (iv) high accuracy. Often expert
systems use decision tree, because they offer results comparable to expert human skills.
Fig. 1 Example of training set and related tree. The example is referred to the question if “Saturday morning” is suitable or not for some activity.
Reproduced from Quinlan, J.R., 1986. Induction of decision trees. Machine Learning 1, 81–106.
4 Data Mining: Classification and Prediction
An adequate choice of features allows build a decision tree that provides a correct classification for each object in the training
data set. Nevertheless, the scope of induction task is implementing a classifier able to label correctly other unseen objects as well.
Therefore, the decision tree must catch the underlying connection between an object’s class and values of its attributes.
It is possible to build more than one tree starting from a training set; in general, it is better to choose the simplest tree in order
to avoid overfitting problems and guarantee a proper classification of new objects. Indeed, it would be expected that the simpler
tree works better than others tree trained on the same data set, because it more likely that correctly classifies more objects unseen
(Quinlan, 1986). For this reason, in the most of cases, the algorithm uses a nontrivial stopping criterion in growing procedure,
considering a mixture of elements with different label class as a leaf node and accepting such as imperfect decision in favor of a
simpler, more generic tree structure. Furthermore, pruning operation generalizes the tree structure where branches are redundant,
therefore removing noise and outliers.
The frequently used decision tree algorithm are Iterative Dichotomiser 3 (ID3), C4.5 and CART.
ID3
ID3 is a decision tree algorithm with an iterative structure based on Hunt’s Concept Learning Systems (Hunt, 1962; Hunt et al.,
1966). During the learning process, only a subset of training data is selected randomly to form a first decision tree. If this tree
correctly classifies also all the other objects in the training set, then it represents the correct tree and the research process finish.
Otherwise, a new subset or window of training records is chosen to form a new tree, including a selection of the incorrectly
classified objects. This iterative process continues until the decision algorithm is able to classify correctly the entire training data
set. The main advantage of ID3 strategy allows identifying the correct decision tree with few iterations even on training set
containing thousands of items.
C4.5 algorithm
An improvement of ID3 is C4.5 algorithm, also developed by J. Ross Quinlan in 1993. C4.5 works with multiway splits in attribute
test (Chen et al., 2015). C4.5 introduces some advantages with respect to ID3, working with both categorical and numerical
(or continuous) attributes, and handling missing attribute values. Moreover, it employs information-based criteria as attribute
selection measure. Furthermore, it uses pruning phase when the tree building is completed. Pruning allows handling the over-
fitting problem, wherein the decision tree algorithm may incur.
Information gain
The decision tree algorithm uses a set of measures of entropy or impurities to evaluate the best choice for the node attribute.
Indeed, a good attribute splits the data by grouping the greatest number of cases that belong to the same class, so that a successor
node is as pure as possible. For this reason, it is indicative a measure of “order”, defined as the number of items belonging to the
same class; obviously, the more items belong to the same class, greater will be the order degree. Entropy is a measure of disorder,
and represents the amount of information. Entropy is null for a pure node (leaf) and becomes maximum when all the classes at
node N are equally likely. Therefore, given a set S of examples and n classes, the entropy is expressed by Eq. (6)
X
n
EðSÞ ¼ pi log2 pi ð6Þ
i¼1
where pi is the portion of examples in i-th class (entropy of set S is constant for all attribute).
To compute the quality of the split on an entire attribute A, the weighted average is calculated over all sets resulting from the
split on all values of A: I(S,A) is the average entropy for attribute A and is defined as in Eq. (7):
X
v
jSj j
IðS; AÞ ¼ EðSj Þ ð7Þ
j
S
Data Mining: Classification and Prediction 5
where Sj represents the j-th subset of S depending on the j-th value of the attribute A. The information gain for attribute A is the
information gained by branching on A is shown in Eq. (8):
GainðS; AÞ ¼ EðSÞ2IðS; AÞ ð8Þ
A good rule to tree growing consists of choosing that attribute for which the gain of information is maximum, or equivalently,
that attribute with minimum average entropy (Han et al., 2012).
Information gain is used as attribute selection measure in ID3 e C4.5 algorithms.
Gini index
Gini Index is an inequality measure, employed in several scientific fields. In decision tree learning, Gini index is another method to
assess purity/impurity of a node. Mathematical expression of Gini impurity of a set Sm of records at a given node m is the following
Eq. (9):
X
n
iðSm Þ ¼ GiniðSm Þ ¼ 1 p2j ð9Þ
j¼1
where pj is the portion of records belonging to j-th class, over a total of n classes. For Gini index, each attribute determines a binary
data split. Gini index measures impurities of each possible pairs of partitions Sm1 and Sm2 for a given attribute A, as shown in
formula (10):
Sm1 Sm2
GiniA ðSm Þ ¼ GiniðSm1 Þ þ GiniðSm2 Þ ð10Þ
Sm Sm
The attribute with the partition having minimum Gini index is chosen to continue the tree building process (Han et al., 2012).
Gini index is usable in problem with two or more categories and is employed as attribute selection measure in CART algorithm.
Rule-Based Classification
Rule models are the second major techniques used in machine learning after the decision tree methods. Trees offer a first, although
rigid example of rules as seen in branching process, since each branch dictates univocal way to browse the tree. Instead, in a rule
model the algorithm may infer further information from the possible overlapping among several rules. As well as the decision
trees, the Rule-Based Classifiers are highly expressive, easy to generate and to interpret. Their performances are comparable to
decision trees. Moreover, rules methods can classify rapidly new instances and can easily handle missing values and numeric
attributes (Tan et al., 2005). Rule-based methods form an integral part of expert systems in artificial intelligence.
If-Then Rules
A rule-based classifier uses a collection of rules of the type, “If…. Then…” to absolve the classification task. The expression
following “If” states a condition or predicate (antecedent rule or left-hand side, LHS) that is a conjunction of attributes value. The
class is defined after the keyword “then”, which is the consequent (or right-hand side, RHS) of the rule. In a data set of records to
classify, an instance is labelled according to the conditions verified by his attributes. The label class will be assigned to the instance
whose attributes verify the condition of a rule r; in this case, the rule r covers the considered instance. If attributes of the instance do
not satisfy any rule, the item label will be that of a default class (Wu et al., 2008).
Fig. 2 Successive iterations of coverage learning: fraction of positive examples (squares), covered by a rule, discarded from original data set.
Fig. 3 Enrichment of a rule by adding a new combination of attribute test leads to a reduction of covered examples (left panel) and allows to
improve the classification (right panel).
information contained in it by means of some test selection criteria such as accuracy or information gain (for accuracy, see below;
for information gain, see the previous section). By adding more conditions to the rule, its coverage is reduced (see Fig. 3). The
growing rule continues until training set cannot be split any further or another stopping criterion is met. The pseudo-code for
sequential covering algorithm is shown in Fig. 4.
Rules accuracy can be expressed in the more compact form A¼ p/t, where p are positive examples, i.e., records with class
correctly predicted by rule and t is the total number of records covered by rule. It is immediate to consider the difference t p as the
number of errors made by rule. Other metrics of rules evaluation are illustrated in Fürnkranz (1999).
Often, accuracy of a rule is adopted as stopping criterion of rule construction. In particular, the building process stops when
accuracy of the rule is maximal, or when increasing in accuracy gets below a given threshold.
When the rule classifier is constructed, test set allows evaluate model accuracy.
• Size ordering: rules with wider LHS condition have the highest priority.
• Class-based ordering: rules belonging to the same class are grouped together and classes appear by decreasing of prevalence,
being class prevalence the fraction of instances that belong to a particular class.
• Rule-based ordering (decision list): rule sets are ordered by priority list, depending on measures of rule quality or experts
knowledge.
Some rules can become over specialized, incurring in the overfitting problem. A solution for this inconvenience is to prune the
rules, by using validation set. Two main strategies are incremental pruning and global pruning (Tan et al., 2005).
Classification by Multilayer Feed-Forward Neural Networks (Backpropagation, Sigmoid Function, Learning Rate)
Conventional machine learning techniques require considerable domain expertise and engineering skills to extract right features
from raw data and so to detect pattern or classify the input. Artificial Neural Networks (ANN) or also Neural Network (NN) allows
solving problems where features detection is difficult. They are used in domains like biomedical field (Dreiseitl and Ohno-
Machado, 2002), computer vision and speech and image recognition, drug discovery and genomics (LeCun et al., 2015).
Neural Networks are computational methods whose structure follows the information processing model of the biological brain
(McCulloch and Pitts, 1943; Widrow and Hoff, 1960; Rosenblatt, 1962; Rumelhart et al., 1986). Feedforward neural networks
architecture is the first and simplest kind of neural network. It represents a powerful tool for handle non-linear classification
(see note in the following section) and regression problems.
Fig. 5 Two-class linear classification, plotted in a 2-dimensional feature space. The red straight line is the decision boundary among positive and
negative examples. w ¼(w1, w2) is the vector of weights and x¼(x1, x2) is the vector of features on which to perform the classification for a given
instance.
Linear y(a)¼ a
(
Step 1 if a40
y ða Þ ¼
1 if a r 0
Fig. 7 Example of a two layer, fully connected network (do not count the input layer).
or error function E ¼ E(y,t), which reports how well the network solves the task. The training process consists in a minimization of
objective function, by varying the weight parameters w. For function minimization is evaluated also the gradient of the objective
function with respect to w by using the backpropagation algorithm.
Fig. 8 Multilayer neural network schematized by fully connected black dots: from left to right – two input unit, two units in the hidden layer and
one output unit. The hidden layer deforms the input space to obtain a linear decision boundary between classes of data, as highlighted from the
shape of the border between colored areas. Reproduced from LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444.
backpropagation networks, Sigmoid activation function is preferred to step function used in perceptron; the objective function of
the network is often defined as (using least squares methods as optimization technique. An alternative to optimization function is
Pp
cross-entropy (with the relative maximum likelihood estimation method). E ¼ 12 i ¼ 1 ðyi ti Þ2 , for a training set containing
p records. The learning rule for neural network relies on gradient descent procedure that allows minimizing E starting from its
∂E ∂E ∂E
gradient ∇E ¼ ∂w ;
1 ∂w2
; …; ∂wl with respect to l weights. The weights, initialized with random values, represents the only tuning
∂E
factor to minimize the error function through a change Dwi ¼ Z∂w i
; where Z is the learning rate (Royas, 1996). This latter can be a
constant value or can be a decreasing function of simulation time t such as Z0/t, where t is an integer counting the algorithm steps.
Learning rate denotes the amount of the weights’ variation or the length of the step of each iteration in the negative gradient
descent.
Support vector machine (SVM) is a supervised machine learning technique based on statistical learning theory (Vapnik, 1995;
Scholkopf et al., 1995; Cristianini and Shawe-Taylor, 2000). SVM is a binary classifier (the term “machines” descends from the fact
that SVM algorithm produces a binary output), but it can handle problems with many classes and also numerical prediction tasks
(Wang and Xue, 2014; Wu et al., 2008). Furthermore, it can treat non-linearly separable data. Similarly to neural network, SVM
employ a non-linear mapping of input vectors in a high-dimensional features space, where it can linearly separate data, by
so-called optimal separating hyperplanes (Chen et al., 2015).
SVM algorithms allow approach many problems of different domains, such as pattern recognition, medical diagnosis, mar-
keting (Yingxin and Xiaogang, 2005) or text classification (Minsky, 1961; Joachims, 1998).
Fig. 9 Linearly separable data (A); different possibilities to separate data are shown in (B)–(D) panels.
Fig. 10 Geometry of a SVM classifier. Dashed lines delimit the margin. The support vectors are training examples circled and nearest to decision
boundary t.
The constraint expresses the request that instances remain outside the margin. In Fig. 10, the hyperplanes of equation
w xi t ¼ 7m delimit the margin of separation between two classes yi of data, positives and negatives respectively. In particular,
positives must verify the condition (w xi t)Z1 and negatives (w xi t)r 1), where m is set to 1, as usual (the parameters
t, ||w|| and m can be rescaled for convenience). The product between classes yi ¼ 71 (assigned in binary supervised problems) and
the previous conditions allows formulate a compact constraint (Eq. 13), valid for all instances.
The Lagrange multipliers method approaches the optimization problem applied to the function to minimize under the
P
condition shown by Eq. (13). The Lagrange multipliers algebra allows to obtain the optimal value for the weight ðw ¼ ni¼ 1 ai yi xi ,
where ai are the Lagrange multipliers and n is the number of training examples) w and for the decision boundary t, starting from
the pairs (xi,yi), which are provided as input in training phase of SVM. Laid out these values, SVM will establish the class of a new
instance according to the following formula (14):
y0 ¼ sgn½w x0 t ð14Þ
The Lagrange function incorporates the constraint preceded by the Lagrange multipliers ai, as shown in the Eq. (15) (according
to the “hinge” loss function formulation (Hastie et al., 2008):
12 Data Mining: Classification and Prediction
1 Xn
Lðw; t; a1 ; …; an Þ ¼ jjwjj2 ai ðyi ðw xi t Þ 1Þ
2 i¼1
! !
1 X
n X
n X
n
¼ jjwjj2 w ai yi x i þt ai yi þ ai ð15Þ
2 i¼1 i¼1 i¼1
By setting to zero the partial derivative of the Lagrange function with respect to t and w, it is possible obtain the following
values:
∂ Xn
Lðw; t; a1 ; …; an Þ ¼ 0-w ¼ ai yi x i ð16Þ
∂w i¼1
∂ Xn
Lðw; t; a1 ; …; an Þ ¼ 0- ai yi ¼ 0 ð17Þ
∂t i¼1
Returning these values into the Lagrange function (15), the primal problem changes in the dual optimization problem
expressed in Eq. (18), entirely depending from the Lagrange multipliers:
! !
1 Xn Xn Xn
Lða1 ; …; an Þ ¼ ai yi xi ai yi xi þ ai
2 i¼1 i¼1 i¼1
1Xn Xn Xn
¼ ai aj yi yj xi xj þ ai ð18Þ
2i¼1j¼1 i¼1
To solve the dual problem it is necessary maximize (the dual Lagrange function (19) is maximized being the ai preceded by a
minus sign in Eq. (15)) the function L(a1,…,an) as described in the following expression:
1X n X n Xn
a1 ; …; an ¼ arg max ¼ ai aj yi yj x i x j þ ai
a1 ;…;an 2 i ¼ 1j ¼ 1 i¼1
ð19Þ
X
n
subject to ai 0; 1rirn; ai yi ¼ 0
i¼1
As shown in the Eq. (19), the optimization problem consists in pairwise dot products between training instances, which is
expressed by the element of the Gram matrix Hij ¼ xi xj.
The condition aiZ0 is typical for support vector machines (the constraint in Eq. (15) being non-negative) and include two
possible cases: (i) ai ¼ 0 for an example xi that does not contribute to learn the decision boundary; (ii) ai40 only for support
vectors, namely, for examples, nearest to the decision boundary. The Lagrange multipliers non-zero determine the decision
boundary through the expression (16) and (17) and consequently the class of each new instance through the Eq. (14).
! !
1 X
n X
n
¼ jjwjj2 w ai yi xi þt ai yi
2 i¼1 i¼1
Data Mining: Classification and Prediction 13
Fig. 11 Non-linear separable case: the distance between the red or blue example into the margin is –ξi/||w||.
X
n X
n
þ ai þ ðC ai bi Þξi
i¼1 i¼1
X
n
¼ Lðw; t; ai Þ þ ðC ai bi Þξi : ð21Þ
i¼1
For an optimal solution every partial derivative with respect to ξi should be zero, and consequently for all i follows the Eq. (22):
C ai bi ¼ 0 ð22Þ
Therefore, vanishing the adding term in the last expression of Eq. (21), the non-linear dual problem is reduced to the linear one
in formulation (19) with the addition of the upper bound on ai derived from bi (22): 0rairC.
With regard to the meaning of the upper bound of ai, the equality to C implies that bi ¼ 0, concerning the case in which training
examples are on or inside the margin, being respectively ξi ¼ 0 or ξi40.
Kernel trick
Non linearly separable data can be treated plotting them in a higher dimensional feature space through a mathematical trans-
formation known as kernel function k. Such function replaces the inner products between pairs of input instances xi xj in the SVM
algorithm for linear separable data, and allows to represent training examples without the explicit computation of their coordi-
nates in the new feature space. The formulas (23) and (24) express that in symbol:
k : x-FðxÞ ð23Þ
and
Hij ¼ Fðxi Þ Fðxj Þ ¼ kðxi ; xj Þ ð24Þ
where F(x) is the feature mapping of the instance x in the new feature space, performed by the kernel function k and k(xi,xj) ¼ F
(xi) F(xj) is the Kernel matrix element, which provides the Gram matrix element Hij already seen above transformed by k. Popular
kernels are listed in Table 2.
With kernel function, the mapping in a higher dimensional space can make possible separate linearly data that in the starting
feature space are non-linearly separable, as shown in Fig. 12.
Finally, for each new instance x0 the class is established through the expression:
y0 ¼ sgn½w Uðx0 Þ t ð25Þ
0
Pn 0
Pn 0
where w U(x ) ¼ i ¼ 1 ai yi Uðxi Þ Uðx Þ ¼ i ¼ 1 ai yi kðxi ; x Þ:
This means that classifying a new instances involves training examples with non-zero Lagrange multipliers.
Multiclass SVM
It is possible to use additional methods that extend the SVM classification to more than two classes. (Weston and Watkins, 1999;
Crammer and Singer, 2001; Hsu and Lin, 2002; Wang and Xue, 2014). Among the strategies aimed to adapt SVM to multiclass
classification, there are two common techniques: one-versus-all (Vapnik, 1998) and one-versus-one (Kreßel, 1999). Both approaches
decompose the single multiclass problem into a set of binary classification problems. To solve a problem with k classes, the one-
versus-all approach trains k SVM binary classifiers so that i-th model recognizes as positives the examples in the i-th class and as
14 Data Mining: Classification and Prediction
Linear k0(Xi,Xj)¼ Xi Xj
Polynomial k1(Xi,Xj)¼(Xi Xj þ a)b
jjx i x j jj2
Gaussian k2(Xi,Xj) ¼ exp 2g2
Sigmoidal k4(Xi,Xj)¼tanh(a(Xi Xj) b)
Fig. 12 Left: in the starting 2D feature space date are not linearly separable; right: kernel trick allows to map data in a 3D new feature space,
where data are linearly separable by means of a hyperplane.
negatives all the rest of examples. In the test phase for a given input, the binary classifier that provides positive output value assigns
the class label.
The one-versus-one (Kreßel, 1999) method solves the k classes problem with a pairwise decomposition of classifiers. This
approach uses k(k 1)/2 binary classifiers, each one of them is trained to identify as positive a given class and negative all the rest
(in a similar way as described above). In the test phase, a given input instance is evaluated or “voted” from pairs of classifiers. The
class label of this test instance will arise from the classifier that gives the highest number of votes.
The rules syntax concerns different machine learning areas. As shown above (see Section Rule-Based Classification), supervised
learning employs rules to absolve some classification tasks. But rules paradigm also take place in the unsupervised learning, with
the association rules (Cios et al., 2007). Both approaches have a large use for many application domains (Gosain and Bhugra,
2013), according to their proper characteristics and differences.
rules (CARs). The first implementation of associative classification is the Classification Based on Association algorithm (CBA)
(Liu et al., 1998).
CBA
The CBA algorithm consists of two parts: i) a rule generator (CBA-RG), based on the Apriori algorithm (Agrawal and Srikant, 1994)
appropriately modified to mine class association rules; ii) a classifier builder (CBA-CB) that classifies accurately the rules previously
generated. Comparing CBA with other classifier based on rules, such as C4.5 (Quinlan, 1993) and CART (Breiman et al., 1984),
some differences emerge. Both approaches are based on the “covering method” (Michalski, 1980), but results are different,
depending on the way of application. CBA algorithm (CBA-RG module) adopts covering method, running it over all the rules
mined from the entire training set. Unlike, the traditional application of covering method works on one class at a time (see also
previous section on rules-based classifier), whenever removing covered examples by best rules, and learning new rules from the
remaining cases. The difference among results of CBA and the other classifiers arises from the general validity of rules mined by
CBA that constructs all the rules from the entire training set and then makes the best selection of rules covering training data.
Furthermore, rule-pruning technique is adopted in CBA to avoid overfitting problems. Pruning represents a difference between
CBA algorithm and the association rule mining. Another difference lies in the presence of CBA-CB module, being the AR not based
on the classifier building.
construction of rules, for example, it generates rules directly from training data, instead of creating a large set of candidate rules, as in
associative classification; moreover, it produces a more complete set of rules than traditional rule-based approaches.
CPAR originates from First Order Inductive Learner (FOIL) learning system (Quinlan, 1990; Quinlan and Cameron-Jones,
1993), modified with Predictive Rule Mining (PRM) algorithm. PRM varies the covering method (see Section Rule-Based
Classification) implemented by FOIL, keeping the rules that cover positive examples and associating a “weight” to these rules. In
this way, positives examples are covered by more rules, achieving higher accuracy than FOIL. Furthermore, working with large
database, PRM attains greater efficiency with respect to FOIL, thanks to reducing of time consuming part in rules building. PRM
reaches higher efficiency but lower accuracy than associate classification. PRM results are exceeded by CPAR, that is the evolution
of PRM. It gets better the rule building process, avoiding redundant rule generation and building several rules simultaneously.
Finally, CPAR evaluates the prediction power of every rule, defining its expected accuracy as the probability that an example
observing the antecedent of the rule actually belongs to class predicted in the consequent. So doing, in a set of k best rules for a
class, the rule with the highest expected accuracy will be chosen to class prediction.
References
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I., 1995. Fast discovery of association rules. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R.
(Eds.), Advances in Knowledge Discovery and Data Mining. Cambridge, MA: AAAI/MIT Press, pp. 307–329.
Agrawal, R., Srikant R., 1994. Fast algorithms for mining association rules. In: Proceedings of International Conference on Very Large Data Bases (VLDB), San Jose, CA.
Agrawal, R., Tomasz I., Swami, A., 1993. Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (Eds.), Proceedings of the 1993 ACM
SIGMOD International Conference on Management of Data, pp. 207–216. New York, NY: ACM.
Atluri, G., Gupta, R., Fang, G., et al., 2009. Association analysis techniques for bioinformatics problems. In: Rajasekaran, S. (Ed.), Bioinformatics and Computational Biology.
New Orleans, LA: Springer, pp. 1–13.
Baralis, E., Chiusano, S., Garza, P., 2004. On support thresholds in associative classification. In: Proceedings of the 2004 ACM Symposium on Applied Computing,
pp. 553–558. New York, NY: ACM.
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees. Belmont, CA: Wadsworth.
Brin, S., Motwani, R., Silverstein, C., 1997. Beyond market baskets: Generalizing associations rules to correlations. In: Proceedings of the 1997 ACM SIGMOD International
Conference on Management of Data, pp. 265–276. New York, NY: ACM.
Cendrowska, J., 1987. PRISM: An algorithm for inducing modular rules. International Journal of ManMachine Studies 27, 349–370.
Chen, F., Deng, P., Wan, J., et al., 2015. Data mining for the internet of things: Literature review and challenges. International Journal of Distributed Sensor Network 2015, 1–15.
Cios, K.J., Swiniarski, R.W., Pedrycz, W., Kurgan, L.A., 2007. Unsupervised learning: Association rules. In: Kecman, V. (Ed.), Data Mining. Springer, pp. 289–306.
Clark, P., Niblett, T., 1989. The CN2 induction algorithm. Machine Learning 3, 261–283.
Cohen, W.W., 1995. Fast effective rule induction. In: Proceedings of the 12th International Conference on Machine Learning, pp. 115–123. San Mateo, CA: Morgan and Kaufmann.
Cortes, C., Vapnik, V., 1995. Support-vector networks. Machine Learning 20 (3), 273–297.
Crammer, K., Singer, Y., 2001. On the algorithmic implementation of multiclass kernel-based machines. Journal of Machine Learning Research 2, 265–292.
Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines. New York, NY: Cambridge University Press.
Dreiseitl, S., Ohno-Machado, L., 2002. Logistic regression and artificial neural network classification models: A methodology review. Journal of Biomedical Informatics 35 (5), 352–359.
Fürnkranz, J., 1999. Separate-and-conquer rule learning. Artificial Intelligence Review 13, 3–54.
Fürnkranz, J., 2010. Rule learning. In: Sammut, C., Webb, G.I. (Eds.), Encyclopedia of Machine Learning. New York, NY: Springer, pp. 875–879.
Fürnkranz, J., Flach, P.A., 2005. Roc ‘n0 rule learning – Towards a better understanding of covering algorithms. Machine Learning 58, 39–77.
Gosain, A., Bhugra, M., 2013. A comprehensive survey of association rules on quantitative data in data mining. In: IEEE Conference Information & Communication
Technologies, JeJu Island, pp. 1003–1008.
Han, J., Kamber, M., Pei, J., 2012. Data mining, Concepts and Techniques, third ed. Waltham, MA: Morgan Kaufmann.
Han, J., Pei, J., Yin, Y., 2000. Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International conference on Management of
Data, pp. 1–12. New York, NY: ACM.
Hastie, T., Tibshirani, R., Friedman J., 2008. The Elements of Statistical Learning, second ed. Stanford, CA: Springer.
Holte, R.C., 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning 11, 63–90.
Hsu, C.W., Lin, C.J., 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13 (2), 415–425.
Hunt, E.B., 1962. Concept Learning: An Information Processing Problem. New York, NY: Wiley.
Hunt, E.B., Marin, J., Stone, P.J., 1966. Experiments in Induction. New York, NY: Academic Press.
Jain, A.K., Mao, J., Mohiuddin, K.M., 1996. Artificial neural networks: A tutorial. IEEE Computer-Special Issue in Neural Computing 29 (3), 31–44.
Joachims, T., 1998. Text categorization with support vector machines: Learning with many relevant features. In: Hutchison, D., Kanade, T., Kittler, J., et al. (Eds.), Machine
Learning: European Conference on Machine Learning (in LNCS 1398), pp. 137–142. Berlin: Springer.
Kesavaraj, G., Sukumaran, S., 2013. A study on classification techniques in data mining. In: Proceedings of the International Conference on Computer Communication and
Networking Technologies (ICCCNT'13), pp. 1–7. Tamil Nadu: IEEE.
Kreßel, U., 1999. Pairwise classification and support vector machines. In: Schölkopf, B., Burges, C., Smola, A. (Eds.), Advances in Kernel Methods: Support Vector Learning.
Cambridge: MIT Press, pp. 255–268.
Kulkarni, M., Kulkarni, S., 2016. Knowledge discovery in text mining using association rule extraction. International Journal of Computer Applications 143, 30–36.
Lavanya, D., Rani, K.U., 2011. Performance evaluation of decision tree classifiers on medical datasets. International Journal of Computer Applications 26, 1–4.
Lawrence, R.L., Wright, A., 2001. Rule-based classification systems using classification and regression trees (CART) analysis. Photogrammetric Engineering & Remote Sensing
67, 1137–1142.
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444.
Li, T., Zhang, C., Ogihara, M., 2004. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression.
Bioinformatics 20, 2429–2437.
Liu, B., 2011. Web data mining. Exploring Hyperlinks, Contents and Usage Data. Chicago, IL: Springer.
Liu, B., Hsu, W., Ma, Y., 1998. Integrating classification and association rule mining. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data
Mining, pp. 80–86. New York, NY: AAAI Press.
Li, W., Han, J., Pei, J., 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. In: Proceedings of the IEEE International Conference on
ICDM'01, pp. 369–376. IEEE: San Jose, CA.
Data Mining: Classification and Prediction 17
McCulloch, W.S., Pitts, W., 1943. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5 (4), 115–133. Reprinted in Anderson and
Rosenfeld, Springer.
Michalski, R., 1980. Pattern recognition as rule-guided induction inference. IEEE Transaction On Pattern Analysis and Machine Intelligence 2, 349–361.
Michalski, R.S., 1969. On the quasi-minimal solution of the general covering problem. In: Proceeding of the V International Symposium on the Information Processing, Bled,
Yugoslavia, pp 125–128.
Michalski, R.S., 1975. Synthesis of optimal and quasi-optimal variable-valued logic formulas. In: Proceedings of the 1975 International Symposium on Multiple-Valued Logic,
pp. 76–87. Bloomington, Indiana.
Minsky, M., 1961. Steps toward artificial intelligence. In: Hamburger, F. (Ed.), Proceedings of the IRE, vol. 49 (No. 1), pp. 8–30. New York: IEEE.
Pagallo, G., Haussler, D., 1990. Boolean feature discovery in empirical learning. Machine Learning 5, 71–99.
Quinlan, J.R., 1986. Induction of decision trees. Machine Learning 1, 81–106.
Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. San Francisco, CA: Morgan Kaufmann.
Quinlan, J.R., 1990. Learning Logical Definitions from Relations. Machine Learning, 5. Boston: Kluwer Academic, pp. 239–266.
Quinlan, J.R., Cameron-Jones, R.M., 1993. FOIL: A midterm report. In: Brazdil, P.B. (Ed.), Proceedings of European Conference on Machine Learning, pp. 3–20. Vienna: Springer-Verlag.
Raileanu, L.E., Stoffel, K., 2004. Theoretical comparison between the Gini Index and Information Gain criteria. Annals of Mathematics and Artificial Intelligence 41, 77–93.
Rokach, L., Maimon, O., 2005. Decision trees. In: Rokach, L., Maimon, O. (Eds.), Data Mining and Knowledge Discovery Handbook. New York, NY: Springer, pp. 165–192.
Rosenblatt, F., 1962. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanism. Washington, DC: Spartan Books.
Royas, R., 1996. Neural Networks. A Systematic Introduction. Berlin: Springer.
Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L., PDP Research Group. (Eds.),
Parallel Distributed Processing: Explorations in the Microstructure of Cognition 1. MIT Press, pp. 318–362.
Scholkopf, B., Burges, C., Vapnik, V., 1995. Extracting support data for a given task. In: Fayyad, U.M., Uthurusamy, R. (Eds.), Proceedings of the First International Conference
on Knowledge discovery and Data Mining, pp. 252–257. Menlo Park, CA: AAAI Press.
Tan, P.N., Steinbach, M., Kumar, V., 2005. Introduction to Data Mining. Boston, MA: Pearson Addison Wesley.
Vapnik, V., 1995. The Nature of Statistical Learning Theory. New York, NY: Springer.
Vapnik, V., 1998. Statistical Learning Theory. New York, NY: Wiley.
Wang, Z., Xue, X., 2014. Multi-class support vector machine. In: Ma, Y., Guo, G. (Eds.), Support Vector Machines Applications, first ed. Cham: Springer International
Publishing, pp. 23–48.
Weston, J., Watkins, C., 1999. Support vector machines for multi-class pattern recognition. In: Verleysen, M. (Ed.), Proceedings of European Symposium on Artificial Neural
Networks, pp. 219–224. Bruges: ESANN.
Widrow, B., Hoff, M.E., 1960. Adaptive switching circuits. In: IRE WESCON Convention Record, vol. 4, pp. 96–104. Reprinted in Anderson and Rosenfeld (1988).
Wu, X., Kumar, V., Quinlan, R.J., et al., 2008. Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1–37.
Yingxin, L., Xiaogang, R., 2005. Feature selection for cancer classification based on support vector machine. Journal of Computer Research and Development 42 (10), 1796–1801.
Yin, X., Han, J., 2003. CPAR: Classification based on predictive association rules. In: Barbara, D., Kamath, C. (Eds.), Proceedings of the 2003 SIAM International Conference
on Data Mining, pp. 331–335. San Francisco, CA: Society for Industrial and Applied Mathematics.
Zhou, L., Wang, Q., Fujita, H., 2017. One versus one multi-class classification fusion using optimizing decision directed acyclic graph for predicting listing status of
companies. Information Fusion 36, 80–89.
Further Reading
Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press.
Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Singapore: Springer.
Duda, R.O., Hart, P.E., Stork, D.G., 1991. Pattern Classification. New York, NY: John Wiley & Sons.
Flach, P., 2012. Machine Learning. New York, NY: Cambridge University Press.
Harrinton, P., 2012. Machine Learning in Action. Shelte Island, NY: Manning.
MacKay, D.J.C., 2005. Information Theory, Inference and Learning Algorithms. Cambridge University Press.
McKinney, W., 2012. Python for Data Analysis. Sebastopol, CA: O’Reilly.
Mitchell, T.M., 1997. Machine Learning. Portland, OR: McGraw Hill.
Model, M.L., 2009. Bioinformatics Programmingusing Python. Sebastopol, CA: O’Reilly.
Muller, K.R., Mika, S., Ratsch, G., Tsuda, K., Scholkopf, B., 2001. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks 12 (2), 181–201.
Phyu, T.N., 2009. Survey of classification techniques in data mining. In: Ao, S.I., Castillo, O., Douglas, C., Feng, D.D., Lee, J.A. (Eds.), Proceedings of International Multi-
Conference of Engineers and Computer Scientists, pp. 727–731. Hong Kong: Newswood Limited.
Schölkopf, B., Smola, A.J., 2002. Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
Shawe-Taylor, J., Cristianini, N., 2004. Kernel Methods for Pattern Analysis. Cambridge University Press.
Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., 2005. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research
6, 1453–1484.
Witten, I.H., Frank, E., 2005. Data Mining: Practical Machine Learnig Tools and Techniques, second ed. San Francisco, CA: Morgan Kaufmann.
Wu, X., Kumar, V., 2009. The Top Ten Algorithms in Data Mining. Boca Raton, FL: Chapman & Hall/CRC.