Data Mining Classification and Prediction

Data Mining: Classification and Prediction
Alfonso Urso, Antonino Fiannaca, Massimo La Rosa, Valentina Ravì, and Riccardo Rizzo, ICAR-CNR, Palermo, Italy
r 2018 Elsevier Inc. All rights reserved.
Basic Concept of Classification and Prediction
Classification is the most common task in machine learning. It works on input data where each item is tagged with a predefined set
of categorical labels or classes (Kesavaraj and Sukumaran, 2013). From such training records, the classifier learns a model able to
distinguish classes or concepts for future predictions of discrete variables.
Using examples labelled with true function values (observation, measurements, etc.), a numerical predictor, or merely a
predictor, learns to construct continuous-valued functions as model that predicts unknown or missing numeric values (Han et al.,
2012).
Many applications of classification task solve several practical problems, for example, credit risk prediction, based on the
knowledge acquired by collecting data from many loan applicants. The output from such a classifier will be one of two possible
categorical labels “safe” or “risky”, asserting if a new applicant for a loan is reliable or not.
Similarly, regression models can be used in several fields of interest, for example, in marketing management to predict how
much a given customer will spend during a sale. Unlike classification, numeric prediction models continuous-valued function f by
0 0
mapping the input vector of attributes X to gives ordered continuous values y as output: y ¼ f(X).
Model Construction
Building a classifier or a predictor needs a training phase, in which a set of input labelled data, known as the training set is shown to
the machine. Each item in training set is described by some features or attribute (qualitative or quantitative) and by a class or a
numerical label. In this step, the candidate algorithms learn predetermined data classes or concepts: this is the operation mode of
supervised learning. The validation set allows selecting the best model among possible candidates, comparing their performances
over data never seen before. Finally, when the model was chosen, it needs to assess the machine's ability in classification task on a
new data set, so called test set. A good classifier or predictor is a machine that inferred the “general” model underlying training data
and can apply it to absolve his task on new unseen data.
Role of features
The term features holds the meaning of learning, representing the ensemble of characteristics useful to bring out the model from
input data. The right model gets the best classification or prediction performances, for this reason, the skill of features extraction is
considered a basic part of the classification/prediction task.
Classifier Versus Predictor: Learning, Testing, Accuracy, Overfit
Classification and prediction represent the most used supervised techniques of machine learning (Han et al., 2012). Classifiers and
predictors develop their skills starting from training labelled data, discrete or numerical records, and produce categorical labels
(classes) or numeric predictions (function values) respectively. The base of operation for both machines is the “model” that must
connect the class (discrete or numerical) to the features.
Learning
The first step to constructing a classifier is learning the model from data. The classification model originates from training data,
which must be suitably prepared. Data preparation involves (i) cleaning, (ii) relevance analysis and (iii) data transformation:
(i) In the cleaning step, data are pre-processed to reduce noise and to handle missing values.
(ii) Relevance analysis consists of correlation analysis and attribute selection: these procedures reduce redundant and irrelevant
features, improving classification efficiency and scalability.
(iii) Then the information contained in training records have to be generalized into concepts, with data dimensionality
reduction.
In training phase, the machine is fed with pre-processed data, which is a list of instances, each of them described by some
attributes and an assigned qualitative (for a classifier) or numerical (for predictor) label for each. This label represents the “true”
class or the function values. Symbolically, the i–th instance is presented to the algorithm by the pair (Xi,yi), where X¼ (x1,…,xn) is
the attribute vector, which is associated to a predefined label yi. The algorithm increases its knowledge using the information from
the training records. The larger is the training set, the better is the classifier/predictor. In particular, a numerical predictor in training
phase constructs a function that minimizes the difference between the predicted and true function values.
Encyclopedia of Bioinformatics and Computational Biology doi:10.1016/B978-0-12-809633-8.20461-5 1

2 Data Mining: Classification and Prediction
Model selection depends in comparing different models learned during the training by using validation data set.
Testing and Accuracy

When the training phase finish and the validation procedure selects the best among possible models, a test evaluates the learning
degree of the machine. For this purpose, the algorithm will examine a set of data in the same form (Xi,yi) of the training set.
Usually, test sample is achieved randomly selecting 20%–30% of the complete labelled data set.
A comparison between the class or numerical label of each instance in the test set and the corresponding output is used to test
the classifier. In particular, the percentage of cases correctly classified in the testing data set gets the accuracy of the classifier, as
shown in the formula (1):
p
Accuracy ¼ ð1Þ
t
where p is the number of right classifications and t is the total number of test cases.
For a predictor the accuracy is provided by measuring the “distance” between the actual value yi and the predicted value yi0
corresponding to each vector of input attribute Xi. Loss functions provide these kind of error measures. The following expressions
(2,3) shows the most common loss functions:
Absolute error: jyi yi0 j ð2Þ
Squared error: ðyi yi0 Þ2 ð3Þ
Starting from these expression, the average loss over the test set is obtained by the error rates in expressions (4) and (5):
Pd
i¼1 jyi yi0 j
Mean absolute error: ð4Þ
d
Pd
i¼1 ðyi yi0 Þ2
Mean squared error: ð5Þ
d
where d is the total number of test instances.

Evaluation of classification and prediction methods can also consider other criteria like:
• speed: time to construct the model; time to use the model;

• robustness: noise and missing values;
• scalability: efficiency in disk-resident databases;
• interpretability: understanding and insight provided by the model.
Generalization and Overfitting

The construction of classification model is based on a suitable choice of features, in type and number. Performances of the
classifier can be enhanced by raising the number of features and producing sophisticated decision boundary among data. This choice
can lead to excellent performances on the test set but can result in poor performances on new patterns. This declining performance
is a symptom of the problem of overfitting. It means that a complicated model with a complex decision boundary can give poor
generalization capability.
To avoid this effect, it is desirable looking for a general model. Indeed, the robustness of the algorithm depends on the degree
of generalization, namely the capacity of working with the crucial features of data and choosing general rules for the model.
Further Considerations
The simpler classification problem is the binary classification. The binary or two-class classification has several applications, for
example, in problems regarding credit/loan approval or medical diagnosis. Moreover, classification techniques provides solution
also for problems involving more than two classes, dealing with multi-class classification (see also the section about multi-class
classification in SVM section) (Li et al., 2004; Zhou et al., 2017).
It is possible to forecast numeric values with regression. Three types of regression analysis can be conduct:
(i) Linear and multiple regression.

(ii) Non-linear regression.
(iii) Other methods, such as generalized linear model (GLM), Poisson regression, regression trees.
A discussion of such methods will be subject to the following paragraphs and the next article.
Data Mining: Classification and Prediction 3
Classification by Decision Tree Induction: ID3 (Iterative Dichotomiser), C4.5, CART (Classifier and Regression Trees)
In a classification task, the objects of the universe are described by a set of attributes and their values. To distinguish an object from
others, the classifier has to absolve the induction task (Quinlan, 1986). The induction task consists in developing a classification
rule able to identify the class of any object of the universe from the value of its attributes that represents some important
characteristics or features of object.
A decision tree classifier is a simple and widely used classification technique. As other classification methods, it builds a
classification model learning from a bunch of input data. By analogy with a botanical tree, the structure of the decision tree
consists of a flowchart with a root, branches, internal nodes and leaves. Each structural element of a tree identifies a step in a
splitting data process, whereby the tree algorithm separates input training items by testing their different characteristics, i.e.,
different values of some their attributes.
In particular, the classification process of an object starts from the root that is the feature always tested for each instance. The
different values of root attribute lead to different possible paths on the tree, addressing the algorithm to a specific branch. The
decision process continues in the next internal node, where a new attribute test run over another instance attribute until the reaching
of a leaf node, representing the class that owns the instance considered.
Decision tree classifiers are employed extensively in several fields of interest, such as medical diagnosis, weather prediction,
fraud detection, credit approval, target marketing, customer segmentation (Lavanya and Rani, 2011). Their hallmarks are:
(i) providing intelligible classification rules, (ii) ease to interpret, (iii) speed of construction, (iv) high accuracy. Often expert
systems use decision tree, because they offer results comparable to expert human skills.
Build a Tree: Divide-and-Conquer Algorithm

The build or growth phase of tree begins from the choice of the root. Numerical computations based on information theory
(see Sections Information gain and Gini index) establish which attribute ought to be the root, or an internal node. His values will
be the branches that separate the training data in subsets (attribute test). In the next step, for each branched subset the algorithm
iterates the numerical computation on remaining attributes in order to find the following node and further divides data. Tree
expansion process stops when a leaf node arrives, namely when each item of the analyzed subset gives the same overcome of the
attribute test: this value represents the class label (Lavanya and Rani, 2011). In this ideal case, leaf node is pure, including elements
with the same category label. However, usually such a leaf node is not obvious, because it can contain a mixture of elements with
different label class. In this more recurrently case, the algorithm uses a different stopping criterion, accepting such as imperfect
decision in favor of a simpler, more generic tree structure. As an example, in Fig. 1 is shown a training set and the related tree,
proposed by Quinlan (1986). Here, the classifier suggests if “Saturday morning” is suitable or not for some activity, respectively
with the leaf node Yes or No, starting from its weather attributes.
Fig. 1 Example of training set and related tree. The example is referred to the question if “Saturday morning” is suitable or not for some activity.
Reproduced from Quinlan, J.R., 1986. Induction of decision trees. Machine Learning 1, 81–106.
An adequate choice of features allows build a decision tree that provides a correct classification for each object in the training
data set. Nevertheless, the scope of induction task is implementing a classifier able to label correctly other unseen objects as well.
Therefore, the decision tree must catch the underlying connection between an object’s class and values of its attributes.
It is possible to build more than one tree starting from a training set; in general, it is better to choose the simplest tree in order
to avoid overfitting problems and guarantee a proper classification of new objects. Indeed, it would be expected that the simpler
tree works better than others tree trained on the same data set, because it more likely that correctly classifies more objects unseen
(Quinlan, 1986). For this reason, in the most of cases, the algorithm uses a nontrivial stopping criterion in growing procedure,
considering a mixture of elements with different label class as a leaf node and accepting such as imperfect decision in favor of a
simpler, more generic tree structure. Furthermore, pruning operation generalizes the tree structure where branches are redundant,
therefore removing noise and outliers.
The frequently used decision tree algorithm are Iterative Dichotomiser 3 (ID3), C4.5 and CART.
ID3
ID3 is a decision tree algorithm with an iterative structure based on Hunt’s Concept Learning Systems (Hunt, 1962; Hunt et al.,
1966). During the learning process, only a subset of training data is selected randomly to form a first decision tree. If this tree
correctly classifies also all the other objects in the training set, then it represents the correct tree and the research process finish.
Otherwise, a new subset or window of training records is chosen to form a new tree, including a selection of the incorrectly
classified objects. This iterative process continues until the decision algorithm is able to classify correctly the entire training data
set. The main advantage of ID3 strategy allows identifying the correct decision tree with few iterations even on training set
containing thousands of items.
C4.5 algorithm
An improvement of ID3 is C4.5 algorithm, also developed by J. Ross Quinlan in 1993. C4.5 works with multiway splits in attribute
test (Chen et al., 2015). C4.5 introduces some advantages with respect to ID3, working with both categorical and numerical
(or continuous) attributes, and handling missing attribute values. Moreover, it employs information-based criteria as attribute
selection measure. Furthermore, it uses pruning phase when the tree building is completed. Pruning allows handling the over-
fitting problem, wherein the decision tree algorithm may incur.
Classification and regression tree

Classifier and regression trees (CART) is a decision tree technique (Breiman et al., 1984) can perform both tasks of classification
and regression, depending on the input data nature, categorical or numerical respectively. In particular, leaves of regression tree
predict real numbers (Rokach and Maimon, 2005).
CART splits data by binary attribute tests, driven by pre-set criteria that establish when the growing process stops (Wu et al.,
2008). The tree represents a series of rules opportunely fixed, learned by training data set, in order to predict a likely class label for
a new examined case (Lawrence and Wright, 2001). Pruning is used to address the overfitting problem, which is recurrent also
for CART.
Attribute Selection Measures

Information Gain and Gini Index are two split criteria widely used in the decision tree literature. Split criteria perform attribute
selection for each node, when growing the tree.
Choosing correctly attributes is extremely important for proper classification, since from this depends the data split and the
consequent performance of the tree. It is not obvious to determine which criterion is the best for a given data set (Raileanu and
Stoffel, 2004).
Information gain
The decision tree algorithm uses a set of measures of entropy or impurities to evaluate the best choice for the node attribute.
Indeed, a good attribute splits the data by grouping the greatest number of cases that belong to the same class, so that a successor
node is as pure as possible. For this reason, it is indicative a measure of “order”, defined as the number of items belonging to the
same class; obviously, the more items belong to the same class, greater will be the order degree. Entropy is a measure of disorder,
and represents the amount of information. Entropy is null for a pure node (leaf) and becomes maximum when all the classes at
node N are equally likely. Therefore, given a set S of examples and n classes, the entropy is expressed by Eq. (6)
X
n
EðSÞ ¼ pi log2 pi ð6Þ
i¼1
where pi is the portion of examples in i-th class (entropy of set S is constant for all attribute).
To compute the quality of the split on an entire attribute A, the weighted average is calculated over all sets resulting from the
split on all values of A: I(S,A) is the average entropy for attribute A and is defined as in Eq. (7):
X
v
jSj j
IðS; AÞ ¼ EðSj Þ ð7Þ
j
S
where Sj represents the j-th subset of S depending on the j-th value of the attribute A. The information gain for attribute A is the
information gained by branching on A is shown in Eq. (8):
GainðS; AÞ ¼ EðSÞ2IðS; AÞ ð8Þ
A good rule to tree growing consists of choosing that attribute for which the gain of information is maximum, or equivalently,
that attribute with minimum average entropy (Han et al., 2012).
Information gain is used as attribute selection measure in ID3 e C4.5 algorithms.
Gini index
Gini Index is an inequality measure, employed in several scientific fields. In decision tree learning, Gini index is another method to
assess purity/impurity of a node. Mathematical expression of Gini impurity of a set Sm of records at a given node m is the following
Eq. (9):
X
n
iðSm Þ ¼ GiniðSm Þ ¼ 1 p2j ð9Þ
j¼1
where pj is the portion of records belonging to j-th class, over a total of n classes. For Gini index, each attribute determines a binary
data split. Gini index measures impurities of each possible pairs of partitions Sm1 and Sm2 for a given attribute A, as shown in
formula (10):
Sm1 Sm2
GiniA ðSm Þ ¼ GiniðSm1 Þ þ GiniðSm2 Þ ð10Þ
Sm Sm
The attribute with the partition having minimum Gini index is chosen to continue the tree building process (Han et al., 2012).
Gini index is usable in problem with two or more categories and is employed as attribute selection measure in CART algorithm.
Rule-Based Classification
Rule models are the second major techniques used in machine learning after the decision tree methods. Trees offer a first, although
rigid example of rules as seen in branching process, since each branch dictates univocal way to browse the tree. Instead, in a rule
model the algorithm may infer further information from the possible overlapping among several rules. As well as the decision
trees, the Rule-Based Classifiers are highly expressive, easy to generate and to interpret. Their performances are comparable to
decision trees. Moreover, rules methods can classify rapidly new instances and can easily handle missing values and numeric
attributes (Tan et al., 2005). Rule-based methods form an integral part of expert systems in artificial intelligence.
If-Then Rules
A rule-based classifier uses a collection of rules of the type, “If…. Then…” to absolve the classification task. The expression
following “If” states a condition or predicate (antecedent rule or left-hand side, LHS) that is a conjunction of attributes value. The
class is defined after the keyword “then”, which is the consequent (or right-hand side, RHS) of the rule. In a data set of records to
classify, an instance is labelled according to the conditions verified by his attributes. The label class will be assigned to the instance
whose attributes verify the condition of a rule r; in this case, the rule r covers the considered instance. If attributes of the instance do
not satisfy any rule, the item label will be that of a default class (Wu et al., 2008).
Building Classification Rules

A method to build a rule-based classifier is to extract rules directly from the data (Kesavaraj and Sukumaran, 2013), as done by
PRISM (Cendrowska, 1987), RIPPER (Cohen, 1995), 1R (Holte, 1993) and CN2 (Clark and Niblett, 1989), which employ the so-
called Direct Method, that will be explained below. Alternatively, it is possible to construct a rule classifier using an Indirect
Method, which extract rules from other classification models, for example, decision trees. The algorithm learns first the tree
structure and then converts it in rules. C4.5rules or PART are examples of rule-based classifier build using indirect method.
Direct method: Sequential covering

Sequential covering method aims to create rules that describe or cover many examples of a class and none or very few of other
classes. Covering algorithm, descending from covering strategy (Michalski, 1969, 1975; Fürnkranz and Flach, 2005), is the typical
rule learning method. It consists in iteratively learning one rule from training set and subsequently removing the example covered
by that rule from the training set (Fürnkranz, 1999, 2010) (see Fig. 2). The training set is therefore separate, and recursively the
algorithm learns another rule that covers some of the remaining examples to conquer (cf. separate-and-conquer method, Pagallo
and Haussler (1990)). In particular, to learn a rule for a class, the algorithm starts from an empty rule, which is unable to discern
among the entire input data set. Then, this rule is enriched by adding new attribute test that can describe a fraction of example in
training set. As for the decision tree, the rule algorithm selects an attribute used to split data according to the amount of
Fig. 2 Successive iterations of coverage learning: fraction of positive examples (squares), covered by a rule, discarded from original data set.
Fig. 3 Enrichment of a rule by adding a new combination of attribute test leads to a reduction of covered examples (left panel) and allows to
improve the classification (right panel).
information contained in it by means of some test selection criteria such as accuracy or information gain (for accuracy, see below;
for information gain, see the previous section). By adding more conditions to the rule, its coverage is reduced (see Fig. 3). The
growing rule continues until training set cannot be split any further or another stopping criterion is met. The pseudo-code for
sequential covering algorithm is shown in Fig. 4.
Evaluation of a rule and stopping criterion

Quality of a classification rule can be evaluated by Coverage and Accuracy. Coverage of a rule r represents the fraction of instances
that satisfy the antecedent of a rule, as expressed by the Eq. (11):
jLHSj
Coverageðr Þ ¼ ð11Þ
n
where n is the total number of records in sample considered. An equivalent form for coverage is C¼ t/n, where t ¼ |LHS| is the total
number of records covered by rules.
Accuracy of a rule r is the fraction of records covered by the rule that belong to the correct class, in RHS of rule, as shown in
Eq. (12):
jLHS-RHSj
AccuracyðrÞ ¼ ð12Þ
jLHSj
Fig. 4 Pseudo-code for the sequential covering algorithm.
Rules accuracy can be expressed in the more compact form A¼ p/t, where p are positive examples, i.e., records with class
correctly predicted by rule and t is the total number of records covered by rule. It is immediate to consider the difference t p as the
number of errors made by rule. Other metrics of rules evaluation are illustrated in Fürnkranz (1999).
Often, accuracy of a rule is adopted as stopping criterion of rule construction. In particular, the building process stops when
accuracy of the rule is maximal, or when increasing in accuracy gets below a given threshold.
When the rule classifier is constructed, test set allows evaluate model accuracy.
Conflict resolution and overlap problem

It can occur that one instance triggers more than one rules. In this case, conflict resolution is operated by three possible strategies:
• Size ordering: rules with wider LHS condition have the highest priority.
• Class-based ordering: rules belonging to the same class are grouped together and classes appear by decreasing of prevalence,
being class prevalence the fraction of instances that belong to a particular class.
• Rule-based ordering (decision list): rule sets are ordered by priority list, depending on measures of rule quality or experts
knowledge.
Some rules can become over specialized, incurring in the overfitting problem. A solution for this inconvenience is to prune the
rules, by using validation set. Two main strategies are incremental pruning and global pruning (Tan et al., 2005).
Classification by Multilayer Feed-Forward Neural Networks (Backpropagation, Sigmoid Function, Learning Rate)
Conventional machine learning techniques require considerable domain expertise and engineering skills to extract right features
from raw data and so to detect pattern or classify the input. Artificial Neural Networks (ANN) or also Neural Network (NN) allows
solving problems where features detection is difficult. They are used in domains like biomedical field (Dreiseitl and Ohno-
Machado, 2002), computer vision and speech and image recognition, drug discovery and genomics (LeCun et al., 2015).
Neural Networks are computational methods whose structure follows the information processing model of the biological brain
(McCulloch and Pitts, 1943; Widrow and Hoff, 1960; Rosenblatt, 1962; Rumelhart et al., 1986). Feedforward neural networks
architecture is the first and simplest kind of neural network. It represents a powerful tool for handle non-linear classification
(see note in the following section) and regression problems.
Difference Between Linear and Non-Linear Classifier

It can be helpful to visualize a geometric description of the classification problem, by plotting the data in a so-called feature space.
The feature space is a mathematical formalization used to represent the data based on a characterizing collection of data features.
The axis in a such space correspond to the features chosen for describe each instance in the data set. A two-class linear classifier
discerns between possible cases using a linear combination of the features and a numerical threshold (Fig. 5). An item will belong
to a particular class rather than to the other one if the weighted sum of its features is above or below a given threshold.
Neural network methods take advantage from composing simple non-linear computational modules that deform the input
data space to make classes of data separable by a linear decision boundary.
Fig. 5 Two-class linear classification, plotted in a 2-dimensional feature space. The red straight line is the decision boundary among positive and
negative examples. w ¼(w1, w2) is the vector of weights and x¼(x1, x2) is the vector of features on which to perform the classification for a given
instance.
Fig. 6 Schematization of a single neuron.
NN: Perceptron, Multi-Layer Perceptron and Feedforward Neural Network Architecture

Neural network algorithms are used for both supervised and unsupervised learning. To absolve the classification task, supervised
neural networks is trained with inputs data and their respective targets or classes.
Neural networks have a peculiar architecture inspired by the biological functioning of neurons. In this analogy, a single neuron,
also called perceptron, works as a classifier: for each item to classify, it receives several feature inputs xi with a corresponding weight
wi and produces an output y (see Fig. 6). The output value of a neuron is called activation a and it is a function of the sum of the
P
weighted inputs: a ¼ i wi xi , where the sum run over the total number n of features. There are several types of output or activation
function y ¼ f(a), Table 1 shows some types of activation function.
A neural network is a connection of many neurons. In the example of Fig. 7, the outputs of a single neuron in a layer feed each
neuron in the successive layer, in the “fully connected” model. The “depth” of the network depends on the number of hidden
layers, rising from a simple model of two layer, where only a layer is hidden, until to a multilayer network.
The most simple grouping is single layer perceptron (SLP), consisting of an input and an output layer (Jain et al., 1996). Multilayer
perceptron (MLP) represents an advanced version of the standard SLP, where more layers of computing units connected to each
other make a wider network. Unlike SLP, the MLP can discern data non-linearly separable by employing sigmoid non-linear
activation function as shown in Fig. 8. The continuity property of sigmoid function will be central in network training.
Feedforward neural network (FNN) is a multilayer perceptron where, as occurs in the single neuron, the decision flow is
unidirectional, advancing from the input to the output in successive layers, without cycles or loops. This aspect makes FNN the
simplest example of neural network.
Training the network

In training phase, a feedforward network examines examples that show the relationship between a feature vector x as input and a
target t. The number of times the entire training set is fed to the network is known as epoch. From this training data, the network
should learn the correct model of classification according to which a given input x correspond to an output y close enough to the
right answer t. The ‘distance’ between the target t and the output y(a)¼ y(x,w) is evaluated for each pair (x,t) by the objective function
Table 1 Synoptic table of deterministic activation functions
Activation function Analytical expression Graphic representation
Linear y(a)¼ a
Sigmoid logistic y ða Þ ¼ 1þe1 a ;

yA(0,1)
Sigmoid tanh y(a)¼ tanh(a);

yA( 1,1)
(
Step 1 if a40
y ða Þ ¼
1 if a r 0
Fig. 7 Example of a two layer, fully connected network (do not count the input layer).
or error function E ¼ E(y,t), which reports how well the network solves the task. The training process consists in a minimization of
objective function, by varying the weight parameters w. For function minimization is evaluated also the gradient of the objective
function with respect to w by using the backpropagation algorithm.
Backpropagation and learning rate

The backpropagation algorithm applies chain rule for derivatives to the objective function E(y,t) with respect to the weights w,
relying on gradient descent. Thus E(y,t) and consequently y(x,w) must guarantee both continuity and differentiability. For
Fig. 8 Multilayer neural network schematized by fully connected black dots: from left to right – two input unit, two units in the hidden layer and
one output unit. The hidden layer deforms the input space to obtain a linear decision boundary between classes of data, as highlighted from the
shape of the border between colored areas. Reproduced from LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444.
backpropagation networks, Sigmoid activation function is preferred to step function used in perceptron; the objective function of
the network is often defined as (using least squares methods as optimization technique. An alternative to optimization function is
Pp
cross-entropy (with the relative maximum likelihood estimation method). E ¼ 12 i ¼ 1 ðyi ti Þ2 , for a training set containing
p records. The learning rule for neural network relies on gradient descent procedure that allows minimizing E starting from its
∂E ∂E ∂E
gradient ∇E ¼ ∂w ;
1 ∂w2
; …; ∂wl with respect to l weights. The weights, initialized with random values, represents the only tuning
∂E
factor to minimize the error function through a change Dwi ¼ Z∂w i
; where Z is the learning rate (Royas, 1996). This latter can be a
constant value or can be a decreasing function of simulation time t such as Z0/t, where t is an integer counting the algorithm steps.
Learning rate denotes the amount of the weights’ variation or the length of the step of each iteration in the negative gradient
descent.
Support Vector Machines
Support vector machine (SVM) is a supervised machine learning technique based on statistical learning theory (Vapnik, 1995;
Scholkopf et al., 1995; Cristianini and Shawe-Taylor, 2000). SVM is a binary classifier (the term “machines” descends from the fact
that SVM algorithm produces a binary output), but it can handle problems with many classes and also numerical prediction tasks
(Wang and Xue, 2014; Wu et al., 2008). Furthermore, it can treat non-linearly separable data. Similarly to neural network, SVM
employ a non-linear mapping of input vectors in a high-dimensional features space, where it can linearly separate data, by
so-called optimal separating hyperplanes (Chen et al., 2015).
SVM algorithms allow approach many problems of different domains, such as pattern recognition, medical diagnosis, mar-
keting (Yingxin and Xiaogang, 2005) or text classification (Minsky, 1961; Joachims, 1998).
Linearly Separable Data

Support vector machines mainly works as binary classifier, separating data in a linear way. Data are linearly separable when it is
possible to split them in two groups by using a straight line in a two-dimensional input feature space (the feature space is a
mathematical formalization used to represent the data based on a characterizing collection of data features) or a plane in a 3D
feature space. More generally, for an N-dimensional space the hyperplane N-1-dimensional divides data and it represents the
decision boundary. There are several ways to separate linearly the same data set, as shown in Fig. 9.
SVM algorithm searches for the optimal separating hyperplane, considering the notion of margin m, defined as the distance
between the separating hyperplane and the nearest training examples, at least one for each class. In particular, the points closest to
the decision boundary, known as support vectors, should be far away from it as possible. Indeed, maximizing the margin in training
phase makes the classifier more robust and gives better classification performance on test data . In general, training a SVM aims at
finding the proper decision boundary or separating hyperplane with the largest margin. Support vectors identify the optimal
decision boundary and determine the margin, evaluated as the distance between them (Cristianini and Shawe-Taylor, 2000).
As shown in Fig. 10, the margin has an amplitude equal to m/||w||, measured on w (vector of weights). Therefore, maximizing
the margin corresponds to minimizing ||w||, which results in solving a constrained optimization problem, considering a quadratic
function such as that shown in formula (13):
1
w ; t ¼ arg min jjwjj2 subject to the constraint yi ðw xi t Þ 1 ð13Þ
w;t; 2
where 1rirn and n is the number of training examples.

Fig. 9 Linearly separable data (A); different possibilities to separate data are shown in (B)–(D) panels.
Fig. 10 Geometry of a SVM classifier. Dashed lines delimit the margin. The support vectors are training examples circled and nearest to decision
boundary t.
The constraint expresses the request that instances remain outside the margin. In Fig. 10, the hyperplanes of equation
w xi t ¼ 7m delimit the margin of separation between two classes yi of data, positives and negatives respectively. In particular,
positives must verify the condition (w xi t)Z1 and negatives (w xi t)r 1), where m is set to 1, as usual (the parameters
t, ||w|| and m can be rescaled for convenience). The product between classes yi ¼ 71 (assigned in binary supervised problems) and
the previous conditions allows formulate a compact constraint (Eq. 13), valid for all instances.
The Lagrange multipliers method approaches the optimization problem applied to the function to minimize under the
P
condition shown by Eq. (13). The Lagrange multipliers algebra allows to obtain the optimal value for the weight ðw ¼ ni¼ 1 ai yi xi ,
where ai are the Lagrange multipliers and n is the number of training examples) w and for the decision boundary t, starting from
the pairs (xi,yi), which are provided as input in training phase of SVM. Laid out these values, SVM will establish the class of a new
instance according to the following formula (14):
y0 ¼ sgn½w x0 t ð14Þ
The Lagrange function incorporates the constraint preceded by the Lagrange multipliers ai, as shown in the Eq. (15) (according
to the “hinge” loss function formulation (Hastie et al., 2008):
1 Xn
Lðw; t; a1 ; …; an Þ ¼ jjwjj2 ai ðyi ðw xi t Þ 1Þ
2 i¼1
! !
1 X
n X
n X
n
¼ jjwjj2 w ai yi x i þt ai yi þ ai ð15Þ
2 i¼1 i¼1 i¼1
By setting to zero the partial derivative of the Lagrange function with respect to t and w, it is possible obtain the following
values:
∂ Xn
Lðw; t; a1 ; …; an Þ ¼ 0-w ¼ ai yi x i ð16Þ
∂w i¼1
∂ Xn
Lðw; t; a1 ; …; an Þ ¼ 0- ai yi ¼ 0 ð17Þ
∂t i¼1
Returning these values into the Lagrange function (15), the primal problem changes in the dual optimization problem
expressed in Eq. (18), entirely depending from the Lagrange multipliers:
! !
1 Xn Xn Xn
Lða1 ; …; an Þ ¼ ai yi xi ai yi xi þ ai
2 i¼1 i¼1 i¼1
1Xn Xn Xn
¼ ai aj yi yj xi xj þ ai ð18Þ
2i¼1j¼1 i¼1
To solve the dual problem it is necessary maximize (the dual Lagrange function (19) is maximized being the ai preceded by a
minus sign in Eq. (15)) the function L(a1,…,an) as described in the following expression:
1X n X n Xn
a1 ; …; an ¼ arg max ¼ ai aj yi yj x i x j þ ai
a1 ;…;an 2 i ¼ 1j ¼ 1 i¼1
ð19Þ
X
n
subject to ai 0; 1rirn; ai yi ¼ 0
i¼1
As shown in the Eq. (19), the optimization problem consists in pairwise dot products between training instances, which is
expressed by the element of the Gram matrix Hij ¼ xi xj.
The condition aiZ0 is typical for support vector machines (the constraint in Eq. (15) being non-negative) and include two
possible cases: (i) ai ¼ 0 for an example xi that does not contribute to learn the decision boundary; (ii) ai40 only for support
vectors, namely, for examples, nearest to the decision boundary. The Lagrange multipliers non-zero determine the decision
boundary through the expression (16) and (17) and consequently the class of each new instance through the Eq. (14).
Linearly Inseparable Data

Soft margin
SVM can be adapted to non-linearly separable cases (Cortes and Vapnik, 1995) introducing margin errors ξi, which allow to some of
the training examples to be inside the margin or even at the wrong side of the decision boundary Fig. 11. The slack variables ξi leads
to a soft margin optimization problem:
1 Xn
w ; t ; ξi ¼ arg min jjwjj2 þ C ξi with the constraint yi ðw xi t Þ 1 ξi ð20Þ
w;t;ξi 2
i¼1
where ξiZ0 and 1rirn.

C is a “cost” parameter that controls the tradeoff between margin maximization against slack variable minimization. Based on
the penalization method, C represents the “modulator” of the slack variables, which work as penalty term for deviations on the
wrong side of the margin.
In soft margin case, the Lagrange function becomes the Eq. (21):
1 Xn Xn Xn
Lðw; t; ξi ; a1 ; …; an ; bi Þ ¼ jjwjj2 þ C ξi ai ðyi ðw xi t Þ ð1 ξi ÞÞ bi ξ i
2 i¼1 i¼1 i¼1
! !
1 X
n X
n
¼ jjwjj2 w ai yi xi þt ai yi
2 i¼1 i¼1
Fig. 11 Non-linear separable case: the distance between the red or blue example into the margin is –ξi/||w||.
X
n X
n
þ ai þ ðC ai bi Þξi
i¼1 i¼1
X
n
¼ Lðw; t; ai Þ þ ðC ai bi Þξi : ð21Þ
i¼1
For an optimal solution every partial derivative with respect to ξi should be zero, and consequently for all i follows the Eq. (22):
C ai bi ¼ 0 ð22Þ
Therefore, vanishing the adding term in the last expression of Eq. (21), the non-linear dual problem is reduced to the linear one
in formulation (19) with the addition of the upper bound on ai derived from bi (22): 0rairC.
With regard to the meaning of the upper bound of ai, the equality to C implies that bi ¼ 0, concerning the case in which training
examples are on or inside the margin, being respectively ξi ¼ 0 or ξi40.
Kernel trick
Non linearly separable data can be treated plotting them in a higher dimensional feature space through a mathematical trans-
formation known as kernel function k. Such function replaces the inner products between pairs of input instances xi xj in the SVM
algorithm for linear separable data, and allows to represent training examples without the explicit computation of their coordi-
nates in the new feature space. The formulas (23) and (24) express that in symbol:
k : x-FðxÞ ð23Þ
and
Hij ¼ Fðxi Þ Fðxj Þ ¼ kðxi ; xj Þ ð24Þ
where F(x) is the feature mapping of the instance x in the new feature space, performed by the kernel function k and k(xi,xj) ¼ F
(xi) F(xj) is the Kernel matrix element, which provides the Gram matrix element Hij already seen above transformed by k. Popular
kernels are listed in Table 2.
With kernel function, the mapping in a higher dimensional space can make possible separate linearly data that in the starting
feature space are non-linearly separable, as shown in Fig. 12.
Finally, for each new instance x0 the class is established through the expression:
y0 ¼ sgn½w Uðx0 Þ t ð25Þ
0
Pn 0
Pn 0
where w U(x ) ¼ i ¼ 1 ai yi Uðxi Þ Uðx Þ ¼ i ¼ 1 ai yi kðxi ; x Þ:
This means that classifying a new instances involves training examples with non-zero Lagrange multipliers.
Multiclass SVM
It is possible to use additional methods that extend the SVM classification to more than two classes. (Weston and Watkins, 1999;
Crammer and Singer, 2001; Hsu and Lin, 2002; Wang and Xue, 2014). Among the strategies aimed to adapt SVM to multiclass
classification, there are two common techniques: one-versus-all (Vapnik, 1998) and one-versus-one (Kreßel, 1999). Both approaches
decompose the single multiclass problem into a set of binary classification problems. To solve a problem with k classes, the one-
versus-all approach trains k SVM binary classifiers so that i-th model recognizes as positives the examples in the i-th class and as
Table 2 a,b,g are parameters defining the behavior of kernel function
Kernel function Analytical expression
Linear k0(Xi,Xj)¼ Xi Xj
Polynomial k1(Xi,Xj)¼(Xi Xj þ a)b
jjx i x j jj2
Gaussian k2(Xi,Xj) ¼ exp 2g2
Sigmoidal k4(Xi,Xj)¼tanh(a(Xi Xj) b)
Fig. 12 Left: in the starting 2D feature space date are not linearly separable; right: kernel trick allows to map data in a 3D new feature space,
where data are linearly separable by means of a hyperplane.
negatives all the rest of examples. In the test phase for a given input, the binary classifier that provides positive output value assigns
the class label.
The one-versus-one (Kreßel, 1999) method solves the k classes problem with a pairwise decomposition of classifiers. This
approach uses k(k 1)/2 binary classifiers, each one of them is trained to identify as positive a given class and negative all the rest
(in a similar way as described above). In the test phase, a given input instance is evaluated or “voted” from pairs of classifiers. The
class label of this test instance will arise from the classifier that gives the highest number of votes.
Associative Classification: Classification Based on Association Rules
The rules syntax concerns different machine learning areas. As shown above (see Section Rule-Based Classification), supervised
learning employs rules to absolve some classification tasks. But rules paradigm also take place in the unsupervised learning, with
the association rules (Cios et al., 2007). Both approaches have a large use for many application domains (Gosain and Bhugra,
2013), according to their proper characteristics and differences.
Introduction of Association Rules

Association rules (AR) (Agrawal et al., 1993) represent a technique of unsupervised learning focused on the identification of
relationship of co-occurrence, named associations, among items in a data set. The original application of this kind of rules is
market basket data analysis, aimed at discovering possible regularities of association among items on sale and therefore to support
market decision making. AR are relevant for cross-marketing. However, many other applications range in different contexts, from
Web and text document (Kulkarni and Kulkarni, 2016) to Bioinformatics domain (Atluri et al., 2009).
Rule-Based Classification section is based on a “small set” of rules mined from a database in order to achieve as accurate as
possible result. The objective of AR mining is finding in the database “all” rules that satisfy a given minimum support and
minimum confidence threshold (Agrawal and Srikant, 1994). A basic difference between them is that the target of classification
rule mining is pre-established (namely the class), unlike to what happens in association rule mining.
Associative classification originates from the idea of integrating classification rule mining and association rule mining in order
to gain advantages from both methods, arising computational performances (Liu et al., 1998). This integration is focused on the
subset of AR that have classification class attribute on the right-hand-side. These special rules are also known as class association
rules (CARs). The first implementation of associative classification is the Classification Based on Association algorithm (CBA)
(Liu et al., 1998).
Formalism of association rules

To introduce the definitions of this section, consider a market basket analysis context. The database is a set T of transactions each of
which consists of a set X of items bought in that transaction. In turn, given an object or a set of objects, it is possible to list the
transactions that contain them. An item set X covers a transactions set T if this item set appears in every transaction of T. Support
count of an item set X, denoted by X.count, is the number of transactions in T covered by X. It represents the frequency with which
an item set X appears in a database of transactions T (Liu, 2011).
An expression of the form X-Y represents an AR, where X and Y are disjoint sets of purchasable items. A such rule states that
transactions of the database T which contain X tend to contain Y, as expressed by statement of the kind “98% of customers that
purchase tires and automobile accessories also get automotive service” offered by Agrawal et al. (1995).
In association rule mining, all rules extract will be evaluated based on their strength, selecting only a subset that satisfy
predetermined conditions. Support and confidence of a rule measure its strength. The rule support express the portion of transactions
containing X,Y over the total number n of transactions of T as shown by the formula (26):
ðX,Y Þ:count
support ¼ ð26Þ
n
The confidence of a rule X-Y is the per cent of transactions of T containing X that contain also Y:
ðX,Y Þ:count
confidence ¼ ð27Þ
X
The aim of mining AR is to find all association rules in T with support and confidence greater than or equal to a threshold value
pre-specified named minimum support (minsup) and minimum confidence (minconf).
In order to highlight AR among basket data it is necessary to discover item sets that occurs recurrently (Gosain and Bhugra,
2013). Classical algorithms to mine AR are Apriori (Agrawal and Srikant, 1994) and FP-growth (Han et al., 2000).
CBA
The CBA algorithm consists of two parts: i) a rule generator (CBA-RG), based on the Apriori algorithm (Agrawal and Srikant, 1994)
appropriately modified to mine class association rules; ii) a classifier builder (CBA-CB) that classifies accurately the rules previously
generated. Comparing CBA with other classifier based on rules, such as C4.5 (Quinlan, 1993) and CART (Breiman et al., 1984),
some differences emerge. Both approaches are based on the “covering method” (Michalski, 1980), but results are different,
depending on the way of application. CBA algorithm (CBA-RG module) adopts covering method, running it over all the rules
mined from the entire training set. Unlike, the traditional application of covering method works on one class at a time (see also
previous section on rules-based classifier), whenever removing covered examples by best rules, and learning new rules from the
remaining cases. The difference among results of CBA and the other classifiers arises from the general validity of rules mined by
CBA that constructs all the rules from the entire training set and then makes the best selection of rules covering training data.
Furthermore, rule-pruning technique is adopted in CBA to avoid overfitting problems. Pruning represents a difference between
CBA algorithm and the association rule mining. Another difference lies in the presence of CBA-CB module, being the AR not based
on the classifier building.
Classification Based on Multiple Association Rules

Classification based on Multiple Association Rules or CMAR is a method following to CBA (Li et al., 2001). It introduces some
advances with respect to CBA. First, CMAR identifies the class label analyzing a small set of rules with high confidence instead of
mine the class from a single rule. CMAR uses FP-growth method (Han et al., 2000), a frequent pattern mining algorithm, alternative
to Apriori method, to mine class-association rules. Then, CMAR employs a CR-tree data structure to store and retrieve the large
amount of classification rules previously generated. CR-tree represents a solution to the remarkable effort handling a substantial
amount of rules. Finally, pruning make the classification process more effective and efficient, by removing redundant and noisy
information.
To classify a new object, CMAR compares classification rules that match the given object. The object class is simply assigned
when the class label is the same for all rules matching. Otherwise, CMAR groups together rules with the same class label. Then, it
evaluates the strength of each group, comparing the “combined effect” of correlation (Brin et al., 1997) and support (Baralis et al.,
2004) of contained rules.
Classification Based on Predictive Association Rules

Classification Based on Predictive Association Rules or CPAR (Yin and Han, 2003) is a classification approach that merges the advantages of
(i) associative classification and (ii) traditional rule-based classification. It adopts different strategies with respect to (i) and (ii) about the
construction of rules, for example, it generates rules directly from training data, instead of creating a large set of candidate rules, as in
associative classification; moreover, it produces a more complete set of rules than traditional rule-based approaches.
CPAR originates from First Order Inductive Learner (FOIL) learning system (Quinlan, 1990; Quinlan and Cameron-Jones,
1993), modified with Predictive Rule Mining (PRM) algorithm. PRM varies the covering method (see Section Rule-Based
Classification) implemented by FOIL, keeping the rules that cover positive examples and associating a “weight” to these rules. In
this way, positives examples are covered by more rules, achieving higher accuracy than FOIL. Furthermore, working with large
database, PRM attains greater efficiency with respect to FOIL, thanks to reducing of time consuming part in rules building. PRM
reaches higher efficiency but lower accuracy than associate classification. PRM results are exceeded by CPAR, that is the evolution
of PRM. It gets better the rule building process, avoiding redundant rule generation and building several rules simultaneously.
Finally, CPAR evaluates the prediction power of every rule, defining its expected accuracy as the probability that an example
observing the antecedent of the rule actually belongs to class predicted in the consequent. So doing, in a set of k best rules for a
class, the rule with the highest expected accuracy will be chosen to class prediction.
References
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I., 1995. Fast discovery of association rules. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R.
(Eds.), Advances in Knowledge Discovery and Data Mining. Cambridge, MA: AAAI/MIT Press, pp. 307–329.
Agrawal, R., Srikant R., 1994. Fast algorithms for mining association rules. In: Proceedings of International Conference on Very Large Data Bases (VLDB), San Jose, CA.
Agrawal, R., Tomasz I., Swami, A., 1993. Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (Eds.), Proceedings of the 1993 ACM
SIGMOD International Conference on Management of Data, pp. 207–216. New York, NY: ACM.
Atluri, G., Gupta, R., Fang, G., et al., 2009. Association analysis techniques for bioinformatics problems. In: Rajasekaran, S. (Ed.), Bioinformatics and Computational Biology.
New Orleans, LA: Springer, pp. 1–13.
Baralis, E., Chiusano, S., Garza, P., 2004. On support thresholds in associative classification. In: Proceedings of the 2004 ACM Symposium on Applied Computing,
pp. 553–558. New York, NY: ACM.
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees. Belmont, CA: Wadsworth.
Brin, S., Motwani, R., Silverstein, C., 1997. Beyond market baskets: Generalizing associations rules to correlations. In: Proceedings of the 1997 ACM SIGMOD International
Conference on Management of Data, pp. 265–276. New York, NY: ACM.
Cendrowska, J., 1987. PRISM: An algorithm for inducing modular rules. International Journal of ManMachine Studies 27, 349–370.
Chen, F., Deng, P., Wan, J., et al., 2015. Data mining for the internet of things: Literature review and challenges. International Journal of Distributed Sensor Network 2015, 1–15.
Cios, K.J., Swiniarski, R.W., Pedrycz, W., Kurgan, L.A., 2007. Unsupervised learning: Association rules. In: Kecman, V. (Ed.), Data Mining. Springer, pp. 289–306.
Clark, P., Niblett, T., 1989. The CN2 induction algorithm. Machine Learning 3, 261–283.
Cohen, W.W., 1995. Fast effective rule induction. In: Proceedings of the 12th International Conference on Machine Learning, pp. 115–123. San Mateo, CA: Morgan and Kaufmann.
Cortes, C., Vapnik, V., 1995. Support-vector networks. Machine Learning 20 (3), 273–297.
Crammer, K., Singer, Y., 2001. On the algorithmic implementation of multiclass kernel-based machines. Journal of Machine Learning Research 2, 265–292.
Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines. New York, NY: Cambridge University Press.
Dreiseitl, S., Ohno-Machado, L., 2002. Logistic regression and artificial neural network classification models: A methodology review. Journal of Biomedical Informatics 35 (5), 352–359.
Fürnkranz, J., 1999. Separate-and-conquer rule learning. Artificial Intelligence Review 13, 3–54.
Fürnkranz, J., 2010. Rule learning. In: Sammut, C., Webb, G.I. (Eds.), Encyclopedia of Machine Learning. New York, NY: Springer, pp. 875–879.
Fürnkranz, J., Flach, P.A., 2005. Roc ‘n0 rule learning – Towards a better understanding of covering algorithms. Machine Learning 58, 39–77.
Gosain, A., Bhugra, M., 2013. A comprehensive survey of association rules on quantitative data in data mining. In: IEEE Conference Information & Communication
Technologies, JeJu Island, pp. 1003–1008.
Han, J., Kamber, M., Pei, J., 2012. Data mining, Concepts and Techniques, third ed. Waltham, MA: Morgan Kaufmann.
Han, J., Pei, J., Yin, Y., 2000. Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International conference on Management of
Data, pp. 1–12. New York, NY: ACM.
Hastie, T., Tibshirani, R., Friedman J., 2008. The Elements of Statistical Learning, second ed. Stanford, CA: Springer.
Holte, R.C., 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning 11, 63–90.
Hsu, C.W., Lin, C.J., 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13 (2), 415–425.
Hunt, E.B., 1962. Concept Learning: An Information Processing Problem. New York, NY: Wiley.
Hunt, E.B., Marin, J., Stone, P.J., 1966. Experiments in Induction. New York, NY: Academic Press.
Jain, A.K., Mao, J., Mohiuddin, K.M., 1996. Artificial neural networks: A tutorial. IEEE Computer-Special Issue in Neural Computing 29 (3), 31–44.
Joachims, T., 1998. Text categorization with support vector machines: Learning with many relevant features. In: Hutchison, D., Kanade, T., Kittler, J., et al. (Eds.), Machine
Learning: European Conference on Machine Learning (in LNCS 1398), pp. 137–142. Berlin: Springer.
Kesavaraj, G., Sukumaran, S., 2013. A study on classification techniques in data mining. In: Proceedings of the International Conference on Computer Communication and
Networking Technologies (ICCCNT'13), pp. 1–7. Tamil Nadu: IEEE.
Kreßel, U., 1999. Pairwise classification and support vector machines. In: Schölkopf, B., Burges, C., Smola, A. (Eds.), Advances in Kernel Methods: Support Vector Learning.
Cambridge: MIT Press, pp. 255–268.
Kulkarni, M., Kulkarni, S., 2016. Knowledge discovery in text mining using association rule extraction. International Journal of Computer Applications 143, 30–36.
Lavanya, D., Rani, K.U., 2011. Performance evaluation of decision tree classifiers on medical datasets. International Journal of Computer Applications 26, 1–4.
Lawrence, R.L., Wright, A., 2001. Rule-based classification systems using classification and regression trees (CART) analysis. Photogrammetric Engineering & Remote Sensing
67, 1137–1142.
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444.
Li, T., Zhang, C., Ogihara, M., 2004. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression.
Bioinformatics 20, 2429–2437.
Liu, B., 2011. Web data mining. Exploring Hyperlinks, Contents and Usage Data. Chicago, IL: Springer.
Liu, B., Hsu, W., Ma, Y., 1998. Integrating classification and association rule mining. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data
Mining, pp. 80–86. New York, NY: AAAI Press.
Li, W., Han, J., Pei, J., 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. In: Proceedings of the IEEE International Conference on
ICDM'01, pp. 369–376. IEEE: San Jose, CA.
McCulloch, W.S., Pitts, W., 1943. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5 (4), 115–133. Reprinted in Anderson and
Rosenfeld, Springer.
Michalski, R., 1980. Pattern recognition as rule-guided induction inference. IEEE Transaction On Pattern Analysis and Machine Intelligence 2, 349–361.
Michalski, R.S., 1969. On the quasi-minimal solution of the general covering problem. In: Proceeding of the V International Symposium on the Information Processing, Bled,
Yugoslavia, pp 125–128.
Michalski, R.S., 1975. Synthesis of optimal and quasi-optimal variable-valued logic formulas. In: Proceedings of the 1975 International Symposium on Multiple-Valued Logic,
pp. 76–87. Bloomington, Indiana.
Minsky, M., 1961. Steps toward artificial intelligence. In: Hamburger, F. (Ed.), Proceedings of the IRE, vol. 49 (No. 1), pp. 8–30. New York: IEEE.
Pagallo, G., Haussler, D., 1990. Boolean feature discovery in empirical learning. Machine Learning 5, 71–99.
Quinlan, J.R., 1986. Induction of decision trees. Machine Learning 1, 81–106.
Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. San Francisco, CA: Morgan Kaufmann.
Quinlan, J.R., 1990. Learning Logical Definitions from Relations. Machine Learning, 5. Boston: Kluwer Academic, pp. 239–266.
Quinlan, J.R., Cameron-Jones, R.M., 1993. FOIL: A midterm report. In: Brazdil, P.B. (Ed.), Proceedings of European Conference on Machine Learning, pp. 3–20. Vienna: Springer-Verlag.
Raileanu, L.E., Stoffel, K., 2004. Theoretical comparison between the Gini Index and Information Gain criteria. Annals of Mathematics and Artificial Intelligence 41, 77–93.
Rokach, L., Maimon, O., 2005. Decision trees. In: Rokach, L., Maimon, O. (Eds.), Data Mining and Knowledge Discovery Handbook. New York, NY: Springer, pp. 165–192.
Rosenblatt, F., 1962. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanism. Washington, DC: Spartan Books.
Royas, R., 1996. Neural Networks. A Systematic Introduction. Berlin: Springer.
Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L., PDP Research Group. (Eds.),
Parallel Distributed Processing: Explorations in the Microstructure of Cognition 1. MIT Press, pp. 318–362.
Scholkopf, B., Burges, C., Vapnik, V., 1995. Extracting support data for a given task. In: Fayyad, U.M., Uthurusamy, R. (Eds.), Proceedings of the First International Conference
on Knowledge discovery and Data Mining, pp. 252–257. Menlo Park, CA: AAAI Press.
Tan, P.N., Steinbach, M., Kumar, V., 2005. Introduction to Data Mining. Boston, MA: Pearson Addison Wesley.
Vapnik, V., 1995. The Nature of Statistical Learning Theory. New York, NY: Springer.
Vapnik, V., 1998. Statistical Learning Theory. New York, NY: Wiley.
Wang, Z., Xue, X., 2014. Multi-class support vector machine. In: Ma, Y., Guo, G. (Eds.), Support Vector Machines Applications, first ed. Cham: Springer International
Publishing, pp. 23–48.
Weston, J., Watkins, C., 1999. Support vector machines for multi-class pattern recognition. In: Verleysen, M. (Ed.), Proceedings of European Symposium on Artificial Neural
Networks, pp. 219–224. Bruges: ESANN.
Widrow, B., Hoff, M.E., 1960. Adaptive switching circuits. In: IRE WESCON Convention Record, vol. 4, pp. 96–104. Reprinted in Anderson and Rosenfeld (1988).
Wu, X., Kumar, V., Quinlan, R.J., et al., 2008. Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1–37.
Yingxin, L., Xiaogang, R., 2005. Feature selection for cancer classification based on support vector machine. Journal of Computer Research and Development 42 (10), 1796–1801.
Yin, X., Han, J., 2003. CPAR: Classification based on predictive association rules. In: Barbara, D., Kamath, C. (Eds.), Proceedings of the 2003 SIAM International Conference
on Data Mining, pp. 331–335. San Francisco, CA: Society for Industrial and Applied Mathematics.
Zhou, L., Wang, Q., Fujita, H., 2017. One versus one multi-class classification fusion using optimizing decision directed acyclic graph for predicting listing status of
companies. Information Fusion 36, 80–89.
Further Reading
Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press.
Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Singapore: Springer.
Duda, R.O., Hart, P.E., Stork, D.G., 1991. Pattern Classification. New York, NY: John Wiley & Sons.
Flach, P., 2012. Machine Learning. New York, NY: Cambridge University Press.
Harrinton, P., 2012. Machine Learning in Action. Shelte Island, NY: Manning.
MacKay, D.J.C., 2005. Information Theory, Inference and Learning Algorithms. Cambridge University Press.
McKinney, W., 2012. Python for Data Analysis. Sebastopol, CA: O’Reilly.
Mitchell, T.M., 1997. Machine Learning. Portland, OR: McGraw Hill.
Model, M.L., 2009. Bioinformatics Programmingusing Python. Sebastopol, CA: O’Reilly.
Muller, K.R., Mika, S., Ratsch, G., Tsuda, K., Scholkopf, B., 2001. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks 12 (2), 181–201.
Phyu, T.N., 2009. Survey of classification techniques in data mining. In: Ao, S.I., Castillo, O., Douglas, C., Feng, D.D., Lee, J.A. (Eds.), Proceedings of International Multi-
Conference of Engineers and Computer Scientists, pp. 727–731. Hong Kong: Newswood Limited.
Schölkopf, B., Smola, A.J., 2002. Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
Shawe-Taylor, J., Cristianini, N., 2004. Kernel Methods for Pattern Analysis. Cambridge University Press.
Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., 2005. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research
6, 1453–1484.
Witten, I.H., Frank, E., 2005. Data Mining: Practical Machine Learnig Tools and Techniques, second ed. San Francisco, CA: Morgan Kaufmann.
Wu, X., Kumar, V., 2009. The Top Ten Algorithms in Data Mining. Boca Raton, FL: Chapman & Hall/CRC.

Data Mining Classification and Prediction

Uploaded by

Copyright:

Available Formats

Data Mining Classification and Prediction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Classification and Prediction

Uploaded by

Copyright:

Available Formats

Data Mining: Classiﬁcation and Prediction

Basic Concept of Classiﬁcation and Prediction

Classiﬁer Versus Predictor: Learning, Testing, Accuracy, Overﬁt

Encyclopedia of Bioinformatics and Computational Biology doi:10.1016/B978-0-12-809633-8.20461-5 1

Testing and Accuracy

Absolute error: jyi yi0 j ð2Þ

Squared error: ðyi yi0 Þ2 ð3Þ

where d is the total number of test instances.

• speed: time to construct the model; time to use the model;

Generalization and Overﬁtting

(i) Linear and multiple regression.

Build a Tree: Divide-and-Conquer Algorithm

Classiﬁcation and regression tree

Attribute Selection Measures

Building Classiﬁcation Rules

Direct method: Sequential covering

Evaluation of a rule and stopping criterion

Fig. 4 Pseudo-code for the sequential covering algorithm.

Conﬂict resolution and overlap problem

Difference Between Linear and Non-Linear Classiﬁer

Fig. 6 Schematization of a single neuron.

NN: Perceptron, Multi-Layer Perceptron and Feedforward Neural Network Architecture

Training the network

Table 1 Synoptic table of deterministic activation functions

Activation function Analytical expression Graphic representation

Sigmoid logistic y ða Þ ¼ 1þe1 a ;

Sigmoid tanh y(a)¼ tanh(a);

Backpropagation and learning rate

Support Vector Machines

Linearly Separable Data

where 1rirn and n is the number of training examples.

Linearly Inseparable Data

where ξiZ0 and 1rirn.

Table 2 a,b,g are parameters deﬁning the behavior of kernel function

Kernel function Analytical expression

Associative Classiﬁcation: Classiﬁcation Based on Association Rules

Introduction of Association Rules

Formalism of association rules

Classiﬁcation Based on Multiple Association Rules

Classiﬁcation Based on Predictive Association Rules

You might also like