Classification Model Evaluation Metrics: Željko Đ. Vujović
Classification Model Evaluation Metrics: Željko Đ. Vujović
Classification Model Evaluation Metrics: Željko Đ. Vujović
Abstract - The purpose of this paper was to confirm the basic was the reason, motive, and incentive to consider why this is
assumption that classification models are suitable for solving the so? These four models were chosen at random. In this
problem of data set classifications. We selected four representative introduction, we give their generally accepted definitions.
models: BaiesNet, NaiveBaies, MultilayerPerceptron, and J48, and A Bayesian network is defined as a system of event
applied them to a four-class classification of a specific set of hepatitis probabilities, nodes in a directed acyclic graph, in which, the
C virus data for Egyptian patients. We conducted the study using the
WEKA software classification model, developed at Waikato
probability of an event can be calculated from the probabilities
University, New Zealand. Defeat results were obtained. None of the of its predecessors in the graph. The nodes in the network are
four classes envisaged has been determined reliably. We have variable. They can be concrete values, randomly given, latent
described all 16 metrics, which are used to evaluate classification values , or hypotheses. They are characterized by the
models, listed their characteristics, mutual differences, and the distribution of probabilities. Probability is a quantity that
parameter that evaluates each of these metrics. We have presented touches a presented state of knowledge or a state of belief. In
comparative, tabular values that give each metric for each Bayesian opinion, the probability is assigned to a hypothesis. In
classification model in a concise form, detailed class accuracy with a frequency thinking, the hypothesis is tested without assigning a
table of best and worst metric values, confusion matrices for all four probability. The result of Bayesian analysis is Bayesian
classification models, and a type I and II error table for all four
classification models. In addition to the 16 metric classifications,
inference. It updates the previous probability assigned to the
which we described, we listed seven other metrics, which we did not hypothesis because more evidence and information have been
use because we did not have the opportunity to show their application obtained. [3], [16]
on the selected data set. Metrics were negatively rated selected, Naive Bayesian classifiers are based on naive assumptions of
standard reliable, classification models. This led to the conclusion that the mutual characteristics of independence. In this way, each
the data in the selected data set should be pre-processed to be reliably distribution obtained can be independently estimated as a one-
classified by the classification model. dimensional distribution. This alleviates the problems arising
from the "curse of dimensionality". The “curse of
Keywords: classification model, classification models, evaluate dimensionality” is the problematic nature of the number of
classification models, worst metric values, four-class classification,
metric classification, reliable classified classification models, detailed
variables, which can be collected from a single sample. An
class accuracy example of this is the need for data sets that are scaled
(arranged) exponentially with many characteristics.[3],[14] [16],
Subject areas: artificial intelligence and machine learning, software [18].
engineering A multilayer perceptron is defined as a system composed of a
series of elements (nodes - "neurons") organized into layers.
I INTRODUCTION Layers process information so that they react dynamically to
external inputs. The input layer has one neuron for each
A specific set of data on the hepatitis C virus, consisting of 1385 component, which exists in the input data. Communicates with
instances described with 29 attributes, was considered. [12] hidden layers in the network. The entire processing of input
The goal is to classify these instances into four classes, which data takes place in hidden layers. The input data are weighted
represent hepatitis diseases: class a - Portal fibrosis, class, b - (measured) by appropriate coefficients. The neuron accepts
Little sepsis, class, c - A lot of sepsis, and class d - Cirrhosis.[6] them, calculates their sum, and processes it with an activation
This paper challenges this classification. Sources in the function. It processes the processed data in a "forward" process.
literature suggest that classification into five classes would be The last hidden layer is connected to the output layer. The
better: class a-liver inflammation, class b-fibrosis, class c- output layer has one neuron for each possible output. [3], [14]
cirrhosis, class d – end-stage disease (ESLD), and class e- ,[16], [18].
cancer. [15] J48 is a machine learning model based on the decision tree. It
The initial assumption is that standard, generally accepted was created using the ID3 algorithm (Iterative Dichtomizer 3),
classification models, BayesNet, NaiveBayes, Multilayer- developed by the WEKA project development team. The
Perceptron, and J48, are suitable for such a classification. These decision tree presents and analyzes decision-making situations
models exist in the WEKA software and, as such, have been when one type of decision is derived from another type of
applied to the selected data set. Unsatisfactory results were decision. This facilitates understanding of selection problems,
obtained. Available instances are classified very poorly. That assessment of available versions of the decision, and coverage
1|Page
www.thesai.org
UDC: 004.8 (IJACSA)International Journal of Advanced Computer Science and Applications,
DOI: 10.14569/IJACSA.2021.0120670 Volume 12, Issue. 6, 2021
of uncertain events, which affect outcomes and versions of the 4. Mean Absolute Error is the mean value of the absolute values
decision.[3],[14],[16],18]. of individual prediction errors of all instances in the test set.
The first idea was to consider the metrics used to evaluate the Each prediction error is the difference between the actual value
classification models used. 16 metrics used by WEKA software and the predicted value for the instance.
were reviewed, described, and explained. [4] In addition, it was The mean absolute error (MAE) Ei of an individual model and
stated that there are, in addition to the above, the following is calculated by the formula:
𝑛
metrics: False discovery rate, [21] Log Loss, [22] Barier score,
𝑛
[23] Cumulative gain chart, [24] Lift curve, [25] Kolmogorov- 1
Smirnov test, [26]. These metrics were not considered because 𝐸𝑖 = ∑ |𝑃((𝑖𝑗) − ∑|𝑃((𝑖𝑗) − 𝑇𝑗 ||
𝑛
they were not contained in the WEKA software, which was 𝑗=1
used. Therefore, they could not give their ratings of the 𝑗=1
classification model on the selected data set.
The research made a significant contribution to the where P(ij) is the value predicted by the individual model i for
interpretation of the 16 mentioned metrics, elements, and record j (of n records); and Tj is the target value for record j. For
parameters that each of them uses to evaluate the classification a perfect prediction, P(ij) = Tj and Ei = 0. Thus, the index Ei
models. ranges from 0 to infinity, and 0 corresponds to the ideal. [14] [28]
A significant contribution is also the question: why did the
metrics negatively evaluate the classification models used on 5. Root mean squared error (RMSE) - The root mean square
the selected data set? error is relative to what it would be if a simple predictor was
As a result of this research, other questions arose. Is the number used. Taking the square root of the relative square error, the
of attributes per instance of the observed data set too large? error is reduced to the same dimensions as the predicted size.
How many attributes are needed (optimal) and what are those The root mean square error (RMSE) Ei of an individual model
attributes? Is it necessary to pre-process the data of the observed and is calculated by the formula:
set? What are the techniques for pre-processing data in a set? 𝑛
Unobtrusively, the question arose as to whether the four classes 1 2
for the classification of instances of the observed set were 𝐸𝑖 = √ ∑(𝑃(𝑖𝑗) − 𝑇𝑗 )
𝑛
𝑗=1
correctly determined?
Where P(ij) is the value predicted by the individual model i for
II METRICS record j (of n records), and Tj is the target value for the record
j.For a perfect prediction, P (ij) = Tj and Ei = 0. Thus, the index
1. Accurately classified instances are the sum of true positive Ei ranges from 0 to infinity, and 0 corresponds to the ideal.[27]
(TP) and true negative (TN).
2. Incorrectly classified instances are the sum of false positives 6. Relative absolute error (RAE) is the total absolute error and
(FPs) and false negatives (FNs). normalized by dividing by the total absolute error of the simple
3. Kappa statistic - Cohen's Kappa coefficient (k) is a measure predictor (ZeroR classifier). The relative absolute error Ei of an
of how many instances are classified model of machine individual model is evaluated by the equation:
learning, matched the data marked as the basic truth, controlling
the accuracy of the random classifier as measured, expected 𝑛
accuracy. The accuracy of the Random Accuracy is 1 / k. Here ∑ |𝑃(𝑖𝑗) − 𝑇𝑗 |
k is the number of classes in the data set. In the case of binary 𝑗=1
classification k = 2, so the accuracy is 50% 𝐸𝑖 = 𝑛
(p0 − pe) ∑ |𝑇𝑗 − 𝑇̅|
K= 𝑗=1
(1 − pe)
Where P (ij) is the value predicted by the individual model i for
p0 - total accuracy of the module, pe - random accuracy
record j (of n records); Tj is the target value for record j, and T
(random accuracy of the model).
is given by the formula:
𝑛
In the problem of binary classification pe = pe1 + pe2; pe1 - the 1
probability that the predictions agree randomly with the actual 𝑇̅ = ∑ 𝑇𝑗
𝑛
values of class 1 - "good"; pe2 - the probability that the 𝑗=1
predictions agree randomly with the actual values of class 2 - For a perfect prediction, the counter is 0 and Ei = 0. Thus, the
"accidentally". The assumption is that the two classifiers index Ei ranges from 0 to infinity, and 0 corresponds to the
(model prediction and actual class value) are independent. In ideal.
this case, the probabilities pe1 and pe2 are calculated by A good prediction model produces a near-zero ratio. A bad
multiplying the share of things in the class and the share of the model (one that is worse than a naive model) will produce a
predicted class.[2],[20]. ratio greater than one x100%.[27]
2|Page
www.thesai.org
UDC: 004.8 (IJACSA)International Journal of Advanced Computer Science and Applications,
DOI: 10.14569/IJACSA.2021.0120670 Volume 12, Issue. 6, 2021
7. Root relative squared error (RRSE) reduces the error to the Confusion matrix for four-class classification (Figure 3). Four-
same dimensions as the predicted size. Relative square error is class classification is a problem of classifying instances
the total square error divided by the total square error of a (examples) into four classes. Case of four classes: class A, class
simple predictor. The root of the relative square error Ei of an B, class C, and class D.[13],[17].
individual model j is calculated by the formula:
𝑛
2
∑ (𝑃(𝑖𝑗) − 𝑇𝑗 )
𝑗=1
𝐸𝑖 = √ 𝑛
2
∑ (𝑇𝑗 − 𝑇̅)
𝑗=1
Where P(ij) is the value predicted by the individual model i for
record j (of n records). For perfect prediction, the counter is Figure 3 Confusion matrix for the four-class classification
equal to 0 and Ei = 0. The index Ei ranges from 0 to infinity, problem [8]
and 0 corresponds to the ideal. [28]
9. Accuracy is calculated as the sum of two accurate predictions
8. Confusion matrix for a binary classifier (Figure 1). Actual (TP + TN) divided by the total number of data sets (P + N). The
values are marked True (1) and False (0), and are predicted as best accuracy is 1.0, and the worst is 0.00 . (Figure 4) [19]
Positive (1) and Negative (0). Estimates of the possibilities of
classification models are derived from the expressions TP, TN,
FP, FN, which exist in the confusion matrix. [10]
Actual class
Class designation True (1) False (0)
𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁
Predicted Positive (1) TP FP 𝐴𝐶𝐶 = =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑁 + 𝐹𝑃 𝑃+𝑁
class Negative (0) FN TN
Figure 4. Two ellipses show how accuracy is calculated [7],[11]
Figure 1 Confusion matrix for the binary classification
problem [7] 10. TP Rate - True Positive Rate (Sensitivity or Recall) is
TP (True Positive) - The data point in the confusion matrix is calculated as the number of accurate positive predictions (TP)
True Positive (TP) when a positive outcome is predicted and divided by the total number of positive (P). Also called
what happened is the same. Sensitivity or Recall (REC). The best TP Rate is 1.0 and the
FP (False Positive) - The data point in the confusion matrix is worst 0.0. (Figure 5) [19]
false positive when a positive outcome is predicted, and what
happened is a negative outcome. This scenario is known as a
Type 1 Error. It is like a boon in bad foresight.
FN (False Negative) - The data point in the confusion matrix is
false negative when a negative outcome is predicted, and what
happened is a positive outcome. This scenario is well known as 𝑇𝑃 𝑇𝑃
a Type 2 Error and is considered as dangerous as a Type 1 Error. 𝑆𝑁 = =
𝑇𝑃 + 𝐹𝑁 𝑃
TN (True Negative) - The data point in the confusion matrix is
True Negative (TN) when a negative outcome is predicted and Figure 5. Two ellipses show how the sensitivity is calculated [7]
what happens is the same. The results of the binary
classification shown in Figure 2. 11. FP Rate - False Positive Rate is calculated as the number
of false-positive predictions (FP) divided by the total number
of negatives (N). The best false positive rate is 0.0 and the worst
is 1.0. It can also be calculated as 1-specificity. (Figure 6) [19]
𝐹𝑃
𝐹𝑃𝑅 = = 1 − 𝑆𝑃
Figure 2. Elliptical representation of four binary results of the 𝑇𝑁 + 𝐹𝑃
test set classification [7]
Figure 6. Two ellipses show how the False Positive Rate -
FPR is calculated [7]
3|Page
www.thesai.org
UDC: 004.8 (IJACSA)International Journal of Advanced Computer Science and Applications,
DOI: 10.14569/IJACSA.2021.0120670 Volume 12, Issue. 6, 2021
𝑇𝑃
𝑃𝑅𝐸𝐶 =
𝑇𝑃 + 𝐹𝑃
13. True Negative Rate – TNR (Specificity) - is calculated as Figure 9 ROC curve [1],[5]
the number of correct negative predictions (TN) divided by the The ROC AUC Score shows how good the model is in ranking
total number of negatives (N). The best specificity is 1.0 and predictions. Indicates the probability that a randomly selected
the worst 0.0. . (Figure 8) [19] positive instance is ranked higher than a randomly negative
instance. [7],[19]
4|Page
www.thesai.org
UDC: 004.8 (IJACSA)International Journal of Advanced Computer Science and Applications,
DOI: 10.14569/IJACSA.2021.0120670 Volume 12, Issue. 6, 2021
5|Page
www.thesai.org
UDC: 004.8 (IJACSA)International Journal of Advanced Computer Science and Applications,
DOI: 10.14569/IJACSA.2021.0120670 Volume 12, Issue. 6, 2021
6|Page
www.thesai.org
UDC: 004.8 (IJACSA)International Journal of Advanced Computer Science and Applications,
DOI: 10.14569/IJACSA.2021.0120670 Volume 12, Issue. 6, 2021
By comparative analysis of the confusion matrix for all four VII REFERENCES
classification models and all four classes, we see that the
[1] T.Fawcett, „ROC Graphs: Notes and Practical Considerations for
predictions of true positive results (TP) are not good enough.
Researchers.” Kluwer Academic Publishers, 2004.
(Table 4) Type I and type II errors are relatively high. The goal [2] J.Sim, C.C.Wright, „The Kappa Statistic in Reliability Studies: Use,
of modeling is to reduce these errors to minimum values. Interpretation, and Sample Size Requirements.” Physical Therapy,
Separate consideration of type I and type II errors for the four Volume 85, Issue 3, Pages 257 -68, 2005.
https://doi.org/10.1093/ptj/85.3.257
applied models shows that NaiveBayes has a type I error value
[3] J.Đ.Novaković, „Rešavanje klasifikacionih problema mašinskog učenja.“
equal to 0, for class d, and type II errors for classes a, b, and c. Fakultet tehničkih nauka u Čačku, 2013.
(Table 5) These data further problematize the use of this model. [4] R.R:Bouckaert, E. Frank, M. Hall, R. Kirkby, R.Reutmann, A. Sewald,
For the other three models, the type I and type II errors are, on A., D. Seuse, „WEKA Manual for Version 3-7-8.”, 2013.
[5] T. Saito, M. Rechmsmeier, „The Precision-Recall Plot is More
average, 2.5 times larger than exactly predicted.
Informative than the ROC Plot When Evaluating Binary Classifiers on
Imbalanced Datasets.”, PLoS ONE, 2015
V CONCLUSIONS doi: 10.1371/journal.phone.0118432
[6] M.Nasr, K. Elbahanacy, M. Hamdy, S.M.Kamal, „A novel model based
In this paper, we have considered in detail the 16 metrics for on non-invasive methods for prediction of liver fibrosis.”13th
the evaluation of classification models, which exist in WEKA International Computer Engineering Conference (ICENCO), 2017
software, version 3.4.1., Developed at the University of [7] T. Saito, M. Rehmeismeier, „Basic evaluation measures from the
confusion matrix.” WordPress, 2017
Waikato, New Zealand. The consideration is in line with the [8] V. Leal, „How to build a confusion matrix for a multiclass classifier?”
initial assumption of the paper that classification models are CrossValidated, StackExchange Inc, 2021
suitable for solving the classification problem applied to a [9] S. Auckland, S, „Precision-recall curves-what are they and how are they
specific set of hepatitis C virus data for Egyptian patients. used.” Acutecuretesting, 2017
[10] S. Narkhede, „Understanding Confusion Matrix.” Towards Data Science,
In addition to the above 16 metrics, we found in the literature 2018
that there are other metrics: False discovery rate, Log Loss, [11] A. Mishra, „Metrics to Evaluate your Machine Learning Algorithm.”
Barrier score, Cumulative gain chart, Lift curve, Kolmogorov- Towards Data Science, 2018
Smirnov plot, and Kolmogorov - Smirnov statistics. We did [12] D. Dua and C. Graff, „UCIMachineLearning Repository
[http://archive.ics.uci.edu/ml]. ”Irvine, CA: The University of California,
not describe them because we were unable to demonstrate their School of Information and Computer Science.Hepatitis C Virus (HCV)
application to the data set we selected. These metrics remain for for Egyptian patients Data Set, 2019
display in a later review paper. [13] A.Iqbal, A, Aftab2, S, Ali3, U, Nawaz4, Z, Sana5, L, Ahmad6, M,
All metrics considered negatively evaluated the classification Husen7, A, „Performance Analysis of Machine Learning Techniques on
Software Defect Prediction using NASA Datasets.” (IJACSA)
models, which we used. This has led to doubts because these International Journal of Advanced Computer Science and Applications,
are models that are generally accepted as standard and reliable. Vol. 10, No. 5, 2019
Why, metrics, do they rate them negatively on a selected data [14] R.Delgrado, X-A, Tibau „Why Cohen’s Kappa should be avoided as a
set? Is the number of attributes in the selected data set too large? performance measure in classification.” PLoS One;14 (9), 2019
[15] J.S.Saladi, „What Are The Stages of Liver Failure?.” Healthline, 2019
How many attributes are needed and what are those attributes? [16] Ž. Vujović, „The Big Data and Machine Learning.” Journal of information
Is it necessary to pre-process the data of the selected set? technology and multimedia systems, Vol. 19, Issue 7. pp.11-19, DOI:
The special significance of this paper is that it highlights the 10.5281/zenodo.427923, 2020
multitude of metrics used to evaluate each classification model. [17] S. Nandacumar, „Confusion Matrix – are you confused? (Part I and Part
II).” Medium, 2020
It emphasizes the diversity of these metrics and the parameters [18] A. Albahr1, M. Albahar2, „An Empirical Comparison of Fake News
they measure to better understand the model and its features. Detection using different Machine Learning Algorithms.” (IJACSA)
New questions and problems, which arose from this paper, are: International Journal of Advanced Computer Science and Applications,
What are the techniques for pre-processing data in a data set, Vol. 11, No. 9, 2020
[19] A. Tharwat, „Classification assessment methods.” Applied Computing
and how should discretization, purification, reduction, and and Informatics, Volume 17, Issue 1, 30, 2020
discussion of data be performed in a specific hepatitis C virus [20] M. Widmann, „COHEN’S KAPPA: What It Is, When to Use It, and How
data set for Egyptian patients? to Avoid Its Pitfalls.” The New Stack, 2020
We suggest that the classification be performed in five classes, [21] S. Room, „False Discovery Rate (FDR).” In Dubitzky W.,
WolkenHauer O., Cho KH., Yokota H., (eds) Encyclopedia of Systems
as provided in the latest professional literature: class a- Biology, Springer New York, NY, doi:10.1007/978'1'4419'9863-7_223,
inflammation of the liver, class b-fibrosis, class c-cirrhosis, 2013
class d ‒ end-stage disease (ESLD), and class e-cancer. [22] G. Dembla, „Intuition behind Log-loss score.” Towards Data Science,
2020
[23] J. H. Orallo, P.A. Flach, C.Ferri, „Brier curves: a new cost-based
VI ACKNOWLEDGMENT visualization of classifier performance.” Proceedings of the 28th
International Conference on Machine Learning (ICML-11). pp. 585–
To the editor and reviewers of IJACSA - The Science and 592, 2011
Information (SAI) Organization. To Dejan Vujović, an [24] T. Jurczyk, „Gains vs ROC curves. Do you understand the difference? ”
TIBCO® Data Science, 2020
engineer for the development and maintenance of application [25] D.S. Coppock, „Data Modelling and Mining: Why Lift?” DM Review
software in the Montenegrin Electriciity Transmission System and Source Media, Inc., 2006
Podgorica. [26] A. Justel, D. Pena, R. Zamar, „ A multivariate Kolmogorov -Smirnov test
of goodness of fit” Statistics&Probability Letters, Volume35, Issue3,
Pages251-259, 1997, doi: 10.1016/S0167-7152(97)00020-5,
7|Page
www.thesai.org
UDC: 004.8 (IJACSA)International Journal of Advanced Computer Science and Applications,
DOI: 10.14569/IJACSA.2021.0120670 Volume 12, Issue. 6, 2021
8|Page
www.thesai.org