Classification Model Evaluation Metrics: Željko Đ. Vujović

UDC: 004.
8 (IJACSA)International Journal of Advanced Computer Science and Applications,

DOI: 10.14569/IJACSA.2021.0120670 Volume 12, Issue. 6, 2021
Classification Model Evaluation Metrics

Željko Đ. Vujović
Boulevard Save Kovačevića 20/6, 81000 Podgorica, Montenegro
e-mail: [email protected], cell. +382 67 852 298
Abstract - The purpose of this paper was to confirm the basic was the reason, motive, and incentive to consider why this is
assumption that classification models are suitable for solving the so? These four models were chosen at random. In this
problem of data set classifications. We selected four representative introduction, we give their generally accepted definitions.
models: BaiesNet, NaiveBaies, MultilayerPerceptron, and J48, and A Bayesian network is defined as a system of event
applied them to a four-class classification of a specific set of hepatitis probabilities, nodes in a directed acyclic graph, in which, the
C virus data for Egyptian patients. We conducted the study using the
WEKA software classification model, developed at Waikato
probability of an event can be calculated from the probabilities
University, New Zealand. Defeat results were obtained. None of the of its predecessors in the graph. The nodes in the network are
four classes envisaged has been determined reliably. We have variable. They can be concrete values, randomly given, latent
described all 16 metrics, which are used to evaluate classification values , or hypotheses. They are characterized by the
models, listed their characteristics, mutual differences, and the distribution of probabilities. Probability is a quantity that
parameter that evaluates each of these metrics. We have presented touches a presented state of knowledge or a state of belief. In
comparative, tabular values that give each metric for each Bayesian opinion, the probability is assigned to a hypothesis. In
classification model in a concise form, detailed class accuracy with a frequency thinking, the hypothesis is tested without assigning a
table of best and worst metric values, confusion matrices for all four probability. The result of Bayesian analysis is Bayesian
classification models, and a type I and II error table for all four
classification models. In addition to the 16 metric classifications,
inference. It updates the previous probability assigned to the
which we described, we listed seven other metrics, which we did not hypothesis because more evidence and information have been
use because we did not have the opportunity to show their application obtained. [3], [16]
on the selected data set. Metrics were negatively rated selected, Naive Bayesian classifiers are based on naive assumptions of
standard reliable, classification models. This led to the conclusion that the mutual characteristics of independence. In this way, each
the data in the selected data set should be pre-processed to be reliably distribution obtained can be independently estimated as a one-
classified by the classification model. dimensional distribution. This alleviates the problems arising
from the "curse of dimensionality". The “curse of
Keywords: classification model, classification models, evaluate dimensionality” is the problematic nature of the number of
classification models, worst metric values, four-class classification,
metric classification, reliable classified classification models, detailed
variables, which can be collected from a single sample. An
class accuracy example of this is the need for data sets that are scaled
(arranged) exponentially with many characteristics.[3],[14] [16],
Subject areas: artificial intelligence and machine learning, software [18].
engineering A multilayer perceptron is defined as a system composed of a
series of elements (nodes - "neurons") organized into layers.
I INTRODUCTION Layers process information so that they react dynamically to
external inputs. The input layer has one neuron for each
A specific set of data on the hepatitis C virus, consisting of 1385 component, which exists in the input data. Communicates with
instances described with 29 attributes, was considered. [12] hidden layers in the network. The entire processing of input
The goal is to classify these instances into four classes, which data takes place in hidden layers. The input data are weighted
represent hepatitis diseases: class a - Portal fibrosis, class, b - (measured) by appropriate coefficients. The neuron accepts
Little sepsis, class, c - A lot of sepsis, and class d - Cirrhosis.[6] them, calculates their sum, and processes it with an activation
This paper challenges this classification. Sources in the function. It processes the processed data in a "forward" process.
literature suggest that classification into five classes would be The last hidden layer is connected to the output layer. The
better: class a-liver inflammation, class b-fibrosis, class c- output layer has one neuron for each possible output. [3], [14]
cirrhosis, class d – end-stage disease (ESLD), and class e- ,[16], [18].
cancer. [15] J48 is a machine learning model based on the decision tree. It
The initial assumption is that standard, generally accepted was created using the ID3 algorithm (Iterative Dichtomizer 3),
classification models, BayesNet, NaiveBayes, Multilayer- developed by the WEKA project development team. The
Perceptron, and J48, are suitable for such a classification. These decision tree presents and analyzes decision-making situations
models exist in the WEKA software and, as such, have been when one type of decision is derived from another type of
applied to the selected data set. Unsatisfactory results were decision. This facilitates understanding of selection problems,
obtained. Available instances are classified very poorly. That assessment of available versions of the decision, and coverage
1|Page
www.thesai.org
UDC: 004.8 (IJACSA)International Journal of Advanced Computer Science and Applications,
of uncertain events, which affect outcomes and versions of the 4. Mean Absolute Error is the mean value of the absolute values
decision.[3],[14],[16],18]. of individual prediction errors of all instances in the test set.
The first idea was to consider the metrics used to evaluate the Each prediction error is the difference between the actual value
classification models used. 16 metrics used by WEKA software and the predicted value for the instance.
were reviewed, described, and explained. [4] In addition, it was The mean absolute error (MAE) Ei of an individual model and
stated that there are, in addition to the above, the following is calculated by the formula:
𝑛
metrics: False discovery rate, [21] Log Loss, [22] Barier score,
𝑛
[23] Cumulative gain chart, [24] Lift curve, [25] Kolmogorov- 1
Smirnov test, [26]. These metrics were not considered because 𝐸𝑖 = ∑ |𝑃((𝑖𝑗) − ∑|𝑃((𝑖𝑗) − 𝑇𝑗 ||
𝑛
they were not contained in the WEKA software, which was 𝑗=1
used. Therefore, they could not give their ratings of the 𝑗=1
classification model on the selected data set.
The research made a significant contribution to the where P(ij) is the value predicted by the individual model i for
interpretation of the 16 mentioned metrics, elements, and record j (of n records); and Tj is the target value for record j. For
parameters that each of them uses to evaluate the classification a perfect prediction, P(ij) = Tj and Ei = 0. Thus, the index Ei
models. ranges from 0 to infinity, and 0 corresponds to the ideal. [14] [28]
A significant contribution is also the question: why did the
metrics negatively evaluate the classification models used on 5. Root mean squared error (RMSE) - The root mean square
the selected data set? error is relative to what it would be if a simple predictor was
As a result of this research, other questions arose. Is the number used. Taking the square root of the relative square error, the
of attributes per instance of the observed data set too large? error is reduced to the same dimensions as the predicted size.
How many attributes are needed (optimal) and what are those The root mean square error (RMSE) Ei of an individual model
attributes? Is it necessary to pre-process the data of the observed and is calculated by the formula:
set? What are the techniques for pre-processing data in a set? 𝑛
Unobtrusively, the question arose as to whether the four classes 1 2
for the classification of instances of the observed set were 𝐸𝑖 = √ ∑(𝑃(𝑖𝑗) − 𝑇𝑗 )
𝑛
𝑗=1
correctly determined?
Where P(ij) is the value predicted by the individual model i for
II METRICS record j (of n records), and Tj is the target value for the record
j.For a perfect prediction, P (ij) = Tj and Ei = 0. Thus, the index
1. Accurately classified instances are the sum of true positive Ei ranges from 0 to infinity, and 0 corresponds to the ideal.[27]
(TP) and true negative (TN).
2. Incorrectly classified instances are the sum of false positives 6. Relative absolute error (RAE) is the total absolute error and
(FPs) and false negatives (FNs). normalized by dividing by the total absolute error of the simple
3. Kappa statistic - Cohen's Kappa coefficient (k) is a measure predictor (ZeroR classifier). The relative absolute error Ei of an
of how many instances are classified model of machine individual model is evaluated by the equation:
learning, matched the data marked as the basic truth, controlling
the accuracy of the random classifier as measured, expected 𝑛
accuracy. The accuracy of the Random Accuracy is 1 / k. Here ∑ |𝑃(𝑖𝑗) − 𝑇𝑗 |
k is the number of classes in the data set. In the case of binary 𝑗=1
classification k = 2, so the accuracy is 50% 𝐸𝑖 = 𝑛
(p0 − pe) ∑ |𝑇𝑗 − 𝑇̅|
K= 𝑗=1
(1 − pe)
Where P (ij) is the value predicted by the individual model i for
p0 - total accuracy of the module, pe - random accuracy
record j (of n records); Tj is the target value for record j, and T
(random accuracy of the model).
is given by the formula:
𝑛
In the problem of binary classification pe = pe1 + pe2; pe1 - the 1
probability that the predictions agree randomly with the actual 𝑇̅ = ∑ 𝑇𝑗
𝑛
values of class 1 - "good"; pe2 - the probability that the 𝑗=1
predictions agree randomly with the actual values of class 2 - For a perfect prediction, the counter is 0 and Ei = 0. Thus, the
"accidentally". The assumption is that the two classifiers index Ei ranges from 0 to infinity, and 0 corresponds to the
(model prediction and actual class value) are independent. In ideal.
this case, the probabilities pe1 and pe2 are calculated by A good prediction model produces a near-zero ratio. A bad
multiplying the share of things in the class and the share of the model (one that is worse than a naive model) will produce a
predicted class.[2],[20]. ratio greater than one x100%.[27]
2|Page
www.thesai.org
7. Root relative squared error (RRSE) reduces the error to the Confusion matrix for four-class classification (Figure 3). Four-
same dimensions as the predicted size. Relative square error is class classification is a problem of classifying instances
the total square error divided by the total square error of a (examples) into four classes. Case of four classes: class A, class
simple predictor. The root of the relative square error Ei of an B, class C, and class D.[13],[17].
individual model j is calculated by the formula:
𝑛
2
∑ (𝑃(𝑖𝑗) − 𝑇𝑗 )
𝑗=1
𝐸𝑖 = √ 𝑛
2
∑ (𝑇𝑗 − 𝑇̅)
𝑗=1
Where P(ij) is the value predicted by the individual model i for
record j (of n records). For perfect prediction, the counter is Figure 3 Confusion matrix for the four-class classification
equal to 0 and Ei = 0. The index Ei ranges from 0 to infinity, problem [8]
and 0 corresponds to the ideal. [28]
9. Accuracy is calculated as the sum of two accurate predictions
8. Confusion matrix for a binary classifier (Figure 1). Actual (TP + TN) divided by the total number of data sets (P + N). The
values are marked True (1) and False (0), and are predicted as best accuracy is 1.0, and the worst is 0.00 . (Figure 4) [19]
Positive (1) and Negative (0). Estimates of the possibilities of
classification models are derived from the expressions TP, TN,
FP, FN, which exist in the confusion matrix. [10]
Actual class
Class designation True (1) False (0)
𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁
Predicted Positive (1) TP FP 𝐴𝐶𝐶 = =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑁 + 𝐹𝑃 𝑃+𝑁
class Negative (0) FN TN
Figure 4. Two ellipses show how accuracy is calculated [7],[11]
Figure 1 Confusion matrix for the binary classification
problem [7] 10. TP Rate - True Positive Rate (Sensitivity or Recall) is
TP (True Positive) - The data point in the confusion matrix is calculated as the number of accurate positive predictions (TP)
True Positive (TP) when a positive outcome is predicted and divided by the total number of positive (P). Also called
what happened is the same. Sensitivity or Recall (REC). The best TP Rate is 1.0 and the
FP (False Positive) - The data point in the confusion matrix is worst 0.0. (Figure 5) [19]
false positive when a positive outcome is predicted, and what
happened is a negative outcome. This scenario is known as a
Type 1 Error. It is like a boon in bad foresight.
FN (False Negative) - The data point in the confusion matrix is
false negative when a negative outcome is predicted, and what
happened is a positive outcome. This scenario is well known as 𝑇𝑃 𝑇𝑃
a Type 2 Error and is considered as dangerous as a Type 1 Error. 𝑆𝑁 = =
𝑇𝑃 + 𝐹𝑁 𝑃
TN (True Negative) - The data point in the confusion matrix is
True Negative (TN) when a negative outcome is predicted and Figure 5. Two ellipses show how the sensitivity is calculated [7]
what happens is the same. The results of the binary
classification shown in Figure 2. 11. FP Rate - False Positive Rate is calculated as the number
of false-positive predictions (FP) divided by the total number
of negatives (N). The best false positive rate is 0.0 and the worst
is 1.0. It can also be calculated as 1-specificity. (Figure 6) [19]
𝐹𝑃
𝐹𝑃𝑅 = = 1 − 𝑆𝑃
Figure 2. Elliptical representation of four binary results of the 𝑇𝑁 + 𝐹𝑃
test set classification [7]
Figure 6. Two ellipses show how the False Positive Rate -
FPR is calculated [7]
3|Page
www.thesai.org
12. Precision is calculated as the number of correct positive

predictions (TP), divided by the total number of positive
predictions (TP + FP). The best accuracy is 1.0 and the worst
0.0. (Figure 7) [19]
𝑇𝑃
𝑃𝑅𝐸𝐶 =
𝑇𝑃 + 𝐹𝑃
Figure 7. Two ellipses show how precision is calculated [7],[11]
13. True Negative Rate – TNR (Specificity) - is calculated as Figure 9 ROC curve [1],[5]
the number of correct negative predictions (TN) divided by the The ROC AUC Score shows how good the model is in ranking
total number of negatives (N). The best specificity is 1.0 and predictions. Indicates the probability that a randomly selected
the worst 0.0. . (Figure 8) [19] positive instance is ranked higher than a randomly negative
instance. [7],[19]
17. PRC Area (Precision-Recall Curve Area) It is one number

that describes the capabilities of the model. The PR AUC Score
is the average of the precision scores calculated for each
reminder threshold [0,0, 1,0]. The PRC curve is obtained by
combining Positive Predictive Value and True Positive Rate.
𝑇𝑁 𝑇𝑁 (Figure 10) For each threshold, Positive Predictive Value and
𝑆𝑃 = = SP = 1 – FPRate
𝑇𝑁+𝐹𝑃 𝑁 True Positive Rate are calculated and the corresponding point
of the graph is plotted. Preferably, the algorithm has high
Figure 8. Two ellipses show how specificity (SP) is calculated [7]
precision and high sensitivity. These two metrics are not
independent. That is why a compromise is being made between
14. F-Measure or F-score is a measure of the accuracy of the
them. A good PRC curve has a higher AUC. Research has
test. It is calculated, based on precision and reminders, by the
shown that PRC is graphically more inforative than ROC
formula:
graphs when estimating binary classifiers on unbalanced
2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙 sets.[5],[9],[19].
𝐹 𝑆𝑐𝑜𝑟𝑒 = [7],[11],[19]
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙
15. Matthews Correlation Coefficient (MCC) - is the

correlation between the predicted classes and the basic truth. It
is calculated based on the values from the confusion matrix .
𝑇𝑃
𝑀𝐶𝐶 =
√(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)
MCC is generally considered a balanced measure, which can be
used even if the classes are of very different sizes. [7],[11],[19]
16. ROC Area - Receiver Operating Characteristic Area - The

ROC curve is a graph that visualizes the trade-off between True
Positive Rate and False Positive Rate. (Figure 9) For each
threshold, we calculate True Positive Rate and False Positive
Rate and plot them on one graph. The higher the True Positive
Rate and the lower the False Positive Rate for each threshold,
the better. Better classifiers have more curves on the left. The
area below the ROC curve is called the ROC AUC score, a
number that determines how good the ROC curve is. .[11] Figure 10 Precision-Recall curve [9]
4|Page
www.thesai.org
III EXPERIMENTAL RESULTS
TABLE 1: Metrics Summary

Bayes Naive Multilayer TABLE 2: Table of best and worst metric values for
Net Bayes Perceptron J48 detailed class accuracy
Correctly The The
classified 318 362 368 350 Best Worst
instances
Incorrectly TP Rate 1,0 0,0
classified 1007 1023 1017 1035
instances FP Rate 0,0 1,0
Kappa statistic -0,0287 0 0,0206 0,0029 Precision 1.0 0,0
Mean absolute 0,3763 0,3748 0,3718 0,3751 Recall 1.0 0.0
error
Root mean 0,4393 0,4329 0,5466 0,5814 F-Measure 1.0 0.0
squared error
MCC +1.0 0.0
Relative 100,382% 99,9999% 99,2009% 100,0671%
squared error ROC Area 0.9 0.5
Root relative 101,4822% 100% 126,2575% 134,2938% PRC Area 1.0 0.5
squared error
The best and worst values of each general metric are used to
Total number 1385 1385 1385 1385 measure the accuracy of the classifier. Metrics are in rows and
of instancess values are in columns of the table.
Summary of the accuracy of the four representative classifiers
expressed by general metrics. Metrics are listed in the rows of the
table, and their values, for each classifier, in the columns of the
table. A special number is the total number of instances, which is
the same for each classifier.
TABLE 3: Detailed Accuracy By Class

TP Rate FP Rate Precision Recall F.Measure MCC ROC Area PRC Area
c BayesNet 0,107 0,186 0,156 0,107 0,127 0,091 0,423 0,205
l NaiveBayes 0,000 0,000 ? 0,000 ? ? 0,496 0,241
a M.L.Perc. 0,193 0,254 0,196 0,193 0,195 0,060 0,453 0,220
s J48 0,250 0,236 0,253 0,250 0,251 0,014 0,501 0,247
s(1) Weight Av. 0,230 0,250 0,222 0,230 0,224 0,032 0,473 0,243
c BayesNet 0,271 0,270 0,241 0,271 0,255 0,001 0,510 0,249
l NaiveBayes 0,000 0,000 ? 0,000 ? ? 0,496 0,238
a M.L.Perc. 0,577 0,236 0,271 0,277 0,274 0,041 0,527 0,249
s J48 0,271 0,226 0,274 0,271 0,273 0,045 0,526 0,252
s(2) Weight Av. 0,261 0,261 ? 0,261 ? ? 0,496 0,249
c BayesNet 0,214 0,266 0,217 0,214 0,216 -0,052 0,457 0,234
l NaiveBayes 0,000 0,000 ? 0,000 ? ? 0,496 0,255
a M.L.Perc. 0,282 0,247 0,282 0,282 0,282 0,035 0,521 0,278
s J48 0,231 0,245 0,246 0,231 0,238 -0,0014 0,488 0,248
s(3) Weight Av. 0,266 0,245 0,265 0,266 0,066 0,013 0,509 0,257
c BayesNet 0,320 0,307 0,270 0,320 0,293 0,013 0,524 0,281
l NaiveBayes 1,000 1,000 0,261 1,000 0,414 ? 0,496 0,260
a M.L.Perc. 0,307 0,243 0,308 0,307 0,307 0,063 0,533 0,280
s J48 0,260 0,290 0,240 0,260 0,250 0,030 0,476 0,255
s(4) Weight Av. 0,253 0,250 0,253 0,253 0,253 0,003 0,497 0,251
The detailed accuracy of each of the four representative classifiers for each of the predictions of the class is expressed by the values
of eight different metrics. Metrics are in the columns of the table, the names of the classifiers in the rows of the table, separately
for each class. For each class, the weighted value of each of the eight metrics is shown. This value is the average that results from
multiplying each component by a factor that reflects its significance.
5|Page
www.thesai.org
TABLE 4: Confusion Matrix

BayesNet NaiveBayes M.L.Perceptorn J48
a b c d a b c d a b c d a b c d
36 94 94 112 0 0 0 336 65 94 86 91 84 82 79 94 a= 336
61 90 85 96 0 0 0 332 79 92 86 75 67 90 80 95 b= 332
79 94 76 106 0 0 0 355 96 76 100 83 80 82 82 111 c = 335
55 96 95 111 0 0 0 362 91 78 82 111 101 74 93 94 d= 362
Comparative table of four confusion matrices for all four representative classifiers. In the rows, the number is provided for each
class, and in the columns the actual value of the class.
TABLE 5: Type I Errors and Type II Errors

BayesNet NaiveBayes M.L.Perceptron J48
a b c d a b c d a b c d a b c d
Error 309 215 279 246 336 332 255 0 271 240 355 251 255 242 255 270
Type I
Error 195 284 277 314 0 0 0 1.033 266 248 234 249 248 238 852 300
Type II
Comparative table of Type I and Type II error values for each class and each representative classifier. There are types of errors in
the rows, and their size in the columns.
IV DISCUSSION BayesNet, MultilayerPerceptron and J48 worse than naive.

(Table 1)
The average value of correctly classified instances is 25.24%, Analysis of the detailed accuracy of the classes (Table 1 and
and incorrectly classified instances 73.68%. (Table 1) Table 2) shows very significant results. Based on the tables of
Landis and Koch proposed the following standards for the best and worst metric values for detailed class accuracy, we
kappa coefficient: ≤0 = poor, .01 - .20 = insignificant, .21 - .40 conclude:
= fair, .41 - .60 = moderate, .61– .80 = substantial, a. 81–1 = 1. TP Rate has extremely poor values, close to the worst, for all
almost perfect. [29] In line with the above proposal, BayesNet rated models and all classes. The exception is NaiveBaies,
and NaiveBaies have a poor kappa coefficient, and multilayer which has the best value of 1,000 for class 4, but the same
perceptron and J48 are negligible. It is concluded that the NaiveBayes has the worst value of TP Rate, 0,000, for classes
values of the kappa coefficients show that the instances, 1,2, and 3. Relatively good value of TP Rate, 0,577, showed
classified by the machine learning model, do not match the data MultilaierPerceptron for class 2. Weighted values TP Rates are
marked as the basic truth. MAE values: 0.3763 for BaiesNet, consequently poor.
0.3748 for NaiveBaies, 0.3718 for MultilayerPerceptron, 2. FP Rate for NaiveBayes has an optimal value of 0.000 for
0.3751 for J48 are closer to the lower limit (ideal) than the classes 1,2 and 3, as opposed to class 4 for which it has a
upper (worst). We, therefore, appreciate that they are maximum value of 1000. BayesNet, MultilayerPerceptron, and
acceptable. (Table 1) J48, as well as a weighted value for all four models, and all four
Anthony Ladson gave a model performance table based on the classes are extremely bad.
efficiency coefficient. For the case of model performance 3. Precision has values below a level satisfactory for all four
validation, the values of the efficiency coefficients describe the models.
classification as follows: E≥0.93 - excellent, 0.8≤E˂0.93 - 4. Recall, has the same values as TP Rate. The question is why
good, 0.6≤E˂0.8 - satisfactory, 0.3 ≤E˂0.6 - transient, E˂0.3 - are they separated for display in a separate column?
bad. [30] Based on this, values of 0.4393 for BayesNet, 0.4329 5. The F-Measure has values that are below levels that meet all
for NaiveBayes, 0.5466 for MultilayerPerceptron, and 0.5814 rated models and all four classes.
for J48 are in the transient group. (Table 2) 6. MCC showed unsatisfactory values, which are at the level of
Relative absolute error (RAE) can have values from 0 to random prediction, for all evaluated models and all classes.
infinity. Ideally, it should have a value of 0. Based on this, it is 7. The ROC Area showed values for all models and all classes
concluded that the values of 100.3802% for BaiesNet, 99.999% that are on the verge of bad.
for NaiveBaies, 99, 2009% for MultilaierPerceptron, and 8. The value of the PRC area, for all models and all classes, is
100.0671% for J48% are approximately the same as in the naive below the level that is the worst.
model ( ZeroR classifier). The root of the relative square error The metrics of detailed assessment by classes unequivocally
(RRSE) can have a value from 0 to infinity. Ideally, it should show that the evaluated models, applied in a presented way, do
have a value of 0. RRSE values: 101.4822% for BayesNet, not satisfy. (Table III) This means that new research is needed
100% for NaiveBaies, 126.2775% for MultilaierPerceptron, and the answer to the question: why do metrics of detailed
and 134.2938% for J48 rate NaiveBayes as a naive model, and accuracy give poor estimates of the models used?
6|Page
www.thesai.org
By comparative analysis of the confusion matrix for all four VII REFERENCES
classification models and all four classes, we see that the
[1] T.Fawcett, „ROC Graphs: Notes and Practical Considerations for
predictions of true positive results (TP) are not good enough.
Researchers.” Kluwer Academic Publishers, 2004.
(Table 4) Type I and type II errors are relatively high. The goal [2] J.Sim, C.C.Wright, „The Kappa Statistic in Reliability Studies: Use,
of modeling is to reduce these errors to minimum values. Interpretation, and Sample Size Requirements.” Physical Therapy,
Separate consideration of type I and type II errors for the four Volume 85, Issue 3, Pages 257 -68, 2005.
https://doi.org/10.1093/ptj/85.3.257
applied models shows that NaiveBayes has a type I error value
[3] J.Đ.Novaković, „Rešavanje klasifikacionih problema mašinskog učenja.“
equal to 0, for class d, and type II errors for classes a, b, and c. Fakultet tehničkih nauka u Čačku, 2013.
(Table 5) These data further problematize the use of this model. [4] R.R:Bouckaert, E. Frank, M. Hall, R. Kirkby, R.Reutmann, A. Sewald,
For the other three models, the type I and type II errors are, on A., D. Seuse, „WEKA Manual for Version 3-7-8.”, 2013.
[5] T. Saito, M. Rechmsmeier, „The Precision-Recall Plot is More
average, 2.5 times larger than exactly predicted.
Informative than the ROC Plot When Evaluating Binary Classifiers on
Imbalanced Datasets.”, PLoS ONE, 2015
V CONCLUSIONS doi: 10.1371/journal.phone.0118432
[6] M.Nasr, K. Elbahanacy, M. Hamdy, S.M.Kamal, „A novel model based
In this paper, we have considered in detail the 16 metrics for on non-invasive methods for prediction of liver fibrosis.”13th
the evaluation of classification models, which exist in WEKA International Computer Engineering Conference (ICENCO), 2017
software, version 3.4.1., Developed at the University of [7] T. Saito, M. Rehmeismeier, „Basic evaluation measures from the
confusion matrix.” WordPress, 2017
Waikato, New Zealand. The consideration is in line with the [8] V. Leal, „How to build a confusion matrix for a multiclass classifier?”
initial assumption of the paper that classification models are CrossValidated, StackExchange Inc, 2021
suitable for solving the classification problem applied to a [9] S. Auckland, S, „Precision-recall curves-what are they and how are they
specific set of hepatitis C virus data for Egyptian patients. used.” Acutecuretesting, 2017
[10] S. Narkhede, „Understanding Confusion Matrix.” Towards Data Science,
In addition to the above 16 metrics, we found in the literature 2018
that there are other metrics: False discovery rate, Log Loss, [11] A. Mishra, „Metrics to Evaluate your Machine Learning Algorithm.”
Barrier score, Cumulative gain chart, Lift curve, Kolmogorov- Towards Data Science, 2018
Smirnov plot, and Kolmogorov - Smirnov statistics. We did [12] D. Dua and C. Graff, „UCIMachineLearning Repository
[http://archive.ics.uci.edu/ml]. ”Irvine, CA: The University of California,
not describe them because we were unable to demonstrate their School of Information and Computer Science.Hepatitis C Virus (HCV)
application to the data set we selected. These metrics remain for for Egyptian patients Data Set, 2019
display in a later review paper. [13] A.Iqbal, A, Aftab2, S, Ali3, U, Nawaz4, Z, Sana5, L, Ahmad6, M,
All metrics considered negatively evaluated the classification Husen7, A, „Performance Analysis of Machine Learning Techniques on
Software Defect Prediction using NASA Datasets.” (IJACSA)
models, which we used. This has led to doubts because these International Journal of Advanced Computer Science and Applications,
are models that are generally accepted as standard and reliable. Vol. 10, No. 5, 2019
Why, metrics, do they rate them negatively on a selected data [14] R.Delgrado, X-A, Tibau „Why Cohen’s Kappa should be avoided as a
set? Is the number of attributes in the selected data set too large? performance measure in classification.” PLoS One;14 (9), 2019
[15] J.S.Saladi, „What Are The Stages of Liver Failure?.” Healthline, 2019
How many attributes are needed and what are those attributes? [16] Ž. Vujović, „The Big Data and Machine Learning.” Journal of information
Is it necessary to pre-process the data of the selected set? technology and multimedia systems, Vol. 19, Issue 7. pp.11-19, DOI:
The special significance of this paper is that it highlights the 10.5281/zenodo.427923, 2020
multitude of metrics used to evaluate each classification model. [17] S. Nandacumar, „Confusion Matrix – are you confused? (Part I and Part
II).” Medium, 2020
It emphasizes the diversity of these metrics and the parameters [18] A. Albahr1, M. Albahar2, „An Empirical Comparison of Fake News
they measure to better understand the model and its features. Detection using different Machine Learning Algorithms.” (IJACSA)
New questions and problems, which arose from this paper, are: International Journal of Advanced Computer Science and Applications,
What are the techniques for pre-processing data in a data set, Vol. 11, No. 9, 2020
[19] A. Tharwat, „Classification assessment methods.” Applied Computing
and how should discretization, purification, reduction, and and Informatics, Volume 17, Issue 1, 30, 2020
discussion of data be performed in a specific hepatitis C virus [20] M. Widmann, „COHEN’S KAPPA: What It Is, When to Use It, and How
data set for Egyptian patients? to Avoid Its Pitfalls.” The New Stack, 2020
We suggest that the classification be performed in five classes, [21] S. Room, „False Discovery Rate (FDR).” In Dubitzky W.,
WolkenHauer O., Cho KH., Yokota H., (eds) Encyclopedia of Systems
as provided in the latest professional literature: class a- Biology, Springer New York, NY, doi:10.1007/978'1'4419'9863-7_223,
inflammation of the liver, class b-fibrosis, class c-cirrhosis, 2013
class d ‒ end-stage disease (ESLD), and class e-cancer. [22] G. Dembla, „Intuition behind Log-loss score.” Towards Data Science,
2020
[23] J. H. Orallo, P.A. Flach, C.Ferri, „Brier curves: a new cost-based
VI ACKNOWLEDGMENT visualization of classifier performance.” Proceedings of the 28th
International Conference on Machine Learning (ICML-11). pp. 585–
To the editor and reviewers of IJACSA - The Science and 592, 2011
Information (SAI) Organization. To Dejan Vujović, an [24] T. Jurczyk, „Gains vs ROC curves. Do you understand the difference? ”
TIBCO® Data Science, 2020
engineer for the development and maintenance of application [25] D.S. Coppock, „Data Modelling and Mining: Why Lift?” DM Review
software in the Montenegrin Electriciity Transmission System and Source Media, Inc., 2006
Podgorica. [26] A. Justel, D. Pena, R. Zamar, „ A multivariate Kolmogorov -Smirnov test
of goodness of fit” Statistics&Probability Letters, Volume35, Issue3,
Pages251-259, 1997, doi: 10.1016/S0167-7152(97)00020-5,
7|Page
www.thesai.org
[27] S.Glen, „Mean Squared Error. Definition and Example.”From

StatisticHowTo.com:Elementary Statistics for the rest of us!
https://www.statisticshowto.com/probability-and-statistics/statistics-
definitions/mean/squared-error/, 2021
[28] „Root Relative Squared Error.”GeneXproTools Online Guide, Gepsoft.
Ltd., 2000-214
[29] L.Hartling, M.Hamm, A.Milne, et al. „Interpretation of Fliess’kappa (k)
(from Landis and Koch 1977).” Valiability and Inter-Rater Reliability
Testing of Quality Assessment Instruments [Internet], Rockvile: Agency
for Healthcare Research and Quality (US), 2012
[30] A.Ladson, „Model performance based on the coefficient of efficiency.”
Hidrology, Natural Resources, and R, (2019)
8|Page
www.thesai.org

Classification Model Evaluation Metrics: Željko Đ. Vujović

Uploaded by

Copyright:

Available Formats

Classification Model Evaluation Metrics: Željko Đ. Vujović

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classification Model Evaluation Metrics: Željko Đ. Vujović

Uploaded by

Copyright:

Available Formats

UDC: 004.

8 (IJACSA)International Journal of Advanced Computer Science and Applications,

Classification Model Evaluation Metrics

12. Precision is calculated as the number of correct positive

Figure 7. Two ellipses show how precision is calculated [7],[11]

17. PRC Area (Precision-Recall Curve Area) It is one number

15. Matthews Correlation Coefficient (MCC) - is the

16. ROC Area - Receiver Operating Characteristic Area - The

III EXPERIMENTAL RESULTS

TABLE 1: Metrics Summary

TABLE 3: Detailed Accuracy By Class

TABLE 4: Confusion Matrix

TABLE 5: Type I Errors and Type II Errors

IV DISCUSSION BayesNet, MultilayerPerceptron and J48 worse than naive.

[27] S.Glen, „Mean Squared Error. Definition and Example.”From

You might also like