Acta Polytechnica Hungarica
Vol. 18, No. 6, 2021
Fast Regular and Interval-based Classification,
using parSITs
Balázs Tusor1, Annamária R. Várkonyi-Kóczy2
1
Doctoral School of Applied Informatics and Applied Mathematics
Óbuda University, Budapest, Hungary; E-mail:
[email protected]
2
Institute of Automation, Óbuda University, Budapest, Hungary
E-mail:
[email protected]
Abstract: Parallelized Sequential Indexing Tables (parSITs), are classifiers that have been
developed for the processing of large volumes of data rapidly. Their base idea is
implementing a sequential indexing table structure with parallelization techniques, using a
sequence of Lookup Tables in order to build a chain of value combinations. Although the
inference (evaluation) method, that it was originally developed for, is very fast, its
performance significantly depends on the arbitrary order of the attributes, in multi-class
cases, thus, reducing its classification performance. In this work, we introduce a new
inference method, that increases the classification performance of the classifier, at the cost
of a small increase in computational complexity.
Keywords; Big Data classification; interval-based classification; parallel computing;
sequential indexing tables; lookup tables; machine learning
1
Introduction
Machine learning has been one of the most important areas of computer science in
the past couple of decades. Numerous systems have been developed to address
many different kinds of machine learning problems in a wide range of scientific
fields. To mention a few recent developments, machine learning has been
successfully used in biomedical engineering [1] [2], spam email detection [3],
movement detection [4], the recognition of electromyographic hand gesture
signals for prosthetic hand control [5], Big Data systems [6] [7] [8], etc.
Parallelized Sequential Indexing Tables (parSITs) are classifiers that have been
developed for processing large volumes of data rapidly. Their base idea is
implementing a sequential indexing table structure [9] [10] with parallelization
techniques, using a sequence of Lookup Tables [11] in order to build a chain of
value combinations, describing the data extracted from the training data in a
compact way, organized into a layered structure where each layer takes care of a
– 107 –
B. Tusor et al.
Fast Regular and Interval-based Classification, using parSITs
given dimension of the problem space (i.e. a given attribute from the training
data).
In previous work, we developed a training algorithm for the parSIT classifier [12]
[13], along with a simple inference (evaluation) method that focuses on finding
the index that is the closest to the given input value for each attribute. However,
although this leads to a very fast inference, its performance also significantly
depends on the arbitrary order of the attributes (which directly influences the
structure itself). This has the disadvantage that for input samples that are very
similar to a learned sample in all but one attribute, if that attribute is situated high
in the structure, the classifier has a high chance to misclassify the sample, or not
recognize it at all.
In order to solve this problem, in this paper a new inference method is presented
for the parSIT that uses a different approach: instead of choosing the closest
values, it evaluates all values within a given range. Although this results in a
higher computational complexity inference and a higher implementation
complexity, but it is shown that it also boosts the classification performance
significantly.
The rest of this paper is structured as follows. In Section 2 the parSIT classifier is
described alongside the proposed new method: In Subsection 2.1 the general
architecture is presented, while Subsections 2.2 and 2.3 briefly summarize the
training procedure and the proximity-based inference algorithm, respectively; and
finally, in Subsection 2.4 the new interval-based inference algorithm is proposed
in detail. Section 3 illustrates the classification performance of the new method in
Subsection 3.1 compared to the proximity-based inference, then complexity
analysis is given in Subsection 3.2. Finally, Section 4 concludes the paper and
presents some future work possibilities.
2
2.1
Parallelized Sequential Indexing Tables
General Architecture
The parSIT classifier builds and maintains a layered structure that models the
problem space based on the data used for its training. The structure built from the
data of an N-dimension classification problem (thus, the data having N data
attributes and 1 class attribute) consists of N+1 layers, where the first N layers
handle the attributes of the data, each one regarding the values of corresponding
attribute of the input data. Fig. 1 (a) shows an example for a trained network that
has been built using training dataset X (shown in (b)).
– 108 –
Acta Polytechnica Hungarica
Vol. 18, No. 6, 2021
Figure 1
The general architecture of ParSITs (a) trained on a given training dataset (b)
The array elements are colored accordingly to the training sample (t0..t9) that it
represents. The first “root” layer contains a 1D index array Λ(0) , which stores all
different input values gained from the first attribute (x0), sorted and free of
duplicates. The elements in the index arrays are called markers in the following.
The layers are connected so for each trained data tuple, an implicit “route” can be
followed in order to gain the class label in the last layer that is associated with the
given tuple.
In Fig. 1, it can be seen that x0 has 4 different values in the training dataset, thus,
(0)
the size of the index array Λ(0) is 4. The first element (Λ0 = 1.1) belongs to
training tuples t3 and t4. Since these two tuples have different values in their
second attribute (x1), in the second layer they take up two different array elements
(given by their values: ‘6’ and ‘9’). However, since they share the same first
attribute value, they make up a group in the second layer, which is addressed by
the position of the marker of their first value in the previous layer (η0). Each group
in a given layer is accounted for by storing their starting locations in a given 1D
array α(i) (where i denotes which layer it belongs to) and the number of markers in
each group in 1D array β(i). For example, the starting location of the group (g0) of
the mentioned 2 markers in the second layer is α(1)=0, while it contains β(1)=2
markers. This way, the evaluation is faster, since not all of the markers are need to
be regarded, only the groups marked by the significant markers in the layer before
it. In the root layer, there are no additional arrays, since all markers are observed.
The last (class) layer contains one index array Λ(𝑁) that contains the class labels,
and an occurrence array Θ that accounts for how many times each given value
sequence has been seen during the training. Occurrence values that are higher than
one (Θ𝑗 > 1, for all j elements) indicate redundancy in the training data, while
groups in Λ(𝑁) that have more than one markers in them indicate inconsistency in
the training data (i.e. there are two or more tuples that share the same attribute
values, but differ in class label).
– 109 –
B. Tusor et al.
2.1
Fast Regular and Interval-based Classification, using parSITs
Training Algorithm
The training of the parSIT structure builds the Λ, α and β arrays for each layer, one
layer at a time. It uses parallel computing to sort the values of the given attribute,
then build a temporary array H, and remove the redundant elements to gain the
compact representation of the given attribute. Let P denote the number of training
tuples.
Figure 2
The training of the first layer of the structure shown in Fig. 1
Fig. 2 shows the training procedure for the root layer, processing attribute 0.
The attribute values of X(0) (a column array taken from x0) is sorted alongside a
simple in an increasing sequence array S (containing values from 0 to P-1), using
X(0) as key. The sorting results in X’(0) and S’. After that, the redundant elements in
X’(0) are flagged in flag array F for ∀𝑝 ∈ [0, 𝑃 − 1]:
𝐹𝑝 = {
(0)
(0)
1, 𝑖𝑓 𝑋𝑝−1 = 𝑋𝑝 𝑎𝑛𝑑 𝑝 ≠ 0
0,
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(1)
Then, a parallel computing technique called parallel prefix sum (PPS, [14]) is used
on array F to gain array M, which indicates the new place of each attribute value
in the reduced array Λ:
(0)
Λ 𝑀𝑝 = {
(0)
𝑥0
(0)
𝑥𝑝
𝑖𝑓 𝑝 = 0
(2)
𝑖𝑓 𝑝 ≠ 0 𝑎𝑛𝑑 𝐹𝑝 = 1
The size of Λ is determined from the last value of M:
𝑚 = 𝑀𝑃−1 + 1
(3)
Furthermore, by using S’ to rearrange M, we gain the temporary array H that is
used to distinguish the groups in the next ((i+1)th) layer:
(𝑖+1)
𝐻𝑆𝑝
= 𝑀𝑝 , ∀𝑝𝜖[0, 𝑃 − 1]
(4)
– 110 –
Acta Polytechnica Hungarica
Vol. 18, No. 6, 2021
Figure 3
The training of the second layer of the structure shown in Fig. 1
The training of the rest of the layers (i > 0) is similar (shown in Fig. 3), with the
difference that H is also used to sort the data: X(i), H(i) and S are all sorted by H
first, then X second, thus, gaining an ordering (X’’(i), H’’(i) and S’’) where each
group is represented in the order of their “parent” markers in the previous layer
(i.e. the value that they share in their previous (i - 1) attribute), and the markers
within each group are sorted as well. Thus, H is indicating their group number.
Similarly, to the root layer, in the next steps the flag array F is determined:
𝐹𝑝 = {
(𝑖)
(𝑖)
(𝑖)
(𝑖)
1, 𝑖𝑓 𝑥′′𝑝−1 = 𝑥′′𝑝 𝑎𝑛𝑑 𝐻′′𝑝−1 = 𝐻′′𝑝
0,
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(5)
which is used to create marker placement array M by applying parallel prefix sum
on array F. Array H(i+1) is calculated using Eq. (4) and the compact index array is
determined:
(𝑖)
Λ 𝑀𝑝 = {
(𝑖)
𝑥0
(𝑖)
𝑥𝑝
𝑖𝑓 𝑝 = 0
(6)
𝑖𝑓 𝑝 ≠ 0 𝑎𝑛𝑑 𝐹𝑝 = 1
After that, all non-layers (i > 0) calculate the starting locations (𝛼 (𝑖) ) and the
group sizes (𝛽 (𝑖) ) for each group j:
(𝑖)
𝛼𝑗 = 𝑚𝑖𝑛{𝑘 | 𝜂𝑘 = 𝑗}
(𝑖)
𝛽𝑗 = {
(𝑛)
(𝑛)
(7)
𝛼𝑗+1 − 𝛼𝑗 , 𝑖𝑓 𝑖 < 𝑔 − 1
(𝑛)
𝑚 − 𝛼𝑗 ,
(8)
𝑖𝑓 𝑖 = 𝑔 − 1
In the last layer, the algorithm ends after the occurrence array Θ is calculated.
– 111 –
B. Tusor et al.
2.2
Fast Regular and Interval-based Classification, using parSITs
Proximity-based Inference
Figure 4
An illustration for the proximity-based inference algorithm for a single input sample
Fig. 4 depicts the proximity-based inference algorithm. In the root layer, it simply
(0)
(0)
polls each marker Λ𝑖 for each input sample value x𝑝 , i.e. calculates their
distances, and chooses the index number of the marker that is the closest to the
given input value:
(0)
(0)
𝐻𝑝 = argmin∀𝑖 (𝑑(Λ𝑖 − x𝑝 ))
(9)
where d( ) is a suitable distance measure. In this research, we use the Euclidian
distance, but various other distances can be used, such as a Gaussian function.
marker is chosen for each input tuple p (𝑝 ∈ [0, 𝑃 − 1]), and only the
corresponding groups are polled in the next layer:
(𝑛)
(𝑛)
𝐻𝑝 = argmin∀𝑖 (𝑑 (Λ 𝜐𝑝 +𝑖 − x𝑝 ))
(10)
𝜐𝑝 = 𝛼𝐻𝑝
(11)
where 𝜐𝑝 is the starting position of the group that belongs to the chosen marker Hp
in the previous layer for each tuple p:
Finally, the output of the proximity-based inference is an array y given by the
indices in the last layer:
(𝑁−1)
𝑦 = Λ𝑧
(12)
𝑧 = argmax ∀𝑖 (Θ𝜐𝑝 +𝑖 )
(13)
where the index is the most regularly occurring class that belongs to the marker
chosen in the previous layer:
– 112 –
Acta Polytechnica Hungarica
2.2
Vol. 18, No. 6, 2021
Interval-based Inference
As it was mentioned in the introduction, the proximity-based inference method is
very fast, but has a disadvantage that it focuses on only one marker in each layer.
For example, in the structure shown in Fig. 1, if the input tuple is t10=[1.5 10],
then, the closest learned tuple should be [2.4 10] (as their Euclidian distance is
d(t9,t10)=0.9), but the inference will choose [1.1 9] (d(t3,t10) = 1.07), because 1.5 is
closer to 1.1 than to 2.4. This often leads to misclassification, decreasing the
classification accuracy of the system.
While the proximity-based inference described in the previous subsection focuses
on quickly finding the value that is the closest to the currently examined input
value for each input sample, the interval-based inference investigates the area
around the known values as well, thus, can regard multiple values for each input
sample.
Fig. 5 illustrates the inference algorithm for a single input [2 4], where the ρ
ranges of each attribute are defined at [1.5 3] (i.e. for the first attribute, the interval
[0.5, 3.5] is investigated, while for the second attribute the interval [1, 7] is
regarded). In the figure, each “route” is color-coded to make the paths of the
inference more easily discernable.
In the root layer (L0), the first 3 markers are polled as positive (i.e. being part of
the sought interval), so in the next layer (L1), the groups linked to them (markers
#0 to #5) are regarded, comparing their values to the interval of the second
attribute. In the given example, only markers (η1) #0, #2, #4 and #5 are polled as
positive, so in the last (class) layer the class markers (Λ(2)) and their occurrences
(θ) are counted.
Figure 5
An illustration for the interval-based inference algorithm considering one input sample
– 113 –
B. Tusor et al.
Fast Regular and Interval-based Classification, using parSITs
The output of the system can be either the class with the highest occurrence rate,
or the whole array of classes with their measured occurrence rates (as statistical
information to enhance classification performance). If the inference stops before
reaching the class layer (i.e. does not find any markers that are close enough to the
input sequence), then the default class is returned (which has the highest overall
occurrence rate).
Remark: An easy way to determine the range value for each attribute is
determining the size of their domain (i.e. the largest value their attribute can take)
and set the range to an arbitrary percentage of it.
Figure 6
An illustrative example for the detailed steps of the interval-based inference algorithm processing the
root layer of the structure in Fig. 5, considering 3 input samples (set X)
Figure 7
The continuation of the interval-based inference algorithm, processing the second and third layers of
the structure in Fig. 5
– 114 –
Acta Polytechnica Hungarica
Vol. 18, No. 6, 2021
A more detailed illustration for the interval-based inference algorithm is shows in
Figures 6 and 7, where the input data set X (with 3 samples: t0=[2 4], t1=[3.8 5]
and
t2=[1.1 1]) is evaluated for the structure in Fig. 1, using range parameters
ρ=[1.5 3]. For easier readability, each row is colored in accordance to their
corresponding sample. Since the amount of positively polled markers is varied
among the input samples, the coloring also indicates how many elements are used
in each row. Fig. 6 shows the evaluation of the root layer, while Fig. 7 depicts its
continuation for the rest of the structure.
The algorithm is implemented through matrices, where each row processes a
given input sample. The goal in each layer is to determine which groups of
markers have to be regarded in the next layer, and at the last layer, the
determination of the class distribution among the selected markers, for each input
sample.
Let us consider 𝑇: 𝑃 × 𝑆Λ(0) 2D array in the beginning, where P is the number of
the input samples and 𝑆Λ(0) is the size of the 1D index array in the root layer.
T is used to store the list of the markers that is needed to be evaluated in the next
layer. For the root layer, T(i=0) simply contains an increasing sequence of numbers
from 0 to 𝑆Λ(0) − 1:
(𝑖=0)
𝑇𝑝,𝑗
= 𝑗 ∶ ∀𝑗 ∈ [0, 𝑆Λ(0) − 1], ∀𝑝 ∈ [0, 𝑃 − 1]
(14)
In order to determine the size of the array T(i+1) (i > 0), temporary 2D array 𝛤 with
size 𝑃 × 𝑆Λ(𝑖) is created in each given layer. It is initialized with zeros, then in
(𝑖)
(𝑗)
each row p, a given element 𝛤𝑝,𝑗 is set to 𝛽𝑗 , if the corresponding Λ 𝑇𝑝,𝑗 marker
value is within the 𝜌 range of the given input attribute value 𝑋𝑝,𝑖 :
𝛤𝑝, 𝑇𝑝,𝑗 = {
(𝑖+1)
𝛽𝑇𝑝,𝑗
0
(𝑗)
(𝑗)
if 𝑋𝑝,𝑖 − 𝜌 ≤ Λ 𝑇𝑝,𝑗 and Λ 𝑇𝑝,𝑗 ≤ 𝑋𝑝,𝑖 + 𝜌
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(15)
for all 𝑗 ∈ [0, 𝑆Λ(0) − 1] and all 𝑝 ∈ [0, 𝑃 − 1]. In Figures 6 and 7, the modified
cell values are marked with blue font color.
After that, parallel prefix sum (PPS) is done on 𝛤 in order to collect the number of
all elements (which are needed to be regarded in the next layer) for each row,
gaining 𝛤′. As a result of the PPS step, the last column contains these values,
which are collected in array 𝜑 (with size 𝑃 × 1):
𝜑𝑝 = 𝛤′𝑝, 𝑆
Λ(𝑖)
(16)
−1
for all 𝑝 ∈ [0, 𝑃 − 1], and the largest such number in ST (which is the size of
T(i+1)):
𝑆𝑇 = max∀𝑝 𝜑𝑝
(17)
– 115 –
B. Tusor et al.
Fast Regular and Interval-based Classification, using parSITs
In order to construct T(i+1), another temporary 2D array: 𝛷 is created (with size
𝑃 × 𝑆𝑇 ). In 𝛷 the index numbers of the markers (to be regarded in the next layer)
are needed to be collected. However, this is not done by directly addressing the
elements of 𝛷: instead, it is done indirectly, by using the elements in 𝛤 that polled
positively (𝜐(𝑝, 𝑗) = 𝛤𝑝,𝑗 ) to set only the relevant starting positions: 𝛷𝑝,𝜐(𝑝,𝑗) .
These are marked with red font color in Figures 6 and 7:
If the first element in a row in 𝛤 is polled positive (its value is bigger than 0),
then the first element in the same row of 𝛷 is set to 0, since that is the
leftmost marker:
(𝑖+1)
𝛷𝑝,0 = 𝛼0
(18)
If its value in 𝛤 is 0 (i.e. polled negative), then the sought sequence does not
start at 0. However, the starting position can still simply be determined from
′
𝛤′ (using 𝜐(𝑝, 𝑗) = 𝛤𝑝,𝑗
):
(𝑖+1)
𝛷𝑝,𝜐(𝑝,𝑗−1) = 𝛼𝑗
, if 𝑗 > 0 and 𝛤𝑝,𝑗 > 0
(19)
Remark: it is important to note that at this point 𝛤 only shows if an element polled
positive or not, while the position information is taken from 𝛤′.
For example, in Fig. 6 the first row (p=0) of 𝛤 is [2 2 2 0], while 𝛤′ is [2 4 6 6].
This shows that only the first 3 elements polled positive, so the first element of the
same row in 𝛷 is set to 0, then the 2nd element (j=1) sets 𝛷0,𝜐(0,𝑗−1) = 𝛷0,𝜐(0,0) =
(𝑖=1)
𝛷0,2 to 𝛼𝑗=1 = 2; while the 3rd element (j=2) sets 𝛷0,𝜐(0,𝑗−1) = 𝛷0,𝜐(0,1) = 𝛷0,4
(𝑖+1)
to 𝛼𝑗
(1)
= 𝛼2 = 4.
After that, the corresponding ending locations are set for each row p in 𝛷, for each
positively polling element j in 𝛤:
𝛷𝑝,
(𝑖+1)
𝜐(𝑝,𝑗−1)+𝛽𝑗
−1
(𝑖+1)
= 𝛼𝑗
(𝑖+1)
+ 𝛽𝑗
−1
(20)
These elements are marked with green font color in 𝛷 in Figures 6 and 7. If there
(𝑖+1)
are no groups that have more than 2 elements (𝛽𝑗
> 2), then 𝛷 can be already
(i+1)
used as T
in the next layer. However, if there are more than 2 elements in at
least one group, then there will be “holes” (zeros) in the rows, which is why 𝛷 is
only a temporary array that is used to construct T(i+1).
With 𝛷 set up, T(i+1) is initialized (with the same size as 𝛷), however, with ones.
The elements of T(i+1) are calculated differently for the first element in each row p:
(𝑖+1)
𝑇𝑝,0
= 𝛷𝑝,0
(21)
and subsequent elements:
(𝑖+1)
𝑇𝑝,𝜐(𝑝,𝑗−1) = 𝛷𝑝,𝜐(𝑝,𝑗−1) − 𝛷𝑝,𝜐(𝑝,𝑗−1)−1
– 116 –
(22)
Acta Polytechnica Hungarica
Vol. 18, No. 6, 2021
This can be seen in Figures 6 and 7, where the affected array elements are marked
with cyan font color. After that, parallel prefix sum is done on T(i+1), and the
evaluation moves onto the next layer.
Figure 8
The last step of the interval-based inference algorithm
Fig. 8 shows the last step of the evaluation, when the last layer is reached. Class
collector array 𝜅 is constructed (with size 𝑃 × 𝑘) and using T(N-1), the occurrences
are counted in 𝜅 for each 𝑗 ∈ [0, 𝜑𝑝 − 1], ∀𝑝 ∈ [0, 𝑃 − 1]:
𝜅𝑝, Θ
(𝑁−1)
𝑇
𝑝,𝑗
𝜑 −1
𝑝
= ∑𝑗=0
Θ 𝑇 (𝑁−1)
(23)
𝑝,𝑗
Finally, the output array y is calculated for all p:
argmax∀𝑗 𝜅𝑝,j if 𝜑𝑝 > 0
𝑦𝑝 = {
default
otherwise
(24)
In Fig. 8, for the first sample (p=0) the class distribution of A, B and C is 3/7, 2/7
and 2/7, respectively, thus the output is A. For the third sample (p=2) the search
has reached a dead end at layer #1, thus, the default class (A) is returned.
3
3.1
Performance Evaluation
Experimental Results
The proposed new inference method had been tested on two real-life benchmark
problems from the UCI data repository [15] that are very commonly used to test
the classification performance of machine learning methods. The implementation
has been done on an average PC (Intel® Core™ i5-4590 CPU @ 3.30 GHz, 16
GB RAM), using CUDA v9.2 and Thrust v1.9 [16].
In the first set of experiments, the Wisconsin Breast Cancer (WBC) [17] dataset is
used to compare the classification performances of the proximity-based inference
(PBI) method and the new interval-based inference (IBI) method. The dataset
consists of N=9 attributes and P=500 training samples, which have been separated
into a training and a testing dataset, in various training to testing ratios (TTRs)
from 5:95% (i.e. 5% of the 500 samples are used for training and 95% for testing),
to 95:5% (vice versa). The parSIT is trained for each TTR and the same trained
structure is used for the PBI and IBI phases.
– 117 –
B. Tusor et al.
Fast Regular and Interval-based Classification, using parSITs
Table 1
The calculation of classification measures: recall, precision and balanced accuracy
Formula
Recall
Precision
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
Balanced Accuracy
𝑇𝑁
𝑇𝑃
+
𝑇𝑃 + 𝐹𝑁 𝑇𝑁 + 𝐹𝑃
𝐾
In order to measure the performance, the recall, precision [18] and balanced
accuracy rates [19] have been measured. Table 1 summarizes the formulas with
which these values are derived, where K is the number of classes (K=2 in case of
the WBC dataset). The data is calculated from the true positive (TP), true negative
(TN), false positive (FP) and false negative (FN) results from the inference.
For each class, the recall ratio shows how many instances of a given class are
positively identified, while the precision ratio shows how many of all positive
claims are actually true. The balanced accuracy ratio shows how well the
classifier can identify both the positive and negative samples.
The inference results are compared in Figs. 9-11. For the interval-based inference,
ρ=15% range value is used. As it can be seen in Fig. 9, the recall ratio of the
proximity-based inference peaks around 96%, then slowly declines to 77% as the
amount of training data ratio decreases (relative to the testing data size).
The reason for this is the “narrow” search approach the PBI method uses in the
problem space, only considering one value for each given attribute. However, the
proposed interval-based method is more stable, providing a 94-97.4% recall rate
even with fewer training data, since it expands the search area in the problem
space to a wider region, thus, can better utilize the same trained structure.
Figure 9
Comparison of the recall rates of the two inference methods using the WBC dataset
As Fig. 10 shows, their precision ratio is roughly the same (90-96%), although in
case of lower TTRs (lower than 15:85%) the original PBI method preforms better,
implying that the higher recall rate is at the cost of a lower precision rate.
– 118 –
Acta Polytechnica Hungarica
Vol. 18, No. 6, 2021
In terms of balanced accuracy (Fig. 11), we can also see that the new IBI method
provides a more stable (>95%) performance for the considered TTRs, while the
PBI shows a slow decline for decreasing training set cardinalities.
Figure 10
Comparison of the precision rates of the two inference methods using the WBC dataset
Figure 11
Comparison of the balanced accuracy rates of the two inference methods using the WBC dataset
Table 2 summarizes the results, breaking down the considered TTR spectrum into
5 intervals (from left to right in the figures): very high (the cases where the
training to testing set ratios are more than 4:1 (80:20), i.e., there are more than 4
times as many training samples as testing samples), high (4:1 – 1.5:1), moderate
(1.5:1 – 1:1.5), low (1:1.5 – 1:4) and very low (less than 1:4). The recall, precision
and balanced accuracy rates have been averaged in these intervals, which can be
seen in the table. The difference (∆) in percentage between the IBI and PBI
measures is also shown (marked with bold text), which indicates how the new IBI
algorithm really performs compared to the PBI method.
Table 2
Comparison of the recall, precision and balanced accuracy rates using the WBC dataset
TTR (%)
Very High
High
Recall (%)
PBI
IBI
∆
94.2 97.1
2.9
93.2 96.7
3.5
Precision (%)
PBI
IBI
∆
92.7 94.1
1.4
93.3 92.9 - 0.3
– 119 –
Balanced accuracy(%)
PBI
IBI
∆
95.0
96.8
1.7
92.3
93.6
1.2
B. Tusor et al.
Moderate
Low
Very Low
Fast Regular and Interval-based Classification, using parSITs
90.8
88.5
82.6
97.0
97.5
98.6
6.2
9.0
16.0
92.3
91.9
91.1
93.6
91.5
88.3
1.2
- 0.4
- 2.8
93.3
92.1
88.9
96.6
96.2
95.6
3.4
4.1
6.7
As it can be seen, while the precision is roughly the same (with very little
difference) for very high to moderate TTRs, while for lower TTRs the precision of
the IBI method is worse than that of the PBI method. On the other hand, a steady
increase can be seen in the recall and balanced accuracy rates for the difference in
favor of the IBI method as the TTRs decrease.
The experiments have been done on a multi-class problem as well: The Iris
dataset, which consists of P=150 samples, N=4 attributes and K=3 classes.
The results are counted as follows: for each class j, if the classifier marks a given
input as part of the class, and if correct, it is counted as true positive, and false
positive otherwise. Similarly, if the inference marks the sample as not being part
of the class, then it counts to true negative if it is correct, and false negative if it is
not correct. The recall and precision rates are averaged among all classes.
Figs. 12-14 show the results of the classification using the same performance
measures. As it can be seen, the multiclass problem was much harder to the
proximity-based inference method, exactly due to the reason that has been
outlined in the introduction. The new inference method, however, provides not
only a more stable, but also much higher rate for all three performance ratios.
Figure 12
Comparison of the recall rates of the two inference methods using the Iris dataset
Although compared to the 2-class case of the previous experiment, the IBI method
shows a decrease in recall rate for lower TTRs, but it still generally provides a
better recall rate over the PBI method by at least 15 percentage points, as Fig. 12
indicates.
The precision rate of the IBI, on the other hand, is more stable in comparison to
that of the PBI, providing a rate of ~88% for all the considered TTRs while the
precision rate of the PBI is gradually decreasing with the TTR. This implies that
the IBI method is much better suited for multiclass problems.
– 120 –
Acta Polytechnica Hungarica
Vol. 18, No. 6, 2021
Figure 13
Comparison of the precision rates of the two inference methods using the Iris dataset
The balanced accuracy rate (Fig. 14) shows a very slow decline with the TTRs for
the IBI method, but still provides an at least 80% rate for the lowest TTRs, while
the PBI method only provides a ~70% balanced accuracy rate for the same.
In general, the IBI outperforms the PBI method by 10-15%.
Figure 14
Comparison of the balanced accuracy rates of the two inference methods using the Iris dataset
Table 3 summarizes the results (averaged for TTR intervals) for the Iris dataset,
the same way as Table 1 of the previous experiment. According to the results, the
recall and precision rates of the PBI method is roughly the same, ~70% on average
for very high to moderate TTRs, while decreasing to ~60% for lower TTRs.
The recall rate of the IBI method slowly decreases from ~90% to ~80%
throughout the TTR spectrum, while its precision rate stays around 88%, meaning
that the classes it marks as a positive hit are correct in the majority of time.
The balanced accuracy of the PBI is relatively stable at ~70-77.5% throughout the
TTR spectrum (with a very slow decrease for lower intervals), while that of the
IBI method slowly decreases from ~92.8% to ~85.2%.
– 121 –
B. Tusor et al.
Fast Regular and Interval-based Classification, using parSITs
Table 3
Comparison of the recall, precision and balanced accuracy rates using the Iris dataset
TTR (%)
Very High
High
Moderate
Low
Very Low
Recall (%)
PBI
IBI
∆
69.7 90.5 20.8
71.4 88.7 17.3
70.1 86.7 16.7
67.5 84.8 17.3
61.6 80.4 18.8
Precision (%)
PBI
IBI
∆
69.5 88.9 19.4
70.9 90.0 19.0
70.0 88.8 18.8
67.5 88.2 20.7
61.2 87.5 26.3
Balanced accuracy(%)
PBI
IBI
∆
76.1
92.8
16.7
70.0
88.8
18.8
77.5
89.9
12.4
75.7
88.5
12.9
70.0
85.2
15.2
The effects of range parameter ρ to the classification performance has also been
examined. Fig. 15 shows the recall, precision and balanced accuracy rates of the
interval-based inference, on the Iris dataset (using 70% of the samples for training
and 30% for testing). As it can be seen, at ρ = 5%, all the performance measures
are at their maximum and maintain a high value until around 15%, where a steady
decline begins. The recall decreases to 33%, which is expected for a 3-class
problem, as the covered interval is large enough to cover the whole domain, thus,
only returning the default class for any given inputs. The balanced accuracy falls
to 50%, while the precision rises back to ~75% for higher ρ values.
Figure 15
Performance measure analysis of using different range sizes on the Iris dataset
Remark: For large range values, the ~75% precision rate is caused by the way the
measure is calculated, i.e. taking the ratio between the true positive findings and
all positive claims (TP+FP). If there are no positive claims for a given class at all,
then the recall for that class is 0%, while the precision is 100% (since none of the
positive claims are wrong). In this case, the classifier only returns the default
class, which means 100% precision for the two other classes, while only 25-33%
precision for the default class, which makes the average, approximately 75%.
Interestingly, for the WBC dataset, ρ = 25% provided the best results, even though
the performance for Iris dataset peaked at ρ = 5%, which shows that the optimal ρ
value is primarily dependent on the given data. Thus, it is recommended that for
– 122 –
Acta Polytechnica Hungarica
Vol. 18, No. 6, 2021
any given problem, the inference should be tried for different values between 5%
and 30%, to find the value that is most suitable.
3.2
Complexity Analysis
The computational or time complexity of the proximity-based inference is
O(N∙m), where m is the average number of markers per layer. Since the new
interval-based inference method uses parallel prefix sum twice in each layer, it
will be inherently slower than the proximity-based counterpart, at O(N∙m∙log2m).
The new method also requires more parallel processing units to compute, which
can limit its usability for large index arrays, if the range parameter is also large.
However, both methods still only marginally dependent on the amount of input
data (as the number of processes contributes to the time complexity, given that
they are needed to be managed by the parallel computing framework).
Remark: It is recommended to set the order of the attributes to such that, the one
with the least value variety, is the root layer, since both inferences have to poll all
elements of the index array in the root layer.
Conclusions
In this paper, a new inference method is presented for Parallelized Sequential
Indexing Table classifiers. While the original inference method uses a proximitybased algorithm, where the inference is only considering a one route (a single
series of attribute values closest to the input data values) through the problem
space, the newly proposed algorithm, does a more thorough search through, by
regarding intervals of values for each attribute and thus, provides a more accurate
classification.
The new interval-based inference method has been tested on two real-life
benchmark problems that are very commonly used to test the classification
performance of machine learning methods. Overall, the original proximity-based
inference method has a lower computational complexity, thus it has a faster
operation, requires fewer processing units, and according to the test results, it
performs reasonably well on 2-class problems (with a balanced accuracy rate of
~88.9-95%), though less so on multiclass problems (~70-77.5%), due to the higher
complexity of the problem. The proposed new inference method has a slightly
higher computational complexity and thus, have a slower operation compared to
that of the proximity-based inference method, it is more intensive regarding the
processing units, but in return it performs slightly better on 2-class problems
(~93.6-96.8% balanced accuracy rate) and much better on multiclass problems
(~85.2-92.8%) compared to the proximity-based systems, even though they both
use the same trained classifier.
The experiments have shown that the proposed inference method can provide
good classification metrics, even for multiclass problems (~80% recall, ~87.5%
– 123 –
B. Tusor et al.
Fast Regular and Interval-based Classification, using parSITs
precision and ~85.2% balanced accuracy rates), for cases where the testing data
cardinality significantly outweighs that of the training data, meaning that with it
the parSIT classifier is a reasonable choice for a low-complexity, fast training and
fast performing classifier for such problems.
In future work, we will further improve the proposed inference method, in order to
increase its speed, and based on the classifier, we will develop new methods,
where the processing order of the inputs are not bounded by a single ordering
scheme.
Acknowledgement
Supported by the ÚNKP-19-3-IV-OE-56 New National Excellence Program of the
Ministry for Innovation and Technology.
References
[1]
S. Hussein, P. Kandel, C. W. Bolan, M. B. Wallace, U. Bagci, "Lung and
Pancreatic Tumor Characterization in the Deep Learning Era: Novel
Supervised and Unsupervised Learning Approaches," in IEEE Transactions
on Medical Imaging, Vol. 38, No. 8, pp. 1777-1787
[2]
S. Roy et al., "Deep Learning for Classification and Localization of
COVID-19 Markers in Point-of-Care Lung Ultrasound," in IEEE
Transactions on Medical Imaging, Vol. 39, No. 8, pp. 2676-2687
[3]
Gibson, B. Issac, L. Zhang, S. M. Jacob, "Detecting Spam Email With
Machine Learning Optimized With Bio-Inspired Metaheuristic
Algorithms," in IEEE Access, Vol. 8, pp. 187914-187932
[4]
J. Yun, J. Woo, "A Comparative Analysis of Deep Learning and Machine
Learning on Detecting Movement Directions Using PIR Sensors," in IEEE
Internet of Things Journal, Vol. 7, No. 4, pp. 2855-2868
[5]
G. Jia, H. -K. Lam, S. Ma, Z. Yang, Y. Xu, B. Xiao, "Classification of
Electromyographic Hand Gesture Signals Using Modified Fuzzy C-Means
Clustering and Two-Step Machine Learning Approach," in IEEE Trans. on
Neural Syst. and Rehabilitation Engineering, Vol. 28, No. 6, pp. 1428-1435
[6]
M. Jocic, E. Pap, A. Szakál, D. Obradovic, Z. Konjovic, "Managing Big
Data Using Fuzzy Sets by Directed Graph Node Similarity," Acta
Polytechnica Hungarica, Vol. 14, No. 2. 2017, pp. 183-200
[7]
R. Spir, K. Mikula, N. Peyrieras "Parallelization and validation of
algorithms for Zebrafish cell lineage tree reconstruction from big 4D image
data," Acta Polytechnica Hungarica, Vol. 14, No. 5. 2017, pp. 65-84
[8]
A. Vukmirović, Z. Rajnai, M. Radojičić, J. Vukmirović, M. J. Milenković,
"Infrastructural Model for the Healthcare System based on Emerging
Technologies," Acta Polytechnica Hun., Vol. 15, No. 2, 2018, pp. 33-48
– 124 –
Acta Polytechnica Hungarica
Vol. 18, No. 6, 2021
[9]
A. R. Várkonyi-Kóczy, B. Tusor, J. T. Tóth, “A Multi-Attribute
Classification Method to Solve the Problem of Dimensionality,” in Proc. of
the 15th Int. Conf. on Global Research and Education in Int. Sys.
(Interacademia’2016), Warsaw, Poland, 2016, pp. PS39-1–PS39-6
[10]
B. Tusor, A. R. Várkonyi-Kóczy, J. T. Tóth, “Active Problem Workspace
Reduction with a Fast Fuzzy Classifier for Real-Time Applications,” IEEE
International Conference on Systems, Man, and Cybernetics, Budapest,
Hungary, October 9-12, 2016, pp. 4423-4428, ISBN: 978-1-5090-1819-2
[11]
B. D. Zarit, B. J. Super, F. K. H. Quek, “Comparison of five color models
in skin pixel classification,” in Proc. of the International Workshop on
Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time
Systems, Corfu, Greece, Sep. 26-27, 1999, pp. 58-63
[12]
B. Tusor, J. T. Tóth, A. R. Várkonyi-Kóczy, "Parallelized Sequential
Indexing Tables for Fast High-Volume Data Processing," 2020 IEEE
International Instrumentation and Measurement Technology Conference
(I2MTC), Dubrovnik, Croatia, 2020, pp. 1-6
[13]
B. Tusor, A. R. Várkonyi-Kóczy, "Memory Efficient Exact and
Approximate Functional Dependency Extraction with ParSIT," 2020 IEEE
24th International Conference on Intelligent Engineering Systems (INES),
Reykjavík, Iceland, 2020, pp. 133-138
[14]
M. Safari, W. Oortwijn, S. Joosten, M. Huisman, “Formal Verification of
Parallel Prefix Sum,” In: Lee R., Jha S., Mavridou A. (eds) NASA Formal
Methods. NFM 2020. Lecture Notes in Computer Science, Vol. 12229,
Springer, Cham, 2020
[15]
D.
Dua,
C.
Graff,
UCI
Machine
Learning
Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School
of Information and Computer Science, 2019
[16]
A. V. George, S. Manoj, S. R. Gupte, S. Mitra, S. Sarkar, "Thrust++:
Extending Thrust Framework for Better Abstraction and Performance,"
2017 IEEE 24th International Conference on High Performance Computing
(HiPC), Jaipur, 2017, pp. 368-377, doi: 10.1109/HiPC.2017.00049
[17]
O. L. Mangasarian, W. H. Wolberg: "Cancer diagnosis via linear
programming", SIAM News, Vol. 23, No. 5, September 1990, pp. 1&18
[18]
M. Buckland, F. Gey: “The relationship between recall and precision,”
Journal of the American Soc. for Inf. Science, Vol. 45, No. 5, 1994, pp. 1219
[19]
V. García, R. A. Mollineda, J. S. Sánchez, "TI - Index of Balanced
Accuracy: A Performance Measure for Skewed Class Distributions", in:
Pattern Recognition and Image Analysis, Springer Berlin Heidelberg, pp.
441-448, 2009
– 125 –