Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
Abstract—Most machine learning methods work under the the instances, and the classifier will favor the majority class.
assumption that classes have a roughly balanced number of At the same time, in most of the cases the minority class is the
instances. However, in many real-life problems we may have some one more interesting, usually corresponding to important, rare,
types of instances appearing predominantly more frequently than
the others which causes a bias towards the majority class during or dangerous cases. Therefore, special techniques are needed
classifier training. This becomes even more challenging when to alleviate the mentioned bias and achieve good recognition
dealing with multiple classes, where relationships between them rates on the minority class. At the same time, this cannot
are not easily defined. Learning from multi-class imbalanced come at the cost of a performance drop on majority class
data has not been widely considered in the context of big as we should aim for a balanced and well-rounded classifier
data mining, despite the fact that this is a learning difficulty
frequently appearing in this domain. In this paper, we address [3]. Additionally, recent studies show that the imbalance ratio
this challenge by proposing a comprehensive ensemble-based between classes is not the sole reason behind unsatisfactory
framework. We propose to analyze each class to extract instance- classifier performance [2]. Data-level characteristics play at
level characteristics describing their difficulty levels. We embed least as an important role, including such problems as small
this information into the existing UnderBagging framework. training sets, existence of class sub-structures, or instance-level
Our ensemble samples instances with probabilities proportional
to their difficulty levels. This allows us to focus the learning difficulties.
process on the most difficult instances, better capturing the Traditionally, imbalanced problems have been considered
properties of multi-class imbalanced problems. We implemented for two-class datasets, where minority and majority classes are
our framework on Apache Spark to allow for high-performance easily defined [1]. In recent years, we can see an increasing
computing over big data sets. This experimental study shows that interest in analyzing the imbalanced problem in multi-class
taking into account the instance-level difficulty leads to training
of significantly more accurate ensembles. scenarios as many contemporary real-life problems suffer from
Index Terms—machine learning, ensemble learning, imbal- it. When dealing with multiple classes, one cannot simply
anced data, big data, instance-level difficulty use the existing algorithms for binary problems. Relationships
among classes are no longer easily defined and a given
I. I NTRODUCTION class can behave like be a minority and majority class at
the same time, depending on the context. Binarization of a
Despite more than two decades of research, learning from
multi-class data (i.e., transforming a multi-class problem into
imbalanced data is still considered as one of the major
a set of two-class ones) is not an efficient solution, as we
challenges in the machine learning domain [1], [2]. Most
lose the information of intra-class relationships [4]. Therefore,
of existing classifier training procedures are based on the
algorithms that are capable of directly working on multi-class
assumption that an uniform penalty is suffered for an error on
data are more desirable. The issue of data-level characteristics
all instances. Predictive accuracy and 0-1 loss functions are the
is as important in the binary problems. However, there are but
most common examples of such metrics used to determine the
few algorithms dedicated to multi-class imbalanced scenarios
effectiveness of a training procedure (and related procedures,
that take them into account [5], [6].
e.g., parameter selection). However, in many real-life problems
While imbalanced problems originated from small-scale
the distribution of instances among classes is far from being
datasets, the recent advent of big data significantly affected this
balanced. Some objects appear much more frequently than oth-
domain [7]. Researchers need to deal with the ever-growing
ers, leading to imbalanced datasets. This creates a bias during
size, complexity, and often non-stationary nature of modern
the training procedure, as due to an uniform treatment of all
datasets that are still plagued by skewed class distributions [8]–
This research was partially supported by the Amazon AWS Machine [10]. As the size of data makes usage of standard computers
Learning Research award. unfeasible, we observed a rise of high-performance computing
2485
data, one must be aware of their potential limitations. A potentially more prone to overfitting. Additionally, any data-
study on effects of different sampling techniques on Random level techniques must be aware of these existing sub-structures
Forest showed that SMOTE achieves a highly unsatisfactory in order not to mix instances from then when performing e.g.,
performance in the MapReduce environment [19]. This is sampling (as this would compromise the spatial coherence
caused by the lack of control over data partitioning and in turn of minority class). This problem can be further enhanced in
creating chunks of data with corrupted spatial relationships. big data context, as by using the MapReduce paradigm we
When SMOTE is used in each map with such an unreliable are violating spatial relationships among instances (we lack
neighborhood, it tends to create artificial instances in incorrect control on how these partitions are created) [19].
regions, leading to increasing overlapping among classes and Difficult individual instances. Difficulties with learning may
shifting true class distributions. Random oversampling and also be present at the instance-level [6]. Specific instances may
undersampling performs significantly better on CPU clusters. corrupt the underlying distributions, further enhancing bias
Success stories of MapReduce usage for imbalanced big towards the majority class. Most popular cases include outliers
data include efficient and scalable feature selection [20], rule and noisy instances that should be detected and handled
induction [21], and data sampling [22]. appropriately. Furthermore, by utilizing the information about
Mentioned techniques are developed for two-class big im- difficult instances, one may enrich the sampling and classifier
balanced data. To the best our knowledge, there exists no training procedures.
algorithms dedicated specifically to large-scale multi-class In this work, we will focus on instance-level difficulties.
imbalanced data, neither for GPUs nor CPU clusters [1], [2]. Previously, we have shown that utilizing information about
individual instances can significantly improve data-level tech-
III. I NSTANCE - LEVEL D IFFICULTY niques for both two-class [24] and multi-class problems [6].
Most of the early works on imbalanced data pointed to the We aim at using this information in a big data context, in
disproportion among classes as the main source of learning order to gain additional useful insight into classified datasets.
difficulty. However, one may easily come up with a dummy For the purpose of this paper, we will use an instance-level
problem of two classes with extremely high imbalance ratio difficulty taxonomy, based on their neighborhood analysis (see
but with clear separation. In such a case, even a simple classi- Figure 1 for visualization):
fier will not be affected by a bias. Therefore, there must exist
additional characteristics of imbalanced datasets that cause
degradation of a classifier’s performance [6], [23]. Recent
surveys on this topic identify these data-level difficulties to
most challenging ones [1], [2].
Small training set size. While the overall size of the training
set may be sufficient, the high imbalance ratio may negatively
impact the minority class. There simply may not be enough Safe examples shown Borderline examples shown
minority instances to estimate a proper distribution of this class
and train a competent discrimination model. Additionally,
using standard oversampling techniques on extremely small
minority classes may lead to the lack of diversity among
artificial instances and thus be of no aid to the classification
process. Recently, oversampling techniques have been devel-
oped that use information from the majority class to alleviate
Rare examples shown Outlier examples shown
the small size of the minority class. In the context of big
data classification, this problem is less likely to appear, as we Fig. 1: Visualization of instance-level difficulty types for a
expect to deal with large data samples. multi-class imbalanced dataset
Class overlapping. Overlapping regions between classes have
always been the source of learning difficulties. When com- • Safe instances. These are instances that lie far from other
bined with skewed class distributions, these regions are the classes and form uniform groups. They usually pose little,
ones where bias towards majority class is most likely to appear. if any, learning difficulty and can be easily discriminated.
This problem becomes even more challenging in a multi-class • Borderline instances. These are instances that lie close
scenario, where more than two classes may be present in to a border between classes. Due to their similar char-
a given region. This could be further amplified in big data acteristics to other instances they are more likely to be
scenarios, where overlapping regions are likely to be more misclassified. Borderline instances from a minority class
frequent due to massive data size. will usually be subject to classification bias, as majority
Small disjuncts. A minority class may not have a unimodal instances are likely to be predominant in this part of the
distribution. Instead, it may consist of sub-structures, clusters, decision space.
or disjuncts. This poses additional difficulty for a classifier, as • Rare instances. These are instances that form small
it needs to learn a more complex decision boundary, thus being disjuncts far from the main class structure. Due to their
2486
small size they are likely to be ignored by a classifier. instances that are in a more contaminated neighborhood (i.e.,
Very often they overlap with the majority class, making higher number of neighbors originating from another classes).
their proper classification even more challenging. Modification of instance selection probability for Under-
• Outliers. These are singular instances located far from Bagging. Having established a way to calculate the instance
the main class structure. They can be either noise or valu- difficulty, we must utilize it in a form of a selection likelihood
able rare information. There are now ways of determining during the undersampling in each bag. As we deal with a
this without analyzing the context of the considered multi-class problem, we must calculate the likelihood of each
application. They should be discussed with the domain m-th class independently, in order to conduct our proposed
expert in order to determine the proper course of action. instance difficulty-based undersampling on each of them. The
Noisy outliers should be removed, while rare information higher the difficulty of a given instance, the more valuable it is
should be emphasized. to a classifier, as it represents a region where bias towards the
majority class may occur. Therefore, we want to enhance the
IV. BAGGING WITH I NSTANCE - LEVEL D IFFICULTY importance of difficult instances during the training phase. We
In this section, we will present the details of our pro- achieve this by having a higher likelihood of difficult instances
posed ensemble framework for multi-class imbalanced big to be selected in each bag:
data classification. In the following sections, we will describe
1
details on how to combine ensemble learning with information fm (x) = + ID(x, k), (2)
about instance-level difficulty to form our proposed algorithm nm
UnderBagging+, as well as how to implement our framework where nm is the number of instances in m-th class. Using
on Apache Spark for high-performance computing. this approach, we increase the likeliness of preserving difficult
instances in each bag. At the same time, we give chance for
A. Combining UnderBagging with instance-level difficulty safe instances to be selected, achieving a mix of different types
Bagging-based algorithms are very popular in imbalanced of instances. This should lead to the creation of a more diverse
domains. Many studies show that they are especially efficient set of classifiers as compared to previous works that suggested
when combined with undersampling in each bag, for both to discard the entire set of a given instance type.
two-class [25] and multi-class [26] problems. However, all The likelihood is then transformed into a normalized prob-
existing algorithms based on the idea of UnderBagging assume ability function for selecting a given instance in m-th class:
that all instances in each class are equally important. We
fm (x)
propose to include information about instance-level character- pm (x) = nm . (3)
istics discussed in Section III into the undersampling process i=1 fm (x)
conducted in each bag. So in order to do so, we need a way of This normalization ensures that the proposed probability
calculating the difficulty of each instance. This is usually done function pm is a proper probabilistic distribution. This is
by analyzing the instance neighborhood and assigning a given necessary for using it during Bagging instance selection when
level based on its uniformness [27]. The more contaminated choosing with replacement.
the neighborhood with instances from other classes, the higher Other characteristics of UnderBagging+. The proposed
difficulty level is assigned to analyzed instance. One of four model differs from existing Bagging-based solutions for im-
labels (safe / borderline / rare / outlier) is selected and balanced data by modifying its sampling probability according
this information is further used by sampling or the learning to instance difficulty level and by working directly on multi-
algorithm. class problems. The size of each bag is defined as a user-based
Calculating instance-level difficulty. Previous works used a parameter, as there exists two approaches to this issue: (i)
discrete approach by either selecting or removing all instances selecting the bag size equal to the size of the class with lowest
with a given difficulty label from the training set. This required number of instances; and (ii) selecting the bag size smaller
a time-consuming manual tuning of all possible combinations than the size of the class with lowest number of instances.
of instance types. In this work, we propose to use a probabilis- As for the prediction phase, UnderBagging+ uses a standard
tic approach, where the difficulty of each instance will affect majority voting scheme.
the chances of it being selected for a given bag. In order to
do so, let us define the function for calculating the difficulty B. Implementation on Apache Spark
of each instance based on the number of neighbors from the UnderBagging+ was developed to use in-depth information
same class: about instances in each class. In order to make it applicable
to massive imbalanced datasets, we propose an efficient im-
k plementation on the Apache Spark environment.
xi ∈ kNN(x) ∧ label(xi ) = label(x)
i=1
ID(x, k) = , (1) Spark set-up. Spark uses the MapReduce approach (see Fig-
k ure 2) - a popular distributed computing model that represents
where k is the number of neighbors used for the difficulty data as pairs of keys/values. It consists of two main phases -
analysis and kN N stands for k nearest neighbors. This map and reduce. The role of the map phase is to take the input
function takes values in [0, 1] and assigns higher values to data and split it among a given number of computing nodes
2487
Algorithm 1 Perform target class undersampling
procedure: TargetClassUndersample(DF, N umOf Classes,
T argetCount)
sampledDataF rames ← DataF rame[0 to c]
for c = 1 to N umberOf Classes do
DF c ← DF.f ilter(x → x.class == c)
if DF c .count() > T argetCount then
sampledDataF ramesc ←
DF c .sample(T argetCount)
else
sampledDataF ramesc ← DF c
end if
end for
return sampledDataF rames.union()
Fig. 2: The MapReduce architecture showing the three phases:
map, shuffle, reduce
efficient undersampling code in Scala that takes advantage of
(workers). Each worker performs a given task in parallel, the Spark architecture, in order to offer scalability to massive
returning intermediate results. The role of the reduce phase datasets. Additionally, as UnderBagging+ uses sampling with
is to combine these intermediate results into the final output. replacement, we join the selected instances into a DataFrame
Spark keeps the results between map/reduce steps in main structure to form a balanced dataset. Details of the proposed
memory when possible. Spark uses a driver node to create a undersampling implementation are given in a pseudocode form
Spark Context object which is connected to a cluster manager in Algorithm 1.
such as Mesos or YARN. The Spark Context is responsible for UnderBagging+ on Spark. A separate UnderBagging+ en-
requesting resources from the cluster manager and running semble is constructed in each node of the Spark cluster and
executors on computing nodes. Figure 3 shows the Spark then intermediate results from each ensemble are combined
workflow with solid lines representing data passing. during reduce phase to give a final model. Each node will
already obtain a subset of data from a given map. Therefore,
the calculation of instance difficulties according to Eq. 1 is
done in each node independently, in order to capture local data
characteristics. This leads to the creation of locally specialized
classifiers in each node and benefits the overall diversity of
the proposed UnderBagging+ architecture. As the instance
difficulty calculation requires an analysis of the neighborhood
of each instance, we use an efficient k-NN implementation
from spark-knn package.
V. E XPERIMENTAL STUDY
In order to evaluate the effectiveness of UnderBagging+,
we have designed an in-depth experimental study that aims at
answering the following research questions:
• RQ1: Does UnderBagging+ display better performance
than existing methods for multi-class imbalanced big data
available in Spark?
• RQ2: Does modification of sampling used in UnderBag-
ging+ lead to improved results over standard Bagging?
Fig. 3: The Spark cluster architecture for resource allocation • RQ3: How does the improved sampling method influence
and data transfer the size and diversity of UnderBagging+?
In the following section, we will present the experimental
Undersampling on Spark. The proposed modified undersam- set-up, obtained results, as well as discuss them.
pling that takes into account the instance-level difficulty is the
backbone of our system. It will be executed by each worker A. Data benchmarks
on each node multiple times to form base classifiers for the Learning from imbalanced big data suffers from the lack
UnderBagging+ ensemble. Additionally, it must work on each of proper benchmark repositories. While there are a couple of
class independently and take as an input parameter the size of popularly used large-scale two-class imbalanced datasets [28],
each class after undersampling procedure. We have written an multi-class imbalanced big datasets are still not standardized
2488
or widely available. For the purpose of this study we have
C
selected and prepared five real-world datasets that have the fol- tpi + tni
AvAcc =
lowing important properties: (i) are originally multi-class; (ii)
i=1
tpi + tni + f pi + f ni
have a large volume; and (iii) are characterized by a significant C
imbalance among classes and various data-level difficulties. 1
RecM = recalli
Their details are given in Table I. Please note that we have C i=1
calculated the ratios of different types of instances in each C
class using all data. However, during actual experiments the 1
P recM = precisioni
instance-level difficulty is calculated only using the training C i=1
data and independently in each computational node of the C
C
Spark cluster. Therefore, we report global properties of each RecU = tpi ti
benchmark, but runtime local characteristics may differ. i=1 i=1
C C
P recU = tpi pi
i=1 i=1
2489
TABLE I: Details about multi-class imbalanced big datasets used in experimental study.
Dataset Instances Features Classes Class distribution Safe[%] Borderline[%] Rare[%] Outliers[%]
covtype 581,012 54 7 Class 1 211840 84.97 10.43 3.03 1.57
Class 2 283301 87.62 9.33 2.08 0.97
Class 3 35754 80.81 13.96 3.41 1.82
Class 4 2747 56.86 25.48 9.79 7.86
Class 5 9493 60.32 25.18 8.57 5.93
Class 6 17367 65.80 23.40 6.85 3.94
Class 7 20510 87.25 8.37 2.85 1.54
traffic 1,378,663 26 7 Class 1 163473 22.35 32.59 22.96 22.10
Class 2 214173 25.47 31.73 21.93 20.88
Class 3 275622 34.96 32.68 17.81 14.55
Class 4 343133 39.47 33.74 16.05 10.74
Class 5 157710 24.75 32.21 21.87 21.17
Class 6 179367 27.08 35.69 20.68 16.55
Class 7 45185 22.85 19.79 19.33 38.03
seer 2,532,629 11 10 Class 1 396140 50.53 26.83 9.90 12.75
Class 2 567055 79.01 14.12 4.07 2.80
Class 3 241330 17.91 31.69 23.15 27.25
Class 4 528648 60.71 28.71 6.11 4.48
Class 5 135436 37.15 18.31 16.56 27.98
Class 6 137908 58.65 29.12 6.16 6.06
Class 7 100481 1.25 15.92 28.22 54.60
Class 8 110411 26.39 20.59 23.27 29.75
Class 9 166217 46.04 30.77 11.81 11.38
Class 10 149003 33.03 38.15 16.50 12.33
sensors 2,219,803 5 55 Class 1 43039 53.47 28.28 11.15 7.10
Class 2 46915 71.04 19.21 5.95 3.80
Class 3 46631 55.52 29.91 9.43 5.14
Class 4 43790 65.86 23.22 6.97 3.95
Class 6 35654 62.37 21.71 9.07 6.85
Class 7 55302 64.51 21.62 8.70 5.17
... ... ... ... ... ...
Class 51 25582 27.89 36.90 19.12 16.10
Class 52 28748 33.06 36.84 17.07 13.03
Class 53 2850 72.39 12.35 6.49 8.77
Class 54 2372 81.58 11.68 4.30 2.45
Class 55 4497 89.53 7.69 1.71 1.07
iot 3,000,000 115 11 Class 1 215643 99.90 0.06 0.02 0.02
Class 2 275258 99.25 0.56 0.14 0.05
Class 3 230494 99.96 0.02 0.01 0.00
Class 4 313359 99.97 0.01 0.01 0.01
Class 5 527165 99.69 0.24 0.05 0.02
Class 6 224499 99.84 0.11 0.04 0.02
Class 7 219925 99.96 0.03 0.01 0.00
Class 8 111940 99.93 0.06 0.01 0.00
Class 9 109278 99.95 0.03 0.01 0.01
Class 10 367388 99.99 0.01 0.00 0.00
Class 11 405051 99.98 0.02 0.00 0.00
Standard UnderBagging performs significantly below both sampling every single instance, while UnderBagging+ corre-
versions of Random Forest. However, the proposed UnderBag- lates the probability of instance selection with its difficulty
ging+ is capable of outperforming both versions of Random level. Therefore, any differences in results between these
Forest in a statistically significant way. Additionally, as seen two methods originate from this sampling part. UnderBag-
in Figure 5, it is capable of doing so while maintaining a ging+ is statistically significantly better than its counterpart,
much smaller pool of classifiers. This shows the importance sometimes achieving gains of up to 10% on some metrics
of incorporating instance-level information into the ensemble (which considering the size of datasets is a non-trivial gain).
training procedure and how much can be gained by careful Therefore, we can see that forming bags that preserve difficult
understanding of the underlying data properties in multi-class instances leads to more locally specialized classifiers and
imbalanced big data. thus to improved ensemble predictive power. Additionally, as
Role of improved undersampling with instance-level diffi- the instance difficulty calculation is done independently in
culty. The major difference between standard UnderBagging each computing node of Spark, we are able to overcome the
and the proposed UnderBagging+ lies in utilized sampling potential unfavourable distribution of instances and adapt to
function. UnderBagging assumes a uniform probability of local data partitioning.
2490
TABLE II: Results for UnderBagging+ and reference methods
according to nine performance metrics on five multi-class big
imbalanced datasets.
Dataset UnBag+ UnBag RandFun RandFov CARTun CARTov
AvAcc
covtype 71.18 61.45 65.11 68.67 52.10 51.04
traffic 84.33 78.39 80.00 80.05 71.03 70.13
seer 92.18 90.03 91.09 91.08 82.78 81.99
sensors 95.72 94.01 94.92 94.93 86.44 86.52
iot 99.66 98.38 99.24 98.38 95.36 95.12
RecM
covtype 33.12 14.87 17.55 16.63 10.34 10.07
traffic 36.43 30.11 31.20 31.58 26.56 26.01
seer 55.88 53.28 53.41 53.40 50.02 49.84
sensors 1.75 1.17 1.81 1.55 0.87 0.82
iot 96.88 91.48 94.56 91.98 88.27 88.44
RecU
covtype 5.46 3.18 0.00 0.00 0.00 0.00
traffic 35.18 28.42 30.47 30.56 21.67 22.15
seer 51.82 49.75 51.91 51.85 43.19 44.01
sensors 6.78 4.11 0.00 0.00 0.00 0.00
iot 96.82 93.88 95.17 90.27 87.56 82.99
PrecM
covtype 17.99 12.74 10.08 9.44 7.52 7.33
traffic 36.19 29.87 30.01 30.18 25.04 25.17
seer 60.07 53.84 55.44 55.39 51.99 50.79
sensors 12.99 8.99 2.62 2.74 1.17 1.43
iot 96.77 94.02 95.83 91.09 88.32 87.61 Fig. 4: Visualizations of Bayesian sign-rank tests for pairwise
PrecU statistical comparison between UnderBagging+ and reference
covtype 12.97 8.05 6.47 6.39 4.99 5.28 methods
traffic 37.92 34.09 30.01 30.18 26.62 27.01
seer 55.10 51.92 55.44 55.39 47.81 48.04
sensors 13.01 7.28 2.62 2.55 0.91 0.93
iot 95.83 92.48 95.83 91.09 87.71 81.98
FbM
covtype 18.37 11.94 3.62 4.11 2.07 2.11
traffic 35.61 29.99 30.61 30.76 27.82 28.03 Analysis of ensemble size and diversity of UnderBagging+.
seer 54.82 51.04 52.50 52.15 47.86 46.94 When analyzing the performance of ensemble classifiers, it is
sensors 6.07 3.98 0.00 0.00 0.00 0.00
iot 96.11 94.86 95.05 90.60 87.19 83.48 important to verify how the size of the ensemble influences
FbU its predictive power and diversity. In the context of big data
covtype 24.81 19.63 6.95 6.75 2.98 2.88
analytics it is desirable to create smaller ensembles that can
traffic 36.17 32.10 30.01 30.18 23.98 24.83 perform similarly to their larger counterparts. Smaller ensem-
seer 57.98 54.19 55.44 55.39 49.82 49.75
sensors 4.82 2.97 2.62 2.75 0.99 1.01
bles are faster to compute and require less space in memory.
iot 97.82 93.99 95.83 91.09 88.49 88.72 Figure 5 shows that UnderBagging+ achieves its best perfor-
AvFb mance with much smaller ensemble size than Random Forest,
covtype 20.76 17.84 6.18 6.87 3.98 4.04 while displaying better performance. This can be explained by
traffic 34.17 30.82 28.87 29.00 24.27 23.99 creating a more meaningful and less random bootstrap samples
seer 51.88 50.37 49.35 49.28 44.88 44.64
sensors 5.15 2.99 1.65 1.65 0.63 0.64 for base classifiers, thus being able to cover the decision space
iot 95.69 92.88 94.82 89.71 84.56 84.60 sufficiently with lower number of base models. Interestingly,
CBA diversity analysis shown in Figure 6 points to the fact that
covtype 21.28 15.92 5.56 6.24 3.82 3.46 base classifiers in UnderBagging+ are less randomized. We
traffic 30.03 23.88 22.86 22.74 17.23 15.62
seer 50.99 43.72 41.37 41.31 30.18 29.08
can see a small drop in diversity compared to UnderBagging,
sensors 12.84 6.98 1.09 1.00 0.17 0.19 which can be explained by the fact that UnderBagging+ forces
iot 93.56 90.36 91.12 84.02 78.23 69.62
its base classifiers to focus on difficult instances. This reduced
diversity can actually be seen as a positive factor, as it leads to
more stable classifiers. Both of these factors further augment
the high usability of UnderBagging+ for learning from multi-
class imbalanced big data.
2491
covtype traffic covtype 0.45
traffic
30 0.70
20.0
17.5 0.65 0.40
25
Ensemble Diversity
Ensemble Diversity
15.0 0.60 0.35
12.5
CBA [%]
CBA [%]
20 0.55
10.0 0.30
15 0.50
7.5
5.0 0.45 0.25
10
2.5 0.40
0.20
0.0 25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200
Ensemble Size Ensemble Size Ensemble Size Ensemble Size
Ensemble Diversity
Ensemble Diversity
40 8 0.50
CBA [%]
CBA [%]
35 0.30
6
0.45
30 4 0.25
0.40
25 2 0.20
0.35
20 0 0.15
25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200
Ensemble Size Ensemble Size Ensemble Size Ensemble Size
iot iot
0.70 UnBag+
90 UnBag
0.65
85
Ensemble Diversity
0.60
CBA [%]
80 0.55
UnBag
75
UnBag+ 0.50
RandF_un
70 RandF_ov 0.45
25 50 75 100 125 150 175 200 0.40
Ensemble Size
25 50 75 100 125 150 175 200
Ensemble Size
Fig. 5: Relationship between ensemble size and CBA perfor-
mance metric Fig. 6: Analysis of ensemble diversity of UnderBagging+ with
the respect to ensemble size
2492
Recognition Systems, CORES 2019, Polanica-Zdrój, Poland, 20-22 May, data competition: An extremely imbalanced big data bioinformatics
2019., 2019, pp. 360–369. problem,” Knowl.-Based Syst., vol. 87, pp. 69–79, 2015.
[10] P. Zyblewski, P. Ksieniewicz, and M. Wozniak, “Classifier selection for [29] P. Branco, L. Torgo, and R. P. Ribeiro, “Relevance-based evaluation
highly imbalanced data streams with minority driven ensemble,” in Ar- metrics for multi-class imbalanced domains,” in Advances in Knowledge
tificial Intelligence and Soft Computing - 18th International Conference, Discovery and Data Mining - 21st Pacific-Asia Conference, PAKDD
ICAISC 2019, Zakopane, Poland, June 16-20, 2019, Proceedings, Part 2017, Jeju, South Korea, May 23-26, 2017, Proceedings, Part I, 2017,
I, 2019, pp. 626–635. pp. 698–710.
[11] M. Wozniak, M. Graña, and E. Corchado, “A survey of multiple classifier [30] A. Benavoli, G. Corani, J. Demsar, and M. Zaffalon, “Time for a change:
systems as hybrid systems,” Information Fusion, vol. 16, pp. 3–17, 2014. a tutorial for comparing multiple classifiers through bayesian analysis,”
[12] S. Sharma, C. Bellinger, B. Krawczyk, O. R. Zaı̈ane, and N. Japkowicz, J. Mach. Learn. Res., vol. 18, pp. 77:1–77:36, 2017.
“Synthetic oversampling with the majority class: A new perspective on
handling extreme imbalance,” in IEEE International Conference on Data
Mining, ICDM 2018, Singapore, November 17-20, 2018, 2018, pp. 447–
456.
[13] P. Ksieniewicz, “Combining random subspace approach with smote
oversampling for imbalanced data classification,” in Hybrid Artificial
Intelligent Systems - 14th International Conference, HAIS 2019, León,
Spain, September 4-6, 2019, Proceedings, 2019, pp. 660–673.
[14] A. Cano, “A survey on graphic processing unit computing for large-scale
data mining,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 8,
no. 1, 2018.
[15] A. Cano, A. Zafra, and S. Ventura, “Weighted data gravitation classi-
fication for standard and imbalanced data,” IEEE Trans. Cybernetics,
vol. 43, no. 6, pp. 1672–1687, 2013.
[16] A. Cano, J. M. Luna, A. Zafra, and S. Ventura, “A classification module
for genetic programming algorithms in JCLEC,” J. Mach. Learn. Res.,
vol. 16, pp. 491–494, 2015.
[17] A. Cano, D. T. Nguyen, S. Ventura, and K. J. Cios, “ur-caim: improved
CAIM discretization for unbalanced and balanced data,” Soft Comput.,
vol. 20, no. 1, pp. 173–188, 2016.
[18] S. Ramı́rez-Gallego, A. Fernández, S. Garcı́a, M. Chen, and F. Herrera,
“Big data: Tutorial and guidelines on information and process fusion for
analytics algorithms with mapreduce,” Information Fusion, vol. 42, pp.
51–61, 2018.
[19] S. del Rı́o, V. López, J. M. Benı́tez, and F. Herrera, “On the use of
mapreduce for imbalanced big data using random forest,” Inf. Sci., vol.
285, pp. 112–137, 2014.
[20] S. Ramı́rez-Gallego, H. Mouriño-Talı́n, D. Martı́nez-Rego, V. Bolón-
Canedo, J. M. Benı́tez, A. Alonso-Betanzos, and F. Herrera, “An
information theory-based feature selection framework for big data under
apache spark,” IEEE Trans. Systems, Man, and Cybernetics: Systems,
vol. 48, no. 9, pp. 1441–1453, 2018.
[21] A. Fernández, E. Almansa, and F. Herrera, “Chi-spark-rs: An spark-built
evolutionary fuzzy rule selection algorithm in imbalanced classification
for big data problems,” in 2017 IEEE International Conference on Fuzzy
Systems, FUZZ-IEEE 2017, Naples, Italy, July 9-12, 2017, 2017, pp. 1–
6.
[22] I. Triguero, M. Galar, D. Merino, J. Maillo, H. Bustince, and F. Her-
rera, “Evolutionary undersampling for extremely imbalanced big data
classification under apache spark,” in IEEE Congress on Evolutionary
Computation, CEC 2016, Vancouver, BC, Canada, July 24-29, 2016,
2016, pp. 640–647.
[23] L. I. Kuncheva, Á. Arnaiz-González, J. Dı́ez-Pastor, and I. A. D. Gunn,
“Instance selection improves geometric mean accuracy: a study on
imbalanced data classification,” Progress in AI, vol. 8, no. 2, pp. 215–
228, 2019.
[24] P. Skryjomski and B. Krawczyk, “Influence of minority class instance
types on SMOTE imbalanced data oversampling,” in First Interna-
tional Workshop on Learning with Imbalanced Domains: Theory and
Applications, LIDTA@PKDD/ECML 2017, 22 September 2017, Skopje,
Macedonia, 2017, pp. 7–21.
[25] J. Blaszczynski and J. Stefanowski, “Neighbourhood sampling in bag-
ging for imbalanced data,” Neurocomputing, vol. 150, pp. 529–542,
2015.
[26] M. Lango and J. Stefanowski, “Multi-class and feature selection exten-
sions of roughly balanced bagging for imbalanced data,” J. Intell. Inf.
Syst., vol. 50, no. 1, pp. 97–127, 2018.
[27] F. N. Walmsley, G. D. C. Cavalcanti, D. V. R. Oliveira, R. M. O. Cruz,
and R. Sabourin, “An ensemble generation method based on instance
hardness,” in 2018 International Joint Conference on Neural Networks,
IJCNN 2018, Rio de Janeiro, Brazil, July 8-13, 2018, 2018, pp. 1–8.
[28] I. Triguero, S. del Rı́o, V. López, J. Bacardit, J. M. Benı́tez, and
F. Herrera, “ROSEFW-RF: the winner algorithm for the ecbdl’14 big
2493