Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

2019 IEEE International Conference on Big Data (Big Data)

Bagging Using Instance-Level Difficulty for


Multi-Class Imbalanced Big Data Classification on
Spark
William C. Sleeman IV Bartosz Krawczyk
Department of Computer Science Department of Computer Science
Virginia Commonwealth University Virginia Commonwealth University
Richmond VA, USA Richmond VA, USA
[email protected] [email protected]

Abstract—Most machine learning methods work under the the instances, and the classifier will favor the majority class.
assumption that classes have a roughly balanced number of At the same time, in most of the cases the minority class is the
instances. However, in many real-life problems we may have some one more interesting, usually corresponding to important, rare,
types of instances appearing predominantly more frequently than
the others which causes a bias towards the majority class during or dangerous cases. Therefore, special techniques are needed
classifier training. This becomes even more challenging when to alleviate the mentioned bias and achieve good recognition
dealing with multiple classes, where relationships between them rates on the minority class. At the same time, this cannot
are not easily defined. Learning from multi-class imbalanced come at the cost of a performance drop on majority class
data has not been widely considered in the context of big as we should aim for a balanced and well-rounded classifier
data mining, despite the fact that this is a learning difficulty
frequently appearing in this domain. In this paper, we address [3]. Additionally, recent studies show that the imbalance ratio
this challenge by proposing a comprehensive ensemble-based between classes is not the sole reason behind unsatisfactory
framework. We propose to analyze each class to extract instance- classifier performance [2]. Data-level characteristics play at
level characteristics describing their difficulty levels. We embed least as an important role, including such problems as small
this information into the existing UnderBagging framework. training sets, existence of class sub-structures, or instance-level
Our ensemble samples instances with probabilities proportional
to their difficulty levels. This allows us to focus the learning difficulties.
process on the most difficult instances, better capturing the Traditionally, imbalanced problems have been considered
properties of multi-class imbalanced problems. We implemented for two-class datasets, where minority and majority classes are
our framework on Apache Spark to allow for high-performance easily defined [1]. In recent years, we can see an increasing
computing over big data sets. This experimental study shows that interest in analyzing the imbalanced problem in multi-class
taking into account the instance-level difficulty leads to training
of significantly more accurate ensembles. scenarios as many contemporary real-life problems suffer from
Index Terms—machine learning, ensemble learning, imbal- it. When dealing with multiple classes, one cannot simply
anced data, big data, instance-level difficulty use the existing algorithms for binary problems. Relationships
among classes are no longer easily defined and a given
I. I NTRODUCTION class can behave like be a minority and majority class at
the same time, depending on the context. Binarization of a
Despite more than two decades of research, learning from
multi-class data (i.e., transforming a multi-class problem into
imbalanced data is still considered as one of the major
a set of two-class ones) is not an efficient solution, as we
challenges in the machine learning domain [1], [2]. Most
lose the information of intra-class relationships [4]. Therefore,
of existing classifier training procedures are based on the
algorithms that are capable of directly working on multi-class
assumption that an uniform penalty is suffered for an error on
data are more desirable. The issue of data-level characteristics
all instances. Predictive accuracy and 0-1 loss functions are the
is as important in the binary problems. However, there are but
most common examples of such metrics used to determine the
few algorithms dedicated to multi-class imbalanced scenarios
effectiveness of a training procedure (and related procedures,
that take them into account [5], [6].
e.g., parameter selection). However, in many real-life problems
While imbalanced problems originated from small-scale
the distribution of instances among classes is far from being
datasets, the recent advent of big data significantly affected this
balanced. Some objects appear much more frequently than oth-
domain [7]. Researchers need to deal with the ever-growing
ers, leading to imbalanced datasets. This creates a bias during
size, complexity, and often non-stationary nature of modern
the training procedure, as due to an uniform treatment of all
datasets that are still plagued by skewed class distributions [8]–
This research was partially supported by the Amazon AWS Machine [10]. As the size of data makes usage of standard computers
Learning Research award. unfeasible, we observed a rise of high-performance computing

978-1-7281-0858-2/19/$31.00 ©2019 IEEE 2484


platforms, such as GPUs and clusters. Their specific nature performance of a given classifier when dealing with imbal-
calls not only for customized implementations of existing al- anced data. This requires an in-depth analysis of a specific
gorithms for tackling imbalanced data, but even for developing algorithm. Because of this, algorithm-level techniques offer
completely new techniques that can take advantage of these high effectiveness at the cost of reduced flexibility (as they
computing environments [1]. Most of existing solutions are are usually specific only to a given learner). Most popular
designed for two-class datasets, while learning from multi- approaches include designing skew-insensitive splitting criteria
class imbalanced big data still remains relatively unexplored for decision trees, or modifying the training procedures of
[2]. Support Vector Machines. Additionally, cost-sensitive methods
In this paper, we propose a novel framework for learning can be categorized here, as they change the used loss function.
from multi-class imbalanced big datasets. We utilize an ensem- A higher penalty is assigned to objects coming from minority
ble framework [11] that combined the Bagging approach with classes, thus forcefully alleviating the learning bias. The
undersampling. Instead of conducting uniform undersampling, problem lies in how to determine an optimal cost parameter
as all existing methods do, we propose to include the instance- that will lead to a balanced performance [1].
level information into the sampling procedure. We analyze Ensemble techniques. Ensemble learning is widely consid-
the properties of instances in each class and evaluate their ered as one of the most powerful tools in machine learning
difficulty levels. We achieve this by analyzing the neighbor- [11]. By combining multiple classifiers that are individually
hood of each instance. An increased presence of instances complementary, we may obtain a compound classifier that is
from different classes in the neighborhood point to a higher more accurate than any of its base components. Ensemble
difficulty a given instance will pose to a classifier. We use techniques can be efficiently combined with any data-level
this information to modify the undersampling probabilities, or algorithm-level solutions to offer improved robustness to
in order to boost the presence of difficult instances. This imbalanced and difficult problems [13]. Most efficient methods
leads to better adaptation of base classifiers in the ensemble are combinations of Bagging, Boosting, or Random Forest
to local data characteristics. In order to allow for effective with sampling or cost-sensitive methods [1].
processing of big data sets, we propose an implementation When facing the challenge of learning from imbalanced
of our ensemble using the Apache Spark architecture. Our big data the mentioned algorithms are not enough. Their
experimental study shows that the proposed method achieves bottleneck lies in the fact that they were designed for small-
significantly better predictive performance than the state-of- size datasets. In order to make them scalable to big data
the-art methods for imbalanced big data classification, while classification tasks, one needs to combine them with effec-
offering very good computational performance and scalability. tive high-performance computing architectures. There are two
main trends regarding the choice of hardware.
II. L EARNING FROM I MBALANCED B IG DATA Graphic Processing Units. GPUs offer a powerful high-
In the field of learning from imbalanced data, three main performance computing environment at a fraction of the cost
solutions have been proposed to deal with skewed distribu- of a traditional cluster [14]. While they are characterized by
tions. an excellent data-level parallelism and unbeatable degree of
Data-level techniques. This group of methods focuses on control over each aspect of data scheduling and partitioning,
working directly on the training set in order to balance the they require highly specialized skills to write an efficient and
distributions among classes. Then such a balanced dataset optimize code. Additionally, at the current moment they are
can be used by any classifier. This is usually achieved by unable to handle terabyte-level datasets, due to limitations on
either removing instances from the majority classes (under- the embedded memory. In the context of imbalanced data, the
sampling), or introducing artificial instances into minority main success of GPUs lies in a highly efficient implementation
classes. Both of these solutions have potential drawbacks. of popular SMOTE technique. Due to a full control over
Undersampling, while reducing the size of the training dataset, data partitioning, one can maintain the meaningful creation
risks removing instances that may hold informative value to of artificial instances in given neighborhoods - a feat that
a classifier. Oversampling retains all the instances, but risks is unreachable in most of CPU clusters. GPUs have been
shifting the original class distribution or enhancing the noise successfully applied to skew-insensitive variants of nearest-
present in the training set. Modern sampling techniques use neighbor classifiers [15], inducting classification rules using
more advanced criteria for removing or generating instances, genetic programming [16], as well as efficient large-scale
in order to balance the distributions in a meaningful way discretization of imbalanced data [17].
[1]. They aim at preserving the most important instances, CPU clusters. In recent years one can observe a rise of dis-
while removing noisy or overlapping cases [5]. When dealing tributed computing cluster architectures, especially ones using
with extremely imbalanced datasets, one should use dedicated the MapReduce methodology [18] such as Apache Hadoop
methods, as undersampling will reduce the size of training set and Spark. They offer high elasticity, reliability, and scalability
too significantly and oversampling has too little meaningful in an user-friendly manner. However, this comes at the cost
instances to generate anything [12]. of increased monetary price of hardware and lack of explicit
Algorithm-level techniques. This group of methods focuses controlling of scheduling that is available in GPUs. While CPU
on understanding what specific learning mechanic impairs the clusters seem like a highly attractive solution to imbalanced

2485
data, one must be aware of their potential limitations. A potentially more prone to overfitting. Additionally, any data-
study on effects of different sampling techniques on Random level techniques must be aware of these existing sub-structures
Forest showed that SMOTE achieves a highly unsatisfactory in order not to mix instances from then when performing e.g.,
performance in the MapReduce environment [19]. This is sampling (as this would compromise the spatial coherence
caused by the lack of control over data partitioning and in turn of minority class). This problem can be further enhanced in
creating chunks of data with corrupted spatial relationships. big data context, as by using the MapReduce paradigm we
When SMOTE is used in each map with such an unreliable are violating spatial relationships among instances (we lack
neighborhood, it tends to create artificial instances in incorrect control on how these partitions are created) [19].
regions, leading to increasing overlapping among classes and Difficult individual instances. Difficulties with learning may
shifting true class distributions. Random oversampling and also be present at the instance-level [6]. Specific instances may
undersampling performs significantly better on CPU clusters. corrupt the underlying distributions, further enhancing bias
Success stories of MapReduce usage for imbalanced big towards the majority class. Most popular cases include outliers
data include efficient and scalable feature selection [20], rule and noisy instances that should be detected and handled
induction [21], and data sampling [22]. appropriately. Furthermore, by utilizing the information about
Mentioned techniques are developed for two-class big im- difficult instances, one may enrich the sampling and classifier
balanced data. To the best our knowledge, there exists no training procedures.
algorithms dedicated specifically to large-scale multi-class In this work, we will focus on instance-level difficulties.
imbalanced data, neither for GPUs nor CPU clusters [1], [2]. Previously, we have shown that utilizing information about
individual instances can significantly improve data-level tech-
III. I NSTANCE - LEVEL D IFFICULTY niques for both two-class [24] and multi-class problems [6].
Most of the early works on imbalanced data pointed to the We aim at using this information in a big data context, in
disproportion among classes as the main source of learning order to gain additional useful insight into classified datasets.
difficulty. However, one may easily come up with a dummy For the purpose of this paper, we will use an instance-level
problem of two classes with extremely high imbalance ratio difficulty taxonomy, based on their neighborhood analysis (see
but with clear separation. In such a case, even a simple classi- Figure 1 for visualization):
fier will not be affected by a bias. Therefore, there must exist
additional characteristics of imbalanced datasets that cause
degradation of a classifier’s performance [6], [23]. Recent
surveys on this topic identify these data-level difficulties to
most challenging ones [1], [2].
Small training set size. While the overall size of the training
set may be sufficient, the high imbalance ratio may negatively
impact the minority class. There simply may not be enough Safe examples shown Borderline examples shown
minority instances to estimate a proper distribution of this class
and train a competent discrimination model. Additionally,
using standard oversampling techniques on extremely small
minority classes may lead to the lack of diversity among
artificial instances and thus be of no aid to the classification
process. Recently, oversampling techniques have been devel-
oped that use information from the majority class to alleviate
Rare examples shown Outlier examples shown
the small size of the minority class. In the context of big
data classification, this problem is less likely to appear, as we Fig. 1: Visualization of instance-level difficulty types for a
expect to deal with large data samples. multi-class imbalanced dataset
Class overlapping. Overlapping regions between classes have
always been the source of learning difficulties. When com- • Safe instances. These are instances that lie far from other
bined with skewed class distributions, these regions are the classes and form uniform groups. They usually pose little,
ones where bias towards majority class is most likely to appear. if any, learning difficulty and can be easily discriminated.
This problem becomes even more challenging in a multi-class • Borderline instances. These are instances that lie close
scenario, where more than two classes may be present in to a border between classes. Due to their similar char-
a given region. This could be further amplified in big data acteristics to other instances they are more likely to be
scenarios, where overlapping regions are likely to be more misclassified. Borderline instances from a minority class
frequent due to massive data size. will usually be subject to classification bias, as majority
Small disjuncts. A minority class may not have a unimodal instances are likely to be predominant in this part of the
distribution. Instead, it may consist of sub-structures, clusters, decision space.
or disjuncts. This poses additional difficulty for a classifier, as • Rare instances. These are instances that form small
it needs to learn a more complex decision boundary, thus being disjuncts far from the main class structure. Due to their

2486
small size they are likely to be ignored by a classifier. instances that are in a more contaminated neighborhood (i.e.,
Very often they overlap with the majority class, making higher number of neighbors originating from another classes).
their proper classification even more challenging. Modification of instance selection probability for Under-
• Outliers. These are singular instances located far from Bagging. Having established a way to calculate the instance
the main class structure. They can be either noise or valu- difficulty, we must utilize it in a form of a selection likelihood
able rare information. There are now ways of determining during the undersampling in each bag. As we deal with a
this without analyzing the context of the considered multi-class problem, we must calculate the likelihood of each
application. They should be discussed with the domain m-th class independently, in order to conduct our proposed
expert in order to determine the proper course of action. instance difficulty-based undersampling on each of them. The
Noisy outliers should be removed, while rare information higher the difficulty of a given instance, the more valuable it is
should be emphasized. to a classifier, as it represents a region where bias towards the
majority class may occur. Therefore, we want to enhance the
IV. BAGGING WITH I NSTANCE - LEVEL D IFFICULTY importance of difficult instances during the training phase. We
In this section, we will present the details of our pro- achieve this by having a higher likelihood of difficult instances
posed ensemble framework for multi-class imbalanced big to be selected in each bag:
data classification. In the following sections, we will describe
1
details on how to combine ensemble learning with information fm (x) = + ID(x, k), (2)
about instance-level difficulty to form our proposed algorithm nm
UnderBagging+, as well as how to implement our framework where nm is the number of instances in m-th class. Using
on Apache Spark for high-performance computing. this approach, we increase the likeliness of preserving difficult
instances in each bag. At the same time, we give chance for
A. Combining UnderBagging with instance-level difficulty safe instances to be selected, achieving a mix of different types
Bagging-based algorithms are very popular in imbalanced of instances. This should lead to the creation of a more diverse
domains. Many studies show that they are especially efficient set of classifiers as compared to previous works that suggested
when combined with undersampling in each bag, for both to discard the entire set of a given instance type.
two-class [25] and multi-class [26] problems. However, all The likelihood is then transformed into a normalized prob-
existing algorithms based on the idea of UnderBagging assume ability function for selecting a given instance in m-th class:
that all instances in each class are equally important. We
fm (x)
propose to include information about instance-level character- pm (x) = nm . (3)
istics discussed in Section III into the undersampling process i=1 fm (x)
conducted in each bag. So in order to do so, we need a way of This normalization ensures that the proposed probability
calculating the difficulty of each instance. This is usually done function pm is a proper probabilistic distribution. This is
by analyzing the instance neighborhood and assigning a given necessary for using it during Bagging instance selection when
level based on its uniformness [27]. The more contaminated choosing with replacement.
the neighborhood with instances from other classes, the higher Other characteristics of UnderBagging+. The proposed
difficulty level is assigned to analyzed instance. One of four model differs from existing Bagging-based solutions for im-
labels (safe / borderline / rare / outlier) is selected and balanced data by modifying its sampling probability according
this information is further used by sampling or the learning to instance difficulty level and by working directly on multi-
algorithm. class problems. The size of each bag is defined as a user-based
Calculating instance-level difficulty. Previous works used a parameter, as there exists two approaches to this issue: (i)
discrete approach by either selecting or removing all instances selecting the bag size equal to the size of the class with lowest
with a given difficulty label from the training set. This required number of instances; and (ii) selecting the bag size smaller
a time-consuming manual tuning of all possible combinations than the size of the class with lowest number of instances.
of instance types. In this work, we propose to use a probabilis- As for the prediction phase, UnderBagging+ uses a standard
tic approach, where the difficulty of each instance will affect majority voting scheme.
the chances of it being selected for a given bag. In order to
do so, let us define the function for calculating the difficulty B. Implementation on Apache Spark
of each instance based on the number of neighbors from the UnderBagging+ was developed to use in-depth information
same class: about instances in each class. In order to make it applicable
to massive imbalanced datasets, we propose an efficient im-
k plementation on the Apache Spark environment.
xi ∈ kNN(x) ∧ label(xi ) = label(x)
i=1
ID(x, k) = , (1) Spark set-up. Spark uses the MapReduce approach (see Fig-
k ure 2) - a popular distributed computing model that represents
where k is the number of neighbors used for the difficulty data as pairs of keys/values. It consists of two main phases -
analysis and kN N stands for k nearest neighbors. This map and reduce. The role of the map phase is to take the input
function takes values in [0, 1] and assigns higher values to data and split it among a given number of computing nodes

2487
 Algorithm 1 Perform target class undersampling
procedure: TargetClassUndersample(DF, N umOf Classes,
T argetCount)
sampledDataF rames ← DataF rame[0 to c]
for c = 1 to N umberOf Classes do
DF c ← DF.f ilter(x → x.class == c)

if DF c .count() > T argetCount then

sampledDataF ramesc ←
DF c .sample(T argetCount)
else
sampledDataF ramesc ← DF c
end if
end for
return sampledDataF rames.union()
Fig. 2: The MapReduce architecture showing the three phases:
map, shuffle, reduce
efficient undersampling code in Scala that takes advantage of
(workers). Each worker performs a given task in parallel, the Spark architecture, in order to offer scalability to massive
returning intermediate results. The role of the reduce phase datasets. Additionally, as UnderBagging+ uses sampling with
is to combine these intermediate results into the final output. replacement, we join the selected instances into a DataFrame
Spark keeps the results between map/reduce steps in main structure to form a balanced dataset. Details of the proposed
memory when possible. Spark uses a driver node to create a undersampling implementation are given in a pseudocode form
Spark Context object which is connected to a cluster manager in Algorithm 1.
such as Mesos or YARN. The Spark Context is responsible for UnderBagging+ on Spark. A separate UnderBagging+ en-
requesting resources from the cluster manager and running semble is constructed in each node of the Spark cluster and
executors on computing nodes. Figure 3 shows the Spark then intermediate results from each ensemble are combined
workflow with solid lines representing data passing. during reduce phase to give a final model. Each node will
already obtain a subset of data from a given map. Therefore,
the calculation of instance difficulties according to Eq. 1 is
done in each node independently, in order to capture local data
characteristics. This leads to the creation of locally specialized
classifiers in each node and benefits the overall diversity of
the proposed UnderBagging+ architecture. As the instance
difficulty calculation requires an analysis of the neighborhood
of each instance, we use an efficient k-NN implementation
from spark-knn package.
V. E XPERIMENTAL STUDY
In order to evaluate the effectiveness of UnderBagging+,
we have designed an in-depth experimental study that aims at
answering the following research questions:
• RQ1: Does UnderBagging+ display better performance
than existing methods for multi-class imbalanced big data
available in Spark?
• RQ2: Does modification of sampling used in UnderBag-
ging+ lead to improved results over standard Bagging?
Fig. 3: The Spark cluster architecture for resource allocation • RQ3: How does the improved sampling method influence
and data transfer the size and diversity of UnderBagging+?
In the following section, we will present the experimental
Undersampling on Spark. The proposed modified undersam- set-up, obtained results, as well as discuss them.
pling that takes into account the instance-level difficulty is the
backbone of our system. It will be executed by each worker A. Data benchmarks
on each node multiple times to form base classifiers for the Learning from imbalanced big data suffers from the lack
UnderBagging+ ensemble. Additionally, it must work on each of proper benchmark repositories. While there are a couple of
class independently and take as an input parameter the size of popularly used large-scale two-class imbalanced datasets [28],
each class after undersampling procedure. We have written an multi-class imbalanced big datasets are still not standardized

2488
or widely available. For the purpose of this study we have
C

selected and prepared five real-world datasets that have the fol- tpi + tni
AvAcc =
lowing important properties: (i) are originally multi-class; (ii)
i=1
tpi + tni + f pi + f ni
have a large volume; and (iii) are characterized by a significant C
imbalance among classes and various data-level difficulties. 1 
RecM = recalli
Their details are given in Table I. Please note that we have C i=1
calculated the ratios of different types of instances in each C
class using all data. However, during actual experiments the 1 
P recM = precisioni
instance-level difficulty is calculated only using the training C i=1
data and independently in each computational node of the C
 
C
Spark cluster. Therefore, we report global properties of each RecU = tpi ti
benchmark, but runtime local characteristics may differ. i=1 i=1
C C
P recU = tpi pi
i=1 i=1

B. Set-up (1 + β)2 · P recM · RecM


FβM =
β 2 · P recM + RecM
(1 + β)2 · P recμ · Recμ
In this section we give details regarding the experimental Fβμ =
study set-up and parameters. β 2 · P recμ + Recμ
C
Reference methods. As there is practically no work done on 1  (1 + β 2 ) · precisioni · recalli
AvFβ =
multi-class imbalanced data classification with Spark (see [1]), C i=1 β 2 · precisioni + recalli
we adapted existing methods dedicated to two-class problems. C
 mati,i
We compare UnderBagging+ with the standard version of CBA =
UnderBagging, as well as Random Forest with both under- 
C 
C
i=1 max( mati,j , matj,i )
and oversampling, adapted from [19]. Finally, we use standard j=1 j=1
CART tree with both under- and oversampling as a counterpart
Additionally, we use an Entropy measure for diversity analysis
to ensemble methods.
among ensemble members.
Adaptation to multi-class problems. As mentioned above, Statistical analysis. We analyze the significance of obtained
there are no available methods for learning from multi-class results using Bayesian sign-rank test [30] with its visualization
imbalanced data on Spark. However, both Random Forest capabilities for pairwise comparison of classifiers.
and CART are capable of handling multi-class problems. We High-performance computing environment. All experiments
extended them with our multi-class adaptations of under- and were performed in Spark environment with 32 computing
oversampling, where each class is treated individually and nodes provided by Amazon Web Services.
sampled to a given size.
C. Results and discussion
Parameters. We perform parameter selection for each
classifier using internal 3-fold cross-validation on training Results of our computational experiments with respect to
data. Each ensemble uses CART as a base classifier. nine selected skew-insensitive performance metrics are given
Each ensemble size is chosen from {20, 40, · · · , 200}. in Table II, while the outcomes of Bayesian sign-rank test of
Oversampling ratios are in {0.5nmax , nmax , 1.5nmax , 2nmax , statistical significance are depicted in Figure 4. We will discuss
where nmax is the size of the biggest class in the obtained results from the perspective of the three research
the analyzed dataset. Undersampling ratios are in questions stated at the beginning of this section.
{1.5nmin , 1.5nmin , nmin , 0.5nmin , 0.25nmin }, where nmin Comparison with reference classifiers. Let us firstly com-
is the size of the smallest class. pare UnderBagging+ with the single CART classifier. Unsur-
prisingly, UnderBagging+ always achieves significantly bet-
Training and testing. We have used a stratified 10-fold cross ter results, as further verified by Bayesian statistical tests.
validation in order to maintain the original class distribution Regardless of the used type of sampling, CART always
among the folds. offered significantly lower predictive power. This verifies a
Evaluation metrics. In order to properly evaluate the perfor- natural assumption that multi-class big imbalanced data are
mance of classifiers on imbalanced domains skew-insensitive too complex to be effectively modelled by a single classifier,
metrics are required. As multi-class imbalanced problems were even when augmented with data-level balancing. Therefore,
not popular until recently among researchers, there is a lack ensembles are shown to be the most promising direction for
of uniform approach to which measures should be considered future research on big imbalanced data. Random Forest offers
as a standard. In order to gain an in-depth insight into the much more interesting observations. When combined with
performance of analyzed classifiers we have used the following sampling for each base tree, Random Forest is considered
nine popular metrics for multi-class imbalanced data [29]: as the best performing ensemble for imbalanced big data.

2489
TABLE I: Details about multi-class imbalanced big datasets used in experimental study.
Dataset Instances Features Classes Class distribution Safe[%] Borderline[%] Rare[%] Outliers[%]
covtype 581,012 54 7 Class 1 211840 84.97 10.43 3.03 1.57
Class 2 283301 87.62 9.33 2.08 0.97
Class 3 35754 80.81 13.96 3.41 1.82
Class 4 2747 56.86 25.48 9.79 7.86
Class 5 9493 60.32 25.18 8.57 5.93
Class 6 17367 65.80 23.40 6.85 3.94
Class 7 20510 87.25 8.37 2.85 1.54
traffic 1,378,663 26 7 Class 1 163473 22.35 32.59 22.96 22.10
Class 2 214173 25.47 31.73 21.93 20.88
Class 3 275622 34.96 32.68 17.81 14.55
Class 4 343133 39.47 33.74 16.05 10.74
Class 5 157710 24.75 32.21 21.87 21.17
Class 6 179367 27.08 35.69 20.68 16.55
Class 7 45185 22.85 19.79 19.33 38.03
seer 2,532,629 11 10 Class 1 396140 50.53 26.83 9.90 12.75
Class 2 567055 79.01 14.12 4.07 2.80
Class 3 241330 17.91 31.69 23.15 27.25
Class 4 528648 60.71 28.71 6.11 4.48
Class 5 135436 37.15 18.31 16.56 27.98
Class 6 137908 58.65 29.12 6.16 6.06
Class 7 100481 1.25 15.92 28.22 54.60
Class 8 110411 26.39 20.59 23.27 29.75
Class 9 166217 46.04 30.77 11.81 11.38
Class 10 149003 33.03 38.15 16.50 12.33
sensors 2,219,803 5 55 Class 1 43039 53.47 28.28 11.15 7.10
Class 2 46915 71.04 19.21 5.95 3.80
Class 3 46631 55.52 29.91 9.43 5.14
Class 4 43790 65.86 23.22 6.97 3.95
Class 6 35654 62.37 21.71 9.07 6.85
Class 7 55302 64.51 21.62 8.70 5.17
... ... ... ... ... ...
Class 51 25582 27.89 36.90 19.12 16.10
Class 52 28748 33.06 36.84 17.07 13.03
Class 53 2850 72.39 12.35 6.49 8.77
Class 54 2372 81.58 11.68 4.30 2.45
Class 55 4497 89.53 7.69 1.71 1.07
iot 3,000,000 115 11 Class 1 215643 99.90 0.06 0.02 0.02
Class 2 275258 99.25 0.56 0.14 0.05
Class 3 230494 99.96 0.02 0.01 0.00
Class 4 313359 99.97 0.01 0.01 0.01
Class 5 527165 99.69 0.24 0.05 0.02
Class 6 224499 99.84 0.11 0.04 0.02
Class 7 219925 99.96 0.03 0.01 0.00
Class 8 111940 99.93 0.06 0.01 0.00
Class 9 109278 99.95 0.03 0.01 0.01
Class 10 367388 99.99 0.01 0.00 0.00
Class 11 405051 99.98 0.02 0.00 0.00

Standard UnderBagging performs significantly below both sampling every single instance, while UnderBagging+ corre-
versions of Random Forest. However, the proposed UnderBag- lates the probability of instance selection with its difficulty
ging+ is capable of outperforming both versions of Random level. Therefore, any differences in results between these
Forest in a statistically significant way. Additionally, as seen two methods originate from this sampling part. UnderBag-
in Figure 5, it is capable of doing so while maintaining a ging+ is statistically significantly better than its counterpart,
much smaller pool of classifiers. This shows the importance sometimes achieving gains of up to 10% on some metrics
of incorporating instance-level information into the ensemble (which considering the size of datasets is a non-trivial gain).
training procedure and how much can be gained by careful Therefore, we can see that forming bags that preserve difficult
understanding of the underlying data properties in multi-class instances leads to more locally specialized classifiers and
imbalanced big data. thus to improved ensemble predictive power. Additionally, as
Role of improved undersampling with instance-level diffi- the instance difficulty calculation is done independently in
culty. The major difference between standard UnderBagging each computing node of Spark, we are able to overcome the
and the proposed UnderBagging+ lies in utilized sampling potential unfavourable distribution of instances and adapt to
function. UnderBagging assumes a uniform probability of local data partitioning.

2490
TABLE II: Results for UnderBagging+ and reference methods
according to nine performance metrics on five multi-class big
imbalanced datasets.
Dataset UnBag+ UnBag RandFun RandFov CARTun CARTov
AvAcc
covtype 71.18 61.45 65.11 68.67 52.10 51.04
traffic 84.33 78.39 80.00 80.05 71.03 70.13
seer 92.18 90.03 91.09 91.08 82.78 81.99
sensors 95.72 94.01 94.92 94.93 86.44 86.52
iot 99.66 98.38 99.24 98.38 95.36 95.12
RecM
covtype 33.12 14.87 17.55 16.63 10.34 10.07
traffic 36.43 30.11 31.20 31.58 26.56 26.01
seer 55.88 53.28 53.41 53.40 50.02 49.84
sensors 1.75 1.17 1.81 1.55 0.87 0.82
iot 96.88 91.48 94.56 91.98 88.27 88.44
RecU
covtype 5.46 3.18 0.00 0.00 0.00 0.00
traffic 35.18 28.42 30.47 30.56 21.67 22.15
seer 51.82 49.75 51.91 51.85 43.19 44.01
sensors 6.78 4.11 0.00 0.00 0.00 0.00
iot 96.82 93.88 95.17 90.27 87.56 82.99
PrecM
covtype 17.99 12.74 10.08 9.44 7.52 7.33
traffic 36.19 29.87 30.01 30.18 25.04 25.17
seer 60.07 53.84 55.44 55.39 51.99 50.79
sensors 12.99 8.99 2.62 2.74 1.17 1.43
iot 96.77 94.02 95.83 91.09 88.32 87.61 Fig. 4: Visualizations of Bayesian sign-rank tests for pairwise
PrecU statistical comparison between UnderBagging+ and reference
covtype 12.97 8.05 6.47 6.39 4.99 5.28 methods
traffic 37.92 34.09 30.01 30.18 26.62 27.01
seer 55.10 51.92 55.44 55.39 47.81 48.04
sensors 13.01 7.28 2.62 2.55 0.91 0.93
iot 95.83 92.48 95.83 91.09 87.71 81.98
FbM
covtype 18.37 11.94 3.62 4.11 2.07 2.11
traffic 35.61 29.99 30.61 30.76 27.82 28.03 Analysis of ensemble size and diversity of UnderBagging+.
seer 54.82 51.04 52.50 52.15 47.86 46.94 When analyzing the performance of ensemble classifiers, it is
sensors 6.07 3.98 0.00 0.00 0.00 0.00
iot 96.11 94.86 95.05 90.60 87.19 83.48 important to verify how the size of the ensemble influences
FbU its predictive power and diversity. In the context of big data
covtype 24.81 19.63 6.95 6.75 2.98 2.88
analytics it is desirable to create smaller ensembles that can
traffic 36.17 32.10 30.01 30.18 23.98 24.83 perform similarly to their larger counterparts. Smaller ensem-
seer 57.98 54.19 55.44 55.39 49.82 49.75
sensors 4.82 2.97 2.62 2.75 0.99 1.01
bles are faster to compute and require less space in memory.
iot 97.82 93.99 95.83 91.09 88.49 88.72 Figure 5 shows that UnderBagging+ achieves its best perfor-
AvFb mance with much smaller ensemble size than Random Forest,
covtype 20.76 17.84 6.18 6.87 3.98 4.04 while displaying better performance. This can be explained by
traffic 34.17 30.82 28.87 29.00 24.27 23.99 creating a more meaningful and less random bootstrap samples
seer 51.88 50.37 49.35 49.28 44.88 44.64
sensors 5.15 2.99 1.65 1.65 0.63 0.64 for base classifiers, thus being able to cover the decision space
iot 95.69 92.88 94.82 89.71 84.56 84.60 sufficiently with lower number of base models. Interestingly,
CBA diversity analysis shown in Figure 6 points to the fact that
covtype 21.28 15.92 5.56 6.24 3.82 3.46 base classifiers in UnderBagging+ are less randomized. We
traffic 30.03 23.88 22.86 22.74 17.23 15.62
seer 50.99 43.72 41.37 41.31 30.18 29.08
can see a small drop in diversity compared to UnderBagging,
sensors 12.84 6.98 1.09 1.00 0.17 0.19 which can be explained by the fact that UnderBagging+ forces
iot 93.56 90.36 91.12 84.02 78.23 69.62
its base classifiers to focus on difficult instances. This reduced
diversity can actually be seen as a positive factor, as it leads to
more stable classifiers. Both of these factors further augment
the high usability of UnderBagging+ for learning from multi-
class imbalanced big data.

2491
covtype traffic covtype 0.45
traffic
30 0.70
20.0
17.5 0.65 0.40
25

Ensemble Diversity

Ensemble Diversity
15.0 0.60 0.35
12.5
CBA [%]

CBA [%]
20 0.55
10.0 0.30
15 0.50
7.5
5.0 0.45 0.25
10
2.5 0.40
0.20
0.0 25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200
Ensemble Size Ensemble Size Ensemble Size Ensemble Size

seer sensors seer sensors


50 12 0.60
0.40
45 10 0.55
0.35

Ensemble Diversity

Ensemble Diversity
40 8 0.50
CBA [%]

CBA [%]

35 0.30
6
0.45
30 4 0.25
0.40
25 2 0.20
0.35
20 0 0.15
25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200
Ensemble Size Ensemble Size Ensemble Size Ensemble Size

iot iot
0.70 UnBag+
90 UnBag
0.65
85

Ensemble Diversity
0.60
CBA [%]

80 0.55
UnBag
75
UnBag+ 0.50
RandF_un
70 RandF_ov 0.45
25 50 75 100 125 150 175 200 0.40
Ensemble Size
25 50 75 100 125 150 175 200
Ensemble Size
Fig. 5: Relationship between ensemble size and CBA perfor-
mance metric Fig. 6: Analysis of ensemble diversity of UnderBagging+ with
the respect to ensemble size

VI. C ONCLUSIONS AND FUTURE WORKS


R EFERENCES
In this paper we have addressed the contemporary issue [1] A. Fernández, S. Garcı́a, M. Galar, R. C. Prati, B. Krawczyk, and
of learning from big data under multi-class imbalance. We F. Herrera, Learning from Imbalanced Data Sets. Springer, 2018.
[2] B. Krawczyk, “Learning from imbalanced data: open challenges and
have discussed various challenges related with this domain and future directions,” Progress in AI, vol. 5, no. 4, pp. 221–232, 2016.
pointed to a lack of dedicated solutions for such problems. [3] P. Ksieniewicz and M. Wozniak, “Imbalanced data classification based
To bridge this gap, we have proposed UnderBagging+ with on feature selection techniques,” in Intelligent Data Engineering and
Automated Learning - IDEAL 2018 - 19th International Conference,
efficient implementation on the Apache Spark architecture that Madrid, Spain, November 21-23, 2018, Proceedings, Part II, 2018, pp.
is able to learn from large-scale skewed datasets with multiple 296–303.
classes. [4] A. Fernández, V. López, M. Galar, M. J. del Jesús, and F. Herrera,
“Analysing the classification of imbalanced data-sets with multiple
UnderBagging+ uses the idea of incorporating the instance- classes: Binarization techniques and ad-hoc approaches,” Knowl.-Based
level difficulty into the instance selection process. For each Syst., vol. 42, pp. 97–110, 2013.
training instance, we analyze its neighborhood. Based on the [5] B. Krawczyk, M. Koziarski, and M. Wozniak, “Radial-based over-
sampling for multiclass imbalanced data classification,” IEEE Trans-
number of nearest neighbors not belonging to the same class actions on Neural Networks and Learning Systems, pp. 1–14, doi:
for as given instance, we assign it a difficulty level. Then 10.1109/TNNLS.2019.2913673.
UnderBagging+ transforms this into a probability metric and [6] J. A. Sáez, B. Krawczyk, and M. Wozniak, “Analyzing the oversampling
of different classes and types of examples in multi-class imbalanced
uses it to guide the sampling process for creating training datasets,” Pattern Recognition, vol. 57, pp. 164–178, 2016.
bags. This way difficult instances from each class are more [7] A. Fernández, S. del Rı́o, V. López, A. Bawakid, M. J. del Jesús, J. M.
likely to be selected, leading to focusing base classifiers Benı́tez, and F. Herrera, “Big data with cloud computing: an insight on
the computing environment, mapreduce, and programming frameworks,”
on these challenging cases. Experimental study demonstrated Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 4, no. 5, pp. 380–
that this approach allows for UnderBagging+ to significantly 409, 2014.
outperform reference methods, while maintaining a smaller [8] F. Ahmed, M. Samorani, C. Bellinger, and O. R. Zaı̈ane, “Advantage of
integration in big data: Feature generation in multi-relational databases
ensemble size. for imbalanced learning,” in 2016 IEEE International Conference on Big
Our future works will concentrate on further improvements Data, BigData 2016, Washington DC, USA, December 5-8, 2016, 2016,
in the sampling procedures and adding new steps dedicated pp. 532–539.
[9] J. Klikowski and M. Wozniak, “Multi sampling random subspace
to alleviating other difficulties of multi-class imbalanced big ensemble for imbalanced data stream classification,” in Progress in
data, such as class overlapping and extreme class imbalance. Computer Recognition Systems, International Conference on Computer

2492
Recognition Systems, CORES 2019, Polanica-Zdrój, Poland, 20-22 May, data competition: An extremely imbalanced big data bioinformatics
2019., 2019, pp. 360–369. problem,” Knowl.-Based Syst., vol. 87, pp. 69–79, 2015.
[10] P. Zyblewski, P. Ksieniewicz, and M. Wozniak, “Classifier selection for [29] P. Branco, L. Torgo, and R. P. Ribeiro, “Relevance-based evaluation
highly imbalanced data streams with minority driven ensemble,” in Ar- metrics for multi-class imbalanced domains,” in Advances in Knowledge
tificial Intelligence and Soft Computing - 18th International Conference, Discovery and Data Mining - 21st Pacific-Asia Conference, PAKDD
ICAISC 2019, Zakopane, Poland, June 16-20, 2019, Proceedings, Part 2017, Jeju, South Korea, May 23-26, 2017, Proceedings, Part I, 2017,
I, 2019, pp. 626–635. pp. 698–710.
[11] M. Wozniak, M. Graña, and E. Corchado, “A survey of multiple classifier [30] A. Benavoli, G. Corani, J. Demsar, and M. Zaffalon, “Time for a change:
systems as hybrid systems,” Information Fusion, vol. 16, pp. 3–17, 2014. a tutorial for comparing multiple classifiers through bayesian analysis,”
[12] S. Sharma, C. Bellinger, B. Krawczyk, O. R. Zaı̈ane, and N. Japkowicz, J. Mach. Learn. Res., vol. 18, pp. 77:1–77:36, 2017.
“Synthetic oversampling with the majority class: A new perspective on
handling extreme imbalance,” in IEEE International Conference on Data
Mining, ICDM 2018, Singapore, November 17-20, 2018, 2018, pp. 447–
456.
[13] P. Ksieniewicz, “Combining random subspace approach with smote
oversampling for imbalanced data classification,” in Hybrid Artificial
Intelligent Systems - 14th International Conference, HAIS 2019, León,
Spain, September 4-6, 2019, Proceedings, 2019, pp. 660–673.
[14] A. Cano, “A survey on graphic processing unit computing for large-scale
data mining,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 8,
no. 1, 2018.
[15] A. Cano, A. Zafra, and S. Ventura, “Weighted data gravitation classi-
fication for standard and imbalanced data,” IEEE Trans. Cybernetics,
vol. 43, no. 6, pp. 1672–1687, 2013.
[16] A. Cano, J. M. Luna, A. Zafra, and S. Ventura, “A classification module
for genetic programming algorithms in JCLEC,” J. Mach. Learn. Res.,
vol. 16, pp. 491–494, 2015.
[17] A. Cano, D. T. Nguyen, S. Ventura, and K. J. Cios, “ur-caim: improved
CAIM discretization for unbalanced and balanced data,” Soft Comput.,
vol. 20, no. 1, pp. 173–188, 2016.
[18] S. Ramı́rez-Gallego, A. Fernández, S. Garcı́a, M. Chen, and F. Herrera,
“Big data: Tutorial and guidelines on information and process fusion for
analytics algorithms with mapreduce,” Information Fusion, vol. 42, pp.
51–61, 2018.
[19] S. del Rı́o, V. López, J. M. Benı́tez, and F. Herrera, “On the use of
mapreduce for imbalanced big data using random forest,” Inf. Sci., vol.
285, pp. 112–137, 2014.
[20] S. Ramı́rez-Gallego, H. Mouriño-Talı́n, D. Martı́nez-Rego, V. Bolón-
Canedo, J. M. Benı́tez, A. Alonso-Betanzos, and F. Herrera, “An
information theory-based feature selection framework for big data under
apache spark,” IEEE Trans. Systems, Man, and Cybernetics: Systems,
vol. 48, no. 9, pp. 1441–1453, 2018.
[21] A. Fernández, E. Almansa, and F. Herrera, “Chi-spark-rs: An spark-built
evolutionary fuzzy rule selection algorithm in imbalanced classification
for big data problems,” in 2017 IEEE International Conference on Fuzzy
Systems, FUZZ-IEEE 2017, Naples, Italy, July 9-12, 2017, 2017, pp. 1–
6.
[22] I. Triguero, M. Galar, D. Merino, J. Maillo, H. Bustince, and F. Her-
rera, “Evolutionary undersampling for extremely imbalanced big data
classification under apache spark,” in IEEE Congress on Evolutionary
Computation, CEC 2016, Vancouver, BC, Canada, July 24-29, 2016,
2016, pp. 640–647.
[23] L. I. Kuncheva, Á. Arnaiz-González, J. Dı́ez-Pastor, and I. A. D. Gunn,
“Instance selection improves geometric mean accuracy: a study on
imbalanced data classification,” Progress in AI, vol. 8, no. 2, pp. 215–
228, 2019.
[24] P. Skryjomski and B. Krawczyk, “Influence of minority class instance
types on SMOTE imbalanced data oversampling,” in First Interna-
tional Workshop on Learning with Imbalanced Domains: Theory and
Applications, LIDTA@PKDD/ECML 2017, 22 September 2017, Skopje,
Macedonia, 2017, pp. 7–21.
[25] J. Blaszczynski and J. Stefanowski, “Neighbourhood sampling in bag-
ging for imbalanced data,” Neurocomputing, vol. 150, pp. 529–542,
2015.
[26] M. Lango and J. Stefanowski, “Multi-class and feature selection exten-
sions of roughly balanced bagging for imbalanced data,” J. Intell. Inf.
Syst., vol. 50, no. 1, pp. 97–127, 2018.
[27] F. N. Walmsley, G. D. C. Cavalcanti, D. V. R. Oliveira, R. M. O. Cruz,
and R. Sabourin, “An ensemble generation method based on instance
hardness,” in 2018 International Joint Conference on Neural Networks,
IJCNN 2018, Rio de Janeiro, Brazil, July 8-13, 2018, 2018, pp. 1–8.
[28] I. Triguero, S. del Rı́o, V. López, J. Bacardit, J. M. Benı́tez, and
F. Herrera, “ROSEFW-RF: the winner algorithm for the ecbdl’14 big

2493

You might also like