Benmounah 2018

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Accepted Manuscript

Title: Parallel Swarm Intelligence Strategies for Large-scale


Clustering based on MapReduce with Application to
Epigenetics of Aging

Author: Zakaria Benmounah Souham Meshoul Mohamed


Batouche Pietro Lio’

PII: S1568-4946(18)30203-5
DOI: https://doi.org/doi:10.1016/j.asoc.2018.04.012
Reference: ASOC 4816

To appear in: Applied Soft Computing

Received date: 29-4-2016


Revised date: 26-3-2018
Accepted date: 10-4-2018

Please cite this article as: Zakaria Benmounah, Souham Meshoul, Mohamed Batouche,
Pietro Lio’, Parallel Swarm Intelligence Strategies for Large-scale Clustering based
on MapReduce with Application to Epigenetics of Aging, <![CDATA[Applied Soft
Computing Journal]]> (2018), https://doi.org/10.1016/j.asoc.2018.04.012

This is a PDF file of an unedited manuscript that has been accepted for publication.
As a service to our customers we are providing this early version of the manuscript.
The manuscript will undergo copyediting, typesetting, and review of the resulting proof
before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that
apply to the journal pertain.
Parallel Swarm Intelligence Strategies for Large-scale
Clustering based on MapReduce with Application to
Epigenetics of Aging

t
ip
Zakaria Benmounah1 , Souham Meshoul1 , Mohamed Batouche1 , Pietro Lio’2

cr
us
Abstract

Clustering is an important technique for data analysis and knowledge discovery.

an
In the context of big data, it becomes a challenging issue due to the huge amount
of data recently collected making conventional clustering algorithms inappro-
priate. The use of swarm intelligence algorithms has shown promising results
M
when applied to data clustering of moderate size due to their decentralized and
self-organized behavior. However, these algorithms exhibit limited capabilities
when large data sets are involved. In this paper, we developed a decentralized
d

distributed big data clustering solution using three swarm intelligence algo-
te

rithms according to MapReduce framework. The developed framework allows


cooperation between the three algorithms namely Particle Swarm Optimization,
Ant Colony Optimization and Artificial Bees Colony to achieve largely scalable
p

data partitioning through a migration strategy. This latter reaps advantage


ce

of the combined exploration and exploitation capabilities of these algorithms


to foster diversity. The framework is tested using Amazon Elastic MapReduce
service(EMR) deploying up to 192 computer nodes and 30 gigabytes of data.
Ac

Parallel metrics such as speed-up, size-up and scale-up are used to measure the
elasticity and scalability of the framework. Our results are compared with their
counterparts big data clustering results and show a significant improvement in
terms of time and convergence to good quality solution. The developed model

I Fully
documented templates are available in the elsarticle package on CTAN.
1 Computer Science Department, Constantine 2 University, Constantine, Algeria.
2 Computer Science Department, Cambridge University, Cambridge, UK.

Preprint submitted to Journal of LATEX Templates March 25, 2018

Page 1 of 42
has been applied to epigenetics data clustering according to methylation features
in CpG islands, gene body, and gene promoter in order to study the epigenetics
impact on aging. Experimental results reveal that DNA-methylation changes

t
ip
slightly and not aberrantly with aging corroborating previous studies.
Keywords: Big data, Swarm intelligence, Large-scale clustering, Map-Reduce,

cr
Epigenetics, Aging
2010 MSC: 00-01, 99-00

us
1. Introduction

an
Over the last few decades, developed countries in Europe, Asia, and North
America have experienced a significant change in the age profile of their demo-
graphic structure, with a steady increase in the number of adults aged 65 years
M
5 or older. Although longer lifespan is a clear sign of progress and improved qual-
ity of life, the considerable growth of the elderly population poses far-reaching
social and economic challenges regarding the fiscal sustainability of the welfare
d

state. As such, health-care system and related services have to be rethought


to treat aging as a manipulable long-term biological process whose detrimen-
te

10 tal effects can be limited or procrastinated rather than passively accepted as


inevitable.
p

Key research challenges need to be effectively and efficiently addressed to


ce

cope with the ever-growing amount of epigenetics data that is being exponen-
tially produced. Traditional techniques and tools for data analytics and au-
15 tonomous learning are no longer suitable and even unusable to extract human-
Ac

interpretable knowledge and information from the enormous complex amount


of data. Therefore, new revolutionary approaches and tools are more than re-
quired, among these tools we highlight clustering.
Almost 60 years beyond have passed from the first proposed clustering al-
20 gorithm [1]. Cluster analysis aims at grouping data points into separate groups
called clusters. It plays a versatile role in knowledge discovery, and it is used in
a myriad of fields to extract hidden relationships among data. An up-to-date

Page 2 of 42
review of the application of clustering analysis can be found in [2]. A plethora
of clustering algorithms has been designed to deal with different types and dis-
tributions of data. The exponential increase of data makes cluster analysis even

t
25

ip
more challenging than before. Two broad approaches have emerged to alle-
viate the issue, either by reducing the dimensionality of data (dimensionality

cr
reduction)[3] or by reducing the number of samples within a dataset (sampling)
[4]. However both approaches have been proved to be ineffective when a sin-

us
30 gle machine is used as the prohibitively large amount of data cannot be kept
on a single computer. Therefore, multiple machines are needed, and parallel
processing of data is undeniably necessary.

an
Hardware accelerators such as Field Programmable Gate Array (FPGA) and
Graphics Processing Unit (GPU) have emerged recently as promising technology
35 drivers [5]. On the other hand, Application Programming Interfaces (APIs) such
M
as Message Passing Interface (MPI) and OpenMP have traditionally provided a
software-oriented approach. However while dealing with parallel programming
languages, additional concerns have to be considered, such as the load balancing,
d

the communication flow, the topology choice, the split of data, etc. This makes
te

40 designing a parallel algorithm a very tedious task. To deal with these concerns,
a new open source framework called Apache Hadoop consisting of a storage part
p

namely Hadoop Distributed File System (HDFS) and a processing part namely
MapReduce (MR) has emerged lately. HDFS is a distributed file system that
ce

provides high-performance access to data across Hadoop clusters by managing


45 pools of big data and handling big data analytics applications. MapReduce,
designed by Google [6], provides a new methodology of thinking and developing a
Ac

parallel algorithm suitable for large scale systems without being concerned about
scalability as MapReduce is auto-scalable. The idea was inspired by the map
and reduce primitives characterizing the functional programming LISP. Hadoop
50 encapsulates the details of parallelization, fault-tolerance, data distribution and
load balancing. It Demonstrates a great performance in big data scenarios,
especially for challenging tasks such as clustering.
Nevertheless, data Clustering is an NP-hard problem. If the number of

Page 3 of 42
kn
clusters exceeds three, the alternative ways to group the data is k! , where k is
55 the number of groups, and n is the number of data points to be clustered.
However, data clustering can be easily cast as a global optimization problem

t
ip
of finding the partition that maximizes/minimizes an objective function. This
can be appropriately tackled using metaheuristics, such as Ant Colony Opti-

cr
mization (ACO), Artificial Bee Colony (ABC), Particle Swarm Optimization
60 (PSO) and so forth. These metaheuristics exhibit different dynamics leading to

us
distinct strategies that can be effectively combined to handle hard optimization
problems such as Clustering large datasets.
In this paper, we present a novel scalable and elastic framework for clustering

an
large datasets. The framework is a generalized island model (GIM) that includes
65 three swarm intelligence algorithms, cooperating with each other through a
migration operator.The main motivation behind the use of such a GIM is to
M
combine the search capabilities of the three algorithms for more efficiency of
the search space exploration. The GIM is designed in a parallel and distributed
manner in a way to help the three cooperating algorithms to converge to high-
d

70 quality solutions in a short time while.The developed parallel and distributed


te

architecture are designed using MapReduce programming model to deal with


the huge data volume. First, a MapReduce model is derived for each of the
p

cooperating algorithms separately and then merged lately to achieve the overall
framework which we refer to as MapReduce Clustering-GIM (MRC-GIM). The
ce

75 framework is validated on many computers (192 computers) connected with each


other and large real datasets (more than 30GB). The comparative study reveals
that MRC-GIM outperforms novel developed clustering algorithms dedicated to
Ac

big data clustering in terms of convergence time and solution quality.


Subsequently, we use MRC-GIM to investigate the correlation between Epi-
80 genetics and Aging. As a result of this application, we found that epigenetics
changes slightly and not aberrantly with aging, this latter confirms previous
evidence shown in [52, 53].
The remainder of the paper is organized as follows: in section 2 a brief
description of the background material is given. In section, 3. a review of

Page 4 of 42
85 clustering large data sets algorithms is provided. Section 4 is devoted to the
presentation of the proposed frameworks. In section 5, the performance of
MRC-GIM is evaluated using large datasets and parallel metrics. In section 6

t
ip
we study the correlation between epigenetics and aging and finally in section 7
conclusions are drawn.

cr
90 2. Background and literature review

us
2.1. Clustering Analysis

Clustering analysis is a central topic in computer science and statistics,

an
whose purpose is to find groups in a given data set, according to a notion of simi-
larity or homogeneity that is maximized for elements in the same group (or clus-
95 ter). Formally, given a data set V , a clustering is a partition τ (V ) = {c1 , . . . , cK }
M
of V into non-empty and pairwise disjoint subsets ci , i = 1, . . . , K, whose union
is V .
As a major field of research in exploratory data analysis, over the years, a
d

large number of clustering algorithms have been proposed to find “good” par-
titioning that can uncover latent structures in the data. Clustering approaches
te

100

can be divided into two major classes: partitional clustering and hierarchical
clustering.
p

Partitional clustering algorithms split the dataset into groups based on a


ce

two-step iterative process. Given an initial set of cluster representatives such as


105 centroid locations as in k-means [7] or centroid data points as in k-medoids [8]
the procedure alternates an assignment step where each data point is assigned
Ac

to the cluster with the closest representative and an update step where cluster
representatives are recomputed.
Hierarchical clustering algorithms build a hierarchy of clusters by either
110 merging smaller clusters into larger clusters (agglomerative clustering) or by
splitting larger clusters into smaller clusters (divisive clustering). As a result, a
hierarchical clustering algorithm produces a tree of clusters, called dendrogram,
which shows how clusters and data points are related.

Page 5 of 42
2.2. MapReduce Model

115 A MapReduce program takes place generally in three successive jobs, the

t
M ap, the Auxiliary and the Reduce jobs as depicted in figure 1.

ip
Reduce task <Key, [Value]>
Map task <Key, Value> Auxilary phase
Whole datasets

cr
<k1,[v1,v3]>
<k1,v1> <k2,v2> Sorting
Partitioning
V Combining
<k2,v2>
<k1,v3> <k3,v4>

us
<k3,v4>

Figure 1: MapReduce framework, generally composed of the map job followed by the reducer

an
job

The map job is a preprocessing of the split data where each split part is
M
considered as a value to wich a key is attributed, all values with the same
key are submitted to the same reducer. More formally let’s denote V as the
120 whole dataset, after splitting the data we get V = v1 ∪ v2 ∪ . . . ∪ vp where p is
d

the number of split elements. The mapper creates a function V → key where
key = key1 , key2 , . . . keyr and r denotes the number of keys (r << p). A key
te

keyi can be associated to more than one value. The Auxiliary is an optional
function that takes place between the mappers and the reducers when some
p

125 additional preprocessing tasks are required. The reducer receives a set of pairs
ce

in the form of < key, value > the values with the same key are grouped together
in the same reducer thus it results in one key per each reducer.
Ac

2.3. Swarm intelligence clustering

Swarm intelligence algorithms have been widely used in data clustering.


130 In [9], a constrained ACO (C-ACO) has been developed to handle arbitrarily
shaped clusters and outlying data. An adaptive ACO algorithm was proposed
in [10] to determine the optimal number of clusters and also to improve the
convergence rate. It imitates the ant’s behavior of grouping dead bodies based
on their size so that to keep the nest clean. In fact, the first algorithm using

Page 6 of 42
135 this concept has been proposed by Lumer and Faieta [11]. In [12], several ant
colonies are used in parallel to find clustering solutions that are sent to the
Queen (each colony has one queen). At the final stage, the results at Queens

t
ip
level are combined using a hypergraph model. From another side, PSO based
Clustering was first introduced by Omran et al. in [13]. Their results showed

cr
140 that PSO based clustering outperforms k-means and FCM (fuzzy c-means). Au-
thors in [14] have introduced a quantum-behaved PSO for clustering analysis

us
applied to gene expression profiling. Also, to determine the unknown number
of clusters, a recent clustering algorithm was developed by Cura et al. [15].
Clustering has been tackled as well using ABC metaheuristic [16]. In [17], the

an
145 authors proposed ABCK algorithm which tries to find the Centroids initializa-
tion to be given as an input to the k-means algorithm. Algorithms based on
these nature inspired metaheuristics suffer from premature convergence. Many
M
tentative methods to avoid such an issue were proposed in the literature. In
[18], Ghosh et al. proposed a variant of ACO known as Aggregation Pheromone
150 Density-Based Clustering (APC) algorithm. This latter attempts to solve the
d

issue using a pheromone matrix updating policy. Cohen and Castro in [19]
te

proposed a particle swarm clustering algorithm in which the particles velocity


update is influenced by the particles previous positions along with cognitive,
p

social and self-organizing term. The aim is to guide the particles towards bet-
155 ter solutions avoiding local optima. In [20], an ABC algorithm proposes the
ce

introduction of a new group of bees called the scout bees to create a random
search that helps the employed bees finding the food source. All the methods
reported so far remain limited in terms of quality of results and induce an extra
Ac

cost regarding time complexity [21].

160 2.4. Big data clustering

Since the inception of MapReduce (MR), several MR based clustering algo-


rithms have been proposed. Most of these methods are based on different MR
schemes of k-means. Among them, Pk-means [25] which follows the classical
k-means procedure and runs in one iterative MR job. In the map phase, cen-

Page 7 of 42
165 troids are computed as the weighted average of all points within a cluster, then
in the reducer phase, the algorithm updates the new centers. An additional
auxiliary phase is set up to combine the intermediate data of the same map

t
ip
task. Further, k-means++ has been designed to optimize the selection of initial
centers, starting by choosing them uniformly at random, then adding potential

cr
170 centers one by one in a controlled way until reaching k centers. K-meansk [31]
has come as an improvement of k-means++, rather of sampling a single point

us
in each pass like kmeans++, k-meansk samples O(k) points in each round and
repeats the process for approximately O(log n) rounds to get better accuracy.
Unlike the aforementioned proposed k-means based methods which perform a

an
175 series of k-means tasks serially, Mux-Kmeans [37] performs a multiple k-means
tasks concurrently by taking various centroids together. The algorithm starts
by evaluating the clusters, selecting the best task and finally integrating the
M
new tasks. Efficient k-means++ [38] is another improvement of k-means++
which uses only one MR job to obtain the k centers. The map task consists
180 in the k-means++ initialization, followed by the weighted k-means++ which is
d

performed in the reducer phase. The algorithm integrates a pruning strategy


te

that computes the distances without redundant time as advantage selection of


k (the number of the clusters) is automatic.
p

By using k-means with sampling, X. Cui et al. [40] proposed an algorithm


185 that takes place in three MR jobs. Data summarization has been used as a
ce

baseline for BigK-Clustering [34]. The key idea behind the algorithm is to
divide the data into sub-data and group them separately which results in micro-
clusters, then merge the closest micro-clusters using the equivalence relation,
Ac

and finally, calculate the centers of the final groups. DBCURE-MR [39] and
190 MR-DBSCAN [28] are two density based algorithms that identify clusters with
highly dense areas separated by sparsely dense areas. Both methods employed
a sampling approach to reducing the size of the large data and carry out the
processing of the reduced data to obtain centers, which will be used to cluster
the original data. Abhinandan Das et al. redesign the MinHach algorithm [23]
195 in a more scalable way using map-reduce. This algorithm can capture multiple

Page 8 of 42
interests of user shared within a cluster. DisCo [24] is based on co-clustering
which unlike clustering attempts to cluster both samples and items at once.

t
ip
cr
us
an
M
d
p te
ce
Ac

Page 9 of 42
Table 1: Summary of clustering algorithms based on MapReduce (sz: data size, nd: number
of computer node, pt: data point, dim: dimension and syn: Synthetic)

t
Algorithm Data platform

ip
Abhinandan Das et al. [23] Real (943, 5000, 500000) users none
Raw dataset, Sz(100-135GB) 39 nd with 2 IntelXeon (2.7-3GHz),
Spiros Papadimitriou et al. (DisCo) [24]
Graph dataset, sz(170MB-4.3GB) 8GB of ram.

cr
(1-4) -nd with 2 cores (2.8GHz),
Weizhong Zhao et al. (PKMeans) [25] sz (1GB- 8GB).
4GB of ram.
-Real:(62millions-1.4 billion)pt, -Yahoo Cluster M45: 400 nd,

us
sz: (0.014-0.2)TB. 3.5 TB ram.
Robson L. F. Cordeiro et al. (BOW) [26]
-syn (100000-100 million) -DISC:(64 machines=512 cores
pt*15 dim +1TB of ram)
-1 nd IntelCore Duo,
Hai-Guang Li et al. [27] Small datasets
(2.10GHZ), 2.00GB ram.

an
Real(GPS location records):
-13 nd Intel Core i7 950
Yaobin He et al.(MR-DBSCAN) [28] 0.32 billion-1.92 billion pt,
(3.0GHz), 8GB DRAM
sz( (8.4GB-50.4GB).
syn (Zipf distribution): -1 nd IntelCore i7 (2.93GHz) with,
Alina Ene et al. (Parallel Lloyds) [29]
M
10,000-10,000,000 pt. 8GB ram (100nd simulation).
-Longhorn Hadoop cluster: 48 nd,
8 Intel Nehalem cores (2.5GHz),
Real and syn: (2, 000-32, 000, 000)pt 48GB ram.
Ibrahim Aljarah et al.( MR-CPSO) [30]
(2-54) dim, sz(0.14- 1320.8)MB -NDSU Hadoop cluster: 18 nd 4
d

Intel cores (2.67GHz each),


6GB of ram.
te

-syn: GaussMixture 10;000 points 1968 nd, 2 quad-core (2.5GHz),


Bahman Bahmani et al.(k-means k) [31]
-real(4601-4.8)M pt*(42-58) D. 16GB ram.
-syn: 1024 to 4 million
Local Cluster: 5 Core2 Duo E6550,
Fei Gao et al. [32] pt*64 dim
p

(2.33 GHz) with 1 GB DRAM.


-Real (wikipedia): sz (1M-2G)
real (molecular geometries), 16 nd Gordon ION (192 cores),
ce

Trilce Estrada et al. [33]


sz (48MB-1TBytes). (2.7 GHz) with 3GB of ram.
3 nd with 2 IntelCore (3.1GHz),
Yuqing Miao et al.(BigKClustering) [34] syn:(normal dist), Sz(1GB-3GB)
4GB of ram.
Chen Jin et al. (DiSC) [35] syn: 500,000 data pt * 10 dim. JESUP hadoop cluster
Ac

real(handwritten digits), 17 nd with CPU (2.2GHz),


Pelle Jakovits et al. (CLARA) [36]
25 000- 10 000 000 pt. 500MB RAM.
real (17770-359330) pt* EC2: 16nd, 2 ECUs and 1 CPU,
Chen Li et al. (Mux-Kmeans) [37]
(40-1000) dim. 3.7 GiB ram
-real sz (5.67 GB).
12 nd with 2 AMD Opteron 2212,
Yujie Xu et al. (Efficient K-means++) [38] -Syn: 20 million pt * 128 dim,
(2.00GHz) and 8GB of ram.
sz(15GB).
-Real: (164,860-164,860*50)pt*
3 dim sz(21mb-1.05GB). 20 nd with 2 Duo (2.66GHz),
Younghoon Kim et al. (DBCURE-MR) [39]
-Syn:(5,000,000-25,000,000)pt* 2GB of ram.
(2 to 8 dim), sz(1.7GB).
-syn(Gauss Distribution)
10,000 pt* 3 dim.
16 nd 2 Core AMD,
-Real 2,351,710,420
Xiaoli Cui et al. [40]
pt* 3 dim
10 Opteron (2.00GHz),
2GB of ram.
-Real 4,296,075,259
pt * 9 dim.

Page 10 of 42
Fei Gao et al. designed an approximate spectral clustering which uses kernel-
based machine learning algorithms. The method has shown to be efficiently pro-
cessing very large-scale datasets. The objective is to optimize the computation

t
200

ip
and memory overhead required to compute the kernel matrix [32].
The approach is independent of the employed kernel-based machine learning

cr
and was tested on Amazon Elastic MapReduce. Trilce Estrada et al. [33] pre-
sented a scalable method dedicated to the molecular conformation such as ligand

us
205 and peptides to identify conformations that are an accurate representative of the
native conformations. The algorithm starts by reducing the conformation space.
Afterwards, it combines the geometrical knowledge of conformation similarities

an
based on a tree clustering. A considerable decrease of parallel runtime has been
witnessed, from 5 hours to 20 minutes (12-196 nodes). Hierarchical clustering
210 has also being reimplemented using MapReduce. As an example, DiSC is a hier-
M
archical clustering algorithm [35] designed according to MapReduce framework
which runs on a fairly small number of MR rounds, while reducing the hierarchi-
cal clustering single-linkage problem to the minimum spanning tree (MST). The
d

CLARA is a medoid-based clustering algorithm [36] which unlike centroid-based


te

215 chooses real data points as centers. The algorithm has been defined regarding
MapReduce jobs, showing a way of how to adapt a non-embarrassingly parallel
p

algorithm to a platform that is dedicated to embarrassingly parallel methods.


In contrast, the efficiency and scalability have been barely affected. Finally,
ce

MR-CPSO [30] is based on particle swarm optimization (PSO) algorithm origi-


220 nally developed by taking inspiration from the flocking behavior of birds. The
algorithm is implemented in three MR stages each of which consists in a PSO
Ac

operation. In the first stage the centroids of individual particle are updated
using the classical equation of PSO that updates the velocity and the position,
afterwards, an evaluation of the fitness generated in the first level is computed,
225 then, the last stage is devoted for merging results of the previous stages and
identify the personal best and global best centroids. In table 1, a summary of
all these algorithms is provided including the basic clustering features used as
described above with extra information related to the type and volume of data

11

Page 11 of 42
sets and the platform employed per each algorithm.
230 All above described methods based on k-means or k-medoid, require the
number of cluster in to be known in advance, which is also the case in our

t
ip
developed method.
From this brief review, one can notice that attempts to develop MapRe-

cr
duce methods of clustering are at their debut and in particular, swarm-based
235 algorithms which still need to be rethought in this context despite their success

us
to solve NP-hard problems and particularly data clustering. By another side,
cooperating among metaheuristics has been shown to be a very promising way
to solve efficiently complex problems. Therefore, in our work we investigate the

an
potential of such cooperation to handle clustering of large data sets.

240 3. Highly scalable clustering based on MapReduce


M
A generalized island model is a parallel distributed scheme which is amenable
to large-scale clustering. To perform clustering of large data sets using a GIM we
d

need first to redesign each of the used metaheuristics according to the MapRe-
duce model. In the proposed GIM, the three swarm-based algorithms PSO,
te

245 ACO and ABC cooperate to find the best partition that optimizes a given clus-
ter quality measure. In our study, we consider the total within variance or
p

cohesion measure as the objective function. This latter describes how close the
ce

data points within each cluster are to each other. It is defined as follows:

k
1X X 2
kxi − zj k (1)
n j=1 x ∈z
Ac

i j

Where n is the number of items or data points in the dataset, k the number
250 of clusters, xi , {i = 1, . . . , n} the location of the ith item to be clustered and
zi the center of cluster ci . Therefore, clustering is cast as an optimization task
that aims to minimize the cohesion measure.
Our proposed framework is based on the following assumptions:

1. The data is huge and cannot be stored on a single computer,

12

Page 12 of 42
255 2. The clustering solution could be distributed among different computers,
3. The clustering solution is constructed gradually.

t
In the rest of this section, we consider the following notations:

ip
1. v = {item1 , item2 , ..., itemn }, the set of n items or data points to be
grouped into clusters,

cr
260 2. m: The number of artificial ants or bees.
3. k: The number of clusters.

us
4. pi = {(itemj , cl )} , where {i = 1, . . . , m}, {j = 1, . . . , n} and {l = 1, . . . , k},
the clustering partition is encoded as a set of pairs, each is composed of

an
an item and the assigned cluster.
265 5. fi is the cohesion value of the clustering solution pi .
6. c = {z1 , z2 , . . . , zk },the set of centroids.
M
In the following, we describe the proposed ACO and ABC based MapRe-
duce. Afterward, we describe the overall architecture through which cooperation
between ACO, ABC and PSO introduced in [30] is done.
d
te

270 3.1. Ant colony optimization based on MapReduce

ACO is a swarm based metaheuristic developed by Dorigo et al. [41] to


p

solve NP-hard optimization problems. It is inspired by the foraging behavior of


a colony of ants. A detailed description of this metaheuristic and its biological
ce

metaphor can be found in [42]. Given an optimization problem to be solved,


275 a typical dynamics of an ACO algorithm can be defined as follows. A set of
cooperating artificial ants is used to construct solutions incrementally. Each ant
Ac

constructs a solution and deposits an amount of pheromone on each solution


component according to the overall quality of the solution found. During solu-
tion construction step, an ant selects the next solution component according to
280 a probabilistic rule which depends on the corresponding amount of pheromone
and the value of a problem-specific heuristic. Many variants of ACO algorithm
exist depending on the policies used to update pheromone trails and to select
solutions components.

13

Page 13 of 42
In our work, we adopted ACO algorithm for data clustering inspired from
285 ant k-means described in [43]. The algorithm aims to find the partitions that
minimize the cohesion measure defined above. A solution component is viewed

t
ip
as an assignment of a data point to a cluster. Let’s adopt the following notations:

1. γ, the threshold value.

cr
2. φ(i, j), where {i = 1, . . . , n}, {j = 1, . . . , k}, is a n by k matrix that
290 represents the pheromone value of each item in each cluster.

us
The proposed MapReduce ACO algorithm (MRC-ACO) composed of three
stages, both the first and second are parallel and based upon MapReduce; the

an
last stage is sequential as shown in figure 2.
Stage 1 Stage 2

HDFS (S*M) Mappers <id, sub-vector> HDFS (M) Mappers <i, Ant_i > HDFS
(K*M) Reducers <r, [item_i, c_i ]>
Compute
M
assigne clusters -Compute the center
cohesion mean
split 1 split 1
-Compute the cohesion
Compute
split 2 assigne clusters split 2 cohesion mean stratifying
Do in Parallel -Compute the center
split 3 -Compute the cohesion split 3 Compute
. assigne clusters .
.. .. cohesion mean
.
split n .. -Compute the center split n
d
Compute
-Compute the cohesion
assigne clusters cohesion mean
te

loop Update Pheromon


Matrix

Stage 3

Figure 2: MRC-ACO: Ant Colony Optimization for clustering large datasets based on MapRe-
p

duce framework
ce

The ACO algorithm for data clustering can be outlined as follows:

295 1. During an initialization phase, each ant builds a solution by assigning


Ac

items to clusters randomly. The same amount of pheromone is assigned


to each of all possible pairs (item, cluster).
2. During an iterative phase each ant updates the amount of pheromone
to each of its solution components namely a pair (item, cluster) and con-
300 structs another solution by assigning items to clusters according to a prob-
ability that depends on the amount of pheromone, the threshold value, and
the heuristic value related to the cohesion measure value. During itera-

14

Page 14 of 42
tions, the best partition i.e. with the smallest value of cohesion measure
found so far is kept.
3. At the end of the iterative process, the best partition is returned.

t
305

ip
To handle the big data, we split each pi vector to s sub-vectors where s
is an integer tweaked with regards to the size of data points and the available

cr
memory capacity. The sub-vectors are loaded on the mappers, at this end,
the mappers assign in parallel for each itemi the cluster number ci following a

us
310 guided probabilistic research described in algorithm 1. Afterward, the mappers
attribute the same key to all sub-vector pairs (itemi , ci ) that satisfy the following
two conditions:

an
1. The pairs should belong to the same pi parent vector.
2. The ci of each pair has been assigned to the same cluster.
M
315 This latter is done through the use of MD5 [34] function which is a crypto-
graphic hash function that provides the digital fingerprint. Subsequently, the
mappers transmit the pairs with their keys to the reducers, where the same
d

key is transmitted to the same reducer. Increasing s leads to the decrease of


te

the granularity (the size of the parallel task) per each mapper, giving rise to
320 a shorter processing time. The number of the parallel tasks at this stage is m
p

multiplied by s.
At the reducers job, the transmitted pairs are merged to form one cluster
ce

of a pi vector which is the purpose of applying both above conditions. In


parallel, each reducer computes the center and evaluates the clustering quality
by computing the cohesion as described in equation 1.
Ac

325

The granularity of the system increases compared to the first stage as the
number of parallel tasks equals to k multiplied by m. The purpose of splitting
the data per cluster is to contribute to the decrease of the granularity that yields
to more tasks in parallel thus a rapid system.
330 In the second stage, another MapReduce round is launched to evaluate the
quality of each pi vector and to update the pheromone matrix. Firstly, a light
map stage receives per each mapper the cohesion values of a pi vector and

15

Page 15 of 42
computes their mean. The mean describes how good the overall clustering
solution is. In this round, the number of tasks in parallel is m.
At this end, all pi possess a complete clustering solution with its quality

t
335

ip
calculated. An auxiliary phase is employed to stratify in an ascending way
the pi vectors with regards to their cohesion values preparing them for the

cr
us
an
M
d
p te
ce
Ac

16

Page 16 of 42
pheromone update stage.
Algorithm 1: Do in parallel for each mapperi
Input: < key : id, value : sub-vector>

t
extract φ; for each itemi in sub-vector do

ip
generate random number r ;
if r ≤ q then

cr
pM ax = max(φ(itemi , ∀)); // the highest pheromone value
ci = f ind(φ(itemi , ∀), pM ax); // where the highest

us
pheromone laid

else

an
h=k
P
pSom = φ(itemi , h);// the sum of the pheromone
h=1

recorded
M
for each cl > 1 in k do
(φ(itemi ,c( l−1))+φ(itemi ,cl ))
φ(itemi , cl ) = pSom ; // the shared
pheromone
d

end
te

generate random number rr ;


for each cl in k do
p

if rr ≤ φ(itemi , cl ) then
ce

ci = cl ;
end
end
Ac

end
end
for each pair (itemi , ci ) in sub-vector do
generate a keyi ;
transmit (itemi , ci ); // via the use of MD5

end

17

Page 17 of 42
340 The last stage is sequential and devoted to update the pheromone matrix.
For each corresponding value of the assigned cluster, ci in the best solution
found so far; the pheromone value is updated φ as follows:

t
ip
(1 − ρ)φ(i, j) + 1
φ(i, j) = (2)
fbest

cr
Where ρ is the evaporation rate and fbest is the best cohesion value found
so far.

us
345 3.2. Artificial bee colony clustering based on MapReduce

We designed an artificial bee colony clustering based on MapReduce (MRC-

an
ABC), MRC-ABC is composed of 4 stages described in figure 3.
Stage 3 (employed) Stage 4 (onlooker)

Mappers <cohesion_i,[cohesions]> Mappers <i, (nectar_i, cycle_i>


HDFS HDFS
Reducers <nectra_i, [nectars]>
Compute
M
Choose one solution Reducers <i, subvector_i>
nectar_i accroding to p_i
split 1 split 1
compute p_i Produce new solution
Do in parallel Pfood_i
Compute Choose one solution
split 2 split 2 accroding to p_i
nectar_i
Stage 1 Stage 2 compute p_i Produce new solution HDFS
split 3 split 3
Compute Pfood_i
. . Choose one solution

.. nectar_i .. accroding to p_i


.
split n .. compute p_i split n Produce new solution
Pfood_i
Compute Choose one solution
d
nectar_i accroding to p_i

loop
te

Figure 3: MRC-ABC: an Artificial Bee Colony Clustering for large datasets based on MapRe-
duce Framework
p

The following assumptions are considered [44]:


ce

1. Three groups of bees: employed bees, onlookers, and scouts artificial bees.
350 The employed bees produce a food source (solutions) and share it with the
Ac

onlookers in the dance area. The onlookers choose one food source with
a probability related to its nectar amount and produce for it the distinct
new solutions. The job of the scout is to check whether a solution is not
chosen by the onlookers for many cycles. If so the abandoned solution is
355 replaced with a newly generated one.
2. nectari refers to the nectar amount of the food source, it evaluates how
much a solution is good and is calculated as follows:

18

Page 18 of 42
cohesion(F oodi )
nectari = PM (3)
s=1 cohesion(fs )

t
3. pbi refers to the probability of a pi solution to be chosen by the onlookers,

ip
it is computed as follows.

cr
nectari
pbi = PM (4)
s=1 nectars
4. To avoid getting stuck in the local minimum, ABC employs the variable

us
360

cyclei for each pi , that will be incremented when a food source is not
chosen by any onlooker bee. When a cyclei reaches the maximum cycle

an
number mcn, the scouts eliminate the food and replace it with a new food
source.
365 5. To produce a new potential food position from an old source food, the
M
artificial bees employ the following expression:

pf oodi = pi + ϕ(pi − ps ) (5)


d

In the MRC-ABC, we exploit the same technique used in MRC-ACO, all


te

pi are split to sub-vectors, and similarly the first two stages of MRC-ACO are
adopted in MRC-ABC. However, to imitate the dynamics of ABC the first
p

370 assignment of clusters is done randomly. After the accomplishment of stage one
ce

and two, all vectors possess a cohesion value of the clusters assignment.
The stage three is devoted to computing nectari and pbi for each food source.
To compute the nectari each mapper receives all the computed cohesion values;
Ac

then the results are transmitted to the reducers as shown in algorithm 2.


375 Afterward, each single reducer receives all the nectari computed in the map-
pers step and calculates for each pi the value pbi , as stated in algorithm 3. The
number of mappers/reducers required in this round equals m. By loading only
the cohesion values in each mapper, the nectar amounts in each reducer, we
drastically reduce the size of the loaded values into the mapper’s (or reducer)
380 memory compared to the first and the second stage.

19

Page 19 of 42
Algorithm 2: MRC-ABC: stage three
Mapper: Do in parallel for each mapperi
Input: < key : cohesioni , Listvalues : [cohesions] >

t
ip
compute nectari for the pi using eq. 3;
emit(i, nectarti ) to all reducers;

cr
Reducer: Do in parallel for each reduceri
Input: < key : nectari , Listvalues : [nectars] >

us
compute pbi for the pi using eq. 4;

After achieving stage three, all pi have a completed clustering solution with

an
its nectari and pbi assessed in the memory. In the last round, the onlooker
bees choose one solution according to its pi value and send only the chosen
nectar to the reducers, if pi is not chosen, the variable cyclei is incremented, see
M
385 algorithm 3.
Algorithm 3: MRC-ABC: stage four
Mapper: Do in parallel for each mapperi
d

Input: < key : i, value : (pbi , cyclei ) >


te

if pbi is chosen then


emit(i,pbi );
else
p

cyclei = cyclei + 1;
ce

end
Reducer: Do in parallel for each reduceri
Input: < key : i, value : 0 >
Ac

if pbi is chosen then


produce new solution P f oodi using expression 5;
else
if cyclei equals to mcn then
produce a random solution and replace pi ;
end
end

20

Page 20 of 42
The reducers of the last stage imitate the scouts, they receive all the chosen
pbi and produce from them a new clustering solution pf oodi using equation 5.
The reducers also check if any cyclei has reached the maximum cycle number

t
ip
390 mcn. If so, the solution is considered abandoned and replaced by a randomly
generated one, see algorithm 3.

cr
3.3. The Generalized island model based on MapReduce

GIM (Generalized Island Model) is a parallel model which can be applied

us
to a large class of optimization problems. Based on metaheuristics cooperation,
395 it allows an effective parallel execution of multiple algorithms across multiple

an
islands. This cooperation is maintained by the exchange of solutions between
these islands. The exchange operator (migration operator) aims at improving
the overall performance of the different used algorithms [8].
M
To let cooperation, we set up in parallel our proposed MRC-ACO and MRC-
400 ABC and from the literature MR-CPSO proposed in [30]. Then we integrate
an exchange operator to enable solution migration from one algorithm to an-
d

other. Each algorithm acts as an island and islands cooperate between them
te

by selecting a subset of solutions (according to an elitist selection strategy S


), exchanging selected solutions using the migrant operator. Finally, recombine
them with local solutions following a recombination strategy R (i.e. replacing
p

405

worst solutions). In the proposed model, this has been achieved in Hadoop by
ce

managing the HDFS, which will be logically divided using Centralized Cache
Management (CCM). We create four logical memories. Three of them repre-
sent a distributed memory associated with each algorithm. One shared memory
Ac

410 between the algorithms serving as a migrant solution deposit which acts as a
buffer to exchange solutions in an asynchronous way (figure 4).
Without deploying a barrier operation, no processor will exhibit idle waiting
for another one. Thus, this technique will increase the load balancing yielding

21

Page 21 of 42
MRC-ACO MRC-ABC MR-CPSO
submit solutions

routines routines routines


load solutions

t
ip
HDFS-ABC HDFS-ACO HDFS-PSO

HDFS
shared memory

cr
HDFS

Figure 4: HDFS management for Generalized Island Model in Hadoop architecture (MRC-

us
GIM)

to a speediest system. The dynamics of MRC-GIM is described in algorithm 4.

an
Algorithm 4: Do in parallel for MRC-ACO, MRC-ABC, and MR-CPSO
P :initial population;
S :selection strategy;
M
R :recombination policy;
Ai : MRC-ACO, MRC-ABC and MR-CPSO;
initalize P ; // initialize population
d

while non stop criteria do


P 0 = Ai (P ); // P’ is the new population
te

415
M = S(P 0 ); // select M solutions from P’
submit (M ) in the Shared HDFS meomory; // Sharing the
p

results with others SI


ce

load M 0 from the HDFS; // extract the shared solutions


P 00 = R(P 0 , M 0 ); // combine M’ with P’ to generate new
population
Ac

P = P 00 ;
end
The recombination policy consists of combining the migrant populations
with the local ones to get more diversity for the next iteration which is the key
task of cooperation.

22

Page 22 of 42
4. Experiments

420 In the experiment design, we analyze the efficiency of the proposed frame-

t
work. First, we study the amount of resource such as time, the number of

ip
processing elements and iterations required to perform clustering. The algo-
rithms are designed to be executed on different platforms with different size

cr
of data. Therefore, we study the scalability, speedup, and the sizeup of the
425 algorithms. Finally, we compare the meta model MRC-GIM to different state-

us
of-the-art large scale clustering algorithms. Two sets of data of different sizes
have been used to test the ability of the algorithms to group both large and
small sets of data. The following subsection explains briefly these sets of data.

4.1. Datasets

an
We validate the proposed algorithms by clustering available large datasets.
M
430

The datasets used have been downloaded from the Stanford Network Analysis
Project (SNAP) [45]. To assess the quality of clustering both small and large
d
data sets, we employed two sets of data Friendster and DBLP as shown in
table 2.
te

Table 2: The datasets used in our experiments


p

Dataset Records Cluster Size


Friendster 65,608,366 287,512 30.1 GB
ce

DBLP 317,080 13,477 10 MB


Ac

435 Friendster is a gaming website, which was a social networking website that
enabled users to connect with their friends. The data was released at the end of
June 2011, which contains 65,608,366 users. Each line of the data files represents
the friends list of one user, in the following format: (id of user: separated list
of user’s friends), for example, 1 : 2, 3 means that the user with id = 1 has a
440 connection with both the users two and three. We preprocessed the data to be
suitable for partitioning clustering by constructing for each player an ascending

23

Page 23 of 42
order of all players following their id and then, check weather a connection is
found between this player and the ith player if so the number one is affected
elsewhere 0. A player is considered to have a connection with himself.

t
ip
445 DBLP is a computer science bibliography that provides a comprehensive list
of research papers in computer science field. Two authors are connected if they

cr
publish at least one paper together. We followed the same preprocessing to
prepare the DBLP data for the partitioning clustering.

us
The number of potential clustering solutions within the search area for the
287,51265,608,366
450 Friendster data is 287,512! , which renders the clustering a more challeng-
ing task. We conduct our experiments on the EMR cluster of Amazon with 192

an
nodes where each node has 2.5 GHZ processor and 1.7 Go memory. The data
was moved from local disk to Amazon using the Jets3t Java library. The size of
the memory limits the number of map jobs to be used per node. We split each
M
455 node memory to 20 blocks where each block is a 64 MB (HDFS block). Hadoop
uses HDFS data blocks and assigns a single HDFS block to each map task, for
the Friendster data the minimal configuration is to use 24 nodes with 20 map
d

jobs per each node.


te

4.2. Evaluating each swarm intelligence algorithm individually

To evaluate the convergence of the proposed MRC-ABC, MRC-ACO, MRC-


p

460

GIM, we recorded the cohesion by tweaking the number of iterations from 100
ce

to 1000 at each time (we stooped at 1000 iterations following a limitation of the
platform used). A bar chart of the cohesion against the number of iterations is
shown in figure 5(a). The parameter setting used for the three SI algorithms
Ac

465 are described in table 3.


Regarding Friendster data, MRC-GIM returned the higher clustering quality.
As the number of iterations increases the gap between MRC-GIM and the rest
of algorithms increases as well (figure 5(a)). Increasing the number of iterations
involves enhancing the chance of solution migration thus exploring the search
470 space more efficiently using three different swarm strategies at the same time.
Moreover, the cooperative scheme of MRC-GIM improves its quick convergence

24

Page 24 of 42
Table 3: Parameter setting used in MRC-ACO, MRC-PSO, MRC-PSO

Algortihm Parameter setting

t
ACO Evaporation rate =0.1 Threshold =0.9

ip
PSO c1=c2=1.2, vmax= 0.9, vmin 0.4
ABC Upper bound= 5, limit =10

cr
while avoiding local minima, e.g., the results reached by MRC-GIM in 200

us
iterations, can be achieved by MR-CPSO after around 800 iterations.

an
M
d
p te
ce
Ac

25

Page 25 of 42
(a)

Friendster (a) DBLP (b)

t
13. 4 MRC−ABC 125.361 MRC−ABC
MR−CPSO
12. 651 MR−CPSO

ip
MRC−ACO 120.847
3 MRC−GI M MRC−ACO
10
X 12. 012 118.365 MRC−GIM
Cohesion

11. 235

Cohesion
114.183
111.121
10. 256 108.111
9. 862
105.329

cr
9. 121 102.648
98.654
8. 103
100 200 400 600 800 1000 96.159
Interations 100 200 400 600 800 1000
Iterations

us
(b)

Friendster(a) DBLP(b)
116
13
114
12.5

an
112
12
3
X
10 11.5
110

108

Cohesion
Cohesion

11
106
10.5
104
10
M 102
9.5
100
9
98
8.5
96
8
MRC−ABC MRC−ACO MR−CPOS MRC−GIM MRC−ABC MRC−ACO MR−CPOS MRC−GIM

(c)
d

Friendster (a)
DBLP (b)
te

7200
MR−CPSO 120
MRC−ACO MR−CPSO
MRC−ABC MRC−ACO
6000 100 MRC−ABC
MRC−GIM
MRC−GIM
Time (s)

Time (s)

80
4800
p

60
3600
40
ce

2400 20
0 24 48 96 192 0 4 8 16 32 64 128
Nodes Nodes

Figure 5: (a) A Bar chart showing the cohesion value against the number of iteration, (b)
Ac

Boxplots display the distribution of results over 18 runs using both Friendster and DBLP
data, (c) The Parallel runtime recorded by seconds while increasing the processing elements
at each step.

Regarding DBLP data, the algorithms are more competitive as the search
475 space is restricted compared to Friendster data. MRC-ACO and MR-CPSO are
competitive; MRC-ABC achieved the highest (worst) cohesion value. Subse-
quent, we set up MRC-GIM with only MR-CPSO and MRC-ACO to evaluate

26

Page 26 of 42
whether MRC-GIM converges better with isolating MRC-ABC. The results for
100, 200, and 400 iterations are 12.352, 12.110 and 11.461 respectively, indi-
cating that even MRC-ABC, which returned the worst result, improves the

t
480

ip
convergence of the overall MRC-GIM alongside MR-CPSO and MRC-ACO.
Furthermore, a statistical analysis is conducted to test the randomness of the

cr
algorithm outputs. The test illustrates the distributions of cohesion in terms
of box-plots over 18 runs and 1000 iterations on 192 computers as shown in

us
485 figure 5(b). MRC-GIM boxes indicate that the returned results are symmetric
(roughly the same on each side when cutting down the middle). Therefore, it
returns more robust and stable median value compared to other skewed boxes.

an
Afterwards, the parallel runtime is recorded for both datasets. Figure 5(c)
shows the runtime against the number of processing elements for each dataset.
490 This latter shows that the recorded time depends on the input size of the data
M
and the number of processing elements used. The recorded parallel runtime is for
600 iterations. MRC-GIM requires slightly more computational time compared
to the rest of algorithms since it performs additional tasks such as selection
d

strategy, recombination policy, transmission and integration of migrant solu-


te

495 tions. However as the number of processing elements increases the gap between
the plots shrinks, this difference becomes very narrowed within the nodes 96
p

and 192. In DBLP data sets, the processing elements surplus the requirement
of data, therefore, a small gain in seconds has been recorded.
ce

In the rest of this section, we evaluate, the speed up, scaleup and sizeup. The
500 speedup is defined as the ratio of time recorded to solve a parallel task to the
time required to solve the same problem on a double set of processing elements
Ac

with the same capacity of processing nodes [46], according to the following
expression.
TN
Speedup = (6)
T2N
where TN is the running time using N nodes and T2N is the running time
505 using 2-fold of N . The speedup measures the acceleration of the parallel algo-
rithms from small computational power to a larger one while maintaining the

27

Page 27 of 42
same size of the dataset. We performed the speed up by doubling the number
of nodes used at each step, starting from 24 nodes for Friendster and four nodes
in DBLP as the minimum required configurations.

t
ip
510 While clustering DBLP data sets, all algorithms achieved a speedup larger
than one (figure 6). Therefore, they are all scalable. In Friendster data, MRC-

cr
GIM reached the best speedup imitating the diagonal (figure 6). The MapRe-
duce implementation has helped the algorithms to achieve a good speedup as

us
MapReduce paradigm is proved to be auto-scalable.
Friendster (a) DBLP (b)
2.5 2.5
MR−CPSO MRC−ABC
MRC−ACO MR−CPSO
MRC−ABC MRC−ACO
MRC−GIM MRC−GIM

an
2 2
Speedup
Speedup

1.5 1.5

1 1
48 96 192 8 16 32 64 128
Nodes Nodes
M
16 32 64 128 Nodes 192
1
0.85 MR−CPSO
MRC−ACO MRC−ABC
0.8 MRC−ACO
MRC−ABC
0.8 MRC−GIM
MR−CPSO
0.6 MRC−GIM
Scaleup
Sizeup

0.75
0.4
d
0.7
0.2

0 0.65
0 1.88
3.76 7.52 15.05 30.1 0 1.883.76 7.52 15.05 30.1
Data size(GB) Data size (GB)
te

Figure 6: Plots showing the speed up, size up and scale up metrics. The speed up is computed
for both data sets
p

Secondly, we computed the size up by fixing the number of nodes and in-
ce

515

creasing the size of datasets by 2-folds at each step [46].

TS
Ac

Sizeup = (7)
T2S
Where TS is the parallel runtime for the dataset with size S using N nodes
and T2S is the parallel runtime using 2-fold of S and N nodes. MRC-GIM deals
well with the increase of data, whereas the rest of algorithms recorded very close
520 result compared to each other.
Finally, using the following expression:

28

Page 28 of 42
TSN
Scaleup = (8)
T2SN

t
we compute the scaleup which evaluates the ability of a parallel algorithm to

ip
solve a significant problem using larger datasets and more processing elements
at once [46]. In (8), TSN is the parallel runtime for the dataset with the size S

cr
525 using N nodes and T2SN is the parallel runtime using 2-fold of S and 2-folds of
N nodes.

us
To carry out the experiment, we decreased the size of Friendster data. We
first cluster the data using kmeans. Afterward, we balance between the number
of elements present in each group until reaching 0.94 GB of data (1/32 of the

an
530 original size). Then adding a proportion of the deleted elements to reach 1.88GB
(1/16), 3.76 GB (1/8), 7.52GB (1/4), 15.05 GB (1/2), and 30.1 GB (1/1). This
data is processed in 8, 16, 32, 64, 128, and 192 nodes respectively (since the
M
limitation of the platform we used 192 nodes rather than 256 nodes).
MRC-ABC recorded the best values between the ratio (0.69-0.79), MRC-
d
535 GIM also shows a good scaleup between (0.65-0.80) as shown in figure 6.
te

4.3. Comparative study

We compare MRC-GIM to Lloyds algorithm with the partitioning based


p

scheme proposed in [16], PKmeans [12], BigKclustering [21] and Kmeans ++


with the improved version proposed in [25]. We compare the clustering quality
ce

540 by gathering the cohesion values and the recorded time after 50 runs for 100,
200, 400, 600, 800, and 1000 iterations. The results are shown in table 4. The
Ac

number of processing elements used in this section is 124.

29

Page 29 of 42
Table 4: Table shows the recorded cohesion/time required in second. All the cohesion values
are divided by 103 .

t
Iterations MRC-GIM Lloyd BigKClustering Pkmeans Kmeans++

ip
100 12,012/375s 12,376/398s 12,509/328s 12,835/287s 13,115/264s
200 11.235/789s 11,773/811s 12,287/720s 12,482/649s 12,788/631s
400 10,259/1665s 11,109/1793s 11,587/1603s 12,095/1587s 12,546/1466s

cr
600 9,862/2783s 10,832/2833s 10, 865/2667s 11,311/2583s 11,840/409s
800 9,121/3597s 9,563/3786s 9, 745/3422s 9,976/ 3389s 10.248/3266s

us
1000 8,213/4357s 8,482/4610s 8,69/4531s 8,712/4387s 8,911/4198s

Friendster
10.2

an
10

9.8

9.6
3
X
10 9.4
M
9.2

8.8

8.6

8.4
d

8.2
Lloyd PKmeans Kmeans ++ BigKClustering MRC−GIM
te

Friendster

Lloyd
p

PKmeans
2
ce Cohesion

Kmeans++
3

BigKCLustering
4
Ac

MRC−GIM
5

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5


p−value=1.1212e−09

Figure 7: Boxplots showing distribution of results over 50 runs and 1000 iterations and the
corresponding multi-comparison graph for MRC-GIM, BigKCLustering, PKmeans, Lloyd and
Kmeans++ based MapReduce

30

Page 30 of 42
The same initial clustering is used (also the same number k of clusters). Re-
garding clustering quality, MRC-GIM outperforms the rest of algorithms reach-
ing good partition in a reasonable time. We were unable to run the algorithms

t
545

ip
for more than 1000 iterations as a result of the platform limitation. Kmeans++
returned many outliers; the best solutions returned in 50 runs are selected in

cr
table 4. Kmeans ++ and PKmeans recorded the shortest time as they follow
the classical Kmeans algorithm and run in one iterative MapReduce job.

us
550 In figure 7, the distribution of results for 1000 iterations and 50 runs on 124
computers are recorded. Regarding median value, the proposed model outper-
forms the four algorithms and shows more stability and robustness. To study

an
how significant is the difference between the algorithms, a Friedman test followed
by a multi-comparison test were conducted. The p-values given by the Fried-
555 man indicates a significant difference between the proposed model and the other
M
algorithms at significance level. The multi-comparison graphs clearly show that
our algorithm is significantly different from the rest algorithms. This gain shows
the advantage of using the cooperation and parallelism between metaheuristics
d

on a MapReduce platform.
te

560 4.4. Application on epigenetics datasets to study aging

In the effort to assess the dynamics of DNA methylation as a function of


p

age and underlying pathological events, we extended our cluster analysis to the
ce

methylation profiles from CD34 primary cells assayed by bisulfite sequencing in


six normal individuals ranging in age from 27 to 50 years (Table 5). Due to the
hight decomposability of DNA methylation data, we extract the value of methy-
Ac

565

lation for three regions within each gene, namely, CpG island, gene promoters,
gene bodies, as these regions are highly associated with triggring/desactivating
gene expression.
The aim of this analysis is to cluster each individual’s data based on the
570 three features extracted and investigate how the formation of groups (clusters)
is guided by DNA methylation features over time and whether epigenetic sig-
natures emerge in specific clusters as a byproduct of aging.

31

Page 31 of 42
Table 5: DNA methylation data used in this experiment.

Sample Code Age Sex Ethnicity Methylated BPs Data Size (Mb)

t
ip
GSM706858 27 female Caucasian 2,729,614 118.3
GSM772729 27 female Hispanic 2,464,116 106.7
GSM706859 36 male Hispanic 3,047,608 132.2

cr
GSM772730 42 male Caucasian 2,690,026 116.6
GSM706839 43 male Caucasian 2,920,722 126.6
GSM772731 50 female na 2,726,708 118.1

us
We first examined the average value of methylation3 in CpG islands, gene

an
body, and gene promoter for every individual (Figure 8, a). We found a global
575 low level of methylation in gene bodies and CpG islands and a nearly unmethy-
lated state in promoters. Even though methylation in the gene body increases
M
with age, we did not identify any significant trend or change related to aging,
as values remain consistently near to zero. Therefore, we can argue that pro-
inflammatory mediators in normal individuals exhibit constant low methylation,
d

580 especially in promoter regions and CpG islands, confirming previous general ev-
te

idence [48, 49, 50].


p

3 Methylation values range from 0 to 1.


ce
Ac

32

Page 32 of 42
DNA methylation across ages (a)

methylation
0.08

t
ip
0.04

0.00

cr
F27 F27 1 M36 M42 M43 F50
individual

us
CpG Islands Gene Body Promoter

Cluster sizes across ages (b)

an
100
# genes

50
M
0
F27 F27 1 M36 M42 M43 F50
d
individual
1st 2nd 3th 4th 5th 6th 7th
te

Figure 8: (a): DNA methylation levels in CpG islands, gene body, and promoter regions across
ages. (b): Cluster sizes as function of DNA methylation features and age.
p
ce

Afters clustering the data per each idividual (Figure 8,b), we noticed the
emergence of a giant cluster (containing 65% to 74% of all genes) for any in-
dividual, suggesting the presence of a general methylation pattern at each age.
Ac

585 In fact, elements in the largest cluster in each individual correspond to nearly
unmethylated genes in CpG islands, gene body, and gene promoter and surpris-
ingly these genes remain in the largest cluster over time, as showed by measuring
the Jaccard index4 between the largest clusters for each pair of individuals (Ta-

4 The Jaccard index measures the similarity between finite sets. Formally, given two sets
|A∩B|
A and B, the Jaccard index is defined as: J (A, B) = |A∪B|
.

33

Page 33 of 42
ble 6).

Table 6: Jaccard index for the largest clusters across ages.

t
ip
F27 F27-1 M36 M42 M43 F50

F27 1.00 0.91 0.88 0.89 0.90 0.92

cr
F27-1 0.91 1.00 0.92 0.96 0.95 0.91
M36 0.88 0.92 1.00 0.92 0.92 0.96

us
M42 0.89 0.96 0.92 1.00 0.93 0.93
M43 0.90 0.95 0.92 0.93 1.00 0.92

an
F50 0.92 0.91 0.96 0.93 0.92 1.00

590 Finally, we studied how methylation features (corresponding to values in


M
CpG islands, gene body, and gene promoter) vary across clusters and ages (Fig-
ure 9). While the largest cluster in every individual shows nearly unmethylated
features, clusters tend to exhibit changes in methylation in CpG islands and
d

gene body as size decreases, with a significant peak in one of the two features.
te

595 In particular, we found two recurrent clusters of genes across ages: the first
cluster, characterized by an increased level of methylation in the gene body, is
p

composed of three genes belonging to TNF family (TNFRSF14, TNFRSF6B,


TNFRSF25) and two genes in tgf-β family (AMH, PSPN), while the second
ce

cluster, characterized by an increased level of methylation in CpG islands, is


600 composed of two genes in tnf family (TNFRSF4, TNFRSF11A) and four genes
Ac

in tgf-β family (INHBB, TGFB1, GDF7, GDF10). A gene enrichment analysis


confirmed the strict relationships between genes in each of the two clusters in a
wide number of biological processes. Specifically, genes in the first cluster are
involved in tumor necrosis factor-mediated signaling pathway, urogenital sys-
605 tem development, and more generally in the response to chemical, while genes
in the second cluster are involved, among others, in the regulation of protein
phosphorylation and apoptotic process.

34

Page 34 of 42
F27 F27 1 M36
methylation 0.2 0.2 0.2

t
0.1 0.1 0.1

ip
0 0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

cr
M42 M43 F50
methylation

0.2 0.2 0.2

us
0.1 0.1 0.1

0 0 0

an
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
cluster id cluster id cluster id
CpG Islands Gene Body Promoter
M
Figure 9: DNA methylation levels in CpG islands, gene body, and promoter regions in different
clusters and different individuals.
d

5. Conclusions
te

The amount of data is growing further in size and complexity. The recent ad-
p

610 vances in collecting data have triggered an extreme need for novel data analysis
methods to decipher the complexity of these data. Undoubtedly, parallelization
ce

frameworks and distributed computation will play a primary role in the future
of interpreting big data in general and biological data in particular. For exam-
ple, the cloud computing will help to scale up the analysis of epigenetics profiles
Ac

615 to larger samples of individuals, providing a more detailed characterization of


epigenetic mechanisms in fast and cost-efficient ways.
In this paper, we deal with the issue of clustering big sets of data. Data
clustering itself is an NP-hard problem, and it becomes more challenging when
the clustered data is large. Moreover, we present MRC-GIM, which is a scalable,
620 time-efficient, and robust clustering approach to group very large datasets on
a different commodity of computer architectures. The proposed method aggre-

35

Page 35 of 42
gates three swarm intelligence strategies. Using different strategies that exhibit
different dynamics has the potential to foster the exploration and exploitation
capabilities of the search process. MRC-GIM is designed and implemented

t
ip
625 with Hadoop MapReduce, which makes MRC-GIM auto-scalability and fault-
tolerance. The comparative study reveals that MRC-GIM outperforms novel

cr
developed big-scale clustering in a reasonable time. Although Hadoop exists for
a half decade, a noticeable shortage of tools dealing with big data analysis has

us
been witnessed.
630 We extended our parallelization strategy to the study of DNA methylation in
pro-inflammatory mediators from a population of normal individuals ranging in

an
age from 27 to 50 years. Exploiting the high decomposability of the dataset, we
were able to analyze simultaneously millions of nucleotides, retrieve methylated
cytosines, and cluster genes according to methylation features in CpG islands,
M
635 gene body, and gene promoter. We found a global low level of methylation
in CpG islands and gene body, and a nearly unmethylated status in promoter
regions for each, suggesting that the epigenetic landscape in normal condition
d

should not change aberrantly with aging.


te

6. References
p

640 [1] Steinhaus, H.: Sur la division des corps materiels en parties, Bulletin de
ce

l’Academie Polonaise des Sciences, Volume IV, Pages 801-804, Issue 12 (1956)

[2] Nanda, S.J., Panda, G.: A survey on nature inspired metaheuristic algo-
rithms for partitional clustering, Swarm and Evolutionary Computation, Vol-
Ac

ume 16, Pages 1-18 (2014)

645 [3] Nir, A., Bernard, C.: Faster Dimension Reduction, communications of the
acm, Volume 53, Pages 97-104, Issue 2 (2010)

[4] George, K., Dimitrios, G.: Nick Koudas, Stefan Berchtold, Efficient Biased
Sampling for Approximate Clustering and Outlier Detection in Large Data

36

Page 36 of 42
Sets, IEEE Transactions on Knowledge and Data Engineering, Volume 15,
650 Pages 1170-1187, Issue 5 (2003)

t
[5] Sarkar, S., Majumder, T., Kalyanaraman, A., Pande, P.: Hardware ac-

ip
celerators for biocomputing: A survey, Proceedings of IEEE International
Symposium on circuits and systems (ISCS), Pages 3789-3792 (2010)

cr
[6] Dean, J.,Ghemawat, S.: MapReduce: simplified data processing on large
clusters, Communications of the ACM, Volume 51 Issue 1,Pages 107-113

us
655

(2008)

[7] MacQueen, J. B.:Some Methods for Classiffication and Analysis of Mul-

an
tiVariate Observations, in Proceedings of the Fifth Berkeley Symposium on
Mathematical Statistics and Probability, University of California Press vol.
1, Pages 281-297 (1967)
M
660

[8] Kaufman, L., and Rousseeuw, P.:Clustering by means of medoids, in Sta-


tistical Data Analysis based on the L1-norm and Related Methods ,North-
d

Holland,Pages 405-416 (1987)


te

[9] Chu, S. C., Roddick, J. F., Su, C. J., & Pan, J. S.:Constrained ant colony
665 optimization for data clustering. In PRICAI pp. 534-543 (2004).
p

[10] Ingaramo, D., Leguizamon, A., Errecalde, G., Adaptive, M.: clustering
ce

with artificial ants. J. Comput. Sci. Technol. 4, 264271 (2005)

[11] Lumer, E., Faieta, B.: Diversity and adaptation in populations of clustering
Ac

ants. In: Third International Conference on Simulation of Adaptive Behavior,


670 pp. 501508 (1994)

[12] Yang, Y., Kamel, M.: An aggregated clustering approach using multi-ant
colonies algo-rithms. Pattern Recognit. 39, 12781289 (2006)

[13] Omran, M., Salman, A., Engelbrecht, A.: Image Classification using Parti-
cle Swarm Optimization. Simulated Evolution and Learning 1, 370374 (2002)

37

Page 37 of 42
675 [14] Li, X.: A new intelligent optimization artificial fish swarm algorithm (Doc-
tor Thesis). Zhejiang University of Zhejiang, China (2003)

t
[15] Cura, T.: A particle swarm optimization approach to clustering. Expert

ip
System Applica-tion 39, 15821588 (2012)

cr
[16] Karaboga, D., Ozturk, C.: A novel clustering approach: artificial Bee
680 Colony (ABC) algo-rithm. Application Soft Computing 11, 652657 (2011)

us
[17] Giuliano, A., Mohammad, R.F.: Clustering Analysis with Combination
of Artificial Bee Colony Algorithm and K-means Technique. International
Journal of Computer Theory and Engineering 6(2), 146150 (2014)

an
[18] Ghosh, S., Kothari, M., Halder, A., Ghosh, A.: Use of aggregation
685 pheromone density for imagesegmentation. Pattern Recognition 30, 939949
M
(2009)

[19] Cohen,S, C, M., de Castro, L, N.: Data Clustering with Particle Swarms.
IEEE Congress on Evolutionary Computation, 17921798 (2006)
d

[20] Zhang, C., Ouyang, D., Ning, J.: An artificial bee colony approach for
te

690 clustering. Expert System Application 37, 47614767 (2010)

[21] Abraham, A., Das, S., & Roy, S.: Swarm intelligence algorithms for data
p

clustering. Soft computing for knowledge discovery and data mining, 279-313
ce

(2008)

[22] Dario, I., Marek, R., Francesco, B.: The Generalized Island Model. Parallel
Ac

695 Architectures and Bioinspired Algorithms 415, 151169 (2012)

[23] Das, A. S., Datar, M., Garg, A., and Rajaram, S.: Google News Person-
alization: Scalable Online Collaborative Filtering. Proceedings of the 16th
international conference on World Wide Web ACM, Pages 271-280 (2007)

[24] Papadimitriou, S., Sun, J.:DisCo: Distributed Co-clustering with Map-


700 Reduce , Eighth IEEE International Conference on Data Mining (ICDM’08),
Pages 512-521 (2008).

38

Page 38 of 42
[25] Zhao, W., Ma, H., He, Q.: Parallel K-Means Clustering Based on MapRe-
duce, Cloud Computing, LNCS 5931, Pages 674-679 (2009)

t
[26] Ferreira Cordeiro, R. L., Traina Junior, C., Machado Traina, A. J., Lpez, J.,

ip
705 Kang, U., Faloutsos, C.:Clustering very large multi-dimensional datasets with
mapreduce, Proceedings of the 17th ACM SIGKDD international conference

cr
on Knowledge discovery and data mining, Pages 690-698 (2011)

[27] Li, H. G., Wu, G. Q., Hu, X. G., Zhang, J., Li, L., Wu, X.:K-Means Clus-

us
tering with Bagging and MapReduce. 44th Hawaii International Conference
710 on System Sciences (HICSS),Pages 1-8 (2011)

an
[28] He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., and Fan, J.: MR-
DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using
MapReduce. 17th International Conference on Parallel and Distributed Sys-
M
tems (ICPADS),Pages 473-480 (2011)

715 [29] Ene, A., Im, S., Moseley, B.: Fast Clustering using MapReduce. Proceed-
d

ings of the 17th ACM SIGKDD international conference on Knowledge dis-


covery and data mining, Pages 681-689 (2011)
te

[30] Aljarah, I., Ludwig, S.:Parallel Particle Swarm Optimization Clustering


p

Algorithm based on MapReduce Methodology, Fourth World Congress on


720 Nature and Biologically Inspired Computing (NaBIC), Pages 104-111 (2012)
ce

[31] Bahmani, B., Moseley, B., Vattani, A., Kumar, R.,:Scalable KMeans++,
Proceedings of the VLDB Endowment, Volume 5, Issue 7, Pages 622-633
Ac

(2012)

[32] Hefeeda, M., Gao, F., Abd-Almageed, W.:Distributed Approximate Spec-


725 tral Clustering for Large-Scale Datasets, Proceedings of the 21st international
symposium on High-Performance Parallel and Distributed Computing ACM,
Pages 223-234 (2012)

[33] Estrada, T., Zhang, B., Taufer, M., Cicotti, P., Armen, R.:Reengineering
high-throughput molecular datasets for scalable clustering using MapReduce,

39

Page 39 of 42
730 IEEE 14th International Conference on High Performance Computing and
Communication,Pages 351-359 (2012)

t
[34] Miao, Y., Zhang, J., Feng, H., Qiu, L., Wen, Y.: A Fast Algorithm for

ip
Clustering with MapReduce, Advances in Neural Networks, Pages 532-538,
(2013)

cr
735 [35] Jin, C., Patwary, M. M. A., Agrawal, A., Hendrix, W., Liao, W. K., Choud-
hary, A.: DiSC: A Distributed Single-Linkage Hierarchical Clustering Algo-

us
rithm using MapReduce, ACM 4th International SC Workshop on Data In-
tensive Computing in the Clouds (2013)

an
[36] Jakovits, P., Srirama, S. N.: Clustering on the Cloud: Reducing CLARA
740 to MapReduce, ACM Proceedings of the Second Nordic Symposium on Cloud
Computing, Pages 64-71 (2013).
M
[37] Herwig, R., Poustka, A. J., Mller, C., Bull, C., Lehrach, H., O’Brien,
J. :Multiplex Kmeans for Clustering Large-scale Data Set, The journal of
d

Genome research,Volume 9 Issue 11, Pages 1093-1105 (2014)


te

745 [38] Xu, Y., Qu, W., Li, Z., Min, G., Li, K., Liu, Z.:Efficient K-means++ Ap-
proximation with MapReduce, IEEE Transactions on Parallel and Distributed
p

Systems, Volume 25 Issue 12, Pages 3135-3144 (2014)


ce

[39] Kim, Y., Shim, K., Kim, M. S., Lee, J. S.: DBCURE-MR: An efficient
density-based clustering algorithm for large data using MapReduce, The jour-
750 nal of Information Systems, Volume 42, Pages 15-35 (2014)
Ac

[40] Cui, X., Zhu, P., Yang, X., Li, K., and Ji, C.: Optimized big data K-
means clustering using MapReduce, The journal of Supercomputing, Volume
70 Issue 3, Pages 1249-1259 (2014)

[41] Dorigo, M.:Optimization, learning and natural algorithms. Ph.D. thesis


755 Politecnico di Milano (1992)

40

Page 40 of 42
[42] Dorigo, M., Birattari, M., Stutzle, Thomas.: Ant Colony Optimization
Artificial Ants as a Computational Intelligence Technique, IRIDIA (2006)

t
[43] Kuo, R. J., Wang, H. S., Hu, T. L., Chou, S. H.:Application of ant K-

ip
means on clustering analysis, Journal of Computers and Mathematics with
760 Applications, Volume 50 Issue 10,Pages 1709-1724 (2005)

cr
[44] Karaboga, D., Ozturk, C.:A novel clustering approach: Artificial Bee
Colony (ABC) algorithm, journal of Applied soft computing, Volume 11 Issue

us
1, Pages 652-657 (2011)

[45] Leskovec, J., Krevl, A.:SNAP Datasets: Stanford Large Network Dataset

an
765 Collection, http : //snap.stanf ord.edu/data (2014)

[46] Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel
M
Computing. Addison-Wesley (2003)

[47] Rivest, R,: The MD5 message-digest algorithm (1992)


d
[48] Ben-Avraham, D., Muzumdar, R. H., and Atzmon, G.:Epigenetic genome-
770 wide association methylation in aging and longevity, Epigenomics, vol. 4,
te

Pages 503–509 (2012).

[49] Berdasco, M., and Esteller, M.:Hot topics in epigenetic mechanisms of


p

aging: 2011, Aging Cell, vol. 11, Pages 181-186 (2012)


ce

[50] Florath, I. , Butterbach, K., Mller, H., Bewerunge-Hudler, M., and Bren-
775 ner, H.:Cross-sectional and longitudinal changes in dna methylation with age:
Ac

an epigenome-wide analysis revealing over 60 novel age-associated cpg sites,


Human Molecular Genetics, vol. 23, no. 5, Pages 1186-1201 (2014)

[51] Shanmugam, M., and Sethi, G.:Role of Epigenetics in Inflammation-


Associated Diseases,in Epigenetics: Development and Disease, vol. 61 of Sub-
780 cellular Biochemistry, Pages. 627-657, Springer Netherlands (2013)

[52] P. A. Jones and P. W. Laird, “Cancer-epigenetics comes of age,” Nature


Genetics, vol. 21, pp. 163–169, (1999).

41

Page 41 of 42
[53] Jones, P. A., and Baylin, S. B.:The fundamental role of epigenetic events
in cancer, Nature Reviews Genetics , vol. 3, Pages 415-428 (2002)

t
ip
cr
us
an
M
d
p te
ce
Ac

42

Page 42 of 42

You might also like