Benmounah 2018

Accepted Manuscript
Title: Parallel Swarm Intelligence Strategies for Large-scale

Clustering based on MapReduce with Application to
Epigenetics of Aging
Author: Zakaria Benmounah Souham Meshoul Mohamed

Batouche Pietro Lio’
PII: S1568-4946(18)30203-5
DOI: https://doi.org/doi:10.1016/j.asoc.2018.04.012
Reference: ASOC 4816
To appear in: Applied Soft Computing
Received date: 29-4-2016

Revised date: 26-3-2018
Accepted date: 10-4-2018
Please cite this article as: Zakaria Benmounah, Souham Meshoul, Mohamed Batouche,
Pietro Lio’, Parallel Swarm Intelligence Strategies for Large-scale Clustering based
on MapReduce with Application to Epigenetics of Aging, <![CDATA[Applied Soft
Computing Journal]]> (2018), https://doi.org/10.1016/j.asoc.2018.04.012
This is a PDF file of an unedited manuscript that has been accepted for publication.
As a service to our customers we are providing this early version of the manuscript.
The manuscript will undergo copyediting, typesetting, and review of the resulting proof
before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that
apply to the journal pertain.
Parallel Swarm Intelligence Strategies for Large-scale
Clustering based on MapReduce with Application to
Epigenetics of Aging
t
ip
Zakaria Benmounah1 , Souham Meshoul1 , Mohamed Batouche1 , Pietro Lio’2
cr
us
Abstract
Clustering is an important technique for data analysis and knowledge discovery.
an
In the context of big data, it becomes a challenging issue due to the huge amount
of data recently collected making conventional clustering algorithms inappro-
priate. The use of swarm intelligence algorithms has shown promising results
M
when applied to data clustering of moderate size due to their decentralized and
self-organized behavior. However, these algorithms exhibit limited capabilities
when large data sets are involved. In this paper, we developed a decentralized
d
distributed big data clustering solution using three swarm intelligence algo-
te
rithms according to MapReduce framework. The developed framework allows

cooperation between the three algorithms namely Particle Swarm Optimization,
Ant Colony Optimization and Artificial Bees Colony to achieve largely scalable
p
data partitioning through a migration strategy. This latter reaps advantage

ce
of the combined exploration and exploitation capabilities of these algorithms

to foster diversity. The framework is tested using Amazon Elastic MapReduce
service(EMR) deploying up to 192 computer nodes and 30 gigabytes of data.
Ac
Parallel metrics such as speed-up, size-up and scale-up are used to measure the
elasticity and scalability of the framework. Our results are compared with their
counterparts big data clustering results and show a significant improvement in
terms of time and convergence to good quality solution. The developed model
I Fully
documented templates are available in the elsarticle package on CTAN.
1 Computer Science Department, Constantine 2 University, Constantine, Algeria.
2 Computer Science Department, Cambridge University, Cambridge, UK.
Preprint submitted to Journal of LATEX Templates March 25, 2018
Page 1 of 42
has been applied to epigenetics data clustering according to methylation features
in CpG islands, gene body, and gene promoter in order to study the epigenetics
impact on aging. Experimental results reveal that DNA-methylation changes
t
ip
slightly and not aberrantly with aging corroborating previous studies.
Keywords: Big data, Swarm intelligence, Large-scale clustering, Map-Reduce,
cr
Epigenetics, Aging
2010 MSC: 00-01, 99-00
us
1. Introduction
an
Over the last few decades, developed countries in Europe, Asia, and North
America have experienced a significant change in the age profile of their demo-
graphic structure, with a steady increase in the number of adults aged 65 years
M
5 or older. Although longer lifespan is a clear sign of progress and improved qual-
ity of life, the considerable growth of the elderly population poses far-reaching
social and economic challenges regarding the fiscal sustainability of the welfare
d
state. As such, health-care system and related services have to be rethought

to treat aging as a manipulable long-term biological process whose detrimen-
te
10 tal effects can be limited or procrastinated rather than passively accepted as

inevitable.
p
Key research challenges need to be effectively and efficiently addressed to

ce
cope with the ever-growing amount of epigenetics data that is being exponen-
tially produced. Traditional techniques and tools for data analytics and au-
15 tonomous learning are no longer suitable and even unusable to extract human-
Ac
interpretable knowledge and information from the enormous complex amount

of data. Therefore, new revolutionary approaches and tools are more than re-
quired, among these tools we highlight clustering.
Almost 60 years beyond have passed from the first proposed clustering al-
20 gorithm [1]. Cluster analysis aims at grouping data points into separate groups
called clusters. It plays a versatile role in knowledge discovery, and it is used in
a myriad of fields to extract hidden relationships among data. An up-to-date
Page 2 of 42
review of the application of clustering analysis can be found in [2]. A plethora
of clustering algorithms has been designed to deal with different types and dis-
tributions of data. The exponential increase of data makes cluster analysis even
t
25
ip
more challenging than before. Two broad approaches have emerged to alle-
viate the issue, either by reducing the dimensionality of data (dimensionality
cr
reduction)[3] or by reducing the number of samples within a dataset (sampling)
[4]. However both approaches have been proved to be ineffective when a sin-
us
30 gle machine is used as the prohibitively large amount of data cannot be kept
on a single computer. Therefore, multiple machines are needed, and parallel
processing of data is undeniably necessary.
an
Hardware accelerators such as Field Programmable Gate Array (FPGA) and
Graphics Processing Unit (GPU) have emerged recently as promising technology
35 drivers [5]. On the other hand, Application Programming Interfaces (APIs) such
M
as Message Passing Interface (MPI) and OpenMP have traditionally provided a
software-oriented approach. However while dealing with parallel programming
languages, additional concerns have to be considered, such as the load balancing,
d
the communication flow, the topology choice, the split of data, etc. This makes
te
40 designing a parallel algorithm a very tedious task. To deal with these concerns,
a new open source framework called Apache Hadoop consisting of a storage part
p
namely Hadoop Distributed File System (HDFS) and a processing part namely
MapReduce (MR) has emerged lately. HDFS is a distributed file system that
ce
provides high-performance access to data across Hadoop clusters by managing

45 pools of big data and handling big data analytics applications. MapReduce,
designed by Google [6], provides a new methodology of thinking and developing a
Ac
parallel algorithm suitable for large scale systems without being concerned about
scalability as MapReduce is auto-scalable. The idea was inspired by the map
and reduce primitives characterizing the functional programming LISP. Hadoop
50 encapsulates the details of parallelization, fault-tolerance, data distribution and
load balancing. It Demonstrates a great performance in big data scenarios,
especially for challenging tasks such as clustering.
Nevertheless, data Clustering is an NP-hard problem. If the number of
Page 3 of 42
kn
clusters exceeds three, the alternative ways to group the data is k! , where k is
55 the number of groups, and n is the number of data points to be clustered.
However, data clustering can be easily cast as a global optimization problem
t
ip
of finding the partition that maximizes/minimizes an objective function. This
can be appropriately tackled using metaheuristics, such as Ant Colony Opti-
cr
mization (ACO), Artificial Bee Colony (ABC), Particle Swarm Optimization
60 (PSO) and so forth. These metaheuristics exhibit different dynamics leading to
us
distinct strategies that can be effectively combined to handle hard optimization
problems such as Clustering large datasets.
In this paper, we present a novel scalable and elastic framework for clustering
an
large datasets. The framework is a generalized island model (GIM) that includes
65 three swarm intelligence algorithms, cooperating with each other through a
migration operator.The main motivation behind the use of such a GIM is to
M
combine the search capabilities of the three algorithms for more efficiency of
the search space exploration. The GIM is designed in a parallel and distributed
manner in a way to help the three cooperating algorithms to converge to high-
d
70 quality solutions in a short time while.The developed parallel and distributed

te
architecture are designed using MapReduce programming model to deal with

the huge data volume. First, a MapReduce model is derived for each of the
p
cooperating algorithms separately and then merged lately to achieve the overall
framework which we refer to as MapReduce Clustering-GIM (MRC-GIM). The
ce
75 framework is validated on many computers (192 computers) connected with each

other and large real datasets (more than 30GB). The comparative study reveals
that MRC-GIM outperforms novel developed clustering algorithms dedicated to
Ac
big data clustering in terms of convergence time and solution quality.

Subsequently, we use MRC-GIM to investigate the correlation between Epi-
80 genetics and Aging. As a result of this application, we found that epigenetics
changes slightly and not aberrantly with aging, this latter confirms previous
evidence shown in [52, 53].
The remainder of the paper is organized as follows: in section 2 a brief
description of the background material is given. In section, 3. a review of
Page 4 of 42
85 clustering large data sets algorithms is provided. Section 4 is devoted to the
presentation of the proposed frameworks. In section 5, the performance of
MRC-GIM is evaluated using large datasets and parallel metrics. In section 6
t
ip
we study the correlation between epigenetics and aging and finally in section 7
conclusions are drawn.
cr
90 2. Background and literature review
us
2.1. Clustering Analysis
Clustering analysis is a central topic in computer science and statistics,
an
whose purpose is to find groups in a given data set, according to a notion of simi-
larity or homogeneity that is maximized for elements in the same group (or clus-
95 ter). Formally, given a data set V , a clustering is a partition τ (V ) = {c1 , . . . , cK }
M
of V into non-empty and pairwise disjoint subsets ci , i = 1, . . . , K, whose union
is V .
As a major field of research in exploratory data analysis, over the years, a
d
large number of clustering algorithms have been proposed to find “good” par-
titioning that can uncover latent structures in the data. Clustering approaches
te
100
can be divided into two major classes: partitional clustering and hierarchical
clustering.
p
Partitional clustering algorithms split the dataset into groups based on a

ce
two-step iterative process. Given an initial set of cluster representatives such as

105 centroid locations as in k-means [7] or centroid data points as in k-medoids [8]
the procedure alternates an assignment step where each data point is assigned
Ac
to the cluster with the closest representative and an update step where cluster
representatives are recomputed.
Hierarchical clustering algorithms build a hierarchy of clusters by either
110 merging smaller clusters into larger clusters (agglomerative clustering) or by
splitting larger clusters into smaller clusters (divisive clustering). As a result, a
hierarchical clustering algorithm produces a tree of clusters, called dendrogram,
which shows how clusters and data points are related.
Page 5 of 42
2.2. MapReduce Model
115 A MapReduce program takes place generally in three successive jobs, the
t
M ap, the Auxiliary and the Reduce jobs as depicted in figure 1.
ip
Reduce task <Key, [Value]>
Map task <Key, Value> Auxilary phase
Whole datasets
cr
<k1,[v1,v3]>
<k1,v1> <k2,v2> Sorting
Partitioning
V Combining
<k2,v2>
<k1,v3> <k3,v4>
us
<k3,v4>
Figure 1: MapReduce framework, generally composed of the map job followed by the reducer
an
job
The map job is a preprocessing of the split data where each split part is
M
considered as a value to wich a key is attributed, all values with the same
key are submitted to the same reducer. More formally let’s denote V as the
120 whole dataset, after splitting the data we get V = v1 ∪ v2 ∪ . . . ∪ vp where p is
d
the number of split elements. The mapper creates a function V → key where
key = key1 , key2 , . . . keyr and r denotes the number of keys (r << p). A key
te
keyi can be associated to more than one value. The Auxiliary is an optional
function that takes place between the mappers and the reducers when some
p
125 additional preprocessing tasks are required. The reducer receives a set of pairs
ce
in the form of < key, value > the values with the same key are grouped together
in the same reducer thus it results in one key per each reducer.
Ac
2.3. Swarm intelligence clustering
Swarm intelligence algorithms have been widely used in data clustering.

130 In [9], a constrained ACO (C-ACO) has been developed to handle arbitrarily
shaped clusters and outlying data. An adaptive ACO algorithm was proposed
in [10] to determine the optimal number of clusters and also to improve the
convergence rate. It imitates the ant’s behavior of grouping dead bodies based
on their size so that to keep the nest clean. In fact, the first algorithm using
Page 6 of 42
135 this concept has been proposed by Lumer and Faieta [11]. In [12], several ant
colonies are used in parallel to find clustering solutions that are sent to the
Queen (each colony has one queen). At the final stage, the results at Queens
t
ip
level are combined using a hypergraph model. From another side, PSO based
Clustering was first introduced by Omran et al. in [13]. Their results showed
cr
140 that PSO based clustering outperforms k-means and FCM (fuzzy c-means). Au-
thors in [14] have introduced a quantum-behaved PSO for clustering analysis
us
applied to gene expression profiling. Also, to determine the unknown number
of clusters, a recent clustering algorithm was developed by Cura et al. [15].
Clustering has been tackled as well using ABC metaheuristic [16]. In [17], the
an
145 authors proposed ABCK algorithm which tries to find the Centroids initializa-
tion to be given as an input to the k-means algorithm. Algorithms based on
these nature inspired metaheuristics suffer from premature convergence. Many
M
tentative methods to avoid such an issue were proposed in the literature. In
[18], Ghosh et al. proposed a variant of ACO known as Aggregation Pheromone
150 Density-Based Clustering (APC) algorithm. This latter attempts to solve the
d
issue using a pheromone matrix updating policy. Cohen and Castro in [19]
te
proposed a particle swarm clustering algorithm in which the particles velocity

update is influenced by the particles previous positions along with cognitive,
p
social and self-organizing term. The aim is to guide the particles towards bet-
155 ter solutions avoiding local optima. In [20], an ABC algorithm proposes the
ce
introduction of a new group of bees called the scout bees to create a random
search that helps the employed bees finding the food source. All the methods
reported so far remain limited in terms of quality of results and induce an extra
Ac
cost regarding time complexity [21].
160 2.4. Big data clustering
Since the inception of MapReduce (MR), several MR based clustering algo-

rithms have been proposed. Most of these methods are based on different MR
schemes of k-means. Among them, Pk-means [25] which follows the classical
k-means procedure and runs in one iterative MR job. In the map phase, cen-
Page 7 of 42
165 troids are computed as the weighted average of all points within a cluster, then
in the reducer phase, the algorithm updates the new centers. An additional
auxiliary phase is set up to combine the intermediate data of the same map
t
ip
task. Further, k-means++ has been designed to optimize the selection of initial
centers, starting by choosing them uniformly at random, then adding potential
cr
170 centers one by one in a controlled way until reaching k centers. K-meansk [31]
has come as an improvement of k-means++, rather of sampling a single point
us
in each pass like kmeans++, k-meansk samples O(k) points in each round and
repeats the process for approximately O(log n) rounds to get better accuracy.
Unlike the aforementioned proposed k-means based methods which perform a
an
175 series of k-means tasks serially, Mux-Kmeans [37] performs a multiple k-means
tasks concurrently by taking various centroids together. The algorithm starts
by evaluating the clusters, selecting the best task and finally integrating the
M
new tasks. Efficient k-means++ [38] is another improvement of k-means++
which uses only one MR job to obtain the k centers. The map task consists
180 in the k-means++ initialization, followed by the weighted k-means++ which is
d
performed in the reducer phase. The algorithm integrates a pruning strategy

te
that computes the distances without redundant time as advantage selection of

k (the number of the clusters) is automatic.
p
By using k-means with sampling, X. Cui et al. [40] proposed an algorithm

185 that takes place in three MR jobs. Data summarization has been used as a
ce
baseline for BigK-Clustering [34]. The key idea behind the algorithm is to
divide the data into sub-data and group them separately which results in micro-
clusters, then merge the closest micro-clusters using the equivalence relation,
Ac
and finally, calculate the centers of the final groups. DBCURE-MR [39] and
190 MR-DBSCAN [28] are two density based algorithms that identify clusters with
highly dense areas separated by sparsely dense areas. Both methods employed
a sampling approach to reducing the size of the large data and carry out the
processing of the reduced data to obtain centers, which will be used to cluster
the original data. Abhinandan Das et al. redesign the MinHach algorithm [23]
195 in a more scalable way using map-reduce. This algorithm can capture multiple
Page 8 of 42
interests of user shared within a cluster. DisCo [24] is based on co-clustering
which unlike clustering attempts to cluster both samples and items at once.
t
ip
cr
us
an
M
d
p te
ce
Ac
Page 9 of 42
Table 1: Summary of clustering algorithms based on MapReduce (sz: data size, nd: number
of computer node, pt: data point, dim: dimension and syn: Synthetic)
t
Algorithm Data platform
ip
Abhinandan Das et al. [23] Real (943, 5000, 500000) users none
Raw dataset, Sz(100-135GB) 39 nd with 2 IntelXeon (2.7-3GHz),
Spiros Papadimitriou et al. (DisCo) [24]
Graph dataset, sz(170MB-4.3GB) 8GB of ram.
cr
(1-4) -nd with 2 cores (2.8GHz),
Weizhong Zhao et al. (PKMeans) [25] sz (1GB- 8GB).
4GB of ram.
-Real:(62millions-1.4 billion)pt, -Yahoo Cluster M45: 400 nd,
us
sz: (0.014-0.2)TB. 3.5 TB ram.
Robson L. F. Cordeiro et al. (BOW) [26]
-syn (100000-100 million) -DISC:(64 machines=512 cores
pt*15 dim +1TB of ram)
-1 nd IntelCore Duo,
Hai-Guang Li et al. [27] Small datasets
(2.10GHZ), 2.00GB ram.
an
Real(GPS location records):
-13 nd Intel Core i7 950
Yaobin He et al.(MR-DBSCAN) [28] 0.32 billion-1.92 billion pt,
(3.0GHz), 8GB DRAM
sz( (8.4GB-50.4GB).
syn (Zipf distribution): -1 nd IntelCore i7 (2.93GHz) with,
Alina Ene et al. (Parallel Lloyds) [29]
M
10,000-10,000,000 pt. 8GB ram (100nd simulation).
-Longhorn Hadoop cluster: 48 nd,
8 Intel Nehalem cores (2.5GHz),
Real and syn: (2, 000-32, 000, 000)pt 48GB ram.
Ibrahim Aljarah et al.( MR-CPSO) [30]
(2-54) dim, sz(0.14- 1320.8)MB -NDSU Hadoop cluster: 18 nd 4
d
Intel cores (2.67GHz each),

6GB of ram.
te
-syn: GaussMixture 10;000 points 1968 nd, 2 quad-core (2.5GHz),

Bahman Bahmani et al.(k-means k) [31]
-real(4601-4.8)M pt*(42-58) D. 16GB ram.
-syn: 1024 to 4 million
Local Cluster: 5 Core2 Duo E6550,
Fei Gao et al. [32] pt*64 dim
p
(2.33 GHz) with 1 GB DRAM.

-Real (wikipedia): sz (1M-2G)
real (molecular geometries), 16 nd Gordon ION (192 cores),
ce
Trilce Estrada et al. [33]

sz (48MB-1TBytes). (2.7 GHz) with 3GB of ram.
3 nd with 2 IntelCore (3.1GHz),
Yuqing Miao et al.(BigKClustering) [34] syn:(normal dist), Sz(1GB-3GB)
4GB of ram.
Chen Jin et al. (DiSC) [35] syn: 500,000 data pt * 10 dim. JESUP hadoop cluster
Ac
real(handwritten digits), 17 nd with CPU (2.2GHz),

Pelle Jakovits et al. (CLARA) [36]
25 000- 10 000 000 pt. 500MB RAM.
real (17770-359330) pt* EC2: 16nd, 2 ECUs and 1 CPU,
Chen Li et al. (Mux-Kmeans) [37]
(40-1000) dim. 3.7 GiB ram
-real sz (5.67 GB).
12 nd with 2 AMD Opteron 2212,
Yujie Xu et al. (Efficient K-means++) [38] -Syn: 20 million pt * 128 dim,
(2.00GHz) and 8GB of ram.
sz(15GB).
-Real: (164,860-164,860*50)pt*
3 dim sz(21mb-1.05GB). 20 nd with 2 Duo (2.66GHz),
Younghoon Kim et al. (DBCURE-MR) [39]
-Syn:(5,000,000-25,000,000)pt* 2GB of ram.
(2 to 8 dim), sz(1.7GB).
-syn(Gauss Distribution)
10,000 pt* 3 dim.
16 nd 2 Core AMD,
-Real 2,351,710,420
Xiaoli Cui et al. [40]
pt* 3 dim
10 Opteron (2.00GHz),
2GB of ram.
-Real 4,296,075,259
pt * 9 dim.
Page 10 of 42
Fei Gao et al. designed an approximate spectral clustering which uses kernel-
based machine learning algorithms. The method has shown to be efficiently pro-
cessing very large-scale datasets. The objective is to optimize the computation
t
200
ip
and memory overhead required to compute the kernel matrix [32].
The approach is independent of the employed kernel-based machine learning
cr
and was tested on Amazon Elastic MapReduce. Trilce Estrada et al. [33] pre-
sented a scalable method dedicated to the molecular conformation such as ligand
us
205 and peptides to identify conformations that are an accurate representative of the
native conformations. The algorithm starts by reducing the conformation space.
Afterwards, it combines the geometrical knowledge of conformation similarities
an
based on a tree clustering. A considerable decrease of parallel runtime has been
witnessed, from 5 hours to 20 minutes (12-196 nodes). Hierarchical clustering
210 has also being reimplemented using MapReduce. As an example, DiSC is a hier-
M
archical clustering algorithm [35] designed according to MapReduce framework
which runs on a fairly small number of MR rounds, while reducing the hierarchi-
cal clustering single-linkage problem to the minimum spanning tree (MST). The
d
CLARA is a medoid-based clustering algorithm [36] which unlike centroid-based

te
215 chooses real data points as centers. The algorithm has been defined regarding
MapReduce jobs, showing a way of how to adapt a non-embarrassingly parallel
p
algorithm to a platform that is dedicated to embarrassingly parallel methods.

In contrast, the efficiency and scalability have been barely affected. Finally,
ce
MR-CPSO [30] is based on particle swarm optimization (PSO) algorithm origi-

220 nally developed by taking inspiration from the flocking behavior of birds. The
algorithm is implemented in three MR stages each of which consists in a PSO
Ac
operation. In the first stage the centroids of individual particle are updated
using the classical equation of PSO that updates the velocity and the position,
afterwards, an evaluation of the fitness generated in the first level is computed,
225 then, the last stage is devoted for merging results of the previous stages and
identify the personal best and global best centroids. In table 1, a summary of
all these algorithms is provided including the basic clustering features used as
described above with extra information related to the type and volume of data
11
Page 11 of 42
sets and the platform employed per each algorithm.
230 All above described methods based on k-means or k-medoid, require the
number of cluster in to be known in advance, which is also the case in our
t
ip
developed method.
From this brief review, one can notice that attempts to develop MapRe-
cr
duce methods of clustering are at their debut and in particular, swarm-based
235 algorithms which still need to be rethought in this context despite their success
us
to solve NP-hard problems and particularly data clustering. By another side,
cooperating among metaheuristics has been shown to be a very promising way
to solve efficiently complex problems. Therefore, in our work we investigate the
an
potential of such cooperation to handle clustering of large data sets.
240 3. Highly scalable clustering based on MapReduce

M
A generalized island model is a parallel distributed scheme which is amenable
to large-scale clustering. To perform clustering of large data sets using a GIM we
d
need first to redesign each of the used metaheuristics according to the MapRe-
duce model. In the proposed GIM, the three swarm-based algorithms PSO,
te
245 ACO and ABC cooperate to find the best partition that optimizes a given clus-
ter quality measure. In our study, we consider the total within variance or
p
cohesion measure as the objective function. This latter describes how close the
ce
data points within each cluster are to each other. It is defined as follows:
k
1X X 2
kxi − zj k (1)
n j=1 x ∈z
Ac
i j
Where n is the number of items or data points in the dataset, k the number
250 of clusters, xi , {i = 1, . . . , n} the location of the ith item to be clustered and
zi the center of cluster ci . Therefore, clustering is cast as an optimization task
that aims to minimize the cohesion measure.
Our proposed framework is based on the following assumptions:
1. The data is huge and cannot be stored on a single computer,
12
Page 12 of 42
255 2. The clustering solution could be distributed among different computers,
3. The clustering solution is constructed gradually.
t
In the rest of this section, we consider the following notations:
ip
1. v = {item1 , item2 , ..., itemn }, the set of n items or data points to be
grouped into clusters,
cr
260 2. m: The number of artificial ants or bees.
3. k: The number of clusters.
us
4. pi = {(itemj , cl )} , where {i = 1, . . . , m}, {j = 1, . . . , n} and {l = 1, . . . , k},
the clustering partition is encoded as a set of pairs, each is composed of
an
an item and the assigned cluster.
265 5. fi is the cohesion value of the clustering solution pi .
6. c = {z1 , z2 , . . . , zk },the set of centroids.
M
In the following, we describe the proposed ACO and ABC based MapRe-
duce. Afterward, we describe the overall architecture through which cooperation
between ACO, ABC and PSO introduced in [30] is done.
d
te
270 3.1. Ant colony optimization based on MapReduce
ACO is a swarm based metaheuristic developed by Dorigo et al. [41] to

p
solve NP-hard optimization problems. It is inspired by the foraging behavior of

a colony of ants. A detailed description of this metaheuristic and its biological
ce
metaphor can be found in [42]. Given an optimization problem to be solved,

275 a typical dynamics of an ACO algorithm can be defined as follows. A set of
cooperating artificial ants is used to construct solutions incrementally. Each ant
Ac
constructs a solution and deposits an amount of pheromone on each solution

component according to the overall quality of the solution found. During solu-
tion construction step, an ant selects the next solution component according to
280 a probabilistic rule which depends on the corresponding amount of pheromone
and the value of a problem-specific heuristic. Many variants of ACO algorithm
exist depending on the policies used to update pheromone trails and to select
solutions components.
13
Page 13 of 42
In our work, we adopted ACO algorithm for data clustering inspired from
285 ant k-means described in [43]. The algorithm aims to find the partitions that
minimize the cohesion measure defined above. A solution component is viewed
t
ip
as an assignment of a data point to a cluster. Let’s adopt the following notations:
1. γ, the threshold value.
cr
2. φ(i, j), where {i = 1, . . . , n}, {j = 1, . . . , k}, is a n by k matrix that
290 represents the pheromone value of each item in each cluster.
us
The proposed MapReduce ACO algorithm (MRC-ACO) composed of three
stages, both the first and second are parallel and based upon MapReduce; the
an
last stage is sequential as shown in figure 2.
Stage 1 Stage 2
HDFS (S*M) Mappers <id, sub-vector> HDFS (M) Mappers <i, Ant_i > HDFS
(K*M) Reducers <r, [item_i, c_i ]>
Compute
M
assigne clusters -Compute the center
cohesion mean
split 1 split 1
-Compute the cohesion
Compute
split 2 assigne clusters split 2 cohesion mean stratifying
Do in Parallel -Compute the center
split 3 -Compute the cohesion split 3 Compute
. assigne clusters .
.. .. cohesion mean
.
split n .. -Compute the center split n
d
Compute
-Compute the cohesion
assigne clusters cohesion mean
te
loop Update Pheromon

Matrix
Stage 3
Figure 2: MRC-ACO: Ant Colony Optimization for clustering large datasets based on MapRe-
p
duce framework
ce
The ACO algorithm for data clustering can be outlined as follows:
295 1. During an initialization phase, each ant builds a solution by assigning

Ac
items to clusters randomly. The same amount of pheromone is assigned

to each of all possible pairs (item, cluster).
2. During an iterative phase each ant updates the amount of pheromone
to each of its solution components namely a pair (item, cluster) and con-
300 structs another solution by assigning items to clusters according to a prob-
ability that depends on the amount of pheromone, the threshold value, and
the heuristic value related to the cohesion measure value. During itera-
14
Page 14 of 42
tions, the best partition i.e. with the smallest value of cohesion measure
found so far is kept.
3. At the end of the iterative process, the best partition is returned.
t
305
ip
To handle the big data, we split each pi vector to s sub-vectors where s
is an integer tweaked with regards to the size of data points and the available
cr
memory capacity. The sub-vectors are loaded on the mappers, at this end,
the mappers assign in parallel for each itemi the cluster number ci following a
us
310 guided probabilistic research described in algorithm 1. Afterward, the mappers
attribute the same key to all sub-vector pairs (itemi , ci ) that satisfy the following
two conditions:
an
1. The pairs should belong to the same pi parent vector.
2. The ci of each pair has been assigned to the same cluster.
M
315 This latter is done through the use of MD5 [34] function which is a crypto-
graphic hash function that provides the digital fingerprint. Subsequently, the
mappers transmit the pairs with their keys to the reducers, where the same
d
key is transmitted to the same reducer. Increasing s leads to the decrease of

te
the granularity (the size of the parallel task) per each mapper, giving rise to
320 a shorter processing time. The number of the parallel tasks at this stage is m
p
multiplied by s.
At the reducers job, the transmitted pairs are merged to form one cluster
ce
of a pi vector which is the purpose of applying both above conditions. In

parallel, each reducer computes the center and evaluates the clustering quality
by computing the cohesion as described in equation 1.
Ac
325
The granularity of the system increases compared to the first stage as the
number of parallel tasks equals to k multiplied by m. The purpose of splitting
the data per cluster is to contribute to the decrease of the granularity that yields
to more tasks in parallel thus a rapid system.
330 In the second stage, another MapReduce round is launched to evaluate the
quality of each pi vector and to update the pheromone matrix. Firstly, a light
map stage receives per each mapper the cohesion values of a pi vector and
15
Page 15 of 42
computes their mean. The mean describes how good the overall clustering
solution is. In this round, the number of tasks in parallel is m.
At this end, all pi possess a complete clustering solution with its quality
t
335
ip
calculated. An auxiliary phase is employed to stratify in an ascending way
the pi vectors with regards to their cohesion values preparing them for the
cr
us
an
M
d
p te
ce
Ac
16
Page 16 of 42
pheromone update stage.
Algorithm 1: Do in parallel for each mapperi
Input: < key : id, value : sub-vector>
t
extract φ; for each itemi in sub-vector do
ip
generate random number r ;
if r ≤ q then
cr
pM ax = max(φ(itemi , ∀)); // the highest pheromone value
ci = f ind(φ(itemi , ∀), pM ax); // where the highest
us
pheromone laid
else
an
h=k
P
pSom = φ(itemi , h);// the sum of the pheromone
h=1
recorded
M
for each cl > 1 in k do
(φ(itemi ,c( l−1))+φ(itemi ,cl ))
φ(itemi , cl ) = pSom ; // the shared
pheromone
d
end
te
generate random number rr ;

for each cl in k do
p
if rr ≤ φ(itemi , cl ) then
ce
ci = cl ;
end
end
Ac
end
end
for each pair (itemi , ci ) in sub-vector do
generate a keyi ;
transmit (itemi , ci ); // via the use of MD5
end
17
Page 17 of 42
340 The last stage is sequential and devoted to update the pheromone matrix.
For each corresponding value of the assigned cluster, ci in the best solution
found so far; the pheromone value is updated φ as follows:
t
ip
(1 − ρ)φ(i, j) + 1
φ(i, j) = (2)
fbest
cr
Where ρ is the evaporation rate and fbest is the best cohesion value found
so far.
us
345 3.2. Artificial bee colony clustering based on MapReduce
We designed an artificial bee colony clustering based on MapReduce (MRC-
an
ABC), MRC-ABC is composed of 4 stages described in figure 3.
Stage 3 (employed) Stage 4 (onlooker)
Mappers <cohesion_i,[cohesions]> Mappers <i, (nectar_i, cycle_i>

HDFS HDFS
Reducers <nectra_i, [nectars]>
Compute
M
Choose one solution Reducers <i, subvector_i>
nectar_i accroding to p_i
split 1 split 1
compute p_i Produce new solution
Do in parallel Pfood_i
Compute Choose one solution
split 2 split 2 accroding to p_i
nectar_i
Stage 1 Stage 2 compute p_i Produce new solution HDFS
split 3 split 3
Compute Pfood_i
. . Choose one solution
.. nectar_i .. accroding to p_i

.
split n .. compute p_i split n Produce new solution
Pfood_i
Compute Choose one solution
d
nectar_i accroding to p_i
loop
te
Figure 3: MRC-ABC: an Artificial Bee Colony Clustering for large datasets based on MapRe-
duce Framework
p
The following assumptions are considered [44]:

ce
1. Three groups of bees: employed bees, onlookers, and scouts artificial bees.
350 The employed bees produce a food source (solutions) and share it with the
Ac
onlookers in the dance area. The onlookers choose one food source with
a probability related to its nectar amount and produce for it the distinct
new solutions. The job of the scout is to check whether a solution is not
chosen by the onlookers for many cycles. If so the abandoned solution is
355 replaced with a newly generated one.
2. nectari refers to the nectar amount of the food source, it evaluates how
much a solution is good and is calculated as follows:
18
Page 18 of 42
cohesion(F oodi )
nectari = PM (3)
s=1 cohesion(fs )
t
3. pbi refers to the probability of a pi solution to be chosen by the onlookers,
ip
it is computed as follows.
cr
nectari
pbi = PM (4)
s=1 nectars
4. To avoid getting stuck in the local minimum, ABC employs the variable
us
360
cyclei for each pi , that will be incremented when a food source is not
chosen by any onlooker bee. When a cyclei reaches the maximum cycle
an
number mcn, the scouts eliminate the food and replace it with a new food
source.
365 5. To produce a new potential food position from an old source food, the
M
artificial bees employ the following expression:
pf oodi = pi + ϕ(pi − ps ) (5)

d
In the MRC-ABC, we exploit the same technique used in MRC-ACO, all

te
pi are split to sub-vectors, and similarly the first two stages of MRC-ACO are
adopted in MRC-ABC. However, to imitate the dynamics of ABC the first
p
370 assignment of clusters is done randomly. After the accomplishment of stage one
ce
and two, all vectors possess a cohesion value of the clusters assignment.
The stage three is devoted to computing nectari and pbi for each food source.
To compute the nectari each mapper receives all the computed cohesion values;
Ac
then the results are transmitted to the reducers as shown in algorithm 2.

375 Afterward, each single reducer receives all the nectari computed in the map-
pers step and calculates for each pi the value pbi , as stated in algorithm 3. The
number of mappers/reducers required in this round equals m. By loading only
the cohesion values in each mapper, the nectar amounts in each reducer, we
drastically reduce the size of the loaded values into the mapper’s (or reducer)
380 memory compared to the first and the second stage.
19
Page 19 of 42
Algorithm 2: MRC-ABC: stage three
Mapper: Do in parallel for each mapperi
Input: < key : cohesioni , Listvalues : [cohesions] >
t
ip
compute nectari for the pi using eq. 3;
emit(i, nectarti ) to all reducers;
cr
Reducer: Do in parallel for each reduceri
Input: < key : nectari , Listvalues : [nectars] >
us
compute pbi for the pi using eq. 4;
After achieving stage three, all pi have a completed clustering solution with
an
its nectari and pbi assessed in the memory. In the last round, the onlooker
bees choose one solution according to its pi value and send only the chosen
nectar to the reducers, if pi is not chosen, the variable cyclei is incremented, see
M
385 algorithm 3.
Algorithm 3: MRC-ABC: stage four
Mapper: Do in parallel for each mapperi
d
Input: < key : i, value : (pbi , cyclei ) >

te
if pbi is chosen then

emit(i,pbi );
else
p
cyclei = cyclei + 1;
ce
end
Reducer: Do in parallel for each reduceri
Input: < key : i, value : 0 >
Ac
if pbi is chosen then

produce new solution P f oodi using expression 5;
else
if cyclei equals to mcn then
produce a random solution and replace pi ;
end
end
20
Page 20 of 42
The reducers of the last stage imitate the scouts, they receive all the chosen
pbi and produce from them a new clustering solution pf oodi using equation 5.
The reducers also check if any cyclei has reached the maximum cycle number
t
ip
390 mcn. If so, the solution is considered abandoned and replaced by a randomly
generated one, see algorithm 3.
cr
3.3. The Generalized island model based on MapReduce
GIM (Generalized Island Model) is a parallel model which can be applied
us
to a large class of optimization problems. Based on metaheuristics cooperation,
395 it allows an effective parallel execution of multiple algorithms across multiple
an
islands. This cooperation is maintained by the exchange of solutions between
these islands. The exchange operator (migration operator) aims at improving
the overall performance of the different used algorithms [8].
M
To let cooperation, we set up in parallel our proposed MRC-ACO and MRC-
400 ABC and from the literature MR-CPSO proposed in [30]. Then we integrate
an exchange operator to enable solution migration from one algorithm to an-
d
other. Each algorithm acts as an island and islands cooperate between them
te
by selecting a subset of solutions (according to an elitist selection strategy S

), exchanging selected solutions using the migrant operator. Finally, recombine
them with local solutions following a recombination strategy R (i.e. replacing
p
405
worst solutions). In the proposed model, this has been achieved in Hadoop by
ce
managing the HDFS, which will be logically divided using Centralized Cache
Management (CCM). We create four logical memories. Three of them repre-
sent a distributed memory associated with each algorithm. One shared memory
Ac
410 between the algorithms serving as a migrant solution deposit which acts as a
buffer to exchange solutions in an asynchronous way (figure 4).
Without deploying a barrier operation, no processor will exhibit idle waiting
for another one. Thus, this technique will increase the load balancing yielding
21
Page 21 of 42
MRC-ACO MRC-ABC MR-CPSO
submit solutions
routines routines routines

load solutions
t
ip
HDFS-ABC HDFS-ACO HDFS-PSO
HDFS
shared memory
cr
HDFS
Figure 4: HDFS management for Generalized Island Model in Hadoop architecture (MRC-
us
GIM)
to a speediest system. The dynamics of MRC-GIM is described in algorithm 4.
an
Algorithm 4: Do in parallel for MRC-ACO, MRC-ABC, and MR-CPSO
P :initial population;
S :selection strategy;
M
R :recombination policy;
Ai : MRC-ACO, MRC-ABC and MR-CPSO;
initalize P ; // initialize population
d
while non stop criteria do

P 0 = Ai (P ); // P’ is the new population
te
415
M = S(P 0 ); // select M solutions from P’
submit (M ) in the Shared HDFS meomory; // Sharing the
p
results with others SI

ce
load M 0 from the HDFS; // extract the shared solutions

P 00 = R(P 0 , M 0 ); // combine M’ with P’ to generate new
population
Ac
P = P 00 ;
end
The recombination policy consists of combining the migrant populations
with the local ones to get more diversity for the next iteration which is the key
task of cooperation.
22
Page 22 of 42
4. Experiments
420 In the experiment design, we analyze the efficiency of the proposed frame-
t
work. First, we study the amount of resource such as time, the number of
ip
processing elements and iterations required to perform clustering. The algo-
rithms are designed to be executed on different platforms with different size
cr
of data. Therefore, we study the scalability, speedup, and the sizeup of the
425 algorithms. Finally, we compare the meta model MRC-GIM to different state-
us
of-the-art large scale clustering algorithms. Two sets of data of different sizes
have been used to test the ability of the algorithms to group both large and
small sets of data. The following subsection explains briefly these sets of data.
4.1. Datasets
an
We validate the proposed algorithms by clustering available large datasets.
M
430
The datasets used have been downloaded from the Stanford Network Analysis
Project (SNAP) [45]. To assess the quality of clustering both small and large
d
data sets, we employed two sets of data Friendster and DBLP as shown in
table 2.
te
Table 2: The datasets used in our experiments

p
Dataset Records Cluster Size

Friendster 65,608,366 287,512 30.1 GB
ce
DBLP 317,080 13,477 10 MB

Ac
435 Friendster is a gaming website, which was a social networking website that
enabled users to connect with their friends. The data was released at the end of
June 2011, which contains 65,608,366 users. Each line of the data files represents
the friends list of one user, in the following format: (id of user: separated list
of user’s friends), for example, 1 : 2, 3 means that the user with id = 1 has a
440 connection with both the users two and three. We preprocessed the data to be
suitable for partitioning clustering by constructing for each player an ascending
23
Page 23 of 42
order of all players following their id and then, check weather a connection is
found between this player and the ith player if so the number one is affected
elsewhere 0. A player is considered to have a connection with himself.
t
ip
445 DBLP is a computer science bibliography that provides a comprehensive list
of research papers in computer science field. Two authors are connected if they
cr
publish at least one paper together. We followed the same preprocessing to
prepare the DBLP data for the partitioning clustering.
us
The number of potential clustering solutions within the search area for the
287,51265,608,366
450 Friendster data is 287,512! , which renders the clustering a more challeng-
ing task. We conduct our experiments on the EMR cluster of Amazon with 192
an
nodes where each node has 2.5 GHZ processor and 1.7 Go memory. The data
was moved from local disk to Amazon using the Jets3t Java library. The size of
the memory limits the number of map jobs to be used per node. We split each
M
455 node memory to 20 blocks where each block is a 64 MB (HDFS block). Hadoop
uses HDFS data blocks and assigns a single HDFS block to each map task, for
the Friendster data the minimal configuration is to use 24 nodes with 20 map
d
jobs per each node.

te
4.2. Evaluating each swarm intelligence algorithm individually
To evaluate the convergence of the proposed MRC-ABC, MRC-ACO, MRC-

p
460
GIM, we recorded the cohesion by tweaking the number of iterations from 100
ce
to 1000 at each time (we stooped at 1000 iterations following a limitation of the
platform used). A bar chart of the cohesion against the number of iterations is
shown in figure 5(a). The parameter setting used for the three SI algorithms
Ac
465 are described in table 3.

Regarding Friendster data, MRC-GIM returned the higher clustering quality.
As the number of iterations increases the gap between MRC-GIM and the rest
of algorithms increases as well (figure 5(a)). Increasing the number of iterations
involves enhancing the chance of solution migration thus exploring the search
470 space more efficiently using three different swarm strategies at the same time.
Moreover, the cooperative scheme of MRC-GIM improves its quick convergence
24
Page 24 of 42
Table 3: Parameter setting used in MRC-ACO, MRC-PSO, MRC-PSO
Algortihm Parameter setting
t
ACO Evaporation rate =0.1 Threshold =0.9
ip
PSO c1=c2=1.2, vmax= 0.9, vmin 0.4
ABC Upper bound= 5, limit =10
cr
while avoiding local minima, e.g., the results reached by MRC-GIM in 200
us
iterations, can be achieved by MR-CPSO after around 800 iterations.
an
M
d
p te
ce
Ac
25
Page 25 of 42
(a)
Friendster (a) DBLP (b)
t
13. 4 MRC−ABC 125.361 MRC−ABC
MR−CPSO
12. 651 MR−CPSO
ip
MRC−ACO 120.847
3 MRC−GI M MRC−ACO
10
X 12. 012 118.365 MRC−GIM
Cohesion
11. 235
Cohesion
114.183
111.121
10. 256 108.111
9. 862
105.329
cr
9. 121 102.648
98.654
8. 103
100 200 400 600 800 1000 96.159
Interations 100 200 400 600 800 1000
Iterations
us
(b)
Friendster(a) DBLP(b)
116
13
114
12.5
an
112
12
3
X
10 11.5
110
108
Cohesion
Cohesion
11
106
10.5
104
10
M 102
9.5
100
9
98
8.5
96
8
MRC−ABC MRC−ACO MR−CPOS MRC−GIM MRC−ABC MRC−ACO MR−CPOS MRC−GIM
(c)
d
Friendster (a)
DBLP (b)
te
7200
MR−CPSO 120
MRC−ACO MR−CPSO
MRC−ABC MRC−ACO
6000 100 MRC−ABC
MRC−GIM
MRC−GIM
Time (s)
Time (s)
80
4800
p
60
3600
40
ce
2400 20
0 24 48 96 192 0 4 8 16 32 64 128
Nodes Nodes
Figure 5: (a) A Bar chart showing the cohesion value against the number of iteration, (b)
Ac
Boxplots display the distribution of results over 18 runs using both Friendster and DBLP
data, (c) The Parallel runtime recorded by seconds while increasing the processing elements
at each step.
Regarding DBLP data, the algorithms are more competitive as the search
475 space is restricted compared to Friendster data. MRC-ACO and MR-CPSO are
competitive; MRC-ABC achieved the highest (worst) cohesion value. Subse-
quent, we set up MRC-GIM with only MR-CPSO and MRC-ACO to evaluate
26
Page 26 of 42
whether MRC-GIM converges better with isolating MRC-ABC. The results for
100, 200, and 400 iterations are 12.352, 12.110 and 11.461 respectively, indi-
cating that even MRC-ABC, which returned the worst result, improves the
t
480
ip
convergence of the overall MRC-GIM alongside MR-CPSO and MRC-ACO.
Furthermore, a statistical analysis is conducted to test the randomness of the
cr
algorithm outputs. The test illustrates the distributions of cohesion in terms
of box-plots over 18 runs and 1000 iterations on 192 computers as shown in
us
485 figure 5(b). MRC-GIM boxes indicate that the returned results are symmetric
(roughly the same on each side when cutting down the middle). Therefore, it
returns more robust and stable median value compared to other skewed boxes.
an
Afterwards, the parallel runtime is recorded for both datasets. Figure 5(c)
shows the runtime against the number of processing elements for each dataset.
490 This latter shows that the recorded time depends on the input size of the data
M
and the number of processing elements used. The recorded parallel runtime is for
600 iterations. MRC-GIM requires slightly more computational time compared
to the rest of algorithms since it performs additional tasks such as selection
d
strategy, recombination policy, transmission and integration of migrant solu-

te
495 tions. However as the number of processing elements increases the gap between
the plots shrinks, this difference becomes very narrowed within the nodes 96
p
and 192. In DBLP data sets, the processing elements surplus the requirement
of data, therefore, a small gain in seconds has been recorded.
ce
In the rest of this section, we evaluate, the speed up, scaleup and sizeup. The
500 speedup is defined as the ratio of time recorded to solve a parallel task to the
time required to solve the same problem on a double set of processing elements
Ac
with the same capacity of processing nodes [46], according to the following
expression.
TN
Speedup = (6)
T2N
where TN is the running time using N nodes and T2N is the running time
505 using 2-fold of N . The speedup measures the acceleration of the parallel algo-
rithms from small computational power to a larger one while maintaining the
27
Page 27 of 42
same size of the dataset. We performed the speed up by doubling the number
of nodes used at each step, starting from 24 nodes for Friendster and four nodes
in DBLP as the minimum required configurations.
t
ip
510 While clustering DBLP data sets, all algorithms achieved a speedup larger
than one (figure 6). Therefore, they are all scalable. In Friendster data, MRC-
cr
GIM reached the best speedup imitating the diagonal (figure 6). The MapRe-
duce implementation has helped the algorithms to achieve a good speedup as
us
MapReduce paradigm is proved to be auto-scalable.
Friendster (a) DBLP (b)
2.5 2.5
MR−CPSO MRC−ABC
MRC−ACO MR−CPSO
MRC−ABC MRC−ACO
MRC−GIM MRC−GIM
an
2 2
Speedup
Speedup
1.5 1.5
1 1
48 96 192 8 16 32 64 128
Nodes Nodes
M
16 32 64 128 Nodes 192
1
0.85 MR−CPSO
MRC−ACO MRC−ABC
0.8 MRC−ACO
MRC−ABC
0.8 MRC−GIM
MR−CPSO
0.6 MRC−GIM
Scaleup
Sizeup
0.75
0.4
d
0.7
0.2
0 0.65
0 1.88
3.76 7.52 15.05 30.1 0 1.883.76 7.52 15.05 30.1
Data size(GB) Data size (GB)
te
Figure 6: Plots showing the speed up, size up and scale up metrics. The speed up is computed
for both data sets
p
Secondly, we computed the size up by fixing the number of nodes and in-
ce
515
creasing the size of datasets by 2-folds at each step [46].
TS
Ac
Sizeup = (7)
T2S
Where TS is the parallel runtime for the dataset with size S using N nodes
and T2S is the parallel runtime using 2-fold of S and N nodes. MRC-GIM deals
well with the increase of data, whereas the rest of algorithms recorded very close
520 result compared to each other.
Finally, using the following expression:
28
Page 28 of 42
TSN
Scaleup = (8)
T2SN
t
we compute the scaleup which evaluates the ability of a parallel algorithm to
ip
solve a significant problem using larger datasets and more processing elements
at once [46]. In (8), TSN is the parallel runtime for the dataset with the size S
cr
525 using N nodes and T2SN is the parallel runtime using 2-fold of S and 2-folds of
N nodes.
us
To carry out the experiment, we decreased the size of Friendster data. We
first cluster the data using kmeans. Afterward, we balance between the number
of elements present in each group until reaching 0.94 GB of data (1/32 of the
an
530 original size). Then adding a proportion of the deleted elements to reach 1.88GB
(1/16), 3.76 GB (1/8), 7.52GB (1/4), 15.05 GB (1/2), and 30.1 GB (1/1). This
data is processed in 8, 16, 32, 64, 128, and 192 nodes respectively (since the
M
limitation of the platform we used 192 nodes rather than 256 nodes).
MRC-ABC recorded the best values between the ratio (0.69-0.79), MRC-
d
535 GIM also shows a good scaleup between (0.65-0.80) as shown in figure 6.
te
4.3. Comparative study
We compare MRC-GIM to Lloyds algorithm with the partitioning based

p
scheme proposed in [16], PKmeans [12], BigKclustering [21] and Kmeans ++

with the improved version proposed in [25]. We compare the clustering quality
ce
540 by gathering the cohesion values and the recorded time after 50 runs for 100,
200, 400, 600, 800, and 1000 iterations. The results are shown in table 4. The
Ac
number of processing elements used in this section is 124.
29
Page 29 of 42
Table 4: Table shows the recorded cohesion/time required in second. All the cohesion values
are divided by 103 .
t
Iterations MRC-GIM Lloyd BigKClustering Pkmeans Kmeans++
ip
100 12,012/375s 12,376/398s 12,509/328s 12,835/287s 13,115/264s
200 11.235/789s 11,773/811s 12,287/720s 12,482/649s 12,788/631s
400 10,259/1665s 11,109/1793s 11,587/1603s 12,095/1587s 12,546/1466s
cr
600 9,862/2783s 10,832/2833s 10, 865/2667s 11,311/2583s 11,840/409s
800 9,121/3597s 9,563/3786s 9, 745/3422s 9,976/ 3389s 10.248/3266s
us
1000 8,213/4357s 8,482/4610s 8,69/4531s 8,712/4387s 8,911/4198s
Friendster
10.2
an
10
9.8
9.6
3
X
10 9.4
M
9.2
8.8
8.6
8.4
d
8.2
Lloyd PKmeans Kmeans ++ BigKClustering MRC−GIM
te
Friendster
Lloyd
p
PKmeans
2
ce Cohesion
Kmeans++
3
BigKCLustering
4
Ac
MRC−GIM
5
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

p−value=1.1212e−09
Figure 7: Boxplots showing distribution of results over 50 runs and 1000 iterations and the
corresponding multi-comparison graph for MRC-GIM, BigKCLustering, PKmeans, Lloyd and
Kmeans++ based MapReduce
30
Page 30 of 42
The same initial clustering is used (also the same number k of clusters). Re-
garding clustering quality, MRC-GIM outperforms the rest of algorithms reach-
ing good partition in a reasonable time. We were unable to run the algorithms
t
545
ip
for more than 1000 iterations as a result of the platform limitation. Kmeans++
returned many outliers; the best solutions returned in 50 runs are selected in
cr
table 4. Kmeans ++ and PKmeans recorded the shortest time as they follow
the classical Kmeans algorithm and run in one iterative MapReduce job.
us
550 In figure 7, the distribution of results for 1000 iterations and 50 runs on 124
computers are recorded. Regarding median value, the proposed model outper-
forms the four algorithms and shows more stability and robustness. To study
an
how significant is the difference between the algorithms, a Friedman test followed
by a multi-comparison test were conducted. The p-values given by the Fried-
555 man indicates a significant difference between the proposed model and the other
M
algorithms at significance level. The multi-comparison graphs clearly show that
our algorithm is significantly different from the rest algorithms. This gain shows
the advantage of using the cooperation and parallelism between metaheuristics
d
on a MapReduce platform.
te
560 4.4. Application on epigenetics datasets to study aging
In the effort to assess the dynamics of DNA methylation as a function of

p
age and underlying pathological events, we extended our cluster analysis to the
ce
methylation profiles from CD34 primary cells assayed by bisulfite sequencing in

six normal individuals ranging in age from 27 to 50 years (Table 5). Due to the
hight decomposability of DNA methylation data, we extract the value of methy-
Ac
565
lation for three regions within each gene, namely, CpG island, gene promoters,
gene bodies, as these regions are highly associated with triggring/desactivating
gene expression.
The aim of this analysis is to cluster each individual’s data based on the
570 three features extracted and investigate how the formation of groups (clusters)
is guided by DNA methylation features over time and whether epigenetic sig-
natures emerge in specific clusters as a byproduct of aging.
31
Page 31 of 42
Table 5: DNA methylation data used in this experiment.
Sample Code Age Sex Ethnicity Methylated BPs Data Size (Mb)
t
ip
GSM706858 27 female Caucasian 2,729,614 118.3
GSM772729 27 female Hispanic 2,464,116 106.7
GSM706859 36 male Hispanic 3,047,608 132.2
cr
GSM772730 42 male Caucasian 2,690,026 116.6
GSM706839 43 male Caucasian 2,920,722 126.6
GSM772731 50 female na 2,726,708 118.1
us
We first examined the average value of methylation3 in CpG islands, gene
an
body, and gene promoter for every individual (Figure 8, a). We found a global
575 low level of methylation in gene bodies and CpG islands and a nearly unmethy-
lated state in promoters. Even though methylation in the gene body increases
M
with age, we did not identify any significant trend or change related to aging,
as values remain consistently near to zero. Therefore, we can argue that pro-
inflammatory mediators in normal individuals exhibit constant low methylation,
d
580 especially in promoter regions and CpG islands, confirming previous general ev-
te
idence [48, 49, 50].

p
3 Methylation values range from 0 to 1.

ce
Ac
32
Page 32 of 42
DNA methylation across ages (a)
methylation
0.08
t
ip
0.04
0.00
cr
F27 F27 1 M36 M42 M43 F50
individual
us
CpG Islands Gene Body Promoter
Cluster sizes across ages (b)
an
100
# genes
50
M
0
F27 F27 1 M36 M42 M43 F50
d
individual
1st 2nd 3th 4th 5th 6th 7th
te
Figure 8: (a): DNA methylation levels in CpG islands, gene body, and promoter regions across
ages. (b): Cluster sizes as function of DNA methylation features and age.
p
ce
Afters clustering the data per each idividual (Figure 8,b), we noticed the
emergence of a giant cluster (containing 65% to 74% of all genes) for any in-
dividual, suggesting the presence of a general methylation pattern at each age.
Ac
585 In fact, elements in the largest cluster in each individual correspond to nearly
unmethylated genes in CpG islands, gene body, and gene promoter and surpris-
ingly these genes remain in the largest cluster over time, as showed by measuring
the Jaccard index4 between the largest clusters for each pair of individuals (Ta-
4 The Jaccard index measures the similarity between finite sets. Formally, given two sets
|A∩B|
A and B, the Jaccard index is defined as: J (A, B) = |A∪B|
.
33
Page 33 of 42
ble 6).
Table 6: Jaccard index for the largest clusters across ages.
t
ip
F27 F27-1 M36 M42 M43 F50
F27 1.00 0.91 0.88 0.89 0.90 0.92
cr
F27-1 0.91 1.00 0.92 0.96 0.95 0.91
M36 0.88 0.92 1.00 0.92 0.92 0.96
us
M42 0.89 0.96 0.92 1.00 0.93 0.93
M43 0.90 0.95 0.92 0.93 1.00 0.92
an
F50 0.92 0.91 0.96 0.93 0.92 1.00
590 Finally, we studied how methylation features (corresponding to values in

M
CpG islands, gene body, and gene promoter) vary across clusters and ages (Fig-
ure 9). While the largest cluster in every individual shows nearly unmethylated
features, clusters tend to exhibit changes in methylation in CpG islands and
d
gene body as size decreases, with a significant peak in one of the two features.
te
595 In particular, we found two recurrent clusters of genes across ages: the first
cluster, characterized by an increased level of methylation in the gene body, is
p
composed of three genes belonging to TNF family (TNFRSF14, TNFRSF6B,

TNFRSF25) and two genes in tgf-β family (AMH, PSPN), while the second
ce
cluster, characterized by an increased level of methylation in CpG islands, is

600 composed of two genes in tnf family (TNFRSF4, TNFRSF11A) and four genes
Ac
in tgf-β family (INHBB, TGFB1, GDF7, GDF10). A gene enrichment analysis

confirmed the strict relationships between genes in each of the two clusters in a
wide number of biological processes. Specifically, genes in the first cluster are
involved in tumor necrosis factor-mediated signaling pathway, urogenital sys-
605 tem development, and more generally in the response to chemical, while genes
in the second cluster are involved, among others, in the regulation of protein
phosphorylation and apoptotic process.
34
Page 34 of 42
F27 F27 1 M36
methylation 0.2 0.2 0.2
t
0.1 0.1 0.1
ip
0 0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
cr
M42 M43 F50
methylation
0.2 0.2 0.2
us
0.1 0.1 0.1
0 0 0
an
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
cluster id cluster id cluster id
CpG Islands Gene Body Promoter
M
Figure 9: DNA methylation levels in CpG islands, gene body, and promoter regions in different
clusters and different individuals.
d
5. Conclusions
te
The amount of data is growing further in size and complexity. The recent ad-
p
610 vances in collecting data have triggered an extreme need for novel data analysis
methods to decipher the complexity of these data. Undoubtedly, parallelization
ce
frameworks and distributed computation will play a primary role in the future
of interpreting big data in general and biological data in particular. For exam-
ple, the cloud computing will help to scale up the analysis of epigenetics profiles
Ac
615 to larger samples of individuals, providing a more detailed characterization of

epigenetic mechanisms in fast and cost-efficient ways.
In this paper, we deal with the issue of clustering big sets of data. Data
clustering itself is an NP-hard problem, and it becomes more challenging when
the clustered data is large. Moreover, we present MRC-GIM, which is a scalable,
620 time-efficient, and robust clustering approach to group very large datasets on
a different commodity of computer architectures. The proposed method aggre-
35
Page 35 of 42
gates three swarm intelligence strategies. Using different strategies that exhibit
different dynamics has the potential to foster the exploration and exploitation
capabilities of the search process. MRC-GIM is designed and implemented
t
ip
625 with Hadoop MapReduce, which makes MRC-GIM auto-scalability and fault-
tolerance. The comparative study reveals that MRC-GIM outperforms novel
cr
developed big-scale clustering in a reasonable time. Although Hadoop exists for
a half decade, a noticeable shortage of tools dealing with big data analysis has
us
been witnessed.
630 We extended our parallelization strategy to the study of DNA methylation in
pro-inflammatory mediators from a population of normal individuals ranging in
an
age from 27 to 50 years. Exploiting the high decomposability of the dataset, we
were able to analyze simultaneously millions of nucleotides, retrieve methylated
cytosines, and cluster genes according to methylation features in CpG islands,
M
635 gene body, and gene promoter. We found a global low level of methylation
in CpG islands and gene body, and a nearly unmethylated status in promoter
regions for each, suggesting that the epigenetic landscape in normal condition
d
should not change aberrantly with aging.

te
6. References
p
640 [1] Steinhaus, H.: Sur la division des corps materiels en parties, Bulletin de
ce
l’Academie Polonaise des Sciences, Volume IV, Pages 801-804, Issue 12 (1956)
[2] Nanda, S.J., Panda, G.: A survey on nature inspired metaheuristic algo-
rithms for partitional clustering, Swarm and Evolutionary Computation, Vol-
Ac
ume 16, Pages 1-18 (2014)
645 [3] Nir, A., Bernard, C.: Faster Dimension Reduction, communications of the
acm, Volume 53, Pages 97-104, Issue 2 (2010)
[4] George, K., Dimitrios, G.: Nick Koudas, Stefan Berchtold, Efficient Biased
Sampling for Approximate Clustering and Outlier Detection in Large Data
36
Page 36 of 42
Sets, IEEE Transactions on Knowledge and Data Engineering, Volume 15,
650 Pages 1170-1187, Issue 5 (2003)
t
[5] Sarkar, S., Majumder, T., Kalyanaraman, A., Pande, P.: Hardware ac-
ip
celerators for biocomputing: A survey, Proceedings of IEEE International
Symposium on circuits and systems (ISCS), Pages 3789-3792 (2010)
cr
[6] Dean, J.,Ghemawat, S.: MapReduce: simplified data processing on large
clusters, Communications of the ACM, Volume 51 Issue 1,Pages 107-113
us
655
(2008)
[7] MacQueen, J. B.:Some Methods for Classiffication and Analysis of Mul-
an
tiVariate Observations, in Proceedings of the Fifth Berkeley Symposium on
Mathematical Statistics and Probability, University of California Press vol.
1, Pages 281-297 (1967)
M
660
[8] Kaufman, L., and Rousseeuw, P.:Clustering by means of medoids, in Sta-

tistical Data Analysis based on the L1-norm and Related Methods ,North-
d
Holland,Pages 405-416 (1987)

te
[9] Chu, S. C., Roddick, J. F., Su, C. J., & Pan, J. S.:Constrained ant colony
665 optimization for data clustering. In PRICAI pp. 534-543 (2004).
p
[10] Ingaramo, D., Leguizamon, A., Errecalde, G., Adaptive, M.: clustering
ce
with artificial ants. J. Comput. Sci. Technol. 4, 264271 (2005)
[11] Lumer, E., Faieta, B.: Diversity and adaptation in populations of clustering
Ac
ants. In: Third International Conference on Simulation of Adaptive Behavior,

670 pp. 501508 (1994)
[12] Yang, Y., Kamel, M.: An aggregated clustering approach using multi-ant
colonies algo-rithms. Pattern Recognit. 39, 12781289 (2006)
[13] Omran, M., Salman, A., Engelbrecht, A.: Image Classification using Parti-
cle Swarm Optimization. Simulated Evolution and Learning 1, 370374 (2002)
37
Page 37 of 42
675 [14] Li, X.: A new intelligent optimization artificial fish swarm algorithm (Doc-
tor Thesis). Zhejiang University of Zhejiang, China (2003)
t
[15] Cura, T.: A particle swarm optimization approach to clustering. Expert
ip
System Applica-tion 39, 15821588 (2012)
cr
[16] Karaboga, D., Ozturk, C.: A novel clustering approach: artificial Bee
680 Colony (ABC) algo-rithm. Application Soft Computing 11, 652657 (2011)
us
[17] Giuliano, A., Mohammad, R.F.: Clustering Analysis with Combination
of Artificial Bee Colony Algorithm and K-means Technique. International
Journal of Computer Theory and Engineering 6(2), 146150 (2014)
an
[18] Ghosh, S., Kothari, M., Halder, A., Ghosh, A.: Use of aggregation
685 pheromone density for imagesegmentation. Pattern Recognition 30, 939949
M
(2009)
[19] Cohen,S, C, M., de Castro, L, N.: Data Clustering with Particle Swarms.
IEEE Congress on Evolutionary Computation, 17921798 (2006)
d
[20] Zhang, C., Ouyang, D., Ning, J.: An artificial bee colony approach for
te
690 clustering. Expert System Application 37, 47614767 (2010)
[21] Abraham, A., Das, S., & Roy, S.: Swarm intelligence algorithms for data
p
clustering. Soft computing for knowledge discovery and data mining, 279-313
ce
(2008)
[22] Dario, I., Marek, R., Francesco, B.: The Generalized Island Model. Parallel
Ac
695 Architectures and Bioinspired Algorithms 415, 151169 (2012)
[23] Das, A. S., Datar, M., Garg, A., and Rajaram, S.: Google News Person-
alization: Scalable Online Collaborative Filtering. Proceedings of the 16th
international conference on World Wide Web ACM, Pages 271-280 (2007)
[24] Papadimitriou, S., Sun, J.:DisCo: Distributed Co-clustering with Map-

700 Reduce , Eighth IEEE International Conference on Data Mining (ICDM’08),
Pages 512-521 (2008).
38
Page 38 of 42
[25] Zhao, W., Ma, H., He, Q.: Parallel K-Means Clustering Based on MapRe-
duce, Cloud Computing, LNCS 5931, Pages 674-679 (2009)
t
[26] Ferreira Cordeiro, R. L., Traina Junior, C., Machado Traina, A. J., Lpez, J.,
ip
705 Kang, U., Faloutsos, C.:Clustering very large multi-dimensional datasets with
mapreduce, Proceedings of the 17th ACM SIGKDD international conference
cr
on Knowledge discovery and data mining, Pages 690-698 (2011)
[27] Li, H. G., Wu, G. Q., Hu, X. G., Zhang, J., Li, L., Wu, X.:K-Means Clus-
us
tering with Bagging and MapReduce. 44th Hawaii International Conference
710 on System Sciences (HICSS),Pages 1-8 (2011)
an
[28] He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., and Fan, J.: MR-
DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using
MapReduce. 17th International Conference on Parallel and Distributed Sys-
M
tems (ICPADS),Pages 473-480 (2011)
715 [29] Ene, A., Im, S., Moseley, B.: Fast Clustering using MapReduce. Proceed-
d
ings of the 17th ACM SIGKDD international conference on Knowledge dis-

covery and data mining, Pages 681-689 (2011)
te
[30] Aljarah, I., Ludwig, S.:Parallel Particle Swarm Optimization Clustering

p
Algorithm based on MapReduce Methodology, Fourth World Congress on

720 Nature and Biologically Inspired Computing (NaBIC), Pages 104-111 (2012)
ce
[31] Bahmani, B., Moseley, B., Vattani, A., Kumar, R.,:Scalable KMeans++,
Proceedings of the VLDB Endowment, Volume 5, Issue 7, Pages 622-633
Ac
(2012)
[32] Hefeeda, M., Gao, F., Abd-Almageed, W.:Distributed Approximate Spec-

725 tral Clustering for Large-Scale Datasets, Proceedings of the 21st international
symposium on High-Performance Parallel and Distributed Computing ACM,
Pages 223-234 (2012)
[33] Estrada, T., Zhang, B., Taufer, M., Cicotti, P., Armen, R.:Reengineering
high-throughput molecular datasets for scalable clustering using MapReduce,
39
Page 39 of 42
730 IEEE 14th International Conference on High Performance Computing and
Communication,Pages 351-359 (2012)
t
[34] Miao, Y., Zhang, J., Feng, H., Qiu, L., Wen, Y.: A Fast Algorithm for
ip
Clustering with MapReduce, Advances in Neural Networks, Pages 532-538,
(2013)
cr
735 [35] Jin, C., Patwary, M. M. A., Agrawal, A., Hendrix, W., Liao, W. K., Choud-
hary, A.: DiSC: A Distributed Single-Linkage Hierarchical Clustering Algo-
us
rithm using MapReduce, ACM 4th International SC Workshop on Data In-
tensive Computing in the Clouds (2013)
an
[36] Jakovits, P., Srirama, S. N.: Clustering on the Cloud: Reducing CLARA
740 to MapReduce, ACM Proceedings of the Second Nordic Symposium on Cloud
Computing, Pages 64-71 (2013).
M
[37] Herwig, R., Poustka, A. J., Mller, C., Bull, C., Lehrach, H., O’Brien,
J. :Multiplex Kmeans for Clustering Large-scale Data Set, The journal of
d
Genome research,Volume 9 Issue 11, Pages 1093-1105 (2014)

te
745 [38] Xu, Y., Qu, W., Li, Z., Min, G., Li, K., Liu, Z.:Efficient K-means++ Ap-
proximation with MapReduce, IEEE Transactions on Parallel and Distributed
p
Systems, Volume 25 Issue 12, Pages 3135-3144 (2014)

ce
[39] Kim, Y., Shim, K., Kim, M. S., Lee, J. S.: DBCURE-MR: An efficient
density-based clustering algorithm for large data using MapReduce, The jour-
750 nal of Information Systems, Volume 42, Pages 15-35 (2014)
Ac
[40] Cui, X., Zhu, P., Yang, X., Li, K., and Ji, C.: Optimized big data K-
means clustering using MapReduce, The journal of Supercomputing, Volume
70 Issue 3, Pages 1249-1259 (2014)
[41] Dorigo, M.:Optimization, learning and natural algorithms. Ph.D. thesis

755 Politecnico di Milano (1992)
40
Page 40 of 42
[42] Dorigo, M., Birattari, M., Stutzle, Thomas.: Ant Colony Optimization
Artificial Ants as a Computational Intelligence Technique, IRIDIA (2006)
t
[43] Kuo, R. J., Wang, H. S., Hu, T. L., Chou, S. H.:Application of ant K-
ip
means on clustering analysis, Journal of Computers and Mathematics with
760 Applications, Volume 50 Issue 10,Pages 1709-1724 (2005)
cr
[44] Karaboga, D., Ozturk, C.:A novel clustering approach: Artificial Bee
Colony (ABC) algorithm, journal of Applied soft computing, Volume 11 Issue
us
1, Pages 652-657 (2011)
[45] Leskovec, J., Krevl, A.:SNAP Datasets: Stanford Large Network Dataset
an
765 Collection, http : //snap.stanf ord.edu/data (2014)
[46] Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel
M
Computing. Addison-Wesley (2003)
[47] Rivest, R,: The MD5 message-digest algorithm (1992)

d
[48] Ben-Avraham, D., Muzumdar, R. H., and Atzmon, G.:Epigenetic genome-
770 wide association methylation in aging and longevity, Epigenomics, vol. 4,
te
Pages 503–509 (2012).
[49] Berdasco, M., and Esteller, M.:Hot topics in epigenetic mechanisms of

p
aging: 2011, Aging Cell, vol. 11, Pages 181-186 (2012)

ce
[50] Florath, I. , Butterbach, K., Mller, H., Bewerunge-Hudler, M., and Bren-
775 ner, H.:Cross-sectional and longitudinal changes in dna methylation with age:
Ac
an epigenome-wide analysis revealing over 60 novel age-associated cpg sites,

Human Molecular Genetics, vol. 23, no. 5, Pages 1186-1201 (2014)
[51] Shanmugam, M., and Sethi, G.:Role of Epigenetics in Inflammation-

Associated Diseases,in Epigenetics: Development and Disease, vol. 61 of Sub-
780 cellular Biochemistry, Pages. 627-657, Springer Netherlands (2013)
[52] P. A. Jones and P. W. Laird, “Cancer-epigenetics comes of age,” Nature

Genetics, vol. 21, pp. 163–169, (1999).
41
Page 41 of 42
[53] Jones, P. A., and Baylin, S. B.:The fundamental role of epigenetic events
in cancer, Nature Reviews Genetics , vol. 3, Pages 415-428 (2002)
t
ip
cr
us
an
M
d
p te
ce
Ac
42
Page 42 of 42

Benmounah 2018

Uploaded by

Copyright:

Available Formats

Benmounah 2018

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Benmounah 2018

Uploaded by

Copyright:

Available Formats

Accepted Manuscript

Title: Parallel Swarm Intelligence Strategies for Large-scale

Author: Zakaria Benmounah Souham Meshoul Mohamed

To appear in: Applied Soft Computing

Received date: 29-4-2016

Clustering is an important technique for data analysis and knowledge discovery.

rithms according to MapReduce framework. The developed framework allows

data partitioning through a migration strategy. This latter reaps advantage

of the combined exploration and exploitation capabilities of these algorithms

Preprint submitted to Journal of LATEX Templates March 25, 2018

state. As such, health-care system and related services have to be rethought

10 tal effects can be limited or procrastinated rather than passively accepted as

Key research challenges need to be effectively and efficiently addressed to

interpretable knowledge and information from the enormous complex amount

provides high-performance access to data across Hadoop clusters by managing

70 quality solutions in a short time while.The developed parallel and distributed

architecture are designed using MapReduce programming model to deal with

75 framework is validated on many computers (192 computers) connected with each

big data clustering in terms of convergence time and solution quality.

Clustering analysis is a central topic in computer science and statistics,

Partitional clustering algorithms split the dataset into groups based on a

two-step iterative process. Given an initial set of cluster representatives such as

2.3. Swarm intelligence clustering

Swarm intelligence algorithms have been widely used in data clustering.

proposed a particle swarm clustering algorithm in which the particles velocity

cost regarding time complexity [21].

160 2.4. Big data clustering

Since the inception of MapReduce (MR), several MR based clustering algo-

performed in the reducer phase. The algorithm integrates a pruning strategy

that computes the distances without redundant time as advantage selection of

By using k-means with sampling, X. Cui et al. [40] proposed an algorithm

Intel cores (2.67GHz each),

-syn: GaussMixture 10;000 points 1968 nd, 2 quad-core (2.5GHz),

(2.33 GHz) with 1 GB DRAM.

Trilce Estrada et al. [33]

real(handwritten digits), 17 nd with CPU (2.2GHz),

CLARA is a medoid-based clustering algorithm [36] which unlike centroid-based

algorithm to a platform that is dedicated to embarrassingly parallel methods.

MR-CPSO [30] is based on particle swarm optimization (PSO) algorithm origi-

240 3. Highly scalable clustering based on MapReduce

1. The data is huge and cannot be stored on a single computer,

270 3.1. Ant colony optimization based on MapReduce

ACO is a swarm based metaheuristic developed by Dorigo et al. [41] to

solve NP-hard optimization problems. It is inspired by the foraging behavior of

metaphor can be found in [42]. Given an optimization problem to be solved,

constructs a solution and deposits an amount of pheromone on each solution

1. γ, the threshold value.

loop Update Pheromon

The ACO algorithm for data clustering can be outlined as follows:

295 1. During an initialization phase, each ant builds a solution by assigning

items to clusters randomly. The same amount of pheromone is assigned

key is transmitted to the same reducer. Increasing s leads to the decrease of

of a pi vector which is the purpose of applying both above conditions. In

generate random number rr ;

We designed an artificial bee colony clustering based on MapReduce (MRC-

Mappers <cohesion_i,[cohesions]> Mappers <i, (nectar_i, cycle_i>

.. nectar_i .. accroding to p_i

The following assumptions are considered [44]:

pf oodi = pi + ϕ(pi − ps ) (5)