Final Seminar PDF
Final Seminar PDF
Final Seminar PDF
https://doi.org/10.1007/s10586-018-1799-6
Map Reduce for big data processing based on traffic aware partition
and aggregation
G. Venkatesh1 · K. Arunesh1
Abstract
Big data refers to data sets whose volume is 500+ terabytes of data per day. The velocity makes it difficult to capture, manage,
process and analyze 2 million records per day. Another characteristics of big data is variability which makes it difficult to
identify the reason for losses in i.e., images, audio, video, sensor data and log files etc., Hadoop can be used to analyze this huge
amount of data using Hadoop an approximate early result for executing the job partially becomes available for the user even
before completion of job which reduce the response time. In Layers 3 Traffic aware clustering programming model is used for
processing big data which includes the data processing function map by sort and reducing techniques. The implementation
of the layers three traffic aware clustering method will be on the top of Hadoop which is partitioned into HDFS fixed sized
blocks and generates intermediate output as a collection of <num, data> pairs. The conventional hash function method is used
for partitioning intermediate data among reduced task but it is not traffic efficient. In this paper to reduce network traffic cost,
a Map Reduce task is done by designing data partition and aggregator that can reduce task merged traffic from multiple map
tasks. The proposed algorithm is more efficient to reduce response time and the simulation results have showed proposal can
reduce network traffic.
Keywords Traffic aware cluster · Data aggregation · Map Reduce · k-means cluster · Hadoop · Layers 3 traffic aware
clustering
1 Introduction Map Task performs the input data by partitioning into fixed
blocks and generate the output as a collection of <num, data>
Big data is an evolving term that describes combinations of pairs [2]. After completion of Map task these pairs are shuf-
larger amount of structured, unstructured and semi structured fled across different reduced tasks based on <num, data>
data that cannot be processed using conventional computing pairs and each reduced task accepts only one num key at a
techniques [1]. Facebook, YouTube or social network have time. The process data for that num key produces the result
enormous amount of data that falls under the category of as <num, data> pairs. The Hadoop Map Reduce architecture
bigdata and it requires it to be collect and manage on a daily [3] consists of one Job tracker performed at a time but many
basis. The volume, velocity, variability of data sets makes Task Trackers work simultaneously. Hadoop Map Reduce is
them difficult to be captured, managed, and processed by the modified version of online Map Reduce which supports
traditional method and tools. Hadoop Map Reduce program- online aggregation [4] and reduces response time. Conven-
ming model is now being used for processing big data while tional Map Reduce method materializes the intermediated
results of mapper and it does not allow pipelining between
Map and Reduce tasks. This method has the advantage of
simple recovery in case of failures. Once the mappers have
B G. Venkatesh
finished the tasks, the reducers start executing the tasks.
[email protected]
Partitioning clustering allocate documents into a fixed
K. Arunesh
[email protected] number of clusters and the well known partitioning meth-
ods are k-means clustering [5] and its variants .The basic
1 Department of Computer Science, Sri S Ramasamy Naidu K-means clustering method [6] initially allocates a set of
Memorial College (Affiliated to Madurai Kamarajar objects to a number of cluster randomly from which the mean
University), Sattur, Tamil Nadu 626203, India
123
Cluster Computing
Fig. 1 Block diagram of the proposed layers 3 traffic aware clustering algorithm
of each cluster is calculated for each iteration and each object amount of data by designing and develop the new method for
is reallocated to the nearest mean. In the Map phase, the map- clustering. In the map phase, it performs the task of convert-
per takes a single (num, data) pair as input and generate any ing the input into the form of <num, data> after which the
number of (num,data) pairs as output. The map operates as output is sorted and shuffle in the shuffle phase of second
stateless in which the logic operates on a single pair at a layer in which is again sorted at the aggregator stage [10].
time even if in practice several input pairs are delivered to The third layer is the reducing phase, which performs the task
the same mapper. To recapitulate for the map stage, the user fetching from its own share of data partitioned from all map
simply designs a map function that maps an input <num, tasks to produce the final solution. Here the clustering tech-
data> pair to any number of output pairs. Frequently the map nique for the map phase is used where big data is analyzed
phase is used to specify the desired location of the input by and cluster of similar objects are formed. In this three layer
changing its key. The shuffle stage is automatically handled method, the data produced by the map phase are ordered, par-
by the Map Reduce technique and the engineer has nothing titioned and aggregated to the appropriate machines which
to do for at this stage. The system implementing Map Reduce executes the reduce phase. The network traffic [11] from all
routes all values that are in aggregate with an individual key the three layers can cause a great volume of network traf-
to the same reducer. In the reduce stage, the reducer takes all fic, which causes a serious constraint on the efficiency of
of the values associated with a single num n and outputs any data analytic applications. One of the problems in the net-
number of <num, data> pairs. One of the sequential aspects work traffic is difficult in the process of data with given
of Map Reduce computation is that it reduces all of the maps time.
before the reduce stage can begin. The reducer has access all
the values with the same key and it can perform sequential
computations. The parallelism is exploited by observing that
3 Methodology
reducers operating on different num keys can be executed
concurrently. Document clustering [7] performs collection
3.1 Layers 3 traffic aware clustering method
of similar documents into classes where similarity is some
function on a document and it does not need separate train-
This proposed method is a layers 3 traffic aware clustering
ing process, where the documents in the same clusters are
algorithm based on parallel K-means [3,12] and distance met-
similar and the documents in the different clusters are not
ric [13]. Traffic aware clustering combines the benefits of
relevant.
both the K-means and distance metric algorithms. The ben-
In this paper, a new proposal algorithm is introduced by
efit of parallel K-means is that it is not complex and forms
layers 3 traffic aware clustering programming model. Fig-
clusters using the distance metrics and traffic aware and so
ure 1 illustrates the block diagram of the proposed system,
it has less execution time. This proposed system works in 3
the design, and implementation of the massive datasets, the
layers.
processing model based on Map Reduce and Hadoop.
Layer 1: In this layer, the data is partitioned into different
clusters and different parts to minimize into the boundary
points [14].
2 Problem identification Layer 2: This layer performs shuffling and sorting the clus-
ters based on traffic aware.
Big data is a complex enormous amount of datasets which Layer 3: This performs sorting that is the input is fetched
has become a buzz in the market, due to various challenges to reduce the clusters. The clustering is a function of group-
and major issues to be handled in the big data. Map Reduce ing the set of objects that are combined into a single cluster
[8,9] is the prominent technique for processing the enormous and this cluster contains only similar objects. The dissimi-
123
Cluster Computing
lar objects are formed into another cluster. The goal of the (v) The mapper function finds the nearest centre among
clustering is to partition unstructured data objects into a struc- k centers for the input point and process the data and
tured data objects. K-means clustering algorithm is used to creates small chucks of data.
partition the different sets of data into clusters and it is n-
objects based on attributes into k-partitions where k>n to
form a vector space. The work begin with a decision on Function Map is
a value of k = number of clusters. First the data is parti- Let V
tioned into k clusters and then each data point is taken in Calculate cluster center X uniformly from
sequence and the distance from the centroid of the clusters X and Y at random V= V
is calculated. If the data points which are not closer to the for i=1 ; i< m ;i++
centroid are present in the cluster then those data points are If |V|< m
switch that clusters. This process is repeated until the centre Compute D = between X
does not change. The system processes the given input data,
where the data is shuffled and reduced into small required and its nearest centre that
has been already choosen
data. To reduce the network traffic within a Map Reduce Choose k with probability
task [11,15], the aggregate data is considered the aggregate
data with similar keys before sending the mapper output to
the reducer stage. The data aggregation creates opportuni- end for
ties for multiple tasks on different machines thus goal of for i< n , do
minimize the network traffic [16,17] by data partition and Find the nearest centre Vi
aggregation for Map Reduce task. Layers 3 traffic aware clus- V for Xi ,n++
tering algorithm is proposed for big data applications and the map( num, data)
final result demonstrates the reduction in the network traffic set i = i+1, end for
for each word num in value
cost.
return( num[i],data[i]) , repeat
end function
3.2 Layers 3 traffic aware clustering algorithm
Algorithm 1.Algorithm for Mapper Phase (Layer 1)
Step 1: Input D: data set having n data points. A= {A1 , A2 ,
Step 5: Shuffle stage- traffic aware
A3 . . . An } be the set of data points and V= {V1 , V2 , V3 ,
V4 …Vn } be the set of centers. Function shuffle is
Step 2: Map Reduce function executes in three stages ie. Let if ( num[1] = = num [2])
Map stage, shuffle stage- traffic aware and reduce stage. neglect num[2] only choose num [1]
else Let check with another set
Step 3: Iteratively improves partitioning of data into k repeat upto num [n]
document clusters. end function
Step 4: Mapper Phase
Algorithm 2. Algorithm for Shuffle stage - traffic aware (Layer 2)
(i) Input in map function paradigm is based on sending to Step 6: Reducer Phase
where the data resides and it is in form of file or direc-
tory and is stored in the Hadoop distributed file system
(i) Reducer phase is the combination of shuffle and reduce
(HDFS).
and the input of reducer’s job is to process the data and
(ii) Calculate the distance between each data point and clus-
produces the output of map function that is all data point
ter centre using the distance metrics as follows.
creates a new set of output.
(ii) New cluster centre is calculated using
m 2
DXY = X ik − X jk (1)
k=1
1 Cl
Vi = Xj (2)
Ci j=1
(iii) Data point is assigned to the cluster centre whose dis-
tance from the cluster centre is minimum for all the
Ci = No. of data points in the ith cluster
cluster centers.
(iv) The input passes to mapper function line by line and in
the form is <num, data> pair as cluster centre and data (iii) The reducer phase calculates the new centre using data
points. points that are stored in HDFS.
123
Cluster Computing
Function Reduce is to check whether the same numbers are combined together
Let V before sending them to reduce tasks. In the last stage the
Calculate new cluster center X uniformly sorted output is given to the input of the reduce function
V= V {X,C } where computation of centre co-ordinate new cluster is done
for i=1 ; i< n ;i++
If |V|< n
before the process is ended.
Pi j
Accuracy = Pi max j (3)
i Pj
123
Cluster Computing
Table 1 Comparison of various methods with four different data sets execution time
Data sets Basic K-means [13] K-means parallel [23] Bisecting K-means [24] DB-scan [4] Proposed method
123
Cluster Computing
Reuters-21578 [19] 59 68 61 69 83
20 Newsgroup [20] 76 70 75 73 89
Web Ace [21] 42 55 52 66 68
TREC [22] 81 77 67 80 89
123
Cluster Computing
123