Report 2018

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
A Dissertation Report on
Application of Zipf’s Law for the Optimization of Hadoop

Shuffle Phase and Comparison of Performance
of Hadoop Job Execution based on Shuffle Parameters
Submitted by
Aasia Afreen 1MS14CS002

Chayanika Bhandary 1MS14CS031
Jyothi Kumari R 1MS14CS049
in partial fulfillment for the award of the degree of
Bachelor of Engineering in Computer Science & Engineering

Under the guidance of
Mrs. Geetha J
Assistant Professor,
Department of Computer Science and Engineering
Ramaiah Institute of Technology, Bangalore
M.S.RAMAIAH INSTITUTE OF TECHNOLOGY

(Autonomous Institute, Affiliated to VTU)
BANGALORE-560054
2017-2018, www.msrit.edu,
Application of Zipf’s Law for the Optimization of Hadoop Shuffle Phase and
Comparison of Performance of Hadoop Job Execution based on Shuffle Parameters
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
A Dissertation Report on
Application of Zipf’s Law for the Optimization of Hadoop

Shuffle Phase and Comparison of Performance
of Hadoop Job Execution based on Shuffle Parameters
Submitted by
Aasia Afreen 1MS14CS002

Chayanika Bhandary 1MS14CS031
Jyothi Kumari R 1MS14CS049
in partial fulfillment for the award of the degree of
Bachelor of Engineering in Computer Science & Engineering

Under the guidance of
Mrs. Geetha J
Assistant Professor,
Department of Computer Science and Engineering
Ramaiah Institute of Technology, Bangalore
M.S.RAMAIAH INSTITUTE OF TECHNOLOGY

BANGALORE-560054
2017-2018, www.msrit.edu,
DEPT. OF CSE, RIT 2

Ramaiah Institute of Technology

BANGALORE-560054
Department of Computer Science & Engineering
CERTIFICATE
This is to certify that the project work titled “Application of Zipf’s Law for the
Optimization of Hadoop Shuffle Phase and Comparison of Performance of Hadoop Job
Execution based on Shuffle Parameters” is a bonafide work carried out by Aasia Afreen
(1MS14CS002), Chayanika Bhandary (1MS14CS031) & Jyothi Kumari R
(1MS14CS049) in partial fulfillment for the award of degree of Bachelor of Engineering in
Computer Science and Engineering during the year 2018. The Project report has been
approved as it satisfies the academic requirements with respect to the project work prescribed
for Bachelor of Engineering Degree. To the best of our understanding the work submitted in
this report has not been submitted, in part or full, for the award of said degree.
Signature of the Guide Signature of the HOD

Mrs. Geetha J Dr. Anita Kanavalli
External Examiners
Name of the Examiners: Signature

1.
2.
DEPT. OF CSE, RIT 3

DECLARATION
We the students of final semester BE, Department of Computer Science and Engineering,
Ramaiah Institute of Technology, Bangalore, hereby declare that the project entitled
“Application of Zipf’s Law for the Optimization of Hadoop Shuffle Phase and
Comparison of Performance of Hadoop Job Execution based on Shuffle Parameters”,
report is completed and written by us under the guidance of Mrs. Geetha J, Dept of CSE,
Bangalore for the partial fulfillment of the requirements for the award of the degree of
Bachelor of Engineering and has not been formed the basis for award of any other degree or
diploma certificate.
Place: Bangalore
Date: 05-05-2018
(1MS14CS002 AASIA AFREEN)
(1MS14CS031 CHAYANIKA BHANDARY)
(1MS14CS049 JYOTHI KUMARI R)
DEPT. OF CSE, RIT 4

ACKNOWLEDGEMENT
First and foremost, my utmost gratitude to Mrs. Geetha J, Dept of CSE, MSRIT whose
sincerity and encouragement we will never forget. She has been our inspiration as we
overcame all the obstacles in the completion of this project work.
Dr. Anita Kanavalli, Head of the Department of Computer Science and Engineering, had
kind concern and consideration regarding project work and we would like to thank him for
continues support.
We would like to thank our beloved principal Dr. N.V.R Naidu for his support and
encouragement.
This work would not have been possible without the guidance and help of several individuals
who in one way or another contributed their valuable assistance in preparation and
completion of this study.
We would like to express sincere thanks to all the teaching and non-teaching faculty of CSE
Department and my dear friends who helped in all the ways while preparing the Report.
DEPT. OF CSE, RIT 5

ABSTRACT
Hadoop is an open source distributed processing framework that manages data processing
and storage for big data applications running in clustered systems. It is at the center of a
growing ecosystem of big data technologies that are primarily used to support advanced
analytics initiatives, including predictive analytics, data mining and machine learning
applications. Hadoop can handle various forms of structured and unstructured data, giving
users more flexibility for collecting, processing and analyzing data than relational databases
and data warehouses provide. It uses MapReduce for parallel processing. It consists of two
major phases- the Map phase and the Reduce phase. There is an intermediate phase that goes
unnoticed – the shuffle phase. According to analysis, it is found that one-third of the Map
Reduce processing time is consumed by the shuffle phase and thus, an effort made in the
direction of optimizing Hadoop performance by fine tuning the shuffle phase is legit. In this
project, we attempt a feasibility test to check if the statistical law – Zipf’s Law can be applied
to make smart decisions while spilling to improve the shuffle phase performance of Hadoop
framework. We will also test the law against different datasets of varying sizes to check if the
law has bias against some types of datasets. We will also attempt to compare the time
required for Hadoop job execution by altering the different shuffle phase parameters
(precisely io.sort.factor, io.sort.mb, io.sort.spill.percent) and create predictive models to
predict the Hadoop job execution time according to the variation of the Hadoop Shuffle
phase parameters.
DEPT. OF CSE, RIT 6

LIST OF FIGURES
1. List of activities in the schedule
2. Gantt chart (Part 1)
3. Gantt chart (Part 2)
4. Existing Architecture
5. Proposed Architecture
6. Sequence Diagram for Zipf’s Law Optimization
7. Sequence Diagram for Parameter Tuning
8. Map Reduce Architecture
9. Proposed Algorithm
10. Zipf’s Law plot for Pride & Prejudice Dataset
11. Zipf’s Law plot for Don Quixote Dataset
12. Zipf’s Law plot for Jane Eyre – Autobiography Dataset
13. Zipf’s Law plot for War & Peace Dataset
14. Failed Hadoop Job Execution
15. Successful Hadoop Job Execution
16. Parameter plots : io.sort.mb (mb) vs time (ms) with io.sort.spill.percent = 60%
17. Parameter plots : io.sort.mb (mb) vs time (ms) with io.sort.spill.percent = 80%
18. Parameter plots : io.sort.spill.percent vs time (ms) with io.sort.mb (mb) = 100
19. Parameter plots : io.sort.spill.percent vs time (ms) with io.sort.mb (mb) = 140
20. Correlation Matrix
21. Colored correlation matrix
DEPT. OF CSE, RIT 7

Contents
Declaration i
Acknowledgements ii
Abstract iii
List of Figures
1 INTRODUCTION
1.1 General Introduction 10
1.2 Problem Statement 10
1.3 Objectives of the project 10
1.4 Project deliverables 11
1.5 Current Scope 11
1.6 Future Scope 11
2 PROJECT ORGANIZATION
2.1 Software Process Models 11
2.2 Roles and Responsibilities 12
3 LITERATURE SURVEY
3.1 Introduction 12
3.2 Related Works with the citation of the References 12
3.3 Conclusion of Survey 17
4 PROJECT MANAGEMENT PLAN

4.1 Schedule of the Project 17
4.2 Risk Identification 19
5 SOFTWARE REQUIREMENT SPECIFICATIONS

5.1 Project Overview 19
5.2 External Interface Requirements 19
5.2.1 User Interfaces 19
5.2.2 Hardware Interfaces 19
5.2.3 Software Interfaces 20
6 DESIGN
6.1 Introduction 20
6.2 Architecture Design 20
6.3 Graphical User Interface 23
6.4 Sequence Diagram 23
6.5 Conclusion 23
DEPT. OF CSE, RIT 8

7 IMPLEMENTATION
7.1 Tools Introduction 23
7.2 Technology Introduction 24
7.3 Overall view of the project in terms of implementation 25
7.4 Explanation of Algorithm and how it is been implemented 26
7.5 Information about the implementation of Modules 27
7.6 Conclusion 28
8 TESTING
8.1 Introduction 28
8.2 Testing Tools and Environment 28
8.3 Test cases 28
9 CONCLUSION & SCOPE FOR FUTURE WORK

10 REFERENCES
11 Appendix
1 Screen snapshots (Results)
DEPT. OF CSE, RIT 9

1. INTRODUCTION
1.1 General Introduction
Matt Aslett defined Big Data as “Big Data is now almost universally understood to
refer to the realization of greater business intelligence by storing, processing, and analyzing
data that was previously ignored due to limitation of traditional data management
technologies.” Today, “Big Data” seems to be more than just a hype word. It is of real
concern with data growing at enormous rate and with increasing demand of processing of the
data so generated. In such a scenario, Big Data platforms like Hadoop come to the rescue.
Tuning Hadoop performance appears inevitable for satisfying the present needs and the ever-
increasing demands of tomorrow.
1.2 Problem Statement

We see that spilling phase, in Hadoop MapReduce Paradigm, is a major hindrance to
better performance, because if data is spilled more than once onto the disk, 3 extra I/O needs
to be performed to perform sorting by the key. The 3 I/O are: writing onto the disk for the
first time, reading from and writing to the disk again. Since disk access time is 6-7 times the
buffer access time, spilling to the disk more than once is to be avoided to ameliorate the
performance of this phase. It’s also important to understand that if the Spill thread is slower
than the rate at which the Map output is being produced, the circular buffer will get full and
the Map task has to be blocked until there is some free space on the buffer again.
1.3 Objectives of the project
 An attempt to compare the performance of Hadoop jobs by varying different Hadoop

tuning parameters concerned with the Hadoop Shuffle phase.
 Also attempt to test for bias of Zipf’s law against different datasets of varying sizes
which can be further used for Optimization of Shuffle phase of MapReduce
paradigm.
DEPT. OF CSE, RIT 10

1.4 Project Deliverables
 Hadoop framework has over 190 configuration parameters and some of them have
significant effects on the performance of a Hadoop job. Manually setting the optimum
values for these parameters is a challenging task and also a time consuming process.
 Apart from tuning Hadoop parameters, we also see that optimizing the Hadoop Spill
and Shuffle phase presents immense scope for performance optimization.
1.5 Current Scope
Since humongous amount of data is being produced by the day, BigData is here to
stay. More and more people and organizations will resort to BigData technologies for solving
data issues.
1.6 Future Scope
Optimizing MapReduce, the universally accepted paradigm for BigData solution-

Hadoop will benefit the entire technology society currently using Hadoop and also those will
use such technologies in the future. Since a shift is seen from Apache Hadoop to Apache
Spark in the recent times for BigData based solutions, there exists scope for extrapolating the
current project idea to Apache Spark as well.
2. PROJECT ORGANIZATION
2.1 Software Process Models
The software process model used in our project is the Waterfall model. Since
requirements are fixed and all the procedures are sequential, waterfall model suits our case
the best. We firstly fixed our problem statement, and then our software and hardware
requirements. We finalized our datasets and then started working on our implementation.
Testing followed the completion of implementation. In the end, the entire project process was
documented.

2.2 Roles and Responsibilities

Each team member contributed significantly to the successful completion of the project.
In the initial phases of the project, each team member individually surveyed certain number
of research papers contributing to the literature survey. The requirements and project
management were handled by the members as a team. Each team member made a point to
attend all team meetings and present valid ideas on the project progress. The project was
implemented jointly by all the team members.
3. LITERATURE SURVEY
3.1 Introduction
Below we present a literature review of the relentless efforts of the researchers all
over the world in optimizing the MapReduce paradigm especially the Shuffle phase.
3.2 Related Works with the citation of the References
Hadoop and Shuffle Phase parameters:
According to the paper, Optimizing Hadoop Parameters Based on Application resource

Consumption [1], by Ziad Benslimane, performance of Hadoop is directly dependent on the
number of map tasks that are launched as the number of reducers also depend on the former.
Thus the main objective of the paper turns out to be finding the optimum number of map
tasks that are to be launched.
The author makes a few assumptions- in a CPU-heavy applications, the number of parallel
tasks should be less to avoid over-utilization of the CPU and large number of tasks should be
spawned to distribute the load evenly among the CPUs. In CPU-light applications, large
number of parallel tasks should be scheduled and less number of tasks should be spawned to
reduce the overhead of each task.
The experiment consisted of two parts- determining if the task is CPU-heavy or CPU-light by
running the job in a single node cluster and verifying the theoretical assumptions on multi-

node cluster. The number of splits which is automatically decided by the Hadoop Map can be
monitored using UNIX file split or cat. The test cases used for the experiment are Grep,
PiEstimator and MRIF. Among these three, Grep and PiEstimator were found to be CPU-
light tasks with CPU utilization of 35% and 4% respectively while MRIF being CPU-heavy
task has 88% CPU utilization.
The theories that were proved in this paper are that in light CPU applications, the number of
map tasks is less than equal to the number of CPUs and in case too many map tasks are to be
run, run as many as possible in parallel. In case of CPU-heavy tasks, run at least as many
map tasks as there are CPUs and the number of map tasks spawned in parallel should be less
than or equal to two.
In the paper, Optimizing Hadoop for the Cluster [2], Hansen, the author claims that the
default configuration of Hadoop is not always the most efficient. The number of mappers
should be equal to the input size divided by the block size and the number of reducers is
equal to 1.75 * number of nodes * maximum number of reducers that can run in parallel for
best results. The results of the experiment follows that for WordCount application, the new
configuration performed 2.5 times faster than the default Hadoop configuration and 1.7 faster
than Cloudera’s configuration.
Karthik Kambatla, Abhinav Pathak & Himabindu Pucha, in their paper, Towards
Optimizing Hadoop Provisioning in the Cloud [3], say that there is no hard and fast
configuration rules that are applicable to all kinds of jobs. It suggests that to make proper
utilization of the resources, a consumption profile history of the application is necessary.
The steps to generate the consumption profile history is enlisted as finding resource
consumption by running the task against smaller number of nodes and dataset and then match
the resource consumption signature with that of the signatures of the previous tasks stored in
the database. The closest signature was assigned to the task.
The paper, Hadoop MapReduce Shuffle Phase Management [4], by Narendar Kumar,
proposes that the time required for the shuffle phase can be reduced by using an SDN
enabled structure. Having an SDN controller helps us control the network traffic and enable

re-routing traffic through alternative routes thus reducing network congestion in the shuffle
phase.
For benchmarking, the authors have used Hadoop Terasort. Different datasets of different
sizes (2GB, 10GB, 20GB and 50GB) in a cluster of 16 nodes having an SDN based
architecture including an SDN controller
In the white paper, Advanced Hadoop Tuning and Optimizations [5], by Sanjay Sharma,
it’s proposed that tuning of Hadoop configuration parameters is a black box art. It suggests
users to change as many parameters as possible for better results. Among the many
suggestions provided the most important ones include - mapred.compress.map.output should
be set to TRUE for large clusters and large jobs,
mapred.map/reduce.tasks.spectulative.execution should be set to FALSE if the average
execution time is greater than 1 hour, mapred.tasktracker.map/reduce.tasks.maximum should
be set in the range of [(cores per node) / 2 , 2 * (cores per node)] for large clusters and the
block size should be made larger for large datasets to reduce the number of map tasks
spawned.
Application of Zipf’s Law to Hadoop Shuffle Phase:
In the paper, Speculative Executive Strategy Based on Node Classification and Hierarchy
Index Mechanism for Heterogeneous Hadoop Systems [6], the authors propose to employ
Node Classification along with a different hierarchical storage structure to keep the amount
of time required for completion of map and reduce phases of a backup task to help in
scheduling of tasks. Scheduling of tasks is based on an intuitive assumption that a backup
task is selected to be scheduled only if the execution time of the backup task is at least 50%
less than the remaining execution time of the currently running task. The authors of this
paper claim that the new scheduling technique runs the Word Count problem 12 times faster
than Hadoop original. They claim that because of hierarchical indexing of stored data, search
time complexity is reduced i.e in case the data was stored as a list, then the time complexity
would have been O(k) where k is the number of all items stored but because of hierarchical
structure, the time complexity has been reduced to log 2 n+k/n where n in the number of
nodes. Thus we can see that the time complexity has reduced to O(log 2 (n)).

In the paper, Optimizing MapReduce framework through Joint Scheduling of

Overlapping phases [7], Huanyang Zheng, Ziqi Wan, and Jie Wu observe that map and
shuffle tasks are different in nature. Map tasks are CPU intensive while Shuffle tasks are I/O
intensive. Thus, they can be scheduled in parallel for effective utilization of resources. A new
concept called strong pair has been introduced where a strong pair of jobs refers to two jobs
that have same map and shuffle work load. A challenge faced in running Map and Shuffle
phases in parallel is that it cannot be fully parallelized as shuffling depends on the map
phase. If the shuffling rate is higher than that of the map phase, then the shuffle phase will
have to wait for map intermediates to be generated. Thus to overcome such issues, a strong
pair of jobs are selected such that when the shuffle phase of one job is running, map phase of
the other can run forming a pipeline kind of architecture as time required for both of them is
same in case of strong jobs. The authors have presented five algorithms- pair-based
scheduling algorithm, couple based scheduling algorithm, generalized scheduling algorithm,
group based scheduling algorithm and online group based scheduling algorithm based on the
available job conditions.
In the paper, Reducing MapReduce Abstraction Costs for Text-Centric Applications [8],
the authors have proposed two optimization techniques to improve MapReduce performance
for text-centric tasks. The first approach, Frequency Buffering makes use of Zipf’s law to
maintain an in-memory buffer which keeps track of frequently appearing keys during the
shuffle phase. To find the frequent keys, the algorithm proposed by Metwally et al is used for
profiling, which uses a table consisting of entries consisting of frequency numbers, and a
linked list of keys which occur that many number of times. After profiling, the top k keys are
predicted, reducing sort costs before the combine and reduce phases. If this optimization
technique is implemented, 40% of the abstraction costs are reduced for WordCount, 30% for
InvertedIndex, and 45% for WordPOSTag.
The second approach presented by the authors, is Spill Matcher which makes use of two
types of threads (Support threads & Map threads). Here, the support threads are used to sort
the data when a spill occurs, followed by combine() which writes output to intermediate local
file. The map threads are used to merge sort the spilled tuples from different intermediate
local files, following which combine() occurs before the reduce phase. To minimize the wait

times between both these threads, we can make use of the bulk-service queuing model, which
uses 2 assumptions: the producer threads fill the spill buffer as a Poisson process and the
execution time for the consumer threads which process the spills are identically distributed
random variables. If this approach is used, wait time for WordCount reduces by 90%, for
InvertedIndex by 89%, for AccessLogSum by 77%, and by 83% for AccessLogJoin.
WordPOSTag has negligible wait time in its slowest thread, and spill-matcher is not very
effective for PageRank, cutting down the wait time by only 42% between the two threads.
Native Hadoop uses java based network transport stack on top of JVM for merging and
shuffling phases. This proves to be a bottleneck in Hadoop framework. In the paper, JVM-
Bypass for Efficient Hadoop Shuffling [9], the authors implemented a portable plug-in
library for JVM-Bypassing that can help existing Hadoop framework to leverage TCP/IP
protocol as well as RDMA protocol. This proposal supports RDMA protocol which was an
inherent problem in Hadoop.
Yanfei Guo, Jia Rao, Dazhao Cheng, and Xiaobo Zhou, in their paper- iShuffle:
Improving Hadoop Performance with Shuffle-on- Write [10], propose to separate the shuffle
phase from the reduce phase and make it job-independent. It claims to solve the problem of
skewed input data to the reducers as it predicts the size of the partitions to be fed to the
reducers and also balances the distribution of the map outputs across the nodes. The iShuffle
is designed to have 3 components- shuffler, shuffle manager, and task scheduler. For iShufle
to function properly with minimal changes to the existing architecture, Shuffler is designed to
be an independent component in the TaskTracker which takes the input from the combiner
and shuffles the data. It performs shuffling each time the data is written to local disks by map
tasks i.e shuffle-on- write.
3.3 Conclusion of Survey

It is seen that a lot of research has been undertaken in the field of tuning Hadoop
parameters but it still remains a black box with no hard and fast rules concluded. This
provides immense opportunity to explore more in this regard.
Zipf’s law also presents itself in light of optimizing Hadoop shuffle phase. If Zipf’s law can
be tested for bias against different datasets and if it proves to have no bias against any
particular dataset then the Hadoop shuffle phase can be designed in accordance to this
statistical law and such a concept can also be extended to other popular Big Data platform
like Apache Spark.
4. PROJECT MANAGEMENT PLAN
4.1 Schedule of the Project

The following is the figure which shows the schedule of our project, with all the activities
involved that are listed and shown in the Gantt chart as per the Waterfall Software Process
Model.
Figure 1: List of activities in the schedule

Figure 2: Gantt chart (Part 1)
Figure 3: Gantt chart (Part 2)

4.2 Risk Identification

There are some situations where there is risk that project does not achieve its objectives.
In this project main risk is the availability of good datasets and high quality hardware for
testing the project.
Since it is a group project, selecting and implementing diverse ideas might be difficult. Work
pressure and hurry to complete the other projects may lead to the poor quality of project
delivery. The general risk of an error or omission in scope definition. Uncontrolled changes
and continuous growth of scope. Project teammates may have different interpretations of the
project topic.
5. SOFTWARE REQUIREMENT SPECIFICATIONS
5.1 Project Overview
In this project we use Big Data tools like Apache Hadoop and also data analytics to
optimize the shuffle phase in addition to verifying the Zipfian distribution.
5.2 External Interface Requirements
5.2.1 User Interfaces
The code is written in Java (Hadoop word count and sort) and in Python (sort and graph
plotting). For parameter tuning, Python Flask has been used.
5.2.2 Hardware Interfaces
All experiments are performed on a 4-node cluster.
a) Node 1: Intel core i5 processor 8GB RAM.
b) Node 2: Intel core i3 processor 4GB RAM.
c) Node 3: Intel core i3 processor 4GB RAM.
d) Node 4: Intel core i3 processor 4GB RAM

All of these nodes must have –

 An Intel-compatible platform running Linux/Ubuntu. (Apache Hadoop
compatible)
 At least 32 MB of RAM, a mouse, and enough disk space for storing the datasets
and the results so obtained.
 The administrative privileges are required to install and run Apache Hadoop
utilities under Linux/Ubuntu.
 A network connection for obtaining datasets from the Internet.
5.2.3 Software Interfaces
 Apache Hadoop (version: 2.6.5).
 Python (version: 2.7.x).
6. DESIGN
6.1 Introduction
For implementing our project, we have used Apache Hadoop Big Data platform. We have
also used Python for sorting and getting our final results.
6.2 Architecture Design

Existing Architecture:
Figure 4: Existing Architecture
In Hadoop MapReduce, after the Map phase and before the beginning of the Reduce phase is
a handoff process, known as shuffle and sort. Here, data from the mapper tasks is prepared
and moved to the nodes where the reducer tasks will be run. When the mapper task is
complete, the results are sorted by key, partitioned if there are multiple reducers, and then
written to disk. During the reduce-shuffle phase, merging of intermediate results into an
ordered file takes place after which the shuffling phase stops. During the shuffle phase, the
intermediate map output is stored in the circular buffer with limited storage capacity (by
default 100MB) and when the buffer is 80% full (this value is also configurable), the data is
spilled onto the disk. Before the reduce phase starts, all the output is expected to be on the
disk from where the reducer threads fetch the input for each of the reducers. This leads to a
mandatory spill to the disk at the end of the shuffle phase.
The motivation for undertaking the project includes:
● We see that spilling phase is a major hindrance to better performance because if the data
is spilled more than once into the disk, 3 extra I/O needs to be performed to perform
sorting by the key. The 3 I/O are writing onto the disk for the first time and then reading
from and writing from the disk again to perform the sort and since disk access time is 6-7
times the buffer access time, spilling to the disk more than once is to be avoided to
ameliorate the performance of this phase. It is also important to understand that if the
spill thread is slower than the rate at which the map output is being produced, the circular

buffer will get full and the map task has to be blocked until there is again some free space
on the buffer.
● In case the intermediate results are geographically distributed across different data
centers, then data transferring is done through HTTP connections. This process is highly
time-consuming as the bandwidth and topology affect remote file transfer.
Proposed Architecture:
Figure 5: Proposed Architecture
In our undertaking of the current project, we propose a new design for the existing Hadoop
architecture to use Zipf’s Law for gaining better performance in the shuffle phase. To make
use of the frequency-rank relationship given by the Zipfian distribution, one needs to have
prior knowledge of the text-centric application. If not so, the data needs to be preprocessed to
gain an insight into the frequency of the keys appearing in the data. The formulating of an
efficient algorithm for preprocessing of data is beyond the scope of the project.
By using the prior knowledge of the data and the frequency of appearance of the keys in the
data, smart decisions can be made as to which key is to be spilled onto the disk when the
circular buffer is nearing saturation. The relatively infrequent keys are spilled.
6.3 Graphical User Interface
For the GUI, we have used Python Flask. The user gives four Hadoop Shuffle phase tuning
parameter values as input, namely Dataset size, io.sort.mb, io.sort.spill.percent and

io.sort.factor. The output will be the execution time that is predicted by the three prediction
models – Regression, SVMs and KNN, will be displayed along with the accuracy of each
model. Depending on the accuracy, the user can select the desired execution time from these
3 models.
6.4 Sequence Diagram
Figure 6: For Zipf’s Law Optimization Figure 7: For Parameter Tuning
6.5 Conclusion
Thus, we propose an efficient design that will make sure that the Shuffle phase of the
Hadoop MapReduce paradigm is optimized by making use of Zipf’s Law for faster execution
and also, making use of tuning parameters to make the job execution better.
7. IMPLEMENTATION
7.1 Tools Introduction
The tools used by us in this project are namely SciPy, Python, Hadoop and Matplotlib.

 Python is a general purpose interpreted, interactive, object-oriented and high-level

programming language.
 SciPy is a scientific python open source to perform Mathematical, Scientific and
Engineering Computations. The SciPy library depends on NumPy, which provides
convenient and fast N-dimensional array manipulation. The SciPy library provides
many user-friendly and efficient numerical practices such as routines for numerical
integration and optimization.
 Matplotlib is a plotting library for Python, used along with NumPy to provide an
environment that is an effective open source alternative for MatLab. It is very helpful
in creating 2D plots.
 Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It’s designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
7.2 Technology Introduction
The main technology we emphasize on in our project is Apache Hadoop and its
MapReduce paradigm. Though Hadoop MapReduce framework has been accepted whole
heartedly in the industry but time and again we have come to question the efficiency of the
paradigm and lot of research has been in progress to access and improve the performance of
the framework as a whole as well its components. Here we describe briefly the components
of the MapReduce architecture.
Figure 8: Map Reduce Architecture

The mapper maps the <key, value> input to intermediate <key, value> pairs and these are put
into the circular buffer. The transformed intermediate output is not necessarily of the same
form as that of the input. In the spilling phase, the map output is stored in an in-memory
buffer and when the buffer is almost full, which is usually set to 80%, spilling to the local
disk starts by a thread background thread. Spilling happens at least once- i.e. when the entire
map-task is completed because all the partitions, one for each reducer, must be available on
the disk.
7.3 Overall view of the project in terms of implementation
For testing Zipf’s Law:

The experiment was conducted on a 4 node Hadoop cluster. The datasets used for the
purpose of verification of the concerned law are taken from Project Gutenberg comprising of
English novels depicting the general language pattern. Four datasets of different sizes have
chosen namely Pride and Prejudice, Don Quixote, War and Peace and Jane Eyre- the
Autobiography. The datasets are fed to the WordCount MapReduce on the Hadoop cluster as
input and the output so obtained are sorted in a descending fashion of the frequency of their
occurrence. The highest frequency is used as reference for calculating the expected values in
accordance to Zipf’s Law. Both the actual results and expected values are plotted on a graph
to visually understand the deviation of the results obtained from the theoretical values.
For optimization of job execution time by tuning Hadoop parameters:

The experiment was conducted on a 3 node Hadoop cluster. The datasets used were
taken from Wikimedia Data Dumps. The three datasets are of sizes 1.2 GB, 832.4 MB and
2.1 GB respectively.
In our experiment we vary the values of the Shuffle phase parameters - io.sort.mb (between
100 to 200), io.sort.spill.percent (between 50 to 100) and io.sort.factor (either 10 or 11).
Along with the above mentioned 3 parameters, data size also becomes a parameter to obtain
the job execution time. The Hadoop job execution time is recorded for each run after the
alteration is done for the respective values.

By doing this we have generated datasets which contain the Hadoop parameters and
execution time as different columns. These datasets are used to train ML models like
Multiple Regression and SVMs, which later predict the Hadoop job execution time given the
different parameter values provided by the user through the GUI.
Thus, we can approximately predict the job execution time depending on the tuning
performed on the different Shuffle phase parameters.
7.4 Explanation of Algorithm and Implementation
For our approach where we suggest using Zipf’s Law using a Hash Table during the Shuffle
phase, the proposed algorithm assumes that there is only a single time spill from the circular
buffer to the disk for simplicity. The frequency rank relationship of all the keys are presumed
to be known (pre-processing of input data is required). The number of keys to be kept in the
buffer is entirely intuitional and is assumed to be 5% of the total keys available. It is also
known that the buffer access time is much lower than that of the disk access time. The
algorithm, with such assumptions, mathematically proves that the job execution time in a
traditional system is much higher than the one proposed below –
Figure 9: Proposed Algorithm

7.5 Information about implementation of Modules
The entire project is composed of following main modules.

 The first module involves running word count program for each dataset on Apache
Hadoop platform and transferring the obtained result files from the HDFS to the local
folder.
 The second module involves sorting the obtained results from the first module in
descending order to facilitate the verification of Zipf’s Law.
 The third module includes user logic to verify the Zipf’s law and plot the obtained
results using a python library- Matplotlib.
 The next module makes use of Python Flask to implement the GUI for the parameter
tuning of Hadoop Shuffle Phase.
 The last module, creates a correlation matrix showing the correlation between
different tuning parameters, and accordingly plots a colored correlation matrix for the
parameters.
7.6 Conclusion
From the obtained results, it is seen that Zipf’s law is followed by all the datasets to
maximum extent, and the correlation matrix shows the relationship between different
parameters, including the execution time itself. The Machine Learning models used can help
us predict the Hadoop job execution time when a set of parameters are given.
8. TESTING
8.1 Introduction
The Zipf’s law has to be tested for each dataset individually to ensure that the distribution
proposed is not biased towards a particular dataset. Even though testing is usually automated
in the industry, manual testing can suffice the purpose for small projects like this. Similar
approach is used for parameter tuning.

8.2 Testing Tools and Environment
We have used Manual Testing as well as graph based testing for Zipf’s law
verification. Expected results are computed using the Zipfian distribution and the obtained
results are tested against the pre-computed expected results. Hence, visual testing can suffice.
8.3 Test cases
The datasets used for the verification of Zipf’s Law purpose are obtained from Project
Gutenberg. Four datasets of different sizes have chosen namely Pride and Prejudice (727
KB), Don Quixote (2.4 MB), War and Peace (3.4 MB) and Jane Eyre- The Autobiography
(1.2 MB).
For parameter tuning, the datasets used were taken from Wikimedia Data Dumps. The
three datasets are of sizes 1.2 GB, 832.4 MB and 2.1 GB respectively.
9. CONCLUSION & SCOPE FOR FUTURE WORK
We are striving to make a prediction of the (approximate) job execution time given the
Hadoop tuning parameters and the dataset size using different machine learning techniques
(This prediction may not be exactly correct because it also depends the map-reduce business
logic but our efforts lies in approximation so that users have some idea about tuning and not
use Hadoop like a black box). We also propose to suggest the most optimal tuning
parameters for a dataset of the given size.
We see that the datasets used approximately follow Zipf’s Law, and thus this law could be
used to optimize the Shuffle phase of Hadoop MapReduce. Later this optimization could
even be extended to Apache Spark.
10. REFERENCES

[1] Ziad Benslimane - Optimizing Hadoop Parameters Based on Application Resource

Consumption
[2]Hansen - Optimizing Hadoop for the Cluster
[3] Karthik Kambatla, Abhinav Pathak & Himabindu Pucha - Towards Optimizing Hadoop
Provisioning in the Cloud
[4] Narendar Kumar - Hadoop MapReduce Shuffle Phase Management
[5] Sanjay Sharma - Advanced Hadoop Tuning and Optimizations
[6] Qi Liu, Weidong Cai, Jian Shen, Zhangjie Fu, Xiaodong Liu, Nigel Linge - Speculative
Executive Strategy Based on Node Classification and Hierarchy Index Mechanism for
Heterogeneous Hadoop Systems
[7]Huanyang Zheng, Ziqi Wan, and Jie Wu - Optimizing MapReduce framework
through Joint Scheduling of Overlapping phases
[8] Chun-Hung Hsiao, Michael Cafarella, Satish Narayanasamy - Reducing MapReduce
Abstraction Costs for Text-Centric Applications
[9] Yandong Wang, Cong Xu, Xiaobing Li, Weikuan Yu - JVM-Bypass for Efficient
Hadoop Shuffling
[10] Yanfei Guo, Jia Rao, Dazhao Cheng, Xiaobo Zhou - iShuffle: Improving Hadoop
Performance with Shuffle-on-Write
[11] An Introduction to Big Data Concepts & Terminology
https://www.digitalocean.com/community/tutorials/an-introduction-to-big-data-concepts-
and-terminology
[12] An introduction to the architecture and components of Hadoop Ecosystem
https://www.janbasktraining.com/blog/introduction-architecture-components-hadoop-
ecosystem/
[13] MapReduce
Tutorial-https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Map+Parameters –
[14] What is MapReduce? https://www.guru99.com/introduction-to-mapreduce.html
[15] YARN Scheduler – Detailed Description
https://samanaghazadeh.wordpress.com/2016/05/05/yarn-scheduler-detailed-discription/
[16] Hadoop Configuration Parameters
http://ercoppa.github.io/HadoopInternals/HadoopConfigurationParameters.html

[17] Zipf’s Law - https://en.wikipedia.org/wiki/Zipf%27s_law

[18] Project Gutenberg Datasets - http://www.gutenberg.org/
[19] Wikimedia Data Dumps - https://dumps.wikimedia.org/
11. APPENDIX
1. Results and Snapshots
For Zipf’s Law implementation:
The following are the graphs plotted for each of the datasets taken. We can see here that our
output (plotted in red) closely follows the theoretical output (plotted in blue).
Figure 10: Pride & Prejudice Dataset Figure 11: Don Quixote Dataset

Figure 12: Jane Eyre – Autobiography Dataset Figure 13: War & Peace Dataset
The datasets are fed to the WordCount MapReduce on the Hadoop cluster as input and
the output so obtained are sorted in a descending fashion of the frequency of their
occurrence. The highest frequency is used as reference for calculating the expected values in
accordance to Zipf’s Law. Both the actual results and expected values are plotted on a graph
to visually understand the deviation of the results obtained from the theoretical values.
For Performance comparison of Hadoop jobs based on Shuffle Parameters:
The screenshots below were taken at the time of running the Hadoop jobs after tuning the
parameters to create different data points according to the dataset.

Figure 14: Failed Hadoop Job Execution
As highlighted in the figure above, this screenshot was taken when the Hadoop job failed,
“Spill Failed” is the error message shown.
When the job is successful, the failed shuffles shown are zero, shuffled maps are indicated
and as highlighted in the figure below, the time taken for execution can be noted.
Figure 15: Successful Hadoop Job Execution

We’ve plotted many graphs, out of which a few are displayed here. The following graphs are
a plot of job execution time vs io.sort.mb, where the parameter io.sort.spill.percent is varied
with values 60% & 80% respectively.
Figure 16: io.sort.mb (mb) vs time Figure 17: io.sort.mb (mb) vs time
(ms) with io.sort.spill.percent = (ms) with io.sort.spill.percent =
The following graphs are a plot of job execution time vs io.sort.spill.percent, where the
parameter io.sort.mb is varied with values 100MB & 140MB respectively.
Figure 18: io.sort.spill.percent vs Figure 19: io.sort.spill.percent vs

time (ms) with io.sort.mb (mb) = time (ms) with io.sort.mb (mb) =

Following is the correlation matrix that was created which shows the correlation between
different parameters including execution time.
Figure 20: Correlation matrix
Figure 21. Colored Correlation Matrix
The colored correlation matrix shows correlation in a visually appealing way. Thus, these are
the results that have been obtained for our project.

Report 2018

Uploaded by

Copyright:

Available Formats

Report 2018

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report 2018

Uploaded by

Copyright:

Available Formats

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Application of Zipf’s Law for the Optimization of Hadoop

Aasia Afreen 1MS14CS002

in partial fulfillment for the award of the degree of

Bachelor of Engineering in Computer Science & Engineering

M.S.RAMAIAH INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Application of Zipf’s Law for the Optimization of Hadoop

Aasia Afreen 1MS14CS002

in partial fulfillment for the award of the degree of

Bachelor of Engineering in Computer Science & Engineering

M.S.RAMAIAH INSTITUTE OF TECHNOLOGY

DEPT. OF CSE, RIT 2

Ramaiah Institute of Technology

Signature of the Guide Signature of the HOD

Name of the Examiners: Signature

DEPT. OF CSE, RIT 3

DEPT. OF CSE, RIT 4

DEPT. OF CSE, RIT 5

DEPT. OF CSE, RIT 6

DEPT. OF CSE, RIT 7

4 PROJECT MANAGEMENT PLAN

5 SOFTWARE REQUIREMENT SPECIFICATIONS

DEPT. OF CSE, RIT 8

9 CONCLUSION & SCOPE FOR FUTURE WORK

DEPT. OF CSE, RIT 9

1.2 Problem Statement

1.3 Objectives of the project

 An attempt to compare the performance of Hadoop jobs by varying different Hadoop

DEPT. OF CSE, RIT 10

1.4 Project Deliverables

1.5 Current Scope

1.6 Future Scope

Optimizing MapReduce, the universally accepted paradigm for BigData solution-

DEPT. OF CSE, RIT 11

2.2 Roles and Responsibilities

3.2 Related Works with the citation of the References

Hadoop and Shuffle Phase parameters:

According to the paper, Optimizing Hadoop Parameters Based on Application resource

DEPT. OF CSE, RIT 12

DEPT. OF CSE, RIT 13

Application of Zipf’s Law to Hadoop Shuffle Phase:

DEPT. OF CSE, RIT 14

In the paper, Optimizing MapReduce framework through Joint Scheduling of

DEPT. OF CSE, RIT 15

3.3 Conclusion of Survey

DEPT. OF CSE, RIT 16

4. PROJECT MANAGEMENT PLAN

4.1 Schedule of the Project

Figure 1: List of activities in the schedule

DEPT. OF CSE, RIT 17

Figure 2: Gantt chart (Part 1)

Figure 3: Gantt chart (Part 2)

DEPT. OF CSE, RIT 18

4.2 Risk Identification

5. SOFTWARE REQUIREMENT SPECIFICATIONS

5.1 Project Overview

5.2 External Interface Requirements

5.2.1 User Interfaces