2012 Efficient Big Data Processing in Hadoop MapReduce
ABSTRACT creation phases. Over the past years researchers have actively stud-
This tutorial is motivated by the clear need of many organizations, ied the different performance problems of Hadoop MapReduce.
companies, and researchers to deal with big data volumes effi- Unfortunately, users do not always have a deep knowledge on how
ciently. Examples include web analytics applications, scientific to efficiently exploit the different techniques.
applications, and social networks. A popular data processing en- In this tutorial, we discuss how to reduce the performance gap
gine for big data is Hadoop MapReduce. Early versions of Hadoop to well-tuned database systems. We will point out the similarities
MapReduce suffered from severe performance problems. Today, and differences between the techniques used in Hadoop with those
this is becoming history. There are many techniques that can be used in parallel databases. In particular, we will highlight research
used with Hadoop MapReduce jobs to boost performance by orders areas that have not yet been exploited. In the following, we present
of magnitude. In this tutorial we teach such techniques. First, we the three parts in which this tutorial will be structured.
will briefly familiarize the audience with Hadoop MapReduce and
motivate its use for big data processing. Then, we will focus on dif- 2. HADOOP MAPREDUCE
ferent data management techniques, going from job optimization to We will focus on Hadoop MapReduce, which is the most popu-
physical data organization like data layouts and indexes. Through- lar open source implementation of the MapReduce framework pro-
out this tutorial, we will highlight the similarities and differences posed by Google [6]. Generally speaking, a Hadoop MapReduce
between Hadoop MapReduce and Parallel DBMS. Furthermore, we job mainly consists of two user-defined functions: map and reduce.
will point out unresolved research problems and open issues. The input of a Hadoop MapReduce job is a set of key-value pairs
(k, v) and the map function is called for each of these pairs. The
map function produces zero or more intermediate key-value pairs
1. INTRODUCTION (k′ , v ′ ). Then, the Hadoop MapReduce framework groups these in-
Nowadays, dealing with datasets in the order of terabytes or even termediate key-value pairs by intermediate key k′ and calls the re-
petabytes is a reality [24, 23, 19]. Therefore, processing such big duce function for each group. Finally, the reduce function produces
datasets in an efficient way is a clear need for many users. In this zero or more aggregated results. The beauty of Hadoop MapRe-
context, Hadoop MapReduce [6, 1] is a big data processing frame- duce is that users usually only have to define the map and reduce
work that has rapidly become the de facto standard in both industry functions. The framework takes care of everything else such as
and academia [16, 7, 24, 10, 26, 13]. The main reasons of such parallelisation and failover. The Hadoop MapReduce framework
popularity are the ease-of-use, scalability, and failover properties utilises a distributed file system to read and write its data. Typi-
of Hadoop MapReduce. However, these features come at a price: cally, Hadoop MapReduce uses the Hadoop Distributed File Sys-
the performance of Hadoop MapReduce is usually far from the per- tem (HDFS), which is the open source counterpart of the Google
formance of a well-tuned parallel database [21]. Therefore, many File System [11]. Therefore, the I/O performance of a Hadoop
research works (from industry and academia) have focused on im- MapReduce job strongly depends on HDFS.
proving the performance of Hadoop MapReduce jobs in many as- In the first part of this tutorial, we will introduce Hadoop MapRe-
pects. For example, researchers have proposed different data lay- duce and HDFS in detail. We will contrast both with parallel
outs [16, 9, 18], join algorithms [3, 5, 20], high-level query lan- databases. In particular, we will show and explain the static phys-
guages [10, 13, 24], failover algorithms [22], query optimization ical execution plan of Hadoop MapReduce and how it affects job
techniques [25, 4, 12, 14], and indexing techniques [7, 15, 8]. The performance. In this part, we will also survey high level languages
latter includes HAIL [8]: an indexing technique presented at this that allow users to run jobs even more easily.
VLDB 2012. It improves the performance of Hadoop MapReduce
jobs by up to a factor of 70 — without requiring expensive index
One of the major advantages of Hadoop MapReduce is that it
through scaling out to very large computing clusters. However, this CIDR 2011, a CS teaching award for database systems, as well as
results in high costs in terms of hardware and power consumption. several presentation and science slam awards. His research focuses
Therefore, researchers have carried out many research works to ef- on fast access to big data.
fectively adapt the query processing techniques found in parallel Jorge-Arnulfo Quiané-Ruiz is a postdoctoral researcher at Saar-
databases to the context of Hadoop MapReduce. land University, Germany. Previous affiliations include INRIA and
In the second part of this tutorial, we will provide an overview University of Nantes. He was awarded with a Ph.D. fellowship
of state-of-the-art techniques for optimizing Hadoop MapReduce from the Mexican National Council of Technology (CONACyT).
jobs. We will handle two topics. First, we will survey research He obtained, with highest honors, a M.Sc. in Computer Science
works that focus on tuning the configuration parameters of Hadoop from the National Polytechnic Institute of Mexico. His research
MapReduce jobs [4, 12]. Second, we will survey different query mainly focuses on big data analytics.
optimization techniques for Hadoop MapReduce jobs [25, 14].
