Unit 3

UNIT -3
MAP REDUCE APPLICATION

MapReduce
• Job – A Job in the context of HadoopMapReduce is the unit of work to be

performed as requested by the client / user. The information associated with the Job
includes the data to be processed (input data), MapReduce logic / program /
algorithm, and any other relevant configuration information necessary to execute the
Job.
• Task – HadoopMapReduce divides a Job into multiple sub-jobs known as Tasks.

These tasks can be run independent of each other on various nodes across the cluster.
There are primarily two types of Tasks – Map Tasks and Reduce Tasks.
• Job Tracker– Just like the storage (HDFS), the computation (MapReduce) also
works in a master-slave / master-worker fashion. A Job Tracker node acts as the
Master and is responsible for scheduling / executing Tasks on appropriate nodes,
coordinating the execution of tasks, sending the information for the execution of
tasks, getting the results back after the execution of each task, re-executing the failed
Tasks, and monitors / maintains the overall progress of the Job. Since a Job consists
of multiple Tasks, a Job’s progress depends on the status / progress of Tasks
associated with it. There is only one Job Tracker node per Hadoop Cluster.
• TaskTracker – A TaskTracker node acts as the Slave and is responsible for

executing a Task assigned to it by the JobTracker. There is no restriction on the
number of TaskTracker nodes that can exist in a Hadoop Cluster. TaskTracker
receives the information necessary for execution of a Task from JobTracker,
Executes the Task, and Sends the Results back to Job Tracker.
• Map() – Map Task in MapReduce is performed using the Map() function. This part
of the MapReduce is responsible for processing one or more chunks of data and
producing the output results.
• Reduce() – The next part / component / stage of the MapReduce programming
model is the Reduce() function. This part of the MapReduce is responsible for
consolidating the results produced by each of the Map() functions/tasks.
• Data Locality – MapReduce tries to place the data and the compute as close as
possible. First, it tries to put the compute on the same node where data resides, if that
cannot be done (due to reasons like compute on that node is down, compute on that
node is performing some other computation, etc.), then it tries to put the compute on
the node nearest to the respective data node(s) which contains the data to be
processed. This feature of MapReduce is “Data Locality”.
The following diagram shows the logical flow of a MapReduce programming model.
MapReduce Work Flow
The stages depicted above are
• Input: This is the input data / file to be processed.

• Split: Hadoop splits the incoming data into smaller pieces called “splits”.
• Map: In this step, MapReduce processes each split according to the logic defined in
map() function. Each mapper works on each split at a time. Each mapper is treated
as a task and multiple tasks are executed across different TaskTrackers and
coordinated by the JobTracker.
• Combine: This is an optional step and is used to improve the performance by
reducing the amount of data transferred across the network. Combiner is the same as
the reduce step and is used for aggregating the output of the map() function before it
is passed to the subsequent steps.
• Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put
them in order, and grouped before sending them to the next step.
• Reduce: This step is used to aggregate the outputs of mappers using the reduce()
function. Output of reducer is sent to the next and final step. Each reducer is treated
as a task and multiple tasks are executed across different TaskTrackers and
coordinated by the JobTracker.
• Output: Finally the output of reduce step is written to a file in HDFS.
Game Example
Say you are processing a large amount of data and trying to find out what percentage of
your user base where talking about games. First, we will identify the keywords which we
are going to map from the data to conclude that it’s something related to games. Next, we
will write a mapping function to identify such patterns in our data. For example, the
keywords can be Gold medals, Bronze medals, Silver medals, Olympic football, basketball,
cricket, etc.
Let us take the following chunks in a big data set and see how to process it.
“Hi, how are you”
“We love football”
“He is an awesome football player”
“Merry Christmas”
“Olympics will be held in China”
“Records broken today in Olympics”
“Yes, we won 2 Gold medals”
“He qualified for Olympics”
Mapping Phase – So our map phase of our algorithm will be as
1. Declare a function “Map”

2. Loop: For each words equal to “football”
3. Increment counter
4. Return key value “football”=>counter
In the same way, we can define n number of mapping functions for mapping various words:
“Olympics”, “Gold Medals”, “cricket”, etc.
Reducing Phase – The reducing function will accept the input from all these mappers in
form of key value pair and then processing it. So, input to the reduce function will look like
the following:
reduce (“football”=>2)
reduce (“Olympics”=>3)
Our algorithm will continue with the following steps
5. Declare a function reduce to accept the values from map function.

6. Where for each key-value pair, add value to counter.
7. Return “games”=> counter.
At the end, we will get the output like “games”=>5.
Now, getting into a big picture we can write n number of mapper functions here. Let us say
that you want to know who all where wishing each other. In this case you will write a
mapping function to map the words like “Wishing”, “Wish”, “Happy”, “Merry” and then
will write a corresponding reducer function.
Here you will need one function for shuffling which will distinguish between the “games”
and “wishing” keys returned by mappers and will send it to the respective reducer function.
Similarly you may need a function for splitting initially to give inputs to the mapper
functions in form of chunks. The following diagram summarizes the flow of Map reduce
algorithm:
In the above map reduce flow
• The input data can be divided into n number of chunks depending upon the
amount of data and processing capacity of individual unit.
• Next, it is passed to the mapper functions. Please note that all the chunks are
processed simultaneously at the same time, which embraces the parallel
processing of data.
• After that, shuffling happens which leads to aggregation of similar patterns.
• Finally, reducers combine them all to get a consolidated output as per the logic.
• This algorithm embraces scalability as depending on the size of the input data, we
can keep increasing the number of the parallel processing units.
Unit Tests with MR Unit

HadoopMapReduce jobs have a unique code architecture that follows a specific template
with specific constructs.
This architecture raises interesting issues when doing test-driven development (TDD) and
writing unit tests.
With MRUnit, you can craft test input, push it through your mapper and/or reducer, and
verify its output all in a JUnit test.
As do other JUnit tests, this allows you to debug your code using the JUnit test as a driver.
A map/reduce pair can be tested using MRUnit’sMapReduceDriver. , a combiner can be
tested using MapReduceDriver as well.
A PipelineMapReduceDriver allows you to test a workflow of map/reduce jobs. Currently,
partitioner’s do not have a test driver under MRUnit.
MRUnit allows you to do TDD(Test Driven Development) and write lightweight unit tests
which accommodate Hadoop’s specific architecture and constructs.
Example: We’re processing road surface data used to create maps. The input contains both
linear surfaces and intersections. The mapper takes a collection of these mixed surfaces as
input, discards anything that isn’t a linear road surface, i.e., intersections, and then processes
each road surface and writes it out to HDFS. We can keep count and eventually print out
how many non-road surfaces are inputs. For debugging purposes, we can additionally print
out how many road surfaces were processed.
Anatomy of a MapReduce Job Run

You can run a MapReduce job with a single method call: submit () on a Job object (you can
also call waitForCompletion(), which submits the job if it hasn’t been submitted already,
then waits for it to finish). This method call conceals a great deal of processing behind the
scenes. This section uncovers the steps Hadoop takes to run a job.
The whole process is illustrated in Figure 7-1. At the highest level, there are five
independent entities:
• The client, which submits the MapReduce job.
• The YARN resource manager, which coordinates the allocation of compute resources
On the cluster.
• The YARN node managers, which launch and monitor the compute containers on
Machines in the cluster.
• The MapReduce application master, which coordinates the tasks running the
MapReduce job. The application master and the MapReduce tasks run in containers
That are scheduled by the resource manager and managed by the node managers.
• The distributed filesystem, which is used for sharing job files between the other entities.
• He distributed filesystem ,which is used for sharing job files between the other entities.
Figure 7-1. How Hadoop runs a MapReduce job
Classic MapReduce
A job run in classic MapReduce is illustrated in Figure 6-1. At the highest level, there are
four independent entities:
• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run. The jobtracker is a Java application whose
main class is JobTracker.
• The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are
Java applications whose main class is TaskTracker.
• The distributed filesystem, which is used for sharing job files between the other entities.
Job Initialization:
When the JobTracker receives a call to its submitJob() method, it puts it into an internal
queue from where the job scheduler will pick it up and initialize it. Initialization involves
creating an object to represent the job being run.
To create the list of tasks to run, the job scheduler first retrieves the input splits computed
by the client from the shared filesystem. It then creates one map task for each split.
Task Assignment:
Tasktrackers run a simple loop that periodically sends heartbeat method calls to the
jobtracker. Heartbeats tell the jobtracker that a tasktracker is alive As a part of the heartbeat,
a tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker
will llocate it a task, which it communicates to the tasktracker using the heartbeat return
value.
Task Execution:
Now that the tasktracker has been assigned a task, the next step is for it to run the task. First,
it localizes the job JAR by copying it from the shared filesystem to the tasktracker’s
filesystem. It also copies any files needed from the distributed cache by the application to
the local disk. TaskRunner launches a new Java Virtual Machine to run each task in.
Progress and Status Updates:

MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run.
Because this is a significant length of time, it’s important for the user to get feedback on
how the job is progressing. A job and each of its tasks have a status.When a task is running,
it keeps track of its progress, that is, the proportion of the task completed.
Job Completion:
When the jobtracker receives a notification that the last task for a job is complete (this will
be the special job cleanup task), it changes the status for the job to “successful.”
YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and
makes it interactive to let another application Hbase, Spark etc. to work on it.Different Yarn
applications can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the
same time bringing great benefits for manageability and cluster utilization.
Components Of YARN
o Client: For submitting MapReduce jobs.
o Resource Manager: To manage the use of resources across the cluster

o Node Manager:For launching and monitoring the computer containers on machines
in the cluster.
o Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by
the resource manager, and managed by the node managers.
Jobtracker&Tasktrackerwere were used in previous version of Hadoop, which were

responsible for handling resources and checking progress management. However, Hadoop
2.0 has Resource manager and NodeManager to overcome the shortfall of
Jobtracker&Tasktracker.
Benefits of YARN
o Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes and 40000
task, but Yarn is designed for 10,000 nodes and 1 lakh tasks.
o Utiliazation: Node Manager manages a pool of resources, rather than a fixed
number of the designated slots thus increasing the utilization.
o Multitenancy: Different version of MapReduce can run on YARN, which makes the
process of upgrading MapReduce more manageable.
Sort and Shuffle
The sort and shuffle occur on the output of Mapper and before the reducer. When the
Mapper task is complete, the results are sorted by key, partitioned if there are multiple
reducers, and then written to disk. Using the input from each Mapper <k2,v2>, we collect
all the values for each unique key k2. This output from the shuffle phase in the form of <k2,
list(v2)> is sent as input to reducer phase.
MapReduce Types
Mapping is the core technique of processing a list of data elements that come in pairs of
keys and values. The map function applies to individual elements defined as key-value pairs
of a list and produces a new list. The general idea of map and reduce function of Hadoop
can be illustrated as follows:
map: (K1, V1) -> list (K2, V2)
reduce: (K2, list(V2)) -> list (K3, V3)
The input parameters of the key and value pair, represented by K1 and V1 respectively, are
different from the output pair type: K2 and V2. The reduce function accepts the same format
output by the map, but the type of output again of the reduce operation is different: K3 and
V3. The Java API for this is as follows:
public interface Mapper<K1, V1, K2, V2> extends JobConfigurable,Closeable
void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) throws
IOException;
}
public interface Reducer<K2, V2, K3, V3> extends JobConfigurable,Closeable
{
void reduce(K2 key, Iterator<V2> values,
OutputCollector<K3, V3> output, Reporter reporter)throws
IOException;
}
The OutputCollector is the generalized interface of the Map-Reduce framework to facilitate
collection of data output either by the Mapper or the Reducer. These outputs are nothing but
intermediate output of the job. Therefore, they must be parameterized with their types.
The Reporter facilitates the Map-Reduce application to report progress
andupdatecountrsand status information. If, however, the combine function is used, it has
the same form as the reduce function and the output is fed to the reduce function. This may
be illustrated as follows
map: (K1, V1) -> list (K2, V2)

combine: (K2, list(V2)) -> list (K2, V2)
reduce: (K2, list(V2)) -> list (K3, V3)
Note that the combine and reduce functions use the same type, except in the variable names
where K3 is K2 and V3 is V2.
The partition function operates on the intermediate key-value types. It controls the
partitioning of the keys of the intermediate map outputs. The key derives the partition using
a typical hash function. The total number of partitions is the same as the number of reduce
tasks for the job. The partition is determined only by the key ignoring the value.
public interface Partitioner<K2, V2> extends JobConfigurable {
intgetPartition(K2 key, V2 value, intnumberOfPartition);

Unit 3

Uploaded by

Copyright:

Available Formats

Unit 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3

Uploaded by

Copyright:

Available Formats

UNIT -3

MAP REDUCE APPLICATION

• Job – A Job in the context of HadoopMapReduce is the unit of work to be

• Task – HadoopMapReduce divides a Job into multiple sub-jobs known as Tasks.

• TaskTracker – A TaskTracker node acts as the Slave and is responsible for

MapReduce Work Flow

The stages depicted above are

• Input: This is the input data / file to be processed.

“Hi, how are you”

“We love football”

“He is an awesome football player”

“Olympics will be held in China”

“Records broken today in Olympics”

“Yes, we won 2 Gold medals”

“He qualified for Olympics”

Mapping Phase – So our map phase of our algorithm will be as

1. Declare a function “Map”

Our algorithm will continue with the following steps

5. Declare a function reduce to accept the values from map function.

At the end, we will get the output like “games”=>5.

Unit Tests with MR Unit

Anatomy of a MapReduce Job Run

Figure 7-1. How Hadoop runs a MapReduce job

Progress and Status Updates:

o Resource Manager: To manage the use of resources across the cluster

Jobtracker&Tasktrackerwere were used in previous version of Hadoop, which were

public interface Mapper<K1, V1, K2, V2> extends JobConfigurable,Closeable

map: (K1, V1) -> list (K2, V2)

public interface Partitioner<K2, V2> extends JobConfigurable {

intgetPartition(K2 key, V2 value, intnumberOfPartition);

You might also like