Hadoop Map Reduce

INF4101-Big Data et NoSQL
2. HADOOP MAPREDUCE AND YARN
Dr Mouhim Sanaa
WHAT IS IN IT?
◾ MapReduce Introduction
◾ MapReduce Phases
◾ Word count exemples
◾ MapReduce java code
◾ how does Hadoop run MapReduce jobs
◾ Limitation of MapReduce V1
◾ YARN ARCHITECTURE
◾ YARN COMPONENTS
HOW TO COUNT THE NUMBER OF LINES IN A FILE?
In HDFS, Data is stored across the cluster

How to count the
• The initial file is split into blocs of 46 MB or 128MB number of lines in
• Blocs of a file a single file are distributed across the cluster the file?
• A given block is tipically replicated
Source File Data

Blocks
In a traditional programming environment, you will write a

program in a specific way that performs the task • Read the file line by line
• Increment a counter
It Will take hours or days to read data from sources • If EOL then print
the counter
Source File Data

Blocks
How can we stop moving

data?
Programs are brought to the data,

not The Data to the programs
• You are not moving the data

• Many computers are
counting for you in parallel
How to count the final total of lines

number?
We break our problem into two parts:

• We run a map function for each data block
• Then, we ran a reducer function that takes the output of the map
function and returns the result
MAPREDUCE
◾ Hadoop MapReduce is a software framework for easily writing applications which process
big amounts of data in-parallel on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
◾ The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
• The Map Task: This is the first task, which takes input data and converts it into a set of
data, where individual elements are broken down into tuples (key/value pairs).
• The Reduce Task: This task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always performed after the map
task.
MAPREDUCE OVERVIEW
MAPREDUCE MAP
PHASE
Mappers
• Small program (typically), distributed across the cluster, local to
data
• Handed a portion of the input data (called a split)
• Each mapper parses, filters, or transformes its input
• Produces grouped <key,value> pairs
MAPREDUCE SHUFFLE
PHASE
Shuffle
• The output of each mapper is locally grouped together by key
• One node is chosen to process data for each unique key.
• All of the movement (shuffle) of data is transparently orchestrated by
MapReduce.
MAPREDUCE REDUCER
PHASE
Reducers
• Small programs (typicaly) that aggregate all of the values for the key that they are responsible
for
• Each reducer writes output to its own file
MAPREDUCE COMBINER
PHASE
Combiner –Optionnel-
• The data that will go to each reduce node is sorted and merged before goning to the reduce
Node, pre-doing some of the work of the receiving reduce node in order to minimize trafic
between map and reduce nodes.
MAPREDUCE COMBINER
Combiner –
PHASE
Optionnel-
MAPREDUCE Record Reader
• Mappers don’t run directly on the input splits. It is

because the input splits contain text but mappers don’t
understand the text. Mappers understand (key, value)
pairs only.
• Thus the text in input splits first needs to be converted to
(key, value) pairs. This is achieved by Record Readers.
• In Hadoop terminology, each line in a text is termed as

a ‘record’.
• Record reader reads one record(line) at a time and converts each record into (key,value)
 Key – It is the byte offset of the beginning of the line within the file (not
(0, Hello I am Mouhim Sanaa)
whole file one split).
 Hello I am Mouhim Sanaa
Value – It is the subject of the line. It excludes line terminators.
How can I help you (25, How can I help you)
WORD COUNT EXEMPLE
In this exemple, we have a list of animals name

• Map reduce can automatically split files on lines
breaks
• Our file has been split into two blocks on two nodes.
WORD COUNT EXEMPLE
Map Task
We hace two requirements for our Map Task
• Filter out the non big cat rows
• Prepare count by transforming to <Text(name), Integer(1)>
WORD COUNT EXEMPLE
Shuffle Task
• Shuffle moves all values of one key to the same target node
• Distributed by a Partitionner Class (normally hash
distribution)
• Reduce Tasks can run on any node
WORD COUNT EXEMPLE
Reduce
• The reduce task computes aggregated values for each
key
WORD COUNT EXEMPLE
Combiner Optionnal
• For performance, a pre-aggregate in the Map task can be
helpful
• Reduces the amount if data sent over the network
WORD COUNT EXEMPLE 2
MAX TEMPERATURE EXEMPLE 3
WORD COUNT JAVA CODE
Mapper
Class
Reducer Class
Main
Class
HOW DOES HADOOP RUN MAPREDUCE JOBS
The MapReduce framework consists of a single master

JobTracker and one slaveTaskTracker per cluster-node.
The job execution is controlled by two types of

processes:
• A single master process called JobTracker, which

coordinates all jobs running on the cluster and
assigns map and reduce tasks to run on the
TaskTrackers
• A number of subordinate processes called TaskTrackers,

which run assigned tasks and periodically report the
progress to the JobTracker
1. Suppose a user wants to run a MapReduce query on sample.txt initially stored in HDFS.
hadoop jar query.jar DriverCode sample.txt result
A client can submit a job to the Hadoop by specifying:

• The location of the input and output files.
• The java classes in the form of jar file.
• The job configuration.
2. This sends a message to the JobTracker which produces a unique ID for the job.
3. The Job Client copies job resources, such as a jar file containing Java code you have written to
implement the map or the reduce task, to the shared file system, usually HDFS.
4. Once the resources are in HDFS, the Job Client can tell the JobTracker to start the job.
5.The JobTracker does its own initialization for the job. It calculates how to split the data so that it can
send each "split" to a different mapper process to maximize throughput.
• Name Node then provides the metadata to the Job Tracker
• Job Tracker now knows that sample.txt is split into 4 blocs, and knows the location of each bloc
• As all these four files have three copies stored in HDFS
• Job Tracker communicates with the Task Tracker of only one copy of each file which is residing nearest to it.
7. The TaskTrackers are continually

sending heartbeat messages to the
JobTracker. Now that the
JobTracker has work for them, it will return
a map task or a reduce task as a
response to the
heartbeat.
8.The TaskTrackers need to obtain the

code to execute, so they get it from the
shared file system.
9.Then they can launch a Java Virtual
The results of the map operation are stored Machine with a child process running in it,
on local disk without replication for a given 10.This child process runs your map code
TaskTracker node(and, thus, not in HDFS). or your reduce code.
The results of the reduce operation are
stored to HDFS with standard replication.
The JobTracker is responsible of:
Resource management:
• Maintaining the list of live nodes.

• Maintaining the list of available and occupied map and reduce slots.
• Allocating the available slots to appropriate jobs and tasks according to selected scheduling policy
Data processing:
• Instructing TaskTrackers to start map and reduce tasks.

• Monitoring the execution of the tasks.
• Restarting failed tasks.
Fault
tolerance
• If a TaskTracker fails to communicate with the JobTracker for

a period of time (by default, 1 minute), the JobTracker will assume that the
TaskTracker in question has crashed.
• The JobTracker knows which map and reduce tasks were assigned to each
TaskTracker;
• So it then communicates with the task tracker of another copy of the same
file and directs it to process the desired code over it.
the Job Tracker keeps a track of how many tasks are

being currently served by the task tracker and how many
more tasks can be assigned to it
LIMITATION OF MAPREDUCE V1
The large Hadoop clusters revealed a limitation involving a scalability bottleneck caused by
having a single JobTracker.
 If job tracker fails, all jobs are lost.

 According to Yahoo!, the practical limits of such a design are reached with a cluster of
5,000 nodes and 40,000 tasks running concurrently.
In Hadoop MapReduce, we have a fixed number of map and reduce slots on

each slave node which are set by a cluster administrator into, which are not
fungible (can not be exchanged).
 A node cannot run more map tasks than map slots at any given moment, even if
no reduce tasks are running.
 This harms the cluster utilization because when all map slots are taken (and we
still want more), we cannot use any reduce slots, even if they are available,
or vice versa.
Hadoop was designed to run MapReduce jobs only.
 It can not run other applications like:
•Giraph: Apache Giraph is used to perform graph processing. It is

currently used at Facebook to analyze the social graph formed by users
and their connections.
• Spark: an open-source cluster computing framework for real-time processing.
• Storm and Flink: Streaming data
Hadoop 1.0 Vs Hadoop 2.0

In HADOOP 2.0 MapReduce is just an application that runs on top of YARN.
Hadoop 1.0 Vs Hadoop 2.0

In HADOOP 2.0 MapReduce is just an application that runs on top of YARN.
YARN
• YARN stands for « Yet Another Resource Nogotiator ».

• Apache Hadoop YARN is the resource management component of a Hadoop Cluster.
• YARN/MapReduce2 has been introduced in Hadoop 2.0
• MapReduce2 moves Ressource management (like infrastructure to monitor nodes, allocate
resources and schedule jobs) into Yarn.
• YARN is a framework to provide computational resources for execution engines.
COMPONENTS OF YARN
2 main components:
• YARN Architecture provides a complete overview of the YARN resource manager and node manager.
• The resource manager will run on the master node and the node manager will run on the slave node.
COMPONENTS OF YARN
Resource Manager (One per cluster)

• Runs as a master daemon on a dedicated machine.
• Tracks how many live nodes and resources are available on the cluster.
• Coordinates what applications submitted by users should get these
resources and when.
• The ResourceManager has two main components:

 Scheduler
 Application Manager
COMPONENTS OF YARN
 Scheduler
• It performs the scheduling of jobs based on policies and priorities

defined by the administrator.
• A scheduler typically handles the resource allocation of the

jobs submitted to YARN.
• If a computer app/service wants to run and needs 1GB of RAM and 2 processors for normal operation —
it is the job of YARN scheduler to allocate resources to this application in accordance to a defined policy.
COMPONENTS OF YARN
 Application Manager
• Allocate the container to execute application-specific Application

Master.
• It is also responsible to restart the service of Application Master in

case of failures.
COMPONENTS OF YARN
Node Manager (One per Data Node)

• The Node manager monitors the execution of containers
that are started in its node and sends a progress report to
the Resource Manager.
Container :
Container is a place where a unit of work occurs.
For instance each MapReduce task(not the entire
job) runs in one container
Basic unit of allocation. Ex: container X=2BG, 1CPU
COMPONENTS OF YARN
Node Manager (One per Data Node)
Application Master:
The ApplicationMaster is responsible for negotiating
resources from the ResourceManager and working with the
NodeManager(s) to execute and monitor the containers and
their resource consumption. It has the responsibility of
negotiating appropriate resource containers from the
ResourceManager, tracking their status and monitoring
progress.
LAUNCH AN APPLICATION IN A YARN
COMPONENTS OF YARN
COMPONENTS OF YARN
COMPONENTS OF YARN
COMPONENTS OF YARN

Hadoop Map Reduce

Uploaded by

Copyright:

Available Formats

Hadoop Map Reduce

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Map Reduce

Uploaded by

Copyright:

Available Formats

INF4101-Big Data et NoSQL

2. HADOOP MAPREDUCE AND YARN

In HDFS, Data is stored across the cluster

Source File Data

In a traditional programming environment, you will write a

Source File Data

How can we stop moving

Programs are brought to the data,

• You are not moving the data

How to count the final total of lines

We break our problem into two parts:

• Mappers don’t run directly on the input splits. It is

• In Hadoop terminology, each line in a text is termed as

In this exemple, we have a list of animals name

The MapReduce framework consists of a single master

The job execution is controlled by two types of

• A single master process called JobTracker, which

• A number of subordinate processes called TaskTrackers,

A client can submit a job to the Hadoop by specifying:

7. The TaskTrackers are continually

8.The TaskTrackers need to obtain the

The JobTracker is responsible of:

• Maintaining the list of live nodes.

• Instructing TaskTrackers to start map and reduce tasks.

• If a TaskTracker fails to communicate with the JobTracker for

the Job Tracker keeps a track of how many tasks are

 If job tracker fails, all jobs are lost.

In Hadoop MapReduce, we have a fixed number of map and reduce slots on

Hadoop was designed to run MapReduce jobs only.

 It can not run other applications like:

•Giraph: Apache Giraph is used to perform graph processing. It is

Hadoop 1.0 Vs Hadoop 2.0

Hadoop 1.0 Vs Hadoop 2.0

• YARN stands for « Yet Another Resource Nogotiator ».

Resource Manager (One per cluster)

• The ResourceManager has two main components:

• The ResourceManager has two main components:

• It performs the scheduling of jobs based on policies and priorities

• A scheduler typically handles the resource allocation of the

• The ResourceManager has two main components:

• Allocate the container to execute application-specific Application

• It is also responsible to restart the service of Application Master in

Node Manager (One per Data Node)

Node Manager (One per Data Node)

You might also like