Hadoop Map Reduce

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 53

INF4101-Big Data et NoSQL

2. HADOOP MAPREDUCE AND YARN

Dr Mouhim Sanaa
WHAT IS IN IT?
◾ MapReduce Introduction
◾ MapReduce Phases
◾ Word count exemples
◾ MapReduce java code
◾ how does Hadoop run MapReduce jobs
◾ Limitation of MapReduce V1
◾ YARN ARCHITECTURE
◾ YARN COMPONENTS
HOW TO COUNT THE NUMBER OF LINES IN A FILE?

In HDFS, Data is stored across the cluster


How to count the
• The initial file is split into blocs of 46 MB or 128MB number of lines in
• Blocs of a file a single file are distributed across the cluster the file?
• A given block is tipically replicated

Source File Data


Blocks
HOW TO COUNT THE NUMBER OF LINES IN A FILE?

In a traditional programming environment, you will write a


program in a specific way that performs the task • Read the file line by line
• Increment a counter
It Will take hours or days to read data from sources • If EOL then print
the counter

Source File Data


Blocks
HOW TO COUNT THE NUMBER OF LINES IN A FILE?

How can we stop moving


data?

Programs are brought to the data,


not The Data to the programs

• You are not moving the data


• Many computers are
counting for you in parallel

How to count the final total of lines


number?
HOW TO COUNT THE NUMBER OF LINES IN A FILE?

We break our problem into two parts:


• We run a map function for each data block
• Then, we ran a reducer function that takes the output of the map
function and returns the result
MAPREDUCE

◾ Hadoop MapReduce is a software framework for easily writing applications which process
big amounts of data in-parallel on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
◾ The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
• The Map Task: This is the first task, which takes input data and converts it into a set of
data, where individual elements are broken down into tuples (key/value pairs).
• The Reduce Task: This task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always performed after the map
task.
MAPREDUCE OVERVIEW
MAPREDUCE MAP
PHASE
Mappers
• Small program (typically), distributed across the cluster, local to
data
• Handed a portion of the input data (called a split)
• Each mapper parses, filters, or transformes its input
• Produces grouped <key,value> pairs
MAPREDUCE SHUFFLE
PHASE
Shuffle
• The output of each mapper is locally grouped together by key
• One node is chosen to process data for each unique key.
• All of the movement (shuffle) of data is transparently orchestrated by
MapReduce.
MAPREDUCE REDUCER
PHASE
Reducers
• Small programs (typicaly) that aggregate all of the values for the key that they are responsible
for
• Each reducer writes output to its own file
MAPREDUCE COMBINER
PHASE
Combiner –Optionnel-

• The data that will go to each reduce node is sorted and merged before goning to the reduce
Node, pre-doing some of the work of the receiving reduce node in order to minimize trafic
between map and reduce nodes.
MAPREDUCE COMBINER
Combiner –
PHASE
Optionnel-
MAPREDUCE Record Reader

• Mappers don’t run directly on the input splits. It is


because the input splits contain text but mappers don’t
understand the text. Mappers understand (key, value)
pairs only.
• Thus the text in input splits first needs to be converted to
(key, value) pairs. This is achieved by Record Readers.

• In Hadoop terminology, each line in a text is termed as


a ‘record’.
• Record reader reads one record(line) at a time and converts each record into (key,value)

 Key – It is the byte offset of the beginning of the line within the file (not
(0, Hello I am Mouhim Sanaa)
whole file one split).
 Hello I am Mouhim Sanaa
Value – It is the subject of the line. It excludes line terminators.
How can I help you (25, How can I help you)
WORD COUNT EXEMPLE

In this exemple, we have a list of animals name


• Map reduce can automatically split files on lines
breaks
• Our file has been split into two blocks on two nodes.
WORD COUNT EXEMPLE
Map Task
We hace two requirements for our Map Task
• Filter out the non big cat rows
• Prepare count by transforming to <Text(name), Integer(1)>
WORD COUNT EXEMPLE
Shuffle Task
• Shuffle moves all values of one key to the same target node
• Distributed by a Partitionner Class (normally hash
distribution)
• Reduce Tasks can run on any node
WORD COUNT EXEMPLE
Reduce
• The reduce task computes aggregated values for each
key
WORD COUNT EXEMPLE
Combiner Optionnal
• For performance, a pre-aggregate in the Map task can be
helpful
• Reduces the amount if data sent over the network
WORD COUNT EXEMPLE 2
MAX TEMPERATURE EXEMPLE 3
WORD COUNT JAVA CODE
WORD COUNT JAVA CODE
Mapper
Class
WORD COUNT JAVA CODE
Reducer Class
WORD COUNT JAVA CODE
Main
Class
HOW DOES HADOOP RUN MAPREDUCE JOBS
HOW DOES HADOOP RUN MAPREDUCE JOBS

The MapReduce framework consists of a single master


JobTracker and one slaveTaskTracker per cluster-node.

The job execution is controlled by two types of


processes:

• A single master process called JobTracker, which


coordinates all jobs running on the cluster and
assigns map and reduce tasks to run on the
TaskTrackers

• A number of subordinate processes called TaskTrackers,


which run assigned tasks and periodically report the
progress to the JobTracker
HOW DOES HADOOP RUN MAPREDUCE JOBS

1. Suppose a user wants to run a MapReduce query on sample.txt initially stored in HDFS.
hadoop jar query.jar DriverCode sample.txt result

A client can submit a job to the Hadoop by specifying:


• ​The location of the input and output files.
• ​The java classes in the form of jar file.
• The job configuration.

2. This sends a message to the JobTracker which produces a unique ID for the job.
HOW DOES HADOOP RUN MAPREDUCE JOBS

3. The Job Client copies job resources, such as a jar file containing Java code you have written to
implement the map or the reduce task, to the shared file system, usually HDFS.
HOW DOES HADOOP RUN MAPREDUCE JOBS

4. Once the resources are in HDFS, the Job Client can tell the JobTracker to start the job.
5.The JobTracker does its own initialization for the job. It calculates how to split the data so that it can
send each "split" to a different mapper process to maximize throughput.
• Name Node then provides the metadata to the Job Tracker
• Job Tracker now knows that sample.txt is split into 4 blocs, and knows the location of each bloc
• As all these four files have three copies stored in HDFS
• Job Tracker communicates with the Task Tracker of only one copy of each file which is residing nearest to it.
HOW DOES HADOOP RUN MAPREDUCE JOBS

7. The TaskTrackers are continually


sending heartbeat messages to the
JobTracker. Now that the
JobTracker has work for them, it will return
a map task or a reduce task as a
response to the
heartbeat.
HOW DOES HADOOP RUN MAPREDUCE JOBS
HOW DOES HADOOP RUN MAPREDUCE JOBS

8.The TaskTrackers need to obtain the


code to execute, so they get it from the
shared file system.
9.Then they can launch a Java Virtual
The results of the map operation are stored Machine with a child process running in it,
on local disk without replication for a given 10.This child process runs your map code
TaskTracker node(and, thus, not in HDFS). or your reduce code.
The results of the reduce operation are
stored to HDFS with standard replication.
HOW DOES HADOOP RUN MAPREDUCE JOBS

The JobTracker is responsible of:

Resource management:

• Maintaining the list of live nodes.


• Maintaining the list of available and occupied map and reduce slots.
• Allocating the available slots to appropriate jobs and tasks according to selected scheduling policy

Data processing:

• Instructing TaskTrackers to start map and reduce tasks.


• Monitoring the execution of the tasks.
• Restarting failed tasks.
HOW DOES HADOOP RUN MAPREDUCE JOBS

Fault
tolerance

• If a TaskTracker fails to communicate with the JobTracker for


a period of time (by default, 1 minute), the JobTracker will assume that the
TaskTracker in question has crashed.
• The JobTracker knows which map and reduce tasks were assigned to each
TaskTracker;
• So it then communicates with the task tracker of another copy of the same
file and directs it to process the desired code over it.

the Job Tracker keeps a track of how many tasks are


being currently served by the task tracker and how many
more tasks can be assigned to it
LIMITATION OF MAPREDUCE V1
LIMITATION OF MAPREDUCE V1

The large Hadoop clusters revealed a limitation involving a scalability bottleneck caused by
having a single JobTracker.

 If job tracker fails, all jobs are lost.


 According to Yahoo!, the practical limits of such a design are reached with a cluster of
5,000 nodes and 40,000 tasks running concurrently.
LIMITATION OF MAPREDUCE V1

In Hadoop MapReduce, we have a fixed number of map and reduce slots on


each slave node which are set by a cluster administrator into, which are not
fungible (can not be exchanged).

 A node cannot run more map tasks than map slots at any given moment, even if
no reduce tasks are running.
 This harms the cluster utilization because when all map slots are taken (and we
still want more), we cannot use any reduce slots, even if they are available,
or vice versa.
LIMITATION OF MAPREDUCE V1

Hadoop was designed to run MapReduce jobs only.

 It can not run other applications like:

•Giraph: Apache Giraph is used to perform graph processing. It is


currently used at Facebook to analyze the social graph formed by users
and their connections.
• Spark: an open-source cluster computing framework for real-time processing.
• Storm and Flink: Streaming data
LIMITATION OF MAPREDUCE V1

Hadoop 1.0 Vs Hadoop 2.0


In HADOOP 2.0 MapReduce is just an application that runs on top of YARN.
LIMITATION OF MAPREDUCE V1

Hadoop 1.0 Vs Hadoop 2.0


In HADOOP 2.0 MapReduce is just an application that runs on top of YARN.
YARN

• YARN stands for « Yet Another Resource Nogotiator ».


• Apache Hadoop YARN is the resource management component of a Hadoop Cluster.
• YARN/MapReduce2 has been introduced in Hadoop 2.0
• MapReduce2 moves Ressource management (like infrastructure to monitor nodes, allocate
resources and schedule jobs) into Yarn.
• YARN is a framework to provide computational resources for execution engines.
COMPONENTS OF YARN

2 main components:

• YARN Architecture provides a complete overview of the YARN resource manager and node manager.
• The resource manager will run on the master node and the node manager will run on the slave node.
COMPONENTS OF YARN

Resource Manager (One per cluster)


• Runs as a master daemon on a dedicated machine.
• Tracks how many live nodes and resources are available on the cluster.
• ​Coordinates what applications submitted by users should get these
resources and when.

• The ResourceManager has two main components:


 Scheduler
 Application Manager
COMPONENTS OF YARN

• The ResourceManager has two main components:

 Scheduler

• It performs the scheduling of jobs based on policies and priorities


defined by the administrator.

• A scheduler typically handles the resource allocation of the


jobs submitted to YARN.

• If a computer app/service wants to run and needs 1GB of RAM and 2 processors for normal operation —
it is the job of YARN scheduler to allocate resources to this application in accordance to a defined policy.
COMPONENTS OF YARN

• The ResourceManager has two main components:

 Application Manager

• Allocate the container to execute application-specific Application


Master.

• It is also responsible to restart the service of Application Master in


case of failures.
COMPONENTS OF YARN

Node Manager (One per Data Node)


• The Node manager monitors the execution of containers
that are started in its node and sends a progress report to
the Resource Manager.
Container :
Container is a place where a unit of work occurs.
For instance each MapReduce task(not the entire
job) runs in one container
Basic unit of allocation. Ex: container X=2BG, 1CPU
COMPONENTS OF YARN

Node Manager (One per Data Node)

Application Master:
The ApplicationMaster is responsible for negotiating
resources from the ResourceManager and working with the
NodeManager(s) to execute and monitor the containers and
their resource consumption. It has the responsibility of
negotiating appropriate resource containers from the
ResourceManager, tracking their status and monitoring
progress.
LAUNCH AN APPLICATION IN A YARN
COMPONENTS OF YARN
COMPONENTS OF YARN
COMPONENTS OF YARN
COMPONENTS OF YARN

You might also like