Hadoop Map Reduce
Hadoop Map Reduce
Hadoop Map Reduce
Dr Mouhim Sanaa
WHAT IS IN IT?
◾ MapReduce Introduction
◾ MapReduce Phases
◾ Word count exemples
◾ MapReduce java code
◾ how does Hadoop run MapReduce jobs
◾ Limitation of MapReduce V1
◾ YARN ARCHITECTURE
◾ YARN COMPONENTS
HOW TO COUNT THE NUMBER OF LINES IN A FILE?
◾ Hadoop MapReduce is a software framework for easily writing applications which process
big amounts of data in-parallel on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
◾ The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
• The Map Task: This is the first task, which takes input data and converts it into a set of
data, where individual elements are broken down into tuples (key/value pairs).
• The Reduce Task: This task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always performed after the map
task.
MAPREDUCE OVERVIEW
MAPREDUCE MAP
PHASE
Mappers
• Small program (typically), distributed across the cluster, local to
data
• Handed a portion of the input data (called a split)
• Each mapper parses, filters, or transformes its input
• Produces grouped <key,value> pairs
MAPREDUCE SHUFFLE
PHASE
Shuffle
• The output of each mapper is locally grouped together by key
• One node is chosen to process data for each unique key.
• All of the movement (shuffle) of data is transparently orchestrated by
MapReduce.
MAPREDUCE REDUCER
PHASE
Reducers
• Small programs (typicaly) that aggregate all of the values for the key that they are responsible
for
• Each reducer writes output to its own file
MAPREDUCE COMBINER
PHASE
Combiner –Optionnel-
• The data that will go to each reduce node is sorted and merged before goning to the reduce
Node, pre-doing some of the work of the receiving reduce node in order to minimize trafic
between map and reduce nodes.
MAPREDUCE COMBINER
Combiner –
PHASE
Optionnel-
MAPREDUCE Record Reader
Key – It is the byte offset of the beginning of the line within the file (not
(0, Hello I am Mouhim Sanaa)
whole file one split).
Hello I am Mouhim Sanaa
Value – It is the subject of the line. It excludes line terminators.
How can I help you (25, How can I help you)
WORD COUNT EXEMPLE
1. Suppose a user wants to run a MapReduce query on sample.txt initially stored in HDFS.
hadoop jar query.jar DriverCode sample.txt result
2. This sends a message to the JobTracker which produces a unique ID for the job.
HOW DOES HADOOP RUN MAPREDUCE JOBS
3. The Job Client copies job resources, such as a jar file containing Java code you have written to
implement the map or the reduce task, to the shared file system, usually HDFS.
HOW DOES HADOOP RUN MAPREDUCE JOBS
4. Once the resources are in HDFS, the Job Client can tell the JobTracker to start the job.
5.The JobTracker does its own initialization for the job. It calculates how to split the data so that it can
send each "split" to a different mapper process to maximize throughput.
• Name Node then provides the metadata to the Job Tracker
• Job Tracker now knows that sample.txt is split into 4 blocs, and knows the location of each bloc
• As all these four files have three copies stored in HDFS
• Job Tracker communicates with the Task Tracker of only one copy of each file which is residing nearest to it.
HOW DOES HADOOP RUN MAPREDUCE JOBS
Resource management:
Data processing:
Fault
tolerance
The large Hadoop clusters revealed a limitation involving a scalability bottleneck caused by
having a single JobTracker.
A node cannot run more map tasks than map slots at any given moment, even if
no reduce tasks are running.
This harms the cluster utilization because when all map slots are taken (and we
still want more), we cannot use any reduce slots, even if they are available,
or vice versa.
LIMITATION OF MAPREDUCE V1
2 main components:
• YARN Architecture provides a complete overview of the YARN resource manager and node manager.
• The resource manager will run on the master node and the node manager will run on the slave node.
COMPONENTS OF YARN
Scheduler
• If a computer app/service wants to run and needs 1GB of RAM and 2 processors for normal operation —
it is the job of YARN scheduler to allocate resources to this application in accordance to a defined policy.
COMPONENTS OF YARN
Application Manager
Application Master:
The ApplicationMaster is responsible for negotiating
resources from the ResourceManager and working with the
NodeManager(s) to execute and monitor the containers and
their resource consumption. It has the responsibility of
negotiating appropriate resource containers from the
ResourceManager, tracking their status and monitoring
progress.
LAUNCH AN APPLICATION IN A YARN
COMPONENTS OF YARN
COMPONENTS OF YARN
COMPONENTS OF YARN
COMPONENTS OF YARN