Bda CHP2
Bda CHP2
Bda CHP2
2. Multiple Workers
⚫Follow whatever the Master asks to do
Execution Overview
⚫The MapReduce library in the user program
first splits the input file into M pieces.
⚫The MapReduce library in the user program
then starts up many copies of the program
on a cluster of machines: one master and
multiple workers .
⚫There are M map tasks and R reduce tasks
to assign. (The figures below depicts task =
data + computation)
⚫The master picks idle workers and assigns
each one a map task.
⚫Map Phase (each mapper node)
1) Read in a corresponding input partition.
2) Apply the user-defined map function to
each key/value pair in the partition.
3) Partition the result produced by the map
function into R regions using the
partitioning function.
4) Write the result into its local disk.
5) Notify the master with the locations of
each partitioned intermediate result.
⚫After all the map tasks are done, the
master picks idle workers and assigns each
one a reduce task.
⚫Reduce Phase (each reducer node)
⚫1) Read in all the corresponding
intermediate result partitions from mapper
nodes.
⚫2) Sort the intermediate results by the
intermediate keys.
⚫ 3) Apply the user-defined reduce function
on each intermediate key and the
corresponding set of intermediate values.
⚫4) Create one output file.
Coping With Node Failures.
⚫The worst thing that can happen is that the
compute node at which the Master is
executing fails.
⚫In this case, the entire map-reduce job
must be restarted.
⚫But only this one node can bring the entire
process down; other failures will be
managed by the Master, and the map-
reduce job will complete eventually.
⚫Suppose the compute node at which a Map
worker resides fails. This failure will be
detected by the Master, because it
periodically pings the Worker processes.
5. High Up Processing
⚫ Read/Write operation in Hadoop is immoderate since we
are dealing with large size data that is in TB or PB. In
Hadoop, the data read or write done from the disk which
makes it difficult to perform in-memory calculation and
lead to processing overhead or High up processing.
6. Supports Only Batch Processing
⚫The batch process is nothing but the
processes that are running in the
background and does not have any kind of
interaction with the user. The engines used
for these processes inside the Hadoop core
is not that much efficient. Producing the
output with low latency is not possible with
it.