Lovely Professional University (Lpu) : Mittal School of Business (Msob)
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
Submitted by:-
NAME REG.NO ROLL.NO
FAREEDULLAH MOHAMAD 12000633 RQ1E2208
1.How Map and Reduce work Together? Detailed working with diagram?
Hadoop divides the job into tasks. There are two types of tasks:
MapReduce is a software framework and programming model for dealing with massive
amounts of data. The MapReduce programme is divided into two phases: Map and Reduce.
Map tasks deal with data splitting and mapping, whereas Reduce tasks shuffle and reduce data.
MapReduce programmes developed in Java, Ruby, Python, and C++ may all be run on Hadoop.
Map Reduce programmes in cloud computing are parallel in nature, making them ideal for
executing large-scale data processing across a cluster of servers.
Key-value pairs are the input to each phase. A programmer must also declare two functions:
map and reduce.
Mapping
This is the very first step in the map-reduce program's execution. In this phase, each split's data
is handed to a mapping function, which generates output values. In our case, the mapping
phase's task is to count the number of times each word appears in the input segments.
Reducing
The output values from the Shuffling step are consolidated in this phase. This phase takes the
values from the Shuffling phase and merges them into a single output value. In a nutshell, this
stage summarises the entire dataset.
For each split, a map task is created, which then executes the map function for each
record in the split.
Multiple splits are usually useful since the time it takes to process a split is short
compared to the time it takes to process the entire input. Because the splits are processed
in parallel, it is easier to load balance the processing when the splits are smaller.
Splits that are too small, on the other hand, are not ideal. When divides are too small, the
workload of managing the splits and creating map tasks takes up the entire job execution
time.
For most workloads, a split size equal to the size of an HDFS block (which is 64 MB by
default) is preferable.
When map jobs are run, the output is written to a local disc on the node, rather than to
HDFS.
The reason for choosing local disc over HDFS is to avoid the replication that occurs
while using HDFS to store data.
Reduce tasks process map output to produce the final output, which is then processed by
reduce tasks.
The map output can be discarded once the process is finished. As a result, storing it on
HDFS with replication is excessive.
Hadoop replicates the map task on another node and recreates the map output in the event
of a node loss before the reduce process consumes the map output.
The output is blended on this computer and then provided to a user-defined reduction
function.
Reduce output, unlike map output, is saved in HDFS (the first replica is stored on the
local node and other replicas are stored on off-rack nodes). As a result, composing the
decrease output
The entire execution process (including Map and Reduce operations) is managed by two sorts
of entities known as a
Lovely Professional University Page 2
➢ Jobtracker: He's a master at what he does (responsible for complete execution of
submitted job)
➢ Multiple Task Trackers: Acts as though they are slaves, each doing their own work.
➢ There is one Jobtracker on Namenode for every job submitted for execution in the
system, and many tasktrackers on Datanode for each work submitted for execution in the
system.
Hadoop is an open source distributed processing system for big data applications that controls
data processing and storage. HDFS is an important component of the Hadoop ecosystem. It
provides a secure platform for managing large data sets and supporting big data analytics
applications.
What is HDFS and how does it work?
HDFS allows data to be transferred quickly between compute nodes. It was initially tightly tied
with MapReduce, a data processing framework that filters and divides work among cluster
nodes, then organises and condenses the findings into a cohesive answer to a query. Similarly,
when HDFS receives data, it divides it into individual blocks and distributes them throughout
the cluster's nodes.
Data is written to the server once, then read and reused multiple times with HDFS. A primary
NameNode in HDFS keeps track of where file data in the cluster is stored.
On a commodity hardware cluster, HDFS also has several DataNodes, typically one per node.
In the data centre, the DataNodes are usually grouped together in the same rack. For storage,
data is divided down into individual blocks and distributed among the numerous DataNodes.
Additionally, blocks are copied between nodes, allowing for extremely efficient parallel
processing.
The NameNode understands which DataNode contains which blocks and where in the machine
cluster the DataNodes are located. The NameNode also controls file access, including reads,
writes, creates, and deletes, as well as data block replication between DataNodes.
The DataNodes and the NameNode work together to create the NameNode. As a result, the
cluster may dynamically adjust to changing server capacity requirements in real time by adding
or removing nodes as needed.
HDFS exposes a file system namespace and allows for the storage of user data in files. A file is
divided into one or more blocks, each of which is kept in a collection of DataNodes.
The NameNode is responsible for file system namespace actions such as file and directory
opening, closing, and renaming. The NameNode is also in charge of the block-to-DataNode
3- Hadoop 1 vs Hadoop 2
Hadoop is an open source software programming platform for storing and processing massive
amounts of data. Its framework is built on Java programming, with some native C code and
shell scripts thrown in for good measure.
2. Daemons:
Hadoop 1 Hadoop 2
Namenode Namenode
Datanode Datanode
Secondary Secondary
Lovely Professional University Page 5
Namenode Namenode
Job Tracker Resource Manager
Task Tracker Node Manager
3. Functioning:
HDFS is used for storage in Hadoop 1, and Map Reduce is used for Resource Management
and Data Processing on top of that. This burden on Map Reduce will have an impact on
performance.
HDFS is utilised for storage again in Hadoop 2, and YARN is used for resource
management on top of HDFS. It basically manages the resources and keeps everything
running.
4. Restrictions:
Hadoop 1 is built on a Master-Slave model. It is made up of a single master and a number of
slaves. If your master node crashes, your cluster will be destroyed, regardless of how good
your subordinate nodes are. Again, constructing that cluster necessitates copying system
files, picture files, and other files to another system, which is too time demanding for today's
enterprises.
Hadoop 2 is similarly built on a Master-Slave model. However, there are many masters
(active namenodes and standby namenodes) and slaves in this configuration.