Lovely Professional University (Lpu) : Mittal School of Business (Msob)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

MITTAL SCHOOL OF BUSINESS (MSOB)

LOVELY PROFESSIONAL UNIVERSITY (LPU)

ACADEMIC TASK NO. 1

Course Code: INT576

Course Title: BIG DATA-HADOOP

Class: MBA Section: Q2046

Name of the faculty member: Manjari Gangwar

Submitted by:-
NAME REG.NO ROLL.NO
FAREEDULLAH MOHAMAD 12000633 RQ1E2208
1.How Map and Reduce work Together? Detailed working with diagram?

Hadoop divides the job into tasks. There are two types of tasks:

1. Map tasks (Splits & Mapping)


2. Reduce tasks (Shuffling, Reducing)

MapReduce is a software framework and programming model for dealing with massive
amounts of data. The MapReduce programme is divided into two phases: Map and Reduce.
Map tasks deal with data splitting and mapping, whereas Reduce tasks shuffle and reduce data.
MapReduce programmes developed in Java, Ruby, Python, and C++ may all be run on Hadoop.
Map Reduce programmes in cloud computing are parallel in nature, making them ideal for
executing large-scale data processing across a cluster of servers.
Key-value pairs are the input to each phase. A programmer must also declare two functions:
map and reduce.
Mapping
This is the very first step in the map-reduce program's execution. In this phase, each split's data
is handed to a mapping function, which generates output values. In our case, the mapping
phase's task is to count the number of times each word appears in the input segments.
Reducing
The output values from the Shuffling step are consolidated in this phase. This phase takes the
values from the Shuffling phase and merges them into a single output value. In a nutshell, this
stage summarises the entire dataset.
For each split, a map task is created, which then executes the map function for each
record in the split.
Multiple splits are usually useful since the time it takes to process a split is short
compared to the time it takes to process the entire input. Because the splits are processed
in parallel, it is easier to load balance the processing when the splits are smaller.
Splits that are too small, on the other hand, are not ideal. When divides are too small, the
workload of managing the splits and creating map tasks takes up the entire job execution
time.
For most workloads, a split size equal to the size of an HDFS block (which is 64 MB by
default) is preferable.
When map jobs are run, the output is written to a local disc on the node, rather than to
HDFS.
The reason for choosing local disc over HDFS is to avoid the replication that occurs
while using HDFS to store data.
Reduce tasks process map output to produce the final output, which is then processed by
reduce tasks.
The map output can be discarded once the process is finished. As a result, storing it on
HDFS with replication is excessive.
Hadoop replicates the map task on another node and recreates the map output in the event
of a node loss before the reduce process consumes the map output.

Lovely Professional University Page 1


The concept of data locality is ignored by the Reduce job. Every map task's output is sent
into the reduce task. The output of the map is sent to the computer where the reduce
process is being performed.

The output is blended on this computer and then provided to a user-defined reduction
function.
Reduce output, unlike map output, is saved in HDFS (the first replica is stored on the
local node and other replicas are stored on off-rack nodes). As a result, composing the
decrease output
The entire execution process (including Map and Reduce operations) is managed by two sorts
of entities known as a
Lovely Professional University Page 2
➢ Jobtracker: He's a master at what he does (responsible for complete execution of
submitted job)
➢ Multiple Task Trackers: Acts as though they are slaves, each doing their own work.
➢ There is one Jobtracker on Namenode for every job submitted for execution in the
system, and many tasktrackers on Datanode for each work submitted for execution in the
system.

2-Roles of Distribution storage –HDFS in Hadoop Application Architecture


Implementation
Hadoop applications use the Hadoop Distributed File System (HDFS) as their primary data
storage system. HDFS is a distributed file system that uses a NameNode and DataNode
architecture to allow high-performance data access across highly scalable Hadoop clusters.

Hadoop is an open source distributed processing system for big data applications that controls
data processing and storage. HDFS is an important component of the Hadoop ecosystem. It
provides a secure platform for managing large data sets and supporting big data analytics
applications.
What is HDFS and how does it work?
HDFS allows data to be transferred quickly between compute nodes. It was initially tightly tied
with MapReduce, a data processing framework that filters and divides work among cluster
nodes, then organises and condenses the findings into a cohesive answer to a query. Similarly,
when HDFS receives data, it divides it into individual blocks and distributes them throughout
the cluster's nodes.

Data is written to the server once, then read and reused multiple times with HDFS. A primary
NameNode in HDFS keeps track of where file data in the cluster is stored.
On a commodity hardware cluster, HDFS also has several DataNodes, typically one per node.
In the data centre, the DataNodes are usually grouped together in the same rack. For storage,
data is divided down into individual blocks and distributed among the numerous DataNodes.
Additionally, blocks are copied between nodes, allowing for extremely efficient parallel
processing.
The NameNode understands which DataNode contains which blocks and where in the machine
cluster the DataNodes are located. The NameNode also controls file access, including reads,
writes, creates, and deletes, as well as data block replication between DataNodes.
The DataNodes and the NameNode work together to create the NameNode. As a result, the
cluster may dynamically adjust to changing server capacity requirements in real time by adding
or removing nodes as needed.

Lovely Professional University Page 3


The NameNode and the DataNodes are in constant communication to decide whether the
DataNodes need to execute specific tasks. As a result, the NameNode is always aware of each
DataNode's status. If the NameNode notices that one of the DataNodes isn't functioning
properly, it can reassign the DataNode's responsibility to another node that has the same data
block. DataNodes can also interact with one another, allowing them to work together during
routine file operations.
NameNodes and DataNodes are two components of the HDFS architecture.
A primary/secondary architecture is used by HDFS. The NameNode of an HDFS cluster is the
main server that handles the file system namespace and regulates client file access. The
NameNode, being the central component of the Hadoop Distributed File System, maintains and
controls the file system namespace and grants appropriate access permissions to clients. The
DataNodes in the system are in charge of the storage attached to the nodes they run on.

HDFS exposes a file system namespace and allows for the storage of user data in files. A file is
divided into one or more blocks, each of which is kept in a collection of DataNodes.
The NameNode is responsible for file system namespace actions such as file and directory
opening, closing, and renaming. The NameNode is also in charge of the block-to-DataNode

Lovely Professional University Page 4


mapping. The DataNodes handle read and write requests from the file system's clients. When
the NameNode orders them to, they also handle block creation, deletion, and replication.
Traditional hierarchical file organisation is supported by HDFS. An application or a user can
make directories and then store files within them. A user can create, remove, rename, or move
files from one directory to another in the file system namespace hierarchy, just as they do in
most other file systems.
Any modification to the file system namespace or its characteristics is recorded by the
NameNode. The number of replicas of a file that the HDFS should keep can be specified by an
application. The replication factor of a file, which is stored in the NameNode, is the number of
copies of that file.
What are the advantages of HDFS?
The following are five major benefits of using HDFS:
➢ Efficiency in terms of cost. The DataNodes that store the data use low-cost off-the-shelf
hardware to save money on storage. There is also no licencing fee because HDFS is open
source.
➢ Storage of large data sets. HDFS stores data in a variety of sizes and formats, including
structured and unstructured data, ranging from gigabytes to petabytes.
➢ Recovery time after a hardware failure is short. HDFS is designed to detect and recover
from errors on its own.
➢ Portability. HDFS can be used on any hardware platform and is compatible with a variety
of operating systems, including Windows, Linux, and Mac OS X.
➢ Data access in real time. HDFS is designed for high data speed, making it ideal for
accessing live data.

3- Hadoop 1 vs Hadoop 2
Hadoop is an open source software programming platform for storing and processing massive
amounts of data. Its framework is built on Java programming, with some native C code and
shell scripts thrown in for good measure.

1. Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet


Another Resource Negotiator) and MapReduce version 2.
Hadoop 1 Hadoop 2
HDFS HDFS
Map Reduce YARN/MRv2

2. Daemons:
Hadoop 1 Hadoop 2
Namenode Namenode
Datanode Datanode
Secondary Secondary
Lovely Professional University Page 5
Namenode Namenode
Job Tracker Resource Manager
Task Tracker Node Manager
3. Functioning:
HDFS is used for storage in Hadoop 1, and Map Reduce is used for Resource Management
and Data Processing on top of that. This burden on Map Reduce will have an impact on
performance.
HDFS is utilised for storage again in Hadoop 2, and YARN is used for resource
management on top of HDFS. It basically manages the resources and keeps everything
running.
4. Restrictions:
Hadoop 1 is built on a Master-Slave model. It is made up of a single master and a number of
slaves. If your master node crashes, your cluster will be destroyed, regardless of how good
your subordinate nodes are. Again, constructing that cluster necessitates copying system
files, picture files, and other files to another system, which is too time demanding for today's
enterprises.
Hadoop 2 is similarly built on a Master-Slave model. However, there are many masters
(active namenodes and standby namenodes) and slaves in this configuration.

Lovely Professional University Page 6


5.Eco system:

Lovely Professional University Page 7


4- Replication management in HDFS
Replication Management:
HDFS provides a dependable solution to store massive data in a distributed environment as
data blocks. The blocks are also copied to offer fault tolerance. The replication factor is set
to 3 by default, however it can be changed.
HDFS is a file system designed to store very large files reliably across multiple machines in
a large cluster. It stores each file as a series of blocks, with the exception of the last block, all
of the blocks in a file being the same size. A file's blocks are replicated for fault tolerance.
Replication works:

Data availability is ensured by replication. Making a copy of something is known as


replication, and the number of times you make a copy of something is known as the
Replication Factor. As we saw in File blocks, HDFS stores data in the form of multiple
blocks, and Hadoop is set up to duplicate those file blocks. By default, Hadoop's Replication
Factor is set to 3, but you can alter it manually to meet your needs. For example, in the
preceding example, we created 4 file blocks, which means we created 3 replicas or copies of
each file block, for a total of 4*3 = 12 backup blocks.

Lovely Professional University Page 8


Why is it necessary to replicate our file blocks?
This is due to the fact that we employ commodity hardware (cheap system hardware) to run
Hadoop, which can fail at any time. For our Hadoop configuration, we don't use a
supercomputer. That is why we require a feature in HDFS that allows us to make backup
copies of those file blocks, a function known as fault tolerance.

Lovely Professional University Page 9

You might also like