IMTC634_Data Science_Chapter 13

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

Chapter 13:

Hadoop
Chapter Index
S.No. Reference Particulars Slide
No. From - To
1 Learning Objectives 3
2 Topic 1 Hadoop 4-7
3 Topic 2 Hadoop Architecture 8-9

4 Topic 3 Components of Hadoop 10-11

5 Topic 4 How does Hadoop Functi 12-14


on?

6 Let’s Sum Up 15
Learning Objectives

 € Describe Hadoop Architecture


 € Discuss the main components of Hadoop-HDFS and
MapReduce
 € Explain the functioning of Hadoop
1. Hadoop

 Traditional technologies have proved incapable to handle huge


amounts of data generated in organizations or to fulfill the
processing requirements of such data
 Hadoop is a distributed file system, which allows you to store
and handle massive amount of data on a cloud of machines.
 The main benefit of using Hadoop is that since data is
deposited in multiple nodes, it is preferable to execute it in the
distributed way.
 It means the data which is stored on a node is processed by the
node itself instead of spend the time to distribute it over the
network.
 In case of huge amount of data, it becomes almost impossible
to store data in tables, records and columns.
1. Hadoop

Real-Time Industry Applications of Hadoop:


 Used to manage traffic on streets.
 Used in processing of data streams.
 Used in content management and archiving emails.
 Used for fraud detection and prevention.
 Used in advertisements targeting platforms to capture and
analyze the social media data.
 Used to improve business performance by analyzing customer
data in real-time.
 Used by financial agencies to reduce risk, analyze fraud
patterns, and improve customer satisfaction.
1. Hadoop

Hadoop Ecosystem
 Hadoop ecosystem refers to a collection of components of the Apache
Hadoop software library, including the accessories and tools provided by the
Apache Software Foundation. The following figure shows Hadoop ecosystem:
1. Hadoop

 MapReduce and Hadoop Distributed File System (HDFS) are two


core components of the Hadoop ecosystem that provide a great
starting point to manage Big Data; however, they are not
sufficient to deal with Big Data challenges.
 MapReduce and HDFS provide the necessary services and basic
structure to deal with the core requirements of Big Data
solutions. Other services and tools of the ecosystem provide
the environment and components required to build and
manage purpose-driven Big Data applications.
2. Hadoop Architecture

 A Hadoop cluster consists of a single MasterNode and multiple


worker nodes. The master node contains a NameNode and
JobTracker; whereas a slave or worker node acts as both a
DataNode and TaskTracker. The following figure shows Hadoop
multinode cluster architecture:
2. Hadoop Architecture

 In a larger cluster, HDFS is managed through a NameNode


server to host the file system index.
 A secondary NameNode keeps snapshots of the NameNodes. At
the time of failure of a primary NameNode, a secondary
NameNode replaces the primary NameNode, thus preventing
the file system from getting corrupt and reducing data loss.
 The secondary NameNode takes snapshots of the primary
NameNode directory information after a regular interval of time,
which is saved in local or remote directories.
 These checkpoint images can be used in the place of the
primary NameNode to restart a failed primary NameNode
without replaying the entire journal of file-system actions and
without editing the log to create an up-to-date directory
structure.
3. Components of Hadoop

 There are two main components of Apache Hadoop—the


Hadoop Distributed File System (HDFS)and the MapReduce
parallel processing framework.
 HDFS is a fault‐tolerant storage system, which stores large size
files from terabytes to petabytes across different terminals.
HDFS replicates the data over multiple hosts to achieve
credibility.
 The file in HDFS is split into large blocks of around 64 to 128
megabytes of data. Each block of this file is independently
replicated at multiple data nodes.
 MapReduce is a framework that helps developers to write
programs to process large volumes of unstructured data parallel
over a distributed /standalone architecture in order to get the
output in an aggregated format.
3. Components of Hadoop

 MapReduce consists of several components. Some of the most


important ones are:

 JobTracker: Master node that manages all jobs and


resources in a cluster of commodity computers.

 TaskTrackers: Agents deployed at each machine in the


cluster to run the map and reduce task at the terminal.

 JobHistoryServer: Component that tracks completed jobs.


4. How does Hadoop function?

 Hadoop facilitates the processing of large amounts of data


present in both structured and unstructured forms. Hadoop
clusters are created from the racks of commodity machines.
 Tasks are distributed across these machines (also known as
nodes), which are allowed to work independently and provide
their responses to the starting node.
 Moreover, it is possible to add or remove nodes dynamically in a
Hadoop cluster on the basis of varying workloads.
 Hadoop accomplishes its operations (of dividing the computing
tasks into subtasks that are handled by individual nodes) with
the help of the MapReduce model, which comprises two
functions—mapper and reducer.
4. How does Hadoop function?

 The MapReduce model implements the MapReduce algorithm, as


discussed earlier, to incorporate the capability of breaking data
into manageable subtasks, processing the data on the distributed
cluster simultaneously, and making the data available for
additional processing or user consumption.
 When an indexing job is provided to Hadoop, it requires the
organizational data to be loaded first.
 Next, the data is divided into various pieces, and each piece is
forwarded to different individual servers.
 Each server has a job code with the piece of data it is required to
process.
 Once the server completes operations on the data provided to it,
the response is forwarded with the job code appended to the
result.
 In the end, results from all the nodes are integrated by the
Hadoop software and provided to the user.
4. How does Hadoop function?
Features of Hadoop
 Hadoop helps in Big Data analytics by overcoming the obstacles
usually faced in handling Big Data.
 Hadoop allows analysts to break down large computational
problems into smaller tasks as smaller elements can be analyzed
quickly and economically.
 Hadoop performs well with several nodes without requiring any
type of shared memory or disks among them. Hence, the
efficiency-related issues in context of storage and access to data
get automatically solved.
 Hadoop follows the client–server architecture in which the server
works as a master and is responsible for data distribution among
clients that are commodity machines and work as slaves to carry
out all the computational tasks.
 Hadoop improves data processing by running computing tasks on
all available processors that are working in parallel.
Let’s Sum Up

 Hadoop, which is an open-source framework that provides a


distributed file system for processing Big Data.
 There are two main components of Apache Hadoop—the Hadoop
Distributed File System (HDFS)and the MapReduce parallel
processing framework. Both of these components are open
source projects—HDFS is used for storage and MapReduce is
used for processing.
 Hadoop uses MapReduce programming model of data processing
that allows users to execute and split big data sets into meaning
information.
 Hadoop Distributed File System (HDFS) is a cluster of highly
reliable, efficient, and economical storage solutions that
facilitates the management of files containing related data across
machines.
 Hadoop MapReduce It is a computational framework used in
Hadoop to perform all the mathematical computations. It is based
on a parallel and distributed implementation of MapReduce
algorithm that provides high performance.
THANK YOU

You might also like