Big Data - Hadoop

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

Big Data Technologies

Hadoop
Introduction
 Hadoop is an open-source framework designed for the distributed
storage and processing of big data.
 It can be used as an ETL engine or as an analytics engine for
processing large amounts of structured, semi-structured and
unstructured data.
 Hadoop's development was heavily influenced by Google's
MapReduce and Google File System (GFS) papers, which introduced
the concept of distributed computing and fault-tolerant storage.
 Hadoop's architecture and components were designed to address
the challenges posed by the growing volume of data and the
limitations of traditional data processing systems.
Hadoop
Applications of Hadoop
 Data warehousing: Hadoop's ability to handle large
datasets makes it ideal for building data warehouses,
enabling organizations to store and analyze historical
data for informed decision-making.
 Machine learning: Hadoop facilitates large-scale
machine learning tasks, enabling the training and
execution of complex machine learning models.
 Scientific computing: Hadoop empowers scientists to
process and analyze massive scientific datasets,
accelerating research and discovery.
 Social media analysis: Hadoop can be used to analyze
large social media datasets, identifying trends and
patterns in public sentiment.
Components of Hadoop
Hadoop
Distributed File
System(HDFS)
 A distributed file system that runs
on standard or low-end hardware.
 HDFS provides better data
throughput than traditional file
systems, in addition to high fault
tolerance and native support of
large datasets.
 It is designed to be highly scalable
and fault-tolerant, meaning that it
can handle large amounts of data
and can continue to operate even
if some nodes fail.(3x replication
system)
Yet Another Resource
Negotiator(YARN)
 YARN is a resource manager, responsible for
providing computational resources for application
executions
 It monitors and manages workloads, manages the
high availability features of Hadoop, and
implements security controls.
 YARN supports multiple processing models, in
addition to MapReduce (e.g. Spark)
 With YARN, the resource management and job
scheduling aspects are separated, allowing for more
diverse and dynamic workloads on the same
Hadoop cluster.
 This separation enables Hadoop to support various
distributed computing frameworks beyond
MapReduce, making it a more versatile and
 The main components of YARN architecture include:
 Client: It submits map-reduce jobs.
 Resource Manager: It is the master daemon of YARN and is
responsible for resource assignment and management
among all the applications. Whenever it receives a
processing request, it forwards it to the corresponding node
manager and allocates resources for the completion of the
request accordingly.
 Node Manager: It takes care of an individual node on a
Hadoop cluster and manages application and workflow and
that particular node. Its primary job is to keep-up with the
Resource Manager. It is also responsible for creating the
container process and start it on the request of Application
master.
 Application Master: Is a framework specific library.
The application master is responsible for negotiating
resources with the resource manager, tracking the
status and monitoring progress of a single application.
The application master requests the container from the
node manager by sending a Container Launch
Context(CLC) which includes everything an application
needs to run(environment variables, dependencies,
security tokens etc).
 Container: It is a collection of physical resources such
as RAM, CPU cores and disk on a single node. The
containers are invoked by Container Launch
Context(CLC) which is a record that contains
information such as environment variables, security
tokens, dependencies etc.
MapReduce
 Map reduce is a programming model that simultaneously processes and
analyzes big data sets logically into separate clusters
 MapReduce is a widely used implementation of a batch processing
framework.
 It is highly scalable and reliable and is based on the principle of divide-
and-conquer, which provides built-in fault tolerance and redundancy.
 MapReduce can be used to process schema-less datasets as it does not
require that the input conform to a particular data model.
Phases of MapReduce

 MapReduce model has the following 4


phases:
1. Splitting
2. Mapping
3. Shuffling and Sorting
4. Reducing
Splitting

 This is the first stage of the map-reduce


programming
 It involves dividing the job or dataset into multiple
smaller tasks that can be separately processed and
analyzed
Mapping

Each map task processes an input split and The map function takes the input split as
applies a user-defined map function to input and produces a set of intermediate
transform the input data into intermediate key-value pairs
key-value pairs.
Shuffle and Sorting
 The output of various mapping parts then goes into
the Shuffling and Sorting phase.
 The objective is to consolidate the data based on
the keys so that all data with the same key is sent
to the same reducer in the Reduce phase.
 The intermediate key-value pairs generated by the
map tasks are sorted and partitioned based on their
keys.
 Pairs with the same key are grouped together and
sent to the same reduce task.
Reduce

 •The output of the Shuffling and Sorting phase will


be then input of the Reducer phase.
 •Each reduce task processes a subset of the
intermediate key-value pairs.
 •It applies a user-defined reduce function to the
values associated with a particular key, producing a
set of final key-value pairs.
Advantages of using Hadoop
 Scalability - Hadoop allows seamless scalability,
both horizontally and vertically. It can efficiently
handle and process large volumes of data by
distributing it across a cluster of commodity
hardware. This scalability makes it suitable for
organizations dealing with ever-growing datasets.
 Cost-Effective - Hadoop can run on low-cost
commodity hardware, making it a cost-effective
solution for storing and processing large amounts of
data. It utilizes the parallel processing power of
multiple nodes in a cluster, avoiding the need for
expensive, high-end servers.
 Fault Tolerance - Hadoop is designed to be fault-
tolerant. In a distributed computing environment,
hardware failures are common. Hadoop
automatically detects and compensates for these
failures by redistributing tasks to the remaining
nodes. This ensures data integrity and minimizes
the risk of data loss.
 Parallel Processing - The MapReduce
programming model in Hadoop enables parallel
processing of data. It breaks down large datasets
into smaller chunks and processes them
concurrently across nodes in a cluster. This
parallelism significantly speeds up data processing
tasks.

You might also like