Lecture 4 Introduction to Hadoop

Lecture 4
Processing Big Data

With Hadoop Map
Reduce Technology
By
Dr. Aditya Bhardwaj
[email protected]
Big Data Analytics and Business Intelligence

Learning Objectives
Industrial
HDFS
Evolution of Applications Hadoop
Component
Hadoop of Hadoop Architecture
Discussion
Session 1. History and Introduction to Hadoop
1.1. Introduction to Hadoop
1.1.2 History of Hadoop
1.1.3 How does Hadoop Works?

Inventor of Hadoop
• In 2005, Doug Cutting, an employee at Yahoo, developed an open-
source implementation of GFS for a search engine project, and that
open-source project was named Apache Hadoop.
9/14/2024 5
Inventor of Hadoop
The Background:
•Lucene and Nutch: Doug Cutting originally worked on a text search library,
and later on Nutch, a web crawler. Nutch was designed to index and search
the web, but it needed to handle the massive scale of the web, which
required a more robust and distributed system for data processing and
storage.
Adapting the Concepts of GFS: Doug Cutting, inspired by these papers,

realized that these concepts could be adapted to improve Nutch’s ability to
process large datasets across a distributed network of computers.
Development of Hadoop:
•Initial Implementation (2004-2005): Doug Cutting, along with his
collaborator began developing an open-source version of the distributed file
system and MapReduce framework, initially as part of the Nutch project.
They aimed to create a scalable, reliable, and fault-tolerant system for
processing and storing large datasets.
Inventor of Hadoop
•Naming Hadoop: The project was named "Hadoop" after Doug
Cutting's son's toy elephant, reflecting the idea of something large and
capable of handling big tasks.
9/14/2024 7
Keyword ‘Hadoop’ Search on Google
8/24
1.1.1 Introduction to Hadoop
• Hadoop is an open source tool to handle, process and

analyze Big Data.
• Hadoop platform provides an improved programming

model, which is used to create and run distributed
systems quickly and efficiently.
• It is licensed under the Apache, so also called Apache-

Hadoop
9/24
Hadoop Usage
9/14/2024 10
Industrial Applications of Hadoop
Electronic Health Records (EHR): Healthcare providers
use Hadoop to manage and analyze massive volumes of
patient data, helping in personalized treatment plans and
improving patient outcomes.
Customer Analytics: Retailers use Hadoop to analyze

customer behavior, purchase history, and social media
interactions to create personalized marketing campaigns
and improve customer retention.
Customer Churn Prediction: By analyzing customer

usage patterns and service data, telecommunications
companies use Hadoop to predict and reduce customer
churn. 11/2
How Big Data Is Used In Amazon Recommendation Systems
https://youtu.be/S4RL6prqtGQ
12
While Developing Hadoop Two Major Concerns for the Team of Doug Cutting
How to stores large size files from terabytes to petabytes

across different terminals i.e. storage framework was
required ?.
How to facilitates the processing of large amount of data

present in structured and unstructured format i.e. parallel
processing framework was required ?.
Functional Architecture of Hadoop
• The core component of Hadoop includes HDFS and MapReduce.
Session2. Understanding HDFS
Syllabus to be covered with this topic:
 2.1.1. Introduction to HDFS
 2.1.2 HDFS Architecture (Using Read and Write
Operations)
 2.1.3 Hadoop Distribution and basic Commands
 2.1.4 HDFS Command Line and Web Interface

2.1.1 High Level Architecture of Hadoop Multimode Cluster
16/2
2.1.1 Introduction to HDFS
 The file in HDFS is split into large block size of 64 to
128 MB by default and each block of the file is
independently replicated at multiple data nodes.
 HDFS provides fault-tolerance by replicating the data on

three nodes: two on the same rack and one on a different
rack. NameNode implement this functionality and it
actively monitors the information regarding replicas of a
block. By default replica factor is 3
HDFS has a master-slave architecture, which comprises

of NameNode and number of DataNodes.
17/2
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and
DataNode
i. NameNode
It is also known as Master node.
NameNode does not store actual data or dataset. NameNode stores
Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and
directories.
Tasks of HDFS NameNode

Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming,
closing, opening files and directories.
18/2
2.1.1 HDFS (contd..)
DataNode: It is also known as Slave.
• HDFS Datanode is responsible for storing actual data in HDFS.
• Datanode performs read and write operation as per the request of
the clients.
 HDFS provides fault-tolerance by replicating the data on three

nodes: two on the same rack and one on a different rack.
NameNode implement this functionality and it actively monitors
the information regarding replicas of a block. By default replica
factor is 3
 HDFS has a master-slave architecture, which comprises of

NameNode and DataNodes.
19/2
2.1.2 HDFS Architecture (Read-Write Operations)
• This figure demonstrate how to read a file store at datanode in HDFS architecture.
2.1.2 HDFS Architecture (contd..)
• This figure demonstrate how to read a file store at datanode in HDFS architecture.
2.1.6 HDFS Basic Commands
Command Description
hdfs dfs version HDFS version
mkdir <path> creates directories.
hdfs dfs –ls path displays a list of the contents of a

directory specified by path
provided by the user, showing the
names, permissions, owner, size
and modification date for each
entry.
put copies the file or directory from
the local file system to the
destination within the DFS.
22/2
2.1.6 HDFS Basic Commands
Command Description
hdfs dfs –copyFromLocal This hadoop shell command is
<localSrc> <dest> similar to put command, but the
source is restricted to a local file
reference.
hdfs dfs –cat This Hadoop shell command
displays the contents of the
filename on console or stdout.
hdfs dfs –mv source destination Used for moving files from source
to destination
hdfs dfs –chmod Used for changing the file

permission.
23/2
Thanks Note
25
tungal/presentations/ad2012

Lecture 4 Introduction to Hadoop

Uploaded by

Copyright:

Available Formats

Lecture 4 Introduction to Hadoop

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4 Introduction to Hadoop

Uploaded by

Copyright:

Available Formats

Lecture 4

Processing Big Data

Big Data Analytics and Business Intelligence

1.1. Introduction to Hadoop

1.1.2 History of Hadoop

1.1.3 How does Hadoop Works?

Adapting the Concepts of GFS: Doug Cutting, inspired by these papers,

• Hadoop is an open source tool to handle, process and

• Hadoop platform provides an improved programming

• It is licensed under the Apache, so also called Apache-

Customer Analytics: Retailers use Hadoop to analyze

Customer Churn Prediction: By analyzing customer

How to stores large size files from terabytes to petabytes

How to facilitates the processing of large amount of data

 2.1.4 HDFS Command Line and Web Interface

 HDFS provides fault-tolerance by replicating the data on

HDFS has a master-slave architecture, which comprises

Tasks of HDFS NameNode

 HDFS provides fault-tolerance by replicating the data on three

 HDFS has a master-slave architecture, which comprises of

mkdir <path> creates directories.

hdfs dfs –ls path displays a list of the contents of a

hdfs dfs –chmod Used for changing the file

You might also like