Lecture 4 Introduction to Hadoop

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Lecture 4

Processing Big Data


With Hadoop Map
Reduce Technology
By
Dr. Aditya Bhardwaj

[email protected]

Big Data Analytics and Business Intelligence


Learning Objectives

Industrial
HDFS
Evolution of Applications Hadoop
Component
Hadoop of Hadoop Architecture
Discussion
Session 1. History and Introduction to Hadoop

1.1. Introduction to Hadoop

1.1.2 History of Hadoop

1.1.3 How does Hadoop Works?


Inventor of Hadoop
• In 2005, Doug Cutting, an employee at Yahoo, developed an open-
source implementation of GFS for a search engine project, and that
open-source project was named Apache Hadoop.

9/14/2024 5
Inventor of Hadoop
The Background:
•Lucene and Nutch: Doug Cutting originally worked on a text search library,
and later on Nutch, a web crawler. Nutch was designed to index and search
the web, but it needed to handle the massive scale of the web, which
required a more robust and distributed system for data processing and
storage.

Adapting the Concepts of GFS: Doug Cutting, inspired by these papers,


realized that these concepts could be adapted to improve Nutch’s ability to
process large datasets across a distributed network of computers.

Development of Hadoop:
•Initial Implementation (2004-2005): Doug Cutting, along with his
collaborator began developing an open-source version of the distributed file
system and MapReduce framework, initially as part of the Nutch project.
They aimed to create a scalable, reliable, and fault-tolerant system for
processing and storing large datasets.
Inventor of Hadoop
•Naming Hadoop: The project was named "Hadoop" after Doug
Cutting's son's toy elephant, reflecting the idea of something large and
capable of handling big tasks.

9/14/2024 7
Keyword ‘Hadoop’ Search on Google

8/24
1.1.1 Introduction to Hadoop

• Hadoop is an open source tool to handle, process and


analyze Big Data.

• Hadoop platform provides an improved programming


model, which is used to create and run distributed
systems quickly and efficiently.

• It is licensed under the Apache, so also called Apache-


Hadoop

9/24
Hadoop Usage

9/14/2024 10
Industrial Applications of Hadoop
Electronic Health Records (EHR): Healthcare providers
use Hadoop to manage and analyze massive volumes of
patient data, helping in personalized treatment plans and
improving patient outcomes.

Customer Analytics: Retailers use Hadoop to analyze


customer behavior, purchase history, and social media
interactions to create personalized marketing campaigns
and improve customer retention.

Customer Churn Prediction: By analyzing customer


usage patterns and service data, telecommunications
companies use Hadoop to predict and reduce customer
churn. 11/2
How Big Data Is Used In Amazon Recommendation Systems

https://youtu.be/S4RL6prqtGQ

12
While Developing Hadoop Two Major Concerns for the Team of Doug Cutting

How to stores large size files from terabytes to petabytes


across different terminals i.e. storage framework was
required ?.

How to facilitates the processing of large amount of data


present in structured and unstructured format i.e. parallel
processing framework was required ?.
Functional Architecture of Hadoop
• The core component of Hadoop includes HDFS and MapReduce.
Session2. Understanding HDFS
Syllabus to be covered with this topic:
 2.1.1. Introduction to HDFS
 2.1.2 HDFS Architecture (Using Read and Write
Operations)
 2.1.3 Hadoop Distribution and basic Commands

 2.1.4 HDFS Command Line and Web Interface


2.1.1 High Level Architecture of Hadoop Multimode Cluster

16/2
2.1.1 Introduction to HDFS
 The file in HDFS is split into large block size of 64 to
128 MB by default and each block of the file is
independently replicated at multiple data nodes.

 HDFS provides fault-tolerance by replicating the data on


three nodes: two on the same rack and one on a different
rack. NameNode implement this functionality and it
actively monitors the information regarding replicas of a
block. By default replica factor is 3

HDFS has a master-slave architecture, which comprises


of NameNode and number of DataNodes.

17/2
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and
DataNode

i. NameNode
It is also known as Master node.
NameNode does not store actual data or dataset. NameNode stores
Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and
directories.

Tasks of HDFS NameNode


Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming,
closing, opening files and directories.
18/2
2.1.1 HDFS (contd..)
DataNode: It is also known as Slave.
• HDFS Datanode is responsible for storing actual data in HDFS.
• Datanode performs read and write operation as per the request of
the clients.

 HDFS provides fault-tolerance by replicating the data on three


nodes: two on the same rack and one on a different rack.
NameNode implement this functionality and it actively monitors
the information regarding replicas of a block. By default replica
factor is 3

 HDFS has a master-slave architecture, which comprises of


NameNode and DataNodes.

19/2
2.1.2 HDFS Architecture (Read-Write Operations)
• This figure demonstrate how to read a file store at datanode in HDFS architecture.
2.1.2 HDFS Architecture (contd..)
• This figure demonstrate how to read a file store at datanode in HDFS architecture.
2.1.6 HDFS Basic Commands
Command Description
hdfs dfs version HDFS version

mkdir <path> creates directories.

hdfs dfs –ls path displays a list of the contents of a


directory specified by path
provided by the user, showing the
names, permissions, owner, size
and modification date for each
entry.
put copies the file or directory from
the local file system to the
destination within the DFS.
22/2
2.1.6 HDFS Basic Commands
Command Description
hdfs dfs –copyFromLocal This hadoop shell command is
<localSrc> <dest> similar to put command, but the
source is restricted to a local file
reference.
hdfs dfs –cat This Hadoop shell command
displays the contents of the
filename on console or stdout.
hdfs dfs –mv source destination Used for moving files from source
to destination

hdfs dfs –chmod Used for changing the file


permission.

23/2
Thanks Note

25
tungal/presentations/ad2012

You might also like