HDFS, MapReduce, Yarn

CS 3006
Parallel and Distributed Computing

HDFS, MapReduce, Yarn
Hadoop
• Apache Hadoop is an open source software framework for storage
and large scale processing of data-sets on
clusters of commodity hardware.
• It consists of the following basic modules:

Hadoop Distributed File System (HDFS)
Hadoop YARN
Hadoop MapReduce
Hadoop Module
Hadoop Distributed File System
• HDFS is a distributed file system written in Java that is fault
tolerant and scalable.
• HDFS is the primary distributed storage for Hadoop applications.
• HDFS provides interfaces for applications to move themselves
closer to data.
• There are two types of machines in a HDFS cluster.
NameNode is the heart of an HDFS filesystem, it maintains and manages
the file system metadata. E.g; what blocks make up a file, and on which
datanodes those blocks are stored.
DataNode where HDFS stores the actual data, there are usually quite a few
of these.
HDFS Architecture
HDFS Features
• Failure tolerant - data is duplicated across multiple DataNodes to protect
against machine failures. The default is a replication factor of 3 (every
block is stored on three machines).
• Scalability - data transfers happen directly with the DataNodes so your

read/write capacity scales fairly well with the number of DataNodes
• Space - need more disk space? Just add more DataNodes and re-balance
• Industry standard - Other distributed applications are built on top of

HDFS (HBase, Map-Reduce)
Read Operation in HDFS
Write Operation in HDFS
HDFS Security
• Authentication to Hadoop
Simple – insecure way of using OS username to determine hadoop identity
Kerberos – authentication using kerberos ticket
✔ Set by hadoop.security.authentication=simple|kerberos
• File and Directory permissions are same like in POSIX
read (r), write (w), and execute (x) permissions
also has an owner, group and mode
enabled by default (dfs.permissions.enabled=true)
• ACLs are used for implemention permissions that differ from
natural hierarchy of users and groups
enabled by dfs.namenode.acls.enabled=true
Interfaces to HDFS
• Java API (DistributedFileSystem)
• C wrapper (libhdfs)
• HTTP protocol
• WebDAV protocol
• Shell Commands
MapReduce
• MapReduce is a programming model for efficient distributed computing
Processing unit of Hadoop, used by Google
• It works like a Unix pipeline
cat input | grep | sort | uniq -c | cat > output
Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
Streaming through data, reducing seeks
Pipelining
• A good fit for a lot of applications
Log processing
Web index building
MapReduce (Cont.)
MapReduce - Dataflow
MapReduce - Features
• Fine grained Map and Reduce tasks
Improved load balancing
Faster recovery from failed tasks
• Automatic re-execution on failure
In a large cluster, some nodes are always slow or flaky
Framework re-executes failed tasks
• Locality optimizations
With large data, bandwidth to data is a problem
Map-Reduce + HDFS is a very effective solution
Map-Reduce queries HDFS for locations of input data
Map tasks are scheduled close to the inputs when possible
Introduction: 1-15
Word Count Example
• Mapper
Input: value: lines of text of input
Output: key: word, value: 1
• Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
• Launching program
Defines this job
Submits job to cluster
Word Count Dataflow
Yarn
• YARN is the prerequisite for Enterprise Hadoop
Providing resource management and a central platform to deliver
consistent operations, security, and data governance tools across Hadoop
clusters.
YARN Cluster Basics
• In a YARN cluster, there are two types of hosts:
The ResourceManager is the master daemon that communicates with the client, tracks
resources on the cluster, and orchestrates work by assigning tasks to NodeManagers.
A NodeManager is a worker daemon that launches and tracks processes spawned on
worker hosts.
Yarn Resource Monitoring
• YARN currently defines two resources:
v-cores
Memory
• Each NodeManager tracks

its own local resources and
communicates its resource configuration to the ResourceManager
• The ResourceManager keeps

a running total of the cluster’s available resources.
Yarn Resource Monitoring (Cont.)
Yarn Container
• Containers
a request to hold resources on the YARN cluster.
a container hold request consists of vcore and memory
Hold collection of physical resources
Container as a hold The task running as a

process inside a
container
Yarn Application and ApplicationMaster
• Yarn application
It is a YARN client program that is made up of one or more tasks.
Example: MapReduce Application
• ApplicationMaster
It helps coordinate tasks on the YARN cluster for each running application.
It is the first process run after the application starts.
Hadoop Related Subprojects
• Pig
High-level language for data analysis
• HBase
Table storage for semi-structured data
• Zookeeper
Coordinating distributed applications
• Hive
SQL-like Query language and Metastore
• Mahout
Machine learning
Thank You!

HDFS, MapReduce, Yarn

Uploaded by

Copyright:

Available Formats

HDFS, MapReduce, Yarn

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HDFS, MapReduce, Yarn

Uploaded by

Copyright:

Available Formats

CS 3006

Parallel and Distributed Computing

• It consists of the following basic modules:

• Scalability - data transfers happen directly with the DataNodes so your

• Industry standard - Other distributed applications are built on top of

• Each NodeManager tracks

• The ResourceManager keeps

Container as a hold The task running as a

You might also like