HDFS, MapReduce, Yarn

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

CS 3006

Parallel and Distributed Computing


HDFS, MapReduce, Yarn
Hadoop
• Apache Hadoop is an open source software framework for storage
and large scale processing of data-sets on
clusters of commodity hardware.

• It consists of the following basic modules:


Hadoop Distributed File System (HDFS)
Hadoop YARN
Hadoop MapReduce
Hadoop Module
Hadoop Distributed File System
• HDFS is a distributed file system written in Java that is fault
tolerant and scalable.
• HDFS is the primary distributed storage for Hadoop applications.
• HDFS provides interfaces for applications to move themselves
closer to data.
• There are two types of machines in a HDFS cluster.
NameNode is the heart of an HDFS filesystem, it maintains and manages
the file system metadata. E.g; what blocks make up a file, and on which
datanodes those blocks are stored.
DataNode where HDFS stores the actual data, there are usually quite a few
of these.
HDFS Architecture
HDFS Features
• Failure tolerant - data is duplicated across multiple DataNodes to protect
against machine failures. The default is a replication factor of 3 (every
block is stored on three machines).

• Scalability - data transfers happen directly with the DataNodes so your


read/write capacity scales fairly well with the number of DataNodes

• Space - need more disk space? Just add more DataNodes and re-balance

• Industry standard - Other distributed applications are built on top of


HDFS (HBase, Map-Reduce)
Read Operation in HDFS
Write Operation in HDFS
HDFS Security
• Authentication to Hadoop
Simple – insecure way of using OS username to determine hadoop identity
Kerberos – authentication using kerberos ticket
✔ Set by hadoop.security.authentication=simple|kerberos
• File and Directory permissions are same like in POSIX
read (r), write (w), and execute (x) permissions
also has an owner, group and mode
enabled by default (dfs.permissions.enabled=true)
• ACLs are used for implemention permissions that differ from
natural hierarchy of users and groups
enabled by dfs.namenode.acls.enabled=true
Interfaces to HDFS
• Java API (DistributedFileSystem)

• C wrapper (libhdfs)

• HTTP protocol

• WebDAV protocol

• Shell Commands
MapReduce
• MapReduce is a programming model for efficient distributed computing
Processing unit of Hadoop, used by Google
• It works like a Unix pipeline
cat input | grep | sort | uniq -c | cat > output
Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
Streaming through data, reducing seeks
Pipelining
• A good fit for a lot of applications
Log processing
Web index building
MapReduce (Cont.)
MapReduce - Dataflow
MapReduce - Features
• Fine grained Map and Reduce tasks
Improved load balancing
Faster recovery from failed tasks
• Automatic re-execution on failure
In a large cluster, some nodes are always slow or flaky
Framework re-executes failed tasks
• Locality optimizations
With large data, bandwidth to data is a problem
Map-Reduce + HDFS is a very effective solution
Map-Reduce queries HDFS for locations of input data
Map tasks are scheduled close to the inputs when possible

Introduction: 1-15
Word Count Example
• Mapper
Input: value: lines of text of input
Output: key: word, value: 1
• Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
• Launching program
Defines this job
Submits job to cluster
Word Count Dataflow
Yarn
• YARN is the prerequisite for Enterprise Hadoop
Providing resource management and a central platform to deliver
consistent operations, security, and data governance tools across Hadoop
clusters.
YARN Cluster Basics
• In a YARN cluster, there are two types of hosts:
The ResourceManager is the master daemon that communicates with the client, tracks
resources on the cluster, and orchestrates work by assigning tasks to NodeManagers.
A NodeManager is a worker daemon that launches and tracks processes spawned on
worker hosts.
Yarn Resource Monitoring
• YARN currently defines two resources:
v-cores
Memory

• Each NodeManager tracks


its own local resources and
communicates its resource configuration to the ResourceManager

• The ResourceManager keeps


a running total of the cluster’s available resources.
Yarn Resource Monitoring (Cont.)
Yarn Container
• Containers
a request to hold resources on the YARN cluster.
a container hold request consists of vcore and memory
Hold collection of physical resources

Container as a hold The task running as a


process inside a
container
Yarn Application and ApplicationMaster
• Yarn application
It is a YARN client program that is made up of one or more tasks.
Example: MapReduce Application

• ApplicationMaster
It helps coordinate tasks on the YARN cluster for each running application.
It is the first process run after the application starts.
Hadoop Related Subprojects
• Pig
High-level language for data analysis
• HBase
Table storage for semi-structured data
• Zookeeper
Coordinating distributed applications
• Hive
SQL-like Query language and Metastore
• Mahout
Machine learning
Thank You!

You might also like