HDFS, MapReduce, Yarn
HDFS, MapReduce, Yarn
HDFS, MapReduce, Yarn
• Space - need more disk space? Just add more DataNodes and re-balance
• C wrapper (libhdfs)
• HTTP protocol
• WebDAV protocol
• Shell Commands
MapReduce
• MapReduce is a programming model for efficient distributed computing
Processing unit of Hadoop, used by Google
• It works like a Unix pipeline
cat input | grep | sort | uniq -c | cat > output
Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
Streaming through data, reducing seeks
Pipelining
• A good fit for a lot of applications
Log processing
Web index building
MapReduce (Cont.)
MapReduce - Dataflow
MapReduce - Features
• Fine grained Map and Reduce tasks
Improved load balancing
Faster recovery from failed tasks
• Automatic re-execution on failure
In a large cluster, some nodes are always slow or flaky
Framework re-executes failed tasks
• Locality optimizations
With large data, bandwidth to data is a problem
Map-Reduce + HDFS is a very effective solution
Map-Reduce queries HDFS for locations of input data
Map tasks are scheduled close to the inputs when possible
Introduction: 1-15
Word Count Example
• Mapper
Input: value: lines of text of input
Output: key: word, value: 1
• Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
• Launching program
Defines this job
Submits job to cluster
Word Count Dataflow
Yarn
• YARN is the prerequisite for Enterprise Hadoop
Providing resource management and a central platform to deliver
consistent operations, security, and data governance tools across Hadoop
clusters.
YARN Cluster Basics
• In a YARN cluster, there are two types of hosts:
The ResourceManager is the master daemon that communicates with the client, tracks
resources on the cluster, and orchestrates work by assigning tasks to NodeManagers.
A NodeManager is a worker daemon that launches and tracks processes spawned on
worker hosts.
Yarn Resource Monitoring
• YARN currently defines two resources:
v-cores
Memory
• ApplicationMaster
It helps coordinate tasks on the YARN cluster for each running application.
It is the first process run after the application starts.
Hadoop Related Subprojects
• Pig
High-level language for data analysis
• HBase
Table storage for semi-structured data
• Zookeeper
Coordinating distributed applications
• Hive
SQL-like Query language and Metastore
• Mahout
Machine learning
Thank You!