Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
2
Lecture19
Problem:
If nodes fail, how to store data persistently?
Answer:
Distributed File System:
Provides global file namespace
Google GFS; Hadoop HDFS;
Typical usage pattern
Huge files (100s of GB to TB)
Data is rarely updated in place
Reads and appends are common
Chunk servers
File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks
Master node
a.k.a. Name Node in Hadoop’s HDFS
Stores metadata about where files are stored
Might be replicated
Client library for file access
Talks to master to find chunk servers
Connects directly to chunk servers to access data
MapReduce
Programming Model: MapReduce
15 Lecture 20-MapReduce
Task: Word Count
Case 1: File too large for memory, but all <word, count> pairs fit in memory
Case 2:Count occurrences of words:
words(doc.txt) | sort | uniq -c
where words takes a file and outputs the words in it, one per a line
Case 2 captures the essence of MapReduce
Great thing is that it is naturally parallelizable
16 Lecture 20-MapReduce
MapReduce Overview
17 Lecture 20-MapReduce
MapReduce: The Map step
Input Intermediate
key-value pairs key-value pairs
18 Lecture 20-MapReduce
MapReduce: The Reduce Step
Intermediate
key-value pairs Output
Key-value groups key-value pairs
k v reduce
k v v v k v
k v Group reduce
k v v k v
by key
k v
… …
k v
k v k v
19 Lecture 20-MapReduce
More Specifically
20 Lecture 20-MapReduce
MapReduce: Word Counting
21 Lecture 20-MapReduce
Word Count Using MapReduce
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
22 Lecture 20-MapReduce
MapReduce :Environment
23 Lecture 20-MapReduce
MapReduce: A Diagram
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
24 Lecture 20-MapReduce
MapReduce:In Parallel
All phases are distributed with many tasks doing the work
25 Lecture 20-MapReduce
MapReduce
26 Lecture 20-MapReduce
Data Flow
27 Lecture 20-MapReduce
Coordinator: Master
28 Lecture 20-MapReduce
Dealing with Failures
29 Lecture 20-MapReduce
How many Map Reduce jobs
30 Lecture 20-MapReduce
Task Granularity &Pipelining
31 Lecture 20-MapReduce
Refinements: Backup Tasks
Problem
Slow workers significantly lengthen the job completion time:
Other jobs on the machine
Bad disks
Weird things
Solution
Near end of phase, spawn backup copies of tasks
Whichever one finishes first “wins”
Effect
Dramatically shortens job completion time
32 Lecture 20-MapReduce
Refinement: Combiners
Often a Map task will produce many pairs of the form (k,v1), (k,v2), … for the same
key k
E.g., popular words in the word count example
Can save network time by pre-aggregating values in the mapper:
combine(k, list(v1)) v2
Combiner is usually same as the reduce function
Works only if reduce function is commutative and associative
33 Lecture 20-MapReduce
Refinement: Combiners
Lecture 11-MapReduce
34
Refinement: Partition Function
35 Lecture 11-MapReduce
Thank You