Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh

Big Data Analytics
Module 3: MapReduce Paradigm
Faculty Name : Ms. Varsha Sanap

Dr. Vivek Singh
Index - (Heading Font: minion pro, Size:20)
Lecture19 – Distributed File Systems : Physical Organization of the New

4
Software Compute Nodes, Large- Scale- File-System -Organization.
Lecture20 – MapReduce: The Map Tasks, Grouping by Key,The Reduce
Tasks, Combiners, Details of MapReduce Execution, Coping With Node 15
Failures
2
Lecture19
Distributed File System

MapReduce
4 Lecture19-Distributed File System

Single Node Architecture
Machine Learning, Statistics
“Classical” Data Mining

Motivation: Google Example
 20+ billion web pages x 20KB = 400+ TB

 1 computer reads 30-35 MB/sec from disk
 ~4 months to read the web
 ~1,000 hard drives to store the web
 Takes even more to do something useful
with the data!
 Today, a standard architecture for such problems is emerging:
 Cluster of commodity Linux nodes
 Commodity network (ethernet) to connect them

Cluster Architecture
In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

Cluster Architecture

Large Scale Computing
 Large-scale computing for data mining problems on commodity hardware

 Challenges:
 How do you distribute computation?
 How can we make it easy to write distributed programs?
 Machines fail:
 One server may stay up 3 years (1,000 days)
 If you have 1,000 servers, expect to loose 1/day
 People estimated Google had ~1M machines in 2011
 1,000 machines fail every day!

Idea and Solution
 Issue: Copying data over a network takes time

 Idea:
 Bring computation close to the data
 Store files multiple times for reliability
 Map-reduce addresses these problems
 Google’s computational/data manipulation model
 Elegant way to work with big data
 Storage Infrastructure – File system
 Google: GFS. Hadoop: HDFS
 Programming model
 Map-Reduce

Storage Architecture
 Problem:
 If nodes fail, how to store data persistently?
 Answer:
 Distributed File System:
 Provides global file namespace
 Google GFS; Hadoop HDFS;
 Typical usage pattern
 Huge files (100s of GB to TB)
 Data is rarely updated in place
 Reads and appends are common

Distributed File system
 Chunk servers
 File is split into contiguous chunks
 Typically each chunk is 16-64MB
 Each chunk replicated (usually 2x or 3x)
 Try to keep replicas in different racks
 Master node
 a.k.a. Name Node in Hadoop’s HDFS
 Stores metadata about where files are stored
 Might be replicated
 Client library for file access
 Talks to master to find chunk servers
 Connects directly to chunk servers to access data

Distributed File Sytem
 Reliable distributed file system

 Data kept in “chunks” spread across machines
 Each chunk replicated on different machines
 Seamless recovery from disk or machine failure
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N
 Bring Computation Directly to Data

 Chunk Servers also serve as Compute Servers

Lecture20
MapReduce
Programming Model: MapReduce
15 Lecture 20-MapReduce
Task: Word Count
Case 1: File too large for memory, but all <word, count> pairs fit in memory
Case 2:Count occurrences of words:
 words(doc.txt) | sort | uniq -c
 where words takes a file and outputs the words in it, one per a line
 Case 2 captures the essence of MapReduce
 Great thing is that it is naturally parallelizable
MapReduce Overview
 Sequentially read a lot of data

 Map:
 Extract something you care about
 Group by key: Sort and Shuffle
 Reduce:
 Aggregate, summarize, filter or transform
 Write the result
Outline stays the same, Map and Reduce

change to fit the problem
MapReduce: The Map step
Input Intermediate
key-value pairs key-value pairs
MapReduce: The Reduce Step
Intermediate
key-value pairs Output
Key-value groups key-value pairs
k v reduce
k v v v k v
k v Group reduce
k v v k v
by key
k v
… …
k v
k v k v
More Specifically
 Input: a set of key-value pairs

 Programmer specifies two methods:
 Map(k, v)  <k’, v’>*
 Takes a key-value pair and outputs a set of key-value pairs
 E.g., key is the filename, value is a single line in the file
 There is one Map call for every (k,v) pair
 Reduce(k’, <v’>*)  <k’, v’’>*
 All values v’ with same key k’ are reduced together
and processed in v’ order
 There is one Reduce function call per unique key k’
MapReduce: Word Counting
Word Count Using MapReduce
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
MapReduce :Environment
Map-Reduce environment takes care of:

 Partitioning the input data
 Scheduling the program’s execution across a
set of machines
 Performing the group by key step
 Handling machine failures
 Managing required inter-machine communication
MapReduce: A Diagram
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
MapReduce:In Parallel
All phases are distributed with many tasks doing the work
MapReduce
Data Flow
Coordinator: Master
Dealing with Failures
 Map worker failure

 Map tasks completed or in-progress at worker are reset to idle
 Reduce workers are notified when task is rescheduled on another worker
 Reduce worker failure
 Only in-progress tasks are reset to idle
 Reduce task is restarted
 Master failure
 MapReduce task is aborted and client is notified
How many Map Reduce jobs
 M map tasks, R reduce tasks

 Rule of a thumb:
 Make M much larger than the number of nodes in the cluster
 One DFS chunk per map is common
 Improves dynamic load balancing and speeds up recovery from worker failures
 Usually R is smaller than M
 Because output is spread across R files
Task Granularity &Pipelining
 Fine granularity tasks: map tasks >> machines

 Minimizes time for fault recovery
 Can do pipeline shuffling with map execution
 Better dynamic load balancing
Refinements: Backup Tasks
 Problem
 Slow workers significantly lengthen the job completion time:
 Other jobs on the machine
 Bad disks
 Weird things
 Solution
 Near end of phase, spawn backup copies of tasks
 Whichever one finishes first “wins”
 Effect
 Dramatically shortens job completion time
Refinement: Combiners
 Often a Map task will produce many pairs of the form (k,v1), (k,v2), … for the same
key k
 E.g., popular words in the word count example
 Can save network time by pre-aggregating values in the mapper:
 combine(k, list(v1))  v2
 Combiner is usually same as the reduce function
 Works only if reduce function is commutative and associative
Refinement: Combiners
 Back to our word counting example:

 Combiner combines the values of all keys of a single mapper (single machine):
 Much less data needs to be copied and shuffled!
Lecture 11-MapReduce
34
Refinement: Partition Function
Thank You

Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh

Uploaded by

Copyright:

Available Formats

Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh

Uploaded by

Copyright:

Available Formats

Big Data Analytics

Module 3: MapReduce Paradigm

Faculty Name : Ms. Varsha Sanap

Lecture19 – Distributed File Systems : Physical Organization of the New

Distributed File System

4 Lecture19-Distributed File System

Machine Learning, Statistics

“Classical” Data Mining

5 Lecture19-Distributed File System

 20+ billion web pages x 20KB = 400+ TB

6 Lecture19-Distributed File System

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

7 Lecture19-Distributed File System

8 Lecture19-Distributed File System

 Large-scale computing for data mining problems on commodity hardware

9 Lecture19-Distributed File System

 Issue: Copying data over a network takes time

10 Lecture19-Distributed File System

11 Lecture19-Distributed File System

12 Lecture19-Distributed File System

 Reliable distributed file system

Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

 Bring Computation Directly to Data

13 Lecture19-Distributed File System

 Sequentially read a lot of data

Outline stays the same, Map and Reduce

 Input: a set of key-value pairs

Map-Reduce environment takes care of:

 Map worker failure

 M map tasks, R reduce tasks

 Fine granularity tasks: map tasks >> machines

 Back to our word counting example:

 Much less data needs to be copied and shuffled!

You might also like