Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 36

Big Data Analytics

Module 3: MapReduce Paradigm

Faculty Name : Ms. Varsha Sanap


Dr. Vivek Singh
Index - (Heading Font: minion pro, Size:20)

Lecture19 – Distributed File Systems : Physical Organization of the New


4
Software Compute Nodes, Large- Scale- File-System -Organization.
Lecture20 – MapReduce: The Map Tasks, Grouping by Key,The Reduce
Tasks, Combiners, Details of MapReduce Execution, Coping With Node 15
Failures

2
Lecture19

Distributed File System


MapReduce

4 Lecture19-Distributed File System


Single Node Architecture

Machine Learning, Statistics

“Classical” Data Mining

5 Lecture19-Distributed File System


Motivation: Google Example

 20+ billion web pages x 20KB = 400+ TB


 1 computer reads 30-35 MB/sec from disk
 ~4 months to read the web
 ~1,000 hard drives to store the web
 Takes even more to do something useful
with the data!
 Today, a standard architecture for such problems is emerging:
 Cluster of commodity Linux nodes
 Commodity network (ethernet) to connect them

6 Lecture19-Distributed File System


Cluster Architecture

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

7 Lecture19-Distributed File System


Cluster Architecture

8 Lecture19-Distributed File System


Large Scale Computing

 Large-scale computing for data mining problems on commodity hardware


 Challenges:
 How do you distribute computation?
 How can we make it easy to write distributed programs?
 Machines fail:
 One server may stay up 3 years (1,000 days)
 If you have 1,000 servers, expect to loose 1/day
 People estimated Google had ~1M machines in 2011
 1,000 machines fail every day!

9 Lecture19-Distributed File System


Idea and Solution

 Issue: Copying data over a network takes time


 Idea:
 Bring computation close to the data
 Store files multiple times for reliability
 Map-reduce addresses these problems
 Google’s computational/data manipulation model
 Elegant way to work with big data
 Storage Infrastructure – File system
 Google: GFS. Hadoop: HDFS
 Programming model
 Map-Reduce

10 Lecture19-Distributed File System


Storage Architecture

 Problem:
 If nodes fail, how to store data persistently?
 Answer:
 Distributed File System:
 Provides global file namespace
 Google GFS; Hadoop HDFS;
 Typical usage pattern
 Huge files (100s of GB to TB)
 Data is rarely updated in place
 Reads and appends are common

11 Lecture19-Distributed File System


Distributed File system

 Chunk servers
 File is split into contiguous chunks
 Typically each chunk is 16-64MB
 Each chunk replicated (usually 2x or 3x)
 Try to keep replicas in different racks
 Master node
 a.k.a. Name Node in Hadoop’s HDFS
 Stores metadata about where files are stored
 Might be replicated
 Client library for file access
 Talks to master to find chunk servers
 Connects directly to chunk servers to access data

12 Lecture19-Distributed File System


Distributed File Sytem

 Reliable distributed file system


 Data kept in “chunks” spread across machines
 Each chunk replicated on different machines
 Seamless recovery from disk or machine failure

Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

 Bring Computation Directly to Data


 Chunk Servers also serve as Compute Servers

13 Lecture19-Distributed File System


Lecture20

MapReduce
Programming Model: MapReduce

15 Lecture 20-MapReduce
Task: Word Count

Case 1: File too large for memory, but all <word, count> pairs fit in memory
Case 2:Count occurrences of words:
 words(doc.txt) | sort | uniq -c
 where words takes a file and outputs the words in it, one per a line
 Case 2 captures the essence of MapReduce
 Great thing is that it is naturally parallelizable

16 Lecture 20-MapReduce
MapReduce Overview

 Sequentially read a lot of data


 Map:
 Extract something you care about
 Group by key: Sort and Shuffle
 Reduce:
 Aggregate, summarize, filter or transform
 Write the result

Outline stays the same, Map and Reduce


change to fit the problem

17 Lecture 20-MapReduce
MapReduce: The Map step

Input Intermediate
key-value pairs key-value pairs

18 Lecture 20-MapReduce
MapReduce: The Reduce Step

Intermediate
key-value pairs Output
Key-value groups key-value pairs
k v reduce
k v v v k v

k v Group reduce
k v v k v
by key
k v

… …
k v
k v k v

19 Lecture 20-MapReduce
More Specifically

 Input: a set of key-value pairs


 Programmer specifies two methods:
 Map(k, v)  <k’, v’>*
 Takes a key-value pair and outputs a set of key-value pairs
 E.g., key is the filename, value is a single line in the file
 There is one Map call for every (k,v) pair
 Reduce(k’, <v’>*)  <k’, v’’>*
 All values v’ with same key k’ are reduced together
and processed in v’ order
 There is one Reduce function call per unique key k’

20 Lecture 20-MapReduce
MapReduce: Word Counting

21 Lecture 20-MapReduce
Word Count Using MapReduce

map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

22 Lecture 20-MapReduce
MapReduce :Environment

Map-Reduce environment takes care of:


 Partitioning the input data
 Scheduling the program’s execution across a
set of machines
 Performing the group by key step
 Handling machine failures
 Managing required inter-machine communication

23 Lecture 20-MapReduce
MapReduce: A Diagram

MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

24 Lecture 20-MapReduce
MapReduce:In Parallel

All phases are distributed with many tasks doing the work

25 Lecture 20-MapReduce
MapReduce

26 Lecture 20-MapReduce
Data Flow

27 Lecture 20-MapReduce
Coordinator: Master

28 Lecture 20-MapReduce
Dealing with Failures

 Map worker failure


 Map tasks completed or in-progress at worker are reset to idle
 Reduce workers are notified when task is rescheduled on another worker
 Reduce worker failure
 Only in-progress tasks are reset to idle
 Reduce task is restarted
 Master failure
 MapReduce task is aborted and client is notified

29 Lecture 20-MapReduce
How many Map Reduce jobs

 M map tasks, R reduce tasks


 Rule of a thumb:
 Make M much larger than the number of nodes in the cluster
 One DFS chunk per map is common
 Improves dynamic load balancing and speeds up recovery from worker failures
 Usually R is smaller than M
 Because output is spread across R files

30 Lecture 20-MapReduce
Task Granularity &Pipelining

 Fine granularity tasks: map tasks >> machines


 Minimizes time for fault recovery
 Can do pipeline shuffling with map execution
 Better dynamic load balancing

31 Lecture 20-MapReduce
Refinements: Backup Tasks

 Problem
 Slow workers significantly lengthen the job completion time:
 Other jobs on the machine
 Bad disks
 Weird things
 Solution
 Near end of phase, spawn backup copies of tasks
 Whichever one finishes first “wins”
 Effect
 Dramatically shortens job completion time

32 Lecture 20-MapReduce
Refinement: Combiners

 Often a Map task will produce many pairs of the form (k,v1), (k,v2), … for the same
key k
 E.g., popular words in the word count example
 Can save network time by pre-aggregating values in the mapper:
 combine(k, list(v1))  v2
 Combiner is usually same as the reduce function
 Works only if reduce function is commutative and associative

33 Lecture 20-MapReduce
Refinement: Combiners

 Back to our word counting example:


 Combiner combines the values of all keys of a single mapper (single machine):

 Much less data needs to be copied and shuffled!

Lecture 11-MapReduce
34
Refinement: Partition Function

35 Lecture 11-MapReduce
Thank You

You might also like