Unit-2 1
Unit-2 1
Unit-2 1
1
Agenda
2
Hadoop
Hadoop is an open-source software framework for storing and processing
large datasets ranging in size from gigabytes to petabytes.
Hadoop comes in various flavors like Cloudera, IBM BigInsight, MapR and
Hortonworks.
3
History
4
Core Components of Hadoop
5
HDFS
6
HDFS has two daemons
7
• DataNode: The various functions of DataNode are as follows –
• DataNode runs on the slave machine.
• It stores the actual business data.
• It serves the read-write request from the user.
• DataNode does the ground work of creating, replicating and deleting the
blocks on the command of NameNode.
• After every 3 seconds, by default, it sends heartbeat to NameNode
reporting the health of HDFS.
8
MapReduce
• They are:-
• Map Phase- This phase applies business logic to the data. The input data
gets converted into key-value pairs.
• Reduce Phase- The Reduce phase takes as input the output of Map Phase.
It applies aggregation based on the key of the key-value pairs.
9
YARN
• Resource Manager
• Node Manager
• Application Master
10
Yarn – Resource Manager
11
Yarn – Node Manager
12
MAPREDUCE
• ETL stands for Extract, Transform, Load and it is a process used in data warehousing to
extract data from various sources, transform it into a format suitable for loading into a
data warehouse, and then load it into the warehouse. The process of ETL can be
broken down into the following three stages:
• Extract: The first stage in the ETL process is to extract data from various sources such
as transactional systems, spreadsheets, and flat files. This step involves reading data
from the source systems and storing it in a staging area.
• Transform: In this stage, the extracted data is transformed into a format that is
suitable for loading into the data warehouse. This may involve cleaning and validating
the data, converting data types, combining data from multiple sources, and creating
new data fields.
• Load: After the data is transformed, it is loaded into the data warehouse. This step
involves creating the physical data structures and loading the data into the warehouse.
23
Hadoop Streaming
24
How Streaming Works
• In the previous example, both the mapper and the reducer are
executables that read the input from stdin (line by line) and emit
the output to stdout.
• The utility will create a Map/Reduce job, submit the job to an
appropriate cluster, and monitor the progress of the job until it
completes.
25
Streaming Command Options
26
Hadoop pipes
• Hadoop Pipes allows C++ code to use Hadoop DFS and map/reduce.
• The primary approach is to split the C++ code into a separate
process that does the application specific code.
• The approach will be similar to Hadoop streaming, but using
Writable serialization to convert the types into bytes that are sent
to the process via a socket.
• If you like to code your Map and Reduce logic in C++ use hadoop
Pipes.
27
Hadoop Ecosystem
28
Hadoop Ecosystem
29
Hadoop Ecosystem - components
30
Hadoop Ecosystem - components
31
Hadoop Ecosystem - components
32
Map Reduce
33
What is MapReduce?
• You need to put business logic in the way MapReduce works and
rest things will be taken care by the framework. Work (complete
job) which is submitted by the user to master is divided into small
works (tasks) and assigned to slaves.
34
• It is the heart of Hadoop.
• Hadoop is so much powerful and efficient due to MapRreduce as
here parallel processing is done.
35
Why MapReduce
36
Apache MapReduce Terminologies
37
Map and Reduce
38
What is a MapReduce Job?
39
What is Task in Map Reduce?
40
How Hadoop MapReduce Works
41
How Hadoop MapReduce Works
Phases of MapReduce
• Input Splits (Input Phase)
• Mapping ( Map Phase)
• Intermediate Keys
• Combiner ( group by)
• Shuffling and sorting
• Reducing
• O/P Phase
42
Example for word count
43
Developing Map Reduce Application
44
Reduce Program
for word Count
in Python
45
Unit Test MapReduce using MRUnit
• In order to make sure that your code is correct, you need to Unit
test your code first.
• And like you unit test your Java code using JUnit testing framework,
the same can be done using MRUnit to test MapReduce Jobs.
46
Unit Test MapReduce using MRUnit
47
Unit Testing : Streaming by Python
48
Handling failures in hadoop, mapreduce
• In the real world, user code is buggy, processes crash, and machines
fail.
• One of the major benefits of using Hadoop is its ability to handle
such failures and allow your job to complete successfully.
49
Handling failures in hadoop, mapreduce
Task Failure
• The most common occurrence of this failure is when user code in
the map or reduce task throws a runtime exception.
• If this happens, the child JVM reports the error back to its parent
application master before it exits.
• The error ultimately makes it into the user logs.
• The application master marks the task attempt as failed, and frees
up the container so its resources are available for another task.
50
Handling failures in hadoop, mapreduce
Tasktracker Failure
• If tasktracker fails by crashing or running
very slowly, it will stop sending heartbeats
to the jobtracker.
• The jobtracker will notice a tasketracker
that has stoped sending heartbeats if it has
not received one for 10 minutes (Interval
can be changed) and remove it from its pool
of tasktracker
• A tasktracker can also be blacklisted by the
jobtracker.
51
Handling failures in hadoop, mapreduce
Jobtracker Failure
• It is most serious failure mode.
• Hadoop has no mechanism for dealing
with jobtracker failure
• This situation is improved with YARN.
52
How Hadoop runs a MapReduce job using YARN
53
Job Scheduling in Hadoop
54
Job Scheduling in Hadoop
55
FIFO Scheduler
56
Capacity Scheduler
• Advantages:
• It maximizes the utilization of resources and throughput
in the Hadoop cluster.
• Provides elasticity for groups or organizations in a cost-
effective manner.
• It also gives capacity guarantees and safeguards
to the organization utilizing cluster.
• Disadvantage:
• It is complex amongst the other scheduler.
57
Capacity Scheduler
• The CapacityScheduler allows multiple-tenants to
securely share a large Hadoop cluster.
• It is designed to run Hadoop applications in a shared,
multi-tenant cluster while maximizing the throughput
and the utilization of the cluster.
• It supports hierarchical queues to reflect the structure
of organizations or groups that utilizes the cluster
resources.
58
Fair Scheduler
60
MapReduce Types and Formats
MapReduce Types
• The map and reduce functions in Hadoop MapReduce have the following
general form:
• map: (K1, V1) → list(K2, V2)
• reduce: (K2, list(V2)) → list(K3, V3)
• In general, the map input key and value types (K1 and V1) are different
from the map output types (K2 and V2).
• However, the reduce input must have the same types as the map output,
although the reduce output types may be different again (K3 and V3).
61
Input Formats
• How the input files are split up and read in Hadoop is defined by the
InputFormat.
62
Input Formats
• The files or other objects that should be used for input is selected by the
InputFormat.
• InputFormat defines the Data splits, which defines both the size of
individual Map tasks and its potential execution server.
• InputFormat defines the RecordReader, which is responsible for reading
actual records from the input files.
63
Types of InputFormat in MapReduce
• FileInputFormat in Hadoop
• TextInputFormat
• KeyValueTextInputFormat
• SequenceFileInputFormat
• SequenceFileAsTextInputFormat
• SequenceFileAsBinaryInputFormat
• NLineInputFormat
• dbInputFormat
64
FileInputFormat in Hadoop
65
TextInputFormat
66
KeyValueTextInputFormat
67
SequenceFileInputFormat
68
SequenceFileAsTextInputFormat
69
NLineInputFormat
71
Hadoop Output Format
• Hadoop RecordWriter
• As we know, Reducer takes as input a set of an intermediate key-value
pair produced by the mapper and runs a reducer function on them to
generate output that is again zero or more key-value pairs.
• RecordWriter writes these output key-value pairs from the Reducer
phase to output files.
72
• As we saw above, Hadoop RecordWriter takes output data from
Reducer and writes this data to output files.
• The way these output key-value pairs are written in output files by
RecordWriter is determined by the Output Format.
73
Types of Hadoop Output Formats
• TextOutputFormat
• SequenceFileOutputFormat
• MapFileOutputFormat
• MultipleOutputs
• DBOutputFormat
74
TextOutputFormat
75
SequenceFileOutputFormat
76
MapFileOutputFormat
77
MultipleOutputs
• It allows writing data to files whose names are derived from the
output keys and values, or in fact from an arbitrary string.
• DBOutputFormat
• DBOutputFormat in Hadoop is an Output Format for writing to
relational databases and HBase.
• It sends the reduce output to a SQL table.
• It accepts key-value pairs.
78
Features of MapReduce
• Scalability
• Flexibility
• Security and Authentication
• Cost-effective solution
• Fast
• Simple model of programming
• Parallel Programming
• Availability and resilient nature
79
Features of MapReduce
Scalability
80
Features of MapReduce
Flexibility
MapReduce programming enables companies to access new sources
of data.
It allows enterprises to access structured as well as unstructured data.
The MapReduce framework also provides support for the multiple
languages and data from sources ranging from email, social media, to
clickstream.
MapReduce is flexible to deal with data rather than traditional DBMS.
81
Features of MapReduce
82
Features of MapReduce
Cost-effective solution
Hadoop’s scalable architecture with the MapReduce programming
framework allows the storage and processing of large data sets in a
very affordable manner.
Fast
Hadoop uses a distributed storage(HDFS) that basically implements a
mapping system for locating data in a cluster.
MapReduce programming, are generally located on the same servers
that allow for the faster processing of data.
83
Features of MapReduce
Parallel Programming
One of the major aspects of the working of MapReduce programming
is its parallel processing.
It divides the tasks in a manner that allows their execution in parallel.
The parallel processing allows multiple processors to execute these
divided tasks.
So the entire program is run in less time.
85
Features of MapReduce
86
Hadoop Counters
88
Built-In Counters in MapReduce
• Hadoop maintains some built-in Hadoop counters for every job and
these report various metrics, like, there are counters for the
number of bytes and records, which allow us to confirm that the
expected amount of input is consumed and the expected amount of
output is produced.
• There are several groups for the Hadoop built-in Counters:
• MapReduce Task Counter in Hadoop
• FileSystem Counters
• FileInputFormat Counters
• FileOutputFormat counters
• MapReduce Job Counters
89
Real world Mapreduce
• Demo
90
Hands-on Session
91
Let’s put your knowledge to the test 92
Q & A Time
We have 10 Minutes for Q&A
93