Unit-2 1

Unit – 2 Hadoop
1
Agenda
• History of Hadoop, Apache Hadoop

• Hadoop Distributed File System
• Components of Hadoop
• Data Format, Analyzing data with Hadoop, Scaling out
• Hadoop streaming
• Hadoop pipes
• Hadoop Echo System
• Q&A
2
Hadoop
Hadoop is an open-source software framework for storing and processing
large datasets ranging in size from gigabytes to petabytes.
Hadoop was developed at the Apache Software Foundation.
In 2008, Hadoop defeated the supercomputers and became the fastest

system on the planet for sorting terabytes of data.
Hadoop comes in various flavors like Cloudera, IBM BigInsight, MapR and
Hortonworks.
3
History
2002 2003 2004 2006 2007 2008 Mid 2008

Nutch project Google Google White Development of Yahoo started Hadoop become Hadoop
distributed File Paper on Map Hadoop Started using Hadoop top level project Defeated
System(GFS) Reduce on 1000 node at Apache Supercomputers
White Paper Mid 2004 –
Nutch
Implemented
NDFS and Map
Reduce
2011 2012 2017

Apache Hadoop Hadoop
released 2.0 which 3.0
first contains
stable YARN
version
1.0
4
Core Components of Hadoop
Hadoop consists of three core components –

• Hadoop Distributed File System (HDFS) – It is the storage layer of
Hadoop.
• Map-Reduce – It is the data processing layer of Hadoop.
• YARN – It is the resource management layer of Hadoop.
5
HDFS
• Hadoop Distributed File System provides for distributed storage for

Hadoop. HDFS has a master-slave topology.
• The Big Data files get divided into the number of blocks. Hadoop
stores these blocks in a distributed fashion on the cluster of slave
nodes.
• On the master, we have metadata stored.
6
HDFS has two daemons
• NameNode : NameNode performs following functions –
• NameNode Daemon runs on the master machine.

• It is responsible for maintaining, monitoring and managing DataNodes.
• It records the metadata of the files like the location of blocks, file size,
permission, hierarchy etc.
• Namenode captures all the changes to the metadata like deletion, creation
and renaming of the file in edit logs.
• It regularly receives heartbeat and block reports from the DataNodes.
7
• DataNode: The various functions of DataNode are as follows –
• DataNode runs on the slave machine.
• It stores the actual business data.
• It serves the read-write request from the user.
• DataNode does the ground work of creating, replicating and deleting the
blocks on the command of NameNode.
• After every 3 seconds, by default, it sends heartbeat to NameNode
reporting the health of HDFS.
8
MapReduce
• It is the data processing layer of Hadoop. It processes data in two

phases.
• They are:-
• Map Phase- This phase applies business logic to the data. The input data
gets converted into key-value pairs.
• Reduce Phase- The Reduce phase takes as input the output of Map Phase.
It applies aggregation based on the key of the key-value pairs.
9
YARN
• Short for Yet Another Resource Locator has the following

components:-
• Resource Manager
• Node Manager
• Application Master
10
Yarn – Resource Manager
• Resource Manager runs on the master node.

• It knows where the location of slaves (Rack Awareness).
• It is aware about how much resources each slave have.
• Resource Scheduler is one of the important service run by the Resource
Manager.
• Resource Scheduler decides how the resources get assigned to various
tasks.
• Application Manager is one more service run by Resource Manager.
• Application Manager negotiates the first container for an application.
• Resource Manager keeps track of the heart beats from the Node
Manager.
11
Yarn – Node Manager
• It runs on slave machines.

• It manages containers. Containers are nothing but a fraction of
Node Manager’s resource capacity
• Node manager monitors resource utilization of each container.
• It sends heartbeat to Resource Manager.
12
MAPREDUCE
• MapReduce consists of two primary processes that a programmer

builds: the “ map ” step and the “ reduce ” step. Hence, the name
MapReduce!
• These steps get passed to the MapReduce framework, which then
runs the programs in parallel on a set of worker nodes.
• Recall that MPP database systems spread data out across nodes
that can then be queried.
Analysis vs Reporting
• Reporting is the process of organizing data into informational

summaries in order to monitor how different areas of a
business or system or project are performing.
• Analysis is the process of exploring data and reports in order
to extract meaningful, actionable insights, which can be used
to better understand and improve business performance
Purpose
• Reporting translates raw data into information.

• Analysis transforms data and information into insights.
• Reporting helps companies to monitor their business and be
alerted to when data falls outside of expected ranges.
• Good reporting should raise questions about the business from its
end users
Tasks
• Sometimes what feels like analysis is really just another flavor of

reporting.
• One way to distinguish whether your organization is emphasizing
reporting or analysis is by identifying the primary tasks that are
being performed by your analytics team.
• If most of the team's time is spent on such activities as building,
configuring, consolidating, organizing, formatting, and summarizing,
then you're reporting.
• Analysis focuses on questioning, examining, interpreting,
comparing, and confirming.
• Reporting and analysis tasks can be intertwined, but analysts should
still evaluate where they are spending the majority of their time.
• It's not uncommon to find web analytics teams spending most of
their time on reporting tasks.
Outputs
• On the surface, reporting and analysis deliverables may look similar

with lots of charts, graphs, trend lines, tables, and stats.
• Look closer, and you'll see some differences. The first is the overall
approach
• Reporting generally follows a push approach, where reports are
passively pushed to users who are then expected to extract
meaningful insights and take appropriate actions for themselves
(think self-serve).
Delivery
• Through the push model of reporting, recipients can access reports

through an analytics tool, intranet site, Microsoft Excel®
spreadsheet, or mobile app.
• They can also have them scheduled for delivery into their mailbox,
mobile device (SMS), or FTP site.
• Because of the demands of having to provide data to multiple
individuals and groups at regular intervals, the building, refreshing,
and delivering of reports is often automated. It's a job for robots or
computers, not human beings
Value
• Finally, you need to keep in mind the relationship between

reporting and analysis in driving value.
• Think of the data-driven decision-making stages (data > reporting >
analysis > decision > action > value) as a series of dominoes.
• If you remove a domino, it can be more difficult or impossible to
achieve the desired value.
ETL
• ETL stands for Extract, Transform, Load and it is a process used in data warehousing to
extract data from various sources, transform it into a format suitable for loading into a
data warehouse, and then load it into the warehouse. The process of ETL can be
broken down into the following three stages:
• Extract: The first stage in the ETL process is to extract data from various sources such
as transactional systems, spreadsheets, and flat files. This step involves reading data
from the source systems and storing it in a staging area.
• Transform: In this stage, the extracted data is transformed into a format that is
suitable for loading into the data warehouse. This may involve cleaning and validating
the data, converting data types, combining data from multiple sources, and creating
new data fields.
• Load: After the data is transformed, it is loaded into the data warehouse. This step
involves creating the physical data structures and loading the data into the warehouse.
23
Hadoop Streaming
• Hadoop streaming is a utility that comes with the Hadoop distribution.

• The utility allows you to create and run Map/Reduce jobs with any
executable or script as the mapper and/or the reducer.
• For example:
$HADOOP_HOME/bin/hadoop jar \ path/Hadoop-streamlining-3.2.1.jar \
-input myInputDirs \
-output myOutputDir \
-mapper <path/mapper> \
-reducer <path/reducer>
24
How Streaming Works
• In the previous example, both the mapper and the reducer are
executables that read the input from stdin (line by line) and emit
the output to stdout.
• The utility will create a Map/Reduce job, submit the job to an
appropriate cluster, and monitor the progress of the job until it
completes.
25
Streaming Command Options
Parameter Optional/Required Description

-input directoryname or Required Input location for mapper
filename
-output directoryname Required Output location for reducer
-mapper executable or Required Mapper executable
JavaClassName
-reducer executable or Required Reducer executable
JavaClassName
-file filename Optional Make the mapper, reducer, or
combiner executable available
locally on the compute nodes
26
Hadoop pipes
• Hadoop Pipes allows C++ code to use Hadoop DFS and map/reduce.
• The primary approach is to split the C++ code into a separate
process that does the application specific code.
• The approach will be similar to Hadoop streaming, but using
Writable serialization to convert the types into bytes that are sent
to the process via a socket.
• If you like to code your Map and Reduce logic in C++ use hadoop
Pipes.
27
Hadoop Ecosystem
• Hadoop Ecosystem is a platform or a suite which provides various

services to solve the big data problems.
• It includes Apache projects and various commercial tools and
solutions.
• There are four major elements of Hadoop i.e. HDFS, MapReduce,
YARN, and Hadoop Common.
• Commercial tools – Spark, PIG, HIVE, Hbase, Mahout, Spark, MLLib,
Solar, Lucene, ZooKeeper, Oozie
28
Hadoop Ecosystem
29
Hadoop Ecosystem - components
• Hive - Apache Hive, is an open source data warehouse system for

querying and analyzing large datasets stored in Hadoop files.
• Pig - Apache Pig is a high-level language platform for analyzing and
querying huge dataset that are stored in HDFS.
• Sqoop imports data from external sources into related Hadoop
ecosystem components like HDFS, Hbase or Hive.
• Flume efficiently collects, aggregate and moves a large amount of
data from its origin and sending it back to HDFS.
30
• Ambari, is a management platform for provisioning, managing,

monitoring and securing apache Hadoop cluster.
• Apache Zookeeper is a centralized service and a Hadoop Ecosystem
component for maintaining configuration information, naming,
providing distributed synchronization, and providing group services.
• Oozie - It is a workflow scheduler system for managing apache
Hadoop jobs. Oozie combines multiple jobs sequentially into one
logical unit of work.
31
• Mahout is open source framework for creating scalable machine

learning algorithm and data mining library.
• Once data is stored in Hadoop HDFS, mahout provides the data
science tools to automatically find meaningful patterns in those big
data sets.
32
Map Reduce
33
What is MapReduce?
• Hadoop MapReduce is a software framework for easily writing

applications which process vast amounts of data (multi-terabyte
data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
• You need to put business logic in the way MapReduce works and
rest things will be taken care by the framework. Work (complete
job) which is submitted by the user to master is divided into small
works (tasks) and assigned to slaves.
34
• It is the heart of Hadoop.
• Hadoop is so much powerful and efficient due to MapRreduce as
here parallel processing is done.
35
Why MapReduce
36
Apache MapReduce Terminologies
• Map and Reduce

• What is a job
• , task
• , task attempt
37
Map and Reduce
• Map-Reduce is the data processing component of Hadoop.

• Map-Reduce programs transform lists of input data elements into
lists of output data elements.
• A Map-Reduce program will do this twice, using two different
processing -
• Map
• Reduce
• In between Map and Reduce, there is small phase called Shuffle and
Sort in MapReduce.
38
What is a MapReduce Job?
• MapReduce Job or a A “full program” is an execution of a Mapper

and Reducer across a data set.
• It is an execution of 2 processing layers i.e mapper and reducer.
• A MapReduce job is a work that the client wants to be performed.
• It consists of the input data, the MapReduce Program, and
configuration info.
39
What is Task in Map Reduce?
• A task in MapReduce is an execution of a Mapper or a Reducer on a

slice of data.
• It is also called Task-In-Progress (TIP). It means processing of data is
in progress either on mapper or reducer.
40
How Hadoop MapReduce Works
• In Hadoop, MapReduce works by breaking the data processing into

two phases: Map phase and Reduce phase.
• The map is the first phase of processing, where we specify all the
complex logic/business rules/costly code.
• Reduce is the second phase of processing, where we specify light-
weight processing like aggregation/summation.
41
How Hadoop MapReduce Works
Phases of MapReduce
• Input Splits (Input Phase)
• Mapping ( Map Phase)
• Intermediate Keys
• Combiner ( group by)
• Shuffling and sorting
• Reducing
• O/P Phase
42
Example for word count
43
Developing Map Reduce Application
Map Program for word Count in Python
44
Reduce Program
for word Count
in Python
45
Unit Test MapReduce using MRUnit
• In order to make sure that your code is correct, you need to Unit
test your code first.
• And like you unit test your Java code using JUnit testing framework,
the same can be done using MRUnit to test MapReduce Jobs.
• MRUnit is built on top of JUnit framework. So we will use the JUnit

classes to implement unit test code for MapReduce.
• If you are familiar with JUnits then you will find unit testing for
MapReduce jobs also follows the same pattern.
46
Unit Test MapReduce using MRUnit
• To Unit test MapReduce jobs:

• Create a new test class to the existing project
• Add the mrunit jar file to build path
• Declare the drivers
• Write a method for initializations & environment setup
• Write a method to test mapper
• Write a method to test reducer
• Write a method to test the whole MapReduce job
• Run the test
47
Unit Testing : Streaming by Python
• We can test it on a hadoop cluster, or test it on our local machine

by Unix stream.
cat test_input | python3 mapper.py | sort | python3 reducer.py
48
Handling failures in hadoop, mapreduce
• In the real world, user code is buggy, processes crash, and machines
fail.
• One of the major benefits of using Hadoop is its ability to handle
such failures and allow your job to complete successfully.
• In mapReduce there are three failures modes to consider :

• Failure of the running task
• Failure of the tasktracker
• Failure of the jobtracker
49
Task Failure
• The most common occurrence of this failure is when user code in
the map or reduce task throws a runtime exception.
• If this happens, the child JVM reports the error back to its parent
application master before it exits.
• The error ultimately makes it into the user logs.
• The application master marks the task attempt as failed, and frees
up the container so its resources are available for another task.
50
Tasktracker Failure
• If tasktracker fails by crashing or running
very slowly, it will stop sending heartbeats
to the jobtracker.
• The jobtracker will notice a tasketracker
that has stoped sending heartbeats if it has
not received one for 10 minutes (Interval
can be changed) and remove it from its pool
of tasktracker
• A tasktracker can also be blacklisted by the
jobtracker.
51
Jobtracker Failure
• It is most serious failure mode.
• Hadoop has no mechanism for dealing
with jobtracker failure
• This situation is improved with YARN.
52
How Hadoop runs a MapReduce job using YARN
53
Job Scheduling in Hadoop
• In Hadoop 1, Hadoop MapReduce framework is responsible for

scheduling tasks, monitoring them, and re-executes the failed task.
• But in Hadoop 2, a YARN called Yet Another Resource Negotiator
was introduced.
• The basic idea behind the YARN introduction is to split the
functionalities of resource management and job scheduling or
monitoring into separate daemons that are ResorceManager,
ApplicationMaster, and NodeManager.
54
Job Scheduling in Hadoop
• The ResourceManager has two main components that are

Schedulers and ApplicationsManager.
• Schedulers in YARN ResourceManager is a pure scheduler which is
responsible for allocating resources to the various running
applications.
• The FIFO Scheduler, CapacityScheduler, and FairScheduler are
pluggable policies that are responsible for allocating resources to
the applications.
55
FIFO Scheduler
• First In First Out is the default scheduling policy used in Hadoop.

• FIFO Scheduler gives more preferences to the application coming first
than those coming later.
• It places the applications in a queue and executes them in the order of
their submission (first in, first out).
Advantage:
• It is simple to understand and doesn’t need any configuration.
• Jobs are executed in the order of their submission.
Disadvantage:
• It is not suitable for shared clusters. If the large application comes before the
shorter one, then the large application will use all the resources in the cluster, and
the shorter application has to wait for its turn. This leads to starvation.
56
Capacity Scheduler
• Advantages:
• It maximizes the utilization of resources and throughput
in the Hadoop cluster.
• Provides elasticity for groups or organizations in a cost-
effective manner.
• It also gives capacity guarantees and safeguards
to the organization utilizing cluster.
• Disadvantage:
• It is complex amongst the other scheduler.
57
Capacity Scheduler
• The CapacityScheduler allows multiple-tenants to
securely share a large Hadoop cluster.
• It is designed to run Hadoop applications in a shared,
multi-tenant cluster while maximizing the throughput
and the utilization of the cluster.
• It supports hierarchical queues to reflect the structure
of organizations or groups that utilizes the cluster
resources.
58
Fair Scheduler
• It like capacity Scheduler

• It assigns resources to applications in such a
way that all applications get, on average, an
equal amount of resources over time.
• When the single application is running, then
that app uses the entire cluster resources.
• When other applications are submitted, the
free up resources are assigned to the new apps
so that every app eventually gets roughly the
same amount of resources.
• FairScheduler enables short apps to finish in a
reasonable time without starving the long-lived
apps.
59
Shuffle and Sort
• The process by which the system performs the sort and transfer the
map outputs to the reducers as inputs is Known as the shuffle.
60
MapReduce Types and Formats
MapReduce Types
• The map and reduce functions in Hadoop MapReduce have the following
general form:
• map: (K1, V1) → list(K2, V2)
• reduce: (K2, list(V2)) → list(K3, V3)
• In general, the map input key and value types (K1 and V1) are different
from the map output types (K2 and V2).
• However, the reduce input must have the same types as the map output,
although the reduce output types may be different again (K3 and V3).
61
Input Formats
• How the input files are split up and read in Hadoop is defined by the
InputFormat.
• An Hadoop InputFormat is the first component in Map-Reduce, it is

responsible for creating the input splits and dividing them into
records.
• Initially, the data for a MapReduce task is stored in input files, and
input files typically reside in HDFS.
• Using InputFormat we define how these input files are split and
read.
62
Input Formats
• The InputFormat class is one of the fundamental classes in the

Hadoop MapReduce framework which provides the following
functionality:
• The files or other objects that should be used for input is selected by the
InputFormat.
• InputFormat defines the Data splits, which defines both the size of
individual Map tasks and its potential execution server.
• InputFormat defines the RecordReader, which is responsible for reading
actual records from the input files.
63
Types of InputFormat in MapReduce
• FileInputFormat in Hadoop
• TextInputFormat
• KeyValueTextInputFormat
• SequenceFileInputFormat
• SequenceFileAsTextInputFormat
• SequenceFileAsBinaryInputFormat
• NLineInputFormat
• dbInputFormat
64
FileInputFormat in Hadoop
• It is the base class for all file-based InputFormats.

• Hadoop FileInputFormat specifies input directory where data files
are located.
• When we start a Hadoop job, FileInputFormat is provided with a
path containing files to read.
• FileInputFormat will read all files and divides these files into one or
more InputSplits.
65
TextInputFormat
• It is the default InputFormat of MapReduce.

• TextInputFormat treats each line of each input file as a separate
record and performs no parsing.
• This is useful for unformatted data or line-based records like log
files.
• Key – It is the byte offset of the beginning of the line within the file
(not whole file just one split), so it will be unique if combined with
the file name.
• Value – It is the contents of the line, excluding line terminators.
66
KeyValueTextInputFormat
• It is similar to TextInputFormat as it also treats each line of input as

a separate record.
• While TextInputFormat treats entire line as the value, but the
KeyValueTextInputFormat breaks the line itself into key and value
by a tab character (‘/t’).
• Here Key is everything up to the tab character while the value is the
remaining part of the line after tab character.
67
SequenceFileInputFormat
• Hadoop SequenceFileInputFormat is an InputFormat which reads

sequence files.
• Sequence files are binary files that stores sequences of binary key-
value pairs.
• Sequence files block-compress and provide direct serialization and
deserialization of several arbitrary data types (not just text).
• Here Key & Value both are user-defined.
68
SequenceFileAsTextInputFormat
• Hadoop SequenceFileAsTextInputFormat is another form of

SequenceFileInputFormat which converts the sequence file key
values to Text objects.
• By calling ‘tostring()’ conversion is performed on the keys and
values.
• This InputFormat makes sequence files suitable input for streaming.
69
NLineInputFormat
• Hadoop NLineInputFormat is another form of TextInputFormat

where the keys are byte offset of the line and values are contents of
the line.
• if we want our mapper to receive a fixed number of lines of input,
then we use NLineInputFormat.
• N is the number of lines of input that each mapper receives. By
default (N=1), each mapper receives exactly one line of input. If
N=2, then each split contains two lines. One mapper will receive the
first two Key-Value pairs and another mapper will receive the
second two key-value pairs.
70
DBInputFormat
• Hadoop DBInputFormat is an InputFormat that reads data from a

relational database, using JDBC.
• As it doesn’t have portioning capabilities, so we need to careful not
to swamp the database from which we are reading too many
mappers.
• So it is best for loading relatively small datasets, perhaps for joining
with large datasets from HDFS using MultipleInputs.
71
Hadoop Output Format
• The Hadoop Output Format checks the Output-Specification of the job.

• It determines how RecordWriter implementation is used to write output
to output files.
• Hadoop RecordWriter
• As we know, Reducer takes as input a set of an intermediate key-value
pair produced by the mapper and runs a reducer function on them to
generate output that is again zero or more key-value pairs.
• RecordWriter writes these output key-value pairs from the Reducer
phase to output files.
72
• As we saw above, Hadoop RecordWriter takes output data from
Reducer and writes this data to output files.
• The way these output key-value pairs are written in output files by
RecordWriter is determined by the Output Format.
73
Types of Hadoop Output Formats
• TextOutputFormat
• SequenceFileOutputFormat
• MapFileOutputFormat
• MultipleOutputs
• DBOutputFormat
74
TextOutputFormat
• MapReduce default Hadoop reducer Output Format is

TextOutputFormat, which writes (key, value) pairs on individual
lines of text files and its keys and values can be of any type
• Each key-value pair is separated by a tab character, which can be

changed using MapReduce.output.textoutputformat.separator
property.
75
SequenceFileOutputFormat
• It is an Output Format which writes sequences files for its output

and it is intermediate format use between MapReduce jobs, which
rapidly serialize arbitrary data types to the file;
• SequenceFileInputFormat will deserialize the file into the same
types and presents the data to the next mapper in the same
manner as it was emitted by the previous reducer
76
MapFileOutputFormat
• It is another form of FileOutputFormat in Hadoop Output Format,

which is used to write output as map files.
• The key in a MapFile must be added in order, so we need to ensure
that reducer emits keys in sorted order.
77
MultipleOutputs
• It allows writing data to files whose names are derived from the
output keys and values, or in fact from an arbitrary string.
• DBOutputFormat
• DBOutputFormat in Hadoop is an Output Format for writing to
relational databases and HBase.
• It sends the reduce output to a SQL table.
• It accepts key-value pairs.
78
Features of MapReduce
• Scalability
• Flexibility
• Security and Authentication
• Cost-effective solution
• Fast
• Simple model of programming
• Parallel Programming
• Availability and resilient nature
79
Scalability
• Apache Hadoop is a highly scalable framework. This is because of its

ability to store and distribute huge data across plenty of servers.
• Hadoop MapReduce programming enables organizations to run

applications from large sets of nodes which could involve the use of
thousands of terabytes of data.
80
Flexibility
MapReduce programming enables companies to access new sources
of data.
It allows enterprises to access structured as well as unstructured data.
The MapReduce framework also provides support for the multiple
languages and data from sources ranging from email, social media, to
clickstream.
MapReduce is flexible to deal with data rather than traditional DBMS.
81
Security and Authentication
The MapReduce programming model uses HBase and HDFS security

platform that allows access only to the authenticated users to operate
on the data.
Thus, it protects unauthorized access to system data and enhances
system security.
82
Cost-effective solution
Hadoop’s scalable architecture with the MapReduce programming
framework allows the storage and processing of large data sets in a
very affordable manner.
Fast
Hadoop uses a distributed storage(HDFS) that basically implements a
mapping system for locating data in a cluster.
MapReduce programming, are generally located on the same servers
that allow for the faster processing of data.
83
Simple model of programming

One of the most important features is that it is based on a simple
programming model.
Basically, this allows programmers to develop the MapReduce
programs which can handle tasks easily and efficiently.
The MapReduce programs can be written in Java, Python, ruby etc,

which is not very hard to pick up and is also used widely.
So, anyone can easily learn and write MapReduce programs and meet
their data processing needs.
84
Parallel Programming
One of the major aspects of the working of MapReduce programming
is its parallel processing.
It divides the tasks in a manner that allows their execution in parallel.
The parallel processing allows multiple processors to execute these
divided tasks.
So the entire program is run in less time.
85
Availability and resilient nature

Whenever the data is sent to an individual node, the same set of data
is forwarded to some other nodes in a cluster.
So, if any particular node suffers from a failure, then there are always
other copies present on other nodes that can still be accessed
whenever needed.
This assures high availability of data.
The Hadoop MapReduce framework has the ability to quickly
recognizing faults that occur.
86
Hadoop Counters
• Hadoop Counters provides a way to measure the progress or the

number of operations that occur within map/reduce job.
• Counters represent Hadoop global counters, defined either by the
MapReduce framework or applications.
• Hadoop Counters validate that:
• The correct number of bytes was read and written.
• The correct number of tasks was launched and successfully ran.
• The amount of CPU and memory consumed is appropriate for our
job and cluster nodes.
87
Types of Hadoop MapReduce Counters
There are basically 2 types of MapReduce Counters:
• Built-In Counters in MapReduce

• User-Defined Counters/Custom counters in MapReduce
88
Built-In Counters in MapReduce
• Hadoop maintains some built-in Hadoop counters for every job and
these report various metrics, like, there are counters for the
number of bytes and records, which allow us to confirm that the
expected amount of input is consumed and the expected amount of
output is produced.
• There are several groups for the Hadoop built-in Counters:
• MapReduce Task Counter in Hadoop
• FileSystem Counters
• FileInputFormat Counters
• FileOutputFormat counters
• MapReduce Job Counters
89
Real world Mapreduce
• Demo
90
Hands-on Session
91
Let’s put your knowledge to the test 92
Q & A Time
We have 10 Minutes for Q&A
93

Unit-2 1

Uploaded by

Copyright:

Available Formats

Unit-2 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-2 1

Uploaded by

Copyright:

Available Formats

Unit – 2 Hadoop

• History of Hadoop, Apache Hadoop

Hadoop was developed at the Apache Software Foundation.

In 2008, Hadoop defeated the supercomputers and became the fastest

2002 2003 2004 2006 2007 2008 Mid 2008

2011 2012 2017

Hadoop consists of three core components –

• YARN – It is the resource management layer of Hadoop.

• Hadoop Distributed File System provides for distributed storage for

• NameNode : NameNode performs following functions –

• NameNode Daemon runs on the master machine.

• It is the data processing layer of Hadoop. It processes data in two

• Short for Yet Another Resource Locator has the following

• Resource Manager runs on the master node.

• It runs on slave machines.

• MapReduce consists of two primary processes that a programmer

• Reporting is the process of organizing data into informational

• Reporting translates raw data into information.

• Sometimes what feels like analysis is really just another flavor of

• On the surface, reporting and analysis deliverables may look similar

• Through the push model of reporting, recipients can access reports

• Finally, you need to keep in mind the relationship between

• Hadoop streaming is a utility that comes with the Hadoop distribution.

Parameter Optional/Required Description

• Hadoop Ecosystem is a platform or a suite which provides various

• Hive - Apache Hive, is an open source data warehouse system for

• Ambari, is a management platform for provisioning, managing,

• Mahout is open source framework for creating scalable machine

• Hadoop MapReduce is a software framework for easily writing

• Map and Reduce

• Map-Reduce is the data processing component of Hadoop.

• MapReduce Job or a A “full program” is an execution of a Mapper

• A task in MapReduce is an execution of a Mapper or a Reducer on a

• In Hadoop, MapReduce works by breaking the data processing into

Map Program for word Count in Python

• MRUnit is built on top of JUnit framework. So we will use the JUnit

• To Unit test MapReduce jobs:

• We can test it on a hadoop cluster, or test it on our local machine

cat test_input | python3 mapper.py | sort | python3 reducer.py

• In mapReduce there are three failures modes to consider :

• In Hadoop 1, Hadoop MapReduce framework is responsible for

• The ResourceManager has two main components that are

• First In First Out is the default scheduling policy used in Hadoop.

• It like capacity Scheduler

• An Hadoop InputFormat is the first component in Map-Reduce, it is

• The InputFormat class is one of the fundamental classes in the

• It is the base class for all file-based InputFormats.

• It is the default InputFormat of MapReduce.

• It is similar to TextInputFormat as it also treats each line of input as

• Hadoop SequenceFileInputFormat is an InputFormat which reads

• Hadoop SequenceFileAsTextInputFormat is another form of

• Hadoop NLineInputFormat is another form of TextInputFormat

• Hadoop DBInputFormat is an InputFormat that reads data from a

• The Hadoop Output Format checks the Output-Specification of the job.

• MapReduce default Hadoop reducer Output Format is

• Each key-value pair is separated by a tab character, which can be

• It is an Output Format which writes sequences files for its output

• It is another form of FileOutputFormat in Hadoop Output Format,

• Apache Hadoop is a highly scalable framework. This is because of its