Unit Iv-1

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 84

UNIT IV

Main components and Programming model


- Introduction to Hadoop Framework -
Mapreduce, Input splitting, map and reduce
functions, specifying input and output
parameters, configuring and running a job –
Design of Hadoop file system, HDFS concepts,
command line and java interface, dataflow of
File read & File write.
Introducing Hadoop
• Hadoop is the Apache Software Foundation top-level
project that holds the various Hadoop subprojects
that graduated from the Apache Incubator.
• supports the development of open source software
that supplies a framework for the development of
highly scalable distributed computing applications.
• The Hadoop framework handles the processing
details, leaving developers free to focus on
application logic.
Contd..
• The Apache Hadoop project develops open-source software for
reliable, scalable, distributed computing, including:
• Hadoop Core, our flagship sub-project, provides a distributed
filesystem (HDFS) and support for the MapReduce distributed
computing metaphor.
• HBase builds on Hadoop Core to provide a scalable, distributed
database.
• Pig is a high-level data-flow language and execution framework
for parallel computation. It is built on top of Hadoop Core.
• ZooKeeper is a highly available and reliable coordination system.
Distributed applications use ZooKeeper to store and mediate
updates for critical shared state.
• Hive is a data warehouse infrastructure built on Hadoop Core
that provides data summarization, adhoc querying and analysis
of datasets.
What is Hadoop
• Hadoop is a software framework for distributed
processing of large datasets across large clusters of
computers
– Large datasets  Terabytes or petabytes of data
– Large clusters  hundreds or thousands of nodes
• Hadoop is open-source implementation for Google
MapReduce
• Hadoop is based on a simple programming model
called MapReduce
• Hadoop is based on a simple data model, any data will
fit

4
RDBMS compared to MapReduce
Traditional RDBMS MapReduce

Data size Gigabytes Petabytes

Access Interactive and batch Batch

Updates Read and write many times Write once, read many times

Structure Static schema Dynamic schema

Integrity High Low

Scaling Nonlinear Linear


What is Hadoop (Cont’d)
• Hadoop framework consists on two main layers
– Distributed file system (HDFS)
– Execution engine (MapReduce)

6
Contd..
• The Hadoop Core project provides the basic
services for building a cloud computing
environment with commodity hardware,
• and the APIs for developing software that will
run on that cloud.
• The two fundamental pieces of Hadoop Core
are the
– MapReduce framework, the cloud computing
environment, and
– Hadoop Distributed File System (HDFS).
Contd..
• The Hadoop Core MapReduce framework requires a shared file
system.
• This shared file system does not need to be a system-level file
system, as long as there is a distributed file system plug-in available
to the framework.
• While Hadoop Core provides HDFS, HDFS is not required.
• In Hadoop JIRA (the issue-tracking system), item 4686 is a tracking
ticket to separate HDFS into its own Hadoop project.
• In addition to HDFS, Hadoop Core supports the Cloud- Store
(formerly Kosmos) file system (http://kosmosfs.sourceforge.net/)
and Amazon Simple Storage Service (S3) file system (
http://aws.amazon.com/s3/).
• The Hadoop Core framework comes with plug-ins for HDFS,
CloudStore, and S3.
• Users are also free to use any distributed file system that is visible
as a system-mounted file system, such as Network File System
(NFS), Global File System (GFS), or Lustre.
• When HDFS is used as the shared file system,
Hadoop is able to take advantage of
knowledge about which node hosts a physical
copy of input data, and will attempt to
schedule the task that is to read that data, to
run on that machine.
Hadoop Core MapReduce
• The Hadoop Distributed File System (HDFS) MapReduce
environment provides the user with a sophisticated
framework to manage the execution of map and reduce
tasks across a cluster of machines.
• The user is required to tell the framework the following.
– The location(s) in the distributed file system of the job input
– The location(s) in the distributed file system for the job output
– The input format
– The output format
– The class containing the map function
– Optionally. the class containing the reduce function
– The JAR file(s) containing the map and reduce functions and any
support classes
Hadoop Core MapReduce
• If a job does not need a reduce function, the user does
not need to specify a reducer class, and a reduce phase
of the job will not be run.
• The framework will partition the input, and schedule
and execute map tasks across the cluster.
• If requested, it will sort the results of the map task and
execute the reduce task(s) with the map output.
• The final output will be moved to the output directory,
and the job status will be reported to the user.
Hadoop Core MapReduce
• MapReduce is based on key/value pairs.
• The framework will convert each record of input into a key/value
pair, and each pair will be input to the map function once.
• The map output is a set of key/value pairs—nominally one pair
that is the transformed input pair, but it is perfectly acceptable to
output multiple pairs.
• The map output pairs are grouped and sorted by key.
• The reduce function is called one time for each key, in sort
sequence, with the key and the set of values that share that key.
• The reduce method may output an arbitrary number of key/value
pairs, which are written to the output files in the job output
directory.
• If the reduce output keys are unchanged from the reduce input
keys, the final output will be sorted.
Hadoop Core MapReduce
• The framework provides two processes that handle the
management of MapReduce jobs:
– TaskTracker manages the execution of individual map and
reduce tasks on a compute node in the cluster.
– JobTracker accepts job submissions, provides job
monitoring and control, and manages the distribution of
tasks to the TaskTracker nodes.
• one JobTracker process per cluster and one or more
TaskTracker processes per node in the cluster.
• The JobTracker is a single point of failure, and the
JobTracker will work around the failure of individual
TaskTracker processes.
The Hadoop Distributed File System
• HDFS is a file system that is designed for use for
MapReduce jobs that read input in large chunks of
input, process it, and write potentially large chunks of
output.
• HDFS does not handle random access particularly well.
• For reliability, file data is simply mirrored to multiple
storage nodes. This is referred to as replication in the
Hadoop community.
• As long as at least one replica of a data chunk is
available, the consumer of that data will not know of
storage server failures.
The Hadoop Distributed File System
• HDFS services are provided by two processes:
– NameNode handles management of the file system
metadata, and provides management and control services.
– DataNode provides block storage and retrieval services.
• one NameNode process in an HDFS file system, and
this is a single point of failure.
• Hadoop Core provides recovery and automatic backup
of the NameNode, but no hot failover services.
• There will be multiple DataNode processes within the
cluster, with typically one DataNode process per
storage node in a cluster.
HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..

Client
Block ops
Read Datanodes Datanodes

replication
B
Blocks

Rack1 Write Rack2

Client

01/04/24 cse4/587 16
Basics of MapReduce
• The user configures and submits a MapReduce
job (or just job for short) to the framework,
which will decompose the job into a set of
map tasks, shuffles, a sort, and a set of reduce
tasks.
• The framework will then manage the
distribution and execution of the tasks, collect
the output, and report the status to the user.
The Parts of a Hadoop MapReduce
Job
• The user configures and submits a MapReduce
job (or just job for short) to the framework,
which will decompose the job into a set of
map tasks, shuffles, a sort, and a set of reduce
tasks.
• The framework will then manage the
distribution and execution of the tasks, collect
the output, and report the status to the user.
• The user is responsible for
– handling the job setup, specifying the input
location(s), specifying the input, and ensuring the
input is in the expected format and location.
• The framework is responsible for
– distributing the job among the TaskTracker nodes
of the cluster; running the map, shuffle, sort, and
reduce phases; placing the output in the output
directory; and informing the user of the job-
completion status.
• The job created by the code in
MapReduceIntro.java will read all of its textual
input line by line, and sort the lines based on that
portion of the line before the first tab character.
• If there are no tab characters in the line, the sort
will be based on the entire line.
• The MapReduceIntro.java file is structured to
provide a simple example of configuring and
running a MapReduce job.
Input Splitting
• For the framework to be able to distribute
pieces of the job to multiple machines, it needs
to fragment the input into individual pieces
• Each fragment of input is called an input split.
• The default rules for how input splits are
constructed from the actual input files are a
combination of configuration parameters and
the capabilities of the class that actually reads
the input records.
• An input split is a contiguous group of records
from a single input file,
• There will be at least N input splits, where N is
the number of input files.
• If requested map tasks is larger than this
number, there may be multiple input splits
constructed of each input file.
• The number and size of the input splits
strongly influence overall job performance.
A Simple Map Function:
IdentityMapper
• The Hadoop framework provides a very simple
map function, called IdentityMapper.
• It is used in jobs that only need to reduce the
input, and not transform the raw input.
IdentityMapper.java
• output.collect(key, val), passes a key/value
pair back to the framework for further
processing.
• All map functions must implement the
Mapper interface, and will always be called
with a key.
• The key is an instance of WritableComparable
object, a value that is an instance of a
Writable object, an output object, and a
reporter
• The framework will make one call to your map
function for each record in your input.
• There will be multiple instances of your map
function running, potentially in multiple Java
Virtual Machines (JVMs), and potentially on
multiple machines.
• The framework coordinates all of this.
A Simple Reduce Function: Identity
Reducer
• The Hadoop framework calls the reduce
function one time for each unique key.
• The framework provides the key and the set
of values that share that key.
• The framework-supplied class IdentityReducer
is a simple example that produces one output
record for every value.
IdentityReducer.java
• If the output of your job is to be sorted, the
reducer function must pass the key objects to
the output.collect() method unchanged.
• The reduce phase is free to output any
number of records, including zero records,
with the same key and different values.
• This particular constraint is also why the map
tasks may be multithreaded, while the reduce
tasks are explicitly only single-threaded.
Configuring a job
• All Hadoop jobs have a driver program that configures the
actual MapReduce job and submits it to the Hadoop
framework.
• This configuration is handled through the JobConf object.
• The sample class MapReduceIntro provides a walk-through
for using the JobConf object to configure and submit a job to
the Hadoop framework for execution.
• The code relies on a class called MapReduceIntroConfig,
which ensures that the input and output directories are set
up and ready.
• First, you must create a JobConf object.
• It is good practice to pass in a class that is
contained in the JAR file that has your map and
reduce functions.
• This ensures that the framework will make the
JAR available to the map and reduce tasks run
for your job.
• JobConf conf = new
JobConf(MapReduceIntro.class);
• set the required parameters for the job.
– input and output directory locations,
– the format of the input and output, and
– the mapper and reducer classes.
• All jobs will have a map phase, and the map phase is
responsible for handling the job input.
• The configuration of the map phase requires you to specify
the
– input locations and the
– class that will produce the key/value pairs from the input,
the mapper class, and
– number of map tasks, map output types, and per-map task
threading
Specifying Input formats
• The Hadoop framework provides a large variety of
input formats.
• The major distinctions are between textual input
formats and binary input formats.
• The following are the available formats:
– KeyValueTextInputFormat: Key/value pairs, one per line.
– TextInputFormant: The key is the line number, and the
value is the line.
– NLineInputFormat: Similar to KeyValueTextInputFormat,
but the splits are based on N lines of input rather than Y
bytes of input.
– MultiFileInputFormat: An abstract class that lets
the user implement an input format that
aggregates multiple files into one split.
– SequenceFIleInputFormat: The input file is a
Hadoop sequence file, containing serialized
key/value pairs.
– KeyValueTextInputFormat and
SequenceFileInputFormat are the most commonly
used input formats.
Setting the Output Parameters
• The framework requires that the output
parameters be configured,
• The framework will collect the output from
the specified and place them into the
configured output directory.
• To avoid issues with file name collisions when
placing the task output into the output
directory, the framework requires that the
output directory not exist when you start the
job.
• MapReduceIntroConfig class handles ensuring
that the output directory does not exist and
provides the path to the output directory.
• The Text class is the functional equivalent of a
String. It implements the WritableComparable
interface, which is necessary for keys, and the
Writable interface (which is actually a subset
of WritableComparable), which is necessary
for values.
• The key feature of a Writable is that the
framework knows how to serialize and
deserialize a Writable object.
• The WritableComparable adds the compareTo
interface so the framework knows how to sort
the WritableComparable objects
• The conf.setOutputKeyClass(Text.class) and
conf.setOutputValueClass(Text.class) settings
inform the framework of the types of the
key/value pairs to expect for the reduce phase.
• these classes will also be used to set the values
the framework will expect from the map output.
• To set the output value class, the method is
conf.setMapOutputValueClass(Class<? extends
Writable>).
WritableComparable.java
Writable.java
Configuring the reduce phase
• To configure the reduce phase, the user must
supply the framework with five pieces of
information:
– The number of reduce tasks; if zero, no reduce
phase is run
– The class supplying the reduce method
– The input key and value types for the reduce task;
by default, the same as the reduce output
– The output key and value types for the reduce task
– The output file type for the reduce task output
• The time spent sorting the keys for each output file is a function
of the number of keys.
• In addition, the number of reduce tasks determines the
maximum number of reduce tasks that can be run in parallel.
• The framework generally has a default number of reduce tasks
configured. This value is set by the mapred.reduce.tasks
parameter, which defaults to 1.
• This will result in a single output file containing all of the output
keys, in sorted order.
• There will be one reduce task, run on a single machine that
processes every key.
• The number of reduce tasks is commonly set in the
configuration phase of a job.
• conf.setNumReduceTasks(1);
Running a Job
• The method runJob() submits the
configuration information to the framework
and waits for the framework to finish running
the job.
• The response is provided in the job object.
• The RunningJob class provides a number of
methods for examining the response.
• most useful is job.isSuccessful().

You might also like