Part 5

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

12. What is Text Input Format?

It is the default InputFormat for plain text files in a given job having input files with .gz
extension. In TextInputFormat, files are broken into lines, wherein key is position in the file
and value refers to the line of text. Programmers can write their own InputFormat.
The hierarchy is:

java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<LongWritable,Text>
org.apache.hadoop.mapreduce.lib.input.TextInputFormat
13. What is JobTracker?

JobTracker is a Hadoop service used for the processing of MapReduce jobs in the cluster. It
submits and tracks the jobs to specific nodes having data. Only one JobTracker runs on single
Hadoop cluster on its own JVM process. if JobTracker goes down, all the jobs halt.

Download MapReduce Interview Questions asked by top MNCs in 2017

GET PDF
14. Explain job scheduling through JobTracker.

JobTracker communicates with NameNode to identify data location and submits the work to
TaskTracker node. The TaskTracker plays a major role as it notifies the JobTracker for any
job failure. It actually is referred to the heartbeat reporter reassuring the JobTracker that it is
still alive. Later, the JobTracker is responsible for the actions as in it may either resubmit the
job or mark a specific record as unreliable or blacklist it.

15. What is SequenceFileInputFormat?

A compressed binary output file format to read in sequence files and extends the
FileInputFormat.It passes data between output-input (between output of one MapReduce job
to input of another MapReduce job)phases of MapReduce jobs.

16. How to set mappers and reducers for Hadoop jobs?

Users can configure JobConf variable to set number of mappers and reducers.

job.setNumMaptasks()
job.setNumreduceTasks()
17. Explain JobConf in MapReduce.

It is a primary interface to define a map-reduce job in the Hadoop for job execution. JobConf
specifies mapper, Combiner, partitioner, Reducer,InputFormat , OutputFormat
implementations and other advanced job faets liek Comparators.

18. What is a MapReduce Combiner?

Also known as semi-reducer, Combiner is an optional class to combine the map out records
using the same key. The main function of a combiner is to accept inputs from Map Class and
pass those key-value pairs to Reducer class
19. What is RecordReader in a Map Reduce?

RecordReader is used to read key/value pairs form the InputSplit by converting the byte-
oriented view and presenting record-oriented view to Mapper.

20. Define Writable data types in MapReduce.

Hadoop reads and writes data in a serialized form in writable interface. The Writable
interface has several classes like Text (storing String data), IntWritable, LongWriatble,
FloatWritable, BooleanWritable. users are free to define their personal Writable classes as
well.

21. What is OutputCommitter?

OutPutCommitter describes the commit of MapReduce task. FileOutputCommitter is the


default available class available for OutputCommitter in MapReduce. It performs the
following operations:

Create temporary output directory for the job during initialization.


Then, it cleans the job as in removes temporary output directory post job completion.
Sets up the task temporary output.
Identifies whether a task needs commit. The commit is applied if required.
JobSetup, JobCleanup and TaskCleanup are important tasks during output commit.

22. What is a map in Hadoop?

In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input
location, and outputs a key value pair according to the input type.

23. What is a reducer in Hadoop?

In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a
final output of its own.

24. What are the parameters of mappers and reducers?

The four parameters for mappers are:

LongWritable (input)
text (input)
text (intermediate output)
IntWritable (intermediate output)

The four parameters for reducers are:

Text (intermediate output)


IntWritable (intermediate output)
Text (final output)
IntWritable (final output)
25. What are the key differences between Pig vs MapReduce?

PIG is a data flow language, the key focus of Pig is manage the flow of data from input
source to output store. As part of managing this data flow it moves data feeding it to

process 1. taking output and feeding it to

process2. The core features are preventing execution of subsequent stages if previous stage
fails, manages temporary storage of data and most importantly compresses and rearranges
processing steps for faster processing. While this can be done for any kind of processing tasks
Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all
jobs in a Pig are map reduce jobs or data movement jobs. Pig allows for custom functions to
be added which can be used for processing in Pig, some default ones are like ordering,
grouping, distinct, count etc.

Mapreduce on the other hand is a data processing paradigm, it is a framework for application
developers to write code in so that its easily scaled to PB of tasks, this creates a separation
between the developer that writes the application vs the developer that scales the application.
Not all applications can be migrated to Map reduce but good few can be including complex
ones like k-means to simple ones like counting uniques in a dataset.

You might also like