Big Data-Module 1 - VTU July 2019 Solved Paper
Big Data-Module 1 - VTU July 2019 Solved Paper
Big Data-Module 1 - VTU July 2019 Solved Paper
Module -1
1. How does the Hadoop MapReduce Data flow work for a word count program? Give an
example.
Parallel execution of MapReduce requires other steps in addition to the mapper and
reducer processes.
The basic steps are as follows:
1. Input Splits:
HDFS distributes and replicates data over multiple servers. The default data
chunk or block size is 64MB. Thus, a 500MB file would be broken into 8 blocks
and written to different machines in the cluster.
The data are also replicated on multiple machines (typically three machines). These
data slices are physical boundaries determined by HDFS and have nothing to do
with the data in the file. Also, while not considered part of the MapReduce process,
the time required to load and distribute data throughout HDFS servers can be
considered part of the total processing time.
The input splits used by MapReduce are logical boundaries based on the input data.
2. Map Step:
The mapping process is where the parallel nature of Hadoop comes into play. For
large amounts of data, many mappers can be operating at the same time.
The user provides the specific mapping process. MapReduce will try to execute the
mapper on the machines where the block resides. Because the file is replicated in
HDFS, the least busy node with the data will be chosen.
If all nodes holding the data are too busy, MapReduce will try to pick a node that is
closest to the node that hosts the data block (a characteristic called rack-awareness).
The last choice is any node in the cluster that has access to HDFS.
3. Combiner Step:
It is possible to provide an optimization or pre-reduction as part of the map stage
where key–value pairs are combined prior to the next stage. The combiner stage is
optional.
Page 1
Big Data Analytics, June/July 2019 Module 1
4. Shuffle Step:
Before the parallel reduction stage can complete, all similar keys must be combined
and counted by the same reducer process.
Therefore, results of the map stage must be collected by key–value pairs and shuffled
to the same reducer process.
If only a single reducer process is used, the shuffle stage is not needed.
5. Reduce Step:
The final step is the actual reduction. In this stage, the data reduction is performed as
per the programmer’s design.
The reduce step is also optional. The results are written to HDFS. Each reducer will
write an output file.
For example, a MapReduce job running four reducers will create files called part-
0000, part-0001, part-0002, and part-0003.
Figure below is an example of a simple Hadoop MapReduce data flow for a word count
program. The map process counts the words in the split, and the reduce process calculates the
total for each word. Where as, the actual computation of the map and reduce stages are up to
the programmer.
The input to the MapReduce application is the following file in HDFS with three lines
of text. The goal is to count the number of times each word is used.
Page 2
Big Data Analytics, June/July 2019 Module 1
The first thing MapReduce will do is create the data splits. For simplicity, each line
will be one split. Since each split will require a map task, there are three mapper
processes that count the number of words in the split
On a cluster, the results of each map task are written to local disk and not to HDFS.
Next, similar keys need to be collected and sent to a reducer process. The shuffle step
requires data movement and can be expensive in terms of processing time.
Depending on the nature of the application, the amount of data that must be shuffled
throughout the cluster can vary from small to large.
Once the data have been collected and sorted by key, the reduction step can begin. In
some cases, a single reducer will provide adequate performance. In other cases,
multiple reducers may be required to speed up the reduce phase. The number of
reducers is a tunable option for many applications. The final step is to write the output
to HDFS.
A combiner step enables some pre-reduction of the map output data. For instance, in
the previous example, one map produced the following counts:
(run,1)
(spot,1)
(run,1).
2. Briefly explain HDFS Name Node Federation, NFS Gateway, Snapshots, Checkpoint
and Backups.
HDFS Name Node Federation
Another important feature of HDFS is Name Node Federation.
Older versions of HDFS provided a single namespace for the entire cluster managed
by a single Name Node. Thus, the resources of a single Name Node determined the size of
the namespace.
Federation addresses this limitation by adding support for multiple Name Nodes /
namespaces to the HDFS file system.
The key benefits are as follows:
1. Namespace scalability: HDFS cluster storage scales horizontally without placing a
burden on the Name Node.
Page 3
Big Data Analytics, June/July 2019 Module 1
2. Better performance: Adding more Name Nodes to the cluster scales the file system
read/write operations throughput by separating the total namespace.
3. System Isolation: Multiple Name Nodes enable different categories of applications to
be distinguished, and users can be isolated to different namespaces.
In Fig 3.4 Name Node1 manages the /research and /marketing namespaces, and Name Node2
manages the /data and /project namespaces.
The Name Nodes do not communicate with each other and the Data Nodes “just store data
block” as directed by either Name Node.
NFS Gateway
Page 4
Big Data Analytics, June/July 2019 Module 1
These are created by administrators using the hdfs dfs –snapshot command.
They are read-only point-in-time copies of the file system.
HDFS Snapshots offer the following features:
Snapshots can be taken of a sub-tree of the file system or entire file system
It can be used for data backup, protection against user errors, and disaster
recovery
Snapshot creation is instantaneous
Snapshot files record the block list and the file size
Snapshot do not adversely affect regular HDFS operations
HDFS Checkpoint and Backups
HDFS Backup node maintains an up-to-date copy of the file system namespace both in
memory and on disk
Since it has an up-to-date namespace state in memory, it need not download the fsimage
and edits files from active Name Node
A Name Node supports one Backup node at a time
No checkpoint nodes may be registered if a backup node is in use
3. What do you understand by HDFS? Explain all its components with a neat diagram
The Hadoop Distributed file system (HDFS) was designed for Big Data processing.
Although capable of supporting many users simultaneously, HDFS is not designed as
a true parallel file system. Rather, the design assumes a large file write-once/read-
many model.
HDFS rigorously restricts data writing to one user at a time.
Bytes are always appended to the end of a stream, and byte streams are guaranteed to
be stored in the order written.
HDFS is designed for data streaming where large amounts of data are read from
disk in bulk.
The HDFS block size is typically 64MB or 128MB.
Due to sequential nature of data, there is no local caching mechanism. The large
block and file sizes make it more efficient to reread data from HDFS than to try to
cache the data.
Page 5
Big Data Analytics, June/July 2019 Module 1
Page 6
Big Data Analytics, June/July 2019 Module 1
When a client writes data, it first communicates with the NameNode and requests to
create a file. The NameNode determines how many blocks are needed and provides
the client with the DataNodes that will store the data.
As part of the storage process, the data blocks are replicated after they are written to
the assigned node.
Depending on how many nodes are in the cluster, the NameNode will attempt to write
replicas of the data blocks on nodes that are in other separate racks. If there is only
one rack, then the replicated blocks are written to other servers in the same rack.
After the Data Node acknowledges that the file block replication is complete, the
client closes the file and informs the NameNode that the operation is complete.
Page 7
Big Data Analytics, June/July 2019 Module 1
Note that the NameNode does not write any data directly to the DataNodes. It
does, however, give the client a limited amount of time to complete the operation. If
it does not complete in the time period, the operation is cancelled.
The client requests a file from the NameNode, which returns the best DataNodes from
which to read the data. The client then access the data directly from the DataNodes.
Thus, once the metadata has been delivered to the client, the NameNode steps back
and lets the conversation between the client and the DataNodes proceed. While data
transfer is progressing, the NameNode also monitors the DataNodes by listening for
heartbeats sent from DataNodes.
The lack of a heartbeat signal indicates a node failure. Hence the NameNode will
route around the failed Data Node and begin re-replicating the now-missing blocks.
The mappings b/w data blocks and physical DataNodes are not kept in persistent
storage on the NameNode. The NameNode stores all metadata in memory.
In almost all Hadoop deployments, there is a SecondaryNameNode(Checkpoint
Node). It is not an active failover node and cannot replace the primary NameNode in
case of it failure.
Thus the various important roles in HDFS are:
HDFS uses a master/slave model designed for large file reading or streaming.
The NameNode is a metadata server or “Data traffic cop”.
HDFS provides a single namespace that is managed by the NameNode.
Data is redundantly stored on DataNodes ; there is no data on NameNode.
Secondary NameNode performs checkpoints of NameNode file system’s
state but is not a failover node.
4. Bring out the concepts of HDFS block replication, with an example
When HDFS writes a file, it is represented across the cluster. The amount of replication
is based on the value of dfs.replication in the hdfs-site.xml file.
Hadoop clusters containing more than eight DataNodes, the replication value is
usually set to 3. In a Hadoop cluster of fewer DataNodes but more than one DataNode, a
replication factor of 2 is adequate. For a single machine, like pseudo-distributed the
replication factor is set to 1.
Page 8
Big Data Analytics, June/July 2019 Module 1
If several machines must be involved in the serving o a file, then a file could be rendered
unavailable by the loss of any one of those machines. HDFS solves this problem by
replicating each block across a number of machines.
HDFS default block size is often 64MB. If a file size 80MB is written to HDFS, a 64MB
block and a 16MB block will be created.
The HDFS blocks are based on size, while the splits are based on a logical partitioning of
the data.
For instance, if a file contains discrete records, logical split ensures that a record is not split
physically across two separate servers during processing. Each HDFS block consist of one
or more splits.
The figure above provides an example of how a file is broken into blocks and replicated
across the cluster. In this case replication factor of 3 ensures that any one DataNode can fail
and the replicated blocks will be available on other nodes and subsequently re-replicated on
other DataNodes.
Page 9