UNIT 2-tt1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

UNIT 2:- INTRODUCTION TO HADOOP AND HADOOP

ARCHITECTURE
Explain the core components of Hadoop. (5 Marks).
Following are the Core Components of Hadoop Architecture:-
1. Hadoop Distributed File System (HDFS)
One of the most critical components of Hadoop architecture is the Hadoop Distributed File System
(HDFS). HDFS is the primary storage system used by Hadoop applications. It’s designed to scale to
petabytes of data and runs on commodity hardware. What sets HDFS apart is its ability to maintain
large data sets across multiple nodes in a distributed computing environment.
HDFS operates on the basic principle of storing large files across multiple machines. It achieves high
throughput by dividing large data into smaller blocks, which are managed by different nodes in the
network. This nature of HDFS makes it an ideal choice for applications with large data sets.
2. Yet Another Resource Negotiator (YARN)
Yet Another Resource Negotiator (YARN) is responsible for managing resources in the cluster and
scheduling tasks for users. It is a key element in Hadoop architecture as it allows multiple data
processing engines such as interactive processing, graph processing, and batch processing to handle
data stored in HDFS.
YARN separates the functionalities of resource management and job scheduling into separate daemons.
This design ensures a more scalable and flexible Hadoop architecture, accommodating a broader array
of processing approaches and a wider array of applications.
3. MapReduce Programming Model
MapReduce is a programming model integral to Hadoop architecture. It is designed to process large
volumes of data in parallel by dividing the work into a set of independent tasks. The MapReduce model
simplifies the processing of vast data sets, making it an indispensable part of Hadoop.
MapReduce is characterized by two primary tasks, Map and Reduce. The Map task takes a set of data
and converts it into another set of data, where individual elements are broken down into tuples. On the
other hand, the Reduce task takes the output from the Map as input and combines those tuples into a
smaller set of tuples.
4. Hadoop Common
Hadoop Common, often referred to as the ‘glue’ that holds Hadoop architecture together, contains
libraries and utilities needed by other Hadoop modules. It provides the necessary Java files and scripts
required to start Hadoop. This component plays a crucial role in ensuring that the hardware failures are
managed by the Hadoop framework itself, offering a high degree of resilience and reliability.

Explain MapReduce architecture in brief. (5 Marks)


• Hadoop MapReduce is the processing unit of Hadoop.
• In the MapReduce approach, the processing is done at the slave nodes, and the result is sent to the
master node.
• A data containing code is used to process the entire data. This coded data is usually very small in
comparison to the data itself.
• You only need to send a few kilobytes worth of code to perform a heavy-duty process on computers.
• MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
• Map stage− The map or mapper’s job is to process the input data. Generally, the input data is in the
form of file or directory and is stored in the Hadoop file system. (HDFS). The input file is passed to the
mapper function line by line. The mapper processes the data and creates several small chunks of data.
• Reduce stage− This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s
job is to process the data that comes from the mapper. After processing, it produces a new set of output,
which will be stored in the HDFS.
1. Apply MapReduce on the following document to count the frequency of words. Show all the
phases properly. (5 Marks)
Welcome to Hadoop class
Hadoop is good
Hadoop is bad
Dog Cat Mouse
Dog Dog Cat
Dog Cat Duck

Bus Car Train


Train Plane Car
Bus Bus Plane

Deer Bear River


Car Car River
Deer Car Bear
The quick brown fox
The fox ate the mouse
Now how brown cow
MapReduce typically consists of two main phases: the Map phase and the Reduce phase.
Map Phase:
In the Map phase, each input record is processed independently to generate a set of intermediate key-
value pairs.
For our document:
Welcome to Hadoop class
Hadoop is good
Hadoop is bad
We'll generate key-value pairs where the key is the word and the value is the count of that word. Here's
how the Map phase will look for each line:
1. Welcome to Hadoop class
o (Welcome, 1)
o (to, 1)
o (Hadoop, 1)
o (class, 1)
2. Hadoop is good
o (Hadoop, 1)
o (is, 1)
o (good, 1)
3. Hadoop is bad
o (Hadoop, 1)
o (is, 1)
o (bad, 1)
Reduce Phase:
In the Reduce phase, the intermediate key-value pairs generated by the Map phase are processed to
produce the final output, which is the word frequency.
For each unique word, we'll sum up the counts from all occurrences to get the total frequency. Here's
how the Reduce phase will look:
• (Welcome, [1])
• (to, [1])
• (Hadoop, [1, 1, 1])
• (class, [1])
• (is, [1, 1])
• (good, [1])
• (bad, [1])
Finally, we'll aggregate the counts for each word:
• Welcome: 1
• to: 1
• Hadoop: 3
• class: 1
• is: 2
• good: 1
• bad: 1
So, the word frequencies are:
• Welcome: 1
• to: 1
• Hadoop: 3
• class: 1
• is: 2
• good: 1
• bad: 1
Here's how the Map phase will look for each line:
1. Bus Car Train
o (Bus, 1)
o (Car, 1)
o (Train, 1)
2. Train Plane Car
o (Train, 1)
o (Plane, 1)
o (Car, 1)
3. Bus Bus Plane
o (Bus, 1)
o (Bus, 1)
o (Plane, 1)
Reduce Phase:
In the Reduce phase, the intermediate key-value pairs generated by the Map phase are processed to
produce the final output, which is the word frequency.
For each unique word, we'll sum up the counts from all occurrences to get the total frequency. Here's
how the Reduce phase will look:
• (Bus, [1, 1])
• (Car, [1, 1])
• (Train, [1])
• (Plane, [1, 1])
Finally, we'll aggregate the counts for each word:
• Bus: 2
• Car: 2
• Train: 1
• Plane: 2
So, the word frequencies are:
• Bus: 2
• Car: 2
• Train: 1
• Plane: 2
Here's how the Map phase will look for each line:
1. The quick brown fox
o (The, 1)
o (quick, 1)
o (brown, 1)
o (fox, 1)
2. The fox ate the mouse
o (The, 1)
o (fox, 1)
o (ate, 1)
o (the, 1)
o (mouse, 1)
3. Now how brown cow
o (Now, 1)
o (how, 1)
o (brown, 1)
o (cow, 1)
Reduce Phase:
In the Reduce phase, the intermediate key-value pairs generated by the Map phase are processed to
produce the final output, which is the word frequency.
For each unique word, we'll sum up the counts from all occurrences to get the total frequency. Here's
how the Reduce phase will look:
• (The, [1, 1])
• (quick, [1])
• (brown, [1, 1])
• (fox, [1])
• (ate, [1])
• (the, [1])
• (mouse, [1])
• (Now, [1])
• (how, [1])
• (cow, [1])
Finally, we'll aggregate the counts for each word:
• The: 2
• quick: 1
• brown: 2
• fox: 1
• ate: 1
• the: 1
• mouse: 1
• Now: 1
• how: 1
• cow: 1
So, the word frequencies are:
• The: 2
• quick: 1
• brown: 2
• fox: 1
• ate: 1
• the: 1
• mouse: 1
• Now: 1
• how: 1
• cow: 1

Write the algorithm for Matrix Multiplication using MapReduce.


Write the HDFS commands for the following with suitable example: (5 Marks)
i. To display recursively the contents in the directory.
ii. To copy files/folders from local file system to HDFS store.
iii. To copy files/folders from HDFS store to local file system.
iv. To give the size of each file in directory.
v. To change the replication factor of a file/directory in HDFS.
Here are the HDFS commands for the given tasks along with suitable examples:
i. To display recursively the contents in the directory:
hdfs dfs -ls -R /path/to/directory
Example:
hdfs dfs -ls -R /user/example/directory
This command will display the contents of the specified directory recursively.

ii. To copy files/folders from local file system to HDFS store:


hdfs dfs -put /path/to/local/file_or_folder /path/to/HDFS/destination
Example:
hdfs dfs -put /home/user/example.txt /user/example/
This command will copy the file "example.txt" from the local file system to the HDFS
destination directory "/user/example/".

iii. To copy files/folders from HDFS store to local file system:


hdfs dfs -get /path/to/HDFS/file_or_folder /path/to/local/destination
Example:
hdfs dfs -get /user/example/example.txt /home/user/
This command will copy the file "example.txt" from the HDFS directory "/user/example/" to
the local destination directory "/home/user/".

iv. To give the size of each file in a directory:


hdfs dfs -du /path/to/directory
Example:
hdfs dfs -du /user/example/directory
This command will display the size of each file in the specified directory.

v. To change the replication factor of a file/directory in HDFS:


hdfs dfs -setrep -R <replication_factor> /path/to/file_or_directory
Example:
hdfs dfs -setrep -R 3 /user/example/example.txt
This command will change the replication factor of the file "example.txt" to 3 in HDFS.

You might also like