UNIT -2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

UNIT-II

Hadoop Distributed File System


Hadoop Distributed File System:
overview:
 Hadoop Ecosystem,
 Hadoop Architecture,
 HDFS Concepts,
 Blocks, Namenodes and Datanodes,
 Hadoop FileSystems,
 The Java Interface,
 Reading Data from a Hadoop URL,
 Writing Data,
 Querying the FileSystem,
 Deleting Data,
 Anatomy of File Read and Write.
Hadoop Ecosystem:
The Hadoop ecosystem is a collection of various tools and frameworks that work together to manage,
store, process, and analyze large datasets efficiently.
Developed by Apache, Hadoop and its ecosystem are foundational in Big Data Analytics because they
address the challenges of handling big data—like data volume, velocity, and variety.
Here's an overview of the Hadoop ecosystem components and their roles:
 Hadoop Ecosystem Components
 1. Hadoop Distributed File System
 It is the most important component of Hadoop Ecosystem. HDFS is the primary
storage system of Hadoop. Hadoop distributed file system (HDFS) is a java based file
system that provides scalable, fault tolerance, reliable and cost efficient data storage
for Big data. HDFS is a distributed filesystem that runs on commodity hardware.
HDFS consists of two core components i.e.
• Name node
• Data Node
• Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data.
• Data nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
 2.YARN:
• Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the
resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop
System.
• Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
 3.MapReduce:
• By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over
the processing’s logic and helps to write applications which transform big data sets into a manageable
one.
• MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
 Map function takes a set of data and converts it into another set of data, where individual elements
are broken down into tuples (key/value pairs).
 Reduce function takes the output from the Map as an input and combines those data tuples based on
the key and accordingly modifies the value of the key.
 4.PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and analyzing huge data sets.
• Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
5. HIVE:
• With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 6. Mahout:
• Mahout, allows Machine Learnability to a system or application.
Machine Learning, as the name suggests helps the system to develop itself based
on some patterns, user/environmental interaction or on the basis of algorithms.
 7. Apache Spark:
• It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization,
etc.
 8.Apache HBase:
• It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
Hadoop Architecture:
 Figure shows the client, master NameNode, primary and secondary
MasterNodes and slave nodes in the Hadoop physical architecture.

 Hadoop Distributed File System (HDFS) stores the application data and
file system metadata separately on dedicated servers. NameNode and
DataNode are the two critical components of the Hadoop HDFS
architecture.

 Application data is stored on servers referred to as DataNodes and file


system metadata is stored on servers referred to as NameNode. HDFS
replicates the file content on multiple DataNodes based on the
replication factor to ensure reliability of data.
 1.NameNode:
 All the blocks on DataNodes are handled by NameNode, which is known as
the master node. It performs the following functions:
1. Monitor and control all the DataNodes instances.
2. Permits the user to access a file.
3. Stores all of the block records on a DataNode instance.
4. EditLogs are committed to disk after every write operation to Name Node’s
data storage. The data is then replicated to all the other data nodes, including
Data Node and Backup Data Node.
5. All of the DataNodes’ blocks must be alive in order for all of the blocks to
be removed from the data nodes.
 There are two kinds of files in NameNode: FsImage files and EditLogs files:
1. FsImage: It contains all the details about a filesystem, including all the directories and
files, in a hierarchical format. It is also called a file image because it resembles a
photograph.
2. EditLogs: The EditLogs file keeps track of what modifications have been made to the
files of the filesystem.
 2. DataNode:
 Every slave machine that contains data organsises a DataNode. DataNode stores data
in file format on DataNodes. DataNodes do the following:
1. DataNodes store every data.
2. It handles all of the requested operations on files, such as reading file content and
creating new data, as described above.
3. All the instructions are followed, including scrubbing data on DataNodes, establishing
partnerships, and so on.
 Secondary NameNode:
 When NameNode runs out of disk space, a secondary NameNode is activated
to perform a checkpoint. The secondary NameNode performs the following
duties.
1. It stores all the transaction log data (from all the source databases) into one
location so that when you want to replay it, it is at one single location. Once
the data is stored, it is replicated across all the servers, either directly or via a
distributed file system.
2. The information stored in the filesystem is replicated across all the cluster
nodes and stored in all the data nodes. Data nodes store the data. The cluster
nodes store the information about the cluster nodes. This information is called
metadata.
HDFS Concepts:
1. Blocks
2. Namenodes and Datanodes
1. Blocks:
 A disk has a block size, which is the minimum amount of data that it can read or write.
 Filesystems for a single disk build on this by dealing with data in blocks, which are an
integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes
in size, while disk blocks are normally 512 bytes.
 Having a block abstraction for a distributed filesystem brings several benefits. The first
benefit is the most obvious: a file can be larger than any single disk in the network.
 Second, making the unit of abstraction a block rather than a file simplifies the storage
subsystem.
2. Name Node:
 Name Node is the single point of contact for accessing files in HDFS and it determines
the block ids and locations for data access. So, Name Node plays a Master role in
Master/Slaves Architecture where as Data Nodes acts as slaves. File System metadata
is stored on Name Node.
 An HDFS cluster has two types of node operating in a master-worker pattern: a
namenode (the master) and a number of datanodes (workers). The namenode manages
the filesystem namespace. It maintains the filesystem tree and the metadata for all the
files and directories in the tree.
3. Data Nodes:
 Data Nodes are the slaves part of Master/Slaves Architecture and on which actual
HDFS files are stored in the form of fixed size chunks of data which are called
blocks.
 Data Nodes serve read and write requests of clients on HDFS files and also perform
block creation, replication and deletions.
 Hadoop Filesystems:
 Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation. The Java
abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and there are
several concrete implementations, which are given below.

Filesystem URI scheme Java implementation Description


(all under org.apache.hadoop)

Local file fs.LocalFileSystem A filesystem for a locally


connected disk with
clientside checksums.
HDFS hdfs hdfs.DistributedFileSystem Hadoop’s distributed
filesystem. HDFS is
designed to work efficiently
in conjunction with
MapReduce.

HFTP hftp hdfs.HftpFileSystem A filesystem providing read


only access to HDFS over
HTTP. Often used with
distcp to copy data between
HDFS clusters running
different versions.
 The Java Interface:
 The Java Interface, In this section, we dig into the Hadoop’s File System class: the API for
interacting with one of Hadoop’s filesystems.
 While we focus mainly on the HDFS implementation, Distributed File System, in general you should
strive to write your code against the File System abstract class, to retain portability across
filesystems.
 Reading Data from a Hadoop URL
 One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL object
to open a stream to read the data from. The general idiom is:
 InputStream in = null;
 try {
 in = new URL("hdfs://host/path").openStream();
 // process in
 } finaly {
 IOUtils.closeStream(in);

Writing Data:

 The FileSystem class has a number of methods for creating a file. The simplest is the
method that takes a Path object for the file to be created and returns an output stream
to write to:

Syntax: public FSDataOutputStream create(Path f) throws IOException


 Copying a local file to a Hadoop filesystem
 public class FileCopyWithProgress {
 public static void main(String[] args) throws Exception {
 String localSrc = args[0];
 String dst = args[1];
 InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
 Configuration conf = new Configuration();
 FileSystem fs = FileSystem.get(URI.create(dst), conf);
 OutputStream out = fs.create(new Path(dst), new Progressable() {
 public void progress() {
 System.out.print(".");
 }});
 IOUtils.copyBytes(in, out, 4096, true);
 }}
Querying the Filesystem
 An important feature of any filesystem is the ability to navigate its
directory structure and retrieve information about the files and
directories that it stores. The FileStatus class encapsulates filesystem
metadata for files and directories, including file length, block size,
replication, modification time, ownership, and permission information.
 The method getFileStatus() on FileSystem provides a way of getting a
FileStatus object for a single file or directory.
Showing the file statuses for a collection of paths in a Hadoop filesystem
public class ListStatus {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path[] paths = new Path[args.length];
for (int i = 0; i < paths.length; i++) {
paths[i] = new Path(args[i]);
}
FileStatus[] status = fs.listStatus(paths);
Path[] listedPaths = FileUtil.stat2Paths(status);
for (Path p : listedPaths) {
System.out.println(p);
}}}
 Deleting Data:
 Use the delete() method on FileSystem to permanently remove
files or directories:
 public boolean delete(Path f, boolean recursive) throws
IOException
 Anatomy of File Read:
 The client opens the file it wishes to read by calling open() on the FileSystem object, which for
HDFS is an instance of DistributedFileSystem (step 1 in Figure).
 DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for
the first few blocks in the file (step 2).
 For each block, the namenode returns the addresses of the datanodes that have a copy of that block.
 The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks)
to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which
manages the datanode and namenode I/O.
 The client then calls read() on the stream (step 3). DFSInputStream, which has stored the datanode
addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first
block in the file.
 Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream
(step 4). When the end of the block is reached, DFSInputStream will close the connection to the
datanode, then find the best datanode for the next block (step 5).
 This happens transparently to the client, which from its point of view is just reading a continuous
stream.
 When the client has finished reading, it calls close() on the FSDataInputStream (step 6).
 Anatomy of a File Write:
 Anatomy of a File Write Next we’ll look at how files are written to HDFS. Although
quite detailed, it is instructive to understand the data flow since it clarifies HDFS’s
coherency model.
 The client creates the file by calling create() on DistributedFileSystem (step 1 in
Figure). DistributedFileSystem makes an RPC call to the namenode to create a new file
in the filesystem’s namespace, with no blocks associated with it (step 2).
 The namenode performs various checks to make sure the file doesn’t already exist, and
that the client has the right permissions to create the file.
 As the client writes data (step 3), DFSOutputStream splits it into packets, which it
writes to an internal queue, called the data queue.
 The data queue is consumed by the Data Streamer, whose responsibility it is to ask the
namenode to allocate new blocks by picking a list of suitable datanodes to store the
replicas.
 Similarly, the second datanode stores the packet and forwards it to the third (and last)
datanode in the pipeline (step 4).
 A packet is removed from the ack queue only when it has been acknowledged by all the
datanodes in the pipeline (step 5).
 When the client has finished writing data, it calls close() on the stream (step 6).

You might also like