UNIT-2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

UNIT-II

Hadoop Distributed File System


Hadoop Ecosystem:
The Hadoop ecosystem is a collection of various tools and frameworks that work together to
manage, store, process, and analyze large datasets efficiently.
Developed by Apache, Hadoop and its ecosystem are foundational in Big Data Analytics because
they address the challenges of handling big data—like data volume, velocity, and variety.
Hadoop ecosystem is a platform or framework which helps in solving the big data problems. It
comprises of different components and services ( ingesting, storing, analyzing, and maintaining)
inside of it. Most of the services available in the Hadoop ecosystem are to supplement the main
four core components of Hadoop which include HDFS, YARN, MapReduce and Common.

Hadoop ecosystem includes both Apache Open source projects and other wide variety of
commercial tools and solutions. Some of the well known open source examples include Spark,
Hive, Pig, Sqoop and Oozie.

fig: Hadoop Ecosystem


 Hadoop Ecosystem Components
 1. Hadoop Distributed File System
 It is the most important component of Hadoop Ecosystem. HDFS is the primary storage
system of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that
provides scalable, fault tolerance, reliable and cost efficient data storage for Big data.
HDFS is a distributed filesystem that runs on commodity hardware.
HDFS consists of two core components i.e.
• Name node
• Data Node
• Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data.
• Data nodes are commodity hardware in the distributed environment. Undoubtedly,
making Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware, thus working at
the heart of the system.
 2.YARN:
• Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
• Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
 3.MapReduce:
• By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big
data sets into a manageable one.
• MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
 Map function takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
 Reduce function takes the output from the Map as an input and combines those data
tuples based on the key and accordingly modifies the value of the key.
 4.PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and analyzing huge data sets.
• Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
5. HIVE:
• With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 6. Mahout:
• Mahout, allows Machine Learnability to a system or application. Machine Learning, as
the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
 7. Apache Spark:
• It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
 8.Apache HBase:
• It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to
work on Big Data sets effectively.
Hadoop Architecture:

Fig: Hadoop Distributed File System Architecture


 Figure shows the client, master NameNode, primary and secondary MasterNodes and
slave nodes in the Hadoop physical architecture.
 Hadoop Distributed File System (HDFS) stores the application data and file system
metadata separately on dedicated servers. NameNode and DataNode are the two critical
components of the Hadoop HDFS architecture.
 Application data is stored on servers referred to as DataNodes and file system metadata is
stored on servers referred to as NameNode. HDFS replicates the file content on multiple
DataNodes based on the replication factor to ensure reliability of data.

1.NameNode:
All the blocks on DataNodes are handled by NameNode, which is known as the master node. It
performs the following functions:
1. Monitor and control all the DataNodes instances.
2. Permits the user to access a file.
3. Stores all of the block records on a DataNode instance.
4. EditLogs are committed to disk after every write operation to Name Node’s data storage.
The data is then replicated to all the other data nodes, including Data Node and Backup
Data Node.
5. All of the DataNodes’ blocks must be alive in order for all of the blocks to be removed
from the data nodes.
 There are two kinds of files in NameNode: FsImage files and EditLogs files:
1. FsImage: It contains all the details about a filesystem, including all the directories and
files, in a hierarchical format. It is also called a file image because it resembles a
photograph.
2. EditLogs: The EditLogs file keeps track of what modifications have been made to the
files of the filesystem.

2. DataNode:
 Every slave machine that contains data organsises a DataNode. DataNode stores data in
file format on DataNodes. DataNodes do the following:
1. DataNodes store every data.
2. It handles all of the requested operations on files, such as reading file content and
creating new data, as described above.
3. All the instructions are followed, including scrubbing data on DataNodes, establishing
partnerships, and so on.

3.Secondary NameNode:
4. When NameNode runs out of disk space, a secondary NameNode is activated to perform
a checkpoint. The secondary NameNode performs the following duties.
1. It stores all the transaction log data (from all the source databases) into one location so
that when you want to replay it, it is at one single location. Once the data is stored, it is
replicated across all the servers, either directly or via a distributed file system.
2. The information stored in the filesystem is replicated across all the cluster nodes and
stored in all the data nodes. Data nodes store the data. The cluster nodes store the
information about the cluster nodes. This information is called metadata.
HDFS Concepts,
Blocks, Namenodes and Datanodes:
HDFS Blocks
A disk has a block size, which is the minimum amount of data that it can read or write. HDFS is
a block structured file system. Each HDFS file is broken into blocks of fixed size usually 128
MB which are stored across various data nodes on the cluster. Each of these blocks is stored as a
separate file on local file system on data nodes (Commodity machines on cluster). Thus to access
a file on HDFS, multiple data nodes need to be referenced and the list of the data nodes which
need to be accessed is determined by the file system metadata stored on Name Node.

So, any HDFS client trying to access/read a HDFS file, will get block information from Name
Node first, and then based on the block id’s and locations, data will be read from corresponding
data nodes/computer machines on cluster.

HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks.
By making a block large enough, the time to transfer the data from the disk can be made to be
significantly larger than the time to seek to the start of the block. Thus the time to transfer a large
file made of multiple blocks operates at the disk transfer rate.

HDFS’s fsck command is a useful to get the files and blocks details of file system. Below
command will list the blocks that make up each file in the file system.

$ hadoop fsck / -files –blocks

Advantages of Blocks Feature

1. Quick Seek Time:


By default, HDFS Block Size is 128 MB which is much larger than any other file system. In
HDFS, large block size is maintained to reduce the seek time for block access.

2. Ability to Store Large Files:


Another benefit of this block structure is that, there is no need to store all blocks of a file on the
same disk or node. So, a file’s size can be larger than the size of a disk or node.
3.How Fault Tolerance is achieved with HDFS Blocks:

HDFS blocks feature suits well with the replication for providing fault
tolerance and availability.
By default each block is replicated to three separate machines. This
feature insures blocks
against corrupted blocks or disk or machine failure.

If a block becomes unavailable, a copy can be read from another


machine. And a block that is no longer available due to corruption or
machine failure can be replicated from its alternative machines to other live
machines to bring the replication factor back to the normal level (3 by
default).

2.Name Node:
Name Node is the single point of contact for accessing files in HDFS and it
determines the block ids and locations for data access. So, Name Node plays
a Master role in Master/Slaves Architecture where as Data Nodes acts as
slaves. File System metadata is stored on Name
Node.

File System Metadata contains majorly, File names, File Permissions


and locations of each
block of files. Thus, Metadata is relatively small in size and fits into Main
Memory of a computer machine. So, it is stored in Main Memory of Name
Node to allow fast access.

Important Components of Name Node:

∙ FsImage: It is a file on Name Node’s Local File System containing entire


HDFS file
system namespace (including mapping of blocks to files and file system
properties)

∙ EditLog: It is a Transaction Log residing on Name Node’s Local File System


and contains
a record/entry for every change that occurs to File System Metadata.

3.Data Nodes
Data Nodes are the slaves part of Master/Slaves Architecture and on which
actual HDFS files are
stored in the form of fixed size chunks of data which are called blocks. Data
Nodes serve read and write requests of clients on HDFS files and also
perform block creation, replication and deletions.

Data Nodes Failure Recovery:

Each data node on a cluster periodically sends a heartbeat message to the


name node which is used by the name node to discover the data node
failures based on missing heartbeats. The name node marks data nodes
without recent heartbeats as dead, and does not dispatch any new I/O
requests to them. Because data located at a dead data node is no longer
available to HDFS, data node death may cause the replication factor of
some blocks to fall below their specified values. The name node constantly
tracks which blocks must be re-replicated, and initiates replication whenever
necessary. Thus all the blocks on a dead data node are re-replicated on other
live data nodes and replication factor remains normal.

Hadoop FileSystems:

Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation. The
Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and
there are several concrete implementations, which are given below.

Hadoop Filesystem URI Scheme and Implementations


Hadoop provides various Filesystem (FS) URI schemes, allowing access to data stored on
different types of storage backends. Below is a detailed summary of these schemes, their
implementations, and use cases.

Comparison of Filesystem Types


URI
Access Type Use Case Implementation Class
Scheme
file Local Debugging, standalone mode fs.LocalFileSystem
HDFS storage for large-scale
hdfs Distributed hdfs.DistributedFileSystem
applications
Read-only over Data transfer between HDFS
hftp hdfs.HftpFileSystem
HTTP clusters
Read-only over Secure data transfer across
hsftp hdfs.HsftpFileSystem
HTTPS HDFS clusters
Read-Write over
webhdfs Secure remote HDFS access hdfs.web.WebHdfsFileSystem
HTTP
Long-term storage, reduce
har Archived fs.HarFileSystem
namenode usage
kfs Distributed Cloud storage fs.kfs.KosmosFileSystem
ftp FTP server Integration with legacy systems fs.ftp.FTPFileSystem
s3n Native S3 Access Amazon S3 directly fs.s3native.NativeS3FileSystem
Handle large files on Amazon
s3 Block-based S3 fs.s3.S3FileSystem
S3

This comprehensive overview highlights the diverse capabilities of Hadoop's filesystem


implementations, tailored for specific data storage and processing needs.

The Java Interface


The Hadoop’s FileSystem class: the API for interacting with one of Hadoop’s filesystems.
While we focus mainly on the HDFS implementation, DistributedFileSystem, in general you
should strive to write your code against the FileSystem abstract class, to retain portability
across filesystems. This is very useful when testing your program, for example, since you
can rapidly run tests using data stored on the local filesystem.

Reading Data from a Hadoop URL


One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL
object to open a stream to read the data from.

The general idiom is:


InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}

Example : shows a program for displaying files from Hadoop filesystems on


standard output, like the Unix cat command.
Example 3-1. Displaying files from a Hadoop filesystem on standard output using
a URLStreamHandler
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new
FsUrlStreamHandlerFactory( )); }
public static void main(String[ ] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream( );
IOUtils.copyBytes(in, System.out, 4096, false);
}
finally {
IOUtils.closeStream(in);
}
}
}
We make use of the handy IOUtils class that comes with Hadoop for closing the stream
in the finally clause, and also for copying bytes between the input stream and the
output stream (System.out in this case). The last two arguments to the copyBytes
method are the buffer size used for copying and whether to close the streams when
the copy is complete. We close the input stream ourselves, and System.out doesn’t
need to be closed.

Writing Data
The FileSystem class has a number of methods for creating a file. The simplest is the
method that takes a Path object for the file to be created and returns an output stream
to write to:
public FSDataOutputStream create(Path f) throws IOException
There are overloaded versions of this method that allow you to specify whether to
forcibly overwrite existing files, the replication factor of the file, the buffer size to use
when writing the file, the block size for the file, and file permissions.
public FSDataOutputStream append(Path f) throws IOException
The append operation allows a single writer to modify an already written file by opening
it and writing data from the final offset in the file. With this API, applications that
produce unbounded files, such as logfiles, can write to an existing file after a restart,
for example.

Example 3-4 shows how to copy a local file to a Hadoop filesystem. We illustrate progress
by printing a period every time the progress() method is called by Hadoop, which
is after each 64 K packet of data is written to the datanode pipeline. (Note that this
particular behavior is not specified by the API, so it is subject to change in later versions
of Hadoop. The API merely allows you to infer that “something is happening.”)
Example 3-4. Copying a local file to a Hadoop filesystem
public class FileCopyWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}

Querying the Filesystem:

File metadata: FileStatus


An important feature of any filesystem is the ability to navigate its directory structure and
retrieve information about the files and directories that it stores. The FileStatus class
encapsulates filesystem metadata for files and directories, including file length, block size,
replication, modification time, ownership, and permission information.

The method getFileStatus() on FileSystem provides a way of getting a FileStatus object for a
single file or directory

Example 3-5. Demonstrating file status information


public class ShowFileStatusTest {
private MiniDFSCluster cluster; // use an in-process HDFS cluster for testing
private FileSystem fs;
@Before
public void setUp() throws IOException {
Configuration conf = new Configuration();
if (System.getProperty("test.build.data") == null) {
System.setProperty("test.build.data", "/tmp");
}
cluster = new MiniDFSCluster(conf, 1, true, null);
fs = cluster.getFileSystem();
OutputStream out = fs.create(new Path("/dir/file"));
out.write("content".getBytes("UTF-8"));
out.close();
}
@After
public void tearDown() throws IOException {
if (fs != null) { fs.close(); }
if (cluster != null) { cluster.shutdown(); }
}
@Test(expected = FileNotFoundException.class)
public void throwsFileNotFoundForNonExistentFile() throws IOException {
fs.getFileStatus(new Path("no-such-file"));
}
@Test
public void fileStatusForFile() throws IOException {
Path file = new Path("/dir/file");
FileStatus stat = fs.getFileStatus(file);
assertThat(stat.getPath().toUri().getPath(), is("/dir/file"));
assertThat(stat.isDir(), is(false));
assertThat(stat.getLen(), is(7L));
assertThat(stat.getModificationTime(),
is(lessThanOrEqualTo(System.currentTimeMillis())));
assertThat(stat.getReplication(), is((short) 1));
assertThat(stat.getBlockSize(), is(64 * 1024 * 1024L));
assertThat(stat.getOwner(), is("tom"));
assertThat(stat.getGroup(), is("supergroup"));
assertThat(stat.getPermission().toString(), is("rw-r--r--"));
}
@Test
public void fileStatusForDirectory() throws IOException {
Path dir = new Path("/dir");
FileStatus stat = fs.getFileStatus(dir);
assertThat(stat.getPath().toUri().getPath(), is("/dir"));
assertThat(stat.isDir(), is(true));
assertThat(stat.getLen(), is(0L));
assertThat(stat.getModificationTime(),
is(lessThanOrEqualTo(System.currentTimeMillis())));
assertThat(stat.getReplication(), is((short) 0));
assertThat(stat.getBlockSize(), is(0L));
assertThat(stat.getOwner(), is("tom"));
assertThat(stat.getGroup(), is("supergroup"));
assertThat(stat.getPermission().toString(), is("rwxr-xr-x"));
}
}

Deleting Data:
Use the delete() method on FileSystem to permanently remove files or directories:

public boolean delete(Path f, boolean recursive) throws IOException

If f is a file or an empty directory, then the value of recursive is ignored. A nonempty


directory is only deleted, along with its contents, if recursive is true (otherwise an
IOException is thrown).

Anatomy of a File Read:

Fig: A client reading data from HDFS


 The client opens the file it wishes to read by calling open() on the FileSystem object,
which for HDFS is an instance of DistributedFileSystem (step 1 in Figure).
 DistributedFileSystem calls the namenode, using RPC, to determine the locations of the
blocks for the first few blocks in the file (step 2).
 For each block, the namenode returns the addresses of the datanodes that have a copy of
that block.
 The DistributedFileSystem returns an FSDataInputStream (an input stream that supports
file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a
DFSInputStream, which manages the datanode and namenode I/O.
 The client then calls read() on the stream (step 3). DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then connects to the first (closest)
datanode for the first block in the file.
 Data is streamed from the datanode back to the client, which calls read() repeatedly on
the stream (step 4). When the end of the block is reached, DFSInputStream will close the
connection to the datanode, then find the best datanode for the next block (step 5).
 This happens transparently to the client, which from its point of view is just reading a
continuous stream.
 When the client has finished reading, it calls close() on the FSDataInputStream (step 6).

Anatomy of a File Write:


Fig: A client writing data to HDFS

 Anatomy of a File Write Next we’ll look at how files are written to HDFS. Although
quite detailed, it is instructive to understand the data flow since it clarifies HDFS’s
coherency model.
 The client creates the file by calling create() on DistributedFileSystem (step 1 in Figure).
DistributedFileSystem makes an RPC call to the namenode to create a new file in the
filesystem’s namespace, with no blocks associated with it (step 2).
 The namenode performs various checks to make sure the file doesn’t already exist, and
that the client has the right permissions to create the file.
 As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes
to an internal queue, called the data queue.
 The data queue is consumed by the Data Streamer, whose responsibility it is to ask the
namenode to allocate new blocks by picking a list of suitable datanodes to store the
replicas.
 Similarly, the second datanode stores the packet and forwards it to the third (and last)
datanode in the pipeline (step 4).
 A packet is removed from the ack queue only when it has been acknowledged by all the
datanodes in the pipeline (step 5).
 When the client has finished writing data, it calls close() on the stream (step 6).

You might also like