UNIT-2
UNIT-2
UNIT-2
Hadoop ecosystem includes both Apache Open source projects and other wide variety of
commercial tools and solutions. Some of the well known open source examples include Spark,
Hive, Pig, Sqoop and Oozie.
1.NameNode:
All the blocks on DataNodes are handled by NameNode, which is known as the master node. It
performs the following functions:
1. Monitor and control all the DataNodes instances.
2. Permits the user to access a file.
3. Stores all of the block records on a DataNode instance.
4. EditLogs are committed to disk after every write operation to Name Node’s data storage.
The data is then replicated to all the other data nodes, including Data Node and Backup
Data Node.
5. All of the DataNodes’ blocks must be alive in order for all of the blocks to be removed
from the data nodes.
There are two kinds of files in NameNode: FsImage files and EditLogs files:
1. FsImage: It contains all the details about a filesystem, including all the directories and
files, in a hierarchical format. It is also called a file image because it resembles a
photograph.
2. EditLogs: The EditLogs file keeps track of what modifications have been made to the
files of the filesystem.
2. DataNode:
Every slave machine that contains data organsises a DataNode. DataNode stores data in
file format on DataNodes. DataNodes do the following:
1. DataNodes store every data.
2. It handles all of the requested operations on files, such as reading file content and
creating new data, as described above.
3. All the instructions are followed, including scrubbing data on DataNodes, establishing
partnerships, and so on.
3.Secondary NameNode:
4. When NameNode runs out of disk space, a secondary NameNode is activated to perform
a checkpoint. The secondary NameNode performs the following duties.
1. It stores all the transaction log data (from all the source databases) into one location so
that when you want to replay it, it is at one single location. Once the data is stored, it is
replicated across all the servers, either directly or via a distributed file system.
2. The information stored in the filesystem is replicated across all the cluster nodes and
stored in all the data nodes. Data nodes store the data. The cluster nodes store the
information about the cluster nodes. This information is called metadata.
HDFS Concepts,
Blocks, Namenodes and Datanodes:
HDFS Blocks
A disk has a block size, which is the minimum amount of data that it can read or write. HDFS is
a block structured file system. Each HDFS file is broken into blocks of fixed size usually 128
MB which are stored across various data nodes on the cluster. Each of these blocks is stored as a
separate file on local file system on data nodes (Commodity machines on cluster). Thus to access
a file on HDFS, multiple data nodes need to be referenced and the list of the data nodes which
need to be accessed is determined by the file system metadata stored on Name Node.
So, any HDFS client trying to access/read a HDFS file, will get block information from Name
Node first, and then based on the block id’s and locations, data will be read from corresponding
data nodes/computer machines on cluster.
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks.
By making a block large enough, the time to transfer the data from the disk can be made to be
significantly larger than the time to seek to the start of the block. Thus the time to transfer a large
file made of multiple blocks operates at the disk transfer rate.
HDFS’s fsck command is a useful to get the files and blocks details of file system. Below
command will list the blocks that make up each file in the file system.
HDFS blocks feature suits well with the replication for providing fault
tolerance and availability.
By default each block is replicated to three separate machines. This
feature insures blocks
against corrupted blocks or disk or machine failure.
2.Name Node:
Name Node is the single point of contact for accessing files in HDFS and it
determines the block ids and locations for data access. So, Name Node plays
a Master role in Master/Slaves Architecture where as Data Nodes acts as
slaves. File System metadata is stored on Name
Node.
3.Data Nodes
Data Nodes are the slaves part of Master/Slaves Architecture and on which
actual HDFS files are
stored in the form of fixed size chunks of data which are called blocks. Data
Nodes serve read and write requests of clients on HDFS files and also
perform block creation, replication and deletions.
Hadoop FileSystems:
Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation. The
Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and
there are several concrete implementations, which are given below.
Writing Data
The FileSystem class has a number of methods for creating a file. The simplest is the
method that takes a Path object for the file to be created and returns an output stream
to write to:
public FSDataOutputStream create(Path f) throws IOException
There are overloaded versions of this method that allow you to specify whether to
forcibly overwrite existing files, the replication factor of the file, the buffer size to use
when writing the file, the block size for the file, and file permissions.
public FSDataOutputStream append(Path f) throws IOException
The append operation allows a single writer to modify an already written file by opening
it and writing data from the final offset in the file. With this API, applications that
produce unbounded files, such as logfiles, can write to an existing file after a restart,
for example.
Example 3-4 shows how to copy a local file to a Hadoop filesystem. We illustrate progress
by printing a period every time the progress() method is called by Hadoop, which
is after each 64 K packet of data is written to the datanode pipeline. (Note that this
particular behavior is not specified by the API, so it is subject to change in later versions
of Hadoop. The API merely allows you to infer that “something is happening.”)
Example 3-4. Copying a local file to a Hadoop filesystem
public class FileCopyWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
The method getFileStatus() on FileSystem provides a way of getting a FileStatus object for a
single file or directory
Deleting Data:
Use the delete() method on FileSystem to permanently remove files or directories:
Anatomy of a File Write Next we’ll look at how files are written to HDFS. Although
quite detailed, it is instructive to understand the data flow since it clarifies HDFS’s
coherency model.
The client creates the file by calling create() on DistributedFileSystem (step 1 in Figure).
DistributedFileSystem makes an RPC call to the namenode to create a new file in the
filesystem’s namespace, with no blocks associated with it (step 2).
The namenode performs various checks to make sure the file doesn’t already exist, and
that the client has the right permissions to create the file.
As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes
to an internal queue, called the data queue.
The data queue is consumed by the Data Streamer, whose responsibility it is to ask the
namenode to allocate new blocks by picking a list of suitable datanodes to store the
replicas.
Similarly, the second datanode stores the packet and forwards it to the third (and last)
datanode in the pipeline (step 4).
A packet is removed from the ack queue only when it has been acknowledged by all the
datanodes in the pipeline (step 5).
When the client has finished writing data, it calls close() on the stream (step 6).