HDFS

HDFS
Definition:
HDFS is a fault tolerant, distributed file system that runs on commodity hardware. It is optimized for storing very large files
with streaming data access.
 Fault tolerant: Due to replication of blocks (default 3 replicas), therefore failure of one node won’t affect accessing
the data.
 Distributed file system: files are broken into blocks and distributed over a set of different datanodes. Also, each
block is replicated and stored on a different node, this provides:
o Resiliency (Fault tolerance).
o Parallel read/write capability.
By default, block size is 128MB and replication factor is 3.
There are many replication policies but the default one is:
BlockPlacementPolicyDefault: By default with replication of 3 will put one replica on the local machine if the
writer is on a datanode, otherwise on a random datanode in the same rack as that of the writer, another replica
on a node in a different (remote) rack, and the last on a different node in the same remote rack.
You can read about the other policies from here.
 Runs on commodity hardware: means that it uses readily‐available off the shelf computers, so that there is no need
to buy specific hardware.
 Very large files: HDFS was made for storing very large files, that’s why it’s not suitable for applications that store
lots of small files (Small file problem)
 Streaming data access: HDFS is designed more for batch processing rather than interactive use by users. The
emphasis is on high throughput of data access (time to read the whole dataset/table) rather than low latency of
data access (time to read the first record).
 Final note is that HDFS follows WORM model (Write Once – Read Many):
‫ﻋﻠﻲ‬data ‫ ﺑﻣﻌﻧﻲ ﺍﺻﺢ ﻣﺳﻣﻭﺣﻠﻲ ﺍﻧﻲ ﺍﺯﻭﺩ‬،‫ ﻛﻠﻪ ﻭ ﺍﻛﺗﺑﻪ ﺗﺎﻧﻲ‬file ‫ ﻣﻳﻧﻔﻌﺵ ﺍﻋﺩﻟﻬﺎ ﻏﻳﺭ ﻋﻥ ﻁﺭﻳﻖ ﺍﻧﻲ ﺍﻣﺳﺢ ﺍﻝ‬HDFS ‫ﻭ ﺩﻩ ﻣﻌﻧﺎﻩ ﺍﻥ ﺍﻟﺣﺎﺟﺔ ﻁﺎﻟﻣﺎ ﻛﺗﺑﺗﻬﺎ ﻋﻠﻲ‬
‫ ﻣﻔﻳﺵ‬hive ‫ ﻭﺩﻩ ﺑﻳﻔﺳﺭ ﻟﻳﻪ ﻓﻲ ﺍﻝ‬delete ‫ ﺍﻭ‬update ‫ ﻟﻛﻥ ﻣﺵ ﻣﺳﻣﻭﺡ ﺍﻋﻣﻝ‬truncate‫ﻛﻠﻪ ﻭ ﺩﻩ ﺍﻝ‬file ‫ ﻭ ﻣﺳﻣﻭﺣﻠﻲ ﺍﻣﺳﺢ ﺍﻝ‬append ‫ﻭ ﺩﻩ ﺍﻝ‬file ‫ﺍﻝ‬
update/delete statements
HDFS architecture
Namenode: stores metadata (FsImage file, Editlog file)
Datanode: stores data
Namenode:
The namenode stores the files metadata in the form of:
FsImage:
■ List of files
■ List of Blocks for each file

(File to block mapping)
■ List of Data Node for each block

(Block to datanode mapping)
This is generated by what is
known as block reports which is
sent periodically from
datanodes to the namenode
■ File attributes, e.g. access time, replication factor
EditLog:
■ Records file creations, file deletions, etc.
Both of these files are stored persistently on hard disk, however, the namenode also keeps a copy of the fsimage in
memory. Any writes to the HDFS is reflected to the in‐memory fsimage and is also recorded in the editslog file that’s stored
on hard disk.
:‫ ﻋﻣﺎﻝ ﻳﺯﻳﺩ ﻭ ﺩﻩ ﻣﻣﻛﻥ ﻳﻌﻣﻠﻲ ﻣﺷﻛﻠﺗﻳﻥ‬Editlog file ‫ ﻋﻣﺎﻟﻳﻥ ﻳﻛﺗﺑﻭﺍ ﻁﻭﻝ ﻣﺎ ﺣﺟﻡ ﺍﻝ‬users ‫ ﺷﻐﺎﻟﺔ ﻭ ﻓﻳﻪ‬namenode ‫ﺍﺫﺍ ﻁﻭﻝ ﻣﺎ ﺍﻝ‬
‫ ﺑﺗﺎﻋﻲ‬hard ‫ ﻣﻣﻛﻥ ﻳﺯﻳﺩ ﻟﺩﺭﺟﺔ ﺍﻧﻪ ﻳﺎﺧﺩ ﻣﺳﺎﺣﺔ ﻛﺑﻳﺭﺓ ﻣﻥ ﺍﻝ‬.1

namenode ‫ ﻟﻝ‬startup ‫ ﺑﻳﺑﻁﺄ ﻋﻣﻠﻳﺔ ﺍﻝ‬.2
checkpoint ‫ ﺑﻳﺣﺻﻝ ﺣﺎﺟﺔ ﺍﺳﻣﻬﺎ‬،namenode startup ‫ﻋﺷﺎﻥ ﻧﻔﻬﻡ ﺍﻟﻧﻘﻁﺔ ﺍﻟﺗﺎﻧﻳﺔ ﻣﺣﺗﺎﺟﻳﻥ ﻧﻌﺭﻑ ﺍﻳﻪ ﺍﻟﻠﻲ ﺑﻳﺣﺻﻝ ﺍﺛﻧﺎء ﺍﻝ‬
Checkpoint:
When the NameNode starts up it enters a state called safe mode, in this mode the namenode isn’t able to serve client
requests and it won’t leave the safe mode until:
1. It reads the FsImage and EditLog from disk.

2. Applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out
this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have
been applied to the persistent FsImage. This process is called a checkpoint.
3. Receive enough block reports from datanodes to ensure that the minimum replication factor for the majority of
the files is met.
The purpose of a checkpoint is to make sure that HDFS has a consistent view of the file system metadata by taking a
snapshot of the file system metadata and saving it to FsImage. Even though it is efficient to read a FsImage, it is not
efficient to make incremental edits directly to a FsImage. Instead of modifying FsImage for each edit, we persist the
edits in the EditLog. During the checkpoint the changes from EditLog are applied to the FsImage. A checkpoint can be
triggered at a given time interval (dfs.namenode.checkpoint.period) expressed in seconds, or after a given number of
filesystem transactions have accumulated (dfs.namenode.checkpoint.txns)
.‫ ﻫﺗﺎﺧﺩ ﻭﻗﺕ ﺍﻁﻭﻝ‬checkpoint ‫ ﻛﺑﻳﺭ ﺍﻝ‬EditLog file ‫ﻳﺑﻘﻲ ﻟﻭ ﺍﻝ‬
Quick lab:
 If you want to see the FsImage and the EditLog files and how they are stored on the namenode you can head to the
apache‐hive‐image folder that we downloaded in the hive course and head to:
……….\apache‐hive‐image\hdfs\namenode\current
But if you try to open them with notepad++ you wont be able to read them well since they are byte formatted, you can
convert them to a more readable form but from the HDP VM.
 If you were successful in running HDP virtual machine on your device head to the webshell client
1. cd /hadoop/hdfs/namenode/current
2. ls
3. sudo hdfs oev ‐i edits_0000000000000000001‐0000000000000000073 ‐o
edit_myformat.xml ‐p XML
This command converts the editlog file to XML format
4. more edit_myformat.xml
As you can see all operations are recorded as logs in the EditLog file, for example the first record says that there was a
directory created, the third record says that there was a permission change.
When the namenode restarts it loads the FsImage and the EditLog file into memory to perform the checkpoint merge and
then truncates these EditLog files.
You can do the same with fsimage but change the command from “oev” to “oiv”, here is a snapshot from the fsimage after
converting it to XML format
Namenode resilience:
It’s important to note that without a namenode no client or user can read or write files in HDFS, that’s why it was important
to come up with strategies to protect the namenode from data loss.
Data loss protection:

We can protect the namenode from data loss by:
 NFS backup: Backing up the FsImage & EditLog files by writing synchronously on the local filesystem as well as a
remote NFS (Network Filesharing System)
 Secondary Namenode: it’s job is to

1. Fetch FsImage & EditLog files from primary namenode.
2. Merges them (performs checkpoint) & stores the new checkpointed FsImage.
3. Truncates EditLog file.
‫ ﻋﻣﺭﻩ ﻣﺎ‬EditLog ‫ ﻭﻛﻣﺎﻥ ﺑﺗﺿﻣﻧﻠﻲ ﺍﻥ ﺣﺟﻡ ﺍﻝ‬primary namenode ‫ ﻣﻥ ﻋﻠﻲ ﺍﻝ‬checkpoints ‫ ﺑﺗﺎﻉ ﺍﻝ‬headache ‫ﻓﺎﻳﺩﺗﻬﺎ ﺍﻧﻬﺎ ﺷﺎﻟﺕ ﺍﻝ‬
restart ‫ ﺍﻭ ﺍﺣﺗﺟﺕ ﺍﻋﻣﻝ‬failure ‫ ﻓﻲ ﺣﺎﻟﺔ ﻟﻭ ﺣﺻﻝ‬startup ‫ﻫﻳﺯﻳﺩ ﻭ ﺑﺎﻟﺗﺎﻟﻲ ﺑﺗﺳﺭﻉ ﻋﻣﻠﻳﺔ ﺍﻝ‬
primary namenode ‫ ﺍﻝ‬lag‫ ﻫﺗﺑﻘﻲ ﺩﺍﻳﻣﺎ ﺑﺕ‬secondary namenode ‫ ﺣﺗﻲ ﻟﻭ ﺑﺳﻳﻁ ﻻﻥ ﺍﻝ‬data loss ‫ﺑﺱ ﻣﻬﻡ ﻧﺎﺧﺩ ﺑﺎﻟﻧﺎ ﺍﻥ ﻫﻳﺑﻘﻲ ﺩﺍﻳﻣﺎ ﻓﻳﻪ‬
Note that in case of NFS backup or secondary namenode there is minor data loss and there is also a down time in case the
primary namenode failed or needed to restart. If we want to overcome these problems, we need to set up high availability
architecture
High availability:
In Hadoop 2.0 the concept of HA was
introduced.
Small files problem
The problem
The small files problem occurs when there is an application that stores lots of files that are small in size (much less than the
size of HDFS block), as we learned before that each file has it’s metadata stored in the FsImage that’s residing in the NN’s
memory (each file’s metadata is around 150 bytes). So when we have a lot of small files this can take a lot of space in the
NN’s memory causing it to crash.
The solution
1. One possible solution is HDFS federation:
HDFS federation is the process of horizontal scaling of the namenodes, by adding more namenodes the problem of
small files can be mitigated by distributing the files among these namenodes
It’s important to note that these namenodes are independent and don’t communicate with each other(isolated), so
if one namenode fails the rest are still up and running.
In addition to helping in the small files problem the HDFS federation can be used to increase the read/write
throughput (now the client can read from multiple namenodes at a time)
Also, the isolation of namenodes can help in isolating different applications and users, so for example we can have
a namenode dedicated for production and another one for development.
2. Another solution to the small files problem is the SequenceFile:
SequenceFiles store data as key‐value pairs, in the case of small files the key is the file’s name and the value is the
file’s content.
This way we pack up small files into one large file that is splittable and therefore can be processed by MapReduce.
3. Another solution is using HAR files (Hadoop Archive) instead of SequenceFile, SequenceFile is faster to process but
HAR files takes up less space.
4. Another solution is writing into HBase instead of HDFS.
Rack Awareness
HDFS tries to serve the clients from the datanodes that are the nearest to them, this is done by following the next priorities
PRIORITY CLIENT DATA DISTANCE COMMENT

1 D1/r1/n1 D1/r1/n1 0 Same node
2 D1/r1/n1 D1/r1/n2 2 Different node, same rack
3 D1/r1/n1 D1/r2/n3 4 Different rack, same datacenter
4 D1/r1/n1 D2/r3/n4 6 Different datacenters
D: data center, r: Rack, n: datanode
Interview Questions:
1. What is Hadoop Distributed File System‐ HDFS? (Sol.)
2. How NameNode tackle Datanode failures in HDFS?
HDFS has master‐slave architecture in which master is namenode and slave is datanode. Data node passes a
heartbeat signal to Namenode .When Name node does not receive heartbeat signals from Data node, it assumes
that the data node is either dead or non‐functional.
As soon as the data node is declared dead/non‐functional all the data blocks it hosts are transferred to the other
data nodes with which the blocks are replicated initially. This is how Namenode handles datanode failures.
3. Explain metadata in HDFS? Explain the FsImage & EditLog files

4. What is safe mode in namenode? (Sol.)
From Hadoop Definitive guide:
When the namenode starts, the first thing it does is load its image file (fsimage) into memory and apply the edits
from the edit log. Once it has reconstructed a consistent in‐memory image of the filesystem metadata, it creates a
new fsimage file and an empty edit log.
During this process, the namenode is running in safe mode, which means that it offers only a read‐only view of the
filesystem to clients.
Recall that the locations of blocks in the system are not persisted by the namenode; this information resides with
the datanodes, in the form of a list of the blocks each one is storing. During normal operation of the system, the
namenode has a map of block locations stored in memory. Safe mode is needed to give the datanodes time to
check in to the namenode with their block lists (send block reports), so the namenode can be informed of enough
block locations to run the filesystem effectively. If the namenode didn’t wait for enough datanodes to check in, it
would start the process of replicating blocks to new datanodes, which would be unnecessary in most cases
(because it only needed to wait for the extra datanodes to check in) and would put a great strain on the cluster’s
resources. Indeed, while in safe mode, the namenode does not issue any block‐replication or deletion
instructions to datanodes.
Safe mode is exited when the minimal replication condition is reached, plus an extension time of 30 seconds. The
minimal replication condition is when 99.9% of the blocks in the whole filesystem meet their minimum replication
level (which defaults to 1 and is set by dfs.namenode.replication.min;
5. How data or file is read in Hadoop HDFS?
To read data from HDFS, the client needs to communicate with the namenode for metadata. The client gets name
of files and its location from the namenode. The Namenode responds with details of number of Blocks of the file,
replication factor, and Datanodes where each block will be stored.
Now client communicates with Datanodes where the blocks are actually stored. Clients start reading data in parallel
from the Datanodes based on the information received from the namenodes. Once client or application receives all
the blocks of the file, it will combine these blocks to form a file.
For read performance improvement, the location of each block is based on their distance from the client (Rack
awareness).
6. How data or file is written into Hadoop HDFS?

 Client creates a new file by giving path to
Name Node
 For Each Block Name Node returns the list of
data nodes to host its replicas
 Client Pipelines the data to the chosen data
nodes
 Data node confirms the creation of block
replica to name node
7. What is secondary namenode? (Sol.)

8. Explain small file problem and it’s solutions. (Sol.) ‫ﻣﻬﻡ‬
9. Can multiple clients write into an HDFS file concurrently? (Sol.) HDFS doesn’t support multiple writers !!
‫دﻋواﺗﻛم‬

HDFS - YoussefEtman

Uploaded by

Copyright:

Available Formats

HDFS - YoussefEtman

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HDFS - YoussefEtman

Uploaded by

Copyright:

Available Formats

By default, block size is 128MB and replication factor is 3.

You can read about the other policies from here.

Datanode: stores data

■ List of Blocks for each file

■ List of Data Node for each block

■ File attributes, e.g. access time, replication factor

■ Records file creations, file deletions, etc.

‫ ﺑﺗﺎﻋﻲ‬hard ‫ ﻣﻣﻛﻥ ﻳﺯﻳﺩ ﻟﺩﺭﺟﺔ ﺍﻧﻪ ﻳﺎﺧﺩ ﻣﺳﺎﺣﺔ ﻛﺑﻳﺭﺓ ﻣﻥ ﺍﻝ‬.1

1. It reads the FsImage and EditLog from disk.

Data loss protection:

 Secondary Namenode: it’s job is to

4. Another solution is writing into HBase instead of HDFS.

PRIORITY CLIENT DATA DISTANCE COMMENT

D: data center, r: Rack, n: datanode

3. Explain metadata in HDFS? Explain the FsImage & EditLog files

5. How data or file is read in Hadoop HDFS?

6. How data or file is written into Hadoop HDFS?

7. What is secondary namenode? (Sol.)

You might also like