HDFS - YoussefEtman
HDFS - YoussefEtman
HDFS - YoussefEtman
Definition:
HDFS is a fault tolerant, distributed file system that runs on commodity hardware. It is optimized for storing very large files
with streaming data access.
Fault tolerant: Due to replication of blocks (default 3 replicas), therefore failure of one node won’t affect accessing
the data.
Distributed file system: files are broken into blocks and distributed over a set of different datanodes. Also, each
block is replicated and stored on a different node, this provides:
o Resiliency (Fault tolerance).
o Parallel read/write capability.
There are many replication policies but the default one is:
BlockPlacementPolicyDefault: By default with replication of 3 will put one replica on the local machine if the
writer is on a datanode, otherwise on a random datanode in the same rack as that of the writer, another replica
on a node in a different (remote) rack, and the last on a different node in the same remote rack.
Runs on commodity hardware: means that it uses readily‐available off the shelf computers, so that there is no need
to buy specific hardware.
Very large files: HDFS was made for storing very large files, that’s why it’s not suitable for applications that store
lots of small files (Small file problem)
Streaming data access: HDFS is designed more for batch processing rather than interactive use by users. The
emphasis is on high throughput of data access (time to read the whole dataset/table) rather than low latency of
data access (time to read the first record).
Final note is that HDFS follows WORM model (Write Once – Read Many):
ﻋﻠﻲdata ﺑﻣﻌﻧﻲ ﺍﺻﺢ ﻣﺳﻣﻭﺣﻠﻲ ﺍﻧﻲ ﺍﺯﻭﺩ، ﻛﻠﻪ ﻭ ﺍﻛﺗﺑﻪ ﺗﺎﻧﻲfile ﻣﻳﻧﻔﻌﺵ ﺍﻋﺩﻟﻬﺎ ﻏﻳﺭ ﻋﻥ ﻁﺭﻳﻖ ﺍﻧﻲ ﺍﻣﺳﺢ ﺍﻝHDFS ﻭ ﺩﻩ ﻣﻌﻧﺎﻩ ﺍﻥ ﺍﻟﺣﺎﺟﺔ ﻁﺎﻟﻣﺎ ﻛﺗﺑﺗﻬﺎ ﻋﻠﻲ
ﻣﻔﻳﺵhive ﻭﺩﻩ ﺑﻳﻔﺳﺭ ﻟﻳﻪ ﻓﻲ ﺍﻝdelete ﺍﻭupdate ﻟﻛﻥ ﻣﺵ ﻣﺳﻣﻭﺡ ﺍﻋﻣﻝtruncateﻛﻠﻪ ﻭ ﺩﻩ ﺍﻝfile ﻭ ﻣﺳﻣﻭﺣﻠﻲ ﺍﻣﺳﺢ ﺍﻝappend ﻭ ﺩﻩ ﺍﻝfile ﺍﻝ
update/delete statements
HDFS architecture
Namenode: stores metadata (FsImage file, Editlog file)
Namenode:
The namenode stores the files metadata in the form of:
FsImage:
■ List of files
EditLog:
Both of these files are stored persistently on hard disk, however, the namenode also keeps a copy of the fsimage in
memory. Any writes to the HDFS is reflected to the in‐memory fsimage and is also recorded in the editslog file that’s stored
on hard disk.
: ﻋﻣﺎﻝ ﻳﺯﻳﺩ ﻭ ﺩﻩ ﻣﻣﻛﻥ ﻳﻌﻣﻠﻲ ﻣﺷﻛﻠﺗﻳﻥEditlog file ﻋﻣﺎﻟﻳﻥ ﻳﻛﺗﺑﻭﺍ ﻁﻭﻝ ﻣﺎ ﺣﺟﻡ ﺍﻝusers ﺷﻐﺎﻟﺔ ﻭ ﻓﻳﻪnamenode ﺍﺫﺍ ﻁﻭﻝ ﻣﺎ ﺍﻝ
checkpoint ﺑﻳﺣﺻﻝ ﺣﺎﺟﺔ ﺍﺳﻣﻬﺎ،namenode startup ﻋﺷﺎﻥ ﻧﻔﻬﻡ ﺍﻟﻧﻘﻁﺔ ﺍﻟﺗﺎﻧﻳﺔ ﻣﺣﺗﺎﺟﻳﻥ ﻧﻌﺭﻑ ﺍﻳﻪ ﺍﻟﻠﻲ ﺑﻳﺣﺻﻝ ﺍﺛﻧﺎء ﺍﻝ
Checkpoint:
When the NameNode starts up it enters a state called safe mode, in this mode the namenode isn’t able to serve client
requests and it won’t leave the safe mode until:
The purpose of a checkpoint is to make sure that HDFS has a consistent view of the file system metadata by taking a
snapshot of the file system metadata and saving it to FsImage. Even though it is efficient to read a FsImage, it is not
efficient to make incremental edits directly to a FsImage. Instead of modifying FsImage for each edit, we persist the
edits in the EditLog. During the checkpoint the changes from EditLog are applied to the FsImage. A checkpoint can be
triggered at a given time interval (dfs.namenode.checkpoint.period) expressed in seconds, or after a given number of
filesystem transactions have accumulated (dfs.namenode.checkpoint.txns)
. ﻫﺗﺎﺧﺩ ﻭﻗﺕ ﺍﻁﻭﻝcheckpoint ﻛﺑﻳﺭ ﺍﻝEditLog file ﻳﺑﻘﻲ ﻟﻭ ﺍﻝ
Quick lab:
If you want to see the FsImage and the EditLog files and how they are stored on the namenode you can head to the
apache‐hive‐image folder that we downloaded in the hive course and head to:
……….\apache‐hive‐image\hdfs\namenode\current
But if you try to open them with notepad++ you wont be able to read them well since they are byte formatted, you can
convert them to a more readable form but from the HDP VM.
If you were successful in running HDP virtual machine on your device head to the webshell client
1. cd /hadoop/hdfs/namenode/current
2. ls
3. sudo hdfs oev ‐i edits_0000000000000000001‐0000000000000000073 ‐o
edit_myformat.xml ‐p XML
This command converts the editlog file to XML format
4. more edit_myformat.xml
As you can see all operations are recorded as logs in the EditLog file, for example the first record says that there was a
directory created, the third record says that there was a permission change.
When the namenode restarts it loads the FsImage and the EditLog file into memory to perform the checkpoint merge and
then truncates these EditLog files.
You can do the same with fsimage but change the command from “oev” to “oiv”, here is a snapshot from the fsimage after
converting it to XML format
Namenode resilience:
It’s important to note that without a namenode no client or user can read or write files in HDFS, that’s why it was important
to come up with strategies to protect the namenode from data loss.
NFS backup: Backing up the FsImage & EditLog files by writing synchronously on the local filesystem as well as a
remote NFS (Network Filesharing System)
ﻋﻣﺭﻩ ﻣﺎEditLog ﻭﻛﻣﺎﻥ ﺑﺗﺿﻣﻧﻠﻲ ﺍﻥ ﺣﺟﻡ ﺍﻝprimary namenode ﻣﻥ ﻋﻠﻲ ﺍﻝcheckpoints ﺑﺗﺎﻉ ﺍﻝheadache ﻓﺎﻳﺩﺗﻬﺎ ﺍﻧﻬﺎ ﺷﺎﻟﺕ ﺍﻝ
restart ﺍﻭ ﺍﺣﺗﺟﺕ ﺍﻋﻣﻝfailure ﻓﻲ ﺣﺎﻟﺔ ﻟﻭ ﺣﺻﻝstartup ﻫﻳﺯﻳﺩ ﻭ ﺑﺎﻟﺗﺎﻟﻲ ﺑﺗﺳﺭﻉ ﻋﻣﻠﻳﺔ ﺍﻝ
primary namenode ﺍﻝlag ﻫﺗﺑﻘﻲ ﺩﺍﻳﻣﺎ ﺑﺕsecondary namenode ﺣﺗﻲ ﻟﻭ ﺑﺳﻳﻁ ﻻﻥ ﺍﻝdata loss ﺑﺱ ﻣﻬﻡ ﻧﺎﺧﺩ ﺑﺎﻟﻧﺎ ﺍﻥ ﻫﻳﺑﻘﻲ ﺩﺍﻳﻣﺎ ﻓﻳﻪ
Note that in case of NFS backup or secondary namenode there is minor data loss and there is also a down time in case the
primary namenode failed or needed to restart. If we want to overcome these problems, we need to set up high availability
architecture
High availability:
In Hadoop 2.0 the concept of HA was
introduced.
Small files problem
The problem
The small files problem occurs when there is an application that stores lots of files that are small in size (much less than the
size of HDFS block), as we learned before that each file has it’s metadata stored in the FsImage that’s residing in the NN’s
memory (each file’s metadata is around 150 bytes). So when we have a lot of small files this can take a lot of space in the
NN’s memory causing it to crash.
The solution
1. One possible solution is HDFS federation:
HDFS federation is the process of horizontal scaling of the namenodes, by adding more namenodes the problem of
small files can be mitigated by distributing the files among these namenodes
It’s important to note that these namenodes are independent and don’t communicate with each other(isolated), so
if one namenode fails the rest are still up and running.
In addition to helping in the small files problem the HDFS federation can be used to increase the read/write
throughput (now the client can read from multiple namenodes at a time)
Also, the isolation of namenodes can help in isolating different applications and users, so for example we can have
a namenode dedicated for production and another one for development.
2. Another solution to the small files problem is the SequenceFile:
SequenceFiles store data as key‐value pairs, in the case of small files the key is the file’s name and the value is the
file’s content.
This way we pack up small files into one large file that is splittable and therefore can be processed by MapReduce.
3. Another solution is using HAR files (Hadoop Archive) instead of SequenceFile, SequenceFile is faster to process but
HAR files takes up less space.
Rack Awareness
HDFS tries to serve the clients from the datanodes that are the nearest to them, this is done by following the next priorities
Interview Questions:
1. What is Hadoop Distributed File System‐ HDFS? (Sol.)
2. How NameNode tackle Datanode failures in HDFS?
HDFS has master‐slave architecture in which master is namenode and slave is datanode. Data node passes a
heartbeat signal to Namenode .When Name node does not receive heartbeat signals from Data node, it assumes
that the data node is either dead or non‐functional.
As soon as the data node is declared dead/non‐functional all the data blocks it hosts are transferred to the other
data nodes with which the blocks are replicated initially. This is how Namenode handles datanode failures.
To read data from HDFS, the client needs to communicate with the namenode for metadata. The client gets name
of files and its location from the namenode. The Namenode responds with details of number of Blocks of the file,
replication factor, and Datanodes where each block will be stored.
Now client communicates with Datanodes where the blocks are actually stored. Clients start reading data in parallel
from the Datanodes based on the information received from the namenodes. Once client or application receives all
the blocks of the file, it will combine these blocks to form a file.
For read performance improvement, the location of each block is based on their distance from the client (Rack
awareness).