ISSN: 04532198
Volume 62, Issue 04, April, 2020
Characteristics and Analysis of Hadoop Distributed
Systems
Subhi R. M. Zeebaree1, Hanan M. Shukur2, Lailan M. Haji3, Rizgar R. Zebari4,
Karwan Jacksi5, Shakir M.Abas6
Duhok Polytechnic University, Duhok – Kurdistan Region, Iraq1
Al Kitab University, Kirkuk – Iraq2
Universaty of Zakho, Duhok – Kurdistan Region, Iraq3
Duhok polytechnic University, Duhok – Kurdistan Region, Iraq4
Universaty of Zakho, Duhok – Kurdistan Region, Iraq5
Duhok Polytechnic university, Duhok – Kurdistan Region, Iraq6
Abstract--The last days, the data and internet are become increasingly growing which occurring
the problems in big-data. For these problems, there are many software frameworks used to increase
the performance of the distributed system. This software is used for available of large data storage.
One of the most beneficial software frameworks used to utilize data in distributed systems is
Hadoop. This software creates machine clustering and formatting the work between them. The
Hadoop consists of two major components which are Hadoop Distributed File System (HDFS) and
Map Reduce (MR). By Hadoop, we can process, count and distribute of each word in a large file
and know the number of affecting for each of them. In this paper, we will explain what is Hadoop
and its architectures, how it works and its performance analysis in a distributed system according
to many authors. In addition, assessing each paper and compare with each other.
Keywords: Hadoop, Distributed system, Cloud, Map Reduce, HDFS.
1. Introduction
Nowadays, the data on the internet and distributed systems are redundancy increasing [1], [2].
Generally, there is about 4 petabytes of data on the various servers. The technologies complexly
process this huge data in an efficient manner. This data is stored in many spread machines and
access to the data by parallel processing. Therefore, there is growing competition for accessing
shared resource data in parallel processing systems [3], [4]. There are many of techniques that can
be used to solve these problems [5]. The studies trend towards a breaking-up approach to solving
the problem of huge amounts of data by breaking the problem into pieces and solve these pieces
simultaneously [6], [7]. For instance, a cloud computing distributed system provides a mechanism
to process an enormous amount of data [8]. Also, the combination of a distributed system and
parallel processing can solve some client’s problems remotely in a minimum time [9], [10]. Shared
memory systems and distributed memory systems are two additional approaches that are combined
in a hybrid parallel processing system that is used to solve complex problems of the network like
the limitation of data size [11], [12]. The performance of the distributed system is an important
1555
Zeebaree, et.al, 2020
TRKU
issue, the architecture of the system has a vital impact on the system performance, for example, the
performance of the three-tier architecture 3TA system is more proficient and precise than 2TA
system using Opnet as assessment and designing tool [13]. There are many techniques used to
analyze the performance of the distributed system. However, one of the best-used techniques is
Hadoop, it is a software application framework and it is open-source which distributes the processes
that store and manage the big-data application run by the clustered system. However, it represented
the center of a based growing system that used for advanced analysis, data mining, predictive
analysis, and learning application technique of big-data in the distributed system [14]. The various
forms of unstructured and structured data hold by Hadoop, also it allows the users the control,
process and analysis data stored in a huge database and data warehouse [15]. Generally, it is known
as Apache Hadoop that is used in open source as a part of a project in Apache Software Foundation
(ASF). The Hadoop is running of many clustered servers and it can be used to support thousands
of nodes of hardware [16]. The Hadoop can be used by a distributed file system designed for fast
accessing the data through nodes of the cluster system. In the Hadoop, the data is divided into all
nodes in a cluster as illustrated in figure1 [5].
Fig.1: The distributed data through nodes[14].
In addition, it has the capability to ignore the error and the applications are continuing in running
if one of the nodes is disabling. Hadoop is foundational of data management and data analysis in a
huge distributed system [17]. The main components of the first version of Hadoop are Hadoop
Distributed File system (HDFS) and MapReduce (MR). The MR uses two functions: Map and
Reduce, these functions divide the processes into multi-tasks and each of these tasks run on the
nodes of the cluster [14]. The primary uses of the Hadoop are analytics; also, it is very suitable for
big-data environments because it has the ability to process and store different data types. In
addition, it involves various types of structured and unstructured forms of data like web servers,
internet clickstream, and social media posts [18]. The architecture of HDFS is Master-Slave. Each
1556
ISSN: 04532198
Volume 62, Issue 04, April, 2020
Hadoop consists of individual NameNode, which manages the file system, and DataNode, which
manages storage of data on each node. The whole elements of HDFS combined together support
large data applications [19]. The HDFS has many features, it faults tolerant in high form and
deploys on the low-cost hardware. In addition, Hadoop is suitable to access the large data in a
different location. The architecture of the master node is similar to Google File System (GFS) and
General Parallel File System (GPFS) [20]. Anyway, in this paper, we will explain the Hadoop and
its performance in distributed systems. In addition, compare the Hadoop with another type of
software used for analysis and manage huge data is very fast according to some authors.
2. Literature Review
The Hadoop is a software framework designed by java language; the main job of the Hadoop is to
run the various applications on the huge cluster. It has some features similar to Google File System
(GFS) and MapReduce. Hadoop supports the high access to data of applications and very suitable
for applications that have huge data. When the single machine has more than a hundred CPU is not
fast for processing compared with Hadoop. The large data file is divided by HDFS and it is
distributed on the different machines in the cluster. Each piece of data is replicated through
machines in the cluster. If one of the nodes fails, it is dose not affects the results. The data is an
orientation record in the Hadoop software applications framework. The single input data is divided
into different formats and the sub-sets for the record data are processed by each node in the cluster.
The knowledge form used by the DFS is scheduled by the Hadoop framework depending on the
position of the record of data. The DFS brocks the files into chunks and computed by nodes in the
cluster [5]. The data on the internet becomes very large and it needs technologies to be processing
in an efficient manner. So the distributed file system stores the data in different nodes instead of
one node. The data processing is done by parallel processing which leads quickly to access and
manage the data compared with other systems [14]. Cluster system storage consists of many
resources and each node in the cluster can read, delete, and write. The storage resources and clients
are distributed on the network reveres the local systems. The users can access to data from different
locations, the file is sharing in a hierarchical track. HDFS created to deals with the large data
distribute on several machines on the network. Also, the Hadoop is an error-tolerant system for the
huge data in clustered systems [15]. Most successful applications on the internet like a search
engine and social media depending on the data processing. However, the processing of huge data
for some organizations becomes more important. The analysis for these data is done by many of
techniques. The most software used in this procedure is Hadoop. HDFS mainly consists of two
types of nodes in the cluster: single NameNode and number of DataNodes. NameNode works on
the master servers which performance the structure of data for the NameSpace of HDFS containing
opening, saving, closing and renaming the files plus design the map between the data blocks of
nodes and files. The number of DataNodes process and manage the nodes storage. The data is
divided into some blocks and each block store on different nodes [16].
3. Hadoop Distributed File System Architecture
1557
Zeebaree, et.al, 2020
TRKU
Hadoop architecture is simple compared with other kinds of file system techniques. The main
components of the HDFS are the master and slave. Generally, the HDFS cluster has single
NameNode, Master-Slave, and many DataNodes. The file system and its applications stored by
HDFS separately. All servers are connected by TCP protocols [21]. Figure 2 shows HDFS
architecture. The HDFS consists of many components each of these components designed in
different architectures and for different works [17].
3.1 NameNode
All files system of metadata is held by NameNodes of cluster and controls on the DataNodes. The
NameNode considered the HDFS central controller but it does not store any data for the node. The
NameNode only knew the file blocks and the location of these files. The NameNode determined
clients for the DataNode. In addition, it holds the track for the storage capacity of the cluster. The
NameNode is responsible for any change occurs for the NameSpace and properties of the file
system. The number of replicas of files is maintaining by the HDFS [17]. The file copies are known
as the replication factor and stored by NameNode. The clients cannot read or write any files from
HDFS without NameNodes; also the MapReduce cannot execute its jobs. Consider the NameNode
the main node of HDFS architecture which is used to maintain and manage the data blocks on the
Salve-Node. NameNode manages the File System NameSpace (FSN) in very highly available
servers. The building of HDFS architecture is designed to store user’s data on the Name Node, the
user data stored on the DataNode only [22].
Figure 2: HDFS architecture [5].
3.2 Data Node
All applications data of HDFS are stored on the DataNode and through starting the DataNode
connects to NameNode; also, it performs the connecting to verify the ID of NameSpace and the
1558
ISSN: 04532198
Volume 62, Issue 04, April, 2020
version of DataNode Software [5]. If the NameNode does not like DataNode the system is
automatically turnoff. HDFS checks the connections among DataNodes and NameNode by
heartbeat message. Every three-second the DataNode sends a heartbeat through a TCP connection
to the NameNode. Every ten-time sending heartbeat all data of DataNode informed by the
NameNode [21]. There many jobs of DataNode: Store whole Slave-Node metadata in Hadoop, hold
all Slave-Node tracks by using heartbeat technology, by Replication Factor copy all data of SalveNode into different another Slave-Node and balancing data on the Salve-Node [19]. Always
DataNode processes run in the background for the Salve-Node of the Hadoop cluster. Each file of
HDFS is dividing into small pieces known as blocks; these pieces of data saved in the Salve-Node
[22].
3.3 Secondary NameNode
There is a server known as Secondary NameNode in Hadoop. This node sometimes connects to the
NameNode through one hour to keep a copy of NameNode data in a specific memory. This copy
of data returns back to the NameNode also it keeps a copy for itself [5]. When the NameNode
becomes idle, the data stored in the Secondary NameNode can be used by cluster nodes. The
manager in the busy cluster configures the Secondary NameNode to present service [22]. The main
job of Secondary NameNode is to store a copy of the File of FsImage and edits file of the log. The
EditLog is like a recorder contain all changing happen for File System metadata and FsImage is a
copy HDFS File System in different positions. There are two options that can be used by Secondary
NameNode: geteditsize and checkpoint, the first option used to get the size of the edited file on
NameNode and the second option used to create irrespective of EditLog size.
3.4 HDFS Clients
The application of the user accesses the File system by HDFS’s client. In addition, HDFS client
supports the processes of reading, writing and deleting of the files. The HDFS client asks the
NameNode about the DataNode list when the application reads the files. After that connects to the
DataNode to transfer the stored data block [18]. The client manages and organizes the track among
nodes and sends the data. The HDFS client File System applied to relate the Hadoop File System
and performs the file tasks, this procedure uses the client protocol to connect to the NameNode and
DataNode to read and write. The main role of the client is to request the master node to connect to
any other chunk servers by the file name and byte-offset. It sends the index of chunk to master
which created by the client [20].
3.5 Data Replication
HDFS duplicates the blocks of files for fault-tolerance and the application which is used can
determine the number of copies of the file. This number of copies is changing at any time. The
HDFS is used for the reliability and performance of the intelligence replication model [5]. This
model optimizing to make HDFS a very good performance compared with other distributed file
systems. Generally, many computers (Nodes) are installed in a large HDFS environment. When the
connection occurred among the DataNodes in various clusters is slower than the connection of
DataNode within the same cluster [22]. For this reason, the NameNode tries to improve the
connection between DataNodes. The location of the DataNode is determining by NameNode [15].
1559
Zeebaree, et.al, 2020
TRKU
4. Hadoop MapReduce Framework
The Hadoop MapReduce considered an important application framework used to write programs
that maintain large data storage in a distributed system for huge clusters. The function of
MapReduce is dividing the input data group into separated chunks, the mapping process these data
in parallel processing [19]. The MapReduce saves the output data of maps that input into reduce
process. However, the input and output data store in Filesystem, also the MapReduce carefully
schedules tasks, controls the tasks and executes the tasks that are failed [14]. Figure 3 illustrates
MapReduce.
Figure 3: MapReduce [14].
Usually, the computation node is similar to the storage node. On the same group of nodes execute
the HDFS and MapReduce [21]. MapReduce is a very simple application model used to process
huge data on the parallel system. The MapReduce framework design multiple statuses depending
on the data blocks, at each block all these statuses are runs parallel [14]. The size of DataReduce is
very small, also the MapReduce distributed on all nodes in the cluster. Through the NameNode
cannot transfer data. Sometimes the user needs to store his data on the nodes, the NameNode
provides the block information for the users [18].
5. HDFS Technique and Performance
MapReduce considered the heart of Hadoop that is a program framework works on the large data
group in a parallel system. It locates in the center of Hadoop or end of the package. Hadoop is
processing huge data through thousands or more of servers [19]. Hadoop provides the new method
to manage, store and process huge data. Hadoop used for the distributed system on the huge
computer clusters and it is written in java language [22]. The Hadoop has many features that
distinguish it from other traditional software used in data management; some big companies use
the Hadoop to store and manage the large data on servers [14]. It has a high capability to run many
of applications on the thousand or more nodes in clusters systems. Hadoop allows the administrator
1560
ISSN: 04532198
Volume 62, Issue 04, April, 2020
to easily access to new structured or unstructured data and create new information from that data.
On the other hand, the Hadoop used for many purposes like data warehousing, processing and
analysis data [16]. In addition, Hadoop has a unique storage way used in a distributed file system.
The data processed by some tools which load on the same servers of data. This method is very
faster from other techniques [20].
6. Discussion
Through writing this paper of HDFS performance, each author has different ideas but the same
objectives. The man’s goal of using Hadoop in distributed systems is the acceleration of the store,
process, analysis, and management of huge data. Each author explains the Hadoop in a different
way, what is the Hadoop? The architecture of Hadoop, and the functions of Hadoop in the
distributed system. In addition, the performance of Hadoop in distributed systems compared with
other traditional techniques. In Table 1, we explained the functions and the technique used in
Hadoop in a distributed system for huge clusters.
Author(s)
Fayaz and
Tarakeswar [5]
Devulapalli and
Lakshmi[14]
Mittal et al.
[15]
Lu and
Alwerfali [16]
Sajwan et al.
[17]
Honnutagi
[18]
Tomašić et al.
[19]
Table1: Performance analysis in distributed systems
Year Author’s objectives
Description
2016
The Hadoop increase access to
data quickly.
Considers the Hadoop best
software to control the storage of
data in distributed systems.
2015
Hadoop process and manage
huge data in the distributed
system.
The Hadoop is very fast to create
new data from unstructured data.
2015
The architecture of Hadoop
easy and utilize performance
HDFS.
The design of Hadoop helps the
administrators to rearrange the
data in an efficient manner.
2016
Compare the Hadoop with
other software and explain its
features.
Hadoop has written in high-level
language compared with
traditional software.
2015
Hadoop capability for
maintaining data in a huge
cluster system.
The huge data on the internet
servers can simplify managed by
Hadoop.
2014
Compare functions of Hadoop
in single and multiple systems.
The Hadoop in the distributed
system better compared with a
single system.
2012
Compare Hadoop features in
huge data storage with the
system of storage.
Hadoop allows the user to ignore
many errors without sense it.
1561
Zeebaree, et.al, 2020
Nazini and
Sasikala [20]
Lin and Lin
[21]
TRKU
2019
Performance of Hadoop to run
applications on multiple
servers.
The large application can be
divided into pieces each piece in
different nodes in a good
performance.
2017
Methods and technology of
Hadoop in detail by distributed
systems.
The distributed systems can
benefit from Hadoop without
heavy loading.
7. Conclusion
HDFS is one of the best techniques that maintain, analyze, process and manage large data. In
addition, it is very easy and fast to access data on different servers in the clustered system. The
ability of Hadoop can hold more of the petabytes of data and provides a large application in clusters.
In addition, the Hadoop is using to store redundancy data in different locations to guarantee keeping
data. MapReduce also considered the core of Hadoop, which works many jobs that help the Hadoop
to be easy. On the other hand, HDFS used for process and compute the number of words in a huge
database in distributed systems. The performance of Hadoop in distributed systems is clearly very
good compared with other software used for the same purpose. There are many large companies
that used Hadoop like Facebook.
8. References
[1]
S. R. Zeebaree, K. F. Jacksi, and R. R. Zebari, “Impact analysis of SYN flood DDOS
attack on HAPROXY and NLB cluster-base web servers,” Indonesian Journal of Electrical
Engineering and Computer Science, vol. 19, no. 1, Jul. 2020, doi: 10.11591/ijeecs. v19.i1. pp%p.
[2]
R. R. Zebari, S. R. Zeebaree, and K. Jacksi, “Impact Analysis of HTTP and SYN Flood
DDoS Attacks on Apache 2 and IIS 10.0 Web Servers,” in 2018 International Conference on
Advanced Science and Engineering (ICOASE), 2018, pp. 156–161.
[3]
L. M. Haji, S. R. M. Zeebaree, K. Jacksi, and D. Q. Zeebaree, “A State of Art Survey for
OS Performance Improvement,” Science Journal of University of Zakho, vol. 6, no. 3, pp. 118–
123, Sep. 2018, doi: 10.25271/sjuoz.2018.6.3.516.
[4]
O. H. Jader, S. R. Zeebaree, and R. R. Zebari, “A State of Art Survey For Web Server
Performance Measurement And Load Balancing Mechanisms.”
[5]
H. G Fayaz and K. Tarakeswar, “File Systems and Hadoop Distributed File System in
Big Data,” IJARCCE, vol. 5, pp. 36–40, Dec. 2016, doi: 10.17148/IJARCCE.2016.51207.
[6]
S. R. M. Zeebaree and Mr. O. Mohammed, “Effects of Parallel Processing
Implementation on Balanced Load-Division Depending on Distributed Memory Systems,” J. of
university of Anbar for pure science ISSN: 1991-8941, vol. Vol.5, Nov. 2011.
[7]
S. R. Zeebaree, R. R. Zebari, and K. Jacksi, “Performance analysis of IIS10. 0 and
Apache2 Cluster-based Web Servers under SYN DDoS Attack,” 2020.
1562
ISSN: 04532198
Volume 62, Issue 04, April, 2020
[8]
Z. N. Rashid, S. R. M. Zeebaree, K. H. Sharif, and K. Jacksi, “Distributed Cloud
Computing and Distributed Parallel Computing: A Review,” in 2018 International Conference on
Advanced Science and Engineering (ICOASE), Oct. 2018, pp. 167–172, doi:
10.1109/ICOASE.2018.8548937.
[9]
Z. N. Rashid, S. R. M. Zeebaree, and A. Shengul, “Design and Analysis of Proposed
Remote Controlling Distributed Parallel Computing System Over the Cloud,” in 2019
International Conference on Advanced Science and Engineering (ICOASE), Apr. 2019, pp. 118–
123, doi: 10.1109/ICOASE.2019.8723695.
[10]
R. R. Zebari, S. R. Zeebaree, K. Jacksi, and H. M. Shukur, “E-Business Requirements for
Flexibility And Implementation Enterprise System: A Review.”
[11]
S. R. M. Zeebaree and A. Yowakib, “Improved Approach for Unbalanced Load-Division
Operations Implementation on Hybrid Parallel Processing Systems,” Journal of University of
Zakho, vol. 1, pp. 832–848, Sep. 2013.
[12]
S. R. Zeebaree, R. R. Zebari, K. Jacksi, and D. A. Hasan, “Security Approaches For
Integrated Enterprise Systems Performance: A Review.”
[13]
B. R. Ibrahim, S. R. M. Zeebaree, and B. K. Hussan, “Performance Measurement for
Distributed Systems using 2TA and 3TA based on OPNET Principles,” Science Journal of
University of Zakho, vol. 7, no. 2, pp. 65–69, Jun. 2019, doi: 10.25271/sjuoz.2019.7.2.603.
[14]
S. Devulapalli and A. Lakshmi, “Performance evaluation of Hadoop Distributed File
System In pseudo distributed mode and fully distributed mode,” International Journal of
Computer Sciences and Engineering, vol. 3, no. 9, Sep. 2015.
[15]
P. Mittal, V. Jain, and T. Ahuja, “File System and Hadoop Distributed File System-An
Analogy,” International Journal of Innovations & Advancement in Computer Science, vol. 4,
2015.
[16]
S. Lu and H. Alwerfali, “Implementation and Performance Analysis of Apache Hadoop,”
IOSR Journal of Computer Engineering (IOSR-JCE), vol. 18, no. 5, pp. 48–58, 2016.
[17]
V. Sajwan, V. Yadav, and M. Haider, “The Hadoop Distributed File System: Architecture
and Internals,” International Journal of Combined Research & Development (IJCRD), vol. 4, no.
3, May 2015.
[18]
P. S. Honnutagi, “The Hadoop distributed file system,” International Journal of
Computer Science and Information Technologies, vol. 5, no. 5, 2014.
[19]
I. Tomašić, J. Ugovšek, A. Rashkovska, and R. Trobec, “Multicluster Hadoop Distributed
File System,” in 2012 Proceedings of the 35th International Convention MIPRO, May 2012, pp.
301–305.
[20]
H. Nazini and T. Sasikala, “Simulating aircraft landing and take off scheduling in
distributed framework environment using Hadoop file system,” Cluster Computing, vol. 22, Nov.
2019, doi: 10.1007/s10586-018-1980-y.
1563
Zeebaree, et.al, 2020
TRKU
[21]
C. Lin and Y. Lin, “An overall approach to achieve load balancing for Hadoop
Distributed File System,” International Journal of Web and Grid Services, vol. 13, p. 448, Jan.
2017, doi: 10.1504/IJWGS.2017.10008333.
[22]
J. Lee, J. Chung, and D. Lee, “Efficient data replication scheme based on hadoop
distributed file system,” International Journal of Software Engineering and Its Applications, vol.
9, no. 12, pp. 177–186, Jan. 2015, doi: 10.14257/ijseia.2015.9.12.16.
This work is licensed under a Creative Commons Attribution Non-Commercial 4.0
International License.
1564