Academia.eduAcademia.edu

Characteristics and Analysis of Hadoop Distributed Systems

2020, Technology Reports of Kansai University

The last days, the data and internet are become increasingly growing which occurring the problems in big-data. For these problems, there are many software frameworks used to increase the performance of the distributed system. This software is used for available of large data storage. One of the most beneficial software frameworks used to utilize data in distributed systems is Hadoop. This software creates machine clustering and formatting the work between them. The Hadoop consists of two major components which are Hadoop Distributed File System (HDFS) and Map Reduce (MR). By Hadoop, we can process, count and distribute of each word in a large file and know the number of affecting for each of them. In this paper, we will explain what is Hadoop and its architectures, how it works and its performance analysis in a distributed system according to many authors. In addition, assessing each paper and compare with each other.

ISSN: 04532198 Volume 62, Issue 04, April, 2020 Characteristics and Analysis of Hadoop Distributed Systems Subhi R. M. Zeebaree1, Hanan M. Shukur2, Lailan M. Haji3, Rizgar R. Zebari4, Karwan Jacksi5, Shakir M.Abas6 Duhok Polytechnic University, Duhok – Kurdistan Region, Iraq1 Al Kitab University, Kirkuk – Iraq2 Universaty of Zakho, Duhok – Kurdistan Region, Iraq3 Duhok polytechnic University, Duhok – Kurdistan Region, Iraq4 Universaty of Zakho, Duhok – Kurdistan Region, Iraq5 Duhok Polytechnic university, Duhok – Kurdistan Region, Iraq6 Abstract--The last days, the data and internet are become increasingly growing which occurring the problems in big-data. For these problems, there are many software frameworks used to increase the performance of the distributed system. This software is used for available of large data storage. One of the most beneficial software frameworks used to utilize data in distributed systems is Hadoop. This software creates machine clustering and formatting the work between them. The Hadoop consists of two major components which are Hadoop Distributed File System (HDFS) and Map Reduce (MR). By Hadoop, we can process, count and distribute of each word in a large file and know the number of affecting for each of them. In this paper, we will explain what is Hadoop and its architectures, how it works and its performance analysis in a distributed system according to many authors. In addition, assessing each paper and compare with each other. Keywords: Hadoop, Distributed system, Cloud, Map Reduce, HDFS. 1. Introduction Nowadays, the data on the internet and distributed systems are redundancy increasing [1], [2]. Generally, there is about 4 petabytes of data on the various servers. The technologies complexly process this huge data in an efficient manner. This data is stored in many spread machines and access to the data by parallel processing. Therefore, there is growing competition for accessing shared resource data in parallel processing systems [3], [4]. There are many of techniques that can be used to solve these problems [5]. The studies trend towards a breaking-up approach to solving the problem of huge amounts of data by breaking the problem into pieces and solve these pieces simultaneously [6], [7]. For instance, a cloud computing distributed system provides a mechanism to process an enormous amount of data [8]. Also, the combination of a distributed system and parallel processing can solve some client’s problems remotely in a minimum time [9], [10]. Shared memory systems and distributed memory systems are two additional approaches that are combined in a hybrid parallel processing system that is used to solve complex problems of the network like the limitation of data size [11], [12]. The performance of the distributed system is an important 1555 Zeebaree, et.al, 2020 TRKU issue, the architecture of the system has a vital impact on the system performance, for example, the performance of the three-tier architecture 3TA system is more proficient and precise than 2TA system using Opnet as assessment and designing tool [13]. There are many techniques used to analyze the performance of the distributed system. However, one of the best-used techniques is Hadoop, it is a software application framework and it is open-source which distributes the processes that store and manage the big-data application run by the clustered system. However, it represented the center of a based growing system that used for advanced analysis, data mining, predictive analysis, and learning application technique of big-data in the distributed system [14]. The various forms of unstructured and structured data hold by Hadoop, also it allows the users the control, process and analysis data stored in a huge database and data warehouse [15]. Generally, it is known as Apache Hadoop that is used in open source as a part of a project in Apache Software Foundation (ASF). The Hadoop is running of many clustered servers and it can be used to support thousands of nodes of hardware [16]. The Hadoop can be used by a distributed file system designed for fast accessing the data through nodes of the cluster system. In the Hadoop, the data is divided into all nodes in a cluster as illustrated in figure1 [5]. Fig.1: The distributed data through nodes[14]. In addition, it has the capability to ignore the error and the applications are continuing in running if one of the nodes is disabling. Hadoop is foundational of data management and data analysis in a huge distributed system [17]. The main components of the first version of Hadoop are Hadoop Distributed File system (HDFS) and MapReduce (MR). The MR uses two functions: Map and Reduce, these functions divide the processes into multi-tasks and each of these tasks run on the nodes of the cluster [14]. The primary uses of the Hadoop are analytics; also, it is very suitable for big-data environments because it has the ability to process and store different data types. In addition, it involves various types of structured and unstructured forms of data like web servers, internet clickstream, and social media posts [18]. The architecture of HDFS is Master-Slave. Each 1556 ISSN: 04532198 Volume 62, Issue 04, April, 2020 Hadoop consists of individual NameNode, which manages the file system, and DataNode, which manages storage of data on each node. The whole elements of HDFS combined together support large data applications [19]. The HDFS has many features, it faults tolerant in high form and deploys on the low-cost hardware. In addition, Hadoop is suitable to access the large data in a different location. The architecture of the master node is similar to Google File System (GFS) and General Parallel File System (GPFS) [20]. Anyway, in this paper, we will explain the Hadoop and its performance in distributed systems. In addition, compare the Hadoop with another type of software used for analysis and manage huge data is very fast according to some authors. 2. Literature Review The Hadoop is a software framework designed by java language; the main job of the Hadoop is to run the various applications on the huge cluster. It has some features similar to Google File System (GFS) and MapReduce. Hadoop supports the high access to data of applications and very suitable for applications that have huge data. When the single machine has more than a hundred CPU is not fast for processing compared with Hadoop. The large data file is divided by HDFS and it is distributed on the different machines in the cluster. Each piece of data is replicated through machines in the cluster. If one of the nodes fails, it is dose not affects the results. The data is an orientation record in the Hadoop software applications framework. The single input data is divided into different formats and the sub-sets for the record data are processed by each node in the cluster. The knowledge form used by the DFS is scheduled by the Hadoop framework depending on the position of the record of data. The DFS brocks the files into chunks and computed by nodes in the cluster [5]. The data on the internet becomes very large and it needs technologies to be processing in an efficient manner. So the distributed file system stores the data in different nodes instead of one node. The data processing is done by parallel processing which leads quickly to access and manage the data compared with other systems [14]. Cluster system storage consists of many resources and each node in the cluster can read, delete, and write. The storage resources and clients are distributed on the network reveres the local systems. The users can access to data from different locations, the file is sharing in a hierarchical track. HDFS created to deals with the large data distribute on several machines on the network. Also, the Hadoop is an error-tolerant system for the huge data in clustered systems [15]. Most successful applications on the internet like a search engine and social media depending on the data processing. However, the processing of huge data for some organizations becomes more important. The analysis for these data is done by many of techniques. The most software used in this procedure is Hadoop. HDFS mainly consists of two types of nodes in the cluster: single NameNode and number of DataNodes. NameNode works on the master servers which performance the structure of data for the NameSpace of HDFS containing opening, saving, closing and renaming the files plus design the map between the data blocks of nodes and files. The number of DataNodes process and manage the nodes storage. The data is divided into some blocks and each block store on different nodes [16]. 3. Hadoop Distributed File System Architecture 1557 Zeebaree, et.al, 2020 TRKU Hadoop architecture is simple compared with other kinds of file system techniques. The main components of the HDFS are the master and slave. Generally, the HDFS cluster has single NameNode, Master-Slave, and many DataNodes. The file system and its applications stored by HDFS separately. All servers are connected by TCP protocols [21]. Figure 2 shows HDFS architecture. The HDFS consists of many components each of these components designed in different architectures and for different works [17]. 3.1 NameNode All files system of metadata is held by NameNodes of cluster and controls on the DataNodes. The NameNode considered the HDFS central controller but it does not store any data for the node. The NameNode only knew the file blocks and the location of these files. The NameNode determined clients for the DataNode. In addition, it holds the track for the storage capacity of the cluster. The NameNode is responsible for any change occurs for the NameSpace and properties of the file system. The number of replicas of files is maintaining by the HDFS [17]. The file copies are known as the replication factor and stored by NameNode. The clients cannot read or write any files from HDFS without NameNodes; also the MapReduce cannot execute its jobs. Consider the NameNode the main node of HDFS architecture which is used to maintain and manage the data blocks on the Salve-Node. NameNode manages the File System NameSpace (FSN) in very highly available servers. The building of HDFS architecture is designed to store user’s data on the Name Node, the user data stored on the DataNode only [22]. Figure 2: HDFS architecture [5]. 3.2 Data Node All applications data of HDFS are stored on the DataNode and through starting the DataNode connects to NameNode; also, it performs the connecting to verify the ID of NameSpace and the 1558 ISSN: 04532198 Volume 62, Issue 04, April, 2020 version of DataNode Software [5]. If the NameNode does not like DataNode the system is automatically turnoff. HDFS checks the connections among DataNodes and NameNode by heartbeat message. Every three-second the DataNode sends a heartbeat through a TCP connection to the NameNode. Every ten-time sending heartbeat all data of DataNode informed by the NameNode [21]. There many jobs of DataNode: Store whole Slave-Node metadata in Hadoop, hold all Slave-Node tracks by using heartbeat technology, by Replication Factor copy all data of SalveNode into different another Slave-Node and balancing data on the Salve-Node [19]. Always DataNode processes run in the background for the Salve-Node of the Hadoop cluster. Each file of HDFS is dividing into small pieces known as blocks; these pieces of data saved in the Salve-Node [22]. 3.3 Secondary NameNode There is a server known as Secondary NameNode in Hadoop. This node sometimes connects to the NameNode through one hour to keep a copy of NameNode data in a specific memory. This copy of data returns back to the NameNode also it keeps a copy for itself [5]. When the NameNode becomes idle, the data stored in the Secondary NameNode can be used by cluster nodes. The manager in the busy cluster configures the Secondary NameNode to present service [22]. The main job of Secondary NameNode is to store a copy of the File of FsImage and edits file of the log. The EditLog is like a recorder contain all changing happen for File System metadata and FsImage is a copy HDFS File System in different positions. There are two options that can be used by Secondary NameNode: geteditsize and checkpoint, the first option used to get the size of the edited file on NameNode and the second option used to create irrespective of EditLog size. 3.4 HDFS Clients The application of the user accesses the File system by HDFS’s client. In addition, HDFS client supports the processes of reading, writing and deleting of the files. The HDFS client asks the NameNode about the DataNode list when the application reads the files. After that connects to the DataNode to transfer the stored data block [18]. The client manages and organizes the track among nodes and sends the data. The HDFS client File System applied to relate the Hadoop File System and performs the file tasks, this procedure uses the client protocol to connect to the NameNode and DataNode to read and write. The main role of the client is to request the master node to connect to any other chunk servers by the file name and byte-offset. It sends the index of chunk to master which created by the client [20]. 3.5 Data Replication HDFS duplicates the blocks of files for fault-tolerance and the application which is used can determine the number of copies of the file. This number of copies is changing at any time. The HDFS is used for the reliability and performance of the intelligence replication model [5]. This model optimizing to make HDFS a very good performance compared with other distributed file systems. Generally, many computers (Nodes) are installed in a large HDFS environment. When the connection occurred among the DataNodes in various clusters is slower than the connection of DataNode within the same cluster [22]. For this reason, the NameNode tries to improve the connection between DataNodes. The location of the DataNode is determining by NameNode [15]. 1559 Zeebaree, et.al, 2020 TRKU 4. Hadoop MapReduce Framework The Hadoop MapReduce considered an important application framework used to write programs that maintain large data storage in a distributed system for huge clusters. The function of MapReduce is dividing the input data group into separated chunks, the mapping process these data in parallel processing [19]. The MapReduce saves the output data of maps that input into reduce process. However, the input and output data store in Filesystem, also the MapReduce carefully schedules tasks, controls the tasks and executes the tasks that are failed [14]. Figure 3 illustrates MapReduce. Figure 3: MapReduce [14]. Usually, the computation node is similar to the storage node. On the same group of nodes execute the HDFS and MapReduce [21]. MapReduce is a very simple application model used to process huge data on the parallel system. The MapReduce framework design multiple statuses depending on the data blocks, at each block all these statuses are runs parallel [14]. The size of DataReduce is very small, also the MapReduce distributed on all nodes in the cluster. Through the NameNode cannot transfer data. Sometimes the user needs to store his data on the nodes, the NameNode provides the block information for the users [18]. 5. HDFS Technique and Performance MapReduce considered the heart of Hadoop that is a program framework works on the large data group in a parallel system. It locates in the center of Hadoop or end of the package. Hadoop is processing huge data through thousands or more of servers [19]. Hadoop provides the new method to manage, store and process huge data. Hadoop used for the distributed system on the huge computer clusters and it is written in java language [22]. The Hadoop has many features that distinguish it from other traditional software used in data management; some big companies use the Hadoop to store and manage the large data on servers [14]. It has a high capability to run many of applications on the thousand or more nodes in clusters systems. Hadoop allows the administrator 1560 ISSN: 04532198 Volume 62, Issue 04, April, 2020 to easily access to new structured or unstructured data and create new information from that data. On the other hand, the Hadoop used for many purposes like data warehousing, processing and analysis data [16]. In addition, Hadoop has a unique storage way used in a distributed file system. The data processed by some tools which load on the same servers of data. This method is very faster from other techniques [20]. 6. Discussion Through writing this paper of HDFS performance, each author has different ideas but the same objectives. The man’s goal of using Hadoop in distributed systems is the acceleration of the store, process, analysis, and management of huge data. Each author explains the Hadoop in a different way, what is the Hadoop? The architecture of Hadoop, and the functions of Hadoop in the distributed system. In addition, the performance of Hadoop in distributed systems compared with other traditional techniques. In Table 1, we explained the functions and the technique used in Hadoop in a distributed system for huge clusters. Author(s) Fayaz and Tarakeswar [5] Devulapalli and Lakshmi[14] Mittal et al. [15] Lu and Alwerfali [16] Sajwan et al. [17] Honnutagi [18] Tomašić et al. [19] Table1: Performance analysis in distributed systems Year Author’s objectives Description 2016 The Hadoop increase access to data quickly. Considers the Hadoop best software to control the storage of data in distributed systems. 2015 Hadoop process and manage huge data in the distributed system. The Hadoop is very fast to create new data from unstructured data. 2015 The architecture of Hadoop easy and utilize performance HDFS. The design of Hadoop helps the administrators to rearrange the data in an efficient manner. 2016 Compare the Hadoop with other software and explain its features. Hadoop has written in high-level language compared with traditional software. 2015 Hadoop capability for maintaining data in a huge cluster system. The huge data on the internet servers can simplify managed by Hadoop. 2014 Compare functions of Hadoop in single and multiple systems. The Hadoop in the distributed system better compared with a single system. 2012 Compare Hadoop features in huge data storage with the system of storage. Hadoop allows the user to ignore many errors without sense it. 1561 Zeebaree, et.al, 2020 Nazini and Sasikala [20] Lin and Lin [21] TRKU 2019 Performance of Hadoop to run applications on multiple servers. The large application can be divided into pieces each piece in different nodes in a good performance. 2017 Methods and technology of Hadoop in detail by distributed systems. The distributed systems can benefit from Hadoop without heavy loading. 7. Conclusion HDFS is one of the best techniques that maintain, analyze, process and manage large data. In addition, it is very easy and fast to access data on different servers in the clustered system. The ability of Hadoop can hold more of the petabytes of data and provides a large application in clusters. In addition, the Hadoop is using to store redundancy data in different locations to guarantee keeping data. MapReduce also considered the core of Hadoop, which works many jobs that help the Hadoop to be easy. On the other hand, HDFS used for process and compute the number of words in a huge database in distributed systems. The performance of Hadoop in distributed systems is clearly very good compared with other software used for the same purpose. There are many large companies that used Hadoop like Facebook. 8. References [1] S. R. Zeebaree, K. F. Jacksi, and R. R. Zebari, “Impact analysis of SYN flood DDOS attack on HAPROXY and NLB cluster-base web servers,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 19, no. 1, Jul. 2020, doi: 10.11591/ijeecs. v19.i1. pp%p. [2] R. R. Zebari, S. R. Zeebaree, and K. Jacksi, “Impact Analysis of HTTP and SYN Flood DDoS Attacks on Apache 2 and IIS 10.0 Web Servers,” in 2018 International Conference on Advanced Science and Engineering (ICOASE), 2018, pp. 156–161. [3] L. M. Haji, S. R. M. Zeebaree, K. Jacksi, and D. Q. Zeebaree, “A State of Art Survey for OS Performance Improvement,” Science Journal of University of Zakho, vol. 6, no. 3, pp. 118– 123, Sep. 2018, doi: 10.25271/sjuoz.2018.6.3.516. [4] O. H. Jader, S. R. Zeebaree, and R. R. Zebari, “A State of Art Survey For Web Server Performance Measurement And Load Balancing Mechanisms.” [5] H. G Fayaz and K. Tarakeswar, “File Systems and Hadoop Distributed File System in Big Data,” IJARCCE, vol. 5, pp. 36–40, Dec. 2016, doi: 10.17148/IJARCCE.2016.51207. [6] S. R. M. Zeebaree and Mr. O. Mohammed, “Effects of Parallel Processing Implementation on Balanced Load-Division Depending on Distributed Memory Systems,” J. of university of Anbar for pure science ISSN: 1991-8941, vol. Vol.5, Nov. 2011. [7] S. R. Zeebaree, R. R. Zebari, and K. Jacksi, “Performance analysis of IIS10. 0 and Apache2 Cluster-based Web Servers under SYN DDoS Attack,” 2020. 1562 ISSN: 04532198 Volume 62, Issue 04, April, 2020 [8] Z. N. Rashid, S. R. M. Zeebaree, K. H. Sharif, and K. Jacksi, “Distributed Cloud Computing and Distributed Parallel Computing: A Review,” in 2018 International Conference on Advanced Science and Engineering (ICOASE), Oct. 2018, pp. 167–172, doi: 10.1109/ICOASE.2018.8548937. [9] Z. N. Rashid, S. R. M. Zeebaree, and A. Shengul, “Design and Analysis of Proposed Remote Controlling Distributed Parallel Computing System Over the Cloud,” in 2019 International Conference on Advanced Science and Engineering (ICOASE), Apr. 2019, pp. 118– 123, doi: 10.1109/ICOASE.2019.8723695. [10] R. R. Zebari, S. R. Zeebaree, K. Jacksi, and H. M. Shukur, “E-Business Requirements for Flexibility And Implementation Enterprise System: A Review.” [11] S. R. M. Zeebaree and A. Yowakib, “Improved Approach for Unbalanced Load-Division Operations Implementation on Hybrid Parallel Processing Systems,” Journal of University of Zakho, vol. 1, pp. 832–848, Sep. 2013. [12] S. R. Zeebaree, R. R. Zebari, K. Jacksi, and D. A. Hasan, “Security Approaches For Integrated Enterprise Systems Performance: A Review.” [13] B. R. Ibrahim, S. R. M. Zeebaree, and B. K. Hussan, “Performance Measurement for Distributed Systems using 2TA and 3TA based on OPNET Principles,” Science Journal of University of Zakho, vol. 7, no. 2, pp. 65–69, Jun. 2019, doi: 10.25271/sjuoz.2019.7.2.603. [14] S. Devulapalli and A. Lakshmi, “Performance evaluation of Hadoop Distributed File System In pseudo distributed mode and fully distributed mode,” International Journal of Computer Sciences and Engineering, vol. 3, no. 9, Sep. 2015. [15] P. Mittal, V. Jain, and T. Ahuja, “File System and Hadoop Distributed File System-An Analogy,” International Journal of Innovations & Advancement in Computer Science, vol. 4, 2015. [16] S. Lu and H. Alwerfali, “Implementation and Performance Analysis of Apache Hadoop,” IOSR Journal of Computer Engineering (IOSR-JCE), vol. 18, no. 5, pp. 48–58, 2016. [17] V. Sajwan, V. Yadav, and M. Haider, “The Hadoop Distributed File System: Architecture and Internals,” International Journal of Combined Research & Development (IJCRD), vol. 4, no. 3, May 2015. [18] P. S. Honnutagi, “The Hadoop distributed file system,” International Journal of Computer Science and Information Technologies, vol. 5, no. 5, 2014. [19] I. Tomašić, J. Ugovšek, A. Rashkovska, and R. Trobec, “Multicluster Hadoop Distributed File System,” in 2012 Proceedings of the 35th International Convention MIPRO, May 2012, pp. 301–305. [20] H. Nazini and T. Sasikala, “Simulating aircraft landing and take off scheduling in distributed framework environment using Hadoop file system,” Cluster Computing, vol. 22, Nov. 2019, doi: 10.1007/s10586-018-1980-y. 1563 Zeebaree, et.al, 2020 TRKU [21] C. Lin and Y. Lin, “An overall approach to achieve load balancing for Hadoop Distributed File System,” International Journal of Web and Grid Services, vol. 13, p. 448, Jan. 2017, doi: 10.1504/IJWGS.2017.10008333. [22] J. Lee, J. Chung, and D. Lee, “Efficient data replication scheme based on hadoop distributed file system,” International Journal of Software Engineering and Its Applications, vol. 9, no. 12, pp. 177–186, Jan. 2015, doi: 10.14257/ijseia.2015.9.12.16. This work is licensed under a Creative Commons Attribution Non-Commercial 4.0 International License. 1564