Academia.eduAcademia.edu

Critical Study of Hadoop Implementation and Performance Issues

The MapReduce model has become an important parallel processing model for largescale data-intensive applications like data mining and web indexing. Hadoop, an opensource implementation of MapReduce, is widely applied to support cluster computing jobs requiring low response time. The different issues of Hadoop are discussed here and then for them what are the solutions which are proposed in the various papers which are studied by the author are discussed here. Finally, Hadoop is not an easy environment to manage. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Network delays due to data movement during running time have been ignored in the recent Hadoop research. Unfortunately, both the homogeneity and data locality assumptions in Hadoop are optimistic at best and unachievable at worst, introduces performance problems in virtualized data centers. The analysis of SPOF existing in critical nodes of Hadoop and proposes a metadata replication based solution to enable Hadoop high availability. The goal of heterogeneity can be achieved by a data placement scheme which distributes and stores data across multiple heterogeneous nodes based on their computing capacities. Analysts said that IT using the technology to aggregate and store data from multiple sources can create a whole slew of problems related to access control and ownership. Applications analyzing merged data in a Hadoop environment can result in the creation of new datasets that may also need to be protected.

Critical Study of Hadoop Implementation and Performance Issues Madhavi Vaidya Asst. Professor, Dept of Computer Sc. Vivekanand College, Mumbai, India Dr. Shriniwas Deshpande Associate Professor, Head of PG Dept of Computer Science & Technology, DCPE, HVPM, Amravati, India Abstract The MapReduce model has become an important parallel processing model for largescale data-intensive applications like data mining and web indexing. Hadoop, an opensource implementation of MapReduce, is widely applied to support cluster computing jobs requiring low response time. The different issues of Hadoop are discussed here and then for them what are the solutions which are proposed in the various papers which are studied by the author are discussed here. Finally, Hadoop is not an easy environment to manage. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Network delays due to data movement during running time have been ignored in the recent Hadoop research. Unfortunately, both the homogeneity and data locality assumptions in Hadoop are optimistic at best and unachievable at worst, introduces performance problems in virtualized data centers. The analysis of SPOF existing in critical nodes of Hadoop and proposes a metadata replication based solution to enable Hadoop high availability. The goal of heterogeneity can be achieved by a data placement scheme which distributes and stores data across multiple heterogeneous nodes based on their computing capacities. Analysts said that IT using the technology to aggregate and store data from multiple sources can create a whole slew of problems related to access control and ownership. Applications analyzing merged data in a Hadoop environment can result in the creation of new datasets that may also need to be protected. Keywords : Fault , Distributed, HDFS, NameNode Introduction The phenomenal growth of internet based applications and web services in last decade have brought a change in the mindset of researchers. The traditional technique to store and analyze voluminous data has been improved. The organizations are ready to acquire solutions which are highly reliable. [1] Behavior information of the web users are concealed in the web log. The web log mining can find characteristics and rules of the users’ visiting behavior to improve the service quality to users. Clustering is one of technologies of data mining which applied by web log mining. Applying clustering to analysis users’ visiting behavior can realize clustering of users according to their interest, thus it will help us to improve the web site’s structure.[2] Several system architectures have been implemented for data-intensive computing and large-scale data analysis, such as applications including parallel and distributed relational database management systems. As a platform of computing and storage, availability of Hadoop is the foundation do applications availability on it. It is necessary to keep full time availability of platform for product environment. Hadoop has tried some methods to enhance the availability of applications running on it, e.g. maintaining multiple replicas of application data and redeploying application tasks based on failures, but it doesn’t provide high availability for itself. In the architecture of Hadoop, there exists SPOF (Single Point of Failure), which means the whole system gives up and becomes out of work caused by the failure of critical node where only a single copy is kept.[1,2] MapReduce proposed by Google is a programming model and an associated implementation for large-scale data processing in distributed cluster. In the first stage a Map function is applied in parallel to each partition of the input data, performing the grouping operations; and in the second stage, a Reduce function is applied in parallel to each group produced in the first stage, to perform the final aggregation. The MapReduce model allows users to easily develop data analysis programs that can be scaled to thousands of nodes, without worrying about the details of parallelism. Its popular open source implementation, Hadoop, has been used by many companies (such as Yahoo and Facebook) in production for largescale data analysis in cloud computing. Thus, it is essential to monitor distributed cluster status through MapReduce-based data analysis using Hadoop. Hadoop Distributed File System HDFS is the file system component of Hadoop(Refer Figure 1). While the interface to HDFS is patterned after the UNIX file system, faithfulness to standards was sacrificed in favor of improved performance for the applications at hand. [3] Architecture of Hadoop A. NameNode The NameNode maintains the namespace tree and the mapping of file blocks to DataNodes (the physical location of file data). An HDFS client wanting to read a file first contacts the NameNode for the locations of data blocks comprising the file and then reads block contents from the DataNode closest to the client. When writing data, the client requests the NameNode to nominate a suite of three DataNodes to host the block replicas. The client then writes data to the DataNodes in a pipeline fashion. Fig 1 : Hadoop Architecture The persistent record of the image stored in the local host’s native files system is called a checkpoint. The NameNode also stores the modification log of the image called the journal in the local host’s native file system. For improved durability, redundant copies of the checkpoint and journal can be made at other servers. B. Data Nodes During startup each DataNode connects to the NameNode and performs a handshake. The purpose of the handshake is to verify the namespace ID and the software version of the Data Node. If either does not match, the NameNode the DataNode automatically shuts down. After the handshake the DataNode registers with the NameNode. During normal operation DataNodes send heartbeats to the NameNode. The default heartbeat interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the DataNode to be out of service. The NameNode schedules creation of new replicas of those blocks on other DataNodes. [3] Fig 2 : Map Reduce Framework Hadoop MapReduce is a framework that can be used for executing applications containing vast amounts of data (terabytes of data) in parallel on largely built clusters with numerous nodes in a reliable and fault-tolerant manner. Though it can be executed in a single machine, its true power lies in its ability to scale to several thousands of systems each with several processor cores. Hadoop is designed in such a way that it distributes data efficiently across various nodes in the cluster. It includes a distributed file system that takes care of distributing the huge amount of data sets efficiently across the nodes in the cluster. MapReduce framework (Refer Figure 2) splits the job into various numbers of chunks which the Map tasks process in parallel. The outputs from the map tasks are sorted by the framework and given to Reduce tasks as input. Both the input and output of the tasks are stored in a file system. The framework takes care of scheduling the tasks, monitoring those tasks and reexecuting the failed tasks. Each cluster has only one JobTracker which is actually a daemon service for submitting and tracking MapReduce jobs in Hadoop. So it is a single point of failure for MapReduce service and hence if it goes down all running jobs is halted. The slaves are configured to the node location of the JobTracker and perform tasks as directed by the JobTracker. Each slave node has only one TaskTracker (Refer Figure 3) which keeps track of task instances and notifies the JobTracker about the status. Implementation of appropriate interfaces and abstract classes by the applications specify the input and output functions and supply Map and Reduce functions. Job configuration comprises of these and other parameters. The Hadoop Job client submits the job and configuration to the JobTracker which distributes the configuration to the slaves, schedules tasks and monitors them. It then submits the job report to the Job client. The report consists of status and diagnostic information about the tasks. Fig 3 : Role of Job Tracker and Task Tracker Related Work This paper proposes a metadata replication based solution to enable Hadoop high availability by removing single point of failure in Hadoop. A key component of Hadoop is the Hadoop Distributed File System (HDFS), which is used to store all input and output data for applications [4]. In initialization phase, each standby/slave node is registered to active/primary node and its initial metadata (such as version file and file system image) are caught up with those of active/primary node; in replication phase, the runtime metadata (such as outstanding operations and lease states) for failover in future are replicated; in failover phase, standby/new elected primary node takes over all communications. [3,4]. Hadoop has tried some methods to enhance the availability of applications running on it, e.g. maintaining multiple replicas of application data and redeploying application tasks based on failures, but it doesn’t provide high availability for itself. In the architecture of Hadoop, there exists SPOF (Single Point of Failure), which means the whole system gives up and becomes out of work caused by the failure of critical node where only a single copy is kept. SPOF of Hadoop thus is a huge threat to the availability of Hadoop. To provide high availability for Hadoop, there are several challenges as follows. (1) SPOF identification: Namenode and jobtracker are SPOF in Hadoop, and how to identify the critical component and state information more exactly to remove these SPOF is not an easy job. (2) Low overhead: Achieve high availability needs additional time cost for runtime synchronization among different nodes, so a performance optimized solution for implementing high availability is necessary. (3) Flexible configuration: To implement high availability for Hadoop, many configurable options should be considered to meet performance requirements of different workloads in different execution environments (e.g. network bandwidth and latency).[4] The execution environment of high availability consists of the critical node and one or more nodes used for its backup. In this paper the solution proposes two types of topology architecture of nodes in execution environment: one is active-standby topology which consists of one active critical node and one standby node; the other is primary-slaves topology which consists of one primary critical node and several slave nodes. The analysis of SPOF existing in critical nodes of Hadoop and proposes a metadata replication based solution to enable Hadoop high availability. The solution involves three major phases: in initialization phase, each standby/slave node is registered to active/primary node and its initial metadata (such as version file, file system image) are caught up with those of active/primary node; 1. In replication is the core phase of solution suggested here, the runtime metadata (such as outstanding operations, lease states) for failover are replicated; in failover phase, standby/new elected primary node takes over all communications. To reduce performance penalty for replication, this white paper of MapR suggests that it only replicates metadata which are the most valuable management information for failover instead of a complete data copy stored in active/primary critical node. Note that all management information contained in jobtracker is stored in HDFS persistently and the information can be recovered for failover of jobtracker, so it is unnecessary to design specific metadata replication mechanism for jobtracker. 2. Metadata : Metadata are the most important management information replicated for namenode failover. The initial metadata include two types of files: version file which contains the version information of running HDFS and file system image (fsimage) file which is a persistent checkpoint of the file system. 3. Initialization : The main tasks of initialization phase include node registration to register slave nodes and initial metadata synchronization to make initial metadata consistent between primary node and slave nodes. [4] AvatarNode, developed by Facebook, makes it possible for an administrator to switch a live Hadoop cluster’s NameNode from one node to another node so that the administrator can perform maintenance on the node. The failover must be manually initiated by an administrator, so it doesn’t provide protection from software or hardware failures. If a node fails, the metadata that was on that node is quickly re-replicated to other nodes in the cluster so that the replication factor can quickly hit the configured level again. This is what makes MapR’s HA self-healing.[5] HDFS clients are configured to access the AvatarNode via a Virtual IP. When Primary node is down , the Standby Avatar Node takes the relay. The Standby Avatar Node ingests all committed transactions because it reopens the edits log and consumes all transactions until the end of file. The Standby Avatar node finishes ingestion of all transactions from shared NF and then leaves safe node. The VIP switches from Avatar node to Standby Avatar node [6] In particular, Hadoop has a single NameNode. This is where the metadata is stored about the Hadoop cluster. Unfortunately, there is only one of them, which means that the NameNode is a single point of failure for the entire environment. One may go with a different distribution of Hadoop such as MapR, which fixes the NameNode problem. Or there are companies such as ZettaSet that have built additional tooling around Hadoop, including NameNode high availability, but which do not fork the Apache distribution. Or, since this NameNode issue is specific to HDFS (Hadoop distributed file system), one could replace this with IBM's GPFS-SNC, which similarly averts this problem.[7] Some findings are observed here [8] and they are, Hadoop is willing to wait for nonresponsive nodes for a long time (on the order of 10 minutes). This conservative design allows Hadoop to tolerate non-responsiveness caused by network congestion or compute node overload. A completed map task whose output data is inaccessible is re-executed very conservatively. This makes sense if the inaccessibility of the data is rooted in congestion or overload. This design decision is in stark contrast to the much more aggressive speculative re-execution of straggler tasks that are still running. The health of a reducer is a function of the progress of the shuffle phase (i.e. the number of successfully copied. In Hadoop, information about failures is not shared among different tasks in a job nor even among different code-level objects belonging to the same task. At the task level, when a failure is encountered, this information is not shared with the other tasks. Therefore, tasks may be impacted by a failure even if the same failure had already been encountered by other tasks. In particular, a task can encounter the same failure that previously affected the initial task. The reason for this lack of task-level information sharing is that HDFS is designed with scalability in mind. To avoid placing excessive burden on the Name Node, much of the functionality, including failure detection and recovery, is relegated to the compute nodes. Inside a task, information about failures is not shared among the objects composing the task. Rather, failure information is stored and used on a per object basis. [8] Hadoop’s fault tolerance focuses on two failure levels and uses replication to avoid data loss. The first level is the node level which means a node failure should not affect the data integrity of the cluster. The second level is the rack level which means the data is safe if a whole rack of nodes fail. In traditional Hadoop, the data node will contact the namenode and report its status including information on the size of the disk on the remote node and how much is available for Hadoop to store. The namenode will determine what data files should be stored on the node by the location of the node. using rack awareness and by the percent of the space that is used by Hadoop Rack awareness provides both load balancing and improved fault tolerance for the file system. Rack awareness is designed to separate nodes into physical failure domains and to load balance. It assumes that bandwidth inside a rack is much larger than the bandwidth between racks, therefore the namenode will use the rack awareness to place data closer to the source. For fault tolerance, the namenode uses rack awareness to put data on the source rack and one other rack to guard against whole rack failure. An entire rack could fail if it is possible that a whole site could fail. Here the author has suggested data placement and replication policy which takes the site failure into account when it places data blocks. [8] Another issue of Hadoop is Heterogeneous cluster which is solved here by implementation of MapReduce enjoying wide adoption and is often used for short jobs where low response time is critical. Hadoop’s performance is closely tied to its task scheduler, who implicitly assumes that cluster nodes are homogeneous and tasks make progress linearly, and uses these assumptions to decide when to speculatively re-execute tasks that appear to be stragglers. [2] A key benefit of MapReduce is that it automatically handles failures, hiding the complexity of fault-tolerance from the programmer. If a node crashes, MapReduce reruns its tasks on a different machine. Equally importantly, if a node is available but is performing poorly, a condition that we call a straggler, MapReduce runs a speculative copy of its task (also called a “backup task”) on another machine to finish the computation faster. Without this mechanism of speculative execution1, a job would be as slow as the misbehaving task. Stragglers can arise for many reasons, including faulty hardware and misconfiguration. HDFS enables Hadoop Map Reduce applications to transfer processing operations toward nodes storing application data to be processed by the operations. In a heterogeneous cluster, the computing capacities of nodes may vary significantly. A high-speed node can finish processing data stored in a local disk of the node faster than low-speed counterparts. After a fast node completes the processing of its local input data, the node must support load sharing by handling unprocessed data located in one or more remote slow nodes. When the amount of transferred data due to load sharing is very large, the overhead of moving unprocessed data from slow nodes to fast nodes becomes a critical issue affecting Hadoop’s performance. To boost the performance of Hadoop in heterogeneous clusters, this paper aims at minimizing data movement between slow and fast nodes. This goal can be achieved by a data placement scheme that distribute and store data across multiple heterogeneous nodes based on their computing capacities. Data movement can be reduced if the number of file fragments placed on the disk of each node is proportional to the node’s data processing speed. To achieve the best I/O performance, one may make replicas of an input data file of a Hadoop application in a way that each node in a Hadoop cluster has a local copy of the input data. Such a data replication scheme can, of course, minimize data transfer among slow and fast nodes in the cluster during the execution of the Hadoop application. The data replication approach has several limitations. First, it is very expensive to create replicas in a large-scale cluster. Second, distributing a large number of replicas can wastefully consume scarce network bandwidth in Hadoop clusters. Third, storing replicas requires an unreasonably large amount of disk capacity, which in turn increases the cost of Hadoop clusters. [2,3] There is a single master managing a number of slaves. The input file, which resides on a distributed filesystem throughout the cluster, is split into even-sized chunks replicated for fault-tolerance. Hadoop divides each MapReduce job into a set of tasks. Each chunk of input is first processed by a map task, which outputs a list of key-value pairs generated by a user defined map function. Map outputs are split into buckets based on key. When all maps have finished, reduce tasks apply a reduce function to the list of map outputs with each key. Although several approaches have been proposed to solve the resource allocation problem in a heterogeneous cloud, most of them focus on allocating resources to single job or overlook the resource constraints. However, in practical, the problem is more complex since there must be multiple jobs requested by users simultaneously. In this paper, we first formulate the optimization problem of allocating the limited resources to multiple jobs according to the job feature and node capability. The objective is to maximizing the aggregate resulting utility. Moreover, the node CapabilityAware Resource Provisioner (CARP) is proposed based on Apache Hadoop [8, 9,10,11] to show its feasibility to solve above optimization problem. By default, Hadoop adopts FIFO scheduler which is absolutely unfair in a cloud with multiple jobs. Thus, fair scheduler is proposed here to equally share the resources among all jobs.[11] However, in a heterogeneous cloud, because each node has distinct capability and workload, the nodes with high capability or low workload must wait the nodes with low capability or high workload before integrating the intermediate results output by these nodes. Consequently, the job execution time is prolonged. Hence, other intelligent schedulers, which achieve better resource provisioning, is required to improve the system utility and minimize the execution time of submitted jobs , especially in a heterogeneous cloud with resource constraints. For example, the capacity scheduler in Hadoop, supports multiple queues and job priority, is more flexible and suitable for heterogeneous clouds with various job types. In Hadoop, the jobs are identical no matter whether the default FIFO scheduler or fair scheduler is adopted. In FIFO scheduler, it is based on the best effort resource allocation. Uniform resource allocation is employed by fair scheduler. However, in a heterogeneous cloud, because each node has distinct capability and workload, the nodes with high capability or low workload must wait the nodes with low capability or high workload before integrating the intermediate results output by these nodes. Consequently, the job execution time is prolonged. Hence, other intelligent schedulers, which achieve better resource provisioning, is required to improve the system utility and minimize the execution time of submitted jobs, especially in a heterogeneous cloud with resource constraints.[12] For example, the capacity scheduler in Hadoop, which supports multiple queues and job priority, is more flexible and suitable for heterogeneous clouds with various job types. [10] Security is the next issue which is handled in Hadoop by taking the support from the research papers note here. It says Open source Hadoop technology allows companies to collect, aggregate, share and analyze huge volumes of structured and unstructured data from enterprise data stores as well as from weblogs, online transactions and social media interactions. A growing number of firms are using Hadoop and related technologies such as Hive, Pig and Hbase to analyze data in ways that cannot easily or affordably be done using traditional relational database technologies. JPMorgan Chase,[11] for instance, is using Hadoop to improve fraud detection. Meanwhile, Ebay is using Hadoop technology and the Hbase open source database to build a new search engine for its auction site. Analysts said that IT operations using Hadoop technology for such applications must be aware of potential security problems. Using the technology to aggregate and store data from multiple sources can create a whole slew of problems related to access control and management as well as data entitlement and ownership. Applications analyzing merged data in a Hadoop environment can result in the creation of new datasets that may also need to be protected. Several agencies won't put sensitive data into Hadoop databases because of data access concerns. Several agencies are simply building firewalls to protect Hadoop environments. For many Hadoop users, the most effective security approach is to encrypt data at the individual record level, while it is in transit or being stored in a Hadoop environment.[13] Recently, Hadoop is used in the cloud then there are numerous security issues for cloud computing as it encompasses many technologies including networks, databases, operating systems, resource scheduling, load balancing, concurrency control and memory management. For example, the network that interconnects the systems in a Hadoop cluster has to be secure. Data security involves encrypting the data as well as ensuring that appropriate policies are enforced for data sharing. In addition, resource allocation and memory management algorithms have to be secure. Finally, data mining techniques may be applicable. Hadoop is increasingly useful; here are the security issues with it: Hadoop holds data in HDFS. This file system does not have read and write control, any user can access the input files and results. All jobs are run as Hadoop user, which can execute the applications without any permission. For example, the user with limited access to the jobs they can run, can potentially run that job on any data set on the cluster. Any job running on a Hadoop cluster can access any data on that cluster. Hadoop can set a access control which is held at the client level. Access control list checks should be performed at the start of any read or write when it should be at the file system level. User authentication should use a more secure method, such as a password or RSA key authentication.[14] Distributed systems are becoming more prominent nowadays. For a large cluster, it's very important to detect system anomalies, including erroneous behavior or unexpected long response times, which will often result in system crash. These anomalies may be caused by hardware problems, network communication congestion or software bugs in distributed system components. Owing to large scale and complex of distributed systems, it's impossible to detect anomalies by manually checking system printed logs. Automatically system anomaly monitoring and detection tools are eagerly required by many distributed systems. Although there exist many log analysis tools, most of them are developed on single node and it is very time consuming to diagnose a great amount of log messages produced by a large scale distributed system on just one node. Therefore, there is a great demand to use a distributed way for anomaly detection techniques based on log analysis [15] with the rapid development of Internet, e-commerce websites have brought unprecedented huge records from users. Behavior information of the web users are concealed in the web log. The web log mining can find characteristics and rules of the users’ visiting behavior to improve the service quality to users. Clustering is one of technologies of data mining which applied by web log mining. Applying clustering to analysis users’ visiting behavior can realize clustering of users according to their interest, thus it will help us to improve the web site’s structure. We can also apply the information on recommend system. However, the information is covered in log files which are up to a few terabytes in size. Processing huge datasets have to consume large amount of computation. In general, distributed computing is a good solution. Computing tasks are assigned in parallel to multiple machines to improving processing speed. [16] Conclusion The author is tried to identify the performance issues in HDFS on heterogeneous clusters. Motivated by the performance degradation caused by heterogeneity, for this data placement mechanism in HDFS is suggested. The new mechanism distributes fragments of an input file to heterogeneous nodes based on their computing capacities. The problems related to access control and ownership in terms of security, applications analyzing merged data in a Hadoop environment can result in the creation of new datasets that may also need to be protected. In this manner the security can be maintained on the data which is needed for processing. To boost the performance of Hadoop in heterogeneous clusters the solution is to minimize data movement between slow and fast nodes. This goal can be achieved by a data placement scheme that distribute and store data across multiple heterogeneous nodes based on their computing capacities. References [1] Feng Wang,Bo Dong,Jie Qiu,Xinhui Li,Jie Yang,Ying Li “Hadoop High Availability through Metadata Replication” 2009 ACM 9781-60558-802-5/09 Pg 37-44 [2] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica “Improving MapReduce Performance in Heterogeneous Environments” [3] Harcharan Jit Singh V. P. Singh “High Scalability of HDFS using Distributed Namespace” International Journal of ComputerApplications (0975 – 8887) Volume 52– No.17, August 2012 [4] Jeffrey Shafer, Scott Rixner, and Alan L. Cox “The Hadoop Distributed Filesystem: Balancing Portability and Performance “ [5] White Paper 2011 MapR : High Availability: No single points of failure [6] http://www.slideshare.net/PhilippeJulio/hadooparchitecture [7] http://www.itdirector.com/technology/data_mgmt/content.php ?cid=13041 [8] hadoop.apache.org/ [9] Florin Dinu T. S. Eugene Ng “Analysis of Hadoop’s Performance under Failures” Rice University [10]B.Thirumala Rao, N.V.Sridevi, V.Krishna Reddy, L.S.S.Reddy “Performance Issues of Heterogeneous Hadoop Clusters in Cloud Computing” Global Journal of Computer Science and Technology Volume 11 Issue 8 Version 1.0 May 2011 [11] Wei-Tsung Su and Sun-Ming Wu “Node Capability Aware Resource Provisioning in a Heterogeneous Cloud” 978-1-4673-2815-9 2012 IEEE International Conference on Communications in China: Advanced Internet and Cloud (AIC) [12] http://www.computerworld.com/s/article/922165 2/IT_must_prepare_for_Hadoop_security_issues [13] A dissertation by Jiong Xie Graduate Faculty of Auburn University “Improving Performance of Hadoop Clusters” [14]Mahout,http://mahout.apache.org/. [15] Mingyue Luo,Gang Liu “Distributed log information processing with Map-Reduce :A Case Study from Raw Data to Final Models” 5 IEEE 978-1-4244-6943-7©2010 [16] Yan Liu “System Anomaly Detection in Distributed Systems through MapReduce-Based Log Analysis” Advanced International Conference on Advanced Computer Theory and Engineering (ICACTE)