Academia.eduAcademia.edu

MapReduce: an infrastructure review and research insights

2019, The Journal of Supercomputing

https://doi.org/10.1007/s11227-019-02907-5

In the current decade, doing the search on massive data to find "hidden" and valuable information within it is growing. This search can result in heavy processing on considerable data, leading to the development of solutions to process such huge information based on distributed and parallel processing. Among all the parallel programming models, one that gains a lot of popularity is MapReduce. The goal of this paper is to survey researches conducted on the MapReduce framework in the context of its open-source implementation, Hadoop, in order to summarize and report the wide topic area at the infrastructure level. We managed to do a systematic review based on the prevalent topics dealing with MapReduce in seven areas: (1) performance; (2) job/task scheduling; (3) load balancing; (4) resource provisioning; (5) fault tolerance in terms of availability and reliability; (6) security; and (7) energy efficiency. We run our study by doing a quantitative and qualitative evaluation of the research publications' trend which is published between January 1, 2014, and November 1, 2017. Since the MapReduce is a challenge-prone area for researchers who fall off to work and extend with, this work is a useful guideline for getting feedback and starting research.

The Journal of Supercomputing https://doi.org/10.1007/s11227-019-02907-5 MapReduce: an infrastructure review and research insights Neda Maleki1 · Amir Masoud Rahmani1 · Mauro Conti2 © Springer Science+Business Media, LLC, part of Springer Nature 2019 Abstract In the current decade, doing the search on massive data to find “hidden” and valuable information within it is growing. This search can result in heavy processing on considerable data, leading to the development of solutions to process such huge information based on distributed and parallel processing. Among all the parallel programming models, one that gains a lot of popularity is MapReduce. The goal of this paper is to survey researches conducted on the MapReduce framework in the context of its open-source implementation, Hadoop, in order to summarize and report the wide topic area at the infrastructure level. We managed to do a systematic review based on the prevalent topics dealing with MapReduce in seven areas: (1) performance; (2) job/task scheduling; (3) load balancing; (4) resource provisioning; (5) fault tolerance in terms of availability and reliability; (6) security; and (7) energy efficiency. We run our study by doing a quantitative and qualitative evaluation of the research publications’ trend which is published between January 1, 2014, and November 1, 2017. Since the MapReduce is a challenge-prone area for researchers who fall off to work and extend with, this work is a useful guideline for getting feedback and starting research. Keywords MapReduce paradigm · Parallel and distributed programming model · Hadoop · Systematic review 1 Introduction Over the past years, there has been a flow of data at the scale of petabytes produced by users’ jobs [1]. Known as the Big data era, this makes it difficult for enterprises to maintain and extract valuable information for offering efficient and user-friendly services [2]. Due to the nature of services provided by these firms, data are available * Amir Masoud Rahmani [email protected] 1 Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran 2 Department of Mathematics, University of Padua, Padua, Italy 13 Vol.:(0123456789) N. Maleki et al. in different formats such as image, log, text, and video [3]. They also have extensive information in different languages because of many users around the world. Therefore, researchers have found themselves involved in the most complex processes such as the data storage technique, instant data lookup, manipulation, and updating of the data [4]. As the data are extremely large and unstructured, and needs real-time analysis, it has raised a concept in many researchers’ mentality that a new platform for data retention, transmission, storage, and processing is required [5]. The platform that is capable of processing and analyzing the large volumes of data with an acceptable velocity and reasonable cost. This necessity, from the point of data platform architecture, led to yield parallel and distributed computing on the clusters and grids. In these environments with cost-effective and high-capacity hardware, programming requires to consider data consistency and integrity, node load balancing, skew mitigation, fair resource allocation, and preemption and non-preemption of jobs. Thus, programmers constantly live in fear of these obstacles [4]. To hide the complexities from users’ view of the parallel processing system and abstract the system characteristics, numerous frameworks have been released. The goal of all of these frameworks is focusing the user on his/her production programs and delegating the complexity and controls to the framework [6]. Across all frameworks, MapReduce is known as a certain programming pattern. This pattern is inspired by the functional language Lisp [4], enabling end users to express all kinds of parallel procedures with Map and Reduce functions, without considering the messy parallelism details like fault tolerance, data distribution, and load balancing. It has major importance in handling Big data matter [7]. The basic architecture of the MapReduce framework has two functions called Map and Reduce wherein the former feeds the latter’s input to carry out the computing. The significance of this pattern in serially performing batch processing on Big data is clearly visible [6]. In this framework, parallel computing is commenced by distributing map tasks on different nodes and simultaneously processing disparate data partitions called split. Eventually, by aggregating map outputs and applying the reduce function, the final results are produced, thus accomplishing processing [1]. In recent years, the expansion and evolution of MapReduce especially in the context of its open-source implementation “Hadoop” has resulted in features such as energy efficiency of jobs, fault tolerance, load balancing of cluster, scheduling of jobs and tasks, security, performance, and elasticity which has generally propelled the publishing of multiple articles in journals and conferences. Some other programming models such as Spark [8] and DataMPI [9] are competing with MapReduce. Since MapReduce is an open source with high performance which is used by many big companies for processing batch jobs [10, 11] and is our future research line, we chose to conduct the study on the MapReduce programming model. Table 1 compares the features of MapReduce, Spark, and DataMPI. With the help of the recent articles, considered in this research study and by using inclusion and exclusion criteria, an illustration of MapReduce topics in a systematic study template is presented, thus making the research simple and explicit for readers. The only systematic literature study [4] on MapReduce, which is a holistic paper, was conducted in 2014, but since then to the present time, no other systematic 13 Programming model Project/year Data processing type Data model Compatible with HDFS Cluster manager Storage system Speed 13 Scalability Fault tolerance/ Languages execution time Cost MapReduce Apache/2008 Batch mode Keyvalue pairbased Yes Hadoop YARN Hadoop distributed file system (HDFS) Spark Apache/2012 Real time stream mode RDDbased Yes Hadoop Hadoop distributed YARN, file system, Apache Map R file Mesos, system Cassandra, Open Stack Swift, Amazon S3, Kudu Security Applications benefit Low: because of diskbased processing (multiple stages) More than 40,000 nodes Java, High because PigLatin, data are writHiveQL ten on disks/ longer execution time because of re-execution of all in progress or pending tasks on the failed node Kerberos Number authenof disks tication/ (nodes)/ servicelow level cost of authoriRAM zation Suitable for non-iterative and interactive applications High: because of inmemory processing (single stages) Up to 10,000 nodes Relatively Low SparkSQL, Python, because the Scala, data are kept Java, R in memory as resilient distributed dataset (RDD) objects/ shorter execution time because the lost partition will be automatically recomputed by using the original transformations Password Less authentinumber cation/as of disks it can be (nodes)/ intehigh grated cost of with RAM HDFS, the access control lists supported Suitable for iterative algorithmbased applications like machine learning MapReduce: an infrastructure review and research insights Table 1 Comparing of MapReduce, Spark, and DataMPI 13 Table 1 (continued) Programming model Project/year Data processing type Data model Compatible with HDFS Cluster manager Storage system Speed Scalability Fault tolerance/ Languages execution time Cost Security Applications benefit DataMPI DataMPI team/2014 Stream, iterative Keyvalue pairbased Yes Yes MVAPICH2 Same as Hadoop Same as Hadoop Same as Hadoop – Suitable for Streaming and Iterative applications Highest Java N. Maleki et al. MapReduce: an infrastructure review and research insights and comprehensive review has been done. To the best of our knowledge, our study is the first systematic paper from 2014 to November 2017 which is comprehensive and holistic. In this paper, we have considered prominent varied topics of MapReduce which are required to be further investigated. We extracted and analyzed data from the relevant studies of MapReduce to answer the research questions (RQs) and have presented the answers as our work’s contribution. The rest of the paper is organized as follows. Section 2 consists of two parts: In part one, we introduce a brief architectural overview of MapReduce and Hadoop as its mostly regarded implementation, and in part two, we provide our research methodology. Section 3 reviews the selected papers of three phases. In Sect. 4, we answer the research questions and analyze the results to highlight hot and cold issues in the studies and discuss opportunities for future research. Finally, in Sect. 5 we present our conclusions and the limitations of our research. 2 Background and research methodology 2.1 Background Hadoop is an open-source Apache project [12] that was inspired by Google’s proprietary Google File System and MapReduce framework [13]. Hadoop distributed file system provides a fault-tolerant storage of large datasets [12–14]. Figure 1 shows the HDFS architecture. HDFS supports high-performance access to data using threereplica data block placement policy; two in-rack block replica; and one off-rack block replica [15]. It has two major components: one NameNode and Many DataNodes, in which the metadata are stored on NameNode and application data are kept on DataNodes. A dedicated server called Secondary NameNode is employed for file system image recovery in the presence of failure [14] which provides high availability of Hadoop [16]. The NameNode–DataNodes architecture makes the system Master Scheduler /../file 1 txt NameNode Secondary NameNode Slaves DataNode DataNode DataNode Fig. 1 HDFS architecture [16] 13 N. Maleki et al. scalable, and all the nodes communicate through TCP protocols [13]. The scheduler for job assignment across the Hadoop cluster resides in the Master node [17]. MapReduce, the processing unit of Hadoop consists of two main components: one JobTracker and many TaskTrackers in which the JobTracker coordinates the user’s job across the cluster and the TaskTrackers run the tasks and report to the JobTracker [1, 14, 18, 19]. Figure 2 shows the MapReduce job execution flow. All the input splits key-value pairs are processed in parallel using the mappers [14, 17, 18]. The mapped out files which are called intermediate data are partitioned based on the key, sorted in each partition, and then written on the local disk of the DataNodes [1, 20]. Reducers fetch remotely the data related to the similar key and produce the reduce output files which are stored on HDFS [14, 20]. Hadoop ecosystem consists of many projects which can be categorized as (1) NoSQL databases and their handler projects such as HBase, Sqoop, and Flume; (2) data collecting and processing projects such as Kafka, Spark, and Storm; (3) workflow and streaming data analysis projects such as Pig, Hive, Mahout, Spark’s MLlib, and Drill; (4) administration projects like ZooKeeper and Ambari for providing and coordinating the services in the distributed environment of Hadoop cluster; and (5) security projects such as centralized role-based Sentry, non-role-based Ranger, and Knox [14, 19, 21, 22]. Furthermore, we can name some of Hadoop distributions such as MapR, Cloudera, Hortonworks DataPlatform, Pivotal DataSuite, and IBM InfoSphere and some Hadoop repositories including HealthData, National Climate Datacenter, and Amazon Web Services datasets [23]. HDFS Input: (K1,V1) Mapper1 Part1 Mapper2 Part2 Part1 Mapper3 Part2 Part1 Shuffle Output: list(K2,V2) Map1 Map2 Map3 Part1 Part1 Part1 Map1 Map2 Map3 Part2 Part2 Part2 Sort Input: (K2, list(V2)) Reducer1 Reducer2 Reduce Output: list(K3, V3) Output part1 Output part2 HDFS Fig. 2 MapReduce job execution flow [20] 13 Part2 MapReduce: an infrastructure review and research insights 2.2 Research methodology According to [24–26], we classify and select the articles based on the following protocol: • According to our research area, some research questions are defined. • According to these research questions, keywords are found. • Search strings are made based on these keywords, i.e., by logical and proximity search of keywords in the validated databases as a source to find the targeted papers. • Final papers are screened based on some inclusion and exclusion criteria. 2.2.1 Research questions Research questions are classified into two categories: quantitative and qualitative. Hence, based on this category, we bring up the research questions: RQ1 RQ2 RQ3 RQ4 RQ5 RQ6 What topics have been considered most in MapReduce field? What are the main parameters, investigated by the studies? What are the main artifacts produced by this research? What experimental platforms have been used by the researchers for analysis and evaluation? What kind of benchmarks and dataset have been used in the experiments? What are the open challenges and future directions in Hadoop MapReduce? 2.2.2 Paper selection process We use the following libraries as sources to direct the search process: • • • • IEEE Xplore (http://www.ieee.org/web/publications/xplore/). ScienceDirect—Elsevier (http://www.elsevier.com). SpringerLink (http://www.springerlink.com). ACM Library (http://dl.acm.org/). We organize the researches in three phases. In each phase, we define search terms for finding systematic mapping and literature studies, regular surveys, and primary studies, respectively. Phase 1. Finding Systematic Studies We applied “*” to represent zero or more alphanumeric characters for the word “study” for finding its variants like “study” and “studies” and parentheses used if the word “systematic” is not in title but in abstract or keywords. Phase 2. Finding Survey Studies We first applied search string “Title: MapReduce AND (Title: survey OR Title: review).” However, since we wanted to exclude “systematic review” from our results, we used the “NOT” operator in the search string. 13 N. Maleki et al. Table 2 Search strings Phase 1 S1 ((Systematic) OR title: mapping OR title: literature) AND (Title: stud* OR title: review) AND (title: “MapReduce”) S2 ((Systematic) OR title: mapping OR title: literature) AND (Title: stud* OR title: review) AND (title: “Map-Reduce”) Phase 2 S3 “Title: MapReduce AND (title: survey OR Title: review) NOT (systematic)” Phase 3 S4 “MapReduce AND Hadoop” Table 3 Inclusion and exclusion criteria for study selection Inclusion criteria Studies are published from January 2014 to November 2017 Studies focused on Hadoop MapReduce and its various aspects Studies addressed challenges in MapReduce The paper which has resolved a challenge in MapReduce and not a challenge has been resolved by MapReduce Approach and its validation have been logically presented Exclusion criteria Studies are published in languages other than English Studies are not indexed in the ISI Interdisciplinary journals have been excluded Studies that do not answer or are irrelevant to the research questions Ph.D. theses, academic blogs, editorial notes, technical reports, open access journals have been excluded Phase 3. Finding Primary Studies Since the mostly regarded implementation of MapReduce is Hadoop, in order to have a holistic search, our search strings were made from the terms like “MapReduce” and “Hadoop.” As the results were too many, we refined the hit list by using an advanced search option, in title, abstract, and author keywords [27]. The three-phase search strings are shown in Table 2. To assure that only the qualified publications are included from January 2014 up to November 2017, we applied the following inclusion criteria (Table 3) to select the final papers: Using this strategy, we found 66 papers for conducting the study in which five studies [4, 26, 28–30] have conducted the systematic review, six studies [2, 31–35] have done a survey in Hadoop MapReduce, and the rest which will be reviewed in Sect. 3.2 are the primary studies in MapReduce field. Figure 3 shows the adopted process of article selection in the study. Figure 4 shows the number of articles per year from January 2014 to November 2017. It is observed that the publication of papers in the field of Hadoop MapReduce infrastructure level has an increasing trend. Figure 5 shows the number of shares of each publisher to the publications. 13 MapReduce: an infrastructure review and research insights Select Keywords Phase 1- Finding Systematic Papers Make Search String Phase 2- Finding Related Surveys Application of Search String to Database Sources Phase 3- Finding Primary Studies Hit List Investigation Based on Title, Abstract and Full Text 55 Articles for Phase 3 6 Articles for Phase 2 5 Articles for Phase 1 Percentage of published papers Fig. 3 Schematic map of article selection process 38% 37% 2016 2017 18% 7% 2014 2015 Fig. 4 Annual distribution of publications, from January 2014 to November 2017 Fig. 5 Percentage of published papers based on various publishers ACM, 9% IEEE, 40% Springer, 24% Elsevier, 27% 2.2.3 Studies classification According to the researches, we could reach a good prospect of the main existing challenges in the MapReduce framework. We classified the studies in seven categories according to their main research focus. Figure 6 shows the taxonomy. 13 N. Maleki et al. MapReduce Scheduling Adapve Schedulers Resource Allocaon Data localityaware Scheduler Resource Provisioning Elascity (provision of CPU, memory, disk, and network at runme) Security Energy Efficiency Load Balancing Fault-tolerance Performance Security at the processing level Based on Workload Data Skew Migaon Availability Makespan Security at the Communicaon Level Based on Hardware Data Placement and Replicaon Reliability Parameter Sengs Security at the Disk I/O Level Network Shuffling Security When Aacks Coming Straggler Tasks Fig. 6 Taxonomy of the MapReduce studies We explain each topic category concisely from right to left direction in taxonomy. • The studies in which the system efficiency is the main concern, indicators such • • • • as makespan (jobs completion time), network traffic in transferring data from map tasks to reduce tasks during shuffle phase and number of disks I/O, tuning system parameters, and dealing with stragglers (slow tasks) in the cluster are highly essential for consideration. Reliability and availability parameters are important when the studies intend to consider the fault tolerance of a MapReduce cluster. Since the master node is a single point of failure, how to design the fault-tolerant mechanisms to keep the master node available is the main concern. When a data node fails, keeping access to the requested data by tasks is another concern in the fault tolerance topic. Furthermore, considering solutions when map and reduce tasks fail during the processing is another key point. When some reduce tasks have more input data which cause an unbalanced load across the cluster, the data skew parameter should be considered. Moreover, considering efficient data access as the main focus, where to place data across the cluster and the replication number of each data block are the major parameters. Mitigating energy consumption as the major objective, the cluster characteristics and application type should be noted. How to launch a task near its data (data locality) to improve job execution time and subsequently the energy consumption is an important concern in the energy efficiency studies. The studies which are focused on security, data security when it is transferring (data in motion) or stored (data in rest), secure map and reduce tasks execution, and secure data flow in the presence of threats and attacks are the critical concerns of MapReduce. 13 MapReduce: an infrastructure review and research insights • The cases in which the high workload causes a high demand for resources, researches provide solutions such as provisioning the resources in run-time, i.e., they consider elasticity parameter. • The studies in which scheduling is the main topic, solutions like adaptive schedulers, efficient resource allocation, and data locality-aware schedulers play vital roles. The adaptive scheduler as the first solution could be employed to schedule user jobs with various SLAs using job run-time information to improve performance metrics including job execution time, makespan, CPU utilization, etc. Resource allocation as the second solution is used to allocate resources to user jobs efficiently. Data locality-aware scheduler is another effective solution to optimize one or a set of performance metrics including data locality, energy consumption, makespan, and so on. 3 Review of studies In this section, we review primary studies and regular surveys separately. 3.1 Regular surveys In [31], the authors divided the existing deficiencies in MapReduce into three categories in terms of improvement goals: (1) the native variants which are the studies done by Google as the creator of MapReduce; (2) the Apache variants studies focused on Hadoop; and (3) the third-party extensions wherein most of them have investigated the Hadoop platform improvements such as the I/O access in Hadoop platform, enhancement of database operations in Hadoop, and scheduling scheme of Hadoop map and reduce tasks. This survey also compares the parallel DBMSs with MapReduce in terms of scalability and efficiency. The authors also mention the reason for different parallel processing technologies specifically MapReduce attracting attention. Furthermore, they reviewed some hybrid systems which integrate traditional RDBMS alongside MapReduce. However, there is no comparison between these studies’ pitfalls and advantages. Derbeko et al. [32] have studied the security and privacy aspects of the MapReduce framework in a cloud environment. On the one hand, there is a close relationship of the cloud and MapReduce such that the deployment of the MapReduce in public clouds enables users to process large-scale data in a cost-effective manner and provide the ease of processing and management. But, on the other hand, this deployment causes security and privacy challenges since it does not guarantee the rigorous security and privacy of computations as well as stored data. The authors also investigated the security-related projects in the context of MapReduce such as authentication of users, users’ authorization, auditing–confidentiality–integrity–availability (ACIA) of both the data computation pair and verification of outputs. Additionally, they considered the privacy aspects besides security such as the ability of each participating party to prevent adversarial parties from observing data, codes, computations, and outputs. However, the authors did not address some security issues such 13 N. Maleki et al. as authorization frameworks and trust domains of MapReduce requiring different MapReduce algorithms for data encryption and privacy policies. Hashem et al. [2] have reviewed the application of MapReduce, as a promising technology, in various domains such as telecommunication, manufacturing, pharmaceutical, and governmental organizations. It also considers the algorithms and solutions for improvement and reduction in its challenges between the years 2006 and 2015. This paper has conducted a basic bibliometric study using keywords, abstracts, titles, affiliations, citations, countries, and authorship. Moreover, this paper has investigated the most influential articles of Scopus platform in the MapReduce improvement domain such as declarative interfaces, data access, data processing, data transfer, iteration, resource allocation, and communication in MapReduce as well as their pros and cons. Li et al. [33] have studied the basic concept of the MapReduce framework, its limitations, and the proposed optimization methods. These optimization methods are classified into several topics such as job scheduling optimization, improvement in MapReduce programming model, real-time computation support for stream data, speeding up the hardware of the system, performance tuning like configuration parameters, energy saving as a major cost, and its security through stronger authentication and authorization mechanisms. Moreover, some open-source implementation frameworks of MapReduce are presented in Table 4. Although this is a comprehensive study, it still needs more research on the mentioned aspects. Iyer et al. [34] considered the data-intensive processing and its various approaches along with their advantages and disadvantages, MapReduce programming model, and the application of MapReduce in diverse fields. Some platforms which compete with Hadoop for processing large data are as follows: (1) sector and Sphere in terms of processing speed in TeraSort benchmark; (2) DryadLINQ which is a sequential programming model combined with LINQ expressions making the programming easy; (3) integration of Kepler and Hadoop for workflow applications which provide an easy-to-use architecture and impressive performance. The investigated studies have been compared in terms of scalability, efficiency, file system type, and cost. The number of comparison criteria is adequate; however, the number of considered papers is not enough. Liu et al. [35] have investigated the fault tolerance aspect of MapReduce. In the distributed commodity Hadoop, due to the failure probability in each of the levels of the system such as node level, rack level, and cluster level it causes the emerging of the slow tasks (also known as straggler task), the speculative execution of these tasks is urgent. Hadoop supports this method by the execution of the copy of the slow task on another node which will process the task faster and make the Hadoop throughput better. Additionally, some other speculative methods such as LATE, MCP, Ex-MCP, and ERUL in a heterogeneous environment of Hadoop have been considered. 3.2 Primary studies In the following sections, we thoroughly consider and analyze individually each topic, presented in Fig. 6. Our observations are summarized in a table in each subsection. The studies are compared in terms of main idea, advantages, weakness, investigated parameters, their tool and method, benchmarks, dataset and jobs 13 Framework Development language Programming language Deployment Environment Operating System QT Concurrent C++ C++ Shared memory system Linux, Windows, Mac OS Phoenix/Phoenix++ Java, C++ Java, C++ Shared memory system Linux Disco Erlang Python Master–Slave clusters Linux, Mac OS GridGain Java Java Master–Slave clusters Linux, Windows, Mac OS Skynet Ruby Ruby Peer-to-peer clusters Linux Twister Java Java Shared memory system Linux Misco Python Python Master–Slave mobile systems Linux Hadoop Java Java, C++ Master–Slave clusters Linux Apache Pig Java Pig Latin Master–Slave clusters Linux Cascading Java Java Master–Slave clusters Linux Scalding Scala Scala Master–Slave clusters Linux MapReduce: an infrastructure review and research insights Table 4 Open-source implementation of MapReduce [33] 13 N. Maleki et al. (workload), and the experimental platform to find whether the study contribution has been implemented, simulated, or both. 3.2.1 Energy efficiency studies Mashayekhy et al. [36] have proposed a framework to improve the energy efficiency of deadline-assigned MapReduce jobs. The authors have modeled the performance of individual tasks of a job as an Integer Program. To solve the problem, they have provided two heuristic scheduling algorithms which quickly find the nearoptimal solutions. Therefore, the schedulers are also suitable for real-time systems. The model was designed to fulfill the service-level agreement in terms of meeting the deadline of jobs. Since there are multiple jobs with different functionalities in a Hadoop cluster, how to model an efficient and distributed scheduler to solve the energy problem has not been considered by the authors. Ibrahim et al. [37] have investigated the impact of dynamic voltage and frequency scaling (DVFS) on the performance and energy consumption of the Hadoop cluster and trade-off between performance and power. There are several modes to mitigate power usage using DVFS and Turbo including power save, conservative, and ondemand modes. However, these governors result in sub-optimal solutions even in different phases of Hadoop and do not reflect their design goal. Furthermore, the jobs consume different power in these modes and have not the same execution time which impacts the performance of the entire cluster. Therefore, the authors have provided the insights for efficiently deploying and executing MapReduce jobs by determining the job type including CPU-intensive, I/O-intensive, and network-intensive, and then dynamically tuning the suitable governor according to CPU load. Song et al. [38] have proposed a modulo-based mapping function in which the data blocks are mapped to the nodes of a cluster in order to mitigate the data shuffling and saving energy. The insight behind of such mapping is that by fairly distributing the data blocks across a heterogeneous cluster and by considering the data characteristics, each task can locally access its data and all the tasks can be completed simultaneously. To achieve this goal, the authors considered three factors: “fairness of size,” “fairness of range,” and “best adaptability.” However, the proposed algorithm is deprived of considering the replacement strategy of the blocks when a node failure happens or employing a data replication method in the presence of node failure. Cai et al. [39] have proposed a network traffic-aware and DVFS-enabled resource allocation scheduler. Based on job profiling, the scheduler allocates the resources to deadline-assigned jobs while considering rack-level data locality. Furthermore, the authors improve energy efficiency based on the slack time in which the CPU frequency is adjusted for upcoming tasks. By considering worst-case completion time, the proposed solution achieved a better SLA than stock Hadoop. However, the study has not considered the heterogeneity of the system. The authors need to employ a modified version of job profiling technique in which the job execution is measured either on a small portion of input dataset or using an online estimation of job execution time when running on servers with different speeds. 13 MapReduce: an infrastructure review and research insights Teng et al. [40] have proposed co-optimized energy-aware solutions including (1) Tight Recipe Packing (TRP) is employed to consolidate the reserved virtual Hadoop clusters into the physical servers to save energy and (2) online time-balancing (OTB) is used for on-demand virtual machines placement to mitigate the mode switching through balancing server performance and utilization. The study only considered the off-line and online batch jobs, while a general platform should be able to handle running various workloads with different SLAs to enhance the energy efficiency of a Hadoop-based cloud datacenter. Besides, the proposed power model should consider as well other system resources such as memory and I/O power to reach better performance. Phan et al. [41] have provided two energy-aware speculative execution techniques while considering system performance. First, a hierarchical slow jobs detection technique is employed for reducing the number of killed speculative copies. Then, the hierarchical method eliminates the non-critical straggles to reduce the energy waste on unsuccessful speculative copies. Second, based on a performance–energy model, an energy-efficient speculative copy allocation mechanism is used to allocate the speculative copies. The hierarchical solution can dramatically reduce energy wasted on removed speculative copies while maintaining a good performance compared to the most recent straggler’s detection mechanisms. However, rather than eliminating non-critical slow jobs, a reserved resource-based allocation approach can be applied to reach better performance. Arjona et al. [42] have provided a comprehensive empirical analysis of the power and energy consumption in the heterogeneous Hadoop cluster. The authors measured the power consumed by the server resources such as CPU, network I/O, and storage under different configurations to find the optimal operational levels. They found that the system is not energy proportional and all the server resources efficiency can be maximized if the number of active CPU cores, their frequency, and I/O block size are tuned based on the system and network load. Moreover, the authors defined that the jobs energy consumption depends on CPU load, storage, and network activity. However, the only one considered application is not representative to justify the accuracy of the energy model. In addition, the RAM energy consumption and the dynamicity of CPU load have not been considered. Table 5 shows an overview of the studies in energy efficiency topic. 3.2.2 Fault tolerance studies In Hadoop, the minimal unit of scheduling is “task.” Therefore, when a task fails, the whole task should be re-executed from scratch which results in poor performance. Wang et al. [20], have presented a finer-grained fault tolerance strategy in which the map tasks generate checkpoints per spill instead of a map output file. Therefore, a retrying task can start from the last spill point and saves a lot of time. The proposed fault tolerance strategy which comes with little overhead is not static, i.e., it allows the failed task resumes its execution from a checkpoint at an arbitrary point on demand. Some parameters such as task id, task attempt id, input range, host location, and size are used to implement this strategy. 13 13 Table 5 An overview of existing primary studies focusing on “energy efficiency” Nos. References Experimental platform Parameters Job/dataset/workload Main idea Advantages Disadvantages 1 Mashayekhy et al. [36] Simulation Hadoop cluster (4 nodes) Energy consumption Makespan SLA Number of the map and reduce tasks HiBench benchmark (TeraSort, PageRank, K-means) EMRSA: energyaware scheduling of MapReduce jobs for Big data applications Finding near-optimal No pipelining between map and reduce job scheduler in phases terms of energy Very fast and practical Load balancing of all machines. Significant energy saving 2 Ibrahim et al. [37] Grid’5000 test bed (40 nodes) Job type Power Job response time CPU usage PUMA benchmark (WordCount, K-means), Pi Grep, Sort Governing energy consumption in Hadoop through CPU frequency scaling Improving power consumption Reduction in job response time 3 Song et al. [38] Simulation (100 nodes) Hadoop cluster (6 nodes) Energy consumption Scalability Fault tolerance Disk I/O CPU workload Network I/O Job execution time MRBench ( Sort, WordCount, Grep) DPA: an energy opti- Energy optimization mization algorithm No data loading delay based on fair data No additional cost placement. Improvement in job execution time 4 Cai et al. [39] Simulation using CloudSim (30 PMs) Energy consumption Data locality SLA Cluster utilization Job: sort, matrix, multiplication Energy improvement Not considering the SLA-aware energyheterogeneity of the efficient scheduling Low resource cost environment scheme for Hadoop Low network traffic Not considering the techniques in the presence of the node failure Not adaptable N. Maleki et al. Nos. References Experimental platform Parameters Job/dataset/workload Main idea Advantages Disadvantages 5 Teng et al. [40] Simulation Hadoop cluster Energy consumption Jobs SLA Job: Sort, Terasort, WordCount The energy efficiency Save energy Performance of VM consolidaimprovement tion in MapReduce A co-optimization style IaaS clouds solution Guarantee SLA DVFS-enabled servers 6 Phan et al. [41] Grid’5000 test bed (21 nodes) Energy consumption Number of concurrent map tasks Throughput Number of speculative copies PUMA benchmark (WordCount, K-means, Sort) Energy-driven straggler mitigation in MapReduce Higher energy saving Low-performance gain Not considering the due to the lower DVFS technique number of killed speculative copies Higher throughput 7 Arjona et al. [42] Hadoop cluster (nem- Energy consumption esis, survivor, erdos Block size Throughput servers) Number of CPU cores CPU frequency Job: PageRank A measurementbased analysis of the energy consumption of data center servers Lower power consumption Higher throughput Lower disk I/O power usage Lower network I/O power usage Maximum efficiency The highly accurate energy model Only considering MRV1 Simple power model Not considering the Ram power usage Not considering the CPU load dynamicity MapReduce: an infrastructure review and research insights Table 5 (continued) 13 N. Maleki et al. Fu et al. [43] have conducted their work in three parts: (1) considering the issues of Hadoop speculation mechanism; (2) classifying the faults and failures in a cluster in two groups: (a) hardware failure, i.e., a node failure and (b) software failure i.e., a task failure, and simulating the hardware failure condition for small and large jobs; and (3) manipulating and adjusting the Hadoop failure timeout and testing the different scenarios. The authors have implemented their strategy in three phases: (1) they have used a central information collector which detects faults and failure in run-time; (2) in spite of the Hadoop speculator, the authors’ speculator knows the corresponding nodes of each task. Therefore, when a failed node is detected, all the affected tasks are speculated in an exponential order; and (3) they used a dynamic threshold to determine whether a failure should be speculated or not. If the node has been unavailable for a time interval longer than the threshold, the tasks on that node are speculated. Tang et al. [44] have investigated the node availability and network distance to overcome the node failure and low-speed network bandwidth in a large cluster. This work which is called ANRDF is a part of the authors’ previous work, entitled “BitDew-MapReduce.” BitDew-MapReduce contains nine components as follows: (1) replicator; (2) fault-tolerant mechanism; (3) data lifetime; (4) data locality; (5) distributor; (6) BitDew core service; (7) heartbeat collector; (8) data message; and (9) ANRDF. They have predicted each node availability in a cluster using the featureweighted naïve Bayes which is more accurate than the naïve Bayes. In addition, for estimating the network distance, a bin-based strategy has been employed such that any node in a cluster which is called “application node” measures its distance from the “Landmark nodes” and partitions itself into a bin in which the nodes have the minimum latency from each other. Encountering omission failures which are caused by straggler tasks, there are two aspects: (1) copying the slow task and (2) duplicating the resources. Memishi et al. [45] have presented a failure detection and solving aspect through service timeout adjustment. The authors have employed three levels of the strictness of failure detection using three different algorithms so that the deadline-assigned jobs have more accurate failure detector mechanism. The lenient level of detection is suitable for small workloads whose completion time is less than the default timeout of Hadoop. This level adjusts the timeout by estimating the workload completion time. The two other detectors outperform the default timeout of Hadoop under any workload type and failure injection time and they adjust the timeout dynamically based on the progress score of the user workload. The reliability of Hadoop is entrusted to its core and is fulfilled by re-executing the tasks on a failed node or by input data replication. Yildiz et al. [46] have presented a smart failure-aware scheduler which can act immediately when a failure recovery is needed. To mitigate the job execution time, the scheduler uses the preemption technique rather than waiting approach in which the tasks should wait an uncertain time until the resources are freed. For obtaining the required resources, one way is to kill the primitive running tasks on the other nodes and allocate their resources to the tasks on the failed machine. This method will waste both the resources on which the tasks were running and all the computations which are already done by these tasks. Therefore, the proposed scheduler benefits from a 13 MapReduce: an infrastructure review and research insights work-conserving task preemption technique with only a little overhead. The map task preemption is done by “splitting approach” through a preemption signal. For example, upon receiving the signal by a map task, the task is split into two sub-tasks in which the first one consists of all the processed key-value pairs up to preemption and it is reported to the JobTracker as a completed task, while the second one which consists the unprocessed key-value pairs is added to a pool to be executed later when there is available slot. The reduce task preemption is done by “pause and resume approach” in which the reduce task is paused upon receiving a preemption signal and its data are stored on the local node for being restored back upon resume. To choose a task for preemption, the tasks of the low-priority jobs are selected. Priority is based on the data locality. Namely, the scheduler selects the tasks to be preempted that belong to nodes where the input data of the failed tasks reside. Lin et al. [47] proposed a method to satisfy the Hadoop reliability through intermediate data replication. The authors have measured two parameters: (1) the probability metric in which a job can be completed by the cluster and (2) the energy consumed by the cluster is measured to finish the job under two different intermediate data replication policies which are employed in Hadoop. The first policy is the Hadoop default policy in which the map outputs are stored in their host nodes and is called locally stored (LS). The second policy is imitated the reduce task in which the reduce outputs are replicated in the HDFS and is called a distributed file system (DFS). The authors have conducted the experiments by considering two scales of jobs, i.e., small and large jobs under two levels of parallelism including: (1) full parallelization of a job, i.e., all the tasks of a job can be executed in parallel and (2) full serialization of a job, i.e., none of the tasks of a job can be executed in parallel. Therefore, the authors have considered four scenarios: (1) LS/small jobs; (2) LS/ large jobs; (3) DFS/small jobs; and (4) DFS/large jobs that can help Hadoop administrators to choose the best replication configuration for a cluster setting. Table 6 shows an overview of the fault tolerance in MapReduce cluster-related papers. 3.2.3 Job/task scheduling studies Xu et al. [48] have provided a dynamic scheduler in which each TaskTracker can automatically adjust its number of tasks based on both its processing capacity and workload changes. The scheduler hinders the overloaded and under-loaded nodes using dynamic slots-to-tasks allocation strategy in preference to static slot allocation of Hadoop. The dynamic strategy is based on that in each heartbeat, the full capacity of a TaskTracker would not be at the disposal of the tasks and the TaskTracker makes the decision to accept either more tasks or not by considering its workload. Two monitoring and task executing modules are used for detecting TaskTracker load condition and for executing the accepted tasks, respectively. To achieve the desired results, the monitoring module considers the CPU load, i.e., the number of tasks in the queue of the CPU which are ready to run, CPU utilization, and memory as the load parameters. Lim et al. [49] have formulated the matchmaking and scheduling problem for an open stream of multistage deadline-assigned jobs using constraint programming. 13 13 Table 6 An overview of existing primary studies focusing on “fault tolerance” Nos. References Experimental platform Parameters Dataset/workload 1 Wang et al. [20] Hadoop cluster (16 nodes) on Azure Cloud Overhead of method HiBench benchmark (WordCount, Hive Scalability query) Input data size Map buffer size Map and reduce tasks execution time CPU, disk, and network usage of reduce tasks Performance of merging in reduce side Impact of map side’s combiner Reliability 2 Fu et al. [43] Job size Hadoop cluster (two private clusters with Job execution time Network delay 21 nodes) Network stability Failure occurrence time Main idea Advantages Disadvantages BETL: creating checkpoints by allowing map tasks to output their intermediate data in multiple files on multiple nodes A finer-grained level fault-tolerant mechanism The faster execution time of map tasks due to not merging spill files The benefit of speculative execution Less JVM overhead No master overhead More resilient in terms of creating checkpoints on demand Dramatic performance improvement in failure Adaptable Adaptive and available centralized fault analyzer Reduce task has to shuffle and sort more files Network traffic More disk access No pipelining between map and reduce phases HiBench benchmark FARMS: proposing a hybrid solution (Aggregation, Join, which includes a K-means, PageRspeculation mechaank, Scan, Sort, nism scheme and TeraSort), Built-in a task scheduling YARN benchmark policy to enhance (WordCount, failure awareness TeraSort, Secondarand recovery ySort) Not considering the failure in reduce phase N. Maleki et al. Nos. References Experimental platform Parameters 3 Tang et al. [44] Job response time Simulation using Prediction error TDEMR (1 PM, 1015 simulated data Node availability nodes) Hadoop Cluster with 36 nodes (16 PMs on Hadoop cluster, 20 VMs on Cloud) 4 Memishi et al. [45] Simulation (25 containers) Dataset/workload Main idea Advantages Disadvantages Dataset: ETI@home traces, BRITE generator Job: WordCount, Grep ANRDF: Proposing Resource availability and network distance-aware MapReduce over the Internet using weighted naïve Bayes classifierbased availability prediction and landmark-based network estimation Decreasing job response time High estimation accuracy of nodes’ availability Low shuffle transfer Trusting to nodes which have the potential of failing Not considering the heterogeneity of the environment Lack of new network estimator Solving the omission failures using timeout service adjustment More accurate failure Lack of enhancement of behavior of framedetection work in terms of Performance failure detection and improvement its relation with the Hadoop timeout Task completion time Job: Sort Workload size Reliability MapReduce: an infrastructure review and research insights Table 6 (continued) 13 13 Table 6 (continued) Nos. References Experimental platform Parameters Dataset/workload 5 Yildiz et al. [46] Grid’5000 test bed Parapluie cluster on Rennes (Hadoop cluster with 9 nodes) Dataset: Facebook Data locality workload Task completion time Job completion time Job: Map-heavy (WordCount), Reliability Reduce-heavy Scalability (Sort), PUMA Type of job benchmark Failure detection timeout Workload size Network traffic Job execution time 6 Lin et al. [47] Hadoop cluster Input data size Number of reduce tasks Number of jobs Network traffic _ Main idea Advantages Disadvantages Overhead of profilChronos: proposing a Data locality ing and preemption improvement failure-aware schedtechnique Reduction in job uling strategy in completion time which the recovery tasks with a higher Failure improvement in reduce phase due priority would to the reduction in preempt the tasks waiting time with less priority Throughput improvement by reducing the waiting time of launching recovery tasks Reduction in network traffic Effectiveness of reduce task preemption technique Improving job comImproving job completion reliability pletion reliability Reducing job energy and job energy consumption consumption using replicating intermediate data Not considering the heterogeneity of the environment Not considering the network failure N. Maleki et al. MapReduce: an infrastructure review and research insights Each job’s SLA is characterized by the earliest start time, the execution time, and the end-to-end deadline such that the jobs which miss their deadline are minimum. MRCP-RM is only applicable to jobs with two phases of execution such as MapReduce jobs. The objective of MRCP-RM is to minimize the number of jobs that miss their deadlines. Kao et al. [15] have investigated the trade-off between data locality and performance for deadline-assigned real-time jobs in a homogeneous server system. Three modules are employed in each node to provide deadline guarantees: (1) dispatcher; (2) power controller; and (3) scheduler. To meet the deadline of the jobs, the authors consider the map task deadline of a job which is called “local deadline.” For this purpose, two separate queues for each map and reduce tasks are considered in each data node. Then, the dispatcher first assigns a local deadline to map tasks of a job, and according to this local deadline, the task with the shortest deadline is executed first. Using a partition value estimation, the proposed method partitions tasks to data locality-aware nodes for less data transmission and less blocking. Furthermore, to mitigate energy consumption, some nodes are switched to the sleep state. In this work, because of the considerable penalty of data migration, the proposed framework does not consider the precedence of tasks to satisfy the data locality. Therefore, the shorter jobs are blocked by the non-preemptive execution of larger jobs which mitigates the Hadoop performance. Sun et al. [50] have provided a data locality-aware scheduler in which the expected data of future launching map tasks are prefetched earlier in memory on the intended nodes. The intended nodes are determined based on current pending tasks which their remaining time is less than a threshold and greater than the data block transmission time. According to the consumer–producer model and to manage effectively the memory buffer, two prefetching buffer units each with the same size as the HDFS block are considered per each map slot. Therefore, by using the prefetching technique, the map tasks with rack locality and rack-off locality would not be delayed and consequently, jobs will be completed rather. Tang et al. [51] have presented an optimized and self-adaptive scheduler to reduce tasks. The scheduler can decide dynamically and according to the job content, including the completion time of the task and the size of the map output. In fact, this method prevents the wasting of reduce slot during the copy phase by delaying the start time of the reduce task of the current job and provides idle reduce slots for other jobs. Thus, at a certain time, when some tasks of the job have completed, the scheduler schedules and assigns the reduce slots to the reduce tasks of that job. This method mitigates the completion time of the reduce task, decreases the average system response time, and utilizes the resources efficiently. Bok et al. [52] have considered data locality and I/O load for deadline-assigned jobs which process multimedia and images. Plus, it minimizes job deadline, miss, using two queues, called “urgent” and “delay” queues. The paper minimizes the deadline miss ratio which is caused by I/O load using “urgent” queue and maximizes deadline hit ratio using hot block repetition. Delay queue has the same functionality as Delay scheduling [53] job queue in which the task whose data are located on the other nodes should currently be executed, but its data do not exist on the host node. Therefore, it will wait for a short time (D) expecting at that time a slot on the 13 N. Maleki et al. other nodes is freed and can be executed on them. If in the waiting time, there will be any slots, then after finishing the waiting time the task will be executed on its host node and data locality will not meet. Urgent queue allocates slots to the jobs which are expected to not complete in their deadline because of no data locality or high node workload. When the client submits the job, it first is placed in the delay queue. If the difference of deadline and the predicted completion time is higher than a threshold which is specified by the user, the job is sent to the urgent queue. In the urgent queue, the jobs are arranged to ascend according to the difference amount to execute. Hashem et al. [54] have proposed a two-objective scheduler for minimizing the cost of cloud services and job completion time. The model is a two-objective model in which the cost of resources from the point of view of resource allocation and the job completion time from the point of view of scheduling is considered as the main objectives. Therefore, the proposed model improves performance when processing Big data using the MapReduce framework. The model applies the earliest finish time algorithm in which both tasks to resources and resources to tasks mapping are done to meet the model objectives. In the algorithm, the earliest finish time is chosen based on the number of tasks of a job which is configured by the job owner. In addition, the service method will return a positive value if there are adequate mappers and reducers to finish a workflow job in the specified budget and deadline. Nita et al. [55] have presented a multi-objective scheduler which considers both deadline and budget constraints from the user side. To find a best matching between deadline-assigned jobs and available slots, the authors define a service function and a decision variable. The service function returns a positive value if there are enough mappers and reducers to complete a MapReduce job within budget and deadline and the decision variable represents the weight of resource usage. The best assignment between jobs and resources is selected based on the summation of each service result. In addition to the costs for a map and reduce processing time and their resource usage costs, a penalty for the transferred data have been considered due to its non-locality. Tang et al. [56] have presented a scheduling algorithm for the jobs in the format of a workflow. Since the execution time of the jobs is different due to the job types, i.e., I/O-intensive or CPU-intensive, the algorithm comprises a job prioritizing phase in which the jobs are prioritized with respect to their types, input data size, communication time to other jobs, and type of slots. Moreover, a task assignment phase has been considered to prioritize the tasks for scheduling based on the data locality on their intended node. Therefore, the scheduling length and tasks workflow parallelization have been improved. Table 7 shows an overview of the MapReduce job/task scheduling-related papers. 3.2.4 Load balancing studies Since the imbalance of keys in a Hadoop cluster is intrinsic, Chen et al. [57] have presented a user-transparent partitioning method to solve data skew in the reducer side. To evenly distribute map output data between the reduce tasks, this paper benefits an integrated sampling in which a small fraction of the produced data 13 Nos. References Experimental Platform Parameters Dataset/workload Main Idea Advantages Disadvantages 1 Hadoop cluster (7 PMs and 12 VMs using VMWare) Task response time The execution time of all tasks Load balancing Scalability CPU and memory utilization Number of tasks CPU-intensive job (TeraSort) ATSDWA: each task tracker weighs its load in each heartbeat based on the collected load parameters and adjusts the maximum number of slots for tasks dynamically Load balancing of cluster Simple, reliable, efficient, and applicable algorithm Reduction in tasks’ execution time Enhancement of system response ability The same quality of service as Hadoop Higher resource utilization No overloading bottleneck on JobTracker (selfregulatory of TaskTrackers according to workload) Not scalable Lower system’s adaptability when resource utilization is too high Xu et al. [48] MapReduce: an infrastructure review and research insights Table 7 An overview of existing primary studies focusing on “job/task scheduling” 13 13 Table 7 (continued) Nos. References Experimental Platform Parameters Dataset/workload 2 Simulation Hadoop cluster (11 nodes on Amazon EC2) Quality of service Data locality Scalability Task execution time Job arrival rate System workload parameters Number of resources Job deadline MRCP-RM: modeling The low proportion Gutenberg project of jobs miss their (Synthetic Facebook the resource allocadeadline tion and resource and WordCount scheduling problem Lower average job workload) turnaround time using constraint Efficiently processing programming on the open stream of MapReduce jobs with SLAs Flexible in terms of minimization the number of late jobs and small overhead of scheduling Scalable Non-preemption algorithm so, not wasting computation time Implementing a simulator called “SimExec” Reduction in Map and Reduce task execution time Lim et al. [49] Main Idea Advantages Disadvantages Not effective in the lightly load system Not considering the priority of jobs for modeling resource management No technique to improve performance in error presence caused by userestimated execution time Not considering the complex workloads N. Maleki et al. Nos. References Experimental Platform Parameters Dataset/workload Main Idea Advantages Disadvantages 3 Kao et al. [15] Simulation using CloudSimRT Cloud cluster (20 PMS, 40 VMs) Data locality Energy Job response time Meet deadline ratio Synthetic workload (I/O-bound, CPUbound) Hive benchmark Facebook workload Yahoo! workload Job: Batch, Select, TextSearch, Aggregation DamRT: proposing a data locality-aware real-time scheduler for interactive MapReduce jobs Non-elastic Run-time power saving Improving the quality of service Higher data availability Minimizing response time of tasks 4 Sun et al. [50] Hadoop cluster (21 nodes) Data locality Scalability Job execution time Input data size Block size PUMA benchmark (Grep, Histogram rating, Classification, WordCount, Inverted-index) HPSO: proposing an inter-block and inter-block prefetching-based task scheduler to improve data locality High data locality High scalability Reduction in job execution time Overhead of memory size due to prefetching buffer Ineffective prediction method Network transmission cost 5 Tang et al. [51] Hadoop cluster (7 nodes) SARS: proposing MRBench System average an optimal and (WordCount, Pi, Teraresponse time self-adaptive reduce Sort, GridMix) Task execution time scheduling policy. Job completion time Slot usage Input data size Number of map tasks Reduction in shuffle and sort phases Reduction in task execution time Saving reduce slots Reduction in job completion time Network I/O bottleneck Not considering the heterogeneity of the environment MapReduce: an infrastructure review and research insights Table 7 (continued) 13 13 Table 7 (continued) Nos. References Experimental Platform Parameters Dataset/workload Main Idea Advantages 6 Bok et al. [52] Hadoop cluster (20 nodes) Data locality I/O load Number of deadlineassigned and total jobs Deadline jobs miss ratio Job completion time Makespan Dataset: Multimedia data generator Job: I/O-light (WordCount), I/O-heavy (TeraSort) Minimizing deadline miss ratio of jobs which process large multimedia data using urgent queue scheduling Reduction in job completion time Reducing deadline miss ratio of jobs using speculation method The increasing deadline success ratio of jobs using hot data block replication Improving makespan 7 Hashem et al. [54] Hadoop cluster (10 VMs) Job completion time Cost Throughput CPU utilization Makespan Input data size Dataset: Hadoop scheduler load simulator Job: CPU-intensive (WordCount), I/Ointensive (Sort) Minimizing job comOptimizing job pletion time scheduling based on the multi-objective High resource utilization rate mode Reducing the processing time of the workload Low latency Disadvantages N. Maleki et al. Nos. References Experimental Platform Parameters Dataset/workload Main Idea Advantages 8 Nita et al. [55] MobiWay test bed (12 nodes) Number of cores Makespan Elasticity Job completion time Memory size Time of node updating MOMTH: proposing Dataset: scheduling a multi-objective load simulator (SLS scheduler to fulfills million song) the constraints such as deadline and budget 9 Tang et al. [56] Hadoop cluster Type of job Number of jobs Task execution time Number of tasks Input data size Parallelism and Job: data-intensive MRWS: proposefficiency speedup (Montage workflow) ing a two-phase under any graph optimized workflow size scheduler based on Practical job types and data Makespan improvelocality ment Reduction in task execution time Disadvantages Not considering the Development and energy consumption deployment of cost– Scheduling decision is time continuing taken only based on SLS simulator the current knowlImprovement in job edge completion time The biggest job cannot Reduction in task be executed due to waiting time in the deadline miss tasks’ queue The higher cost of time due to computing dynamically the number of map and reduce slots Not considering the complex job MapReduce: an infrastructure review and research insights Table 7 (continued) 13 N. Maleki et al. during the processing of twenty percentage of map tasks is sampled. Afterward, the large keys are split by considering the servers capacity. Reducers can shuffle immediately the data which are already produced. Hence, the job execution time will be dramatically decreased. Repartitioning intermediate data to preserve load balancing of a heterogeneous Hadoop cluster incurs high overhead. To tackle this problem, Liu et al. [58] have presented a run-time partitioning skew mitigation. The idea is that, rather than controlling data size by its splitting among reducers, resources are allocated dynamically. Namely, the number of allocated resources to the reducers could be increased and decreased in run-time. A resource allocator module is responsible for allocating the number of resources which is demanded by a reduce task. The required resources are allocated based on a statistical model which is constructed by the current partition size and the allocated resources of a reduce task which are enough to estimate the Reduce task execution time. This method is simple and incurs no overhead. Chen et al. [59] have considered the data placement problem in terms of data locality and remote data access costs for both map and reduce sides. The authors have presented a replication strategy in the map side, in which the data access cost defined as a function of the distance of nodes and the data size to be transferred is minimized. The same way, to mitigate the data access frequency in the reduce side, the block dependencies are detected and the blocks which have strong dependency are merged as a single split for processing. Furthermore, to alleviate network traffic, the authors have defined an optimal matrix to place all data blocks based on a topology-aware replica distribution tree. Thus, the data movement during the map and reduce stages is minimized. Li et al. [60] have proposed a programming model, called Map-Balance-Reduce, with an extra middle stage to effectively process the unevenly distributed data which is caused by unbalanced keys. In this model, the map outputs are estimated in the Balance stage and the balanced output of this stage is fed into the Reduce stage. This stage is like a mini-reducer stage in which the task that will cause load unbalancing problems is found in advance by preprocessing the map outputs. The stage sums the map outputs of the same key, partitions them to more splits, and feeds them into the reducer nodes. Importantly, how to define that whether the load is unbalanced is based on the workload of reduce task nodes. If the workload of a reduce task node is less than a certain threshold, the adaptive load balancing process is applied in which the current reduce task is stopped and the keys on current reduce nodes will be partitioned and distributed to other n-1 nodes. In this way, the algorithm will mitigate job execution time. Liroz-Gistau et al. [61] have presented a system which has an extra phase between map and reduces phases, denoted “intermediate reduce.” The skewed map outputs take benefit of the reducers of this phase and can be processed in parallel. The intermediate reduce function is like a combiner and can be iterated adaptively based on the partition size, I/O ratio, or partition skew ratio until a given threshold. Once the map buffer verges, their spills are maintained in a table. Hereafter, they are merged as a partition, and based on the greedy or the data locality strategy, they are fed into the intermediate reducers as input splits. Exploiting the spills once they are ready 13 MapReduce: an infrastructure review and research insights makes the system fault-tolerant and faster, while it incurs master overhead in terms of keeping the spills metadata. Myung et al. [62] have presented histogram information on a join key attribute method to balance a load of reducers to join jobs. In this paper, the data skew problem is relieved by mapping splits to reducers using a partitioning matrix. By a small number of input samplings, the samples from all relations make the matrix and the join operations are done based on the key range overlapping. Namely, the join candidate cells will provide better performance in the join operations. In the range-based partitioning, the most repeated samples are defined as the cause of imbalanced partitioning, i.e., the most skewed relations. Despite the range-based partitioning, the matrix considers the less skewed relations of join. Furthermore, the proposed partitioning outperforms the random-based partitioning in which the rate of input duplication increases substantially with an increase in the input size. Liu et al. [63] have presented an architecture in which the workload distribution of reduce tasks is predicted by using an online partition size prediction algorithm. Therefore, in addition to map function and the number of reducers, the partition sizes are dependent on the input dataset. The algorithm uses a small set of random data to profile some characteristics of the whole input data. Based on the predicted workload, the framework can detect the tasks with a large workload using a deviation detection method without any knowledge of statistic distribution of the data in linear time. Before allocating the resources to the overloaded tasks, the framework determines the relation between task duration with two factors, i.e., partition size and resource allocation. Thereupon, the framework speeds up the job completion time by adjusting proactively the allocation of resources to the overloaded tasks. Zhang et al. [64] have presented a two-objective data skew mitigation model. The model is executed in two independent phases which are called data transmission and data computation. In the data computation phase, the minimum number of nodes which participate in data processing is calculated. In data transmission phase, based on the satisfying upper bound of relative data computation time in computation phase, both the data transmission time and network usage are minimized using a greedy algorithm to find the best network flow. Besides, the method allows users with higher priority to configure their jobs to be processed earlier. Table 8 shows an overview of the load balancing in MapReduce cluster-related papers. 3.2.5 Performance studies There are two constraints in MapReduce: (1) executing the Reduce tasks before map tasks for assuring the logic correctness of MapReduce and (2) running the Map tasks in map slots and reduce tasks in reduce slots. Because of the mentioned constraints, Tang et al. [65] have proposed a job ordering algorithm for optimizing the two performance factors including makespan and total completion time. For finding the best ordering of jobs, the authors have defined an optimal slot configuration where jobs are ordered based on this configuration. Furthermore, the authors have considered a condition in which based on every job order, an optimal slot configuration will be found. Since there is a trade-off between makespan and total completion 13 13 Table 8 An overview of existing primary studies focusing on “load balancing” Nos. References Experimental platform Parameters 1 Chen et al. [57] Job execution time Hadoop cluster (30 Input data to reduce PMs, 15 VMs tasks using KVM, OpenCoefficient of variaStack OS) tion in data size 2 Liu et al. [58] SAVI test bed (Hadoop cluster with 21 VMs using Xen, OpenStack OS) Generated partitions’ size (intermediate data) Reduce task execution time Job completion time CPU and memory allocation Makespan Main idea Dataset: Wikipedia Synthetic workload Job: CPU-intensive (join), I/O-intensive (Sort, Grep, Inverted-index) Load balancing among reduce tasks Applicable and transparent Job execution speedup Overlap between map and reduce phases Highly accurate approximation for distribution intermediate data No need for a prerun sampling of input data No repartitioning DREAMS: providoverhead ing a run-time Simple to implement partitioning skew mitigation by con- Improving job completion time trolling the number Reduction in Reduce of resources task execution time allocated to each Effective mitigation reduce task of the negative impact of partitioning skew Accuracy of partition size prediction Dataset: Wikipedia, Graph generator, RandomText writer, Netflix workload, Synthetic workload PUMA benchmark (text retrieval, web search, machine learning, and database jobs) LIBRA: solving the data skew problem in reduce side. Advantages Disadvantages Occupying more task slots Not able to split and rearrange large keys Not able to improve the shuffle phase by reducing skew mitigation No generality because of job profiling stage overhead No precise estimation of task execution in a highly dynamic environment No network and disk fairness sharing are investigated (only CPU and memory are considered) Not applicable for computational skew applications N. Maleki et al. Dataset/workload Nos. References Experimental platform Parameters Dataset/workload Main idea Advantages 3 Chen et al. [59] Simulation using TopoSim (1080 simulated data nodes) hadoop cluster (18 data nodes) Data locality Network traffic Makespan Block size Input data size Replication factor Network scalability It is not open source Job: K-means, Word- Proposing an optimal Minimizing global Resource costs of data access costs Count, TeraSort data placement data placement into Maximizing data technique by a HDFS is higher than locality topology-aware Hadoop-stock heuristic algorithm Least computation costs in all kinds of block sizes Least computation and communication costs in all kinds of input data sizes, replication factor, and network sizes Implementing a simulator called “TopoSim” 4 Li et al. [60] Hadoop cluster (5 nodes) Data Skew Input data size Job execution time Dataset: NCDC’s weather 13 Reduction in reduce MBR: proposing phase the preprocessLoad balancing in ing scheduling reduce tasks and self-adaption scheduling of Map- Improvement in job execution time Balance-Reduce programming model for effectively processing data with unbalanced keys Disadvantages Time overhead in evenly distributing data MapReduce: an infrastructure review and research insights Table 8 (continued) 13 Table 8 (continued) Experimental platform Parameters Dataset/workload Main idea Advantages Disadvantages 5 Liroz-Gistau et al. [61] Grid’5000 platform (20 nodes) Data Skew Input data size Number of intermediate keys Intermediate data size Block size Job execution time Data locality Dataset: Wikipedia, Synthetic workload Job: Top k %, SecondarySort, Inverted-index, PageRank, WordCount Skew handling of reduce side by introducing a new phase called Intermediate Reduce Reduction in reduce phase Reduction in job execution time Higher fault tolerance Overhead of computation due to intermediate reducers 6 Myun et al. [62] Hadoop cluster (13 nodes) Data skew to reduce tasks Input data size Speedup Scalability Number of samples Number of cores Number of splits Sample size Dataset: Cloud ships and land stations report, Synthetic workload (Scalar skew, Zipf’s skew) MDRP: proposing a skew handling method by constructing multidimensional range partitioning Compatible and fast Additional cost due to creating a histogram Applicable to complex join operations Lack of efficient maintenance of a Load balancing of histogram reduce tasks Lack of selection of a join key attribute 7 Liu et al. [63] SAVI test bed (Hadoop cluster with 11 VMs) Size of partition Heap size Number of reduce tasks Makespan of reduce tasks Reduce task execution time Job completion time Dataset: Wikipedia, OPTIMA: proposing an online partitionSynthetic workload ing skew mitigation Job: Sort, Invertedtechnique for index, WordCount, MapReduce by the relative prediction the frequency workload distribution of reduce tasks at run-time Reduction in Reduce task execution time Improvement in job completion time Prediction accuracy of partition size in linear time Reduction in makespan of reduce tasks Applicable Load balancing N. Maleki et al. Nos. References Nos. References Experimental platform Parameters Dataset/workload Main idea Advantages 8 Simulation Input data size Data Skew Resource usage Scalability _ DTPM: minimizing data transmission time and network bandwidth usage using distributed two-phase mode Not considering the Reduction in the energy efficiency number of nodes which participate in Not considering the job priority data computation phase Less bandwidth usage Reduction in data skew Scalable Reduction in shuffle time Zhang et al. [64] Disadvantages MapReduce: an infrastructure review and research insights Table 8 (continued) 13 N. Maleki et al. time, a greedy job ordering algorithm based on a heuristic slot configuration algorithm have been proposed. Although in the second version of Hadoop, YARN is introduced which benefits “container” model, as there is no controlling of a number of reduce which can run in a container, the network bandwidth will be a bottleneck due to the reduce tasks shuffling. Verma et al. [66] have presented a two-stage scheduler based on the Johnson algorithm to minimize the makespan of multi-wave batch jobs. According to Johnson algorithm, the jobs are arranged based on the map execution time in an ascending queue. If the reduce execution time is shorter than map execution, the scheduler puts the job at the tail of the queue. Although this method mitigates the makespan, in some scenarios that the number of tasks of a job is less than the available slots, local optimal would cause a problem. To tackle the problem, a heuristic method called “Balanced Pool” is employed in which the jobs are divided into two pools with approximately the same makespan. The paper has not considered a model for the jobs whose data are ready during the other jobs’ execution time because the order of the algorithm is almost high due to repetitive divisions. Since manually configuration of Hadoop is a tedious and time-consuming task, Bei et al. [67] have presented a random forest methodology in which the parameter settings are tuned automatically. In this method, two regression-based models are constructed to accurately predict the performance of each map and reduce stage. Subsequently, in each stage a genetic algorithm is employed which is fed by the aforementioned models outputs and the configuration space is found. The proposed method is suitable and fast for repetitive and long-running applications with large input data in a Hadoop cluster. Although the Hadoop performance is enhanced using task scheduling or load balancing techniques, the heterogeneity of a cluster deteriorates the performance of the running jobs which are configured homogeneously. Cheng et al. [68] have proposed an ant-based architecture which is model-independent and automatically obtains the optimal configuration for the large job sets with multi-wave tasks. Namely, improvement in task tuning is performed during job execution by starting from random parameter settings and with any job profiling. The proposed architecture consists of two modules: (1) self-tuning optimizer and (2) task analyzer which resides in JobTracker. The first round of tasks is configured randomly by the optimizer module and the tasks are conducted to TaskTrackers to be executed. Once the first wave is finished, for the next round of tasks execution, the task analyzer suggests better settings to the optimizer module using a fitness function which uses task completion time. Yu et al. [69] have presented an accelerator framework that benefits plug-in components to expedite data movement and merge data without any repetitive disk access. The key idea of the method is to levitate the data on the remote disk nodes until records merge time. The merge time is the time that all the map tasks are finished, i.e., all the map out files are produced and the construction of priority queues from segments (partitions) of map tasks is possible. This mechanism provides a full pipeline between Hadoop map, shuffle, and reduce phases and is more scalable than Hadoop-stock. Moreover, InfiniBand is used as communication hardware rather than Ethernet which is very fast. 13 MapReduce: an infrastructure review and research insights In the shuffling phase, all of the data partitions are transmitted from the map side to the reduce side to be aggregated to feed into their related reducers. This yet challenging problem imposes high network traffic and makes the network bandwidth a bottleneck. Guo et al. [70] have proposed in-network aggregation in which the map outputs are collected and routed across the network and processed at the intermediate nodes once the transmission phase is started. To attain the idea, the authors use a tree model and a graph model to minimize each in-cast transmission, i.e., data transmission of all maps to one reducer and shuffling, i.e., data transmissions of all maps to all reducers, respectively. The methodology relieves the reduce side’s aggregation load by parallelizing the reducing and shuffling phases and diminishing the job completion time. Guo et al. [71] have presented the shuffle phase of Hadoop as an independent service from the map and reduce phase, called “iShuffle.” The service acquires the intermediate data proactively, i.e., before starting to Reduce task through a “ShuffleOn-Write” operation and make it ready for the reduce task. In the Shuffle-On-Write operation, after the map buffer on a node disk is verged and its data are written on the disk, the dedicated shuffler of the node gets a copy of the data. Afterward, the shuffler places data partitions to nodes where the reduce tasks will be launched according to a placement algorithm. Therefore, using the placement algorithm which is based on the partition size prediction and solving it by linear regression, the even data distribution on the reduce nodes during data transferring is guaranteed. To gain fault tolerance, the data are not sent to the intended node directly, but it is written on the node disk first. In addition, the method uses preemptive scheduling to lessen jobs completion time. The proposed method in [72] is inspired by this paper; however, the placement mechanism is totally different. The type of jobs, CPU-intensive or data-intensive, also has been considered to balance the node workload. Ke et al. [73] have presented a three-layer model that alleviates network traffic by designing a data partitioning schema. The proposed model defines a graph which has three layers: (1) mapper nodes; (2) intermediate nodes including aggregation nodes and Shadow nodes and (3) reduce nodes. The model is basically based on the default Hadoop placement technique. According to the intermediate data size, if the produced map output size related to a key partition is large, it is processed on the reduce tasks which are closed to the map task node and it is not sent to the reduce tasks which are placed on the other racks. In the second layer, the nodes are potential if it is supposed that the data to be moved to a reducer will be active. Otherwise, they are sent directly to the reducer through shadow nodes which practically do not exist. Therefore, by considering data locality levels, i.e., node locality, rack locality, and cluster locality, this method achieves data locality, while it mitigates the network traffic. The network traffic minimization is done by a distributed algorithm which is solved by a linear programming using Lagrange. Chen et al. [74] have presented a speculative strategy performed in four steps: (1) detecting the stragglers; (2) predicting the original task remaining time; (3) selecting the stragglers to backup; and (4) placement of the backup tasks on the suitable nodes. First, to detect the straggler tasks, the authors use the task progress rate and the processing bandwidth in each Hadoop phase. Second, to predict process speed and task remaining time, an exponentially weighted moving average method is 13 N. Maleki et al. used. Third, to determine which task to be backed up based on a load of a cluster, a cost–benefit model has been proposed. Finally, to determine suitable nodes to host the backup tasks, data locality and data skew have been considered. The proposed strategy mitigates the job completion time and improves cluster throughput. Load imbalance, i.e., data skew causes the emerging of straggler tasks. To overcome this problem, Guo et al. [75] have proposed a user-transparent speculative strategy for Hadoop in a cloud environment. When the stragglers are detected, the slots of the cloud are scaled out such that the stragglers benefit more resources to process their input data in less time. The proposed strategy balances the resource usage across the cluster using an adaptive slot number and slot memory size changer method in an online manner. Therefore, both the data skew and job completion time are mitigated in this strategy. There are two main strategies for speculative execution: (1) cloning and (2) straggler detection-based. In the cloning, if the cost of computing of task is low and there are enough resources, additional replicas of the task are scheduled in parallel with the initial task. In the straggler detection-based, the progress of each task is controlled, and the additional versions are started when a straggler is detected. Xu et al. [76] have divided the cluster into lightly loaded and highly loaded. They have introduced the smart cloning algorithm (SCA) for the lightly loaded cluster and the enhanced speculative execution (ESE) algorithm for the heavily loaded cluster based on the straggler detection approach. Jiang et al. [77] have presented a heuristic method for online jobs which enter into the system as time goes and an approximate method for off-line jobs to minimize the jobs makespan. Authors’ contribution is to employ servers with different speeds. Moreover, the non-parallelizable reduce tasks assumption is another contribution which makes it more difficult to solve the makespan minimization problem. In this method, the reduce tasks are considered once preemptive and once non-preemptive. The main idea is based on the bin packing problem in which the reduce tasks are arranged according to their execution time descending and allocated to servers with higher speed, respectively. Next, the time duration that the reduce tasks will take longer on these servers is calculated and the results are arranged. Using the results, the time in which all of the servers are idle is defined and the related map task to the largest reduce task is scheduled for execution. Therefore, all the map tasks are allocated in the reduce task execution intervals. Once the total idle slots have been occupied in the interval, the rest of the map tasks are allocated after that time. Ultimately, according to MapReduce execution logic which map tasks should be executed prior to reduce tasks, the current scheduler is reversed and in case of available slots, allocation of reduce tasks continues. Veiga et al. [78] have presented an event-driven architecture in which the phases of map tasks and reduce tasks are executed using the java threads which are called “operations.” Rather than a container-based resource allocation in Hadoop, the proposed model integrates the map and reduces resources into a pool and allows the operations to benefit the resources when they need. The operations form the stages of a pipeline and are connected using data structures to reading and write the data. To alleviate the memory copies in each stage, the architecture uses the reference to the data rather than the data itself. Therefore, in this way, there is no need for 13 MapReduce: an infrastructure review and research insights converting the data to the writable objects. Furthermore, for executing the operations which must be done before the other operations, i.e., the map operation which should be executed before the merge, the system considers a priority method. The architecture is compatible with Hadoop MapReduce jobs, and any changes are required to the source code of the jobs. According to Hadoop-LATE [79], system load, data locality, and low priority of the tasks are the major factors which should be considered as the performance model metrics. To precisely estimate the remaining execution time of the tasks, Huang et al. [80] have proposed a speculative strategy which is based on the linear relationship between system load and execution time of tasks. To detect the slow nodes a dynamic average threshold is defined and for efficient resource usage, an extended maximum cost performance model is proposed. Unlike [73], different slot values are considered. The strategies mitigate the running time and response time of the job. Tian et al. [81] have presented a framework based on the Johnson model to minimize the makespan of off-line and online jobs. This paper has improved the paper [66], and the idea is that rather than dividing cluster resources into pools, only one pool, i.e., the cluster is enough, and all jobs can benefit all the available resources. In this way, better makespan would be acquired. In addition, this paper has proved that obtaining minimum makespan can be solved in linear time and it is not an NP problem. The authors have also mentioned that although the makespan of each pool is minimum, and the makespan of all jobs is not minimum. Wang et al. [82] have presented a speculative execution strategy in which rather than starting the slow tasks from scratch, they start from the leveraged checkpoint of original tasks. The idea is like the checkpoints for the fault-tolerant mechanism which contributes to the granularity of fault tolerance in the spill level rather than the task level. The remaining execution time in each speculative strategy should be well estimated to select rightly the speculative tasks. Therefore, this method benefits two checkpoint types, i.e., input checkpoint an output checkpoint. The speculative task fetches its data from output checkpoint and constructs its memory states and skips the already data processed in the input checkpoint. The authors have also proposed a scheduler to select a speculative task. They have calculated the original task remaining time using the progress score, the progress rate, and the time the task has already taken. For calculating the speculative task completion time, the recovery time of partial map output and the execution time of unprocessed data are used. Based on the two calculated times and by comparing their sum to the remaining time of the original task, the “speculation gain” is calculated. The tasks with the higher gain are selected to be scheduled on the cluster. Table 9 shows an overview of the MapReduce performance-related papers. 3.2.6 Security studies Fu et al. [83] have investigated data leakage attacks in two platform layers, i.e., application and operating system layers. They have proposed a framework which is composed of an on-demand data collector and a data analyzer. The data collector collects Hadoop logs, FS-image files, and monitors logs from every node 13 N. Maleki et al. actively or on demand. The collected data are sent to the data analyzer in which the data are analyzed with automatic methods to find the stolen data, find the attacker, and reconstruct the crime scenario. Moreover, the authors have presented a four-dimensional algorithm with Abnormal Directory, Abnormal User, Abnormal Operation, and Block Proportion dimensions for detecting the suspicious data leakage behaviors. Parmar et al. [84] have identified Hadoop security vulnerabilities and introduced “Kuber” to remove the vulnerabilities. The proposed framework uses three levels of security: (1) secure user authentication; (2) encrypted data in transit; and (3) encrypted data at rest. In the proposed framework, the HDFS encryption zone security mechanism is totally removed and tasks can directly access data by employing encryption on each individual data block. This technique eliminates the requirement of decryption of the complete file. Moreover, the authors benefit Salsa20 and its variant chacha20 rather than AES as a cipher suit because of their speed, safety, and easy implementation. However, the authors have not tested their framework in a distributed environment to consider the performance and scalability of the framework. Gupta et al. [85] have presented a multilayer access control framework covering Hadoop ecosystem services, data, applications, and system resources to restrict unauthorized users. The authors enhanced the authorization capabilities of Hadoop by employing Apache Ranger and Knox frameworks in services such as HDFS, Hive, and HBase. Moreover, they enforced YARN security policies using a Ranger plug-in to prohibit unauthorized users from submitting jobs into the cluster. However, the authors have not investigated the fine-grained authorization between Hadoop core daemons including NameNode, DataNodes, and ApplicationMaster. Wang et al. [86] have developed a compromised Hadoop cluster in which an attack is launched and a protective block-based scheme is proposed to deal with that. The authors infected a node of the cluster that delays the job execution. The toxic node cannot be detected to be decommissioned from the cluster. Therefore, the defense scheme monitors the nodes and it blocks the node in which there is any job with more killed tasks, more several slow containers, or more running tasks slower than the average task execution time. Such blocked nodes are recognized as the attacker nodes. This study only focused on the map tasks attack; however, researchers can also consider the reduce tasks attacks scenarios to better simulate toxic real systems. There are many encrypted communications in Hadoop which leads to sensitive information leakage by means of communication patterns detection. Therefore, Ohrimenko et al. [87] have presented a framework in which secure implementation of jobs is considered and the data traffic between the map and reduce stages are analyzed. They implemented Melbourne shuffle, a secure framework to deals with information leakage which is caused by adversaries at system and application levels by means of interfering or observing of jobs execution. Ulusoy et al. [88] introduced a fine-grained framework called, GAURDMR which enforces security mechanisms at the key-value level. The proposed framework generates dynamic authorized views of data sources using object constraint language (OCL) specifications. Moreover, it guarantees security at the computation level using a modular reference monitor and provides built-in access control model. 13 13 Nos. References Experimental platform Parameters Dataset/workload Main Idea 1 Tang et al. [65] Simulation Hadoop cluster (20 nodes) on Amazon EC2 Proposing two classes Reduction in Dataset: Synthetic Makespan makespan and total of algorithms for workload, Facebook Total completion time completion time optimizing job workload (TCT) Accurately designed ordering and map/ The average execution Job: PUMA benchestimator called reduce slot configumark (WordCount, time of map and “MR Estimator’” ration Sort, Grep, etc.) reduce tasks 2 Verma et al. [66] Simulation Hadoop cluster (66 nodes) Makespan Resource (number of map and reduce slots) utilization Makespan improveBalancedPools: proDataset: Wikipedia ment posing an optimal article traffic logs, two-stage map and Complex workload reduce job scheduler (Facebook and to minimize the Yahoo!) makespan based Job: WikiTrends on “Johnson” (Select, TextSearch, algorithm Aggregation, Join) 3 Bei et al. [67] Hadoop cluster (10 PMs, 10 VMs) Map execution time Reduce execution time Job: HiBench benchmark (WordCount, TeraSort, Sort), PUMA benchmark (Adjust, Invertedindex) RFHOC: proposing a random forest approach that constructs two groups of performance models for both map and reduce stages Advantages Disadvantages Not considering the data locality, fault tolerance, and straggler tasks Not considering the complex workload with priority Not considering the heterogeneity of the environment Not considering the fairness Not considering the dependent jobs (DAGs) No modeling for measuring jobs’ makespan which their input data will be ready during the execution time of other jobs Large buffer size for The robust and sort phase accurate prediction model High scalability Speedup at map phase Lower cost at reduce phase MapReduce: an infrastructure review and research insights Table 9 An overview of existing primary studies focusing on “performance” 13 Table 9 (continued) Nos. References 4 Experimental platform Cheng et al. [68] Hadoop cluster (9 nodes) Virtualized Hadoop cluster on multitenant private cloud Parameters Dataset/workload Main Idea Advantages Disadvantages Job completion time Task completion time I/O rate CPU steal time The memory size of sorting Job size Workload type Dataset: Wikipedia Job: PUMA benchmark (WordCount, TeraSort, Grep) Proposing an ant-based selfadaptive task tuning approach that automatically searches the optimal configurations for individual tasks running on different nodes Flexible and adaptable Improvement in average job completion time Not a unified and static parameter tuning method, but an online one The effective method especially in a heterogeneous environment High speed of finding a good configuration Not considering the implementation in a public cloud environment Not suitable for CPUintensive jobs and small jobs Not performing well in the virtualized cluster N. Maleki et al. Nos. References Experimental platform 5 Shuffle-merge-reduce Hadoop cluster (26 delay nodes with InfiniBand software stack Disk throughput CPU utilization of OFED) Transparency JVM overhead Network traffic Memory scalability Memory write Yu et al. [69] Parameters Dataset/workload Main Idea Advantages Disadvantages Job: WordCount, TeraSort Hadoop-A: proposing an acceleration framework which uses plug-in components to fast data movement and to merge data without repetition and disk access Cost of new hardware, Full pipelining i.e., InfiniBand between shuffle, Building overhead of merge and reduce the priority queues phases Delay in completion More scalable than of reduce tasks due Hadoop-stock to waiting for the Fast remote disk completion of the last access through map output file InfiniBand interconnection Elimination of repetitive merge and disk access Fast completion of map tasks due to lightweight fetching and setting up operations Improvement in throughput MapReduce: an infrastructure review and research insights Table 9 (continued) 13 13 Table 9 (continued) Nos. References Experimental platform Parameters 6 Simulation Hadoop cluster (6 PMs, 61 VMs) Dataset: Ten input Datacenter size files of 65 MB per Shuffle transfer size each map Aggregation ratio Job: WordCount Network traffic Number of active links Number of cache servers Intermediate data size Guo et al. [70] Dataset/workload Main Idea Advantages Disadvantages iShuffle: decoupling shuffle from reduce tasks and proactively pushing intermediate data to nodes via a novel shuffle-on-write operation Skew is tackled by flexible reduce tasks dispatching Load balancing Pipelining between shuffle and reduce phases Significant Reduction in shuffle delay for shuffle-heavy jobs Reduction in job completion time Significant reduction in recovery time of a reduce task Transparent The speculative method is not enabled Less improvement for shuffle-light jobs Overhead in map phase due to the independent shuffler Unfairness in reduce task scheduling of large jobs N. Maleki et al. Nos. References Experimental platform Parameters Dataset/workload Main Idea 7 Guo et al. [71] Hadoop cluster (32 nodes) 8 Ke et al. [73] Simulation (5 simulated VMs) Hadoop cluster (20 nodes) Reducing netFacebook workload Shuffle delay work traffic for a generated by SWIM Job completion time MapReduce job by Map phase overhead Job: PUMA benchdesigning a novel mark, HiBench Load balancing intermediate data benchmark Fault tolerance partition scheme, (PageRank, Bayes), Disk throughput called “three-layer Shuffle-heavy job Locality model” (SelfJoin, TeraSort, Fairness K-means, InvertedTask type index, TermIntermediate data size vector, WordCount, PageRank, Bayes), Shuffle-light job (Histogram movies, Histogram ratings, Grep) Dataset: Wikimedia SRS and IRS-based Network traffic Job: WordCount shuffling: pushing Data reduction ratio the aggregation Size of the time computation into interval the network and parNumber of the map allelizing the shuffle and reduce tasks and reduce phases Number of aggregators Number of nodes Number of keys Advantages Disadvantages Reduction in network No data locality in reduce side traffic caused by map tasks Handling MapReduce job in an online manner when some system parameters are not given 13 Reduction in network traffic Usage of less number of resources such as aggregating servers and active links Adaptable to other server-centric method structures Delay reduction during the reduce phase Data locality MapReduce: an infrastructure review and research insights Table 9 (continued) 13 Table 9 (continued) Nos. References Experimental platform Parameters 9 Chen et al. [74] Heterogeneity of Hadoop cluster environment (small: 30 VMs and 15 PMs, large: 101 Scalability Job execution time VMs and 30 PMs) Cluster throughput Data locality Data Skew 10 Guo et al. [75] Hadoop cluster (8 PMs and 32 VMs using VMware VSphare) Load balancing Number of slots Size of the memory slot Dataset/workload Main Idea Advantages Disadvantages Job: Data-intensive and CPU-intensive (WordCount, Sort, Grep, GridMix) MCP: developing the maximum cost performance strategy which has three phases: finding slow tasks, predicting their remaining time, and selecting the one to backup based on a load of a cluster, to improve the effectiveness of speculative execution FlexSlot: proposing a user-transparent task slot management scheme that automatically identifies map stragglers and resizes their slots accordingly to accelerate task execution Scalable Small overhead Handling data skew well Stable performance under various kinds of environment Not considering the straggler tasks Unable to reduce the I/O wait time More straggler tasks due to unawareness of imbalance in resource allocation of VMs Flexible changing of the number of slots in an online manner Efficient resource utilization Mitigation of data skew Reduction in job completion time Simple implementation Incurring overhead in slot resizing Dataset: TeraGen, Wikipedia, Netflix Job: HiBench benchmark (PageRank, Bayes), PUMA benchmark (WordCount, TeraSort, etc.) N. Maleki et al. Nos. References Experimental platform Parameters Dataset/workload 11 Xu et al. [76] Simulation (one PM and 11,000 simulated VMs) Average job flow time Dataset generator Overall computation cost 12 Jiang et al. [77] Simulation (50 nodes) Makespan Server speed Preemptive or nonpreemptive task Synthetic workload Main Idea Advantages Reduction in job SCA: proposing delay time a cloning-based Reduction in e scheme which is based on maximiz- Reduction in total job flow time ing the overall system utility for a lightly loaded cluster ESE: a detectionbased scheme to mitigate the number of stragglers for a heavily loaded cluster Minimizing makespan Minimizing makespan Applying different of off-line and server speeds online jobs using Load balancing heuristic and approximation methods Disadvantages More resource consumption Not considering the complex job MapReduce: an infrastructure review and research insights Table 9 (continued) 13 13 Table 9 (continued) Nos. References Experimental platform Parameters Dataset/workload Main Idea 13 Veiga et al. [78] Hadoop cluster (DAS-4) Public Cloud (Amazon EC2) Block Size Replication factor Data buffer size Data pool size Worker heap size Flame-MR: proposing Dataset: Ranan event-driven domTextWriter, architecture for BigDataBench suite improving Hadoop (Kronecker graph performance by generator) avoiding memory Job: Micro benchcopies and data mark (Sort, Grep), movement pipelinApplication ing benchmark (PageRank, Connected component) 14 Huang et al. [80] Hadoop cluster (7 PMs and 3 VMs using VirtualBox) Job response time Data skew in the map task Job completion time Makespan Task execution time System load Dataset: RandomWriter Job: Sort, Grep, WordCount, GridMix2 ERUL: proposing two speculators for accurate estimating of the task remaining time Advantages Disadvantages N. Maleki et al. Low fault tolerance Scalable Unnecessary disk Portable access Compatible with MapReduce jobs Flexible in terms of same software interface as Hadoop Pipelining between map and reduce phases Reduction in memory and disk usage Reduction in job execution time Minimization overhead of thread creation/destruction Alleviating memory copy operations Failure in evenly Reduction in job reduce task input data completion time assumption Accurate estimation Higher throughput Makespan improvement Task execution time improvement Nos. References Experimental platform Parameters Dataset/workload Main Idea Advantages 15 Tian et al. [81] Hadoop cluster (4 PMs and 32 VMs) Task response time Makespan Resource utilization HScheduler: propos- Makespan improveDataset: Wikipedia ment ing a Johnson article traffic logs, model-based frame- Considering of multiComplex workload wave jobs work for minimizing (Facebook and the makespan of Yahoo!) off-line and online Job: WikiTrends jobs (Select, TextSearch, Aggregation, Join) TeraSort WordCount 16 Wang et al. [82] Hadoop cluster (8 PMs and 15 VMs using Xen) Job completion time The efficiency of speculative execution Network size of shuffling Scalability Waiting timeout of speculative tasks Block size Map selectivity Analytical workload Job: Micro benchmark (WordCount, Grep), Machine learning (K-means) PSE: starting specula- Reduction in operation costs tive tasks from the such as re-reading, checkpoint using re-copying, and partial speculative re-computing the execution approach processed data to reduce operations Reduction in job costs completion time Higher efficiency of speculative execution Scalable Applicable Disadvantages Not considering the preemption Not considering the energy efficiency Higher cost due to additional process such as job setting up, dispatching, and migration Not considering the load balancing MapReduce: an infrastructure review and research insights Table 9 (continued) 13 N. Maleki et al. The framework provides a secure environment and does not require hard coding programming to perform policy specification and function assigned to the jobs. Table 10 shows an overview of the Hadoop security-related papers. 3.2.7 Resource provisioning studies Khan et al. [89] have presented a job performance model to provision resources for deadline-assigned multi-waves jobs. The model is constructed based on the historical job execution records, allocated map and reduce slots, and size of the input dataset. The model estimates the job execution time using locally weighted linear regression and provisions the required amount of resources based on Langrage multiplier technique. To hinder the resource provisioning bias (over-provisioning or under-provisioning), the average of the best-case and worst-case execution of a job is considered. Nghiem et al. [90] have addressed the resource provisioning problem while considering the energy consumption and performance trade-off. The authors have defined the optimal number of tasks for a set of jobs using the actual sampled runtime data of the cluster, and there is no need to rely on the rules of thumbs. The optimal number is achieved by considering the trade-off between data locality and resource utilization which is handled by tuning split size for CPU-bound and I/Obound jobs. The author’s approach is based on the accuracy of optimal resource provisioning per application on a particular system. This method saves energy significantly up to several million dollars; however, users should establish a database which is required for jobs profiling. Application-centric SSD caching for Hadoop applications (AC-SSD), which reduces the job completion time has been proposed by Tang et al. [91]. This approach uses the genetic algorithm to calculate the nearly optimal weights of virtual machines for allocating SSD cache space and controlling the I/O operations per second (IOPS) based on the importance of the VMs. Furthermore, it proposes a closed-loop adaptation to face the rapidly changing workload. Considering the importance of VMs and relationships among VMs inside the application improves the performance. Table 11 shows an overview of the papers. 4 Results and discussion After synthesizing the data, we answered to our research questions RQ1 to RQ6 in this section. Answer to Question RQ1 What topics have been considered most in MapReduce field? Of the 55 studies that provided MapReduce topics, the greatest number of studies (N = 16) could be accounted for on the topic performance. We can see that two other subjects, namely scheduling with the number of 9 (16%) articles and load balancing 13 MapReduce: an infrastructure review and research insights with the number of eight (15%) articles, are the next most investigated research topics. Of the remaining, 7 (13%) articles focused on energy efficiency, 6 (11%) articles on security, 6 (11%) articles on fault tolerance, and 3 (5%) articles on resource provisioning. Figure 7 shows the percentage of studies frequency of each topic on the corresponding slice of the pie chart. Figure 8 shows the most frequent topics, investigated by each publisher. IEEE has mostly considered performance topic, i.e., eleven articles out of sixteen (69%). Elsevier has mostly investigated fault tolerance topic i.e., four articles out of six (67%). Springer has mostly considered the energy efficiency topic, i.e., four studies out of six (67%), and ACM has mostly considered the security topic i.e., three studies out of six (50%). Answer to Question RQ2 What are the main parameters, investigated by the studies? According to Fig. 9, of the 55 studies included in our research, 25% (N = 14) considered job completion time and makespan as main parameters, 24% (N = 13) of studies considered scalability and data locality parameters, and 22% (12) considered input data size parameter. Job execution time and network in terms of network traffic overhead, network I/O (transmission cost), network delay, and network stability are the next most investigated parameters, considered by 20% (N = 11) of studies. 18% (N = 10) of studies considered a number of map and reduce tasks, while 16% (N = 9) of the studies considered the size of intermediate data produced by the map tasks. In 15% (N = 8) of the studies, the execution time of either map or reduce task has been considered, and SLA has been considered by 9% (N = 5) of the studies. Answer to Question RQ3 What are the main artifacts produced by the research? The four main artifacts produced by the study on MapReduce are shown in Fig. 10: algorithms, frameworks, architectures, and topology. When a paper has a logical view, i.e., like a design pattern, we put it in the architecture category. When a paper implements an architecture, we put it in the framework category. Algorithm category consists of the papers which have introduced a method, an algorithm, an approach, a schema, and a strategy to enhance the MapReduce functionality. Mostly, schedulers belong to this category. Furthermore, a topology is proposed when the shuffling network design is supposed to be considered. Using this classification, half of the papers have contributed an algorithm to enhance the MapReduce functionality, whereas topology has been less proposed. The number of each artifact investigated by the publishers is shown in Fig. 11. Besides, we show the studies belong to each artifact in Table 12. By categorizing the papers based on the software and hardware solutions, about 93% (N = 51) of the studies have improved the MapReduce challenges through software solutions, i.e., algorithm. But only 7% (N = 4) of the studies [37, 39, 40, 69] have employed hardware technologies as an improvement tool. The reason is that, on the one hand, using new and high-speed hardware for facing challenges 13 13 Table 10 An overview of existing primary studies focusing on “Security” Experimental platform Parameters Dataset/workload Main idea Advantages Disadvantages 1 Fu et al. [83] Hadoop cluster (16 nodes using VirtualBox) Data leakage Reliability _ _ Detection of suspiProposing a framecious data leakage work for investigatbehaviors ing data leakage An efficient way to attacks in Hadoop locate attacked cluster nodes Finding attacker and reliable evidence Reconstruction of the entire scenario 2 Parmar et al. [84] Hadoop cluster (single node) Encryption cost Machine performance Dataset: synthetic (6 various file sizes) The slow speed of Cost-effective techKuber: a threeencryption nique dimensional secuNot integrated with High memory rity technique to KMS Hadoop performance in remove the Hadoop encryption service encryption/decrypsecurity vulnertion compared to abilities default Hadoop Encryption Zone More flexible Managing encryption credentials securely on the client side 3 Gupta et al. [85] Hadoop cluster SLA _ Not considering the Multilayer authoriza- Meet users SLA authorization level in tion framework for Two-layer data Hadoop daemons access checking a representative Hadoop ecosystem Tag-based data access policy using deployment the Atlas framework N. Maleki et al. Nos. References 13 Nos. References Experimental platform Parameters Dataset/workload Main idea Advantages Disadvantages 4 Wang et al. [86] Hadoop cluster (9 nodes) Cluster performance Makespan Task execution time Dataset: Wikipedia, Synthetic (using TeraGen) Jobs: TeraSort, WordCount, WordMean SEINA: A stealthy and effective internal attack in Hadoop systems Higher system performance in the presence of an attack Minor overhead Improvement in task execution time Not considering the reduce tasks attacks 5 Ohrimenko et al. [87] Hadoop cluster (8 nodes) System overhead Memory usage Dataset: Census data sample, New York taxi ride Job: aggregate, aggregate filter Observing and Preventing leakage in MapReduce Lower I/O overhead Evaluation of framework on secure implemented of Hadoop, VC3. The lower overhead of framework due to pregrouping values with the same key Implementation in java and C++ _ 6 Ulusoy et al. [88] Hadoop cluster (7 nodes) Scalability Cluster performance Dataset: Twitter, Google images GuardMR: finegrained security policy enforcement for MapReduce systems High efficiency Small overhead Scalable High modularity and flexibility User-transparent framework Practical policy specification Lower performance due to performing reflection operations in the reference monitor MapReduce: an infrastructure review and research insights Table 10 (continued) 13 Table 11 An overview of existing primary studies focusing on “resource provisioning” Nos. References Experimental platform Parameters 1 Khan et al. [89] Hadoop cluster (8 VM Map, Shuffle and Reduce phase nodes) execution time Amazon EC2 Cloud Job execution time (20 instances) Number of reduce tasks Input data size 2 Nghie et al. [90] Hadoop cluster (24 nodes) Energy efficiency Task execution time Job execution time CPU time Number of Reduce tasks Dataset/workload Main Idea Advantages Disadvantages Dataset: TeraGen Jobs: CPU-intensive (WordCount), IOintensive (Sort) Estimating job completion time based on a Hadoop job performance model using “locally weighted linear regression” technique and resource provisioning for deadlineassigned jobs using “Lagrange multipliers” technique Accuracy of the pro- The model overprovisions when posed model in job there are more virtual execution estimation machines The proposed HP model is economical The method does not consider the jobs with in terms of resource logical dependency provisioning Reduction in job execution time Dataset: TeraGen Job: TeraSort A resource provisioning algorithm with a mathematical formula for obtaining the exact optimal number of task resources for any workload Not considering the The accurate and heterogeneity of the optimal number of environment reduce tasks Not considering the Improvement in scalability energy consumption Improvement task execution time Usable in other MapReduce implementation frameworks Reduction in job execution time N. Maleki et al. Nos. References Experimental platform Parameters Dataset/workload Main Idea Advantages 3 Hadoop cluster (4 PMs forms 3 clusters, 20 VMs) CPU time Job completion time Cache size Network I/O Micro benchmark, TestDFSIO benchmark, WordCount, TeraSort, Sort, Aggregation, Join Scan, Bayes, PageRank, K-means Application-centric SSD cache allocation for Hadoop applications The simple and not Considering the accurate solution application-centric to detect workload instead of VMchanges centric SSD caching Performance degradaschemes tion during provisionShortest job compleing tion time Higher performance Tang et al. [91] Disadvantages MapReduce: an infrastructure review and research insights Table 11 (continued) 13 N. Maleki et al. Resource Provisioning, 5% Security, 11% Fault-Tolerance, 11% Performance Job/Task Scheduling Load Balancing Energy Efficiency, 13% Energy Efficiency Fault-Tolerance Security Load Balancing, 15% Performance, 29% Job/Task Scheduling, 16% Fig. 7 Research topics ranked by the percentage of publications IEEE Elsevier Springer ACM Resource Provisioning Security Performance Load Balancing Job/Task Scheduling Fault-Tolerance Energy Efficiency 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Fig. 8 Percentage of investigated topics per publishers imposes more costs to the developer, and on the other hand, the researchers who wish to compare their work to these studies are forced to re-extend or spend high cost (if the hardware is accessible!) to simulate the same situations. Hence, the number of citations of these papers will be impacted by this case. Answer to Question RQ4 What experimental platforms have been used by the researchers for analysis and evaluation? We classified the experimental platforms into three categories: simulation, implementation using Cloud services, and implementation in the test bed. Therefore, based on these categories, in 71% (N = 39) of studies which evaluated the results using implementation, cloud with 7% (N = 4), test bed with 13% (N = 7), in-house Hadoop cluster with 51% (N = 28) [20, 42, 43, 47, 48, 50–52, 54, 56, 57, 60, 62, 67–69, 71, 74, 75, 78, 80–91] have been used. Of the test bed category, Grid’5000 is used in four studies [37, 41, 46, 61], SAVI test bed is used in two studies [58, 63], and MobiWay is used in one study [55], respectively. 9% (N = 5) of studies [39, 45, 64, 76, 77] only used simulation to evaluate the results in which one study [39] have used CloudSim and the rest of studies have 13 MapReduce: an infrastructure review and research insights SLA 9% Map/Reduce Task Execution Time 15% Size of Intermediate Data 16% Number of Map/Reduce Tasks 18% Network Traffic 20% Job Execution Time 20% Input Data Size 22% Data Locality 24% Scalability 24% Makespan 25% Job Completion Time 25% Fig. 9 Investigated percentage of each parameter used stock simulator. 20% (11) of studies [15, 36, 38, 40, 44, 49, 59, 65, 66, 70, 73] have used both simulation and implementation as the experimental platform in which in terms of implementation, studies [15, 44] have been implemented in cloud and the rest of studies have been implemented in in-house Hadoop cluster. In terms of simulation, studies [15, 44, 59] have used their extended simulator: TDEMR, CloudSimMR, and TopoSim and the others have used stock simulator. The virtualized tools, used in the studies include Xen, VMWare, KVM, and VirtualBox. The statistics are shown in Fig. 12. Answer to Question RQ5 What kind of jobs, benchmarks, and dataset have been used in the experiments? And what percentage of each one has been used in the studies? For answering this question, we have provided the job name and its functionality, job shuffle degree in terms of heavy or light shuffling, dataset, and benchmarks in Table 13. According to Table 13, jobs are categorized as shuffle-light and shuffleheavy in terms of produced intermediate data by map tasks. Of the total number of publications included in this study, six benchmarks have been used: PUMA, HiBench, MicroBench, MRBench, TestDFSIO, and Built-in YARN, included Fig. 10 Four main artifacts of studies Algorithm 27 Framework 12 Artifacts Architecture 12 Topology 4 13 N. Maleki et al. in 42% (N = 23) of the studies. Among all, PUMA is used frequently by 44% (N = 10), HiBench is the second most used benchmark by 26% (N = 6), while MicroBench and MRBench are used by 13% (N = 3) and 9% (N = 2) of studies, respectively. TestDFSIO and built-in YARN are used in only 4% (N = 1) of studies. The remaining studies which are 58% (N = 32) have used a different combination of common jobs of Table 13. Figure 13 shows these statistics. From the 55 existing articles about the MapReduce framework presented in this study, 51 papers have used the jobs which have been shown in Table 14. However, there is any information about the dataset or jobs which have been used in four studies [47, 64, 83, 85]. Figure 14 shows the percentage that each job has been used in the 55 articles (popularity). Answer to Question RQ6 What are the open challenges and future directions in Hadoop MapReduce? • Open challenges To answer this question, some of the challenges presented in the section of reviewed papers have been considered. However, some yet challenging problems in MapReduce can be mentioned as follows: • Hadoop MapReduce has been widely discussed to improve performance. Some researches try to improve the performance by studying the dependency of the workflow and to reach the data locality. Separating the phases as independent jobs brings better performance. However, most of the jobs have a dependency on them, so how to justify the independency of them is a yet challenging problem. • By decoupling the phases to accelerate the computations, there would be a dilemma between speed and scalability. The MapReduce model is designed for scalability, so how to maintain the scalability in the decoupled design is another issue. Springer 30 25 20 15 10 5 0 Elsevier IEEE ACM Algorithm Architecture Framework Topology ACM 1 1 3 0 IEEE 8 7 4 3 Elsevier 9 2 4 0 Springer 8 3 1 1 Fig. 11 Number of each artifact investigated by the publishers 13 Algorithm BeTL [20] Xu et al. [48]. MRCP-RM [49] Tang et al. [65] Balanced Pools [66] Chen et al. [74] Xu et al. [76] Khan et al. [89] Ibrahim et al. [37] FARMS [43] Memishi et al. [45] Chronos [46] SARS [51] MDRP [62] Jiang et al. [77] ERUL [80] Song et al. [38] Jia-Chun Lin et al. [47] Bok et al. [52] Hashem et al. [54] Momth [55] Tang et al. [56] Wang et al. [86] HScheduler [81] Yaoguang Wang et al. [82] Arjona et al. [42] LIBRA [57] RFHOC [67] Cheng et al. [68] iShuffle [71] FlexSlot [75] Fu [83] OPTIMA [63] Tang et al. [91] Phan et al. [41] Teng et al. [40] Mashayekhy et al. DREAMS [58] [36] Hadoop-A [69] ANRDF [44] Kao et al. [15] FP-Hadoop [61] MBR [60] Ohrimenko et al. [87] Ulusoy et al. [88] Cai et al. [39] Guo et al. [70] Guo et al. [73] Nghiem et al. [90] Architecture HPSO [50] Flame-MR [78] Framework Topology Paik [59] Zhang et al. [64] Parmar et al. [84] Gupta et al. [85] MapReduce: an infrastructure review and research insights Table 12 Classification of studies based on the artifacts 13 N. Maleki et al. • Many production jobs are executed in the cluster of Hadoop using the MapRe- • • • • • duce programming model. Therefore, makespan of these jobs is an important issue which should be considered as an effective metric in performance. The order in which jobs are executing has a significant impact on makespan. Systematically exploring the Hadoop parameters space and finding a near-optimal configuration are a challenge. Some new intelligent algorithms and techniques which are based on the cluster and workload properties are required to suggest an appropriate parameter setting. Network overhead is another serious problem in prolonging execution of jobs. To overcome this issue, designing new algorithms and techniques are required to improve and accelerate the shuffle phase of MapReduce. The straggler tasks which are caused by internal and external problems such as resource competition, hardware heterogeneity, hardware failure, and data skew should be considered as the other performance metrics. How to select the straggler tasks and how to define the proper node to host the tasks are the notable challenges in the speculative strategies. Moreover, some energy consumption models are required to prevent waste of energy on killed speculative copies. There are many kinds of MapReduce jobs such as production, interactive, and deadline-assigned jobs. On the one hand, we should be able to provide resources at run-time to meet jobs requirements. On the other hand, this provisioning should not cause “Bias” which influences energy efficiency and performance. Enterprises and IoT providers use Hadoop Lake to store and process data generated from IoT devices. In this situation, security and privacy requirements are critical challenges for the prominent technology firms and state. Providing protective schemes in terms of authentication, authorization, and data confidentiality are imperative to secure Hadoop system in the presence of attacks. To prevent and confront the attacks, Hortonworks [92] have divided Hadoop security vulnerabilities into three parts: (1) systemic; (2) operational; and (3) architectural. By researching and presenting new solutions on each domain, we can overcome Hadoop security problems. Table 14 shows an overview of the challenges. • Future directions Platform 55 Papers Simulation and Implementation Implementation 71%, 39 Cloud 7%, 4 Testbed 13%, 7 20%, 11 In-house Hadoop 51%, 28 Fig. 12 Percentage of environments which have been used in the studies 13 Simulation 9%, 5 Dataset TeraGen Wikipedia Article Traffic Logs Complex (Event Log of Facebook, Yahoo!) SwimGen RandomTextWriter DatasetGen Synthetic Netflix WikiMedia Cloud Ships and Land Station Traces BRITEGen Live Journal Graph Data Google Web Graph Newyork Taxi Ride Census Data Samples Twitter Google Images Functionality Shuffling degree Job Benchmark PUMA HiBench MicroBench Built-in Yarn WikiTrends BigDataBench ApplicationBench MRBench TestDFSIO Counts the occurrence of each distinct word in a text file Heavy WordCount Is used for Google search results: it refers to Web sites counting the number of links and the quality of the links they refer to Heavy PageRank Counts the number of occurrences of strings matching the target in a text file Light Grep Mining a graph to determine its sub-networks Heavy Connected components A database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents Heavy Inverted-index Returns information and statistics about phrases in the context of a particular document Heavy Term-vector Returns number of votes about movies, registered by users Light Histogram ratings, Histogram movies A clustering which its purpose is to divide n observation into k-clusters Each observation belongs to the cluster that has the closest average, selected as the prototype Heavy K-means Sorts a text file with a volume of a terabyte Heavy TeraSort Predict class membership probabilities such as the probability that a given tuple belongs to a particular class Heavy Bayes Joins a table with itself—each row in the table is combined with itself and with other rows of the table Heavy SelfJoin MapReduce: an infrastructure review and research insights Table 13 Benchmarks, dataset, job name, and its functionality, shuffling degree 13 N. Maleki et al. 26% 13% 42% 58% 9% 44% 4% 4% Others PUMA HiBench MicroBench MRBench TestDFSIO Fig. 13 Percentage of most common used benchmarks in the articles Table 14 Three-dimensional security of Hadoop cluster [92] Systemic Data access and ownership Data at rest and data in motion protection Multi-tenancy Inter-node communication Client interaction Distributed nodes Operational Authentication and authorization Administrative data access Configuration and patch management Authentication of applications and nodes Audit and logging Monitoring, filtering, and blocking API security Architectural Walled garden Cluster security Data-centric security Enterprise security Embedded security 13 Built-in YARN MapReduce: an infrastructure review and research insights Term-Vector 2% Connected Componenets 2% WordMean 2% Mulplicaon 2% Matrix 2% Aggregate 2% Top-k% 2% 4% Histogram 4% SecondarySearch 4% Scan 4% Pi 6% GridMix 7% Bayes Inverted-Index Join K-means PageRank Grep TeraSort Sort WordCount 0% 10% 11% 11% 15% 15% 22% 29% 33% 56% 20% 30% 40% 50% 60% Fig. 14 Percentage that each job has appeared in the articles Although many signs of progress have been gain, there are still several open issues in the MapReduce at the infrastructure level. Therefore, after studying related papers in MapReduce, we will discuss some unmentioned issues that can be studied and analyzed further. We enumerate some promising future directions in Hadoop MapReduce as follows: • General platform: by integrating MapReduce and Spark, we can benefit a general platform in which the batch, streaming, iterative, and in-memory applications can be executed simultaneously in a Hadoop cluster. We can employ a dedicated pool for each application type or group of users and reach better performance and power saving. • Artificial intelligence approaches: we can build accurate and robust performance prediction models based on historical data in each Hadoop phases and feed these models output to algorithms such as genetic, smart hill climbing, and machine learning. Using the qualified search in the Hadoop configuration space, these methods can find optimal or near-optimal configuration with high probability. These methods help developers to not scramble with manually configuration of Hadoop configuration parameters. • Combination techniques: hardware approaches such as dynamic voltage and frequency scaling, SSD-based in-storage computing, and remote-based data access controllers along with pipelining the map, sort, shuffle, and reduce phases can improve the power consumption of a MapReduce cluster. • Software-based approaches: we can employ algorithms in which the placement of data, produced by mappers is earlier defined so that the partition, belonged to a specified reducer would be available by and by during completion of map 13 Cloudy MapReduce General Platform Intelligent techniques Hardware-Software techniques Secure MapReduce Performance Elasticity Energy Efficiency Fault-tolerance Security Scalability Energy Efficiency Load balancing Challenges Opportunies N. Maleki et al. Fig. 15 Challenges and opportunities in MapReduce area • • • • phase. In such way, the heavy shuffling of the shuffle phase is divided into light shuffling and accelerates the job execution time. MapReduce Model: by defining an appropriate execution model based on the heterogeneity of systems such as application type, data type and format, server characteristics, topology and communication medium type, and workload prediction, we can reach to higher performance. Cloudy MapReduce: since MapReduce programming model accelerates Big data processing, deploying MapReduce in IaaS clouds can maximize the performance of cloud infrastructure service. Furthermore, we can service MapReduce to cloud users for running their MapReduce applications in the cloud. Besides, we can benefit fine-grained cluster security using cloud-based MapReduce. Cluster Topology: shuffling is a network-consuming stage in geo-distributed MapReduce-based datacenters. The default network topology of Hadoop is flat, i.e., “tree” [14, 59] which does not support scalability and causes higher data computation and communication costs. Although there are two masters (one as a backup) in a Hadoop cluster, how many nodes can deploy in a sub-cluster and how the masters of the sub-clusters should communicate with each other are already the open issues. Secure MapReduce: To secure Hadoop cluster, the robust and efficient algorithms are required in four aspects of security including authentication, authorization, auditing, and data access. To prevent and confront the attacks, some solutions including new user authentication protocols such as Kerberos [93], robust encryption algorithms for data communication between Hadoop daemons, and powerful data-at-rest access control mechanisms can be employed. Further, we can design and develop visualizations tools and intelligent algorithms to predictive models for informing the system administrator of data spillage and destructive attacks using attack patterns detection and provenance logs. 13 MapReduce: an infrastructure review and research insights • Cost-effective MapReduce: the mentioned challenges impose costs in terms of energy consumption. To alleviate the costs, we can focus on the solutions which mitigate job execution time. Load skew handling including online solutions (quickly aggregating intermediate data and then estimating the reduce task workload), writing customized partitioners, multi-level partitioners, optimal schedulers such as run-time map task split binding or the run-time reduce task partition binding, powerful speculative mechanisms, and efficient data replication algorithms reduce the job execution time and subsequently the required energy. In this way and with this outlook, we reach “Green MapReduce” since the carbon emissions are controlled. Figure 15 shows a summary of challenges and opportunities. 5 Conclusions and limitations In this paper, we have conducted a holistic study systematically in Hadoop MapReduce. First, we had an architectural overview of Hadoop main components. After describing our research methodology, we classified the MapReduce studies into seven areas: (1) performance; (2) job/task scheduling; (3) load balancing; (4) resource provisioning; (5) fault tolerance in terms of availability and reliability; (6) security; and (7) energy efficiency. Afterward, we extracted the main idea, discussed strengths and drawbacks and provided our observations by answering the research questions. The chronicle of studies reflects the attention to the challenges of MapReduce as a Big data processing platform among the researches. The majority (16 out of 55 articles) of studies have focused on performance as the most significant topic in MapReduce, while scheduling, load balancing, energy efficiency, security, fault tolerance, and resource provisioning are the next most considered topics, respectively. We defined the future direction and presented several potential solutions to researchers, interested in MapReduce area. We studied the major investigated challenges of MapReduce framework as well as the best-proposed solutions and tried hard to provide a comprehensive systematic study. But, the study might have some limitations which are our plan to address them in future studies. Searching only digital libraries using search string keywords is just one of the many channels of finding research activity stream about a widely focused topic like MapReduce. Two search approaches for future study are as follows: (1) using other means such as Ph.D. theses, academic blogs, editorial notes, and technical reports and (2) relaxing some of the strict exclusion criteria such as considering the interdisciplinary articles, national journals, and conferences, and non-English articles, it might help us to become familiar with other worthy solutions. References 1. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113 13 N. Maleki et al. 2. Hashem IAT, Anuar NB, Gani A, Yaqoob I, Xia F, Khan SU (2016) MapReduce: review and open challenges. Scientometrics 109(1):389–422 3. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Antony S, Liu H, Murthy R (2010) Hive—a petabyte scale data warehouse using Hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010) 4. Polato I, Ré R, Goldman A, Kon F (2014) A comprehensive view of Hadoop research—a systematic literature review. J Netw Comput Appl 46:1–25 5. Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687 6. Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347 7. Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209 8. http://spark.apache.org/ 9. http://datampi.org/ 10. Soualhia M, Khomh F, Tahar S (2017) Task scheduling in big data platforms: a systematic literature review. J Syst Softw 134:170–189 11. Zhang B, Wang X, Zheng Z (2018) The optimization for recurring queries in big data analysis system with MapReduce. Future Gener Comput Syst 87:549–556 12. http://hadoop.apache.org/ 13. Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST) 14. White T (2009) Hadoop: the definitive guide. O’Reilly Media Inc, Sebastopol 15. Kao Y-C, Chen Y-S (2016) Data-locality-aware mapreduce real-time scheduling framework. J Syst Softw 112:65–77 16. Wang F, Qiu J, Yang J, Dong B, Li X, Li Y (2009) Hadoop high availability through metadata replication. In: Proceedings of the first international workshop on cloud data management. ACM, Hong Kong, pp 37–44 17. Li F, Ooi BC, Tamer Ozsu M, Wu S (2014) Distributed data management using MapReduce. ACM Comput Surv 46(3):1–42 18. Singh R, Kaur PJ (2016) Analyzing performance of Apache Tez and MapReduce with Hadoop multinode cluster on Amazon cloud. J Big Data 3(1):19 19. https://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php 20. Wang H, Chen H, Du Z, Hu F (2016) BeTL: MapReduce checkpoint tactics beneath the task level. IEEE Trans Serv Comput 9(1):84–95 21. Alapati SR (2016) Expert Hadoop administration: managing, tuning, and securing spark, YARN, and HDFS. Addison-Wesley Professional, Boston 22. Gupta M, Patwa F, Sandhu R (2017) Object-tagged RBAC model for the Hadoop ecosystem. In: IFIP Annual Conference on Data and Applications Security and Privacy. Springer 23. Erraissi A, Belangour A, Tragha A (2017) A big data Hadoop building blocks comparative study. Int J Comput Trends Technol 48(1):36–40 24. Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: an update. Inf Softw Technol 64:1–18 25. Cruz-Benito J (2016) Systematic literature review & mapping. https://doi.org/10.5281/zenod o.165773 26. Lu Q, Zhu L, Zhang H, Wu D, Li Z, Xu X (2015) MapReduce job optimization: a mapping study. In: 2015 International Conference on Cloud Computing and Big Data (CCBD) 27. Charband Y, Navimipour NJ (2016) Online knowledge sharing mechanisms: a systematic review of the state of the art literature and recommendations for future research. Inf Syst Front 18(6):1131–1151 28. Poggi N, Carrera D, Call A, Mendoza S, Becerra Y, Torres J, Ayguadé E, Gagliardi F, Labarta J, Reinauer R, Vujic N, Green D, Blakeley J (2014) ALOJA: a systematic study of Hadoop deployment variables to enable automated characterization of cost-effectiveness. In: 2014 IEEE International Conference on Big Data (Big Data) 29. Sharma M, Hasteer N, Tuli A, Bansal A (2014) Investigating the inclinations of research and practices in Hadoop: a systematic review. In: 2014 5th International Conference—Confluence the Next Generation Information Technology Summit (Confluence) 13 MapReduce: an infrastructure review and research insights 30. Thakur S, Ramzan M (2016) A systematic review on cardiovascular diseases using big-data by Hadoop. In: 2016 6th International Conference—Cloud System and Big Data Engineering (Confluence) 31. Lu J, Feng J (2014) A survey of mapreduce based parallel processing technologies. China Commun 11(14):146–155 32. Derbeko P, Dolev S, Gudes E, Sharma S (2016) Security and privacy aspects in MapReduce on clouds: a survey. Comput Sci Rev 20:1–28 33. Li R, Hu H, Li H, Wu Y, Yang J (2016) MapReduce parallel programming model: a state-of-the-art survey. Int J Parallel Prog 44(4):832–866 34. Iyer GN, Silas S (2015) a comprehensive survey on data-intensive computing and mapreduce paradigm in cloud computing environments. In: Rajsingh EB, Bhojan A, Peter JD (eds) Informatics and communication technologies for societal development: proceedings of ICICTS 2014. Springer India, New Delhi, pp 85–93 35. Liu Q, Jin D, Liu X, Linge N (2016) a survey of speculative execution strategy in MapReduce. In: Sun X, Liu A, Chao H-C, Bertino E (eds) Cloud Computing and Security: Second International Conference, ICCCS 2016, Nanjing, China, July 29–31, 2016, Revised Selected Papers, Part I. Springer, Cham, pp 296–307 36. Mashayekhy L, Nejad MM, Grosu D, Zhang Q, Shi W (2015) Energy-aware scheduling of mapreduce jobs for big data applications. IEEE Trans Parallel Distrib Syst 26(10):2720–2733 37. Ibrahim S, Phan T-D, Carpen-Amarie A, Chihoub H-E, Moise D, Antoniu G (2016) Governing energy consumption in Hadoop through cpu frequency scaling: an analysis. Future Gener Comput Syst 54:219–232 38. Song J, He H, Wang Z, Yu G, Pierson J-M (2016) Modulo based data placement algorithm for energy consumption optimization of MapReduce system. J Grid Comput 1:1–16 39. Cai X, Li F, Li P, Ju L, Jia Z (2017) SLA-aware energy-efficient scheduling scheme for Hadoop YARN. J Supercomput 73(8):3526–3546 40. Teng F, Yu L, Li T, Deng D, Magoulès F (2017) Energy efficiency of VM consolidation in IaaS clouds. J Supercomput 73(2):782–809 41. Phan T-D, Ibrahim S, Zhou AC, Aupy G, Antoniu G (2017) Energy-driven straggler mitigation in MapReduce. In: European Conference on Parallel Processing. Springer 42. Arjona Aroca J, Chatzipapas A, Fernández Anta A, Mancuso V (2014) A measurement-based analysis of the energy consumption of data center servers. In: Proceedings of the 5th International Conference on Future Energy Systems. ACM 43. Fu H, Chen H, Zhu Y, Yu W (2017) FARMS: efficient mapreduce speculation for failure recovery in short jobs. Parallel Comput 61:68–82 44. Tang B, Tang M, Fedak G, He H (2017) Availability/network-aware MapReduce over the internet. Inf Sci 379:94–111 45. Memishi B, Pérez MS, Antoniu G (2017) Failure detector abstractions for MapReduce-based systems. Inf Sci 379:112–127 46. Yildiz O, Ibrahim S, Antoniu G (2017) Enabling fast failure recovery in shared Hadoop clusters: towards failure-aware scheduling. Future Gener Comput Syst 74:208–219 47. Lin J-C, Leu F-Y, Chen Y-P (2015) Analyzing job completion reliability and job energy consumption for a heterogeneous MapReduce cluster under different intermediate-data replication policies. J Supercomput 71(5):1657–1677 48. Xu X, Cao L, Wang X (2016) Adaptive task scheduling strategy based on dynamic workload adjustment for heterogeneous Hadoop clusters. IEEE Syst J 10(2):471–482 49. Lim N, Majumdar S, Ashwood-Smith P (2017) MRCP-RM: a technique for resource allocation and scheduling of MapReduce jobs with deadlines. IEEE Trans Parallel Distrib Syst 28(5):1375–1389 50. Sun M, Zhuang H, Li C, Lu K, Zhou X (2016) Scheduling algorithm based on prefetching in MapReduce clusters. Appl Soft Comput 38:1109–1118 51. Tang Z, Jiang L, Zhou J, Li K, Li K (2015) A self-adaptive scheduling algorithm for reduce start time. Future Gener Comput Syst 43:51–60 52. Bok K, Hwang J, Lim J, Kim Y, Yoo J (2016) An efficient MapReduce scheduling scheme for processing large multimedia data. Multimed Tools Appl 76(16):1–24 53. Zaharia M, Borthakur D, Sarma JS, Elmeleegy K, Shenker S, Stoica I (2010) Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer systems. ACM, Paris, pp 265–278 13 N. Maleki et al. 54. Hashem IAT, Anuar NB, Marjani M, Gani A, Sangaiah AK, Sakariyah AK (2017) Multi-objective scheduling of MapReduce jobs in big data processing. Multimed Tools Appl 77(8):1–16 55. Nita M-C, Pop F, Voicu C, Dobre C, Xhafa F (2015) MOMTH: multi-objective scheduling algorithm of many tasks in Hadoop. Cluster Comput 18(3):1011–1024 56. Tang Z, Liu M, Ammar A, Li K, Li K (2016) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72(6):2059–2079 57. Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533 58. Liu Z, Zhang Q, Ahmed R, Boutaba R, Liu Y, Gong Z (2016) Dynamic resource allocation for MapReduce with partitioning skew. IEEE Trans Comput 65(11):3304–3317 59. Chen W, Paik I, Li Z (2016) Topology-aware optimal data placement algorithm for network traffic optimization. IEEE Trans Comput 65(8):2603–2617 60. Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst 61. Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) FP-Hadoop: efficient processing of skewed MapReduce jobs. Inf Syst 60:69–84 62. Myung J, Shim J, Yeon J, Lee S-G (2016) Handling data skew in join algorithms using MapReduce. Expert Syst Appl 51:286–299 63. Liu Z, Zhang Q, Boutaba R, Liu Y, Wang B (2016) OPTIMA: on-line partitioning skew mitigation for MapReduce with resource adjustment. J Netw Syst Manag 24(4):859–883 64. Zhang X, Jiang J, Zhang X, Wang X (2015) A data transmission algorithm for distributed computing system based on maximum flow. Cluster Comput 18(3):1157–1169 65. Tang S, Lee BS, He B (2016) Dynamic job ordering and slot configurations for MapReduce workloads. IEEE Trans Serv Comput 9(1):4–17 66. Verma A, Cherkasova L, Campbell RH (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secure Comput 10(5):314–327 67. Bei Z, Yu Z, Zhang H, Xiong W, Xu C, Eeckhout L, Feng S (2016) RFHOC: a random-forest approach to auto-tuning Hadoop’s configuration. IEEE Trans Parallel Distrib Syst 27(5):1470–1483 68. Cheng D, Rao J, Guo Y, Jiang C, Zhou X (2017) Improving performance of heterogeneous MapReduce clusters with adaptive task tuning. IEEE Trans Parallel Distrib Syst 28(3):774–786 69. Yu W, Wang Y, Que X (2014) Design and evaluation of network-levitated merge for Hadoop acceleration. IEEE Trans Parallel Distrib Syst 25(3):602–611 70. Guo D, Xie J, Zhou X, Zhu X, Wei W, Luo X (2015) Exploiting efficient and scalable shuffle transfers in future data center networks. IEEE Trans Parallel Distrib Syst 26(4):997–1009 71. Guo Y, Rao J, Cheng D, Zhou X (2017) iShuffle: improving Hadoop performance with shuffle-onwrite. IEEE Trans Parallel Distrib Syst 28(6):1649–1662 72. Maleki N, Rahmani AM, Conti M (2018) POSTER: an intelligent framework to parallelize Hadoop phases. In: Proceedings of the 27th international symposium on high-performance parallel and distributed computing. ACM 73. Ke H, Li P, Guo S, Guo M (2016) On traffic-aware partition and aggregation in mapreduce for big data applications. IEEE Trans Parallel Distrib Syst 27(3):818–828 74. Chen Q, Liu C, Xiao Z (2014) Improving MapReduce performance using smart speculative execution strategy. IEEE Trans Comput 63(4):954–967 75. Guo Y, Rao J, Jiang C, Zhou X (2017) Moving Hadoop into the cloud with flexible slot management and speculative execution. IEEE Trans Parallel Distrib Syst 28(3):798–812 76. Xu H, Lau WC (2017) Optimization for speculative execution in big data processing clusters. IEEE Trans Parallel Distrib Syst 28(2):530–545 77. Jiang Y, Zhu Y, Wu W, Li D (2017) Makespan minimization for MapReduce systems with different servers. Future Gener Comput Syst 67:13–21 78. Veiga J, Expósito RR, Taboada GL, Tourino J (2016) Flame-MR: an event-driven architecture for MapReduce applications. Future Gener Comput Syst 65:46–56 79. Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, San Diego, pp 29–42 80. Huang X, Zhang L, Li R, Wan L, Li K (2016) Novel heuristic speculative execution strategies in heterogeneous distributed environments. Comput Electr Eng 50:166–179 81. Tian W, Li G, Yang W, Buyya R (2016) HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs. J Supercomput 72(6):2376–2393 13 MapReduce: an infrastructure review and research insights 82. Wang Y, Lu W, Lou R, Wei B (2015) Improving MapReduce performance with partial speculative execution. J Grid Comput 13(4):587–604 83. Fu X, Gao Y, Luo B, Du X, Guizani M (2017) Security threats to Hadoop: data leakage attacks and investigation. IEEE Netw 31(2):67–71 84. Parmar RR, Roy S, Bhattacharyya D, Bandyopadhyay SK, Kim T (2017) Large-Scale Encryption in the Hadoop Environment: challenges and Solutions. IEEE Access 5:7156–7163 85. Gupta M, Patwa F, Benson J, Sandhu R (2017) Multi-layer authorization framework for a representative Hadoop ecosystem deployment. In: Proceedings of the 22nd ACM on symposium on access control models and technologies. ACM 86. Wang J, Wang T, Yang Z, Mao Y, Mi N, Sheng B (2017) Seina: a stealthy and effective internal attack in Hadoop systems. In: 2017 International Conference on Computing, Networking and Communications (ICNC). IEEE 87. Ohrimenko O, Costa M, Fournet C, Gkantsidis C, Kohlweiss M, Sharma D (2015) Observing and preventing leakage in MapReduce. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, Denver, pp 1570–1581 88. Ulusoy H, Colombo P, Ferrari E, Kantarcioglu M, Pattuk E (2015) GuardMR: fine-grained security policy enforcement for MapReduce systems. In: Proceedings of the 10th ACM symposium on information, computer and communications security. ACM, Singapore, pp 285–296 89. Khan M, Jin Y, Li M, Xiang Y, Jiang C (2016) Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans Parallel Distrib Syst 27(2):441–454 90. Nghiem PP, Figueira SM (2016) Towards efficient resource provisioning in MapReduce. J Parallel Distrib Comput 95:29–41 91. Tang Z, Wang W, Huang Y, Wu H, Wei J, Huang T (2017) Application-centric SSD cache allocation for Hadoop applications. In: Proceedings of the 9th Asia-pacific symposium on internetware. ACM 92. Hadoop S (2016) Security recommendations for Hadoop environments. White paper, Securosis 93. Garman J (2003) Kerberos: the definitive guide. O’Reilly Media, Inc Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 13