Academia.eduAcademia.edu

Study of Apache Hadoop

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. . The settings for the Hadoop environment are critical for deriving the full benefit from the rest of the hardware and software. The Distribution for Apache Hadoop* software includes Apache Hadoop* and other software components optimized to take advantage of hardware-enhanced performance and security capabilities.The Apache Hadoop project defines HDFS as “the primary storage system used by Hadoop applications” that enables reliable ,extremely rapid computations. Its Hadoop Distributed File System (HDFS) splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster. Hadoop uses a distributed user-level filesystem. It takes care of storing data -- and it can handle very large amount of data.

[Patel, 3(12): December, 2014] ISSN: 2277-9655 Scientific Journal Impact Factor: 3.449 (ISRA), Impact Factor: 2.114 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY Study of Apache Hadoop Uma Patel, Rakesh Patel, Nimita Patel* *Student,B.E.(IT), Kirodimal Institute of Technology,Raigarh(C.G.),India Lecturer,Department of Information Technology, Kirodimal Institute of Technology Raigarh(C.G.),India Student,B.E.(IT), Kirodimal Institute of Technology,Raigarh(C.G.),India Abstract Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. . The settings for the Hadoop environment are critical for deriving the full benefit from the rest of the hardware and software. The Distribution for Apache Hadoop* software includes Apache Hadoop* and other software components optimized to take advantage of hardware-enhanced performance and security capabilities.The Apache Hadoop project defines HDFS as “the primary storage system used by Hadoop applications” that enables reliable ,extremely rapid computations. Its Hadoop Distributed File System (HDFS) splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster. Hadoop uses a distributed user-level filesystem. It takes care of storing data -- and it can handle very large amount of data. Keywords: Apache hadoop. Introduction Apache Hadoop is an open source software from Apache Software Foundation. Apache Hadoop, and Hadoop are trademarks of The Apache Software Foundation. Used with permission. No endorsement by The Apache Software Foundation is implied by the use of these marks we implemented a low-cost, fully realized big data platform based on the Intel® Distribution for Apache Hadoop* software. It is an open source software stack that runs on a cluster of machines. Hadoop provides distributed storage and distributed processing for very large data sets. It is an Apache project released under Apache Open Source License v2.0.This license is very commercial friendly. Originally Hadoop was developed and open sourced by Yahoo. Now Hadoop is developed as an Apache Software Foundation project and has numerous contributors from Cloudera, Horton Works, Facebook, etc. Hadoop is open source. The software is free. Hadoop runs on a cluster of machines. The cluster size can be anywhere from 10 nodes to 1000s of nodes. For a large cluster, the hardware costs will be significant. The cost of IT / OPS for standing up a large Hadoop cluster and supporting it will need to be factored in. Since Hadoop is a newer technology, finding people to work on this ecosystem is not easy. http: // www.ijesrt.com Architecture of hadoop History Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project. Component of hadoop: Hadoop provides two components:1. Hadoop Distributed File System (HDFS) 2. MapReduce. © International Journal of Engineering Sciences & Research Technology [270] [Patel, 3(12): December, 2014] ISSN: 2277-9655 Scientific Journal Impact Factor: 3.449 (ISRA), Impact Factor: 2.114 HDFS Architecture Hadoop = HDFS + MapReduce Hadoop provides two things : Storage & Compute.storage is provided by Hadoop Distributed File System (HDFS). Compute is provided by MapReduce. It consists of two parts: Hadoop Distributed File System (HDFS), which is modeled after Google's GFS, and Hadoop MapReduce, which is modeled after Google's MapReduce. HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software Hadoop distributed file system (HDFS) HDFS is the 'file system' or 'storage layer' of Hadoop. It takes care of storing data and it can handle very large amount of data.The Hadoop Distributed File System (HDFS) is a distributed file system designed to run oncommodity hardware. It has many similarities with existing distributed file systems. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides highthrough put access to application data and is suitable for applications that have large data sets. NameNode and DataNode The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features. HDFS is implemented by two services: the NameNode and DataNode.The NameNode maintains the file system namespace. The NameNode is responsible for maintaining the HDFS directory tree, and is a centralized service in the cluster operating on a single node. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode. Clients contact the NameNode in order to perform common filesystem operations, such as open, close, rename, and delete. The NameNode does not store HDFS data itself, but rather maintains a mapping between HDFS file name, a list of blocks in the file, and the DataNode(s) on which those blocks are stored. The Hadoop Distributed File System (HDFS) is one of many different components and projects contained within the community Hadoop™ ecosystem. The Apache Hadoop project defines.HDFS as: “the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. http: // www.ijesrt.com © International Journal of Engineering Sciences & Research Technology [271] [Patel, 3(12): December, 2014] ISSN: 2277-9655 Scientific Journal Impact Factor: 3.449 (ISRA), Impact Factor: 2.114 being a real threat to many modern businesses’ bottom line, features that minimize outages and contribute to keeping a batch analytic data store up, operational, and feeding any online system that requires its input are welcomed by both IT and business professionals. The file system namespace HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features. The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode. Big Data CapableThe hallmark of HDFS is its ability to tackle big data use cases and most of the characteristics that comprise them (data velocity, variety, and volume). The rate at which HDFS can supply data to the programming layers of Hadoop equates to faster batch processing times and quicker answers to complex analytic questions. PortabilityAny tenured data professional can relay horror stories of having to transfer, migrate, and convert huge data volumes between disparate storage/software vendors. One benefit of HDFS is its portability between various Hadoop distributions, which helps minimize vendor lock-in. Data replication HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. Cost-EffectiveAs previously stated, HDFS is open source software, which translates into real cost savings for its users. As many companies can attest, high-priced storage solutions can take a significant bite out of IT budgets and are many times completely out of reach for small or startup companies .Other benefits of HDFS exist, but the four above are the primary reasons why many users deploy HDFS as their analytic storage solution. Map ReduceMap Reduce takes care of distributed computing. It reads the data, usually from its storage,A Hadoop MapReduce job mainly consists of two user-defined functions: map and reduce. The input of a Hadoop MapReduce job is a set of key-value pairs (k, v) and the map function is called for each of these pairs. The map function produces zero or more intermediate key-value pairs (k′, v′). Then, the Hadoop MapReduce framework groups these intermediate key-value pairs by intermediate key k′ and calls the reduce function for each group. Finally, the reduce function produces zero or more aggregated results. The beauty of Hadoop MapReduce is that users usually only have to define the map and reduce functions. The framework takes care of everything else such as parallelisation and failover. The Hadoop MapReduce framework utilizes a distributed file system to read and write its data. Typically, Hadoop MapReduce uses the Hadoop Distributed File System The Benefits of HDFSThere is little debate that HDFS provides a number of benefits for those who choose to use it.Below are some of the most commonly cited. Built-In Redundancy and FailoverHDFS supplies out-of-the-box redundancy and failover capabilities that require little to no manual intervention (depending on the use case). Having such features built into the storage layer allows system administrators and developers to concentrate on other responsibilities versus having to create monitoring systems and/or programming routines to compensate for another set of storage software that lacks those capabilities. Moreover, with downtime http: // www.ijesrt.com © International Journal of Engineering Sciences & Research Technology [272] [Patel, 3(12): December, 2014] ISSN: 2277-9655 Scientific Journal Impact Factor: 3.449 (ISRA), Impact Factor: 2.114 developer's life. And, although it looks deceptively simple, it is very powerful, with a great number of sophisticated (and profitable) applications written in this framework. In the other sections of this book we will introduce you to the practical aspects of MapReduce implementation. We will also show you how to avoid it, by using higher-level tools, 'cause not everybody likes to write Java code. Then you will be able to see whether or not Hadoop is for you, or even invent a new framework. Keep in mind though that other developers are also busy inventing new frameworks, so hurry to read more. (HDFS), which is the open source counterpart of the Google File System . Therefore, the I/O performance of a Hadoop MapReduce job strongly depends on HDFS.In the MapReduce model, computation is divided into a map function and a reduce function. The map function takes a key/value pair and produces one or more intermediate key/value pairs. The reduce function then takes these intermediate key/value pairs and merges all values corresponding to a single key. The map function can run independently on each key/value pair, exposing enormous amounts of parallelism. Similarly, the reduce function can run independently on each intermediate key, also exposing significant parallelism. Hadoop Customization A user can customize and optimize a Hadoop MapReduce job by supplying additional functions besides just map and reduce. In this section, we consider whether users are exploiting this feature in practice. MapReduce has Mappers and Reducers MapReduce splits computation into multiple tasks. They are called Mappers and Reducers Job Customization Most job customizations are related to ways of partitioning data and aggregating data from earlier stages: Combiner: The Combiner performs partial aggregation during the local sort phase in a map task and a reduce task. In general, if the application semantics support it, a combiner is recommended. In OPENCLOUD, 62% of users have used this optimization at least once. In M45 and WEB MINING clusters, 43% and 80% of users have used it respectively. Secondary Sort: This function is applied during the reduce phase. By grouping different reduce keys for a single reduce call, secondary sort allows users to implement an optimized join algorithm as well as other complex operations. In the M45 cluster, no user applied a secondary sort. In the WEB MINING cluster, only one user used secondary sort, and that was through the use of Pig, which implements certain join algorithms using secondary sort. In OPENCLOUD, 14% of users have used secondary sort, perhaps suggesting a higher level of sophistication or more stringent performance requirements. Custom Partitioner: A user can also have full control over how to redistribute map output to the reduce tasks using a custom partitioner. In the OPENCLOUD cluster, as many as 35% of users have used a custom partitioner. However, only two users in the M45 cluster and one user in the WEB MINING cluster applied a custom partitioner. Mapper The 'sorter' (the girl asking 'how old are you') only concerned about sorting people into appropriate groups (in our case, age). She isn't concerned about the next step of compute. In MapReduce parlance the girl is known as MAPPER. Reducer:Once the participants are sorted into appropriate age groups, then the guy wearing 'bowtie' just interviews that particular age group to produce the final result for that group .There are few subtle things happening here: • The result for one age group is not influenced by the result of other age group. So they can be processed in parallel. • we can be certain that each group has all participants for that group. For example, all 20 something’s are in the group 20s. If the mapper did her job right, this would be the case. • With these assumptions, the guy in bowtie can produce a result for a particular age group, independently The benefits of MapReduce programming:So what are the benefits of MapReduce programming? As you can see, it summarizes a lot of the experiences of scientists and practitioners in the design of distributed processing systems. It resolves or avoids several complications of distributed computing. It allows unlimited computations on an unlimited amount of data. It actually simplifies the http: // www.ijesrt.com © International Journal of Engineering Sciences & Research Technology [273] [Patel, 3(12): December, 2014] ISSN: 2277-9655 Scientific Journal Impact Factor: 3.449 (ISRA), Impact Factor: 2.114 11 users have tried different values for replication factor. In M45, two users have adjusted block size and only one user tried a different replication factor. Other than these, all users kept the cluster default values. In summary, users tend to tune parameters directly related to failures. JVM options are used to prevent “Out of Memory” errors. By talking with administrators of OPENCLOUD, we learned that many of their users explicitly tuned these options in response to poor failure behaviors. In contrast, users rarely tune parameters related to performance, perhaps because their performance requirements were generally being met, or perhaps because these parameters are more difficult to understand and manipulate. Custom Input and Output Format: Hadoop provides an InputFormat and OutputFormat framework to simplify handling of custom data formats and non-native storage systems. In OPENCLOUD, 27% of users applied a custom input format at least once and 10% of users applied a custom output format. In M45, only 4 users applied a custom input format and only 1 user applied a custom output format. In WEB MINING, only one user applied a custom input format and none applied a custom output format. In general, job customizations help with performance and thus a visible fraction of users leverage them, especially the optional combiners. OPENCLOUD users tend to use more optimization techniques than users of the other two clusters. Configuration Tuning Hadoop exposes a variety of configuration parameters for tuning performance and reliability. Here we discuss a few configuration parameters that are typically considered important for performance and faulttolerance . Failure Parameters: Users can control how failures are handled as well as erroneous inputs. In OPENCLOUD, 7 users explicitly specified a higher threshold to retry failed tasks, 6 users specified a higher “skip” to ignore bad input records, and 1 user specified a higher threshold in the number of tolerable failed tasks. In M45, 3 users set a higher threshold in the number of tolerable failed tasks. All WEB MINING users stayed with cluster default values. Java Virtual Machine (JVM) Option: The native Hadoop MapReduce interface is implemented in Java. If a map or reduce task requires a large memory footprint, the programmer must manually adjust the heap and stack sizes: 29 OPENCLOUD users, 11 M45 users and 3 WEB MINING cluster users have changed default JVM option for their jobs. Speculative Execution: Speculative execution is the default mechanism to handle straggler tasks. Only two users from OPENCLOUD and M45 have changed the cluster default value for their applications. We discuss speculative execution in detail in Section 6.3. Sort Parameters: Hadoop runs a merge sort at the end of the map phase and just before the reduce phase. There are four parameters that directly related to those sorts. Two users of WEB MINING cluster have adjusted io.sort.mb parameter to 200. Only one user of M45 cluster have adjusted io.sort.mb to 10. Other than that, all users used the cluster default values. HDFS Parameters: The HDFS block size and replication factor affect the behavior of writing the final output of a MapReduce job. In OPENCLOUD, http: // www.ijesrt.com Job optimization One of the major advantages of Hadoop MapReduce is that it allows non-expert users to easily run analytical tasks over big data. Hadoop MapReduce gives users full control on how input data sets are processed. Users code their queries using Java rather than SQL. This makes Hadoop MapReduce easy to use for a larger number of developers: no background in databases is required; only a basic knowledge in Java is required. However, Hadoop MapReduce jobs are far behind parallel databases in their query processing efficiency. Hadoop MapReduce jobs achieve decent performance through scaling out to very large computing clusters. However, this results in high costs in terms of hardware and power consumption.Therefore, researchers have carried out many research works to effectively adapt the query processing techniques found in parallel databases to the context of Hadoop MapReduce. Data layouts and indexes One of the main performance problems with Hadoop MapReduce is its physical data organization including data layouts and indexes. Data layouts: Hadoop MapReduce jobs often suffer from a roworiented layout. The disadvantages of row layouts have been thoroughly researched in the context of column stores . However, in a distributed system, a pure column store has severe drawbacks as the data for different columns may reside on different nodes leading to high network costs. Thus, whenever a query references more than one attribute, columns have to be sent through the network in order to merge different attributes values into a row (tuple reconstruction). This can significantly decrease the performance of Hadoop MapReduce jobs. Therefore, other, more effective data layouts have been proposed in the literature for Hadoop MapReduce © International Journal of Engineering Sciences & Research Technology [274] [Patel, 3(12): December, 2014] ISSN: 2277-9655 Scientific Journal Impact Factor: 3.449 (ISRA), Impact Factor: 2.114 Conclusion We find that in the three workloads, a majority of users submitted many small single stage applications implemented in Java, although the rest of workloads are highly diverse in application styles and data processing characteristics. We see underuse of Hadoop features, extensions and optimization tools. Our conclusion is that the use of Hadoop for academic research is still in its adolescence. Easing the use of Hadoop, and improving system designs subject to changing use cases are crucial research directions for future. Data confidentiality through encryption and decryption performed without a performance penalty in the storage layer Hadoop Distributed File System* (HDFS) taking full advantage of enhancements provided Advanced Encryption Standard New Instructions .The MapReduce programming model has been successfully used at Google for many different purposes. We attribute this success to several reasons. References 1. 2. 3. 4. 5. 6. 7. 8. HDFS Java API: http://hadoop.apache.org/core/docs/current/a pi/ HDFS source code: http://hadoop.apache.org/core/version_contr ol.html Apache hadoop.apache.org [http://hadoop.apache.org] Kai Ren1, YongChul Kwon2, Magdalena Balazinska2, Bill Howe2 .Hadoop’s Adolescence: A Comparative Workload Analysis from Three Research Clusters(Parallel Data Laboratory Carnegie Mellon UniversityPittsburgh, PA 15213-3890). Hadoop,http://hadoop.apache.org/mapreduc e/. A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7):419–429, 2011. “Hadoop HDFS,” Hadoop.Apache.org: http://hadoop.apache.org/hdfs/. http: // www.ijesrt.com © International Journal of Engineering Sciences & Research Technology [275]