Haddob Lab Report
Haddob Lab Report
Haddob Lab Report
To Understand the current trends and future developments in the big data
processing using Hadoop
To Familiarize with the Hadoop Distributed File System (HDFS) and the
MapReduce programming model
To Learn how to handle large amounts of data and perform distributed processing
To Understand how to use the different components of the Hadoop ecosystem
such as Pig, Hive, and HBase
To Understand how to optimize the performance of a Hadoop cluster
To Understand the concepts of data warehousing and data mining using Hadoop
Theory
Introduction:
Apache Hadoop is an open-source framework that is used to efficiently store and process
large datasets ranging in size from gigabytes to petabytes of data. Instead of using one
large computer to store and process the data, Hadoop allows clustering multiple
computers to analyze massive datasets in parallel more quickly.
Hadoop Distributed File System (HDFS) – A distributed file system that runs
on standard or low-end hardware. HDFS provides better data throughput than
traditional file systems, in addition to high fault tolerance and native support
of large datasets.
Hadoop Common – Provides common Java libraries that can be used across
all modules.
Hadoop History
As the World Wide Web grew in the late 1900s and early 2000s, search engines and
indexes were created to help locate relevant information amid the text-based content. In
the early years, search results were returned by humans. But as the web grew from dozens
to millions of pages, automation was needed. Web crawlers were created, many as
university-led research projects, and search engine start-ups took off (Yahoo, AltaVista,
etc.).
Ability to store and process huge amounts of any kind of data, quickly. With
data volumes and varieties constantly increasing, especially from social media and
the Internet of Things (IoT), that's a key consideration.
Fault tolerance. Data and application processing are protected against hardware
failure. If a node goes down, jobs are automatically redirected to other nodes to
make sure the distributed computing does not fail. Multiple copies of all data are
stored automatically.
Low cost. The open-source framework is free and uses commodity hardware to
store large quantities of data.
Scalability. You can easily grow your system to handle more data simply by
adding nodes. Little administration is required.
The modest cost of commodity hardware makes Hadoop useful for storing and combining
data such as transactional, social media, sensor, machine, scientific, click streams, etc.
The low-cost storage lets you keep information that is not deemed currently critical but
that you might want to analyze later.
Sandbox for discovery and analysis
Because Hadoop was designed to deal with volumes of data in a variety of shapes and
forms, it can run analytical algorithms. Big data analytics on Hadoop can help your
organization operate more efficiently, uncover new opportunities and derive next-level
competitive advantage. The sandbox approach provides an opportunity to innovate with
minimal investment.
Data lake
Data lakes support storing data in its original or exact format. The goal is to offer a raw or
unrefined view of data-to-data scientists and analysts for discovery and analytics. It helps
them ask new or difficult questions without constraints. Data lakes are not a replacement
for data warehouses. In fact, how to secure and govern data lakes is a huge topic for IT.
They may rely on data federation techniques to create a logical data structure.
We're now seeing Hadoop beginning to sit beside data warehouse environments, as well
as certain data sets being offloaded from the data warehouse into Hadoop or new types of
data going directly to Hadoop. The end goal for every organization is to have a right
platform for storing and processing data of different schema, formats, etc. to support
different use cases that can be integrated at different levels.
Things in the IoT need to know what to communicate and when to act. At the core of the
IoT is a streaming, always on torrent of data. Hadoop is often used as the data store for
millions or billions of transactions. Massive storage and processing capabilities also allow
you to use Hadoop as a sandbox for discovery and definition of patterns to be monitored
for prescriptive instruction. You can then continuously improve these instructions,
because Hadoop is constantly being updated with new data that doesn’t match previously
defined patterns.
MapReduce programming is not a good match for all problems. It’s good for simple
information requests and problems that can be divided into independent units, but it's not
efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because
the nodes don’t intercommunicate except through sorts and shuffles, iterative algorithms
require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files
between MapReduce phases and is inefficient for advanced analytic computing.
There’s a widely acknowledged talent gap. It can be difficult to find entry-level
programmers who have sufficient Java skills to be productive with MapReduce. That's
one reason distribution providers are racing to put relational (SQL) technology on top of
Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills.
And, Hadoop administration seems part art and part science, requiring low-level
knowledge of operating systems, hardware and Hadoop kernel settings.
Data security. Another challenge centers around the fragmented data security issues,
though new tools and technologies are surfacing. The Kerberos authentication protocol is
a great step toward making Hadoop environments secure.
Full-fledged data management and governance. Hadoop does not have easy-to-use,
full-feature tools for data management, data cleansing, governance and metadata.
Especially lacking are tools for data quality and standardization.
Hadoop Architecture
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS)
and provides a distributed file system that is designed to run on commodity hardware. It
has many similarities with existing distributed file systems. However, the differences
from other distributed file systems are significant. It is highly fault-tolerant and is
designed to be deployed on low-cost hardware. It provides high throughput access to
application data and is suitable for applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes
the following two modules −
Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
It is quite expensive to build bigger servers with heavy configurations that handle large
scale processing, but as an alternative, you can tie together many commodity computers
with single-CPU, as a single functional distributed system and practically, the clustered
machines can read the dataset in parallel and provide a much higher throughput.
Moreover, it is cheaper than one high-end server. So, this is the first motivational factor
behind using Hadoop that it runs across clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core
tasks that Hadoop performs −
Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further
processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures
at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
Source Code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setJarByClass(WordCountTest.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, inputFolder);
FileOutputFormat.setOutputPath(job, outputFolder);
System.exit(job.waitForCompletion(true)?0:1);
public void map (LongWritable key, Text value, Context con) throws IOException,
InterruptedException{
con.write(outputKey, outputValue);
int sum = 0;
sum+= value.get();
con.write(word,new IntWritable(sum));
Folder Structure:
Create the input data folder and input folder inside the input data and upload words.txt.
Also upload compiled WorCount.jar inside input data folder.
Output:
Discussion:
In this Hadoop lab I am able to know about the Hadoop, Implementation of Hadoop,
advantages disadvantages, challenges of Hadoop.
Apache Hadoop is an open-source framework that is used to efficiently store and process
large datasets ranging in size from gigabytes to petabytes of data. Instead of using one
large computer to store and process the data, Hadoop allows clustering multiple
computers to analyze massive datasets in parallel more quickly. As the World Wide Web
grew in the late 1900s and early 2000s, search engines and indexes were created to help
locate relevant information amid the text-based content. In the early years, search results
were returned by humans. But as the web grew from dozens to millions of pages,
automation was needed. Web crawlers were created, many as university-led research
projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).
Sandbox for discovery and analysis, Data Lake, complement your data warehouse, IoT
and Hadoop are the popular usages of Hadoop.
The challenges area for Hadoop is, MapReduce programming is not a good match for all
problems. There’s a widely acknowledged talent gap, Data security, Full-fledged data
management and governance. Map reduce, HDFS YRRN framework and common
utilities is the most importance component of haboob architecture. Hadoop runs code
across a cluster of computers. This process includes the following core tasks that Hadoop
performs
Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M). These files are then distributed across
various cluster nodes for further processing. HDFS, being on top of the local file system,
supervises the processing. Blocks are replicated for handling hardware failure. Checking
that the code was executed successfully. Performing the sort that takes place between the
map and reduce stages. Sending the sorted data to a certain computer. Writing the
debugging logs for each job.
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores. Hadoop does not rely on hardware
to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has
been designed to detect and handle failures at the application layer. Servers can be added
or removed from the cluster dynamically and Hadoop continues to operate without
interruption. Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
Conclusion
According to my Hadoop lab objectives my goals are fully archived. I am bale to know
about the Hadoop, it’s implementation, usages, advantages and disadvantages.