Introduction To Big Data and Hadoop
Introduction To Big Data and Hadoop
Introduction To Big Data and Hadoop
1
Agenda
Introduction to Hadoop
Hadoop Distribution
Introduction to Big Data
Every day, we create quintillion bytes of data and 90% of the data in the world today
has been created in the last four-five years alone.
This data comes from everywhere:
sensors used to gather climate information,
posts to social media sites,
digital pictures and videos,
purchase transaction records, and
cell phone GPS signals and many more ……
This huge data is Big Data.
3
Three Vs of Big Data
4
Introduction to Hadoop
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage .
Hadoop introduced a new way to simplify the analysis of large data sets, and in a very
short time reshaped the big data market. In fact, today Hadoop is often synonymous
with the term big data.
History and Uses of Hadoop
The genesis of Hadoop came from the Google File System paper that was published in
October 2003. This paper spawned another research paper from Google - MapReduce:
Simplified Data Processing on Large Clusters.
Development started in the Apache Nutch project, but was moved to the new Hadoop
subproject in January 2006. Doug Cutting, who was working at Yahoo! at the time,
named it after his son's toy elephant.
The first committer added to the Hadoop project was Owen O’Malley in March 2006.
Hadoop 0.1.0 was released in April 2006 and continues to evolve by the many
contributors to the Apache Hadoop project
In 2007 Yahoo started using Hadoop on a 1000 node cluster
Stable version of Hadoop 2.x was released somewhere in 2013
History and Uses of Hadoop
Hadoop
Hadoop is an open source framework, that supports the processing of large data sets in a
distributed computing environment.
Hadoop that is part of the Apache Software Foundation
It can be efficiently used for big data storage, processing, access, analysis, governance,
security, operations and deployment.
Hadoop consists of MapReduce, the Hadoop distributed file system (HDFS) and a number
of related projects such as Apache Hive, HBase and Zookeeper etc. MapReduce and
Hadoop distributed file system (HDFS) are the main component of Hadoop.
HDFS is the storage layer of Hadoop , while MapReduce is responsible for processing data
Hadoop Cluster
Hadoop cluster is a special type of computational cluster designed for processing vast amount
of data in a distributed computing environment. These clusters run on low cost commodity
computers.
8
Hadoop Cluster Architecture
• Different Racks
• At the top of each rack
there is a rack switch
• Few machines to act as
Name node and as
Resource Manager referred
as Masters
• Majority of the machines
acts as DataNode and
Node Manager Trackers
referred as Slaves
• Masters have configuration
favoring more RAM and
CPU and less local
storage.
• Slave nodes have lots of
local disk storage and
moderate amounts of CPU
and RAM
9
History and Uses of Hadoop
Most of the services available in the Hadoop ecosystem are to supplement the core
components of Hadoop which include HDFS, YARN, MapReduce
Hadoop core components
HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity
hardware . It consists of two key components, which are Namenode and Datanode
Namenode (Master):
• HDFS has a master/slave architecture. An HDFS
cluster consists of a single NameNode, a master
server that manages the file system namespace
and regulates access to files by clients.
• It executes file system namespace operations like
opening, closing, and renaming files and directories.
It also determines the mapping of blocks to
DataNodes
• It frequently receives heartbeat and block report
from the data nodes in the cluster to ensure they
are working and live
Datanode (Slave):
• They manage storage attached to the nodes that
they run on.
• Internally, a file is split into one or more blocks and
these blocks are stored in a set of DataNodes.
• The DataNodes are responsible for serving read
and write requests from the file system’s clients.
• The DataNodes also perform block creation,
deletion, and replication upon instruction from the
NameNode 12
Hadoop core components……………..contd.
Points to remember
Master/Slave Architecture. Name node acts as Master while Data node acts as Slaves
NN manages filesystem namespace
All client interaction starts with NN
DN stores files as blocks
DB sends hearbeat and block report to NN
File is broken into Blocks and stored in DN
NN maintains file to block mapping , location , order of blocks and other metadata
Default Block size is 128 MB
You can change default block size of file
Client directly interacts with DN for reading and writing file
Data is kept in different racks. To keep two copies in one rack and one copy in another rack ensures if
any entire rack fails we still have one copy in another rack. This feature is called Rack Awareness.
Client does not send blocks to all 3 data nodes identified by Name node. The reason is Client will be
choked by data transmission at a time. That’s the reason it Client sends blocks to first Data Node, then its
first Data Node’s headache to contact and transmit Data to next Data Node. This is done through Data
pipeline .
13
Hadoop core components……………..contd.
Reading data into HDFS
HDFS client requests namenode for the list of datanodes that stores replicas of blocks(file
consist of blocks) for the given file name. For each block, namenode returns address of
datanodes that has a copy of that block(metadata information).
Once client receives metadata about blocks present at datanode, client request datanode
to transfer blocks.
HDFS client once receives metadata from namenode, locations of each block are ordered
by their distance from the reader and client tries to read from closet replica first, if it fails
then it try for next replica and so on. Data is streamed from datanode back to client.
14
Hadoop core components……………..contd.
MapReduce
MapReduce is a software framework which enables us in writing applications that process large
data sets using distributed and parallel algorithms in a Hadoop environment.
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs)
Reduce task, which takes the output from a map as an input and combines those data tuples into
a smaller set of tuples
It is easy to scale data processing over multiple computing nodes using MapReduce
Generally MapReduce paradigm is based on sending the computer to where the data resides.
The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server
15
Hadoop core components……………..contd.
Example
Problem Statement:
Count the number of occurrences of each word available in a DataSet
Input DataSet:
Just for simplicity, we are going to use simple small DataSet as shown below. However, Real-
time applications use very huge amount of Data.
16
Hadoop core components……………..contd.
17
Hadoop core components……………..contd.
18
Hadoop core components……………..contd.
19
Hadoop core components……………..contd.
20
Hadoop core components……………..contd.
MapReduce – Shuffle Function (Sorting Step)
21
Hadoop core components……………..contd.
22
Hadoop core components……………..contd.
23
Hadoop core components……………..contd.
24
Hadoop core components……………..contd.
The diagram show the flowchart for YARN operations when a job is submitted
6
2
4
3 5
5 6
25
Hadoop Ecosystem –other projects
HIVE
It is an open source data warehouse system for querying and analyzing large datasets stored in Hadoop
files
Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically translates SQL-like
queries into MapReduce jobs which will execute on Hadoop.
Ambari
It is a management platform for provisioning, managing, monitoring and securing apache Hadoop cluster
Zookeeper
It manages and coordinates a large cluster of machines. t is an open-source technology that maintains
configuration information and provides synchronized as well as group services which are deployed on
Hadoop cluster to administer the infrastructure.
Scoop
Used to export/import data from external datastores into Hadoop Distributed File System or related Hadoop
eco-systems like Hive and Hbase
Hadoop Ecosystem –other projects
and more…..
Hadoop Distribution
Vendor distributions are, of course, designed to overcome issues with the open
source edition and provide additional value to customers, with a focus on things
such as:
• Reliability. The vendors react faster when bugs are detected. They promptly
deliver fixes and patches, which makes their solutions more stable.
Three of the top Hadoop distributions are provided by Cloudera, MapR and
Hortonworks.
$7 BILLION | 110,000 EMPLOYEES | 31 COUNTRIES