Introduction To Big Data and Hadoop

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Introduction to Big Data and Hadoop

1
Agenda

 Introduction to Big Data

 Introduction to Hadoop

 History and uses of Hadoop

 Hadoop and Hadoop Cluster

 Hadoop & Traditional RDBMS

 Hadoop cluster Architecture

 Hadoop core components

 Hadoop ecosystem-other projects

 Hadoop Distribution
Introduction to Big Data

Every day, we create quintillion bytes of data and 90% of the data in the world today
has been created in the last four-five years alone.
This data comes from everywhere:
 sensors used to gather climate information,
 posts to social media sites,
 digital pictures and videos,
 purchase transaction records, and
 cell phone GPS signals and many more ……
This huge data is Big Data.

BigData is high-volume, high-velocity and high-variety information that demands


cost-effective, innovative forms of information processing for enhanced insight and
decision making

3
Three Vs of Big Data

4
Introduction to Hadoop

Large Dataset Challenges

Storing large VOLUMES


(TB/PB/XB)
Processing In Timely Manner
Processing Variety of DATA
(St/SS/US)
Costly High End Infrastructure

The Solution –Apache Hadoop

The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage .
Hadoop introduced a new way to simplify the analysis of large data sets, and in a very
short time reshaped the big data market. In fact, today Hadoop is often synonymous
with the term big data.
History and Uses of Hadoop

 The genesis of Hadoop came from the Google File System paper that was published in
October 2003. This paper spawned another research paper from Google - MapReduce:
Simplified Data Processing on Large Clusters.
 Development started in the Apache Nutch project, but was moved to the new Hadoop
subproject in January 2006. Doug Cutting, who was working at Yahoo! at the time,
named it after his son's toy elephant.
 The first committer added to the Hadoop project was Owen O’Malley in March 2006.
 Hadoop 0.1.0 was released in April 2006 and continues to evolve by the many
contributors to the Apache Hadoop project
 In 2007 Yahoo started using Hadoop on a 1000 node cluster
 Stable version of Hadoop 2.x was released somewhere in 2013
History and Uses of Hadoop

• Hadoop is in use at most organizations that handle big data:


Yahoo!
Facebook
Amazon
Netflix
Google
Etc…

• Some examples of scale:


Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo!
Web search

• FB’s Hadoop cluster hosts 100+ PB of data & growing at ½ PB/day


Hadoop and Hadoop Cluster

Hadoop
 Hadoop is an open source framework, that supports the processing of large data sets in a
distributed computing environment.
 Hadoop that is part of the Apache Software Foundation
 It can be efficiently used for big data storage, processing, access, analysis, governance,
security, operations and deployment.
 Hadoop consists of MapReduce, the Hadoop distributed file system (HDFS) and a number
of related projects such as Apache Hive, HBase and Zookeeper etc. MapReduce and
Hadoop distributed file system (HDFS) are the main component of Hadoop.
 HDFS is the storage layer of Hadoop , while MapReduce is responsible for processing data

Hadoop Cluster
Hadoop cluster is a special type of computational cluster designed for processing vast amount
of data in a distributed computing environment. These clusters run on low cost commodity
computers.

8
Hadoop Cluster Architecture
• Different Racks
• At the top of each rack
there is a rack switch
• Few machines to act as
Name node and as
Resource Manager referred
as Masters
• Majority of the machines
acts as DataNode and
Node Manager Trackers
referred as Slaves
• Masters have configuration
favoring more RAM and
CPU and less local
storage.
• Slave nodes have lots of
local disk storage and
moderate amounts of CPU
and RAM

9
History and Uses of Hadoop

• Hadoop is in use at most organizations that handle big data:


Yahoo!
Facebook
Amazon
Netflix
Google
Etc…

• Some examples of scale:


Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo!
Web search

• FB’s Hadoop cluster hosts 100+ PB of data & growing at ½ PB/day


Hadoop ecosystem
There are a number of projects underway at the Apache Foundation. These are in various
states of maturity. Below is the high level diagram showing the different Hadoop ecosystem
projects (for ingesting, storing, analyzing, and maintaining)

Most of the services available in the Hadoop ecosystem are to supplement the core
components of Hadoop which include HDFS, YARN, MapReduce
Hadoop core components
 HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity
hardware . It consists of two key components, which are Namenode and Datanode

Namenode (Master):
• HDFS has a master/slave architecture. An HDFS
cluster consists of a single NameNode, a master
server that manages the file system namespace
and regulates access to files by clients.
• It executes file system namespace operations like
opening, closing, and renaming files and directories.
It also determines the mapping of blocks to
DataNodes
• It frequently receives heartbeat and block report
from the data nodes in the cluster to ensure they
are working and live

Datanode (Slave):
• They manage storage attached to the nodes that
they run on.
• Internally, a file is split into one or more blocks and
these blocks are stored in a set of DataNodes.
• The DataNodes are responsible for serving read
and write requests from the file system’s clients.
• The DataNodes also perform block creation,
deletion, and replication upon instruction from the
NameNode 12
Hadoop core components……………..contd.

Points to remember
 Master/Slave Architecture. Name node acts as Master while Data node acts as Slaves
 NN manages filesystem namespace
 All client interaction starts with NN
 DN stores files as blocks
 DB sends hearbeat and block report to NN
 File is broken into Blocks and stored in DN
 NN maintains file to block mapping , location , order of blocks and other metadata
 Default Block size is 128 MB
 You can change default block size of file
 Client directly interacts with DN for reading and writing file
 Data is kept in different racks. To keep two copies in one rack and one copy in another rack ensures if
any entire rack fails we still have one copy in another rack. This feature is called Rack Awareness.
 Client does not send blocks to all 3 data nodes identified by Name node. The reason is Client will be
choked by data transmission at a time. That’s the reason it Client sends blocks to first Data Node, then its
first Data Node’s headache to contact and transmit Data to next Data Node. This is done through Data
pipeline .

13
Hadoop core components……………..contd.
 Reading data into HDFS

HDFS client requests namenode for the list of datanodes that stores replicas of blocks(file
consist of blocks) for the given file name. For each block, namenode returns address of
datanodes that has a copy of that block(metadata information).
Once client receives metadata about blocks present at datanode, client request datanode
to transfer blocks.
HDFS client once receives metadata from namenode, locations of each block are ordered
by their distance from the reader and client tries to read from closet replica first, if it fails
then it try for next replica and so on. Data is streamed from datanode back to client.
14
Hadoop core components……………..contd.

 MapReduce
 MapReduce is a software framework which enables us in writing applications that process large
data sets using distributed and parallel algorithms in a Hadoop environment.
 MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs)
 Reduce task, which takes the output from a map as an input and combines those data tuples into
a smaller set of tuples
 It is easy to scale data processing over multiple computing nodes using MapReduce
 Generally MapReduce paradigm is based on sending the computer to where the data resides.
 The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server

15
Hadoop core components……………..contd.
 Example
Problem Statement:
Count the number of occurrences of each word available in a DataSet
Input DataSet:
Just for simplicity, we are going to use simple small DataSet as shown below. However, Real-
time applications use very huge amount of Data.

Client Required Final Result

16
Hadoop core components……………..contd.

MapReduce – Map Function (Split Step)

17
Hadoop core components……………..contd.

MapReduce – Map Function (Mapping Step)

18
Hadoop core components……………..contd.

MapReduce – Shuffle Function (Merge Step)

19
Hadoop core components……………..contd.

MapReduce – Shuffle Function (Merge Step)

20
Hadoop core components……………..contd.
MapReduce – Shuffle Function (Sorting Step)

21
Hadoop core components……………..contd.

MapReduce – Reduce Function (Reduce Step)

22
Hadoop core components……………..contd.

MapReduce 3 Step Process With WordCount Example

23
Hadoop core components……………..contd.

 YARN (Yet Another Resource Negotiator)


 It is the framework responsible for providing the computational resources (e.g., CPUs,
memory, etc.) needed for application executions. Two key components are Resource
Manager (Master) and Node Manager (Slave)
 YARN was introduced in Hadoop 2.0 to remove the bottleneck on Job Tracker which was
present in Hadoop 1.0
 YARN architecture basically separates resource management layer from the processing
layer
The key YARN components are as follows-
Resource Manager: To manage the use of resources across the cluster
Container: Name given to a package of resources including RAM,CPU, HDD etc,
Node Manager: to oversee the containers running on the cluster nodes
Application Master: It negotiates with resource manager for resources and runs the
application specific process (Map or reduce tasks ) in those clusters

24
Hadoop core components……………..contd.

The diagram show the flowchart for YARN operations when a job is submitted

6
2
4

3 5

5 6

25
Hadoop Ecosystem –other projects

 HIVE
It is an open source data warehouse system for querying and analyzing large datasets stored in Hadoop
files
Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically translates SQL-like
queries into MapReduce jobs which will execute on Hadoop.

 Ambari
It is a management platform for provisioning, managing, monitoring and securing apache Hadoop cluster

 Zookeeper
It manages and coordinates a large cluster of machines. t is an open-source technology that maintains
configuration information and provides synchronized as well as group services which are deployed on
Hadoop cluster to administer the infrastructure.

 Scoop
Used to export/import data from external datastores into Hadoop Distributed File System or related Hadoop
eco-systems like Hive and Hbase
Hadoop Ecosystem –other projects

 Hbase -NoSql database


 PIG- A high-level scripting language (Pig Latin)
 Chukwa- Hadoop log aggregation
 Scribe - More general log aggregation
 Cassandra - Column store database on a P2P backend
 Dumbo - Python library for streaming
 Kafka - A high-throughput, distributed, messaging system
 Falcon - A data management framework
 Ranger - Comprehensive Security for Hadoop
 Knox - Provides a single point of authentication & access for Apache Hadoop services in cluster
 Atlas - Atlas Metadata and governance platform
 Tez - Tez is next generation Hadoop query processing framework written on top of YARN
 Storm - Apache Hadoop Stream processing framework
 Slider - A framework for deploy, manage & monitor existing distributed applications on YARN
 Trident - Provides a high level abstraction on top of Storm and makes it easier to build topologies
 Impala - Impala is promoted for analysts and data scientists to perform analytics on data stored in
Hadoop via SQL or business intelligence tools.
 Solr - Solr is a standalone enterprise search server with a REST-like API

and more…..
Hadoop Distribution

Hadoop is an open source project, a number of vendors have developed their


own distributions (with stable Hadoop versions), adding new functionality or
improving the code base.

Vendor distributions are, of course, designed to overcome issues with the open
source edition and provide additional value to customers, with a focus on things
such as:

• Reliability. The vendors react faster when bugs are detected. They promptly
deliver fixes and patches, which makes their solutions more stable.

• Support. A variety of companies provide technical assistance, which makes it


possible to adopt the platforms for mission-critical and enterprise-grade tasks.

• Completeness. Very often Hadoop distributions are supplemented with other


tools to address specific tasks.

Three of the top Hadoop distributions are provided by Cloudera, MapR and
Hortonworks.
$7 BILLION | 110,000 EMPLOYEES | 31 COUNTRIES

You might also like