Understanding Hadoop Ecosystem

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 38

Understanding Hadoop Ecosystem

Hadoop
ecosystem
• Hadoop ecosystem is the various tools and technologies provided by
Hadoop collectively termed as Hadoop ecosystem to enable
development and deployment of bigdata solutions in a cost effective
manner.
• Core components of Hadoop ecosystem HDFS and map reduce.
• However these two are not sufficient.
• All these enable user to process large data sets in real time and
provide tools to support various types of Hadoop projects, schedule
jobs and manage cluster resources.
•HDFS-Distributed storage layer for Hadoop.
•Yarn – Resource management layer introduced in Hadoop 2.x.
•Hadoop Map-Reduce – Parallel processing layer for Hadoop.
•HBase – It is a column-oriented database that runs on top of HDFS. It is a NoSQL
database which does not understand the structured query. For sparse data set, it
suits well.
•Hive – Apache Hive is a data warehousing infrastructure based on Hadoop and it
enables easy data summarization, using SQL queries.
•Pig – It is a top-level scripting language. As we use it with Hadoop. Pig enables
writing complex data processing without Java programming.
•Flume – It is a reliable system for efficiently collecting large amounts of log data
from many different sources in real-time.
•Sqoop – It is a tool design to transport huge volumes of data between Hadoop
and RDBMS.
•Oozie – It is a Java Web application uses to schedule Apache Hadoop jobs. It
combines multiple jobs sequentially into one logical unit of work.
•Zookeeper – A centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services.
•Mahout – A library of scalable machine-learning algorithms, implemented on top
of Apache Hadoop and using the MapReduce paradigm
Advantages of
hadoop
• Stores data in native format:- HDFS stores data in native format, no
structure imposed while storing data.
• Scalable:- Hadoop can store and distribute large data sets across
commodity hardware.
• Cost effective:-scale out architecture, reduced cost per terabyte
of storage.
• Flexibility:- ability to work with all kinds of data.
Elements at various stages of data
processing
HDFS
:-
• Hdfs is effective, fault tolerant and distributed approach for storing
and managing huge volumes of data.
• Data collected in Hadoop cluster is first broken into smaller chunks
called blocks and distributed across multiple nodes.
• These smaller subsets of data are then operated upon by map and
reduce functions.
• Result from each of these operations is combined together to provide
aggregate outcome called big data solution.
• Hdfs keeps tracks of distributed pieces of data using filesystem metadata.
• Metadata “data about data”
• Acts as a template that provides the following information.
1. The time of creation, last access, modifications and deletion of file.
2. The storage location of blocks of file on a cluster.
3. Access permissions to view and modify a file.
4. Number of files stored on a cluster
5. Number of data nodes connected to form a cluster.
6. Location of transaction log on the cluster.
Hadoop(master slave
architecture)
Name node and data
node?
• Data nodes ensures connectivity with the name node by sending
heartbeat messages.
• Whenever the name node ceases to receive a heartbeat
message from data node it unmaps the data node from the
cluster.
• When a heartbeat message re appears or a new heartbeat message is
received, respective data node is added to the cluster.
Secondary node and
checkpointing?
HDFS data
blocks?
Failure of
node??
Replication
factor.
HDFS Write Mechanism(Pipeline
setup)
HDFS Write Mechanism(Writing a
block)
HDFS write
mechanism( Acknowledgement)
HDFS Multi block write
mechanism
HDFS (READ
Mechanism)
HDFS commands
• -mkdir <path>
Creates a directory named path in HDFS.
• -ls <path>
Lists the contents of the directory specified by path, showing the names, permissions, owner,
size and modification date for each entry.
• -mv <src><dest>
Moves the file or directory indicated by src to dest, within HDFS.
• -cp <src> <dest>
Copies the file or directory identified by src to dest, within HDFS.
• -get <src> <localDest>
Copies the file or directory in HDFS identified by src to the local file system path identified by
localDest.
• -rm <path>
Removes the file or empty directory identified by path.
• -cat <filename>
Displays the contents of filename on stdout.
Package org.apache.hadoop.io:
Interfaces:

•RawComparator-A Comparator that operates directly on byte representations of objects.


•Stringifier-Stringifier interface offers two methods to convert an object to a string
representation and restore the object given its string representation.
•Writable-A serializable object which implements a simple, efficient, serialization protocol,
based on DataInput and DataOutput.
•WritableComparable-A Writable which is also Comparable.
•WritableFactory-A factory for a class of Writable.
Package org.apache.hadoop.io: classes
Package org.apache.hadoop.io: classes
HDFS HA Architecture:
•The HA architecture solved this problem of NameNode availability by allowing us to have
two NameNodes in an active/passive configuration.
•So, we have two running NameNodes at the same time in a High Availability cluster
Active NameNode
Standby/Passive NameNode.
•Hadoop 2.0 overcomes this single point of failure by providing support for many
NameNode.
•HDFS NameNode High Availability architecture provides the option of running two
redundant NameNodes in the same cluster in an active/passive configuration with a hot
standby.
Active NameNode – It handles all client operations in the cluster.
Passive NameNode – It is a standby namenode, which has similar data as active
NameNode.
•It acts as a slave, maintains enough state to provide a fast failover, if necessary.
•If Active NameNode fails, then passive NameNode takes all the responsibility of active
node and cluster continues to work.
There are two issues in maintaining consistency in the HDFS High Availability
cluster

1. Active and Standby NameNode should always be in sync with each other, i.e.
They should have the same metadata. This will allow us to restore the Hadoop
cluster to the same namespace state where it got crashed and therefore, will
provide us to have fast failover.
2. There should be only one active NameNode at a time because two active
NameNode will lead to corruption of the data. This kind of scenario is termed as
a split-brain scenario where a cluster gets divided into smaller cluster, each one
believing that it is the only active cluster. To avoid such scenarios fencing is done.
Fencing is a process of ensuring that only one NameNode remains active at a
particular time.
You can deploy an HA cluster by preparing the following:
1.NameNode machines:
These are the machines on which you can run the active and standby
NameNodes.
These NameNodes must have similar hardware configurations.
2.Shared storage:
Both NameNode machines must have read/write accessibility on a
shared directory.
Features of HDFS:
•Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
•Replication - Due to some unfavorable conditions, the node containing the data may be
loss. So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
•Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the
event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other
machine containing the copy of that data automatically become active.
•Distributed data storage - This is one of the most important features of HDFS that makes
Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.
•Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
MapReduce
•A MapReduce is a data processing tool which is used to process the data parallelly in a
distributed form.
•The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer
phase.
•In the Mapper, the input is given in the form of a key-value pair.
•The output of the Mapper is fed to the reducer as input.
•The reducer runs only after the Mapper is over.
•The reducer too takes input in key-value format, and the output of reducer is the final
output.
YARN
•Yet Another Resource Negotiator takes programming to the next level beyond Java , and makes it
interactive to let another application Hbase, Spark etc. to work on it.
•Different Yarn applications can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at
the same time bringing great benefits for manageability and cluster utilization.
Components of YARN:
•Client: For submitting MapReduce jobs.
•Resource Manager: To manage the use of resources across the cluster
•Node Manager:For launching and monitoring the computer containers on machines in the cluster.
•Map Reduce Application Master: Checks tasks running the MapReduce job. The application master
and the MapReduce tasks run in containers that are scheduled by the resource manager, and managed
by the node managers.
HBase
•Hbase is an open source and sorted map data built on Hadoop.
•It is column oriented and horizontally scalable.
•It is based on Google's Big Table.
•It has set of tables which keep data in key value format.
•Hbase is well suited for sparse data sets which are very common in big
data use cases.
•Hbase provides APIs enabling development in practically any
programming language.
•It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.
Why Hbase?

•RDBMS get exponentially slow as the data becomes large


•Expects data to be highly structured, i.e. ability to fit in a well-defined
schema
•Any change in schema might require a downtime
•For sparse datasets, too much of overhead of maintaining NULL values
Regions

•Regions are nothing but tables that are split up and spread across the
region servers.
Region server:
•The region servers have regions that -
•Communicate with the client and handle data-related operations.
•Handle read and write requests for all the regions under it.
•Decide the size of the region by following the region size thresholds.
•When we take a deeper look into the region server, it contain regions and stores
as shown below:
•The store contains memory store and HFiles.
•Memstore is just like a cache memory.
•Anything that is entered into the HBase is stored here initially.
•Later, the data is transferred and saved in Hfiles as blocks and the memstore is
flushed.

You might also like