CS19741-Cloud Computing-Unit 3 Notes
CS19741-Cloud Computing-Unit 3 Notes
CS19741-Cloud Computing-Unit 3 Notes
Design of HDFS, Concepts and Java Interface, Dataflow of File read & File write, Map Reduce, Input
splitting, map and reduce functions, Specifying input and output parameters, Configuring and Running a
Job. Hadoop Vs Spark.
Case Study: Design and Implementation of Hive, Pig, HBase.
HADOOP
Hadoop is an open source distributed processing framework that manages data processing and
storage for big data applications in scalable clusters of computer servers.
It's at the center of big data technologies that are primarily used to support advanced analytics
initiatives, including predictive analytics, data mining and machine learning.
Unstructured information: does not have any particular internal structure; for
example, plain text or image data, such as internet clickstream records, web server
and mobile application logs, social media posts, customer emails and sensor data
from the internet of things (IoT).
Raid (redundant array of independent disks) is a way of storing the same data in different places
(thus, redundantly) on multiple hard disks.
RAID (redundant array of independent disks) is a data storage virtualization technology that
combines multiple physical disk drive components into a single logical unit for the purposes of data
redundancy, performance improvement, or both.
HDFS clusters do not benefit from using RAID (Redundant Array of Independent Disks) for data
node storage (although RAID is recommended for the name node’s disks to protect against corruption of
it‘s metadata). The redundancy that RAID provides is not needed, since HDFS handles it by replication
between nodes.
The Hadoop Distributed File System (HDFS) is the primary storage system used
by Hadoop applications.
Not for data nodes. For some master nodes processes like Hive Metastore, yes
Processing Techniques:
Batch Processing: Processing of previously collected jobs in a single batch.
The component to provide online access was HBase, a key-value store that uses HDFS for its
underlying storage. HBase provides both online read/write access of individual rows and batch operations
for reading and writing data in bulk, making it a
good solution for building applications on.
Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure,
– Detect failures and recover from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to where data resides
– Provides very high aggregate bandwidth
Distributed File System
• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 64MB block size
– Each block replicated on multiple DataNodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
HDFS Architecture
Functions of a NameNode
• Manages File System Namespace
– Maps a file name to a set of blocks
– Maps a block to the DataNodes where it resides
• Cluster Configuration Management
• Replication Engine for Blocks
NameNode Metadata
• Metadata in Memory
– The entire metadata is in main memory
– No demand paging of metadata
• Types of metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g. creation time, replication factor
• A Transaction Log
– Records file creations, file deletions etc
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
Block Placement
• Current Strategy
– One replica on local node
– Second replica on a remote rack
– Third replica on same remote rack
– Additional replicas are randomly placed
• Clients read from nearest replicas
• Would like to make this policy pluggable
Heartbeats
• DataNodes send heartbeat to the NameNode: Once every 3 seconds
• NameNode uses heartbeats to detect DataNode failure
Replication Engine
• NameNode detects DataNode failures
– Chooses new DataNodes for new replicas
– Balances disk usage
– Balances communication traffic to DataNodes
Data Correctness
• Use Checksums to validate data: Use CRC32
• File Creation
– Client computes checksum per 512 bytes
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
Data Pieplining
• Client retrieves a list of DataNodes on which to place replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next node in the Pipeline
• When all replicas are written, the Client moves on to write the next block in file
Rebalancer
• Goal: % disk full on DataNodes should be similar
– Usually run when new DataNodes are added
– Cluster is online when Rebalancer is active
– Rebalancer is throttled to avoid network congestion
– Command line tool
Secondary NameNode
• Copies FsImage and Transaction Log from Namenode to a temporary directory
• Merges FSImage and Transaction Log into a new FSImage in temporary directory
• Uploads new FSImage to the NameNode: Transaction Log on NameNode is purged
Java Interface, Dataflow of File read & File write,
Java Interface
The Hadoop FileSystem class: the API for interacting with one of Hadoop‘s filesystems.
Although we focus mainly on the HDFS implementation, DistributedFileSystem, in general you should
strive to write your code against the FileSystem abstract class, to retain portability across filesystems. This
is very useful when testing your program, for example, because you can rapidly run tests using data stored
on the local filesystem. Hadoop filesystem is by using a java.net.URL object to open a stream to read the
data from.
Table 3.1. Hadoop filesystems
Commads for HDFS User:
% hadoop dfs -mkdir /foodir
% hadoop dfs -cat /foodir/myfile.txt
% hadoop dfs -rm /foodir/myfile.txt
Commands for HDFS Administrator
% hadoop dfsadmin -report
% hadoop dfsadmin -decommision datanodename
Create a directory
To see how it is displayed in the listing:
% hadoop fs -mkdir books
% hadoop fs -ls .
Found 2 items
drwxr-xr-x - tom supergroup 0 2014-10-04 13:22 books
-rw-r ----- 1 tom supergroup 119 2014-10-04 13:21 quangle.txt
Example: rw- r-- ---
owner,group,Others mode
Copying a file from the local filesystem to HDFS:
% hadoop fs -copyFromLocal input/docs/quangle.txt \hdfs://localhost/user/tom/quangle.txt
% hadoop fs –l.
quangle.txt
% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty
TreeThe Quangle Wangle sat,
But his face you could not
see, On account of his
Beaver Hat.
Displaying files from a Hadoop filesystem on standard output by using the FileSystem directly
public class FileSystemCat {
public static void main(String[] args) throws
Exception { String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri),
conf); InputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096,
false);
} finally {
IOUtils.closeStream(i
n);
}
}
}
The program runs as follows:
% hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty
TreeThe Quangle Wangle sat,
But his face you could not
see, On account of his
Beaver Hat.
Displaying files from a Hadoop filesystem on standard output twice, by using seek()
public class FileSystemDoubleCat {
public static void main(String[] args) throws
Exception { String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri),
conf); FSDataInputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096,
false); in.seek(0); // go back to the start of
the file IOUtils.copyBytes(in, System.out,
4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
MapReduce - What?
• MapReduce is a programming model for efficient distributed computing
• It works like a Unix pipeline
– cat input | grep | sort | uniq -c | cat > output
– Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
– Streaming through data, reducing seeks
– Pipelining
• A good fit for a lot of applications
– Log processing
– Web index building
MapReduce - Features
• Fine grained Map and Reduce tasks
– Improved load balancing
– Faster recovery from failed tasks
• Automatic re-execution on failure
– In a large cluster, some nodes are always slow or flaky
– Framework re-executes failed tasks
• Locality optimizations
– With large data, bandwidth to data is a problem
– Map-Reduce + HDFS is a very effective solution
– Map-Reduce queries HDFS for locations of input data
– Map tasks are scheduled close to the inputs when possible
MapReduce – Dataflow
Input splitting, map and reduce functions, Specifying input and output parameters,
MapReduce:
MapReduce is a programming framework that allows us to perform distributed and parallel processing
on large data sets in a distributed environment.
MapReduce consists of two distinct tasks – Map and Reduce.
As the name MapReduce suggests, the reducer phase takes place after the mapper phase has been
completed.
So, the first is the map job, where a block of data is read and processed to produce key-value pairs
as intermediate outputs.
The output of a Mapper or map job (key-value pairs) is input to the Reducer.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a
smaller set of tuples or key-value pairs which is the final output.
Let us understand more about MapReduce and its components. MapReduce majorly has the following
three Classes. They are,
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective key-value pair. Hadoop‘s Mapper store saves this
intermediate data into the local disk.
Input Split: It is the logical representation of data. It represents a block of work that contains a
single map task in the MapReduce Program.
RecordReader: It interacts with the Input split and converts the obtained data in the form of Key-
Value Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which processes it and
generates the final output which is then saved in the HDFS.
Driver Class
The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long with data
types and their respective job names.
Meanwhile, you may go through this MapReduce Tutorial video where our expert from Hadoop
online training has discussed all the concepts related to MapReduce has been clearly explained using
examples:
Advantages of MapReduce
The two biggest advantages of MapReduce are:
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part of
the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which helps us to
process the data using different machines. As the data is processed by multiple machines instead of a
single machine in parallel, the time taken to process the data gets reduced by a tremendous amount as
shown in the figure below (2).
Fig.: Traditional Way Vs. MapReduce Way – MapReduce Tutorial
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the data in the
MapReduce Framework. In the traditional system, we used to bring data to the processing unit and
process it. But, as the data grew and became very huge, bringing this huge amount of data to the
processing unit posed the following issues:
Moving huge data to processing is costly and deteriorates the network performance.
Processing takes time as the data is processed by a single unit which becomes the bottleneck.
The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the data. So,
as you can see in the above image that the data is distributed among multiple nodes where each node
processes the part of the data residing on it. This allows us to have the following advantages:
It is very cost-effective to move processing unit to the data.
The processing time is reduced as all the nodes are working with their part of the data in parallel.
Every node gets a part of the data to process and therefore, there is no chance of a node getting
overburdened.
Hadoop MapReduce Example: Word Count
• Mapper
– Input: value: lines of text of input
– Output: key: word, value: 1
• Reducer
– Input: key: word, value: set of counts
– Output: key: word, value: sum
• Launching program
– Defines this job
– Submits job to cluster
Let us understand, how a MapReduce works by taking an example where I have a text file called
example.txt whose contents are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we will
be finding the unique words and the number of occurrences of those unique words.
First, we divide the input into three splits as shown in the figure. This will distribute the work
among all the map nodes.
Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each of the
tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every word, in
itself, will occur once.
Now, a list of key-value pair will be created where the key is nothing but the individual words and
value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1;
River, 1. The mapping process remains the same on all the nodes.
After the mapper phase, a partition process takes place where sorting and shuffling happen so that
all the tuples with the same key are sent to the corresponding reducer.
So, after the sorting and shuffling phase, each reducer will have a unique key and a list of values
corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Now, each Reducer counts the values which are present in that list of values. As shown in the
figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number of
ones in the very list and gives the final output as – Bear, 2.
Finally, all the output key/value pairs are then collected and written in the output file.
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
Job is typically used to specify the Mapper, combiner (if any), Partitioner, Reducer, InputFormat,
OutputFormat implementations. FileInputFormat indicates the set of input files
(FileInputFormat.setInputPaths(Job, Path…)/FileInputFormat.addInputPath(Job, Path)) and
(FileInputFormat.setInputPaths(Job, String…)/FileInputFormat.addInputPaths(Job, String)) and
where the output files should be written (FileOutputFormat.setOutputPath(Path)).
Optionally, Job is used to specify other advanced facets of the job such as the Comparator to be
used, files to be put in the DistributedCache, whether intermediate and/or job outputs are to be compressed
(and how), whether job tasks can be executed in a speculative manner
(setMapSpeculativeExecution(boolean))/setReduceSpeculativeExecution(boolean)), maximum number
of attempts per task (setMaxMapAttempts(int)/ setMaxReduceAttempts(int)) etc.
Configured Parameters:The following properties are localized in the job configuration for each task’s
execution:
Name Type Description
mapreduce.job.id String The job id
mapreduce.job.jar String job.jar location in job directory
mapreduce.job.local.dir String The job specific shared scratch space
mapreduce.task.id String The task id
mapreduce.task.attempt.id String The task attempt id
mapreduce.task.is.map boolean Is this a map task
mapreduce.task.partition int The id of the task within the job
mapreduce.map.input.file String The filename that the map is reading from
mapreduce.map.input.start long The offset of the start of the map input split
mapreduce.map.input.length long The number of bytes in the map input split
mapreduce.task.output.dir String The task‘s temporary output directory
JobConfs :
• Jobs are controlled by configuring JobConf
• JobConfs are maps from attribute names to string values
• The framework defines attributes to control how the job is Executed
– conf.set(―mapred.job.name‖, ―MyApp‖);
• Applications can add arbitrary values to the JobConf
– conf.set(―my.string‖, ―foo‖);
– conf.set(―my.integer‖, 12);
• JobConf is available to all tasks
Hadoop Vs Spark:
“Parallel computing is the simultaneous use of more than one processor to solve a problem”
. “Distributed computing is the simultaneous use of more than one computer to solve a
problem”
Hadoop is designed to handle batch processing efficiently whereas Spark is designed to handle
real-time data efficiently. Hadoop is a high latency computing framework, which does not have an
interactive mode whereas Spark is a low latency computing and can process data interactively.
1. HDFS – Hadoop Distributed File System. This is the file system that manages the storage of
large sets of data across a Hadoop cluster. HDFS can handle both structured and unstructured data.
The storage hardware can range from any consumer-grade HDDs to enterprise drives.
2. MapReduce. The processing component of the Hadoop ecosystem. It assigns the data fragments
from the HDFS to separate map tasks in the cluster. MapReduce processes the chunks in parallel
to combine the pieces into the desired result.
3. YARN. Yet Another Resource Negotiator. Responsible for managing computing resources and job
scheduling.
4. Hadoop Common. The set of common libraries and utilities that other modules depend on.
Another name for this module is Hadoop core, as it provides support for all other Hadoop
components.
What is Hadoop?
Apache Hadoop is a platform that handles large datasets in a distributed fashion. The framework
uses MapReduce to split the data into blocks and assign the chunks to nodes across a cluster. MapReduce
then processes the data in parallel on each node to produce a unique output.
Every machine in a cluster both stores and processes data. Hadoop stores the data to disks
using HDFS. The software offers seamless scalability options. You can start with as low as one machine
and then expand to thousands, adding any type of enterprise or commodity hardware.
The Hadoop ecosystem is highly fault-tolerant. Hadoop does not depend on hardware to achieve
high availability. At its core, Hadoop is built to look for failures at the application layer. By replicating
data across a cluster, when a piece of hardware fails, the framework can build the missing parts from
another location. The nature of Hadoop makes it accessible to everyone who needs it. The open-source
community is large and paved the path to accessible big data processing.
What is Spark?
Apache Spark is an open-source tool. This framework can run in a standalone mode or on a cloud
or cluster manager such as Apache Mesos, and other platforms. It is designed for fast performance and
uses RAM for caching and processing data.
Spark performs different types of big data workloads. This includes MapReduce-like batch
processing, as well as real-time stream processing, machine learning, graph computation, and interactive
queries. With easy to use high-level APIs, Spark can integrate with many different libraries,
including PyTorch and TensorFlow.
The Spark engine was created to improve the efficiency of MapReduce and keep its benefits. Even
though Spark does not have its file system, it can access data on many different storage solutions. The data
structure that Spark uses is called Resilient Distributed Dataset, or RDD.
There are five main components of Apache Spark:
1. Apache Spark Core. The basis of the whole project. Spark Core is responsible for necessary
functions such as scheduling, task dispatching, input and output operations, fault recovery, etc.
Other functionalities are built on top of it.
2. Spark Streaming. This component enables the processing of live data streams. Data can originate
from many different sources, including Kafka, Kinesis, Flume, etc.
3. Spark SQL. Spark uses this component to gather information about the structured data and how
the data is processed.
4. Machine Learning Library (MLlib). This library consists of many machine learning algorithms.
MLlib‘s goal is scalability and making machine learning more accessible.
5. GraphX. A set of APIs used for facilitating graph analytics tasks.
Hive Storage Format: Items to be considered while choosing a file format for storage include:
Support for columnar storage
Splitability
Compression
Schema evolution
Indexing capabilities
A Simple Query
• Find all page views coming from xyz.com on March 31st:
SELECT
page_views.* FROM
page_views
WHERE page_views.date >= '2008-
03-01' AND page_views.date <=
'2008-03-31'
AND page_views.referrer_url like '%xyz.com';
• Hive only reads partition 2008-03-01,* instead of scanning entire table
Pig Limitations:
Pig does not support random reads or queries in the order of tens of milliseconds.
Pig does not support random writes to update small portions of data, all writes are bulk,
streaming writes, just like MapReduce.
Low latency queries are not supported in Pig, thus it is not suitable for OLAP and OLTP.
Pig Architecture: The conveys that,
1. Pig Latin scripts or Pig commands
from Grunt shell will be submitted to
Pig Engine.
2. Pig Engine parses, compiles,
optimizes, and fires MapReduce
statements.
3. MapReduce accesses HDFS and
returns the results.
One of the common use case of Pig is data
pipelines. A common example is web
companies bring in logs from their web servers,
cleansing the data, and pre-computing common
aggregates before loading it into their data
warehouse. In this case, the data is loaded
onto the grid, and then Pig is used to clean out
records from bots and records with corrupt data. It is also used to join web event data against user
databases so that user cookies can be connected with known user information.
What is the need for Pig when we already have Mapreduce?
Mapreduce is a low level data set processing paradigm where as the Pig provides the high level of
abstraction for processing large data sets. Though, both Mapreduce and Pig are used for processing data
sets and Pig transformations are converted into a series of Mapreduce jobs, below are the major differences
of Mapreduce Processing Framework and Pig framework.
Pig Latin provides all of the standard data-processing operations, such as join, filter, group
by, order by, union, etc. MapReduce provides the group by operation directly, but order by, filter,
projection, join are not provided and must be written by the user.
Mapreduce Pig
It is too low-level and rigid, and leads to a great High level Programming
deal of custom user code that is hard to maintain
and reuse
Development cycle is very long. Writing mappers In pig, no need of compiling or packaging of code.
and reducers, compiling and packaging the code, Pig operators will be converted into map or reduce
submitting jobs, and retrieving the results is a time tasks internally.
Consuming process
To extract small portion of data from large datasets Pig is not suitable small portions of data in a large
using Mapreduce is preferable dataset, since it is set up to scan the whole dataset,
or at least large portions of it
Not Easily Extendable. We need to write functions UDFs tend to be more reusable than the libraries
starting from scratch. developed for writing MapReduce programs
We need MapReduce when we need very deep level Sometimes, it is not very convenient to express
and fine grained control on the way we want to what we need exactly in terms of Pig and Hive
process our data. queries.
Performing Data set joins is very difficult Joins are simple to achieve in Pig.
Difference Between Hive and Pig:
Hive can be treated as competitor for Pig in some cases and Hive also operates on HDFS similar to
Pig but there are some significant differences. HiveQL is query language based on SQL but Pig Latin is
not a query language. It is data flow scripting language.
Since Pig Latin is procedural, it fits very naturally in the pipeline paradigm. HiveQL on the other
hand is declarative.
Example Problem
Suppose you have user data in a file,
website data in another, and you need to find the top
5 most visited pages by users aged 18-25.
Ease of Translation
HBase
HBase is a column-oriented non-relational database management system that runs on top
of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data
sets, which are common in many big data use cases. It is well suited for real-time data processing or
random read/write access to large volumes of data.
Unlike relational database systems, HBase does not support a structured query language like SQL;
in fact, HBase isn‘t a relational data store at all. HBase applications are written in Java™ much like a
typical Apache MapReduce application. HBase does support writing applications in Apache Avro, REST
and Thrift.
An HBase system is designed to scale linearly. It comprises a set of standard tables with rows and
columns, much like a traditional database. Each table must have an element defined as a primary key, and
all access attempts to HBase tables must use this primary key.
Avro, as a component, supports a rich set of primitive data types including: numeric, binary data and
strings; and a number of complex types including arrays, maps, enumerations and records. A sort order can
also be defined for the data.
HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built into HBase,
but if you‘re running a production cluster, it‘s suggested that you have a dedicated ZooKeeper cluster
that‘s integrated with your HBase cluster.
HBase works well with Hive, a query engine for batch processing of big data, to enable fault-
tolerant big data applications.
HBase - What?
• Modeled on Google‘s Bigtable
• Row/column store
• Billions of rows/millions on columns
• Column-oriented - nulls are free
• Untyped - stores byte[]
The important topics that I will be taking you through in this HBase architecture blog are:
HBase Data Model
HBase Architecture and it‘s Components
HBase Write Mechanism
HBase Read Mechanism
HBase Performance Optimization Mechanisms
HBase Data Model
HBase is a column-oriented NoSQL database. Although it looks similar to a relational database
which contains rows and columns, but it is not a relational database. Relational databases are row oriented
while HBase is column-oriented. The difference between Column-oriented and Row-oriented databases:
To better understand it, let us take an example and consider the table below.
If this table is stored in a row-oriented database. It
will store the records as shown below:
HDFS: HDFS get in contact with the HBase components and stores a large amount ofdata in a
distributed manner.
HBase Read and Write Data operations from Client into Hfile can be shown in below diagram.
Step 1) Client wants to write data and
in turn first communicates
with Regions server and then
regions
Step 2) Regions contacting memstore
for storing associated with the
column family
Step 3) First data stores into Memstore,
where the data is sorted and
after that, it flushes into
HFile. The main reason for
using Memstore is to store
data in a Distributed file
system based on Row Key. Memstore will be placed in Region server main memory while
HFiles are written into HDFS.
Step 4) Client wants to read data from Regions
Step 5) In turn Client can have direct access to Memstore, and it can request for data.
Step 6) Client approaches HFiles to get the data. The data are fetched and retrieved by the Client.
Memstore holds in-memory modifications to the store. The hierarchy of objects in HBase Regions is
as shown from top to bottom in below table.
Table HBase table present in the HBase cluster
Store It stores per ColumnFamily for each region for the table
Memstore Memstore for each store for each region for the table
It sorts data before flushing into HFiles
Write and read performance will increase because of sorting
StoreFile StoreFiles for each store for each region for the table
HBASE HDFS
Accessed through shell commands, client API in Primarily accessed through MR (Map Reduce) jobs
Java, REST, Avro or Thrift
Storage and process both can be perform It's only for storage areas
Some typical IT industrial applications use HBase operations along with Hadoop. Applications
include stock exchange data, online banking data operations, and processing Hbase is best-suited solution
method.
HTable table = …
Text row = new
Text(“enclosure1”); Text col1 =
new Text(“animal:type”); Text
col2 = new Text(“animal:size”);
BatchUpdate update = new
BatchUpdate(row); update.put(col1,
“lion”.getBytes(“UTF-8”));
update.put(col2, “big”.getBytes(“UTF-
8)); table.commit(update);
update = new BatchUpdate(row);
update.put(col1, “zebra”.getBytes(“UTF-
8”)); table.commit(update);
HBase - Querying
• Retrieve a cell
Cell = table.getRow(―enclosure1‖).getColumn(―animal:type‖).getValue();
• Retrieve a row
RowResult = table.getRow( ―enclosure1‖ );
• Scan through a range of rows
Scanner s = table.getScanner( new String[] { ―animal:type‖ } );
Assignment 4:
Case Study: Map Reduce techniques:
Try it by yourself, this work will help during your placement time
Specifying input and output parameters,
Input splitting,
Mapping and
Reducing functions,
Design and Implementation of Hive, Pig, HBase.
Configuring and Running a Job using hadoop, .
Consider own set of students data set. Data set also from different formats(Facebook, tweeter,
Oracle database, notepad, excel, etc., ). The data set consists of different set of marks/grades, students
from different location(City/ Village), categories, various family background, family education, Different
mode education(Native language/Specific type of language), students get different stream of studies
(school/poly techniques/ College/ University may be educations is +2/ Diploma/ Engineering/ Arts/ U.G./
P.G./ Ph. d., present working and earnings, Certificates courses and placement record. From this data set
analysis which type of students, are performing well in the outside world/social sector. Analysis is based
age group between 25-30 ages.
Assume data set is located in various location in various forms, that is the records are stored in
the distributed fashion. Example School students record are in stored in different school data base server,
College students academic record are available in the university record.
Work:
1. Setup own Hadoop platform, use different types of tools (Spark, Hive, Hbase, Pig, based your
requirements), Minimum two tools should use in your work.
2. Consider different types of data format adopted by different sectors
3. Create Own data base.
3. Design your own Architecture for Analysis purpose.
Output:
1. Screen shorts of setup
2. Write a steps involved while setting up
3. Configuration
4. Analysis report/ Your prediction
Work:
1. Reference Contents and installation from various web sites, &
https://www.tutorialspoint.com
2. Hands on Try yourself: Configuring and Running a Job. Hadoop Vs Spark.
Interview Based Questions
1) What is MapReduce?
It is a framework or a programming model that is used for processing large data sets over clusters
of computers using distributed programming.
13) How can we change the split size if our commodity hardware has less storage space?
If our commodity hardware has less storage space, we can change the split size by writing the
‗custom splitter‘. There is a feature of customization in Hadoop which can be called from the main
method.
17) Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting
happens only on the reducer side. Mapper method initialization depends upon each input split. While
doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will
get initialized. For each row, inputsplit again gets divided into mapper, thus we do not have a track of
the previous row value.
20) What is the difference between an HDFS Block and Input Split?
HDFS Block is the physical division of the data and Input Split is the logical division of the data.
25) What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop
Cluster?
JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There
is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process.
In a typical production cluster its run on a separate machine. Each slave node is configured with job
tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If
it goes down, all running jobs are halted. JobTracker in Hadoop performs following actions(from
Hadoop Wiki:)
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they
are deemed to have failed and the work is scheduled on a different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do
then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it
may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.
27) What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop
Cluster
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and
Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop
slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of
slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM
processes to do the actual work (called as Task Instance) this is to ensure that process failure does not
take down the task tracker. The TaskTracker monitors these task instances, capturing the output and
exit codes. When the Task instances finish, successfully or not, the task tracker notifies the
JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few
minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of
the number of available slots, so the JobTracker can stay up to date with where in the cluster work can
be delegated.
30) What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a
slave node?
Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate
JVM process. Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is
run as a separate JVM process. One or Multiple instances of Task Instance is run on each slave node.
Each task instance is run as a separate JVM process. The number of Task instances can be controlled
by configuration. Typically a high end machine is configured to run more task instances.
35) What are combiners? When should I use a combiner in my MapReduce Job?
Combiners are used to increase the efficiency of a MapReduce program. They are used to
aggregate intermediate map output locally on individual mapper outputs. Combiners can help you
reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer
code as a combiner if the operation performed is commutative and associative. The execution of
combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may
execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners
execution.
37) What is the Hadoop MapReduce API contract for a key and value Class?
The Key must implement the org.apache.hadoop.io.WritableComparable interface.
The value must implement the org.apache.hadoop.io.Writable interface.
42) What is HDFS Block size? How is it different from traditional file system block size?
In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is
typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each
block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store
each HDFS block as a separate file. HDFS Block size can not be compared with the traditional file
system block size.
43) What is a NameNode? How many instances of NameNode run on a Hadoop Cluster?
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all
files in the file system, and tracks where across the cluster the file data is kept. It does not store the
data of these files itself. There is only One NameNode process run on any hadoop cluster. NameNode
runs on its own JVM process. In a typical production cluster its run on a separate machine. The
NameNode is a Single Point of Failure for the HDFS Cluster. When the NameNode goes down, the
file system goes offline. Client applications talk to the NameNode whenever they wish to locate a file,
or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by
returning a list of relevant DataNode servers where the data lives.
44) What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?
A DataNode stores data in the Hadoop File System HDFS. There is only One DataNode
process run on any hadoop slave node. DataNode runs on its own JVM process. On startup, a
DataNode connects to the NameNode. DataNode instances can talk to each other, this is mostly during
replicating data.