Big Data and Hadoop: by - Ujjwal Kumar Gupta
Big Data and Hadoop: by - Ujjwal Kumar Gupta
Big Data and Hadoop: by - Ujjwal Kumar Gupta
By
Ujjwal Kumar Gupta
Contents
Why Big Data & Hadoop
Drawbacks of Traditional Database
Hadoop History
What is Hadoop & How it Works
Hadoop Cluster
Hadoop Ecosystem
Course Topics
Week 1
Understanding Big Data
Hadoop Architecture
Week 2
Introduction to Hadoop 2.x
Data loading Techniques
Hadoop Project Environment
Week 5
Analytics using Hive
Understanding Hive QL
Sqoop Connectivity
Week 6
NoSQL Databases
Understanding HBASE
Zookeeper
Week 3
Hadoop MapReduce Framework Week 7
Programming MapReduce
Apache Spark Framework
Hadoop Installation Cluster
Programming Spark
Setup
Week 8
Week 4
Real world Datasets and Analysis
Analytics using Pig
Project Discussion
Understanding Pig Latin
Drawbacks of Traditional
Database
Expensive - Out of Reach for small &
mid-size company
Scalability As Data Grows
Expanding the system is a Challenging
task
Time Consuming It takes lots of
time to store & process data
What is Hadoop
Open source framework designed for storage and
processing of large scale data on clusters of
commodity hardware
Created by Doug Cutting in 2006.
Cutting named the program after his sons toy
elephant.
3 Vs of Hadoop
Hadoop Cluster
Hadoop Cluster
Hadoop Commands
1. Print the Hadoop version
hadoop version
2. List the contents of the root directory in HDFS
hadoop fs -ls /
3. Report the amount of space used and available on currently mounted
filesystem
hadoop fs -df hdfs:/
4. Count the number of directories, files and bytes
hadoop fs -count hdfs:/
5. Run a DFS filesystem checking utility
hadoop fsck /
Hadoop Commands
6. Create a new directory
hadoop fs -mkdir /user/
7.
8.
Hadoop Commands
11. Download File from HDFS to local system
hadoop fs -get /user/test.txt /home/hadoop/
12. Copy File from one dir to other
hadoop fs cp /usr/text.txt /input/
13. Move File from one dir to other
hadoop fs mv /text.txt /input/
14. Change Replication Factor
hadoop fs -setrep -w 2 apache_hadoop/sample.txt
15. Copy file from one node to other
hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop
MapReduce Overview
MapReduce Features
Fault-Tolerance
MapReduce Features
Individual
Work
Parallel Work
Map
Phase
Reads assigned
input split from
Parses input
into records
(key/value pairs)
Applies map
function to each
record
Informs master
node of its
completion
Partitio
n Phase
Each mapper
must determine
which reducer will
receive each of
the outputs
For any key, the
destination
partition is same
Number of
partitions =
Number of
reducers
Shuffle
Phase
input
Fetches
data from all
map tasks for
the portion
corresponding
to the reduce
tasks bucket
Sort
Phase
Merge-sorts all
map outputs into
a single
run
Reduce
Phase
Applies userdefined reduce
function to the
merged run
Arguments: key
and corresponding
list of values
Writes output
to a file in HDFS
Introduction to Pig
Pig is one of the components of the Hadoop eco-system.
Pig is a high-level data flow scripting language.
Pig runs on the Hadoop clusters.
Pig is an Apache open-source project.
Pig uses HDFS for storing and retrieving data and Hadoop MapReduce for
processing Big Data.
Data Models
As part of its data model, Pig supports four basic types:
Atom
Tuple
Bag
Map
An associative array; the key must be a chararray but the value can
be any type
Example: [name#Mike,phone#5551212]
Installing Pig
1. Download pig tar file from apache website
2. Unzip the tar file
$ tar xvzf pig-0.15.0.tar.gz
3. Move file to install location
$ sudo mv pig-0.15.0 /usr/local/pig
4. Set path in bashrc
$ sudo gedit ~/.bashrc
Add following lines at the end of file
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
5. Save the changes in bashrc
$ Source ~/.bashrc
Pig Commands
Introduction to Hive
Hive is a data warehouse system for Hadoop that facilitates ad-hoc
queries and the analysis of large data sets stored in Hadoop.
It provides a SQL-like language called HiveQL (HQL). Due to its SQL-like
interface,
Hive is a popular choice for Hadoop analytics.
It provides massive scale-out and fault tolerance capabilities for data
storage and processing of commodity hardware.
Relying on MapReduce for execution, Hive is batch-oriented and has high
latency for query execution.
JDBC
Web
Interface
Thrift Server
Driver
(Compiler,Optimizer,Executor)
Hadoop
Node
Manager
ODBC
Name
Node
Data Node
+
Recourse Manager
Metastore
Metastore
Metastore is the component that stores the system catalog and metadata about tables,
columns, partitions, and so on. Metadata is stored in a traditional RDBMS format.
Apache Hive uses Derby database by default. Any JDBC compliant database like MySQL
can be used for Metastore.
Metastore Configuration
The key attributes that should be configured for Hive Metastore are given below:
Parameter
Description
Example
javax.jdo.option.Connectio
nURL
jdbc:derby://localhost:1527
/metastore_db;create=true
javax.jdo.option.Connectio
nDriverNamej
org.apache.derby.jdbc.Clie
ntDriver
javax.jdo.option.Connectio
nUserName
APP
javax.jdo.option.Connectio
nPassword
Password
mine
Metastore ConfigurationTemplate
Driver
Driver is the component that:
manages the lifecycle of a Hive Query Language (HiveQL) statement as
it moves through Hive; and
maintains a session handle and any session statistics.
Query Compiler
Query compiler compiles HiveQL into a Directed Acyclic Graph (DAG) of
MapReduce tasks.
Query optimizer:
consists of a chain of transformations, so that the operator DAG resulting from
one transformation is passed as an input to the next transformation.
performs taskscolumn pruning, partition pruning, and repartitioning of data.
Hive Server
Hive Server provides a thrift interface and a Java Database Connectivity/Open
Database Connectivity
(JDBC/ODBC) server. It enables the integration of Hive with other applications.
Client Components
A developer uses the client component to perform development in Hive. The
client component
includes the Command Line Interface (CLI), the web user interface (UI), and the
JDBC/ODBC driver.
Hive
JDBC
Command Line Interface
Web
Interface
ODBC
Thrift Server
Hive Tables
Tables in Hive are analogous to tables in relational databases. A Hive table logically
comprises the data being stored and the associated meta data. Each table has a
corresponding directory in HDFS.
Two types of tables in Hive
Managed Tables - Tables Managed by Hive
External Tables Tables Managed by user
CREATE TABLE t1(ds string, ctry float, li list<map<string, struct<p1:int,
p2:int>>);
CREATE EXTERNAL TABLE test_extern(c1 string, c2 int) LOCATION
'/user/mytables/mydata';
Primitive
Types
Integers: TINYINT,
SMALLINT, INT, and
BIGINT
Boolean: BOOLEAN
Floating point
numbers: FLOAT and
DOUBLE
String: STRING
Complex
Types
Structs: {a INT; b
INT}
Maps: M['group']
Arrays: ['a', 'b',
'c'], A[1]
returns 'b'
Userdefined
Types
Structures with
attributes
Attributes can be
of any type
Installing Hive
1. Download hive tar file from apache website
2. Unzip the tar file
$ tar xvzf apache-hive-2.0.0-bin.tar.gz
3. Move file to install location
$ sudo mv apache-hive-2.0.0-bin /usr/local/hive
4. Set path in bashrc
$ sudo gedit ~/.bashrc
Add following lines at the end of file
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
5. Save the changes in bashrc
$ Source ~/.bashrc
6. Initialize Metastore database
$ schematool initSchema dbType derby
Static Partition
Static Partition is the default type of partition available in hive , in static
partition we have create all the partition of the table manually and have to load
data for each partition separately
1. Creating Partition Table
CREATE TABLE foo (id INT, msg STRING)
PARTITIONED BY (dt STRING);
2. Loading Data in Partition table
LOAD DATA LOCAL INPATH '/home/user/sample.txt'
INTO TABLE employee
PARTITION (country = 'US', state = 'CA');
3. Listing partitions of a table
Show partitions emp;
Static Partition
1. Altering Partition Table
ALTER TABLE employee
ADD PARTITION (year=2013)
location '/2012/part2012';
ALTER TABLE employee PARTITION (year=1203)
RENAME TO PARTITION (Yoj=1203);
ALTER TABLE employee DROP [IF EXISTS]
PARTITION (year=1203);
INSERT OVERWRITE TABLE test_part PARTITION(ds='2009-01-01',
hr=12)
SELECT * FROM t ;
2. Querying a Partition table
Dynamic Partition
Dynamic partition is used to create automatic partition in a table , we dont have to
provide separate file for each partition . By default dynamic partition is disabled for using
dynamic partition we have to enable it first .
Enabling Dynamic Partition
SET
SET
SET
SET
hive.exec.dynamic.partition=true;
hive.exec.max.dynamic.partition=2048;
hive.exec.max.dynamic.partitions.pernode=256; // In case of cluster
hive.exec.dynamic.partition.mode=non-strict;
Dynamic Partition
Loading Data in Dynamic Partition
CREATE TABLE part_u (
id int, name string)
PARTITIONED BY (
year INT, month INT, day INT);
CREATE TABLE users (
id int, name string , dt DATE)
ROW FORMAT DELIMTED
FIELDS TERMINATED BY , ;
LOAD DATA LOCAL INPATH '/home/user/sample.txt'
INTO TABLE user;
INSERT INTO TABLE part_u PARTITION(year,month,day)
SELECT id,name,year(dt),month(dt),day(dt)
from users;