Apache Hive

Hive: SQL
Apache Hive
Hive provides a SQL like interface for users to extract data from the Hadoop system.
Batch SELECT t1.a1 as c1, t2.b1 as c2

Processing
FROM t1 JOIN t2 ON (t1.a2=t2.b2);
Resource Management
Storage
HDFS HBase
Features of Hive
Originally developed by Facebook around 2007

Is an open-source Apache project
High level abstraction layer on top of
MapReduce and Apache Spark
Uses HiveQL
Suitable for structured data

Case Study
A leading online education company uses Hive and Impala to analyze social media coverage.
The organization analyzes positive, negative, and neutral reviews using Hive.
Hive Architecture
Hive Architecture
The major components of Hive architecture are: Hadoop core components, Metastore, Driver, and Hive clients.
JDBC/ODBC
Hive Hive Thrift Hive Web

CLI Server Interface
Drivers
Parse Planner
Metastore
r
Execution Optimizer
MapReduce
RDBM
S
HDF
S
Job Execution Flow in Hive
Receive SQL query
1 Parse HiveQL
2 Make optimizations
3 Plan execution
Submit job(s) to cluster
4
Monitor progress
5
Process data in MapReduce or Apache Spark
6
7 Store the data in HDFS

Interfaces to Run Hive Queries
Hive offers many interfaces for running queries.
Hive Query Editor

• Command-line shell
– Hive: Beeline
• Hue Web UI
– Hive Query Editor
• Metastore Manager
– ODBC / JDBC
Connecting with Hive
Hive can be run using Beeline
Hue can be used to write a Hive query from a UI

Running Hive Queries Using Beeline
‘!’ is used to execute Beeline commands.
Below are a few commands for running Beeline:

• !exit – to exit the shell
• !help – to show list of all commands
• !verbose – to show added details of queries
Running Beeline from Command Line
Below are the command lines for running Beeline
beeline –u … -f
To execute file using the –u option sample.hql
To use HiveQL directly from the command line using the beeline –u ... -e
-e option 'SELECT * FROM users‘
beeline –u … -
To continue running script even after an error force=TRUE
Running Hive query
SQL commands are terminated with a semicolon (;)
Hive> select * from device

> LIMIT 5;
OK
1 2008-10-21 00:00:00 Sorrento F00L phone
2 2010-04-19 00:00:00 Titanic 2100 phone
3 2011-02-18 00:00:00 MeeToo 3.0 phone phone
4 2011-09-21 00:00:00 MeeToo 3.1 phone
5 2008-10-21 00:00:00 iFruit 1
Time taken: 0.296 seconds, Fetched: 5 row(s)
Data Types
Data Types in Hive
The data types in Hive are as follows:
Data Types in Hive
Primitive types Complex types User-defined types
• Integers: TINYINT, • Structs: {a INT; b INT} • Structures with attributes

SMALLINT, INT, and
• Maps: M['group']
BIGINT
• Arrays: ['a', 'b', 'c'], A[1]
• Boolean: BOOLEAN returns 'b‘
• Floating point numbers:
FLOAT and DOUBLE
• String:
STRING
Changing Table Data Location
• By default, table data is stored in the default warehouse location

user/hive/warehouse
• Use LOCATION to specify the directory where you want to reside your data in HDFS
CREATE TABLE sample (

id INT,
class STRING,
fees INT,
posted TIMESTAMP
)
ROW FORMAT DELIMITED
FIELD TERMINATED BY ‘ , ’
LOCATION '/bda/sample';
External Managed Table
• Tables are “managed” or “internal” by default.

When a table is removed, the data also gets
deleted.
CREATE EXTERNAL TABLE sample ( id
• Use EXTERNAL to create an external managed
INT,
table
class STRING,
fees INT,
• Dropping an external table removes only its posted TIMESTAMP
metadata
)
FIELD TERMINATED BY ‘ , ’
LOCATION '/bda/sample';
Validation of Data
Hive follows “schema on read”
• Unlike RDBMS, Hive does not validate data on insert

• Files are simply moved into place, which makes loading data into tables faster in Hive
• Errors in file formats are discovered when queries are performed
SE
Q T
X
T
Missing data is represented as NULL.

Loading of Data
• Data can be moved from the HDFS file directly to Hive table
hdfs dfs -mv /sample/data /user/hive/warehouse/sample/
• Data can be loaded using the following query:
LOAD DATA INPATH '/sample/data’
OVERWRITE INTO TABLE sample;

Loading Data from RDBMS
• Sqoop provides support for importing data into Hive
• Using hive-import option in Sqoop, you can:
o create a table in Hive metastore
o import data from the RDBMS to the table’s directory in HDFS
sqoop import \
-connect jdbc:mysql://localhost/sample \
-username training \
-password training \
hive-import creates a table accessible
in Hive.
-fields-terminated-by '\t' \
-table employees \
-hive-import
Hive DDL and DML
Defining Database and Table
• Databases and tables are created and managed using the DDL (Data Definition Language) of HiveQL
• They are very similar to standard SQL DDL
• For example, Create/Drop/Alter/Use Database

Creating a Database
• To create a new database
• CREATE DATABASE <dbname>;
CREATE DATABASE sample;

• The above statement will add database /user/hive/warehouse/sample.db
definition to the metastore and will also
create a storage directory in HDFS in the
default location.
CREATE DATABASE IF NOT EXISTS sample;
• For example:
/user/hive/warehouse/simplilearn.db
• In order to avoid error in case database

simplilearn already exists:
• CREATE DATABASE IF NOT EXISTS <dbname>;

Deleting a Database
• Removing a database is similar to creating it
o replace CREATE with DROP
DROP DATABASE <dbname>;

DROP DATABASE sample;
o In case the database already exists, you can

DROP DATABASE IF EXISTS <dbname>;
• In order to remove database, if it has some table :
DROP DATABASE <dbname> CASCADE;
This might remove data in HDFS.

Creating New Table
Syntax to create a new table
• Syntax creates a subdirectory in the database’s

warehouse directory in HDFS
CREATE TABLE tablename( colname DATATYPE, ….)
ROW FORMAT DELIMITED o Default database
FIELDS TERMINATED BY char
/user/hive/warehouse/tablename
STOREDAS {TEXTFILE|SEQUENCEFILE|…}
o Named database
/user/hive/warehouse/dbname.db/tablename
Table Creation: Example
• The following example shows how to create a new

table named sample CREATE TABLE sample (
id INT,
o Data is stored as text with four comma-separated
class STRING,
fields per line
fees INT,
posted TIMESTAMP
)
FIELD TERMINATED BY ‘ , ’ ;
Hive Metastore
Managing Data with Hive
Hive uses Metastore service to store metadata for Hive tables.
• A table is an HDFS directory containing zero or more files
Default path: /user/hive/warehouse/<table_name>
• Table supports many formats for data storage and retrieval

</
>
M ET
• Metastore stores the created metadata
A
o Contained in an RDBMS such as MySQL
• Hive Tables are stored in HDFS and the relevant metadata is stored in the Metastore
What Is Hive Metastore?
The Metastore is the component that stores the system catalog which
contains metadata about tables, columns, and partitions.
Use of Metastore in Hive
• Hive uses metastore to get table structure and location of data
• The server queries actual data which is stored in HDFS
1. Get table structure (Metadata in RDBMS)

and location of data
Query
2. Query actual data
Hive Server
(Data in HDFS
files)
Data Warehouse Directory Structure
• By default, all data gets stored in
/user/hive/warehouse
• Each table is a directory within the default location having one or more files
Customers Table
customer_id name country /user/hive/warehouse/customers
001 Alice us
001 Alice us
002 Bob ca 002 Bob ca
File 01
003 Carlos mx 003 Carlos mx
…. …. ….
004 Dieter de
392 Maria it
392 Maria it
393 Nigel uk 393 Nigel uk
File 02
394 Ophelia dk
394 Ophelia dk
…. …. ….
…. …. ….
In HDFS, Hive data can be split into more than one file.

Apache Hive

Uploaded by

Copyright:

Available Formats

Apache Hive

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apache Hive

Uploaded by

Copyright:

Available Formats

Hive: SQL

Batch SELECT t1.a1 as c1, t2.b1 as c2

Originally developed by Facebook around 2007

High level abstraction layer on top of

MapReduce and Apache Spark

Suitable for structured data

Hive Hive Thrift Hive Web

Receive SQL query

7 Store the data in HDFS

Hive Query Editor

Hive can be run using Beeline

Hue can be used to write a Hive query from a UI

‘!’ is used to execute Beeline commands.

Below are a few commands for running Beeline:

Hive> select * from device

Data Types in Hive

Primitive types Complex types User-defined types

• Integers: TINYINT, • Structs: {a INT; b INT} • Structures with attributes

• By default, table data is stored in the default warehouse location

CREATE TABLE sample (

• Tables are “managed” or “internal” by default.

• Unlike RDBMS, Hive does not validate data on insert

Missing data is represented as NULL.

• Data can be loaded using the following query:

LOAD DATA INPATH '/sample/data’

OVERWRITE INTO TABLE sample;

• They are very similar to standard SQL DDL

• For example, Create/Drop/Alter/Use Database

• To create a new database

• CREATE DATABASE <dbname>;

CREATE DATABASE sample;

• In order to avoid error in case database

• CREATE DATABASE IF NOT EXISTS <dbname>;

• Removing a database is similar to creating it

o replace CREATE with DROP

DROP DATABASE <dbname>;

o In case the database already exists, you can

DROP DATABASE <dbname> CASCADE;

This might remove data in HDFS.

Syntax to create a new table

• Syntax creates a subdirectory in the database’s

• The following example shows how to create a new

• A table is an HDFS directory containing zero or more files

Default path: /user/hive/warehouse/<table_name>

• Table supports many formats for data storage and retrieval

• Hive uses metastore to get table structure and location of data

• The server queries actual data which is stored in HDFS

1. Get table structure (Metadata in RDBMS)

• By default, all data gets stored in

You might also like