Apache Hive

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Hive: SQL

Apache Hive
Hive provides a SQL like interface for users to extract data from the Hadoop system.

Batch SELECT t1.a1 as c1, t2.b1 as c2


Processing
FROM t1 JOIN t2 ON (t1.a2=t2.b2);

Resource Management

Storage
HDFS HBase
Features of Hive

Originally developed by Facebook around 2007


Is an open-source Apache project

High level abstraction layer on top of

MapReduce and Apache Spark

Uses HiveQL

Suitable for structured data


Case Study
A leading online education company uses Hive and Impala to analyze social media coverage.

The organization analyzes positive, negative, and neutral reviews using Hive.
Hive Architecture
Hive Architecture
The major components of Hive architecture are: Hadoop core components, Metastore, Driver, and Hive clients.

JDBC/ODBC

Hive Hive Thrift Hive Web


CLI Server Interface

Drivers
Parse Planner
Metastore
r
Execution Optimizer

MapReduce
RDBM
S
HDF
S
Job Execution Flow in Hive

Receive SQL query

1 Parse HiveQL

2 Make optimizations

3 Plan execution
Submit job(s) to cluster
4
Monitor progress
5
Process data in MapReduce or Apache Spark
6

7 Store the data in HDFS


Interfaces to Run Hive Queries
Hive offers many interfaces for running queries.

Hive Query Editor


• Command-line shell
– Hive: Beeline

• Hue Web UI
– Hive Query Editor

• Metastore Manager
– ODBC / JDBC
Connecting with Hive

Hive can be run using Beeline

Hue can be used to write a Hive query from a UI


Running Hive Queries Using Beeline

‘!’ is used to execute Beeline commands.

Below are a few commands for running Beeline:


• !exit – to exit the shell
• !help – to show list of all commands
• !verbose – to show added details of queries
Running Beeline from Command Line
Below are the command lines for running Beeline

beeline –u … -f
To execute file using the –u option sample.hql

To use HiveQL directly from the command line using the beeline –u ... -e
-e option 'SELECT * FROM users‘

beeline –u … -
To continue running script even after an error force=TRUE
Running Hive query
SQL commands are terminated with a semicolon (;)

Hive> select * from device


> LIMIT 5;
OK
1 2008-10-21 00:00:00 Sorrento F00L phone
2 2010-04-19 00:00:00 Titanic 2100 phone
3 2011-02-18 00:00:00 MeeToo 3.0 phone phone
4 2011-09-21 00:00:00 MeeToo 3.1 phone
5 2008-10-21 00:00:00 iFruit 1
Time taken: 0.296 seconds, Fetched: 5 row(s)
Data Types
Data Types in Hive
The data types in Hive are as follows:

Data Types in Hive

Primitive types Complex types User-defined types

• Integers: TINYINT, • Structs: {a INT; b INT} • Structures with attributes


SMALLINT, INT, and
• Maps: M['group']
BIGINT
• Arrays: ['a', 'b', 'c'], A[1]
• Boolean: BOOLEAN returns 'b‘
• Floating point numbers:
FLOAT and DOUBLE

• String:
STRING
Changing Table Data Location

• By default, table data is stored in the default warehouse location


user/hive/warehouse

• Use LOCATION to specify the directory where you want to reside your data in HDFS

CREATE TABLE sample (


id INT,
class STRING,
fees INT,

posted TIMESTAMP
)
ROW FORMAT DELIMITED
FIELD TERMINATED BY ‘ , ’
LOCATION '/bda/sample';
External Managed Table

• Tables are “managed” or “internal” by default.


When a table is removed, the data also gets
deleted.
CREATE EXTERNAL TABLE sample ( id
• Use EXTERNAL to create an external managed
INT,
table
class STRING,
fees INT,
• Dropping an external table removes only its posted TIMESTAMP
metadata
)
ROW FORMAT DELIMITED
FIELD TERMINATED BY ‘ , ’
LOCATION '/bda/sample';
Validation of Data
Hive follows “schema on read”

• Unlike RDBMS, Hive does not validate data on insert


• Files are simply moved into place, which makes loading data into tables faster in Hive
• Errors in file formats are discovered when queries are performed

SE
Q T
X
T

Missing data is represented as NULL.


Loading of Data
• Data can be moved from the HDFS file directly to Hive table
hdfs dfs -mv /sample/data /user/hive/warehouse/sample/

• Data can be loaded using the following query:

LOAD DATA INPATH '/sample/data’

OVERWRITE INTO TABLE sample;


Loading Data from RDBMS
• Sqoop provides support for importing data into Hive
• Using hive-import option in Sqoop, you can:
o create a table in Hive metastore
o import data from the RDBMS to the table’s directory in HDFS

sqoop import \
-connect jdbc:mysql://localhost/sample \
-username training \
-password training \
hive-import creates a table accessible
in Hive.
-fields-terminated-by '\t' \
-table employees \
-hive-import
Hive DDL and DML
Defining Database and Table

• Databases and tables are created and managed using the DDL (Data Definition Language) of HiveQL

• They are very similar to standard SQL DDL

• For example, Create/Drop/Alter/Use Database


Creating a Database

• To create a new database

• CREATE DATABASE <dbname>;

CREATE DATABASE sample;


• The above statement will add database /user/hive/warehouse/sample.db
definition to the metastore and will also
create a storage directory in HDFS in the
default location.
CREATE DATABASE IF NOT EXISTS sample;
• For example:
/user/hive/warehouse/simplilearn.db

• In order to avoid error in case database


simplilearn already exists:

• CREATE DATABASE IF NOT EXISTS <dbname>;


Deleting a Database

• Removing a database is similar to creating it

o replace CREATE with DROP

DROP DATABASE <dbname>;


DROP DATABASE sample;

o In case the database already exists, you can


DROP DATABASE IF EXISTS <dbname>;
• In order to remove database, if it has some table :

DROP DATABASE <dbname> CASCADE;

This might remove data in HDFS.


Creating New Table

Syntax to create a new table

• Syntax creates a subdirectory in the database’s


warehouse directory in HDFS
CREATE TABLE tablename( colname DATATYPE, ….)
ROW FORMAT DELIMITED o Default database
FIELDS TERMINATED BY char
/user/hive/warehouse/tablename
STOREDAS {TEXTFILE|SEQUENCEFILE|…}
o Named database

/user/hive/warehouse/dbname.db/tablename
Table Creation: Example

• The following example shows how to create a new


table named sample CREATE TABLE sample (
id INT,
o Data is stored as text with four comma-separated
class STRING,
fields per line
fees INT,
posted TIMESTAMP
)
ROW FORMAT DELIMITED
FIELD TERMINATED BY ‘ , ’ ;
Hive Metastore
Managing Data with Hive
Hive uses Metastore service to store metadata for Hive tables.

• A table is an HDFS directory containing zero or more files

Default path: /user/hive/warehouse/<table_name>

• Table supports many formats for data storage and retrieval


</
>
M ET
• Metastore stores the created metadata
A
o Contained in an RDBMS such as MySQL

• Hive Tables are stored in HDFS and the relevant metadata is stored in the Metastore
What Is Hive Metastore?
The Metastore is the component that stores the system catalog which
contains metadata about tables, columns, and partitions.
Use of Metastore in Hive

• Hive uses metastore to get table structure and location of data

• The server queries actual data which is stored in HDFS

1. Get table structure (Metadata in RDBMS)


and location of data

Query
2. Query actual data

Hive Server

(Data in HDFS
files)
Data Warehouse Directory Structure

• By default, all data gets stored in

/user/hive/warehouse

• Each table is a directory within the default location having one or more files

Customers Table
customer_id name country /user/hive/warehouse/customers

001 Alice us
001 Alice us
002 Bob ca 002 Bob ca
File 01
003 Carlos mx 003 Carlos mx
…. …. ….
004 Dieter de
392 Maria it
392 Maria it
393 Nigel uk 393 Nigel uk
File 02
394 Ophelia dk
394 Ophelia dk
…. …. ….
…. …. ….

In HDFS, Hive data can be split into more than one file.

You might also like