Hive

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 63

INF4101-Big Data et NoSQL

4. HIVE

Dr Mouhim Sanaa
WHAT IS IN IT FOR YOU?

• What is Hive?
• Architecture of Hive
• Data flow in Hive
• Hive data modelling
• Hive data types
• Create table partitions
• Quering Hive tables
WHAT IS HIVE?
WHAT IS HIVE?

Hive is a data warehousing infrastructure based on Apache Hadoop.


Hive is used for quering and analyzing large datasets stored in hdfs.
Hive provides SQL which enables users to do querying, summarization
and data analysis easily.
Hive uses a query language call HiveQL which is similar to SQL.
WHAT IS HIVE?
WHAT IS HIVE?

When Hive can be used?

Hive can be used for Online It is scalable, fast and flexible SQL users car write SQL like queries
Analytic processing to interact with a large dataset stored
(OLAP), i.e used for in HDFS
reporting and data analysis
(BI).
WHAT IS HIVE?

When Hive can not be used?

It can not be used


Hive is not a Relationnal For Online Transaction Can not be used for Can not be used for
Data Base Processing (OLTP), such Real time updates and Low latency scenarios
as Financial transaction queries
systems: (read, insert,
update and delete).
HIVE ARCHITECTURE

HIVE Clients
Apache Hive supports different types of client applications for
performing queries on the Hive:

 Thrift Clients: As Hive server is based on Apache Thrift. Thrift is


used for defining and creating services for numerous languages.
Thrift forms a remote procedure call (RPC) framework.

 JDBC Clients: Hive allows Java applications to connect to it using


the JDBC driver which is defined in the class
org.apache.hadoop.hive.jdbc.HiveDriver.

 ODBC Clients: The Hive ODBC Driver allows applications that


support the ODBC protocol to connect to Hive.
HIVE ARCHITECTURE

HIVE services
Hive provides many services as shown in the image. Let us
have a look at each of them:

 Hive CLI (Command Line Interface): This is the default shell


provided by the Hive where you can execute your Hive queries
and commands directly.
 Apache Hive Web Interfaces: Apart from the command line
interface, Hive also provides a web based GUI for executing Hive
queries and commands.
 Hive Server: Hive server is built on Apache Thrift and therefore,
is also referred as Thrift Server that allows different clients to
submit requests to Hive and retrieve the final result:
ActionScript, C, C++, C#, Cappuccino, Cocoa, Delphi, Erlang, Go,
Haskel, Java, Node.js, Objective-C, OCaml, Perl, PHP, Python, Ruby
and Smalltalk.
HIVE ARCHITECTURE

HIVE services
Apache Hive Driver: It is responsible for receiving the queries
submitted through the CLI, the web UI, Thrift, ODBC or JDBC
interfaces by a client.
 Then, the driver passes the query to the compiler where
parsing, type checking and semantic analysis takes place with
the help of schema present in the metastore.
 In the next step, an optimized logical plan is generated in the
form of a DAG (Directed Acyclic Graph) of map-reduce tasks and
HDFS tasks.
 Finally, the execution engine executes these tasks in the order
of their dependencies, using

Metastore: Repository for storing all the Hive metadata


information.
HIVE ARCHITECTURE

HIVE services
Apache Hive Driver: It is responsible for receiving the queries
submitted through the CLI, the web UI, Thrift, ODBC or JDBC
interfaces by a client.
 Then, the driver passes the query to the compiler where
parsing, type checking and semantic analysis takes place with
the help of schema present in the metastore.
 In the next step, an optimized logical plan is generated in the
form of a DAG (Directed Acyclic Graph) of map-reduce tasks and
HDFS tasks.
 Finally, the execution engine executes these tasks in the order
of their dependencies, using

Metastore: Repository for storing all the Hive metadata


information.
HIVE DATA FLOW

4. sendMataData
HIVE MODES
HIVE DATA MODELLING
Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on.

Tables: Homogeneous units of data which have the same schema.

Example of page_views table, where each row could comprise of the following columns (schema):

timestamp—which is of INT type that corresponds to a UNIX timestamp of when the page was
viewed.
userid —which is of BIGINT type that identifies the user who viewed the page.

page_url—which is of STRING type that captures the location of the page.

IP—which is of STRING type that captures the IP address from where the page request was made.

Country – which is STRING indicating the country.


HIVE DATA MODELLING
Partitions: Each Table can have one or more partition Keys which determines how the data is stored.

• Each unique value of a field of the partition keys defines a partition of the Table.
• Country field in table_views, ‘US’ is partition, ‘FR’ is a partition…
• This speeds up the analysis significantly.

Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash
function of some column of the Table.

• For example the page_views table may be bucketed by userid, which is one of
the columns of the page_view table.
• These can be used to efficiently sample the data.
HIVE DATA MODELLING
HIVE DATA MODELLING
 Data files are just regular HDFS files

 Warehouse directory in HDFS


• Specified by hive.metastore.warehouse.dir in Hive-site.xml
• E.g /user/hive/warehouse

 Once can think tables, partitions and buckets as directories, subdirectories and files respectively

Hive Entity Sample Sample location*


database testdb $WH/testdb.db
table T $WH/testdb.db/T
partition Date=‘01012021’ $WH/testdb.db/T/date=01012021

*in this exemple $WH is a variable that holds warehouse path


HIVE DATA MODELLING
 Creating DataBases
 Create a database named « mydatabase »

hive > CREATE DATABASE mydatabase;


 Create a database named « mydatabase » and override the Hive warehouse configured location

hive > CREATE DATABASE mydatabase


>LOCATION ‘/myfavorite/folder/’;

 Create a database named « mydatabase » and add a descriptive comment

hive > CREATE DATABASE mydatabase


>Comment ‘This is my database’;
HIVE DATA MODELLING
 Show DataBases
 List the databases in hive system

hive > SHOW DATABASES;


 Show some basic information about Database

hive > DESCRIBE DATABASE mydatabase;

 Show more detailed information about a Database

hive > DESCRIBE DATABASE EXTENDED mydatabase;


HIVE DATA MODELLING
DataBase Use/Drop/Alter
 Hive « default » database is used if database is not specified

 Tell Hive that your following statements will use a specific database
hive > USE mydatabase;
 Delete a Database
hive > DROP DATABASE IF EXISTS mydatabase;

Note: Database directory is also deleted

 Cascade is used to drop respecrive tables before dropping the Database


hive > DROP DATABASE IF EXISTS mydatabase CASCADE;

 Alter a Database

hive > ALTER DATABASE mydatabase SET DBPROPERTIES (createdby=‘bigdatauser2’);

Note: Only possible to update the DBPROPRTIES meta data


HIVE DATA MODELLING
Data Type properties

Integer String types


 TINYINT: 1 byte integer  STRING: sequence of characters in a specified
 SMALLINT: 2 byte integer character set
 INT: 4 byte integer  VARCHAR: sequence of characters in a specified
 BIGINT: 8 byte integer character set with a maximum length
Boolean type  CHAR: sequence of characters in a specified
 BOOLEAN: TRUE/FALSE character set with a defined length
Floating point numbers Date and time types
 FLOAT: single precision  TIMESTAMP: a specific point in time, up to
 DOUBLE: double precision nanosecond precision
Fixed point numbers  DATE: a date
 DECIMAL: a fixed point value of user defined Binary types
scale and precision  BINARY: a sequence of bytes
HIVE DATA MODELLING
Data Type properties
Complex types can be built up from primitive types and other composite types
Arrays: Indexable lists containing elements of the same type
 Format: ARRAY<data_type>
 Literal syntax example: array(‘user1’,’user2’)
 Accessing elements: [n] notation where n is an index into the array
Structs: Collection of elements of different types
 Format: Struct<col_name:data_type, …>
 Accessing element: Dot (.) notation. E.g structurename.firstnam e
Maps: Collection of key values tuples
 Format: MAP<primitive_type,data_type>
 Accessing elements: [‘element name’] notation
HIVE DATA MODELLING
Creating tables

 Inspecting Table
HIVE DATA MODELLING
Creating tables
 The Alter statement
Rename a table Alter table name rename To new_name;

Change a column Alter able name change new_name new_type;


HIVE DATA MODELLING
Creating tables
 The Alter statement
Add column Alter Table name add columns (new_colomn type, ….);

Replace: delete all columns Alter Table name replace columns (name type, …..);
HIVE DATA MODELLING

Exemple:

1. Create a database university;


2. Create a table students with the fields (id, name, gpa, field)
3. Add some values to the table students
4. Show the fields of the table;
5. Show the content of the table students;
HIVE DATA MODELLING
Creating tables
File:users.dat
 Creating a delimited Table
1|1|Bob Smith|Mary
2|1|Frank Barney|James:Liz:Karen
3|2|Ellen Lacy|Randy:Martin
4|3|Jake Gray|
5|4|Sally Fields|John:Fred:Sue:Hank:Robert

 Inspecting Table

Note: fore more details you can use


HIVE DATA MODELLING
Creating tables
File:users2.dat
 Creating a delimited Table
Id|iffice_id|name|childen
1|1|Bob Smith|Mary
2|1|Frank Barney|James:Liz:Karen
3|2|Ellen Lacy|Randy:Martin
4|3|Jake Gray|
5|4|Sally Fields|John:Fred:Sue:Hank:Robert

>hive create table users2


(id int, office_id int, name string, children array<string>)
row format delimited fields terminated by '|‘
collection items terminated by ':'
tblproperties("skip.header.line.count"="1");
HIVE DATA MODELLING
Internal and External Tables

 Internal or Managed Tables

• Hive is responsible of both Metadata and data.


• If the table is dropped then metadata and data is removed.
• The concept of a table in Hive is very similar to the table in the relational database.
• For example, /user/hive/warehouse/employee is created by Hive in HDFS for the employee table.
• All the data in the table will be kept in the directory. The Hive table is also referred to as internal or
managed tables.
• Hive fully manages the life cycle of the table and data.
HIVE DATA MODELLING
Internal and External Tables

 External Tables

• Hive is responsible only of the metadata.


• If an external table is dropped, the metadata is deleted, but the data is kept.
• When there is data already in HDFS, an external Hive table can be created to describe the data.
• It is called EXTERNAL because the data in the external table is specified in the LOCATION properties
instead of the default warehouse directory.
• Useful if sharing data with other tools.
HIVE DATA MODELLING
Internal and External Tables
LOAD DATA INTO HIVE
Loading data from input file

hive> LOAD DATA LOCAL INPATH 'users.dat' OVERWRITE INTO TABLE users;

• The Local indicates the source data is on the local fileSystem


• Local data is copied to final location
• Hive does not do any transformation while loading data into tables.

Loading data from HDFS

hive> LOAD DATA INPATH 'users.dat' OVERWRITE INTO TABLE users;

• Lack of « LOCAL » keyword means source data is on the HDFS file system
• All of the file in the hivedirectory/users.dat directory are copied over into Hive.
• Overwrite keyword causes contents of target table to be deleted and replaced
CREATE STATIC PARTITIONS
Exemple

Create partitions based on state

1. Create zip database;

hive >create database zip;

hive> use zip;


CREATE STATIC PARTITIONS
Exemple

Create partitions based on state

2. Create table zipCode partitionned By state

hive> CREATE TABLE zipCode (recordNumber


int,Country string,city string,zipCode int)
PARTITIONED BY (state string)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY '/t';
CREATE STATIC PARTITIONS
Exemple

Create partitions based on state

3. Upload zipCode.txt into local file system or HDFS


CREATE STATIC PARTITIONS
Exemple

Create partitions based on state

4. Load data from SampleData/zipCode.txt into


zipCode table

hive> LOAD DATA LOCAL INPATH 'SampleData/zipcode.txt'


INTO TABLE zipCode partition (state="PR");
CREATE STATIC PARTITIONS
Exemple

Create partitions based on state

4. Load data from SampleData/zipCode.txt into


zipCode table

hive> LOAD DATA LOCAL INPATH 'SampleData/zipcode.txt'


INTO TABLE zipCode partition (state="TX");
CREATE STATIC PARTITIONS
Exemple

Create partitions based on state

4. Load data from SampleData/zipCode.txt into


zipCode table

hive> LOAD DATA LOCAL INPATH 'SampleData/zipcode.txt'


INTO TABLE zipCode partition (state="AL");
CREATE STATIC PARTITIONS
Exemple

Create partitions based on state

4. Load data from SampleData/zipCode.txt into


zipCode table

hive> LOAD DATA LOCAL INPATH 'SampleData/zipcode.txt'


INTO TABLE zipCode partition (state="NC");
CREATE STATIC PARTITIONS
Exemple

Create partitions based on state

5. Check that the partitions were created.

hive> show partitions zipCode;


CREATE STATIC PARTITIONS
Exemple

Create partitions based on state

6. Check the corresponding partition directories on HDFS..

Hdfs dfs –ls /user/hive/warehouse/zip/zipCode


CREATE DYNAMIC PARTITIONS

1. Put the CSV file employees.csv on a local file system or


HDFS
2. create an unpartitioned table that holds all the data.

hive> create external table employees(eid int,name string,


position string,dept string) row format delimited fields
terminated by ',' location '/user/hive' ;
CREATE DYNAMIC PARTITIONS

3. Put the data from csv file into employees table

hive> load data local inpath 'SampleData/employees.csv'


overwrite into table employees;

hive> select * from employees;


CREATE DYNAMIC PARTITIONS

4. Create a partition table.

CREATE EXTERNAL TABLE EMP_PART (eid int, name


string, position string)
PARTITIONED BY (dept string);
CREATE DYNAMIC PARTITIONS

5. Set the dynamic partition mode to create partitioned directories of data dynamically when data is inserted.

SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE EMP_PART PARTITION (DEPT) SELECT eid,name,position,dept FROM employees;

CREATE DYNAMIC PARTITIONS

6. Insert data from the unpartitioned table (all the data) into the partitioned table , dynamically creating the partitions.

INSERT INTO TABLE EMP_PART PARTITION (DEPT)


SELECT eid,name,position,dept FROM employees;
CREATE DYNAMIC PARTITIONS

6. Check that the partitions were created.

SHOW PARTITIONS emp_part;


CREATE DYNAMIC PARTITIONS

6. Check the corresponding partition directories on HDFS..


EXPORTING DATA OUT OF HIVE

 Query results can be inserted into file system directories (local or HDFS)
 If Local keyword is uesed, Hive will write data to the directory on the local file system.
 Insert overwrite statements to HDFS is good way to extract large amounts of data from Hive.

insert overwrite local directory 'testexporthive' select eid, name from emp_part
where dept="service";
SELECT FROM
Simple SELECT QUERY
WHERE
Built In Operators
 Relational Operators
A = B, A!= B, A < B, A <= B, A > B, A >= B, A IS NULL, A IS NOT NULL, A
LIKE B, A RLIKE B (Null if A or B is null), A REGEXP B (same as RLIKE).

 Arithmetic Operators
A+B, A-B, A*B, A / B, A % B, ~A (Not of A).

 Logical Operators
A AND B, A && B, A OR B, A || B, NOT A (true is A is false), !A (Like not A).
WHERE
Built In Functions
WHERE
Built In Functions
WHERE

 select where query with STRUCT column (address):

 select where query with Group by clause:


SELECT JOINS

 JOIN is used to combine records from two or more tables in the database.
 It is similar to SQL JOIN.

There are different types of joins given the follow:

 JOIN: JOIN clause is used to combine and retrieve the records from multiple tables.
 LEFT OUTER JOIN: THE HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are
nom matches in the right table.
 RIGHR OUTER JOIN: THE HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if
there are nom matches in the right table.
 FULL OUTER JOIN: THE HiveQL FULL OUTER JOIN combines all the records from both the left and the right
outer table that fulfill the join condition.
SELECT JOINS
 JOIN: JOIN clause is used to combine and retrieve the records from multiple tables.

salessales products
name pid
id pname

hive> SET hive.auto.convert.join=false;

select sales.*, products.* from sales JOIN products


ON sales.id=products.pid;

select s.*, p.* from sales s JOIN products p ON


s.id=p.pid;
SELECT JOINS
 LEFT OUTER JOIN: THE HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are
nom matches in the right table.

salessales products
name pid
id pname

select sales.*, products.* from sales LEFT OUTER JOIN


products ON sales.id=products.pid;
SELECT JOINS
 RIGHT OUTER JOIN: THE HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there
are nom matches in the right table.

salessales products
name pid
id pname

select sales.*, products.* from sales RIGHT OUTER JOIN


products ON sales.id=products.pid;
SELECT JOINS
 FULL OUTER JOIN: THE HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are
nom matches in the right table.

salessales products
name pid
id pname

select sales.*, products.* from sales FULL OUTER JOIN


products ON sales.id=products.pid;
GROUP BY
 Basically, for grouping particular column values mentioned with the group by query, Group by clause use columns
on Hive tables.

SELECT [ALL | DISTINCT] select_expr, select_expr,



FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];
GROUP BY
 Basically, for grouping particular column values mentioned with the group by query, Group by clause use columns
on Hive tables.
hive> select dept, count(*) from employees GROUP
Exemple BY dept;
ORDER BY
 Basically, for grouping particular column values mentioned with the group by query, Group by clause use columns
on Hive tables.
hive> select id, name, dept from employees ORDER
Exemple BY dept;
VIEWS
 A view allows a query to be saved and treated like a table. It is a logical construct, as it does not store data like a
table. In other words, materialized views are not currently supported by Hive..

You might also like