Hive
Hive
Hive
4. HIVE
Dr Mouhim Sanaa
WHAT IS IN IT FOR YOU?
• What is Hive?
• Architecture of Hive
• Data flow in Hive
• Hive data modelling
• Hive data types
• Create table partitions
• Quering Hive tables
WHAT IS HIVE?
WHAT IS HIVE?
Hive can be used for Online It is scalable, fast and flexible SQL users car write SQL like queries
Analytic processing to interact with a large dataset stored
(OLAP), i.e used for in HDFS
reporting and data analysis
(BI).
WHAT IS HIVE?
HIVE Clients
Apache Hive supports different types of client applications for
performing queries on the Hive:
HIVE services
Hive provides many services as shown in the image. Let us
have a look at each of them:
HIVE services
Apache Hive Driver: It is responsible for receiving the queries
submitted through the CLI, the web UI, Thrift, ODBC or JDBC
interfaces by a client.
Then, the driver passes the query to the compiler where
parsing, type checking and semantic analysis takes place with
the help of schema present in the metastore.
In the next step, an optimized logical plan is generated in the
form of a DAG (Directed Acyclic Graph) of map-reduce tasks and
HDFS tasks.
Finally, the execution engine executes these tasks in the order
of their dependencies, using
HIVE services
Apache Hive Driver: It is responsible for receiving the queries
submitted through the CLI, the web UI, Thrift, ODBC or JDBC
interfaces by a client.
Then, the driver passes the query to the compiler where
parsing, type checking and semantic analysis takes place with
the help of schema present in the metastore.
In the next step, an optimized logical plan is generated in the
form of a DAG (Directed Acyclic Graph) of map-reduce tasks and
HDFS tasks.
Finally, the execution engine executes these tasks in the order
of their dependencies, using
4. sendMataData
HIVE MODES
HIVE DATA MODELLING
Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on.
Example of page_views table, where each row could comprise of the following columns (schema):
timestamp—which is of INT type that corresponds to a UNIX timestamp of when the page was
viewed.
userid —which is of BIGINT type that identifies the user who viewed the page.
IP—which is of STRING type that captures the IP address from where the page request was made.
• Each unique value of a field of the partition keys defines a partition of the Table.
• Country field in table_views, ‘US’ is partition, ‘FR’ is a partition…
• This speeds up the analysis significantly.
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash
function of some column of the Table.
• For example the page_views table may be bucketed by userid, which is one of
the columns of the page_view table.
• These can be used to efficiently sample the data.
HIVE DATA MODELLING
HIVE DATA MODELLING
Data files are just regular HDFS files
Once can think tables, partitions and buckets as directories, subdirectories and files respectively
Tell Hive that your following statements will use a specific database
hive > USE mydatabase;
Delete a Database
hive > DROP DATABASE IF EXISTS mydatabase;
Alter a Database
Inspecting Table
HIVE DATA MODELLING
Creating tables
The Alter statement
Rename a table Alter table name rename To new_name;
Replace: delete all columns Alter Table name replace columns (name type, …..);
HIVE DATA MODELLING
Exemple:
Inspecting Table
External Tables
hive> LOAD DATA LOCAL INPATH 'users.dat' OVERWRITE INTO TABLE users;
• Lack of « LOCAL » keyword means source data is on the HDFS file system
• All of the file in the hivedirectory/users.dat directory are copied over into Hive.
• Overwrite keyword causes contents of target table to be deleted and replaced
CREATE STATIC PARTITIONS
Exemple
5. Set the dynamic partition mode to create partitioned directories of data dynamically when data is inserted.
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE EMP_PART PARTITION (DEPT) SELECT eid,name,position,dept FROM employees;
6. Insert data from the unpartitioned table (all the data) into the partitioned table , dynamically creating the partitions.
Query results can be inserted into file system directories (local or HDFS)
If Local keyword is uesed, Hive will write data to the directory on the local file system.
Insert overwrite statements to HDFS is good way to extract large amounts of data from Hive.
insert overwrite local directory 'testexporthive' select eid, name from emp_part
where dept="service";
SELECT FROM
Simple SELECT QUERY
WHERE
Built In Operators
Relational Operators
A = B, A!= B, A < B, A <= B, A > B, A >= B, A IS NULL, A IS NOT NULL, A
LIKE B, A RLIKE B (Null if A or B is null), A REGEXP B (same as RLIKE).
Arithmetic Operators
A+B, A-B, A*B, A / B, A % B, ~A (Not of A).
Logical Operators
A AND B, A && B, A OR B, A || B, NOT A (true is A is false), !A (Like not A).
WHERE
Built In Functions
WHERE
Built In Functions
WHERE
JOIN is used to combine records from two or more tables in the database.
It is similar to SQL JOIN.
JOIN: JOIN clause is used to combine and retrieve the records from multiple tables.
LEFT OUTER JOIN: THE HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are
nom matches in the right table.
RIGHR OUTER JOIN: THE HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if
there are nom matches in the right table.
FULL OUTER JOIN: THE HiveQL FULL OUTER JOIN combines all the records from both the left and the right
outer table that fulfill the join condition.
SELECT JOINS
JOIN: JOIN clause is used to combine and retrieve the records from multiple tables.
salessales products
name pid
id pname
salessales products
name pid
id pname
salessales products
name pid
id pname
salessales products
name pid
id pname