Hive

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

What is HIVE

Hive is a data warehouse system which is used to analyze structured data. It is built on the top
of Hadoop. It was developed by Facebook.

Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.

HIVE Architecture

Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect
to Hive.
Hive Services

The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column
and its type information, the serializers and deserializers which is used to read and
write data and the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.

HIVE Data Types

Hive data types are categorized in numeric types, string types, misc types, and complex types.
A list of Hive data types is given below.

Integer Types
Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767

INT 4-byte signed integer 2,147,483,648 to 2,147,483,647

BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to 9,223,372,036,854,775,8


Decimal Type
Type Size Range

FLOAT 4-byte Single precision floating point number

DOUBLE 8-byte Double precision floating point number

1. Create Database Statement


Create Database is a statement used to create a database in Hive. A database in Hive is
a namespace or a collection of tables.

2.Drop Database Statement

Drop Database is a statement that drops all the tables and deletes the database.
3.Create Table Statement

Create Table is a statement used to create a table in Hive.

Sr.No Field Name Data Type


1 Student_Name int
2 Student_Rollno String
3 Student_Marks Float

On successful creation of table, you get to see the following response:

We can check the data of the student table with the help of the below command.
4.Load Data Statement
Hive provides us the functionality to load pre-created table entities either from our local file
system or from HDFS. The LOAD DATA statement is used to load data into the hive table.

LOAD DATA to the student hive table with the help of the below command.

Let’s see the student table content to observe the effect with the help of the below
command.
5.Alter Table Statement

It is used to alter a table in Hive.

TRUNCATE

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on
the values of partitioned columns such as date, city, and department. Using partition, it is
easy to query a portion of the data.

Tables or partitions are sub-divided into buckets, to provide extra structure to the data that
may be used for more efficient querying. Bucketing works based on the value of hash
function of some column of a table.

For example, a table named Tab1 contains employee data such as id, name, dept, and yoj
(i.e., year of joining). Suppose you need to retrieve the details of all employees who joined in
2012. A query searches the whole table for the required information. However, if you
partition the employee data with the year and store it in a separate file, it reduces the query
processing time. The following example shows how to partition a file and its data:
The following query is used to add a partition to the employee table.

Renaming a Partition

The syntax of this command is as follows.

Dropping a Partition

The following syntax is used to drop a partition:

Aggregate Functions in Hive

In Hive, the aggregate function returns a single value resulting from computation over many
rows. Let''s see some commonly used aggregate functions: -
GROUP BY Clause

The HQL Group By clause is used to group the data from the multiple records based on one
or more column. It is generally used in conjunction with the aggregate functions (like SUM,
COUNT, MIN, MAX and AVG) to perform an aggregation over each group.

Example

Let us take an example of SELECT…GROUP BY clause. Assume employee table as given


below, with Id, Name, Salary, Designation, and Dept fields. Generate a query to retrieve the
number of employees in each department.

The ORDER BY clause is used to retrieve the details based on one column and sort the
result set by ascending or descending order.
Example

Let us take an example for SELECT...ORDER BY clause. Assume employee table as given
below, with the fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details in order by using Department name.
The following query retrieves the employee details using the above scenario:

You might also like