Hive
Hive
Hive
Hive is a data warehouse system which is used to analyze structured data. It is built on the top
of Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
HIVE Architecture
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect
to Hive.
Hive Services
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column
and its type information, the serializers and deserializers which is used to read and
write data and the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
Hive data types are categorized in numeric types, string types, misc types, and complex types.
A list of Hive data types is given below.
Integer Types
Type Size Range
Drop Database is a statement that drops all the tables and deletes the database.
3.Create Table Statement
We can check the data of the student table with the help of the below command.
4.Load Data Statement
Hive provides us the functionality to load pre-created table entities either from our local file
system or from HDFS. The LOAD DATA statement is used to load data into the hive table.
LOAD DATA to the student hive table with the help of the below command.
Let’s see the student table content to observe the effect with the help of the below
command.
5.Alter Table Statement
TRUNCATE
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on
the values of partitioned columns such as date, city, and department. Using partition, it is
easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that
may be used for more efficient querying. Bucketing works based on the value of hash
function of some column of a table.
For example, a table named Tab1 contains employee data such as id, name, dept, and yoj
(i.e., year of joining). Suppose you need to retrieve the details of all employees who joined in
2012. A query searches the whole table for the required information. However, if you
partition the employee data with the year and store it in a separate file, it reduces the query
processing time. The following example shows how to partition a file and its data:
The following query is used to add a partition to the employee table.
Renaming a Partition
Dropping a Partition
In Hive, the aggregate function returns a single value resulting from computation over many
rows. Let''s see some commonly used aggregate functions: -
GROUP BY Clause
The HQL Group By clause is used to group the data from the multiple records based on one
or more column. It is generally used in conjunction with the aggregate functions (like SUM,
COUNT, MIN, MAX and AVG) to perform an aggregation over each group.
Example
The ORDER BY clause is used to retrieve the details based on one column and sort the
result set by ascending or descending order.
Example
Let us take an example for SELECT...ORDER BY clause. Assume employee table as given
below, with the fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details in order by using Department name.
The following query retrieves the employee details using the above scenario: