Hive Lecture Notes
Hive Lecture Notes
Hive Lecture Notes
time learning MapReduce.Hive is a “data warehouse software” that enables you to query and manipulate
data using a SQL-like language known as HiveQL. It was developed at Facebook so that people who had
experience in SQL would be able to work on querying datasets without actually learning new paradigms
like MapReduce or new programming languages.
The use cases that creators of Hive had in mind while working on it were as follows:
● Data Cleaning: Data engineers usually get data that is not very clean, also the data may not be in
the format required for analytics. Hive would help them clean this data and also convert this data
into required formats without too much effort.
● Reporting: Data teams are also tasked with publishing reports and summaries on a regular basis
using the classic data warehouses such as Oracle, Vertica etc. A classic data warehouse such as
Oracle is utterly effective when the data set is small and highly structured. In such reporting use
cases, the most important task of Hive is to clean the high volume data, extract the essential facts
or metadata from the data and publish the facts or metadata in the classic data warehouse. All
reports are generated and published quickly using the data present in the classic data warehouse.
● Data Applications: Data engineers would be able to build data-driven applications from big data
using Hive as the data processing layer. If required, a combination of Hive QL and an external
programming language such as Python is used to perform complex data processing activities. Some
popular examples of data-driven applications which can be accomplished conveniently using Hive
are:
○ Contact recommendations on social networking websites.
○ Generation of summary reports.
Hive stores and queries data using its data models. The purpose of using data models is to make querying
convenient and fast.
There are four main components in Hive data models, which are similar to how an RDBMS stores data:
1. Databases
2. Tables
3. Partitions
4. Buckets
3. Login to your EC2 instance and use the ls command to check if the data has been transferred
4. To look at the headers of the file use the command head data_2008.csv
Note: You can use the head command on either your local machine or EC2. If using your local machine you
will have to change data_2008.csv to the exact location of your file.
5. Now enter the Hive Command Line Interface by entering the command hive on your EC2 instance
6. To create an internal table use the below command
8. To load data into the internal table use the command below
load data local inpath 'location of your dataset' overwrite into table
flights_data;
9. To view the data in the table you can use the below command
select * from flights_data limit 10;
10. To create an external table you need to modify the create table statement as given below
create external table flights_data_ext( `Year` int, `Month` int, `DayofMonth`
int, `DayOfWeek` int, `DepTime` int, `CRSDepTime` int, `ArrTime` int,
`CRSArrTime` int, `UniqueCarrier` string, `FlightNum` string, `TailNum` string,
`ActualElapsedTime` int, `CRSElapsedTime` int, `AirTime` int, `ArrDelay` int,
`DepDelay` int, `Origin` string, `Dest` string, `Distance` bigint, `TaxiIn`
int, `TaxiOut` int, `Cancelled` int, `CancellationCode` string, `Diverted` int,
`CarrierDelay` string, `WeatherDelay` string, `NASDelay` string,
`SecurityDelay` string, `LateAircraftDelay` string)
row format delimited fields terminated by ','
location 's3a://yourS3bucketname/flights_data';
Some of the queries that can be run on this dataset using Hive are as below
1. Number of reviews
select avg(review_count)
from (
select reviewerid, count(*) as review_count
from electronics_columns_s3
group by reviewerid
)a;
3. Reviews by date
2. Upper()
Select upper(name) from students;
3. Regex_Replace()
Select regexp_replace(concat(upper(name),id),' ','') as username from students;
Hive has two special data models available that help it become a really powerful tool for data
transformation and analytics, partitioning & bucketing.
1. Partitioning: Partitions are used to make queries faster by dividing the tables into smaller parts
using partition key columns. For example, if in a table with details of employees in a
company(name, salary, department, designation), you can create a partition for each department.
Hive will store each partition separately and will scan only the partition that is needed, thereby
making the query faster.
2. Bucketing: Like partitioning, bucketing also divides the data into smaller and more manageable
parts. The key difference between partitioning and bucketing is bucketing uses the hash of a
column to create multiple 'buckets', whereas partitions are more straightforward - they are simply
segmented directories according to the value of a column (year, month etc.).
For creating partitions, "PARTITIONED BY" clause is used. Similarly for creating buckets, "CLUSTERED BY"
clause is used.
● You typically perform partitioning on columns (or groups of columns) such that it produces a
manageable number of partitions (directories), such as year, month, state, country etc.
● However, if you partition on say customer ID, the number of partitions will be enormous and may
actually reduce performance; thus, you can choose to create buckets on customer ID
Hive has two options available when using partitions, Static partitioning and dynamic partitioning.
Note: You should download the file to your local system if you are using Quickstart VM. If you are using
Hue, you can upload the file to HDFS as shown in the video in previous session.
5. Load data into the partitioned table using the previously created normal table
9. If you need to change the file to be referred by a partition you can use the below query
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=3000;
set hive.exec.max.dynamic.partitions.pernode=3000;
3. To load data into the table use the below command(this query may take 12-15 mins to finish)
4. To compare query performance between normal and partitioned table, use the below queries
a. For partitioned table
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.enforce.bucketing=true;
5. Load data into the bucketed table using the previously created normal table
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.enforce.bucketing=true;
Note: To enable bucketing, you need to enable partitioning also.
2. You will also need to run the below commands, to set max partitions back to default value of 1000
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;
4. To load data into the table use the below command(this query may take 12-15 mins to finish)
As you already know Hive was first developed at Facebook to provide a tool to users to query large
amounts of data using SQL like syntax. Post that it was open sourced as an Apache Project which has
helped the further development of Hive. In the beginning, Hive made use of the MapReduce execution
engine to run its queries but with time newer execution engines have been added to Hadoop which has
made query processing using Hive even faster.
The introduction of YARN in Hadoop 2.x was pivotal as Hadoop now had a resource manager responsible
for allocating resources to various jobs being run in the ecosystem. With the advent of YARN, all nodes
were treated as worker or processing nodes and they were not classified into fixed groups such as Map
slots or Reduce slots. In other words, any free node can perform any task such as a map or reduce task.
● Tez does not write data to the disk during the intermediary steps of a Hive query. Tez makes use of
Directed Acyclic Graphs and the data from an intermediary step is passed on to the next step in the
graph instead of being written to the disk like it is done when using the MapReduce engine.
Removal of these IO operations saves a lot of time when dealing with large amounts of data.
● Tez and YARN together enable you to use objects in a container across applications. If two
applications require the same object(say a data frame) and are running within the same container,
you need not create the same object, again and again, you can reuse it. This leads to better
management of resources and also helps improve the performance.
To find out the current execution engine of your Hive setup you can run the below command
set hive.execution.engine;
If your current execution engine is something else then Tez, you can use the below command to set the
execution engine as Tez(Tez is case insensitive).
set hive.execution.engine=tez;
Hive on Tez really improved the performance of queries but one area where Hive was found lacking was
interactive querying. While working in the industry there are use cases which involve querying data at high
speed and also using the previously generated data to drill down deeper to get further insights. Now Hive
with Tez is able to handle batch processing jobs(long running queries on large datasets on a schedule) very
well but high-speed interactive querying was still lacking. This is where Hive LLAP(Live Long and Process)
comes into the picture.
LLAP makes Hive capable of running interactive queries by making use of a daemon. The daemon is always
running on the data nodes, that lives long and processes the data. What this means is that a daemon is run
on each of the data nodes, and when you run a query, Hive checks its LLAP daemon first, whether it has the
information about the table and the chunk of data that you are querying. If yes, then Hive passes on that
query to the daemon instead of Tez and containers. The daemon keeps all the data in its cache, which is in
memory, so the response time for queries handled by the daemon is very low.
The various options available to you for choosing the execution engine are:
● The first is that you don't use LLAP at all and use Tez for running your queries. This method would
be best when you are using Hive for processing batch queries and same data is not queried again
and again over a short period of time.
● The second option is to configure Hive to always look for LLAP first and then go to Tez.
The need for improved performance has also led to advances in the field of data storage.
one of the most widely used and one of the most optimal file formats out there is the Optimized Row
Columnar(ORC) file format. It is natively supported by Hive. This format stores data in a columnar fashion
which means the data for a table is stored in column major order i.e. if there are three columns in a table
then column 1 is written first in contiguous memory units of the disk, then column2 and so on.
When dealing with big data the schema keeps on increasing and most of the data sources have a large
number of columns. The advantage of columnar storage is that when your query deals with only a few
columns of a table, it makes it possible to only scan the data needed for the particular columns instead of
all the columns thereby reducing disk seeks. Reduction in disk seeks increases the query performance by
manifolds.
● Because of homogeneity of data i.e. data of the same type are stored together, data compression is
significant
● Because of inclusion of stripes scanning of non-relevant data was avoided
Similar to the ORC file format, there is another very popular and powerful file format in the Apache
ecosystem called Parquet file format. This was developed jointly by Cloudera and Twitter. Parquet file
format was developed based on the Dremel paper, you can read the detailed paper at this link. The
advantages of Parquet file format are as follows:
● Parquet is built keeping the complex data structures in mind, so you can also store all sorts of
nested data structures in it.
● It gives you the flexibility to keep encoding on a per-column basis. So depending on the type and
the data that is associated with the column in your table, you can choose a suitable compression
scheme.
● All the metadata is written at the end of the parquet file. So that allows you to write in a single
pass. This is a huge advantage when dealing with large data files.
Another very interesting enhancement that has been added to the Hive ecosystem is the ability to run
vectorized operations. Vectorisation is a style of computer programming in which an operation is applied
on the entire collection of data at a time instead of applying it individually on all the elements one by one.
Let's say you have an array of ten integer values and you want to increase the value of each integer by 2.
The most obvious way of solving this problem is using a for loop you can access each element one by one
and then increase its value by 2. But vectorisation introduces Data-level parallelism which means that the
operations required to transform a set of elements can be performed on all elements of the collection at
the same time. That is, a single instruction can be applied to multiple data elements in parallel.
Vectorization converts data from a particular column into a vector while running the query. This conversion
leads to a faster execution time. In the video, you saw how using vectorization we were able to reduce the
query runtime by 60%. Do note that as shown in the video vectorization works best with ORC file format
and does not work with JSON files.
● You can download this document from the website for self-use only.
● Any copies of this document, in part or full, saved to disc or to any other storage medium may only be used
for subsequent, self-viewing purposes or to print an individual extract or copy for non-commercial personal
use only.
● Any further dissemination, distribution, reproduction, copying of the content of the document herein or the
uploading thereof on other websites or use of content for any other commercial/unauthorized purposes in
any way which could infringe the intellectual property rights of UpGrad or its contributors, is strictly
prohibited.
● No graphics, images or photographs from any accompanying text in this document will be used separately
for unauthorised purposes.
● No material in this document will be modified, adapted or altered in any way.
● No part of this document or UpGrad content may be reproduced or stored in any other web site or included
in any public or private electronic retrieval system or service without UpGrad’s prior written permission.
● Any rights not expressly granted in these terms are reserved.