Hive Lecture Notes

Hive is a tool which was developed to help people work with Big Data without actually needing to spend
time learning MapReduce.Hive is a “data warehouse software” that enables you to query and manipulate
data using a SQL-like language known as HiveQL. It was developed at Facebook so that people who had
experience in SQL would be able to work on querying datasets without actually learning new paradigms
like MapReduce or new programming languages.
The main features of Hive are:
● An SQL-like interface to write queries on large datasets

● Hive can be used to process all variants of data i.e. Structured, Semi-structured and Unstructured
● A variety of built-in functions for working with dates, strings, etc
● Easy ETL (extraction, transformation, and loading) of data
The use cases that creators of Hive had in mind while working on it were as follows:
● Data Cleaning: Data engineers usually get data that is not very clean, also the data may not be in
the format required for analytics. Hive would help them clean this data and also convert this data
into required formats without too much effort.
● Reporting: Data teams are also tasked with publishing reports and summaries on a regular basis
using the classic data warehouses such as Oracle, Vertica etc. A classic data warehouse such as
Oracle is utterly effective when the data set is small and highly structured. In such reporting use
cases, the most important task of Hive is to clean the high volume data, extract the essential facts
or metadata from the data and publish the facts or metadata in the classic data warehouse. All
reports are generated and published quickly using the data present in the classic data warehouse.
● Data Applications: Data engineers would be able to build data-driven applications from big data
using Hive as the data processing layer. If required, a combination of Hive QL and an external
programming language such as Python is used to perform complex data processing activities. Some
popular examples of data-driven applications which can be accomplished conveniently using Hive
are:
○ Contact recommendations on social networking websites.
○ Generation of summary reports.
© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Hive has multiple data types available that can be used. You can learn about the various data types in Hive
here
Hive stores and queries data using its data models. The purpose of using data models is to make querying
convenient and fast.
There are four main components in Hive data models, which are similar to how an RDBMS stores data:
1. Databases
2. Tables
3. Partitions
4. Buckets
Let’s start with tables. Hive has two types of tables:

● Managed (or internal) table
● External table
Note: 'Managed table' and the 'internal table' are synonymous terms.
You should use external tables when:

● You want to use the data outside of Hive as well. For example, when another existing program is
running on the same cluster
● You want the data to remain stored on the HDFS even after dropping tables because Hive does not
delete the data stored outside (of the Hive database)
● You do not want Hive to control the storage of your data (location/directories of storage/etc.)
On the other hand, you use managed tables when:

● The data is temporary. So, when the Hive table is dropped, the data stored in the internal table is
deleted along with the metadata
● You want Hive to manage the life cycle of the data completely, i.e. both store and process it

To create tables in Hive you need to use the Create Table command. To create a table using the airlines
dataset follow the below instructions:
1. Download the file to your laptop
2. Open the terminal on your laptop and use the below command to send file to your EC2 instance
scp -i locationofpemfile locationofdownloadeddataset
ec2-user@yourpublicip:/home/ec2-user
3. Login to your EC2 instance and use the ls command to check if the data has been transferred
4. To look at the headers of the file use the command head data_2008.csv
Note: You can use the head command on either your local machine or EC2. If using your local machine you
will have to change data_2008.csv to the exact location of your file.
5. Now enter the Hive Command Line Interface by entering the command hive on your EC2 instance
6. To create an internal table use the below command
create table flights_data( `Year` int, `Month` int, `DayofMonth` int,

`DayOfWeek` int, `DepTime` int, `CRSDepTime` int, ÀrrTime` int, `CRSArrTime`
int, ÙniqueCarrier` string, `FlightNum` string, `TailNum` string,
ÀctualElapsedTime` int, `CRSElapsedTime` int, ÀirTime` int, ÀrrDelay` int,
`DepDelay` int, Òrigin` string, `Dest` string, `Distance` bigint, `TaxiIn`

int, `TaxiOut` int, `Cancelled` int, `CancellationCode` string, `Diverted`
int, `CarrierDelay` string, `WeatherDelay` string, `NASDelay` string,
`SecurityDelay` string, `LateAircraftDelay` string)
row format delimited fields terminated by ',';
7. To view the current tables use the command show tables
8. To load data into the internal table use the command below
load data local inpath 'location of your dataset' overwrite into table
flights_data;
9. To view the data in the table you can use the below command
select * from flights_data limit 10;
10. To create an external table you need to modify the create table statement as given below
create external table flights_data_ext( `Year` int, `Month` int, `DayofMonth`
int, `DayOfWeek` int, `DepTime` int, `CRSDepTime` int, ÀrrTime` int,
`CRSArrTime` int, ÙniqueCarrier` string, `FlightNum` string, `TailNum` string,
ÀctualElapsedTime` int, `CRSElapsedTime` int, ÀirTime` int, ÀrrDelay` int,
`DepDelay` int, Òrigin` string, `Dest` string, `Distance` bigint, `TaxiIn`
int, `TaxiOut` int, `Cancelled` int, `CancellationCode` string, `Diverted` int,
`CarrierDelay` string, `WeatherDelay` string, `NASDelay` string,
`SecurityDelay` string, `LateAircraftDelay` string)
row format delimited fields terminated by ','
location 's3a://yourS3bucketname/flights_data';
11. To load data into this table use the command

load data local inpath 'location of your dataset' overwrite into table
flights_data_ext;

During the module , you ran Hive Queries on Amazon Reviews Dataset, the dataset contains customer
reviews for Amazon electronic items. The data was hosted on a public S3 bucket and for analysis, you need
to copy the data to your own bucket using the below command:
aws s3 cp s3://hivedata-bde/Electronics_5.json s3://yourbucketname/
Some of the queries that can be run on this dataset using Hive are as below
1. Number of reviews
select count(*) as review_count, count(distinct reviewerid) as reviewer_count,

count(distinct asin) as product_count, min(unixreviewtime) as min_time,
max(unixreviewtime) as max_time
from electronics_columns_s3;
2. Average reviews by reviewer
select avg(review_count)
from (
select reviewerid, count(*) as review_count
from electronics_columns_s3
group by reviewerid
)a;
3. Reviews by date
select year(from_unixtime(unixreviewtime)) as yr, count(*) as review_count

group by year(from_unixtime(unixreviewtime));

In addition to normal functions like Select, Group By, Hive has many advanced functions in-built that can
be used to transform data(both row level and column level) and also gain insights from the data. Some of
the advanced functions covered in the session video are
1. Explode()
Select explode(feedetails) FROM students;
Select explode(subjects) FROM students;
Select explode(feedetails) F ROM students WHERE name="Alexa";
2. Upper()
Select upper(name) from students;
3. Regex_Replace()
Select regexp_replace(concat(upper(name),id),' ','') as username from students;
You can view a list of built-in functions available in Hive here.
Hive has two special data models available that help it become a really powerful tool for data
transformation and analytics, partitioning & bucketing.
1. Partitioning: Partitions are used to make queries faster by dividing the tables into smaller parts
using partition key columns. For example, if in a table with details of employees in a
company(name, salary, department, designation), you can create a partition for each department.
Hive will store each partition separately and will scan only the partition that is needed, thereby
making the query faster.
2. Bucketing: Like partitioning, bucketing also divides the data into smaller and more manageable
parts. The key difference between partitioning and bucketing is bucketing uses the hash of a
column to create multiple 'buckets', whereas partitions are more straightforward - they are simply
segmented directories according to the value of a column (year, month etc.).
For creating partitions, "PARTITIONED BY" clause is used. Similarly for creating buckets, "CLUSTERED BY"
clause is used.

Guidelines for when to use partitioning and bucketing:
● You typically perform partitioning on columns (or groups of columns) such that it produces a
manageable number of partitions (directories), such as year, month, state, country etc.
● However, if you partition on say customer ID, the number of partitions will be enormous and may
actually reduce performance; thus, you can choose to create buckets on customer ID
Hive has two options available when using partitions, Static partitioning and dynamic partitioning.
Static partitioning example:

You need to download the “Data Files” given below the video.
Note: You should download the file to your local system if you are using Quickstart VM. If you are using
Hue, you can upload the file to HDFS as shown in the video in previous session.
1. To create table use the below query
CREATE TABLE IF NOT EXISTS customer_partitioned(customer_fname

varchar(64),customer_lname varchar(64),customer_addr string,city
varchar(64))PARTITIONED BY (country VARCHAR(64),state VARCHAR(64))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
2. To load data into the table use the below command
LOAD DATA INPATH 'yourfilepath'

INTO TABLE customer_partitioned
PARTITION(country='usa', state='az');
Note: If you are using the VM and data is on your local machine use the below command
LOAD DATA LOCAL INPATH 'yourfilepath'

INTO TABLE customer_partitioned
PARTITION(country='usa', state='az');

Also, note that you will need to provide path to the data file of the particular country and state which you
mention in the last line. Similarly you can load data for other countries also.
Dynamic partitioning example:
You need to download the “customer.dat” given below the video.

CREATE TABLE customer(customer_fname VARCHAR(64),customer_lname

VARCHAR(64),customer_addr STRING,city VARCHAR(64),state
VARCHAR(64),country VARCHAR(64)) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ‘,’ STORED AS TEXTFILE;
LOAD DATA INPATH 'yourdatapath' INTO TABLE customer ;

LOAD DATA LOCAL INPATH 'yourdatapath' INTO TABLE customer ;

Also, note that you will need to provide path to the data file of the particular country and state which you
mention in the last line. Similarly you can load data for other countries also.
3. By default, dynamic partitioning is disabled in Hive to prevent accidental partition creation. To

enable, we use:
set hive.exec.dynamic.partition =true;

set hive.exec.dynamic.partition.mode=nonstrict;

4. Create a partitioned table using the below query
CREATE TABLE partitioned_customer(customer_fname

varchar(64),customer_lname varchar(64),customer_addr string,city
varchar(64))PARTITIONED BY (country VARCHAR(64),state VARCHAR(64));
5. Load data into the partitioned table using the previously created normal table
Insert into table partitioned_customer partition(country,state) select

customer_fname,customer_lname,customer_addr,city,country,state from
customer;
6. To show all the partitions
show partitions partitioned_customer;
7. Simple select query using partition
Select * from partitioned_customer where state='az' and country='usa' ;
8. If you want to drop any partitions, it can be done as:
ALTER Table partitioned_customer DROP IF EXISTS PARTITION(country='usa',

state='nw',);
9. If you need to change the file to be referred by a partition you can use the below query
ALTER TABLE customer_partitioned PARTITION (country='usa',state='az') SET

LOCATION 'newpath';

1. You will need to enable dynamic partitioning in hive and also increase the number of partitions
before your run the queries. Use the below queries to do so
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=3000;
set hive.exec.max.dynamic.partitions.pernode=3000;
create external table if not exists

electronics_columns_s3_partitioned_year_month
(reviewerid string, asin string, reviewername string, helpful array<int>,
reviewtext string,
overall double, summary string, unixreviewtime bigint) partitioned by (yr
int, mnth int)
location 's3a://yourbucketname/yourfoldername';
3. To load data into the table use the below command(this query may take 12-15 mins to finish)
insert overwrite table electronics_columns_s3_partitioned_year_month

partition(yr, mnth)
select reviewerid, asin, reviewername, helpful, reviewtext,
overall, summary, unixreviewtime, year(from_unixtime(unixreviewtime)) as yr,
month(from_unixtime(unixreviewtime)) as mnth
from electronics_columns_s3;
Note: You had already created the “electronics_columns_s3” table in the last session. If you had deleted
the same, please follow the instructions on that segment to recreate the table before loading the data.
4. To compare query performance between normal and partitioned table, use the below queries
a. For partitioned table
select overall, count(*) as review_count

from electronics_columns_s3_partitioned_year_month
where yr = 2004 and mnth = 1
group by overall;
b. For normal table
select overall, count(*) as review_count

where year(from_unixtime(unixreviewtime)) = 2004
and month(from_unixtime(unixreviewtime)) = 1
group by overall;
You need to download the “products.dat” given on the segment.

1. To create a normal table use the below query
CREATE TABLE IF NOT EXISTS products_input(productid int,

transactiondate date,amount int,store string)ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Load data inpath 'yourdatapath' into table products_input;

load data inpath 'yourdatapath' into table products_input;
3. To enable bucketing use:
set hive.enforce.bucketing=true;

4. Create a bucketed table using the below query
CREATE TABLE IF NOT EXISTS products_bucket(productid

int,amount int) PARTITIONED BY (store String,transactiondate date) CLUSTERED
BY (productid) into 4 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
5. Load data into the bucketed table using the previously created normal table
insert into table products_bucket partition(store,transactiondate) select

productid,amount,store,transactiondate from products_input;
6. Simple select query using buckets
select * from products_bucket where store="walmart" and

transactiondate="2018-01-01" and productid = 123;
7. Select query for comparison
select * from products_input w

here store="walmart" and
transactiondate="2018-01-01" a nd productid = 123 ;
1. To enable bucketing use the below commands.
set hive.enforce.bucketing=true;
Note: To enable bucketing, you need to enable partitioning also.
2. You will also need to run the below commands, to set max partitions back to default value of 1000
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;

drop table electronics_columns_s3_partitioned_year_month_clustered;

create external table if not exists
electronics_columns_s3_partitioned_year_month_clustered
(reviewerid string, asin string, reviewername string, helpful array<int>,
reviewtext string,
overall double, summary string, unixreviewtime bigint) partitioned by (yr
int, mnth int)
clustered by (reviewerid) into 4 buckets
location 's3://yourbucketname/yourfoldername';
4. To load data into the table use the below command(this query may take 12-15 mins to finish)
insert overwrite table

electronics_columns_s3_partitioned_year_month_clustered partition(yr, mnth)
select reviewerid, asin, reviewername, helpful, reviewtext,
overall, summary, unixreviewtime, yr, mnth
from electronics_columns_s3_partitioned_year_month;
Note: You had already created the “electronics_columns_s3_partitioned_year_month” table in the last
segment. If you had deleted the same, please follow the instructions on that segment to recreate the table
before loading the data.
As you already know Hive was first developed at Facebook to provide a tool to users to query large
amounts of data using SQL like syntax. Post that it was open sourced as an Apache Project which has
helped the further development of Hive. In the beginning, Hive made use of the MapReduce execution
engine to run its queries but with time newer execution engines have been added to Hadoop which has
made query processing using Hive even faster.
The introduction of YARN in Hadoop 2.x was pivotal as Hadoop now had a resource manager responsible
for allocating resources to various jobs being run in the ecosystem. With the advent of YARN, all nodes
were treated as worker or processing nodes and they were not classified into fixed groups such as Map
slots or Reduce slots. In other words, any free node can perform any task such as a map or reduce task.

The benefits that Tez provides over MapReduce execution engine while using Hive are:
● Tez does not write data to the disk during the intermediary steps of a Hive query. Tez makes use of
Directed Acyclic Graphs and the data from an intermediary step is passed on to the next step in the
graph instead of being written to the disk like it is done when using the MapReduce engine.
Removal of these IO operations saves a lot of time when dealing with large amounts of data.
● Tez and YARN together enable you to use objects in a container across applications. If two
applications require the same object(say a data frame) and are running within the same container,
you need not create the same object, again and again, you can reuse it. This leads to better
management of resources and also helps improve the performance.
To find out the current execution engine of your Hive setup you can run the below command
set hive.execution.engine;
If your current execution engine is something else then Tez, you can use the below command to set the
execution engine as Tez(Tez is case insensitive).
set hive.execution.engine=tez;
Hive Execution Engine: LLAP
Hive on Tez really improved the performance of queries but one area where Hive was found lacking was
interactive querying. While working in the industry there are use cases which involve querying data at high
speed and also using the previously generated data to drill down deeper to get further insights. Now Hive
with Tez is able to handle batch processing jobs(long running queries on large datasets on a schedule) very
well but high-speed interactive querying was still lacking. This is where Hive LLAP(Live Long and Process)
comes into the picture.
LLAP makes Hive capable of running interactive queries by making use of a daemon. The daemon is always
running on the data nodes, that lives long and processes the data. What this means is that a daemon is run
on each of the data nodes, and when you run a query, Hive checks its LLAP daemon first, whether it has the
information about the table and the chunk of data that you are querying. If yes, then Hive passes on that
query to the daemon instead of Tez and containers. The daemon keeps all the data in its cache, which is in
memory, so the response time for queries handled by the daemon is very low.
The various options available to you for choosing the execution engine are:
● The first is that you don't use LLAP at all and use Tez for running your queries. This method would
be best when you are using Hive for processing batch queries and same data is not queried again
and again over a short period of time.
● The second option is to configure Hive to always look for LLAP first and then go to Tez.

● The third option is to configure Hive to use LLAP based on some criteria and use Tez if criteria are
not specified.
The need for improved performance has also led to advances in the field of data storage.
one of the most widely used and one of the most optimal file formats out there is the Optimized Row
Columnar(ORC) file format. It is natively supported by Hive. This format stores data in a columnar fashion
which means the data for a table is stored in column major order i.e. if there are three columns in a table
then column 1 is written first in contiguous memory units of the disk, then column2 and so on.
When dealing with big data the schema keeps on increasing and most of the data sources have a large
number of columns. The advantage of columnar storage is that when your query deals with only a few
columns of a table, it makes it possible to only scan the data needed for the particular columns instead of
all the columns thereby reducing disk seeks. Reduction in disk seeks increases the query performance by
manifolds.
Some of the benefits of using ORC file format are:
● Because of homogeneity of data i.e. data of the same type are stored together, data compression is
significant
● Because of inclusion of stripes scanning of non-relevant data was avoided
Similar to the ORC file format, there is another very popular and powerful file format in the Apache
ecosystem called Parquet file format. This was developed jointly by Cloudera and Twitter. Parquet file
format was developed based on the Dremel paper, you can read the detailed paper at this link. The
advantages of Parquet file format are as follows:
● Parquet is built keeping the complex data structures in mind, so you can also store all sorts of
nested data structures in it.
● It gives you the flexibility to keep encoding on a per-column basis. So depending on the type and
the data that is associated with the column in your table, you can choose a suitable compression
scheme.
● All the metadata is written at the end of the parquet file. So that allows you to write in a single
pass. This is a huge advantage when dealing with large data files.

● It separates metadata from the column data. What this essentially means for the processing
engines is that you can very easily split these files. You can split column files into and send them
across different nodes.
● A single metadata file is capable of referencing multiple parquet files and column data files. This
kind of flexibility is really helpful when you are working in a distributed environment.
Another very interesting enhancement that has been added to the Hive ecosystem is the ability to run
vectorized operations. Vectorisation is a style of computer programming in which an operation is applied
on the entire collection of data at a time instead of applying it individually on all the elements one by one.
Let's say you have an array of ten integer values and you want to increase the value of each integer by 2.
The most obvious way of solving this problem is using a for loop you can access each element one by one
and then increase its value by 2. But vectorisation introduces Data-level parallelism which means that the
operations required to transform a set of elements can be performed on all elements of the collection at
the same time. That is, a single instruction can be applied to multiple data elements in parallel.
Vectorization converts data from a particular column into a vector while running the query. This conversion
leads to a faster execution time. In the video, you saw how using vectorization we were able to reduce the
query runtime by 60%. Do note that as shown in the video vectorization works best with ORC file format
and does not work with JSON files.

Disclaimer: All content and material on the UpGrad website is copyrighted material, either belonging to UpGrad or
its bonafide contributors and is purely for the dissemination of education. You are permitted to access print and
download extracts from this site purely for your own education only and on the following basis:
● You can download this document from the website for self-use only.
● Any copies of this document, in part or full, saved to disc or to any other storage medium may only be used
for subsequent, self-viewing purposes or to print an individual extract or copy for non-commercial personal
use only.
● Any further dissemination, distribution, reproduction, copying of the content of the document herein or the
uploading thereof on other websites or use of content for any other commercial/unauthorized purposes in
any way which could infringe the intellectual property rights of UpGrad or its contributors, is strictly
prohibited.
● No graphics, images or photographs from any accompanying text in this document will be used separately
for unauthorised purposes.
● No material in this document will be modified, adapted or altered in any way.
● No part of this document or UpGrad content may be reproduced or stored in any other web site or included
in any public or private electronic retrieval system or service without UpGrad’s prior written permission.
● Any rights not expressly granted in these terms are reserved.

Hive Lecture Notes

Uploaded by

Copyright:

Available Formats

Hive Lecture Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hive Lecture Notes

Uploaded by

Copyright:

Available Formats

Hive is a tool which was developed to help people work with Big Data without actually needing to spend

The main features of Hive are:

● An SQL-like interface to write queries on large datasets

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Let’s start with tables. Hive has two types of tables:

You should use external tables when:

On the other hand, you use managed tables when:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

create​ ​table​ flights_data( ​`Year`​ ​int​, ​`Month`​ ​int​, ​`DayofMonth`​ ​int​,

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

7. To view the current tables use the command ​show tables

11. To load data into this table use the command

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

aws s3 cp s3://hivedata-bde/Electronics_5.json s3://yourbucketname/

select​ ​count​(*) ​as​ review_count, ​count​(​distinct​ reviewerid) ​as​ reviewer_count,

2. Average reviews by reviewer

select​ ​year​(from_unixtime(unixreviewtime)) ​as​ yr, ​count​(*) ​as​ review_count

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

You can view a list of built-in functions available in Hive ​here​.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Static partitioning example:

1. To create table use the below query

CREATE​ ​TABLE​ ​IF​ ​NOT​ ​EXISTS​ customer_partitioned(customer_fname

2. To load data into the table use the below command

LOAD​ ​DATA​ INPATH ​'yourfilepath'

LOAD​ ​DATA​ ​LOCAL​ INPATH ​'yourfilepath'

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Dynamic partitioning example:

You need to download the “customer.dat” given below the video.

1. To create table use the below query

CREATE TABLE customer(customer_fname VARCHAR(64),customer_lname

2. To load data into the table use the below command

LOAD​ ​DATA​ INPATH ​'yourdatapath'​ ​INTO​ ​TABLE​ customer ;

LOAD​ ​DATA​ ​LOCAL​ INPATH ​'yourdatapath'​ ​INTO​ ​TABLE​ customer ;

3. By default, dynamic partitioning is disabled in Hive to prevent accidental partition creation. To

set​ hive.exec.dynamic.partition =​true​;

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

CREATE​ ​TABLE​ partitioned_customer(customer_fname

Insert​ ​into​ ​table​ partitioned_customer ​partition​(country,state) ​select

6. To show all the partitions

​show​ ​partitions​ partitioned_customer;

7. Simple select query using partition

Select​ * ​from​ partitioned_customer ​where​ state=​'az'​ ​and​ country=​'usa'​ ;

8. If you want to drop any partitions, it can be done as:

ALTER​ ​Table​ partitioned_customer ​DROP​ ​IF​ ​EXISTS​ ​PARTITION​(country=​'usa'​,

ALTER​ ​TABLE​ customer_partitioned ​PARTITION​ (country=​'usa'​,state=​'az'​) ​SET

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

2. To create table use the below query

create​ ​external​ ​table​ ​if​ ​not​ ​exists

insert​ overwrite ​table​ electronics_columns_s3_partitioned_year_month

select​ overall, ​count​(*) ​as​ review_count

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

b. For normal table

select​ overall, ​count​(*) ​as​ review_count

You need to download the “products.dat” given on the segment.

1. To create a normal table use the below query

CREATE​ ​TABLE​ ​IF​ ​NOT​ ​EXISTS​ products_input(productid ​int​,

2. To load data into the table use the below command

Load​ ​data​ inpath ​'yourdatapath'​ ​into​ ​table​ products_input;

create table flights_data( `Year` int, `Month` int, `DayofMonth` int,

7. To view the current tables use the command show tables

select count(*) as review_count, count(distinct reviewerid) as reviewer_count,

select year(from_unixtime(unixreviewtime)) as yr, count(*) as review_count

You can view a list of built-in functions available in Hive here.

CREATE TABLE IF NOT EXISTS customer_partitioned(customer_fname

LOAD DATA INPATH 'yourfilepath'

LOAD DATA LOCAL INPATH 'yourfilepath'

LOAD DATA INPATH 'yourdatapath' INTO TABLE customer ;

LOAD DATA LOCAL INPATH 'yourdatapath' INTO TABLE customer ;

set hive.exec.dynamic.partition =true;

CREATE TABLE partitioned_customer(customer_fname

Insert into table partitioned_customer partition(country,state) select

show partitions partitioned_customer;

Select * from partitioned_customer where state='az' and country='usa' ;

ALTER Table partitioned_customer DROP IF EXISTS PARTITION(country='usa',

ALTER TABLE customer_partitioned PARTITION (country='usa',state='az') SET

create external table if not exists

insert overwrite table electronics_columns_s3_partitioned_year_month

select overall, count(*) as review_count

select overall, count(*) as review_count

CREATE TABLE IF NOT EXISTS products_input(productid int,

Load data inpath 'yourdatapath' into table products_input;

load data inpath 'yourdatapath' into table products_input;

CREATE TABLE IF NOT EXISTS products_bucket(productid

insert into table products_bucket partition(store,transactiondate) select

select * from products_bucket where store="walmart" and

select * from products_input w

drop table electronics_columns_s3_partitioned_year_month_clustered;

insert overwrite table