Hive Lecture Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Hive is a tool which was developed to help people work with Big Data without actually needing to spend

time learning MapReduce.Hive is a “data warehouse software” that enables you to query and manipulate
data using a SQL-like language known as HiveQL. It was developed at Facebook so that people who had
experience in SQL would be able to work on querying datasets without actually learning new paradigms
like MapReduce or new programming languages.

The main features of Hive are:

● An SQL-like interface to write queries on large datasets


● Hive can be used to process all variants of data i.e. Structured, Semi-structured and Unstructured
● A variety of built-in functions for working with dates, strings, etc
● Easy ETL (extraction, transformation, and loading) of data

The use cases that creators of Hive had in mind while working on it were as follows:

● Data Cleaning: Data engineers usually get data that is not very clean, also the data may not be in
the format required for analytics. Hive would help them clean this data and also convert this data
into required formats without too much effort.
● Reporting: Data teams are also tasked with publishing reports and summaries on a regular basis
using the classic data warehouses such as Oracle, Vertica etc. A classic data warehouse such as
Oracle is utterly effective when the data set is small and highly structured. In such reporting use
cases, the most important task of Hive is to clean the high volume data, extract the essential facts
or metadata from the data and publish the facts or metadata in the classic data warehouse. All
reports are generated and published quickly using the data present in the classic data warehouse.
● Data Applications: Data engineers would be able to build data-driven applications from big data
using Hive as the data processing layer. If required, a combination of Hive QL and an external
programming language such as Python is used to perform complex data processing activities. Some
popular examples of data-driven applications which can be accomplished conveniently using Hive
are:
○ Contact recommendations on social networking websites.
○ Generation of summary reports.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


Hive has multiple data types available that can be used. You can learn about the various data types in Hive
here

Hive stores and queries data using its data models. The purpose of using data models is to make querying
convenient and fast.

There are four main components in Hive data models, which are similar to how an RDBMS stores data:
1. Databases
2. Tables
3. Partitions
4. Buckets

Let’s start with tables. Hive has two types of tables:


● Managed (or internal) table
● External table
Note:​ 'Managed table' and the 'internal table' are synonymous terms.

You should use external tables when:


● You want to use the data outside of Hive as well. For example, when another existing program is
running on the same cluster
● You want the data to remain stored on the HDFS even after dropping tables because Hive does not
delete the data stored outside (of the Hive database)
● You do not want Hive to control the storage of your data (location/directories of storage/etc.)

On the other hand, you use managed tables when:


● The data is temporary. So, when the Hive table is dropped, the data stored in the internal table is
deleted along with the metadata
● You want Hive to manage the life cycle of the data completely, i.e. both store and process it

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


To create tables in Hive you need to use the Create Table command. To create a table using the airlines
dataset follow the below instructions:
1. Download the file to your laptop
2. Open the terminal on your laptop and use the below command to send file to your EC2 instance
scp -i ​locationofpemfile​ ​locationofdownloadeddataset
ec2-user@​yourpublicip​:/home/ec2-user

3. Login to your EC2 instance and use the ​ls​ command to check if the data has been transferred

4. To look at the headers of the file use the command ​head data_2008.csv

Note:​ You can use the head command on either your local machine or EC2. If using your local machine you
will have to ​change data_2008.csv to the exact location of your file.
5. Now enter the Hive Command Line Interface by entering the command ​hive​ on your EC2 instance
6. To create an internal table use the below command

create​ ​table​ flights_data( ​`Year`​ ​int​, ​`Month`​ ​int​, ​`DayofMonth`​ ​int​,


`DayOfWeek`​ ​int​, ​`DepTime`​ ​int​, ​`CRSDepTime`​ ​int​, ​`ArrTime`​ ​int​, ​`CRSArrTime`
int​, ​`UniqueCarrier`​ ​string​, ​`FlightNum`​ ​string​, ​`TailNum`​ ​string​,
`ActualElapsedTime`​ ​int​, ​`CRSElapsedTime`​ ​int​, ​`AirTime`​ ​int​, ​`ArrDelay`​ ​int​,
`DepDelay`​ ​int​, ​`Origin`​ ​string​, ​`Dest`​ ​string​, ​`Distance`​ ​bigint​, ​`TaxiIn`

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


int​, ​`TaxiOut`​ ​int​, ​`Cancelled`​ ​int​, ​`CancellationCode`​ ​string​, ​`Diverted`
int​, ​`CarrierDelay`​ ​string​, ​`WeatherDelay`​ ​string​, ​`NASDelay`​ ​string​,
`SecurityDelay`​ ​string​, ​`LateAircraftDelay`​ ​string​)
row​ ​format​ ​delimited​ ​fields​ ​terminated​ ​by​ ​','​;

7. To view the current tables use the command ​show tables

8. To load data into the internal table use the command below
load​ ​data​ ​local​ inpath ​'location of your dataset'​ overwrite ​into​ ​table
flights_data;

9. To view the data in the table you can use the below command
select​ * ​from​ flights_data ​limit​ ​10​;

10. To create an external table you need to modify the create table statement as given below
create​ ​external​ ​table​ flights_data_ext( ​`Year`​ ​int​, ​`Month`​ ​int​, ​`DayofMonth`
int​, ​`DayOfWeek`​ ​int​, ​`DepTime`​ ​int​, ​`CRSDepTime`​ ​int​, ​`ArrTime`​ ​int​,
`CRSArrTime`​ ​int​, ​`UniqueCarrier`​ ​string​, ​`FlightNum`​ ​string​, ​`TailNum`​ ​string​,
`ActualElapsedTime`​ ​int​, ​`CRSElapsedTime`​ ​int​, ​`AirTime`​ ​int​, ​`ArrDelay`​ ​int​,
`DepDelay`​ ​int​, ​`Origin`​ ​string​, ​`Dest`​ ​string​, ​`Distance`​ ​bigint​, ​`TaxiIn`
int​, ​`TaxiOut`​ ​int​, ​`Cancelled`​ ​int​, ​`CancellationCode`​ ​string​, ​`Diverted`​ ​int​,
`CarrierDelay`​ ​string​, ​`WeatherDelay`​ ​string​, ​`NASDelay`​ ​string​,
`SecurityDelay`​ ​string​, ​`LateAircraftDelay`​ ​string​)
row​ ​format​ ​delimited​ ​fields​ ​terminated​ ​by​ ​','
location ​'s3a://yourS3bucketname/flights_data'​;

11. To load data into this table use the command


load​ ​data​ ​local​ inpath ​'location of your dataset'​ overwrite ​into​ ​table
flights_data_ext;

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


During the module , you ran Hive Queries on Amazon Reviews Dataset, the dataset contains customer
reviews for Amazon electronic items. The data was hosted on a public S3 bucket and for analysis, you need
to copy the data to your own bucket using the below command:

aws s3 cp s3://hivedata-bde/Electronics_5.json s3://yourbucketname/

Some of the queries that can be run on this dataset using Hive are as below

1. Number of reviews

select​ ​count​(*) ​as​ review_count, ​count​(​distinct​ reviewerid) ​as​ reviewer_count,


count​(​distinct​ ​asin​) ​as​ product_count, ​min​(unixreviewtime) ​as​ min_time,
max​(unixreviewtime) ​as​ max_time
from​ electronics_columns_s3;

2. Average reviews by reviewer

select​ ​avg​(review_count)
from​ (
​select​ reviewerid, ​count​(*) ​as​ review_count
​from​ electronics_columns_s3
​group​ ​by​ reviewerid
)a;

3. Reviews by date

select​ ​year​(from_unixtime(unixreviewtime)) ​as​ yr, ​count​(*) ​as​ review_count


from​ electronics_columns_s3
group​ ​by​ ​year​(from_unixtime(unixreviewtime));

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


In addition to normal functions like Select, Group By, Hive has many advanced functions in-built that can
be used to transform data(both row level and column level) and also gain insights from the data. Some of
the advanced functions covered in the session video are
1. Explode()
Select​ explode(feedetails) ​FROM​ students;
Select​ explode(subjects) ​FROM​ students;
Select​ explode(feedetails) F​ ROM​ students ​WHERE​ ​name​=​"Alexa"​;

2. Upper()
Select​ ​upper​(​name​) ​from​ students;

3. Regex_Replace()
Select​ regexp_replace(​concat​(​upper​(​name​),​id​),​' '​,​''​) ​as​ username ​from​ students;

You can view a list of built-in functions available in Hive ​here​.

Hive has two special data models available that help it become a really powerful tool for data
transformation and analytics, partitioning & bucketing.

1. Partitioning​: Partitions are used to make queries faster by dividing the tables into smaller parts
using partition key columns. For example, if in a table with details of employees in a
company(name, salary, department, designation), you can create a partition for each department.
Hive will store each partition separately and will scan only the partition that is needed, thereby
making the query faster.
2. Bucketing​: Like partitioning, bucketing also divides the data into smaller and more manageable
parts. The key difference between partitioning and bucketing is bucketing uses the hash of a
column to create multiple 'buckets', whereas partitions are more straightforward - they are simply
segmented directories according to the value of a column (year, month etc.).

For creating partitions, "PARTITIONED BY" clause is used. Similarly for creating buckets, "CLUSTERED BY"
clause is used.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


Guidelines for when to use partitioning and bucketing:

● You typically perform partitioning on columns (or groups of columns) such that it produces a
manageable number of partitions (directories), such as year, month, state, country etc.
● However, if you partition on say customer ID, the number of partitions will be enormous and may
actually reduce performance; thus, you can choose to create buckets on customer ID

Hive has two options available when using partitions, Static partitioning and dynamic partitioning.

Static partitioning example:


You need to download the “Data Files” given below the video.

Note:​ You should download the file to your local system if you are using Quickstart VM. If you are using
Hue, you can upload the file to HDFS as shown in the video in previous session.

1. To create table use the below query

CREATE​ ​TABLE​ ​IF​ ​NOT​ ​EXISTS​ customer_partitioned(customer_fname


varchar​(​64​),customer_lname ​varchar​(​64​),customer_addr ​string​,city
varchar​(​64​))PARTITIONED ​BY​ (country ​VARCHAR​(​64​),state ​VARCHAR​(​64​))
ROW​ ​FORMAT​ ​DELIMITED
FIELDS​ ​TERMINATED​ ​BY​ ​','
STORED​ ​AS​ TEXTFILE;

2. To load data into the table use the below command

LOAD​ ​DATA​ INPATH ​'yourfilepath'


INTO​ ​TABLE​ customer_partitioned
PARTITION​(country=​'usa'​, state=​'az'​);
Note: If you are using the VM and data is on your local machine use the below command

LOAD​ ​DATA​ ​LOCAL​ INPATH ​'yourfilepath'


INTO​ ​TABLE​ customer_partitioned
PARTITION​(country=​'usa'​, state=​'az'​);

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


Also, note that you will need to provide path to the data file of the particular country and state which you
mention in the last line. Similarly you can load data for other countries also.

Dynamic partitioning example:

You need to download the “customer.dat” given below the video.


Note:​ You should download the file to your local system if you are using Quickstart VM. If you are using
Hue, you can upload the file to HDFS as shown in the video in previous session.

1. To create table use the below query

CREATE TABLE customer(customer_fname VARCHAR(64),customer_lname


VARCHAR(64),customer_addr STRING,city VARCHAR(64),state
VARCHAR(64),country VARCHAR(64)) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ‘,’ STORED AS TEXTFILE;

2. To load data into the table use the below command

LOAD​ ​DATA​ INPATH ​'yourdatapath'​ ​INTO​ ​TABLE​ customer ;


Note:​ If you are using the VM and data is on your local machine use the below command

LOAD​ ​DATA​ ​LOCAL​ INPATH ​'yourdatapath'​ ​INTO​ ​TABLE​ customer ;


Also, note that you will need to provide path to the data file of the particular country and state which you
mention in the last line. Similarly you can load data for other countries also.

3. By default, dynamic partitioning is disabled in Hive to prevent accidental partition creation. To


enable, we use:

set​ hive.exec.dynamic.partition =​true​;


set​ hive.exec.dynamic.partition.mode=nonstrict;

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


4. Create a partitioned table using the below query

CREATE​ ​TABLE​ partitioned_customer(customer_fname


varchar​(​64​),customer_lname ​varchar​(​64​),customer_addr ​string​,city
varchar​(​64​))PARTITIONED ​BY​ (country ​VARCHAR​(​64​),state ​VARCHAR​(​64​));

5. Load data into the partitioned table using the previously created normal table

Insert​ ​into​ ​table​ partitioned_customer ​partition​(country,state) ​select


customer_fname,customer_lname,customer_addr,city,country,state ​from
customer;

6. To show all the partitions

​show​ ​partitions​ partitioned_customer;

7. Simple select query using partition

Select​ * ​from​ partitioned_customer ​where​ state=​'az'​ ​and​ country=​'usa'​ ;

8. If you want to drop any partitions, it can be done as:

ALTER​ ​Table​ partitioned_customer ​DROP​ ​IF​ ​EXISTS​ ​PARTITION​(country=​'usa'​,


state=​'nw'​,);

9. If you need to change the file to be referred by a partition you can use the below query

ALTER​ ​TABLE​ customer_partitioned ​PARTITION​ (country=​'usa'​,state=​'az'​) ​SET


LOCATION ​'newpath'​;

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


1. You will need to enable dynamic partitioning in hive and also increase the number of partitions
before your run the queries. Use the below queries to do so

set​ hive.exec.dynamic.partition=​true​;
set​ hive.exec.dynamic.partition.mode=nonstrict;
set​ hive.exec.max.dynamic.partitions=​3000​;
set​ hive.exec.max.dynamic.partitions.pernode=​3000​;

2. To create table use the below query

create​ ​external​ ​table​ ​if​ ​not​ ​exists


electronics_columns_s3_partitioned_year_month
(reviewerid ​string​, ​asin​ ​string​, reviewername ​string​, helpful ​array​<​int​>,
reviewtext ​string​,
overall ​double​, summary ​string​, unixreviewtime ​bigint​) partitioned ​by​ (yr
int​, mnth ​int​)
location ​'s3a://yourbucketname/yourfoldername'​;

3. To load data into the table use the below command(this query may take 12-15 mins to finish)

insert​ overwrite ​table​ electronics_columns_s3_partitioned_year_month


partition​(yr, mnth)
select​ reviewerid, ​asin​, reviewername, helpful, reviewtext,
overall, summary, unixreviewtime, ​year​(from_unixtime(unixreviewtime)) ​as​ yr,
month​(from_unixtime(unixreviewtime)) ​as​ mnth
from​ electronics_columns_s3;
Note: ​You had already created the “electronics_columns_s3” table in the last session. If you had deleted
the same, please follow the instructions on that segment to recreate the table before loading the data.

4. To compare query performance between normal and partitioned table, use the below queries
a. For partitioned table

select​ overall, ​count​(*) ​as​ review_count

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


from​ electronics_columns_s3_partitioned_year_month
where​ yr = ​2004​ ​and​ mnth = ​1
group​ ​by​ overall;

b. For normal table

select​ overall, ​count​(*) ​as​ review_count


from​ electronics_columns_s3
where​ ​year​(from_unixtime(unixreviewtime)) = ​2004
and​ ​month​(from_unixtime(unixreviewtime)) = ​1
group​ ​by​ overall;

You need to download the “products.dat” given on the segment.


Note:​ You should download the file to your local system if you are using Quickstart VM. If you are using
Hue, you can upload the file to HDFS as shown in the video in previous session.

1. To create a normal table use the below query

CREATE​ ​TABLE​ ​IF​ ​NOT​ ​EXISTS​ products_input(productid ​int​,


transactiondate ​date​,amount ​int​,​store​ ​string​)​ROW​ ​FORMAT
DELIMITED​ ​FIELDS​ ​TERMINATED​ ​BY​ ​','​ ​STORED​ ​AS​ TEXTFILE;

2. To load data into the table use the below command

Load​ ​data​ inpath ​'yourdatapath'​ ​into​ ​table​ products_input;


Note:​ If you are using the VM and data is on your local machine use the below command

load​ ​data​ inpath ​'yourdatapath'​ ​into​ ​table​ products_input;

3. To enable bucketing use:

set​ hive.exec.dynamic.partition.mode=nonstrict;
set​ hive.exec.dynamic.partition=​true​;
set​ hive.enforce.bucketing=​true​;

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


4. Create a bucketed table using the below query

CREATE​ ​TABLE​ ​IF​ ​NOT​ ​EXISTS​ products_bucket(productid


int​,amount ​int​) PARTITIONED ​BY​ (​store​ ​String​,transactiondate ​date​) CLUSTERED
BY​ (productid) ​into​ ​4​ buckets
ROW​ ​FORMAT​ ​DELIMITED
FIELDS​ ​TERMINATED​ ​BY​ ​','
STORED​ ​AS​ TEXTFILE;

5. Load data into the bucketed table using the previously created normal table

insert​ ​into​ ​table​ products_bucket ​partition​(​store​,transactiondate) ​select


productid,amount,​store​,transactiondate ​from​ products_input;

6. Simple select query using buckets

select​ * ​from​ products_bucket ​where​ ​store​=​"walmart"​ ​and


transactiondate=​"2018-01-01"​ ​and​ productid = ​123​;

7. Select query for comparison

select​ * ​from​ products_input w


​ here​ ​store​=​"walmart"​ ​and
transactiondate=​"2018-01-01"​ a ​ nd​ productid = ​123​ ;

1. To enable bucketing use the below commands.

set​ hive.exec.dynamic.partition.mode=nonstrict;
set​ hive.exec.dynamic.partition=​true​;
set​ hive.enforce.bucketing=​true​;
Note: ​To enable bucketing, you need to enable partitioning also.
2. You will also need to run the below commands, to set max partitions back to default value of 1000

set​ hive.exec.max.dynamic.partitions=​1000​;
set​ hive.exec.max.dynamic.partitions.pernode=​1000​;

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


3. To create table use the below query

drop​ ​table​ electronics_columns_s3_partitioned_year_month_clustered;


create​ ​external​ ​table​ ​if​ ​not​ ​exists
electronics_columns_s3_partitioned_year_month_clustered
(reviewerid ​string​, ​asin​ ​string​, reviewername ​string​, helpful ​array​<​int​>,
reviewtext ​string​,
overall ​double​, summary ​string​, unixreviewtime ​bigint​) partitioned ​by​ (yr
int​, mnth ​int​)
clustered ​by​ (reviewerid) ​into​ ​4​ buckets
location ​'s3://yourbucketname/yourfoldername'​;

4. To load data into the table use the below command(this query may take 12-15 mins to finish)

insert​ overwrite ​table


electronics_columns_s3_partitioned_year_month_clustered ​partition​(yr, mnth)
select​ reviewerid, ​asin​, reviewername, helpful, reviewtext,
overall, summary, unixreviewtime, yr, mnth
from​ electronics_columns_s3_partitioned_year_month;
Note: ​You had already created the “electronics_columns_s3_partitioned_year_month” table in the​ last
segment​. If you had deleted the same, please follow the instructions on that segment to recreate the table
before loading the data.

As you already know Hive was first developed at Facebook to provide a tool to users to query large
amounts of data using SQL like syntax. Post that it was open sourced as an Apache Project which has
helped the further development of Hive. In the beginning, Hive made use of the MapReduce execution
engine to run its queries but with time newer execution engines have been added to Hadoop which has
made query processing using Hive even faster.

The introduction of YARN in Hadoop 2.x was pivotal as Hadoop now had a resource manager responsible
for allocating resources to various jobs being run in the ecosystem. With the advent of YARN, all nodes
were treated as worker or processing nodes and they were not classified into fixed groups such as Map
slots or Reduce slots. In other words, any free node can perform any task such as a map or reduce task.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


The benefits that Tez provides over MapReduce execution engine while using Hive are:

● Tez does not write data to the disk during the intermediary steps of a Hive query. Tez makes use of
Directed Acyclic Graphs and the data from an intermediary step is passed on to the next step in the
graph instead of being written to the disk like it is done when using the MapReduce engine.
Removal of these IO operations saves a lot of time when dealing with large amounts of data.
● Tez and YARN together enable you to use objects in a container across applications. If two
applications require the same object(say a data frame) and are running within the same container,
you need not create the same object, again and again, you can reuse it. This leads to better
management of resources and also helps improve the performance.

To find out the current execution engine of your Hive setup you can run the below command

set hive.execution.engine;

If your current execution engine is something else then Tez, you can use the below command to set the
execution engine as Tez(Tez is case insensitive).

set hive.execution.engine=tez;

Hive Execution Engine: LLAP

Hive on Tez really improved the performance of queries but one area where Hive was found lacking was
interactive querying. While working in the industry there are use cases which involve querying data at high
speed and also using the previously generated data to drill down deeper to get further insights. Now Hive
with Tez is able to handle batch processing jobs(long running queries on large datasets on a schedule) very
well but high-speed interactive querying was still lacking. This is where Hive LLAP(Live Long and Process)
comes into the picture.

LLAP makes Hive capable of running interactive queries by making use of a daemon. The daemon is always
running on the data nodes, that lives long and processes the data. What this means is that a daemon is run
on each of the data nodes, and when you run a query, Hive checks its LLAP daemon first, whether it has the
information about the table and the chunk of data that you are querying. If yes, then Hive passes on that
query to the daemon instead of Tez and containers. The daemon keeps all the data in its cache, which is in
memory, so the response time for queries handled by the daemon is very low.

The various options available to you for choosing the execution engine are:

● The first is that you don't use LLAP at all and use Tez for running your queries. This method would
be best when you are using Hive for processing batch queries and same data is not queried again
and again over a short period of time.
● The second option is to configure Hive to always look for LLAP first and then go to Tez.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


● The third option is to configure Hive to use LLAP based on some criteria and use Tez if criteria are
not specified.

The need for improved performance has also led to advances in the field of data storage.

one of the most widely used and one of the most optimal file formats out there is the Optimized Row
Columnar(ORC) file format. It is natively supported by Hive. This format stores data in a columnar fashion
which means the data for a table is stored in column major order i.e. if there are three columns in a table
then column 1 is written first in contiguous memory units of the disk, then column2 and so on.

When dealing with big data the schema keeps on increasing and most of the data sources have a large
number of columns. The advantage of columnar storage is that when your query deals with only a few
columns of a table, it makes it possible to only scan the data needed for the particular columns instead of
all the columns thereby reducing disk seeks. Reduction in disk seeks increases the query performance by
manifolds.

Some of the benefits of using ORC file format are:

● Because of homogeneity of data i.e. data of the same type are stored together, data compression is
significant
● Because of inclusion of stripes scanning of non-relevant data was avoided

Similar to the ORC file format, there is another very popular and powerful file format in the Apache
ecosystem called Parquet file format. This was developed jointly by Cloudera and Twitter. Parquet file
format was developed based on the Dremel paper, you can read the detailed paper at this ​link​. The
advantages of Parquet file format are as follows:

● Parquet is built keeping the complex data structures in mind, so you can also store all sorts of
nested data structures in it.
● It gives you the flexibility to keep encoding on a per-column basis. So depending on the type and
the data that is associated with the column in your table, you can choose a suitable compression
scheme.
● All the metadata is written at the end of the parquet file. So that allows you to write in a single
pass. This is a huge advantage when dealing with large data files.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


● It separates metadata from the column data. What this essentially means for the processing
engines is that you can very easily split these files. You can split column files into and send them
across different nodes.
● A single metadata file is capable of referencing multiple parquet files and column data files. This
kind of flexibility is really helpful when you are working in a distributed environment.

Another very interesting enhancement that has been added to the Hive ecosystem is the ability to run
vectorized operations. Vectorisation is a style of computer programming in which an operation is applied
on the entire collection of data at a time instead of applying it individually on all the elements one by one.
Let's say you have an array of ten integer values and you want to increase the value of each integer by 2.
The most obvious way of solving this problem is using a for loop you can access each element one by one
and then increase its value by 2. But vectorisation introduces Data-level parallelism which means that the
operations required to transform a set of elements can be performed on all elements of the collection at
the same time. That is, a single instruction can be applied to multiple data elements in parallel.

Vectorization converts data from a particular column into a vector while running the query. This conversion
leads to a faster execution time. In the video, you saw how using vectorization we were able to reduce the
query runtime by 60%. Do note that as shown in the video vectorization works best with ORC file format
and does not work with JSON files.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


Disclaimer​: All content and material on the UpGrad website is copyrighted material, either belonging to UpGrad or
its bonafide contributors and is purely for the dissemination of education. You are permitted to access print and
download extracts from this site purely for your own education only and on the following basis:

● You can download this document from the website for self-use only.
● Any copies of this document, in part or full, saved to disc or to any other storage medium may only be used
for subsequent, self-viewing purposes or to print an individual extract or copy for non-commercial personal
use only.
● Any further dissemination, distribution, reproduction, copying of the content of the document herein or the
uploading thereof on other websites or use of content for any other commercial/unauthorized purposes in
any way which could infringe the intellectual property rights of UpGrad or its contributors, is strictly
prohibited.
● No graphics, images or photographs from any accompanying text in this document will be used separately
for unauthorised purposes.
● No material in this document will be modified, adapted or altered in any way.
● No part of this document or UpGrad content may be reproduced or stored in any other web site or included
in any public or private electronic retrieval system or service without UpGrad’s prior written permission.
● Any rights not expressly granted in these terms are reserved.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

You might also like