Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 91

2But hql is not for operating rows randomly.

Such as reading
randomly, insert /update/delete records randomly.

In older versions (such as 0.7.*,0.8.* … ) of hive ,hive does not


support update and delete operations .

But latest versions (0.14.*….), hive supports update and delete.

HIVE INSTALATION

hive installation steps

1.Start by downloading the most recent stable release of


hive from one of the apache download mirror through the
browser(for ex hive-0.14.0.tar.gz).

2. copy the hive tar file into work folder

3. Bigdata@master:~$tar -xzf hive-0.14.0.tar.gz

4.set the environmental variable HIVE_HOME in bashrc file.

export HIVE_HOME=/home/matuser/work/hive-0.14.0
export PATH=$PATH:$HIVE_HOME/bin;

create following directory before starting hive first time.


Since once we create table in hive, with table name one
folder will be created in /user/hive/warehouse. This path
already configured in hive-site.xml.

Bigdata@master:~$hadoop fs -mkdir

/user/hive/wareshouse/

in hive-site.xml we will find following property.

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the
warehouse</description>
</property>

By default hive is connected to embedded database called


Derby to store metadata.

<property>
<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby:;databaseName=metastore_db;create
=true</value>
<description>JDBC connect string for a JDBC
metastore</description>
</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC
metastore</description>
</property>
derby data can't be shared between hive users. To share metadata
b/w hive clients configure external database(like mysql).

To Configure mysql as a metastore


steps
$mysql -u root -p
Enter password:
mysql>create database metastore_db

create a new user in MySQL


CREATE USER 'matuser'@'localhost' IDENTIFIED BY 'matuser';
mysql>GRANT ALL PREVILIGES on metastore_db.* to
matuser@localhost

then go to hive_site.xml and modify following properties


to communicate with mysql
<property>
<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost:3306/metastore_db</val
ue>
<description>JDBC connect string for a JDBC
metastore</description>
</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC
metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>matuser</value>
<description>username to use against metastore
database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>matuser</value>
<description>password to use against metastore
database</description>
</property>

To enter into hive


Bigdata@master:~$hive shell
hive>

Hive DDL(data definition language)

create

alter

drop

In hive ,

Each table should be associated with database.A database is a


group(logical)of tables ,views .The default database is “default”
in hive.

If you don’t select any database, your table will be stored in


“default” database.

hdfs location of default database as follows

/user/hive/warehouse

When table is created in hive default database, a directory


will be created in warehouse directory.

Ex:

if you create table with the name table1,then table1 directory will
be created in following path.

/user/hive/warehouse/table1

We can also create custom databases (user defined).

hive complex data types & simple data types

hive simple types


1)Tinyint(1 byte)

2)SmallInt(2 bytes)

3)int(4 bytes)

4)Bigint(8 bytes)

5)float(4 bytes)

6)double(8 bytes)

7)Decimal (hive-0.11.0)

8)Date(hive-0.12.0)

9)Timestamp(hive-0.8.0)

10)String(max size 2gb)

11)varchar(hive-0.12.0 suppots 1 to 65535


characters)

12)char(hive-0.13.0)

14)Boolean --->true/false

15)Binary-->hive-0.8.0

hive complex data types

Collection data types:-

1)Array -->collection of elements

2)Map --> collection of key & value pairs

3)STRUCT --> collection of attributes with different data


types.

Hive tables:-

Hive tables can be classified in following ways

1) Internal and External tables.


2) Partitioned and non partitioned tables

3) Bucketed and non bucketed tables.

Example of internal and external tables

Creation of a file with data in local system.

$cat > file1

Hadoop

Pig

Hive

^d

Inner table(by default each table is inner table)

Hive> create table sample(line String)

In hdfs with the table name "sample" directory will be created


under /user/hive/warehouse/sample.

Loading the file into table

Hive>load data local inpath '/home/matuser/file1' into


table sample;

now we will find file1 under


/user/hive/warehose/sample/file1.

i.e file1 is copied into sample directory from local

Hive> select * from sample;


o/p

hadoop

pig

hive

ex:2

cat > file2

hbase

oozie

flume

^d

Hive>load data local inpath '/home/bigdata/file2' into table


sample;

$hadoop fs –ls /user/hive/warehouse/sample

o/p

/user/hive/warehouse/sample/file1

/user/hive/warehouse/sample/file2

Now in hdfs , sample directory has two files file1, file2.

Hive>select * from sample;

Now hive reads all rows of file1 and file2.

Dropping a table

Hive>drop table sample


sample directory will be deleted from hdfs werehouse
directory

ex:

/user/hive/warehouse/sample directory will be


removed (fiel1,file2 deleted)

we know hive maintains metadata in embedded Derby


database or external RDBMS(like mysql,oracle).

Now sample(table) and data(hdfs directory) both


deleted.

Creating an external tables:-

Hive>create external table extab(line String);

In hdfs,extab folder is created as follows

/user/hive/warehouse/extab

Hive>load data local inpath '/home/matuser/file1' into


table extab;

In hdfs

/user/hive/warehouse/extab/file1

hive>select * from extab;

hello

pig
hive

Hive>drop table extab;

From hive extab is deleted [metadata deleted]

hive>select * from extab;

table not found error

In hdfs /user/hive/warehouse/extab/file1 is available

Here only table is deleted not the data.

Reuse the data

Hive>create table extab(line String);

Now an extab table is created but no extab folder is


created in the warehouse. Since it is already available in
the warehouse. it uses the existing directory present in
the warehouse. So hive will reuse the location

/user/hive/warehouse/extab/file1

With out loading data if we execute following command

Hive>select * from extab;(as inner table)


o/p

hadoop

pig

hive

But now(previous) we created extab as inner table.

If you drop extab,

Now both table and directory will be deleted.

Both inner and external tables can use custom hdfs location.

Ex:

Hive>create table sample1(line String) location


‘/user/matuser’

In hive, sample1 created

In hdfs /user/matuser direcoty created.

Hive>load data local inpath ‘/home/matuser/file1’ into


table sample1;

In hdfs /user/matuser/file1

Now table name is sample1,directory name is matuser.


Hive>drop table sample1;

/user/matuser will be deleted. Because sampl1 is inner


table.

External table can use cutom location

Hive>create external table sample2(line String) location


‘/user/usertab’;

if usertab is existed, hive will use it ,if not existed a


directory will be cteated with given name

hive>load data local inpath ‘/home/matuser/file1’ into


table sample1;

Hive>drop table sample2;

Still /user/usertab is available.(because sample2 is


external table);

hdfs dfsadmin -safemode leave

LOADING DATA INTO TABLES

When you loaded a file in to table, the file will be copied


into tables hdfs directory.

Ex:

$cat >sample1
1

^d

$cat >sample2

^d

Hive>create table sample(line String);

Hive>load data local inpath 'sample1' into table sample;

Hive>load data local inpath 'sample2' into table sample;

Now in hdfs,sample directory has two files sample1 and


sample2.

That means forever load, file is copied into table


directory .so data is appending.

Another way to load data into table.

$cat >sample3

^d
$hadoop fs –copyFromLocal samp3
/user/hive/warehouse/sample

Now samp directory has 3 files file1.file2,file3.

Hive>select * from sample;

o/p

2 from samp1

5 from samp2

8 from samp3

How to overwrite data of table while loading:-

$cat >sample4

10

11

12

^d
Hive>load data local inpath 'sample4' overwrite into
table sample;

Now all previous files of sample directory ,sample1


sample2 and sample3 will be deleted . and sample4 file
is copied into sample.

Hive>select * from sample;

10

11

12

Hive>load data local inpath 'sample4' into table sample;

$hadoop fs –ls /user/hive/warehouse/sample

Hive>select * from sample4;

……Sample4

…..sample4_copy_1

10

11 from sample4

12

10
11 from sample4-copy-1

12

Loading data from hdfs to hive table

Hive>load data inpath /user/training/file1’ into table


sample;

Once file1 is loaded into table source file(file1)will be


deleted from hdfs;

In all previous examples we worked with single column


table;

What if ,if file has multiple fields, now we need to create


table with multiple columns.

$cat>test1

100,200,300

400,500,600

700,800,900

^d

Hive>create table test1(a int,b int ,c int);

Hive>load data local inpath '/home/matuser/test1' into


table test1;

Hive>select * from test1;

Null null null


Null null null

Null null null

The problem is the delimiter. Default delimiter for hive


table is “^A”(\001)

But the input file has “,”[comma] delimiter.

For the table ‘test1’ delimiter is ^A(\001)

So the entire line is treated as a single field.

Total line to be stored in first column of the table(a int)

First column in test1 table is int ,but line is alphanumeric.


if data is not match with column type ,then hive keep null
value.

There is no values for 2nd field and 3rd field.

That’s why ‘b’,’c’ column values became Null

Solution:- modify the delimiter with ‘,’;

Hive>create table test1(a int,b int ,c int) row format


delimited fields terminated by ',' or '\t';

Now for the table test1 ‘,’ is delimiter

File test1 has two commas

So number of fields in file is 3.

1’st filed will be assigned to ‘a’ column.

2’nd filed will be assigned to ‘b’ column.


3’rd filed will be assigned to ‘c’ column.

Hive>load data local inpath ‘/home/matuser/test1’ into


table test2;

Hive>select * from test2;

100 200 300

400 500 600

700 800 900

Queries

$ cat > emp

101,srinivas,20000,M,1001

102,satya,50000,M,1002

103,sailaja,25000,F,1002

104,amit,30000,M,1003

104,saradha,50000,F,1004

105,anusha,45000,F,1003

106,krishna,60000,M,1005

107,rajesh,42000,M,1002

108,mohit,40000,M,1001

109,bhargavi,45000,F,1004

110,amrutha,35000,F,1005
hive>create table emp(ecode int,name string,salary
int,sex string,deptno int)row format delimited fields
terminated by ',';

hive>load data local inpath '/home/matuser/emp' into


table emp;

hive>select * from emp;

o/p all rows

hive>select * from emp where sex='F'

o/p all females

hive>select * from emp where deptno=1003 and sex='M'

o/p all males whose deptno is 1003

hive>select * from emp where ((deptno=1001

or deptno=1002) and sex='M')

or

((deptno=1004

or deptno=1005) and sex='F');

o/p
all males of 1001,1005

all females of 1004,1008

hive>select * from emp where deptno =1001

or deptno=1002 or deptno=1003;

or

hive>select * from emp where deptno


in(1001,1002,1003);

hive>select * from emp where deptno!=1001 and


deptno!=1008

o/p all rows except 1001 and 1008

note : hive does not support "NOT IN"

hive> select * from emp where salary>25000 and


salary<50000

note: hive does not support "BETWEEN" and "NOT


BETWEEN" operators

hive> select * from emp limit 5;

o/p top 5 rows of a table;

order by:

to sort the data in ascending order or descending order.

hive>select * from emp order by name;


default ascending order

hive>select * from emp order by salary desc;

hive>select * from emp order by salary desc,deptno,sex


desc;

it sorts the data based on salary in descending order. high


salary will be priority. if two salaries are equal then
checks dept numbers. For dept numbers not sort mode
specified, so default is ascending order. so low value of
dept number is prioritised. if two dept numbers are equal
then checks sex. high value will be prioritised

note : no limit for max number of sort columns.

adding new columns to table;

hive>alter table emp add columns(tax int);

hive>select * from emp;

101 satya 50000 M 1001 Null

102 sravani 10000 F 1002 Null

hive>insert overwrite table emp

select ecode,name,salary,sex,deptno,salary*0.1
from emp;

note: in the above example target table and source table is


same. so data will be overriden.

o/p
101 satya 50000 M 1001 5000

102 sravani 10000 F 1002 1000

updating a table

increment salary by 10% in each row

hive>insert overwrite table emp

select ecode,name,salary+
(salary*0.1),sex,deptno,tax from emp;

note: by overriding technique we can update data.

e female rows deleted);

if function

used to perform conditional transformations

if(condition,value1,value2)

1'st argument is condition

2'nd argument is true value

3'rd argument is false value.

note: if first is true ,value1 will be return

otherwise value2 will be returned.

ex:

$cat > file1

100,200

300,400

644,545
789,456

^d

hive>create table tab1(a int,b int)row format delimited


fields terminated by ',';

hive>load data local inpath '/home/matuser/file1' into


table tab1;

hive>alter table tab1 add columns(big int);

hive>insert overwrite table tab1

select a,b,if(a>b,a,b) from tab1;

hive>select * from tab1;

100 200 200

300 400 400

644 545 644

nested if statements;

$cat >file2

100,200,230

300,400,75

644,545,20

789,456,98

among 3 columns which is biggest value.

hive>create table tab2(a int,b int ,c int)row format


delimited fields terminated by ',';

hive>load data local inpath 'file2' into table tab2


hive>select form tab2;

hive>alter table tab2 add columns(big int);

hive>insert overwrite table tab2

select a,b,c,if(a>b,if(a>c,a,c),if(b>c,b,c)) from


tab2;

hive>select * from tab2;

100 200 230 230

300 400 75 400

644 545 20 644

789 456 98 789

Let us work with temperature data using hive:-

$cat > tempr.text

xxxxx2010xxx20xx

xxxxx2010xxx21xx

xxxxx2011xxx27xx

xxxxx2011xxx23xx

^d

This file has no delimiters

But we have clear clues for a attributes

Year 6th to 9th


Temperature 13th to 14th

Task:- for each year ,max temperatures

Hive>create table Raw(line String);

Hive>load data local inpath '/home/matuser/tempr.txt'


into table Raw;

Hive>create table tmpr(y int ,t int);

Hive>insert overwrite table tmpr

select substr(line,6,4),substr(line,13,2) from Raw;

Hive> select * from tmpr;

2010 20

2010 21

2010 20

2011 27

2011 23

Substr(line,6,4)will extract year.

6start position

4length of substring

Substr(line,13,2)will extract temperature

Hive>create table result(y int ,max int);

Hive>insert overwrite table result

select y,max(t) from tmpr group by y;

Hive>select * from result


2010 21

2011 27

In previous examples we used “insert overwrite table


<table name>”

Overwrite is mandatory

But hive 0.14 “insert into” available to append rows.

“insert overwrite “ will overwrite previous data.

One more example on temperature data, with air


temperature has both positive and negative values.

$cat >tmpr.txt

xxxxx2010xxx20xx

xxxxx2010xxx-15xx

xxxxx2010xxx-10xx

xxxxx2010xxx23xx

xxxxx2011xxx-15xx

xxxxx2011xxx-10xx

^d

this file has both negative and positive values

now for temperature,

if positive , length is 2

if negative length is 3

How to make it structure and analyse it?

Task :-for each year ,max and min temperature is


required.

hive>create table raw(line string);

hive>load data local inpath '/home/matuser/raw.txt'


into table raw;

hive>create table tmpr(y int,t int);

hive>insert overwrite table tmpr

select * from(select substr(line,

6,4),ssubstr(line,13,2) from raw1

where substr(line,13,1)!='-'

union all

select substr(line,6,4),substr(line,13,3) from raw1

using union all both rows set we merged.

-->in hive there is no "union" .only "union all" available

--> "union all" should be applied as a subquery.

-->the sub qurery should have as alias .(ex:tmpr)

merged results we are loading in tempr table.

hive>select * from tmpr;

o/p
2010 20

2010 -15

2010 -10

2010 23

2011 -15

2011 -11

hive>insert overwrite table tmpr

select substr(line,6,4),

if(substr(line,13,1)!='-',

substr(line,13,2),substr(line,13,3)) from raw;

hive>select * from tmpr;

o/p

2010 20

2010 -15

2010 -10

.........

if 13th character is '-'

it retrives positive temperature otherwise negative


temperature.

note: substr() return value is string type. in the above


case string contains pure numbers, so it can be implicitly
casted into in value. in our table y and t are int datatypes.

hive>create table result2(y int, max int,min int)row


format delimited fields terminated by ',';

hive>insert overwrite table result2

select y max(t),min(t) from tempr

group by y;

hive>select * from result;

0/p

2012 23 -15

2011 -11 -15

---------------------------------------------------------------------
-------------

Complex Data types


working with Arrays

Bigdata@Bigdata:~$ cat>student

Kalam,20,Btech*Mtech*PHD

praneeth,21,Bsc*Msc*Mtech

satya,38,Bsc*Msc*PHD

hive>create table student(name string,age


int,qualification Array <string>) row format delimited
fields terminated by ',' collection items terminated by '*';
hive>load data local inpath '/home/Bigdata/student'
into table student;

hive>select * from student;

gagan 20 ["btech","mtech","phd"]

praveen 21 ["bsc","msc","mtech"]

hive> select qualification[0] from student;

Btech

Bsc

Working with Map

$ cat >trans

p101,pendrive$400*camera$5000*mobile$8500

p102,Tv$30000*CDPlayer$10000*pendrive$400*compute
r$25000

hive>create table trans(prodid string,products


map<string,int>)row format delimited fields terminated
by ',' collection items terminated by '*' map keys
terminated by '$';

hive>load data local inpath 'trans' into table trans ;

hive>select * from trans;


p101{"pendrive":400,"camera":5000,"mobile":8500}

p102

{"Tv":30000,"CDPlayer":10000,"pendrive":800,"computer
",25000}

hive>select prodid,products['pendrive'] from trans;

o/p

p101 400

p102 800

working with structs

$ cat >profile

gagan datta,27,manaswi#23#hyd,hyd

krishna,30,saradha#25#hyd,hyd

^d

hive>create table profile(name string,age int,wife


struct<name:string,age:int,city:string>,city string)row
format delimited fields terminated by ',' collection items
terminated by '#';

hive>load data local inpath 'profile' into table profile;

hive>select * from profile;

gagan datta 24
{"name":"manaswi","age":24,"city":"del"}hyd

gagan datta 24
{"name":"saradha","age":24,"city":"del"}hyd
hive>select name ,wife.name from profile;

gagan datta manaswi

krishna saradha

Like:

hive>create table staff like emp;

staff table will be created with emp structure.

emp has ',' delimiter so staff also having ',' delimiter

hive>insert overwrite table staff select * from emp;

deleting rows of a table

note: in hive-0.14 we have delete and update statements

delete female rows from staff table

hive>insert overwrite table staff

select * from staff where sex='M'

all male rows copied into staff

cleaning null values

$cat >file3;

100,,200

200,,

,,700
150,240,

,130,140

175,300,250

^d

hive>create table tab1(a int,b int,c int)row format


delimited fields terminated by ',';

hive>load data local local inpath 'file3' into table tab1;

hive>select * from tab1;

100 NULL 200

200 NULL NULL

NULL NULL 300

NULL 300 NULL

10 20 NULL

NULL 30 40

100 50 25

hive>select a+b+c from tab1;

NULL

NULL

NULL

NULL

NULL
NULL

175

becase

100+200=300

NULL+20 =NULL

200-NULL=NULL

NOTE :in arithmetic opretation in null values are existed


final result will be null

so before arithmatic operations we need to clear nulls

to replace null with zeros

hive> insert overwrite table tab1

select if(a is null,0,a),if(b is null,0,b),if(c is null,0,c) from


tab1;

hive>select * from tab1;

100 0 200

200 0 300

0 0 300

0 300 0

10 20 0

0 30 40

100 50 25

hive>select a+b+c from tab1;

300

500
300

30

70

175

Null vales can't be compared with '=' and '!=' operators.

ex:-

hive>select * from emp where salary is null;

hive>select * from emp where salary != null;//error

hive>select * from emp where salary is not null;

lets do some more transformations

hive> create table emp(ecode int,name String, sal int,


sex String,dno int) row format delimited fields
terminated by ',';

hive> load data local inpath 'emp.txt' into table emp;

hive> select *from emp;

101,srinivas,20000,M,1001

102,satya,50000,M,1002

103,sailaja,25000,F,1002

104,amit,30000,M,1003

104,saradha,50000,F,1004

105,anusha,45000,F,1003

106,krishna,60000,M,1005
107,rajesh,42000,M,1002

108,mohit,40000,M,1001

109,bhargavi,45000,F,1004

110,amrutha,35000,F,1005

in sex colmn

'M' to be transfromed as 'male'

'F' to be transfromed as 'female'

in deptno deptnumbers to be transformed as department


name.

deptno ------> deptname

1001 -------> marketing

1002 -------> HR

1003 -------> finance

remaining ----> others

salary---->grade

salary<30000 ---> D

salary>=30000 & sal<50000 -->C

salary>=50000 & sal<70000 -->B

salary>=70000 ---->A

hive>create table employee(eid int,ename string,salary


int,sex String, deptname String,grade string);
hive>insert overwrite table employee

select ecode,name,sal,

if(sex='F','female','male'),

if(dno =1001,'marketing',

if(dno=1002,'HR',

if(dno=1003,'Finance','others'))),

if(sal>=70000 ,'A',

if(sal>=50000,'B',

if(sal>=30000,'C','D')))

from emp;

hive>select * from employee;

101,srinivas,20000,Male,marketing

102,satya,50000,Male,HR

103,sailaja,25000,Female,HR

104,amit,30000,Male,Finance

104,saradha,50000,Female,others

105,anusha,45000,Female,Finance

.......

case statements in hive

we can also perform conditional transformations using


case statement.
hive>select case deptno

when 1001 then 'marketing'

when 1002 then 'HR'

when 1003 then 'finance'

else 'others'

end

from emp;

hive>select name,case salary

when salary>=70000 then 'A'

when salary>=50000 then 'B'

when salary>=30000 then 'C'

else 'D'

end

from emp;

Aggregate finction in hive:

hive>select sum(salary) from emp;

hive>select sum(salary) from emp where sex="F";

hive>select avg(salary) from emp;

hive>select max(salary) from emp;

hive>select min(salary) from emp;


hive>select count(salary) from emp;

note: count() will not count nulls of sal column;

count(*) will count nulls of column;

how to check whether a column has null value or not?

hive>select count(*)-count(salary) from emp;

if result is 0 ,there is no nulls;

if result >0 ,there are nulls

or

hive>select count(*) from emp where salary is null;

if result is 0 ,there is no nulls;

if result >0 ,there are nulls

group by: to perform aggerations on groups separately we


use group by

ex1: find sum of the salaries based on gender

hive>select sex,sum(salary) from emp group by sex;

ex2: find sum of the salaries of each department


hive>select deptno,sum(salary) from emp group by
deptno;

ex3:

hive>select
deptno,sum(salary),avg(salary),min(salary),max(salary)
,count(*) from emp group by deptno;

having clause:

it woks with only group by clause. it is used to filter


grouped columns.

ex1:display sum of the salaries of only departments


1001,1002.

hive>select deptno,sum(salary) from emp group by


deptno having deptno in(1001,1002);

ex2: display salaries of only departments 1001,1002 and


whose gender is female and salary >25000

hive>select dno,sum(sal) from emp where sex='M' and


sal>25000 group by dno having dno in(1001,1002);
in the above example ,sex and salary or not grouping
columns .so you can 't use them in having clause, deptno
can be used in where clause

note: it is always better to fillter grouping columns in


HAVING clause for better performance.

ex: display all departments whose sum of the salaries


>50000;

hive>select dno from emp group by dno having


sum(sal)>=50000;

hive>select dno ,sum(salary) from emp where deptno


in(1002,1004) group by deptno;

//row filter happens at mapper level.

hive>select dno,sum(salary) from emp group by deptno


having deptno in(1002,1003);

combination of order by and group by

display top 3 sum of dept salaries

hive>select dno,sum(sal) as tot from emp group by dno


order by tot desc limit 3;

o/p

1002 117000
1004 95000

1005 95000

ex : top 5 pages from weblog table based on visits

hive>select page,count(*) as cnt from weblog group by


page order by cnt desc limit 5;

Eliminating duplicates:

distinct() function is used to eliminate duplicates.

TAB

10

20

30 hive>select distinct(x) from tab;

20 o/p

20 10

10 20

30

merging tables

UNION ALL

NOTE: in hive we UNION ALL support. it does not support


UNION.

UNION does not support duplicate rows.


UNION ALL allow duplicate rows

if do not want duplicate values ,first merge tables ,later


apply distinct or group by then eliminate duplicate rows

in hql union should be a sub query, and sub query must


have alias.

table -->emp1

eid ename salary sex deptno

101 satya 75000 M 1001

102 gagan 100000 M 1002

103 lalitha 150000 F 1003

............

...........

1000 rows

table -->emp2

eid ename salary sex deptno

201 saradha 80000 F 1004

202 krishna 60000 M 1002

303 laxman 40000 M 1001

................

.............

500 rows
ex: merge emp1,emp2 into emp

note: schema of all tables same.

hive>insert overwrite table emp

select * from(

select * from emp1

union all

select * from emp2)e;

hive>select * from emp;

o/p is 1000+500=1500 rows

note: if tables having different structure then ?

emp1:eid,ename,salary,sex,deptno

emp2:eid,ename,deptno,salary,sex

hive> create table staff like emp;

hive>insert overwrite table staff

select * from(

select eid,ename,salary,sex,deptno from emp1

union all

select eid,ename,salary,sex,deptno from emp2


)e;

note: we should not use * in select statement because


tables having different schema.

note: in above two table number of columns are same.

how to merge if number of columns are different?

emp1:eid,ename,salary,sex,deptno,designation

emp2:eid,ename,salary,deptno,city.

in emp1, no city column

in emp2 ,no sex,designation.

hive>create table emp1(eid int,ename string,sal int,dno


int,sex string,designation string)row format delimited
fields terminated by ',';

hive>create table emp2(eid int,ename string,sal int,dno


int,city string)row format delimited fields terminated by
',';

hive>load data local inpath 'emp1' into table emp1;

hive>load data local inpath 'emp2' into table emp2;

hive>create table emp3(eid int,ename string,sal int,dno


int,sex string,city string,designation string);
hive>insert overwrite table emp3

select * from(

select eid,ename,sal,dno,sex,' ' as city , designation


from emp1

union all

select eid,ename,sal,dno,' 'as sex,city,' ' as


designation from emp2)e;

hive>select * from emp3;

o/p

eid ename salary deptno sex city designation

101 satya 75000 1001 M TL

102 gagan 100000 1002 M SE

201 saradha 80000 1004 hyd

grouped aggregations with multiple tables

model1:

hive> select dno,sum(sal) from (

select dno,sal from emp1

union all

select dno ,sal from emp2)e

group by dno;

model2:
hive>select dno,sum(tot) as tot from(

select dno,sum(sal) as tot from emp1

group by dno

union all

select dno,sum(sal)as tot from emp2

group by dno)e

group by dno;

model1 provides better performance.

how to append rows of table to another table?

joins using hive

joins are used to collect data from two or more

tables ,based on comman columns.

joins are two types basically

1)Inner joins

2)Outer Joins

Inner join

only matching rows with join condition

A B
1 3

2 4

3 5

4 8

5 9

o/p 3,3

4,4

5,5

outer joins

matching rows and non matching rows with join condition.

outer joins are 3 types

left outer join

right outer join

full outer join

left outer join

all rows of first table and matchings of second table

o/p

1 null

2 null

3,3

4,4

5,5
Rigtht outer join:-

matching rows of first table and mathcing and non


mathching rows of second table.

3,3

4,4

5,5

null,8

null,9

full outer join

all matching rows and non matching rows of left and right
table

3,3

4,4

5,5

1,null

2,null

null,8

null,9

Emp
eid ename salary sex deptno

101 Satya 75000 M 1001

102 Gagan 100000 M 1002

103 Lalitha 150000 F 1003

104 Anusha 50000 F 1001

105 Krishna 60000 M 1004

106 Anandh 40000 M 1005

107 Padma 90000 F 1002

108 laxman 35000 M 1006

Dept

deptno deptname deptloc

1001 Accounts Hyderabad

1002 Finance Delhi

1003 HR Chennai

1007 Marketing Hyderabad

1008 Sales Delhi

InnerJoin

hive> select eid ,ename,salary,sex,deptname,deptloc


from emp e join dept d on(e.deptno=d.deptno);

left outer join

hive>select eid ,ename,salary,sex,deptname,deptloc


from emp e left outer join dept d
on(e.deptno=d.deptno);

right outer join


hive>select eid ,ename,salary,sex,deptname,deptloc
from emp e right outer join dept d
on(e.deptno=d.deptno);

full outer join

select eid ,ename,salary,sex,deptname,deptloc from emp


e full outer join dept d on(e.deptno=d.deptno);

for each city total salary is required

in this task city is in deptname table and salary is emp


table . so join required.

hive>select deptloc ,sum(salary) from emp e join dept d


on(e.deptno=d.deptno)group by deptloc.

o/p

delhi 100000

hyderabad 125000

but everytime while collecting data from multiple tables ,if


you apply joins ,it will take lots of process time

so denormalize the tables into single table.

denormalization :

hive>create table empdept(eid int,name string, salary


int ,sex string,deptno1 int,deptno2 int,dname string,dloc
string) rows format delimited fileds terminated by ',';

apply full outer join ,so no data is missed.

hive>insert overwrite table empdept

select eid,ename,salary,sex,e.deptno,d.deptno, dept


name,deptloc from emp e full outer join dept d
on(e.deptno=d.deptno);

now all rows of emp ,dept available in empdept

now with out appying joins every time you can directly
query on empdept.

hive>select deptloc,sum(salary) from empdept group by


deptloc.

we loaded "full outer join" content into empdept

which contains matching rows,and non matching rows


based on join condition(e.deptno=d.deptno)

to get matching rows

hive>select * from empdept where deptno1=deptno2;

to get matching rows and non matching rows of left

hive>select * from empdept where deptno1 is not null;

to get matching rows and non matching rows of right

hive>select * from empdept where deptno2 is not null;

to get only non-matchings of left side

hive>select * from empdept where deptno2 is null;

to get only non-matchings of right side

hive>select * from empdept where deptno1 is null;

to get only non-matchings of left side and right side

hive>select * from empdept where deptno2 is null or


deptno2 is null;

sales datasets

$cat sales

13-10-2010 261.54

01-10-2012 10123.02

01-10-2012 244.57

10-07-2011 4965.7595

28-08-2010 394.27

28-08-2010 146.69

17-06-2011 93.54

17-06-2011 905.08

24-03-2011 2781.82

26-02-2010 228.41

23-11-2010 196.85

23-11-2010 124.56

08-06-2012 716.84

08-06-2012 1474.33

04-08-2012 80.61

the dataset having two fields 1.date 2. amt

hive>create table temp(dt string,amt float)row format


delimited fields terminated by '\t';

hive>load data local inpath 'sales' into table temp;


hive>select * from sales;

01-01-2001 3000.89

............

this date is not in hive date format

hive date default format is yyyy-mm-dd ,so convert given


date format to hive date format

hive>create table sales(dt array<string>,amt float);

hive>insert overwrite table sales

select split(dt,"-"),amt from temp;

note: split returns string array.

split('01-05-2005",'-') o/p ["01","05","2005"]

hive>create table sales1(dt string,amt float)

hive>insert overwrite table sales1

select concat(dt[2],'-',dt[1],'-',dt[0]),amt from


sales;

hive>select * from sales1;

2001-01-01 3000.78

.................

.................

2014-12-31 200.09

now date is in hive date format then you can apply all
hive date functions

month('2001-01-05') o/p 1

year('2001-01-05') o/p 2001

day('2001-01-05') o/p 5

yearly sales report

hive>create table yearlyreport(year int,totsales float);

hive>insert overwrite table yearlyreport

select year(dt),sum(amt) from sales1

group by year(dt);

monthly salesreport for year 2010

hive>create table report2010(month int,totsales float);

hive>insert overwrite table report2010

select month(dt),sum(amt)from sales1

where year(dt)=2010 group by month(dt);

daily sales report(day wise) for month jan of 2010

hive>create table report2010_jan(day int,totsales float);

hive>insert overwrite table report2010_jan

select day(dt),sum(amt) from sales1


where year(dt)=2010 and month(dt)=1

group by day(dt);

quarterly sales report for year 2010

quarter1 ---> 1,2,3 months

quarter2 ---> 4,5,6 months

quarter3 ----> 7,8,9 months

quarter4 ----> 10,11,12 months

hive>alter table sales1 add columns(quarter int);

hive>insert overwrite table sales1

select dt,amt,case

when month(dt)<4 then 1

when month(dt)<7 then 2

when month(dt)<10 then 3

else 4

end from sales1;

for all rows quarter number is generated

hive>create table quarterly_report2010(quarter


int,totamt float);

hive>insert overwrite table quarterly_report2010

select quarter ,sum(amt) from sales1


group by quarter;

half yearly sales report for each year(multi grouping)

hive>alter table sales1 add columns(halfyear int);

hive>insert overwrite table sales1

select dt,amt,quarter,if(quarter<2,1,2) from sales1;

hive>create table halfyearly_report(year int,halfyear


int,totsales int);

hive>insert overwrite table halfyearly_report

select year(dt),halfyear ,sum(amt) from sales1

group by year(dt),halfyear;

hive>select * from quarterly-report2010;

quarter totamt

1 50000
2 75000

3 60000

4 90000

now the requirement is,for each quarter,when compared


with its pevious quater sales when compared with its
previous quarter sales,how much percetage of sales
increased or decreased.

ex:2nd quater sales increased for 20% when compared


with 1st quater sales.

now each quarter's sales valume to be compared with its


previous rows's sales volume.

it's not possible.

so we can take help of "cartisian product"

hive>create table cart1(q1 int,tot int,q2 int, tot int);

hive>insert overwrite table cat1

select l.quarter,l.totamt,r.quater,r.totamt

from quarterly-report2010 l;

join quarterly-report2010 r;

q1 tot1 q2 tot2

1 50000 1 50000

1 50000 2 75000

1 50000 3 60000

1 50000 4 90000
........

so at the time of cartisoning apply row filter

where quarter1-quarter2=1;

hive>insert overwrite table cart1

select l.quarter ,l.totamt,r.quarter,r.totamt from


quarterly-report2010 l join quarterly-report2010 r where
l.quarter-r.quarter=1;

hive>select * from cart1;

q1 tot1 q2 tot2

2 75000 1 50000

3 60000 2 75000

4 90000 3 60000

hive>alter table cart1 add columns(percent-growth


float)

hive>insert overwrite table cart1

select q1 ,tot1,q2,tot2

((tot1-tot2)*100)/tot2 from cart1;

hive>select * from cart1;

q1 tot q2 tot percentage-growth

2 75000 1 50000 50.0

3 60000 2 75000 -20.0


4 90000 3 60000 50.0

percentage growth formula=[(persent value-


previousvalue)/previous value*100]

Classification of tables

1) partitioned tables

2) non- partitioned tables

1). Partitioned tables:


Partitioning – Hive organizes tables into partitions for grouping same type of data
together based on a column or partition key. Each table in the hive can have one or
more partition keys to identify a particular partition. Using partition we can make it
faster to do queries on slices of the data.
Why is Partitioning Important?
Huge amount of data which is in the range of petabytes is getting stored in HDFS.
So due to this, it becomes very difficult for Hadoop users to query this huge amount
of data.

The Hive was introduced to lower down this burden of data querying. Apache Hive
converts the SQL queries into MapReduce jobs and then submits it to the Hadoop
Cluster. When we submit a SQL query, Hive read the entire data-set. So, it
becomes inefficient to run MapReduce jobs over a large table. Thus this is
resolved by creating partitions in tables. Apache Hive makes this job of
implementing partitions very easy by creating partitions by its automatic partition
scheme at the time of table creation.

In Partitioning method, all the table data is divided into multiple partitions. Each
partition corresponds to a specific value(s) of partition column(s). It is kept as a sub-
record inside the table’s record present in the HDFS. Therefore on querying a
particular table, appropriate partition of the table is queried which contains the query
value. Thus this decreases the I/O time required by the query. Hence increases the
performance speed.

To create data partitioning in Hive following command is used-

CREATE TABLE table_name (column1 data_type, column2 data_type)


PARTITIONED BY (partition1 data_type, partition2 data_type,….);

Hive Data Partitioning Example


Example. Consider a table named Tab1. The table contains client detail like id,
name, dept, and yoj( year of joining).

Suppose we need to retrieve the details of all the clients who joined in 2012. Then,
the query searches the whole table for the required information. But if we partition
the client data with the year and store it in a separate file, this will reduce the query
processing time.

example will help us to learn how to partition a file and its data-

The file name says file1 contains client data table:


tab1/clientdata/file1

id, name, dept, yoj

1, sunny, SC, 2009

2, animesh, HR, 2009

3, sumeer, SC, 2010

4, sarthak, TP, 2010

Partition above data into two files using years

tab1/clientdata/2009/file2

1, sunny, SC, 2009

2, animesh, HR, 2009

tab1/clientdata/2010/file3

3, sumeer, SC, 2010

4, sarthak, TP, 2010

Now when we are retrieving the data from the table, only the data of the specified
partition will be queried. Creating a partitioned table is as follows:

CREATE TABLE table_tab1 (id INT, name STRING, dept STRING, yoj INT) PARTITIONED BY (year STRING)
row format delimited fields terminated by ',';

load data local inpath 'file:///home/madhu/2009_f1' OVERWRITE into table table_tab1 PARTITION(year='2009');

load data local inpath 'file:///home/madhu/2010_f2' OVERWRITE into table table_tab1 PARTITION(year='2010');

Types of Hive Partitioning


There are two types of Partitioning in Apache Hive-

● Static Partitioning
● Dynamic Partitioning

Hive Static Partitioning


● Insert input data files individually into a partition table is Static Partition.

● Usually when loading files (big files) into Hive tables static partitions
are preferred.

● Static Partition saves your time in loading data compared to dynamic


partition.

● You “statically” add a partition in the table and move the file into the
partition of the table.

● We can alter the partition in the static partition.

● You can get the partition column value from the filename, day of date
etc without reading the whole big file.

● If you want to use the Static partition in the hive you should set
property set hive.mapred.mode = strict This property set by default
in hive-site.xml

● Static partition is in Strict Mode.

● You should use where clause to use limit in the static partition.

● You can perform Static partition on Hive Manage table or external


table.

Hive Dynamic Partitioning


● Single insert to partition table is known as a dynamic partition.

● Usually, dynamic partition loads the data from the non-partitioned


table.

● Dynamic Partition takes more time in loading data compared to static


partition.

● When you have large data stored in a table then the Dynamic partition
is suitable.

● If you want to partition a number of columns but you don’t know how
many columns then also dynamic partition is suitable.

● Dynamic partition there is no required where clause to use limit.


● we can’t perform alter on the Dynamic partition.

● You can perform dynamic partition on hive external table and managed
table.

● If you want to use the Dynamic partition in the hive then the mode is in
non-strict mode.

● Here are Hive dynamic partition properties you should allow

Hive Partitioning – Advantages and Disadvantages


a) Hive Partitioning Advantages

● Partitioning in Hive distributes execution load horizontally.

● In partition faster execution of queries with the low volume of data


takes place. For example, search population from Vatican City returns
very fast instead of searching entire world population.

b) Hive Partitioning Disadvantages

● There is the possibility of too many small partition creations- too many
directories.

● Partition is effective for low volume data. But there some queries like
group by on high volume of data take a long time to execute. For
example, grouping population of China will take a long time as
compared to a grouping of the population in Vatican City.

● There is no need for searching entire table column for a single record.

Bucketing – In Hive Tables or partition are subdivided into buckets based on the
hash function of a column in the table to give extra structure to the data that may be
used for more efficient queries.

A hash function is a function which when given a key, generates an address in


the table.
The hash function is used to index the original value or key and then used later
each time the data associated with the value or key is to be retrieved.

CREATE TABLE table_name PARTITIONED BY (partition1 data_type, partition2


data_type,….) CLUSTERED BY (column_name1, column_name2, …) SORTED BY
(column_name [ASC|DESC], …)] INTO num_buckets BUCKETS;

Each bucket in Hive is just a file in the table directory (unpartitioned table) or the
partition directory. So, you have chosen to divide the partitions into n buckets. Then
you will have n files in each of your partition directories. Hence, from the above
diagram, you can see, where you have bucketed each partition into 2 buckets.
Therefore each partition, say EEE, will have two files where each of them will be
storing the EEE student’s data.
Case Study on Partitioning and Bucketing

Client having Some E –commerce data which belongs to India operations in which each
state (38 states) operations mentioned in as a whole. If we take state column as partition
key and perform partitions on that India data as a whole, we can able to get Number of
partitions (38 partitions) which is equal to number of states (38) present in India. Such
that each state data can be viewed separately in partitions tables.
Creation of Table all states

create table all states(state string, District string,Enrolments string)

row format delimited

fields terminated by ',';

Loading data into created table all states

Load data local inpath '/home/user/Datasets/AllStates.csv' into table


allstates;

Creation of partition table

create table state_part(District string,Enrolments string) PARTITIONED


BY(state string);

For partition we have to set this property

Run in hive propmt

set hive.exec.dynamic.partition.mode=nonstrict

Loading data into partition table

INSERT OVERWRITE TABLE state_part PARTITION(state)


SELECT district,enrolments,state from allstates;

Actual processing and formation of partition tables based on state as partition key

There are going to be 38 partition outputs in HDFS storage with the file name as state
name. We will check this in this step

The following screen shots will show u the execution of above mentioned code
Creation of table all states with 3 column names such as
state, district, and enrollment

1. Loading data into table all states


2. Creation of partition table with state as partition key
3. In this step Setting partition mode as non-strict( This mode
will activate dynamic partition mode)
4. Loading data into partition tablestate_part
5. Actual processing and formation of partition tables based on
state as partition key
6. There is going to 38 partition outputs in HDFS storage with
the file name as state name. We will check this in this step. In This
step, we seeing the 38 partition outputs in HDFS
What is Buckets?

Buckets in hive is used in segregating of hive table-data into multiple files or


directories. it is used for efficient querying.

● The data i.e. present in that partitions can be divided further into
Buckets
● The division is performed based on Hash of particular columns that
we selected in the table.
● Buckets use some form of Hashing algorithm at back end to read
each record and place it into buckets
● In Hive, we have to enable buckets by using the
set.hive.enforce.bucketing=true;

Step 1) Creating Bucket as shown below.


From the above screen shot

● We are creating sample_bucket with column names such as


first_name, job_id, department, salary and country
● We are creating 4 buckets overhere.
● Once the data get loaded it automatically, place the data into 4
buckets

Step 2) Loading Data into table sample bucket


Assuming that"Employees table" already created in Hive system. In this step, we
will see the loading of Data from employees table into table sample bucket.

Before we start moving employees data into buckets, make sure that it consist of
column names such as first_name, job_id, department, salary and country.

Here we are loading data into sample bucket from employees table.

Here we are loading data into sample bucket from employees table.

Step 3)Displaying 4 buckets that created in Step 1


From the above screenshot, we can see that the data from
the employees table is transferred into 4 buckets created
in step 1.

sales table has trasactions history from 2010-01-01 to till


date.

when you want to fetch only a perticular date(ex:2013-03-


10 transactions).

hive>select * from sales where dt='2013-03-10';

suppose your output is 100 records.

now observe the problem : to fetch 100 records of given


date hive has to read all rows from the table(10 crore
records). it causes high latency,and expensive in terms of
time and cpu resources.

if you are able to keep each day transaction in to a


separate partition.
when you request that date ,hive will point that partition
and reads all rows of that partition without reading of all
rows of the table . it saves a lot of readimng time.

non partitioned:

hive> create database empdetails

hive>use empdetails

hive>create table emp(eid int,ename string,salary int,sex


string,deptno int)row format delimited fields terminated
by ',';

hive>load data local inpath '/home/matuser/emp.txt' into


table emp;

let us assume there are 200 males and 300 females

total 500 rows.

hive>select * from emp where sex='F';

now hive has to read all 500 rows ,after reading each row
it validates the condition (sex='F'),if true then it writes
row.

we overcome the above problem by keeping all the


females into partition and all males into another partition.

Creating partitioned tables:

create table emp(eid int,ename string,salary int,sex


string,deptno int)partitioned by(sex string);

the above query will be failed. because column sex is


repeated in general column and partitioned column . both
the columns should be unique.

hive>create table emp(eid int,ename string,salary int,sex


string,deptno int)partitioned by(gender string) row format
delimited fields terminated by ',';

hive>describe emp;

eid int

name string

salary int physical columns(file contains

sex string onl y 5 fileds)

deptno int

gender string logical column(partitioned column)

table having 5 physical columns and one logical


column(total 6 columns).

hive>create database empdetails;

following directories will be created in warehouse dir

/user/hive/warehouse/empdetails.db/emp

note:-here we will not find any partitioines in emp tables.


since partitions are created at the time of loading data
into partitiones , not at the time of creating tables.

we can load data from file to partitons and table to


partitions

note: partitions are created at the time of loading data


into partitions,not at the time of creating tables.

we can load data from file to partitions and table to


partitions.

initially we work with loading from tables(non


partitoned)to partitioned tables.

Loading data from non-partitoned tables to partitioned


table.

hive>insert overwrite table emp select * from emp where


sex="F";

the above query is failed because emp is not a normal


table. it is partitioned table. you need to specify partitoned
name.

sol:

hive>insert overwrite table emp//emp is


partitioned table

partition(gender='F')//partition name

select * from default.emp where sex="F";

now you loaded all female rows into S=F partition

hive>insert overwrite table emp//emp is


partitioned table

partition(gender='M')//partition name

select * from default.emp where sex="M";

now you loaded all female rows into S=M partition

note:

sql supports multi loading(multi insertitiones) in single


statement.

Multiple Insertion

hive>from default.emp

insert overwrite table emp partition(gender='F')


select * where sex ='F'
insert overwrite table emp partition(gender='M')
select * where sex ='M';

In HDFS,
/User
hive
/warehouse/
empdetails.db
emp/
/gender =F
/000000_0

/gender =M
/000000_0
Observe above graph
emp -> is directory for table
gender=F -> is subdirectory of emp
gender=F/000000_0 file contains only female data
gender=M -> is subdirectory of emp
gender=M/000000_0 file contains only male data
Reading data from partitions

hive> select * from emp;


-> hive reads all partitions
hive> select * from emp where sex='F';
-> hive reads all partitions, because 'sex' is not partition
column. 'S' is our
partition column. to take advantage of partitions, you
should use partition
column in where clause
hive> select * from emp where gender='F';
now hive directly points
/user/hive/warehouse/mydb.db/emp/gender=F
partition and read all rows of 000000_0
file,with out checking condition.
So for this query, no need of reading all rows of table. so lot
reading time is saved.
creating partitioned tables using multiple columns:-

hive> create table emp1(ecode int, name String, sal int)


partitioned by(dno int, sex String)
row format delimited fields terminated by ','
hive> describe emp1;
ecode int
name String -----> physical colums
sal int

dno int
sex String -----> logical columns for partitions

hive> from emp


Insert overwrite table emp1
partition(dno=1001,sex='F') select eid,ename,salary where
deptno=1001 and sex='F'
Insert overwrite table emp1
partition(dno=1001,sex='M') select eid,ename,salary where
deptno=1001 and sex='M'
Insert overwrite table emp1
partition(dno=1002,sex='F') select eid,ename, salary where
deptno=1002 and sex='F'
Insert overwrite table emp1
partition(dno=1002,sex='M') select eid,ename,salary where
deptno=1002 and sex='M';
In hdfs,
/user
/hive
/warehouse
/mydb.db
/emp1
-> /dno=1001
-> Sex= F
/000000_0

-> Sex= M
/000000_0
-> /dno=1002
-> Sex= F
/000000_0

-> Sex= M
/000000_0
Now, you created,emp1 table, with Partitioned by(dno int, sex
String)
here, dno is primary partition for dno
primary partition is a subdirectory of table's directory and sub
partition is a subdirectory of primary partion.
In this way you can go for multi level sub partitioning

hive> select * from emp1;


-> hive reads all rows of all partitions
hive> select * from emp1 where dno=1002;
hive reads

/user/hive/warehouse/mydb.db/emp1/dno=1002/Sex=F/00
0000_0

/user/hive/warehouse/mydb.db/emp1/dno=1002/Sex=M/00
0000_0

hive> select * from emp1 where dno=1002 and Sex='M';


hive reads only

/user/hive/warehouse/mydb.db/emp1/dno=1002/Sex=M/00
0000_0

hive> select * from emp1 where Sex='F';


hive reads

/user/hive/warehouse/mydb.db/emp1/dno=1001/Sex=F/00
0000_0

/user/hive/warehouse/mydb.db/emp1/dno=1002/Sex=M/00
0000_0

In previous example, emp1


we have loaded 1001,M
1001,F
1002,F
1002,M patitions
2 departments(dno), 2 sex groups
so 2*2=4 partitions
so we used 4 insert statements for loading.
if we have 100 departments(dno) in emp table, then we need
to load into 100*2 =200
partitions, which is very expensive process

we need dynamic loading facility

hive supports dynamic loading into partitions.


by default, dynamic loading feature is disabled in hive
hive> set hive.exec.dynamic.partition=true;
loading data into partitiions dynamically
hive> create table dept1 like emp1;
hive> insert overwrite table dept1 partition(dno,sex) select
ecode,name,sal,dno,sex from emp;
this query will be failed, because primary partition can't be
dynamic
except primary partition(dno),remaining partitions are dynamic
In table dept1,dno can be Static and sex can be dynamic
hive> from emp

insert overwrite table dept1 partition(dno=1001,sex) //


dno -> static sex-> dynamic
select ecode,name,sal,sex from emp where dno=11
insert overwrite table dept1 partition(dno=1002,sex)
select ecode,name,sal,sex from emp where
dno=1002
But, we need primary partition(dno)also as dynamic then
set one more following option.
hive> set hive.exec.dynamic.partition.mode=nonstrict;
-> by default this option value is "strict"
now we can load dynamically into primary and its
subpartitions.
hive> insert overwrite table dept1 partition(dno,sex)
select ecode,name,sal,dno,sex from emp;
so to load dynamically into partitions, we need to set two
options
i) set hive.exec.dynamic.partition=true;
ii) set hive .exec.dynamic.partition.mode=nonstrict;
these options are temporary, once you quit from hive shell
these will be set back to default values as
FALSE and STRICT
Lets work on sales table

hive> create table sales( dt String,amt int)

row format delimited fields terminated by ;;


hive> load data local inpath '/user/matdb/sales into table
sales;
hive> select * from sales;
dt amt
2001-01-01 250000
......... ......
......... ......
......... ......
......... ......
2015-07-19 3000000
I want to create a partitioned table, with year as primary
partition

month is sub- partition of month,

and day is sub-partition of month,

-> almost 15 years data, under each year 12 months under


each month,30(30/31/28)days on each day average
100 records(transactions)

hive> create table newsales(dt String, amt int)


partitioned by (y int, m int, d int)
row format delimited fields terminated by ',';

Now, loading dynamically into New sales table from sales


table.
hive> set hive.exec.dynamic.partition=true; /to enable
dynamic loading
hive> set hive.exec.dynamic.partition.mode=nonstrict;
//to make primary partition as dynamic.

hive> insert overwrite table newsales partition(y,m,d)


select dt,amt,year(dt),month(dt),day(dt) from
sales;

In HDFS,
/user
/hive
/warehouse
/mydb.db
/newsales
/y =2001
-> /m=1
d=1/000000-0
d=2/000000_0
............
............
d=31/000000_0

-> /m=12
d=1/000000-0
d=2/000000_0
............
............
d=31/000000_0

/y =2015
-> /m=1
d=1/000000-0
d=2/000000_0
............
............
d=31/000000_0

-> /m=7
d=1/000000-0
d=2/000000_0
............
............
d=19/000000_0

hive> select *from newsales;


-> hive reads all partitions.

hive> select *from newsales where y=2014;


-> hive reads 365 partitions.
how! 2014 has 12 sub partions for month.
each month has 30/31/28 sub partitions for day.
hive> select * from new sales where m=1;
-> hive reads 15*31 partitions.
because m=1 partitions available in each year(15)
m=1 as 31 partitions for day.
hive> select * from new sales where y=2015 and m=6 and
d=20;
-> hive reads only one partitions.
task
i want to fetch sales transactions between 2003 may to
2010 july

hive> select * from newsales where (y=2003 and m>=5)


or
(y>=2004 and y<2010)
or
(y=2010 and m<=7);

How to load data into partitions from files

hive> create database matdb;

hive> use matdb;

hive> create table enquiries(name String,qualification String,


email String, phone String, course String,
data_of_enqauiry String)partitioned by (y int,m int,d
int)
row format delimited fields terminated by ',';

hive> load data local inpath '/user/matdb/enquiries1'


into table enquiries partition(y=2015, m=7,d=1);

hive> load data local inpath '/user/matdb/enquiries2'


into table enquiries partition(y=2015, m=7,d=2);

hive> load data local inpath '/user/matdb/enquiries3'


into table enquiries partition(y=2015, m=7,d=3);

In HDFS,
/user
/hive
/warehouse
/matdb.db
/enquiries
/y=2015
/m=7
/d=1/enquiries1
/d=2/enquiries2
/d=3/enquiries3
...............
...............
I want to listout all enquiries from 2015-10-1 to 2015-10-5,

hive> select * from enquiries where y=2015 and m=10 and


d<6;

Now your table enquiries is already partitioned by year,


month and day.

I want to access data based on course name.

hive> create table enquiries_on_course(name string, email


String, phone
String, date-of-enquiry string) partitioned by(course String)

How to load partition to partitioned tables

we can't load data from file to course partition.because, our


enquiry file contains all courses. We don't have seperate files
for each course. While loading from files,we can't filter rows.
so, table to table copy is only way.

hive> set hive.exec.dynamic.partition =true;

hive> set hive.exec.dynamic.partition.mode=Nonstrict;

hive> insert overwrite table enquiries_on_course


partition(course) select
name,qualification,email,phone,date-of-enquiry,course
from enquiries;

hive> select * from enquiries_on_course where


course='hadoop';

A partitioned table can be inner or external


creating external partitioned table

hive> create external table ext_emp(ecode int, name


String, sal int) partitioned by (dno int, sex String) row
format delimited fields terminated by ',' location
'/user/matdb/mydata';

partition table can use custom locations.

Note: if table has large number of partitions and partition size


is large, querying data from all partitions is expensive in terms
of CPU resources and processing time.

So to restrict the user, querying data using non partitioning


columns,set the following option.

hive> set hive.mapred.mode=Strict;

hive> select * from emp1;

Statement failed because you are requesting all partitions

hive> select * from emp1 where sal>=25000;

statement failed because sal isn't partitioned coulumn. so


you are requesting all partitioins

hive>select * from emp1 where dno=1001

// hive reads
/user/hive/warehouse/......./emp1/d=1001/grade=F

/user/hive/warehouse/......./emp1/d=1001/grade=M
partitions.

In emp1 table dno, sex are partitioned columns.

if you set,
hive> set hive.mapred.mode=nonStrict;
// Now can request all partitions of table

// you can use any column(partitioned/non-partitioned) in


where clause.

Up to now we have seen,advantages of partitioned tables.

Drawbacks of partitioned tables

For example you are partitioning your table by city column as


primary partition and dno(department number) as sub
partition.

hive> create table emp(------)


partitioned by(city String,dno int);

hive> Set hive.exec.dynamic.partition=true;

hive> Set hive.exec.dynamic.partition.mode=nonstrict;

now hive has to create 10000 partitions,if you have 100 unique
cities and 100 unique department number(dons) for each one
partition one data file to be cteated. so 10000 files. In this case
NameNode has to maintain metadata for 10000 files. this is
greate overhead(burder) for namenode.what if, if you are
partitioning your sales table by product_id, if you have millions
of products? very very burden on Namenode for maintaining
metadata. In such case, we choose Bucketing tables.
Hive UDFS:

1) create a new class that sholud be derived from UDF and


write one or more evaluate methods with a hadoop datatypes

3)create required package

/home/matuser/ mkdir -p com/mat/hive/udf

package com.mat.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class UpperUDF extends UDF
{
public Text evaluate(final Text s)
{
if(s==null)
{
return null;
}
return new Text(s.toString().toUpperCase());
}
}

3) save the above program under above path


/home/matuser/gedit com/mat/hive/udf/UpperUDF.java

4) compile the above source code using below statement

5)

$ javac -cp /home/matuser/hadoop-1.0.4/hadoop-core-


1.0.4.jar:/home/matuser/hive-0.10.0/lib/hive-exec-0.10.0.jar
com/mat/hive/udf/UpperUDF.java

6)

$jar -cvf upperudf.jar com/mat/hive/udf

the above statement will create upperudf.jar file under


/home/matuser/
7)
the next step is add the jar with your UDF code to the hive
class path. the easiest way is , set HIVE_AUX_JARS_PATH to a
directory contauning any jars you need to add before starting
hive

$ export HIVE_AUX_JARS_PATH=/home/matuser

8) start hive normally

once you have hive running the last step is register your
function

hive>create temporary function myupper as


'com.mat.hive.udf.UpperUDF'
9) Now you can use it

hive>select myupper(ename) from emp;

views:

A view allows a query to be saved and treated like a table. It is


a logical construct, as it
does not store data like a table.

When a query references a view, the information in its


definition is combined with the
rest of the query by Hive’s query planner. Logically, you can
imagine that Hive executes
the view and then uses the results in the rest of the query.

Views to Reduce Query Complexity

When a query becomes long or complicated, a view may be


used to hide the complexity
by dividing the query into smaller, more manageable pieces;
similar to writing a func-
tion in a programming language or the concept of layered
design in software. Encap-
sulating the complexity makes it easier for end users to
construct complex queries from
reusable parts. For example, consider the following query with
a nested subquery:

FROM(
SELECT * FROM empmaster e1 JOIN dept d
ON(e1.deptid=d.deptid)where d.deptid=1001
)e select e.* WHERE ename like 'g%';

It is common for Hive queries to have many levels of nesting.


In the following example,
the nested portion of the query is turned into a view:

SELECT * FROM shorter_join WHERE ename like 'g%';

CREATE VIEW shorter_join AS


SELECT e.eid,e.ename,e.salary,d.deptid,d.deptname FROM
empmaster e JOIN dept d ON (e.deptid=d.deptid) WHERE
d.deptid=1001;

the view is used like any other table. In this query we added a
WHERE clause to the
SELECT statement. This exactly emulates the original query:

Views that Restrict Data Based on Conditions


A common use case for views is restricting the result rows
based on the value of one or
more columns. Some databases allow a view to be used as a
security mechanism. Rather
than give the user access to the raw table with sensitive data,
the user is given access to
a view with a WHERE clause that restricts the data. Hive does
not currently support this
feature, as the user must have access to the entire underlying
raw table for the view to
work. However, the concept of a view created to limit data
access can be used to protect
information from the casual query:

CREATE TABLE userinfo (


firstname string, lastname string, user string, password
string)row format delimited fields terminated by ' ';
load data local inpath '/home/matuser/hivequeries/userinfo'
into table userinfo;

CREATE VIEW safer_user_info AS


SELECT firstname,lastname FROM userinfo;

Here is another example where a view is used to restrict data


based on a WHERE clause.
In this case, we wish to provide a view on an employee table
that only exposes employees
from a specific department:

hive> CREATE TABLE employee (firstname string, lastname


string,
> user string, password string, department string);

hive> CREATE VIEW techops_employee AS


> SELECT firstname,lastname,user FROM employee WERE
department='techops';

Bucketing tables

Bucket is a file in tables directory(HDFS) At the time


of creating table,we can choose how many backets you want.
each backet may have multiple key groups. if a group key is
sent to backet1, all similar keys will be sent to that bucket.
observe, from picture,
all p1,p4 product_ids sent to 000000_0
all p2,p5 product_ids sent to 000001_0
all p3,p6 product_ids sent to 000002_0
Now all data records adjested into 3 buckets.
How to create bucketed tables.
hive> Create table emp_Buckets(ecode int, name String, sal
int, sex String, dno int)
CLUSTERED BY (dno) into 3 buckets;
loading into Buckets
hive> Set hive.enforce.bucketing=true;
now bucketing feature is enabled.
hive> Insert overwrite table emp_Buckets select * from emp;
hive> dfs -ls
/user/hive/warehouse/mudb.db/emp_buckets
->
/user/hive/warehouse/mudb.db/emp_buckets/000000_0

/user/hive/warehouse/mudb.db/emp_buckets/000001_0

/user/hive/warehouse/mudb.db/emp_buckets/000002_0
hive maintains hastable, which key stored in which buckets
when you request data using bucketing column, First hive
understands, the address of (bucketed)
key by reading hashtable, and reads data only from that
bucket, without reading from other buckets
Hashtable
Hashcode $ key Bucket
11 000000_0
12 000001_0
13 000002_0
14 000000_0
15 000000_0
16 000002_0
17 000003_0
18 000003_0
-------------------
-------------------
-------------------
-------------------

hive> select * from emp_buckets;


hive reads all buckets;
hive> sesect * from emp_buckets where dno=16;
Now hive reads data from only 000002_0 bucket.
you can create bucketed tables using multiple columns.
example:-
hive> create table bucks_tab(ecode int, name String, sal
int, sex String dno int)
hive> set hive.enforce.buckting =true;
hive> Insert overwrite table bucks_tab select * from
Emp;
now combination of dno, sex act as a compsite key. if 11,
F(dno, sex) is entered into bucket1, all remaining 11,F records
will be sent to bucket1;
(lets see the figure in next page)
hive> select * from bucks-tab where dno=11 and
sex='F';
hive reads from bucket1 (000000_0) file

Case Study on how to use HIVE on Top of the HADOOP and different statistical
analysis. In this example we have a predefined dataset (cricket.csv) having more than 20
columns and more than 10000 records in it. In the below example we are explaining how
to display few records based on attributes of cricket. The cricket.csv file has different
attributes like

name,country,Player_iD,Year.stint,teamid,run,igl,D,G,G_batting.AB,R,H,2B,3B,HR,RBI
,SB,CS,BB,SO,I BB,HBP,SH,SF,GIDP,G_OLD

Follow the below steps in order to retrieve the data from Big tables.

1.Start Hadoop $ start-all.sh


2.Create a Folder and copy all predefined datasets
$ hadoop fs -mkdir sampledataset
$ hadoop fs -put /home/sai/hivedataset/ /user/sai/sampledataset Note:hivedataset has Big
Data tables in .cs format ,those tables are copying from my local file system to Hadoop.

3.Start Hive $ hive

4.Create Table in Hive hive>create table temp_cricket (col_value STRING);

5.Load the Data from csv file to table temp_batting hive> LOAD DATA INPATH
'/user/sai/sampledataset/hivedataset/cricket.cs v' OVERWRITE INTO TABLE
temp_cricket;

6.Create Another table to Extract data hive>create table batting (player_id STRING, year
INT, runs INT);

7.Insert data Into newly created data By extracting

hive>insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?)


{1}',3) player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 4) year,
regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 7) run from temp_cricket; In above query
from temp_cricket table player_id,year,run are loaded into batting table .
Execute Few Queries and observe the result

Example 1 Display year and maximum runs scored in each year hive> SELECT year,
max(runs) FROM batting GROUP BY year;

Total Jobs=1 hadoop job information for stage-1: Number of mappers:1;Number of


reducers:1 Stage-1 map=0% reduce =0% Stage-1 map=100% reduce =0% cumulative
CPU 3.45 sec Stage-1 map=100% reduce =100% cumulative CPU 6.35 sec MapReduce
Total cumulative CPU time:6.35 sec Time taken :46.09 sec

Example 2 Display player_id,year and maximum runs scored in each year hive>
SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs
FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs) ;

Total Jobs=2 Launching job 1 out of 2 hadoop job information for stage-2: Number of
mappers:1;Number of reducers:1 Stage-2 map=0% reduce =0% Stage-2 map=100%
reduce =0% cumulative CPU 3.0 sec Stage-2 map=100% reduce =100% cumulative CPU
5.95 sec MapReduce Total cumulative CPU time:5.95 sec Launching job 2 out of 2
hadoop job information for stage-3: Number of mappers:1;Number of reducers:0 Stage-3
map=0% reduce =0% Stage-3 map=100% reduce =0% cumulative CPU 3.12 sec Stage-3
map=100% reduce =100% cumulative CPU 3.12 sec MapReduce Total cumulative CPU
time:3.12 sec Time taken :65.743sec

As there are two queries (Select Statements) two jobs have been executed. The CPU time
varies according to the system configuration

Conclusion Hive works with Hadoop to allow you to query and manage large-scale data
using a familiar SQL-like interface. Hive provides command line interface to the shell

You might also like