Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File

2But hql is not for operating rows randomly.
Such as reading
randomly, insert /update/delete records randomly.
In older versions (such as 0.7.*,0.8.* … ) of hive ,hive does not

support update and delete operations .
But latest versions (0.14.*….), hive supports update and delete.
HIVE INSTALATION
hive installation steps
1.Start by downloading the most recent stable release of

hive from one of the apache download mirror through the
browser(for ex hive-0.14.0.tar.gz).
2. copy the hive tar file into work folder
3. Bigdata@master:~$tar -xzf hive-0.14.0.tar.gz
4.set the environmental variable HIVE_HOME in bashrc file.
export HIVE_HOME=/home/matuser/work/hive-0.14.0
export PATH=$PATH:$HIVE_HOME/bin;
create following directory before starting hive first time.

Since once we create table in hive, with table name one
folder will be created in /user/hive/warehouse. This path
already configured in hive-site.xml.
Bigdata@master:~$hadoop fs -mkdir
/user/hive/wareshouse/
in hive-site.xml we will find following property.
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the
warehouse</description>
</property>
By default hive is connected to embedded database called

Derby to store metadata.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create
=true</value>
<description>JDBC connect string for a JDBC
metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC
</property>
derby data can't be shared between hive users. To share metadata
b/w hive clients configure external database(like mysql).
To Configure mysql as a metastore

steps
$mysql -u root -p
Enter password:
mysql>create database metastore_db
create a new user in MySQL

CREATE USER 'matuser'@'localhost' IDENTIFIED BY 'matuser';
mysql>GRANT ALL PREVILIGES on metastore_db.* to
matuser@localhost
then go to hive_site.xml and modify following properties

to communicate with mysql
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore_db</val
ue>
<description>JDBC connect string for a JDBC
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>matuser</value>
<description>username to use against metastore
database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>matuser</value>
<description>password to use against metastore
database</description>
</property>
To enter into hive

Bigdata@master:~$hive shell
hive>
Hive DDL(data definition language)
create
alter
drop
In hive ,
Each table should be associated with database.A database is a

group(logical)of tables ,views .The default database is “default”
in hive.
If you don’t select any database, your table will be stored in

“default” database.
hdfs location of default database as follows
/user/hive/warehouse
When table is created in hive default database, a directory

will be created in warehouse directory.
Ex:
if you create table with the name table1,then table1 directory will
be created in following path.
/user/hive/warehouse/table1
We can also create custom databases (user defined).
hive complex data types & simple data types
hive simple types

1)Tinyint(1 byte)
2)SmallInt(2 bytes)
3)int(4 bytes)
4)Bigint(8 bytes)
5)float(4 bytes)
6)double(8 bytes)
7)Decimal (hive-0.11.0)
8)Date(hive-0.12.0)
9)Timestamp(hive-0.8.0)
10)String(max size 2gb)
11)varchar(hive-0.12.0 suppots 1 to 65535

characters)
12)char(hive-0.13.0)
14)Boolean --->true/false
15)Binary-->hive-0.8.0
hive complex data types
Collection data types:-
1)Array -->collection of elements
2)Map --> collection of key & value pairs
3)STRUCT --> collection of attributes with different data

types.
Hive tables:-
Hive tables can be classified in following ways
1) Internal and External tables.

2) Partitioned and non partitioned tables
3) Bucketed and non bucketed tables.
Example of internal and external tables
Creation of a file with data in local system.
$cat > file1
Hadoop
Pig
Hive
^d
Inner table(by default each table is inner table)
Hive> create table sample(line String)
In hdfs with the table name "sample" directory will be created

under /user/hive/warehouse/sample.
Loading the file into table
Hive>load data local inpath '/home/matuser/file1' into

table sample;
now we will find file1 under

/user/hive/warehose/sample/file1.
i.e file1 is copied into sample directory from local
Hive> select * from sample;

o/p
hadoop
pig
hive
ex:2
cat > file2
hbase
oozie
flume
^d
Hive>load data local inpath '/home/bigdata/file2' into table

sample;
$hadoop fs –ls /user/hive/warehouse/sample
o/p
/user/hive/warehouse/sample/file1
/user/hive/warehouse/sample/file2
Now in hdfs , sample directory has two files file1, file2.
Hive>select * from sample;
Now hive reads all rows of file1 and file2.
Dropping a table
Hive>drop table sample

sample directory will be deleted from hdfs werehouse
directory
ex:
/user/hive/warehouse/sample directory will be

removed (fiel1,file2 deleted)
we know hive maintains metadata in embedded Derby

database or external RDBMS(like mysql,oracle).
Now sample(table) and data(hdfs directory) both

deleted.
Creating an external tables:-
Hive>create external table extab(line String);
In hdfs,extab folder is created as follows
/user/hive/warehouse/extab
Hive>load data local inpath '/home/matuser/file1' into

table extab;
In hdfs
/user/hive/warehouse/extab/file1
hive>select * from extab;
hello
pig
hive
Hive>drop table extab;
From hive extab is deleted [metadata deleted]
hive>select * from extab;
table not found error
In hdfs /user/hive/warehouse/extab/file1 is available
Here only table is deleted not the data.
Reuse the data
Hive>create table extab(line String);
Now an extab table is created but no extab folder is

created in the warehouse. Since it is already available in
the warehouse. it uses the existing directory present in
the warehouse. So hive will reuse the location
/user/hive/warehouse/extab/file1
With out loading data if we execute following command
Hive>select * from extab;(as inner table)

o/p
hadoop
pig
hive
But now(previous) we created extab as inner table.
If you drop extab,
Now both table and directory will be deleted.
Both inner and external tables can use custom hdfs location.
Ex:
Hive>create table sample1(line String) location

‘/user/matuser’
In hive, sample1 created
In hdfs /user/matuser direcoty created.
Hive>load data local inpath ‘/home/matuser/file1’ into

table sample1;
In hdfs /user/matuser/file1
Now table name is sample1,directory name is matuser.

Hive>drop table sample1;
/user/matuser will be deleted. Because sampl1 is inner

table.
External table can use cutom location
Hive>create external table sample2(line String) location

‘/user/usertab’;
if usertab is existed, hive will use it ,if not existed a

directory will be cteated with given name
hive>load data local inpath ‘/home/matuser/file1’ into

table sample1;
Hive>drop table sample2;
Still /user/usertab is available.(because sample2 is

external table);
hdfs dfsadmin -safemode leave
LOADING DATA INTO TABLES
When you loaded a file in to table, the file will be copied

into tables hdfs directory.
Ex:
$cat >sample1
1
^d
$cat >sample2
^d
Hive>create table sample(line String);
Hive>load data local inpath 'sample1' into table sample;
Now in hdfs,sample directory has two files sample1 and

sample2.
That means forever load, file is copied into table

directory .so data is appending.
Another way to load data into table.
$cat >sample3
^d
$hadoop fs –copyFromLocal samp3
/user/hive/warehouse/sample
Now samp directory has 3 files file1.file2,file3.
o/p
2 from samp1
5 from samp2
8 from samp3
How to overwrite data of table while loading:-
$cat >sample4
10
11
12
^d
Hive>load data local inpath 'sample4' overwrite into
table sample;
Now all previous files of sample directory ,sample1

sample2 and sample3 will be deleted . and sample4 file
is copied into sample.
10
11
12
$hadoop fs –ls /user/hive/warehouse/sample
Hive>select * from sample4;
……Sample4
…..sample4_copy_1
10
11 from sample4
12
10
11 from sample4-copy-1
12
Loading data from hdfs to hive table
Hive>load data inpath /user/training/file1’ into table

sample;
Once file1 is loaded into table source file(file1)will be

deleted from hdfs;
In all previous examples we worked with single column

table;
What if ,if file has multiple fields, now we need to create

table with multiple columns.
$cat>test1
100,200,300
400,500,600
700,800,900
^d
Hive>create table test1(a int,b int ,c int);
Hive>load data local inpath '/home/matuser/test1' into

table test1;
Hive>select * from test1;
Null null null

Null null null
Null null null
The problem is the delimiter. Default delimiter for hive

table is “^A”(\001)
But the input file has “,”[comma] delimiter.
For the table ‘test1’ delimiter is ^A(\001)
So the entire line is treated as a single field.
Total line to be stored in first column of the table(a int)
First column in test1 table is int ,but line is alphanumeric.

if data is not match with column type ,then hive keep null
value.
There is no values for 2nd field and 3rd field.
That’s why ‘b’,’c’ column values became Null
Solution:- modify the delimiter with ‘,’;
Hive>create table test1(a int,b int ,c int) row format

delimited fields terminated by ',' or '\t';
Now for the table test1 ‘,’ is delimiter
File test1 has two commas
So number of fields in file is 3.
1’st filed will be assigned to ‘a’ column.
2’nd filed will be assigned to ‘b’ column.

3’rd filed will be assigned to ‘c’ column.
Hive>load data local inpath ‘/home/matuser/test1’ into

table test2;
Hive>select * from test2;
100 200 300
400 500 600
700 800 900
Queries
$ cat > emp
101,srinivas,20000,M,1001
102,satya,50000,M,1002
103,sailaja,25000,F,1002
104,amit,30000,M,1003
104,saradha,50000,F,1004
105,anusha,45000,F,1003
106,krishna,60000,M,1005
107,rajesh,42000,M,1002
108,mohit,40000,M,1001
109,bhargavi,45000,F,1004
110,amrutha,35000,F,1005
hive>create table emp(ecode int,name string,salary
int,sex string,deptno int)row format delimited fields
terminated by ',';
hive>load data local inpath '/home/matuser/emp' into

table emp;
hive>select * from emp;
o/p all rows
hive>select * from emp where sex='F'
o/p all females
hive>select * from emp where deptno=1003 and sex='M'
o/p all males whose deptno is 1003
hive>select * from emp where ((deptno=1001
or deptno=1002) and sex='M')
or
((deptno=1004
or deptno=1005) and sex='F');
o/p
all males of 1001,1005
all females of 1004,1008
hive>select * from emp where deptno =1001
or deptno=1002 or deptno=1003;
or
hive>select * from emp where deptno

in(1001,1002,1003);
hive>select * from emp where deptno!=1001 and

deptno!=1008
o/p all rows except 1001 and 1008
note : hive does not support "NOT IN"
hive> select * from emp where salary>25000 and

salary<50000
note: hive does not support "BETWEEN" and "NOT

BETWEEN" operators
hive> select * from emp limit 5;
o/p top 5 rows of a table;
order by:
to sort the data in ascending order or descending order.
hive>select * from emp order by name;

default ascending order
hive>select * from emp order by salary desc;
hive>select * from emp order by salary desc,deptno,sex

desc;
it sorts the data based on salary in descending order. high

salary will be priority. if two salaries are equal then
checks dept numbers. For dept numbers not sort mode
specified, so default is ascending order. so low value of
dept number is prioritised. if two dept numbers are equal
then checks sex. high value will be prioritised
note : no limit for max number of sort columns.
adding new columns to table;
hive>alter table emp add columns(tax int);
101 satya 50000 M 1001 Null
102 sravani 10000 F 1002 Null
hive>insert overwrite table emp
select ecode,name,salary,sex,deptno,salary*0.1
from emp;
note: in the above example target table and source table is

same. so data will be overriden.
o/p
101 satya 50000 M 1001 5000
102 sravani 10000 F 1002 1000
updating a table
increment salary by 10% in each row
select ecode,name,salary+
(salary*0.1),sex,deptno,tax from emp;
note: by overriding technique we can update data.
e female rows deleted);
if function
used to perform conditional transformations
if(condition,value1,value2)
1'st argument is condition
2'nd argument is true value
3'rd argument is false value.
note: if first is true ,value1 will be return
otherwise value2 will be returned.
ex:
$cat > file1
100,200
300,400
644,545
789,456
^d
hive>create table tab1(a int,b int)row format delimited

fields terminated by ',';
hive>load data local inpath '/home/matuser/file1' into

table tab1;
hive>alter table tab1 add columns(big int);
hive>insert overwrite table tab1
select a,b,if(a>b,a,b) from tab1;
hive>select * from tab1;
100 200 200
300 400 400
644 545 644
nested if statements;
$cat >file2
100,200,230
300,400,75
644,545,20
789,456,98
among 3 columns which is biggest value.
hive>create table tab2(a int,b int ,c int)row format

delimited fields terminated by ',';
hive>load data local inpath 'file2' into table tab2

hive>select form tab2;
hive>alter table tab2 add columns(big int);
hive>insert overwrite table tab2
select a,b,c,if(a>b,if(a>c,a,c),if(b>c,b,c)) from

tab2;
100 200 230 230
300 400 75 400
644 545 20 644
789 456 98 789
Let us work with temperature data using hive:-
$cat > tempr.text
xxxxx2010xxx20xx
xxxxx2010xxx21xx
xxxxx2011xxx27xx
xxxxx2011xxx23xx
^d
This file has no delimiters
But we have clear clues for a attributes
Year 6th to 9th

Temperature 13th to 14th
Task:- for each year ,max temperatures
Hive>create table Raw(line String);
Hive>load data local inpath '/home/matuser/tempr.txt'

into table Raw;
Hive>create table tmpr(y int ,t int);
Hive>insert overwrite table tmpr
select substr(line,6,4),substr(line,13,2) from Raw;
Hive> select * from tmpr;
2010 20
2010 21
2010 20
2011 27
2011 23
Substr(line,6,4)will extract year.
6start position
4length of substring
Substr(line,13,2)will extract temperature
Hive>create table result(y int ,max int);
Hive>insert overwrite table result
select y,max(t) from tmpr group by y;
Hive>select * from result

2010 21
2011 27
In previous examples we used “insert overwrite table

<table name>”
Overwrite is mandatory
But hive 0.14 “insert into” available to append rows.
“insert overwrite “ will overwrite previous data.
One more example on temperature data, with air

temperature has both positive and negative values.
$cat >tmpr.txt
xxxxx2010xxx20xx
xxxxx2010xxx-15xx
xxxxx2010xxx-10xx
xxxxx2010xxx23xx
xxxxx2011xxx-15xx
xxxxx2011xxx-10xx
^d
this file has both negative and positive values
now for temperature,
if positive , length is 2
if negative length is 3
How to make it structure and analyse it?
Task :-for each year ,max and min temperature is

required.
hive>create table raw(line string);
hive>load data local inpath '/home/matuser/raw.txt'

into table raw;
hive>create table tmpr(y int,t int);
hive>insert overwrite table tmpr
select * from(select substr(line,
6,4),ssubstr(line,13,2) from raw1
where substr(line,13,1)!='-'
union all
select substr(line,6,4),substr(line,13,3) from raw1
using union all both rows set we merged.
-->in hive there is no "union" .only "union all" available
--> "union all" should be applied as a subquery.
-->the sub qurery should have as alias .(ex:tmpr)
merged results we are loading in tempr table.
hive>select * from tmpr;
o/p
2010 20
2010 -15
2010 -10
2010 23
2011 -15
2011 -11
hive>insert overwrite table tmpr
select substr(line,6,4),
if(substr(line,13,1)!='-',
substr(line,13,2),substr(line,13,3)) from raw;
hive>select * from tmpr;
o/p
2010 20
2010 -15
2010 -10
.........
if 13th character is '-'
it retrives positive temperature otherwise negative

temperature.
note: substr() return value is string type. in the above

case string contains pure numbers, so it can be implicitly
casted into in value. in our table y and t are int datatypes.
hive>create table result2(y int, max int,min int)row

format delimited fields terminated by ',';
hive>insert overwrite table result2
select y max(t),min(t) from tempr
group by y;
hive>select * from result;
0/p
2012 23 -15
2011 -11 -15
---------------------------------------------------------------------
-------------
Complex Data types

working with Arrays
Bigdata@Bigdata:~$ cat>student
Kalam,20,Btech*Mtech*PHD
praneeth,21,Bsc*Msc*Mtech
satya,38,Bsc*Msc*PHD
hive>create table student(name string,age

int,qualification Array <string>) row format delimited
fields terminated by ',' collection items terminated by '*';
hive>load data local inpath '/home/Bigdata/student'
into table student;
hive>select * from student;
gagan 20 ["btech","mtech","phd"]
praveen 21 ["bsc","msc","mtech"]
hive> select qualification[0] from student;
Btech
Bsc
Working with Map
$ cat >trans
p101,pendrive$400*camera$5000*mobile$8500
p102,Tv$30000*CDPlayer$10000*pendrive$400*compute
r$25000
hive>create table trans(prodid string,products

map<string,int>)row format delimited fields terminated
by ',' collection items terminated by '*' map keys
terminated by '$';
hive>load data local inpath 'trans' into table trans ;
hive>select * from trans;

p101{"pendrive":400,"camera":5000,"mobile":8500}
p102
{"Tv":30000,"CDPlayer":10000,"pendrive":800,"computer
",25000}
hive>select prodid,products['pendrive'] from trans;
o/p
p101 400
p102 800
working with structs
$ cat >profile
gagan datta,27,manaswi#23#hyd,hyd
krishna,30,saradha#25#hyd,hyd
^d
hive>create table profile(name string,age int,wife

struct<name:string,age:int,city:string>,city string)row
format delimited fields terminated by ',' collection items
terminated by '#';
hive>load data local inpath 'profile' into table profile;
hive>select * from profile;
gagan datta 24
{"name":"manaswi","age":24,"city":"del"}hyd
gagan datta 24
{"name":"saradha","age":24,"city":"del"}hyd
hive>select name ,wife.name from profile;
gagan datta manaswi
krishna saradha
Like:
hive>create table staff like emp;
staff table will be created with emp structure.
emp has ',' delimiter so staff also having ',' delimiter
hive>insert overwrite table staff select * from emp;
deleting rows of a table
note: in hive-0.14 we have delete and update statements
delete female rows from staff table
hive>insert overwrite table staff
select * from staff where sex='M'
all male rows copied into staff
cleaning null values
$cat >file3;
100,,200
200,,
,,700
150,240,
,130,140
175,300,250
^d
hive>create table tab1(a int,b int,c int)row format

hive>load data local local inpath 'file3' into table tab1;
100 NULL 200
200 NULL NULL
NULL NULL 300
NULL 300 NULL
10 20 NULL
NULL 30 40
100 50 25
hive>select a+b+c from tab1;
NULL
NULL
NULL
NULL
NULL
NULL
175
becase
100+200=300
NULL+20 =NULL
200-NULL=NULL
NOTE :in arithmetic opretation in null values are existed

final result will be null
so before arithmatic operations we need to clear nulls
to replace null with zeros
hive> insert overwrite table tab1
select if(a is null,0,a),if(b is null,0,b),if(c is null,0,c) from

tab1;
100 0 200
200 0 300
0 0 300
0 300 0
10 20 0
0 30 40
100 50 25
hive>select a+b+c from tab1;
300
500
300
30
70
175
Null vales can't be compared with '=' and '!=' operators.
ex:-
hive>select * from emp where salary is null;
hive>select * from emp where salary != null;//error
hive>select * from emp where salary is not null;
lets do some more transformations
hive> create table emp(ecode int,name String, sal int,

sex String,dno int) row format delimited fields
terminated by ',';
hive> load data local inpath 'emp.txt' into table emp;
hive> select *from emp;
101,srinivas,20000,M,1001
102,satya,50000,M,1002
103,sailaja,25000,F,1002
104,amit,30000,M,1003
104,saradha,50000,F,1004
105,anusha,45000,F,1003
106,krishna,60000,M,1005
107,rajesh,42000,M,1002
108,mohit,40000,M,1001
109,bhargavi,45000,F,1004
110,amrutha,35000,F,1005
in sex colmn
'M' to be transfromed as 'male'
'F' to be transfromed as 'female'
in deptno deptnumbers to be transformed as department

name.
deptno ------> deptname
1001 -------> marketing
1002 -------> HR
1003 -------> finance
remaining ----> others
salary---->grade
salary<30000 ---> D
salary>=30000 & sal<50000 -->C
salary>=50000 & sal<70000 -->B
salary>=70000 ---->A
hive>create table employee(eid int,ename string,salary

int,sex String, deptname String,grade string);
hive>insert overwrite table employee
select ecode,name,sal,
if(sex='F','female','male'),
if(dno =1001,'marketing',
if(dno=1002,'HR',
if(dno=1003,'Finance','others'))),
if(sal>=70000 ,'A',
if(sal>=50000,'B',
if(sal>=30000,'C','D')))
from emp;
hive>select * from employee;
101,srinivas,20000,Male,marketing
102,satya,50000,Male,HR
103,sailaja,25000,Female,HR
104,amit,30000,Male,Finance
104,saradha,50000,Female,others
105,anusha,45000,Female,Finance
.......
case statements in hive
we can also perform conditional transformations using

case statement.
hive>select case deptno
when 1001 then 'marketing'
when 1002 then 'HR'
when 1003 then 'finance'
else 'others'
end
from emp;
hive>select name,case salary
when salary>=70000 then 'A'
when salary>=50000 then 'B'
when salary>=30000 then 'C'
else 'D'
end
from emp;
Aggregate finction in hive:
hive>select sum(salary) from emp;
hive>select sum(salary) from emp where sex="F";
hive>select avg(salary) from emp;
hive>select max(salary) from emp;
hive>select min(salary) from emp;

hive>select count(salary) from emp;
note: count() will not count nulls of sal column;
count(*) will count nulls of column;
how to check whether a column has null value or not?
hive>select count(*)-count(salary) from emp;
if result is 0 ,there is no nulls;
if result >0 ,there are nulls
or
hive>select count(*) from emp where salary is null;
if result is 0 ,there is no nulls;
if result >0 ,there are nulls
group by: to perform aggerations on groups separately we

use group by
ex1: find sum of the salaries based on gender
hive>select sex,sum(salary) from emp group by sex;
ex2: find sum of the salaries of each department

hive>select deptno,sum(salary) from emp group by
deptno;
ex3:
hive>select
deptno,sum(salary),avg(salary),min(salary),max(salary)
,count(*) from emp group by deptno;
having clause:
it woks with only group by clause. it is used to filter

grouped columns.
ex1:display sum of the salaries of only departments

1001,1002.
hive>select deptno,sum(salary) from emp group by

deptno having deptno in(1001,1002);
ex2: display salaries of only departments 1001,1002 and

whose gender is female and salary >25000
hive>select dno,sum(sal) from emp where sex='M' and

sal>25000 group by dno having dno in(1001,1002);
in the above example ,sex and salary or not grouping
columns .so you can 't use them in having clause, deptno
can be used in where clause
note: it is always better to fillter grouping columns in

HAVING clause for better performance.
ex: display all departments whose sum of the salaries

>50000;
hive>select dno from emp group by dno having

sum(sal)>=50000;
hive>select dno ,sum(salary) from emp where deptno

in(1002,1004) group by deptno;
//row filter happens at mapper level.
hive>select dno,sum(salary) from emp group by deptno

having deptno in(1002,1003);
combination of order by and group by
display top 3 sum of dept salaries
hive>select dno,sum(sal) as tot from emp group by dno

order by tot desc limit 3;
o/p
1002 117000
1004 95000
1005 95000
ex : top 5 pages from weblog table based on visits
hive>select page,count(*) as cnt from weblog group by

page order by cnt desc limit 5;
Eliminating duplicates:
distinct() function is used to eliminate duplicates.
TAB
10
20
30 hive>select distinct(x) from tab;
20 o/p
20 10
10 20
30
merging tables
UNION ALL
NOTE: in hive we UNION ALL support. it does not support

UNION.
UNION does not support duplicate rows.

UNION ALL allow duplicate rows
if do not want duplicate values ,first merge tables ,later

apply distinct or group by then eliminate duplicate rows
in hql union should be a sub query, and sub query must

have alias.
table -->emp1
eid ename salary sex deptno
101 satya 75000 M 1001
102 gagan 100000 M 1002
103 lalitha 150000 F 1003
............
...........
1000 rows
table -->emp2
201 saradha 80000 F 1004
202 krishna 60000 M 1002
303 laxman 40000 M 1001
................
.............
500 rows
ex: merge emp1,emp2 into emp
note: schema of all tables same.
select * from(
select * from emp1
union all
select * from emp2)e;
o/p is 1000+500=1500 rows
note: if tables having different structure then ?
emp1:eid,ename,salary,sex,deptno
emp2:eid,ename,deptno,salary,sex
hive> create table staff like emp;
hive>insert overwrite table staff
select * from(
select eid,ename,salary,sex,deptno from emp1
union all
select eid,ename,salary,sex,deptno from emp2

)e;
note: we should not use * in select statement because

tables having different schema.
note: in above two table number of columns are same.
how to merge if number of columns are different?
emp1:eid,ename,salary,sex,deptno,designation
emp2:eid,ename,salary,deptno,city.
in emp1, no city column
in emp2 ,no sex,designation.
hive>create table emp1(eid int,ename string,sal int,dno

int,sex string,designation string)row format delimited

int,city string)row format delimited fields terminated by
',';
hive>load data local inpath 'emp1' into table emp1;
hive>load data local inpath 'emp2' into table emp2;

int,sex string,city string,designation string);
hive>insert overwrite table emp3
select * from(
select eid,ename,sal,dno,sex,' ' as city , designation

from emp1
union all
select eid,ename,sal,dno,' 'as sex,city,' ' as

designation from emp2)e;
hive>select * from emp3;
o/p
eid ename salary deptno sex city designation
101 satya 75000 1001 M TL
102 gagan 100000 1002 M SE
201 saradha 80000 1004 hyd
grouped aggregations with multiple tables
model1:
hive> select dno,sum(sal) from (
select dno,sal from emp1
union all
select dno ,sal from emp2)e
group by dno;
model2:
hive>select dno,sum(tot) as tot from(
select dno,sum(sal) as tot from emp1
group by dno
union all
select dno,sum(sal)as tot from emp2
group by dno)e
group by dno;
model1 provides better performance.
how to append rows of table to another table?
joins using hive
joins are used to collect data from two or more
tables ,based on comman columns.
joins are two types basically
1)Inner joins
2)Outer Joins
Inner join
only matching rows with join condition
A B
1 3
2 4
3 5
4 8
5 9
o/p 3,3
4,4
5,5
outer joins
matching rows and non matching rows with join condition.
outer joins are 3 types
left outer join
right outer join
full outer join
left outer join
all rows of first table and matchings of second table
o/p
1 null
2 null
3,3
4,4
5,5
Rigtht outer join:-
matching rows of first table and mathcing and non

mathching rows of second table.
3,3
4,4
5,5
null,8
null,9
full outer join
all matching rows and non matching rows of left and right
table
3,3
4,4
5,5
1,null
2,null
null,8
null,9
Emp
101 Satya 75000 M 1001
102 Gagan 100000 M 1002
103 Lalitha 150000 F 1003
104 Anusha 50000 F 1001
105 Krishna 60000 M 1004
106 Anandh 40000 M 1005
107 Padma 90000 F 1002
108 laxman 35000 M 1006
Dept
deptno deptname deptloc
1001 Accounts Hyderabad
1002 Finance Delhi
1003 HR Chennai
1007 Marketing Hyderabad
1008 Sales Delhi
InnerJoin
hive> select eid ,ename,salary,sex,deptname,deptloc

from emp e join dept d on(e.deptno=d.deptno);
left outer join
hive>select eid ,ename,salary,sex,deptname,deptloc

from emp e left outer join dept d
on(e.deptno=d.deptno);
right outer join

hive>select eid ,ename,salary,sex,deptname,deptloc
from emp e right outer join dept d
full outer join
select eid ,ename,salary,sex,deptname,deptloc from emp

e full outer join dept d on(e.deptno=d.deptno);
for each city total salary is required
in this task city is in deptname table and salary is emp

table . so join required.
hive>select deptloc ,sum(salary) from emp e join dept d

on(e.deptno=d.deptno)group by deptloc.
o/p
delhi 100000
hyderabad 125000
but everytime while collecting data from multiple tables ,if

you apply joins ,it will take lots of process time
so denormalize the tables into single table.
denormalization :
hive>create table empdept(eid int,name string, salary

int ,sex string,deptno1 int,deptno2 int,dname string,dloc
string) rows format delimited fileds terminated by ',';
apply full outer join ,so no data is missed.
hive>insert overwrite table empdept
select eid,ename,salary,sex,e.deptno,d.deptno, dept

name,deptloc from emp e full outer join dept d
now all rows of emp ,dept available in empdept
now with out appying joins every time you can directly
query on empdept.
hive>select deptloc,sum(salary) from empdept group by

deptloc.
we loaded "full outer join" content into empdept
which contains matching rows,and non matching rows

based on join condition(e.deptno=d.deptno)
to get matching rows
hive>select * from empdept where deptno1=deptno2;
to get matching rows and non matching rows of left
hive>select * from empdept where deptno1 is not null;
to get matching rows and non matching rows of right
hive>select * from empdept where deptno2 is not null;
to get only non-matchings of left side
hive>select * from empdept where deptno2 is null;
to get only non-matchings of right side
hive>select * from empdept where deptno1 is null;
to get only non-matchings of left side and right side
hive>select * from empdept where deptno2 is null or

deptno2 is null;
sales datasets
$cat sales
13-10-2010 261.54
01-10-2012 10123.02
01-10-2012 244.57
10-07-2011 4965.7595
28-08-2010 394.27
28-08-2010 146.69
17-06-2011 93.54
17-06-2011 905.08
24-03-2011 2781.82
26-02-2010 228.41
23-11-2010 196.85
23-11-2010 124.56
08-06-2012 716.84
08-06-2012 1474.33
04-08-2012 80.61
the dataset having two fields 1.date 2. amt
hive>create table temp(dt string,amt float)row format

delimited fields terminated by '\t';
hive>load data local inpath 'sales' into table temp;

hive>select * from sales;
01-01-2001 3000.89
............
this date is not in hive date format
hive date default format is yyyy-mm-dd ,so convert given

date format to hive date format
hive>create table sales(dt array<string>,amt float);
hive>insert overwrite table sales
select split(dt,"-"),amt from temp;
note: split returns string array.
split('01-05-2005",'-') o/p ["01","05","2005"]
hive>create table sales1(dt string,amt float)
hive>insert overwrite table sales1
select concat(dt[2],'-',dt[1],'-',dt[0]),amt from

sales;
hive>select * from sales1;
2001-01-01 3000.78
.................
.................
2014-12-31 200.09
now date is in hive date format then you can apply all
hive date functions
month('2001-01-05') o/p 1
year('2001-01-05') o/p 2001
day('2001-01-05') o/p 5
yearly sales report
hive>create table yearlyreport(year int,totsales float);
hive>insert overwrite table yearlyreport
select year(dt),sum(amt) from sales1
group by year(dt);
monthly salesreport for year 2010
hive>create table report2010(month int,totsales float);
hive>insert overwrite table report2010
select month(dt),sum(amt)from sales1
where year(dt)=2010 group by month(dt);
daily sales report(day wise) for month jan of 2010
hive>create table report2010_jan(day int,totsales float);
hive>insert overwrite table report2010_jan
select day(dt),sum(amt) from sales1

where year(dt)=2010 and month(dt)=1
group by day(dt);
quarterly sales report for year 2010
quarter1 ---> 1,2,3 months
quarter2 ---> 4,5,6 months
quarter3 ----> 7,8,9 months
quarter4 ----> 10,11,12 months
hive>alter table sales1 add columns(quarter int);
select dt,amt,case
when month(dt)<4 then 1
else 4
end from sales1;
for all rows quarter number is generated
hive>create table quarterly_report2010(quarter

int,totamt float);
hive>insert overwrite table quarterly_report2010
select quarter ,sum(amt) from sales1

group by quarter;
half yearly sales report for each year(multi grouping)
hive>alter table sales1 add columns(halfyear int);
select dt,amt,quarter,if(quarter<2,1,2) from sales1;
hive>create table halfyearly_report(year int,halfyear

int,totsales int);
hive>insert overwrite table halfyearly_report
select year(dt),halfyear ,sum(amt) from sales1
group by year(dt),halfyear;
hive>select * from quarterly-report2010;
quarter totamt
1 50000
2 75000
3 60000
4 90000
now the requirement is,for each quarter,when compared

with its pevious quater sales when compared with its
previous quarter sales,how much percetage of sales
increased or decreased.
ex:2nd quater sales increased for 20% when compared

with 1st quater sales.
now each quarter's sales valume to be compared with its

previous rows's sales volume.
it's not possible.
so we can take help of "cartisian product"
hive>create table cart1(q1 int,tot int,q2 int, tot int);
hive>insert overwrite table cat1
select l.quarter,l.totamt,r.quater,r.totamt
from quarterly-report2010 l;
join quarterly-report2010 r;
q1 tot1 q2 tot2
1 50000 1 50000
1 50000 2 75000
1 50000 3 60000
1 50000 4 90000
........
so at the time of cartisoning apply row filter
where quarter1-quarter2=1;
hive>insert overwrite table cart1
select l.quarter ,l.totamt,r.quarter,r.totamt from

quarterly-report2010 l join quarterly-report2010 r where
l.quarter-r.quarter=1;
hive>select * from cart1;
q1 tot1 q2 tot2
2 75000 1 50000
3 60000 2 75000
4 90000 3 60000
hive>alter table cart1 add columns(percent-growth

float)
hive>insert overwrite table cart1
select q1 ,tot1,q2,tot2
((tot1-tot2)*100)/tot2 from cart1;
hive>select * from cart1;
q1 tot q2 tot percentage-growth
2 75000 1 50000 50.0
3 60000 2 75000 -20.0

4 90000 3 60000 50.0
percentage growth formula=[(persent value-

previousvalue)/previous value*100]
Classification of tables
1) partitioned tables
2) non- partitioned tables
1). Partitioned tables:

Partitioning – Hive organizes tables into partitions for grouping same type of data
together based on a column or partition key. Each table in the hive can have one or
more partition keys to identify a particular partition. Using partition we can make it
faster to do queries on slices of the data.
Why is Partitioning Important?
Huge amount of data which is in the range of petabytes is getting stored in HDFS.
So due to this, it becomes very difficult for Hadoop users to query this huge amount
of data.
The Hive was introduced to lower down this burden of data querying. Apache Hive
converts the SQL queries into MapReduce jobs and then submits it to the Hadoop
Cluster. When we submit a SQL query, Hive read the entire data-set. So, it
becomes inefficient to run MapReduce jobs over a large table. Thus this is
resolved by creating partitions in tables. Apache Hive makes this job of
implementing partitions very easy by creating partitions by its automatic partition
scheme at the time of table creation.
In Partitioning method, all the table data is divided into multiple partitions. Each
partition corresponds to a specific value(s) of partition column(s). It is kept as a sub-
record inside the table’s record present in the HDFS. Therefore on querying a
particular table, appropriate partition of the table is queried which contains the query
value. Thus this decreases the I/O time required by the query. Hence increases the
performance speed.
To create data partitioning in Hive following command is used-
CREATE TABLE table_name (column1 data_type, column2 data_type)

PARTITIONED BY (partition1 data_type, partition2 data_type,….);
Hive Data Partitioning Example

Example. Consider a table named Tab1. The table contains client detail like id,
name, dept, and yoj( year of joining).
Suppose we need to retrieve the details of all the clients who joined in 2012. Then,
the query searches the whole table for the required information. But if we partition
the client data with the year and store it in a separate file, this will reduce the query
processing time.
example will help us to learn how to partition a file and its data-
The file name says file1 contains client data table:

tab1/clientdata/file1
id, name, dept, yoj
1, sunny, SC, 2009
2, animesh, HR, 2009
3, sumeer, SC, 2010
4, sarthak, TP, 2010
Partition above data into two files using years
tab1/clientdata/2009/file2
1, sunny, SC, 2009
2, animesh, HR, 2009
tab1/clientdata/2010/file3
3, sumeer, SC, 2010
4, sarthak, TP, 2010
Now when we are retrieving the data from the table, only the data of the specified
partition will be queried. Creating a partitioned table is as follows:
CREATE TABLE table_tab1 (id INT, name STRING, dept STRING, yoj INT) PARTITIONED BY (year STRING)
row format delimited fields terminated by ',';
load data local inpath 'file:///home/madhu/2009_f1' OVERWRITE into table table_tab1 PARTITION(year='2009');
load data local inpath 'file:///home/madhu/2010_f2' OVERWRITE into table table_tab1 PARTITION(year='2010');
Types of Hive Partitioning

There are two types of Partitioning in Apache Hive-
● Static Partitioning
● Dynamic Partitioning
Hive Static Partitioning

● Insert input data files individually into a partition table is Static Partition.
● Usually when loading files (big files) into Hive tables static partitions
are preferred.
● Static Partition saves your time in loading data compared to dynamic

partition.
● You “statically” add a partition in the table and move the file into the
partition of the table.
● We can alter the partition in the static partition.
● You can get the partition column value from the filename, day of date
etc without reading the whole big file.
● If you want to use the Static partition in the hive you should set
property set hive.mapred.mode = strict This property set by default
in hive-site.xml
● Static partition is in Strict Mode.
● You should use where clause to use limit in the static partition.
● You can perform Static partition on Hive Manage table or external

table.
Hive Dynamic Partitioning

● Single insert to partition table is known as a dynamic partition.
● Usually, dynamic partition loads the data from the non-partitioned

table.
● Dynamic Partition takes more time in loading data compared to static

partition.
● When you have large data stored in a table then the Dynamic partition
is suitable.
● If you want to partition a number of columns but you don’t know how
many columns then also dynamic partition is suitable.
● Dynamic partition there is no required where clause to use limit.

● we can’t perform alter on the Dynamic partition.
● You can perform dynamic partition on hive external table and managed
table.
● If you want to use the Dynamic partition in the hive then the mode is in
non-strict mode.
● Here are Hive dynamic partition properties you should allow
Hive Partitioning – Advantages and Disadvantages

a) Hive Partitioning Advantages
● Partitioning in Hive distributes execution load horizontally.
● In partition faster execution of queries with the low volume of data

takes place. For example, search population from Vatican City returns
very fast instead of searching entire world population.
b) Hive Partitioning Disadvantages
● There is the possibility of too many small partition creations- too many
directories.
● Partition is effective for low volume data. But there some queries like
group by on high volume of data take a long time to execute. For
example, grouping population of China will take a long time as
compared to a grouping of the population in Vatican City.
● There is no need for searching entire table column for a single record.
Bucketing – In Hive Tables or partition are subdivided into buckets based on the
hash function of a column in the table to give extra structure to the data that may be
used for more efficient queries.
A hash function is a function which when given a key, generates an address in

the table.
The hash function is used to index the original value or key and then used later
each time the data associated with the value or key is to be retrieved.
CREATE TABLE table_name PARTITIONED BY (partition1 data_type, partition2

data_type,….) CLUSTERED BY (column_name1, column_name2, …) SORTED BY
(column_name [ASC|DESC], …)] INTO num_buckets BUCKETS;
Each bucket in Hive is just a file in the table directory (unpartitioned table) or the
partition directory. So, you have chosen to divide the partitions into n buckets. Then
you will have n files in each of your partition directories. Hence, from the above
diagram, you can see, where you have bucketed each partition into 2 buckets.
Therefore each partition, say EEE, will have two files where each of them will be
storing the EEE student’s data.
Case Study on Partitioning and Bucketing
Client having Some E –commerce data which belongs to India operations in which each
state (38 states) operations mentioned in as a whole. If we take state column as partition
key and perform partitions on that India data as a whole, we can able to get Number of
partitions (38 partitions) which is equal to number of states (38) present in India. Such
that each state data can be viewed separately in partitions tables.
Creation of Table all states
create table all states(state string, District string,Enrolments string)
row format delimited
Loading data into created table all states
Load data local inpath '/home/user/Datasets/AllStates.csv' into table

allstates;
Creation of partition table
create table state_part(District string,Enrolments string) PARTITIONED

BY(state string);
For partition we have to set this property
Run in hive propmt
set hive.exec.dynamic.partition.mode=nonstrict
Loading data into partition table
INSERT OVERWRITE TABLE state_part PARTITION(state)

SELECT district,enrolments,state from allstates;
Actual processing and formation of partition tables based on state as partition key
There are going to be 38 partition outputs in HDFS storage with the file name as state
name. We will check this in this step
The following screen shots will show u the execution of above mentioned code
Creation of table all states with 3 column names such as
state, district, and enrollment
1. Loading data into table all states

2. Creation of partition table with state as partition key
3. In this step Setting partition mode as non-strict( This mode
will activate dynamic partition mode)
4. Loading data into partition tablestate_part
5. Actual processing and formation of partition tables based on
state as partition key
6. There is going to 38 partition outputs in HDFS storage with
the file name as state name. We will check this in this step. In This
step, we seeing the 38 partition outputs in HDFS
What is Buckets?
Buckets in hive is used in segregating of hive table-data into multiple files or

directories. it is used for efficient querying.
● The data i.e. present in that partitions can be divided further into
Buckets
● The division is performed based on Hash of particular columns that
we selected in the table.
● Buckets use some form of Hashing algorithm at back end to read
each record and place it into buckets
● In Hive, we have to enable buckets by using the
set.hive.enforce.bucketing=true;
Step 1) Creating Bucket as shown below.

From the above screen shot
● We are creating sample_bucket with column names such as

first_name, job_id, department, salary and country
● We are creating 4 buckets overhere.
● Once the data get loaded it automatically, place the data into 4
buckets
Step 2) Loading Data into table sample bucket

Assuming that"Employees table" already created in Hive system. In this step, we
will see the loading of Data from employees table into table sample bucket.
Before we start moving employees data into buckets, make sure that it consist of
column names such as first_name, job_id, department, salary and country.
Here we are loading data into sample bucket from employees table.
Here we are loading data into sample bucket from employees table.
Step 3)Displaying 4 buckets that created in Step 1

From the above screenshot, we can see that the data from
the employees table is transferred into 4 buckets created
in step 1.
sales table has trasactions history from 2010-01-01 to till

date.
when you want to fetch only a perticular date(ex:2013-03-

10 transactions).
hive>select * from sales where dt='2013-03-10';
suppose your output is 100 records.
now observe the problem : to fetch 100 records of given

date hive has to read all rows from the table(10 crore
records). it causes high latency,and expensive in terms of
time and cpu resources.
if you are able to keep each day transaction in to a

separate partition.
when you request that date ,hive will point that partition
and reads all rows of that partition without reading of all
rows of the table . it saves a lot of readimng time.
non partitioned:
hive> create database empdetails
hive>use empdetails
hive>create table emp(eid int,ename string,salary int,sex

string,deptno int)row format delimited fields terminated
by ',';
hive>load data local inpath '/home/matuser/emp.txt' into

table emp;
let us assume there are 200 males and 300 females
total 500 rows.
hive>select * from emp where sex='F';
now hive has to read all 500 rows ,after reading each row
it validates the condition (sex='F'),if true then it writes
row.
we overcome the above problem by keeping all the

females into partition and all males into another partition.
Creating partitioned tables:
create table emp(eid int,ename string,salary int,sex

string,deptno int)partitioned by(sex string);
the above query will be failed. because column sex is

repeated in general column and partitioned column . both
the columns should be unique.
hive>create table emp(eid int,ename string,salary int,sex

string,deptno int)partitioned by(gender string) row format
hive>describe emp;
eid int
name string
salary int physical columns(file contains
sex string onl y 5 fileds)
deptno int
gender string logical column(partitioned column)
table having 5 physical columns and one logical

column(total 6 columns).
hive>create database empdetails;
following directories will be created in warehouse dir
/user/hive/warehouse/empdetails.db/emp
note:-here we will not find any partitioines in emp tables.

since partitions are created at the time of loading data
into partitiones , not at the time of creating tables.
we can load data from file to partitons and table to

partitions
note: partitions are created at the time of loading data

into partitions,not at the time of creating tables.
we can load data from file to partitions and table to

partitions.
initially we work with loading from tables(non

partitoned)to partitioned tables.
Loading data from non-partitoned tables to partitioned

table.
hive>insert overwrite table emp select * from emp where

sex="F";
the above query is failed because emp is not a normal

table. it is partitioned table. you need to specify partitoned
name.
sol:
hive>insert overwrite table emp//emp is

partitioned table
partition(gender='F')//partition name
select * from default.emp where sex="F";
now you loaded all female rows into S=F partition
hive>insert overwrite table emp//emp is

partitioned table
partition(gender='M')//partition name
select * from default.emp where sex="M";
now you loaded all female rows into S=M partition
note:
sql supports multi loading(multi insertitiones) in single

statement.
Multiple Insertion
hive>from default.emp
insert overwrite table emp partition(gender='F')

select * where sex ='F'
insert overwrite table emp partition(gender='M')
select * where sex ='M';
In HDFS,
/User
hive
/warehouse/
empdetails.db
emp/
/gender =F
/000000_0
/gender =M
/000000_0
Observe above graph
emp -> is directory for table
gender=F -> is subdirectory of emp
gender=F/000000_0 file contains only female data
gender=M -> is subdirectory of emp
gender=M/000000_0 file contains only male data
Reading data from partitions
hive> select * from emp;

-> hive reads all partitions
hive> select * from emp where sex='F';
-> hive reads all partitions, because 'sex' is not partition
column. 'S' is our
partition column. to take advantage of partitions, you
should use partition
column in where clause
hive> select * from emp where gender='F';
now hive directly points
/user/hive/warehouse/mydb.db/emp/gender=F
partition and read all rows of 000000_0
file,with out checking condition.
So for this query, no need of reading all rows of table. so lot
reading time is saved.
creating partitioned tables using multiple columns:-
hive> create table emp1(ecode int, name String, sal int)

partitioned by(dno int, sex String)
row format delimited fields terminated by ','
hive> describe emp1;
ecode int
name String -----> physical colums
sal int
dno int
sex String -----> logical columns for partitions
hive> from emp

Insert overwrite table emp1
partition(dno=1001,sex='F') select eid,ename,salary where
deptno=1001 and sex='F'
partition(dno=1001,sex='M') select eid,ename,salary where
deptno=1001 and sex='M'
partition(dno=1002,sex='F') select eid,ename, salary where
deptno=1002 and sex='F'
partition(dno=1002,sex='M') select eid,ename,salary where
deptno=1002 and sex='M';
In hdfs,
/user
/hive
/warehouse
/mydb.db
/emp1
-> /dno=1001
-> Sex= F
/000000_0
-> Sex= M
/000000_0
-> /dno=1002
-> Sex= F
/000000_0
-> Sex= M
/000000_0
Now, you created,emp1 table, with Partitioned by(dno int, sex
String)
here, dno is primary partition for dno
primary partition is a subdirectory of table's directory and sub
partition is a subdirectory of primary partion.
In this way you can go for multi level sub partitioning
hive> select * from emp1;

-> hive reads all rows of all partitions
hive> select * from emp1 where dno=1002;
hive reads
/user/hive/warehouse/mydb.db/emp1/dno=1002/Sex=F/00
0000_0
/user/hive/warehouse/mydb.db/emp1/dno=1002/Sex=M/00
0000_0
hive> select * from emp1 where dno=1002 and Sex='M';

hive reads only
0000_0
hive> select * from emp1 where Sex='F';

hive reads
/user/hive/warehouse/mydb.db/emp1/dno=1001/Sex=F/00
0000_0
0000_0
In previous example, emp1

we have loaded 1001,M
1001,F
1002,F
1002,M patitions
2 departments(dno), 2 sex groups
so 2*2=4 partitions
so we used 4 insert statements for loading.
if we have 100 departments(dno) in emp table, then we need
to load into 100*2 =200
partitions, which is very expensive process
we need dynamic loading facility
hive supports dynamic loading into partitions.

by default, dynamic loading feature is disabled in hive
hive> set hive.exec.dynamic.partition=true;
loading data into partitiions dynamically
hive> create table dept1 like emp1;
hive> insert overwrite table dept1 partition(dno,sex) select
ecode,name,sal,dno,sex from emp;
this query will be failed, because primary partition can't be
dynamic
except primary partition(dno),remaining partitions are dynamic
In table dept1,dno can be Static and sex can be dynamic
hive> from emp
insert overwrite table dept1 partition(dno=1001,sex) //

dno -> static sex-> dynamic
select ecode,name,sal,sex from emp where dno=11
insert overwrite table dept1 partition(dno=1002,sex)
select ecode,name,sal,sex from emp where
dno=1002
But, we need primary partition(dno)also as dynamic then
set one more following option.
hive> set hive.exec.dynamic.partition.mode=nonstrict;
-> by default this option value is "strict"
now we can load dynamically into primary and its
subpartitions.
hive> insert overwrite table dept1 partition(dno,sex)
select ecode,name,sal,dno,sex from emp;
so to load dynamically into partitions, we need to set two
options
i) set hive.exec.dynamic.partition=true;
ii) set hive .exec.dynamic.partition.mode=nonstrict;
these options are temporary, once you quit from hive shell
these will be set back to default values as
FALSE and STRICT
Lets work on sales table
hive> create table sales( dt String,amt int)
row format delimited fields terminated by ;;

hive> load data local inpath '/user/matdb/sales into table
sales;
hive> select * from sales;
dt amt
2001-01-01 250000
......... ......
......... ......
......... ......
......... ......
2015-07-19 3000000
I want to create a partitioned table, with year as primary
partition
month is sub- partition of month,
and day is sub-partition of month,
-> almost 15 years data, under each year 12 months under

each month,30(30/31/28)days on each day average
100 records(transactions)
hive> create table newsales(dt String, amt int)

partitioned by (y int, m int, d int)
Now, loading dynamically into New sales table from sales

table.
hive> set hive.exec.dynamic.partition=true; /to enable
dynamic loading
hive> set hive.exec.dynamic.partition.mode=nonstrict;
//to make primary partition as dynamic.
hive> insert overwrite table newsales partition(y,m,d)

select dt,amt,year(dt),month(dt),day(dt) from
sales;
In HDFS,
/user
/hive
/warehouse
/mydb.db
/newsales
/y =2001
-> /m=1
d=1/000000-0
d=2/000000_0
............
............
d=31/000000_0
-> /m=12
d=1/000000-0
d=2/000000_0
............
............
d=31/000000_0
/y =2015
-> /m=1
d=1/000000-0
d=2/000000_0
............
............
d=31/000000_0
-> /m=7
d=1/000000-0
d=2/000000_0
............
............
d=19/000000_0
hive> select *from newsales;

-> hive reads all partitions.
hive> select *from newsales where y=2014;

-> hive reads 365 partitions.
how! 2014 has 12 sub partions for month.
each month has 30/31/28 sub partitions for day.
hive> select * from new sales where m=1;
-> hive reads 15*31 partitions.
because m=1 partitions available in each year(15)
m=1 as 31 partitions for day.
hive> select * from new sales where y=2015 and m=6 and
d=20;
-> hive reads only one partitions.
task
i want to fetch sales transactions between 2003 may to
2010 july
hive> select * from newsales where (y=2003 and m>=5)

or
(y>=2004 and y<2010)
or
(y=2010 and m<=7);
How to load data into partitions from files
hive> create database matdb;
hive> use matdb;
hive> create table enquiries(name String,qualification String,

email String, phone String, course String,
data_of_enqauiry String)partitioned by (y int,m int,d
int)
hive> load data local inpath '/user/matdb/enquiries1'

into table enquiries partition(y=2015, m=7,d=1);


In HDFS,
/user
/hive
/warehouse
/matdb.db
/enquiries
/y=2015
/m=7
/d=1/enquiries1
/d=2/enquiries2
/d=3/enquiries3
...............
...............
I want to listout all enquiries from 2015-10-1 to 2015-10-5,
hive> select * from enquiries where y=2015 and m=10 and

d<6;
Now your table enquiries is already partitioned by year,

month and day.
I want to access data based on course name.
hive> create table enquiries_on_course(name string, email

String, phone
String, date-of-enquiry string) partitioned by(course String)
How to load partition to partitioned tables
we can't load data from file to course partition.because, our

enquiry file contains all courses. We don't have seperate files
for each course. While loading from files,we can't filter rows.
so, table to table copy is only way.
hive> set hive.exec.dynamic.partition =true;
hive> set hive.exec.dynamic.partition.mode=Nonstrict;
hive> insert overwrite table enquiries_on_course

partition(course) select
name,qualification,email,phone,date-of-enquiry,course
from enquiries;
hive> select * from enquiries_on_course where

course='hadoop';
A partitioned table can be inner or external

creating external partitioned table
hive> create external table ext_emp(ecode int, name

String, sal int) partitioned by (dno int, sex String) row
format delimited fields terminated by ',' location
'/user/matdb/mydata';
partition table can use custom locations.
Note: if table has large number of partitions and partition size

is large, querying data from all partitions is expensive in terms
of CPU resources and processing time.
So to restrict the user, querying data using non partitioning

columns,set the following option.
hive> set hive.mapred.mode=Strict;
hive> select * from emp1;
Statement failed because you are requesting all partitions
hive> select * from emp1 where sal>=25000;
statement failed because sal isn't partitioned coulumn. so

you are requesting all partitioins
hive>select * from emp1 where dno=1001
// hive reads
/user/hive/warehouse/......./emp1/d=1001/grade=F
/user/hive/warehouse/......./emp1/d=1001/grade=M
partitions.
In emp1 table dno, sex are partitioned columns.
if you set,
hive> set hive.mapred.mode=nonStrict;
// Now can request all partitions of table
// you can use any column(partitioned/non-partitioned) in

where clause.
Up to now we have seen,advantages of partitioned tables.
Drawbacks of partitioned tables
For example you are partitioning your table by city column as

primary partition and dno(department number) as sub
partition.
hive> create table emp(------)

partitioned by(city String,dno int);
hive> Set hive.exec.dynamic.partition=true;
hive> Set hive.exec.dynamic.partition.mode=nonstrict;
now hive has to create 10000 partitions,if you have 100 unique
cities and 100 unique department number(dons) for each one
partition one data file to be cteated. so 10000 files. In this case
NameNode has to maintain metadata for 10000 files. this is
greate overhead(burder) for namenode.what if, if you are
partitioning your sales table by product_id, if you have millions
of products? very very burden on Namenode for maintaining
metadata. In such case, we choose Bucketing tables.
Hive UDFS:
1) create a new class that sholud be derived from UDF and

write one or more evaluate methods with a hadoop datatypes
3)create required package
/home/matuser/ mkdir -p com/mat/hive/udf
package com.mat.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class UpperUDF extends UDF
{
public Text evaluate(final Text s)
{
if(s==null)
{
return null;
}
return new Text(s.toString().toUpperCase());
}
}
3) save the above program under above path

/home/matuser/gedit com/mat/hive/udf/UpperUDF.java
4) compile the above source code using below statement
5)
$ javac -cp /home/matuser/hadoop-1.0.4/hadoop-core-

1.0.4.jar:/home/matuser/hive-0.10.0/lib/hive-exec-0.10.0.jar
com/mat/hive/udf/UpperUDF.java
6)
$jar -cvf upperudf.jar com/mat/hive/udf
the above statement will create upperudf.jar file under

/home/matuser/
7)
the next step is add the jar with your UDF code to the hive
class path. the easiest way is , set HIVE_AUX_JARS_PATH to a
directory contauning any jars you need to add before starting
hive
$ export HIVE_AUX_JARS_PATH=/home/matuser
8) start hive normally
once you have hive running the last step is register your
function
hive>create temporary function myupper as

'com.mat.hive.udf.UpperUDF'
9) Now you can use it
hive>select myupper(ename) from emp;
views:
A view allows a query to be saved and treated like a table. It is

a logical construct, as it
does not store data like a table.
When a query references a view, the information in its

definition is combined with the
rest of the query by Hive’s query planner. Logically, you can
imagine that Hive executes
the view and then uses the results in the rest of the query.
Views to Reduce Query Complexity
When a query becomes long or complicated, a view may be

used to hide the complexity
by dividing the query into smaller, more manageable pieces;
similar to writing a func-
tion in a programming language or the concept of layered
design in software. Encap-
sulating the complexity makes it easier for end users to
construct complex queries from
reusable parts. For example, consider the following query with
a nested subquery:
FROM(
SELECT * FROM empmaster e1 JOIN dept d
ON(e1.deptid=d.deptid)where d.deptid=1001
)e select e.* WHERE ename like 'g%';
It is common for Hive queries to have many levels of nesting.

In the following example,
the nested portion of the query is turned into a view:
SELECT * FROM shorter_join WHERE ename like 'g%';
CREATE VIEW shorter_join AS

SELECT e.eid,e.ename,e.salary,d.deptid,d.deptname FROM
empmaster e JOIN dept d ON (e.deptid=d.deptid) WHERE
d.deptid=1001;
the view is used like any other table. In this query we added a
WHERE clause to the
SELECT statement. This exactly emulates the original query:
Views that Restrict Data Based on Conditions

A common use case for views is restricting the result rows
based on the value of one or
more columns. Some databases allow a view to be used as a
security mechanism. Rather
than give the user access to the raw table with sensitive data,
the user is given access to
a view with a WHERE clause that restricts the data. Hive does
not currently support this
feature, as the user must have access to the entire underlying
raw table for the view to
work. However, the concept of a view created to limit data
access can be used to protect
information from the casual query:
CREATE TABLE userinfo (

firstname string, lastname string, user string, password
string)row format delimited fields terminated by ' ';
load data local inpath '/home/matuser/hivequeries/userinfo'
into table userinfo;
CREATE VIEW safer_user_info AS

SELECT firstname,lastname FROM userinfo;
Here is another example where a view is used to restrict data

based on a WHERE clause.
In this case, we wish to provide a view on an employee table
that only exposes employees
from a specific department:
hive> CREATE TABLE employee (firstname string, lastname

string,
> user string, password string, department string);
hive> CREATE VIEW techops_employee AS

> SELECT firstname,lastname,user FROM employee WERE
department='techops';
Bucketing tables
Bucket is a file in tables directory(HDFS) At the time

of creating table,we can choose how many backets you want.
each backet may have multiple key groups. if a group key is
sent to backet1, all similar keys will be sent to that bucket.
observe, from picture,
all p1,p4 product_ids sent to 000000_0
Now all data records adjested into 3 buckets.
How to create bucketed tables.
hive> Create table emp_Buckets(ecode int, name String, sal
int, sex String, dno int)
CLUSTERED BY (dno) into 3 buckets;
loading into Buckets
hive> Set hive.enforce.bucketing=true;
now bucketing feature is enabled.
hive> Insert overwrite table emp_Buckets select * from emp;
hive> dfs -ls
/user/hive/warehouse/mudb.db/emp_buckets
->
/user/hive/warehouse/mudb.db/emp_buckets/000000_0
hive maintains hastable, which key stored in which buckets
when you request data using bucketing column, First hive
understands, the address of (bucketed)
key by reading hashtable, and reads data only from that
bucket, without reading from other buckets
Hashtable
Hashcode $ key Bucket
11 000000_0
12 000001_0
13 000002_0
14 000000_0
15 000000_0
16 000002_0
17 000003_0
18 000003_0
-------------------
-------------------
-------------------
-------------------
hive> select * from emp_buckets;

hive reads all buckets;
hive> sesect * from emp_buckets where dno=16;
Now hive reads data from only 000002_0 bucket.
you can create bucketed tables using multiple columns.
example:-
hive> create table bucks_tab(ecode int, name String, sal
int, sex String dno int)
hive> set hive.enforce.buckting =true;
hive> Insert overwrite table bucks_tab select * from
Emp;
now combination of dno, sex act as a compsite key. if 11,
F(dno, sex) is entered into bucket1, all remaining 11,F records
will be sent to bucket1;
(lets see the figure in next page)
hive> select * from bucks-tab where dno=11 and
sex='F';
hive reads from bucket1 (000000_0) file
Case Study on how to use HIVE on Top of the HADOOP and different statistical
analysis. In this example we have a predefined dataset (cricket.csv) having more than 20
columns and more than 10000 records in it. In the below example we are explaining how
to display few records based on attributes of cricket. The cricket.csv file has different
attributes like
name,country,Player_iD,Year.stint,teamid,run,igl,D,G,G_batting.AB,R,H,2B,3B,HR,RBI
,SB,CS,BB,SO,I BB,HBP,SH,SF,GIDP,G_OLD
Follow the below steps in order to retrieve the data from Big tables.
1.Start Hadoop $ start-all.sh

2.Create a Folder and copy all predefined datasets
$ hadoop fs -mkdir sampledataset
$ hadoop fs -put /home/sai/hivedataset/ /user/sai/sampledataset Note:hivedataset has Big
Data tables in .cs format ,those tables are copying from my local file system to Hadoop.
3.Start Hive $ hive
4.Create Table in Hive hive>create table temp_cricket (col_value STRING);
5.Load the Data from csv file to table temp_batting hive> LOAD DATA INPATH
'/user/sai/sampledataset/hivedataset/cricket.cs v' OVERWRITE INTO TABLE
temp_cricket;
6.Create Another table to Extract data hive>create table batting (player_id STRING, year
INT, runs INT);
7.Insert data Into newly created data By extracting
hive>insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?)

{1}',3) player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 4) year,
regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 7) run from temp_cricket; In above query
from temp_cricket table player_id,year,run are loaded into batting table .
Execute Few Queries and observe the result
Example 1 Display year and maximum runs scored in each year hive> SELECT year,
max(runs) FROM batting GROUP BY year;
Total Jobs=1 hadoop job information for stage-1: Number of mappers:1;Number of

reducers:1 Stage-1 map=0% reduce =0% Stage-1 map=100% reduce =0% cumulative
CPU 3.45 sec Stage-1 map=100% reduce =100% cumulative CPU 6.35 sec MapReduce
Total cumulative CPU time:6.35 sec Time taken :46.09 sec
Example 2 Display player_id,year and maximum runs scored in each year hive>
SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs
FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs) ;
Total Jobs=2 Launching job 1 out of 2 hadoop job information for stage-2: Number of
mappers:1;Number of reducers:1 Stage-2 map=0% reduce =0% Stage-2 map=100%
reduce =0% cumulative CPU 3.0 sec Stage-2 map=100% reduce =100% cumulative CPU
5.95 sec MapReduce Total cumulative CPU time:5.95 sec Launching job 2 out of 2
hadoop job information for stage-3: Number of mappers:1;Number of reducers:0 Stage-3
map=0% reduce =0% Stage-3 map=100% reduce =0% cumulative CPU 3.12 sec Stage-3
map=100% reduce =100% cumulative CPU 3.12 sec MapReduce Total cumulative CPU
time:3.12 sec Time taken :65.743sec
As there are two queries (Select Statements) two jobs have been executed. The CPU time
varies according to the system configuration
Conclusion Hive works with Hadoop to allow you to query and manage large-scale data
using a familiar SQL-like interface. Hive provides command line interface to the shell

Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File

Uploaded by

Copyright:

Available Formats

2But hql is not for operating rows randomly.

In older versions (such as 0.7.*,0.8.* … ) of hive ,hive does not

But latest versions (0.14.*….), hive supports update and delete.

hive installation steps

1.Start by downloading the most recent stable release of

2. copy the hive tar file into work folder

3. Bigdata@master:~$tar -xzf hive-0.14.0.tar.gz

4.set the environmental variable HIVE_HOME in bashrc file.

create following directory before starting hive first time.

in hive-site.xml we will find following property.

By default hive is connected to embedded database called

To Configure mysql as a metastore

create a new user in MySQL

then go to hive_site.xml and modify following properties

To enter into hive

Hive DDL(data definition language)

Each table should be associated with database.A database is a

If you don’t select any database, your table will be stored in

hdfs location of default database as follows

When table is created in hive default database, a directory

We can also create custom databases (user defined).

hive complex data types & simple data types

hive simple types

10)String(max size 2gb)

11)varchar(hive-0.12.0 suppots 1 to 65535

hive complex data types

Collection data types:-

1)Array -->collection of elements

2)Map --> collection of key & value pairs

3)STRUCT --> collection of attributes with different data

Hive tables can be classified in following ways

1) Internal and External tables.

3) Bucketed and non bucketed tables.

Example of internal and external tables

Creation of a file with data in local system.

$cat > file1

Inner table(by default each table is inner table)

Hive> create table sample(line String)

In hdfs with the table name "sample" directory will be created

Loading the file into table

Hive>load data local inpath '/home/matuser/file1' into

now we will find file1 under

i.e file1 is copied into sample directory from local

Hive> select * from sample;

cat > file2

Hive>load data local inpath '/home/bigdata/file2' into table

$hadoop fs –ls /user/hive/warehouse/sample

Now in hdfs , sample directory has two files file1, file2.

Hive>select * from sample;

Now hive reads all rows of file1 and file2.

Hive>drop table sample

/user/hive/warehouse/sample directory will be

we know hive maintains metadata in embedded Derby

Now sample(table) and data(hdfs directory) both

Creating an external tables:-

Hive>create external table extab(line String);

In hdfs,extab folder is created as follows

Hive>load data local inpath '/home/matuser/file1' into

In older versions (such as 0.7.,0.8. … ) of hive ,hive does not