Apache Hive: An Introduction

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 51

Apache Hive

An Introduction

Agenda
Overview
A quick look at Hive and its background.

Structure
A peek at the structure of Hive.

Language
How to write DDL and DML statements in Hive.

Hive at Yahoo
Working with Hive on Yahoos grids.

Advanced Features
Some more things you can do with Hive.

More Information
Where to look when you need more details or help.

Overview

Motivation for Hive


Companies are no longer dealing with gigabytes, but rather terabytes

Large amount of data to analyze


Researchers want to study and understand the data Business fold wants to slice and dice the data & metrics in various ways Every one impatient Give me answers now

Joining across large data sets is quite tricky

Motivation for Hive


Started in January 2007 at Facebook Query data on Hadoop without having to write complex

MapReduce programs in Java each time


SQL chosen for familiarity and tools-support

An active open-source project since August 2008


Top-level Apache project (hive.apache.org)

Used in many companies; a diverse set of contributors

What Hive Is
A Hadoop-based system for managing and querying structured data

Hive provides an view of your data as tables with rows


and columns Uses HDFS for storing data Provides a SQL-like interface for querying data Uses MapReduce for executing queries Scales well to handle massive data-sets
6

Example
SELECT COUNT(1) AS job_count, t.wait_time FROM (SELECT ROUND(wait_time/1000) AS wait_time, job_id FROM starling_jobs WHERE grid = MB AND dt >= 2011_07_11

AND dt <= 2011_07_13) t


GROUP BY t.wait_time;

8 Simple steps
Login to grid gateway machine. Create a hdfs file to store your hive metadata, Ex:hadoop fs -mkdir /user/vmoorthy/warehouse Go to hive shell by running hive SET mapred.job.queue.name=unfunded; job in the unfunded queue -- to run your

8 Simple steps ()
Create a database specifying the location for meta data store. Ex:CREATE DATABASE autos LOCATION '/user/vmoorthy/warehouse'; USE autos; -- to work with previously created database named 'autos CREATE TABLE used_car(chromeTrimId INT,trimId INT, usedCarCondition STRING, usedCarMileage INT, usedCarPrice INT, chromeModelId INT, modelId INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION '/user/vmoorthy/usedCarTrim'; -- create a table for the tab separated hdfs file named usedCarTrim

8 Simple steps ()
Now, you are ready to run select queries on the above table. Ex:3030; SELECT * FROM used_car WHERE chrometrimid >

10

Structure

11

Architecture
Hive
JDBC Command-line Interface Web Interface ODBC

Thrift Server Meta-store

Driver (Compiler, Optimizer, Executor)

Hadoop

Database

12

Query Execution
Query

Parser

Logical Plan Generator

Optimizer

Physical Plan Generator

MapReduce Job(s)

Executor

13

Storage
Table metadata is stored in meta-store Directories for databases, tables and partitions

Files for table-data


<warehouse-directory> <database-directory> <table-directory>

<partition-directory>
<data-file1> <data-file2> []

<data-filen>

14

Language

15

Data Model
Database a namespace for tables and other units of data (default if none specified)

Table a row-based store for data in a database; each row


having one or more columns Partition a key-based separation of data in a table for reducing the amount of data scanned (Optional) Bucket cluster of data in a partition based on a hashing a column-value (Optional)

16

Primitive Data-types
Integers TINYINT (1 byte), SMALLINT (2 bytes), INT (4 bytes), BIGINT (8 bytes)

Boolean BOOLEAN (TRUE / FALSE)


Floating-Point FLOAT, DOUBLE

String STRING
Implicit and explicit casting supported

17

Complex Data-types
Arrays a list of elements of the same data-type accessible using an index. A[n] denotes the element at index

n(starts from zero) in array A


Structs a record with named elements. foo.bar denotes the field bar in the struct foo Maps maintains mappings from keys to respective values. M[foo] denotes the value for foo in the map M

Collections can be nested arbitrarily

18

Operators
Relational =, !=, <, <=, etc. as well as IS NULL, IS NOT NULL, LIKE, etc. Generate TRUE

or FALSE based on comparison


Arithmetic +, -, *, /, etc. Generate number based on the result of the arithmetic operation Logical AND, OR, NOT, etc. Generate TRUE or FALSE based on combining Boolean expressions

19

Built-in Fuctions
Mathematical round(), floor(), rand(), etc. String concat(), substr(), regexp_replace(),

etc.
Time to_date(), from_unixtime(), year(), month(), etc. Aggregates count(), sum(), min(), max(), avg()

and quite a lot more

20

Creating a Table
CREATE TABLE employees(name STRING, age INT); or CREATE TABLE IF NOT EXISTS employees(name STRING, age INT); or CREATE TABLE employees(name STRING, age INT) PARTITIONED BY (join_dt STRING); or CREATE TABLE employees(name STRING, age INT) STORED AS SequenceFile; etc.

21

Loading Data
LOAD DATA INPATH '/foo/bar/snafu.txt' INTO TABLE employees; or LOAD DATA LOCAL INPATH '/homes/wombat/emp_2011-12-01.txt' INTO TABLE employees PARTITION (join_dt=2011_12_01);

or
INSERT OVERWRITE TABLE employees SELECT name, age FROM all_employees

WHERE location = 'Bangalore';

22

Querying Data
SELECT * FROM employees; or SELECT * FROM employees LIMIT 10; or SELECT name, age FROM employees WHERE age > 30; or SET hive.exec.compress.output=false; SET hive.cli.print.header=true; INSERT OVERWRITE LOCAL DIRECTORY /homes/wombat/blr SELECT * FROM all_employees WHERE location = Bangalore; etc.

23

External Tables
Data not managed by Hive Useful when data is already processed and in a usable state

Manually clean up after dropping tables/partitions

CREATE EXTERNAL TABLE foo(name string, age int)

LOCATION /user/bar/wombat;

24

Altering a Table
ALTER TABLE employees RENAME TO blr_employees;

ALTER TABLE employees REPLACE COLUMNS (emp_name STRING, emp_age INT);

ALTER TABLE employees ADD COLUMNS (emp_id STRING);

ALTER TABLE all_employees DROP PARTITION (location=Slackville);

25

Databases
CREATE DATABASE foo;

or
CREATE DATABASE IF NOT EXISTS foo;

or
CREATE DATABASE foo LOCATION /snafu/wombat; USE foo; SELECT * FROM bar LIMIT 10;

or
SELECT * FROM foo.bar LIMIT 10; DROP DATABASE foo;

or
DROP DATABASE IF EXISTS foo;

26

Other Operations
SHOW TABLES; SHOW PARTITIONS all_employees; SHOW PARTITIONS all_employees PARTITION (location=Bangalore); DESCRIBE employees; DROP TABLE employees; or DROP TABLE IF EXISTS employees;

27

Joins
SELECT e.name, d.dept_name FROM departments d JOIN all_employees e ON (e.dept_id = d.dept_id);

or

SELECT e.name, d.dept_name


FROM departments d LEFT OUTER JOIN all_employees e

ON (e.dept_id = d.dept_id);

28

Ordering of Data
ORDER BY global ordering of results based on the selected columns

SORT BY local ordering of results on each reducer


based on the selected columns

29

File-formats
TextFile plain-text files; fields delimited with ^A by default

SequenceFile serialized objects, possiblycompressed RCFile columnar storage of serialized objects, possibly-compressed

30

TextFile Delimiters
Default field-separator is ^A; row-separator is \n John Doe^A36\n

Jane Doe^A33\n
Default list-separator is ^B; value-separator is ^C John Doe^Adept^Cfinance^Bemp_id^C2357\n

CREATE TABLE employees(name STRING, age INT)

ROW FORMAT DELIMITED


FIELDS TERMINATED by '\t';

31

Buckets
Distribute partition-data into files based on columns Improves performance for filters with these columns Works best when data is uniformly distributed
CREATE TABLE employees(name STRING, age INT) CLUSTERED BY (name) INTO 31 BUCKETS;

32

Compressed Storage
Saves space and generally improves performance Direct support for reading compressed files
LOAD DATA LOCAL INPATH /foo/bar/emp_data.bz2 INTO TABLE all_employees;

Compressed TextFile cannot usually be split

SequenceFile or RCFile recommended instead

33

Tips
Judicious use of partitions and buckets can drastically improve the performance of your queries

Put always-used Hive CLI commands in $HOME/.hiverc


(e.g. SET mapred.job.queue.name=unfunded;) Use EXPLAIN to analyze a query before executing it Use RCFile with compression to save storage and to improve performance

34

Hive at Yahoo

35

Specifics
Hive CLI available as /home/y/bin/hive on gateways of supported grids

Mandatory LOCATION clause in CREATE TABLE


Must specify MapReduce queue for submitted Jobs

(e.g. SET mapred.job.queue.name=unfunded;)


No JDBC / ODBC support Integrated with HCatalog

36

Advanced Features

37

User-defined Functions
Many very useful built-in functions
SHOW FUNCTIONS; DESCRIBE FUNCTION foo;

Extensible using user-defined functions User-defined Function (UDF) for one-to-one mapping
E.g. round(), concat(), unix_timestamp(), etc.

User-defined Aggregate Function (UDAF) for many-to-one mapping


E.g. sum(), avg(), stddev(), etc.

User-defined Table-generating Function (UDTF) for one-to-

many mapping
E.g. explode(), etc.

38

Custom UDF
package com.yahoo.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.io.Text;

@Description( name = "toupper", value = "_FUNC_(str) - Converts a string to uppercase", extended = "Example:\n" + " > SELECT toupper(author_name) FROM authors a;\n" +

" STEPHEN KING"


)

39

Custom UDF ()
public class ToUpper extends UDF {

public Text evaluate(Text s) { Text to_value = new Text(""); if (s != null) { try { to_value.set(s.toString().toUpperCase()); } catch (Exception e) { // Should never happen

to_value = new Text(s);


} } return to_value; } }

40

UDF Usage
add jar build/ql/test/test-udfs.jar; CREATE TEMPORARY FUNCTION TO_UPPER AS com.yahoo.hive.udf.ToUpper'; SELECT TO_UPPER(src.value) FROM src; DROP TEMPORARY FUNCTION TO_UPPER;

41

Overloaded UDF
public class UDFExampleAdd extends UDF { public Integer evaluate(Integer a, Integer b) { if (a == null || b == null) return null; return a + b; } public Double evaluate(Double a, Double b) {

if (a == null || b == null) return null;


return a + b; }

42

Overloaded UDF
add jar build/contrib/hive_contrib.jar; CREATE TEMPORARY FUNCTION example_add AS 'org.apache.hadoop.hive.contrib.udf.UDFExampleAdd'; SELECT example_add(1, 2) FROM src; SELECT example_add(1.1, 2.2) FROM src;

43

UDAF Example
SELECT page_url, count(1), count(DISTINCT user_id) FROM mylog;
public class UDAFCount extends UDAF { public static class Evaluator implements UDAFEvaluator { private int mCount; public void init() {mcount = 0;} public boolean iterate(Object o) { if (o!=null) mCount++; return true;} public Integer terminatePartial() {return mCount;} public boolean merge(Integer o) {mCount += o; return true;} public Integer terminate() {return mCount;} }

44

Overloaded UDAF
public class UDAFSum extends UDAF { public static class IntEvaluator implements UDAFEvaluator {

private int mSum;


public void init() {mSum = 0;} public boolean iterate(Integer o) {mSum += o; return true;} public Integer terminatePartial() {return mSum;} public boolean merge(Integer o) {mSum += o; return true;}

public Integer terminate() {return mSum;}


}

45

Overloaded UDAF
public static class DblEvaluator implements UDAFEvaluator { private double mSum; public void init() {mSum = 0;}

public boolean iterate(Double o)


{mSum += o; return true;} public Double terminatePartial() {return mSum;}

public boolean merge(Double o)


{mSum += o; return true;} public Double terminate() {return mSum;} } }

46

What Hive Is Not


Not suitable for small data-sets Does not provide real-time results Does not support row-level updates Imposes a schema on the data Does not support transactions Does not need expensive server-class hardware,

RDBMS licenses or god-like DBAs to scale

47

More Information

48

External References
Hive home-page: hive.apache.org Hive wiki: cwiki.apache.org/confluence/display/Hive Hive tutorial: cwiki.apache.org/confluence/display/Hive/Tutorial Hive language manual:
cwiki.apache.org/confluence/display/Hive/LanguageManual

Mailing-list: [email protected]

49

Internal References
Hive at Yahoo: wiki.corp.yahoo.com/view/Grid/Hive Hive FAQ: wiki.corp.yahoo.com/view/Grid/HiveFAQ

Troubleshooting:
wiki.corp.yahoo.com/view/Grid/HiveTroubleShooting

Internal mailing-list: [email protected] Hive CLI yinst package: hive_cli Installation instructions:
wiki.corp.yahoo.com/view/Grid/HiveInstallation

50

Questions?

51

You might also like