Pig Hive
Pig Hive
Pig Hive
• Support for
• Grouping
• Joins
• Filtering
• Aggregation
• Extensibility
• Support for User Defined Functions (UDF’s)
• Leverages the same massive parallelism as native MapReduce
Pig Basics
-- Extract words from each line and put them into a pig bag named ‘words’
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
• Help is available
$ pig -h
• Pig supports HDFS commands
grunt> pwd
• put, get, cp, ls, mkdir, rm, mv, etc.
About Pig Scripts
Type Description
Tuple Ordered set of fields (a “row / record”)
Bag Collection of tuples (a “resultset / table”)
Map A set of key-value pairs
Keys must be of type chararray
Pig Data Formats
• BinStorage
• Loads and stores data in machine-readable (binary) format
• PigStorage
• Loads and stores data as structured, field delimited text files
• TextLoader
• Loads unstructured data in UTF-8 format
• PigDump
• Stores data in UTF-8 format
• YourOwnFormat!
• via UDFs
Loading Data Into Pig
• STORE
• Writes output to an HDFS file in a specified directory
grunt> STORE processed INTO 'processed_txt';
• Fails if directory exists
• Writes output files, part-[m|r]-xxxxx, to the directory
• PigStorage can be used to specify a field delimiter
• DUMP
• Write output to screen
grunt> DUMP processed;
Relational Operators
• FOREACH
• Applies expressions to every record in a bag
• FILTER
• Filters by expression
• GROUP
• Collect records with the same key
• ORDER BY
• Sorting
• DISTINCT
• Removes duplicates
FOREACH . . .GENERATE
• Use the FOREACH …GENERATE operator to work with rows of data, call functions,
etc.
• Basic syntax:
alias2 = FOREACH alias1 GENERATE
expression;
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FOREACH alias1 GENERATE col1, col2;
DUMP alias2;
(1,2) (4,2) (8,3) (4,3) (7,2) (8,4)
FILTER. . .BY
• Use the ORDER…BY operator to sort a relation based on one or more fields
• Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
DISTINCT. . .
• FLATTEN
• Used to un-nest tuples as well as bags
• INNER JOIN
• Used to perform an inner join of two or more relations based on common field values
• OUTER JOIN
• Used to perform left, right or full outer joins
• SPLIT
• Used to partition the contents of a relation into two or more relations
• SAMPLE
• Used to select a random data sample with the stated sample size
INNER JOIN. . .
• Use the JOIN operator to perform an inner, equi-join join of two or more relations
based on common field values
• The JOIN operator always performs an inner join
• Inner joins ignore null keys
• Filter null keys before the join
• JOIN and COGROUP operators perform similar functions
• JOIN creates a flat set of output records
• COGROUP creates a nested set of output records
INNER JOIN Example
• Use the OUTER JOIN operator to perform left, right, or full outer joins
• Pig Latin syntax closely adheres to the SQL standard
• The keyword OUTER is optional
• keywords LEFT, RIGHT and FULL will imply left outer, right outer and full outer joins respectively
• Outer joins will only work provided the relations which need to produce nulls (in the
case of non-matching keys) have schemas
• Outer joins will only work for two-way joins
• To perform a multi-way outer join perform multiple two-way outer join statements
User-Defined Functions
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
DEFINE
• DEFINE can be used to work with UDFs and also streaming commands
• Useful when dealing with complex input/output formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;
• http://pig.apache.org
Apache Hive
• Hive is schema-on-read
• Schema is only enforced when the data is read (at query time)
• Allows greater flexibility: same data can be read using multiple schemas
• Contrast with an RDBMS, which is schema-on-write
• Schema is enforced when the data is loaded
• Speeds up queries at the expense of load times
Create Table Syntax
• CREATE TABLE
• LOAD: file moved into Hive’s data warehouse directory
• DROP: both metadata and data deleted
• CREATE EXTERNAL TABLE
• LOAD: no files moved
• DROP: only metadata deleted
• Use this when sharing with other Hadoop applications, or when you want to use multiple
schemas on the same data
Partitioning
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND
page_views.date <= '2008-03-31'
Relational Operators
• GROUP BY
• Group data by column values
• Select statement can only include columns included in the
GROUP BY clause
• ORDER BY / SORT BY
• ORDER BY performs total ordering
• Slow, poor performance
• SORT BY performs partial ordering
• Sorts output from each reducer
Advanced Hive Operations
• JOIN
• If only one column in each table is used in the join, then only one MapReduce job will run
• This results in 1 MapReduce job:
SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key
= c.key
• If multiple tables are joined, put the biggest table last and the reducer will stream the last table,
buffer the others
• Use left semi-joins to take the place of IN/EXISTS
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on a.key
= b.key;
Advanced Hive Operations
• JOIN
• Do not specify join conditions in the WHERE clause
• Hive does not know how to optimise such queries
• Will compute a full Cartesian product before filtering it
• Join Example
SELECT
a.ymd, a.price_close, b.price_close
FROM stocks a
JOIN stocks b ON a.ymd = b.ymd
WHERE a.symbol = 'AAPL' AND
b.symbol = 'IBM' AND
a.ymd > '2010-01-01';
Hive Stinger
• http://hive.apache.org