Untitled 1
Untitled 1
Untitled 1
Table of Contents
Get the extended Hive Guide as PDF!
About Apache Hive
Hive Example Query Catalog
SELECT Statement Syntax
Computing with Columns
WHERE Clauses
GROUP BY Clauses
HAVING Clauses
Example Queries
INSERT Statement Syntax
Hive Syntax Checker
See Other Documents
WHERE Clauses
A WHERE clause is used to filter the result set by using predicate operators and logical operators.
Functions can also be used to compute the condition.
GROUP BY Clauses
A GROUP BY clause is frequently used with aggregate functions, to group the result set by
columns and apply aggregate functions over each group. Functions can also be used to compute the
grouping key.
HAVING Clauses
A HAVING clause lets you filter the groups produced by GROUP BY, by applying predicate
operators to each groups.
Example Queries
Here are some basic examples. The underlying table consists of three fields: ip, url, and time.
# Number of Records
SELECT COUNT(1) FROM www_access;
TD_TIME_RANGE UDF is simple and efficient to use. Please refer to the Performance Tuning
article for more information.
The `time` column is a special column that is always present and stores the UNIX timestamp of the
log.
INSERT Statement Syntax
Here’s the syntax of Hive’s INSERT statement.
-- append new rows to tablename1
INSERT INTO TABLE tablename1 select_statement1 FROM from_statement;
INSERT INTO will append to the table, keeping the existing data intact.
If record doesn’t include time column, time column is imported
TD_SCHEDULED_TIME().
If record includes time column, time column should be Unixtime.
All INSERT INTO statements require the FROM clause.
TreasureData HiveQL does not support Hive Multi-Table Inserts. More about INSERT
OVERWRITE:
On tables NOT receiving streaming updates, INSERT OVERWRITE will delete any existing data
in the table and write the new rows.
INSERT OVERWRTITE is not recommendeed for use on tables receiving streaming updates.
INSERT OVERWRITE will not delete recently received streaming update rows or updates that
arrive during the execution of INSERT OVERWRITE. If you need to perform INSERT
OVERWRITE on a table tbat normally receives streaming updates, stop the streaming update
before performing INSERT OVERWRTITE.
Table of Contents
Predicate Operators
Arithmetic Operators
Logical Operators
Operators for Complex Types
Predicate Operators
Operator Types Description
All
A=B primitive TRUE if expression A is equal to expression B otherwise FALSE
types
All Returns same result with EQUAL(=) operator for non-null operands, but
A <=> B primitive returns TRUE if both are NULL, FALSE if one of the them is NULL (as
types of version 0.9.0)
A == B None! Fails because of invalid syntax. SQL uses =, not ==
All
NULL if A or B is NULL, TRUE if expression A is NOT equal to
A <> B primitive
expression B otherwise FALSE
types
All
A != B primitive a synonym for the <> operator
types
All
NULL if A or B is NULL, TRUE if expression A is less than expression
A<B primitive
B otherwise FALSE
types
All
NULL if A or B is NULL, TRUE if expression A is less than or equal to
A <= B primitive
expression B otherwise FALSE
types
All
NULL if A or B is NULL, TRUE if expression A is greater than
A>B primitive
expression B otherwise FALSE
types
All
NULL if A or B is NULL, TRUE if expression A is greater than or equal
A >= B primitive
to expression B otherwise FALSE
types
A [NOT] All NULL if A, B or C is NULL, TRUE if A is greater than or equal to B
BETWEEN primitive AND A less than or equal to C otherwise FALSE. This can be inverted by
B AND C types using the NOT keyword. (as of version 0.9.0)
A IS NULL all types TRUE if expression A evaluates to NULL otherwise FALSE
A IS NOT
All types FALSE if expression A evaluates to NULL otherwise TRUE
NULL
NULL if A or B is NULL, TRUE if string A matches the SQL simple
regular expression B, otherwise FALSE. The comparison is done
character by character. The _ character in B matches any character in
A [NOT]
strings A(similar to . in posix regular expressions) while the % character in B
LIKE B
matches an arbitrary number of characters in A(similar to .* in posix
regular expressions) e.g. 'foobar' like 'foo' evaluates to FALSE where as
'foobar' like 'foo_ _ _' evaluates to TRUE and so does 'foobar' like 'foo%'
NULL if A or B is NULL, TRUE if any (possibly empty) substring of A
A [NOT] matches the Java regular expression B, otherwise FALSE. E.g. 'foobar'
strings
RLIKE B RLIKE 'foo' evaluates to FALSE whereas 'foobar' RLIKE '^f.*r$'
evaluates to TRUE.
A REGEXP
strings Same as RLIKE
B
Arithmetic Operators
Operator Types Description
Gives the result of adding A and B. The type of the result is the same as the
common parent(in the type hierarchy) of the types of the operands. e.g. since
A+B Numbers
every integer is a float, therefore float is a containing type of integer so the +
operator on a float and an int will result in a float.
Gives the result of subtracting B from A. The type of the result is the same as
A-B Numbers
the common parent(in the type hierarchy) of the types of the operands.
Gives the result of multiplying A and B. The type of the result is the same as
the common parent(in the type hierarchy) of the types of the operands. Note
A*B Numbers
that if the multiplication causing overflow, you will have to cast one of the
operators to a type higher in the type hierarchy.
A/B Numbers Gives the result of dividing B from A. The result is a double type.
Gives the reminder resulting from dividing A by B. The type of the result is
A%B Numbers the same as the common parent(in the type hierarchy) of the types of the
operands.
Gives the result of bitwise AND of A and B. The type of the result is the same
A&B Numbers
as the common parent(in the type hierarchy) of the types of the operands.
Gives the result of bitwise OR of A and B. The type of the result is the same
A|B Numbers
as the common parent(in the type hierarchy) of the types of the operands.
Gives the result of bitwise XOR of A and B. The type of the result is the same
A^B Numbers
as the common parent(in the type hierarchy) of the types of the operands.
Gives the result of bitwise NOT of A. The type of the result is the same as the
~A Numbers
type of A.
Logical Operators
Operator Types Description
TRUE if both A and B are TRUE, otherwise FALSE. NULL if A or
A AND B boolean
B is NULL
A && B boolean Same as A AND B
TRUE if either A or B or both are TRUE; FALSE OR NULL is
A OR B boolean
NULL; otherwise FALSE
A || B boolean Same as A OR B
NOT A boolean TRUE if A is FALSE or NULL if A is NULL. Otherwise FALSE.
!A boolean Same as NOT A
A IN (val1, val2, ...) boolean TRUE if A is equal to any of the values
A NOT IN (val1,
boolean TRUE if A is not equal to any of the values
val2, ...)
Table of Contents
Mathematical Functions
Collection Functions
Type Conversion Functions
Date Functions
Conditional Functions
String Functions
Misc Functions
Mathematical Functions
Return
Name(Signature) Description
Type
double round(double a) Returns the rounded BIGINT value of the double
double round(double a, int d) Returns the double rounded to d decimal places
Returns the maximum BIGINT value that is equal or less than the
bigint floor(double a)
double
ceil(double a), Returns the minimum BIGINT value that is equal or greater than the
bigint
ceiling(double a) double
Returns a random number (that changes from row to row) that is
double rand(), rand(int seed) distributed uniformly from 0 to 1. Specifiying the seed will make sure
the generated random number sequence is deterministic.
double exp(double a) Returns ea where e is the base of the natural logarithm
double ln(double a) Returns the natural logarithm of the argument
double log10(double a) Returns the base-10 logarithm of the argument
double log2(double a) Returns the base-2 logarithm of the argument
log(double base,
double Return the base "base" logarithm of the argument
double a)
pow(double a, double
double p), power(double a, Return ap
double p)
double sqrt(double a) Returns the square root of a
Returns the number in binary format (see
string bin(bigint a) http://dev.mysql.com/doc/refman/5.0/en/string-
functions.html#function_bin)
If the argument is an int, hex returns the number as a string in hex
format. Otherwise if the number is a string, it converts each character
hex(bigint a)
string into its hex representation and returns the resulting string. (see
hex(string a)
http://dev.mysql.com/doc/refman/5.0/en/string-
functions.html#function_hex)
Inverse of hex. Interprets each pair of characters as a hexidecimal
string unhex(string a)
number and converts to the character represented by the number.
conv(bigint num, int
from_base, int
Converts a number from a given base to another (see
to_base),
string http://dev.mysql.com/doc/refman/5.0/en/mathematical-
conv(STRING num,
functions.html#function_conv)
int from_base, int
to_base)
double abs(double a) Returns the absolute value
int double pmod(int a, int b) Returns the positive value of a mod b
pmod(double a,
double b)
double sin(double a) Returns the sine of a (a is in radians)
double asin(double a) Returns the arc sin of x if -1<=a<=1 or null otherwise
double cos(double a) Returns the cosine of a (a is in radians)
double acos(double a) Returns the arc cosine of x if -1<=a<=1 or null otherwise
double tan(double a) Returns the tangent of a (a is in radians)
double atan(double a) Returns the arctangent of a
double degrees(double a) Converts value of a from radians to degrees
double radians(double a) Converts value of a from degrees to radians
positive(int a),
int double Returns a
positive(double a)
negative(int a),
int double Returns -a
negative(double a)
float sign(double a) Returns the sign of a as '1.0' or '-1.0'
double e() Returns the value of e
double pi() Returns the value of pi
Collection Functions
Return
Name(Signature) Description
Type
int size(Map<K.V>) Returns the number of elements in the map type
int size(Array<T>) Returns the number of elements in the array type
Returns an unordered array containing the keys of the input
array<K> map_keys(Map<K.V>)
map
Returns an unordered array containing the values of the
array<V> map_values(Map<K.V>)
input map
array_contains(Array<T>,
boolean Returns TRUE if the array contains value
value)
Sorts the input array in ascending order according to the
array<t> sort_array(Array<T>)
natural ordering of the array elements and returns it
Date Functions
Return
Name(Signature) Description
Type
from_unixtime(bigint unixtime[, Converts the number of seconds from unix epoch (1970-01-
string
string format]) 01 00:00:00 UTC) to a string representing the timestamp of
that moment in the current system time zone in the format of
"1970-01-01 00:00:00"
bigint unix_timestamp() Gets current time stamp using the default time zone.
Converts time string in format yyyy-MM-dd HH:mm:ss
bigint unix_timestamp(string date) to Unix time stamp, return 0 if fail: unix_timestamp('2009-
03-20 11:30:01') = 1237573801
Convert time string with given pattern (see here) to Unix
unix_timestamp(string date,
bigint time stamp, return 0 if fail: unix_timestamp('2009-03-20',
string pattern)
'yyyy-MM-dd') = 1237532400
Returns the date part of a timestamp string: to_date("1970-
string to_date(string timestamp)
01-01 00:00:00") = "1970-01-01"
Returns the year part of a date or a timestamp string:
int year(string date) year("1970-01-01 00:00:00") = 1970, year("1970-01-01") =
1970
Returns the month part of a date or a timestamp string:
int month(string date) month("1970-11-01 00:00:00") = 11, month("1970-11-01")
= 11
day(string date) Return the day part of a date or a timestamp string:
int
dayofmonth(date) day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
Returns the hour of the timestamp: hour('2009-07-30
int hour(string date)
12:58:59') = 12, hour('12:58:59') = 12
int minute(string date) Returns the minute of the timestamp
int second(string date) Returns the second of the timestamp
Return the week number of a timestamp string:
int weekofyear(string date) weekofyear("1970-11-01 00:00:00") = 44,
weekofyear("1970-11-01") = 44
datediff(string enddate, string Return the number of days from startdate to enddate:
int
startdate) datediff('2009-03-01', '2009-02-27') = 2
date_add(string startdate, int Add a number of days to startdate: date_add('2008-12-31',
string
days) 1) = '2009-01-01'
date_sub(string startdate, int Subtract a number of days to startdate: date_sub('2008-12-
string
days) 31', 1) = '2008-12-30'
from_utc_timestamp(timestamp, Assumes given timestamp ist UTC and converts to given
timestamp
string timezone) timezone
to_utc_timestamp(timestamp, Assumes given timestamp is in given timezone and converts
timestamp
string timezone) to UTC
Conditional Functions
Return
Name(Signature) Description
Type
if(boolean testCondition, T valueTrue, Return valueTrue when testCondition is true,
T
T valueFalseOrNull) returns valueFalseOrNull otherwise
Return the first v that is not NULL, or NULL if
T COALESCE(T v1, T v2, ...)
all v's are NULL
CASE a WHEN b THEN c [WHEN d When a = b, returns c; when a = d, return e;
T
THEN e]* [ELSE f] END else return f
T CASE WHEN a THEN b [WHEN c When a = true, returns b; when c = true, return
THEN d]* [ELSE e] END d; else return e
String Functions
Return Type Name(Signature) Description
Returns the numeric value of the first character
int ascii(string str)
of str
Returns the string or bytes resulting from
concatenating the strings or bytes passed in as
concat(string|binary A,
string parameters in order. e.g. concat('foo', 'bar')
string|binary B...)
results in 'foobar'. Note that this function can
take any number of input strings.
Returns the top-k contextual N-grams from a set
context_ngrams(array<array
array<struct<string,d of tokenized sentences, given a string of
<string>>, array<string>,
ouble>> "context". See StatisticsAndDataMining for
int K, int pf)
more information.
concat_ws(string SEP, Like concat() above, but with custom separator
string
string A, string B...) SEP.
concat_ws(string SEP, Like concat_ws() above, but taking an array of
string
array<string>) strings.
Returns the first occurance of str in strList
where strList is a comma-delimited string.
find_in_set(string str, string
int Returns null if either argument is null. Returns 0
strList)
if the first argument contains any commas. e.g.
find_in_set('ab', 'abc,b,ab,c,def') returns 3
Formats the number X to a format like
format_number(number x, '#,###,###.##', rounded to D decimal places,
string
int d) and returns the result as a string. If D is 0, the
result has no decimal point or fractional part.
Extract json object from a json string based on
json path specified, and return json string of the
extracted json object. It will return null if the
get_json_object(string input json string is invalid. The json path can
string
json_string, string path) only have the characters [0-9a-z_], i.e., no
upper-case or special characters. Also, the keys
*cannot start with numbers.* This is due to
restrictions on Hive column names.
in_file(string str, string Returns true if the string str appears as an entire
boolean
filename) line in filename.
Returns the position of the first occurence of
int instr(string str, string substr)
substr in str
int length(string A) Returns the length of the string
locate(string substr, string Returns the position of the first occurrence of
int
str[, int pos]) substr in str after position pos
Returns the string resulting from converting all
lower(string A) lcase(string
string characters of B to lower case e.g.
A)
lower('fOoBaR') results in 'foobar'
lpad(string str, int len, string Returns str, left-padded with pad to a length of
string
pad) len
Returns the string resulting from trimming
string ltrim(string A) spaces from the beginning(left hand side) of A
e.g. ltrim(' foobar ') results in 'foobar '
Returns the top-k N-grams from a set of
array<struct<string,d ngrams(array<array<string> tokenized sentences, such as those returned by
ouble>> >, int N, int K, int pf) the sentences() UDAF. See
StatisticsAndDataMining for more information.
Returns the specified part from the URL. Valid
values for partToExtract include HOST, PATH,
QUERY, REF, PROTOCOL, AUTHORITY,
FILE, and USERINFO. e.g.
parse_url(string urlString, parse_url('http://facebook.com/path1/p.php?k1=
string string partToExtract [, string v1&k2=v2#Ref1', 'HOST') returns
keyToExtract]) 'facebook.com'. Also a value of a particular key
in QUERY can be extracted by providing the
key as the third argument, e.g.
parse_url('http://facebook.com/path1/p.php?k1=
v1&k2=v2#Ref1', 'QUERY', 'k1') returns 'v1'.
printf(String format, Obj... Returns the input formatted according do printf-
string
args) style format strings
Returns the string extracted using the pattern.
e.g. regexp_extract('foothebar', 'foo(.*?)(bar)',
2) returns 'bar.' Note that some care is necessary
in using predefined character classes: using '\s'
regexp_extract(string as the second argument will match the letter s; '
string subject, string pattern, int s' is necessary to match whitespace, etc. The
index) 'index' parameter is the Java regex Matcher
group() method index. See
docs/api/java/util/regex/Matcher.html for more
information on the 'index' or Java regex group()
method.
Returns the string resulting from replacing all
substrings in INITIAL_STRING that match the
java regular expression syntax defined in
regexp_replace(string
PATTERN with instances of REPLACEMENT,
INITIAL_STRING, string
string e.g. regexp_replace("foobar", "oo|ar", "")
PATTERN, string
returns 'fb.' Note that some care is necessary in
REPLACEMENT)
using predefined character classes: using '\s' as
the second argument will match the letter s; '
s' is necessary to match whitespace, etc.
string repeat(string str, int n) Repeat str n times
string reverse(string A) Returns the reversed string
rpad(string str, int len, string Returns str, right-padded with pad to a length of
string
pad) len
Returns the string resulting from trimming
string rtrim(string A) spaces from the end(right hand side) of A e.g.
rtrim(' foobar ') results in ' foobar'
Tokenizes a string of natural language text into
sentences(string str, string
array<array<string>> words and sentences, where each sentence is
lang, string locale)
broken at the appropriate sentence boundary and
returned as an array of words. The 'lang' and
'locale' are optional arguments. e.g.
sentences('Hello there! How are you?') returns (
("Hello", "there"), ("How", "are", "you") )
string space(int n) Return a string of n spaces
array split(string str, string pat) Split str around pat (pat is a regular expression)
Splits text into key-value pairs using two
delimiters. Delimiter1 separates text into K-V
str_to_map(text[,
map<string,string> pairs, and Delimiter2 splits each K-V pair.
delimiter1, delimiter2])
Default delimiters are ',' for delimiter1 and '='
for delimiter2.
Returns the substring or slice of the byte array
of A starting from start position till the end of
substr(string|binary A, int
string A e.g. substr('foobar', 4) results in 'bar'
string start) substring(string|binary
(see
A, int start)
http://dev.mysql.com/doc/refman/5.0/en/string-
functions.html#function_substr)
Returns the substring or slice of the byte array
substr(string|binary A, int
of A starting from start position with length len
start, int len)
string e.g. substr('foobar', 4, 1) results in 'b' (see
substring(string|binary A,
http://dev.mysql.com/doc/refman/5.0/en/string-
int start, int len)
functions.html#function_substr)
Translates the input string by replacing the
characters present in the from string with the
translate(string input, string corresponding characters in the to string. This
string
from, string to) is similar to the translate function in
PostgreSQL. If any of the parameters to this
UDF are NULL, the result is NULL as well
Returns the string resulting from trimming
string trim(string A) spaces from both ends of A e.g. trim(' foobar ')
results in 'foobar'
Returns the string resulting from converting all
upper(string A) ucase(string
string characters of A to upper case e.g.
A)
upper('fOoBaR') results in 'FOOBAR'
Misc Functions
Return Type Name(Signature) Description
int hash(a1[, a2...]) Returns a hash value of the arguments
You might notice occasional, inconsistent results between Hive and Presto geolocation functions.
Both Hive and Presto UDFs use a geolocation database supplied by Maxmind. However, due to
release schedules, the release level of the Maxmind database used by Hive and Presto might be
different.
An example of different results is as follows:
Table of Contents
Hive native UDFs
Hivemall generic UDFs
Treasure Data specific UDFs
TD_TIME_RANGE
TD_SCHEDULED_TIME
TD_TIME_ADD
TD_TIME_FORMAT
TD_TIME_PARSE
TD_DATE_TRUNC
TD_LAST
TD_FIRST
TD_X_RANK
TD_SESSIONIZE (deprecated)
TD_SESSIONIZE_WINDOW
TD_PARSE_USER_AGENT
TD_PARSE_AGENT
TD_MD5
TD_URL_DECODE
TD_IP_TO_COUNTRY_CODE
TD_IP_TO_COUNTRY_NAME
TD_IP_TO_SUBDIVISION_NAMES
TD_IP_TO_MOST_SPECIFIC_SUBDIVISION_NAME
TD_IP_TO_LEAST_SPECIFIC_SUBDIVISION_NAME
TD_IP_TO_CITY_NAME
TD_IP_TO_LATITUDE
TD_IP_TO_LONGITUDE
TD_IP_TO_METRO_CODE (US Only)
TD_IP_TO_TIME_ZONE
TD_IP_TO_POSTAL_CODE
TD_IP_TO_CONNECTION_TYPE
TD_IP_TO_DOMAIN
TD_LAT_LONG_TO_COUNTRY
TD_SUBSTRING_INENCODING
TD_DIVIDE
TD_SUMIF
TD_AVGIF
TD_NUMERIC_RANGE
TD_ARRAY_INDEX
Notice that the time of day ("00:00:00") can be omitted, as shown above. Alternately, the time of
day can be specified up to seconds. In general, the time string should be formatted as either
"YYYY-MM-DD" or "YYYY-MM-DD hh:mm:ss", e.g., "2013-01-01" or "1999-01-01 07:00:00".
TD_SCHEDULED_TIME
Signature
long TD_SCHEDULED_TIME()
Description
This UDF returns the exact time when the job was scheduled by the scheduled query feature. The
returned value may differ from NOW() because the actual query start time may be delayed.
If the query is not a scheduled query, the UDF returns the time when the job was issued. You may
use this UDF with TD_TIME_ADD for incremental aggregation.
TD_TIME_ADD
Signature
"year" and "month" durations are NOT supported, because these have complicated implications. A
month can be 28, 29, 30, or 31 days, and a year could be 365 or 366 days. To implement these, this
function will become a lot heavier and impact performance.
Example
This example selects records with timestamps “2013-01-01 00:00:00 UTC” or later but older than
“2013-01-02 00:00:00 UTC”.
A typical use of this function within scheduled queries is in conjunction with TD_TIME_RANGE
and TD_SCHEDULED_TIME to narrow the time range to determined set of days, hours, minutes, or
seconds. For example:
TD_TIME_FORMAT
Signature
Description
This UDF converts a UNIX timestamp to a string with the specified format (see the Supported time
formats in TD_TIME_FORMAT UDF page for available formats). For example, “yyyy-MM-dd
HH:mm:ss z” converts 1325376000 to “2012-01-01 00:00:00 UTC”. If no timezone is specified, the
UDF uses UTC.
Example
This example formats a UNIX timestamp into a date formatted string
TD_TIME_PARSE
Signature
Description
This UDF converts a time string into a UNIX timestamp.
default_timezone is used to interpret time. If time itself has timezone (e.g. “2012-01-01 +0700”),
then default_timezone is ignored. If default_timezone is not specified while time also does not
specify a timezone, then the UDF uses ‘UTC’ as the timezone for time. A list of supported time
zones can be found here.
If the format of the time string is invalid, the UDF returns NULL.
TD_DATE_TRUNC
Signature
Description
This UDF performs a timestamp truncation at the level specified by the ‘unit’ parameter. The
supported units are:
‘minute’
‘hour’
‘day’
‘week’
‘month’
‘quarter’
‘year’
An optional ‘timezone’ parameter can be specified to indicate an alternative reference timezone the
‘unit’ is referenced to. While the input ‘time’ is in global Unix time format, in different timezones
‘day’ that is the start of a day corresponds to different times.
This function mimics the functionality of native Presto’s date_trunc function, the main
difference being that Presto’s date_trunc does not allow specification of the timezone, since it
uses the sessions' reference timezone.
Example
with time equal 1416787667 corresponding to ‘2014-11-24 00:07:47 UTC’ will return
1416787200 corresponding to ‘2014-11-24 00:00:00 UTC’.
With the same value and timezone ‘PST’ instead:
the function returns 1416758400 since the start of day for the ‘PST’ timezone is 8 hours behind
the start of day for ‘UTC’.
TD_LAST
Signature
Description
This aggregate UDF finds the row with the largest value in the ‘comparison column’ (cmp_col1)
and returns that row’s ‘retrieve column’ (ret_col) value. Additional comparison columns
(cmp_col2, cmp_col3, etc.) are used as tiebreakers when the preceding comparison column has
more than one row with the largest value.
Example
This example selects the URL of the most recent access log (the row with the largest time) for each
user.
TD_FIRST
Signature
TD_FIRST(ret_col, cmp_col1, cmp_col2, ...)
Description
This aggregate UDF finds the row with the smallest value in the ‘comparison column’ (cmp_col1)
and returns that row’s ‘retrieve column’ (ret_col) value. Additional comparison columns
(cmp_col2, cmp_col3, etc.) are used as tiebreakers when the preceding comparison column has
more than one row with the smallest value.
Example
This example selects the referer URL of the earliest access log (the row with the smallest time) for
each page_id.
TD_X_RANK
Signature
long TD_X_RANK(keys)
Description
Returns the rank of each row within the partition of the result set. The rank of a row is one plus the
number of ranks that come before the row.
Example
You need to CLUSTER BY or ORDER BY within a sub query to use this feature. CLUSTER
BY is more scalable than ORDER BY because it doesn’t require total order across multiple nodes,
thus allowing us to process the query in parallel.
TD_SESSIONIZE (deprecated)
TD_SESSIONIZE has been deprecated due to performance issues and sometimes-inconsistent
results. TD_SESSIONIZE will be removed in a future release of TD Presto.
TD_SESSIONIZE_WINDOW is the replacement function. TD_SESSIONIZE_WINDOW is a
Presto window function that is equivalent in function with equivalent functionality, more consistent
results and more reliable performance. Discontinue use of TD_SESSIONIZE as soon as possible,
and replace it with use of TD_SESSIONIZE_WINDOW.
Signature
Description
Sessionization of a table of event data groups a series of event rows associated with users into
individual sessions for analysis. As long as the series of events is associated with the same user
identifier (typically IP address, email, cookie or similar identifier) and events are separated by no
more than a chosen timeout interval, they will be grouped into a session.
This UDF takes three arguments:
1. The time field specified in unix epoch
2. A timeout window in seconds (when this amount of time elapses, it indicates the start of a
new session)
3. The field name to sessionize by
It returns a UUID for the session of the request.
Example
You need to ORDER BY the sessionize_by column and the unix_timestamp field within a sub
query to use this UDF. The following query below sessionizes based on user_id and the
timestamp. You may want to use user_id or cookie_id instead of ip_address for non-anonymous
logs.
TD_SESSIONIZE_WINDOW
Signature
Description
Sessionization of a table of event data groups a series of event rows associated with users into
individual sessions for analysis. As long as the series of events is associated with the same user
identifier (typically IP address, email, cookie or similar identifier) and events are separated by no
more than a chosen timeout interval, they will be grouped into a session.
TD_SESSIONIZE_WINDOW is a UDF window function used for sessionization. It replaces
TD_SESSIONIZE, providing consistent results and better performance.
TD_SESSIONIZE_WINDOW takes two arguments:
1. The time field specified in unix epoch
2. A timeout interval in seconds (when this amount of time elapses between events, it indicates
the start of a new session)
Other usage notes:
SELECT
TD_SESSIONIZE_WINDOW(time, 3600)
OVER (PARTITION BY ip_address ORDER BY time)
as session_id,
time,
ip_address,
path
FROM
web_logs
TD_PARSE_USER_AGENT
Signature
Description
TD_PARSE_USER_AGENT is now deprecated, will not be updated from now on and will be
disabled in future. We recommend to use TD_PARSE_AGENT.
This UDF returns the result of parsing a user agent string. The user agent is parsed by the basis of
rules. Accepts the following options as a string, as user options:
os and ua return json. With _family, _major and _minor return a string. The device option also
returns a string.
Example
The example shows the result of parsing user agent from access log.
SELECT TD_PARSE_USER_AGENT(agent) AS agent FROM www_access
> {"user_agent": {"family": "IE", "major": "9", "minor": "0", "patch": ""},
"os": {"family": "Windows 7", "major": "", "minor": "", "patch": "",
"patch_minor": ""}, "device": {"family": "Other"}}
SELECT TD_PARSE_USER_AGENT(agent, 'os') AS agent_os FROM www_access
> {"family": "Windows 7", "major": "", "minor": "", "patch": "", "patch_minor":
""}
SELECT TD_PARSE_USER_AGENT(agent, 'os_family') AS agent_os_family FROM
www_access
> Windows 7
TD_PARSE_AGENT
Signature
This UDF returns a Map value of result to parse a user agent string. The UDF is implemented by
Woothee.
Example
The example shows the result of parsing user agent from access log.
TD_MD5
Signature
string TD_MD5(col)
Description
This UDF calculates the MD5 hash digest from a given string.
Example
SELECT TD_MD5(column) FROM tbl
TD_URL_DECODE
Signature
string TD_URL_DECODE(col)
Description
This UDF applies URL decoding for a given string.
Example
TD_IP_TO_COUNTRY_CODE
Signature
Description
This UDF converts IP address to country code. This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_COUNTRY_CODE('106.142.252.8') AS ipv4,
TD_IP_TO_COUNTRY_CODE('2001:268:d005:f5be:c43e:af35:81f:8f60') AS ipv6
TD_IP_TO_COUNTRY_NAME
Signature
Description
This UDF converts IP address to country code. This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_COUNTRY_NAME('106.142.252.8') AS ipv4,
TD_IP_TO_COUNTRY_NAME('2001:268:d005:f5be:c43e:af35:81f:8f60') AS ipv6
The function returns Japan in this example.
TD_IP_TO_SUBDIVISION_NAMES
Signature
Description
This UDF converts IP address to list of subdivisions (e.g. US states, JP prefectures, etc). This UDF
supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_SUBDIVISION_NAMES('106.142.252.8') AS ipv4,
TD_IP_TO_SUBDIVISION_NAMES('2001:268:d005:f5be:c43e:af35:81f:8f60') AS ipv6
TD_IP_TO_MOST_SPECIFIC_SUBDIVISION_NAME
Signature
Description
This UDF converts IP address to the most specific subdivisions (e.g. US states, JP prefectures, etc).
This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_MOST_SPECIFIC_SUBDIVISION_NAME('106.142.252.8') AS ipv4,
TD_IP_TO_MOST_SPECIFIC_SUBDIVISION_NAME('2001:268:d005:f5be:c43e:af35:81f:8f60')
AS ipv6
TD_IP_TO_LEAST_SPECIFIC_SUBDIVISION_NAME
Signature
Description
This UDF converts IP address to the least specific subdivisions (e.g. US states, JP prefectures, etc).
This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_LEAST_SPECIFIC_SUBDIVISION_NAME('106.142.252.8') AS ipv4,
TD_IP_TO_LEAST_SPECIFIC_SUBDIVISION_NAME('2001:268:d005:f5be:c43e:af35:81f:8f60'
) AS ipv6
TD_IP_TO_CITY_NAME
Signature
Description
This UDF converts IP address to city name. This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_CITY_NAME('106.142.252.8') AS ipv4,
TD_IP_TO_CITY_NAME('2001:268:d005:f5be:c43e:af35:81f:8f60') AS ipv6
TD_IP_TO_LATITUDE
Signature
Description
This UDF converts IP address to latitude. This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_LATITUDE('106.142.252.8') AS ipv4,
TD_IP_TO_LATITUDE('2001:268:d005:f5be:c43e:af35:81f:8f60') AS ipv6
TD_IP_TO_LONGITUDE
Signature
Description
This UDF converts IP address to longitude. This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_LONGITUDE('106.142.252.8') AS ipv4,
TD_IP_TO_LONGITUDE('2001:268:d005:f5be:c43e:af35:81f:8f60') AS ipv6
Description
This UDF converts IP address to metro code (US Only). This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_METRO_CODE('106.142.252.8') AS ipv4,
TD_IP_TO_METRO_CODE('2001:268:d005:f5be:c43e:af35:81f:8f60') AS ipv6
TD_IP_TO_TIME_ZONE
Signature
Description
This UDF converts IP address to time zone. This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_TIME_ZONE('106.142.252.8') AS ipv4,
TD_IP_TO_TIME_ZONE('2001:268:d005:f5be:c43e:af35:81f:8f60') AS ipv6
TD_IP_TO_POSTAL_CODE
Signature
Description
This UDF converts IP address to postal code. This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_POSTAL_CODE('106.142.252.8') AS ipv4,
TD_IP_TO_POSTAL_CODE('2001:268:d005:f5be:c43e:af35:81f:8f60') AS ipv6
TD_IP_TO_CONNECTION_TYPE
Signature
Description
This UDF converts IP address to connection type. This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_CONNECTION_TYPE('106.142.252.8') AS ipv4,
TD_IP_TO_CONNECTION_TYPE('2001:268:d005:f5be:c43e:af35:81f:8f60') AS ipv6
TD_IP_TO_DOMAIN
Signature
Description
This UDF converts IP address to domain. This UDF supports IPv4 and IPv6.
Example
SELECT
TD_IP_TO_DOMAIN('106.142.252.8') AS ipv4,
TD_IP_TO_DOMAIN('2001:268:d005:f5be:c43e:af35:81f:8f60') AS ipv6
TD_LAT_LONG_TO_COUNTRY
Signature
Description
This UDF converts geo location information (latitude/longitude) to country name.
Example
SELECT
TD_LAT_LONG_TO_COUNTRY('FULL_NAME', 37, -122)
TD_LAT_LONG_TO_COUNTRY('THREE_LETTER_ABBREVIATION', 37, -122)
TD_LAT_LONG_TO_COUNTRY('POSTAL_ABBREVIATION', 37, -122)
TD_LAT_LONG_TO_COUNTRY('SORTABLE_NAME', 37, -122)
TD_SUBSTRING_INENCODING
Signature
Description
This UDF returns the substring or slice of the byte array of str from the 0-index position at most
max_len_inbytes with charset encoding. charset can be selected from java.nio.charset.Charset.
Example
TD_DIVIDE
Signature
Description
This UDF returns the division of numeric types safely even if denominator is zero. If denominator
is zero, it returns 0 and in other case it returns the numerator divided by denominator.
Example
TD_SUMIF
Signature
Description
This UDF returns the sum of the column which satisfies the predidate.
Example
TD_AVGIF
Signature
TD_NUMERIC_RANGE
Signature
Description
This UDF generates a range of integers from a to b incremented by c or the elements of a map into
multiple rows and columns.
Example
SELECT TD_NUMERIC_RANGE(0,10,2)
=> 0
2
4
6
8
TD_ARRAY_INDEX
Signature
Description
This UDF returns an array’s value at the i’th index.
Example
When two values (of any type) are divided in Hive the result of the division is always automatically
upgraded to ‘double’.
To obtain an integer result from a division please cast the result of the division to the desired type.
For example:
SELECT
CAST((column1 / column2) AS bigint) as division
FROM
table
The Presto engine behaves differently with divisions and does not automatically upgrades the result
to double.
Divide By 0
When the dividend of a division is 0, the result will be a ‘NaN’.
This can commonly happen when the dividing the values of 2 columns and the value of the column
used as divided is a 0. For example:
SELECT
column1,
column2,
column3 / column4 as division
FROM
table
SELECT
TRIM(' a') as value_keyword
FROM
table
=> ' a' (Expected result is 'a')
SELECT
regexp_replace(TRIM(' a '), '^ +| +$', '') as value_keyword
FROM
table