Hive REGEXP_EXTRACT returning null results

Question

I am trying to extract R7080075 and X1234567 from the sample data below. The format is always a single upper case character followed by 7 digit number. This ID is also always preceded by an underscore. Since it's user generated data, sometimes it's the first underscore in the record and sometimes all preceding spaces have been replaced with underscores.

I'm querying HDP Hive with this in the select statement:

REGEXP_EXTRACT(column_name,'[(?:(^_A-Z))](\d{7})',0)

I've tried addressing positions 0-2 and none return an error or any data. I tested the code on regextester.com and it highlighted the data I want to extract. When I then run it in Zepplin, it returns NULLs.

My regex experience is limited so I have reviewed the articles here on regexp_extract (+hive) and talked with a colleague. Thanks in advance for your help.

Sample data:

Sept Wk 5 Sunny Sailing_R7080075_12345
Holiday_Wk2_Smiles_X1234567_ABC

Vasan · Accepted Answer · 2018-06-04 17:02:09Z

1

The Hive manual says this:

Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.

Also, your expression includes unnecessary characters in the character class.

Try this:

REGEXP_EXTRACT(column_name,'_[A-Z](\\d{7})',0)

Since you want only the part without underscore, use this:

REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)

It matches the entire pattern, but extracts only the second group instead of the entire match.

Or alternatively:

REGEXP_EXTRACT(column_name,'(?<=_)[A-Z]\\d{7}', 0)

This uses a regexp technique called "positive lookbehind". It translates to : "find me an upper case alphabet followed by 7 digits, but only if they are preceded by an _". It uses the _ for matching but doesn't consider it part of the extracted match.

edited Jun 4, 2018 at 17:02

answered May 31, 2018 at 21:47

Vasan

4,9584 gold badges21 silver badges40 bronze badges

Thanks, Vasan. This gets me most of the way there. The reason for some of the extra syntax is I wanted to identify it by using the leading underscore, but I did not want to return the underscore. Also, I cannot count on it being the only [A-Z](\\d{7}) in the string. In the end, '[(^_A-Z)](\\d{7})' gave me what I needed. Thanks for your help with the double backslash reference!
– Leigh
Commented May 31, 2018 at 23:03
@LeighF. But that's still incorrect syntax though - whatever you include within [] will get interpreted literally and lose its special regex meaning. Also, they're treated like they were separated by an "OR". So your expression translates to "match a single ^ OR a _ OR a ( OR a ) OR an uppercase alphabet, then match 7 digits". Maybe it still works with that, but please be mindful of this.
– Vasan
Commented May 31, 2018 at 23:21
Thanks for this additional detail. While it worked, I see that it may not always. I want to match on the _[A-Z]####### but only return [A-Z]####### I found that [A-Z](\\d{7}) does this, but only because the [A-Z] is matching, not because it also matched based on the condition of a preceding _. Your example also returns the underscore. Is there a way to include the underscore in the match but not the extract?
– Leigh
Commented Jun 4, 2018 at 16:49
@LeighF. I edited my answer with what you need. Please try this and let me know.
– Vasan
Commented Jun 4, 2018 at 17:08
Thanks @Vasan. Option two works great! REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)
– Leigh
Commented Jun 12, 2018 at 19:36

Add a comment |

Collectives™ on Stack Overflow

Hive REGEXP_EXTRACT returning null results

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
regex
hive
extract
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged regexhiveextract or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
regex
hive
extract
or ask your own question.