0

I am trying to extract R7080075 and X1234567 from the sample data below. The format is always a single upper case character followed by 7 digit number. This ID is also always preceded by an underscore. Since it's user generated data, sometimes it's the first underscore in the record and sometimes all preceding spaces have been replaced with underscores.

I'm querying HDP Hive with this in the select statement:

REGEXP_EXTRACT(column_name,'[(?:(^_A-Z))](\d{7})',0) 

I've tried addressing positions 0-2 and none return an error or any data. I tested the code on regextester.com and it highlighted the data I want to extract. When I then run it in Zepplin, it returns NULLs.

My regex experience is limited so I have reviewed the articles here on regexp_extract (+hive) and talked with a colleague. Thanks in advance for your help.

Sample data:

Sept Wk 5 Sunny Sailing_R7080075_12345
Holiday_Wk2_Smiles_X1234567_ABC

1 Answer 1

1

The Hive manual says this:

Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.

Also, your expression includes unnecessary characters in the character class.

Try this:

REGEXP_EXTRACT(column_name,'_[A-Z](\\d{7})',0) 

Since you want only the part without underscore, use this:

REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)

It matches the entire pattern, but extracts only the second group instead of the entire match.

Or alternatively:

REGEXP_EXTRACT(column_name,'(?<=_)[A-Z]\\d{7}', 0)

This uses a regexp technique called "positive lookbehind". It translates to : "find me an upper case alphabet followed by 7 digits, but only if they are preceded by an _". It uses the _ for matching but doesn't consider it part of the extracted match.

5
  • Thanks, Vasan. This gets me most of the way there. The reason for some of the extra syntax is I wanted to identify it by using the leading underscore, but I did not want to return the underscore. Also, I cannot count on it being the only [A-Z](\\d{7}) in the string. In the end, '[(^_A-Z)](\\d{7})' gave me what I needed. Thanks for your help with the double backslash reference!
    – Leigh
    Commented May 31, 2018 at 23:03
  • @LeighF. But that's still incorrect syntax though - whatever you include within [] will get interpreted literally and lose its special regex meaning. Also, they're treated like they were separated by an "OR". So your expression translates to "match a single ^ OR a _ OR a ( OR a ) OR an uppercase alphabet, then match 7 digits". Maybe it still works with that, but please be mindful of this.
    – Vasan
    Commented May 31, 2018 at 23:21
  • Thanks for this additional detail. While it worked, I see that it may not always. I want to match on the _[A-Z]####### but only return [A-Z]####### I found that [A-Z](\\d{7}) does this, but only because the [A-Z] is matching, not because it also matched based on the condition of a preceding _. Your example also returns the underscore. Is there a way to include the underscore in the match but not the extract?
    – Leigh
    Commented Jun 4, 2018 at 16:49
  • @LeighF. I edited my answer with what you need. Please try this and let me know.
    – Vasan
    Commented Jun 4, 2018 at 17:08
  • Thanks @Vasan. Option two works great! REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)
    – Leigh
    Commented Jun 12, 2018 at 19:36

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.