2

I am trying to extract a portion of the below string using regexp_extract but am not having any success:

CUST_NEW_ACCOUNTS_LINES_2019-03-03.dat.gz

I want to just get the date portion. On the regex101.com website this seemed to work, but hive is giving me an error message.

regexp_extract(meta_source_filename,'^(?:[^_]+_){4}([^_]+)') file_date

Can someone help me understand what is incorrect here? I am not at all familiar with regexp_extract syntax so have been using another function as a starting point.
Thank you!

0

2 Answers 2

2
with your_data as (
select 'CUST_NEW_ACCOUNTS_LINES_2019-03-03.dat.gz' str
)

select regexp_extract(str,'_(\\d{4}(-\\d{2}){2})\\.',1)
from your_data;

Result:

OK
2019-03-03
Time taken: 0.062 seconds, Fetched: 1 row(s)

Expression '_(\\d{4}(-\\d{2}){2})\\.' means:

underscore _ four digits \\d{4} repeat (hyphen and two digits) two times (-\\d{2}){2} dot\\.

Capture group number one (date only): (\\d{4}(-\\d{2}){2}) . In Hive you need to use \\ for shielding.

1

You have captured the substring you need into a capturing group. You should use the number, ID of the group as the third argument:

regexp_extract(meta_source_filename,'^(?:[^_]+_){4}([^_]+)', 1) file_date
                                                             ^

See the regexp_extract(string subject, string pattern, int index) docs:

The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.