12

I am having some problems with regexp_extract:

I am querying on a tab-delimited file, the column I'm checking has strings that look like this:

abc.def.ghi

Now, if I do:

select distinct regexp_extract(name, '[^.]+', 0) from dummy;

MR job runs, it works, and I get "abc" from index 0.

But now, if I want to get "def" from index 1:

select distinct regexp_extract(name, '[^.]+', 1) from dummy;

Hive fails with:

2011-12-13 23:17:08,132 Stage-1 map = 0%,  reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201112071152_0071 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

Log file says:

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row

Am I doing something fundamentally wrong here?

Thanks, Mario

2 Answers 2

33

From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract.

It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group.

0 = the entire match
1 = capture group 1
2 = capture group 2, etc ...

Paraphrased from the manual:

regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
                                  ^    ^   
               groups             1    2

This returns 'bar'.

So, in your case, to get the text after the dot, something like this might work:
regexp_extract(name, '\.([^.]+)', 1)
or this
regexp_extract(name, '[.]([^.]+)', 1)

edit

I got re-interested in this, just a fyi, there could be a shortcut/workaround for you.

It looks like you want a particular segment separated with a dot . character, which is almost like split.
Its more than likely the regex engine used overwrites a group if it is quantified more than once.
You can take advantage of that with something like this:

Returns the first segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){1}', 1)

Returns the second segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){2}', 1)

Returns the third segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)

The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes.

Some notes:

  • This regex ^(?:([^.]+)\.?){n} has problems though.
    It requires there be something between dots in the segment or the regex won't match ....

  • It could be this ^(?:([^.]*)\.?){n} but this will match even if there is less than n-1 dots,
    including the empty string. This is probably not desireable.

There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots.
This uses a lookahead assertion and capture buffer 2 as a flag.

^(?:(?!\2)([^.]*)(?:\.|$())){2} , everything else is the same.

So, if it uses java style regex, then this should work.
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1) change {2} to whatever 'segment' is needed (this does segment 2).

and it still returns capture buffer 1 after the {N}'th iteration.

Here it is broken down

^                # Begining of string
 (?:             # Grouping
    (?!\2)            # Assertion: Capture buffer 2 is UNDEFINED
    ( [^.]*)          # Capture buffer 1, optional non-dot chars, many times
    (?:               # Grouping
        \.                # Dot character
      |                 # or,
        $ ()              # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string)
    )                 # End grouping
 ){3}            # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)

If it doesn't do assertions, then this won't work!

2

I think you have to make 'groups' no?

select distinct regexp_extract(name, '([^.]+)', 1) from dummy;

(untested)

I think it behaves like the java library and this should work, let me know though.

2
  • ha, weird, the MR job runs with that, but it returns still the contents of index 0!
    – Mario
    Commented Dec 13, 2011 at 22:33
  • and I just tried for kicks with index 2, and then it fails with 2011-12-13 23:33:41,377 Stage-1 map = 0%, reduce = 0% 2011-12-13 23:34:01,465 Stage-1 map = 100%, reduce = 100% Ended Job = job_201112071152_0074 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask again.... weird weird weird....
    – Mario
    Commented Dec 13, 2011 at 22:39

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.