Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization bug preventing a regex containing (*SKIP)(*FAIL) to work correctly #21534

Closed
florian-pe opened this issue Sep 29, 2023 · 4 comments · Fixed by #21536
Closed

Optimization bug preventing a regex containing (*SKIP)(*FAIL) to work correctly #21534

florian-pe opened this issue Sep 29, 2023 · 4 comments · Fixed by #21536

Comments

@florian-pe
Copy link

The following regex doesn't match while it should.

"dir/file.mp3" =~ m{ ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) }x

Disabling the optimization (I don't know which one causes this) with (?{}) makes the match succeed as it should.

say $& if "dir/file.mp3" =~ m{ ([^/]+) (?: (?{}) / (*SKIP)(*FAIL) | \z ) }x

Additionally, Using a possessive quantifier makes the bug go away.

$ perl -E 'say $& if "dir/file.mp3" =~ m{ ([^/]++) (?: / (*SKIP)(*FAIL) | \z ) }x'
file.mp3

When running the faulty regex with use re "debug"; the message Contradicts stclass... [regexec_flags] appears.

$ perl -E 'use re "debug"; say $& if "dir/file.mp3" =~ m{ ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) }x'
Compiling REx " ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) "
Final program:
   1: OPEN1 (3)
   3:   PLUS (6)
   4:     NANYOFM[/] (0)
   6: CLOSE1 (8)
   8: BRANCH (buf:1/1) (16)
  10:   EXACT </> (12)
  12:   SKIP (14)
  14:   OPFAIL (20)
  16: BRANCH (buf:1/1) (FAIL)
  18:   EOS (20)
  19: TAIL (20)
  20: END (0)
stclass NANYOFM[/] plus minlen 1
Matching REx " ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) " against "dir/file.mp3"
Matching stclass NANYOFM[/] against "dir/file.mp3" (12 bytes)
   0 <> <dir/file.m>         |   0| 1:OPEN1(3)
   0 <> <dir/file.m>         |   0| 3:PLUS(6)
                             |   0| NANYOFM[/] can match 3 times out of 2147483647...
   3 <dir> </file.mp3>       |   1|  6:CLOSE1(8)
   3 <dir> </file.mp3>       |   1|  8:BRANCH (buf:1/1)(16)
   3 <dir> </file.mp3>       |   2|   10:EXACT </>(12)
   4 <dir/> <file.mp3>       |   2|   12:SKIP(14)
   4 <dir/> <file.mp3>       |   3|    14:OPFAIL(20)
                             |   3|    failed...
                             |   2|   failed...
Contradicts stclass... [regexec_flags]
Match failed
Freeing REx: " ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) "
@florian-pe florian-pe changed the title Optimization bug preventing a regex contiaing (*SKIP)(*FAIL) to work correctly Optimization bug preventing a regex containing (*SKIP)(*FAIL) to work correctly Sep 29, 2023
@demerphq
Copy link
Collaborator

I will try to find some time for this, but I don't know when. I agree that it is a bug. It is surprising to me that the stclass is getting in the way here, also that the possessive quantifier changes the behavior.

@demerphq
Copy link
Collaborator

demerphq commented Sep 30, 2023

@khwilliamson you created the related patches: 21d1ed5 and 56ff060.

I think there is a thinko in these patches, as they update previous_occurence_end after executing reg_try(), which seems to partly or mostly defeat the point of checking if s == previous_occurrenc_end in the first place. Unfortunately this logic is inside of a set of macros (as the fundamental logic is repeated over and over with different functions), so it is a bit hard to see.

In the following logic the macro parameter 'f' is a function call which does a scan of the string to find how many characters match a given class:

        REXEC_FBC_UTF8_FIND_NEXT_SCAN(
                        (char *) find_span_end_mask((U8 *) s, (U8 *) strend,
                                                    (U8) ARG1u(c), FLAGS(c)));

The macros are as follows:

/* We keep track of where the next character should start after an occurrence
 * of the one we're looking for.  Knowing that, we can see right away if the
 * next occurrence is adjacent to the previous.  When 'doevery' is FALSE, we
 * don't accept the 2nd and succeeding adjacent occurrences */
#define FBC_CHECK_AND_TRY                                           \
        if (   (   doevery                                          \
                || s != previous_occurrence_end)                    \
            && (   reginfo->intuit                                  \
                || (s <= reginfo->strend && regtry(reginfo, &s))))  \
        {                                                           \
            goto got_it;                                            \
        }

#define REXEC_FBC_UTF8_FIND_NEXT_SCAN(f)                    \
    while (s < strend) {                                    \
        s = (char *) (f);                                   \
        if (s >= strend) {                                  \
            break;                                          \
        }                                                   \
                                                            \
        FBC_CHECK_AND_TRY                                   \
        s += UTF8SKIP(s);                                   \
        previous_occurrence_end = s;                        \
    }

I believe that the update to previous_occurrence_end should be directly after f() has returned, not after reg_try(), which may advance s considerably itself.

EDIT: actually not /directly/ after f() has returned, but before regtry() is executed.

demerphq added a commit that referenced this issue Sep 30, 2023
This fixes #21534

When the previous_occurence_end logic was added it did not account for
patterns where "s" is updated by regtry(). See 21d1ed5 which introduced
this logic. That patch seemed to assume that regtry() would not alter s,
but it does. This fix seems to work around the problem for the verb's,
but it feels like a bodge instead of a proper fix, which I think would
involve reworking the previous_occurrence_end logic to take into account
that regtry() can modify s. But for now this should work.
@khwilliamson
Copy link
Contributor

khwilliamson commented Sep 30, 2023 via email

@demerphq
Copy link
Collaborator

My original theory on how to fix this issue was wrong, or at least more complex than I expected. So I went with a different solution. @khwilliamson it would be nice if you reviewed the previous_occurrence_end logic in light of PREGf_SKIP and the fact that regtry() may modify the string pointer as a side-effect, the existing logic doesn't entirely make sense to me.

Consider a similar case to the one in this bug report:

./perl -Ilib -Mre=debug -le'"AAAAAAABBBBBBBAAAC"=~/([Aa]+(B+(*SKIP)(*FAIL)|[CD]))/ and print $1'

When we start we start matching at pos 0, and match ""AAAAAAABBBBBBB" before we hit the /(*SKIP)(*FAIL)/, and which causes regtry() to set 's' to point at the last 'B', which is then incremented to point at the first 'A' following the 'B's. We then loop around in the FBC macro, and compute f() again, which returns an 's' pointing at the same spot, and since do_every is false, due to PREGf_SKIP being true, we do not try the pattern again when we should. With the fix in this patch we notice that reginfo->cutpoint is set, thus 's' was updated by regtry() and so we DO try again. But it feels to me like we should not need to inspect reginfo->cutpoint like that, and that we should have noticed in some other way that we were no longer in the original sequence of 'A's, the check on previous_occurrence_end seems just a little simple for what we are trying to do here.

demerphq added a commit that referenced this issue Sep 30, 2023
This fixes #21534

We have an optimisation that applies to patterns starting with a PLUS
style pattern, so that things like /A+B/ do not try to match at every A
when B does not match after the the first attempt. Eg,
"AAAAAABBBBAAAAC"=~/[Aa]+[CD]/ should not try to match at every 'A' in
the first sequence of 'A's. This optimisation is signalled by the
presence of an PREGf_SKIP flag.

The PLUS optimisation did not play nicely with patterns which were doing
a similar task using the (*SKIP) operator. Essentially we need to
disable the former when the latter has been used, or it can get
confused. Consider the case of

    "AAAAAABBBBBBBAAAAAC"=~/[Aa]+(?:[Bb]+(*SKIP)(*FAIL)|[CD])/.

The idea is to signal to the regex engine that once "AAAAABBBBBBBBB" is
matched it can continue from after the final 'B'. Because of the way the
PLUS optimizatin is implemented, this advancing of the pointer to the
last B confused things, and it failed to match the final AAAAC sequence.
This patch is somewhat of a bodge, it shouldnt be necessary to inspect
inside of reginfo() after a call to regtry() and it is a bit
counter-intuitive to do so. This patch wraps the check in a macro so
that at least it is somewhat self documenting what it is doing.
demerphq added a commit that referenced this issue Oct 11, 2023
This fixes #21534

We have an optimisation that applies to patterns starting with a PLUS
style pattern, so that things like /A+B/ do not try to match at every A
when B does not match after the the first attempt. Eg,
"AAAAAABBBBAAAAC"=~/[Aa]+[CD]/ should not try to match at every 'A' in
the first sequence of 'A's. This optimisation is signalled by the
presence of an PREGf_SKIP flag.

The PLUS optimisation did not play nicely with patterns which were doing
a similar task using the (*SKIP) operator. Essentially we need to
disable the former when the latter has been used, or it can get
confused. Consider the case of

    "AAAAAABBBBBBBAAAAAC"=~/[Aa]+(?:[Bb]+(*SKIP)(*FAIL)|[CD])/.

The idea is to signal to the regex engine that once "AAAAABBBBBBBBB" is
matched it can continue from after the final 'B'. Because of the way the
PLUS optimizatin is implemented, this advancing of the pointer to the
last B confused things, and it failed to match the final AAAAC sequence.
This patch is somewhat of a bodge, it shouldnt be necessary to inspect
inside of reginfo() after a call to regtry() and it is a bit
counter-intuitive to do so. This patch wraps the check in a macro so
that at least it is somewhat self documenting what it is doing.
demerphq added a commit that referenced this issue Oct 11, 2023
This fixes #21534

We have an optimisation that applies to patterns starting with a PLUS
style pattern, so that things like /A+B/ do not try to match at every A
when B does not match after the the first attempt. Eg,
"AAAAAABBBBAAAAC"=~/[Aa]+[CD]/ should not try to match at every 'A' in
the first sequence of 'A's. This optimisation is signalled by the
presence of an PREGf_SKIP flag.

The PLUS optimisation did not play nicely with patterns which were doing
a similar task using the (*SKIP) operator. Essentially we need to
disable the former when the latter has been used, or it can get
confused. Consider the case of

    "AAAAAABBBBBBBAAAAAC"=~/[Aa]+(?:[Bb]+(*SKIP)(*FAIL)|[CD])/.

The idea is to signal to the regex engine that once "AAAAABBBBBBBBB" is
matched it can continue from after the final 'B'. Because of the way the
PLUS optimizatin is implemented, this advancing of the pointer to the
last B confused things, and it failed to match the final AAAAC sequence.
This patch is somewhat of a bodge, it shouldnt be necessary to inspect
inside of reginfo() after a call to regtry() and it is a bit
counter-intuitive to do so. This patch wraps the check in a macro so
that at least it is somewhat self documenting what it is doing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants