Optimization bug preventing a regex containing (SKIP)(FAIL) to work correctly #21534

florian-pe · 2023-09-29T03:25:29Z

The following regex doesn't match while it should.

"dir/file.mp3" =~ m{ ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) }x

Disabling the optimization (I don't know which one causes this) with (?{}) makes the match succeed as it should.

say $& if "dir/file.mp3" =~ m{ ([^/]+) (?: (?{}) / (*SKIP)(*FAIL) | \z ) }x

Additionally, Using a possessive quantifier makes the bug go away.

$ perl -E 'say $& if "dir/file.mp3" =~ m{ ([^/]++) (?: / (*SKIP)(*FAIL) | \z ) }x'
file.mp3

When running the faulty regex with use re "debug"; the message Contradicts stclass... [regexec_flags] appears.

$ perl -E 'use re "debug"; say $& if "dir/file.mp3" =~ m{ ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) }x'
Compiling REx " ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) "
Final program:
   1: OPEN1 (3)
   3:   PLUS (6)
   4:     NANYOFM[/] (0)
   6: CLOSE1 (8)
   8: BRANCH (buf:1/1) (16)
  10:   EXACT </> (12)
  12:   SKIP (14)
  14:   OPFAIL (20)
  16: BRANCH (buf:1/1) (FAIL)
  18:   EOS (20)
  19: TAIL (20)
  20: END (0)
stclass NANYOFM[/] plus minlen 1
Matching REx " ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) " against "dir/file.mp3"
Matching stclass NANYOFM[/] against "dir/file.mp3" (12 bytes)
   0 <> <dir/file.m>         |   0| 1:OPEN1(3)
   0 <> <dir/file.m>         |   0| 3:PLUS(6)
                             |   0| NANYOFM[/] can match 3 times out of 2147483647...
   3 <dir> </file.mp3>       |   1|  6:CLOSE1(8)
   3 <dir> </file.mp3>       |   1|  8:BRANCH (buf:1/1)(16)
   3 <dir> </file.mp3>       |   2|   10:EXACT </>(12)
   4 <dir/> <file.mp3>       |   2|   12:SKIP(14)
   4 <dir/> <file.mp3>       |   3|    14:OPFAIL(20)
                             |   3|    failed...
                             |   2|   failed...
Contradicts stclass... [regexec_flags]
Match failed
Freeing REx: " ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) "

The text was updated successfully, but these errors were encountered:

demerphq · 2023-09-29T07:54:18Z

I will try to find some time for this, but I don't know when. I agree that it is a bug. It is surprising to me that the stclass is getting in the way here, also that the possessive quantifier changes the behavior.

demerphq · 2023-09-30T14:23:11Z

@khwilliamson you created the related patches: 21d1ed5 and 56ff060.

I think there is a thinko in these patches, as they update previous_occurence_end after executing reg_try(), which seems to partly or mostly defeat the point of checking if s == previous_occurrenc_end in the first place. Unfortunately this logic is inside of a set of macros (as the fundamental logic is repeated over and over with different functions), so it is a bit hard to see.

In the following logic the macro parameter 'f' is a function call which does a scan of the string to find how many characters match a given class:

        REXEC_FBC_UTF8_FIND_NEXT_SCAN(
                        (char *) find_span_end_mask((U8 *) s, (U8 *) strend,
                                                    (U8) ARG1u(c), FLAGS(c)));

The macros are as follows:

/* We keep track of where the next character should start after an occurrence
 * of the one we're looking for.  Knowing that, we can see right away if the
 * next occurrence is adjacent to the previous.  When 'doevery' is FALSE, we
 * don't accept the 2nd and succeeding adjacent occurrences */
#define FBC_CHECK_AND_TRY                                           \
        if (   (   doevery                                          \
                || s != previous_occurrence_end)                    \
            && (   reginfo->intuit                                  \
                || (s <= reginfo->strend && regtry(reginfo, &s))))  \
        {                                                           \
            goto got_it;                                            \
        }

#define REXEC_FBC_UTF8_FIND_NEXT_SCAN(f)                    \
    while (s < strend) {                                    \
        s = (char *) (f);                                   \
        if (s >= strend) {                                  \
            break;                                          \
        }                                                   \
                                                            \
        FBC_CHECK_AND_TRY                                   \
        s += UTF8SKIP(s);                                   \
        previous_occurrence_end = s;                        \
    }

I believe that the update to previous_occurrence_end should be directly after f() has returned, not after reg_try(), which may advance s considerably itself.

EDIT: actually not /directly/ after f() has returned, but before regtry() is executed.

This fixes #21534 When the previous_occurence_end logic was added it did not account for patterns where "s" is updated by regtry(). See 21d1ed5 which introduced this logic. That patch seemed to assume that regtry() would not alter s, but it does. This fix seems to work around the problem for the verb's, but it feels like a bodge instead of a proper fix, which I think would involve reworking the previous_occurrence_end logic to take into account that regtry() can modify s. But for now this should work.

khwilliamson · 2023-09-30T16:15:32Z

On 9/30/23 08:23, Yves Orton wrote: @khwilliamson <https://github.com/khwilliamson> you created the related patches: 21d1ed5 <21d1ed5> and 56ff060 <56ff060>. I think there is a thinko in these patches, as they update |previous_occurence_end| after executing reg_try(), which seems to partly or mostly defeat the point of checking if |s == previous_occurrenc_end| in the first place. Unfortunately this logic is inside of a set of macros (as the fundamental logic is repeated over and over with different functions), so it is a bit hard to see.

I still think it is right. But more comments and maybe renaming the variable would be helpful. 'previous_occurrence_end' is set by the macros to point 1 beyond the current occurrence end, in preparation for the next iteration when it will indeed mean the end of the previous occurrence. I also bisected the failing test. It did not pass on the initial commit that added *SKIP, nor in any official release so far. (*FAIL was added earlier than *SKIP was)

…

In the following logic the macro parameter 'f' is a function call which does a scan of the string to find how many characters match a given class: |REXEC_FBC_UTF8_FIND_NEXT_SCAN( (char *) find_span_end_mask((U8 *) s, (U8 *) strend, (U8) ARG1u(c), FLAGS(c))); | The macros are as follows: |/* We keep track of where the next character should start after an occurrence * of the one we're looking for. Knowing that, we can see right away if the * next occurrence is adjacent to the previous. When 'doevery' is FALSE, we * don't accept the 2nd and succeeding adjacent occurrences */ #define FBC_CHECK_AND_TRY \ if ( ( doevery \ || s != previous_occurrence_end) \ && ( reginfo->intuit \ || (s <= reginfo->strend && regtry(reginfo, &s)))) \ { \ goto got_it; \ } #define REXEC_FBC_UTF8_FIND_NEXT_SCAN(f) \ while (s < strend) { \ s = (char *) (f); \ if (s >= strend) { \ break; \ } \ \ FBC_CHECK_AND_TRY \ s += UTF8SKIP(s); \ previous_occurrence_end = s; \ } | I believe that the update to |previous_occurrence_end| should be directly after |f()| has returned, not after |reg_try()|, which may advance s considerably itself. — Reply to this email directly, view it on GitHub <#21534 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA2DHYWFINNYJ34S2XDW6TX5ATNXANCNFSM6AAAAAA5L3RS3U>. You are receiving this because you were mentioned.Message ID: ***@***.***>

demerphq · 2023-09-30T16:59:39Z

My original theory on how to fix this issue was wrong, or at least more complex than I expected. So I went with a different solution. @khwilliamson it would be nice if you reviewed the previous_occurrence_end logic in light of PREGf_SKIP and the fact that regtry() may modify the string pointer as a side-effect, the existing logic doesn't entirely make sense to me.

Consider a similar case to the one in this bug report:

./perl -Ilib -Mre=debug -le'"AAAAAAABBBBBBBAAAC"=~/([Aa]+(B+(*SKIP)(*FAIL)|[CD]))/ and print $1'

When we start we start matching at pos 0, and match ""AAAAAAABBBBBBB" before we hit the /(*SKIP)(*FAIL)/, and which causes regtry() to set 's' to point at the last 'B', which is then incremented to point at the first 'A' following the 'B's. We then loop around in the FBC macro, and compute f() again, which returns an 's' pointing at the same spot, and since do_every is false, due to PREGf_SKIP being true, we do not try the pattern again when we should. With the fix in this patch we notice that reginfo->cutpoint is set, thus 's' was updated by regtry() and so we DO try again. But it feels to me like we should not need to inspect reginfo->cutpoint like that, and that we should have noticed in some other way that we were no longer in the original sequence of 'A's, the check on previous_occurrence_end seems just a little simple for what we are trying to do here.

This fixes #21534 We have an optimisation that applies to patterns starting with a PLUS style pattern, so that things like /A+B/ do not try to match at every A when B does not match after the the first attempt. Eg, "AAAAAABBBBAAAAC"=~/[Aa]+[CD]/ should not try to match at every 'A' in the first sequence of 'A's. This optimisation is signalled by the presence of an PREGf_SKIP flag. The PLUS optimisation did not play nicely with patterns which were doing a similar task using the (*SKIP) operator. Essentially we need to disable the former when the latter has been used, or it can get confused. Consider the case of "AAAAAABBBBBBBAAAAAC"=~/[Aa]+(?:[Bb]+(*SKIP)(*FAIL)|[CD])/. The idea is to signal to the regex engine that once "AAAAABBBBBBBBB" is matched it can continue from after the final 'B'. Because of the way the PLUS optimizatin is implemented, this advancing of the pointer to the last B confused things, and it failed to match the final AAAAC sequence. This patch is somewhat of a bodge, it shouldnt be necessary to inspect inside of reginfo() after a call to regtry() and it is a bit counter-intuitive to do so. This patch wraps the check in a macro so that at least it is somewhat self documenting what it is doing.

florian-pe added the Needs Triage label Sep 29, 2023

xenu added type-regex and removed Needs Triage labels Sep 29, 2023

florian-pe changed the title ~~Optimization bug preventing a regex contiaing (*SKIP)(*FAIL) to work correctly~~ Optimization bug preventing a regex containing (*SKIP)(*FAIL) to work correctly Sep 29, 2023

demerphq mentioned this issue Sep 30, 2023

regexec.c - make find_byclass() work with (*SKIP) and friends #21536

Merged

demerphq closed this as completed in #21536 Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization bug preventing a regex containing (SKIP)(FAIL) to work correctly #21534

Optimization bug preventing a regex containing (SKIP)(FAIL) to work correctly #21534

florian-pe commented Sep 29, 2023

demerphq commented Sep 29, 2023

demerphq commented Sep 30, 2023 •

edited

Loading

khwilliamson commented Sep 30, 2023 via email

demerphq commented Sep 30, 2023

Optimization bug preventing a regex containing (*SKIP)(*FAIL) to work correctly #21534

Optimization bug preventing a regex containing (*SKIP)(*FAIL) to work correctly #21534

Comments

florian-pe commented Sep 29, 2023

demerphq commented Sep 29, 2023

demerphq commented Sep 30, 2023 • edited Loading

khwilliamson commented Sep 30, 2023 via email

demerphq commented Sep 30, 2023

Optimization bug preventing a regex containing (SKIP)(FAIL) to work correctly #21534

Optimization bug preventing a regex containing (SKIP)(FAIL) to work correctly #21534

demerphq commented Sep 30, 2023 •

edited

Loading