-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization bug preventing a regex containing (*SKIP)(*FAIL) to work correctly #21534
Comments
I will try to find some time for this, but I don't know when. I agree that it is a bug. It is surprising to me that the stclass is getting in the way here, also that the possessive quantifier changes the behavior. |
@khwilliamson you created the related patches: 21d1ed5 and 56ff060. I think there is a thinko in these patches, as they update In the following logic the macro parameter 'f' is a function call which does a scan of the string to find how many characters match a given class:
The macros are as follows:
I believe that the update to EDIT: actually not /directly/ after |
This fixes #21534 When the previous_occurence_end logic was added it did not account for patterns where "s" is updated by regtry(). See 21d1ed5 which introduced this logic. That patch seemed to assume that regtry() would not alter s, but it does. This fix seems to work around the problem for the verb's, but it feels like a bodge instead of a proper fix, which I think would involve reworking the previous_occurrence_end logic to take into account that regtry() can modify s. But for now this should work.
On 9/30/23 08:23, Yves Orton wrote:
@khwilliamson <https://github.com/khwilliamson> you created the related
patches: 21d1ed5
<21d1ed5> and 56ff060 <56ff060>.
I think there is a thinko in these patches, as they update
|previous_occurence_end| after executing reg_try(), which seems to
partly or mostly defeat the point of checking if |s ==
previous_occurrenc_end| in the first place. Unfortunately this logic is
inside of a set of macros (as the fundamental logic is repeated over and
over with different functions), so it is a bit hard to see.
I still think it is right. But more comments and maybe renaming the
variable would be helpful. 'previous_occurrence_end' is set by the
macros to point 1 beyond the current occurrence end, in preparation for
the next iteration when it will indeed mean the end of the previous
occurrence.
I also bisected the failing test. It did not pass on the initial commit
that added *SKIP, nor in any official release so far. (*FAIL was added
earlier than *SKIP was)
…
In the following logic the macro parameter 'f' is a function call which
does a scan of the string to find how many characters match a given class:
|REXEC_FBC_UTF8_FIND_NEXT_SCAN( (char *) find_span_end_mask((U8 *) s,
(U8 *) strend, (U8) ARG1u(c), FLAGS(c))); |
The macros are as follows:
|/* We keep track of where the next character should start after an
occurrence * of the one we're looking for. Knowing that, we can see
right away if the * next occurrence is adjacent to the previous. When
'doevery' is FALSE, we * don't accept the 2nd and succeeding adjacent
occurrences */ #define FBC_CHECK_AND_TRY \ if ( ( doevery \ || s !=
previous_occurrence_end) \ && ( reginfo->intuit \ || (s <=
reginfo->strend && regtry(reginfo, &s)))) \ { \ goto got_it; \ } #define
REXEC_FBC_UTF8_FIND_NEXT_SCAN(f) \ while (s < strend) { \ s = (char *)
(f); \ if (s >= strend) { \ break; \ } \ \ FBC_CHECK_AND_TRY \ s +=
UTF8SKIP(s); \ previous_occurrence_end = s; \ } |
I believe that the update to |previous_occurrence_end| should be
directly after |f()| has returned, not after |reg_try()|, which may
advance s considerably itself.
—
Reply to this email directly, view it on GitHub
<#21534 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA2DHYWFINNYJ34S2XDW6TX5ATNXANCNFSM6AAAAAA5L3RS3U>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
My original theory on how to fix this issue was wrong, or at least more complex than I expected. So I went with a different solution. @khwilliamson it would be nice if you reviewed the previous_occurrence_end logic in light of PREGf_SKIP and the fact that regtry() may modify the string pointer as a side-effect, the existing logic doesn't entirely make sense to me. Consider a similar case to the one in this bug report:
When we start we start matching at pos 0, and match ""AAAAAAABBBBBBB" before we hit the /(*SKIP)(*FAIL)/, and which causes regtry() to set 's' to point at the last 'B', which is then incremented to point at the first 'A' following the 'B's. We then loop around in the FBC macro, and compute f() again, which returns an 's' pointing at the same spot, and since do_every is false, due to PREGf_SKIP being true, we do not try the pattern again when we should. With the fix in this patch we notice that reginfo->cutpoint is set, thus 's' was updated by regtry() and so we DO try again. But it feels to me like we should not need to inspect reginfo->cutpoint like that, and that we should have noticed in some other way that we were no longer in the original sequence of 'A's, the check on previous_occurrence_end seems just a little simple for what we are trying to do here. |
This fixes #21534 We have an optimisation that applies to patterns starting with a PLUS style pattern, so that things like /A+B/ do not try to match at every A when B does not match after the the first attempt. Eg, "AAAAAABBBBAAAAC"=~/[Aa]+[CD]/ should not try to match at every 'A' in the first sequence of 'A's. This optimisation is signalled by the presence of an PREGf_SKIP flag. The PLUS optimisation did not play nicely with patterns which were doing a similar task using the (*SKIP) operator. Essentially we need to disable the former when the latter has been used, or it can get confused. Consider the case of "AAAAAABBBBBBBAAAAAC"=~/[Aa]+(?:[Bb]+(*SKIP)(*FAIL)|[CD])/. The idea is to signal to the regex engine that once "AAAAABBBBBBBBB" is matched it can continue from after the final 'B'. Because of the way the PLUS optimizatin is implemented, this advancing of the pointer to the last B confused things, and it failed to match the final AAAAC sequence. This patch is somewhat of a bodge, it shouldnt be necessary to inspect inside of reginfo() after a call to regtry() and it is a bit counter-intuitive to do so. This patch wraps the check in a macro so that at least it is somewhat self documenting what it is doing.
This fixes #21534 We have an optimisation that applies to patterns starting with a PLUS style pattern, so that things like /A+B/ do not try to match at every A when B does not match after the the first attempt. Eg, "AAAAAABBBBAAAAC"=~/[Aa]+[CD]/ should not try to match at every 'A' in the first sequence of 'A's. This optimisation is signalled by the presence of an PREGf_SKIP flag. The PLUS optimisation did not play nicely with patterns which were doing a similar task using the (*SKIP) operator. Essentially we need to disable the former when the latter has been used, or it can get confused. Consider the case of "AAAAAABBBBBBBAAAAAC"=~/[Aa]+(?:[Bb]+(*SKIP)(*FAIL)|[CD])/. The idea is to signal to the regex engine that once "AAAAABBBBBBBBB" is matched it can continue from after the final 'B'. Because of the way the PLUS optimizatin is implemented, this advancing of the pointer to the last B confused things, and it failed to match the final AAAAC sequence. This patch is somewhat of a bodge, it shouldnt be necessary to inspect inside of reginfo() after a call to regtry() and it is a bit counter-intuitive to do so. This patch wraps the check in a macro so that at least it is somewhat self documenting what it is doing.
This fixes #21534 We have an optimisation that applies to patterns starting with a PLUS style pattern, so that things like /A+B/ do not try to match at every A when B does not match after the the first attempt. Eg, "AAAAAABBBBAAAAC"=~/[Aa]+[CD]/ should not try to match at every 'A' in the first sequence of 'A's. This optimisation is signalled by the presence of an PREGf_SKIP flag. The PLUS optimisation did not play nicely with patterns which were doing a similar task using the (*SKIP) operator. Essentially we need to disable the former when the latter has been used, or it can get confused. Consider the case of "AAAAAABBBBBBBAAAAAC"=~/[Aa]+(?:[Bb]+(*SKIP)(*FAIL)|[CD])/. The idea is to signal to the regex engine that once "AAAAABBBBBBBBB" is matched it can continue from after the final 'B'. Because of the way the PLUS optimizatin is implemented, this advancing of the pointer to the last B confused things, and it failed to match the final AAAAC sequence. This patch is somewhat of a bodge, it shouldnt be necessary to inspect inside of reginfo() after a call to regtry() and it is a bit counter-intuitive to do so. This patch wraps the check in a macro so that at least it is somewhat self documenting what it is doing.
The following regex doesn't match while it should.
Disabling the optimization (I don't know which one causes this) with
(?{})
makes the match succeed as it should.Additionally, Using a possessive quantifier makes the bug go away.
When running the faulty regex with
use re "debug";
the messageContradicts stclass... [regexec_flags]
appears.The text was updated successfully, but these errors were encountered: