Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3.11 re.compile raises SRE code error for valid regex. #98740

Closed
Waszker opened this issue Oct 26, 2022 · 2 comments
Closed

Python 3.11 re.compile raises SRE code error for valid regex. #98740

Waszker opened this issue Oct 26, 2022 · 2 comments
Labels
3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes topic-regex type-bug An unexpected behavior, bug, or error

Comments

@Waszker
Copy link

Waszker commented Oct 26, 2022

Bug report

Following regex causes re.compile() to raise RuntimeError: invalid SRE code:

   re.compile(
        r"(?P<h>^([01][0-9]|2[0-3]))"
        r"((?P<m>([0-5][0-9]))?"
        r"(?(5)(?P<s>([0-5][0-9]|60))?)"
        r"(?(7)(\.(?P<ms>([0-9]{1,6})?))?))$"
    )

Your environment

Python 3.11

  • CPython versions tested on: 3.11
  • Operating system and architecture: Linux (docker image as well as virtualenv)

I've checked and this hasn't been an issue in all previous Python interpreter versions, starting from 3.6 (the oldest I've checked).
What's more the regex is correctly recognized and does not cause any issues in other regexp implementations, e.g. the online tool https://regex101.com/

I've already asked about this on mailing list and confirmed that this is a bug.

@serhiy-storchaka has confirmed that the case for this bug has already been found.

@Waszker Waszker added the type-bug An unexpected behavior, bug, or error label Oct 26, 2022
@AlexWaygood AlexWaygood added topic-regex 3.11 only security fixes 3.12 bugs and security fixes labels Oct 26, 2022
@hroncok
Copy link
Contributor

hroncok commented Oct 26, 2022

I've bisected this to f703c96

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Oct 27, 2022

The simplified example is:

re.compile('()()()()()()()(?(1)()?)')

It is caused by the fundamental flaw in the RE validation code which checks whether the last word in the "then" branch of the conditional expression matches opcode JUMP which was 16 in 3.10 and below and becomes 15 in 3.11. Unfortunately it matches the value of the argument of other opcode ("MARK 15") which means the end of the 8th capturing group.

The bug is not new. Even simpler example for 3.11 is:

re.compile(r'()(?(1)\x0f?)')

and for 3.10 and below:

re.compile(r'()(?(1)\x10?)')

No matter what is the value of the JUMP opcode, there is always an example which fails.

The solution of this issue will not be easy and may require changing semantic of some opcodes.

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Oct 27, 2022
In very rare circumstances the JUMP opcode could be confused with the
argument of the opcode in the "then" part which doesn't end with the
JUMP opcode. This led to incorrect detection of the final JUMP opcode
and incorrect calculation of the size of the subexpression.

NOTE: Changed return value of functions _validate_inner() and
_validate_charset() in Modules/_sre/sre.c.  Now they return 0 on success,
-1 on failure, and 1 if the last op is JUMP (which usually is a failure).
Previously they returned 1 on success and 0 on failure.
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Oct 27, 2022
In very rare circumstances the JUMP opcode could be confused with the
argument of the opcode in the "then" part which doesn't end with the
JUMP opcode. This led to incorrect detection of the final JUMP opcode
and incorrect calculation of the size of the subexpression.

NOTE: Changed return value of functions _validate_inner() and
_validate_charset() in Modules/_sre/sre.c.  Now they return 0 on success,
-1 on failure, and 1 if the last op is JUMP (which usually is a failure).
Previously they returned 1 on success and 0 on failure.
serhiy-storchaka added a commit that referenced this issue Nov 3, 2022
In very rare circumstances the JUMP opcode could be confused with the
argument of the opcode in the "then" part which doesn't end with the
JUMP opcode. This led to incorrect detection of the final JUMP opcode
and incorrect calculation of the size of the subexpression.

NOTE: Changed return value of functions _validate_inner() and
_validate_charset() in Modules/_sre/sre.c.  Now they return 0 on success,
-1 on failure, and 1 if the last op is JUMP (which usually is a failure).
Previously they returned 1 on success and 0 on failure.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Nov 3, 2022
…onGH-98764)

In very rare circumstances the JUMP opcode could be confused with the
argument of the opcode in the "then" part which doesn't end with the
JUMP opcode. This led to incorrect detection of the final JUMP opcode
and incorrect calculation of the size of the subexpression.

NOTE: Changed return value of functions _validate_inner() and
_validate_charset() in Modules/_sre/sre.c.  Now they return 0 on success,
-1 on failure, and 1 if the last op is JUMP (which usually is a failure).
Previously they returned 1 on success and 0 on failure.
(cherry picked from commit e9ac890)

Co-authored-by: Serhiy Storchaka <[email protected]>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Nov 3, 2022
pythonGH-98764)

In very rare circumstances the JUMP opcode could be confused with the
argument of the opcode in the "then" part which doesn't end with the
JUMP opcode. This led to incorrect detection of the final JUMP opcode
and incorrect calculation of the size of the subexpression.

NOTE: Changed return value of functions _validate_inner() and
_validate_charset() in Modules/_sre/sre.c.  Now they return 0 on success,
-1 on failure, and 1 if the last op is JUMP (which usually is a failure).
Previously they returned 1 on success and 0 on failure..
(cherry picked from commit e9ac890)

Co-authored-by: Serhiy Storchaka <[email protected]>
miss-islington added a commit that referenced this issue Nov 3, 2022
In very rare circumstances the JUMP opcode could be confused with the
argument of the opcode in the "then" part which doesn't end with the
JUMP opcode. This led to incorrect detection of the final JUMP opcode
and incorrect calculation of the size of the subexpression.

NOTE: Changed return value of functions _validate_inner() and
_validate_charset() in Modules/_sre/sre.c.  Now they return 0 on success,
-1 on failure, and 1 if the last op is JUMP (which usually is a failure).
Previously they returned 1 on success and 0 on failure.
(cherry picked from commit e9ac890)

Co-authored-by: Serhiy Storchaka <[email protected]>
serhiy-storchaka added a commit that referenced this issue Nov 3, 2022
…98764) (GH-99046)

In very rare circumstances the JUMP opcode could be confused with the
argument of the opcode in the "then" part which doesn't end with the
JUMP opcode. This led to incorrect detection of the final JUMP opcode
and incorrect calculation of the size of the subexpression.

NOTE: Changed return value of functions _validate_inner() and
_validate_charset() in Modules/_sre/sre.c.  Now they return 0 on success,
-1 on failure, and 1 if the last op is JUMP (which usually is a failure).
Previously they returned 1 on success and 0 on failure.
(cherry picked from commit e9ac890)

Co-authored-by: Serhiy Storchaka <[email protected]>
@serhiy-storchaka serhiy-storchaka added the 3.10 only security fixes label Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes topic-regex type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants