Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When computing the anchors on the traceback, results may be wrong if unicode chars are used #99103

Closed
fabioz opened this issue Nov 4, 2022 · 4 comments
Assignees
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@fabioz
Copy link
Contributor

fabioz commented Nov 4, 2022

Bug report

Consider the code below:

d = {
    "ó": {
        "á": {
            "í": {
                "theta": 1
            }
        }
    }
}

try:
    result = d["ó"]["á"]["í"]["beta"]
except:
    import traceback;traceback.print_exc()

The output provided is:

Traceback (most recent call last):
  File "W:\pydev.debugger\check\snippet2.py", line 12, in <module>
    result = d["ó"]["á"]["í"]["beta"]
             ~~~~~~~~~~~~~~~~~~~^^^^^^^^
KeyError: 'beta'

Notice that for each additional unicode char, an additional `~' is added.

This seems to happen because when computing the anchors in traceback._extract_caret_anchors_from_line_segment the columns from the ast nodes generated in ast.parse seem to be related to bytes and not actual chars.

Your environment

  • CPython versions tested on: 3.11.0
  • Operating system and architecture: Windows 10
@fabioz
Copy link
Contributor Author

fabioz commented Nov 4, 2022

Note: in another example it gets a bit worse and ends up throwing an internal failure:

#coding: utf-8

try:
    á = 1
    í = 2
    c = tuple
    
    result = á + í + c
except:
    import traceback;traceback.print_exc()

Gives me:

Traceback (most recent call last):
Traceback (most recent call last):
  File "W:\pydev.debugger\check\snippet2.py", line 8, in <module>
    result = á + í + c
             ~~~~~~~~^
TypeError: unsupported operand type(s) for +: 'int' and 'type'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "W:\pydev.debugger\check\snippet2.py", line 10, in <module>
    import traceback;traceback.print_exc()
                     ^^^^^^^^^^^^^^^^^^^^^
  File "C:\bin\Miniconda\envs\py311_tests\Lib\traceback.py", line 183, in print_exc
    print_exception(*sys.exc_info(), limit=limit, file=file, chain=chain)
  File "C:\bin\Miniconda\envs\py311_tests\Lib\traceback.py", line 125, in print_exception
    te.print(file=file, chain=chain)
  File "C:\bin\Miniconda\envs\py311_tests\Lib\traceback.py", line 977, in print
    for line in self.format(chain=chain):
  File "C:\bin\Miniconda\envs\py311_tests\Lib\traceback.py", line 914, in format
    yield from _ctx.emit(exc.stack.format())
                         ^^^^^^^^^^^^^^^^^^
  File "C:\bin\Miniconda\envs\py311_tests\Lib\traceback.py", line 531, in format
    formatted_frame = self.format_frame_summary(frame_summary)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\bin\Miniconda\envs\py311_tests\Lib\traceback.py", line 478, in format_frame_summary
    colno = _byte_offset_to_character_offset(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\bin\Miniconda\envs\py311_tests\Lib\traceback.py", line 566, in _byte_offset_to_character_offset
    return len(as_utf8[:offset + 1].decode("utf-8"))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 13: unexpected end of data

@fabioz
Copy link
Contributor Author

fabioz commented Nov 4, 2022

Note: the _byte_offset_to_character_offset could use the code below to compute the offset without having the issue of breaking in a decode operation:

_utf8_with_2_bytes = 0x80
_utf8_with_3_bytes = 0x800
_utf8_with_4_bytes = 0x10000


def _utf8_byte_offset_to_character_offset(s, offset):
    byte_offset = 0
    char_offset = 0
    for char_offset, character in enumerate(s):
        byte_offset += 1

        codepoint = ord(character)

        if codepoint >= _utf8_with_4_bytes:
            byte_offset += 3

        elif codepoint >= _utf8_with_3_bytes:
            byte_offset += 2

        elif codepoint >= _utf8_with_2_bytes:
            byte_offset += 1

        if byte_offset > offset:
            break

    # Make 1 based.
    char_offset += 1
    return char_offset

@mdboom
Copy link
Contributor

mdboom commented Nov 4, 2022

The second and third comment seem to be a duplicate of #98744, which I confirmed is now fixed on main. The original issue seems to still exist on main, however.

isidentical added a commit to isidentical/cpython that referenced this issue Nov 5, 2022
isidentical added a commit to isidentical/cpython that referenced this issue Nov 12, 2022
miss-islington pushed a commit that referenced this issue Nov 12, 2022
isidentical added a commit to isidentical/cpython that referenced this issue Nov 12, 2022
…t the current line (pythonGH-99145)

Automerge-Triggered-By: GH:isidentical.
(cherry picked from commit 57be545)

Co-authored-by: Batuhan Taskaya <[email protected]>
pablogsal pushed a commit that referenced this issue Nov 21, 2022
…current line (#99423)

[3.11] gh-99103: Normalize specialized traceback anchors against the current line (GH-99145)

Automerge-Triggered-By: GH:isidentical.
(cherry picked from commit 57be545)

Co-authored-by: Batuhan Taskaya <[email protected]>
@hauntsaninja
Copy link
Contributor

Looks like this has been fixed and backported, thank you for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants