Page MenuHomePhabricator

Python 2 Page representation (repr) returns non-ASCII bytes
Closed, DeclinedPublic

Description

The repr of a Page instance uses the console encoding to encode the title. This causes some operations in Python 2 to fail when it tries to decode the representation back and assumes it's in ASCII encoding.

https://code.djangoproject.com/ticket/18063 provides a very good explanation of the problem and approach being taken by Pywikibot.
Hasten the day of Python 2 being de-supported; until then, we give the best possible output for users of Python 2.

Examples:

>>> import pywikibot
>>> p = pywikibot.Page(pywikibot.Site(), u'öäöä')
>>> p.title()
u'\xf6\xe4\xf6\xe4'
>>> '%r' % ([p],)
'[Page(\xc3\xb6\xc3\xa4\xc3\xb6\xc3\xa4)]'
>>> u'%r' % ([p],)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)

Here all results using repr or str directly or indirectly on Python 2:

>>> p = pywikibot.Page(s, u'Ümlaut')
>>> print(u'Hello: %r!' % [p])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
>>> print(u'Hello: %s!' % [p])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
>>> print(u'Hello: %r!' % p)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
>>> print(u'Hello: %s!' % p)
Hello: [[test:Ümlaut]]!

The same on Python 3:

>>> p = pywikibot.Page(s, 'Ümlaut')
>>> print('Hello: %r!' % [p])
Hello: [Page(b'\xc3\x9cmlaut')]!
>>> print('Hello: %s!' % [p])
Hello: [Page(b'\xc3\x9cmlaut')]!
>>> print('Hello: %r!' % p)
Hello: Page(b'\xc3\x9cmlaut')!
>>> print('Hello: %s!' % p)
Hello: [[test:Ümlaut]]!

On Windows 7 it can also fail when the console encoding doesn't even support the characters:

>>> import pywikibot as py
>>> s = py.Site('af')
>>> p = py.Page(s, 'user:xqt')
>>> p
Page(Gebruiker:Xqt)
>>> s = py.Site('fa')
>>> p = py.Page(s, 'user:xqt')
>>> p

Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    p
  File "pywikibot\page.py", line 224, in __repr__
    self.title().encode(config.console_encoding))
  File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to <undefined>

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 3:21 AM
bzimport set Reference to bz64958.

Taking as I've been working on this for a while due to the cache key bug.

We now have a test case for this in page_tests, which fails after a clean install of pywikibot.

One way to workaround this problem is to set the console codepage to 65001

C:\pywikibot\core>chcp 65001
Active code page: 65001

And set pywikibot console_encoding = 'utf-8'

One my console, ar and fa doesnt display properly - boxes are used - but copying the console text does put the proper ar/fa text into the clipboard buffer.

However in pwb shell, 'print p' causes another exception.

>>> pywikibot.output(p)
Page(کاربر:John Vandenberg)
>>> p
Page(کاربر:John Vandenberg)Traceback (most recent call last):
  File "<console>", line 1, in <module>
IOError: [Errno 0] Error
jayvdb set Security to None.
This comment was removed by Dalba.
jayvdb renamed this task from representation string fails for page object to representation (repr) string fails for page object.Jun 22 2015, 7:19 AM
jayvdb removed jayvdb as the assignee of this task.
jayvdb raised the priority of this task from Low to Medium.
jayvdb removed a project: Pywikibot-General.
XZise renamed this task from representation (repr) string fails for page object to Python 2 Page representation (repr) returns non-ASCII bytes.Jun 25 2015, 10:53 AM
XZise raised the priority of this task from Medium to High.
XZise updated the task description. (Show Details)

The suggestion in T66958#705611 only handles the original problem discussed here where the codec itself can't encode characters present in the title. But on all systems it'll always fail when using a non-ASCII title (or namespace) and it uses repr and tries to insert it into a unicode. This also mainly affects Python 2 only as repr in Python 3 can return Unicode characters and if you try putting bytes into a str it just adds the byte values.

Change 220613 had a related patch set uploaded (by XZise):
[FEAT] page_tests: Page repr encoding test

https://gerrit.wikimedia.org/r/220613

Change 219618 had a related patch set uploaded (by XZise):
[bugfix] Workaround UnicodeDecodeError on api error

https://gerrit.wikimedia.org/r/219618

Change 220613 merged by jenkins-bot:
[FEAT] page_tests: Page repr encoding test

https://gerrit.wikimedia.org/r/220613

Just a few notes in relation to T89589: Usage of unicode_literals from __future__ package and {1e54a7d6}: While the patch changed the Page.__repr__ it returned unicode in Python 2 which is also not compliant to Python's specifications and it was mostly reverted in {853e6b0b} and if you compare the method pre merging the unicode_literals patch and with the revert the diff just shows that code has been moved and instead of %-notation it's using str.format:

     def __repr__(self):
         """Return a more complete string representation."""
-        return "%s(%s)" % (self.__class__.__name__,
-                           self.title().encode(config.console_encoding))
+        title = self.title().encode(config.console_encoding)
+        return str('{0}({1})').format(self.__class__.__name__, title)

So in the end does the unicode_literals patch not touch the representation. And it also does not touch how the representation of lists look. The only difference with that patch is that if the list's representation was previously mangled into an existing bytes instance it's now most likely a unicode instance and it now tries to decode it.

Change 219618 merged by jenkins-bot:
[bugfix] Workaround UnicodeDecodeError on api error

https://gerrit.wikimedia.org/r/219618

Change 223880 had a related patch set uploaded (by Merlijn van Deen):
[bugfix] Workaround UnicodeDecodeError on api error

https://gerrit.wikimedia.org/r/223880

Change 223880 merged by jenkins-bot:
[bugfix] Workaround UnicodeDecodeError on api error

https://gerrit.wikimedia.org/r/223880

Change 228620 had a related patch set uploaded (by Xqt):
[FIX] Workaround fix for UnicodeEncodeError in api.py

https://gerrit.wikimedia.org/r/228620

Change 228620 abandoned by Xqt:
[FIX] Workaround fix for UnicodeEncodeError in api.py

https://gerrit.wikimedia.org/r/228620

jayvdb lowered the priority of this task from High to Low.Aug 21 2015, 10:36 AM
jayvdb removed a project: Patch-For-Review.

With https://gerrit.wikimedia.org/r/231566 , I believe we've got a good solution which means this bug wont have much impact.

The mentioned django bug report does not mention https://www.python.org/dev/peps/pep-3138/#motivation:

  • Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII characters (>= 0x80) to '\xXX'.

[…]

For Unicode strings, the following additional conversions are done.

  • Convert leading surrogate pair characters without trailing character (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\uXXXX'.
  • Convert 16-bit characters (>= 0x100) to '\uXXXX'.
  • Convert 21-bit characters (>= 0x10000) and surrogate pair characters to '\U00xxxxxx'.

[…]

This algorithm converts any string to printable ASCII, and repr() is used as a handy and safe way to print strings for debugging or for logging.

So at least when Python 3 was implemented the repr() function defined as as returning ASCII. But who knows if they meant bytes or ASCII compatible str. So in the end it might be legitimate to return unicode but then only containing ASCII characters (and then Python would automatically convert one into the other without a problem).

Xqt lowered the priority of this task from Low to Lowest.Mar 10 2020, 10:51 AM

Declined because Python 2 has been dropped.

Change 608690 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [tests] Remove failing test_ASCII_compatible test

https://gerrit.wikimedia.org/r/c/pywikibot/core/ /608690

Change 608690 merged by jenkins-bot:
[pywikibot/core@master] [tests] Remove failing test_ASCII_compatible test

https://gerrit.wikimedia.org/r/c/pywikibot/core/ /608690