Python 2 Page representation (repr) returns non-ASCII bytes
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Xqt
	May 6 2014, 5:02 PM

Description

The repr of a Page instance uses the console encoding to encode the title. This causes some operations in Python 2 to fail when it tries to decode the representation back and assumes it's in ASCII encoding.

https://code.djangoproject.com/ticket/18063 provides a very good explanation of the problem and approach being taken by Pywikibot.
Hasten the day of Python 2 being de-supported; until then, we give the best possible output for users of Python 2.

Examples:

>>> import pywikibot
>>> p = pywikibot.Page(pywikibot.Site(), u'öäöä')
>>> p.title()
u'\xf6\xe4\xf6\xe4'
>>> '%r' % ([p],)
'[Page(\xc3\xb6\xc3\xa4\xc3\xb6\xc3\xa4)]'
>>> u'%r' % ([p],)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)

Here all results using repr or str directly or indirectly on Python 2:

>>> p = pywikibot.Page(s, u'Ümlaut')
>>> print(u'Hello: %r!' % [p])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
>>> print(u'Hello: %s!' % [p])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
>>> print(u'Hello: %r!' % p)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
>>> print(u'Hello: %s!' % p)
Hello: [[test:Ümlaut]]!

The same on Python 3:

>>> p = pywikibot.Page(s, 'Ümlaut')
>>> print('Hello: %r!' % [p])
Hello: [Page(b'\xc3\x9cmlaut')]!
>>> print('Hello: %s!' % [p])
Hello: [Page(b'\xc3\x9cmlaut')]!
>>> print('Hello: %r!' % p)
Hello: Page(b'\xc3\x9cmlaut')!
>>> print('Hello: %s!' % p)
Hello: [[test:Ümlaut]]!

On Windows 7 it can also fail when the console encoding doesn't even support the characters:

>>> import pywikibot as py
>>> s = py.Site('af')
>>> p = py.Page(s, 'user:xqt')
>>> p
Page(Gebruiker:Xqt)
>>> s = py.Site('fa')
>>> p = py.Page(s, 'user:xqt')
>>> p

Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    p
  File "pywikibot\page.py", line 224, in __repr__
    self.title().encode(config.console_encoding))
  File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to <undefined>

Details

Reference: bz64958

Subject	Repo	Branch	Lines +/-
[tests] Remove failing test_ASCII_compatible test	pywikibot/core	master	+0 -8
[FIX] Workaround fix for UnicodeEncodeError in api.py	pywikibot/core	master	+10 -5
[bugfix] Workaround UnicodeDecodeError on api error	pywikibot/core	2.0	+14 -2
[bugfix] Workaround UnicodeDecodeError on api error	pywikibot/core	master	+15 -4
[FEAT] page_tests: Page repr encoding test	pywikibot/core	master	+10 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Xover	T265640 phe-tools: Match&Split bot is not running because of python2 deprecation in pywikibot
Resolved	Xqt	T213287 Drop support of Python 2.7
Resolved	Xqt	T243770 Pywikibot Python 2 compatibility (tracking)
Declined	None	T66958 Python 2 Page representation (repr) returns non-ASCII bytes

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 22 2014, 3:21 AM

• bzimport added a project: Pywikibot-General.

• bzimport set Reference to bz64958.

Xqt created this task.May 6 2014, 5:02 PM

Taking as I've been working on this for a while due to the cache key bug.

We now have a test case for this in page_tests, which fails after a clean install of pywikibot.

One way to workaround this problem is to set the console codepage to 65001

C:\pywikibot\core>chcp 65001
Active code page: 65001

And set pywikibot console_encoding = 'utf-8'

One my console, ar and fa doesnt display properly - boxes are used - but copying the console text does put the proper ar/fa text into the clipboard buffer.

However in pwb shell, 'print p' causes another exception.

>>> pywikibot.output(p)
Page(کاربر:John Vandenberg)
>>> p
Page(کاربر:John Vandenberg)Traceback (most recent call last):
  File "<console>", line 1, in <module>
IOError: [Errno 0] Error

Aklapper added a project: Pywikibot.Nov 27 2014, 4:18 PM

Dalba subscribed.Apr 1 2015, 4:06 AM

jayvdb updated the task description. (Show Details)Apr 1 2015, 6:55 AM

jayvdb set Security to None.

Dalba added a comment.Apr 1 2015, 7:42 AM

This comment was removed by Dalba.

jayvdb renamed this task from representation string fails for page object to representation (repr) string fails for page object.Jun 22 2015, 7:19 AM

jayvdb removed jayvdb as the assignee of this task.

jayvdb raised the priority of this task from Low to Medium.

jayvdb removed a project: Pywikibot-General.

jayvdb mentioned this in T95809: Page's repr returns invalid data causing Python to error.Jun 22 2015, 7:21 AM

XZise merged a task: T95809: Page's repr returns invalid data causing Python to error.Jun 25 2015, 10:49 AM

XZise added subscribers: gerritbot, valhallasw, Aklapper, XZise.

XZise renamed this task from representation (repr) string fails for page object to Python 2 Page representation (repr) returns non-ASCII bytes.Jun 25 2015, 10:53 AM

XZise raised the priority of this task from Medium to High.

XZise updated the task description. (Show Details)

The suggestion in T66958#705611 only handles the original problem discussed here where the codec itself can't encode characters present in the title. But on all systems it'll always fail when using a non-ASCII title (or namespace) and it uses repr and tries to insert it into a unicode. This also mainly affects Python 2 only as repr in Python 3 can return Unicode characters and if you try putting bytes into a str it just adds the byte values.

XZise updated the task description. (Show Details)Jun 25 2015, 11:03 AM

Change 220613 had a related patch set uploaded (by XZise):
[FEAT] page_tests: Page repr encoding test

https://gerrit.wikimedia.org/r/220613

Change 219618 had a related patch set uploaded (by XZise):
[bugfix] Workaround UnicodeDecodeError on api error

https://gerrit.wikimedia.org/r/219618

jayvdb removed a project: Patch-For-Review.Jun 26 2015, 1:09 AM

Change 220613 merged by jenkins-bot:
[FEAT] page_tests: Page repr encoding test

https://gerrit.wikimedia.org/r/220613

XZise mentioned this in rPWBC18ca364a57c2: [FEAT] page_tests: Page repr encoding test.Jun 26 2015, 1:10 AM

Just a few notes in relation to T89589: Usage of unicode_literals from __future__ package and {1e54a7d6}: While the patch changed the Page.__repr__ it returned unicode in Python 2 which is also not compliant to Python's specifications and it was mostly reverted in {853e6b0b} and if you compare the method pre merging the unicode_literals patch and with the revert the diff just shows that code has been moved and instead of %-notation it's using str.format:

     def __repr__(self):
         """Return a more complete string representation."""
-        return "%s(%s)" % (self.__class__.__name__,
-                           self.title().encode(config.console_encoding))
+        title = self.title().encode(config.console_encoding)
+        return str('{0}({1})').format(self.__class__.__name__, title)

So in the end does the unicode_literals patch not touch the representation. And it also does not touch how the representation of lists look. The only difference with that patch is that if the list's representation was previously mangled into an existing bytes instance it's now most likely a unicode instance and it now tries to decode it.

Change 219618 merged by jenkins-bot:
[bugfix] Workaround UnicodeDecodeError on api error

https://gerrit.wikimedia.org/r/219618

XZise mentioned this in rPWBC06925875262d: [bugfix] Workaround UnicodeDecodeError on api error.Jun 28 2015, 1:18 PM

Change 223880 had a related patch set uploaded (by Merlijn van Deen):
[bugfix] Workaround UnicodeDecodeError on api error

https://gerrit.wikimedia.org/r/223880

gerritbot added a project: Patch-For-Review.Jul 9 2015, 6:44 PM

Change 223880 merged by jenkins-bot:
[bugfix] Workaround UnicodeDecodeError on api error

https://gerrit.wikimedia.org/r/223880

valhallasw mentioned this in rPWBCc7a12b5eb7aa: [bugfix] Workaround UnicodeDecodeError on api error.Jul 9 2015, 7:56 PM

XZise mentioned this in T107428: UnicodeEncodeError in Page.__repr__().Jul 31 2015, 9:14 AM

Change 228620 had a related patch set uploaded (by Xqt):
[FIX] Workaround fix for UnicodeEncodeError in api.py

https://gerrit.wikimedia.org/r/228620

Change 228620 abandoned by Xqt:
[FIX] Workaround fix for UnicodeEncodeError in api.py

https://gerrit.wikimedia.org/r/228620

jayvdb lowered the priority of this task from High to Low.Aug 21 2015, 10:36 AM

jayvdb removed a project: Patch-For-Review.

With https://gerrit.wikimedia.org/r/231566 , I believe we've got a good solution which means this bug wont have much impact.

jayvdb updated the task description. (Show Details)Oct 2 2015, 8:30 AM

XZise mentioned this in T72976: codec encoding problems on win32.Oct 2 2015, 9:49 AM

The mentioned django bug report does not mention https://www.python.org/dev/peps/pep-3138/#motivation:

Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII characters (>= 0x80) to '\xXX'.

[…]

For Unicode strings, the following additional conversions are done.

Convert leading surrogate pair characters without trailing character (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\uXXXX'.

Convert 16-bit characters (>= 0x100) to '\uXXXX'.

Convert 21-bit characters (>= 0x10000) and surrogate pair characters to '\U00xxxxxx'.

[…]

This algorithm converts any string to printable ASCII, and repr() is used as a handy and safe way to print strings for debugging or for logging.

So at least when Python 3 was implemented the repr() function defined as as returning ASCII. But who knows if they meant bytes or ASCII compatible str. So in the end it might be legitimate to return unicode but then only containing ASCII characters (and then Python would automatically convert one into the other without a problem).

Dvorapa mentioned this in T60574: unicodeDecodeError in url2unicode().Jun 3 2018, 4:24 PM

Xqt added a parent task: T243770: Pywikibot Python 2 compatibility (tracking).Jan 27 2020, 3:01 PM

Xqt lowered the priority of this task from Low to Lowest.Mar 10 2020, 10:51 AM

Declined because Python 2 has been dropped.

Change 608690 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [tests] Remove failing test_ASCII_compatible test

https://gerrit.wikimedia.org/r/c/pywikibot/core/ /608690

gerritbot added a project: Patch-For-Review.Jun 30 2020, 4:54 PM

Change 608690 merged by jenkins-bot:
[pywikibot/core@master] [tests] Remove failing test_ASCII_compatible test

https://gerrit.wikimedia.org/r/c/pywikibot/core/ /608690

Xqt mentioned this in rPWBC1084bb4668e9: [tests] Remove failing test_ASCII_compatible test.Jul 1 2020, 7:55 AM

Python 2 Page representation (repr) returns non-ASCII bytesClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Python 2 Page representation (repr) returns non-ASCII bytes
Closed, DeclinedPublic
Actions

Related Objects
Search...