Python, unicode and exceptions

2020-06-03

Code compatibility between Python 2 and Python 3 is tedious.
In Python 3, str is the unicode of Python 2 and bytes is the str of Python 2. See this page for a nice overview. Python 2 also has the u"..." prefix to specify a unicode string; it was reintroduced in Python 3.3 to ease code compatibility.

String formatting and unicode

In Python 2, %s (old style) was the usual way to format a string; then {}.format() (new style) formatting was introduced and is preferred in Python 3, without taking into account f-strings. See pyformat for more details about both styles.

However, %s and {}.format() behave differently in Python 2:

"{}".format(u"我") will raise UnicodeEncodeError because it stays str
"%s" % u"我" works and will upgrade the str to unicode
u"{}".format(u"我") or u"%s" % u"我" both work

So, as a rule of thumb, it is always good to prefix strings with u"" (with Python >= 3.3) everywhere, and decode/encode in utf-8 when needed.

Note: the import from __future__ import unicode_literals will force all Python 2 str to be unicode but it may break some things depending on the code base. More details on python-future.

Exceptions and unicode

Python 2 Exception will call str when printed, and will fail if the message string is in unicode with the error: <exception str() failed>. So the message must be encoded to UTF-8, but that breaks in Python 3 because the exception will get bytes and will display code points when printed. See this question on Stackoverflow for more context.

In practice, all messages should be in UTF-8, so to raise an exception having unicode characters, both versions have to be supported:

import sys

msg = u"test message 我"
major_version = sys.version_info.major
if major_version == 2 and isinstance(msg, unicode):
    msg = msg.encode("utf-8")
raise Exception(msg)  # Works in python 2 and 3

This behavior does not occur when printing the traceback with traceback.print_exc().

str and unicode methods

The __str__ method in a class returns the human-readable string of an object when called with str or in a format string.

In Python 2, __str__ returns bytes, and in Python 3 it returns unicode. A UnicodeEncodeError will be raised when using non-ASCII characters and returning unicode in __str__ for Python 2. The special method __unicode__ must be used, but it is ignored in Python 3.

To make code compatible with both versions, the six package has a decorator called six.python_2_unicode_compatible that, when used on a class, will:

Rename __unicode__ and __str__ methods in Python 2 and correctly encode the string
Do nothing in Python 3

For example, the following will work for both Python versions:

import six

@six.python_2_unicode_compatible
class Test(object):
    def __str__(self):
        return u"我"

Be careful, as __repr__ also has the same problem and is not fixed.

String formatting and unicode

Exceptions and unicode

__str__ and __unicode__ methods

str and unicode methods