Python, unicode and exceptions

Code compatibility between Python 2 and Python 3 is tedious.
In Python 3, str is the unicode of Python 2 and bytes is the str of Python 2. See this page for a nice overview. Python 2 also has the u"..." prefix to specify a unicode string, it was reintroduced in Python 3.3 to ease code compatiblity.

String formatting and unicode

In Python 2, %s (old style) was the usual way to format a string, then {}.format() (new style) formatting was introduced and is preferred in Python 3, without taking account f-strings. See pyformat for more details about both styles.

However %s and {}.format() behave differently in Python 2:

  • "{}".format(u"ζˆ‘") will raise UnicodeEncodeError because it stays str
  • "%s" % u"ζˆ‘" works and will upgrade the str to unicode
  • u"{}".format(u"ζˆ‘") or u"%s" % u"ζˆ‘" both works

So, as a rule of thumb, it is always good to prefix strings with u"" (with Python >= 3.3) everywhere, and decode/encode in utf-8 when needed.

Note: the import from __future__ import unicode_literals will force all Python 2 str to be unicode but it may break some stuff depending on the code base. More details on python-future.

Exceptions and unicode

Python 2 Exception will call str when printed, and will fail if the message string is in unicode with the error: <exception str() failed>. So the message must be encoded to UTF-8, but that breaks in Python 3 because the exception will get bytes and will display code points when printed. See this question on Stackoverflow for more context.

In practice, all messages should be in UTF-8, so to raise an exception having unicode characters, both versions have to be supported:

import sys

msg = u"test message ζˆ‘"
major_version = sys.version_info.major
if major_version == 2 and isinstance(msg, unicode):
    msg = msg.encode("utf-8")
raise Exception(msg)  # Works in python 2 and 3

This behavior does not happen when printing the traceback with traceback.print_exc().

__str__ and __unicode__ methods

The __str__ method in a class returns the human readable string of an object when called with str or in a format string.

In Python 2, __str__ returns bytes, and Python 3 unicode. A UnicodeEncodeError will be raised when using non-ascii characters and returning unicode in __str__ for Python 2. The special method __unicode__ must be used, but it is ignored in Python 3.

To make code compatible with both versions, the six package has a decorator called six.python_2_unicode_compatible that when used on a class will:

  • Rename __unicode__ and __str__ methods in Python 2 and correctly encode the string
  • Do nothing in Python 3

For example, the following will work for both Python versions:

import six

@six.python_2_unicode_compatible
class Test(object):
    def __str__(self):
        return u"ζˆ‘"

Be careful as __repr__ also has the same problem and is not fixed.