Code compatibility between Python 2 and Python 3 is tedious.
In Python 3, str is the unicode of Python 2 and bytes is the str of Python 2. See this page for a nice overview. Python 2 also has the u"..." prefix to specify a unicode string; it was reintroduced in Python 3.3 to ease code compatibility.
String formatting and unicode
In Python 2, %s (old style) was the usual way to format a string; then {}.format() (new style) formatting was introduced and is preferred in Python 3, without taking into account f-strings. See pyformat for more details about both styles.
However, %s and {}.format() behave differently in Python 2:
"{}".format(u"ζ")will raiseUnicodeEncodeErrorbecause it staysstr"%s" % u"ζ"works and will upgrade thestrtounicodeu"{}".format(u"ζ")oru"%s" % u"ζ"both work
So, as a rule of thumb, it is always good to prefix strings with u"" (with Python >= 3.3) everywhere, and decode/encode in utf-8 when needed.
Note: the import from __future__ import unicode_literals will force all Python 2 str to be unicode but it may break some things depending on the code base. More details on python-future.
Exceptions and unicode
Python 2 Exception will call str when printed, and will fail if the message string is in unicode with the error: <exception str() failed>. So the message must be encoded to UTF-8, but that breaks in Python 3 because the exception will get bytes and will display code points when printed. See this question on Stackoverflow for more context.
In practice, all messages should be in UTF-8, so to raise an exception having unicode characters, both versions have to be supported:
import sys
msg = u"test message ζ"
major_version = sys.version_info.major
if major_version == 2 and isinstance(msg, unicode):
msg = msg.encode("utf-8")
raise Exception(msg) # Works in python 2 and 3
This behavior does not occur when printing the traceback with traceback.print_exc().
__str__ and __unicode__ methods
The __str__ method in a class returns the human-readable string of an object when called with str or in a format string.
In Python 2, __str__ returns bytes, and in Python 3 it returns unicode. A UnicodeEncodeError will be raised when using non-ASCII characters and returning unicode in __str__ for Python 2. The special method __unicode__ must be used, but it is ignored in Python 3.
To make code compatible with both versions, the six package has a decorator called six.python_2_unicode_compatible that, when used on a class, will:
- Rename
__unicode__and__str__methods in Python 2 and correctly encode the string - Do nothing in Python 3
For example, the following will work for both Python versions:
import six
@six.python_2_unicode_compatible
class Test(object):
def __str__(self):
return u"ζ"
Be careful, as __repr__ also has the same problem and is not fixed.