Code compatibility between Python 2 and Python 3 is tedious.
In Python 3, str
is the unicode
of Python 2 and bytes
is the str
of Python 2. See this page for a nice overview. Python 2 also has the u"..."
prefix to specify a unicode string, it was reintroduced in Python 3.3 to ease code compatiblity.
String formatting and unicode
In Python 2, %s
(old style) was the usual way to format a string, then {}.format()
(new style) formatting was introduced and is preferred in Python 3, without taking account f-strings
. See pyformat for more details about both styles.
However %s
and {}.format()
behave differently in Python 2:
"{}".format(u"ζ")
will raiseUnicodeEncodeError
because it staysstr
"%s" % u"ζ"
works and will upgrade thestr
tounicode
u"{}".format(u"ζ")
oru"%s" % u"ζ"
both works
So, as a rule of thumb, it is always good to prefix strings with u""
(with Python >= 3.3) everywhere, and decode/encode in utf-8
when needed.
Note: the import from __future__ import unicode_literals
will force all Python 2 str
to be unicode
but it may break some stuff depending on the code base. More details on python-future.
Exceptions and unicode
Python 2 Exception
will call str
when printed, and will fail if the message string is in unicode
with the error: <exception str() failed>
. So the message must be encoded to UTF-8, but that breaks in Python 3 because the exception will get bytes
and will display code points when printed. See this question on Stackoverflow for more context.
In practice, all messages should be in UTF-8, so to raise an exception having unicode characters, both versions have to be supported:
import sys
msg = u"test message ζ"
major_version = sys.version_info.major
if major_version == 2 and isinstance(msg, unicode):
msg = msg.encode("utf-8")
raise Exception(msg) # Works in python 2 and 3
This behavior does not happen when printing the traceback with traceback.print_exc()
.
__str__ and __unicode__ methods
The __str__
method in a class returns the human readable string of an object when called with str
or in a format string.
In Python 2, __str__
returns bytes
, and Python 3 unicode
. A UnicodeEncodeError
will be raised when using non-ascii characters and returning unicode
in __str__
for Python 2. The special method __unicode__
must be used, but it is ignored in Python 3.
To make code compatible with both versions, the six package has a decorator called six.python_2_unicode_compatible that when used on a class will:
- Rename
__unicode__
and__str__
methods in Python 2 and correctly encode the string - Do nothing in Python 3
For example, the following will work for both Python versions:
import six
@six.python_2_unicode_compatible
class Test(object):
def __str__(self):
return u"ζ"
Be careful as __repr__
also has the same problem and is not fixed.