I did find that if you add the unicode_literals directive you should also add something like:
# -*- coding: utf-8
to the first or second line your .py file. Otherwise lines such as:
foo = "barré"
result in an an error such as:
SyntaxError: Non-ASCII character '\xc3' in file mumble.py on line 198,
but no encoding declared; see http://www.python.org/peps/pep-0263.html
for details
The main source of problems I've had working with unicode strings is when you mix utf-8 encoded strings with unicode ones.
For example, consider the following scripts.
two.py
# encoding: utf-8
name = 'helló wörld from two'
one.py
# encoding: utf-8
from __future__ import unicode_literals
import two
name = 'helló wörld from one'
print name + two.name
The output of running python one.py is:
Traceback (most recent call last):
File "one.py", line 5, in <module>
print name + two.name
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
In this example, two.name is an utf-8 encoded string (not unicode) since it did not import unicode_literals, and one.name is an unicode string. When you mix both, python tries to decode the encoded string (assuming it's ascii) and convert it to unicode and fails. It would work if you did print name + two.name.decode('utf-8').
The same thing can happen if you encode a string and try to mix them later.
For example, this works:
# encoding: utf-8
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
html = html.encode('utf-8')
print 'DEBUG: %s' % html
Output:
DEBUG: <html><body>helló wörld</body></html>
But after adding the import unicode_literals it does NOT:
# encoding: utf-8
from __future__ import unicode_literals
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
html = html.encode('utf-8')
print 'DEBUG: %s' % html
Output:
Traceback (most recent call last):
File "test.py", line 6, in <module>
print 'DEBUG: %s' % html
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)
It fails because 'DEBUG: %s' is an unicode string and therefore python tries to decode html. A couple of ways to fix the print are either doing print str('DEBUG: %s') % html or print 'DEBUG: %s' % html.decode('utf-8').
I hope this helps you understand the potential gotchas when using unicode strings.
Also take into account that unicode_literal will affect eval() but not repr() (an asymmetric behavior which imho is a bug), i.e. eval(repr(b'\xa4')) won't be equal to b'\xa4' (as it would with Python 3).
Ideally, the following code would be an invariant, which should always work, for all combinations of unicode_literals and Python {2.7, 3.x} usage:
from __future__ import unicode_literals
bstr = b'\xa4'
assert eval(repr(bstr)) == bstr # fails in Python 2.7, holds in 3.1+
ustr = '\xa4'
assert eval(repr(ustr)) == ustr # holds in Python 2.7 and 3.1+
The second assertion happens to work, since repr('\xa4') evaluates to u'\xa4' in Python 2.7.