In Python 2 there are two variants of string: those made of bytes with type (str
) and those made of text with type (unicode
).
In Python 2, an object of type str
is always a byte sequence, but is commonly used for both text and binary data.
A string literal is interpreted as a byte string.
s = 'Cafe' # type(s) == str
There are two exceptions: You can define a Unicode (text) literal explicitly by prefixing the literal with u
:
s = u'Café' # type(s) == unicode
b = 'Lorem ipsum' # type(b) == str
Alternatively, you can specify that a whole module's string literals should create Unicode (text) literals:
from __future__ import unicode_literals
s = 'Café' # type(s) == unicode
b = 'Lorem ipsum' # type(b) == unicode
In order to check whether your variable is a string (either Unicode or a byte string), you can use:
isinstance(s, basestring)
In Python 3, the str
type is a Unicode text type.
s = 'Cafe' # type(s) == str
s = 'Café' # type(s) == str (note the accented trailing e)
Additionally, Python 3 added a bytes
object, suitable for binary "blobs" or writing to encoding-independent files. To create a bytes object, you can prefix b
to a string literal or call the string's encode
method:
# Or, if you really need a byte string:
s = b'Cafe' # type(s) == bytes
s = 'Café'.encode() # type(s) == bytes
To test whether a value is a string, use:
isinstance(s, str)
It is also possible to prefix string literals with a u
prefix to ease compatibility between Python 2 and Python 3 code bases. Since, in Python 3, all strings are Unicode by default, prepending a string literal with u
has no effect:
u'Cafe' == 'Cafe'
Python 2’s raw Unicode string prefix ur
is not supported, however:
>>> ur'Café'
File "<stdin>", line 1
ur'Café'
^
SyntaxError: invalid syntax
Note that you must encode
a Python 3 text (str
) object to convert it into a bytes
representation of that text. The default encoding of this method is UTF-8.
You can use decode
to ask a bytes
object for what Unicode text it represents:
>>> b.decode()
'Café'
While the bytes
type exists in both Python 2 and 3, the unicode
type only exists in Python 2. To use Python 3's implicit Unicode strings in Python 2, add the following to the top of your code file:
from __future__ import unicode_literals
print(repr("hi"))
# u'hi'
Another important difference is that indexing bytes in Python 3 results in an int
output like so:
b"abc"[0] == 97
Whilst slicing in a size of one results in a length 1 bytes object:
b"abc"[0:1] == b"a"
In addition, Python 3 fixes some unusual behavior with unicode, i.e. reversing byte strings in Python 2. For example, the following issue is resolved:
# -*- coding: utf8 -*-
print("Hi, my name is Łukasz Langa.")
print(u"Hi, my name is Łukasz Langa."[::-1])
print("Hi, my name is Łukasz Langa."[::-1])
# Output in Python 2
# Hi, my name is Łukasz Langa.
# .agnaL zsakuŁ si eman ym ,iH
# .agnaL zsaku�� si eman ym ,iH
# Output in Python 3
# Hi, my name is Łukasz Langa.
# .agnaL zsakuŁ si eman ym ,iH
# .agnaL zsakuŁ si eman ym ,iH