In Python 2 there are two variants of string: those made of bytes with type (str) and those made of text with type (unicode).
In Python 2, an object of type str is always a byte sequence, but is commonly used for both text and binary data.
A string literal is interpreted as a byte string.
s = 'Cafe' # type(s) == str
There are two exceptions: You can define a Unicode (text) literal explicitly by prefixing the literal with u:
s = u'Café' # type(s) == unicode
b = 'Lorem ipsum' # type(b) == str
Alternatively, you can specify that a whole module's string literals should create Unicode (text) literals:
from __future__ import unicode_literals
s = 'Café' # type(s) == unicode
b = 'Lorem ipsum' # type(b) == unicode
In order to check whether your variable is a string (either Unicode or a byte string), you can use:
isinstance(s, basestring)
In Python 3, the str type is a Unicode text type.
s = 'Cafe' # type(s) == str
s = 'Café' # type(s) == str (note the accented trailing e)
Additionally, Python 3 added a bytes object, suitable for binary "blobs" or writing to encoding-independent files. To create a bytes object, you can prefix b to a string literal or call the string's encode method:
# Or, if you really need a byte string:
s = b'Cafe' # type(s) == bytes
s = 'Café'.encode() # type(s) == bytes
To test whether a value is a string, use:
isinstance(s, str)
It is also possible to prefix string literals with a u prefix to ease compatibility between Python 2 and Python 3 code bases. Since, in Python 3, all strings are Unicode by default, prepending a string literal with u has no effect:
u'Cafe' == 'Cafe'
Python 2’s raw Unicode string prefix ur is not supported, however:
>>> ur'Café'
File "<stdin>", line 1
ur'Café'
^
SyntaxError: invalid syntax
Note that you must encode a Python 3 text (str) object to convert it into a bytes representation of that text. The default encoding of this method is UTF-8.
You can use decode to ask a bytes object for what Unicode text it represents:
>>> b.decode()
'Café'
While the bytes type exists in both Python 2 and 3, the unicode type only exists in Python 2. To use Python 3's implicit Unicode strings in Python 2, add the following to the top of your code file:
from __future__ import unicode_literals
print(repr("hi"))
# u'hi'
Another important difference is that indexing bytes in Python 3 results in an int output like so:
b"abc"[0] == 97
Whilst slicing in a size of one results in a length 1 bytes object:
b"abc"[0:1] == b"a"
In addition, Python 3 fixes some unusual behavior with unicode, i.e. reversing byte strings in Python 2. For example, the following issue is resolved:
# -*- coding: utf8 -*-
print("Hi, my name is Łukasz Langa.")
print(u"Hi, my name is Łukasz Langa."[::-1])
print("Hi, my name is Łukasz Langa."[::-1])
# Output in Python 2
# Hi, my name is Łukasz Langa.
# .agnaL zsakuŁ si eman ym ,iH
# .agnaL zsaku�� si eman ym ,iH
# Output in Python 3
# Hi, my name is Łukasz Langa.
# .agnaL zsakuŁ si eman ym ,iH
# .agnaL zsakuŁ si eman ym ,iH