Python Language Strings: Bytes versus Unicode


Example

Python 2.x2.7

In Python 2 there are two variants of string: those made of bytes with type (str) and those made of text with type (unicode).

In Python 2, an object of type str is always a byte sequence, but is commonly used for both text and binary data.

A string literal is interpreted as a byte string.

s = 'Cafe'    # type(s) == str

There are two exceptions: You can define a Unicode (text) literal explicitly by prefixing the literal with u:

s = u'Café'   # type(s) == unicode
b = 'Lorem ipsum'  # type(b) == str

Alternatively, you can specify that a whole module's string literals should create Unicode (text) literals:

from __future__ import unicode_literals

s = 'Café'   # type(s) == unicode
b = 'Lorem ipsum'  # type(b) == unicode

In order to check whether your variable is a string (either Unicode or a byte string), you can use:

isinstance(s, basestring)
Python 3.x3.0

In Python 3, the str type is a Unicode text type.

s = 'Cafe'           # type(s) == str
s = 'Café'           # type(s) == str (note the accented trailing e)

Additionally, Python 3 added a bytes object, suitable for binary "blobs" or writing to encoding-independent files. To create a bytes object, you can prefix b to a string literal or call the string's encode method:

# Or, if you really need a byte string:
s = b'Cafe'          # type(s) == bytes
s = 'Café'.encode()  # type(s) == bytes

To test whether a value is a string, use:

isinstance(s, str)
Python 3.x3.3

It is also possible to prefix string literals with a u prefix to ease compatibility between Python 2 and Python 3 code bases. Since, in Python 3, all strings are Unicode by default, prepending a string literal with u has no effect:

u'Cafe' == 'Cafe'

Python 2’s raw Unicode string prefix ur is not supported, however:

>>> ur'Café'
  File "<stdin>", line 1
    ur'Café'
           ^
SyntaxError: invalid syntax

Note that you must encode a Python 3 text (str) object to convert it into a bytes representation of that text. The default encoding of this method is UTF-8.

You can use decode to ask a bytes object for what Unicode text it represents:

>>> b.decode()
'Café'
Python 2.x2.6

While the bytes type exists in both Python 2 and 3, the unicode type only exists in Python 2. To use Python 3's implicit Unicode strings in Python 2, add the following to the top of your code file:

from __future__ import unicode_literals
print(repr("hi"))
# u'hi'
Python 3.x3.0

Another important difference is that indexing bytes in Python 3 results in an int output like so:

b"abc"[0] == 97

Whilst slicing in a size of one results in a length 1 bytes object:

b"abc"[0:1] == b"a"

In addition, Python 3 fixes some unusual behavior with unicode, i.e. reversing byte strings in Python 2. For example, the following issue is resolved:

# -*- coding: utf8 -*-
print("Hi, my name is Łukasz Langa.")
print(u"Hi, my name is Łukasz Langa."[::-1])
print("Hi, my name is Łukasz Langa."[::-1])

# Output in Python 2
# Hi, my name is Łukasz Langa.
# .agnaL zsakuŁ si eman ym ,iH
# .agnaL zsaku�� si eman ym ,iH

# Output in Python 3
# Hi, my name is Łukasz Langa.
# .agnaL zsakuŁ si eman ym ,iH
# .agnaL zsakuŁ si eman ym ,iH