Comparing string in a case insensitive way seems like something that's trivial, but it's not. This section only considers unicode strings (the default in Python 3). Note that Python 2 may have subtle weaknesses relative to Python 3 - the later's unicode handling is much more complete.
The first thing to note it that case-removing conversions in unicode aren't trivial. There is text for which text.lower() != text.upper().lower()
, such as "ß"
:
>>> "ß".lower()
'ß'
>>> "ß".upper().lower()
'ss'
But let's say you wanted to caselessly compare "BUSSE"
and "Buße"
. Heck, you probably also want to compare "BUSSE"
and "BUẞE"
equal - that's the newer capital form. The recommended way is to use casefold
:
>>> help(str.casefold)
"""
Help on method_descriptor:
casefold(...)
S.casefold() -> str
Return a version of S suitable for caseless comparisons.
"""
Do not just use lower
. If casefold
is not available, doing .upper().lower()
helps (but only somewhat).
Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê"
- but it doesn't:
>>> "ê" == "ê"
False
This is because they are actually
>>> import unicodedata
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E WITH CIRCUMFLEX']
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']
The simplest way to deal with this is unicodedata.normalize
. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does
>>> unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")
True
To finish up, here this is expressed in functions:
import unicodedata
def normalize_caseless(text):
return unicodedata.normalize("NFKD", text.casefold())
def caseless_equal(left, right):
return normalize_caseless(left) == normalize_caseless(right)