Regular Expressions Regex modifiers (flags) UNICODE modifier


The UNICODE modifier, usually expressed as u (PHP, Python) or U (Java), makes the regex engine treat the pattern and the input string as Unicode strings and patterns, make the pattern shorthand classes like \w, \d, \s, etc. Unicode-aware.


is a PHP regex to match strings that consist of 1 or more Unicode letters. See the regex demo.

Note that in PHP, the /u modifier enables the PCRE engine to handle strings as UTF8 strings (by turning on PCRE_UTF8 verb) and make the shorthand character classes in the pattern Unicode aware (by enabling PCRE_UCP verb, see more at

Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

In Python 2.x, the re.UNICODE only affects the pattern itself: Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.

An inline version: (?u) in Python, (?U) in Java. For example:

print(re.findall(ur"(?u)\w+", u"Dąb")) # [u'D\u0105b']
print(re.findall(r"\w+", u"Dąb"))      # [u'D', u'b']

System.out.println("Dąb".matches("(?U)\\w+")); // true
System.out.println("Dąb".matches("\\w+"));     // false