The UNICODE modifier, usually expressed as u
(PHP, Python) or U
(Java), makes the regex engine treat the pattern and the input string as Unicode strings and patterns, make the pattern shorthand classes like \w
, \d
, \s
, etc. Unicode-aware.
/\A\p{L}+\z/u
is a PHP regex to match strings that consist of 1 or more Unicode letters. See the regex demo.
Note that in PHP, the /u
modifier enables the PCRE engine to handle strings as UTF8 strings (by turning on PCRE_UTF8
verb) and make the shorthand character classes in the pattern Unicode aware (by enabling PCRE_UCP
verb, see more at pcre.org).
Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
In Python 2.x, the re.UNICODE
only affects the pattern itself: Make \w
, \W
, \b
, \B
, \d
, \D
, \s
and \S
dependent on the Unicode character properties database.
An inline version: (?u)
in Python, (?U)
in Java. For example:
print(re.findall(ur"(?u)\w+", u"Dąb")) # [u'D\u0105b']
print(re.findall(r"\w+", u"Dąb")) # [u'D', u'b']
System.out.println("Dąb".matches("(?U)\\w+")); // true
System.out.println("Dąb".matches("\\w+")); // false