Regular Expressions Tutorial => UNICODE modifier

Example

The UNICODE modifier, usually expressed as u (PHP, Python) or U (Java), makes the regex engine treat the pattern and the input string as Unicode strings and patterns, make the pattern shorthand classes like \w, \d, \s, etc. Unicode-aware.

/\A\p{L}+\z/u

is a PHP regex to match strings that consist of 1 or more Unicode letters. See the regex demo.

Note that in PHP, the /u modifier enables the PCRE engine to handle strings as UTF8 strings (by turning on PCRE_UTF8 verb) and make the shorthand character classes in the pattern Unicode aware (by enabling PCRE_UCP verb, see more at pcre.org).

Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

In Python 2.x, the re.UNICODE only affects the pattern itself: Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.

An inline version: (?u) in Python, (?U) in Java. For example:

print(re.findall(ur"(?u)\w+", u"Dąb")) # [u'D\u0105b']
print(re.findall(r"\w+", u"Dąb"))      # [u'D', u'b']

System.out.println("Dąb".matches("(?U)\\w+")); // true
System.out.println("Dąb".matches("\\w+"));     // false

PDF - Download Regular Expressions for free

Previous Next

Regular Expressions

Fastest Entity Framework Extensions

Example

Got any Regular Expressions Question?

Regular Expressions

Regular Expressions Regex modifiers (flags) UNICODE modifier

Fastest Entity Framework Extensions

Example

Got any Regular Expressions Question?