Regular Expressions UTF-8 matchers: Letters, Marks, Punctuation etc. Matching letters in different alphabets


Examples below are given in Ruby, but same matchers should be available in any modern language.

Let’s say we have the string "AℵNaïve", produced by Messy Artificial Intelligence. It consists of letters, but generic \w matcher won’t match much:

▶ "AℵNaïve"[/\w+/]
#⇒ "A"

The correct way to match Unicode letter with combining marks is to use \X to specify a grapheme cluster. There is a caveat for Ruby, though. Onigmo, the regex engine for Ruby, still uses the old definition of a grapheme cluster. It is not yet updated to Extended Grapheme Cluster as defined in Unicode Standard Annex 29.

So, for Ruby we could have a workaround: \p{L} will do almost fine, save for it fails on combined diacritical accent on i:

▶ "AℵNaïve"[/\p{L}+/]
#⇒ "AℵNai"

By adding the “Mark symbols” to the expression, we can finally match everything:

▶ "AℵNaïve"[/[\p{L}\p{M}]+/]
#⇒ "AℵNaïve"