Regular Expressions Tutorial => Matching letters in different...

Example

Examples below are given in Ruby, but same matchers should be available in any modern language.

Let’s say we have the string "AℵNaïve", produced by Messy Artificial Intelligence. It consists of letters, but generic \w matcher won’t match much:

▶ "AℵNaïve"[/\w+/]
#⇒ "A"

The correct way to match Unicode letter with combining marks is to use \X to specify a grapheme cluster. There is a caveat for Ruby, though. Onigmo, the regex engine for Ruby, still uses the old definition of a grapheme cluster. It is not yet updated to Extended Grapheme Cluster as defined in Unicode Standard Annex 29.

So, for Ruby we could have a workaround: \p{L} will do almost fine, save for it fails on combined diacritical accent on i:

▶ "AℵNaïve"[/\p{L}+/]
#⇒ "AℵNai"

By adding the “Mark symbols” to the expression, we can finally match everything:

▶ "AℵNaïve"[/[\p{L}\p{M}]+/]
#⇒ "AℵNaïve"

PDF - Download Regular Expressions for free

Previous Next

Regular Expressions

Fastest Entity Framework Extensions

Example

Got any Regular Expressions Question?

Regular Expressions

Regular Expressions UTF-8 matchers: Letters, Marks, Punctuation etc. Matching letters in different alphabets

Fastest Entity Framework Extensions

Example

Got any Regular Expressions Question?