Regular expressions (also called "regex" or "regexp") define patterns that can be matched against a string. Type ?regex
for the official R documentation and see the Regex Docs for more details. The most important 'gotcha' that will not be learned in the SO regex/topics is that most R-regex functions need the use of paired backslashes to escape in a pattern
parameter.
"[AB]"
could be A or B"[[:alpha:]]"
could be any letter"[[:lower:]]"
stands for any lower-case letter. Note that "[a-z]"
is close but doesn't match, e.g., รบ
."[[:upper:]]"
stands for any upper-case letter. Note that "[A-Z]"
is close but doesn't match, e.g., ร
."[[:digit:]]"
stands for any digit : 0, 1, 2, ..., or 9 and is equivalent to "[0-9]"
.+
, *
and ?
apply as usual in regex. -- +
matches at least once, *
matches 0 or more times, and ?
matches 0 or 1 time.
You can specify the position of the regex in the string :
"^..."
forces the regular expression to be at the beginning of the string"...$"
forces the regular expression to be at the end of the stringPlease note that regular expressions in R often look ever-so-slightly different from regular expressions used in other languages.
R requires double-backslash escapes (because "\"
already implies escaping in general in R strings), so, for example, to capture whitespace in most regular expression engines, one simply needs to type \s
, vs. \\s
in R.
UTF-8 characters in R should be escaped with a capital U, e.g. [\U{1F600}]
and [\U1F600]
match ๐, whereas in, e.g., Ruby, this would be matched with a lower-case u.
The following site reg101 is a good place for checking online regex before using it R-script.
The R Programmming wikibook has a page dedicated to text processing with many examples using regular expressions.