When working with regular expressions one modifier for PCRE is g
for global match.
In R matching and replacement functions have two version: first match and global match:
sub(pattern,replacement,text)
will replace the first occurrence of pattern by replacement in text
gsub(pattern,replacement,text)
will do the same as sub but for each occurrence of pattern
regexpr(pattern,text)
will return the position of match for the first instance of pattern
gregexpr(pattern,text)
will return all matches.
Some random data:
set.seed(123)
teststring <- paste0(sample(letters,20),collapse="")
# teststring
#[1] "htjuwakqxzpgrsbncvyo"
Let's see how this works if we want to replace vowels by something else:
sub("[aeiouy]"," ** HERE WAS A VOWEL** ",teststring)
#[1] "htj ** HERE WAS A VOWEL** wakqxzpgrsbncvyo"
gsub("[aeiouy]"," ** HERE WAS A VOWEL** ",teststring)
#[1] "htj ** HERE WAS A VOWEL** w ** HERE WAS A VOWEL** kqxzpgrsbncv ** HERE WAS A VOWEL** ** HERE WAS A VOWEL** "
Now let's see how we can find a consonant immediately followed by one or more vowel:
regexpr("[^aeiou][aeiou]+",teststring)
#[1] 3
#attr(,"match.length")
#[1] 2
#attr(,"useBytes")
#[1] TRUE
We have a match on position 3 of the string of length 2, i.e: ju
Now if we want to get all matches:
gregexpr("[^aeiou][aeiou]+",teststring)
#[[1]]
#[1] 3 5 19
#attr(,"match.length")
#[1] 2 2 2
#attr(,"useBytes")
#[1] TRUE
All this is really great, but this only give use positions of match and that's not so easy to get what is matched, and here comes regmatches
it's sole purpose is to extract the string matched from regexpr, but it has a different syntax.
Let's save our matches in a variable and then extract them from original string:
matches <- gregexpr("[^aeiou][aeiou]+",teststring)
regmatches(teststring,matches)
#[[1]]
#[1] "ju" "wa" "yo"
This may sound strange to not have a shortcut, but this allow extraction from another string by the matches of our first one (think comparing two long vector where you know there's is a common pattern for the first but not for the second, this allow an easy comparison):
teststring2 <- "this is another string to match against"
regmatches(teststring2,matches)
#[[1]]
#[1] "is" " i" "ri"
Attention note: by default the pattern is not Perl Compatible Regular Expression, some things like lookarounds are not supported, but each function presented here allow for perl=TRUE
argument to enable them.