R Language Single and Global match.


When working with regular expressions one modifier for PCRE is g for global match.

In R matching and replacement functions have two version: first match and global match:

  • sub(pattern,replacement,text) will replace the first occurrence of pattern by replacement in text

  • gsub(pattern,replacement,text) will do the same as sub but for each occurrence of pattern

  • regexpr(pattern,text) will return the position of match for the first instance of pattern

  • gregexpr(pattern,text) will return all matches.

Some random data:

teststring <- paste0(sample(letters,20),collapse="")

# teststring
#[1] "htjuwakqxzpgrsbncvyo"

Let's see how this works if we want to replace vowels by something else:

sub("[aeiouy]"," ** HERE WAS A VOWEL** ",teststring)
#[1] "htj ** HERE WAS A VOWEL** wakqxzpgrsbncvyo"

gsub("[aeiouy]"," ** HERE WAS A VOWEL** ",teststring)
#[1] "htj ** HERE WAS A VOWEL** w ** HERE WAS A VOWEL** kqxzpgrsbncv ** HERE WAS A VOWEL**  ** HERE WAS A VOWEL** "

Now let's see how we can find a consonant immediately followed by one or more vowel:

#[1] 3
#[1] 2
#[1] TRUE

We have a match on position 3 of the string of length 2, i.e: ju

Now if we want to get all matches:

#[1]  3  5 19
#[1] 2 2 2
#[1] TRUE

All this is really great, but this only give use positions of match and that's not so easy to get what is matched, and here comes regmatches it's sole purpose is to extract the string matched from regexpr, but it has a different syntax.

Let's save our matches in a variable and then extract them from original string:

matches <- gregexpr("[^aeiou][aeiou]+",teststring)
#[1] "ju" "wa" "yo"

This may sound strange to not have a shortcut, but this allow extraction from another string by the matches of our first one (think comparing two long vector where you know there's is a common pattern for the first but not for the second, this allow an easy comparison):

teststring2 <- "this is another string to match against"
#[1] "is" " i" "ri"

Attention note: by default the pattern is not Perl Compatible Regular Expression, some things like lookarounds are not supported, but each function presented here allow for perl=TRUE argument to enable them.