R Language Pattern Matching and Replacement Finding Matches


# example data
test_sentences <- c("The quick brown fox", "jumps over the lazy dog")   

Is there a match?

grepl() is used to check whether a word or regular expression exists in a string or character vector. The function returns a TRUE/FALSE (or "Boolean") vector.

Notice that we can check each string for the word "fox" and receive a Boolean vector in return.

grepl("fox", test_sentences)

Match locations

grep takes in a character string and a regular expression. It returns a numeric vector of indexes.This will return which sentence contains the word "fox" in it.

grep("fox", test_sentences)
#[1] 1

Matched values

To select sentences that match a pattern:

# each of the following lines does the job:
test_sentences[grep("fox", test_sentences)]
test_sentences[grepl("fox", test_sentences)]
grep("fox", test_sentences, value = TRUE)
# [1] "The quick brown fox"


Since the "fox" pattern is just a word, rather than a regular expression, we could improve performance (with either grep or grepl) by specifying fixed = TRUE.

grep("fox", test_sentences, fixed = TRUE)
#[1] 1

To select sentences that don't match a pattern, one can use grep with invert = TRUE; or follow subsetting rules with -grep(...) or !grepl(...).

In both grepl(pattern, x) and grep(pattern, x), the x parameter is vectorized, the pattern parameter is not. As a result, you cannot use these directly to match pattern[1] against x[1], pattern[2] against x[2], and so on.

Summary of matches

After performing the e.g. the grepl command, maybe you want to get an overview about how many matches where TRUE or FALSE. This is useful e.g. in case of big data sets. In order to do so run the summary command:

# example data
test_sentences <- c("The quick brown fox", "jumps over the lazy dog") 

# find matches
matches <- grepl("fox", test_sentences)

# overview