# example data
test_sentences <- c("The quick brown fox", "jumps over the lazy dog")
grepl()
is used to check whether a word or regular expression exists in a string or character vector. The function returns a TRUE/FALSE (or "Boolean") vector.
Notice that we can check each string for the word "fox" and receive a Boolean vector in return.
grepl("fox", test_sentences)
#[1] TRUE FALSE
grep
takes in a character string and a regular expression. It returns a numeric vector of indexes.This will return which sentence contains the word "fox" in it.
grep("fox", test_sentences)
#[1] 1
To select sentences that match a pattern:
# each of the following lines does the job:
test_sentences[grep("fox", test_sentences)]
test_sentences[grepl("fox", test_sentences)]
grep("fox", test_sentences, value = TRUE)
# [1] "The quick brown fox"
Since the "fox"
pattern is just a word, rather than a regular expression, we could improve performance (with either grep
or grepl
) by specifying fixed = TRUE
.
grep("fox", test_sentences, fixed = TRUE)
#[1] 1
To select sentences that don't match a pattern, one can use grep
with invert = TRUE
; or follow subsetting rules with -grep(...)
or !grepl(...)
.
In both grepl(pattern, x)
and grep(pattern, x)
, the x
parameter is vectorized, the pattern
parameter is not. As a result, you cannot use these directly to match pattern[1]
against x[1]
, pattern[2]
against x[2]
, and so on.
After performing the e.g. the grepl
command, maybe you want to get an overview about how many matches where TRUE
or FALSE
. This is useful e.g. in case of big data sets. In order to do so run the summary
command:
# example data
test_sentences <- c("The quick brown fox", "jumps over the lazy dog")
# find matches
matches <- grepl("fox", test_sentences)
# overview
summary(matches)