In case of big data sets, the call of grepl("fox", test_sentences)
does not perform well. Big data sets are e.g. crawled websites or million of Tweets, etc.
The first acceleration is the usage of the perl = TRUE
option. Even faster is the option fixed = TRUE
. A complete example would be:
# example data
test_sentences <- c("The quick brown fox", "jumps over the lazy dog")
grepl("fox", test_sentences, perl = TRUE)
#[1] TRUE FALSE
In case of text mining, often a corpus gets used. A corpus cannot be used directly with grepl
. Therefore, consider this function:
searchCorpus <- function(corpus, pattern) {
return(tm_index(corpus, FUN = function(x) {
grepl(pattern, x, ignore.case = TRUE, perl = TRUE)
}))
}