To illustrate the effect of good for loop construction, we will calculate the mean of each column in four different ways:
*apply
family of functionscolMeans
functionEach of these options will be shown in code; a comparison of the computational time to execute each option will be shown; and lastly a discussion of the differences will be given.
column_mean_poor <- NULL
for (i in 1:length(mtcars)){
column_mean_poor[i] <- mean(mtcars[[i]])
}
column_mean_optimal <- vector("numeric", length(mtcars))
for (i in seq_along(mtcars)){
column_mean_optimal <- mean(mtcars[[i]])
}
vapply
Functioncolumn_mean_vapply <- vapply(mtcars, mean, numeric(1))
colMeans
Functioncolumn_mean_colMeans <- colMeans(mtcars)
The results of benchmarking these four approaches is shown below (code not displayed)
Unit: microseconds
expr min lq mean median uq max neval cld
poor 240.986 262.0820 287.1125 275.8160 307.2485 442.609 100 d
optimal 220.313 237.4455 258.8426 247.0735 280.9130 362.469 100 c
vapply 107.042 109.7320 124.4715 113.4130 132.6695 202.473 100 a
colMeans 155.183 161.6955 180.2067 175.0045 194.2605 259.958 100 b
Notice that the optimized for
loop edged out the poorly constructed for loop. The poorly constructed for loop is constantly increasing the length of the output object, and at each change of the length, R is reevaluating the class of the object.
Some of this overhead burden is removed by the optimized for loop by declaring the type of output object and its length before starting the loop.
In this example, however, the use of an vapply
function doubles the computational efficiency, largely because we told R that the result had to be numeric (if any one result were not numeric, an error would be returned).
Use of the colMeans
function is a touch slower than the vapply
function. This difference is attributable to some error checks performed in colMeans
and mainly to the as.matrix
conversion (because mtcars
is a data.frame
) that weren't performed in the vapply
function.