apply
is used to evaluate a function (maybe an anonymous one) over the margins of an array or matrix.
Let's use the iris
dataset to illustrate this idea. The iris
dataset has measurements of 150 flowers from 3 species. Let's see how this dataset is structured:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Now, imagine that you want to know the mean of each of these variables. One way to solve this might be to use a for
loop, but R programmers will often prefer to use apply
(for reasons why, see Remarks):
> apply(iris[1:4], 2, mean)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.199333
iris
to include only the first 4 columns, because mean
only works on numeric data.2
indicates that we want to work on the columns only (the second subscript of the r×c array); 1
would give the row means.In the same way we can calculate more meaningful values:
# standard deviation
apply(iris[1:4], 2, sd)
# variance
apply(iris[1:4], 2, var)
Caveat: R has some built-in functions which are better for calculating column and row sums and means: colMeans
and rowMeans
.
Now, let's do a different and more meaningful task: let's calculate the mean only for those values which are bigger than 0.5
. For that, we will create our own mean
function.
> our.mean.function <- function(x) { mean(x[x > 0.5]) }
> apply(iris[1:4], 2, our.mean.function)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.665347
(Note the difference in the mean of Petal.Width
)
But, what if we don't want to use this function in the rest of our code? Then, we can use an anonymous function, and write our code like this:
apply(iris[1:4], 2, function(x) { mean(x[x > 0.5]) })
So, as we have seen, we can use apply
to execute the same operation on columns or rows of a dataset using only one line.
Caveat: Since apply
returns very different kinds of output depending on the length of the results of the specified function, it may not be the best choice in cases where you are not working interactively. Some of the other *apply
family functions are a bit more predictable (see Remarks).