# example data
DT = data.table(iris)
DT[, Bin := cut(Sepal.Length, c(4,6,8))]
Suppose we want the summary function output for Sepal.Length along with the number of observations:
DT[, c(
as.list(summary(Sepal.Length)),
N = .N
), by=.(Species, Bin)]
# Species Bin Min. 1st Qu. Median Mean 3rd Qu. Max. N
# 1: setosa (4,6] 4.3 4.8 5.0 5.006 5.2 5.8 50
# 2: versicolor (6,8] 6.1 6.2 6.4 6.450 6.7 7.0 20
# 3: versicolor (4,6] 4.9 5.5 5.6 5.593 5.8 6.0 30
# 4: virginica (6,8] 6.1 6.4 6.7 6.778 7.2 7.9 41
# 5: virginica (4,6] 4.9 5.7 5.8 5.722 5.9 6.0 9
We have to make j a list of columns. Usually, some playing around with c, as.list and . is enough to figure out the correct way to proceed.
Instead of making a summary table, we may want to store a summary statistic in a new column. We can use := as usual. For example,
DT[, is_big := .N >= 25, by=.(Species, Bin)]
If you find yourself wanting to parse column names, like
Take the mean of
x.Length/x.Widthwherextakes ten different values.
then you are probably looking at data embedded in column names, which is a bad idea. Read about tidy data and then reshape to long format.
Data frames and data.tables are well-designed for tabular data, where rows correspond to observations and columns to variables. If you find yourself wanting to summarize over rows, like
Find the standard deviation across columns for each row.
then you should probably be using a matrix or some other data format entirely.