# example data
DT = data.table(iris)
DT[, Bin := cut(Sepal.Length, c(4,6,8))]
Suppose we want the summary
function output for Sepal.Length
along with the number of observations:
DT[, c(
as.list(summary(Sepal.Length)),
N = .N
), by=.(Species, Bin)]
# Species Bin Min. 1st Qu. Median Mean 3rd Qu. Max. N
# 1: setosa (4,6] 4.3 4.8 5.0 5.006 5.2 5.8 50
# 2: versicolor (6,8] 6.1 6.2 6.4 6.450 6.7 7.0 20
# 3: versicolor (4,6] 4.9 5.5 5.6 5.593 5.8 6.0 30
# 4: virginica (6,8] 6.1 6.4 6.7 6.778 7.2 7.9 41
# 5: virginica (4,6] 4.9 5.7 5.8 5.722 5.9 6.0 9
We have to make j
a list of columns. Usually, some playing around with c
, as.list
and .
is enough to figure out the correct way to proceed.
Instead of making a summary table, we may want to store a summary statistic in a new column. We can use :=
as usual. For example,
DT[, is_big := .N >= 25, by=.(Species, Bin)]
If you find yourself wanting to parse column names, like
Take the mean of
x.Length/x.Width
wherex
takes ten different values.
then you are probably looking at data embedded in column names, which is a bad idea. Read about tidy data and then reshape to long format.
Data frames and data.tables are well-designed for tabular data, where rows correspond to observations and columns to variables. If you find yourself wanting to summarize over rows, like
Find the standard deviation across columns for each row.
then you should probably be using a matrix or some other data format entirely.