data.table Computing summary statistics Custom summaries


Example

# example data
DT = data.table(iris)
DT[, Bin := cut(Sepal.Length, c(4,6,8))]

Suppose we want the summary function output for Sepal.Length along with the number of observations:

DT[, c(
    as.list(summary(Sepal.Length)),
    N = .N
), by=.(Species, Bin)]

#       Species   Bin Min. 1st Qu. Median  Mean 3rd Qu. Max.  N
# 1:     setosa (4,6]  4.3     4.8    5.0 5.006     5.2  5.8 50
# 2: versicolor (6,8]  6.1     6.2    6.4 6.450     6.7  7.0 20
# 3: versicolor (4,6]  4.9     5.5    5.6 5.593     5.8  6.0 30
# 4:  virginica (6,8]  6.1     6.4    6.7 6.778     7.2  7.9 41
# 5:  virginica (4,6]  4.9     5.7    5.8 5.722     5.9  6.0  9

We have to make j a list of columns. Usually, some playing around with c, as.list and . is enough to figure out the correct way to proceed.

Assigning summary statistics as new columns

Instead of making a summary table, we may want to store a summary statistic in a new column. We can use := as usual. For example,

DT[, is_big := .N >= 25, by=.(Species, Bin)]

Pitfalls

Untidy data

If you find yourself wanting to parse column names, like

Take the mean of x.Length/x.Width where x takes ten different values.

then you are probably looking at data embedded in column names, which is a bad idea. Read about tidy data and then reshape to long format.

Rowwise summaries

Data frames and data.tables are well-designed for tabular data, where rows correspond to observations and columns to variables. If you find yourself wanting to summarize over rows, like

Find the standard deviation across columns for each row.

then you should probably be using a matrix or some other data format entirely.