data.table Tutorial => Custom summaries

Example

# example data
DT = data.table(iris)
DT[, Bin := cut(Sepal.Length, c(4,6,8))]

Suppose we want the summary function output for Sepal.Length along with the number of observations:

DT[, c(
    as.list(summary(Sepal.Length)),
    N = .N
), by=.(Species, Bin)]

#       Species   Bin Min. 1st Qu. Median  Mean 3rd Qu. Max.  N
# 1:     setosa (4,6]  4.3     4.8    5.0 5.006     5.2  5.8 50
# 2: versicolor (6,8]  6.1     6.2    6.4 6.450     6.7  7.0 20
# 3: versicolor (4,6]  4.9     5.5    5.6 5.593     5.8  6.0 30
# 4:  virginica (6,8]  6.1     6.4    6.7 6.778     7.2  7.9 41
# 5:  virginica (4,6]  4.9     5.7    5.8 5.722     5.9  6.0  9

We have to make j a list of columns. Usually, some playing around with c, as.list and . is enough to figure out the correct way to proceed.

Assigning summary statistics as new columns

Instead of making a summary table, we may want to store a summary statistic in a new column. We can use := as usual. For example,

DT[, is_big := .N >= 25, by=.(Species, Bin)]

Pitfalls

Untidy data

If you find yourself wanting to parse column names, like

Take the mean of x.Length/x.Width where x takes ten different values.

then you are probably looking at data embedded in column names, which is a bad idea. Read about tidy data and then reshape to long format.

Rowwise summaries

Data frames and data.tables are well-designed for tabular data, where rows correspond to observations and columns to variables. If you find yourself wanting to summarize over rows, like

Find the standard deviation across columns for each row.

then you should probably be using a matrix or some other data format entirely.

PDF - Download data.table for free

Previous Next

data.table

Fastest Entity Framework Extensions

Example

Assigning summary statistics as new columns

Pitfalls

Untidy data

Rowwise summaries

Got any data.table Question?

data.table

data.table Computing summary statistics Custom summaries

Fastest Entity Framework Extensions

Example

Assigning summary statistics as new columns

Pitfalls

Untidy data

Rowwise summaries

Got any data.table Question?