data.table Using keys and indices Improving performance for selecting subsets


Example

# example data
set.seed(1)
n  = 1e7
ng = 1e4
DT = data.table(
    g1  = sample(ng, n, replace=TRUE),
    g2  = sample(ng, n, replace=TRUE),
    v  = rnorm(n)
)

Matching on one column

After the first run of a subsetting operation with == or %in%...

system.time(
    DT[ g1 %in% 1:100]
)
#    user  system elapsed 
#    0.12    0.03    0.16 

An index has been created automatically for g1. Subsequent subsetting operations run almost instantly:

system.time(
    DT[ g1 %in% 1:100]
)
#    user  system elapsed 
#       0       0       0

To monitor when an index is created or used, add the verbose=TRUE option or change the global setting options(datatable.verbose=TRUE).

Matching on multiple columns

Currently, matching on two columns does not automatically create an index:

system.time(
    DT[ g1 %in% 1:100 & g2 %in% 1:100]
)
#    user  system elapsed 
#    0.57    0.00    0.57

Re-run this and it will remain slow. Even if we manually add the index with setindex(DT, g1, g2), it will remain slow because this query is not yet optimized by the package.

Fortunately, if we can enumerate the combinations of values we want to search for and an index is available, we can quickly equi-join:

system.time(
    DT[ CJ(g1 = 1:100, g2 = 1:100, unique=TRUE), on=.(g1, g2), nomatch=0]
)
#    user  system elapsed 
#    0.53    0.00    0.54 
setindex(DT, g1, g2)
system.time(
    DT[ CJ(g1 = 1:100, g2 = 1:100, unique=TRUE), on=.(g1, g2), nomatch=0]
)
#    user  system elapsed 
#       0       0       0

With CJ, it's important to watch out for the number of combinations becoming too large.