# example data
set.seed(1)
n = 1e7
ng = 1e4
DT = data.table(
g1 = sample(ng, n, replace=TRUE),
g2 = sample(ng, n, replace=TRUE),
v = rnorm(n)
)
After the first run of a subsetting operation with ==
or %in%
...
system.time(
DT[ g1 %in% 1:100]
)
# user system elapsed
# 0.12 0.03 0.16
An index has been created automatically for g1
. Subsequent subsetting operations run almost instantly:
system.time(
DT[ g1 %in% 1:100]
)
# user system elapsed
# 0 0 0
To monitor when an index is created or used, add the verbose=TRUE
option or change the global setting options(datatable.verbose=TRUE)
.
Currently, matching on two columns does not automatically create an index:
system.time(
DT[ g1 %in% 1:100 & g2 %in% 1:100]
)
# user system elapsed
# 0.57 0.00 0.57
Re-run this and it will remain slow. Even if we manually add the index with setindex(DT, g1, g2)
, it will remain slow because this query is not yet optimized by the package.
Fortunately, if we can enumerate the combinations of values we want to search for and an index is available, we can quickly equi-join:
system.time(
DT[ CJ(g1 = 1:100, g2 = 1:100, unique=TRUE), on=.(g1, g2), nomatch=0]
)
# user system elapsed
# 0.53 0.00 0.54
setindex(DT, g1, g2)
system.time(
DT[ CJ(g1 = 1:100, g2 = 1:100, unique=TRUE), on=.(g1, g2), nomatch=0]
)
# user system elapsed
# 0 0 0
With CJ
, it's important to watch out for the number of combinations becoming too large.