R Language data.table Setting keys in data.table


Yes, you need to SETKEY pre 1.9.6

In the past (pre 1.9.6), your data.table was sped up by setting columns as keys to the table, particularly for large tables. [See intro vignette page 5 of September 2015 version, where speed of search was 544 times better.] You may find older code making use of this setting keys with 'setkey' or setting a 'key=' column when setting up the table.

DT <- data.table(
  x = letters[1:5], 
  y = 5:1, 
  z = (1:5) > 3

#> DT
#   x y     z
#1: a 5 FALSE
#2: b 4 FALSE
#3: c 3 FALSE
#4: d 2  TRUE
#5: e 1  TRUE

Set your key with the setkey command. You can have a key with multiple columns.

setkey(DT, y)

Check your table's key in tables()


> tables()
[1,] DT      5    3  1 x,y,z y  
Total: 1MB

Note this will re-sort your data.

#> DT
#   x y     z
#1: e 1  TRUE
#2: d 2  TRUE
#3: c 3 FALSE
#4: b 4 FALSE
#5: a 5 FALSE

Now it is unnecessary

Prior to v1.9.6 you had to have set a key for certain operations especially joining tables. The developers of data.table have sped up and introduced a "on=" feature that can replace the dependency on keys. See SO answer here for a detailed discussion.

In Jan 2017, the developers have written a vignette around secondary indices which explains the "on" syntax and allows for other columns to be identified for fast indexing.

Creating secondary indices?

In a manner similar to key, you can setindex(DT, key.col) or setindexv(DT, "key.col.string"), where DT is your data.table. Remove all indices with setindex(DT, NULL).

See your secondary indices with indices(DT).

Why secondary indices?

This does not sort the table (unlike key), but does allow for quick indexing using the "on" syntax. Note there can be only one key, but you can use multiple secondary indices, which saves having to rekey and resort the table. This will speed up your subsetting when changing the columns you want to subset on.

Recall, in example above y was the key for table DT:

# x y     z
# 1: e 1  TRUE
# 2: d 2  TRUE
# 3: c 3 FALSE
# 4: b 4 FALSE
# 5: a 5 FALSE

# Let us set x as index 
setindex(DT, x)

# Use indices to see what has been set
# [1] "x"

# fast subset using index and not keyed column
DT["c", on ="x"]
#x y     z
#1: c 3 FALSE

# old way would have been rekeying DT from y to x, doing subset and 
# perhaps keying back to y (now we save two sorts)
# This is a toy example above but would have been more valuable with big data sets