R Language data.table Writing code compatible with both data.frame and data.table


Differences in subsetting syntax

A data.table is one of several two-dimensional data structures available in R, besides data.frame, matrix and (2D) array. All of these classes use a very similar but not identical syntax for subsetting, the A[rows, cols] schema.

Consider the following data stored in a matrix, a data.frame and a data.table:

ma <- matrix(rnorm(12), nrow=4, dimnames=list(letters[1:4], c('X', 'Y', 'Z')))
df <- as.data.frame(ma)
dt <- as.data.table(ma)

ma[2:3]  #---> returns the 2nd and 3rd items, as if 'ma' were a vector (because it is!)
df[2:3]  #---> returns the 2nd and 3rd columns
dt[2:3]  #---> returns the 2nd and 3rd rows!

If you want to be sure of what will be returned, it is better to be explicit.

To get specific rows, just add a comma after the range:

ma[2:3, ]  # \
df[2:3, ]  #  }---> returns the 2nd and 3rd rows
dt[2:3, ]  # /

But, if you want to subset columns, some cases are interpreted differently. All three can be subset the same way with integer or character indices not stored in a variable.

ma[, 2:3]          #  \
df[, 2:3]          #   \
dt[, 2:3]          #    }---> returns the 2nd and 3rd columns
ma[, c("Y", "Z")]  #   /
df[, c("Y", "Z")]  #  /
dt[, c("Y", "Z")]  # /

However, they differ for unquoted variable names

mycols <- 2:3
ma[, mycols]                # \
df[, mycols]                #  }---> returns the 2nd and 3rd columns
dt[, mycols, with = FALSE]  # /

dt[, mycols]                # ---> Raises an error

In the last case, mycols is evaluated as the name of a column. Because dt cannot find a column named mycols, an error is raised.

Note: For versions of the data.table package priorto 1.9.8, this behavior was slightly different. Anything in the column index would have been evaluated using dt as an environment. So both dt[, 2:3] and dt[, mycols] would return the vector 2:3. No error would be raised for the second case, because the variable mycols does exist in the parent environment.

Strategies for maintaining compatibility with data.frame and data.table

There are many reasons to write code that is guaranteed to work with data.frame and data.table. Maybe you are forced to use data.frame, or you may need to share some code that you don't know how will be used. So, there are some main strategies for achieving this, in order of convenience:

  1. Use syntax that behaves the same for both classes.
  2. Use a common function that does the same thing as the shortest syntax.
  3. Force data.table to behave as data.frame (ex.: call the specific method print.data.frame).
  4. Treat them as list, which they ultimately are.
  5. Convert the table to a data.frame before doing anything (bad idea if it is a huge table).
  6. Convert the table to data.table, if dependencies are not a concern.

Subset rows. Its simple, just use the [, ] selector, with the comma:

A[1:10, ]
A[A$var > 17, ]  # A[var > 17, ] just works for data.table

Subset columns. If you want a single column, use the $ or the [[ ]] selector:

colname <- 'var'

If you want a uniform way to grab more than one column, it's necessary to appeal a bit:

B <- `[.data.frame`(A, 2:4)

# We can give it a better name
select <- `[.data.frame`
B <- select(A, 2:4)
C <- select(A, c('foo', 'bar'))

Subset 'indexed' rows. While data.frame has row.names, data.table has its unique key feature. The best thing is to avoid row.names entirely and take advantage of the existing optimizations in the case of data.table when possible.

B <- A[A$var != 0, ]
# or...
B <- with(A, A[var != 0, ])  # data.table will silently index A by var before subsetting

stuff <- c('a', 'c', 'f')
C <- A[match(stuff, A$name), ]  # really worse than: setkey(A); A[stuff, ]

Get a 1-column table, get a row as a vector. These are easy with what we have seen until now:

B <- select(A, 2)    #---> a table with just the second column
C <- unlist(A[1, ])  #---> the first row as a vector (coerced if necessary)