data.table is one of several two-dimensional data structures available in R, besides
matrix and (2D)
array. All of these classes use a very similar but not identical syntax for subsetting, the
A[rows, cols] schema.
Consider the following data stored in a
data.frame and a
ma <- matrix(rnorm(12), nrow=4, dimnames=list(letters[1:4], c('X', 'Y', 'Z'))) df <- as.data.frame(ma) dt <- as.data.table(ma) ma[2:3] #---> returns the 2nd and 3rd items, as if 'ma' were a vector (because it is!) df[2:3] #---> returns the 2nd and 3rd columns dt[2:3] #---> returns the 2nd and 3rd rows!
If you want to be sure of what will be returned, it is better to be explicit.
To get specific rows, just add a comma after the range:
ma[2:3, ] # \ df[2:3, ] # }---> returns the 2nd and 3rd rows dt[2:3, ] # /
But, if you want to subset columns, some cases are interpreted differently. All three can be subset the same way with integer or character indices not stored in a variable.
ma[, 2:3] # \ df[, 2:3] # \ dt[, 2:3] # }---> returns the 2nd and 3rd columns ma[, c("Y", "Z")] # / df[, c("Y", "Z")] # / dt[, c("Y", "Z")] # /
However, they differ for unquoted variable names
mycols <- 2:3 ma[, mycols] # \ df[, mycols] # }---> returns the 2nd and 3rd columns dt[, mycols, with = FALSE] # / dt[, mycols] # ---> Raises an error
In the last case,
mycols is evaluated as the name of a column. Because
dt cannot find
a column named
mycols, an error is raised.
Note: For versions of the
data.table package priorto 1.9.8, this behavior was slightly
different. Anything in the column index would have been evaluated using
dt as an
environment. So both
dt[, 2:3] and
dt[, mycols] would return the vector
error would be raised for the second case, because the variable
mycols does exist in
the parent environment.
There are many reasons to write code that is guaranteed to work with
data.table. Maybe you are forced to use
data.frame, or you may need to share some code that you don't know how will be used. So, there are some main strategies for achieving this, in order of convenience:
data.tableto behave as
data.frame(ex.: call the specific method
list, which they ultimately are.
data.framebefore doing anything (bad idea if it is a huge table).
data.table, if dependencies are not a concern.
Subset rows. Its simple, just use the
[, ] selector, with the comma:
A[1:10, ] A[A$var > 17, ] # A[var > 17, ] just works for data.table
Subset columns. If you want a single column, use the
$ or the
[[ ]] selector:
A$var colname <- 'var' A[[colname]] A[]
If you want a uniform way to grab more than one column, it's necessary to appeal a bit:
B <- `[.data.frame`(A, 2:4) # We can give it a better name select <- `[.data.frame` B <- select(A, 2:4) C <- select(A, c('foo', 'bar'))
Subset 'indexed' rows. While
data.table has its unique
key feature. The best thing is to avoid
row.names entirely and take advantage of the existing optimizations in the case of
data.table when possible.
B <- A[A$var != 0, ] # or... B <- with(A, A[var != 0, ]) # data.table will silently index A by var before subsetting stuff <- c('a', 'c', 'f') C <- A[match(stuff, A$name), ] # really worse than: setkey(A); A[stuff, ]
Get a 1-column table, get a row as a vector. These are easy with what we have seen until now:
B <- select(A, 2) #---> a table with just the second column C <- unlist(A[1, ]) #---> the first row as a vector (coerced if necessary)