A data.table
is one of several two-dimensional data structures available in R, besides data.frame
, matrix
and (2D) array
. All of these classes use a very similar but not identical syntax for subsetting, the A[rows, cols]
schema.
Consider the following data stored in a matrix
, a data.frame
and a data.table
:
ma <- matrix(rnorm(12), nrow=4, dimnames=list(letters[1:4], c('X', 'Y', 'Z')))
df <- as.data.frame(ma)
dt <- as.data.table(ma)
ma[2:3] #---> returns the 2nd and 3rd items, as if 'ma' were a vector (because it is!)
df[2:3] #---> returns the 2nd and 3rd columns
dt[2:3] #---> returns the 2nd and 3rd rows!
If you want to be sure of what will be returned, it is better to be explicit.
To get specific rows, just add a comma after the range:
ma[2:3, ] # \
df[2:3, ] # }---> returns the 2nd and 3rd rows
dt[2:3, ] # /
But, if you want to subset columns, some cases are interpreted differently. All three can be subset the same way with integer or character indices not stored in a variable.
ma[, 2:3] # \
df[, 2:3] # \
dt[, 2:3] # }---> returns the 2nd and 3rd columns
ma[, c("Y", "Z")] # /
df[, c("Y", "Z")] # /
dt[, c("Y", "Z")] # /
However, they differ for unquoted variable names
mycols <- 2:3
ma[, mycols] # \
df[, mycols] # }---> returns the 2nd and 3rd columns
dt[, mycols, with = FALSE] # /
dt[, mycols] # ---> Raises an error
In the last case, mycols
is evaluated as the name of a column. Because dt
cannot find
a column named mycols
, an error is raised.
Note: For versions of the data.table
package priorto 1.9.8, this behavior was slightly
different. Anything in the column index would have been evaluated using dt
as an
environment. So both dt[, 2:3]
and dt[, mycols]
would return the vector 2:3
. No
error would be raised for the second case, because the variable mycols
does exist in
the parent environment.
There are many reasons to write code that is guaranteed to work with data.frame
and data.table
. Maybe you are forced to use data.frame
, or you may need to share some code that you don't know how will be used. So, there are some main strategies for achieving this, in order of convenience:
data.table
to behave as data.frame
(ex.: call the specific method print.data.frame
).list
, which they ultimately are.data.frame
before doing anything (bad idea if it is a huge table).data.table
, if dependencies are not a concern.Subset rows. Its simple, just use the [, ]
selector, with the comma:
A[1:10, ]
A[A$var > 17, ] # A[var > 17, ] just works for data.table
Subset columns. If you want a single column, use the $
or the [[ ]]
selector:
A$var
colname <- 'var'
A[[colname]]
A[[1]]
If you want a uniform way to grab more than one column, it's necessary to appeal a bit:
B <- `[.data.frame`(A, 2:4)
# We can give it a better name
select <- `[.data.frame`
B <- select(A, 2:4)
C <- select(A, c('foo', 'bar'))
Subset 'indexed' rows. While data.frame
has row.names
, data.table
has its unique key
feature. The best thing is to avoid row.names
entirely and take advantage of the existing optimizations in the case of data.table
when possible.
B <- A[A$var != 0, ]
# or...
B <- with(A, A[var != 0, ]) # data.table will silently index A by var before subsetting
stuff <- c('a', 'c', 'f')
C <- A[match(stuff, A$name), ] # really worse than: setkey(A); A[stuff, ]
Get a 1-column table, get a row as a vector. These are easy with what we have seen until now:
B <- select(A, 2) #---> a table with just the second column
C <- unlist(A[1, ]) #---> the first row as a vector (coerced if necessary)