data.table Joins and merges Equi-join


Example

# example data
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), x = 11:15)
#    id  x
# 1:  1 11
# 2:  1 12
# 3:  2 13
# 4:  3 14
# 5: NA 15

b = data.table(id = 1:2, y = -(1:2))
#    id  y
# 1:  1 -1
# 2:  2 -2

Intuition

Think of x[i] as selecting a subset of x for each row of i. This syntax mirrors matrix subsetting in base R and is consistent with the first argument meaning "where", in DT[where, select|update|do, by].

One might wonder why this new syntax is worth learning, since merge(x,i) still works with data.tables. The short answer is that it we usually wants to merge and then do something further. The x[i] syntax concisely captures this pattern of use and also allows for more efficient computation. For a more detailed explanation, read FAQs 1.12 and 2.14.

Handling multiply-matched rows

By default, every row of a matching each row of b is returned:

a[b, on="id"]
#    id  x  y
# 1:  1 11 -1
# 2:  1 12 -1
# 3:  2 13 -2

This can be tweaked with mult:

a[b, on="id", mult="first"]
#    id  x  y
# 1:  1 11 -1
# 2:  2 13 -2

Handling unmatched rows

By default, unmatched rows of a still show up in the result:

b[a, on="id"]
#    id  y  x
# 1:  1 -1 11
# 2:  1 -1 12
# 3:  2 -2 13
# 4:  3 NA 14
# 5: NA NA 15

To hide these, use nomatch:

b[a, on="id", nomatch=0]
#    id  y  x
# 1:  1 -1 11
# 2:  1 -1 12
# 3:  2 -2 13

Note that x[i] will attempt to match NAs in i.

Counting matches returned

To count the number of matches for each row of i, use .N and by=.EACHI.

b[a, on="id", .N, by=.EACHI]
#    id N
# 1:  1 1
# 2:  1 1
# 3:  2 1
# 4:  3 0
# 5: NA 0