data.table Joins and merges Update values in a join


When data is "tidy," it is often organized into several tables. To combine the data for analysis, we need to "update" one table with values from another.

For example, we might have sales data for performances, where attributes of the performer (their budget) and of the location (its population) are stored in separate tables:

mainDT = data.table(
  p_id = rep(LETTERS[1:2], c(2,4)), 
  geo_id = sample(rep([c(1,25,50)], 3:1)), 
  sales = sample(100, 6)
pDT   = data.table(id = LETTERS[1:2], budget = c(60, 75))
geoDT = data.table(id =[c(1,50)], pop = c(100, 200))

mainDT # sales data
#    p_id geo_id sales
# 1:    A     AL    95
# 2:    A     WY    66
# 3:    B     AL    62
# 4:    B     MO     6
# 5:    B     AL    20
# 6:    B     MO    17

pDT # performer attributes
#    id budget
# 1:  A     60
# 2:  B     75

geoDT # location attributes
#    id pop
# 1: AL 100
# 2: WY 200

When we are ready to do some analysis, we need to grab variables from these other tables:

DT = copy(mainDT)

DT[pDT, on=.(p_id = id), budget := i.budget]
DT[geoDT, on=.(geo_id = id), pop := i.pop]

#    p_id geo_id sales budget pop
# 1:    A     AL    95     60 100
# 2:    A     WY    66     60 200
# 3:    B     AL    62     75 100
# 4:    B     MO     6     75  NA
# 5:    B     AL    20     75 100
# 6:    B     MO    17     75  NA

A copy is taken to avoid contaminating the raw data, but we could work directly on mainDT instead.

Advantages to using separate tables

The advantages of this structure are covered in the paper on tidy data, but in this context:

  1. Tracing missing data. Only rows that match up in the merge receive an assignment. We have no data for geo_id == "MO" above, so its variables are NA in our final table. If we see missing data like this unexpectedly, we can trace it back to the missing observation in the geoDT table and investigate from there whether we have a data problem that can be addressed.

  2. Comprehensibility. In building our statistical model, it might be important to keep in mind that budget is constant for each performer. In general, understanding the structure of the data pays dividends.

  3. Memory size. There might be a large number of performer and location attributes that don't end up in the statistical model. This way, we don't need to include them in the (possibly massive) table used for analysis.

Programmatically determining columns

If there are many columns in pDT, but we only want to select a few, we can use

p_cols = "budget"
DT[pDT, on=.(p_id = id), (p_cols) := mget(sprintf("i.%s", p_cols))]

The parentheses around (p_cols) := are essential, as noted in the doc on creating columns.