This is for those moving to data.table >= 1.9.8
You have a data set of pet owners and names, but you suspect some repeated data has been captured.
library(data.table)
DT <- data.table(pet = c("dog","dog","cat","dog"),
owner = c("Alice","Bob","Charlie","Alice"),
entry.date = c("31/12/2015","31/12/2015","14/2/2016","14/2/2016"),
key = "owner")
> tables()
NAME NROW NCOL MB COLS KEY
[1,] DT 4 3 1 pet,owner,entry.date owner
Total: 1MB
Recall keying a table will sort it. Alice has been entered twice.
> DT
pet owner entry.date
1: dog Alice 31/12/2015
2: dog Alice 14/2/2016
3: dog Bob 31/12/2015
4: cat Charlie 14/2/2016
Say you used unique
to get rid of duplicates in your data based on the key, using the most recent data capture date by setting fromLast to TRUE.
clean.DT <- unique(DT, fromLast = TRUE)
> tables()
NAME NROW NCOL MB COLS KEY
[1,] clean.DT 3 3 1 pet,owner,entry.date owner
[2,] DT 4 3 1 pet,owner,entry.date owner
Total: 2MB
Alice duplicate been removed.
clean.DT <- unique(DT, fromLast = TRUE)
> tables()
NAME NROW NCOL MB COLS KEY
[1,] clean.DT 4 3 1 pet,owner,entry.date owner
[2,] DT 4 3 1 pet,owner,entry.date owner
This does not work. Still 4 rows!
Use the by=
parameter which no longer defaults to your key but to all columns.
clean.DT <- unique(DT, by = key(DT), fromLast = TRUE)
Now all is well.
> clean.DT
pet owner entry.date
1: dog Alice 14/2/2016
2: dog Bob 31/12/2015
3: cat Charlie 14/2/2016
See item 1 in the NEWS release notes for details:
Changes in v1.9.8 (on CRAN 25 Nov 2016)
POTENTIALLY BREAKING CHANGES
- By default all columns are now used by
unique()
,duplicated()
anduniqueN()
data.table methods, #1284 and #1841. To restore old behaviour:options(datatable.old.unique.by.key=TRUE)
. In 1 year this option to restore the old default will be deprecated with warning. In 2 years the option will be removed. Please explicitly passby=key(DT)
for clarity. Only code that relies on the default is affected. 266 CRAN and Bioconductor packages using data.table were checked before release. 9 needed to change and were notified. Any lines of code without test coverage will have been missed by these checks. Any packages not on CRAN or Bioconductor were not checked.
So you can use the options as a temporary workaround until your code is fixed.
options(datatable.old.unique.by.key=TRUE)