data.table Cleaning data Handling duplicates


Example

# example data
DT = data.table(id = c(1,2,2,3,3,3))[, v := LETTERS[.I]][]

To deal with "duplicates," combine counting rows in a group and subsetting rows by group.

Keep one row per group

Aka "drop duplicates" aka "deduplicate" aka "uniquify."

unique(DT, by="id")
# or
DT[, .SD[1L], by=id]
#    id v
# 1:  1 A
# 2:  2 B
# 3:  3 D

This keeps the first row. To select a different row, one can fiddle with the 1L part or use order in i.

Keep only unique rows

DT[, if (.N == 1L) .SD, by=id]
#    id v
# 1:  1 A

Keep only nonunique rows

DT[, if (.N > 1L) .SD, by=id]
#    id v
# 1:  2 B
# 2:  2 C
# 3:  3 D
# 4:  3 E
# 5:  3 F