data.table Selecting rows within each group


Example

# example data
DT <- data.table(Titanic)

Suppose that, for each sex, we want the rows with the highest survival numbers:

DT[Survived == "Yes", .SD[ N == max(N) ], by=Sex]

#    Class    Sex   Age Survived   N
# 1:  Crew   Male Adult      Yes 192
# 2:   1st Female Adult      Yes 140

.SD is the subset of data associated with each Sex; and we are subsetting this further, to the rows that meet our condition. If speed is important, instead use an approach suggested by eddi on SO:

DT[ DT[Survived == "Yes", .I[ N == max(N) ], by=Sex]$V1 ]

#    Class    Sex   Age Survived   N
# 1:  Crew   Male Adult      Yes 192
# 2:   1st Female Adult      Yes 140

Pitfalls

In the last line of code, .I refers to the row numbers of the full data.table. However, this is not true when there is no by:

DT[ Survived == "Yes", .I]

# 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

DT[ Survived == "Yes", .I, by=Sex]$I

# 17 18 19 20 25 26 27 28 21 22 23 24 29 30 31 32