R Language Run-length encoding Identifying and grouping by runs in data.table

Help us to keep this website almost Ad Free! It takes only 10 seconds of your time:
> Step 1: Go view our video on YouTube: EF Core Bulk Insert
> Step 2: And Like the video. BONUS: You can also share it!

Example

The data.table package provides a convenient way to group by runs in data. Consider the following example data:

library(data.table)
(DT <- data.table(x = c(1, 1, 2, 2, 2, 1), y = 1:6))
#    x y
# 1: 1 1
# 2: 1 2
# 3: 2 3
# 4: 2 4
# 5: 2 5
# 6: 1 6

The variable x has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1 with value 1. We might want to compute the mean value of variable y in each of the runs of variable x (these mean values are 1.5, 4, and 6).

The data.table rleid function provides an id indicating the run id of each element of a vector:

rleid(DT$x)
# [1] 1 1 2 2 2 3

One can then easily group on this run ID and summarize the y data:

DT[,mean(y),by=.(x, rleid(x))]
#    x rleid  V1
# 1: 1     1 1.5
# 2: 2     2 4.0
# 3: 1     3 6.0


Got any R Language Question?