R Language Identifying and grouping by runs in base R


Example

One might want to group their data by the runs of a variable and perform some sort of analysis. Consider the following simple dataset:

(dat <- data.frame(x = c(1, 1, 2, 2, 2, 1), y = 1:6))
#   x y
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
# 5 2 5
# 6 1 6

The variable x has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1 with value 1. We might want to compute the mean value of variable y in each of the runs of variable x (these mean values are 1.5, 4, and 6).

In base R, we would first compute the run-length encoding of the x variable using rle:

(r <- rle(dat$x))
# Run Length Encoding
#   lengths: int [1:3] 2 3 1
#   values : num [1:3] 1 2 1

The next step is to compute the run number of each row of our dataset. We know that the total number of runs is length(r$lengths), and the length of each run is r$lengths, so we can compute the run number of each of our runs with rep:

(run.id <- rep(seq_along(r$lengths), r$lengths))
# [1] 1 1 2 2 2 3

Now we can use tapply to compute the mean y value for each run by grouping on the run id:

data.frame(x=r$values, meanY=tapply(dat$y, run.id, mean))
#   x meanY
# 1 1   1.5
# 2 2   4.0
# 3 1   6.0