R Language Run-length encoding to compress and decompress vectors


Example

Long vectors with long runs of the same value can be significantly compressed by storing them in their run-length encoding (the value of each run and the number of times that value is repeated). As an example, consider a vector of length 10 million with a huge number of 1's and only a small number of 0's:

set.seed(144)
dat <- sample(rep(0:1, c(1, 1e5)), 1e7, replace=TRUE)
table(dat)
#       0       1 
#     103 9999897 

Storing 10 million entries will require significant space, but we can instead create a data frame with the run-length encoding of this vector:

rle.df <- with(rle(dat), data.frame(values, lengths))
dim(rle.df)
# [1] 207   2
head(rle.df)
#   values lengths
# 1      1   52818
# 2      0       1
# 3      1  219329
# 4      0       1
# 5      1  318306
# 6      0       1

From the run-length encoding, we see that the first 52,818 values in the vector are 1's, followed by a single 0, followed by 219,329 consecutive 1's, followed by a 0, and so on. The run-length encoding only has 207 entries, requiring us to store only 414 values instead of 10 million values. As rle.df is a data frame, it can be stored using standard functions like write.csv.

Decompressing a vector in run-length encoding can be accomplished in two ways. The first method is to simply call rep, passing the values element of the run-length encoding as the first argument and the lengths element of the run-length encoding as the second argument:

decompressed <- rep(rle.df$values, rle.df$lengths)

We can confirm that our decompressed data is identical to our original data:

identical(decompressed, dat)
# [1] TRUE

The second method is to use R's built-in inverse.rle function on the rle object, for instance:

rle.obj <- rle(dat)                            # create a rle object here
class(rle.obj)
# [1] "rle"

dat.inv <- inverse.rle(rle.obj)               # apply the inverse.rle on the rle object

We can confirm again that this produces exactly the original dat:

identical(dat.inv, dat)
# [1] TRUE