Long vectors with long runs of the same value can be significantly compressed by storing them in their run-length encoding (the value of each run and the number of times that value is repeated). As an example, consider a vector of length 10 million with a huge number of 1's and only a small number of 0's:
set.seed(144)
dat <- sample(rep(0:1, c(1, 1e5)), 1e7, replace=TRUE)
table(dat)
# 0 1
# 103 9999897
Storing 10 million entries will require significant space, but we can instead create a data frame with the run-length encoding of this vector:
rle.df <- with(rle(dat), data.frame(values, lengths))
dim(rle.df)
# [1] 207 2
head(rle.df)
# values lengths
# 1 1 52818
# 2 0 1
# 3 1 219329
# 4 0 1
# 5 1 318306
# 6 0 1
From the run-length encoding, we see that the first 52,818 values in the vector are 1's, followed by a single 0, followed by 219,329 consecutive 1's, followed by a 0, and so on. The run-length encoding only has 207 entries, requiring us to store only 414 values instead of 10 million values. As rle.df
is a data frame, it can be stored using standard functions like write.csv
.
Decompressing a vector in run-length encoding can be accomplished in two ways. The first method is to simply call rep
, passing the values
element of the run-length encoding as the first argument and the lengths
element of the run-length encoding as the second argument:
decompressed <- rep(rle.df$values, rle.df$lengths)
We can confirm that our decompressed data is identical to our original data:
identical(decompressed, dat)
# [1] TRUE
The second method is to use R's built-in inverse.rle
function on the rle
object, for instance:
rle.obj <- rle(dat) # create a rle object here
class(rle.obj)
# [1] "rle"
dat.inv <- inverse.rle(rle.obj) # apply the inverse.rle on the rle object
We can confirm again that this produces exactly the original dat
:
identical(dat.inv, dat)
# [1] TRUE