Suppose we want to read and stack a bunch of similarly-formatted files. The quick solution is:
rbindlist(lapply(list.files(patt="csv$"), fread), id=TRUE)
We might not be satisfied with this for a couple reasons:
fread
or when stacking with rbindlist
due to inconsistent or buggy data formatting.One way to handle this is to make a "files table" and store the contents of each file as a list-column entry on the row associated with it.
Before making the example data below, make sure you're in an empty folder you can write to. Run getwd()
and read ?setwd
if you need to change folders.
# example data
set.seed(1)
for (i in 1:3)
fwrite(data.table(id = 1:2, v = sample(letters, 2)), file = sprintf("file201%s.csv", i))
This part is fairly straightforward:
# First, identify the files you want:
fileDT = data.table(fn = list.files(pattern="csv$"))
# Next, optionally parse the names for metadata using regex:
fileDT[, year := type.convert(sub(".*([0-9]{4}).*", "\\1", fn))]
# Finally construct a string file-ID column:
fileDT[, id := as.character(.I)]
# fn year id
# 1: file2011.csv 2011 1
# 2: file2012.csv 2012 2
# 3: file2013.csv 2013 3
Read in the files as a list column:
fileDT[, contents := .(lapply(fn, fread))]
# fn year id contents
# 1: file2011.csv 2011 1 <data.table>
# 2: file2012.csv 2012 2 <data.table>
# 3: file2013.csv 2013 3 <data.table>
If there's a snag in reading one of the files or you need to change the arguments to fread
depending on the file's attributes, this step can easily be extended, looking like:
fileDT[, contents := {
cat(fn, "\n")
dat = if (year %in% 2011:2012){
fread(fn, some_args)
} else {
fread(fn)
}
.(.(dat))
}, by=fn]
For details on options for reading in CSVs and similar files, see ?fread
.
From here, we want to stack the data:
fileDT[, rbindlist(setNames(contents, id), idcol="file_id")]
# file_id id v
# 1: 1 1 g
# 2: 1 2 j
# 3: 2 1 o
# 4: 2 2 w
# 5: 3 1 f
# 6: 3 2 w
If some problem occurs in stacking (like column names or classes not matching), we can go back to the individual tables in fileDT
to inspect where the problem originated. For example,
fileDT[id == "2", contents[[1]]]
# id v
# 1: 1 o
# 2: 2 w
If the files are not in your current working dir, use
my_dir = "whatever"
fileDT = data.table(fn = list.files(my_dir, pattern="*.csv"))
# and when reading
fileDT[, contents := .(lapply(fn, function(n) fread(file.path(my_dir, n))))]