Casting is used to transform data from long to wide format.
Starting with a long data set:
DT = data.table(ID = rep(letters[1:3],3), Age = rep(20:22,3), Test = rep(c("OB_A","OB_B","OB_C"), each = 3), Result = 1:9)
We can cast our data using the dcast
function in data.table. This returns another data.table in wide format:
dcast(DT, formula = ID ~ Test, value.var = "Result")
ID OB_A OB_B OB_C
1: a 1 4 7
2: b 2 5 8
3: c 3 6 9
class(dcast(DT, formula = ID ~ Test, value.var = "Result"))
[1] "data.table" "data.frame"
A value.var
argument is necessary for a proper cast - if not provided dcast will make an assumption based on your data.
dcast(DT, formula = ID ~ Test, value.var = "Result")
ID OB_A OB_B OB_C
1: a 1 4 7
2: b 2 5 8
3: c 3 6 9
ID OB_A OB_B OB_C
1: a 20 20 20
2: b 21 21 21
3: c 22 22 22
Multiple value.var
s can be provided in a list
dcast(DT, formula = ID ~ Test, value.var = list("Result","Age"))
ID Result_OB_A Result_OB_B Result_OB_C Age_OB_A Age_OB_B Age_OB_C
1: a 1 4 7 20 20 20
2: b 2 5 8 21 21 21
3: c 3 6 9 22 22 22
Casting is controlled using the formula argument in dcast
. This is of the form ROWS ~ COLUMNS
dcast(DT, formula = ID ~ Test, value.var = "Result")
ID OB_A OB_B OB_C
1: a 1 4 7
2: b 2 5 8
3: c 3 6 9
dcast(DT, formula = Test ~ ID, value.var = "Result")
Test a b c
1: OB_A 1 2 3
2: OB_B 4 5 6
3: OB_C 7 8 9
Both rows and columns can be expanded with further variables using +
dcast(DT, formula = ID + Age ~ Test, value.var = "Result")
ID Age OB_A OB_B OB_C
1: a 20 1 4 7
2: b 21 2 5 8
3: c 22 3 6 9
dcast(DT, formula = ID ~ Age + Test, value.var = "Result")
ID 20_OB_A 20_OB_B 20_OB_C 21_OB_A 21_OB_B 21_OB_C 22_OB_A 22_OB_B 22_OB_C
1: a 1 4 7 NA NA NA NA NA NA
2: b NA NA NA 2 5 8 NA NA NA
3: c NA NA NA NA NA NA 3 6 9
#order is important
dcast(DT, formula = ID ~ Test + Age, value.var = "Result")
ID OB_A_20 OB_A_21 OB_A_22 OB_B_20 OB_B_21 OB_B_22 OB_C_20 OB_C_21 OB_C_22
1: a 1 NA NA 4 NA NA 7 NA NA
2: b NA 2 NA NA 5 NA NA 8 NA
3: c NA NA 3 NA NA 6 NA NA 9
Casting can often create cells where no observation exists in the data. By default this is denoted by NA
, as above. We can override this with the fill=
argument.
dcast(DT, formula = ID ~ Test + Age, value.var = "Result", fill = 0)
ID OB_A_20 OB_A_21 OB_A_22 OB_B_20 OB_B_21 OB_B_22 OB_C_20 OB_C_21 OB_C_22
1: a 1 0 0 4 0 0 7 0 0
2: b 0 2 0 0 5 0 0 8 0
3: c 0 0 3 0 0 6 0 0 9
You can also use two special variables in the formula object
.
represents no other variables...
represents all other variablesdcast(DT, formula = Age ~ ., value.var = "Result")
Age .
1: 20 3
2: 21 3
3: 22 3
dcast(DT, formula = ID + Age ~ ..., value.var = "Result")
ID Age OB_A OB_B OB_C
1: a 20 1 4 7
2: b 21 2 5 8
3: c 22 3 6 9
We can also cast and aggregate values in one step. In this case, we have three observations in each of the intersections of Age and ID. To set what aggregation we want, we use the fun.aggregate
argument:
#length
dcast(DT, formula = ID ~ Age, value.var = "Result", fun.aggregate = length)
ID 20 21 22
1: a 3 0 0
2: b 0 3 0
3: c 0 0 3
#sum
dcast(DT, formula = ID ~ Age, value.var = "Result", fun.aggregate = sum)
ID 20 21 22
1: a 12 0 0
2: b 0 15 0
3: c 0 0 18
#concatenate
dcast(DT, formula = ID ~ Age, value.var = "Result", fun.aggregate = function(x){paste(x,collapse = "_")})
ID 20 21 22
1: a 1_4_7
2: b 2_5_8
3: c 3_6_9
We can also pass a list to fun.aggregate
to use multiple functions
dcast(DT, formula = ID ~ Age, value.var = "Result", fun.aggregate = list(sum,length))
ID Result_sum_20 Result_sum_21 Result_sum_22 Result_length_20 Result_length_21 Result_length_22
1: a 12 0 0 3 0 0
2: b 0 15 0 0 3 0
3: c 0 0 18 0 0 3
If we pass more than one function and more than one value, we can calculate all combinations by passing a vector of value.vars
dcast(DT, formula = ID ~ Age, value.var = c("Result","Test"), fun.aggregate = list(function(x){paste0(x,collapse = "_")},length))
ID Result_function_20 Result_function_21 Result_function_22 Test_function_20 Test_function_21 Test_function_22 Result_length_20 Result_length_21
1: a 1_4_7 OB_A_OB_B_OB_C 3 0
2: b 2_5_8 OB_A_OB_B_OB_C 0 3
3: c 3_6_9 OB_A_OB_B_OB_C 0 0
Result_length_22 Test_length_20 Test_length_21 Test_length_22
1: 0 3 0 0
2: 0 0 3 0
3: 3 0 0 3
where each pair is calculated in the order value1_formula1, value1_formula2, ... , valueN_formula(N-1), valueN_formulaN
.
Alternatively, we can evaluate our values and functions one-to-one by passing 'value.var' as a list:
dcast(DT, formula = ID ~ Age, value.var = list("Result","Test"), fun.aggregate = list(function(x){paste0(x,collapse = "_")},length))
ID Result_function_20 Result_function_21 Result_function_22 Test_length_20 Test_length_21 Test_length_22
1: a 1_4_7 3 0 0
2: b 2_5_8 0 3 0
3: c 3_6_9 0 0 3
By default, column name components are seperated by an underscore _
. This can be manually overridden using the sep=
argument:
dcast(DT, formula = Test ~ ID + Age, value.var = "Result")
Test a_20 b_21 c_22
1: OB_A 1 2 3
2: OB_B 4 5 6
3: OB_C 7 8 9
dcast(DT, formula = Test ~ ID + Age, value.var = "Result", sep = ",")
Test a,20 b,21 c,22
1: OB_A 1 2 3
2: OB_B 4 5 6
3: OB_C 7 8 9
This will seperate any fun.aggregate
or value.var
we use:
dcast(DT, formula = Test ~ ID + Age, value.var = "Result", fun.aggregate = c(sum,length), sep = ",")
Test Result,sum,a,20 Result,sum,b,21 Result,sum,c,22 Result,length,a,20 Result,length,b,21 Result,length,c,22
1: OB_A 1 2 3 1 1 1
2: OB_B 4 5 6 1 1 1
3: OB_C 7 8 9 1 1 1