Convert from wide form to long form
Load data USArrests
from datasets
.
data("USArrests")
head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Use ?USArrests
to find out more. First, convert to data.table
. The names of states are row names in the original data.frame
.
library(data.table)
DT <- as.data.table(USArrests, keep.rownames=TRUE)
This is data in the wide form. It has a column for each variable. The data can also be stored in long form without loss of information. The long form has one column that stores the variable names. Then, it has another column for the variable values. The long form of USArrests
looks like so.
State Crime Rate
1: Alabama Murder 13.2
2: Alaska Murder 10.0
3: Arizona Murder 8.1
4: Arkansas Murder 8.8
5: California Murder 9.0
---
196: Virginia Rape 20.7
197: Washington Rape 26.2
198: West Virginia Rape 9.3
199: Wisconsin Rape 10.8
200: Wyoming Rape 15.6
We use the melt
function to switch from wide form to long form.
DTm <- melt(DT)
names(DTm) <- c("State", "Crime", "Rate")
By default, melt
treats all columns with numeric data as variables with values. In USArrests
, the variable UrbanPop
represents the percentage urban population of a state. It is different from the other variabes, Murder
, Assault
and Rape
, which are violent crimes reported per 100,000 people. Suppose we want to retain UrbanPop
column. We achieve this by setting id.vars
as follows.
DTmu <- melt(DT, id.vars=c("rn", "UrbanPop" ),
variable.name='Crime', value.name = "Rate")
names(DTmu)[1] <- "State"
Note that we have specified the names of the column containing category names (Murder, Assault, etc.) with variable.name
and the column containing the values with value.name
. Our data looks like so.
State UrbanPop Crime Rate
1: Alabama 58 Murder 13.2
2: Alaska 48 Murder 10.0
3: Arizona 80 Murder 8.1
4: Arkansas 50 Murder 8.8
5: California 91 Murder 9.0
Generating summaries with with split-apply-combine style approach is a breeze. For example, to summarize violent crimes by state?
DTmu[, .(ViolentCrime = sum(Rate)), by=State]
This gives:
State ViolentCrime
1: Alabama 270.4
2: Alaska 317.5
3: Arizona 333.1
4: Arkansas 218.3
5: California 325.6
6: Colorado 250.6