There are times in which it is desirable to consolidate factor levels into fewer groups, perhaps because of sparse data in one of the categories. It may also occur when you have varying spellings or capitalization of the category names. Consider as an example the factor
set.seed(1)
colorful <- sample(c("red", "Red", "RED", "blue", "Blue", "BLUE", "green", "gren"),
size = 20,
replace = TRUE)
colorful <- factor(colorful)
Since R is case-sensitive, a frequency table of this vector would appear as below.
table(colorful)
colorful blue Blue BLUE green gren red Red RED 3 1 4 2 4 1 3 2
This table, however, doesn't represent the true distribution of the data, and the categories may effectively be reduced to three types: Blue, Green, and Red. Three examples are provided. The first illustrates what seems like an obvious solution, but won't actually provide a solution. The second gives a working solution, but is verbose and computationally expensive. The third is not an obvious solution, but is relatively compact and computationally efficient.
factor
(factor_approach
)factor(as.character(colorful),
levels = c("blue", "Blue", "BLUE", "green", "gren", "red", "Red", "RED"),
labels = c("Blue", "Blue", "Blue", "Green", "Green", "Red", "Red", "Red"))
[1] Green Blue Red Red Blue Red Red Red Blue Red Green Green Green Blue Red Green [17] Red Green Green Red Levels: Blue Blue Blue Green Green Red Red Red Warning message: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, : duplicated levels in factors are deprecated
Notice that there are duplicated levels. We still have three categories for "Blue", which doesn't complete our task of consolidating levels. Additionally, there is a warning that duplicated levels are deprecated, meaning that this code may generate an error in the future.
ifelse
(ifelse_approach
)factor(ifelse(colorful %in% c("blue", "Blue", "BLUE"),
"Blue",
ifelse(colorful %in% c("green", "gren"),
"Green",
"Red")))
[1] Green Blue Red Red Blue Red Red Red Blue Red Green Green Green Blue Red Green [17] Red Green Green Red Levels: Blue Green Red
This code generates the desired result, but requires the use of nested ifelse
statements. While there is nothing wrong with this approach, managing nested ifelse
statements can be a tedious task and must be done carefully.
list_approach
)A less obvious way of consolidating levels is to use a list where the name of each element is the desired category name, and the element is a character vector of the levels in the factor that should map to the desired category. This has the added advantage of working directly on the levels
attribute of the factor, without having to assign new objects.
levels(colorful) <-
list("Blue" = c("blue", "Blue", "BLUE"),
"Green" = c("green", "gren"),
"Red" = c("red", "Red", "RED"))
[1] Green Blue Red Red Blue Red Red Red Blue Red Green Green Green Blue Red Green [17] Red Green Green Red Levels: Blue Green Red
The time required to execute each of these approaches is summarized below. (For the sake of space, the code to generate this summary is not shown)
Unit: microseconds expr min lq mean median uq max neval cld factor 78.725 83.256 93.26023 87.5030 97.131 218.899 100 b ifelse 104.494 107.609 123.53793 113.4145 128.281 254.580 100 c list_approach 49.557 52.955 60.50756 54.9370 65.132 138.193 100 a
The list approach runs about twice as fast as the ifelse
approach. However, except in times of very, very large amounts of data, the differences in execution time will likely be measured in either microseconds or milliseconds. With such small time differences, efficiency need not guide the decision of which approach to use. Instead, use an approach that is familiar and comfortable, and which you and your collaborators will understand on future review.