R Language Factors Consolidating Factor Levels with a List


Example

There are times in which it is desirable to consolidate factor levels into fewer groups, perhaps because of sparse data in one of the categories. It may also occur when you have varying spellings or capitalization of the category names. Consider as an example the factor

set.seed(1)
colorful <- sample(c("red", "Red", "RED", "blue", "Blue", "BLUE", "green", "gren"),
                   size = 20,
                   replace = TRUE)
colorful <- factor(colorful)

Since R is case-sensitive, a frequency table of this vector would appear as below.

table(colorful)
colorful  
blue  Blue  BLUE green  gren   red   Red   RED  
   3     1     4     2     4     1     3     2

This table, however, doesn't represent the true distribution of the data, and the categories may effectively be reduced to three types: Blue, Green, and Red. Three examples are provided. The first illustrates what seems like an obvious solution, but won't actually provide a solution. The second gives a working solution, but is verbose and computationally expensive. The third is not an obvious solution, but is relatively compact and computationally efficient.

Consolidating levels using factor (factor_approach)

factor(as.character(colorful),
       levels = c("blue", "Blue", "BLUE", "green", "gren", "red", "Red", "RED"),
       labels = c("Blue", "Blue", "Blue", "Green", "Green", "Red", "Red", "Red"))
 [1] Green Blue  Red   Red   Blue  Red   Red   Red   Blue  Red   Green Green Green Blue  Red   Green
[17] Red   Green Green Red  
Levels: Blue Blue Blue Green Green Red Red Red
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

Notice that there are duplicated levels. We still have three categories for "Blue", which doesn't complete our task of consolidating levels. Additionally, there is a warning that duplicated levels are deprecated, meaning that this code may generate an error in the future.

Consolidating levels using ifelse (ifelse_approach)

factor(ifelse(colorful %in% c("blue", "Blue", "BLUE"),
       "Blue",
       ifelse(colorful %in% c("green", "gren"),
              "Green",
              "Red")))
 [1] Green Blue  Red   Red   Blue  Red   Red   Red   Blue  Red   Green Green Green Blue  Red   Green
[17] Red   Green Green Red  
Levels: Blue Green Red

This code generates the desired result, but requires the use of nested ifelse statements. While there is nothing wrong with this approach, managing nested ifelse statements can be a tedious task and must be done carefully.

Consolidating Factors Levels with a List (list_approach)

A less obvious way of consolidating levels is to use a list where the name of each element is the desired category name, and the element is a character vector of the levels in the factor that should map to the desired category. This has the added advantage of working directly on the levels attribute of the factor, without having to assign new objects.

levels(colorful) <- 
     list("Blue" = c("blue", "Blue", "BLUE"),
          "Green" = c("green", "gren"),
          "Red" = c("red", "Red", "RED"))
 [1] Green Blue  Red   Red   Blue  Red   Red   Red   Blue  Red   Green Green Green Blue  Red   Green
[17] Red   Green Green Red  
Levels: Blue Green Red

Benchmarking each approach

The time required to execute each of these approaches is summarized below. (For the sake of space, the code to generate this summary is not shown)

Unit: microseconds
          expr     min      lq      mean   median      uq     max neval cld
        factor  78.725  83.256  93.26023  87.5030  97.131 218.899   100  b 
        ifelse 104.494 107.609 123.53793 113.4145 128.281 254.580   100   c
 list_approach  49.557  52.955  60.50756  54.9370  65.132 138.193   100 a

The list approach runs about twice as fast as the ifelse approach. However, except in times of very, very large amounts of data, the differences in execution time will likely be measured in either microseconds or milliseconds. With such small time differences, efficiency need not guide the decision of which approach to use. Instead, use an approach that is familiar and comfortable, and which you and your collaborators will understand on future review.