R Language Factors

Help us to keep this website almost Ad Free! It takes only 10 seconds of your time:
> Step 1: Go view our video on YouTube: EF Core Bulk Insert
> Step 2: And Like the video. BONUS: You can also share it!

Syntax

  1. factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA)
  2. Run ?factor or see the documentation online.

Remarks

An object with class factor is a vector with a particular set of characteristics.

  1. It is stored internally as an integer vector.
  2. It maintains a levels attribute the shows the character representation of the values.
  3. Its class is stored as factor

To illustrate, let us generate a vector of 1,000 observations from a set of colors.

set.seed(1)
Color <- sample(x = c("Red", "Blue", "Green", "Yellow"), 
                size = 1000, 
                replace = TRUE)
Color <- factor(Color)

We can observe each of the characteristics of Color listed above:

#* 1. It is stored internally as an `integer` vector
typeof(Color)
[1] "integer"
#* 2. It maintains a `levels` attribute the shows the character representation of the values.
#* 3. Its class is stored as `factor`
attributes(Color)
$levels
[1] "Blue"   "Green"  "Red"    "Yellow"

$class
[1] "factor"

The primary advantage of a factor object is efficiency in data storage. An integer requires less memory to store than a character. Such efficiency was highly desirable when many computers had much more limited resources than current machines (for a more detailed history of the motivations behind using factors, see stringsAsFactors: an Unauthorized Biography). The difference in memory use can be seen even in our Color object. As you can see, storing Color as a character requires about 1.7 times as much memory as the factor object.

#* Amount of memory required to store Color as a factor.
object.size(Color)
4624 bytes
#* Amount of memory required to store Color as a character
object.size(as.character(Color))
8232 bytes

Mapping the integer to the level

While the internal computation of factors sees the object as an integer, the desired representation for human consumption is the character level. For example,

head(Color)
[1] Blue   Blue   Green  Yellow Red    Yellow  
Levels: Blue Green Red Yellow

is a easier for human comprehension than

head(as.numeric(Color))
[1] 1 1 2 4 3 4

An approximate illustration of how R goes about matching the character representation to the internal integer value is:

head(levels(Color)[as.numeric(Color)])
[1] "Blue"   "Blue"   "Green"  "Yellow" "Red"    "Yellow"

Compare these results to

head(Color)
[1] Blue   Blue   Green  Yellow Red    Yellow  
Levels: Blue Green Red Yellow

Modern use of factors

In 2007, R introduced a hashing method for characters the reduced the memory burden of character vectors (ref: stringsAsFactors: an Unauthorized Biography). Take note that when we determined that characters require 1.7 times more storage space than factors, that was calculated in a recent version of R, meaning that the memory use of character vectors was even more taxing before 2007.

Owing to the hashing method in modern R and to far greater memory resources in modern computers, the issue of memory efficiency in storing character values has been reduced to a very small concern. The prevailing attitude in the R Community is a preference for character vectors over factors in most situations. The primary causes for the shift away from factors are

  1. The increase of unstructured and/or loosely controlled character data
  2. The tendency of factors to not behave as desired when the user forgets she is dealing with a factor and not a character

In the first case, it makes no sense to store free text or open response fields as factors, as there will unlikely be any pattern that allows for more than one observation per level. Alternatively, if the data structure is not carefully controlled, it is possible to get multiple levels that correspond to the same category (such as "blue", "Blue", and "BLUE"). In such cases, many prefer to manage these discrepancies as characters prior to converting to a factor (if conversion takes place at all).

In the second case, if the user thinks she is working with a character vector, certain methods may not respond as anticipated. This basic understanding can lead to confusion and frustration while trying to debug scripts and codes. While, strictly speaking, this may be considered the fault of the user, most users are happy to avoid using factors and avoid these situations altogether.



Got any R Language Question?