?factor
or see the documentation online.An object with class factor
is a vector with a particular set of characteristics.
integer
vector.levels
attribute the shows the character representation of the values.factor
To illustrate, let us generate a vector of 1,000 observations from a set of colors.
set.seed(1)
Color <- sample(x = c("Red", "Blue", "Green", "Yellow"),
size = 1000,
replace = TRUE)
Color <- factor(Color)
We can observe each of the characteristics of Color
listed above:
#* 1. It is stored internally as an `integer` vector
typeof(Color)
[1] "integer"
#* 2. It maintains a `levels` attribute the shows the character representation of the values.
#* 3. Its class is stored as `factor`
attributes(Color)
$levels [1] "Blue" "Green" "Red" "Yellow" $class [1] "factor"
The primary advantage of a factor object is efficiency in data storage. An integer requires less memory to store than a character. Such efficiency was highly desirable when many computers had much more limited resources than current machines (for a more detailed history of the motivations behind using factors, see stringsAsFactors
: an Unauthorized Biography). The difference in memory use can be seen even in our Color
object. As you can see, storing Color
as a character requires about 1.7 times as much memory as the factor object.
#* Amount of memory required to store Color as a factor.
object.size(Color)
4624 bytes
#* Amount of memory required to store Color as a character
object.size(as.character(Color))
8232 bytes
While the internal computation of factors sees the object as an integer, the desired representation for human consumption is the character level. For example,
head(Color)
[1] Blue Blue Green Yellow Red Yellow Levels: Blue Green Red Yellow
is a easier for human comprehension than
head(as.numeric(Color))
[1] 1 1 2 4 3 4
An approximate illustration of how R goes about matching the character representation to the internal integer value is:
head(levels(Color)[as.numeric(Color)])
[1] "Blue" "Blue" "Green" "Yellow" "Red" "Yellow"
Compare these results to
head(Color)
[1] Blue Blue Green Yellow Red Yellow Levels: Blue Green Red Yellow
In 2007, R introduced a hashing method for characters the reduced the memory burden of character vectors (ref: stringsAsFactors
: an Unauthorized Biography). Take note that when we determined that characters require 1.7 times more storage space than factors, that was calculated in a recent version of R, meaning that the memory use of character vectors was even more taxing before 2007.
Owing to the hashing method in modern R and to far greater memory resources in modern computers, the issue of memory efficiency in storing character values has been reduced to a very small concern. The prevailing attitude in the R Community is a preference for character vectors over factors in most situations. The primary causes for the shift away from factors are
In the first case, it makes no sense to store free text or open response fields as factors, as there will unlikely be any pattern that allows for more than one observation per level. Alternatively, if the data structure is not carefully controlled, it is possible to get multiple levels that correspond to the same category (such as "blue", "Blue", and "BLUE"). In such cases, many prefer to manage these discrepancies as characters prior to converting to a factor (if conversion takes place at all).
In the second case, if the user thinks she is working with a character vector, certain methods may not respond as anticipated. This basic understanding can lead to confusion and frustration while trying to debug scripts and codes. While, strictly speaking, this may be considered the fault of the user, most users are happy to avoid using factors and avoid these situations altogether.