### Stats

119 Wednesday, May 3, 2017
Not affiliated with Stack Overflow
Rip Tutorial: riptutorial@gmail.com

# Factors

## Syntax

1. factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA)
2. Run `?factor` or see the documentation online.

## Remarks

An object with class `factor` is a vector with a particular set of characteristics.

1. It is stored internally as an `integer` vector.
2. It maintains a `levels` attribute the shows the character representation of the values.
3. Its class is stored as `factor`

To illustrate, let us generate a vector of 1,000 observations from a set of colors.

``````set.seed(1)
Color <- sample(x = c("Red", "Blue", "Green", "Yellow"),
size = 1000,
replace = TRUE)
Color <- factor(Color)
``````

We can observe each of the characteristics of `Color` listed above:

``````#* 1. It is stored internally as an `integer` vector
typeof(Color)
``````
``````[1] "integer"
``````
``````#* 2. It maintains a `levels` attribute the shows the character representation of the values.
#* 3. Its class is stored as `factor`
attributes(Color)
``````
``````\$levels
[1] "Blue"   "Green"  "Red"    "Yellow"

\$class
[1] "factor"
``````

The primary advantage of a factor object is efficiency in data storage. An integer requires less memory to store than a character. Such efficiency was highly desirable when many computers had much more limited resources than current machines (for a more detailed history of the motivations behind using factors, see `stringsAsFactors`: an Unauthorized Biography). The difference in memory use can be seen even in our `Color` object. As you can see, storing `Color` as a character requires about 1.7 times as much memory as the factor object.

``````#* Amount of memory required to store Color as a factor.
object.size(Color)
``````
``````4624 bytes
``````
``````#* Amount of memory required to store Color as a character
object.size(as.character(Color))
``````
``````8232 bytes
``````

# Mapping the integer to the level

While the internal computation of factors sees the object as an integer, the desired representation for human consumption is the character level. For example,

``````head(Color)
``````
``````[1] Blue   Blue   Green  Yellow Red    Yellow
Levels: Blue Green Red Yellow
``````

is a easier for human comprehension than

``````head(as.numeric(Color))
``````
``````[1] 1 1 2 4 3 4
``````

An approximate illustration of how R goes about matching the character representation to the internal integer value is:

``````head(levels(Color)[as.numeric(Color)])
``````
``````[1] "Blue"   "Blue"   "Green"  "Yellow" "Red"    "Yellow"
``````

Compare these results to

``````head(Color)
``````
``````[1] Blue   Blue   Green  Yellow Red    Yellow
Levels: Blue Green Red Yellow
``````

# Modern use of factors

In 2007, R introduced a hashing method for characters the reduced the memory burden of character vectors (ref: `stringsAsFactors`: an Unauthorized Biography). Take note that when we determined that characters require 1.7 times more storage space than factors, that was calculated in a recent version of R, meaning that the memory use of character vectors was even more taxing before 2007.

Owing to the hashing method in modern R and to far greater memory resources in modern computers, the issue of memory efficiency in storing character values has been reduced to a very small concern. The prevailing attitude in the R Community is a preference for character vectors over factors in most situations. The primary causes for the shift away from factors are

1. The increase of unstructured and/or loosely controlled character data
2. The tendency of factors to not behave as desired when the user forgets she is dealing with a factor and not a character

In the first case, it makes no sense to store free text or open response fields as factors, as there will unlikely be any pattern that allows for more than one observation per level. Alternatively, if the data structure is not carefully controlled, it is possible to get multiple levels that correspond to the same category (such as "blue", "Blue", and "BLUE"). In such cases, many prefer to manage these discrepancies as characters prior to converting to a factor (if conversion takes place at all).

In the second case, if the user thinks she is working with a character vector, certain methods may not respond as anticipated. This basic understanding can lead to confusion and frustration while trying to debug scripts and codes. While, strictly speaking, this may be considered the fault of the user, most users are happy to avoid using factors and avoid these situations altogether.