Factors are one method to represent categorical variables in R. Given a vector x
whose values can be converted to characters using as.character()
, the default arguments for factor()
and as.factor()
assign an integer to each distinct element of the vector as well as a level attribute and a label attribute. Levels are the values x
can possibly take and labels can either be the given element or determined by the user.
To example how factors work we will create a factor with default attributes, then custom levels, and then custom levels and labels.
# standard
factor(c(1,1,2,2,3,3))
[1] 1 1 2 2 3 3
Levels: 1 2 3
Instances can arise where the user knows the number of possible values a factor can take on is greater than the current values in the vector. For this we assign the levels ourselves in factor()
.
factor(c(1,1,2,2,3,3),
levels = c(1,2,3,4,5))
[1] 1 1 2 2 3 3
Levels: 1 2 3 4 5
For style purposes the user may wish to assign labels to each level. By default, labels are the character representation of the levels. Here we assign labels for each of the possible levels in the factor.
factor(c(1,1,2,2,3,3),
levels = c(1,2,3,4,5),
labels = c("Fox","Dog","Cow","Brick","Dolphin"))
[1] Fox Fox Dog Dog Cow Cow
Levels: Fox Dog Cow Brick Dolphin
Normally, factors can only be compared using ==
and !=
and if the factors have the same levels. The following comparison of factors fails even though they appear equal because the factors have different factor levels.
factor(c(1,1,2,2,3,3),levels = c(1,2,3)) == factor(c(1,1,2,2,3,3),levels = c(1,2,3,4,5))
Error in Ops.factor(factor(c(1, 1, 2, 2, 3, 3), levels = c(1, 2, 3)), :
level sets of factors are different
This makes sense as the extra levels in the RHS mean that R does not have enough information about each factor to compare them in a meaningful way.
The operators <
, <=
, >
and >=
are only usable for ordered factors. These can represent categorical values which still have a linear order. An ordered factor can be created by providing the ordered = TRUE
argument to the factor
function or just using the ordered
function.
x <- factor(1:3, labels = c('low', 'medium', 'high'), ordered = TRUE)
print(x)
[1] low medium high
Levels: low < medium < high
y <- ordered(3:1, labels = c('low', 'medium', 'high'))
print(y)
[1] high medium low
Levels: low < medium < high
x < y
[1] TRUE FALSE FALSE
For more information, see the Factor documentation.