Regularly, missing data isn't coded as NA
in datasets. In SPSS for example, missing values are often represented by the value 99
.
num.vec <- c(1, 2, 3, 99, 5)
num.vec
## [1] 1 2 3 99 5
It is possible to directly assign NA using subsetting
num.vec[num.vec == 99] <- NA
However, the preferred method is to use is.na<-
as below. The help file (?is.na
) states:
is.na<-
may provide a safer way to set missingness. It behaves differently for factors, for example.
is.na(num.vec) <- num.vec == 99
Both methods return
num.vec
## [1] 1 2 3 NA 5
Missing values can be removed in several ways from a vector:
num.vec[!is.na(num.vec)]
num.vec[complete.cases(num.vec)]
na.omit(num.vec)
## [1] 1 2 3 5
When using arithmetic functions on vectors with missing values, a missing value will be returned:
mean(num.vec) # returns: [1] NA
The na.rm
parameter tells the function to exclude the NA
values from the calculation:
mean(num.vec, na.rm = TRUE) # returns: [1] 2.75
# an alternative to using 'na.rm = TRUE':
mean(num.vec[!is.na(num.vec)]) # returns: [1] 2.75
Some R functions, like lm
, have a na.action
parameter. The default-value for this is na.omit
, but with options(na.action = 'na.exclude')
the default behavior of R can be changed.
If it is not necessary to change the default behavior, but for a specific situation another na.action
is needed, the na.action
parameter needs to be included in the function call, e.g.:
lm(y2 ~ y1, data = anscombe, na.action = 'na.exclude')