R Language Omitting or replacing missing values


Recoding missing values

Regularly, missing data isn't coded as NA in datasets. In SPSS for example, missing values are often represented by the value 99.

num.vec <- c(1, 2, 3, 99, 5)
## [1]  1  2  3 99  5

It is possible to directly assign NA using subsetting

num.vec[num.vec == 99] <- NA

However, the preferred method is to use is.na<- as below. The help file (?is.na) states:

is.na<- may provide a safer way to set missingness. It behaves differently for factors, for example.

is.na(num.vec) <- num.vec == 99

Both methods return

## [1]  1  2  3 NA  5

Removing missing values

Missing values can be removed in several ways from a vector:

## [1] 1 2 3 5

Excluding missing values from calculations

When using arithmetic functions on vectors with missing values, a missing value will be returned:

mean(num.vec) # returns: [1] NA

The na.rm parameter tells the function to exclude the NA values from the calculation:

mean(num.vec, na.rm = TRUE) # returns: [1] 2.75

# an alternative to using 'na.rm = TRUE':
mean(num.vec[!is.na(num.vec)]) # returns: [1] 2.75

Some R functions, like lm, have a na.action parameter. The default-value for this is na.omit, but with options(na.action = 'na.exclude') the default behavior of R can be changed.

If it is not necessary to change the default behavior, but for a specific situation another na.action is needed, the na.action parameter needs to be included in the function call, e.g.:

 lm(y2 ~ y1, data = anscombe, na.action = 'na.exclude')