Data frames are likely the data structure you will used most in your analyses. A data frame is a special kind of list that stores same-length vectors of different classes. You create data frames using the data.frame
function. The example below shows this by combining a numeric and a character vector into a data frame. It uses the :
operator, which will create a vector containing all integers from 1 to 3.
df1 <- data.frame(x = 1:3, y = c("a", "b", "c"))
df1
## x y
## 1 1 a
## 2 2 b
## 3 3 c
class(df1)
## [1] "data.frame"
Data frame objects do not print with quotation marks, so the class of the columns is not always obvious.
df2 <- data.frame(x = c("1", "2", "3"), y = c("a", "b", "c"))
df2
## x y
## 1 1 a
## 2 2 b
## 3 3 c
Without further investigation, the "x" columns in df1
and df2
cannot be differentiated. The str
function can be used to describe objects with more detail than class.
str(df1)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
str(df2)
## 'data.frame': 3 obs. of 2 variables:
## $ x: Factor w/ 3 levels "1","2","3": 1 2 3
## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
Here you see that df1
is a data.frame
and has 3 observations of 2 variables, "x" and "y." Then you are told that "x" has the data type integer (not important for this class, but for our purposes it behaves like a numeric) and "y" is a factor with three levels (another data class we are not discussing). It is important to note that, by default, data frames coerce characters to factors. The default behavior can be changed with the stringsAsFactors
parameter:
df3 <- data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df3)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: chr "a" "b" "c"
Now the "y" column is a character. As mentioned above, each "column" of a data frame must have the same length. Trying to create a data.frame from vectors with different lengths will result in an error. (Try running data.frame(x = 1:3, y = 1:4)
to see the resulting error.)
As test-cases for data frames, some data is provided by R by default. One of them is iris, loaded as follows:
mydataframe <- iris
str(mydataframe)