dplyr introduces a grammar of data manipulation in R. It provides a consistent interface to work with data no matter where it is stored: data.frame, data.table, or a database. The key pieces of dplyr are written using Rcpp, which makes it very fast for working with in-memory data.
dplyr's philosophy is to have small functions that do one thing well. The five simple functions (filter, arrange, select, mutate, and summarise) can be used to reveal new ways to describe data. When combined with group_by, these functions can be used to calculate group wise summary statistics.
All these functions have a similar syntax:
$)We will use the built-in mtcars dataset to explore dplyr's single table verbs. Before converting the type of mtcars to tbl_df (since it makes printing cleaner), we add the rownames of the dataset as a column using rownames_to_column function from the tibble package.
library(dplyr) # This documentation was written using version 0.5.0
mtcars_tbl <- as_data_frame(tibble::rownames_to_column(mtcars, "cars"))
# examine the structure of data
head(mtcars_tbl)
# A tibble: 6 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
filter helps subset rows that match certain criteria. The first argument is the name of the data.frame and the second (and subsequent) arguments are the criteria that filter the data (these criteria should evaluate to either TRUE or FALSE)
Subset all cars that have 4 cylinders - cyl:
filter(mtcars_tbl, cyl == 4)
# A tibble: 11 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#2 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#3 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#5 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# ... with 6 more rows
We can pass multiple criteria separated by a comma. To subset the cars which have either 4 or 6 cylinders - cyl and have 5 gears - gear:
filter(mtcars_tbl, cyl == 4 | cyl == 6, gear == 5)
# A tibble: 3 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
#2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
#3 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
filter selects rows based on criteria, to select rows by position, use slice.
slice takes only 2 arguments: the first one is a data.frame and the second is integer row values.
To select rows 6 through 9:
slice(mtcars_tbl, 6:9)
# A tibble: 4 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
#2 Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
#3 Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
#4 Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
Or:
slice(mtcars_tbl, -c(1:5, 10:n()))
This results in the same output as slice(mtcars_tbl, 6:9)
n() represents the number of observations in the current group
arrange is used to sort the data by a specified variable(s). Just like the previous verb (and all other functions in dplyr), the first argument is a data.frame, and consequent arguments are used to sort the data. If more than one variable is passed, the data is first sorted by the first variable, and then by the second variable, and so on..
To order the data by horsepower - hp
arrange(mtcars_tbl, hp)
# A tibble: 32 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#2 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#3 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#5 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#6 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# ... with 26 more rows
To arrange the data by miles per gallon - mpg in descending order, followed by number of cylinders - cyl:
arrange(mtcars_tbl, desc(mpg), cyl)
# A tibble: 32 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#2 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#3 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#4 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#5 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#6 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# ... with 26 more rows
select is used to select only a subset of variables. To select only mpg, disp, wt, qsec, and vs from mtcars_tbl:
select(mtcars_tbl, mpg, disp, wt, qsec, vs)
# A tibble: 32 x 5
# mpg disp wt qsec vs
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 21.0 160.0 2.620 16.46 0
#2 21.0 160.0 2.875 17.02 0
#3 22.8 108.0 2.320 18.61 1
#4 21.4 258.0 3.215 19.44 1
#5 18.7 360.0 3.440 17.02 0
#6 18.1 225.0 3.460 20.22 1
# ... with 26 more rows
: notation can be used to select consecutive columns. To select columns from cars through disp and vs through carb:
select(mtcars_tbl, cars:disp, vs:carb)
# A tibble: 32 x 8
# cars mpg cyl disp vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Mazda RX4 21.0 6 160.0 0 1 4 4
#2 Mazda RX4 Wag 21.0 6 160.0 0 1 4 4
#3 Datsun 710 22.8 4 108.0 1 1 4 1
#4 Hornet 4 Drive 21.4 6 258.0 1 0 3 1
#5 Hornet Sportabout 18.7 8 360.0 0 0 3 2
#6 Valiant 18.1 6 225.0 1 0 3 1
# ... with 26 more rows
or select(mtcars_tbl, -(hp:qsec))
For datasets that contain several columns, it can be tedious to select several columns by name. To make life easier, there are a number of helper functions (such as starts_with(), ends_with(), contains(), matches(), num_range(), one_of(), and everything()) that can be used in select. To learn more about how to use them, see ?select_helpers and ?select.
Note: While referring to columns directly in select(), we use bare column names, but quotes should be used while referring to columns in helper functions.
To rename columns while selecting:
select(mtcars_tbl, cylinders = cyl, displacement = disp)
# A tibble: 32 x 2
# cylinders displacement
# <dbl> <dbl>
#1 6 160.0
#2 6 160.0
#3 4 108.0
#4 6 258.0
#5 8 360.0
#6 6 225.0
# ... with 26 more rows
As expected, this drops all other variables.
To rename columns without dropping other variables, use rename:
rename(mtcars_tbl, cylinders = cyl, displacement = disp)
# A tibble: 32 x 12
# cars mpg cylinders displacement hp drat wt qsec vs
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0
#2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0
#3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1
#4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1
#5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0
#6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1
# ... with 26 more rows, and 3 more variables: am <dbl>, gear <dbl>, carb <dbl>
mutate can be used to add new columns to the data. Like all other functions in dplyr, mutate doesn't add the newly created columns to the original data. Columns are added at the end of the data.frame.
mutate(mtcars_tbl, weight_ton = wt/2, weight_pounds = weight_ton * 2000)
# A tibble: 32 x 14
# cars mpg cyl disp hp drat wt qsec vs am gear carb weight_ton weight_pounds
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1.3100 2620
#2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1.4375 2875
#3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1.1600 2320
#4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1.6075 3215
#5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1.7200 3440
#6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1.7300 3460
# ... with 26 more rows
Note the use of weight_ton while creating weight_pounds. Unlike base R, mutate allows us to refer to columns that we just created to be used for a subsequent operation.
To retain only the newly created columns, use transmute instead of mutate:
transmute(mtcars_tbl, weight_ton = wt/2, weight_pounds = weight_ton * 2000)
# A tibble: 32 x 2
# weight_ton weight_pounds
# <dbl> <dbl>
#1 1.3100 2620
#2 1.4375 2875
#3 1.1600 2320
#4 1.6075 3215
#5 1.7200 3440
#6 1.7300 3460
# ... with 26 more rows
summarise calculates summary statistics of variables by collapsing multiple values to a single value. It can calculate multiple statistics and we can name these summary columns in the same statement.
To calculate the mean and standard deviation of mpg and disp of all cars in the dataset:
summarise(mtcars_tbl, mean_mpg = mean(mpg), sd_mpg = sd(mpg),
mean_disp = mean(disp), sd_disp = sd(disp))
# A tibble: 1 x 4
# mean_mpg sd_mpg mean_disp sd_disp
# <dbl> <dbl> <dbl> <dbl>
#1 20.09062 6.026948 230.7219 123.9387
group_by can be used to perform group wise operations on data. When the verbs defined above are applied on this grouped data, they are automatically applied to each group separately.
To find mean and sd of mpg by cyl:
by_cyl <- group_by(mtcars_tbl, cyl)
summarise(by_cyl, mean_mpg = mean(mpg), sd_mpg = sd(mpg))
# A tibble: 3 x 3
# cyl mean_mpg sd_mpg
# <dbl> <dbl> <dbl>
#1 4 26.66364 4.509828
#2 6 19.74286 1.453567
#3 8 15.10000 2.560048
We select columns from cars through hp and gear, order the rows by cyl and from highest to lowest mpg, group the data by gear, and finally subset only those cars have mpg > 20 and hp > 75
selected <- select(mtcars_tbl, cars:hp, gear)
ordered <- arrange(selected, cyl, desc(mpg))
by_cyl <- group_by(ordered, gear)
filter(by_cyl, mpg > 20, hp > 75)
Source: local data frame [9 x 6]
Groups: gear [3]
# cars mpg cyl disp hp gear
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Lotus Europa 30.4 4 95.1 113 5
#2 Porsche 914-2 26.0 4 120.3 91 5
#3 Datsun 710 22.8 4 108.0 93 4
#4 Merc 230 22.8 4 140.8 95 4
#5 Toyota Corona 21.5 4 120.1 97 3
# ... with 4 more rows
Maybe we are not interested the intermediate results, we can achieve the same result as above by wrapping the function calls:
filter(
group_by(
arrange(
select(
mtcars_tbl, cars:hp
), cyl, desc(mpg)
), cyl
),mpg > 20, hp > 75
)
This can be a little difficult to read. So, dplyr operations can be chained using the pipe %>% operator. The above code transalates to:
mtcars_tbl %>%
select(cars:hp) %>%
arrange(cyl, desc(mpg)) %>%
group_by(cyl) %>%
filter(mpg > 20, hp > 75)
dplyr provides summarise_all() to apply functions to all (non-grouping) columns.
To find the number of distinct values for each column:
mtcars_tbl %>%
summarise_all(n_distinct)
# A tibble: 1 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 32 25 3 27 22 22 29 30 2 2 3 6
To find the number of distinct values for each column by cyl:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_all(n_distinct)
# A tibble: 3 x 12
# cyl cars mpg disp hp drat wt qsec vs am gear carb
# <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 4 11 9 11 10 10 11 11 2 2 3 2
#2 6 7 6 5 4 5 6 7 2 2 3 3
#3 8 14 12 11 9 11 13 14 1 2 2 4
Note that we just had to add the group_by statement and the rest of the code is the same. The output now consists of three rows - one for each unique value of cyl.
To summarise specific multiple columns, use summarise_at
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"), mean)
# A tibble: 3 x 4
# cyl mpg disp hp
# <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636
#2 6 19.74286 183.3143 122.28571
#3 8 15.10000 353.1000 209.21429
helper functions (?select_helpers) can be used in place of column names to select specific columns
To apply multiple functions, either pass the function names as a character vector:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
c("mean", "sd"))
or wrap them inside funs:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
funs(mean, sd))
# A tibble: 3 x 7
# cyl mpg_mean disp_mean hp_mean mpg_sd disp_sd hp_sd
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.509828 26.87159 20.93453
#2 6 19.74286 183.3143 122.28571 1.453567 41.56246 24.26049
#3 8 15.10000 353.1000 209.21429 2.560048 67.77132 50.97689
Column names are now be appended with function names to keep them distinct. In order to change this, pass the name to be appended with the function:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
c(Mean = "mean", SD = "sd"))
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
funs(Mean = mean, SD = sd))
# A tibble: 3 x 7
# cyl mpg_Mean disp_Mean hp_Mean mpg_SD disp_SD hp_SD
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.509828 26.87159 20.93453
#2 6 19.74286 183.3143 122.28571 1.453567 41.56246 24.26049
#3 8 15.10000 353.1000 209.21429 2.560048 67.77132 50.97689
To select columns conditionally, use summarise_if:
Take the mean of all columns that are numeric grouped by cyl:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_if(is.numeric, mean)
# A tibble: 3 x 11
# cyl mpg disp hp drat wt qsec
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727
#2 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714
#3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214
# ... with 4 more variables: vs <dbl>, am <dbl>, gear <dbl>,
# carb <dbl>
However, some variables are discrete, and mean of these variables doesn't make sense.
To take the mean of only continuous variables by cyl:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_if(function(x) is.numeric(x) & n_distinct(x) > 6, mean)
# A tibble: 3 x 7
# cyl mpg disp hp drat wt qsec
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727
#2 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714
#3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214