# Short Examples of Best Practices When Writing Functions That Call dplyr Verbs

Tags: dplyr

dplyr, the foundational tidyverse package, makes a trade-off between being easy to code in interactively at the expense of being more difficult to create functions with. The source of the trade-off is in how `dplyr` evaluates column names (specifically, allowing for unquoted column names as argument inputs). Tidy evaluation has been under major development the last couple of years in order to make programming with dplyr easier.

During this development, there have been a variety of proposed methods for programming with `dplyr`. In this post, I will document the current ‘best-practices’ with `dplyr` 1.0.0. In the Older approaches section I provide analogous examples that someone (i.e. myself) might have used during this maturation period.

For a more full discussion on this topic see `dplyr`’s documentation at programming with dplyr and the various links referenced there.

# Function expecting one column

``library(tidyverse)``

Pretend we want to create a function that calculates the sum of a given variable in a dataframe:

``````sum_var <- function(df, var){

summarise(df, {{var}} := sum({{var}}))
}``````

To run this function:

``sum_vars(mpg, cty)``

If you wanted to edit the variable in place and avoid using the special assignment operator `:=`, you could use the new (in `dplyr` 1.0.0) `across()` function.

``````sum_vars <- function(df, vars){

summarise(df, across({{vars}}, sum))
}``````

# Functions allowing multiple columns

Using the `across()` approach also allows you to input more than one variable, e.g. a user could call the following to get summaries on both `cty` and `hwy`.

``sum_vars(mpg, c(cty, hwy))``

If you wanted to compute multiple column summaries with different functions and you wanted to glue the function name onto your outputted column names1, you could instead pass a named list of functions into the `.fns` argument of `across()`.

``````sum_vars <- function(df, vars){

summarise(df, across({{vars}}, list(sum = sum, mean = mean)))
}``````

You might want to create a function that can take in multiple sets of columns, e.g. the function below allows you to `group_by()` one set of variables and `summarise()` another set:

``````sum_group_vars <- function(df, group_vars, sum_vars){
df %>%
group_by(across({{group_vars}})) %>%
summarise(across({{sum_vars}}, list(sum = sum, mean = mean)))
}``````

How a user would run `sum_group_vars()`:

``````sum_group_vars(mpg,
c(model, year),
c(hwy, cty))``````

If you’re feeling fancy, you could also make the input to `.fns` an argument to `sum_group_vars()`2.

# Older approaches

Generally, I find the new `across()` approaches introduced in `dplyr` 1.0.0 are easier and more consistent to use than the methods that preceded them. However the methods in this section still work and are supported. They are just no longer the ‘recommended’ or most ‘modern’ approach available for creating functions that pass column names into `dplyr` verbs.

Prior to the introduction of the bracket-bracket, `{{}}`, you would have used the `enquo()` + bang-bang approach3. The function below is equivalent to the `sum_var()` function shown at the start of this post.

``````sum_var <- function(df, var){
var_quo <- enquo(var)
summarise(df, !!var_quo := sum(!!var_quo))
}``````

To modify variables in-place you would have used the `*_at()`, `*_if()` or `*_all()` function variants (which are now superseded by `across()`).

``````sum_vars <- function(df, vars){

summarise_at(df, {{vars}}, sum)
}``````

Similar to using `across()` this method allows multiple variables being input. However what is weird about this function is that it requires the user wrapping the variable names in `vars()`4. Hence to use the previously created function, a user would run:

``sum_vars(mpg, vars(hwy, cty))``

Alternatively, you could have the variable name inputs be character vectors by modifying the function like so:

``````sum_var <- function(df, vars){

summarise_at(df, vars(one_of(vars)), sum)
}``````

Which could be called by a user as:

``sum_var(mpg, c("hwy", "cty"))``

These `*_at()` variants also support inputting a list of functions, e.g. the below function would output both the sums and means.

``````sum_var <- function(df, var){

summarise_at(df, vars(one_of(var)), list(sum = sum, mean = mean))
}``````

For multiple grouping variables and multiple variables to be summarised you could create:

``````groupsum <- function(df, group_vars, sum_vars){
df %>%
group_by_at(vars(one_of(group_vars))) %>%
summarise_at(vars(one_of(sum_vars)), list(sum = sum, mean = mean))
}``````

Which would be called by a user:

``````sum_var(mpg,
c("model", "year"),
c("hwy", "cty"))``````

There are a variety of similar spins you might take on handling tidy evaluation when creating these or similar types of functions.

One other older approach perhaps worth mentioning (presented here) is “passing the dots”. Here is an example for if we want to `group_by()` multiple columns and then `summarise()` on just one column:

``````sum_group_var <- function(df, sum_var, ...){
df %>%
group_by(...) %>%
summarise({{sum_var}} := sum({{sum_var}}))
}``````

The limitation with this approach is that only one set of your inputs can have more than one variable in it, i.e. wherever you pass the `...` in your function.

# Appendix

Image shared on social media was created using `xaringan` and `flair`. See dplyr-1.0.0-example for details.

1. `dplyr` 1.0.0 also now has support for using the glue package syntax for modifying variable names.↩︎

2. Doing this doesn’t require any tidy evaluation knowledge↩︎

3. There is also the `rlang::enquos()` and `!!!` operator for when the input has length greater than one.↩︎

4. A niche function specific to tidy evaluation (which users might not think of).↩︎