Introduction
dplyr 1.0.0 is coming soon!
I’m really excited about this release.
I think rowwise
and across
are excellent new functions that make
common operations feel more intuitive.
You can try them out for yourself by installing the development version
from Github until there is a CRAN release.
This post is just me playing around with some of the new features.
Stop Repeating Names
I’ve probably typed something like this a thousand times:
library(ggplot2)
library(tidyr)
library(dplyr, warn.conflicts = FALSE)
mtcars <- as_tibble(mtcars)
mtcars %>%
mutate(vs = as.factor(vs)) %>%
mutate(am = as.factor(am))
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # … with 22 more rows
Here’s the general pattern: I’m applying a function to a column, but I want the column name to stay the same.
It also happens in summaries, where you want to find column means, but you just want the columns to be named the same:
mtcars %>%
summarise(disp = mean(disp), hp = mean(hp))
## # A tibble: 1 x 2
## disp hp
## <dbl> <dbl>
## 1 231. 147.
You are always repeating
column_name = func(column_name)
dplyr::across
is a way to get around that. In mutates and summarises,
by default across
keeps the same column name after you apply a function:
mtcars %>%
mutate(across(vs, as.factor))
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # … with 22 more rows
This is a nice, but small benefit. The cool thing is that you can
easily apply this to any number of columns you want using tidyselect
and functions.
mtcars %>%
summarise(across(c(disp, hp), mean))
## # A tibble: 1 x 2
## disp hp
## <dbl> <dbl>
## 1 231. 147.
mtcars %>%
summarise(across(is.numeric & -disp & -hp, mean))
## # A tibble: 1 x 9
## mpg cyl drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20.1 6.19 3.60 3.22 17.8 0.438 0.406 3.69 2.81
Summary Rows
One nice thing about Excel is summary rows.
For example, say I have the following summarised table:
diamonds_sum <- diamonds %>%
mutate(across(cut, as.character)) %>%
group_by(cut) %>%
summarise(across(price, sum), across(carat, n_distinct), n = n())
diamonds_sum
## # A tibble: 5 x 4
## cut price carat n
## <chr> <int> <int> <int>
## 1 Fair 7017600 185 1610
## 2 Good 19275009 199 4906
## 3 Ideal 74513487 232 21551
## 4 Premium 63221498 251 13791
## 5 Very Good 48107623 231 12082
In Excel, I could easily add a total row for each column. You can do it with dplyr, but it doesn’t feel great.
sum_of_sum <- diamonds_sum %>%
summarise(across(-cut, sum))
sum_of_sum
## # A tibble: 1 x 3
## price carat n
## <int> <int> <int>
## 1 212135217 1098 53940
diamonds_sum %>%
bind_rows(
sum_of_sum %>%
mutate(cut = "total")
)
## # A tibble: 6 x 4
## cut price carat n
## <chr> <int> <int> <int>
## 1 Fair 7017600 185 1610
## 2 Good 19275009 199 4906
## 3 Ideal 74513487 232 21551
## 4 Premium 63221498 251 13791
## 5 Very Good 48107623 231 12082
## 6 total 212135217 1098 53940
It’s a somewhat common problem. Here’s a StackOverflow post that documents various solutions.
Funny enough, you can actually do this in the new dplyr using rowwise
.
The intution here is to tranpose the dataframe and think of price
, carat
and n
as rows.
Then we can add a rowwise
summary, tranpose back and get our total row!
So we start by transposing the dataframe:
transpose_df <- function(x, col, name) {
x %>%
pivot_longer(-{{ col }}, names_to = name) %>%
pivot_wider(names_from = {{ col }})
}
diamonds_sum %>%
transpose_df(cut, "feature")
## # A tibble: 3 x 6
## feature Fair Good Ideal Premium `Very Good`
## <chr> <int> <int> <int> <int> <int>
## 1 price 7017600 19275009 74513487 63221498 48107623
## 2 carat 185 199 232 251 231
## 3 n 1610 4906 21551 13791 12082
Then use rowwise
to add a summary column for each feature:
diamonds_sum %>%
transpose_df(cut, "feature") %>%
rowwise(feature) %>%
mutate(total = sum(c_across()))
## # A tibble: 3 x 7
## # Rowwise: feature
## feature Fair Good Ideal Premium `Very Good` total
## <chr> <int> <int> <int> <int> <int> <int>
## 1 price 7017600 19275009 74513487 63221498 48107623 212135217
## 2 carat 185 199 232 251 231 1098
## 3 n 1610 4906 21551 13791 12082 53940
Then we can transpose again to get back to our original shape:
diamonds_sum %>%
transpose_df(cut, "feature") %>%
rowwise(feature) %>%
mutate(total = sum(c_across())) %>%
transpose_df(feature, "cut")
## # A tibble: 6 x 4
## cut price carat n
## <chr> <int> <int> <int>
## 1 Fair 7017600 185 1610
## 2 Good 19275009 199 4906
## 3 Ideal 74513487 232 21551
## 4 Premium 63221498 251 13791
## 5 Very Good 48107623 231 12082
## 6 total 212135217 1098 53940
This is also instructive to see how we can tranpose data using
the pivot_*
family of functions, and why it’s not so straight-forward to
arbitrarily tranpose data. It’s easy to lose information!