# Tests for Pairwise Mean Differences in R

## Introduction

Inspired by Jonas K. Lindeløv’s excellent website

common statistical tests are linear models

this post will walk through common statistical tests used when analyzing categorical variables in R.

I’ll cover 5 situations:

1. pairwise differences between members of a category
2. comparison to the overall category mean
3. pairwise differences within a category
4. consecutive comparisons of time-based or sequential factors
5. before-and-after comparisons

## How to use contrasts in R

In short: don’t bother.1

Like many before me, one of my stats classes technically “taught” me contrasts. But I didn’t get the point and using them was cumbersome, so I promptly ignored them for years.

Luckily for me, someone came along and fixed the situation: emmeans. emmeans frames contrasts as a question you pose to a model: you can ask for all pairwise comparisons and get back that. lm and summary treat the same problem as fitting abstract coefficients, and you are left to answer your own question.

emmeans works with lm, glm, and the Bayesian friends in brms and rstanarm, so the process is applicable no matter the tool.

And you don’t have to learn (much) about contrasts to take advantage of it.

### Cheatsheet

This article is summarized in the following table. For more information on emmeans and contrasts, be sure to visit the extensive emmeans vignettes.

Template:

lm(outcome ~ cat, data = df) %>%
emmeans(pairwise ~ cat)
test emmeans formula diagram
pairwise pairwise ~ cat
mean [eff | del.eff] ~ cat
pairwise in category pairwise ~ cat_cmp | cat_within
consecutive consec ~ cat
before and after mean_chg ~ cat

## Pairwise differences

The goal of this section is to look at pairwise differences between values of a category. The primary example will be pairwise differences in air time between airlines.

We’ll start the analysis by grabbing 100 random flights from the top 5 airlines, using data from the nycflights13 package.

library(tidyverse, warn.conflicts = TRUE)
library(emmeans)
library(broom)
library(nycflights13)

set.seed(158)

flights <- nycflights13::flights %>%
left_join(nycflights13::airlines)

top_carriers <- flights %>%
count(name) %>%
slice_max(n, n = 5)

flights <- flights %>%
filter(name %in% top_carriers$name) %>% mutate(across(c(name, month), as.factor)) %>% mutate(across(name, str_replace, " Inc.", "")) %>% mutate(time_of_day = case_when( between(sched_dep_time, 0, 1159) ~ "am", TRUE ~ "pm" )) %>% group_by(name) %>% slice_sample(n = 100) %>% ungroup() For each airline, we can we find the mean air time. But we also want to test if the airlines differ significantly. flights %>% group_by(name) %>% summarise(across(air_time, mean, na.rm = TRUE), n = n()) ## # A tibble: 5 x 3 ## name air_time n ## <chr> <dbl> <int> ## 1 American Airlines 183. 100 ## 2 Delta Air Lines 169. 100 ## 3 ExpressJet Airlines 98.4 100 ## 4 JetBlue Airways 167. 100 ## 5 United Air Lines 206. 100 To figure this out, we’ll fit a linear model using the airlines as predictors. lm_airlines <- lm(air_time ~ name, data = flights) summary(lm_airlines) ## ## Call: ## lm(formula = air_time ~ name, data = flights) ## ## Residuals: ## Min 1Q Median 3Q Max ## -170.36 -51.17 -15.44 33.35 434.64 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 183.093 8.371 21.873 < 2e-16 *** ## nameDelta Air Lines -13.941 11.778 -1.184 0.2371 ## nameExpressJet Airlines -84.680 11.838 -7.153 3.13e-12 *** ## nameJetBlue Airways -16.366 11.778 -1.390 0.1653 ## nameUnited Air Lines 23.267 11.749 1.980 0.0482 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 82.44 on 487 degrees of freedom ## (8 observations deleted due to missingness) ## Multiple R-squared: 0.1609, Adjusted R-squared: 0.1541 ## F-statistic: 23.35 on 4 and 487 DF, p-value: < 2.2e-16 The classic lm-to-summary workflow. Notice that American Airlines is no-where to be found. That’s because it’s the “default” category. R calls this contrast treatment but the more common name is dummy encoding. All the other name[Airline] rows in the table show the estimated change in air time from American Airlines, and how confident the model is on that change based on the p-value. This tells us some nice stuff. ExpressJet Airlines has much lower air time than American, while United has a little more. But that’s it. We can’t answer our questions about pairwise differences: • Are Delta and JetBlue really that different? • How do our two top airlines compare? • What about the bottom two? To find all these pairwise differences, we need emmeans::emmeans: lm_airlines %>% emmeans(pairwise ~ name) ##$emmeans
##  name                emmean   SE  df lower.CL upper.CL
##  American Airlines    183.1 8.37 487      167      200
##  Delta Air Lines      169.2 8.29 487      153      185
##  ExpressJet Airlines   98.4 8.37 487       82      115
##  JetBlue Airways      166.7 8.29 487      150      183
##  United Air Lines     206.4 8.24 487      190      223
##
## Confidence level used: 0.95
##
## $contrasts ## contrast estimate SE df t.ratio p.value ## American Airlines - Delta Air Lines 13.94 11.8 487 1.184 0.7608 ## American Airlines - ExpressJet Airlines 84.68 11.8 487 7.153 <.0001 ## American Airlines - JetBlue Airways 16.37 11.8 487 1.390 0.6347 ## American Airlines - United Air Lines -23.27 11.7 487 -1.980 0.2771 ## Delta Air Lines - ExpressJet Airlines 70.74 11.8 487 6.006 <.0001 ## Delta Air Lines - JetBlue Airways 2.42 11.7 487 0.207 0.9996 ## Delta Air Lines - United Air Lines -37.21 11.7 487 -3.183 0.0134 ## ExpressJet Airlines - JetBlue Airways -68.31 11.8 487 -5.800 <.0001 ## ExpressJet Airlines - United Air Lines -107.95 11.7 487 -9.188 <.0001 ## JetBlue Airways - United Air Lines -39.63 11.7 487 -3.391 0.0067 ## ## P value adjustment: tukey method for comparing a family of 5 estimates emmeans explicitly returns the estimated differences and p-values for every combination of airlines. We notice that the difference between the middle two airlines Delta and JetBlue is not significant. American has about 23 minutes less air time than United, but it’s not significant either. And not surprisingly, any difference over 35 minutes seems significant. ## Comparison to overall mean We might also be interested in comparing each airline not to each other, but to the overall mean. This is useful for benchmarking the companies based on which ones are above and below the overall average. lm_airlines %>% emmeans(eff ~ name) %>% pluck("contrasts") ## contrast estimate SE df t.ratio p.value ## American Airlines effect 18.34 7.47 487 2.454 0.0241 ## Delta Air Lines effect 4.40 7.42 487 0.594 0.6913 ## ExpressJet Airlines effect -66.34 7.47 487 -8.876 <.0001 ## JetBlue Airways effect 1.98 7.42 487 0.267 0.7898 ## United Air Lines effect 41.61 7.39 487 5.632 <.0001 ## ## P value adjustment: fdr method for 5 tests Using average air time across airlines as a benchmark for comparison, Delta and JetBlue fall right in the middle. United and Delta lead, with ExpressJet trailing far behind. We can also see how one company compares against all the others by using the del.eff contrast: lm_airlines %>% emmeans(del.eff ~ name) %>% pluck("contrasts") ## contrast estimate SE df t.ratio p.value ## American Airlines effect 22.93 9.34 487 2.454 0.0241 ## Delta Air Lines effect 5.50 9.27 487 0.594 0.6913 ## ExpressJet Airlines effect -82.92 9.34 487 -8.876 <.0001 ## JetBlue Airways effect 2.47 9.27 487 0.267 0.7898 ## United Air Lines effect 52.01 9.24 487 5.632 <.0001 ## ## P value adjustment: fdr method for 5 tests ## Pairwise comparisons within groups It’s common to check pairwise comparisons within groups. For example, you might want to see if students who attended an ACT prep class scored higher on the test than those who didn’t. If students from multiple schools were eligible to take the prep class, you’d want to test the effect of the class within schools, to control for variation. Going back to air travel, it’s pretty clear that flights are delayed longer for departure in the PM than in the AM. lm(dep_delay ~ time_of_day, data = flights) %>% tidy() ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.16 2.59 0.449 0.654 ## 2 time_of_daypm 17.8 3.27 5.43 0.0000000871 On average, flights in the PM are delayed 17 minutes longer than flights leaving in the AM. And this difference is very significant. But is this effect consistent across airlines? Are some airlines performing better than others, or is there one bad airline causing most of this delay? The normal summary on linear regression doesn’t answer these questions: lm_delay <- lm(dep_delay ~ name * time_of_day, data = flights) tidy(lm_delay) ## # A tibble: 10 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.414 6.52 -0.0635 0.949 ## 2 nameDelta Air Lines -0.443 8.47 -0.0523 0.958 ## 3 nameExpressJet Airlines 4.78 8.51 0.561 0.575 ## 4 nameJetBlue Airways 2.19 8.56 0.256 0.798 ## 5 nameUnited Air Lines 0.789 9.00 0.0877 0.930 ## 6 time_of_daypm 12.3 7.76 1.59 0.113 ## 7 nameDelta Air Lines:time_of_daypm 8.21 10.5 0.779 0.436 ## 8 nameExpressJet Airlines:time_of_daypm 13.4 10.6 1.27 0.206 ## 9 nameJetBlue Airways:time_of_daypm 4.78 10.6 0.452 0.652 ## 10 nameUnited Air Lines:time_of_daypm 3.59 10.8 0.332 0.740 If we just looked at the p-values uncritically, we might conclude that there are no significant differences between AM and PM delays across the airlines, since the smallest p-value is around 0.1. But remember that by default lm is only computing the time difference (the interactions) relative to American Airlines. We don’t care if the airlines differ from American: we care how they compare to themselves in the morning and the evening. Therefore, we need to do a pairwise comparison of AM and PM within each group: lm_delay %>% emmeans(pairwise ~ time_of_day | name) %>% pluck("contrasts") ## name = American Airlines: ## contrast estimate SE df t.ratio p.value ## am - pm -12.3 7.76 483 -1.588 0.1131 ## ## name = Delta Air Lines: ## contrast estimate SE df t.ratio p.value ## am - pm -20.5 7.14 483 -2.879 0.0042 ## ## name = ExpressJet Airlines: ## contrast estimate SE df t.ratio p.value ## am - pm -25.8 7.21 483 -3.572 0.0004 ## ## name = JetBlue Airways: ## contrast estimate SE df t.ratio p.value ## am - pm -17.1 7.19 483 -2.380 0.0177 ## ## name = United Air Lines: ## contrast estimate SE df t.ratio p.value ## am - pm -15.9 7.52 483 -2.116 0.0348 Compared to lm, emmeans tell a completely different story. We are confident that every airline, save American, has more delays in the PM. It’s also more clear why the default contrast in lm gave such a different outlook. The magnitude of the differences between airlines don’t different all that much, ranging from -12 to -25. Those differences aren’t significant when compared to American, but the differences from morning to night certainly are. ## Consecutive comparisons Consecutive comparisons frequently occur with yearly aggregations. It’s natural to ask how 2020 compares to 2019, and how 2019 compared to 2018. Changes from year-to-year might be an effect of random variation, or the company might want to measure the impact of a new release or strategy. In my line of work, this is particularly important for usability testing. For example, do users who started in 2018 find it easier to navigate a website, compared to newer users who started in 2019? In this section, we’ll show how to do consecutive comparisons, using airplane data as an example. planes <- nycflights13::planes %>% filter(year > 2003) %>% mutate(across(year, as.factor)) Let’s analyze the average number of airplane seats per year. More seats are great for airlines, because they can sell more tickets, but travelers don’t like being crammed and uncomfortable. planes %>% group_by(year) %>% summarise(across(seats, mean, na.rm = TRUE), n = n()) %>% arrange(desc(year)) ## # A tibble: 10 x 3 ## year seats n ## <fct> <dbl> <int> ## 1 2013 192. 92 ## 2 2012 199. 95 ## 3 2011 197. 66 ## 4 2010 152. 48 ## 5 2009 182. 84 ## 6 2008 136. 147 ## 7 2007 121. 123 ## 8 2006 127. 126 ## 9 2005 113. 162 ## 10 2004 116. 192 It’s pretty clear that this is a increasing year-over-year trend, as a linear regression confirms: lm(seats ~ as.numeric(year), data = planes) %>% tidy() ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 95.7 4.12 23.2 7.24e-98 ## 2 as.numeric(year) 10.4 0.752 13.9 1.96e-40 But 2013, 2012, and 2011 look really similar. Let’s see how significant those differences are by doing consecutive comparisons: lm_seats <- lm(seats ~ year, data = planes) lm_seats %>% emmeans(consec ~ year) %>% pluck("contrasts") ## contrast estimate SE df t.ratio p.value ## 2005 - 2004 -3.48 7.82 1125 -0.445 0.9997 ## 2006 - 2005 14.34 8.70 1125 1.648 0.5509 ## 2007 - 2006 -5.65 9.29 1125 -0.608 0.9972 ## 2008 - 2007 14.29 8.95 1125 1.596 0.5903 ## 2009 - 2008 46.18 10.02 1125 4.609 <.0001 ## 2010 - 2009 -29.39 13.25 1125 -2.217 0.1957 ## 2011 - 2010 44.23 13.90 1125 3.183 0.0129 ## 2012 - 2011 1.98 11.74 1125 0.169 1.0000 ## 2013 - 2012 -6.69 10.72 1125 -0.624 0.9966 ## ## P value adjustment: mvt method for 9 tests Consecutive comparisons highlight the years where there were significant peaks in the average number of airplanes seats: 2009 and 2011. And like we thought, there is little change from 2011 to 2013. ## Before and after The last type of contrast we’ll talk about today is before and after mean comparisons. It’s similar to consecutive comparisons, because it often only makes sense for sequential factors like time. This one’s also unique, because the unit of comparison is not necessarily an individual member of a category, but rather groups compared to other groups. Continuing with seats from the previous section, an exampe of a before and after comparison would be comparing the mean number of seats from 2003 - 2008 to the mean number of seats from 2009 - 2013. This type of analysis is useful for non-linear trends, where you want to identity the change point (hence the name before and after). Returning to delays, how much variation is there month-to-month? As always, we’ll run the tests using linear regression: lm_month <- lm(dep_delay ~ month, data = flights) tidy(lm_month) ## # A tibble: 12 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 10.5 5.41 1.95 0.0520 ## 2 month2 -2.96 7.90 -0.374 0.708 ## 3 month3 6.32 7.79 0.810 0.418 ## 4 month4 10.3 7.45 1.39 0.166 ## 5 month5 8.39 7.79 1.08 0.282 ## 6 month6 7.51 7.65 0.982 0.327 ## 7 month7 20.3 7.74 2.62 0.00904 ## 8 month8 -5.44 7.74 -0.702 0.483 ## 9 month9 -10.9 7.45 -1.46 0.145 ## 10 month10 -3.08 7.79 -0.396 0.692 ## 11 month11 -2.90 7.90 -0.368 0.713 ## 12 month12 -8.35 8.21 -1.02 0.309 The default contrast tells us that month 7, July, has significantly longer delays than January. But how does this average change over the course of the year? lm_month %>% emmeans(mean_chg ~ month) ##$emmeans
##  month emmean   SE  df lower.CL upper.CL
##  1     10.535 5.41 481  -0.0936    21.16
##  2      7.579 5.75 481  -3.7272    18.89
##  3     16.850 5.61 481   5.8301    27.87
##  4     20.875 5.12 481  10.8153    30.93
##  5     18.925 5.61 481   7.9051    29.94
##  6     18.047 5.41 481   7.4180    28.68
##  7     30.829 5.54 481  19.9446    41.71
##  8      5.098 5.54 481  -5.7871    15.98
##  9     -0.333 5.12 481 -10.3931     9.73
##  10     7.450 5.61 481  -3.5699    18.47
##  11     7.632 5.75 481  -3.6746    18.94
##  12     2.182 6.17 481  -9.9507    14.31
##
## Confidence level used: 0.95
##
## \$contrasts
##  contrast estimate   SE  df t.ratio p.value
##  1|2         1.750 5.66 481  0.309  1.0000
##  2|3         3.698 4.32 481  0.856  0.9499
##  3|4         0.646 3.72 481  0.174  1.0000
##  4|5        -2.731 3.38 481 -0.808  0.9633
##  5|6        -4.824 3.25 481 -1.486  0.5733
##  6|7        -6.659 3.21 481 -2.074  0.2173
##  7|8       -13.257 3.27 481 -4.053  0.0005
##  8|9       -11.860 3.44 481 -3.447  0.0051
##  9|10       -8.513 3.84 481 -2.219  0.1607
##  10|11      -8.679 4.56 481 -1.903  0.3009
##  11|12     -10.862 6.39 481 -1.699  0.4237
##
## P value adjustment: mvt method for 11 tests

As I mentioned before, this one is a little more complicated to understand. Each row measures the difference using

after_average - before_average

So row 7 | 8, the first significant row, says that the average delay from August to December is 11 minutes lower than the average delay from January to August.

This indicates that August turning point in the year (the before and after moment), when delays start to shrink.

1. If you are really curious, you might find these posts about contrasts in R useful:
1. http://www.clayford.net/statistics/tag/sum-contrasts/
2. https://www.r-bloggers.com/using-and-interpreting-different-contrasts-in-linear-models-in-r/
3. https://blogs.uoregon.edu/rclub/2015/11/03/anova-contrasts-in-r/
↩︎