Tests for Pairwise Mean Differences in R

Introduction
~~How to use contrasts in R~~
- Cheatsheet
Pairwise differences
Comparison to overall mean
Pairwise comparisons within groups
Consecutive comparisons
Before and after

Introduction

Inspired by Jonas K. Lindeløv’s excellent website

common statistical tests are linear models

this post will walk through common statistical tests used when analyzing categorical variables in R.

I’ll cover 5 situations:

pairwise differences between members of a category
comparison to the overall category mean
pairwise differences within a category
consecutive comparisons of time-based or sequential factors
before-and-after comparisons

How to use contrasts in R

In short: don’t bother.¹

Like many before me, one of my stats classes technically “taught” me contrasts. But I didn’t get the point and using them was cumbersome, so I promptly ignored them for years.

Luckily for me, someone came along and fixed the situation: emmeans. emmeans frames contrasts as a question you pose to a model: you can ask for all pairwise comparisons and get back that. lm and summary treat the same problem as fitting abstract coefficients, and you are left to answer your own question.

emmeans works with lm, glm, and the Bayesian friends in brms and rstanarm, so the process is applicable no matter the tool.

And you don’t have to learn (much) about contrasts to take advantage of it.

Cheatsheet

This article is summarized in the following table. For more information on emmeans and contrasts, be sure to visit the extensive emmeans vignettes.

Template:

lm(outcome ~ cat, data = df) %>%
  emmeans(pairwise ~ cat)

test	emmeans formula	diagram
pairwise	`pairwise ~ cat`
mean	`[eff \| del.eff] ~ cat`
pairwise in category	`pairwise ~ cat_cmp \| cat_within`
consecutive	`consec ~ cat`
before and after	`mean_chg ~ cat`

Pairwise differences

The goal of this section is to look at pairwise differences between values of a category. The primary example will be pairwise differences in air time between airlines.

We’ll start the analysis by grabbing 100 random flights from the top 5 airlines, using data from the nycflights13 package.

library(tidyverse, warn.conflicts = TRUE)
library(emmeans)
library(broom)
library(nycflights13)

set.seed(158)

flights <- nycflights13::flights %>%
  left_join(nycflights13::airlines)

top_carriers <- flights %>%
  count(name) %>%
  slice_max(n, n = 5)

flights <- flights %>%
  filter(name %in% top_carriers$name) %>%
  mutate(across(c(name, month), as.factor)) %>%
  mutate(across(name, str_replace, " Inc.", "")) %>%
  mutate(time_of_day = case_when(
    between(sched_dep_time, 0, 1159) ~ "am",
    TRUE ~ "pm"
  )) %>%
  group_by(name) %>%
  slice_sample(n = 100) %>%
  ungroup()

For each airline, we can we find the mean air time. But we also want to test if the airlines differ significantly.

flights %>%
  group_by(name) %>%
  summarise(across(air_time, mean, na.rm = TRUE), n = n())

## # A tibble: 5 x 3
##   name                air_time     n
##   <chr>                  <dbl> <int>
## 1 American Airlines      183.    100
## 2 Delta Air Lines        169.    100
## 3 ExpressJet Airlines     98.4   100
## 4 JetBlue Airways        167.    100
## 5 United Air Lines       206.    100

To figure this out, we’ll fit a linear model using the airlines as predictors.

lm_airlines <- lm(air_time ~ name, data = flights)
summary(lm_airlines)

## 
## Call:
## lm(formula = air_time ~ name, data = flights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -170.36  -51.17  -15.44   33.35  434.64 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              183.093      8.371  21.873  < 2e-16 ***
## nameDelta Air Lines      -13.941     11.778  -1.184   0.2371    
## nameExpressJet Airlines  -84.680     11.838  -7.153 3.13e-12 ***
## nameJetBlue Airways      -16.366     11.778  -1.390   0.1653    
## nameUnited Air Lines      23.267     11.749   1.980   0.0482 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.44 on 487 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.1609, Adjusted R-squared:  0.1541 
## F-statistic: 23.35 on 4 and 487 DF,  p-value: < 2.2e-16

The classic lm-to-summary workflow. Notice that American Airlines is no-where to be found. That’s because it’s the “default” category. R calls this contrast treatment but the more common name is dummy encoding.

All the other name[Airline] rows in the table show the estimated change in air time from American Airlines, and how confident the model is on that change based on the p-value.

This tells us some nice stuff. ExpressJet Airlines has much lower air time than American, while United has a little more. But that’s it.

We can’t answer our questions about pairwise differences:

Are Delta and JetBlue really that different?
How do our two top airlines compare?
What about the bottom two?

To find all these pairwise differences, we need emmeans::emmeans:

lm_airlines %>%
  emmeans(pairwise ~ name)

## $emmeans
##  name                emmean   SE  df lower.CL upper.CL
##  American Airlines    183.1 8.37 487      167      200
##  Delta Air Lines      169.2 8.29 487      153      185
##  ExpressJet Airlines   98.4 8.37 487       82      115
##  JetBlue Airways      166.7 8.29 487      150      183
##  United Air Lines     206.4 8.24 487      190      223
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast                                estimate   SE  df t.ratio p.value
##  American Airlines - Delta Air Lines        13.94 11.8 487  1.184  0.7608 
##  American Airlines - ExpressJet Airlines    84.68 11.8 487  7.153  <.0001 
##  American Airlines - JetBlue Airways        16.37 11.8 487  1.390  0.6347 
##  American Airlines - United Air Lines      -23.27 11.7 487 -1.980  0.2771 
##  Delta Air Lines - ExpressJet Airlines      70.74 11.8 487  6.006  <.0001 
##  Delta Air Lines - JetBlue Airways           2.42 11.7 487  0.207  0.9996 
##  Delta Air Lines - United Air Lines        -37.21 11.7 487 -3.183  0.0134 
##  ExpressJet Airlines - JetBlue Airways     -68.31 11.8 487 -5.800  <.0001 
##  ExpressJet Airlines - United Air Lines   -107.95 11.7 487 -9.188  <.0001 
##  JetBlue Airways - United Air Lines        -39.63 11.7 487 -3.391  0.0067 
## 
## P value adjustment: tukey method for comparing a family of 5 estimates

emmeans explicitly returns the estimated differences and p-values for every combination of airlines.

We notice that the difference between the middle two airlines Delta and JetBlue is not significant. American has about 23 minutes less air time than United, but it’s not significant either. And not surprisingly, any difference over 35 minutes seems significant.

Comparison to overall mean

We might also be interested in comparing each airline not to each other, but to the overall mean. This is useful for benchmarking the companies based on which ones are above and below the overall average.

lm_airlines %>%
  emmeans(eff ~ name) %>%
  pluck("contrasts")

##  contrast                   estimate   SE  df t.ratio p.value
##  American Airlines effect      18.34 7.47 487  2.454  0.0241 
##  Delta Air Lines effect         4.40 7.42 487  0.594  0.6913 
##  ExpressJet Airlines effect   -66.34 7.47 487 -8.876  <.0001 
##  JetBlue Airways effect         1.98 7.42 487  0.267  0.7898 
##  United Air Lines effect       41.61 7.39 487  5.632  <.0001 
## 
## P value adjustment: fdr method for 5 tests

Using average air time across airlines as a benchmark for comparison, Delta and JetBlue fall right in the middle. United and Delta lead, with ExpressJet trailing far behind.

We can also see how one company compares against all the others by using the del.eff contrast:

lm_airlines %>%
  emmeans(del.eff ~ name) %>%
  pluck("contrasts")

##  contrast                   estimate   SE  df t.ratio p.value
##  American Airlines effect      22.93 9.34 487  2.454  0.0241 
##  Delta Air Lines effect         5.50 9.27 487  0.594  0.6913 
##  ExpressJet Airlines effect   -82.92 9.34 487 -8.876  <.0001 
##  JetBlue Airways effect         2.47 9.27 487  0.267  0.7898 
##  United Air Lines effect       52.01 9.24 487  5.632  <.0001 
## 
## P value adjustment: fdr method for 5 tests

Pairwise comparisons within groups

It’s common to check pairwise comparisons within groups. For example, you might want to see if students who attended an ACT prep class scored higher on the test than those who didn’t. If students from multiple schools were eligible to take the prep class, you’d want to test the effect of the class within schools, to control for variation.

Going back to air travel, it’s pretty clear that flights are delayed longer for departure in the PM than in the AM.

lm(dep_delay ~ time_of_day, data = flights) %>%
  tidy()

## # A tibble: 2 x 5
##   term          estimate std.error statistic      p.value
##   <chr>            <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)       1.16      2.59     0.449 0.654       
## 2 time_of_daypm    17.8       3.27     5.43  0.0000000871

On average, flights in the PM are delayed 17 minutes longer than flights leaving in the AM. And this difference is very significant.

But is this effect consistent across airlines? Are some airlines performing better than others, or is there one bad airline causing most of this delay?

The normal summary on linear regression doesn’t answer these questions:

lm_delay <- lm(dep_delay ~ name * time_of_day, data = flights)
tidy(lm_delay)

## # A tibble: 10 x 5
##    term                                  estimate std.error statistic p.value
##    <chr>                                    <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)                             -0.414      6.52   -0.0635   0.949
##  2 nameDelta Air Lines                     -0.443      8.47   -0.0523   0.958
##  3 nameExpressJet Airlines                  4.78       8.51    0.561    0.575
##  4 nameJetBlue Airways                      2.19       8.56    0.256    0.798
##  5 nameUnited Air Lines                     0.789      9.00    0.0877   0.930
##  6 time_of_daypm                           12.3        7.76    1.59     0.113
##  7 nameDelta Air Lines:time_of_daypm        8.21      10.5     0.779    0.436
##  8 nameExpressJet Airlines:time_of_daypm   13.4       10.6     1.27     0.206
##  9 nameJetBlue Airways:time_of_daypm        4.78      10.6     0.452    0.652
## 10 nameUnited Air Lines:time_of_daypm       3.59      10.8     0.332    0.740

If we just looked at the p-values uncritically, we might conclude that there are no significant differences between AM and PM delays across the airlines, since the smallest p-value is around 0.1.

But remember that by default lm is only computing the time difference (the interactions) relative to American Airlines.

We don’t care if the airlines differ from American: we care how they compare to themselves in the morning and the evening.

Therefore, we need to do a pairwise comparison of AM and PM within each group:

lm_delay %>%
  emmeans(pairwise ~ time_of_day | name) %>%
  pluck("contrasts")

## name = American Airlines:
##  contrast estimate   SE  df t.ratio p.value
##  am - pm     -12.3 7.76 483 -1.588  0.1131 
## 
## name = Delta Air Lines:
##  contrast estimate   SE  df t.ratio p.value
##  am - pm     -20.5 7.14 483 -2.879  0.0042 
## 
## name = ExpressJet Airlines:
##  contrast estimate   SE  df t.ratio p.value
##  am - pm     -25.8 7.21 483 -3.572  0.0004 
## 
## name = JetBlue Airways:
##  contrast estimate   SE  df t.ratio p.value
##  am - pm     -17.1 7.19 483 -2.380  0.0177 
## 
## name = United Air Lines:
##  contrast estimate   SE  df t.ratio p.value
##  am - pm     -15.9 7.52 483 -2.116  0.0348

Compared to lm, emmeans tell a completely different story. We are confident that every airline, save American, has more delays in the PM.

It’s also more clear why the default contrast in lm gave such a different outlook. The magnitude of the differences between airlines don’t different all that much, ranging from -12 to -25. Those differences aren’t significant when compared to American, but the differences from morning to night certainly are.

Consecutive comparisons

Consecutive comparisons frequently occur with yearly aggregations. It’s natural to ask how 2020 compares to 2019, and how 2019 compared to 2018. Changes from year-to-year might be an effect of random variation, or the company might want to measure the impact of a new release or strategy.

In my line of work, this is particularly important for usability testing. For example, do users who started in 2018 find it easier to navigate a website, compared to newer users who started in 2019?

In this section, we’ll show how to do consecutive comparisons, using airplane data as an example.

planes <- nycflights13::planes %>%
  filter(year > 2003) %>%
  mutate(across(year, as.factor))

Let’s analyze the average number of airplane seats per year. More seats are great for airlines, because they can sell more tickets, but travelers don’t like being crammed and uncomfortable.

planes %>%
  group_by(year) %>%
  summarise(across(seats, mean, na.rm = TRUE), n = n()) %>%
  arrange(desc(year))

## # A tibble: 10 x 3
##    year  seats     n
##    <fct> <dbl> <int>
##  1 2013   192.    92
##  2 2012   199.    95
##  3 2011   197.    66
##  4 2010   152.    48
##  5 2009   182.    84
##  6 2008   136.   147
##  7 2007   121.   123
##  8 2006   127.   126
##  9 2005   113.   162
## 10 2004   116.   192

It’s pretty clear that this is a increasing year-over-year trend, as a linear regression confirms:

lm(seats ~ as.numeric(year), data = planes) %>%
  tidy()

## # A tibble: 2 x 5
##   term             estimate std.error statistic  p.value
##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)          95.7     4.12       23.2 7.24e-98
## 2 as.numeric(year)     10.4     0.752      13.9 1.96e-40

But 2013, 2012, and 2011 look really similar. Let’s see how significant those differences are by doing consecutive comparisons:

lm_seats <- lm(seats ~ year, data = planes)

lm_seats %>%
  emmeans(consec ~ year) %>%
  pluck("contrasts")

##  contrast    estimate    SE   df t.ratio p.value
##  2005 - 2004    -3.48  7.82 1125 -0.445  0.9997 
##  2006 - 2005    14.34  8.70 1125  1.648  0.5509 
##  2007 - 2006    -5.65  9.29 1125 -0.608  0.9972 
##  2008 - 2007    14.29  8.95 1125  1.596  0.5903 
##  2009 - 2008    46.18 10.02 1125  4.609  <.0001 
##  2010 - 2009   -29.39 13.25 1125 -2.217  0.1957 
##  2011 - 2010    44.23 13.90 1125  3.183  0.0129 
##  2012 - 2011     1.98 11.74 1125  0.169  1.0000 
##  2013 - 2012    -6.69 10.72 1125 -0.624  0.9966 
## 
## P value adjustment: mvt method for 9 tests

Consecutive comparisons highlight the years where there were significant peaks in the average number of airplanes seats: 2009 and 2011.

And like we thought, there is little change from 2011 to 2013.

Before and after

The last type of contrast we’ll talk about today is before and after mean comparisons. It’s similar to consecutive comparisons, because it often only makes sense for sequential factors like time.

This one’s also unique, because the unit of comparison is not necessarily an individual member of a category, but rather groups compared to other groups.

Continuing with seats from the previous section, an exampe of a before and after comparison would be comparing the mean number of seats from 2003 - 2008 to the mean number of seats from 2009 - 2013.

This type of analysis is useful for non-linear trends, where you want to identity the change point (hence the name before and after).

Returning to delays, how much variation is there month-to-month?

As always, we’ll run the tests using linear regression:

lm_month <- lm(dep_delay ~ month, data = flights)
tidy(lm_month)

## # A tibble: 12 x 5
##    term        estimate std.error statistic p.value
##    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)    10.5       5.41     1.95  0.0520 
##  2 month2         -2.96      7.90    -0.374 0.708  
##  3 month3          6.32      7.79     0.810 0.418  
##  4 month4         10.3       7.45     1.39  0.166  
##  5 month5          8.39      7.79     1.08  0.282  
##  6 month6          7.51      7.65     0.982 0.327  
##  7 month7         20.3       7.74     2.62  0.00904
##  8 month8         -5.44      7.74    -0.702 0.483  
##  9 month9        -10.9       7.45    -1.46  0.145  
## 10 month10        -3.08      7.79    -0.396 0.692  
## 11 month11        -2.90      7.90    -0.368 0.713  
## 12 month12        -8.35      8.21    -1.02  0.309

The default contrast tells us that month 7, July, has significantly longer delays than January. But how does this average change over the course of the year?

lm_month %>%
  emmeans(mean_chg ~ month)

## $emmeans
##  month emmean   SE  df lower.CL upper.CL
##  1     10.535 5.41 481  -0.0936    21.16
##  2      7.579 5.75 481  -3.7272    18.89
##  3     16.850 5.61 481   5.8301    27.87
##  4     20.875 5.12 481  10.8153    30.93
##  5     18.925 5.61 481   7.9051    29.94
##  6     18.047 5.41 481   7.4180    28.68
##  7     30.829 5.54 481  19.9446    41.71
##  8      5.098 5.54 481  -5.7871    15.98
##  9     -0.333 5.12 481 -10.3931     9.73
##  10     7.450 5.61 481  -3.5699    18.47
##  11     7.632 5.75 481  -3.6746    18.94
##  12     2.182 6.17 481  -9.9507    14.31
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast estimate   SE  df t.ratio p.value
##  1|2         1.750 5.66 481  0.309  1.0000 
##  2|3         3.698 4.32 481  0.856  0.9499 
##  3|4         0.646 3.72 481  0.174  1.0000 
##  4|5        -2.731 3.38 481 -0.808  0.9633 
##  5|6        -4.824 3.25 481 -1.486  0.5733 
##  6|7        -6.659 3.21 481 -2.074  0.2173 
##  7|8       -13.257 3.27 481 -4.053  0.0005 
##  8|9       -11.860 3.44 481 -3.447  0.0051 
##  9|10       -8.513 3.84 481 -2.219  0.1607 
##  10|11      -8.679 4.56 481 -1.903  0.3009 
##  11|12     -10.862 6.39 481 -1.699  0.4237 
## 
## P value adjustment: mvt method for 11 tests

As I mentioned before, this one is a little more complicated to understand. Each row measures the difference using

after_average - before_average

So row 7 | 8, the first significant row, says that the average delay from August to December is 11 minutes lower than the average delay from January to August.

This indicates that August turning point in the year (the before and after moment), when delays start to shrink.

If you are really curious, you might find these posts about contrasts in R useful:
1. http://www.clayford.net/statistics/tag/sum-contrasts/
2. https://www.r-bloggers.com/using-and-interpreting-different-contrasts-in-linear-models-in-r/
3. https://blogs.uoregon.edu/rclub/2015/11/03/anova-contrasts-in-r/
↩︎