library(equivalence)
library(tidyverse)
library(reshape2)

1 Introduction

Many of us are steeped in the tradition of Null Hypothesis Significance Testing (NHST). Under the null hypothesis (H0), there is no difference between conditions. Under NHST, the burden of proof is upon us to reject the null. That’s usually okay, but it’s a big problem when we are really interested in evaluating equivalence. That’s when you need an equivalence test. Some people use Bayes Factors(BF^01) as oblique evidence for equivalence. Others such as past me plead ignorance and report the null as evidence for equivalence. The good news is that even you can test for equivalence.

2 Equivalence Testing

2.1 TOST Test: 2 Means

Here’s a Test of One-Sided Significance (TOST) that evaluates equivalence between two distributions. Let’s first simulate some data that we know to be exactly equal - replicating a single random normal distribution.

set.seed(1234)
a <- rnorm(1000, 10, 2)  #1000 observations with a mean of 10 and an SD=2
b <- a
c <- data.frame(a, b)
head(c)
##           a         b
## 1  7.585869  7.585869
## 2 10.554858 10.554858
## 3 12.168882 12.168882
## 4  5.308605  5.308605
## 5 10.858249 10.858249
## 6 11.012112 11.012112

Variables ‘a’ and ‘b’ are equivalent. We created then so. Now we need a statistic. In standard NHST, the null hypothesis is the no difference hypothesis. If you ‘accept’ the null, are you justified in concluding that the distributions are equivalent? No! Absence of difference is not equivalence. This is an easy trap to fall into – much like our legal system where the assumption is ‘innocent until proven guilty’.

With equivalence testing, however, the null hypothesis is flipped on its head. That is, you start with the assumption of a difference. Then you must overcome the burden of proof to show equivalence (i.e., p<.05 indicates equivalence not difference). As a sanity check, let’s run a TOST test on the two identical distributions we generated above – assuming they are independent.

tost(c$a, c$b, epsilon = 1, paired = F, var.equal = T, conf.level = 0.95)
## 
##  Two Sample TOST
## 
## data:  c$a and c$b
## df = 1998
## sample estimates:
## mean of x mean of y 
##  9.946806  9.946806 
## 
## Epsilon: 1 
## 95 percent two one-sided confidence interval (TOST interval):
##  -0.1467966  0.1467966
## Null hypothesis of statistical difference is: rejected 
## TOST p-value: 1.24855e-28

Yes! That’s one fine kettle of fish. We now have a p-value to support our suspicion of equivalence.

Let’s wiggle the distributions a bit, shift the mean of variable ‘b’. Here’s what the distributions look like when mean a=10, mean b=11 by density plot

jamie.theme <- theme_bw() + theme(axis.line = element_line(colour = "black"), 
    panel.grid.minor = element_blank(), panel.grid.major = element_blank(), 
    panel.border = element_blank(), panel.background = element_blank(), legend.title = element_blank())
set.seed(1234)
a <- rnorm(1000, 10, 2)
z <- rnorm(1000, 11, 2)  #now the z-variable has a mean of 11, the two distributions will start to pull apart.
d <- data.frame(a, z)
d.11 <- d %>% melt(measure.vars = 1:2)

dense11 <- ggplot(d.11, aes(x = value, fill = variable)) + geom_density(alpha = 0.5, 
    colour = "white") + jamie.theme
print(dense11)

Are distributions ‘a’ and ‘z’ equivalent? Let’s check with a TOST test.

tost(d$a, d$z, epsilon = 1, paired = F, var.equal = T, conf.level = 0.95)
## 
##  Two Sample TOST
## 
## data:  d$a and d$z
## df = 1998
## sample estimates:
## mean of x mean of y 
##  9.946806 11.029016 
## 
## Epsilon: 1 
## 95 percent two one-sided confidence interval (TOST interval):
##  -1.2278241 -0.9365976
## Null hypothesis of statistical difference is: not rejected 
## TOST p-value: 0.8235223

No! You fail to reject the null. That means you cannot conclude that ‘a’ and ‘b’ are equivalent. Note the wording here – the paradox is that ‘not equivalent’ does not mean the same thing as different. I know. I know. semantics!

Now just as another sanity check, let’s run a standard t-test on a:z.

t.test(d$a, d$z)
## 
##  Welch Two Sample t-test
## 
## data:  d$a and d$z
## t = -12.23, df = 1997.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.2557445 -0.9086773
## sample estimates:
## mean of x mean of y 
##  9.946806 11.029016


Here are some real data from a study we are working on now where we examined very subtle pupillary responses. It is important to control for blink rates in this type of experiment. During peer review, a referee raised the logical question of whether were more blinks as participants completed the task of listening for a particular word while sitting in a dark room vs. a bright room. We hypothesized that blink rates would be equivalent because the task is equally demanding in both lighting conditions but we need evidence! Enter the TOST test.


Here are some analyses of mean blink rate (number of samples with 0mm values / total number of samples) per participant, per lighting condition.

exp1 <- read.csv("Blinks_Exp1.csv", header = T)
exp2 <- read.csv("Blinks_Exp2.csv", header = T)
head(exp2)
##   X lightCond participant numNAs numObservations percentBlinks
## 1 1    bright           2   3224           21514     14.985591
## 2 2    bright           3   3246           21507     15.092760
## 3 3    bright           4   1501           21453      6.996690
## 4 4    bright           5    755           21561      3.501693
## 5 5    bright           6   1882           15170     12.406065
## 6 6    bright           7   3525           18252     19.312952
e1 <- exp1 %>% filter(Lighting != "Dark") %>% select(2:3, 6)
head(e1)
##   participant Lighting percentBlinks
## 1           2   Bright      0.000000
## 2           2   Middle      2.122477
## 3           3   Bright      0.000000
## 4           3   Middle      0.000000
## 5           4   Bright     12.045214
## 6           4   Middle     11.361162

Let’s look at the distribution of blinks for experiment one (counting pure tones)

newvec <- c("yellow", "black")
ggplot(e1, aes(x = Lighting, percentBlinks, shape = Lighting, fill = Lighting)) + 
    geom_point(shape = 21, color = "black", size = 2.3, alpha = 0.6, position = position_jitter(w = 0.03, 
        h = 0)) + scale_fill_manual(values = newvec) + ylab("Blink Rate") + 
    stat_summary(fun.y = mean, fun.ymin = mean, fun.ymax = mean, geom = "crossbar", 
        color = "black", size = 0.4, width = 0.3) + jamie.theme

The means look pretty close; there’s a fair bit of spread though. Let’s try an equivalence test. This one is paired (within-subject), and we will not assume equal variance. I couldn’t figure out how to run this using a grouping variable, so instead I spread the data into wide form using tidyR (as opposed to dcast)

ecast <- e1 %>% spread(Lighting, percentBlinks)
head(ecast, n = 10)
##    participant    Bright    Middle
## 1            2  0.000000  2.122477
## 2            3  0.000000  0.000000
## 3            4 12.045214 11.361162
## 4            5  7.612096 10.855950
## 5            6  5.493741  9.334986
## 6            8  3.026087  5.638705
## 7            9  5.352113 10.869565
## 8           10  0.000000  0.973913
## 9           11  6.917233  6.066946
## 10          12  4.700557  1.845404
tost(ecast$Bright, ecast$Middle, epsilon = 1, paired = T, var.equal = F, conf.level = 0.95)
## 
##  Paired TOST
## 
## data:  ecast$Bright and ecast$Middle
## df = 23
## sample estimates:
## mean of the differences 
##              -0.4953428 
## 
## Epsilon: 1 
## 95 percent two one-sided confidence interval (TOST interval):
##  -1.2884059  0.2977203
## Null hypothesis of statistical difference is: not rejected 
## TOST p-value: 0.1433708

Here’s a case where we can’t report equivalence - the means are close (within half a percent), but the variance is large.

So, do we blink slightly more in the dark? Let’s run a standard difference test.

t.test(ecast$Bright, ecast$Middle, paired = T, var.equal = F)
## 
##  Paired t-test
## 
## data:  ecast$Bright and ecast$Middle
## t = -1.0705, df = 23, p-value = 0.2955
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4525768  0.4618911
## sample estimates:
## mean of the differences 
##              -0.4953428

Now we are stuck with a real paradox. The TOST test cannot assert equivalence, and yet the standard TTest does not support a difference. This contrast is like that little strip of land when you are leaving one state but haven’t reached the welcome sign for a new state. Where are you?


One option here is to report effect size and leave it at that – or turn to Bayes Factors.