Thursday, November 10, 2016

For a Type 1 error rate of 5%, accept H1 if BF10 > 1 (wait, what?)

So I was looking at Daniel Lakens' fascinating recent blog post The Dance of the Bayes Factors, where he shows how Bayes factors are subject to a pretty substantial degree of sampling variability. When used as a basis for a decision about whether a null hypothesis is true or false, they can also (of course) result in errors. For example, Daniel shows that for a two-sample comparison of means, with n = 75 in each group and a true effect size of Cohen's δ = .3, quite high rates of Type 2 errors can result. In particular, Daniel notes that we will:
  • Falsely support the null 25% of the time if we set a cutoff of BF10 < 1/3 for supporting the null
  • Falsely support the null 12.5% of the time if we set a stricter cutoff of BF10 < 1/10 for supporting the null
I wondered: What kind of Type 2 error rate would result if we set our evidence cut-offs so as to ensure a Type 1 error rate of 5%? To figure this out, I simulated some data where the null is true, and where we again have n of 75 in each group, and see what cut-off would preserve this Type 1 error rate:

get.BF = function(cohensd, n){
  x<-rnorm(n = n, mean = 0, sd = 1) #produce N simulated participants
  y<-rnorm(n = n, mean = cohensd, sd = 1) #produce N simulated participants
  z<-t.test(x,y) #perform the t-test
simsH0 = replicate(10000, get.BF(cohensd = 0, n = 75))
quantile(simsH0, probs = 0.95)
##       95% 
## 0.9974437

Wait… what? It turns out that even if we accept any Bayes factor greater than 1 as a basis to support H1, we will only reject H0 falsely 5% of the time! That seems bizarre - a Bayes factor of 1, after all, means that the data is just as consistent with H0 as it is with H1! (Note: The cut-off for a 5% error rate differs according to the H1 prior specified and the sample size - don't take the title of this post too literally!)

How about if we used stricter cut-offs of BF10 > 3 (“positive” evidence according to Kass and Raftery, 1995), or > 20 (“strong” evidence)?
sum(simsH0 > 3)/length(simsH0)
## [1] 0.0114
sum(simsH0 > 20)/length(simsH0)
## [1] 0.0015

It turns out that if we use the BF10 > 3 criterion (which still seems very liberal), our Type 1 error rate is only about 1%. And it we use a rule of BF10 > 20 as a cut-off for supporting H1, our Type 1 error rate becomes absolutely miniscule - about 0.15%.

Weird. But let's follow this exercise to its conclusion… what Type 2 error rate results if we the BF10 > 1 rule to support H1? Well, presuming Daniel's original effect size of δ = 0.3:
simsH1 = replicate(10000, get.BF(cohensd = 0.3, n = 75))
sum(simsH1 > 1)/length(simsH1) #We support H1 46% of the time
## [1] 0.4621
power.t.test (75, 0.3) #Which is virtually identical to the power of a frequentist t-test
##      Two-sample t test power calculation 
##               n = 75
##           delta = 0.3
##              sd = 1
##       sig.level = 0.05
##           power = 0.4463964
##     alternative = two.sided
## NOTE: n is number in *each* group

So, on one level, we have a reassuring result: It is possible to set a Bayes factor cut-off for “decisiveness” that gives us basically the same Type 1 and Type 2 error rates as a conventional significance test. Which makes sense: The Bayes factor is just a monotonic transformation of the p value, after all.

But on the other hand… a lot of this doesn't make a lot of intuitive sense. How can we maintain a low Type 1 error rate with such a liberal cut-off for supporting H1? And why does the same incredibly liberal cut-off not result in particularly good power for supporting H1 when the true effect size is small but definitely not zero?

The answer to this lies in thinking about the models we're actually comparing here. We might think of this test as comparing an hypothesis of an exactly zero effect (H0) to one of a non-zero effect (H1), but that isn't actually what we're doing. The H1 model being tested is that the true effect size δ is randomly drawn from a Cauchy distribution with location = 0 and scale = 0.707.

Now when the true effect size δ is zero, the majority of sample effect sizes we see are indeed near zero. And the null (H0) model tends to fit the resulting sample estimates quite well, as it should (since it's the true model). On the other hand, the H1 model of Cauchy (0, 0.707) is quite spread out: It implies a 50% chance of an effect size |δ| > 0.707. Correspondingly it suggests that sample effect sizes very near zero are quite improbable. So when H0 is true, the H1 model tends not to fit the data well, and we will accidentally support it only very rarely.

But what's the story when the true effect size is δ = 0.3? Why is the Bayes factor test still so shy to support the H1 model, even if the null hypothesis is actually false, and even if we take the slightest sliver of evidence (BF10 > 1) as enough to say H1 is true?

Well, first of all, in this situation neither the H0 model nor the H1 model is true. The H1 model suggests that the true effect size is randomly drawn from a Cauchy distribution, and places zero prior probability on the true effect size taking any specific point value. But here the true effect size is a fixed, small, value. So both models are technically false, but either could sometimes fit the resulting data quite well. And, as it happens, because the H1 model of Cauchy (0, 0.707) suggests that a small true effect size isn't very probable, it tends not to fit very well to data simulated according to a fixed, small, effect size. On the other hand, the null model often fits reasonably well with such data. Hence in this situation we tend to be much more likely to support H0 than H1, even though H0 is false.

In one sense, the result I come up with above is just a statistical curiousity, and one that probably won't be very surprising to people very familiar with Bayes factors. But I think there are some simple philosophy-of-science considerations to keep in mind here that might be of interest to some.

Basically, when we use a Bayes factor test of a null hypothesis test, the H1 model we specify is a falsifiable (but literally false) model that we use as a stand-in for the possibly-true (but unfalsifiable) substantive hypothesis: That the true effect size simply isn't zero. That isn't a fault of the test: No statistical test can make up for the fact that psychological hypotheses are often too vaguely specified to be directly tested without being reformulated. But we need to be wary of over-interpreted a Bayes factor result that supports the null hypothesis: Such a result implies that H0 fits the data better than the specific H1 model we've specified… but it doesn't necessarily mean the null hypothesis is true.


  1. You might be interested in these other posts that followed up Daniel's. Tim did similar simulations, and I showed how you can easily do this stuff analytically by using the t CDF (since t is a sufficient statistic in this special case of the BF. Not always!).

    I would also note that simulating from a single alternative value is not really the right simulation for this model specification. The proper Bayesian simulation would sample from the alternative prior in such a way as to obtain the prior predictive distribution. Such as is done in Jeff's paper here:

    1. Thanks Alex! The posts you link to definitely do a much more solid and comprehensive job at looking at this than I have - my post is really more just pointing out the curious way the error rates pan out in the scenario Daniel discussed. I wasn't aware of those posts so thanks (and I've added your RSS feed now :) )