Monday, June 13, 2016

My talk at the M3 conference on Bayes factor null hypothesis tests

Recently I visited the USA for the first time. In between consuming vast quantities of pizza, Del Taco, and Blue Moon, I presented at the Modern Modeling Methods (M3) conference in Connecticut. (Slides included below).

The M3 conference was jam-packed full of presentations about technical issues relating to quite complex statistical models (sample presentation title: "Asymptotic efficiency of the pseudo-maximum likelihood estimator in multi-group factor models with pooled data"). The questions were serious, the matrix algebra was plentiful, and the word "SPSS" was never uttered.

My presentation was a little different: I wanted to talk about the default statistical methods used by everyday common-or-garden researchers: The ones who like t-tests, do their analyses by point-and-click, and think a posterior is something from a Nicky Minaj song (but may know lots and lots about things other than statistics!). I believe that the practices of everyday researchers are important to look at: If we want to fix the replication crisis, we need most researchers to be doing their research and analysis better - not just the ones who pass their time by discussing floating-point computation on Twitter.

And in terms of problems with the data analyses used by everyday researchers, the big issue that jumps out at me is the (mis)use of null hypothesis significance tests (NHST). Positions obviously vary on whether NHST should be used at all, but I think there's reasonable agreement amongst methodology geeks that the form of Fisher/Neyman-Pearson hybrid NHST that dominates current practice is really problematic. To me, a couple of the biggest problems are that:
  1. Hybrid NHST (and frequentist analysis in general) doesn't directly tell us how certain we can be that a particular hypothesis is true. It tells us P(Data | Hypothesis), but we'd generally rather like to know P(Hypothesis | Data).
  2. Hybrid NHST in specific has a problem with asymmetry of result: p < .05 is interpreted as meaning the null can be rejected, the alternate hypothesis is supported, publication results, and rejoicing is heard throughout the land. But p > .05 is often read as indicating only uncertainty: We can't say the null is true, only that there is insufficient evidence to reject it*. This may be a contributor to publication bias and the replication bias: Part of the preference against publishing null results may be that they are often interpreted as not actually communicating anything other than uncertainty.
But what do we replace hybrid NHST with? Abandoning inferential statistics entirely is obviously foolish. There are several options (Deborah Mayo's error statistics approach, Bayesian estimation, etc.) But an approach that's gaining especial traction in psychology at the moment is that of using Bayes factor tests: Particularly, using Bayes factors to compare the evidence for a point null vs. a vague alternate hypothesis (although obviously this isn't the only way in which Bayes factors can be used).

My talk was a gentle critique of this approach of Bayes Factor null hypothesis testing. And I do mean gentle - as I mention in slide 9 of the talk, I think Bayes factor tests of null hypotheses have some great advantages over conventional NHST. I mention several of these in my slides below, but perhaps the biggest advantage is that, unlike hybrid NHST, they compare two hypotheses in such a way that either hypothesis might end up supported (unlike NHST, where only the alternate can possibly "win"!) So I wouldn't want to stomp on the flower of Bayes factor testing. And certainly my talk critiques only a particular implementation of Bayes factors: To test a point null vs. a non-directional diffuse alternate. Much of the fantastic development of methods and software by guys like Rouder, Morey and Wagenmakers can be applied more broadly than just to Bayes factor tests of point null hypotheses.

But I do think that Bayes factor tests of point null hypotheses do have some problems that mean they may not be a suitable default approach to statistical analysis in psychology. (And currently this does seem to be the default way in which Bayes factors are applied).

To begin with, a Bayes factor is a statement only about the (ratio of) the likelihood of the data under the null and alternate hypotheses. It isn't a statement about posterior odds or posterior probabilities. For a Bayesian analysis, that seems to be a little unsatisfying to me. Researchers presumably want to know how certain they can be that their hypotheses are correct; that is, they want to know about posterior probabilities (even if they don't use those words). In fact, researchers often try to interpret p values in a Bayesian way - as a statement about the probability that the null is true.

And I suspect that a similar thing will happen if Bayes factor null hypothesis tests become commonplace: People will (at least implicitly) interpret them as statements about the posterior odds that the alternate hypothesis is correct. In fact, I think that kind of interpretation is almost supported by the availability of qualitative interpretation guidelines for Bayes factors: The notion that Bayes factors can be directly interpreted themselves - rather than converted first to posterior odds - seems to me to reinforce the idea that they're the endpoint of an analysis: that the Bayes factor directly tells us about how certain we can be that a particular hypothesis is correct. I know that Jeff Rouder has explicitly argued against this interpretation - instead saying that researchers should report Bayes factors and let researchers select and update their own priors (perhaps aided by suggestions from the researchers), and in an ideal world, that's exactly how things would work, but I don't think that this is realistic for everyday readers and researchers with limited statistical expertise.

So everyday researchers will naturally want a statement about the posterior (about how certain they can be that an hypothesis is correct) if doing a Bayes factor analysis. And I think it's likely that they will in fact interpret the Bayes factor as providing this information. But in what circumstance can a Bayes factor be interpreted as the posterior odds that the alternate hypothesis is correct? Well this is fairly obvious: The Bayes factor is the posterior odds that the alternate hypothesis is correct if we placed a prior probability of 0.5 on the null being true, and 0.5 on the alternate being true.

The thing is... that's a really weird prior. It's a prior that takes a "tower and hill" shape (see slides 13 and 14), and suggests that one particular value of the effect size (δ = 0) is vastly infinitely** more likely than any other value. It is absolutely and definitely a very informative prior, and yet also one that seems unlikely to represent our actual state of prior knowledge about any given parameter.

So this is a problematic prior - and when researchers use the Bayes factor as the endpoint of a statistical analysis without explicitly drawing on prior information, I would argue that this prior implicitly underlies the resulting conclusions. For this reason I don't think that a Bayes factor test of a point null hypothesis vs. a non-directional alternate, with the Bayes factor serving as the major statistical endpoint of the analysis, is an ideal default approach in psychology.

Ok... but what would be a better default approach? The relatively slight modification I suggested in the talk was to take into account the fact that most (not all) hypotheses tested in psychology are directional. Therefore, instead of focusing on the "strawman" of a point null hypothesis, we could focus on testing whether a particular effect is positive or negative.

This is most obviously achieved by using MCMC Bayesian estimation of a parameter and then tallying up the proportion of draws from the posterior that are greater than zero (i.e., the posterior probability that the effect is positive). However, it could also be achieved by using Bayes factors, by comparing an hypothesis that the effect is positive to an hypothesis that it is negative (with a half-Cauchy prior for each, say). So the statistical methods and programs (e.g., JASP) developed by supporters of Bayes factor testing could readily be adapted to this slightly altered goal. Either way, this would allow us to dispense with our inordinate focus on the point null, and directly test the hypothesis that's likely to be of most interest: that there is an effect in a specific direction.

However, in performing directional tests we need to take into account the fact that most effects in psychology are small: If we were to use uninformative priors, this would lead to excess confidence in our statements about the directionality of effects. That is, the posterior probability that the true effect is in the same direction as that found in the sample will be closer to 1 with an uninformative prior than it would be if we took into account the fact that most effects are small. I don't think it's feasible to expect everyday researchers to select their own informative priors, but methodologists could suggest default informative priors for psychology: Given that we know only that a particular parameter describes a psychological effect, which values of the parameter are most likely?

To achieve this, I suggest that we could find default prior variances for common analyses in psychology empirically. As a very rough idea of how this could work, I point out that Richard et al.’s (2003) meta-meta-analysis of 25,000 social psychological studies found a mean absolute effect of r = .23, which equates to a Cohen's d value of 0.43. We might use this information to set a default prior for a two-sample means comparison, for example, of Cauchy (0, 0.43), implying a 50% chance that the true standardised effect δ is somewhere between -0.43 and +0.43. (Note: There could be better choices of meta-meta-analysis to use that aren't restricted to social psych, and we would want to correct for publication bias, but this serves as a rough illustration of how we could set a default prior empirically). Use of this prior would allow us to make statements about the posterior probability that an effect is in a particular direction, without the statement being made overconfident due to use of an uninformative prior.

So that was my idea, and here are the slides from my presentation:





But. I'm definitely not perfectly happy with the directional approach I suggested. It deals with the problematic implicit prior seemingly underlying Bayes factor tests of null hypotheses. And unlike either NHST or non-directional Bayes factor testing it also quantifies how certain we can be that the true parameter falls in a particular direction. But three problems remain:
  • The approach doesn't distinguish trivially small effects from substantial ones (one could maybe deal with this by adding a default ROPE?)
  • Like most alternatives to NHST being discussed currently, it relies heavily on the use of standardised effect sizes, which have quite substantial problems.
  • The approach doesn't prevent - and in fact facilitates - the testing of theories that make only vague directional predictions. Good theories would do better than suggesting only the sign of a parameter, and nothing about its magnitude.
To a large extent I think these three problems cannot be resolved by statistics alone: We won't be able to do better than standardised effect sizes as long as psychologists mostly use scales that lack clear units of measurement, and we won't be able to do better than directional analyses while most psychological theories make only directional predictions.

But at the least I think that Bayes factor tests are a better default approach than hybrid NHST. And I hesitantly suggest that the directional approach I outline is in turn a slightly better default approach than using Bayes factors to test point null hypotheses against non-directional alternates.


*This problem doesn't apply to the original form of Neyman-Pearson testing, where p > .05 is read as indicating support for a decision that the null is true.
**Thanks @AlxEtz for the correction.