The pathological science: Psychology, skepticism, and statistics: Separating model from hypothesis in the Bayes factor test

Premise

When using statistical analyses, we will often test a statistical model that has one or more parts that we regard as forming an hypothesis, and other parts that we see as being auxiliary assumptions (or as representing auxiliary information). Consider, for example, the one-sample t test, which tests a model where the data generating mechanism is N(0, σ²). Here we typically treat the part of the model that says that the mean is zero as the (null) hypothesis, whereas we treat the part of the model that says that the observations are normal, independent and identically distributed with variance σ² as just representing auxiliary assumptions - stuff we don't really care about, but assume so that we can run the test. This division is useful, but is not intrinsic to the test: We could just as well treat these apparently auxiliary parts of the model as being substantive parts of the hypothesis we are trying to test. How might this differentiation into hypothesis and auxiliary "stuff" play out in the context of Bayes factors?

Bayes factor tests

A Bayes factor is the ratio of the probability of a set of observed data given one model to the probability of the data given another model. Bayes factors can be used for many purposes, but a popular current application is as a replacement for null hypothesis significance tests.

Currently, researchers using Bayes factors typically don't really differentiate the hypotheses they're testing from the statistical models they've specified. Instead the terms "hypothesis" and "model" are used almost interchangeably.

Treating the models as the hypotheses

Let's take a simple paired mean difference test. Here a default Bayes factor analysis as implemented in JASP would compare two models for the paired differences Y:

M₀: Y ~ N(0, σ²)
M₁: Y ~ N(σδ, σ²); δ ~ Cauchy(0, 0.707).

In short, the null model M₀says the paired differences are all drawn from a population of normally distributed observations that has fixed zero mean and some fixed unknown variance¹. The alternative model M₁ again says that the observations are drawn from a normally distributed population with fixed unknown variance, but here the standardised mean difference from zero is itself drawn from a Cauchy distribution with location parameter of 0 and scale parameter 0.707 (i.e., the "prior" on effect size").

We could interpret this model as simply testing these two models (and perhaps even call them hypotheses). We could then calculate the Bayes factor, use that along with our prior odds to calculate the posterior odds showing the relative credibility of the two models, and thereby determine which model is more likely to be true. Right?

Actually, I think there are some real problems with this lack of differentiation of model and hypothesis. Specifically:

Here, changing the prior on effect size under M₁changes the models being compared. As such, the prior selection under M₁ determines what question the analysis asks.

Relatedly, it would make little sense to test the "robustness" of this analysis to different choices of prior on effect size under M₁; different prior specifications imply completely different models being compared.

The models compared are just two of an infinite number of models we might propose in relation to the paired difference (and particularly the standardised effect size). We have taken no position about the prior probability of other hypotheses: If a theorist comes up and says they have a hypothesis that δ lies specifically in the region [0.8, 0.9], we'd have to be agnostic and simply say that our analysis hasn't considered that hypothesis. Consequently:

While we can multiply the Bayes factor by the prior odds to calculate the posterior odds indicating how many times more probable one model is than the other, we can't calculate a posterior probability that a particular model is true: We have not taken into account the possibility that another model might be much more probable than either of those compared.
We can't really say that the findings "support" the null model (even if the posterior odds are in its favour): The findings might suggest that the data is more plausible under the null than this alternate model, but what about all the other potential alternate models?

The notion of the alternate model being "true" is not particularly meaningful. The alternate model places a continuous prior probability distribution on the standardised mean difference δ, with a zero prior probability of the parameter taking any fixed point effect size. As such, the alternate model is technically false if the true parameter value takes any fixed value (whether zero or non-zero)². The idea of testing a model known to be false seems to incompatible with the intended role of Bayes factor tests in psychology (i.e., as tools for hypothesis testing).

Distinguishing the models from the hypotheses

I believe that some of the above problems can be resolved if we differentiate the statistical models specified in a Bayes factor test from the hypotheses being tested. Specifically, I would suggest that it is possibly to view a non-directional Bayes factor test of a point null hypothesis as comparing two hypotheses:

H₀: δ = 0
H₁: δ =! 0

With the auxiliary assumptions that in each model the error terms are independently, identically and normally distributed with fixed but unknown variance.

Now here we cannot calculate the likelihood of the data under H₁without adding more information, so we add an additional piece of information:

Prior information: If δ =! 0, δ ~ P(δ)

In other words, our null hypothesis is that the effect size is zero, and our alternative hypothesis is simply that the effect size is non-zero (or perhaps that it lies in a particular direction). In addition to this specification of our hypotheses, we add the extra prior information that if the true effect size is non-zero, the credibility of other values is distributed according to the prior probability distribution P(δ). For example, when using the defaults in JASP, P(δ) might be Cauchy(0, 0.707). Note that this prior probability distribution is not part of the alternative hypothesis H₁: It is extra information we use to allow us to test H₁.

In this view:

Changing the prior on effect size under H₁could change the results, but won't change the essential question being asked. Consequently, it's perfectly reasonable to test the robustness of the results to alternative priors on the effect size.
If H₁ is non-directional, the hypotheses compared exhaust all possibilities regarding the standardised effect size. (It must be either zero or not zero).

It is therefore sensible (and probably appropriate) to convert the resulting Bayes factor not only to a posterior odds, but to a posterior probability that H₁ is true. This posterior probability will, of course, be conditional on the prior odds and the prior on effect size.
It is therefore also perfectly appropriate to say that the findings support a particular hypothesis (if the resulting posterior probability is in its favour).

There is no meaningfulness problem: H₁ is true if and only if the true value of the standardised effect size is not zero.

In effect, this alternative view implies taking on a stronger prior commitment: We are specifying a prior in the conventional Bayesian sense, where we specifically assume that some parameter regions are more probable than others. We are not being agnostic about other possibilities: If we know that some theorist has suggested that the true effect size δ is in the region [0.8, 0.9], and we have specified a prior on effect size of Cauchy (0, 0.707), then our prior implies that we think the theorist is full of rubbish.³

Now skeptical readers can always say "Hey - your findings are conditional on your priors, but I don't believe in the same priors!" But that's the case for any Bayesian analysis, so I don't see that as really presenting a problem that is specific to this approach.

TL;DR If we differentiate the hypotheses to be tested from the statistical models compared in a Bayes factor test, the interpretation of the findings becomes a lot more straightforward.

¹ I've ignore the priors on the variance parameters for the sake of brevity since they're common to both models.
² The only scenario in which the alternate model would be "true" is if the underlying true causal effect is literally randomly drawn anew from a Cauchy (0, 0.707) distribution every time the experiment is replicated - a rather implausible scenario.
³Specifically, our Cauchy (0, 0.707) prior says that even if the effect size isn't exactly zero, there is still only a 1.8% chance that their hypothesis that the effect size lies in the region [0.8, 0.9] is true.

2 comments:

Jeff RouderApril 12, 2017 at 10:14 PM
Thanks for the blog post. I am going to take "=!" to be "not equals to."

For me, there are no hypotheses, only models. And models are defined as those things that make predictions about data. By predictions, I mean provide statements like Pr(Y in interval)=x.

You have 3 points for the top mode. I agree with all three. Robustness is ill defined, comparisons are among specified models, support for null is relative to stated alternative, truth is unhelpful. The unhelpfulness of truth, for me, is a universal :)

That said, I dont see how your semantic change really changes anything. It feels semantic to me. Just calling one prior information and the other part of the model doesn't really change the situation. Still against truth.

Best, Jeff

Monday, April 10, 2017

Separating model from hypothesis in the Bayes factor test

Premise

Bayes factor tests

Treating the models as the hypotheses

Distinguishing the models from the hypotheses

2 comments: