The pathological science: Psychology, skepticism, and statistics

Thursday, October 12, 2017

Our article on Bayes factors (and estimation, and ROPEs)

Last week - while I was down in Wellington destroying my ancient knees at volleyball nationals - an article by Rasmus, Michael and I was published in Research in Human Development. The article is called "Using Bayes Factors to test hypotheses in developmental research" (No, I still don't have much of a talent for snappy article titles). The version of record is here, e-prints here for the few clickers, and postprints here.

This article has a bit of an unusual origin story. Last year, I was at the Modern Modeling Methods conference in Connecticut, and gave a talk discussing Bayes factor tests of null hypotheses, raising the problem of how they implicitly put a huge spike of prior probability on a point null (which, at the time, seemed somewhat silly to me).

Kyle Lang was in the audience for the talk, and mentioned afterwards that he was putting together a special issue of Research in Human Development about methodological issues in developmental research. He had already put together quite a detailed proposal for this special issue, including an abstract for an article on Bayes factors. He suggested that I write this article, along with whichever collaborators seemed suitable. I reached out to Michael (a fantastic colleague of mine at Massey) and Rasmus (whose work I knew through things like his brilliant Bayesian Estimation Supersedes the t-test (BEST) - online app.

So away we went, trying to write a gentle and balanced introduction to Bayes factors targeted at developmental researchers. (We use Wynn's classic study purportedly showing arithmetic abilities in babies as a running example). I would have to say our article is less a brand-new original contribution to knowledge than the type of introduction to a methodological issue one might often find in a textbook. But as an ECA and member of the precariat, I am definitely not in the habit of turning down invites to publish journal articles!

Now the small group of people who follow me on Twitter, hang out on PsychMAD, and are most likely to be reading this post are hardly likely to need a gentle introduction to Bayes factors; most of y'all run Markov chains before breakfast. And there do already exist some really great general introductions to Bayesian estimation and Bayes factors; Rouder et al. (2009), Kruschke and Liddell (2017), and Etz and Vandekerckhove (2017) are some of my favourites. Tailoring an introduction to a particular sub-discipline is obviously just a small contribution to the literature on Bayesian inference.

That said, there are a few things about our introduction to Bayes factors that are a bit unusual for the genre, and that might persuade some of you that this is a useful introduction to share:

1. No pitchforks raised at frequentists

As one must in these situations, we begin by pointing out some problems with null hypothesis significance testing that might prompt someone to want to try a Bayesian approach (e.g., a p value is a statement about uncertainty that is difficult to intuitively understand or communicate to the public; NHST doesn't deal well with optional stopping; there is an asymmetry within NHST wherein only significant results are regarded as informative; etc.). But we aren't rabid Bayesians about this; some of my favourite Twitter-friends are frequentists, after all. We acknowledge that the identified problems with hybrid NHST don't apply to every form of statistical significance testing (c.f. equivalence testing), and that we are just presenting a Bayesian response to some of these problems (the implicit message being that this is only one way of going about an analysis).

2. Making it clear what Bayes factors are good for, and what they aren't.

The role and importance of Bayes factors within the broader Bayesian church can be a little confusing. By one view, the Bayes factor is just the link between the prior odds and the posterior odds for any two models, so almost any Bayesian analysis uses a Bayes factor in some sense. By another view, running a data analysis focusing on a Bayes factor is a quaint activity undertaken only by that funny sort of Bayesian who thinks point 'nil' hypotheses are plausible.

In our article, we draw on Etz and Wagenmakers' excellent historical account of the Bayes factor, mentioning how the factor came about in part as Harold Jeffreys' solution to a particular problem with Bayesian estimation: For some choices of prior, even if a parameter takes some specific hypothesised value (a "general law"), and we collect data and estimate the value of the parameter, we may never be able to collect enough data to produce a posterior probability distribution that suggests that the parameter actually takes this hypothesised value. Jeffreys discussed the problem in relation to non-informative discrete prior distributions, but the problem is especially acute for continuous priors, where the prior probability that the parameter takes any specific point value is zero. In effect, if you want to estimate a parameter where some prior information or theory suggests that the parameter should take some specific point value - such as zero - conventional Bayesian estimation is sort of a pain: You can't just place a continuous prior on the parameter.

The Bayes factor approach solves this problem by allowing us to focus our attention on a comparison of two models: One in which a parameter is forced to take some fixed value (almost always of zero), and one in which it is not (having instead some continuous prior). Comparing a point null hypothesis to a less restricted alternate hypothesis is not the only type of match-up one can use Bayes factors to analyse, but it is the type of match-up that a Bayes factor analysis is particularly useful for examining.

In real research, sometimes this type of comparison will be relevant - i.e., sometimes there will be good reason to treat both a point null hypothesis and a less restrictive alternate as plausible - and sometimes it won't. There do exist experimental manipulations that probably have absolutely no effect on the dependent variable of interest (e.g., in Bem's ESP studies). But there also exist many contexts where exactly-zero relationships are simply implausible. For example, this is true of almost any correlational research in psychology - a correlation between two variables A and B can arise from an effect of A on B, B on A, or via a shared effect of any of a huge array of potential third variables, so it defies belief that any such correlation would be exactly zero in size.

Now obviously, Bayes factor analyses can be used to test hypotheses other than point nulls - e.g., a null interval, or in other words a hypothesis that a parameter is trivially small (rather than exactly zero) in size. But if you're testing a null interval rather than a point null, then it's easy enough to just use Bayesian estimation with a continuous prior for the parameter, and the special utility of an analysis with Bayes factors is obviated. If you aren't actually wanting to test a point null hypothesis, then as Jake Chambers would say:

Image yoinked from Pinterest.

3. Speaking of which, ROPEs...

As well as Bayes factors, we discuss the approach of Bayesian estimation with a defined Region of Practical Equivalence, as advocated by John Kruschke (e.g., here). This is a really useful approach when there is no reason to believe that a parameter is exactly zero in size, but we want to work out if it's reasonably large (or negligibly small).

4. If you do use a Bayes factor, it's a step on the path, not the endpoint.

We suggest that a researcher using a Bayes factor analysis needs to select a prior odds for the compared hypotheses, and report the posterior odds (and ideally posterior probabilities), rather than just treating the Bayes factor as the endpoint of an analysis. A Bayes factor alone is not an adequate summary of uncertainty and the risk of error: It provides neither the known long-term error probability that we might get in a frequentist analysis, nor the posterior probability of ``being wrong'' that we can achieve with a full-blown Bayesian analysis (see this earlier post for more on this). Likewise, we urge against interpreting Bayes factors directly using qualitative cut-offs such as BF above 3 = substantial evidence.

5. You can test point nulls using Bayesian estimation

Although it'll be a point that's somewhat obvious to a lot of you, we note that it's perfectly possibly to test a point hypothesis (and calculate a non-zero posterior probability for that point hypothesis) when using Bayesian estimation. You can do this simply including a `toggling' parameter that determines whether or not the main parameter of interest is fixed at a specific value, with a Bernoulli prior on this toggling parameter. Doing so admittedly requires a little more programming work, but it's hardly impossible.

By emphasising this connection between estimation and Bayes factors, hopefully the rather weird nature of Bayes factors become clear: They're a Bayesian method that implicitly place a large chunk of prior probability on a specific point value, usually of an exactly zero effect. The implicit prior probability distribution for an analysis with a Bayes factor thus looks more like the Dark Tower sitting atop a gentle hill than a non-informative prior distribution. If you don't see a parameter value of exactly zero as being remarkably and especially plausible, maybe a Bayes factor analysis isn't the right choice for you.

6. Got an app with that?

Why yes, we do! Rasmus coded up a nifty web app for us (screenshot below), which allows you to compare two means (as in a t test) while calculating the posterior odds that the mean difference is exactly zero in size (as in a Bayes factor analysis), and also the probability that the difference in means falls within some region of practical equivalence (as in a ROPE analysis). You get to define the prior probability of an exactly zero difference, the scale parameter on the Cauchy prior on effect size under H1, and the width of the ROPE.

The article isn't perfect; one thing that I realised too late in the game to change was that we strike rather too optimistic a tone about how Bayes factors perform with optional stopping. Bayes factors do retain their original interpretation with optional stopping, which jibes up well with our focus on converting Bayes factors to posterior odds and thus interpreting them, rather than using qualitative cut-offs to interpret Bayes factors directly. But when they are used in combination with simple decision rules (e.g., if the BF is greater than 10, I'll support H1), and optional stopping is utilised, inflated error rates can result. We should have included a warning about this. Instead, at this late point, I'll direct you to this blog post by Stephen Martin for more.

I hope that some of you find the article useful, or at least not too egregious an affront to your own proclivities (be they Bayesian, Bayes-factorian, or frequentist). Comments welcome!

Tuesday, June 13, 2017

A really minimal reproducible workflow using point-and-click in SPSS

Shortly after I finished my Master's research project, I decided - for some reason now lost in the sands of time - that I wanted to double-check some results I had produced in my thesis. (My project was a psychometric validation study using confirmatory factor analysis in AMOS*). I found my final data file and my model specification, re-ran the analysis with the same options selected, right down to the random seed for the bootstrapping of results, and.... all the factor loadings were very slightly different to those that I had published.

Shit. Frantic investigations ensued.

...

The reason for the discrepancy in results, it ultimately turned out, was this: The bootstrapping procedure obviously selects a random sample of rows in each bootstrapped sample, and at some point I had run my analyses with my dataset sorted in a different way than I'd found it when trying to reproduce the analyses. Hence very slightly different numbers coming out. A tiny and seemingly inconsequential aspect of my workflow had resulted in my findings becoming unreproducible.

Clearly, there was a flaw in my workflow, and the consequences could have been even worse then: When you rely on point-and-click programs and manual data manipulation in programs like Excel, SPSS and AMOS, it's damn easy to end up with a study folder that's jam-packed with datafiles that look "Data134", "Finaldata34", "FinalfinaldataEM", and "SUPERDATABASE.sav"**, screeds of output files, and no idea whatsoever how you ended up with that table of results on page 34 of your thesis.

Now in an ideal world, we'd all be writing our analyses in R, using Markdown or knitr to produce output for publication, with neatly commented scripts, our original data saved for posterity on osf, and version control of our scripts using git. (Or something along those lines).

The thing is, though, is that there are many people analysing data out there who have minimal programming skills, and I don't think that reproducible workflows should only be the province of people with those skills. For example, I teach a postgraduate multivariate data analysis course at a university where most of my students haven't had any experience with programming, and very little experience with data analysis full stop; I don't think it's reasonable to expect them all to use tools like R and github.

So I'm interested in how we people using SPSS via point-and-click can produce reproducible workflows without having to battle with programming code. The constraint of doing this via point-and-click is obviously going to result in an imperfect solution, but anything is better than a workflow consisting of 20 different versions of your datafile, 75 output files all based on different versions of the data, and published results tables based on some haphazard assortment of the above combined with a few typos. Without further ado, here is a draft workflow:

Set up a folder for your project, and set up a backup solution. This could be as simple as having the project folder within your My Dropbox folder, if you use Dropbox.
If you have to type in raw data into SPSS yourself (e.g., from surveys), do this carefully and then double-check every single data point. Ideally have someone else check all the data points again. Be absolutely sure you're happy with the raw data before doing any data analysis or processing.
Now save a "raw data.sav" copy of your dataset. Do not ever change it again - this is the key departure point for everything else.
From this point forward, do all your data processing and analysis using menu commands, but always click "Paste" instead of "Ok" once you've specified an analysis or command. This will paste the syntax for the analysis or command you've specified into a syntax file. To actually run the analysis or command, select (highlight) it in the syntax file and click the green "run selection" triangle. Crucially, you should NOT:

manually edit data in the data view (don't even sort it within the data view; use Data > Sort cases for that).
manually edit variable properties in the variable view. If you want to change variable properties (label, value labels, measure type, width etc.) use the Data > Define Variable Properties command.

You can now do whatever data processing you need to now using SPSS (e.g., reverse-coding items, imputing missing data, combining items into scale scores using Compute variable, labelling variables and levels, etc). Again, make sure you do all this via menu commands.
Keep your syntax file neat - if you run lots of analyses and click paste every time, you will end up with lots and lots of code chunks in your syntax. Delete the code if you've run analyses that you don't need (though remember not to run lots of analyses looking for "interesting" results unless your project is explicitly exploratory - stick to whatever analyses you set out in your pre-registration!)
Carefully comment your syntax file, starting with a description at the top of the dataset it applies to, the date, who wrote it, and any other crucial contextual information. Begin a comment in SPSS syntax by typing an asterisk, and end the comment by typing a full stop. Include comments explaining what the various analyses and data processing commands achieve.
As you work, you may end up with several versions of your syntax file(s) for your data analysis and processing. This is ok, but include a comment at the beginning of the syntax file explaining what version it is and what the key differences are from previous versions. Make absolutely sure it's clear what dataset is used with the syntax file. Give the files sensible sequential filenames (e.g., "analysis script version 1.sps", "analysis script version 2.sps").
When you get to the point of copying SPSS output into your analysis, add a comment next to the analysis in the syntax file (e.g., "This analysis goes in Table 3 in manuscript".) Avoid manually typing numbers into your manuscript whenever possible; copy and paste instead (if necessary into paste Excel first for rounding).
Before submitting a paper or thesis, make sure that you can take your original raw data and a syntax file (or more than one syntax file) and easily reproduce every single figure reported in the paper. Ensure you keep and back up copies of the files necessary to do this, and if at all possible post copies somewhere where others can access them (e.g., osf.io).

So that's my basic idea for an unintimidating reproducible workflow for beginner data analysis. It's imperfect but at least results in a syntax file and raw data file that should be sufficient to reproduce any published results.

Do any readers of my blog have suggestions for how this can be improved? I know my own analysis practices aren't as neat and organised as they could be, so I'm open to feedback. Ideally this could eventually be something I can pass on to my students for their own data analyses.

*I know: Ew.
**Actual example from one of my old projects.

Wednesday, May 10, 2017

On the idea that covariate imbalance lead to bias, even with randomisation

I recently came across an in-press article by Nguyen et al. (2017) with the title "Simple randomization did not protect against bias in smaller trials". The article suggests that with finite samples, covariate imbalance between groups can result in "accidental bias" (even with randomisation). The essential idea is that if we have a sample of people who differ on a number of variables that affect the dependent variables (covariates), randomising them to groups will result in the groups having similar mean values on each covariate - but not exactly the same mean values. These differences in mean covariate values will be larger for smaller samples. As such, with finite sample sizes, a difference we observe between the groups on the dependent variable might not actually be due to the manipulated variable, but actually due to pre-existing differences on the covariates.

Based on a set of simulations, Nguyen et al. conclude that N > 1000 is needed to protect against bias occurring due to covariate imbalance. Subsequently, a preprint by Gjalt-Jorn Peters used simulation studies to estimate the probability that randomisation will produce groups that are somewhat imbalanced on at least one of a set of covariates, for a given sample size; his findings suggested that this probability is high (~45%) for the studies contained in the Reproducability Project: Psychology.

Polite debate ensued on social media (e.g., here). This post is an attempt to explain:

Why some of us felt that it was invalid to say that small samples result in "bias" in this scenario
But why the claim of "bias" isn't wrong per se (it just hinges on a different definition of bias) - following a brief email conversation I had with Long Nguyen
Why I don't think that we need samples with N > 1000 - i.e., why I think the problem raised in these studies is actually dealt with by the inferential statistics we use already, and thus isn't something we really need to worry about.

Firstly, let's consider the idea of bias. To a statistician, a statistic is a biased estimator of a parameter if the expected value of the statistic is different than the true value of the parameter. Because the idea of bias refers to an expected value, it's explicitly a long-run concept. For example, imagine I am interested in estimating the average height of adult New Zealanders. If I randomly draw a sample of 30 individuals from this population and calculate their average height, I will get an estimate that's different from the true average height of the population. But if I did this again and again, and calculated the average estimate across the repeated samplings, this the average estimate would be close to the true mean... and with an infinite number of repeated samplings, it would be exactly the same as the true population mean.

Bias is different from random error: Any given estimate of a parameter will contain some error, but bias refers to a systematic tendency to over- or under-estimate the parameter.

When we randomly assign participants to conditions, the random assignment process means that there is no systematic tendency to assign participants with a higher level of some covariate to one condition rather than the other. Consequently, the expected value of each covariate is equal across conditions, and the estimated effect of the treatment is an unbiased estimator of the true treatment effect. I.e., on average across replications, the estimate of the treatment effect will not tend to systematically under- or over-estimate the true treatment effect. This is why some of us were surprised to see Nguyen et al. and Peters claim that estimates of the treatment effect are "biased" with small samples even if randomisation is used: By a conventional definition of bias, this would not be true. Estimates of the treatment effect will contain some error, and randomly occurring covariate imbalance is a cause of such error, but this isn't really bias - or not in the sense that we usually think of bias.

But this is where the definitional issue comes in: Nguyen et al. use a term called "accidental bias", which I believe is due to Efron (1971), with a more accessible discussion in Lachin (1988). Efron and Lachin's articles - though they go over my head in a lot of ways - seem to be largely concerned with assignment methods other than simple/complete randomisation - e.g., assignment methods such as biased coin designs, used to ensure equal sample size across experimental cells. Efron and Lachin both use the term accidental bias to refer to a difference between the true estimated treatment effect that occurs to imbalance on omitted covariates. Importantly, this bias has expectation of zero for simple randomisation as well as for related assignment methods (permuted block, biased coin, etc.); in other words, it isn't bias at all in the conventional sense. However, the variance of the accidental bias does differ for these different random-like assignment methods: In other words, some of the assignment methods they considered tend to be more susceptible to (random) inaccuracy in their estimates due to covariate imbalance than others.

Crucially, "accidental bias" is a very different form of bias: It is not a subtype of "bias" in the sense used above, but something else entirely (which we would often consider simply as a source of random error). Distinguishing error in estimation arising from covariate imbalance from other sources of error in estimation was something that was relevant for Efron and Lachin when considering the properties of different random-like assignment methods, but perhaps not so relevant outside of this specific methodological focus. So Nguyen et al. and Peters aren't really wrong to say that estimates of treatment effects can be biased with small samples with randomisation; it's just that this claim hinges on a somewhat unusual definition of bias. This is something both papers could make clearer, I think.

But what if we set the semantic issues aside? Is it not the case that random assignment can sometimes leave us with sub-samples that are drastically imbalanced on some covariate, and that this might result in a crappy inference about the treatment effect (regardless of whether we think of the cause as "bias" or "error")? Well, yes. And if we concluded that a treatment definitely has some effect size simply because that was the difference in our sample, this would be a major problem.

The thing is, though, is that isn't how causal inferences are actually done: In reality, we will always use some kind of statistical inference (whether via p values, confidence intervals, Bayes factors, Bayesian estimation, whatever). It's important to recognise here that inferential statistics have a crucial role in causal inference, even if we're only making an inference about the effect that a manipulation had within our sample itself: It is not the case that the estimated effect in the sample perfectly captures the true effect in that sample, and that we then use inferential statistics to generalise beyond the sample. Rather, we need inferential statistics even to make inferences about effects within the sample.

As it happens, virtually any inferential statistic we are likely to use to estimate a treatment effect will include a random error term - i.e., the model will take into account variability in the dependent variable that is not attributable to the independent variable. And randomly-occurring imbalance on covariates is just another source of random error. As such, the inferential statistics we use to estimate treatment effects don't assume a complete lack of covariate imbalance: Rather, they assume only that the expected value of the random error term is zero for each observational unit. This assumption would be breached if the assignment process was such that it systematically tended to produce different covariate levels in different groups, but that isn't the case with randomisation.

Let's look at how this play out in practice, in a case where there is a highly relevant omitted covariate (gender in a study of the effect of a diet), randomisation is used, sample size is small, and the true effect size is zero: Does "accidental bias" increase our error rate?

In other words, in this scenario of a truly zero effect size, where there is an omitted and highly relevant covariate, and a tiny sample size, our Type 1 error rate remains at 5%. If we conclude that there is a true causal effect when p < 0.05, we will be in error sometimes - and covariate imbalance will be to blame for some of those errors - but the rate of such errors is exactly as expected given the alpha level.

How about power? Let's imagine a scenario where we have a bigger sample, and the treatment does work, so we expect good power:

Yes - power remains as advertised despite the omitted covariate, so our error rate isn't harmed. Of course, if there was greater variability in post-test scores than we specified in our power calculation - perhaps as a result of an omitted covariate - then our estimate of power for a given unstandardised effect size would be incorrect. But in practice we typically scale our hypothesised effect size relative to the post-test variability anyway, so this isn't really a worry.

The upshot of all this is that, sometimes by sheer bad luck, we will end up with groups that are quite imbalanced with respect to some important covariate - even if we use random assignment. And this may, occasionally, cause us to make an inferential error (i.e., an error in our inference about whether a causal effect exists). However, given that we use inferential statistics to make such causal inferences, we can control the probability of making an error. Our existing inferential tools already allow us to do this: They implicitly take into account the possibility of covariate imbalance.

Do we need N of 1000 or more for a clinical trial, as Nguyen suggests? No. The N > 1000 rule isn't based on a concern of direct relevance to inference (e.g., to achieve some specific level of power, or to achieve some specific level of precision of confidence intervals, or to ensure some maximum discrepancy between the true and estimated effect sizes). Rather, the rule is intended to minimise to a very tiny magnitude one source of variability in the estimated treatment effect, while ignoring other sources of variability. This doesn't seem like a good basis for sample size determination.

If we want to keep our Type 2 error rate to some very small rate, or ensure our confidence intervals reach some specific level of precision, we already have the inferential tools necessary to work out what sample size we require. Sometimes N > 1000 may be necessary to achieve some specific power or precision goal, and sometimes a much smaller sample size will do.

Monday, April 10, 2017

Separating model from hypothesis in the Bayes factor test

Premise

When using statistical analyses, we will often test a statistical model that has one or more parts that we regard as forming an hypothesis, and other parts that we see as being auxiliary assumptions (or as representing auxiliary information). Consider, for example, the one-sample t test, which tests a model where the data generating mechanism is N(0, σ²). Here we typically treat the part of the model that says that the mean is zero as the (null) hypothesis, whereas we treat the part of the model that says that the observations are normal, independent and identically distributed with variance σ² as just representing auxiliary assumptions - stuff we don't really care about, but assume so that we can run the test. This division is useful, but is not intrinsic to the test: We could just as well treat these apparently auxiliary parts of the model as being substantive parts of the hypothesis we are trying to test. How might this differentiation into hypothesis and auxiliary "stuff" play out in the context of Bayes factors?

Bayes factor tests

A Bayes factor is the ratio of the probability of a set of observed data given one model to the probability of the data given another model. Bayes factors can be used for many purposes, but a popular current application is as a replacement for null hypothesis significance tests.

Currently, researchers using Bayes factors typically don't really differentiate the hypotheses they're testing from the statistical models they've specified. Instead the terms "hypothesis" and "model" are used almost interchangeably.

Treating the models as the hypotheses

Let's take a simple paired mean difference test. Here a default Bayes factor analysis as implemented in JASP would compare two models for the paired differences Y:

M₀: Y ~ N(0, σ²)
M₁: Y ~ N(σδ, σ²); δ ~ Cauchy(0, 0.707).

In short, the null model M₀says the paired differences are all drawn from a population of normally distributed observations that has fixed zero mean and some fixed unknown variance¹. The alternative model M₁ again says that the observations are drawn from a normally distributed population with fixed unknown variance, but here the standardised mean difference from zero is itself drawn from a Cauchy distribution with location parameter of 0 and scale parameter 0.707 (i.e., the "prior" on effect size").

We could interpret this model as simply testing these two models (and perhaps even call them hypotheses). We could then calculate the Bayes factor, use that along with our prior odds to calculate the posterior odds showing the relative credibility of the two models, and thereby determine which model is more likely to be true. Right?

Actually, I think there are some real problems with this lack of differentiation of model and hypothesis. Specifically:

Here, changing the prior on effect size under M₁changes the models being compared. As such, the prior selection under M₁ determines what question the analysis asks.

Relatedly, it would make little sense to test the "robustness" of this analysis to different choices of prior on effect size under M₁; different prior specifications imply completely different models being compared.

The models compared are just two of an infinite number of models we might propose in relation to the paired difference (and particularly the standardised effect size). We have taken no position about the prior probability of other hypotheses: If a theorist comes up and says they have a hypothesis that δ lies specifically in the region [0.8, 0.9], we'd have to be agnostic and simply say that our analysis hasn't considered that hypothesis. Consequently:

While we can multiply the Bayes factor by the prior odds to calculate the posterior odds indicating how many times more probable one model is than the other, we can't calculate a posterior probability that a particular model is true: We have not taken into account the possibility that another model might be much more probable than either of those compared.
We can't really say that the findings "support" the null model (even if the posterior odds are in its favour): The findings might suggest that the data is more plausible under the null than this alternate model, but what about all the other potential alternate models?

The notion of the alternate model being "true" is not particularly meaningful. The alternate model places a continuous prior probability distribution on the standardised mean difference δ, with a zero prior probability of the parameter taking any fixed point effect size. As such, the alternate model is technically false if the true parameter value takes any fixed value (whether zero or non-zero)². The idea of testing a model known to be false seems to incompatible with the intended role of Bayes factor tests in psychology (i.e., as tools for hypothesis testing).

Distinguishing the models from the hypotheses

I believe that some of the above problems can be resolved if we differentiate the statistical models specified in a Bayes factor test from the hypotheses being tested. Specifically, I would suggest that it is possibly to view a non-directional Bayes factor test of a point null hypothesis as comparing two hypotheses:

H₀: δ = 0
H₁: δ =! 0

With the auxiliary assumptions that in each model the error terms are independently, identically and normally distributed with fixed but unknown variance.

Now here we cannot calculate the likelihood of the data under H₁without adding more information, so we add an additional piece of information:

Prior information: If δ =! 0, δ ~ P(δ)

In other words, our null hypothesis is that the effect size is zero, and our alternative hypothesis is simply that the effect size is non-zero (or perhaps that it lies in a particular direction). In addition to this specification of our hypotheses, we add the extra prior information that if the true effect size is non-zero, the credibility of other values is distributed according to the prior probability distribution P(δ). For example, when using the defaults in JASP, P(δ) might be Cauchy(0, 0.707). Note that this prior probability distribution is not part of the alternative hypothesis H₁: It is extra information we use to allow us to test H₁.

In this view:

Changing the prior on effect size under H₁could change the results, but won't change the essential question being asked. Consequently, it's perfectly reasonable to test the robustness of the results to alternative priors on the effect size.
If H₁ is non-directional, the hypotheses compared exhaust all possibilities regarding the standardised effect size. (It must be either zero or not zero).

It is therefore sensible (and probably appropriate) to convert the resulting Bayes factor not only to a posterior odds, but to a posterior probability that H₁ is true. This posterior probability will, of course, be conditional on the prior odds and the prior on effect size.
It is therefore also perfectly appropriate to say that the findings support a particular hypothesis (if the resulting posterior probability is in its favour).

There is no meaningfulness problem: H₁ is true if and only if the true value of the standardised effect size is not zero.

In effect, this alternative view implies taking on a stronger prior commitment: We are specifying a prior in the conventional Bayesian sense, where we specifically assume that some parameter regions are more probable than others. We are not being agnostic about other possibilities: If we know that some theorist has suggested that the true effect size δ is in the region [0.8, 0.9], and we have specified a prior on effect size of Cauchy (0, 0.707), then our prior implies that we think the theorist is full of rubbish.³

Now skeptical readers can always say "Hey - your findings are conditional on your priors, but I don't believe in the same priors!" But that's the case for any Bayesian analysis, so I don't see that as really presenting a problem that is specific to this approach.

TL;DR If we differentiate the hypotheses to be tested from the statistical models compared in a Bayes factor test, the interpretation of the findings becomes a lot more straightforward.

¹ I've ignore the priors on the variance parameters for the sake of brevity since they're common to both models.
² The only scenario in which the alternate model would be "true" is if the underlying true causal effect is literally randomly drawn anew from a Cauchy (0, 0.707) distribution every time the experiment is replicated - a rather implausible scenario.
³Specifically, our Cauchy (0, 0.707) prior says that even if the effect size isn't exactly zero, there is still only a 1.8% chance that their hypothesis that the effect size lies in the region [0.8, 0.9] is true.

Tuesday, February 14, 2017

Bayesian analysis (but not pure Bayes factor analysis) allows clear reporting of error probabilities

I noticed this exchange on Twitter last week. Daniel Lakens posted a screenshot from the reviewer guidelines of Comprehensive Results in Social Psychology. The guidelines promote Bayes factor analysis, and in particular accepting a hypothesis when the Bayes factor in favour of that hypothesis is greater than 3.

Deborah Mayo responded criticising this approach for lacking error control:

I’ve noticed variations on this argument – that Bayes factor analyses lack error control - floating around a fair bit over the last while. Being a Bayesian myself (mostly), and being concerned about making errors in inference, I figured it might be useful to sort through my thoughts on this issue and post some relevant points.

Point 1: Frequentist significance tests involve decision rules that control error rates; Bayesians report the probability of error
Significance tests - or at least most types of them - involve a decision rule set up in such a way that it will control the rate of some type of error. Almost always, the decision rule will be set up to keep the probability of a Type 1 error (a false rejection of the null hypothesis) to 5%. The decision rule is typically intended to be a rigid and intrinsic part of the analysis: We reject the null hypothesis if p < .05, and do not reject it if not.

On the other hand, in a Bayesian analysis, we don't necessarily have to apply some firm decision rule. A Bayesian analysis is capable of providing a statement like "Given the priors specified and data observed, there is an 85% probability that this hypothesis is correct". (A statement which implies, in turn, a 15% chance of being in error if one was to decide in favour of the hypothesis). A researcher reporting a Bayesian conclusion like the one above could leave simply it at that; they could apply a decision rule (e.g., posterior probability > 90% is required to support a hypothesis), but they don't have to. Therefore, because a significance test is essentially a decision rule while Bayes theorem isn't, it is probably best to say that a frequentist analysis seeks to control error rates, while a Bayesian analysis will communicate the error probability (i.e., the probability that a particular conclusion would be incorrect). Both of these approaches seem to be reasonable alternatives to me. But what are we assuming when we report error probabilities in this Bayesian way?

Point 2: Frequentist and Bayesian approaches are both concerned with the probability of error, but they condition on different things

Frequentist hypothesis tests are typically aimed at controlling the rate of error, conditional on a particular hypothesis being true. As mentioned above, a statistical significance test for a parameter is almost always set up to such that conditional on the null hypothesis being true, we will erroneously reject it only 5% of the tim.

The situation in a Bayesian analysis is a bit different. A Bayesian test of a hypothesis produces a posterior probability for that hypothesis. We can subtract this posterior probability from 1 to get an error probability. But this error probability isn't the probability of error, conditional on a particular hypothesis being true; it's the probability that a particular conclusion is incorrect, conditional on the priors specified and data observed.

So we have two types of error probabilities: The probability of error conditional on a hypothesis, and the probability of error conditional on the prior and observed data. Both of these are perfectly reasonable things to care about. I don’t think it’s reasonable for anyone to dictate that researchers should care about only one of these.

For me personally, once I’ve finished collecting data, run an analysis, and reached a conclusion, I’m most interested in the probability that my conclusion is wrong (which is part of the reason why I’m keen on Bayesian analysis). But this doesn’t mean I’m completely disinterested in error rates conditional on particular hypotheses being true: This is something I might take into account when planning a study (e.g., in terms of sample size planning). I’m also broadly interested in how different types of Bayesian analyses perform in terms of error rates conditional on hypotheses. Which leads me to the next point...

Point 3: An analysis may focus on one type of error probability, but we can still study how it performs in terms the other.

A frequentist significance test will typically be set up specifically to control the rate of falsely rejecting the null, conditional on the null hypothesis being true, to 5% (or some other alpha level). But this doesn’t mean we can’t work out the probability that a specific conclusion is incorrect, conditional on the data observed (i.e., the Bayesian type of error probability). Ioannidis’ paper claiming that most published research findings are false deals with exactly this problem (albeit based on some shaky assumptions): He calculated the probability that an effect actually exists, conditional on a significant test statistic, for a range of priors and power scenarios. There is admittedly a bit of a conceptual problem here with trying to draw Bayesian conclusions from a frequentist test, but the mathematics are fairly straightforward.

Similarly, while a Bayesian analysis is set up to tell us the probability that a particular conclusion is in error, given the priors set and data observed, there’s no reason why we can’t work out the rate at which a Bayesian analysis with a particular decision rule will result in errors, conditional on a particular hypothesis being true. This previous blog of mine is concerned with this kind of problem, as are these better ones by Daniel Lakens, Alex Etz and Tim van der Zee.

The upshot is that while a particular type of analysis (frequentist or Bayesian) may be tailored toward focusing on a particular type of error probability, we can still use a bit of mathematics or simulation to work out how it performs in terms of the other type. So choosing to focus on controlling one type of error probability hardly implies that a researcher is disregarding the other type of error probability entirely.

Point 4: The Bayesian error probability depends on the prior

A word of caution about Bayesian error probabilities: Given its dependence on a prior, the type of inference and error rate I get with a Bayesian analysis is useful primarily provided I can set a prior that’s relevant to my readership. If I set a prior where most of the readers of my study can say “yes, that seems a reasonable description of what we knew in advance”, and I draw a conclusion, then the posterior probability distribution tells them something useful about the probability of my conclusion being in error. But if they think my prior is utterly unreasonable, then I haven’t given the reader direct and useful information about the chance that my conclusion is wrong.

Point 5: Back to that editorial recommendation of support H₁ if BF₁₀>3

Finally I want to reflect on how error probabilities are dealt with in a very specific type of analysis, which is a Bayes factor analysis using a default prior for the H1 model, and a decision rule based solely on the Bayes factor rather than a posterior probability distribution. This is more or less the type of analysis recommended by the “Comprehensive Results in Social Psychology” editorial policy that started off this whole discussion.

Now in this case, the analysis is not set up to control the rate of error, conditional on a particular hypothesis being true, to a specific value (as in a frequentist analysis). However, because it doesn’t involve a commitment to a particular prior probability that the null hypothesis is correct, it also doesn’t directly tell us the probability that a particular conclusion is false, conditional on the data (the Bayesian type of error probability).

This latter problem is reasonably easily remedied: The author can specify prior probabilities for the null and alternate, and easily calculate the posterior probability that each hypothesis is correct (provided these prior probabilities sum to 1):

Posterior odds = P(H1)/P(H0) * BF_H1

P(H1|Data) = Posterior odds/(1+posterior odds)

For example, say we're doing a t-test and take an uber-default assumption of P(H0) = P(H1) = 0.5, with a Cauchy (0, 0.707) on standardised effect size under H1 (the JASP default, last time I checked). We then observe a BF₁₀ of 3. This implies a posterior probability of 3/4 = 75% in favour of H1. In other words, if you observed such a Bayes factor and concluded in favour of the alternate, there'd be a 25% chance you'd be in error (conditional on the data and prior specification).

That said, this error probability is only informative to the extent that the priors are reasonable. The prior used here takes a spike-and-slab prior form which implies that there's a 50% chance the effect size is exactly zero, yet a substantial chance that it's extremely large in size (e.g., a 15% prior probability that the absolute true effect size is greater than 3). A reader may well see this prior probability distribution as an unreasonable description of what is known about a given parameter in a given research setting.

So I think it is crucial that authors using Bayes factor analyses actually report posterior probabilities for some reasonable and carefully selected priors on H1 and H0 (or even a range of priors). If this isn’t done, the reader isn’t provided with direct information about either type of error probability. Of course, a savvy reader could assume some priors and work both types of error probability out. But I think we have an obligation to not leave that entire process to the reader. We should at least give clear information about one of the types of error rates applicable to our study, if not both. So I disagree with the reviewer guidelines that provoked this post: We shouldn't make statistical inferences from Bayes factors alone. (If you use Bayes factors, make the inference based on the posterior probability you calculate using the Bayes factor).

Secondly, the priors we set on H0 and H1 (and the prior on effect size under H1) in a Bayes factor analysis should be selected carefully such that they're a plausible description of what was known ahead of the study. Otherwise, we'll get a posterior probability distribution and some information about the probability of error, but this probability of error will be conditional on assumptions that aren't relevant to the research question nor to the reader.

Monday, January 23, 2017

Lots of problems with the Reliable Change Index

In a previous post, I talked about the Reliable Change Index (RCI: Jacobson & Truax, 1991), and a rather unusual misconception about its implications.

Today, as I promised in that earlier post, I want to talk about the RCI a bit more broadly, and discuss some problems with its use. (TL;DR: The RCI doesn't test the hypothesis we're interested in, and has extremely poor statistical power for testing the hypothesis that it does).

Introduction to the RCI

For those who haven't heard of it, the RCI is something that's popular amongst clinical psychologists conducting single case experimental designs. Basically the idea is:

We obtain a pre score on the measure of interest for our lone participant, administer an intervention, then obtain a post score, and calculate the observed difference between the two. (I.e., how much did the participant improve?)
Using reliability and variability estimates from a psychometric validation study of the measure (or whatever is available), we calculate the standard error of measurement of a set of scores: SEM = SD*sqrt(1 - reliability)
Using the SEM, we calculate the standard error of measurement of the difference between pre and post scores: Sdiff = sqrt(2*SEM^2)
We compare the observed improvement in the participant's score to the critical ratio 1.96*Sdiff; if it is greater than this critical ratio, the improvement is "reliable"/statistically significant.

Now collecting large samples of data for trials of psychotherapy interventions is a pretty expensive and painful process, so single case experimental designs are still reasonably popular in clinical psych (well, at least here in New Zealand they are). And to be able to do a single case experimental design and yet still run a proper objective inferential statistic has its appeal - it feels much more objective than all that business of just qualitatively interpreting plots of single case data, doesn't it?

What the RCI does and doesn't test

However, while the RCI is indeed an inferential statistic, I'm not sure that it allows you to make an inference about anything you really care about.

Consider, first of all, the null hypothesis that the RCI is testing. The null hypothesis is that the participant's true score at time 1 was the same as her true score at time 2. What is a true score on a measure? Well, a true score is a concept from classical test theory. Specifically, a true score is the average score the participant would get if we could administer the measure to the participant an infinite number of times, with each administration being independent of the others (i.e., we'd have to wipe their memories after each administration), and without their actual level of the measured attribute changing (so imagine a counterfactual world where we could keep administering the measure repeatedly at exactly the same point in time). Importantly, the true score is not the participant's actual level of the attribute we're trying to measure.

So the true score is an abstract and rather strange concept; I'm not sure we really care about whether a participant's true score has changed after the intervention. Rather, the key question is: Did the intervention cause a change in the person's actual level of the attribute of interest?

Now the RCI does actually relate to that causal question, in the sense that it helps to rule out one specific alternative explanation for an observed difference in scores: It helps us rule out the possibility that a difference we've seen is purely due to random measurement error.

But the problem is that it doesn't help us rule out any much more pressing alternative explanations (other than an actual causal effect of the intervention). These alternative explanations include:

Maturation: The change in scores could be due to some natural time-related process (e.g., aging)
History: The change in scores could be due to some external event not caused by the intervention
Regression to the mean: The change in scores could be due to the selection of a participant with an extreme observed level of the variable of interest, who in probability would receive a less extreme score on re-testing (regardless of intervention)
Testing: The sheer act of taking the pre-test measurement could have the participant
Instrumentation: There could have been a change in the measurement method in the intervention period (related to the idea of systematic measurement error, which the RCI does not address)

All of these threats to internal validity are just as pressing as the threat of random measurement error, but the RCI doesn't address any of them. In fact, the RCI does not even address the issue of whether or not there is some kind of enduring change to explain at all. If the actual attribute of interest (e.g., depression, anxiety) fluctuates from day to day (as distinct from fluctuation due to measurement error), there might be a difference in true scores between two measurement occasions, but without any enduring change whatsoever in the person's typical level of the attribute of interest.

Power problems

Unfortunately, even if we accept that the RCI doesn't really tell us what we are most interested in knowing, and we're willing to accept making inferences only about differences in true scores rather than causal effects, there is still a major lurking problem. That problem is that in realistic conditions, the RCI has absolutely abysmal statistical power. A plot showing statistical power relative to effect size for measures of various levels of reliability is shown in the figure below.

As you can see, the picture is very worrying: You can achieve 80% power only if the effect size is large and the reliability of your measure stupendous (e.g., effect size d of around 0.8 and reliability of 0.95). In more realistic circumstances - say a medium effect size of d = 0.5 and reliability of 0.8 - your power is well under 30%. And with measures of lower reliability and smaller effects, the statistical power of the test is barely higher than the Type 1 error rate. In these circumstances, even if by some magical coincidence we do come across a significant result, we shouldn't place any trust in it: the lower the power, the lower the probability that a significant finding reflects a true effect (see Ioannidis, 2005).

Why is the power of the RCI so low? Well there's no getting away from the fact that in a single case study, N = 1. By deliberately focusing only on ruling out measurement error variation (rather than all extraneous variation) we increase the power of the test a little, but the reality is that two measurements from a single person simply can't allow us to make precise inferences.

Assumptions

And we can go further down the rabbit hole still: There are three more specific distributional assumptions that are problematic in regard to the RCI:

The RCI may be biased if it is based on an estimate of reliability whose distributional assumptions were not met (which is likely to be the case if it is based on a Cronbach's alpha estimate)
The RCI is based on the assumption that the measurement error variance is constant across people, and specifically that the quantity of measurement error is the same for your participant as it is in the sample you got your reliability estimate from. If, as could very well be the case, measurement error is higher or lower for your participant than that estimate, then the RCI will give biased inferences.
The RCI is based on the assumption that the measurement error for your participant is normally distributed over testing occasions. If this (untestable) assumption does not hold, the fact that N = 1 means that there is no protection to be gained from the central limit theorem. Consequently small departures from normality could distort Type 1 error rates.

So what to do?

While I totally respect the intentions of those trying to use the RCI to add rigour to single case experimental designs, I'm not sure that the RCI is worth reporting. It doesn't really address the real question of interest in a single case experimental design, and its poor power means that it's likely to produce untrustworthy inferences.

But what to report instead? If you're wedded to a single case design, I think a couple of strategies that might help are:

Collect lots of data points. If we can see how variable an individual participant's scores are within the baseline and intervention periods, this gives a much more direct (albeit subjective) impression of whether random fluctuation in scores could account for an observed difference between baseline and treatment. And with a very large number of data points collected at regular intervals you may even be able to use methods designed for time series data (e.g., change point regression) to detect a change in behaviour at the time of the intervention.
If you can, use a withdrawal design (e.g., ABAB). Hell, if you can go back and forth between treatment and baseline a bunch of times at random, you could even run a perfectly conventional significance test to compare the two conditions. Unfortunately I don't think this is a feasible option for the types of intervention most psychologists are interested in nowadays though (you can't "withdraw" teaching the participant how to practice mindfulness, say).

But at the end of the day, I think there's no getting around the fact that making quantitative claims on the basis of data from a single participant is a pretty risky process. If you're trialling an intervention with one of the H.M.s of the world - someone whose condition is so rare you just can't possibly get a decently sized sample even in principle - then stick with a single case design and make do. But if the reality is that you just want to trial a psychotherapy with people who have a reasonably common condition - and you're using a single case design only because running a large sample RCT is beyond your current means - then I think it's worth re-thinking things more completely and considering whether another research topic might be more deserving of your attention.

Appendix: R code for plot

Thursday, November 10, 2016

For a Type 1 error rate of 5%, accept H1 if BF10 > 1 (wait, what?)

So I was looking at Daniel Lakens' fascinating recent blog post The Dance of the Bayes Factors, where he shows how Bayes factors are subject to a pretty substantial degree of sampling variability. When used as a basis for a decision about whether a null hypothesis is true or false, they can also (of course) result in errors. For example, Daniel shows that for a two-sample comparison of means, with n = 75 in each group and a true effect size of Cohen's δ = .3, quite high rates of Type 2 errors can result. In particular, Daniel notes that we will:

Falsely support the null 25% of the time if we set a cutoff of BF10 < 1/3 for supporting the null
Falsely support the null 12.5% of the time if we set a stricter cutoff of BF10 < 1/10 for supporting the null

I wondered: What kind of Type 2 error rate would result if we set our evidence cut-offs so as to ensure a Type 1 error rate of 5%? To figure this out, I simulated some data where the null is true, and where we again have n of 75 in each group, and see what cut-off would preserve this Type 1 error rate:

require(BayesFactor)

set.seed(123)
get.BF = function(cohensd, n){
  x<-rnorm(n = n, mean = 0, sd = 1) #produce N simulated participants
  y<-rnorm(n = n, mean = cohensd, sd = 1) #produce N simulated participants
  z<-t.test(x,y) #perform the t-test
  BF10<-exp(ttest.tstat(z$statistic,n,n,rscale=sqrt(2)/2)$bf)
  BF10
}
simsH0 = replicate(10000, get.BF(cohensd = 0, n = 75))
quantile(simsH0, probs = 0.95)

##       95% 
## 0.9974437

Wait… what? It turns out that even if we accept any Bayes factor greater than 1 as a basis to support H1, we will only reject H0 falsely 5% of the time! That seems bizarre - a Bayes factor of 1, after all, means that the data is just as consistent with H0 as it is with H1! (Note: The cut-off for a 5% error rate differs according to the H1 prior specified and the sample size - don't take the title of this post too literally!)

How about if we used stricter cut-offs of BF10 > 3 (“positive” evidence according to Kass and Raftery, 1995), or > 20 (“strong” evidence)?

sum(simsH0 > 3)/length(simsH0)

## [1] 0.0114

sum(simsH0 > 20)/length(simsH0)

## [1] 0.0015

It turns out that if we use the BF10 > 3 criterion (which still seems very liberal), our Type 1 error rate is only about 1%. And it we use a rule of BF10 > 20 as a cut-off for supporting H1, our Type 1 error rate becomes absolutely miniscule - about 0.15%.

Weird. But let's follow this exercise to its conclusion… what Type 2 error rate results if we the BF10 > 1 rule to support H1? Well, presuming Daniel's original effect size of δ = 0.3:

simsH1 = replicate(10000, get.BF(cohensd = 0.3, n = 75))
sum(simsH1 > 1)/length(simsH1) #We support H1 46% of the time

## [1] 0.4621

power.t.test (75, 0.3) #Which is virtually identical to the power of a frequentist t-test

## 
##      Two-sample t test power calculation 
## 
##               n = 75
##           delta = 0.3
##              sd = 1
##       sig.level = 0.05
##           power = 0.4463964
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

So, on one level, we have a reassuring result: It is possible to set a Bayes factor cut-off for “decisiveness” that gives us basically the same Type 1 and Type 2 error rates as a conventional significance test. Which makes sense: The Bayes factor is just a monotonic transformation of the p value, after all.

But on the other hand… a lot of this doesn't make a lot of intuitive sense. How can we maintain a low Type 1 error rate with such a liberal cut-off for supporting H1? And why does the same incredibly liberal cut-off not result in particularly good power for supporting H1 when the true effect size is small but definitely not zero?

The answer to this lies in thinking about the models we're actually comparing here. We might think of this test as comparing an hypothesis of an exactly zero effect (H0) to one of a non-zero effect (H1), but that isn't actually what we're doing. The H1 model being tested is that the true effect size δ is randomly drawn from a Cauchy distribution with location = 0 and scale = 0.707.

Now when the true effect size δ is zero, the majority of sample effect sizes we see are indeed near zero. And the null (H0) model tends to fit the resulting sample estimates quite well, as it should (since it's the true model). On the other hand, the H1 model of Cauchy (0, 0.707) is quite spread out: It implies a 50% chance of an effect size |δ| > 0.707. Correspondingly it suggests that sample effect sizes very near zero are quite improbable. So when H0 is true, the H1 model tends not to fit the data well, and we will accidentally support it only very rarely.

But what's the story when the true effect size is δ = 0.3? Why is the Bayes factor test still so shy to support the H1 model, even if the null hypothesis is actually false, and even if we take the slightest sliver of evidence (BF10 > 1) as enough to say H1 is true?

Well, first of all, in this situation neither the H0 model nor the H1 model is true. The H1 model suggests that the true effect size is randomly drawn from a Cauchy distribution, and places zero prior probability on the true effect size taking any specific point value. But here the true effect size is a fixed, small, value. So both models are technically false, but either could sometimes fit the resulting data quite well. And, as it happens, because the H1 model of Cauchy (0, 0.707) suggests that a small true effect size isn't very probable, it tends not to fit very well to data simulated according to a fixed, small, effect size. On the other hand, the null model often fits reasonably well with such data. Hence in this situation we tend to be much more likely to support H0 than H1, even though H0 is false.

In one sense, the result I come up with above is just a statistical curiousity, and one that probably won't be very surprising to people very familiar with Bayes factors. But I think there are some simple philosophy-of-science considerations to keep in mind here that might be of interest to some.

Basically, when we use a Bayes factor test of a null hypothesis test, the H1 model we specify is a falsifiable (but literally false) model that we use as a stand-in for the possibly-true (but unfalsifiable) substantive hypothesis: That the true effect size simply isn't zero. That isn't a fault of the test: No statistical test can make up for the fact that psychological hypotheses are often too vaguely specified to be directly tested without being reformulated. But we need to be wary of over-interpreted a Bayes factor result that supports the null hypothesis: Such a result implies that H0 fits the data better than the specific H1 model we've specified… but it doesn't necessarily mean the null hypothesis is true.