Monday, January 23, 2017

Lots of problems with the Reliable Change Index

In a previous post, I talked about the Reliable Change Index (RCI: Jacobson & Truax, 1991), and a rather unusual misconception about its implications.

Today, as I promised in that earlier post, I want to talk about the RCI a bit more broadly, and discuss some problems with its use. (TL;DR: The RCI doesn't test the hypothesis we're interested in, and has extremely poor statistical power for testing the hypothesis that it does).

Introduction to the RCI

For those who haven't heard of it, the RCI is something that's popular amongst clinical psychologists conducting single case experimental designs. Basically the idea is:
  • We obtain a pre score on the measure of interest for our lone participant, administer an intervention, then obtain a post score, and calculate the observed difference between the two. (I.e., how much did the participant improve?)
  • Using reliability and variability estimates from a psychometric validation study of the measure (or whatever is available), we calculate the standard error of measurement of a set of scores: SEM = SD*sqrt(1 - reliability)
  • Using the SEM, we calculate the standard error of measurement of the difference between pre and post scores: Sdiff = sqrt(2*SEM^2)
  • We compare the observed improvement in the participant's score to the critical ratio 1.96*Sdiff; if it is greater than this critical ratio, the improvement is "reliable"/statistically significant.
Now collecting large samples of data for trials of psychotherapy interventions is a pretty expensive and painful process, so single case experimental designs are still reasonably popular in clinical psych (well, at least here in New Zealand they are). And to be able to do a single case experimental design and yet still run a proper objective inferential statistic has its appeal - it feels much more objective than all that business of just qualitatively interpreting plots of single case data, doesn't it?

What the RCI does and doesn't test

However, while the RCI is indeed an inferential statistic, I'm not sure that it allows you to make an inference about anything you really care about.

Consider, first of all, the null hypothesis that the RCI is testing. The null hypothesis is that the participant's true score at time 1 was the same as her true score at time 1. What is a true score on a measure? Well, a true score is a concept from classical test theory. Specifically, a true score is the average score the participant would get if we could administer the measure to the participant an infinite number of times, with each administration being independent of the others (i.e., we'd have to wipe their memories after each administration), and without their actual level of the measured attribute changing (so imagine a counterfactual world where we could keep administering the measure repeatedly at exactly the same point in time). Importantly, the true score is not the participant's actual level of the attribute we're trying to measure.

So the true score is an abstract and rather strange concept; I'm not sure we really care about whether a participant's true score has changed after the intervention. Rather, the key question is: Did the intervention cause a change in the person's actual level of the attribute of interest?

Now the RCI does actually relate to that causal question, in the sense that it helps to rule out one specific alternative explanation for an observed difference in scores: It helps us rule out the possibility that a difference we've seen is purely due to random measurement error.

But the problem is that it doesn't help us rule out any much more pressing alternative explanations (other than an actual causal effect of the intervention). These alternative explanations include:
  • Maturation: The change in scores could be due to some natural time-related process (e.g., aging)
  • History: The change in scores could be due to some external event not caused by the intervention
  • Regression to the mean: The change in scores could be due to the selection of a participant with an extreme observed level of the variable of interest, who in probability would receive a less extreme score on re-testing (regardless of intervention)
  • Testing: The sheer act of taking the pre-test measurement could have the participant
  • Instrumentation: There could have been a change in the measurement method in the intervention period (related to the idea of systematic measurement error, which the RCI does not address)
All of these threats to internal validity are just as pressing as the threat of random measurement error, but the RCI doesn't address any of them. In fact, the RCI does not even address the issue of whether or not there is some kind of enduring change to explain at all. If the actual attribute of interest (e.g., depression, anxiety) fluctuates from day to day (as distinct from fluctuation due to measurement error), there might be a difference in true scores between two measurement occasions, but without any enduring change whatsoever in the person's typical level of the attribute of interest.

Power problems

Unfortunately, even if we accept that the RCI doesn't really tell us what we are most interested in knowing, and we're willing to accept making inferences only about differences in true scores rather than causal effects, there is still a major lurking problem. That problem is that in realistic conditions, the RCI has absolutely abysmal statistical power. A plot showing statistical power relative to effect size for measures of various levels of reliability is shown in the figure below.

As you can see, the picture is very worrying: You can achieve 80% power only if the effect size is large and the reliability of your measure stupendous (e.g., effect size d of around 0.8 and reliability of 0.95). In more realistic circumstances - say a medium effect size of d = 0.5 and reliability of 0.8 - your power is well under 30%. And with measures of lower reliability and smaller effects, the statistical power of the test is barely higher than the Type 1 error rate. In these circumstances, even if by some magical coincidence we do come across a significant result, we shouldn't place any trust in it: the lower the power, the lower the probability that a significant finding reflects a true effect (see Ioannidis, 2005).

Why is the power of the RCI so low? Well there's no getting away from the fact that in a single case study, N = 1. By deliberately focusing only on ruling out measurement error variation (rather than all extraneous variation) we increase the power of the test a little, but the reality is that two measurements from a single person simply can't allow us to make precise inferences.


And we can go further down the rabbit hole still: There are three more specific distributional assumptions that are problematic in regard to the RCI:
  1. The RCI may be biased if it is based on an estimate of reliability whose distributional assumptions were not met (which is likely to be the case if it is based on a Cronbach's alpha estimate)
  2. The RCI is based on the assumption that the measurement error variance is constant across people, and specifically that the quantity of measurement error is the same for your participant as it is in the sample you got your reliability estimate from. If, as could very well be the case, measurement error is higher or lower for your participant than that estimate, then the RCI will give biased inferences.
  3. The RCI is based on the assumption that the measurement error for your participant is normally distributed over testing occasions. If this (untestable) assumption does not hold, the fact that N = 1 means that there is no protection to be gained from the central limit theorem. Consequently small departures from normality could distort Type 1 error rates.

So what to do? 

While I totally respect the intentions of those trying to use the RCI to add rigour to single case experimental designs, I'm not sure that the RCI is worth reporting. It doesn't really address the real question of interest in a single case experimental design, and its poor power means that it's likely to produce untrustworthy inferences.

But what to report instead? If you're wedded to a single case design, I think a couple of strategies that might help are:
  • Collect lots of data points. If we can see how variable an individual participant's scores are within the baseline and intervention periods, this gives a much more direct (albeit subjective) impression of whether random fluctuation in scores could account for an observed difference between baseline and treatment. And with a very large number of data points collected at regular intervals you may even be able to use methods designed for time series data (e.g., change point regression) to detect a change in behaviour at the time of the intervention.
  • If you can, use a withdrawal design (e.g., ABAB). Hell, if you can go back and forth between treatment and baseline a bunch of times at random, you could even run a perfectly conventional significance test to compare the two conditions. Unfortunately I don't think this is a feasible option for the types of intervention most psychologists are interested in nowadays though (you can't "withdraw" teaching the participant how to practice mindfulness, say).
But at the end of the day, I think there's no getting around the fact that making quantitative claims on the basis of data from a single participant is a pretty risky process. If you're trialling an intervention with one of the H.M.s of the world - someone whose condition is so rare you just can't possibly get a decently sized sample even in principle - then stick with a single case design and make do. But if the reality is that you just want to trial a psychotherapy with people who have a reasonably common condition - and you're using a single case design only because running a large sample RCT is beyond your current means - then I think it's worth re-thinking things more completely and considering whether another research topic might be more deserving of your attention.

Appendix: R code for plot