Today, as I promised in that earlier post, I want to talk about the RCI a bit more broadly, and discuss some problems with its use. (TL;DR: The RCI doesn't test the hypothesis we're interested in, and has extremely poor statistical power for testing the hypothesis that it does).
Introduction to the RCIFor those who haven't heard of it, the RCI is something that's popular amongst clinical psychologists conducting single case experimental designs. Basically the idea is:
- We obtain a pre score on the measure of interest for our lone participant, administer an intervention, then obtain a post score, and calculate the observed difference between the two. (I.e., how much did the participant improve?)
- Using reliability and variability estimates from a psychometric validation study of the measure (or whatever is available), we calculate the standard error of measurement of a set of scores: SEM = SD*sqrt(1 - reliability)
- Using the SEM, we calculate the standard error of measurement of the difference between pre and post scores: Sdiff = sqrt(2*SEM^2)
- We compare the observed improvement in the participant's score to the critical ratio 1.96*Sdiff; if it is greater than this critical ratio, the improvement is "reliable"/statistically significant.
What the RCI does and doesn't testHowever, while the RCI is indeed an inferential statistic, I'm not sure that it allows you to make an inference about anything you really care about.
Consider, first of all, the null hypothesis that the RCI is testing. The null hypothesis is that the participant's true score at time 1 was the same as her true score at time 1. What is a true score on a measure? Well, a true score is a concept from classical test theory. Specifically, a true score is the average score the participant would get if we could administer the measure to the participant an infinite number of times, with each administration being independent of the others (i.e., we'd have to wipe their memories after each administration), and without their actual level of the measured attribute changing (so imagine a counterfactual world where we could keep administering the measure repeatedly at exactly the same point in time). Importantly, the true score is not the participant's actual level of the attribute we're trying to measure.
So the true score is an abstract and rather strange concept; I'm not sure we really care about whether a participant's true score has changed after the intervention. Rather, the key question is: Did the intervention cause a change in the person's actual level of the attribute of interest?
Now the RCI does actually relate to that causal question, in the sense that it helps to rule out one specific alternative explanation for an observed difference in scores: It helps us rule out the possibility that a difference we've seen is purely due to random measurement error.
But the problem is that it doesn't help us rule out any much more pressing alternative explanations (other than an actual causal effect of the intervention). These alternative explanations include:
- Maturation: The change in scores could be due to some natural time-related process (e.g., aging)
- History: The change in scores could be due to some external event not caused by the intervention
- Regression to the mean: The change in scores could be due to the selection of a participant with an extreme observed level of the variable of interest, who in probability would receive a less extreme score on re-testing (regardless of intervention)
- Testing: The sheer act of taking the pre-test measurement could have the participant
- Instrumentation: There could have been a change in the measurement method in the intervention period (related to the idea of systematic measurement error, which the RCI does not address)
Power problemsUnfortunately, even if we accept that the RCI doesn't really tell us what we are most interested in knowing, and we're willing to accept making inferences only about differences in true scores rather than causal effects, there is still a major lurking problem. That problem is that in realistic conditions, the RCI has absolutely abysmal statistical power. A plot showing statistical power relative to effect size for measures of various levels of reliability is shown in the figure below.
Why is the power of the RCI so low? Well there's no getting away from the fact that in a single case study, N = 1. By deliberately focusing only on ruling out measurement error variation (rather than all extraneous variation) we increase the power of the test a little, but the reality is that two measurements from a single person simply can't allow us to make precise inferences.
And we can go further down the rabbit hole still: There are three more specific distributional assumptions that are problematic in regard to the RCI:
- The RCI may be biased if it is based on an estimate of reliability whose distributional assumptions were not met (which is likely to be the case if it is based on a Cronbach's alpha estimate)
- The RCI is based on the assumption that the measurement error variance is constant across people, and specifically that the quantity of measurement error is the same for your participant as it is in the sample you got your reliability estimate from. If, as could very well be the case, measurement error is higher or lower for your participant than that estimate, then the RCI will give biased inferences.
- The RCI is based on the assumption that the measurement error for your participant is normally distributed over testing occasions. If this (untestable) assumption does not hold, the fact that N = 1 means that there is no protection to be gained from the central limit theorem. Consequently small departures from normality could distort Type 1 error rates.
While I totally respect the intentions of those trying to use the RCI to add rigour to single case experimental designs, I'm not sure that the RCI is worth reporting. It doesn't really address the real question of interest in a single case experimental design, and its poor power means that it's likely to produce untrustworthy inferences.
So what to do?
But what to report instead? If you're wedded to a single case design, I think a couple of strategies that might help are:
- Collect lots of data points. If we can see how variable an individual participant's scores are within the baseline and intervention periods, this gives a much more direct (albeit subjective) impression of whether random fluctuation in scores could account for an observed difference between baseline and treatment. And with a very large number of data points collected at regular intervals you may even be able to use methods designed for time series data (e.g., change point regression) to detect a change in behaviour at the time of the intervention.
- If you can, use a withdrawal design (e.g., ABAB). Hell, if you can go back and forth between treatment and baseline a bunch of times at random, you could even run a perfectly conventional significance test to compare the two conditions. Unfortunately I don't think this is a feasible option for the types of intervention most psychologists are interested in nowadays though (you can't "withdraw" teaching the participant how to practice mindfulness, say).