The pathological science: Psychology, skepticism, and statistics: 2016

Thursday, November 10, 2016

For a Type 1 error rate of 5%, accept H1 if BF10 > 1 (wait, what?)

So I was looking at Daniel Lakens' fascinating recent blog post The Dance of the Bayes Factors, where he shows how Bayes factors are subject to a pretty substantial degree of sampling variability. When used as a basis for a decision about whether a null hypothesis is true or false, they can also (of course) result in errors. For example, Daniel shows that for a two-sample comparison of means, with n = 75 in each group and a true effect size of Cohen's δ = .3, quite high rates of Type 2 errors can result. In particular, Daniel notes that we will:

Falsely support the null 25% of the time if we set a cutoff of BF10 < 1/3 for supporting the null
Falsely support the null 12.5% of the time if we set a stricter cutoff of BF10 < 1/10 for supporting the null

I wondered: What kind of Type 2 error rate would result if we set our evidence cut-offs so as to ensure a Type 1 error rate of 5%? To figure this out, I simulated some data where the null is true, and where we again have n of 75 in each group, and see what cut-off would preserve this Type 1 error rate:

require(BayesFactor)

set.seed(123)
get.BF = function(cohensd, n){
  x<-rnorm(n = n, mean = 0, sd = 1) #produce N simulated participants
  y<-rnorm(n = n, mean = cohensd, sd = 1) #produce N simulated participants
  z<-t.test(x,y) #perform the t-test
  BF10<-exp(ttest.tstat(z$statistic,n,n,rscale=sqrt(2)/2)$bf)
  BF10
}
simsH0 = replicate(10000, get.BF(cohensd = 0, n = 75))
quantile(simsH0, probs = 0.95)

##       95% 
## 0.9974437

Wait… what? It turns out that even if we accept any Bayes factor greater than 1 as a basis to support H1, we will only reject H0 falsely 5% of the time! That seems bizarre - a Bayes factor of 1, after all, means that the data is just as consistent with H0 as it is with H1! (Note: The cut-off for a 5% error rate differs according to the H1 prior specified and the sample size - don't take the title of this post too literally!)

How about if we used stricter cut-offs of BF10 > 3 (“positive” evidence according to Kass and Raftery, 1995), or > 20 (“strong” evidence)?

sum(simsH0 > 3)/length(simsH0)

## [1] 0.0114

sum(simsH0 > 20)/length(simsH0)

## [1] 0.0015

It turns out that if we use the BF10 > 3 criterion (which still seems very liberal), our Type 1 error rate is only about 1%. And it we use a rule of BF10 > 20 as a cut-off for supporting H1, our Type 1 error rate becomes absolutely miniscule - about 0.15%.

Weird. But let's follow this exercise to its conclusion… what Type 2 error rate results if we the BF10 > 1 rule to support H1? Well, presuming Daniel's original effect size of δ = 0.3:

simsH1 = replicate(10000, get.BF(cohensd = 0.3, n = 75))
sum(simsH1 > 1)/length(simsH1) #We support H1 46% of the time

## [1] 0.4621

power.t.test (75, 0.3) #Which is virtually identical to the power of a frequentist t-test

## 
##      Two-sample t test power calculation 
## 
##               n = 75
##           delta = 0.3
##              sd = 1
##       sig.level = 0.05
##           power = 0.4463964
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

So, on one level, we have a reassuring result: It is possible to set a Bayes factor cut-off for “decisiveness” that gives us basically the same Type 1 and Type 2 error rates as a conventional significance test. Which makes sense: The Bayes factor is just a monotonic transformation of the p value, after all.

But on the other hand… a lot of this doesn't make a lot of intuitive sense. How can we maintain a low Type 1 error rate with such a liberal cut-off for supporting H1? And why does the same incredibly liberal cut-off not result in particularly good power for supporting H1 when the true effect size is small but definitely not zero?

The answer to this lies in thinking about the models we're actually comparing here. We might think of this test as comparing an hypothesis of an exactly zero effect (H0) to one of a non-zero effect (H1), but that isn't actually what we're doing. The H1 model being tested is that the true effect size δ is randomly drawn from a Cauchy distribution with location = 0 and scale = 0.707.

Now when the true effect size δ is zero, the majority of sample effect sizes we see are indeed near zero. And the null (H0) model tends to fit the resulting sample estimates quite well, as it should (since it's the true model). On the other hand, the H1 model of Cauchy (0, 0.707) is quite spread out: It implies a 50% chance of an effect size |δ| > 0.707. Correspondingly it suggests that sample effect sizes very near zero are quite improbable. So when H0 is true, the H1 model tends not to fit the data well, and we will accidentally support it only very rarely.

But what's the story when the true effect size is δ = 0.3? Why is the Bayes factor test still so shy to support the H1 model, even if the null hypothesis is actually false, and even if we take the slightest sliver of evidence (BF10 > 1) as enough to say H1 is true?

Well, first of all, in this situation neither the H0 model nor the H1 model is true. The H1 model suggests that the true effect size is randomly drawn from a Cauchy distribution, and places zero prior probability on the true effect size taking any specific point value. But here the true effect size is a fixed, small, value. So both models are technically false, but either could sometimes fit the resulting data quite well. And, as it happens, because the H1 model of Cauchy (0, 0.707) suggests that a small true effect size isn't very probable, it tends not to fit very well to data simulated according to a fixed, small, effect size. On the other hand, the null model often fits reasonably well with such data. Hence in this situation we tend to be much more likely to support H0 than H1, even though H0 is false.

In one sense, the result I come up with above is just a statistical curiousity, and one that probably won't be very surprising to people very familiar with Bayes factors. But I think there are some simple philosophy-of-science considerations to keep in mind here that might be of interest to some.

Basically, when we use a Bayes factor test of a null hypothesis test, the H1 model we specify is a falsifiable (but literally false) model that we use as a stand-in for the possibly-true (but unfalsifiable) substantive hypothesis: That the true effect size simply isn't zero. That isn't a fault of the test: No statistical test can make up for the fact that psychological hypotheses are often too vaguely specified to be directly tested without being reformulated. But we need to be wary of over-interpreted a Bayes factor result that supports the null hypothesis: Such a result implies that H0 fits the data better than the specific H1 model we've specified… but it doesn't necessarily mean the null hypothesis is true.

Sunday, October 30, 2016

No, a breach of normality probably won't cause a 17% Type 1 error rate

Over the weekend I came across this article on the PsychMAP Facebook group

Cain, M. K., Zhang, Z., & Yuan, K.-H. (2016). Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation. Behavior Research Methods. https://doi.org/10.3758/s13428-016-0814-1

From the abstract:
"this study examined 1,567 univariate distriubtions and 254 multivariate distributions collected from authors of articles published in Psychological Science and the American Education Research Journal. We found that 74 % of univariate distributions and 68 % multivariate distributions deviated from normal distributions. In a simulation study using typical values of skewness and kurtosis that we collected, we found that the resulting type I error rates were 17 % in a t-test and 30 % in a factor analysis. Hence, we argue that it is time to routinely report skewness and kurtosis..."

Hang on... a 17% Type I error rate purely due to non-normality? This seemed implausibly high to me. In an OLS model such as a t-test (or ANOVA, or ANCOVA, or regression, or whatever), the assumption of normally distributed error terms really isn't very important. For example, it isn't required for the estimates to be unbiased or consistent or efficient. It is required for the sampling distribution of test statistics such as the t statistic to follow their intended distributions, and thus for significance tests and confidence intervals to be trustworthy... but even then, as the sample size grows larger, as per the CLT these sampling distributions will converge toward normality anyway, virtually regardless of the original error distribution. So the idea of non-normal "data" screwing up the Type 1 error rate to this degree was hard to swallow.

So I looked closer, and found a few things that bugged me about this article. Firstly, in an article attempting to give methodological advice, it really bugs me that the authors misstate the assumptions of OLS procedures. They say:

" we conducted simulations on the one-sample t-test, simple regression, one-way ANOVA, and confirmatory factor analysis (CFA).... Note that for all of these models, the interest is in the normality of the dependent variable(s)."

This isn't really true. We do not assume that the marginal distribution of the dependent variable is normal in t-tests, regression, or ANOVA. Rather we assume that the errors are normal. This misconception is something I've grumbled about before. But we can set this aside for the moment given that the simulation from the article that I'll focus on pertains to the one-sample t-test, where the error distribution and the marginal distribution of the data coincide.

So let's move on to looking at the simulations of the effects of non-normality on Type I error rates for the one-sample t-test (Table 4 in the article). This is where the headline figure of a "17%" Type 1 error rate comes from, albeit that the authors seem to be mis-rounding 17.7%. Basically what the authors did here is:

Estimate percentiles of skewness and (excess) kurtosis from the data they collected from authors of published psychology studies
Use the PearsonDS package to repeatedly simulate data from the Pearson family of distributions under different conditions of skewness and sample size
Run one-sample t-tests on this data.

There are a number of conditions they looked at, but the 17.7% figure comes about when N = 18 and skewness = 6.32 (this is the 99th percentile for skewness in the datasets they observed).

However, they also note that "Because kurtosis has little influence on the t-test, it was kept at the 99th percentile, 95.75, throughout all conditions." I'm unsure where they get this claim from: to the extent that non-normality might cause problems with Type 1 error, both skewness and kurtosis could do so. On the face of it, this is a strange decision: It means that all the results for the one-sample t-test simulation are based on an extremely high and unusual level of kurtosis, a fact that is not emphasised in the article.

As it happens, one of the real reasons why they chose this level of kurtosis is presumably that the simulation simply just wouldn't run if they tried to combine a more moderate degree of kurtosis with a high level of skewness like 6.32. The minimum possible kurtosis for any probability distribution is skewness^2 + 1.

That technical note aside, the net effect is that the headline figure of a Type I error rate of 17% is based on a tiny sample size (18) and an extremely unusual degree of non-normality: An extremely high degree of skewness (6.32, the 99th percentile for skewness across their observed datasets), and an extremely high degree of excess kurtosis (95.75, again the 99th percentile). To give you an idea of what this distribution looks like, check the histogram below (10,000 data points drawn from this distribution). It's a little hard to imagine even a very poor researcher not noticing that their is something amiss if this was the distribution of their raw data!

In fact, it's worth noting that the authors "contacted 339 researchers with publications that appeared in Psychological Science from January 2013 to June 2014 and 164 more researchers with publications that appeared in the American Education Research Journal from January 2010 to June 2014", but we have no idea how many of these authors applied analyses that actually assumed normality. No doubt, some of the extreme levels of skewness and kurtosis arose in cases such as reaction time measurements or count data, where even without checking diagnostics the original authors might have been well aware that a normality assumption was inappropriate.

So what about more plausible degrees of non-normality? If anything, a close read of the article shows how even quite dramatic non-normality causes few problems: For example, take the case of N = 18, skewness of 2.77 and excess kurtosis of 95.75 (a small sample size with still quite extreme non-normality, as visualised in the q-q plot below - albeit not as extreme as the case discussed previously). The authors find that the Type 1 error rate (with alpha of 0.05) in this scenario is... 6.4%. That's hardly something to get too stressed out about!

Ok, so non-normality only really causes Type I problems in extremely severe situations: E.g., pathologically high levels of skewness and kurtosis combined with small sample sizes. But why am I picking on the authors? There is still at least some small danger here of inflated Type 1 error - isn't it good to be conservative and check for these problems?

Well, I have two main worries with this perspective.

Firstly, it's my perception that psychologists worry far too much about normality, and ignore the other assumptions of common OLS procedures. E.g., in descending order of importance:

Conditional mean of errors for any combination of predictors are assumed to be zero (breached if correlated measurement error present, or any measurement at all in the Xs, or randomisation not conducted, or unmodelled non-linear relationships present between Xs and Y)
Error terms are assumed to be independent
Error terms are assumed to have the same variance, regardless of the levels of the predictors.

Now breaches of some of these assumptions (especially the first one) are much more serious, because they can result in estimates that aren't unbiased and consistent and efficient; consequently Type I error rates can of course be substantially inflated (or deflated). As methodologists, we need to bang on about normality less, and the other assumptions more. As Andrew Gelman has said, normality is a minor concern in comparison to some of the other assumptions (note that his phrasing of the OLS/regression assumptions is a little different to how I've written them above, though the ideas are similar).

Secondly, I worry that by encouraging researchers to check for non-normality (and then potentially change their analysis method if problems are found), we introduce extra researcher degrees of freedom into the analysis process. Deciding whether error non-normality is present is a subjective and challenging task, and there are many potential solutions to non-normality researchers can try. It thus strikes me that the potential for selective reporting (and even p-hacking) involved in encouraging researchers to check for non-normality and then attempt various remedies is actually a much greater threat to validity than any direct effect of non-normality.

In conclusion...Non-normality can cause problems, but only in very unusual situations. If you're running a t-test or regression, chances are that there are more important things for you to be worrying about.

Wednesday, September 7, 2016

What if the mean difference less than the standard error of measurement?

[Warning: Very Basic Stuff contained within]

Last weekend I was at a conference where there was an interesting tutorial-style talk on the Reliable Change Index (RCI). The RCI is generally used to try and determine if a single person has displayed 'real' change over the course of some intervention. (In other words, could an observed change in score plausibly have occurred purely due to random measurement error?)

I have some conceptual problems with the RCI in terms of whether it really tells us anything we really want to know (which I'll save for another day), but it was an interesting and well-delivered presentation. That said, I want to pick on an idea that was mentioned in the talk, and that I've heard others repeat recently.

The idea relates to extending the RCI outside of single cases. Particularly, the speaker suggested that when looking at a group study, that if a mean difference is less than the standard error of measurement, that this suggests that the apparent effect might be spurious (i.e., purely the result of measurement error) - even if the mean difference is statistically significant. His reasoning for this was that a statistical significance test focuses on sampling error, not measurement error.

Now, for a single case, a change in score that is less than the standard error of measurement is indeed one that would be quite consistent with a null hypothesis that the true score of the participant has not actually changed. (This isn't to say that this null is true, just that the observation isn't overtly unconsistent with the null). The RCI framework formalises this idea further by:

Using the SEM to calculate the standard error of the difference, Sdiff = sqrt(2*SEM^2). Since both the pre score and the post score are subject to errors of measurement, the standard error of the difference is a little more than the SEM.
Using 1.96*Sdiff as the cut-off for reliable change, drawing on the usual goal of a 5% Type 1 error rate.

All good so far. However, if we are comparing two sample means, the picture changes. At each time point we now have multiple observations (for different people), each with a different quantity of measurement error. The mean of the measurement error across people will itself have a variance that is less than the variance of the measurement error variable itself. This should be intuitively obvious: The variance of the mean of a sample of observations is always less than the variance of the underlying variable itself (well, provided the sample has N > 1 and the observations aren't perfectly correlated.)

In fact, when the sample is reasonably large, the standard error of the mean of the measurement error for the sample will be drastically less than the standard error of measurement itself. So an observation that a mean difference is less than the standard error of measurement is not necessarily consistent with the null hypothesis of no true change occurring.

So do we need to calculate the standard error of measurement for a particular sample, and use that along with a significance test (or Bayesian test) when assessing mean differences?

No.

Standard inferential tests do not only deal with sampling error. Any test you're likely to use to look at a mean difference will include an error term (often, but not necessarily, assumed to be normal and i.i.d. with mean zero). This error term bundles up any source of purely unsystematic random variability in the dependent variable - including both sampling error and unsystematic measurement error. So your standard inferential test already deals with unsystematic measurement error. Looking at the standard error of measurement in a group analysis tells you nothing extra about the likelihood that a true effect exists.

Monday, June 13, 2016

My talk at the M3 conference on Bayes factor null hypothesis tests

Recently I visited the USA for the first time. In between consuming vast quantities of pizza, Del Taco, and Blue Moon, I presented at the Modern Modeling Methods (M3) conference in Connecticut. (Slides included below).

The M3 conference was jam-packed full of presentations about technical issues relating to quite complex statistical models (sample presentation title: "Asymptotic efficiency of the pseudo-maximum likelihood estimator in multi-group factor models with pooled data"). The questions were serious, the matrix algebra was plentiful, and the word "SPSS" was never uttered.

My presentation was a little different: I wanted to talk about the default statistical methods used by everyday common-or-garden researchers: The ones who like t-tests, do their analyses by point-and-click, and think a posterior is something from a Nicky Minaj song (but may know lots and lots about things other than statistics!). I believe that the practices of everyday researchers are important to look at: If we want to fix the replication crisis, we need most researchers to be doing their research and analysis better - not just the ones who pass their time by discussing floating-point computation on Twitter.

And in terms of problems with the data analyses used by everyday researchers, the big issue that jumps out at me is the (mis)use of null hypothesis significance tests (NHST). Positions obviously vary on whether NHST should be used at all, but I think there's reasonable agreement amongst methodology geeks that the form of Fisher/Neyman-Pearson hybrid NHST that dominates current practice is really problematic. To me, a couple of the biggest problems are that:

Hybrid NHST (and frequentist analysis in general) doesn't directly tell us how certain we can be that a particular hypothesis is true. It tells us P(Data | Hypothesis), but we'd generally rather like to know P(Hypothesis | Data).
Hybrid NHST in specific has a problem with asymmetry of result: p < .05 is interpreted as meaning the null can be rejected, the alternate hypothesis is supported, publication results, and rejoicing is heard throughout the land. But p > .05 is often read as indicating only uncertainty: We can't say the null is true, only that there is insufficient evidence to reject it*. This may be a contributor to publication bias and the replication bias: Part of the preference against publishing null results may be that they are often interpreted as not actually communicating anything other than uncertainty.

But what do we replace hybrid NHST with? Abandoning inferential statistics entirely is obviously foolish. There are several options (Deborah Mayo's error statistics approach, Bayesian estimation, etc.) But an approach that's gaining especial traction in psychology at the moment is that of using Bayes factor tests: Particularly, using Bayes factors to compare the evidence for a point null vs. a vague alternate hypothesis (although obviously this isn't the only way in which Bayes factors can be used).

My talk was a gentle critique of this approach of Bayes Factor null hypothesis testing. And I do mean gentle - as I mention in slide 9 of the talk, I think Bayes factor tests of null hypotheses have some great advantages over conventional NHST. I mention several of these in my slides below, but perhaps the biggest advantage is that, unlike hybrid NHST, they compare two hypotheses in such a way that either hypothesis might end up supported (unlike NHST, where only the alternate can possibly "win"!) So I wouldn't want to stomp on the flower of Bayes factor testing. And certainly my talk critiques only a particular implementation of Bayes factors: To test a point null vs. a non-directional diffuse alternate. Much of the fantastic development of methods and software by guys like Rouder, Morey and Wagenmakers can be applied more broadly than just to Bayes factor tests of point null hypotheses.

But I do think that Bayes factor tests of point null hypotheses do have some problems that mean they may not be a suitable default approach to statistical analysis in psychology. (And currently this does seem to be the default way in which Bayes factors are applied).

To begin with, a Bayes factor is a statement only about the (ratio of) the likelihood of the data under the null and alternate hypotheses. It isn't a statement about posterior odds or posterior probabilities. For a Bayesian analysis, that seems to be a little unsatisfying to me. Researchers presumably want to know how certain they can be that their hypotheses are correct; that is, they want to know about posterior probabilities (even if they don't use those words). In fact, researchers often try to interpret p values in a Bayesian way - as a statement about the probability that the null is true.

And I suspect that a similar thing will happen if Bayes factor null hypothesis tests become commonplace: People will (at least implicitly) interpret them as statements about the posterior odds that the alternate hypothesis is correct. In fact, I think that kind of interpretation is almost supported by the availability of qualitative interpretation guidelines for Bayes factors: The notion that Bayes factors can be directly interpreted themselves - rather than converted first to posterior odds - seems to me to reinforce the idea that they're the endpoint of an analysis: that the Bayes factor directly tells us about how certain we can be that a particular hypothesis is correct. I know that Jeff Rouder has explicitly argued against this interpretation - instead saying that researchers should report Bayes factors and let researchers select and update their own priors (perhaps aided by suggestions from the researchers), and in an ideal world, that's exactly how things would work, but I don't think that this is realistic for everyday readers and researchers with limited statistical expertise.

So everyday researchers will naturally want a statement about the posterior (about how certain they can be that an hypothesis is correct) if doing a Bayes factor analysis. And I think it's likely that they will in fact interpret the Bayes factor as providing this information. But in what circumstance can a Bayes factor be interpreted as the posterior odds that the alternate hypothesis is correct? Well this is fairly obvious: The Bayes factor is the posterior odds that the alternate hypothesis is correct if we placed a prior probability of 0.5 on the null being true, and 0.5 on the alternate being true.

The thing is... that's a really weird prior. It's a prior that takes a "tower and hill" shape (see slides 13 and 14), and suggests that one particular value of the effect size (δ = 0) is ~~vastly~~ infinitely** more likely than any other value. It is absolutely and definitely a very informative prior, and yet also one that seems unlikely to represent our actual state of prior knowledge about any given parameter.

So this is a problematic prior - and when researchers use the Bayes factor as the endpoint of a statistical analysis without explicitly drawing on prior information, I would argue that this prior implicitly underlies the resulting conclusions. For this reason I don't think that a Bayes factor test of a point null hypothesis vs. a non-directional alternate, with the Bayes factor serving as the major statistical endpoint of the analysis, is an ideal default approach in psychology.

Ok... but what would be a better default approach? The relatively slight modification I suggested in the talk was to take into account the fact that most (not all) hypotheses tested in psychology are directional. Therefore, instead of focusing on the "strawman" of a point null hypothesis, we could focus on testing whether a particular effect is positive or negative.

This is most obviously achieved by using MCMC Bayesian estimation of a parameter and then tallying up the proportion of draws from the posterior that are greater than zero (i.e., the posterior probability that the effect is positive). However, it could also be achieved by using Bayes factors, by comparing an hypothesis that the effect is positive to an hypothesis that it is negative (with a half-Cauchy prior for each, say). So the statistical methods and programs (e.g., JASP) developed by supporters of Bayes factor testing could readily be adapted to this slightly altered goal. Either way, this would allow us to dispense with our inordinate focus on the point null, and directly test the hypothesis that's likely to be of most interest: that there is an effect in a specific direction.

However, in performing directional tests we need to take into account the fact that most effects in psychology are small: If we were to use uninformative priors, this would lead to excess confidence in our statements about the directionality of effects. That is, the posterior probability that the true effect is in the same direction as that found in the sample will be closer to 1 with an uninformative prior than it would be if we took into account the fact that most effects are small. I don't think it's feasible to expect everyday researchers to select their own informative priors, but methodologists could suggest default informative priors for psychology: Given that we know only that a particular parameter describes a psychological effect, which values of the parameter are most likely?

To achieve this, I suggest that we could find default prior variances for common analyses in psychology empirically. As a very rough idea of how this could work, I point out that Richard et al.’s (2003) meta-meta-analysis of 25,000 social psychological studies found a mean absolute effect of r = .23, which equates to a Cohen's d value of 0.43. We might use this information to set a default prior for a two-sample means comparison, for example, of Cauchy (0, 0.43), implying a 50% chance that the true standardised effect δ is somewhere between -0.43 and +0.43. (Note: There could be better choices of meta-meta-analysis to use that aren't restricted to social psych, and we would want to correct for publication bias, but this serves as a rough illustration of how we could set a default prior empirically). Use of this prior would allow us to make statements about the posterior probability that an effect is in a particular direction, without the statement being made overconfident due to use of an uninformative prior.

So that was my idea, and here are the slides from my presentation.

But. I'm definitely not perfectly happy with the directional approach I suggested. It deals with the problematic implicit prior seemingly underlying Bayes factor tests of null hypotheses. And unlike either NHST or non-directional Bayes factor testing it also quantifies how certain we can be that the true parameter falls in a particular direction. But three problems remain:

The approach doesn't distinguish trivially small effects from substantial ones (one could maybe deal with this by adding a default ROPE?)
Like most alternatives to NHST being discussed currently, it relies heavily on the use of standardised effect sizes, which have quite substantial problems.
The approach doesn't prevent - and in fact facilitates - the testing of theories that make only vague directional predictions. Good theories would do better than suggesting only the sign of a parameter, and nothing about its magnitude.

To a large extent I think these three problems cannot be resolved by statistics alone: We won't be able to do better than standardised effect sizes as long as psychologists mostly use scales that lack clear units of measurement, and we won't be able to do better than directional analyses while most psychological theories make only directional predictions.

But at the least I think that Bayes factor tests are a better default approach than hybrid NHST. And I hesitantly suggest that the directional approach I outline is in turn a slightly better default approach than using Bayes factors to test point null hypotheses against non-directional alternates.

*This problem doesn't apply to the original form of Neyman-Pearson testing, where p > .05 is read as indicating support for a decision that the null is true.
**Thanks @AlxEtz for the correction.

Wednesday, May 4, 2016

Is good fit of a latent variable model positive evidence for measurement validity?

This week I reviewed a paper attempting to explain why latent variable modelling is useful for working out whether a measure is valid or not. (Latent variable modelling meaning either exploratory or confirmatory factor analysis). The author drew on Borsboom, Mellenbergh and van Heerden's theory of validity.

In Borsboom et al's theory, a test is a valid measure of an attribute if:
1) The attribute exists
2) Variation in the attribute causes variation in the measurement outcomes.

Therefore, the author suggested that latent variable models - which test models in which unobserved "latent" variables have causal effects on observed indicators - are useful for testing the validity of a measure. For example, if you hypothesise that a test is a valid measure of a specific individual attribute, and you fit a unidimensional CFA model and find "good" fit, then this supports the idea that the measure is valid. (We'll set aside the controversy surrounding what constitutes "good" fit of a latent variable model for the moment).

Now I don't want to pick on the paper I reviewed too much here - this is a line of reasoning that I suspect a lot of psychologists explicitly or implicitly follow when fitting latent variable models (or mesurement models anyway). I've certainly published conventional psychometric papers that are at least indirectly based on this line of reasoning (example). But the more I think about it, the more it seems to me that this line of reasoning just doesn't work at all.

Why? The problem is the auxiliary hypothesis of conditional independence.

When we're examining the validity of a set of items as a measure of some attribute, we will typically have a substantive hypothesis that variation in the attribute causes variation in the item responses. This is fine. The problem is that this hypothesis is only testable in conjunction with the hypothesis that, controlling for the effects of the latent attribute, the item responses are uncorrelated with each other (the assumption of conditional independence). At most, we might be able to free some of these error correlations, but we cannot allow all of them to be freely estimated, otherwise the model will be unidentifiable.

Problematically, the assumption of local independence is typically not part of the substantive hypothesis we are testing - if variation in the attribute causes variation in the measurement outcomes, then the measure is valid, regardless of whether local independence holds. There are occasional cases where we are genuinely interested in trying to explain correlations between item scores - e.g., say, the g explanation for the positive manifold of IQ tests - but for the most part, the assumption of conditional independence is just an assumption we make for convenience, not a part of the substantive theory. In Popper's terms, conditional independence is an auxiliary hypothesis.

Importantly, conditional independence is also an auxiliary hypothesis that typically isn't very plausible: For a pair of items, it means that we assume that responses to the two items have exactly zero effects on each other, and that aside from the latent variable specified, there exists no other variable whatsoever that has any direct effect on the responses to both of the two items.

What this all means is that if an hypothesised latent variable model doesn't fit the data well, it could be because the test isn't a valid measure of the attribute, but it could also be the case that the test is valid, but the assumption of conditional independence doesn't hold: In other words, the items have relationships with one another that aren't perfectly explained by shared effects of the latent variable.
To some extent, I suspect researchers are aware of this: It might be part of the reason why most researchers use fairly lax standards for testing the fit of latent variable models, and why many researchers are reasonably open to post-hoc modifications to models to try and account for problematic error correlations.

But what I think is less widely appreciated is that breaches of conditional independence can also lead to the opposite problem: A finding that a latent variable model fits "well", with significant and positive loadings of the variable on the items, despite the latent variable actually having no effect on any of the items. For a unidimensional model, this can occur when the error correlations are homogenous, but the latent variable has no true effect.

I have attached simulations below demonstrating examples of both cases.

require(lavaan)
require(MASS)
 
#Scenario 1: Latent variable does affect observed outcomes
#but lack of conditional independence means model fits poorly
 
  set.seed(123) #for replicability
  latent.var = rnorm(1000, 0, 1) #Standard normal latent variable
 
  #In the population, error correlations vary between 0 and 0.3 in size
  Sigma1 = matrix(runif(25, 0, 0.3), ncol = 5)
  diag(Sigma1) <- rep(1, times = 5) 
  errors1 = mvrnorm(n = 1000, mu = rep(0, times = 5), Sigma = Sigma1)
 
  #The latent variable has true effect of beta = 0.5 on all items
  data1 = as.data.frame(apply(errors1, 2, FUN = function(x){
    x+latent.var*0.5})) 
 
  #fit a unidimensional latent variable model to the data
  #assuming conditional independence
  mod1 = cfa('latent.var =~ V1 + V2 + V3 + V4 + V5', data = data1) 
  summary(mod1, fit.measures = TRUE) 
  #The model fits poorly per the chi-square and RMSEA
  #yet the latent variable does have positive effects 
  #on the observed outcomes
  #I.e., the observed measure IS valid
  #yet the latent variable model doesn't fit 
  #due to the lack of conditional independence.
 
 
#Scenario 2: No effects of latent variable on observed outcomes
#but lack of conditional independence means
#model fits well (one latent, five indicators)
 
  set.seed(123) #for replicability
 
  #There is a standard normal latent variable
  latent.var = rnorm(1000, 0, 1) 
 
  #In the population, the error correlation matrix is homogenous 
  #with all error correlations equalling 0.3
  Sigma2 = matrix(rep(0.3, times = 25), ncol = 5)
  diag(Sigma2) <- rep(1, times = 5) 
  errors2 = mvrnorm(n = 1000, mu = rep(0, times = 5), Sigma = Sigma2)
 
  #The latent variable has no effect on any of the variables. 
  #(so observed variables are just the errors)
  data2 = as.data.frame(apply(errors2, 2, FUN = function(x){
    x+latent.var*0})) 
 
  #fit a unidimensional latent variable model to the data
  #assuming conditional independence
  mod2 = cfa('latent.var =~ V1 + V2 + V3 + V4 + V5', data = data2) 
  summary(mod2, fit.measures = TRUE) 
 
  #The model fits extremely well by any measure, 
  #and all the estimated effects of the latent variable on observed 
  #variables are positive and significant. 
  #Yet in reality the latent variable does not have a causal effect 
  #on observed outcomes; the measure is not valid.

Created by Pretty R at inside-R.org

Does this mean that latent variable modelling has no place in psychometric validation research? Probably not. But certainly I think we need to be more aware that the statistical models we're testing when we estimate latent variable models can be very different from the substantive hypotheses we're trying to test. When conditional independence is an assumption, rather than a part of the substantive theory we want to test, the fit of a latent variable model (whether good or poor) probably doesn't tell us an awful lot.