The pathological science: Psychology, skepticism, and statistics

Wednesday, September 7, 2016

What if the mean difference less than the standard error of measurement?

[Warning: Very Basic Stuff contained within]

Last weekend I was at a conference where there was an interesting tutorial-style talk on the Reliable Change Index (RCI). The RCI is generally used to try and determine if a single person has displayed 'real' change over the course of some intervention. (In other words, could an observed change in score plausibly have occurred purely due to random measurement error?)

I have some conceptual problems with the RCI in terms of whether it really tells us anything we really want to know (which I'll save for another day), but it was an interesting and well-delivered presentation. That said, I want to pick on an idea that was mentioned in the talk, and that I've heard others repeat recently.

The idea relates to extending the RCI outside of single cases. Particularly, the speaker suggested that when looking at a group study, that if a mean difference is less than the standard error of measurement, that this suggests that the apparent effect might be spurious (i.e., purely the result of measurement error) - even if the mean difference is statistically significant. His reasoning for this was that a statistical significance test focuses on sampling error, not measurement error.

Now, for a single case, a change in score that is less than the standard error of measurement is indeed one that would be quite consistent with a null hypothesis that the true score of the participant has not actually changed. (This isn't to say that this null is true, just that the observation isn't overtly unconsistent with the null). The RCI framework formalises this idea further by:

Using the SEM to calculate the standard error of the difference, Sdiff = sqrt(2*SEM^2). Since both the pre score and the post score are subject to errors of measurement, the standard error of the difference is a little more than the SEM.
Using 1.96*Sdiff as the cut-off for reliable change, drawing on the usual goal of a 5% Type 1 error rate.

All good so far. However, if we are comparing two sample means, the picture changes. At each time point we now have multiple observations (for different people), each with a different quantity of measurement error. The mean of the measurement error across people will itself have a variance that is less than the variance of the measurement error variable itself. This should be intuitively obvious: The variance of the mean of a sample of observations is always less than the variance of the underlying variable itself (well, provided the sample has N > 1 and the observations aren't perfectly correlated.)

In fact, when the sample is reasonably large, the standard error of the mean of the measurement error for the sample will be drastically less than the standard error of measurement itself. So an observation that a mean difference is less than the standard error of measurement is not necessarily consistent with the null hypothesis of no true change occurring.

So do we need to calculate the standard error of measurement for a particular sample, and use that along with a significance test (or Bayesian test) when assessing mean differences?

No.

Standard inferential tests do not only deal with sampling error. Any test you're likely to use to look at a mean difference will include an error term (often, but not necessarily, assumed to be normal and i.i.d. with mean zero). This error term bundles up any source of purely unsystematic random variability in the dependent variable - including both sampling error and unsystematic measurement error. So your standard inferential test already deals with unsystematic measurement error. Looking at the standard error of measurement in a group analysis tells you nothing extra about the likelihood that a true effect exists.

Monday, June 13, 2016

My talk at the M3 conference on Bayes factor null hypothesis tests

Recently I visited the USA for the first time. In between consuming vast quantities of pizza, Del Taco, and Blue Moon, I presented at the Modern Modeling Methods (M3) conference in Connecticut. (Slides included below).

The M3 conference was jam-packed full of presentations about technical issues relating to quite complex statistical models (sample presentation title: "Asymptotic efficiency of the pseudo-maximum likelihood estimator in multi-group factor models with pooled data"). The questions were serious, the matrix algebra was plentiful, and the word "SPSS" was never uttered.

My presentation was a little different: I wanted to talk about the default statistical methods used by everyday common-or-garden researchers: The ones who like t-tests, do their analyses by point-and-click, and think a posterior is something from a Nicky Minaj song (but may know lots and lots about things other than statistics!). I believe that the practices of everyday researchers are important to look at: If we want to fix the replication crisis, we need most researchers to be doing their research and analysis better - not just the ones who pass their time by discussing floating-point computation on Twitter.

And in terms of problems with the data analyses used by everyday researchers, the big issue that jumps out at me is the (mis)use of null hypothesis significance tests (NHST). Positions obviously vary on whether NHST should be used at all, but I think there's reasonable agreement amongst methodology geeks that the form of Fisher/Neyman-Pearson hybrid NHST that dominates current practice is really problematic. To me, a couple of the biggest problems are that:

Hybrid NHST (and frequentist analysis in general) doesn't directly tell us how certain we can be that a particular hypothesis is true. It tells us P(Data | Hypothesis), but we'd generally rather like to know P(Hypothesis | Data).
Hybrid NHST in specific has a problem with asymmetry of result: p < .05 is interpreted as meaning the null can be rejected, the alternate hypothesis is supported, publication results, and rejoicing is heard throughout the land. But p > .05 is often read as indicating only uncertainty: We can't say the null is true, only that there is insufficient evidence to reject it*. This may be a contributor to publication bias and the replication bias: Part of the preference against publishing null results may be that they are often interpreted as not actually communicating anything other than uncertainty.

But what do we replace hybrid NHST with? Abandoning inferential statistics entirely is obviously foolish. There are several options (Deborah Mayo's error statistics approach, Bayesian estimation, etc.) But an approach that's gaining especial traction in psychology at the moment is that of using Bayes factor tests: Particularly, using Bayes factors to compare the evidence for a point null vs. a vague alternate hypothesis (although obviously this isn't the only way in which Bayes factors can be used).

My talk was a gentle critique of this approach of Bayes Factor null hypothesis testing. And I do mean gentle - as I mention in slide 9 of the talk, I think Bayes factor tests of null hypotheses have some great advantages over conventional NHST. I mention several of these in my slides below, but perhaps the biggest advantage is that, unlike hybrid NHST, they compare two hypotheses in such a way that either hypothesis might end up supported (unlike NHST, where only the alternate can possibly "win"!) So I wouldn't want to stomp on the flower of Bayes factor testing. And certainly my talk critiques only a particular implementation of Bayes factors: To test a point null vs. a non-directional diffuse alternate. Much of the fantastic development of methods and software by guys like Rouder, Morey and Wagenmakers can be applied more broadly than just to Bayes factor tests of point null hypotheses.

But I do think that Bayes factor tests of point null hypotheses do have some problems that mean they may not be a suitable default approach to statistical analysis in psychology. (And currently this does seem to be the default way in which Bayes factors are applied).

To begin with, a Bayes factor is a statement only about the (ratio of) the likelihood of the data under the null and alternate hypotheses. It isn't a statement about posterior odds or posterior probabilities. For a Bayesian analysis, that seems to be a little unsatisfying to me. Researchers presumably want to know how certain they can be that their hypotheses are correct; that is, they want to know about posterior probabilities (even if they don't use those words). In fact, researchers often try to interpret p values in a Bayesian way - as a statement about the probability that the null is true.

And I suspect that a similar thing will happen if Bayes factor null hypothesis tests become commonplace: People will (at least implicitly) interpret them as statements about the posterior odds that the alternate hypothesis is correct. In fact, I think that kind of interpretation is almost supported by the availability of qualitative interpretation guidelines for Bayes factors: The notion that Bayes factors can be directly interpreted themselves - rather than converted first to posterior odds - seems to me to reinforce the idea that they're the endpoint of an analysis: that the Bayes factor directly tells us about how certain we can be that a particular hypothesis is correct. I know that Jeff Rouder has explicitly argued against this interpretation - instead saying that researchers should report Bayes factors and let researchers select and update their own priors (perhaps aided by suggestions from the researchers), and in an ideal world, that's exactly how things would work, but I don't think that this is realistic for everyday readers and researchers with limited statistical expertise.

So everyday researchers will naturally want a statement about the posterior (about how certain they can be that an hypothesis is correct) if doing a Bayes factor analysis. And I think it's likely that they will in fact interpret the Bayes factor as providing this information. But in what circumstance can a Bayes factor be interpreted as the posterior odds that the alternate hypothesis is correct? Well this is fairly obvious: The Bayes factor is the posterior odds that the alternate hypothesis is correct if we placed a prior probability of 0.5 on the null being true, and 0.5 on the alternate being true.

The thing is... that's a really weird prior. It's a prior that takes a "tower and hill" shape (see slides 13 and 14), and suggests that one particular value of the effect size (δ = 0) is ~~vastly~~ infinitely** more likely than any other value. It is absolutely and definitely a very informative prior, and yet also one that seems unlikely to represent our actual state of prior knowledge about any given parameter.

So this is a problematic prior - and when researchers use the Bayes factor as the endpoint of a statistical analysis without explicitly drawing on prior information, I would argue that this prior implicitly underlies the resulting conclusions. For this reason I don't think that a Bayes factor test of a point null hypothesis vs. a non-directional alternate, with the Bayes factor serving as the major statistical endpoint of the analysis, is an ideal default approach in psychology.

Ok... but what would be a better default approach? The relatively slight modification I suggested in the talk was to take into account the fact that most (not all) hypotheses tested in psychology are directional. Therefore, instead of focusing on the "strawman" of a point null hypothesis, we could focus on testing whether a particular effect is positive or negative.

This is most obviously achieved by using MCMC Bayesian estimation of a parameter and then tallying up the proportion of draws from the posterior that are greater than zero (i.e., the posterior probability that the effect is positive). However, it could also be achieved by using Bayes factors, by comparing an hypothesis that the effect is positive to an hypothesis that it is negative (with a half-Cauchy prior for each, say). So the statistical methods and programs (e.g., JASP) developed by supporters of Bayes factor testing could readily be adapted to this slightly altered goal. Either way, this would allow us to dispense with our inordinate focus on the point null, and directly test the hypothesis that's likely to be of most interest: that there is an effect in a specific direction.

However, in performing directional tests we need to take into account the fact that most effects in psychology are small: If we were to use uninformative priors, this would lead to excess confidence in our statements about the directionality of effects. That is, the posterior probability that the true effect is in the same direction as that found in the sample will be closer to 1 with an uninformative prior than it would be if we took into account the fact that most effects are small. I don't think it's feasible to expect everyday researchers to select their own informative priors, but methodologists could suggest default informative priors for psychology: Given that we know only that a particular parameter describes a psychological effect, which values of the parameter are most likely?

To achieve this, I suggest that we could find default prior variances for common analyses in psychology empirically. As a very rough idea of how this could work, I point out that Richard et al.’s (2003) meta-meta-analysis of 25,000 social psychological studies found a mean absolute effect of r = .23, which equates to a Cohen's d value of 0.43. We might use this information to set a default prior for a two-sample means comparison, for example, of Cauchy (0, 0.43), implying a 50% chance that the true standardised effect δ is somewhere between -0.43 and +0.43. (Note: There could be better choices of meta-meta-analysis to use that aren't restricted to social psych, and we would want to correct for publication bias, but this serves as a rough illustration of how we could set a default prior empirically). Use of this prior would allow us to make statements about the posterior probability that an effect is in a particular direction, without the statement being made overconfident due to use of an uninformative prior.

So that was my idea, and here are the slides from my presentation.

But. I'm definitely not perfectly happy with the directional approach I suggested. It deals with the problematic implicit prior seemingly underlying Bayes factor tests of null hypotheses. And unlike either NHST or non-directional Bayes factor testing it also quantifies how certain we can be that the true parameter falls in a particular direction. But three problems remain:

The approach doesn't distinguish trivially small effects from substantial ones (one could maybe deal with this by adding a default ROPE?)
Like most alternatives to NHST being discussed currently, it relies heavily on the use of standardised effect sizes, which have quite substantial problems.
The approach doesn't prevent - and in fact facilitates - the testing of theories that make only vague directional predictions. Good theories would do better than suggesting only the sign of a parameter, and nothing about its magnitude.

To a large extent I think these three problems cannot be resolved by statistics alone: We won't be able to do better than standardised effect sizes as long as psychologists mostly use scales that lack clear units of measurement, and we won't be able to do better than directional analyses while most psychological theories make only directional predictions.

But at the least I think that Bayes factor tests are a better default approach than hybrid NHST. And I hesitantly suggest that the directional approach I outline is in turn a slightly better default approach than using Bayes factors to test point null hypotheses against non-directional alternates.

*This problem doesn't apply to the original form of Neyman-Pearson testing, where p > .05 is read as indicating support for a decision that the null is true.
**Thanks @AlxEtz for the correction.

Wednesday, May 4, 2016

Is good fit of a latent variable model positive evidence for measurement validity?

This week I reviewed a paper attempting to explain why latent variable modelling is useful for working out whether a measure is valid or not. (Latent variable modelling meaning either exploratory or confirmatory factor analysis). The author drew on Borsboom, Mellenbergh and van Heerden's theory of validity.

In Borsboom et al's theory, a test is a valid measure of an attribute if:
1) The attribute exists
2) Variation in the attribute causes variation in the measurement outcomes.

Therefore, the author suggested that latent variable models - which test models in which unobserved "latent" variables have causal effects on observed indicators - are useful for testing the validity of a measure. For example, if you hypothesise that a test is a valid measure of a specific individual attribute, and you fit a unidimensional CFA model and find "good" fit, then this supports the idea that the measure is valid. (We'll set aside the controversy surrounding what constitutes "good" fit of a latent variable model for the moment).

Now I don't want to pick on the paper I reviewed too much here - this is a line of reasoning that I suspect a lot of psychologists explicitly or implicitly follow when fitting latent variable models (or mesurement models anyway). I've certainly published conventional psychometric papers that are at least indirectly based on this line of reasoning (example). But the more I think about it, the more it seems to me that this line of reasoning just doesn't work at all.

Why? The problem is the auxiliary hypothesis of conditional independence.

When we're examining the validity of a set of items as a measure of some attribute, we will typically have a substantive hypothesis that variation in the attribute causes variation in the item responses. This is fine. The problem is that this hypothesis is only testable in conjunction with the hypothesis that, controlling for the effects of the latent attribute, the item responses are uncorrelated with each other (the assumption of conditional independence). At most, we might be able to free some of these error correlations, but we cannot allow all of them to be freely estimated, otherwise the model will be unidentifiable.

Problematically, the assumption of local independence is typically not part of the substantive hypothesis we are testing - if variation in the attribute causes variation in the measurement outcomes, then the measure is valid, regardless of whether local independence holds. There are occasional cases where we are genuinely interested in trying to explain correlations between item scores - e.g., say, the g explanation for the positive manifold of IQ tests - but for the most part, the assumption of conditional independence is just an assumption we make for convenience, not a part of the substantive theory. In Popper's terms, conditional independence is an auxiliary hypothesis.

Importantly, conditional independence is also an auxiliary hypothesis that typically isn't very plausible: For a pair of items, it means that we assume that responses to the two items have exactly zero effects on each other, and that aside from the latent variable specified, there exists no other variable whatsoever that has any direct effect on the responses to both of the two items.

What this all means is that if an hypothesised latent variable model doesn't fit the data well, it could be because the test isn't a valid measure of the attribute, but it could also be the case that the test is valid, but the assumption of conditional independence doesn't hold: In other words, the items have relationships with one another that aren't perfectly explained by shared effects of the latent variable.
To some extent, I suspect researchers are aware of this: It might be part of the reason why most researchers use fairly lax standards for testing the fit of latent variable models, and why many researchers are reasonably open to post-hoc modifications to models to try and account for problematic error correlations.

But what I think is less widely appreciated is that breaches of conditional independence can also lead to the opposite problem: A finding that a latent variable model fits "well", with significant and positive loadings of the variable on the items, despite the latent variable actually having no effect on any of the items. For a unidimensional model, this can occur when the error correlations are homogenous, but the latent variable has no true effect.

I have attached simulations below demonstrating examples of both cases.

require(lavaan)
require(MASS)
 
#Scenario 1: Latent variable does affect observed outcomes
#but lack of conditional independence means model fits poorly
 
  set.seed(123) #for replicability
  latent.var = rnorm(1000, 0, 1) #Standard normal latent variable
 
  #In the population, error correlations vary between 0 and 0.3 in size
  Sigma1 = matrix(runif(25, 0, 0.3), ncol = 5)
  diag(Sigma1) <- rep(1, times = 5) 
  errors1 = mvrnorm(n = 1000, mu = rep(0, times = 5), Sigma = Sigma1)
 
  #The latent variable has true effect of beta = 0.5 on all items
  data1 = as.data.frame(apply(errors1, 2, FUN = function(x){
    x+latent.var*0.5})) 
 
  #fit a unidimensional latent variable model to the data
  #assuming conditional independence
  mod1 = cfa('latent.var =~ V1 + V2 + V3 + V4 + V5', data = data1) 
  summary(mod1, fit.measures = TRUE) 
  #The model fits poorly per the chi-square and RMSEA
  #yet the latent variable does have positive effects 
  #on the observed outcomes
  #I.e., the observed measure IS valid
  #yet the latent variable model doesn't fit 
  #due to the lack of conditional independence.
 
 
#Scenario 2: No effects of latent variable on observed outcomes
#but lack of conditional independence means
#model fits well (one latent, five indicators)
 
  set.seed(123) #for replicability
 
  #There is a standard normal latent variable
  latent.var = rnorm(1000, 0, 1) 
 
  #In the population, the error correlation matrix is homogenous 
  #with all error correlations equalling 0.3
  Sigma2 = matrix(rep(0.3, times = 25), ncol = 5)
  diag(Sigma2) <- rep(1, times = 5) 
  errors2 = mvrnorm(n = 1000, mu = rep(0, times = 5), Sigma = Sigma2)
 
  #The latent variable has no effect on any of the variables. 
  #(so observed variables are just the errors)
  data2 = as.data.frame(apply(errors2, 2, FUN = function(x){
    x+latent.var*0})) 
 
  #fit a unidimensional latent variable model to the data
  #assuming conditional independence
  mod2 = cfa('latent.var =~ V1 + V2 + V3 + V4 + V5', data = data2) 
  summary(mod2, fit.measures = TRUE) 
 
  #The model fits extremely well by any measure, 
  #and all the estimated effects of the latent variable on observed 
  #variables are positive and significant. 
  #Yet in reality the latent variable does not have a causal effect 
  #on observed outcomes; the measure is not valid.

Created by Pretty R at inside-R.org

Does this mean that latent variable modelling has no place in psychometric validation research? Probably not. But certainly I think we need to be more aware that the statistical models we're testing when we estimate latent variable models can be very different from the substantive hypotheses we're trying to test. When conditional independence is an assumption, rather than a part of the substantive theory we want to test, the fit of a latent variable model (whether good or poor) probably doesn't tell us an awful lot.