*similar*mean values on each covariate - but not exactly the same mean values. These differences in mean covariate values will be larger for smaller samples. As such, with finite sample sizes, a difference we observe between the groups on the dependent variable might not actually be due to the manipulated variable, but actually due to pre-existing differences on the covariates.

Based on a set of simulations, Nguyen et al. conclude that N > 1000 is needed to protect against bias occurring due to covariate imbalance. Subsequently, a preprint by Gjalt-Jorn Peters used simulation studies to estimate the probability that randomisation will produce groups that are somewhat imbalanced on at least one of a set of covariates, for a given sample size; his findings suggested that this probability is high (~45%) for the studies contained in the Reproducability Project: Psychology.

Polite debate ensued on social media (e.g., here). This post is an attempt to explain:

- Why some of us felt that it was invalid to say that small samples result in "bias" in this scenario
- But why the claim of "bias" isn't wrong per se (it just hinges on a different definition of bias) - following a brief email conversation I had with Long Nguyen
- Why I don't think that we need samples with N > 1000 - i.e., why I think the problem raised in these studies is actually dealt with by the inferential statistics we use already, and thus isn't something we really need to worry about.

Firstly, let's consider the idea of

*bias*. To a statistician, a statistic is a biased estimator of a parameter if the expected value of the statistic is different than the true value of the parameter. Because the idea of bias refers to an expected value, it's explicitly a long-run concept. For example, imagine I am interested in estimating the average height of adult New Zealanders. If I randomly draw a sample of 30 individuals from this population and calculate their average height, I will get an estimate that's different from the true average height of the population. But if I did this again and again, and calculated the average estimate across the repeated samplings, this the average estimate would be close to the true mean... and with an infinite number of repeated samplings, it would be exactly the same as the true population mean.

Bias is different from random error: Any given estimate of a parameter will contain some error, but bias refers to a

*systematic*tendency to over- or under-estimate the parameter.

When we randomly assign participants to conditions, the random assignment process means that there is no systematic tendency to assign participants with a higher level of some covariate to one condition rather than the other. Consequently, the expected value of each covariate is equal across conditions, and the estimated effect of the treatment is an unbiased estimator of the true treatment effect. I.e., on

*average across replications*, the estimate of the treatment effect will not tend to systematically under- or over-estimate the true treatment effect. This is why some of us were surprised to see Nguyen et al. and Peters claim that estimates of the treatment effect are "biased" with small samples even if randomisation is used: By a conventional definition of bias, this would not be true. Estimates of the treatment effect will contain some error, and randomly occurring covariate imbalance is a cause of such error, but this isn't really

*bias*- or not in the sense that we usually think of bias.

But this is where the definitional issue comes in: Nguyen et al. use a term called "accidental bias", which I believe is due to Efron (1971), with a more accessible discussion in Lachin (1988). Efron and Lachin's articles - though they go over my head in a lot of ways - seem to be largely concerned with assignment methods other than simple/complete randomisation - e.g., assignment methods such as biased coin designs, used to ensure equal sample size across experimental cells. Efron and Lachin both use the term

*accidental bias*to refer to a difference between the true estimated treatment effect that occurs to imbalance on omitted covariates. Importantly, this bias has expectation of zero for simple randomisation as well as for related assignment methods (permuted block, biased coin, etc.); in other words, it isn't bias at all in the conventional sense.

*However*, the variance of the accidental bias does differ for these different random-like assignment methods: In other words, some of the assignment methods they considered tend to be more susceptible to (random) inaccuracy in their estimates due to covariate imbalance than others.

Crucially, "accidental bias" is a very different form of bias: It is

*not*a subtype of "bias" in the sense used above, but something else entirely (which we would often consider simply as a source of random error). Distinguishing error in estimation arising from covariate imbalance from other sources of error in estimation was something that was relevant for Efron and Lachin when considering the properties of different random-like assignment methods, but perhaps not so relevant outside of this specific methodological focus. So Nguyen et al. and Peters aren't really

*wrong*to say that estimates of treatment effects can be biased with small samples with randomisation; it's just that this claim hinges on a somewhat unusual definition of bias. This is something both papers could make clearer, I think.

But what if we set the semantic issues aside? Is it not the case that random assignment can sometimes leave us with sub-samples that are drastically imbalanced on some covariate, and that this might result in a crappy inference about the treatment effect (regardless of whether we think of the cause as "bias" or "error")? Well, yes. And if we concluded that a treatment definitely has some effect size simply because that was the difference in our sample, this would be a major problem.

The thing is, though, is that isn't how causal inferences are actually done: In reality, we will always use some kind of statistical inference (whether via

*p*values, confidence intervals, Bayes factors, Bayesian estimation, whatever). It's important to recognise here that inferential statistics have a crucial role in causal inference, even if we're only making an inference about the effect that a manipulation had within our sample itself: It is

*not*the case that the estimated effect in the sample perfectly captures the true effect in that sample, and that we then use inferential statistics to generalise beyond the sample. Rather, we need inferential statistics even to make inferences about effects

*within*the sample.

As it happens, virtually any inferential statistic we are likely to use to estimate a treatment effect will include a random error term - i.e., the model will take into account variability in the dependent variable that is not attributable to the independent variable. And randomly-occurring imbalance on covariates is just another source of random error. As such, the inferential statistics we use to estimate treatment effects

*don't*assume a complete lack of covariate imbalance: Rather, they assume only that the expected value of the random error term is zero for each observational unit

*.*This assumption would be breached if the assignment process was such that it systematically tended to produce different covariate levels in different groups, but that isn't the case with randomisation.

Let's look at how this play out in practice, in a case where there is a highly relevant omitted covariate (gender in a study of the effect of a diet), randomisation is used, sample size is small, and the true effect size is zero: Does "accidental bias" increase our error rate?

In other words, in this scenario of a truly zero effect size, where there is an omitted and highly relevant covariate, and a tiny sample size, our Type 1 error rate remains at 5%. If we conclude that there is a true causal effect when

*p*< 0.05, we will be in error sometimes - and covariate imbalance will be to blame for some of those errors - but the rate of such errors is exactly as expected given the alpha level.

How about power? Let's imagine a scenario where we have a bigger sample, and the treatment does work, so we expect good power:

Yes - power remains as advertised despite the omitted covariate, so our error rate isn't harmed. Of course, if there was greater variability in post-test scores than we specified in our power calculation - perhaps as a result of an omitted covariate - then our estimate of power for a given unstandardised effect size would be incorrect. But in practice we typically scale our hypothesised effect size relative to the post-test variability anyway, so this isn't really a worry.

The upshot of all this is that, sometimes by sheer bad luck, we will end up with groups that are quite imbalanced with respect to some important covariate - even if we use random assignment. And this may, occasionally, cause us to make an inferential error (i.e., an error in our inference about whether a causal effect exists). However, given that we use inferential statistics to make such causal inferences, we can control the probability of making an error. Our existing inferential tools already allow us to do this: They implicitly take into account the possibility of covariate imbalance.

Do we need N of 1000 or more for a clinical trial, as Nguyen suggests? No. The N > 1000 rule isn't based on a concern of direct relevance to inference (e.g., to achieve some specific level of power, or to achieve some specific level of precision of confidence intervals, or to ensure some maximum discrepancy between the true and estimated effect sizes). Rather, the rule is intended to minimise to a very tiny magnitude one source of variability in the estimated treatment effect, while ignoring other sources of variability. This doesn't seem like a good basis for sample size determination.

If we want to keep our Type 2 error rate to some very small rate, or ensure our confidence intervals reach some specific level of precision, we already have the inferential tools necessary to work out what sample size we require. Sometimes N > 1000 may be necessary to achieve some specific power or precision goal, and sometimes a much smaller sample size will do.