The fallacy of the large sample size argument
Another bureaucratic solution to problems of science is being
increasingly heard. This is the idea that papers that report experiments with
small sample sizes should be "desk rejected". This is a relatively new
phenomenon. Recently I have had two referees on two different papers mention
this point. I have seen tweets suggesting that anything less than n = 100 won't
do. This is supposed to contribute to the solution of the 'crisis of
reproducibility' - especially in psychology and related disciplines.
Such proposals
for bureaucratic solutions with fixed norms do not take account of elementary
statistics. The sample size needed for estimation or hypothesis testing is relative
to the variance of the variables involved.
Take the simplest example - where a normally distributed random variable X has unknown mean m and
known variance s2.
The experiment will deliver a sample of n independent observations on X. The
goal is to estimate m,
with a 95% confidence interval, u to v. Critically we want to choose a sample
size n such that v-u< e.
The 95%
confidence interval is xbar ± 1.96*s/sqrt(n) where xbar is the sample mean.
Therefore we
require:
3.92s/sqrt(n)
< e, leading to n > (3.92s/e)^2.
The sample size needed is proportional to the variance. For example, suppose that s = e. Then a sample size
of 16 is good enough. Suppose that s = 0.80*e, then a sample size of around 10 would be good enough. A
sample size of 100 would be needed if s would be 2.55 times e. It is clearly the variance in relation to the required error size that is important.
This really
matters because running experiments is typically very expensive. If a sample
size of 10 would do why use 100? It would only be to satisfy a slogan, not for
any real contribution to more reliable statistical inference.
The supposed
failure of reproducibility is built into the system and will not be overcome by
adopting slogans. In classical statistical inference with 5% significance
levels, according to the theory the null hypothesis will be wrongly rejected
approximately 5% of the time anyway. No researcher can know if their experiment
is one of the 5%! This is the point of repeat studies. Yet repeat studies are
hard to publish (no novelty) and if indeed they do find results at odds with
the original study then the researchers on the original study may have their
integrity questioned. Yet who can know whether the second study is one of the
20% that will falsely not reject the null hypothesis (assuming a power of 80%)!
The Bayesian
approach does not have these problems. An experiment will result in
probabilities about hypotheses or probability distributions over parameters rather than fixed answers or conclusions.
As more data are collected through repetition studies so these probabilities
will be updated. Also it should be noted that a similar argument used in the
example above will hold for finding a Bayesian credible interval.
To paraphrase a well known saying: "It's the variance,
stupid!"