Mel Slater's Presence Blog

Thoughts about research and radical new applications of virtual reality - a place to write freely without the constraints of academic publishing,and have some fun.

14 November, 2016

The fallacy of the large sample size argument

Another bureaucratic solution to problems of science is being increasingly heard. This is the idea that papers that report experiments with small sample sizes should be "desk rejected". This is a relatively new phenomenon. Recently I have had two referees on two different papers mention this point. I have seen tweets suggesting that anything less than n = 100 won't do. This is supposed to contribute to the solution of the 'crisis of reproducibility' - especially in psychology and related disciplines.

Such proposals for bureaucratic solutions with fixed norms do not take account of elementary statistics. The sample size needed for estimation or hypothesis testing is relative to the variance of the variables involved. 

Take the simplest example - where a normally distributed random variable X has unknown mean m and known variance s2. The experiment will deliver a sample of n independent observations on X. The goal is to estimate m, with a 95% confidence interval, u to v. Critically we want to choose a sample size n such that v-u< e.

The 95% confidence interval is  xbar ± 1.96*s/sqrt(n) where xbar is the sample mean.

Therefore we require:
3.92s/sqrt(n) < e, leading to n > (3.92s/e)^2. 

The sample size needed is proportional to the variance. For example, suppose that s = e. Then a sample size of 16 is good enough.  Suppose that s = 0.80*e, then a sample size of around 10 would be good enough. A sample size of 100 would be needed if s would be 2.55 times e. It is clearly the variance in relation to the required error size that is important.

This really matters because running experiments is typically very expensive. If a sample size of 10 would do why use 100? It would only be to satisfy a slogan, not for any real contribution to more reliable statistical inference.

The supposed failure of reproducibility is built into the system and will not be overcome by adopting slogans. In classical statistical inference with 5% significance levels, according to the theory the null hypothesis will be wrongly rejected approximately 5% of the time anyway. No researcher can know if their experiment is one of the 5%! This is the point of repeat studies. Yet repeat studies are hard to publish (no novelty) and if indeed they do find results at odds with the original study then the researchers on the original study may have their integrity questioned. Yet who can know whether the second study is one of the 20% that will falsely not reject the null hypothesis (assuming a power of 80%)!

The Bayesian approach does not have these problems. An experiment will result in probabilities about hypotheses or probability distributions over parameters rather than fixed answers or conclusions. As more data are collected through repetition studies so these probabilities will be updated. Also it should be noted that a similar argument used in the example above will hold for finding a Bayesian credible interval. 

To paraphrase a well known saying: "It's the variance, stupid!"