Pre-Registration of Analysis of Experiments is Dangerous for Science
The idea of pre-registration of
experimental analysis is supposed to be one amongst a set of solutions to the crisis
in the explosion of false results making it into the scientific literature -
especially psychology and related sciences. It is argued that pre-registration
of analyses will curb the over-analyses of data that is carried out in a search
for ‘significant’ results.
Here I argue that such a move will not only
not solve any crisis but that it will have detrimental effects on science. The idea arises as a false notion of how scientific research operates. In the accepted ideology the scientist has a ‘hypothesis’ and then formulates this as a ‘null hypothesis’ (H0) and an ‘alternative hypothesis’ (H1) (of course there could be sets of these for any particular experiment). H0 is set up to mean that the original scientific hypothesis is ‘false’. Specifically H0 is chosen to satisfy simple mathematical properties that will make the resulting statistical analysis quite easy to do, following well-known text book formulae. For example, the scientist may believe that experimental condition E will produce a higher value on some critical variable y than experimental condition C (other things being equal). In such a case H0 would typically be that μC = μE , and H1 that μE > μC. Under the null hypothesis the distribution of the test statistic t for the difference of two means is known, but critically under certain statistical assumptions. Depending on the results of the computed test statistic t after the experiment, H0 is rejected (if t falls in some critical region) or not. H0 not being rejected is not evidence for it being 'true'. Classical statistics does not even permit the researcher to give odds or make probability statements about the support for H0 or H1.
Anyone who has ever done an experiment in psychological and related sciences knows that this ideology is just that, an ‘ideology’. In our own work we carry out experiments in virtual reality where under different conditions we examine participant responses. For example, we want to see how conditions E and C influence participants responses on a set of response variables y1,...,yp, but also where there are covariates x1,...,xk. In the (rare) simplest case p = 1 and k = 0. The conditions E and C are typically ‘Experimental’ and ‘Control’ conditions. Theory will make predictions about how the response variables may be influenced by the experimental conditions. However, it will rarely will tell us the expected influence of the covariates (e.g., things like age, gender, education, etc).
Anyone who has ever done an experiment in psychological and related sciences knows that this ideology is just that, an ‘ideology’. In our own work we carry out experiments in virtual reality where under different conditions we examine participant responses. For example, we want to see how conditions E and C influence participants responses on a set of response variables y1,...,yp, but also where there are covariates x1,...,xk. In the (rare) simplest case p = 1 and k = 0. The conditions E and C are typically ‘Experimental’ and ‘Control’ conditions. Theory will make predictions about how the response variables may be influenced by the experimental conditions. However, it will rarely will tell us the expected influence of the covariates (e.g., things like age, gender, education, etc).
Let’s start with the simplest case - a between
groups experimental design with two conditions E and C, and one response
variable, and no covariates. The null and alternate hypotheses can be as above. So this is very
simple, and we would register this experimental design and the main test would
be a t-test for the difference between the two sample means.
The experiment is carried out and the data
is collected. Now the problems start. The t-test relies on a various
assumptions such as normality of the response variables (under conditions E and
C), equal variances and also independent observations. The latter can usually
be assured by experimental protocols. You do the analysis and check the
normality of the response variables, and they are not normally distributed under E or C and the
variances are nowhere near each other. The t-test is inappropriate. A bar chart
showing sample means and standard errors indeed shows that the mean response
under E is greater than under C, but one of the standard errors is worryingly
large. You plot the histograms and find that the
response variable is bimodal under both conditions, so even the means are not
useful to understand the data. You realise eventually that the bimodality is
caused by sex of the participants, and therefore nothing now makes sense
without including this. Moreover, the design happened to be balanced for sex. If you take sex into account the variation in these data is
very well explained by the statistical model, if you don’t then nothing is ‘significant’.
So now your experimental design is 2x2 with (C,E) and Sex as the two factors. Here everything makes sense, we find the
mean(Condition-E-Male) > mean(Condition-C-Male) and mean(Condition-E-Female)
> mean(Condition-C-Female), both by far. Everything that needs to be
compatible with normality is so. The earlier result is explained because there
is overlap between the results of Male and Female.
So what do we do now? Do we register a new
experiment and do it all over again? In practice this is often not possible:
the PhD student has finished, the grant money has run out, circumstances have
changed so that the previous equipment or software is just no longer available,
etc.. You kill the paper? In fact it is a kind of mysticism to suppose that
what you had registered, i.e., the thoughts you had in your head before the
experiment are somehow going to influence what happened. What happened is as we
described, the means tell their story in spite of what you may have thought
beforehand. Of course doing a post hoc analysis is not supposed to be valid in
the classical statistical framework, because you have already ‘looked at’ the
data. So are the results to be discounted? The argument of ‘registration’ it is
that you should register another experimental design.
So you register another design where now
sex is explicitly an experimental factor giving a 2x2 design. But to play safe
you also record another set of covariates (age, education, number of hours per
week playing computer games, etc). You run the experiment again, and just as
before find the difference in means and all is well. Moreover, between the 4
groups of the 2x2 design there are no significant differences in any of the
covariates. However, while preparing the paper you are generating graphs, and
you notice an almost perfect linear relationship between your response variable
and the number of hours a week playing computer games. You then include this as
a covariate in an ANCOVA, and find that all other effects are wiped out, and
that the explanation may be to do with game playing and not gender and not even
E and C. In fact in more depth you find that E and C do have the predicted
differential effect but only within the group of females. You find also that
game playing is linked to age, and also education, so that there is a complex
relationship between all these variables that is impossible to describe with
ANOVA or ANCOVA single equation type models and what is needed is a path
analysis. You set up a path analysis according to what you suspect and it is an
exceedingly good fit to the data, and in fact a very simple model.
Unfortunately, this means that you have to throw the data away, and register
another experiment, and start again.
And so it goes on.
Is this ‘fishing’? It might be described
that way, but in fact it is the principled way to analyse data. The Registrationists would have us believe
that you set up the hypotheses, state the tests, do the experiment, run the
tests, report the results. It is an ‘input-output’ model of science where
thought is eliminated. It is tantamount to saying “No human was involved in the
analysis of this data”. That surely eliminates bias, fishing and error, but it
is wrong, and can lead to completely
false conclusions as the simple example above illustrates. In the example above
we would have stopped immediately after the first experiment, reported the results
(“No significant difference between the means of E and C”) and that would be it
- assuming it were even possible to publish a paper with P > 0.05. The more
complex relationships would never have been discovered, and since P > 0.05
no other researcher might be tempted to go into this area.
Of course “fishing” is bad. You do an
experiment, find ‘nothing’ and then spend months analysing and reanalysing the
data in order to find somewhere a P
< 0.05. This is not investigation on the lines above, it is “Let’s try this”
“Let’s try that”, it is not driven by the logic of what is actually found in
the data; it is not investigative but trial and error.
Now the above was a very simple example.
But now suppose there are multiple response variables (p > 1) and covariates (k > 0). There are so many more things that can go wrong.
Residual errors of model fits may not be normal, so you have to transform some
of the response variables; a subset of the covariates might be highly
correlated so that using a principle component score in their stead may lead to
a more elegant and simpler model; there may be a clustering effect where
response variables are better understood by combining some of them; there may
be important non-linearity in the relationships that cannot be dealt with in
the ANCOVA model, and so on. Each one of these potentially requires the
registration of another experiment, since
such circumstances could not have been foreseen and were not included in the
registered experiment.
The distinction is rightly made between ‘exploratory’
and ‘confirmatory’ experiments. In practice experiments are a mixture of these
- what started out as ‘confirmatory’ can quickly become ‘exploratory’ in the light
of data. Perhaps only in physics there are such clear cut ‘experiments’, for
example, to confirm the predictions of the theory of relatively by observing
light bending during eclipses. Once we deal with data in behavioural,
psychological or medical sciences things are just messy, and the attempt to try
to force these into the mould of physics is damaging.
Research is not carried out in a vacuum: there
are strong economic and personal pressures on researchers - not least of which
for some is the issue of ‘tenure’. At the most base level tenure may depend on
how many and how often P < 0.05 can be found. Faced with life-changing
pressures researchers may choose to run their experiment, then ‘fish’ for some
result, and then register it, reporting it only some months after the actual experiment
was done.
Registration is going to solve nothing at
all. It encourages an input-output type of research and analysis. Register the
design, do the experiment, run the results through SPSS, publish the paper (if
there is a P < 0.05). In fact too many papers follow this formula -
evidenced by the fact that the results section starts with F-tests and t-tests without even presenting
first in a straightforward way in tables and graphs what the data shows. This
further illustrates the fact that apart from the initial idea that led to the
experiment, there is no intervention of thought in the rest of the process -
presenting data in a useful way and discussing it before ever presenting a ‘test’
requires thought. In this process discovery
is out the window, because this occurs when you get results that are not those
that were predicted by the experiment. At best discovery becomes extremely
expensive - since the ‘discovery’ could only be followed through by the
registration of another experiment that would specifically address this.