Pre-Registration of Analysis of Experiments is Dangerous for Science
The idea of pre-registration of
experimental analysis is supposed to be one amongst a set of solutions to the crisis
in the explosion of false results making it into the scientific literature -
especially psychology and related sciences. It is argued that pre-registration
of analyses will curb the over-analyses of data that is carried out in a search
for ‘significant’ results.
Here I argue that such a move will not only
not solve any crisis but that it will have detrimental effects on science. The idea arises as a false notion of how scientific research operates. In the accepted ideology the scientist has a ‘hypothesis’ and then formulates this as a ‘null hypothesis’ (H0) and an ‘alternative hypothesis’ (H1) (of course there could be sets of these for any particular experiment). H0 is set up to mean that the original scientific hypothesis is ‘false’. Specifically H0 is chosen to satisfy simple mathematical properties that will make the resulting statistical analysis quite easy to do, following well-known text book formulae. For example, the scientist may believe that experimental condition E will produce a higher value on some critical variable y than experimental condition C (other things being equal). In such a case H0 would typically be that μC = μE , and H1 that μE > μC. Under the null hypothesis the distribution of the test statistic t for the difference of two means is known, but critically under certain statistical assumptions. Depending on the results of the computed test statistic t after the experiment, H0 is rejected (if t falls in some critical region) or not. H0 not being rejected is not evidence for it being 'true'. Classical statistics does not even permit the researcher to give odds or make probability statements about the support for H0 or H1.
Anyone who has ever done an experiment in psychological and related sciences knows that this ideology is just that, an ‘ideology’. In our own work we carry out experiments in virtual reality where under different conditions we examine participant responses. For example, we want to see how conditions E and C influence participants responses on a set of response variables y1,...,yp, but also where there are covariates x1,...,xk. In the (rare) simplest case p = 1 and k = 0. The conditions E and C are typically ‘Experimental’ and ‘Control’ conditions. Theory will make predictions about how the response variables may be influenced by the experimental conditions. However, it will rarely will tell us the expected influence of the covariates (e.g., things like age, gender, education, etc).
Anyone who has ever done an experiment in psychological and related sciences knows that this ideology is just that, an ‘ideology’. In our own work we carry out experiments in virtual reality where under different conditions we examine participant responses. For example, we want to see how conditions E and C influence participants responses on a set of response variables y1,...,yp, but also where there are covariates x1,...,xk. In the (rare) simplest case p = 1 and k = 0. The conditions E and C are typically ‘Experimental’ and ‘Control’ conditions. Theory will make predictions about how the response variables may be influenced by the experimental conditions. However, it will rarely will tell us the expected influence of the covariates (e.g., things like age, gender, education, etc).
Let’s start with the simplest case - a between
groups experimental design with two conditions E and C, and one response
variable, and no covariates. The null and alternate hypotheses can be as above. So this is very
simple, and we would register this experimental design and the main test would
be a t-test for the difference between the two sample means.
The experiment is carried out and the data
is collected. Now the problems start. The t-test relies on a various
assumptions such as normality of the response variables (under conditions E and
C), equal variances and also independent observations. The latter can usually
be assured by experimental protocols. You do the analysis and check the
normality of the response variables, and they are not normally distributed under E or C and the
variances are nowhere near each other. The t-test is inappropriate. A bar chart
showing sample means and standard errors indeed shows that the mean response
under E is greater than under C, but one of the standard errors is worryingly
large. You plot the histograms and find that the
response variable is bimodal under both conditions, so even the means are not
useful to understand the data. You realise eventually that the bimodality is
caused by sex of the participants, and therefore nothing now makes sense
without including this. Moreover, the design happened to be balanced for sex. If you take sex into account the variation in these data is
very well explained by the statistical model, if you don’t then nothing is ‘significant’.
So now your experimental design is 2x2 with (C,E) and Sex as the two factors. Here everything makes sense, we find the
mean(Condition-E-Male) > mean(Condition-C-Male) and mean(Condition-E-Female)
> mean(Condition-C-Female), both by far. Everything that needs to be
compatible with normality is so. The earlier result is explained because there
is overlap between the results of Male and Female.
So what do we do now? Do we register a new
experiment and do it all over again? In practice this is often not possible:
the PhD student has finished, the grant money has run out, circumstances have
changed so that the previous equipment or software is just no longer available,
etc.. You kill the paper? In fact it is a kind of mysticism to suppose that
what you had registered, i.e., the thoughts you had in your head before the
experiment are somehow going to influence what happened. What happened is as we
described, the means tell their story in spite of what you may have thought
beforehand. Of course doing a post hoc analysis is not supposed to be valid in
the classical statistical framework, because you have already ‘looked at’ the
data. So are the results to be discounted? The argument of ‘registration’ it is
that you should register another experimental design.
So you register another design where now
sex is explicitly an experimental factor giving a 2x2 design. But to play safe
you also record another set of covariates (age, education, number of hours per
week playing computer games, etc). You run the experiment again, and just as
before find the difference in means and all is well. Moreover, between the 4
groups of the 2x2 design there are no significant differences in any of the
covariates. However, while preparing the paper you are generating graphs, and
you notice an almost perfect linear relationship between your response variable
and the number of hours a week playing computer games. You then include this as
a covariate in an ANCOVA, and find that all other effects are wiped out, and
that the explanation may be to do with game playing and not gender and not even
E and C. In fact in more depth you find that E and C do have the predicted
differential effect but only within the group of females. You find also that
game playing is linked to age, and also education, so that there is a complex
relationship between all these variables that is impossible to describe with
ANOVA or ANCOVA single equation type models and what is needed is a path
analysis. You set up a path analysis according to what you suspect and it is an
exceedingly good fit to the data, and in fact a very simple model.
Unfortunately, this means that you have to throw the data away, and register
another experiment, and start again.
And so it goes on.
Is this ‘fishing’? It might be described
that way, but in fact it is the principled way to analyse data. The Registrationists would have us believe
that you set up the hypotheses, state the tests, do the experiment, run the
tests, report the results. It is an ‘input-output’ model of science where
thought is eliminated. It is tantamount to saying “No human was involved in the
analysis of this data”. That surely eliminates bias, fishing and error, but it
is wrong, and can lead to completely
false conclusions as the simple example above illustrates. In the example above
we would have stopped immediately after the first experiment, reported the results
(“No significant difference between the means of E and C”) and that would be it
- assuming it were even possible to publish a paper with P > 0.05. The more
complex relationships would never have been discovered, and since P > 0.05
no other researcher might be tempted to go into this area.
Of course “fishing” is bad. You do an
experiment, find ‘nothing’ and then spend months analysing and reanalysing the
data in order to find somewhere a P
< 0.05. This is not investigation on the lines above, it is “Let’s try this”
“Let’s try that”, it is not driven by the logic of what is actually found in
the data; it is not investigative but trial and error.
Now the above was a very simple example.
But now suppose there are multiple response variables (p > 1) and covariates (k > 0). There are so many more things that can go wrong.
Residual errors of model fits may not be normal, so you have to transform some
of the response variables; a subset of the covariates might be highly
correlated so that using a principle component score in their stead may lead to
a more elegant and simpler model; there may be a clustering effect where
response variables are better understood by combining some of them; there may
be important non-linearity in the relationships that cannot be dealt with in
the ANCOVA model, and so on. Each one of these potentially requires the
registration of another experiment, since
such circumstances could not have been foreseen and were not included in the
registered experiment.
The distinction is rightly made between ‘exploratory’
and ‘confirmatory’ experiments. In practice experiments are a mixture of these
- what started out as ‘confirmatory’ can quickly become ‘exploratory’ in the light
of data. Perhaps only in physics there are such clear cut ‘experiments’, for
example, to confirm the predictions of the theory of relatively by observing
light bending during eclipses. Once we deal with data in behavioural,
psychological or medical sciences things are just messy, and the attempt to try
to force these into the mould of physics is damaging.
Research is not carried out in a vacuum: there
are strong economic and personal pressures on researchers - not least of which
for some is the issue of ‘tenure’. At the most base level tenure may depend on
how many and how often P < 0.05 can be found. Faced with life-changing
pressures researchers may choose to run their experiment, then ‘fish’ for some
result, and then register it, reporting it only some months after the actual experiment
was done.
Registration is going to solve nothing at
all. It encourages an input-output type of research and analysis. Register the
design, do the experiment, run the results through SPSS, publish the paper (if
there is a P < 0.05). In fact too many papers follow this formula -
evidenced by the fact that the results section starts with F-tests and t-tests without even presenting
first in a straightforward way in tables and graphs what the data shows. This
further illustrates the fact that apart from the initial idea that led to the
experiment, there is no intervention of thought in the rest of the process -
presenting data in a useful way and discussing it before ever presenting a ‘test’
requires thought. In this process discovery
is out the window, because this occurs when you get results that are not those
that were predicted by the experiment. At best discovery becomes extremely
expensive - since the ‘discovery’ could only be followed through by the
registration of another experiment that would specifically address this.
9 Comments:
To summarize, your argument is "It's hard to do reproducible science and it sometimes requires more experiments, so let's instead continue to do non-reproducible science." The distinction between exploratory and confirmatory analyses was so very clear in your own example; you then dismissed it by just claiming that it's not always clear cut? Of course it's not always clear cut but the whole point is that we should try whatever it takes to move away from non-reproducible to reproducible science. What do you propose to achieve that?
This argument is not about reproducible science. It is saying that "registration" is an incorrect and dangerous way to try to get good science. I specifically refer to registration of analyses (there's nothing wrong with registration of experimental designs). Registration of analyses is completely unrealistic, because real data is rarely the neat stuff you read in text books. It has outliers (when to 'remove' them and when not), it has missing values, it doesn't follow standard distributions, no amount of 'transformation' of the variables gets you to a mathematically neat distribution etc.. So the analysis is rarely the straightforward "Do an ANOVA" following a textbook formula.
So this case is nothing to do with reproducible science. My argument is against bureaucratic sham solutions to problems, and registration is at best a bureaucratic approach that will solve nothing, and may put people off science altogether.
Rather at the end of the day what is wanted is better education in statistical methods, so that people understand what's going on, and stop resorting to input-output methods reliant on computer programs that give an "answer" (P < 0.05).
Great post. I agree in many ways. I also think that many of the prereg crowd are researchers that are interested in psychological phenomena that are more easily researched within a prereg framework, and possibly make the mistake of assuming that what works for them works for everyone. E.g. conducting research within clinical contexts (which rely upon relationships with service providers, not to mention the many recruitment and engagement issues that frequently arise), makes prereg unsuitable in certain contexts. Overall, some of these prereg adherents are presenting as increasingly dogmatic and ideological in their outlook, which helps no one IMO.
Hello, I've written a response to this post where I attempt to defend preregistration. It can be found on my blog here: https://psychbrief.com/2016/05/24/in-defense-of-preregistration/
Hopefully I've represented your views accurately but if I've made a mistake I apologise and I will correct it.
Thanks - I agree with this response completely.
But then my concern is that if after pre-registering and doing the analysis and you find everything different to what was predicted, and then you can include it all in the paper anyway - so what is the point?
I strongly believe that the way to solve problems in science is NOT to do so through what is essentially a bureaucratic means.
I also want to emphasise that I don't see any problem about registering experimental designs and hypotheses, but I'm far less in favour of registering proposed highly specific analyses - for all the reasons I post above.
Finally, people who would "fish" find a result and then pre-register that and report the experiment some months later, of course would be engaging in fraud, and they would know that they are doing wrong. Labelling it as fraud isn't going to make any difference to such - hopefully an extremely tiny minority - of people.
Thanks for your post, I'm happy that this is being discussed in such a great way as your post.
But then my concern is that if after pre-registering and doing the analysis and you find everything different to what was predicted, and then you can include it all in the paper anyway - so what is the point?
Because then everyone knows whether the analysis was planned or prompted by the way the data fell, and this is helpful because these two analyses are not equivalently useful for building future research.
Yes, this is nice, and it doesn't imply any harm (unless it becomes some sort of binding obligation imposed on us) but I still don't see the point.
This is very difficult to express what I want to say now, and it can easily be misunderstood.
There is a mysticism in statistics that implicitly says that what you had in your head beforehand can somehow affect the results. Now of course at one level this is true - because the thoughts you had in your head determine what questions you ask and therefore which data you collect, what the variables are etc. But once all that is determined whether beforehand you thought that meanX < meanY or something else is irrelevant to what actually happens - you get the result that "nature" delivers. So if I register beforehand that I expect "meanX < meanY" this may be interesting as part of a theoretical argument or model, but in terms of what the data actually brings up it really can't have an influence (except in a mystical way). So when I analyse after getting the data to see if meanX < meanY this can be stated in terms of my theory (about why I expected it and why I am testing it). If I pre-registered this I also have to say: I pre-registered this and it didn't come out the way I expected so now I have to put this in a special section of the paper. What extra information does this add? Isn't it better to say - My theory implied that meanX < meanY, but actually I didn't find that so my theory might be wrong, or some other condition was not taken into account etc. This avoid the bureaucracy and puts things exactly in the right context.
So my view of what it comes down to is the context in which the analysis takes place - i.e., the theoretical background or model, and how what is being estimated and tested is part of that, so that readers can understand what is being done in the appropriate context.
Thank you for the reply. I can see where you are coming from with regards to it being more bureaucratic (though I don't think it would make it much more so). The reason why it's good is because you are being open about what you expected and then what you did. You aren't pretending you were going to examine this co-variable from the start, which is what currently happens; a new co-variable is discovered and the study is rewritten to imply it was always being examined. Preregistering allows readers to know what changes have been made and it therefore allows us to clearly see whether the statistical analyses they performed are applicable to their study (or how far down the garden of forking paths they've gone https://twitter.com/PsychologyBrief/media and http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf). If you preregister your analyses and then find you have to do different one's, that's fine but you can't pretend you did them from the beginning. We will also know if the researcher has dropped groups/analysed new groups in search of a significant value.
Preregistration would not inhibit analysis of the results with regards to the theory it is testing. It just ensures we are being honest and open about what has been done so we can be more confident in what the data is showing.
I agree labeling it "fraud" isn't going to stop most of the people who do it anyway, but they would need to have a time-machine to go back and change their study after the reviewer asks for amendments at Stage 1 (https://twitter.com/PsychologyBrief). It's unlikely there would be no amendments to make for a study, so it would need to be changed from the original idea. This would help protect against fraud.
Thanks.
All that you say is true, in the sense that the background theory or context of why you did the experiment must be in the paper.
But basically I don't care what your predictions are. They cannot influence the results, unless the mystical path is followed.
I only care about what happened.
As for covariates influencing the results - great, we discover something new.
Are we to pretend that it is somehow false if I didn't think of the possible effects of these covariates in advance?
If the data shows that these were important in explaining the results, then why should I be ashamed of this?
I think that basically it comes down to different philosophies. I think of statistical analysis as finding a succinct, minimal model with which we explain the variation in the data. This is opposed to the (in reality non-existent) null hypothesis testing methodology. Of course I would have had a starting point for the model. What difference does it make to anything at all if I would have registered the starting point? No difference whatsoever. What difference does it make if someone else looks at my registered starting point? Nothing at all. "Ah I see before s/he did the experiment s/he thought that X would happen. But it didn't. Got you!" I don't think science should be like that.
If the intention is to stop fraud then it won't.
Basically, my point of view is that if some people want to register their experiments, it is fine with me. But don't claim that this is somehow superior, or try to impose it on others. It is like religion. I don't care what others believe. But don't try to force those beliefs on me.
Post a Comment
<< Home