Mel Slater's Presence Blog

Thoughts about research and radical new applications of virtual reality - a place to write freely without the constraints of academic publishing,and have some fun.

17 April, 2016

Pre-Registration of Analysis of Experiments is Dangerous for Science

The idea of pre-registration of experimental analysis is supposed to be one amongst a set of solutions to the crisis in the explosion of false results making it into the scientific literature - especially psychology and related sciences. It is argued that pre-registration of analyses will curb the over-analyses of data that is carried out in a search for ‘significant’ results.

Here I argue that such a move will not only not solve any crisis but that it will have detrimental effects on science. The idea arises as a false notion of how scientific research operates. In the accepted ideology the scientist has a ‘hypothesis’ and then formulates this as a ‘null hypothesis’ (H0) and an ‘alternative hypothesis’ (H1) (of course there could be sets of these for any particular experiment). H0 is set up to mean that the original scientific hypothesis is ‘false’. Specifically H0 is chosen to satisfy simple mathematical properties that will make the resulting statistical analysis quite easy to do, following well-known text book formulae. For example, the scientist may believe that experimental condition E will produce a higher value on some critical variable y than experimental condition C (other things being equal). In such a case H0 would typically be that μC = μE , and H1 that  μE > μC. Under the null hypothesis the distribution of the test statistic t for the difference of two means is known, but critically under certain statistical assumptions. Depending on the results of the computed test statistic t after the experiment, H0 is rejected (if t falls in some critical region) or not. H0 not being rejected is not evidence for it being 'true'. Classical statistics does not even permit the researcher to give odds or make probability statements about the support for H0  or H1.

Anyone who has ever done an experiment in psychological and related sciences knows that this ideology is just that, an ‘ideology’.  In our own work we carry out experiments in virtual reality where under different conditions we examine participant responses. For example, we want to see how conditions E and C influence participants responses on a set of response variables y1,...,yp, but also where there are covariates x1,...,xk. In the (rare) simplest case p = 1 and k = 0. The conditions E and C are typically ‘Experimental’ and ‘Control’ conditions. Theory will make predictions about how the response variables may be influenced by the experimental conditions. However, it will rarely will tell us the expected influence of the covariates (e.g., things like age, gender, education, etc).

Let’s start with the simplest case - a between groups experimental design with two conditions E and C, and one response variable, and no covariates. The null and alternate hypotheses can be as above. So this is very simple, and we would register this experimental design and the main test would be a t-test for the difference between the two sample means.

The experiment is carried out and the data is collected. Now the problems start. The t-test relies on a various assumptions such as normality of the response variables (under conditions E and C), equal variances and also independent observations. The latter can usually be assured by experimental protocols. You do the analysis and check the normality of the response variables, and they are not normally distributed under E or C and the variances are nowhere near each other. The t-test is inappropriate. A bar chart showing sample means and standard errors indeed shows that the mean response under E is greater than under C, but one of the standard errors is worryingly large. You plot the histograms and find that the response variable is bimodal under both conditions, so even the means are not useful to understand the data. You realise eventually that the bimodality is caused by sex of the participants, and therefore nothing now makes sense without including this. Moreover, the design happened to be balanced for sex.  If you take sex into account the variation in these data is very well explained by the statistical model, if you don’t then nothing is ‘significant’. So now your experimental design is 2x2 with (C,E) and Sex as the two factors. Here everything makes sense, we find the mean(Condition-E-Male) > mean(Condition-C-Male) and mean(Condition-E-Female) > mean(Condition-C-Female), both by far. Everything that needs to be compatible with normality is so. The earlier result is explained because there is overlap between the results of Male and Female.

So what do we do now? Do we register a new experiment and do it all over again? In practice this is often not possible: the PhD student has finished, the grant money has run out, circumstances have changed so that the previous equipment or software is just no longer available, etc.. You kill the paper? In fact it is a kind of mysticism to suppose that what you had registered, i.e., the thoughts you had in your head before the experiment are somehow going to influence what happened. What happened is as we described, the means tell their story in spite of what you may have thought beforehand. Of course doing a post hoc analysis is not supposed to be valid in the classical statistical framework, because you have already ‘looked at’ the data. So are the results to be discounted? The argument of ‘registration’ it is that you should register another experimental design.

So you register another design where now sex is explicitly an experimental factor giving a 2x2 design. But to play safe you also record another set of covariates (age, education, number of hours per week playing computer games, etc). You run the experiment again, and just as before find the difference in means and all is well. Moreover, between the 4 groups of the 2x2 design there are no significant differences in any of the covariates. However, while preparing the paper you are generating graphs, and you notice an almost perfect linear relationship between your response variable and the number of hours a week playing computer games. You then include this as a covariate in an ANCOVA, and find that all other effects are wiped out, and that the explanation may be to do with game playing and not gender and not even E and C. In fact in more depth you find that E and C do have the predicted differential effect but only within the group of females. You find also that game playing is linked to age, and also education, so that there is a complex relationship between all these variables that is impossible to describe with ANOVA or ANCOVA single equation type models and what is needed is a path analysis. You set up a path analysis according to what you suspect and it is an exceedingly good fit to the data, and in fact a very simple model. Unfortunately, this means that you have to throw the data away, and register another experiment, and start again.

And so it goes on.

Is this ‘fishing’? It might be described that way, but in fact it is the principled way to analyse data. The Registrationists would have us believe that you set up the hypotheses, state the tests, do the experiment, run the tests, report the results. It is an ‘input-output’ model of science where thought is eliminated. It is tantamount to saying “No human was involved in the analysis of this data”. That surely eliminates bias, fishing and error, but it is wrong, and can lead to completely false conclusions as the simple example above illustrates. In the example above we would have stopped immediately after the first experiment, reported the results (“No significant difference between the means of E and C”) and that would be it - assuming it were even possible to publish a paper with P > 0.05. The more complex relationships would never have been discovered, and since P > 0.05 no other researcher might be tempted to go into this area.

Of course “fishing” is bad. You do an experiment, find ‘nothing’ and then spend months analysing and reanalysing the data in order to find somewhere a P < 0.05. This is not investigation on the lines above, it is “Let’s try this” “Let’s try that”, it is not driven by the logic of what is actually found in the data; it is not investigative but trial and error.

Now the above was a very simple example. But now suppose there are multiple response variables (p > 1) and covariates (k > 0). There are so many more things that can go wrong. Residual errors of model fits may not be normal, so you have to transform some of the response variables; a subset of the covariates might be highly correlated so that using a principle component score in their stead may lead to a more elegant and simpler model; there may be a clustering effect where response variables are better understood by combining some of them; there may be important non-linearity in the relationships that cannot be dealt with in the ANCOVA model, and so on. Each one of these potentially requires the registration of another experiment, since such circumstances could not have been foreseen and were not included in the registered experiment.

The distinction is rightly made between ‘exploratory’ and ‘confirmatory’ experiments. In practice experiments are a mixture of these - what started out as ‘confirmatory’ can quickly become ‘exploratory’ in the light of data. Perhaps only in physics there are such clear cut ‘experiments’, for example, to confirm the predictions of the theory of relatively by observing light bending during eclipses. Once we deal with data in behavioural, psychological or medical sciences things are just messy, and the attempt to try to force these into the mould of physics is damaging.

Research is not carried out in a vacuum: there are strong economic and personal pressures on researchers - not least of which for some is the issue of ‘tenure’. At the most base level tenure may depend on how many and how often P < 0.05 can be found. Faced with life-changing pressures researchers may choose to run their experiment, then ‘fish’ for some result, and then register it, reporting it only some months after the actual experiment was done.

Registration is going to solve nothing at all. It encourages an input-output type of research and analysis. Register the design, do the experiment, run the results through SPSS, publish the paper (if there is a P < 0.05). In fact too many papers follow this formula - evidenced by the fact that the results section starts with F-tests and t-tests without even presenting first in a straightforward way in tables and graphs what the data shows. This further illustrates the fact that apart from the initial idea that led to the experiment, there is no intervention of thought in the rest of the process - presenting data in a useful way and discussing it before ever presenting a ‘test’ requires thought. In this process discovery is out the window, because this occurs when you get results that are not those that were predicted by the experiment. At best discovery becomes extremely expensive - since the ‘discovery’ could only be followed through by the registration of another experiment that would specifically address this.

Overall ‘registration’ of experiments is a not only not a good idea, it is a damaging one. It encourages wrong views of the process of scientific discovery and will not even prevent what it was designed for - ‘fishing’. It will have a negative impact, reducing everything to formulaic ways of proceeding, that the unscrupulous will find a way to get round, and those interested more in discovery than accumulating P values will find frustrating and alien.