Mel Slater's Presence Blog: December 2022

My original degree and Masters were in Statistics. Then I also studied sociology and psychology. Then via statistics, through my involvement in the statistical language GLIM, I moved to computer science and computer graphics, while at Queen Mary (University of London). We had our first head-mounted display from Division Inc in 1991, and together with the postdoc Dr Martin Usoh, we started investigating presence in virtual reality. At this point my own background suddenly made sense - through statistics I knew how to design experiments and analyse data. Through the social sciences I could read and understand papers in psychology and related sciences. Computer science, computer graphics, programming etc meant that I could understand what was feasible given the hardware and software resources. I embarked on a research program in virtual reality mainly rooted in experimental studies, concerned with various aspects of how people respond to virtual environments. I moved to UCL in 1995 and then to Barcelona, and continued with this series of studies

Over many years, for the analyses of the results of these experiments I used classical statistics - hypotheses testing, significance levels, confidence intervals, and so on. Think about the meaning of "significance level". A lot of people automatically interpret it as the probability of the null hypothesis. So if the significance level is low (< 0.05) then the probability of the null hypothesis being true is low and it should be "rejected". However, it doesn't mean that at all. Significance level is the conditional probability of rejecting the null hypothesis if it is true. This means that in the long run, a significance level of 5% means that 5% of the time, the statistical test will reject the null hypothesis when it is true. This is far from the meaning of the probability of the null hypothesis. In fact in classical statistics, the probability of the null hypothesis being true is 0 or 1 (only we don't know which). This is because in classical statistics probability is associated with frequency of occurrence of an event. Only if there were parallel universes, in some of which the null hypothesis were true, and others not, and we could access data from these universes, could we think of "the probability of the null hypothesis" in classical statistics.

When people study statistics, even if they are mathematically sophisticated, many find the concepts - type 1 error, type 2 error, significance levels, power, confidence intervals, very mysterious. They are! Think about a confidence interval. You're told - the 95% confidence interval for the true mean is between 20 and 30. Automatically this is interpreted as "the probability of the true mean being between 20 and 30 is 0.95." Again, this is a wrong interpretation. It means that if we had multiple repetitions of the same experiment, then 95% of the time the true mean will fall in the calculated confidence limits. This says nothing about the particular interval we have found - is this one of the 95% or one of the 5%? The mathematical theory of classical statistics, like the Neyman-Pearson Lemma is very elegant. But interpretation is somewhat convoluted.

Classical statistics is easy to do. To apply it you don't need to understand it. You run the data through a statistical package and out comes a number t=2.2, on 18 degrees of freedom, and P < 0.05. (But is it one-sided or two-sided?). P < 0.05, so we have "significance": write the paper! For every different type of problem there is another type of test. There are underlying assumptions that must be obeyed - typically that the random errors in the variable under consideration with respect to the hypothesis, must be normally distributed. If not - try to find some transformation of the data (like taking the log) to make it so. Otherwise use "non-parametric" statistics, and learn another whole set of tests (also available in the statistical package). It is in principle easy.

But what about the "power"? This is the probability of rejecting the null hypothesis when it is not true. You're supposed to compute the power before you do the experiment, because this will help in determining the required sample size. But how can you compute the power? To do so you need know at least the variances of the response variables. To know those you must have already collected sufficient data. But what was the power involved in that data collection exercise?

Bayesian statistics has a completely different philosophy. You start off with your prior probabilities of the hypothesis in question (e.g., that the true mean is in a certain range). You collect the data, and then use Bayes' Theorem to update your prior probabilities to posterior. So you end up with a revised probability conditional on the data. Of course this is also done through a statistical package, since apart from very simple problems, the integrals involved in finding the posterior probabilities must be evaluated using numerical simulation.

People argue that there is a subjective element in this. Yes, it is true, for Bayesian statistics probability is not based on frequency but is subjective (though of course you can use informative frequency data). It is subjective, but as you add more and more data, results starting from different subjective priors will converge to the same posteriors. And ... think back to power calculations in classical statistics - these are based on guesswork, "estimations" of variance that typically have no basis in any actual data. Power calculations make everyone relax ("Oh great, it has a power of 80%!") but in reality I think that such power calculations are meaningless.

In Bayesian statistics there are not lots of different tests, no different tests for different situations. There is only one principle - based on Bayes' Theorem. You need good statistical software though, that allows you to properly express your prior distributions, the likelihood (the probability distribution of the response variables conditional on the parameters under investigation), and to compute the posterior distributions.

A few years ago I came across the language called BUGS. It appealed to the computer science side of me because it is a functional programming language where you can elegantly express prior distributions and likelihoods. So I started analysing the results of our experimental studies using Bayesian statistics - I think that this is the first paper where I dared to do this. I thought that reviewers would come down heavily against this, but to my pleasant surprise they were quite favourable, and have never raised questions ever since!

Later I came across the Stan probabilistic programming language which at first I used in conjunction with MATLAB and then using R and the R interface for Stan.

Bayesian analysis appeals to me because of its simplicity in concept and interpretation, but also because it overcomes a major problem in classical statistics. When we carry out an experiment there are typically multiple response variables of interest - e.g., the results of a questionnaire, physiological measures, behavioural responses, and so on. Bearing in mind the meaning of "significance", when you carry out more than one statistical test you lose control of significance. E.g., if the significance level is 0.05 for each test, then the probability that at least one will be "significant" by chance is not 0.05 (but greater). There are ad hoc ways around this, like the extreme Bonferroni Correction, or multiple comparison tests like Scheffe, but these are not principled, even if clever.

In Bayesian statistics if you have k parameters of interest, then what happens is that the joint probability distributions of all these parameters together is computed, and then you can read off as many probability statements that you like from this, without losing anything. It is as if you have a page of text with lots of facts. In classical statistics the more facts that you read off the page, the less the reliability of your conclusions. But in Bayesian statistics you can read as many "facts" off the page as you like, and nothing is affected. If you compute a particular set of probabilities from the joint distribution, then they are all valid - those are the probabilities, and that's it.

After many experimental studies using Bayesian statistics and Stan I decided to write a text book summarising what I had learned over the years. The book starts from an introduction to probability and probability distributions, and after introducing Bayes' Theorem goes through a series of different experimental setups, with the corresponding Bayesian model, and how to program it in Stan. There are a set of slides (with commentary) available, and also every program in the book is available on my Kaggle page where they can be executed online (look for the files called "Slater-Bayesian-Statistics-*").

Finally, you might enjoy reading about some statistics of virtual reality.

Classical Statistics

Bayesian Statistics

Courtesy of dall-e-2

Mel Slater's Presence Blog

About Me

17 December, 2022

In the Presence of Bayesian Statistics for VR