Mel Slater's Presence Blog

Thoughts about research and radical new applications of virtual reality - a place to write freely without the constraints of academic publishing,and have some fun.

25 May, 2023

 

Eliciting Sentiment with ChatGPT





Background

 

In two previous papers we used sentiment analysis to better understand the responses of participants to an unusual experience in virtual reality – attending a rock concert amongst a virtual audience (1, 2). The sentiment analysis allowed us to get ‘behind the scenes’ and discover aspects of the experiences of participants that we would never have found with questionnaires or behavioural or physiological responses. 

 

I recently had to do sentiment analysis of more than 1200 short pieces of text. In the previous work I used the R packages

 

·       sentimentr (3) which uses 9 dictionaries and aims particularly at ‘valence shifters’ i.e., modifiers where ‘I do not like it’ is correctly recognized as negative and ‘I really like it’ is an enhanced positive valence. 

·       The VADER system (4) was designed for the analysis of social media text but also is used more generally. I used the R implementation[1] by Katherine Roehrick.

·       The syuzhet package (5)[2] includes 4 sentiment lexicons and was originally designed for analysis of the latent structure in narrative, although it has been used widely for other applications.

·       SentimentAnalysis[3] is the fourth R package that we use (6, 7) which exploits 3 different dictionaries which has applications in the analysis of financial text, but again has been used more widely.

 

I was interested to see how well ChatGPT4 would do with sentiment analysis, given that it is a highly sophisticated language model based on data far more extensive than would have been possible with the packages above. Also there have been some studies that show that it has a lot of promise in this area (8, 9)

 

I was interested to know (i) how to actually do it in a simple way without any programming whatsoever and (ii) whether the scores obtained would better reflect the sentiment expressed in the text.

 

 

Method

 

In order to make the text available to ChatGPT4 I tried at first to give prompts such as:

 

“In the following are short texts (phrases and sentences) separated by ‘$’ Please give me sentiment scores for each text in the range -1 to 1, while maintaining the positions of the ‘$’. Please don’t output anything other than the numbers themselves.”

 

(Yes, I can’t help using “please” and “thank you” in my communications).

 

This method sort of worked, and it would give me output such as -0.4$0.5$0.9$-0.8$ …

However, invariably it missed some, and since I had a large amount of text, and had to break it into segments, missing one would cost me a lot of time to track where it had gone wrong.

 

Instead the following method worked well. The text was read into R (e.g., variable “mytext”) and then in R the output of mytext would give, for example, 

 

[1] "It may have had an impact, but there is a lot of need for appeasement towards the therapist in the answers given about the evaluation of the VR session"                                                                                                                                                                                                                                                     

[2] "It has helped her manage emotions generated through VR"                                                                                                                                                                                                                                                                                                                                                       

[3] "Although it has allowed her to understand the concepts of psychological violence, she has not identified with her attitude and behaviour."                                                                                                                                                                                                                                                                     

[4] "She denies the events that occurred and maintains a distant role regarding the activity"                                                                                                                                                                                                                                                                                                                     

[5] "She remains distant regarding the activity, referring more utility to the contents worked on in the rest of the sessions"                           

 

To ChatGPT:

 

Prompt: “I'm going to give a number of phrases each prefixed by a number in square brackets [1] to [n] where n is some number,  and each text is in quote marks "...". Please give a sentiment score between -1 and 1 for each text, maintaining the number in brackets for each one. Please do not provide any other explanation but just give the output in the form [1] score, [2] score, and so on. OK?”

 

The output was:

 

[1] -0.36

[2] 0.56

[3] -0.04

[4] -0.31

[5] -0.22

 

In the many such streams of text that I provided it never missed a single one, and the scores seemed appropriate.

 

Comparisons

 

Here I give comparisons with the methods from the R packages.

 

Text

sentimentr

vader

syuzhet

SentimentAnalysis

ChatGPT4

It may have had an impact, but there is a lot of need for appeasement towards the therapist in the answers given about the evaluation of the VR session.

-0.10

0.00

-0.25

0.09

-0.36

It has helped her manage emotions generated through VR

0.40

0.00

1.20

0.50

0.56

Although it has allowed her to understand the concepts of psychological violence, she has not identified with her attitude and behaviour.

0.11

-0.62

-0.50

0.11

-0.04

She denies the events that occurred and maintains a distant role regarding the activity.

-0.13

-0.42

-0.50

0.12

-0.31

She remains distant regarding the activity, referring more utility to the contents worked on in the rest of the sessions.

0.26

0.00

0.65

0.30

-0.22

 

Overall, it seems that ChatGPT had a better understanding of the text, and generated values more in accord with the sentiment expressed. 


A caveat is that I did not spend time cleaning the text, and also with ChatGPT we have no idea about how it numerically assigns these scores.

 

However, with ChatGPT we are able to ask it for explanations.

 

Text

Score

Explanation

It may have had an impact, but there is a lot of need for appeasement towards the therapist in the answers given about the evaluation of the VR session.

-0.36

The negative sentiment in this statement arises from the mention of "a lot of need for appeasement towards the therapist." This suggests that the user's responses about the evaluation of the VR session may not reflect their true feelings or experiences accurately. The use of the term "appeasement" implies that the user may feel compelled to provide positive or favorable feedback to please the therapist or avoid any potential conflict. While the statement also acknowledges that the VR session may have had an impact, the overall negative sentiment is driven by the perceived need for appeasement, which suggests potential reservations or incongruence between the user's genuine experience and their expressed evaluation of the VR session.

It has helped her manage emotions generated through VR.

0.56

The positive sentiment in this statement arises from the mention of the VR experience being helpful in managing emotions. By stating that it has helped her manage emotions, it implies that the VR session had a positive impact on the user's emotional state. This suggests that the user found the VR experience beneficial in terms of emotional regulation or control. The use of the word "helped" indicates that the VR session had a positive influence, enabling the user to effectively deal with the emotions evoked during the VR experience. Overall, the statement reflects a positive outcome of the VR session in terms of emotional management.

Although it has allowed her to understand the concepts of psychological violence, she has not identified with her attitude and behavior.

-0.04

The negative sentiment in this statement arises from the user's inability to identify with her own attitude and behavior despite gaining an understanding of psychological violence through the VR experience. While the user has acquired knowledge about the concepts of psychological violence, the fact that she does not identify with her own attitude and behavior suggests a disconnect or lack of self-reflection. The use of the word "although" implies a contrast between the user's understanding of psychological violence and her inability to relate to it personally. This discrepancy may contribute to a slightly negative sentiment as it indicates a potential limitation or challenge in the user's engagement with the VR experience. Overall, the statement reflects a mixed sentiment, with a slight negative tone stemming from the user's lack of identification with her attitude and behavior despite gaining conceptual understanding of psychological violence.

She denies the events that occurred and maintains a distant role regarding the activity.

-0.31

The negative sentiment in this statement is primarily driven by two factors. First, the user's denial of the events that occurred suggests a refusal to acknowledge or accept the significance or impact of those events. This denial can be seen as a defensive mechanism, potentially hindering personal growth or understanding from the VR activity. Second, the user's maintenance of a distant role regarding the activity implies a lack of active engagement or emotional connection with the content. By keeping a distance, the user may be avoiding or downplaying the potential impact or relevance of the activity. Together, these factors contribute to a negative sentiment as they indicate resistance, denial, and a lack of active involvement in the VR activity. The user's denial and emotional distance suggest a limited willingness to explore or address the underlying issues associated with the activity, which can hinder the effectiveness of the intervention.

She remains distant regarding the activity, referring more utility to the contents worked on in the rest of the sessions.

-0.22

The negative sentiment in this statement is primarily driven by the user's sense of distance or detachment from the activity. By remaining distant, the user is indicating a lack of active engagement or emotional connection with the VR activity. This detachment can potentially hinder the effectiveness of the intervention, as active engagement is crucial for meaningful learning and growth. Furthermore, the user's statement about finding more utility in the contents worked on in the rest of the sessions suggests that they perceive the VR activity as less valuable or impactful compared to other aspects of the therapy. This perception further reinforces the sentiment of negativity or skepticism towards the VR activity. Overall, the statement reflects a certain level of disengagement and skepticism towards the VR activity, contributing to the slightly negative sentiment score. The user's preference for the contents of the other sessions indicates a perceived lack of value or relevance in the VR activity, which can impact its effectiveness in achieving therapeutic goals.

 

Consistency

 

Here I consider whether ChatGPT4 is consistent across different evaluations of the same text. I gave it 50 texts and asked it to find the sentiment scores using the same prompt as above. 

 

The figure below shows the comparison between the two sets of scores. The correlation is r = 0.82 with 95% confidence interval 0.70 to 0.89. So the correlation is very strong but there are clearly 3 outlying points with quite different scores on each occasion. 

 



Figure – The x-axis corresponds to the original sentiment scores and the y-axis scores for the same text a few days later.

 

I asked ChatGPT about why there were discrepancies:

 

“It's important to note that sentiment analysis is inherently challenging and can be influenced by individual interpretation. Different human annotators may assign different sentiment scores to the same sentence based on their personal perspectives, experiences, and understanding of the context. Similarly, different sentiment analysis models may yield varying scores due to differences in their algorithms, training data, and biases.  Therefore, while sentiment analysis can provide valuable insights, it is essential to consider it as an approximation and be aware of its limitations, especially when interpreting and comparing scores from different sources or occasions.”

ChatGPT Versions

The results from ChatGPT3.5 and ChatGPT4 are quite different.

I tried this with ChatGPT4:

Please give a sentiment score between -1 and 1 for the following text: "I really liked it but aspects of it were uncomfortable".

Assigning a sentiment score to a text depends on the balance of positive and negative expressions. Here, we have a mix of both positive ("I really liked it") and negative sentiment ("aspects of it were uncomfortable"). Therefore, the sentiment score might be slightly positive, considering the expressed like outweighs the discomfort. I'd assign a score of 0.3. Please note that sentiment analysis can be subjective and can vary based on the nuances of language.

Please give a sentiment score between -1 and 1 for the following text "Aspects of it were uncomfortable but I really liked it".

In this sentence, even though it starts with a negative expression ("Aspects of it were uncomfortable"), it ends on a positive note ("but I really liked it"). Similar to the previous example, there is a mix of both positive and negative sentiments here. Given the overall positive sentiment, I would assign a score of 0.3. As noted before, sentiment analysis can be subjective and can vary based on the nuances of language.

Now here is the same with ChatGPT3.5 

Please give a sentiment score between -1 and 1 for the following text: "I really liked it but aspects of it were uncomfortable".

Based on the given text, the sentiment score can be calculated as follows:
Positive sentiment: "I really liked it" - This indicates a positive sentiment. Negative sentiment: "aspects of it were uncomfortable" - This indicates a negative sentiment.
To assign a sentiment score between -1 and 1, we can consider the overall sentiment of the text, which combines both positive and negative aspects. Since the positive sentiment outweighs the negative sentiment, we can assign a sentiment score closer to 1, indicating a predominantly positive sentiment. However, the discomfort mentioned does introduce some negative sentiment.
Considering the balance between positive and negative sentiments, I would assign a sentiment score of approximately 0.6 to the given text.

Please give a sentiment score between -1 and 1 for the following text "Aspects of it were uncomfortable but I really liked it".

Based on the given text, "Aspects of it were uncomfortable but I really liked it," the sentiment score can be calculated as follows:
Positive sentiment: "I really liked it" - This indicates a positive sentiment. Negative sentiment: "Aspects of it were uncomfortable" - This indicates a negative sentiment.
Similar to the previous example, we need to consider the overall sentiment of the text. In this case, although there is positive sentiment expressed, the presence of discomfort is more prominent. Therefore, the negative sentiment outweighs the positive sentiment to some extent.
Considering the balance between positive and negative sentiments, I would assign a sentiment score of approximately -0.4 to the given text, indicating a predominantly negative sentiment.

The version of ChatGPT used is very important. 

Conclusions

Using ChatGPT in a very simple way for sentiment analysis is feasible, and it gives good results. I don’t recommend a blind application of it, without carefully checking that the results it gives seem to be reasonable, bearing in mind the last quoted comment from ChatGPT about the subjective nature of sentiment interpretations. Do several different runs with ChatGPT and compare the results between them for consistency. Compare with other mainstream methods of sentiment analysis.

References

 

1.        A. Beacco, R. Oliva, C. Cabreira, J. Gallego, M. Slater, Disturbance and Plausibility in a Virtual Rock Concert: A Pilot Study. 2021 IEEE Virtual Reality and 3D User Interfaces (VR), 538-545 (2021).10.1109/VR50410.2021.00078.

2.        M. Slater, C. Cabriera, G. Senel, D. Banakou, A. Beacco, R. Oliva, J. Gallego, The sentiment of a virtual rock concert. Virtual Reality,  (2022).doi.org/10.1007/s10055-022-00685-9.

3.        T. W. Rinker, sentimentr: Calculate Text Polarity Sentiment. version 2.9.0https://github.com/trinker/sentimentr (2021).

4.        C. Hutto, E. Gilbert, in Proceedings of the International AAAI Conference on Web and Social Media. (2014), vol. 8.

5.        M. Jockers, Package ‘syuzhet’. URL: https://cran. r-project. org/web/packages/syuzhet,  (2017)

6.        S. Feuerriegel, N. Proellochs, M. S. Feuerriegel, Package ‘SentimentAnalysis’. CRAN: London, UK,  (2018)

7.        S. Feuerriegel, N. Proellochs, SentimentAnalysis Vignette. Massachusetts Institute of Technology https://mran.microsoft.com/snapshot/2017-08-06/web/packages/SentimentAnalysis/vignettes/SentimentAnalysis.html (2019).

8.        Z. Wang, Q. Xie, Z. Ding, Y. Feng, R. Xia, Is ChatGPT a good sentiment analyzer? A preliminary study. arXiv preprint arXiv:2304.04339,  (2023)

9.        T. Susnjak, Applying bert and chatgpt for sentiment analysis of lyme disease in scientific literature. arXiv preprint arXiv:2302.06474,  (2023)

17 December, 2022

In the Presence of Bayesian Statistics for VR

 





My original degree and Masters were in Statistics. Then I also studied sociology and psychology. Then via statistics, through my involvement in the statistical language GLIM,  I moved to computer science and computer graphics, while at Queen Mary (University of London). We had our first head-mounted display from Division Inc in 1991, and together with the postdoc Dr Martin Usoh, we started investigating presence in virtual reality. At this point my own background suddenly made sense - through statistics I knew how to design experiments and analyse data. Through the social sciences I could read and understand papers in psychology and related sciences. Computer science, computer graphics,  programming etc meant that I could understand what was feasible given the hardware and software resources. I embarked on a research program in virtual reality mainly rooted in experimental studies, concerned with various aspects of how people respond to virtual environments. I moved to UCL in 1995 and then to Barcelona, and continued with this series of studies 

Over many years, for the analyses of the results of these experiments I used classical statistics - hypotheses testing, significance levels, confidence intervals, and so on. Think about the meaning of "significance level". A lot of people automatically interpret it as the probability of the null hypothesis. So if the significance level is low (< 0.05) then the probability of the null hypothesis being true is low and it should be "rejected". However, it doesn't mean that at all. Significance level is the conditional probability of rejecting the null hypothesis if it is true. This means that in the long run, a significance level of 5% means that 5% of the time, the statistical test will reject the null hypothesis when it is true. This is far from the meaning of the probability of the null hypothesis. In fact in classical statistics, the probability of the null hypothesis being true is 0 or 1 (only we don't know which). This is because in classical statistics probability is associated with frequency of occurrence of an event. Only if there were parallel universes, in some of which the null hypothesis were true, and others not, and we could access data from these universes, could we think of "the probability of the null hypothesis" in classical statistics.

When people study statistics, even if they are mathematically sophisticated, many find the concepts - type 1 error, type 2 error, significance levels, power, confidence intervals, very mysterious. They are! Think about a confidence interval. You're told - the 95% confidence interval for the true mean is between 20 and 30. Automatically this is interpreted as "the probability of the true mean being between 20 and 30 is 0.95." Again, this is a wrong interpretation. It means that if we had multiple repetitions of the same experiment, then 95% of the time the true mean will fall in the calculated confidence limits. This says nothing about the particular interval we have found - is this one of the 95% or one of the 5%? The mathematical theory of classical statistics, like the Neyman-Pearson Lemma is very elegant. But interpretation is somewhat convoluted. 

Classical statistics is easy to do. To apply it you don't need to understand it. You run the data through a statistical package and out comes a number t=2.2, on 18 degrees of freedom, and P < 0.05. (But is it one-sided or two-sided?). P < 0.05, so we have "significance": write the paper! For every different type of problem there is another type of test. There are underlying assumptions that must be obeyed - typically that the random errors in the variable under consideration with respect to the hypothesis, must be normally distributed. If not - try to find some transformation of the data (like taking the log) to make it so. Otherwise use "non-parametric" statistics, and learn another whole set of tests (also available in the statistical package). It is in principle easy.

But what about the "power"? This is the probability of rejecting the null hypothesis when it is not true. You're supposed to compute the power before you do the experiment, because this will help in determining the required sample size. But how can you compute the power? To do so you need know at least the variances of the response variables. To know those you must have already collected sufficient data. But what was the power involved in that data collection exercise? 

Bayesian statistics has a completely different philosophy. You start off with your prior probabilities of the hypothesis in question (e.g., that the true mean is in a certain range). You collect the data, and then use Bayes' Theorem to update your prior probabilities to posterior. So you end up with a revised probability conditional on the data. Of course this is also done through a statistical package, since apart from very simple problems, the integrals involved in finding the posterior probabilities must be evaluated using  numerical simulation.

People argue that there is a subjective element in this. Yes, it is true, for Bayesian statistics probability is not based on frequency but is subjective (though of course you can use informative frequency data). It is subjective, but as you add more and more data, results starting from different subjective priors will converge to the same posteriors. And ... think back to power calculations in classical statistics - these are based on guesswork, "estimations" of variance that typically have no basis in any actual data. Power calculations make everyone relax ("Oh great, it has a power of 80%!") but in reality I think that such power calculations are meaningless.

In Bayesian statistics there are not lots of different tests, no different tests for different situations. There is only one principle - based on Bayes' Theorem. You need good statistical software though, that allows you to properly express your prior distributions, the likelihood (the probability distribution of the response variables conditional on the parameters under investigation), and to compute the posterior distributions.

A few years ago I came across the language called BUGS. It appealed to the computer science side of me because it is a functional programming language where you can elegantly express prior distributions and likelihoods. So I started analysing the results of our experimental studies using Bayesian statistics - I think that this is the first paper where I dared to do this. I thought that reviewers would come down heavily against this, but to my pleasant surprise they were quite favourable, and have never raised questions ever since!

Later I came across the Stan probabilistic programming language which at first I used in conjunction with MATLAB and then using R and the R interface for Stan

Bayesian analysis appeals to me because of its simplicity in concept and interpretation, but also because it overcomes a major problem in classical statistics. When we carry out an experiment there are typically multiple response variables of interest - e.g., the results of a questionnaire, physiological measures, behavioural responses, and so on. Bearing in mind the meaning of "significance", when you carry out more than one statistical test you lose control of significance. E.g., if the significance level is 0.05 for each test, then the probability that at least one will be "significant" by chance is not 0.05 (but greater). There are ad hoc ways around this, like the extreme Bonferroni Correction, or multiple comparison tests like Scheffe, but these are not principled, even if clever.

In Bayesian statistics if you have k parameters of interest, then what happens is that the joint probability distributions of all these parameters together is computed, and then you can read off as many probability statements that you like from this, without losing anything. It is as if you have a page of text with lots of facts. In classical statistics the more facts that you read off the page, the less the reliability of your conclusions. But in Bayesian statistics you can read as many "facts" off the page as you like, and nothing is affected. If you compute a particular set of probabilities from the joint distribution, then they are all valid - those are the probabilities, and that's it.

After many experimental studies using Bayesian statistics and Stan I decided to write a text book summarising what I had learned over the years. The book starts from an introduction to probability and probability distributions, and after introducing Bayes' Theorem goes through a series of different experimental setups, with the corresponding Bayesian model, and how to program it in Stan. There are a set of slides (with commentary) available, and also every program in the book is available on my Kaggle page where they can be executed online (look for the files called "Slater-Bayesian-Statistics-*").

Finally, you might enjoy reading about some statistics of virtual reality.





Classical Statistics


Bayesian Statistics

13 April, 2020

A Shocking Outcome in Virtual Reality

This blog post is on Behavioural and Social Sciences Nature Research.

29 March, 2020

Virtual reality for training

Virtual Reality for Training

Mel Slater

Simulation has been used for training for decades, most notably for flight simulation. These go back to the 1920s, the first being the Link Trainer. A flight simulator is essentially a complete aircraft cockpit, but with all of the visual, auditory and haptic information that signifies the state of the flight being produced by digital means. So the window views are generated by computer graphics, the sound is digitally controlled, and the simulator is in a huge box that is mounted on a platform that delivers the forces to which the simulated aircraft is subject. This is a kind of mixed reality system – the cockpit and all the controls are real, but the displays and forces acting on the physical structure are digitally produced.

Virtual Reality (VR) offers the possibility of training through mainly digital feedback, in the form of graphics, sound and haptic feedback where appropriate. Also it can apply to a very wide variety of circumstances without the need for highly expensive and non-portable platforms. For example, people in a company who have to present in front of clients can learn how to improve their performance by rehearsing in front of an entirely virtual audience, with the help of a trainer. A quite different application might be preparing for emergencies such as a fire in an office. All of these different types of application require essentially the same, and these days quite portable, hardware. 

Acquiring new skills, procedures, and knowledge, through action – eventually leading to expertise. It should be generalisable, so that when new and slightly different situations arise the practitioner can deal with these based on acquired knowledge of similar situations.
Training involves skills transfer where existing skills are utilised to learn new ones – for example, a mechanic may have very well learned how to forge a particular component, but then more rapidly learn what mistakes to avoid when making another different but similar enough component. It is important that training is in line with organisational needs, and follows the organisation’s ethos. 

Why is VR any use for this? Previously I have referred to three perceptual illusions which arise from the use of VR: Place Illusion (the illusion of being in the place depicted by the VR displays), Plausibility Illusion (that events happening there are really happening) and a Body Ownership Illusion (that your co-located virtual body is your body). Together, these mean that you will tend to act in VR much as you would in similar circumstances in reality. So when you stand over a precipice in VR, even though you 100% know that it is not real, and that there is nothing there, you can’t but help have feelings of anxiety. You respond as if it is real.  However, for VR to be useful in training we have to have evidence that people do respond realistically in VR. An example of the type of study that is necessary was given by Bhagavathula (2018) who compared pedestrian behaviour in reality and VR. It was found that pedestrians made similar decisions about crossing the road in both VR and reality, and there was no difference between VR and reality regarding the perceived risk, and estimations of distance of vehicles. However, there was a difference in estimation of speed of an approaching vehicle. On the one hand it is remarkable that the differences between reality and VR were so low, because an essential feature of VR is that everyone knows that even if an approaching car would hit them, nothing would happen. However, remember that VR operates out of perceptual illusions. If your sensory system is showing you an approaching car, then no matter what you might be thinking about it, the safe thing to do is to get out of the way! This is how Place Illusion and Plausibility operate.

So VR is excellent for training because (i) it can make the abstract something tangible. Instead of learning about a complex maneuver  by reading about it, or practicing it in an artificial way, one can actually do it in situ, in a virtual environment. Or for understanding the implications of some complex mathematics, one can use a visualization that involves mobilizing body movements in order to literally grasp it in a concrete way. 

(ii) VR enables doing, not simply observing. Doing engages the whole body in a multisensory way, and the more that the body is engaged the greater the chance for learning and retention. An example of maintenance training involves the learner actually doing so rather than only watching someone else do it or a video.  Or operators can train in facilities before they were ever built Operators can learn the process of making a delivery while maintaining safety for themselves and their customers

(iii) VR is highly suited for training in complex circumstances, where for practical, ethical or safety reasons training in the real site is not possible. Our previous example illustrated this – in that case people were able to train on an installation that did not yet exist. In these complex or dangerous circumstances people can train over and over again, without additional cost of providing materials, for example. Accidents or problems that occur during the virtual training are without physical consequences, but are, of course, ideal for learning. We earlier saw one example of training for fire hazards. But there are complex problems only involving interaction with other humans without complex machinery or installations involved. For example, how do medical doctors learn to deal with intransigent patients? One study showed that doctors faced with unreasonable demands from patients for antibiotics reacted much as they would in reality, and that therefore such environments could be used for ethical and social relationship training. 

 (iv) VR can offer multiple perspectives over the same scenario, which offers also a greater chance of understanding. In VR it is even possible to have a different perspective with respect to yourself as well as experiencing a scenario from different points of view and the training can be collaborative involving several remote participants, or collaboration for bringing experts together from all over the world

(v) VR is excellent for measurement – since everything that the trainees do can, in principle, be recorded and measured: Their overall behaviour, the bodily movements, their physiological and brain responses. The trainee could also re-enter the environment and observe their own recorded data being played out.

Having considered where VR might be good for training we need to also to pay attention to the pitfalls. For example, consider the learning of a sport like table tennis in VR. You could become an excellent VR table tennis player. But as anyone who seriously plays table tennis in reality knows, it is a highly complex skill involving multiple factors associated with every strike of the ball, probably mostly below the conscious awareness of the player. Unless the simulation of table tennis were perfect, it is likely that the skill from virtual table tennis would not translate to real table tennis. Even worse, an already skilled table tennis player might find their skills weakened after playing virtual table tennis because of negative transfer of training. This means that the skilled player may pick up habits that work well in VR but which do not work well in reality. The biggest danger of VR for training is such negative transfer, where people apparently learn something in VR, but where the real world is different enough, perhaps in very subtle ways, that the learning simply does not transfer or makes things worse.

Let’s look at some evidence. Basically if you think that VR is good for training then you will find evidence to support that. On the other hand if you think that it is not good, you will find evidence to support that too.

Winther et al (2020) evaluated VR training for pump maintenance. A comparison was made between VR, and traditional methods of video training and pairwise training where one person helps another to learn. The VR did not have any haptic feedback, whereas with the video and pairwise training trainees could work directly on the real equipment. The sample size was n=36 in a between-groups study with the 3 groups. On almost all measures of training outcome the VR method did not perform as well as the other two methods. This may not be too surprising since the VR method is the only one where trainees could not work directly on the actual machinery. 

Leder (2019) compared VR and PowerPoint instruction for safety training. It was found that VR had no advantage. However, problems with this study were that the VR (in a CAVE) was non-interactive, where trainees seemed to just essentially watch something, rather than be engaged. Also the sample size was small.

Sankaranarayanan et al (2018) compared VR training for a fire in an operating theatre with a control group. VR was found to be superior in learning to successfully put out the fire – 20% of the control group compared to 70% of the VR group were able to complete all the steps correctly one week after the exposure. Note that the sample sizes were quite small (10 in each group) and also it is not clear what the control group actually did. 

Murcia-López el al (2018) compared virtual and physical training for a bimanual assembly task. The VR training seemed not to be different in outcomes compared to the best performing physical condition. The sample size involved (n = 60, over three conditions) was reasonable. Retention of learned skills over time was not particularly good for any of the conditions. 

Gavish (2015) compared VR and AR for training of industrial assembly tasks. The comparisons were VR (with haptics) against a video only control, Augmented Reality working with the real device compared to a control. There were 10 engineers randomly assigned to each group, and the test was to do the real assembly. No difference was found between the VR and the VR-control condition, but AR resulted in less errors overall. However, as usual, look for the cautionary aspects – it is not entirely clear, but it seems that the VR was non-interactive. 

Borsci et al’s 2015 paper reported a meta study of mixed reality (MR) and VR studies for car service maintenance. They only found 8 papers with sufficient rigour to be included in the meta study, and mostly these were on mixed reality – even though hundreds of papers had been published. They concluded:

  • MR systems seemed to be more useful than VR
  • MR/VR resulted in less errors, and required less time for training than previous methods
  • Trainees found the VR/MR more interesting than other methods, with better generalizability
  • VR/MR methods were adaptable to individual expertise, but too much reliance on MR/VR is less effective than on-the-job training.

A number of limitations of the studies were identified:
·      The evaluation studies tend to ignore the organisational setting and organisational needs.
·      They use a limited set of evaluation criteria - too much focussed on time and errors, ignoring: Cybersickness, skill recall and decay, motivation, acceptance, trust, prior attitudes and cognitive skills.

The authors suggested a wider set of evaluation criteria:

·      The effect of and taking account of cognitive skills - including visuospatial abilities-
·      Levels of trust/acceptance of VR/MR tools 
·      Motivation in use by the trainees and trainers
·      Trainee attitudes towards the systems
·      Their previous experience
·      The impact of cybersickness
·      Physiological Reactions – e.g., attention shift, cognitive load, stress
·      Level of presence and engagement
·      Technical aspects and tools features – e.g. effect of designed features, expected and experienced system functioning.

Overall we can conclude that immersive technologies can provide an excellent method for training. It is concrete, it is multisensory involving and engaging the whole body, it can break out of the constraints of reality and give people perspectives that they cannot ever attain in reality, it can be lead to high quality measurement, it is infinitely repeatable, and because of this it is ultimately low cost. 

The cautionary message is that just because of these reasons do not assume that it is going to produce good results. In some circumstances a seemingly unimportant detail that is wrong can lead to negative transfer of training. Before advocating a particular solution for training it must be studied extensively.

Also consider whether the application can be equally well done with other methods (direct traditional teaching, role play, video). Ask yourself why, in this particular application, is VR necessary? Take into account cost, the time scale, the logistics and the feasibility of producing a good enough training scenario without negative transfer. 

Additional Resources


Here are some aspects of training that I have not considered, but which are discussed in this paper to which the section numbers refer:

      VR in surgical training (Section 2.4)
      Navigation rehearsal (Section 2.1.3) (the body is critical)
      Military (rehearsal of scenarios)
      VR in sports (negative transfer considerations critical).
      Model based VR vs 360 video (section 7.2).
      Ethics


Readers might also find the following useful:-


On negative transfer:


On comparative evaluation:

      Bhagavathula et al (2018). "The Reality of Virtual Reality: A Comparison of Pedestrian Behavior in Real and Virtual Environments", in: Proceedings of the Human Factors and Ergonomics Society Annual Meeting: SAGE Publications Sage CA: Los Angeles, CA), 2056-2060.
      Sankaranarayanan et al (2018). Immersive virtual reality-based training improves response in a simulated operating room fire scenario. Surgical endoscopy, 1-11.
      Murcia-Lopez, M., and Steed, A. (2018). A Comparison of Virtual and Physical Training Transfer of Bimanual Assembly Tasks. IEEE transactions on visualization and computer graphics 24, 1574-1583.
      Gavish, N., et al (2015). Evaluating virtual reality and augmented reality training for industrial maintenance and assembly tasks. Interactive Learning Environments 23, 778-798. 

Labels: , , ,