Eliciting Sentiment with ChatGPT
Background
In two previous papers we used sentiment analysis to better understand the responses of participants to an unusual experience in virtual reality – attending a rock concert amongst a virtual audience (1, 2). The sentiment analysis allowed us to get ‘behind the scenes’ and discover aspects of the experiences of participants that we would never have found with questionnaires or behavioural or physiological responses.
I recently had to do sentiment analysis of more than 1200 short pieces of text. In the previous work I used the R packages
· sentimentr (3) which uses 9 dictionaries and aims particularly at ‘valence shifters’ i.e., modifiers where ‘I do not like it’ is correctly recognized as negative and ‘I really like it’ is an enhanced positive valence.
· The VADER system (4) was designed for the analysis of social media text but also is used more generally. I used the R implementation[1] by Katherine Roehrick.
· The syuzhet package (5)[2] includes 4 sentiment lexicons and was originally designed for analysis of the latent structure in narrative, although it has been used widely for other applications.
· SentimentAnalysis[3] is the fourth R package that we use (6, 7) which exploits 3 different dictionaries which has applications in the analysis of financial text, but again has been used more widely.
I was interested to see how well ChatGPT4 would do with sentiment analysis, given that it is a highly sophisticated language model based on data far more extensive than would have been possible with the packages above. Also there have been some studies that show that it has a lot of promise in this area (8, 9)
I was interested to know (i) how to actually do it in a simple way without any programming whatsoever and (ii) whether the scores obtained would better reflect the sentiment expressed in the text.
Method
In order to make the text available to ChatGPT4 I tried at first to give prompts such as:
“In the following are short texts (phrases and sentences) separated by ‘$’ Please give me sentiment scores for each text in the range -1 to 1, while maintaining the positions of the ‘$’. Please don’t output anything other than the numbers themselves.”
(Yes, I can’t help using “please” and “thank you” in my communications).
This method sort of worked, and it would give me output such as -0.4$0.5$0.9$-0.8$ …
However, invariably it missed some, and since I had a large amount of text, and had to break it into segments, missing one would cost me a lot of time to track where it had gone wrong.
Instead the following method worked well. The text was read into R (e.g., variable “mytext”) and then in R the output of mytext would give, for example,
[1] "It may have had an impact, but there is a lot of need for appeasement towards the therapist in the answers given about the evaluation of the VR session"
[2] "It has helped her manage emotions generated through VR"
[3] "Although it has allowed her to understand the concepts of psychological violence, she has not identified with her attitude and behaviour."
[4] "She denies the events that occurred and maintains a distant role regarding the activity"
[5] "She remains distant regarding the activity, referring more utility to the contents worked on in the rest of the sessions"
…
To ChatGPT:
Prompt: “I'm going to give a number of phrases each prefixed by a number in square brackets [1] to [n] where n is some number, and each text is in quote marks "...". Please give a sentiment score between -1 and 1 for each text, maintaining the number in brackets for each one. Please do not provide any other explanation but just give the output in the form [1] score, [2] score, and so on. OK?”
The output was:
[1] -0.36
[2] 0.56
[3] -0.04
[4] -0.31
[5] -0.22
…
In the many such streams of text that I provided it never missed a single one, and the scores seemed appropriate.
Comparisons
Here I give comparisons with the methods from the R packages.
Text | sentimentr | vader | syuzhet | SentimentAnalysis | ChatGPT4 |
It may have had an impact, but there is a lot of need for appeasement towards the therapist in the answers given about the evaluation of the VR session. | -0.10 | 0.00 | -0.25 | 0.09 | -0.36 |
It has helped her manage emotions generated through VR | 0.40 | 0.00 | 1.20 | 0.50 | 0.56 |
Although it has allowed her to understand the concepts of psychological violence, she has not identified with her attitude and behaviour. | 0.11 | -0.62 | -0.50 | 0.11 | -0.04 |
She denies the events that occurred and maintains a distant role regarding the activity. | -0.13 | -0.42 | -0.50 | 0.12 | -0.31 |
She remains distant regarding the activity, referring more utility to the contents worked on in the rest of the sessions. | 0.26 | 0.00 | 0.65 | 0.30 | -0.22 |
Overall, it seems that ChatGPT had a better understanding of the text, and generated values more in accord with the sentiment expressed.
A caveat is that I did not spend time cleaning the text, and also with ChatGPT we have no idea about how it numerically assigns these scores.
However, with ChatGPT we are able to ask it for explanations.
Text | Score | Explanation |
It may have had an impact, but there is a lot of need for appeasement towards the therapist in the answers given about the evaluation of the VR session. | -0.36 | The negative sentiment in this statement arises from the mention of "a lot of need for appeasement towards the therapist." This suggests that the user's responses about the evaluation of the VR session may not reflect their true feelings or experiences accurately. The use of the term "appeasement" implies that the user may feel compelled to provide positive or favorable feedback to please the therapist or avoid any potential conflict. While the statement also acknowledges that the VR session may have had an impact, the overall negative sentiment is driven by the perceived need for appeasement, which suggests potential reservations or incongruence between the user's genuine experience and their expressed evaluation of the VR session. |
It has helped her manage emotions generated through VR. | 0.56 | The positive sentiment in this statement arises from the mention of the VR experience being helpful in managing emotions. By stating that it has helped her manage emotions, it implies that the VR session had a positive impact on the user's emotional state. This suggests that the user found the VR experience beneficial in terms of emotional regulation or control. The use of the word "helped" indicates that the VR session had a positive influence, enabling the user to effectively deal with the emotions evoked during the VR experience. Overall, the statement reflects a positive outcome of the VR session in terms of emotional management. |
Although it has allowed her to understand the concepts of psychological violence, she has not identified with her attitude and behavior. | -0.04 | The negative sentiment in this statement arises from the user's inability to identify with her own attitude and behavior despite gaining an understanding of psychological violence through the VR experience. While the user has acquired knowledge about the concepts of psychological violence, the fact that she does not identify with her own attitude and behavior suggests a disconnect or lack of self-reflection. The use of the word "although" implies a contrast between the user's understanding of psychological violence and her inability to relate to it personally. This discrepancy may contribute to a slightly negative sentiment as it indicates a potential limitation or challenge in the user's engagement with the VR experience. Overall, the statement reflects a mixed sentiment, with a slight negative tone stemming from the user's lack of identification with her attitude and behavior despite gaining conceptual understanding of psychological violence. |
She denies the events that occurred and maintains a distant role regarding the activity. | -0.31 | The negative sentiment in this statement is primarily driven by two factors. First, the user's denial of the events that occurred suggests a refusal to acknowledge or accept the significance or impact of those events. This denial can be seen as a defensive mechanism, potentially hindering personal growth or understanding from the VR activity. Second, the user's maintenance of a distant role regarding the activity implies a lack of active engagement or emotional connection with the content. By keeping a distance, the user may be avoiding or downplaying the potential impact or relevance of the activity. Together, these factors contribute to a negative sentiment as they indicate resistance, denial, and a lack of active involvement in the VR activity. The user's denial and emotional distance suggest a limited willingness to explore or address the underlying issues associated with the activity, which can hinder the effectiveness of the intervention. |
She remains distant regarding the activity, referring more utility to the contents worked on in the rest of the sessions. | -0.22 | The negative sentiment in this statement is primarily driven by the user's sense of distance or detachment from the activity. By remaining distant, the user is indicating a lack of active engagement or emotional connection with the VR activity. This detachment can potentially hinder the effectiveness of the intervention, as active engagement is crucial for meaningful learning and growth. Furthermore, the user's statement about finding more utility in the contents worked on in the rest of the sessions suggests that they perceive the VR activity as less valuable or impactful compared to other aspects of the therapy. This perception further reinforces the sentiment of negativity or skepticism towards the VR activity. Overall, the statement reflects a certain level of disengagement and skepticism towards the VR activity, contributing to the slightly negative sentiment score. The user's preference for the contents of the other sessions indicates a perceived lack of value or relevance in the VR activity, which can impact its effectiveness in achieving therapeutic goals. |
Consistency
Here I consider whether ChatGPT4 is consistent across different evaluations of the same text. I gave it 50 texts and asked it to find the sentiment scores using the same prompt as above.
The figure below shows the comparison between the two sets of scores. The correlation is r = 0.82 with 95% confidence interval 0.70 to 0.89. So the correlation is very strong but there are clearly 3 outlying points with quite different scores on each occasion.
Figure – The x-axis corresponds to the original sentiment scores and the y-axis scores for the same text a few days later.
I asked ChatGPT about why there were discrepancies:
“It's important to note that sentiment analysis is inherently challenging and can be influenced by individual interpretation. Different human annotators may assign different sentiment scores to the same sentence based on their personal perspectives, experiences, and understanding of the context. Similarly, different sentiment analysis models may yield varying scores due to differences in their algorithms, training data, and biases. Therefore, while sentiment analysis can provide valuable insights, it is essential to consider it as an approximation and be aware of its limitations, especially when interpreting and comparing scores from different sources or occasions.”
ChatGPT Versions
Assigning a sentiment score to a text depends on the balance of positive and negative expressions. Here, we have a mix of both positive ("I really liked it") and negative sentiment ("aspects of it were uncomfortable"). Therefore, the sentiment score might be slightly positive, considering the expressed like outweighs the discomfort. I'd assign a score of 0.3. Please note that sentiment analysis can be subjective and can vary based on the nuances of language.
Please give a sentiment score between -1 and 1 for the following text "Aspects of it were uncomfortable but I really liked it".
In this sentence, even though it starts with a negative expression ("Aspects of it were uncomfortable"), it ends on a positive note ("but I really liked it"). Similar to the previous example, there is a mix of both positive and negative sentiments here. Given the overall positive sentiment, I would assign a score of 0.3. As noted before, sentiment analysis can be subjective and can vary based on the nuances of language.
Based on the given text, the sentiment score can be calculated as follows:
Positive sentiment: "I really liked it" - This indicates a positive sentiment. Negative sentiment: "aspects of it were uncomfortable" - This indicates a negative sentiment.
To assign a sentiment score between -1 and 1, we can consider the overall sentiment of the text, which combines both positive and negative aspects. Since the positive sentiment outweighs the negative sentiment, we can assign a sentiment score closer to 1, indicating a predominantly positive sentiment. However, the discomfort mentioned does introduce some negative sentiment.
Considering the balance between positive and negative sentiments, I would assign a sentiment score of approximately 0.6 to the given text.
Please give a sentiment score between -1 and 1 for the following text "Aspects of it were uncomfortable but I really liked it".
Based on the given text, "Aspects of it were uncomfortable but I really liked it," the sentiment score can be calculated as follows:
Positive sentiment: "I really liked it" - This indicates a positive sentiment. Negative sentiment: "Aspects of it were uncomfortable" - This indicates a negative sentiment.
Similar to the previous example, we need to consider the overall sentiment of the text. In this case, although there is positive sentiment expressed, the presence of discomfort is more prominent. Therefore, the negative sentiment outweighs the positive sentiment to some extent.
Considering the balance between positive and negative sentiments, I would assign a sentiment score of approximately -0.4 to the given text, indicating a predominantly negative sentiment.
Conclusions
Using ChatGPT in a very simple way for sentiment analysis is feasible, and it gives good results. I don’t recommend a blind application of it, without carefully checking that the results it gives seem to be reasonable, bearing in mind the last quoted comment from ChatGPT about the subjective nature of sentiment interpretations. Do several different runs with ChatGPT and compare the results between them for consistency. Compare with other mainstream methods of sentiment analysis.
References
1. A. Beacco, R. Oliva, C. Cabreira, J. Gallego, M. Slater, Disturbance and Plausibility in a Virtual Rock Concert: A Pilot Study. 2021 IEEE Virtual Reality and 3D User Interfaces (VR), 538-545 (2021).10.1109/VR50410.2021.00078.
2. M. Slater, C. Cabriera, G. Senel, D. Banakou, A. Beacco, R. Oliva, J. Gallego, The sentiment of a virtual rock concert. Virtual Reality, (2022).doi.org/10.1007/s10055-022-00685-9.
3. T. W. Rinker, sentimentr: Calculate Text Polarity Sentiment. version 2.9.0, https://github.com/trinker/sentimentr (2021).
4. C. Hutto, E. Gilbert, in Proceedings of the International AAAI Conference on Web and Social Media. (2014), vol. 8.
5. M. Jockers, Package ‘syuzhet’. URL: https://cran. r-project. org/web/packages/syuzhet, (2017)
6. S. Feuerriegel, N. Proellochs, M. S. Feuerriegel, Package ‘SentimentAnalysis’. CRAN: London, UK, (2018)
7. S. Feuerriegel, N. Proellochs, SentimentAnalysis Vignette. Massachusetts Institute of Technology https://mran.microsoft.com/snapshot/2017-08-06/web/packages/SentimentAnalysis/vignettes/SentimentAnalysis.html (2019).
8. Z. Wang, Q. Xie, Z. Ding, Y. Feng, R. Xia, Is ChatGPT a good sentiment analyzer? A preliminary study. arXiv preprint arXiv:2304.04339, (2023)
9. T. Susnjak, Applying bert and chatgpt for sentiment analysis of lyme disease in scientific literature. arXiv preprint arXiv:2302.06474, (2023)