Abstract
Sentiment analysis of open-ended survey responses is a complex but essential task in understanding public opinion. This study compares the performance of three large language models (LLMs)—GPT-4o, Llama-3.3-70B-Instruct, and Gemini-2.0-Flash—against dedicated sentiment classification neural networks, specifically Twitter-RoBERTa-base and DeBERTa-v3-base-absa-v1.1. Using survey data from two COVID-19 studies, we evaluated these models based on accuracy, precision, recall, and F1-score across sentiment categories. Our results show that LLMs consistently outperform dedicated models, with GPT-4o achieving the highest overall accuracy. Additionally, we find that the performance gap between LLMs and neural networks becomes more pronounced when considering only survey responses with full annotator agreement. Unlike earlier LLM versions, our selected LLM models exhibit no significant positive bias, reinforcing their reliability for sentiment classification. However, LLMs require significantly more computational resources and incur higher costs, making dedicated neural networks a competitive alternative in resource-constrained settings. This study highlights the trade-offs between performance, scalability, and cost, offering insights for researchers and practitioners choosing between LLMs and dedicated sentiment analysis models. Future research should prioritize optimizing and further evaluating the inherent biases of LLMs with diverse datasets, while also testing upcoming new models to ensure fair and reliable performance as these models evolve. Additionally, future work should explore sentiment analysis beyond traditional polarity classification to capture a broader spectrum of emotions, as well as investigate multimodal approaches for more comprehensive sentiment classification.
1. Introduction
In an era dominated by a constant stream of digital information, capturing the nuances of human emotions has become increasingly valuable across multiple different fields. Whether in understanding consumer behavior, gauging public opinion, or enhancing human-computer interactions, the ability to discern underlying emotional tones is paramount for improving the outcomes of these diverse applications. The capability to do this automatically, driven by the advent of modern artificial intelligence (AI), is at the heart of many of these modern applications1 ,2 ,3.
In this paper, we explore methods to uncover such emotional insights from survey data, comparing the traditional machine learning approaches of dedicated sentiment classification based neural networks with newer Large Language Models (LLMs). Surveys are typically composed of questions that have been meticulously developed by domain experts to ensure clarity and avoid ambiguity. While responses to structured, multiple-choice questions offer simplicity in analysis, open-ended, free-response questions provide deeper insights that cannot be easily quantified. These responses capture the subtleties and complexities of human thought and emotion, enabling a wide variety of applications, such as identifying responses with specific sentiments that structured questions might miss. For instance, a school principal may want to focus on negative comments to address actionable concerns, and sentiment analysis can highlight these efficiently. Additionally, ongoing research into free-response questions in areas such as education underscores the growing interest in this field4.
Despite this growing interest, analyzing these nuanced responses poses significant challenges. Manual coding and interpretation of qualitative data are time-consuming and prone to human error and bias, which can lead to inconsistencies in analysis. Traditional qualitative data analysis methods, such as thematic and deductive coding, require substantial human effort and rely on predefined tags, preventing new themes from emerging5. Recent testing in automation for this process, using advanced tools like dedicated sentiment analysis neural networks, has demonstrated that these technologies can address issues of speed and consistency in qualitative data analysis6. By reducing the reliance on manual efforts for interpreting survey responses, organizations can quickly and consistently uncover valuable insights within qualitative data, allowing for more timely and informed decision-making.
Nevertheless, despite these technological advancements, challenges remain. These new techniques often come with significant barriers, such as the need for extensive technical expertise, specialized software, additional computational resources, and regular model updates to incorporate new data. These constraints limit the broader application of these methods, keeping them primarily within the domain of specialized research.
In light of these challenges, however, the more recent emergence of LLMs with generative AI capabilities presents a potentially transformative opportunity. LLMs have the potential to overcome many challenges associated with traditional techniques, opening possibilities for more accessible and efficient survey analysis to a broader range of users. For example, the ability to use such models through natural language instructions, via simple web interfaces, or at scale through application programming interfaces (APIs), lessens the need for dedicated machines or expensive software. While these models have become more widely available and user-friendly, they have not yet been thoroughly benchmarked against more specialized machine learning models for analyzing free-text survey data. Specifically, several research gaps exist: (1) Versions of LLM models: Recent LLMs like GPT-4o, Llama-3.3, and Gemini-2.0-Flash have not been evaluated in this context. (2) Scope of dedicated neural networks: The DeBERTa model has been overlooked for sentiment analysis. (3) Stochastic behaviors of LLMs: Previous studies often used a single inference run for LLMs, ignoring the potential benefits of ensemble results for a more robust conclusion. (4) Positive bias evaluation: Evaluations have primarily focused on accuracy and macro F1-score, neglecting a formal testing for model output biases. (5) Model sensitivity on noisy ground truth labels: There is a lack of sensitivity analyses examining how noise in human-annotated labels affects LLM inference in categorization tasks. We discuss these gaps further in the “Related Works” section.
In this study, we aim to address these gaps by reassessing the comparison between dedicated neural networks and modern LLMs for sentiment analysis on healthcare survey data. Our goal is to demonstrate that these general LLMs can perform sentiment analysis at a level comparable to or better than dedicated sentiment analysis models. Currently, there is public skepticism of LLMs due to issues like hallucinations, where outputs seem plausible but deviate from user input or factual knowledge, as well as challenges with ambiguity, incomplete responses, and bias rooted in training data7 ,8. Given these shortcomings, we must rigorously test LLMs to build trust and ensure they produce reliable, unbiased results in tasks such as sentiment analysis of open-ended survey data, where accuracy and fairness are paramount.
2. Background
This section presents an overview of core concepts—spanning Natural Language Processing, sentiment analysis, machine learning, neural networks, and large language models—that underpin our subsequent comparison of dedicated models and LLMs for sentiment analysis.
2.1 Natural Language Processing (NLP)
NLP is a field that enables computers to interpret and generate human language. Its research focuses on identifying underlying linguistic patterns in text or speech data.
2.1.1 Sentiment Analysis
Sentiment analysis, also known as opinion mining, is a subset of NLP that focuses on analyzing the overall sentiment or identifying sentiment towards specific topics or events9. In its simple form, sentiment analysis labels three emotions — negative, neutral, and positive. Automating this is essential because manual annotation is time-consuming. Techniques such as lexicon, corpus, dictionary, and machine learning are commonly used. However, AI’s role in sentiment analysis is unavoidable as it assists in the processing of free text. This has led to the development of dedicated neural networks widely used for sentiment analysis10.
2.2 Machine Learning (ML)
Machine Learning (ML), first developed in the 1950s11 , is an umbrella concept that covers a wide range of topics. ML algorithms iteratively learn from data and are well-suited for tasks that require understanding language as they can leverage large annotated corpora. Nonetheless, while continuous development has improved ML models, they often struggle with the subtleties of natural language, including contextual or idiomatic expressions. These limitations paved the way for neural networks and ultimately LLMs.
2.3 Neural Networks
Neural networks are a powerful tool within ML that mimics the structure and function of the human brain to process complex patterns and data. They originated with the perceptron, which did not rely on the often-unknown conditions specific to biological organisms12 ,13. Unlike traditional ML models which use manually engineered features, neural networks can automatically learn hierarchical representations through multiple layers. This ability makes neural networks particularly suitable for tasks involving sequential or contextual data, such as those encountered in NLP. To further enhance their effectiveness in these tasks, more advanced architectures have been developed, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks14 ,15. These architectures are specifically designed to capture dependencies in sequential data, allowing for a better understanding of the relationships between words and phrases in a text, which is pivotal for accurate sentiment analysis and other NLP applications. Approaches through these early architectures for sentiment analysis have shown promise. However, these early neural networks require fine-tuning on volumes of pre-existing labeled data and specialized software. These issues have led to the development of more accessible LLMs with broader, general-purpose generative AI capabilities5.
2.3.1 Large Language Models (LLMs)
LLMs are a type of neural network that represents a milestone in NLP research and development. Trained on massive text data, they seek to understand and generate human-like text. Despite the recent boom of LLMs within many industries, LLMs are relatively new when compared to their predecessors in NLP, such as statistical and neural network language models. Current LLMs were not derived from these models directly, but rather initially from pre-trained language models (PLMs). PLMs, trained in a self-supervised setting on large text corpora to learn genetic representations, were widely used until machines became capable of handling tens to hundreds of billions of parameters and training on larger datasets. This shift, along with the introduction of the Transformer architecture16 , enabled the first generation of LLMs, which could efficiently handle long-range dependencies in text. Although this marked a major advancement, these initial LLMs still struggled with aligning responses to user intent, and generated biased outputs and violent content without explicit prompting17. Further development, such as the adoption of multi-head attention mechanisms and large-scale training on massive datasets, resolved many of these issues, leading to today’s LLMs that leverage large infrastructure like high-performance GPU clusters, dramatically improving training times and performance18. These advancements have enabled LLMs to process data more effectively, capturing the intricate patterns necessary for complex NLP tasks like sentiment analysis.
3. Related Works
3.1 BERT-based Models Across Industries
Over the past decade, sentiment analysis of open-ended survey responses has advanced markedly across industries – from healthcare and customer service to finance, education, and e-commerce. Early approaches often struggled with domain-specific language (for example, sentiment tools built for social media performed poorly on clinical survey text in healthcare)19. The introduction of transformer-based models like BERT and its variants (e.g., RoBERTa) brought significant improvements. Fine-tuned models adapted to industry data achieved high accuracy and F1-score, far surpassing earlier lexicon or SVM-based methods. In customer service, for instance, a RoBERTa model attained 96.97% accuracy on airline feedback sentiment (binary classification) and about 86.9% on three-class sentiment20 – a near-human level performance that greatly aids airlines in mining customer reviews. In finance, specialized models such as FinBERT (a BERT tuned for financial language) similarly pushed accuracy from ~71% (previous state-of-art) to 86%, even reaching 97% on highly agreed-upon financial phrase dataa. Such domain-tailored RoBERTa models excel at capturing jargon and context (e.g. medical sentiment or product review tone), yielding robust F1-score in real-world deployments. Education has seen comparable trends too: researchers analyzing student feedback in Massive Open Online Courses (MOOCs) found transformer models reliable – all tested models (BERT, RoBERTa, GPT-3.5, etc.) showed strong performance, with RoBERTa slightly leading (peaking around 87% on metrics) in classifying student sentiment21. Likewise in e-commerce, BERT-based classifiers fine-tuned on retail product reviews (e.g. a multilingual BERT used by NLPTown) can predict star ratings in multiple languages with high accuracy19, enabling consistent sentiment analysis of customer opinions worldwide. Collectively, these successes highlight the versatility and high performance of BERT-based approaches in a wide range of industrial contexts, laying the groundwork for further advancements in open-ended survey analysis.
3.2 LLMs vs. BERT-based Models – Comparative Performance
The rise of LLMs like GPT-3 introduced a new paradigm for sentiment analysis, often matching or even surpassing dedicated fine-tuned models on open-ended text. LLMs bring the ability to perform sentiment classification with minimal task-specific training. Studies have shown, for example, that GPT-3.5 in a zero-shot setting achieves sentiment accuracy on par with a fine-tuned BERT, even approaching state-of-the-art levels on standard benchmarks19. In real-world applications, LLMs have demonstrated impressive results across domains. In healthcare, one study on COVID-19 surveys found an instruction-tuned LLM outperformed all other tools – ChatGPT outscored a fine-tuned OPT model and traditional analyzers, improving accuracy by ~6% and boosting F1-score by 4–7%22. In education, research on survey feedback analysis showed that ChatGPT (GPT-4) surpassed RoBERTa, achieving ~21% higher accuracy and a ~32% increase in F1-score5.
However, there are also instances where dedicated BERT-based models outperform LLMs, typically when domain-specific fine-tuning and ample data give the smaller model a specialized edge. For instance, in education feedback analysis (a more structured domain), a fine-tuned RoBERTa even marginally outperformed GPT-3.5, attaining the highest precision/recall (around 87% F1-score) whereas GPT-3.5 was a few points lower23. These outcomes highlight underlying factors: LLMs leverage broad knowledge and context modeling (shining in zero/few-shot scenarios and complex language understanding), whereas dedicated models benefit from domain focus and consistency (especially for industry-specific terminology or when computational efficiency is paramount).
In practice, organizations choose based on needs – some customer service teams now use GPT-4 to interpret nuanced support ticket sentiments in real time, while others deploy lean BERT-based classifiers for known domains to maximize speed and control. Both approaches have reported strong accuracy and F1-scores in production, and numerous case studies underscore that neither is universally “best” – the optimal choice hinges on the dataset and context. Ultimately, the trend shows LLMs rapidly closing the gap with specialized sentiment models, and in many cases overtaking them, but domain-specific fine-tuned models (like RoBERTa variants) remain highly competitive, often serving as the benchmark for state-of-the-art performance in targeted sentiment analysis tasks.
3.3 Research Gaps and Our Focus
After reviewing prior studies on the comparative analysis of LLMs and dedicated neural networks, we identified several gaps in existing research. First, prior studies compared LLMs to dedicated neural networks using the latest models available at the time of the research. As time has progressed, more recent LLMs — such as GPT-4o, Llama-3.3, and Gemini-2.0-Flash — have not yet been evaluated in this context. Second, DeBERTa, a dedicated neural network model, has been largely overlooked and not thoroughly explored for sentiment analysis. Third, previous studies typically ran each LLM only once on the experimental dataset without leveraging ensemble results from multiple runs. Given the non-deterministic nature of LLM outputs, this single-run approach could introduce variability and limit the reliability of the findings. Fourth, most previous evaluations have focused on overall accuracy and macro F1-score, overlooking the assessment of model output biases, which should be conducted through rigorous statistical testing. Finally, there is a lack of sensitivity analyses investigating how noise in human-annotated labels affects LLM inference in categorization tasks, particularly for survey data.
To address these gaps, this study revisits the comparison between dedicated neural networks and modern LLMs for sentiment analysis on healthcare survey data. The models considered include Twitter-RoBERTa-base, DeBERTa-v3-base-absa-v1.1, GPT-4o, Llama-3.3-70B-Instruct, and Gemini-2.0-Flash. Dedicated models such as Twitter-RoBERTa-base and DeBERTa-v3-base-absa-v1.1 were pre-trained specifically for sentiment-related tasks, providing focused outputs for classification24 ,25. Meanwhile, LLMs like GPT-4o, Llama-3.3-70B-Instruct, and Gemini-2.0-Flash are designed for broader applications, leveraging large-scale datasets and advanced architectures. These LLMs aim to balance general-purpose functionality with domain-specific tasks, though they come with higher computational requirements. Table 1 provides a high-level summary of these models.
Model Name | Model Type | # Parameters | Specialization | Key Features |
GPT-4o | LLM | ~200 billion | Multimodal input (text, image, audio) | Combines multimodal capabilities with high efficiency for non-English and vision-related tasks. |
Llama-3.3-70B-Instruct | LLM | 70 billion | General NLP tasks | Instruction-tuned; handles broad and nuanced language tasks. |
Gemini-2.0-Flash | LLM | Undisclosed | High volume and frequency tasks | Next generation features, speed, and multimodal generation for a diverse variety of tasks. |
RoBERTa-base-sentiment | Dedicated NN | 125 million | Sentiment analysis | Extension of BERT focused on positive, neutral, and negative classifications. |
DeBERTa-v3-base-absa | Dedicated NN | 184 million | Aspect-based sentiment analysis | Uses disentangled attention for modeling content and position more effectively. |
4. Methodology
In this section, we provide a detailed overview of the methodology, including data collection, model implementation, LLM prompt engineering, and performance evaluation. Additionally, we discuss the limitations encountered during the process.
4.1 Data Collection
The Covid-19 data used in this paper was provided by authors in Lossio-Ventura et al. (2023, 2024), who provided a detailed description of the dataset. The data were collected from two COVID-19 surveys conducted by the National Institutes of Health (NIH) and Stanford University26. These surveys were designed to assess general topics experienced by participants during the pandemic lockdown. The NIH and Stanford data sets were collected through web-based surveys during the COVID-19 pandemic, using convenience sampling via social media and prior study participants. The NIH survey (April 2020–May 2021) included 3655 participants, with 2497 providing free-text responses to question “Is there anything else you would like to tell us that might be important that we did not ask about?”, yielding 26,411 sentences. The Stanford survey (March–September 2020) recruited 4582 participants, with 3349 responding to at least one of three free-text questions (“Although this is a challenging time, can you tell us about any positive effects or ‘silver linings’ you have experienced during this crisis?”, “What are the reasons you are not self-isolating more?” and “Have you experienced any difficulties due to the coronavirus crisis?”), generating 21,266 sentences. The survey responses were manually annotated for sentiment polarity by three independent annotators, provided by the creators of the original dataset. The final sentiment labels were determined through a rigorous three-step agreement process. The dataset comprises a total of 1,000 randomly chosen sentences, with 500 sentences selected from each survey.
The dataset was chosen based on its suitability for testing both LLMs and neural networks’ capabilities in handling sentiment analysis tasks. It also provides flexibility, as the pre-labeled data were provided in two .xlsx files, enabling easy manipulation and processing through various models. This facilitated efficient experimentation and fine-tuning through Python’s library Pandas (v2.2.2). The dataset provides a balanced mix of positive, negative, and neutral sentiments, allowing for a comprehensive evaluation of the models’ performance. In our experiment, the dedicated neural network models were run locally on laptops, processing the survey responses from the .xlsx files. In contrast, the LLMs were hosted on remote servers, with the laptops communicating with these servers to send data and receive the results.
4.2 Model Implementation
4.2.1 Dedicated Neural Networks
We employed Twitter-RoBERTa-base and DeBERTa-v3-base-absa-v1.1 locally through Python. Using Hugging Face’s Transformers Python library, each pre-trained model was directly accessed through a pipeline, allowing the models to run locally on our computer setup and utilize our own computational power. By creating Python functions to iterate over each survey response sentence, the models then classified the sentiment. The output of the models was then stored into an .xlsx file based on the sentence that was classified.
4.2.2 LLMs on Microsoft Azure AI (Llama-3.3-70B-Instruct and GPT-4o)
For this experiment, both the Meta-Llama-3.3-70B-Instruct and GPT-4o models were hosted on the Microsoft Azure AI platform. We first installed the serverless models on a compute instance virtual machine within the Azure AI workspace. To process the data and conduct the sentiment analysis, we utilized Microsoft Prompt Flow, which structured the workflow into three key steps. In step 1, the “SQL_Connection_Input.py” Python node was used to retrieve the 1,000 sentences from the COVID-19 dataset, which was stored in a MySQL database. Step 2 involved the “Process_Text_Content.jinja2” node, which connected to the deployed models and prompted them to conduct sentiment analysis based on the instructions provided. Finally, in step 3, the output of models was saved into an .xlsx file. Due to the throughput limit of the models, the 1,000 sentences were divided into 10 smaller chunks, each containing 100 sentences, and processed sequentially to ensure efficient handling of the data.
4.2.3 Gemini-2.0-Flash
We began by requesting the Gemini API key from Google AI Studio to establish a connection with the model. This script utilized the model.generate_content(prompt, generation_config) function from the Google Generative AI library to perform the sentiment analysis. The function was designed to process the 1,000 sentences collected during the COVID-19 period, performing sentiment analysis on each. Due to the token limit of the Gemini API, the 1,000 survey responses were processed in 20 smaller chunks of 50 sentences each and then sequentially compiled for further evaluation in our sentiment analysis.
4.3 LLM Prompt Engineering
Typically, prompt engineering consists of fine-tuning three sections: the system, which describes the role of the LLM model; the user, which provides specific task instructions and context; and the assistant, which defines the exact output format required for our analysis and automation. Through each LLM’s API, we began by testing simple prompts to ensure basic functionality. We iteratively refined the prompts until the models could classify each survey response without significant issues. After this refinement process, we finalized the following general prompts for the LLMs.
For the system section, we specified the role of the model as: “You are an AI assistant that helps people conduct sentiment analysis.”
The user prompt provided the model with task-specific instructions and context. The final user prompt was: “Analyze the sentiment of the following sentence, considering it is a comment from the COVID-19 period. The analysis aims to understand people’s sentiment during the COVID period. The sentiment categories are Positive (1), Neutral (0), and Negative (-1).” This allowed the model to focus on the relevant context and perform sentiment classification accordingly.
For the assistant section, we required the output to be in a format suitable for automation. The prompt instructed the LLMs to return only the sentiment label as an integer, corresponding to Positive (1), Neutral (0), or Negative (-1), without any additional text.
4.4 Evaluation
The outputs from each model were stored in separate .xlsx files, and a Python script using Pandas (v2.2.2) was employed to compute key evaluation metrics, including accuracy, precision, recall, and F1-score for negative, neutral, and positive sentiments. Each neural network (Twitter-RoBERTa-base and DeBERTa-v3-base-absa-v1.1) was executed once, whereas each LLM model (Gemini-2.0-Flash, Llama-3.3-70B-Instruct, and GPT-4o) was run five times to account for their inherent non-deterministic behavior. The evaluations were conducted on both the complete model output datasets and the subsets containing only responses with full annotator agreement.
4.5 Limitations
4.5.1 Dataset
The data for this study consists of randomly selected sentences from full survey responses, even when specific questions were asked, resulting in a lack of context that may be necessary for accurately classifying sentiment. Participants may have responded under specific conditions or with particular expectations, further complicating sentiment classification. Additionally, the annotation process involved three annotators assessing the polarity of each sentence, with the final sentiment labels assigned through a structured agreement process. When the annotators failed to reach consensus, sentences defaulted to a neutral label, which could skew the results by misclassifying nuanced positive or negative sentiments as neutral, potentially impacting model performance and the dataset’s overall sentiment distribution. This motivated us to conduct an additional analysis based solely on data with full annotation agreement.
Despite these limitations, the dataset is useful as it provides a transparent view of the sentiment annotation process. This transparency allows for a clear understanding of the level of agreement among annotators and highlights the subjectivity involved in sentiment analysis, allowing us to better compare the performance of each model to the sentiment labeled by each annotator. Furthermore, the structured agreement process used to determine the final sentiment labels ensures a systematic approach to dealing with disagreements, contributing to the overall reliability of the dataset.
4.5.2 LLM Hyperparameter Selection
In this study, we employed three large language models—GPT-4o, Llama-3.3-70B-Instruct on Azure, and Gemini-2.0-flash—to perform sentiment analysis on our survey dataset. For all three models, we fixed the temperature hyperparameter at 0.8 while leaving all other configuration settings at their default values. This decision was made to maintain consistency across experiments and to simplify the comparative analysis between dedicated neural networks and LLMs. However, this approach also presents a limitation. The performance and output quality of LLMs are known to be sensitive to various configuration hyperparameters, such as temperature, top-p, and top-k, among others. By only adjusting the temperature hyperparameter, we did not explore the full range of possible model behaviors that might emerge under different settings. As a result, the results reported in this paper represent just one configuration scenario, which may not fully capture the models’ potential or the nuances of sentiment classification. Future work should consider a more comprehensive hyperparameter search to better understand how these settings affect performance, thereby offering a more robust comparison between LLMs and dedicated neural network approaches.
4.5.3 Environmental Variability in Model Deployment
Due to technological limitations, the study deploys dedicated neural networks locally and LLMs on remote servers (e.g., Microsoft Azure AI and Google AI Studio). This introduces variability not just in model architecture but also in operational environment, latency, and potential API constraints. Such differences might affect performance comparisons, especially when the LLMs are subject to rate limitations or token constraints that do not affect locally run models. Moreover, OpenAI, Meta and Google periodically optimize their cloud-based models, improving inference efficiency, latency, or response coherence. These changes might lead to unintended variations in model performance over time, making reproducibility more challenging. Future work should consider running controlled experiments that minimize environmental disparities, such as evaluating LLMs on self-hosted instances where possible or benchmarking them under different API conditions to assess variability across deployments.
Results
This section presents the findings from our experiments, starting with a comparison of model performance in terms of accuracy, precision, recall and F1-scores. In the second half, we evaluate model output biases, with a particular focus on testing the existence of positive biases in LLMs.
5.1 Accuracy, Precision, Recall, and F1-score
Using Confusion Matrices Tables 2 – 6, we evaluated the performance of dedicated sentiment classification neural networks compared to LLMs in sentiment analysis. The performance metrics include overall accuracy, as well as precision, recall, and F1-score for each sentiment category. Definitions of these metrics can be found in standard machine learning textbooks or authoritative online sources such as Wikipedia.
Label (all types of annotator agreement) | Label (full annotator agreement) | ||||||
Negative | Neutral | Positive | Negative | Neutral | Positive | ||
Model output | Negative | 375 | 84 | 2 | 309 | 30 | 1 |
Neutral | 74 | 244 | 65 | 47 | 171 | 46 | |
Positive | 8 | 21 | 127 | 3 | 8 | 110 |
Label (all types of annotator agreement) | Label (full annotator agreement) | ||||||
Negative | Neutral | Positive | Negative | Neutral | Positive | ||
Model output | Negative | 320 | 94 | 3 | 263 | 34 | 1 |
Neutral | 104 | 207 | 76 | 73 | 151 | 56 | |
Positive | 33 | 48 | 115 | 23 | 24 | 100 |
Label (all types of annotator agreement) | Label (full annotator agreement) | ||||||
Negative | Neutral | Positive | Negative | Neutral | Positive | ||
Model output | Negative | 411 | 116 | 0 | 344 | 36 | 0 |
Neutral | 40 | 201 | 22 | 14 | 161 | 10 | |
Positive | 6 | 32 | 172 | 1 | 12 | 147 |
Label (all types of annotator agreement) | Label (full annotator agreement) | ||||||
Negative | Neutral | Positive | Negative | Neutral | Positive | ||
Model output | Negative | 386 | 78 | 1 | 330 | 17 | 0 |
Neutral | 67 | 234 | 19 | 26 | 173 | 8 | |
Positive | 4 | 37 | 174 | 3 | 19 | 149 |
Label (all types of annotator agreement) | Label (full annotator agreement) | ||||||
Negative | Neutral | Positive | Negative | Neutral | Positive | ||
Model output | Negative | 406 | 86 | 1 | 346 | 20 | 0 |
Neutral | 49 | 239 | 18 | 11 | 181 | 7 | |
Positive | 2 | 24 | 175 | 2 | 8 | 150 |
Model | Overall Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) | ||||||
Negative | Neutral | Positive | Negative | Neutral | Positive | Negative | Neutral | Positive | ||
Twitter-RoBERTa-base | 74.60 | 81.34 | 63.71 | 81.41 | 82.06 | 69.91 | 65.46 | 81.70 | 66.67 | 72.57 |
DeBERTa-v3-base-absa-v1.1 | 64.20 | 76.74 | 53.49 | 58.67 | 70.02 | 59.31 | 59.28 | 73.23 | 56.25 | 58.97 |
Gemini-2.0-Flash | 78.40 (0.29)* | 77.99 (0.30) | 76.43 (0.64) | 81.90 (0.67) | 89.93 (0.65) | 57.59 (1.08) | 88.66 (1.06) | 83.54 (0.31) | 65.69 (0.69) | 85.15 (0.69) |
Llama-3.3-70B-Instruct | 79.40 (0.26) | 83.01 (0.28) | 73.13 (0.53) | 80.93 (0.48) | 84.46 (0.50) | 67.05 (0.68) | 89.69 (0.67) | 83.73 (0.28) | 69.96 (0.42) | 85.09 (0.45) |
GPT-4o | 82.00 (0.45) | 82.35 (0.36) | 78.10 (1.06) | 87.06 (0.39) | 88.84 (0.50) | 68.48 (0.47) | 90.21 (0.78) | 85.47 (0.33) | 72.98 (0.64) | 88.61 (0.49) |
Negative | Neutral | Positive | Negative | Neutral | Positive | Negative | Neutral | Positive |
Twitter-RoBERTa-base | 81.38 | 90.88 | 64.77 | 90.91 | 86.07 | 81.82 | 70.06 | 88.41 | 72.30 | 79.14 |
DeBERTa-v3-base-absa-v1.1 | 70.90 | 88.26 | 53.93 | 68.03 | 73.26 | 72.25 | 63.69 | 80.06 | 61.76 | 65.79 |
Gemini-2.0-Flash | 89.93 (0.20)* | 90.53 (0.41) | 87.03 (0.73) | 91.88 (0.57) | 95.82 (0.37) | 77.03 (1.04) | 93.63 (0.73) | 93.10 (0.22) | 81.73 (0.39) | 92.74 (0.48) |
Llama-3.3-70B-Instruct | 89.93 (0.25) | 95.10 (0.24) | 83.57 (0.90) | 87.13 (0.55) | 91.92 (0.58) | 82.78 (0.59) | 94.90 (0.45) | 93.48 (0.32) | 83.17 (0.35) | 90.85 (0.25) |
GPT-4o | 93.38 (0.35) | 94.54 (0.47) | 90.95 (0.68) | 93.75 (0.52) | 96.38 (0.25) | 86.60 (0.99) | 95.54 (0.57) | 95.45 (0.28) | 88.73 (0.73) | 94.64 (0.53) |
*: Values in parentheses represent standard deviation estimates (%) from multiple runs.
Model | Twitter-RoBERTa-base | DeBERTa-v3-base-absa-v1.1 | Gemini-2.0-Flash | Llama-3.3-70B-Instruct | GPT-4o |
Twitter-RoBERTa-base | NA | 38.63 (< 0.01)* | 6.39 (0.01) | 10.67 (< 0.01) | 26.33 (< 0.01) |
DeBERTa-v3-base-absa-v1.1 | NA | 66.33 (< 0.01) | 73.58 (< 0.01) | 104.91 (< 0.01) | |
Gemini-2.0-Flash | NA | 0.72 (0.39) | 10.45 (< 0.01) | ||
Llama-3.3-70B-Instruct | NA | 8.05 (< 0.01) | |||
GPT-4o | NA |
*: Values in parentheses represent the associated p-values
Model | Twitter-RoBERTa-base | DeBERTa-v3-base-absa-v1.1 | Gemini-2.0-Flash | Llama-3.3-70B-Instruct | GPT-4o |
Twitter-RoBERTa-base | NA | 30.08 (< 0.01)* | 26.33 (< 0.01) | 27.86 (< 0.01) | 59.60 (< 0.01) |
DeBERTa-v3-base-absa-v1.1 | NA | 88.99 (< 0.01) | 89.83 (< 0.01) | 130.88 (< 0.01) | |
Gemini-2.0-Flash | NA | 0 (1.00) | 9.62 (< 0.01) | ||
Llama-3.3-70B-Instruct | NA | 13.30 (< 0.01) | |||
GPT-4o | NA |
*: Values in parentheses represent the associated p-values
Frequency | Twitter-RoBERTa-base output | ||
incorrect | correct | ||
GPT4o output | incorrect | 28 | 20 |
correct | 107 | 570 |
All types of annotator agreement for polarity | Full annotator agreement for polarity | |
Baseline model: Twitter RoBERTa-base | ||
Gemini-2.0-Flash | 5.09% | 10.51% |
Llama-3.3-70B-Instruct | 6.43% | 10.51% |
GPT-4o | 9.92% | 14.75% |
Baseline model: DeBERTa-v3-base-absa-v1.1 | ||
Gemini-2.0-Flash | 22.12% | 26.84% |
Llama-3.3-70B-Instruct | 23.68% | 26.84% |
GPT-4o | 27.73% | 31.71% |
The results of our experiments, as shown in Tables 7 – 8, provide a direct comparison of the performance of specialized neural networks in sentiment classification and LLMs on survey data. Specifically, each model was evaluated on its ability to classify sentiment into negative, neutral, and positive categories, using both partial and full annotator agreement for polarity. To ensure the observed performance differences are unlikely to be due to chance, we evaluated the statistical significance of the model overall accuracy comparisons presented in Tables 9 – 10 using McNemar’s test27 , given a sufficiently large number of discordant pairs (i.e., the sum of the off-diagonal elements in its 2×2 contingency table >10) in our data. This test is specifically designed for comparing two classifiers on the same dataset, as it accounts for the correlation between predictions due to being applied to identical test instances28. Unlike a standard proportion comparison test, such as a z-test for proportions, McNemar’s test is appropriate because it considers the dependency structure of the data, ensuring a more reliable assessment of statistical significance. That said, it is also important to consider how repeated inference runs of LLMs on the same dataset should be handled in the statistical test. As LLMs inference were executed five times over the same dataset while Neural Networks inference was executed once, one might think of conducting McNemar’s test for each LLM run, which would result in multiple comparisons against Neural Networks. A drawback of this approach is having multiple p-values. If a single conclusion is desired, one must correct for multiple testing (e.g., Bonferroni, Holm-Bonferroni, etc.). Another less appealing approach is to compute the McNemar test statistic by aggregating the contingency tables across runs, yielding a single p-value. However, this method ignores the correlation across runs because they share the same test dataset, potentially inflating the sample size or understating variance and thus leading to a misleading p-value. We instead chose to ensemble the LLM inference results from the five runs via a majority voting rule (randomly assigning one of the tied values if there is a tie) for McNemar’s test. This approach provides a single p-value without artificially inflating or ignoring correlation across runs.
To further demonstrate how McNemar’s test was applied, Table 11 presents a 2×2 contingency table used to test the statistical significance of the difference in model accuracy between GPT-4o and Twitter-RoBERTa-base, based on survey data with full annotator agreement on polarity. It provides a snapshot of how GPT-4o and Twitter-RoBERTa-base perform relative to each other—highlighting both their correct/incorrect agreements and the times they disagree. The χ2 test statistic was calculated using the off-diagonal cell values, and the p-value was computed with 1 degree of freedom. In this case, the χ2 test statistic and p-value < 0.01. As you may have noticed, diagonal cell values are excluded from computation as they represent classifier agreement—both making the same decision, whether correct or incorrect. These values do not help identify systematic performance differences. Instead, we use off-diagonal values, which capture disagreements between classifiers.
As shown in Table 7, LLMs demonstrate better overall performance in sentiment analysis compared to dedicated neural networks. In particular, GPT-4o consistently outperforms the dedicated models with statistical significance (p-value < 0.01 compared to Twitter-RoBERTa-base; p-value < 0.01 compared to DeBERTa-v3-base-absa, as shown in Table 9) and achieves the highest overall accuracy across all models. This suggests that LLMs can better generalize across different sentiment categories in survey data and, as a result, better handle the complexities of human responses more effectively than specialized models like Twitter-RoBERTa-base or DeBERTa-v3-base-absa. This trend becomes more pronounced when considering only sentences with full annotator agreement, where sentiment classification is less noisy (see Tables 8, 10, and 12). In Table 8, accuracy increases across all models with full annotator agreement, as clearer polarity features make sentiment detection easier by emphasizing more distinct indicators. However, the performance gap between LLMs and neural networks also widens (see Table 12), with stronger statistical significance, as indicated by higher χ² test statistics and lower p-values (see Table 10).
5.2 Model Bias Assessment
While recent LLMs outperform dedicated neural networks on our data, it is crucial to examine their positive bias, one of many shortcomings of LLMs. Deploying positively biased LLM models in sentiment analysis during the COVID-19 lockdown could have resulted in misrepresentation of public distress, leading to inadequate mental health support, flawed policymaking, and a false perception of public well-being. Such bias could extend to other sectors, such as finance, education, and workplace management, where overly optimistic sentiment analysis may skew economic forecasts, underplay student or employee dissatisfaction, and misguide decision making.
Earlier-generation LLMs or those without specialized fine-tuning (e.g., GPT-2, GPT-3.5, LLaMA-2) tend to overpredict positive sentiment, likely reflecting a general positivity bias in language data. Newer instruction-tuned models like GPT-4 more carefully calibrate sentiment to mitigate this issue, with their effectiveness empirically demonstrated in some studies29. Our study further confirms the absence of significant positive bias in newer LLMs through two established assessment methods: the False Positive Rate approach and the Distribution-Based Bias approach30 ,31. The False Positive Rate approach evaluates the extent of positive misclassification of ground-truth neutral and negative sentiments. Table 13 presents this metric across models, showing that under the full annotator agreement setting, all three LLMs had a false positive rate below 4.00%, whereas neural networks, such as DeBERTa-v3-base-absa-v1.1, exhibited the highest rate at 8.27%. The Distribution-Based Bias approach, also known as the Kullback–Leibler divergence method, provides a more rigorous assessment by quantifying the divergence between a model’s predicted sentiment distribution and the ground-truth distribution. Table 14 presents results from this analysis using fully agreed-upon annotations, indicating that all three LLMs passed the distribution bias test, while two neural network models failed. Table 15 further demonstrates that the proportion of positive sentiment in model outputs closely aligns with human-labeled data. Specifically, under full annotator agreement, the positive sentiment proportion was 21.66% for human annotations, 22.07% for GPT-4o and Gemini-2.0-Flash, and 23.59% for LLaMA-3.3-70B-Instruct. It’s worthwhile to call out that, although the positive sentiment proportion in DeBERTa-v3-base-absa-v1.1 outputs was close to the ground truth, the model failed the Distribution-Based Bias test due to misalignment in the neutral and negative sentiment distributions. These findings suggest that recent LLMs achieve better sentiment calibration, reducing concerns about systematic positive bias.
False Positive Rate (%) | Model | |||||
Twitter-RoBERTa-base | DeBERTa-v3-base-absa-v1.1 | Gemini-2.0-Flash | Llama-3.3-70B-Instruct | GPT-4o | ||
Degree of annotator agreement | full | 1.94 | 8.27 | 2.29 | 3.87 | 1.76 |
all (full, partial, zero) | 3.60 | 10.05 | 4.71 | 5.09 | 3.23 |
Model | |||||
Twitter-RoBERTa-base | DeBERTa-v3-base-absa-v1.1 | Gemini-2.0-Flash | Llama-3.3-70B-Instruct | GPT-4o | |
Kullback-Leibler distance | 0.016 | 0.022 | 0.0029 | 0.0011 | 0.00047 |
p-value | < 0.01 | < 0.01 | 0.35 | 0.67 | 0.84 |
Model bias detected? | YES | YES | NO | NO | NO |
Proportion (%) | Annotator | Model | ||||
Twitter-RoBERTa-base | DeBERTa-v3-base-absa-v1.1 | Gemini-2.0-Flash | Llama-3.3-70B-Instruct | GPT-4o |
Degree of annotator agreement | full | 21.66 | 16.69 | 20.28 | 22.07 | 23.59 | 22.07 |
all (full, partial, zero) | 19.40 | 15.60 | 19.60 | 21.00 | 21.50 | 20.10 |
6. Discussion
Building on the observations above, we now delve into the underlying reasons for these performance differences by examining how the architectures and training strategies of the models influence their outcomes. LLMs leverage extensive pretraining on diverse datasets and instruction-tuning to align with human intent, enabling them to generalize across complex and ambiguous inputs. Additionally, LLMs have knowledge cutoff dates in 2023/2024—specifically, October 2023 for GPT-4o, August 2024 for Gemini-2.0-Flash, and December 2023 for Llama-3.3-70B-Instruct. These dates follow the events of COVID-19, allowing them to recognize the context of the survey responses. In contrast, Twitter-RoBERTa-base was trained on tweets from January 2018 to December 2021, and DeBERTa-v3-base-absa-v1.1 was fine-tuned for aspect-based sentiment analysis using datasets such as the ABSADatasets, which are domain-specific and focused on identifying sentiment tied to particular aspects rather than generalized contexts. These differences in both the timeframes and content of the training datasets result in the LLMs’ broader contextual understanding compared to the specialized but narrower focus of the dedicated neural networks, contributing to their superior performance in free-response survey data. Another critical factor influencing performance differences is model architecture. RoBERTa, optimized for general language understanding, emphasizes efficient pretraining strategies and self-attention mechanisms but lacks innovations like disentangled attention. Its architecture excels at processing noisy, short-form text such as social media data, due to its robust training on datasets like TweetEval. In contrast, DeBERTa incorporates disentangled attention and enhanced mask decoding, which effectively separates content and positional embeddings. This makes DeBERTa particularly well-suited for fine-grained aspect-based sentiment tasks but limits its ability to generalize across broader, more ambiguous survey contexts. LLMs, however, are designed for versatility and scale, utilizing architectures with billions of parameters, advanced attention mechanisms, and extensive pretraining on diverse datasets. This allows them to handle linguistic subtleties and complex contexts far better than the specialized models. Their general-purpose nature enables them to excel across a wide variety of tasks, including sentiment classification, language generation, and summarization, making them more adaptable and effective in understanding free-response survey data.
In summary, these results highlight that while LLMs are not specifically designed for sentiment analysis, their extensive pretraining and robust architectures enable them to excel in this task, often outperforming dedicated neural networks. The adaptability of models like GPT-4o, Llama-3.3-70B-Instruct, and Gemini-2.0-Flash allows them to capture nuanced and contextual information across diverse and ambiguous free-response survey data. However, the findings also underscore certain trade-offs. As Table 16 shows, for pure sentiment analysis efficiency, smaller dedicated models (such as Twitter-RoBERTa-base and DeBERTa-v3-base-absa-v1.1) remain the fastest, cheapest, and most lightweight options. Among the LLMs, Gemini-2.0-Flash offers improved efficiency and is generally more budget-friendly than GPT-4o and Llama-3.3-70B-instruct. Even so, all three large-scale models demand far more computational resources than the specialized ones. If the goal is to minimize runtime and cost specifically for sentiment tasks, fine-tuned smaller dedicated models are significantly more efficient and resource-friendly than relying on giant general-purpose LLMs.
Model | Computation Cost ($ per 1 million tokens) | Inference Speed on Single A100 80GB (second per token) | Hardware Requirement (GB) |
GPT-4o (on Azure Cloud) | Input: $2.50 Output: $10.00 | ~5–10 (theoretical) | GPU: 8–32 × 80GB VRAM: 200–400+ GB. |
Llama-3.3-70B-Instruct (on Azure Cloud) | Input: $0.71 Output: $0.71. | ~0.25–1.0 | GPU: 2–4 × 80GB VRAM: ~140–170 GB |
Gemini-2.0-Flash (on Google Cloud) | Input: $0.10 Output: $0.40 | ~0.005–0.01 | GPU: 1 × 40–48GB VRAM: ~16–32 GB |
Twitter-RoBERTa-base | Negligible cloud cost if self-hosted. | ~0.000002–0.000003 | GPU: 1 × 2GB VRAM: ~0.24 GB |
DeBERTa-v3-base-absa-v1.1 | Negligible cloud cost if self-hosted | ~0.000002–0.000003 | GPU: 1 × 2GB VRAM: ~0.35 GB |
7. Conclusion
We found that LLMs consistently outperformed dedicated neural network models by achieving higher accuracy in determining sentiment analysis. Specifically, GPT-4o showed superior performance in capturing the subtleties of human emotions, making it more effective in handling the complexities of free-response survey data. In our evaluation, the order of performance from highest to lowest was: GPT-4o > Llama-3.3-70B-Instruct ≈ Gemini-2.0-Flash > Twitter-RoBERTa-base > DeBERTa-v3-base-absa-v1.1. Moreover, by evaluating model performance using different levels of agreement from annotators as the ground-truth sentiment category, we found that the superiority of LLMs over dedicated neural networks was more pronounced in the full-agreement setting. Furthermore, unlike earlier versions of LLMs, our selected models showed no strong statistical evidence of positive bias in sentiment classification, reinforcing their reliability for unbiased sentiment analysis. However, while LLMs outperform dedicated neural networks, they require significantly more resources and come at a higher cost for users. This makes dedicated neural networks, such as RoBERTa-v3-base-absa-v1.1, a competitive alternative when resource and cost constraints are a concern in practical applications. Given these findings, it is important to weigh the trade-offs between performance and practicality. Our general recommendation is: If computational cost, resource constraints, data privacy, and security are not a concern, LLMs—particularly GPT-4o via API—are recommended due to their superior accuracy and ease of use, even for non-technical practitioners. However, real-world applications often impose additional constraints that might influence model selection. Factors such as deployment environment, real-time processing needs, and overall system architecture can play a crucial role in deciding which model is most appropriate. With these considerations in mind, when selecting between LLMs and dedicated neural networks, practitioners should evaluate the following factors: (1). Performance and Accuracy: evaluate if the enhanced accuracy and capability of capturing subtle nuances justify opting for an LLM over a dedicated model. (2). Resource Requirements: assess the computational power, memory, and infrastructure needed. LLMs generally demand more resources. (3). Cost Implications: consider both the deployment and operational costs. While LLMs may incur higher expenses, using an API can alleviate infrastructure burdens though it introduces per-call costs. (4). Deployment and Scalability: determine the ease of integration and scalability within existing systems. API-based LLMs, for instance, simplify deployment by transferring maintenance responsibilities to the service provider. (5). Data Privacy and Security: examine potential risks related to transmitting sensitive data, particularly when using external API services for LLMs.
While this study provides insights into the relative strengths and limitations of each approach, several challenges remain. Future research should prioritize optimizing and further evaluating the inherent biases of LLMs using diverse datasets, while also testing upcoming new models such as GPT o1 and o3 to ensure fair and reliable performance as these models evolve. In addition to ensembling strategies, fine-tuning LLMs with domain-specific data—such as those from healthcare, finance, and education—could significantly enhance their contextual understanding and performance, ensuring their effective application across various specialized fields.
Further work should also expand sentiment analysis beyond the traditional positive, neutral, and negative classifications to include more nuanced emotion detection, such as stress, anxiety, depression, and frustration. This granularity is particularly critical in public health, where tracking emotional distress can help policymakers and healthcare providers respond proactively to mental health crises. Similarly, recognizing complex emotions in customer feedback, employee engagement surveys, and educational assessments would provide deeper insights into public sentiment.
Lastly, future research on survey data sentiment analysis would benefit from integrating multimodal sources such as images, videos, or audio recordings alongside text. By fusing these diverse inputs, researchers can gain deeper insights into participants’ emotional states and contextual cues. For instance, an LLM might analyze not only the textual response but also voice stress markers or relevant images respondents attach. This approach could uncover previously overlooked signals, resulting in more robust and holistic sentiment classification. Addressing these challenges and broadening research directions will enhance the accuracy, fairness, and applicability of sentiment analysis across a wide range of real-world applications.
8. References
- E. Cambria. Affective computing and sentiment analysis. IEEE Intelligent Systems 31(2), 102–107 (2016). [↩]
- F. Anzum, M. L. Gavrilova. Emotion detection from micro-blogs using novel input representation. IEEE Access: Practical Innovations, Open Solutions 11, 19512–19522 (2023). [↩]
- A. Hussain, E. Cambria, S. Poria, A. Hawalah, F. Herrera. Information fusion for affective computing and sentiment analysis. Information Fusion 71, 97–98 (2021). [↩]
- S. L. Wallas, A. K. Lewis, M. D. Allen. The state of the literature on student evaluations of teaching and an exploratory analysis of written comments: who benefits most? College Teaching 67(1), 1–14 (2018). [↩]
- M. J. Parker, C. Anderson, C. Stone, Y. Oh. A large language model approach to educational survey feedback analysis. International Journal of Artificial Intelligence in Education (2024). [↩] [↩] [↩]
- L. Zhang,, S. Wang,B. Liu. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4), e1253.(2018). [↩]
- L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems 43(2), 1-55 (2025). [↩]
- E. M. Bender, T. Gebru, A. McMillan‐Major, S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 610-623 (2021). [↩]
- B. Liu. Opinion Mining and Sentiment Analysis. In: Web Data Mining. Data-Centric Systems and Applications. Berlin, Heidelberg: Springer. 459-526 (2011). [↩]
- J. Wang, R. Xu. Performance analysis of sentiment classification based neural network. Applied and Computational Engineering 5(1), 513–518 (2023). [↩]
- A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development 3(3), 210–229 (1959). [↩]
- F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65(6), 386–408 (1958). [↩]
- J. Schmidhuber. Deep learning in neural networks: an overview. Neural Networks: The Official Journal of the International Neural Network Society 61, 85–117 (2015). [↩]
- J. L. Elman. Finding structure in time. Cognitive Science 14(2), 179–211 (1990). [↩]
- S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural Computation 9(8), 1735–1780 (1997). [↩]
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems 30, 5998-6008 (2017). [↩]
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P.Christiano, J. Leike, R. Lowe. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, Neural Information Processing Systems (NeurIPS) (2022). [↩]
- S. Han, M. Wang, J. Zhang, D. Li, J. Duan. A review of large language models: Fundamental architectures, key technological evolutions, interdisciplinary technologies integration, optimization and compression techniques, applications, and challenges. Electronics 13 (24), (2024). [↩]
- J. A. Lossio-Ventura, R. Weger, A. Y. Lee, E. P. Guinee, J. Chung, L. Atlas, E. Linos, F. Pereira. A Comparison of ChatGPT and Fine-Tuned Open Pre-Trained Transformers (OPT) against widely used sentiment analysis tools: sentiment analysis of COVID-19 survey data. MIR Ment Health 11, e50150. (2024). [↩] [↩] [↩]
- Z. Li, C. Yang, C. Huang. A comparative sentiment analysis of airline customer reviews using Bidirectional Encoder Representations from Transformers (BERT) and its variants. Mathematics 12(1), 53 (2024). [↩]
- A. Marrhich, I. Lafram, N. Berbiche. Multi-task learning with BERT, RoBERTa, GPT-3.5, ELECTRA, and XLNet for urgency classification, topic similarity, and sentiment analysis in MOOCs. Ingénierie des Systèmes d’Information 29 (5), 1891-1901 (2024). [↩]
- J. A. Lossio-Ventura, R. Weger, A. Y. Lee, E. P. Guinee, J. Chung, L. Atlas, E. Linos, F. Pereira. A Comparison of ChatGPT and Fine-Tuned Open Pre-Trained Transformers (OPT) against widely used sentiment analysis tools: sentiment analysis of COVID-19 survey data. MIR Ment Health 11, e50150. (2024). [↩]
- A. Marrhich, I. Lafram, N. Berbiche. Multi-task learning with BERT, RoBERTa, GPT-3.5, ELECTRA, and XLNet for urgency classification, topic similarity, and sentiment analysis in MOOCs. Ingénierie des Systèmes d’Information 29 (5), 1891-1901 (2024). [↩]
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach (2019). [↩]
- P. He, X. Liu, J. Gao, W. Chen. DeBERTa: Decoding-enhanced BERT with disentangled attention (2021). [↩]
- J. A. Lossio-Ventura, R. Weger, A. Lee, E. Guinee, J. Chung, A. Lauren, E. Linos, F. Pereira. Sentiment analysis test dataset created from two COVID-19 surveys: National Institutes of Health (NIH) and Stanford University (Version 2). Figshare (2023). [↩]
- Q. McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947). [↩]
- T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10(7), 1895–1923 (1998). [↩]
- J. Krugmann, J. Hartmann. Sentiment analysis in the age of generative AI. Customer Needs and Solutions 11 (3), (2024). [↩]
- J. Dixon, J. Li, J. Sorensen, N. Thain, L. Vasserman. Measuring and mitigating unintended bias in text classification. Proceedings of AAAI/ACM Conference on AI, Ethics, and Society. (2018). [↩]
- J. P. Venugopal, A. A. V. Subramanian, G. Sundaram, M. Rivera, P. Wheeler. A comprehensive approach to bias mitigation for sentiment analysis of social media data. Applied Sciences 14(23), 11471 (2024). [↩]