Abstract
This paper discusses the development of my model of Author Prediction and Sentiment Analysis of Quotes. I created 2 models, one for author prediction and one for sentiment analysis. The main motive was to create a model which would accurately predicted author and sentiment of a quote based on the quote entered. I used Python for the coding part of this project as well as the Python library TensorFlow and NLTK, an NLP-focused library. The dataset chosen was a collection of quotes, their authors, and the emotions behind them in a CSV-type file and garbage data was removed manually. The bot ran over 50 epochs for both sentiment analysis and author prediction bot and used Simple RNN and LSTM layers for neural network training. The accuracy shown by the author prediction bot is 4.2% and the accuracy for sentiment analysis is 50%.
Introduction
Text analysis by natural language processing is something that is being used by most coders for projects. Analysis of text has a lot of potential benefits in the field of coding and many people make use of IT in many ways. Research in this field proved to be very fascinating for me and I would like to provide some examples of research which are existing in this field to give a better outlook of my research and how it relates to work already done in this field.
For example, a student at Princeton University created a new tool to combat plagiarism, an app that aims to determine whether the text was written by a human or AI called GPTZero. GPTZero uses natural language processing to analyze text and two variables to determine whether the author of a particular text is human: perplexity, or how complex the writing is, and burstiness, or how variable it is. Text that’s more complex with varied sentence length tends to be human-written, while prose that is more uniform and familiar to GPTZero tends to be written by AI1.
Text analysis can include a lot more than the purpose of answering the question of who wrote the text. It can also tell the purpose of writing the text through sentiment analysis. Sentiment analysis is the process of gathering and analyzing people’s opinions, thoughts, and impressions regarding various topics, products, subjects, and services. People’s opinions can be beneficial to corporations, governments, and individuals for collecting information and making decisions based on opinions. Sentiment analysis has gained widespread acceptance in recent years, not just among researchers but also among businesses, governments, and organizations. The growing popularity of the Internet has lifted the web to the rank of the principal source of universal information. Lots of users use various online resources to express their views and opinions. To constantly monitor public opinion and aid decision-making, we must employ user-generated data to analyze it automatically. As a result, sentiment analysis has increased its popularity across research communities in recent years.2
A notable sentiment analysis bot with would be like the Smart Simon Bot with Public Sentiment Analysis for Novel Covid-19 Tweets Stratification. In this project, the authors collected tweets from Twitter from February 2020 to July 2020 to analyze the sentiments of people about COVID-19. This analysis determines how the pandemic has created detrimental impacts like fear, terror, anxiety, and stress among people. The censorious issues discussed in the paper are (i) leverage of Twitter data precisely tweet for sentiment analysis. (ii) community sentiment correlated through the advancement of viruses and COVID-19. (iii) detailed textual analysis and information visualization. (iv) collation of text stratification procedure used in machine learning3.
My model does not classify sentiments as positive, neutral, or negative like most of the sentiment analysis bots out on the internet. It gives out the actual classification of the quote as humor, inspirational, deep meaning, etc. Author classification is also a relatively new concept with not many bots preexisting in this area.
Methodology
I used Python for the coding part of this project as well as the Python library TensorFlow and NLTK, an NLP-focused library. TensorFlow bundles together a slew of machine learning and deep learning models and algorithms (aka neural networks) and makes them useful by way of common programmatic metaphors4. NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum5.
The dataset was chosen from Keggle and was a collection of quotes, their authors, and the emotions behind them in a CSV-type file. Initially, we started out with multiple tags related to quote but they were then reduced to singular for greater bot efficiency. It contains 2505 different values with 878 different authors and 4 different tags. In many cases, I came across rows with empty tags and had to manually classify them according to the previously classified data. For eg.
“If there’s a book that you want to read, but it hasn’t been written yet, then you must write it.” | Toni Morrison | inspirational |
This quote did not have a tag and was classified as ‘inspirational’ manually. There were some duplicate values but they were deleted by hand.
While running the bot to see the author list taken from the CSV file I ran into an error as I noticed there were many strange characters that appeared instead of the author’s name. After a lengthy scan of all initial 2505 data values, I manually deleted the trash data which was affecting the bot and brought down the data to 2447 values. The garbage values proved as a disturbance to the bot as both the quote and the author were a string of nonsensical characters, and this brought unnecessary and unusable data to the bot.
After fiddling with code and different layers for neural network training I found the Simple RNN, Bidirectional LSTM, and Dense layers working the best as they gave the best accuracy results. I fiddled with their parameter to find the best possible result which was given at 200 units.
Dense layers are the most common type of layer in neural networks. A dense layer is a layer where each neuron is connected to every neuron in the previous layer. The primary advantage of dense layers is that they are able to capture complex patterns in data by allowing each neuron to interact with all the neurons in the previous layer. For text data, dense layers are often used in NLP tasks, such as text classification, sentiment analysis, and language translation. This is because text data often have a complex structure with long-term dependencies that can be better captured by a dense layer with a high degree of connectivity6.
LSTM stands for Long-Short Term Memory. LSTM is a type of recurrent neural network but is better than traditional recurrent neural networks in terms of memory. Having a good hold over memorizing certain patterns LSTMs perform fairly better. As with every other NN, LSTM can have multiple hidden layers and as it passes through every layer, the relevant information is kept, and all the irrelevant information gets discarded in every single cell. If we look and other non-neural network classification techniques they are trained on multiple words as separate inputs that are just word having no actual meaning as a sentence, and while predicting the class it will give the output according to statistics and not according to meaning. That means every single word is classified into one of the categories. This is not the same in LSTM. In LSTM we can use a multiple word string to find out the class to which it belongs. This is very helpful while working with Natural language processing. If we use appropriate layers of embedding and encoding in LSTM, the model will be able to find out the actual meaning in the input string and will give the most accurate output class7.
Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is fed as input to the current step. The Recurrent Neural Network consists of multiple fixed activation function units, one for each time step. Each unit has an internal state which is called the hidden state of the unit. This hidden state signifies the past knowledge that the network currently holds at a given time step. This hidden state is updated at every time step to signify the change in the knowledge of the network about the past. An RNN remembers each and every piece of information through time. It is useful in time series prediction only because of the feature to remember previous inputs as well8.
The bot ran over 50 epochs for both sentiment analysis and author prediction bot and used Simple RNN, Bidirectional LSTM, and Dense layers for neural network training in batches of 100. Although there were some deviations in the 50-epoch training, they were very close which is expected.
The tags in the sentiment analysis had multiple variables in them so they had to be manually edited to one single variable. On finishing the changes on the first 250 quotes, I trained my model on these values and then predicted the rest using my model inserted those back into the data set, and then trained it again on the updated dataset
The model, after training, did not present with any significant underfitting or overfitting so no regularization techniques were employed.
Results
The accuracy figures were obtained by training the model with a batch of training data(30% of the dataset) and testing it with the rest of the data over a certain number of epochs. The accuracy of the author prediction bot is very low due to the huge and diverse data along with the fact that a quote is merely a line or two so the bot has difficulty describing a set pattern for an author.The accuracy of the sentiment analysis bot on the other hand is fairly high compared as there are only four divisions to it which makes it easier to divide.
The accuracy showed by the author prediction bot is not very high as there were only 2447 entries for 841 different authors which lowered the accuracy by a huge margin to just a mere 4.2%. It does not deem it fit for practical use at the moment.
My accuracy for the sentiment analysis bot was 50% which can be considered fairly high due to the fact that it had 4 different sentiments which did not have simple values for the bot to easily differentiate between each other. It can be used for practical applications but would not act as a reliable source.
Conclusion
So you know when you come across a quote that hits very differently and you just want to know who wrote it and what they were thinking while writing it, so my project focuses on the prediction of writers on the basis of their quote and the sentiment to which this quote relates to. I think my model would be very useful in cases where people are traveling to different places and encounter unfamiliar quotes that fuel their curiosity to know more. In addition to that I believe my bot is one of the few available which can predict authors on one-line texts or less. Even though the accuracy rates of the models is lower than expected (4.2% in author prediction and 50% in sentiment analysis) I believe these models will be of immense usefulness after I fix them in the future.
Even though my model has performed quite well considering its factors it has many limitations as well. For eg it is not fit for practical use at the moment due to its data set and data quality which is in turn leading to very low accuracy rates but that is something I plan to fix in the future as well.
As compared to different Machine learning projects which focus on the analysis of the style of writing of long text and then predicting the author my project would analyze the quote to give you the author and the sentiment behind the quote. By using sentiment analysis, we enhance the bot as it gives not only the author but the sentiment which inspired the author to write the quote.
Prediction of authors is not something that has been done as frequently as compared to sentiment analysis which is very common nowadays in machine learning projects, in most cases it only classifies it as positive, neutral, or negative but my bot would give the accurate sentiment behind it (Humor, Deep Meaning, Inspirational, Love) as I believe quotes are more than just positive or negative. Each quote has a purpose or a message to deliver and my bot predicts the basics of that message by saying that if the quote is made with the purpose of making a person laugh, think, inspire, or simply to express a feeling of love.
I believe that the main idea in creating this model is – “UNDERSTANDING TRUE EMOTIONS BEHIND GREAT WORDS ”
Future Plans
Due to time constraints and my limited understanding of this field, many of my ideas could not be incorporated into this AI model and bot and I plan to fix them when I have the necessary skills required to do so.
In the future, I want to fix the accuracy rate of the author prediction bot by adding an abundance of data which reimburses it for the diversity of the data values. I plan to add a few more sentiments to the model to have it cover a larger variety of sentiments. The ones I would particularly add are wisdom and life as they expands my classification of sentiment and additional data would also help get my accuracy rates higher as well as help me study how my model is classifying quotes and identify a set pattern in the classifications.
For now, the dataset cleaning and manual annotation of sentiment was done by me only and that is something I want to work on as I want more input on my choices of selection for the sentiments as some of them can be debated.
I plan to incorporate my 2 separate bots (one for author prediction and one for sentiment analysis) into an app for mobile devices which will be able to scan quotes using the camera and run the program using the said bots. It would not only make my work easier to access but also easier to use by the public.
Acknowledgments
Thank you for the guidance of Ihita Mandal from Carnegie Mellon University in the development of this research paper.
Extra Info on Figures, Tables and Equations
My GitHub repository for the codes to my bots can be found at: https://github.com/KushagraGoel2024/Quote-analysis-author-and-sentiment
References
- Margaret Osborne (2023) https://www.smithsonianmag.com/smart-news/student-creates-app-to-detect-essays-written-by-ai-180981463/ [↩]
- Wankhade, M., Rao, A.C.S. & Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif Intell Rev 55, 5731–5780 (2022). https://link.springer.com/article/10.1007/s10462-022-10144- [↩]
- Ramya, B.N., Shetty, S.M., Amaresh, A.M. et al. Smart Simon Bot with Public Sentiment Analysis for Novel Covid-19 Tweets Stratification. SN COMPUT. SCI. 2, 227 (2021). https://link.springer.com/article/10.1007/s42979-021-00625-5 [↩]
- Serdar Yegulalp(2022)https://www.infoworld.com/article/3278008/what-is-tensorflow-the-machine-learning-library-explained.html [↩]
- 2023, NLTK Project created with Sphinx and NLTK Theme https://www.nltk.org/ [↩]
- Baeldung(2024) https://www.baeldung.com/cs/neural-networks-dense-sparse [↩]
- Shraddha Shekhar(2024) https://www.analyticsvidhya.com/blog/2021/06/lstm-for-text-classification/ [↩]
- Geeks for Geeks(2024) https://www.geeksforgeeks.org/introduction-to-recurrent-neural-network/ [↩]