Sentiment_Analysis_on_COVID-19_Twitter_Data
Sentiment_Analysis_on_COVID-19_Twitter_Data
Purnendu Karmakar
Electronics and Communications Engineering
The LNM Institute of Information Technology
Jaipur,India
[email protected]
© IEEE 2021. This article is free to access and download, along with rights for full text and data mining, re-use and analysis.
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 26,2025 at 11:49:45 UTC from IEEE Xplore. Restrictions apply.
better understanding of emotion level of people, this model of generating unbiased outputs in Sentiment prediction. The
can be further extended to more complex multi-emotion main hurdles were #tags (hashtags), @mentions, web links
levels such as joy, panic, happiness, sorrow. (URLs), and stop words present in tweets. We used regular
expression based substitution to remove the #tags, @men-
tions, and URLs from the tweets. Stop words were handled by
II. DATA C OLLECTION AND P REPROCESSING NLTK library in python under the hood of Textblob (python
library).
This section explores how data is collected using APIs and 2) Phase - 2: Finding Polarity
what Pre-processing steps were followed. For analysis, the
This phase was the most essential Pre-processing step
tweet-data was collected from Nov 2019 to May 2020. All
through the process. With the help of the TextBlob module
the tweets collected in that window were separated based
in Python, we estimated the polarity scores of each tweet in
on the States(Indian) they belong to and Months they were
the dataset. The cleaned tweets from the previous phase were
collected from. The second phase of data collection included
subjected to multiple evaluation models using TextBlob and a
a collection of tweets separated by each day of each month.
generalized score of polarity was found for each tweet. This
A. Data Collection score was directly correlated with the group of words present
in the text, i.e., Unigrams, Bigrams, Trigrams, etc.
Data Collection was one of the lengthy and major tasks to
TexBlob API returns a value in the range [-1,+1] where
perform.
+1 implies Extreme Positive Polarity, -1 implies Extreme
1) Phase - 1: Month-wise Collection
Negative Polarity and 0 implies neutral. This score was added
Data was collected using GetOldTweets3 [4] API (Python)
as a separate attribute in the dataset.
and was stored in multiple files separated state-wise and
month-wise. Three keywords (corona, COVID, COVID-19) 3) Phase - 3: Finding Sentiments
were searched through twitter [5] for those months for col- This phase was an extension over the last phase to cate-
lecting tweets shared among different states. For instance, a gorize the polarity scores of tweets into 3 classes (namely:
file named ”Assam 01 tweets.csv” contains the tweets shared Positive, Negative and Neutral). Positive Sentiments are those
in Assam during January. having range (0,1]. Negative sentiments range is [-1,0), and
We created a list of all Indian states and their approximate neutral sentiments are having 0.0 polarity. A simple looping
radius to find the data required within a restricted region at through the dataset and applying filters concluded this phase.
a time. We looped through all the months for all the states These 3 classes were stored as a separate feature in the dataset
while collecting the month-wise data and saved the collected called ”Sentiments”.
tweets in separate CSV files. This phase of data collection 4) Phase - 4: Combining the Dataset
resulted in a collection of approx 140,000 raw tweets over In this section, we combined the various datasets created
the duration of December 2019 to May 2020. during Data Collection and Finding Polarity phase into
2) Phase - 2: Day-wise Collection
more manageable and workable datasets. We combined the
The second phase of data collection was performed to
polarity datasets state-wise, i.e., we create a common dataset
collect all tweets from a given region (Indian State), but this
with the ”Month” column for each state. After that, we also
time separated by individual dates. This phase was separated
combined all the newly formed state-wise datasets to form a
from the month-wise collection because month-wise analysis
large combined dataset with a new attribute named ”State”.
of tweets required fast data collection to enable easy and
The whole process of Pre-processing took just over a day
rapid verification of ideas, while day-wise analysis [6] re-
of processing and coding time. This process revealed a
quired more data points and computation-intensive graphical
lot of interesting properties of data, like Positive polarity
calculations to be done on large data points.
was more common, surprisingly than neutral and negative
Automated query generation using GetOldT weets3 [4] API
ones, even during the most hyper-active months of Corona.
in Python eased the task of collecting approx 250,000 data
Neutral polarity took the second spot. The reasons for such
points and took about a day of scraping time. The scraping
occurrences are discussed in the Analysis and Results section.
process was broken down in threads running on Google
Collaboratory and local machines simultaneously. It took
around 10 hours for scraping of the data.
III. F RAMEWORKS
B. Pre-processing
The pre-processing phase was comparatively easier and more We have collected and analyzed data in a system of i5 8th
enlightening than other phases. This phase included both data generation with 8 GB of RAM in windows operating system.
Pre-processing and part of Exploratory Data Analysis. TextBlob API was used to get the polarities of Tweets that
1) Phase - 1: Data Cleaning use Natural Language Processing(NLP) to get the polarities
As we delved into the process of cleaning the data to find of each tweet [3]. TextBlob uses multiple NLP techniques to
useful features, we realized that the raw tweets were incapable get the sentiments [7]. Techniques used in this process were-:
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 26,2025 at 11:49:45 UTC from IEEE Xplore. Restrictions apply.
A. Parts Of Speech Tagging (POS) B. Getting Polarities
POS tagging is assigning tags to each word in the sentence The Polarities and Sentiments are generated by Algorithm is
which is used in Lemmatization [8]. shown in Algorithm 2. Each statename month csv Dataset
For example-: ”Corona is making economy of World down”. Each statename month csv Dataset with Polarities and Sen-
In this sentence, it makes a list of tuples in Python. timents of each Tweet trytry: catchcatch:end
[(’Corona’, ’NNP’), (’is’, ’VBZ’), (’making’, ’VBG’), (’econ- Required Libraries Imported A list of all state-
omy’, ’NN’), (’of’, ’IN’), (’World’, ’NNP’), (’down’, ’IN’) ] name month csv file is created List not Empty Get Polarity
NNP-Proper Noun, VBZ-Verb, NN-Common Noun. This of Each Tweet in each file Polarity is greater than 0
tagging is used in the Lemmatization of words. Sentiment of that tweet is saved Positive Polarity is Less
B. Lemmatization than 0 The sentiment of that tweet is saved Negative The
sentiment of that tweet is saved Neutral Save the Polarity and
It is the process of making words in their first form of the verb Sentiment in statename month Polarity csv file Exception
[9]. This done because for instance make and made gives the Print(This state is empty) Polarity and Sentiments Gener-
same meaning but if Lemmatization is not used it will treat ation Algorithm
all these words separately and it will increase the features and
will create redundancy in the analysis. After Lemmatization V. A NALYSIS
made, make are converted to make and hence the count will
become 2 of make. in the sentence. In this paper, TextBlob is used to find the polarity of
C. Stemming scraped tweets and Natural Language Toolkit (NLTK) for
word frequency [10]. In this paper firstly state-wise analysis
Stemming creates the root form of inflected words. For is done and then the frequency of Positive, Negative, and
instance, making and make will be counted as 2 different Neutral tweets are calculated. Each state is analyzed month-
words in a sentence affecting our accuracy if analysis but wise separately. Certain insights were discovered about the
after using Stemming will be removed from the word making data using Visualizations [11]
and it is converted to make [9]. suffixes like es, ies, ing and As per the frequency plot in Fig. 1 it is observed that there
many more are removed from words making our final vector
of low dimension.
D. Stopwords Removal
Stopwords are words like is, have, has, etc are diminishing
the significance of analysis [8]. Hence, these were also
removed. It reduces the redundancy of words and making
analysis better as the meaning of the sentence is not changed.
The sentence ”Corona is making economy of World down”
will be converted to ”Corona make economy world down”.
Now this sentence is passed as a parameter to the already
trained model of TextBlob and it will predict its polarity
based on Unigram Bag of Words model and a Floating point
number is returned.
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 26,2025 at 11:49:45 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Overall sentiments on Lockdown and nearby dates.
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 26,2025 at 11:49:45 UTC from IEEE Xplore. Restrictions apply.
December. They picked up some concern about Corona in tweets from November 2019 to May 2020. The words
during January-February, but remained mostly unconcerned. country, government, day, anddon was used less than
In March, people started to realize the impact of the viral 6000 times. The top 10 words which were used frequently
disease and the increase in no. of tweets reflect that. are:-
People remained mostly positive about the situation but no. corona, COV ID, 19, covid19, virus, India, people, amp,
of negative tweets confirm that situation had started taking a lockdown, f ight
serious front in the state and people’s psychology. This shows that people’s sentiments varied from moderate to
In April, the effect magnified itself to a great extent but more highly negative emotions. Overall India spent the months in
people kept hope, maybe as a result of constant efforts of Lockdown mostly in fear and partly in anguish.
the government to keep the population away from anxiety. The lower end of the graph contains words like
May data tends to show that people were now lesser like, country, day, today, government, spread, saf e
concerned about the pandemic and more hopeful about the The right part of the above graph states a different story. It
future. shows people’s emotion in that region varied from health
Fig. 6 represents the line plot of overall sentiment with worries and outcomes of the pandemic to political scenarios.
VI. R ESULTS
Fig. 7. Word Frequency plot in Tweets. The analysis phase of the process gave us great insights
into the emotional state of people in India, and also, how
frequently used with a frequency count of more than 50,000 sentiments varied from state to state on daily basis.A similar
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 26,2025 at 11:49:45 UTC from IEEE Xplore. Restrictions apply.
study was carried out in [12] for Nepal.In that initial steps for of Maharashtra in the Month of March, but it speaks for all
data collection and tools used are same.However, number of states of India. So does the Fig.5, in most of the states of
tweets collected and range of dates are lesser.Further, authors India, Positivity remained intact compared to Negativity.
in [12] more focused in word cloud variation and sentiment
E. Variation of Sentiments in Each State
variation presented in the form of pie chart. Authors in [13]
also taken same approach and used map and bar charts to The sentiments predicted using the tweets, when categorized
visualize the sentiment. In our study we have taken different into classes like Positive, Neutral, and Negative, and plotted
approach and presented the sentiment variation over time in on a date-vs-frequency curve, Fig. 7 is what was resulted.
the form of graph. Also, we have pr resented tweet frequency Fig. 7 shows not only the active dates from Nov 2019 to
of different states.Then as a case study Maharashtra has been May 2020 but also how people were bothered by the situation
taken. This approach of analysis allowed us to realize that at any given date. The highest peak in March is when the
emotional versatility in the Indian population can be very government first announced the Lockdown. The word fre-
much affected by the sate government and measures they quency curve can also be analyzed to predict the sentimental
take to control a pandemic like a corona, by the amount of state of the people. The top 30 words show that people were
information available to people of a certain region and how indeed fearful and anxious about the effects of the corona, but
actively people are willing to accept the negative side of a remained hopeful and trustful with governmental measures.
situation. In this work, our aim was to show that physical
happenings in the real world are also reflected in the virtual VII. C ONCLUSIONS
social network. The positive tweets peaks are coinciding with
the lockdown announcements as can be seen in Fig. 4. Further, From the analysis in this paper, it is observed that people
we can see that corona and COVID-19 dominate the tweeter in India were mostly expressing their thoughts with positive
space in the observed duration. Our results will help Govt. or sentiments. This paper concludes that there was a sudden hike
policy making agencies to decide upon the framework in the in the tweets on every date when there is an announcement
event of a future pandemic or similar events. of Lockdown. The states like Madhya Pradesh, Maharashtra,
Jammu and Kashmir, and Rajasthan was having higher no. of
A. Polarity Analysis Results tweets as compared to other states because there were more
From the analysis of Fig. 1, it is clear that even in such harsh no. of positive cases of coronavirus as compared to other
situations of strict Lockdowns and increasing corona cases, states. While the people were posting tweets with negative
people were mostly Positive. Negative tweets are mainly sentiments in India, the twitter audience of India seems to
caused by trigger events that are more political in nature. reward positive sentiment much more than negative sentiment
reversing the overall polarity to positive. It seems although we
B. Most Affected States have faced with fear and anticipation about the Coronavirus
As seen in Fig. 2, it is an interesting observation that some and the future, the trust in the Government of India to address
states were more affected by the pandemic that others. The the Corona crisis supersedes all such fears and anticipation
tweet frequency in a state directly correlates to how many emotions. This analysis was unique in its way as Lockdown
cases of the corona were realized in that state. People’s shows peaks and were positive. Most people were criticizing
emotions varied the same way. The no. of tweets from a Lockdown on the face but twitter was full of positive tweets.
state also reflects the vulnerability of people of that state This analysis can be further taken to new possibilities of
to the pandemic, hence they were more active and more Emotion analysis. Rather than having Positive, Negative, and
informed about the situation. Maharashtra, Rajasthan, and Neutral tweets, we can analyze based on emotions. Text
Madhya Pradesh were greatly affected but remained more can also represent the emotions of a person writing. Tweets
hopeful and Positive compared to states like Jammu and can have emotions like Hate, Respect, Agreement, Anger,
Kashmir and Uttar Pradesh. Happiness. Each tweet can have multiple emotions and we can
have the emotion having a maximum score. Only a few states
C. Most Active Months like Maharashtra are more affected by Corona and states like
Fig. 3 shows that India was most active during the months of Jharkhand are having very few cases in comparison to others.
March 2020 to May 2020. This is directly correlated with the In these states, people can have multiple emotions and several
fact that India was introduced to the virus in March and the amazing insights can be generated. In the future, it holds the
no. of active cases increased vastly in March. In India, April potential to discover the mindset of people. In this work, we
was mostly active Month because a lot of trigger events like have only used tweets in English language. Tweets in other
Lockdown announcements and no. of deaths were also at the Indian languages if collected will have a better representation
peak during that time. of people’s sentiments.
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 26,2025 at 11:49:45 UTC from IEEE Xplore. Restrictions apply.
covid-19,last accessed on 20-10-2020
[2] Manguri, Kamaran H, Ramadhan Rebaz N Amin,and Pshko R Mo-
hammed,”Twitter Sentiment Analysis on Worldwide COVID-19 Out-
breaks,” Kurdistan Journal of Applied Research,54–65,2020
[3] C. Kaur and A. Sharma, ”Twitter Sentiment Analysis on Coronavirus
using Textblob,” EasyChair2516-2314, 2020.
[4] Release, GetOldTweets3 API in Python Release v3 0.0.11, November
2019.
[5] Twitter sentiment analysis: The good the bad and the omg!Kouloumpis,
Efthymios, and Wilson, Theresa, and Moore, Johanna, Fifth Interna-
tional AAAI Conference on weblogs and social media,2011.
[6] A. D. J. A. a. S. Dubey, ”Twitter Sentiment Analysis during COVID19
Outbreak,” 2020.
[7] P. Tyagi and R. J. A. a. S. Tripathi, ”A Review towards the Sentiment
Analysis Techniques for the Analysis of Twitter Data,” 2019.
[8] Varsha Sahayak, Vijaya Shete and Apashabi Pathan, “Sentiment Anal-
ysis on Twitter Data,” in International Journal of Innovative Research
in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Issue 1, Volume
2 (January 2015).
[9] Namrata Godbole, Manjunath Srinivasaiah, Steven Skiena, “Large-
Scale Sentiment Analysis for News and Blogs,” Google Inc., New York
NY, USA and Dept. of Computer Science, Stony Brook University,
Stony Brook, NY 11794-4400, USA.
[10] N. K. Rajput, B. A. Grover, and V. K. J. a. p. a. Rathi, ”Word fre-
quency and sentiment analysis of twitter messages during Coronavirus
pandemic,” 2020.
[11] M. Ra, B. Ab, and S. Kc, ”COVID-19 Outbreak: Tweet based Anal-
ysis and Visualization towards the Influence of Coronavirus in the
World,”2020
[12] Twitter Sentiment Analysis During Covid-19 Outbreak in Nepal,
Pokharel, Bishwo Prakash, Available at SSRN 3624719,2020.
[13] Guntaka, V. S. P. R., Gupta, A. K., Somisetty, S. 2020. Twitter sentiment
analysis and visualization – In Proceedings: 16th Annual Symposium
on Graduate Research and Scholarly Projects. Wichita, KS: Wichita
State University, p.31
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 26,2025 at 11:49:45 UTC from IEEE Xplore. Restrictions apply.