0% found this document useful (0 votes)
123 views

Sentiment Analysis

This document summarizes and compares different techniques for sentiment analysis of Telugu data. It discusses how sentiment analysis is important for understanding customer opinions and improving products. It also notes that while much research has been done on English sentiment analysis, less work has been done on Telugu. The document proposes using an ensemble technique to combine text preprocessing methods like tf-idf with classifiers like decision trees and neural networks to perform binary sentiment classification of Telugu text. Related work applying techniques like word embeddings, NLP preprocessing, and ensemble methods to other languages is also discussed.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views

Sentiment Analysis

This document summarizes and compares different techniques for sentiment analysis of Telugu data. It discusses how sentiment analysis is important for understanding customer opinions and improving products. It also notes that while much research has been done on English sentiment analysis, less work has been done on Telugu. The document proposes using an ensemble technique to combine text preprocessing methods like tf-idf with classifiers like decision trees and neural networks to perform binary sentiment classification of Telugu text. Related work applying techniques like word embeddings, NLP preprocessing, and ensemble methods to other languages is also discussed.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Sentiment Analysis of Telugu data and comparing

advanced ensemble techniques

Srikiran Boddupalli∗ , Anitha Sai Saranya† , Usha Mundra‡ , Pratyusha Dasam§ , Padmamala Sriram¶
Department of Computer Science and Engineering,
Amrita School of Engineering,
Amrita Vishwa Vidyapeetham, Amritapuri, India
Email: ∗ [email protected], † [email protected], ‡ [email protected],
§ [email protected], ¶ [email protected]

Abstract—Predicting or classifying a particular sentence or overcome it by merging some techniques available online and
review is very important for companies to launch or upgrade machines must be made reliable and efficient to interpret and
their products because everyone in the younger generation has understand human emotions and feelings.
stepped up and started reviewing what they have bought which
is helpful for the company as well as next buyer of the product. Most of the people tend to express views or emotion more
Sentiment analysis plays a crucial role in this process. The two in regional language than in other languages. In India, Telugu
classes under binary classification are positive and negative. The is Third most popular language and is spoken by around 7
positive class shows a good note and the negative class shows the percent of the Indian population which is 74 million people
bad note. An valid method for predicting sentiments could enable which is a huge number and not much research has been done
us, to pull out opinions from the internet and predict customers
taste. In this paper, we compare advanced ensemble techniques
in this phase.
and we brief where can we improve our algorithm. Binary classifiers are used in this project. Positive and negative
Keywords—Sentiment Analysis, Binary Classification, are the classes we use. So when we give a input to a trained
Advanced ensemble techniques, tf–idf. model it classifies the particular statement into these classes.
We are using tf–idf(term frequency - inverse document fre-
I. I NTRODUCTION quency) for text pre-processing and we compare four types
of advanced ensemble techniques and propose a improvised
In recent years sentiment analysis has been a very booming model.
topic all around the world, each word that we utter or write
carries a lot of emotion or sentiment which shows the user or This paper is methodical as follows: Section-II Problem State-
customers view. As we know, this generation has very much ment, Section-III gives Related work, section-IV proposed
involved in online shopping, social media, product launches models, Section-V exhibits Results, Section-VI conclusion and
etc. Section-VII References.
Research shows that online shoppers surpass 120 million mark
in 2018 and trends shows that last year there was a 116 II. P ROBLEM S TATEMENT
percentage increase in online shopping and will continue the
same. Due to this trend in online shopping, the market is also As we are working on Telugu, very small amount of
drifting towards web and very fewer people go to outlets to buy resources or data available compared to English and it’s a
things they need so online shopping which has involved social tedious process to collect data from Telugu different sources
media into the phase which is helpful for company to launch and annotate it manually to train our model. So considering
their products online and users or customers have their right only reviews of products would not be enough for training
to comment or write their reviews on that particular product, our model so we took data from news channels which gives
this process helps both customers and companies as well. us a good base for training our model by which we have
higher chances of getting good classification or prediction
Depending on reviews customers can decide whether to buy while testing our model.
the product or not. Whereas a company can analyze the reviews
and comments on the product and they can come up with a Companies launching a product online must know weather
better model keeping the previous reviews in mind. their product is successful in the market or not and knowing
that via comments and reviews etc. so storing reading each and
A Company can view each and every product and read the every statement is hard so if there is a automation done in this
reviews or comments and note down positivity and negativity area that is we store all related data and apply to our model
of the review/comment. As this is a tedious task we can try to which gives the output weather the statement is positive or
automate it by storing all the data related to it and by using negative towards the particular issue or product. This process
our model we can give a particular input review and it tells helps companies to upgrade their existing product or inventing
whether the statement is negative or positive. This procedure new product by taking new inputs by user or customer and
helps the company with the growth in online marketing. Each also we compare Ensemble techniques with base algorithms
and every part contains some limitation and we have tried to and specify where we can improve our ensemble algorithm.
III. R ELATED W ORK used decision tree classifier then under liner classifier
neural network and SVM were used then rule based
• Oscar Araque (2017) proposed Enhancing deep learn- classifier and finally used probabilistic classifiers like
ing sentiment analysis with ensemble techniques. They maximum entropy, naive Bayes and Bayesian network.
provided automatic feature extraction and both richer Accuracies were calculated and compared.[7]
representation capabilities and better performance than
traditional feature based techniques. This paper pro- • Radhika Mamidi presented Annotated Corpus for Tel-
poses several models where classic hand-crafted fea- ugu Sentiment Analysis very less work has been done
tures are combined with automatically extracted em- in Telugu language under sentiment analysis compared
bedding features[1]. to English language. Took data from news channels
then annotated the data and created perfect data-set
• Bruno Trstenjaka(2013) confers the possibility of us- for classification or prediction.[8]
ing KNN model with tf–idf method and framework
for text classification. They have analyzed the output • Mohammad Karim Sohrabi reviewed An efficient pre-
by the model and shown that unused words in the processing procedure for supervised sentiment analy-
document have impacted the quality of classification sis by converting sentences to numerical vectors. For
so to overcome that pre-processing should be done transform text into numerical values word2vec is used.
and the combination of tf–idf and KNN with minor RapidMiner and Python have been used to implement
modifications shows superior model.[2] different opinion mining procedure and the results of
pre-processing: tf–idf algorithm used: KNN these executions have been compared and evaluated.
Word2vec is compared to tf–idf. neural network and
• LI-PING JING(2002) described tf–idf in their paper. SVM have given soaring accuracy compared to other
They have calculated precision with the help of output methods.[9]
results. Effecting precision or accuracy is number of tools used: word2vec, NLP, tf–idf.
unused terms for that pre-processing should be perfect
by which better accuracy my be obtained.[3] • Ahmed Oussous worked on Impact of Text pre-
pre-processing: tf–idf processing and Ensemble Learning on Arabic Senti-
ment Analysis. The goal of the paper was to measure
• Monisha Kanakaraj(2015) Dispensed analyzing mood the impact on the output when subjected with Arabic
of person on a distinct news from twitter posts. data. naive Bayes, SVM, maximum entropy have been
To increase accuracy they have included NLP and used using k-fold cross-validation. Many work has
word sense disambiguation. Then the mined text is been done in English but less in Arabic and also
put through ensemble classifiers. It has shown that very tough due to morphology and there is absence of
ensemble has outperformed base models.[4] free lexical resource for Arabic language. they have
pre-processing: NLP and word sense disambiguation. focused more on pre-processing as it it specified as
algorithm used: SVM, Naive Bayes, Max Entropy, main part of their project. For feature extraction n-
Baseline. gram model is used and accuracy, precision, recall,
• Sandeep Sricharan Mukku Took data from news chan- f1-score is compared for implemented algorithm.[10]
nels, then assigned sentiment to each sentence and tools used: NLP.
translated into English then used NLP for basic pre- • Gangula Rama Rohit Reddy presented Impact of
processing and then used doc2vec model for text pre- Translation on Sentiment Analysis. In this paper prod-
processing and then used the data output from doc2vec uct and book reviews in English are used. Some
into machine learning algorithms and compared the times when translated sentiment may vary so some
accuracies of the algorithms.[5] data is manually translated and then compared with
• Shalini K worked on sentiment analysis of Indian previous sentiment. It was evident that sometimes it
languages using CNN. They have used code-mixed misleads when data is automatically translated i.e it
data which means particular sentence contains more gives other sentiment than previous. Manual transla-
than one language. This phenomenon is vastly used tion is preferred over automatic as it looses its original
in movie dialogues and song lyrics. They have used sentiment.
CNN to classify particular sentence as CNN is more tools used: doc2vec, Google translator, NLP.
effective when used with NLP. They have worked on • Reddy Naidu worked on Sentiment Analysis Using
Telugu data in separate as well and they have done Telugu SentiWordNet. Due to less availability of an-
basic pre-processing using word2vec then subjected notated data very less work has been done in Indian
the output to machine learning algorithm and the languages. In this project data undergoes subjectivity
accuracy can be improved with word embedding or classification. If statement contains no sentiment it
Sentwordnet.[6] is classified into objective else it is classified into
• Vidisha M. Pradhan surveyed on sentiment analy- subjective. With SentiWordNet particular sentence is
sis algorithms for opinion mining. Firstly basic pre- given a sentiment. for text pre-processing doc2vec tool
processing is done on data then it is applied to opinion is used. We can represent SentiWordNet as Word net
mining technique to find out polarity of the sentence. + sentiment information. This proposed system has
They used the pre-processed data to various algo- acquired accuracy of 74 for subjective classification
rithms such as under supervised learning they have and 81 for sentiment classification.[11]
IV. PROPOSED SYSTEM 1) Stacking: Here, training data is divided into n parts,
model is being trained on n-1 parts and nth is subjected to
testing. This is repeated n times, then this trained model is
A. Data-set subjected to predict the class labels of test set.
Acquiring annotated data-set for Telugu language is very Logistic Regression and Naive Bayes models are trained and
limited available online. It is very tough to collect data, predictions are recorded. Support Vector Classification model
manually assigning sentiment is very tough for large data- is trained, considering predictions from above models as
sets. As we had some people working on the same topic we predictors.
requested data-set to sandeep as they have collected news from 2) Bagging: Powerful method applied to the models having
news channels then manually assigned sentiment by creating high variance. High variance is due to learning of exceptions
perfect working data-set. Data-set contains 1500 positive and or constraints of training data(i.e., More specific to training
1500 negative sentences. data). To overcome this a subsets of original data are drawn
with replacement and the models are trained on subsets which
B. basic pre-processing steps are different from each other.
Random Forest Classifier follows the bagging approach, Se-
• stop words removal lects a set of predictors that decides the best branch at each
The words that has very less impact on outcome are node of the decision tree. Decision Trees can be constructed
termed as stop words. Those words does not effect on subset of predictors and data which handles high variance.
output so it can be removed for ease of working on
3) Boosting: Ensemble method that builds an efficient
remained data-set.
classifier from a set of weak classifiers. Builds a subsequent
• special characters removal model strives to precise previous model sequentially. Models
In social media there will be many noise words such are appended until the training data is classified perfectly or
as special characters that is not required so we remove number of models reaches to its maximum.
it and only retain emoticons which has effect on Once completed with appending weak models, a set of weak
sentiment. models each with a cut value are left. Classifications are made
by computing the weighted average of the frail classifiers. The
• Tokenization classification for the ensemble model is taken as the aggregate
This is a step where we divide larger string into of the weighted predictions. This procedure is Ada-Boost
smaller tokens or single words. Ensemble method for Machine Learning. The first successful
boosting method emerged for binary classification.
C. TF-IDF E. Results
Before going to the classification it is important to trans-
form the text data to numerical vectors. Each document is Model Accuracy Precision Recall F1 Score
transformed to a vector. This is done using TF-IDF vectorizer. Stacking 64.20 63.88 68.30 63.88
Each weight in a vector is combination of two values: Random Forest 64.66 65.32 65.18 65.25
Ada-Boost 61.48 62.65 60.28 61.43
1. Term Frequency(TF)
2. Inverse Document Frequency(IDF) V. C ONCLUSION AND FUTURE WORK

Term Frequency is the count of occurrence of the term in Sentiment analysis is a very important part for organization
a document. so as to know whether their product, model or movie is a
success or a failure. The sentiment analysis of Indian languages
Inverse Document Frequency is the ratio of number of has acquired high importance in almost all fields such as
documents in the corpus and the number of documents in entertainment industry,forensics, business, etc. so we have
which the term has occurred at-least once. worked on an Indian language that is Telugu as it is most
spoken language in India and very less contribution is done in
IDF formula this field. There are few annotated data-sets available in this
language. It is very hard to create a new annotated data so
IDF(w) = log( D/ DF(W)) we have collected data from sandeep and then applied basic
pre-processing techniques. Main part for basic pre-processing
D-total number of documents are removal of stop words, tokenizing and removal of special
DF(w) -number of documents containing word w characters which are considered as noise and doesn’t effect
Now, the output. For text pre-processing we have used basic tf–idf
TF-IDF(w) = TF(w)*IDF(w) technique and converted the data into numerical vector. In
this paper we compare advanced ensemble model so we
D. Ensemble Learning implemented 4 types and compared with the output shown
above. To improve our output we can concentrate more on
Integrating the predictions from different independent mod- basic pre-processing, there are many stop words in Telugu
els to improve the performance is termed as Ensemble Learn- language but we used only some which may effect output a
ing. Some of the Advanced Ensemble Learning Techniques: little and tuning our data-set by adding more relevant data and
increasing data so the model will be well trained. Attributes
for algorithms effects the output of model so we have to know
the optimal values for our model and train the model by the
corresponding values to get better output.

R EFERENCES
[1] O. Araque, I. Corcuera-Platas, J. F. Sánchez-Rada, and C. A. Iglesias,
“Enhancing deep learning sentiment analysis with ensemble techniques
in social applications,” Expert Systems with Applications, vol. 77, pp.
236 – 246, 2017. [Online]. Available: http://www.sciencedirect.com/
science/article/pii/S0957417417300751
[2] B. Trstenjak, S. Mikac, and D. Donko, “Knn with tf-idf based
framework for text categorization,” Procedia Engineering, vol. 69,
pp. 1356 – 1364, 2014, 24th DAAAM International Symposium on
Intelligent Manufacturing and Automation, 2013. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1877705814003750
[3] Li-Ping Jing, Hou-Kuan Huang, and Hong-Bo Shi, “Improved feature
selection approach tfidf in text mining,” in Proceedings. International
Conference on Machine Learning and Cybernetics, vol. 2, Nov 2002,
pp. 944–946 vol.2.
[4] M. Kanakaraj and R. M. R. Guddeti, “Performance analysis of ensemble
methods on twitter sentiment analysis using nlp techniques,” in Pro-
ceedings of the 2015 IEEE 9th International Conference on Semantic
Computing (IEEE ICSC 2015), Feb 2015, pp. 169–170.
[5] S. S. Mukku and R. Mamidi, “ACTSA: Annotated corpus for Telugu
sentiment analysis,” in Proceedings of the First Workshop on Building
Linguistically Generalizable NLP Systems. Copenhagen, Denmark:
Association for Computational Linguistics, Sep. 2017, pp. 54–58.
[Online]. Available: https://www.aclweb.org/anthology/W17-5408
[6] K. Shalini, A. Ravikurnar, R. C. Vineetha, R. D. Aravinda, K. M.
Annd, and K. S. Soman, “Sentiment analysis of indian languages
using convolutional neural networks,” 2018 International Conference
on Computer Communication and Informatics (ICCCI), pp. 1–4, 2018.
[7] V. M, J. Vala, and P. Balani, “A survey on sentiment analysis algorithms
for opinion mining,” International Journal of Computer Applications,
vol. 133, pp. 7–11, 01 2016.
[8] S. Mukku and R. Mamidi, “Actsa: Annotated corpus for telugu senti-
ment analysis,” 09 2017.
[9] M. K. Sohrabi and F. Hemmatian, “An efficient preprocessing method
for supervised sentiment analysis by converting sentences to numerical
vectors: a twitter case study,” Multimedia Tools and Applications, May
2019. [Online]. Available: https://doi.org/10.1007/s11042-019-7586-4
[10] G. Al-Sukkar, I. Aljarah, and H. Alsawalqah, “Enhancing the arabic
sentiment analysis using different preprocessing operators,” 04 2017.
[11] R. Naidu, S. K. Bharti, K. S. Babu, and R. K. Mohapatra, “Sentiment
analysis using telugu sentiwordnet,” in 2017 International Conference
on Wireless Communications, Signal Processing and Networking (WiSP-
NET), March 2017, pp. 666–670.

R EFERENCES
[12] J. Isaac and Sandhya Harikumar, “Logistic regression within DBMS”, in
Proceedings of the 2016 2nd International Conference on Contemporary
Computing and Informatics, IC3I 2016, 2016, pp. 661-666.

You might also like