Paper 11-Normalization of Unstructured and Informal Text

1) The document discusses normalization of unstructured and informal text for sentiment analysis from social media sources like Twitter. 2) It proposes a mechanism for normalization consisting of four phases: noise reduction, part-of-speech tagging, stop word removal, and stemming and lemmatization. 3) Experiments on Twitter data using the normalization mechanism improved sentiment classification accuracy from 75.42% to 82.357%, demonstrating the benefits of normalization for sentiment analysis.

Uploaded by

GHDaru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views8 pages

Paper 11-Normalization of Unstructured and Informal Text

Uploaded by

GHDaru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 9, No. 10, 2018

Normalization of Unstructured and Informal Text in

Sentiment Analysis
Muhammad Javed1, Shahid Kamal2
Institute of Computing and Information Technology, Gomal University,
Dera Ismail Khan, K.P.K, Pakistan.

Abstract—Sentiment Analysis is problem of natural language expressing real world actions in an instant environment. The
processing which deals with the extraction and analysis of public common characteristics of microbloging sites are: (i) Short
sentiments shared about target entities over microbloging Text (ii) Instant Messaging (iii) Pictorial symbols (iv) Slang
websites. This field has gained great attention due to the huge terms (v) Real time [3]. The well-known microbloging sites
availability of decision making textual contents. Sentiment are Tumblr, Plurk, Friendfeed and Twitter. Twitter is the
Analysis has enormous application areas such as; Market most popular microbloging site that allows its users to publish
Analysis, Service Analysis, Showbiz analysis, Movies, sports and short messages (tweets) for communication. The emergence of
even the popularity and acceptance rate of political policies can social media sites has changed the public communication style
also be predicted via sentiment analysis systems. Although
so the research directions are shifted from information
tremendous volume of opinionative text is available but it is
unstructured and noisy due to which sentiment classifiers can’t
retrieval to “Opinion Mining”. Online users share bulk of
achieve good outcomes. Normalization is the process used to opinionative information over these social networking
clean noise from unstructured text for sentiment analysis. In this websites so observers and analysts are taking advantage of
study we have proposed a mechanism for the normalization of these available information by collecting and summarizing
informal and unstructured text. Proposed mechanism is concerned opinionative information for the sake of monitoring
comprised of four essential phases; Noise Reduction, Part of authors’ moods about their launched products, services and
Speech Tagging, Stop Word Removal stemming and even political policies for better decision making. Socio
Lemmatization. Numerous experiments are performed on twitter Monitoring is performed by means of Sentiment Analysis and
data set with unsupervised lexicons and dictionaries. Python and Opinion Mining. Sentiment Analysis or Opinion Mining is
Natural language toolkit is used for performing all four essential the novel field of text classification and problem of Natural
steps. This study demonstrates that utilization and normalization Language Processing. Opinion Mining is the computational
of informal tokens in tweets improved the overall classification study of public sentiments, feelings and opinions shared in the
accuracy from 75.42 to 82.357. form of text over social media sites. Extracting public
opinions from user generated content is not a big matter
Keywords—Informal; normalization; opinion mining; roman; instead the identification, summarization and strength of
sentiment analysis; text preprocessing opinions about desired entity is the challenging task. Efficient
I. INTRODUCTION classification of opinions requires knowledge of machine
learning and classification algorithm with appropriate
Text Mining is computer assisted process introduced to linguistic rules. The rapid growth of socio communication
help business organizations by providing effective decision devices and channels produced newer challenges for observers
making answers and future trends. Text Mining is the method and analysts. Online users publish their views and opinions in
of mining high quality information from text using patterns distinctive and informal way which is not directly translatable
with additional knowledge of linguistic rules. Text Mining for machine learning system. Additionally they adopt
fulfils the needs of government, research and business e.g. E- acronyms, emotion icons and other microbloging features for
discovery, Scientific Discovery and National Security [1]. The communication. Sentiment Analysis task can’t be performed
extraction and recognition of pattern is performed through the directly on these published reviews instead it requires massive
intersection of Machine Learning, Artificial Intelligence and effort of input text preparation. In past numerous experiments
Database system [2]. The Social sites are used widely for have been performed for text normalization and preprocessing.
socio communication like Blogs and Microblogs. Blogs are Text Normalization is task of data mining in which text is
utilized for publishing articles, news, or any other topic that is cleaned from undesired tags and symbols. Normalization
of interest. Mostly organization like Exact, Themovieblog, (a.k.a. preprocessing) is process of cleaning user generated
ESOMAR (European Society for Opinion and Market text for analysis and prediction [4]. One can’t extract actual
Research), AAPOR (American Association for Public Opinion opinion without assessing opinionative text precisely so
Research) and individuals have their own blogs for quality of decision directly depends on the quality of text. In
communication. Blogging sites provide immediate feedback past preprocessing is performed via many different supervised
of reviewers about their products, articles and publications. On [5], semi supervised [6] and unsupervised [7] methods.
the other hand, the sites that allow short text for chatting, Sentiment Analysis is applicable in almost every field of life.
communication, exchanging views about their interests are In this research we have decided to normalize user generated
considered as Microblogs. These sites allow the posting of contents in political domain for making valuable dataset for
short text and messages. Microblogs are designed for

78 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018

the sake of analysis. We have offered a mechanism in which phase or for homogeneous data. There is lack of efforts for
text is cleaned using four key steps; Noise Reduction, Part of heterogeneous text preprocessing and also there is no such
Speech Tagging, Stop Word Removal, Stemming & system which detect and clean data during classification. They
Lemmatization. Python Natural language toolkit is used for are planning to develop a hybrid system for cleaning
performing all four necessary steps. This study demonstrates homogenous data in different situations. Haddi, E et al [16]
that utilization and normalization of informal tokens in tweet explored the significance of text preprocessing for extracting
can improve the overall classification accuracy. The rest of public opinions from social media contents. They performed
article is comprised of; Section 2 presents related work, various experiments over movie reviews dataset with
section 3 method, section 4 results and discussion and section supervised algorithm and concluded that SVM significantly
5 presents Conclusion and Future work. achieved better accuracy in comparison with other algorithms.
They used three different features like Feature Presence (FP),
II. RELATED WORK Feature Frequency (FF) and Term Frequency Inverse
The increasing growth of electronic document on World Document Frequency (TF-IDF) and achieved 93 % of overall
Wide Web has changed the way of analysis dramatically. The accuracy. They stated that sentiment analysis is harder
social media is growing rapidly due to the availability of problem and one can’t achieve promising outcomes without
millions of online user generated opinionative contents on cleaning text. Singh, T. et al [17] proposed a system for
social sites. The expressed views and suggestions are efficient preprocessing of text for twitter sentiment analysis.
considered as sentiments and opinions. These sentiments are They actually explored the importance of slang words in
mined for better decision making and also for the purpose of sentiment analysis by combining these with existing features.
analysis and evaluation. Sentiment Analysis or Opinion Various experiments are performed in which SVM was used
Mining is the computational study of public moods. Dave el al as base classifier. Their results demonstrate that proposed
[8] in 2003 used the term “Opinion Mining” for the first time. system achieved promising outcomes with the combination of
Opinion mining or sentiment analysis is the problem of NLP. conditional random field with n-grams. They achieved 94 %
Sentiment analysis on twitter is new and challenging area, of average accuracy after normalization of text. Hemalatha, I.
reasonable efforts have already been done in this area but due et al [18] offered a three step preprocessing strategy in which
to increasing ratio of online users this research area is rising they removed URL as first step, Special and repeated
day by day for analyzing various entities but the quality of text characters are removed in second step while third step was
is big issue for observers and analysts. introduced to remove question words. They claimed that with
this preprocessing algorithm one can easily perform sentiment
Sentiment Analysis for politics is the hot topic, in past a lot analysis with any machine learning algorithm. Angiani, G. et
of work has been done on predicting elections or political al [19] compared various existing machine learning methods
events. In fact Microbloging sites are becoming the most for text preprocessing and stated that appropriate
popular platform for political arena [9]. The use of internet preprocessing can improve and gain the valuable information
was limited to exchange of information with each other before knowledge from available text. They evaluated the
the election of USA in 2008 but it was changed dramatically performance of numerous preprocessing strategies over twitter
when Barack Obama started his campaign on social media data and concluded that using a dictionary is not a useful idea
[10]. It was the first political campaign ran on social media. for upgrading the classification performance. Additionally,
The social media became one of the most valuable source for they suggested that combining different preprocessing filters
political conversation after that campaign. Kim,D [11] can positively improve the classification accuracy. Although
investigated that Twitter was highly focused for seeking sentiments and opinions are mined and analyzed in English
political information during the Korean Election 2010. like languages but it is observed that opinionative contents in
Gaffney [12] tracked the #IranElection hashtag for studying other languages are also available at high rate over social
the use of microbloging sites more specifically twitter during networking websites. Duwairi, R. & El-Orfali,M [20]
2009 Iran election, due to maximum usage of twitter the presented their work for Arabic text in which they investigated
maintenance of twitter was stopped at one stage on the order Arabic text from three perspective; first one is the multi-aspect
of US state department [13]. Political incumbents and of text representation such as significant role of stemming n-
challengers used twitter for political benefits during the US gram and features correlation for Arabic language. Secondly
midterm elections held in 2010 [14]. Although there exist the performance of three existing machine learning classifiers;
many systems for the extraction of public sentiment but NB, SVM and K-NN was examined and the last perspective
informal nature of text is big hurdle for all of them. Cleaning was to analyses the impact of different characteristics of
or normalizing opinionative text is challenging issue of dataset over sentiment analysis whereas experiments were
sentiment analysis. In last few years data cleaning and performed on two datasets; Manually compiled dataset for
preprocessing is viewed as an important action and topic of politics and existing corpus of movie reviews. Their results
research, as various supervised and unsupervised methods demonstrate that Naïve Bayes classifier outperforms the other
have been experimented for number of domains for analysis two on both domains (Politics and Movies) by achieving
purpose. Hariharakrishnan, J. et al [15] reviewed numerous 96.6% and 85.7% accuracy respectively. They stated that
techniques of text preprocessing and highlighted the preprocessing and behavior of dataset has great impact on
significance and multi-aspects of preprocessing such as Noise sentiment analysis accuracy. Dos Santos & Ladeira, M [21]
reduction, outlier identification, and inconsistent data. They performed experiments on Portuguese reviews for presenting
raised a very logical point that most of the experiments for the role of text preprocessing in sentiment analysis.
text preprocessing are performed either in data collection

79 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018

Additionally this research presented a large corpus of 759 terms may increase quality of dataset. Arjun et al. [28]
thousands reviews as their contribution. They concluded that compared two stemming techniques; Porter’s and Krovetz
text preprocessing has insignificant role in text classification algorithm. Their findings suggested that both algorithms have
and sentiment analysis. They stated that accuracy and few limitations in some specific scenario Porter’s algorithm
performance of sentiment analysis systems depends on the [29] is context based and also it leads to large degree of
nature of datasets and sentences/reviews used in the dataset conversion whereas Krovetz algorithm [30] produced
because sometimes preprocessing lower the accuracy by inefficient results with large datasets. In past, preprocessing is
removing the valuable and necessary information from target performed in a sequential manner using a pipeline of
text. Toman, M [22] proposed a lemmatization system which preliminary tasks but there exist few systems which utilizes
uses multilingual semantic thesaurus Eurowordnet. They unified phase for all essential tasks. Clark, A [31] came with
evaluated the performance of proposed system on two the design and implementation tool for the preprocessing of
different corpora. Their findings suggest that the proposed noisy corpora. He coped with typographical errors, white
system achieved promising outcomes and they concluded that space issue using trainable stochastic transducer model over
conversion of inflected forms into their roots does not affect 100 million word corpus of Usenet news. He stated that
the classification accuracy while on the other hand preprocessing process can be improved by merging various
Christopher, D.M et al. [23] stated that stemming lowers the models for sort of typographical errors. Bao, Y et al [32]
precision. Dařena, F., & Žižka, J. [24] reviewed the existing explored the significance of text preprocessing for twitter
preprocessing methods and application for non-standard short sentiment Analysis. They unfolded the impact of URLs,
text and unfold several informative patterns. They stated that Negation, repeated letters, stemming and Lemmatization. The
smaller datasets are inappropriate and produces inefficient experiments were performed on Stanford twitter dataset. Their
outcomes. Additionally, this study explored that preprocessing findings show that handling URLs, negations and repeated
results highly based on the language of data and algorithm. characters can improve accuracy while on the other hand
The positive point about preprocessing is that it reduces stemming and lemmatization decreases the classification
dictionary size of data collection. Noise and unclear data is not accuracy. They achieved an average 85.5 % accuracy with
the issue of English language but all languages used over original feature space. Petz, G et al. [33] compared various
internet based social media require preprocessing of text. existing preprocessing techniques for sentiment analysis in
Infact preprocessing experiments are performed for almost all real world situations. They stated that to achieve satisfactory
human languages; Arabic, French, Hindi, Chinese, Japanese outcomes three tasks of preprocessing are essentials; sentence
and Turkish. Hidayatullah, A. F et al [25] experimented with tokenization, replacement of slang symbols and stemming of
Indonesian language in order to clean the text more inflected forms. They achieved 83.42 % of average F1-score
specifically tweets for further analysis. They divided these for eight different techniques. Krouska, A et al. [34] reviewed
experiments into two parts; common preprocessing and the recent research of text preprocessing and sentiment
specific preprocessing. Their results demonstrate that they analysis. They revealed the in-depth analysis of preprocessing
achieve good results with specific text preprocessing tasks. techniques by performing various experiments over manually
Additionally, they suggested preprocessing process can be compiled twitter dataset and stated that appropriate feature
improved by introducing novel algorithms and system for selection and proper representation can improve classification
automatic recognition of non-standard words. The rapid accuracy positively. They compared four key classifiers NB,
advent in web came with newer communication indicators SVM, KNN and C4.5 over three different datasets OMD,
such as # tags, @ tags and emotion icons. Ignoring such HCR and STS-Gold. Raza, A et al [35] reported that modern
symbols during preprocessing can affect the quality of dataset. linguistic style has number of variant features such as use of
One can’t directly remove meaningful punctuations during romans, slangs, Urdu language terms and sentences for
preprocessing, because meaningful punctuation (Emoticons) expressing their likes, dislikes about hundreds of real world
convey opinion towards target entities. Wegrzyn-Wolska, K entities. Additionally, system of one domain and language is
et al [26] compared three emoticon’s preprocessing methods; inapplicable over other languages and domains. Therefore,
Emotion deletion (emodel), emoticons 2-values translation Text normalization is important to cope with many tasks such
(emo2label) and emoticon explanation (emo2explanation). as Plagiarism Detection, pattern discovery, Sentence
Emoticon weight lexicon was used with Naïve Bayes Classier Recognition, Topic Modelling and information retrieval. Liu,
in order to assess the effect of emotion icons. They concluded B. and Zhang, L [36] demonstrated that sentiment analysis is
that emotion icons act as verbal indicator of sentiment and can performed at three granularity levels; document, sentence and
enhance the sentiment analysis accuracy. They achieved 78% phrase level. Previous research showed that document level
of average accuracy with Naïve Bayes Classifier. Gull, R et al sentiment analysis has gained much focus as in past [37, 38]
[27] proposed an approach for the analysis of qualitative and performed document level sentiment analysis using minimum
quantitative data in specified time. They transformed the cuts algorithm. Kian, K.D et al [39] performed preprocessing
extracted text into structured format and prepared politics experiments to determine the qualitative difference among
dataset for the analysis of political party using linguistic high and low agreement data. Raza, Raza, A et al [40] used
features and classifiers. Two classifiers Naïve Bayes and SVM preprocessing and lexicon based sentiment analysis system to
are employed and they concluded that SVM produced better capture public opinion shared about political protest over
outcomes. In future, they are planning to analyze multilingual twitter. They stated that effective preprocessing can enhance
text. Stemming is one the key phase of preprocessing because the classification accuracy. Yu et al [41] proposed a system
identifying and converting inflected forms of opinionative for sentence level subjectivity classification. In this research

80 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018

we have proposed a mechanism for text preprocessing at eliminated before mining and analysis. In this phase of
sentence level for sentiment analysis of political contents preprocessing we removed these undesired symbols and tags
using existing and manually built dictionaries and lexicons. using HTML parser.
III. METHODOLOGY C. Definition of Informal Tokens
Sentiment Analysis is process of acquiring users In past extracted text is preprocessed using English
sentiments shared on social networking websites. There exists repositories but it is observed that today social sites have
two main methods of assigning polarities to public sentiments; provided bulk of opinionative data in informal style and words
Supervised & Unsupervised. Whatever the method is used for of other language too, so in order to capture opinion from
sentiment analysis it always needs quality text in decision numerous geographical areas there is a dire need to collect and
making process. summarize opinion of different styles (formal & Informal)
and format. Twitter and other microbloging services allow
users to share short informal slangy terms which are not easily
detectable for machine and sometimes even for a human
reader. So in this study we have captured non-English
opinionative words used in English sentences for the sake of
efficient sentiment analysis. We first detected all the slang
terms and then proper definition is assigned to each extracted
token using manually compiled list of slangs and non-standard
terms. Slangs refer to misspell English language opinionative
terms whereas non-standard used to represent Roman Urdu
terms shared in English sentences for expressing positivity and
negativity about concerned entity. A list of Roman Urdu
opinionative terms is created for effective identification of
both slang and non-standard terms. Python Natural Language
Toolkit is used for cleaning formal and informal opinionative
tokens.
D. Part of Speech Tagging
The noise free text is passed to part of speech tagging
phase for labeling appropriate parts of speech tags to each
target token. Part of Speech (POS) Tagging is the process of
assigning parts of speech to each desired term. In this research
Fig. 1. Mechanism for Normalization of Informal Text. tokenized text is labeled according to grammatical nature i.e.
adjective, verb, adverb and noun etc. we use python NLTK for
Today socio sites produce numerous challenges for assigning part of speech tags to extracted text.
gathering quality text. In this research a mechanism is
proposed for the normalization of user generated opinionative E. Stop Word Removal
contents in order to perform optimized analysis. Proposed The words having high frequency or most frequently used
mechanism is comprised of following essential phases as terms are considered as stop words like “for” ”the” “a” etc. In
depicted above in Fig. 1. sentiment Analysis, stop words are removed to obtain more
concise and desired text for analysis. So we removed all such
A. Normalization stop words from extracted data by providing tokens to python
Normalization in sentiment analysis is referred as the NLTK. Python NLTK [42] is a collection of built-in libraries
process of cleaning or removing irrelevant data from a huge and software for Natural Language Processing. A corpus
collection of extracted data. The extracted data is full of noise having list of words of various language is the part of python
containing URLs, tags, links etc. Data preprocessing is NLTK so stop words from extracted text is removed through
performed to remove such noise from extracted text to make it the utilization of corpus with Python NLTK.
more clear and consistent. In every text mining process data
must be preprocessed before going to analysis phase, so we F. Stemming and Lemmatization
preprocessed the extracted data for further processing. The The process of reducing all the terms with the same stem
URLs and tags are removed from extracted data; generally to a common form is named as stemming; a stem is a root
these URLs have no use in sentiment analysis process. form. For example the stem for the words “fishing”, “fished”,
Following tasks are involved in text preprocessing process: “fisher” is “Fish” while lemmatization is the process of
removing inflectional endings and replacing this inflected
B. Noise Reduction word with a base word, and the module used for this process is
The text extracted from social media sites is full of noise. known as Lemmatizer; the Lemmatizer uses an additional
This text contains URLs, Symbols, undesired punctuations dictionary to replace the inflected forms into its base form. As
and some special communication symbols i.e. @, RT and < > in stemming the terms “begging” “beggar” “beginning” will
etc. These symbols and tags have no role in sentiment be replaced with the terms “beg” while in lemmatization these
classification tasks so all such kind of punctuations must be terms will be replaced with terms “beg” , “beg” and “begin”

81 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018

respectively. The output generated by Lemmatizer is more This section presents the experimental findings of
accurate as compared to that of stemmer. So we replaced all proposed mechanism. In order to underline the impact of
the inflected form to their base form by using python NLTK preprocessing we have presented precision, recall, f-measure
Lemmatizer. The Python NLTK Lemmatizer uses WordNet & accuracy of tweets collection for both formal and informal
database for finding lemmas of inflected terms. The canonical opinion bearing terms. We have computed precision, recall f-
form of the word is known as lemma. measure separately for both terms just to emphasize the
effectiveness of handling informal opinions in sentiment
The preprocessed text is saved in separate file as dataset analysis.
for classification and analysis. We have evaluated the
effectiveness of our preprocessed text using existing A. Precision
classification technique. Precision is actually the fraction between retrieved and
IV. RESULTS AND DISCUSSION relevant instances as shown below in equation 1.
To evaluate the effectiveness of proposed mechanism (1)
comprehensive experiments are performed on twitter data
about Pakistan Political Parties and Leaders. The data is Whereas TP is used for True Positive, FP is for False
extracted about Pakistan politics from publically available Positive. TN shows True negative and FN is for False
reviews of twitter using twitter APIs. Manual annotation is Negative. In this study we used these terms for specifying the
performed to assign polar classes to each extracted tweet so a following criteria, TP: Correctly identified as positive by the
set of 1400 tweets in which 700 positive whereas other 700 proposed framework, FP: Incorrectly identified as positive.
negative are labelled as benchmark in order to evaluate the Similarly, for negative, TN is for correctly identified as
performance of preprocessing mechanism. Table I. shows the negative while FN shows the terms which are incorrectly
statistics for positive & negative tweets for both formal & identified as negative.
informal opinions. Precision for formal positive instances:
All the necessary steps of normalization mentioned in
section 3 are performed using Python natural language toolkit. = = 81.31%
This study presents a novel mechanism of text normalization
Precision for formal negative instances:
in the classification of informal opinion bearing text. In past
there exist many methods for text normalization but still there = = 79.72%
is sufficient gap for improvement so we proposed a
mechanism which detect all the opinionative feature in The precision for informal positive and negative tweets is
preprocessing phase. Table.II presents the opinionative as follow;
features which are considered in this study just to increase the
classification accuracy. Precision for informal positive instances:

TABLE I. HUMAN ANNOTATED DATASET OF FORMAL AND INFORMAL

= = 83.03%
OPINIONATIVE TWEETS
Precision for informal negative instances:
MANUALLY LABELED OPINION POSITIVE NEGATIVE
FORMAL OPINION 500 500 = = 92.04%
INFORMAL OPINION 200 200
Similarly, precision for both formal & informal positive
TABLE II. FORMAL AND INFORMAL OPINIONATIVE FEATURES
and negative tweets is described as below;

OPINION Precision for both formal & informal positive instances:

S.NO. NATURE DEFINITION
FEATURES
Qualifying Word: A word that shows
= = 81.85%
1 f1: Adjective Formal
the quality of an entity
Precision for both formal & informal negative instances:
Action: A word that shows some
2 f2: Verb Formal
action on an entity
= = 82.87%
An adverb is a word that emphasis an
3 f3: Adverb Formal
adjective, verb.
B. Recall
Informal opinionative words that are
f4: Slangs & Recall is used to find the numbers of relevant from
4 Informal left misspelled intentionally in some
Acronyms
particular context. retrieved instances as shown below in eq.2.
Writing Urdu script with English
f5: Roman (2)
5 Informal letter according to its appropriate
Urdu Terms
pronunciation.
Punctuations and combination of Recall for formal positive & negative tweets is presented
6 f6: Emoticon Symbolic characters used to show facial as follow;
expressions.
Recall for formal positive instances:

82 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018

= = 79.2% TABLE III. OPINIONATIVE TWEETS HAVING BOTH FORMAL & INFORMAL
FEATURES
Recall for formal negative instances:
OPINIONATIVE TWEETS OPINION
S. No LABEL
= = 81.8% FEATURES
Riaz has bribed many politicians but
Recall for informal positive & negative tweets is shown 2 he must know he can never bribe me: F2 Positive
below; PTI chief

Recall for informal positive instances: 3 Once a drbari always darbari F5 Negative
Patwari (zehni ghulam), darbari and
= = 93% 4 bhikari All are in shock That what F2, F5 Negative
happened to us.
Recall for informal negative instances: 5 Chal patwari get lost... F4, F5 Negative

= = 81% 6 Aala to good zbrdst F1, F5 Positive

7 Fucking Daghi :-P F2, F5 Negative
Similarly, recall for both formal & informal positive & 8 F1, F5, F6 Positive
That's great janbaz #Bilawal
negative tweets is shown below;
Wah,,bahut aala,,,pti linked offshore
Recall for both formal & informal positive instances: 9 companies are neat and clean like F1, F4, F5 Positive
imran niazi,,,, hahahaha
= 83.14% I have seen KPK hospitals , they are
better than Punjab hospitals, 1000
Recall for both formal & informal negative instances: 10 times better and i am a doctor also i F1, F5 Positive
know better than you Mr brainless
= 81.57% Patwari

C. F-Measure E. Confusion Matrix

F-Measure is also a statistical analysis used in binary Confusion Matrix or Contingency Table is used for the
classification. It is used to measure the accuracy by evaluation of proposed system. Confusion Matrix is actually a
considering both precision and recall as shown below in eq.3. two dimensional array which is used to visualize the
performance of proposed mechanism. Table IV shows the
(3) confusion matrix of experimental results. In which rows show
the number of manually annotated instances whereas columns
= show the machine/system annotated instances.

= = TABLE IV. CONFUSION MATRIX OF FORMAL AND INFORMAL

OPINIONATIVE TERMS
82.21%
MACHINE ANNOTATED LABELS
POLARITY CLASS LABELS
D. Accuracy FOR FORMAL & INFORMAL
The accuracy is the degree of correctness; Mathematical OPINIONATIVE TERMS Positive Negative TOTAL
representation of accuracy is shown below in eq. 4. H Positive 396 (TP) 104 (FN) 500
Formal U Negative 91 (FP) 409 (TN) 500
(4) opinion
M TOTAL 487 513 1000
A
CLASS
= 82.357% N Positive Negative TOTAL
LABELS
Informal Positive 186 (TP) 14 (FN) 200
It is observed that out of 1153 opinionative tweets 1054 opinion A
tweets are identified as opinionative with formal and informal N Negative 38 (FP) 162 (TN) 200
opinion bearing words whereas rest of the opinionative tweets N TOTAL 224 176 400
are identified using informal tokens only, where no single O CLASS
Positive Negative TOTAL
formal opinionative token was present which increases A LABELS
Formal T Positive 582 (TP) 118 (FN) 700
accuracy up to 6.937 from 75.42 to 82.357. Table III shows a &
subset of tweets collection which are marked as opinionative A Negative 129 (FP) 571 (TN) 700
informal
with formal and informal opinion. opinions T
E TOTAL 711 689 1400
D

83 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018

V. CONCLUSION AND FUTURE WORK

Sentiment Analysis is computational study of user’s
opinion about real world entities. Analysis are performed on
publically available data over social media sites. Machine
Learning algorithms need a well formed quality dataset for
analysis so publically available text is first normalized in order
to achieve decision making results. One can’t mine public
opinions accurately without inputting meaningful instances. In
fact, quality of analysis directly depends on the size and nature
of input data. This research proposes a novel mechanism for
normalization of publically available opinionative data for the
sake of sentiment analysis. Text normalization is not just a
single step, Infact it is the process of performing a flow of
essential actions sequentially i.e. Tokenization, Stop word
Fig. 2. Impact of Opinionative Features. removal, Part of Speech Tagging, Stemming and
Lemmatization. In this study we have considered six opinion
In above table, vertical representations of positive and bearing indicators; Verb, Adjective, Adverb, Slang, Roman
negative instances show the outcomes of machine whereas Urdu terms and Emoticon as classification attributes. In past,
horizontal instances indicate manually labeled instances. We non-standard and unstructured terms are handled at
have decomposed whole data set into formal and informal so classification phase which sometimes lowers the classification
here in Table.4 ternary confusion matrix is shown in order to accuracy so in order to overcome this deficiency proposed
provide clear picture of informal opinions. Third &last study provides a proper definition to each extracted informal
confusion matrix shows the overall results of both formal & & non-standard terms at normalization phase. Twitter data is
informal opinions in which machine labeled 711 opinionative first crawled using twitter APIs and then separate file is
tweets as positive while 689 are marked as negative with generated for performing normalization tasks. The
accuracy of 81.9 and 82.85 respectively. As in this normalization steps namely; Noise reduction, informal
preprocessing mechanism six opinionative features are definition, parts of speech tagging, stemming and
considered as shown in table.2. Experimental results lemmatization are performed in incremental manners. To
demonstrate that all these features have great impact on real evaluate the results of proposed mechanism experiments are
world results. Fig.2 shows that few of them has increased the performed on manually annotated collection of 1400 tweets
classification accuracy dramatically. equally distributed for positive and negative opinion bearing
instances. Experimental results demonstrate that informal
Fig.2 shows that all considered features have increased the opinions have great impact on the classification accuracy as
average accuracy of sentiment analysis. If we consider only we achieved 82.357% accuracy with an increment of 6.937%.
verbs from whole collection the accuracy was noticed as Proposed mechanism is robust and can be applied at
31.42, and for verb and adjectives, accuracy is jumped to multidimensional domains. We must encourage future
72.58 and similarly for all features the overall accuracy is researchers to experiment with novel opinionative features to
achieved as 82.357 which shows significant contribution in provide quality datasets for many real world entities.
sentiment analysis.
REFERENCES
Table.V shows the comparative results of proposed [1] Glass, K. and Colbaugh, R., 2012. Estimating the sentiment of social
preprocessing mechanism, it is noticed that proposed system media content for security informatics applications. Security
outperformed the existing systems by achieving an average Informatics (a springer open jouranl), Vol. 1, Issue 1.pp 1-16
precision, recall and accuracy of 81.9%, 82.35%, and 82.357% [2] Lahoti, A.A., 2014. Data Mining Technique its Needs and Using
respectively. Applications. , IJCSMC Vol. 3.Issue. 4, pp.572-579.
[3] Kumar, A. & Sebastian, T.M., 2012. Sentiment Analysis: A Perspective
TABLE V. COMPARATIVE ANALYSIS OF PROPOSED PREPROCESSING on its Past, Present and Future.International Journal of Intelligent
MECHANISM Systems and Applications, Vol. 4. Issue.10. pp1-14.
[4] Jianqiang, Z. and Xiaolin, G., 2017. Comparison Research on Text Pre-
processing Methods on Twitter Sentiment Analysis. IEEE Access, 5,
STUDIES PRECISION RECALL ACCURACY
pp.2870-2879.
Pang, B et [5] Kotsiantis, S.B., Kanellopoulos, D. and Pintelas, P.E., 2006. Data
83% 80.58% 81.5% preprocessing for supervised leaning. International Journal of Computer
al. [37]
Science, 1(2), pp.111-117.
Haddi, E [6] Xiang, B. and Zhou, L., 2014. Improving twitter sentiment analysis with
63% 60% 60% lexical
[43] topic-based mixture modeling and semi-supervised training.
In Proceedings of the 52nd Annual Meeting of the Association for
Etaiwi, W Computational Linguistics (Volume 2: Short Papers) (Vol. 2, pp. 434-
51.8% 74.9% 72.96% 439).
et al. [44]
[7] Karl, M., Bayer, J. and van der Smagt, P., 2016. Unsupervised
preprocessing for Tactile Data.
Proposed 81.9% 82.35% 82.357%
[8] Dave, K. et al., 2003. Mining the peanut gallery: Opinion extraction and
semantic classification of product reviews. Proceedings of the 12th

84 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018

international conference on World Wide Web, pp.519–528. New York, [28] Arjun Srinivas Nayak , Ananthu P Kanive , Naveen Chandavekar, Bala
NY, USA. subramani R. ,2016. Survey on Pre-Processing Techniques for Text
[9] Tumasjan, A., Sprenger, T.O., Sandner, P.G. and Welpe, I.M., 2011. Mining. International Journal of Engineering and Computer Science.
Election forecasts with Twitter: How 140 characters reflect the political 5(6),pp. 16875-16879.
landscape. Social science computer review, 29(4), pp.402-418. [29] Moral, C., de Antonio, A., Imbert, R. and Ramírez, J., 2014. A survey of
[10] Anderson, D., 2009. How has Web 2.0 reshaped the presidential stemming algorithms in information retrieval. Information Research: An
campaign in the United States? In Proceedings of the WebSci'09: International Electronic Journal, 19(1), p.n1.
Society On-Line, 18-20 March 2009, Athens, Greece. [30] Ramasubramanian, C. and Ramya, R., 2013. Effective pre-processing
[11] Kim, D., 2011. Tweeting politics: Examining the motivations for Twitter activities in text mining using improved porter’s stemming
use and the impact on political participation. In 61st Annual Conference algorithm. International Journal of Advanced Research in Computer and
of the International Communication Association. Communication Engineering, 2(12), pp.4536-8.
[12] Gaffney, D., 2010. iran Election : Quantifying Online Activism. [31] Clark, A., 2003, March. Pre-processing very noisy text. In Proc. of
Analysis, ( pp.1-8). Web Science Conf. 2010, April 26-27, 2010, Workshop on Shallow Processing of Large Corpora(pp. 12-22).
Raleigh, NC, USA. [32] Bao, Y., Quan, C., Wang, L. and Ren, F., 2014, August. The role of pre-
[13] Pleming, S., 2009. US State Department speaks to Twitter over Iran. processing in twitter sentiment analysis. In International Conference on
Reuters, June, 16. Intelligent Computing (pp. 615-624). Springer, Cham.
[14] Cozma, R. and Chen, K., 2011, May. Congressional Candidates’ Use of [33] Petz, G., Karpowicz, M., Fürschuß, H., Auinger, A., Winkler, S.,
Twitter During the 2010 Midterm Elections: A Wasted Opportunity?. Schaller, S. and Holzinger, A., 2012. On text preprocessing for opinion
In 61st Annual Conference of the International communication mining outside of laboratory environments. Active media technology,
association. pp.618-629.
[15] Hariharakrishnan, J., Mohanavalli, S. and Kumar, K.S., 2017, January. [34] Krouska, A., Troussas, C. and Virvou, M., 2016, July. The effect of
Survey of pre-processing techniques for mining big data. In Computer, preprocessing techniques on Twitter sentiment analysis. In Information,
Communication and Signal Processing (ICCCSP), 2017 International Intelligence, Systems & Applications (IISA), 2016 7th International
Conference on (pp. 1-5). IEEE. Conference on (pp. 1-5). IEEE.
[16] Haddi, E., Liu, X. and Shi, Y., 2013. The role of text pre-processing in [35] Raza, A.A., Habib, A., Ashraf, J. and Javed, M., 2017. A Review on
sentiment analysis. Procedia Computer Science, 17, pp.26-32. Urdu Language Parsing. INTERNATIONAL JOURNAL OF
ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 8(4),
[17] Singh, T. and Kumari, M., 2016. Role of Text Pre-processing in Twitter pp.93-97.
Sentiment Analysis. Procedia Computer Science, 89, pp.549-554.
[36] Liu, B. and Zhang, L., 2012. A survey of opinion mining and sentiment
[18] Hemalatha, I., Varma, G.S. and Govardhan, A., 2012. Preprocessing the analysis. In Mining text data (pp. 415-463). Springer US.
informal text for efficient sentiment analysis. International Journal of
Emerging Trends & Technology in Computer Science (IJETTCS), 1(2), [37] Pang, B., Lee, L. and Vaithyanathan, S., 2002, July. Thumbs up?:
pp.58-61. sentiment classification using machine learning techniques.
In Proceedings of the ACL-02 conference on Empirical methods in
[19] Angiani, G., Ferrari, L., Fontanini, T., Fornacciari, P., Iotti, E., Magliani, natural language processing-Volume 10(pp. 79-86). Association for
F. and Manicardi, S., 2016. A Comparison between Preprocessing Computational Linguistics.
Techniques for Sentiment Analysis in Twitter. In KDWeb.
[38] Pang, B. and Lee, L., 2004, July. A sentimental education: Sentiment
[20] Duwairi, R. and El-Orfali, M., 2014. A study of the effects of analysis using subjectivity summarization based on minimum cuts.
preprocessing strategies on sentiment analysis for Arabic text. Journal of In Proceedings of the 42nd annual meeting on Association for
Information Science, 40(4), pp.501-513.
Computational Linguistics (p. 271). Association for Computational
[21] Dos Santos, F.L. and Ladeira, M., 2014, October. The role of text pre- Linguistics.
processing in opinion mining on a social media language dataset. [39] Kenyon-Dean, K., Ahmed, E., Fujimoto, S., Georges-Filteau, J., Glasz,
In Intelligent Systems (BRACIS), 2014 Brazilian Conference on (pp. C., Kaur, B., Lalande, A., Bhanderi, S., Belfer, R., Kanagasabai, N. and
50-54). IEEE.
Sarrazingendron, R., 2018. Sentiment Analysis: It’s Complicated!. In
[22] Toman M, Tesar R, Jezek K. Influence of word normalization on text Proceedings of the 2018 Conference of the North American Chapter of
classification. Proceedings of InSciT. 2006 Oct 25;4:354-8. the Association for Computational Linguistics: Human Language
[23] Christopher, D.M., Prabhakar, R. and Hinrich, S.C.H.Ü.T.Z.E., 2008. Technologies, Volume 1 (Long Papers) (Vol. 1, pp. 1886-1895).
Introduction to information retrieval. An Introduction To Information [40] Raza, A.A., Habib, A., Ashraf, J. and Javed, M., 2018. Semantic
Retrieval, 151, p.177. Orientation Based Decision Making Framework for Big Data Analysis
[24] Dařena, F. and Žižka, J., 2015. Interdependence of text mining quality of Sporadic News Events. Journal of Grid Computing, pp.1-17.
and the input data preprocessing. In Artificial Intelligence Perspectives [41] Yu, H. and Hatzivassiloglou, V., 2003, July. Towards answering opinion
and Applications (pp. 141-150). Springer, Cham. questions: Separating facts from opinions and identifying the polarity of
[25] Hidayatullah, A.F. and Ma’arif, M.R., 2017, January. Pre-processing opinion sentences. In Proceedings of the 2003 conference on Empirical
Tasks in Indonesian Twitter Messages. In Journal of Physics: methods in natural language processing (pp. 129-136). Association for
Conference Series (Vol. 801, No. 1, p. 012072). IOP Publishing. Computational Linguistics.
[26] Wegrzyn-Wolska, K., Bougueroua, L., Yu, H. and Zhong, J., EXPLORE [42] Maynard, D. & Funk, A., 2012. Automatic Detection of Political
THE EFFECTS OF EMOTICONS ON TWITTER SENTIMENT Opinions in Tweets. In CEUR Workshop Proceedings. pp. 88–99,
ANALYSIS. Computer Science & Information Technology, p.65. Venezia, Italy.
[27] Gull, R., Shoaib, U., Rasheed, S., Abid, W. and Zahoor, B., 2016. Pre [43] Haddi, E., 2015. Sentiment analysis: text, pre-processing, reader views
Processing of Twitter's Data for Opinion Mining in Political and cross domains (Doctoral dissertation, Brunel University London).
Context. Procedia Computer Science, 96, pp.1560-1570. [44] Etaiwi, W. and Naymat, G., 2017. The Impact of applying Different
Preprocessing Steps on Review Spam Detection. Procedia Computer
Science, 113, pp.273-279.