Data Preprocessing in Sentiment Analysis Using Twitter Data: July 2019
Data Preprocessing in Sentiment Analysis Using Twitter Data: July 2019
net/publication/334670363
CITATIONS READS
9 6,458
2 authors:
All content following this page was uploaded by T. Nikil Prakash Ph.D on 25 July 2019.
Abstract: Data preprocessing is an important tool for Data Mining (DM) algorithm. Twitter data is an unstructured
data set it is a collection of information from people entered his/her feelings, opinion, attitudes, products review,
emotions, etc. This type of information is growing day by day in the internet. May companies want to analyze customers
opinions which like the product and the services. The Proposed work to analyses the twitter trending information and
collect various different information form the users. It improves the accuracy of Twitter data. This work easy to identify
the people reaction or opinion. Additionally, improve the better performance for data preprocessing tool.
Keywords: Twitter, Data preprocessing, Sentiment analysis, Data cleaning, Data preparation.
89 | P a g e
INTERNATIONAL EDUCATIONAL APPLIED RESEARCH JOURNAL (IEARJ)
Volume 03, Issue 07, July 2019
E-ISSN: 2456-6713
problem by providing a characterization of variability described the possible directions of data preparations and
changes in preprocessing. These techniques analyze the delivered many challenges and issues for data preparations.
sensitivity of the researcher’s results in preprocessing and The data preparation is very important criteria for data
take decisions. The comparative of data preprocessing in mining related algorithms because the real data is
different specifications in the relative documents. This improving and provide the system's performance is a high
approaches based on theoretically and it is decreasing the quality of data. Moreover, the quality of the data produces
risk for researchers. concentrative patterns.
Vivek Kumar et al[8] proposed source language sentence S. B Kotsiantis et al [10] proposed various algorithms for
extraction (SLSE) framework model for a language data preprocessing techniques to improve the best
translator. SLSE is creating a bilingual dictionary, N- performance for the data set. It represents the quality of
grams, inverse term index, etc. SLSE is a training data set instance data. The irrelevant and redundant data provide
so it’s a very difficult task for building model framework. noisy and unreliable data then the phrase is more difficult
The sentence selection has been described in well-defined to form the knowledge discovery. The machine learning
function and the sentence has been extracted from the problems are also known as data preparation and
frequency of each generated query. processing time to be considered in the amount of data. The
data preprocessing steps described machine learning
Shichao Zhang et al [9] proposed data preparation and data algorithms. This algorithm produced less accurate and less
cleaning for data mining using machine learning systems. understandable results or anything fail to discover. The
Data preparation is very important for research works and preprocessing steps resolve several steps that include noisy
data preparation found the critical issues in the dataset. It data, redundancy data, and missing data.
Types of Responsibility
Method Data source
approaches Advantages Disadvantages
Machine Resolve noisy data, missing Less accurate Less
S. B Kotsiantis et al Twitter
learning data, and redundancy understandable
Cannot provide better
Stamatios aggelos. N et al Data mining Twitter Handled large data
results
Lexicon and
Covered features selection,
Salvador Garcia et al machine Social media --
imperfect data, etc.
learning
- Reduced the number of
Machine
A. Asbrino et al Social media unlabeled points, --
learning
- Increasing performance
Machine Decreasing the risk for Understanding the
Matthew J. Denny et al Twitter
learning researchers researchers Opinions
Machine Social media Language sentence Cannot understand the
Vivek Kumar et al
learning and blogs extraction other language
- Produces concentrative
Shichao zhang et al Machine
Social media patterns, Data cleaning problems
learning
- High quality performance
90 | P a g e
INTERNATIONAL EDUCATIONAL APPLIED RESEARCH JOURNAL (IEARJ)
Volume 03, Issue 07, July 2019
E-ISSN: 2456-6713
Data Normalization:
The data normalization is used to analyzing
measurement unit that is expressed in the attributes of
measurement unit range and common scale.
a. Min-Max Normalization:
v − min
𝑣′ = (new_max – new_min) + new_min
𝑚𝑎𝑥 − 𝑚𝑖𝑛
b. Z-Score Normalization:
v−mean
𝑣 ′ = stand_deva
Sentiment Classification:
In generally to calculate sentiment score for each tweet
Figure 1: Twitter Data Preprocessing shown below:
Data Preparation: Score = number of positive words - number of negative
words
Data preparation is a set of techniques that initialize the
data properly to sense as input for containing DM If score > 0 it is positive sentiment
algorithm. In this paper, the data is collected from
Score < 0 it is negative sentiment
Twitter data. The data is related to the Indian parliament
election results. Score = 0 it is neutral sentiment
91 | P a g e
INTERNATIONAL EDUCATIONAL APPLIED RESEARCH JOURNAL (IEARJ)
Volume 03, Issue 07, July 2019
E-ISSN: 2456-6713
Table 2 shows the results of Twitter sentiment analysis Springer Nature Singapore Pte Ltd, DOI:
percentage. This research work has collected 1000 https://doi.org/10.1007/978-981-13-1501-5_43,
twitter words in twitter website using the hashtag of Year: 2019
#electionresults2019. The proposed algorithm easily to [9] Shichao Zhang, Chengqi Zhang and Qiang Yang
identify the people opinion and find the sentiment score “Data preparation for data mining Data preparation
of Twitter sentiment data. for data mining,” Applied Artificial Intelligence,
DOI: 10.1080/713827180 VOL: 17:5-6, PP 375-
CONCLUSION: 381, Year: 2018
[10] S. B. Kotsiantis, D. Kanellopoulos and P. E. Pintelas
Data preprocessing is a process of DM algorithms. The
“Data Preprocessing for Supervised Learning“
proposed work is identifying the people sentiment in
INTERNATIONAL JOURNAL OF COMPUTER
election results in 2019. This algorithm improves the
SCIENCE VOLUME 1, ISSN 1306-4428, Year:
accuracy of Twitter data. The noisy data and stop words
2006
are removed and given better results in data cleaning
algorithms. This algorithm easily to classify the better
results in data preprocessing techniques. In the future to
improve the data quality and find the context words in
sentiment. Moreover to improve the good accuracy of
Twitter data.
REFERENCES:
[1] Jayaram Hariharakrishnan, Mohanavalli.S,
Srividya, and Sundhara Kumar K.B “Survey of Pre-
processing Techniques for Mining Big Data,” IEEE
International Conference on Computer,
Communication, and Signal Processing (ICCCSP-
2017), Year: 2017
[2] Chen Min, Shiwen Mao, and Yunhao Liu. "Big data:
a survey" Mobile Networks and Applications, PP:
171-209. 2014.
[3] Przemyslaw Grzegorzewski and Andrzej Kochanski
“Data Preprocessing in Industrial Manufacturing”
Springer Nature Switzerland AG, DOI:
https://doi.org/10.1007/978-3-030-03201-2_3,
Year: 2019
[4] STAMATIOS-AGGELOS N.
ALEXANDROPOULOS, SOTIRIS B.
KOTSIANTIS and MICHAEL N. VRAHATIS
“Data preprocessing in predictive data mining”
Cambridge University Press, Vol. 34, 1–33. DOI:
10.1017/S026988891800036X, Year: 2019
[5] Salvador García, Sergio Ramírez-Gallego, Julián
Luengo, Jose Manuel Benítez, and Francisco
Herrera “Big data preprocessing: methods and
Prospects” Big Data Analytics, DOI:
10.1186/s41044-016-0014-0, Pp: 1-9, Year: 2016
[6] A. Astorino, E. Gorgone, M. Gaudioso & D.
Pallaschke “Data preprocessing in semi-supervised
SVM classification” DOI:
10.1080/02331931003692557, VOL: 60:1-2, PP:
143-151, Year: 2017
[7] Matthew J. Denny and Arthur Spirling “Text
Preprocessing For Unsupervised Learning: Why It
Matters, When It Misleads, And What To Do About
It” DOI: https://doi.org/10.1017/pan.2017.44, Year:
2017
[8] Vivek Kumar, Abhishek Verma, Namita Mittal and
Sergey V. Gromov “Anatomy of Preprocessing of
Big Data for Monolingual Corpora Paraphrase
Extraction: Source Language Sentence Selection”
92 | P a g e