0% found this document useful (0 votes)
5 views

Data Preprocessing in Sentiment Analysis Using Twitter Data: July 2019

Uploaded by

Gwen NuhaZahra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Preprocessing in Sentiment Analysis Using Twitter Data: July 2019

Uploaded by

Gwen NuhaZahra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/334670363

DATA PREPROCESSING IN SENTIMENT ANALYSIS USING TWITTER DATA

Article · July 2019

CITATIONS READS

9 6,458

2 authors:

T. Nikil Prakash Ph.D Aloysius Amalanathan


St. Joseph's College of Tiruchchirappalli St. Joseph's College of Tiruchchirappalli
13 PUBLICATIONS 33 CITATIONS 10 PUBLICATIONS 53 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by T. Nikil Prakash Ph.D on 25 July 2019.

The user has requested enhancement of the downloaded file.


INTERNATIONAL EDUCATIONAL APPLIED RESEARCH JOURNAL (IEARJ)
Volume 03, Issue 07, July 2019
E-ISSN: 2456-6713

DATA PREPROCESSING IN SENTIMENT ANALYSIS USING


TWITTER DATA
T. Nikil Prakash1, Dr. A. Aloysius2
1
Research Scholar, Department of Computer Science, St. Joseph’s College, Trichy, India.
2
Assistant Professor, Department of Computer Science, St. Joseph’s College, Trichy, India.

Abstract: Data preprocessing is an important tool for Data Mining (DM) algorithm. Twitter data is an unstructured
data set it is a collection of information from people entered his/her feelings, opinion, attitudes, products review,
emotions, etc. This type of information is growing day by day in the internet. May companies want to analyze customers
opinions which like the product and the services. The Proposed work to analyses the twitter trending information and
collect various different information form the users. It improves the accuracy of Twitter data. This work easy to identify
the people reaction or opinion. Additionally, improve the better performance for data preprocessing tool.

Keywords: Twitter, Data preprocessing, Sentiment analysis, Data cleaning, Data preparation.

INTRODUCTION: algorithms. Its enable us to knowledge discovery form the


large dataset. The DM algorithms are unreliable data to be
Big data preprocessing is a crucial task for many
received or noisy data cannot provide better results. The
researchers, administrators, organizations and companies
most well-known and widely used up-to-date algorithms
to collecting the data and analyzing the huge amount of
provide the data preprocessing steps. The performance for
specific data or information [1]. The different types of data
all other frameworks to be compared. The DM algorithm
or similar data are collected from different types of web
handled the quality of data its too large or noisy values to
sources and places. That data can be due to improperly
be contained. Moreover, various classification algorithms
measure in the term of noisy data, missing data, wrong data
are applied to similar problems.
or inconsistent data. If the inequality data provide the
wrong results and wrong conclusions in the data analysis,
Salvador Garcia et al [5] proposed data preprocessing
pattern reorganizations and decision making. The unique
methods, characteristics, and approaches. They
challenges are handled in mixed variety data and
communicate big data and data processing throughout all
unstructured data order to be preprocessed into structured
methods and technologies and include the state of art in big
data or ordered data representations.
data. Additionally, it focuses on big data framework
development, for example, Hadoop, spark, and flink. These
Data preparation is highly recommended of many reasons
applications and methods are learning new paradigms.
such as datum quality or database quality, process of data
They described big data preprocessing key issues and
analysis, possibility of related algorithms to apply for
covered big data families of data processing that is feature
removing noisy data and missing data and increase the data
selection, imperfect data, imbalanced learning, and
reliability that is high-quality data models require high
instance reduction. Moreover, they developed in bid data
quality of data [2].
preprocessing frameworks.
Sentiment analysis involved identifying a given text of
A. Asbrino et al [6] proposed semi-supervised SVM
content by first preprocessing it to detecting stop words and
classification in data preprocessing. In SVM semi-
symbols, etc. and then checking the subjectivity contents.
supervised models contribute labeled and unlabeled data
The getting the opinion content polarity is determined
optimization, this model finding the good quality of
either on machine learning methods and lexical based
separate hypo plane. These approaches are two types that
methods. Sentiment categorizes the content into positive or
are mirror integer linear programming problems and
negative and or neutral. SA makes the use of knowledge in
continuous optimizations problem. Both problems are
the term of context-dependent, for example, some single
solved to very hard and increased the computational model.
words gave multiple meaning in the given word. It can be
It’s reduced the number of unlabeled points and increasing
solved by applying for proper context. If the proper context
performance classification.
to increase the accuracy of sentiment classification in the
knowledge of context to be used [3].
Matthew J. Denny et al [7] proposed unsupervised learning
methods for political science text data research and
LITERATURE REVIEW:
preprocessing decisions. They introduced statistical
Stamatios Aggelos. N et al [4] proposed a data procedure and software that examines finding the
preprocessing algorithm for predictive data mining sensitivity. This approach understanding the researchers’

89 | P a g e
INTERNATIONAL EDUCATIONAL APPLIED RESEARCH JOURNAL (IEARJ)
Volume 03, Issue 07, July 2019
E-ISSN: 2456-6713

problem by providing a characterization of variability described the possible directions of data preparations and
changes in preprocessing. These techniques analyze the delivered many challenges and issues for data preparations.
sensitivity of the researcher’s results in preprocessing and The data preparation is very important criteria for data
take decisions. The comparative of data preprocessing in mining related algorithms because the real data is
different specifications in the relative documents. This improving and provide the system's performance is a high
approaches based on theoretically and it is decreasing the quality of data. Moreover, the quality of the data produces
risk for researchers. concentrative patterns.

Vivek Kumar et al[8] proposed source language sentence S. B Kotsiantis et al [10] proposed various algorithms for
extraction (SLSE) framework model for a language data preprocessing techniques to improve the best
translator. SLSE is creating a bilingual dictionary, N- performance for the data set. It represents the quality of
grams, inverse term index, etc. SLSE is a training data set instance data. The irrelevant and redundant data provide
so it’s a very difficult task for building model framework. noisy and unreliable data then the phrase is more difficult
The sentence selection has been described in well-defined to form the knowledge discovery. The machine learning
function and the sentence has been extracted from the problems are also known as data preparation and
frequency of each generated query. processing time to be considered in the amount of data. The
data preprocessing steps described machine learning
Shichao Zhang et al [9] proposed data preparation and data algorithms. This algorithm produced less accurate and less
cleaning for data mining using machine learning systems. understandable results or anything fail to discover. The
Data preparation is very important for research works and preprocessing steps resolve several steps that include noisy
data preparation found the critical issues in the dataset. It data, redundancy data, and missing data.

Table 1: The comparison results of existing work.

Types of Responsibility
Method Data source
approaches Advantages Disadvantages
Machine Resolve noisy data, missing Less accurate Less
S. B Kotsiantis et al Twitter
learning data, and redundancy understandable
Cannot provide better
Stamatios aggelos. N et al Data mining Twitter Handled large data
results
Lexicon and
Covered features selection,
Salvador Garcia et al machine Social media --
imperfect data, etc.
learning
- Reduced the number of
Machine
A. Asbrino et al Social media unlabeled points, --
learning
- Increasing performance
Machine Decreasing the risk for Understanding the
Matthew J. Denny et al Twitter
learning researchers researchers Opinions
Machine Social media Language sentence Cannot understand the
Vivek Kumar et al
learning and blogs extraction other language
- Produces concentrative
Shichao zhang et al Machine
Social media patterns, Data cleaning problems
learning
- High quality performance

PROPOSED WORK: important technique for processing the performance of


Data Mining algorithms. In there are several
Data Preprocessing:
preprocessing tools are available in DM. figure 1 shows
Data preprocessing is transforming the data into a basic data preprocessing techniques for Twitter data.
form that make it easy to work. Data preprocessing is an

90 | P a g e
INTERNATIONAL EDUCATIONAL APPLIED RESEARCH JOURNAL (IEARJ)
Volume 03, Issue 07, July 2019
E-ISSN: 2456-6713

another way those data needed to the distinction of data


preparation to approximately suit the input data of DM
task.

Data Normalization:
The data normalization is used to analyzing
measurement unit that is expressed in the attributes of
measurement unit range and common scale.

a. Min-Max Normalization:
v − min
𝑣′ = (new_max – new_min) + new_min
𝑚𝑎𝑥 − 𝑚𝑖𝑛

b. Z-Score Normalization:
v−mean
𝑣 ′ = stand_deva

Where v is the old feature value and v’ is the new one.

Sentiment Classification:
In generally to calculate sentiment score for each tweet
Figure 1: Twitter Data Preprocessing shown below:
Data Preparation: Score = number of positive words - number of negative
words
Data preparation is a set of techniques that initialize the
data properly to sense as input for containing DM If score > 0 it is positive sentiment
algorithm. In this paper, the data is collected from
Score < 0 it is negative sentiment
Twitter data. The data is related to the Indian parliament
election results. Score = 0 it is neutral sentiment

Data Cleaning: Algorithm:


Data cleaning is an important technique for data Input Training twitter data set t, Positive sentiment Pt,
preprocessing tool. It is a process of DM techniques. It negative sentiment nt, neutral sentiment nut
removes the bad errors data and reduced unnecessary
Output Sentiment Classifier C, Total tweets tw
information of data. The missing of data are also
Step 1 Import Twitter API
included in data cleaning techniques. The presence of
2 Set the twitter authentication // create twitter
noise data may affect the intrinsic characteristic of a account API
classification problem. 3 Create the class and set the data cleaning method
Create stemming and lemmatization function
Stemming: 4 Try
If sentiment > 0 : positive
Stemming is a process of removing infectional words Else if sentiment < 0 : negative
which is affixes, for example playing-play, studies- Else : neutral
study. Stemming works on some particular language 5 Main definition
mainly English and Spanish. Call the twitter class function
6 Find the Percentage of Twitter sentiment score
Lemmatization: //accuracy
7 Print the twitter data
Lemmatization takes the consideration of morphological
analysis of the words. It reduces inflected words properly Table 2: Twitter data sentiment calculation
with the root words belongs to the sentences. It also
called as lemma which is the set of words in dictionary Total Twitter Words 1000
form, citation form and canonical form. Positive Sentiment Percentage 32 %
Negative Sentiment Percentage 18 %
Data reduction:
Neutral Sentiment Percentage 39 %
The data reduction represents the original data that Total Accuracy Percentage 89 %
reduced to obtain a set of techniques to one way or

91 | P a g e
INTERNATIONAL EDUCATIONAL APPLIED RESEARCH JOURNAL (IEARJ)
Volume 03, Issue 07, July 2019
E-ISSN: 2456-6713

Table 2 shows the results of Twitter sentiment analysis Springer Nature Singapore Pte Ltd, DOI:
percentage. This research work has collected 1000 https://doi.org/10.1007/978-981-13-1501-5_43,
twitter words in twitter website using the hashtag of Year: 2019
#electionresults2019. The proposed algorithm easily to [9] Shichao Zhang, Chengqi Zhang and Qiang Yang
identify the people opinion and find the sentiment score “Data preparation for data mining Data preparation
of Twitter sentiment data. for data mining,” Applied Artificial Intelligence,
DOI: 10.1080/713827180 VOL: 17:5-6, PP 375-
CONCLUSION: 381, Year: 2018
[10] S. B. Kotsiantis, D. Kanellopoulos and P. E. Pintelas
Data preprocessing is a process of DM algorithms. The
“Data Preprocessing for Supervised Learning“
proposed work is identifying the people sentiment in
INTERNATIONAL JOURNAL OF COMPUTER
election results in 2019. This algorithm improves the
SCIENCE VOLUME 1, ISSN 1306-4428, Year:
accuracy of Twitter data. The noisy data and stop words
2006
are removed and given better results in data cleaning
algorithms. This algorithm easily to classify the better
results in data preprocessing techniques. In the future to
improve the data quality and find the context words in
sentiment. Moreover to improve the good accuracy of
Twitter data.

REFERENCES:
[1] Jayaram Hariharakrishnan, Mohanavalli.S,
Srividya, and Sundhara Kumar K.B “Survey of Pre-
processing Techniques for Mining Big Data,” IEEE
International Conference on Computer,
Communication, and Signal Processing (ICCCSP-
2017), Year: 2017
[2] Chen Min, Shiwen Mao, and Yunhao Liu. "Big data:
a survey" Mobile Networks and Applications, PP:
171-209. 2014.
[3] Przemyslaw Grzegorzewski and Andrzej Kochanski
“Data Preprocessing in Industrial Manufacturing”
Springer Nature Switzerland AG, DOI:
https://doi.org/10.1007/978-3-030-03201-2_3,
Year: 2019
[4] STAMATIOS-AGGELOS N.
ALEXANDROPOULOS, SOTIRIS B.
KOTSIANTIS and MICHAEL N. VRAHATIS
“Data preprocessing in predictive data mining”
Cambridge University Press, Vol. 34, 1–33. DOI:
10.1017/S026988891800036X, Year: 2019
[5] Salvador García, Sergio Ramírez-Gallego, Julián
Luengo, Jose Manuel Benítez, and Francisco
Herrera “Big data preprocessing: methods and
Prospects” Big Data Analytics, DOI:
10.1186/s41044-016-0014-0, Pp: 1-9, Year: 2016
[6] A. Astorino, E. Gorgone, M. Gaudioso & D.
Pallaschke “Data preprocessing in semi-supervised
SVM classification” DOI:
10.1080/02331931003692557, VOL: 60:1-2, PP:
143-151, Year: 2017
[7] Matthew J. Denny and Arthur Spirling “Text
Preprocessing For Unsupervised Learning: Why It
Matters, When It Misleads, And What To Do About
It” DOI: https://doi.org/10.1017/pan.2017.44, Year:
2017
[8] Vivek Kumar, Abhishek Verma, Namita Mittal and
Sergey V. Gromov “Anatomy of Preprocessing of
Big Data for Monolingual Corpora Paraphrase
Extraction: Source Language Sentence Selection”

92 | P a g e

View publication stats

You might also like