0% found this document useful (0 votes)
47 views

Sentiment Analysis On Massive Open Online Course Evaluations A Text Mining

This article presents a sentiment classification scheme for analyzing massive open online course (MOOC) reviews using text mining and deep learning techniques. The study evaluates the predictive performance of various machine learning, ensemble learning, and deep learning methods for sentiment analysis on MOOC evaluations. It finds that deep learning architectures, especially long short-term memory networks, achieve the highest accuracy of 95.80% when used with GloVe word embeddings.

Uploaded by

Shangbo Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Sentiment Analysis On Massive Open Online Course Evaluations A Text Mining

This article presents a sentiment classification scheme for analyzing massive open online course (MOOC) reviews using text mining and deep learning techniques. The study evaluates the predictive performance of various machine learning, ensemble learning, and deep learning methods for sentiment analysis on MOOC evaluations. It finds that deep learning architectures, especially long short-term memory networks, achieve the highest accuracy of 95.80% when used with GloVe word embeddings.

Uploaded by

Shangbo Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341156027

Sentiment analysis on massive open online course evaluations: A text mining


and deep learning approach

Article in Computer Applications in Engineering Education · May 2020


DOI: 10.1002/cae.22253

CITATIONS READS

204 2,562

1 author:

Aytug Onan
Izmir Katip Celebi University
79 PUBLICATIONS 4,102 CITATIONS

SEE PROFILE

All content following this page was uploaded by Aytug Onan on 14 April 2021.

The user has requested enhancement of the downloaded file.


Received: 10 November 2019 | Accepted: 21 April 2020
DOI: 10.1002/cae.22253

SPECIAL ISSUE ARTICLE

Sentiment analysis on massive open online course


evaluations: A text mining and deep learning approach

Aytuğ ONAN

Department of Computer Engineering,


Faculty of Engineering and Architecture, Abstract
İzmir Katip Çelebi University, Massive open online courses (MOOCs) are recent innovative approaches in
İzmir, Turkey
distance education, which provide learning content to participants without
Correspondence age‐, gender‐, race‐, or geography‐related barriers. The purpose of our research
Aytuğ ONAN, Department of Computer is to present an efficient sentiment classification scheme with high predictive
Engineering, Faculty of Engineering and
Architecture, İzmir Katip Çelebi performance in MOOC reviews, by pursuing the paradigms of ensemble
University, 35620 İzmir, Turkey. learning and deep learning. In this contribution, we seek to answer several
Email: [email protected]
research questions on sentiment analysis on educational data. First, the pre-
dictive performance of conventional supervised learning methods, ensemble
learning methods and deep learning methods has been evaluated. Besides, the
efficiency of text representation schemes and word‐embedding schemes has
been evaluated for sentiment analysis on MOOC evaluations. For the evalua-
tion task, we have analyzed a corpus containing 66,000 MOOC reviews, with
the use of machine learning, ensemble learning, and deep learning methods.
The empirical analysis indicate that deep learning‐based architectures out-
perform ensemble learning methods and supervised learning methods for the
task of sentiment analysis on educational data mining. For all the compared
configurations, the highest predictive performance has been achieved by long
short‐term memory networks in conjunction with GloVe word‐embedding
scheme‐based representation, with a classification accuracy of 95.80%.

KEYWORDS
deep learning, massive open online courses, sentiment analysis, text mining

1 | INTRODUCTION steadily increasing, and there are several platforms, such


as Coursera, edX, MiríadaX, or FutureLearn, to provide
Information and communication technologies have con- free open courses from prestigious institutions worldwide
siderably affected many aspects, including education do- to thousands of participants [47]. MOOC platforms gen-
main. Massive open online courses (MOOCs) are recent erally employ conventional learning content, such as short
innovative approaches in distance education, which pro- video lectures, slide presentations, reading texts, problem
vide learning content to participants without age‐, gender‐, sets, live chat, and online learning assessments [31,43]. In
race‐, or geography‐related barriers. MOOC can be defined addition, MOOCs are generally supported by discussion
as an online course provided to an unlimited number of forums to enhance the interactions between the stake-
participants from geographically scattered locations [33]. holders of the learning process. MOOCs are typically
The number of people enrolled in MOOCs has been characterized by free registration and an easy access to

Comput Appl Eng Educ. 2020;1–18. wileyonlinelibrary.com/journal/cae © 2020 Wiley Periodicals, Inc. | 1
2 | ONAN

educational information and resources. The use of reviews [38], identification of opinions toward refugee
MOOCs provides instructors the opportunity to reach crisis [74], and disaster management [64]. In education
many students throughout the world, and it also provides domain, sentiment analysis can be utilized to evaluate the
students the opportunity to select courses from a large set progress of group discussion [22], to recognize and reg-
of courses offered by instructors from prestigious higher ulate e‐learners’ emotions [66] and to identify learning‐
institutions, which may not be possible otherwise [6,34]. related emotions of students on text feedbacks [9].
MOOCs support the conceptualization of ongoing profes- Sentiment analysis methods have been typically
sional learning and lifelong personalized learning [21]. In grouped into two classes, that is, lexicon‐based methods
addition, MOOCs are means to provide a more accessible and machine learning‐based methods [44]. In the lexicon‐
and democratized model of higher education [35]. MOOCs based methods, sentiment orientation of a text document
have been constrained by several disadvantages. The has been identified with the use of a dictionary with po-
completion rates for MOOCs are very low as compared sitive and negative sentiment values for each word. In this
with the conventional face‐to‐face courses or closed online way, the semantic orientation of words and phrases has
courses. For instance, Breslow et al [15] identified a been computed to identify the sentiment of the text. In
completion rate of 15% for a physics MOOC offered in contrast, the machine learning‐based methods model the
2012 to 154,763 students. Similarly, Ho et al [30] examined sentiment analysis as a supervised learning task. In this
the completion rates of the 17 HarvardX and MITx scheme, the learning model has been obtained by training
MOOCs offered between Fall 2012 and Summer 2013, the supervised learning algorithm in conjunction with the
identifying that only 5% of the students received the course labeled set of text documents. In machine learning‐based
certificate. In addition, instructor‐student interaction and sentiment analysis, conventional supervised learning al-
learning assessment are limited in MOOCs [31]. gorithms, such as Naïve Bayes algorithm, support vector
Educational data mining (EDM) is an emerging re- machines, and k‐nearest neighbor algorithm, have been
search field concerned with the application of tools and successfully employed. Ensemble learning and deep
techniques from data mining, machine learning, and learning methods may also be employed for sentiment
statistics to data obtained from educational settings, with analysis tasks. Ensemble learning (also known as multiple
the aim of better understanding students and the settings classifier systems) is the process of training multiple
of learning process [65]. EDM can provide educational learning algorithms and combining their predictions by
policy makers useful insights to improve the efficiency regarding them as a committee of decision makers to en-
and quality of teaching and learning [56]. The classifi- hance the predictive performance of the learning model.
cation, clustering, association rule mining, and statistics Ensemble learning has been successfully employed in a
techniques have been employed on educational data. wide range of application fields, including energy demand
Students’ learning performance assessment and mon- prediction [60], medical data analysis [49], and text genre
itoring, dropout and retention prediction, and modeling classification [61]. For instance, Rodger [60] analyzed the
the learning behavior are several application tasks for predictive performance of machine learning algorithms
EDM [7]. Text mining is the process of extracting useful (such as artificial neural networks and k‐nearest neighbor)
information from unstructured text document. Compared for energy demand prediction. In this contribution, Rodger
with the other applications, such as classification, clus- [60] presented an ensemble classification scheme based on
tering, association rule mining, or regression, the number fuzzy‐rough nearest neighbor algorithm. Similarly, Rodger
of earlier works dedicated to the text mining on educa- [49] presented an ensemble classification scheme based on
tion data is very limited [7]. In the domain of text mining, k‐nearest neighbor algorithm for medical data analysis.
the main techniques may be broadly classified as ex- Deep learning is an emerging field of machine
ploratory analysis, concept extraction, summarization, learning that processes data on the basis of multiple
text categorization, and sentiment analysis [17]. layers/or stages of nonlinear information processing in a
Sentiment analysis (also, known as, opinion mining) is hierarchical way [19]. Ensemble learning and deep
the computational field concerned with contextually learning can yield promising results for sentiment ana-
mining unstructured text documents so that subjective lysis tasks [25,58].
information (such as opinions, sentiments, attitudes, eva- In this study, we have collected 66,000 student re-
luations, or emotions) can be extracted [54]. With the use views for MOOCs from coursetalk.com, a comprehensive
of sentiment analysis, structured and insightful knowledge platform for MOOC reviews. We present a text mining
can be obtained from unstructured text documents, which approach to analyze MOOC reviews, with the use of
can be useful for decision support and individual decision machine learning, ensemble learning, and deep learning
makers [52]. Sentiment analysis can be employed for methods. In the machine learning‐based approach, we
many different tasks, such as assessment of online hotel utilized three term weighting schemes (i.e., term
ONAN | 3

presence [TP], term frequency [TF], and TF‐IDF). The of learning content and to identify learning behavior of
representation schemes have been evaluated in con- students. This section briefly presents the earlier works in
junction with five supervised learners (i.e., Naïve Bayes, the field. Adamopoulos [2] analyzed user‐generated on-
support vector machines, logistic regression, k‐nearest line reviews about MOOCs to identify the effects of fac-
neighbor, and random forest) and five ensemble learning tors, such as course, platform, and university on student
methods (i.e., AdaBoost, Bagging, Random Subspace, retention. In another study, Valakunde and Patwardhan
voting, and Stacking). In the deep learning‐based ap- [67] employed sentiment analysis on review comments
proach, we have utilized three word‐embedding schemes provided by students on evaluation reviews. In this
(i.e., word2vec, fastText, and GloVe) in conjunction with scheme, frequency‐inverse document frequency (TF‐IDF)
five deep learning architectures (i.e., convolutional weighting scheme has been utilized to represent review
neural network, recurrent neural network, bidirectional comments in conjunction with two machine learning
recurrent neural network with attention mechanism, algorithms, that is, Naïve Bayes and support vector ma-
gated recurrent unit, and long short‐term memory). To chines. Similarly, Wen et al [71] employed sentiment
the best of our knowledge, this is the first comprehensive analysis on MOOC discussion forums. In this scheme,
study on sentiment analysis of massive open online forum posts from three MOOCs have been utilized to
course reviews, in which the predictive performances of identify dropout characteristics of students. The analysis
conventional classification algorithms, ensemble learning revealed that there is a significant correlation between
methods, and deep learning algorithms have been re- sentiment expressed in the course forum posts and the
ported. In the empirical analysis, the following research completion rates for MOOCs. In another study, Altrab-
questions have been addressed: sheh et al [9] introduced a machine learning‐based sen-
timent analysis scheme to extract learning‐related
(1) Is there a statistically meaningful difference between emotions of students on text feedbacks. In this scheme,
the predictive performance of conventional su- student feedbacks, opinions, and feelings about different
pervised learning methods, ensemble learning algo- courses, such as calculus, communication skills, data-
rithms, and deep learning methods? base, engineering, have been collected/gathered using
(2) Do ensemble methods enhance the predictive per- Twitter. To represent text documents, three conventional
formance of supervised learning methods? N‐gram models (namely unigram, bigram, and trigram)
(3) Which text representation scheme yields the highest and their combinations have been examined. In the
predictive performance for sentiment analysis on classification phase, Naïve Bayes algorithm, support
MOOC reviews? vector machines, maximum entropy classifier, and ran-
(4) Is there a significant difference in predictive perfor- dom forest algorithm have been utilized. In another
mance between different word‐embedding schemes study, Adinolfi et al [3] presented a sentiment analysis
on MOOC reviews? framework to examine student satisfaction on different
(5) Is there a significant difference in predictive perfor- platforms, such as massive open online courses, learning
mance between different deep learning architectures diaries, and Twitter. In the presented scheme, the stu-
on MOOC reviews? dents’ and teachers’ behaviors have been also modeled.
Similarly, Bogdan [12] employed sentiment analysis to
The rest of this paper is structured as follows. In improve the course content and to identify students’
Section 2, related works on sentiment analysis have been opinions regarding the integration of MOOCs into em-
presented. Section 3 presents the materials and methods bedded system blended courses. In another study, ma-
of study, namely text corpus, text representation chine learning‐based sentiment analysis has been
schemes, machine learning classifiers, ensemble learning employed to mine opinions from students’ comments
methods, word‐embedding schemes, and deep learning about the performance of instructors [26]. In the pre-
architectures. Section 4 reports the experimental proce- sented scheme, support vector machines and random
dure and the empirical results. Finally, concluding re- forest algorithm have been applied for sentiment analy-
marks have been presented in Section 5. sis. Moreno‐Marcos et al [46] presented empirical results
for lexicon‐based and machine‐learning based ap-
proaches to sentiment classification on forum messages
2 | R E L A T E D WO R K S in MOOCs to extract patterns of learners’ behavior. In
this scheme, logistic regression, support vector machines,
Sentiment analysis on educational data can be employed decision trees, random forest, and Naïve Bayes algorithm
to obtain feedback on learning content and resources, have been utilized as the supervised learning methods.
which can provide useful insights to enhance the quality The empirical analysis indicated that random forest
4 | ONAN

algorithm outperforms the other supervised learning short‐term memory. Similarly, Wang et al [69] introduced
methods and lexicon‐based framework for sentiment a stacked residual long short‐term memory‐based archi-
analysis on MOOC reviews. Text mining and sentiment tecture to identify sentiment intensity of text documents.
analysis have been also employed to extract information
about the drivers of higher educational institutions' on-
line success [63]. In the presented scheme, topic model- 3 | METHODOLOGY
ing and topic profiling analysis methods have been
employed to enhance the international attractiveness of In this section, text corpus, machine learning‐based ap-
higher educational institutions. In another study, Abdi proach for sentiment analysis, and deep‐learning based
et al [1] presented a query‐based, multi‐document, approach for sentiment analysis have been presented.
opinion‐oriented summarization approach. The pre-
sented scheme utilizes sentiment analysis to extract
sentiment orientation and subjective information. In the 3.1 | Corpus
summarization module, user's query relevant sentences
have been identified. Belbachir and Boughanem [10] To collect a text corpus on MOOC reviews, we have
utilized language models used in information retrieval to crawled coursetalk.com, a comprehensive platform for
represent the query and document for sentiment analysis. MOOC reviews. The course reviews from diverse range of
Moreover, Al‐Smadi et al [8] employed morphological, fields, such as accounting, algebra, aerospace engineer-
syntactic, and semantic features for sentiment analysis. ing, agriculture, computer science, data science, and
Recently, Bustillos et al [53] presented a compre- education, have been considered. In this way, we have
hensive analysis of machine learning and deep learning collected approximately 93,000 MOOC reviews. In this
methods for opinion mining in an intelligent learning platform, feedbacks of courses can be provided with the
environment. In this work, several machine learning al- use of a 5‐point scale score, from which an overall quality
gorithms (such as Bernoulli Naïve Bayes, multinomial score has been computed for a course. To obtain a labeled
Naïve Bayes, support vector machines, linear support corpus, we have utilized the quality scores provided by
vector machine, stochastic gradient descent, and the learners. In this way, evaluation reviews with the
k‐nearest neighbor algorithm) and several deep learning quality scores of 1 and 2 have been labeled as “negative,”
architectures (such as convolutional neural network and whereas evaluation reviews with the quality scores of
long short‐term memory) have been employed. The 4 and 5 have been labeled as “positive.” After the labeling
highest predictive performance, with a classification ac- process, we have obtained a corpus with approximately
curacy of 88.26%, has been obtained with the use of a 33,000 negative reviews and approximately 37,000 posi-
deep learning‐based architecture. Similarly, Cabada et al tive reviews. To obtain a balanced corpus, our final cor-
[16] employed two deep earning‐based architectures (i.e., pus consists of 66,000 reviews, with 33,000 negative and
convolutional neural network and long short‐term 33,000 positive sentiments. In Table 1, sample MOOC
memory) on educational reviews, obtaining a classifica- reviews from the corpus have been presented.
tion accuracy of 84.32%. Nguyen and Nguyen [48] in- To process text documents by machine learning al-
troduced a convolutional N‐gram bidirectional LSTM gorithms, several preprocessing tasks should be con-
word‐embedding architecture for sentiment analysis on ducted. For the preprocessing task, we have adopted the
video comments. In the presented scheme, a word with preprocessing stages outlined in [1,48]. First, we em-
semantic and contextual information in short‐ and long‐ ployed text normalization. In this stage, all letters in the
distance periods has been represented. Lin et al [40] ex- text corpus have been converted into lowercase letters.
amined the predictive performance of knowledge‐based All sequence and punctuation marks have been elimi-
and machine learning‐based approaches for sentiment nated. Abbreviations have been converted into their ex-
analysis on student evaluations of teaching. In another panded versions. In addition, URLs, stop words,
study, López et al [42] presented a framework on the irrelevant words, and sparse terms have been eliminated.
basis of opinion mining and semantic profiling on edu- Tokenization (which is the process of separation of given
cational resource platform. Recently, Onan [50] ex- sentences and words of the documents into tokens or
amined the predictive performance of conventional characters) has been performed. In addition, stemming
classification algorithms, ensemble methods, and deep has been performed on text corpus to reduce words to
learning algorithms on student evaluations of teaching. their word stems. For stemming task, we have utilized
In another study, Wang et al [70] presented a hybrid Snowball stemming algorithm [57]. Text preprocessing
deep learning‐based scheme for sentiment analysis on the tasks have been implemented on natural language
basis of convolutional neural networks and long toolkit [41].
ONAN | 5

TABLE 1 Sample massive online open course reviews of students

Sentiment orientation MOOC evaluation review


Negative I was impressed in the beginning, but afterward I figured out that it is a total waste of time and money, as
everything is made to demonstrate how muscular and strong the professor is, and how they can invent
puzzling exercise and the hardest problems ever. This does not convey knowledge; also, the irrational
amount of content in the given time is for a full‐time student, not for a self‐based student, as they
claimed
Negative Course content/quality is very average, and it is even loosely connected. What is the use of learning
association rules mining without learning even all the different rules? There is no application of those
things to real‐world examples; the course only deals with the coding part—just complete the coding.
How do we even justify 500 bucks for this? Plus, the lack of communication between staff and students is
really demoralizing. Bugs in notebooks are reported in forum for days and there are still no replies or
solution from staff. Audit learners cannot even get access to exams despite being informed that there is
an exam at the beginning of the course. Seriously, it should be assured that the quality is of MicroMaster
level before charging 500 for it. The Supply Chain MIT courses charge only 150–200, with much better
content/organizations.
Negative The lessons do not teach much and do not help at all with the "homework" assignments. I would not
recommend this course.
Positive This course covers all the important aspects of software engineering from requirement elicitation and
architectural patterns to build and release management. The short videos, quizzes, and programming
exercises helped to understand the topics, and the instructors answered open questions using Slack. I
really enjoyed this course and want to thank all the instructors for the effort they put into this course!
Positive Great course. The videos are very informative and subtitled. The single‐graded in‐course quizzes provide a
frequent check‐up for the learned content. The final transcribing projects are a team effort. I will
certainly enroll in the next course of the Deciphering Secrets series as well.
Positive It is the best online course available for developing basic CFD and FEA concepts, and learning Ansys side by
side. I really look forward to an advanced course or may be a specialization series in CFD and/or FEA
with Ansys.

Abbreviation: CFD, computational fluid dynamics; FEA, finite element analysis; MOOC, massive open online courses.

3.2 | Machine learning‐based sentiment 3.2.1 | Feature construction


analysis
To process text documents in conjunction with supervised
In the machine learning‐based sentiment analysis, there learning algorithms, the conversion of documents into a
are two main stages, namely extraction of features from feature vector is a crucial task. In text mining and in-
the data and their representation in terms of feature formation retrieval task, one common scheme that has
vectors, and training of the supervised learning algo- been frequently and successfully employed is bag‐of‐
rithms on the feature vectors to obtain the learning words (BOW) framework. In this framework, a text
model. Based on the obtained learning model, the class document is regarded as a bag of words and represented
labels for unseen instances have been determined [36]. by a vector containing all the words encountered in the
For machine learning‐based sentiment analysis, we have document, without taking into account syntax, word or-
employed three term weighting schemes (i.e., TP, TF, and derings, and grammar [27]. In this framework, each text
TF‐IDF) and three N‐gram models (namely bigram, document has been represented on the basis of the fre-
unigram, and trigram model). These text representation quency of each word. The set of features has been utilized
schemes have been utilized to train five supervised to train the supervised learning algorithm to obtain the
learning algorithms (i.e., Naïve Bayes, support vector learning model. Based on the bag‐of‐words framework,
machines, logistic regression, k‐nearest neighbor, and there are three types of weighting schemes that may be
random forest) and five ensemble methods (i.e., Ada- employed, namely TP, TF, and TF‐IDF scheme. For the
Boost algorithm, Bagging, Random Subspace, voting, and TP‐based weighting, it has been considered whether a
Stacking). The rest of this section presents text re- word occurs in a text document or not. In this scheme, a
presentation schemes, supervised learning algorithms, binary‐valued feature vector has represented each text
and ensemble learning methods utilized in the machine document, such that one has been used to represent that a
learning‐based sentiment analysis. word has occurred and zero has been used to indicate that
6 | ONAN

the word has not occurred in the document. For the TF‐ classification and regression tasks. In this scheme, an
based weighting, the number of occurrences of each word instance has been assigned to a class label, based on the
encountered in a document has been computed. In this majority voting of its neighbors. KNN involves storing
way, frequently encountered words have been assigned all the instances at the time of classification [4].
higher scoring values, whereas rarely encountered words • Random forest algorithm is an ensemble of bagging
have been assigned lower scoring values. This issue may algorithm and random subspace algorithm. In this al-
be problematic, as frequent words will have dominance gorithm, decision trees have been employed as the
over the rarely encountered ones. For some tasks in nat- base learner. Each tree has been built on the basis of
ural language processing, some rarely encountered words bootstrap samples of the training data. The diversity
may be domain‐specific words and more informative among the base learners has been provided by a
about the context. To eliminate the problems associated random feature selection. In this way, the model yields
with TF‐based weighting, an inverse document frequency satisfactory results in the existence of noisy or irrele-
may be utilized to measure the frequency of rare words vant data [14].
across the text documents. This scheme has been known
as TF‐IDF weighting. In this scheme, the frequencies of
words have been rescaled on the basis of the number of 3.2.3 | Ensemble learning methods
occurrences in all documents. The frequently en-
countered words have been penalized. In text mining Ensemble learning (also known as multiple classifier
tasks, the N‐gram model is an important representation systems) is the process of training multiple learning al-
scheme. In this model, n‐character slice of a text docu- gorithms and combining their predictions by regarding
ment has been extracted. The common N‐gram models them as a committee of decision makers. Ensemble
utilized in sentiment analysis are unigram model (N = 1), learning aims to identify a learning model with a higher
bigram (N = 2), and trigram (N = 3). For the experimental predictive performance [51]. The algorithms have been
analysis, we have modeled the text corpus using three briefly described:
weighting schemes (TF, TP, and TF‐IDF) and three
N‐gram models (bigram, unigram and trigram). In this • AdaBoost is a boosting‐based ensemble learning algo-
way, nine different configurations have been obtained for rithm, in which a more robust classification model has
the corpus. been obtained by focusing on the instances that are
harder to learn [24].
• Bagging (also known as Bootstrap aggregating) is an-
3.2.2 | Supervised learning methods other ensemble algorithm, which combines base
learning algorithms trained on different training sub-
To obtain learning models based on the feature sets sets obtained from the original training set by the
outlined in Section 3.2.1., we have considered five su- bootstrap sampling [13].
pervised learning methods. The algorithms have been • Random Subspace algorithm is another ensemble
briefly described: learning method, which achieves diversity among the
members of the ensemble with feature space‐based
• Naïve Bayes algorithm is a statistical supervised partition [29].
learning algorithm based on Bayes’ theorem and con- • Voting is a simple way to combine the predictions of
ditional independence assumption [39]. individual learning algorithms of the ensemble. In
• Support vector machines (SVM) are linear supervised general, voting schemes can be divided into two
learning algorithms that may be utilized for classifi- categories, as unweighted voting schemes and weigh-
cation and regression tasks. SVM finds a hyperplane in ted voting schemes. The unweighted voting
the higher dimensional space to separate instances of schemes include minimum probability, maximum
different classes [68]. probability, majority voting, product of probability, and
• Logistic regression is a linear classification algorithm, average of probabilities [52]. In the empirical analysis,
which provides a scheme to apply linear regression to we have considered the five unweighted voting
classification problems. In this scheme, a linear re- schemes to combine the supervised learning
gression model and transformed target variables have algorithms.
been employed to obtain a linear classification • Stacking (also known as stacked generalization) is
scheme [28]. another ensemble combination scheme. Stacking al-
• K‐nearest neighbor (KNN) is an instance‐based su- gorithm employs a two‐staged structure to combine the
pervised learning algorithm that may be utilized for predictions of multiple learning algorithms [72].
ONAN | 7

3.3 | Deep learning‐based sentiment • The fastText is a computationally efficient representa-


analysis tion scheme to learn word embeddings from text
documents. In this scheme, each word has been re-
For the deep learning‐based sentiment analysis, the text garded as a bag of character n‐grams [32]. As compared
corpus has been represented by three word‐embedding with word2vec, the fastText scheme can yield a higher
schemes (namely word2vec, fastText, and GloVe). To predictive performance for morphologically rich lan-
process text, we have utilized five deep learning archi- guages and rare words [20].
tectures (i.e., convolutional neural network, recurrent • The global vectors (GloVe) is an unsupervised predic-
neural network, bidirectional recurrent neural network tion model to obtain vector representations for words.
with attention mechanism, gated recurrent unit, and long In this scheme, the local context‐based learning of
short‐term memory). word2vec model has been integrated by the global
matrix factorization. Based on the global word‐word
co‐occurrence statistics obtained from the text corpus,
3.3.1 | Word embedding‐based training has been conducted. Based on the training
representation process, linear structures of the word vector space have
been extracted [55].
For text mining and sentiment analysis tasks, one com-
mon representation scheme is bag‐of‐words scheme. Yet,
bag‐of‐words scheme cannot capture the semantic rela- 3.3.2 | Deep learning architectures
tions among the components of text documents. In ad-
dition, this scheme yields a sparse data representation Deep learning algorithms and architectures have been
with a high‐dimensional feature space [5]. For text clas- already employed in a wide range of applications, in-
sification, word embedding‐based representation is an cluding computer vision, pattern recognition, and natural
effective scheme, which can be utilized in conjunction language processing. For conventional machine learning
with machine learning algorithms and deep learning ar- schemes presented in Section 3.2, a high‐dimensional
chitectures. The use of word embedding enables to re- and sparse feature representation has been utilized in
present text documents in a compact and more expressive conjunction with classifiers. Deep learning architectures
way. Word embedding‐based representation provides provide the learning of multi‐level feature representa-
learning by distributed expressions of words existing in a tions. The architectures aim to identify learning
low‐dimensional space [11]. Word embeddings have been models on the basis of multiple layers/or stages of
based on the distributional hypothesis. Based on this nonlinear information processing in a hierarchical
hypothesis, words with similar meanings should be en- way [37]. The rest of this section briefly descries the
countered in a similar context. Hence, the vector‐based deep learning architectures utilized in the empirical
representation aims to capture characteristics of the analysis.
neighbors of a word. In this way, the similarity between
words can be captured. In this scheme, a large un- • Convolutional neural networks (CNN) are deep neural
supervised set of documents has been utilized to extract network‐based architectures, which process data with
semantic and syntactic meaning among the words [59]. the use of a grid‐based topology. CNN has been char-
For this study, we have considered three word‐ acterized by a special kind of mathematical operation,
embedding schemes (namely word2vec, fastText, and referred to as convolution. The convolution operation
GloVe) in conjunction with the deep learning archi- has been handled in one or more convolutional layers.
tectures. The representation schemes have been briefly Typical convolutional neural network architecture
described: consists of input layer, output layer, and hidden layers.
The hidden layers of the architecture comprise several
• The word2vec is an unsupervised and computationally layers, that is, convolutional layers, pooling layers, fully
efficient prediction model to learn word embeddings connected layers, and normalization layers. In con-
from text documents. The word2vec model consists of volutional layers, convolution operation has been em-
two models, that is, continuous bag of words model ployed on the input data and the feature maps have
(CBOW) and continuous skip‐gram model [45]. The been obtained. To add the nonlinearity to the archi-
CBOW model predicts the target word from its context tecture, the activation functions (such as rectified lin-
words that surround it across a window size of k. In ear unit) have been utilized in conjunction with the
contrast, the skip‐gram model predicts the context feature maps. The pooling layers have been utilized to
words, given the target word. combine the outputs of neuron clusters. In this way,
8 | ONAN

the spatial size of feature spaces has been reduced and 4 | EXPERIMENTS AND RESULTS
the models’ ability to deal with overfitting has been
enhanced. In pooling layer, maximum pooling has In this section, performance measures, experimental
been employed. The fully connected layers have been procedure and experimental results obtained from the
utilized to obtain the final output of the archi- conventional classifiers, ensemble learning algorithms,
tecture [23]. and deep learning architectures have been presented.
• Recurrent neural network (RNN) is another type
of deep learning architecture to process sequential
data. In RNN, connections between neurons constitute 4.1 | Performance measures
a directed graph. In this architecture, internal state
has been utilized to process sequence of inputs. To evaluate the performance of machine learning mod-
Hence, the architecture can be successfully employed els, classification accuracy (ACC) and F‐measure have
for sequential tasks, such as speech recognition. been considered in this study.
In RNN, each output has been determined by Classification accuracy is one of the most widely
recurrently processing the same task over the employed measures for evaluation of supervised learning
instances of the sequence. Based on all the algorithms, which is computed as given by Equation (1):
earlier computations, the output has been de-
termined [73]. TN + TP (1)
ACC =
• Long short‐term memory network (LSTM) is another TP+FP + FN+TN
deep learning architecture based on recurrent neural
networks. Conventional RNN architecture suffers where TN, TP, FP, and FN denote the number of true
from the exploding or vanishing gradient problem. In negatives, true positives, false positives, and false nega-
RNN, the arbitrarily long sequences of input cannot be tives, respectively.
properly handled. In response, LSTM utilizes forget F‐measure is another common measure for perfor-
gates to overcome the problems. In LSTM architecture, mance evaluation on supervised learning algorithms,
the back propagation of error has been allowed which is the harmonic mean of the precision and recall.
until a limited number of time steps. For a typical Precision (PRE) is the proportion of the true positives
LSTM unit, there is a cell and three kinds of gates, against the true positives and false positives, as given by
namely an input gate, an output gate, and a forget Equation (2). Recall (REC) is the proportion of the true
gate. The open and close operations at the gates positives against the true positives and false negatives, as
have been utilized to control which information should given by Equation (3). Based on Equations (2) and (3), F‐
be preserved and when the information should be measure has been computed, as given by Equation (4):
accessed [62].
• Gated recurrent unit (GRU) is another deep learning TP (2)
PRE =
architecture based on recurrent neural networks. In a TP + FP
typical GRU architecture, there are two gates (namely TP
REC = (3)
reset gate and update gate) [18]. TP + FN
• Recurrent neural network with attention mechanism
2 × PRE × REC (4)
(RNN‐AM) is another deep learning architecture based F−measure =
PRE + REC
on recurrent neural networks. The conventional
encoder‐decoder frameworks must encode all in- 4.2 | Experimental procedure
formation, which may not be relevant to the current
task [50,62]. For long input sequences, it is not possible To evaluate the predictive performance of models for
to fully capture information‐rich and selective encod- sentiment analysis on EDM, we have conducted two set
ing. To deal properly with this problem, attention of experiments, namely, machine learning‐based senti-
mechanisms have been employed. In the bidirectional ment analysis and deep learning‐based sentiment
recurrent neural network architecture with attention analysis.
mechanism, each output word yt corresponds to a For evaluation task on machine learning methods, the
weighted combination of input states. In this scheme, text corpus has been represented by three traditional text
the weight values define the weight contribution of weighting schemes (i.e., TP, TF, and TF‐IDF schemes)
each input state to the output state. Based on this and three N‐gram models (i.e., unigram, bigram, and
scheme, the decoder pays varying attentions to the trigram model). In this way, we have obtained nine
states [62]. configurations on the text corpus. The feature
ONAN | 9

TABLE 2 Classification accuracy values obtained by machine learning algorithms

Unigram Unigram Unigram Bigram Bigram Bigram Trigram Trigram Trigram


+TP +TF +TF‐IDF +TP +TF +TF‐IDF +TF +TP +TF‐IDF
KNN 75.89 76.30 76.11 74.87 75.42 75.08 72.49 74.38 73.70
SVM 78.13 78.24 78.15 77.95 78.06 77.99 77.74 77.88 77.82
LR 77.49 77.69 77.57 77.10 77.28 77.19 76.48 76.99 76.83
NB 79.14 79.23 79.17 79.04 79.10 79.08 78.91 79.01 78.94
RF 78.67 78.75 78.71 78.48 78.59 78.54 78.31 78.43 78.40
AdaBoost (KNN) 82.86 82.92 82.91 82.77 82.84 82.79 82.69 82.76 82.73
AdaBoost (SVM) 83.37 83.48 83.41 83.31 83.34 83.33 83.18 83.24 83.22
AdaBoost (LR) 83.13 83.17 83.15 83.06 83.10 83.08 82.95 83.02 82.98
AdaBoost (NB) 84.18 84.33 84.23 84.09 84.16 84.13 83.85 84.06 83.95
AdaBoost (RF) 83.72 83.81 83.74 83.60 83.64 83.62 83.51 83.57 83.53
Bagging (KNN) 79.61 79.74 79.69 79.49 79.58 79.55 79.29 79.41 79.37
Bagging (SVM) 80.53 80.59 80.55 80.39 80.45 80.40 80.30 80.37 80.34
Bagging (LR) 80.21 80.27 80.22 80.10 80.20 80.14 79.90 79.99 79.96
Bagging (NB) 81.18 81.22 81.20 81.07 81.16 81.14 81.00 81.05 81.03
Bagging (RF) 80.85 80.97 80.90 80.77 80.81 80.79 80.67 80.74 80.72
RS (KNN) 84.56 84.70 84.60 84.45 84.54 84.50 84.37 84.41 84.40
RS (SVM) 86.02 86.17 86.09 85.67 85.94 85.74 85.47 85.60 85.51
RS (LR) 85.33 85.41 85.40 85.21 85.27 85.24 85.02 85.17 85.08
RS (NB) 89.41 89.62 89.51 88.24 88.63 88.47 87.68 87.96 87.81
RS (RF) 87.10 87.53 87.33 86.61 86.90 86.68 86.28 86.52 86.40
Voting (Minimum 81.40 81.47 81.43 81.35 81.38 81.37 81.25 81.28 81.27
probability)
Voting (Maximum 81.69 81.75 81.72 81.59 81.67 81.66 81.50 81.57 81.55
probability)
Voting (Majority 81.97 82.01 81.99 81.84 81.89 81.86 81.77 81.82 81.79
voting)
Voting (Product of 82.21 82.33 82.25 82.15 82.17 82.17 82.04 82.09 82.05
probability)
Voting (Average of 82.59 82.65 82.61 82.50 82.58 82.53 82.40 82.45 82.43
probabilities)
Stacking 84.95 85.00 84.97 84.82 84.88 84.85 84.72 84.78 84.74

Abbreviations: KNN, k‐nearest neighbor algorithm, LR, logistic regression; NB, Naïve Bayes; RF, random forest; RS, Random subspace method; SVM, support
vector machines ; TF, term frequency; TF‐IDF, frequency‐inverse document frequency; TP, term presence.
Note: The highest values have been indicated by bold and second highest values have been indicated by italics.

representations have been evaluated in conjunction with been combined by using either minimum probability,
five supervised learning methods (namely Naïve Bayes, maximum probability, majority voting, product of prob-
support vector machines, logistic regression, K‐nearest ability, or average of probabilities' combination rule. For
neighbor, and random forest algorithm) and five en- Stacking ensemble, the five supervised learning algo-
semble learning methods (i.e., AdaBoost, Bagging, rithms have been utilized as the base‐level classifiers, and
Random Subspace, voting, and Stacking). For voting en- logistic regression has been utilized as the meta‐level
semble, the five supervised learning algorithms have classifier. Ten‐fold cross validation has been employed on
10 | ONAN

TABLE 3 F‐measure values obtained by machine learning algorithms

Unigram Unigram Unigram Bigram Bigram Bigram Trigram Trigram Trigram


+TP +TF +TF‐IDF +TP +TF +TF‐IDF +TF +TP +TF‐IDF
KNN 0.77 0.77 0.77 0.76 0.77 0.77 0.71 0.76 0.75
SVM 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79
LR 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78
NB 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80
RF 0.80 0.80 0.80 0.80 0.80 0.80 0.79 0.79 0.79
AdaBoost (KNN) 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82
AdaBoost (SVM) 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.83
AdaBoost (LR) 0.83 0.83 0.83 0.83 0.83 0.83 0.82 0.83 0.82
AdaBoost (NB) 0.84 0.84 0.84 0.83 0.84 0.84 0.83 0.83 0.83
AdaBoost (RF) 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.83
Bagging (KNN) 0.81 0.81 0.81 0.80 0.81 0.81 0.80 0.80 0.80
Bagging (SVM) 0.81 0.81 0.81 0.81 0.81 0.81 0.81 0.81 0.81
Bagging (LR) 0.81 0.81 0.81 0.81 0.81 0.81 0.81 0.81 0.81
Bagging (NB) 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82
Bagging (RF) 0.82 0.82 0.82 0.81 0.82 0.81 0.81 0.81 0.81
RS (KNN) 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85
RS (SVM) 0.87 0.87 0.87 0.87 0.87 0.87 0.86 0.87 0.86
RS (LR) 0.86 0.86 0.86 0.86 0.86 0.86 0.86 0.86 0.86
RS (NB) 0.90 0.91 0.90 0.88 0.89 0.89 0.88 0.88 0.88
RS (RF) 0.88 0.88 0.88 0.87 0.88 0.87 0.87 0.87 0.87
Voting (Minimum 0.84 0.84 0.84 0.84 0.84 0.84 0.84 0.84 0.84
probability)
Voting (Maximum 0.84 0.84 0.84 0.84 0.84 0.84 0.84 0.84 0.84
probability)
Voting (Majority 0.84 0.84 0.84 0.84 0.84 0.84 0.84 0.84 0.84
voting)
Voting (Product of 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85
probability)
Voting (Average of 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85
probabilities)
Stacking 0.86 0.86 0.86 0.86 0.86 0.86 0.85 0.86 0.86

Abbreviations: KNN, k‐nearest neighbor algorithm; LR, logistic regression; NB, Naïve Bayes; RF, random forest; RS, random subspace method; SVM, support
vector machines; TF, term frequency; TF‐IDF, frequency‐inverse document frequency; TP, term presence.
Note: The highest values have been indicated by bold and second highest values have been indicated by italics.

the experimental evaluations. For evaluation task on and GloVe) in conjunction with five deep‐learning ar-
machine learning methods, all the experimental proce- chitectures (namely convolutional neural network, re-
dures have been conducted on WEKA 3.9, with the de- current neural network, bidirectional recurrent neural
fault parameters being considered. network with attention mechanism, gated recurrent unit,
For evaluation task on deep learning‐based archi- and long short‐term memory). To implement and train
tectures, the text corpus has been represented by three deep learning architectures, we have utilized Tensorflow
word‐embedding schemes (namely word2vec, fastText, and Keras. To obtain the optimal predictive performance
ONAN | 11

from the models, we have employed hyperparameter obtained by ensemble learning methods indicate that
optimization, based on Bayesian optimization using ensemble learners outperform the conventional su-
Gaussian process. For word2vec and fastText word em- pervised learners for sentiment classification task on
beddings, continuous skip‐gram and continuous bag of EDM. Random Subspace algorithm outperforms the
words (CBOW) methods have been evaluated with other ensemble methods analyzed in the empirical eva-
varying vector sizes (vector size of 200 and 300) and luations. Stacking algorithm also yields higher predictive
different dimensions for projection layers (dimension size performances, compared with the other ensemble meth-
of 100 and 200). For the corpus, 80% of data have been ods, Bagging, AdaBoost, and voting. Regarding the pre-
utilized as the training set, whereas the rest of data have dictive performance of conventional text representation
been utilized as the testing set. schemes on EDM, we have evaluated nine different
configurations. The unigram features with TF‐based re-
presentation have obtained the highest classification ac-
4.3 | Experimental results curacies. The unigram features with TF‐IDF weighting
have obtained the second highest predictive perfor-
In this section, classification accuracy and F‐measure mances, and unigram features have obtained the third
values obtained by conventional supervised learning highest predictive performances with TP. As it can be
methods, ensemble learning methods, and deep‐learning seen from the results listed in Table 2, the unigram model
architectures have been presented. outperforms bigram and trigram model, and TF‐based
Table 2 presents the classification accuracy values representation outperforms TP and TF‐IDF‐based
obtained by supervised learning algorithms and ensemble weighting schemes.
learning methods on nine configurations of text corpus. The highest predictive performance (with a classifi-
We have considered five widely utilized supervised cation accuracy of 89.62%) among the compared schemes
learning algorithms (i.e., Naïve Bayes, support vector has been obtained by a random subspace ensemble of
machines, logistic regression, K‐nearest neighbor, and Naïve Bayes, when unigram features with TF‐based re-
random forest algorithm) and five ensemble learning presentation have represented text corpus. In Table 3,
methods (i.e., AdaBoost, Bagging, Random Subspace, F‐measure values obtained by the supervised learning
voting, and Stacking) in the empirical analysis. Regarding algorithms and ensemble learning methods have been
the predictive performance of supervised learning meth- presented. As it can be observed from the results listed
ods, the highest predictive performance in terms of in Table 3, the patterns obtained by classification algo-
classification accuracy has been obtained by Naïve Bayes rithms and ensemble learning methods in terms of clas-
algorithm. The second highest predictive performance sification accuracies are also valid for the results obtained
has been obtained by random forest algorithm, and in terms of F‐measure. Regarding the predictive perfor-
support vector machines have obtained the third highest mances of supervised learning algorithms, Naïve Bayes
predictive performance. The classification accuracies algorithm and random forest algorithm outperform the

FIGURE 1 Main effects plot for


classification accuracy values on machine
learning‐based sentiment analysis methods
12 | ONAN

TABLE 4 Classification accuracy values obtained by deep learning algorithms

Word embedding Vector size Dimension of projection layer CNN RNN LSTM GRU RNN‐AM
word2vec (Skip‐gram) 200 100 84.14 85.52 89.42 86.54 87.63
word2vec (Skip‐gram) 200 200 84.28 85.58 89.56 86.57 87.66
word2vec (Skip‐gram) 300 100 84.37 85.65 89.81 86.63 87.76
word2vec (Skip‐gram) 300 300 84.48 85.69 89.98 86.75 87.82
word2vec (CBOW) 200 100 84.50 85.73 90.02 86.78 87.86
word2vec (CBOW) 200 200 84.52 85.78 90.04 86.82 87.94
word2vec (CBOW) 300 100 84.63 85.86 90.15 86.87 87.95
word2vec (CBOW) 300 300 84.66 85.88 90.32 86.89 88.00
fastText (Skip‐gram) 200 100 84.75 85.90 90.38 86.92 88.09
fastText (Skip‐gram) 200 200 84.85 85.94 90.59 86.94 88.17
fastText (Skip‐gram) 300 100 84.93 85.99 90.83 87.00 88.20
fastText (Skip‐gram) 300 300 85.01 86.01 90.96 87.01 88.37
fastText (CBOW) 200 100 85.08 86.10 91.05 87.06 88.46
fastText (CBOW) 200 200 85.13 86.15 91.24 87.09 88.60
fastText (CBOW) 300 100 85.15 86.24 91.36 87.17 88.77
fastText (CBOW) 300 300 85.19 86.25 91.64 87.25 88.82
GloVe 200 100 85.22 86.29 91.77 87.41 88.93
GloVe 200 200 85.30 86.34 92.87 87.45 88.98
GloVe 300 100 85.36 86.41 93.91 87.53 89.04
GloVe 300 300 85.46 86.46 95.80 87.58 89.25

Abbreviations: CBOW, continuous bag of words; CNN, convolutional neural network, GRU, gated recurrent units; LSTM, long short‐term memory networks;
RNN, recurrent neural network; RNN‐AM, recurrent neural network with attention mechanism.
Note: The highest values have been indicated by bold and second highest values have been indicated by italics.

other classification algorithms. The predictive perfor- model, fastText continuous bag of words (CBOW) model,
mance of conventional supervised learning algorithms and GloVe) have been considered here. The predictive
has been enhanced with the use of ensemble learners. performance results (in terms of classification accuracy)
The highest predictive performance in terms of listed in Table 4 indicate that GloVe word‐embedding
F‐measure has been obtained with a random subspace scheme yields a higher predictive performance, compared
ensemble of Naïve Bayes with an F‐measure of 0.91. In with the other word‐embedding schemes. The second
Figure 1, the main effects plot for accuracy results has highest predictive performances have been obtained by
been presented to summarize the main findings of the fastText CBOW model, which is followed by fastText
empirical results on machine learning‐based sentiment skip‐gram model. The lowest predictive performances in
analysis. terms of classification accuracies have been obtained by
In Tables 4 and 5, classification accuracies and word2vec skip‐gram model. For the different vector sizes
F‐measure values obtained by five deep learning archi- and dimensions of projection layers considered in the
tectures (convolutional neural network, recurrent neural empirical analysis, a vector size of 300 and a dimension
network, long short‐term memory, gated recurrent unit, projection layer of 300 yield a higher predictive
and recurrent neural network with attention mechanism) performance.
on three word‐embedding schemes have been presented, Regarding the predictive performances of deep
respectively. learning‐based architectures for sentiment analysis, long
Five word embedding‐based representation schemes short‐term memory networks (LSTM) have obtained the
(namely word2vec skip‐gram model, word2vec con- highest predictive performances. The second highest
tinuous bag of words (CBOW) model, fastText skip‐gram predictive performances have been generally achieved
ONAN | 13

TABLE 5 F‐measure values obtained by deep learning algorithms

Word Embedding Vector size Dimension of projection layer CNN RNN LSTM GRU RNN‐AM
word2vec (Skip‐gram) 200 100 0.85 0.86 0.88 0.86 0.87
word2vec (Skip‐gram) 200 200 0.85 0.86 0.91 0.86 0.87
word2vec (Skip‐gram) 300 100 0.85 0.86 0.91 0.86 0.87
word2vec (Skip‐gram) 300 300 0.85 0.86 0.91 0.86 0.87
word2vec (CBOW) 200 100 0.85 0.86 0.91 0.87 0.87
word2vec (CBOW) 200 200 0.85 0.86 0.91 0.87 0.87
word2vec (CBOW) 300 100 0.85 0.86 0.91 0.87 0.87
word2vec (CBOW) 300 300 0.85 0.86 0.91 0.87 0.87
fastText (Skip‐gram) 200 100 0.85 0.86 0.91 0.87 0.87
fastText (Skip‐gram) 200 200 0.85 0.86 0.92 0.87 0.87
fastText (Skip‐gram) 300 100 0.85 0.86 0.92 0.87 0.87
fastText (Skip‐gram) 300 300 0.85 0.86 0.92 0.87 0.87
fastText (CBOW) 200 100 0.85 0.86 0.92 0.87 0.88
fastText (CBOW) 200 200 0.85 0.86 0.92 0.87 0.88
fastText (CBOW) 300 100 0.85 0.86 0.92 0.87 0.88
fastText (CBOW) 300 300 0.85 0.86 0.93 0.87 0.88
GloVe 200 100 0.85 0.86 0.93 0.87 0.88
GloVe 200 200 0.86 0.86 0.93 0.87 0.88
GloVe 300 100 0.86 0.86 0.94 0.87 0.88
GloVe 300 300 0.86 0.86 0.96 0.87 0.88

Abbreviations: CBOW, continuous bag of words; CNN, convolutional neural network, GRU, gated recurrent units; LSTM, long short‐term memory networks;
RNN, recurrent neural network; RNN‐AM, recurrent neural network with attention mechanism.
Note: The highest values have been indicated by bold and second highest values have been indicated by italics.

FIGURE 2 Main effects plot for


classification accuracy values on deep
learning‐based sentiment analysis methods
14 | ONAN

by recurrent neural network with attention mechanism of squares, adjusted mean square, F‐Value, and prob-
(RNN‐AM). Gated recurrent units (GRU) have generally ability value, respectively. According to the one‐way
obtained the third highest predictive performances. The ANOVA test results presented in Table 6, the higher
empirical results listed in Tables 4 and 5 indicate that predictive performances obtained by ensemble learning
LSTM, GRU, and RNN‐AM architectures outperform methods are statistically significant (p < .0001). Simi-
the conventional recurrent neural network. For the task larly, there is a statistically meaningful difference be-
of sentiment analysis on EDM, the lowest predictive tween the results obtained by conventional text
performance has been achieved by the convolutional representation schemes.
neural network architecture. For all the compared In Table 7, the statistical significance of results for
configurations, long short‐term memory networks different word‐embedding schemes and deep learning
(LSTM) have achieved the highest predictive perfor- algorithms listed in Tables 4 and 5 has been evaluated.
mance with GloVe word‐embedding scheme‐based re- As it can be observed from the results presented in
presentation, with a classification accuracy of 95.80%. In Table 7, there is a statistically meaningful difference in
the empirical results, we seek to identify whether deep the predictive performance between different word‐
learning architectures outperform the conventional su- embedding schemes and deep learning architectures
pervised learning methods and ensemble learning (p < .0001). In Figure 3, the confidence intervals for the
methods. Deep learning methods outperform conven- mean values of supervised learning methods, ensemble
tional supervised learning methods and ensemble learning algorithms, and deep learning architectures for
learning methods. In addition, ensemble learning a confidence level of 95% are presented. Based on the
methods outperform conventional supervised learning statistical significances between the results, the figure
methods. In Figure 2, the main effects plot for accuracy has been divided into three regions denoted by red da-
results has been presented to summarize the main shed lines. The predictive performances obtained by
findings of the empirical results on deep learning‐based conventional supervised learning methods, ensemble
sentiment analysis. learning methods, and deep learning algorithms are
To further evaluate the statistical significance of the statistically meaningful.
results obtained in the empirical analysis, we have
performed one‐way analysis of variance (ANOVA) tests
in Minitab statistical program. In Table 6, the statistical TABLE 7 One‐way analysis of variance test results for deep
significance of results for conventional supervised learning‐based sentiment analysis
learning methods and ensemble learning methods lis-
Accuracy values
ted in Tables 2 and 3 has been evaluated, where DF, SS,
MS, F, and p denote degrees of freedom, adjusted sum Source DF Adj SS Adj MS F‐value p‐Value
Word 4 29.912 7.478 24.39 .000
embeddings
TABLE 6 One‐way analysis of variance test results for Deep learning 4 461.210 115.302 376.04 .000
compared classifiers and ensemble learning methods architecture
Accuracy values Error 91 27.903 0.307

Source DF SS MS F‐value p‐Value Lack of fit 16 17.910 1.119 8.40 .000


Classifier 25 2177.08 87.0833 1274.41 .000 Pure error 75 9.993 0.133
Text 8 8.37 1.0467 15.32 .000 Total 99 519.024
representation
F‐measure values
Error 200 13.67 0.0683
Source DF Adj SS Adj MS F‐value p‐Value
Total 233 2199.12 Word 4 0.002036 0.000509 9.83 .000
F‐measure values embeddings
Source DF SS MS F‐value p‐Value Deep learning 4 0.050651 0.012663 244.62 .000
Classifier 25 0.209637 0.008385 467.51 .000 architecture

Text 8 0.001050 0.000131 7.32 .000 Error 91 0.004711 0.000052


Representation Lack of fit 16 0.002440 0.000153 5.04 0.000
Error 200 0.003587 0.000018 Pure error 75 0.002270 0.000030
Total 233 0.214275 Total 99 0.057398
ONAN | 15

This research has some theoretical and practical im-


plications. The results of the empirical analysis indicate
that ensemble learning methods yield a higher predictive
performance in educational data mining, compared with
the conventional supervised learning methods. Regarding
the performance of different ensemble learners, the ran-
dom subspace ensemble outperforms the other ensemble
methods. Regarding the empirical results obtained by
conventional text representation schemes, the highest
predictive performances have been obtained by unigram
sets. For evaluation tasks, TF, a term weighting scheme,
outperforms TF‐IDF and TP‐based representations. The
FIGURE 3 Interval plot of accuracy values for compared results on MOOC reviews indicate that deep learning
methods architectures outperform conventional supervised learn-
ing methods and ensemble learning methods. For the
deep learning architectures, long short‐term memory
5 | CONCLUSION networks (LSTM) have obtained the highest predictive
performances. The second highest predictive perfor-
Massive open online courses (MOOCs) are recent in- mances have been generally achieved by recurrent neural
novative approaches in distance education, which pro- network with attention mechanism (RNN‐AM). Gated
vide learning content to participants without age‐, recurrent units (GRU) have generally obtained the third
gender‐, race‐, or geography‐related barriers. Sentiment highest predictive performances. The empirical analysis
analysis on educational data can be employed to obtain indicates that GloVe word‐embedding scheme yields a
feedback on learning content, which can help teachers to higher predictive performance, compared with the other
improve their teaching process and provide learners to word‐embedding schemes. The second highest predictive
reach the high‐quality educational resources. performances have been obtained by fastText CBOW
In this paper, we have analyzed a corpus containing model, which is followed by fastText skip‐gram model.
66,000 MOOC reviews, with the use of machine learning, There are several practical implications of the re-
ensemble learning, and deep learning methods. In the search. The identification of an appropriate representa-
machine learning‐based approach, we utilized three term tion scheme is a critical issue in developing machine
weighting schemes (i.e., TP, TF, and TF‐IDF). The re- learning‐based sentiment classification schemes. In this
presentation schemes have been evaluated in conjunction regard, the experimental analysis presents comprehen-
with five supervised learners (i.e., Naïve Bayes, support sive empirical results for different text representation
vector machines, logistic regression, k‐nearest neighbor, schemes, supervised learning methods, ensemble learn-
and random forest) and five ensemble learning methods ing models, and deep learning architectures for educa-
(i.e., AdaBoost, Bagging, Random Subspace, voting, and tional data mining, which may be utilized as baseline
Stacking). In the deep learning‐based approach, we have empirical results for the field. In addition, we present first
utilized three word‐embedding schemes (i.e., word2vec, corpus on massive open online course reviews, which
fastText, and GloVe) in conjunction with five deep may be helpful for further research.
learning architectures (i.e., convolutional neural net-
work, recurrent neural network, bidirectional recurrent ORCID
neural network with attention mechanism, gated re- Aytuğ ONAN http://orcid.org/0000-0002-9434-5880
current unit, and long short‐term memory).
This research aims to present an efficient sentiment RE FER E NCES
classification scheme with a high predictive performance 1. A. Abdi, S. M. Shamsuddin, and R. M. Aliguliyev, QMOS:
in the educational domain, by pursuing the paradigms of Query‐based multi‐documents opinion‐oriented summarization,
ensemble learning and deep learning. To the best of our Inform. Process. Manag. 54 (2018), no. 2, 318–338.
knowledge, this is the first comprehensive empirical 2. P. Adamopoulos, 2013. What makes a great MOOC? An
study on sentiment analysis of MOOC reviews, in which interdisciplinary analysis of student retention in online courses,
the predictive performances of conventional classification in: International Conference on Information Systems, ICIS 2013.
algorithms, ensemble learning methods, and deep learn- 3. P. Adinolfi et al., Sentiment analysis to evaluate teaching perfor-
ing algorithms have been reported. mance, Int. J. Knowl. Soc. Res. (IJKSR) 7 (2016), no. 4, 86–107.
16 | ONAN

4. D. W. Aha, D. Kibler, and M. K. Albert, Instance‐based learning 23. J. L. Elman, Finding structure in time, Cogn. Sci. 14 (1990), no.
algorithms, Mach. Learn. 6 (1991), no. 1, 37–66. 2, 179–211.
5. N. Alami, M. Meknassi, and N. En‐nahnahi, Enhancing un- 24. Y. Freund, and R. E. Schapire, Experiments with a new boosting
supervised neural networks based text summarization with word algorithm, Machine Learning: Proceedings of the Thirteenth
embedding and ensemble learning, Expert Syst. Appl. 123 International Conference, 96 (1996), 148–156.
(2019), 195–211. 25. X. Glorot, A. Bordes, and Y. Bengio (2011). Domain adaptation
6. C. Alario‐Hoyos et al., Delving into participants’ profiles and use for large‐scale sentiment classification: A deep learning
of social tools in MOOCs, IEEE Transact. Learn. Technol. 7 approach. In Proceedings of the 28th International
(2014), no. 3, 260–266. Conference on Machine Learning (ICML‐11) (pp. 513‐520).
7. H. Aldowah, H. Al‐Samarraie, and W. M. Fauzy, Educational 26. G. Gutiérrez et al., Mining: Students comments about teacher
data mining and learning analytics for 21stcentury higher edu- performance assessment using machine learning algorithms, Int.
cation: A review and synthesis, Telematics Inform. 39 (2019), no. J. Combin. Optim. Prob. Inform. 9 (2018), no. 3, 26–40.
4, 13–49. 27. G. Hackeling, Mastering machine learning with scikit‐learn,
8. M. Al‐Smadi et al., Enhancing aspect‐based sentiment analysis Packt Publishing Ltd, Birmingham, UK, 2017.
of Arabic hotels’ reviews using morphological, syntactic and se- 28. T. Hastie, R. Tibsharani, and J. Friedman, The elements of
mantic features, Inform. Process. Manag. 56 (2019), no. 2, statistical learning: Data mining, inference, and prediction, 27.,
308–319. 2nd., Springer, Berlin, 2009, pp. 83–85.
9. N. Altrabsheh, M. Cocea, and S. Fallahkhair (2015). Predicting 29. T. K. Ho, The random subspace method for constructing decision
learning‐related emotions from students' textual classroom forests, IEEE. Trans. Pattern. Anal. Mach. Intell. 20 (1998), no.
feedback via Twitter. Paper presented at the International 8, 832–844.
Conference on Educational Data Mining (EDM) (8th, Madrid, 30. A. Ho et al. (2014). HarvardX and MITx: The first year of open
Spain, Jun 26‐29, 2015) online courses, fall 2012‐summer 2013. Ho, AD, Reich, J.,
10. F. Belbachir and M. Boughanem, Using language models to Nesterko, S., Seaton, DT, Mullaney, T., Waldo, J., & Chuang, I.
improve opinion detection, Inform. Process. Manag. 54 (2018), (2014). HarvardX and MITx: The first year of open online
no. 6, 958–968. courses (HarvardX and MITx Working Paper No. 1).
11. Y. Bengio et al., A neural probabilistic language model, J. Mach. 31. M. Jia et al., Who can benefit more from massive open online
Learn. Res. 3 (2003), no. Feb, 1137–1155. courses? A prospective cohort study, Nurse Educ. Today 76
12. R. Bogdan, Sentiment analysis on embedded systems blended (2019), 96–102.
courses, BRAIN. Broad Res. Artif. Intell. Neurosci. 8 (2017), no. 32. A. Joulin et al., Fasttext. zip: Compressing text classification
1, 17–23. models, arXiv preprint arXiv 1612 (2016), 03651.
13. L. Breiman, Bagging predictors, Mach. Learn. 24 (1996), no. 2, 33. A. M. Kaplan and M. Haenlein, Higher education and the di-
123–140. gital revolution: About MOOCs, SPOCs, social media, and the
14. L. Breiman, Random forests, Mach. Learn. 45 (2001), no. 1, 5–32. Cookie Monster, Bus. Horiz. 59 (2016), no. 4, 441–450.
15. L. Breslow et al., Studying learning in the worldwide classroom 34. I. U. Khan et al., Predicting the acceptance of MOOCs in a
research into edX's first MOOC, Res. Pract. Assess. 8 (2013), 13–25. developing country: application of task‐technology fit model, so-
16. R. Z. Cabada, M. L. B. Estrada, and R. O. Bustillos, Mining of cial motivation, and self‐determination theory, Telematics
educational opinions with deep learning, J. Univers. Comput. Inform. 35 (2018), no. 4, 964–978.
Sci. 24 (2018), no. 11, 1604–1626. 35. V. Kovanović et al., Exploring communities of inquiry in massive
17. G. Chakraborty, M. Pagolu, and S. Garla, 2013. Text mining and open online courses, Comput. Educ. 119 (2018), 44–58.
analysis, in: Text Mining and Analysis. Practical Method, 36. P. C. Lane, D. Clarke, and P. Hender, On developing robust
Examples and Case Studies Using SAS. models for favourability analysis: Model choice, feature sets and
18. K. Cho et al., Learning phrase representations using RNN imbalanced data, Decis. Support Syst. 53 (2012), no. 4, 712–718.
encoder‐decoder for statistical machine translation, arXiv 37. Y. LeCun, Generalization and network design strategies, Con-
preprint arXiv 1406 (2014), 1078–1734. nectionism in perspective, 19, Elsevier, Amsterdam, The
19. D. Ciregan, U. Meier, and J. Schmidhuber, 2012. Multi‐column Netherlands, 1989.
deep neural networks for image classification, in: Proceedings of 38. P. J. Lee, Y. H. Hu, and K. T. Lu, Assessing the helpfulness of
the IEEE Computer Society Conference on Computer Vision online hotel reviews: A classification‐based approach, Telematics
and Pattern Recognition. https://doi.org/10.1109/CVPR.2012. Inform. 35 (2018), no. 2, 436–445.
6248110 39. D. D. Lewis, Naive (Bayes) at forty: The independence assump-
20. W. Di, A. Bhardwaj, and J. Wei, Deep learning essentials: Your tion in information retrieval, European conference on machine
hands‐on guide to the fundamentals of deep learning and learning, Springer, Berlin, Heidelberg, 1998, pp. 4–15.
neural network modeling, Packt Publishing, Birmingham, 40. Q. Lin et al., Lexical based automated teaching evaluation via
UK, 2018. students’ short reviews, Comput. Appl. Eng. Educ. 27 (2019), no.
21. S. Donitsa‐Schmidt and B. Topaz, Massive open online courses 1, 194–205.
as a knowledge base for teachers, Journal of Education for 41. E. Loper and S. Bird, 2002. The natural language toolkit NLTK:
Teaching 44 (2018), no. 5, 608–620. The Natural Language Toolkit, in: Proceedings of the ACL‐02
22. L. P. Dringus and T. Ellis, Using data mining as a strategy for Workshop on Effective Tools and Methodologies for Teaching
assessing asynchronous discussion forums, Comput. Educ. 45 Natural Language Processing and Computational Linguistics,
(2005), no. 1, 141–160. available at https://doi.org/10.3115/1225403.1225421
ONAN | 17

42. M. B. López et al., EduRP: An educational resources platform 61. J. A. Rodger, Discovery of medical Big Data analytics: Improving
based on opinion mining and semantic web, J. Univers. Comput. the prediction of traumatic brain injury survival rates by data
Sci. 24 (2018), no. 11, 1515–1535. mining patient informatics processing software Hybrid Hadoop
43. W. D. Maxwell et al., Massive open online courses in US Hive, Inform. Med. Unlocked 1 (2015), 17–26.
healthcare education: Practical considerations and lessons 62. L. M. Rojas‐Barahona, Deep learning for sentiment analysis,
learned from implementation, Curr. Pharm. Teach. Learn. 10 Lang. Linguist. Compass 10 (2016), no. 12, 701–719.
(2018), no. 6, 736–743. 63. C. L. Santos, P. Rita, and J. Guerreiro, Improving international
44. W. Medhat, A. Hassan, and H. Korashy, Sentiment analysis attractiveness of higher education institutions based on text
algorithms and applications: A survey, Ain Shams Eng. J. 5 mining and sentiment analysis, Int. J. Educ. Manag. 32 (2018),
(2014), no. 4, 1093–1113. no. 3, 431–447.
45. T. Mikolov et al., Efficient estimation of word representations in 64. B. Shah et al. (2019). Twitter Analysis for Disaster Management.
vector space, arXiv preprint arXiv 1301 (2013), 3781. In 2018 Fourth International Conference on Computing
46. P. M. Moreno‐Marcos et al. (2018, April). Sentiment analysis in Communication Control and Automation (ICCUBEA)
MOOCs: A case study. In 2018 IEEE Global Engineering (pp. 1‐4). IEEE.
Education Conference (EDUCON) (pp. 1489‐1496). IEEE. 65. G. Siemens and R. S. Baker (2012, April). Learning analytics and
47. P. J. Muñoz‐Merino et al., Precise effectiveness strategy for educational data mining: towards communication and
analyzing the effectiveness of students with educational resources collaboration. In Proceedings of the 2nd international conference
and activities in MOOCs, Comput. Human. Behav. 47 (2015), on learning analytics and knowledge (pp. 252‐254). ACM.
108–118. 66. F. Tian et al., Recognizing and regulating e‐learners’ emotions
48. H. T. Nguyen and M. Le Nguyen, Multilingual opinion mining based on interactive Chinese texts in e‐learning systems, Knowl.‐
on YouTube–A convolutional N‐gram BiLSTM word embedding, Based Syst. 55 (2014), 148–164.
Inform. Process. Manag. 54 (2018), no. 3, 451–462. 67. N. D. Valakunde and M. S. Patwardhan (2013, November).
49. A. Onan, An ensemble scheme based on language function Multi‐aspect and multi‐class based document sentiment analysis
analysis and feature engineering for text genre classification, of educational data catering accreditation process. In 2013
J. Inform. Sci. 44 (2018), no. 1, 28–47. International Conference on Cloud & Ubiquitous Computing &
50. A. Onan, Mining opinions from instructor evaluation reviews: A Emerging Technologies (pp. 188‐192). IEEE.
deep learning approach, Comput. Appl. Eng. Educ. 28 (2020), 68. V. Vapnik, The nature of statistical learning theory, Springer
no. 1, 117–138. science & business media, 2013.
51. A. Onan, S. Korukoğlu, and H. Bulut, Ensemble of keyword 69. J. Wang, B. Peng, and X. Zhang, Using a stacked residual LSTM
extraction methods and classifiers in text classification, Expert model for sentiment intensity prediction, Neurocomputing 322
Syst. Appl. 57 (2016), 232–247. (2018), 93–101.
70. J. Wang et al., Tree‐Structured Regional CNN‐LSTM Model for
52. A. Onan, S. Korukoğlu, and H. Bulut, A multiobjective weighted
Dimensional Sentiment Analysis. IEEE/ACM Transactions on,
voting ensemble classifier based on differential evolution algorithm
Audio, Speech, and Language Processing 28 (2019), 581–591.
for text sentiment classification, Expert Syst. Appl. 62 (2016), 1–16.
71. M. Wen, D. Yang, and C. Rose (2014, July). Sentiment Analysis
53. R. Oramas Bustillos et al., Opinion mining and emotion re-
in MOOC Discussion Forums: What does it tell us? In
cognition in an intelligent learning environment, Comput. Appl.
Educational data mining 2014.
Eng. Educ. 27 (2019), no. 1, 90–101.
72. D. H. Wolpert, Stacked generalization, Neural Netw. 5 (1992),
54. B. Pang, and L. Lee, Opinion mining and sentiment analysis,
no. 2, 241–259.
Foundations and Trends® in Information Retrieval 2 (2008), no.
73. L. Zhang, S. Wang, and B. Liu, Deep learning for sentiment
1–2, 1–135.
analysis: A survey, Wiley Interdisciplinary Reviews: Data
55. J. Pennington, R. Socher, and C. Manning (2014). Glove: Global
Mining and Knowledge Discovery 8 (2018), no. 4, e1253.
vectors for word representation. In Proceedings of the 2014
74. N. Öztürk and S. Ayvaz, Sentiment analysis on Twitter: A text
conference on empirical methods in natural language
mining approach to the Syrian refugee crisis, Telematics and
processing (EMNLP) (pp. 1532‐1543).
Informatics 35 (2018), no. 1, 136–147.
56. A. Peña‐Ayala, Educational data mining: A survey and a data
mining‐based analysis of recent works, Expert systems with
applications 41 (2014), no. 4, 1432–1462. AUTHOR BIOGRAPHY
57. M. F. Porter (2001). Snowball: A language for stemming
algorithms. Aytuğ Onan received the BS degree in
58. R. Prabowo and M. Thelwall, Sentiment analysis: A combined computer engineering from the Izmir
approach, Journal of Informetrics 3 (2009), no. 2, 143–157. University of Economics, İzmir, Turkey,
59. S. M. Rezaeinia et al., Sentiment analysis based on improved in 2010, and the MS degree in computer
pre‐trained word embeddings, Expert Systems with Applications
engineering and the PhD degree in
117 (2019), 139–147.
60. J. A. Rodger, A fuzzy nearest neighbor neural network statistical
computer engineering from Ege University, Turkey,
model for predicting demand for natural gas and energy cost in 2013 and 2016, respectively. He has been an
savings in public buildings, Expert Syst. Appl. 41 (2014), no. 4, Associate Professor with the Department of Computer
1813–1829. Engineering, Izmir Katip Celebi University, Izmir,
18 | ONAN

Turkey, since April 2019. He has published several


How to cite this article: Onan A. Sentiment
journal articles on machine learning and computa-
analysis on massive open online course
tional linguistics. Dr. Onan has been an editor for the
evaluations: A text mining and deep learning
KSII Transactions on Internet and Information Sys-
approach. Comput Appl Eng Educ. 2020;1–18.
tems and an Associate Editor for the Journal of King
https://doi.org/10.1002/cae.22253
Saud University Computer and Information Sciences.

View publication stats

You might also like