0% found this document useful (0 votes)
71 views

Various Approaches To Aspect-Based Sentiment Analysis: Preprint

This document discusses various approaches to aspect-based sentiment analysis. It explores preprocessing steps like ID-encoding, bit-masking, and location-encoding to represent aspects in text. It also discusses classical machine learning techniques using TF-IDF vectors and deep learning models like attention LSTM-RNN and memory network models. Experimental results showed that the memory network model worked better than the attention model for this task.

Uploaded by

Eric Jonathan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Various Approaches To Aspect-Based Sentiment Analysis: Preprint

This document discusses various approaches to aspect-based sentiment analysis. It explores preprocessing steps like ID-encoding, bit-masking, and location-encoding to represent aspects in text. It also discusses classical machine learning techniques using TF-IDF vectors and deep learning models like attention LSTM-RNN and memory network models. Experimental results showed that the memory network model worked better than the attention model for this task.

Uploaded by

Eric Jonathan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/325009014

Various Approaches to Aspect-based Sentiment Analysis

Preprint · May 2018

CITATIONS READS
0 350

2 authors:

Amlaan Bhoi Sandeep Joshi


University of Illinois at Chicago University of Illinois at Chicago
4 PUBLICATIONS   0 CITATIONS    1 PUBLICATION   0 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Sandeep Joshi on 11 June 2018.

The user has requested enhancement of the downloaded file.


Various Approaches to Aspect-based Sentiment Analysis
Amlaan Bhoi Sandeep Joshi
Department of Computer Science Department of Computer Science
University of Illinois at Chicago University of Illinois at Chicago
Chicago, IL, USA Chicago, IL, USA
[email protected] [email protected]
arXiv:1805.01984v1 [cs.CL] 5 May 2018

Abstract— The problem of aspect-based sentiment original aspect term’s location. Formally, for each unique
analysis deals with classifying sentiments (negative, aspect term ai in the corpus, we assign a unique ID idi
neutral, positive) for a given aspect in a sentence. and store it in a dictionary as a key-value pair. Then,
A traditional sentiment classification task involves
treating the entire sentence as a text document and for each sentence sj , we create a zero-vector and replace
classifying sentiments based on all the words. Let us zeros with idi in those aspect locations. We call this
assume, we have a sentence such as ”the acceleration new vector an aspect-sequence. For example, if we have
of this car is fast, but the reliability is horrible”. This the sentence ”the battery life of the phone is too short”,
can be a difficult sentence because it has two aspects then assuming our unique ID is x, our aspect-sequence
with conflicting sentiments about the same entity.
Considering machine learning techniques (or deep becomes:
learning), how do we encode the information that we
are interested in one aspect and its sentiment but not
[0 x x 0 0 0 0 0 0]
the other? Let us explore various pre-processing steps, Another variation of this would be to treat the whole
features, and methods used to facilitate in solving this
task. aspect term as one token. Then, our aspect-sequence
becomes:
I. INTRODUCTION
[0 x 0 0 0 0 0 0]
Aspect-based sentiment analysis is defined as given a
text document di and an aspect term/phrase aj , we wish Deep learning models do not work well with sequences
to determine the sentiment [-1, 0, +1] of the aspect term of different lengths. Thus, we applied zero-padding
in the document [1]. In this setting, we assume aspect [2] to pad the sequences to a fixed length. Our next
extraction has been performed and is not a primary task. approach to feature engineering was simple and is called
Otherwise, we would also need to extract aspects and bit-masking. This approach follows the same concept
cluster them into some predefined class. as ID-encoding, but instead of applying a unique ID to
We discuss the various pre-processing steps we used each aspect, we encode a 1 to each aspect term location.
to generate features for our models. There is a plethora Thus, the previous example’s bit-masking vector would
of approaches to aspect-based sentiment analysis. In become:
this paper, we consider using classical machine learning
[0 1 1 0 0 0 0 0 0]
techniques and two deep learning approaches. Finally, we
look at the results, which model performed best and why Our final effort to encode aspect information into our
that might be the case. model is called location-encoding. In this approach,
for each aspect term ai and sentence sj , we encode
II. TECHNIQUES
the location of each context word ck with respect to
We shall now explore the different techniques, meth- the aspect term in the sentence. Following our previous
ods, and features used in this experiment. We will divide example, our location-sequence becomes:
the section into three sections: pre-processing steps and
features common to all methods, classical machine learn- [1 1 2 3 4 5 6]
ing models, and deep learning models. Some features and
where we do not include the aspect term location and
pre-processing steps may overlap between the two sec-
our aspect term is considered one location regardless of
tions and that will be mentioned wherever appropriate.
its size (single word or phrase). For tokenization of text,
A. Common Pre-processing Steps & Features we used Keras’ text to word sequence method [3].
We start with the first feature we use called
B. Machine Learning Models
ID-encoding. ID-encoding works by assigning each
unique aspect in the text corpus (collection of docu- In our machine learning models, the specific feature
ments/sentences) a unique ID that cannot be repeated. engineering we did was TF-IDF vectorization. This
We then encode that id in a zero-vector based on the converts our text sentences into TF-IDF vectors and the
whole corpus becomes a TF-IDF matrix. The tf-idf score worked at another did not. Both methods are built upon
formulation is as follows: on the LSTM [11] and Attention [12] concepts.
• Attention LSTM-RNN: The first model we tried
tfidft,d = tft,d × idft was vanilla LSTM-RNN [10] with an attention layer
Please refer to the book by Manning et. al [5] for full to learn the weights of context words with respect
details. Let us look at the various machine learning mod- to the aspect. This approach did not work well as a
els we used for this experiment. The features mentioned single attention layer is not sufficient to learn the
above are used in all the models. We shall just give a abstract features between the aspect and context
brief overview for each. Every model except XGBoost is terms. The problem was also compounded by the
from the Scikit-Learn library [4]. fact that our encoding of features was not sufficient
for a single layer pass to work. We tried Bidirectional
• Naive Bayes: Naive Bayes classifiers apply Bayes’
LSTMs, dropout, recurrent dropout, MaskedGlob-
theorem with the assumption of conditional inde-
alAveragePooling, BatchNormlization, LeakyReLU,
pendence between every pair of features. We chose
and more. The basic architecture that we built can
this classifier because it has shown to perform sur-
be found here.
prisingly well on text problems.
• Deep Memory Network: The second model we
• Decision Tree: Decision tree is a non-parametric
implemented is the deep memory network (a.k.a.
learning method that predicts the value of a target
MemNet) [12]. Instead of using just one atten-
variable by learning decision rules. The reason why
tion+linear layer, it uses multiple layers called hops.
we chose this is because decision trees can learn
Hops are needed for this task because the task
inherent rules available in the dataset that are not
requires multiple levels of abstraction. Each hop
available to the user.
performs the aforementioned operation and feeds it
• Support Vector Machines (SVM): Support vec-
into a softmax layer. The model calculates context
tor machines have shown to be extremely capable
attention as well as location attention. The input
of handling aspect-based sentiment analysis in do-
to this model is the input sentence tokenized with
mains such as but not limited to customer reviews on
unique word IDs, aspect term/phrase, location of
laptops and restaurants [6]. Unsurprisingly, SVMs
the aspect term, and a location vector denoting each
performed the best in the classical machine learning
context word’s location with respect to the aspect
methods we tried. We shall see those results later.
term. This model performed the best as we shall
• Random Forest Classifier: Random forest clas-
see.
sifier is a meta-estimator that fits decision trees on
sub-samples of the input dataset. It can be regarded Attention LSTM-RNN is built using Keras [3] while
as an ensemble method. Fitting various decision the MemNet was built using Tensorflow [2]. All models
trees on sub-samples of data can help prevent over- were trained on Stratified K-Fold validation to ensure
fitting. We chose this because we believed it can be no bias and confirm the metrics of precision, recall, F-1
a better classifier than vanilla decision trees. score, and overall accuracy for each dataset. For predic-
• Extra Trees Classifier: Extra trees classifier is tion, we shuffled and split the train and test dataset,
another meta-estimator (also categorized as an en- saved the trained model, and loaded it back up for
semble method) which fits randomized decision trees testing.
(a.k.a extra trees) on various sub-samples of the
dataset. The motivation behind using this method
is the same as using random forest classifiers. III. RESULTS
• Extreme Gradient Boosting (XGBoost): Ex-
treme gradient boosting is an optimized, parallel The dataset used for our experiments is a modified
implementation of gradient boosting [7]. This algo- version of the SemEval 2016: Task 4 challenge [15]. We
rithm is based on boosting of decision trees. Please apply the algorithms on the Tech Reviews and Food
refer to the paper by Friedman [7] for a better Reviews domains.
intuition behind XGBoost. We chose this algorithm In our experiments, MemNet worked best with an
because it proved to perform well on our previous overall accuracy of 0.713 and 0.7866 on tech reviews and
machine learning experiments. food reviews dataset respectively. We trained all models
on Google Cloud Platform with 16 Intel Skylake CPUs,
C. Deep Learning Models 60GB RAM, and Nvidia P100 GPU. The second best
The only specific pre-processing step we did for the consistent performer was SVM with one-hot encoding the
deep-learning model is stop-word removal [8]. For text. One surprisingly good model is the ETC on Tech
word embeddings, we used Stanford’s GLoVE embed- dataset. However, the model failed to perform well on
dings [9]. Let us now explore the two deep learning Food dataset with positive class F1-score at 0.3745 which
approaches we tried. We shall also explore why one is below par.
TABLE I
Test Results

TEST
Positive Class Negative Class Neutral Class
Classifier Dataset P R F1 P R F1 P R F1 Accuracy
Tech 0.6349 0.6153 0.625 0.5722 0.5625 0.5673 0.3142 0.3586 0.335 0.5399
Naive Bayes + OH
Food 0.71466 0.6261 0.6674 0.3915 0.4037 0.3974 0.2458 0.3333 0.2829 0.5201
Tech 0.6489 0.6256 0.6370 0.5073 0.5852 0.5435 0.3513 0.2826 0.3132 0.5397
Decision Tree + OH
Food 0.7058 0.8130 0.7557 0.5405 0.3726 0.4411 0.4196 0.3560 0.3852 0.6269
Tech 0.6945 0.7230 0.7085 0.6235 0.6306 0.6271 0.3780 0.3369 0.3563 0.6112
SVM + OH
Food 0.7231 0.8668 0.7885 0.5750 0.4285 0.4911 0.4318 0.2878 0.3454 0.6629
Tech 0.5689 0.7274 0.6376 0.7079 0.6897 0.6975 0.5454 0.2916 0.3753 0.6247
RFC + LE
Food 0.4136 0.1783 0.2438 0.6579 0.9040 0.7599 0.3448 0.1514 0.2061 0.6147
Tech 0.5669 0.8435 0.6771 0.7768 0.6976 0.7342 0.6220 0.1911 0.2882 0.6511
XGBoost + LE
Food 0.6387 0.1151 0.1877 0.6436 0.9655 0.7713 0.5018 0.1167 0.1779 0.6333
Tech 0.6414 0.8121 0.7159 0.7973 0.7440 0.7687 0.6275 0.4043 0.4873 0.7021
ETC + LE
Food 0.5027 0.3002 0.3745 0.6907 0.8790 0.7728 0.3732 0.2000 0.2550 0.6363
Tech 0.8249 0.7694 0.7960 0.7008 0.6776 0.6835 0.4475 0.5178 0.4669 0.7130
MemNet
Food 0.8446 0.9241 0.8819 0.7222 0.6921 0.7044 0.6175 0.4401 0.5051 0.7866

IV. CONCLUSION [11] Luong, Minh-Thang, Hieu Pham, and Christopher D. Man-
ning. ”Effective approaches to attention-based neural machine
In conclusion, it is apparent that MemNet works best translation.” arXiv preprint arXiv:1508.04025 (2015).
on this task based on our experiments. Surprisingly, [12] Tang, Duyu, Bing Qin, and Ting Liu. ”Aspect level senti-
ment classification with deep memory network.” arXiv preprint
ETC performed better than expected on the tech reviews arXiv:1605.08900 (2016).
dataset. The main problem is still the class imbalance. [13] Tai, Kai Sheng, Richard Socher, and Christopher D.
This leads to low precision, recall, and F-1 scores for Manning. ”Improved semantic representations from tree-
structured long short-term memory networks.” arXiv preprint
the neutral class. This can further be fixed by over- arXiv:1503.00075 (2015).
sampling or resampling. More aspects we can tackle [14] Ma, Dehong, et al. ”Interactive Attention Networks for
are dependency parsing as a feature, opinion dictionary, Aspect-Level Sentiment Classification.” arXiv preprint
arXiv:1709.00893 (2017).
sentiment dictionary, using ELMo embeddings instead of [15] Pontiki, Maria, et al. ”SemEval-2016 task 5: Aspect based
GLoVe embeddings, modularizing our pipeline for faster sentiment analysis.” Proceedings of the 10th international
experiments, and gaining a deeper understanding of why workshop on semantic evaluation (SemEval-2016). 2016.
some models do not work on these datasets or these types
of problems. For other future experiments, we want to
try Tree-LSTMs [13] and Interactive Attention Networks
[14].

References
[1] Liu, Bing. Web data mining: exploring hyperlinks, contents,
and usage data. Springer Science & Business Media, 2007.
[2] Abadi, Martin, et al. ”TensorFlow: A System for Large-Scale
Machine Learning.” OSDI. Vol. 16. 2016.
[3] Chollet, Francois. ”Keras.” (2015): 128.
[4] Buitinck, Lars, et al. ”API design for machine learning soft-
ware: experiences from the scikit-learn project.” arXiv preprint
arXiv:1309.0238 (2013).
[5] Schutze, Hinrich, Christopher D. Manning, and Prabhakar
Raghavan. Introduction to information retrieval. Vol. 39.
Cambridge University Press, 2008.
[6] Kiritchenko, Svetlana, et al. ”NRC-Canada-2014: Detecting
aspects and sentiment in customer reviews.” Proceedings of the
8th International Workshop on Semantic Evaluation (SemEval
2014). 2014.
[7] Friedman, Jerome H. ”Greedy function approximation: a gra-
dient boosting machine.” Annals of statistics (2001): 1189-
1232.
[8] Bird, Steven, Ewan Klein, and Edward Loper. Natural lan-
guage processing with Python: analyzing text with the natural
language toolkit. ” O’Reilly Media, Inc.”, 2009.
[9] Pennington, Jeffrey, Richard Socher, and Christopher Man-
ning. ”Glove: Global vectors for word representation.” Pro-
ceedings of the 2014 conference on empirical methods in
natural language processing (EMNLP). 2014.
[10] Hochreiter, Sepp, and Jurgen Schmidhuber. ”Long short-term
memory.” Neural computation 9.8 (1997): 1735-1780.

View publication stats

You might also like