BERT Fine-Tuning For Sentiment Analysis On Indonesian Mobile Apps Reviews
BERT Fine-Tuning For Sentiment Analysis On Indonesian Mobile Apps Reviews
Haftittah Wuswilahaken DW
Faculty of Computer Science, Brawijaya University, Indonesia, [email protected]
Novanto Yudistira
Faculty of Computer Science, Brawijaya University, Indonesia, [email protected]
User reviews have an essential role in the success of the developed mobile apps. User reviews in the textual form are unstructured data,
creating a very high complexity when processed for sentiment analysis. Previous approaches that have been used often ignore the
context of reviews. In addition, the relatively small data makes the model overfitting. A new approach, BERT, has been introduced as a
transfer learning model with a pre-trained model that has previously been trained to have a better context representation. This study
examines the effectiveness of fine-tuning BERT for sentiment analysis using two different pre-trained models. Besides the multilingual
pre-trained model, we use the pre-trained model that only has been trained in Indonesian. The dataset used is Indonesian user reviews
of the ten best apps in 2020 in Google Play sites. We also perform hyper-parameter tuning to find the optimum trained model. Two
training data labeling approaches were also tested to determine the effectiveness of the model, which is score-based and lexicon-based.
The experimental results show that pre-trained models trained in Indonesian have better average accuracy on lexicon-based data. The
pre-trained Indonesian model highest accuracy is 84%, with 25 epochs and a training time of 24 minutes. These results are better than
all of the machine learning and multilingual pre-trained models.
Additional Keywords and Phrases: BERT fine-tuning, Sentiment analysis, Apps review
1 INTRODUCTION
Apps are developed to meet human needs in various aspects of life. Meanwhile, mobile apps are specifically designed to
run on mobile phone devices such as smartphones or tablets. The increasing human need has become a business
opportunity to develop many mobile apps. Some factors affect users for using apps, such as usability, ratings, and ease
of use [1]. In the apps development process, the usability factor is the most necessary aspect because it can reflect the
level of ease apps when used [2]. Usability evaluation always involves users, one of them is user reviews. User reviews
have rich information when a user is using the apps. Usually, user reviews have two essential parts are rating and
opinion. The rating indicates the overall evaluation of the user experience using a numerical scale. While opinion is a
deeper story about the user is experienced when using the apps. Ratings do not necessarily provide honest reviews. In
addition, ratings do not contain information to increase the user's usability factor. Reading the opinions of user reviews
can provide better insight and comprehension. There are no restrictions on how users submit text-based reviews. This
makes the review dataset challenging to work with, as it is less structured. Therefore, user opinion creates complexity
if read manually by humans. Sentiment analysis can be used to conclude the textual content of each opinion, whether
the user likes or dislikes it when using the apps. By knowing user preferences through reviews, developers can increase
the usability factor so that they can develop the business even further.
Sentiment analysis, also known as opinion mining classifies user opinion based on polarity. Sentiment analysis
consists of different tasks, methods, and types of analysis. Three methods used in sentiment analysis are machine
learning (ML), hybrid learning, and lexicon-based [3]. The most popular and widely used method of machine learning is
supervised learning. This approach trains a model using labeled data to predict output by considering new unlabeled
data [4]. The most used model are naïve bayes (NB) classifier [5], [6] and support vector machines (SVM) [7], [8].
Although this method has shown promising results, this method requires feature engineering to represent the text [9].
The more complete the features used to capture information, the better the classification results, complicating the overall
machine learning process. Meanwhile, the traditional method for sentiment analysis is lexicon-based [10]. This method
inspects each sentence for words that express a positive or negative feeling to classify sentiment. Because sentiment
analysis is domain-dependent, therefore the lexicon used needs to be matched with that domain. It can be a disadvantage
of the lexicon-based method, as it is a time-consuming process. However, the lexicon-based method does not require
training data, so this is advantageous [11]. There are two main approaches to creating sentiment lexicons: dictionary-
based [12], [13] and corpus-based [14], [15]. In many cases, the dictionary-based approach works well for the general
domain. Meanwhile, corpus-based lexicons can be adapted to certain domains. The hybrid method is used to extend
machine learning models with lexicon-based knowledge [16]. With this, the limitations of both previous approaches can
be overcome.
Deep neural networks (DNNs) have recently shown remarkable results and are widely used in various research. Two
main DNN architectures are convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Generally,
CNNs are hierarchical architectures, while RNNs are sequential architectures [17]. CNN shows the best results in image
processing [18], and the same architecture has been applied to text processing [19]. Kim and Jeong used CNN
architecture for sentiment analysis on three different datasets [20]. The results show that employing consecutive
convolutional layers effectively for longer text and provide better results than various machine learning methods.
Meanwhile, Kuniasari and Setyanto used the RNN architecture for sentiment analysis in Indonesian [21]. The results
show that RNN has better accuracy than CNN, although the RNN model may overfitting during the testing process. RNN
also suffers from vanishing gradients problem, which hinders learning for long text sequences. Long short-term memory
(LSTM) enhances the RNN architecture to solve the vanishing gradients problem by taking long-term dependencies.
[22]. Therefore, LSTM reportedly outperformed both CNN and RNN for sentiment analysis [23], [24]. More recently, the
language model (LM) is a large network trained on unlabeled data and can be fine-tuned to downstream tasks has made
breakthroughs in NLP.
2
One of the language model approach successes is obtained by transferring the encoder from an encoder-decoder
architecture based on transformers [25]. Successful language models are GPT-2 [26] by Open AI1, BERT [27] by Google
AI Language Team2, and XLNet [28] by Google AI Brain Team3. BERT is of the best-performing models because it can
improve understanding of a context. Therefore, this model has several developments, such as RoBERTa [29] to improving
training procedures and ALBERT [30] by reducing the model size. BERT also provides a pre-trained model that is readily
and easily fine-tuned across a variety of downstream tasks. Besides that, pre-trained models are also available in various
languages [31]–[33]. Research from [34] conducted an in-depth experiment to investigate the effectiveness of the BERT
fine-tuning for text classification, show the latest state-of-the-art result in the sentiment analysis of reviews. Nguyen et
al. using BERT fine-tuning for sentiment analysis in Vietnamese, show that BERT outperforms other LSTM, TextCNN,
RCNN with impressive results [35].
Based on recent studies and the need for user evaluation of mobile apps in Indonesian, we conducted a sentiment
analysis using BERT fine-tuning. Although much sentiment analysis research has been conducted on the Indonesian
dataset, in-depth experiments using BERT fine-tuning for apps review in Indonesian have never been done before. The
main contribution of this paper is to determine the effectiveness of the fine-tuning method for sentiment analysis in
Indonesian. We collected 10,615 Indonesian user reviews on the 10 best 2020 apps. We use two different pre-trained
models, the multilingual BERT-Base [27] trained on multilingual data and the IndoBERT-Base [31], specially trained on
the Indonesian dataset. Next, we perform hyper-parameter tuning for best results. As evaluation criteria, we use accuracy
and computation time for each model. The paper is structured as follows: Section 2 describes the proposed methodology
while Section 3 describes the experimental design and experimental results for discussion, and Section 4 concludes the
work and future work.
2 METHODOLOGY
1 https://openai.com.
2 https://research.google/teams/language.
3 https://research.google/teams/brain.
4 https://www.appannie.com/en/insights/market-data/2020-mobile-recap-how-to-succeed-in-2021.
5 https://play.google.com/store.
3
(a) (b)
Figure 1: Training data labeling method to obtain sentiment polarity, (a) using the score to sentiment while (b) using the lexicon-based
method with InSet Lexicon
Table 1: Parameters comparison of pre-trained models. L=numbers of transformer layers, H=numbers of hidden embedding size, and
A=numbers of attention heads.
4
indicates the token and ‘0’ is not. This mask tells “self-attention” not to include this [PAD] token when interpreting
sentences.
To fine-tuning the model for classification, apply an additional fully-connected dense layer on top of its output layer
and retrain the entire model with the specific domain dataset. BERT only takes the final hidden state of the [CLS] token
as a representation of the entire sentence to feed into a fully-connected dense layer. The size of the last fully-connected
dense layer is equal to the number of labels. SoftMax activation with sparse categorical cross-entropy is added on top of
the model to predict the likelihood of the label. Figure 2 shows how fine-tuning model is used for classification.
According to the original paper, the author gives a range of values to fine-tuning for the downstream tasks. Batch size
range [16, 32], learning rate (Adam) range [5e-5, 3e-5, 2e-5], and number of epochs [2, 3, 4].
Figure 2: BERT input formatting schema that requires tokenizer with adding special token
5
2.3 Evaluation criteria
We use accuracy to measure the evaluation performance of all models. One of the more obvious metrics, it is the
measure of all the correctly identified cases. It is most used when all the classes are equally important. Accuracy is used
when the true positives (TP) and true negatives (TN) are more important. Accuracy defined in Equation (1). Where, TP
= True Positive, TN = True Negatives, FP = False Positives and FN = False Negatives. We also calculate the computation
time during the training phase.
𝑇𝑃 + 𝑇𝑃
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
6 https://colab.research.google.com.
7 https://pytorch.org.
8 https://huggingface.co/transformers.
6
(a) (b)
Figure 4: Results of the comparison of dataset labeling, (a) score-based labeling produce balanced data while (b) lexicon-based labeling
produce imbalanced data
Table 2: Fine-tuning result of pre-trained models for score-based labeling data with learning rate 1e-5 on 10 epochs.
Table 3: Fine-tuning result of pre-trained models for score-based labeling data with learning rate 1e-5 on 10 epochs.
Table 4: Fine-tuning result of pre-trained models for lexicon-based labeling data with learning rate 1e-5 on 10 epochs.
The first fine-tuning experiment for two pre-trained models in score-based labeling data performed using base
learning rates [1e-5] at 16 and 32 batch sizes with ten epochs are shown in Table 3. The first fine-tuning of BERT for
two pre-trained models in score-based labeling data performed using base learning rates 1e-5 at 16 and 32 batch sizes
with ten epochs are shown in Table 4. IndoBERT-Base with a batch size of 16 shows the best accuracy in testing accuracy
of 0.5856. However, the results obtained show that the fine-tuning of the BERT model on the dataset with score-based
7
Table 5: IndoBERT-Base fine-tuning results for lexicon-based data labeling on 25 epochs
Avg Validation
Batch Size Learning Rate Avg Training Accuracy Training Time Test Accuracy
Accuracy
16 1e-5 0.9701 0.8215 23min 21s 0.8267
32 1e-5 0.9709 0.8301 16min 40s 0.8342
16 2e-5 0.9650 0.8337 24min 20s 0.8493
32 2e-5 0.9718 0.8413 16min 56s 0.8286
16 3e-5 0.9332 0.8141 24min 52s 0.7909
32 3e-5 0.9678 0.8269 16min 55s 0.8173
Figure 5: Training and validation accuracy on the best fine-tuning result. IndoBERT-Base with learning rate=2e-5; batch size=16;
number of epochs=25
labeling is not much better than the SVM of 0.5736 at the ML baseline. In the second fine-tuning experiment, we use the
same hyperparameter with previous for lexicon-based labeling data, the result is shown in Table 4. IndoBERT-Base still
obtained the best test accuracy results at batch size 16 of 0.8248. This is because BERT-Base multilingual is only trained
on Wikipedia corpus, even though the dataset used contains a lot of slang sentences. While indoBERT-Base is trained in
the larger data in Indonesian and contains formal and slang language, such as Twitter. In the last experiment, we only
focused on lexicon-based labeling data because the dataset with the score-based labeling showed unfavorable results.
We only use IndoBERT-Base because based on previous tests, it shows stable results. We still used learning rate
variations in the [1e-5, 2e-5, 3e-5] at batch sizes 32, but we increased the epochs to 25. We wanted to increase the
accuracy obtained in the previous test. The best test accuracy is obtained when the learning rate is 2e-5 in batch size 16
with a value of 0.8493, as shown in Table 5. The best fine-tuning history of last experiments based on Table 5 is shown
in Figure 5. As a result of all experiments, a smaller batch size increases the overall computation time. IndoBERT-Base
as a pre-trained model that is specific trained in Indonesian shows the best results than the pre-trained multilingual.
4 CONCLUSION
This study has successfully conducted in-depth experiments on analyst sentiment on Indonesian language user reviews.
BERT being the newest state-of-the-art natural language processing has shown optimal results for downstream tasks,
8
such as sentiment analysis. Through fine-tuning, pre-trained models can do transfer learning. So, it is much better than
machine learning. The method of labeling the data affects the final result. Lexicon-based labeling is proven to show
much better accuracy than score-based labeling. The best results are obtained with the IndoBERT-Base model. The
highest accuracy is 84%, with 25 epochs and a training time of 24 minutes. Future work is to investigate the specific
effect of each preprocessing procedure and other settings related to the tuning to further language model for sentiment
analysis purposes.
REFERENCES
[1] H.-W. Kim, H.-L. Lee, and S.-J. Choi, “An Exploratory Study on the Determinants of Mobile Application Purchase,” The Journal of Society for e-
Business Studies, vol. 16, no. 4, pp. 173–195, Nov. 2011, doi: 10.7838/jsebs.2011.16.4.173.
[2] R. Alturki and V. Gay, “Usability attributes for mobile applications: A systematic review,” in EAI/Springer Innovations in Communication and
Computing, Springer Science and Business Media Deutschland GmbH, 2019, pp. 53–62.
[3] A. Ligthart, C. Catal, and B. Tekinerdogan, “Systematic reviews in sentiment analysis: a tertiary study,” Artificial Intelligence Review, pp. 1–57, 2021,
doi: 10.1007/s10462-021-09973-3.
[4] S. Sah, “Machine Learning: A Review of Learning Types,” no. July, 2020, doi: 10.20944/preprints202007.0230.v1.
[5] M. Wongkar and A. Angdresey, “Sentiment Analysis Using Naive Bayes Algorithm Of The Data Crawler: Twitter,” Oct. 2019, doi:
10.1109/ICIC47613.2019.8985884.
[6] Y. Nurdiansyah, S. Bukhori, and R. Hidayat, “Sentiment analysis system for movie review in Bahasa Indonesia using naive bayes classifier method,”
in Journal of Physics: Conference Series, Apr. 2018, vol. 1008, no. 1, p. 012011, doi: 10.1088/1742-6596/1008/1/012011.
[7] H. Tuhuteru, “Analisis Sentimen Masyarakat Terhadap Pembatasan Sosial Berksala Besar Menggunakan Algoritma Support Vector Machine,”
Information System Development (ISD), vol. 5, no. 2, pp. 7–13, 2020.
[8] M. Ahmad, S. Aftab, M. S. Bashir, N. Hameed, I. Ali, and Z. Nawaz, “SVM optimization for sentiment analysis,” International Journal of Advanced
Computer Science and Applications, vol. 9, no. 4, pp. 393–398, 2018, doi: 10.14569/IJACSA.2018.090455.
[9] M. Avinash and E. Sivasankar, “A study of feature extraction techniques for sentiment analysis,” in Advances in Intelligent Systems and Computing,
Jun. 2019, vol. 814, pp. 475–486, doi: 10.1007/978-981-13-1501-5_41.
[10] F. Hemmatian and M. K. Sohrabi, “A survey on classification techniques for opinion mining and sentiment analysis,” Artificial Intelligence Review,
vol. 52, no. 3, pp. 1495–1545, Oct. 2019, doi: 10.1007/s10462-017-9599-6.
[11] S. Shayaa et al., “Sentiment analysis of big data: Methods, applications, and open challenges,” IEEE Access, vol. 6, pp. 37807–37827, Jun. 2018, doi:
10.1109/ACCESS.2018.2851311.
[12] G. N. Alemneh, A. Rauber, and S. Atnafu, “Dictionary Based Amharic Sentiment Lexicon Generation,” in Communications in Computer and
Information Science, May 2019, vol. 1026, pp. 311–326, doi: 10.1007/978-3-030-26630-1_27.
[13] S. Park and Y. Kim, “Building thesaurus lexicon using dictionary-based approach for sentiment classification,” in 2016 IEEE/ACIS 14th International
Conference on Software Engineering Research, Management and Applications, SERA 2016, Jul. 2016, pp. 39–44, doi: 10.1109/SERA.2016.7516126.
[14] M. Darwich, S. A. Mohd Noah, N. Omar, and N. A. Osman, “Corpus-Based Techniques for Sentiment Lexicon Generation: A Review,” Journal of
Digital Information Management, vol. 17, no. 5, p. 296, Oct. 2019, doi: 10.6025/jdim/2019/17/5/296-305.
[15] I. Guellil, A. Adeel, F. Azouaou, and A. Hussain, “SentiALG: Automated Corpus Annotation for Algerian Sentiment Analysis,” in Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Jul. 2018, vol. 10989 LNAI, pp. 557–
567, doi: 10.1007/978-3-030-00563-4_54.
[16] R. Narayan, M. Roy, and S. Dash, “Ensemble based Hybrid Machine Learning Approach for Sentiment Classification- A Review,” International
Journal of Computer Applications, vol. 146, no. 6, pp. 31–36, Jul. 2016, doi: 10.5120/ijca2016910813.
[17] W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative Study of CNN and RNN for Natural Language Processing,” Feb. 2017.
[18] N. Sharma, V. Jain, and A. Mishra, “An Analysis of Convolutional Neural Networks for Image Classification,” in Procedia Computer Science, Jan.
2018, vol. 132, pp. 377–384, doi: 10.1016/j.procs.2018.05.198.
[19] Y. Kim, “Convolutional neural networks for sentence classification,” in EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language
Processing, Proceedings of the Conference, Aug. 2014, pp. 1746–1751, doi: 10.3115/v1/d14-1181.
[20] H. Kim and Y. S. Jeong, “Sentiment classification using Convolutional Neural Networks,” Applied Sciences (Switzerland), vol. 9, no. 11, pp. 1–14,
2019, doi: 10.3390/app9112347.
[21] L. Kurniasari and A. Setyanto, “Sentiment Analysis using Recurrent Neural Network,” in Journal of Physics: Conference Series, Mar. 2020, vol. 1471,
no. 1, p. 012018, doi: 10.1088/1742-6596/1471/1/012018.
[22] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi:
10.1162/neco.1997.9.8.1735.
[23] D. Li and J. Qian, “Text sentiment analysis based on long short-term memory,” in 2016 1st IEEE International Conference on Computer Communication
and the Internet, ICCCI 2016, Dec. 2016, pp. 471–475, doi: 10.1109/CCI.2016.7778967.
9
[24] B. N. Saha and A. Senapati, “Long Short Term Memory (LSTM) based Deep Learning for Sentiment Analysis of English and Spanish Data,” in 2020
International Conference on Computational Performance Evaluation, ComPE 2020, Jul. 2020, pp. 442–446, doi: 10.1109/ComPE49325.2020.9200054.
[25] A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
[26] I. S. Alec Radford , Jeffrey Wu , Rewon Child , David Luan , Dario Amodei, “Language Models are Unsupervised Multitask Learners,” OpenAI Blog,
vol. 1, no. May, pp. 1–7, 2020.
[27] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” NAACL
HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies -
Proceedings of the Conference, vol. 1, no. Mlm, pp. 4171–4186, 2019.
[28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized autoregressive pretraining for language understanding,”
arXiv, no. NeurIPS, pp. 1–18, 2019.
[29] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv, no. 1, 2019.
[30] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A Lite BERT for Self-supervised Learning of Language
Representations,” pp. 1–17, 2019, [Online]. Available: http://arxiv.org/abs/1909.11942.
[31] B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” Proceedings of the 1st Conference
of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing,
pp. 843–857, 2020.
[32] M. Polignano, P. Basile, M. de Gemmis, G. Semeraro, and V. Basile, “AlBERTo: Italian BERT language understanding model for NLP challenging
tasks based on tweets,” in CEUR Workshop Proceedings, 2019, vol. 2481.
[33] R. Scheible, F. Thomczyk, P. Tippmann, V. Jaravine, and M. Boeker, “GottBERT: a pure German Language Model,” Dec. 2020.
[34] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to Fine-Tune BERT for Text Classification?,” in Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Oct. 2019, vol. 11856 LNAI, pp. 194–206, doi: 10.1007/978-3-030-32381-
3_16.
[35] Q. T. Nguyen, T. L. Nguyen, N. H. Luong, and Q. H. Ngo, “Fine-Tuning BERT for Sentiment Analysis of Vietnamese Reviews,” in Proceedings - 2020
7th NAFOSTED Conference on Information and Computer Science, NICS 2020, Nov. 2020, pp. 302–307, doi: 10.1109/NICS51282.2020.9335899.
[36] F. Koto and G. Y. Rahmaningtyas, “Inset lexicon: Evaluation of a word list for Indonesian sentiment analysis in microblogs,” in Proceedings of the
2017 International Conference on Asian Language Processing, IALP 2017, Feb. 2018, vol. 2018-Janua, pp. 391–394, doi: 10.1109/IALP.2017.8300625.
10