Large-Scale_News_Classification_using_BERT_Languag (1)
Large-Scale_News_Classification_using_BERT_Languag (1)
Approach
Novanto Yudistira
Department of Informatics Engineering, Faculty of Computer Science, Brawijaya University, Indonesia,
[email protected]
The rise of big data analytics on top of NLP increasing the computational burden for text processing at scale. The problems faced in
NLP are very high dimensional text, so it takes a high computation resource. The MapReduce allows parallelization of large
computations and can improve the efficiency of text processing. This research aims to study the effect of big data processing on NLP
tasks based on a deep learning approach. We classify a big text of news topics with fine-tuning BERT used pre-trained models. Five
pre-trained models with a different number of parameters were used in this study. To measure the efficiency of this method, we
compared the performance of the BERT with the pipelines from Spark NLP. The result shows that BERT without Spark NLP gives
higher accuracy compared to BERT with Spark NLP. The accuracy average and training time of all models using BERT is 0.9187 and 35
minutes while using BERT with Spark NLP pipeline is 0.8444 and 9 minutes. The bigger model will take more computation resources
and need a longer time to complete the tasks. However, the accuracy of BERT with Spark NLP only decreased by an average of 5.7%,
while the training time was reduced significantly by 62.9% compared to BERT without Spark NLP.
Additional Keywords and Phrases: Large-scale text classification, distributed NLP architectures, BERT language model,
Spark NLP
1 INTRODUCTION
Natural language processing (NLP) is a subfield of artificial intelligence (AI) that can study human and computer
interactions through natural languages, such as the meaning of words, phrases, sentences, and syntactic and semantic
processing. In early, NLP research used rule-based methods to understand and reason a text. Experts manually create
these rules for various NLP tasks [1]. It is complicated to manage the rules if the number of rules is large. Therefore, this
approach is considered obsolete by researchers [2]. Internet development causes data to be collected easily so that a
statistical learning approach is possible to resolve NLP tasks. This is known as the machine learning approach. With
feature engineering, this approach brings significant improvements to many NLP tasks [3]. Meanwhile, deep learning
approaches were introduced to NLP in 2012 after success in image recognition [4] and speech recognition [5]. Deep
learning outperformed the other approaches with surprisingly better results.
In NLP, language modeling (LM) provides a context that differentiates similar words and phrases according to the
context in which they appear. The NLP framework based on deep learning for language modeling has entered a new
chapter. This is characterized by many deep learning architectures and models from which to solve NLP tasks is
constantly evolving. Previously successful architecture is bidirectional LSTM (bi-LSTM) based on recurrent neural
network (RNN), where the model can read the context from left to right and from right to left [6]. The main limitations
of bi-LSTM are sequential, which makes the parallel training process very difficult. The transformer architecture
accomplishes this by replacing the LSTM cells with an "attention" mechanism [7]. With this attention, the model can see
the entire sequence of context as a whole, making it easier to practice in parallel. The Transformer has made great
progress on many different NLP benchmarks. There are many transformer-based language models, including BERT[8],
RoBERTa [9] GPT-2 [10] and XLNet [11].
The rise of "big data" analytics on top of NLP has led to an increasing need to ease the computational burden that
processes text at scale [12]. The amount of unstructured textual data has led to increased interest in information
extraction technology from academia and industry. One of the problems faced in NLP is that text has very high
dimensions [13]. It takes computation capable of processing high-dimensional textual data quickly. Input data is
distributed across multiple machine clusters to complete within a reasonable time. The MapReduce allows easy
parallelization of large computations and uses re-execution as the primary mechanism for fault tolerance [14]. Previous
research has used this concept to perform sentiment analysis tasks. The results obtained are that MapReduce can
improve the efficiency of processing large amounts of text even though the performance obtained is similar to traditional
sentiment analysis [15].
This research aims to study big data processing on NLP tasks based on a deep learning approach. We classify large
amounts of news topics using BERT based on transformer architecture. Training BERT from scratch requires a huge
dataset and takes much time to train. Therefore, we use the existing pre-trained models [8], [16].To demonstrate the
efficiency of this method, we conducted extensive experiments to study our proposed approach. We use Spark NLP built
on top of Apache Spark as a library that can scale the entire classification process in a distributed environment [17]. We
compared the performance of the base method model with the classifier pipelines from Spark NLP. Apart from observing
the model's accuracy, we also look at the computation time and computation resources used during the training and
testing process.
2 RELATED WORK
Big data comes with an unstructured format, mainly textual data, called big text [18]. Social media has the most
contribution to a big text. In addition, other online sources such as online news portals, blogs, health records, government
2
data provide rich textual data for research. Despite the abundance of data sources, this field has attracted less attention
from academia. In this section, we present literature studies carried out in the fields of deep learning for text classification
and big data framework for large-scale text processing. We reviewed prior work to understand its limitations so that we
can use them to refine our research.
Deep learning gives us big potential in the NLP field [19]. Many studies have contributed to text classification tasks
using deep neural networks. Some successful architectures include convolutional neural network (CNN) based models,
for example, VD-CNN [20] and DP-CNN [21], recurrent neural network (RNN) based models, for example, SANN [22],
and attention-based models, for example, HAN [23] and DiSAN [24]. These models use pre-trained word embedding
[25], [26] to improve performance in downstream tasks. Although many impressive results have been achieved, the
dependent problem carries many limitations for enhancing the model's performance. Even with the development of
contextualized word vectors such as CoVe [27] and ELMo [28], the model architecture still needs to be assigned in
particular. Pre-training language models and fine-tuning of downstream tasks have made breakthroughs in NLP.
Howard and Ruder proposed ULMFiT [29], whereas Radford et al. proposed OpenAI GPT [30] using a multi-layer
transformer architecture to learn language representations of large-scale text. To solve unidirectional language
representation from OpenAI GPT, Devlin et al. proposed BERT [8] using deep bidirectional representations. Compared
to the previous model, BERT does not require a specific architecture for each downstream task, so this model has
achieved great success in many NLP downstream tasks [31].
Hadoop is a MapReduce platform used for distributed processing. One of the Hadoop framework's major problems is
that it transforms any computation as a MapReduce job [12]. In NLP, this would require re-implementation of each NLP
pipeline, so it is ineffective. Apache Spark addresses this problem by extending Hadoop ecosystem with a parallel
computational programming model, including resilient distributed datasets (RDDs) and learning algorithms [32]. Next,
Xiangrui et al. introduced MLib1 as a machine learning library running on Spark [33]. Research from Jian et al. analyzed
the Spark framework by running a machine learning instance using MLib and highlighting Spark's advantages [34].
Spark is also used as a distributed framework for solving NLP tasks such as sentiment analysis [35], [36], and document
classification [37]. Their research results show that Spark has a speed advantage in large text processing. As deep
learning models have successfully in NLP, there is a need to implement pre-trained models and scale large data with
distributed use cases. John Snow Labs2 developed Spark NLP as a library built on top of Apache Spark and Apache MLib
that provides an NLP pipeline and pre-trained models [17]. The library offers the ability to train, customize and save
models so they can be run on clusters, other machines, or stored.
3 METHODOLOGY
3.1 Dataset
We use a corpus of news articles from the AG dataset [38]. It contains 1 million news articles that have been gathered
from more than 2000 from ComeToMyHead news sources. This dataset includes 120,000 training samples and 7,600 test
samples. We only use the description as a sample and category as the label. Each sample is a short text divided into four
labels.
1 https://spark.apache.org/mllib.
2 https://nlp.johnsnowlabs.com.
3
3.2 BERT
BERT is a deep learning architecture that can be used for downstream NLP tasks. The architecture consists of a stacked
encoder layer from the transformer [7]. There are two main steps in BERT: pre-training and fine-tuning [8]. During pre-
training, BERT is trained in a large unlabeled corpus with two unsupervised tasks: masked language model (MLM) and
next sentence prediction (NSP) to produce a pre-trained model. For fine-tuning, the model is initialized with the pre-
trained parameters, and all the parameters are fine-tuned using labeled data for specific tasks such as classification.
We can assume the pre-trained model as a black box with H = 768 shaped vectors for each input token in a sequence.
Sequences can be one sentence or a pair of sentences separated by a [SEP] token and begin with a [CLS] token. For
classification task, we added an output layer to model and fine-tuned all parameters from end to end. In practice, we
only use the output from the [CLS] token as the representation of the whole sequence. Thus, the entire fine-tuning BERT
architecture for the classification task is shown in Figure 1. A simple SoftMax classifier is added to the top of the model
to predict the probability of label c shown in Equation 1. Where W is the task-specific parameter matrix. We fine-tune
all the parameters from BERT as well as W jointly by maximizing the log-probability of the correct label.
In this study, we use five pre-trained models as shown in Table 1. In the original paper, L represents the numbers of
transformer layers (stacked encoder), H represents numbers of hidden embedding size, and A represents numbers of
attention heads [8]. Smaller model architecture is using less parameters to train and can be used in limited computation
resources. The number of parameters in every pretrained model shown in Table 2.
Figure 1: Fine-tuning BERT architecture for the classification task. We just use the [CLS] output token for classification along with
some added Linear and SoftMax layers.
Table 1: Pre-trained BERT models are used. We only focus on six models: Tiny (L=2, H=128), Mini (L=4, H=256), Small (L=4, H=512),
Medium (L=8, H=512), and Base (L=12, H=768).
4
Table 2: The number of parameters on the pre-trained BERT model.
The optimal hyperparameter values are task-specific. We use Adam with β1 = 0.9 and β2 = 0.999. The base learning rate
is 1e-4and the warm-up proportion is 0.1. We empirically set the max number of the epoch to 4 and save the best model
on the validation set for testing.
Figure 2: Spark NLP pipeline as a sequence for text classification. Each annotator applied adds a new column to a DataFrame that is
fed into the pipeline.
3 https://spacy.io.
4 https://textblob.readthedocs.io/en/dev.
5 https://radimrehurek.com/gensim.
5
To create a classifier in Spark NLP, we use ClassifierDL. ClassifierDL is a multi-class text classifier in Spark NLP, and
it uses various text embeddings as an input for text classifications. The ClassifierDL uses a deep learning model (DNNs)
built inside TensorFlow6.The classification process is carried out after going through the text processing stages above.
We will train each pre-trained model for 4 epochs with a batch size of 32 and a learning rate of 1e-4. Spark NLP will
write the training logs to annotator_logs folder in our directory.
6 https://www.tensorflow.org.
7 https://colab.research.google.com.
8 https://wandb.ai.
6
(a) (b) (c)
Figure 3: The computational resources used during the training use the BERT without Spark NLP pipeline. (a) GPU utilization, (b) GPU
memory allocated, and (c) process memory in use.
Table 3: Comparison of accuracy and computation time during the training process between the BERT without Spark NLP and BERT
with Spark NLP pipelines. We also calculate the reduction in accuracy and computation time (in percent) to determine the
effectiveness of the proposed pipeline.
The results of all experiments show that, BERT without Spark NLP gives higher accuracy rate compared to BERT
with Spark NLP on all pre-trained models. But BERT with Spark NLP has advantages in efficiency. As shown in Table
3, BERT with Spark NLP gives good accuracy but it takes less time to complete the task. A significant decrease in
computation time when using BERT with Spark NLP by 62.9% with a decrease in accuracy of 5.7%. Even though using
Spark NLP the RAM resources used is much higher, we can see the efficiency of this method.
7
5 CONCLUSION
The BERT model is a good model to do large scale NLP tasks such as news classification. The larger the model gives
higher accuracy, but it will take more time to complete the task. The bigger the dataset we use to train and test the
model, it will affect the time it takes to complete the task. Using Spark NLP gives us advantages when we want to use a
BERT-Large model and process large amounts of data. In this study we found that using BERT with Spark NLP is more
efficient then using BERT without Spark NLP. Using BERT with Spark NLP, the drop accuracy average is 5.7% and the
training time drop average is 62.9% compared to BERT without Spark NLP. In the near future, we plan to expand and
improve our framework by exploring more architectures and pre-trained models to improve classification performance
and computational resources. Furthermore, we wanted to explore the effects of text preprocessing prior to training.
REFERENCES
[1] G. Sidorov, A. Gupta, M. Tozer, D. Catala, A. Catena, and S. Fuentes, “Rule-based system for automatic grammar correction using syntactic n-grams
for english Language Learning (L2),” CoNLL 2013 - 17th Conference on Computational Natural Language Learning, Proceedings of the Shared Task, pp.
96–101, 2013.
[2] L. Chiticariu, Y. Li, and F. R. Reiss, “Rule-based information extraction is dead! Long live rule-based information extraction systems!,” EMNLP 2013 -
2013 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, no. October, pp. 827–832, 2013.
[3] Y. Xu, K. Hong, J. Tsujii, and E. I. C. Chang, “Feature engineering combined with machine learning and rule-based methods for structured information
extraction from narrative clinical discharge summaries,” Journal of the American Medical Informatics Association, vol. 19, no. 5, pp. 824–832, 2012, doi:
10.1136/amiajnl-2011-000776.
[4] L. Fei-Fei, J. Deng, and K. Li, “ImageNet: Constructing a large-scale image database,” Journal of Vision, vol. 9, no. 8, pp. 1037–1037, 2010, doi:
10.1167/9.8.1037.
[5] W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolcke, “The microsoft 2017 conversational speech recognition system,” arXiv, 2017.
[6] P. Zhou, Z. Qi, S. Zheng, J. Xu, H. Bao, and B. Xu, “Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling,”
Conference of 26th International Conference on Computational Linguistics, COLING 2016, vol. 2, no. 1, pp. 3485–3495, 2016.
[7] A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
[8] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” NAACL HLT
2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of
the Conference, vol. 1, no. Mlm, pp. 4171–4186, 2019.
[9] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv, no. 1, 2019.
[10] I. S. Alec Radford , Jeffrey Wu , Rewon Child , David Luan , Dario Amodei, “Language Models are Unsupervised Multitask Learners,” OpenAI Blog,
vol. 1, no. May, pp. 1–7, 2020.
[11] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized autoregressive pretraining for language understanding,”
arXiv, no. NeurIPS, pp. 1–18, 2019.
[12] R. Agerri, X. Artola, Z. Beloki, G. Rigau, and A. Soroa, “Big data for Natural Language Processing: A streaming approach,” Knowledge-Based Systems,
vol. 79, pp. 36–42, 2015, doi: 10.1016/j.knosys.2014.11.007.
[13] J. Wang, Y. Li, J. Shan, J. Bao, C. Zong, and L. Zhao, “Large-Scale Text Classification Using Scope-Based Convolutional Neural Network: A Deep
Learning Approach,” IEEE Access, vol. 7, pp. 171548–171558, 2019, doi: 10.1109/ACCESS.2019.2955924.
[14] J. Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Association for Computing Machinery, vol. 51, no. January
2008, pp. 107–113, 2008, doi: 10.1145/1327452.1327492.
[15] I. Ha, B. Back, and B. Ahn, “MapReduce functions to analyze sentiment information from social big data,” International Journal of Distributed Sensor
Networks, vol. 2015, 2015, doi: 10.1155/2015/417502.
[16] I. Turc, M. W. Chang, K. Lee, and K. Toutanova, “Well-read students learn better: On the importance of pre-training compact models,” arXiv, no. Mlm,
pp. 1–13, 2019.
[17] V. Kocaman and D. Talby, “Spark NLP: Natural Language Understanding at Scale,” Software Impacts, vol. 8, no. January, p. 100058, 2021, doi:
10.1016/j.simpa.2021.100058.
[18] M. Sokolova, “Big Text advantages and challenges: classification perspective,” International Journal of Data Science and Analytics, vol. 5, no. 1, pp. 1–
10, 2018, doi: 10.1007/s41060-017-0087-5.
[19] J. Dai and C. Chen, “Text classification system of academic papers based on hybrid Bert-BiGRU model,” Proceedings of 2020 12th International
Conference on Intelligent Human-Machine Systems and Cybernetics, IHMSC 2020, vol. 2, pp. 40–44, 2020, doi: 10.1109/IHMSC49165.2020.10088.
[20] A. Conneau, H. Schwenk, Y. Le Cun, and L. Barrault, “Very deep convolutional networks for text classification,” Proceedings of 15th Conference of the
European Chapter of the Association for Computational Linguistics, EACL 2017, vol. 1, no. 2001, pp. 1107–1116, 2017, doi: 10.18653/v1/e17-1104.
[21] R. Johnson and T. Zhang, “Deep pyramid convolutional neural networks for text categorization,” Proceedings of 55th Annual Meeting of the Association
8
for Computational Linguistics, vol. 1, pp. 562–570, 2017, doi: 10.18653/v1/P17-1052.
[22] F. Kokkinos and A. Potamianos, “Structural attention neural networks for improved sentiment analysis,” Proceedings of 5th Conference of the European
Chapter of the Association for Computational Linguistics, vol. 2, pp. 586–591, 2017.
[23] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical Attention Networks for Document Classification,” Proceedings of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016, pp. 1480–1489, 2016.
[24] T. Shen, J. Jiang, T. Zhou, S. Pan, G. Long, and C. Zhang, “DiSAN: Directional self-attention network for RNN/CNN-free language understanding,”
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), pp. 5446–5455, 2018.
[25] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Proceedings of 1st International
Conference on Learning Representations, ICLR 2013, pp. 1–12, 2013.
[26] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation,” Proceedings of the Empiricial Methods in Natural
Language Processing (EMNLP 2014), vol. 12, pp. 1532–1543, 2014.
[27] B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in translation: Contextualized word vectors,” Advances in Neural Information Processing
Systems, pp. 6295–6306, 2017.
[28] M. E. Peters et al., “Deep contextualized word representations,” NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, pp. 2227–2237, 2018, doi: 10.18653/v1/n18-1202.
[29] J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text Classification,” Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics, pp. 328–339, 2018, doi: 10.18653/v1/P18-103.
[30] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by Generative Pre-Training,” OpenAI, 2018.
[31] A. Adhikari, A. Ram, R. Tang, and J. Lin, “DocBERT: BERT for document classification,” arXiv, 2019.
[32] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” 2nd USENIX Workshop on Hot
Topics in Cloud Computing, HotCloud 2010, 2010.
[33] X. Meng et al., “MLlib: Machine learning in Apache Spark,” Journal of Machine Learning Research, vol. 17, pp. 1–7, 2016.
[34] J. Fu, J. Sun, and K. Wang, “SPARK-A Big Data Processing Platform for Machine Learning,” Proceedings - 2016 International Conference on Industrial
Informatics - Computing Technology, Intelligent Technology, Industrial Information Integration, ICIICII 2016, pp. 48–51, 2017, doi:
10.1109/ICIICII.2016.0023.
[35] S. Al-Saqqa, G. Al-Naymat, and A. Awajan, “A large-scale sentiment data classification for online reviews under apache spark,” Procedia Computer
Science, vol. 141, pp. 183–189, 2018, doi: 10.1016/j.procs.2018.10.166.
[36] N. Nodarakis, A. Tsakalidis, S. Sioutas, and G. Tzimas, “Large scale sentiment analysis on twitter with spark,” CEUR Workshop Proceedings, vol. 1558,
2016.
[37] L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, and J. M. Zurada, “Distributed Classification of Text Documents on Apache
Spark Platform,” Artificial intelligence and soft computing: 15th international conference, ICAISC 2016, no. ML, pp. 621–630, 2016, doi: 10.1007/978-3-
319-39378-0.
[38] A. Gulli, “AG’s corpus of news articles,” AG’s corpus of news articles. http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html (accessed May
20, 2021).
[39] S. Bird and E. Loper, “NLTK: The Natural Language Toolkit,” Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 214–217, 2004,
[Online]. Available: https://www.aclweb.org/anthology/P04-3031.
[40] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” 15th Conference of the European Chapter of the
Association for Computational Linguistics, EACL 2017 - Proceedings of Conference, vol. 2, pp. 427–431, 2017, doi: 10.18653/v1/e17-2068.
[41] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the Association for
Computational Linguistics, vol. 5, pp. 135–146, 2017, doi: 10.1162/tacl_a_00051.
[42] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky, “The Stanford CoreNLP Natural Language Processing Toolkit,” pp. 55–60,
2014, doi: 10.3115/v1/p14-5010.