A Deep-Word and Character Based Approach To Offensive Language Identification
A Deep-Word and Character Based Approach To Offensive Language Identification
[email protected], [email protected],
[email protected]
617
Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019), pages 617–621
Minneapolis, Minnesota, USA, June 6–7, 2019. ©2019 Association for Computational Linguistics
learning approaches. Also, there has been pub- B. Two representations are therefore created for
lished a couple of surveys covering various work each input x:
addressing the identification of abusive, toxic, and
offensive language, hate speech, etc., and their 1. xc which is the indexed representation of
methodology including (Schmidt and Wiegand, the tweet based on its characters padded to
2017) and (Fortuna and Nunes, 2018). the length of the longest word in the corpus.
Additionally, there were several workshops and The indices include 256 of the most common
shared tasks on offensive language identification characters, plus 0 for padding and 1 for un-
and related problems, including TA-COS2 , Abu- known characters.
sive Language Online3 , and TRAC4 (Kumar et al., 2. xw which is the embeddings of the words
2018), and GermEval (Wiegand et al., 2018), in the input tweet based on FastText’s 600B-
which shows the significance of the problem. token common crawl model (Mikolov et al.,
2018).
3 Methodology and Data
The methodology used for both subtask A, offen- Then, xc is fed into an embedding layer with
sive language identification, and subtask B, au- output size of 32 and a CNN layer after that. xc
tomatic categorization of offense types, consists is then concatenated with xw and both are fed to
of a preprocessing phase and a deep classification a unidirectional RNN with LSTM cell of size 256,
phase. We first introduce the preprocessing phase, the output of which is the input to two consecu-
then elaborate on the classification phase. tive fully-connected layers that map their input to
an R128 and an R2 space, respectively. We also ap-
3.1 Preprocessing plied dropout of keeping rate 0.5 on CNN’s output,
The preprocessing phase consists of (1) replacing xw , RNN’s output, and the first fully-connected
obfuscated offensive words with their correct form layer’s output.
and (2) tweet tokenization using NLTK tweet tok- The CNN layer consists of four consecutive
enizer (Bird et al., 2009). In social media, some sub-layers:
words are distorted in a way to escape the of-
1. CNN consisting of 64 filters with kernel size
fense detection systems or to reduce the imperti-
of 2, stride of 1, same padding and RELU ac-
nence. For instance, ‘asshole’ may be written as
tivation;
‘a$$hole’, ‘a$sh0le’, ‘a**hole’, etc. Having a list
of English offensive words, we can create a list 2. max-pooling layer with pool size and stride
containing most of the possible permutations. Us- of 2;
ing such a list will ease the job for the classifier
and searching in it is computationally cheap. Fur- 3. another CNN, same as the first one, but with
thermore, replacing contractions, e.g. ‘I’m’ with ‘I 128 filters;
am’, and replacing common social media abbrevi-
4. the same max-pooling again.
ations, e.g. ‘w/’ with ‘with’, were not helpful and
were not used to train the final model. Finally, we used an AdamOptimizer (Kingma
and Ba, 2014) with learning rate of 1e−3 and
3.2 Deep Classifier
batch size of 32 to train the model.
Given a tweet, we want to know if its offensive or
not (subtask A), and if the offense is targeted (sub- 3.3 Baseline Methods
task B). Regarding that both subtasks are problems We used two baseline methods for subtask A:
of binary classification, we used one architecture
to tackle both. To define the problem, if we have • an SVM with 1- to 3-gram word TFIDF and
a tweet x, we want to predict the label y, OFF or 1- to 5-gram character count feutrue vectors
NOT in subtask A, and TIN or UNT in subtask as input;
2
http://ta-cos.org/ • an SVM with BERT representations of the
3
https://sites.google.com/site/ tweets (using average pooling (Xiao, 2018))
abusivelanguageworkshop2017/
4
https://sites.google.com/view/trac1/ as input using BERT-Large, Uncased
home model.
618
The SVMs were trained for 15 epochs cluding the validation data) and DeepModel+val
with stochastic gradient descent, hinge loss, on the combination of the training and validation
alpha of 1e−6, elasticnet penalty, and data. The best performance is in bold.
random state of 5. The SVMs were imple-
mented using Scikit-learn (Pedregosa et al., 2011). System Macro F1 Accuracy
All NOT baseline 0.4189 0.7209
3.4 Data All OFF baseline 0.2182 0.2790
The main dataset used to train the model is Of- SVM 0.7452 0.8011
fensive Language Identification Dataset (OLID) BERT-SVM 0.7507 0.8011
Zampieri et al. (2019a). The dataset is annotated DeepModel 0.7788 0.8326
hierarchically to identify offensive language (OF- DeepModel+val 0.7793 0.8337
Fensive or NOT), whether it is targeted (Targeted
INsult or UNTargeted), and if so, its target (INDi- Table 1: Results for subtask A
vidual, GRouP, or OTHer). We divided the 13,240
samples in the training set into 12,000 samples for The best performance belongs to Deep-
training and 1,240 samples for validation. Model+val by a margin of more than 2.8 percent,
with the best baseline performance, BERT-SVM.
As neural networks require huge amount of
However, it should be mentioned that the results in
training data, we tried adding more data from the
the first two rows belong to a model trained only
dataset of the First Workshop on Trolling, Aggres-
on OLID. You can see the confusion matrix for the
sion, and Cyberbullying (TRAC-1) (Kumar et al.,
best performance in figure 1.
2018) which was not helpful. However, adding
the training data from Toxic Comment Classifica- Confusion Matrix
tion Challenge on Kaggle (Conversation AI, 2017)
increased the macro-averaged F1 -score on the val- 0.8
idation set by ∼ 2%. This data comprises tweets
NOT
with positive and negative tags in six categories: 572 48
toxic, severe toxic, obscene, threat, 0.6
True label
OFF
619
baseline results by a large margin, like subtask A. were not much better than the SVM with TFIDF
The results for subtask B are presented in table 3. and count features, probably due the fact that the
BERT model requires fine-tuning for more task-
System Macro F1 Accuracy specific representations.
All TIN baseline 0.4702 0.8875 The majority of DeepModel+val’s errors are in
All UNT baseline 0.1011 0.1125 OFF class and can be categorized into (1) sar-
DeepModel 0.6065 0.8583 casm: the model is unable to detect sarcastic lan-
DeepModel+val 0.6400 0.8875 guage which is even difficult for humans to detect;
(2) emotion: discerning emotions, such as anger,
Table 3: Results for subtask B seems to be a challenge for the model; (3) eth-
nic and racial slurs, etc. Solving these problems
This time, adding the validation data made a require a more comprehensive knowledge of the
considerable difference, as the training data for context and the language, which was examined in
subtask B is fewer. You can see the confusion ma- works such as (Poria et al., 2016) and improved
trix for DeepModel+val in figure 2. the results. However, experimenting with emotion
Confusion Matrix embeddings in the current work was not helpful
and did not appear in the final results. Being aware
of the emotion of the text, personality of the au-
0.8 thor, and sentiment of the sentences is helpful to
TIN
206 7 detect offensive language, as many offensive con-
0.6 tents have an angry tone (ElSherief et al., 2018)
True label
UNT
Predicted label
In this paper, we introduced Ghmerti team’s ap-
Figure 2: The confusion matrix for the DeepModel+val proach to the problems of ‘offensive language
in subtask B identification’ and ‘automatic categorization of of-
fense type’ in shared task 6 of SemEval 2019, Of-
The confusion matrix shows that the perfor- fensEval. In subtask A, the neural network-based
mance of the model is good for TIN, but poor for model outperformed the other methods, including
UNT. Table 4 shows the detailed results for Deep- an SVM with word TFIDF and character count
Model+val in subtask B, which indicates that the features and another SVM with BERT-encoded
imbalance is worse than subtask A and the poor tweets as input. Furthermore, analysis of the re-
performance on UNT is mainly due to low recall. sults indicates that sarcastic language, inability to
discern the emotions such as anger, and ethnic and
Precision Recall F1 -score racial slurs constitute a considerable portion of the
TIN 0.9115 0.9671 0.9385 errors. Such deficiencies demand larger training
UNT 0.5000 0.2593 0.3415 corpora and variety of other features, such as in-
formation on sarcasm, emotion, personality, etc.
Table 4: Detailed DeepModel+val results in subtask B
5 Analysis References
Steven Bird, Ewan Klein, and Edward Loper. 2009.
In subtask A, DeepModel+val outperformed the Natural language processing with Python: analyz-
second best method, BERT-SVM, by 2.86% ing text with the natural language toolkit. ” O’Reilly
Macro F1 -score. BERT-SVM results, however, Media, Inc.”.
620
Erik Cambria, Praphul Chandra, Avinash Sharma, and Yashar Mehdad and Joel Tetreault. 2016. Do charac-
Amir Hussain. 2010. Do not feel the trolls. ISWC, ters abuse more than words? In Proceedings of the
Shanghai. 17th Annual Meeting of the Special Interest Group
on Discourse and Dialogue, pages 299–303.
Conversation AI. 2017. Toxic comment classification
challenge: Identify and classify toxic online com- Tomas Mikolov, Edouard Grave, Piotr Bojanowski,
ments. Christian Puhrsch, and Armand Joulin. 2018. Ad-
vances in pre-training distributed word representa-
Thomas Davidson, Dana Warmsley, Michael Macy, tions. In Proceedings of the International Confer-
and Ingmar Weber. 2017. Automated Hate Speech ence on Language Resources and Evaluation (LREC
Detection and the Problem of Offensive Language. 2018).
In Proceedings of ICWSM.
Chikashi Nobata, Joel Tetreault, Achint Thomas,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yashar Mehdad, and Yi Chang. 2016. Abusive
Kristina Toutanova. 2018. Bert: Pre-training of deep Language Detection in Online User Content. In
bidirectional transformers for language understand- Proceedings of the 25th International Conference
ing. arXiv preprint arXiv:1810.04805. on World Wide Web, pages 145–153. International
World Wide Web Conferences Steering Committee.
Karthik Dinakar, Birago Jones, Catherine Havasi,
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
Henry Lieberman, and Rosalind Picard. 2012. Com-
B. Thirion, O. Grisel, M. Blondel, P. Pretten-
mon sense reasoning for detection, prevention, and
hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
mitigation of cyberbullying. ACM Transactions on
sos, D. Cournapeau, M. Brucher, M. Perrot, and
Interactive Intelligent Systems (TiiS), 2(3):18.
E. Duchesnay. 2011. Scikit-learn: Machine learning
Mai ElSherief, Vivek Kulkarni, Dana Nguyen, in Python. Journal of Machine Learning Research,
William Yang Wang, and Elizabeth Belding. 2018. 12:2825–2830.
Hate Lingo: A Target-based Linguistic Analysis Soujanya Poria, Erik Cambria, Devamanyu Hazarika,
of Hate Speech in Social Media. arXiv preprint and Prateek Vij. 2016. A deeper look into sarcas-
arXiv:1804.04257. tic tweets using deep convolutional neural networks.
In Proceedings of COLING 2016, the 26th Inter-
Paula Fortuna and Sérgio Nunes. 2018. A Survey on national Conference on Computational Linguistics:
Automatic Detection of Hate Speech in Text. ACM Technical Papers, pages 1601–1612.
Computing Surveys (CSUR), 51(4):85.
Anna Schmidt and Michael Wiegand. 2017. A Sur-
Björn Gambäck and Utpal Kumar Sikdar. 2017. Using vey on Hate Speech Detection Using Natural Lan-
Convolutional Neural Networks to Classify Hate- guage Processing. In Proceedings of the Fifth Inter-
speech. In Proceedings of the First Workshop on national Workshop on Natural Language Process-
Abusive Language Online, pages 85–90. ing for Social Media. Association for Computational
Linguistics, pages 1–10, Valencia, Spain.
Edel Greevy and Alan F Smeaton. 2004. Classifying
racist texts using a support vector machine. In Pro- Michael Wiegand, Melanie Siegel, and Josef Rup-
ceedings of the 27th annual international ACM SI- penhofer. 2018. Overview of the GermEval 2018
GIR conference on Research and development in in- Shared Task on the Identification of Offensive Lan-
formation retrieval, pages 468–469. ACM. guage. In Proceedings of GermEval.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A Han Xiao. 2018. bert-as-service. https://
method for stochastic optimization. arXiv preprint github.com/hanxiao/bert-as-service.
arXiv:1412.6980.
Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, and Sara Rosenthal, Noura Farra, and Ritesh Kumar.
Marcos Zampieri. 2018. Benchmarking Aggression 2019a. Predicting the Type and Target of Offensive
Identification in Social Media. In Proceedings of the Posts in Social Media. In Proceedings of NAACL.
First Workshop on Trolling, Aggression and Cyber- Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
bulling (TRAC), Santa Fe, USA. Sara Rosenthal, Noura Farra, and Ritesh Kumar.
2019b. SemEval-2019 Task 6: Identifying and Cat-
Shervin Malmasi and Marcos Zampieri. 2017. Detect-
egorizing Offensive Language in Social Media (Of-
ing Hate Speech in Social Media. In Proceedings
fensEval). In Proceedings of The 13th International
of the International Conference Recent Advances in
Workshop on Semantic Evaluation (SemEval).
Natural Language Processing (RANLP), pages 467–
472. Ziqi Zhang, David Robinson, and Jonathan Tepper.
2018. Detecting Hate Speech on Twitter Using a
Shervin Malmasi and Marcos Zampieri. 2018. Chal- Convolution-GRU Based Deep Neural Network. In
lenges in Discriminating Profanity from Hate Lecture Notes in Computer Science. Springer Ver-
Speech. Journal of Experimental & Theoretical Ar- lag.
tificial Intelligence, 30:1–16.
621