1 Generalizing Hate Speech Detection Using Multi-Task Learning
1 Generalizing Hate Speech Detection Using Multi-Task Learning
Keywords: Automatic identification of hateful and abusive content is vital in combating the spread
Hate speech of harmful online content and its damaging effects. Most existing works evaluate models
Abusive speech by examining the generalization error on train–test splits on hate speech datasets. These
Multi-task learning
datasets often differ in their definitions and labeling criteria, leading to poor generalization
Public political figures
performance when predicting across new domains and datasets. This work proposes a new
Transfer learning
Multi-task Learning (MTL) pipeline that trains simultaneously across multiple hate speech
datasets to construct a more encompassing classification model. Using a dataset-level leave-
one-out evaluation (designating a dataset for testing and jointly training on all others), we trial
the MTL detection on new, previously unseen datasets. Our results consistently outperform a
large sample of existing work. We show strong results when examining the generalization error
in train–test splits and substantial improvements when predicting on previously unseen datasets.
Furthermore, we assemble a novel dataset, dubbed PubFigs, focusing on the problematic speech
of American Public Political Figures. We crowdsource-label using Amazon MTurk more than
20,000 tweets and machine-label problematic speech in all the 305,235 tweets in PubFigs. We
find that the abusive and hate tweeting mainly originates from right-leaning figures and relates
to six topics, including Islam, women, ethnicity, and immigrants. We show that MTL builds
embeddings that can simultaneously separate abusive from hate speech, and identify its topics.
1. Introduction
With the increasing prevalence of online media platforms in our day-to-day lives, detecting hateful and abusive content has
become necessary to prevent the pollution of online platforms by problematic and malicious users (Schneider and Rizoiu, 2023).
Automatic detection of such harmful content has recently received significant attention from the research community. Currently,
most existing works evaluate their models in context using train–test splits: the model is sequentially trained and then tested on the
same dataset. However, several recent works (Arango et al., 2019; Swamy et al., 2019; Yin and Zubiaga, 2021; Fortuna et al., 2021)
raised concerns over the poor generalization performance of such existing models when applied to hate speech datasets other than
those used to train the model. This poor performance persists even for datasets gathered from the same platform.
A key challenge in building generalizable models is the lack of a universally agreed-upon definition of hate speech that is
specific enough to be operationalized. There are many hateful and abusive speech facets (dubbed in this work as domains), such as
racism, sexism, ableism, bullying, harassment, incitement of violence, and extremism. Most prior works concerned with hate speech
detection concentrate on specific domains, which translates to differences in the labeling criteria. As a result, each dataset captures
∗ Corresponding author.
E-mail addresses: [email protected] (L. Yuan), [email protected] (M.-A. Rizoiu).
https://doi.org/10.1016/j.csl.2024.101690
Received 28 March 2023; Received in revised form 5 April 2024; Accepted 12 July 2024
Available online 17 July 2024
0885-2308/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Fig. 1. Different approaches to addressing labeling bias in hate speech datasets. The traditional Machine learning approach increases the size of the training
dataset by adding more labeled rows with the same labeling definition, leading to additional bias to that labeling criteria. Our novel multi-task learning approach
allows for increasing the number of datasets and definitions in the training pipeline for a more general representation.
only a fraction of the hate speech, partially explaining why models trained on single datasets generalize poorly to other datasets
concerned with other domains.
Problem Statement. The work presents a method for training hate speech detection models that account for multiple definitions
of hate speech. Given the absence of a universally agreed-upon definition, existing works leverage machine learning models to
estimate what constitutes hate speech. Such models learn from labeled data what represents hate speech, therefore shifting the
difficulty from defining hate speech to assembling representative datasets. Due to how they are built—typically using human labelers
in a narrow domain–such datasets will always suffer from sample size and labeling biases. As a result, the trained hate speech
classifiers are unlikely to generalize to new domains and datasets.
Research Questions. We address the problem by examining two open questions concerning hate speech detection. The
first research question relates to constructing models that account for the various definitions of hate across different datasets.
Most existing work adopt a narrow definition, construct labeled datasets, and measure the generalization error on train–test
splits (Davidson et al., 2017; Waseem and Hovy, 2016; Basile et al., 2019; de Gibert et al., 2018). Given the discussion in the
previous paragraph, the results are overly optimistic and fail to generalize to new domains (Arango et al., 2019; Swamy et al., 2019).
The question is can we construct a model that can utilize multiple hate speech datasets to capture the differing definitions
of hate across different datasets to improve a model’s classification performance? The second question concerns detecting
hate speech in previously unseen domains and datasets. A limited number of works attempted to train on multiple models (Yuan
et al., 2023; Waseem et al., 2018) but generally do not study predictivity on previously unseen datasets and domains. The research
question is can we build a detection approach that generalizes to an entirely new dataset containing hateful and abusive
speech?
Our Solution. Addressing the first question, we contribute a novel transfer learning pipeline that accounts for the multiple
facets of hate and abusive speech by training on multiple datasets in parallel. Intuitively, this reduces the bias and increases the
generalization of the constructed detection model. Fig. 1 illustrates this process. Any labeled dataset (top-left panel of Fig. 1) embeds
a degree of bias in defining hate speech compared to the global population — all possible types of hate speech at a given moment.
This bias is schematically shown dimensionality-reduced in the bottom-left panel. The typical but naive solution to increasing the
performance of hate speech detection is to increase the size of the training dataset (top-middle panel). However, as the same criteria
for constructing the dataset are applied, this results in a similar or even more embedded bias (bottom-middle panel), as the model
overfits to a single definition of hate speech. Our proposed method learns from multiple labeled datasets (top-right panel), each
with its own hate speech domain and labeling criteria. These datasets have biases (arrows in the bottom-right panel), which get
canceled and averaged out as our model learns a single, more encompassing definition of hate speech. We adopt a Multi-Task
Learning (MTL) technique with Hard Parameter Sharing (Baxter, 1997) to the task of hate speech detection. We train on eight
publicly available datasets (as opposed to prior literature, which uses 2 or 3 datasets). Furthermore, we construct a novel dataset
containing social media postings of American public political figures human-annotated for hate and abuse speech.1 We fine-tune
1 We publicly release the trained model, the detection and training code, and the American Political Figures dataset, which are available at https:
//code.research.uts.edu.au/14386080/mtl_hatespeech.
2
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
a single BERT language model (Devlin et al., 2018) to which we attach as many classification heads as datasets to train. Each
classification head is adapted to detect the hate classes specific to its datasets; however, the gradients are back-propagated into
a single language model. Intuitively, this constructs a single representation that captures a more generic definition of hate. Our
approach differs from existing works in our pipeline architecture and our utilization of MTL across a significantly higher number
of hate speech datasets. Our model produces strong predictive performances in train/test split scenarios; we compare MTL against
nine state-of-the-art hate speech detection architectures reported on several datasets. MTL outperforms the competing approaches
(12 wins, 1 draw, 7 losses) on their reported performance measures.
We address the second open question in a sequence of three steps. In the first step, we employ a leave-one-out scheme to evaluate
unseen hate speech detection. We train MTL on all but the target dataset and evaluate on the target dataset that MTL never saw
during training. MTL obtains the best prediction performances compared to existing literature, except for datasets whose labeling is
so specific that using a generalized model hurts performances. In the second step, we use MTL to study the generalization of classifiers
trained on individual datasets to new unseen datasets. We find that specific pairs of datasets have high mutual generalization —
typically those proposed by the same authors in the same work. This confirms that the dataset construction and labeling reflect the
author-specific definition of hate speech. We also note that MTL generalizes best for 7 (out of 9) datasets.
The third and final step is to test hate speech detection in a novel, previously unexplored hate speech domain using MTL. We
construct a brand new labeled hate speech Twitter dataset of 305,235 tweets from 15 American public figures across both left and
right political leanings, dubbed the PubFigs dataset. We use Amazon Mechanical Turk crowdsourcing to label a subset of 20,327
tweets — dubbed the PubFigs-L dataset. To our knowledge, this is the first dataset focusing on the hate speech of American public
figures. Our dataset has previously unexplored particularities, such as dealing with covert hate speech and focusing on a small
number of public political figures instead of many anonymous users on an online social media platform. We apply the proposed
MTL pipeline model to the PubFigs-L dataset to examine the posting behavior of the figures in the dataset. We uncover that right-
leaning figures in our dataset post more inappropriate content than left-leaning figures and identify that hateful and abusive speech
primarily concentrates on 6 topics: Islam, Women, Race and Ethnicity, Immigration and Refugees, Terrorism and Extremism, and
American Politics. We examine the effects of MTL training on the BERT embedding space, finding that MTL training increases the
distinctness of hate speech and abusive content in the embedding space, both from neutral content and from each other. We further
examine the distinctness of particular facets of hate, finding that misogyny and Islamophobia, in particular, became significantly
more distinct in the space compared to other facets of hate.
The main contributions of this work are as follows:
• A Multi-Task Learning pipeline using multiple datasets to construct a more encompassing representation of hate speech;
• An extensive analysis of the generalization of hate speech detection to new, unseen datasets and domains;
• A novel Twitter dataset containing posts from 15 American public political figures, annotated for covert hate and problematic
speech.
2. Background
This section discusses the definitions of hate speech, related works in hate speech classification, and transfer learning approaches.
Defining Hate Speech. Hate speech is not easily quantifiable as a concept (MacAvaney et al., 2019; Kong et al., 2022, 2020,
2021). It lies on a continuum with offensive speech and other abusive content such as bullying and harassment. Some definitions
given in the literature are as follows: The United Nations (United Nations, 2019) defines hate speech as ‘‘any kind of communication
in speech, writing or behavior, that attacks or uses pejorative or discriminatory language concerning a person or a group based on
who they are, in other words, based on their religion, ethnicity, nationality, race, color, descent, gender or other identity factor’’.
Davidson et al. (2017) defines hate speech as ‘‘language that is used to express hatred towards a targeted group or is intended to
be derogatory, to humiliate, or to insult the members of the group’’. Fortuna and Nunes (2018) surveyed definitions of hate speech
and produced their own: ‘‘Hate speech is language that attacks or diminishes, that incites violence or hate against groups, based on
specific characteristics such as physical appearance, religion, descent, national or ethnic origin, sexual orientation, gender identity or
other, and it can occur with different linguistic styles, even in subtle forms or when humor is used’’. In contrast, Schmidt and Wiegand
(2017) defines a much broader definition of hate speech as an umbrella term for all content on the hateful to abusive continuum.
Most of these definitions are enumerations of the facets of hate speech, which make it difficult to transform into detection. These
definitions, while similar, have subtle differences in their semantics which can cause ambiguity in dataset construction and detection.
Hateful and Abusive Speech Classification. Early approaches for hate speech classification utilized non-neural network-
based classifiers, usually in conjunction with manual feature engineering. Examples of such works are Davidson et al. (2017)
and Waseem and Hovy (2016) which used various engineered features in conjunction with a logistic regression classifier. More
recently, MacAvaney et al. (2019) utilized a Multi-view SVM with feature engineering, reporting results similar to modern neural
network models.
Recent advances in deep learning have seen the state-of-the-art dominated by deep neural network-based models. Initial
models utilized recurrent or convolutional neural networks (Zhang and Luo, 2018) with textual features, often in conjunction with
non-neural classifiers (Badjatiya et al., 2017) and feature engineering (Agrawal and Awekar, 2018; Rizoiu et al., 2016).
The introduction of large pre-trained transformer language models such as the ‘‘Bidirectional Encoder Representations from
Transformers’’ (BERT) (Devlin et al., 2018), and its variants have shown impressive performance in several NLP-related tasks,
including hate speech detection (Mozafari et al., 2019; Madukwe et al., 2020b; Swamy et al., 2019; Roy et al., 2022). Our work
extends upon these works by exploring transfer learning and multi-task learning in conjunction with transformer-based models.
3
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Transfer and Multi-Task Learning. Transfer learning is the exploitation of knowledge gained in one setting to improve the
generalization performance in another setting (Goodfellow et al., 2016). Formally, given source domain 𝐷𝑆 and source task 𝑇𝑆 ,
target domain 𝐷𝑇 and target task 𝑇𝑇 where 𝐷𝑆 ≠ 𝐷𝑇 or 𝑇𝑆 ≠ 𝑇𝑇 , transfer learning seeks to make an improvement to the learning
of the target predictive function 𝑓𝑇 (⋅) in 𝐷𝑇 using knowledge in 𝐷𝑆 and 𝑇𝑆 (Pan and Yang, 2010).
The most common form of transfer learning is Sequential Transfer Learning (STL): the model is trained on related tasks one at
a time, then fine-tuned to adapt the source knowledge to the target domain. Multi-Task Learning (MTL) is an alternative paradigm
known as parallel transfer learning. MTL seeks to transfer knowledge between several target tasks simultaneously and jointly, rather
than sequentially towards a single target task (Baxter, 1997). The tasks act as regularizers for each other in the joint model.
Closest Related Works. Several existing works relate closely to our work. We discuss these works in two groups.
The first group examines the performance of hate speech detection models in cross-dataset settings, and hate speech classification
in previously unseen hate speech datasets. Guimarã et al. (2023) investigated the composition, vocabulary and targets of hate in
several hate speech datasets, drawing attention to the conflicting definitions of hate speech contained within the datasets. They
examined the cross-dataset classification performance between 6 different hate speech datasets. Similar to our work, they fine-
tuned a BERT classification model on a single dataset and then tested on the other five datasets. They find that datasets with closer
definitions of hate speech and similar compositions in the type of hate speech tend to achieve better cross-dataset performance with
each other. Our work differs in our employment of MTL to transfer knowledge between datasets and as a means to improve the cross-
dataset generalization performance. Chiril et al. (2021) explored the ability of hate speech detection models to transfer knowledge
from generic hate speech datasets to more granular topic-specific hate speech detection tasks. They explore two evaluation schemes:
the first scheme trains on a single topic-general hate speech dataset and then tests on one of several topic-specific datasets. The
second scheme concatenates the training set of all topic-specific datasets and uses it to train a single model. They examined various
classification models based on LSVM, LSTM, CNN, ELMo, and BERT. Our work extends upon transferring knowledge from multiple
datasets by exploring multi-task learning over concatenation to learn a generalized representation implicitly.
The second group of existing literature examines the application of Multi-Task Learning for detecting hateful and abusive speech.
Various studies have explored MTL’s utility in this area. Plaza-Del-Arco et al. (2021) harnessed MTL for hate speech detection,
integrating multiple detection tasks within polarity and emotion knowledge classification to augment the hate speech classifier. Our
study sets itself apart by applying MTL across disparate datasets instead of varying tasks within the same dataset. Waseem et al.
(2018) employed MTL across multiple datasets, utilizing hard parameter sharing within a Recurrent Neural Network classification
framework. Yuan et al. (2023) adopted an MTL strategy, generating generalized embeddings via a bi-directional LSTM model across
two datasets. Kapil and Ekbal (2020) investigated multiple MTL configurations using five datasets with distinct labels related to
hate speech. Unlike these studies, our approach leverages a BERT architecture, diverging from more traditional neural network
models like LSTMs and RNNs. We incorporate a broader array of datasets, and, most importantly, we concentrate on classifications
in previously unseen datasets.
Ghosh et al. (2023) proposed an MTL framework that tackles hate speech detection and aggressive posting identification.
Our methodology differs significantly in the MTL architecture employed; we use a shared BERT unit across various datasets. In
contrast, Ghosh et al. (2023) utilized two independent neural network channels for each task, with a shared XLMR encoder that
merges with the channels before connecting to a classification head. Plaza-Del-Arco et al. (2021) explored an MTL configuration to
enhance hate speech classification on Twitter in Spanish, utilizing auxiliary sentiment and emotion classification tasks to aid the hate
speech detection task. Unlike their approach, our research does not leverage auxiliary tasks. Instead, we train across multiple hate
speech datasets with varying definitions of hate, aiming to derive a generalized hate speech representation to bolster predictions on
new, unseen datasets.
3. Methodology
This section details our MTL pipeline (Section 3.1), its training (Section 3.2), and the classification of unseen datasets
(Section 3.3).
3.1. Model
Fig. 2 shows the schema of our proposed BERT-based (Devlin et al., 2018) multi-task transfer learning pipeline. It consists of the
following:
The Preprocessing Unit standardizes the text input by removing capitalization, repetitive punctuation, redundant white spaces,
emojis, and URLs in text. It uses a single space character to separate all words and punctuation. We filter out empty sequences or
only containing the Twitter-specific retweet flag ‘‘RT’’ after preprocessing.
The BERT Unit consists of the pre-trained BERT tokenizer, the pre-trained BERT language model, and a pooling layer. The
BERT tokenizer creates a sequence of tokens as input for the BERT model from the preprocessed text. The BERT model creates
a representation for each token in the sequence. The BERT pooling layer constructs a fixed-sized sentence embedding by pooling
together the individual token representations. It takes inputs from the BERT token outputs and the hidden state of the last BERT
layer after processing the first token of the sequence (the ‘‘[CLS]’’ token). The pooling layer consists of a fully connected layer and
a 𝑡𝑎𝑛ℎ activation function, formally defined as
𝑒𝑧 − 𝑒−𝑧
tanh(𝑧) =
𝑒𝑧 + 𝑒−𝑧
4
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Fig. 2. Schema of the Multi-Task Learning pipeline. An arbitrary number of datasets are used to train a single model jointly. Each dataset-specific classification
head propagates its loss through a single shared BERT unit to produce a generalized representation of hate speech.
for the 𝑖th class of 𝐾 classes where 𝑧⃗ ∈ R𝐾 is the logit output from the hidden layer. The classification heads do not share weights,
producing different predictions based on the single set of BERT encodings. This incentivizes BERT to construct representations useful
to all classifier heads.
Dataset Preprocessing. The training constructs a task-agnostic representation of hateful and abusive speech by jointly fine-
tuning BERT’s parameters using several datasets. The datasets can have arbitrary numbers of classes (see Table 1). Hate speech
datasets are known to have a heavy class imbalance (Yuan et al., 2023; Madukwe et al., 2020a); therefore, we use a stratified
random split of 8:1:1 (training : validation : test set) to ensure that all subsets contain the same ratio of classes. We further use
random oversampling of the minority classes during training.
Training and Loss Function. We train the MTL pipeline for 10 epochs with a learning rate of 2𝑒 − 5. Each epoch consists of a
pass over each of the 𝑛 datasets’ training sets. We use mini-batching with a batch size of 512. Once all heads have completed the
epoch, losses are accumulated and summed over all classification heads, then propagated using the Adam optimizer (Kingma and
Ba, 2017). This optimize for their own respective dataset classification task while the shared BERT must optimize for all datasets
classification tasks simultaneously. results in the individual classification heads having different loss to the shared BERT unit.
The loss for each dataset classification head is defined as its cross-entropy loss, formally given for dataset 𝑑𝑖 with labels 𝑦𝑖 and
predictions 𝑦̂𝑖 as:
∑
𝐻(𝑦𝑖 , 𝑦̂𝑖 ) = − 𝑦𝑖𝑗 ⋅ log(𝑦̂𝑖𝑗 )
𝑗∈𝑑𝑖
The total BERT loss at each epoch is the sum of the individual loss for all 𝐾 datasets. Formally this is therefore defined as:
∑
𝐾
𝐿𝑜𝑠𝑠 = − 𝐻(𝑦𝑖 , 𝑦̂𝑖 )
𝑖=1
∑𝐾 ∑
=− 𝑦𝑖𝑗 ⋅ log(𝑦̂𝑖𝑗 )
𝑖=1 𝑗∈𝑑𝑖
Computational Complexity. The MTL approach is very computationally efficient. Unlike Sequential Transfer Learning, which
requires learning from the different tasks sequentially, one after the other (see Section 2), the Parallel Transfer Learning paradigm
that we employ in MTL allows the concurrent computation of each dataset classification head. Consequently, all dataset computations
can be achieved in parallel with sufficient computational resources (i.e., GPUs). As a result, the computation time increase for
5
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Table 1
The datasets used in this work with the number of labeled examples. Datasets contributed from this work are highlighted
in bold. Problematic classes are shown in red italics. Note that the PubFigs is not manually labeled, and only the sample PubFigs-L
is labeled for problematic speech.
Dataset Classes #Neutral #Problematic #Total
Davidson (Davidson et al., 2017) Neither, Offensive, Hate 4,162 20,620 24,782
Waseem (Waseem and Hovy, 2016) Neither, Racism, Sexism 11,501 5,406 16,907
Reddit (Qian et al., 2019) Non-Hate, Hate 10,053 3,130 13,183
Gab (Qian et al., 2019) Non-Hate, Hate 15,111 11,046 26,157
Fox (Gao and Huang, 2017) Non-Hate, Hate 919 332 1,251
Mandl (Mandl et al., 2019) Neutral, Profane, Hate, Offensive 3,135 1,883 5,018
Stormfront (de Gibert et al., 2018) Non-Hate, Hate, Skip/Unclear, Relation 9,330 1,089 10,419
HatEval (de Gibert et al., 2018) Non-Hate, Hate 6832 4,926 11,758
PubFigs-L Neutral, Abuse, Hate 17,963 2,364 20,327
PubFigs – – – 305,235
adding new datasets is only marginal and relates mainly to initialization and data transfer. The update of the shared model
requires completing all classification heads’ processing before its backward pass, but additional datasets do not increase the overall
computational footprint. This results in a training runtime complexity that scales linearly with the number of datasets.
Selecting Best Model. We evaluate using the holdout validation set at each epoch. We select the model weights with the highest
validation macro-F1 score over all epochs as the final model weights. We also explored selecting the final weights using the validation
loss but found the performance worse than macro-F1.
Single Dataset Baselines use the same architecture as MTL, but with only one input dataset and one classification head.
Therefore, we tune the BERT and classifier on a single task. We use the Single Dataset Baseline in Section 6.1 to evaluate dataset
pairwise generalization.
Each classification head of the MTL pipeline independently predicts an arbitrary number of problematic classes, depending on
each dataset’s labeling (see Table 1). As the number of classes in new, previously unseen datasets is unknown, we construct a binary
classification of content as problematic/harmless. We build a binary mapping for each dataset by joining the classes shown in red
italic font in Table 1 into a single problematic class (and the others into the harmless class).
We propose two schemes for classifying unseen datasets: New Classification Head (NCH) and Majority Vote (MV). NCH trains a
new head on the binarized versions of the available datasets. First, we build a new training set that concatenates training instances
from all datasets. Second, we build a validation set that concatenates all datasets’ validation and testing sets. Finally, we freeze
the MTL-tuned BERT and train a new binary classifier head for 10 epochs, selecting the final weights based on the best validation
performance. MV leverages the trained dataset-specific classifier heads. Each classifier makes an individual prediction — binarized
using its specific dataset mapping. A majority vote of classifiers selects the final classification label for each instance in the unseen
dataset.
NCH and MV differ in the effect of dataset sizes on the final classification. In the NCH scheme, each dataset contributes
proportionally to its size, as more training information originates from larger datasets than smaller ones. By contrast, the MV scheme
gives smaller datasets equal weight to larger datasets in the majority vote. We investigate both schemes in Section 6. For single
dataset baselines (see Section 3.2), we only binarize the output of the classification head — i.e., the MV scheme with a single voter.
4. Datasets
In this section, we discuss the eight publicly available datasets that we use in this work. We summarize them in Table 1, together
with their number of instances and classes, highlighting the ones we deem problematic. The rest of this section details each dataset;
the following section (Section 5) introduces the novel datasets we construct — dubbed PubFigs.
Davidson (Davidson et al., 2017) is a widely used Twitter dataset in hate speech classification and other related applications (Bad-
jatiya et al., 2017; Mozafari et al., 2019; Zhang and Luo, 2018). The dataset explicitly differentiates between hateful speech and
offensive speech. The dataset was constructed by first collating a list of potentially hateful tweets by conducting a keyword search
using words and phrases from Hatebase.org. Crowdsourced workers from CrowdFlower then labeled a sample of 25,000 tweets.
Workers were provided with definitions of the classes and asked to consider the entire context of the text rather than focusing on
individual words. Workers were told that the presence of particular terms, however offensive they may be, does not indicate hate
speech. At least three workers labeled each data instance.
Waseem (Waseem and Hovy, 2016) is a Twitter dataset widely used in prior works. It focuses exclusively on two specific facets
of hate: racism and sexism. The authors collated a corpus of tweets based on a keyword set that contains frequently occurring terms
in hateful tweets and references to specific entities, such as TV shows which often incite racist and sexist tweets. The authors added
the tweets of manually identified prolific posters of hateful content to this corpus. The labeling was done by the two authors and
an outside annotator. Several existing works, including Waseem (2016) himself, have raised questions regarding the quality of the
6
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Table 2
The American Political Figures captured in the PubFigs-L dataset. Counts of Neutral, Abusive and Hate tweets from each figure, alongside their brief description.
Figure Perceived #Neutral #Abuse #Hate #Total Description
political
leaning
AJ Right 1666 182 173 2021 Far right show host and prominent conspiracy theorist.
MG Right 486 45 27 558 American Republican politician. Georgia Congress repr.
CO Right 261 38 21 320 American conservative influencer and political commentator.
AK Right 1086 61 23 1170 Right winged American political activist.
AC Right 5143 523 464 6130 American conservative media pundit and author.
LI Right 1474 28 10 1512 American conservative television host.
BS Right 1753 129 56 1938 American conservative political commentator.
DT Right 1972 197 85 2254 Former Republican US president.
DJ Right 1862 166 56 2084 Relative of DT.
TS Left 247 1 0 248 Popular American Musician.
BR Left 282 2 0 284 American Democrat politician.
BO Left 76 0 0 76 Former Democrat US president.
MO Left 34 0 0 34 Former US first lady.
AO Left 354 4 4 362 American Democrat politician. New York Congress repr.
IO Left 1267 46 23 1336 American Democrat politician. Minnesota Congress repr.
annotations in the dataset. Arango et al. (2019) highlight the skewed distribution of users who contribute to the racist and sexist
classes, with 8 users accounting for all tweets labeled as racist.
Reddit (Qian et al., 2019): Reddit is a well-known social media platform comprising subreddits, where user-created communities
discuss topics or themes. The Reddit dataset was collected from ten subreddits based on their tendency to contain toxicity and hate
speech, such as r/DankMemes, r/MensRights, and r/TheDonald. The top 200 posts of each subreddits sorted by Reddit’s
‘‘Hottest’’ ordering were collated. The conversation threads were filtered using the hate keywords from ElSherief et al. (2018) and
annotated using Amazon Mechanical Turk as either hateful or non-hateful; three different workers annotated each thread.
Gab (Qian et al., 2019): Gab is a social media and microblogging platform with functionalities similar to Twitter, well known for
its far-right user base (Jasser et al., 2023). The Gab dataset originates from the same work as the Reddit dataset and uses a similar
methodology. The authors used hateful keywords (ElSherief et al., 2018) to identify potentially hateful posts, which are then labeled
as hateful or non-hateful using Amazon Mechanical Turk. Each post was assigned to 3 different workers.
Fox (Gao and Huang, 2017) is a small binary – hate speech or not – dataset that contains user comments from ten news discussion
threads on the Fox News website. Four of the ten threads were annotated by two native English speakers, who discussed the labeling
criteria before annotation. The other six threads were annotated by only one of the two annotators.
Mandl (Mandl et al., 2019) is a Twitter dataset released as a part of the HASOC track of the 2019 ACM Forum for Information
Retrieval Evaluation (FIRE) conference. We only use the English dataset. Labeling was done in two steps. First, tweets were labeled
as hateful and offensive or not hateful or offensive. Hateful and offensive tweets were then re-examined and further divided into 3
classes, resulting in a final set of 4 classes: Neutral, Hate, Offensive, and Profane. For this work, we consider only the classes Hate
and Offensive as problematic; according to most definitions of hate speech (see Section 2), profanities (while displeasing) cannot be
considered problematic content. ‘‘Several juniors’’ performed the annotations after being given rough guidelines and definitions.
Stormfront (de Gibert et al., 2018): Stormfront is a neo-Nazi Internet forum considered one of the major racial hate hubs. The
Stormfront dataset contains 9916 sentences from 500 posts posted across 22 sub-forums on the Stormfront website between 2002
and 2017. Three annotators first labeled 1144 sentences as either hate, no hate, or skip/unclear. A relation label is also used for
posts hateful given the conversation chain’s full context, but not by themselves. Our work treats each sentence as an individual data
instance. Hence, we consider the skip/unclear and relation classes non-hate.
HatEval (Basile et al., 2019) is a Twitter dataset containing tweets from July to September 2018. The hate speech class in this
work is focused explicitly on hate against immigrants and women. The original work contributes a Spanish dataset in addition to
the English one, and it provides additional labels for the target of hate and the aggressiveness of the content. We do not use these
in this work.
This section presents the construction of the PubFigs dataset we contribute and its human-labeled subset PubFigs-L. Fig. 3
schematically shows the construction process of PubFigs and PubFigs-L. Tweets are first collected from Twitter to construct the
unlabeled PubFigs (details in Section 5.1). Next, we use the MTL classifier to construct a subset that we manually label using Amazon
Mechanical Turk, giving us PubFigs-L (Section 5.2). Finally, we train a new classifier with PubFigs-L added into the MTL training
process, which we then use to machine-label PubFigs (Section 5.5).
7
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Fig. 3. The construction process of PubFigs and PubFigs-L datasets. We detail each action (purple hexagon) in Section 5: Twitter Data Collection in Section 5.1,
PubFigs-L subset Selection in Section 5.2, Amazon Mechanical Turk Labeling in Section 5.3, and Machine Labeling in Section 5.5.
We gather historical tweets from 15 American public figures — such as former presidents, conservative politicians, far-right
conspiracy theorists, media pundits, and left-leaning representatives perceived as very progressive. The dataset mainly pertains to
public figures directly associated with American politics with a social media following. Some figures are not directly involved in
American politics but have a significant public or social media following. We select figures based on their perceived conservative
(right-) vs. liberal (left-) political leaning and media presence to cover a range of personalities and social media behavioral patterns.
We collected 305,235 tweets from 16 Twitter accounts (one figure used two accounts with similar amounts of activity).
Table 2 outlines the figures in the dataset, with each figure only being referred to by a pseudonym. The collection process of the
tweets was conducted as follows. We retrieved archived tweets (Wu et al., 2020) from www.polititweet.com, a website dedicated to
archiving Twitter postings of public figures. These archived tweets may be truncated due to the API settings used by polititweet. As
such, we re-obtained the non-truncated tweets (where still available) from the Twitter API using the tweet ID from the polititweet
postings.
Labeling more than 300,000 postings is time and cost-prohibitive; therefore, we label a subset — dubbed PubFigs-L. Hate speech
is rare, and hate speech from public figures is even rarer; hence, selecting the subset through random sampling would yield very few
hateful tweets. To build a more balanced set, we employ an active sampling procedure. First, we train a binary classifier specialized
for classifying unseen datasets using our MTL framework (see the MTL-NCH classifier in Section 3.3 for further details). We train
this classifier using the eight hate speech datasets described in Section 4 for 10 epochs at a learning rate of 2𝑒 − 5 and a batch size
of 512, as described in Section 3.2. We refer to this classifier as MTL-NCH[8].
We apply MTL-NCH[8] to PubFigs, machine-labeling it with the ‘‘harmless’’ and ‘‘problematic’’ labels. We build a balanced subset
as follows. First, we build the likely positive set by selecting all tweets labeled as problematic. Second, we build the likely negative
tweet set by undersampling an equal number of tweets labeled as harmless for each figure so that the dataset is not overly skewed
towards any single figure. We obtain a final subset of 20,327 tweets. Next, we human-annotate this subset via crowd-sourcing using
Amazon Mechanical Turk.
8
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Table 3
Comparison of tie-breaking strategies applied to a sample of the Davidson dataset for the
3-class problem (‘‘neutral’’, ‘‘abusive’’, and ‘‘hate’’) and the binary problem (‘‘neutral’ vs
‘‘abusive’’/‘‘hate’’). Highest scores shown in boldface.
Strategy NMV-H NMV-L HH HL LH LL
3 Class F1 0.389 0.431 0.468 𝟎.𝟒𝟖𝟏 0.453 0.422
Binary F1 0.568 0.674 0.709 𝟎.𝟕𝟎𝟗 0.615 0.615
We preprocess the tweets to remove links, identification data, and non-textual data such as videos and images. Links were
substituted with a replacement token to indicate where a link existed in the text. Workers were set to label each tweet as either
‘‘Hate’’, ‘‘Abuse’’, or ‘‘Neutral’’. We provided the workers with definitions of each class and positive and negative examples (we
show the instructions and worker interface in the supplementary materials). The definitions were given as follows: ‘‘Hate’’: content
that directly or indirectly attacks, discriminates, incites violence and/or hate against a person or group on the basis of who they
are. ‘‘Abuse’’: content that is abusive but not hateful based on the criteria above. Any content that bullies, harasses, insults, or is
otherwise offensive to an individual or group but not on the basis of their identity. ‘‘Neutral’’: any content not fitting the other
two classes. We restrict participation to workers in English-speaking countries (the UK, the United States, Canada, and Australia)
to increase the likelihood of native speakers with socio-political background knowledge. We further require workers to have a 98%
approval rate and have completed at least 5000 MTurk HITS. In total, 456 workers contributed to the labeling process, and 8 workers
labeled each tweet. Further details regarding the MTurk Labeling process, such as payment and task batch size, can be found in the
supplementary materials. We determined the final annotations based on a two-stage majority vote procedure described in the next
section.
5.3.1. Tie-breaking
Hate speech identification is a difficult problem even for humans; therefore, it is expected to have diverse labels among the 8
annotators of each tweet. A simple majority vote among the annotators is unlikely to yield the expected results. For example, a
tweet labeled by the workers as 3:2:3 (neutral : abuse : hate) is unlikely to be neutral, as five annotators considered it abusive or
harmful. We start from the hypothesis that it is easier to distinguish between neutral and problematic speech – we dub speech as
problematic when it is either abusive or hateful –, than between abuse and hate. As such, we devise a two-stage procedure that we
dub tie-breaking. First, we break between neutral and problematic, then between hateful and abusive. In stage 1, we compare the
number of annotators that selected neutral versus those that selected abuse or neutral and select the higher number. If we selected
abuse/hate, then we compare the amount of abuse versus the amount of hate. In the example above, we chose problematic in stage
one, and hate in stage two.
However, at any stage, there may be an equal amount of annotators for both options. As a result, we explored several tie-breaking
strategies: NMV-H : breaking naïvely through a single majority vote between all 3 classes to more hateful class (e.g., 4:4:2 yields
abuse); NMV-L: breaking naïvely through a single majority vote between all 3 classes to less hateful class (e.g., 4:4:2 yields neutral);
HH : breaking to the more hateful class for both stages — in case of equality, select problematic for stage 1, and hate at stage 2
(e.g., 3:2:3 yields hate); LL: breaking to the less hateful class for both stages — neutral at stage 1, and abuse at stage 2 (3:2:3 yields
neutral); HL: breaking first to the more hateful class at stage 1 (i.e., problematic), and to the less hateful class at stage 2 (i.e., abuse)
(3:3:3 would yield abuse); and; LH : breaking first to the less hateful class at stage 1 and to the more hateful class at stage 2 (3:3:3
would yield neutral).
To select a tie-breaking strategy, we took a small sample of 200 instances from the Davidson dataset and re-labeled them following
our methodology described in Section 5.3 using the Amazon MTurkers. The intuition is that the best tie-breaking strategy will yield
the closest results to the Davidson annotations. We chose the Davidson dataset due to its closeness to our labeling task. Both the
Davidson and our labeling tasks use tweets, labeled with 3 classes: neutral, offensive, hate for Davidson and neutral, abusive, hate
for our task. The offensive class was mapped to the abusive class in the labeling process, as both classes followed roughly similar
definitions. We treat each tie-breaking strategy as a classifier and the already assigned labels from the Davidson dataset as the ground
truth. We compute typical classification error metrics (such as the F1-score), and consider that the tie-breaking method with the
highest F1 is the best suited. Table 3 shows that HL achieves the highest three-class F1-score — if a post has an equal number of
neutral and abuse/hate, we chose abuse/hate in stage 1; in stage 2, if it has an equal number of abuse and hate annotations, we
chose abuse. We use this strategy to build the final annotation for each post in PubFigs-L.
9
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Table 4
Comparison of mean agreement with majority vote for MTurk worker annotations before and after blocklisting
underperforming workers. The columns show the mean agreement with the majority when using 3 labels
(‘‘neutral’’, ‘‘abusive’’, ‘‘hate’’) and two labels (‘‘neutral’’ vs ‘‘abusive’’/‘‘hate’’).
3 Class Binary Krippendorff’s
mean agreement mean agreement Alpha
With underperforming workers 0.649 0.701 0.063
After blocklisting and re-labeling 0.803 0.813 0.124
Improvement 23.72% 15.97% 96.82%
Fig. 4. Amazon Turkers’ mean agreement with the majority for the final labels in the PubFigs-L dataset. Eight workers label each tweet via majority vote.
The 𝑥-axis shows how many workers selected the majority label. The 𝑦-axis shows the percentage of total instances assigned to a label with a given majority
vote split. (a) binary: harmless (neutral) vs. problematic (abuse and hate); (b) 3 class labeling (neutral, abuse, hate).
The annotation task was done in batches of 50 to 100 HITs (Human Intelligence Tasks), with each HIT consisting of 10 tweets for
labeling. For each batch, we examined the top 25 workers who completed the most work and blocklisted them if the work was poor
quality. We identified suspect workers using two metrics: Firstly, we examined the workers’ Krippendorff’s Alpha (Krippendorff,
2011) — their mean consensus with the final instance label of the majority vote. We operate under the assumption that most
workers do not behave maliciously; hence, users with a low Krippendorff’s Alpha value are suspicious as it indicates that they
tend to systematically disagree with the majority of other workers across many tweets. Secondly, we examined the distribution of
user-assigned labels. We know that hate speech is rare among public figures; therefore, we consider as suspicious any worker with
an unusual class distribution; for example, skewed towards abuse or hate, equal among the three classes, or constantly labeling only
one or two classes. We manually examined the work of suspected workers for quality. In particular, we examined posts labeled as
hateful and abusive for the presence of overtly benign content. Overtly underperforming workers were blocklisted and unable to
further work on our task. Tweets labeled by more than two underperforming workers were redone. To analyze the effectiveness of
this filtering and reannotation process, we examined the mean agreement with the majority vote across all instances for both the 3
class problem and the binary problem of neutral vs non-neutral. A higher agreement indicates a more robust label since we assume
that most workers do not behave maliciously.
Table 4 shows that blocklisting underperforming workers improves the mean agreement with the majority by 23.72% and the
inter-worker Krippendorff’s Alpha by 98.82%.
The crowdsourced annotation yielded 17,963 neutral instances, 1422 abusive instances, and 942 hate instances. We find that
all figures had consistently more neutral than problematic (abuse, hate) labeled tweets — highlighting the class imbalance. The
vast majority of problematic tweets (96.62%) originate from right-leaning figures: 915 (out of 942) hateful tweets and 1369 (1422)
abusive tweets. Within the hate class, every right-leaning figure in the dataset had at least 10 hateful tweets. AC, a controversial
right-leaning political commentator, accounts for just under half of all hate-labeled instances (464 out of 942). Comparatively, the
left-leaning figure with the most hate tweets was IO (23), although these are most likely false positives (see Section 6.4). Four
left-leaning figures (BR, TS, BO, and MO) had no tweets labeled hate.
Hate Speech Annotation — A Difficult Problem. Annotating hate speech is a challenging task. This difficulty can be quantified
by examining the inter-annotator agreement, as a difficult task often leads to lower consensus among workers. We use the mean
agreement with the majority vote over all tweets as a measure of difficulty rather than the more widely used Krippendorff’s Alpha.
The latter is skewed to low values when there is a low overlap of annotated instances between workers.
We analyze workers’ agreement in two tasks. The first is differentiating between harmless content (neutral) and problematic
content (abuse and hate). The second task is differentiating between the three classes. We expect workers to perform better in the
first task as it is easier to identify harmless from problematic content than differentiating between flavors of problematic (hate
10
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Table 5
Unseen dataset performance of existing works and our MTL pipeline. MTL is trained on all datasets except the testing dataset. The best results are in bold;
we report the same number of decimals as the original work. For MTL we show ±𝑠𝑡.𝑑𝑒𝑣. We could not obtain the Founta and OLID datasets, and we do not
include them in our work. The number of wins in parentheses.
Prior work Model Train Test Metric Reported MTL-NCH
dataset dataset (3∕8) (5∕8)
Arango et al. (2019) Badjatiya et al. (2017) Waseem HatEval Macro F1 0.516 0.645 ± 0.006
LSTM + GBDT (binary) + Davidson
Arango et al. (2019) Agrawal and Awekar Waseem HatEval Macro F1 0.541 0.645 ± 0.006
(2018) + Davidson
Bi-LSTM baseline
(binary)
Swamy et al. (2019) BERT binary Waseem Davidson Macro F1 0.5296 0.6822
± 0.0197
Swamy et al. (2019) BERT binary Founta Davidson Macro F1 0.5824 0.6822
± 0.0197
Swamy et al. (2019) BERT binary OLID Davidson Macro F1 0.5982 0.6822
± 0.0197
Swamy et al. (2019) BERT binary Davidson Waseem Macro F1 0.6928 0.4995 ± 0.0085
Swamy et al. (2019) BERT binary Founta Waseem Macro F1 0.6049 0.4995 ± 0.0085
Swamy et al. (2019) BERT binary OLID Waseem Macro F1 0.6269 0.4995 ± 0.0085
and abuse). Fig. 4 shows the average worker agreement with the majority vote for the first binary task (Fig. 4(a)) and the second
three-class task (Fig. 4(b)). The 𝑥-axis shows the number of annotators who agreed with the label assigned via the majority vote;
the 𝑦-axis shows the percentage of instances with the given agreement between annotators.
Intuitively, a difficult decision is signaled by a high percentage of instances (high y-axis) for which few annotators agree (low
x-axis). Fig. 4(a) shows that annotators have a higher agreement for the harmless class (most tweets achieve a consensus of seven
annotators out of eight, 7∕8) than for the problematic class (consensus of 4∕8). We posit that the language used by public figures
may not be overtly abusive or hateful, making it difficult for annotators to identify. In the three-class task (Fig. 4(b)), the abuse and
hate classes have similar distributions over the final majority vote, indicating that both classes are similarly challenging to identify.
Finally, about 18% of the abuse and hate instances have a low agreement outcome (3∕8), which can occur when as many as six
annotators out of eight agree that the instance is problematic. This suggests that identifying the explicit facet of problematic content
is difficult even when there is overall agreement that the tweet is problematic.
The final action in Fig. 3 is machine-labeling the entire PubFigs dataset. To achieve this, we train an MTL classifier (see Section 3)
using the eight public datasets introduced in Section 4 and the human-labeled PubFigs-L dataset. We optimize the MTL model for
each of the 9 datasets simultaneously; once the training is complete, we leverage the classification head corresponding to PubFigs-L
to machine-label all the 305,235 tweets in the PubFigs dataset according to the PubFigs-L labels. We obtain 298,803 neutral, 5299
abusive, and 1133 hateful tweets. The minority nature of the abusive and hateful tweets – comprising 1.74% and less than 1% of
the total tweets, respectively – was as anticipated given Twitter’s policy on hateful conduct and its efforts to eliminate hate speech
from its platform (Twitter Inc., 2023).
The PubFigs dataset is about 15 times larger than the PubFigs-L subset; however, it contains only marginally more hate tweets
(1133 in PubFigs vs 942 in PubFigs-L) and 4 times more abusive tweets (5299 in PubFigs vs 1422 in PubFigs-L). The classifier labeled
the vast majority of tweets as neutral. This is expected given the construction of the PubFigs-L subset. In Section 5.2, we used an
MTL classifier trained on eight public datasets to identify potentially problematic speech. As it turns out, that classifier had a good
performance, missing only very few hateful tweets. We perform additional error analysis in Section 6.1.
This section discusses the evaluation of our MTL approach. First, in Section 6.1, we present the hate speech detection on unseen
datasets and the improvements in train–test splits. Next, in Section 6.4, we apply the MTL model on the PubFigs dataset and analyze
the inappropriate speech patterns of a sample of American public figures. All reported results are mean averages over 10 experimental
runs.
Metrics. We use macro averaged F1 as our the evaluation metric, defined for 𝑁 classes as:
1 ∑ 1 ∑ 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 × 𝑟𝑒𝑐𝑎𝑙𝑙𝑖
𝑁 𝑁
Macro F1 = 𝐹 1𝑖 =
𝑁 𝑖=1 𝑁 𝑖=1 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 + 𝑟𝑒𝑐𝑎𝑙𝑙𝑖
The F1 score is the harmonic mean between Precision and Recall and can be understood as a trade-off between the two metrics.
Macro averaging is selected over other averaging methods as it assigns equal importance to each class irrespective of their size. This
is beneficial in our evaluation tasks as hate speech datasets are known to often be highly imbalanced (Yuan et al., 2023; Madukwe
et al., 2020a), with only a few positive examples for some classes.
11
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Fig. 5. The problem setup for unseen dataset classification. We adopt a leave-one-out evaluation scheme for evaluating the pipeline’s ability to generalize on a
completely new dataset.
Table 6
Macro-F1 prediction performances on a target dataset, unseen during training (shown by column headers). (above horizontal ruler) Our two
MTL flavors (NCH and MV) trained on all datasets except the target dataset. (below horizontal ruler) Transferability between pairs of datasets. A
single dataset baseline (see Section 3.2) is trained on the source dataset (rows) and tested on the target dataset (columns). The best results are
in bold.
Testing Dataset
Model Davidson Waseem Reddit Gab Fox Storm- Mandl Hat- Pub- Wins
Front Eval Figs-L
MTL
MTL-NCH 0.6822 0.3801 0.8456 0.8738 0.6150 0.6826 0.5312 0.6449 0.6175 6
MTL-MV 0.6455 0.4048 0.8263 0.8660 0.6030 0.6771 0.4834 0.6315 0.6231 1
Davidson 0.5556 0.5914 0.6731 0.4932 0.4597 0.5690 0.5414 0.5469 0
Waseem 0.6136 0.6000 0.6427 0.5519 0.5356 0.5099 0.5784 0.5611 0
Reddit 0.6135 0.4957 0.8083 0.5229 0.5559 0.4900 0.5741 0.5402 0
ERT Baseline
Trained On:
Here, we examine whether the MTL pipeline can train models that detect hateful and abusive speech in previously unseen
datasets.
Experiment Setup and Evaluation. We use a leave-one-out setup: given 𝑛 datasets, we train the shared BERT representation
using 𝑛 − 1 datasets, leaving out the dataset 𝑑𝑡 . We use the MTL-trained BERT – in the NCH or MV classification scheme (see
Section 3.3) – to evaluate on 𝑑𝑡 . Note that no portion of 𝑑𝑡 is observed during training; thus, 𝑑𝑖 is an entirely new dataset. We
rotate the left-out dataset 𝑑𝑡 until we have evaluated all datasets. A diagram of the evaluation scheme can be seen in Fig. 5. We
also evaluate the transferability between datasets by training single dataset baselines (see Section 3.2). We train on one dataset
and test on another for each possible pair of datasets (81 unique pairs). Datasets with multiple problematic classes are binarized as
problematic/harmless using the mapping in Table 1.
Outperforming the State-of-the-Art. We start by assessing the performance of MTL against existing state-of-the-art that
examined predicting hate speech on unseen datasets. In Table 5, we select prior works that evaluate using the same datasets used
in our work. Our model outperforms the existing models in 5 out of 8 comparisons, achieving nearly 10% higher performance
on HatEval and Davidson datasets. We underperform solely on the Waseem dataset, which is known to have particular labeling
and contains many false positives (Yuan et al., 2023; Waseem, 2016). Notably, the Waseem dataset was not constructed for hate
speech classification but rather to assess the agreement between amateur and expert annotators, as discussed in Yuan et al. (2023)
12
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Table 7
MTL against 9 baselines (column ‘‘Model in related work’’) for targeted hate speech prediction (see the ‘‘Setup’’ in Section 6.3). We copy the performances
reported by the works (column ‘‘Reported’’). We report using the performance metrics and the same number of decimals as reported in the original papers
(column ‘‘Metric’’) ±𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛. The number of wins is shown in parentheses. Arango et al. (2019) use a binary hate/non-hate label mapping that we
replicate for the corresponding MTL comparisons. Waseem et al. (2018) do not specify the averaging method; hence we assume micro-F1.
Work Model in related work Dataset Metric Reported(7∕20) MTL (12∕20)
Mozafari et al. (2019) BERT + CNN Waseem Weighted F1 0.88 0.83 ± 0.01
Mozafari et al. (2019) BERT + CNN Davidson Weighted F1 0.92 0.90 ± 0.01
MacAvaney et al. (2019) BERT finetune HatEval Macro F1 0.7452 0.7526 ± 0.0154
MacAvaney et al. (2019) mSVM HatEval Macro F1 0.7481 0.7526 ± 0.0154
Zhang and Luo (2018) CNN + sCNN Waseem Micro F1 0.83 0.83 ± 0.01
Zhang and Luo (2018) CNN + sCNN Waseem Macro F1 0.77 0.78 ± 0.01
Zhang and Luo (2018) CNN + sCNN Davidson Micro F1 0.94 0.90 ± 0.01
Zhang and Luo (2018) CNN + sCNN Davidson Macro F1 0.64 0.74 ± 0.01
Arango et al. (2019) Badjatiya et al. (2017) LSTM + GBDT baseline (binary) Waseem Micro F1 0.807 0.828 ± 0.008
Arango et al. (2019) Badjatiya et al. (2017) LSTM + GBDT baseline (binary) Waseem Macro F1 0.731 0.802 ± 0.010
Arango et al. (2019) Agrawal and Awekar (2018) Bi-LSTM baseline (binary) Waseem Micro F1 0.843 0.828 ± 0.008
Arango et al. (2019) Agrawal and Awekar (2018) Bi-LSTM baseline (binary) Waseem Macro F1 0.796 0.802 ± 0.010
Madukwe et al. (2020b) GA (BERT + CNN+LSTM) Davidson Weighted F1 0.87 0.90 ± 0.01
Madukwe et al. (2020b) GA (BERT + CNN+LSTM) Davidson Macro F1 0.73 0.74 ± 0.01
Waseem et al. (2018) BOW Waseem Micro F1 0.87 0.83 ± 0.01
Waseem et al. (2018) BOW Davidson Micro F1 0.89 0.90 ± 0.01
Yuan et al. (2023) Bi-LSTM Waseem Macro F1 0.7809 0.7823 ± 0.0133
Yuan et al. (2023) Bi-LSTM Davidson Macro F1 0.7264 0.7450 ± 0.0052
Kapil and Ekbal (2020) SP-MTL + CNN Waseem Macro F1 0.8916 0.7823 ± 0.0133
Kapil and Ekbal (2020) SP-MTL + CNN Davidson Macro F1 0.9115 0.7450 ± 0.0052
and the supplementary appendix. We contend that a generalized model is not ideal for such datasets, leading to lower prediction
performance.
In Table 6, we further investigate the classification performances of MTL and single dataset baselines. We note two key
observations. Firstly, the MTL flavors consistently outperform the single dataset baselines, achieving the best results on seven out
of nine datasets. This finding indicates that MTL successfully enhances the predictive ability on unseen datasets. For the remaining
two datasets (Waseem and Mandl), the single dataset baseline trained on PubFigs-L, which is introduced in this work, achieves the
best performance. Secondly, we find that MTL-NCH outperforms MTL-MV in seven out of the nine datasets, and it is within 2% of
the MTL-MV performance on the remaining two (Waseem and PubFigs-L). As a result, we focus solely on the MTL-NCH flavor in all
discussions in this paper.
Error Analysis on PubFigs-L. We compare the classification results of the MTL-NCH classifier against the human annotated
labels in PubFigs-L to gain insight into the types of errors that the classifier makes. This classifier is equivilant to MTL-NCH[8]
described in Section 5.2. The confusion matrix of of MTL-NCH[8] is shown in Table 8. The classifier shows good performance on
both hateful and abusive tweets, with the classifier correctly identifying 82% of abusive or hateful tweets as problematic. We find
that the main error was a false positive by MTL-NCH where it considered a human labeled neutral tweet to be problematic. Of the
10170 tweets classified as problematic, the vast majority of tweets (8230/1170) were deemed to be not hateful or abusive by the
human annotators. This indicates that the MTL-NCH classifier is overly liberal with assigning the problematic classification in the
PubFigs-L dataset.
We further study how transferable each dataset is to each other. Given that no definition of hate is universally agreed upon,
each dataset brings with it its own biases. We hypothesize that certain datasets may generalize better to others due to shared
biases in their definitions. Fig. 6 presents the distribution of macro-F1 scores using violin plots, with each violin representing a
target dataset how transferable each target dataset is from other datasets. The left half of each violin illustrates the performance
distribution of all single dataset baselines, whereas the right half shows the performance of MTL-NCH. We observe that MTL-NCH
exhibits higher mean prediction performance and lower variance than single baselines. Additionally, the single baselines for Waseem
and Mandl exhibit two visible modes, while HatEval, Reddit, and Gab show less pronounced bimodality. This phenomenon arises
from the uneven transferability of models — some datasets generalize better to specific others, potentially due to similarities in their
hate speech facets or annotation procedures. This is shown in Table 6, where higher performance indicates better generalization
to unseen datasets. For instance, Gab and Reddit exhibit bidirectional generalization, achieving over 80% macro-F1 performance
in both directions. This is unsurprising, given that they were proposed in the same work. Other dataset pairs generalize only one
way, e.g., Davidson to Gab, PubFigs-L to Davidson, PubFigs-L to Gab, and Waseem to Reddit. This may occur when one dataset covers
more hate speech facets than the other. We also analyze dataset similarity in terms of their vocabulary usage. We measure the
Ruzicka similarity (Ruzicka, 1958), a weighted Jaccard similarity quantifying the overlap in the terms occurring in the problematic
and harmless classes. While Gab and Reddit have similar language use for both classes, Gab and PubFigs-L are only similar for the
harmless class (Ruzicka= 0.37 for harmless, 0.08 for problematic), which still leads to a moderate generalization (macro-F1= 0.66).
More details on the Ruzicka similarities are available in the supplementary appendix.
13
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Fig. 6. Unseen classification performance (macro-F1) on each target dataset by the single dataset trained baselines (green, left) and the MTL pipeline (orange,
right). The magenta lines indicate quartiles.
Table 8
MTL-NCH[8] label counts compared to the final MTurk human
assigned labels. Red italics are errors.
MTL-NCH[8] MTurk assigned label
Classification
Neutral Abuse Hate
Neutral 9733 294 130
Problematic 8230 1128 812
Fig. 7. The problem setup for improving targeted classification. We train on all datasets while designating a target dataset, using its test set to evaluate the
pipeline’s ability to improve classification performance on a known dataset through transfer learning.
Here, we examine MTL’s ability to improve the classification performance for a known target dataset (in a train–test split), by
transferring knowledge from the other datasets.
Experiment Setup and Evaluation. Fig. 7 shows the setup schema of the task. We leverage all 𝑛 datasets in training; one dataset
is designated as the target 𝑑𝑡 on which we aim to improve classification and the other 𝑛−1 are jointly leveraged by MTL. We optimize
the MTL model for each of the 𝑛 tasks simultaneously: on the training set of 𝑑𝑡 and the 𝑛 − 1 additional datasets. Finally, we evaluate
using 𝑑𝑡 ’s testing set (10% of the dataset). This differs from the new unseen setup used in Section 6.1 as the training part of the
14
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Fig. 8. Results of MTL targeted training. (a) Mean macro F1 over 10 runs for the targeted dataset improvement task. MTL is trained with all 9 datasets
while the baseline classifier is trained with only the target dataset. MTL’s improvement over the baseline is statistically significant (p < 0.05) for all datasets.
(b) Diminishing returns for increasing the datasets used in the MTL targeted training. Each line represents classification on a different target dataset with the
leftmost and rightmost points being equivalent to the baseline and MTL results shown in (a). The 𝑥-axis shows the number of datasets used in training MTL
while the 𝑦-axis shows the macro F1 performance on the testing set. Shaded areas are the 95% confidence intervals over 10 runs.
Table 9
Confusion matrix of the PubFigs-L testing set for the classification
head attached to a MTL unit trained on the PubFigs-L training set
and 8 other datasets. Red italics are errors.
Predicted True label
label
Neutral Abuse Hate
Neutral 1698 96 65
Abuse 78 42 11
Hate 21 4 18
target dataset 𝑑𝑡 is not omitted. We test against a baseline model in which we use the same model architecture as our MTL model
but do not leverage any additional datasets, instead training only on 𝑑𝑡 .
Outperforming the Baselines. We compare to the baseline model trained on only the target dataset. Fig. 8(a) shows a
comparison of the mean Macro F1 score of 10 experiment runs of the single dataset baseline compared to MTL. To test for statistical
significance, we perform Welch’s t-test (Welch, 1947) using the experiment runs as our samples. We find that in all 9 datasets, our
experiments show a statistically significant (𝑝 < 0.05) improvement for MTL compared to training the model only on the target
dataset. This shows the effect of the transferring knowledge from the other datasets through the MTL framework.
We conducted an extensive literature review to identify prior studies that reported predictions on the same datasets as ours.
Table 7 compares the performances of MTL against these works using the same performance metrics they report. Table 7 compares
the performance of MTL with the reported results of the seven identified studies, using the same performance metrics as reported by
them. The column labeled as Reported lists the results reported in the respective studies. Our analysis shows that MTL outperforms
the baselines consistently, with 12 wins, 1 draw, and 7 losses. Even in cases where MTL underperforms, it still achieves comparable
performance in most cases. As an aside (and intuitively), the targeted classification obtains significantly better results than the
unseen. For example, the unseen macro-F1 for HatEval is 0.6449 (see Table 6) whereas the targeted macro-F1 is 0.7526 (see Table 7).
Diminishing Prediction Returns. The above results show that MTL successfully improves prediction performance for single
datasets by transferring knowledge from the other datasets. In Fig. 8, we explore how the prediction performance on a target
dataset 𝑑𝑡 varies with 𝑖, the number of additional datasets used in MTL. For each increment of 𝑖, we iterate 10 times: sample
without replacement 𝑖 datasets (non-identical to 𝑑𝑡 ), train MTL on the 𝑖 datasets and the training part of 𝑑𝑡 , and report performance
on the testing subset of 𝑑𝑡 .
We find that the mean and variance of macro-F1 performance improve with 𝑖. However, most datasets observe a diminishing
returns effect, with less improvement when 𝑖 > 5. We posit this happens due to two phenomena. First, some datasets transfer
significantly better to others (see Table 6) and, as 𝑖 increases, so does the chance of having them in the MTL training. Second,
there is likely significant overlap in the transferable knowledge between the datasets; this makes newly added datasets increasingly
redundant compared to datasets already used.
15
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Fig. 9. Understanding what makes texts abusive and hateful. (a) Most common terms in instances labeled Abuse (top) and Hate (bottom) by the MTL
classifier in the PubFigs dataset. Larger words indicate more appearances. (b) Top 20 terms with highest SHAP values from all instances labeled ‘‘Hate’’ in the
PubFigs dataset.
Error Analysis on PubFigs-L. We perform error analysis by focusing on the performance of PubFigs-L classification head to
gain insight into the types of errors that the classifier makes. Table 9 shows the confusion matrix of the PubFigs-L classification
head attached to the MTL model. We find that the MTL model made very few mistakes when classifying neutral tweets. However,
predictive performance on both the hate and abuse classes were poor with the majority of both classes being misclassified as neutral.
The abuse class appears to be especially difficult to differentiate for the classifier with the most common two errors being mistaking
abuse for neutral and neutral for abuse. Performance on the hate class was poor with only 18∕94 hateful tweets correctly classified.
We attribute these results to the low number of abusive and hate speech instances, as well as differences in the definition abusive
and hate speech captured in the datasets. PubFigs-L exclusively features public figures who, by virtue of their prominent public
profiles, are markedly less inclined to engage in overt instances of abuse or hate. This contrasts with the other datasets which tend
to feature more overt and explicit instances of hateful and abusive posts. As such, the classifier overlooks this subtly in abusive and
hateful speech in PubFigs-L as there is less overlap between the abusive and hate speech definitions captured by the other datasets
which makes knowledge transfer through MTL less effective.
Here, we analyze the tweets of the 15 American public figures in PubFigs to gain insight into the distribution of hateful and
abusive tweets and the language, topics, and targets of hate speech and abuse in their postings.
More Hateful and Abusive Speech by Right-Leaning Figures. We find that right-leaning figures on Twitter produce
significantly more hateful posts than left-leaning figures. Specifically, out of the 5299 abusive posts, 5093 were generated by right-
leaning figures, while out of the 1133 hateful posts, 1083 were generated by right-leaning figures. These findings reinforce our earlier
conclusions from profiling the PubFigs-L dataset, as described in Section 5.4. Notably, all left-leaning figures – except the left-leaning
Democrat politician IO – had fewer than ten hateful posts. Upon manual inspection of IO’s tweets classified as Abuse and Hate, we
observed that many addressed topics related to Muslims and Islam, which are common themes in Islamophobic content. IO is a
practicing Muslim and often tweets regarding the topic. We posit that some of these classifications may be false positives, as the
model may conflate Islam-related content with Islamophobic content due to the occurrence of similar terms in hateful posts.
Topics and Targets of Hate Speech. We explore the main topics and targets of hateful and abusive speech by examining the
most commonly occurring words in tweets labeled as hateful and abusive across all figures. Fig. 9(a) shows the word clouds of the
top 100 most common words for posts classified as Abuse (top) and Hate (bottom) by the MTL classifier in the PubFigs dataset.
The most frequent abuse and hate words were related to six topics: Islam, immigrants and refugees, race and ethnicity, women,
terrorism and extremism, and American politics.
The top terms from instances labeled as hateful were mainly concerned with the first five topics, while those labeled as abusive
mainly related to American Politics. The latter mostly relate to individuals in American politics – such as Donald Trump, Hillary
Clinton, and Joe Biden – whereas the hatefully labeled instances have broader targets. The democrats Hillary Clinton and Joe Biden
16
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Fig. 10. MTL training increases the distinctiveness of the constructed embedding space. Clustering tendency measured using Hopkins Statistic value of all
instances in PubFigs grouped by assigned label (left) and hateful instances grouped by topic keywords (right).
Fig. 11. Abusive and hateful content becomes more localized in the post-MTL embeddings, making it more distinct and identifiable. Overlayed density
comparison of the embedding space constructed by pre-MTL (the off-the-shelf BERT) (left) and post-MTL (the MTL-tuned BERT) (right). The embedding is
projected to two dimensions using UMAP for visualization. A darker color indicates a higher probability for instances of a particular label to be present at that
point in the space.
are the most common targets of abuse, mainly from right-leaning figures who generate most of the abusive tweets. As PubFigs covers
both the 2016 and 2020 elections, it follows that the Democratic nominees are the most visible.
What Makes A Tweet Hateful? We apply SHAP (Lundberg and Lee, 2017) analysis to uncover the terms the classifier deems
vital for determining whether a post is hateful. Fig. 9(b) shows the top 20 terms with the highest absolute SHAP values. None of
these top terms exhibit a negative SHAP value, indicating that no term strongly contributes to a post being labeled as ‘‘not hateful’’.
Furthermore, the top keywords with the highest absolute SHAP values differ from the most common terms, with indicator terms such
as ‘‘beasts’’, ‘‘parasites’’ and ‘‘reds’’ emerging as highly important in determining whether a post is hateful. Terms related to terrorism
– such as ‘‘isil’’ and ‘‘taliban’’ – also had high SHAP values; a qualitative inspection reveals that they relate mostly to immigrants and
refugees, Islam, race and ethnicity. Our findings suggest that the most vulnerable groups to hate speech are Muslims, immigrants
and refugees, people of color, and women. At the same time, American politicians are the primary targets of abusive speech. Such
findings highlight the need to better distinguish between Islam-related and Islamophobic content.
MTL Training Increases Distinctness of Hate Spech Embedding. We investigate the impact of the MTL training on the
generated embedding space. The embeddings are large vectorial representations generated by the BERT language model for each
17
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
tweet, on which downstream tasks (such as hate speech detection) can be performed. We compare the embedding representations
constructed by BERT before and after MTL tuning — dubbed pre-MTL and post-MTL, respectively. Post-MTL refers to the embeddings
generated by the BERT model after fine-tuning in the targeted training setup described in the Setup of Section 6.3. We aim to
determine whether MTL improves embedding utility by examining how distinct hateful and abusive posts are in each embedding
representation. In post-MTL, the distinction between hateful, abusive and harmless posts should increase, improving hate speech
detection performance.
We evaluate distinctness using the Hopkins Statistic — a measure for clustering tendency, i.e., how likely are instances with
similar properties to be located close together in the embedding space, forming clusters. The Hopkins Statistic compares, for each
point, the nearest neighbors between the original dataset and a random dataset generated with the same size and dimension as the
original. A value closer to 1 indicates a strong clustering tendency, while a value closer to 0.5 indicates a random distribution with no
clustering tendency. Fig. 10 plots the Hopkins Statistic for abusive and hateful tweets (left) and for each of the identified six topics of
problematic speech (right) (see Topics and Targets of Hate Speech) using the pre- and post-MTL embeddings. Post-MTL increases
the clustering tendency for hateful tweets and slightly decreases it for abusive tweets. We posit this to be because all datasets used
in training relate to hate speech, while only three (out of nine) datasets explicitly differentiate abusive (but not hateful) speech. This
causes abusive instances to become less distinct. Zooming in on the topics of hate speech, we find that five (of the six) topics display
an increased clustering tendency. The clustering tendency increases substantially for two topics: Islam and Women. Islamophobia
and misogyny are prominent facets of hate speech, especially among the far-right (Sian, 2018). The increased clustering tendency
shows that the improvement in hate speech detection may be due to BERT better distinguishing Islamophobic and misogynistic
posts in its post-MTL embedding space.
We also visually inspect the distinctness of hateful and abusive posts. We use UMAP (McInnes et al., 2018) – a widely used
dimensionality reduction technique – to project the embedding space into two dimensions. Fig. 11 shows the two-dimensional
density plots of the three populations of posts in PubFigs: neutral (green), abusive (blue) and hateful (red). The left panel shows
the pre-MTL embedding, and the right panel the post-MTL. Pre-MTL achieves minimal separation between the two problematic
classes, with the two distributions overlapping each other and the neutral class. These are signs of low distinctiveness. Post-MTL
significantly increases the separation between hateful and abusive instances, now clustered into several dense areas. This may be
due to specific patterns within the textual content, such as common terms or similar topics. Interestingly, the neutral distribution
remains unchanged and maintains the same shape as pre-MTL.
This work addresses the problem of poor generalizability in hate speech detection models when evaluated on new datasets. We
propose a Multi-task Learning (MTL) pipeline that leverages multiple datasets to construct a more encompassing representation of
hate to improve generalization. Our results show that our method outperforms existing works when evaluating new unseen datasets
and is comparable with state-of-the-art ones for improving performance on a known dataset.
Furthermore, we contribute a machine-labeled dataset of the online Twitter postings of American Public Political figures and
a human-labeled subset. We apply the MTL classifier to machine-label the more than 300 thousand tweets in our dataset. We
investigate the patterns of usage of abusive and hateful speech by public political figures and the effects of MTL training on the shared
embedding space representation. We find that right-leaning figures produce significantly more abusive and hateful content than left-
leaning figures, with the majority of problematic content centered around six topics. We also find that MTL training increases the
distinctiveness of hateful and abusive speech from neutral speech.
Limitations and Future work Several directions for future work can be explored based on the findings of this study. Firstly,
current classification on new domains involves binarizing dataset class labels for unseen classification using label mapping, which
can limit the specificity of results. A direction for further extension is increasing the specificity beyond the ‘‘problematic vs. harmless’’
binary problem. As our results show that MTL training increases the distinctiveness of hate in the shared embedding space, improving
the specificity regarding the targets of hate speech could be a possible avenue for future work.
Secondly, future work can focus on incorporating additional context into the MTL framework. In this work, we process only
textual data from a single posting and discard the accompanying context, such as the conversation chain and attached media (images
or videos). Incorporating this additional context into the learning framework could improve the model’s predictive power.
Thirdly, pre-trained transformer models have seen significant advancement in recent years. This work uses a shared BERT
model as its pooling point. Future work can explore using different and more recent models such as DeBERTa, XLM-RoBERTa,
and GPT-based models over BERT.
Fourthly, this work identified several topics and targets of hate in the online discourse of public figures. In particular, we
identified Muslims, women, and immigrants and refugees as targets of hate. Future research aims to analyze the impact of the
detected hateful tweets on public discourse and how figures interact with such hateful speech, whether by following existing
discourse or starting new discussions.
Finally, while the unseen classification results outperform the baselines and existing works, the raw F1 value remains relatively
low. The MTL pipeline trains on several datasets simultaneously, forcing the shared model to learn representations useful to classify
all datasets used, reducing the bias introduced from each dataset. This operates under the assumption that there is no significant
overlap in the biases of each dataset, which may or may not be true in practice. Multiple datasets with shared biases may, in fact,
further exacerbate the problem. The exact definition of hate learned by the MTL and single dataset baselines is unknown due to
the black-box nature of the model. Therefore, examining dataset characteristics and exploring explainability methods could provide
insight into the learned definitions and the characteristics that make text hateful. Explicitly studying the types of biases and their
overlap between datasets is a direction for future work.
18
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Before performing any human labeling, we underwent an ethics review process by our university’s Institutional Review Board
(IRB) Human Ethics Committee. The committee assessed ethical standards, guidelines, and relevant experts regarding potential
ethical concerns and risks. We subsequently conducted a comprehensive review of our project and its procedures, considering worker
consent, privacy, and data security issues. Based on the IRB review, our project complies with ethical research standards (approval
number: ETH22-7031).
Lanqin Yuan: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software,
Validation, Visualization, Writing – original draft, Writing – review & editing, Resources. Marian-Andrei Rizoiu: Conceptualization,
Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation,
Writing – review & editing.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.
Data availability
Acknowledgments
This research was supported by an Australian Government Research Training Program (RTP) Scholarship, by the Commonwealth
of Australia (represented by the Defence Science and Technology Group) through a Defence Science Partnerships Agreement, and
by the Australian Department of Home Affairs. This work has been partially supported by the National Science Centre, Poland
(Project No. 2021/41/B/HS6/02798). This research was undertaken with the assistance of resources and services from the National
Computational Infrastructure (NCI), which is supported by the Australian Government.
References
Agrawal, Sweta, Awekar, Amit, 2018. Deep Learning for Detecting Cyberbullying Across Multiple Social Media Platforms. ISBN: 978-3-319-76940-0, pp. 141–153.
http://dx.doi.org/10.1007/978-3-319-76941-7_11.
Arango, Aymé, Pérez, Jorge, Poblete, Barbara, 2019. Hate speech detection is not as easy as you may think: A closer look at model validation. In: Proceedings of the
42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. (Paris, France) New York, NY, USA, ISBN: 9781450361729,
pp. 45–54. http://dx.doi.org/10.1145/3331184.3331262.
Badjatiya, Pinkesh, Gupta, Shashank, Gupta, Manish, Varma, Vasudeva, 2017. Deep learning for hate speech detection in tweets. http://dx.doi.org/10.1145/
3041021.3054223.
Basile, Valerio, Bosco, Cristina, Fersini, Elisabetta, Nozza, Debora, Patti, Viviana, Pardo, Francisco Manuel Rangel, Rosso, Paolo, Sanguinetti, Manuela, 2019.
SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In: Proceedings of the 13th International Workshop
on Semantic Evaluation. Association for Computational Linguistics, Minneapolis, Minnesota, USA, pp. 54–63. http://dx.doi.org/10.18653/v1/S19-2007.
Baxter, Jonathan, 1997. A Bayesian/information theoretic model of learning to learn via multiple task sampling, vol. 28. (ISSN: 1573-0565) pp. 7–39.
http://dx.doi.org/10.1023/A:1007327622663.
Chiril, Patricia, Pamungkas, Endang Wahyu, Benamara, Farah, Moriceau, Véronique, Patti, Viviana, 2021. Emotionally informed hate speech detection: A
multi-target perspective. Cogn. Comput. 14, 322–352, https://api.semanticscholar.org/CorpusID:235671981.
Davidson, Thomas, Warmsley, Dana, Macy, Michael W., Weber, Ingmar, 2017. Automated hate speech detection and the problem of offensive language, CoRR
abs/1703.04009. http://arxiv.org/abs/1703.04009.
de Gibert, Ona, Perez, Naiara, García-Pablos, Aitor, Cuadros, Montse, 2018. Hate speech dataset from a white supremacy forum. In: Proceedings of the 2nd
Workshop on Abusive Language Online (ALW2). Association for Computational Linguistics, Brussels, Belgium, pp. 11–20. http://dx.doi.org/10.18653/v1/W18-
5102.
Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina, 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
CoRR abs/1810.04805. arxiv:1810.04805.
ElSherief, Mai, Nilizadeh, Shirin, Nguyen, Dana, Vigna, Giovanni, Belding, Elizabeth M., 2018. Peer to peer hate: Hate speech instigators and their targets. CoRR
abs/1804.04649 2018. arxiv:1804.04649.
Fortuna, Paula, Nunes, Sérgio, 2018. A survey on automatic detection of hate speech in text. ACM Comput. Surv. (ISSN: 0360-0300) 51 (4), 30. http:
//dx.doi.org/10.1145/3232676, Article 85.
Fortuna, Paula, Soler-Company, Juan, Wanner, Leo, 2021. How well do hate speech, toxicity, abusive and offensive language classification models generalize
across datasets? Inf. Process. Manage. (ISSN: 0306-4573) 58 (3), 102524. http://dx.doi.org/10.1016/j.ipm.2021.102524.
19
L. Yuan and M.-A. Rizoiu Computer Speech & Language 89 (2025) 101690
Gao, Lei, Huang, Ruihong, 2017. Detecting online hate speech using context aware models. In: Proceedings of the International Conference Recent Advances in
Natural Language Processing. RANLP 2017, INCOMA Ltd. Varna, Bulgaria, pp. 260–266. http://dx.doi.org/10.26615/978-954-452-049-6_036.
Ghosh, Soumitra, Priyankar, Amit, Ekbal, Asif, Bhattacharyya, Pushpak, 2023. A transformer-based multi-task framework for joint detection of aggression and
hate on social media data. Nat. Lang. Eng. 29 (6), 1495–1515. http://dx.doi.org/10.1017/S1351324923000104.
Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, 2016. Deep Learning. The MIT Press, ISBN: 0262035618.
Guimarã, Samuel, Kakizaki, Gabriel, Melo, Philipe, Silva, Márcio, Murai, Fabricio, Reis, Julio C.S., Benevenuto, Fabrício, 2023. Anatomy of hate speech datasets:
Composition analysis and cross-dataset classification. In: Proceedings of the 34th ACM Conference on Hypertext and Social Media. Rome, Italy, New York,
NY, USA, ISBN: 9798400702327, p. 11. http://dx.doi.org/10.1145/3603163.3609158, Article 33.
Jasser, Greta, McSwiney, Jordan, Pertwee, Ed, Zannettou, Savvas, 2023. Welcome to #GabFam’: Far-right virtual community on gab. New Media Soc. (ISSN:
1461-4448) 25 (7), 1728–1745. http://dx.doi.org/10.1177/14614448211024546.
Kapil, Prashant, Ekbal, Asif, 2020. A deep neural network based multi-task learning approach to hate speech detection. Knowl.-Based Syst. (ISSN: 0950-7051)
210, 106458. http://dx.doi.org/10.1016/j.knosys.2020.106458.
Kingma, Diederik P., Ba, Jimmy, 2017. Adam: A method for stochastic optimization. arxiv:1412.6980[cs.LG].
Kong, Quyu, Booth, Emily, Bailo, Francesco, Johns, Amelia, Rizoiu, Marian-Andrei, 2022. Slipping to the extreme: A mixed method to explain how extreme
opinions infiltrate online discussions. In: Proceedings of the International AAAI Conference on Web and Social Media, Vol. 16, No. 1. (ISSN: 2334-0770) pp.
524–535. http://dx.doi.org/10.1609/icwsm.v16i1.19312.
Kong, Quyu, Ram, Rohit, Rizoiu, Marian-Andrei, 2021. Evently: Modeling and Analyzing Reshare Cascades with Hawkes Processes. In: Proceedings of the 14th
ACM International Conference on Web Search and Data Mining. ISBN: 9781450382977, pp. 1097–1100. http://dx.doi.org/10.1145/3437963.3441708.
Kong, Quyu, Rizoiu, Marian-Andrei, Xie, Lexing, 2020. Describing and Predicting Online Items with Reshare Cascades via Dual Mixture Self-exciting
Processes. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management. ISBN: 9781450368599, pp. 645–654.
http://dx.doi.org/10.1145/3340531.3411861.
Krippendorff, Klaus, 2011. Computing krippendorff’s alpha-reliability.
Lundberg, Scott, Lee, Su-In, 2017. A unified approach to interpreting model predictions. http://dx.doi.org/10.48550/ARXIV.1705.07874.
MacAvaney, Sean, Yao, Hao-Ren, Yang, Eugene, Russell, Katina, Goharian, Nazli, Frieder, Ophir, 2019. Hate speech detection: Challenges and solutions. PLOS
ONE 14 (8), 1–16. http://dx.doi.org/10.1371/journal.pone.0221152.
Madukwe, Kosisochukwu, Gao, Xiaoying, Xue, Bing, 2020a. In data we trust: A critical analysis of hate speech detection datasets. In: Proceedings of the Fourth
Workshop on Online Abuse and Harms. Association for Computational Linguistics, pp. 150–161. http://dx.doi.org/10.18653/v1/2020.alw-1.18, Online.
Madukwe, Kosisochukwu Judith, Gao, Xiaoying, Xue, Bing, 2020b. A GA-based approach to fine-tuning BERT for hate speech detection. In: 2020 IEEE Symposium
Series on Computational Intelligence. SSCI, pp. 2821–2828. http://dx.doi.org/10.1109/SSCI47803.2020.9308419.
Mandl, Thomas, Modha, Sandip, Majumder, Prasenjit, Patel, Daksh, Dave, Mohana, Mandlia, Chintak, Patel, Aditya, 2019. Overview of the HASOC track at FIRE
2019: Hate speech and offensive content identification in Indo-European languages. In: Proceedings of the 11th Forum for Information Retrieval Evaluation
(Kolkata, India) (New York, NY, USA. ISBN: 9781450377508, pp. 14–17. http://dx.doi.org/10.1145/3368567.3368584.
McInnes, Lel, Healy, John, Melville, James, 2018. UMAP: Uniform manifold approximation and projection for dimension reduction. http://dx.doi.org/10.48550/
ARXIV.1802.03426.
Mozafari, Marzieh, Farahbakhsh, Reza, Crespi, Noel, 2019. A BERT-based transfer learning approach for hate speech detection in online social media.
arxiv:1910.12574.
Pan, Sinno Jialin, Yang, Qiang, 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22 (10), 1345–1359. http://dx.doi.org/10.1109/TKDE.2009.191.
Plaza-Del-Arco, Flor Miriam, Dolores Molina-González, M., Alfonso Ureña-López, L., Martín-Valdivia, María Teresa, 2021. A multi-task learning approach to hate
speech detection leveraging sentiment analysis. IEEE Access 9 (2021), 112478–112489. http://dx.doi.org/10.1109/ACCESS.2021.3103697.
Qian, Jing, Bethke, Anna, Liu, Yinyin, Belding, Elizabeth M., Wang, William Yang, 2019. A benchmark dataset for learning to intervene in online hate speech.
CoRR abs/1909.04251 2019. arxiv:1909.04251.
Rizoiu, Marian-Andrei, Xie, Lexing, Caetano, Tiberio, Cebrian, Manuel, 2016. Evolution of privacy loss in wikipedia. In: International Conference on Web Search
and Data Mining. WSDM’16, ACM, ACM Press, New York, New York, USA, ISBN: 9781450337168, pp. 215–224. http://dx.doi.org/10.1145/2835776.2835798,
arxiv:1512.03523.
Roy, Pradeep Kumar, Bhawal, Snehaan, Subalalitha, Chinnaudayar Navaneethakrishnan, 2022. Hate speech and offensive language detection in dravidian languages
using deep ensemble framework. Comput. Speech Lang. 75 (2022), 101386.
Ruzicka, M., 1958. Anwendung mathematisch-statisticher methoden in der geobotanik (synthetische bearbeitung von aufnahmen). Biol. Bratisl 13 (1958), 647–661.
Schmidt, Anna, Wiegand, Michael, 2017. A survey on hate speech detection using natural language processing. In: Proceedings of the Fifth International Workshop
on Natural Language Processing for Social Media. Association for Computational Linguistics, Valencia, Spain, pp. 1–10. http://dx.doi.org/10.18653/v1/W17-
1101.
Schneider, Philipp J., Rizoiu, Marian-Andrei, 2023. The effectiveness of moderating harmful online content. Proc. Natl. Acad. Sci. (ISSN: 0027-8424) 120 (34),
1–3. http://dx.doi.org/10.1073/pnas.2307360120.
Sian, Katy, 2018. Stupid Paki Loving Bitch: The Politics of Online Islamophobia and Misogyny. Springer International Publishing, Cham, ISBN: 978-3-319-71776-0,
pp. 117–138. http://dx.doi.org/10.1007/978-3-319-71776-0_7.
Swamy, Steve Durairaj, Jamatia, Anupam, Gambäck, Björn, 2019. Studying generalisability across abusive language detection datasets. In: Proceedings of
the 23rd Conference on Computational Natural Language Learning. CoNLL, Association for Computational Linguistics, Hong Kong, China, pp. 940–950.
http://dx.doi.org/10.18653/v1/K19-1088.
Twitter Inc., 2023. Twitter’s policy on hateful conduct | twitter help. https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy.
United Nations, 2019. United nations strategy and plan of action on hate speech SYNOPSIS. https://www.un.org/en/genocideprevention/hate-speech-strategy.
shtml.
Waseem, Zeerak, 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter. In: Proceedings of the First Workshop
on NLP and Computational Social Science. Association for Computational Linguistics, Austin, Texas, pp. 138–142. http://dx.doi.org/10.18653/v1/W16-5618.
Waseem, Zeerak, Hovy, Dirk, 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In: Proceedings of the NAACL
Student Research Workshop. Association for Computational Linguistics, San Diego, California, pp. 88–93. http://dx.doi.org/10.18653/v1/N16-2013.
Waseem, Zeerak, Thorne, James, Bingel, Joachim, 2018. Bridging the gaps: Multi task learning for domain transfer of hate speech detection. Online Harassment
2018, 29–55. http://dx.doi.org/10.1007/978-3-319-78583-7_3.
Welch, B.L., 1947. The generalization of student’s problem when several different population varlances are involved. Biometrika (ISSN: 0006-3444) 34 (1–2),
http://dx.doi.org/10.1093/biomet/34.1-2.28, https://academic.oup.com/biomet/article-pdf/34/1-2/28/553093/34-1-2-28.pdf.
Wu, Siqi, Rizoiu, Marian-Andrei, Xie, Lexing, 2020. Variation across scales: Measurement fidelity under Twitter data sampling. In: International AAAI Conference
on Web and Social Media. ICWSM’20, pp. 1–10, arxiv:2003.09557.
Yin, Wenjie, Zubiaga, Arkaitz, 2021. Towards generalisable hate speech detection: a review on obstacles and solutions. arxiv:2102.08886[cs.CL].
Yuan, Lanqin, Wang, Tianyu, Ferraro, Gabriela, Suominen, Hanna, Rizoiu, Marian-Andrei, 2023. Transfer learning for hate speech detection in social media. J.
Comput. Soc. Sci. (ISSN: 2432-2717) 6 (2), 1081–1101. http://dx.doi.org/10.1007/s42001-023-00224-9.
Zhang, Ziqi, Luo, Lei, 2018. Hate speech detection: A solved problem? The challenging case of long tail on Twitter. http://dx.doi.org/10.48550/ARXIV.1803.03662.
20