Hate Speech Chapter Final Preprint
Hate Speech Chapter Final Preprint
Thomas R. Davidson
Department of Sociology
Rutgers University–New Brunswick
[email protected]
This is a draft of a chapter that has been accepted by Oxford University Press in the forthcoming
book The Oxford Handbook of the Sociology of Machine Learning edited by Juan Pablo Pardo-
Guerra and Christian Borch due for publication in 2023. This reference will be updated upon
1
Online hate speech has recently drawn widespread attention from journalists, policymakers, and
academics. While it only constitutes a fraction of a percent of the content shared online—a
report by the Alan Turing Institute estimated that 0.001% of the content shared on major
platforms is abusive or hateful (Vidgen, Margetts, and Harris 2019)—this still equates to an
enormous amount of hateful material given the scale of online information flows. Facebook
reports detecting tens of millions of violations of its hate speech policies every quarter. 1 In 2021,
41% of Americans reported experiences of online harassment due to their gender, race and
ethnicity, religion, sexual orientation, or political views (Pew Research Center 2021). Women,
racial minorities, and LGBTQ+ individuals are particularly likely to be victims of hate speech
and other types of online harassment. The Anti-Defamation League (2021) has reported
increased online harassment against LGBTQ+, Jewish, and Asian-Americans. With millions of
people exposed, directly or indirectly, on a regular basis, online hate speech fosters a hostile
online environment for marginalized groups and can result in significant offline harms. While
these downstream effects are difficult to measure, there is evidence that online hate speech is
associated with increases in hate crimes (Müller and Schwarz 2021) and can elicit a violent
backlash from members of targeted groups (Mitts 2019). The case of Myanmar paints a stark
picture of what can happen when hate speech goes unchecked, as ultra-nationalists and members
of the military used Facebook to attack the Rohingya Muslim minority, contributing to genocidal
violence. At the time, Facebook was unprepared to address the problem because the company
only had two staff who spoke Burmese to review posts, even though one-third of Myanmar’s
To help tackle online hate speech, researchers and practitioners have turned to automated
methods to detect and measure it. There are several reasons why automation is desirable.
1
https://transparency.fb.com/policies/community-standards/hate-speech/#data (Accessed 10/7/2022).
2
Automated approaches enable researchers to perform tasks like measuring the prevalence of hate
speech over time or across different platforms, which would be unfeasible using qualitative
methods. Social media platforms employ tens of thousands of people to moderate hate speech
and other kinds of content, but even with such large workforces, automation is required to handle
the massive volume of content (Gillespie 2018). Speed has also become critical to these
endeavors, in part, due to legal obligations imposed on platforms (Kaye 2019). For example,
Germany’s Network Enforcement Act (NetzDG for short), which took effect in 2018, requires
that platforms remove hate speech within twenty-four hours or face substantial fines. In addition
to scalability and speed, automation can alleviate the burden placed on content moderators, who
can experience post-traumatic stress disorder and other negative psychological consequences
from continual exposure to content such as animal abuse, child pornography, and hate speech
(Roberts 2019).
Methods developed to detect online hate speech at scale have proven valuable tools in
several areas of social scientific research. Some studies have addressed the impact of online hate
speech. Tamar Mitts (2019) showed how localities in several European countries with high anti-
Muslim hate speech on Twitter experienced increased support for the terrorist group Islamic
State. Others examine hate speech as an outcome, such as a study showing how Egyptian player
Mohamed Salah’s successful tenure at Liverpool Football Club reduced fans’ use of anti-Muslim
language online (Alrababa’h et al. 2021). In another study analyzing 750 million tweets written
during the 2016 US election campaign, researchers found no evidence that Trump’s divisive
rhetoric increased online hate speech (Siegel et al. 2021). Terrorist attacks have been shown to
lead to spikes in the use of hate speech on platforms, including Twitter and Reddit (Olteanu et al.
2018). Changes to social media platform policies have been used as natural experiments to assess
3
their impact on online hate speech (Chandrasekharan et al. 2017a). Researchers have also
considered approaches to reduce online hate speech, including experimental work targeting hate
Common sense understandings of hate speech might imply that detecting it automatically
is relatively straightforward. Like Supreme Court Justice Potter Stewart’s famous remark on
obscenity, you know it when you see it.3 The basic idea is that once some examples of hate
speech have been identified, machine learning can augment human judgments by learning a set
of rules to detect hate speech in new data. However, hate speech is often highly subjective,
ambiguous, and context-dependent, making it difficult for both humans and computers to detect.
Mark Zuckerberg has remarked that it is “much easier to build an AI system to detect a nipple
consider some differences between nudity and hate speech detection. Images and video data
relatively high accuracy. While there is still debate regarding whose body parts are acceptable in
different contexts,5 if computers are shown enough images of naked bodies, features of these
images, like nipples or bare skin, make it relatively straightforward to predict whether an image
contains nudity. Hate speech, on the other hand, is not so easy for computers to identify. Some
features, such as slurs, can function equivalently to nipples, allowing us to identify some
instances of hate speech easily, but many cases require more nuanced contextual understanding
and interpretation. For example, slurs can be used by people repeating or contesting hate speech
2
See Siegel (2020) for a more thorough review of the social scientific literature on online hate speech.
3
Jacobellis v. Ohio, 1964.
4
https://www.engadget.com/2018-04-25-zuckerberg-easier-for-ai-detect-nipples-than-hate-speech.html (retrieved
10/6/222).
5
See Gillespie 2018, Chapters 1 and 6.
4
of which they have been the target or in alternative contexts such as reclaimed slurs used by
Due to these challenges, the case of hate speech detection provides valuable insight into
the promises and pitfalls of supervised text classification for sociological analyses. This chapter
describes the process of developing annotated datasets and training machine learning models to
detect hate speech, including some of the cutting-edge techniques used to improve these
methods. A key theme running through the chapter is how ostensibly minor methodological
downstream societal impacts. In particular, I discuss racial bias in hate speech detection systems,
examining why models can discriminate against the same groups they are designed to protect and
the approaches that have been proposed to identify and mitigate such biases. Hate speech
detection is not only an instructive case for sociologists interested in using supervised text
classification, but the use of these approaches and related content moderation tools to govern
online speech at a global scale makes it a particularly urgent topic of sociological inquiry.
Before going any further, it is important to define hate speech. As alluded to in the previous
section, most readers will have some common sense understanding of hate speech and will
recognize certain speech acts as hateful. There are some examples, such as the overt use of
antisemitic, racist, or homophobic language to attack others, that most reasonable observers can
agree upon without requiring any formal definition. Legal definitions range from strict,
enforceable prohibitions on certain speech in countries like Germany to wide-ranging free speech
protections like those articulated in the First Amendment. In the United States, the issue has been
5
a source of tremendous controversy and debate since the 1980s, particularly as institutions like
universities have developed and attempted to enforce their own regulations (Wilson and Land
2021). Most recently, social media platforms have developed codes of conduct and policies that
delimit hateful speech and content. For example, Twitter’s “Hateful conduct policy” prohibits the
promotion of violence, attacks, and threats based on race, gender, sexual orientation, and a range
of other categories. They provide a rationale for the policy, emphasizing how they are
that seeks to silence the voices of those who have been historically marginalized.”6 These
policies have evolved as platforms have responded to emerging types of hateful conduct, shifting
for general guidance to specific proscriptions related to the type of attack, the target, and so on
(Kaye 2019),7 and transitioning from more permissive rules, inspired by the First Amendment,
towards the European regulatory model (Wilson and Land 2021). While I do not provide a
singular definition of hate speech here, I outline four core dimensions that make it difficult to
The first major challenge is that hate speech can be subjective. Different people have
different understandings of hate speech. Even if we agree upon a specific definition, there is still
room for differences in interpretation. This subjectivity can enter into machine learning systems,
such that their underlying models of hate speech may reflect certain perspectives at the expense
of others. Moreover, hate speech can also be ambiguous. The complexity of human language and
interaction makes it difficult to agree upon whether a particular statement meets a given
definition. Ambiguity is one of the reasons why commercial content moderation policies have
ballooned in size as companies have increasingly added layers of specificity to their protocols
6
https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy (Accessed 10/6/2022).
7
Facebook’s Hate Speech policy has changed 23 times since May 25, 2018, according to its new transparency page:
https://transparency.fb.com/policies/community-standards/hate-speech/ (Accessed 10/6/2022).
6
(Kaye 2019). Ceteris paribus, more ambiguous cases will be more difficult for humans to agree
upon and thus more difficult for computers to detect. A third challenge is presented by the fact
that hate speech is also context-dependent. What is considered hateful in one context may be
innocuous in another. Specific cultural knowledge is often required to reason about the
implications of a statement, the intent of the speaker, and other salient factors. From a legal
perspective, context is critical as it allows us to better understand intent and potential harms
(Wilson and Land 2021). Yet, human content moderators and machine learning systems often
evaluate texts abstracted from their context. Finally, hate speech is politicized. In the U.S., there
has been a backlash against efforts to moderate online hate speech, with accusations that content
moderation efforts are a symptom of political correctness and bias against conservatives.
Opinion polls find that conservatives think online hate speech is taken too seriously, while
liberals believe it is not taken seriously enough (Pew Research Center 2020). In what follows, it
is essential to consider these issues—that hate speech can be subjective, ambiguous, context-
algorithms.
Hate speech detection is a type of supervised text classification task. The goal is to take a text as
an input and assign it to a discrete class, i.e., whether or not it should be considered hate speech.
This section begins by considering simple rule-based methods using lexicons and highlighting
their limitations. I then walk through the process of training machine learning classifiers to detect
hate speech, moving from sampling and annotation to feature generation and algorithm selection.
Throughout, I emphasize key methodological decisions and consider how these choices can have
7
significant consequences for the way machine learning algorithms behave. The goal is to explore
the main contours of the topic rather than systematically review this vast and growing literature
(see Fortuna and Nunes (2018) and Vidgen and Derczynski (2020)).
At first glance, racist, sexist, homophobic, and other hateful language is often given away by
slurs and related words or phrases. If we assume hate speech is indicated by certain words, we
can detect it by enumerating all of these hateful words in a lexicon and checking whether a text
uses any of them. Resources like the crowd-sourced Hatebase lexicon, which features 3,894 such
terms across 98 different languages, promise such a solution.8 A simple decision rule can be used
to classify documents: a statement is considered hate speech if it contains one or more words
from the lexicon. Lexicon-based methods have been used to detect hate speech in online fora
with some success (Wiegand et al. 2018). While these approaches can indeed locate many
instances of hate speech, they will generate excessive false positives if words are used in
To better understand the limitations of lexicons, it’s helpful to consider two metrics
commonly used in machine learning: precision and recall. Precision quantifies the accuracy of
the hate speech predictions: a low precision score indicates that many statements are mistakenly
flagged as hateful. Recall is defined as the proportion of hate speech correctly flagged as hate
speech, where a low recall score would imply that many hateful statements were missed. 9
Lexicons tend to have high recall but low precision, particularly when many words or phrases are
8
https://hatebase.org/ (Accessed 10/7/222).
9
¿ true positives
Formally, precision is defined as and recall as
¿ true positives+ ¿ false positives
¿ true positives
.
¿ true positives+ ¿ false negatives
8
included. A fishing metaphor is useful here: a lexicon is like a trawler dragging a large net across
the seabed. The aim might be to catch tuna, but the trawler dredges up everything else in its path,
including porpoises, sea turtles, and sharks. This by-catch makes trawling extremely inefficient
and damaging to entire ecosystems. Lexicons can catch a lot of hate speech but are also prone to
sweeping up many other documents. For example, only around 5% of the statements in a sample
of almost thirty-thousand tweets containing words from the Hatebase lexicon were considered
hate speech by human raters (Davidson et al. 2017). This low precision occurs because many
words that indicate hate speech are also used in other contexts. The most obvious examples are
curse words, including slurs, which are frequently used online in a variety of contexts (Kwok and
Wang 2013). At the same time, other hate speech can slip through the nets if relevant words have
not been included in the lexicon. Language is constantly evolving, and internet users often find
For example, users of the forum 4Chan responded to Google’s efforts to reduce online abuse by
using the company’s name and those of other major companies like Yahoo and Skype as
euphemisms for hateful slurs (Magu, Joshi, and Luo 2017). Given these limitations, lexicons
should only be used if the goal is to precisely detect particular types of hate speech where there is
high confidence that all relevant words have been included and that each term in the lexicon does
Machine learning can help to address the shortcomings of lexicons. Rather than starting with
a set of keywords, these methods work by taking a set of texts annotated for hate speech and
training a computer to distinguish between hate speech and other content. The goal is to replicate
the human annotations as accurately as possible by using features of the text to identify hate
10
Small keyword lexicons are ideal since large lexicons can suffer serious problems when used for document
selection, as experimental research demonstrates (King, Lam, and Roberts 2017).
9
speech. This approach tends to work better than lexicon-based classification because human
raters can make holistic judgments based on their interpretations of the texts. Machine learning
models can automatically select from a large array of features to best replicate the human
judgments. Rather than acting like trawlers, these approaches should be more like the technique
of line-and-pole fishing, enabling us to catch the tuna but leave the other creatures in the sea.
Ideally, we want to catch as much hate speech as possible (maximizing recall) while minimizing
the amount of by-catch (maximizing precision). These methods provide a powerful, flexible
approach for detecting hate speech and can easily be generalized to other kinds of content.
Nonetheless, machine learning methods are not immune to the issues faced by lexicons and can
To begin developing a hate speech detection model, one must identify a set of documents to use
as training data. Much of the current work uses data from Twitter due to the prominence of the
platform and the relative ease of data collection using its application programming interface
(API), and the majority of studies also focus on English content, although there is a growing
interest in other languages (Siegel 2020; Vidgen and Derczynski 2020). Since hate speech is
relatively rare, randomly sampling posts from social media would require vast samples to
identify a sufficient number of examples, so keyword lexicons have been widely used to sample
documents more likely to be hateful. Studies have used existing resources like the Hatebase
lexicon (Davidson et al. 2017) or developed custom sets of keywords (Golbeck et al. 2017).
Unfortunately, this means that documents without keywords are missed, and those containing
commonly used keywords appear with high frequency. To address these issues, scholars have
10
developed hybrid approaches, augmenting random samples with keyword-based samples to more
accurately reflect the distribution of hate speech online (Founta et al. 2018). A promising avenue
of research is the use of synthetic training texts, including both human (Vidgen, Thrush, et al.
2021) and machine-generated texts as training data (Hartvigsen et al. 2022). Synthetic data
provide greater control over content, enabling researchers to improve predictive performance by
Once data have been collected, they need to be annotated. Early methods used binary
Hirschberg 2012) or racist/not racist (Kwok and Wang 2013). These binary schemas have been
extended to multiple types of hateful speech, like racist/sexist/not racist or sexist (Waseem and
Hovy 2016). This leaves little room for nuance, so offensive language, such as cursing and
reclaimed slurs, are often misclassified as hate speech. Models trained on these data can function
like lexicons, labeling anything containing these terms as hate speech. To address this issue,
Davidson and colleagues (2017) developed a ternary schema, distinguishing between hate
speech, offensive language, and other content. Annotators were instructed not to make decisions
based on the presence of certain words but to holistically evaluate the entire statement.
Subsequent work has also accounted for other content like spam (Founta et al. 2018) and used
more detailed hierarchical coding schemes, asking annotators whether specific individuals or
groups are targeted (Zampieri et al. 2019) or whether there is a clear intent to offend (Sap et al.
2020).
Once a schema has been developed, annotators must read each example and determine the
sufficient quantity of examples to train a model that generalizes well to new data. Researchers
11
often employ crowdworkers from Mechanical Turk, CrowdFlower, and similar platforms due to
the scale of the task. Crowdworkers are provided with instructions and paid a small fee for each
annotation, with checks in place to monitor performance.11 Most studies use multiple decisions
for each example to improve reliability, but recent work on other classification tasks suggests
that it is more efficient to produce a larger dataset with a single annotator per example (Barberá
et al. 2020). Advances in machine learning like active learning (Kirk, Vidgen, and Hale 2022)
and few-shot learning (Chiu, Collins, and Alexander 2022) will help to reduce the cost of
annotation, allowing models to achieve strong predictive performance with relatively small
amounts of training data. Others have side-stepped the need for annotations altogether. For
example, researchers trained a model to identify hateful content by distinguishing between posts
from Reddit communities known to be hateful and other communities on the site
(Chandrasekharan et al. 2017b). In many cases, those interested in applying these methods can
leverage existing resources rather than developing training data from scratch. The website
Once documents have been annotated, the texts must be converted into representations that can
be input into machine learning classifiers. These inputs, known as features, are used to predict
the class each document belongs to. The most common way to represent text for supervised
constructed where each unique word in the corpus is represented as a column and each document
11
For an overview and evaluation of crowd-sourced text analysis, see (Benoit et al. 2016). The topic of crowd-work
has also been subject to extensive discussion regarding ethics and potential exploitation (e.g. Fieseler, Bucher, and
Hoffmann 2019).
12
as a row. Each element of this matrix denotes how often a specific term is used in a given
document, often weighted to account for the frequency with which each term occurs in the
functions of the document vectors. For example, a word like “kill” might get a relatively high
weight, whereas unrelated words will have weights close to zero. These representations can be
considered an extension of the lexicon approach, but rather than deciding whether certain words
indicate hate speech a priori, classifiers use patterns in the data to identify the words associated
There are typically thousands of unique words in a corpus of text—the set of unique words is
often referred to as a vocabulary—but each document only contains a minuscule fraction. Bag-
of-words representations are thus considered sparse representations since most elements of each
document vector are zero. Sparsity is particularly acute when we consider that most user-
generated texts on social media are relatively short (e.g., tweets are limited to 280 characters).
This sparsity adds computational complexity, as models have to estimate many parameters and
store large matrices. Moreover, much of the information in a sparse matrix is irrelevant to the
prediction task. Very frequent words like “and,” “of,” and “the” (often known as stopwords)
provide little information, and very infrequent words and hapax legomena (words that occur only
once) are too rare to provide any useful patterns. It is thus conventional to pre-process texts,
performing tasks such as setting words to lowercase and dropping punctuation, stopwords, and
rare words to remove redundant content and reduce the dimensionality of the data. These tasks
are often treated as routine, but such decisions must be performed with care because some of this
information can be relevant to classification tasks, making downstream results sensitive to these
13
Over the past decade, word embeddings have become a common type of feature
representation used in hate speech classification and other tasks. Word embeddings are dense
language models on large corpora of text (Mikolov et al. 2013). Each word is assigned a vector,
and its relative position in the vector space captures something about its meaning: words used in
similar contexts occupy similar positions, a property consistent with distributional semantics, a
longstanding concept in linguistic theory (Firth 1957). Models using embeddings as features
have achieved strong predictive performance at hate speech classification tasks (Djuric et al.
2015). Sociologists have demonstrated how embeddings encode rich information about
stereotypes and cultural assumptions (Kozlowski, Taddy, and Evans 2019). These
representations are particularly effective for hate speech detection because they can capture more
subtle, implicit types of hate speech (Waseem et al. 2017). For example, proximity to a hateful
slur in the embedding space signals that a word is often used in a hateful context, allowing
For example, the vector for the word “crane” is an average over the instances where the word is
hence the meaning of a word—to vary according to the context in which it is used (Smith 2020).
If the word “egg” occurs in the document, then the vector has a meaning closer to the bird,
whereas proximity to a word like “hardhat” or “girder” signal an alternative meaning. Such
representations are a promising avenue for detecting hate speech, enabling computers to better
14
distinguish between counter-speech, reclaimed slurs, and genuine hate speech by incorporating
Non-textual information and other metadata have been used as inputs for hate speech
detection classifiers. For example, emojis can be central to interpreting some hateful posts (Kirk,
Vidgen, Rottger, et al. 2022). Beyond the text, we can incorporate information about the speaker,
target, and context. We might expect a late-night reply to a person expressing an opinion on a
controversial political topic to have a higher likelihood of abusiveness than a lunchtime post
celebrating a birthday. Several studies have used social features to improve hate speech
classification models (Ribeiro et al. 2017; Vidgen, Nguyen, et al. 2021). This chapter focuses on
identifying hate speech in written text, but our online social lives are increasingly mediated
through image and video, making audiovisual and multimodal approaches particularly urgent
areas for future research (Kiela et al. 2020). Most of the principles and procedures outlined here
generally apply when extending these approaches to non-textual and multimodal approaches.
A wide array of algorithms has been used for hate speech detection, from logistic regression and
support vector machines to random forests and deep neural networks. While the estimation
procedures differ, these approaches all share a common objective: to learn a function y f ( X ),
where X is the matrix of features and y is a vector representing the hate speech annotation
associated with each document. The objective is to accurately identify hate speech in the training
data and generalize to new examples. In computer science research, it is conventional to test
several different algorithms and experiment with associated hyperparameters to identify the best-
performing model.
15
Models using transformer architectures such as BERT (Devlin et al. 2019) and GPT
(Brown et al. 2020) have recently achieved record-breaking performance on various natural
language processing tasks and have become the gold standard approach for hate speech detection
and other classification problems. These models are known as large language models (LLMs)
because they are trained on enormous corpora of text and can have billions of internal
parameters. Rather than starting from scratch by training a model on an annotated dataset, pre-
trained models are adapted to new tasks through a process known as fine-tuning, a technique
borrowed from image classification. Due to the pre-training on large amounts of text, LLMs can
perform reasonably well at tasks like hate speech detection with few training examples (Chiu et
al. 2022). These models can also generate text, opening up new possibilities for the creation of
synthetic training data (Hartvigsen et al. 2022), explanations for why particular cases have been
flagged as hateful (Sap et al. 2020), and analyses of how context shapes perceptions of
The performance of hate speech detection models should always be evaluated using out-
of-sample annotated data never used during training. A key challenge is to avoid overfitting,
where models learn to predict the training examples but fail to generalize to new data (see
Molina and Garip 2019). Predictive performance is measured by using the predictions for these
out-of-sample data to calculate precision, recall, and the F1-score, a weighted mean of the two.
Confusion matrices are often used to identify where a model makes incorrect predictions (e.g.,
Davidson et al. 2017). Qualitative inspection of predictions and associated documents can help to
reveal how the model is making decisions. Misclassified documents can be particularly fruitful,
allowing analysts to identify the patterns that cause problems for classifiers (Warner and
Hirschberg 2012). Evaluation templates can provide more nuanced data on model performance,
16
allowing models to be tested against various curated examples, including difficult-to-judge cases
such as counter speech, reclaimed slurs, and texts with intentional misspellings (Röttger et al.
2021).
Scholars have drawn attention to bias and discrimination in machine learning systems, ranging
from Google Search (Noble 2018) and face-recognition models (Buolamwini and Gebru 2018) to
loan risk (O’Neil 2016) and welfare allocation models (Eubanks 2018). Bias in hate speech
detection systems is particularly pernicious because victims of hate speech and members of
targeted groups can be mistakenly accused of being hate speakers. Hate speech detection
classifiers, including Google’s state-of-the-art Perspective API, have been shown to discriminate
English (AAE) as hateful and abusive (Davidson, Bhattacharya, and Weber 2019; Sap et al.
2019). Such biases extend to other groups, including sexual and religious minorities. For
example, the Perspective API tended to classify statements containing neutral identity terms like
“queer,” “lesbian,” and “muslim” as toxic (Dixon et al. 2018). Other commercial hate speech
detection systems are more difficult to audit, but reports from users and information uncovered
Facebook often mislabeled posts about racism as hate speech12 and missed violent threats against
Muslims.13 Most existing studies and reporting have focused on the U.S., and little is known
about the extent to which these systems discriminate against minority groups in other national,
12
https://www.washingtonpost.com/technology/2021/11/21/facebook-algorithm-biased-race/ (Accessed 10/6/22).
13
https://www.propublica.org/article/facebook-hate-speech-censorship-internal-documents-algorithms (Accessed
10/6/22).
17
I argue that two aspects of the training procedure are the root cause of these biases. The
first is annotation. Until recently, most researchers paid scant attention to who provided
annotations or whether they harbored attitudes that could influence their judgments. Social
psychologists have shown that perceptions of hate speech vary according to the race and gender
of the observer (Cowan and Hodge 1996), as well as their attitudes, with individuals scoring
higher on anti-Black racism scales less likely to consider violent acts against Black people as
hate crimes (Roussos and Dovidio 2018). Survey evidence shows that males and conservatives
find online hate speech “less disturbing” than others (Costello et al. 2019). These biases can
enter into datasets, as white, male, conservative annotators are less likely to consider texts
containing anti-Black racism to be hateful (Sap et al. 2022). Even well-intentioned researchers
and annotators can produce problematic data: Waseem and Hovy (2017) drew upon critical race
theory to develop an annotation scheme and employed a gender studies student and a “non-
activist feminist” to check annotations for bias, yet models trained on their dataset
disproportionately classify tweets written in AAE as racist and sexist (Davidson et al. 2019).
A second area that has received less attention is sampling. Lexicon-based sampling can
result in the oversampling of texts from targeted groups. For example, texts extracted from
Twitter using the keywords from the Hatebase lexicon included a disproportionate amount of
content produced by African-Americans, mainly due to the inclusion of the reclaimed slur
“n*gga,” and annotators almost universally tended to flag tweets using this term to be hateful or
offensive (Waseem, Thorne, and Bingel 2018). This over-representation led classifiers trained on
these data to associate AAE with hate speech (Davidson et al. 2019, Harris et al. 2022).
Sampling bias can also arise in other ways. The bias identified in the Perspective classifier
occurred because neutral identity terms were often used pejoratively in the training data (Dixon
18
et al. 2018). Such problems may become more ubiquitous as we increasingly rely on language
models that learn stereotypical associations present in the vast corpora of text scraped from the
What can be done to address bias in machine learning classifiers? Computer scientists
have attempted to develop technical solutions for post hoc bias mitigation with limited success,
often obscuring underlying problems (Gonen and Goldberg 2019) or generating other
problematic behavior (Xu et al. 2021). It is difficult to develop simple technical fixes for
ingrained issues that systematically skew model outputs—as the adage goes, “garbage in,
garbage out.” Regarding annotation, it is plausible that annotators more attuned to context, if not
members of the targeted groups, would help to mitigate biases, an argument Ruha Benjamin
(2019) makes in her analysis of race and technology. For example, formerly gang-involved youth
have been employed as domain experts to identify tweets related to offline violence in Chicago,
demonstrating the value of localized knowledge (Frey et al. 2020). Better training and guidance
are another promising avenue. Sap and colleagues (2019) found that instructing annotators about
racial bias and dialect variation resulted in fewer racialized misclassifications. Turning to
sampling, synthetic examples generated either by humans or large language models can yield
training data that is more representative and less susceptible to bias (Hartvigsen et al. 2022;
Vidgen, Thrush, et al. 2021). Methods have also been proposed to evaluate annotated datasets
and trained models. Templates can be used to audit classifiers for different kinds of bias (Röttger
et al. 2021), and unsupervised methods like topic modeling can help to surface biases in
annotated corpora (Davidson and Bhattacharya 2020). Initiatives to provide more detailed
documentation will also make it easier to identify potential problems with training data and
models (Mitchell et al. 2019; Gebru et al. 2021). There is no panacea for addressing bias in
19
supervised text classification systems, so it is critical that researchers understand how these
problems can arise at each step in the machine learning pipeline, from the collection and
CONCLUSION
Supervised text classification used in conjunction with digital trace data from social media
enables us to study social life and social problems at an unprecedented scale. Even highly
subjective phenomena like hate speech can be identified automatically with relatively high
accuracy, given sufficient, high-quality annotated training data. This chapter provides an
overview of the methodology behind these approaches and highlights some cutting-edge
technologies that computational social scientists are only beginning to utilize. These methods
have become more powerful and accessible, opening up new opportunities for social scientific
research. However, this chapter also provides a cautionary tale, demonstrating how classification
systems can have unintended consequences, reproducing some of the same types of
discriminatory behavior we seek to mitigate. When developing training data or using a pre-
trained model, we should carefully consider how sampling and annotation procedures can result
in downstream biases. There is still much work to be done to better understand and mitigate
these issues.
Beyond the methodological concerns, it is important to emphasize how these systems can
impact millions of people’s lives. This makes the analysis of hate speech detection and other
related forms of content moderation an essential topic for sociological inquiry. Social media
platforms now routinely conduct content moderation at a vast, global scale (Gillespie 2018;
Roberts 2019): Facebook now claims to proactively detect 94.7% of hate speech, removing
20
millions of hateful posts every month.14 In this respect, platforms now operate more like
governments than companies as they regulate speech across the globe (Klonick 2018). These
systems can make life better for many by reducing their exposure to hateful and abusive content,
but they can also backfire, leading to discrimination and censorship. Social scientists have a role
to play in identifying these issues and informing debates to help make these systems more
democratic, accountable, and transparent. Understanding how these technologies work and when
References
Alrababa’H, Ala’, William Marble, Salma Mousa, and Alexandra A. Siegel. 2021. “Can
Exposure to Celebrities Reduce Prejudice? The Effect of Mohamed Salah on
Islamophobic Behaviors and Attitudes.” American Political Science Review
115(4):1111–28.
Anti-Defamation League. 2021. Online Hate and Harassment: The American Experience 2021.
Anti-Defamation League.
Barberá, Pablo, Amber E. Boydstun, Suzanna Linn, Ryan McMahon, and Jonathan Nagler. 2020.
“Automated Text Classification of News Articles: A Practical Guide.” Political Analysis
1–24.
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021.
“On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” Pp. 610–23
in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and
Transparency. ACM.
Benjamin, Ruha. 2019. Race after Technology: Abolitionist Tools for the New Jim Code. Polity.
Benoit, Kenneth, Drew Conway, Benjamin E. Lauderdale, Michael Laver, and Slava Mikhaylov.
2016. “Crowd-Sourced Text Analysis: Reproducible and Agile Production of Political
Data.” American Political Science Review 110(2):278–95.
Brown, Tom, and many others. 2020. “Language Models Are Few-Shot Learners.” Pp. 1877–
1901 in Advances in Neural Information Processing Systems. Vol. 33.
14
https://ai.facebook.com/blog/training-ai-to-detect-hate-speech-in-the-real-world/ (Accessed 10/6/22).
21
Buolamwini, Joy, and Timnit Gebru. 2018. “Gender Shades: Intersectional Accuracy Disparities
in Commercial Gender Classification.” Pp. 1–15 in Proceedings of Machine Learning
Research. Vol. 81.
Caselli, Tommaso, Valerio Basile, Jelena Mitrović, and Michael Granitzer. 2021. “HateBERT:
Retraining BERT for Abusive Language Detection in English.” Pp. 17–25 in
Proceedings of the 5th Workshop on Online Abuse and Harms. ACL.
Chandrasekharan, Eshwar, Mattia Samory, Anirudh Srinivasan, and Eric Gilbert. 2017b. “The
Bag of Communities: Identifying Abusive Behavior Online with Preexisting Internet
Data.” Pp. 3175–87 in Proceedings of the CHI Conference on Human Factors in
Computing Systems. ACM Press.
Chiu, Ke-Li, Annie Collins, and Rohan Alexander. 2022. “Detecting Hate Speech with GPT-3.”
Costello, Matthew, James Hawdon, Colin Bernatzky, and Kelly Mendes. 2019. “Social Group
Identity and Perceptions of Online Hate.” Sociological Inquiry 89(3):427–52.
Cowan, Gloria, and Cyndi Hodge. 1996. “Judgments of Hate Speech: The Effects of Target
Group, Publicness, and Behavioral Responses of the Target.” Journal of Applied Social
Psychology 26(4):355–74.
Davidson, Thomas, and Debasmita Bhattacharya. 2020. “Examining Racial Bias in an Online
Abuse Corpus with Structural Topic Modeling.” in ICWSM Data Challenge Workshop.
Davidson, Thomas, Debasmita Bhattacharya, and Ingmar Weber. 2019. “Racial Bias in Hate
Speech and Abusive Language Detection Datasets.” Pp. 25–35 in Proceedings of the
Third Workshop on Abusive Language Online: ACL.
Davidson, Thomas, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. “Automated Hate
Speech Detection and the Problem of Offensive Language.” Pp. 512–15 in Proceedings
of the 11th International Conference on Web and Social Media.
Denny, Matthew J., and Arthur Spirling. 2018. “Text Preprocessing For Unsupervised Learning:
Why It Matters, When It Misleads, And What To Do About It.” Political Analysis
26(02):168–89.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-
Training of Deep Bidirectional Transformers for Language Understanding.” Pp. 4171–86
in Proceedings of NAACL-HLT 2019. ACL.
22
Dixon, Lucas, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. “Measuring
and Mitigating Unintended Bias in Text Classification.” Pp. 67–73 in Proceedings of the
2018 Conference on AI, Ethics, and Society. ACM Press.
Djuric, Nemanja, Jing Zhou, Robin Morris, Mihajlo Grbovic, Vladan Radosavljevic, and
Narayan Bhamidipati. 2015. “Hate Speech Detection with Comment Embeddings.” Pp.
29–30 in WWW 2015 Companion. ACM Press.
Eubanks, Virginia. 2018. Automating Inequality: How High-Tech Tools Profile, Police, and
Punish the Poor. St. Martin’s Press.
Fieseler, Christian, Eliane Bucher, and Christian Pieter Hoffmann. 2019. “Unfairness by Design?
The Perceived Fairness of Digital Labor on Crowdworking Platforms.” Journal of
Business Ethics 156(4):987–1005.
Fortuna, Paula, and Sérgio Nunes. 2018. “A Survey on Automatic Detection of Hate Speech in
Text.” ACM Computing Surveys 51(4):1–30. doi: 10.1145/3232676.
Frey, William R., Desmond U. Patton, Michael B. Gaskell, and Kyle A. McGregor. 2020.
“Artificial Intelligence and Inclusion: Formerly Gang-Involved Youth as Domain Experts
for Analyzing Unstructured Twitter Data.” Social Science Computer Review 38(1):42–56.
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna
Wallach, Hal Daumé Iii, and Kate Crawford. 2021. “Datasheets for Datasets.”
Communications of the ACM 64(12):86–92.
Gillespie, Tarleton. 2018. Custodians of the Internet: Platforms, Content Moderation, and the
Hidden Decisions That Shape Social Media. Yale University Press.
Golbeck, Jennifer, and many others. 2017. “A Large Labeled Corpus for Online Harassment
Research.” Pp. 229–33 in Proceedings of the 2017 ACM on Web Science Conference.
ACM Press.
Gonen, Hila, and Yoav Goldberg. 2019. “Lipstick on a Pig: Debiasing Methods Cover up
Systematic Gender Biases in Word Embeddings But Do Not Remove Them.” Pp. 609–14
in Proceedings of NAACL-HLT. ACL.
Harris, Camille, Matan Halevy, Ayanna Howard, Amy Bruckman, and Diyi Yang. 2022.
“Exploring the Role of Grammar and Word Choice in Bias Toward African American
23
English (AAE) in Hate Speech Classification.” Pp. 789–98 in Conference on
Fairness, Accountability, and Transparency. ACM Press.
Hartvigsen, Thomas, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece
Kamar. 2022. “ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and
Implicit Hate Speech Detection.” Pp. 3309–26 in Proceedings of the 60th Annual
Meeting of the Association for Computational Linguistics. ACL.
Kaye, David A. 2019. Speech Police: The Global Struggle to Govern the Internet. Columbia
Global Reports.
Kiela, Douwe, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik
Ringshia, and Davide Testuggine. 2020. “The Hateful Memes Challenge: Detecting Hate
Speech in Multimodal Memes.” Pp. 1–14 in 34th Conference on Neural Information
Processing Systems.
King, Gary, Patrick Lam, and Margaret E. Roberts. 2017. “Computer-Assisted Keyword and
Document Set Discovery from Unstructured Text.” American Journal of Political
Science 61(4):971–88.
Kirk, Hannah, Bertie Vidgen, and Scott Hale. 2022. “Is More Data Better? Re-Thinking the
Importance of Efficiency in Abusive Language Detection with Transformers-Based
Active Learning.” Pp. 52–61 in Proceedings of the Third Workshop on Threat,
Aggression and Cyberbullying. ACL.
Kirk, Hannah, Bertie Vidgen, Paul Rottger, Tristan Thrush, and Scott Hale. 2022. “Hatemoji: A
Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-
Based Hate.” Pp. 1352–68 in Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies. ACL.
Klonick, Kate. 2018. “The New Governors: The People, Rules, and Processes Governing Online
Speech.” Harvard Law Review 131(6):1598–1670.
Kozlowski, Austin C., Matt Taddy, and James A. Evans. 2019. “The Geometry of Culture:
Analyzing the Meanings of Class through Word Embeddings.” American Sociological
Review 000312241987713.
Kwok, Irene, and Yuzhou Wang. 2013. “Locate the Hate: Detecting Tweets against Blacks.” Pp.
1621–22 in Proceedings of the Twenty-Seventh AAA Conference on Artificial
Intelligence.
Magu, Rijul, Kshitij Joshi, and Jiebo Luo. 2017. “Detecting the Hate Code on Social Media.” Pp.
608–11 in Proceedings of the 11th International Conference on Web and Social Media.
Magu, Rijul, and Jiebo Luo. 2018. “Determining Code Words in Euphemistic Hate Speech Using
Word Embedding Networks.” Pp. 93–100 in Proceedings of the 2nd Workshop on
Abusive Language Online. ACL.
24
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. “Distributed
Representations of Words and Phrases and Their Compositionality.” Pp. 3111–19 in
Advances in Neural Information Processing Systems.
Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben
Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. “Model
Cards for Model Reporting.” Pp. 220–29 in Proceedings of the Conference on Fairness,
Accountability, and Transparency. ACM Press.
Mitts, Tamar. 2019. “From Isolation to Radicalization: Anti-Muslim Hostility and Support for
ISIS in the West.” American Political Science Review 113(1):173–94.
Molina, Mario, and Filiz Garip. 2019. “Machine Learning for Sociology.” Annual Review of
Sociology 45:27–45.
Müller, Karsten, and Carlo Schwarz. 2021. “Fanning the Flames of Hate: Social Media and Hate
Crime.” Journal of the European Economic Association 19(4):2131–67.
Munger, Kevin. 2016. “Tweetment Effects on the Tweeted: Experimentally Reducing Racist
Harassment.” Political Behavior.
Noble, Safiya Umoja. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism.
New York, NY: NYU Press.
Olteanu, Alexandra, Carlos Castillo, Jeremy Boy, and Kush Varshney. 2018. “The Effect of
Extremist Violence on Hateful Speech Online.” Proceedings of the International AAAI
Conference on Web and Social Media 12(1).
O’Neil, Cathy. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and
Threatens Democracy. Broadway Books.
Pew Research Center. 2020. “Partisans in the U.S. Increasingly Divided on Whether Offensive
Content Online Is Taken Seriously Enough.” Pew Research Center. Retrieved October 7,
2022 (https://www.pewresearch.org/fact-tank/2020/10/08/partisans-in-the-u-s-
increasingly-divided-on-whether-offensive-content-online-is-taken-seriously-enough/).
Ribeiro, Manoel Horta, Pedro H. Calais, Yuri A. Santos, Virgílio A. F. Almeida, and Wagner
Meira Jr. 2017. “Characterizing and Detecting Hateful Users on Twitter.” in Proceedings
of the International AAAI Conference on Web and Social Media.
Röttger, Paul, Bertie Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, and Janet
Pierrehumbert. 2021. “HateCheck: Functional Tests for Hate Speech Detection Models.”
Pp. 41–58 in Proceedings of the 59th Annual Meeting of the Association for
25
Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing. ACL.
Roussos, Gina, and John F. Dovidio. 2018. “Hate Speech Is in the Eye of the Beholder: The
Influence of Racial Attitudes and Freedom of Speech Beliefs on Perceptions of Racially
Motivated Threats of Violence.” Social Psychological and Personality Science 9(2):176–
85.
Sap, Maarten, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. “The Risk of
Racial Bias in Hate Speech Detection.” Pp. 1668–78 in Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics. ACL.
Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. 2020.
“Social Bias Frames: Reasoning about Social and Power Implications of Language.” Pp.
5477–90 in Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. ACL.
Sap, Maarten, Swabha Swayamdipta, Laura Vianna, Xuhui Zhou, Yejin Choi, and Noah Smith.
2022. “Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic
Language Detection.” Pp. 5884–5906 in Proceedings of the NAACL-HLT. ACL.
Siegel, Alexandra A. 2020. “Online Hate Speech.” Social Media and Democracy: The State of
the Field, Prospects for Reform 56–88.
Siegel, Alexandra A., Evgenii Nikitin, Pablo Barberá, Joanna Sterling, Bethany Pullen, Richard
Bonneau, Jonathan Nagler, and Joshua A. Tucker. 2021. “Trumping Hate on Twitter?
Online Hate Speech in the 2016 US Election Campaign and Its Aftermath.” Quarterly
Journal of Political Science 16(1):71–104.
Smith, Noah A. 2020. “Contextual Word Representations: Putting Words into Computers.”
Communications of the ACM 63(6):66–74.
Vidgen, Bertie, and Leon Derczynski. 2020. “Directions in Abusive Language Training Data, a
Systematic Review: Garbage in, Garbage Out.” PLoS ONE 15(12):e0243300.
Vidgen, Bertie, Helen Margetts, and Alex Harris. 2019. “How Much Online Abuse Is There: A
Systematic Review of Evidence for the UK.” Alan Turing Institute.
Vidgen, Bertie, Dong Nguyen, Helen Margetts, Patricia Rossini, and Rebekah Tromble. 2021.
“Introducing CAD: The Contextual Abuse Dataset.” Pp. 2289–2303 in Proceedings of
the NAACL-HLT. ACL.
Vidgen, Bertie, Tristan Thrush, Zeerak Waseem, and Douwe Kiela. 2021. “Learning from the
Worst: Dynamically Generated Datasets to Improve Online Hate Detection.” Pp. 1667–
82 in Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language
Processing. ACL.
26
Warner, William, and Julia Hirschberg. 2012. “Detecting Hate Speech on the World Wide Web.”
Pp. 19–26 in Proceedings of the Second Workshop on Language in Social Media. ACL.
Waseem, Zeerak, Thomas Davidson, Dana Warmsley, and Ingmar Weber. 2017. “Understanding
Abuse: A Typology of Abusive Language Detection Subtasks.” Pp. 78–84 in
Proceedings of the First Workshop on Abusive Language Online. ACL.
Waseem, Zeerak, and Dirk Hovy. 2016. “Hateful Symbols or Hateful People? Predictive
Features for Hate Speech Detection on Twitter.” Pp. 88–93 in Proceedings of NAACL-
HLT.
Waseem, Zeerak, James Thorne, and Joachim Bingel. 2018. “Bridging the Gaps: Multi Task
Learning for Domain Transfer of Hate Speech Detection.” Pp. 29–55 in Online
Harassment, Human-Computer Interaction Series, edited by J. Goldbeck. New York,
NY: Springer.
Wiegand, Michael, Josef Ruppenhofer, Anna Schmidt, and Clayton Greenberg. 2018. “Inducing
a Lexicon of Abusive Words – a Feature-Based Approach.” Pp. 1046–56 in Proceedings
of the NAACL-HLT. ACL.
Wilson, Richard Ashby, and Molly K. Land. 2021. “Hate Speech on Social Media: Content
Moderation in Context.” Connecticut Law Review 52(3):1029–76.
Xu, Albert, Eshaan Pathak, Eric Wallace, Suchin Gururangan, Maarten Sap, and Dan Klein.
2021. “Detoxifying Language Models Risks Marginalizing Minority Voices.” Pp. 2390–
97 in Proceedings of NAACL-HLT. ACL.
Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh
Kumar. 2019. “Predicting the Type and Target of Offensive Posts in Social Media.” Pp.
1415–20 in Proceedings of NAACL-HLT. ACL.
Zhou, Xuhui, Hao Zhu, Akhila Yerukola, Thomas Davidson, Jena D. Hwang, Swabha
Swayamdipta, and Maarten Sap. 2023. “COBRA Frames: Contextual Reasoning about
Effects and Harms of Offensive Statements.” in Proceedings of the 61st Annual Meeting
of the Association for Computational Linguistics. ACL.
27