0% found this document useful (0 votes)
12 views27 pages

Hate Speech Chapter Final Preprint

This chapter discusses the complexities of hate speech detection, focusing on the development of annotated datasets and machine learning models, while highlighting the significant societal impacts of methodological choices. It emphasizes the challenges of bias in these systems, particularly how they can inadvertently discriminate against marginalized groups. The chapter argues for the importance of sociological inquiry into hate speech detection as platforms increasingly use automated tools for content moderation.

Uploaded by

50357
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views27 pages

Hate Speech Chapter Final Preprint

This chapter discusses the complexities of hate speech detection, focusing on the development of annotated datasets and machine learning models, while highlighting the significant societal impacts of methodological choices. It emphasizes the challenges of bias in these systems, particularly how they can inadvertently discriminate against marginalized groups. The chapter argues for the importance of sociological inquiry into hate speech detection as platforms increasingly use automated tools for content moderation.

Uploaded by

50357
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Hate Speech Detection and Bias in Supervised Text Classification

Thomas R. Davidson
Department of Sociology
Rutgers University–New Brunswick
[email protected]

This chapter introduces hate speech detection, describing the process of


developing annotated datasets and training machine learning models and
highlighting cutting-edge techniques used to improve these methods. The case of
hate speech detection provides valuable insight into the promises and pitfalls of
supervised text classification for sociological analyses. A key theme running
through the chapter is how ostensibly minor methodological decisions—
particularly those related to the development of training datasets—can have
profound downstream societal impacts. In particular, I examine racial bias in
these systems, explaining why models intended to detect hate speech can
discriminate against the groups they are designed to protect and discussing efforts
to mitigate these problems. I argue that hate speech detection and other forms of
content moderation should be an important topic of sociological inquiry as
platforms increasingly use these tools to govern speech on a global scale.

This is a draft of a chapter that has been accepted by Oxford University Press in the forthcoming

book The Oxford Handbook of the Sociology of Machine Learning edited by Juan Pablo Pardo-

Guerra and Christian Borch due for publication in 2023. This reference will be updated upon

publication. Please cite the published version.

1
Online hate speech has recently drawn widespread attention from journalists, policymakers, and

academics. While it only constitutes a fraction of a percent of the content shared online—a

report by the Alan Turing Institute estimated that 0.001% of the content shared on major

platforms is abusive or hateful (Vidgen, Margetts, and Harris 2019)—this still equates to an

enormous amount of hateful material given the scale of online information flows. Facebook

reports detecting tens of millions of violations of its hate speech policies every quarter. 1 In 2021,

41% of Americans reported experiences of online harassment due to their gender, race and

ethnicity, religion, sexual orientation, or political views (Pew Research Center 2021). Women,

racial minorities, and LGBTQ+ individuals are particularly likely to be victims of hate speech

and other types of online harassment. The Anti-Defamation League (2021) has reported

increased online harassment against LGBTQ+, Jewish, and Asian-Americans. With millions of

people exposed, directly or indirectly, on a regular basis, online hate speech fosters a hostile

online environment for marginalized groups and can result in significant offline harms. While

these downstream effects are difficult to measure, there is evidence that online hate speech is

associated with increases in hate crimes (Müller and Schwarz 2021) and can elicit a violent

backlash from members of targeted groups (Mitts 2019). The case of Myanmar paints a stark

picture of what can happen when hate speech goes unchecked, as ultra-nationalists and members

of the military used Facebook to attack the Rohingya Muslim minority, contributing to genocidal

violence. At the time, Facebook was unprepared to address the problem because the company

only had two staff who spoke Burmese to review posts, even though one-third of Myanmar’s

population—approximately eighteen million people—actively used the platform (Kaye 2019).

To help tackle online hate speech, researchers and practitioners have turned to automated

methods to detect and measure it. There are several reasons why automation is desirable.
1
https://transparency.fb.com/policies/community-standards/hate-speech/#data (Accessed 10/7/2022).

2
Automated approaches enable researchers to perform tasks like measuring the prevalence of hate

speech over time or across different platforms, which would be unfeasible using qualitative

methods. Social media platforms employ tens of thousands of people to moderate hate speech

and other kinds of content, but even with such large workforces, automation is required to handle

the massive volume of content (Gillespie 2018). Speed has also become critical to these

endeavors, in part, due to legal obligations imposed on platforms (Kaye 2019). For example,

Germany’s Network Enforcement Act (NetzDG for short), which took effect in 2018, requires

that platforms remove hate speech within twenty-four hours or face substantial fines. In addition

to scalability and speed, automation can alleviate the burden placed on content moderators, who

can experience post-traumatic stress disorder and other negative psychological consequences

from continual exposure to content such as animal abuse, child pornography, and hate speech

(Roberts 2019).

Methods developed to detect online hate speech at scale have proven valuable tools in

several areas of social scientific research. Some studies have addressed the impact of online hate

speech. Tamar Mitts (2019) showed how localities in several European countries with high anti-

Muslim hate speech on Twitter experienced increased support for the terrorist group Islamic

State. Others examine hate speech as an outcome, such as a study showing how Egyptian player

Mohamed Salah’s successful tenure at Liverpool Football Club reduced fans’ use of anti-Muslim

language online (Alrababa’h et al. 2021). In another study analyzing 750 million tweets written

during the 2016 US election campaign, researchers found no evidence that Trump’s divisive

rhetoric increased online hate speech (Siegel et al. 2021). Terrorist attacks have been shown to

lead to spikes in the use of hate speech on platforms, including Twitter and Reddit (Olteanu et al.

2018). Changes to social media platform policies have been used as natural experiments to assess

3
their impact on online hate speech (Chandrasekharan et al. 2017a). Researchers have also

considered approaches to reduce online hate speech, including experimental work targeting hate

speakers on social media (Munger 2016).2

Common sense understandings of hate speech might imply that detecting it automatically

is relatively straightforward. Like Supreme Court Justice Potter Stewart’s famous remark on

obscenity, you know it when you see it.3 The basic idea is that once some examples of hate

speech have been identified, machine learning can augment human judgments by learning a set

of rules to detect hate speech in new data. However, hate speech is often highly subjective,

ambiguous, and context-dependent, making it difficult for both humans and computers to detect.

Mark Zuckerberg has remarked that it is “much easier to build an AI system to detect a nipple

than it is to detect hate speech.”4 Reflecting on Zuckerberg’s statement, it is instructive to

consider some differences between nudity and hate speech detection. Images and video data

contain context-independent information that enables computers to recognize nudity with

relatively high accuracy. While there is still debate regarding whose body parts are acceptable in

different contexts,5 if computers are shown enough images of naked bodies, features of these

images, like nipples or bare skin, make it relatively straightforward to predict whether an image

contains nudity. Hate speech, on the other hand, is not so easy for computers to identify. Some

features, such as slurs, can function equivalently to nipples, allowing us to identify some

instances of hate speech easily, but many cases require more nuanced contextual understanding

and interpretation. For example, slurs can be used by people repeating or contesting hate speech

2
See Siegel (2020) for a more thorough review of the social scientific literature on online hate speech.
3
Jacobellis v. Ohio, 1964.
4
https://www.engadget.com/2018-04-25-zuckerberg-easier-for-ai-detect-nipples-than-hate-speech.html (retrieved
10/6/222).
5
See Gillespie 2018, Chapters 1 and 6.

4
of which they have been the target or in alternative contexts such as reclaimed slurs used by

African-Americans and the LGBTQ+ community.

Due to these challenges, the case of hate speech detection provides valuable insight into

the promises and pitfalls of supervised text classification for sociological analyses. This chapter

describes the process of developing annotated datasets and training machine learning models to

detect hate speech, including some of the cutting-edge techniques used to improve these

methods. A key theme running through the chapter is how ostensibly minor methodological

decisions—particularly those related to the development of training datasets—can have profound

downstream societal impacts. In particular, I discuss racial bias in hate speech detection systems,

examining why models can discriminate against the same groups they are designed to protect and

the approaches that have been proposed to identify and mitigate such biases. Hate speech

detection is not only an instructive case for sociologists interested in using supervised text

classification, but the use of these approaches and related content moderation tools to govern

online speech at a global scale makes it a particularly urgent topic of sociological inquiry.

HATE SPEECH: DO WE KNOW IT WHEN WE SEE IT?

Before going any further, it is important to define hate speech. As alluded to in the previous

section, most readers will have some common sense understanding of hate speech and will

recognize certain speech acts as hateful. There are some examples, such as the overt use of

antisemitic, racist, or homophobic language to attack others, that most reasonable observers can

agree upon without requiring any formal definition. Legal definitions range from strict,

enforceable prohibitions on certain speech in countries like Germany to wide-ranging free speech

protections like those articulated in the First Amendment. In the United States, the issue has been

5
a source of tremendous controversy and debate since the 1980s, particularly as institutions like

universities have developed and attempted to enforce their own regulations (Wilson and Land

2021). Most recently, social media platforms have developed codes of conduct and policies that

delimit hateful speech and content. For example, Twitter’s “Hateful conduct policy” prohibits the

promotion of violence, attacks, and threats based on race, gender, sexual orientation, and a range

of other categories. They provide a rationale for the policy, emphasizing how they are

“committed to combating abuse motivated by hatred, prejudice or intolerance, particularly abuse

that seeks to silence the voices of those who have been historically marginalized.”6 These

policies have evolved as platforms have responded to emerging types of hateful conduct, shifting

for general guidance to specific proscriptions related to the type of attack, the target, and so on

(Kaye 2019),7 and transitioning from more permissive rules, inspired by the First Amendment,

towards the European regulatory model (Wilson and Land 2021). While I do not provide a

singular definition of hate speech here, I outline four core dimensions that make it difficult to

detect and highlight the implications for automated systems.

The first major challenge is that hate speech can be subjective. Different people have

different understandings of hate speech. Even if we agree upon a specific definition, there is still

room for differences in interpretation. This subjectivity can enter into machine learning systems,

such that their underlying models of hate speech may reflect certain perspectives at the expense

of others. Moreover, hate speech can also be ambiguous. The complexity of human language and

interaction makes it difficult to agree upon whether a particular statement meets a given

definition. Ambiguity is one of the reasons why commercial content moderation policies have

ballooned in size as companies have increasingly added layers of specificity to their protocols

6
https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy (Accessed 10/6/2022).
7
Facebook’s Hate Speech policy has changed 23 times since May 25, 2018, according to its new transparency page:
https://transparency.fb.com/policies/community-standards/hate-speech/ (Accessed 10/6/2022).

6
(Kaye 2019). Ceteris paribus, more ambiguous cases will be more difficult for humans to agree

upon and thus more difficult for computers to detect. A third challenge is presented by the fact

that hate speech is also context-dependent. What is considered hateful in one context may be

innocuous in another. Specific cultural knowledge is often required to reason about the

implications of a statement, the intent of the speaker, and other salient factors. From a legal

perspective, context is critical as it allows us to better understand intent and potential harms

(Wilson and Land 2021). Yet, human content moderators and machine learning systems often

evaluate texts abstracted from their context. Finally, hate speech is politicized. In the U.S., there

has been a backlash against efforts to moderate online hate speech, with accusations that content

moderation efforts are a symptom of political correctness and bias against conservatives.

Opinion polls find that conservatives think online hate speech is taken too seriously, while

liberals believe it is not taken seriously enough (Pew Research Center 2020). In what follows, it

is essential to consider these issues—that hate speech can be subjective, ambiguous, context-

dependent, and politicized—as we consider how human judgments become embedded in

algorithms.

THE NUTS-AND-BOLTS OF HATE SPEECH DETECTION

Hate speech detection is a type of supervised text classification task. The goal is to take a text as

an input and assign it to a discrete class, i.e., whether or not it should be considered hate speech.

This section begins by considering simple rule-based methods using lexicons and highlighting

their limitations. I then walk through the process of training machine learning classifiers to detect

hate speech, moving from sampling and annotation to feature generation and algorithm selection.

Throughout, I emphasize key methodological decisions and consider how these choices can have

7
significant consequences for the way machine learning algorithms behave. The goal is to explore

the main contours of the topic rather than systematically review this vast and growing literature

(see Fortuna and Nunes (2018) and Vidgen and Derczynski (2020)).

The Limitations of Lexicons and the Promise of Machine Learning

At first glance, racist, sexist, homophobic, and other hateful language is often given away by

slurs and related words or phrases. If we assume hate speech is indicated by certain words, we

can detect it by enumerating all of these hateful words in a lexicon and checking whether a text

uses any of them. Resources like the crowd-sourced Hatebase lexicon, which features 3,894 such

terms across 98 different languages, promise such a solution.8 A simple decision rule can be used

to classify documents: a statement is considered hate speech if it contains one or more words

from the lexicon. Lexicon-based methods have been used to detect hate speech in online fora

with some success (Wiegand et al. 2018). While these approaches can indeed locate many

instances of hate speech, they will generate excessive false positives if words are used in

alternative contexts, as well as false negatives if relevant words are omitted.

To better understand the limitations of lexicons, it’s helpful to consider two metrics

commonly used in machine learning: precision and recall. Precision quantifies the accuracy of

the hate speech predictions: a low precision score indicates that many statements are mistakenly

flagged as hateful. Recall is defined as the proportion of hate speech correctly flagged as hate

speech, where a low recall score would imply that many hateful statements were missed. 9

Lexicons tend to have high recall but low precision, particularly when many words or phrases are
8
https://hatebase.org/ (Accessed 10/7/222).
9
¿ true positives
Formally, precision is defined as and recall as
¿ true positives+ ¿ false positives
¿ true positives
.
¿ true positives+ ¿ false negatives
8
included. A fishing metaphor is useful here: a lexicon is like a trawler dragging a large net across

the seabed. The aim might be to catch tuna, but the trawler dredges up everything else in its path,

including porpoises, sea turtles, and sharks. This by-catch makes trawling extremely inefficient

and damaging to entire ecosystems. Lexicons can catch a lot of hate speech but are also prone to

sweeping up many other documents. For example, only around 5% of the statements in a sample

of almost thirty-thousand tweets containing words from the Hatebase lexicon were considered

hate speech by human raters (Davidson et al. 2017). This low precision occurs because many

words that indicate hate speech are also used in other contexts. The most obvious examples are

curse words, including slurs, which are frequently used online in a variety of contexts (Kwok and

Wang 2013). At the same time, other hate speech can slip through the nets if relevant words have

not been included in the lexicon. Language is constantly evolving, and internet users often find

ways to circumvent content moderation by using developing alternative spellings or neologisms.

For example, users of the forum 4Chan responded to Google’s efforts to reduce online abuse by

using the company’s name and those of other major companies like Yahoo and Skype as

euphemisms for hateful slurs (Magu, Joshi, and Luo 2017). Given these limitations, lexicons

should only be used if the goal is to precisely detect particular types of hate speech where there is

high confidence that all relevant words have been included and that each term in the lexicon does

not generate excessive false positives.10

Machine learning can help to address the shortcomings of lexicons. Rather than starting with

a set of keywords, these methods work by taking a set of texts annotated for hate speech and

training a computer to distinguish between hate speech and other content. The goal is to replicate

the human annotations as accurately as possible by using features of the text to identify hate

10
Small keyword lexicons are ideal since large lexicons can suffer serious problems when used for document
selection, as experimental research demonstrates (King, Lam, and Roberts 2017).

9
speech. This approach tends to work better than lexicon-based classification because human

raters can make holistic judgments based on their interpretations of the texts. Machine learning

models can automatically select from a large array of features to best replicate the human

judgments. Rather than acting like trawlers, these approaches should be more like the technique

of line-and-pole fishing, enabling us to catch the tuna but leave the other creatures in the sea.

Ideally, we want to catch as much hate speech as possible (maximizing recall) while minimizing

the amount of by-catch (maximizing precision). These methods provide a powerful, flexible

approach for detecting hate speech and can easily be generalized to other kinds of content.

Nonetheless, machine learning methods are not immune to the issues faced by lexicons and can

suffer from other types of biases.

Collecting and Annotating Training Data

To begin developing a hate speech detection model, one must identify a set of documents to use

as training data. Much of the current work uses data from Twitter due to the prominence of the

platform and the relative ease of data collection using its application programming interface

(API), and the majority of studies also focus on English content, although there is a growing

interest in other languages (Siegel 2020; Vidgen and Derczynski 2020). Since hate speech is

relatively rare, randomly sampling posts from social media would require vast samples to

identify a sufficient number of examples, so keyword lexicons have been widely used to sample

documents more likely to be hateful. Studies have used existing resources like the Hatebase

lexicon (Davidson et al. 2017) or developed custom sets of keywords (Golbeck et al. 2017).

Unfortunately, this means that documents without keywords are missed, and those containing

commonly used keywords appear with high frequency. To address these issues, scholars have

10
developed hybrid approaches, augmenting random samples with keyword-based samples to more

accurately reflect the distribution of hate speech online (Founta et al. 2018). A promising avenue

of research is the use of synthetic training texts, including both human (Vidgen, Thrush, et al.

2021) and machine-generated texts as training data (Hartvigsen et al. 2022). Synthetic data

provide greater control over content, enabling researchers to improve predictive performance by

prioritizing the most difficult examples to detect.

Once data have been collected, they need to be annotated. Early methods used binary

schemas, annotating documents in classes like antisemitic/not antisemitic (Warner and

Hirschberg 2012) or racist/not racist (Kwok and Wang 2013). These binary schemas have been

extended to multiple types of hateful speech, like racist/sexist/not racist or sexist (Waseem and

Hovy 2016). This leaves little room for nuance, so offensive language, such as cursing and

reclaimed slurs, are often misclassified as hate speech. Models trained on these data can function

like lexicons, labeling anything containing these terms as hate speech. To address this issue,

Davidson and colleagues (2017) developed a ternary schema, distinguishing between hate

speech, offensive language, and other content. Annotators were instructed not to make decisions

based on the presence of certain words but to holistically evaluate the entire statement.

Subsequent work has also accounted for other content like spam (Founta et al. 2018) and used

more detailed hierarchical coding schemes, asking annotators whether specific individuals or

groups are targeted (Zampieri et al. 2019) or whether there is a clear intent to offend (Sap et al.

2020).

Once a schema has been developed, annotators must read each example and determine the

appropriate class. Typically, tens of thousands of documents must be annotated to provide a

sufficient quantity of examples to train a model that generalizes well to new data. Researchers

11
often employ crowdworkers from Mechanical Turk, CrowdFlower, and similar platforms due to

the scale of the task. Crowdworkers are provided with instructions and paid a small fee for each

annotation, with checks in place to monitor performance.11 Most studies use multiple decisions

for each example to improve reliability, but recent work on other classification tasks suggests

that it is more efficient to produce a larger dataset with a single annotator per example (Barberá

et al. 2020). Advances in machine learning like active learning (Kirk, Vidgen, and Hale 2022)

and few-shot learning (Chiu, Collins, and Alexander 2022) will help to reduce the cost of

annotation, allowing models to achieve strong predictive performance with relatively small

amounts of training data. Others have side-stepped the need for annotations altogether. For

example, researchers trained a model to identify hateful content by distinguishing between posts

from Reddit communities known to be hateful and other communities on the site

(Chandrasekharan et al. 2017b). In many cases, those interested in applying these methods can

leverage existing resources rather than developing training data from scratch. The website

https://hatespeechdata.com/ contains an update-to-date list of annotated hate speech datasets in

over twenty languages (Vidgen and Derczynski 2020).

Computational Representations of Text

Once documents have been annotated, the texts must be converted into representations that can

be input into machine learning classifiers. These inputs, known as features, are used to predict

the class each document belongs to. The most common way to represent text for supervised

classification is to treat each text as an unordered bag-of-words. A term-document matrix is

constructed where each unique word in the corpus is represented as a column and each document

11
For an overview and evaluation of crowd-sourced text analysis, see (Benoit et al. 2016). The topic of crowd-work
has also been subject to extensive discussion regarding ethics and potential exploitation (e.g. Fieseler, Bucher, and
Hoffmann 2019).

12
as a row. Each element of this matrix denotes how often a specific term is used in a given

document, often weighted to account for the frequency with which each term occurs in the

corpus. Classifiers trained on bag-of-words representations make predictions based on weighted

functions of the document vectors. For example, a word like “kill” might get a relatively high

weight, whereas unrelated words will have weights close to zero. These representations can be

considered an extension of the lexicon approach, but rather than deciding whether certain words

indicate hate speech a priori, classifiers use patterns in the data to identify the words associated

with hate speech.

There are typically thousands of unique words in a corpus of text—the set of unique words is

often referred to as a vocabulary—but each document only contains a minuscule fraction. Bag-

of-words representations are thus considered sparse representations since most elements of each

document vector are zero. Sparsity is particularly acute when we consider that most user-

generated texts on social media are relatively short (e.g., tweets are limited to 280 characters).

This sparsity adds computational complexity, as models have to estimate many parameters and

store large matrices. Moreover, much of the information in a sparse matrix is irrelevant to the

prediction task. Very frequent words like “and,” “of,” and “the” (often known as stopwords)

provide little information, and very infrequent words and hapax legomena (words that occur only

once) are too rare to provide any useful patterns. It is thus conventional to pre-process texts,

performing tasks such as setting words to lowercase and dropping punctuation, stopwords, and

rare words to remove redundant content and reduce the dimensionality of the data. These tasks

are often treated as routine, but such decisions must be performed with care because some of this

information can be relevant to classification tasks, making downstream results sensitive to these

choices (Denny and Spirling 2018)

13
Over the past decade, word embeddings have become a common type of feature

representation used in hate speech classification and other tasks. Word embeddings are dense

representations consisting of relatively short, real-valued vectors developed by training neural

language models on large corpora of text (Mikolov et al. 2013). Each word is assigned a vector,

and its relative position in the vector space captures something about its meaning: words used in

similar contexts occupy similar positions, a property consistent with distributional semantics, a

longstanding concept in linguistic theory (Firth 1957). Models using embeddings as features

have achieved strong predictive performance at hate speech classification tasks (Djuric et al.

2015). Sociologists have demonstrated how embeddings encode rich information about

stereotypes and cultural assumptions (Kozlowski, Taddy, and Evans 2019). These

representations are particularly effective for hate speech detection because they can capture more

subtle, implicit types of hate speech (Waseem et al. 2017). For example, proximity to a hateful

slur in the embedding space signals that a word is often used in a hateful context, allowing

researchers to detect euphemistic hate speech (Magu and Luo 2018).

A limitation of early word embedding approaches is the capacity to represent polysemy.

For example, the vector for the word “crane” is an average over the instances where the word is

used to describe a bird, construction equipment, and a verb. Representations known as

contextualized embeddings provide more nuanced representations by allowing the vector—and

hence the meaning of a word—to vary according to the context in which it is used (Smith 2020).

If the word “egg” occurs in the document, then the vector has a meaning closer to the bird,

whereas proximity to a word like “hardhat” or “girder” signal an alternative meaning. Such

representations are a promising avenue for detecting hate speech, enabling computers to better

14
distinguish between counter-speech, reclaimed slurs, and genuine hate speech by incorporating

contextual information (Caselli et al. 2021).

Non-textual information and other metadata have been used as inputs for hate speech

detection classifiers. For example, emojis can be central to interpreting some hateful posts (Kirk,

Vidgen, Rottger, et al. 2022). Beyond the text, we can incorporate information about the speaker,

target, and context. We might expect a late-night reply to a person expressing an opinion on a

controversial political topic to have a higher likelihood of abusiveness than a lunchtime post

celebrating a birthday. Several studies have used social features to improve hate speech

classification models (Ribeiro et al. 2017; Vidgen, Nguyen, et al. 2021). This chapter focuses on

identifying hate speech in written text, but our online social lives are increasingly mediated

through image and video, making audiovisual and multimodal approaches particularly urgent

areas for future research (Kiela et al. 2020). Most of the principles and procedures outlined here

generally apply when extending these approaches to non-textual and multimodal approaches.

Training and Evaluating Hate Speech Classifiers

A wide array of algorithms has been used for hate speech detection, from logistic regression and

support vector machines to random forests and deep neural networks. While the estimation

procedures differ, these approaches all share a common objective: to learn a function y f ( X ),

where X is the matrix of features and y is a vector representing the hate speech annotation

associated with each document. The objective is to accurately identify hate speech in the training

data and generalize to new examples. In computer science research, it is conventional to test

several different algorithms and experiment with associated hyperparameters to identify the best-

performing model.

15
Models using transformer architectures such as BERT (Devlin et al. 2019) and GPT

(Brown et al. 2020) have recently achieved record-breaking performance on various natural

language processing tasks and have become the gold standard approach for hate speech detection

and other classification problems. These models are known as large language models (LLMs)

because they are trained on enormous corpora of text and can have billions of internal

parameters. Rather than starting from scratch by training a model on an annotated dataset, pre-

trained models are adapted to new tasks through a process known as fine-tuning, a technique

borrowed from image classification. Due to the pre-training on large amounts of text, LLMs can

perform reasonably well at tasks like hate speech detection with few training examples (Chiu et

al. 2022). These models can also generate text, opening up new possibilities for the creation of

synthetic training data (Hartvigsen et al. 2022), explanations for why particular cases have been

flagged as hateful (Sap et al. 2020), and analyses of how context shapes perceptions of

offensiveness (Zhou et al. 2023).

The performance of hate speech detection models should always be evaluated using out-

of-sample annotated data never used during training. A key challenge is to avoid overfitting,

where models learn to predict the training examples but fail to generalize to new data (see

Molina and Garip 2019). Predictive performance is measured by using the predictions for these

out-of-sample data to calculate precision, recall, and the F1-score, a weighted mean of the two.

Confusion matrices are often used to identify where a model makes incorrect predictions (e.g.,

Davidson et al. 2017). Qualitative inspection of predictions and associated documents can help to

reveal how the model is making decisions. Misclassified documents can be particularly fruitful,

allowing analysts to identify the patterns that cause problems for classifiers (Warner and

Hirschberg 2012). Evaluation templates can provide more nuanced data on model performance,

16
allowing models to be tested against various curated examples, including difficult-to-judge cases

such as counter speech, reclaimed slurs, and texts with intentional misspellings (Röttger et al.

2021).

BIAS AND DISCRIMINATION IN HATE SPEECH DETECTION SYSTEMS

Scholars have drawn attention to bias and discrimination in machine learning systems, ranging

from Google Search (Noble 2018) and face-recognition models (Buolamwini and Gebru 2018) to

loan risk (O’Neil 2016) and welfare allocation models (Eubanks 2018). Bias in hate speech

detection systems is particularly pernicious because victims of hate speech and members of

targeted groups can be mistakenly accused of being hate speakers. Hate speech detection

classifiers, including Google’s state-of-the-art Perspective API, have been shown to discriminate

against African-Americans by disproportionately flagging tweets written in African-American

English (AAE) as hateful and abusive (Davidson, Bhattacharya, and Weber 2019; Sap et al.

2019). Such biases extend to other groups, including sexual and religious minorities. For

example, the Perspective API tended to classify statements containing neutral identity terms like

“queer,” “lesbian,” and “muslim” as toxic (Dixon et al. 2018). Other commercial hate speech

detection systems are more difficult to audit, but reports from users and information uncovered

by journalists and leaked by whistleblowers suggest widespread problems. For example,

Facebook often mislabeled posts about racism as hate speech12 and missed violent threats against

Muslims.13 Most existing studies and reporting have focused on the U.S., and little is known

about the extent to which these systems discriminate against minority groups in other national,

linguistic, and cultural contexts.

12
https://www.washingtonpost.com/technology/2021/11/21/facebook-algorithm-biased-race/ (Accessed 10/6/22).
13
https://www.propublica.org/article/facebook-hate-speech-censorship-internal-documents-algorithms (Accessed
10/6/22).

17
I argue that two aspects of the training procedure are the root cause of these biases. The

first is annotation. Until recently, most researchers paid scant attention to who provided

annotations or whether they harbored attitudes that could influence their judgments. Social

psychologists have shown that perceptions of hate speech vary according to the race and gender

of the observer (Cowan and Hodge 1996), as well as their attitudes, with individuals scoring

higher on anti-Black racism scales less likely to consider violent acts against Black people as

hate crimes (Roussos and Dovidio 2018). Survey evidence shows that males and conservatives

find online hate speech “less disturbing” than others (Costello et al. 2019). These biases can

enter into datasets, as white, male, conservative annotators are less likely to consider texts

containing anti-Black racism to be hateful (Sap et al. 2022). Even well-intentioned researchers

and annotators can produce problematic data: Waseem and Hovy (2017) drew upon critical race

theory to develop an annotation scheme and employed a gender studies student and a “non-

activist feminist” to check annotations for bias, yet models trained on their dataset

disproportionately classify tweets written in AAE as racist and sexist (Davidson et al. 2019).

A second area that has received less attention is sampling. Lexicon-based sampling can

result in the oversampling of texts from targeted groups. For example, texts extracted from

Twitter using the keywords from the Hatebase lexicon included a disproportionate amount of

content produced by African-Americans, mainly due to the inclusion of the reclaimed slur

“n*gga,” and annotators almost universally tended to flag tweets using this term to be hateful or

offensive (Waseem, Thorne, and Bingel 2018). This over-representation led classifiers trained on

these data to associate AAE with hate speech (Davidson et al. 2019, Harris et al. 2022).

Sampling bias can also arise in other ways. The bias identified in the Perspective classifier

occurred because neutral identity terms were often used pejoratively in the training data (Dixon

18
et al. 2018). Such problems may become more ubiquitous as we increasingly rely on language

models that learn stereotypical associations present in the vast corpora of text scraped from the

internet (Bender et al. 2021).

What can be done to address bias in machine learning classifiers? Computer scientists

have attempted to develop technical solutions for post hoc bias mitigation with limited success,

often obscuring underlying problems (Gonen and Goldberg 2019) or generating other

problematic behavior (Xu et al. 2021). It is difficult to develop simple technical fixes for

ingrained issues that systematically skew model outputs—as the adage goes, “garbage in,

garbage out.” Regarding annotation, it is plausible that annotators more attuned to context, if not

members of the targeted groups, would help to mitigate biases, an argument Ruha Benjamin

(2019) makes in her analysis of race and technology. For example, formerly gang-involved youth

have been employed as domain experts to identify tweets related to offline violence in Chicago,

demonstrating the value of localized knowledge (Frey et al. 2020). Better training and guidance

are another promising avenue. Sap and colleagues (2019) found that instructing annotators about

racial bias and dialect variation resulted in fewer racialized misclassifications. Turning to

sampling, synthetic examples generated either by humans or large language models can yield

training data that is more representative and less susceptible to bias (Hartvigsen et al. 2022;

Vidgen, Thrush, et al. 2021). Methods have also been proposed to evaluate annotated datasets

and trained models. Templates can be used to audit classifiers for different kinds of bias (Röttger

et al. 2021), and unsupervised methods like topic modeling can help to surface biases in

annotated corpora (Davidson and Bhattacharya 2020). Initiatives to provide more detailed

documentation will also make it easier to identify potential problems with training data and

models (Mitchell et al. 2019; Gebru et al. 2021). There is no panacea for addressing bias in

19
supervised text classification systems, so it is critical that researchers understand how these

problems can arise at each step in the machine learning pipeline, from the collection and

annotation of training data to model evaluation and deployment.

CONCLUSION

Supervised text classification used in conjunction with digital trace data from social media

enables us to study social life and social problems at an unprecedented scale. Even highly

subjective phenomena like hate speech can be identified automatically with relatively high

accuracy, given sufficient, high-quality annotated training data. This chapter provides an

overview of the methodology behind these approaches and highlights some cutting-edge

technologies that computational social scientists are only beginning to utilize. These methods

have become more powerful and accessible, opening up new opportunities for social scientific

research. However, this chapter also provides a cautionary tale, demonstrating how classification

systems can have unintended consequences, reproducing some of the same types of

discriminatory behavior we seek to mitigate. When developing training data or using a pre-

trained model, we should carefully consider how sampling and annotation procedures can result

in downstream biases. There is still much work to be done to better understand and mitigate

these issues.

Beyond the methodological concerns, it is important to emphasize how these systems can

impact millions of people’s lives. This makes the analysis of hate speech detection and other

related forms of content moderation an essential topic for sociological inquiry. Social media

platforms now routinely conduct content moderation at a vast, global scale (Gillespie 2018;

Roberts 2019): Facebook now claims to proactively detect 94.7% of hate speech, removing

20
millions of hateful posts every month.14 In this respect, platforms now operate more like

governments than companies as they regulate speech across the globe (Klonick 2018). These

systems can make life better for many by reducing their exposure to hateful and abusive content,

but they can also backfire, leading to discrimination and censorship. Social scientists have a role

to play in identifying these issues and informing debates to help make these systems more

democratic, accountable, and transparent. Understanding how these technologies work and when

they can go awry is critical to these efforts.

References

Alrababa’H, Ala’, William Marble, Salma Mousa, and Alexandra A. Siegel. 2021. “Can
Exposure to Celebrities Reduce Prejudice? The Effect of Mohamed Salah on
Islamophobic Behaviors and Attitudes.” American Political Science Review
115(4):1111–28.

Anti-Defamation League. 2021. Online Hate and Harassment: The American Experience 2021.
Anti-Defamation League.

Barberá, Pablo, Amber E. Boydstun, Suzanna Linn, Ryan McMahon, and Jonathan Nagler. 2020.
“Automated Text Classification of News Articles: A Practical Guide.” Political Analysis
1–24.

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021.
“On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” Pp. 610–23
in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and
Transparency. ACM.

Benjamin, Ruha. 2019. Race after Technology: Abolitionist Tools for the New Jim Code. Polity.

Benoit, Kenneth, Drew Conway, Benjamin E. Lauderdale, Michael Laver, and Slava Mikhaylov.
2016. “Crowd-Sourced Text Analysis: Reproducible and Agile Production of Political
Data.” American Political Science Review 110(2):278–95.

Brown, Tom, and many others. 2020. “Language Models Are Few-Shot Learners.” Pp. 1877–
1901 in Advances in Neural Information Processing Systems. Vol. 33.

14
https://ai.facebook.com/blog/training-ai-to-detect-hate-speech-in-the-real-world/ (Accessed 10/6/22).

21
Buolamwini, Joy, and Timnit Gebru. 2018. “Gender Shades: Intersectional Accuracy Disparities
in Commercial Gender Classification.” Pp. 1–15 in Proceedings of Machine Learning
Research. Vol. 81.

Caselli, Tommaso, Valerio Basile, Jelena Mitrović, and Michael Granitzer. 2021. “HateBERT:
Retraining BERT for Abusive Language Detection in English.” Pp. 17–25 in
Proceedings of the 5th Workshop on Online Abuse and Harms. ACL.

Chandrasekharan, Eshwar, Umashanthi Pavalanathan, Anirudh Srinivasan, Adam Glynn, Jacob


Eisenstein, and Eric Gilbert. 2017a. “You Can’t Stay Here: The Efficacy of Reddit’s
2015 Ban Examined through Hate Speech.” Proceedings of the ACM on Human-
Computer Interaction 1:1–22.

Chandrasekharan, Eshwar, Mattia Samory, Anirudh Srinivasan, and Eric Gilbert. 2017b. “The
Bag of Communities: Identifying Abusive Behavior Online with Preexisting Internet
Data.” Pp. 3175–87 in Proceedings of the CHI Conference on Human Factors in
Computing Systems. ACM Press.

Chiu, Ke-Li, Annie Collins, and Rohan Alexander. 2022. “Detecting Hate Speech with GPT-3.”

Costello, Matthew, James Hawdon, Colin Bernatzky, and Kelly Mendes. 2019. “Social Group
Identity and Perceptions of Online Hate.” Sociological Inquiry 89(3):427–52.

Cowan, Gloria, and Cyndi Hodge. 1996. “Judgments of Hate Speech: The Effects of Target
Group, Publicness, and Behavioral Responses of the Target.” Journal of Applied Social
Psychology 26(4):355–74.

Davidson, Thomas, and Debasmita Bhattacharya. 2020. “Examining Racial Bias in an Online
Abuse Corpus with Structural Topic Modeling.” in ICWSM Data Challenge Workshop.

Davidson, Thomas, Debasmita Bhattacharya, and Ingmar Weber. 2019. “Racial Bias in Hate
Speech and Abusive Language Detection Datasets.” Pp. 25–35 in Proceedings of the
Third Workshop on Abusive Language Online: ACL.

Davidson, Thomas, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. “Automated Hate
Speech Detection and the Problem of Offensive Language.” Pp. 512–15 in Proceedings
of the 11th International Conference on Web and Social Media.

Denny, Matthew J., and Arthur Spirling. 2018. “Text Preprocessing For Unsupervised Learning:
Why It Matters, When It Misleads, And What To Do About It.” Political Analysis
26(02):168–89.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-
Training of Deep Bidirectional Transformers for Language Understanding.” Pp. 4171–86
in Proceedings of NAACL-HLT 2019. ACL.

22
Dixon, Lucas, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. “Measuring
and Mitigating Unintended Bias in Text Classification.” Pp. 67–73 in Proceedings of the
2018 Conference on AI, Ethics, and Society. ACM Press.

Djuric, Nemanja, Jing Zhou, Robin Morris, Mihajlo Grbovic, Vladan Radosavljevic, and
Narayan Bhamidipati. 2015. “Hate Speech Detection with Comment Embeddings.” Pp.
29–30 in WWW 2015 Companion. ACM Press.

Eubanks, Virginia. 2018. Automating Inequality: How High-Tech Tools Profile, Police, and
Punish the Poor. St. Martin’s Press.

Fieseler, Christian, Eliane Bucher, and Christian Pieter Hoffmann. 2019. “Unfairness by Design?
The Perceived Fairness of Digital Labor on Crowdworking Platforms.” Journal of
Business Ethics 156(4):987–1005.

Firth, John R. 1957. “A Synopsis of Linguistic Theory, 1930-1955.” Studies in Linguistic


Analysis.

Fortuna, Paula, and Sérgio Nunes. 2018. “A Survey on Automatic Detection of Hate Speech in
Text.” ACM Computing Surveys 51(4):1–30. doi: 10.1145/3232676.

Founta, Antigoni-Maria, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy


Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas
Kourtellis. 2018. “Large Scale Crowdsourcing and Characterization of Twitter Abusive
Behavior.” Pp. 491–500 in Proceedings of the Twelth International AAAI Conference on
Web and Social Media.

Frey, William R., Desmond U. Patton, Michael B. Gaskell, and Kyle A. McGregor. 2020.
“Artificial Intelligence and Inclusion: Formerly Gang-Involved Youth as Domain Experts
for Analyzing Unstructured Twitter Data.” Social Science Computer Review 38(1):42–56.

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna
Wallach, Hal Daumé Iii, and Kate Crawford. 2021. “Datasheets for Datasets.”
Communications of the ACM 64(12):86–92.

Gillespie, Tarleton. 2018. Custodians of the Internet: Platforms, Content Moderation, and the
Hidden Decisions That Shape Social Media. Yale University Press.

Golbeck, Jennifer, and many others. 2017. “A Large Labeled Corpus for Online Harassment
Research.” Pp. 229–33 in Proceedings of the 2017 ACM on Web Science Conference.
ACM Press.

Gonen, Hila, and Yoav Goldberg. 2019. “Lipstick on a Pig: Debiasing Methods Cover up
Systematic Gender Biases in Word Embeddings But Do Not Remove Them.” Pp. 609–14
in Proceedings of NAACL-HLT. ACL.

Harris, Camille, Matan Halevy, Ayanna Howard, Amy Bruckman, and Diyi Yang. 2022.
“Exploring the Role of Grammar and Word Choice in Bias Toward African American

23
English (AAE) in Hate Speech Classification.” Pp. 789–98 in Conference on
Fairness, Accountability, and Transparency. ACM Press.

Hartvigsen, Thomas, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece
Kamar. 2022. “ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and
Implicit Hate Speech Detection.” Pp. 3309–26 in Proceedings of the 60th Annual
Meeting of the Association for Computational Linguistics. ACL.

Kaye, David A. 2019. Speech Police: The Global Struggle to Govern the Internet. Columbia
Global Reports.

Kiela, Douwe, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik
Ringshia, and Davide Testuggine. 2020. “The Hateful Memes Challenge: Detecting Hate
Speech in Multimodal Memes.” Pp. 1–14 in 34th Conference on Neural Information
Processing Systems.

King, Gary, Patrick Lam, and Margaret E. Roberts. 2017. “Computer-Assisted Keyword and
Document Set Discovery from Unstructured Text.” American Journal of Political
Science 61(4):971–88.

Kirk, Hannah, Bertie Vidgen, and Scott Hale. 2022. “Is More Data Better? Re-Thinking the
Importance of Efficiency in Abusive Language Detection with Transformers-Based
Active Learning.” Pp. 52–61 in Proceedings of the Third Workshop on Threat,
Aggression and Cyberbullying. ACL.

Kirk, Hannah, Bertie Vidgen, Paul Rottger, Tristan Thrush, and Scott Hale. 2022. “Hatemoji: A
Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-
Based Hate.” Pp. 1352–68 in Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies. ACL.

Klonick, Kate. 2018. “The New Governors: The People, Rules, and Processes Governing Online
Speech.” Harvard Law Review 131(6):1598–1670.

Kozlowski, Austin C., Matt Taddy, and James A. Evans. 2019. “The Geometry of Culture:
Analyzing the Meanings of Class through Word Embeddings.” American Sociological
Review 000312241987713.

Kwok, Irene, and Yuzhou Wang. 2013. “Locate the Hate: Detecting Tweets against Blacks.” Pp.
1621–22 in Proceedings of the Twenty-Seventh AAA Conference on Artificial
Intelligence.

Magu, Rijul, Kshitij Joshi, and Jiebo Luo. 2017. “Detecting the Hate Code on Social Media.” Pp.
608–11 in Proceedings of the 11th International Conference on Web and Social Media.

Magu, Rijul, and Jiebo Luo. 2018. “Determining Code Words in Euphemistic Hate Speech Using
Word Embedding Networks.” Pp. 93–100 in Proceedings of the 2nd Workshop on
Abusive Language Online. ACL.

24
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. “Distributed
Representations of Words and Phrases and Their Compositionality.” Pp. 3111–19 in
Advances in Neural Information Processing Systems.

Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben
Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. “Model
Cards for Model Reporting.” Pp. 220–29 in Proceedings of the Conference on Fairness,
Accountability, and Transparency. ACM Press.

Mitts, Tamar. 2019. “From Isolation to Radicalization: Anti-Muslim Hostility and Support for
ISIS in the West.” American Political Science Review 113(1):173–94.

Molina, Mario, and Filiz Garip. 2019. “Machine Learning for Sociology.” Annual Review of
Sociology 45:27–45.

Müller, Karsten, and Carlo Schwarz. 2021. “Fanning the Flames of Hate: Social Media and Hate
Crime.” Journal of the European Economic Association 19(4):2131–67.

Munger, Kevin. 2016. “Tweetment Effects on the Tweeted: Experimentally Reducing Racist
Harassment.” Political Behavior.

Noble, Safiya Umoja. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism.
New York, NY: NYU Press.

Olteanu, Alexandra, Carlos Castillo, Jeremy Boy, and Kush Varshney. 2018. “The Effect of
Extremist Violence on Hateful Speech Online.” Proceedings of the International AAAI
Conference on Web and Social Media 12(1).

O’Neil, Cathy. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and
Threatens Democracy. Broadway Books.

Pew Research Center. 2020. “Partisans in the U.S. Increasingly Divided on Whether Offensive
Content Online Is Taken Seriously Enough.” Pew Research Center. Retrieved October 7,
2022 (https://www.pewresearch.org/fact-tank/2020/10/08/partisans-in-the-u-s-
increasingly-divided-on-whether-offensive-content-online-is-taken-seriously-enough/).

Pew Research Center. 2021. “The State of Online Harassment.”

Ribeiro, Manoel Horta, Pedro H. Calais, Yuri A. Santos, Virgílio A. F. Almeida, and Wagner
Meira Jr. 2017. “Characterizing and Detecting Hateful Users on Twitter.” in Proceedings
of the International AAAI Conference on Web and Social Media.

Roberts, Sarah T. 2019. Behind the Screen. Yale University Press.

Röttger, Paul, Bertie Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, and Janet
Pierrehumbert. 2021. “HateCheck: Functional Tests for Hate Speech Detection Models.”
Pp. 41–58 in Proceedings of the 59th Annual Meeting of the Association for

25
Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing. ACL.

Roussos, Gina, and John F. Dovidio. 2018. “Hate Speech Is in the Eye of the Beholder: The
Influence of Racial Attitudes and Freedom of Speech Beliefs on Perceptions of Racially
Motivated Threats of Violence.” Social Psychological and Personality Science 9(2):176–
85.

Sap, Maarten, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. “The Risk of
Racial Bias in Hate Speech Detection.” Pp. 1668–78 in Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics. ACL.

Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. 2020.
“Social Bias Frames: Reasoning about Social and Power Implications of Language.” Pp.
5477–90 in Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. ACL.

Sap, Maarten, Swabha Swayamdipta, Laura Vianna, Xuhui Zhou, Yejin Choi, and Noah Smith.
2022. “Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic
Language Detection.” Pp. 5884–5906 in Proceedings of the NAACL-HLT. ACL.

Siegel, Alexandra A. 2020. “Online Hate Speech.” Social Media and Democracy: The State of
the Field, Prospects for Reform 56–88.

Siegel, Alexandra A., Evgenii Nikitin, Pablo Barberá, Joanna Sterling, Bethany Pullen, Richard
Bonneau, Jonathan Nagler, and Joshua A. Tucker. 2021. “Trumping Hate on Twitter?
Online Hate Speech in the 2016 US Election Campaign and Its Aftermath.” Quarterly
Journal of Political Science 16(1):71–104.

Smith, Noah A. 2020. “Contextual Word Representations: Putting Words into Computers.”
Communications of the ACM 63(6):66–74.

Vidgen, Bertie, and Leon Derczynski. 2020. “Directions in Abusive Language Training Data, a
Systematic Review: Garbage in, Garbage Out.” PLoS ONE 15(12):e0243300.

Vidgen, Bertie, Helen Margetts, and Alex Harris. 2019. “How Much Online Abuse Is There: A
Systematic Review of Evidence for the UK.” Alan Turing Institute.

Vidgen, Bertie, Dong Nguyen, Helen Margetts, Patricia Rossini, and Rebekah Tromble. 2021.
“Introducing CAD: The Contextual Abuse Dataset.” Pp. 2289–2303 in Proceedings of
the NAACL-HLT. ACL.

Vidgen, Bertie, Tristan Thrush, Zeerak Waseem, and Douwe Kiela. 2021. “Learning from the
Worst: Dynamically Generated Datasets to Improve Online Hate Detection.” Pp. 1667–
82 in Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language
Processing. ACL.

26
Warner, William, and Julia Hirschberg. 2012. “Detecting Hate Speech on the World Wide Web.”
Pp. 19–26 in Proceedings of the Second Workshop on Language in Social Media. ACL.

Waseem, Zeerak, Thomas Davidson, Dana Warmsley, and Ingmar Weber. 2017. “Understanding
Abuse: A Typology of Abusive Language Detection Subtasks.” Pp. 78–84 in
Proceedings of the First Workshop on Abusive Language Online. ACL.

Waseem, Zeerak, and Dirk Hovy. 2016. “Hateful Symbols or Hateful People? Predictive
Features for Hate Speech Detection on Twitter.” Pp. 88–93 in Proceedings of NAACL-
HLT.

Waseem, Zeerak, James Thorne, and Joachim Bingel. 2018. “Bridging the Gaps: Multi Task
Learning for Domain Transfer of Hate Speech Detection.” Pp. 29–55 in Online
Harassment, Human-Computer Interaction Series, edited by J. Goldbeck. New York,
NY: Springer.

Wiegand, Michael, Josef Ruppenhofer, Anna Schmidt, and Clayton Greenberg. 2018. “Inducing
a Lexicon of Abusive Words – a Feature-Based Approach.” Pp. 1046–56 in Proceedings
of the NAACL-HLT. ACL.

Wilson, Richard Ashby, and Molly K. Land. 2021. “Hate Speech on Social Media: Content
Moderation in Context.” Connecticut Law Review 52(3):1029–76.

Xu, Albert, Eshaan Pathak, Eric Wallace, Suchin Gururangan, Maarten Sap, and Dan Klein.
2021. “Detoxifying Language Models Risks Marginalizing Minority Voices.” Pp. 2390–
97 in Proceedings of NAACL-HLT. ACL.

Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh
Kumar. 2019. “Predicting the Type and Target of Offensive Posts in Social Media.” Pp.
1415–20 in Proceedings of NAACL-HLT. ACL.

Zhou, Xuhui, Hao Zhu, Akhila Yerukola, Thomas Davidson, Jena D. Hwang, Swabha
Swayamdipta, and Maarten Sap. 2023. “COBRA Frames: Contextual Reasoning about
Effects and Harms of Offensive Statements.” in Proceedings of the 61st Annual Meeting
of the Association for Computational Linguistics. ACL.

27

You might also like