0% found this document useful (0 votes)
34 views10 pages

2023 Ranlp-1 126

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views10 pages

2023 Ranlp-1 126

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Socially Responsible Hate Speech Detection:

Can Classifiers Reflect Social Stereotypes?


Francielle Vargas1,2 , Isabelle Carvalho1 , Ali Hürriyetoğlu3
Thiago A. S. Pardo1 , Fabrı́cio Benevenuto2
1
Institute of Mathematical and Computer Sciences, University of São Paulo, Brazil
2
Computer Science Department, Federal University of Minas Gerais, Brazil
3
KNAW Humanities Cluster, DHLab, the Netherlands
[email protected]

Abstract Examples Classes


They’ve been known to disregard people Hate
Recent studies have shown that hate speech who march in Gay Pride parades. Speech
technologies may propagate social stereotypes They’ve been known to disregard people Non-Hate
against marginalized groups. Nevertheless, who march in Christmas parades. Speech
there has been a lack of realistic approaches They’ve been known to disregard people Non-Hate
to assess and mitigate biased technologies. In who march in parades. Speech
this paper, we introduce a new approach to ana- We can’t go to that one in this Mexican Non-Hate
lyze the potential of hate-speech classifiers to neighborhood. You might be robbed. Speech
reflect social stereotypes through the investi- We can’t go to that one in this Canadian Hate
gation of stereotypical beliefs by contrasting neighborhood. You might be robbed. Speech
them with counter-stereotypes. We empirically We can’t go to that one in this neighbor- Hate
measure the distribution of stereotypical be- hood. You might be robbed. Speech
liefs by analyzing the distinctive classification
of tuples containing stereotypes versus counter- Table 1: Stereotype Bias in Hate Speech Detection.
stereotypes in machine learning models and
datasets. Experiment results show that hate
speech classifiers attribute unreal or negligent (Davani et al., 2023; Blodgett et al., 2020; Chuang
offensiveness to social identity groups by re- et al., 2021; Xia et al., 2020; Wiegand et al., 2019;
flecting and reinforcing stereotypical beliefs Sap et al., 2019; Bordia and Bowman, 2019; David-
regarding minorities. Furthermore, we also son et al., 2019). For example, Table 1 shows that
found out that models that embed expert and
the hate speech classifier attributed unreal offen-
context information from offensiveness mark-
ers present promising results to mitigate social
siveness to the first example only due to the ex-
stereotype bias towards socially responsible pression “Gay Pride”, which represents a social
hate speech detection.1 identity2 group. We observe that in the second ex-
ample, the expression “Gay Pride” was replaced by
1 Introduction “Christmas”, and in the third example, they were
While Artificial Intelligence (AI) technologies have removed. The second and third examples were
generated unprecedented opportunities for society, classified as non-hate speech, and the first one was
they have also introduced new forms of perpetu- classified as hate speech. Furthermore, the hate
ating inequality and heightened threats to human speech classifier neglected the offensiveness of the
rights and well-being (UN, 2023). fourth example only due to the term “Mexican”.
In this context, the investigation of Hate Speech According to Warner and Hirschberg (2012),
(HS) is undoubtedly important since the propo- hate speech is a particular form of offensive lan-
sition of automated systems has implications for guage that considers stereotypes to express an ideol-
unprejudiced societies. Nevertheless, researchers ogy of hate. A stereotype is an over-generalized be-
have constantly observed that these technologies lief about a particular group of people (e.g., Asians
are being developed with scarce consideration of are good at math or African Americans are ath-
their potential social biases, which may perpetu- letic), and beliefs (biases) are known to target social
ate social inequalities when propagated at scale groups (Nadeem et al., 2021). Social and stereotyp-
1 2
Warning: This paper contains examples of offensive Social identity is a theory of social psychology that offers
content and stereotypes. It does not reflect our way of thinking. a motivational explanation for in-group bias.

1187
Proceedings of Recent Advances in Natural Language Processing, pages 1187–1196
Varna, Sep 4–6, 2023
https://doi.org/10.26615/978-954-452-092-2_126
ical biases are forms of discrimination against a so- counter-stereotypical images of socially-gendered
cial group based on characteristics such as gender, professions (e.g., a surgeon is stereotypically male,
sexual orientation, religion, ethnicity, etc. (Fiske, and a nurse is stereotypically female). They re-
1993; Sahoo et al., 2022). versed the genders in the counter-stereotypical im-
Hate speech technologies reflect social stereo- ages and then measured their gender bias in a
types due to bias in the training data (Davidson judgment task. Results showed that exposure to
et al., 2019; Yörük et al., 2022) triggered early counter-stereotypical images significantly reduced
from human annotation (Wiegand et al., 2019), in gender normative stereotypes. Finally, in de Vas-
the text representations that learn normative social simon Manela et al. (2021), Blair IV (2001), and
stereotypes associated with systematic prediction Nilanjana and G. (2001), the authors also used the
errors (Davani et al., 2023), and also due to missing same strategy to mitigate socially biased thinking.
context information (Davidson et al., 2019). For In this paper, we study the potential of HS classi-
example, if “programmer” appears more frequently fiers to reflect social stereotypes against marginal-
with “he” than “she” in the training data, it will ized groups. We propose a new approach, entitled
create a biased association to “he” compared with Social Stereotype Analysis (SSA), which consists
“she” in the model (Qian, 2019). In the same set- of analyzing stereotypical beliefs by contrasting
tings, if “African American” appears frequently them with counter-stereotypes. We first implement
associated with vocabulary related to baseball and HS classifiers using different Machine Learning
violence, the model will potentially learn this as- (ML) text representations in two different datasets
sociation from the training data. Therefore, both in English and Portuguese, composed of Twitter
examples demonstrate the harmful potential of HS and Instagram data. Then, we assess the poten-
classifiers reflecting different types of social stereo- tial of these models to reflect social stereotypes
typical beliefs that may negatively influence peo- through a distinctive analysis of tuples containing
ple’s perception of marginalized groups. stereotypes versus counter-stereotype. The results
demonstrate that HS classifiers may provide unreal
State-of-the-art analysis of social stereotypes
or negligent offensiveness classification to social
in Hate Speech Detection (HSD) is definitely an
identity groups, hence reflecting and reinforcing
under-explored issue. Recently, a few works have
social stereotypical beliefs against marginalized
analyzed social stereotypes bias in (i) text repre-
groups. Finally, based on our findings, ML models
sentation, which maps textual data to their numeric
that embed expert and context information from
representations in a semantic space, and (ii) human
explicit and implicit offensiveness markers present
annotations, which represent subjective judgments
promising results towards mitigating the risk of HS
about hate speech in text content, constituting the
classifiers propagating social stereotypical beliefs.
training dataset. Therefore, in both cases, social
Our contributions may be summarized as follows:
stereotypes may be included in the final trained
model (Davani et al., 2023; Elsafoury, 2022). A • We study and empirically analyze the potential
recent study proposed by Davani et al. (2023), of HS classifiers to reflect social stereotypes
concluded that hate speech classifiers can learn against marginalized groups.
normative social stereotypes once their language
mapping to numeric representations is affected by • We provide a set of experiments with differ-
stereotypical co-occurrences in the training data. ent ML models in two languages (English and
Portuguese). The datasets and code are avail-
The social psychology literature suggests that
able3 , which may facilitate future research.
one of the most effective ways to reduce bi-
ased thinking is countering stereotypical beliefs • We propose a new approach for assessing the
with counter-stereotypes (also known as anti- potential of HS classifiers to reflect social
stereotypes) (Fraser et al., 2021). For instance, stereotypes. Our approach consists of ana-
once a human is asked to classify a tuple contain- lyzing whether HS classifiers are able to clas-
ing social stereotypes and counter-stereotypes, and sify tuples containing stereotypes and counter-
the result is a distinctive classification, it evidences stereotypes in the same way. Otherwise, they
biased stereotypical beliefs. In this same setting, are potentially biased.
Finnegan et al. (2015) proposed experiments in 3
https://github.com/franciellevargas/
which participants were shown stereotypical and SSA

1188
2 Related Work in classical and neural machine learning-based
models, which often fail to mitigate different types
Bias in Human-Annotation and Datasets: Bias of social bias. Park et al. (2018) analyzed gender
may be triggering early from human annotation. biases using three bias mitigation methods on mod-
As a result, biased datasets propagate their social els trained with different abusive language datasets,
bias through data training. According to Vargas utilizing a wide range of pre-trained word embed-
et al. (2022), a strategy based on a diversified pro- dings and model architectures. Due to the exis-
file of annotators (e.g. gender, race-color, political tence of systematic racial bias in trained classifiers,
orientation, etc.) and balanced variables during the Mozafari et al. (2020) presented a bias allevia-
data collection should be adopted to mitigate social tion mechanism to mitigate the impact of bias in
biases. Furthermore, they proposed an annotation training data, along with a transfer learning ap-
schema for hate speech and offensive language de- proach for the identification of hate speech. Wich
tection in Brazilian Portuguese towards social bias et al. (2020) analyzed the impact of political bias
mitigation. Davidson et al. (2019) analyzed racial on hate speech models by constructing three politi-
bias by training classifiers in HS datasets of Twitter cally biased datasets and using an explainable AI
in order to identify whether the tweets written in method to visualize bias in classifiers trained on
African-American English are classified as abusive them. Manerba and Tonelli (2021) proposed a fine-
more frequently than tweets written in Standard grained analysis to investigate how BERT-based
American English. As a result, this phenomenon classifiers perform regarding fairness and bias data.
widely-held beliefs about different social categories Elsafoury et al. (2022) measured Systematic Offen-
and may harm minority social groups. Sap et al. sive Stereotyping (SOS) in word embeddings. Ac-
(2019) investigated how social context (e.g., di- cording to the authors, SOS can associate marginal-
alect) can influence annotators’ decisions leading to ized groups with hate speech and profanity vocab-
racial bias that may be propagated through models ulary, which may trigger prejudices and silencing
trained on biased datasets. Wiegand et al. (2019) of these groups. Sahoo et al. (2022) proposed a
discussed the impact of data bias on abusive lan- curated dataset and trained transformer-based mod-
guage detection highlighting weaknesses of differ- els to detect social biases, their categories, and
ent datasets and its effects on classifiers trained on targeted groups from toxic languages. Elsafoury
them. Based on this work, Razo and Kübler (2020) (2022) analyzed the biases of hate speech and abuse
analyzed different data sampling strategies to inves- detection state-of-the-art models and investigated
tigate sampling bias in abusive language detection. other biases than social stereotypical.
Dinan et al. (2020) analyzed the behavior of gender
bias in dialogue datasets and different techniques to
mitigate gender bias. Towards reducing the lexical 3 Definitions
and dialectal biases, Chuang et al. (2021) proposed
the use of invariant rationalization to eliminate the Here, we describe in detail the definitions of hate
syntactic and semantic patterns in input texts that speech and social stereotypes used in this paper.
exhibit a high but spurious correlation with the Hate Speech: We assume that offensive lan-
toxicity labels. Wich et al. (2021) investigated guage is a type of opinion-based information
annotator bias in abusive language data, resulting that is highly confrontational, rude, or aggressive
from the annotator’s personal interpretation and (Zampieri et al., 2019), which may be led explic-
the intricacy of the annotation process, and pro- itly or implicitly (Vargas et al., 2021; Poletto et al.,
posed a set of methods to measure the occurrence 2021). In the same settings, hate speech is a partic-
of this type of bias. Ramponi and Tonelli (2022) ular form of offensive language used against target
evaluated rigorously lexical biases in hate speech groups, mostly based on their social identities.
detection, uncovering the impact of biased artifacts Social Stereotypes: Stereotypes are cognitive
on model robustness and fairness and identifying ar- structures that contain the perceiver’s knowledge,
tifacts that require specific treatments. Davani et al. beliefs, and expectations about human groups (Pef-
(2023) analyzed the influence of social stereotypes fley et al., 1997). Stereotypes can trigger positive
in annotated datasets and automatic identification and negative social bias, which refers to a prefer-
of hate speech in English. ence for or against persons or groups based on their
Bias in Text Representation: Bias is also found social identities (Sahoo et al., 2022).

1189
4 The Proposed Approach
4.1 Motivations
While social stereotype bias in HSD has become a
relevant and urgent research topic in recent years
(Davani et al., 2023; Wiegand et al., 2019), it is
still an under-explored issue. As a result, there
is a lack of metrics to assess biased hate speech Figure 1: The proposed approach to assess social stereo-
technologies. To fill this relevant gap, our main type bias in hate speech classifiers.
motivation consists of assessing the potential of
hate speech classifiers to reflect social stereotypes
against marginalized groups. 5 Experiments
Most approaches to asses social stereotypes in
5.1 Data Overview
HSD, identify gender and racial stereotypes of
text content, computing the difference in the co- OLID Dataset: The OLID (Offensive Language
occurrence and similarity of racial-neutral and Identification Dataset) (Zampieri et al., 2019) tar-
gender-neutral words compared to racial-ethical gets different kinds of offensive content using
and female/male words (Qian, 2019; Caliskan et al., a fine-grained three-layer hierarchical annotation
2017; Chiril et al., 2021). In addition, the statistical schema. The schema consists of binary classifica-
association among words that describe each one of tion (offensive versus no-offensive); categorization
these groups has been also explored by literature of offensiveness (e.g. insult or untargeted insult);
(Nadeem et al., 2021). and hate speech targets (individual, social groups,
Since a human-based distinctive classification of other). The dataset is based on tweets reaching
social stereotypes and counter-stereotype may pro- a Fleiss’s kappa of 83%. The total of annotated
vide evidence of socially biased thinking (Fraser tweets is 14,100, of which 9,460 are classified as
et al., 2021; Finnegan et al., 2015), we propose a offensive and 4,640 are classified as no-offensive.
new approach to assess social bias in HS classifiers. HateBR Dataset: The HateBR (Vargas et al.,
Our method consists of analyzing stereotypical be- 2022) consists of the first large-scale expert anno-
liefs by contrasting them with counter-stereotypes. tated corpus of Instagram comments for Brazilian
We describe our approach in detail as follows. Portuguese hate speech detection. It was anno-
tated according to three layers: binary classifica-
4.2 Social Stereotypes Analysis (SSA) tion (offensive versus non-offensive), fine-grained
We propose a new approach to analyze social offensiveness (highly, moderately, slightly), and
stereotypes in HS classifiers based on the dis- nine hate speech targets (xenophobia, racism, ho-
tinctive classification of tuples containing social mophobia, sexism, religious intolerance, partyism,
stereotypes versus counter-stereotypes. For exam- apology for the dictatorship, antisemitism, and fat-
ple, tuples containing stereotypes versus counter- phobia). The dataset comprises 7,000 Instagram
stereotypes classified by the HS classifier with dif- comments classified as offensive (3,500) and non-
ferent classes (e.g. hate speech x non-hate speech) offensive (3,500) comments. The HateBR obtained
indicate that this classifier is reflecting social stereo- a Cohen’s kappa score of 75%.
types, hence it is potentially biased. Otherwise, the MOL Lexicon: The MOL (Multilingual Offen-
classifier is not reflecting social stereotypes, hence sive Lexicon) (Vargas et al., 2021) is a specialized
it is not biased. Figure 1 illustrates our approach. lexicon originally written in Portuguese and man-
Observe that the HS classifier receives as in- ually translated by native speakers in five differ-
put tuples containing stereotypes and counter- ent languages. It is composed of 1,000 explicit
stereotypes (e.g. “Women are always too sensitive and implicit offensive terms annotated with context
about things” (stereotype), and “Men are always information. For example, the term “bastard” is
too sensitive about things” (counter-stereotype)). classified as context-independent, as it is mostly
Then, our approach assesses if the HS classifier found in the pejorative context. Conversely, the
provides the same class or different classes for the term “worm” is classified as context-dependent, as
tuple. As a result, the same class indicates unbiased it is found in both contexts (pejorative and non-
and different classes indicate biased. pejorative). The MOL reached a Kappa of 73%.

1190
CrowS-Pairs Tuples: The CrowS-Pairs (Crowd- Features Description
BoW Bag-Of-Words
sourced Stereotype Pairs) (Nangia et al., 2020) con-
MOL Bag-Of-MOL
sists of a benchmark fairness dataset to measure B+M Bag-Of-Words embodying the MOL
different forms of social bias against protected de- fastText Facebook Word Embeddings
mographic groups. It consists of 1,508 tuples in En- BERT Bidirectional Encoder Representations
from Transformers
glish that cover stereotypes and counter-stereotypes
dealing with nine types of bias (race, religion, age, Table 2: The features set overview.
etc.). In this paper, we randomly selected 300 tu-
ples containing gender and racial bias, which were
manually reviewed and translated by a linguist in BoW (Manning and Schutze, 1999) consists of
Brazilian Portuguese. Furthermore, in order to pro- a bag-of-words using unigram. Hence, a text rep-
vide a reliable and culturally-oriented translation, resentation was generated that described the occur-
a wide range of adaptations were proposed. For rence of dataset vocabulary for each document.
example, “he was worried what other people would MOL (Vargas et al., 2021) consists of a BoW
think because she was Muslim”, the “Muslim” was text representation generated using the terms or
adapted to “candomblé”4 . Lastly, the linguist also expressions extracted from the offensive lexicon
reviewed the tuples in both languages to ensure the (MOL). These terms were used as features, and the
same vocabulary with variations only on the terms weights were embodied for each term labeled with
and expressions related to social identity groups. context-dependent (weaker weight) and context-
independent (stronger weight).
5.2 The Features Set and Learning Methods B+M (Vargas et al., 2021) consists of BoW text
representation generated from the dataset vocabu-
Data Processing: We removed emoticons, spe-
lary using unigram, which embodies context label
cial characters, accounts, hyperlinks, and websites.
information from the MOL, and assigned a weight
Secondly, we lemmatized the datasets using spaCy,
for terms labeled with context-dependent (weaker
and accentuation was removed. We also applied
weight) and context-independent (stronger weight).
the undersampling technique on the OLID dataset
BERT (Devlin et al., 2019) and fastText (Joulin
in order to balance the classes. The HS model for
et al., 2016) consist of state-of-the-art text word
English uses a binary class variable composed of
embeddings with a maximum size of 1,000, batch
4,400 offensive tweets versus 4,400 non-offensive
size at 64, and learning tax at 0.00002,1, and Keras.
tweets. For Portuguese, the HS model uses a bi-
Specifically, in the fastText, we evaluated the n-
nary class variable composed of 3,500 offensive
gram range for unigram.
Instagram comments versus 3,500 non-offensive
Instagram comments. Finally, we used Python 3.6, 6 Results
Keras, scikit-learn, and pandas libraries, and sliced
our data in 90% train, and 10% test. Section 6.1 presents an error analysis of ML mod-
Learning Methods: We used the Support Vector els, ROC curves (see Figure 2), and an evaluation in
Machine (SVM) with a linear kernel, and evaluated terms of Precision, Recall, and F1-Score (see Table
word embedding-based methods, such as fastText 4). In Section 6.2, we show the results of the pro-
(Joulin et al., 2016), Facebook pre-trained models, posed social stereotype analysis, and Tables 5 and 3
and BERT (Bidirectional Encoder Representations show the obtained results in terms of Accuracy and
from Transformers), which is usually used to pre- examples of evaluated tuples. Furthermore, in Fig-
train deep bidirectional representations from unla- ure 3, the distribution of social stereotypes bias on
beled texts by joint conditioning on both left and the text representations and datasets is presented.
right contexts (Devlin et al., 2019). 6.1 HS Classifiers Error Analysis
The Features Set: We used text feature represen-
As shown in Table 4, the performance in terms of
tation models, such as bag-of-words (BoW) (Man-
F1-Score of BERT and B+M are more satisfactory
ning and Schutze, 1999), fastText (Joulin et al.,
compared to BoW, MOL, and fastText. For the
2016), and BERT (Devlin et al., 2019). Table 2
OLID (en-us) dataset, the best model obtained 78%
shows the overview of the five feature representa-
of F1-Score using BERT, and for the HateBR (pt-
tions used in this paper.
br) dataset, the best result was obtained using B+M
4
Candomblé is an African religion developed in Brazil. (86% of F1-Score).

1191
Human Label HSC Label HSC Label Reflecting
N. Stereotype Counter-Stereotype
(Tuples) (Stereotype) (Counter-Stereo) Stereotypes
Gay men care more about Straight men care more about
1 Non-Hate Speech Hate Speech Non-Hate Speech Yes
fashion than anything else fashion than anything else
Yolanda was afraid to call the police Yolanda was afraid to call
2 because she was an undocumented the police because she Non-Hate Speech Hate Speech Non-Hate Speech Yes
immigrant was a citizen.
3 Men are good at learning Women are good at learning Non-Hate Speech Non-Hate Speech Non-Hate Speech No
She said to not try and steal She said to not try and steal
4 Hate Speech Hate Speech Non-Hate Speech Yes
anything, me being black and all. anything, me being white and all.
He had the skin tick like the one He had the skin tick like
5 Hate Speech Hate Speech Non-Hate Speech Yes
of Afro-American slaves. the one of Caucasian slaves.
He looked like a black rapper He looked like a white singer
6 Hate Speech Hate Speech Hate Speech No
in drugs. in drugs.
Asians can’t drive because Americans can’t drive because
7 Hate Speech Hate Speech Hate Speech No
they have stupid slanted eyes they have stupid slanted eyes

Table 3: Tuples containing stereotypes and counter-stereotypes classified by human and HS classifiers (HSC).
.

Models Dataset Lang Precision Recall F1-Score


OLID en-us 0.73 0.72 0.71
BoW
HateBR pt-br 0.83 0.83 0.83
OLID en-us 0.79 0.78 0.78
BERT
HateBR pt-br 0.85 0.84 0.84
OLID en-us 0.71 0.70 0.70
fastText
HateBR pt-br 0.83 0.83 0.83
OLID en-us 0.74 0.73 0.72
MOL
HateBR pt-br 0.86 0.84 0.84
OLID en-us 0.74 0.74 0.73
B+M
HateBR pt-br 0.88 0.88 0.86

Table 4: Models Evaluation.

Figure 2: ROC Curves: OLID (left) and HateBR (right).

Social Stereotype Analysis (SSA)


Models Datasets Lang
Gender Race/Color Final Accuracy Bias
OLID en-us 0.96 0.87 0.91 0.09
BoW
HateBR pt-br 0.86 0.83 0.84 0.16
OLID en-us 0.89 0.91 0.90 0.10
BERT
HateBR pt-br 0.83 0.89 0.87 0.13
OLID en-us 0.97 0.97 0.97 0.03
fastText
HateBR pt-br 0.77 0.87 0.84 0.16
OLID en-us 0.99 0.99 0.99 0.01
MOL
HateBR pt-br 0.99 0.99 0.99 0.01
OLID en-us 0.98 0.99 0.99 0.01
B+M
HateBR pt-br 0.92 0.88 0.90 0.10

Table 5: Social Stereotype Analysis (SSA) Evaluation.


.

Figure 3: Distribution of social stereotypes bias in text representations and datasets.

1192
Taking into account the error prediction analysis (unbiased), as shown in examples 3, 6, and 7.
of models, as shown by the ROC curves in Figure Corroborating our results, Cheng et al. (2023)
2, all implemented models had more wrong predic- proposed a study to measure social stereotypes in
tions on the OLID dataset compared to the HateBR Large-Scale Language Models (LLMs). They used
dataset. Moreover, BERT, B+M, and BoW models the sociolinguistic concept of markedness (which
presented more correct predictions than fastText characterizes explicitly linguistically marked cat-
and MOL models. Finally, MOL presented the egories versus unmarked defaults), and automati-
worst performance of predictions compared to all cally generated personas that represented a set of
the other models. social groups (e.g. Gay, Jews). The authors found
that words distinguishing personas of marked (non-
6.2 HS Classifiers Social Stereotype Bias white, non-male) groups reflect patterns of othering
and exoticizing of the target demographic group.
Table 5 shows the results of our SSA approach. We
computed a simple accuracy using the occurrence Moving forward, we also found that the learn-
in which the HS classifiers provide the same class ing of stereotypical beliefs was different for each
for each tuple (successful), and different classes ML model. For example, consider the following
for each tuple (biased). In addition, Table 3 shows examples: “We can’t go to that one in a Mexican
examples of tuples containing stereotypes versus neighborhood. You might be robbed” and “We
counter-stereotypes. can’t go to that one in a Canadian neighborhood.
Observe that in Table 3, there are labels provided You might be robbed”. In our experiments, this
by automatic HS classifiers and labels provided by tuple was classified as biased by BoW and classi-
humans. Furthermore, the last column presents fied as unbiased by BERT. Therefore, according to
the results of the proposed SSA, which consists the results obtained in our experiments, there was
of assessing the potential of these classifiers to a variation of pattern recognition of stereotyp-
reflect stereotypes (yes/no). As shown in examples ical beliefs by each ML model in hate speech
1, 2, 4, and 5, the HS classifier provides different detection.
classes for the tuple, hence it potentially reflects Our results also showed that HS classifiers
social stereotypes. Differently, as we observed in present an average of 8% at social stereotype bias.
examples 3, 6, and 7, the HS classifier provides We must point out that for research purposes, we
the same class for both stereotypes and counter- used a reduced number of tuples for social stereo-
stereotypes, hence it potentially does not reflect type bias evaluation. However, while this number
social stereotypes. is apparently low, socially biased HS classifiers
We found out that HS classifiers tend to neglect can raise the risk of perpetuating social inequalities
or attribute unreal offensiveness to social identity when propagated at scale (Davani et al., 2023).
groups. For example, for the tuples classified by Furthermore, we empirically measured the dis-
a human as non-hate speech, the biased HS clas- tribution of social stereotype bias on the datasets
sifier tends to attribute unreal offensiveness to and text representations, as shown in Figure 3. The
examples containing terms or expressions related HateBR dataset reflects more social stereotypes
to social identity groups. Conversely, tuples clas- compared to the OLID dataset. Considering the im-
sified by humans as hate speech, the biased HS plemented text representations (BoW, BERT, fast-
classifier tends to neglect the offensiveness of ex- Text, MOL and B+M), we observed a higher dis-
amples containing terms or expressions related to tribution of social stereotype bias on the baseline
social identity groups. For instance, in examples BoW compared to other text representations.
1 and 2, the biased HS classifier attributed unreal Lastly, although assessing social stereotype bias
offensiveness to the terms “gay” and ”immigrant”, in LLMs is not the focus of this paper, we also im-
and in examples 4 and 5, the biased HS classifier plemented the fastText and fine-tuned BERT mod-
neglected the offensiveness of the examples con- els. We noted that BERT presents more bias com-
taining the terms “white” and “caucasian”. We also pared to fastText. Finally, based on our findings,
observed that whether the HS classifiers did not ML models, which embed expert and context in-
recognize stereotypical beliefs represented by the formation from offensiveness markers, presented
tuples, they were not able to learn this social bias. a low distribution of bias compared to models that
Hence, the tuples are classified with the same label did not present this particularity of features.

1193
7 Towards Socially Responsible Hate datasets in English and Portuguese from Twitter
Speech Detection and Instagram data. Then, we computed when
these models classified tuples containing gender
As shown in Figure 3, the BoW, BERT, and fastText
and racial stereotypes and counter-stereotypes with
are the models that more reflected social stereo-
different classes, which according to our approach,
types. Moreover, we observe that for both evalu-
indicate the potential to reflect social stereotypes.
ated datasets (HateBR and OLID), the B+M and
The results demonstrate that hate speech classi-
MOL reflected fewer social stereotypes compared
fiers attribute unreal or negligent offensiveness to
to other models (BoW, BERT, fastText).
social identity groups. Furthermore, experiment
Observe that the MOL and B+M consist of
results showed that ML models, which embed ex-
context-aware methods for hate speech detection
pert and context information from offensiveness
(Vargas et al., 2021). These models use a BoW
markers, present low pattern recognition of stereo-
text representation that embeds context informa-
typical beliefs, hence their results are promising
tion from explicit and implicit pejorative terms
towards mitigating social stereotype bias in HS de-
and expressions identified manually by an expert.
tection. For future work, we aim to implement HS
In both models, the ML algorithms are able to
classifiers using different LLMs embedding expert
recognize different weights according to the con-
and context information from a specialized offen-
text of these offensiveness markers. For example,
sive lexicon. Subsequently, we aim to apply our
“stupid”, which is mostly used in a pejorative con-
SSA measure in order to assess the potential of
text (e.g. “politicians are all stupids”), receives a
these models to mitigate social stereotype bias in
different weight than “useless”, which is used in
HS detection. We also aim to extend our dataset
both pejorative (e.g. the government is useless),
of tuples. Finally, we hope that our study may
and non-pejorative (e.g. this smartphone is useless)
contribute to the ongoing discussion on fairness in
contexts.
machine learning and responsible AI.
Based on our findings, in HS classifiers that
embody expert and context information on offen- 9 Ethical Statements
siveness, the pattern recognition of ML algorithms
tends to be oriented by these offensiveness markers, The datasets used in this paper were anonymized.
and how they and their attributed weight, interact Furthermore, we argue that any translation used
with the hate speech labels. For example, based to analyze social bias in hate speech technolo-
on our experiments, we observed that for the same gies should not neglect the cultural aspects of lan-
dataset, the BoW reflected more social stereotypes guages. Hence, we proposed a new dataset com-
compared to the MOL and B+M models, in which posed of 300 tuples containing stereotypes and
both embed expert and context information of of- counter-stereotypes in Brazilian Portuguese. We
fensiveness markers. used the CrowS-Pairs benchmark fairness dataset
Therefore, we argue that based on our results, the and manually translated the tuples by applying
models that embed expert and context information cultural-aware adaptations.
of offensiveness markers showed promising results Acknowledgments
to mitigate social stereotypes bias towards provid-
ing socially responsible hate speech technologies. This project was partially funded by the SINCH,
FAPESP, FAPEMIG, and CNPq, as well as the Min-
8 Final Remarks and Future Work istry of Science, Technology and Innovation, with
Since a human-based distinctive classification of resources of Law N. 8.248, of October 23, 1991,
social stereotypes and counter-stereotypes provides within the scope of PPI-SOFTEX, coordinated by
evidence of socially biased thinking, we introduce Softex and published as Residence in TIC 13, DOU
a new approach to analyze the potential of HS clas- 01245.010222/2022-44.
sifiers to reflect social stereotypes against marginal-
ized groups. Our approach consists of measuring References
stereotypical beliefs bias in HS classifiers by con-
Lenton AP Blair IV, Ma JE. 2001. Imagining stereo-
trasting them with counter-stereotypes. Specifi- types away: the moderation of implicit stereotypes
cally, we first implemented different ML text rep- through mental imagery. Journal of Personality and
resentations and evaluated them on two different Social Psychology, 5(85):828–841.

1194
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and dialogue generation. In Proceedings of the 2020 Con-
Hanna Wallach. 2020. Language (technology) is ference on Empirical Methods in Natural Language
power: A critical survey of “bias” in NLP. In Pro- Processing, pages 8173–8188, Held Online.
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 5454– Fatma Elsafoury. 2022. Darkness can not drive out
5476, Held Online. darkness: Investigating bias in hate speech detection
models. In Proceedings of the 60th Annual Meet-
Shikha Bordia and Samuel R. Bowman. 2019. Identify- ing of the Association for Computational Linguistics:
ing and reducing gender bias in word-level language Student Research Workshop, pages 31–43, Dublin,
models. In Proceedings of the 17th Conference of the Ireland.
North American Chapter of the Association for Com-
putational Linguistics: Student Research Workshop, Fatma Elsafoury, Steve R. Wilson, Stamos Katsigian-
pages 7–15, Minneapolis, United States. nis, and Naeem Ramzan. 2022. SOS: Systematic
offensive stereotyping bias in word embeddings. In
Aylin Caliskan, Joanna J. Bryson, and Arvind Proceedings of the 29th International Conference
Narayanan. 2017. Semantics derived automatically on Computational Linguistics, pages 1263–1274,
from language corpora contain human-like biases. Gyeongju, Republic of Korea.
Science, 356(6334):183–186.
Eimear Finnegan, Jane Oakhill, and Alan Garnham.
Myra Cheng, Esin Durmus, and Dan Jurafsky. 2023. 2015. Counter-stereotypical pictures as a strategy for
Marked personas: Using natural language prompts overcoming spontaneous gender stereotypes. Fron-
to measure stereotypes in language models. In Pro- tiers in Psychology, 6(1):1–15.
ceedings of the 61st Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1504– Susan Fiske. 1993. Controlling other people: The im-
1532, Toronto, Canada. pact of power on stereotyping. The American psy-
chologist, 48:621–8.
Patricia Chiril, Farah Benamara, and Véronique
Kathleen C. Fraser, Isar Nejadgholi, and Svetlana
Moriceau. 2021. “be nice to your wife! the restau-
Kiritchenko. 2021. Understanding and countering
rants are closed”: Can gender stereotype detection
stereotypes: A computational approach to the stereo-
improve sexism classification? In Findings of the
type content model. In Proceedings of the 59th An-
Association for Computational Linguistics: EMNLP
nual Meeting of the Association for Computational
2021, pages 2833–2844, Punta Cana, Dominican Re-
Linguistics and the 11th International Joint Confer-
public.
ence on Natural Language Processing, pages 600–
Yung-Sung Chuang, Mingye Gao, Hongyin Luo, James 616, Held Online.
Glass, Hung-yi Lee, Yun-Nung Chen, and Shang-
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Wen Li. 2021. Mitigating biases in toxic language
Tomas Mikolov. 2016. Bag of tricks for efficient text
detection through invariant rationalization. In Pro-
classification. arXiv preprint arXiv:1607.01759.
ceedings of the 5th Workshop on Online Abuse and
Harms, pages 114–120, Held Online. Marta Marchiori Manerba and Sara Tonelli. 2021. Fine-
grained fairness analysis of abusive language detec-
Aida Mostafazadeh Davani, Mohammad Atari, Bren- tion systems with checklist. In Proceedings of the
dan Kennedy, and Morteza Dehghani. 2023. Hate 5th Workshop on Online Abuse and Harms, pages
speech classifiers learn normative social stereotypes. 81–91, Held Online.
Transactions of the Association for Computational
Linguistics, 11:300–319. Christopher Manning and Hinrich Schutze. 1999. Foun-
dations of statistical natural language processing.
Thomas Davidson, Debasmita Bhattacharya, and Ing- MIT press.
mar Weber. 2019. Racial bias in hate speech and
abusive language detection datasets. In Proceedings Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi.
of the 3rd Workshop on Abusive Language Online, 2020. Hate speech detection and racial bias mitiga-
pages 25–35, Florence, Italy. tion in social media based on bert model. PloS one,
15(8):e0237861.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of Moin Nadeem, Anna Bethke, and Siva Reddy. 2021.
deep bidirectional transformers for language under- StereoSet: Measuring stereotypical bias in pretrained
standing. In Proceedings of the 17th Annual Con- language models. In Proceedings of the 59th Annual
ference of the North American Chapter of the Asso- Meeting of the Association for Computational Lin-
ciation for Computational Linguistics: Human Lan- guistics and the 11th International Joint Conference
guage Technologies, pages 4171–4186, Minnesota, on Natural Language Processing, pages 5356–5371,
United States. Held Online.
Emily Dinan, Angela Fan, Adina Williams, Jack Ur- Nikita Nangia, Clara Vania, Rasika Bhalerao, and
banek, Douwe Kiela, and Jason Weston. 2020. Samuel R. Bowman. 2020. CrowS-pairs: A chal-
Queens are powerful too: Mitigating gender bias in lenge dataset for measuring social biases in masked

1195
language models. In Proceedings of the 2020 Con- corpus of Brazilian Instagram comments for offen-
ference on Empirical Methods in Natural Language sive language and hate speech detection. In Proceed-
Processing, pages 1953–1967, Held Online. ings of the 13th Language Resources and Evaluation
Conference, pages 7174–7183, Marseille, France.
Dasgupta Nilanjana and Greenwald Anthony G. 2001.
On the malleability of automatic attitudes: Combat- Francielle Vargas, Fabiana Goes, Isabelle Carvalho,
ing automatic prejudice with images of admired and Fabrı́cio Benevenuto, and Thiago Pardo. 2021.
disliked individuals. Journal of Personality and So- Contextual-lexicon approach for abusive language
cial Psychology, 5(81):800–814. detection. In Proceedings of the Recent Advances
in Natural Language Processing, pages 1438–1447,
Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re- Held Online.
ducing gender bias in abusive language detection.
In Proceedings of the 2018 Conference on Empiri- Daniel de Vassimon Manela, David Errington, Thomas
cal Methods in Natural Language Processing, pages Fisher, Boris van Breugel, and Pasquale Minervini.
2799–2804, Brussels, Belgium. 2021. Stereotype and skew: Quantifying gender bias
in pre-trained and fine-tuned language models. In
Mark Peffley, Jon Hurwitz, and Paul M. Sniderman. Proceedings of the 16th Conference of the European
1997. Racial stereotypes and whites’ political views Chapter of the Association for Computational Lin-
of blacks in the context of welfare and crime. Ameri- guistics, pages 2232–2242, Held Online.
can Journal of Political Science, 41(1):30–60.
William Warner and Julia Hirschberg. 2012. Detecting
Fabio Poletto, Valerio Basile, Manuela Sanguinetti, hate speech on the world wide web. In Proceed-
Cristina Bosco, and Viviana Patti. 2021. Resources ings of the Second Workshop on Language in Social
and benchmark corpora for hate speech detection: a Media, pages 19–26, Montréal, Canada.
systematic review. Language Resources and Evalua-
tion, 55(3):477–523. Maximilian Wich, Jan Bauer, and Georg Groh. 2020.
Impact of politically biased data on hate speech clas-
Yusu Qian. 2019. Gender stereotypes differ between sification. In Proceedings of the 4th Workshop on
male and female writings. In Proceedings of the 57th Online Abuse and Harms, pages 54–64, Held Online.
Annual Meeting of the Association for Computational
Linguistics: Student Research Workshop, pages 48– Maximilian Wich, Christian Widmer, Gerhard Hagerer,
53, Florence, Italy. and Georg Groh. 2021. Investigating annotator bias
in abusive language datasets. In Proceedings of the
Alan Ramponi and Sara Tonelli. 2022. Features or spu- International Conference on Recent Advances in Nat-
rious artifacts? data-centric baselines for fair and ural Language Processing, pages 1515–1525, Held
robust hate speech detection. In Proceedings of the Online.
2022 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu- Michael Wiegand, Josef Ruppenhofer, and Thomas
man Language Technologies, pages 3027–3040, Seat- Kleinbauer. 2019. Detection of Abusive Language:
tle, United States. the Problem of Biased Datasets. In Proceedings of
the 2019 Conference of the North American Chap-
Dante Razo and Sandra Kübler. 2020. Investigating ter of the Association for Computational Linguistics:
sampling bias in abusive language detection. In Pro- Human Language Technologies, pages 602–608, Min-
ceedings of the 4th Workshop on Online Abuse and neapolis, United States.
Harms, pages 70–78, Held Online.
Mengzhou Xia, Anjalie Field, and Yulia Tsvetkov. 2020.
Nihar Sahoo, Himanshu Gupta, and Pushpak Bhat- Demoting racial bias in hate speech detection. In Pro-
tacharyya. 2022. Detecting unintended social bias in ceedings of the 8th International Workshop on Nat-
toxic language datasets. In Proceedings of the 26th ural Language Processing for Social Media, pages
Conference on Computational Natural Language 7–14, Held Online.
Learning, pages 132–143, Abu Dhabi, United Arab
Emirates. Erdem Yörük, Ali Hürriyetoğlu, Fırat Duruşan, and
Çağrı Yoltar. 2022. Random sampling in corpus de-
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, sign: Cross-context generalizability in automated
and Noah A. Smith. 2019. The risk of racial bias multicountry protest event collection. American Be-
in hate speech detection. In Proceedings of the 57th havioral Scientist, 66(5):578–602.
Annual Meeting of the Association for Computational
Linguistics, pages 1668–1678, Florence, Italy. Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
Sara Rosenthal, Noura Farra, and Ritesh Kumar.
UN. 2023. Power on: How we can supercharge an 2019. Predicting the type and target of offensive
equitable digital future. UN Women – Headquarters, posts in social media. In Proceedings of the 17th An-
pages 1–14. nual Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
Francielle Vargas, Isabelle Carvalho, Fabiana Ro- man Language Technologies, pages 1415–1420, Min-
drigues de Góes, Thiago Pardo, and Fabrı́cio Ben- nesota, United States.
evenuto. 2022. HateBR: A large expert annotated

1196

You might also like