Attentive
Attentive
Speech Detection
1† 2‡∗ 3†
Atanu Mandal and Gargi Roy and Amit Barman
4† 5†
Indranil Dutta and Sudip Kumar Naskar
†
Jadavpur University, Kolkata, INDIA
‡
Optum Global Solutions Private Limited, Bengaluru, INDIA
1 2 3 5
{ atanumandal0491, roygargi1997, amitbarman811, sudip.naskar}@gmail.com,
4
[email protected]
is of utmost importance. Researchers have been While these approaches have yielded some success,
diligently working since the past decade on dis- they often struggle to capture the nuanced nature
tinguishing between content that promotes ha- of speech, as the exact text might be interpreted dif-
tred and content that does not. Traditionally, ferently when considering context, tone, and intent
the main focus has been on analyzing textual (Fortuna and Nunes, 2018). To address these limi-
content. However, recent research attempts tations, researchers are turning to a more holistic
have also commenced into the identification
approach combining both text and speech modali-
of audio-based content. Nevertheless, stud-
ies have shown that relying solely on audio
ties to enhance the accuracy and robustness of hate
or text-based content may be ineffective, as re- speech detection systems (Rana and Jha, 2022).
cent upsurge indicates that individuals often
employ sarcasm in their speech and writing. To "I can't really say South
overcome these challenges, we present an ap- Park is stupid, but it
actually is"
proach to identify whether a speech promotes
"I can't really say South
hate or not utilizing both audio and textual rep- Park is stupid, but it
resentations. Our methodology is based on the actually is"
Transformer framework that incorporates both
!!
Hate
audio and text sampling, accompanied by our
very own layer called “Attentive Fusion”. The
results of our study surpassed previous state-
of-the-art techniques, achieving an impressive "And it's basically about,
it's about Dracula and he
macro F1 score of 0.927 on the Test Set. goes around killing people
and sucking the blood"
"And it's basically about,
1 Introduction it's about Dracula and he
goes around killing people
and sucking the blood"
In recent years, the explosive growth of digital com-
munication platforms has facilitated unprecedented Hate
!!
Not
levels of information exchange, enabling individu-
als from diverse backgrounds to interact and share
ideas. However, this surge in online interactions Figure 1: Identification of “Hate” or “Not Hate” using
has also led to the emergence of a concerning is- multimodality approach
sue: the increase of hate speech (Davidson et al.,
2017). Hate speech, characterized by offensive, This multidimensional approach referred to as
discriminatory, or derogatory language targeting multimodal hate speech detection, leverages not
individuals or groups based on race, ethnicity, re- only the textual content of messages but also the
ligion, gender, or sexual orientation, poses signifi- acoustic cues and prosodic features present in
cant challenges to maintaining a safe and inclusive speech. By simultaneously analyzing both text and
online environment (Schmidt and Wiegand, 2017). speech-based characteristics, this approach aims
∗
The work was carried out when the author was at Ja- to capture a more comprehensive representation of
davpur University. communication, considering not only the words
used, but also the emotional nuances conveyed for detecting Hatred within spoken English speech.
through speech intonation, pitch, and rhythm. Fig- This dataset is derived from diverse open-source
ure 1 illustrates two examples each for “Hate” and datasets. The specifics regarding the number of
“Not Hate” using the Multimodality. In the two samples utilized from various datasets are precisely
cues shown (figure 1), represents the speech cue outlined in Table 1.
and represents the text cue.
VCTK
In this paper, we investigate multimodal hate 1.4% CMU-MOSEI
Social-IQ 6.1%
speech detection exploring the synergies between 3.3%
MELD
CMU-MOSI
1.8%
text and speech for identifying hate speech in- 4.0%
LJ Speech
1.0%
stances. We examine the challenges posed by hate
speech in the digital age, the limitations of tradi-
tional text-based detection methods, and the poten-
tial advantages of integrating speech data into the
Common Voice
detection process. By leveraging insights from var- 82.3%
LJ Speech
state-of-the-art (SOTA) methodologies in the sub- 1.1%
sequent manner:
• Our system consists of a sequence of inter-
connected systems enclosing the Transformer
1
framework . Common Voice
82.0%
Table 1: Statistics of the dataset used for Identification Our experiments were carried out on a compre-
of Hatred. hensive dataset that encompassed all seven datasets
combined. Each dataset contained entries that fell
For our experiments, we used fragments of the into either the "Hate" or "Not-Hate" category, along
2
DeToxy dataset (Ghosh et al., 2022) , a dataset with a transcription for each audio. To facilitate
1
Code is publicly available in GitHub.
understanding, we have depicted the distribution
2
Ghosh et al. (2022) used 20,271 data consisting of CMU- and VCTK of which IEMOCAP, MSP-Improv, MSP-Podcast,
MOSEI, CMU-MOSI, Common Voice, IEMOCAP, LJ Speech, Switchboard are not open-sourced therefore we were unable
MELD, MSP-Improv, MSP-Podcast, Social-IQ, Switchboard, to use the dataset.
Hate Not Hate
Dataset Train Dev Test Train Dev Test
CMU-MOSEI 149 33 35 448 100 95
CMU-MOSI 47 10 10 134 30 29
Common Voice 2,013 442 433 6,037 1,326 1,300
LJ Speech 28 6 6 74 17 17
MELD 99 22 21 294 65 64
Social-IQ 83 18 19 242 56 50
VCTK 34 8 8 104 23 22
2,453 539 532 7,333 1,617 1,577
2.2 CMU-MOSI
Carnegie Mellon University - Multimodal Corpus
of Sentiment Intensity (CMU-MOSI) (Zadeh et al.,
2016) is another dataset by Carnegie Mellon Uni-
versity, which consists of 2199 video clips of dif-
ferent opinions, annotated with sentiment. It is
annotated in the range [−3, 3], using various pa-
Figure 3: Sample count for “Hate” and “Not Hate” rameters for sentiment intensity, subjectivity, and
per-millisecond annotations of audio features. It
contains 97% non-toxic and nearly 3% toxic utter-
of each dataset’s contribution to our framework
ances. Figure 4b provides the information on the
through a pie chart, as showcased in Figure 2. Fig-
number of samples for “Hate” and “Not Hate” and
ures 2a, 2b, and 2c accordingly illustrate the respec-
Figure 5b provides the pictorial information of the
tive contributions of the training data, development
number of samples to audio duration.
data, and testing data. There exists a significant dis-
parity in the number of samples across the various
2.3 Common Voice
datasets but, the proportional representation of the
training, development, and test datasets remains This dataset (Ardila et al., 2020) by Mozilla Devel-
consistent. Notably, Common Voice comprises the oper Network is an open-source, dataset of voices
majority of the data, while LJ Speech is the least of multiple languages for the use of training speech-
represented. The statistical analysis of the “Hate” enabled systems, with 20,217 hours of recorded
and “Not Hate” classes is presented in Table 2. audio and 14,973 hours of validated speech audios.
Meanwhile, the bar plot showcasing the sample Figure 4c provides the information on the number
count for both classes can be seen in Figure 3. A of samples for “Hate” and “Not Hate” and Figure
comprehensive description of datasets is described 5c provides the pictorial information of the number
in Section 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, and 2.7. of samples to audio duration.
Figure 5d provides the pictorial information of the validated and annotated, along with questions, an-
number of samples to audio duration. swers, and annotations for the level of complexity
2.5 MELD of the said questions and answers. Figure 4f pro-
vides the information on the number of samples for
Multimodal Emotion Lines Dataset (MELD) (Poria “Hate” and “Not Hate” and Figure 5f provides the
et al., 2019) has over 1,400 dialogues and 13,000 pictorial information of the number of samples to
dialogues from the television show “Friends”. Each audio duration.
utterance in dialogue has been labelled by one of
the emotions – Anger, Disgust, Sadness, Joy, Neu-
tral, Surprise, and Fear. MELD also has annota- 2.7 VCTK
tions for sentiments – positive, negative, and neu-
tral. Figure 4e provides the information on the The VCTK corpus (Yamagishi et al., 2019) con-
number of samples for “Hate” and “Not Hate” and tains 110 speakers’ speech data spoken in English,
Figure 5e provides the pictorial information of the having various accents. Every single speaker reads
number of samples to audio duration. a passage, selected from newspapers, archives, and
so on. Figure 4g provides the information on the
2.6 Social-IQ number of samples for “Hate” and “Not Hate” and
Another dataset (Zadeh et al., 2019) by Carnegie Figure 5g provides the pictorial information of the
Mellon University has videos that are thoroughly number of samples to audio duration.
3 Experiments troduced the symbols “< s >” and “< /s >” at
the start and end, respectively, signifying the com-
This section demonstrates our innovative tech- mencement and conclusion of the sentences.
niques for detecting Hatred within a speech. The
section is divided into numerous subsections for 3.2 Framework
understanding our approach with ease. Section 3.1
presents the methods we used to prepare the dataset We have used the Transformer (Vaswani et al.,
for our suggested framework. Section 3.2 describes 2017) framework which has gained widespread
our suggested framework. Section 3.3 discusses recognition and is considered SOTA in the domains
the parameters used for our proposed framework of Speech Recognition and Machine Translation
and Section 4 discusses the results of our approach (MT) due to its exceptional ability to handle the
with other benchmark frameworks. complexities of these complex tasks. To provide
a clear overview of our methodology, Figure 6
3.1 Dataset Pre-processing presents an overview of our framework.
In the task of pre-processing, we carefully selected The Speech Feature is extracted by the “log mel
the data that possessed comparable lengths of au- spectrogram” technique, which has been discussed
dio. We disregarded instances with excessively in section 3.1. This technique involves the computa-
long or short duration. Our inclination to overlook tion of a spectrogram that represents the frequency
excessively long audio duration stemmed from the content of an audio signal over time, using a loga-
understanding that it would necessitate extensive rithmic scale for the frequency axis. The resulting
computational resources. Conversely, audio with spectrogram has a dimension of “(80×time_step)”
extremely short duration lacked the richness of au- and is then passed to the Speech Sampling Block.
dio features. The Speech Sampling Block is responsible for se-
lecting a subset of the input spectrogram, based on
Values certain criteria (described in Section 3.2.1). On the
Sample Rate 16,000 Hz other hand, the tokenized Text, which is obtained
Number of FFT 400 through a process described in section 3.1, has a
Number of MELs Channel 80 dimension of “(max_length × 1)” and is passed
Hop Length 160 to the Text Sampling Block (discussed in Section
Chunk Length 30 3.2.2). The Text Sampling Block performs a simi-
Number of Samples 4,80,000 lar function as the Speech Sampling Block but on
Number of Frames 3,000 the tokenized Text instead of the spectrogram.
Number of Samples per Token 320 The resulting subset of Speech Sampling is fed
Frames per Second 10 ms to the Encoder of the first Transformer module and
Tokens per Second 25 ms the Decoder of the second Transformer module.
Similarly, the resulting subset of Text Sampling is
Table 3: Audio feature extraction parameters fed to the Decoder of the first Transformer module
and the Encoder of the second Transformer module.
We conducted a series of experiments with our The motivation behind this approach is to investi-
framework, exploring various methods of extract- gate whether the text in the Decoder can learn from
ing features such as Mel-frequency cepstral coef- the audio in the Encoder, and vice versa. This is
ficients (MFCCs) and filter banks. However, we inspired by the idea of MT, where the target text
discovered that the “log mel spectrogram” yields in the Decoder learns from the source text in the
superior accuracy in comparison to other alterna- Encoder. By applying this concept to the audio and
tives, as it captures auditory information in a man- text domains, we aim to explore the potential for
ner akin to human perception. To extract these cross-modal learning and the transfer of knowledge
features, we established the optimal parameters between different modalities.
empirically, which are detailed in Table 3. To further process the outputs of the two Trans-
For feature extraction from text, we used the former modules, we introduce a Long short-term
pre-trained Albert Tokenizer (Lan et al., 2020) memory (LSTM) block that consists of a single
from IndicBART (Dabre et al., 2022) developed LSTM layer. This LSTM block is responsible for
by AI4Bharat. To tokenize each sentence, we in- sequentially learning the knowledge from each step
Softmax
Linear
Attentive Fusion
Pipeline 1 Pipeline 2
LSTM LSTM
Add & Norm Add & Norm Add & Norm Add & Norm
Multi-Head Multi-Head Multi-Head Multi-Head
Attention Attention Attention Attention
of the output. After going through this process, we 3.2.1 Speech Sampling
obtain two outputs: one from the first LSTM and
another from the second LSTM. The combination
of the first Transformer with the first LSTM is rep- Convolutional
resented as “Pipeline 1” and the combination of
the second Transformer with the second LSTM is Convolutional
commencement and conclusion of the sentences The value, wi obtained from equation 3 was in-
(refer to Section 3.1) and passed on to Word Em- troduced into the subsequent module that incorpo-
bedding. The subsequent output is then directed to rates a Linear Layer to differentiate between differ-
the Positional Encoder. Subsequently, the output of ent classes.
the Word Embedding and the Positional Encoder
are combined and conveyed to the subsequent hi- 3.3 Hyperparameters
erarchical module. The representation of the “Text 3.3.1 Speech Sampling
Sampling” framework can be seen in Figure 8.
For the two Convolutional layers, we used filter
sizes of “4096” and “1024”, respectively and kernel
st
size of “3” for both. Strides of “1” for 1 and “2”
nd
Word Embedding
for 2 Convolutional layer was used. For LSTM
layer units of “512” with activation function “tanh”
Positional Encoder and recurrent activation of “sigmoid” was used. For
the Positional Encoder vocab size of “64,014”, the
hidden dimension of “512” was passed.