0% found this document useful (0 votes)

16 views

Attentive

Uploaded by

EDARA KEERTHI LAHARI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Attentive

Uploaded by

EDARA KEERTHI LAHARI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Attentive Fusion: A Transformer-based Approach to Multimodal Hate

Speech Detection
1† 2‡∗ 3†
Atanu Mandal and Gargi Roy and Amit Barman
4† 5†
Indranil Dutta and Sudip Kumar Naskar
†
Jadavpur University, Kolkata, INDIA
‡
Optum Global Solutions Private Limited, Bengaluru, INDIA
1 2 3 5
{ atanumandal0491, roygargi1997, amitbarman811, sudip.naskar}@gmail.com,
4
[email protected]

Abstract Traditional methods of hate speech detection

primarily focused on analyzing text-based con-
With the recent surge and exponential growth tent, leveraging natural language processing (NLP)
of social media usage, scrutinizing social media
techniques to identify offensive language patterns.
content for the presence of any hateful content
arXiv:2401.10653v1 [cs.CL] 19 Jan 2024

is of utmost importance. Researchers have been While these approaches have yielded some success,
diligently working since the past decade on dis- they often struggle to capture the nuanced nature
tinguishing between content that promotes ha- of speech, as the exact text might be interpreted dif-
tred and content that does not. Traditionally, ferently when considering context, tone, and intent
the main focus has been on analyzing textual (Fortuna and Nunes, 2018). To address these limi-
content. However, recent research attempts tations, researchers are turning to a more holistic
have also commenced into the identification
approach combining both text and speech modali-
of audio-based content. Nevertheless, stud-
ies have shown that relying solely on audio
ties to enhance the accuracy and robustness of hate
or text-based content may be ineffective, as re- speech detection systems (Rana and Jha, 2022).
cent upsurge indicates that individuals often
employ sarcasm in their speech and writing. To "I can't really say South
overcome these challenges, we present an ap- Park is stupid, but it
actually is"
proach to identify whether a speech promotes
"I can't really say South
hate or not utilizing both audio and textual rep- Park is stupid, but it
resentations. Our methodology is based on the actually is"
Transformer framework that incorporates both
!!
Hate
audio and text sampling, accompanied by our
very own layer called “Attentive Fusion”. The
results of our study surpassed previous state-
of-the-art techniques, achieving an impressive "And it's basically about,
it's about Dracula and he
macro F1 score of 0.927 on the Test Set. goes around killing people
and sucking the blood"
"And it's basically about,
1 Introduction it's about Dracula and he
goes around killing people
and sucking the blood"
In recent years, the explosive growth of digital com-
munication platforms has facilitated unprecedented Hate
!!
Not
levels of information exchange, enabling individu-
als from diverse backgrounds to interact and share
ideas. However, this surge in online interactions Figure 1: Identification of “Hate” or “Not Hate” using
has also led to the emergence of a concerning is- multimodality approach
sue: the increase of hate speech (Davidson et al.,
2017). Hate speech, characterized by offensive, This multidimensional approach referred to as
discriminatory, or derogatory language targeting multimodal hate speech detection, leverages not
individuals or groups based on race, ethnicity, re- only the textual content of messages but also the
ligion, gender, or sexual orientation, poses signifi- acoustic cues and prosodic features present in
cant challenges to maintaining a safe and inclusive speech. By simultaneously analyzing both text and
online environment (Schmidt and Wiegand, 2017). speech-based characteristics, this approach aims
∗
The work was carried out when the author was at Ja- to capture a more comprehensive representation of
davpur University. communication, considering not only the words
used, but also the emotional nuances conveyed for detecting Hatred within spoken English speech.
through speech intonation, pitch, and rhythm. Fig- This dataset is derived from diverse open-source
ure 1 illustrates two examples each for “Hate” and datasets. The specifics regarding the number of
“Not Hate” using the Multimodality. In the two samples utilized from various datasets are precisely
cues shown (figure 1), represents the speech cue outlined in Table 1.
and represents the text cue.
VCTK
In this paper, we investigate multimodal hate 1.4% CMU-MOSEI
Social-IQ 6.1%
speech detection exploring the synergies between 3.3%
MELD
CMU-MOSI
1.8%
text and speech for identifying hate speech in- 4.0%
LJ Speech
1.0%
stances. We examine the challenges posed by hate
speech in the digital age, the limitations of tradi-
tional text-based detection methods, and the poten-
tial advantages of integrating speech data into the
Common Voice
detection process. By leveraging insights from var- 82.3%

ious disciplines such as NLP, audio signal process-

ing, and machine learning, multimodal approaches
hold promise in achieving higher detection accu- (a) Training Data
racy and reducing false positives, ultimately foster- VCTK
1.4% CMU-MOSEI
ing safer and more inclusive online environments. Social-IQ
3.4%
6.2%
CMU-MOSI
MELD
Our methodology sets itself apart from other 4.0%
1.9%

LJ Speech
state-of-the-art (SOTA) methodologies in the sub- 1.1%

sequent manner:
• Our system consists of a sequence of inter-
connected systems enclosing the Transformer
1
framework . Common Voice
82.0%

• We have introduced a layer termed “Attentive

Fusion” that augments the results.
(b) Development Data
2 Dataset Description VCTK
1.4% CMU-MOSEI
Social-IQ 6.2%
3.3% CMU-MOSI
Number of Samples MELD 1.8%
Dataset 4.0%
Train Dev Test LJ Speech
1.1%
CMU-MOSEI
(Bagher Zadeh et al., 2018) 597 133 130
CMU-MOSI
(Zadeh et al., 2016) 181 40 39
Common Voice
(Ardila et al., 2020) 8,050 1,768 1,733 Common Voice
82.2%
LJ Speech
(Ito and Johnson, 2017) 102 23 23
MELD
(Poria et al., 2019) 393 87 85
Social-IQ (c) Test Data
(Zadeh et al., 2019) 325 74 69
VCTK Figure 2: Pictorial representation of the contribution of
(Yamagishi et al., 2019) 138 31 30
datasets
9,786 2,156 2,109

Table 1: Statistics of the dataset used for Identification Our experiments were carried out on a compre-
of Hatred. hensive dataset that encompassed all seven datasets
combined. Each dataset contained entries that fell
For our experiments, we used fragments of the into either the "Hate" or "Not-Hate" category, along
2
DeToxy dataset (Ghosh et al., 2022) , a dataset with a transcription for each audio. To facilitate
1
Code is publicly available in GitHub.
understanding, we have depicted the distribution
2
Ghosh et al. (2022) used 20,271 data consisting of CMU- and VCTK of which IEMOCAP, MSP-Improv, MSP-Podcast,
MOSEI, CMU-MOSI, Common Voice, IEMOCAP, LJ Speech, Switchboard are not open-sourced therefore we were unable
MELD, MSP-Improv, MSP-Podcast, Social-IQ, Switchboard, to use the dataset.
Hate Not Hate
Dataset Train Dev Test Train Dev Test
CMU-MOSEI 149 33 35 448 100 95
CMU-MOSI 47 10 10 134 30 29
Common Voice 2,013 442 433 6,037 1,326 1,300
LJ Speech 28 6 6 74 17 17
MELD 99 22 21 294 65 64
Social-IQ 83 18 19 242 56 50
VCTK 34 8 8 104 23 22
2,453 539 532 7,333 1,617 1,577

Table 2: Data Statistics of “Hate” and “Not Hate”

the pictorial information of the number of samples

to audio duration.

2.2 CMU-MOSI
Carnegie Mellon University - Multimodal Corpus
of Sentiment Intensity (CMU-MOSI) (Zadeh et al.,
2016) is another dataset by Carnegie Mellon Uni-
versity, which consists of 2199 video clips of dif-
ferent opinions, annotated with sentiment. It is
annotated in the range [−3, 3], using various pa-
Figure 3: Sample count for “Hate” and “Not Hate” rameters for sentiment intensity, subjectivity, and
per-millisecond annotations of audio features. It
contains 97% non-toxic and nearly 3% toxic utter-
of each dataset’s contribution to our framework
ances. Figure 4b provides the information on the
through a pie chart, as showcased in Figure 2. Fig-
number of samples for “Hate” and “Not Hate” and
ures 2a, 2b, and 2c accordingly illustrate the respec-
Figure 5b provides the pictorial information of the
tive contributions of the training data, development
number of samples to audio duration.
data, and testing data. There exists a significant dis-
parity in the number of samples across the various
2.3 Common Voice
datasets but, the proportional representation of the
training, development, and test datasets remains This dataset (Ardila et al., 2020) by Mozilla Devel-
consistent. Notably, Common Voice comprises the oper Network is an open-source, dataset of voices
majority of the data, while LJ Speech is the least of multiple languages for the use of training speech-
represented. The statistical analysis of the “Hate” enabled systems, with 20,217 hours of recorded
and “Not Hate” classes is presented in Table 2. audio and 14,973 hours of validated speech audios.
Meanwhile, the bar plot showcasing the sample Figure 4c provides the information on the number
count for both classes can be seen in Figure 3. A of samples for “Hate” and “Not Hate” and Figure
comprehensive description of datasets is described 5c provides the pictorial information of the number
in Section 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, and 2.7. of samples to audio duration.

2.1 CMU-MOSEI 2.4 LJ Speech

Carnegie Mellon University - Multimodal Opinion This is another open-source dataset (Ito and John-
Sentiment and Emotion Intensity (CMU-MOSEI) son, 2017) which has 13,100 clips of short audio
(Bagher Zadeh et al., 2018) is considered the largest segments, with one speaker reading texts from a
and most extensive dataset for emotion recognition collection of seven books of non-fiction. Every clip
tasks and multimodal sentiment analysis. Figure 4a is transcribed and has a varying length of 1 to 10
provides the information on the number of samples seconds. Figure 4d provides the information on the
for “Hate” and “Not Hate” and Figure 5a provides number of samples for “Hate” and “Not Hate” and
(a) CMU-MOSEI (b) CMU-MOSI (c) Common Voice (d) LJ Speech

(e) MELD (f) Social-IQ (g) VCTK

Figure 4: Sample count for “Hate” and “Not Hate”

(a) CMU-MOSEI (b) CMU-MOSI (c) Common Voice (d) LJ Speech

(e) MELD (f) Social-IQ (g) VCTK

Figure 5: Scatter representation of Datasets according to audio length

Figure 5d provides the pictorial information of the validated and annotated, along with questions, an-
number of samples to audio duration. swers, and annotations for the level of complexity
2.5 MELD of the said questions and answers. Figure 4f pro-
vides the information on the number of samples for
Multimodal Emotion Lines Dataset (MELD) (Poria “Hate” and “Not Hate” and Figure 5f provides the
et al., 2019) has over 1,400 dialogues and 13,000 pictorial information of the number of samples to
dialogues from the television show “Friends”. Each audio duration.
utterance in dialogue has been labelled by one of
the emotions – Anger, Disgust, Sadness, Joy, Neu-
tral, Surprise, and Fear. MELD also has annota- 2.7 VCTK
tions for sentiments – positive, negative, and neu-
tral. Figure 4e provides the information on the The VCTK corpus (Yamagishi et al., 2019) con-
number of samples for “Hate” and “Not Hate” and tains 110 speakers’ speech data spoken in English,
Figure 5e provides the pictorial information of the having various accents. Every single speaker reads
number of samples to audio duration. a passage, selected from newspapers, archives, and
so on. Figure 4g provides the information on the
2.6 Social-IQ number of samples for “Hate” and “Not Hate” and
Another dataset (Zadeh et al., 2019) by Carnegie Figure 5g provides the pictorial information of the
Mellon University has videos that are thoroughly number of samples to audio duration.
3 Experiments troduced the symbols “< s >” and “< /s >” at
the start and end, respectively, signifying the com-
This section demonstrates our innovative tech- mencement and conclusion of the sentences.
niques for detecting Hatred within a speech. The
section is divided into numerous subsections for 3.2 Framework
understanding our approach with ease. Section 3.1
presents the methods we used to prepare the dataset We have used the Transformer (Vaswani et al.,
for our suggested framework. Section 3.2 describes 2017) framework which has gained widespread
our suggested framework. Section 3.3 discusses recognition and is considered SOTA in the domains
the parameters used for our proposed framework of Speech Recognition and Machine Translation
and Section 4 discusses the results of our approach (MT) due to its exceptional ability to handle the
with other benchmark frameworks. complexities of these complex tasks. To provide
a clear overview of our methodology, Figure 6
3.1 Dataset Pre-processing presents an overview of our framework.
In the task of pre-processing, we carefully selected The Speech Feature is extracted by the “log mel
the data that possessed comparable lengths of au- spectrogram” technique, which has been discussed
dio. We disregarded instances with excessively in section 3.1. This technique involves the computa-
long or short duration. Our inclination to overlook tion of a spectrogram that represents the frequency
excessively long audio duration stemmed from the content of an audio signal over time, using a loga-
understanding that it would necessitate extensive rithmic scale for the frequency axis. The resulting
computational resources. Conversely, audio with spectrogram has a dimension of “(80×time_step)”
extremely short duration lacked the richness of au- and is then passed to the Speech Sampling Block.
dio features. The Speech Sampling Block is responsible for se-
lecting a subset of the input spectrogram, based on
Values certain criteria (described in Section 3.2.1). On the
Sample Rate 16,000 Hz other hand, the tokenized Text, which is obtained
Number of FFT 400 through a process described in section 3.1, has a
Number of MELs Channel 80 dimension of “(max_length × 1)” and is passed
Hop Length 160 to the Text Sampling Block (discussed in Section
Chunk Length 30 3.2.2). The Text Sampling Block performs a simi-
Number of Samples 4,80,000 lar function as the Speech Sampling Block but on
Number of Frames 3,000 the tokenized Text instead of the spectrogram.
Number of Samples per Token 320 The resulting subset of Speech Sampling is fed
Frames per Second 10 ms to the Encoder of the first Transformer module and
Tokens per Second 25 ms the Decoder of the second Transformer module.
Similarly, the resulting subset of Text Sampling is
Table 3: Audio feature extraction parameters fed to the Decoder of the first Transformer module
and the Encoder of the second Transformer module.
We conducted a series of experiments with our The motivation behind this approach is to investi-
framework, exploring various methods of extract- gate whether the text in the Decoder can learn from
ing features such as Mel-frequency cepstral coef- the audio in the Encoder, and vice versa. This is
ficients (MFCCs) and filter banks. However, we inspired by the idea of MT, where the target text
discovered that the “log mel spectrogram” yields in the Decoder learns from the source text in the
superior accuracy in comparison to other alterna- Encoder. By applying this concept to the audio and
tives, as it captures auditory information in a man- text domains, we aim to explore the potential for
ner akin to human perception. To extract these cross-modal learning and the transfer of knowledge
features, we established the optimal parameters between different modalities.
empirically, which are detailed in Table 3. To further process the outputs of the two Trans-
For feature extraction from text, we used the former modules, we introduce a Long short-term
pre-trained Albert Tokenizer (Lan et al., 2020) memory (LSTM) block that consists of a single
from IndicBART (Dabre et al., 2022) developed LSTM layer. This LSTM block is responsible for
by AI4Bharat. To tokenize each sentence, we in- sequentially learning the knowledge from each step
Softmax

Linear

Attentive Fusion

Pipeline 1 Pipeline 2

LSTM LSTM

Add & Norm Add & Norm

Feed Forward Feed Forward

Add & Norm Add & Norm

Add & Norm Add & Norm
Multi-Head Multi-Head
Feed Forward Attention Attention Feed Forward

Add & Norm Add & Norm Add & Norm Add & Norm
Multi-Head Multi-Head Multi-Head Multi-Head
Attention Attention Attention Attention

Speech Sampling Text Sampling

Figure 6: Overview of our approach

of the output. After going through this process, we 3.2.1 Speech Sampling
obtain two outputs: one from the first LSTM and
another from the second LSTM. The combination
of the first Transformer with the first LSTM is rep- Convolutional
resented as “Pipeline 1” and the combination of
the second Transformer with the second LSTM is Convolutional

represented as “Pipeline 2”. These two outputs

Positional
are then passed to the proposed “Attentive Fusion” LSTM
Encoder
Layer (described in Section 3.2.3). The Attentive
Fusion Layer is designed to learn the knowledge
from both outputs in a joint manner, combining the
information from the two pipelines. The output of
Figure 7: Overview of Speech Sampling
the Attentive Fusion Layer is then fed to a Linear
Layer with Softmax activation, where it undergoes
Our module for “Speech Sampling” was influ-
further processing and classification according to
enced by the work of Radford et al. (2022) with mi-
the specific classes.
nor modifications. This module comprises a pair of
A comprehensive exploration of the Audio Sam- Convolutional layers, with each layer being accom-
pling Module, Text Sampling Module, and our pro- panied by a Gaussian Error Linear Unit (GELU)
posed Attentive Fusion layer are presented in Sec- activation function. The outcome of the Convolu-
tion 3.2.1, 3.2.2, and 3.2.3, respectively. All hyper- tional layer was passed through a Positional En-
parameter configurations can be found in Section coder and an LSTM layer separately. The results
3.3. from the Positional Encoder and LSTM were com-
bined. The Speech Sampling framework is shown (ϵ). The entire outcome was then subjected to mul-
in Figure 7. tiplication with wi itself. Equation 3 illustrates our
approach.
3.2.2 Text Sampling
Our “Text Sampling” module comprises a simplis- ′ wi
tic approach containing Word Embedding and a wi = × wi (3)
∑ i wi + ϵ
Positional Encoder. The raw text was tokenized
by appending with “< s >” and “< /s >” at the ′

commencement and conclusion of the sentences The value, wi obtained from equation 3 was in-
(refer to Section 3.1) and passed on to Word Em- troduced into the subsequent module that incorpo-
bedding. The subsequent output is then directed to rates a Linear Layer to differentiate between differ-
the Positional Encoder. Subsequently, the output of ent classes.
the Word Embedding and the Positional Encoder
are combined and conveyed to the subsequent hi- 3.3 Hyperparameters
erarchical module. The representation of the “Text 3.3.1 Speech Sampling
Sampling” framework can be seen in Figure 8.
For the two Convolutional layers, we used filter
sizes of “4096” and “1024”, respectively and kernel
st
size of “3” for both. Strides of “1” for 1 and “2”
nd
Word Embedding
for 2 Convolutional layer was used. For LSTM
layer units of “512” with activation function “tanh”
Positional Encoder and recurrent activation of “sigmoid” was used. For
the Positional Encoder vocab size of “64,014”, the
hidden dimension of “512” was passed.

3.3.2 Text Sampling

Figure 8: Overview of Text Sampling
For the Word Embedding, we used a vocab size of
“64,014”, and a hidden dimension of “512” with
3.2.3 Attentive Fusion Layer True mask zero. For the Positional Encoder vocab
The layer we have named the “Attentive Fusion” size of “64,014”, the hidden dimension of “512”
layer is a layer that we have devised for the pur- was passed.
pose of detecting hatred within a speech. In our
methodology (as illustrated in equation 1), we have 3.3.3 Transformer
seamlessly integrated the outcomes from Pipeline For the transformer framework, the number of
1 and Pipeline 2, allowing them to flow into their heads and the number of layers for the Encoder
respective Linear layers individually, thereby ensur- and Decoder were kept “4”. The hidden dimen-
ing the preservation of their unique characteristics. sions were kept “512”, and the dropout (Srivastava
et al., 2014) rate of “0.3”.
L1 = Linear(x1 )
(1)
L2 = Linear(x2 ) 3.3.4 LSTM
The result, L1 and L2 underwent a process of For the LSTM layer, units of “512” with activation
cross multiplication, after which a hyperbolic tan- function “tanh” and recurrent activation of “sig-
gent function (tanh) was used. To enhance the moid” were used, with a dropout rate of “0.3”. Use
disparity of each tensor value (wi ), an exponen- of bias and Forget bias with return sequence kept
tial (e) function was applied, as demonstrated in True.
equation 2.
3.3.5 Learning Rate
(tanh(L1 ×L2 ))
wi = e (2)
We use “AdamW” optimizer (Loshchilov and Hut-
The outcome of equation 2 wi underwent divi- ter, 2019) with β1 of “0.9”, β2 of “0.98”, ϵ of
−6
sion by the summation of every element of wi . To “1 × 10 ” and decay of “0.1” with adaptive learn-
prevent division by zero, we introduced an epsilon ing rate.
5 Ablation Studies
√
arg1 = cs 5.1 “Attentive Fusion layer” vs “Concatenate
−1.5 layer”
arg2 = cs × ws
√ (4)
Instead of using our proposed Attentive Fusion
lr = dmodel × min(arg1 , arg2 )
Layer, we used the Concatenate Layer to study
lrate = min(lr, 0.0004)
the effectiveness of the Attentive Fusion layer and
found it even outperformed the Concatenate layer.
In equation 4, cs is current step, warmup steps The differences in results are shown in Table 5.
(ws) were set to “2048” and dmodel of “512”.
Dev Test
4 Results Concatenate Layer 0.908 0.909
Attentive Fusion Layer 0.931 0.927
Table 4 provides the benchmark macro F1 scores Table 5: Macro F1 Score Result on replacing “Attentive
of the frameworks proposed by Ghosh et al. (2022). Fusion layer” with “Concatenate layer”
The proposed frameworks were trained only with
the audio sequences. Our study suggests that only
the audio sequences cannot provide a better under- 5.2 Using Pipeline 1 and Pipeline 2 separately
standing of hatred in speech. Current trends show We also checked whether alone text learning from
that persons use hateful words in spoken sentences audio representation or audio learning from textual
but the tones, frequency and amplitudes are kept representation outperformed our baseline result or
normal, which can also be remarked as “Sarcasmic not. For the evaluation, we used Pipeline 1 and
Behaviour”. To overcome the situation we used Pipeline 2 separately followed by Linear Layer,
multimodality where an audio specimen along with and found that it is unable to score at par result to
its transcripts are used. Using multimodality has our baseline. The differences in results are shown
an increase in macro F1 score compared with the F- in Table 6.
Bank framework proposed by Ghosh et al. (2022).
Dev Test
In cases of unfrozen wav2vec-2.0. the differ-
ences are very nominal as wav2vec-2.0 provides an Pipeline 1 0.910 0.909
embedding knowledge of each token of the audio Pipeline 2 0.910 0.899
specimen. In contrast, we didn’t use any embed- Our Baseline 0.931 0.927
ding knowledge of speech tokens, which will be
Table 6: Macro F1 Score Result on Pipeline 1 and
experimented with in our upcoming works. The re-
Pipeline 2 separately and compared with our baseline.
searcher has shown two wav2vec-2.0 among which
is one wav2vec-2.0 (9 layer). In this system, the
researcher took the representation token from the 6 Conclusion
th
9 layer.
In this work, we proposed a framework that can
classify whether a speech promotes Hatred or not.
System Category Dev Test
For the speech feature extraction, we used a log
F-Bank - 0.610 0.620 mel spectrogram feature extraction technique. Our
Freezed 0.448 0.457 framework consists of Speech Sampling and Text
wav2vec-2.0
Unfreezed 0.877 0.869 sampling followed by two separate transformer
wav2vec-2.0 Unfreezed 0.897 0.877 frameworks that serve different efforts. Each Trans-
(9 layer)
former framework is followed by an LSTM layer,
Proposed - 0.931 0.927
Framework the output of which is fed to our proposed layer,
and further sent to Linear Layer for Classification.
Table 4: Evaluation and Result of Different Systems The whole framework was able to outperform the
Proposed by Ghosh et al. (2022). Unfrozen wav2vec-2.0 existing benchmark macro F1 score. The only limi-
th
setup with representations taken from the 9 layer as tation of our approach is it is limited to the English
reported by Ghosh et al. (2022) language. In future work, we would like to test its
robustness for other languages.
Acknowledgements Soujanya Poria, Devamanyu Hazarika, Navonil Ma-
jumder, Gautam Naik, Erik Cambria, and Rada Mi-
This research was supported by the TPU Research halcea. 2019. MELD: A multimodal multi-party
Cloud (TRC) program, a Google Research initia- dataset for emotion recognition in conversations. In
tive. Proceedings of the 57th Annual Meeting of the As-
sociation for Computational Linguistics, pages 527–
536, Florence, Italy. Association for Computational
Linguistics.
References
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
Rosana Ardila, Megan Branson, Kelly Davis, Michael
man, Christine McLeavey, and Ilya Sutskever. 2022.
Kohler, Josh Meyer, Michael Henretty, Reuben
Robust speech recognition via large-scale weak su-
Morais, Lindsay Saunders, Francis Tyers, and Gre-
pervision.
gor Weber. 2020. Common voice: A massively-
multilingual speech corpus. In Proceedings of the Aneri Rana and Sonali Jha. 2022. Emotion based hate
Twelfth Language Resources and Evaluation Confer- speech detection using multimodal learning. ArXiv,
ence, pages 4218–4222, Marseille, France. European abs/2202.06218.
Language Resources Association.
Anna Schmidt and Michael Wiegand. 2017. A survey
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, on hate speech detection using natural language pro-
Erik Cambria, and Louis-Philippe Morency. 2018. cessing. In Proceedings of the Fifth International
Multimodal language analysis in the wild: CMU- Workshop on Natural Language Processing for So-
MOSEI dataset and interpretable dynamic fusion cial Media, pages 1–10, Valencia, Spain. Association
graph. In Proceedings of the 56th Annual Meeting of for Computational Linguistics.
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 2236–2246, Melbourne, Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Australia. Association for Computational Linguistics. Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: A simple way to prevent neural net-
Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, works from overfitting. J. Mach. Learn. Res.,
Ratish Puduppully, Mitesh Khapra, and Pratyush Ku- 15(1):1929–1958.
mar. 2022. IndicBART: A pre-trained model for indic
natural language generation. In Findings of the As- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
sociation for Computational Linguistics: ACL 2022, Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
pages 1849–1863, Dublin, Ireland. Association for Kaiser, and Illia Polosukhin. 2017. Attention is all
Computational Linguistics. you need. In Advances in Neural Information Pro-
cessing Systems, volume 30. Curran Associates, Inc.
Thomas Davidson, Dana Warmsley, Michael Macy, and
Ingmar Weber. 2017. Automated hate speech detec- Junichi Yamagishi, Christophe Veaux, and Kirsten Mac-
tion and the problem of offensive language. Proceed- Donald. 2019. CSTR VCTK Corpus: English multi-
ings of the International AAAI Conference on Web speaker corpus for CSTR voice cloning toolkit (ver-
and Social Media, 11. sion 0.92).
Paula Fortuna and Sérgio Nunes. 2018. A survey on Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund
automatic detection of hate speech in text. ACM Tong, and Louis-Philippe Morency. 2019. Social-
Comput. Surv., 51(4). iq: A question answering benchmark for artificial
social intelligence. In 2019 IEEE/CVF Conference
Sreyan Ghosh, Samden Lepcha, S Sakshi, Rajiv Ratn on Computer Vision and Pattern Recognition (CVPR),
Shah, and Srinivasan Umesh. 2022. DeToxy: A pages 8799–8809.
Large-Scale Multimodal Dataset for Toxicity Classi-
fication in Spoken Utterances. In Proc. Interspeech Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-
2022, pages 5185–5189. Philippe Morency. 2016. Multimodal sentiment in-
tensity analysis in videos: Facial gestures and verbal
Keith Ito and Linda Johnson. 2017. The lj messages. IEEE Intelligent Systems, 31(6):82–88.
speech dataset. https://keithito.com/
LJ-Speech-Dataset/.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,

Kevin Gimpel, Piyush Sharma, and Radu Soricut.
2020. ALBERT: A lite BERT for self-supervised
learning of language representations. In 8th Inter-
national Conference on Learning Representations,
ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
2020. OpenReview.net.

Ilya Loshchilov and Frank Hutter. 2019. Decoupled

weight decay regularization.

Uzima Borehole Drilling System
100% (4)
Uzima Borehole Drilling System
39 pages
FortiOS 7.0.10 Administration Guide
No ratings yet
FortiOS 7.0.10 Administration Guide
2,688 pages
CCBA-mock Test
No ratings yet
CCBA-mock Test
10 pages
The History of Sulu by Najeeb M. Saleeby (1908)
No ratings yet
The History of Sulu by Najeeb M. Saleeby (1908)
307 pages
Questionnaire On ONLINE TRANSACTION
78% (18)
Questionnaire On ONLINE TRANSACTION
3 pages
Multi-Modal Hate Speech Detection Using Machine
No ratings yet
Multi-Modal Hate Speech Detection Using Machine
5 pages
Multi-Modal Hate Speech Detection Using Machine Learning
No ratings yet
Multi-Modal Hate Speech Detection Using Machine Learning
4 pages
2402.01967v1
No ratings yet
2402.01967v1
7 pages
DETECTION OF HATE BASED POLITICAL SPEECH
No ratings yet
DETECTION OF HATE BASED POLITICAL SPEECH
5 pages
Multi Modal Hate Speech Detection Using Machine Learning
100% (1)
Multi Modal Hate Speech Detection Using Machine Learning
5 pages
Kec Ai Gryffindor Dravidianlangtech Naacl 2025
No ratings yet
Kec Ai Gryffindor Dravidianlangtech Naacl 2025
7 pages
7473-Article Text-10855-1-10-20200925
No ratings yet
7473-Article Text-10855-1-10-20200925
4 pages
A Multilingual Evaluation For Online Hate Speech Detection
No ratings yet
A Multilingual Evaluation For Online Hate Speech Detection
22 pages
2022 Aacl-Srw 5
No ratings yet
2022 Aacl-Srw 5
8 pages
Countering Hate Speech On Social Media
No ratings yet
Countering Hate Speech On Social Media
2 pages
Multimodal_Hate_Speech_Detection_in_Memes_Using_Contrastive_Language-Image_Pre-Training
No ratings yet
Multimodal_Hate_Speech_Detection_in_Memes_Using_Contrastive_Language-Image_Pre-Training
17 pages
540900a328
No ratings yet
540900a328
6 pages
Deep Learning for hate speech detection: Compararive Study
No ratings yet
Deep Learning for hate speech detection: Compararive Study
18 pages
Zeorth Review
No ratings yet
Zeorth Review
9 pages
2024.ltedi-1.20
No ratings yet
2024.ltedi-1.20
6 pages
STRUKTURA RADA
No ratings yet
STRUKTURA RADA
8 pages
Deep Learning Based Fusion Approach For Hate Speech Detection
No ratings yet
Deep Learning Based Fusion Approach For Hate Speech Detection
7 pages
Hate Speech Detection - Challenges and Solutions - PLOS ONE
No ratings yet
Hate Speech Detection - Challenges and Solutions - PLOS ONE
9 pages
Contextual-Aware and Expert Data Resources For Bra
No ratings yet
Contextual-Aware and Expert Data Resources For Bra
22 pages
Integrating_Handcrafted_Features_with_Machine_Lear
No ratings yet
Integrating_Handcrafted_Features_with_Machine_Lear
13 pages
Multimodal Language Analysis in The Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph
No ratings yet
Multimodal Language Analysis in The Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph
11 pages
Hate Speech Detection: Challenges and Solutions: A1111111111 A1111111111 A1111111111 A1111111111 A1111111111
No ratings yet
Hate Speech Detection: Challenges and Solutions: A1111111111 A1111111111 A1111111111 A1111111111 A1111111111
16 pages
Expert Systems - 2024 - Gandhi - Hate speech detection A comprehensive review of recent works
No ratings yet
Expert Systems - 2024 - Gandhi - Hate speech detection A comprehensive review of recent works
24 pages
5 Hate - Speech - Detection - in - Low-Resourced - Indian - Lang
No ratings yet
5 Hate - Speech - Detection - in - Low-Resourced - Indian - Lang
22 pages
Semester Project Report by Qaiser
No ratings yet
Semester Project Report by Qaiser
5 pages
Multimodal Prompt Transformer With Hybrid Contrastive Learning For Emotion Recognition in Conversation
No ratings yet
Multimodal Prompt Transformer With Hybrid Contrastive Learning For Emotion Recognition in Conversation
11 pages
Ousidhoun Multilingual and Multi Aspect HAte Speech Analysis
No ratings yet
Ousidhoun Multilingual and Multi Aspect HAte Speech Analysis
10 pages
Computers and Electrical Engineering: Jincai Chen Chao Sun Sheng Zhang Jiangfeng Zeng
No ratings yet
Computers and Electrical Engineering: Jincai Chen Chao Sun Sheng Zhang Jiangfeng Zeng
14 pages
Investigating Deep Learning Approaches For Hate
No ratings yet
Investigating Deep Learning Approaches For Hate
12 pages
Learning Style
No ratings yet
Learning Style
17 pages
Deep Learning Based Fusion Approach for Hate Speech Detection
No ratings yet
Deep Learning Based Fusion Approach for Hate Speech Detection
8 pages
Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail On Twitter
No ratings yet
Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail On Twitter
21 pages
A Voting Enabled Predictive Approach For Hate Speech Detection
No ratings yet
A Voting Enabled Predictive Approach For Hate Speech Detection
5 pages
Detecting and Visualizing Hate Speech in Social Media: A Cyber Watchdog For Surveillance-Modha2020
No ratings yet
Detecting and Visualizing Hate Speech in Social Media: A Cyber Watchdog For Surveillance-Modha2020
11 pages
Gitari - A Lexicon-Based Approach For Hate Speech Detection
0% (1)
Gitari - A Lexicon-Based Approach For Hate Speech Detection
16 pages
Marathi Hate Speech Detection IEEE Paper
No ratings yet
Marathi Hate Speech Detection IEEE Paper
5 pages
A Survey On Hate Speech Detection Using Natural Language Processing
No ratings yet
A Survey On Hate Speech Detection Using Natural Language Processing
10 pages
D19-1474
No ratings yet
D19-1474
10 pages
Semantic Speech Analysis Using Machine Learning and Deep Learning Techniques: A Comprehensive Review
No ratings yet
Semantic Speech Analysis Using Machine Learning and Deep Learning Techniques: A Comprehensive Review
30 pages
Ensemble_Text_Classification_with_TF-IDF_Vectorization_for_Hate_Speech_Detection_in_Social_Media
No ratings yet
Ensemble_Text_Classification_with_TF-IDF_Vectorization_for_Hate_Speech_Detection_in_Social_Media
7 pages
Hate speech detection is not as easy as you may think
No ratings yet
Hate speech detection is not as easy as you may think
17 pages
1 Generalizing Hate Speech Detection Using Multi-Task Learning
No ratings yet
1 Generalizing Hate Speech Detection Using Multi-Task Learning
20 pages
applsci-11-08575
No ratings yet
applsci-11-08575
21 pages
Design and Implementation of A Multichannel Convolutional Neural Network For Hate Speech Detection in Social Networks
No ratings yet
Design and Implementation of A Multichannel Convolutional Neural Network For Hate Speech Detection in Social Networks
10 pages
An inter-modal attention-based deep learning framework using unified modality for multimodal fake news, hate speech and offensive language detection
No ratings yet
An inter-modal attention-based deep learning framework using unified modality for multimodal fake news, hate speech and offensive language detection
11 pages
Machine Learning Based Automatic Hate Speech Recognition System
No ratings yet
Machine Learning Based Automatic Hate Speech Recognition System
4 pages
A Survey On Automatic Detection of Hate Speech in Text
No ratings yet
A Survey On Automatic Detection of Hate Speech in Text
30 pages
Exploring Intensities of Hate Speech on Social Media
No ratings yet
Exploring Intensities of Hate Speech on Social Media
6 pages
1 Identification of Hate Speech in Social Media
No ratings yet
1 Identification of Hate Speech in Social Media
6 pages
A comprehensive review on automatic hate speech detection in the age of the transformer
No ratings yet
A comprehensive review on automatic hate speech detection in the age of the transformer
25 pages
MM 4
No ratings yet
MM 4
7 pages
FDIA 2023 Paper 4
No ratings yet
FDIA 2023 Paper 4
12 pages
Hatexplain: A Benchmark Dataset For Explainable Hate Speech Detection
No ratings yet
Hatexplain: A Benchmark Dataset For Explainable Hate Speech Detection
12 pages
paper 12
No ratings yet
paper 12
11 pages
12 V May 2024
No ratings yet
12 V May 2024
8 pages
A296 D Stamped
No ratings yet
A296 D Stamped
4 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
From Everand
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
Sergio Torres-Martínez
No ratings yet
Analysis of a Medical Research Corpus: A Prelude for Learners, Teachers, Readers and Beyond
From Everand
Analysis of a Medical Research Corpus: A Prelude for Learners, Teachers, Readers and Beyond
Georgette Nicolas Jabbour
No ratings yet
CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
No ratings yet
CIFAKE Image Classification and Explainable Identification of AI-Generated Synthetic Images
10 pages
Real-Time Human Tracking Using Multi-Features Visual With CNN-LSTM and Q-Learning
No ratings yet
Real-Time Human Tracking Using Multi-Features Visual With CNN-LSTM and Q-Learning
15 pages
Real-Time Human Tracking Using Multi-Features Visual With CNN-LSTM and Q-Learning
100% (1)
Real-Time Human Tracking Using Multi-Features Visual With CNN-LSTM and Q-Learning
16 pages
Benchmarking Probabilistic Deep Learning Methods For License Plate Recognition
No ratings yet
Benchmarking Probabilistic Deep Learning Methods For License Plate Recognition
14 pages
BS Iso Iec 15067-3-3-2019
No ratings yet
BS Iso Iec 15067-3-3-2019
30 pages
Dell Inspiron 14z-5423 - WISTRON DMB40 - 11289-1
No ratings yet
Dell Inspiron 14z-5423 - WISTRON DMB40 - 11289-1
105 pages
Installation & Commissioning of HX260 Modems1
No ratings yet
Installation & Commissioning of HX260 Modems1
56 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
Patch Roll-Up CPR9 SRX - Apr 2024 (Released 4 - 2024)
No ratings yet
Patch Roll-Up CPR9 SRX - Apr 2024 (Released 4 - 2024)
4 pages
En Data Sheet 1258
No ratings yet
En Data Sheet 1258
2 pages
Cse - Ai
No ratings yet
Cse - Ai
13 pages
Boon
No ratings yet
Boon
18 pages
SignedBirthCertificate 10022966 507
No ratings yet
SignedBirthCertificate 10022966 507
1 page
Engaging Children With Educational Content Via Gamification: Research Open Access
No ratings yet
Engaging Children With Educational Content Via Gamification: Research Open Access
15 pages
Report On Stock Verification
100% (3)
Report On Stock Verification
3 pages
Display Advertising PowerPoint Presentation
No ratings yet
Display Advertising PowerPoint Presentation
17 pages
LiDAR Notes
No ratings yet
LiDAR Notes
70 pages
VB Script Help
No ratings yet
VB Script Help
275 pages
Mobile Based Attendance System
100% (1)
Mobile Based Attendance System
21 pages
CORE I
No ratings yet
CORE I
2 pages
L-032 L2 Samyuktha Mandampully PPS Experiment 1 PDF Algorithms Computer Programming 4
No ratings yet
L-032 L2 Samyuktha Mandampully PPS Experiment 1 PDF Algorithms Computer Programming 4
1 page
In GATE 2010.compressed
No ratings yet
In GATE 2010.compressed
18 pages
Instagram
No ratings yet
Instagram
14 pages
E Waste Related Communication
No ratings yet
E Waste Related Communication
3 pages
Automation in Construction: Hossein Naderi, Alireza Shojaei, Reachsak Ly
No ratings yet
Automation in Construction: Hossein Naderi, Alireza Shojaei, Reachsak Ly
19 pages
Oracle-Fusion-Financials Sample Resumes-1-1
No ratings yet
Oracle-Fusion-Financials Sample Resumes-1-1
7 pages
Comparison Between Phase-And Level-Shifted PWM Schemes For Flying Capacitor Multilevel Inverter
No ratings yet
Comparison Between Phase-And Level-Shifted PWM Schemes For Flying Capacitor Multilevel Inverter
4 pages
FZT955
No ratings yet
FZT955
7 pages
Programming in C: Sri Krishna Arts and Science College Coimbatore
No ratings yet
Programming in C: Sri Krishna Arts and Science College Coimbatore
38 pages

Attentive

Uploaded by

Attentive

Uploaded by

Attentive Fusion: A Transformer-based Approach to Multimodal Hate

Abstract Traditional methods of hate speech detection

ious disciplines such as NLP, audio signal process-

• We have introduced a layer termed “Attentive

Table 2: Data Statistics of “Hate” and “Not Hate”

the pictorial information of the number of samples

2.1 CMU-MOSEI 2.4 LJ Speech

(e) MELD (f) Social-IQ (g) VCTK

Figure 4: Sample count for “Hate” and “Not Hate”

(a) CMU-MOSEI (b) CMU-MOSI (c) Common Voice (d) LJ Speech

(e) MELD (f) Social-IQ (g) VCTK

Figure 5: Scatter representation of Datasets according to audio length

Add & Norm Add & Norm

Feed Forward Feed Forward

Add & Norm Add & Norm

Speech Sampling Text Sampling

Figure 6: Overview of our approach

represented as “Pipeline 2”. These two outputs

3.3.2 Text Sampling

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,

Ilya Loshchilov and Frank Hutter. 2019. Decoupled

You might also like