0% found this document useful (0 votes)
231 views5 pages

Marathi Hate Speech Detection IEEE Paper

This document presents an approach for detecting hate speech in Marathi language tweets. It uses a dataset of 25,000 tweets in Marathi classified into 4 categories: hate, offensive, profane and neutral. Several machine learning classifiers are tested on preprocessed tweet data, with XGBoost achieving the highest accuracy of 74.93%. Preprocessing steps include removing stopwords, punctuation, URLs and tokenization, lemmatization and part-of-speech tagging for feature extraction. Prior research on hate speech detection using machine learning techniques on other languages is also reviewed.

Uploaded by

ANJALI DEORE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
231 views5 pages

Marathi Hate Speech Detection IEEE Paper

This document presents an approach for detecting hate speech in Marathi language tweets. It uses a dataset of 25,000 tweets in Marathi classified into 4 categories: hate, offensive, profane and neutral. Several machine learning classifiers are tested on preprocessed tweet data, with XGBoost achieving the highest accuracy of 74.93%. Preprocessing steps include removing stopwords, punctuation, URLs and tokenization, lemmatization and part-of-speech tagging for feature extraction. Prior research on hate speech detection using machine learning techniques on other languages is also reviewed.

Uploaded by

ANJALI DEORE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Marathi Hate Speech Detection

Nihar Chaudhari Anushka Chavan Anjali Deore Siddharth Bhorge


[email protected] [email protected] [email protected] [email protected]

Department of Electronics and Telecommunication, Vishwakarma Institute of Technology, Pune, India, 411037

Abstract - The massive expansion of social networking II. Literature Survey


websites has made it easier for people with various
cultural and psychological backgrounds to In this paper, the authors have studied research works
communicate directly with one another. It has led to an
carried out between the years 2017 to 2022. An
increase in online conflicts between them. This paper
effective approach for detecting hate speech patterns
proposes an approach to detect marathi hate speech. A
publicly available dataset of tweets in Marathi language and most prevalent unigrams has been proposed in
is used. Data preprocessing includes the removal of [1].The tweets are classified into three classes mainly
stopwords , punctuations , emojis , numbers , URLs etc. clean ,offensive and hateful.Features like
and feature extraction is carried out using tokenization , semantic,sentiment,unigram and pattern are extracted
lemmatization and POS tagging.The performances of in this approach. The accuracy achieved for binary
XGBoost, Random Forest , Logistic Regression and classification was 87% whereas it was 78.4% for
SVM have been compared in this study for the detection ternary classification.
of hate speech. XGBoost classifier provided highest
accuracy of 74.93%.
A method based on a deep neural network combining
Keywords: Hate speech, Marathi tweets, Machine convolutional and gated recurrent networks (GRU) is
learning. proposed in [2]. The use of GRU over LSTM helped
achieve better accuracy. This approach classifies
I. Introduction tweets based on sexism and racism. Three Deep
Neural Network Architectures are suggested by [3] to
Internet users are becoming more and more interested identify hate speech on Twitter: GRU, which is
in online social media. The services offered by social strong at capturing sequence orders, CNN, which is
networking providers like Twitter ,Instagram and good at feature extraction, and ULMFiT, which
Facebook are extremely popular among internet employs transfer learning. The ULMFiT model
users. Due to their prominence in the social provided the best results with an accuracy of 97.5%.
networking space, they frequently struggle to handle
rude and hateful language.Hence, such companies A supervised learning model has been proposed in [4]
need to invest a lot of attention and resources to to classify hate towards women on twitter.Turkish
tackle and to provide a permanent solution to this tweets based on womens clothing has been used and
problem. machine learning algorithms achieved a maximum
Hate speech is the use of hostile, violent, or offensive accuracy of 72%. Flesch KinCaid level and Flesch
language directed at a certain group of people who Reading Ease scores are used to assess the quality of
share a characteristic, such as gender, ethnicity, race, the tweets.
or religious views. There is a critical need to propose
a solution to detect hate speech automatically. This Automatic detection of the hate tweets using machine
would automate decision-making to turn social learning using bag of words and the TFIDF is
networking sites into a welcoming space for proposed in [5].A publicly available dataset from
information sharing. kaggle based on english tweets has been used for
In this work, marathi hate speech detection is carried experimentation. An accuracy of 94% is obtained
out. Four classes hate , offensive , profane and neutral using both the above mentioned features
have been analyzed. separately.The logistic regression classifier is used to
classify whether the content is hateful or not. The
approach in [6][7] used n-grams as features and
passed their TF IDF values to different machine
learning models.An accuracy of 95.6% was achieved
using this approach.

South African english tweets [8] are used to detect


speech that is hateful and offensive.Word n-gram,
character n-gram, negative emotions and
syntactic-based features were extracted and
analyzed.Gradient Boost classifier achieved an
accuracy of 80.3% for hate speech.

A method of classifying online hate using machine


learning that makes use of word embeddings such as Fig.1. Class wise tweet count bar plot
Distributed Bag of Words (DBOW) and Distributed
Memory Mean (DMM), as well as Word2vec
Convolutional Neural Networks (CNNs) is proposed B. Preprocessing
in [9].Two publicly available datasets consisting of
Typically, the data is presented in phrases or
35000 and 25000 tweets are used in this approach to
paragraphs, which is how people naturally
classify hate and non-hate tweets.
communicate. Therefore, the data must be changed
and cleaned up before analysis in order for the
Hate speech detection was carried out on a dataset
computer to interpret it in the proper language.
consisting of urdu tweets in [10].Variable Global
Feature Selection Scheme for dimensionality The first step in the preprocessing of the data was
reduction and Synthetic Minority Optimization noise removal. The noise in the dataset includes the
Technique for class imbalance were used to get better URL links and Twitter handle names. Using regular
performance. [11] uses a marathi tweet dataset to expressions, anything that comes after http/https is
classify the text as hate and non-hate. The text is removed from URL links and Twitter handle names
encoded using encoding techniques like after the previous stage. Then punctuation and special
BagOfWords, n-gram, Word2Vec.TF-IDF feature characters are eliminated, followed by stop words
extraction technique is implemented and an accuracy like a, an, the, is, etc. Since they have no real
of 77% was achieved.Subjective and semantic significance, it is not necessary to include these stop
features are considered in [12] and a lexicon is words in order to understand the statement's
created from hate and semantic features which is sentiment.
further used for developing a hate speech detection
The python strip() method is capable of eliminating
model.
these extra spaces from the beginning and end of
each line.The sentiment is not discernible from the
III. Methodology
punctuation or special characters.The distribution of
tweet length and tweet character count for hate and
profane classes is shown in Fig 2 and for offensive
A. Dataset Description
and neutral classes is shown in Fig 3.
The dataset consists of 25,000 tweets with four main
categories : hate,offensive,profane and neutral. Each
category has 5300 tweets as shown in Fig 1.
. The raw text is divided up into its component words
and this process is called tokenization.Then the
individual words are tagged using Part of speech
tagging (POS tagging).It is also called grammatical
tagging in which the words are marked corresponding
to the part of speech(noun,pronoun,adjective) based
on the context.
The process of feature extraction is shown in Fig 4.

Once the tokens are tagged , they are then passed to


Fig.2. Distribution of words in HATE and PRFN the Unigram model which gives us the probability of
individual words.The words are then lemmatized to
remove unnecessary processing. Lemmatization
allows us to group together different forms of the
same word to the base word. The system workflow is
shown in Fig 5.

Fig.3. Distribution of words in NOT and OFF

C. Feature Extraction

To extract features from text,Part of speech tagging


has been used.Unigram model is used to predict the
probability of words. The following assumptions are
made by the unigram model:

1. Each word's probability stands alone from those


that came before it.

2. It depends on how frequently the word appears


overall in the training text.

Fig.5. Overall workflow of Hate Speech Detection

These were treated as a series of words and the


features were then extracted. For feature extraction,
Count vectorizer is used. We used a unigram model
to map the POS tags and then form the corresponding
Fig.4. Feature extraction process for hate speech detection sentences.The sentences are broken down into words
using the Countvectorizer to tokenize the text, and
this vocabulary is then used to encode new texts.
D. Classification Table 2 represents the testing accuracies along with
other performance evaluation metrics.
There are four classes of speech i.e. hate,
offensive,profane and neutral. Classification is Table 1 : Performance evaluation metrics
performed using four algorithms; XGBoost, Logistic
regression, Random Forest and SVM. Classifier Accuracy Precision Recall F1
The first classifier used is XgBoost The equation for % % % score
%
XGBoost is given by eq.1.
𝑘 Logistic 72.32 73.23 74.34 74.00
𝑦𝑖 = ∑ 𝑓𝑘 (𝑥𝑖) , 𝑓𝑘 ∈ 𝐹 (1) Regression
𝑘=1
Random 74.80 80.20 70.42 75.41
Forest
where k, f and F denote the number of
trees,functional space of F and set of CARTS Support 70.12 74.15 70.35 71.05
respectively. Vector
Machine
To improve the predicted accuracy of a dataset,
Random Forest mixes a number of decision trees on Xgboost 74.93 80.23 75.45 70.24
various subsets of the data and averages the results.
The equation for Random Forest is given by eq.2.
Recall, precision and F1 score are the other
∑𝑐 performance evaluation metrics used. By comparing
𝑅𝐹 = (2) these values it is observed that XGBoost gives the
𝑇
higher results with a f1 score 70.24%, recall 75.45%
where, c is the entropy of all trees and T is the total and precision 80.23%.
count of trees in the forest.

SVM is used for both classification and regression. V. Conclusion


Classification in SVM is carried out by finding the
hyperplane which differentiates the two classes. The This paper presents a novel method for hate speech
equation for the hyperplane is given by eq.3. detection from marathi tweets. This machine learning
approach classifies the tweets as hate , offensive ,
𝑤. 𝑥 + 𝑏 = 0 (3)
profane and neutral. A comparative study using four
where x is the data point, w is the vector normal to machine learning models (Random Forest, SVM ,
the hyperplane, and b is the bias. XGBoost , Logistic regression) is performed and the
XGBoost classifier provides a maximum accuracy of
In logistic regression, the dependent variable is 74.93% .
modelled using a logistic function.The
hyperparameter used is a random state whose value
has been taken as 0. The equation of logistic
regression is given by eq.4. VI. References

𝑝
𝑙𝑛( 1−𝑝 ) = β0 + β1𝑋1 + ...... + β𝑘𝑋𝑘 (4) [1] Watanabe, Hajime, Mondher Bouazizi, and Tomoaki
Ohtsuki. "Hate speech on twitter: A pragmatic
approach to collect hateful and offensive expressions
IV. Results and perform hate speech detection." IEEE access
2018, pp. 13825-13835.
[2] Zhang, Ziqi, David Robinson, and Jonathan Tepper.
The testing accuracies achieved by the classifiers
XGBoost, Random Forest,Logistic Regression and "Detecting hate speech on twitter using a
SVM are 74.93, 74.80 ,72.32 ,70.12, respectively. convolution-gru based deep neural network." In
European semantic web conference, 2018, pp.
745-760.
[3] Amrutha, B. R., and K. R. Bindu. "Detecting hate
speech in tweets using different deep neural network
architectures." In 2019 International Conference on
Intelligent Computing and Control Systems (ICCS),
2019, pp. 923-926. .
[4] Şahi, Havvanur, Yasemin Kılıç, and Rahime Belen
Saǧlam. "Automated detection of hate speech towards
woman on Twitter." In 2018 3rd international
conference on computer science and engineering
(UBMK), 2018, pp. 533-536.
[5] Koushik, Garima, K. Rajeswari, and Suresh Kannan
Muthusamy. "Automated hate speech detection on
Twitter." In 2019 5th International Conference On
Computing, Communication, Control And Automation
(ICCUBEA), 2019,pp. 1-4.
[6] Gaydhani, Aditya, Vikrant Doma, Shrikant Kendre,
and Laxmi Bhagwat. "Detecting hate speech and
offensive language on twitter using machine learning:
An n-gram and tf idf based approach." arXiv preprint
arXiv:1809.08651 (2018).
[7] Davidson, Thomas, Dana Warmsley, Michael Macy,
and Ingmar Weber. "Automated hate speech detection
and the problem of offensive language." In
Proceedings of the international AAAI conference on
web and social media, 2017, vol. 11, no. 1, pp.
512-515.
[8] Oriola, Oluwafemi, and Eduan Kotzé. "Evaluating
machine learning techniques for detecting offensive
and hate speech in South African tweets." IEEE
Access 8 (2020): 21496-21509.
[9] Ketsbaia, Lida, Biju Issac, and Xiaomin Chen.
"Detection of hate tweets using machine learning and
deep learning." In 2020 IEEE 19th International
Conference on Trust, Security and Privacy in
Computing and Communications (TrustCom), 2020,
pp. 751-758.
[10] Ali, Muhammad Z., Sahar Rauf, Kashif Javed, and
Sarmad Hussain. "Improving hate speech detection of
Urdu tweets using sentiment analysis." IEEE Access 9
(2021): 84296-84305.
[11] Gajbhiye, Disha, Swapnil Deshpande, Prerna Ghante,
Abhijeet Kale, and Deptii Chaudhari. "Machine
Learning Models for Hate Speech Identification in
Marathi Language." In Forum for Information
Retrieval Evaluation (Working Notes)(FIRE),
CEUR-WS. org. 2021.
[12] Gitari, Njagi Dennis, Zhang Zuping, Hanyurwimfura
Damien, and Jun Long. "A lexicon-based approach for
hate speech detection." International Journal of
Multimedia and Ubiquitous Engineering 10, no. 4
(2015): 215-230.

You might also like