0% found this document useful (0 votes)
9 views

Group Document

Uploaded by

Nikitha Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Group Document

Uploaded by

Nikitha Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 56

A Project Report

On

Submitted in partial fulfillment of the


requirements for the award of degree of
Bachelor of Engineering
In
Information Technology

Submitted by

S. Nikitha (1608-20-737-010)
K. Ashish Yadav (1608-20-737-
018) B. Sri Phani (1608-20-737-
022)
Under the Esteemed Guidance of
Dr. K. Vikram Reddy
Associate Professor, Department of

IT

Department of Information Technology

MATRUSRI ENGINEERING COLLEGE


(An Autonomous Institution)
(Sponsored by Matrusri Education socitey, Estd1980)
Affiliated to Osmania University , Approved by
AICTE)
Saidabad, Hyderabad, Telangana-
500059 (2023-24)
Department of Information Technology

MATRUSRI ENGINEERING COLLEGE


(An Autonomous Institution)
(Sponsored by Matrusri Education socitey,Estd1980)

(Affiliated to Osmania University , Approved by

AICTE) Saidabad, Hyderabad,

Telangana-500059
(2023-24)

CERTIFICATE
This is to certify that a Project report entitled “ ONLINE ABUSE DETECTION
USING MACHINE LEARNING ” is being submitted by S. Nikitha (1608-20-737-010),
K. Ashish Yadav (1608-20-737-018), B. Sri Phani(1608-20-737-022) in partial fulfilment
of the requirement for the award of the degree of Bachelor of Engineering in
“Information Technology” O.U., Hyderabad during the year 2023- 2024 is a record of
bonafide work carried out by them under my guidance. The results presented in this report
have been verified and are found to be satisfactory.

Dr . K. Vikram Reddy Dr. G. Shyama Chandra Prasad


Associate Professor Professor & Head
ProjectGuide Dept. of IT

External Examiner(s)

Page |
ACKNOWLEDGEMENT

We wish to express our gratitude to project guide, for his valuable supervision and
timely assistance in regard to our project work.
We wish to express our gratitude to project coordinator Mr. K. Vikram Reddy,
Assistant Professor, Department of Information Technology, Matrusri Engineering
College, for his indefatigable inspiration, guidance, cogent discussion, constructive
criticisms and encouragement throughout this dissertation work.
We wish to express our gratitude to Dr. G. Shyama Chandra Prasad, Head of
Department, Information Technology, Matrusri Engineering College for permitting to do
this project.
We wish to express our gratitude to Dr. D. Hanumantha Rao, Principal of
Matrusri Engineering College who permitted us to carry the project work as per of the
academics.
We would like to thank IT Department for providing us this opportunity to share
and contribute ours part of work to accomplish the project in time, and all the teaching and
supporting staff for their steadfast support and encouragement.
The satisfaction that accompanies the successful completion of the any task would
be incomplete without the mention of my family members and friends, who made it
possible and whose encouragement and guidance has been a source of inspiration

Page |
DECLARATION

We hereby declare that the project entitled “ ONLINE ABUSE DETECTION USING
MACHINE LEARNING ” submitted to the Osmania University in partial fulfilment of
the requirements for the award of the degree of B.E. in IT, is a bonafide and original
work done by us under the guidance of Mr. K. Vikram Reddy, Assistant Professor,
Information Technology and this project work have not been submitted to any other
university for the award of any degree or diploma.

S.Nikitha (1608-20-737-010)
K.Ashish Yadav (1608-20-737-018)
B.Sri Phani (1608-20-737-022)

Page |
ABSTRACT

The emergence of social media platforms has given rise to an unparalleled level of hate speech in
public conversations. The number of tweets containing hate speech and targeting one or another
user is on the increase every day. Unfortunately, any user engaged on these platforms will have a
risk of being targeted or harassed via abusing language, expressing hate towards race, colour,
religion, descent, gender, nation, etc. This bullying, trolling, and harassment content can be very
serious, in several cases might lead to suicide of the victim.

Hate speech can also be present in the form of sarcasm or indirect taunt, making it confusing for
users or system to understand the intent behind the tweet. Therefore, the need of an automatic and
scalable detection system has become a priority. To stop hate speech spreaders, a machine learning
system needs to be developed that automatically detects spreaders of hate speech based on the
contents of their posts, where the model should be able to infer the meaning of a word with
respective its context.

The main objective of the project is to overcome the limitations of existing systems and introduce a
model having a greater accuracy. The word embeddings are made task- specific by training on the
dataset provided. This reduces the problem of out-of- vocabulary to some extent. Various
classifiers are paired with these embeddings and tested, but CNN found out to be efficient. Hatred
spreading through the use of language on social media platforms and in online groups is becoming
a well-known phenomenon. By comparing two text representations: bag of words (BoW) and pre-
trained word embedding using GloVe, we used a binary classification approach to automatically
process user contents to detect hate speech. The Naive Bayes Algorithm (NBA), Logistic
Regression Model (LRM), Support Vector Machines (SVM), Random Forest Classifier (RFC) and
the one-dimensional Convolutional Neural Networks (1D-CNN) are the models proposed. With a
weighted macro-F1 score of 0.66 and a 0.90 accuracy, the performance of the 1D-CNN and GloVe
embeddings was best among all the models.

Page |
LIST OF FIGURES

Figure No. Figure Name Page No.


1.1 2
The problem statement
2.1 6
Hierarchy of hate speech concepts.
4.1 14
The input and output of Word2Vec
4.2 15
Word2Vec windows of different sizes and their
respective training samples
4.4 17
The architecture of CNN for image detection
4.5 18
Illustration of a Convolutional Neural Network (CNN)
architecture for sentence classification
4.6 20
Narrow vs. Wide Convolution. Filter size 5, input size 7.
Source: A Convolutional Neural Network for Modelling
Sentences (2014)
4.7 20
Convolution Stride Size. Left: Stride size 1. Right: Stride size
2.
4.8 21
Max pooling in CNN.
4.9 24
System Architecture for Hate Speech Spreader Detection
5.1 28
Use Case Diagram for Hate Speech Spreaders Detection
5.2 29
Activity Diagram for Hate Speech Spreaders Detection
5.3 30
State Chart Diagram for Hate Speech Spreaders Detection
5.4 31
Sequence Diagram for Hate Speech Spreaders Detection

6.1 43
Predicted Result 1
6.2 43
Predicted Result 2
6.3 43
Predicted Result 3

Page |
TABLE OF CONTENTS

TITLE.........................................................................................................................................I
CERTIFICATE.........................................................................................................................II
ACKNOWLEDGEMENT......................................................................................................III
DECLARATION.....................................................................................................................IV
ABSTRACT..............................................................................................................................V
LIST OF FIGURES................................................................................................................VI
CHAPTER 1- INTRODUCTION 1- 4
1.1 Introduction................................................................................................................1-2
1.2 Proposed Solution......................................................................................................2-3
1.3 Objective......................................................................................................................3
1.4 Scope............................................................................................................................4
CHAPTER 2-LITERATURE SURVEY 5-7
2.1 Literature Survey.......................................................................................................5-6
2.2 Feasibility Study..........................................................................................................7
2.2.1 Operational Feasiblility.......................................................................................7
2.2.2 Operational Feasiblility.......................................................................................7
2.2.3 Technical Feasiblility..........................................................................................7
2.2.4 Economic Feasiblility.........................................................................................7
CHAPTER 3-REQUIREMENTS SPECIFICATION 8-11
3.1 Software Requirements...........................................................................................8-10
3.2 Hardware Requirements............................................................................................11
CHAPTER 4-SYSTEM ANALYSIS 12-27
4.1 Exsisting System...............................................................................................................12
4.2 Proposed System..........................................................................................................12-24
4.2.1. Word2Vec.....................................................................................................14-16
4.2.2 TF-IDF................................................................................................................16
4.2.3. Embedding Layer.........................................................................................16-17
4.2.4. Convolutional Neural Networks...................................................................17-22
4.2.5 Character-Level CNNs...........................................................................................22

V
4.2.6 Developed Models........................................................................................22-24
4.3 System Architecture...................................................................................................24
4.4 Module Description...............................................................................................25-27
4.5 Advantages Of Proposed System...............................................................................27
4.6 Applications Of Proposed System..............................................................................27
CHAPTER 5-DESIGN 28-31
5.1 UML Diagrams.....................................................................................................28-31
5.1.1 Use case Diagram.........................................................................................28-29
5.1.2 Activity Diagram...............................................................................................29
5.1.3 State Chart Diagram..........................................................................................30
5.1.4 Sequence Diagram.............................................................................................31
CHAPTER 6-IMPLEMENTATION 32-43
6.1 Sample Code..........................................................................................................32-41
6.2 Output Screens.......................................................................................................42-43
CHAPTER 7-CONCLUSION 44
7.1 Conclusion and Future Enhancements........................................................................44
CHAPTER 8-REFERENCES 45-46
8.1 References....................................................................................................................45-46

V
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Social media has become an important communication medium. Social media platforms such as
Facebook, Twitter, LinkedIn, and WhatsApp enable users to interact with each other by both
sharing and consuming information. Information can spread widely and even viral on social
media very quickly. Unfortunately, hate speech can also spread easily that not only harm
individual victims but also create an adverse impact onsociety.

Users can now share everything they desire with their followers and express their opinions in an
instant. While the freedom of speech is important and generally should not be restricted, not all
kinds of free speech should be tolerated. This brings us to the subject of hate speech. Freedom of
speech in virtual environment gave users a false sense of security, a sense that even hate speech
could be shared without repercussions. Hate speech is commonly defined sas any
communication that uses offensive and threatening language that targets specific groups of
people basis on somecharacteristics such as religion, ethnicity, nationality, race, color, gender,
or some other characteristics. A huge amount of user generated content on the web and social
media has given rise to a variety of challenges including the spreading and sharing of hate
speech messages.

However, classifying users as haters or not from their posts is a challenging task. Still, the
detection of offensive language from social media is not an easy research problem due to the
different levels of ambiguities present in natural language and the noisy nature of social media
language. In addition, social media subscribers come from linguistically diverse communities.
PAN at CLEF 2021 with “Profiling Hate Speech Spreaders on Twitter” deals with the detection
of hate speech spreaders in two languages English and Spanish meaning that classification need
to be done at the user level and not at the post level.Since the net work is huge, it is almost
impossible to stop hate speech spreaders using only human resources. Therefore, numerous
computational methods are being developed to enable automated detection of hate spreaders on
social networks. Advances in Natural language processing (NLP) and text analysis techniques
provide an automated way to identify the hate speech posts by analyzing social medias’ posts.
To stop hate speech spreaders, a machine learning system needs to be developed that
automatically detects spreaders of hate speech based on the contents of their posts, such that the

Page
model is trained on contextual embeddings.

Figure 1.1 depicts the problem statement that each user has 200 tweets acting as the input and he is
classified either as a Hate SpeechSpreader or a Non-Hate Speech Spreader.

Figure 1.1: The problem statement


the model is trained on contextual embeddings. Figure 1.1 depicts the problem statement that each
user has 200 tweets acting as the input and he is classified either as a Hate Speech Spreader or a
Non-Hate Speech Spreader.

1.2 PROPOSED SOLUTION


Based on the results and analysis presented in the study, the proposed solution for detecting hate
speech in social media posts involves utilizing a one-dimensional Convolutional Neural Network
(1D-CNN) with pre-trained GloVe word embeddings. Here's a breakdown of the solution:
1. Model Selection : The study compared various machine learning models and deep learning
architectures for hate speech detection, including Naive Bayes Algorithm (NBA), Logistic
Regression Model (LRM), Support Vector Machines (SVM), Random Forest Classifier
(RFC), and the 1D-CNN.
2. Feature Representation : Two main types of feature representations were used: Bag-of-Words
(BoW) and pre-trained GloVe word embeddings. GloVe embeddings, which capture semantic
relationships between words, were utilized for the 1D-CNN model.

Page
3. Training Data : The study used a dataset annotated at the sentence level, focusing on Internet
forum posts in English. The data were manually labeled into two classes: Hate and No-Hate.
4. Evaluation Metrics : The performance of each model was evaluated using metrics such as
accuracy and F1 score. The 1D-CNN model with GloVe embeddings achieved the highest
weighted macro-F1 score of 0.66 and an accuracy of 0.90.
5. Solution Summary : The proposed solution involves using the 1D-CNN model with GloVe
embeddings to automatically detect hate speech in social media posts.
6. Improvements : To enhance the performance of the GloVe embeddings, increasing the size of
the training dataset or fine-tuning the embeddings specifically for hate speech detection could be
considered. Additionally, exploring other deep learning architectures or ensemble methods may
further improve the detection accuracy.
Overall, the solution offers an efficient and accurate method for identifying hate speech in online
content, which can contribute to mitigating the negative impacts of hate speech on society.

1.3 OBJECTIVE
The objective of the model described in the paper is to develop an effective method for
automatically detecting hate speech in social media posts. This involves:
1 .Comparing different text representations, such as Bag of Words (BoW) and GloVe word
embeddings.
2. Evaluating various machine learning and deep learning models, including Naive Bayes
Algorithm (NBA), Logistic Regression Model (LRM), Support Vector Machines (SVM), Random
Forest Classifier (RFC), and one-dimensional Convolutional Neural Networks (1D-CNN).
3. Training and testing the models on a dataset from a white supremacist forum to assess their
performance in terms of accuracy and F1 score.
4. Identifying the most effective model, which in this case was found to be 1D-CNN with GloVe
embeddings, for hate speech detection on social media platforms.

Page
1.4 SCOPE
The proposed hate speech detection model for social media posts involves:
 Text Classification: Identifying hate speech or non-hate speech content in social media posts.
 Feature Representation: Utilizing methods like Bag-of-Words (BoW) and GloVe word
embeddings to capture text semantics effectively.
 Performance Evaluation: Assessing model effectiveness through metrics like
accuracy, precision, recall, and F1 score.
 Implementation and Deployment: Deploying the most effective model in real-
time applications like social media moderation tools.
 Validation and Ethics: Validating model robustness with unseen data and addressing
ethical concerns like censorship and bias to ensure fairness and transparency.

Page
CHAPTER 2
LITERATURE SURVEY

2.1 LITERATURE SURVEY


Hate speech is a poisonous discourse that can swiftly spread on social media or due to prejudices
or disputes between different groups within and across countries [1]. A hate crime refers to crimes
committed against a person due to their actual or perceived affiliation with a specific group
(https://www.ilga-europe.org/what-we-do/our-advocacy-work/hate-crime-hate-speech, accessed on
1 March 2022) The protected characteristics of Facebook define hate speech as an attack on an
individual’s dignity, including their race, origin, or ethnicity. According to Twitter policies, tweets
should not be used to threaten or harass others due to their ethnicity, gender, religion, or any other
factor. In addition to age, caste, and handicap, YouTube also censors content that promotes
violence or hatred toward certain persons or groups. Often, hate speech regarding online
radicalization or criminal activities is studied [2]; however, hate speech has also been discussed in
other contexts [3].

There is strong motivation to study the automatic detection of hate speech due to the
overwhelming online spread of information [4]. The detection of hate speech is crucial to reducing
crime and protecting people’s beliefs. This study is especially important in the face of ongoing
wars, distorting reality and dehumanizing the attacked Ukrainian nation. Studies have shown an
increase in hate speech against China on online social media, especially racist and abusive content
accusing people of causing the COVID-19 outbreak. On the other hand, a lower rate of hate speech
reduces crime, such as cyberbullying, which significantly affects social tranquility [5], leading to
minimal cyber-attacks [6].

Despite the many studies in the field, hate speech is still problematic and challenging. The
literature reports that both humans and machine learning models have difficulty detecting hate
speech due to the complexity and variety of hate speech categories. Hate is characterized by more
extreme behaviors associated with prejudice. Figure 1 shows numerous aspects of hate speech.
Hate speech is seen as a layer between aggressive and abusive text; however, all of these share the
offensive aspect. On the other hand, sexist, homophobic, and religious hate are relatively different
as they target a group of people or a gender. The figure shows that the separation of the concept is
complicated, a challenge that has been previously discussed [7]. Specific hate speech definitions
are contentious [7]; racist and homophobic tweets, for example, are more likely to be labeled as

Page
hate speech than other types of offensive or abusive content. Therefore, there is no way to
generalize whether an inflammatory text is hate speech [8]. The expansion of the representation of
short documents, such as on Twitter, is a significant issue; they cause additional challenges to the
traditional bag-of-words model, resulting in data sparsity due to insufficient contextual
information [9]. Furthermore, many datasets [10] may differ due to these classifications and
definitions, making it challenging to compare machine learning models. In other words, hate
speech notions are all within the umbrella of abusive text, according to Poletto et al. [1].

Figure 2.1 Hierarchy of hate speech concepts.


The literature showed several literature review articles; however, most of them targeted one area of
the literature or are relatively old [7,11,12,13,14]. The literature study in [7] was devoted to
building a generic metadata architecture for hate speech classification based on predefined score
groups using semantic analysis and fuzzy logic analysis. The study in [11] targeted hate speech
concerning gender, religion, and race related to cyberterrorism and international legal frameworks;
however, the study did not focus on Twitter or datasets for machine learning. Most of the papers
cited in [12] are related to the legal literature that defines hate speech for criminal sanctions. The
study in [13] was devoted to hate speech geographical aspects, social media platform diversity,
and the generic qualitative or quantitative methods used by researchers. To the best of our
knowledge, no review has been dedicated to English hate speech dataset analysis. This paper
aimed to deeply review hate speech concepts, methods, and datasets to provide researchers with an
insight into the latest state-of-the-art studies in hate speech detection.This study carries a
systematic literature review based on the methodology of Tranfield et al. [15]. The method
synthesizes and extracts results on evidence-based systematic literature extracted based on four
aspects: the research question, review criteria, final literature review, and data extraction and
synthesis. The studied topic is heterogeneous in its wide range of methods and homogenous in
terms of using textual Twitter hate speech content, but is bounded by a specific research question
the methodology of Tranfield et al. is applicable [16].

Page
Table -2.1 : Literature Survey of Hate Speech Detection

S.No Name Author Algorithm Conclusion Drawbacks


.
1. Keeping Aleksanda BERT, for Achieved 95% Future focus on
Children r images accuracy (overall) online grooming
Safe Online: Jevremovi CNN, and 91% (audio). and self-harm
Analyzing c, Mladen LSTM, Future focus: detection.
What is Veinovic, BLSTM. online grooming,
Seen/Heard Milan self-harm.
Cabarkapa
2 Detecting Pushpit Gaussian Label power set Dataset issues:
Toxic Gautam Naïve method with frequent kernel
Remarks in Bayes, multinomial Naïve crashes, errors
Online SVM. Bayes for multiple (over 150,000
Conversatio types. comments).
n
3 Detecting Khalid T. Word2Vec Aids law Small dataset,
Islamic Mursi, enforcement in limited range of
Radicalism Ahmed S. analyzing radical keywords.
in Arabic Alghamdi extremism on
Tweets social media.
4 Text Mining Akshaya Text Future prospects: Challenges:
and Text Udgave, Mining, diverse algorithms, handling
Analytics of Prasanna NLP multilingual ambiguity,
Research Kulkarni challenges. multilingual text
Articles refinement.
5 CNN for Spiros V. CNN, Outperforms Dataset biases,
Toxic Georgako Word2Vec established model
Comment pouls methods for toxic interpretability,
Classificatio comment computational
n classification. resources for
training.
6 Multilingual Guizhe XLM- Emphasizes sample Data imbalance,
Toxic Text Song, RoBERTa, size reconstruction inference with
Detection Degen MBERT for effective large-scale
Huang, detection. language models.
Zhifeng
7 Toxic Krishna ML, Achieved 94% Underutilization
Comments Dubey, LSTM, accuracy. Potential of ELMo model,
Detection Rahul NLP, for precision precision
using LSTM Nair ANN improvement improvement.
noted.
8 Hate Speech Pradeep DCNN Achieved recall Inability to
Detection Kumar values of 0.88 for predict 53% of
Using Roy hate speech and tweets due to
DCNN 0.99 for non-hate dataset imbalance
speech. towards non-hate
tweets.

Page
CHAPTER 3
REQUIREMENT SPECIFICATION
3.1 SOFTWARE REQUIREMENTS:
The following text contains the libraries required and their versions used while developing the
project:
i. Python 3

The whole project is developed in Python 3 in Google Colab.

ii. Google Colab


Google Colaboratory, or "Google Colab" for short, is a free online platform provided by Google
that allows users to write and run Python code using a web browser. Colab provides access to a
cloud-based runtime environment with pre-installed software packages and hardware accelerators,
making it a convenient tool for running deep learning models and other computationally intensive
tasks. Users can also easily share Colab notebooks with others, making it a popular platform for
collaborative research and education. Colab also offers integration with other Google services such
as Google Drive, making it easy to import and export data and models.We have worked mostly on
the local runtime for notebooks that create models to utilize the local resources, local RAM and
GPU.

iii. Keras
Version: 2.12.0
Keras is a high-level, deep learning API developed by Google for implementing neural networks. It
is written in Python and is used to make the implementation of neural networks easy. It also
supports multiple backend neural network computation.

Keras is relatively easy to learn and work with because it provides a python frontend with a high
level of abstraction while having the option of multiple back-ends for computation purposes. This
makes Keras slower than other deep learning frameworks, but extremely beginner-friendly.
iv. Tensorflow
Version: 2.12.0

TensorFlow is an open-source library that the Google Brain team developed in 2012. Python is by
far the most common language that TensorFlow uses. You can import the TensorFlow library into
your Python environment and perform in-depth learning development. There is a sure way in which
the program gets executed. You first create nodes, which process- the data in the form of a graph.

Page
The data gets stored in the form of tensors, and the tensor data flows to various nodes. One of
TensorFlow’s best qualities is that it makes code development easy. The readily available APIs
save users from rewriting some of the code that would otherwise have been time-consuming.
TensorFlow speeds up the process of training a model. Additionally, the chances of errors in the
program are also reduced, typically by 55 to 85 percent.
v. scikit-learn

Version: 1.2.0

Scikit Learn or Sklearn is one of the most robust libraries for machine learning in Python. It is
open source and built upon NumPy, SciPy, and Matplotlib. It provides a range of tools for machine
learning and statistical modeling including dimensionality reduction, clustering, regression and
classification, through a consistent interface in Python. Additionally, it provides many other tools
for evaluation, selection, model development, and data preprocessing.
vi. Pickle
Python pickle module is used for serializing and de-serializing a Python object structure. Any
object in Python can be pickled so that it can be saved on disk. What pickle does is that it
“serializes” the object first before writing it to file. Pickling is a way to convert a python object
(list, dict, etc.) into a character stream. The idea is that this character stream contains all the
information necessary to reconstruct the objectin another python script.

vii. Numpy

Version: 1.23.5

Numpy is a general-purpose array-processing package. It provides a high- performance


multidimensional array object, and tools for working with these arrays. It is the fundamental
package for scientific computing with Python.

Besides its obvious scientific uses, Numpy can also be used as an efficient multi-dimensional
container of generic data.
viii. Pandas
Version: 1.5.3
Pandas is an open-source library that is built on top of NumPy library. It is a Python package that
offers various data structures and operations for manipulating numerical data and time series. It is
mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-
performance & productivity for users.

Page
ix. nltk

Version: 3.8.1

To run the below python program, (NLTK) natural language toolkit has to be installed in your
system. The NLTK module is a massive tool kit, aimed at helping you with the entire Natural
Language Processing (NLP) methodology.

In order to install NLTK run the following commands in your terminal.

sudo pip install nltk

Then, enter the python shell in your terminal by simply typing pythonType import nltk

nltk.download(‘all’)

The above installation will take quite some time due to the massive amount of tokenizers,
chunkers, other algorithms, and all of the corpora to be downloaded.

x. collections

The collection Module in Python provides different types of containers. A Container is an object
that is used to store different objects and provide a way to access the contained objects and iterate
over them. Some of the built-in containers are Tuple, List, Dictionary, etc. In this article, we will
discuss the different containers provided by the collections module.
xi. Gensim

Version: 4.3.1

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with
large corpora. It is designed to handle large text collections efficiently and performantly, with a
focus on unsupervised learning techniques such as latent semantic analysis (LSA), latent Dirichlet
allocation (LDA), and random projections. Gensim also includes tools for text preprocessing, such
as tokenization and stopword removal, and provides a number of pre-trained models for use in
natural language processing (NLP) tasks such as word2vec and fastText. The library is open-
source and can be installed via pip.

Page
3.2 HARDWARE REQUIREMENTS:
 Operating System : Windows 10 64-bit
 RAM : Minimum 8 GB
 Processor : Intel i5 (12th gen) or equivalent
 GPU : NVIDIA GeForce RTX or equivalent with 4
GB Memory
Our project is developed on the device with the specifications as Windows 11, 16 GB RAM,
AMD Ryzen 5 5600 Processor, NVIDIA GeForce RTX 3050 with GPU 4GB.

Page
CHAPTER 4
SYSTEM ANALYSIS
4.1 PROBLEMS WITH EXISTING SYSTEM:
The existing models are good to certain extent and also achieved good accuracies. Few of the
models used simple techniques like the Linear SVM classifier, Logistic Regression, while others
have used Neural Networks such as RNNs, LSTMs, Convolutional Neural Networks, Artificial
Neural Networks. Apart from the classifiers, the main issue was with the generation of word
embeddings. The modelshad embeddings that
 They are pretrained
 Difficulty in recognizing new words.
 whose classification was computationally very expensive
 were totally based on n-grams, which may not be needed for our task.
 primarily dependent on dataset and word frequencies. It is better used in
Information retrieval, not sentiment analysis.
 Do not comprise contextual information

4.2 PROPOSED SYSTEM


The following section discusses the procedure, techniques we used throughout theprocess of
developing the solution.
Preprocessing:
The data set for this task was given in English. Each language corpus had data from 200 authors in
the train set and 100 authors in the test set. Half of all the authors in both sets were labeled with 1
indicating that the author spreads hate speech, while the other half in both sets was labeled with 0
to indicate that the author does not spread hate speech. Data was collected from users that posted
on the Twitter social network, which was available in PAN competition website. Each of the total
300 authors had 200 unique tweet posts. This sums up to 60000 tweet posts which seems as a
reasonable number. However, if we consider that the task is to predict on author-level rather than
on tweet-level, and also that we cannot use test set for creating the model, we end up with only 200
training examples. Moreover, it is possible that not all posts, from an author that spreads hate
speech, are genuine hate speech. In order to tackle this problem, the tweets of an author are
concatenated to form a single sample. Thus, we got 200 training examples for 200 authors. Test
set was likewise restructured by concatenating every 200 tweet posts grouped by author into a new

Page
training example. However, that did not result in larger test corpus since we still predicted on
author-level. In other research papers, to mitigate the noise effect, they conducted an analysis of
what indicators we could use to better discriminate between hate speech spreaders and the non-
hate speech spreaders. But we thought indicators were miscellaneous and do not add enough
weightage.
After removal of XML tags, the emojis, hash tags, hyperlink, ‘RT’ were removed. The contractions
in text like “isn’t” are expanded. The stop words are not removed assuming they would give
contextual information, especially the negative stop words are preserved such as ‘not’. The text is
lowercased and lemmatized.
The preprocessed text is given as input to the models Word2Vec, they produce the word
embeddings.
There are two ways to handle this project:
A. Training on tweet level
This may seem equivalent to sentiment analysis, where each the tweets of all authors is considered
as a training sample. Each tweet is given label of author (as in the dataset, the labels are assigned
to author not the tweets).

The thing to note here is, this may seem simpler to implement but it is not semantically correct
according to our project and also it is not good to explicitly assign the labels for tweets the labels
of their author. It is because, a hate speech spreader may not contain all his tweets to be hate
speech.

Some of the previous approaches followed tweet level process and defined a threshold for the no.
of tweets for a author to be a 'Hate Speech Spreader'.

B. Training on Author level


All the tweets of an author are concatenated together to form a sample. Thus, in each language,
there are 200 samples with respect to 200 authors in training phase. This approach seemed to be
appropriate. Firstly, we have done experiments in tweet level. After working on the tweet level,
we shifted towards author level andfollowing sections will give the insights of the works done.
The main factors that affect the performance of the model are:

i. Word Embeddings
ii. Classifier
There are many popular methods to generate word embeddings. We tried the following:

i. Word2Vec: It is very popular method introduced in 2013 by Google. It is pretrained on a large

Page
corpus. GloVe, FastText are as well-known as Word2Vec. The way they are built is different.
Each has its own pros and cons. Word2Vec mechanism is discussed in depth in section 3.2.2.

ii. TF-IDF: This method generates embeddings considering occurence of the word in the
document with respect to occurence in entire corpus. These do not comprise information of
surrounding words. These are described in section 3.2.3.

iii. Embedding layer: The models such as Word2Vec, GloVe, FastText face a problem of out-
of-vocabulary, as they are pre-trained and are limited to a specific set of words. FastText is
based on n-grams, so it may handle few new words, but those embeddings might not be
relevant. To overcome this issue, we choose to build our own vocabulary and generate
embeddings for each word using Embedding layer while iterating through each sample. It is
explained in section 3.2.4.

The classifier used is :

i. CNN: These are typically used for image classification tasks, because of their ability to
capture the local context clearly and generalize it using pooling layers. But 1D CNN Layers
can be used for sequences such as text. It is explained in depth in section 3.2.8 and 3.2.9.

4.2.1. Word2Vec:
Word2vec is a combination of models used to represent distributed representations of words in a
corpus C. Word2Vec (W2V) is an algorithm that accepts text corpus as an input and outputs a
vector representation for each word, as shown in the figure 4.1.

Figure 4.1 : The input and output of Word2Vec

There are two flavors of this algorithm namely: CBOW and Skip-Gram. Given a set of sentences
(also called corpus) the model loops on the words of each sentence and either tries to use the
current word w in order to predict its neighbors (i.e., its context), this approach is called “Skip-
Gram”, or it uses each of these contexts to predict

Page
Figure 4.2: Word2Vec windows of different sizes and their respective training samples
the current word w, in that case the method is called “Continuous Bag Of Words” (CBOW).
The vectors we use to represent words are called neural word embeddings, and representations are
strange. One thing describes another, even though those two things are radically different. As Elvis
Costello said: “Writing about music is like dancing about architecture.” Word2vec “vectorizes”
about words, and by doing so it makes natural language computer-readable — we can start to
perform powerful mathematical operations on words to detect their similarities.
So, a neural word embedding represents a word with numbers. It’s a simple, yet unlikely,
translation. Word2vec is similar to an autoencoder, encoding each word in a vector, but rather than
training against the input words through reconstruction, as a restricted Boltzmann machine does,
word2vec trains words against other words that neighbour them in the input corpus. As shown in
figure 4.2 , it does so in one of two ways, either using context to predict a target word (a method
known as ‘Continuous Bag Of Words’, or CBOW), or using a word to predict a target context,
which is called ‘Skip- Gram’. We use the latter method because it produces more accurate results on
large datasets.

Figure 4.3: Comparing the architectures of CBOW and Skip-gram


When the feature vector assigned to a word cannot be used to accurately predict that word’s
context, the components of the vector are adjusted. Each word’s context in the corpus is the teacher
sending error signals back to adjust the feature vector. The vectors of words judged similar by

Page
their context

Page
are nudged closer together by adjusting the numbers in the vector. In this tutorial, we are going to
focus on Skip-Gram model which in contrast to CBOW consider center word as input as depicted
in figure above and predict context words.

4.2.2 TF-IDF:
TF-IDF stands for "Term Frequency-Inverse Document Frequency". It is a numerical statistic that
is used to evaluate how important a word is to a document in a collection of documents. TF-IDF
embeddings are generated by first computing the term frequency (TF) and inverse document
frequency (IDF) values for each word in the training data.

The term frequency (TF) value of a word is the number of times the word appears in a document.
The inverse document frequency (IDF) value of a word is a measure of how much information the
word provides, i.e., how rare or common it is in the collection of documents. The IDF value is
computed as the logarithm of the total number of documents divided by the number of documents
that contain the word.

TF-IDF embeddings are useful for tasks like sentiment analysis because they capture the
importance of each word in a document, and the resulting vector representations can be used as
features for machine learning models. For example, the TF-IDF vectors can be used as input to a
classification algorithm, such as logistic regression or support vector machines, to predict the
sentiment of a document. By using TF-IDF embeddings, the model can give more weight to the
words that are most important for the sentiment classification task.

4.2.3. Embedding Layer:


The Embedding Layer in a neural network is a trainable layer that is used to create
wordembeddings. It maps each word in the input text to a dense vector of fixed size. During
training, the weights of the embedding layer are adjusted to minimize the loss function of the
neural network. The optimization process updates the embedding vectors such that words that are
similar in meaning are mapped to similar vectors in the embeddingspace.
The process of training the embedding layer involves the following steps:
 Tokenization: The text is split into words or subwords using a tokenizer.
 Indexing: Each word or subword is mapped to a unique integer index.
 Padding: The sequences of word indices are padded to a fixed length to make them uniform in
size.

Page
 Embedding: Each word index is converted to a dense vector of fixed size by looking up
its corresponding embedding vector in a lookup table.
 Training: The neural network is trained on the embedded sequences, with the weights of
the embedding layer updated during the backpropagation process.
After the training process, the embedding layer produces a fixed-size vector representation of each
word in the input text. These word embeddings can be used as input to downstream tasks such as
sentiment analysis, machine translation, and text classification.

4.2.4. Convolutional Neural Networks:


CNNs are essentially several layers of convolutions with nonlinear activation functions like ReLU
(Rectified Linear Unit) or tanh applied to the results. In a traditional feedforward neural network,
we connect each input neuron to each output neuron in the next layer. That’s also called a fully
connected layer, or affine layer. In CNNs we instead use convolutions over the input layer to
compute the output. This results in local connections, where each region of the input is connected
to a neuron in the output. During the training phase, a CNN automatically learns the values of its
filters based on the task you want to perform. For example, as in the architecture shown in figure
4.4, an image classification CNN may learn to detect edges from raw pixels in the first layer, then
use the edges to detect simple shapes in the second layer, and then use these shapes to deter
higher- level features, such as facial shapes in higher layers. The last layer is then a classifier that
uses these high-level features.

Figure 4.4: The architecture of CNN for image detection


There are two aspects of this computation worth paying attention to: Location Invariance and
Compositionality. Let’s say you want to classify whether or not there’s an elephant in an image.
Because you are sliding your filters over the whole image you don’t really care where the elephant
occurs. In practice, pooling also gives you invariance to translation, rotation and scaling, but more
on that later. The second key aspect is (local) compositionality.

So, how does any of this apply to NLP?

Page
Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a
matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a
character. That is, each row is vector that represents a word. Typically, these vectors are word
embeddings (low-dimensional representations) like word2vec or GloVe, but they could also be
one-hot vectors that index the word into a vocabulary. For a 10 word sentence using a 100-
dimensional embedding we would havea 10×100 matrix as our input. That’s our “image”.

Figure 4.5 depicts Convolutional Neural Network (CNN) architecture for sentence classification.
That is, the three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs
convolution on the sentence matrix and generate (variable-length) feature maps. Then 1-max
pooling is performed over each map, i.e., the largest number from each feature map is recorded.
Thus, a univariate feature vector is generated from all six maps, and these 6 features are
concatenated to form a feature vector for the penultimate layer. The final softmax layer then
receives this feature vector as input and uses it to classify the sentence; here we assume binary
classificationand hence depicted.

Figure 4.5: Illustration of a Convolutional Neural Network (CNN) architecture


for sentence classification
What about the nice intuitions we had for Computer Vision? Location Invariance and local
Compositionality made intuitive sense for images, but not so muchfor NLP. You probably do care a

Page
lot where in the sentence a word appears. Pixels close to each other are likely to be semantically
related (part of the same object), but the same isn’t always true for words. In many languages, parts
of phrases could be separated by several other words. The compositional aspect isn’t obvious
either. Clearly, words compose in some ways, like an adjective modifying a noun, but how exactly
this works what higher level representations actually “mean” isn’t as obvious as in the Computer
Vision case.
Given all this, it seems like CNNs wouldn’t be a good fit for NLP tasks. Recurrent Neural
Networks make more intuitive sense. They resemble how we process language, or at least how we
think we process language: Reading sequentially from left to right. Fortunately, this doesn’t mean
that CNNs don’t work. All models are wrong, but some are useful. It turns out that CNNs applied
to NLP problems perform quite well. The simple Bag of Words model is an obvious
oversimplification with incorrect assumptions, but has nonetheless been the standard approach for
years and lead to prettygood results.
A big argument for CNNs is that they are fast. Very fast. Convolutions are a central part of
computer graphics and implemented on a hardware level on GPUs. Compared to something like n-
grams, CNNs are also efficient in terms of representation. With a large vocabulary, computing
anything more than 3-grams can quickly become expensive. Even Google doesn’t provide
anything beyond 5-grams. Convolutional Filters learn good representations automatically, without
needing to represent the whole vocabulary. It’s completely reasonable to have filters of size larger
than 5. I like to think that many of the learned filters in the first layer are capturing features quite
similar (but not limited) to n-grams, but represent them in a more compactway.

CNN Hyperparameters
Before explaining at how CNNs are applied to NLP tasks, let’s look at some of the choices you
need to make when building a CNN. Hopefully this will help you better understand the literature in
the field
i. Narrow vs. Wide convolution:
When I explained convolutions above I neglected a little detail of how we apply the filter.
Applying a 3×3 filter at the center of the matrix works fine, but what about the edges? How would
you apply the filter to the first element of a matrix that doesn’t haveany neighboring elements to
the top and left? You can use zero-padding.

Page
All elements that would fall outside of the matrix are taken to be zero. By doing this you can
apply the filter to every element of your input matrix, and get a larger or equally sized output.
Adding zero-padding is also called wide convolution, and not using zero-padding would be a
narrow convolution. An example in 1D looks like this:

Figure 4.6: Narrow vs. Wide Convolution. Filter size 5, input size 7. Source: AConvolutional
Neural Network for Modelling Sentences (2014)
From figure 4.6, you can see how wide convolution is useful, or even necessary, when you have a
large filter relative to the input size. In the above, the narrow convolution yields an output of size
(7−5)+1=3, and a wide convolution an output of size 7+2∗ 4−5)+1=11. More generally, the
formula for the output size is
n_out = (n_in + 2*n_padding − n_filter)+1

ii. Stride Size:


Another hyperparameter for your convolutions is the stride size, defining by how much you want
to shift your filter at each step. In all the examples above the stride size was 1, and consecutive
applications of the filter overlapped. A larger stride size leads to fewer applications of the filter
and a smaller output size. The figure 4.7 from the Stanford cs231 website shows stride sizes of 1
and 2 applied to a one-dimensional input

Figure 4.7 : Convolution Stride Size. Left: Stride size 1. Right: Stride size 2.
In the literature we typically see stride sizes of 1, but a larger stride size may allow you to build a
model that behaves somewhat similarly to a Recursive Neural Network, i.e., looks like a tree.

Page
iii. Pooling Layers:
A key aspect of Convolutional Neural Networks are pooling layers, typically applied after the
convolutional layers. Pooling layers subsample their input. The most common way to do pooling it
to apply a max operation to the result of each filter. You don’t necessarily need to pool over the
complete matrix, you could also pool over a window. For example, the figure 4.8 shows max
pooling for a 2×2 window. In NLP we typically are apply pooling over the complete output,
yielding just a single number for each filter:

Figure 4.8: Max pooling in CNN.


Why pooling? There are a couple of reasons. One property of pooling is that it provides a fixed
size output matrix, which typically is required for classification. For example, if you have 1,000
filters and you apply max pooling to each, you will get a 1000-dimensional output, regardless of
the size of your filters, or the size of your input.
This allows you to use variable size sentences, and variable size filters, but always get the same
output dimensions to feed into a classifier.
Pooling also reduces the output dimensionality but (hopefully) keeps the most salient information.
You can think of each filter as detecting a specific feature, such as detecting if the sentence
contains a negation like “not amazing” for example. If this phrase occurs somewhere in the
sentence, the result of applying the filter to that region will yield a large value, but a small value in
other regions. By performing the max operation you are keeping information about whether or not
the feature appeared in the sentence, but you are losing information about where exactly it
appeared. But isn’t this information about locality really useful? Yes, it is and it’s a bit similar to
what a bag of n-grams model is doing. You are losing global information about locality (where in a
sentence something happens), but you are keeping local information captured by your filters, like
“not amazing” being very different from “amazing not”.
In imagine recognition, pooling also provides basic invariance to translating (shifting) and
rotation. When you are pooling over a region, the output will stay approximately the same even if
you shift or rotate the image by a few pixels, because the max operations will pick out the same

Page
value regardless.

iv. Channels:
The last concept we need to understand are channels. Channels are different “views” of your input
data. For example, in image recognition you typically have RGB (red, green, blue) channels. You
can apply convolutions across channels, either with different or equal weights. In NLP you could
imagine having various channels as well: You could have a separate channels for different word
embeddings (word2vec and GloVe for example), or you could have a channel for the same
sentence represented in different languages, or phrased in different ways.

4.2.5 Character-Level CNNs:


So far, all of the models presented were based on words. But there has also been research in
applying CNNs directly to characters. [26] learns character-level embeddings, joins them with pre-
trained word embeddings, and uses a CNN for Part of Speech tagging.GloVe is more efficient than
Word2Vec. GloVe means Global Vectors, with global referring to global corpus statistics and
vectors referring to word representations. To obtain the inputs to the deep learning network,
we used the GloVe pre-trained model.
A massive corpus of 2B tweets was used to train the GloVe pre-trained model. The machine
learning algorithms used the BoW features. We performed optimization using the Adam algorithm
. The 1D-CNN learns to encode input sequence properties which are useful for the task of
detecting hate speech in the sentence. CNN-based text classifications can learn features from
words or phrases in different positions in the text.

4.2.6 Developed Models


The different models built are:
1. Word2Vec with CNN:
Word2Vec embeddings are given to CNN model. If a word is not in vocabulary, it generates a new
random embedding for it. It is on author level. Embedding size is 100. We have not tested all these
following models on test data. Therefore, we have only the training/validation data accuracy of
these, but not test accuracy. Few of them are tested and foundtheir test accuracy was very low.
2. Word2Vec with CNN [Tweet level]:
This is similar to model ii, but ignores new words and trained on tweet level.
3. Word2Vec, TF-IDF + CNN, LR, SVM:
The Word2Vec embeddings are multiplied with TF-IDF weights. These are trained on tweet level.
Training accuracy was good, but the test accuracy was bad.What we thought was when the
embeddings generated by Word2Vec are multiplied by TF-IDF weights, their meaning computed

Page
in the embeddings is getting altered.
1. Trained Embeddings + CNN:
Two individual models were developed for two languages with the same architecture. Figure 4.9
depicts the architecture. It is on author level. It had been run for 30 epochs and 5-fold cross-
validation is done. We experimented with Average Pooling 1D layer and Max Pooling 1D layer
after Convolution 1D layer, with various hyperparameters. The layers with provided parameters in
table 4.8 worked best for us:
Layer Hyperparameters
Embedding Layer Embedding size: 100
Convolution 1D Layer No. of Kernels: 36, Kernel size: 24
Max Pooling 1D layer Pool size: 3

Table 4.8: The layers and hyperparameters used in CNN model


level model, having the same architecture. This was done considering the performance of models
using n-grams and FastText.
2. Trained char-level embeddings + CNN:
It is similar to model vii, but here each character is treated as a separate feature, and the model
learns patterns in sequences of characters.
3. Ensemble model using CNN:
Considering the good scores of model vii, in the sense of betterment, we created an ensemble
model that consists of 5 CNN models that are built as per model vii specifications. Voting criteria
is used to decide the label.
The preprocessed text is used to build the vocabulary. The words with minimum no. of occurrence
2 are preserved. The tokenization is done using the tokenizer imported from keras and tokens are
given to the embedding layer where the embedding size parameter is set to 100. Various
embedding sizes are checked, but this was found to be the best. The embedding layer is connected
to a
Convolutional Neural Network that consists of Convolution 1D layer, Max pooling 1D layer and a
Global Average Pooling 1D layer. The parameters such
as no. of kernels, kernel size are tuned and are saved with those that give the highest accuracy.
Then, finally sigmoid layer is used for binary classification
Model is classified using Adam optimizer and Binary cross entropy (loss evaluator). The
cross validation is performed and model is run for 30 epochs.

Page
4.3 SYSTEM ARCHITECTURE

Figure 4.9: System Architecture for Hate Speech Spreader Detection

From the figure 4.9, it can be observed that the tweets of user are first cleaned of XML tags and
preprocessed. In pre-processing, contractions, punctuations, emojis and other unwanted characters
are removed. Then, tokenization is done on pre-processed text andtokens are generated.
Meanwhile, vocabulary is built using the tokens. The sentences as sequence of encoded tokens are
given to Sequential model. The embedding layer produces embeddings and are given for
Convolution 1D layer. It performs the convolution operation with the filters of specified
parameters. Then, Max Pooling andGlobal Average Pooling is done. Sigmoid layer is used for
binary classification.

Page
4.4 MODULE DESCRIPTION
USER MODULE

 Users can sign up to the web application by registering themselves by providing details
like user name, password etc..
 Registered users can also sign in to their profile by using user id and password.
 They can post videos, stories and photos in the web application.
 Users can send friend requests to other users and can also chat with their friends.
 Users can view,like and comment the videos and photos posted by their friends in the
web application.

ADMIN MODULE

 Admin can handle and make changes in the web application.


 They can also view the requests from users.
 They can also view the comments that have been classified as hate and on hate speech.
 They can manage the notifications of users.

MACHINE LEARNING MODULE

 The Machine Learning module is responsible for classifying comments and messages
as hate speech or non hate speech
 From a vast set of comments and messages, the 1D CNN is used to predict bullying
comments and messages.
This module includes the following steps:
1. Data collection
2. Data preprocessing
3. Segmentation
4. Feature extraction
5. Training
6. Testing
1. DATA COLLECTION
 Collecting data for training the Machine Learning model is the basic step in the machine
learning pipeline.
 The predictions made by Machine Learning systems can only be as good as the data on
which they have been trained.

Page
 In this system, dataset containing bullying as well as non-bullying comments and
messages.
 The data set is downloaded from KAGGLE website.
 80% of dataset is used for training and the remaining 20% is used for testing.
2. DATA PREPROCESSING
 Real-world raw data and images are often incomplete, inconsistent and lacking in certain
behaviors or trends. They are also likely to contain many errors. So, once collected, they
are pre-processed into a format the machine learning algorithm can use for the model.
 Data preprocessing in Machine Learning is a crucial step that helps enhance the quality of
data to promote the extraction of meaningful insights from the data.
 The proprocessing step also includes the removal of stop words, special characters and the
conversion of uppercase letters to lowercase.
 The Lemmatization step includes converting tense word into root word. For example, the
word running is converted to its root word run.
3. SEGMENTATION
 Segmentation can be defined as the process of separating sentences into different tokens.
 N-grams are used for grouping tokens.
 N-grams are used for a variety of things. Some examples include auto completion of
sentences.
 In this project, 2-gram is used to group tokens.
4. FEATURE EXTRACTION
 Feature extraction is the process of taking out a list of words from the text data and then
transforming them into a feature set which is usable by a classifier.
 In this system, TF-IDF vectorizer is used for feature extraction.
 TF-IDF stands for term frequency-inverse document frequency and it is a measure, used to
quantify the importance or relevance of string representations in a document.

5. TRAINING
 Model training is the key step in machine learning that results in a model ready to be
validated, tested, and deployed.
 The performance of the model determines the quality of the applications that are built
using it.
 Quality of training data and the training algorithm are both important assets during the
model training phase.

Page
 Typically, dataset is split for training and testing.
6. TESTING
 In machine learning, model testing is referred to as the process where the performance of a
fully trained model is evaluated on a testing set.
 The testing set consisting of a set of testing samples should be separated from the both
training and validation sets, but it should follow the same probability distribution as the
training set.
 Each testing sample has a known value of the target.

4.5 ADVANTAGES OF PROPOSED SYSTEM


 Automatically learn features, capture local word dependencies (e.g., "crazy" vs. "you're
crazy").
 Capture semantic meaning (similar words close in vector space), reduce data complexity
for better model performance.
 Improve accuracy and handle large datasets efficiently. (But beware of potential bias!)

4.6 APPLICATIONS IN PROPOSED SYSTEM


 Social Media Platforms: These platforms can use this approach to automatically flag or
remove hateful content, promoting a safer and more inclusive online environment.
 Content Moderation: Websites and forums can leverage this technology to filter out
hateful comments before they are published, improving user experience and reducing the
spread of negativity.
 Customer Service: Companies can use hate speech detection to identify and address
instances of abuse directed towards customer service representatives.
 Law Enforcement: This technology can potentially assist law enforcement in
identifying and investigating online hate crimes.

Page
CHAPTER 5
DESIGN
5.1 UML Diagrams
The Unified Modeling Language (UML) is a standard language for writing software blue prints.
The UML is a language for visualizing, specifying, constructing, documenting the artifacts of a
software intensive system.
Some of the frequently used diagrams in software development are:

 Use Case diagrams


 Activity diagrams
 State Chart diagrams
 Sequence diagrams
 Class diagrams

5.1.1 Use case Diagram


Use case is a description of set of sequence of actions that a system performs that yields an
observable result of value to particular actor. Actors are the entities that interact with a system.
Although in most cases, actors used to represent the users of system, actors can actually be
anything that needs to exchange information with the system. So, an actor may be people,
computer hardware, other systems, etc.

Figure 5.1: Use Case Diagram for Hate Speech Spreaders Detection

As shown in figure 5.1, our system has two actors Twitter User, Hate SpeechSpreader Detector.
The actions of Twitter User are to upload posts, view posts.

Page
Hate Speech Spreader Detector is the tool built that does collect users tweets, preprocesses tweets,
builds vocabulary, builds word embeddings, trains classifier, predicts the test data.

5.1.2 Activity Diagram


An activity diagram is a special case of state diagram. An activity diagram is like a flow Machine
showing the flow a control from one activity to another. An activity diagram is used to model
dynamic aspects of the system. Activities are nothing but the functions of a system. Number of
activity diagrams are prepared to capture the entire flow in a system

Figure 5.2: Activity Diagram for Hate Speech Spreaders Detection

Page
5.1.3 State Chart Diagram
A state diagram is used to represent the condition of the system or part of the system at finite
instances of time. It is a behavioral diagram and it represents the behavior using finite state
transitions. State diagrams are also referred to as State machines and State-chart Diagrams.
These terms are often used interchangeably. So simply, a state diagram is used to model the
dynamic behavior of a class in response to time and changing external stimuli. We can say that
each and every class has a state but we do not model every class using State diagrams. We prefer to
model the states with three or more states.

Figure 5.3: State Chart Diagram for Hate Speech Spreaders Detection
As shown in figure 5.3, the various states the project goes through are removal of XML tags from
tweets, removing emojis, hashtags, and Lemmatization and tokenization, adding new words to the
vocabulary. Then, adding embedding layer to Sequential model, convolution of embedding,
pooling is applied, loss is computed andweights are changed through back propagation.

Page
5.1.4 Sequence Diagram
A sequence diagram simply depicts interaction between objects in a sequential order i.e., the order
in which these interactions take place. Sequence diagram uses a lifeline which is a named element
which depicts an individual. Communication happens as the messages appear in a sequential order
on the lifeline. Sequence diagrams establish the roles of objects and help provide essential
information to determine class responsibilities and interfaces.

Figure 5.4: Sequence Diagram for Hate Speech Spreaders Detection


The figure 5.4 depicts that Hate Speech Spreader Detector (HSSD) object imports details of user
from GUI (Graphical User Interface). It sends the tweets of the users to pre-processor.
Preprocessor removes emojis, hashtags and does lemmatization.

Page
CHAPTER 6
IMPLEMENTATION
Python code for building vocabulary, tokenization, encoding, padding sequences and generating
embeddings, building CNN model and fitting it using Cross- Validation:

6.1 SAMPLE CODE


Creating Front End using Python
import streamlit as
st import numpy as
np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from tensorflow.keras.preprocessing import text,
sequence from tensorflow.keras.models import
Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, MaxPooling1D
import re
import
nltk
import re
from nltk.corpus import
stopwords import nltk
nltk.download('stopwords')
stopword=set(stopwords.words('english'))
stemmer =
nltk.SnowballStemmer("english")

# Preprocess text data


# Convert text to
lowercase def
clean(text):

Page
text = re.sub('\n', '', text)
text = re.sub('\w\d\w', '', text)
text = [word for word in text.split(' ') if word not in
stopword] text=" ".join(text)
text = [stemmer. stem(word) for word in text. split('
')] text=" ".join(text)
return text

# Streamlit UI
st.title("Hate Speech Recognition")

# Input text
text = st.text_area("Enter text to check for hate
speech") text = clean(text)
import pickle
with open('classifier.pkl','rb') as
file: clf = pickle.load(file)
with open('count.pkl','rb') as
file: cv = pickle.load(file)
if st.button('Predict'):

# 1. preprocess
transformed_sms =
clean(text) # 2. vectorize
vector_input =
cv.transform([transformed_sms]).toarray() # 3. predict
result = st.header(clf.predict((vector_input)))

Page
Creating Back End using Python
import numpy as
np import pandas
as pd import
pickle
import tensorflow as tf
from wordcloud import WordCloud,
STOPWORDS import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing import text,
sequence from tensorflow.keras.models import
Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, MaxPooling1D
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import seaborn as sns

import os
for dirname, _, filenames in
os.walk('/kaggle/input'): for filename in
filenames:

df =
pd.read_csv('data.csv')

!pip install nltk

import nltk

nltk.download('punkt')
import string
try:
nltk.download('stopwords')

Page
except Exception as e:
print(f"Error downloading stopwords:
{e}") from nltk.corpus import stopwords
from nltk.stem import
PorterStemmer ps =
PorterStemmer()
def
transform_text(tweet)

y = []
for i in tweet:
if
i.isalnum

tweet =
y[:]

for i in tweet:
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)

tweet =
y[:]

for i in tweet:
y.append(ps.stem(i

return " ".join(y)

transform_text("I'm gonna be home soon and i don't want to talk about this stuff anymore
tonight, k? I've cried enough today.")
from nltk.stem.porter import PorterStemmer

ps.stem('loving')

Page
df['transformed_text'] = df['tweet'].apply(transform_text)

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer


cv = CountVectorizer()
tfidf = TfidfVectorizer(max_features=3000)

X=
tfidf.fit_transform(df['transformed_text']).toarray() y

df.head()

c=df['class']
df.rename(columns={'transformed_text' :
'text',
'class' :
'category'},
inplace=True)

df= pd.concat([a,b,c], axis=1)


df.rename(columns={'class' :
'label'},
inplace=True)

# Grouping data by label


df['category'].value_counts()

hate, offensive, neither =


np.bincount(df['label']) total = hate +
offensive + neither
print('Examples:\n Total: {}\n hate: {} ({:.2f}% of total)\
n'.format( total, hate, 100 * hate / total))
print('Examples:\n Total: {}\n Offensive: {} ({:.2f}% of total)\
n'.format( total, offensive, 100 * offensive / total))

Page
total, neither, 100 * neither / total))

X_train_, X_test, y_train_, y_test =


train_test_split( df.index.values,
df.label.values,
test_size=0.10,
random_state=42,
stratify=df.label.value
s,

X_train, X_val, y_train, y_val =


train_test_split( df.loc[X_train_].index.valu
es, df.loc[X_train_].label.values,
test_size=0.10,
random_state=4
2,
stratify=df.loc[X_train_].label.values,

df['data_type'] =
['not_set']*df.shape[0]
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['category', 'label', 'data_type']).count()

df

df_train =
df.loc[df["data_type"]=="train"] df_val
= df.loc[df["data_type"]=="val"] df_test

df_train.head()

df_train_plus_val = pd.concat([df_train,df_val], axis=0)

Page
df_train_plus_val.head()

x=
df_train_plus_val.text.values y

max_features = 20000
max_text_length = 512

x_tokenizer =
text.Tokenizer(max_features)

x_tokenized = x_tokenizer.texts_to_sequences(x)
x_train_val= sequence.pad_sequences(x_tokenized, maxlen=max_text_length)

x_test_tokenized = x_tokenizer.texts_to_sequences(df_test.text.values)
x_test = sequence.pad_sequences(x_test_tokenized,maxlen=max_text_length)

import numpy as np

embedding_dim = 100
embeddings_index =

with open('glove.6B.100d.txt', encoding='utf-8') as


f: for line in f:
values =
line.split() word
= values[0]
coefs = np.asarray(values[1:],

print(f'Found {len(embeddings_index)} word vectors')

embedding_matrix=
np.zeros((max_features,embedding_dim)) for word, index in
x_tokenizer.word_index.items():

Page
bre
ak
else:
embedding_vector =
embeddings_index.get(word) if

y_train_plus_val = tf.keras.utils.to_categorical(y,
num_classes=3) y_test =
tf.keras.utils.to_categorical(df_test.label, num_classes=3)

"""# Building 1 D CNN Model"""

model = Sequential()
model.add(Embedding(max_features,
embedding_dim,
embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
trainable=False))

model.add(Dropout(0.2))

model.add(Conv1D(64,2,padding='valid',activation='relu'))
model.add(MaxPooling1D())
model.add(Conv1D(64,2,padding='valid',activation='relu'))
model.add(MaxPooling1D())

model.add(Conv1D(32,2,padding='valid',activation='relu'))
model.add(MaxPooling1D())
model.add(Conv1D(32,2,padding='valid',activation='relu'))
model.add(GlobalMaxPooling1D())

model.add(Dense(16,
activation='relu'))
model.add(Dense(16,

Page
model.add(Dense(3,
activation='softmax')) model.summary()

model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

history = model.fit(x_train_val, y_train_plus_val, batch_size= 64, validation_split=0.2, epochs=10)

# Plot loss
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training loss vs.
Epochs') plt.legend()
plt.show()

# Plot accuracy
plt.plot(history.history['accuracy'],
label='train')
plt.plot(history.history['val_accuracy'], label='validation')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Training accuracy vs.
Epochs') plt.legend()

model.evaluate(x_test,y_test, batch_size = 64)

y_pred =
model.predict(x_test) y_pred

y_pred = np.array( [ np.argmax (y) for y in y_pred ] )

y_pred

Page
y_test_labels = df_test.label

cm = confusion_matrix(y_test_labels,
y_pred) fig = sns.heatmap(cm, annot=True,
fmt="d") plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
plt.show(fig)

import pickle
pickle.dump(tfidf,open('vectorized.pkl','wb'
)) model.save('model.h5')

Page
6.2 OUTPUT SCREENS

To start our project run ‘streamlit_file.ipynb’ file on Google Colab server or Jupyter .To connect
kernel and double click on Run

In above screen Google Colab started and now open browser through https://nine-carrots-
switch.loca.lt and enter URL as ‘34.68.93.235’ and press key to get below home page.

Page
Predicted Result 1:

Predicted Result 2:

Predicted Result 3:

Page
CHAPTER 7
CONCLUSION AND FUTURE ENHANCEMENTS

The difficulty of automatically recognizing hate speech in social media posts is addressed in this
study. This research presents a hate speech dataset that was manually labeled and collected from a
white supremacist online community. We discovered that the analysis generated significant hate
preconceptions, as well as ranging levels of ethnic and religious-based stereotypes. Our findings
have shown that the selection of word embeddings, the selected parameters and the optimizer have
a high impact on the output achieved.

Hate speech in the social media space, which can have negative impacts on the society were
detected easily and the high accuracy rate of the model will bring many benefits while reducing
the damage. By assessing and comparing the performance of the various hate detection models,
we found that word embeddings with 1D-CNN is an important tool for hate speech detection.

1D-CNN, a deep learning model, achieved the highest weighted macro-F1 score of 0.66 with a
0.90 accuracy. The results of the confusion matrix graphs in figures 1 to 5 demonstrated that
GloVe embedding features were unable to correctly
Problems with Word2Vec, Glove, BERT, TF-IDF were that they are pre- trained, cannot handle
out-of-vocabulary issue, BERT is computationally very expensive, FastText is also pretrained and
based on n-grams, TF-IDF is a bag of words model and primarily dependent on dataset and word
frequencies. It is better used in Information retrieval, not sentiment analysis.

The CNN is suitable than RNN/LSTM to this problem, because each sample has many tweets but
ordering is not needed between them and local context needs to be captured. Also, CNN is fast.The
future works are to include more features considering the emojis, hashtags and to make great
predictions from models built using traditional machine learning models and deep learning models
together and to build a robust model using ensemblelearning methods.

Page
CHAPTER 8
REFERENCES
[1 ] Davidson T, Warmsley D, Macy MW, Weber I. Automated Hate Speech Detection and the
Problem of Offensive Language. ICWSM. 2017;.
[2] Zimmerman S, Kruschwitz U, Fox C. Improving Hate Speech Detection with Deep
Learning Ensembles. In: LREC; 2018.
[3] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. arXiv:181004805 [cs]. 2018;.
[4] Hagen M, Potthast M, Bu¨chner M, Stein B. Webis: An Ensemble for Twitter Sentiment
Detection. In: SemEval@NAACL-HLT; 2015.
[5] Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of Tricks for Efficient Text Classification.
In: Proceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics: Volume 2, Short Papers. ACL; 2017. p. 427–431.
[6] Zhang Z, Robinson D, Tepper J. Detecting hate speech on twitter using a convolution- gru
based deep neural network. In: European Semantic Web Conference. Springer; 2018. p. 745–760.
[7] MacAvaney, Sean & Yao, Hao-Ren & Yang, Eugene & Russell, Katina & Goharian, Nazli &
Frieder, Ophir. (2019). Hate speech detection: Challenges and solutions. PloS one. 14. e0221152.
10.1371/journal.pone.0221152.
[8] Neuman Y, Assaf D, Cohen Y, Last M, Argamon S, Howard N, et al. Metaphor Identification
in Large Texts Corpora. PLoS ONE. 2013; 8(4). https://doi.org/10.1371/journal.pone.0062343
[9] Hatebase;. Available from: https://hatebase.org/.
[10] P. Fortuna, S. Nunes, A survey on automatic detection of hate speech in text, ACM
Computing Surveys (CSUR) 51 (2018) 1–30.
[11] A. Schmidt, M. Wiegand, A survey on hate speech detection using natural language
processing, in: Proceedings of the Fifth International workshop on natural language processing for
social media, 2017, pp. 1–10.
[12] dennybritz.com/posts/wildml/understanding-convolutional-neural-networks-for- nlp/
[13] Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014),
1746–1751.

Page
[14] Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A Convolutional Neural Network
for Modelling Sentences. Acl, 655–665.
[15] Santos, C. N. dos, & Gatti, M. (2014). Deep Convolutional Neural Networks for Sentiment
Analysis of Short Texts. In COLING-2014 (pp. 69–78).
[16] Johnson, R., & Zhang, T. (2015). Effective Use of Word Order for Text Categorization with
Convolutional Neural Networks. To Appear: NAACL-2015, (2011).
[17] Johnson, R., & Zhang, T. (2015). Semi-supervised Convolutional Neural Networks for Text
Categorization via Region Embedding.
[18] Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., & Hao, H. (2015). Semantic
Clustering and Convolutional Neural Network for Short Text Categorization. Proceedings ACL
2015, 352–357.
[19] Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to)
Convolutional Neural Networks for Sentence Classification, Nguyen, T. H., & Grishman, R.
(2015). Relation Extraction: Perspective from Convolutional Neural Networks. Workshop on
Vector Modeling for NLP, 39–48.
[20] Sun, Y., Lin, L., Tang, D., Yang, N., Ji, Z., & Wang, X. (2015). Modeling Mention , Context
and Entity with Neural Networks for Entity Disambiguation, (Ijcai), 1333–1339.
[21] Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, J. (2014). Relation Classification via
Convolutional Deep Neural Network. Coling, (2011), 2335–2344.
[22] Gao, J., Pantel, P., Gamon, M., He, X., & Deng, L. (2014). Modeling Interestingness with
Deep Neural Networks.
[23] Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). A Latent Semantic Model with
Convolutional-Pooling Structure for Information Retrieval. Proceedings of the 23rd ACM
International Conference on Conference on Information and Knowledge Management – CIKM
’14, 101–110.
[24] Weston, J., & Adams, K. (2014). # T AG S PACE : Semantic Embeddings from Hashtags,
1822–1827.

Page

You might also like