Group Document
Group Document
On
Submitted by
S. Nikitha (1608-20-737-010)
K. Ashish Yadav (1608-20-737-
018) B. Sri Phani (1608-20-737-
022)
Under the Esteemed Guidance of
Dr. K. Vikram Reddy
Associate Professor, Department of
IT
Telangana-500059
(2023-24)
CERTIFICATE
This is to certify that a Project report entitled “ ONLINE ABUSE DETECTION
USING MACHINE LEARNING ” is being submitted by S. Nikitha (1608-20-737-010),
K. Ashish Yadav (1608-20-737-018), B. Sri Phani(1608-20-737-022) in partial fulfilment
of the requirement for the award of the degree of Bachelor of Engineering in
“Information Technology” O.U., Hyderabad during the year 2023- 2024 is a record of
bonafide work carried out by them under my guidance. The results presented in this report
have been verified and are found to be satisfactory.
External Examiner(s)
Page |
ACKNOWLEDGEMENT
We wish to express our gratitude to project guide, for his valuable supervision and
timely assistance in regard to our project work.
We wish to express our gratitude to project coordinator Mr. K. Vikram Reddy,
Assistant Professor, Department of Information Technology, Matrusri Engineering
College, for his indefatigable inspiration, guidance, cogent discussion, constructive
criticisms and encouragement throughout this dissertation work.
We wish to express our gratitude to Dr. G. Shyama Chandra Prasad, Head of
Department, Information Technology, Matrusri Engineering College for permitting to do
this project.
We wish to express our gratitude to Dr. D. Hanumantha Rao, Principal of
Matrusri Engineering College who permitted us to carry the project work as per of the
academics.
We would like to thank IT Department for providing us this opportunity to share
and contribute ours part of work to accomplish the project in time, and all the teaching and
supporting staff for their steadfast support and encouragement.
The satisfaction that accompanies the successful completion of the any task would
be incomplete without the mention of my family members and friends, who made it
possible and whose encouragement and guidance has been a source of inspiration
Page |
DECLARATION
We hereby declare that the project entitled “ ONLINE ABUSE DETECTION USING
MACHINE LEARNING ” submitted to the Osmania University in partial fulfilment of
the requirements for the award of the degree of B.E. in IT, is a bonafide and original
work done by us under the guidance of Mr. K. Vikram Reddy, Assistant Professor,
Information Technology and this project work have not been submitted to any other
university for the award of any degree or diploma.
S.Nikitha (1608-20-737-010)
K.Ashish Yadav (1608-20-737-018)
B.Sri Phani (1608-20-737-022)
Page |
ABSTRACT
The emergence of social media platforms has given rise to an unparalleled level of hate speech in
public conversations. The number of tweets containing hate speech and targeting one or another
user is on the increase every day. Unfortunately, any user engaged on these platforms will have a
risk of being targeted or harassed via abusing language, expressing hate towards race, colour,
religion, descent, gender, nation, etc. This bullying, trolling, and harassment content can be very
serious, in several cases might lead to suicide of the victim.
Hate speech can also be present in the form of sarcasm or indirect taunt, making it confusing for
users or system to understand the intent behind the tweet. Therefore, the need of an automatic and
scalable detection system has become a priority. To stop hate speech spreaders, a machine learning
system needs to be developed that automatically detects spreaders of hate speech based on the
contents of their posts, where the model should be able to infer the meaning of a word with
respective its context.
The main objective of the project is to overcome the limitations of existing systems and introduce a
model having a greater accuracy. The word embeddings are made task- specific by training on the
dataset provided. This reduces the problem of out-of- vocabulary to some extent. Various
classifiers are paired with these embeddings and tested, but CNN found out to be efficient. Hatred
spreading through the use of language on social media platforms and in online groups is becoming
a well-known phenomenon. By comparing two text representations: bag of words (BoW) and pre-
trained word embedding using GloVe, we used a binary classification approach to automatically
process user contents to detect hate speech. The Naive Bayes Algorithm (NBA), Logistic
Regression Model (LRM), Support Vector Machines (SVM), Random Forest Classifier (RFC) and
the one-dimensional Convolutional Neural Networks (1D-CNN) are the models proposed. With a
weighted macro-F1 score of 0.66 and a 0.90 accuracy, the performance of the 1D-CNN and GloVe
embeddings was best among all the models.
Page |
LIST OF FIGURES
6.1 43
Predicted Result 1
6.2 43
Predicted Result 2
6.3 43
Predicted Result 3
Page |
TABLE OF CONTENTS
TITLE.........................................................................................................................................I
CERTIFICATE.........................................................................................................................II
ACKNOWLEDGEMENT......................................................................................................III
DECLARATION.....................................................................................................................IV
ABSTRACT..............................................................................................................................V
LIST OF FIGURES................................................................................................................VI
CHAPTER 1- INTRODUCTION 1- 4
1.1 Introduction................................................................................................................1-2
1.2 Proposed Solution......................................................................................................2-3
1.3 Objective......................................................................................................................3
1.4 Scope............................................................................................................................4
CHAPTER 2-LITERATURE SURVEY 5-7
2.1 Literature Survey.......................................................................................................5-6
2.2 Feasibility Study..........................................................................................................7
2.2.1 Operational Feasiblility.......................................................................................7
2.2.2 Operational Feasiblility.......................................................................................7
2.2.3 Technical Feasiblility..........................................................................................7
2.2.4 Economic Feasiblility.........................................................................................7
CHAPTER 3-REQUIREMENTS SPECIFICATION 8-11
3.1 Software Requirements...........................................................................................8-10
3.2 Hardware Requirements............................................................................................11
CHAPTER 4-SYSTEM ANALYSIS 12-27
4.1 Exsisting System...............................................................................................................12
4.2 Proposed System..........................................................................................................12-24
4.2.1. Word2Vec.....................................................................................................14-16
4.2.2 TF-IDF................................................................................................................16
4.2.3. Embedding Layer.........................................................................................16-17
4.2.4. Convolutional Neural Networks...................................................................17-22
4.2.5 Character-Level CNNs...........................................................................................22
V
4.2.6 Developed Models........................................................................................22-24
4.3 System Architecture...................................................................................................24
4.4 Module Description...............................................................................................25-27
4.5 Advantages Of Proposed System...............................................................................27
4.6 Applications Of Proposed System..............................................................................27
CHAPTER 5-DESIGN 28-31
5.1 UML Diagrams.....................................................................................................28-31
5.1.1 Use case Diagram.........................................................................................28-29
5.1.2 Activity Diagram...............................................................................................29
5.1.3 State Chart Diagram..........................................................................................30
5.1.4 Sequence Diagram.............................................................................................31
CHAPTER 6-IMPLEMENTATION 32-43
6.1 Sample Code..........................................................................................................32-41
6.2 Output Screens.......................................................................................................42-43
CHAPTER 7-CONCLUSION 44
7.1 Conclusion and Future Enhancements........................................................................44
CHAPTER 8-REFERENCES 45-46
8.1 References....................................................................................................................45-46
V
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Social media has become an important communication medium. Social media platforms such as
Facebook, Twitter, LinkedIn, and WhatsApp enable users to interact with each other by both
sharing and consuming information. Information can spread widely and even viral on social
media very quickly. Unfortunately, hate speech can also spread easily that not only harm
individual victims but also create an adverse impact onsociety.
Users can now share everything they desire with their followers and express their opinions in an
instant. While the freedom of speech is important and generally should not be restricted, not all
kinds of free speech should be tolerated. This brings us to the subject of hate speech. Freedom of
speech in virtual environment gave users a false sense of security, a sense that even hate speech
could be shared without repercussions. Hate speech is commonly defined sas any
communication that uses offensive and threatening language that targets specific groups of
people basis on somecharacteristics such as religion, ethnicity, nationality, race, color, gender,
or some other characteristics. A huge amount of user generated content on the web and social
media has given rise to a variety of challenges including the spreading and sharing of hate
speech messages.
However, classifying users as haters or not from their posts is a challenging task. Still, the
detection of offensive language from social media is not an easy research problem due to the
different levels of ambiguities present in natural language and the noisy nature of social media
language. In addition, social media subscribers come from linguistically diverse communities.
PAN at CLEF 2021 with “Profiling Hate Speech Spreaders on Twitter” deals with the detection
of hate speech spreaders in two languages English and Spanish meaning that classification need
to be done at the user level and not at the post level.Since the net work is huge, it is almost
impossible to stop hate speech spreaders using only human resources. Therefore, numerous
computational methods are being developed to enable automated detection of hate spreaders on
social networks. Advances in Natural language processing (NLP) and text analysis techniques
provide an automated way to identify the hate speech posts by analyzing social medias’ posts.
To stop hate speech spreaders, a machine learning system needs to be developed that
automatically detects spreaders of hate speech based on the contents of their posts, such that the
Page
model is trained on contextual embeddings.
Figure 1.1 depicts the problem statement that each user has 200 tweets acting as the input and he is
classified either as a Hate SpeechSpreader or a Non-Hate Speech Spreader.
Page
3. Training Data : The study used a dataset annotated at the sentence level, focusing on Internet
forum posts in English. The data were manually labeled into two classes: Hate and No-Hate.
4. Evaluation Metrics : The performance of each model was evaluated using metrics such as
accuracy and F1 score. The 1D-CNN model with GloVe embeddings achieved the highest
weighted macro-F1 score of 0.66 and an accuracy of 0.90.
5. Solution Summary : The proposed solution involves using the 1D-CNN model with GloVe
embeddings to automatically detect hate speech in social media posts.
6. Improvements : To enhance the performance of the GloVe embeddings, increasing the size of
the training dataset or fine-tuning the embeddings specifically for hate speech detection could be
considered. Additionally, exploring other deep learning architectures or ensemble methods may
further improve the detection accuracy.
Overall, the solution offers an efficient and accurate method for identifying hate speech in online
content, which can contribute to mitigating the negative impacts of hate speech on society.
1.3 OBJECTIVE
The objective of the model described in the paper is to develop an effective method for
automatically detecting hate speech in social media posts. This involves:
1 .Comparing different text representations, such as Bag of Words (BoW) and GloVe word
embeddings.
2. Evaluating various machine learning and deep learning models, including Naive Bayes
Algorithm (NBA), Logistic Regression Model (LRM), Support Vector Machines (SVM), Random
Forest Classifier (RFC), and one-dimensional Convolutional Neural Networks (1D-CNN).
3. Training and testing the models on a dataset from a white supremacist forum to assess their
performance in terms of accuracy and F1 score.
4. Identifying the most effective model, which in this case was found to be 1D-CNN with GloVe
embeddings, for hate speech detection on social media platforms.
Page
1.4 SCOPE
The proposed hate speech detection model for social media posts involves:
Text Classification: Identifying hate speech or non-hate speech content in social media posts.
Feature Representation: Utilizing methods like Bag-of-Words (BoW) and GloVe word
embeddings to capture text semantics effectively.
Performance Evaluation: Assessing model effectiveness through metrics like
accuracy, precision, recall, and F1 score.
Implementation and Deployment: Deploying the most effective model in real-
time applications like social media moderation tools.
Validation and Ethics: Validating model robustness with unseen data and addressing
ethical concerns like censorship and bias to ensure fairness and transparency.
Page
CHAPTER 2
LITERATURE SURVEY
There is strong motivation to study the automatic detection of hate speech due to the
overwhelming online spread of information [4]. The detection of hate speech is crucial to reducing
crime and protecting people’s beliefs. This study is especially important in the face of ongoing
wars, distorting reality and dehumanizing the attacked Ukrainian nation. Studies have shown an
increase in hate speech against China on online social media, especially racist and abusive content
accusing people of causing the COVID-19 outbreak. On the other hand, a lower rate of hate speech
reduces crime, such as cyberbullying, which significantly affects social tranquility [5], leading to
minimal cyber-attacks [6].
Despite the many studies in the field, hate speech is still problematic and challenging. The
literature reports that both humans and machine learning models have difficulty detecting hate
speech due to the complexity and variety of hate speech categories. Hate is characterized by more
extreme behaviors associated with prejudice. Figure 1 shows numerous aspects of hate speech.
Hate speech is seen as a layer between aggressive and abusive text; however, all of these share the
offensive aspect. On the other hand, sexist, homophobic, and religious hate are relatively different
as they target a group of people or a gender. The figure shows that the separation of the concept is
complicated, a challenge that has been previously discussed [7]. Specific hate speech definitions
are contentious [7]; racist and homophobic tweets, for example, are more likely to be labeled as
Page
hate speech than other types of offensive or abusive content. Therefore, there is no way to
generalize whether an inflammatory text is hate speech [8]. The expansion of the representation of
short documents, such as on Twitter, is a significant issue; they cause additional challenges to the
traditional bag-of-words model, resulting in data sparsity due to insufficient contextual
information [9]. Furthermore, many datasets [10] may differ due to these classifications and
definitions, making it challenging to compare machine learning models. In other words, hate
speech notions are all within the umbrella of abusive text, according to Poletto et al. [1].
Page
Table -2.1 : Literature Survey of Hate Speech Detection
Page
CHAPTER 3
REQUIREMENT SPECIFICATION
3.1 SOFTWARE REQUIREMENTS:
The following text contains the libraries required and their versions used while developing the
project:
i. Python 3
iii. Keras
Version: 2.12.0
Keras is a high-level, deep learning API developed by Google for implementing neural networks. It
is written in Python and is used to make the implementation of neural networks easy. It also
supports multiple backend neural network computation.
Keras is relatively easy to learn and work with because it provides a python frontend with a high
level of abstraction while having the option of multiple back-ends for computation purposes. This
makes Keras slower than other deep learning frameworks, but extremely beginner-friendly.
iv. Tensorflow
Version: 2.12.0
TensorFlow is an open-source library that the Google Brain team developed in 2012. Python is by
far the most common language that TensorFlow uses. You can import the TensorFlow library into
your Python environment and perform in-depth learning development. There is a sure way in which
the program gets executed. You first create nodes, which process- the data in the form of a graph.
Page
The data gets stored in the form of tensors, and the tensor data flows to various nodes. One of
TensorFlow’s best qualities is that it makes code development easy. The readily available APIs
save users from rewriting some of the code that would otherwise have been time-consuming.
TensorFlow speeds up the process of training a model. Additionally, the chances of errors in the
program are also reduced, typically by 55 to 85 percent.
v. scikit-learn
Version: 1.2.0
Scikit Learn or Sklearn is one of the most robust libraries for machine learning in Python. It is
open source and built upon NumPy, SciPy, and Matplotlib. It provides a range of tools for machine
learning and statistical modeling including dimensionality reduction, clustering, regression and
classification, through a consistent interface in Python. Additionally, it provides many other tools
for evaluation, selection, model development, and data preprocessing.
vi. Pickle
Python pickle module is used for serializing and de-serializing a Python object structure. Any
object in Python can be pickled so that it can be saved on disk. What pickle does is that it
“serializes” the object first before writing it to file. Pickling is a way to convert a python object
(list, dict, etc.) into a character stream. The idea is that this character stream contains all the
information necessary to reconstruct the objectin another python script.
vii. Numpy
Version: 1.23.5
Besides its obvious scientific uses, Numpy can also be used as an efficient multi-dimensional
container of generic data.
viii. Pandas
Version: 1.5.3
Pandas is an open-source library that is built on top of NumPy library. It is a Python package that
offers various data structures and operations for manipulating numerical data and time series. It is
mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-
performance & productivity for users.
Page
ix. nltk
Version: 3.8.1
To run the below python program, (NLTK) natural language toolkit has to be installed in your
system. The NLTK module is a massive tool kit, aimed at helping you with the entire Natural
Language Processing (NLP) methodology.
Then, enter the python shell in your terminal by simply typing pythonType import nltk
nltk.download(‘all’)
The above installation will take quite some time due to the massive amount of tokenizers,
chunkers, other algorithms, and all of the corpora to be downloaded.
x. collections
The collection Module in Python provides different types of containers. A Container is an object
that is used to store different objects and provide a way to access the contained objects and iterate
over them. Some of the built-in containers are Tuple, List, Dictionary, etc. In this article, we will
discuss the different containers provided by the collections module.
xi. Gensim
Version: 4.3.1
Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with
large corpora. It is designed to handle large text collections efficiently and performantly, with a
focus on unsupervised learning techniques such as latent semantic analysis (LSA), latent Dirichlet
allocation (LDA), and random projections. Gensim also includes tools for text preprocessing, such
as tokenization and stopword removal, and provides a number of pre-trained models for use in
natural language processing (NLP) tasks such as word2vec and fastText. The library is open-
source and can be installed via pip.
Page
3.2 HARDWARE REQUIREMENTS:
Operating System : Windows 10 64-bit
RAM : Minimum 8 GB
Processor : Intel i5 (12th gen) or equivalent
GPU : NVIDIA GeForce RTX or equivalent with 4
GB Memory
Our project is developed on the device with the specifications as Windows 11, 16 GB RAM,
AMD Ryzen 5 5600 Processor, NVIDIA GeForce RTX 3050 with GPU 4GB.
Page
CHAPTER 4
SYSTEM ANALYSIS
4.1 PROBLEMS WITH EXISTING SYSTEM:
The existing models are good to certain extent and also achieved good accuracies. Few of the
models used simple techniques like the Linear SVM classifier, Logistic Regression, while others
have used Neural Networks such as RNNs, LSTMs, Convolutional Neural Networks, Artificial
Neural Networks. Apart from the classifiers, the main issue was with the generation of word
embeddings. The modelshad embeddings that
They are pretrained
Difficulty in recognizing new words.
whose classification was computationally very expensive
were totally based on n-grams, which may not be needed for our task.
primarily dependent on dataset and word frequencies. It is better used in
Information retrieval, not sentiment analysis.
Do not comprise contextual information
Page
training example. However, that did not result in larger test corpus since we still predicted on
author-level. In other research papers, to mitigate the noise effect, they conducted an analysis of
what indicators we could use to better discriminate between hate speech spreaders and the non-
hate speech spreaders. But we thought indicators were miscellaneous and do not add enough
weightage.
After removal of XML tags, the emojis, hash tags, hyperlink, ‘RT’ were removed. The contractions
in text like “isn’t” are expanded. The stop words are not removed assuming they would give
contextual information, especially the negative stop words are preserved such as ‘not’. The text is
lowercased and lemmatized.
The preprocessed text is given as input to the models Word2Vec, they produce the word
embeddings.
There are two ways to handle this project:
A. Training on tweet level
This may seem equivalent to sentiment analysis, where each the tweets of all authors is considered
as a training sample. Each tweet is given label of author (as in the dataset, the labels are assigned
to author not the tweets).
The thing to note here is, this may seem simpler to implement but it is not semantically correct
according to our project and also it is not good to explicitly assign the labels for tweets the labels
of their author. It is because, a hate speech spreader may not contain all his tweets to be hate
speech.
Some of the previous approaches followed tweet level process and defined a threshold for the no.
of tweets for a author to be a 'Hate Speech Spreader'.
i. Word Embeddings
ii. Classifier
There are many popular methods to generate word embeddings. We tried the following:
Page
corpus. GloVe, FastText are as well-known as Word2Vec. The way they are built is different.
Each has its own pros and cons. Word2Vec mechanism is discussed in depth in section 3.2.2.
ii. TF-IDF: This method generates embeddings considering occurence of the word in the
document with respect to occurence in entire corpus. These do not comprise information of
surrounding words. These are described in section 3.2.3.
iii. Embedding layer: The models such as Word2Vec, GloVe, FastText face a problem of out-
of-vocabulary, as they are pre-trained and are limited to a specific set of words. FastText is
based on n-grams, so it may handle few new words, but those embeddings might not be
relevant. To overcome this issue, we choose to build our own vocabulary and generate
embeddings for each word using Embedding layer while iterating through each sample. It is
explained in section 3.2.4.
i. CNN: These are typically used for image classification tasks, because of their ability to
capture the local context clearly and generalize it using pooling layers. But 1D CNN Layers
can be used for sequences such as text. It is explained in depth in section 3.2.8 and 3.2.9.
4.2.1. Word2Vec:
Word2vec is a combination of models used to represent distributed representations of words in a
corpus C. Word2Vec (W2V) is an algorithm that accepts text corpus as an input and outputs a
vector representation for each word, as shown in the figure 4.1.
There are two flavors of this algorithm namely: CBOW and Skip-Gram. Given a set of sentences
(also called corpus) the model loops on the words of each sentence and either tries to use the
current word w in order to predict its neighbors (i.e., its context), this approach is called “Skip-
Gram”, or it uses each of these contexts to predict
Page
Figure 4.2: Word2Vec windows of different sizes and their respective training samples
the current word w, in that case the method is called “Continuous Bag Of Words” (CBOW).
The vectors we use to represent words are called neural word embeddings, and representations are
strange. One thing describes another, even though those two things are radically different. As Elvis
Costello said: “Writing about music is like dancing about architecture.” Word2vec “vectorizes”
about words, and by doing so it makes natural language computer-readable — we can start to
perform powerful mathematical operations on words to detect their similarities.
So, a neural word embedding represents a word with numbers. It’s a simple, yet unlikely,
translation. Word2vec is similar to an autoencoder, encoding each word in a vector, but rather than
training against the input words through reconstruction, as a restricted Boltzmann machine does,
word2vec trains words against other words that neighbour them in the input corpus. As shown in
figure 4.2 , it does so in one of two ways, either using context to predict a target word (a method
known as ‘Continuous Bag Of Words’, or CBOW), or using a word to predict a target context,
which is called ‘Skip- Gram’. We use the latter method because it produces more accurate results on
large datasets.
Page
their context
Page
are nudged closer together by adjusting the numbers in the vector. In this tutorial, we are going to
focus on Skip-Gram model which in contrast to CBOW consider center word as input as depicted
in figure above and predict context words.
4.2.2 TF-IDF:
TF-IDF stands for "Term Frequency-Inverse Document Frequency". It is a numerical statistic that
is used to evaluate how important a word is to a document in a collection of documents. TF-IDF
embeddings are generated by first computing the term frequency (TF) and inverse document
frequency (IDF) values for each word in the training data.
The term frequency (TF) value of a word is the number of times the word appears in a document.
The inverse document frequency (IDF) value of a word is a measure of how much information the
word provides, i.e., how rare or common it is in the collection of documents. The IDF value is
computed as the logarithm of the total number of documents divided by the number of documents
that contain the word.
TF-IDF embeddings are useful for tasks like sentiment analysis because they capture the
importance of each word in a document, and the resulting vector representations can be used as
features for machine learning models. For example, the TF-IDF vectors can be used as input to a
classification algorithm, such as logistic regression or support vector machines, to predict the
sentiment of a document. By using TF-IDF embeddings, the model can give more weight to the
words that are most important for the sentiment classification task.
Page
Embedding: Each word index is converted to a dense vector of fixed size by looking up
its corresponding embedding vector in a lookup table.
Training: The neural network is trained on the embedded sequences, with the weights of
the embedding layer updated during the backpropagation process.
After the training process, the embedding layer produces a fixed-size vector representation of each
word in the input text. These word embeddings can be used as input to downstream tasks such as
sentiment analysis, machine translation, and text classification.
Page
Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a
matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a
character. That is, each row is vector that represents a word. Typically, these vectors are word
embeddings (low-dimensional representations) like word2vec or GloVe, but they could also be
one-hot vectors that index the word into a vocabulary. For a 10 word sentence using a 100-
dimensional embedding we would havea 10×100 matrix as our input. That’s our “image”.
Figure 4.5 depicts Convolutional Neural Network (CNN) architecture for sentence classification.
That is, the three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs
convolution on the sentence matrix and generate (variable-length) feature maps. Then 1-max
pooling is performed over each map, i.e., the largest number from each feature map is recorded.
Thus, a univariate feature vector is generated from all six maps, and these 6 features are
concatenated to form a feature vector for the penultimate layer. The final softmax layer then
receives this feature vector as input and uses it to classify the sentence; here we assume binary
classificationand hence depicted.
Page
lot where in the sentence a word appears. Pixels close to each other are likely to be semantically
related (part of the same object), but the same isn’t always true for words. In many languages, parts
of phrases could be separated by several other words. The compositional aspect isn’t obvious
either. Clearly, words compose in some ways, like an adjective modifying a noun, but how exactly
this works what higher level representations actually “mean” isn’t as obvious as in the Computer
Vision case.
Given all this, it seems like CNNs wouldn’t be a good fit for NLP tasks. Recurrent Neural
Networks make more intuitive sense. They resemble how we process language, or at least how we
think we process language: Reading sequentially from left to right. Fortunately, this doesn’t mean
that CNNs don’t work. All models are wrong, but some are useful. It turns out that CNNs applied
to NLP problems perform quite well. The simple Bag of Words model is an obvious
oversimplification with incorrect assumptions, but has nonetheless been the standard approach for
years and lead to prettygood results.
A big argument for CNNs is that they are fast. Very fast. Convolutions are a central part of
computer graphics and implemented on a hardware level on GPUs. Compared to something like n-
grams, CNNs are also efficient in terms of representation. With a large vocabulary, computing
anything more than 3-grams can quickly become expensive. Even Google doesn’t provide
anything beyond 5-grams. Convolutional Filters learn good representations automatically, without
needing to represent the whole vocabulary. It’s completely reasonable to have filters of size larger
than 5. I like to think that many of the learned filters in the first layer are capturing features quite
similar (but not limited) to n-grams, but represent them in a more compactway.
CNN Hyperparameters
Before explaining at how CNNs are applied to NLP tasks, let’s look at some of the choices you
need to make when building a CNN. Hopefully this will help you better understand the literature in
the field
i. Narrow vs. Wide convolution:
When I explained convolutions above I neglected a little detail of how we apply the filter.
Applying a 3×3 filter at the center of the matrix works fine, but what about the edges? How would
you apply the filter to the first element of a matrix that doesn’t haveany neighboring elements to
the top and left? You can use zero-padding.
Page
All elements that would fall outside of the matrix are taken to be zero. By doing this you can
apply the filter to every element of your input matrix, and get a larger or equally sized output.
Adding zero-padding is also called wide convolution, and not using zero-padding would be a
narrow convolution. An example in 1D looks like this:
Figure 4.6: Narrow vs. Wide Convolution. Filter size 5, input size 7. Source: AConvolutional
Neural Network for Modelling Sentences (2014)
From figure 4.6, you can see how wide convolution is useful, or even necessary, when you have a
large filter relative to the input size. In the above, the narrow convolution yields an output of size
(7−5)+1=3, and a wide convolution an output of size 7+2∗ 4−5)+1=11. More generally, the
formula for the output size is
n_out = (n_in + 2*n_padding − n_filter)+1
Figure 4.7 : Convolution Stride Size. Left: Stride size 1. Right: Stride size 2.
In the literature we typically see stride sizes of 1, but a larger stride size may allow you to build a
model that behaves somewhat similarly to a Recursive Neural Network, i.e., looks like a tree.
Page
iii. Pooling Layers:
A key aspect of Convolutional Neural Networks are pooling layers, typically applied after the
convolutional layers. Pooling layers subsample their input. The most common way to do pooling it
to apply a max operation to the result of each filter. You don’t necessarily need to pool over the
complete matrix, you could also pool over a window. For example, the figure 4.8 shows max
pooling for a 2×2 window. In NLP we typically are apply pooling over the complete output,
yielding just a single number for each filter:
Page
value regardless.
iv. Channels:
The last concept we need to understand are channels. Channels are different “views” of your input
data. For example, in image recognition you typically have RGB (red, green, blue) channels. You
can apply convolutions across channels, either with different or equal weights. In NLP you could
imagine having various channels as well: You could have a separate channels for different word
embeddings (word2vec and GloVe for example), or you could have a channel for the same
sentence represented in different languages, or phrased in different ways.
Page
in the embeddings is getting altered.
1. Trained Embeddings + CNN:
Two individual models were developed for two languages with the same architecture. Figure 4.9
depicts the architecture. It is on author level. It had been run for 30 epochs and 5-fold cross-
validation is done. We experimented with Average Pooling 1D layer and Max Pooling 1D layer
after Convolution 1D layer, with various hyperparameters. The layers with provided parameters in
table 4.8 worked best for us:
Layer Hyperparameters
Embedding Layer Embedding size: 100
Convolution 1D Layer No. of Kernels: 36, Kernel size: 24
Max Pooling 1D layer Pool size: 3
Page
4.3 SYSTEM ARCHITECTURE
From the figure 4.9, it can be observed that the tweets of user are first cleaned of XML tags and
preprocessed. In pre-processing, contractions, punctuations, emojis and other unwanted characters
are removed. Then, tokenization is done on pre-processed text andtokens are generated.
Meanwhile, vocabulary is built using the tokens. The sentences as sequence of encoded tokens are
given to Sequential model. The embedding layer produces embeddings and are given for
Convolution 1D layer. It performs the convolution operation with the filters of specified
parameters. Then, Max Pooling andGlobal Average Pooling is done. Sigmoid layer is used for
binary classification.
Page
4.4 MODULE DESCRIPTION
USER MODULE
Users can sign up to the web application by registering themselves by providing details
like user name, password etc..
Registered users can also sign in to their profile by using user id and password.
They can post videos, stories and photos in the web application.
Users can send friend requests to other users and can also chat with their friends.
Users can view,like and comment the videos and photos posted by their friends in the
web application.
ADMIN MODULE
The Machine Learning module is responsible for classifying comments and messages
as hate speech or non hate speech
From a vast set of comments and messages, the 1D CNN is used to predict bullying
comments and messages.
This module includes the following steps:
1. Data collection
2. Data preprocessing
3. Segmentation
4. Feature extraction
5. Training
6. Testing
1. DATA COLLECTION
Collecting data for training the Machine Learning model is the basic step in the machine
learning pipeline.
The predictions made by Machine Learning systems can only be as good as the data on
which they have been trained.
Page
In this system, dataset containing bullying as well as non-bullying comments and
messages.
The data set is downloaded from KAGGLE website.
80% of dataset is used for training and the remaining 20% is used for testing.
2. DATA PREPROCESSING
Real-world raw data and images are often incomplete, inconsistent and lacking in certain
behaviors or trends. They are also likely to contain many errors. So, once collected, they
are pre-processed into a format the machine learning algorithm can use for the model.
Data preprocessing in Machine Learning is a crucial step that helps enhance the quality of
data to promote the extraction of meaningful insights from the data.
The proprocessing step also includes the removal of stop words, special characters and the
conversion of uppercase letters to lowercase.
The Lemmatization step includes converting tense word into root word. For example, the
word running is converted to its root word run.
3. SEGMENTATION
Segmentation can be defined as the process of separating sentences into different tokens.
N-grams are used for grouping tokens.
N-grams are used for a variety of things. Some examples include auto completion of
sentences.
In this project, 2-gram is used to group tokens.
4. FEATURE EXTRACTION
Feature extraction is the process of taking out a list of words from the text data and then
transforming them into a feature set which is usable by a classifier.
In this system, TF-IDF vectorizer is used for feature extraction.
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used to
quantify the importance or relevance of string representations in a document.
5. TRAINING
Model training is the key step in machine learning that results in a model ready to be
validated, tested, and deployed.
The performance of the model determines the quality of the applications that are built
using it.
Quality of training data and the training algorithm are both important assets during the
model training phase.
Page
Typically, dataset is split for training and testing.
6. TESTING
In machine learning, model testing is referred to as the process where the performance of a
fully trained model is evaluated on a testing set.
The testing set consisting of a set of testing samples should be separated from the both
training and validation sets, but it should follow the same probability distribution as the
training set.
Each testing sample has a known value of the target.
Page
CHAPTER 5
DESIGN
5.1 UML Diagrams
The Unified Modeling Language (UML) is a standard language for writing software blue prints.
The UML is a language for visualizing, specifying, constructing, documenting the artifacts of a
software intensive system.
Some of the frequently used diagrams in software development are:
Figure 5.1: Use Case Diagram for Hate Speech Spreaders Detection
As shown in figure 5.1, our system has two actors Twitter User, Hate SpeechSpreader Detector.
The actions of Twitter User are to upload posts, view posts.
Page
Hate Speech Spreader Detector is the tool built that does collect users tweets, preprocesses tweets,
builds vocabulary, builds word embeddings, trains classifier, predicts the test data.
Page
5.1.3 State Chart Diagram
A state diagram is used to represent the condition of the system or part of the system at finite
instances of time. It is a behavioral diagram and it represents the behavior using finite state
transitions. State diagrams are also referred to as State machines and State-chart Diagrams.
These terms are often used interchangeably. So simply, a state diagram is used to model the
dynamic behavior of a class in response to time and changing external stimuli. We can say that
each and every class has a state but we do not model every class using State diagrams. We prefer to
model the states with three or more states.
Figure 5.3: State Chart Diagram for Hate Speech Spreaders Detection
As shown in figure 5.3, the various states the project goes through are removal of XML tags from
tweets, removing emojis, hashtags, and Lemmatization and tokenization, adding new words to the
vocabulary. Then, adding embedding layer to Sequential model, convolution of embedding,
pooling is applied, loss is computed andweights are changed through back propagation.
Page
5.1.4 Sequence Diagram
A sequence diagram simply depicts interaction between objects in a sequential order i.e., the order
in which these interactions take place. Sequence diagram uses a lifeline which is a named element
which depicts an individual. Communication happens as the messages appear in a sequential order
on the lifeline. Sequence diagrams establish the roles of objects and help provide essential
information to determine class responsibilities and interfaces.
Page
CHAPTER 6
IMPLEMENTATION
Python code for building vocabulary, tokenization, encoding, padding sequences and generating
embeddings, building CNN model and fitting it using Cross- Validation:
Page
text = re.sub('\n', '', text)
text = re.sub('\w\d\w', '', text)
text = [word for word in text.split(' ') if word not in
stopword] text=" ".join(text)
text = [stemmer. stem(word) for word in text. split('
')] text=" ".join(text)
return text
# Streamlit UI
st.title("Hate Speech Recognition")
# Input text
text = st.text_area("Enter text to check for hate
speech") text = clean(text)
import pickle
with open('classifier.pkl','rb') as
file: clf = pickle.load(file)
with open('count.pkl','rb') as
file: cv = pickle.load(file)
if st.button('Predict'):
# 1. preprocess
transformed_sms =
clean(text) # 2. vectorize
vector_input =
cv.transform([transformed_sms]).toarray() # 3. predict
result = st.header(clf.predict((vector_input)))
Page
Creating Back End using Python
import numpy as
np import pandas
as pd import
pickle
import tensorflow as tf
from wordcloud import WordCloud,
STOPWORDS import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing import text,
sequence from tensorflow.keras.models import
Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, MaxPooling1D
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import seaborn as sns
import os
for dirname, _, filenames in
os.walk('/kaggle/input'): for filename in
filenames:
df =
pd.read_csv('data.csv')
import nltk
nltk.download('punkt')
import string
try:
nltk.download('stopwords')
Page
except Exception as e:
print(f"Error downloading stopwords:
{e}") from nltk.corpus import stopwords
from nltk.stem import
PorterStemmer ps =
PorterStemmer()
def
transform_text(tweet)
y = []
for i in tweet:
if
i.isalnum
tweet =
y[:]
for i in tweet:
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)
tweet =
y[:]
for i in tweet:
y.append(ps.stem(i
transform_text("I'm gonna be home soon and i don't want to talk about this stuff anymore
tonight, k? I've cried enough today.")
from nltk.stem.porter import PorterStemmer
ps.stem('loving')
Page
df['transformed_text'] = df['tweet'].apply(transform_text)
X=
tfidf.fit_transform(df['transformed_text']).toarray() y
df.head()
c=df['class']
df.rename(columns={'transformed_text' :
'text',
'class' :
'category'},
inplace=True)
Page
total, neither, 100 * neither / total))
df['data_type'] =
['not_set']*df.shape[0]
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'
df
df_train =
df.loc[df["data_type"]=="train"] df_val
= df.loc[df["data_type"]=="val"] df_test
df_train.head()
Page
df_train_plus_val.head()
x=
df_train_plus_val.text.values y
max_features = 20000
max_text_length = 512
x_tokenizer =
text.Tokenizer(max_features)
x_tokenized = x_tokenizer.texts_to_sequences(x)
x_train_val= sequence.pad_sequences(x_tokenized, maxlen=max_text_length)
x_test_tokenized = x_tokenizer.texts_to_sequences(df_test.text.values)
x_test = sequence.pad_sequences(x_test_tokenized,maxlen=max_text_length)
import numpy as np
embedding_dim = 100
embeddings_index =
embedding_matrix=
np.zeros((max_features,embedding_dim)) for word, index in
x_tokenizer.word_index.items():
Page
bre
ak
else:
embedding_vector =
embeddings_index.get(word) if
y_train_plus_val = tf.keras.utils.to_categorical(y,
num_classes=3) y_test =
tf.keras.utils.to_categorical(df_test.label, num_classes=3)
model = Sequential()
model.add(Embedding(max_features,
embedding_dim,
embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
trainable=False))
model.add(Dropout(0.2))
model.add(Conv1D(64,2,padding='valid',activation='relu'))
model.add(MaxPooling1D())
model.add(Conv1D(64,2,padding='valid',activation='relu'))
model.add(MaxPooling1D())
model.add(Conv1D(32,2,padding='valid',activation='relu'))
model.add(MaxPooling1D())
model.add(Conv1D(32,2,padding='valid',activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(16,
activation='relu'))
model.add(Dense(16,
Page
model.add(Dense(3,
activation='softmax')) model.summary()
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
# Plot loss
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training loss vs.
Epochs') plt.legend()
plt.show()
# Plot accuracy
plt.plot(history.history['accuracy'],
label='train')
plt.plot(history.history['val_accuracy'], label='validation')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Training accuracy vs.
Epochs') plt.legend()
y_pred =
model.predict(x_test) y_pred
y_pred
Page
y_test_labels = df_test.label
cm = confusion_matrix(y_test_labels,
y_pred) fig = sns.heatmap(cm, annot=True,
fmt="d") plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
plt.show(fig)
import pickle
pickle.dump(tfidf,open('vectorized.pkl','wb'
)) model.save('model.h5')
Page
6.2 OUTPUT SCREENS
To start our project run ‘streamlit_file.ipynb’ file on Google Colab server or Jupyter .To connect
kernel and double click on Run
In above screen Google Colab started and now open browser through https://nine-carrots-
switch.loca.lt and enter URL as ‘34.68.93.235’ and press key to get below home page.
Page
Predicted Result 1:
Predicted Result 2:
Predicted Result 3:
Page
CHAPTER 7
CONCLUSION AND FUTURE ENHANCEMENTS
The difficulty of automatically recognizing hate speech in social media posts is addressed in this
study. This research presents a hate speech dataset that was manually labeled and collected from a
white supremacist online community. We discovered that the analysis generated significant hate
preconceptions, as well as ranging levels of ethnic and religious-based stereotypes. Our findings
have shown that the selection of word embeddings, the selected parameters and the optimizer have
a high impact on the output achieved.
Hate speech in the social media space, which can have negative impacts on the society were
detected easily and the high accuracy rate of the model will bring many benefits while reducing
the damage. By assessing and comparing the performance of the various hate detection models,
we found that word embeddings with 1D-CNN is an important tool for hate speech detection.
1D-CNN, a deep learning model, achieved the highest weighted macro-F1 score of 0.66 with a
0.90 accuracy. The results of the confusion matrix graphs in figures 1 to 5 demonstrated that
GloVe embedding features were unable to correctly
Problems with Word2Vec, Glove, BERT, TF-IDF were that they are pre- trained, cannot handle
out-of-vocabulary issue, BERT is computationally very expensive, FastText is also pretrained and
based on n-grams, TF-IDF is a bag of words model and primarily dependent on dataset and word
frequencies. It is better used in Information retrieval, not sentiment analysis.
The CNN is suitable than RNN/LSTM to this problem, because each sample has many tweets but
ordering is not needed between them and local context needs to be captured. Also, CNN is fast.The
future works are to include more features considering the emojis, hashtags and to make great
predictions from models built using traditional machine learning models and deep learning models
together and to build a robust model using ensemblelearning methods.
Page
CHAPTER 8
REFERENCES
[1 ] Davidson T, Warmsley D, Macy MW, Weber I. Automated Hate Speech Detection and the
Problem of Offensive Language. ICWSM. 2017;.
[2] Zimmerman S, Kruschwitz U, Fox C. Improving Hate Speech Detection with Deep
Learning Ensembles. In: LREC; 2018.
[3] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. arXiv:181004805 [cs]. 2018;.
[4] Hagen M, Potthast M, Bu¨chner M, Stein B. Webis: An Ensemble for Twitter Sentiment
Detection. In: SemEval@NAACL-HLT; 2015.
[5] Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of Tricks for Efficient Text Classification.
In: Proceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics: Volume 2, Short Papers. ACL; 2017. p. 427–431.
[6] Zhang Z, Robinson D, Tepper J. Detecting hate speech on twitter using a convolution- gru
based deep neural network. In: European Semantic Web Conference. Springer; 2018. p. 745–760.
[7] MacAvaney, Sean & Yao, Hao-Ren & Yang, Eugene & Russell, Katina & Goharian, Nazli &
Frieder, Ophir. (2019). Hate speech detection: Challenges and solutions. PloS one. 14. e0221152.
10.1371/journal.pone.0221152.
[8] Neuman Y, Assaf D, Cohen Y, Last M, Argamon S, Howard N, et al. Metaphor Identification
in Large Texts Corpora. PLoS ONE. 2013; 8(4). https://doi.org/10.1371/journal.pone.0062343
[9] Hatebase;. Available from: https://hatebase.org/.
[10] P. Fortuna, S. Nunes, A survey on automatic detection of hate speech in text, ACM
Computing Surveys (CSUR) 51 (2018) 1–30.
[11] A. Schmidt, M. Wiegand, A survey on hate speech detection using natural language
processing, in: Proceedings of the Fifth International workshop on natural language processing for
social media, 2017, pp. 1–10.
[12] dennybritz.com/posts/wildml/understanding-convolutional-neural-networks-for- nlp/
[13] Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014),
1746–1751.
Page
[14] Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A Convolutional Neural Network
for Modelling Sentences. Acl, 655–665.
[15] Santos, C. N. dos, & Gatti, M. (2014). Deep Convolutional Neural Networks for Sentiment
Analysis of Short Texts. In COLING-2014 (pp. 69–78).
[16] Johnson, R., & Zhang, T. (2015). Effective Use of Word Order for Text Categorization with
Convolutional Neural Networks. To Appear: NAACL-2015, (2011).
[17] Johnson, R., & Zhang, T. (2015). Semi-supervised Convolutional Neural Networks for Text
Categorization via Region Embedding.
[18] Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., & Hao, H. (2015). Semantic
Clustering and Convolutional Neural Network for Short Text Categorization. Proceedings ACL
2015, 352–357.
[19] Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to)
Convolutional Neural Networks for Sentence Classification, Nguyen, T. H., & Grishman, R.
(2015). Relation Extraction: Perspective from Convolutional Neural Networks. Workshop on
Vector Modeling for NLP, 39–48.
[20] Sun, Y., Lin, L., Tang, D., Yang, N., Ji, Z., & Wang, X. (2015). Modeling Mention , Context
and Entity with Neural Networks for Entity Disambiguation, (Ijcai), 1333–1339.
[21] Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, J. (2014). Relation Classification via
Convolutional Deep Neural Network. Coling, (2011), 2335–2344.
[22] Gao, J., Pantel, P., Gamon, M., He, X., & Deng, L. (2014). Modeling Interestingness with
Deep Neural Networks.
[23] Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). A Latent Semantic Model with
Convolutional-Pooling Structure for Information Retrieval. Proceedings of the 23rd ACM
International Conference on Conference on Information and Knowledge Management – CIKM
’14, 101–110.
[24] Weston, J., & Adams, K. (2014). # T AG S PACE : Semantic Embeddings from Hashtags,
1822–1827.
Page