0% found this document useful (0 votes)
150 views

Cyberbullying Detection Based On Semantic Enhanced Marginalised Denoising Autoencoder - Report

The document describes a project report on cyber bullying detection using a semantic-enhanced marginalized denoising autoencoder. The report was submitted by two students, Aishwarya Iyer and Anchana R, in partial fulfillment of their Bachelor of Engineering degree in computer science and engineering. The project involved developing a new text representation learning method called semantic-enhanced marginalized denoising autoencoder (SMDA) to automatically detect cyberbullying messages on social media by exploiting the hidden feature structure of bullying information and learning a robust representation of text messages.

Uploaded by

kiswah computers
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views

Cyberbullying Detection Based On Semantic Enhanced Marginalised Denoising Autoencoder - Report

The document describes a project report on cyber bullying detection using a semantic-enhanced marginalized denoising autoencoder. The report was submitted by two students, Aishwarya Iyer and Anchana R, in partial fulfillment of their Bachelor of Engineering degree in computer science and engineering. The project involved developing a new text representation learning method called semantic-enhanced marginalized denoising autoencoder (SMDA) to automatically detect cyberbullying messages on social media by exploiting the hidden feature structure of bullying information and learning a robust representation of text messages.

Uploaded by

kiswah computers
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

A PROJECT REPORT

ON

“CYBER BULLYING DETECTION BASED ON SEMANTIC-ENHANCED


MARGINALIZED DENOISING AUTOENCODER “

Submitted in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING

BY

AISHWARYA IYER
1NH14CS007

ANCHANA R
1NH14CS704

Under the guidance of

Mr. Manjunatha Swamy


Senior Assistant Professor, Dept. of CSE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


(Autonomous Institution Affiliated to VTU &Approved by AICTE)
Accredited by NAAC ‘A’, Accredited by NBA
Outer Ring Road, Panathur Post, Kadubisanahalli,
Bangalore – 560103
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

It is hereby certified that the project work entitled “CYBER BULLYING DETECTION BASED
ON SEMANTIC-ENHANCED MARGINALISED DENOISING AUTOENCODER” is a bonafide
work carried out by AISHWARYA IYER (1NH14CS007) and ANCHANA.R (1NH14CS704) in
partial fulfilment for the award of Bachelor of Engineering in COMPUTER SCIENCE AND
ENGINEERING of the New Horizon College of Engineering during the year 2019-2020. It is
certified that all corrections/suggestions indicated for Internal Assessment have been
incorporated in the Report deposited in the departmental library. The project report has
been approved as it satisfies the academic requirements in respect of project work
prescribed for the said Degree.

………………………… ……………………….. ………………………………


Signature of Guide Signature of HOD Signature of Principal
(Mr. Manjunath Swamy) (Dr. B. Rajalakshmi) (Dr. Manjunatha)

External Viva

Name of Examiner Signature with date

1. ………………………………………….. ………………………………….

2. …………………………………………… …………………………………..
ABSTRACT

As a side effect of increasingly popular social media coma cyber bullying has emerged as
a serious problem affecting children, adolescence and young adults. machine learning
techniques make automatic detection of cyberbullying messages in social media
possible, and this could help to construct a healthy and safe social media environment.
In this meaningful research area, one critical issue is robust and discriminative numerical
representation learning of text messages. In this paper ok, we propose a new
representation learning method to tackle this problem. our method name semantic-
enhanced marginalized denoising autoencoder (smSDA) is developed via semantic
extension of the popular deep learning model stacked denoising autoencoder (SDA). The
semantic extension consists of semantic dropout noise and sparsity constraints, where
the semantic dropout no one’s is designed based on domain knowledge and the word
embedding technique. our proposed method is able to exploit the hidden feature
structure of bullying information and learn a robust and discriminative representation of
text. comprehensive experiments on to public cyber-bullying corp (Twitter and Myspace)
Are conducted, and the results show that our proposed approaches outperform other
baseline text representation learning methods

Keywords:
Cyberbullying, Autoencoder, Linear SVM, Bag of Words, Word2Vec, Dataset, Myspace,
Preprocessing.

I
ACKNOWLEDGEMENT

The satisfaction and euphoria that accompany the successful completion of any task
would be impossible without the mention of the people who made it possible, whose
constant guidance and encouragement crowned our efforts with success.

I have great pleasure in expressing my deep sense of gratitude to Dr. Mohan


Manghnani, Chairman of New Horizon Educational Institutions for providing necessary
infrastructure and creating good environment.

I take this opportunity to express my profound gratitude to Dr. Manjunatha, Principal


NHCE, for his constant support and encouragement.

I am grateful to Dr.Prashanth C.S.R, Dean Academics, for his unfailing encouragement


and suggestions, given to me in the course of my project work.

I would also like to thank Dr. B. Rajalakshmi, Professor and Head, Department of
Computer Science and Engineering, for her constant support.

I express my gratitude to Mr. Manjunatha Swamy, Designation, my project guide, for


constantly monitoring the development of the project and setting up precise deadlines.
Her valuable suggestions were the motivating factors in completing the work.

Finally, a note of thanks to the teaching and non-teaching staff of Dept of Computer
Science and Engineering, for their cooperation extended to me, and my friends, who
helped me directly or indirectly in the course of the project work.

AISHWARYA IYER(1NH14CS007)
ANCHANA.R(1NH14CS704)

II
CONTENTS

ABSTRACT I
ACKNOWLEDGEMENT II
LIST OF FIGURES V
LIST OF TABLES VI

1. INTRODUCTION

1.1. GENERAL INDRODUCTION 1

1.2. PROBLEM DEFINATION 2

1.3. PROJECT PURPOSE 2

1.4. PROJECT FEATURES 2

1.5. MODULE DESCRIPTION

1.5.1. DATASET 3

1.5.2. PREPROCESSING 4

1.5.3. WORD2VEC 4

1.5.4. AUTO ENCODER PREPRAINING 4

1.5.5. DENOISING AUTO ENCODER 5

1.5.6. LINEAR SVM 5

2. LITRATURE SURVEY
2.1. RELATED WORK 6

2.2. TEXT REPRESENTATION LEARNING 6

2.3. DATA SEARCH AND SELECTION 7

2.4. DATA ABSTRACTION 8

2.5. DIMENSIONS OF CHARACTERIZATION 9

2.6. EXSISTING SYSTEM 13

2.7. PROPOSED SYSTEM 13

2.8. SOFTWARE DESCRIPTION

2.8.1. PYTHON 13

2.8.2. GOOGLE COLLAB 14

III
2.8.3. OPEN CV 14

2.8.4. GUI 15

2.9. SYSTEM STUDY 16

3. REQUIREMENT ANALYSIS
3.1. FUNCTIONAL REQUIREMENTS 18

3.2. NON-FUNCTIONAL REQUIREMENTS

3.2.1. ACCESSABILITY 19

3.2.2. MAINTAINABILITY 19

3.2.3. SCALABILITY 19

3.2.4. PORTABILITY 20

3.2.5. RELIABILITY 20

3.3. HARDWARE REQUIREMENTS 21

3.4. SOFTWARE REQUIREMENTS 21

3.5. SYSTEM REQUIREMENT SPECIFICATION DOCUMENT 22

4. DESIGN
4.1. WORKFLOW 23

4.1.1. FILTERING 23

4.2. ARCHITECTURE DIAGRAM 24

5. IMPLEMENTATION
5.1. PREPROCESSING 25

5.2. AUTO ENCODER 33

5.3. GUI 38

6. TESTING 47
7. SNAPSHOTS 54
8. CONCLUSION AND FUTURE ENHANCEMENT
8.1. CONCLUSION 60

8.2. FUTURE ENHANCEMENT 61

REFERENCES 62

IV
LIST OF FIGURES

Fig. No Figure Description Page No

1.5. Module Description 3

4.1. Workflow 23

4.1.1. Text Filtering 24

4.2. Architecture Diagram 24

7.1. GUI 54

7.2 Loading of XML1 55

7.3. Loading of XML2 55

7.4. Loading of XML3 55

7.5. Loading of XML4 56

7.6. Loading of XML5 56

7.7. Loading of XML6 56

7.8. Processed Body1 57

7.9. Processed Body2 57

7.10. Vector Representation1 58

7.11. Vector Representation2 58

7.12. Classification1 59

7.13. Classification2 59

7.14. Classification3 59

V
LIST OF TABLES

Table No Table Description Page No

2.5. Studies, Tasks Performed and Approach Categories 9

6.1. Test Case for File Operation 48

6.2. Test Case for XML File 49

6.3. Test Case for Pre-processing 50

6.4. Test Case for Vector Representation 51

6.5. Test Case for Classification (Bullying) 52

6.6. Test Case for Classification (Non-Bullying) 53

VI
Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

CHAPTER 1

INTRODUCTION

1.1 GENERAL INTRODUCTION


Cyberbullying is a supervised learning problem, where the system is trained with data
labelled by humans as cyberbullying. Later, the trained system can be used to detect
cyberbullying automatically. The first step in cyberbullying detection is to learn a reliable
numeric representation for the text messages. This is achieved through a word2vec
model of insulting seeds and use of cosine similarity to extend the seeds to a larger set.
The insulting seeds are the set of words which can be called as the bullying words and its
extensions (different spellings, repetitions etc.). This constructed bullying features are
then given to a stacked denoising auto encoder. The output of the autoencoder provides
robust and discriminative features, which is used with a support vector machine (SVM)
classifier. The classifier performs a binary classification and detects if the given message
is bullying or not. The dataset used in this project is openly available and is taken from
social networking websites such as Myspace or Twitter.

Dept of CSE, NHCE 1


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

1.2 PROBLEM DEFINITION


Cyber bullying detection is a semi-supervised learning problem that uses a cyberbullying
corpus labelled by humans to train a classifier, which should recognize a bullying
message automatically.

1.3 PROJECT PURPOSE


The scope of the project is that, it gives a specialized representation learning model for
cyberbullying detection and also gives a robust and discriminative representation for
detection.

1.4 PROJECT FEATURES


Myspace is a social networking website which is similar to twitter. There will be a series
of messages that are captured and stored as a dataset. The dataset will be read, cleaned
(to remove noises like special characters, stop words (words like this, to, was)),
preprocessed (create numerical representation for each word) using a model that we
create (word2vec).
Deep learning algorithms are neural networks that are helpful in learning complex
underlying representations. A neural network will be trained on the preprocessed data
which is trained to learn the features of bullying data and non-bullying data.
Classification will be performed using a classifier algorithm (such as Support Vector
Machine). Support Vector Machine is a linear classifier capable of separating two classes
of data (bullying/non-bullying) based on the given features.
In Python Open CV library is available for image processing. The Open CV is a library of
programming functions mainly aimed at real-time computer vision.

Dept of CSE, NHCE 2


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

1.5 MODULE DESCRIPTION


In this project there are six modules to achieve our expected result. These are the major
functionalities of the project.

Fig 4.2 Architecture Diagram

1.5.1 DATASET
Myspace is a social networking website which is similar to twitter. There will be a series
of messages that are captured and stored as a dataset. The dataset is available online
and can be accessed by anyone.

Dept of CSE, NHCE 3


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

1.5.2 PREPROCESSING
This is the second module in the implementation of our project. In this step we remove
special characters, stop words, meaningless words etc. from our acquired dataset so
that processing will be easier.

1.5.3 Word2vec
The Key idea of word2vec is to achieve better performance not by using a more complex
model (i.e., with more layers), but by allowing a simpler (shallower) model to be trained
on much larger amounts of data.

Two algorithms for learning words vectors: - CBOW: from context predict target (focus
of what follows) - Skip-gram: from target predict context.

1.5.4 AUTOENCODER PRETRAINING

An autoencoder is a neural network that is trained to attempt to copy its input to its
output. Internally, it has a hidden layer h that describes a code used to represent the
input

•Hidden layer h

•Two parts
–Encoder h= f(x)

–Decoder r=g(h)

Dept of CSE, NHCE 4


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

1.5.5 DENOISING AUTOENCODER

This is the fifth module where we use denoising auto-encoder to pretrain our deep
neural network

Unsupervised initialization of layers with an explicit denoising criterion appears to help


capture interesting structure in the input distribution.

This leads to intermediate representations much better suited for subsequent learning
tasks such as supervised classification.

It provides better empirical results and also handles missing values.

1.5.6 LINEAR SVM


Linear SVM or linear support vector machine is the newest extremely fast machine
learning (data mining) algorithm for solving multiclass classification problems from ultra
large data sets that implements an original proprietary version of a cutting plane
algorithm for designing a linear support vector machine.

Dept of CSE, NHCE 5


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

CHAPTER 2
LITERATURE SURVEY
Literature survey is the most important step in software development process. Before
developing the tool, it is necessary to determine the time factor, economy and company
strength. Once these things are satisfied, ten next steps are to determine which
operating system and language can be used for developing the tool. Once the
programmers start building the tool the programmers need lot of external support. This
support can be obtained from senior programmers, from book or from websites. Before
building the system, the above consideration is taken into account for developing the
proposed system.

2.1 Related Work


This work aims to learn a robust and discriminative text representation for cyberbullying
detection. Text representation and automatic cyberbullying detection are both related
to our work. In the following, we briefly review the previous work in these two areas.

2.2 Text Representation Learning


In text mining, information retrieval and natural language processing, effective
numerical representation of linguistic units is a key issue. The Bag-of-words (BoW)
model is the most classical text representation and the cornerstone of some states-of-
arts models including Latent Semantic Analysis (LSA) and topic models. BoW model
represents a document in a textual corpus using a vector of real numbers indicating the
occurrence of words in the document. Although BoW model has proven to be efficient
and effective, the representation is often very sparse. To address this problem, LSA
applies Singular Value Decomposition (SVD) on the word-document matrix for BoW
model to derive a low-rank approximation.

Dept of CSE, NHCE 6


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Each new feature is a linear combination of all original features to alleviate the sparsity
problem. Topic models, including Probabilistic Latent Semantic Analysis and Latent
Dirichlet Allocation, are also proposed. The basic idea behind topic models is that word
choice in a document will be influenced by the topic of the document probabilistically.
Topic models try to define the generation process of each word occurred in a
document. Similar to the approaches aforementioned, our proposed approach takes the
BoW representation as the input. However, our approach has some distinct merits.
Firstly, the multi-layers and non-linearity of our model can ensure a deep learning
architecture for text representation, which has been proven to be effective for learning
high-level features. Second, the applied dropout noise can make the learned
representation more robust. Third, specific to cyberbullying detection, our method
employs the semantic information, including bullying words and sparsity constraint
imposed on mapping matrix in each layer and this will in turn produce more
discriminative representation.

2.3 Data Search and Selection


An electronic literature search was conducted across Scopus, the ACM Digital Library,
and the IEEE Xplore digital library. The main search strategy was the discovery of
academic literature relevant to the theme “automated detection of electronic bullying,
anti-social behaviour and harassment” using the following query phrases without any
publication year filter applied: “cyber-bull* or cyberbull* detection”, “detecting
cyberbull* or cyberbull*”, “electronic or online bullying detection”, “detecting electronic
or online bullying, cyber-bull*” or “cyberbull* prevention tool”, “cyber-bull* or
cyberbull* prevention software”, “cyber-bull* or cyberbull* software”, “anti cyber-bull*
or anti cyberbull*” or “anti-cyberbull* or anti-cyber-bull*” or “anticyberbull* or
anticyberbull*”, “detecting electronic or online harassment”.

A citation trail was performed on the discovered papers using the papers’ references as
a starting point and a total of 89 academic papers was discovered as a result of the

Dept of CSE, NHCE 7


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

search. The papers were initially assessed for relevance via a review of their titles,
abstract, and concluding arguments: 18 papers were not considered relevant to the
survey and so were removed. The full text of the remaining papers was reviewed and
papers whose primary focus did not include any of the 4 cyberbullying detection tasks
we identified in Section 1 were discounted. This led to the removal of a further 18
papers. These included papers that dealt with themes such as youth violence
involvement detection (Sigel and Harpin, 2013), story matching to identify distressed
teens (Dinakar et al., 2012b; Macbeth et al., 2013), and cyberbully prevention policies
(Al Mazari, 2013). To eliminate the effects of language on cyberbully detection when
comparing the reviewed studies, we excluded papers using non-English corpora; thus, a
further 7 papers were excluded. These included papers such as Ptazynski et al. (2010a; b),
Honjo et al. (2011), Nitta et al. (2013), Li and Tagami (2014), Margono et al. (2014) and Van Hee
et al. (2015) which were removed as they used non-English corpora. The remaining 46
papers were included in the final list of papers examined by this study.

2.4 Data Abstraction


For the included papers, we performed data abstraction using characteristics such as
detection tasks performed, data sources, the size and availability of the datasets,
detection techniques, annotation judgement, features extracted, external resources
used, and pre-processing steps. We used the total number of documents (i.e., messages,
posts, comments, etc.) as a measure of the data size as opposed to using other metrics
such as the number of users or threads in a dataset; thus, a sample containing 50
messages generated by 70 users was assigned 50 as the data size value.

Dept of CSE, NHCE 8


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

2.5 Dimensions of Characterization


Tables 1 and 2 present a summary of key information abstracted from the reviewed
studies. Table 1 provides a quick overview of approach categories and detection tasks
for each of the 46 papers. Table 2 presents additional information about the studies,
such as features and techniques used, pre-processing steps performed, and any external
resources used (e.g., WordNet1, urbandictionary2, etc.). Table 3 presents details (where
available) of the datasets used by the papers. Finally, the best available results per
detection tasks for each corpus category are presented in Table 4

Dept of CSE, NHCE 9


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Our survey revealed binary classification as the most common task performed in
cyberbullying detection. In this regard, bullying messages are considered members of a
“bullying” class and all other documents belong to the “other” or “non-bullying” class.
The key task then is the identification of documents that possess the core attributes of
the “bullying” class. Out of the 46 studies reviewed, 34 performed binary classification
either as the sole detection task or in combination with other tasks. This classification of
messages is often facilitated by sentiment analysis using emotive wordlists, supervised
learning, and lexicon-based systems.

Studies such as Yin et al. (2009), Dinakar et al. (2011), Xu et al. (2012a), and Rafiq et al.
(2015) performed sentiment analysis using supervised-learning techniques. Others such

Dept of CSE, NHCE 10


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

as Burn-Thorton and Burman (2012), Kontostathis et al. (2013), Nahar et al. (2013;
2014), Munezero et al. (2014), Nandhini and Sheeba (2015a;b), and Zhao et al. (2016),
while also implementing binary classification, did not perform the message classification
via sentiment analysis. Interestingly, Xu et al. (2012a) is the only instance we found
whereby sentiment analysis is performed not for the purpose of binary classification but
to understand the emotions expressed in what they term “bully traces”, which are
tweets containing any of the words “bully”, “bullied” and “bullying” (i.e., tweets
containing bullying references or reportage – e.g., “I saw a girl got bullied at school
today #bullyingisnotcool”). Role identification is the next most performed task (11
papers), featuring heavily in studies such as Sanchez and Kumar (2011), Chen et al.
(2012), Dadvar et al. (2014), and Galán-García et al. (2014). Determining the severity of
cyberbullying by computing a score indicative of the bullying severity of messages
and/or sender is performed by studies such as Chen et al. (2012), Perez et al. (2012),
Dadvar et al. (2013a), Del Bosque and Garza (2014), and Potha and Maragoudakis
(2014). Dadvar et al. (2012b) and Squicciarini et al. (2015) were the only studies we
found that proposed the relatively novel task of detecting and classifying the events that
occur after a cyberbullying incident. While cyberbullying occurs across various forms of
electronic media – such as SMS (Short Messaging Service), MMS (Multimedia Messaging
Service), email, forums, chat rooms – and social media platforms like Facebook, Twitter,
YouTube and SnapChat, social media was the main source of data for many of the
studies reviewed. This can be attributed to the availability of social media data which is
often freely accessible in the public domain; emails, SMS, MMS and chat rooms are, in
contrast, very personal means of communication and, as such, communications via
these media are less likely to be publicly available.
Twitter and MySpace are the most common data sources. Twitter is used in many
studies including Sanchez and Kumar (2011), Xu et al. (2012a; b), Huang et al. (2014),
Galán-García et al. (2014), and Zhao et al. (2016). MySpace is used by Yin et al. (2009),
Parime and Suri (2014), Nandhini and Sheeba (2015a; b), and Squicciarini et al. (2015)

Dept of CSE, NHCE 11


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

amongst others. YouTube is in second place with Dinakar et al. (2011), Chen et al.
(2012), Dadvar et al. (2013a; b; 2014) using corpora that included YouTube data.
BurnThorton and Burman (2012) is the only paper in our sample that uses an email
corpus. 14 papers publicly shared their datasets: 9 of these make use of the Barcelona
Media dataset (a publicly available dataset of social media data) and the remaining 5
papers sourced the corpus themselves. With supervised-learning methods proving
popular amongst the reviewed studies (34 papers), the means by which judgements on
annotated data were arrived at is of interest. Traditional means of labelling data using
annotators or by the researchers themselves still proved to be popular, with 25 studies
employing annotators, experts, or researchers to label data. Crowd-sourcing annotators
is also gaining traction within the cyberbullying research community, with studies such
as Sanchez and Kumar (2011), Kontostathis et al. (2013) and Hosseinmardi et al. (2015)
using crowdsourcing services like Amazon Mechanical Turk (MTurk) and CrowdFlower to
label data. Given the ease, relative low cost, and huge time savings of crowdsourcing, we
expected to find higher utilisation of crowdsourcing services amongst the studies but
perhaps researchers’ need to ensure high-quality annotated data currently presents a
barrier that crowdsourcing services will need to overcome in order to become more
widely used. Interestingly only 3 papers (Dinakar et al., 2011; 2012a; Rafiq et al., 2015)
employed experts to annotate data. This is surprising since a natural assumption would
be that the use of experts for annotation likely presents the best chance of achieving
quality, labelled data.
A possible reason for this low utilization of experts for labelling data could be the
subjective nature of bullying, a consequence of which could be that researchers’ and
experts’ views on cyberbullying may differ greatly. Thus, researchers adopting a specific
definition of cyberbullying would naturally want the annotators to be guided by this
definition.

Dept of CSE, NHCE 12


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

2.6 EXISTING SYSTEM


There are many cyber bullying algorithms available in the market most of which are
supervised algorithms. These types of algorithms generally use bag of words to
distinguish between bullying and non-bullying labels.

Disadvantages:
They cannot corelate similar words. For example, if an algorithm uses bag of words it will
classify cat and kitten as two different entities instead of classifying them as related
entities unless explicitly implied.

2.7 PROPOSED SYSTEM


The proposed system will make cyber bullying detection more dynamic i.e. Instead of
using bag of words algorithm it will use word2vec model which is a multi-dimensional
algorithm that can plot and distinguish between related entities without being explicitly
programmed. This will play a great role in enhancing the dynamic of cyber bullying
detection algorithm. Hence this will improve the performance of any of the existing
models.

2.8 SOFTWARE DESCRIPTION


2.8.1 PYTHON

Python is a general-purpose interpreted, interactive, object-oriented, and high-level


programming language. An interpreted language, Python has a design philosophy that
emphasizes code readability (notably using whitespace indentation to delimit code
blocks rather than curly brackets or keywords), and a syntax that allows programmers to
express concepts in fewer lines of code than might be used in languages such as C++or
Java.

Dept of CSE, NHCE 13


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

It provides constructs that enable clear programming on both small and large scales.
Python interpreters are available for many operating systems. Python, the reference
implementation of Python, is open source software and has a community-based
development model, as do nearly all of its variant implementations.

Python is managed by the non-profit Python Software Foundation. Python features a


dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and
procedural, and has a large and comprehensive standard library.

2.8.2 GOOGLE COLLAB


Google Colab is a free cloud service and now it supports free GPU! You can; improve
your Python programming language coding skills. develop deep learning applications
using popular libraries such as Keras, TensorFlow, PyTorch, and OpenCV.
It's an incredible online browser-based platform that allows us to train our models on
machines for free. Colab provides 25GB RAM, so even for big data-sets you can load
your entire data into memory. The speed up was found to be aroud 2.5x, with the same
data generation steps .

2.8.3 OpenCV
OpenCV (Open source computer vision) is a library of programming functions mainly
aimed at real-time computer vision. OpenCV is written in C++ and its primary interface is
in C++, but it still retains a less comprehensive though extensive older C interface. There
are bindings in Python, Java and MATLAB/OCTAVE. The API for these interfaces can be
found in the online documentation. Wrappers in other languages such as C#, Perl Ch,
Haskell and Ruby have been developed to encourage adoption by a wider audience.
All of the new developments and algorithms in OpenCV are now developed in the C++
interface. OpenCV runs on the following desktop operating systems: Windows, Linux,

Dept of CSE, NHCE 14


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

macOS, FreeBSD, NetBSD, OpenBSD. OpenCV runs on the following mobile operating
systems: Android, iOS, Maemo, BlackBerry 10. The user can get official releases from
SourceForge or take the latest sources from GitHub. OpenCV uses CMake.
OpenCV (Open Source Computer Vision Library) is released under a BSD license and
hence it’s free for both academic and commercial use. It has C++, Python and Java
interfaces and supports Windows, Linux, Mac OS, iOS and Android. OpenCV was
designed for computational efficiency and with a strong focus on real-time applications.
Written in optimized C/C++, the library can take advantage of multi-core processing.
Enabled with OpenCL, it can take advantage of the hardware acceleration of the
underlying heterogeneous compute platform.

2.8.4 GRAPHICAL USER INTERFACE (GUI)


The tkinter package (“Tk interface”) is the standard Python interface to the Tk GUI
toolkit. Both Tk and tkinter are available on most Unix platforms, as well as on Windows
systems. (Tk itself is not part of Python; it is maintained at ActiveState.)
Running python -m tkinter from the command line should open a window
demonstrating a simple Tk interface.

Dept of CSE, NHCE 15


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

2.9 SYSTEM STUDY


FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is put
forth with a very general plan for the project and some cost estimates. During
system analysis the feasibility study of the proposed system is to be carried out. This
is to ensure that the proposed system is not a burden to the company. For feasibility
analysis, some understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are,
• ECONOMICAL FEASIBILITY
• TECHNICAL FEASIBILITY
• SOCIAL FEASIBILITY

ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on
the organization. The amount of fund that the company can pour into the research
and development of the system is limited. The expenditures must be justified. Thus
the developed system as well within the budget and this was achieved because most
of the technologies used are freely available. Only the customized products had to
be purchased.

TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on
the available technical resources. This will lead to high demands on the available
technical resources. This will lead to high demands being placed on the client. The
developed system must have a modest requirement, as only minimal or null changes
are required for implementing this system.

Dept of CSE, NHCE 16


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user.
This includes the process of training the user to use the system efficiently. The user
must not feel threatened by the system, instead must accept it as a necessity. The
level of acceptance by the users solely depends on the methods that are employed
to educate the user about the system and to make him familiar with it. His level of
confidence must be raised so that he is also able to make some constructive
criticism, which is welcomed, as he is the final user of the system.

Dept of CSE, NHCE 17


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

CHAPTER 3
REQUIREMENT ANALYSIS

3.1 FUNCTIONAL REQUIREMENTS


The functional requirements in cyber bullying are:
1. Making Posts
2. Commenting on Posts
3. Reporting and Blocking
Making Posts: The user can make posts in this social media. The post can be words or
images. These posts are visible for all other registered users who are following this user.
Commenting on Posts: Users can make comments on the posts of other users who they
are following. Reporting and Blocking: Users can report about other users if they find
them annoying or disturbing. They can also block other users whom they find as a
problem to them.

3.2 NON-FUNCTIONAL REQUIREMENTS


Nonfunctional requirements describe how a system must behave and establish
constraints of its functionality. This type of requirements is also known as the
system’s quality attributes. Attributes such as performance, security, usability,
compatibility are not the feature of the system, they are a required characteristic. They
are "developing" properties that emerge from the whole arrangement and hence we
can't compose a particular line of code to execute them.
Any attributes required by the customer are described by the specification. We must
include only those requirements that are appropriate for our project.

Dept of CSE, NHCE 18


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Some Non-Functional Requirements are as follows:


• Reliability
• Maintainability
• Performance
• Portability
• Scalability
• Flexibility

3.2.1 ACCESSIBILITY:

Accessibility is a general term used to describe the degree to which a product, device,
service, or environment is accessible by as many people as possible.

3.2.2 MAINTAINABILITY:

In software engineering, maintainability is the ease with which a software product can
be modified in order to:

• Correct defects

• Meet new requirements

New functionalities can be added in the project based on the user requirements.

Since the programming is very simple, it is easier to find and correct the defects and to
make the changes in the project.

3.2.3 SCALABILITY:

System is capable of handling increase total throughput under an increased load when
resources (typically hardware) are added.

Dept of CSE, NHCE 19


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

System can work normally under situations such as low bandwidth and large number of
users.

3.2.4 PORTABILITY:

Portability is one of the key concepts of high-level programming. Portability is the


software code base feature to be able to reuse the existing code instead of creating new
code when moving software from an environment to another.

Project can be executed under different operation conditions provided it meets its
minimum configurations. Only system files and dependant assemblies would have to be
configured in such case.

3.2.5 RELAIBILITY:

Software Reliability is the probability of failure-free software operation for a specified


period of time in a specified environment. Software Reliability is also an important
factor affecting system reliability.

Dept of CSE, NHCE 20


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

3.3 HARDWARE REQUIREMENTS

• System : Pentium IV 2.4 GHz.

• Hard Disk : 40 GB.

• Floppy Drive : 1.44 Mb.

• Monitor : 14’ Colour Monitor.

• Mouse : Optical Mouse.

• Ram : 4GB.

3.4 SOFTWARE REQUIREMENTS

• Operating system : Windows XP,7,8,8.1,10

• Coding language : Python

• Front end : Python

• IDE : Jupyter

• Libraries : sklearn, word2vec, pandas and TensorFlow

Dept of CSE, NHCE 21


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

3.5 System Requirement Specification Document


Software Requirement Specification (SRS) is the starting point of the software
developing activity. As system grew more complex it became evident that the goal of the
entire system cannot be easily comprehended. Hence the need for the requirement
phase arose. The software project is initiated by the client needs. The SRS is the means
of translating the ideas of the minds of clients (the input) into a formal document (the
output of the requirement phase.)

The SRS phase consists of two basic activities:

1. Problem/requirement analysis:

The process is order and more nebulous of the two, deals with understanding the
problem, the goal and constraints.

2. Requirement Specification:

Here, the focus is on specifying what has been found giving analysis such as
representation, specification languages and tools, and checking the specifications are
addressed during this activity. The Requirement phase terminates with the production
of the validate SRS document. Producing the SRS document is the basic goal of this
phase.

Role of SRS:

The purpose of the Software Requirement Specification is to reduce the communication


gap between the clients and the developers. Software Requirement Specification is the
medium through which the client and user needs are accurately specified. It forms the
basis of software development. A good SRS should satisfy all the parties involved in the
system.

Dept of CSE, NHCE 22


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

CHAPTER 4
DESIGN
4.1 WORKFLOW

Fig 4.1: Workflow

4.1.1 Filtering
All the content in this social network will be filtered and only after that it will reach the
user. For filtering rules are kept. The filtering is of 2 types Image filtering and text
filtering. Text filtering is a collection of words called bag of words is constructed and the
words included in this are filtered.

Dept of CSE, NHCE 23


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

These words are filtered directly and are also extracted from their latent structure.
Based upon the percentage of the bullying contain that word the behavior of the user is

Fig 4.1.1: Text filtering

4.2 ARCHITECTURE DIAGRAM

Fig 4.2 Architecture Diagram

Dept of CSE, NHCE 24


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

CHAPTER 5
IMPLEMENTATION
5.1 PREPROCESSING
• Extract labels: Here we club the posts related to a specific topic and group
them in series of ten or less than ten and store them in a file.
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Human Concensus/Packet1Consensus.xlsx\n",
"Human Concensus/Packet2Consensus.xlsx\n",
]
}
],
"source": [
"import zipfile\n",
"z = zipfile.ZipFile('./datasets/Myspace/Human

Concensus.zip','r')\n",
"a = z.namelist()\n",
"for i in a:\n",
" print(i)"
]
Dept of CSE, NHCE 25
Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

• Extraction of information from xlsx files and formation of data


frame: Each packet contains around 200 files and we have 10 packets. Each file
is tested for cyberbullying and marked accordingly.
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df1 (180, 4)\n",
"df (180, 2)\n",
" file label\n",
"0 nan NaN\n",
"1 File Name Is Cyberbullying Present?\n",
"2 4046827 N\n",
"3 4046827.0001 Y\y",
"source": [
"def _format(x):\n",
" if(type(x) == float):\n",
" return '{:.4f}'.format(x)\n",
" else:\n",
" return x\n",
"\n",
"import pandas as pd\n",
"\n",
"df = None\n",
Dept of CSE, NHCE 26
Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

"for file in a:\n",


" data = z.open(file)\n",
" df1= pd.read_excel(data)\n",
" df1.columns = ['file', 'label', 'C', 'D']\n",
" df1['file'] = df1['file'].apply(_format)\n",
" print('df1', df1.shape)\n",
" if df is None:\n",
" df = df1[['file', 'label']]\n",
" else:\n",
" df = df.append(df1[['file', 'label']])\n",
" print('df', df.shape)\n",
" print(df)"
]
"source": [
"from bs4 import BeautifulSoup\n",
"import glob\n",
"import re\n",
"\n",
"l_labels = []\n",
"l_content = []\n",
"l_files = []\n",
"\n",
"l_text = []\n",
"for archive_name in glob.glob('./datasets/Myspace/xml*.zip'):\n",
" z = zipfile.ZipFile(archive_name,'r')\n",
" \n",
" #print(z)\n",

Dept of CSE, NHCE 27


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

" b= z.namelist()\n",
" for i in b:\n",
" rx = re.compile(r'(\\d+)(\\.)(\\d+)')\n",
" res = rx.search(i)\n",
" if res is not None:\n",
" name = res.group()\n",
" if name in list(data['file']):\n",
" _xml = z.open(i)\n",
" d = BeautifulSoup(_xml, \"lxml-xml\")\n",
" text= d.find_all(\"body\")\n",
" l_text.extend(text)\n",
" \n",
" l_content.append(text)\n",
" l_files.append(name)\n",
" \n",
" lab = (df['label'][df['file']==name.replace('.xml', '')]).to_string()[-1]\n",
" if (lab == 'N'): \n",
" l_labels.append(0)\n",
" else:\n",
" l_labels.append(1)\n",
" \n",
" print(text) “
]

Dept of CSE, NHCE 28


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

• Function for clean-up of the Myspace data: Pre-Processing for the data
by removing stop words, repeating words and special characters is done here.
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def t_r(text):\n",
" text = re.sub('<body>',' ',text)\n",
" text = re.sub('</body>',' ',text)\n",
" text = re.sub('\\W+',' ',text)\n",
" text = re.sub('(\\W+)(\\d+)(\\W+)',' ',text)\n",
" text = text.split()\n",
"\n",
" return text\n"
] "cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def t_s(text):\n",
" text = re.sub('<body>',' ',text)\n",
" text = re.sub('</body>',' ',text)\n",
" text = re.sub('\\W+',' ',text)\n",
" text = re.sub('(\\W+)(\\d+)(\\W+)',' ',text)\n",
"\n",
" return text"
]

Dept of CSE, NHCE 29


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"source": [
"new=set()\n",
"for i in l_text:\n",
" data1= t_s(str(i))\n",
" new.update(data1.split())\n",
"print(new)"
]
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def _len(item):\n",
" if(15>len(item)>2):\n",
" return True\n",
" else:\n",
" return False\n",
" \n",
"_new = filter(_len, new)"
]

Dept of CSE, NHCE 30


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"13645"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"l_new = list(_new)\n",
"len(l_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source":
"cell_type": "code",
"execution_count": 14,

Dept of CSE, NHCE 31


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream"
metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Dept of CSE, NHCE 32


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

5.2 AUTOENCODER
import tensorflow as tf
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
import sys
import joblib
from sklearn.svm import SVC
import os
BATCH_SIZE=15
GRID_ROWS= 8
GRID_COLS= 8
def masking_noise(data, v, sess):
"""
Applies masking noise to data in X.
In other words a fraction v of elements of X
(chosen at random) is forced to zero.
:param data: array_like, Input data
:param sess: TensorFlow session
:param v: fraction of elements to distort, float
:return: transformed data
"""
data_noise = data.copy()
rand = tf.random_uniform(data.shape)
data_noise[sess.run(tf.nn.relu(tf.sign(v - rand))).astype(np.bool)] = 0
return data_noise
def salt_and_pepper_noise(X, v):
"""Apply salt and pepper noise to data in X.

Dept of CSE, NHCE 33


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

In other words a fraction v of elements of X


(chosen at random) is set to its maximum or minimum value according to a
fair coin flip.
If minimum or maximum are not given, the min (max) value in X is taken.
:param X: array_like, Input data
:param v: int, fraction of elements to distort
:return: transformed data
"""
X_noise = X.copy()
n_features = X.shape[1]
mn = X.min()
mx = X.max()
for i, sample in enumerate(X):
mask = np.random.randint(0, n_features, v)
for m in mask:
if np.random.random() < 0.5:
X_noise[i][m] = mn
else:
X_noise[i][m] = mx
return X_noise
def corrupt_input(corr_type, data, v, sess):
""" Corrupt a fraction 'v' of 'data' according to the
noise method of this autoencoder.
:return: corrupted data
"""
if corr_type == 'masking':
x_corrupted = masking_noise(data, v, sess)
elif corr_type == 'salt_and_pepper':

Dept of CSE, NHCE 34


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

x_corrupted = salt_and_pepper_noise(data, v)
elif corr_type == 'none':
x_corrupted = data
else:
x_corrupted = None
return x_corrupted
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
def fc_layer(prev, input_size, output_size):
W = weight_variable([input_size, output_size])
b = bias_variable([output_size])
return tf.matmul(prev, W) + b
def autoencoder(x, x_corr):
l1 = tf.nn.tanh(fc_layer(x_corr, 8285, 1000))
l2 = tf.nn.tanh(fc_layer(l1, 1000, 500))
l3 = fc_layer(l2, 500, 100)
l4 = tf.nn.tanh(fc_layer(l3, 100, 500))
l5 = tf.nn.tanh(fc_layer(l4, 500, 1000))
out = fc_layer(l5, 1000, 8285)
loss = tf.reduce_mean(tf.squared_difference(x, out))
return loss, out, l3

def batch_gen(data, batch_size, current=[0]):


data_len = data.shape[0]

Dept of CSE, NHCE 35


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

_current = current [0]


batch = data[_current:batch_size*(_current+1)]
current[0] += 1
if (data_len//batch_size < current[0]):
current[0] = 0
return batch
def main():
if not os.path.exists("train_rep.p"):
# load the data
labels_np = pickle.load(open("labels_np.pkl","rb"))
data_vec_np = pickle.load(open("data_vec_np.pkl", "rb"))
Here, we are Splitting the data to testing and training sets
X_train, X_test, y_train, y_test = train_test_split(data_vec_np, labels_np, test_size = 0.2,
stratify=labels_np)
# initialize inputs
x = tf.placeholder(tf.float32, shape=[None, 8285])
x_corr = tf.placeholder(tf.float32, shape=[None, 8285])
# build the model
loss, output, latent = autoencoder (x, x_corr)
# initialize optimizer
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
#run the training loop
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(2000):

batch = batch_gen(X_train, BATCH_SIZE)


corr_type = 'masking'

Dept of CSE, NHCE 36


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

v = 0.1
# batch_0 = batch[0].reshape(1,-1)
batch_0 = batch
x_corr_input = corrupt_input(corr_type, batch_0, v, sess)
feed = {x : batch_0, x_corr: x_corr_input}
if i % 500 == 0:
train_loss = sess.run([loss],
feed_dict=feed)
print("Step: {}. Loss: {}" .format(i, train_loss))
train_step.run(feed_dict=feed)
train_rep = latent.eval(feed_dict={x:X_train, x_corr:X_train})
test_rep = latent.eval(feed_dict={x:X_test, x_corr:X_test})
print(test_rep.shape)
print(train_rep.shape)
joblib.dump( test_rep, "test_rep.p" )
joblib.dump( train_rep, "train_rep.p" )
joblib.dump( y_train, "train_lab.p" )
joblib.dump( y_test, "test_lab.p" )

else:
_train_rep = joblib.load("train_rep.p")
_test_rep = joblib.load("test_rep.p")

train_lab = joblib.load("train_lab.p")
test_lab = joblib.load("test_lab.p")

from sklearn.decomposition import PCA


_pca = PCA(n_components=20)

Dept of CSE, NHCE 37


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

_pca.fit(_train_rep)
train_rep = _pca.transform(_train_rep)
test_rep = _pca.transform(_test_rep)
clf= SVC()
clf.fit(train_rep, train_lab)
print("Training score:", clf.score(train_rep, train_lab))
print("Test score:", clf.score(test_rep, test_lab))

5.3 GRAPHICAL USER INTERFACE (GUI)


from tkinter import *
import tkinter
from PIL import Image
from PIL import ImageTk
from tkinter import filedialog
import cv2
import matplotlib.pyplot as plt
import numpy as np
import re
from bs4 import BeautifulSoup
from gensim.models import Word2Vec
import pickle
from sklearn.decomposition import PCA
from sklearn.svm import SVC
import joblib
import warnings
import sys

Dept of CSE, NHCE 38


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

import os
warnings.filterwarnings("ignore")
loaded_xml = [0]
content_list = [0]
Here, we are creating our class, Window, and inheriting from the Frame class. Frame
is a class from the tkinter module. (see Lib/tkinter/__init__)
class Window(Frame):
# Define settings upon initialization. Here you can specify
def __init__(self, master=None):
# parameters that you want to send through the Frame class.
Frame.__init__(self, master)
#reference to the master widget, which is the tk window
self.master = master
self.w = 750
self.h = 450
self.load = Image.open("laptop.jpg")
self.load = self.load.resize((self.w, self.h), Image.ANTIALIAS)
render = ImageTk.PhotoImage(self.load)
# labels can be text or images
self.img = Label(root, image=render)
self.img.pack(side=LEFT)
self.text2 = Text(root, height=40, width=50)
self.scroll = Scrollbar(root, command=self.text2.yview)
#with that, we want to then run init_window, which doesn't yet exist
self.init_window()
#Creation of init_window
def init_window(self):

Dept of CSE, NHCE 39


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

# changing the title of our master widget


self.master.title("GUI")
# allowing the widget to take the full space of the root window
self.pack(fill=BOTH, expand=1)
We are creating a menu instance
menu = Menu(self.master)
self.master.config(menu=menu)
We are creating the file object
file = Menu(menu)
# adds a command to the menu option, calling it exit, and the
# command it runs on event is client_exit
file.add_command(label="Load xml with messages",command=self.showXml)
file.add_command(label="Show the processed body",command=self.showBody)
file.add_command(label="Show Vector Representation", command=self.showRes)
file.add_command(label="Classification", command=self.classf)
file.add_command(label="Exit", command=self.client_exit)
Here we added "file" to our menu
menu.add_cascade(label="File", menu=file)
load = Image.open("laptop.jpg")
resized_load = load.resize((self.w, self.h), Image.ANTIALIAS)
render = ImageTk.PhotoImage(resized_load)
# labels can be text or images
self.img.configure(image=render, text="Cyber Bullying Detection App")
self.img.image = render
self.img.pack(side=LEFT)
def showXml(self):

Dept of CSE, NHCE 40


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

filename1 = filedialog.askopenfilename()
loaded_xml[0] = filename1
print("Reading XML:", loaded_xml[0])
_xml = open(filename1)
d = BeautifulSoup(_xml, "lxml-xml")
text= d.find_all("body")
load = Image.open("myspace.png")
resized_load = load.resize((self.w//2, self.h), Image.ANTIALIAS)
render = ImageTk.PhotoImage(resized_load)
# labels can be text or images
self.img.configure(image=render)
self.img.image = render
self.img.pack(side=LEFT)
self.text2.delete(1.0,END)
self.text2.configure(yscrollcommand=self.scroll.set)
self.text2.tag_configure('bold_italics', font=('Arial', 12, 'bold', 'italic'))
self.text2.tag_configure('big', font=('Verdana', 20, 'bold'))
self.text2.tag_configure('color', foreground='#476042',
font=('Tempus Sans ITC', 12, 'bold'))
#self.text2.tag_bind('follow', '<1>', lambda e, t=self.text2: t.insert(END, "Not now,
maybe later!"))
self.text2.insert(END,'Content of XML file\n\n', 'big')
quote = text
self.text2.insert(END, quote, 'color')
self.text2.pack(side=LEFT)
self.scroll.pack(side=RIGHT, fill=Y)
def predict(self):
reps = pickle.load(open("reps.pkl", "rb"))

Dept of CSE, NHCE 41


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

files = pickle.load(open("l_files.pkl", "rb"))


fname = os.path.basename(loaded_xml[0]).replace(".xml", "")
print("File name for predict:", fname)
ind = files.index(fname)
rep = reps[ind]
_train_rep = joblib.load("train_rep.p")
_test_rep = joblib.load("test_rep.p")
train_lab = joblib.load("train_lab.p")
test_lab = joblib.load("test_lab.p")
_pca = PCA(n_components=20)
_pca.fit(_train_rep)
train_rep = _pca.transform(_train_rep)
test_rep = _pca.transform(_test_rep)
clf= SVC()
clf.fit(train_rep, train_lab)
val_rep = _pca.transform([rep])
pred = clf.predict(val_rep.reshape(1, -1))
labs = ["Non-bullying" ,"Bullying"]
print("Prediction is ", labs[pred[0]])
return labs[pred[0]]
def t_r(self, text):
text = re.sub('<body>','',text)
text = re.sub('</body>','',text)
text = re.sub('\W+',' ',text)
text = re.sub('(\W+)(\d+)(\W+)',' ',text)
text = text.split()

Dept of CSE, NHCE 42


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

return text
def _len(self, item):
if(15>len(item)>2):
return True
else:
return False
def showBody(self):
load = Image.open("textm.jpg")
resized_load = load.resize((self.w//2, self.h), Image.ANTIALIAS)
render = ImageTk.PhotoImage(resized_load)
_xml = open(loaded_xml[0])
d = BeautifulSoup(_xml, "lxml-xml")
text= d.find_all("body")
sen = []
for i in text:
data1= self.t_r(str(i))
sen.extend(data1)
print(len(sen))
print(sen)
print(type(sen))
proc_content = list(filter(self._len, sen))
proc_content = [i.lower() for i in proc_content]
content_list[0] = proc_content
# labels can be text or images
self.img.configure(image=render)
self.img.image = render

Dept of CSE, NHCE 43


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

self.img.pack(side=LEFT)
self.text2.delete(1.0,END)
self.text2.configure(yscrollcommand=self.scroll.set)
self.text2.tag_configure('bold_italics', font=('Arial', 12, 'bold', 'italic'))
self.text2.tag_configure('big', font=('Verdana', 20, 'bold'))
self.text2.tag_configure('color', foreground='#476042',
font=('Tempus Sans ITC', 12, 'bold'))
#self.text2.tag_bind('follow', '<1>', lambda e, t=self.text2: t.insert(END, "Not now,
maybe later!"))
self.text2.insert(END,'Processed Content\n\n', 'big')
quote = proc_content
self.text2.insert(END, quote, 'color')
self.text2.pack(side=LEFT)
self.scroll.pack(side=RIGHT, fill=Y)
def classf(self):
load = Image.open("classf.jpg")
resized_load = load.resize((self.w//2, self.h), Image.ANTIALIAS)
render = ImageTk.PhotoImage(resized_load)
# labels can be text or images
self.img.configure(image=render)
self.img.image = render
self.img.pack(side=LEFT)
self.text2.delete(1.0,END)
self.text2.configure(yscrollcommand=self.scroll.set)
self.text2.tag_configure('bold_italics', font=('Arial', 12, 'bold', 'italic'))
self.text2.tag_configure('big', font=('Verdana', 20, 'bold'))
self.text2.tag_configure('color', foreground='#476042',
font=('Tempus Sans ITC', 12, 'bold'))

Dept of CSE, NHCE 44


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

#self.text2.tag_bind('follow', '<1>', lambda e, t=self.text2: t.insert(END, "Not now,


maybe later!"))
self.text2.insert(END,'Results\n\n', 'big')
quote = self.predict()
self.text2.insert(END, quote, 'color')
self.text2.pack(side=LEFT)
self.scroll.pack(side=RIGHT, fill=Y)
def showRes(self):
load = Image.open("word2vec.png")
resized_load = load.resize((self.w//2, self.h), Image.ANTIALIAS)
render = ImageTk.PhotoImage(resized_load)
wv_model = pickle.load(open("wv_model.pkl", "rb"))
rep = []
words = []
for i in content_list[0]:
try:
rep.append(wv_model.wv.get_vector(i))
words.append(i)
except KeyError:
pass
rep_str_list = [str(j)+" : "+str(i)+'\n\n' for i,j in zip(rep, words)]
rep_str = " ".join(rep_str_list)
# labels can be text or images
self.img.configure(image=render)
self.img.image = render
self.img.pack(side=LEFT)

Dept of CSE, NHCE 45


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

self.text2.delete(1.0,END)
self.text2.configure(yscrollcommand=self.scroll.set)
self.text2.tag_configure('bold_italics', font=('Arial', 12, 'bold', 'italic'))
self.text2.tag_configure('big', font=('Verdana', 20, 'bold'))
self.text2.tag_configure('color', foreground='#476042',
font=('Tempus Sans ITC', 12, 'bold'))
#self.text2.tag_bind('follow', '<1>', lambda e, t=self.text2: t.insert(END, "Not now,
maybe later!"))
self.text2.insert(END,'Results\n\n', 'big')
quote = rep_str
self.text2.insert(END, quote, 'color')
self.text2.pack(side=LEFT)
self.scroll.pack(side=RIGHT, fill=Y)
def client_exit(self):
sys.exit(0)
Here, root window created. Here, that would be the only window, but you can later
have windows within windows.
root = Tk() # A root window for displaying objects
root.geometry("750x450")
#creation of an instance
app = Window(master=root)
app.mainloop()
root.destroy()

Dept of CSE, NHCE 46


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

CHAPTER 6
TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product it is
the process of exercising software with the intent of ensuring that the Software system
meets its requirements and user expectations and does not fail in an unacceptable
manner. There are various types of test. Each test type addresses a specific testing
requirement.

Dept of CSE, NHCE 47


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Project Information Test Information

Project Name Cyberbullying Test Type File Operation


detection based
on semantic
Test Name File Testing
enhanced
denoising
marginalized auto Original Author Tester 1
encoder
Project ID P 01
Test Case ID T 01
Platform Python

Test Objective – Testing to ensure that each file doesn’t contain more than 10 posts.

Step Number Description Test Date Expected Result


Result
01 File shouldnt 26-04-2020 Each file Successful
contain more doesn’t contain
than 10 posts more than 10
posts.

Table 6.1: Test case for File operation

Dept of CSE, NHCE 48


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Project Information Test Information

Project Name Cyberbullying Test Type Loading of XML file


detection based
on semantic
Test Name XML Testing
enhanced
denoising
marginalized auto Original Author Tester 1
encoder
Project ID P 02
Test Case ID T 02
Platform Python

Test Objective – Testing to ensure that the XML file is being loaded properly.

Step Number Description Test Date Expected Result


Result
01 XML file is being 26-04-2020 XML file should Successful
loaded properly be loaded
properly

Table 6.2: Test case for XML file

Dept of CSE, NHCE 49


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Project Information Test Information

Project Name Cyberbullying Test Type Removal of special


detection based on characters, stop words
semantic enhanced and repeated words
denoising Test Name Preprocessing
marginalized auto
encoder Original Author Tester 1
Project ID P 03

Platform Python Test Case ID T 03

Test Objective – Testing to ensure that preprocessing occurred correctly.

Step Number Description Test Date Expected Result


Result
01 Preprocessing 26/04/2020 Removal of Successful
occurs correctly special
characters, stop
words and
repeated words

Table 6.3: Test case for Preprocessing

Dept of CSE, NHCE 50


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Project Information Test Information

Project Name Cyberbullying Test Type Converting words to


detection based vector
on semantic
Test Name Vector Representation
enhanced
denoising
marginalized auto Original Author Tester 02
encoder
Project ID P 04
Test Case ID T 04
Platform Python

Test Objective – Using word2vec model we check to see if the words are correctly
represented as vectors.

Step Number Description Test Date Expected Result


Result
01 Check to see if 26/04/2020 Converting Successful
the words are words to vectors

correctly
represented as
vectors.

Table 6.4: Test case for Vector Representation

Dept of CSE, NHCE 51


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Project Information Test Information

Project Name Cyberbullying Test Type Classification


detection based
on semantic
Test Name Bullying
enhanced
denoising
marginalized auto Original Author Tester 02
encoder
Project ID P 05
Test Case ID T 05
Platform Python

Test Objective – Testing to ensure whether the loaded XML file results in correct
classification

Step Number Description Test Date Expected Result


Result
01 To ensure 26/04/2020 Bullying Successful
whether the
loaded XML file
results in
correct
classification

Table 6.5: Test case for Classification (Bullying)

Dept of CSE, NHCE 52


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Project Information Test Information

Project Name Cyberbullying Test Type Classification


detection based on
semantic enhanced
Test Name Non-Bullying
denoising
marginalized auto
encoder Original Author Tester 02

Project ID P 06

Platform Python Test Case ID T 06

Test Objective – Testing to ensure whether the loaded XML file results in correct
classification

Step Number Description Test Date Expected Result Result

01 To ensure 26/04/2020 Non-Bullying Successful


whether the
loaded XML file
results in correct
classification

Table 6.6: Test case for Classification (Non-Bullying)

Dept of CSE, NHCE 53


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

CHAPTER 7
SNAPSHOTS

Fig 7.1 Graphical User Interface (GUI)

Dept of CSE, NHCE 54


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Fig 7.2 Loading of XML 1 Fig 7.3 Loading of XML 2

Fig 7.4 Loading of XML 3

Dept of CSE, NHCE 55


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Fig 7.5 Loading of XML 4 Fig 7.6 Loading of XML 5

Fig 7.7 Loading of XML 6

Dept of CSE, NHCE 56


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Fig 7.8 Processed Body 1

Fig 7.9 Processed Body 2

Dept of CSE, NHCE 57


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Fig:7.10 Vector Representation 1

Fig 7.11 Vector Representation 2

Dept of CSE, NHCE 58


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

Fig 7.12 Classification 1 Fig 7.13 Classification 2

Fig 7.14 Classification 3

Dept of CSE, NHCE 59


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

CHAPTER 8
CONCLUSION AND FUTURE ENHANCEMENT
8.1 CONCLUSION
In our proposed system we are able to implement our project in a fair and good manner
with respect to the scope and relevance of social networking. In modern life the
incidents of cyber bullying are very high. Every user of every age group is vulnerable to
such threats. So, our project as a steady intension in irradiating such incidents through
our social network. This is a social network which can be used by all age groups. Since
this is very user friendly it is easily accessible by anyone and above all due to its
relevance it is expected to be a success. The interface is very user friendly and hence it
provides ease of usability.
This project also addresses the text-based cyberbullying detection problem, where
robust and discriminative representations of messages are critical for an effective
detection system. By designing semantic dropout noise and enforcing sparsity, we have
developed semantic-enhanced marginalized denoising autoencoder as a specialized
representation learning model for cyberbullying detection. In addition, word
embeddings have been used to automatically expand and refine bullying word lists that
is initialized by domain knowledge. The performance of our approaches has been
experimentally verified through two cyberbullying corporation from social medias:
Twitter and Myspace. As a next step we are planning to further improve the robustness
of the learned representation by considering word order in messages.

Dept of CSE, NHCE 60


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

8.2 FUTURE ENHANCEMENT


The proposed system’s accuracy for classification is around 80%. Since it is difficult to
classify such complicated textual data, the best we were able to train the models
accuracy was 80%. Being able to gather more data related to this topic, we can perhaps
get a more effective and efficient model. New techniques of parallel computing along
with maximum quality of accuracy can help in future enhancement of this system.

Dept of CSE, NHCE 61


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

REFERENCES
[1] Rui Zhao and Kezhi Mao, “Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Auto-Encoder”, IEEE Transactions on Affective Computing, 2016.
[2] A. M. Kaplan and M. Haenlein, “Users of theworld, unite! The challenges and
opportunities ofsocial media,” Business horizons, vol. 53, no. 1, pp.59–68, 2010.
[3] R. M. Kowalski, G. W. Giumetti, A. N. Schroeder,and M. R. Lattanner, “Bullying in the
digital age: Acritical review and metaanalysis of cyberbullyingresearch among youth.”
2014.
[4] M. Ybarra, “Trends in technology-based sexualand non-sexual aggression over time
and linkages to nontechnology aggression,” National Summit on Interpersonal Violence
and Abuse Across the Lifespan: Forging a Shared Agenda, 2010.
[5] B. K. Biggs, J. M. Nelson, and M. L. Sampilo,“Peer relations in the anxiety–depression
link: Test of a mediation model,” Anxiety, Stress, & Coping, vol.23, no. 4, pp. 431–447,
2010.
[6] S. R. Jimerson, S. M. Swearer, and D. L. Espelage,Handbook of bullying in schools: An
international perspective. Routledge/Taylor & Francis Group, 2010.
[7] G. Gini and T. Pozzoli, “Association between bullying and psychosomatic problems: A
meta-analysis,” Pediatrics, vol. 123, no. 3, pp. 1059–1065,2009.
[8] A. Kontostathis, L. Edwards, and A. Leatherman,“Text mining and cybercrime,” Text
Mining:Applications and Theory. John Wiley & Sons, Ltd,Chichester, UK, 2010.
[9] Q. Huang, V. K. Singh, and P. K. Atrey, “Cyber bullying detection using social and
textual analysis,” in Proceedings of the 3rd International Workshop on Socially-Aware
Multimedia. ACM, 2014, pp. 3–6.
[10] D. Yin, Z. Xue, L. Hong, B. D. Davison, A. Kontostathis, and L. Edwards, “Detection of
harassment on web 2.0,” Proceedings of the Content Analysis in the WEB, vol. 2, pp. 1–
7, 2009. CYBER-BULLYING DETECTION

Dept of CSE, NHCE 62


Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder

[11] K. Dinakar, R. Reichart, and H. Lieberman, “Modeling the detection of textual


cyberbullying.” in The Social Mobile Web, 2011.
[12] V. Nahar, X. Li, and C. Pang, “An effective approach for cyberbullying detection,”
Communications in Information Science and Management Engineering, 2012.
[13] M. Dadvar, F. de Jong, R. Ordelman, and R. Trieschnigg, “Improved cyberbullying
detection using gender information,” in Proceedings of the 12th Dutch-Belgian
Information Retrieval Workshop (DIR2012). Ghent, Belgium: ACM, 2012.
[14] M. Dadvar, D. Trieschnigg, R. Ordelman, and F. de Jong, “Improving cyberbullying
detection with user context,” in Advances in Information Retrieval. Springer, 2013, pp.
693–696.
[15] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol, “Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising
criterion,” The Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010.
[16] P. Baldi, “Autoencoders, unsupervised learning, and deep architectures,”
Unsupervised and Transfer Learning Challenges in Machine Learning, Volume7, p. 43,
2012.
[17]M.Chen, Z.Xu, K. Weinberger, and F. Sha, “Marginalized denoising autoencoders for
domain adaptation,” arXiv preprint arXiv: 1206.4683, 2012.
[18]T.K.Landauer,P.W.Foltz,andD.Laham, “An introduction to latent semantic analysis,”
Discourse processes, vol. 25, no. 2-3, pp. 259–284, 1998.
[19] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the
National academy of Sciences of the United States of America, vol. 101, no. Suppl 1, pp.
5228–5235, 2004.
[20] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” the Journal of
machine Learning research, vol. 3, pp. 993–1022, 2003.

Dept of CSE, NHCE 63

You might also like