Cyberbullying Detection Based On Semantic Enhanced Marginalised Denoising Autoencoder - Report
Cyberbullying Detection Based On Semantic Enhanced Marginalised Denoising Autoencoder - Report
A PROJECT REPORT
ON
BACHELOR OF ENGINEERING
IN
BY
AISHWARYA IYER
1NH14CS007
ANCHANA R
1NH14CS704
CERTIFICATE
It is hereby certified that the project work entitled “CYBER BULLYING DETECTION BASED
ON SEMANTIC-ENHANCED MARGINALISED DENOISING AUTOENCODER” is a bonafide
work carried out by AISHWARYA IYER (1NH14CS007) and ANCHANA.R (1NH14CS704) in
partial fulfilment for the award of Bachelor of Engineering in COMPUTER SCIENCE AND
ENGINEERING of the New Horizon College of Engineering during the year 2019-2020. It is
certified that all corrections/suggestions indicated for Internal Assessment have been
incorporated in the Report deposited in the departmental library. The project report has
been approved as it satisfies the academic requirements in respect of project work
prescribed for the said Degree.
External Viva
1. ………………………………………….. ………………………………….
2. …………………………………………… …………………………………..
ABSTRACT
As a side effect of increasingly popular social media coma cyber bullying has emerged as
a serious problem affecting children, adolescence and young adults. machine learning
techniques make automatic detection of cyberbullying messages in social media
possible, and this could help to construct a healthy and safe social media environment.
In this meaningful research area, one critical issue is robust and discriminative numerical
representation learning of text messages. In this paper ok, we propose a new
representation learning method to tackle this problem. our method name semantic-
enhanced marginalized denoising autoencoder (smSDA) is developed via semantic
extension of the popular deep learning model stacked denoising autoencoder (SDA). The
semantic extension consists of semantic dropout noise and sparsity constraints, where
the semantic dropout no one’s is designed based on domain knowledge and the word
embedding technique. our proposed method is able to exploit the hidden feature
structure of bullying information and learn a robust and discriminative representation of
text. comprehensive experiments on to public cyber-bullying corp (Twitter and Myspace)
Are conducted, and the results show that our proposed approaches outperform other
baseline text representation learning methods
Keywords:
Cyberbullying, Autoencoder, Linear SVM, Bag of Words, Word2Vec, Dataset, Myspace,
Preprocessing.
I
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompany the successful completion of any task
would be impossible without the mention of the people who made it possible, whose
constant guidance and encouragement crowned our efforts with success.
I would also like to thank Dr. B. Rajalakshmi, Professor and Head, Department of
Computer Science and Engineering, for her constant support.
Finally, a note of thanks to the teaching and non-teaching staff of Dept of Computer
Science and Engineering, for their cooperation extended to me, and my friends, who
helped me directly or indirectly in the course of the project work.
AISHWARYA IYER(1NH14CS007)
ANCHANA.R(1NH14CS704)
II
CONTENTS
ABSTRACT I
ACKNOWLEDGEMENT II
LIST OF FIGURES V
LIST OF TABLES VI
1. INTRODUCTION
1.5.1. DATASET 3
1.5.2. PREPROCESSING 4
1.5.3. WORD2VEC 4
2. LITRATURE SURVEY
2.1. RELATED WORK 6
2.8.1. PYTHON 13
III
2.8.3. OPEN CV 14
2.8.4. GUI 15
3. REQUIREMENT ANALYSIS
3.1. FUNCTIONAL REQUIREMENTS 18
3.2.1. ACCESSABILITY 19
3.2.2. MAINTAINABILITY 19
3.2.3. SCALABILITY 19
3.2.4. PORTABILITY 20
3.2.5. RELIABILITY 20
4. DESIGN
4.1. WORKFLOW 23
4.1.1. FILTERING 23
5. IMPLEMENTATION
5.1. PREPROCESSING 25
5.3. GUI 38
6. TESTING 47
7. SNAPSHOTS 54
8. CONCLUSION AND FUTURE ENHANCEMENT
8.1. CONCLUSION 60
REFERENCES 62
IV
LIST OF FIGURES
4.1. Workflow 23
7.1. GUI 54
7.12. Classification1 59
7.13. Classification2 59
7.14. Classification3 59
V
LIST OF TABLES
VI
Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder
CHAPTER 1
INTRODUCTION
1.5.1 DATASET
Myspace is a social networking website which is similar to twitter. There will be a series
of messages that are captured and stored as a dataset. The dataset is available online
and can be accessed by anyone.
1.5.2 PREPROCESSING
This is the second module in the implementation of our project. In this step we remove
special characters, stop words, meaningless words etc. from our acquired dataset so
that processing will be easier.
1.5.3 Word2vec
The Key idea of word2vec is to achieve better performance not by using a more complex
model (i.e., with more layers), but by allowing a simpler (shallower) model to be trained
on much larger amounts of data.
Two algorithms for learning words vectors: - CBOW: from context predict target (focus
of what follows) - Skip-gram: from target predict context.
An autoencoder is a neural network that is trained to attempt to copy its input to its
output. Internally, it has a hidden layer h that describes a code used to represent the
input
•Hidden layer h
•Two parts
–Encoder h= f(x)
–Decoder r=g(h)
This is the fifth module where we use denoising auto-encoder to pretrain our deep
neural network
This leads to intermediate representations much better suited for subsequent learning
tasks such as supervised classification.
CHAPTER 2
LITERATURE SURVEY
Literature survey is the most important step in software development process. Before
developing the tool, it is necessary to determine the time factor, economy and company
strength. Once these things are satisfied, ten next steps are to determine which
operating system and language can be used for developing the tool. Once the
programmers start building the tool the programmers need lot of external support. This
support can be obtained from senior programmers, from book or from websites. Before
building the system, the above consideration is taken into account for developing the
proposed system.
Each new feature is a linear combination of all original features to alleviate the sparsity
problem. Topic models, including Probabilistic Latent Semantic Analysis and Latent
Dirichlet Allocation, are also proposed. The basic idea behind topic models is that word
choice in a document will be influenced by the topic of the document probabilistically.
Topic models try to define the generation process of each word occurred in a
document. Similar to the approaches aforementioned, our proposed approach takes the
BoW representation as the input. However, our approach has some distinct merits.
Firstly, the multi-layers and non-linearity of our model can ensure a deep learning
architecture for text representation, which has been proven to be effective for learning
high-level features. Second, the applied dropout noise can make the learned
representation more robust. Third, specific to cyberbullying detection, our method
employs the semantic information, including bullying words and sparsity constraint
imposed on mapping matrix in each layer and this will in turn produce more
discriminative representation.
A citation trail was performed on the discovered papers using the papers’ references as
a starting point and a total of 89 academic papers was discovered as a result of the
search. The papers were initially assessed for relevance via a review of their titles,
abstract, and concluding arguments: 18 papers were not considered relevant to the
survey and so were removed. The full text of the remaining papers was reviewed and
papers whose primary focus did not include any of the 4 cyberbullying detection tasks
we identified in Section 1 were discounted. This led to the removal of a further 18
papers. These included papers that dealt with themes such as youth violence
involvement detection (Sigel and Harpin, 2013), story matching to identify distressed
teens (Dinakar et al., 2012b; Macbeth et al., 2013), and cyberbully prevention policies
(Al Mazari, 2013). To eliminate the effects of language on cyberbully detection when
comparing the reviewed studies, we excluded papers using non-English corpora; thus, a
further 7 papers were excluded. These included papers such as Ptazynski et al. (2010a; b),
Honjo et al. (2011), Nitta et al. (2013), Li and Tagami (2014), Margono et al. (2014) and Van Hee
et al. (2015) which were removed as they used non-English corpora. The remaining 46
papers were included in the final list of papers examined by this study.
Our survey revealed binary classification as the most common task performed in
cyberbullying detection. In this regard, bullying messages are considered members of a
“bullying” class and all other documents belong to the “other” or “non-bullying” class.
The key task then is the identification of documents that possess the core attributes of
the “bullying” class. Out of the 46 studies reviewed, 34 performed binary classification
either as the sole detection task or in combination with other tasks. This classification of
messages is often facilitated by sentiment analysis using emotive wordlists, supervised
learning, and lexicon-based systems.
Studies such as Yin et al. (2009), Dinakar et al. (2011), Xu et al. (2012a), and Rafiq et al.
(2015) performed sentiment analysis using supervised-learning techniques. Others such
as Burn-Thorton and Burman (2012), Kontostathis et al. (2013), Nahar et al. (2013;
2014), Munezero et al. (2014), Nandhini and Sheeba (2015a;b), and Zhao et al. (2016),
while also implementing binary classification, did not perform the message classification
via sentiment analysis. Interestingly, Xu et al. (2012a) is the only instance we found
whereby sentiment analysis is performed not for the purpose of binary classification but
to understand the emotions expressed in what they term “bully traces”, which are
tweets containing any of the words “bully”, “bullied” and “bullying” (i.e., tweets
containing bullying references or reportage – e.g., “I saw a girl got bullied at school
today #bullyingisnotcool”). Role identification is the next most performed task (11
papers), featuring heavily in studies such as Sanchez and Kumar (2011), Chen et al.
(2012), Dadvar et al. (2014), and Galán-García et al. (2014). Determining the severity of
cyberbullying by computing a score indicative of the bullying severity of messages
and/or sender is performed by studies such as Chen et al. (2012), Perez et al. (2012),
Dadvar et al. (2013a), Del Bosque and Garza (2014), and Potha and Maragoudakis
(2014). Dadvar et al. (2012b) and Squicciarini et al. (2015) were the only studies we
found that proposed the relatively novel task of detecting and classifying the events that
occur after a cyberbullying incident. While cyberbullying occurs across various forms of
electronic media – such as SMS (Short Messaging Service), MMS (Multimedia Messaging
Service), email, forums, chat rooms – and social media platforms like Facebook, Twitter,
YouTube and SnapChat, social media was the main source of data for many of the
studies reviewed. This can be attributed to the availability of social media data which is
often freely accessible in the public domain; emails, SMS, MMS and chat rooms are, in
contrast, very personal means of communication and, as such, communications via
these media are less likely to be publicly available.
Twitter and MySpace are the most common data sources. Twitter is used in many
studies including Sanchez and Kumar (2011), Xu et al. (2012a; b), Huang et al. (2014),
Galán-García et al. (2014), and Zhao et al. (2016). MySpace is used by Yin et al. (2009),
Parime and Suri (2014), Nandhini and Sheeba (2015a; b), and Squicciarini et al. (2015)
amongst others. YouTube is in second place with Dinakar et al. (2011), Chen et al.
(2012), Dadvar et al. (2013a; b; 2014) using corpora that included YouTube data.
BurnThorton and Burman (2012) is the only paper in our sample that uses an email
corpus. 14 papers publicly shared their datasets: 9 of these make use of the Barcelona
Media dataset (a publicly available dataset of social media data) and the remaining 5
papers sourced the corpus themselves. With supervised-learning methods proving
popular amongst the reviewed studies (34 papers), the means by which judgements on
annotated data were arrived at is of interest. Traditional means of labelling data using
annotators or by the researchers themselves still proved to be popular, with 25 studies
employing annotators, experts, or researchers to label data. Crowd-sourcing annotators
is also gaining traction within the cyberbullying research community, with studies such
as Sanchez and Kumar (2011), Kontostathis et al. (2013) and Hosseinmardi et al. (2015)
using crowdsourcing services like Amazon Mechanical Turk (MTurk) and CrowdFlower to
label data. Given the ease, relative low cost, and huge time savings of crowdsourcing, we
expected to find higher utilisation of crowdsourcing services amongst the studies but
perhaps researchers’ need to ensure high-quality annotated data currently presents a
barrier that crowdsourcing services will need to overcome in order to become more
widely used. Interestingly only 3 papers (Dinakar et al., 2011; 2012a; Rafiq et al., 2015)
employed experts to annotate data. This is surprising since a natural assumption would
be that the use of experts for annotation likely presents the best chance of achieving
quality, labelled data.
A possible reason for this low utilization of experts for labelling data could be the
subjective nature of bullying, a consequence of which could be that researchers’ and
experts’ views on cyberbullying may differ greatly. Thus, researchers adopting a specific
definition of cyberbullying would naturally want the annotators to be guided by this
definition.
Disadvantages:
They cannot corelate similar words. For example, if an algorithm uses bag of words it will
classify cat and kitten as two different entities instead of classifying them as related
entities unless explicitly implied.
It provides constructs that enable clear programming on both small and large scales.
Python interpreters are available for many operating systems. Python, the reference
implementation of Python, is open source software and has a community-based
development model, as do nearly all of its variant implementations.
2.8.3 OpenCV
OpenCV (Open source computer vision) is a library of programming functions mainly
aimed at real-time computer vision. OpenCV is written in C++ and its primary interface is
in C++, but it still retains a less comprehensive though extensive older C interface. There
are bindings in Python, Java and MATLAB/OCTAVE. The API for these interfaces can be
found in the online documentation. Wrappers in other languages such as C#, Perl Ch,
Haskell and Ruby have been developed to encourage adoption by a wider audience.
All of the new developments and algorithms in OpenCV are now developed in the C++
interface. OpenCV runs on the following desktop operating systems: Windows, Linux,
macOS, FreeBSD, NetBSD, OpenBSD. OpenCV runs on the following mobile operating
systems: Android, iOS, Maemo, BlackBerry 10. The user can get official releases from
SourceForge or take the latest sources from GitHub. OpenCV uses CMake.
OpenCV (Open Source Computer Vision Library) is released under a BSD license and
hence it’s free for both academic and commercial use. It has C++, Python and Java
interfaces and supports Windows, Linux, Mac OS, iOS and Android. OpenCV was
designed for computational efficiency and with a strong focus on real-time applications.
Written in optimized C/C++, the library can take advantage of multi-core processing.
Enabled with OpenCL, it can take advantage of the hardware acceleration of the
underlying heterogeneous compute platform.
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on
the organization. The amount of fund that the company can pour into the research
and development of the system is limited. The expenditures must be justified. Thus
the developed system as well within the budget and this was achieved because most
of the technologies used are freely available. Only the customized products had to
be purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on
the available technical resources. This will lead to high demands on the available
technical resources. This will lead to high demands being placed on the client. The
developed system must have a modest requirement, as only minimal or null changes
are required for implementing this system.
SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user.
This includes the process of training the user to use the system efficiently. The user
must not feel threatened by the system, instead must accept it as a necessity. The
level of acceptance by the users solely depends on the methods that are employed
to educate the user about the system and to make him familiar with it. His level of
confidence must be raised so that he is also able to make some constructive
criticism, which is welcomed, as he is the final user of the system.
CHAPTER 3
REQUIREMENT ANALYSIS
3.2.1 ACCESSIBILITY:
Accessibility is a general term used to describe the degree to which a product, device,
service, or environment is accessible by as many people as possible.
3.2.2 MAINTAINABILITY:
In software engineering, maintainability is the ease with which a software product can
be modified in order to:
• Correct defects
New functionalities can be added in the project based on the user requirements.
Since the programming is very simple, it is easier to find and correct the defects and to
make the changes in the project.
3.2.3 SCALABILITY:
System is capable of handling increase total throughput under an increased load when
resources (typically hardware) are added.
System can work normally under situations such as low bandwidth and large number of
users.
3.2.4 PORTABILITY:
Project can be executed under different operation conditions provided it meets its
minimum configurations. Only system files and dependant assemblies would have to be
configured in such case.
3.2.5 RELAIBILITY:
• Ram : 4GB.
• IDE : Jupyter
1. Problem/requirement analysis:
The process is order and more nebulous of the two, deals with understanding the
problem, the goal and constraints.
2. Requirement Specification:
Here, the focus is on specifying what has been found giving analysis such as
representation, specification languages and tools, and checking the specifications are
addressed during this activity. The Requirement phase terminates with the production
of the validate SRS document. Producing the SRS document is the basic goal of this
phase.
Role of SRS:
CHAPTER 4
DESIGN
4.1 WORKFLOW
4.1.1 Filtering
All the content in this social network will be filtered and only after that it will reach the
user. For filtering rules are kept. The filtering is of 2 types Image filtering and text
filtering. Text filtering is a collection of words called bag of words is constructed and the
words included in this are filtered.
These words are filtered directly and are also extracted from their latent structure.
Based upon the percentage of the bullying contain that word the behavior of the user is
CHAPTER 5
IMPLEMENTATION
5.1 PREPROCESSING
• Extract labels: Here we club the posts related to a specific topic and group
them in series of ten or less than ten and store them in a file.
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Human Concensus/Packet1Consensus.xlsx\n",
"Human Concensus/Packet2Consensus.xlsx\n",
]
}
],
"source": [
"import zipfile\n",
"z = zipfile.ZipFile('./datasets/Myspace/Human
Concensus.zip','r')\n",
"a = z.namelist()\n",
"for i in a:\n",
" print(i)"
]
Dept of CSE, NHCE 25
Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Autoencoder
" b= z.namelist()\n",
" for i in b:\n",
" rx = re.compile(r'(\\d+)(\\.)(\\d+)')\n",
" res = rx.search(i)\n",
" if res is not None:\n",
" name = res.group()\n",
" if name in list(data['file']):\n",
" _xml = z.open(i)\n",
" d = BeautifulSoup(_xml, \"lxml-xml\")\n",
" text= d.find_all(\"body\")\n",
" l_text.extend(text)\n",
" \n",
" l_content.append(text)\n",
" l_files.append(name)\n",
" \n",
" lab = (df['label'][df['file']==name.replace('.xml', '')]).to_string()[-1]\n",
" if (lab == 'N'): \n",
" l_labels.append(0)\n",
" else:\n",
" l_labels.append(1)\n",
" \n",
" print(text) “
]
• Function for clean-up of the Myspace data: Pre-Processing for the data
by removing stop words, repeating words and special characters is done here.
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def t_r(text):\n",
" text = re.sub('<body>',' ',text)\n",
" text = re.sub('</body>',' ',text)\n",
" text = re.sub('\\W+',' ',text)\n",
" text = re.sub('(\\W+)(\\d+)(\\W+)',' ',text)\n",
" text = text.split()\n",
"\n",
" return text\n"
] "cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def t_s(text):\n",
" text = re.sub('<body>',' ',text)\n",
" text = re.sub('</body>',' ',text)\n",
" text = re.sub('\\W+',' ',text)\n",
" text = re.sub('(\\W+)(\\d+)(\\W+)',' ',text)\n",
"\n",
" return text"
]
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"source": [
"new=set()\n",
"for i in l_text:\n",
" data1= t_s(str(i))\n",
" new.update(data1.split())\n",
"print(new)"
]
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def _len(item):\n",
" if(15>len(item)>2):\n",
" return True\n",
" else:\n",
" return False\n",
" \n",
"_new = filter(_len, new)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"13645"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"l_new = list(_new)\n",
"len(l_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source":
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream"
metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
5.2 AUTOENCODER
import tensorflow as tf
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
import sys
import joblib
from sklearn.svm import SVC
import os
BATCH_SIZE=15
GRID_ROWS= 8
GRID_COLS= 8
def masking_noise(data, v, sess):
"""
Applies masking noise to data in X.
In other words a fraction v of elements of X
(chosen at random) is forced to zero.
:param data: array_like, Input data
:param sess: TensorFlow session
:param v: fraction of elements to distort, float
:return: transformed data
"""
data_noise = data.copy()
rand = tf.random_uniform(data.shape)
data_noise[sess.run(tf.nn.relu(tf.sign(v - rand))).astype(np.bool)] = 0
return data_noise
def salt_and_pepper_noise(X, v):
"""Apply salt and pepper noise to data in X.
x_corrupted = salt_and_pepper_noise(data, v)
elif corr_type == 'none':
x_corrupted = data
else:
x_corrupted = None
return x_corrupted
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
def fc_layer(prev, input_size, output_size):
W = weight_variable([input_size, output_size])
b = bias_variable([output_size])
return tf.matmul(prev, W) + b
def autoencoder(x, x_corr):
l1 = tf.nn.tanh(fc_layer(x_corr, 8285, 1000))
l2 = tf.nn.tanh(fc_layer(l1, 1000, 500))
l3 = fc_layer(l2, 500, 100)
l4 = tf.nn.tanh(fc_layer(l3, 100, 500))
l5 = tf.nn.tanh(fc_layer(l4, 500, 1000))
out = fc_layer(l5, 1000, 8285)
loss = tf.reduce_mean(tf.squared_difference(x, out))
return loss, out, l3
v = 0.1
# batch_0 = batch[0].reshape(1,-1)
batch_0 = batch
x_corr_input = corrupt_input(corr_type, batch_0, v, sess)
feed = {x : batch_0, x_corr: x_corr_input}
if i % 500 == 0:
train_loss = sess.run([loss],
feed_dict=feed)
print("Step: {}. Loss: {}" .format(i, train_loss))
train_step.run(feed_dict=feed)
train_rep = latent.eval(feed_dict={x:X_train, x_corr:X_train})
test_rep = latent.eval(feed_dict={x:X_test, x_corr:X_test})
print(test_rep.shape)
print(train_rep.shape)
joblib.dump( test_rep, "test_rep.p" )
joblib.dump( train_rep, "train_rep.p" )
joblib.dump( y_train, "train_lab.p" )
joblib.dump( y_test, "test_lab.p" )
else:
_train_rep = joblib.load("train_rep.p")
_test_rep = joblib.load("test_rep.p")
train_lab = joblib.load("train_lab.p")
test_lab = joblib.load("test_lab.p")
_pca.fit(_train_rep)
train_rep = _pca.transform(_train_rep)
test_rep = _pca.transform(_test_rep)
clf= SVC()
clf.fit(train_rep, train_lab)
print("Training score:", clf.score(train_rep, train_lab))
print("Test score:", clf.score(test_rep, test_lab))
import os
warnings.filterwarnings("ignore")
loaded_xml = [0]
content_list = [0]
Here, we are creating our class, Window, and inheriting from the Frame class. Frame
is a class from the tkinter module. (see Lib/tkinter/__init__)
class Window(Frame):
# Define settings upon initialization. Here you can specify
def __init__(self, master=None):
# parameters that you want to send through the Frame class.
Frame.__init__(self, master)
#reference to the master widget, which is the tk window
self.master = master
self.w = 750
self.h = 450
self.load = Image.open("laptop.jpg")
self.load = self.load.resize((self.w, self.h), Image.ANTIALIAS)
render = ImageTk.PhotoImage(self.load)
# labels can be text or images
self.img = Label(root, image=render)
self.img.pack(side=LEFT)
self.text2 = Text(root, height=40, width=50)
self.scroll = Scrollbar(root, command=self.text2.yview)
#with that, we want to then run init_window, which doesn't yet exist
self.init_window()
#Creation of init_window
def init_window(self):
filename1 = filedialog.askopenfilename()
loaded_xml[0] = filename1
print("Reading XML:", loaded_xml[0])
_xml = open(filename1)
d = BeautifulSoup(_xml, "lxml-xml")
text= d.find_all("body")
load = Image.open("myspace.png")
resized_load = load.resize((self.w//2, self.h), Image.ANTIALIAS)
render = ImageTk.PhotoImage(resized_load)
# labels can be text or images
self.img.configure(image=render)
self.img.image = render
self.img.pack(side=LEFT)
self.text2.delete(1.0,END)
self.text2.configure(yscrollcommand=self.scroll.set)
self.text2.tag_configure('bold_italics', font=('Arial', 12, 'bold', 'italic'))
self.text2.tag_configure('big', font=('Verdana', 20, 'bold'))
self.text2.tag_configure('color', foreground='#476042',
font=('Tempus Sans ITC', 12, 'bold'))
#self.text2.tag_bind('follow', '<1>', lambda e, t=self.text2: t.insert(END, "Not now,
maybe later!"))
self.text2.insert(END,'Content of XML file\n\n', 'big')
quote = text
self.text2.insert(END, quote, 'color')
self.text2.pack(side=LEFT)
self.scroll.pack(side=RIGHT, fill=Y)
def predict(self):
reps = pickle.load(open("reps.pkl", "rb"))
return text
def _len(self, item):
if(15>len(item)>2):
return True
else:
return False
def showBody(self):
load = Image.open("textm.jpg")
resized_load = load.resize((self.w//2, self.h), Image.ANTIALIAS)
render = ImageTk.PhotoImage(resized_load)
_xml = open(loaded_xml[0])
d = BeautifulSoup(_xml, "lxml-xml")
text= d.find_all("body")
sen = []
for i in text:
data1= self.t_r(str(i))
sen.extend(data1)
print(len(sen))
print(sen)
print(type(sen))
proc_content = list(filter(self._len, sen))
proc_content = [i.lower() for i in proc_content]
content_list[0] = proc_content
# labels can be text or images
self.img.configure(image=render)
self.img.image = render
self.img.pack(side=LEFT)
self.text2.delete(1.0,END)
self.text2.configure(yscrollcommand=self.scroll.set)
self.text2.tag_configure('bold_italics', font=('Arial', 12, 'bold', 'italic'))
self.text2.tag_configure('big', font=('Verdana', 20, 'bold'))
self.text2.tag_configure('color', foreground='#476042',
font=('Tempus Sans ITC', 12, 'bold'))
#self.text2.tag_bind('follow', '<1>', lambda e, t=self.text2: t.insert(END, "Not now,
maybe later!"))
self.text2.insert(END,'Processed Content\n\n', 'big')
quote = proc_content
self.text2.insert(END, quote, 'color')
self.text2.pack(side=LEFT)
self.scroll.pack(side=RIGHT, fill=Y)
def classf(self):
load = Image.open("classf.jpg")
resized_load = load.resize((self.w//2, self.h), Image.ANTIALIAS)
render = ImageTk.PhotoImage(resized_load)
# labels can be text or images
self.img.configure(image=render)
self.img.image = render
self.img.pack(side=LEFT)
self.text2.delete(1.0,END)
self.text2.configure(yscrollcommand=self.scroll.set)
self.text2.tag_configure('bold_italics', font=('Arial', 12, 'bold', 'italic'))
self.text2.tag_configure('big', font=('Verdana', 20, 'bold'))
self.text2.tag_configure('color', foreground='#476042',
font=('Tempus Sans ITC', 12, 'bold'))
self.text2.delete(1.0,END)
self.text2.configure(yscrollcommand=self.scroll.set)
self.text2.tag_configure('bold_italics', font=('Arial', 12, 'bold', 'italic'))
self.text2.tag_configure('big', font=('Verdana', 20, 'bold'))
self.text2.tag_configure('color', foreground='#476042',
font=('Tempus Sans ITC', 12, 'bold'))
#self.text2.tag_bind('follow', '<1>', lambda e, t=self.text2: t.insert(END, "Not now,
maybe later!"))
self.text2.insert(END,'Results\n\n', 'big')
quote = rep_str
self.text2.insert(END, quote, 'color')
self.text2.pack(side=LEFT)
self.scroll.pack(side=RIGHT, fill=Y)
def client_exit(self):
sys.exit(0)
Here, root window created. Here, that would be the only window, but you can later
have windows within windows.
root = Tk() # A root window for displaying objects
root.geometry("750x450")
#creation of an instance
app = Window(master=root)
app.mainloop()
root.destroy()
CHAPTER 6
TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product it is
the process of exercising software with the intent of ensuring that the Software system
meets its requirements and user expectations and does not fail in an unacceptable
manner. There are various types of test. Each test type addresses a specific testing
requirement.
Test Objective – Testing to ensure that each file doesn’t contain more than 10 posts.
Test Objective – Testing to ensure that the XML file is being loaded properly.
Test Objective – Using word2vec model we check to see if the words are correctly
represented as vectors.
correctly
represented as
vectors.
Test Objective – Testing to ensure whether the loaded XML file results in correct
classification
Project ID P 06
Test Objective – Testing to ensure whether the loaded XML file results in correct
classification
CHAPTER 7
SNAPSHOTS
CHAPTER 8
CONCLUSION AND FUTURE ENHANCEMENT
8.1 CONCLUSION
In our proposed system we are able to implement our project in a fair and good manner
with respect to the scope and relevance of social networking. In modern life the
incidents of cyber bullying are very high. Every user of every age group is vulnerable to
such threats. So, our project as a steady intension in irradiating such incidents through
our social network. This is a social network which can be used by all age groups. Since
this is very user friendly it is easily accessible by anyone and above all due to its
relevance it is expected to be a success. The interface is very user friendly and hence it
provides ease of usability.
This project also addresses the text-based cyberbullying detection problem, where
robust and discriminative representations of messages are critical for an effective
detection system. By designing semantic dropout noise and enforcing sparsity, we have
developed semantic-enhanced marginalized denoising autoencoder as a specialized
representation learning model for cyberbullying detection. In addition, word
embeddings have been used to automatically expand and refine bullying word lists that
is initialized by domain knowledge. The performance of our approaches has been
experimentally verified through two cyberbullying corporation from social medias:
Twitter and Myspace. As a next step we are planning to further improve the robustness
of the learned representation by considering word order in messages.
REFERENCES
[1] Rui Zhao and Kezhi Mao, “Cyberbullying Detection based on Semantic-Enhanced
Marginalized Denoising Auto-Encoder”, IEEE Transactions on Affective Computing, 2016.
[2] A. M. Kaplan and M. Haenlein, “Users of theworld, unite! The challenges and
opportunities ofsocial media,” Business horizons, vol. 53, no. 1, pp.59–68, 2010.
[3] R. M. Kowalski, G. W. Giumetti, A. N. Schroeder,and M. R. Lattanner, “Bullying in the
digital age: Acritical review and metaanalysis of cyberbullyingresearch among youth.”
2014.
[4] M. Ybarra, “Trends in technology-based sexualand non-sexual aggression over time
and linkages to nontechnology aggression,” National Summit on Interpersonal Violence
and Abuse Across the Lifespan: Forging a Shared Agenda, 2010.
[5] B. K. Biggs, J. M. Nelson, and M. L. Sampilo,“Peer relations in the anxiety–depression
link: Test of a mediation model,” Anxiety, Stress, & Coping, vol.23, no. 4, pp. 431–447,
2010.
[6] S. R. Jimerson, S. M. Swearer, and D. L. Espelage,Handbook of bullying in schools: An
international perspective. Routledge/Taylor & Francis Group, 2010.
[7] G. Gini and T. Pozzoli, “Association between bullying and psychosomatic problems: A
meta-analysis,” Pediatrics, vol. 123, no. 3, pp. 1059–1065,2009.
[8] A. Kontostathis, L. Edwards, and A. Leatherman,“Text mining and cybercrime,” Text
Mining:Applications and Theory. John Wiley & Sons, Ltd,Chichester, UK, 2010.
[9] Q. Huang, V. K. Singh, and P. K. Atrey, “Cyber bullying detection using social and
textual analysis,” in Proceedings of the 3rd International Workshop on Socially-Aware
Multimedia. ACM, 2014, pp. 3–6.
[10] D. Yin, Z. Xue, L. Hong, B. D. Davison, A. Kontostathis, and L. Edwards, “Detection of
harassment on web 2.0,” Proceedings of the Content Analysis in the WEB, vol. 2, pp. 1–
7, 2009. CYBER-BULLYING DETECTION