b.e-cse-batchno-168
b.e-cse-batchno-168
SOCIAL MEDIA
by
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
MARCH - 2021
i
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI - 600 119
www.sathyabama.ac.in
BONAFIDE CERTIFICATE
Internal Guide
ii
DECLARATION
DATE:
iii
ACKNOWLEDGEMENT
I would like to express my sincere and deep sense of gratitude to my Project Guide
Dr. G.MEERA GANDHI M.E, Ph.D., for her valuable guidance, suggestions and
constant encouragement paved way for the successful completion of my project
work.
I wish to express our thanks to all Teaching and Non-teaching staff of the
Department of COMPUTER SCIENCE & ENGINEERING who were helpful in many
ways for the completion of the project.
iv
ABSTRACT
In this work, there is an argue for a focus on the latter problem for practical
reasons. This project show that it is a much more challenging task, as the analysis of
the language in the typical datasets shows that hate speech lacks unique,
discriminative features and therefore is found in the ‘long tail’ in a dataset that is
difficult to discover. Later in this project there is an propose of Deep Neural Network
structures serving as feature extractors that are particularly effective for capturing the
semantics of hate speech. These methods are evaluated on the largest collection of
hate speech datasets based on Twitter, and are shown to be able to outperform
state of the art by up to 6 percentage points in macro-average F1, or 9 percentage
points in the more challenging case of identifying hateful content.
v
TABLE OF CONTENTS
1 INTRODUCTION
1.1 Objective 02
Existing System 03
Existing systemDisadvantage 03
Proposed System 04
2 LITERATURE SURVEY
Motivation 05
Literature survey 06
3 METHODOLOGY
Hardware Requriements 09
Software Requriements 09
Software Features 10
Python 10
Numpy Introduction 13
Pandas 14
Matplotlib 14
vi
Machine Learning 15
Supervised Learning 17
Unsupervised Learning 18
Tensorflow 18
Modules Explanation 19
Data Preprocessing 20
Cyberbully Detection 21
Classification Algorithms 21
Experimental Results 25
Conclusion 27
REFERENCES 28
APPENDIX 30
vii
LIST OF FIGURES
viii
CHAPTER 1
INTRODUCTION
The exponential growth of social media such as Twitter and community
forums has revolutionised communication and content publishing but is also
increasingly exploited for the propagation of hate speech and the organisation of
hate-based activities [1, 3]. The anonymity and mobility afforded by such media has
made the breeding and spread of hate speech eventually leading to hate crime
effortless in a virtual land scape beyond the realms of traditional law enforcement.
The term ‘hate speech’ was formally defined as ‘any communication that disparages
a person or a group on the basis of some characteristics (to be referred to as types
of hate or hate classes) such as race, colour, ethnicity, gender, sexual orientation,
nationality, religion, or other characteristics’. In the UK, there has been significant
increase of hate speech towards the migrant and Muslim communities following
recent events including leaving the EU, the Manchester and the London attacks. In
the EU, surveys and reports focusing on young people in the EEA (European
Economic Area) region show rising hate speech. And related crimes based on
religious beliefs, ethnicity, sexual orientation or gender, as 80% of respondents have
encountered hate speech online and 40% felt attacked or threatened. Statistics also
show that in the US, hate speech and crime is on the rise since the Trump election.
The urgency of this matter has been increasingly recognised, as a range of
international initiatives have been launched towards the qualification of the problems
and the development of countermeasures.
1
responses including being scared, frustrated, angry, or depressed. Cyberbullying can
take place on social media sites such as Facebook, Myspace, and Twitter. "By 2008,
93% of young people between the ages of 12 and 17 were online. In fact, youth
spend more time with media than any single other activity besides sleeping." The last
decade has witnessed a surge of cyberbullying, which is categorized as bullying that
occurs through the use of electronic communication technologies, such as e-mail,
instant messaging, social media, online gaming, or through digital messages or
images sent to a cellular phone.
There are many risks attached to social media sites, and cyberbullying is one of the
larger risks. One million children were harassed, threatened or subjected to other
forms of cyberbullying on Facebook during the past year, while 90 percent of social-
media-using teens who have witnessed online cruelty say they have ignored mean
behaviour on social media, and 35 percent have done so frequently. Ninety-five
percent of social-media-using teens who have witnessed cruel behaviour on social
networking sites say they have seen others ignoring the mean behaviour, and 55
percent have witnessed this frequently. Terms like "Facebook depression" have
been coined specifically in regard to the result of extended social media use, with
cyberbullying playing a large part in this.
OBJECTIVE
This aims to classify textual content into non-hate or hate speech, in which
case the method may also identify the targeting characteristics (i.e., types of hate,
such as race, and religion) in the hate speech.
PROBLEM STATEMENT
2
classify the polarity of the tweet where it is either positive or negative. If the
tweet has both positive and negative elements, the more dominant sentiment
should be picked as the final label. We use the dataset from Kaggle which was
crawled and labeled positive/negative. The data pro-vided comes with
emoticons, usernames and hashtags which are required to be processed and
converted into a standard form. We also need to extract useful features from the
text such uni-grams and bigrams which is a form of representation of the
“tweet”. We use various machine learning algorithms to conduct sentiment
analysis using the extracted features. However, just relying on individual models
did not give a high accuracy so we pick the top few models to generate a model
ensemble. Ensembling is a form of meta learning algorithm technique where we
combine different classifiers in order to improve the prediction accuracy. Finally,
we report our experimental results and findings at the end.
EXISTING SYSTEM
3
PROPOSED SYSTEM
Also, as we shall show in the following, this problem may not be easily
mitigated by conventional methods of over- or under-sampling.
Because the real challenge is the lack of unique, discriminative linguistic
characteristics in hate Tweets compared to non-hate.
As a proxy to quantify and compare the linguistic characteristics of hate and
non-hate Tweets, we propose to study the ‘uniqueness’ of the vocabulary for
each class.
4
CHAPTER 2
LITERATURE SURVEY
MOTIVATION
5
LITERATURE SURVEy
[1] In afsaneh Asaei et al, Perceptual Information Loss because of Impaired Speech
Production, Phonological classes characterize without articulatory and articulatory-
bound telephone properties. Profound neural system is utilized to gauge the
likelihood of phonological classes from the discourse signal. In principle, a one of a
kind mix of telephone characteristics structure a phoneme personality. Probabilistic
induction of phonological classes in this manner empowers estimation of their
compositional phoneme probabilities. An epic data theoretic system is concocted to
measure the data passed on by each telephone trait, and survey the discourse
creation quality for view of phonemes. As an utilization case, we theorize that
interruption in discourse creation prompts data misfortune in telephone properties,
and in this manner disarray in phoneme recognizable proof. We evaluate the
measure of data misfortune because of dysarthria enunciation recorded in the
TORGO database. A tale data measure is figured to assess the deviation from a
perfect telephone credit creation driving us to recognize solid creation from
obsessive discourse.
[2] duanpei, m.tanaka and R.chen et al, a robust speech detection algorithm for
speech activated hands-on application, depicts a novel commotion vigorous
discourse discovery calculation that can work dependably in serious vehicle
boisterous conditions. Superior has been acquired with the accompanying methods:
(1) clamor concealment dependent on head part examination for pre-handling, (2)
vigorous endpoint identification utilizing dynamic parameters [ I ] and (3) discourse
check utilizing periodicity of voiced signs with symphonious improvement. Clamor
concealment improves the SNR as contrasted and nonlinear range subtraction by
around 20 db. This causes the endpoint location to work dependably in SNRs down
to - 10 dB. In vehicle situations, street knock clamors are tricky for discourse
identifiers causing mis-discovery blunders. Discourse confirmation assists with
evacuating these blunders. This innovation is being utilized in Sony vehicle route
items.
6
[3] M.izzad, nursuriati jamil and zainab abu bakar et al, speech/Non-Speech
Detection in Malay Language Spontaneous Speech, is to segregate discourse and
non-discourse sections in Malay language unconstrained discourse as
discourse/non-discourse discovery is significant in numerous discourse handling
applications. Off base sentence limits are a significant reason for mistakes in
programmed discourse acknowledgment and a preprocessing stage that portions the
discourse signal into times of discourse and non-discourse is priceless in improving
the acknowledgment precision. We proposed a mix of three sound highlights that is
vitality, zero intersection rate (ZCR) and key recurrence (F0) for the discourse/non-
discourse location as each element has interesting properties to separate discourse
and non-discourse sections. Tests are directed on one-hour Malay language
unconstrained discourse comprising of in excess of 20,000 discourse/non-discourse
portions. A precision assessment uncovers that the proposed technique
accomplished 97.8% exactness rate. On-discourse fragments will additionally be
utilized as up-and-comers of sentence limit in our next test
[4] Bujar Raufi, Ildi Xhaferri et al, Application of Machine Learning Techniques for
Hate Speech Detection in Mobile Applications, The proliferation of data through
various platforms and applications is in constant increase. The versatility of data and
its omnipresence makes it very hard to detect the trustworthiness and intention of the
source. This is very evident in dynamic environments such as mobile applications.
As a result, designing mobile applications that will monitor, control and block any
type of malintents is important. Technique used automatic hate speech detection,
machine learning, artificial neural networks (ANNs)
[5] Arum Sucia Saksesi, Muhammad Nasrun, Casi Setianingsih et al, Analysis Text
of Hate Speech Detection Using Recurrent Neural Network , Twitter is very important
for the success and destruction of one's image due to the many sentences of opinion
that can compete the users. Examples of phrases that mean evil refer to hate
speech to others. Evil perspectives can be categorized in hate speech, which hates
speech is regulated in Article 28 of the ITE Law. Not a few people who intentionally
and unintentionally oppose social media that contain hate speech. Unfortunately,
social media does not have the ability to aggregate information about an existing
conversation into a conclusion.
7
[6] Ioanna K. Lekea, Panagiotis Karampelas et al , Detecting Hate Speech within the
Terrorist Argument: A Greek Case , Hate speech can be used by a terrorist group as
a means of judging possible targets’ guilt and deciding on their punishment, as well
as a means of making people to accept acts of terror or even as propaganda for
possibly attracting new members. To decide on how the automatic classification will
be performed, we experimented with different text analyzing techniques such as
critical discourse and content analysis and based on the preliminary results of these
techniques a classification algorithm is proposed that can classify the communiqués
in three categories depending on the presence of hate speech.
[7] Axel Rodríguez, Carlos Argueta, Yi-Ling Chen et al, Automatic Detection of Hate
Speech on Facebook Using Sentiment and Emotion Analysis, Hate speech has been an
issue since the start of the Internet, but the advent of social media has brought it to
unimaginable heights. To address such an important issue, in this paper, we explore a novel
framework to effectively detect highly discussed topics that generate hate speech on
Facebook. With the use of graph, sentiment, and emotion analysis techniques, we cluster
and analyze posts on prominent Facebook pages. the definition of hate speech is conduct
that uses direct assault with words on people who have particular traits, and this kind of
assault usually has a tendency of violence or carries a tone of debasement.
8
CHAPTER 3
METHODOLOGY
The primary goal of this project is to detect the cyberbullying words or tweets
in social media.One million children were harassed, threatened or subjected to other
forms of cyberbullying on social media during the past year, while 90 percent of
social-media-using teens who have witnessed online cruelty
To stop this type of harassment in social media. The project will detect the
bullying words or tweets in social media and try to prevent the behaviour from the
user
HARDWARE REQUIREMENTS:
MOUSE : 3 BUTTONS
SOFTWARE REQUIREMENTS:
9
hidden, known or unknown, expected or unexpected from client's point of view.
TOOLS USED :
ANACONDA NAVIGATOR
PYTHON BUILT-IN MODULES
o NUMPY
o PANDAS
o MATPLOTLIB
o SKLEARN
o SEABORM
SOFTWARE FEATURES
In this part various software are discussed which are used to develop the
system
PYTHON
Python is a high-level, interpreted, interactive and object-oriented scripting
language. Python is designed to be highly readable. It uses English keywords
frequently whereas other languages use punctuation, and it has fewer syntactical
constructions than other languages.
Python is Interactive: You can sit at a Python prompt and interact with the
interpreter directly to write your programs.
10
Python is Object-Oriented: Python supports Object-Oriented style or
technique of programming that encapsulates code within objects.
HISTORY OF PYTHON
Python was developed by Guido van Rossum in the late eighties and early
nineties at the National Research Institute for Mathematics and Computer Science in
the Netherlands.
Python is derived from many other languages, including ABC, Modula-3, C, C++,
Algol-68, Small Talk, Unix shell, and other scripting languages.
Python is copyrighted. Like Perl, Python source code is now available under the
GNU General Public License (GPL).
FEATURES OF PYTHON
A simple language which is easier to learn, Python has a very simple and
elegant syntax. It's much easier to read and write Python programs compared to
other languages like C++, Java, C#. Python makes programming fun and allows you
to focus on the solution rather than syntax. If you are a newbie, it's a great choice to
start your journey with Python.
11
changes to the Python's source code. Python has a large community
constantly improving it in each iteration.
Portability
You can move Python programs from one platform to another and run it
without any changes. It runs seamlessly on almost all platforms including
Windows, Mac OS X and Linux.
12
NUMPY INTRODUCTION
Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package
Numarray was also developed, having some additional functionalities. In 2005,
Travis Oliphant created NumPy package by incorporating the features of Numarray
into Numeric package. There are many contributors to this open source project.
NumPy is often used along with packages like SciPy (Scientific Python) and
Mat−plotlib (plotting library). This combination is widely used as a replacement for
MatLab, a popular platform for technical computing. However, Python alternative to
MatLab is now seen as a more modern and complete programming language.
Numpy − Environment
13
The best way to enable NumPy is to use an installable binary package
specific to your operating system. These binaries contain full SciPy stack
(inclusive of NumPy, SciPy, matplotlib, IPython, SymPy and nose packages
along with core Python).
PANDAS
MATPLOTLIB
matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for
embedding plots into applications using general-purpose GUItoolkits
like Tkinter, wxPython, Qt, or GTK+. There is also a procedural "pylab" interface
based on a state machine (like OpenGL), designed to closely resemble that
of MATLAB, though its use is discouraged. SciPy makes use of Matplotlib.
Matplotlib was originally written by John D. Hunter, since then it has an active
development community,] and is distributed under a BSD-style license. Michael
Droettboom was nominated as matplotlib's lead developer shortly before John
Hunter's death in August 2012, and further joined by Thomas Caswell.[6][7]
Matplotlib 2.0.x supports Python versions 2.7 through 3.6. Python 3 support started
with Matplotlib 1.2. Matplotlib 1.4 is the last version to support Python 2.6. Matplotlib
has pledged to not support Python 2 past 2020 by signing the Python 3 Statement.[
14
MACHINE LEARNING
DATA RULES
COMPUTER
OUTPUT
The core objective of machine learning is the learning and inference. First of all, the
machine learns through the discovery of patterns. This discovery is made thanks to
the data. One crucial part of the data scientist is to choose carefully which data to
provide to the machine. The list of attributes used to solve a problem is called
a feature vector. You can think of a feature vector as a subset of data that is used
to tackle a problem. The machine uses some fancy algorithms to simplify the reality
and transform this discovery into a model. Therefore, the learning stage is used to
describe the data and summarize it into a model.
15
Fig 3.5.2 learning phase
For instance, the machine is trying to understand the relationship between the wage
of an individual and the likelihood to go to a fancy restaurant. It turns out the
machine finds a positive relationship between wage and going to a high-end
restaurant: This is the model
Inferring
16
Fig 3.5.4 Machine learning algorithms
Machine learning can be grouped into two broad learning tasks: Supervised and
Unsupervised. There are many other algorithms
SUPERVISED LEARNING
An algorithm uses training data and feedback from humans to learn the relationship
of given inputs to a given output. For instance, a practitioner can use marketing
expense and weather forecast as input data to predict the sales of cans.
You can use supervised learning when the output data is known. The algorithm will
predict new data.
Classification task
Regression task
Classification
Imagine you want to predict the gender of a customer for a commercial. You will start
gathering data on the height, weight, job, salary, purchasing basket, etc. from your
17
customer database. You know the gender of each of your customer, it can only be
male or female. The objective of the classifier will be to assign a probability of being
a male or a female (i.e., the label) based on the information (i.e., features you have
collected). When the model learned how to recognize male or female, you can use
new data to make a prediction. For instance, you just got new information from an
unknown customer, and you want to know if it is a male or female. If the classifier
predicts male = 70%, it means the algorithm is sure at 70% that this customer is a
male, and 30% it is a female
Regression
When the output is a continuous value, the task is a regression. For instance, a
financial analyst may need to forecast the value of a stock based on a range of
feature like equity, previous stock performances, macroeconomics index. The
system will be trained to estimate the price of the stocks with the lowest possible
error.
UNSUPERVISED LEARNING
You can use it when you do not know how to classify the data, and you want the
algorithm to find patterns and classify the data for you
TENSORFLOW
The most famous deep learning library in the world is Google's TensorFlow.
Google product uses machine learning in all of its products to improve the search
engine, translation, image captioning or recommendations.
To give a concrete example, Google users can experience a faster and more refined
the search with AI. If the user types a keyword a the search bar, Google provides a
recommendation about what could be the next word.
18
TensorFlow Architecture
This is why it is called TensorFlow because the tensor goes in it flows through a list
of operations, and then it comes out the other side.
MODULES EXPLANATION
19
Fig 3.6.1 architecture diagram
Data Preprocessing
Social network data are noisy, thus preprocessing has been applied to improve the
accuracy of the input data. This includes removing stop words. Stop words are
usually like “a”, “as”, “have”, “is”, “the”, “or”, etc. Stop words mainly used for
consumed memory space and processing time.
Feature Extraction
Parsing noun, adjective and pronoun involves two steps such as part-of-speech
tagging and extracting noun, adjective and pronoun from the tagged output.
The part-of-speech tagging(POS tagging or POST), also called grammatical
tagging, is the process of marking up a word in a text as corresponding to apart of
speech.
20
The Part-of-speech tagging is carried out using the package provided by
Stanford Natural Language Processing.
Frequency Extraction
Cyberbully Detection
In cyberbully detection, the bullying words in the tweet contents are detected
using the machine learning algorithms. After getting the output from the
preprocessing step, the output file will sent to the classification algorithms. There the
trained classifier will be used for detection. The training dataset consist of list of
cyberbullying words. With the training dataset the preprocessed twitter dataset
is tested for bullying word presence or not. The natural language processing and the
logistic regression are mainly used to detect the cyberbully words present in the
tweet contents.
Data Description:
The data given is in the form of a comma-separated values files with tweets and
their correspond-ing sentiments. The training dataset is a csv file of type
tweet_id,sentiment,tweet where the tweet_id is a unique integer identifying the
tweet, sentiment is either 1 (positive) or 0 (neg-ative), and tweet is the tweet
enclosed in "". Similarly, the test dataset is a csv file of type tweet_id,tweet.
CLASSIFICATION ALGORITHMS
21
LeandRomaf, an experienced software engineer who is passionate at teaching
people how artificial intelligence systems work, says that “in recent years, there have
been significant breakthroughs in empowering computers to understand language
just as we do.”
This article will give a simple introduction to Natural Language Processing and how it
can be achieved.
Syntactic analysis and semantic analysis are the main techniques used to complete
Natural Language Processing tasks.
Syntax
Syntax refers to the arrangement of words in a sentence such that they make
grammatical sense.
In NLP, syntactic analysis is used to assess how the natural language aligns with the
grammatical rules.
Computer algorithms are used to apply grammatical rules to a group of words and
derive meaning from them.
22
Lemmatization: It entails reducing the various inflected forms of a word into a single
form for easy analysis.
Part-of-speech tagging: It involves identifying the part of speech for every word.
Semantics
Semantics refers to the meaning that is conveyed by a text. Semantic analysis is one
of the difficult aspects of Natural Language Processing that has not been fully
resolved yet.
Named entity recognition (NER): It involves determining the parts of a text that
can be identified and categorized into preset groups. Examples of such groups
include names of people and names of places.
23
Natural language generation: It involves using databases to derive semantic
intentions and convert them into human language
LOGISTIC REGRESSION
In statistics, the logistic model (or logit model) is used to model the probability of a
certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick.
This can be extended to model several classes of events such as determining
whether an image contains a cat, dog, lion, etc. Each object being detected in the
image would be assigned a probability between 0 and 1 and the sum adding to one.
Logistic regression is a statistical model that in its basic form uses a logistic function
to model a binary dependent variable, although many more complex extensions
exist. In regression analysis, logistic regression (or logit regression) is estimating
the parameters of a logistic model (a form of binary regression). Mathematically, a
binary logistic model has a dependent variable with two possible values, such as
pass/fail which is represented by an indicator variable, where the two values are
labelled "0" and "1". In the logistic model, the log-odds (the logarithm of the odds) for
the value labelled "1" is a linear combination of one or more independent variables
("predictors"); the independent variables can each be a binary variable (two classes,
coded by an indicator variable) or a continuous variable (any real value). The
corresponding probability of the value labeled "1" can vary between 0 (certainly the
value "0") and 1 (certainly the value "1"), hence the labeling; the function that
converts log-odds to probability is the logistic function, hence the name. The unit of
measurement for the log-odds scale is called a logit, from logistic unit, hence the
alternative names. Analogous models with a different sigmoid functioninstead of the
logistic function can also be used, such as the probit model; the defining
characteristic of the logistic model is that increasing one of the independent variables
multiplicatively scales the odds of the given outcome at a constant rate, with each
independent variable having its own parameter; for a binary dependent variable this
generalizes the odds ratio.
24
CHAPTER 4
RESULTS
the data is tested under bi gram and the accuracy has been calculated
The data is tested under td idf mode and the accuracy score is
calculated and also macro avg, weighted avg.
25
This is the final output in which the true values and predicted values are
denoted . The values denote the percentage of classified words of hate, offensive,
neither.
26
CHAPTER 5
CONCLUSION
Conclusion:
As hate speech continues to be a societal problem, the need for automatic hate
speech detection systems becomes more apparent. In this report, we proposed a
solution to the detection of hate speech and offensive language on Twitter through
machine learning using Bag of Words and TF IDF values. We performed
comparative analysis of Logistic Regression, Naive Bayes, Decision Tree,
Random Forest and Gradient Boosting on various sets of feature values and
model parameters. The results showed that Logistic Regression performs
comparatively better with the TF IDF approach. We presented the current
problems for this task and our system that achieves reasonable accuracy (89%)
as well as recall (84%). Given all the challenges that remain, there is a need for
more research on this problem, including both technical and practical matters.
Future work
Generate new future’s like removing those cyberbullying tweets that are posted in
social media . increase more accuracy in prediction and warring the user about
their tweets. We believe there are ways that design can help stop online
aggression. Adding live detection or printing*** for bullying words can help in
reduce of bullying or online distress. Better feedback from sites can encourage
users to report aggression or harassment. Finally, existing designs can help
support low-risk interventions.As we’ve seen, designing to help bystanders takes
careful planning. It also requires sensitivity for the ways people use social media.
Still, there’s no shortage of ways to empower bystanders to stand up against
online bullying.
27
REFERENCES
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of
machine Learning research, 3:993–1022, 2003.
[2] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
28
[12] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In Advances in
neural information processing systems, pages 3111–3119, 2013.
[13] V. Nahar, X. Li, and C. Pang. An effective approach for cyberbullying detection.
Communications in Information Science and Management Engineering, 2012.
[14] V. Nahar, X. Li, and C. Pang. An effective approach for cyberbullying detection.
Communications in Information Science and Management Engineering, 3(5):238,
2013.
[16] J.-M. Xu, K.-S. Jun, X. Zhu, and A. Bellmore. Learning from bullying traces in
social media. In Proceedings of the 2012 conference of the North American chapter
of the association for computational linguistics: Human language technologies,
pages 656–666. Association for Computational Linguistics, 2012.
[18] R. Zhao and K. Mao. Semi-random projection for dimensionality reduction and
extreme learning machine in high-dimensional space. Computational Intelligence
Magazine, IEEE, 10(3):30–41, 2015.
29
APENDEX
SOURCE CODE
"cells": [
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import string\n",
"import nltk\n",
"import seaborn\n",
30
"import numpy as np\n",
"import warnings\n",
"warnings.simplefilter(action='ignore', category=FutureWarning)\n",
"%matplotlib inline\n",
"## 2. Tokenizing\n",
"## 4. Stemming\n"
"stopwords = nltk.corpus.stopwords.words(\"english\")\n"
"stopwords.extend(other_exclusions)\n",
"stemmer = PorterStemmer()\n"
" '[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')\n",
31
" punc_remove = tweets.str.replace(\"[^a-zA-Z]\", \" \")\n",
" # tokenizing\n",
"dataset['processed_tweets'] = processed_tweets\n",
"dataset"
"# visualizing which of the word is most commonly used in the twitter
dataset\n",
"\n",
"plt.figure(figsize=(10, 7))\n",
"plt.imshow(wordcloud, interpolation=\"bilinear\")\n",
32
"plt.axis('off')\n",
"plt.show()"
"# visualizing which of the word is most commonly used for offensive
speech\n",
"random_state=21, max_font_size=110).generate(offensive_words)\n",
"plt.figure(figsize=(10, 7))\n",
"plt.imshow(wordcloud, interpolation=\"bilinear\")\n",
"plt.axis('off')\n",
"plt.show()"
"bigram_vectorizer = CountVectorizer(ngram_range=(1,2),max_df=0.75,
min_df=1, max_features=10000)\n",
"bigram = bigram_vectorizer.fit_transform(processed_tweets).toarray()\n",
"#TF-IDF Features\n",
"tfidf"
"X = panda.DataFrame(bigram)\n",
"y = dataset['class'].astype(int)\n",
33
"model = LogisticRegression(class_weight='balanced',penalty=\"l1\",
C=0.01).fit(X_train_bow,y_train)\n",
"y_preds = model.predict(X_test_bow)\n",
"print(report)\n",
"X = tfidf\n",
"y = dataset['class'].astype(int)\n",
"model = LogisticRegression().fit(X_train_tfidf,y_train)\n",
"y_preds = model.predict(X_test_tfidf)\n",
"print(report)\n",
34