Final Project Report
Final Project Report
ON
Submitted by
BACHELOR OF TECHNOLOGY
in
NOV 2024
SRM INSTITUTE OF SCIENCE & TECHNOLOGY
(Under Section 3 of UGC Act, 1956)
BONAFIDE CERTIFICATE
SIGNATURE
SIGNATURE
Cyberbullying has emerged as a significant social issue with the rapid growth of
social media platforms, impacting mental health and well-being. Despite various
the Hinglish language, a popular blend of Hindi and English widely used in social
analyze and classify these messages effectively. Key NLP preprocessing steps,
refine the dataset. To enhance feature extraction, we used the Term Frequency-Inverse
Document Frequency (TF-IDF) model, capturing relevant features that represent the
In the machine learning phase, multiple algorithms were implemented and evaluated,
language in Hinglish.
Streamlit, allowing users to input text and receive real-time feedback on potential
The findings from this research underline the feasibility of using machine learning for
models in creating safe online environments. Future research can build upon this
expanding the application to detect other forms of abusive content across additional
multilingual settings.
ACKNOWLEDGEMENTS
atmosphere for doing research. All through the work, in spite of his busy
karan rana
Author
TABLE OF CONTENTS
ABSTRACT iii
ACKNOWLEDGEMENTS v
ABBREVIATIONS ix
LIST OF SYMBOLS x
1 INTRODUCTION 1
1.1 background and problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 objective of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 scope and significance of project........................................................................2
2 LITERATURE SURVEY 3
2.1 challenges and existing approaches. . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 machine learning and NLP in cyberbullying detection....................................3
3 System Analysis....................................................................................................... 4
4 System Design. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1 Tables and Figures............................................................................................ 5
6 Conclusion................................................................................................................ 10
7 Future Enhancements............................................................................................. 12
8 References................................................................................................................ 14
LIST OF TABLES AND FIGURES
4. pie chart depicting the distribution of bullying and non bullying dataset......5
5.Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 7
7. server-side working.................................................................................................. 7
AI - Artificial Intelligence
BI - Business Intelligence
FN - False Negative
FP - False Positive
LR - Logistic Regression
ML - Machine Learning
NB - Naïve Bayes
TN - True Negative
TP - True Positive
UI - User Interface
UX - User Experience
x
CHAPTER 1
INTRODUCTION
With the explosive growth of social media, communication barriers are shrinking,
enabling people to connect across the world in real time. However, alongside the
positive effects, this digital revolution has fueled the rise of cyberbullying—a form of
online harassment that can severely impact mental well-being. In India and similar
detection systems that typically focus on single-language texts. This project tackles
results obtained, and the implications of this approach for creating safer digital
specific solutions.
1. Background and Problem Statement
In recent years, cyberbullying has emerged as a serious social issue, with cases
rising due to increased digital connectivity. This form of bullying differs from
cyberbullying in Hinglish.
and K-Nearest Neighbors (KNN) to find the most effective model for
Hinglish.
where users can enter text and receive real-time feedback on potentially
3. Objective of this Subtopic: Clearly state what the study aims to achieve and the
practical goals of the project in both technical and user-centered terms.
this project fills a critical gap in current research, which primarily focuses on
model, the project provides valuable insights into handling other language
LITERATURE SURVEY
Cyberbullying detection has gained significant attention in recent years due to its detrimental impact
on mental health and online safety. Early work in this domain focused on rule-based approaches and
keyword matching to identify bullying content in social media. However, these methods proved
inadequate in dealing with complex linguistic structures, sarcasm, and indirect bullying language. In
recent years, machine learning techniques have emerged as a more effective solution for automatic
cyberbullying detection.
Studies such as those by Dadvar et al. (2013) and Kwok & Wang (2013) leveraged supervised
learning algorithms (e.g., SVM, Decision Trees) to classify texts as abusive or non-abusive. Their
findings demonstrated that while traditional models could capture simple bullying patterns, they
struggled with more nuanced forms of bullying, like covert harassment or bullying through images
or indirect communication.
Recent advancements, however, incorporate deep learning models such as Recurrent Neural
Networks (RNNs) and Convolutional Neural Networks (CNNs) for improved feature extraction
from textual data. Gambäck & Sikdar (2017) proposed using character-level embeddings to
address the challenges of informal text such as slang, emojis, and code-switching. These
advancements have significantly improved detection accuracy, yet challenges remain for languages
like Hinglish, where traditional models are less effective.
Objective of this Subtopic: To highlight the evolution of cyberbullying detection techniques,
focusing on the shift from rule-based to machine learning and deep learning models, and to discuss
the limitations of existing approaches in multilingual contexts.
Machine learning, particularly NLP techniques, has proven essential in enhancing the accuracy of
cyberbullying detection systems. NLP methods such as tokenization, sentiment analysis, and part-
of-speech tagging are commonly employed to extract meaningful features from social media content.
Machine learning classifiers like Naive Bayes (NB), Logistic Regression (LR), and Support
Vector Machines (SVM) are often used to categorize texts based on whether they contain bullying
or non-bullying content.
A study by Founta et al. (2018) explored the use of TF-IDF (Term Frequency-Inverse Document
Frequency) for feature extraction and applied classifiers like SVM and Random Forests to detect
cyberbullying in social media posts. The results showed that SVM performed particularly well in
binary classification tasks, demonstrating its ability to handle high-dimensional feature spaces
effectively. However, these models still face challenges when dealing with informal, non-standard
text forms like Hinglish, which requires specialized preprocessing techniques for better
classification.
The need for custom language models has been emphasized in research focusing on languages with
significant code-switching, like Hinglish. Vyas et al. (2019) proposed a hybrid model combining
rule-based methods with machine learning to improve detection accuracy for Hinglish. Their
approach involved a custom dictionary of common Hinglish terms and slang expressions, which
helped boost the model’s performance in detecting bullying content in mixed-language texts.
Objective of this Subtopic: To examine the role of machine learning and NLP in detecting
cyberbullying, with a focus on the algorithms and feature extraction techniques that have been
employed in recent studies, as well as the limitations faced in multilingual environments.
CHAPTER 3
SYSTEM
ANALYSIS
1. Functional Requirements
● Text Input and Preprocessing: The system must accept social media posts,
comments, or tweets in Hinglish as input. The text should undergo
preprocessing, including tokenization, language detection, and conversion of
Hinglish text into a usable format for the model.
● Cyberbullying Classification: The core functionality of the system is to
classify input text into "bullying" or "non-bullying" categories. This
classification will be done using machine learning models, with accuracy
optimized for Hinglish.
● Model Training and Evaluation: The system must train a machine learning
model on a labeled dataset of Hinglish text to identify bullying content. It
should also evaluate model performance using metrics such as accuracy,
precision, recall, and F1-score.
● User Interface: A simple, user-friendly interface (such as a web application)
will allow users to input Hinglish text and receive feedback on whether the text
contains cyberbullying content.
● Real-time Feedback: The system must provide real-time feedback, displaying
results almost instantly after a user submits text. The output should inform the
user whether the message is classified as cyberbullying, along with a
confidence score indicating the certainty of the classification.
● Multilingual Handling: The system should be able to detect Hinglish, with the
capability to handle mixed languages, informal vocabulary, and slang terms
common in Indian social media.
2. Non-Functional Requirements
In addition to functional requirements, the system must also fulfill the following
non-functional requirements:
3. System Architecture
The system architecture is designed to ensure seamless processing from text input to
classification output. The architecture includes several layers that handle different
tasks, as outlined below:
● Data Collection Layer: This layer is responsible for collecting Hinglish data
from social media platforms (e.g., Twitter, Facebook, Instagram). Data can be
scraped using APIs or collected from publicly available datasets of social
media posts. The dataset will contain both bullying and non-bullying messages
for training purposes.
● Data Preprocessing Layer: This layer is essential for cleaning and preparing
the Hinglish text data for machine learning. The preprocessing steps include:
○ Tokenization: Splitting the text into individual words or tokens.
○ Language Detection: Identifying the Hindi-English language mix and
normalizing the text.
○ Noise Removal: Removing unnecessary symbols, links, or stop words.
○ Slang Detection and Translation: Identifying Hinglish-specific slang
and regional variations and converting them into a more structured form.
● Machine Learning Layer: After preprocessing, the data is passed to the
machine learning models. Several algorithms such as Support Vector
Machine (SVM), Naive Bayes (NB), and Logistic Regression (LR) are
trained using the labeled data. These models are evaluated based on
performance metrics to determine the most effective one for cyberbullying
detection.
● Prediction and Feedback Layer: This layer handles user inputs and runs
predictions using the trained machine learning models. Once the model
classifies the text, feedback is generated, and the user is notified in real-time
whether the text contains cyberbullying content or not.
● User Interface Layer: The final layer is the user interface, which allows users
to interact with the system through a simple web application. The interface will
accept text inputs, process them, and display results to the user.
4. Tools and Technologies
SYSTEM DESIGN
1. System Architecture
The system follows a layered architecture to separate different concerns and ensure
modularity. It consists of the following components:
● Data Collection Layer: This module is responsible for gathering social media
posts or comments that are written in Hinglish. The data is scraped from
publicly available datasets or fetched from social media APIs (such as Twitter
or Facebook) using keywords related to cyberbullying. The data consists of a
mix of text types, including informal language, slang, and code-switched
language. The collected data is stored in a structured format, such as a CSV file
or a database, for later preprocessing and model training.
● Preprocessing Layer: The raw text data collected from social media is
preprocessed to make it suitable for machine learning. This module performs
several steps:
○ Tokenization: Breaking down the text into individual words or phrases.
○ Noise Removal: Removing unnecessary characters, links, stop words, or
irrelevant data.
○ Text Normalization: Converting informal text or Hinglish into a
standard format. This includes converting slang terms into their
corresponding English meanings or standardized forms.
○ Feature Extraction: Using techniques like TF-IDF (Term Frequency-
Inverse Document Frequency) or Word2Vec for converting text into
numerical features that can be used by machine learning algorithms.
● Machine Learning Layer: This core component contains the machine learning
algorithms that are trained on the labeled data to classify text as
"cyberbullying" or "non-cyberbullying." The training process involves:
○ Data Splitting: Dividing the dataset into training, validation, and testing
sets.
○ Model Training: Using various classification algorithms like Support
Vector Machines (SVM), Naive Bayes (NB), and Logistic Regression
(LR). The models are trained using the processed and feature-extracted
data.
○ Model Evaluation: Assessing the model’s performance using accuracy,
precision, recall, and F1-score. The best-performing model is selected
based on these metrics.
● Prediction and Feedback Layer: After training the models, this module takes
input text from the user and classifies it using the selected model. Once a user
submits a text (e.g., a comment or tweet), the system processes the text through
the preprocessing module and then feeds it into the trained model for
prediction. The model outputs a label indicating whether the text is classified as
"cyberbullying" or "non-cyberbullying." Additionally, a confidence score is
generated to indicate the certainty of the prediction.
● User Interface Layer: The system provides an interface where users can
interact with the cyberbullying detection system. This user-friendly interface is
built using web technologies such as HTML5, CSS3, and JavaScript, and is
implemented through a framework like Flask or Django. The user enters the
text into an input field, and the result is displayed within seconds. The interface
is designed to be simple and intuitive, with options to enter different forms of
social media text and get real-time results.
1. User Input: The user enters a social media post or comment written in
Hinglish into the system via the web interface.
2. Preprocessing: The text is passed through the preprocessing module, where it
is tokenized, normalized, and transformed into features that the machine
learning model can understand.
3. Prediction: The preprocessed text is sent to the trained machine learning
model, which classifies the text as "cyberbullying" or "non-cyberbullying."
4. Output: The prediction result is displayed to the user, along with a confidence
score.
5. Feedback: Users are given feedback to help understand the results. If the text
is classified as cyberbullying, a message can be provided with suggestions or
warnings.
pie chart depicting the distribution of bullying and non bullying dataset
FIGURES
RESEARCH METHODOLOGY
server side
CONCLUSION
FUTURE ENHANCEMENT
● Web Scraping and API Integrations: Integrating the system with social media
platforms like Twitter, Facebook, or Instagram using their APIs would allow the
system to scan posts and comments in real-time. This can be achieved by setting up
scheduled scrapers or streaming APIs to collect social media data as it is posted.
● Batch Processing for Large Datasets: To ensure scalability, implementing batch
processing for large datasets can improve performance. In situations where real-time
analysis is not necessary, processing large amounts of historical data can be done in
batches, using distributed computing frameworks like Apache Spark or Hadoop.
● Cloud Integration: Deploying the system on a cloud platform like AWS, Google
Cloud, or Microsoft Azure would allow the system to scale as needed. The cloud
infrastructure can accommodate spikes in traffic and large data volumes, ensuring
seamless real-time performance. Cloud services can also facilitate the storage and
access of huge datasets, which is essential for handling social media content.
REFERENCES
[1] B. Dean, “How many people use social media in 2021? (65+ statistics),”
Sep. 2021. [Online]. Available: https://backlinko.com/social-media-users