0% found this document useful (0 votes)
16 views

Tarp Da 3

Uploaded by

Anurag Karki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Tarp Da 3

Uploaded by

Anurag Karki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Technical Answers for Real World Problems (TARP)

CSE1901
Assignment 3
Fake Social Media Account Detection using ML
In partial fulfilment for the award of the degree of B. Tech in Computer Science and
Engineering

Submitted by:
Pramit Karki (20BCE2896)
Anurag Karki (20BCE2907)
Shreya Karki (20BCE2899)

Under the Guidance of


Prof. Suresh A.

Fall Semester 2023-24


Fake Social Media Account Detection Using ML
#
Scope – Vellore Institute of Technology

1. Proposed Methodology:
a) Data Collection:
Current social media Data sets which are collected using Instagram Scrapper and
Kaggle.
b) Data Preprocessing:
Process of cleaning up the unnecessary data, filling in the empty rows and columns,
and fixing or eliminating inaccurate, incomplete, or duplicate data from a dataset.
c) Feature Extraction:
Extraction of relevant features from the pre-processed data.
d) Training
Train our machine learning algorithms on the extracted features
e) Feature Importance Analysis:
Analyse the importance of each feature in the ML model.
f) Fine-Tuning:
Model is experimented with different hyper parameters to improve the model's
performance
g) Threshold Selection:
Develop adaptive thresholding strategies informed by anomaly detection and real-
time monitoring.
h) Model Deployment:
Deploy the trained model to detect fake accounts in real-time or batch processing.
i) Model Evaluation:
Performance is evaluated on a held-out test set.
j) Monitoring and Maintenance:
Create a continuous learning system that updates with fresh data and detects model
drift.
Data Collection

Social Media
Dataset

Data Cleaning Data


Imputations Pre-Processing
Oversampling Data
Integration
Data Normalization

Reduction

Feature Extraction

ADA – Boost
SVM Training
(Training)

ADA – Boost
Datasets Evaluation
SVM
(Test)
(Classification)

Fake Accounts Actual Accounts


(Bots) (Real users)

Fig. 1: Block Diagram


1. Data Collection
As machine learning grows more prevalent, it is more crucial to collect massive
volumes of data and identify it. The core process in the machine-learning pipeline is
gathering the data needed to train the ML model. The accuracy of the predictions made
by ML systems is only as good as the training datasets. Data is entered into the system
during processing. The effectiveness and accuracy of the algorithm will depend on its
quality and correctness. As a result, the datasets determine the output. So, we gather a
large dataset of Instagram profiles, including both real and fake accounts. Web
scraping techniques or API access can be used to collect this data.

2. Data preprocessing:
In this step we handle missing data and outliers, Normalize or scale the feature values
as necessary and Split the dataset into training and testing sets to evaluate the model's
performance.
We are only interested in data normalization as our data is already clean and doesn’t
have missing values.
• Data normalization: The amount of processing and memory required for
training iterations depends on the size of the dataset. Normalization reduces
the order and magnitude of the data, hence reducing the size.

3. Feature extraction:
The next step is to extract relevant features from the pre-processed data. This could
involve using techniques like dimensionality reduction or feature selection to identify
the most informative features. These features could include: Profile picture analysis
(e.g., image quality, face detection), Number of followers and followings, Engagement
metrics (likes, comments, posts), Account age, Bio and caption text analysis, Frequency
and timing of posts, User activity (e.g., frequency of logins),Hashtag usage etc.
4. Training:
The next step is to train our machine learning algorithms on the extracted features. For
instance, the AdaBoost classifier is a sequence of weak classifiers, and each weak
classifier is trained on a subset of the data. The weights assigned to misclassified
samples are adjusted at each iteration to improve the performance of the next weak
classifier.
5. Feature Importance Analysis:
The next step is to analyse the importance of each feature in the AdaBoost model. This
can help us understand which features are most informative for detecting fake accounts.

6. Fine-Tuning:

Then the model is experimented with different hyper parameters of the AdaBoost
algorithm to improve the model's performance. This might include adjusting the number
of weak classifiers (base estimators) or learning rate.

7. Threshold Selection:

Then an appropriate threshold is determined for classifying profiles as fake or real. Like
we can balance precision and recall.

8. Model Deployment:

The next step is to deploy the trained AdaBoost model to detect fake Instagram accounts
in real-time or batch processing. Implement mechanisms for periodic model updates to
adapt to changing fake account patterns.

9. Evaluation:
After deploying the model , the next step is to evaluate its performance on a held-out
test set. AdaBoost model is evaluated using appropriate metrics such as accuracy,
precision, recall, F1-score, and ROC AUC on the testing dataset.

10. Monitoring and Maintenance:

Then the model's performance is continuously monitored and adapted to emerging fake
account strategies. The training dataset is regularly updated to include new examples of
fake and real accounts
Overall, this architecture provides a framework to detect fake Instagram accounts. By
comparing multiple classifiers, we can use the model with the best performance and
identify fake accounts more accurately.

The algorithm which we would be using to train the model are briefly explained below:

AdaBoost
Adaptive Boosting (AdaBoost) is a popular ensemble learning algorithm that can be
used for classification tasks. It works by combining multiple weak classifiers to form a
strong classifier. Each weak classifier is trained on a subset of the data and assigns
weights to misclassified samples to improve the performance of the next weak
classifier. Bootstrapping is not followed by Ada-boost. Classifiers with more accuracy
is assigned with high weight to get the final output.

where,
ht(x) = week classifier’s output t for x
αt = weight alloted to classifier.
αt can be calculated as: αt= 1/2 * ln ((1 - E)/E).
αt is based on the error rate E. Each training data has equal weights initially.
The component classifiers gain from boosting when applying boosting technique to
robust classifiers.
Platform Used:

(a) Hardware:
The evaluation of the suggested model is performed on a Laptop PC. It has following
hardware configuration:
1) an Intel Core i5-10300H processor.
2) a processor with 8 cores and a clock speed of 2.50GHz.
3) 16 GB of physical memory.
4) a graphics card with GTX1650Ti.

(b) Software:
Programming Language:
Python: Python is a high-level programming language widely used for data analysis,
machine learning, web development, and more.
Libraries:
1) Scikit-learn: Scikit-learn is a powerful Python library for machine learning that
provides tools for classification, regression, clustering, and more.
2) Numpy: Numpy is a fundamental Python library for scientific computing that
enables numerical operations with multi-dimensional arrays and matrices
3) Matplotplit: Matplotlib is a Python library for data visualization that allows
creating charts, graphs, and other graphical representations of data.
4) Instaloader: Instaloader is a Python package that enables downloading pictures,
videos, and other media from Instagram.
IDE:
1) Jupyter Notebook: Jupyter Notebook is an interactive web-based environment
that allows writing, executing, and sharing code in various programming
languages, including Python.
2) VsCode: VSCode is a free source code editor developed by Microsoft that
supports many programming languages, debugging, Git integration, and more.

You might also like