Tarp Da 3
Tarp Da 3
CSE1901
Assignment 3
Fake Social Media Account Detection using ML
In partial fulfilment for the award of the degree of B. Tech in Computer Science and
Engineering
Submitted by:
Pramit Karki (20BCE2896)
Anurag Karki (20BCE2907)
Shreya Karki (20BCE2899)
1. Proposed Methodology:
a) Data Collection:
Current social media Data sets which are collected using Instagram Scrapper and
Kaggle.
b) Data Preprocessing:
Process of cleaning up the unnecessary data, filling in the empty rows and columns,
and fixing or eliminating inaccurate, incomplete, or duplicate data from a dataset.
c) Feature Extraction:
Extraction of relevant features from the pre-processed data.
d) Training
Train our machine learning algorithms on the extracted features
e) Feature Importance Analysis:
Analyse the importance of each feature in the ML model.
f) Fine-Tuning:
Model is experimented with different hyper parameters to improve the model's
performance
g) Threshold Selection:
Develop adaptive thresholding strategies informed by anomaly detection and real-
time monitoring.
h) Model Deployment:
Deploy the trained model to detect fake accounts in real-time or batch processing.
i) Model Evaluation:
Performance is evaluated on a held-out test set.
j) Monitoring and Maintenance:
Create a continuous learning system that updates with fresh data and detects model
drift.
Data Collection
Social Media
Dataset
Reduction
Feature Extraction
ADA – Boost
SVM Training
(Training)
ADA – Boost
Datasets Evaluation
SVM
(Test)
(Classification)
2. Data preprocessing:
In this step we handle missing data and outliers, Normalize or scale the feature values
as necessary and Split the dataset into training and testing sets to evaluate the model's
performance.
We are only interested in data normalization as our data is already clean and doesn’t
have missing values.
• Data normalization: The amount of processing and memory required for
training iterations depends on the size of the dataset. Normalization reduces
the order and magnitude of the data, hence reducing the size.
3. Feature extraction:
The next step is to extract relevant features from the pre-processed data. This could
involve using techniques like dimensionality reduction or feature selection to identify
the most informative features. These features could include: Profile picture analysis
(e.g., image quality, face detection), Number of followers and followings, Engagement
metrics (likes, comments, posts), Account age, Bio and caption text analysis, Frequency
and timing of posts, User activity (e.g., frequency of logins),Hashtag usage etc.
4. Training:
The next step is to train our machine learning algorithms on the extracted features. For
instance, the AdaBoost classifier is a sequence of weak classifiers, and each weak
classifier is trained on a subset of the data. The weights assigned to misclassified
samples are adjusted at each iteration to improve the performance of the next weak
classifier.
5. Feature Importance Analysis:
The next step is to analyse the importance of each feature in the AdaBoost model. This
can help us understand which features are most informative for detecting fake accounts.
6. Fine-Tuning:
Then the model is experimented with different hyper parameters of the AdaBoost
algorithm to improve the model's performance. This might include adjusting the number
of weak classifiers (base estimators) or learning rate.
7. Threshold Selection:
Then an appropriate threshold is determined for classifying profiles as fake or real. Like
we can balance precision and recall.
8. Model Deployment:
The next step is to deploy the trained AdaBoost model to detect fake Instagram accounts
in real-time or batch processing. Implement mechanisms for periodic model updates to
adapt to changing fake account patterns.
9. Evaluation:
After deploying the model , the next step is to evaluate its performance on a held-out
test set. AdaBoost model is evaluated using appropriate metrics such as accuracy,
precision, recall, F1-score, and ROC AUC on the testing dataset.
Then the model's performance is continuously monitored and adapted to emerging fake
account strategies. The training dataset is regularly updated to include new examples of
fake and real accounts
Overall, this architecture provides a framework to detect fake Instagram accounts. By
comparing multiple classifiers, we can use the model with the best performance and
identify fake accounts more accurately.
The algorithm which we would be using to train the model are briefly explained below:
AdaBoost
Adaptive Boosting (AdaBoost) is a popular ensemble learning algorithm that can be
used for classification tasks. It works by combining multiple weak classifiers to form a
strong classifier. Each weak classifier is trained on a subset of the data and assigns
weights to misclassified samples to improve the performance of the next weak
classifier. Bootstrapping is not followed by Ada-boost. Classifiers with more accuracy
is assigned with high weight to get the final output.
where,
ht(x) = week classifier’s output t for x
αt = weight alloted to classifier.
αt can be calculated as: αt= 1/2 * ln ((1 - E)/E).
αt is based on the error rate E. Each training data has equal weights initially.
The component classifiers gain from boosting when applying boosting technique to
robust classifiers.
Platform Used:
(a) Hardware:
The evaluation of the suggested model is performed on a Laptop PC. It has following
hardware configuration:
1) an Intel Core i5-10300H processor.
2) a processor with 8 cores and a clock speed of 2.50GHz.
3) 16 GB of physical memory.
4) a graphics card with GTX1650Ti.
(b) Software:
Programming Language:
Python: Python is a high-level programming language widely used for data analysis,
machine learning, web development, and more.
Libraries:
1) Scikit-learn: Scikit-learn is a powerful Python library for machine learning that
provides tools for classification, regression, clustering, and more.
2) Numpy: Numpy is a fundamental Python library for scientific computing that
enables numerical operations with multi-dimensional arrays and matrices
3) Matplotplit: Matplotlib is a Python library for data visualization that allows
creating charts, graphs, and other graphical representations of data.
4) Instaloader: Instaloader is a Python package that enables downloading pictures,
videos, and other media from Instagram.
IDE:
1) Jupyter Notebook: Jupyter Notebook is an interactive web-based environment
that allows writing, executing, and sharing code in various programming
languages, including Python.
2) VsCode: VSCode is a free source code editor developed by Microsoft that
supports many programming languages, debugging, Git integration, and more.