0% found this document useful (0 votes)
9 views

IOMP_DOC-2[1]3 final

The document is a mini project report on 'Software Driven Waste Segregation Using Machine Learning and CNN' submitted for a Bachelor of Technology degree in Computer Science and Engineering. It outlines the project's objective to enhance waste management efficiency through image processing and machine learning techniques, specifically using a Convolutional Neural Network for waste classification. The report includes sections on system analysis, design, implementation, and testing, emphasizing the project's potential to optimize recycling processes without hardware integration.

Uploaded by

Sai Krishna2303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

IOMP_DOC-2[1]3 final

The document is a mini project report on 'Software Driven Waste Segregation Using Machine Learning and CNN' submitted for a Bachelor of Technology degree in Computer Science and Engineering. It outlines the project's objective to enhance waste management efficiency through image processing and machine learning techniques, specifically using a Convolutional Neural Network for waste classification. The report includes sections on system analysis, design, implementation, and testing, emphasizing the project's potential to optimize recycling processes without hardware integration.

Uploaded by

Sai Krishna2303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Industry Oriented Mini Project Report

on

SOFTWARE DRIVEN WASTE SEGREGATION


USING MACHINE LEARNING AND CNN

Submitted in partial fulfilment of the requirement for the award of degree

Bachelor of Technology

in

Computer Science and Engineering

by
ABHIJITH S R – 21J21A0501

SAI KRISHNA RAO K – 21J21A0536

JAGADEESH KUMAR K – 22J25A0507

Under the Supervision

of

Dr.D. Magdalene Delighta Angeline, B.Tech., M.Tech., Ph.D.,


Associate Professor

Department of Computer Science and Engineering

JOGINPALLY B.R. ENGINEERING COLLEGE


Accredited by NAAC with A+ Grade, Recognized under Sec. 2(f) of UGC Act. 1956
Approved by AICTE, Affiliated to JNTUH, Hyderabad and ISO 9001:2015 Certified
Bhaskar Nagar, Yenkapally, Moinabad (Mandal)
R.R (Dist)-500075. T.S., India

1
JOGINPALLY B.R. ENGINEERING COLLEGE
Accredited by NAAC with A+ Grade, Recognized under Sec. 2(f) of UGC Act. 1956
Approved by AICTE, Affiliated to JNTUH, Hyderabad and ISO 9001:2015 Certified
Bhaskar Nagar, Yenkapally, Moinabad (Mandal)
R.R (Dist)-500075. T.S., India

CERTIFICATE

The Industrial Oriented Mini Project entitled “SOFTWARE DRIVEN WASTE


SEGREGATION USING MACHINE LEARNING AND CNN” that is been submitted by
ABHIJITH S R (21J21A0501), SAI KRISHNA RAO K (21J21A0536), JAGADEESH KUMAR
K (22J25A0507). in partial fulfilment of the award of Bachelor of Technology in Computer Science
and Engineering to Jawaharlal Nehru Technological University Hyderabad. It is record of bonafide
work carried out under our guidance and supervision. In my opinion, this report is of standard
required for the degree of Bachelor of Technology.

PROJECT SUPERVISOR HEAD OF THE DEPARTMENT

Dr.D. Magdalene Delighta Angeline B.Tech., M.Tech., Ph.D. Dr. T. Prabakaran B.E, M.E., Ph.D.,
Associate Professor Professor

EXTERNAL EXAMINER

2
DECLARATION OF THE STUDENT

I hereby declare that the Industrial Oriented Mini Project entitled “SOFTWARE DRIVEN
WASTE SEGREGATION USING MACHINE LEARNING AND CNN”, presented under the
supervision of Dr.D.Magdalene Delighta Angeline B.Tech.,M.Tech.,Ph.D. Assistant professor and
submitted to Joginpally B.R. Engineering College is original and has not been submitted in part or
whole for Bachelor degree to any other university.

ABHIJITH S R – 21J21A0501
SAI KRISHNA RAO K - 21J21A0536
JAGADEESH KUMAR K – 22J25A0507

3
ACKNOWLEDGEMENT

We would like to take this opportunity to place it on record that this Project Report would
never have taken shape but for the cooperation extended to us by certain individuals. Though it is
not possible to name all of them, it would be unpardonable on our part if we do not mention some
of the very important persons.

Sincerely, we acknowledge our deep sense of gratitude to our project supervisor


Dr.D.MAGDALENE DELIGHTA ANGELINE B.Tech.,M.Tech.,Ph.D for her constant
encouragement, help and valuable suggestions.

We express our gratitude to Dr.T. PRABAKARAN B.E, M.E., Ph.D, HOD of Computer
Science and Engineering for his valuable suggestions and advices.

We express our gratitude to Dr.B. VENKATA RAMANA REDDY, Principal of


JOGINPALLY B.R. ENGINEERING COLLEGE for his valuable suggestions and advices. We also
extend our thanks to other Faculty members for their cooperation during our Project Report.

Finally, we would like to thank our parents and friends for their cooperation to complete
this Project Report.

ABHIJITH S R – 21J21A0501
SAI KRISHNA RAO K - 21J21A0536
JAGADEESH KUMAR K – 22J25A0507

4
ABSTRACT

The Smart Recycling System aims to enhance waste management efficiency through software-
driven waste identification and classification. By leveraging image processing techniques with OpenCV
and machine learning algorithms, the system classifies recyclable materials such as plastic, paper, and
metal from uploaded images. The system utilizes a Convolutional Neural Network (CNN) model, trained
on a custom dataset, to identify various waste types with high accuracy. The user interacts with the
system by uploading images of waste, which the software then processes and classifies, suggesting the
correct recycling category for each material. This approach eliminates manual sorting, helping to reduce
waste segregation errors. The system’s intuitive software interface ensures ease of use, even for
individuals unfamiliar with recycling guidelines. By focusing solely on software, this project provides
an effective and scalable solution to optimize recycling processes without the need for hardware or IoT
integration.

5
TABLE OF CONTENTS

CHAPTER CONTENTS PAGE No.


ABSTRACT 5
LIST OF FIGURES 8
LIST OF TABLES 9
LIST OF ABREVATIONS 10

1 INTRODUCTION 11
1.1 Objective 12
1.2 Scope and Challenges 12
1.3 Problem Analysis 13
2 LITERATURE REVIEW 15
2.1 Existing System Analysis 15
2.2 Areas for Improvement 16
2.3 Proposed System 17

3 SYSTEM ANALYSIS 19
3.1 Functional Requirements 21
3.2 Non-Functional Requirements 22
3.3 Hardware Requirements 22
3.4 Software Requirements 22
3.5 Data Collection Process 23
3.6 Deliverables and Beneficiaries 23
3.7 Algorithm 23

3.8 Methodology 24
4 SYSTEM FEASIBILITY 25
4.1 Economical Feasibility 26
4.2 Technical Feasibility 28

4.3 Social Feasibility 31


5 SYSTEM DESIGN 35
5.1 E-R Diagram 35
5.2 System Architecture 36

5.3 UML Diagrams 38


5.3.1 Use case Diagram 39
6
5.3.2 Class Diagram 39
5.3.3 Sequence Diagram 40
6 SYSTEM IMPLEMENTATION 42
6.1 Software Architecture 42
6.1.1 Backend Development 42
6.1.2 Database Design 42
6.2 Model Development and Integration 43
6.2.1 Model Training and Evaluation 43
6.2.2 Model Deployment 43
6.3 Integration of Software Components 43
6.3.1 API Development 43
6.3.2 Data Flow 43
6.4 Testing and Deployment 43
6.4.1 Unit Testing 43
6.4.2 System Testing 44
6.4.3 Deployment 44
6.5 Expected Outcomes 44
6.5.1 Efficient Waste Management 44
6.5.2 Environmental Impact 44
7 SYSTEM TESTING 45
7.1 Unit Testing 45
7.2 Integration Testing 45
7.3 Functional Testing 45
7.4 System Testing 46
7.5 Regression Testing 46
7.6 Performance Testing 46
7.7 Testing Cases
8 RESULTS 47
9 CONCLUSION AND FUTURE ENHANCMENT 54
9.1 Conclusion 54
9.2 Future Enhancement 54
REFERENCES 56
APPENDIX 57
A SOURCE CODE 57
B DATA BASE TABLES 70
C INTERNSHIP CERTIFICATES 72
D PUBLICATION 75

7
LIST OF FIGURES

FIGURE NO. FIGURE NAME PAGE NO.

1.1 Waste Segregation 10

1.2 Waste Classification 13

2.1 Waste Management 16

3.1 Recycling Methods 20

4.1 Waste Cycle 25

4.2 Waste Flow 30

5.1 E-R Diagram 35

5.2 System Architecture 38

5.3 Use Case Diagram 39

5.4 Class Diagram 40

5.5 Sequence Diagram 41

6.1 Testing Cases

8.1 Correlation Heatmap 46

8.2 Neural Network Confusion Matrix 48

8.3 Random Forest Vs Number of Trees 50

8
LIST OF TABLES

Table No. Table Name Page No.


B.1 Test Cases 47

9
LIST OF ABBREVATIONS

AI - Artificial Intelligence
ML - Machine Learning
DL - Deep Learning
API - Application Programming Interface
DB - Database
SQL - Structured Query Language
JSON - JavaScript Object Notation
UI - User Interface
UX - User Experience
IDE - Integrated Development Environment
CSV - Comma-Separated Values
SVM - Support Vector Machine
KNN - K-Nearest Neighbours
RF - Random Forest
OCR - Optical Character Recognition
UAT - User Acceptance Testing
ER - Entity-Relationship
RAM - Random Access Memory

10
CHAPTER 1
INTRODUCTION

The code is focused on data preprocessing, feature engineering, and machine learning model
training for a smart bin dataset. It starts by loading the data, handling missing values, and encoding
categorical features. The next steps involve detecting and removing outliers, followed by the creation
of new features based on the existing ones. Afterward, a correlation analysis is performed to identify
relationships between features. The data is scaled for machine learning models that are sensitive to
feature scaling. The code then trains and evaluates various machine learning models (KNN, SVM,
Logistic Regression, Decision Tree, Random Forest, and Neural Networks) using performance
metrics like accuracy, precision, recall, and confusion matrices to predict the target variable, likely
related to waste classification or recycling.

Once the dataset is cleaned, the next step involves identifying and removing outliers that could
potentially skew the analysis. This is done using boxplots for visual inspection, followed by replacing
extreme outlier values with the mean of the respective columns. Afterward, the program calculates
the change in fill levels (FL_C, FL_C_3, FL_C_12) based on the difference between the FL_A and
FL_B columns, which likely represent different states or measurements of the dataset over time.

Fig.1.1 Waste Segregation

The correlation matrix is then computed to understand the relationships between various
features, and a heatmap is plotted to visualize the strength of these correlations. To ensure that features
are on a comparable scale, standard scaling is applied to the selected columns, ensuring that no
feature disproportionately influences the model due to differing units or ranges.

The dataset is then split into training and testing sets, where features (FL_A, FL_C, and VS)
are used to predict the target variable (Class). Various classification algorithms are tested, including
K-Nearest Neighbours (KNN), Support Vector Machine (SVM), Logistic Regression, Decision Tree,
Neural Networks, and Random Forest. Each model is trained on the training data and tested on the
11
testing set to evaluate its performance. Metrics like accuracy, precision, recall, F1 score, and the
Matthews correlation coefficient (MCC) are calculated for each model to assess the quality of the
predictions.
The code also includes plotting functions for visualizing the model scores for KNN, decision
tree, and random forest classifiers for different hyperparameters (like the number of neighbours or
trees). These visualizations help identify the optimal number of neighbours or trees that maximize the
model’s performance. Additionally, confusion matrices are plotted for each model, providing a clear
view of how well each algorithm distinguishes between the different classes.

1.1 Objective

The objective of the provided code is to preprocess a dataset, train multiple machine learning
models, evaluate their performance, and identify the best model for predicting a target variable, Class,
based on several input features related to sensor data from waste bins. The code seeks to apply various
classification algorithms to the pre-processed data and compare their effectiveness in terms of
accuracy and other evaluation metrics. Ultimately, the goal is to determine which machine learning
model offers the best predictive performance for classifying the waste data.
A key part of the objective is to handle and clean the data effectively before model training.
The preprocessing includes handling missing values, removing outliers, and encoding categorical
variables. Missing values are replaced with the median of respective columns to ensure that the dataset
is complete and does not introduce biases due to missing data. Outliers are identified using boxplots
and replaced with median values to prevent them from skewing the results. The encoding of
categorical variables like Class, Container Type, and Recyclable fraction ensures that all features are
in a numerical format suitable for machine learning algorithms.
Another significant aspect of the objective is to scale the features to ensure that no individual
feature dominates the learning process due to differences in scale. Feature scaling using
standardization (StandardScaler) is applied to the numerical features, ensuring that each feature
contributes equally to the model. This is particularly important for distance-based algorithms like
KNN and SVM, which are sensitive to the magnitude of the features. The scaling step ensures that
the models are not biased towards features with larger numerical ranges.

1.2 Scope and Challenges

The scope of the provided code is focused on applying machine learning techniques to a real-
world dataset, which involves sensor data from waste bins. The dataset contains various features that
help in predicting the class of waste, such as Class, Container Type, Recyclable fraction, and sensor
12
readings like temperature and humidity. The primary goal is to preprocess the data, train different
machine learning models, evaluate their performances, and identify the best-suited model for the
classification task. This objective is pertinent to waste management systems, where accurate
classification of waste is essential for efficient sorting and recycling processes.
One of the key aspects of the scope is the use of multiple machine learning algorithms to solve
the classification problem. The code implements algorithms like K-Nearest Neighbors (KNN),
Support Vector Machine (SVM), Logistic Regression, Decision Trees, Multi-Layer Perceptron
(MLP), and Random Forest. Each model is chosen based on its distinct characteristics, and the code
aims to assess their effectiveness in classifying waste into predefined categories. This comparison is
important to identify the strengths and weaknesses of different algorithms, ultimately selecting the
one that provides the best accuracy and reliability for waste classification.
One of the main challenges in the code arises from data preprocessing. The dataset contains
missing values, and the process of imputing these missing values with the median could introduce
biases if the missingness is not random. For example, if certain classes or features have higher missing
rates, simply imputing them may distort the overall distribution of the data. Additionally, the
treatment of outliers could be another challenge. While replacing outliers with the median helps
mitigate their impact, this approach may not always be appropriate for all types of data or
distributions, potentially affecting model performance.

1.3 Problem Analysis

The problem presented in the code involves classifying waste data, where the goal is to predict
the class of waste based on various sensor features such as temperature, humidity, and other
characteristics associated with waste containers. In essence, it is a classification problem where the
system must accurately identify the category of waste based on the input sensor readings. The
challenge lies in working with a real-world dataset that may contain missing values, noisy data,
imbalanced classes, and features of varying types, which can affect the performance of machine
learning models. Addressing these challenges is crucial for creating an effective waste classification
system.
One significant problem is the presence of missing data in the dataset. Real-world datasets often
contain gaps due to errors in data collection, sensor malfunctions, or non-responses from the monitored
environment. In this case, the missing values are imputed using the median value of the corresponding
feature. However, the imputation strategy is simplistic and may not always be ideal, particularly if the
missingness is not random. If the data is missing in a non-random manner, the imputation of missing
values using the median could introduce bias into the dataset, which may affect the performance of the
models, leading to inaccurate predictions or misrepresentations of the waste classes. This challenge is
13
particularly evident when dealing with large datasets that are highly sensitive to data quality
Feature scaling also presents a challenge, especially when using algorithms like K-Nearest
Neighbors (KNN), Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP). These models
are sensitive to the scale of the features, meaning that differences in the magnitude of features could
lead to biased results. For example, if one feature has values ranging from 0 to 1 and other ranges from
1000 to 10000, models like SVM or KNN could give undue importance to the feature with the larger
range, distorting the classification results. While the code addresses this issue by standardizing the
numerical features, the challenge remains that different algorithms react differently to scaled data. Tree-
based models like Random Forest or Decision Trees are not sensitive to feature scaling, which makes
the preprocessing step more complex when trying to balance the needs of all models.
Another important problem is the handling of categorical variables, such as Class (the target
variable), Container Type, and Recyclable Fraction. These features are encoded using label encoding,
which assigns numerical values to each category. While this method works for models like decision
trees, which are not sensitive to the ordinal nature of categorical variables, it can introduce issues for
models that assume a continuous relationship between the encoded values. For example, linear models
or logistic regression may misinterpret the numerical labels as having an inherent ordinal relationship,
which is not the case for all categorical variables. This misinterpretation can lead to less accurate
predictions. As a result, a more advanced encoding method such as one-hot encoding or target encoding
may be needed to better represent the categorical data for certain models.

Fig.1.2 Waste Classification

14
CHAPTER 2
LITERATURE REVIEW

Smart Bins are an advanced innovation aimed at revolutionizing traditional waste management
practices by introducing automation, data collection, and intelligent processing. These bins are
designed to address common inefficiencies, such as delayed waste collection, improper segregation,
and low recycling rates, through the integration of sensor technology and data analytics.
2.1 Fill Levels Monitoring
Sensors installed in the bins continuously monitor how full they are and report this data as fill levels,
typically categorized as FL_A (low), FL_B (medium), and FL_C (high). This information is
transmitted to a centralized system to ensure timely waste collection. For instance, when a bin
approaches FL_C, it triggers an alert to prevent overflow, thereby maintaining hygiene and aesthetics
in urban areas.
2.2 Waste Type Detection
Smart Bins can identify the type of waste they contain. By utilizing specialized sensors or image
processing techniques, they can distinguish between organic, plastic, metal, and paper waste. This
automated classification eliminates human intervention and improves segregation accuracy, which is
critical for efficient recycling processes.
2.3 Environmental Condition Monitoring
For bins storing organic waste, additional sensors can measure parameters like temperature and
moisture. This data helps monitor decomposition rates and detect any issues, such as the potential
release of harmful gases, allowing timely intervention.
By leveraging these features, Smart Bins not only automate waste segregation but also optimize
waste collection schedules and reduce operational inefficiencies. For example, municipal authorities
can plan collection routes based on real-time fill levels, avoiding unnecessary trips to half-empty bins
while prioritizing those nearing capacity. This targeted approach minimizes fuel consumption, reduces
carbon emissions, and improves overall resource allocation.
Moreover, Smart Bins are not limited to passive waste management; they actively engage users by
guiding them to dispose of waste correctly. Some systems use visual or auditory cues to indicate the
appropriate disposal bin for each waste type, promoting user participation in recycling efforts.
2.4 Existing System Analysis
The existing system presented above focuses on processing a dataset related to smart bin data, using
various machine learning techniques to build predictive models for waste management classification. It
involves several steps:

15
The dataset is loaded and missing values are replaced with the median value of respective columns. Label
encoding is performed to transform categorical variables into numerical values. Outliers in specific
features (e.g., FL_A, FL_B) are detected using boxplots and then replaced with the mean value where
necessary. A new feature, FL_C, is created as the difference between FL_A and FL_B to represent the
change in fill level. Standard scaling is applied to certain features to standardize them for better
performance in machine learning models. Several models such as K-Nearest Neighbour’s (KNN),
Support Vector Machine (SVM), Logistic Regression, Decision Trees, Neural Networks, and Random
Forests are used for classification. Each model is evaluated using metrics like accuracy, precision,
recall, F1 score, and Matthews Correlation Coefficient (MCC). The performance of each model is
evaluated using confusion matrices and various classification metrics. Visualizations such as confusion
matrices and model performance plots are generated

2.2 Areas for Improvement


One area that can be improved is the redundancy in imports and code structure. The code imports
pandas, matplotlib, pyplot, and other libraries multiple times. This redundancy makes the code more
cluttered and harder to maintain. For example, libraries like matplotlib, pyplot are imported twice and
used to plot graphs, which could be consolidated into a single import at the start of the script. A more
organized structure would involve removing these duplicate imports and keeping only one instance of
each import. This also helps avoid unnecessary overhead when running the code and contributes to better
readability.
The current code uses a basic approach for imputing missing values by replacing them with the
median value of the respective column. While this is a reasonable approach for numerical data, it might
not always be optimal, especially if the data distribution is highly skewed. A better strategy could involve
analyzing the nature of missing data more carefully and choosing an imputation method accordingly. For
example, categorical data should be handled separately, potentially using the mode of the column for
imputation. In more complex scenarios, you could also consider predictive imputation techniques like
KNN imputation or Iterative Imputer. Additionally, the use of a loop to impute missing values could be
optimized. Instead of using a loop to manually handle each column, utilizing pandas’ built-in methods
(such as defiling ()) can achieve this more efficiently.

2.3 Proposed System


The proposed system aims to build a machine learning pipeline that addresses a range of tasks
related to data preprocessing, model training, and evaluation. The goal of this system is to create a
robust framework that can handle real-world datasets, clean, and preprocess the data efficiently, train
multiple machine learning models, and evaluate their performance in a structured manner. The system
16
will be used to classify or predict outcomes based on historical data, which will be represented by
numerical and categorical features. The proposed system integrates key steps, including data cleaning
(handling missing values, outlier detection), feature engineering, model training, hyperparameter
tuning, and comprehensive evaluation.
In this system, the preprocessing phase is designed to handle several data quality issues that are
common in real-world datasets. The first step involves the identification and imputation of missing
values. The system will first check for missing data and apply the median imputation method for
numerical columns to fill missing entries. For categorical data, the system can apply mode imputation
or a predefined strategy based on the dataset’s characteristics. The preprocessing pipeline also includes
outlier detection and removal, where values that deviate significantly from the expected range are
replaced with more reasonable values, such as the mean of the column. This helps in minimizing the
impact of anomalous data on model performance.

Fig.2.1 Waste Management

17
Moreover, the system is designed to encode categorical variables into numerical
representations. This step is essential since machine learning algorithms generally require numerical
input. The proposed system employs label encoding for categorical columns, transforming each
category into a unique integer value. While label encoding works in certain cases, this method can be
expanded or replaced with one-hot encoding for nominal categorical variables to avoid introducing any
unintended ordinal relationships, thereby improving model accuracy. Data normalization or scaling can
also be added to the pipeline, especially for algorithms like KNN or SVM that are sensitive to feature
scale.

18
CHAPTER 3
SYSTEM ANALYSIS

System Analytics refers to the application of data analysis techniques to evaluate, optimize, and
understand the performance of a system. It helps identify inefficiencies, enhance system components,
and ensure functionality aligns with desired outcomes. For the provided project, system analytics
revolves around the systematic preprocessing, modelling, and evaluation of data for the smart bin waste
classification system.

Key Processes in System Analytics


• Data Preprocessing
Importing Libraries and Dataset: The system begins by importing essential libraries such as pandas,
matplotlib, and scikit-learn. A CSV dataset, Smart_Bin.csv, is loaded for analysis.
Handling Missing Values: Missing data is identified and replaced with the median value of respective
columns to ensure uniformity and prevent biases during model training.
Encoding Categorical Features: Features like Class, Container Type, and Recyclable Fraction are
converted into numerical representations using label encoding for seamless integration with machine
learning algorithms.
• Outlier Detection and Removal
Visualization of Outliers: Boxplots are generated to detect anomalies in columns, particularly fill
level data.
Replacing Outliers: Outliers, defined as extreme values (e.g., greater than 100), are replaced with the
mean value of the corresponding column to maintain data consistency.
• Feature Engineering
New features are derived from the existing data by calculating differences between fill levels (FL_A
and FL_B). For instance:
FL_C Difference: The computed difference between FL_C values over different time frames
such as FL_C_3 and FL_C_12.
Feature engineering enhances the dataset's predictive power and improves model accuracy.
• Data Scaling

Standard scaling is applied to numerical features (FL_A, FL_B, FL_C, and VS) to normalize their
ranges, ensuring that models like KNN and SVM are not biased by varying magnitudes of features.
• Model Training and Evaluation
Splitting Dataset: The data is divided into training and testing sets for effective model validation.

19
Model Implementation: Multiple machine learning algorithms are implemented to classify waste
bins, including:
• K-Nearest Neighbors (KNN)
• Support Vector Machines (SVM)
• Logistic Regression
• Decision Trees
• Neural Networks (MLP)
• Random Forest
Performance Metrics: Models are evaluated using metrics like:
• Accuracy: Measures overall correctness.
• Precision and Recall: Focus on relevance and detection capability.
• F1-Score: Balances precision and recall.
• Matthews Correlation Coefficient (MCC): Evaluates the quality of binary classifications.
Confusion Matrices: These visualize model performance by illustrating true positives, false positives,
true negatives, and false negatives.
• Visualization and Insights
Model Accuracy: Plots are generated to analyses accuracy trends, such as evaluating KNN
performance across varying K values.
Confusion Matrices: Graphical representations of prediction outcomes for each model.
Metric Comparisons: Precision, Recall, and F1-scores are visualized

Fig.3.1 Recycling Methods

20
3.1 Functional Requirements
The functional requirements define the core tasks and capabilities the system must provide:
3.1.1 Data Collection
The system accepts intake and manage data related to recycling bins, focusing on critical
aspects like fill levels, container types, and recyclable fractions. For instance, the dataset
may contain columns representing the amount of waste in different sections of a bin (e.g.,
FL_A, FL_B), the type of waste container (plastic, metal, or paper), and whether the
contents are recyclable. This collected data forms the foundation for making predictions
about bin status and recycling optimization.
3.1.2 Prediction and Classification
Based on the collected data, the system predicts whether a bin is full or not. For example,
if the fill level exceeds a predefined threshold, the bin will be classified as "full." This
prediction helps streamline waste collection by ensuring bins are emptied on time.
Furthermore, the classification system can identify the type of recyclables and segregate
them accordingly.

3.1.3 User Interface


While the system operates on pre-existing datasets, an optional graphical interface could
allow users to upload new data, view bin statuses, and receive notifications about bin
conditions (e.g., alerts when bins are nearing capacity). The UI should be designed for
accessibility and intuitive operation, enabling users to interact without requiring technical
expertise.

3.2 Non-Functional Requirements


Non-functional requirements determine the system's quality attributes:
3.2.1 Efficiency
The system must process data swiftly and return predictions within milliseconds to support
decision-making. For instance, even large datasets should be analyzed efficiently without
significant delays in results, ensuring smooth operations for waste management personnel.
3.2.2 Scalability
The system should handle expanding datasets as the number of bins increases. For
example, while the current dataset might include data for 50 bins, future scalability ensures
the system remains efficient even when analyzing data for 500 bins.
3.2.3 Accuracy
Accuracy in predictions is vital to avoid errors, such as marking a nearly empty bin as full.
21
The algorithms must achieve high precision and recall, particularly in scenarios where
misclassification could lead to operational inefficiencies.

3.3 Hardware Requirements


Although this project focuses on the software side, some hardware considerations are:
3.3.1 Processing Unit
A local computing device (e.g., laptop, desktop, or server) capable of running Python-
based machine learning models is required. The system should support at least 4GB of
RAM and a modern processor for seamless execution.
3.3.2 Simulated Data Input
In the absence of physical sensors, datasets serve as a proxy for sensor-generated data.
The system processes these structured datasets as if they were real-time inputs from smart
bins.

3.4 Software Requirements


The software tools and components include:
3.4.1 Programming Environment
Python serves as the primary programming language, supported by libraries like pandas
for data manipulation, scikit-learn for machine learning, matplotlib and seaborn for data
visualization.
3.4.2 Dataset
A structured CSV dataset simulates real-world data, including historical records of bin
statuses, fill levels, and recyclable classifications. For example, columns like FL_A, FL_B,
and Recyclable Fraction offer numerical values that models analyses.

3.5 Data Collection Process


The data collection process emphasizes ensuring clean and usable input for machine learning:
3.5.1 Data Cleaning
Missing values in critical columns (e.g., FL_A, FL_B) are replaced with the median of each
column, ensuring uniformity without introducing outliers or skewing data.
3.5.2 Label Encoding
Categorical features, such as container types and recyclable fractions, are converted into
numerical representations (e.g., metal = 1, plastic = 2) using label encoding. This step is
crucial for machine learning algorithms that require numerical inputs.

22
3.6 Deliverables and Beneficiaries
3.6.1 Deliverables
A trained model capable of accurately predicting bin statuses (e.g., full, or empty) based on
input data. Visualization tools, including confusion matrices, accuracy plots, and performance
metrics, to provide insights into model performance.
3.6.2 Beneficiaries
• Waste Management Companies: The system optimizes collection schedules,
reducing unnecessary trips and operational costs.
• Local Authorities: Helps in better allocation of resources for waste management.
• Environmentally Conscious Users: Encourages responsible disposal habits by
providing clear recycling instructions.

3.7 Algorithm
The project employs multiple machine learning classifiers to predict bin statuses:
3.7.1 Primary Algorithms
K-Nearest Neighbors (KNN): KNN Identifies bin status by comparing its features to
the closest neighbors in the dataset.
Support Vector Machine (SVM): SVM Builds a hyperplane to classify bins into
"full" or "not full."
Logistic Regression: Logistic Regression uses probabilities to classify bins. Decision
Tree: Decision Tree is tree-based model that splits data into rules for prediction.
Neural Network (NN): Neural Network Employs layers to detect complex patterns in
data.
Random Forest: Random Forest combines multiple decision trees to improve
prediction reliability.
3.8 Methodology
The methodology involves a systematic approach to model development and evaluation:
3.8.1 Data Preprocessing
In Data Processing missing values are imputed using the median to prevent data loss.
Outliers, identified via boxplots, are replaced with the column mean to maintain data
consistency.
Features are scaled using StandardScaler to ensure compatibility across models.
3.8.2 Modelling
Multiple classifiers are implemented to compare performance.
Hyperparameter tuning is performed using grid search or manual adjustments.

23
CHAPTER 4
SYSTEM FEASIBILITY

• Data Quality

The effectiveness of the analysis largely depends on the quality and relevance of the dataset. A
high-quality dataset should contain enough examples with a balanced distribution of classes, which
prevents bias in the model’s predictions. Furthermore, the features used for model training should
be meaningful and capture the underlying patterns of the data. If the dataset includes noisy or
irrelevant features, the model’s performance will suffer. Therefore, ensuring that the dataset is well-
pre-processed—by removing or imputing missing values, encoding categorical variables correctly,
and addressing any inconsistencies—is crucial to achieving reliable results.

• Model Evaluation

Even though the models are trained and evaluated using standard performance metrics like
accuracy, precision, recall, and F1-score, it is essential to evaluate feature relevance. The features
selected for training play a significant role in determining the model's accuracy. Poorly selected or
irrelevant features may lead to overfitting or underfitting, diminishing model performance.
Therefore, feature selection or engineering techniques can significantly improve the model's
predictive ability. For instance, creating new features from existing data or selecting the most
influential ones could increase the model’s robustness and interpretability.

• Model Selection

The choice of algorithms (e.g., KNN, SVM, Logistic Regression, Random Forest) should be
driven by the problem's nature and the data's characteristics. For example, if the task is a
classification problem, algorithms like KNN or Logistic Regression might be more effective.
However, for more complex data patterns, algorithms like Random Forest or SVM could yield better
results. Additionally, computational efficiency plays a crucial role—while SVM models can be
computationally expensive, particularly for large datasets, Random Forests may better handle
complex, non-linear relationships. The decision should consider the trade-off between accuracy and
computational cost, especially when dealing with large-scale datasets.

24
• Handling Imbalanced Data
Imbalanced datasets, where one class significantly outnumbers the other, can cause the model
to be biased toward the majority class. This can lead to misleading predictions, especially in
classification problems like bin status prediction. To address this, techniques such as SMOTE
(Synthetic Minority Over-sampling Technique) or under sampling can be employed to balance the
dataset. SMOTE generates synthetic examples for the minority class, while under sampling reduces
the number of examples from the majority class. Implementing these techniques ensures that the
model treats all classes with equal importance, improving its ability to generalize well across both
classes.

4.1 Economic Feasibility

The Smart Recycling System project, as designed, focuses on software-based solutions for
automating waste segregation and classification, without hardware-based sensors or real-time data
collection. The economic feasibility analysis considers the costs associated with software
development, operational costs, and potential benefits.

Fig.4.1 Waste Cycle

4.1.1 Initial Setup Costs


• Software Development Costs
Developer Salaries: The cost of hiring data scientists and software developers to create machine
learning models, build a user interface, and integrate data processing pipelines. This could also
include hiring a project manager for coordinating tasks.
Software Tools and Libraries: The system is built using Python libraries such as pandas, scikit-
learn, matplotlib, and seaborn, most of which are free for personal and academic use. However,
if any premium tools or libraries are required in the future, this could incur additional costs.

25
4.1.2 Operational Costs
• Cloud Computing / Hosting
Although the system is not dependent on real-time sensors, cloud infrastructure may be required
to host the trained machine learning models and run predictions for large datasets. This would
include hosting the models on cloud platforms like AWS, Google Cloud, or Microsoft Azure,
which typically charge based on computational usage.
Data Storage: Storing large amounts of historical data (e.g., fill levels, waste classification data)
on cloud storage platforms would incur additional costs, but these costs would be manageable
given the software focus of the project.
• Software Maintenance and Updates:
Periodic updates and model retraining may be required as new data becomes available, which
would incur additional developer hours and cloud costs for training and testing new models.
Regular maintenance of the user interface and backend systems to ensure smooth operation and
bug fixes.
4.1.3 Revenue Generation and Cost Savings
• Improved Waste Management Efficiency:
While this project is purely software-based, it can still offer significant cost savings for waste
management companies by optimizing the segregation of waste. More accurate classification of
recyclables ensures that waste is properly sorted, leading to higher rates of recycling and lower
disposal costs.
Reduced Labor Costs: By automating the waste classification process, the system reduces the
manual effort required to sort recyclables, lowering labor costs for waste management operations.
• Environmental and Social Benefits:
Recycling Optimization: The system improves the efficiency of waste segregation, which in turn
helps reduce the amount of waste sent to landfills, thus promoting environmental sustainability.
Public Awareness and Engagement: By providing an intuitive interface for users to input data
and receive feedback on recycling practices, the system can enhance public awareness of
recycling protocols, contributing to broader environmental goals and possibly attracting
government funding or incentives.
• Potential for Commercialization:
Licensing: The software could be licensed to waste management companies, municipalities, or
organizations focused on sustainability. This can become a recurring revenue stream for future
versions of the software.

26
Subscription Model: Offering the software as a service (SaaS) to local authorities, urban
planners, or environmental organizations could generate ongoing subscription-based revenue,
allowing you to scale the system over time.
4.1.4 Scalability
• Geographical Expansion: The system is scalable because it is entirely software-based, and with
appropriate marketing and partnerships, it could be deployed across different cities or
municipalities. As the software expands its reach, it would handle data from multiple sources,
increasing its overall utility and profitability.
• Adoption by Other Industries: Beyond waste management, the system can be expanded to
industries that rely on material classification, such as manufacturing, logistics, or supply chain
industries that manage recyclable materials.
4.1.5 Social and Environmental Impact
• Job Creation: While the system automates waste classification, the development, marketing, and
ongoing maintenance of the software would create jobs for developers, data scientists, and
marketing professionals.
• Environmental Sustainability: The system supports sustainability by promoting accurate waste
segregation and recycling, directly contributing to the reduction of environmental pollutants, and
encouraging better recycling practices in communities.

4.2 Technical Feasibility


The Smart Recycling System project, which is focused on utilizing machine learning for
waste classification and segregation, requires an analysis of the technical feasibility to determine if
the project can be successfully developed and deployed based on current technology and available
resources.
4.2.1 System Architecture and Design
• Software-Based System: The Smart Recycling System does not rely on hardware-based IoT
sensors or real-time data collection, making it easier to implement. The system focuses entirely
on the data analysis side, leveraging historical data of fill levels and recyclable fractions to predict
and classify waste in recycling bins.
• Data Preprocessing Pipeline: The system will rely on Python libraries like pandas, matplotlib,
seaborn, and scikit-learn to process the data, clean it, handle missing values, and apply machine
learning algorithms for prediction. The tools and libraries available in Python make the system
development straightforward and easy to maintain.
• Modelling and Machine Learning: The system uses well-established machine learning
algorithms like K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Decision Trees,
27
Logistic Regression, and Random Forest for waste classification. These algorithms are mature,
well-documented, and can be efficiently implemented with the available resources.

4.2.2 Data Availability and Quality


• The system requires a dataset containing information about waste bins, such as fill levels (e.g.,
FL_A, FL_B, FL_C) and recyclable fractions. Since the system relies on historical data, the
availability of a dataset with relevant features is a critical factor for successful implementation. In
case such datasets are not readily available, public or open-source data sources could be used, or
the data can be synthesized from existing recycling datasets.
• The quality of the dataset impacts the model's performance. Since the data requires preprocessing
(handling missing values, feature engineering, and label encoding), it is technically feasible to
clean and prepare the data with existing tools. The project assumes that the dataset is clean and
sufficient in quantity. Any issues with data quality (e.g., highly imbalanced data or poor feature
engineering) could impact the system’s performance, but methods such as oversampling, data
augmentation, or feature selection could be used to improve the model’s effectiveness.
4.2.3 Machine Learning Model Development
• Training and Testing: The models will be trained on a set of historical data (training set) and
evaluated on a separate set (testing set). Using cross-validation or train-test splits will ensure that
the models are generalized and not overfitting. Given the nature of the problem (classification),
multiple algorithms will be tested, and the best-performing one will be selected.
• Scalability of Machine Learning Models: The project uses scalable machine learning algorithms
such as Random Forests, which can handle large datasets effectively. For smaller datasets, simpler
models like Logistic Regression or Decision Trees can be used, and their scalability is not an
issue.
• Model Evaluation: The system evaluates the models using standard performance metrics like
accuracy, precision, recall, F1-score, and Matthews Correlation Coefficient (MCC). Confusion
matrices and classification reports provide valuable insights into how well the models are
performing in terms of classification, making this evaluation approach technically feasible.
• Tools and Libraries: The machine learning libraries such as scikit-learn (for model training and
evaluation) and matplotlib/ seaborn (for visualization) are widely used and well-supported,
making them robust choices for model implementation.

4.2.4 System Integration and Deployment


• User Interface (UI): The project envisions a user interface that allows users to input data, view
predictions, and receive alerts regarding the fill levels and status of recycling bins. The technical
28
feasibility of developing such an interface is high, as modern frameworks like Flask or Django
(Python web frameworks) can be used to create a simple web-based UI.
• Integration with Existing Systems: While this system operates independently of IoT devices, it
can be integrated with existing waste management software or systems for automatic data
collection (in cases where hardware is available in the future). The system is modular and can
work in tandem with other applications in the waste management domain.

4.2.5 Computational Resources

• Cloud Infrastructure: If necessary, cloud platforms such as AWS, Google Cloud, or Microsoft
Azure can be used to host the system, especially if it needs to scale to handle multiple smart bins'
data. These platforms provide the computational power needed to train and test models without
requiring local hardware upgrades.
• Local Computational Resources: If a smaller-scale system is required, the model training and
predictions can be run locally on machines with adequate processing power. The resource
requirements for machine learning models like KNN, Random Forest, and Logistic Regression
are not computationally intensive, so they can run on systems with moderate specs.
• Data Storage: Data storage requirements are relatively moderate, as historical data about
recycling bins can be stored on cloud or local databases, ensuring that enough storage is available
for training datasets and model outputs.

4.2.6 Technology Stack and Tools

• Programming Language: Python is the primary programming language used for this project, as

it has extensive libraries for machine learning, data analysis, and visualization. Python's versatility
and popularity ensure that finding technical resources, support, and troubleshooting are not
challenging.
• Libraries: The main libraries required are:

pandas for data manipulation and preprocessing.

scikit-learn for implementing machine learning algorithms and model evaluation. matplotlib and
seaborn for visualization of data and results.
Flask or Django for creating a simple web-based user interface if required.

The use of these libraries ensures technical feasibility, as they are well-documented, widely
used, and supported by the community.

29
4.3 Social Feasibility
The social feasibility of the Smart Recycling System focuses on evaluating how well the system
aligns with societal needs, its potential impact on the community, and its acceptance by
stakeholders, including local authorities, waste management companies, and the general public.
This analysis helps determine if the project will be well-received and if it will contribute
positively to society.

Fig.4.2 Waste Flow

4.3.1 Environmental Impact and Sustainability


• Waste Management Improvement: The primary benefit of the Smart Recycling System is its
contribution to improving recycling efforts, which directly supports sustainable waste
management practices. By accurately predicting whether recycling bins are full, the system
ensures that bins are collected and emptied at optimal times, preventing overflows and
contamination.
• Reduction in Landfill Waste: Efficient waste collection reduces the need for extra collection
trips, reducing fuel consumption and lowering the carbon footprint of waste management
operations. The system contributes to reducing waste sent to landfills, promoting a cleaner
environment.
• Promotion of Recycling: By optimizing waste collection and providing better recycling
practices, the system helps reinforce environmentally friendly behaviours in communities. It
encourages responsible recycling, which in turn, helps decrease pollution, conserve resources,
and lower overall environmental degradation.

30
4.3.2 Community and Stakeholder Benefits
• Waste Management Companies: The system improves operational efficiency for waste
management companies by helping them plan waste collection schedules more effectively. By
predicting fill levels, these companies can deploy collection trucks only when needed, reducing
the number of unnecessary trips. This results in cost savings, improved logistics, and a reduction
in greenhouse gas emissions.
• Local Authorities: Local authorities, responsible for waste management and public health, can
use the system to monitor the status of recycling bins and ensure that waste is handled promptly.
This could lead to better service delivery for citizens, with improved cleanliness and sanitation in
neighborhoods.
• Environmentally Conscious Citizens: The public, especially environmentally conscious
citizens, will benefit from an organized and efficient recycling system. The system offers
convenience by ensuring recycling bins are emptied in a timely manner, making it easier for
people to recycle without the risk of overflowing bins. This could increase participation in
recycling programs and overall community engagement in sustainability efforts.
• Educational Value: The system can also serve as an educational tool, raising awareness about
recycling practices. Through its interface or public reporting features, it can educate users about
the importance of waste separation, the impact of recycling, and how individuals can contribute
to a greener planet.

4.3.3 Accessibility and User-Friendliness


• User Interface: The proposed system can have a simple user interface that allows citizens to
easily input data (e.g., type of recyclables, fill levels) or view real-time alerts regarding bin
statuses. The simplicity and ease of use of this interface ensure that the system is accessible to a
wide range of users, including non-tech-savvy individuals.
• Public Awareness: For the system to be successful, it must be widely known and accepted. Public
education campaigns about the system and its benefits (e.g., more efficient waste collection and
improved recycling rates) are essential. This could include informative sessions, flyers, or social
media campaigns to educate the public about the system's role in improving waste management.

4.3.4 Inclusivity and Equity


• Community Engagement: The system is designed to be inclusive by making waste management
more efficient across all communities, from urban centers to rural areas. However, its success
depends on the accessibility of the required infrastructure and technology. For it to be inclusive,
local authorities may need to ensure that rural and underserved communities have equal access to
31
the benefits of the system.
• Support for Underserved Areas: The system can be especially beneficial for underserved areas
where waste management is a significant challenge. By providing predictive analysis for waste
collection, the system can help improve the quality of life for these communities, ensuring they
have a cleaner and healthier environment.

4.3.5 Adoption and Public Perception


• Public Acceptance: For the system to be socially feasible, it must be well-accepted by the
community. A user-friendly interface, clear communication about its benefits, and reassurance
about privacy and data security are crucial in encouraging citizens to adopt the system.
• Behavioral Change: The system can foster positive behavioral change in individuals by making
recycling easier and more efficient. As users see the impact of their actions through real- time
monitoring and improved recycling collection, it could motivate them to participate more actively
in recycling efforts.
• Trust in Technology: The system’s reliance on machine learning models and predictions may
raise concerns about the accuracy and reliability of the predictions. For widespread adoption, it is
important to establish trust in the system’s ability to make correct predictions. This can be
achieved through transparency about how the models work, continuous performance monitoring,
and regular updates on improvements.

4.3.6 Social Impact on Employment


• Job Creation: The system may create new job opportunities in the technology and waste
management sectors, such as roles in data analysis, system monitoring, model development, and
maintenance. It also requires technicians to oversee the operation of the system and handle any
technical issues.
• Reduction in Manual Labor: The smart system can reduce the need for manual monitoring of
recycling bins, making waste collection more automated and reducing labour costs. However, it
is important to ensure that such automation does not lead to job losses, but rather provides new
opportunities for upskilling and reskilling the workforce.

32
CHAPTER 5

SYSTEM DESIGN

5.1 E-R Diagrams


Entity-Relationship (E-R) diagrams are essential for visualizing the structure of the Smart Recycling
System, enabling effective data management and interactions between various components. As the
system uses machine learning to predict recycling bin statuses and classify waste types, E-R diagrams
represent entities such as Recycling Bins, Waste Types, Users, Bin Status Logs, and Collection
Trucks, showing how they interact in real-time.
The E-R diagram organizes relationships like
Recycling Bins (with attributes like bin ID, type, and location) linked to Waste Types (such as
plastic, metal, and paper).
Users (linked to bins for waste sorting) connected to Bin Status Logs (logging fill levels and
timestamps).
Collection Trucks linked to Collection Routes, representing the collection process of bins.

5.1.1 Conceptual E-R Diagrams


Purpose: It provides a high-level overview of the relationships between bins, waste types, users, and
logs.
Use in Smart Recycling System: It illustrates how waste is classified and linked to bin statuses and
users.

5.1.2 Logical E-R Diagrams


Purpose: It defines attributes for entities like fill levels, bin ID, and waste classification.
Use in Smart Recycling System: It specifies attributes like bin status, type of waste, and timestamps
for logging.

5.1.3 Physical E-R Diagrams


Purpose: It Focuses on database storage and structure.
Use in Smart Recycling System: It represents how bin data, waste types, and logs are stored for
quick access and analysis.

5.1.4 Extended E-R Diagrams


Purpose: It uses advanced features like generalization to model similar waste types and bin types.

33
Use in Smart Recycling System: It models different types of waste (e.g., paper, plastic) under the
common Waste Type entity, enhancing data organization.

5.1.5 Crow's Foot E-R Diagrams


Purpose: It shows cardinality and participation constraints.
Use in Smart Recycling System: It defines how many bins can contain each waste type, and how
many logs can be associated with a bin.

5.1.6 UML Class Diagrams (as an alternative)


Purpose: It represents the system in a more object-oriented manner.
Use in Smart Recycling System: It useful for modelling complex relationships between users, waste
types, and recycling processes with attributes and methods.

Fig 5.1 E-R Diagram

5.2 System Architecture


The architecture of the Smart Recycling System defines the components and their interactions,
enabling the system to predict recycling bin statuses, classify waste types, and optimize waste
management processes. The system is designed with a modular approach that integrates various
components, such as data collection, prediction algorithms, and user interfaces.

34
5.2.1 Data Collection Layer
Purpose: Collects data from smart bins, including fill levels, waste types, and user interactions.
Components: Smart sensors in the bins, user input interfaces (e.g., mobile apps), and real-time data
collection systems.
Functionality: Sensors continuously monitor the bin's fill levels and waste types, while users can
input data related to the waste they deposit.
5.2.2 Data Processing and Machine Learning Layer
Purpose: Processes the collected data and makes predictions about the bin's status (e.g., full, empty).
Components: Data processing units, machine learning algorithms (KNN, Random Forest, etc.), and
data storage (databases).
Functionality: This layer applies pre-processing techniques like data cleaning, feature engineering,
and scaling before feeding the data into machine learning models. The system predicts whether the
bin is full or not and classifies the type of waste.
5.2.3 Prediction and Decision-Making Layer
Purpose: Executes predictive models to determine the fill status and waste type classification.
Components: Machine learning models (e.g., KNN, SVM), prediction servers, and algorithms.
Functionality: Based on the processed data, the models classify the bin’s fill level and waste type.
The system uses these predictions to optimize the collection process, schedule pickups, and provide
alerts to users.
5.2.4 User Interface Layer
Purpose: Provides an interface for users (e.g., waste management operators, local authorities, and
citizens) to interact with the system.
Components: Web or mobile applications, dashboards, and alert systems.
Functionality: Users can view the status of recycling bins, receive alerts about full bins, and manage
collection schedules. The interface provides real-time data visualization, including bin statuses, waste
type classifications, and system performance metrics.
5.2.5 Collection and Optimization Layer:
Purpose: Manages the physical collection of waste from the bins and optimizes collection routes.
Components: Collection trucks, GPS systems, and route optimization software.
Functionality: Based on the predictions and alerts from the machine learning layer, the system
optimizes waste collection routes. The collection trucks are dispatched to bins based on their fill
status, reducing inefficiencies in the collection process.
5.2.6 Integration and Communication Layer:
Purpose: This Layer ensures seamless communication between different layers and components.
Components: APIs, data transmission protocols (e.g., MQTT), and cloud services.
35
Functionality: This layer connects various components of the system (data collection, machine
learning models, user interfaces) and ensures that data flows in real-time, allowing the system to
operate smoothly and respond quickly to changes.

Fig 5.2 System Architecture


5.3 UML Diagrams
Unified Modelling Language (UML) diagrams are critical for representing the dynamic and
static aspects of the Smart Recycling System, offering a clear visualization of system
functionality, class structures, and interactions. These diagrams help in illustrating user
interaction, class organization, and process flows, ensuring effective system design.
5.3.1 Use Case Diagram
• A Use Case Diagram is used to visually represent the interactions between users (actors) and
the system. It helps in understanding the functional requirements of the system.
• The diagram depicts the roles of different users, such as waste management staff or citizens,

36
and their interactions with system features such as logging waste, checking bin statuses, and
scheduling collections.
The key elements are
Actors: Users (e.g., Waste Management Personnel, Citizens)
Use Cases: Bin fill level monitoring, waste classification, collection scheduling
Relationships: Associations between actors and use cases, showing what functionality each actor can
access.

Fig 5.3 Use Case Diagram


5.3.2 Class Diagram
• Class Diagrams model the static structure of the system, showing the system's classes, their
attributes, methods, and the relationships between the classes.
• It represents the key components of the system like Recycling Bin, Waste Types, Collection
Trucks, Users, and their attributes and behaviors.
The key elements are
Classes: Recycling Bin, User, Waste Type, Collection Log, etc.
Attributes: Bin ID, Fill Level, User ID, Waste Category, etc.
Methods: Sort Waste, Update Bin Status, Schedule Collection, etc.

37
Relationships: Inheritance, Association, and Aggregation relationships that depict.

Fig 5.4 Class Diagram

5.3.3 Sequence Diagram


• A Sequence Diagram is used to show how objects interact with each other over time. It
emphasizes the order of events or method calls.
• The diagram can show the sequence of events when a user interacts with the system, such as
checking a bin’s fill level, classifying the waste, and scheduling a collection.
The key elements are
Objects: Users, Recycling Bins, Collection Trucks, Waste Classification System.
Messages: Function calls or method invocations between objects, showing how data flows.
Lifelines: Represent the existence of objects during the sequence, showing which components
participate in the system's processes.

38
Fig 5.5 Sequence Diagram

39
CHAPTER 6
SYSTEM IMPLEMENTATION

The system implementation focuses entirely on the design, development, and integration of
software components, as the project does not involve any hardware sensors or IoT devices. The
system is designed to manage and analyze data related to recycling, such as waste types and bin
statuses, using machine learning models and data processing techniques. By leveraging these
software-driven methods, the system predicts bin statuses and classifies waste types, providing an
efficient and scalable solution for optimizing recycling processes.

6.1 Software Architecture


6.1.1 Backend Development
Programming Languages: The backend of the system is developed using languages such as Python
or Java. The Python programming language is used due to its robust libraries for machine learning
(e.g., scikit-learn, TensorFlow).
Machine Learning Model: A machine learning model is trained to predict the recycling bin status
and classify waste types. Models such as Logistic Regression, Random Forest, or Support Vector
Machine (SVM) are chosen based on the data characteristics and computational efficiency.
Data Processing: Data preprocessing techniques such as normalization, encoding categorical
variables, and handling missing values are applied before training the model. The cleaned data is fed
into the model to make predictions about waste classification and bin fill levels.
6.1.2 Database Design
Relational Database (e.g., MySQL/PostgreSQL): The system uses a relational database to store all
necessary data, including recycling bin data, waste types, user information, and bin status logs.
Schema: The database schema consists of tables like Recycling Bins (with attributes like ID, type,
and location), Waste Types (plastic, metal, paper, etc.), Bin Status Logs (for tracking fill levels), and
Collection Routes.
Data Retrieval: SQL queries are used to efficiently retrieve and manipulate data, such as generating
reports on recycling bin statuses and waste classifications.
6.2 Model Development and Integration
6.2.1 Model Training and Evaluation
The machine learning model is trained using historical data related to waste types and bin statuses.
Performance metrics like accuracy, precision, recall, and F1-score are used to evaluate the model.
Once trained, the model is integrated into the backend, where it will classify waste types and predict
bin statuses based on input data.
40
6.2 Model Deployment
After testing, the trained model is deployed on a cloud server (e.g., AWS, Google Cloud) or a local
server to handle real-time predictions.
The model will receive data inputs (such as bin fill levels and waste types) and make predictions,
which are then sent to the backend to be processed.

6.3 Integration of Software Components


6.3.1 API Development
APIs (RESTful or GraphQL) are developed to allow communication between the backend and the
database. The API exposes endpoints for retrieving recycling bin data, submitting new waste
classification data, and updating bin statuses.
The API ensures seamless integration of machine learning predictions with data storage and
retrieval, allowing real-time updates.
6.3.2 Data Flow
Input: The user inputs waste classification data (e.g., waste types, timestamps, bin fill levels) into
the system through the backend interface.
Processing: The data is processed by the backend, where the machine learning model classifies the
waste and predicts the fill levels of the recycling bins.
Output: The results (predicted bin status and waste classification) are stored in the database for
future retrieval or analysis.

6.4 Testing and Deployment


6.4.1 Unit Testing
The system undergoes unit testing to ensure that each individual component (such as the model,
database, and API) is working correctly. This includes testing the prediction accuracy of the model,
database queries, and the functionality of data handling mechanisms.

6.4.2 System Testing


The integrated system is tested to ensure that all components work together as expected. Testing
focuses on checking the flow of data between the backend, database, and model, validating the
system’s overall performance.
6.4.3 Deployment
The system is deployed to a cloud or on-premises server. It is made accessible to users via an internal
interface, allowing for data input and analysis. The system is continuously monitored for
performance and updated as required.
41
6.5 Expected Outcomes
6.5.1 Efficient Waste Management: The software system will optimize recycling bin collection
and waste classification, reducing operational costs and improving waste management efficiency.
6.5.2 Environmental Impact: The system will promote recycling efforts, reduce landfill waste,
and contribute to environmental sustainability.

42
CHAPTER 7
SYSTEM TESTING
Testing is the process of evaluating and verifying that a software application or system meets
specified requirements and functions correctly. It involves running the software under different
conditions to identify defects, bugs, or areas for improvement, ensuring that the system works as
intended and meets user expectations. Testing helps ensure software quality, reliability, security, and
performance by identifying issues before the software is deployed for end-users.

7.1 Unit Testing


• Unit testing is used to test individual components or functions of the software.
• This system ensures that each function (e.g., object classification, recommendation generation)
works as expected.
• Test a function that classifies waste items to ensure it categorizes input correctly.

7.2 Integration Testing


• Integration Testing is used to test the interaction between different components or modules in
your system.
• It tests that the data retrieval from your database correctly integrates with the waste classification
system.

7.3 Functional Testing


• Functional Testing validates that the software system performs its functions as specified.
• This involves verifying that each feature of the system works as intended (e.g., user inputs waste
data, and the software provides recycling recommendations).
• Tools: Selenium for web applications, or use framework-specific tools like pytest or Jest.
• Example: Test the user interface and confirm that, when waste data is entered, the system returns
the appropriate recommendations.

7.4 System Testing


• Purpose: To verify that the complete system meets the requirements and works together as a
whole.
• This testing ensures that the end-to-end functionality of the software system, including all
modules and user interfaces, behaves as expected in real-world usage scenarios.
• Tools: Selenium, Postman for API testing.
• Example: Test the entire system flow from user input to output, ensuring no breakages between
43
modules (e.g., database, classification algorithm).

7.5 Regression Testing


• Purpose: To check whether recent code changes have affected existing features.
• Description: When new features or bug fixes are added to the system, regression testing ensures
that these changes do not negatively impact existing functionality.
• Tools: pytest, Jest, Selenium.
• Example: After adding a new classification method, test the entire system to ensure that previous
features still work correctly.

7.6 Performance Testing


• Purpose: To assess the software's responsiveness, scalability, and stability under load.
• Description: Performance testing helps ensure that the software can handle various levels of
traffic or computational load (e.g., how the system handles multiple users inputting data
simultaneously or large amounts of recycling data).
• Tools: JMeter, Locust, or Python's time it for checking the performance of specific functions.
• Example: Test how fast the system responds when multiple users submit waste data at the same
time.

7.7 Testing Cases

Test Test Case Input Expected Actual Pass/Fail Remarks Status


Case Description Data Output Output
ID
TC01 Load Dataset File path Dataset is loaded Dataset Pass Works as Completed
to the successfully. loaded expected.
dataset Shape and successfully.
column details Shape:
are displayed. (1000, 10)

TC02 Handle Missing Column Missing values Missing Pass Verified Completed
Values with are replaced with values successfully.
missing median for replaced. No
values numeric columns. NaNs.

44
TC03 Correlation Full Correlation Correlation Pass Matches Completed
Matrix dataset heatmap is heatmap expected
displayed. displayed. visualization.
Correct
relationships
between features
are visualized.

TC04 Split Dataset Dataset Dataset is Dataset split Pass Splitting Completed
and split correctly split correctly: confirmed.
ratio (80- into training and 80-20 ratio.
20) testing subsets.

TC05 Hyperparameter Training Optimal k value Optimal k Pass Working as Completed


Tuning - KNN data, is identified, and value expected.
parameter corresponding identified
grid for k accuracy is correctly.
displayed.

TC06 Evaluate KNN Test data Model predicts Accuracy: Pass Performance Completed
on Test Data class labels and 87%. F1- is acceptable.
provides score: 0.85.
accuracy,
precision, recall,
and F1-score.

TC07 Handle Missing Dataset Missing values Missing Fail Preprocessing Under
Values in Input with are values not function Review
Data missing replaced/imputed, handled needs review.
values and no errors properly.
occur during
preprocessing or
training.

TC08 Compute Test data, Confusion matrix Confusion Pass Correct Completed
Confusion predicted is computed and matrix output
Matrix for labels matches expected matches verified.
KNN values. expected
Predictions values.

45
CHAPTER 8
RESULTS

Fig 8.1 Correlation Heatmap

The image presents two components: an overview of the dataset used in the project and a correlation
heatmap visualizing interdependencies among various features. The dataset consists of 13 columns,
each representing specific attributes related to waste classification: The target variable indicating the
waste category and The image shows that the dataset contains 4638 non-null entries for all columns,
ensuring no missing data after preprocessing. Data types include float64 for numerical features
and int64 for categorical features, reflecting efficient storage and representation.

46
2. Correlation Heatmap

The heatmap visualizes the Pearson correlation coefficients between features in the dataset.
Correlation values range from -1 (strong negative correlation) to +1 (strong positive
correlation).

Observations:
Target Variable (Class):

• Moderately correlated with FL_B (-0.55) and weakly correlated with other fill- level
features like FL_A and derived features (FL_C, FL_C_3, FL_C_12).
• Weak negative correlation with Container Type (-0.4).

Fill Levels (FL_A, FL_B, etc.):

• Strong positive correlations among related features like FL_B and FL_B_3
(0.87), indicating temporal consistency in waste fill levels.
• High correlation values (close to 1) between features within the same time interval or
derived features like FL_C_12 and FL_A_12 (0.69).

Derived Features (FL_C, FL_C_3, FL_C_12):

• Capture differences in fill levels, showing moderate to high correlations with


corresponding raw fill levels, providing additional predictive power for classification.

Recyclable Fraction:

• Weak correlations with most features, indicating its limited dependence on fill- level
measurements.

47
Fig 8.2 Neural Network Confusion Matrix

The output illustrates the confusion matrix and the evaluation metrics for a Neural Network model
used for waste classification.

o The matrix is divided into four quadrants:

▪ True Negative (TN): 398 instances were correctly classified as negative (e.g., waste not belonging

to the target class).


▪ False Positive (FP): 30 instances were incorrectly classified as positive.

48
▪ False Negative (FN): 41 instances were incorrectly classified as negative.

▪ True Positive (TP): 459 instances were correctly classified as positive (e.g., waste belonging to

the target class).

The confusion matrix is supported by quantitative metrics for a detailed performance assessment:

• TP+TN
Accuracy is calculated as Accuracy =
𝑇𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

This model achieved an accuracy of 92.35%, highlighting the model’s ability to correctly
classify the majority of instances.

• Precision is calculated as Precision = 𝑇𝑃


𝑇𝑃+𝐹𝑃

Precision of 93.86% indicates that the model is highly reliable when it predicts a positive class.

• Recall is calculated as Recall = 𝑇𝑃


𝑇𝑃+𝐹𝑁

A recall of 91.8% demonstrates that the model can identify most of the positive cases
correctly.
• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
F1 Score is calculated as Precision = 2 ×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙

The F1 Score of 92.82% reflects a strong balance between precision and recall.

• Matthews Correlation Coefficient (MCC):

A robust metric for imbalanced datasets, accounting for all confusion matrix elements.

The MCC score of 0.847 indicates a high level of agreement between predicted and actual
labels.

Importance in the Study

• The Neural Network model exhibits excellent performance across multiple metrics,

particularly in minimizing errors.


• The strong balance between precision and recall makes this model suitable for real-world

deployment, where both false positives and false negatives have significant implications.
• The high MCC score further validates the robustness of the classification, suggesting that the

model effectively captures the patterns and nuances in the dataset.

This output demonstrates the effectiveness of the Neural Network in achieving high accuracy
and reliability in waste classification, validating its potential for scalable and practical
applications.
49
Fig 8.3 Random Forest

The graph shows how accuracy improves with the increasing number of trees in the Random Forest
model.
Key data points include
At 1 tree, accuracy is 83.1%.With 7 trees, accuracy peaks at 89.2%, after which it stabilizes. Beyond
16 trees, accuracy plateaus at 89.4%, indicating diminishing returns for additional trees. The
steady improvement in accuracy with more trees demonstrates the effectiveness of ensemble learning.
The plateau suggests that increasing the number of trees beyond 16 does not significantly enhance
performance.

50
CHAPTER 9
CONCLUSION AND FUTURE ENHANCEMENT

9.1 Conclusion
The Smart Recycling System project successfully demonstrates the potential of software-
driven solutions to address the growing global issue of waste management and recycling. By
developing a user-friendly platform that automates the classification of waste and provides actionable
recycling recommendations, the system contributes significantly to sustainable practices. The
integration of algorithms for waste categorization ensures that users can efficiently identify the type
of waste they are disposing of, while the recommendation engine offers appropriate disposal or
recycling options, making the entire recycling process smoother and more accessible.
Throughout its development, the system has been thoroughly tested to ensure its functionality,
reliability, and user experience. Unit testing, integration testing, and other essential software testing
methods confirmed that the individual components work as expected and that the system operates
seamlessly as a whole. The final product is an effective tool for individuals looking to adopt better
recycling practices, contributing to environmental sustainability efforts.
The Smart Recycling System is not only a practical solution but also highlights the growing
importance of technology in promoting eco-friendly habits. With its ability to accurately classify
waste and provide relevant recommendations, the system empowers users to make informed decisions
about waste disposal. In the long run, it has the potential to play a key role in reducing waste, lowering
carbon footprints, and supporting global recycling initiatives. The project lays a strong foundation for
future developments in smart waste management, and its impact can be further amplified with
additional features and enhancements in future iterations.
9.1.1 Impact and Contribution
The project contributes to the circular economy model by encouraging responsible waste
disposal and maximizing the reuse of resources. By making recycling more accessible and efficient,
the system promotes environmental sustainability. This project also demonstrates how technology
can be used to address critical global challenges, providing a scalable solution for waste management.

9.2 Future Enhancements


While the Smart Recycling System is a robust solution, there are several areas where
improvements and enhancements can be made to increase its effectiveness and scalability. These
future enhancements could include:
• Integration with IoT: Although the current system does not incorporate IoT technology,
future versions could use IoT devices like smart bins equipped with sensors to detect waste

51
types automatically. This would enhance the system's ability to classify waste without
requiring user input, improving efficiency.
• Machine Learning Model Improvement: The waste classification algorithm can be
further refined using advanced machine learning techniques. By training the model on a
larger and more diverse dataset, the system's classification accuracy can be significantly
improved. Additionally, integrating deep learning models could help classify more
complex or ambiguous waste items.
• Geolocation-Based Recommendations: Future versions of the system could integrate
geolocation features, providing users with information on nearby recycling centres, drop-
off locations, or collection schedules based on their location. This would make the system
more convenient and relevant to users' daily lives.
• Mobile Application Development: The system could be expanded into a mobile
application, making it accessible to a broader audience. A mobile app could offer real-
time notifications, push alerts for waste collection schedules, and allow users to access
recycling tips on the go.
• Collaborations with Local Governments and Organizations: Partnering with local
governments or recycling organizations could help scale the system and ensure that it is
aligned with local recycling rules and regulations. Such collaborations could also help
provide users with rewards or incentives for actively participating in recycling programs.
• User Engagement and Education: The system could include educational resources about
recycling best practices and the environmental impact of waste. Gamifying the experience
(e.g., with rewards or points for recycling efforts) could encourage users to engage with
the system more frequently and improve participation in recycling efforts.
• Support for More Waste Categories: The system could expand its database to include a
wider variety of waste materials, such as electronic waste, textiles, and organic waste. This
would make the system applicable to a broader range of recycling scenarios.

By implementing these enhancements, the Smart Recycling System could become an even more
powerful tool in promoting sustainable waste management and encouraging responsible consumption.
The continuous evolution of the system will contribute to global sustainability efforts and help tackle
the growing environmental challenges posed by waste.

52
REFERENCES

1. Sustainable Waste Management Solutions: "A review on sustainable waste management


strategies and technologies," International Journal of Environmental Science and Technology,
Springer, 2021.
2. Recycling and AI Algorithms: Abedin, M.Z., & Al Mamun, M. "AI-based Waste
Classification and Recycling Recommendations," Journal of Sustainable Computing:
Informatics and Systems, Elsevier, 2020.
3. Machine Learning for Waste Classification: "Application of Machine Learning in Waste
Management and Recycling," Advances in Waste Management & Processing, Springer, 2020.
4. Database Integration for Web Applications: "Database Management Systems for Web
Applications," Journal of Computer Science, Elsevier, 2019.
5. Recycling Recommendation Systems: "Design and Implementation of Waste Management
and Recycling Systems Using Recommendation Algorithms," Waste Management &
Research, SAGE Publications, 2019.
6. Software Testing Methodologies: Sommerville, I. "Software Engineering," 10th Edition,
Pearson, 2015. ISBN: 978-0133943030.
7. Selenium for Automated Web Testing: "Mastering Selenium WebDriver," Packt Publishing,
2018. ISBN: 978-1788621754.
8. Machine Learning Techniques in Recycling: Patel, P., & Shah, H. "Machine Learning for
Sustainable Recycling: A Survey of Algorithms and Applications," Environmental
Monitoring and Assessment, Springer, 2020.
9. Sustainable Waste Management Technologies: "Waste Management and Recycling: A
Handbook for Practitioners," CRC Press, 2019. ISBN: 978-0367330240.
10. AI in Waste Classification: "Artificial Intelligence for Sustainable Waste Management: A
Review," Environmental Technology & Innovation, Elsevier, 2020.
11. Database Testing Best Practices: “Database Testing: Concepts, Methodologies, Tools, and
Applications,” IGI Global, 2018. ISBN: 978-1522570936.
12. Software Testing and Debugging: “Software Testing Techniques,” 2nd Edition, Van Nostrand
Reinhold, 2016. ISBN: 978-0130080760.

53
APPENDIX

A. SOURCE CODE:

#Importing the Required Libraries


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')

#Importing the Dataset


import pandas as pd
df = pd.read_csv('/kaggle/input/smartrecy-csv/Smart_Bin.csv')
#Information overview of the dataset
df.shape
df.info()
#Replacing the missing values
def removing_missing_values(column_name):
df[column_name] = df[column_name].fillna(df[column_name].median())
i=1
for i in range(1,8):
removing_missing_values(df.columns[i])
i+=1
df
df.info()
#Label Encoding the Target variable
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['Class']= label_encoder.fit_transform(df['Class'])
df['Class'].unique()
df['Container Type']= label_encoder.fit_transform(df['Container Type'])
df['Container Type'].unique()
df['Recyclable fraction']= label_encoder.fit_transform(df['Recyclable fraction'])
df['Recyclable fraction'].unique()
df.head()
54
#df.describe()

#Detecting and Removing the Outliers


import seaborn as sns
plt.figure(figsize=(20,15))
plt.subplot(2,4,1)
sns.boxplot(df['Class'])
plt.subplot(2,4,2)
sns.boxplot(df['FL_B'])
plt.subplot(2,4,3)
sns.boxplot(df['FL_A'])
plt.subplot(2,4,4)
sns.boxplot(df['VS'])
plt.subplot(2,4,5)
sns.boxplot(df['FL_B_3'])
plt.subplot(2,4,6)
sns.boxplot(df['FL_A_3'])
plt.subplot(2,4,7)
sns.boxplot(df['FL_B_12'])
plt.subplot(2,4,8)
sns.boxplot(df['FL_A_12'])
plt.show()

#Removing the Outliers


df['FL_B'].values[df['FL_B'].values > 100] = df['FL_B'].mean()
df['FL_B_3'].values[df['FL_B_3'].values > 100] = df['FL_B_3'].mean()
df['FL_B_12'].values[df['FL_B_12'].values > 100] = df['FL_B_12'].mean()
df['FL_A'].values[df['FL_A'].values > 100] = df['FL_A'].mean()
df['FL_A_3'].values[df['FL_A_3'].values > 100] = df['FL_A_3'].mean()
df['FL_A_12'].values[df['FL_A_12'].values > 100] = df['FL_A_12'].mean()
df.describe()

#Calculation of Change in Fill Level


df['FL_C']=df['FL_A']-df['FL_B']
df['FL_C_3']=df['FL_A_3']-df['FL_B_3']
55
df['FL_C_12']=df['FL_A_12']-df['FL_B_12']
df.info()

#Corrleation Matrix
import seaborn as sns
from matplotlib import rcParams
from matplotlib.cm import rainbow
corrmat=df.corr()
top_corr_features=corrmat.index
plt.figure(figsize=(15,15))
sns.heatmap(df[top_corr_features].corr(),annot=True,cmap='RdYlGn')
plt.show()
#Standard Scaling
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
standardScaler=StandardScaler()
columns_to_scale=['FL_B','FL_A', 'FL_C', 'VS']
df[columns_to_scale]=standardScaler.fit_transform(df[columns_to_scale])
df.head()

#Splitting the dataset into training and testing data


from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
train_X = train[['FL_A', 'FL_C', 'VS']]
train_y = train.Class
test_X = test[['FL_A', 'FL_C', 'VS']]
test_y = test.Class

#K Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn_scores=[]
for k in (range(1,21)):
knn_classifier=KNeighborsClassifier(n_neighbors=k)
knn_classifier.fit(train_X,train_y)
knn_scores.append(knn_classifier.score(test_X,test_y))
56
#Plotting the Graph of model scores for different k values
plt.figure(figsize=(30,30))
plt.plot([k for k in range(1,21)],knn_scores,color="blue")
for i in range(1,21):
plt.text(i, knn_scores[i-1],(i,round(knn_scores[i-1],4)))
plt.xticks([i for i in range(1,21)])
plt.xlabel("Number of Neighbors (K)",color="Red",weight="bold",fontsize="18")
plt.ylabel("Scores",color="Red",weight="bold",fontsize="18")
plt.title("K Neighbors Classifier scores for different K values",color="Red",weight="bold",fontsize="20")
#plt.figure(figsize=(30, 20))
plt.show()
plt.rcParams["font.weight"]="bold"
plt.rcParams["axes.labelweight"]="bold"

#Plotting the confusion matrix


from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
KN_model = KNeighborsClassifier(8)
KN_model.fit(train_X, train_y)
pred = KN_model.predict(test_X)
KN_model.score(test_X, test_y)
cm = confusion_matrix(test_y, pred)
print(cm)
print("\n")

# Plotting the confusion matrix


disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=KN_model.classes_)
disp.plot(cmap=plt.cm.Blues)

# Classification Report
print(classification_report(test_y, pred))
TP = cm[1][1]
FP = cm[0][1]
TN = cm[0][0]
57
FN = cm[1][0]
b = ((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))
b = b**0.5
a = (TP*TN-FP*FN)
MCC_KNN = round(a/b,3 )
Precision_Score = TP / (FP + TP)
Recall_Score = TP / (FN + TP)
Accuracy_Score = (TP + TN)/ (TP + FN + TN + FP)
F1_Score = 2* Precision_Score * Recall_Score/ (Precision_Score + Recall_Score)
print("Accuracy : ",Accuracy_Score*100)
print("Precision : ",Precision_Score*100)
print("Recall : ",Recall_Score*100)
print("F1 score : ",F1_Score*100)
print('MCC Score : ', MCC_KNN)
sns.jointplot(x='FL_C',y='FL_A',data=df,hue='Class')

#Support Vector Machine


import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix
import seaborn as sns # For better heatmap visualization

# Assuming you have 'train_X', 'train_y', 'test_X', 'test_y' ready and already split
svm_model = LinearSVC()
svm_model.fit(train_X, train_y) # Train the SVM model

# Predictions
pred = svm_model.predict(test_X)

# Generate confusion matrix


cm = confusion_matrix(test_y, pred)
print(cm)
print("\n")

58
# Manually plotting the confusion matrix using seaborn heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Predicted Negative", "Predicted
Positive"], yticklabels=["True Negative", "True Positive"])
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix')
plt.show()

# Extract values from confusion matrix


TP = cm[1][1] # True Positive
FP = cm[0][1] # False Positive
TN = cm[0][0] # True Negative
FN = cm[1][0] # False Negative

# MCC Calculation
b = ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)) ** 0.5
a = (TP * TN - FP * FN)
MCC_SVM = round(a / b, 3) if b != 0 else 0 # Avoid division by zero

# Precision, Recall, Accuracy, F1 Score Calculation


Precision_Score = TP / (FP + TP) if (FP + TP) != 0 else 0
Recall_Score = TP / (FN + TP) if (FN + TP) != 0 else 0
Accuracy_Score = (TP + TN) / (TP + FN + TN + FP)
F1_Score = 2 * Precision_Score * Recall_Score / (Precision_Score + Recall_Score) if (Precision_Score +
Recall_Score) != 0 else 0

# Printing metrics
print("Accuracy (calculated manually):", Accuracy_Score * 100)
print("Accuracy (from model's score):", svm_model.score(test_X, test_y) * 100) # Alternatively, use the
model's built-in accuracy
print("Precision:", Precision_Score * 100)
print("Recall:", Recall_Score * 100)
print("F1 score:", F1_Score * 100)
print("MCC score:", MCC_SVM)
59
#Logistic Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import seaborn as sns # For better heatmap visualization

# Assuming 'train_X', 'train_y', 'test_X', 'test_y' are already defined and split
classifier = LogisticRegression(random_state=0)
classifier.fit(train_X, train_y) # Train the logistic regression model

# Predicting the test set


y_pred = classifier.predict(test_X)

# Generate confusion matrix


cm = confusion_matrix(test_y, y_pred)
print(cm)
print("\n")

# Plotting confusion matrix using seaborn (alternative to plot_confusion_matrix)


plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Predicted Negative", "Predicted
Positive"], yticklabels=["True Negative", "True Positive"])
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix')
plt.show()

# Extract values from confusion matrix


TP = cm[1][1] # True Positive
FP = cm[0][1] # False Positive
TN = cm[0][0] # True Negative
FN = cm[1][0] # False Negative

60
# MCC Calculation (Matthews Correlation Coefficient)
b = ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)) ** 0.5
a = (TP * TN - FP * FN)
MCC_SVM = round(a / b, 3) if b != 0 else 0 # Avoid division by zero

# Precision, Recall, Accuracy, F1 Score Calculation


Precision_Score = TP / (FP + TP) if (FP + TP) != 0 else 0
Recall_Score = TP / (FN + TP) if (FN + TP) != 0 else 0
Accuracy_Score = (TP + TN) / (TP + FN + TN + FP)
F1_Score = 2 * Precision_Score * Recall_Score / (Precision_Score + Recall_Score) if (Precision_Score +
Recall_Score) != 0 else 0

# Printing performance metrics


print("Accuracy (calculated manually):", Accuracy_Score * 100)
print("Accuracy (from model's score):", classifier.score(test_X, test_y) * 100) # Alternatively, use the
model's built-in accuracy
print("Precision:", Precision_Score * 100)
print("Recall:", Recall_Score * 100)
print("F1 score:", F1_Score * 100)
print("MCC score:", MCC_SVM)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix
import seaborn as sns # For better heatmap visualization

# Assuming 'train_X', 'train_y', 'test_X', 'test_y' are already defined and split
# Decision Tree Classifier
clf_model = DecisionTreeClassifier(random_state=0)
clf_model.fit(train_X, train_y)
pred = clf_model.predict(test_X)

# Confusion Matrix for Decision Tree


61
cm = confusion_matrix(test_y, pred)
print("Decision Tree Confusion Matrix:")
print(cm)
print("\n")

# Plot confusion matrix using Seaborn


plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Predicted Negative", "Predicted
Positive"], yticklabels=["True Negative", "True Positive"])
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Decision Tree Confusion Matrix')
plt.show()

# Extract values from confusion matrix


TP = cm[1][1] # True Positive
FP = cm[0][1] # False Positive
TN = cm[0][0] # True Negative
FN = cm[1][0] # False Negative

# MCC Calculation (Matthews Correlation Coefficient)


b = ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)) ** 0.5
a = (TP * TN - FP * FN)
MCC_DT = round(a / b, 3) if b != 0 else 0 # Avoid division by zero

# Precision, Recall, Accuracy, F1 Score Calculation


Precision_Score = TP / (FP + TP) if (FP + TP) != 0 else 0
Recall_Score = TP / (FN + TP) if (FN + TP) != 0 else 0
Accuracy_Score = (TP + TN) / (TP + FN + TN + FP)
F1_Score = 2 * Precision_Score * Recall_Score / (Precision_Score + Recall_Score) if (Precision_Score +
Recall_Score) != 0 else 0

# Printing performance metrics for Decision Tree


print("Decision Tree Accuracy:", Accuracy_Score * 100)
print("Decision Tree Precision:", Precision_Score * 100)
62
print("Decision Tree Recall:", Recall_Score * 100)
print("Decision Tree F1 score:", F1_Score * 100)
print("Decision Tree MCC score:", MCC_DT)

# Neural Network Classifier (MLP)


classifier = MLPClassifier(hidden_layer_sizes=(150, 100, 50), max_iter=300, activation='relu',
solver='adam', random_state=1)
classifier.fit(train_X, train_y)
y_pred = classifier.predict(test_X)

# Confusion Matrix for Neural Network


cm = confusion_matrix(test_y, y_pred)
print("\nNeural Network Confusion Matrix:")
print(cm)
print("\n")

# Plot confusion matrix for Neural Network


plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Predicted Negative", "Predicted
Positive"], yticklabels=["True Negative", "True Positive"])
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Neural Network Confusion Matrix')
plt.show()

# Extract values from confusion matrix


TP = cm[1][1] # True Positive
FP = cm[0][1] # False Positive
TN = cm[0][0] # True Negative
FN = cm[1][0] # False Negative

# MCC Calculation (Matthews Correlation Coefficient)


b = ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)) ** 0.5
a = (TP * TN - FP * FN)
MCC_NN = round(a / b, 3) if b != 0 else 0 # Avoid division by zero
63
# Precision, Recall, Accuracy, F1 Score Calculation
Precision_Score = TP / (FP + TP) if (FP + TP) != 0 else 0
Recall_Score = TP / (FN + TP) if (FN + TP) != 0 else 0
Accuracy_Score = (TP + TN) / (TP + FN + TN + FP)
F1_Score = 2 * Precision_Score * Recall_Score / (Precision_Score + Recall_Score) if (Precision_Score +
Recall_Score) != 0 else 0

# Printing performance metrics for Neural Network


print("Neural Network Accuracy:", Accuracy_Score * 100)
print("Neural Network Precision:", Precision_Score * 100)
print("Neural Network Recall:", Recall_Score * 100)
print("Neural Network F1 score:", F1_Score * 100)
print("Neural Network MCC score:", MCC_NN)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

# Assuming the 'df' DataFrame is available with required features and 'Class' as target
# Splitting the data
train, test = train_test_split(df, test_size=0.2)
train_X = train[['FL_B', 'FL_C', 'FL_C_12', 'FL_C_3']]
train_y = train['Class']
test_X = test[['FL_B', 'FL_C', 'FL_C_12', 'FL_C_3']]
test_y = test['Class']

# Train and evaluate Random Forest for different numbers of trees


RF_scores = []
for k in range(1, 21):
ensemble_classifier = RandomForestClassifier(n_estimators=k, random_state=42)
ensemble_classifier.fit(train_X, train_y)
RF_scores.append(ensemble_classifier.score(test_X, test_y))

64
# Plot the Random Forest performance with different n_estimators
plt.figure(figsize=(10, 6))
plt.plot(range(1, 21), RF_scores, color='blue')
for i in range(1, 21):
plt.text(i, RF_scores[i-1], f"({i}, {RF_scores[i-1]:.3f})", ha='center', va='bottom')
plt.xticks(range(1, 21))
plt.xlabel('Number of Trees', color='Red', weight='bold', fontsize=12)
plt.ylabel('Accuracy', color='Red', weight='bold', fontsize=12)
plt.title('Random Forest Accuracy vs Number of Trees', color='Red', weight='bold', fontsize=14)
plt.show()

# Train Random Forest with default n_estimators


RF_model = RandomForestClassifier(n_estimators=100, random_state=42)
RF_model.fit(train_X, train_y)
pred = RF_model.predict(test_X)

# Compute confusion matrix


cm = confusion_matrix(test_y, pred)
print("Confusion Matrix:")
print(cm)
print("\n")

# Plot confusion matrix


plt.figure(figsize=(6, 5))
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix for Random Forest')
plt.colorbar()
tick_marks = np.arange(2)
plt.xticks(tick_marks, ['Negative', 'Positive'], rotation=45)
plt.yticks(tick_marks, ['Negative', 'Positive'])

# Annotate the confusion matrix


thresh = cm.max() / 2.
for i in range(2):
for j in range(2):
65
plt.text(j, i, format(cm[i, j]), ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")

plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

# Calculate metrics from confusion matrix


TP = cm[1][1] # True Positive
FP = cm[0][1] # False Positive
TN = cm[0][0] # True Negative
FN = cm[1][0] # False Negative

# MCC Calculation
b = ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)) ** 0.5
a = (TP * TN - FP * FN)
MCC_RF = round(a / b, 3) if b != 0 else 0 # Avoid division by zero

# Other metrics
Precision_Score = TP / (FP + TP) if (FP + TP) != 0 else 0
Recall_Score = TP / (FN + TP) if (FN + TP) != 0 else 0
Accuracy_Score = (TP + TN) / (TP + FN + TN + FP)
F1_Score = 2 * Precision_Score * Recall_Score / (Precision_Score + Recall_Score) if (Precision_Score +
Recall_Score) != 0 else 0

# Print the results


print("Random Forest Performance Metrics:")
print(f"Accuracy : {Accuracy_Score * 100:.2f}%")
print(f"Precision : {Precision_Score * 100:.2f}%")
print(f"Recall : {Recall_Score * 100:.2f}%")
print(f"F1 score : {F1_Score * 100:.2f}%")
print(f"MCC score : {MCC_RF:.3f}")

66
B. INTERNSHIP CERTIFICATES

67
68
69
C.PUBLICATION

1. Presented a paper titled “XXXX” in the CONFERENCE NAME at VENUE, DATE.


2. Published a paper titled “XXXX”, Journal Name, Vol.No.,
Issue No., pp.23-29, Date/Month 2024. DOI:

70

You might also like