SlideShare a Scribd company logo
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 325
Case Study: Prediction on Iris Dataset Using KNN Algorithm
Shreyas Tayade1, Rakhi Gupta2, Deval Kherde3 , Chaitanya Ubale4
1Student,Sipna College of Engineering and Technology, Maharashtra, India
2Assistant Professor, Sipna College of Engineering and Technology, Maharashtra, India
3Student,Sipna College of Engineering and Technology, Maharashtra, India
4Student,Sipna College of Engineering and Technology, Maharashtra, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - The well-known Iris dataset is used in this case
study to use the K-Nearest Neighbors (KNN) method. The
150 iris flower observations in the Iris dataset include 50
observations of each of the three species—Setosa,
Versicolor, and Virginica. This case study aims to identify
the four characteristics of sepal length, sepal breadth,
petal length, and petal width that may be used to
categorize iris flowers into their respective species.
The KNN method is a well-liked and straightforward
classification technique that makes predictions by locating
the nearest neighbors of each observation. To guarantee
that all of the characteristics in this case study are on the
same scale, the dataset is first divided into training and
testing sets. The next step is to train a KNN model with k=3,
which takes into account each observation's three nearest
neighbors. Lastly, the accuracy score is used to assess how
well the model performed on the test set.
Key Words: K-Nearest Neighbors,sepal length, sepal
breadth, petal length,petal width
1.INTRODUCTION
The Iris dataset, which includes measurements of three
different iris flower species, is well-known in the machine
learning field. The dataset is a well-known example of a
problem that may be resolved using supervised learning
techniques and has been widely used as a benchmark for
classification systems.
This issue may be resolved using the straightforward and
well-liked classification technique K-Nearest Neighbors
(KNN). In this case study, we will use the Iris dataset and
the KNN method to categorize iris blossoms according to
four characteristics: sepal length, sepal width, petal length,
and petal width.
This case study's main objective is to outline the
fundamental procedures for using KNN on the Iris dataset,
from loading the data through assessing the model's
performance on hypothetical data. We'll load the dataset
first, then divide it into training and testing sets, normalise
the data, train the KNN model, and assess its performance.
fig-1 Dataset
For those who are new to machine learning, the Iris
dataset serves as a nice example of a classification issue
that can be handled using KNN. Further categorization
issues in the future can be solved using the knowledge and
methods obtained from this case study.
2. ATTRIBUTE SELECTION
The key to attaining good classification accuracy on the
Iris dataset is selecting the best attribute for KNN. The
four characteristics in this dataset are sepal length, sepal
width, petal length, and petal width.
2 Description of Data
Using feature selection approaches that rank the
characteristics according to their significance or relevance
to the classification job is one method for selecting the
best attribute. This may be accomplished using a variety of
techniques, including feature selection based on mutual
information, correlation, or trees.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 326
An alternative strategy is to show the data with scatter
plots or other visualization tools and then assess how
easily the classes can be distinguished depending on each
feature. For each class of characteristics, for instance, we
may plot the pairwise pairings and see which combination
best separates the classes.
The petal length and petal width variables are recognised
to offer the best separation between the three classes in
the context of the Iris dataset, as demonstrated in several
research and visualizations. Consequently, in the Iris
dataset, these two features are frequently used as the best
attributes for KNN.
It is crucial to remember that the selection of the best
qualities might change based on the particular situation
and dataset. As a result, it is always advised to experiment
with various attribute combinations and assess how well
the KNN model performs using a validation or test set.
3.Data Visualization
1. Scatter plot: Sepal length and sepal breadth are
two examples of two continuous characteristics
that can be visualized using a scatter plot. If there
is a linear connection between two characteristics
or if there are any anomalies, scatter plots can be
used to find patterns or trends in the data.
2. Box plot: Box plots are used to show how a
continuous quantity is distributed among various
groups. To display the range of sepal length for
each type of iris, for instance, a box plot can be
used. The bars indicate the range of the data
within 1.5 times the interquartile range (IQR),
while the rectangle represents the IQR, which
includes the middle 50% of the data. Box plots can
be used to spot variations in how a measure is
distributed among various groups.
3. Histogram: The spread of a singular continuous
quantity is shown using histograms. For instance,
the range of petal length in the iris sample can be
displayed using a histogram. Histograms can be
used to determine the distribution's form (such as
normal or skewed), as well as any possible
anomalies or data gaps.
4. Heatmap: The connection between two
categorical factors is shown using heatmaps. For
each species of iris, the prevalence of each mix of
petal length and breadth can be displayed using a
heatmap, for instance. Heatmaps can be used to
find patterns or trends in the data, such as
whether a particular set of variable pairings is
more prevalent in one area than another.
5. Pie Chart:The percentage of each group within a
single categorical variable is shown using pie
plots. For instance, a pie graphic can be used to
display the percentage of each species in the
information for iris. Pie charts are helpful for
contrasting the proportions of various groups and
for helping to visualize the distribution of a
variable.
The iris dataset can be used to make a wide variety of
images, of which these are only a few instances. Insights
and useful readings from the data can be gained by
researchers and experts with the aid of data visualization,
eventually resulting in better decision-making. So,
whether it is the eye dataset or another dataset, data
visualization is an essential stage in the data analysis
process.
4. Model Comparison
1. The chance that an instance will belong to a
particular class is predicted using the linear
categorization model known as logistic
regression. It presupposes that the features and
the goal variable have a linear connection. The
approach is straightforward and easy to
understand, and it can be applied to binary or
multiple-class classification problems.
2. Non-linear models called decision trees can be
applied to both categorization and regression
problems. Recursively dividing the data into
subgroups according to the values of the features,
they then base their choices on the dominant class
in each subset. Decision trees can manage both
category and numerical characteristics and are
comprehensible.
3. Random Forests: An ensemble technique, Random
Forests uses various Decision Trees to produce a
more reliable and precise model. The forecasts of
all the trees in the forest are averaged to produce
the end projection. Each tree in the forest is
trained using a random subset of the data.
Because of their great precision and prowess in
handling complicated datasets, random forests
are well known.
4. Support Vector Machines (SVM): For binary and
multi-class classification problems, SVMs are a
common paradigm. They operate by identifying
the hyperplane that maximises the gap between
the classes and best divides the data into various
classes. SVMs are especially effective for datasets
with distinct class borders because they can
manage both linear and non-linear connections
between the features and the target variable.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 327
RESULTS AND ANALYSIS
For the Iris dataset, we observed that the highest accuracy,
95.5%, was obtained for KNN and the least accuracy
88.88%, was obtained using Logistic Regression. The same
has been tabulated and represented below for the models
used.
Fig 3- accuracy plot
Model Accuracy
KNN 95.50%
Decision Tree 93.33%
Logistic Regression 88.88%
SVM 93.33%
Naive Bayes 91.11%
Random Forest 91.11%
Table 1- Comparison of algorithms
5. Model training
In the case of KNN on the Iris dataset, the model training
involves the following steps:
1. Dataset loading: The Iris dataset must first be
loaded into the machine learning environment.
150 samples with 4 characteristics make up the
dataset, which is frequently divided into a training
set and a testing set.
2. Division of the dataset: A training set and a testing
set are created from the dataset. This is done to
assess how well the KNN model performs with
unknown data. 70% of the data is often utilised
for training and 30% is used for testing, or a split
ratio of 70:30.
3. As KNN is a distance-based algorithm, it's crucial
to make sure that all of the characteristics are
scaled equally. To achieve this, divide each
feature's standard deviation by its mean before
summing them up.
4. KNN model training: The training set is used to
train the KNN model. The number of neighbors to
take into account is the primary KNN parameter
(k). With the Iris dataset, a value of k=3 or k=5 is
frequently employed.
5. A performance metric, such as accuracy,
precision, recall, or F1 score, is used to assess the
KNN model's performance on the testing set. In
the case of the Iris dataset, the accuracy score is
frequently employed.
6. Changing the value of k or experimenting with
other distance measures are two ways to tweak
the model if the performance of the KNN model is
not adequate.
fig-4 accuracy plot for K-values
Overall, the KNN algorithm is relatively simple and easy to
implement for the Iris dataset. The key steps are to split
the data, normalize the data, train the model, and evaluate
the performance. By following these steps and
experimenting with different parameter values, it is
possible to achieve high classification accuracy on the Iris
dataset.
6. CONCLUSIONS
As shown by its successful use on the well-known Iris
dataset, the K-Nearest Neighbors (KNN) algorithm
provides a straightforward and practical approach for
classification challenges. For those who are new to
machine learning, the Iris dataset serves as a nice example
of a classification issue that can be handled using KNN.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 328
In this case study, we have demonstrated the fundamental
procedures needed to use KNN on the Iris dataset,
including loading the data, dividing it into training and
testing sets, normalizing the data, and finally training and
assessing the KNN model. The model performed well on
the test set, demonstrating its efficacy in identifying the
various kinds of iris blooms.
This case study shows the value of data pretreatment and
assessment in obtaining accurate and trustworthy results,
and it may be used as a valuable reference for people
interested in using KNN to solve classification challenges.
REFERENCES
[1] Fisher, R. A. (1936). The use of multiple
measurements in taxonomic problems. Annals of
Eugenics, 7(2), 179-188.
[2] Anderson, E. (1935). The irises of the Gaspe peninsula.
Bulletin of the American Iris Society, 59, 2-5.
[3] Scikit-learn documentation: https://scikit-
learn.org/stable/modules/generated/sklearn.neighbo
rs.KNeighborsClassifier.html
[4] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The
elements of statistical learning: data mining, inference,
and prediction. Springer.
[5] Alpaydin, E. (2010). Introduction to machine learning
(2nd ed.). MIT Press.
[6] Geron, A. (2019). Hands-on machine learning with
Scikit-Learn, Keras, and TensorFlow (2nd ed.).
O'Reilly Media.
[7] Kaggle: https://www.kaggle.com/uciml/iris

More Related Content

Similar to Case Study: Prediction on Iris Dataset Using KNN Algorithm (20)

PDF
IRJET - Survey on Clustering based Categorical Data Protection
IRJET Journal
 
PDF
The International Journal of Engineering and Science (The IJES)
theijes
 
PDF
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET Journal
 
PDF
Data mining techniques a survey paper
eSAT Publishing House
 
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
IRJET Journal
 
PDF
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Editor IJCATR
 
PDF
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
IRJET Journal
 
PDF
Regularized Weighted Ensemble of Deep Classifiers
ijcsa
 
PDF
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
ijtsrd
 
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
PDF
Hypothesis on Different Data Mining Algorithms
IJERA Editor
 
PDF
TERM DEPOSIT SUBSCRIPTION PREDICTION
IRJET Journal
 
PDF
Clustering Approach Recommendation System using Agglomerative Algorithm
IRJET Journal
 
PDF
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
IJECEIAES
 
PDF
Dilated Inception U-Net for Nuclei Segmentation in Multi-Organ Histology Images
IRJET Journal
 
PDF
Data Analysis and Prediction System for Meteorological Data
IRJET Journal
 
PDF
Survey on Feature Selection and Dimensionality Reduction Techniques
IRJET Journal
 
PDF
Stock Market Prediction using Long Short-Term Memory
IRJET Journal
 
PDF
Hx3115011506
IJERA Editor
 
PDF
IRJET- Privacy Preservation using Apache Spark
IRJET Journal
 
IRJET - Survey on Clustering based Categorical Data Protection
IRJET Journal
 
The International Journal of Engineering and Science (The IJES)
theijes
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET Journal
 
Data mining techniques a survey paper
eSAT Publishing House
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
IRJET Journal
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Editor IJCATR
 
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
IRJET Journal
 
Regularized Weighted Ensemble of Deep Classifiers
ijcsa
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
ijtsrd
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
Hypothesis on Different Data Mining Algorithms
IJERA Editor
 
TERM DEPOSIT SUBSCRIPTION PREDICTION
IRJET Journal
 
Clustering Approach Recommendation System using Agglomerative Algorithm
IRJET Journal
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
IJECEIAES
 
Dilated Inception U-Net for Nuclei Segmentation in Multi-Organ Histology Images
IRJET Journal
 
Data Analysis and Prediction System for Meteorological Data
IRJET Journal
 
Survey on Feature Selection and Dimensionality Reduction Techniques
IRJET Journal
 
Stock Market Prediction using Long Short-Term Memory
IRJET Journal
 
Hx3115011506
IJERA Editor
 
IRJET- Privacy Preservation using Apache Spark
IRJET Journal
 

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
PDF
Kiona – A Smart Society Automation Project
IRJET Journal
 
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
PDF
Breast Cancer Detection using Computer Vision
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Ad

Recently uploaded (20)

PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
PPTX
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
PPTX
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PPT
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
PPTX
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
PPTX
Distribution reservoir and service storage pptx
dhanashree78
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PPTX
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PDF
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PPT
Testing and final inspection of a solar PV system
MuhammadSanni2
 
PDF
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
Distribution reservoir and service storage pptx
dhanashree78
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
Testing and final inspection of a solar PV system
MuhammadSanni2
 
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
Ad

Case Study: Prediction on Iris Dataset Using KNN Algorithm

  • 1. © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 325 Case Study: Prediction on Iris Dataset Using KNN Algorithm Shreyas Tayade1, Rakhi Gupta2, Deval Kherde3 , Chaitanya Ubale4 1Student,Sipna College of Engineering and Technology, Maharashtra, India 2Assistant Professor, Sipna College of Engineering and Technology, Maharashtra, India 3Student,Sipna College of Engineering and Technology, Maharashtra, India 4Student,Sipna College of Engineering and Technology, Maharashtra, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - The well-known Iris dataset is used in this case study to use the K-Nearest Neighbors (KNN) method. The 150 iris flower observations in the Iris dataset include 50 observations of each of the three species—Setosa, Versicolor, and Virginica. This case study aims to identify the four characteristics of sepal length, sepal breadth, petal length, and petal width that may be used to categorize iris flowers into their respective species. The KNN method is a well-liked and straightforward classification technique that makes predictions by locating the nearest neighbors of each observation. To guarantee that all of the characteristics in this case study are on the same scale, the dataset is first divided into training and testing sets. The next step is to train a KNN model with k=3, which takes into account each observation's three nearest neighbors. Lastly, the accuracy score is used to assess how well the model performed on the test set. Key Words: K-Nearest Neighbors,sepal length, sepal breadth, petal length,petal width 1.INTRODUCTION The Iris dataset, which includes measurements of three different iris flower species, is well-known in the machine learning field. The dataset is a well-known example of a problem that may be resolved using supervised learning techniques and has been widely used as a benchmark for classification systems. This issue may be resolved using the straightforward and well-liked classification technique K-Nearest Neighbors (KNN). In this case study, we will use the Iris dataset and the KNN method to categorize iris blossoms according to four characteristics: sepal length, sepal width, petal length, and petal width. This case study's main objective is to outline the fundamental procedures for using KNN on the Iris dataset, from loading the data through assessing the model's performance on hypothetical data. We'll load the dataset first, then divide it into training and testing sets, normalise the data, train the KNN model, and assess its performance. fig-1 Dataset For those who are new to machine learning, the Iris dataset serves as a nice example of a classification issue that can be handled using KNN. Further categorization issues in the future can be solved using the knowledge and methods obtained from this case study. 2. ATTRIBUTE SELECTION The key to attaining good classification accuracy on the Iris dataset is selecting the best attribute for KNN. The four characteristics in this dataset are sepal length, sepal width, petal length, and petal width. 2 Description of Data Using feature selection approaches that rank the characteristics according to their significance or relevance to the classification job is one method for selecting the best attribute. This may be accomplished using a variety of techniques, including feature selection based on mutual information, correlation, or trees. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 326 An alternative strategy is to show the data with scatter plots or other visualization tools and then assess how easily the classes can be distinguished depending on each feature. For each class of characteristics, for instance, we may plot the pairwise pairings and see which combination best separates the classes. The petal length and petal width variables are recognised to offer the best separation between the three classes in the context of the Iris dataset, as demonstrated in several research and visualizations. Consequently, in the Iris dataset, these two features are frequently used as the best attributes for KNN. It is crucial to remember that the selection of the best qualities might change based on the particular situation and dataset. As a result, it is always advised to experiment with various attribute combinations and assess how well the KNN model performs using a validation or test set. 3.Data Visualization 1. Scatter plot: Sepal length and sepal breadth are two examples of two continuous characteristics that can be visualized using a scatter plot. If there is a linear connection between two characteristics or if there are any anomalies, scatter plots can be used to find patterns or trends in the data. 2. Box plot: Box plots are used to show how a continuous quantity is distributed among various groups. To display the range of sepal length for each type of iris, for instance, a box plot can be used. The bars indicate the range of the data within 1.5 times the interquartile range (IQR), while the rectangle represents the IQR, which includes the middle 50% of the data. Box plots can be used to spot variations in how a measure is distributed among various groups. 3. Histogram: The spread of a singular continuous quantity is shown using histograms. For instance, the range of petal length in the iris sample can be displayed using a histogram. Histograms can be used to determine the distribution's form (such as normal or skewed), as well as any possible anomalies or data gaps. 4. Heatmap: The connection between two categorical factors is shown using heatmaps. For each species of iris, the prevalence of each mix of petal length and breadth can be displayed using a heatmap, for instance. Heatmaps can be used to find patterns or trends in the data, such as whether a particular set of variable pairings is more prevalent in one area than another. 5. Pie Chart:The percentage of each group within a single categorical variable is shown using pie plots. For instance, a pie graphic can be used to display the percentage of each species in the information for iris. Pie charts are helpful for contrasting the proportions of various groups and for helping to visualize the distribution of a variable. The iris dataset can be used to make a wide variety of images, of which these are only a few instances. Insights and useful readings from the data can be gained by researchers and experts with the aid of data visualization, eventually resulting in better decision-making. So, whether it is the eye dataset or another dataset, data visualization is an essential stage in the data analysis process. 4. Model Comparison 1. The chance that an instance will belong to a particular class is predicted using the linear categorization model known as logistic regression. It presupposes that the features and the goal variable have a linear connection. The approach is straightforward and easy to understand, and it can be applied to binary or multiple-class classification problems. 2. Non-linear models called decision trees can be applied to both categorization and regression problems. Recursively dividing the data into subgroups according to the values of the features, they then base their choices on the dominant class in each subset. Decision trees can manage both category and numerical characteristics and are comprehensible. 3. Random Forests: An ensemble technique, Random Forests uses various Decision Trees to produce a more reliable and precise model. The forecasts of all the trees in the forest are averaged to produce the end projection. Each tree in the forest is trained using a random subset of the data. Because of their great precision and prowess in handling complicated datasets, random forests are well known. 4. Support Vector Machines (SVM): For binary and multi-class classification problems, SVMs are a common paradigm. They operate by identifying the hyperplane that maximises the gap between the classes and best divides the data into various classes. SVMs are especially effective for datasets with distinct class borders because they can manage both linear and non-linear connections between the features and the target variable.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 327 RESULTS AND ANALYSIS For the Iris dataset, we observed that the highest accuracy, 95.5%, was obtained for KNN and the least accuracy 88.88%, was obtained using Logistic Regression. The same has been tabulated and represented below for the models used. Fig 3- accuracy plot Model Accuracy KNN 95.50% Decision Tree 93.33% Logistic Regression 88.88% SVM 93.33% Naive Bayes 91.11% Random Forest 91.11% Table 1- Comparison of algorithms 5. Model training In the case of KNN on the Iris dataset, the model training involves the following steps: 1. Dataset loading: The Iris dataset must first be loaded into the machine learning environment. 150 samples with 4 characteristics make up the dataset, which is frequently divided into a training set and a testing set. 2. Division of the dataset: A training set and a testing set are created from the dataset. This is done to assess how well the KNN model performs with unknown data. 70% of the data is often utilised for training and 30% is used for testing, or a split ratio of 70:30. 3. As KNN is a distance-based algorithm, it's crucial to make sure that all of the characteristics are scaled equally. To achieve this, divide each feature's standard deviation by its mean before summing them up. 4. KNN model training: The training set is used to train the KNN model. The number of neighbors to take into account is the primary KNN parameter (k). With the Iris dataset, a value of k=3 or k=5 is frequently employed. 5. A performance metric, such as accuracy, precision, recall, or F1 score, is used to assess the KNN model's performance on the testing set. In the case of the Iris dataset, the accuracy score is frequently employed. 6. Changing the value of k or experimenting with other distance measures are two ways to tweak the model if the performance of the KNN model is not adequate. fig-4 accuracy plot for K-values Overall, the KNN algorithm is relatively simple and easy to implement for the Iris dataset. The key steps are to split the data, normalize the data, train the model, and evaluate the performance. By following these steps and experimenting with different parameter values, it is possible to achieve high classification accuracy on the Iris dataset. 6. CONCLUSIONS As shown by its successful use on the well-known Iris dataset, the K-Nearest Neighbors (KNN) algorithm provides a straightforward and practical approach for classification challenges. For those who are new to machine learning, the Iris dataset serves as a nice example of a classification issue that can be handled using KNN.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 328 In this case study, we have demonstrated the fundamental procedures needed to use KNN on the Iris dataset, including loading the data, dividing it into training and testing sets, normalizing the data, and finally training and assessing the KNN model. The model performed well on the test set, demonstrating its efficacy in identifying the various kinds of iris blooms. This case study shows the value of data pretreatment and assessment in obtaining accurate and trustworthy results, and it may be used as a valuable reference for people interested in using KNN to solve classification challenges. REFERENCES [1] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188. [2] Anderson, E. (1935). The irises of the Gaspe peninsula. Bulletin of the American Iris Society, 59, 2-5. [3] Scikit-learn documentation: https://scikit- learn.org/stable/modules/generated/sklearn.neighbo rs.KNeighborsClassifier.html [4] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer. [5] Alpaydin, E. (2010). Introduction to machine learning (2nd ed.). MIT Press. [6] Geron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly Media. [7] Kaggle: https://www.kaggle.com/uciml/iris