0% found this document useful (0 votes)
15 views

KNN 2

Uploaded by

littletrout8803
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

KNN 2

Uploaded by

littletrout8803
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2023 International Conference on Computer Communication and Informatics (ICCCI), Jan, 23-25, 2023, Coimbatore, INDIA

An Improvised Machine Learning Model KNN for


Malware Detection and Classification
2023 International Conference on Computer Communication and Informatics (ICCCI) | 979-8-3503-4821-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICCCI56745.2023.10128189

M. Raj Shekhar Rao Deepanshu Yadav Dr. V. Anbarasu


Dept. of Networking and Dept. of Networking and Dept. of Networking and
Communications Communications Communications
SRM Institute of Science and SRM Institute of Science and SRM Institute of Science and
Technology, Kattankulathur Technology, Kattankulathur Technology, Kattankulathur
Chennai, India Chennai, India Chennai, India
[email protected] [email protected] [email protected]

system, which covers the gap in the secure protection of these


Abstract—The problem of network security has arisen as a networks.
key source of worry in today's linked society. Sabotage and
II. LITERATURE SURVEY
information extortion are among the most significant risks toan
organization's security. They include a broad spectrum of [1] A Malware Detection System keeps an eye out for hostile
dangers such as huge intellectual property theft and software or unusual activities on networks or other systems. Additionally,
attacks, as well as sabotage and information extortion. preventive solutions like firewalls, robust authentication, and user
Inappropriate usage of protocols on a network may potentially privilege management are available. The detection system will be
improved by using Data Mining methods such as segmentation,
represent a security risk to a system or network. In network frequent pattern,
security, the useof data mining tools to gather, alter, and analyze
enormous amounts of data is vital. With the help of a range of and so on, on susceptible data sent across a network. Additionally,
data mining technologies that are now accessible, it is possible to Network Behavior Analysis (NBA) software monitors network traffic
conduct analysis and forecasting of data and threats across or network traffic data in order to detect unexpected traffic flows
computer networks. This study makes an effort to forecast through network monitoring activities.
network security dangers by using a variety of categorization [2] Classification techniques are used to identify network
methodologies. The classification techniques used in this study intrusions, which is a more effective method for distinguishing
include the Naive Bayes Classifier, the Decision Tree Classifier between normal and aberrant networks. If misconfigurations are
and the K Nearest Neighbors Classifier. It assesses the efficacy discovered, the security of a whole network will be jeopardized and
of the categorization approaches described above in order to may even create interruptions. This article is mostly concerned with
discover potential risks and vulnerabilities. In order to analyze a identifying misconfigurations in IP addresses or other situations that
dataset and extract information, Simple Machine Learning might result in an anomaly.
techniques are utilized, such as in Distributed Denial of Service [3] This article starts by proposing a data-driven defensive
(DDoS) attacks (R2L), U2R attacks, and Probe Attacks (U2R). architecture for cyber security situational awareness. Then, an inter-
The Naive Bayes Classifier, the Decision Tree Classifier, and the learning design for cyber-attack detection via data mining is seen in
K Nearest neighbours method are all examples of classification the, along with different data mining approaches for detection. Two
algorithms discussed in this section. For the purpose of detection methods are examined in the investigations: abuse detection
comparing theperformance of the proposed machine learning and anomaly detection. Numerous data mining techniques are used to
algorithm technique with other approaches, the detect cyber-attacks, including classifications, K - nearest neighbors,
Decision Tree, k-means, and association rules. These techniques are
entropy calculation and accuracy, recall, F1Score, and entropy used with either thorough scrutiny ornet flow.
may be employed.
[4] With this study's hybrid detection technique, performance is
Keywords—Machine Learning, Computer networks, improved by integrating the benefits of both misconduct and fault
algorithm, entropy, accuracy, recall. diagnosis. Bycomparing several methods against each other, it may
identify the most effective ones.There is a significant increase in
I. INTRODUCTION detection rates with a low rate of false positives when using the
anomaly detection approach, as well as a significant increase in
The protection of the network from malware is one of themost detection rates when using the hybrid system thatis developed.
important aspects of system or network administration. If a
corporation is breached by a hostile attacker, it could result in [5] This article discusses a technology known as fuzzy entropy
significant losses for the business, including a data breach, probable weighted natural nearest neighbour (FEW-NNN) that greatly
downtime, and loss of client trust. enhances the accuracy and efficiency of an attack detection system.
The density of data points, the core of each class, and the least
A malware detection system is a piece of software or hardware encircling sphere radius are then computed using the suggested
that is used to keep an eye out for any potentially harmful, natural nearest neighbour searching algorithms. It employed data
dangerous, or against- the-rule activity on networks or systems.
mining to resolve the problem of router system failures in thisarticle.
Administrators are informed of these violations or incursion
activities. They achieve this by clustering algorithms in the file system of
routers across an administrative domain to reveal local, network-
A vital requirement for a high level of security and protection specific laws. Discrepancies from these local standards mightresult in
is there for a properly functioning network in order to enable implementations. It is utilized in tandem
trustworthy and trusted knowledge sharing between various
organisations. Attacks like cyber intrusions must be dealt with more
severely in lightof the advancements in science and technology.
Ourlives are greatly facilitated by a malware detection
979-8-3503-4821-7/23/$31.00 ©2023 IEEE
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at 11:48:21 UTC from IEEE Xplore. Restrictions apply.
2023 International Conference on Computer Communication and Informatics (ICCCI), Jan, 23-25, 2023, Coimbatore, INDIA
with file systems from a major state-wide network,a big university
The dataset, which was employed for this investigation, is
community, and a major research network.
exceptionally comprehensive and feature-rich. It still has many
[6] This approach is predicated on the idea that meaningful contradictions. Therefore, doing the data cleaning process on it was
information may be extracted from big data sets—the same essential.
principle that underpins knowledge discovery. The majority of
machine learning rely on algorithms using statistical and In order to do this experiment, we selected and skimmed 10% of
quantitative models, which are often generated from machine- the dataset. We looked for contradictions in the dataset.
learning techniques. While these approaches have been effectivein
segmentation, extrapolation, and grouping, each method reveals Function extraction is a dimensionality reduction technique that
only a limited depth of information. condenses a starting set of raw data into more manageable
classifications for processing. When processing resources are kept to
[7] This article defines and illustrates a visual technique for
a minimum without significantly losing information, the feature
exploring log files. Additionally, visual data analytics is applied.
extraction method works well. The removal of features may also help
The Three record file system is the most effective approach for
to cutdown on redundant data in a particular investigation.
spotting faults and anomalies.
[8] This work is based on the implementation ofdata mining
technology into network services, the extraction, transformation, Rows in the dataset that had redundant and useless data were
and federalization ofenormous amounts of necessary data, and the excluded since they were useless for our analysis. The
provision of substantial technology, correlation, and multiple standardization of data came next. It is completed prior to supplying
regression analysis to enable legislators and analysts to initiate our dataset's data to our machine learning models. After that, we
global comprehensive assessment of data reserves in a timely normalized our dataset.
manner, extracting helpful information from them in order to
efficiently achieve comprehensive network management. After data processing, we used the dataset to train and test our
Incorporating big data and evolution into intelligent platforms and machine learning models.
devices hasbecome an important stage in their progress.
The following machine learning algorithms willbe used for our
[9] This study describes a threat detection approach for a vast effort on this subject and applied tothis subset:
network. This method uses small and light statistical information
accessible at distribution locations. This method combines
analytical reports and data mining methods with efficient signature • Decision Tree: A classification machine learning model is
detection. This strategy is only efficient against network assaults the decision tree. Predicting the class of a given object involves
like Denial of Service (Dos). analyzing the data
using straightforward principles. In this case, Gini has been chosen
III. OBJECTIVES as the attribute that will best divide the data. Additionally, class
weights have been modified in accordance with the proportions of
The main goal of malware analysis is to get information from each class frequency.
a sample of the infection that will aid in responding to a malware
occurrence.Malware study seeks to evaluate, pinpoint, and contain
the possible threats posed by malware. • Random Forest: A supervised learning method is Random
Forest. For the highest performance and most accurate findings, a
Random
The goal of this project is to create a system for detecting
forest classifier with 100 trees was utilised.
malware and a predictive model that candetermine the difference
between good (normal) and bad (network intrusions) connections.
We will • Gaussian Naïve Bayes: A probabilistic classification
model is one that uses Naive Bayes. It's a classifier in a machine
train numerous machine-learning models on the dataset in order to learning algorithm
gain a better knowledge of network incursions for the that's used to distinguish between related itemsbased on particular
aforementioned goal. characteristics.
V. UML DIAGRAMS
Architecture Diagram: The decision tree generation, model
evaluation, and performance measures modules are represented by
the flow of modules in this diagram. The several modules provide
step-by-step coverage of the key components of developing the
model. Therefore, in order to comprehend the flow of modules and
their individual contributions to the creation of the entire model, a
block diagram is required.

Fig. 1. Classification of Malware DetectionTechniques

IV. METHODOLOGY
We selected to work with the 10% of the original dataset that
had been skimmed because it was extremely detailed and large. The
Massachusetts Institute of Technology Lincoln Lab created this
dataset. It was first designed to research network intrusion.
Fig. 2. Architecture Diagram
The [10]dataset was divided into training and testing groups in Use Case Diagram: The UML - Use Case Diagram is the
an 80:20 ratio. There were 116469 samples in the training dataset foundation of our entire model, as depicted in the figure 2
and 29117 samples in the testing dataset.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at 11:48:21 UTC from IEEE Xplore. Restrictions apply.
2023 International Conference on Computer Communication and Informatics (ICCCI), Jan, 23-25, 2023, Coimbatore, INDIA

Fig. 5. Result of Decision Tree Model

Fig. 3. Use case Diagram

VI. EXPERIMENTAL RESULTS


The dataset was split into an 80:20 ratio before our tests were
conducted. 80% of the dataset was utilised for training, while the
remaining 20% was used for testing. K-fold Validation (K=5) is
another method used to measure the skill of machine learning
models. Here are the findings from our analysis:

• Gaussian Naïve Bayes Model: 61.7%

Fig. 6. Result of Random Forest Model

Fig. 4. Result of Gaussian Naïve Bayes Model


Decision Tree Model: 92.05% algorithms used and compared were Fig. 7. Result of K-Nearest Neighbour
decision trees, Random Forest,and Naive Bayes. After applying the
best accuracy techniques, it had an accuracy of 98.20%. According VII. CONCLUSION
to this study, K-Nearest Neighbour Model is the best approach for This project's goal is to introduce a machine learning solution
identifying risky code. to the malware problem. We require an automated way to identify
If we increase the amount of files in the data setused to power infected files and potentially harmful websites due to the growing
the algorithms in the future, this accuracy can be increased. Every prevalence of malware.
algorithm has a number of variables that can be evaluated with Corrupt and clean executables were used to construct the data set
various values to improve accuracy. This project can go to the in the project's early stages. The information needed to create thedata
application level with the help of a library called pickle, where we collection was extracted using a Python script. For machine learning
can store the algorithm's learned lessons and test a new file to seeif it algorithms to be constructed and trained, the data collection must
is clean or contaminated. Static analysis has also demonstrated to be ready. The three
be risk-free and time-freeduring execution. blockchain platform,” IEEE International Conference onBig
Data, 2021.
REFERENCES
[2] H. F. Md Jobair, M. Paul, C. Ryan, S. Hossain, and C. Victor,
[1] S. Ryan, R. Mohammad A, H. F. Md Jobair, “Smart connected aircraft: Towards security, privacy, and
S. Hossain, and C. Alfredo, “Ride-hailing forautonomous ethicalhacking,” International Conference on Security of
vehicles: Hyperledger fabric- based secure and decentralize Information and Networks, 2022.
[3] M. J. Hossain Faruk, “Ehr data management: Hyperledger
fabric-based health data storing

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at 11:48:21 UTC from IEEE Xplore. Restrictions apply.
2023 International Conference on Computer Communication and Informatics (ICCCI), Jan, 23-25, 2023, Coimbatore, INDIA
and sharing,” The Fall 2021 Symposium ofStudent Scholars,
2021.
[4] H. Hassani, E. Silva, S. Unger, M. Tajmazinani, and S.
MacFeely, “Artificial intelligence (ai) or intelligence
augmentation (ia): What is the future?” AI, vol. 1, p. 1211, 04
2020.
[5] M. O. F. Rokon, R. Islam, A. Darki, E. Papalexakis, and M.
Faloutsos, “Sourcefinder: Finding malware source-code from
publicly available repositories,” in RAID, 2020.
[6] N. Sharma and B. Arora, “Data mining and machine learning
techniques for malware detection,” in Rising Threats in Expert
Applications and Solutions, V. S. Rathore, N. Dey, V.Piuri, R.
Babo, Z. Polkowski, and J. M.
R. S. Tavares, Eds. Singapore: Springer Singapore, 2021, pp.
557–567.
[7] Dr Ekambaram Kesavulu Reddy, “Neural Networks for
Intrusion Detection and Its Applications,” published, July
2013.
[8] Srinivas Mukkamala, Guadalupe Janoski andAndrew H. Sung,
“Intrusion detection using neural networks and support vector
machines,”published, 2002.
[9] Nahla Ben Amor, Salem Benferhat and Zied Elouedi, “Naive
Bayes vs decision trees in intrusion detection systems”,
published, 2004.
[10] Uddin, M., Rudra, S., & Nazim Uddin, M. (2019).
Forecasting the Long Term Economics Status of Bangladesh
Using Machine Learning Approaches from 2016-2036.
International Journal of Computer Communication and
Informatics, 1(1), 58-64

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at 11:48:21 UTC from IEEE Xplore. Restrictions apply.

You might also like