0% found this document useful (0 votes)
4 views14 pages

BT40595_Research_Paper

The document discusses the use of machine learning algorithms for network traffic classification, highlighting various techniques such as port-based, payload-based, and neural network-based methods. It emphasizes the advantages of machine learning in adapting to new traffic types, handling encrypted traffic, and improving real-time analysis. The literature survey reviews several studies that explore advanced classification methods and the challenges posed by encrypted traffic, ultimately advocating for machine learning's role in enhancing network security and performance.

Uploaded by

Aditya Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

BT40595_Research_Paper

The document discusses the use of machine learning algorithms for network traffic classification, highlighting various techniques such as port-based, payload-based, and neural network-based methods. It emphasizes the advantages of machine learning in adapting to new traffic types, handling encrypted traffic, and improving real-time analysis. The literature survey reviews several studies that explore advanced classification methods and the challenges posed by encrypted traffic, ultimately advocating for machine learning's role in enhancing network security and performance.

Uploaded by

Aditya Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Network Traffic Classification Using Machine

Learning Algorithms

Aditya Singh Yadav(21131410089),Alok


Kumar(21131410091)

Abstract - presently, programs and networks are unable to exchange communication

through the Internet, which might give rise to subpar performance of applications. To get

beyond these limitations, ideas like application-aware networking and network-aware

programming for apps are explored.. Internet Service Provider and Network Security manager

analyse the traffic with network traffic and classification using ML Technique. Protocols used

by networks can be defined using a range of methods, including payload-based, port-based,

and neural network-based strategies. .Nowadays Machine learning is a very famous technology

in various types of fields. We can classify the network traffic using training the ML algorithms

to categorize the traffics network . Using the Ml for organising the network traffic we can gain

lots of advantages like the train Ml model with data that achieves the high accuracy , ML model

able to adapt new type traffic and application,this is well suited for large network traffic, Train

Ml model analyse the network traffic in real time and this is handel the Encrypted traffic .so

because of lots of advantage we use Machine Learning


This technique does not require complex
Introduction
computation so not focus on deep
Classification of Network Traffic is a very
inspection.This method focus on reading
important concept for Internet service
the header of transport layer that introduce
Provider(ISP) and Network Operator for
less data.this method do well work on
identifying the different types of Network
famous and well known Protocol.But this
Traffic ,Topology ,Source IP,Destination
technique have lots of disadvantage like
IP and other features of network traffic .
this method depend on the Port number but
In present time and in future understanding
most modern application does not have fix
of the user traffic demands to enable
port number and also that type of
policing and prioritization processes,
application used “Non Standard Port” .In
technique of Quality of Service(QOS) .This
Present Time or day by day application
Knowledge is very helpful for Network
security is enhancing so lots of encrypted
Security Management,Malware detection
traffic generated so that method is
.If we perform These type of Engineering
struggling on encrypted traffic.and Some
we can also improve Quality Of Services of
protocols use the same Port Number for
Network Connection . Nowadays there are
different traffic like HTTP/HTTPS .
various ways to classification like Port
Network Traffic means flow of data around
based Technique .
the Computer system.We can divide in
Port based technique is also a traditional
different types based on different criteria .
technique for classifying the network traffic
Network Traffic based on the Protocol-
. This method depends on the traffic Port
HTTP/HTTPS Traffic- that type of traffic
Number that is associated with traffic
generated by a web application.
Protocol . Every traffic assigned with
DNS Traffic-Domain Name System traffic
unique Port Number like HTTP traffic -
translating the query to IP address.
Port 21 ,DNS traffic Port 53
TCP Traffic-Transfer control Protocol
SMTP traffic- Port 25 . This technique is
ensures reliable delivery.
very simple because it relies on the
UDP Traffic- User DataGram Protocol is
mapping of Port Number to every service.
very fast but not any type of guarantee to
deliver the traffic.
ICMP Traffic-Internet Control Message suited for large network traffic, Train Ml
Protocol used of some calculation and error model analyse the network traffic in real
reporting using PING and Traceroute time and this is handel the Encrypted traffic
TLS/SSL Traffic-Transport layer Security .so because of lots of advantage we use
provides security to encrypted file/mail. Machine Learning
QUIC Traffic-Quick UDP Internet K-Nearest Neighbors (KNN)-For
Connection that type of traffic mostly used Regression and Classification applications
by web applications. techniques are simple and easy to use for
Multicast and Broadcast Traffic- that type the machine learning algorithms , based on
of traffic delivers multiple IP addresses at the ranking and values its CNN , KNN
same time. identifies or forecast the its results for every
Network Traffic based on Application- new data points which is called similarity
Real-Time Traffic- that type of traffic Working of KNN-
moving continuously and that requires low Choose K: The amount of selected k
latency. nearest neighbours. This is an optimization
Bulk Traffic-That type of data transfers parameter that you choose manually. It
large amounts of data. represents how many nearby data points
Interactive traffic-that type of traffic will be considered when making a
interacts with real time. prediction.
Peer to Peer Traffic-That type of traffic Calculate Distance:The KNN chooses
involves direct transfer without any central how much we have to travel between each
server data point and each other points in the
dataset . Depending on the kind of data ,
Machine Learning Algorithms- Standard distance metrics include every
Nowadays Machine learning is a very Manhattan separation , Euclidean distance
famous technology in various types of and others .
fields. We can classify the network traffic Identify Neighbours: Trying the
using training the ML algorithms to calculated lengths , this approach tells
categorize the network traffic. which kind of K neighbor data observations
Using the Ml for classifying the network are near to the retrieved data points
traffic we can gain lots of advantages like Vote or Average:
the train Ml model with data that achieves
● Using Classification: The Most
the high accuracy , ML model able to adapt
voted value of the K nearest
new type traffic and application,this is well
Neighbor classes is utilized by the builds some decision trees and collects their
method. The Group who has the results.
most frequent amount of K Key idea:
neighbors obtains the unknown data A random forest combines much data and
point.. makes it in that format where the data is
● Using Regression: The Projected used or created by many features in tree
cost for the unknown data point is format, which compares every data point
obtain by aggregating the score of and gives very accurate data for the given
kNN data.
Steps in Random Forest:
Pros:
● Bootstrap Sampling: From the
● To project and find the values is
authentic dataset, a couple of
easy to obtain and understand
subsets are created by randomly
● There is a proper form of the data
sampling the data with replacement
that is produced .
(this technique is known as
● It is very useful for small amount of
bagging). Each subset is used to
data
educate an extraordinary choice
tree.
Cons:
● Building Decision Tree :
● The amount which is calculated is
● For every tree, as opposed to
very high for huge amount of data
thinking about all capabilities at
set
each cut up, Random forest
● Performance can degrade if data
randomly selects a subset of
isn't normalised or if irrelevant
functions. This introduces diversity
features are present.
among the bushes, preventing them
● The correct K value is difficult to
from being too comparable.
obtain and needs a professional who
● The bushes are skilled
can find the correct K value .
independently, with every tree
studying distinct styles because of
Random forest: is a wide book of rules to
specific information samples and
batch machine learning which is used for
characteristic subsets.
labeling and regressing calculations. To
● Aggregation:
give the most true positive predictions , it
classification: Every tree which is
present in the forest is compared
one by one and the tree which have Protocol Internet Connections (QUIC)
the most accurate data is the best protocol traffic using ensemble machine
data present in the forest and collect learning approaches. This research paper
the correct output applies the five main learning tech are :
Random Forest , Extra Tress, Gradient

Literature Survey Boosting Tress ,Extreme Gradient


Boosting Tree, and Light Gradient
Jing et al. Fuzzy C-Means: Granular (2023) Boosting Model. The collaborated
starts off with a resolution to the challenges techniques achieved up to 99.40%
encryption bring into network traffic accuracy. The paper claims that ensemble
classification like traffic dispersion and low techniques, particularly XGBT and LGBM,
dataset counts. They put forth the perform well on encrypted traffic
“Cardinality- Constrained Fuzzy C-Means” classification at low data volumes. (Sultan
clustering algorithm that uses interplay Almuhammadi , Abdullatif Alnajim
among network flows to improve traffic ,Mohammed Ayub) ; The research paper
partitioning. This granular computing “Network Traffic Identification in Packet
approach prepares the ground for precise Sampling Environment” analyzes problems
traffic segmentation, which is a still a focus and solutions regarding the identification of
in many other studies ; Yan et Al: When the network traffic’s contour profile when
focusing on unidirectional accuracy heavy packet sampling is adopted, which is
lifts during dynamic environments, Yan et typical in modern high speed networks.
al. (2023) sets focus on high-speed Sampling of packet have a lot of view on
networks where short flows defy traffic identity . The paper proposes a
conventions. Their novel work integrates classification of (DBNAI) technique that
‘Entropy-based and Chi-square’ feature incorporates behaviour features in the space
tests into Random Forest model to and time domains for enhanced
distinguish encrypted versus unencrypted classification accuracy. The results
traffic at high speeds. This work further outperformed traditional techniques, thanks
advanced the importance of payload to the implementation of automated
analysis with expectations of focusing learning to the reduced data in multiple
studies on real time classification hurdles ; network environments. (Shi Dong ,
The research paper “QUICK Classfication Yuanjun Xia) ; The research paper “Data
of Network traffic (ML)” studies the collection, confirmation, processing,
classification of Quick User Datagram extraction of features, model development,
and evaluation constitute some of the key This work proposes a federated semi-
steps in the machine learning method of supervised learning approach to network
network traffic classification . For a part of traffic classification tackling issues of:
our information collection process, we privacy concerns and the expensive nature
conduct recording packets at different of labelled data. And proposes a new
moments over mornings as well as approach for labeling network traffic using
afternoons within the span of six days Deep Packet Inspection and Domain Name
(April 26, 27, 28, and May 9, 11, and 15) of System on home edge devices. This study
2017 in order to collect . This paper extracts a large number of features from
described the Boruta feature selection both labelled and unlabeled datasets using
method and the identification of optimal Autoencoders and Convolutional neural
features which lessen the computational networks, which reduces the reliance on
burden. This research proposed three labelled data. This model achieves high
classifiers in the form of Hoeffding accuracy. (Zi Wang Xuan , Z Li eYi, Me
Adaptive Trees (HAT), Works under KNN Fu ng Yi , Chun Yin gYe, Pan Wang) ;
and RF using sliding windows. The The Research paper :reviews the use of ML
accuracy of the sample selection was up to for detection of virus and classification. It
95% with the Boruta Feature Selection emphasizes the extent to which ML has
technique. (. MOSAB ELDHAI, HAMD become a key component of solving
ABAKER AN,AHM EDAB DELA malware issues as malware is using
ZIZ,IBRA HIM,) ; The Research paper increasingly sophisticated obfuscation
:This paper aims to enhance classification strategies to sidestep traditional detection
of network using Software Defined techniques. According to the research,
Networking integrated with machine classification detection approaches are
learning. The paper points out the divided into static, dynamic, and hybrid
challenges faced by traditional techniques methods with an emphasis on Deep
based on ports or payloads due to the Learning methods. The study also
presence of encrypted or dynamic traffic. emphasizes emerging trends ; The research
This study apply both supervised and paper : concerns with detection of
unsupervised ML algorithms, such as: unproductive applications that occupies
Decision Trees, Random Forest, and K- bandwidth. The study focuses on the
means clustering. Of all the models tested, problems of underutilization and network
Decision Tree produced the best result with congestion which result due to peer-to-peer
an accuracy of 99% ; The Research paper : file sharing. The methodology involved
traffic monitoring with Wireshark and data titled : set out to study the problem relating
analysis in MATLAB. Better bandwidth to network performance due to the presence
management and policies to improve of bandwidth-consuming unproductive
academic use of the internet were suggested applications. The application of packet
(SB mmed, Dr. S. M. Sa A Moha ni, Dr. sniffing over 90 days using Wireshark
D.) ; This report entitled “Network Traffic coupled with data analysis in MATLAB
Analysis using NLP and MATLAB” revealed the network’s idleness and
focuses on network traffic analysis with the excessive use of peer-to-peer (P2P) traffic.
aim of detecting unproductive bandwidth- The paper calls for effective policies for
consuming applications that result in managing bandwidth to improve the
underutilization and diminished academic utilization of useful traffic and proposes
access. The study shows that inadequate sustainable solutions toward improved
bandwidth control and peer-to-peer network performance. (Argha Ghosh, Dr.
applications significantly contribute to A. Senthilrajan) ; In the paper : the authors
network congestion. The study suggests study improvement techniques for the
increasing controls on non academic classification of encrypted traffic. This has
bandwidth use and the formulation of always posed a challenge to classical
defined internet usage policies for methods owing to the widespread use of
educational use. (Manish R. Joshi, Theyazn encryption protocols. Distiller enhances
Hassn Hadi) ; The work of : Wireshark has performance using deep learning and
provided valuable insights into network multimodal data, constraining the
bandwidth utilization and its inefficiencies computation to a lower bound at the same
on the university’s network over a period of time. It adds multitask learning to deep
90 days. This work analyzes bandwidth
learning models and gains improvement in
wastage due to unproductive applications
accuracy of existing models by about
as well as highlights the underutilization of
8.45%. The model's accuracy is shown to
the network and worsens bandwidth
be robust against a public dataset, with
management leading to reduced
future work proposed suggesting the
networked academic activities. It illustrates
the need for active monitoring using implementation of semi-supervised
Wireshark and recommends the learning for refinement. (Giuseppe Acetoa,
establishment of an internet access policy Domenico Ciuonzoa, Antonio Montieria,
that favors scholarly traffic over non- Antonio Pescape) ; The authors of this
scholarly traffic. (Vanya Ivanova, Tasho paper "Real-Time Network Traffic
Tashev, Ivo Draganov) ; The research paper Analysis Utilising AI, ML And DL
Techniques” focus on more modern network traffic classification ” focuses on
methods regarding network traffic analysis. the main work of machine learning (ML) in
One of the most important aspects is the enhancing the analysis of network traffic
application of random forest algorithm and improving security measures.
which led to an accuracy of 99.31%. This Anomalies that ML can detect are not
paper contributes other fundamental limited to anomaly problems only, but also
aspects such as how to deal with large include predicting the congestion of a
organizational data and make system network and optimizing its resources. Past
scalable in addition to minimizing false work done on these issues is reviewed,
positive/negative detections. Some of the challenges of data, model interpretability,
most valuable findings focus on the need of and ever-increasing requirement for
security improvement using adaptive algorithm improvement to counter newer
machine learning models, capturing system cyber-attacks are emphasized. More recent
health data in real-time, and multisolution techniques of ML for traffic analysis ought
tools for comprehensive analysis of to be covered in scientific studies more
network services ; The review found in frequently, this is what the paper calls for ;
“Machine Learning Approaches for Traffic The Zulfiqar's paper : sorts out the
Analysis” concentrates on the impact significance of machine learning with
machine learning (ML) has on the respect to network performance and
performance level of a network and its security vis-a-vis effective traffic
security. It examines the attempts made classification. The researchers examines
with respect to intrusion detection and the Naïve Bayes and K-nearest neighbour
network behavior analysis and stresses how (KNN) algorithms, KNN was found as the
effectively ML can identify hostile actions. most effective method to classify network
It makes a distinction between supervised traffic from live streams of videos. The
and unsupervised learning. Discussed authors collected data using Wireshark,
techniques include flow analysis and extracting features with high accuracy
anomaly detection. Data management and warranted precision in feature extraction
ever-evolving types of attacks are some of .The authors claim that KNN is dependable
the reason problems discussed. More for the assessment of network traffic and
advanced ML approaches to augment recommend for intensive research on
existing cybersecurity frameworks is sophisticated machine learning algorithms
needed(Nour Alqudaha , Qussai Yaseen) ; for changing traffic flows.(Lakshmi
The research paper titled “In anlysisng the Santhosh Tripura , Kavya Kurra) ; The
survey paper titled : analyzes the Traffic Analysis” tries to improve
application of deep learning techniques for cybersecurity with machine learning
classifying the network traffic produced by approaches. It discusses the increasing
the rapidly high number of Internet of danger of cyber threats and presents an
Things devices. It highlights theories of Intrusion Detection System (IDS) that uses
numerous models of deep learning while SVMs, Random Forests, CNNs, and ANN
providing insights to their strengths, algorithms. The study was done based on
weaknesses, and how conclusion were the CICIDS2017 dataset, and it found that
made to counter issues specific to IoT. The Random Forest performed with the highest
paper describes as its main focus accuracy. The paper calls for more up-to-
complicated patterns of data traffic, limited date datasets and further work aimed at the
resources of an IoT device, and scarcity of application of machine learning with big
data. Also examines concerning issues data for more efficient threat detection.
related to security` ; The paper (Anastasia Victoria , Yelizaveta Elizaveta)
“Advancements in Cybersecurity Machine
Learning Algorithms Applied to Network

Universidad Del Cauca, Popayán,


Colombia. The next step Data
Preprocessing we perform data
cleaning,type casting,label encoding and
data sampling.Now Dataset is ready for
training using KNN and Random forest
Algorithm and evaluate the method using
confusion matrix

Methodology-
Gathering data,encouragement preparation,
extraction of characteristics, model
development, and testing constitute a few
of the key steps in the machine learning
method of network traffic classification.For
a portion of our information collection
handle, we conduct recording packets at
different moments throughout morning and
afternoon hours within the span of six days
(April 26, 27, 28, and May 9, 11, and 15) of
2017 in order to collect information from
Use utilities to create or transform variables
to increase predictive power, including
accessing categorical variables and
normalising numerical features. Various
visualisations using tools like Seaborn and
Matplotlib help to understand data
distribution and relationships. This process
introduces preprocessing techniques like
scaling and coding to ensure that the data is
ready for modelling. The final point
highlights the importance of preprocessing
for good data analysis by showing the
shapes and differences of the cleaned data.

After the data preprocessing now the data is


Fig 1 - Network Traffic classification clean and balanced, we apply a different
Using ML Algorithms machine learning(ml).

K-Nearest Neighbors (KNN) machine


learning algorithm-

Fig 1 - Data PreProcessing Flow Chart

The Data Preprocessing outlines simple


steps to prepare datasets for analysis and
modelling. First, import appropriate
libraries such as pandas, numpy, seaborn, Fig 2 - KNN Flow Chart
and matplotlib to establish a foundation for The K-nearest neighbors(KNN) algorithm
data management and visualisation. Load It’s a self-supervised predictor utilized for
the configuration file and check its structure analysis and classification/prediction.. This
using functions such as head(),isNull() and is one of the popular and simplest
file() that can help identify data types and classification and regression algorithms in
missing values. Data cleaning issues are machine algorithms. But it is typically used
then addressed by interpolating or deleting in classification problems.
missing values and processing duplicates.
Compute KNN- There are several methods
for distance calculation.
Euclidean distance -This is a method of
calculating distance between two data
points. This is a very simple and most
commonly used method. Under root of X2-
X1 + Y2 – Y1 , there X represent the
coordinates of one point and Y represent
the coordinates of second points , x2- x1
calculates the difference between x
coordinates .

Manhattan distance - Manhattan distance


method is also a popular distance metric ,
which measures the absolute value between Fig 3 - Confusion Matrix of KNN
two points.
{|x2 - x1| + |y2 - y1|} , here |x2 - x1|
calculates the absolute difference between
corresponding elements in the two points
similarly for Y point also .
Minkowski distance- This method calculate
the distance generalize the manhattan
and euclidean distance metrics.
{(∑|x2 - x1|^p)^(1/p)} , Here |x2-
x1| calculates the absolute difference
between corresponding elements in
the two points , P is a parameter that
determines the type of distance
metric , This formula calculates the
absolute difference of p root under
Fig 4 - Top 3 Feature Of Confusion Matrix
the power of p.

This model is trained many times and gives Random Forest Machine Learning
the output of 97% which is the best output Algorithm-
for the 10th time.
tree , θ represent random feature
subset and bootstraps sample , N
is number of decision trees.

Fig 6 - Confusion Matrix of RF


Fig 5 - Random Forest Flow Chart
Random Forest algo is a type of
altogether learning method. It is an next
version of the bagging method. it uses
or takes both bagging and feature
uncertainty to make an different forest
of decision trees. Feature uncertanity
means making a random child of
features , which guarantee no bond
among decision trees , this key Fig 7-Top 5 Feature of Confusion Matrix
difference between decision tree and
random forest . Conclusion
These features are works under
Random forest where the In this research paper ,we use Dataset-
Bootstrap sampling takes samples Unicauca-Version2-87Atts dataset.that
dataset publicly available on the
randomly from the selected data ,
internet.The data set is collected by the
Feature Bagging where for every wishark which then goes to data
decision tree a random subset of preprocessing and then it used KNN and
feature is selected , by these two therefore we check it through randomforest
.We apply the data preprocessing on this
processes multiple decision trees
dataset ,after data preprocessing we apply
are created . There is no concise KNN and Randomforest on Imbalanced
formula for Random forest but the and clean dataset.First we train K-nearest
algorithm can be show in terms of Neighbour (KNN) model that classify the
network traffic and predict network traffic
maths like : [H(x) = ∑[h(x, θ)]/N
class based on feature .KNN model
] where : H(x) is final prediction , archives 91% accuracy.Second we train
h(x, θ) is prediction from a single DecisonTreeClassifier that accuracy
99%.Then we train ABDELAZIZ , IBRAHIM
RandomForestClassifier that predict the ABAKER TARGIO HASHEM
new data class and model archives the 99%
,SHARIEF F. BABIKER, M. N.
accuracy.for great results a quality of data
is very important to implement the quality MARSONO6 ,MUZAFFAR
data.Our future work gives priority to the HAMZAH , AND NOOR
network traffic .
ZAMAN JHANJHI : "Improved
Feature Selection and Stream
Reference-
1. Xuyang Jing , Jingjing Zhaoa , Traffic Classification Based on

Zheng Yan , Witold Pedryczb , Xian Machine Learning in Software-


Defined Networks".
Li : “Granular classifier: building
7. ZiXuan Wang , ZeYi Li , MengYi
traffic granules for encrypted
traffic classification based on Fu , YingChun Ye , Pan Wang :

granular computing”. "Network traffic classification


based on federated semi-supervised
2. Xinge Yanab , Liukun Hec , Yifan
learning".
Xua , Jiuxin Cao , Liangmin
Wanga , Guyang Xie : "High- 8. Ayodeji Olalekan Salau &

Speed Encrypted Traffic Melesew Mossie Beyene :


“Software defined networking
Classification by Using Payload
based network traffic classification
Features,".
3. Sultan Almuhammadi , Abdullatif using machine learning

Alnajim ,Mohammed Ayub : techniques”.


9. Daniel Gibert , Carles Mateu, Jordi
"QUIC Network Traffic
Planes : "The rise of machine
Classification Using Ensemble
Machine Learning Techniques". learning for detection and

4. Shi Dong , Yuanjun Xia : classification of malware:Research


developments, trends and
"Network Traffic Identification in
challenges".
Packet Sampling Environment".
5. Muhammad Shafiq, Xiangzhan Yu, 10. Gianni D’Angelo , Francesco

and Dawei Wang : Network Traffic Palmieri : Network traffic

Classification Using Machine classification using deep


convolutional recurrent
Learning Algorithms.
6. ARWA M. ELDHAI , MOSAB autoencoder neural networks for

HAMDAN ,AHMED spatial–temporal features extraction


11. Hamid Tahaei, Firdaus Afifi, Badrul Anuar : The rise of traffic
Adeleh Asemi, Faiz Zaki, Nor classification in IoT networks

You might also like