0% found this document useful (0 votes)

14 views

An Optimized K Means Clustering For Improving Accuracy in Traffic Classification

The document presents an optimized K-means clustering algorithm for improving network traffic classification accuracy. It uses self-organizing maps to initially cluster the traffic data to derive the cluster number and centers, which are then used as initial parameters for K-means clustering. This approach improves upon traditional K-means by reducing randomness and providing better initial values, leading to higher classification accuracy.

Uploaded by

Cristian

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

An Optimized K Means Clustering For Improving Accuracy in Traffic Classification

Uploaded by

Cristian

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Wireless Personal Communications

https://doi.org/10.1007/s11277-021-08435-x

An Optimized K‑means Clustering for Improving Accuracy

in Traffic Classification

Shasha Zhao1,3 · Yi Xiao1,2 · Yueqiang Ning1,2 · Yuxiao Zhou1,3 · Dengying Zhang1,3

Accepted: 22 March 2021

Abstract
With the explosive grown network traffic, the traditional port- and payload-based methods
are insatiable for the requirements of privacy protection as well as the fast real-time classi-
fication for the today traffic classification. Here, a network traffic classification model based
on both the Self-Organizing Maps (SOM) and the K-means fusion algorithm is proposed.
In which, the traffic data is initially clustered by the SOM network to derive the cluster
number and each cluster center value. Then those values are taken as the initial parameters
to run the K-means algorithm, achieving optimal classification. As results compared with
the traditional K-means algorithm, the initially clustering done by using the SOM network
not only inherits its advantages of simple method and efficient processing, but also reduces
time cost. Moreover, a significant improvement in coossification accuracy is achieved with
our proposed algorithm.

Keywords SOM · K-means · Traffic classification · Feature selection

1 Introduction

More and more new client applications are emerging with the developing Internet, which
promotes the innovation development of the Internet-based TCP/IP technology, including
5G infrastructure, three major application scenarios,and the fifth-generation cellular net-
work [1]. One hand, both the effective network managements and the improved analysis
play a key role in these technologies [2]. On the other hand, network traffic, as an impor-
tant part of network, usually affects recording and reflecting activities of both the Internet

* Shasha Zhao
[email protected]
* Dengying Zhang
[email protected]
1
College of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing,
China
2
College of Telecommunications, Information Engineering, Nanjing University of Posts
and Telecommunications, Nanjing, China
3
Jiangsu Key Laboratory of Broadband Wireless Communication and Internet of Things, Nanjing
University of Posts and Telecommunications, Nanjing, China

13
Vol.:(0123456789)
S. Zhao et al.

and the users [3]. Through analyzing the network traffic, the users network behavior can be
learn. Meanwhile, network traffic classification is also necessary and benefit for the com-
munication industry [4, 5]. For the operators, they can improve the service quality with
effectively supervising the network, and estimating the planning capacity of the 5G infra-
structure more accurately and reasonably. For the administrators, they can control the traf-
fic effectively and reduce congestion via traffic analysis. In short, accurate network traffic
classification is an important basis for network security and traffic engineering.
Currently, different classification ways had been used to improve the network traffic
classification, such as, the port number-based classification method [6], and the one based
on IP packet payload parsing [7]. However, those methods requires some accurate informa-
tion, including the application layer data and its packets format, to give an accurate clas-
sification. Unfortunately, those information are not easy to extract due to their privacy and
encryption measures. As another ways to improve classification accuracy and protect users
privacy, the machine learning method based on flow statistics features has been proposed
and investigated, like SVM, Bayes and C4.5 [8–12]. Their derived accuracy based on
supervised learning can be higher than 70%. For example, Moore et al. give an output clas-
sification accuracy of 95% after a series of improvement measures based on a supervised
algorithm [10, 11]. Nevertheless, those algorithms still require the training dataset to be
labeled manually, which increase much cost. Furthermore, some new traffic categories are
hard to recognize, resulting in impossibility of labelling manually. Therefore, some cluster-
ing algorithms (e.g., DBSCAN, K-means, EM) based on unsupervised learning had been
carried out to improve the traffic classification [13].
As one of them, K-means algorithm can cluster new traffic types efficiently [14, 15].
While, both the improper initial clustering center and the unreasonable K value usually
lead its clustering result to fall easily into a local optimal solution. Although the way of
choosing clustering centers had been used to improve the K-means algorithm [15], it still
need to set the K value manually. Furthermore, the randomness is another problem that
need to be solved. Additionally, the classification accuracy resulted from those unsuper-
vised algorithms is lower than that of supervised one, which is needed to be further opti-
mized. In this work, a network traffic classification model named as SOM-K fusion algo-
rithm is proposed based on both the Self-Organizing Maps (SOM) and the K-means fusion
algorithm. As one unsupervised algorithms belongs to machine learning, it not only dem-
onstrate an significantly improved network traffic classification accuracy, but also reduces
time cost comparing with the traditional K-means algorithm.
The remainder of this paper is organized as follows. The improved K-means clustering
algorithm in this paper is introduced in Sect. 2. Introducing data sources, pre-processing
methods and feature selection algorithms are carried out in Sect. 3. At last, the results of
the experiment and summary as well as future challenges are respectively evaluated and
shown in Sects. 4 and 5.

2 Improved Algorithm

Generally, the number of clusters used in traditional K-means algorithm should be derived
manually at first. Meanwhile, the suitability of the manually settled K value will decide the
derived classification accuracy. The random initial cluster center will lead the clustering
result into a local optimal solution.

13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…

Thus, to get a preferred K value, here, the SOM network has been applied to derive
and optimize the K-means algorithm. Two steps were included in this improved K-means
clustering algorithm. Initially, the traffic dataset was imputed into the SOM network, and
executing the SOM algorithm with continuously update the weights of the neurons until
the set number of iterations is reached. After that, the clustering result including the center
values and the number of the clusters will be outputted. Finally, the outputted cluster val-
ues and number of the clusters were taken as the initial values used in the K-means algo-
rithm to derive the final clustering result.

2.1 Self‑Organizing Maps

Comparing with neural networks, there are only input and competition layers without hid-
den layers in the SOM network. As a result, it can be divided into three processes of com-
petitive, cooperative, and synaptic adaptive ones.
In the competition process, the similarity between the input vector and the neurons will
be calculated. The most similar neuron is decided and is chosen as the winning neuron.
Next, the winning neuron is used to determine the topological neighborhood of the excited
neuron, in turn, which provide the basis for the cooperation of adjacent neurons. For the
synaptic adaptive process, the weight vector corresponding to each neuron in the neigh-
borhood will be adjusted after determining the topological neighborhood of the winning
neuron. The farther away from the winning neuron, the greater the degree of suppression
is, and the smaller the proportion of the weight has.

Algorithm 1 Self Organizing Maps

Input:
Traffic data set that after data processing and feature selection
Output:
The center value and the number of each cluster in the clustering results
1: Set the initial learning rate and winning neighborhood is respective η0 and N (t0 ), and
the number of iterations is T;
2: for each neuron do
3: Taking the random value within the interval [0,1] to initialize the weight vector Wi
of the neuron;
4: end for
5: repeat
6: Normalize the input vector Xi ;
7: for each neuron do
8: Calculating the Euclidean distance of the normalizing input vector and the weight
vector Wi to which the neuron is connected;
9: The neuron with the smallest Euclidean distance wins;
10: Adjusting the weights of the winning neurons and the neurons in their neighbor-
hood;
11: Update learning rate and winning neighborhood N (t);
12: end for
13: until iteration=T
14: Calculate the number of clusters and cluster centers of clustering results

The detail steps of the SOM algorithm are shown in Algorithm 1. It can be seen that when
a certain type of data is input into the SOM network, the Euclidean distance between the input
vector and the weight vector will be calculated for each neuron. The neuron with the smallest

13
S. Zhao et al.

distance is defined as the winning neuron. During the training process, the weight vector of
the winning neurons and their neighboring neurons will be continuously adjusted. The farther
away from the winning neurons, the smaller the magnitude of the weight will adjust. Carrying
out the iterative process, the learning rate will decrease, and the winning neighborhood will
shrink continually. Once the predetermined number of iterations T is reached, the preliminary
clustering of the data with the SOM network will be completed. The number of clusters and
the cluster center for each clustering result will be derived at the same time.
To avoid the influence caused by the varied dimensions with different features on classi-
fication accuracy, the input vector X is subjected to z-score normalization with the equation
described as
(x − 𝜇)
X∗ = (1)
𝜎
where the X ∗ represents the normalizing input vector, 𝜇 is the mean value, and 𝜎 is the
standard deviation. The principle of weight vector update for all neurons in the winning
neighborhood Ni∗ (t) can be described as
{ [ ]
Wi (t + 1) = Wi (t) + 𝜂(t)e−n X − Wi (t) i ∈ Ni∗ (t)
Wi (t + 1) = Wi (t) i ∉ Ni∗ (t) (2)

where t is the current number of iterations, Wi (t) is the weight of neuron i, 𝜂(t) represents
the learning rate at the t-th iteration and will decay with increasing number of iterations,
n is the topological distance, and e−n is the distance from the winning neuron. The larger
the value of e−n,(the) smaller the weight update ratio will be. Generally, the initial winning
neighborhood N t0 is set with a larger value, but it will shrink with the increased number
of iterations. As results, the learning rate can be expressed as
e−n
𝜂(t) = t = 1, 2, ⋯ , T (3)
t+2
and the winning neighborhood can be described as
N(0)t
N(t) = N(0) − t = 1, 2, ⋯ , T (4)
T
where T is the total number of iterations set, and N(0) is the initial neighborhood.

2.2 K‑means Clustering

K-means clustering as a typical unsupervised algorithm, it can make a large similarity

where xi and xj represent the n-dimensional input vectors, as well as xik and xjk represents
the value of the Kth dimension of the xi and xj , respectively.
However, for the K-means algorithm, both an improper initial clustering center and
an unreasonable K value can make the clustering result falling easily into a local optimal

13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…

solution, resulting in a bad clustering. Thus, it is necessary to do the initial clustering for
these similar data by the SOM network, and to obtain cluster centers as well as the number
of clusters with SOM algorithm in order to initialize the K-means clustering.

Algorithm 2 K-means clustering

Input:
normalizing traffic dataset Xj , initial clustering centers Ci , and clustering cluster num-
ber k
Output:
result of clustering C = {C1 , C2 , · · · , Ck }
1: Use the initial cluster centers Ci as the initial mean vector {µ1 , µ2 , · · · , µk };
2: repeat
3: for j=1,2,· · ·, m do
4: Calculating the Euclidean distance from the data point Xj to other mean vector
µi (1 ≤ i ≤ k);
5: Divide the data point Xj into the cluster corresponding to the nearest cluster
center;
6: end for
7: for i=1,2,· · ·, k do
8: Calculate a new mean vector for each cluster;
9: Replace the mean vector with the new one;
10: end for
11: until all mean vectors do not change

The detail steps of K-means clustering are shown in Algorithm 2. At first, the K value
and the initial cluster centers were set. After inputting the standardized data set, the Euclid-
ean distance of each data point far from the cluster center will be calculated. According
to the distance, and the data points will be divided roughly into different cluster centers.
Recalculating the mean vector of each cluster, and use it as the new cluster center, and then
re-divide all the data points. Repeating the subsequent steps until the cluster centers have
not change no longer.

3 Experiment Setup

In this section, the data trace and preprocessing method were introduced first. Then the
principle of both the feature selection algorithm and the meaning of selected features were
explained. Finally, the evaluation metrics of the algorithm performance have been briefly
introduced.

3.1 Data Traces

To analyze the performance of the algorithm, the Moore_set [10], as the most authoritative
test data set used in the current network traffic classification, had been chosen as the exper-
imental data set. One subset of Moore_set was taken as the initial data. Those missing data
with too many missing dimensions are eliminated, and the rest ones are filled by multiple
imputation method [16]. However, for different category, there is much different propor-
tion in the dataset. Such as, up to 77.9% of the traffic is WWW type, while the GAME
type accounting is only 0.002%. In order to improve the accuracy of different categories,

13
S. Zhao et al.

the proportion of each category had been balanced. The balanced experimental data set is
listed in Table 1.

3.2 Feature Selection

In fact, the traffic data in the Moore_set contains 248 kinds of statistical feature attrib-
utes, which covers most of the features used in current traffic classification. It includes
some typical features like the port number used by the server and client, the packet byte
length statistic, the packet arrival interval statistic, the total number of bytes transferred the
throughput, and the data transmission time. Nevertheless, a large number of redundant or
unrelated features are also concluded in these above features, which will increase the data
dimension. Furthermore, the algorithm complexity of the K-means clustering is defined as
O(Ktn ), which depends on the data dimension n. The data with redundant or unrelated fea-
tures will increase the algorithm complexity and decrease computational efficiency.
Therefore, to reduce time consumption and classifier complexity, looking for a small
number of attributes via feature selection is necessary. To today, feature selection is usu-
ally divided into filter, wrapper and embedded methods, which is done through supervised
and unsupervised algorithms [17–19]. Among them, the correlation-based feature selec-
tion algorithm (CFS) method, as one of the most typical filter method, has two signifi-
cant advantages. One is that the correlation between different features can be calculated.
The other one is the less algorithm complexity compared to those methods based on
wrapper and embedded. Here, CFS is carried out and used to filter the statistical traffic
characteristics.
In detail, the feature-class and feature-feature correlation matrices are initially calculated
with the CFS based on the training set. Then, the feature subset is solved with the best first
search. Assuming that the algorithm starts with an empty set, and taking the empty set D as
an example, the estimated value for each possible single feature is first calculated. It is rep-
resented by the merit value. The feature with the highest merit value will be added into D,
making D as a one-dimensional feature vector. The feature with the largest merit value in the
remaining feature is then selected and added into D. If the Merit value of the two-dimensional
feature vector D is smaller than the original one, this feature will be removed, and the feature
with the second largest Merit value will be found and is added into D. The above process will

Table 1 Information of balanced Categories Amount Proportion (%)

dataset
WWW 8000 65.04
MAIL 2000 16.26
FTP-DATA 1000 8.13
FTP-PASV 350 2.85
DATABASE 300 2.44
SERVICES 200 1.63
MULTIMEDIA 150 1.22
P2P 100 0.81
FTP-CONTROL 100 0.81
ATTACK 50 0.41
INTERACTIVE 50 0.41

13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…

be repeated until the Merit value of the set D cease increasing. The heuristic estimate of the
feature subset, Merit, is defined as follows:
mrcf
Merit = √
(6)
m + m(m − 1)rff

where, the m is the number of features, r is the Pearson correlation coefficient, rcf is the
feature-class average correlation, and rff is the feature-feature average correlation. After
several experiments with our data sets, the selected optimal feature subset has 10 features,
which are respectively labeled as total_packets_b_ a, mean_data_ip_b_ a, var_data_control_
a_ b, actual_data_bytes_a_ b, actual_data_bytes_b_ a, max_data_ip, data_xmit_time_b_ a,
ack_pkts_sent_b_ a, Duration, mean_IAT_a_ b. To understand each feature easily, the spe-
cific meanings corresponding to each identifier are listed in Table 2.

3.3 Evaluation Metrics

The performance of the algorithm is evaluated by using overall accuracy and precision. The
overall accuracy is used to evaluate the ability of the algorithm to generate cluster classes that
only contain a single traffic class. The precision is used to evaluate the accuracy of classifying
traffic samples in each traffic category. The correctly classified sample with a certain category
is defined as TP, while the one misclassified into other categories is characterized as FP. The
precision and the overall accuracy are respectively described as
Ti
precision = i = 1, 2, … , k (7)
TPi + FPi

and
∑k
TPi
overall− accuracy = ∑ i=1
(8)
total− samples

Table 2 Information of selected features

Identifiers Feature description

total_packets_b_a The total number of packets sent by b to a

mean_data_ip_b_a The mean of the total bytes in the IP packet sent by b to a
var_data_control_a_b the variance of controlling byte packets
actual_data_bytes_a_b The total number of bytes sent from a to b
actual_data_bytes_b_a The total number of bytes sent by b to a
Duration duration
max_data_ip The maximum value of the total bytes in the IP packet
data_xmit_time_b_a Total data transmission time sent by b to a
ack_pkts_sent_b_a The total number of ack packets sent by b to a
mean_IAT_a_b Average time the message arrives

13
S. Zhao et al.

4 Experiments Results

In order to conduct a fair comparison experiment, 5 consecutive natural numbers are

selected as K values for 5 clustering in K-means experiment. As a result, the highest
classification accuracy will be selected as the final result. The traditional SOM is also
set to perform five times in the experiment. Feature number dependent running time
cost with the three algorithms was shown in Fig. 1. Obviously, the convergence time for
all the three algorithms increases rapidly with the increased features number. As shown
in Fig. 1, the SOM algorithm demonstrates the longest convergence time. While, the
K-means needs the shortest time. Additionally, for small number, all algorithms have a
close time cost. Nevertheless, as the features number is over 15, the time cost increase
dramatically. However, our SOM-K demonstrates much lower time cost comparing to
other two algorithms. Here, the decrease in cost time introduced by the SOM-K algo-
rithm may be as a result of the improved initial clustering centers used.
To reduce the serious effects of the initial random clustering center and difference
in K values on the K-means clustering results, the test for the K-means have been
done many times. The average of the experimental results are taken as the accuracy of
K-means classification. To give a clearly image, feature number dependent overall accu-
racy for all algorithms was shown in Fig. 2. Notably, the SOM-K algorithm shows the
larger accuracy than the other two algorithms. It can be seen that the overall accuracy
of algorithm increases initially and then decreases with the increase in features number
used. For the features number of 20, the accuracy of the K-means, the SOM and the
SOM-K reaches 82.3%, 80.2% and 87.8% respectively. While, it decreases with further
increase in features, which may be attributed to the increase in redundant and extrane-
ous features. In addition, although the SOM algorithm shows a relatively stable value,
the overall accuracy is low. The K-means algorithm shows a close accuracy to that of
SOM. On the other hand, it is necessary to reduce the complexity of the model as much

Fig. 1 Time cost depending on the number of features for the three algorithm of SOM, K-means, and SOM-
K

13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…

Fig. 2 Overall accuracy depending on the number of features for the K-means, SOM, and SOM-K algo-
rithm

as possible after the good classification performance. As a consequence, 10 features is

appropriate and then is selected as the optimal configuration of the classification model.
The classification accuracy respectively derived with the three models are shown in
Fig. 3. Clearly, all the three classification models performed well in the traffic classification
with the large-scale sample. As shown in Fig. 3, the sample size also has a large impact on

Fig. 3 Classification precision from different types of traffic calculated with the K-means, SOM, and
SOM-K algorithm

13
S. Zhao et al.

the classification accuracy rate. It shows a positive correlation between sample size and
accuracy. Noteworthy, the improvement of the classification accuracy rate for the “small-
scale sample” category is most obvious, including MULTIMEDIA, P2P, and INTERAC-
TIVE. However, the SOM-K algorithm model proposed in this work shows an improved
8–20% of traffic classification accuracy rate comparing to the traditional K-means one, and
an improved one of 5–32% comparing to the SOM one. Moreover, among the 11 traffic
categories, the classification accuracy rate with the SOM-K algorithm approaches or even
exceeds 80%.

5 Conclusion

In summary, the SOM-K algorithm, as a method for traffic classification based on unsuper-
vised learning has been proposed in this work. The optimal subset of traffic characteristics
was selected with the CFS algorithm and used to run K-means algorithms. The experi-
mental results show that the overall accuracy of the traffic classification model with this
SOM-K algorithm can reach 87.8%, and the classification accuracy can exceed 90% for the
categories with massive samples. Additionally, SOM-K achieves higher accuracy and less
time cost than SOM and K-means algorithms for the traffic classification methods based on
unsupervised learning, which implies efficient data processing capabilities of our proposed
algorithm in the field of data processing.

Acknowledgements This work is supported by the National Natural Science Foundation of

China(61872423), the Scientific Research Foundation of the Higher Education Institutions of Jiangsu Prov-
ince, China (19KJB510050) and Postgraduate Educational and Teaching ReformProject of Jiangsu Province
(JGLX19_045), Teaching Reform Research Project of Nanjing University of Posts and Telecommunications
(JG01619JX29).

References
1. Nahum, C. V., et al. (2020). Testbed for 5G connected artificial intelligence on virtualized networks.
IEEE Access, 8, 223202–223213.
2. Tzanakaki, A., Anastasopoulos, M., Berberana, I., Syrivelis, D., & Flegkas, P. (2017). Wireless-optical
network convergence: Enabling the 5G architecture to support operational and end-user services. IEEE
Communications Magazine, 55(10), 184–192.
3. Bu, Z., Zhou, B., Cheng, P., Zhang, K., & Ling, Z. H. (2020). Encrypted network traffic classification
using deep and parallel network-in-network models. IEEE Access, 8, 132950–132959.
4. Aceto, G., Ciuonzo, D., Montieri, A., & Pescap, A. (2019). Mobile encrypted traffic classification
using deep learning: Experimental evaluation, lessons learned, and challenges. IEEE Transactions on
Network and Service Management, 16(2), 445–458.
5. Wang, P., Chen, X., Ye, F., & Sun, Z. (2019). A survey of techniques for mobile service encrypted traf-
fic classification using deep learning. IEEE Access, 7, 54024–54033.
6. Karagiannis, T., Broido, A., & Faloutsos, M.(2004, Octorber). Transport layer identifification of P2P
traffific. Proceedings of Internet Measurement Conference, IEEE.
7. Alizadeh, H., & Zquete, A. (2016). Traffic classification for managing applications networking pro-
files. Security and Communication Networks, 9(14), 2557–2575.
8. Elnawawy, M., Sagahyroon, A., & Shanableh, T. (2020). FPGA-based network traffic classification
using machine learning. IEEE Access, 8, 175637–175650.
9. Pacheco, F., Exposito, E., & Gineste, M. (2019). Towards the deployment of machine learning solu-
tions in network traffic classification: A systematic survey. IEEE Communications Surveys and Tutori-
als, 21(2), 1988–2014.

13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…

10. Moore, A., & Zuev, D.(2005). Internet traffic classification using Bayesian analysis techniques. Proceed-
ings of the 2005 ACM SIGMETRICS international conference on measurement and modeling of computer
systems (pp. 50–60).
11. Auld, T., & Moore, A. (2007). Bayesian neural networks for internet traffic classification. IEEE Transac-
tions on Neural Networks, 18(1), 223–239.
12. Hao, S. N., Jing, H., & Liu, S. Y.(2015). Improved SVM method for internet traffic classification based on
feature weight learning. 2015 international conference on control, automation and information sciences
(ICCAIS) (pp. 102–106).
13. Namdev, N., Agrawal, S., & Silkari, S. (2015). Recent advancement in machine learning based Internet
traffic classification. Procedia Computer Science, 60, 784–791.
14. Liu, Y., Li, W., & Li, Y.(2007). Network traffic classification using K-means clustering. In International
multi-symposiums on computer & computational sciences. IEEE Computer Society.
15. Jiang, D., Zheng, W., & Lin, X. (2012). Research on selection of initial center points based on improved
K-means algorithm. In Proceedings of 2012 2nd international conference on computer science and net-
work technology (pp. 1146–1149).
16. Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical
Engineering, 40(1), 6–28.
17. Zou, Y. (2018). Data analysis and processing of massive network traffic based on cloud computing and
research on its key algorithms. Wireless Personal Communications, 102, 3159–3170.
18. Yang, L., Dong, Y., & Rana, M. S. (2018). Fine-grained video traffic classification based on QoE values.
Wireless Personal Communications, 103(4), 1481–1498.
19. Wang, D., Zhang, H., & Liu, R. (2016). Unsupervised feature selection through Gram–Schmidt orthogo-
nalization. A word co-occurrence perspective. Neurocomputing, 173, 845–854.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

Shasha Zhao received the B.E. degree in electrical engineering from

Anhui Normal University, Wuhu, in 2004, and received the Ph.D.
degrees from Nanjing University of Posts and Telecommunication,
Nanjing, China, in 2013. She is currently a lecturer of the School of
Internet of Things, Nanjing University of Posts and Telecommunica-
tion, Nanjing, China. Her research interests mainly focus on cognitive
wireless network, wireless network virtualization and Internet of
Things technology.

13
S. Zhao et al.

Yi Xiao received the B.E. degree in communication engineering from

Nanyang Institute of Technology, Nanyang, in 2017. He is currently
working towards his M.S. degree in the College of Communications &
Information Engineering, Nanjing University of Posts and Telecom-
munications, Nanjing. His research interests are focused on the area of
machine learning and computer network.

Yueqiang Ning received the B.E. degree in communication engineer-

ing from West Anhui University, AnHui, in 2017. He is currently
working towards his M.S. degree in the College of Telecommunica-
tions & Information Engineering, Nanjing University of Posts and Tel-
ecommunications, Nanjing. His research interests mainly focus on the
area of resource allocation in network function virtualization.

Xiaoyu Zhou is currently working towards his B.E. degree in the Col-
lege of Internet of Things, Nanjing University of Posts and Telecom-
munications, Nanjing. His research interests mainly focus on the area
of machine learning.

13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…

Dengying Zhang (M’17) received the B.S., M.S., and Ph.D. degrees
from Nanjing University of Posts and Telecommunication, Nanjing,
China, in 1986, 1989 and 2004, respectively. He is currently a Profes-
sor of the School of Internet of Things, Nanjing University of Posts
and Telecommunication, Nanjing, China. He was a visiting scholar in
Digital Media Lab, Umea University, Sweden, from 2007 to 2008. His
research interests include signal and information processing, network-
ing technique, and information security.

Module 1 - Introduction To Astronomy
100% (6)
Module 1 - Introduction To Astronomy
14 pages
Immortality Mortality Divine Paradox
100% (1)
Immortality Mortality Divine Paradox
33 pages
Young's Modulus of Elasticity, Stress, Strain - Numerical Problems
100% (2)
Young's Modulus of Elasticity, Stress, Strain - Numerical Problems
14 pages
Chapter 8 (Heat Exchanger)
No ratings yet
Chapter 8 (Heat Exchanger)
17 pages
Detailed Syllabus Of: Diploma in Civil Engineering
No ratings yet
Detailed Syllabus Of: Diploma in Civil Engineering
62 pages
Sap R/3 Financials: Accrual Engine
100% (1)
Sap R/3 Financials: Accrual Engine
21 pages
Endoscope Optics. Chapter 8. 8.1 Introduction PDF
No ratings yet
Endoscope Optics. Chapter 8. 8.1 Introduction PDF
8 pages
Cummins K19 Maintainence Schedule
100% (1)
Cummins K19 Maintainence Schedule
13 pages
Weekly Wimax Traffic Forecasting Using Trainable Cascade-Forward Backpropagation Network in Wavelet Domain
No ratings yet
Weekly Wimax Traffic Forecasting Using Trainable Cascade-Forward Backpropagation Network in Wavelet Domain
6 pages
A SOM Based Approach For Visualization of GSM Network Performance Data
No ratings yet
A SOM Based Approach For Visualization of GSM Network Performance Data
10 pages
Improve Energy Efficiency Routing in WSN by Using Automata
No ratings yet
Improve Energy Efficiency Routing in WSN by Using Automata
7 pages
Cluster Head Selection and Routing Protocol For Wireless Sensor Networks (WSNS) Based On Software-Defined Network (SDN) Via Game of Theory
No ratings yet
Cluster Head Selection and Routing Protocol For Wireless Sensor Networks (WSNS) Based On Software-Defined Network (SDN) Via Game of Theory
16 pages
Robust SON System With Enhanced Handover Performance System
No ratings yet
Robust SON System With Enhanced Handover Performance System
6 pages
Bandwidth Optimization Through Dynamic Routing in ATM Networks: Genetic Algorithm & Tabu Search Approach
No ratings yet
Bandwidth Optimization Through Dynamic Routing in ATM Networks: Genetic Algorithm & Tabu Search Approach
5 pages
Automatic Modulation ClassificationBased On Deep Learning For SDR
No ratings yet
Automatic Modulation ClassificationBased On Deep Learning For SDR
13 pages
Vertical Handoff Decision Scheme Within Heterogeneous Mobile Communication Networks
No ratings yet
Vertical Handoff Decision Scheme Within Heterogeneous Mobile Communication Networks
5 pages
Design of Multi-Rate Wlans Using Non-Linear Algorithm: Available Online at
No ratings yet
Design of Multi-Rate Wlans Using Non-Linear Algorithm: Available Online at
2 pages
Decision Making in Next Generation Networks Using Fuzzy Systems
No ratings yet
Decision Making in Next Generation Networks Using Fuzzy Systems
7 pages
Queueing Theory-Based Path Delay Analysis of Wireless Sensor Networks
No ratings yet
Queueing Theory-Based Path Delay Analysis of Wireless Sensor Networks
6 pages
Do 2019
No ratings yet
Do 2019
23 pages
A Clustering Algorithm Based On Mobility Properties in
No ratings yet
A Clustering Algorithm Based On Mobility Properties in
15 pages
Integration_of_time_series_models_with_soft_clustering_to_enhance_network_traffic_forecasting (1)
No ratings yet
Integration_of_time_series_models_with_soft_clustering_to_enhance_network_traffic_forecasting (1)
6 pages
06. Wireless Networks With Machine Learning Routing Enabled
No ratings yet
06. Wireless Networks With Machine Learning Routing Enabled
6 pages
Lifetime Communication
No ratings yet
Lifetime Communication
19 pages
Intelligent Paging Strategy For Multi-Carrier CDMA System: Keywords
No ratings yet
Intelligent Paging Strategy For Multi-Carrier CDMA System: Keywords
7 pages
Graph-Tensor_Neural_Networks_for_Network_Traffic_Data_Imputation
No ratings yet
Graph-Tensor_Neural_Networks_for_Network_Traffic_Data_Imputation
15 pages
Neural Network Based Energy Efficient Clustering and Routing in Wireless Sensor Networks
No ratings yet
Neural Network Based Energy Efficient Clustering and Routing in Wireless Sensor Networks
6 pages
Application of Artificial Neural Network For Path Loss Prediction in Urban Macrocellular Environment
No ratings yet
Application of Artificial Neural Network For Path Loss Prediction in Urban Macrocellular Environment
6 pages
IRLR: An Improved Reinforcement Learning-Based Routing Algorithm For Wireless Mesh Networks
No ratings yet
IRLR: An Improved Reinforcement Learning-Based Routing Algorithm For Wireless Mesh Networks
16 pages
Distribution of The Tree Parity Machine Synchronization Time
No ratings yet
Distribution of The Tree Parity Machine Synchronization Time
8 pages
An Investigative Model For Wimax Networks With Multiple Traffic Summary
No ratings yet
An Investigative Model For Wimax Networks With Multiple Traffic Summary
14 pages
Identifying Quality of Experience (Qoe) in 3G/4G Radio Networks Based On Quality of Service (Qos) Metrics
No ratings yet
Identifying Quality of Experience (Qoe) in 3G/4G Radio Networks Based On Quality of Service (Qos) Metrics
9 pages
Neural Networks Embed D
No ratings yet
Neural Networks Embed D
6 pages
Applying Time-Division Multiplexing in Star-Based Optical Networks
No ratings yet
Applying Time-Division Multiplexing in Star-Based Optical Networks
4 pages
State Feedback Controller Design of Networked Control Systems
No ratings yet
State Feedback Controller Design of Networked Control Systems
5 pages
Forwarder Set Based Dynamic Duty Cycling in Asynchronous Wireless Sensor Networks
No ratings yet
Forwarder Set Based Dynamic Duty Cycling in Asynchronous Wireless Sensor Networks
6 pages
AI94008FU
No ratings yet
AI94008FU
10 pages
HiddenMarkov Pred
No ratings yet
HiddenMarkov Pred
9 pages
Network Administrator Assistance System Based On Fuzzy C-Means Analysis
No ratings yet
Network Administrator Assistance System Based On Fuzzy C-Means Analysis
2 pages
Node Insertion Algorithm and A
No ratings yet
Node Insertion Algorithm and A
11 pages
Performance Evaluation of Artificial Neural Networks For Spatial Data Analysis
No ratings yet
Performance Evaluation of Artificial Neural Networks For Spatial Data Analysis
15 pages
IMPLEMENTATION OF ENERGY EFFICIENT COVERAGE AWARE ROUTING PROTOCOL FOR WIRELESS SENSOR NETWORK USING GENETIC ALGORITHM
No ratings yet
IMPLEMENTATION OF ENERGY EFFICIENT COVERAGE AWARE ROUTING PROTOCOL FOR WIRELESS SENSOR NETWORK USING GENETIC ALGORITHM
12 pages
MLSP Exp04 60002200083
No ratings yet
MLSP Exp04 60002200083
5 pages
Abstract
No ratings yet
Abstract
9 pages
Clustering in Distributed Incremental Estimation in Wireless Sensor Networks
No ratings yet
Clustering in Distributed Incremental Estimation in Wireless Sensor Networks
25 pages
Jecet: Journal of Electronics and Communication Engineering & Technology (JECET)
No ratings yet
Jecet: Journal of Electronics and Communication Engineering & Technology (JECET)
6 pages
Network Traffic Classification Via Neural Networks
No ratings yet
Network Traffic Classification Via Neural Networks
25 pages
Neural Network Tomography
No ratings yet
Neural Network Tomography
14 pages
An Algorithm For The Measure Station Selection and Measure Assignment in The Active IP Network Measurement
No ratings yet
An Algorithm For The Measure Station Selection and Measure Assignment in The Active IP Network Measurement
8 pages
Sequential - Network Monitoring - 2015
No ratings yet
Sequential - Network Monitoring - 2015
9 pages
The Vehicle's Velocity Prediction Methods Based On RNN and LSTM Neural Network
No ratings yet
The Vehicle's Velocity Prediction Methods Based On RNN and LSTM Neural Network
4 pages
Classification Using Neural Network & Support Vector Machine For Sonar Dataset
No ratings yet
Classification Using Neural Network & Support Vector Machine For Sonar Dataset
4 pages
Power System Planning and Operation Using Artificial Neural Networks
No ratings yet
Power System Planning and Operation Using Artificial Neural Networks
6 pages
An Energy Efficient Cluster Formation Protocol With Low Latency in Wireless Sensor Networks
No ratings yet
An Energy Efficient Cluster Formation Protocol With Low Latency in Wireless Sensor Networks
7 pages
Deep Learning Based Channel Estimation For Chaotic Wireless Communication
No ratings yet
Deep Learning Based Channel Estimation For Chaotic Wireless Communication
83 pages
Real-Time Feedback Control of Computer Networks Based On Predicted State Estimation
No ratings yet
Real-Time Feedback Control of Computer Networks Based On Predicted State Estimation
27 pages
J Suscom 2017 08 001
No ratings yet
J Suscom 2017 08 001
17 pages
A Comparative Study On The Topology Control Mechanism Using GAHCT and FLHCT For An N-Tier Heterogeneous Wireless Sensor Network
No ratings yet
A Comparative Study On The Topology Control Mechanism Using GAHCT and FLHCT For An N-Tier Heterogeneous Wireless Sensor Network
8 pages
Dissertation Interference Alignment
100% (2)
Dissertation Interference Alignment
10 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
8 pages
Effect of Partially Correlated Data On Clustering in Wireless Sensor Networks
No ratings yet
Effect of Partially Correlated Data On Clustering in Wireless Sensor Networks
10 pages
Proposal of KMSTME Data Mining Clustering Method For Prolonging Life of Wireless Sensor Networks
No ratings yet
Proposal of KMSTME Data Mining Clustering Method For Prolonging Life of Wireless Sensor Networks
5 pages
Abstract: Due To Significant Advances in Wireless Modulation Technologies, Some MAC Standards Such As 802.11a
No ratings yet
Abstract: Due To Significant Advances in Wireless Modulation Technologies, Some MAC Standards Such As 802.11a
6 pages
Soft Computing Assignment
100% (1)
Soft Computing Assignment
13 pages
Network Codes
No ratings yet
Network Codes
8 pages
Optimization of DEEC Routing Protocol Using Genetic Algorithm PDF
No ratings yet
Optimization of DEEC Routing Protocol Using Genetic Algorithm PDF
7 pages
Routing in Wireless Mesh Networks
From Everand
Routing in Wireless Mesh Networks
Raghav Kumar
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
TCP Ip Anexo 06
No ratings yet
TCP Ip Anexo 06
9 pages
13 Nat
No ratings yet
13 Nat
14 pages
14 Wan
No ratings yet
14 Wan
14 pages
Preface - 2017 - Data Analytics For Intelligent Transportation Systems
No ratings yet
Preface - 2017 - Data Analytics For Intelligent Transportation Systems
3 pages
Copyright - 2017 - Data Analytics For Intelligent Transportation Systems
No ratings yet
Copyright - 2017 - Data Analytics For Intelligent Transportation Systems
1 page
Risk Assessment of Soil Erosion Using A GISBased SEMMA in PostFire and Managed WatershedSustainability Switzerland
No ratings yet
Risk Assessment of Soil Erosion Using A GISBased SEMMA in PostFire and Managed WatershedSustainability Switzerland
25 pages
UNIX
No ratings yet
UNIX
63 pages
Mandatory Appendix 5 - Guideline On The Approval of New Materials Under The ASME BPV Code
No ratings yet
Mandatory Appendix 5 - Guideline On The Approval of New Materials Under The ASME BPV Code
4 pages
Mastering Python Scientific Computing - Sample Chapter
33% (3)
Mastering Python Scientific Computing - Sample Chapter
25 pages
Vitamin and Mineral Poster Assignment
No ratings yet
Vitamin and Mineral Poster Assignment
3 pages
MD Q2
No ratings yet
MD Q2
2 pages
Pyramid - Hardware That Hums, Software That Sizzles - Customizing Your GURPS Hardware
No ratings yet
Pyramid - Hardware That Hums, Software That Sizzles - Customizing Your GURPS Hardware
5 pages
TGB Blade 425400 Manual de Reparatie 6
No ratings yet
TGB Blade 425400 Manual de Reparatie 6
229 pages
Automation Engineer Resume
25% (4)
Automation Engineer Resume
2 pages
NodeXL - Collection - Twitter - Starbucks - 2010-10!27!02!00!01
No ratings yet
NodeXL - Collection - Twitter - Starbucks - 2010-10!27!02!00!01
581 pages
The Restless Atmosphere
No ratings yet
The Restless Atmosphere
59 pages
Troubles of Fresh Water Generator
No ratings yet
Troubles of Fresh Water Generator
3 pages
Pyrosil
No ratings yet
Pyrosil
4 pages
Civil Quantities
No ratings yet
Civil Quantities
9 pages
Led Lamp Brochure 2021
No ratings yet
Led Lamp Brochure 2021
108 pages
International Conference of Edible Oils & Fats: "From Fundamentals To The Future: Processing Applications & Health"
No ratings yet
International Conference of Edible Oils & Fats: "From Fundamentals To The Future: Processing Applications & Health"
47 pages
HVAC DistributionSystemsSizing
No ratings yet
HVAC DistributionSystemsSizing
56 pages
Pile Foundation:: Basic Concepts
No ratings yet
Pile Foundation:: Basic Concepts
57 pages
Linear Integrated Circuits - EE3402 - Notes - Unit 3 - Applications of OP-AMP
No ratings yet
Linear Integrated Circuits - EE3402 - Notes - Unit 3 - Applications of OP-AMP
28 pages
Laboratory Report Cycle
No ratings yet
Laboratory Report Cycle
3 pages
Gyro Calc
100% (2)
Gyro Calc
30 pages
1.4 Protein
No ratings yet
1.4 Protein
55 pages
Poly Spek
No ratings yet
Poly Spek
3 pages

An Optimized K Means Clustering For Improving Accuracy in Traffic Classification

Uploaded by

An Optimized K Means Clustering For Improving Accuracy in Traffic Classification

Uploaded by

Wireless Personal Communications

An Optimized K‑means Clustering for Improving Accuracy

Shasha Zhao1,3 · Yi Xiao1,2 · Yueqiang Ning1,2 · Yuxiao Zhou1,3 · Dengying Zhang1,3

Accepted: 22 March 2021

Keywords SOM · K-means · Traffic classification · Feature selection

Algorithm 1 Self Organizing Maps

K-means clustering as a typical unsupervised algorithm, it can make a large similarity

Algorithm 2 K-means clustering

Table 1 Information of balanced Categories Amount Proportion (%)

Table 2 Information of selected features

total_packets_b_a The total number of packets sent by b to a

In order to conduct a fair comparison experiment, 5 consecutive natural numbers are

as possible after the good classification performance. As a consequence, 10 features is

Acknowledgements This work is supported by the National Natural Science Foundation of

Shasha Zhao received the B.E. degree in electrical engineering from

Yi Xiao received the B.E. degree in communication engineering from

Yueqiang Ning received the B.E. degree in communication engineer-

You might also like