An Optimized K Means Clustering For Improving Accuracy in Traffic Classification
An Optimized K Means Clustering For Improving Accuracy in Traffic Classification
https://doi.org/10.1007/s11277-021-08435-x
Abstract
With the explosive grown network traffic, the traditional port- and payload-based methods
are insatiable for the requirements of privacy protection as well as the fast real-time classi-
fication for the today traffic classification. Here, a network traffic classification model based
on both the Self-Organizing Maps (SOM) and the K-means fusion algorithm is proposed.
In which, the traffic data is initially clustered by the SOM network to derive the cluster
number and each cluster center value. Then those values are taken as the initial parameters
to run the K-means algorithm, achieving optimal classification. As results compared with
the traditional K-means algorithm, the initially clustering done by using the SOM network
not only inherits its advantages of simple method and efficient processing, but also reduces
time cost. Moreover, a significant improvement in coossification accuracy is achieved with
our proposed algorithm.
1 Introduction
More and more new client applications are emerging with the developing Internet, which
promotes the innovation development of the Internet-based TCP/IP technology, including
5G infrastructure, three major application scenarios,and the fifth-generation cellular net-
work [1]. One hand, both the effective network managements and the improved analysis
play a key role in these technologies [2]. On the other hand, network traffic, as an impor-
tant part of network, usually affects recording and reflecting activities of both the Internet
* Shasha Zhao
[email protected]
* Dengying Zhang
[email protected]
1
College of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing,
China
2
College of Telecommunications, Information Engineering, Nanjing University of Posts
and Telecommunications, Nanjing, China
3
Jiangsu Key Laboratory of Broadband Wireless Communication and Internet of Things, Nanjing
University of Posts and Telecommunications, Nanjing, China
13
Vol.:(0123456789)
S. Zhao et al.
and the users [3]. Through analyzing the network traffic, the users network behavior can be
learn. Meanwhile, network traffic classification is also necessary and benefit for the com-
munication industry [4, 5]. For the operators, they can improve the service quality with
effectively supervising the network, and estimating the planning capacity of the 5G infra-
structure more accurately and reasonably. For the administrators, they can control the traf-
fic effectively and reduce congestion via traffic analysis. In short, accurate network traffic
classification is an important basis for network security and traffic engineering.
Currently, different classification ways had been used to improve the network traffic
classification, such as, the port number-based classification method [6], and the one based
on IP packet payload parsing [7]. However, those methods requires some accurate informa-
tion, including the application layer data and its packets format, to give an accurate clas-
sification. Unfortunately, those information are not easy to extract due to their privacy and
encryption measures. As another ways to improve classification accuracy and protect users
privacy, the machine learning method based on flow statistics features has been proposed
and investigated, like SVM, Bayes and C4.5 [8–12]. Their derived accuracy based on
supervised learning can be higher than 70%. For example, Moore et al. give an output clas-
sification accuracy of 95% after a series of improvement measures based on a supervised
algorithm [10, 11]. Nevertheless, those algorithms still require the training dataset to be
labeled manually, which increase much cost. Furthermore, some new traffic categories are
hard to recognize, resulting in impossibility of labelling manually. Therefore, some cluster-
ing algorithms (e.g., DBSCAN, K-means, EM) based on unsupervised learning had been
carried out to improve the traffic classification [13].
As one of them, K-means algorithm can cluster new traffic types efficiently [14, 15].
While, both the improper initial clustering center and the unreasonable K value usually
lead its clustering result to fall easily into a local optimal solution. Although the way of
choosing clustering centers had been used to improve the K-means algorithm [15], it still
need to set the K value manually. Furthermore, the randomness is another problem that
need to be solved. Additionally, the classification accuracy resulted from those unsuper-
vised algorithms is lower than that of supervised one, which is needed to be further opti-
mized. In this work, a network traffic classification model named as SOM-K fusion algo-
rithm is proposed based on both the Self-Organizing Maps (SOM) and the K-means fusion
algorithm. As one unsupervised algorithms belongs to machine learning, it not only dem-
onstrate an significantly improved network traffic classification accuracy, but also reduces
time cost comparing with the traditional K-means algorithm.
The remainder of this paper is organized as follows. The improved K-means clustering
algorithm in this paper is introduced in Sect. 2. Introducing data sources, pre-processing
methods and feature selection algorithms are carried out in Sect. 3. At last, the results of
the experiment and summary as well as future challenges are respectively evaluated and
shown in Sects. 4 and 5.
2 Improved Algorithm
Generally, the number of clusters used in traditional K-means algorithm should be derived
manually at first. Meanwhile, the suitability of the manually settled K value will decide the
derived classification accuracy. The random initial cluster center will lead the clustering
result into a local optimal solution.
13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…
Thus, to get a preferred K value, here, the SOM network has been applied to derive
and optimize the K-means algorithm. Two steps were included in this improved K-means
clustering algorithm. Initially, the traffic dataset was imputed into the SOM network, and
executing the SOM algorithm with continuously update the weights of the neurons until
the set number of iterations is reached. After that, the clustering result including the center
values and the number of the clusters will be outputted. Finally, the outputted cluster val-
ues and number of the clusters were taken as the initial values used in the K-means algo-
rithm to derive the final clustering result.
2.1 Self‑Organizing Maps
Comparing with neural networks, there are only input and competition layers without hid-
den layers in the SOM network. As a result, it can be divided into three processes of com-
petitive, cooperative, and synaptic adaptive ones.
In the competition process, the similarity between the input vector and the neurons will
be calculated. The most similar neuron is decided and is chosen as the winning neuron.
Next, the winning neuron is used to determine the topological neighborhood of the excited
neuron, in turn, which provide the basis for the cooperation of adjacent neurons. For the
synaptic adaptive process, the weight vector corresponding to each neuron in the neigh-
borhood will be adjusted after determining the topological neighborhood of the winning
neuron. The farther away from the winning neuron, the greater the degree of suppression
is, and the smaller the proportion of the weight has.
The detail steps of the SOM algorithm are shown in Algorithm 1. It can be seen that when
a certain type of data is input into the SOM network, the Euclidean distance between the input
vector and the weight vector will be calculated for each neuron. The neuron with the smallest
13
S. Zhao et al.
distance is defined as the winning neuron. During the training process, the weight vector of
the winning neurons and their neighboring neurons will be continuously adjusted. The farther
away from the winning neurons, the smaller the magnitude of the weight will adjust. Carrying
out the iterative process, the learning rate will decrease, and the winning neighborhood will
shrink continually. Once the predetermined number of iterations T is reached, the preliminary
clustering of the data with the SOM network will be completed. The number of clusters and
the cluster center for each clustering result will be derived at the same time.
To avoid the influence caused by the varied dimensions with different features on classi-
fication accuracy, the input vector X is subjected to z-score normalization with the equation
described as
(x − 𝜇)
X∗ = (1)
𝜎
where the X ∗ represents the normalizing input vector, 𝜇 is the mean value, and 𝜎 is the
standard deviation. The principle of weight vector update for all neurons in the winning
neighborhood Ni∗ (t) can be described as
{ [ ]
Wi (t + 1) = Wi (t) + 𝜂(t)e−n X − Wi (t) i ∈ Ni∗ (t)
Wi (t + 1) = Wi (t) i ∉ Ni∗ (t) (2)
where t is the current number of iterations, Wi (t) is the weight of neuron i, 𝜂(t) represents
the learning rate at the t-th iteration and will decay with increasing number of iterations,
n is the topological distance, and e−n is the distance from the winning neuron. The larger
the value of e−n,(the) smaller the weight update ratio will be. Generally, the initial winning
neighborhood N t0 is set with a larger value, but it will shrink with the increased number
of iterations. As results, the learning rate can be expressed as
e−n
𝜂(t) = t = 1, 2, ⋯ , T (3)
t+2
and the winning neighborhood can be described as
N(0)t
N(t) = N(0) − t = 1, 2, ⋯ , T (4)
T
where T is the total number of iterations set, and N(0) is the initial neighborhood.
2.2 K‑means Clustering
where xi and xj represent the n-dimensional input vectors, as well as xik and xjk represents
the value of the Kth dimension of the xi and xj , respectively.
However, for the K-means algorithm, both an improper initial clustering center and
an unreasonable K value can make the clustering result falling easily into a local optimal
13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…
solution, resulting in a bad clustering. Thus, it is necessary to do the initial clustering for
these similar data by the SOM network, and to obtain cluster centers as well as the number
of clusters with SOM algorithm in order to initialize the K-means clustering.
The detail steps of K-means clustering are shown in Algorithm 2. At first, the K value
and the initial cluster centers were set. After inputting the standardized data set, the Euclid-
ean distance of each data point far from the cluster center will be calculated. According
to the distance, and the data points will be divided roughly into different cluster centers.
Recalculating the mean vector of each cluster, and use it as the new cluster center, and then
re-divide all the data points. Repeating the subsequent steps until the cluster centers have
not change no longer.
3 Experiment Setup
In this section, the data trace and preprocessing method were introduced first. Then the
principle of both the feature selection algorithm and the meaning of selected features were
explained. Finally, the evaluation metrics of the algorithm performance have been briefly
introduced.
3.1 Data Traces
To analyze the performance of the algorithm, the Moore_set [10], as the most authoritative
test data set used in the current network traffic classification, had been chosen as the exper-
imental data set. One subset of Moore_set was taken as the initial data. Those missing data
with too many missing dimensions are eliminated, and the rest ones are filled by multiple
imputation method [16]. However, for different category, there is much different propor-
tion in the dataset. Such as, up to 77.9% of the traffic is WWW type, while the GAME
type accounting is only 0.002%. In order to improve the accuracy of different categories,
13
S. Zhao et al.
the proportion of each category had been balanced. The balanced experimental data set is
listed in Table 1.
3.2 Feature Selection
In fact, the traffic data in the Moore_set contains 248 kinds of statistical feature attrib-
utes, which covers most of the features used in current traffic classification. It includes
some typical features like the port number used by the server and client, the packet byte
length statistic, the packet arrival interval statistic, the total number of bytes transferred the
throughput, and the data transmission time. Nevertheless, a large number of redundant or
unrelated features are also concluded in these above features, which will increase the data
dimension. Furthermore, the algorithm complexity of the K-means clustering is defined as
O(Ktn ), which depends on the data dimension n. The data with redundant or unrelated fea-
tures will increase the algorithm complexity and decrease computational efficiency.
Therefore, to reduce time consumption and classifier complexity, looking for a small
number of attributes via feature selection is necessary. To today, feature selection is usu-
ally divided into filter, wrapper and embedded methods, which is done through supervised
and unsupervised algorithms [17–19]. Among them, the correlation-based feature selec-
tion algorithm (CFS) method, as one of the most typical filter method, has two signifi-
cant advantages. One is that the correlation between different features can be calculated.
The other one is the less algorithm complexity compared to those methods based on
wrapper and embedded. Here, CFS is carried out and used to filter the statistical traffic
characteristics.
In detail, the feature-class and feature-feature correlation matrices are initially calculated
with the CFS based on the training set. Then, the feature subset is solved with the best first
search. Assuming that the algorithm starts with an empty set, and taking the empty set D as
an example, the estimated value for each possible single feature is first calculated. It is rep-
resented by the merit value. The feature with the highest merit value will be added into D,
making D as a one-dimensional feature vector. The feature with the largest merit value in the
remaining feature is then selected and added into D. If the Merit value of the two-dimensional
feature vector D is smaller than the original one, this feature will be removed, and the feature
with the second largest Merit value will be found and is added into D. The above process will
13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…
be repeated until the Merit value of the set D cease increasing. The heuristic estimate of the
feature subset, Merit, is defined as follows:
mrcf
Merit = √
(6)
m + m(m − 1)rff
where, the m is the number of features, r is the Pearson correlation coefficient, rcf is the
feature-class average correlation, and rff is the feature-feature average correlation. After
several experiments with our data sets, the selected optimal feature subset has 10 features,
which are respectively labeled as total_packets_b_ a, mean_data_ip_b_ a, var_data_control_
a_ b, actual_data_bytes_a_ b, actual_data_bytes_b_ a, max_data_ip, data_xmit_time_b_ a,
ack_pkts_sent_b_ a, Duration, mean_IAT_a_ b. To understand each feature easily, the spe-
cific meanings corresponding to each identifier are listed in Table 2.
3.3 Evaluation Metrics
The performance of the algorithm is evaluated by using overall accuracy and precision. The
overall accuracy is used to evaluate the ability of the algorithm to generate cluster classes that
only contain a single traffic class. The precision is used to evaluate the accuracy of classifying
traffic samples in each traffic category. The correctly classified sample with a certain category
is defined as TP, while the one misclassified into other categories is characterized as FP. The
precision and the overall accuracy are respectively described as
Ti
precision = i = 1, 2, … , k (7)
TPi + FPi
and
∑k
TPi
overall− accuracy = ∑ i=1
(8)
total− samples
13
S. Zhao et al.
4 Experiments Results
Fig. 1 Time cost depending on the number of features for the three algorithm of SOM, K-means, and SOM-
K
13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…
Fig. 2 Overall accuracy depending on the number of features for the K-means, SOM, and SOM-K algo-
rithm
Fig. 3 Classification precision from different types of traffic calculated with the K-means, SOM, and
SOM-K algorithm
13
S. Zhao et al.
the classification accuracy rate. It shows a positive correlation between sample size and
accuracy. Noteworthy, the improvement of the classification accuracy rate for the “small-
scale sample” category is most obvious, including MULTIMEDIA, P2P, and INTERAC-
TIVE. However, the SOM-K algorithm model proposed in this work shows an improved
8–20% of traffic classification accuracy rate comparing to the traditional K-means one, and
an improved one of 5–32% comparing to the SOM one. Moreover, among the 11 traffic
categories, the classification accuracy rate with the SOM-K algorithm approaches or even
exceeds 80%.
5 Conclusion
In summary, the SOM-K algorithm, as a method for traffic classification based on unsuper-
vised learning has been proposed in this work. The optimal subset of traffic characteristics
was selected with the CFS algorithm and used to run K-means algorithms. The experi-
mental results show that the overall accuracy of the traffic classification model with this
SOM-K algorithm can reach 87.8%, and the classification accuracy can exceed 90% for the
categories with massive samples. Additionally, SOM-K achieves higher accuracy and less
time cost than SOM and K-means algorithms for the traffic classification methods based on
unsupervised learning, which implies efficient data processing capabilities of our proposed
algorithm in the field of data processing.
References
1. Nahum, C. V., et al. (2020). Testbed for 5G connected artificial intelligence on virtualized networks.
IEEE Access, 8, 223202–223213.
2. Tzanakaki, A., Anastasopoulos, M., Berberana, I., Syrivelis, D., & Flegkas, P. (2017). Wireless-optical
network convergence: Enabling the 5G architecture to support operational and end-user services. IEEE
Communications Magazine, 55(10), 184–192.
3. Bu, Z., Zhou, B., Cheng, P., Zhang, K., & Ling, Z. H. (2020). Encrypted network traffic classification
using deep and parallel network-in-network models. IEEE Access, 8, 132950–132959.
4. Aceto, G., Ciuonzo, D., Montieri, A., & Pescap, A. (2019). Mobile encrypted traffic classification
using deep learning: Experimental evaluation, lessons learned, and challenges. IEEE Transactions on
Network and Service Management, 16(2), 445–458.
5. Wang, P., Chen, X., Ye, F., & Sun, Z. (2019). A survey of techniques for mobile service encrypted traf-
fic classification using deep learning. IEEE Access, 7, 54024–54033.
6. Karagiannis, T., Broido, A., & Faloutsos, M.(2004, Octorber). Transport layer identifification of P2P
traffific. Proceedings of Internet Measurement Conference, IEEE.
7. Alizadeh, H., & Zquete, A. (2016). Traffic classification for managing applications networking pro-
files. Security and Communication Networks, 9(14), 2557–2575.
8. Elnawawy, M., Sagahyroon, A., & Shanableh, T. (2020). FPGA-based network traffic classification
using machine learning. IEEE Access, 8, 175637–175650.
9. Pacheco, F., Exposito, E., & Gineste, M. (2019). Towards the deployment of machine learning solu-
tions in network traffic classification: A systematic survey. IEEE Communications Surveys and Tutori-
als, 21(2), 1988–2014.
13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…
10. Moore, A., & Zuev, D.(2005). Internet traffic classification using Bayesian analysis techniques. Proceed-
ings of the 2005 ACM SIGMETRICS international conference on measurement and modeling of computer
systems (pp. 50–60).
11. Auld, T., & Moore, A. (2007). Bayesian neural networks for internet traffic classification. IEEE Transac-
tions on Neural Networks, 18(1), 223–239.
12. Hao, S. N., Jing, H., & Liu, S. Y.(2015). Improved SVM method for internet traffic classification based on
feature weight learning. 2015 international conference on control, automation and information sciences
(ICCAIS) (pp. 102–106).
13. Namdev, N., Agrawal, S., & Silkari, S. (2015). Recent advancement in machine learning based Internet
traffic classification. Procedia Computer Science, 60, 784–791.
14. Liu, Y., Li, W., & Li, Y.(2007). Network traffic classification using K-means clustering. In International
multi-symposiums on computer & computational sciences. IEEE Computer Society.
15. Jiang, D., Zheng, W., & Lin, X. (2012). Research on selection of initial center points based on improved
K-means algorithm. In Proceedings of 2012 2nd international conference on computer science and net-
work technology (pp. 1146–1149).
16. Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical
Engineering, 40(1), 6–28.
17. Zou, Y. (2018). Data analysis and processing of massive network traffic based on cloud computing and
research on its key algorithms. Wireless Personal Communications, 102, 3159–3170.
18. Yang, L., Dong, Y., & Rana, M. S. (2018). Fine-grained video traffic classification based on QoE values.
Wireless Personal Communications, 103(4), 1481–1498.
19. Wang, D., Zhang, H., & Liu, R. (2016). Unsupervised feature selection through Gram–Schmidt orthogo-
nalization. A word co-occurrence perspective. Neurocomputing, 173, 845–854.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13
S. Zhao et al.
Xiaoyu Zhou is currently working towards his B.E. degree in the Col-
lege of Internet of Things, Nanjing University of Posts and Telecom-
munications, Nanjing. His research interests mainly focus on the area
of machine learning.
13
An Optimized K‑means Clustering for Improving Accuracy in Traffic…
Dengying Zhang (M’17) received the B.S., M.S., and Ph.D. degrees
from Nanjing University of Posts and Telecommunication, Nanjing,
China, in 1986, 1989 and 2004, respectively. He is currently a Profes-
sor of the School of Internet of Things, Nanjing University of Posts
and Telecommunication, Nanjing, China. He was a visiting scholar in
Digital Media Lab, Umea University, Sweden, from 2007 to 2008. His
research interests include signal and information processing, network-
ing technique, and information security.
13