Darknet Traffic Classification With Machine Learning Algorithms and SMOTE Method
Darknet Traffic Classification With Machine Learning Algorithms and SMOTE Method
'DUNQHW7UDIILF&ODVVLILFDWLRQZLWK0DFKLQH/HDUQLQJ
Algorithms and SMOTE Method
$OJRULWKPVDQG6027(0HWKRG
Hasan Karagol
+DVDQ.DUDJ|O Oguzhan Erdem
2÷X]KDQ(UGHP
Department of
Department ofElectrical and Electronics
Electrical and Electronics Engineering
Engineering Department of
Department ofElectrical andElectronics
Electrical and Electronics Engineering
Engineering
Trakya University
Trakya University Trakya University
Trakya University
Edime, Turkiye
(GLUQH7UNL\H Edime, Turkiye
(GLUQH7UNL\H
[email protected]
KDVDQNDUDJRO#WUDN\DHGXWU [email protected]
RJHUGHP#WUDN\DHGXWU
Barkin Akbas
%DUNÕQ$NEDú Tuncay Soylu
7XQFD\6R\OX
Department of
ofOccupational
Occupational Health
Health and
and Safety
Safety Program
Program
2022 7th International Conference on Computer Science and Engineering (UBMK) | 978-1-6654-7010-0/22/$31.00 ©2022 IEEE | DOI: 10.1109/UBMK55850.2022.9919462
Department of
Department ofElectrical and Electronics
Electrical and Electronics Engineering
Engineering Department
Trakya University
Trakya University University of
University ofHealth Sciences
Health Sciences
Edime, Turkiye
(GLUQH7UNL\H istanbul, Turkiye
øVWDQEXO7UNL\H
[email protected]
EDUNLQDNEDV#WUDN\DHGXWU [email protected]
WXQFD\VR\OX#VEXHGXWU
(IEEE - UBMK-2022) - VII th International Conference on Computer Science and Engineering - 374
,(((8%0.9,,WK,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ
Authorized licensed use limited to: Sharvari Govilkar. Downloaded on June 27,2023 at 06:39:09 UTC from IEEE Xplore. Restrictions apply.
and QHWZRUN
DQG network WUHQGV
trends FDQ
can EH
be IRXQG
found LQ
in WKH
the UHDO
real QHWZRUN
network E\ by TABLE!.
7$%/(, APPLICATION
$ CLASSES INTHE &,&'
33/,&$7,21&/$66(6,17+( CIC-DARKNET2020 DATASET
$5.1(7 '$7$6(7
analyzing RQO\
DQDO\]LQJ only WKH
the SDFNHW
packet KHDGHU
header LQIRUPDWLRQ
information RIof WKH
the GDUNQHW
darknet Darknet
'DUNQHW Benign
%HQLJQ
traffic without examining the payload. In [12], the efficiency of
WUDIILFZLWKRXWH[DPLQLQJWKHSD\ORDG,Q>@WKHHIILFLHQF\RI Class Tor
Tor VPN
VPN Non-Tor
Non-Tor Non-VPN
Non-VPN
time-related IHDWXUHV
WLPHUHODWHG features ZDV
was VWXGLHG
studied WR
to VROYH
solve WKH
the SUREOHP
problem RI of Audio-Stream
$XGLR6WUHDP 121
13.056 1.470
3.296
characterizing HQFU\SWHG
FKDUDFWHUL]LQJ encrypted WUDIILF
traffic DQG
and GHWHFWLQJ
detecting 9LUWXDO
Virtual 3ULYDWH
Private Browsing
%URZVLQJ 263
0 32.451
0
Chat
&KDW 65
4 .476
4 10
6.520
Network 931
1HWZRUN (VPN) WUDIILF
traffic. ,Q
In >@
[13], WKH
the DXWKRUV
authors SUHVHQWHG
presented Da WLPH
time Email
(PDLO 13
569
490
5.054
analysis WR
DQDO\VLV to GHWHFW
detect DQG
and FKDUDFWHUL]H
characterize 7RU
Tor WUDIILF
traffic. 7KH\
They VHOHFWHG
selected P2P
33 110
0 24.150
0
features those exclusively time-based statistics and proved that
IHDWXUHVWKRVHH[FOXVLYHO\WLPHEDVHGVWDWLVWLFVDQGSURYHGWKDW File-Transfer
)LOH7UDQVIHU 107
2.503
6.731
1.832
base features can be effectively used to detect Tor traffic.
EDVHIHDWXUHVFDQEHHIIHFWLYHO\XVHGWRGHWHFW7RUWUDIILF Video-Stream
9LGHR6WUHDP 202
1.144
3.363
5.038
VOIP
92,3 298
1.167
o 2.100
The DXWKRUV
7KH authors RIof >@
[14] proposed Darknet WUDIILF
SURSRVHG 'DUNQHW traffic GHWHFWLRQ
detection Total
7RWDO 1.179
22.915 69.065
23.840
system DQG
V\VWHP and HYDOXDWHG
evaluated WKH
the SHUIRUPDQFH
performance RI of VL[
six VXSHUYLVHG
supervised ML0/ General Total
*HQHUDO7RWDO 24.094
92.905
techniques. They declared that Bagged Decision Tree algorithm
WHFKQLTXHV7KH\GHFODUHGWKDW%DJJHG'HFLVLRQ7UHHDOJRULWKP
with 99.5 % accuracy score outperforms all the models. In [15],
ZLWKDFFXUDF\VFRUHRXWSHUIRUPVDOOWKHPRGHOV,Q>@
the authors proposed a model for encrypted traffic classification.
WKHDXWKRUVSURSRVHGDPRGHOIRUHQFU\SWHGWUDIILFFODVVLILFDWLRQ TABLE I!.
7$%/(,, CANDIDATE
& FEATURE
$1','$7( ) LIST.
($785( / ,67
They compared the performance of 17 different ML algorithms
7KH\FRPSDUHGWKHSHUIRUPDQFHRIGLIIHUHQW0/DOJRULWKPV
)B,' Feature Description
)HDWXUH'HVFULSWLRQ )B,' Feature Description
)HDWXUH'HVFULSWLRQ
and reached the best result of9 1.6 % accuracy with the XGBoost
DQGUHDFKHGWKHEHVWUHVXOWRIDFFXUDF\ZLWKWKH;*%RRVW
algorithm. 7KH
DOJRULWKP The DXWKRUV
authors LQin >@
[2], have classified GDUNQHW
KDYH FODVVLILHG darknet WUDIILF
traffic FI
) Src IP
6UF,3 F33
) Bwd IAT Min
%ZG,$70LQ
F2
) Src Port
6UF3RUW F34
) Fwd PSH Flags
)ZG36+)ODJV
using Deep Leaming methods. Through multiple Deep Feature
XVLQJ'HHS/HDUQLQJPHWKRGV7KURXJKPXOWLSOH'HHS)HDWXUH F3
) Dst IP
'VW,3 F35
) Fwd Header Length
)ZG+HDGHU/HQJWK
Leaming, they
/HDUQLQJ WKH\ transform the SUREOHP
WUDQVIRUP WKH problem RIof LGHQWLI\LQJ
identifying GDUNQHW
darknet F4
) Dst Port
'VW3RUW F36
) Bwd Header Length
%ZG+HDGHU/HQJWK
applications into a typical classification task. They showed that,
DSSOLFDWLRQVLQWRDW\SLFDOFODVVLILFDWLRQWDVN7KH\VKRZHGWKDW F5
) Flow Duration
)ORZ'XUDWLRQ F37
) Fwd Packets/a
)ZG3DFNHWVV
DarknetSec outperformed DOORWKHUDSSURDFKHVZLWKWKHKLJKHVW
'DUNQHW6HFRXWSHUIRUPHG all other approaches, with the highest F6
) Total Fwd Packet
7RWDO)ZG3DFNHW F38
) Bwd Packets/s
%ZG3DFNHWVV
accuracy RI
DFFXUDF\ of ,Q>@WKHDXWKRUVHOLPLQDWHWKHLPEDODQFH
92.22%. In [3], the authors eliminate the imbalance F7
) Total Bwd packets
7RWDO%ZGSDFNHWV F39
) Packet Length Min
3DFNHW/HQJWK0LQ
F8 Total Length of Fwd Packet F40 Packet Length Max
of the dataset using the method SMOTE and perform Principal
RIWKHGDWDVHWXVLQJWKHPHWKRG6027(DQGSHUIRUP3ULQFLSDO ) 7RWDO/HQJWKRI)ZG3DFNHW ) 3DFNHW/HQJWK0D[
F9
) Total Length ofBwd Packet
7RWDO/HQJWKRI%ZG3DFNHW F41
) Packet Length Mean
3DFNHW/HQJWK0HDQ
Component Analysis (PCA) to reduce the number of features.
&RPSRQHQW$QDO\VLV3&$WRUHGXFHWKHQXPEHURIIHDWXUHV FlO
) Fwd Packet Length Max
)ZG3DFNHW/HQJWK0D[ F42
) FIN Flag Count
),1)ODJ&RXQW
They DSSOLHG
7KH\ applied ELQDU\
binary FODVVLILFDWLRQ
classification WRto WKH
the EDODQFHG
balanced GDWDVHW
dataset DQG
and FII
) Fwd Packet Length Min
)ZG3DFNHW/HQJWK0LQ F43
) SYN Flag Count
6<1)ODJ&RXQW
achieved
DFKLHYHG 99% DFFXUDF\
accuracy ZLWK
with WKH
the ([WUD
Extra DQG
and 'HFLVLRQ
Decision 7UHH
Tree FI2
) Fwd Packet Length Mean
)ZG3DFNHW/HQJWK0HDQ F44
) RST Flag Count
567)ODJ&RXQW
algorithms. ,Q
DOJRULWKPV [4], WKH
In >@ the DXWKRUV
authors VWXGLHG
studied ELQDU\
binary DQGand PXOWLFODVV
multiclass F13
) Fwd Packet Length Std
)ZG3DFNHW/HQJWK6WG F45
) PSH Flag Count
36+)ODJ&RXQW
classifications with the same dataset. In both cases, the authors FI 4
) Bwd Packet Length Max
%ZG3DFNHW/HQJWK0D[ F46
) ACK Flag Count
$&.)ODJ&RXQW
FODVVLILFDWLRQVZLWKWKHVDPHGDWDVHW,QERWKFDVHVWKHDXWKRUV
FI5
) Bwd Packet Length Min
%ZG3DFNHW/HQJWK0LQ F47
) Down/Up Ratio
'RZQ8S5DWLR
tested the darknet with 5 different ML algorithms and obtained
WHVWHGWKHGDUNQHWZLWKGLIIHUHQW0/DOJRULWKPVDQGREWDLQHG FI6
) Bwd Packet Length Mean
%ZG3DFNHW/HQJWK0HDQ F48
) Average Packet Size
$YHUDJH3DFNHW6L]H
the EHVW
WKH best results of
UHVXOWV RI 98% DFFXUDF\
accuracy ZLWK
with WKH
the 5DQGRP
Random )RUHVW
Forest F17
) Bwd Packet Length Std
%ZG3DFNHW/HQJWK6WG F49
) Fwd Segment Size Avg
)ZG6HJPHQW6L]H$YJ
algorithm. 6R\OXHWDOZHUHWKHILUVWWRXVHWKH6LPSOH&$57
DOJRULWKP Soylu et al. were the first to use the Simple CART FI8
) Flow Bytes/a
)ORZ%\WHVV F50
) Bwd Segment Size Avg
%ZG6HJPHQW6L]H$YJ
algorithm IRU
DOJRULWKP for LQWHUQHW
internet WUDIILF
traffic FODVVLILFDWLRQ
classification ZLWK
with
8 DSSOLFDWLRQ
application FI9
) Flow Packets/a
)ORZ3DFNHWVV F51
) Bwd PacketiBulk Avg
%ZG3DFNHW%XON$YJ
classes >@
FODVVHV [16]. 7KH
The SURSRVHG
proposed DUFKLWHFWXUH
architecture LV is LPSOHPHQWHG
implemented RQ on F20
) Flow IAT Mean
)ORZ,$70HDQ F52
) Bwd Bulk Rate Avg
%ZG%XON5DWH$YJ
F21
) Flow IAT Std
)ORZ,$76WG F53
) Subflow Fwd Packets
6XEIORZ)ZG3DFNHWV
parallel and pipeline architectures in FPGAs and drives 665 and
SDUDOOHODQGSLSHOLQHDUFKLWHFWXUHVLQ)3*$VDQGGULYHVDQG F22
) FlowIATMax
)ORZ,$70D[ F54
) Subflow Fwd Bytes
6XEIORZ)ZG%\WHV
914 JLJD
giga ELWV
bits per second *ESV
SHU VHFRQG (Gbps) RU or
2078 DQGand
2857 PLOOLRQ
million F23
) Flow IATMin
)ORZ,$70LQ F55
) Subflow Bwd Bytes
6XEIORZ%ZG%\WHV
classifications SHU
FODVVLILFDWLRQV per VHFRQG
second 0&36
(MCPS), ZLWK
with
96.8125% DFFXUDF\
accuracy, F24
) Fwd IAT Total
)ZG,$77RWDO F56
) FWD Init Win Bytes
):',QLW:LQ%\WHV
respectively.
UHVSHFWLYHO\ F25
) Fwd IAT Mean
)ZG,$70HDQ F57
) Bwd Init Win Bytes
%ZG,QLW:LQ%\WHV
F26
) Fwd IAT Std
)ZG,$76WG F58
) Fwd Act Data Pkts
)ZG$FW'DWD3NWV
III.
,,, ML-BASED
0 DARKNET
/%$6('' TRAFFIC
$5.1(77 CLASSIFICATION
5$)),&&/$66,),&$7,21 F27
) Fwd IAT Max
)ZG,$70D[ F59
) Fwd Seg Size Min
)ZG6HJ6L]H0LQ
F28
) Fwd IAT Min
)ZG,$70LQ F60
) Idle Mean
,GOH0HDQ
A. Dataset
A. Dataset F29
) Bwd IAT Total
%ZG,$77RWDO F61
) Idle Std
,GOH6WG
F30
) BwdIATMean
%ZG,$70HDQ F62
) Idle Max
,GOH0D[
CIC-Darknet2020 >@
&,&'DUNQHW [17] GDWDVHW
dataset ZDV
was FROOHFWHG
collected E\
by &DQDGLDQ
Canadian F31
) Bwd IAT Std
%ZG,$76WG F63
) Idle Min
,GOH0LQ
Institute IRU
,QVWLWXWH for &\EHUVHFXULW\
V
Cybersecurity's work
ZRUN to detect DQG
WR GHWHFW and FKDUDFWHUL]H
characterize F32
) BwdIATMax
%ZG,$70D[
VPN DQG
931 and 7RU
Tor DSSOLFDWLRQV
applications. &,&'DUNQHW
CIC-Darknet2020 GDWDVHW
dataset FRQWDLQV
contains
Benign and Darknet
%HQLJQDQG labeled traffic. Darknet traffic consists of
'DUNQHWODEHOHGWUDIILF'DUNQHWWUDIILFFRQVLVWVRI
Audio-Stream, %URZVLQJ
$XGLR6WUHDP Browsing, &KDW
Chat, (PDLO
Email, 33
P2P, 7UDQVIHU
Transfer, 9LGHR
Video- We note that the dataset is unbalanced in terms of both the 8
:HQRWHWKDWWKHGDWDVHWLVXQEDODQFHGLQWHUPVRIERWKWKH
Stream DQG
6WUHDP and 92,3
VOIP. 7KH
The GDWDVHW
dataset FRQVLVWV
consists RI
of
85 IHDWXUHV
features DQG
and subclasses DQG
VXEFODVVHV and WKH
the PDLQ
main 'DUNQHW
Darknet DQG
and %HQLJQ
Benign FODVVHV
classes. 7R
To
141530 rows,
including
URZV LQFOXGLQJ 2431 I DQG
and
134348 VDPSOHV
samples IURP
from ameliorate this problem, we further utilized SMOTE technique
DPHOLRUDWHWKLVSUREOHPZHIXUWKHUXWLOL]HG6027(WHFKQLTXH
Darknet and Benign flows, UHVSHFWLYHO\
'DUNQHWDQG%HQLJQIORZV respectively. in this paper.
LQWKLVSDSHU
Duplicate GDWD
'XSOLFDWH data FROXPQV
columns FRQWDLQLQJ
containing WKH
the VDPH
same YDOXH
value, URZV
rows B. Synthetic Minority
B. Synthetic Minority Oversampling
Oversampling Technique
Technique (SMOTE)
(SMOTE)
with missing values DQGURZVZLWK1D1YDOXHVZHUH
ZLWKPLVVLQJYDOXHV and rows with NaN values were UHPRYHG
removed The 6027(
7KH SMOTE >@
[18] DOJRULWKP
algorithm LV
is XVHG
used WR
to FUHDWH
create D
a EDODQFHG
balanced
with data SUHSURFHVVLQJZKLFKKDVD
ZLWKGDWD preprocessing, which has a PLVOHDGLQJHIIHFWRQWKH
misleading effect on the dataset by synthetically increasing the number of samples with
GDWDVHWE\V\QWKHWLFDOO\LQFUHDVLQJWKHQXPEHURIVDPSOHVZLWK
results. $IWHU
UHVXOWV After GDWD
data SUHSURFHVVLQJ
pre-processing,
I 16999 IORZV
flows LQFOXGLQJ
including minority class values in the imbalanced dataset. This method is
PLQRULW\FODVVYDOXHVLQWKHLPEDODQFHGGDWDVHW7KLVPHWKRGLV
24094 Darknet
and
'DUNQHW DQG 92905 %HQLJQ
Benign IORZV
flows 7DEOH
(Table ,
I) ZLWK
with WKH
the frequently XVHG
IUHTXHQWO\ used LQ
in DUHDV
areas ZKHUH
where LW it LV
is UHODWLYHO\
relatively GLIILFXOW
difficult RU
or
number of 63 IHDWXUHV7DEOH,,7DEOH,VKRZVWKHGLVWULEXWLRQ
QXPEHURI features (Table II). Table I shows the distribution expensive to find data such DVKHDOWKLQWHUQHWWUDIILF
H[SHQVLYHWRILQGGDWDVXFK as health, internet traffic.
of 8 subclasses of Tor, VPN, Non-Tor, Non-VPN classes. The
RIVXEFODVVHVRI7RU9311RQ7RU1RQ931FODVVHV7KH
data beneath the Tor-VPN tag is considered Darknet, whereas This method avoids the problem of overfilling and provides
7KLVPHWKRGDYRLGVWKHSUREOHPRIRYHUILWWLQJDQGSURYLGHV
GDWDEHQHDWKWKH7RU931WDJLVFRQVLGHUHG'DUNQHWZKHUHDV
Non-Tor , Non-VPN Benign class.
1RQ7RU1RQ931%HQLJQFODVV good FODVVLILFDWLRQ
JRRG classification SHUIRUPDQFH
performance. ,Q SMOTE, WKH
In 6027( the HUD
era RI
of D
a
synthetic occurrence can be represented as:
V\QWKHWLFRFFXUUHQFHFDQEHUHSUHVHQWHGDV
(IEEE - UBMK-2022) - VII th International Conference on Computer Science and Engineering - 375
,(((8%0.9,,WK,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ
Authorized licensed use limited to: Sharvari Govilkar. Downloaded on June 27,2023 at 06:39:09 UTC from IEEE Xplore. Restrictions apply.
ݕ௦௬ ൌ ݕ ሺ ݕ െ ݕ ሻ כȯ (I)
TABLE IV.
7$%/(,9 SELECTED
6 FEATURES
(/(&7(' ) LIST.
($785(6 / ,67
Fill
)B,' Featnre Description
)HDWXUH'HVFULSWLRQ Fill
)B,' Featnre Description
)HDWXUH'HVFULSWLRQ
where y'rs
ZKHUHݕ
the minority class instance beneath thought, yj LVDQ
LVWKHPLQRULW\FODVVLQVWDQFHEHQHDWKWKRXJKWݕ is an
F4
) Dst Port
'VW3RUW F36
) Bwd Header Length
%ZG+HDGHU/HQJWK
occurrence randomly
RFFXUUHQFH chosen IURP
UDQGRPO\ FKRVHQ from WKH
the NQHDUHVW
k-nearest PLQRULW\
minority F5
) Flow Duration
)ORZ'XUDWLRQ F40
) Packet Length Max
3DFNHW/HQJWK0D[
QHLJKERUV of y':
neighbors RIݕ
\Y LV
is D
a vector in ZKLFK
YHFWRU LQ which HDFK
each FRPSRQHQW
component LV
is DQ
an F8
) Total Length of Fwd Packet
7RWDO/HQJWKRI)ZG3DFNHW F41
) Packet Length Mean
3DFNHW/HQJWK0HDQ
arbitrary number from [0,1]; the symbol "*" indicates element-
DUELWUDU\QXPEHUIURP>@WKHV\PERO³´כLQGLFDWHVHOHPHQW FlO
) Fwd Packet Length Max
)ZG3DFNHW/HQJWK0D[ F48
) Average Packet Size
$YHUDJH3DFNHW6L]H
wise duplication >@
ZLVHGXSOLFDWLRQ [19]. FI 4
) Bwd Packet Length Max
%ZG3DFNHW/HQJWK0D[ F50
) Bwd Segment Size Avg
%ZG6HJPHQW6L]H$YJ
FI6
) Bwd Packet Length Mean
%ZG3DFNHW/HQJWK0HDQ F56
) Fwd Init Win Bytes
)ZG,QLW:LQ%\WHV
SMOTE DOJRULWKP
6027( algorithm ZDV
was DSSOLHG
applied LQ
in
2 GLIIHUHQW
different ZD\V
ways WR
to WKH
the FI9
) Flow Packetsls
)ORZ3DFNHWVV F57
) Bwd Init Win Bytes
%ZG,QLW:LQ%\WHV
dataset presented in Table 1. SMOTE /HYHOKDVEHHQGHILQHG
GDWDVHWSUHVHQWHGLQ7DEOH,6027( (Levell) has been defmed F20
) Flow IAT Mean
)ORZ,$70HDQ F60
) Idle Mean
,GOH0HDQ
F22
) FlowIATMax
)ORZ,$70D[ F62
) Idle Max
,GOH0D[
for the method of equating the sample numbers of Darknet and
IRUWKHPHWKRGRIHTXDWLQJWKHVDPSOHQXPEHUVRI'DUNQHWDQG
F23
) FlowIATMin
)ORZ,$70LQ F63
) Idle Min
,GOH0LQ
Benign classes directly (random oversampling from subclasses
%HQLJQFODVVHVGLUHFWO\UDQGRPRYHUVDPSOLQJIURPVXEFODVVHV F35
) Fwd Header Length
)ZG+HDGHU/HQJWK
(Audio-Stream, %URZVLQJ
$XGLR6WUHDP Browsing, HWF
etc.)), UHJDUGOHVV of WKH
regardless RI the EDODQFH
balance RI
of
subclasses VL]HV
VXEFODVVHV sizes
(24094 'DUNQHW
Darknet DQGand
92905 Benign), and
%HQLJQ DQG
SMOTE (Level 2) forrurming the SMOTE algorithm to equalize
6027(/HYHOIRUUXQQLQJWKH6027(DOJRULWKPWRHTXDOL]H IV. 3
,9 PERFORMANCE E 9$/8$7,21
(5)250$1&(( VALUAT ION
the QXPEHU
WKH number RI of VDPSOHV
samples LQ
in VXEFODVVHV
subclasses
(103336 'DUNQHW
Darknet DQG
and
103336 Benign). We can follow from Table 1 and Table 111 that,
%HQLJQ:HFDQIROORZIURP7DEOH,DQG7DEOH,,, WKDW A.
A. Experimental Setup
Experimental Setup
121 $XGLRVWUHDP
Audio-stream GDWD of 7RU
data RI Tor FODVV
class LQ
in 7DEOH
Table 1, has EHHQ
, KDV been In our study, we propose three classification scenarios using
,QRXUVWXG\ZHSURSRVHWKUHHFODVVLILFDWLRQVFHQDULRVXVLQJ
increased WR
LQFUHDVHG to
1470, ZKLFK
which LV
is WKH
the DPRXQW
amount RIof GDWD
data LQ
in QRQ7RU
non-Tor. the data set from [17]. In DOOFDVHVWKHXQEDODQFHLQGDWDVHWVZHUH
WKHGDWDVHWIURP>@,Q all cases, the unbalance in datasets were
Similarly, the number of samples belonging to non-VPN label,
6LPLODUO\WKHQXPEHURIVDPSOHVEHORQJLQJWRQRQ931ODEHO compensated E\
FRPSHQVDWHG by RYHUVDPSOLQJ
oversampling ZLWK
with 6027(
SMOTE. ,Q In WKH
the ILUVW
first
is increase d from 3296 to 13056 to provide balance with VPN
LVLQFUHDVHGIURPWRWRSURYLGHEDODQFHZLWK931 analysis, we performed binary classification where the traffic is
DQDO\VLVZHSHUIRUPHGELQDU\FODVVLILFDWLRQZKHUHWKHWUDIILFLV
side. This method was applied in order to contribute equally to
VLGH7KLVPHWKRGZDVDSSOLHGLQRUGHUWRFRQWULEXWHHTXDOO\WR decomposed LQWR
GHFRPSRVHG into 'DUNQHW
Darknet DQG
and %HQLJQ
Benign FODVVHV
classes &DVH
(Case1, '%
DB DVas
the FODVVLILFDWLRQ
WKH of
classification RI 8 GLIIHUHQW
different VXEFODVVHV
subclasses. 7KH
The XQEDODQFHG
unbalanced their initials). Next, the traffic is FODVVLILHGLQWRIRXUFODVVHVDV
WKHLULQLWLDOV1H[WWKHWUDIILFLV classified into four classes as
initial dataset without SMOTE is demonstrated as wlo SMOTE
LQLWLDOGDWDVHWZLWKRXW6027(LVGHPRQVWUDWHGDVZR6027( Tor, VPN, NonTor, NonVPN (Case2, TVNN). Finally, we have
7RU9311RQ7RU1RQ931&DVH7911)LQDOO\ZHKDYH
in the rest of the paper.
LQWKHUHVWRIWKHSDSHU classified the traffic trace into eight application classes as Audio-
FODVVLILHGWKHWUDIILFWUDFHLQWRHLJKWDSSOLFDWLRQFODVVHVDV$XGLR
Stream, %URZVLQJ
6WUHDP Browsing, &KDW
Chat, (PDLO
Email, 33
P2P, )LOHWUDQVIHU
File-transfer, 9LGHR
Video-
TABLE III.
7$%/(,,, APPLICATION
$ CLASSES INTHE &,&'
33/,&$7,21&/$66(6,17+( CIC-DARKNET2020 DATASET
$5.1(7 '$7$6(7 Stream, 92,3
6WUHDP VOIP &DVH
(Case3, $%&(3)99
ABCEPFVV). ,Q In WKLV
this SDSHU
paper, ZH
we
SMOTE /
AFTER 6027(
$)7(5 (LEVEL 2)
(9(/
employed the most common ML algorithms which are Decision
HPSOR\HGWKHPRVWFRPPRQ0/DOJRULWKPVZKLFKDUH'HFLVLRQ
Darknet
'DUNQHW Benign
%HQLJQ Tree [15], Simple CART [16], Random Forest [16], kNN [17],
7UHH>@6LPSOH&$57>@5DQGRP)RUHVW>@N11>@
Class
&ODVV Tor
7RU VPN
931 Non-Tor 1RQ931
1RQ7RU Non-VPN Naive bayes [18] and Adaboost [19]. We split the data 75% for
1DLYHED\HV>@DQG$GDERRVW>@:HVSOLWWKHGDWDIRU
Audio-Stream
$XGLR6WUHDP 1.470
13.056
1.470
13.056
training and 25% for testing. In order to avoid overfitting, the
WUDLQLQJDQGIRUWHVWLQJ,QRUGHUWRDYRLGRYHUILWWLQJWKH
Browsing
%URZVLQJ 32.451
0 32.451
0 k-fold FURVV
NIROG cross YDOLGDWLRQ
validation >@
[22] WHFKQLTXH
technique ZDV
was XVHG
used LQ
in WKH
the
Chat
&KDW 410
6.520
4 10
6.520
Email
(PDLO 4 92
5.054
490
5.054
classifications where the k value was chosen as 10. In all cases
FODVVLILFDWLRQVZKHUHWKHNYDOXHZDVFKRVHQDV,QDOOFDVHV
P2P
33 24.150
0 24.150
0 the most prominent features of Table 11 (initial feature set) were
WKHPRVWSURPLQHQWIHDWXUHVRI7DEOH,,LQLWLDOIHDWXUHVHWZHUH
File-Transfer
)LOH7UDQVIHU 6.731
2.503
6.731
2.503
included as a result ofFS.
LQFOXGHGDVDUHVXOWRI)6
Video-Stream
9LGHR6WUHDP 3.363
5.038
3.363
5.038
VOIP
92,3 o 2.100
o 2.100
The following computer resources are used for modeling and
7KHIROORZLQJFRPSXWHUUHVRXUFHVDUHXVHGIRUPRGHOLQJDQG
Total
7RWDO 69.065
34.271
69.065
34.271
training WKH
WUDLQLQJ the GDWDVHW
dataset: +DUGZDUH
Hardware - /HQRYR
Lenovo PRELOH
mobile ZLWK
with
16 *%
GB
General Total
*HQHUDO7RWDO 103.336
103336
RAM, Intel i7 -7700HQ CPU #*+]SURFHVVRU6RIWZDUH
5$0,QWHOL+4&38 @2.80 GHz processor, Software -
Python and WEKA as language for modeling and visualization.
3\WKRQDQG:(.$DVODQJXDJHIRUPRGHOLQJDQGYLVXDOL]DWLRQ
C. Feature
C. Selection Algorithms
Feature Selection Algorithms
In machine learning, large amounts of data are collected in
,QPDFKLQHOHDUQLQJODUJHDPRXQWVRIGDWDDUHFROOHFWHGLQ B. The Performance
B. The Comparison of
Performance Comparison ofDatasets
Datasets
order to better train the algorithm, but it is often not useful to
RUGHUWREHWWHUWUDLQWKHDOJRULWKPEXWLWLVRIWHQQRWXVHIXOWR Table V shows the performance comparison ofwlo SMOTE,
7DEOH9VKRZVWKHSHUIRUPDQFHFRPSDULVRQRIZR6027(
work with such large amounts of data. Feature selection is the
ZRUNZLWKVXFKODUJHDPRXQWVRIGDWD)HDWXUHVHOHFWLRQLVWKH SMOTE /HYHO
6027( (Levell) and 6027(
DQG SMOTE /HYHO
(Level
2) GDWDVHWV
datasets IRU
for ERWK
both
SURFHVV of VHOHFWLQJ
process RI selecting UHOHYDQW
relevant IHDWXUHV
features RU
or D
a VXEVHW
subset RI
of IHDWXUHV
features. case_1 (DB) and case_2 (TVNN). In this analysis we observed
FDVHB'%DQGFDVHB7911,QWKLVDQDO\VLVZHREVHUYHG
Evaluation criteria are used to find an optimal subset of features
(YDOXDWLRQFULWHULDDUHXVHGWRILQGDQRSWLPDOVXEVHWRIIHDWXUHV the effect of SMOTE in both equating the VXEFODVVHVDQGWKH
WKHHIIHFWRI6027(LQERWKHTXDWLQJWKH subclasses, and the
[20]. Feature selection DOJRULWKPVDUHFKDUDFWHUL]HGE\6HDUFK
>@)HDWXUHVHOHFWLRQ algorithms are characterized by; Search super classes ZLWKRXWHTXDWLQJWKH
VXSHUFODVVHV without equating the VXEFODVVHV
subclasses. $OO
All 3 VHWV
sets ZHUH
were
Organization, *HQHUDWLRQ
2UJDQL]DWLRQ Generation RI of 6XFFHVVRUV
Successors DQGand (YDOXDWLRQ
Evaluation trained with Decision Tree Algorithm by performing )6
WUDLQHGZLWK'HFLVLRQ7UHH$OJRULWKPE\SHUIRUPLQJ FS ZLWK
(with
measures. )6
PHDVXUHV FS PDNHV
makes DVVHVVPHQWV
assessments E\
by SUREDELOLW\
probability RI
of HUURU
error, Information Gain) and the results are given as accuracy metric.
,QIRUPDWLRQ*DLQDQGWKHUHVXOWVDUHJLYHQDVDFFXUDF\PHWULF
divergence, GHSHQGHQFH
GLYHUJHQFH dependence, LQWHUFODVV
interclass GLVWDQFH
distance, LQIRUPDWLRQ
information JDLQ
gain Table 9
7DEOH V DOVR
also VKRZV
shows WKH
the RSWLPXP
optimum QXPEHU
number RIof IHDWXUHV
features DIWHU
after
and consistency [21].
DQGFRQVLVWHQF\>@ selection DOJRULWKP
VHOHFWLRQ algorithm LV
is DSSOLHG
applied. 7KH
The UHVXOWVSURYH
results prove WKDW
that 6027(
SMOTE
(Level 2) outperforms all LQ
/HYHORXWSHUIRUPVDOO in ERWK
both &DVH
Case 1 DQG
and 2
(96,84, 96,74,
In this SDSHU,QIRUPDWLRQ*DLQZDVXVHGDVWKH
,QWKLV paper, Information Gain was used as the )6PHWULF
FS metric
respectively). )XUWKHUPRUH
UHVSHFWLYHO\ Furthermore, WKHthe OHDVW
least QXPEHU
number RI
of IHDWXUHV
features LV
is
with the UDQNHU
ZLWKWKH ranker PHWKRG7KH)6UF,3
method. The F1 (Src 1P) DQG)'VW,3
and F3 (Dst 1P)features
IHDWXUHV
selected for SMOTE (Level 2) set )IRUERWK&DVHDQG
VHOHFWHGIRU6027(/HYHOVHW (8F for both Case 1 and 2).
in 7DEOH
LQ 11 DUH
Table ,, are H[FOXGHG
excluded IURP
from WKH
the GDWDVHW
dataset EHIRUH
before 6027(
SMOTE LV is
applied, LQ
DSSOLHG in RUGHU
order WR
to DFKLHYH
achieve D
a PRUH
more DFFXUDWH
accurate FODVVLILFDWLRQ
classification TABLE V
7$%/(9 THEPERFOID.1ANCE
7 CO},1]'ARISON
+( 3(5)250$1&( & 203$5,621 2 OF
)'DATASETS
$7$6(76
accuracy since they
DFFXUDF\VLQFH do QRWKDYHUHDOYDOXHVLQVXFKGDWDVHWV
WKH\GR not have real values in such datasets. )6
FS
was DSSOLHG
ZDV applied ERWK
both EHIRUH
before DQG
and DIWHU
after 6027(
SMOTE VWHSV
steps LQGLYLGXDOO\
individually. Case_l (DB)
&DVHB'% Case_2 (TVNN)
&DVHB7911
The VHOHFWHGIHDWXUHVLQERWKVFHQDULRVDUHVKRZQLQ7DEOH,9
7KH selected features in both scenarios are shown in Table IV. Accnracy
$FFXUDF\ Nnmof
1XP RI Accnracy
$FFXUDF\ Nnmof
1XPRI
(%) Feat.
)HDW
(%) Feat.
)HDW
wlo SMOTE
ZR6027( 95,76
9 96,39
II
SMOTE(Levell)
6027(/HYHO 95,96
10
92,04
15
SMOTE(Leve12)
6027(/HYHO 96.84
8 96.74
8
(IEEE - UBMK-2022) - VII th International Conference on Computer Science and Engineering - 376
,(((8%0.9,,WK,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ
Authorized licensed use limited to: Sharvari Govilkar. Downloaded on June 27,2023 at 06:39:09 UTC from IEEE Xplore. Restrictions apply.
C.
C. The
The Performance
Performance Comparison
Comparison ofAIL
of ML Algorithms
Algorithms I, 2 and
DQG 3,
respectively.
UHVSHFWLYHO\ The results confirm that
7KHUHVXOWVFRQILUP WKDW the
WKH SMOTE
6027(
Table VI demonstrates
7DEOH9, the selected features before and after
GHPRQVWUDWHVWKHVHOHFWHGIHDWXUHVEHIRUHDQGDIWHU method increases the accuracy values within all algorithms and
PHWKRGLQFUHDVHVWKHDFFXUDF\YDOXHVZLWKLQDOODOJRULWKPVDQG
SMOTE with three alternative selection approaches separately
6027(ZLWKWKUHHDOWHUQDWLYHVHOHFWLRQDSSURDFKHVVHSDUDWHO\ cases. The Random Forest algorithm gives the highest accuracy
FDVHV7KH5DQGRP)RUHVWDOJRULWKPJLYHVWKHKLJKHVWDFFXUDF\
(Detailed descriptions of the selected features are given in
'HWDLOHGGHVFULSWLRQVRIWKHVHOHFWHGIHDWXUHVDUHJLYHQ Table
LQ7DEOH values
YDOXHV (97.22%,
97.16%,
85.99%
for IRU Case I, 2 and
&DVH DQG 3,
IV). We see that the nornber of features were reduced from 63
,9:HVHHWKDWWKHQXPEHURIIHDWXUHVZHUHUHGXFHGIURP respectively).
UHVSHFWLYHO\ Furthermore,
)XUWKHUPRUH we ZH see VHH that
WKDW this
WKLV method
PHWKRG gives
JLYHV
to 8,8,6 after SMOTE
WRDIWHU 6027( for Case I, 2 and 3 respectively. Table VII
IRU&DVHDQGUHVSHFWLYHO\7DEOH9,, satisfactory results when viewed from the perspective of other
VDWLVIDFWRU\UHVXOWVZKHQYLHZHGIURPWKHSHUVSHFWLYHRIRWKHU
compares
FRPSDUHV the SHUIRUPDQFH of
WKH performance RI ML
0/ algorithms
DOJRULWKPV inLQ terms
WHUPV ofRI metrics,
PHWULFV too
WRR (Recall,
5HFDOO Precision,
3UHFLVLRQ MSE,06( Fl-Score,
)6FRUH Kappa).
.DSSD In ,Q
Accuracy [23], Precision [24], Recall [24], Mean Squared Error
$FFXUDF\>@3UHFLVLRQ>@5HFDOO>@0HDQ6TXDUHG(UURU addition,
DGGLWLRQ tree
WUHH depth
GHSWK and
DQG number
QXPEHU of RI leaf
OHDI nodes,
QRGHV which
ZKLFK are
DUH
(MSE) [25], FI -Score [26], and Kappa [27] by selected features.
06(>@)6FRUH>@DQG.DSSD>@E\VHOHFWHGIHDWXUHV important
LPSRUWDQW parameters
SDUDPHWHUV forIRU hardware
KDUGZDUH applications
DSSOLFDWLRQV of RI internet
LQWHUQHW
,Q addition,
In DGGLWLRQtree
WUHH depth
GHSWK and
DQG the
WKHnumber
QXPEHU of
RI leaf
OHDI nodes
QRGHV are
DUH also
DOVR traffic classification studies, are also shown in Table VII. It is
WUDIILFFODVVLILFDWLRQVWXGLHVDUHDOVRVKRZQLQ7DEOH9,,,WLV
presented for tree-based algorithms.
SUHVHQWHGIRUWUHHEDVHGDOJRULWKPV seen
VHHQ that
WKDW these
WKHVH values
YDOXHV are
DUH at
DW aD satisfactory
VDWLVIDFWRU\ level
OHYHO for
IRU the
WKH large
ODUJH
dataset in this paper.
GDWDVHWLQWKLVSDSHU
TABLE VI.
7$%/(9, SELECTED FEATURES BEFORE AND AFTER SMOTE
6(/(&7(')($785(6%()25($1'$)7(5 6027(
V.
9 CONCLUSION
&21&/86,21
Analyzes
$QDO\]HV Before SMOTE
%HIRUH6027( After SMOTE
$IWHU6027( This paper presented a systematic approach to evaluate the
7KLVSDSHUSUHVHQWHGDV\VWHPDWLFDSSURDFKWRHYDOXDWHWKH
Case 1
&DVH F38, F20, F3, F21, F12, F39,
)))))) F21, F54, F55, F33, F34,
)))))
machine
PDFKLQH learning
OHDUQLQJ algorithms
DOJRULWKPV in
LQ traffic
WUDIILF classification.
FODVVLILFDWLRQ Cross
&URVV
(DB)
'% F46, F48, F14
))) F46, F39, F6
)))
Case 2
&DVH F38,F39,F46,F20,F3,F21,
)))))) F54, F21, F55, F39, F46,
))))) validation
YDOLGDWLRQtechnique
WHFKQLTXHwas
ZDVused
XVHG to
WR avoid
DYRLG over-learning,
RYHUOHDUQLQJ Feature
)HDWXUH
(TVNN)
7911 F6,FI7,F8,FI8,F54
))))) F33, F6, F38
))) selection algorithms were used to identify the most prominent
VHOHFWLRQDOJRULWKPVZHUHXVHGWRLGHQWLI\WKHPRVWSURPLQHQW
Case 3
&DVH F2, F39, F46, F38, F60, F20,
)))))) F60,F58,F61,F6,F2,
))))) features. The SMOTE method was applied to achieve unbiased
IHDWXUHV7KH6027(PHWKRGZDVDSSOLHGWRDFKLHYHXQELDVHG
(ABCEPFVV)
$%&(3)99 F58. F61
)) F39
) classification.
FODVVLILFDWLRQ
As a future work, we plan to achieve higher accuracy with
$VDIXWXUHZRUNZHSODQWRDFKLHYHKLJKHUDFFXUDF\ZLWK
Note that eachML algorithm was executed for Case 1,2 and
1RWHWKDWHDFK0/DOJRULWKPZDVH[HFXWHGIRU&DVHDQG fewer features (self-selected features
IHZHUIHDWXUHVVHOIVHOHFWHG IHDWXUHV by
E\ deep learning algorithm
GHHSOHDUQLQJ DOJRULWKP
3 separately.
VHSDUDWHO\ Thus,
7KXV three
WKUHH results
UHVXOWV were
ZHUH presented
SUHVHQWHG for
IRU each
HDFK instead of Feature selection algorithms selected with hanthnade)
LQVWHDGRI)HDWXUHVHOHFWLRQDOJRULWKPVVHOHFWHGZLWKKDQGPDGH
algorithm in Table VII, where the first row for Case I and the
DOJRULWKPLQ7DEOH9,,ZKHUHWKHILUVWURZIRU&DVHDQGWKH by
E\ performing
SHUIRUPLQJ aD Deep
'HHS Learning
/HDUQLQJ Based
%DVHG internet
LQWHUQHW traffic
WUDIILF
last row for Case
ODVWURZIRU &DVH 3.
For example, the
)RUH[DPSOH WKH accuracy values for the
DFFXUDF\YDOXHVIRU WKH classification using SMOTE method.
FODVVLILFDWLRQXVLQJ6027(PHWKRG
Decision Tree algorithm are 95.76%, 96.39%, 85.25% for
'HFLVLRQ7UHHDOJRULWKPDUH Case
IRU&DVH
TABLE VII.
7$%/(9,, BEFORE!AFTER
%()25($)7(5 SMOTE
6027( EVOLUTION FOR CASE
(92/87,21)25 1, 2 AND
&$6( $1' 3.
Accnracy
$FFXUDF\ Recall
5HFDOO Precision
3UHFLVLRQ MSE
06( FI-Score
)6FRUH Kappa
.DSSD Tree Depth
7UHH'HSWK ofleafNodes
#RIOHDI1RGHV
:ML Algorithms
0/$OJRULWKPV
Bef.
%HI Aft.
$IW Bef.
%HI Aft
$IW. Bef.
%HI Aft.
$IW Bef.
%HI Aft.
$IW Bef.
%HI Aft.
$IW Bef.
%HI Aft.
$IW Bef.
%HI Aft
$IW. Bef.
%HI Aft
$IW.
95.76
96.84
0.96
0.97
0.96
0.97
0.87
0.94
0.96
0.97
0.87
0.94
24
28
803
1271
Decision Tree (C4.5)
'HFLVLRQ7UHH&
96.39
96.74
0.96
0
.97 0.96
0.97
0.95
0.96
0.96
0.97
0.94
0.95
25
27
934
1542
(Binary Split)
%LQDU\6SOLW
85.25
85.72
0.85
0
.86 0.85
0.87
0.83
0.83
0.85
0.85
0.82
0.82
27
27
1587
2287
95.73
96.73
0.96
0.97
0.96
0.97
0.87
0.94
0.96
0.97
0.87
0.93
24
29
589
1465
Simple CART
6LPSOH&$57 96.14
96.66
0.96
0.97
0.96
0.97
0.95
0.96
0.96
0.97
0.93
0.95
22
27
606
1297
84.96
85.80
0.85
0.86
0.85
0.87
0.82
0.83
0.85
0.85
0.82
0.82
40
35
1301
2299
95.71
97.22
0.96
0.97
0.96
0.97
0.87
0.94
0.96
0.97
0.87
0.94
Random Forest
5DQGRP)RUHVW
96.55
97.16
0.97
0.97
0.97
0.97
0.95
0.97
0.97
0.97
0.94
0.96
(Num.ofTrees=100)
1XP RI 7UHHV
84.12
85.99
0.84
0.86
0.84
0.87
0.82
0.83
0.84
0.86
0.81
0.82
94.48
95.71
0.95
0.96
0.94
0.96
0.83
0.91
0.94
0.96
0.83
0.91
k-NN
N11
94.61
95.51
0.95
0.96
0.95
0.96
0.92
0.94
0.95
0.96
0.91
0.94
(1F3)
N
80.66
81.62
0.81
0.82
0.81
0.82
0.77
0.78
0.81
0.81
0.77
0.77
92.08
92.62
0.92
0.91
0.92
0.91
0.75
0.82
0.92
0.91
0.75
0.81
Naive Bayes (NB)
1DwYH%D\HV1% 89.28
91.39
0.89
0
.91 0.90
0.91
0.90
0.89
0.90
0.91
0.81
0.88
80.80
81.30
0.81
0.75
0.83
0.82
0.78
0.74
0.81
0.77
0.77
0.70
87.44
87.67
0.87
0.88
0.87
0.88
0.60
0.76
0.87
0.88
0.59
0.75
Adaboost
$GDERRVW 74.07
96.92
0.71
0.97
0.71
0.97
0.40
0.96
0.71
0.97
0.48
0.96
84.25
85.91
0.84
0.86
0.84
0.87
0.82
0.83
0.84
0.86
0.81
0.82
[5]
>@ J-54XLQODQ³,QGXFWLRQRIGHFLVLRQWUHHV´0DFKLQH/HDUQLQJ
. R. Quinlan, "Induction of decision trees." Machine Learning, (1986).
REFERENCES
5()(5(1&(6 1(1),81-106.
±
[1]
>@ <Zhang,
Y. <Xiao,
=KDQJY. ;LDRK.
. Ghaboosi,
*KDERRVLJ.-Zhang" H.
=KDQJ & +Deng,
'HQJ "A
³$ survey
VXUYH\ of
RI [6]
>@ L. Breiman, J. Friedman, C. J. Stone, & R. A. Olshen, "Classification and
/%UHLPDQ-)ULHGPDQ&-6WRQH 5$2OVKHQ³&ODVVLILFDWLRQDQG
cyber crimes" Security and Communication Networks (2011), 5(4), 422-
F\EHUFULPHV´6HFXULW\DQG&RPPXQLFDWLRQ1HWZRUNV± regression trees". Taylor & Francis, (1984).
UHJUHVVLRQWUHHV´7D\ORU )UDQFLV
437.
[7]
>@ N.
1 S.
6 Altman, ³$Q Introduction
$OWPDQ "An ,QWURGXFWLRQ to
WR kernel
NHUQHO and
DQG nearest-neighbor
QHDUHVWQHLJKERU
[2]
>@ J.
- Lan, ; Liu,
/DQX. /LX B.
% Li, <Li,
/L Y. 7 Geng,
T.
/L & *HQJ "A
³$novel
QRYHO self-attentive
VHOIDWWHQWLYH deep
GHHS nonparametric regression." The American Statistician (1992), 46(3), 175.
QRQSDUDPHWULFUHJUHVVLRQ´7KH$PHULFDQ6WDWLVWLFLDQ
learning
OHDUQLQJ method
PHWKRG for
IRU darknet
GDUNQHW traffic
WUDIILF classification
FODVVLILFDWLRQ and
DQG application
DSSOLFDWLRQ [8]
>@ D.
' D.
' Lewis,
/HZLV "Naive
³1DLYH (bayes)
ED\HV at
DW forty:
IRUW\ the
WKH independence
LQGHSHQGHQFH assumption
DVVXPSWLRQ in
LQ
identification." (2022). DarknetSec Computers & Security, 116, 102663.
LGHQWLILFDWLRQ´'DUNQHW6HF&RPSXWHUV 6HFXULW\ information retrieval." Machine Learning: (1998),4-15.
LQIRUPDWLRQUHWULHYDO´0DFKLQH/HDUQLQJ±
[3]
>@ N. Jadav, N. Dutta, H. K. D. Sarma, E. Pricop, & S. Tanwar, "A machine
1-DGDY1'XWWD+.'6DUPD(3ULFRS 67DQZDU³$PDFKLQH [9]
>@ R. S. Schapire, "A briefintroduction to boosting." Proceedings ofthe 16th
566FKDSLUH³$EULHILQWURGXFWLRQWRERRVWLQJ´3URFHHGLQJVRIWKHWK
learning
OHDUQLQJ approach
DSSURDFK to
WR classify
FODVVLI\ network
QHWZRUN traffic" 13th International
WUDIILF´ WK ,QWHUQDWLRQDO International Joint Conference on Artificial Intelligence (1999).
,QWHUQDWLRQDO-RLQW&RQIHUHQFHRQ$UWLILFLDO,QWHOOLJHQFH
Conference on Electronics, Computers and Artificial Intelligence (2021).
&RQIHUHQFHRQ(OHFWURQLFV&RPSXWHUVDQG$UWLILFLDO,QWHOOLJHQFH [10] Liu, & K. Fukuda, "An evaluation of darknettraffic taxonomy." Journal
>@ J.-/LX .)XNXGD³$QHYDOXDWLRQRIGDUNQHWWUDIILFWD[RQRP\´-RXUQDO
[4]
>@ L. A. Iliadis, & T. Kaifas, "Darknet traffic classification using machine
/$,OLDGLV 7.DLIDV³'DUNQHWWUDIILFFODVVLILFDWLRQXVLQJPDFKLQH ofInfonnation Processing (2018), 26(0), 148-157.
RI,QIRUPDWLRQ3URFHVVLQJ±
learning
OHDUQLQJteclmiques."1Oth
WHFKQLTXHV´WK International
,QWHUQDWLRQDO Conference
&RQIHUHQFH on
RQModern
0RGHUQ Circuits
&LUFXLWV [11] 5Niranjana,
>@ R. 1LUDQMDQD V. $Kumar,
9 A. S.
.XPDU & 6 Sheen,
6KHHQ "Darknet
³'DUNQHWtraffic
WUDIILF analysis
DQDO\VLV and
DQG
and Systems Teclmologies (2021) .
DQG6\VWHPV7HFKQRORJLHV classification using numerical AGM and mean shift clustering algorithm."
FODVVLILFDWLRQXVLQJQXPHULFDO$*0DQGPHDQVKLIWFOXVWHULQJDOJRULWKP´
SN Computer Science (2019).
61&RPSXWHU6FLHQFH
(IEEE - UBMK-2022) - VII th International Conference on Computer Science and Engineering - 377
,(((8%0.9,,WK,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ
Authorized licensed use limited to: Sharvari Govilkar. Downloaded on June 27,2023 at 06:39:09 UTC from IEEE Xplore. Restrictions apply.
[12] *'UDSHU*LO$+/DVKNDUL06,0DPXQ $$*KRUEDQL³$
>@ G. Draper-Gil, A. H. Lashkari, M. S. I. Manum, & A. A. Ghorbani, "A. [19] 1
>@ N. 9 Chawla, .
V. &KDZOD W. %RZ\HU
K. : Bowyer, / O. +DOO
L. 2 Hall, W. 3
& : P. .HJHOPH\HU
Kegelmeyer,
characterization RI
FKDUDFWHUL]DWLRQ of HQFU\SWHG
encrypted DQGand VPN traffic XVLQJ
931 WUDIILF using WLPHUHODWHG
time-related "SMOTE: V\QWKHWLF
³6027( synthetic PLQRULW\
minority RYHUVDPSOLQJ
over-sampling WHFKQLTXH´
technique." -RXUQDO
Journal RIof
features." Proceedings
IHDWXUHV´ 3URFHHGLQJV RI of WKH
the QG
2nd ,QWHUQDWLRQDO
International &RQIHUHQFH
Conference RQ on Artificial Intelligence Research (2002), 16, 321-357.
$UWLILFLDO,QWHOOLJHQFH5HVHDUFK±
Information Systems Security and Privacy. (2016).
,QIRUPDWLRQ6\VWHPV6HFXULW\DQG3ULYDF\ [20] R.
>@ Kohavi,
5 .RKDYL G. +
& * H. -RKQ
John, ³:UDSSHUV
"Wrappers IRU
for IHDWXUH
feature VXEVHW
subset VHOHFWLRQ´
selection."
[13] $+DELEL/DVKNDUL*'UDSHU*LO06,0DPXQ $$*KRUEDQL
>@ A. Habibi Lashkari, G. Draper Gil, M. S. I. Mamun, & A. A. Ghorbani, Artificial Intelligence (1997), 97(1-2), 273-324.
$UWLILFLDO,QWHOOLJHQFH±±
"Characterization of tor traffic using time based features." Proceedings of
³&KDUDFWHUL]DWLRQRIWRUWUDIILFXVLQJWLPHEDVHGIHDWXUHV´3URFHHGLQJVRI [21] L.
>@ Molina, L.
/ 0ROLQD Belanche,
/ %HODQFKH A. 1HERW
& $ Nebot, ³)HDWXUH
"Feature VHOHFWLRQ
selection DOJRULWKPV
algorithms: D a
the UG
WKH 3rd ,QWHUQDWLRQDO
International &RQIHUHQFH
Conference RQon ,QIRUPDWLRQ
Information 6\VWHPV
Systems 6HFXULW\
Security DQG
and survey and experimental evaluation." IEEE International Conference on
VXUYH\DQGH[SHULPHQWDOHYDOXDWLRQ´,(((,QWHUQDWLRQDO&RQIHUHQFHRQ
Privacy (2017).
3ULYDF\ Data Mining, (2002).
'DWD0LQLQJ
>@ Q. $EX
[14] 4 Abu $O+DLMD
Al-Haija, 0
M. .ULFKHQ
Krichen, W. $EX
& : Abu (OKDLMD
Elhaija, ³0DFKLQHOHDUQLQJ
"Machine-learning- [22] 0
>@ M. 6WRQH
Stone, ³&URVVYDOLGDWRU\
"Cross-validatory FKRLFH
choice DQG
and DVVHVVPHQW
assessment RI of VWDWLVWLFDO
statistical
based darlrnet traffic detection system for loT applications." Electronics
EDVHGGDUNQHWWUDIILFGHWHFWLRQV\VWHPIRU,R7DSSOLFDWLRQV´(OHFWURQLFV predictions." -RXUQDO
SUHGLFWLRQV´ Journal RI
of WKH
the 5R\DO
Royal 6WDWLVWLFDO
Statistical 6RFLHW\
Society: 6HULHV
Series %B
(2022),1 1(4),556.
(Methodological) (1974), 36 (2), 111-133.
0HWKRGRORJLFDO±
[15] 1
>@ N. *XSWD
Gupta, 9
V. Jindal,
-LQGDO P. %HGL
& 3 Bedi, ³(QFU\SWHG
"Encrypted WUDIILF
traffic FODVVLILFDWLRQ
classification XVLQJ
using [23] -$6ZHWV³0HDVXULQJWKH$FFXUDF\RI'LDJQRVWLF6\VWHPV´6FLHQFH
>@ J . A. Swets, "Measuring the Accuracy of Diagnostic Systems." Science,
eXtreme gradient boosting algorithm." Advances in Intelligent Systems
H;WUHPHJUDGLHQWERRVWLQJDOJRULWKP´$GYDQFHVLQ,QWHOOLJHQW6\VWHPV (1988),240 (4857), 1285-1293.
±
and Computing (2021),225-232.
DQG&RPSXWLQJ±
[24] 0%XFNODQG )*H\³7KHUHODWLRQVKLSEHWZHHQUHFDOODQGSUHFLVLRQ´
>@ M. Buckland, & F. Gey, "The relationship between recall and precision."
[16] 7
>@ Soylu , 2
T. 6R\OX O. (UGHP
Erdem, A. &DUXV
& $ Cams, ³%LW
"Bit YHFWRUFRGHG
vector-coded VLPSOH
simple &$57
CART Journal of the American Society for Information Science (1994), 45(1),
-RXUQDORIWKH$PHULFDQ6RFLHW\IRU,QIRUPDWLRQ6FLHQFH
structure IRU
VWUXFWXUH for ORZ
low ODWHQF\
latency WUDIILF
traffic FODVVLILFDWLRQ
classification RQon )3*$V´
FPGAs." &RPSXWHU
Computer 12-19.
±
Networks, (2020).
1HWZRUNV
[25] (/HKPDQQ *&DVHOOD³7KHRU\RISRLQWHVWLPDWLRQ´7HFKQRPHWULFV
>@ E. Lehmann, & G. Casella, "Theory of point estimation." Teclmometrics
[17] A.
>@ Habibi Lashkari, G. Kaur, & A. Rahali, "DIDarknet: A Contemporary
$+DELEL/DVKNDUL*.DXU $5DKDOL³','DUNQHW$&RQWHPSRUDU\ (1999), 41(3),274.
Approach WR
$SSURDFK to 'HWHFW
Detect DQG
and &KDUDFWHUL]H
Characterize WKHthe 'DUNQHW
Darlrnet Traffic using 'HHS
7UDIILF XVLQJ Deep
[26] 1
>@ N. &KLQFKRU
Chinchor, ³08&
"MUC-4 HYDOXDWLRQ
evaluation PHWULFV´
metrics." 3URFHHGLQJV
Proceedings RIof WKH
the WK
4th
Image Learning." lOth International Conference on Communication and
,PDJH/HDUQLQJ´WK,QWHUQDWLRQDO&RQIHUHQFHRQ&RPPXQLFDWLRQDQG
Conference on Message Understanding - 08&¶
&RQIHUHQFHRQ0HVVDJH8QGHUVWDQGLQJ MUC4 '92.
Network Security, (2020).
1HWZRUN6HFXULW\
[27] -&RKHQ³$FRHIILFLHQWRIDJUHHPHQWIRUQRPLQDOVFDOHV´(GXFDWLRQDO
>@ J. Cohen, "A coefficient of agreement for nominal scales." Educational
[18] K.
>@ Pearson, ³2Q
. 3HDUVRQ "On OLQHV
lines DQG
and SODQHV
planes RI
of FORVHVW
closest ILW
fit WR
to V\VWHPV
systems RI
of SRLQWV
points LQ
in
and Psychological Measurement(1960), 20(1), 37--46.
DQG3V\FKRORJLFDO0HDVXUHPHQW±
space. "The London, Edinburgh, and Dublin Philosophical Magazine and
VSDFH³7KH/RQGRQ(GLQEXUJKDQG'XEOLQ3KLORVRSKLFDO0DJD]LQHDQG
Journal ofScience(1901).LIII, 2(11), 559-572.
-RXUQDORI6FLHQFH/,,,±
(IEEE - UBMK-2022) - VII th International Conference on Computer Science and Engineering - 378
,(((8%0.9,,WK,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHU6FLHQFHDQG(QJLQHHULQJ
Authorized licensed use limited to: Sharvari Govilkar. Downloaded on June 27,2023 at 06:39:09 UTC from IEEE Xplore. Restrictions apply.