0% found this document useful (0 votes)
33 views

A Review and Analysis of The Bot-IoT Dataset

Uploaded by

Karim Schneit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

A Review and Analysis of The Bot-IoT Dataset

Uploaded by

Karim Schneit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2021 IEEE International Conference on Service-Oriented System Engineering (SOSE)

A Review and Analysis of the Bot-IoT Dataset


Jared M. Peterson∗ , Joffrey L. Leevy∗ , Taghi M. Khoshgoftaar∗
∗ Florida
Atlantic University
Email: [email protected], [email protected], [email protected]

Abstract—Machine learning is rapidly changing the cybersecu- model to maintain its ability to generalize. The generalizability
rity landscape. The use of predictive models to detect malicious of a model can also be affected during the feature selection
activity and identify inscrutable attack patterns is providing process. Features that contain environment- or dataset-specific
2021 IEEE International Conference on Service-Oriented System Engineering (SOSE) | 978-1-6654-3477-5/21/$31.00 ©2021 IEEE | DOI: 10.1109/SOSE52839.2021.00007

levels of automation that are desperately needed to level the


playing field between malicious actors and network defenders. information can cause models to overfit. This is especially true
This has led to increased research at the intersection of machine for network security data, which can contain features that are
learning and cybersecurity and also the creation of many publicly only relevant to a single network.
available datasets. This paper provides an in-depth, unique In this paper, we provide an in-depth analysis of the Bot-IoT
review and analysis of one of the newest datasets, Bot-IoT. The
dataset developed by Koroniotis et al. [2]. Bot-IoT was created
full dataset contains about 73 million instances (big data). Models
trained on Bot-IoT are capable of detecting various botnet attacks in 2018 and published in 2019 by the University of New South
in Internet of Things (IoT) networks. The purpose of this paper Wales (UNSW) as a modernized and realistic dataset for train-
is to provide researchers with a fundamental understanding of ing models to detect botnet attacks in Internet of Things (IoT)
Bot-IoT, its features, and some of its pitfalls. We also discuss networks. IoT refers to a network of devices, not normally
data cleaning procedures and briefly summarize the use of the
regarded as computers, that have Internet connectivity and
dataset in published research.
Index Terms—Bot-IoT, data cleaning, feature analysis, machine limited computing capability [3].
learning, big data. The main contribution of this paper is to provide a clear
understanding of Bot-IoT, its subsets, and its features. We also
I. I NTRODUCTION identify features of the dataset that may invalidate performance
results, and to the best of our knowledge, we are the first
With the number of networked devices increasing every to do this. Our work goes beyond the scope of a traditional
year, the challenge of securing these devices is also increasing. review paper. In addition, we analyze published Bot-IoT works
Organizations and businesses around the world are charged and rely on our empirical research to establish the validity of
with protecting these new and vast networks but are over- features used in those works.
burdening security analysts with a massive influx of data. The The remainder of this paper is organized as follows: Sec-
most obvious solution has been the turn to automation, specif- tion II discusses the composition of Bot-IoT; Section III
ically machine learning, to identify and respond to attacks analyzes the features contained within the dataset; Section IV
without the need for constant human supervision. Machine discusses the data cleaning of Bot-IoT; Section V provides
learning models can drastically improve the detection rate an overview of papers that have incorporated the dataset; and
of traditional static signature-based detection systems while Section VI concludes with the main points of this paper.
reducing human workload. To that end, there has been a
large amount of research on machine learning applications for II. DATASET D ESCRIPTION
cybersecurity. Many researchers choose to investigate differ-
ent machine learning models and thereby leverage publicly Bot-IoT was developed in a testbed consisting of multiple
available datasets to train and test their models. virtual machines with various operating systems (OSs), net-
A significant part of the machine learning process is the work firewalls, network taps, the Node-red tool [4], and the
analysis of data that is used to train a predictive model. This Argus network security tool [2], [5]. Bot-IoT is comprised of
can be a daunting task, especially when a dataset requires multiple sets and subsets that are different in file format, size,
additional knowledge of the problem domain. A clear un- and feature count. A full listing of features, their definitions,
derstanding of a dataset is of critical importance in machine and associated sets and subsets are shown in Table III (Ap-
learning, and misunderstood or poorly generated features can pendix).
lead to models that produce invalid results. The first set, referred to as the Raw Set, contains roughly 70
During data cleaning, it is important to understand the GB of packet capture (PCAP) files. These files contain the net-
significance of a particular feature, its expected type and range work data that was intercepted in the testbed environment from
of values [1]. Without this information, it would be difficult network taps. Since this set is raw, it needs to be processed
to discern between valid and invalid values. Understanding a by a network analysis tool such as Wireshark [6], Argus [5],
feature’s potential range of values ensures that the proper lower or Zeek [7] before it is usable for traditional machine learning
and upper bounds are used in the normalization process, even models. This also means the features produced by the Raw
if the dataset does not contain them. This is important for a Set can vary based on the tool utilized to process the PCAP.

2642-6587/21/$31.00 ©2021 IEEE 20


DOI 10.1109/SOSE52839.2021.00007

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
TABLE I TABLE II
B OT-I OT: F ULL S ET B OT-I OT: 5% S UBSET AND 10-B EST S UBSET

Category Subcategory No. of Instances Category Subcategory No. of Instances


Normal Normal 9,543 Normal Normal 477
TCP 12,315,997 TCP 1,593,180
DoS UDP 20,659,491 DoS/DDoS UDP 1,981,230
HTTP 29,706 HTTP 2,474
TCP 19,547,603 OS Fingerprinting 17,914
Reconnaissance
DDoS UDP 18,965,106 Service Scanning 73,168
HTTP 19,771
Keylogging 73
Information Theft
OS Fingerprinting 358,275 Data Exfiltration 6
Reconnaissance
Service Scanning 1,463,364
Keylogging 1,469
Information Theft
Data Exfiltration 118
independent features in the 5% Subset. The 10-Best features
were selected based on their rankings. This subset is contained
within two CSV files that both have a header row in the feature
The second set, referred to as the Full Set, contains comma- names.
separated values (CSV) files that add up to roughly 73 million The Raw Set, Full Set, 5% Subset, and 10-Best Subset
instances, which are generated by the Argus network security contain big data. Big data is defined by specific properties,
tool. It is important to note that each instance is representative such as volume, variety, variability, velocity, complexity, and
of a network session. The features of a session represent an value [8]. These properties may increase the difficulty of
aggregation of all the bytes and packets related to a single the classification task for learners trained on big data. With
communication session between two hosts. The Full Set has regard to binary classification of normal and attack instances,
the fewest total features of any of the processed sets or Tables I and II show that the datasets are severely imbalanced.
subsets. It contains 26 independent features and 3 dependent Class imbalance arises from a disproportionate number of
features. The 26 independent features only include the network majority class instances (attack instances in Bot-IoT) and
flow data from Argus and do not contain the 14 additional could potentially skew the results of big data analytics [9].
calculated features developed by Koroniotis et al. [2]. It is also We point out that the class imbalance problem is compounded
worth noting that the CSV files for the Full Set do not contain by big data.
a header row, and each file has six columns interspersed in
the features that do not contain any data. These details can III. F EATURE A NALYSIS
be important when attempting to convert the data into another A common goal for any cybersecurity machine learning
format or importing the data into a tool for analysis. Table I study is to produce generalizable models that can accurately
shows categories and subcategories of network traffic for the identify malicious behavior. To this end, it is extremely im-
Full Set. portant to understand the data and its features, so that the
The first subset of the Bot-IoT dataset, referred to as the models can be trained and tested on valid information. We
5% Subset, was provided and recommended by Koroniotis et analyzed all 46 features present in the 5% Subset. This section
al. [2] as a smaller and more manageable version. It contains provides information on the data type, meaning, and utility of
5% of the instances of the original set, roughly 3.6 million, the various features. The section is divided into several sub-
and is a representative sample of the Full Set in terms of sections, including Dependent Features, Independent Features,
attack category. The 5% Subset contains the most features of and Invalid Features.
any processed set or subset of Bot-IoT, with 43 independent
features and 3 dependent features. The 43 independent features A. Dependent Features
contain the Argus network flow features and the additional There are three dependent categorical features in the Bot-
calculated features. The 5% Subset is divided into five CSV IoT dataset: category, subcategory, and attack. All three
files, each containing its own header row with feature names. features are present in each of the sets and subsets with the
Table II shows categories and subcategories of traffic for the exception of the Raw Set. Category is a string type feature with
5% Subset. five potential values: Normal, DoS, DDoS, Reconnaissance,
The final subset, referred to as the 10-Best Subset, contains and Information Theft. These values represent the attack
the same number of instances as the original 5% Subset (see category for a given instance. Subcategory is also a string type
Table II) but contains only 10 of the original 43 independent feature but with eight potential values: Normal, Transmission
features. Each file contains 3 dependent features and 16 Control Protocol (TCP), User Datagram Protocol (UDP), Hy-
independent features, of which 6 serve as session identifiers pertext Transfer Protocol (HTTP), OS Fingerprinting, Service
and 10 are considered by Koroniotis et al. [2] to be the Scanning, Keylogging, and Data Exfiltration. The subcategory
best. The 10-Best features were derived through the mapping values are used in conjunction with the category value to
of the correlation coefficient and joint entropy of the 43 classify an instance with a more specific type of attack.

21

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
It is important to note that both Denial-of-Service (DoS) average rates: AR P Proto P SrcIP, AR P Proto P DstIP,
and Distributed Denial-of-Service (DDoS) categories share AR P Proto P Sport, and AR P Proto P Dport. The aver-
the same values for their subcategories. The final dependent age rate these features refer to is the number of packets divided
feature is attack, which serves as a binary value indicating that by the duration of the session.
an instance is either an attack, with a value of 1, or normal, Since Koroniotis et al. [2] did not provide pseudo code for
with a value of 0. calculating the features, we performed our own calculations.
This process gave us additional insight into the relevant
B. Independent Features algorithms. The features utilize a window of 100 instances that
The 5% Subset contains 43 independent features. We de- includes the current instance. The various totals and average
termined that 6 of these features are invalid, and in Sec- rates are calculated and use the current instance as a filter. For
tion III-B3, we provide justification for invalidating these example, TnBPSrcIP is the total number of bytes per source
features. Of the remaining 37 features, there are 23 standard IP. It is calculated by summing the bytes feature from every
features and 14 calculated features. The standard features were instance within the current window that has the same saddr
generated by the Argus network security tool and are included value as the current instance.
in the Full Set or can be trivially derived from its features. The The inclusion of the calculated features makes it possible
calculated features were introduced by Koroniotis et al. [2] for a model utilizing only the valid features to properly
as “additional features.” These calculated features are present categorize the DDoS and Reconnaissance attack categories.
only in the 5% Subset and the 10-Best Subset and require Without the inclusion of these features, it would be extremely
some form of scripting to be calculated. difficult for any model, without some type of memory, to
1) Standard Features: The standard features contain several accurately identify these attacks. This is due to the nature of
categorical features. These include flgs, proto, and state, which the two attacks. A DDoS attack requires multiple different
are all string data types. Each of the three has a categorical in- hosts attacking at or near the same time. A DDoS instance
teger equivalent, corresponding to flgs number, proto number, would be impossible to differentiate from a DoS instance if the
and state number, respectively. The integer counterparts are model has no way of knowing how many concurrent or near
not included in the Full Set but can be easily generated based concurrent connections exist in proximity. The Reconnaissance
on the values of the string features. Both sets of features category is similar, in that the primary ways of identifying
contain the same categorical information, but with different service scanning or OS fingerprinting is by identifying a single
data types. The features sport and dport are the only other host attempting connections on many ports. Without a way of
categorical features, and they represent a unique port number determining the number of connections going to many ports
that can range from 1 to 65,535. There is the possibility, that originate from a single host, it would be difficult, if not
depending on cleaning procedures, for the values 0 and -1 impossible, for a model to properly categorize this type of
to also be present. Typically, these values represent protocols attack.
that do not use port numbers. 3) Invalid Features: We established that the following fea-
The standard features also include many ratio type features. tures are invalid: pkSeqID, seq, stime, ltime, saddr, and daddr.
These include features related to the network session of the These features undermine the efficiency of a predictive model
instance, such as pkts, bytes, dur, spkts, dpkts, sbytes, dbytes, because they contribute to overfitting and limit generalization.
rate, srate, and drate. Each of these 10 features provides We explain further in the next three paragraphs.
insight into the length and throughput of the session of an As pkSeqID and seq are row or sequence identifiers that
instance. The features mean, stddev, sum, min, and max contain only contain ordinal information, we consider them to be
aggregated session duration information that is calculated by invalid. Unlike pkSeqID, seq has not been explicitly defined
the Argus network security tool. This data can help models by Koroniotis et al. [2]. Furthermore, seq has been used in
compare the current instance duration with previous instances. several studies as this feature appears in Koroniotis et al.’s 10-
2) Calculated Features: The calculated features are Best Feature set. After consulting the Argus developers and
all ratio type features that contain additional aggregated Argus documentation, we discovered seq is a monotonically
session information not included in the standard fea- increasing sequence number that provides synchronization and
tures. The features TnBPSrcIP, TnBPDstIP, TnP PSrcIP, reliability between Argus readers. Apart from providing ordi-
TnP PDstIP, TnP PerProto, and TnP Per Dport represent a nal information about the network session, seq does nothing
total byte or packet count for a particular Internet Proto- else.
col (IP) address or protocol. Similarly, N IN Conn P DstIP The start packet time and last packet time for each instance
and N IN Conn P SrcIP represent total connections for are denoted by stime and ltime, respectively. Although using
a source or destination IP address. The final two cal- stime in combination with ltime enables other key features
culated totals are Pkts P State P Protocol P DestIP and to be derived, most of the derived information is already
Pkts P State P Protocol P SrcIP. These two features rep- available from dur, rate, srate, and drate, which are more
resent the total number of packets for sessions in the generalizable features than stime and ltime. While stime and
same connection state that have the same source or des- ltime might help improve detection of intense attacks, these
tination IP address. The other four calculated features are timestamps may simultaneously degrade proper classification

22

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
of normal traffic during those same time windows. This can best of our knowledge, these are all the published works that
have very undesirable consequences for business operations. used the Bot-IoT dataset.
For example, if a major website is experiencing a severe DoS
attack, it is not desirable for normal customer traffic to be A. Studies with Invalid Features
incorrectly classified as attack traffic and dropped. There are several works that utilize the 5% Subset and its 43
The source and destination IP addresses for each instance independent features. As discussed previously, this set contains
are designated by saddr and daddr, respectively. We excluded all six invalid features (pkSeqID, seq, stime, ltime, saddr, and
them for two reasons. Firstly, we recognize that private IP daddr). However, we assume that these works did not include
addresses are not globally unique and may vary between pkSeqID. We make this assumption because pkSeqID was
networks. If a trained model associates a particular activity clearly labeled as a row identifier in the Bot-IoT README file
with a specific private IP address, this association will most and the original paper [2]. In addition to the original paper,
likely not generalize for the same IP address in another local we discovered 12 studies that utilized the full independent
network. The second reason relates to an inherent issue with feature set. Bhuvaneswari and Selvakumar [10], Alhowaide et
Bot-IoT. When analyzing the 5% Subset, we observed that al. [11], Abdel-Basset et al. [12], Wiyono and Cahyani [13],
100% of the attack traffic contains private IPs for both saddr Ferrag et al. [14], Zhang et al. [15], Liaqat et al. [16], Pacheco
and daddr, but only 36% of the normal traffic contains private and Sun [17], Aldhaheri et al. [18], Soe et al. [19], Ferrag
IPs for both saddr and daddr. This creates a problem where et al. [20], and Alyasiri et al. [21] all mentioned using the
64% of normal instances can be identified because their saddr 5% Subset with either an accompanying feature list or feature
and daddr features display a public IP. count. Of note, Bhuvaneswari and Selvakumar, Abdel-Basset
Although we consider the six features (pkSeqID, seq, stime, et al., Wiyono and Cahyani, Zhang et al., and Aldhaheri et
ltime, saddr, and daddr) to be invalid, this does not mean al. used both the 5% Subset and the 10-Best Subset. Since
they are entirely unusable. Machine learning practitioners these papers would have used at least one if not all the invalid
may find them beneficial in certain situations, such as for features, we assume their results are not valid and do not
troubleshooting models, and cybersecurity analysts may decide represent a good generalizable model.
the features are useful for forensics. There are many other Another commonly used feature set was the 10-Best Subset.
examples capable of demonstrating the upside of keeping one This subset contains the invalid feature seq, and so we assume
or more of these features. However, for the purpose of our each of these papers utilized it for training and testing. We
study, we believe the six features should be removed from found nine studies that utilized the 10-Best Subset, five [10],
Bot-IoT. [12], [13], [15], [18] of which were previously mentioned,
since they also use the 5% Subset. Ibitoye et al. [22], Filus et
IV. DATA C LEANING al. [23], Lawal et al. [24], and Sriram et al. [25] all utilized the
10-Best Subset for training and testing. Since seq is not a valid
In addition to the six invalid features that should be re-
feature that will produce generalizable models, we assume the
moved, we discovered other areas of the dataset that need to be
results of these nine studies did not produce valid results.
cleaned. These additional discoveries are derived solely from
We found only a single study utilized the 26 independent
working with the 5% Subset, but we believe the issues will
features present in the Full Set. This set includes all of the
be found in all the processed sets and subsets. First, many if
invalid features and if utilized in its entirety will cause a model
not all the instances using Internet Control Message Protocol
to lose its ability to generalize. Just like the studies utilizing
(ICMP) have a hexadecimal value for the sport and dport
the 5% Subset, we assume that pkSeqID was not included.
features. The ICMP port values should be changed to -1 or 0 to
We found that Ferrag and Maglaras [26] utilized this set, and
indicate that they do not have a valid port value. Changing the
therefore, we assume their results are invalid.
values to -1 would have an additional advantage of matching
Multiple studies did not use one of the provided sets of
the values for instances that use Address Resolution Protocol
features but instead developed their own feature lists. These
(ARP). Our second observation involves the mislabeling of
studies did, however, include at least one of the previously
several instances. Specifically, we are referring to instances
mentioned invalid features, which would have affected the
that use ARP and are not labeled as normal traffic. Based on
validity of their conclusions. The affected studies are as
the description of each of the attack categories, we determined
follows: Oreški and Andročec [27], Alkadi et al. [28], Kumar
that ARP does not contribute to the attacks. Therefore, mis-
et al. [29], Al-Zewairi et al. [30], Djenna et al. [31], Churcher
labeled instances should be relabeled as normal traffic, or all
et al. [32], Popoola et al. [33], Kumar et al. [34], Kumar et
instances that use ARP should be removed.
al. [35], and Biswas and Roy [36]. The number of invalid
features varied from study to study, but the most common
V. L ITERATURE R EVIEW
invalid feature was seq.
The primary focus of this section is to group the published
works of research that have used the Bot-IoT dataset. Our B. Studies with Unlisted Features
search for papers concluded on June 20, 2021. The search A sizeable portion of the studies captured in our search did
revealed 47 studies that utilized the Bot-IoT dataset. To the not state or list the features used. These studies are Guizani

23

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
and Ghafoor [37], Alkadi et al. [38], Shafiq et al. [39], Cheema classifiers, namely random forest [70], decision tree [71],
et al. [40], Mulyanto et al. [41], Bagui and Li [42], Susilo and Hoeffding tree [72], and logistic model tree [73]. All 477
Sari [43], Koroniotis et al. [44], Dwibedi et al. [45], Shafiq normal instances from the 5% Subset were selected. From
et al. [46], Huong et al. [47], Huong et al. [48], Venugopal et the attack categories, the researchers randomly selected 81,977
al. [49], Nimbalkar and Kshirsagar [50], and Jithu et al. [51]. instances from DDoS, 82,060 from Reconnaissance, and 556
Since the exact list of features used is unknown, we cannot from Information Theft. The dataset was then split in a
make any conclusions about the validity of the results. ratio of 66:34 for training and testing. In general, the results
Another subset of the studies we reviewed did not utilize show that using the proposed selection algorithm significantly
any of the processed sets or subsets. Instead, PCAP files from decreases the number of features without impacting classifier
the Raw Set were used to generate features. The studies are Ge performance. Among the four models, decision tree performed
et al. [52], Costa et al. [53], and Ge et al. [54]. These studies the best, with top scores of 100% precision for DDoS and
did not use the Argus network security tool, which was used in Reconnaissance, 99.99% F-measure for DDoS and Reconnais-
the original study, and therefore generated completely different sance, 0% false positive rate for DDoS and Reconnaissance,
features. Because we do not have access to the data generated and 99.40% true positive rate for DDoS. The fact that all the
by these studies, we cannot make any conclusions about the classifiers are tree-based is a limitation of this study. Classifier
validity of the results. diversity adds credence to the results of a machine learning
study.
C. Studies with Only Valid Features
Two studies clearly used only valid features from Bot-IoT. VI. C ONCLUSION
These works are Demirpolat et al. [55] and Soe et al. [56]. Data analysis is one of the most crucial aspects of the
Demirpolat et al. used 16 features comprised of the standard machine learning process. We emphasize that invalid fea-
Argus network data with the exclusion of session identification tures and/or poor data cleaning practices can lead to non-
data, timestamps, and the seq feature. Soe et al. used a generalizable models and invalid performance scores. Bot-IoT
smaller feature set, comprised of eight features, that consisted is an intrusion detection dataset that trains models to detect
primarily of the additional features developed by Koroniotis various botnet attacks in IoT networks. Based on our data
et al. [2]. analysis of Bot-IoT, we discovered several invalid features.
In their work, Demirpolat et al. evaluated an ensemble These features were highly utilized in several studies that we
of prototypical networks [57] and Support Vector Machines reviewed, with over 50% of the papers using one or more of
(SVMs) [58] against four other models: Convolutional Neural the invalid features. For future studies that incorporate Bot-
Network (CNN) [59], SVMs, Naive Bayes [60], and deep IoT, we recommend utilizing only valid features and adopting
autoencoders [61]. The proposed ensemble model uses few- an effective data cleaning procedure.
shot learning [62] to compensate for training machine learning
models with limited data. All models were trained on three ACKNOWLEDGMENTS
different datasets: Bot-IoT, UNSW-NB15 [63], and a software-
defined networking [64] customized set. The models were im- We would like to thank the reviewers in the Data Mining and
plemented with Scikit-learn [65], Keras [66], and Pytorch [67]. Machine Learning Laboratory at Florida Atlantic University.
To address class imbalance, the number of instances in each Additionally, we acknowledge partial support by the NSF
category of Bot-IoT, except the Normal and Information Theft (CNS-1427536). Opinions, findings, conclusions, or recom-
categories, was down-sampled to 20,000. The instances for mendations in this paper are the authors’ and do not reflect
Normal and Information Theft were not down-sampled be- the views of the NSF.
cause their numbers are small in comparison. For the training
set, Demirpolat et al. randomly selected instances of 100, R EFERENCES
400, 800, and 1,000. Also, the researchers randomly selected [1] J. L. Leevy and T. M. Khoshgoftaar, “A survey and analysis of intrusion
100 instances for the validation set and used the remaining detection models based on cse-cic-ids2018 big data,” Journal of Big
Data, vol. 7, no. 1, pp. 1–19, 2020.
instances from the down-sampled dataset as the test set. With [2] N. Koroniotis, N. Moustafa, E. Sitnikova, and B. Turnbull, “Towards
regard to accuracy, precision, recall, and F-measure for Bot- the development of realistic botnet dataset in the internet of things
IoT, the ensemble model outperformed the other models by for network forensic analytics: Bot-iot dataset,” Future Generation
Computer Systems, vol. 100, pp. 779–796, 2019.
roughly 20%. The highest Bot-IoT F-measure score for the [3] B. B. Zarpelão, R. S. Miani, C. T. Kawakani, and S. C. de Alvarenga, “A
ensemble was 96% for the DDoS category. One shortcoming survey of intrusion detection in internet of things,” Journal of Network
of this work is the relatively small sizes of the training sets. and Computer Applications, vol. 84, pp. 25–37, 2017.
A small sample size can lead to a model with high variance. [4] T. O. Foundation, “Node-red: Low-code programming for event-driven
applications.” https://nodered.org/.
For their research, Soe et al. proposed a feature selection [5] Argus, “Argus,” https://openargus.org/.
algorithm that was implemented on a Raspberry Pi device. [6] Wireshark, “Wireshark. go deep.” https://www.wireshark.org/.
The algorithm uses correlation-based feature selection [68] [7] T. Z. Project, “The zeek network security monitor,” https://zeek.org//.
[8] J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya, “A survey
and for the calculation of the final feature set, relies on the on addressing high-class imbalance in big data,” Journal of Big Data,
gain-ratio [69] metric. Soe et al. utilized several tree-based vol. 5, no. 1, p. 42, 2018.

24

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
[9] R. Zuech, J. Hancock, and T. M. Khoshgoftaar, “Detecting web attacks [29] P. Kumar, G. P. Gupta, and R. Tripathi, “Toward design of an intelligent
using random undersampling and ensemble learners,” Journal of Big cyber attack detection system using hybrid feature reduced approach for
Data, vol. 8, no. 1, pp. 1–20, 2021. iot networks,” Arabian Journal for Science and Engineering, vol. 46,
[10] B. A. NG and S. Selvakumar, “Anomaly detection framework for internet no. 4, pp. 3749–3778, 2021.
of things traffic using vector convolutional deep learning approach in [30] M. Al-Zewairi, S. Almajali, and M. Ayyash, “Unknown security attack
fog environment,” Future Generation Computer Systems, vol. 113, pp. detection using shallow and deep ann classifiers,” Electronics, vol. 9,
255–265, 2020. no. 12, p. 2006, 2020.
[11] A. Alhowaide, I. Alsmadi, and J. Tang, “Pca, random-forest and pearson [31] A. Djenna, D. E. Saidouni, and W. Abada, “A pragmatic cybersecurity
correlation for dimensionality reduction in iot ids,” in 2020 IEEE Inter- strategies for combating iot-cyberattacks,” in 2020 International Sympo-
national IOT, Electronics and Mechatronics Conference (IEMTRONICS). sium on Networks, Computers and Communications (ISNCC). IEEE,
IEEE, 2020, pp. 1–6. 2020, pp. 1–6.
[12] M. Abdel-Basset, V. Chang, H. Hawash, R. K. Chakrabortty, and [32] A. Churcher, R. Ullah, J. Ahmad, F. Masood, M. Gogate, F. Alqahtani,
M. Ryan, “Deep-ifs: Intrusion detection approach for iiot traffic in fog B. Nour, W. J. Buchanan et al., “An experimental analysis of attack
environment,” IEEE Transactions on Industrial Informatics, 2020. classification using machine learning in iot networks,” Sensors, vol. 21,
[13] R. T. Wiyono and N. D. W. Cahyani, “Performance analysis of decision no. 2, p. 446, 2021.
tree c4. 5 as a classification technique to conduct network forensics for [33] S. I. Popoola, B. Adebisi, M. Hammoudeh, G. Gui, and H. Gacanin,
botnet activities in internet of things,” in 2020 International Conference “Hybrid deep learning for botnet attack detection in the internet of things
on Data Science and Its Applications (ICoDSA). IEEE, 2020, pp. 1–5. networks,” IEEE Internet of Things Journal, 2020.
[14] M. A. Ferrag, L. Maglaras, A. Ahmim, M. Derdour, and H. Janicke, [34] P. Kumar, G. P. Gupta, and R. Tripathi, “Tp2sf: A trustworthy privacy-
“Rdtids: Rules and decision tree-based intrusion detection system for preserving secured framework for sustainable smart cities by leveraging
internet-of-things networks,” Future internet, vol. 12, no. 3, p. 44, 2020. blockchain and machine learning,” Journal of Systems Architecture, vol.
[15] Y. Zhang, J. Xu, Z. Wang, R. Geng, K.-K. R. Choo, J. A. Pérez-Dı́az, and 115, p. 101954, 2021.
D. Zhu, “Efficient and intelligent attack detection in software defined [35] P. Kumar, R. Kumar, G. P. Gupta, and R. Tripathi, “A distributed frame-
iot networks,” in 2020 IEEE International Conference on Embedded work for detecting ddos attacks in smart contract-based blockchain-
Software and Systems (ICESS). IEEE, 2020, pp. 1–9. iot systems by leveraging fog computing,” Transactions on Emerging
[16] S. Liaqat, A. Akhunzada, F. S. Shaikh, A. Giannetsos, and M. A. Telecommunications Technologies, p. e4112, 2020.
Jan, “Sdn orchestration to combat evolving cyber threats in internet of [36] R. Biswas and S. Roy, “Botnet traffic identification using neural net-
medical things (iomt),” Computer Communications, vol. 160, pp. 697– works,” Multimedia Tools and Applications, pp. 1–25, 2021.
705, 2020. [37] N. Guizani and A. Ghafoor, “A network function virtualization system
[17] Y. Pacheco and W. Sun, “Adversarial machine learning: A comparative for detecting malware in large iot based networks,” IEEE Journal on
study on contemporary intrusion detection datasets,” in Proceedings of Selected Areas in Communications, vol. 38, no. 6, pp. 1218–1228, 2020.
the 7th International Conference on Information Systems Security and [38] O. Alkadi, N. Moustafa, B. Turnbull, and K.-K. R. Choo, “A deep
Privacy, vol. 1, 2021, pp. 160–171. blockchain framework-enabled collaborative intrusion detection for pro-
[18] S. Aldhaheri, D. Alghazzawi, L. Cheng, B. Alzahrani, and A. Al- tecting iot and cloud networks,” IEEE Internet of Things Journal, 2020.
Barakati, “Deepdca: novel network-based detection of iot attacks using [39] M. Shafiq, Z. Tian, A. K. Bashir, X. Du, and M. Guizani, “Corrauc: a
artificial immune system,” Applied Sciences, vol. 10, no. 6, p. 1909, malicious bot-iot traffic detection method in iot network using machine
2020. learning techniques,” IEEE Internet of Things Journal, 2020.
[19] Y. N. Soe, P. I. Santosa, and R. Hartanto, “Ddos attack detection based on [40] M. A. Cheema, H. K. Qureshi, C. Chrysostomou, and M. Lestas,
simple ann with smote for iot environment,” in 2019 Fourth International “Utilizing blockchain for distributed machine learning based intrusion
Conference on Informatics and Computing (ICIC). IEEE, 2019, pp. 1– detection in internet of things,” in 2020 16th International Conference
5. on Distributed Computing in Sensor Systems (DCOSS). IEEE, 2020,
[20] M. A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke, “Deep pp. 429–435.
learning for cyber security intrusion detection: Approaches, datasets, and [41] M. Mulyanto, M. Faisal, S. W. Prakosa, and J.-S. Leu, “Effectiveness
comparative study,” Journal of Information Security and Applications, of focal loss for minority classification in network intrusion detection
vol. 50, p. 102419, 2020. systems,” Symmetry, vol. 13, no. 1, p. 4, 2021.
[21] H. Alyasiri, J. A. Clark, A. Malik, and R. de Fréin, “Grammatical [42] S. Bagui and K. Li, “Resampling imbalanced data for network intrusion
evolution for detecting cyberattacks in internet of things environments.” detection datasets,” Journal of Big Data, vol. 8, no. 1, pp. 1–41, 2021.
[22] O. Ibitoye, O. Shafiq, and A. Matrawy, “Analyzing adversarial attacks [43] B. Susilo and R. F. Sari, “Intrusion detection in iot networks using deep
against deep learning for intrusion detection in iot networks,” in 2019 learning algorithm,” Information, vol. 11, no. 5, p. 279, 2020.
IEEE Global Communications Conference (GLOBECOM). IEEE, 2019, [44] N. Koroniotis, N. Moustafa, and E. Sitnikova, “A new network forensic
pp. 1–6. framework based on deep learning for internet of things networks: A
[23] K. Filus, J. Domańska, and E. Gelenbe, “Random neural network for particle deep framework,” Future Generation Computer Systems, vol.
lightweight attack detection in the iot,” in Symposium on Modelling, 110, pp. 91–106, 2020.
Analysis, and Simulation of Computer and Telecommunication Systems. [45] S. Dwibedi, M. Pujari, and W. Sun, “A comparative study on contempo-
Springer, 2020, pp. 79–91. rary intrusion detection datasets for machine learning research,” in 2020
[24] M. A. Lawal, R. A. Shaikh, and S. R. Hassan, “An anomaly mitigation IEEE International Conference on Intelligence and Security Informatics
framework for iot using fog computing,” Electronics, vol. 9, no. 10, p. (ISI). IEEE, 2020, pp. 1–6.
1565, 2020. [46] M. Shafiq, Z. Tian, Y. Sun, X. Du, and M. Guizani, “Selection of
[25] S. Sriram, R. Vinayakumar, M. Alazab, and K. Soman, “Network flow effective machine learning algorithm and bot-iot attacks traffic identifi-
based iot botnet attack detection using deep learning,” in IEEE INFO- cation for internet of things in smart city,” Future Generation Computer
COM 2020-IEEE Conference on Computer Communications Workshops Systems, vol. 107, pp. 433–442, 2020.
(INFOCOM WKSHPS). IEEE, 2020, pp. 189–194. [47] T. T. Huong, T. P. Bac, D. M. Long, B. D. Thang, N. T. Binh,
[26] M. A. Ferrag and L. Maglaras, “Deepcoin: A novel deep learning and T. D. Luong, and T. K. Phuc, “Lockedge: Low-complexity cyberattack
blockchain-based energy exchange framework for smart grids,” IEEE detection in iot edge computing,” IEEE Access, vol. 9, pp. 29 696–
Transactions on Engineering Management, vol. 67, no. 4, pp. 1285– 29 710, 2021.
1297, 2019. [48] T. T. Huong, T. P. Bac, D. M. Long, B. D. Thang, T. D. Luong, and
[27] D. Oreški and D. Andročec, “Genetic algorithm and artificial neural N. T. Binh, “An efficient low complexity edge-cloud framework for
network for network forensic analytics,” in 2020 43rd International security in iot networks,” in 2020 IEEE Eighth International Conference
Convention on Information, Communication and Electronic Technology on Communications and Electronics (ICCE). IEEE, 2021, pp. 533–539.
(MIPRO). IEEE, pp. 1200–1205. [49] S. Venugopal, G. W. Sathianesan, and R. Rengaswamy, “Cyber forensic
[28] O. AlKadi, N. Moustafa, B. Turnbull, and K.-K. R. Choo, “Mixture framework for big data analytics using sunflower jaya optimization-
localization-based outliers models for securing data migration in cloud based deep stacked autoencoder,” International Journal of Numerical
centers,” IEEE Access, vol. 7, pp. 114 607–114 618, 2019. Modelling: Electronic Networks, Devices and Fields, p. e2892, 2021.

25

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
[50] P. Nimbalkar and D. Kshirsagar, “Feature selection for intrusion detec- [72] C. Manapragada, G. I. Webb, and M. Salehi, “Extremely fast decision
tion system in internet-of-things (iot),” ICT Express, vol. 7, no. 2, pp. tree,” in Proceedings of the 24th ACM SIGKDD International Confer-
177–181, 2021. ence on Knowledge Discovery & Data Mining, 2018, pp. 1953–1962.
[51] P. Jithu, J. Shareena, A. Ramdas, and A. Haripriya, “Intrusion detection [73] M. Sumner, E. Frank, and M. Hall, “Speeding up logistic model tree
system for iot botnet attacks using deep learning,” SN Computer Science, induction,” in European conference on principles of data mining and
vol. 2, no. 3, pp. 1–8, 2021. knowledge discovery. Springer, 2005, pp. 675–683.
[52] M. Ge, N. F. Syed, X. Fu, Z. Baig, and A. Robles-Kelly, “Towards a
deep learning-driven intrusion detection approach for internet of things,”
Computer Networks, vol. 186, p. 107784, 2021.
[53] W. L. Costa, M. M. Silveira, T. de Araujo, and R. L. Gomes, “Improving
ddos detection in iot networks through analysis of network traffic charac-
teristics,” in 2020 IEEE Latin-American Conference on Communications
(LATINCOM). IEEE, 2020, pp. 1–6.
[54] M. Ge, X. Fu, N. Syed, Z. Baig, G. Teo, and A. Robles-Kelly, “Deep
learning-based intrusion detection for iot networks,” in 2019 IEEE
24th Pacific Rim International Symposium on Dependable Computing
(PRDC). IEEE, 2019, pp. 256–25 609.
[55] A. Demirpolat, A. K. Sarica, and P. Angin, “Protédge: A few-shot en-
semble learning approach to software-defined networking-assisted edge
security,” Transactions on Emerging Telecommunications Technologies,
p. e4138, 2020.
[56] Y. N. Soe, Y. Feng, P. I. Santosa, R. Hartanto, and K. Sakurai, “Towards
a lightweight detection system for cyber attacks in the iot environment
using corresponding features,” Electronics, vol. 9, no. 1, p. 144, 2020.
[57] T. Gao, X. Han, Z. Liu, and M. Sun, “Hybrid attention-based prototyp-
ical networks for noisy few-shot relation classification,” in Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019,
pp. 6407–6414.
[58] R. A. Bauder and T. M. Khoshgoftaar, “The detection of medicare fraud
using machine learning methods with excluded provider labels,” in The
Thirty-First International Flairs Conference, 2018.
[59] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmen-
tation for deep learning,” Journal of Big Data, vol. 6, no. 1, pp. 1–48,
2019.
[60] R. A. Bauder, T. M. Khoshgoftaar, A. Richter, and M. Herland, “Predict-
ing medical provider specialties to detect anomalous insurance claims,”
in 2016 IEEE 28th international conference on tools with artificial
intelligence (ICTAI). IEEE, 2016, pp. 784–790.
[61] Z. Salekshahrezaee, J. L. Leevy, and T. M. Khoshgoftaar, “A recon-
struction error-based framework for label noise detection,” Journal of
Big Data, vol. 8, no. 1, pp. 1–16, 2021.
[62] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few
examples: A survey on few-shot learning,” ACM Computing Surveys
(CSUR), vol. 53, no. 3, pp. 1–34, 2020.
[63] N. Moustafa and J. Slay, “Unsw-nb15: a comprehensive data set for
network intrusion detection systems (unsw-nb15 network data set),”
in 2015 military communications and information systems conference
(MilCIS). IEEE, 2015, pp. 1–6.
[64] D. Kreutz, F. M. Ramos, P. E. Verissimo, C. E. Rothenberg, S. Azodol-
molky, and S. Uhlig, “Software-defined networking: A comprehensive
survey,” Proceedings of the IEEE, vol. 103, no. 1, pp. 14–76, 2014.
[65] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,
“Scikit-learn: Machine learning in python,” Journal of machine learning
research, vol. 12, no. Oct, pp. 2825–2830, 2011.
[66] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
[67] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
imperative style, high-performance deep learning library,” arXiv preprint
arXiv:1912.01703, 2019.
[68] M. A. Hall, “Correlation-based feature selection for machine learning,”
1999.
[69] J. L. Leevy, J. Hancock, R. Zuech, and T. M. Khoshgoftaar, “Detecting
cybersecurity attacks across different network features and learners,”
Journal of Big Data, vol. 8, no. 1, pp. 1–29, 2021.
[70] V. M. Herrera, T. M. Khoshgoftaar, F. Villanustre, and B. Furht,
“Random forest implementation and optimization for big data analytics
on lexisnexis’s high performance computing cluster platform,” Journal
of Big Data, vol. 6, no. 1, pp. 1–36, 2019.
[71] N. Seliya and T. M. Khoshgoftaar, “The use of decision trees for cost-
sensitive classification: an empirical study in software quality predic-
tion,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, vol. 1, no. 5, pp. 448–459, 2011.

26

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
A PPENDIX

TABLE III
F EATURES , D ESCRIPTIONS , AND S ETS

Feature Description Full Set 5% 10-Best


pkSeqID Row Identifier YES YES YESa
stime Record start time YES YES NO
flgs Flow state flags seen in transactions YES YES NO
flgs number Numerical representation of feature flags NO YES NO
proto Textual representation of transaction protocol... YES YES YESa
proto number Numerical representation of feature proto NO YES NO
saddr Source IP address YES YES YESa
sport Source port number YES YES YESa
daddr Destination IP address YES YES YESa
dport Destination port number YES YES YESa
pkts Total count of packets in transaction YES YES NO
bytes Total number of bytes in transaction YES YES NO
state Transaction state YES YES NO
state number Numerical representation of feature state NO YES NO
ltime Record last time YES YES NO
seq Argus sequence number YES YES YES
dur Record total duration YES YES NO
mean Average duration of aggregated records YES YES YES
stddev Standard deviation of aggregated records YES YES YES
sum Total duration of aggregated records YES YES NO
min Minimum duration of aggregated records YES YES YES
max Maximum duration of aggregated records YES YES YES
spkts Source-to-destination packet count YES YES NO
dpkts Destination-to-source packet count YES YES NO
sbytes Source-to-destination byte count YES YES NO
dbytes Destination-to-source byte count YES YES NO
rate Total packets per second in transaction YES YES NO
srate Source-to-destination packets per second YES YES YES
drate Destination-to-source packets per second YES YES YES
attackb Class label: 0 for Normal traffic, 1 for Attac... YES YES YES
categoryb Traffic category YES YES YES
subcategoryb Traffic subcategory YES YES YES
TnBPSrcIP Total Number of bytes per source IP NO YES NO
TnBPDstIP Total Number of bytes per Destination IP NO YES NO
TnP PSrcIP Total Number of packets per source IP NO YES NO
TnP PDstIP Total Number of packets per Destination IP NO YES NO
TnP PerProto Total Number of packets per protocol NO YES NO
TnP Per Dport Total Number of packets per dport NO YES NO
AR P Proto P SrcIP Average rate per protocol per Source IP (calcu... NO YES NO
AR P Proto P DstIP Average rate per protocol per Destination IP NO YES NO
N IN Conn P SrcIP Number of inbound connections per source IP NO YES YES
N IN Conn P DstIP Number of inbound connections per destination IP NO YES YES
AR P Proto P Sport Average rate per protocol per sport NO YES NO
AR P Proto P Dport Average rate per protocol per dport NO YES NO
Pkts P State P Protocol P DestIP Number of packets grouped by state of flows an... NO YES NO
Pkts P State P Protocol P SrcIP Number of packets grouped by state of flows an... NO YES NO
a
Included for network session identification only
b
Dependent feature

27

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.

You might also like