A Review and Analysis of The Bot-IoT Dataset
A Review and Analysis of The Bot-IoT Dataset
Abstract—Machine learning is rapidly changing the cybersecu- model to maintain its ability to generalize. The generalizability
rity landscape. The use of predictive models to detect malicious of a model can also be affected during the feature selection
activity and identify inscrutable attack patterns is providing process. Features that contain environment- or dataset-specific
2021 IEEE International Conference on Service-Oriented System Engineering (SOSE) | 978-1-6654-3477-5/21/$31.00 ©2021 IEEE | DOI: 10.1109/SOSE52839.2021.00007
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
TABLE I TABLE II
B OT-I OT: F ULL S ET B OT-I OT: 5% S UBSET AND 10-B EST S UBSET
21
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
It is important to note that both Denial-of-Service (DoS) average rates: AR P Proto P SrcIP, AR P Proto P DstIP,
and Distributed Denial-of-Service (DDoS) categories share AR P Proto P Sport, and AR P Proto P Dport. The aver-
the same values for their subcategories. The final dependent age rate these features refer to is the number of packets divided
feature is attack, which serves as a binary value indicating that by the duration of the session.
an instance is either an attack, with a value of 1, or normal, Since Koroniotis et al. [2] did not provide pseudo code for
with a value of 0. calculating the features, we performed our own calculations.
This process gave us additional insight into the relevant
B. Independent Features algorithms. The features utilize a window of 100 instances that
The 5% Subset contains 43 independent features. We de- includes the current instance. The various totals and average
termined that 6 of these features are invalid, and in Sec- rates are calculated and use the current instance as a filter. For
tion III-B3, we provide justification for invalidating these example, TnBPSrcIP is the total number of bytes per source
features. Of the remaining 37 features, there are 23 standard IP. It is calculated by summing the bytes feature from every
features and 14 calculated features. The standard features were instance within the current window that has the same saddr
generated by the Argus network security tool and are included value as the current instance.
in the Full Set or can be trivially derived from its features. The The inclusion of the calculated features makes it possible
calculated features were introduced by Koroniotis et al. [2] for a model utilizing only the valid features to properly
as “additional features.” These calculated features are present categorize the DDoS and Reconnaissance attack categories.
only in the 5% Subset and the 10-Best Subset and require Without the inclusion of these features, it would be extremely
some form of scripting to be calculated. difficult for any model, without some type of memory, to
1) Standard Features: The standard features contain several accurately identify these attacks. This is due to the nature of
categorical features. These include flgs, proto, and state, which the two attacks. A DDoS attack requires multiple different
are all string data types. Each of the three has a categorical in- hosts attacking at or near the same time. A DDoS instance
teger equivalent, corresponding to flgs number, proto number, would be impossible to differentiate from a DoS instance if the
and state number, respectively. The integer counterparts are model has no way of knowing how many concurrent or near
not included in the Full Set but can be easily generated based concurrent connections exist in proximity. The Reconnaissance
on the values of the string features. Both sets of features category is similar, in that the primary ways of identifying
contain the same categorical information, but with different service scanning or OS fingerprinting is by identifying a single
data types. The features sport and dport are the only other host attempting connections on many ports. Without a way of
categorical features, and they represent a unique port number determining the number of connections going to many ports
that can range from 1 to 65,535. There is the possibility, that originate from a single host, it would be difficult, if not
depending on cleaning procedures, for the values 0 and -1 impossible, for a model to properly categorize this type of
to also be present. Typically, these values represent protocols attack.
that do not use port numbers. 3) Invalid Features: We established that the following fea-
The standard features also include many ratio type features. tures are invalid: pkSeqID, seq, stime, ltime, saddr, and daddr.
These include features related to the network session of the These features undermine the efficiency of a predictive model
instance, such as pkts, bytes, dur, spkts, dpkts, sbytes, dbytes, because they contribute to overfitting and limit generalization.
rate, srate, and drate. Each of these 10 features provides We explain further in the next three paragraphs.
insight into the length and throughput of the session of an As pkSeqID and seq are row or sequence identifiers that
instance. The features mean, stddev, sum, min, and max contain only contain ordinal information, we consider them to be
aggregated session duration information that is calculated by invalid. Unlike pkSeqID, seq has not been explicitly defined
the Argus network security tool. This data can help models by Koroniotis et al. [2]. Furthermore, seq has been used in
compare the current instance duration with previous instances. several studies as this feature appears in Koroniotis et al.’s 10-
2) Calculated Features: The calculated features are Best Feature set. After consulting the Argus developers and
all ratio type features that contain additional aggregated Argus documentation, we discovered seq is a monotonically
session information not included in the standard fea- increasing sequence number that provides synchronization and
tures. The features TnBPSrcIP, TnBPDstIP, TnP PSrcIP, reliability between Argus readers. Apart from providing ordi-
TnP PDstIP, TnP PerProto, and TnP Per Dport represent a nal information about the network session, seq does nothing
total byte or packet count for a particular Internet Proto- else.
col (IP) address or protocol. Similarly, N IN Conn P DstIP The start packet time and last packet time for each instance
and N IN Conn P SrcIP represent total connections for are denoted by stime and ltime, respectively. Although using
a source or destination IP address. The final two cal- stime in combination with ltime enables other key features
culated totals are Pkts P State P Protocol P DestIP and to be derived, most of the derived information is already
Pkts P State P Protocol P SrcIP. These two features rep- available from dur, rate, srate, and drate, which are more
resent the total number of packets for sessions in the generalizable features than stime and ltime. While stime and
same connection state that have the same source or des- ltime might help improve detection of intense attacks, these
tination IP address. The other four calculated features are timestamps may simultaneously degrade proper classification
22
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
of normal traffic during those same time windows. This can best of our knowledge, these are all the published works that
have very undesirable consequences for business operations. used the Bot-IoT dataset.
For example, if a major website is experiencing a severe DoS
attack, it is not desirable for normal customer traffic to be A. Studies with Invalid Features
incorrectly classified as attack traffic and dropped. There are several works that utilize the 5% Subset and its 43
The source and destination IP addresses for each instance independent features. As discussed previously, this set contains
are designated by saddr and daddr, respectively. We excluded all six invalid features (pkSeqID, seq, stime, ltime, saddr, and
them for two reasons. Firstly, we recognize that private IP daddr). However, we assume that these works did not include
addresses are not globally unique and may vary between pkSeqID. We make this assumption because pkSeqID was
networks. If a trained model associates a particular activity clearly labeled as a row identifier in the Bot-IoT README file
with a specific private IP address, this association will most and the original paper [2]. In addition to the original paper,
likely not generalize for the same IP address in another local we discovered 12 studies that utilized the full independent
network. The second reason relates to an inherent issue with feature set. Bhuvaneswari and Selvakumar [10], Alhowaide et
Bot-IoT. When analyzing the 5% Subset, we observed that al. [11], Abdel-Basset et al. [12], Wiyono and Cahyani [13],
100% of the attack traffic contains private IPs for both saddr Ferrag et al. [14], Zhang et al. [15], Liaqat et al. [16], Pacheco
and daddr, but only 36% of the normal traffic contains private and Sun [17], Aldhaheri et al. [18], Soe et al. [19], Ferrag
IPs for both saddr and daddr. This creates a problem where et al. [20], and Alyasiri et al. [21] all mentioned using the
64% of normal instances can be identified because their saddr 5% Subset with either an accompanying feature list or feature
and daddr features display a public IP. count. Of note, Bhuvaneswari and Selvakumar, Abdel-Basset
Although we consider the six features (pkSeqID, seq, stime, et al., Wiyono and Cahyani, Zhang et al., and Aldhaheri et
ltime, saddr, and daddr) to be invalid, this does not mean al. used both the 5% Subset and the 10-Best Subset. Since
they are entirely unusable. Machine learning practitioners these papers would have used at least one if not all the invalid
may find them beneficial in certain situations, such as for features, we assume their results are not valid and do not
troubleshooting models, and cybersecurity analysts may decide represent a good generalizable model.
the features are useful for forensics. There are many other Another commonly used feature set was the 10-Best Subset.
examples capable of demonstrating the upside of keeping one This subset contains the invalid feature seq, and so we assume
or more of these features. However, for the purpose of our each of these papers utilized it for training and testing. We
study, we believe the six features should be removed from found nine studies that utilized the 10-Best Subset, five [10],
Bot-IoT. [12], [13], [15], [18] of which were previously mentioned,
since they also use the 5% Subset. Ibitoye et al. [22], Filus et
IV. DATA C LEANING al. [23], Lawal et al. [24], and Sriram et al. [25] all utilized the
10-Best Subset for training and testing. Since seq is not a valid
In addition to the six invalid features that should be re-
feature that will produce generalizable models, we assume the
moved, we discovered other areas of the dataset that need to be
results of these nine studies did not produce valid results.
cleaned. These additional discoveries are derived solely from
We found only a single study utilized the 26 independent
working with the 5% Subset, but we believe the issues will
features present in the Full Set. This set includes all of the
be found in all the processed sets and subsets. First, many if
invalid features and if utilized in its entirety will cause a model
not all the instances using Internet Control Message Protocol
to lose its ability to generalize. Just like the studies utilizing
(ICMP) have a hexadecimal value for the sport and dport
the 5% Subset, we assume that pkSeqID was not included.
features. The ICMP port values should be changed to -1 or 0 to
We found that Ferrag and Maglaras [26] utilized this set, and
indicate that they do not have a valid port value. Changing the
therefore, we assume their results are invalid.
values to -1 would have an additional advantage of matching
Multiple studies did not use one of the provided sets of
the values for instances that use Address Resolution Protocol
features but instead developed their own feature lists. These
(ARP). Our second observation involves the mislabeling of
studies did, however, include at least one of the previously
several instances. Specifically, we are referring to instances
mentioned invalid features, which would have affected the
that use ARP and are not labeled as normal traffic. Based on
validity of their conclusions. The affected studies are as
the description of each of the attack categories, we determined
follows: Oreški and Andročec [27], Alkadi et al. [28], Kumar
that ARP does not contribute to the attacks. Therefore, mis-
et al. [29], Al-Zewairi et al. [30], Djenna et al. [31], Churcher
labeled instances should be relabeled as normal traffic, or all
et al. [32], Popoola et al. [33], Kumar et al. [34], Kumar et
instances that use ARP should be removed.
al. [35], and Biswas and Roy [36]. The number of invalid
features varied from study to study, but the most common
V. L ITERATURE R EVIEW
invalid feature was seq.
The primary focus of this section is to group the published
works of research that have used the Bot-IoT dataset. Our B. Studies with Unlisted Features
search for papers concluded on June 20, 2021. The search A sizeable portion of the studies captured in our search did
revealed 47 studies that utilized the Bot-IoT dataset. To the not state or list the features used. These studies are Guizani
23
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
and Ghafoor [37], Alkadi et al. [38], Shafiq et al. [39], Cheema classifiers, namely random forest [70], decision tree [71],
et al. [40], Mulyanto et al. [41], Bagui and Li [42], Susilo and Hoeffding tree [72], and logistic model tree [73]. All 477
Sari [43], Koroniotis et al. [44], Dwibedi et al. [45], Shafiq normal instances from the 5% Subset were selected. From
et al. [46], Huong et al. [47], Huong et al. [48], Venugopal et the attack categories, the researchers randomly selected 81,977
al. [49], Nimbalkar and Kshirsagar [50], and Jithu et al. [51]. instances from DDoS, 82,060 from Reconnaissance, and 556
Since the exact list of features used is unknown, we cannot from Information Theft. The dataset was then split in a
make any conclusions about the validity of the results. ratio of 66:34 for training and testing. In general, the results
Another subset of the studies we reviewed did not utilize show that using the proposed selection algorithm significantly
any of the processed sets or subsets. Instead, PCAP files from decreases the number of features without impacting classifier
the Raw Set were used to generate features. The studies are Ge performance. Among the four models, decision tree performed
et al. [52], Costa et al. [53], and Ge et al. [54]. These studies the best, with top scores of 100% precision for DDoS and
did not use the Argus network security tool, which was used in Reconnaissance, 99.99% F-measure for DDoS and Reconnais-
the original study, and therefore generated completely different sance, 0% false positive rate for DDoS and Reconnaissance,
features. Because we do not have access to the data generated and 99.40% true positive rate for DDoS. The fact that all the
by these studies, we cannot make any conclusions about the classifiers are tree-based is a limitation of this study. Classifier
validity of the results. diversity adds credence to the results of a machine learning
study.
C. Studies with Only Valid Features
Two studies clearly used only valid features from Bot-IoT. VI. C ONCLUSION
These works are Demirpolat et al. [55] and Soe et al. [56]. Data analysis is one of the most crucial aspects of the
Demirpolat et al. used 16 features comprised of the standard machine learning process. We emphasize that invalid fea-
Argus network data with the exclusion of session identification tures and/or poor data cleaning practices can lead to non-
data, timestamps, and the seq feature. Soe et al. used a generalizable models and invalid performance scores. Bot-IoT
smaller feature set, comprised of eight features, that consisted is an intrusion detection dataset that trains models to detect
primarily of the additional features developed by Koroniotis various botnet attacks in IoT networks. Based on our data
et al. [2]. analysis of Bot-IoT, we discovered several invalid features.
In their work, Demirpolat et al. evaluated an ensemble These features were highly utilized in several studies that we
of prototypical networks [57] and Support Vector Machines reviewed, with over 50% of the papers using one or more of
(SVMs) [58] against four other models: Convolutional Neural the invalid features. For future studies that incorporate Bot-
Network (CNN) [59], SVMs, Naive Bayes [60], and deep IoT, we recommend utilizing only valid features and adopting
autoencoders [61]. The proposed ensemble model uses few- an effective data cleaning procedure.
shot learning [62] to compensate for training machine learning
models with limited data. All models were trained on three ACKNOWLEDGMENTS
different datasets: Bot-IoT, UNSW-NB15 [63], and a software-
defined networking [64] customized set. The models were im- We would like to thank the reviewers in the Data Mining and
plemented with Scikit-learn [65], Keras [66], and Pytorch [67]. Machine Learning Laboratory at Florida Atlantic University.
To address class imbalance, the number of instances in each Additionally, we acknowledge partial support by the NSF
category of Bot-IoT, except the Normal and Information Theft (CNS-1427536). Opinions, findings, conclusions, or recom-
categories, was down-sampled to 20,000. The instances for mendations in this paper are the authors’ and do not reflect
Normal and Information Theft were not down-sampled be- the views of the NSF.
cause their numbers are small in comparison. For the training
set, Demirpolat et al. randomly selected instances of 100, R EFERENCES
400, 800, and 1,000. Also, the researchers randomly selected [1] J. L. Leevy and T. M. Khoshgoftaar, “A survey and analysis of intrusion
100 instances for the validation set and used the remaining detection models based on cse-cic-ids2018 big data,” Journal of Big
Data, vol. 7, no. 1, pp. 1–19, 2020.
instances from the down-sampled dataset as the test set. With [2] N. Koroniotis, N. Moustafa, E. Sitnikova, and B. Turnbull, “Towards
regard to accuracy, precision, recall, and F-measure for Bot- the development of realistic botnet dataset in the internet of things
IoT, the ensemble model outperformed the other models by for network forensic analytics: Bot-iot dataset,” Future Generation
Computer Systems, vol. 100, pp. 779–796, 2019.
roughly 20%. The highest Bot-IoT F-measure score for the [3] B. B. Zarpelão, R. S. Miani, C. T. Kawakani, and S. C. de Alvarenga, “A
ensemble was 96% for the DDoS category. One shortcoming survey of intrusion detection in internet of things,” Journal of Network
of this work is the relatively small sizes of the training sets. and Computer Applications, vol. 84, pp. 25–37, 2017.
A small sample size can lead to a model with high variance. [4] T. O. Foundation, “Node-red: Low-code programming for event-driven
applications.” https://nodered.org/.
For their research, Soe et al. proposed a feature selection [5] Argus, “Argus,” https://openargus.org/.
algorithm that was implemented on a Raspberry Pi device. [6] Wireshark, “Wireshark. go deep.” https://www.wireshark.org/.
The algorithm uses correlation-based feature selection [68] [7] T. Z. Project, “The zeek network security monitor,” https://zeek.org//.
[8] J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya, “A survey
and for the calculation of the final feature set, relies on the on addressing high-class imbalance in big data,” Journal of Big Data,
gain-ratio [69] metric. Soe et al. utilized several tree-based vol. 5, no. 1, p. 42, 2018.
24
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
[9] R. Zuech, J. Hancock, and T. M. Khoshgoftaar, “Detecting web attacks [29] P. Kumar, G. P. Gupta, and R. Tripathi, “Toward design of an intelligent
using random undersampling and ensemble learners,” Journal of Big cyber attack detection system using hybrid feature reduced approach for
Data, vol. 8, no. 1, pp. 1–20, 2021. iot networks,” Arabian Journal for Science and Engineering, vol. 46,
[10] B. A. NG and S. Selvakumar, “Anomaly detection framework for internet no. 4, pp. 3749–3778, 2021.
of things traffic using vector convolutional deep learning approach in [30] M. Al-Zewairi, S. Almajali, and M. Ayyash, “Unknown security attack
fog environment,” Future Generation Computer Systems, vol. 113, pp. detection using shallow and deep ann classifiers,” Electronics, vol. 9,
255–265, 2020. no. 12, p. 2006, 2020.
[11] A. Alhowaide, I. Alsmadi, and J. Tang, “Pca, random-forest and pearson [31] A. Djenna, D. E. Saidouni, and W. Abada, “A pragmatic cybersecurity
correlation for dimensionality reduction in iot ids,” in 2020 IEEE Inter- strategies for combating iot-cyberattacks,” in 2020 International Sympo-
national IOT, Electronics and Mechatronics Conference (IEMTRONICS). sium on Networks, Computers and Communications (ISNCC). IEEE,
IEEE, 2020, pp. 1–6. 2020, pp. 1–6.
[12] M. Abdel-Basset, V. Chang, H. Hawash, R. K. Chakrabortty, and [32] A. Churcher, R. Ullah, J. Ahmad, F. Masood, M. Gogate, F. Alqahtani,
M. Ryan, “Deep-ifs: Intrusion detection approach for iiot traffic in fog B. Nour, W. J. Buchanan et al., “An experimental analysis of attack
environment,” IEEE Transactions on Industrial Informatics, 2020. classification using machine learning in iot networks,” Sensors, vol. 21,
[13] R. T. Wiyono and N. D. W. Cahyani, “Performance analysis of decision no. 2, p. 446, 2021.
tree c4. 5 as a classification technique to conduct network forensics for [33] S. I. Popoola, B. Adebisi, M. Hammoudeh, G. Gui, and H. Gacanin,
botnet activities in internet of things,” in 2020 International Conference “Hybrid deep learning for botnet attack detection in the internet of things
on Data Science and Its Applications (ICoDSA). IEEE, 2020, pp. 1–5. networks,” IEEE Internet of Things Journal, 2020.
[14] M. A. Ferrag, L. Maglaras, A. Ahmim, M. Derdour, and H. Janicke, [34] P. Kumar, G. P. Gupta, and R. Tripathi, “Tp2sf: A trustworthy privacy-
“Rdtids: Rules and decision tree-based intrusion detection system for preserving secured framework for sustainable smart cities by leveraging
internet-of-things networks,” Future internet, vol. 12, no. 3, p. 44, 2020. blockchain and machine learning,” Journal of Systems Architecture, vol.
[15] Y. Zhang, J. Xu, Z. Wang, R. Geng, K.-K. R. Choo, J. A. Pérez-Dı́az, and 115, p. 101954, 2021.
D. Zhu, “Efficient and intelligent attack detection in software defined [35] P. Kumar, R. Kumar, G. P. Gupta, and R. Tripathi, “A distributed frame-
iot networks,” in 2020 IEEE International Conference on Embedded work for detecting ddos attacks in smart contract-based blockchain-
Software and Systems (ICESS). IEEE, 2020, pp. 1–9. iot systems by leveraging fog computing,” Transactions on Emerging
[16] S. Liaqat, A. Akhunzada, F. S. Shaikh, A. Giannetsos, and M. A. Telecommunications Technologies, p. e4112, 2020.
Jan, “Sdn orchestration to combat evolving cyber threats in internet of [36] R. Biswas and S. Roy, “Botnet traffic identification using neural net-
medical things (iomt),” Computer Communications, vol. 160, pp. 697– works,” Multimedia Tools and Applications, pp. 1–25, 2021.
705, 2020. [37] N. Guizani and A. Ghafoor, “A network function virtualization system
[17] Y. Pacheco and W. Sun, “Adversarial machine learning: A comparative for detecting malware in large iot based networks,” IEEE Journal on
study on contemporary intrusion detection datasets,” in Proceedings of Selected Areas in Communications, vol. 38, no. 6, pp. 1218–1228, 2020.
the 7th International Conference on Information Systems Security and [38] O. Alkadi, N. Moustafa, B. Turnbull, and K.-K. R. Choo, “A deep
Privacy, vol. 1, 2021, pp. 160–171. blockchain framework-enabled collaborative intrusion detection for pro-
[18] S. Aldhaheri, D. Alghazzawi, L. Cheng, B. Alzahrani, and A. Al- tecting iot and cloud networks,” IEEE Internet of Things Journal, 2020.
Barakati, “Deepdca: novel network-based detection of iot attacks using [39] M. Shafiq, Z. Tian, A. K. Bashir, X. Du, and M. Guizani, “Corrauc: a
artificial immune system,” Applied Sciences, vol. 10, no. 6, p. 1909, malicious bot-iot traffic detection method in iot network using machine
2020. learning techniques,” IEEE Internet of Things Journal, 2020.
[19] Y. N. Soe, P. I. Santosa, and R. Hartanto, “Ddos attack detection based on [40] M. A. Cheema, H. K. Qureshi, C. Chrysostomou, and M. Lestas,
simple ann with smote for iot environment,” in 2019 Fourth International “Utilizing blockchain for distributed machine learning based intrusion
Conference on Informatics and Computing (ICIC). IEEE, 2019, pp. 1– detection in internet of things,” in 2020 16th International Conference
5. on Distributed Computing in Sensor Systems (DCOSS). IEEE, 2020,
[20] M. A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke, “Deep pp. 429–435.
learning for cyber security intrusion detection: Approaches, datasets, and [41] M. Mulyanto, M. Faisal, S. W. Prakosa, and J.-S. Leu, “Effectiveness
comparative study,” Journal of Information Security and Applications, of focal loss for minority classification in network intrusion detection
vol. 50, p. 102419, 2020. systems,” Symmetry, vol. 13, no. 1, p. 4, 2021.
[21] H. Alyasiri, J. A. Clark, A. Malik, and R. de Fréin, “Grammatical [42] S. Bagui and K. Li, “Resampling imbalanced data for network intrusion
evolution for detecting cyberattacks in internet of things environments.” detection datasets,” Journal of Big Data, vol. 8, no. 1, pp. 1–41, 2021.
[22] O. Ibitoye, O. Shafiq, and A. Matrawy, “Analyzing adversarial attacks [43] B. Susilo and R. F. Sari, “Intrusion detection in iot networks using deep
against deep learning for intrusion detection in iot networks,” in 2019 learning algorithm,” Information, vol. 11, no. 5, p. 279, 2020.
IEEE Global Communications Conference (GLOBECOM). IEEE, 2019, [44] N. Koroniotis, N. Moustafa, and E. Sitnikova, “A new network forensic
pp. 1–6. framework based on deep learning for internet of things networks: A
[23] K. Filus, J. Domańska, and E. Gelenbe, “Random neural network for particle deep framework,” Future Generation Computer Systems, vol.
lightweight attack detection in the iot,” in Symposium on Modelling, 110, pp. 91–106, 2020.
Analysis, and Simulation of Computer and Telecommunication Systems. [45] S. Dwibedi, M. Pujari, and W. Sun, “A comparative study on contempo-
Springer, 2020, pp. 79–91. rary intrusion detection datasets for machine learning research,” in 2020
[24] M. A. Lawal, R. A. Shaikh, and S. R. Hassan, “An anomaly mitigation IEEE International Conference on Intelligence and Security Informatics
framework for iot using fog computing,” Electronics, vol. 9, no. 10, p. (ISI). IEEE, 2020, pp. 1–6.
1565, 2020. [46] M. Shafiq, Z. Tian, Y. Sun, X. Du, and M. Guizani, “Selection of
[25] S. Sriram, R. Vinayakumar, M. Alazab, and K. Soman, “Network flow effective machine learning algorithm and bot-iot attacks traffic identifi-
based iot botnet attack detection using deep learning,” in IEEE INFO- cation for internet of things in smart city,” Future Generation Computer
COM 2020-IEEE Conference on Computer Communications Workshops Systems, vol. 107, pp. 433–442, 2020.
(INFOCOM WKSHPS). IEEE, 2020, pp. 189–194. [47] T. T. Huong, T. P. Bac, D. M. Long, B. D. Thang, N. T. Binh,
[26] M. A. Ferrag and L. Maglaras, “Deepcoin: A novel deep learning and T. D. Luong, and T. K. Phuc, “Lockedge: Low-complexity cyberattack
blockchain-based energy exchange framework for smart grids,” IEEE detection in iot edge computing,” IEEE Access, vol. 9, pp. 29 696–
Transactions on Engineering Management, vol. 67, no. 4, pp. 1285– 29 710, 2021.
1297, 2019. [48] T. T. Huong, T. P. Bac, D. M. Long, B. D. Thang, T. D. Luong, and
[27] D. Oreški and D. Andročec, “Genetic algorithm and artificial neural N. T. Binh, “An efficient low complexity edge-cloud framework for
network for network forensic analytics,” in 2020 43rd International security in iot networks,” in 2020 IEEE Eighth International Conference
Convention on Information, Communication and Electronic Technology on Communications and Electronics (ICCE). IEEE, 2021, pp. 533–539.
(MIPRO). IEEE, pp. 1200–1205. [49] S. Venugopal, G. W. Sathianesan, and R. Rengaswamy, “Cyber forensic
[28] O. AlKadi, N. Moustafa, B. Turnbull, and K.-K. R. Choo, “Mixture framework for big data analytics using sunflower jaya optimization-
localization-based outliers models for securing data migration in cloud based deep stacked autoencoder,” International Journal of Numerical
centers,” IEEE Access, vol. 7, pp. 114 607–114 618, 2019. Modelling: Electronic Networks, Devices and Fields, p. e2892, 2021.
25
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
[50] P. Nimbalkar and D. Kshirsagar, “Feature selection for intrusion detec- [72] C. Manapragada, G. I. Webb, and M. Salehi, “Extremely fast decision
tion system in internet-of-things (iot),” ICT Express, vol. 7, no. 2, pp. tree,” in Proceedings of the 24th ACM SIGKDD International Confer-
177–181, 2021. ence on Knowledge Discovery & Data Mining, 2018, pp. 1953–1962.
[51] P. Jithu, J. Shareena, A. Ramdas, and A. Haripriya, “Intrusion detection [73] M. Sumner, E. Frank, and M. Hall, “Speeding up logistic model tree
system for iot botnet attacks using deep learning,” SN Computer Science, induction,” in European conference on principles of data mining and
vol. 2, no. 3, pp. 1–8, 2021. knowledge discovery. Springer, 2005, pp. 675–683.
[52] M. Ge, N. F. Syed, X. Fu, Z. Baig, and A. Robles-Kelly, “Towards a
deep learning-driven intrusion detection approach for internet of things,”
Computer Networks, vol. 186, p. 107784, 2021.
[53] W. L. Costa, M. M. Silveira, T. de Araujo, and R. L. Gomes, “Improving
ddos detection in iot networks through analysis of network traffic charac-
teristics,” in 2020 IEEE Latin-American Conference on Communications
(LATINCOM). IEEE, 2020, pp. 1–6.
[54] M. Ge, X. Fu, N. Syed, Z. Baig, G. Teo, and A. Robles-Kelly, “Deep
learning-based intrusion detection for iot networks,” in 2019 IEEE
24th Pacific Rim International Symposium on Dependable Computing
(PRDC). IEEE, 2019, pp. 256–25 609.
[55] A. Demirpolat, A. K. Sarica, and P. Angin, “Protédge: A few-shot en-
semble learning approach to software-defined networking-assisted edge
security,” Transactions on Emerging Telecommunications Technologies,
p. e4138, 2020.
[56] Y. N. Soe, Y. Feng, P. I. Santosa, R. Hartanto, and K. Sakurai, “Towards
a lightweight detection system for cyber attacks in the iot environment
using corresponding features,” Electronics, vol. 9, no. 1, p. 144, 2020.
[57] T. Gao, X. Han, Z. Liu, and M. Sun, “Hybrid attention-based prototyp-
ical networks for noisy few-shot relation classification,” in Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019,
pp. 6407–6414.
[58] R. A. Bauder and T. M. Khoshgoftaar, “The detection of medicare fraud
using machine learning methods with excluded provider labels,” in The
Thirty-First International Flairs Conference, 2018.
[59] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmen-
tation for deep learning,” Journal of Big Data, vol. 6, no. 1, pp. 1–48,
2019.
[60] R. A. Bauder, T. M. Khoshgoftaar, A. Richter, and M. Herland, “Predict-
ing medical provider specialties to detect anomalous insurance claims,”
in 2016 IEEE 28th international conference on tools with artificial
intelligence (ICTAI). IEEE, 2016, pp. 784–790.
[61] Z. Salekshahrezaee, J. L. Leevy, and T. M. Khoshgoftaar, “A recon-
struction error-based framework for label noise detection,” Journal of
Big Data, vol. 8, no. 1, pp. 1–16, 2021.
[62] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few
examples: A survey on few-shot learning,” ACM Computing Surveys
(CSUR), vol. 53, no. 3, pp. 1–34, 2020.
[63] N. Moustafa and J. Slay, “Unsw-nb15: a comprehensive data set for
network intrusion detection systems (unsw-nb15 network data set),”
in 2015 military communications and information systems conference
(MilCIS). IEEE, 2015, pp. 1–6.
[64] D. Kreutz, F. M. Ramos, P. E. Verissimo, C. E. Rothenberg, S. Azodol-
molky, and S. Uhlig, “Software-defined networking: A comprehensive
survey,” Proceedings of the IEEE, vol. 103, no. 1, pp. 14–76, 2014.
[65] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,
“Scikit-learn: Machine learning in python,” Journal of machine learning
research, vol. 12, no. Oct, pp. 2825–2830, 2011.
[66] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
[67] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
imperative style, high-performance deep learning library,” arXiv preprint
arXiv:1912.01703, 2019.
[68] M. A. Hall, “Correlation-based feature selection for machine learning,”
1999.
[69] J. L. Leevy, J. Hancock, R. Zuech, and T. M. Khoshgoftaar, “Detecting
cybersecurity attacks across different network features and learners,”
Journal of Big Data, vol. 8, no. 1, pp. 1–29, 2021.
[70] V. M. Herrera, T. M. Khoshgoftaar, F. Villanustre, and B. Furht,
“Random forest implementation and optimization for big data analytics
on lexisnexis’s high performance computing cluster platform,” Journal
of Big Data, vol. 6, no. 1, pp. 1–36, 2019.
[71] N. Seliya and T. M. Khoshgoftaar, “The use of decision trees for cost-
sensitive classification: an empirical study in software quality predic-
tion,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, vol. 1, no. 5, pp. 448–459, 2011.
26
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.
A PPENDIX
TABLE III
F EATURES , D ESCRIPTIONS , AND S ETS
27
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on October 04,2022 at 16:07:41 UTC from IEEE Xplore. Restrictions apply.