MapReduce Based Intelligent Model For Intrusion Detection Using Machine Learning Techniques
MapReduce Based Intelligent Model For Intrusion Detection Using Machine Learning Techniques
a r t i c l e i n f o a b s t r a c t
Article history: With the emergence of the Internet of Things (IoT), the computer networks’ phenomenal expansion, and
Received 2 September 2021 enormous relevant applications, data is continuously increasing. In this way, cybersecurity has gained
Revised 19 November 2021 significant importance in protecting networks from different cyber-attacks like Intrusions, Denial-of-
Accepted 10 December 2021
Service (DoS), Eavesdropping, Rushing Attack, etc. A traditional Intrusion Detection System (IDS) tangled
Available online 16 December 2021
with the clustering technique plays a vital role in modern security. Still, it has limitations to analyze the
vast volumes of data to identify an anomaly intelligently. Machine learning is a technique that may be
Keywords:
tangled with the MapReduce-Based Intelligent Model for Intrusion Detection (MR-IMID) to automate
Denial-of-Service
Intrusion detection system
intrusion detection intelligently. MR-IMID is proposed to detect intrusions on a network with multiple
Cyber-attacks data classification tasks in this research work. The proposed MR-IMID processes big data sets reliably
Network traffic using commodity hardware. In this proposed research work, multiple network sources are being utilized
Hadoop distributed file system in Real-time for intrusion detection. In this proposed research, the MR-IMID detects intrusions by pre-
dicting unknown test scenarios and stores the data in the database to minimize future inconsistencies.
The detection accuracy of the proposed model during training and validation phases is 97.7% and
95.7%, respectively, which is better than previously published approaches.
Ó 2021 The Authors. Published by Elsevier B.V. on behalf of King Saud University. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
https://doi.org/10.1016/j.jksuci.2021.12.008
1319-1578/Ó 2021 The Authors. Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731
illegal accessibility, alteration, or damage, may also be discovered, concerns or threat activity on a network (Wu, 2020). There have
determined, and identified using an IDS (Wu, 2020). To assist a sys- been many cybersecurity studies with the capacity to identify
tem’s security, it is necessary to identify different CAs or inconsis- and prevent cyber assaults or intrusions. One of the most well-
tencies in a network and construct an efficient IDS that plays an known technologies in the cyber sector is signature-based network
essential part in modern NS. Usually, intrusion detection involves intrusion detection (Sarker and Salim, 2018). This method uses a
extensive data set exploration. recognized signature and has recently achieved significant accep-
Image analysis, pattern identification, social networks, massive tance, as well as economic success. The ‘‘anomaly-based method”,
Network Traffic (NT) analysis, etc., require the exploration of enor- on the other hand, offers a benefit over the ‘‘signature-based
mous datasets. Sequential techniques can’t process such applica- approach” for recognizing hidden or ‘‘zero-day attacks” (Wu,
tions since they’re too big. Conventional clustering-based 2020). This method analyses important security data to watch NT
intrusion detection approaches based have no efficient scalability and identify behavioral attack patterns. Several data mining and
with increasing NT volumes. Additionally, massive NT analysis pre- ML approaches evaluate such security event patterns and make
sents a performance issue when detecting anomalous links, neces- meaningful choices (Tapiador et al., 2013). The primary disadvan-
sitating a parallel method for intrusion detection. Generally, classic tage of the anomaly-based method is that it might result in many
parallel algorithms developed using the ‘‘Message Passing Interface false alarms since it can classify formerly unknown system actions
(MPI)” approach (Snir et al., 2015) encounter a variety of chal- as anomalies (Wu, 2020). As a result, minimizing an IDS’s false-
lenges, e.g., efficiently managing network connection and balanc- positive rates must be a key goal. Therefore, to reduce these con-
ing the division of processing burden across various processors. cerns, a MapReduce-based effective identification method is
Furthermore, parallel algorithms can be affected by node failure. required.
Therefore, it reduces the scalability of the algorithm. Consequently, ML is a field of AI connected to ‘‘computational statistics, data
building a scalable ‘‘parallel intrusion detection algorithm” that mining, and data science”. It is primarily concerned with teaching
achieves high intrusion identification rates is required. machines to learn from data (Li et al., 2012). It is closely linked to
For MPI (Snir et al., 2015), the ‘‘MapReduce programming mod- ‘‘mathematical techniques, statistical analysis, optimization, ‘‘ and
el” (Dean and Ghemawat, 2010) has developed as a parallel pro- other fields. Thus, ML is a data-driven technique in the cybersecu-
cessing approach, particularly for data-intensive tasks. The rity domain with the initial step to comprehend raw security data
MapReduce technique has many features that make it a viable to construct an ‘‘intelligent security model” for generating fore-
option for parallelizing data mining jobs, including ease of deploy- casts. ML approaches commonly utilized association analysis to
ment and the elimination of the need to learn many parallel pro- create ‘‘rule-based intelligent systems (Wagner et al., 2011)”. Many
gramming specifics. MapReduce also has many options for ‘‘node famous approaches have been used to construct a data-driven pre-
failure” and ‘‘load balancing”. The dataset size and computer nodes dictive model (Li et al., 2012). These approaches include the
determine how MapReduce separates the input dataset into dis- ‘‘probability-based Naive Bayes (NB) classifier, hyperplane-based
tinct splits. There are two primary functions in MapReduce: ‘‘Map Support Vector Machine (SVM), instance-learning-based K-
Function (MF)” and ‘‘Reduce Function (RF)”. The MF generates inter- Nearest Neighbor (KNN), the sigmoid function-based Linear
mediate results as ‘‘(key, values list) data pairs” by processing the Regression (LR) technique, and rule-based classification, such as
input data records as ‘‘(key, value) data pairs”. Then the RF combi- Decision Trees (DT)” (Alghamdi et al., 2021; Nadeem et al., 2021;
nes and aggregates the intermediate ‘‘(values list)” of the MF with Khan et al., 2021; Ahmad et al., 2021).
the same intermediate key. ‘‘High Availability Distributed Object- Many researchers have employed the ML classification
Oriented Platform (Hadoop)” (White, 2010) is an ‘‘Apache- approaches listed above in the cybersecurity area, notably identify-
developed” open-source platform that employs the MapReduce ing intrusions or CAs. For example, Li et al. (Kotpalliwar and Wajgi,
approach. It was designed to handle data-intensive tasks. One of 2015) demonstrated how to use the hyperplane-based SVM classi-
Hadoop’s features is its own distributed file system. It is known fier utilizing a Radial Basis Function (RBF) kernel for identifying
as ‘‘Hadoop Distributed File System (HDFS)”. It is utilized to handle preset attack types using the famous Knowledge Discovery in
and process massive datasets. Additionally, Hadoop’s MapReduce is Databases (KDD’99) cup dataset. These types may include ‘‘DoS,
intended to interact with HDFS effectively by bringing the comput- Probe or Scan, User to Root (U2R), Remote to the user (R2L), and
ing process to the data rather than the other way around, allowing normal traffic”. To develop a faster system, the authors did the
Hadoop to attain high data localization. model training with big datasets, utilizing a ‘‘least-squared SVM
Another issue with intrusion detection is the scarcity of trained classifier”, to develop a faster system. In Pervez and Farid (2014),
professionals who can monitor and respond to intrusions by ana- the authors categorized the anomalies using SVM classifier varia-
lyzing the large data in clusters as the output of MapReduce. In tion. SVM classifier is being utilized to identify anomalies and var-
the cybersecurity field, ML approaches have been successfully ious kinds of CAs. These attacks included Network Basic Input/
applied to create efficient strategies. ML has a lot of potential for Output System (NetBIOS) scans, Denial of service (DoS) attacks,
identifying different forms of cyber-attacks, intrusion detection, Post Office Protocol (POP) Spams, and Secure Shell (SSH) scans. A
malware classification, detection, privacy protection, advanced ‘‘one-class SVM classifier” is employed in Kokila and Selvi (2014)
threat detection, so it’s becoming a useful tool for defenders. Pre- to identify unseen computer worms’ behavior.
sent threats are becoming more complex and sophisticated as Aljarah and Ludwig (2013) proposed an IDS called IDS-MRCPSO,
adversarial techniques evolve rapidly. Most current security tech- which is based on a parallel Particle Swarm Optimization (PSO)
nologies, for example, may easily pass threat variations. As a result, clustering algorithm and the MapReduce technique, to tackle the
self-learning methods should be able to deal with such issues. ML processing of massive NT. PSO is a particularly effective approach
techniques have emerged as an essential tool for the entire security for clustering. It eliminates the sensitivity issue of initial cluster
industry in this regard. centroids. Using ‘‘commodity hardware”, this IDS handles big data
sets. To assess the system speed, tests were conducted on an actual
intrusion data set. The test findings show that ‘‘IDS-MRCPSO”
2. Literature review scales extremely near the optimum speed by enhancing detection
accuracy. Also, it is effective for growing Training Data Set (TDS)
An IDS is generally employed in the observation and evaluation volumes. Additionally, to prevent ‘‘random sampling” impacts,
of day-to-day activity in a computer network. It identifies security they built the detection model using the entire TDS. As a result,
9724
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731
their method covers more relevant parts of the TDS and creates a This research elaborates an IDS using the ML technique. The
better detection model. The findings show that utilizing more proposed MapReduce-based IDS employs a novel approach to
TDS improves detection rates while reducing false alarms to a min- reduce the number of false alarms over time. It is accomplished
imum. Their model has a limitation, and it does not differentiate by obtaining human professional feedback and changing the learn-
among various kinds of intrusions; instead, it merely determines ing model accordingly. The proposed method effectively reduces
whether an incursion has occurred. the probability of many false alarms with similar data. On the other
IDS makes it possible to detect harmful activity and intrusions hand, the presented MapReduce-based IDS can consider supervised
early on. Most IDS methods, however, are overwhelmed by the pre- learning approaches. As a result, there is label information in the
sent large amounts of NT. It needs innovative techniques capable of training process. It also recommends changes in cases when tradi-
handling large volumes of traffic while retaining high efficiency tional methods categorize training samples using human
during analysis. Wu (2020) presented a ‘‘distributed Network IDS” professionals.
framework in which data analysis is carried out in a Cloud Comput- The proposed IDS provides a method for streamlining the eval-
ing (CC) setting. Sensors in numerous locations throughout the uation of human specialists’ decisions. As a result, depending on
network, including network devices, servers, individual worksta- predefined observations, the system may detect anomalies. There-
tions, capture NT, operating system logs, and primary application fore, using ‘‘supervised feature learning,” the system can detect
data. The MapReduce approach is used to gather, analyze, and human errors while labeling data and offering corrections. Further-
compare data obtained from many sources. It looks for event cor- more, the system may use a scoring method to detect new traffic
relations that might identify intrusion activities or malicious segments. The proposed IDS provides a dynamically updated
behavior. This framework can easily handle huge amounts of gath- framework that addresses the previous techniques’ adaptation
ered data and heavy processing loads, efficiently scaling to busi- problem. With little computing cost, the proposed MapReduce-
ness network settings. In addition, unlike prior IDS models, it can based IDS improves the learning framework based on existing data
detect sophisticated attacks by correlating data from many sources and novel types of CAs.
and discovering patterns that may not be seen in centralized traffic A visual representation of the main roles of each strategy in a
collections or ‘‘single host log” analysis. The model’s viability is cyber-event detection system. The legend in Fig. 1. shows the lay-
demonstrated by the experiments conducted on an actual cluster out for different elements (e.g., ‘‘green rectangle for data collecting
and cloud infrastructure. module”). The same format is used to convey each strategy so that
Mining useful data from enormous data sets stored on the cloud the reader may connect each strategy to the framework, detailed
has been a growing business trend. Yet, current IDS systems have further below.
proven incapable of adapting to ‘‘large-scale log data mining”. As ‘‘Data sources” are those that create large amounts of data that
a result, Besharati et al. (2019) proposed an ‘‘association rule min- may be gathered and processed to identify and prevent CAs. There
ing” technique based on the ‘‘MapReduce parallel computing” are numerous ways for generating relevant data for security ana-
architecture. To begin, the ‘‘Apriori Frequent Itemset Mining lytics (e.g., ‘‘windows logs, NetFlow data, and email logs”). The
(AFIM)” method is examined, and the ‘‘MapReduce approach” is source(s) used vary per company, and data is collected from
utilized to parallelize and enhance it so that AFIM may be com- numerous sources and kept in a ‘‘Data Storage (DS)”.
pleted more quickly. Second, Apriori is intended to operate in par- ‘‘Data collection” is the process of connecting a system to exter-
allel on IDS. Lastly, the test was performed by constructing a nal data sources. This module collects data from the given data
‘‘Hadoop cluster” using an open-source CC architecture. The find- sources using various methods, saves it in a database, and then for-
ings demonstrate that the mentioned technique has a better detec- wards it for preprocessing. Preparing data for future modules, par-
tion accuracy and takes less processing time on large amounts of ticularly data processing, is known as ‘‘Data Pre-processing.”
data. Feature selection and extraction, elimination of duplicates and
In contrast to the previous studies we present in this investiga- faulty records, data validation, and standardization are examples
tion, an MR-IMID using ML Technique is presented for the sake of of preprocessing activities. The data is acquired from the DS, and
intrusion detection. The proposed model captures the classification prepared data is stored back in the DS, and fed for data processing.
of security issues based on their significance. It combines MapRe- The ‘‘Data Processing module” makes use of big data technology to
duce and ML techniques, Artificial Neural Network (ANN), to man- extract useful information about CA. Results are generated in this
age large amounts of data in large networks to identify intrusions module by using the ML technique. The ‘‘Data Post-processing
efficiently. MapReduce approach is very effective in parallel clus- module” uses ML to improve the ‘‘Data Processing module‘‘ find-
tering of a large dataset. It then constructs a generic architecture ings”. For the completion of the related activities, this module, like
for detecting intrusions centered on the identified significant char- ‘‘Data Processing”, connects with DS and is assisted by big data
acteristics to address the known problems. techniques.
Further, it is checked that whether the cyber event is found or
not. In case of no a simple message will be visualized but in case
of yes trigger will be on and triggered results also will be stored
3. Proposed Methodology: on a database in the visualization module. The ‘‘Visualization mod-
ule” makes use of a variety of methods (such as a ‘‘dashboard, a
Developing a data-driven intelligent IDS might make computa- text or graphic report, and an email notifier”) to convey finished
tional security methods examine distinct cyber event patterns and results (such as security alerts) to security professionals.
ultimately anticipate attacks using cybersecurity data. However, After processing security data in each module, ‘‘Cloud Big Data
modeling CAs is difficult, whereas modern security datasets may Storage Support” controls its distribution and back-and-forth stor-
include several aspects of security features that may be less signif- age. Not only does the module handle DS, but it also keeps track of
icant or not required. This paper presents an intrusion detection the numerous methods that other modules use to access and
Model MR-IMID based on MapReduce, proven to be an effective change data. To provide the needed DS, the module leverages sev-
parallelization approach for many tasks. Furthermore, the pro- eral DS technologies (e.g., ‘‘Highly parallel Integrated Virtual Envi-
posed model includes ML for clustering analysis to recognize the ronment (HIVE), HDFS, MongoDB, Cassandra”). The distribution of
important security features for building an intelligent detection data processing among computer nodes is managed by ‘‘Big Data
model. Processing Support”. The module uses a big data framework (e.g.,
9725
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731
‘‘Hadoop, Spark, or Storm”). To distribute processing across the step is further divided into different stages. In the first phase, data
computer nodes, the module supports ‘‘data preprocessing, data is collected from various cybersecurity sensors installed in multi-
processing, and data post-processing modules”, as illustrated in ple locations. In this proposed research, a pre-labeled cybersecurity
Fig. 1. For example, a MapReduce program comprises an MF that dataset is selected for the operation of the presented model. There
filters and sorts data and an RF that conducts a summary process. are 216,352 cases in this dataset, with fifty-four features. Fifty-
The publicly available intrusion data set from the ‘‘Kaggle ML three features are independent. The remaining one, the output
Repository” is being utilized in this proposed research class, is dependent. The next and important layer is the preprocess-
(kaggle.com). It is divided into two types: ‘‘regular and attack”. ing used to mitigate the noisy data using moving average, normal-
The proposed intrusion detection dataset includes 41 features of ization, and data cleaning, which use the mean imputation
the input and one as an output. Each sample has a label indicating method. Then the preprocessed data is split into 70% training
its traffic type, either regular or malicious traffic. The open attack and 30% testing data set of each class. After this process, the train-
groups for the NSL-KDD data belong to four major classes: DoS, ing data is sent to the training layer, whereas the testing dataset is
Remote to Local (R2L), User-to-Root (U2R), and probe. All types stored on cloud storage.
of attacks include multiple sample attacks in the information col- After preprocessing, the processed data is passed through the
lection. Table 1 presents a full overview of the features. MapReduce procedure. The data analysis processing model’s foun-
The complete list of threats is presented in the following. dation is MapReduce. Its framework is basic and easy to grasp. The
‘‘Map and Reduce” functions are the two most important activities
3.1. NSL-KDD dataset attacks description in MapReduce. No matter how complicated the MapReduce opera-
tion is, it must go through these two steps shown in Fig. 3.
DoS: back, land, neptune, pod, smurf, teardrop, processtable, DAG ¼ ðW; E; DAGinfoÞ ð1Þ
udpstorm, mailbomb, apache2
Probe: psweep, nmap, saint, mscan, portsweep, satan
W ¼ fW name ; fMapg; fReduceg; Param; Input; Outputg ð2Þ
R2L: spy, warezclient, guesspassword, ftp_write, imap, multi-
hop, named, phf, snmpgetattack,waezmaster, xlock, xsnoop, http- Here, W is the Number of job streams acquired after processing.
tunnel, sendmail ‘‘Wname” is the task’s name. ‘‘Map and Reduce” are map and
U2R: bufferoverflow, loadmodule, perl, snmpguess, sqlattack, reduce processing procedures. ‘‘Param” is the task’s configuration
xterm, rootkit, ps, worm parameter. The input and output tasks’ data source kind is repre-
The pseudocode of the process is shown in Table 2. In the data sented by ‘‘Input and Output”. E indicates the connection between
collecting process, all sorts of attacks comprise several sample two tasks in the ‘‘Directed Acyclic Graph (DAG)” diagram.
attacks. As shown in Fig. 2. The framework flow of the proposed ‘‘DAGinfo” is the DAG’s unique identifying information. Eq. (2)’s
model is divided into two phases: training and validation. Each ‘‘Map processing (MP)” may be described as:
9726
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731
Table 1
Dataset structure (kaggle.com).
1 X 2
6.3 validate detection criteria: if no detection criteria are satisfied, access
grant else data store in intrusion data base G¼ sl ql ð7Þ
7 Stop 2 l
Updating weight and bias between the input layer and the hid-
den layer is shown in Eq. (12)
Z "X #
Dxi;j ¼ nl mj;l qj 1 qj r i 4. Simulation results:
l
This data set is the most widely used standard test set for net-
Z
work IDSs (kaggle.com). The information in this data set is split
Dxi;j ¼ nj r i ð10Þ
into two parts: the training dataset and the validation dataset.
The training data has a unique identification, but the test data is
where
unidentified. The test data also includes certain attack types that
" # were not present in the training data. Thus, it makes the system’s
X identification more accurate and trustworthy.
nj ¼ nl Kj;l qj 1 qj
l In this proposed MR-IMID using ML Technique is implemented
on a data set. Data was processed ahead of time to remove data
Output and hidden layer is shown in Eq. (11) in which updating inconsistencies and protect data from mistakes. The MapReduce-
the weight and bias between them based IDS looks for malicious behavior or intrusion in different hid-
den layers (including hidden neurons) and activation functions.
Dþj;l ¼ Dj;l þ kF DKj;l ð11Þ
Furthermore, many neurons are assessed in the network’s hidden
9728
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731
P
Table 3 True Positiv e
Positiv e Predictiv e Value ¼ P ð20Þ
Training of the proposed MR-IMID using ML Technique.
Predicted Condition Positiv e
Total Number of samples Result (output)
P
(151446) True Negativ e
Negativ e Predictiv e Value ¼ P ð21Þ
Expected Output Predicted Predicted Predicted Condition Negativ e
Positive Negative
Input 98,657 (Positive) True Positive False Positive
Tables 3 and 4 show the training and validation results in terms
(TP) (FP) of detection accuracy and miss rate. The ANN algorithm has been
96,950 1707 used for a dataset of 216,352 records. It is divided into 70 percent
52,789 (Negative) False Negative True Negative training (151446 samples) and 30 percent validation of each class
(FN) (TN)
for the objectives of training and validation. Various statistical
1838 50,951
measures are utilized for comparison. Also, performance is calcu-
lated using various metrics named ‘‘detection accuracy, sensitivity,
specificity, miss-rate, fall-out, Likelihood Positive Ratio (LR+), Like-
Table 4 lihood Negative Ratio (LR), Precision and Negative Predictive
Validation of the proposed MR-IMID using ML Technique.
Value (NPV)”. The ‘‘True Positive Rate (TPR)” is expressed as sensi-
Total Number of samples Result (output) tivity. The ‘‘True Negative Rate (TNR)” is defined as specificity.
(64906) ‘‘False Negative Rate (FNR)” is described as miss-rate. ‘‘False-
Expected Output Predicted Predicted Positive Rate (FPR)” is expressed as fallout. ‘‘Positive Predictive
Positive Negative Value (PPV)” is expressed as precision.
Input 42,192 (Positive) True Positive False Positive Table 3 shows the proposed MR-IMID model intrusion detection
(TP) (FP) on the server during the ‘‘Training Phase (TP)”. During TP, a total of
40,836 1356
151,446 samples are utilized. They are split into 98,657 positive
22,714 (Negative) False Negative True Negative
(FN) (TN) samples and 52,789 negative samples. 96,950 ‘‘True Positive” sam-
1433 21,281 ples are accurately predicted, and no intrusion is identified. 1707
records are mistakenly forecasted as negative, implying that intru-
sion is identified. Similarly, 52,789 samples are taken, where neg-
ative means intrusion is detected. 50951 samples are accurately
layers, and various active functions are implemented. The simula- forecasted as negative, indicating intrusion. 1838 samples are
tion results of the proposed MapReduce-based IDS are used to pre- incorrectly predicted as positive, indicating no intrusion is identi-
dict this system’s efficiency accurately. Spyder tool is used for fied, even though intrusion on the server exists.
training and validation process. All experiments were performed Table 4. shows the MR-IMID model intrusion detection on the
on a desktop computer with 64-bit Windows 10 OS with Intel(R) server during the ‘‘Validation Phase (VP)”. During VP, a total of
Xeon(R). The detection speed of the proposed model is measured 64,906 samples are utilized. They are split into 42,192 positive
by using the operating system clock. The detection time was samples and 22,714 negative samples. ‘‘True Positive” are properly
approximately 0.2147 s per record. The data classification con- identified in 40,836 samples, indicating that no intrusion has
sumes only a few milliseconds. The proposed MR-IMID calculated occurred. 1356 records are mistakenly projected as negatives, indi-
the output with the counterpart of MapReduce-based IDS utilizing cating that intrusion has occurred. Similarly, 22,714 samples are
‘‘multiple statistical measures,” as shown in Eq. (13) to (21). taken, where negative implies intrusion is identified. 21281 sam-
P
True Positiv e ples are accurately forecasted as negative, indicating intrusion.
Sensitiv ity ¼ P ð13Þ
Condition Positiv e Finally, 1433 samples are incorrectly predicted as positive, indicat-
ing no intrusion is identified, even though the network exists.
P
True Negativ e Table 5. shows the proposed model performance in terms of
Specificity ¼ P ð14Þ
Condition Negativ e ‘‘detection accuracy, sensitivity, specificity, miss rate, and preci-
sion” during the training and validation phase. The proposed
P P
True Positiv e þ True Negativ e model gives 97.6%, 0.983, 0.965, 2.4%, and 0.981 detection accu-
Accuracy ¼ P ð15Þ racy, sensitivity, specificity, miss rate, and precision during train-
Total Population
ing. And during validation, the proposed model gives 95.7%,
Miss Rate ¼ 1 Accuracy ð16Þ 0.968, 0.937, 4.3%, and 0.966 detection accuracy, sensitivity, speci-
ficity, miss rate, and precision, respectively.
P
False Positiv e In addition, some more statistical measures of the proposed
Fallout ¼ P ð17Þ
Condition Negativ e
model are included to forecast the values during training, such as
FPR, LR+, LR-, and NPV gives the result 0.035, 28.08, 0.025, and
P
True Positiv e Ratio 0.967 during validation 0.063, 15.36, 0.045, and 0.94 respectively.
Likelihood Positiv e Ratio ¼ P ð18Þ Table 6. shows the comparison of the performance of the pro-
False Positiv e Ratio
posed MR-IMID using the ML Technique with previous approaches
P (Haider et al., 2021; Sheikhan et al., 2012; Gao et al., 2019; Ingre
True Positiv e Ratio
Likelihood Negativ e Ratio ¼ P ð19Þ and Yadav, 2015; Khan et al., 2021; Tavallaee et al., 2009;
False Positiv e Ratio
Ibrahim et al., 2013; Ingre and Yadav, 2015; Panda et al., 2010;
Table 5
Performance Evaluation of Proposed MR-IMID Model in Training and Validation Using Different Statistical Measures.
Phases Detection Accuracy (%) Sensitivity TPR Specificity TNR Miss-Rate FNR (%) Fall-out FPR LR+ LR PPV (Precision) NPV
Training 97.6 0.983 0.965 2.4 0.035 28.08 0.025 0.981 0.967
Validation 95.7 0.968 0.937 4.3 0.063 15.36 0.045 0.966 0.940
9729
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731
Table 6
Comparison Results of the Proposed Model with Literature.
Alshinina and Elleithy, 2018). The proposed model performance in Alshinina, R., Elleithy, K., 2018. A highly accurate machine learning approach for
developing wireless sensor network middleware. In: 2018 Wireless
terms of ‘‘detection accuracy and miss rate” during the training and
Telecommunications Sym. Phoenix, AZ, pp. 1–7.
validation phase. During training, the proposed model gives 97.6% Besharati, E., Naderan, M., Namjoo, E., 2019. Logistic regression host-based
and 2.4% detection accuracy and miss rate, respectively. And dur- intrusion detection system for cloud environments. J. Ambient Intell.
ing validation, the proposed model gives 95.7% and 4.3% detection Humaniz. Comput. 5, (4), 3669–3692.
Dainotti, A., Pescapé, A., Ventre, G., 2016. Worm traffic analysis and
accuracy and miss rate, respectively. The summary of results anal- characterization. IEEE Commun. 2 (3), 1435–1442.
ysis, which clearly shows that the detection accuracy is improved Dean, J., Ghemawat, S., 2010. MapReduce: simplified data processing on large
around 3% to 22% as compared to previously proposed machine clusters. In: Proceedings of the OSDI, pp. 137–150.
Gao, X., Shan, C., Hu, C., Niu, Z., Liu, Z., 2019. An adaptive ensemble machine learning
techniques. The proposed model is clearly shown that the pre- model for intrusion detection. IEEE Access 7 (3), 82512–82521.
sented approach gives better results than the previously published Haider, A., Khan, M.A., Rehman, A., Kim, H.S., 2021. A real-time sequential deep
approaches. extreme learning machine cybersecurity intrusion detection system. Comput.,
Mater. Continua 66 (2), 1785–1798.
Ibrahim, L.M., Basheer, D.T., Mahmod, M.S., 2013. A comparison study for intrusion
5. Conclusion database (Kdd99, Nsl-Kdd) based on self-organization map (SOM) artificial
neural network. J. Eng. Sci. Technol. 8 (1), 107–119.
Ingre, B., Yadav, A.B., 2015. Performance analysis of NSL-KDD dataset using ANN. In:
In this proposed research work, an MR-IMID using ML Tech- Int. Conf. on Signal Processing and Communication Engineering Systems,
nique is presented for the sake of intrusion detection. It combines Guntur, India. IEEE, pp. 92–96.
MapReduce and ML techniques, ANN, to manage large amounts of Ingre, B., Yadav, A., 2015. Performance analysis of NSL-KDD dataset using ANN. In:
IEEE: In 2015 International Conference on Signal Processing and
data in large networks to identify intrusions efficiently. MapRe- Communication Engineering Systems, pp. 92–96.
duce approach is very effective in parallel clustering of a large <https://www.kaggle.com>.
dataset. Also, ML is very helpful in extracting similar features in Khan, A.H., Khan, M.A., Abbas, S., Siddiqui, S.Y., Saeed, M.A., et al., 2021. Simulation,
modeling, and optimization of intelligent kidney disease predication
the clustered data and tagging process. Based on this tagged data, empowered with computational intelligence approaches. Comput., Mater.
the proposed model can quickly identify intrusions and stored the Continua 67 (2), 1399–1412.
data in the database for early identification of future attacks. The Khan, M.A., Rehman, A., Khan, K.M., Almotiri, S.H., 2021. Enhance intrusion
detection in computer networks based on deep extreme learning machine.
test findings on a real intrusion dataset reveal that MR-IMID with Comput., Mater. Continua 66 (1), 467–480.
ANN scales effectively as datasets grow. The results also show that Kokila, R., Selvi, S.T., 2014. DDoS detection and analysis in SDN-based environment
utilizing more training data improves detection outcomes while using support vector machine classifier. In: Proceedings of the 2014 Sixth
International Conference on Advanced Computing (ICoAC), Chennai, India, pp.
reducing false alarms to a minimum. The resultant security sys-
205–210.
tem’s detection accuracy is 97.6% in training and 95.7% in valida- Kotpalliwar, M.V., Wajgi, R., 2015. Classification of attacks using support vector
tion which is better than previously published approaches. machine on KDDCUP’99 IDS Database. In: Proceedings of the 2015 Fifth
International Conference on Communication Systems and Network
Technologies, Gwalior, India, pp. 987–990.
Li, Y., Xia, J., Zhang, S., Yan, J., 2012. An efficient intrusion detection system based on
References support vector machines and gradually feature removal method. Expert Syst.
Appl. 2 (39), 424–430.
Mohammadi, S., Mirvaziri, H., Ghazizadeh-Ahsaee, M., Karimipour, H., 2019. Cyber
Ahmad, G., Alanazi, S., Alruwaili, M., Ahmad, F., Khan, M.A., et al., 2021. Intelligent
intrusion detection by combined feature selection algorithm. J. Inf. Secur Appl.
ammunition detection and classification system using convolutional neural
10 (44), 80–88.
network. Comput., Mater. Continua 67 (2), 2585–2600.
Nadeem, L., Azam, M.A., Amin, Y., Ghamdi, M.A., Chai, K.K., et al., 2021. Integration
Alghamdi, M.A., Khan, M.F.N., Khan, A.K., Khan, I., Ahmed, A., et al., 2021. Pv model
of D2D, network slicing, and MEC in 5G cellular networks: Survey and
parameter estimation using modified fpa with dynamic switch probability and
challenges. IEEE Access 9 (7), 37590–37612.
step size function. IEEE Access 9 (4), 42027–42044.
Panda, M., Abraham, A., Patra, M.R., 2010. Discriminative multinomial Naïve Bayes
Aljarah, I., Ludwig, S.A., 2013. MapReduce intrusion detection system based on a
for network intrusion detection. In: Sixth Int. Conf. on Information Assurance
particle swarm optimization clustering algorithm. IEEE Congress Evol. Comput.,
and Security, Atlanta, GA, USA, pp. 5–10.
955–962
9730
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731
Pervez, M.S., Farid, D.M., 2014. Feature selection and intrusion classification in NSL- Tapiador, J.E., Orfila, A., Ribagorda, A., Ramos, B., 2013. Key-recovery attacks on
KDD cup 99 dataset employing SVMs. In: Proceedings of the 8th International KIDS, a keyed anomaly detection system. IEEE Trans. 12 (2), 312–325.
Conference on Software, Knowledge, Information Management and Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A., 2009. A detailed analysis of the
Applications (SKIMA 2014), Dhaka, Bangladesh, pp. 1–6. KDD CUP 99 data set. In: 2009IEEE Sym. on Computational Intelligence for
Sarker, I.H., Salim, F.D., 2018. Mining user behavioral rules from smartphone data Security and Defense Applications, Chicago, Illinois, USA, pp. 1–6.
through association analysis. In: Proceedings of the 22nd Pacific-Asia Tsai, C.F., Hsu, Y.F., Lin, C.Y., Lin, W.Y., 2017. Intrusion detection by machine
Conference on Knowledge Discovery and Data Mining (PAKDD), Melbourne, learning: a review. Expert Syst. Appl. 8 (36), 11994–12000.
Australia, 3-6 June, pp. 450–461. Wagner, C., François, J., Engel, T., 2011. Machine learning approach for ip-flow
Sheikhan, M., Jadidi, Z., Farrokhi, A., 2012. Intrusion detection using reduced-size record anomaly detection. In: Proceedings of the International Conference on
RNN based on feature grouping. Neural Comput. Appl. 21 (6), 1185–1190. Research in Networking, Valencia, Spain, pp. 28–39.
Snir, M., Otto, S., Walker, D., Dongarra, J., 2015 no. 5. In: MPI: The complete White, T., 2010. Hadoop: The definitive guide, original. O’Reilly Media.
reference. MIT Press, Cambridge, pp. 1–16. Wu, W., 2020. Application of MapReduce parallel association mining on IDS in cloud
Sun, N., Zhang, J., Rimba, P., Gao, S., Zhang, L., 2018. Data-driven cybersecurity computing environment. J. Intell. Fuzzy Syst. Preprint 4 (3), 1–9.
incident prediction: a survey. IEEE Commun. 21 (1), 1744–1772. Xin, Y., Kong, L., Liu, Z., Chen, Y., Li, Y., 2018. Machine learning and deep learning
methods for cybersecurity. IEEE Access 6 (4), 35365–35381.
9731