0% found this document useful (0 votes)
28 views

MapReduce Based Intelligent Model For Intrusion Detection Using Machine Learning Techniques

The document discusses a MapReduce-based intelligent model called MR-IMID for intrusion detection using machine learning techniques. MR-IMID can detect intrusions in networks with large data sets reliably using commodity hardware. It processes data from multiple network sources in real-time for intrusion detection. MR-IMID also stores detected data in a database to minimize future inconsistencies. When tested, MR-IMID achieved a 97.7% detection accuracy during training and 95.7% during validation, outperforming previous approaches.

Uploaded by

soutien104
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

MapReduce Based Intelligent Model For Intrusion Detection Using Machine Learning Techniques

The document discusses a MapReduce-based intelligent model called MR-IMID for intrusion detection using machine learning techniques. MR-IMID can detect intrusions in networks with large data sets reliably using commodity hardware. It processes data from multiple network sources in real-time for intrusion detection. MR-IMID also stores detected data in a database to minimize future inconsistencies. When tested, MR-IMID achieved a 97.7% detection accuracy during training and 95.7% during validation, outperforming previous approaches.

Uploaded by

soutien104
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731

Contents lists available at ScienceDirect

Journal of King Saud University –


Computer and Information Sciences
journal homepage: www.sciencedirect.com

MapReduce based intelligent model for intrusion detection using


machine learning technique
Muhammad Asif a, Sagheer Abbas a, M.A. Khan b, Areej Fatima c, Muhammad Adnan Khan d,⇑,
Sang-Woong Lee d
a
School of Computer Science, National College of Business Administration and Economics, Lahore 54000, Pakistan
b
Riphah School of Computing & Innovation, Riphah International University, Lahore Campus, Lahore 54000, Pakistan
c
Department of Computer Science, Lahore Garrison University, Lahore 54000, Pakistan
d
Pattern Recognition and Machine Learning Lab, Department of Software, Gachon University, Seongnam 13557, South Korea

a r t i c l e i n f o a b s t r a c t

Article history: With the emergence of the Internet of Things (IoT), the computer networks’ phenomenal expansion, and
Received 2 September 2021 enormous relevant applications, data is continuously increasing. In this way, cybersecurity has gained
Revised 19 November 2021 significant importance in protecting networks from different cyber-attacks like Intrusions, Denial-of-
Accepted 10 December 2021
Service (DoS), Eavesdropping, Rushing Attack, etc. A traditional Intrusion Detection System (IDS) tangled
Available online 16 December 2021
with the clustering technique plays a vital role in modern security. Still, it has limitations to analyze the
vast volumes of data to identify an anomaly intelligently. Machine learning is a technique that may be
Keywords:
tangled with the MapReduce-Based Intelligent Model for Intrusion Detection (MR-IMID) to automate
Denial-of-Service
Intrusion detection system
intrusion detection intelligently. MR-IMID is proposed to detect intrusions on a network with multiple
Cyber-attacks data classification tasks in this research work. The proposed MR-IMID processes big data sets reliably
Network traffic using commodity hardware. In this proposed research work, multiple network sources are being utilized
Hadoop distributed file system in Real-time for intrusion detection. In this proposed research, the MR-IMID detects intrusions by pre-
dicting unknown test scenarios and stores the data in the database to minimize future inconsistencies.
The detection accuracy of the proposed model during training and validation phases is 97.7% and
95.7%, respectively, which is better than previously published approaches.
Ó 2021 The Authors. Published by Elsevier B.V. on behalf of King Saud University. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction nomic difficulties in large-scale networks. For example, according


to (Tsai et al., 2017), a ransomware outbreak in May 2017 caused
The need for cybersecurity and defense against different kinds $8 billion in losses to various businesses, including banking,
of Cyber-Attacks (CAs) has been steadily rising. The major reasons healthcare, energy, and colleges. According to other estimates, an
are the attractiveness of IoT, the phenomenal expansion of Com- information leak costs an impacted business 3.9 million USD on
puter Networks (CNs), and the wide variety of related apps utilized average and 8.19 million USD in the US (Mohammadi et al.,
by consumers for private or business purposes. Cyber-attacks, e.g., 2019). As a result, corresponding to modern demands in cyber-
DoS attacks (Sun et al., 2018), computer malware, or unauthorized space, the need for cybersecurity and defense against different
access (Dainotti et al., 2016), caused severe disruption and eco- forms of CAs is rising each day.
A Network Security (NS) framework and a Computer Security
(CS) framework are usually included in a cybersecurity system. Dif-
⇑ Corresponding author. ferent technologies, e.g., firewalls and encryption, are meant to
E-mail addresses: [email protected] (M. Asif), dr.sagheer@ncbae. manage CAs. An IDS is better able to prevent external attacks on
edu.pk (S. Abbas), [email protected] (M.A. Khan), [email protected] CN (Tapiador et al., 2013). As a result, the major goal of the IDS
(A. Fatima), [email protected] (M.A. Khan), [email protected] (S.-W. Lee).
is to identify and prevent different types of harmful network inter-
Peer review under responsibility of King Saud University.
actions. Traditional solutions, such as firewalls, are incapable of
fulfilling the job effectively (Xin et al., 2018). An IDS monitors
and evaluates everyday activity in a computer network. It identi-
fies security concerns or attacks, e.g., DoS attacks, while recogniz-
Production and hosting by Elsevier
ing harmful cyber actions. Inappropriate system activity, such as

https://doi.org/10.1016/j.jksuci.2021.12.008
1319-1578/Ó 2021 The Authors. Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731

illegal accessibility, alteration, or damage, may also be discovered, concerns or threat activity on a network (Wu, 2020). There have
determined, and identified using an IDS (Wu, 2020). To assist a sys- been many cybersecurity studies with the capacity to identify
tem’s security, it is necessary to identify different CAs or inconsis- and prevent cyber assaults or intrusions. One of the most well-
tencies in a network and construct an efficient IDS that plays an known technologies in the cyber sector is signature-based network
essential part in modern NS. Usually, intrusion detection involves intrusion detection (Sarker and Salim, 2018). This method uses a
extensive data set exploration. recognized signature and has recently achieved significant accep-
Image analysis, pattern identification, social networks, massive tance, as well as economic success. The ‘‘anomaly-based method”,
Network Traffic (NT) analysis, etc., require the exploration of enor- on the other hand, offers a benefit over the ‘‘signature-based
mous datasets. Sequential techniques can’t process such applica- approach” for recognizing hidden or ‘‘zero-day attacks” (Wu,
tions since they’re too big. Conventional clustering-based 2020). This method analyses important security data to watch NT
intrusion detection approaches based have no efficient scalability and identify behavioral attack patterns. Several data mining and
with increasing NT volumes. Additionally, massive NT analysis pre- ML approaches evaluate such security event patterns and make
sents a performance issue when detecting anomalous links, neces- meaningful choices (Tapiador et al., 2013). The primary disadvan-
sitating a parallel method for intrusion detection. Generally, classic tage of the anomaly-based method is that it might result in many
parallel algorithms developed using the ‘‘Message Passing Interface false alarms since it can classify formerly unknown system actions
(MPI)” approach (Snir et al., 2015) encounter a variety of chal- as anomalies (Wu, 2020). As a result, minimizing an IDS’s false-
lenges, e.g., efficiently managing network connection and balanc- positive rates must be a key goal. Therefore, to reduce these con-
ing the division of processing burden across various processors. cerns, a MapReduce-based effective identification method is
Furthermore, parallel algorithms can be affected by node failure. required.
Therefore, it reduces the scalability of the algorithm. Consequently, ML is a field of AI connected to ‘‘computational statistics, data
building a scalable ‘‘parallel intrusion detection algorithm” that mining, and data science”. It is primarily concerned with teaching
achieves high intrusion identification rates is required. machines to learn from data (Li et al., 2012). It is closely linked to
For MPI (Snir et al., 2015), the ‘‘MapReduce programming mod- ‘‘mathematical techniques, statistical analysis, optimization, ‘‘ and
el” (Dean and Ghemawat, 2010) has developed as a parallel pro- other fields. Thus, ML is a data-driven technique in the cybersecu-
cessing approach, particularly for data-intensive tasks. The rity domain with the initial step to comprehend raw security data
MapReduce technique has many features that make it a viable to construct an ‘‘intelligent security model” for generating fore-
option for parallelizing data mining jobs, including ease of deploy- casts. ML approaches commonly utilized association analysis to
ment and the elimination of the need to learn many parallel pro- create ‘‘rule-based intelligent systems (Wagner et al., 2011)”. Many
gramming specifics. MapReduce also has many options for ‘‘node famous approaches have been used to construct a data-driven pre-
failure” and ‘‘load balancing”. The dataset size and computer nodes dictive model (Li et al., 2012). These approaches include the
determine how MapReduce separates the input dataset into dis- ‘‘probability-based Naive Bayes (NB) classifier, hyperplane-based
tinct splits. There are two primary functions in MapReduce: ‘‘Map Support Vector Machine (SVM), instance-learning-based K-
Function (MF)” and ‘‘Reduce Function (RF)”. The MF generates inter- Nearest Neighbor (KNN), the sigmoid function-based Linear
mediate results as ‘‘(key, values list) data pairs” by processing the Regression (LR) technique, and rule-based classification, such as
input data records as ‘‘(key, value) data pairs”. Then the RF combi- Decision Trees (DT)” (Alghamdi et al., 2021; Nadeem et al., 2021;
nes and aggregates the intermediate ‘‘(values list)” of the MF with Khan et al., 2021; Ahmad et al., 2021).
the same intermediate key. ‘‘High Availability Distributed Object- Many researchers have employed the ML classification
Oriented Platform (Hadoop)” (White, 2010) is an ‘‘Apache- approaches listed above in the cybersecurity area, notably identify-
developed” open-source platform that employs the MapReduce ing intrusions or CAs. For example, Li et al. (Kotpalliwar and Wajgi,
approach. It was designed to handle data-intensive tasks. One of 2015) demonstrated how to use the hyperplane-based SVM classi-
Hadoop’s features is its own distributed file system. It is known fier utilizing a Radial Basis Function (RBF) kernel for identifying
as ‘‘Hadoop Distributed File System (HDFS)”. It is utilized to handle preset attack types using the famous Knowledge Discovery in
and process massive datasets. Additionally, Hadoop’s MapReduce is Databases (KDD’99) cup dataset. These types may include ‘‘DoS,
intended to interact with HDFS effectively by bringing the comput- Probe or Scan, User to Root (U2R), Remote to the user (R2L), and
ing process to the data rather than the other way around, allowing normal traffic”. To develop a faster system, the authors did the
Hadoop to attain high data localization. model training with big datasets, utilizing a ‘‘least-squared SVM
Another issue with intrusion detection is the scarcity of trained classifier”, to develop a faster system. In Pervez and Farid (2014),
professionals who can monitor and respond to intrusions by ana- the authors categorized the anomalies using SVM classifier varia-
lyzing the large data in clusters as the output of MapReduce. In tion. SVM classifier is being utilized to identify anomalies and var-
the cybersecurity field, ML approaches have been successfully ious kinds of CAs. These attacks included Network Basic Input/
applied to create efficient strategies. ML has a lot of potential for Output System (NetBIOS) scans, Denial of service (DoS) attacks,
identifying different forms of cyber-attacks, intrusion detection, Post Office Protocol (POP) Spams, and Secure Shell (SSH) scans. A
malware classification, detection, privacy protection, advanced ‘‘one-class SVM classifier” is employed in Kokila and Selvi (2014)
threat detection, so it’s becoming a useful tool for defenders. Pre- to identify unseen computer worms’ behavior.
sent threats are becoming more complex and sophisticated as Aljarah and Ludwig (2013) proposed an IDS called IDS-MRCPSO,
adversarial techniques evolve rapidly. Most current security tech- which is based on a parallel Particle Swarm Optimization (PSO)
nologies, for example, may easily pass threat variations. As a result, clustering algorithm and the MapReduce technique, to tackle the
self-learning methods should be able to deal with such issues. ML processing of massive NT. PSO is a particularly effective approach
techniques have emerged as an essential tool for the entire security for clustering. It eliminates the sensitivity issue of initial cluster
industry in this regard. centroids. Using ‘‘commodity hardware”, this IDS handles big data
sets. To assess the system speed, tests were conducted on an actual
intrusion data set. The test findings show that ‘‘IDS-MRCPSO”
2. Literature review scales extremely near the optimum speed by enhancing detection
accuracy. Also, it is effective for growing Training Data Set (TDS)
An IDS is generally employed in the observation and evaluation volumes. Additionally, to prevent ‘‘random sampling” impacts,
of day-to-day activity in a computer network. It identifies security they built the detection model using the entire TDS. As a result,
9724
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731

their method covers more relevant parts of the TDS and creates a This research elaborates an IDS using the ML technique. The
better detection model. The findings show that utilizing more proposed MapReduce-based IDS employs a novel approach to
TDS improves detection rates while reducing false alarms to a min- reduce the number of false alarms over time. It is accomplished
imum. Their model has a limitation, and it does not differentiate by obtaining human professional feedback and changing the learn-
among various kinds of intrusions; instead, it merely determines ing model accordingly. The proposed method effectively reduces
whether an incursion has occurred. the probability of many false alarms with similar data. On the other
IDS makes it possible to detect harmful activity and intrusions hand, the presented MapReduce-based IDS can consider supervised
early on. Most IDS methods, however, are overwhelmed by the pre- learning approaches. As a result, there is label information in the
sent large amounts of NT. It needs innovative techniques capable of training process. It also recommends changes in cases when tradi-
handling large volumes of traffic while retaining high efficiency tional methods categorize training samples using human
during analysis. Wu (2020) presented a ‘‘distributed Network IDS” professionals.
framework in which data analysis is carried out in a Cloud Comput- The proposed IDS provides a method for streamlining the eval-
ing (CC) setting. Sensors in numerous locations throughout the uation of human specialists’ decisions. As a result, depending on
network, including network devices, servers, individual worksta- predefined observations, the system may detect anomalies. There-
tions, capture NT, operating system logs, and primary application fore, using ‘‘supervised feature learning,” the system can detect
data. The MapReduce approach is used to gather, analyze, and human errors while labeling data and offering corrections. Further-
compare data obtained from many sources. It looks for event cor- more, the system may use a scoring method to detect new traffic
relations that might identify intrusion activities or malicious segments. The proposed IDS provides a dynamically updated
behavior. This framework can easily handle huge amounts of gath- framework that addresses the previous techniques’ adaptation
ered data and heavy processing loads, efficiently scaling to busi- problem. With little computing cost, the proposed MapReduce-
ness network settings. In addition, unlike prior IDS models, it can based IDS improves the learning framework based on existing data
detect sophisticated attacks by correlating data from many sources and novel types of CAs.
and discovering patterns that may not be seen in centralized traffic A visual representation of the main roles of each strategy in a
collections or ‘‘single host log” analysis. The model’s viability is cyber-event detection system. The legend in Fig. 1. shows the lay-
demonstrated by the experiments conducted on an actual cluster out for different elements (e.g., ‘‘green rectangle for data collecting
and cloud infrastructure. module”). The same format is used to convey each strategy so that
Mining useful data from enormous data sets stored on the cloud the reader may connect each strategy to the framework, detailed
has been a growing business trend. Yet, current IDS systems have further below.
proven incapable of adapting to ‘‘large-scale log data mining”. As ‘‘Data sources” are those that create large amounts of data that
a result, Besharati et al. (2019) proposed an ‘‘association rule min- may be gathered and processed to identify and prevent CAs. There
ing” technique based on the ‘‘MapReduce parallel computing” are numerous ways for generating relevant data for security ana-
architecture. To begin, the ‘‘Apriori Frequent Itemset Mining lytics (e.g., ‘‘windows logs, NetFlow data, and email logs”). The
(AFIM)” method is examined, and the ‘‘MapReduce approach” is source(s) used vary per company, and data is collected from
utilized to parallelize and enhance it so that AFIM may be com- numerous sources and kept in a ‘‘Data Storage (DS)”.
pleted more quickly. Second, Apriori is intended to operate in par- ‘‘Data collection” is the process of connecting a system to exter-
allel on IDS. Lastly, the test was performed by constructing a nal data sources. This module collects data from the given data
‘‘Hadoop cluster” using an open-source CC architecture. The find- sources using various methods, saves it in a database, and then for-
ings demonstrate that the mentioned technique has a better detec- wards it for preprocessing. Preparing data for future modules, par-
tion accuracy and takes less processing time on large amounts of ticularly data processing, is known as ‘‘Data Pre-processing.”
data. Feature selection and extraction, elimination of duplicates and
In contrast to the previous studies we present in this investiga- faulty records, data validation, and standardization are examples
tion, an MR-IMID using ML Technique is presented for the sake of of preprocessing activities. The data is acquired from the DS, and
intrusion detection. The proposed model captures the classification prepared data is stored back in the DS, and fed for data processing.
of security issues based on their significance. It combines MapRe- The ‘‘Data Processing module” makes use of big data technology to
duce and ML techniques, Artificial Neural Network (ANN), to man- extract useful information about CA. Results are generated in this
age large amounts of data in large networks to identify intrusions module by using the ML technique. The ‘‘Data Post-processing
efficiently. MapReduce approach is very effective in parallel clus- module” uses ML to improve the ‘‘Data Processing module‘‘ find-
tering of a large dataset. It then constructs a generic architecture ings”. For the completion of the related activities, this module, like
for detecting intrusions centered on the identified significant char- ‘‘Data Processing”, connects with DS and is assisted by big data
acteristics to address the known problems. techniques.
Further, it is checked that whether the cyber event is found or
not. In case of no a simple message will be visualized but in case
of yes trigger will be on and triggered results also will be stored
3. Proposed Methodology: on a database in the visualization module. The ‘‘Visualization mod-
ule” makes use of a variety of methods (such as a ‘‘dashboard, a
Developing a data-driven intelligent IDS might make computa- text or graphic report, and an email notifier”) to convey finished
tional security methods examine distinct cyber event patterns and results (such as security alerts) to security professionals.
ultimately anticipate attacks using cybersecurity data. However, After processing security data in each module, ‘‘Cloud Big Data
modeling CAs is difficult, whereas modern security datasets may Storage Support” controls its distribution and back-and-forth stor-
include several aspects of security features that may be less signif- age. Not only does the module handle DS, but it also keeps track of
icant or not required. This paper presents an intrusion detection the numerous methods that other modules use to access and
Model MR-IMID based on MapReduce, proven to be an effective change data. To provide the needed DS, the module leverages sev-
parallelization approach for many tasks. Furthermore, the pro- eral DS technologies (e.g., ‘‘Highly parallel Integrated Virtual Envi-
posed model includes ML for clustering analysis to recognize the ronment (HIVE), HDFS, MongoDB, Cassandra”). The distribution of
important security features for building an intelligent detection data processing among computer nodes is managed by ‘‘Big Data
model. Processing Support”. The module uses a big data framework (e.g.,
9725
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731

Fig. 1. Proposed ‘‘MR-IMID” using ML Technique.

‘‘Hadoop, Spark, or Storm”). To distribute processing across the step is further divided into different stages. In the first phase, data
computer nodes, the module supports ‘‘data preprocessing, data is collected from various cybersecurity sensors installed in multi-
processing, and data post-processing modules”, as illustrated in ple locations. In this proposed research, a pre-labeled cybersecurity
Fig. 1. For example, a MapReduce program comprises an MF that dataset is selected for the operation of the presented model. There
filters and sorts data and an RF that conducts a summary process. are 216,352 cases in this dataset, with fifty-four features. Fifty-
The publicly available intrusion data set from the ‘‘Kaggle ML three features are independent. The remaining one, the output
Repository” is being utilized in this proposed research class, is dependent. The next and important layer is the preprocess-
(kaggle.com). It is divided into two types: ‘‘regular and attack”. ing used to mitigate the noisy data using moving average, normal-
The proposed intrusion detection dataset includes 41 features of ization, and data cleaning, which use the mean imputation
the input and one as an output. Each sample has a label indicating method. Then the preprocessed data is split into 70% training
its traffic type, either regular or malicious traffic. The open attack and 30% testing data set of each class. After this process, the train-
groups for the NSL-KDD data belong to four major classes: DoS, ing data is sent to the training layer, whereas the testing dataset is
Remote to Local (R2L), User-to-Root (U2R), and probe. All types stored on cloud storage.
of attacks include multiple sample attacks in the information col- After preprocessing, the processed data is passed through the
lection. Table 1 presents a full overview of the features. MapReduce procedure. The data analysis processing model’s foun-
The complete list of threats is presented in the following. dation is MapReduce. Its framework is basic and easy to grasp. The
‘‘Map and Reduce” functions are the two most important activities
3.1. NSL-KDD dataset attacks description in MapReduce. No matter how complicated the MapReduce opera-
tion is, it must go through these two steps shown in Fig. 3.
DoS: back, land, neptune, pod, smurf, teardrop, processtable, DAG ¼ ðW; E; DAGinfoÞ ð1Þ
udpstorm, mailbomb, apache2
Probe: psweep, nmap, saint, mscan, portsweep, satan
W ¼ fW name ; fMapg; fReduceg; Param; Input; Outputg ð2Þ
R2L: spy, warezclient, guesspassword, ftp_write, imap, multi-
hop, named, phf, snmpgetattack,waezmaster, xlock, xsnoop, http- Here, W is the Number of job streams acquired after processing.
tunnel, sendmail ‘‘Wname” is the task’s name. ‘‘Map and Reduce” are map and
U2R: bufferoverflow, loadmodule, perl, snmpguess, sqlattack, reduce processing procedures. ‘‘Param” is the task’s configuration
xterm, rootkit, ps, worm parameter. The input and output tasks’ data source kind is repre-
The pseudocode of the process is shown in Table 2. In the data sented by ‘‘Input and Output”. E indicates the connection between
collecting process, all sorts of attacks comprise several sample two tasks in the ‘‘Directed Acyclic Graph (DAG)” diagram.
attacks. As shown in Fig. 2. The framework flow of the proposed ‘‘DAGinfo” is the DAG’s unique identifying information. Eq. (2)’s
model is divided into two phases: training and validation. Each ‘‘Map processing (MP)” may be described as:
9726
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731

Table 1
Dataset structure (kaggle.com).

Sr. No. Features Datatype Sr. No. Features Datatype


1 duration Integer 22 is_guest_login Integer
2 protocol_type Nominal 23 count Integer
3 service Nominal 24 srv_count Integer
4 flag Nominal 25 serror_rate Float
5 src_bytes Integer 26 srv_serror_rate Float
6 dst_bytes Integer 27 rerror _rate Float
7 land Integer 28 srv_rerror_rate Float
8 wrong_fragment Integer 29 same_srv_rate Float
9 urgent Integer 30 diff _srv_rate Float
10 hot Integer 31 srv_diff_host_rate Float
11 num_failed_logins Integer 32 dst_host_count Float
12 root_shell Integer 33 dst_host_srv_count Float
13 num_compromise Integer 34 dst_host_same_srv_rate Float
14 roots_hell Integer 35 dst_host_diff_srv_rate Float
15 su_attempted Integer 36 dst_host_same_src_port_rate Float
16 num_root Integer 37 dst_host_srv_diff_port_rate Float
17 num_file_creation Integer 38 ddst_host _serror_rate Float
18 num_shells Integer 39 dst-_host_srv_serror_rate Float
19 num_access_files Integer 40 dst_host_rerror_rate Float
20 num_outbound_cmds Integer 41 dst_host_srv_rerror_rat Float
21 Is_host_login Integer

Table 2 After the MapReduce procedure, a classification process is per-


Pseudo code for the proposed Model. formed in the training layer to predict cyber events further using
Sr Steps the ANN technique. The proposed ANN technique applied a mini-
No. mum of three levels in the training layer, including an input, a hid-
1 Start den, and an output layer. Backpropagation is assembled in several
2 Initialization steps, including ‘‘weight initialization, feedforward, feedback error
3 Feature extraction propagation, and weight and bias updates”. The presented model’s
4 Training Phase
4.1 Initialization xij , nf; M, E, q, Kj;l ; Error (G) = 0 and the number of
‘‘sigmoid function” for input and hidden layer and output may be
epochs £ = 0 expressed as
4.2 For each training patternr
1
a) do the feedforward phase to ql ¼ !! where j
i) Calculate ql for each hidden layer, using Eq. (6) Pn
b) Compute output error signals and hidden layer error signals  c2 þ mjl  P
1 wl
ðxij ri Þ
j¼1 m
c1 þ
using Eq. (7) 1þe 1þe i¼1

c) do the backpropagation phase to


i) Update the Hidden layer Weights mþ
¼ 1; 2; 3 . . . n & l ¼ 1; 2; 3 . . . r ð6Þ
j;l using Eq. (11)
ii) update the weights xij of all using Eq. (12) In above eq r i , xij , c1 , Kjl , c2 represents the input features,
4.3 £=£+1
4.4 Test detection criteria: if no detection criteria are satisfied, go to step weights between ith input and jth hidden layer neurons, bais of
5. Else go to step 4 hidden layers, weights between the jth hidden layer and lth out-
4.5 Store in cloud put layer neurons, and bias of output layer, respectively.
5 Validation Phase The minimum mean square error can be calculated as given
5.1 Import Trained data from Cloud
below:
5.2 Import input

1 X 2
6.3 validate detection criteria: if no detection criteria are satisfied, access
grant else data store in intrusion data base G¼ sl  ql ð7Þ
7 Stop 2 l

where sl represents the desired output and out l as a calculated


Map ¼ ðMname ; InQ ; InVal; OutQ; OutVal; PropertiesÞ ð3Þ output.
Both layers weight in change can be calculated as given below:
Here, ‘‘Mname” denotes the MP name. ‘‘InQ and InVal” denote
the types of the ‘‘Key-Value Pair (KVP)” entered during the Map @G
DW / 
procedure. ‘‘OutQ and OutVal” denote the output KVP types. Prop- @W
erties represent the property factors necessary for MP. In Eq. (2), Z
the ‘‘Reduce processing (RP)” may be represented as follows @G
Dj;l ¼  ð8Þ
@ Dj;l
Reduce ¼ ðRname ; InQ; InVal; OutQ; OutVal; PropertiesÞ ð4Þ Eq. (8) can be written as
Here, ‘‘Rname” is the Reduce process name. Other parameters @G @ql @wl
are represented in the same way as the Map. DK j;l ¼    ð9Þ
@ql @wl @ Kj;l
‘‘E” is represented as described in Eq. (5).
After simplification above Equation can be written as
E ¼ ðPath; StartT; EndTÞ ð5Þ Z "X    
#
   
Dxi;j ¼ sl  ql  ql 1  ql  Kj;l  qj 1  qj  r i
‘‘Path” denotes the data stream’s transmission path. ‘‘StartTk” l
indicates the current task. ‘‘EndTk” denotes the next task.
9727
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731

Fig. 2. Framework for Data Flow of Proposed ‘‘MR-IMID” using ML Technique.

Updating weight and bias between the input layer and the hid-
den layer is shown in Eq. (12)

xþi;j ¼ xi;j þ kF Kxi;j ð12Þ

kF is the learning rate of Proposed Model ‘‘MR-IMID” for intru-


sion detection. Convergence of Proposed ‘‘MR-IMID” for intrusion
detection depends upon the careful selection of kF . After the train-
ing layer, the output of the training layer is sent to the performance
layer, which will predict the cyber event based on detection accu-
racy and miss rate, whether the learning criteria are met. If the
answer is ’NO,’ the training layer will be updated, and so on, but
if the answer is ’YES,’ the output will be saved on cloud data
storage.
In the Validation phase, the test data and the learned patterns
stored on to the cloud are imported from the cloud database and
referred to the ANN-based model to predict whether the intrusion
is found. If the answer is ’No,’ the process will be aborted, and if the
Fig. 3. MapReduce architecture. answer is ’Yes,’ the message will state that an intrusion is found.

Z "X #
   
Dxi;j ¼ nl mj;l  qj 1  qj  r i 4. Simulation results:
l
This data set is the most widely used standard test set for net-
Z
work IDSs (kaggle.com). The information in this data set is split
Dxi;j ¼ nj r i ð10Þ
into two parts: the training dataset and the validation dataset.
The training data has a unique identification, but the test data is
where
unidentified. The test data also includes certain attack types that
" # were not present in the training data. Thus, it makes the system’s
X     identification more accurate and trustworthy.
nj ¼ nl Kj;l  qj 1  qj
l In this proposed MR-IMID using ML Technique is implemented
on a data set. Data was processed ahead of time to remove data
Output and hidden layer is shown in Eq. (11) in which updating inconsistencies and protect data from mistakes. The MapReduce-
the weight and bias between them based IDS looks for malicious behavior or intrusion in different hid-
den layers (including hidden neurons) and activation functions.
Dþj;l ¼ Dj;l þ kF DKj;l ð11Þ
Furthermore, many neurons are assessed in the network’s hidden
9728
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731

P
Table 3 True Positiv e
Positiv e Predictiv e Value ¼ P ð20Þ
Training of the proposed MR-IMID using ML Technique.
Predicted Condition Positiv e
Total Number of samples Result (output)
P
(151446) True Negativ e
Negativ e Predictiv e Value ¼ P ð21Þ
Expected Output Predicted Predicted Predicted Condition Negativ e
Positive Negative
Input 98,657 (Positive) True Positive False Positive
Tables 3 and 4 show the training and validation results in terms
(TP) (FP) of detection accuracy and miss rate. The ANN algorithm has been
96,950 1707 used for a dataset of 216,352 records. It is divided into 70 percent
52,789 (Negative) False Negative True Negative training (151446 samples) and 30 percent validation of each class
(FN) (TN)
for the objectives of training and validation. Various statistical
1838 50,951
measures are utilized for comparison. Also, performance is calcu-
lated using various metrics named ‘‘detection accuracy, sensitivity,
specificity, miss-rate, fall-out, Likelihood Positive Ratio (LR+), Like-
Table 4 lihood Negative Ratio (LR), Precision and Negative Predictive
Validation of the proposed MR-IMID using ML Technique.
Value (NPV)”. The ‘‘True Positive Rate (TPR)” is expressed as sensi-
Total Number of samples Result (output) tivity. The ‘‘True Negative Rate (TNR)” is defined as specificity.
(64906) ‘‘False Negative Rate (FNR)” is described as miss-rate. ‘‘False-
Expected Output Predicted Predicted Positive Rate (FPR)” is expressed as fallout. ‘‘Positive Predictive
Positive Negative Value (PPV)” is expressed as precision.
Input 42,192 (Positive) True Positive False Positive Table 3 shows the proposed MR-IMID model intrusion detection
(TP) (FP) on the server during the ‘‘Training Phase (TP)”. During TP, a total of
40,836 1356
151,446 samples are utilized. They are split into 98,657 positive
22,714 (Negative) False Negative True Negative
(FN) (TN) samples and 52,789 negative samples. 96,950 ‘‘True Positive” sam-
1433 21,281 ples are accurately predicted, and no intrusion is identified. 1707
records are mistakenly forecasted as negative, implying that intru-
sion is identified. Similarly, 52,789 samples are taken, where neg-
ative means intrusion is detected. 50951 samples are accurately
layers, and various active functions are implemented. The simula- forecasted as negative, indicating intrusion. 1838 samples are
tion results of the proposed MapReduce-based IDS are used to pre- incorrectly predicted as positive, indicating no intrusion is identi-
dict this system’s efficiency accurately. Spyder tool is used for fied, even though intrusion on the server exists.
training and validation process. All experiments were performed Table 4. shows the MR-IMID model intrusion detection on the
on a desktop computer with 64-bit Windows 10 OS with Intel(R) server during the ‘‘Validation Phase (VP)”. During VP, a total of
Xeon(R). The detection speed of the proposed model is measured 64,906 samples are utilized. They are split into 42,192 positive
by using the operating system clock. The detection time was samples and 22,714 negative samples. ‘‘True Positive” are properly
approximately 0.2147 s per record. The data classification con- identified in 40,836 samples, indicating that no intrusion has
sumes only a few milliseconds. The proposed MR-IMID calculated occurred. 1356 records are mistakenly projected as negatives, indi-
the output with the counterpart of MapReduce-based IDS utilizing cating that intrusion has occurred. Similarly, 22,714 samples are
‘‘multiple statistical measures,” as shown in Eq. (13) to (21). taken, where negative implies intrusion is identified. 21281 sam-
P
True Positiv e ples are accurately forecasted as negative, indicating intrusion.
Sensitiv ity ¼ P ð13Þ
Condition Positiv e Finally, 1433 samples are incorrectly predicted as positive, indicat-
ing no intrusion is identified, even though the network exists.
P
True Negativ e Table 5. shows the proposed model performance in terms of
Specificity ¼ P ð14Þ
Condition Negativ e ‘‘detection accuracy, sensitivity, specificity, miss rate, and preci-
sion” during the training and validation phase. The proposed
P P
True Positiv e þ True Negativ e model gives 97.6%, 0.983, 0.965, 2.4%, and 0.981 detection accu-
Accuracy ¼ P ð15Þ racy, sensitivity, specificity, miss rate, and precision during train-
Total Population
ing. And during validation, the proposed model gives 95.7%,
Miss  Rate ¼ 1  Accuracy ð16Þ 0.968, 0.937, 4.3%, and 0.966 detection accuracy, sensitivity, speci-
ficity, miss rate, and precision, respectively.
P
False Positiv e In addition, some more statistical measures of the proposed
Fallout ¼ P ð17Þ
Condition Negativ e
model are included to forecast the values during training, such as
FPR, LR+, LR-, and NPV gives the result 0.035, 28.08, 0.025, and
P
True Positiv e Ratio 0.967 during validation 0.063, 15.36, 0.045, and 0.94 respectively.
Likelihood Positiv e Ratio ¼ P ð18Þ Table 6. shows the comparison of the performance of the pro-
False Positiv e Ratio
posed MR-IMID using the ML Technique with previous approaches
P (Haider et al., 2021; Sheikhan et al., 2012; Gao et al., 2019; Ingre
True Positiv e Ratio
Likelihood Negativ e Ratio ¼ P ð19Þ and Yadav, 2015; Khan et al., 2021; Tavallaee et al., 2009;
False Positiv e Ratio
Ibrahim et al., 2013; Ingre and Yadav, 2015; Panda et al., 2010;

Table 5
Performance Evaluation of Proposed MR-IMID Model in Training and Validation Using Different Statistical Measures.

Phases Detection Accuracy (%) Sensitivity TPR Specificity TNR Miss-Rate FNR (%) Fall-out FPR LR+ LR PPV (Precision) NPV
Training 97.6 0.983 0.965 2.4 0.035 28.08 0.025 0.981 0.967
Validation 95.7 0.968 0.937 4.3 0.063 15.36 0.045 0.966 0.940

9729
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731

Table 6
Comparison Results of the Proposed Model with Literature.

Preprocessing Map- Machine Learning Detection Detection


Layer Reduce Technique Accuracy (%) Miss-Rate (%)
Haider et al. (2021) No No RTS-DELM Training (70%) 96.2 3.8
Validation (30%) 92.7 7.3
Sheikhan et al. (2012) No No RNN Training (60 %) 94.1 5.9
Validation (40%) 93.8 6.2
Gao et al. (2019) No No Adaptive voting Training (85%) 85.2 14.8
algorithm Validation (15%) 84.5 15.5
Ingre and Yadav (2015) No No ANN Training (85%) 81.2 18.8
Validation (15%) 79.9 20.1
Khan et al. (2021) Yes No DELM Training (70%) 92.1 7.9
Validation (30%) 91.3 8.7
Tavallaee et al. (2009) No No SVM Training (85%) 93.5 6.5
Validation (15%) 92.1 7.9
Ibrahim et al. (2013) No No SOMNN Training (65%) 75.4 24.6
Validation (35%) 73.1 26.9
Ingre and Yadav (2015) No No ANN Training (45%) 81.2 18.8
Validation (55%) 79.9 20.1
Panda et al. (2010) No No Discriminative Training (93%) 81.5 18.5
Multinomial Naïve Validation (7%) 80.6 19.4
Bayes + RP
Alshinina and Elleithy (2018) No No GANs Training (85%) 86.5 13.5
Validation (15%) 81.1 18.9
Proposed MR-IMID Yes Yes ANN Training (70%) 97.6 2.4
Validation (30%) 95.7 4.3

Alshinina and Elleithy, 2018). The proposed model performance in Alshinina, R., Elleithy, K., 2018. A highly accurate machine learning approach for
developing wireless sensor network middleware. In: 2018 Wireless
terms of ‘‘detection accuracy and miss rate” during the training and
Telecommunications Sym. Phoenix, AZ, pp. 1–7.
validation phase. During training, the proposed model gives 97.6% Besharati, E., Naderan, M., Namjoo, E., 2019. Logistic regression host-based
and 2.4% detection accuracy and miss rate, respectively. And dur- intrusion detection system for cloud environments. J. Ambient Intell.
ing validation, the proposed model gives 95.7% and 4.3% detection Humaniz. Comput. 5, (4), 3669–3692.
Dainotti, A., Pescapé, A., Ventre, G., 2016. Worm traffic analysis and
accuracy and miss rate, respectively. The summary of results anal- characterization. IEEE Commun. 2 (3), 1435–1442.
ysis, which clearly shows that the detection accuracy is improved Dean, J., Ghemawat, S., 2010. MapReduce: simplified data processing on large
around 3% to 22% as compared to previously proposed machine clusters. In: Proceedings of the OSDI, pp. 137–150.
Gao, X., Shan, C., Hu, C., Niu, Z., Liu, Z., 2019. An adaptive ensemble machine learning
techniques. The proposed model is clearly shown that the pre- model for intrusion detection. IEEE Access 7 (3), 82512–82521.
sented approach gives better results than the previously published Haider, A., Khan, M.A., Rehman, A., Kim, H.S., 2021. A real-time sequential deep
approaches. extreme learning machine cybersecurity intrusion detection system. Comput.,
Mater. Continua 66 (2), 1785–1798.
Ibrahim, L.M., Basheer, D.T., Mahmod, M.S., 2013. A comparison study for intrusion
5. Conclusion database (Kdd99, Nsl-Kdd) based on self-organization map (SOM) artificial
neural network. J. Eng. Sci. Technol. 8 (1), 107–119.
Ingre, B., Yadav, A.B., 2015. Performance analysis of NSL-KDD dataset using ANN. In:
In this proposed research work, an MR-IMID using ML Tech- Int. Conf. on Signal Processing and Communication Engineering Systems,
nique is presented for the sake of intrusion detection. It combines Guntur, India. IEEE, pp. 92–96.
MapReduce and ML techniques, ANN, to manage large amounts of Ingre, B., Yadav, A., 2015. Performance analysis of NSL-KDD dataset using ANN. In:
IEEE: In 2015 International Conference on Signal Processing and
data in large networks to identify intrusions efficiently. MapRe- Communication Engineering Systems, pp. 92–96.
duce approach is very effective in parallel clustering of a large <https://www.kaggle.com>.
dataset. Also, ML is very helpful in extracting similar features in Khan, A.H., Khan, M.A., Abbas, S., Siddiqui, S.Y., Saeed, M.A., et al., 2021. Simulation,
modeling, and optimization of intelligent kidney disease predication
the clustered data and tagging process. Based on this tagged data, empowered with computational intelligence approaches. Comput., Mater.
the proposed model can quickly identify intrusions and stored the Continua 67 (2), 1399–1412.
data in the database for early identification of future attacks. The Khan, M.A., Rehman, A., Khan, K.M., Almotiri, S.H., 2021. Enhance intrusion
detection in computer networks based on deep extreme learning machine.
test findings on a real intrusion dataset reveal that MR-IMID with Comput., Mater. Continua 66 (1), 467–480.
ANN scales effectively as datasets grow. The results also show that Kokila, R., Selvi, S.T., 2014. DDoS detection and analysis in SDN-based environment
utilizing more training data improves detection outcomes while using support vector machine classifier. In: Proceedings of the 2014 Sixth
International Conference on Advanced Computing (ICoAC), Chennai, India, pp.
reducing false alarms to a minimum. The resultant security sys-
205–210.
tem’s detection accuracy is 97.6% in training and 95.7% in valida- Kotpalliwar, M.V., Wajgi, R., 2015. Classification of attacks using support vector
tion which is better than previously published approaches. machine on KDDCUP’99 IDS Database. In: Proceedings of the 2015 Fifth
International Conference on Communication Systems and Network
Technologies, Gwalior, India, pp. 987–990.
Li, Y., Xia, J., Zhang, S., Yan, J., 2012. An efficient intrusion detection system based on
References support vector machines and gradually feature removal method. Expert Syst.
Appl. 2 (39), 424–430.
Mohammadi, S., Mirvaziri, H., Ghazizadeh-Ahsaee, M., Karimipour, H., 2019. Cyber
Ahmad, G., Alanazi, S., Alruwaili, M., Ahmad, F., Khan, M.A., et al., 2021. Intelligent
intrusion detection by combined feature selection algorithm. J. Inf. Secur Appl.
ammunition detection and classification system using convolutional neural
10 (44), 80–88.
network. Comput., Mater. Continua 67 (2), 2585–2600.
Nadeem, L., Azam, M.A., Amin, Y., Ghamdi, M.A., Chai, K.K., et al., 2021. Integration
Alghamdi, M.A., Khan, M.F.N., Khan, A.K., Khan, I., Ahmed, A., et al., 2021. Pv model
of D2D, network slicing, and MEC in 5G cellular networks: Survey and
parameter estimation using modified fpa with dynamic switch probability and
challenges. IEEE Access 9 (7), 37590–37612.
step size function. IEEE Access 9 (4), 42027–42044.
Panda, M., Abraham, A., Patra, M.R., 2010. Discriminative multinomial Naïve Bayes
Aljarah, I., Ludwig, S.A., 2013. MapReduce intrusion detection system based on a
for network intrusion detection. In: Sixth Int. Conf. on Information Assurance
particle swarm optimization clustering algorithm. IEEE Congress Evol. Comput.,
and Security, Atlanta, GA, USA, pp. 5–10.
955–962

9730
M. Asif, S. Abbas, M.A. Khan et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 9723–9731

Pervez, M.S., Farid, D.M., 2014. Feature selection and intrusion classification in NSL- Tapiador, J.E., Orfila, A., Ribagorda, A., Ramos, B., 2013. Key-recovery attacks on
KDD cup 99 dataset employing SVMs. In: Proceedings of the 8th International KIDS, a keyed anomaly detection system. IEEE Trans. 12 (2), 312–325.
Conference on Software, Knowledge, Information Management and Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A., 2009. A detailed analysis of the
Applications (SKIMA 2014), Dhaka, Bangladesh, pp. 1–6. KDD CUP 99 data set. In: 2009IEEE Sym. on Computational Intelligence for
Sarker, I.H., Salim, F.D., 2018. Mining user behavioral rules from smartphone data Security and Defense Applications, Chicago, Illinois, USA, pp. 1–6.
through association analysis. In: Proceedings of the 22nd Pacific-Asia Tsai, C.F., Hsu, Y.F., Lin, C.Y., Lin, W.Y., 2017. Intrusion detection by machine
Conference on Knowledge Discovery and Data Mining (PAKDD), Melbourne, learning: a review. Expert Syst. Appl. 8 (36), 11994–12000.
Australia, 3-6 June, pp. 450–461. Wagner, C., François, J., Engel, T., 2011. Machine learning approach for ip-flow
Sheikhan, M., Jadidi, Z., Farrokhi, A., 2012. Intrusion detection using reduced-size record anomaly detection. In: Proceedings of the International Conference on
RNN based on feature grouping. Neural Comput. Appl. 21 (6), 1185–1190. Research in Networking, Valencia, Spain, pp. 28–39.
Snir, M., Otto, S., Walker, D., Dongarra, J., 2015 no. 5. In: MPI: The complete White, T., 2010. Hadoop: The definitive guide, original. O’Reilly Media.
reference. MIT Press, Cambridge, pp. 1–16. Wu, W., 2020. Application of MapReduce parallel association mining on IDS in cloud
Sun, N., Zhang, J., Rimba, P., Gao, S., Zhang, L., 2018. Data-driven cybersecurity computing environment. J. Intell. Fuzzy Syst. Preprint 4 (3), 1–9.
incident prediction: a survey. IEEE Commun. 21 (1), 1744–1772. Xin, Y., Kong, L., Liu, Z., Chen, Y., Li, Y., 2018. Machine learning and deep learning
methods for cybersecurity. IEEE Access 6 (4), 35365–35381.

9731

You might also like