Cyber Hacking Breaches Prediction and Detection

Uploaded by

aparnasoddy

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Cyber Hacking Breaches Prediction and Detection

Uploaded by

aparnasoddy

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Cyber Hacking Breaches Prediction and Detection

2023 2nd International Conference on Vision Towards Emerging Trends in Communication and Networking Technologies (ViTECoN) | 979-8-3503-4798-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/VITECON58111.2023.10157462

Using Machine Learning

K Pujitha Gorla Nandini K V Teja Sree
Assistant professor UG Scholar UG Scholar
Dept of CSSE Dept of CSSE Dept of CSSE
Mohan Babu University Mohan Babu University Mohan Babu University
(Erstwhile Sree Vidyanikethan (Erstwhile Sree Vidyanikethan (Erstwhile Sree Vidyanikethan
Engineering College) Engineering College Engineering College
Tirupati, AP,India Tirupati, AP India Tirupati, AP India
[email protected] [email protected] [email protected]

Banda Nandini Dhodla Radhika

UG Scholar UG Scholar
Dept of CSSE Dept of CSSE
Mohan Babu University Mohan Babu University
(Erstwhile Sree Vidyanikethan (Erstwhile Sree Vidyanikethan
Engineering College Engineering College
Tirupati, AP India Tirupati, AP India
[email protected] Radhika.chinni1914gmail.com

cybercrime strategy, and the accuracy results are then

Abstract— Cyber hacking breaches prediction is one of the
emerging technologies and it has been a quite challenging task to
compared. The second is to examine if the information at hand
recognize breaches detection and prediction using computer can be used to predict cybercrime perpetrators. It is employed
algorithms. Making malware detection more responsive, scalable, to hide a system's data. Information theft is caused by sensitive
and efficient than traditional systems that call for human and highly confidential data as well as poor management. The
involvement is the main goal of applying machine learning for hackers' techniques might be found
breaches detection and prediction. Various types of cyber hacking in two different methods. One is to move through with legal
attacks any of them will harm a person's information and financial action, get in touch with the victim, and let them know about
reputation. Data from governmental and non-profit the violations [1]. The organizations should be aware of the
organizations, such as user and company information, may be sorts, trends, and patterns of assaults for the purpose of
compromised, posing a risk to their finances and reputation. The
enabling them to monitor the system. We present a study on
information can be collected from websites that can
the consequences of these kinds of attacks in an effort for
trigger cyberattack. Organizations like the healthcare industry
are able to contain sensitive data that needs to be kept discreet and
managing the prevention of occurring the beaches. We provide
safe. Identity theft, fraud, and other losses may be caused by data comprehensive study of the breaches that have occurred by the
breaches. The findings indicate that 70% of breaches affect various organizations and financial effect. Because of
numerous organizations, including the healthcare industry. The improvements in information technology, declining prices for
analysis displays the likelihood of a data breach. Due to increased memory and storage devices, and the expansion of the digital
usage of computer applications, the security for host and network economy, businesses and governmental organizations now
is leading to the risk of data breaches. Machine learning methods acquire more data every day. Businesses and organizations
can be used to find these assaults. By research, machine learning have the threat of data attacks because of the collecting of
models are utilized to protect the website from security flaws. The personal data on their computers. The Privacy Rights
dataset can be obtained from the Privacy Rights Clearinghouse. Clearinghouse (PRC) found a large number of records that
Data breaches can be decreased by educating staff on the use of were exposed due to data breaches between 2005 and 2019.
modern security measures. This can aid in understanding the The organization may face legal action as a result of data
attacks knowledge and data security. The machine learning breaches that cause financial and reputational harm. Computer
models like Random Forest, Decision Tree, k-means and Multi- networks are used in manufacturing, healthcare, research. This
layer Perceptron are used to predict the data breaches. information is transferring every second through network.
These attacks are used for profit and destroy the important
Keywords—Cyber hacking breaches, Machine learning,
information and use that for own need which rises the risk of
Algorithms, Prediction.
data [2]. Hybrid based detection is used to detect the high false
I. INTRODUCTION positive rates and low false positive rates. Anomaly based
detection analyzes the behavior of traffic, where signature-
based detection has the previous attacks records and able to
Over the years, corporations have increasingly become detect the possibilities.
the target of cyberattacks. Among the ransomware that causes It has high likelihood of discovering undiscovered hazards.
such severe damage, the hackers may employ many attack For finding breaches on massive amounts of data, typical
types. Today, maintaining system security, including the detection techniques are ineffective. Machine learning
confidentiality of both corporate and personal data, is quite techniques can increase the effectiveness of breach detection
difficult. Millions of cyberattacks per day cause tremendous [3]. There are supervised and unsupervised methods of
financial losses. Our study has three main objectives. Using machine learning techniques. In a supervised method, the
actual cybercrime data, the initial phase is to predict a model will be trained by the labelled input [4]. By that label

979-8-3503-4798-2/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: Florida Institute of Technology. Downloaded on August 05,2023 at 14:29:52 UTC from IEEE Xplore. Restrictions apply.
the model will distinguish between many classes existing
from the records of taken dataset, where as in another model, Between 2005 and 2011, Ayyagari examined the records of
system will be trained by the unlabeled input in order to breaches and found that occurrence of hacking attacks was
determine the content of exactly same existed in the dataset declining [8]. Smith focused data breaches that occur on health
of input Using Twitter data that has been scraped and organizations and he also studied the relationship between
categorized, online social networks can serve as platforms breaches and storage. He discovered that 72% of attacks come
and routes for exchanging information [5]. Probabilistic from health companies and involve digital and electronic data.
methodology evaluates the relationship between user group Shu looked into the intended company and implemented a
sentiment and potential cyberattacks We use a statistical number of measures to prevent the release of personal
modelling approach to address these problems and apply
information. In contrast to Algarni, Malaiya explored the
them for PRC data, which demonstrates It shows that neither
elements that contribute to data breaches and looked at how
the number of breaches nor their frequency have grown over
time. We distinguish between two types of breaches: calculators are used to determine these factors. Horawalavithana
negligent and malicious, which happen when personal investigated the vulnerability prediction activities.[9] Security-
information is accidentally revealed. The dataset contains, related algorithms used Protecting information on a public
both the negligent and risk sizes contain also persist the website against unauthorized scraping or illegitimate usage is
constant. PRC is a nonprofit organization with a focus on essential for the today's technology environment. Data is crucial
privacy problems [6]. Only the 4571 breaches that involved and has a big financial influence on many business executives
the delicate data contains corresponding data sum up in the and website owners. Security procedures such as honeypots are
dataset. Sizes of data breaches (records exposed) across a ten- used to steer a masquerader in the incorrect direction [10]. Gaul
year time frame. confined our analysis to the 2253 breaches and Rehman researched prediction techniques. Jenkins looked
in this subset. These statistics have two major drawbacks. The examined the correlation between data breach features using the
collection only includes breaches that have been publicly speech act theory. Chen examined how phishing attempts and
recognized, Hence the list of details mentioned for therefore other organizational characteristics effects the stock value of
the number of records listed for a breach each is simply the multinational corporations [11].
approximation of the total number of people suffered. PRC McLeod and Dolezel discovered the levels of exposure, security,
dataset, on the other hand, is the biggest and most and how they can result in data breaches after coming across data
comprehensive open dataset of its kind. There's a chance that breaches at health organizations. Kafali investigated the link
several data breaches go unreported. Various surveys have between regulations and data leaks. Sen and Borle tested a
found that somewhere Up to 89% of security events are
number of data breach theories and discovered that forgoing
unreported, which is between 60%. These claims related to
information technology protection can increase the likelihood of
the unofficial surveys of security experts and there is no
reason to confirm their frequency distributions should be a data breach Xu examined hacking techniques that were
different from PRC. The overall number of violations employed. Sen and Borle discovered that tight legislation can
affected by a particular breach from the all records of lower the chance of violations [12].
breaches is given as S [7]. We designate the number of related
entries for each specific breach. To identify the distribution A model for data security was put forth by Kantarcioglu and
which is time independent most accurately describes data. Shaon to safeguard the systems. Bertino and Ferrari [13,14]
The PRC dataset classifies each breach into one out of seven researched the ideas around big data security and organizational
categories. Two groups are normally formed from the seven security. Data on cyberbreach incidents are broken down into
categories. The first are data breaches resulting from technical, social, and socio-technical categories. Liu researched
"negligence," when documents were accidently disclosed by network anomalies and malware using Risk teller to analyze 600
an attacker rather than being actively sought after by them, thousand of devices and examined their malware infections.
such as when laptops are lost or when sensitive data is Many organizations had the breaches 67.8% and 55.5%, of file
unintentionally made public. The second category consists of sharing activities [15].
data breaches that resulted from "mischievous" acts which
consider to attack the sensitive data, such as hackers breaking Martin suggested a deep learning architecture based on
into networks, an employee of organization misusing data, or randomized ensembles for identifying breaches. caminero
credit card fraud contains data on each type of data breach applied NSL-KDD using data from the AWID database [16]. To
present in dataset, as well as categories [8] Therefore it is
demonstrate how more fragile random forest is, Hang created a
understandable that careless activities happen almost twice as
gradient tree technique. Huang introduced the extraction and
frequently as deliberate data breaches. They did not be
included which are negligent breaches in the current research classification of multiscale guided features. Iman extracted the
since they are more likely to be caused by human mistake chosen features from the dataset using the bortua feature
than cyberattacks. Although the other three sub-categories are selection algorithm [17,18].
important in and of themselves and merit independent Gwebu discovered that companies with a poorer reputation
investigation, the hacking sub-category will be the focus of suffer a lower stock market value return than companies with a
this study (hacked breach dataset hereinafter). Hacking positive reputation [19]. According to Khandpur's research on
(including malware), insider, credit card fraud, and unknown social media, 71% of data breaches on Twitter occur there. Shu
are the four sub-categories of detrimental breaches that were employed social media as a sensor to decipher social behavior
looked at in. Breach detection is the process of continuously from cyberattacks. Ritter used historical data to identify
preventing unauthorized access and monitoring network or cyberattacks like denial of service and data breaches. Sarkar
computer system activity for cues about potential incidents. researched the dark web to understand the weaknesses [20].
This is usually achieved by collecting the information from Zhang researched the issues and threats of data security while
different networks as well as from networks and later vetting Abouelmehdietal performed the current challenges with large
the information for any possible security holes. Systems data security. Ikegami and Kikuchi examined the model design
created using these techniques generally have high rates of of the breach dataset of Japan. Using the PRC dataset, Peng
detection rates for hazardous behavior that are both false conducted research on cyberattacks. Eling and Loperfido
positive and false negative, as well as a lack of ongoing conducted research on the type-based analysis of data beaches.
adaptability to changing harmful behavior.
A panel regression that evaluates the possibility of data breaches
was proposed by Buckman. A fixed-effect model was developed
II.limited
Authorized licensed use RELATED
to: FloridaWORK
Institute of Technology. Downloaded on August 05,2023 at 14:29:52 UTC from IEEE Xplore. Restrictions apply.
by Romansky [22] to determine the impact of data breaches.
Edwards looked at the frequency and significance of data organization, website and social network details and year,
breaches. The researchers showed that a log-normal distribution records, organization type and sources are the attributes in the
[23] may be used to match the volume breach and compared dataset. Several methods were employed to alter the dataset's
with binomial distribution which is negative that can be used to structure and impute missing values during the preprocessing
match the frequency. Sun developed a model that is applied to stage. During the preparation stage, the dataset was analyzed,
compile any level of breach data in order to know the rate and a number of approaches were utilized to modify the
making during underwritten of cybersecurity insurance. We structure.
have examined a range of subject areas, such as data clustering
and privacy frameworks, before using a multidisciplinary
approach to show why data privacy and security are crucial. The
author provided a mathematical foundation description to
characteristics and analyze privacy to aid in his work on
preventing breaches that result in privacy.

III. EXISTING SYSTEM

Several elements, such as unexpected application existence,

network port usage, and strange network activity. Different
kinds of attacks that occur on cloud environment include details Table 1 Attributes and its characteristics
hijacking, mischievous information of client manipulation,
denial of service, risky VM migration, and sniffing/spoofing of Attributes like Entity, year, records, organization type and
virtual networks. All of these cutting-edge techniques might be method are used in the table 1. The characteristics like Airtel and
employed by hackers to attempt to seize control of the cloud years 2018,2019,2020 and healthcare, social networking are used
service. Data on user profiles, hosts, connections, protocols, in the dataset.
and devices are tracked by an intrusion detection system.
Firewalls and system monitors check the systems and websites
used by businesses that handle sensitive data frequently to help B. Data Preprocessing
combat these issues. Present, to detect the data breaches third
parties are used. However, a lot of hackers and security experts In order to remove the null and duplicate values data
make an effort to undermine a company's security measures out preprocessing is performed. The adjustments made to the data
of personal hatred or for other reasons. before we send for the particular algorithm is known as data pre-
Each instance of a security breach contains a processing. The Data Preprocessing used for transforming noise
succinct narrative, the date the breach was first discovered, and data into clear datasets. In other words, the data is collected in
the count of breaches, records and the specifics of offence. We the raw form from different sources, which makes it impossible
only maintain breaches caused by hacking operations. Veris for the evaluation. The data must be organized properly in order
Community Database is the second. Launched in 2013, the to achieve better outcomes from the machine learning approach
Veris Community Database is a Verizon Threat Task that is used to apply the model. To check if there are any missing values
meant to collect and disseminate information on security they can be detected in preprocessing method. The System
incidents from multiple reliable sources. A most recent edition efficiency and accuracy may affect if we do not perform data
has almost 8000 incidents and is divided into seven threat preprocessing. Here the output for the null values is not displayed
action categories: ransomware, hackers, interpersonal, as they are not consisted with numerical values. The machine
exploitation, physical ones, errors, and environmental. The learning models have specific requirements for the information's
major subjects of this study are malware and hacking because format; for instance, the K means method does not tolerate null
they have a lot in common with outside hacking activities. values. Therefore, null values from the initial raw data collection
A security breach is indeed a security alert where sensitive must be controlled in order to run the k means method.
information from a service or company is fraudulently
accessed. Events like the Marriott data breach, which took
almost 4 years to be detected, and a company like Verizon,
which took almost 6 months to identify a data breach in 2016,
are examples of how big corporate behemoths routinely
disregard security fundamentals. Information breaches are
when confidential or secure information about an organization
is intentionally or accidentally collected. Decision tree
algorithm works better with eccentricity but fails over time. In
regression models as well, the threshold value significantly
biases the results. If the barrier is too high, the entire process
will stop working. Although very sophisticated, neural
Networks need a lot of information at first. The machine
learning framework used for security, which is currently being
built in the background, will keep an eye on the website from Figure 1 Architecture
both an internal and external perspective. With the help of
several datasets, this model was created. To maintain the Data is collected from various organizations, social network and
system under control, the model requires a variety of actions. websites. Then the data is aggregated and preprocessed. That
data is sent to the training models to train the data and algorithms
are applied on them. Prediction of data breaches whether
IV. PROPOSED SYSTEM
occurred or not is identified. By using machine learning models
A. Data and Methodology it able to detects the breaches of data.
The dataset used for prediction is taken from Kaggle as its
dataAuthorized
source.licensed
The dataset contains
use limited to: Florida300 instances
Institute with Downloaded on August 05,2023 at 14:29:52 UTC from IEEE Xplore.
of Technology. Restrictions apply.
subgroup in a node has the entire target variable, at which point
C. Feature Selection the iteration is finished.
(l, k) = (l₁, l₂, l₃, l m, k)
Finding the best feature from the features that are present The target variable, l, is what we're attempting to comprehend,
in training data is the process of feature selection. Correlation categorise, or generalise. The input variables l₁, l₂, l₃ are
coefficient method is used for the feature selection. It is the represented by the vector l in this challenge.
technique of selecting only suitable data and removing the
unnecessary data. The main aim of the feature model in Random Forest Model:
machine learning is to build useful models. It reduces the Random forest algorithm is a supervised machine learning
number of input variables when executing the model. By algorithm. It is the collection of decision trees. The next decision
using correlation method, we can predict the variable from the tree will be error free and efficient than the previous one. In this
other. It mainly used because the optimal variables are way the decision trees are formed which provides the efficiency.
correlated. If the two variables are correlated, we take one of It will provide the understandable predictions. It can be used for
them which is more adequate with the destinated variable. regression and classification problems. It builds various number
of decision trees by taking different samples and takes the
majority vote for regression and classification.
Z= argmax {∑ ( ( ( ) = ))} (1)
=1
x∈ {p (l). p (l k)
where ( ) is a classification tree,
The tree with the number n was chosen at random from a
pattern.
Each category tree inside the ensemble is built using a distinct
subset of the training data, B (l, m), if B (l, m) is the
representation of the dataset. Then, each tree functions as usual
decision trees. Data is divided into segments according to a
randomly chosen value until it is completely partitioned or the
Figure 2 Feature Selection maximum depth is achieved.

D. Implementation

We are using Decision Tree, Random Forest, K-Means and

Multi-layer perceptron algorithms for predicting the cyber
breaches.

Decision Tree Model:

The decision tree algorithm is a supervised machine learning

algorithm. It is used for both classification and regression
problems. It is used to make predictions by taking the answers
of questions previously noted. Two types of decision trees are
existed, Categorial decision tree which predicts in discrete form
of the data belongs and another type is regression tree is Figure 4 Working of random forest classifier
regression tree in which the predictions in it can be considered
as actual number. The different types of terminologies like Gini
index, Information gain, Chi-Square are used by decision tree
to work with variables of nodes and sub nodes. This algorithm
automatically learns the breaches signatures and divide the task K means model:
as breached or not. The collection of sources is divided into sub
parts based on the value of attribute. The process of repeating K- Means Clustering is an unsupervised learning algorithm,
this procedure on every outcome subset is referred as recursive which is used to group the unlabeled dataset into different
partitioning. clusters. It enables us to divide the dataset into various clusters
so that we may identify the different groupings within the
unlabeled dataset by itself without training. It is an algorithm
based on centroid. Each centroid will be associated by cluster.
The main goal of this algorithm to reduce the sum of the distance
between data point and its clusters.
The unlabeled input dataset is divided into a variety of k-
number of clusters, and this procedure will be repeated before
the best cluster groups are discovered which will be the result. It
works by first choosing the value of k in order to determine how
many clusters will be produced. Then, it chooses the randomly
chosen points that will serve as centroids. Next the data points
. are placed based on the distance of centroid they are placed
Figure 3 Working of Decision Tree Algorithm whether nearest to the centroid or closest to the centroid. A new
set of random points(centroid) is placed for each cluster. It
repeats this process until it finds the optimal clusters.
The division no longer creates value to the predictions, or the The objective
Authorized licensed use limited to: Florida Institute of Technology. Downloaded on August 05,2023 at 14:29:52of
UTCthis
from approach
IEEE Xplore. is to minimize
Restrictions apply. an
objective function, in this case, the squared error function. using activation function and the layers exposed to output layer
which gives the output.
=∑ −1 ∑ −1 ||
( ) − ||2 (2) Output Layer:
Where || ( ) − || is a selected measure of the distance This layer gives the desired output. This layer has its own biases
and weights which make it predict the desired output. From the
between a data point and the cluster centre is an indication of hidden layer it directly gives the prediction.
how far the n data points are from each cluster's center. Centroid Signals travel chronologically across various layers of the MLP
is the unknown locality of the clusters center. From the above network's main component, from the input layer to the output
formula x, y are the variables in squared error function.
layer. Multiplications are added up in each hidden layer node and
then transferred through a transfer function, like a nonlinear
sigmoid function. Various transfer functions are used in neural
networks, in this model, logistic sigmoid function is used which
is given by
1
( )= (3)
1+ ̂(− )
Sigmoid function is used because its probability is between 0 and
1. It is used to predict the outcome of model in the form of
probability value. Here is the variable and f(x) are the sigmoid
function.
70% of the data set is used for training during the network
evaluation, and the weights and deviations can be changed in
accordance with the network and the desired output value. 15%
is utilized for verification in order to prevent overfitting before
the network stops training and 15% is tested to determine the
Figure 5 Before applying k-means Clustering network's performance.

Figure 7 Multilayer perceptron

IV. RESULTS
Figure 6 After applying k-means clustering

In Figure 5 the training examples are shown as dots. Here all We test various predictive models using the prediction methods
different categories are mixed up before applying the k-means covered. The two classifiers that perform the best are K-Means
clustering. In the figure 6 the centroids are in star shape. Here and MLP.
after applying k-means clustering the different categories are
divided into different clusters. The clusters are grouped in
categories according to the similar properties.
Algorithms Accuracy
Multi-layer perceptron model:
1 Decision Tree 90.86
Multi-layer perceptron (MLP) contains multiple dense layers
which converts any input dimension to desired dimension. It is 2 Random Forest 91.80
a neural network which combines the neurons together in which
the output of some neurons is the input of other neurons. This 3 K means 94.19
model has three layers, Input layer, hidden layer and output
layer. Input layer takes the input and forward it for the further 4 MLP 98.72
process and the hidden layer do the same process and forward it Table 2 Comparison of Results
to the output layer. The output layer gives the predictive model
result. Among Decision Tree, Random Forest, k-means, MLP
Input Layer: classifiers, and we present the findings in table 2. K Means and
The input layer takes the dataset as input known as visible layer MLP gives the better accuracy. we present the findings which is
also because it can be shown in the network. According to the comparison of results shown in the table 2. Decision tree
unit neuron the neural network is drawn. The data is accepted algorithm gives the 90.86 accuracy and Random Forest gives the
by this layer and passed to the rest of network. 91.80 whereas K means gives 94.19 which is better and MLP
Hidden Layer: gives the better accuracy compared to other machine learning
Hidden layers come after input layer. These are called as hidden models.
layers because they are not directly connected to the input. It
adds the weights to the input and transfer them to output by
Authorized licensed use limited to: Florida Institute of Technology. Downloaded on August 05,2023 at 14:29:52 UTC from IEEE Xplore. Restrictions apply.
Parking Spaces”, Journal of Pharmaceutical Negative
Results, vol. 13, no. 4, pp. 1010–1013, Nov. 2022.
[9] Sivakumar Depuru , Anjana Nandam , P.A. Ramesh , M.
Saktivel , K. Amala , Sivanantham. (2022). Human
Emotion Recognition System Using Deep Learning
Technique. Journal of Pharmaceutical Negative Results,
13(4), 1031–1035.
https://doi.org/10.47750/pnr.2022.13.04.141 (Original
work published November 4, 2022)
[10] S. Depuru, P. Hari, P. Suhaas, S. R. Basha, R. Girish and
P. K. Raju, "A Machine Learning based Malware
Figure 8 Comparison of results Classification Framework," 2023 5th International
Conference on Smart Systems and Inventive Technology
VI. CONCLUSION (ICSSIT), Tirunelveli, India, 2023, pp. 1138-1143, doi:
10.1109/ICSSIT55814.2023.10060914
A method for assessing the risk of hacker intrusions, [11] S. Depuru, K. Vaishnavi, B. Manogna, K. J. Sri, A.
addressing the issue of unreported intrusions, and estimating Preethi and C. Priyanka, "Hybrid CNNLBP using Facial
the exposure of enterprises. Since machine and deep Emotion Recognition based on Deep Learning
learning techniques are increasingly being employed for a Approach," 2023 Third International Conference on
Artificial Intelligence and Smart Energy (ICAIS),
variety of purposes, including cyber security, it is imperative
Coimbatore, India, 2023, pp. 972-980, doi:
to determine whether and which category of algorithms can 10.1109/ICAIS56108.2023.10073918.
deliver adequate results. Spam detection, malware analysis,
[12] Ayyagari, R. (2012). An exploratory analysis of data
and intrusion detection are three important aspects of cyber breaches from 2005-2011: Trends and insights. Journal
security that are explored for these methodologies. of Information Privacy and Security
According to our research, there are still a number of [13] Algarni, A. M., Malaiya, Y. K. (2016, May). A
problems with current machine learning algorithms that consolidated approach for estimation of data security
reduce their value for cyber security. The dataset containing breach costs. In 2016 2nd International Conference on
300 instances is trained by using machine learning Information Management (ICIM) (pp. 26-39). IEEE.
algorithms like Random Forest model, decision tree model, [14] Kafali, Jones, J., Petruso, M., Williams, L., Singh, M. P.
MLP, K-Means model. Through this method has achieved (2017, May). How good is a security policy against real
breaches? A HIPAA case study. In 2017 IEEE/ACM
MLP and K-Means higher accuracy and yielded better
39th International Conference on Software Engineering
output. The proposed system can be efficiently applied to (ICSE) (pp. 530-540). IEEE.
detect the breaches and predict them.
[15] Sen, R., Borle, S. (2015). Estimating the contextual risk
REFERENCES of data breach: An empirical approach. Journal of
Management Information Systems, 32(2), 314-341.
[16] Bertino, E., & Ferrari, E. (2018). Big data security and
[1] M. Xu, K. M. Schweitzer, R. M. Bateman, and S. Xu, privacy,”. In A comprehensive guide through the Italian
“Modeling and predicting cyber hacking breaches,” database research over the last 25 years (pp. 425–439).
IEEE Trans. Inf. Forensics Security, vol. 13, no. 11, pp. Springer. Gray, J., Gerlitz, C., & Bounegru, L. (2018).
2856–2871, 2018. [17] Smith, T. T. (2016). Examining Data Privacy Breaches
[2] IBM. (2019). Cost of a data breach report. IBM in Healthcare
Security, 76. [Online]. Available [18] A. Bachar, N. E. Makhfi, O.E. Bannay, "Towards a
https://www.ibm.com/downloads/cas/ZBZLY7KL behavioral network intrusion detection system based on
[3] Fernandez Maimo et al., “A self-adaptive deep the SVM model", in 2020 1st international conference
learning-based system for anomaly detection in 5G on innovation research in applied science engineering
networks,” IEEE Access, vol. 6, pp. 7700–7712, 2018. and technology (IRASET), Meknes, Morocco, 2020, pp.
[4] Kantarcioglu M and Ferrari E (2019) Research 1-7
Challenges at the Intersection of Big Data, Security and [19] A. Bachar, N. E. Makhfi, O.E. Bannay, "Towards a
Privacy. behavioral network intrusion detection system based on
[5] Verizon, “Data breach investigations report,” 2019. the SVM model", in 2020 1st international conference
[Online]. Available: on innovation research in applied science engineering
https://enterprise.verizon.com/resources/reports/dbir/ and technology (IRASET), Meknes, Morocco, 2020, pp.
[6] H. Hammouchi, O. Cherqi, G. Mezzour, M. Ghogho, 1-7
and M. El Koutbi, “Digging deeper into data breaches: [20] L. Bilge, Y. Han, and M. Dell’Amico, "Riskteller:
An exploratory data analysis of hacking breaches over Predicting the risk of cyber incidents", in Proc. of the
time,” Procedia Computer Science, vol. 151, pp. 1004– 2017 ACM SIGSAC conf. on Computer and
1009, 2019. Communications Security, 2017, pp. 1299–1311.
[7] rack T. Majority of malware analysts aware of data [21] S. Sarkar, M. Almukaynizi, J. Shakarian, and P.
breaches not disclosed by their employers. Shakarian, "Predicting enterprise cyber incidents using
http://www.threattracksecurity.com/press-re social network analysis on dark web hacker forums",
lease/majority-of-malware-analysts-aware-of-data- The Cyber Defense Review, pp. 87–102, 2019
breaches-not-dis closed-by-their-employers.aspx [22] M. Lopez-Martin, B. Carro, J. I.Arribas, and A.
[8] K. Pujitha , Kattamanchi Prem Krishna , K. Amala , Sanchez-Esguevillas, "Network intrusion detection with
Annavarapu Yasaswini , Sivakumar Depuru , a novel hierarchy of distances between embeddings of
Authorized licensed use limited to: Florida Institute of Technology. Downloaded on August 05,2023 at 14:29:52 UTC from IEEE Xplore. Restrictions apply.
Kopparam Runvika, “Development of Secured Online hash IP addresses",Knowledge-based Syst.,vol 219,2021