0% found this document useful (0 votes)
14 views

Graph RL Malware Detection

This paper proposes using knowledge graphs and reinforcement learning for malware analysis. The authors incorporate prior knowledge from cybersecurity knowledge graphs to guide an RL algorithm to detect malware. They show this guided approach outperforms a base RL system. Standard machine learning has limitations for cybersecurity tasks due to imbalanced datasets and lack of background knowledge incorporation.

Uploaded by

apprendrerss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Graph RL Malware Detection

This paper proposes using knowledge graphs and reinforcement learning for malware analysis. The authors incorporate prior knowledge from cybersecurity knowledge graphs to guide an RL algorithm to detect malware. They show this guided approach outperforms a base RL system. Standard machine learning has limitations for cybersecurity tasks due to imbalanced datasets and lack of background knowledge incorporation.

Uploaded by

apprendrerss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Using Knowledge Graphs and Reinforcement

Learning for Malware Analysis


Aritran Piplai∗ , Priyanka Ranade∗ , Anantaa Kotal∗ , Sudip Mittal† , Sandeep Nair Narayanan‡ , Anupam Joshi∗
∗ Dept. of Computer Science & Electrical Engineering, University of Maryland, Baltimore County,

Email: {apiplai1, priyankaranade, anantak1, joshi}@umbc.edu


† Department of Computer Science, University of North Carolina, Wilmington, Email: [email protected]
‡ Cisco Research and Development, [email protected]

Abstract—Machine learning algorithms used to detect attacks standard machine learning algorithms have limitations in the
are limited by the fact that they cannot incorporate the back- cybersecurity space, especially in real deployments [28], [35].
ground knowledge that an analyst has. This limits their suitability In part, this is because the dataset is highly imbalanced – most
in detecting new attacks. Reinforcement learning is different from
traditional machine learning algorithms used in the cybersecurity of the observed data in a real system is not attacks, and unless
domain. Compared to traditional ML algorithms, reinforcement a dataset is artificially curated to be balanced, it will mostly
learning does not need a mapping of the input-output space or have benign data. In other words, the base rate of an attack
a specific user-defined metric to compare data points. This is is quite low. Hence, most machine learning algorithms tend
important for the cybersecurity domain, especially for malware to overfit for the data points of non-attack scenarios. Such
detection and mitigation, as not all problems have a single,
known, correct answer. Often, security researchers have to resort learning algorithms are prone to weak performances in the
to guided trial and error to understand the presence of a malware real world. They also struggle in their ability to generalize, to
and mitigate it. unseen attacks.
In this paper, we incorporate prior knowledge, represented There are also other problems in the cybersecurity space
as Cybersecurity Knowledge Graphs (CKGs), to guide the for traditional ML approaches. In other domains like natural
exploration of an RL algorithm to detect malware. CKGs
capture semantic relationships between cyber-entities, including language processing or computer vision, it is fairly simple
that mined from open source. Instead of trying out random for a human, who is not necessarily an expert in the specific
guesses and observing the change in the environment, we aim to task, to draw a conclusion the Artificial Intelligence (AI)
take the help of verified knowledge about cyber-attack to guide system is trying to reach with high confidence. For example,
our reinforcement learning algorithm to effectively identify ways a human may not need complicated background knowledge,
to detect the presence of malicious filenames so that they can be
deleted to mitigate a cyber-attack. We show that such a guided except what is in implicit in the image itself to differentiate the
system outperforms a base RL system in detecting malware. image of a cat from a dog. However, looking at netflow data,
Index Terms—Reinforcement Learning, Knowledge Graphs, it is not easy for a person to decide if it represents an attack –
Cybersecurity, Artificial Intelligence this requires significant expertise and background knowledge.
So when we are trying to mimic the way security professionals
I. I NTRODUCTION reach a decision about a task, it is imperative that we consider
Cybersecurity aims to protect hardware and software com- a way to incorporate their prior knowledge that is not in the
ponents in a computer system against malicious attacks. A data itself into the machine learning system. This is because
key part to protecting against malicious attacks is identifying their prior knowledge and experience, as opposed to common
them, as early as possible in the Cyber Kill Chain [7]. The knowledge, may dictate how they perform their tasks in their
most common approaches today are signature based, which of profession. For example, a security professional might take
course can be defeated by adversaries by minor modifications what they know about recent discussions in online forums
of the malware or its dropper. Another approach is based about vulnerabilities in identifying an attack on their system.
on behavioural anomalies. This can be done by observing Also, in tasks like image recognition, there is not much scope
the system behavior over time and flagging any aberrant for subjectivity for reaching a conclusion. Usually, an image
behavior. However, such anomaly detection based approaches is either a cat or not. However, what is an attack is not always
often misidentify less frequently occurring legitimate system clear, and for sophisticated APTs, even experts may not even
actions as attacks. Another commonly used approach is to use recognize an attack as it is happening. It has been reported
machine learning on network data to detect attacks [24]. In that often a third party identifies and APT attack, not the in
2019 for instance, the IEEE Big Data conference organized house security team of the organization being attacked.
a competition on the use of machine learning algorithms to Reinforcement Learning (RL) mimics the way human be-
detect attacks based on observed network data [8]. ings tend to process new information. RL does not require
Machine learning is an important tool for recognising pat- sub-optimal actions to be explicitly corrected. Instead, it tries
terns in data. In the cybersecurity space, this can be useful to find a balance between exploration (of new knowledge)
to identify the behavior of a system under attack. However, and exploitation (of current knowledge). It does not assume
any prior mathematical modelling for the environment. This We update a policy table, also known as a Q-table for every
gives the RL algorithm more flexibility in learning about a action taken from a state. A Q-table is simply a lookup table
new knowledge space. that preserves the maximum expected reward for an action
In this paper, we make two key contributions. First, we at each state. The columns represent actions and the rows
use reinforcement learning for malware detection. Second, we represent states. The Q-table is improved at each iteration and
also use knowledge mined from open information sources is controlled by the learning rate α [32].
describing the same or similar malware attacks to change
the behavior of the RL algorithm, automatically changing the B. Cybersecurity Knowledge Graphs (CKGs)
parameters of the RL algorithm to adapt its reward functions Cybersecurity Knowledge Graphs have been widely used to
and their initial probability distributions. This approach is represent Cyber Threat Intelligence (CTI). We use CKG to
not only a way to solve the issue of having to rely on store CTI as semantic triples that helps in understanding how
pre-defined loss functions for traditional ML systems created different cyber entities are related. This representation allows
by individuals who may not be experts in cybersecurity, it the users to query the system and reason over the information.
also helps to mimic how security professionals use their own Knowledge graphs for cybersecurity have been used before to
knowledge in identifying attacks. represent various entities [23]. Open source CTI has been used
We organize our paper as follows - Section II talks about to build CKGs and other agents to aid cybersecurity analysts
the key concepts of the different aspects of our algorithm. In working in an organization [16]–[18], [21], [27]. CKGs have
Section III, we discuss our core algorithm. We discuss our also been used to compare different malware by Liu et al. [13].
findings in Section IV. We also discuss some of the relevant Some behavioral aspects have also been incorporated in CKGs,
work conducted in this area in Section V, and we finally where the authors used system call information [22]. Graph
conclude our paper in Section VI. based methods have also been post processed by machine
learning algorithms, as demonstrated by other approaches [2],
II. BACKGROUND [9], [10].
In this section, we discuss key reinforcement learning III. M ETHODOLOGY
algorithm details and our general approach of representing
In this section, we discuss the principles of our proposed
extracted open source text in a Cybersecurity Knowledge
algorithm. Figure 1, gives us an overview of the different
Graph (CKG).
aspects of our system. We provide text describing a piece
A. Reinforcement Learning of malware as an input to our cyber knowledge extraction
pipeline and we receive a set of semantic triples as an output.
We utilize methods in model-free reinforcement learning We assert this set of triples to a CKG. We sometimes fuse
in our approach [30]. Model-free RL agents employ prior this CKG with data from other sources such as behavior
experience and inductive reasoning to estimate, rather than analysis [25]. This forms our knowledge base that will help
generate, the value of a particular action. There are many us identifying malicious activity in a system and also suggest
kinds of model-free reinforcement learning models such as ways to mitigate them. The RL algorithm acts on the malware
SARSA, Monte Carlo Control, Actor-Critic and Q-learning. behavior data. We tune the parameters of the RL algorithm
We specifically utilize the Q-learning methods [30]. based on the information present in the CKGs that can be
Q-learning agents learn an action-value function (policy), retrieved by querying the CKG.
which returns the reward of taking an action given a state.
Q-learning utilizes the Bellman equation in order to learn A. Cyber Knowledge Extraction (CKGs) from prior knowledge
expected future rewards. We calculate the maximum future sources
reward max(Q(s0 , a0 )) given a set of multiple actions corre- We have an established cyber knowledge extraction pipeline
sponding to different rewards. Q(s, a) is the current policy that takes malware After Action Reports and automatically
of an action a from state s, and γ is the discount factor. populates CKGs [26]. The method uses a trained ‘Malware
The discount factor is the total reward an agent will receive Entity Extractor’ to detect cyber- named entities. Once the
from the current iteration until termination, and allows us to malware entities have been detected, a deep neural network
value short term reward over long term reward. The goal is is used to identify the relationships between them [23]. The
to therefore maximize the discounted future reward at every relationship extractor takes pairs of entities, generates their text
iteration [32]. embedding using a trained word2vec [15] model, and produces
a relationship as an output for the pairs of entities. Finally the
Maximum predicted reward, given entity-relationship set is asserted into a CKG.
new state and all possible actions
z }| { We use these trained models on open-source text describing
NewQ(s, a) = Q(s, a)+α [R(s, a) +γ max Q0 (s0 , a0 )−Q(s, a)] the malware we use in our experiments, or similar malware.
In order to find open source text analysing the malware, we
| {z } | {z }
New Q-Value Reward
Learning Rate use the known MD5 Hash of the malware and perform a web
Discount rate
search to look for articles talking about the same malware.
Fig. 1: An architecture diagram specifying the different steps of our proposed method

If we seek additional information about the malware samples,


we can also search for open-source articles about the malware
family if that is known. We produce the open-source text
that we gather from the web search as an input to the cyber
knowledge extraction pipeline. As a result, we get semantic
triples describing the malware asserted in the CKG. Querying
this CKG can result in entities that prove to be valuable
for modelling our RL algorithm. This CKG populated with
information about the malware acts as our prior knowledge
source.

B. Reinforcement Learning Algorithm Fig. 2: Diagram showing the state transformation with respect
We detonated malware samples in an isolated environment to the actions in our dataset
and observed some system parameters that represent the be-
havior of the malware sample. We used the same method to We use Q-Learning for the purpose of identifying malicious
collect the malware behavior data as described by Piplai et al. process names. The behavior data that we collected resulted
[25]. The malware dataset comprises of samples downloaded in data snapshots at a time interval of 10 seconds. For Q-
from VirusTotal1 . The data was collected over a time period of Learning problems, we define a state space and an action
60 minutes for each malware. We refer to the first 30 minutes space. To define a state-space we ideally need to look at key
the malware is not active, as the benign phase. The next 30 identifiers that can be used to isolate a row in our data. We
minutes form the malicious phase, where the malware is active use the ‘timestamp’ for this purpose, and that helps in creating
in the system. Traffic was generated artificially to simulate as many states as there are rows in our data. For a single
multiple clients interacting with the malware infected system. experiment, we have 26000 to 28000 rows. We define our
Table I, describes the different parameters that were collected action space by the total number of distinct process names
during the exercise. In our paper, we aim to use Reinforcement that are getting created in a single experiment. Since this
Learning to identify a sequence of malicious processes that number may vary per experiment, we take a superset of all
may be created by the malware to perform the attack. the processes that are created for all our experiments. In our
experiments, we see that a total of 99 distinct processes are
1 VirusTotal. https://www.virustotal.com/ created over the course of the time of data collection. We use
TABLE I: Virtual machines performance metrics [1].
Metric Category Description
Status Process status
CPU information CPU usage percent, CPU times in user space, CPU times in system/kernel space, CPU times of children processes in user
space, CPU times of children processes in system space.
Context switches Number of context switches voluntary, Number of context switches involuntary
IO counters Number of read requests, Number of write requests, Number of read bytes, Number of written bytes, Number of read chars,
Number of written chars
Memory information Amount of memory swapped out to disk, Proportional set size (PSS), Resident set size (RSS), Unique set size (USS), Virtual
memory size (VMS), Number of dirty pages, Amount of physical memory, text resident set (TRS), Memory used by shared
libraries, memory that with other processes
Threads Number of used threads
File descriptors Number of opened file descriptors
Network information Number of received bytes, Number of sent bytes

licious process names. In Figure 3, we can see a screenshot


of an open source text description of a malware sample that
an expert has provided. Instead of relying on hand-crafted
reward functions to evaluate the quality of a state, we can
use the knowledge extracted from open source text and use
them directly in the reward function. In Figure 3, the text
description serves as input to a CKG that captures the semantic
relationship between the malware and the ‘high nice values’.
The ‘high nice values’ are an indicator for the malware
referenced. The CKG establishes a relationship between the
Malware and the Indicator. When we query the knowledge
graph about the indicator, it returns ‘high nice values’ as a
result. We can map this entity to features of our behavior data
and incorporate them into the parameters of our RL algorithm.
There are two ways we incorporate prior knowledge into our
RL algorithm. The first method are inspired on some of the
concepts mentioned by Moreno et al. [19] In Q-Learning, we
have an exploration phase and an exploitation phase. In vanilla
Q-Learning, exploration phases, we try random actions and
Fig. 3: Diagram showing a knowledge source (a) describing
observe the reward. During the exploitation phase, we choose
a similar malware and the time series data we collected after
the action that maximizes the Q-value for the state-action pair.
detonating a malware (b).
Moreno et al. [19] demonstrate a method of using existing Q-
99 as the action space for our experiments. values of a given state action pair for exploration leading to
Intuitively, we can think of a single process created at a faster convergence. We can see this in Equation 1. We can
timestamp ‘t’ is changing the state of the system. The new state also manipulate the exploration probabilities with the help of
is identified with the timestamp ‘t+1’. The system features the T(s) value.
change as the new process is created, and we aim to identify exp(Q(s, ai )/T (s))
the set of processes responsible for state changes that might P (s, ai ) = P (1)
j exp(Q(s, aj )/T (s))
signify a malicious attack. Figure 2, demonstrates an example
how we model process names as action spaces and identify In our algorithm, we use the extracted knowledge to tune
different states with timestamps. We use this intuition to the probability distribution for exploration of a state-action
create reward functions that may help us identify malicious pair. For example, Equation 2 will change the probability
processes. For example, a high deviation from restful network distribution, into assigning higher probabilities for actions
activity after the creation of a process may signify the presence associated with a ‘nice value’ of -20. The extracted knowledge
of a malware in the system. A surge in Input/Output (I/O) from expert sources show us ‘nice values’ are important while
read/write bytes may also mean the same. We use this intuition searching for malicious processes.
to form reward functions. (nicevalue(ai ) + 20)
P (s, ai ) = 1 − (2)
C. Utilizing CKGs in the RL algorithm 2· | maxj nicevalue(aj ) |
As stated in Section I, in a system that calls for complicated The second method is incorporating the extracted knowl-
analysis to reach a conclusion, we need to consider expert edge into our reward functions to identify the malicious
knowledge that may help us better identify or mitigate ma- processes. For example, if a knowledge source indicates that
the processes create multiple threads, we can use that as an
additional parameter for our reward function. We will discuss
nicevalue
the different reward functions that we have constructed from Reward(s, a) = w1 ·
the knowledge sources and their performance in Section IV. 20
+ w2 · (numthreads(state, action)
− numthreads(state − 1, previousaction))
IV. E XPERIMENTAL R ESULTS (5)
In this section, we discuss the preliminary results from In the third experiment (Exp 3), we incorporate the prior
our experiments. Most of the malware analysis and machine knowledge source in our reward function. The prior knowledge
learning research concentrates on evaluating the performance source states that the nice values of the associated processes
of the machine learning algorithm on a dataset of malware are high priority. We use Equation 5 with the value of ‘w2’
samples. For example, Gavrilut et al. [6], discuss multiple set to 0. A high priority nice value will be close to -20. So, a
machine learning algorithms to detect malware files using state-action pair for which the nice value is close to -20, will
perceptrons and kernelized perceptrons and they evaluate the have a negative reward associated with it if the value of ‘w1’
performance of their trained algorithm on a test set. Since is set to 1.
our approach is aimed at discovering a sequence of process In the last experiment (Exp 4), we select another source
names that we suspect to be malicious, we aim to rank the of information describing the Bill Gates Botnet family. The
actions (processes) with respect to their q-values after some source tells us that this malware family spawns a significantly
episodes of training. To be precise, we use 140 episodes for higher number of threads. We use Equations 3, 4, and 5 for this
training with 10,000 steps in each of them. The high number experiment. We use a weighted sum of the two prior sources
of step count compared to the number of episodes is because in this experiment.
the time series data consists of 26,000 to 28,000 states. If we
use a small value for the step count, we would be able to cover Type Exp-1 Exp-2 Exp-3 Exp-4
only a small portion of the state space in one episode. Hence, Q-value -5.1 -5.7 -7 -16.99
we keep the step count high and the episode count relatively Rank 9(99) 11(99) 8(99) 1(99)
low. We aim to calculate as much q-values as possible in one
episode itself. We then record how our RL system scores the TABLE II: Ranking and Q-values of the known malicious
known malicious processes with respect to other processes that process
are benign.
We use a combination of reward functions from Equations In Table II we can see that the reward function using the
3, 4, and 5. The first reward function is constructed based on weighted mean of prior information sources is able to identify
common knowledge and intuition. If the I/O activity decreases the known malicious process as the highest ranked process.
or the network activity after a process creation, we can make The Q-values are greater in magnitude because the reward
an educated assumption that the process could be benign. We functions have more parameters included in them. This shows
assign a reward value of ‘+1’ if the I/O or network activity that including more knowledge sources in our RL algorithm
decreases after a process creation. For this we calculate the yields better results.
average of I/O read/write bytes for 5 time-steps before a In Figure 4, we can see the comparison of time required to
process creation and 5 time steps after a process creation. complete each episode that consists of 10,000 steps. We argued
The same approach is used for KiloBytes (KB) Sent/Received. previously that tuning the exploration probability distribution
This is done because the effect of a process creation may not would lead to a faster convergence. We observe that this is
be immediate. In our first experiment (Exp 1) we simply use partly true. In sharp dips signify the exploitation phase of
Equation 3 and Equation 4 as our reward function. the RL algorithm, when the algorithm knows what it looks
In our second experiment (Exp 2), we keep the reward func- for. The surge signifies more time to find ‘actions’ during
tions constant. However, we change the exploration criteria for exploration that fit the state-transition. Although the average
our RL algorithm based on prior knowledge. The exploration time is lower in Figure 4a than in Figure 4b, this is because
probability distribution is stated in Equation 2. This helps in the beginning the environment is benign. So, the tuned
the RL algorithm to explore state-action pairs that have high probability distribution that is supposed to aid in the process of
priority nice values leading to a faster convergence. finding malicious processes hinders the RL algorithm to find
processes or actions, that are benign in the beginning, that
fits the state transformation. However, we do observe some
Reward(s, a) + = 1 (if I/O write/read decreases sharp peaks in Figure 4a, as the RL algorithm’s episodes move
(3)
or KB sent/recieved decreases) to the malicious phase. In contrast, the time per episode is
more consistent in Figure 4b. This is the result of the changed
probability distribution that helps the RL algorithm in finding
Reward(s, a) − = 1 (otherwise) (4) processes faster in the malicious phase.
(a) For Exp 1 (user-defined reward functions) (b) For Exp 2 (tuned exploration probabilities)
Fig. 4: Comparison of time required to complete each episode for two experiments

V. R ELATED W ORK Robustness in the cyber-physical space measures the re-


siliency of a given specification being satisfied. The secu-
Security is critical in maintaining the integrity of cyber rity problem is to find candidates with minimal robustness
systems. Technological innovations have lead to complex (counterexample) by falsification or changing of input and
system architectures that are increasingly hard to protect. Pre- parameters of the system. Conventional methods, such as
emptively identifying attack scenarios and taking mitigating simulated annealing and cross entropy, require a large number
actions is a complex task. There is no perfect solution to it of simulation runs to the minimize robustness. Akazaki et al.
yet. AI, especially ML, can be helpful in answering questions [34] proposed the use of RL techniques, i.e., Asynchronous
to which there is no easy answer. ML techniques have been Advantage Actor-Critic (A3C) and Double Deep Q Network
previously used in the domain of cybersecurity for intrusion (DDQN), to reduce the number of simulation runs required
detection [3], malware detection [31], and cyberphysical at- to find such counterexamples. Intrusion detection is also an
tacks [4]. However, traditional ML techniques are not ideal important defense technique and current techniques leave
for to emulate how a security researcher conducts malware room for improvement. Xu et al. [33] proposed a kernel-
analysis. The shortcomings of traditional ML techniques in based RL approach using Least-Squares Temporal-Difference
the domain of cyber-security have been discussed in Section (LS-TD) for intrusion detection that showed better accuracy
I. than Hidden Markov Model and and linear TD algorithms.
RL is an alternate strategy for automatic resolution of tasks Shamshirband et al. [29] discuss the challenges in detecting
in domains where the correct answer is not known. This malicious behavior in Wireless Sensor Networks (WSNs).
technique is popular in the field of game playing [12], robotics They explain why traditional intrusion detection methods fail
[11] and even for biological data [14]. There are similarities in to detect distributed denial-of-service attacks and propose a
in the problem space of game playing and attack identification Game-based Fuzzy Q-learning (G-FQL) algorithm that com-
and mitigation in the cybersecurity domain. Previous literature bines game theoretic approach and the fuzzy Q-learning for
explore how applications of RL can be applied to various intrusion detection in WSN.
aspects of cyber security. The 2016 study by Feng et al. [5] Although these papers use RL for malware analysis, these
characterises cyber state dynamics as a function of physical approaches do not take advantage of prior knowledge about
state, control inputs, disturbances, and current cyber attacks the domain when developing an attack detection and mitigation
and defenses. The cyber defense problem was then modeled strategy. The inability to use pre-existing knowledge, puts prior
as a two-player zero-sum game. An RL algorithm was used RL algorithms for cyber-security at a significant disadvantage.
to efficiently learn the optimal cyber defense strategy in the The AI has very limited idea in the beginning about the
problem space. Some of these models generate adversaries to environment the malware is acting on, and the possible actions
train the RL algorithm by changing some of the parameters of that a human, who is an expert in this domain, would take.
the malware sample that may make it lose its potency. In our The same disadvantage burdened the RL algorithms for mobile
experiments, we use the behavior of actual malware samples. robotics. In 2004, Moreno et al. [20] proposed a supervised re-
We do not automatically generate adversaries yet as there are inforcement learning for mobile robotics that takes advantage
challenges to preserve the malware potency. of external knowledge and validates it in a “wall-following”
behaviour. Although the paper broadly discussed about ranking [5] Ming Feng and Hao Xu. Deep reinforecement learning based optimal
the ‘prior human knowledge sources’ by changing the explo- defense for cyber-physical system in presence of unknown cyber-attack.
In 2017 IEEE Symposium Series on Computational Intelligence (SSCI),
ration probability distribution, it also briefly discussed about pages 1–8. IEEE, 2017.
how it can be used to improve performance. In our study, we [6] D. Gavriluţ, M. Cimpoeşu, D. Anton, and L. Ciortuz. Malware detection
use a variant of this technique in one of our experiments. The using machine learning. In 2009 International Multiconference on
Computer Science and Information Technology, pages 735–741, 2009.
key difference is that we are utilizing CKGs that already hold [7] Eric Hutchins, Michael Cloppert, and Rohan Amin.
knowledge about cybersecurity and we do not have to rely on Intelligence-driven computer network defense informed
expensive human inputs. by analysis of adversary campaigns and intrusion kill
chains. https://www.lockheedmartin.com/content/dam/lockheed-
VI. C ONCLUSION AND F UTURE W ORK martin/rms/documents/cyber/LM-White-Paper-Intel-Driven-
Defense.pdf.
In this paper, we propose using RLs to detect malware, and [8] A. Janusz, D. Kałuza, A. Chadzyńska-Krasowska, B. Konarski, J. Hol-
incorporating prior knowledge from CKGs into the RL based land, and D. Ślezak. Ieee bigdata 2019 cup: Suspicious network event
detection system. This separates our approach from traditional recognition. In 2019 IEEE International Conference on Big Data (Big
Data), pages 5881–5887, 2019.
use of ML approaches to detect attacks. Our approach mimics [9] Karuna P Joshi, Aditi Gupta, Sudip Mittal, Claudia Pearce, and Tim
the way cybersecurity professionals in a SoC analyze the Finin. Alda: Cognitive assistant for legal document analytics. In AAAI
reported sensor data based on their backgroud knowledge to Fall Symposium on Cognitive Assistance in Government and Public
Sector Applications. AAAI Press, 2016.
see if the system is currently under attack by a malware. Some- [10] Maithilee Joshi, Sudip Mittal, Karuna P Joshi, and Tim Finin. Semanti-
times they resort to guided trial and error method (manifested cally rich, oblivious access control using abac for secure cloud storage.
in this case by the RL algorithm), and in some cases they use In Int. conf. on edge computing, pages 142–149. IEEE, 2017.
[11] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning
their own knowledge and experience to derive conclusions. in robotics: A survey. The International Journal of Robotics Research,
Specifically, in our experiments, we show how prior knowl- 32(11):1238–1274, 2013.
edge taken from text sources describing a malware activity [12] Robert Levinson. General game-playing and reinforcement learning.
can be used in an RL algorithm to detect malicious process. Computational Intelligence, 12(1):155–176, 1996.
[13] Jing Liu, Yuan Wang, and Yongjun Wang. The similarity analysis of
We suggest that deleting these processes may prove to be malicious software. In Int. Conf. on Data Science in Cyberspace. IEEE,
an acceptable mitigation strategy. However, the knowledge 2016.
stored in the CKGs can actually provide multiple mitigation [14] Mufti Mahmud, Mohammed Shamim Kaiser, Amir Hussain, and Stefano
Vassanelli. Applications of deep learning and reinforcement learning to
strategies that can be used as a malware executes. In ongoing biological data. IEEE transactions on neural networks and learning
work, we are using multiple known mitigation strategies systems, 29(6):2063–2079, 2018.
after detonating a malware sample. This can produce better [15] Tomas Mikolov, Ilya Sutskevarand Kai Chen, Greg Corrado, and Jeffrey
Dean. Distributed representations of words and phrases and their com-
mitigation strategies for a malicious sample. We are also using positionality. In 26th International Conference on Neural Information
the malware features to identify the malware family that a Processing Systems - Vol. 2, pages 3111–3119. ACM, 2013.
new malware sample belongs to. We can the use open source [16] Sudip Mittal, Prajit Das, Varish Mulwad, Anupam Joshi, and Tim Finin.
Cybertwitter: Using twitter to generate alerts for cybersecurity threats
intelligence about that malware family to formulate candidate and vulnerabilities. In IEEE/ACM International Conference on Advances
mitigation steps. This will help us build a robust mitigation in Social Networks Analysis and Mining. IEEE Press, 2016.
strategy generator that will be able to use and integrate the [17] Sudip Mittal, Anupam Joshi, and Tim Finin. Thinking, fast and
knowledge of security researchers with RL, to yield the best slow: Combining vector spaces and knowledge graphs. arXiv preprint
arXiv:1708.03310, 2017.
sequence of steps needed for a particular malware attack to be [18] Sudip Mittal, Anupam Joshi, and Tim Finin. Cyber-all-intel: An ai for
defeated. security related threat intelligence. preprint arXiv:1905.02895, 2019.
[19] D. Moreno, Carlos V. Regueiro, R. Iglesias, and S. Barro. Using prior
ACKNOWLEDGEMENT knowledge to improve reinforcement learning in mobile robotics. 2004.
The authors would like to thank Dr. Mahmoud Abdelsalam [20] David L Moreno, Carlos V Regueiro, Roberto Iglesias, and Senén Barro.
Using prior knowledge to improve reinforcement learning in mobile
and Dr. Maanak Gupta for the dataset used in this work. robotics. Proc. Towards Autonomous Robotics Systems. Univ. of Essex,
This work was supported by a United States Department UK, 2004.
of Defense grant, a gift from IBM research, and a National [21] Lorenzo Neil, Sudip Mittal, and Anupam Joshi. Mining threat intel-
ligence about open-source projects and libraries from code repository
Science Foundation (NSF) grant, award number 2025685. issues and bug reports. In Int. Conf. on Intelligence and Security
Informatics. IEEE, 2018.
R EFERENCES [22] Younghee Park, Douglas Reeves, Vikram Mulukutla, and Balaji Sun-
[1] Mahmoud Abdelsalam, Ram Krishnan, Yufei Huang, and Ravi Sandhu. daravel. Fast malware classification by automated behavioral graph
Malware detection in cloud infrastructures using convolutional neural matching. In 6th Annual Workshop on Cyber Security and Information
networks. In 11th Int. Conf. on Cloud Computing. IEEE, 2018. Intelligence Research. ACM, 2010.
[2] Blake Anderson, Daniel Quist, Joshua Neil, Curtis Storlie, and Terran [23] Aditya Pingle, Aritran Piplai, Sudip Mittal, Anupam Joshi, James Holt,
Lane. Graph-based malware detection using dynamic analysis. Journal and Richard Zak. Relext: Relation extraction using deep learning
in Computer Virology, 7(1):247–258, 2011. approaches for cybersecurity knowledge graph improvement. In Int.
[3] Kelton AP da Costa, João P Papa, Celso O Lisboa, Roberto Munoz, Conf. on Advances in Social Networks Analysis and Mining. IEEE, 2019.
and Victor Hugo C de Albuquerque. Internet of things: A survey [24] A. Piplai, S. S. L. Chukkapalli, and A. Joshi. Nattack! adversarial attacks
on machine learning-based intrusion detection approaches. Computer to bypass a gan based classifier trained to detect network intrusion.
Networks, 151:147–157, 2019. In 2020 IEEE 6th Intl Conference on Big Data Security on Cloud
[4] Derui Ding, Qing-Long Han, Yang Xiang, Xiaohua Ge, and Xian-Ming (BigDataSecurity), IEEE Intl Conference on High Performance and
Zhang. A survey on security control and attack detection for industrial Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data
cyber-physical systems. Neurocomputing, 275:1674–1683, 2018. and Security (IDS), pages 49–54, 2020.
[25] Aritran Piplai, Sudip Mittal, Mahmoud Abdelsalam, Maanak Gupta,
Anupam Joshi, and Tim Finin. Knowledge enrichment by fusing
representations for malware threat intelligence and behavior. IEEE
International Conference on Intelligence and Security Informatics (ISI),
2020.
[26] Aritran Piplai, Sudip Mittal, Anupam Joshi, Tim Finin, James Holt, and
Richard Zak. Creating cybersecurity knowledge graphs from malware
after action reports. IEEE Access 2020, 2020.
[27] Priyanka Ranade, Sudip Mittal, Anupam Joshi, and Karuna Joshi. Using
deep neural networks to translate multi-lingual threat intelligence. In Int.
Conf. on Intelligence and Security Informatics. IEEE, 2018.
[28] Maheshkumar Sabhnani and Gursel Serpen. Why machine learning
algorithms fail in misuse detection on kdd intrusion detection data set.
Intelligent Data Analysis, 2004.
[29] Shahaboddin Shamshirband, Ahmed Patel, Nor Badrul Anuar, Miss
Laiha Mat Kiah, and Ajith Abraham. Cooperative game theoretic
approach using fuzzy q-learning for detecting and preventing intrusions
in wireless sensor networks. Engineering Applications of Artificial
Intelligence, 32:228–241, 2014.
[30] Sutton and Barto. Reinforcement learning: An introduction. MIT press,
2018.
[31] Daniele Ucci, Leonardo Aniello, and Roberto Baldoni. Survey of
machine learning techniques for malware analysis. Computers &
Security, 81:123–147, 2019.
[32] Watkins and Dayan. Q-learning. In Machine Learning), pages 279–292,
1992.
[33] Xin Xu and Yirong Luo. A kernel-based reinforcement learning
approach to dynamic behavior modeling of intrusion detection. In
International Symposium on Neural Networks, pages 455–464. Springer,
2007.
[34] Yoriyuki Yamagata, Shuang Liu, Takumi Akazaki, Yihai Duan, and
Jianye Hao. Falsification of cyber-physical systems using deep rein-
forcement learning. IEEE Transactions on Software Engineering, 2020.
[35] Roman V. Yampolskiy. Artificial intelligence safety and cybersecurity:
a timeline of ai failures. arxiv, 2016.

You might also like