Graph RL Malware Detection
Graph RL Malware Detection
Abstract—Machine learning algorithms used to detect attacks standard machine learning algorithms have limitations in the
are limited by the fact that they cannot incorporate the back- cybersecurity space, especially in real deployments [28], [35].
ground knowledge that an analyst has. This limits their suitability In part, this is because the dataset is highly imbalanced – most
in detecting new attacks. Reinforcement learning is different from
traditional machine learning algorithms used in the cybersecurity of the observed data in a real system is not attacks, and unless
domain. Compared to traditional ML algorithms, reinforcement a dataset is artificially curated to be balanced, it will mostly
learning does not need a mapping of the input-output space or have benign data. In other words, the base rate of an attack
a specific user-defined metric to compare data points. This is is quite low. Hence, most machine learning algorithms tend
important for the cybersecurity domain, especially for malware to overfit for the data points of non-attack scenarios. Such
detection and mitigation, as not all problems have a single,
known, correct answer. Often, security researchers have to resort learning algorithms are prone to weak performances in the
to guided trial and error to understand the presence of a malware real world. They also struggle in their ability to generalize, to
and mitigate it. unseen attacks.
In this paper, we incorporate prior knowledge, represented There are also other problems in the cybersecurity space
as Cybersecurity Knowledge Graphs (CKGs), to guide the for traditional ML approaches. In other domains like natural
exploration of an RL algorithm to detect malware. CKGs
capture semantic relationships between cyber-entities, including language processing or computer vision, it is fairly simple
that mined from open source. Instead of trying out random for a human, who is not necessarily an expert in the specific
guesses and observing the change in the environment, we aim to task, to draw a conclusion the Artificial Intelligence (AI)
take the help of verified knowledge about cyber-attack to guide system is trying to reach with high confidence. For example,
our reinforcement learning algorithm to effectively identify ways a human may not need complicated background knowledge,
to detect the presence of malicious filenames so that they can be
deleted to mitigate a cyber-attack. We show that such a guided except what is in implicit in the image itself to differentiate the
system outperforms a base RL system in detecting malware. image of a cat from a dog. However, looking at netflow data,
Index Terms—Reinforcement Learning, Knowledge Graphs, it is not easy for a person to decide if it represents an attack –
Cybersecurity, Artificial Intelligence this requires significant expertise and background knowledge.
So when we are trying to mimic the way security professionals
I. I NTRODUCTION reach a decision about a task, it is imperative that we consider
Cybersecurity aims to protect hardware and software com- a way to incorporate their prior knowledge that is not in the
ponents in a computer system against malicious attacks. A data itself into the machine learning system. This is because
key part to protecting against malicious attacks is identifying their prior knowledge and experience, as opposed to common
them, as early as possible in the Cyber Kill Chain [7]. The knowledge, may dictate how they perform their tasks in their
most common approaches today are signature based, which of profession. For example, a security professional might take
course can be defeated by adversaries by minor modifications what they know about recent discussions in online forums
of the malware or its dropper. Another approach is based about vulnerabilities in identifying an attack on their system.
on behavioural anomalies. This can be done by observing Also, in tasks like image recognition, there is not much scope
the system behavior over time and flagging any aberrant for subjectivity for reaching a conclusion. Usually, an image
behavior. However, such anomaly detection based approaches is either a cat or not. However, what is an attack is not always
often misidentify less frequently occurring legitimate system clear, and for sophisticated APTs, even experts may not even
actions as attacks. Another commonly used approach is to use recognize an attack as it is happening. It has been reported
machine learning on network data to detect attacks [24]. In that often a third party identifies and APT attack, not the in
2019 for instance, the IEEE Big Data conference organized house security team of the organization being attacked.
a competition on the use of machine learning algorithms to Reinforcement Learning (RL) mimics the way human be-
detect attacks based on observed network data [8]. ings tend to process new information. RL does not require
Machine learning is an important tool for recognising pat- sub-optimal actions to be explicitly corrected. Instead, it tries
terns in data. In the cybersecurity space, this can be useful to find a balance between exploration (of new knowledge)
to identify the behavior of a system under attack. However, and exploitation (of current knowledge). It does not assume
any prior mathematical modelling for the environment. This We update a policy table, also known as a Q-table for every
gives the RL algorithm more flexibility in learning about a action taken from a state. A Q-table is simply a lookup table
new knowledge space. that preserves the maximum expected reward for an action
In this paper, we make two key contributions. First, we at each state. The columns represent actions and the rows
use reinforcement learning for malware detection. Second, we represent states. The Q-table is improved at each iteration and
also use knowledge mined from open information sources is controlled by the learning rate α [32].
describing the same or similar malware attacks to change
the behavior of the RL algorithm, automatically changing the B. Cybersecurity Knowledge Graphs (CKGs)
parameters of the RL algorithm to adapt its reward functions Cybersecurity Knowledge Graphs have been widely used to
and their initial probability distributions. This approach is represent Cyber Threat Intelligence (CTI). We use CKG to
not only a way to solve the issue of having to rely on store CTI as semantic triples that helps in understanding how
pre-defined loss functions for traditional ML systems created different cyber entities are related. This representation allows
by individuals who may not be experts in cybersecurity, it the users to query the system and reason over the information.
also helps to mimic how security professionals use their own Knowledge graphs for cybersecurity have been used before to
knowledge in identifying attacks. represent various entities [23]. Open source CTI has been used
We organize our paper as follows - Section II talks about to build CKGs and other agents to aid cybersecurity analysts
the key concepts of the different aspects of our algorithm. In working in an organization [16]–[18], [21], [27]. CKGs have
Section III, we discuss our core algorithm. We discuss our also been used to compare different malware by Liu et al. [13].
findings in Section IV. We also discuss some of the relevant Some behavioral aspects have also been incorporated in CKGs,
work conducted in this area in Section V, and we finally where the authors used system call information [22]. Graph
conclude our paper in Section VI. based methods have also been post processed by machine
learning algorithms, as demonstrated by other approaches [2],
II. BACKGROUND [9], [10].
In this section, we discuss key reinforcement learning III. M ETHODOLOGY
algorithm details and our general approach of representing
In this section, we discuss the principles of our proposed
extracted open source text in a Cybersecurity Knowledge
algorithm. Figure 1, gives us an overview of the different
Graph (CKG).
aspects of our system. We provide text describing a piece
A. Reinforcement Learning of malware as an input to our cyber knowledge extraction
pipeline and we receive a set of semantic triples as an output.
We utilize methods in model-free reinforcement learning We assert this set of triples to a CKG. We sometimes fuse
in our approach [30]. Model-free RL agents employ prior this CKG with data from other sources such as behavior
experience and inductive reasoning to estimate, rather than analysis [25]. This forms our knowledge base that will help
generate, the value of a particular action. There are many us identifying malicious activity in a system and also suggest
kinds of model-free reinforcement learning models such as ways to mitigate them. The RL algorithm acts on the malware
SARSA, Monte Carlo Control, Actor-Critic and Q-learning. behavior data. We tune the parameters of the RL algorithm
We specifically utilize the Q-learning methods [30]. based on the information present in the CKGs that can be
Q-learning agents learn an action-value function (policy), retrieved by querying the CKG.
which returns the reward of taking an action given a state.
Q-learning utilizes the Bellman equation in order to learn A. Cyber Knowledge Extraction (CKGs) from prior knowledge
expected future rewards. We calculate the maximum future sources
reward max(Q(s0 , a0 )) given a set of multiple actions corre- We have an established cyber knowledge extraction pipeline
sponding to different rewards. Q(s, a) is the current policy that takes malware After Action Reports and automatically
of an action a from state s, and γ is the discount factor. populates CKGs [26]. The method uses a trained ‘Malware
The discount factor is the total reward an agent will receive Entity Extractor’ to detect cyber- named entities. Once the
from the current iteration until termination, and allows us to malware entities have been detected, a deep neural network
value short term reward over long term reward. The goal is is used to identify the relationships between them [23]. The
to therefore maximize the discounted future reward at every relationship extractor takes pairs of entities, generates their text
iteration [32]. embedding using a trained word2vec [15] model, and produces
a relationship as an output for the pairs of entities. Finally the
Maximum predicted reward, given entity-relationship set is asserted into a CKG.
new state and all possible actions
z }| { We use these trained models on open-source text describing
NewQ(s, a) = Q(s, a)+α [R(s, a) +γ max Q0 (s0 , a0 )−Q(s, a)] the malware we use in our experiments, or similar malware.
In order to find open source text analysing the malware, we
| {z } | {z }
New Q-Value Reward
Learning Rate use the known MD5 Hash of the malware and perform a web
Discount rate
search to look for articles talking about the same malware.
Fig. 1: An architecture diagram specifying the different steps of our proposed method
B. Reinforcement Learning Algorithm Fig. 2: Diagram showing the state transformation with respect
We detonated malware samples in an isolated environment to the actions in our dataset
and observed some system parameters that represent the be-
havior of the malware sample. We used the same method to We use Q-Learning for the purpose of identifying malicious
collect the malware behavior data as described by Piplai et al. process names. The behavior data that we collected resulted
[25]. The malware dataset comprises of samples downloaded in data snapshots at a time interval of 10 seconds. For Q-
from VirusTotal1 . The data was collected over a time period of Learning problems, we define a state space and an action
60 minutes for each malware. We refer to the first 30 minutes space. To define a state-space we ideally need to look at key
the malware is not active, as the benign phase. The next 30 identifiers that can be used to isolate a row in our data. We
minutes form the malicious phase, where the malware is active use the ‘timestamp’ for this purpose, and that helps in creating
in the system. Traffic was generated artificially to simulate as many states as there are rows in our data. For a single
multiple clients interacting with the malware infected system. experiment, we have 26000 to 28000 rows. We define our
Table I, describes the different parameters that were collected action space by the total number of distinct process names
during the exercise. In our paper, we aim to use Reinforcement that are getting created in a single experiment. Since this
Learning to identify a sequence of malicious processes that number may vary per experiment, we take a superset of all
may be created by the malware to perform the attack. the processes that are created for all our experiments. In our
experiments, we see that a total of 99 distinct processes are
1 VirusTotal. https://www.virustotal.com/ created over the course of the time of data collection. We use
TABLE I: Virtual machines performance metrics [1].
Metric Category Description
Status Process status
CPU information CPU usage percent, CPU times in user space, CPU times in system/kernel space, CPU times of children processes in user
space, CPU times of children processes in system space.
Context switches Number of context switches voluntary, Number of context switches involuntary
IO counters Number of read requests, Number of write requests, Number of read bytes, Number of written bytes, Number of read chars,
Number of written chars
Memory information Amount of memory swapped out to disk, Proportional set size (PSS), Resident set size (RSS), Unique set size (USS), Virtual
memory size (VMS), Number of dirty pages, Amount of physical memory, text resident set (TRS), Memory used by shared
libraries, memory that with other processes
Threads Number of used threads
File descriptors Number of opened file descriptors
Network information Number of received bytes, Number of sent bytes