0% found this document useful (0 votes)
6 views

2021 - Makhor - Malware Detection Using Fuzzy Similarity of System Call Dependency Sequence

This document summarizes a research paper that proposes a new dynamic malware detection method called Markhor. Markhor uses system call dependency sequences to create patterns of malicious behavior. It then determines the similarity of an unknown file's system calls to these patterns using a fuzzy algorithm to classify the file as malicious or benign. The evaluation showed Markhor achieved high accuracy, precision, and F-measure in malware detection.

Uploaded by

aulia rachma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

2021 - Makhor - Malware Detection Using Fuzzy Similarity of System Call Dependency Sequence

This document summarizes a research paper that proposes a new dynamic malware detection method called Markhor. Markhor uses system call dependency sequences to create patterns of malicious behavior. It then determines the similarity of an unknown file's system calls to these patterns using a fuzzy algorithm to classify the file as malicious or benign. The evaluation showed Markhor achieved high accuracy, precision, and F-measure in malware detection.

Uploaded by

aulia rachma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Journal of Computer Virology and Hacking Techniques

https://doi.org/10.1007/s11416-021-00383-1

ORIGINAL PAPER

Markhor: malware detection using fuzzy similarity of system call


dependency sequences
Amir Mohammadzade Lajevardi1 · Saeed Parsa2 · Mohammad Javad Amiri3

Received: 1 November 2020 / Accepted: 4 April 2021


© The Author(s), under exclusive licence to Springer-Verlag France SAS, part of Springer Nature 2021

Abstract
Static malware detection approaches are time-consuming and cannot deal with code obfuscation techniques. Dynamic malware
detection approaches, on the other hand, address these two challenges, however, suffer from behavioral ambiguity, such as the
system calls obfuscation. In this paper, we introduce Markhor, a dynamic and behavior-based malware detection approach.
Markhor uses system call data dependency and system call control dependency sequences to create a weighted list of malicious
patterns. The list is then used to determine the malicious processes. Next, the similarity of a file system call sequences to a
malicious pattern is extracted based on a fuzzy algorithm and the file nature is determined. The evaluation results reveal the
efficiency of Markhor in terms of accuracy (0.982), precision (0.976), and F-measure (0.982).

1 Introduction ber of signatures is increasing over time, and Also, such


techniques can not be used to detect new malware also fam-
Malware detection approaches can be categorized into static ilies [2]. Moreover, signature-based approaches require a
and dynamic approaches [1]. While static approaches can database to be updated within a short period of time [2], and
determine the nature of the software (i.e., malicious or finally, malware that uses deformation or code obfuscation
benign) without running the malware, these approaches are is not easily detected using this approach. In the behavior-
time-consuming and also vulnerable to code obfuscation. based approaches, on the other hand, malware is detected by
Dynamic approaches, on the other hand, execute the software analyzing its behavior [3,4]. In such approaches, malware
and detect the nature of the software based on its requests behavioral patterns are modeled and if a malicious behavior
resulting in a faster detection. is recognized, the program is avoided to be run.
Malware detection approaches can also be categorized In software security, malware behavior is detected accord-
into signature-based and behavioral-based approaches. Signa ing to its system resources usage. The behavior of malware,
ture-based approaches extract a specific byte code for a therefore, is classified into five different classes [3,5,6]:
malware family and use it to detect all samples of that par- file-based behaviors, process-based behaviors, windows-
ticular family. While Signature-based approaches are very based behaviors, network-based behaviors, and operating
fast, they suffer from several drawbacks. First, the num- system alteration-based behaviors. Figure 1 shows the num-
ber of application programming interface (API) calls for 386
B Amir Mohammadzade Lajevardi malware analyzed in [7]. As it is demonstrated, malware
[email protected] behaviors were mostly intended to search files in order to
Saeed Parsa read or write them.
[email protected] In this paper, we present Markhor,1 a behavioral-based
Mohammad Javad Amiri and dynamic approach for detecting the malicious files
[email protected] or processes. Markhor uses system call data dependency
and system call control dependency sequences to create a
1 Department of Computer Engineering, Sharif University of
Technology, Tehran, Iran
1 Markhor (Capra falconeri), is a large Capra species native to Central
2 Department of Computer Engineering, Iran University of
Asia, Karakoram and the Himalayas. The name is thought to be derived
Science and Technology, Tehran, Iran
from Persian–a conjunction of mar (“snake, serpent”) and the suffix
3 Department of Computer and Information Science, University khor (“-eater”), interpreted to represent the animal’s alleged ability to
of Pennsylvania, Pennsylvania, USA kill snakes.

123
A. M. Lajevardi et al.

Fig. 1 API calls distribution in


386 malware analyzed in [7]

weighted list of malicious patterns. In Markhor, the similar- and operations might be obfuscated or encoded. As a result,
ity of a file system call sequences to a malicious patterns is it might be too difficult to detect their real values. Whereas,
extracted based on a fuzzy algorithm and the file nature is in the dynamic method, the malware is decoded and sends
determined in malware run time. its requests to the operating system. It is obvious that these
The main contributions of this paper are: requests are not obfuscated because if the operating system
cannot recognize the sent requests, it is not able to send
1. Using semantic relations of API calls instead of structural the desired response. Dynamic approaches include Inline
relations to construct the real malicious API sequences, API Hooking [6] and tracking with operating system service
2. Detecting fake API calls from the malicious behaviors of descriptor Table [8].
malware, and After extracting and tracking a program behavior accu-
3. Proposing a fuzzy-based algorithm to determining whether rately in a secure environment, it is necessary to determine
a program is malicious or benign. the program nature to check whether it is benign or malicious.
In the rest of this section, we discuss malware detec-
tion approaches that use behavioral features. Extracting the
The rest of this paper is organized as follows. Section 2 dynamic-link library dependency tree for suspicious software
discusses related work. Markhor is introduced in Sect. 3. from the import address Table (IAT) without the execution
Section 4 evaluates the performance of Markhor, and Sect. 5 of the application is proposed in [9] to detect malware. The
concludes the paper. approach is able to detect fake dynamic-link library injection
and uses Dependency Walker [10] to generate the behavioral
tree of each file. The main drawback of this approach is low
2 Related work accuracy for detecting the malware in which IAT is destroyed.
The extracted behavioral tree might also have many unim-
Behavioral feature extraction is one of the most important portant nodes which lead to increasing detection time.
components of the malware detection approaches. If the fea- Modeling the program behavior based on the frequency of
tures are not extracted correctly, the type of software cannot API calls is studied in [11]. The method, however, assumes
be detected accurately. Behavioral feature extraction can be that API calls are independent resulting in lower precision.
done statically or dynamically. Static extraction is suitable Most recent malware detection approach [12–17] use API
when a malware behavior is needed to be analyzed without call sequences to reduce the false positive and detect the fake
running it. This is done through program source code and API call injection. These approaches use the API method
analyzing extracted codes from the program. Such methods call sequence as malicious patterns to detect malware where
either use program import address Table [4] or track call the API call sequence is extracted for each malware and
operations in program code [5]. benign program. Using data mining techniques, then, effec-
The dynamic extraction of behavior features requires to tive sequences to detect malware are extracted. Finally, the
run the malware first and then track its behavior. Dynamic dependency between API calls is considered based on their
approaches are more efficient in comparison to static ones, call sequence. The main challenge in these approaches is the
because, in the static analysis, that codes, the address table,

123
MARKHOR: malware detection using fuzzy similarity of system call dependency sequences

Table 1 Malware families used Malware family Count 1 UNICODE_STRING uniName;


to produce test data 2 OBJECT_ATTRIBUTES objAttr;
Trojan 11735 3 //Refer to a file by its object name
4 RtlInitUnicodeString(&uniName, L"\\DosDevices\\C:\\
Virus 4060
WINDOWS\\example.txt");
Worm 1413 5 InitializeObjectAttributes(&objAttr, &uniName,
Backdoor 1388 OBJ_CASE_INSENSITIVE | OBJ_KERNEL_HANDLE,
Rootkit 33 NULL, NULL);
6 //Obtain a file handle
Sum 18629 7 HANDLE handle;
8 NTSTATUS ntstatus;
way API call sequence is extracted, which leads to decreasing 9 IO_STATUS_BLOCK ioStatusBlock;
the detection rate. 10 if(KeGetCurrentIrql() != PASSIVE_LEVEL)
11 return STATUS_INVALID_DEVICE_STATE;
12 ntstatus = ZwCreateFile(&handle,GENERIC_WRITE,&objAttr
,&ioStatusBlock,NULL,FILE_ATTRIBUTE_NORMAL,0,
3 Markhor FILE_OVERWRITE_IF,
FILE_SYNCHRONOUS_IO_NONALERT,NULL, 0);
13 if (Condition1)
The proposed approach consists of four main steps. In the 14 {
first step, test data including malware and benign files are 15 //Write to a file
collected. The focus of the second step is on extracting mal- 16 #define BUFFER_SIZE 30
17 CHAR buffer[BUFFER_SIZE];
ware behavioral feature by tracking malware behavior. In the
18 size_tcb;
third step, useful patterns are extracted from test data using 19
system call dependency sequences, and finally, in the fourth 20 if(NT_SUCCESS(ntstatus))
step, the characteristics of a program are explored using the 21 {
22 ntstatus = RtlStringCbPrintf(buffer, sizeof(buffer), "This is %
extracted patterns. In this section, we present these four steps
d test\r\n", 0x0);
in detail. 23 if(NT_SUCCESS(ntstatus))
24 {
3.1 Test data collection 25 ntstatus = RtlStringCbLength(buffer, sizeof(buffer), &cb);
26 if(NT_SUCCESS(ntstatus))
27 {
In most existing approaches, the behavioral model is cre- 28 ntstatus = ZwWriteFile(handle, NULL, NULL, NULL,&
ated using the dataset presented in [18]. This dataset includes ioStatusBlock, buffer, cb, NULL,NULL);
information about the malware behavior based on some pre- 29 }
30 }
defined system calls. Since our proposed approach covers a
31 ZwClose(handle);
wide range of API calls, the executable file of each malware is 32 }
needed. Therefore, virus sign malware Database [19] is used 33 }
to collect the executable file of malware. We collect 18629 34 else // Not Condition1
35 {
malware and 15460 benign programs in total. The benign
36 //Read from a file
programs are mainly obtained from Program Files and Win- 37 LARGE_INTEGER byteOffset;
dows directory of Windows operating system. The malware 38

families used to produce test data and their corresponding 39 if(NT_SUCCESS(ntstatus))


40 {
counts are shown in Table 1.
41 byteOffset.LowPart = byteOffset.HighPart = 0;
42 ntstatus = ZwReadFile(handle, NULL, NULL, NULL, &
3.2 Feature extraction ioStatusBlock, buffer, BUFFER_SIZE, &byteOffset, NULL
);
43 if(NT_SUCCESS(ntstatus))
To extract the behavior of a malware, it needs to be run in
44 {
an operating system. Running malware, however, will affect 45 buffer[BUFFER_SIZE−1] = ’\0’;
the operating system, hence, most existing approaches use 46 DbgPrint("%s\n", buffer);
virtual machines and sandboxes to run malware. In Markhor, 47 }
48 ZwClose(handle);
we use VirtualBox as a virtual machine which contains Win-
49 }
dows XP as a guest operating system. We also used API 50 }
monitor [20] to track the file’s behavior. This software accepts
Code 1 Code sample for reading and writing a file.
an executable file of malware as the input and generates
their API call in the run-time. Furthermore, by running the
malware, data and control dependency sequences will be
extracted. Note that the collected malware does not use red-

123
A. M. Lajevardi et al.

Fig. 2 A part of intercepted API calls for the notepad.exe sample file
pills in run-time. Using red-pills, malware can recognize the for code 1 is as follows:
virtual machine environment and hides its real behaviour.
Each malware is run for 5 min. During this time its behaviour sccds(code1) = {1 → 2 → 3 → 4 → 5 → 6 → 7 → 8
is extracted and logged as a sequence of API call with their
→ 9 → 10 → 11}
arguments. A part of intercepted API calls for the notepad.exe
sample file is shown in Fig. 2.
It is very important to eliminate dependencies between
system calls that are not related to each other logically
and semantically. Therefore, it is necessary to identify data
3.3 Pattern extraction from test data
dependencies between system calls rather than sequential
dependencies.
The aim of this step is to extract malicious patterns from
Step two: System call data dependency sequence (SCDDS)
the dataset. These patterns are extracted based on control
extraction: To find semantic dependence between system
dependency and data dependency which are described in the
calls, we use data dependency among these calls. The param-
rest of this section.
eters in system calls mainly consist of types in and out. Data
dependency between system calls occurs when the output
of a system call is the input of another call. The way these
3.3.1 System call dependency sequence (SCDS) extraction
dependencies are extracted is described below.
In Markhor, system calls and their dependencies are modeled
in a novel way. we use a sample code, shown in Code 1, – Def-use pair: In this part, def-use pairs are extracted
to explain the proposed method. As shown in Code 1, the for each system call. For each system call, definitions are
program attempts to read/write a file using a sequence of parameters with type output of a system call. Uses are
system calls. Condition1 defines the type of operation on non-constant parameters with type input of a system call
a file. This operation can either be a read or a write. Since this that were previously defined in another system call. To
paper is intended for dynamic analysis of malware (run-time extract def-use pairs, the type of parameters should be
analysis), we assumed that in either case the condition holds, determined. Table 2 shows the type of parameters defined
so the write operation is performed successfully on the file. in system calls for sample code 1. It is worth mentioning
The process of system call dependency sequence (SCDS) that since in most cases the return value of system calls
extraction is described in the following steps. are from type Boolean or NtStatus, the focus of this paper
Step one: System call control dependency sequence (SCCDS) is on system calls parameters, not on their return values.
extraction: Since the program is analyzed dynamically (and After calculating the types of system calls parameters,
not statically), the system calls control dependency forms a we can extract the Def-use pairs for each system call.
sequence rather a graph. In Fig. 3, the system call control Table 3 shows these pairs.
sequence for sample code 1 is shown. According to this fig- – Reaching definition extraction: Using def-use pairs
ure, the system call control dependency sequence or SCCDS and system call control dependency sequence, data

123
MARKHOR: malware detection using fuzzy similarity of system call dependency sequences

Table 2 Types of system calls


Method Parametere Properties
parameters
RtlInitUnicodeString Par1 = Out , Par2 = In(optional)
InitializeObjectAttributes Par1 = Out, Par2 = In, Par3 = In,Par4 = In,Par5 = In(optional)
KeGetCurrentIrql –
ZwCreateFile Par1 = Out, Par2 = In, Par3 = In,Par4 = Out, Par5 = In(optional),
Par6 = In, Par7 = In, Par8 = In, Par9 = In,Par10 = In(optional), Par11 = In
NT_SUCCESS Par1 = In
RtlStringCbPrintf Par1 = Out, Par2 = In, Par3 = In
RtlStringCbLength Par1 = In, Par2 = In, Par3 = Out(optional)
ZwWriteFile Par1 = In, Par2 = In(optional),Par3 = In(optional), Par4 = In(optional),
Par5 = Out, Par6 = In, Par7 = In, Par8 = In(optional), Par9 = In(optional)
ZwReadFile Par1 = In, Par2 = In(optional),Par3 = In(optional), Par4 = In(optional),
Par5 = Out, Par6 = Out, Par7 = In, Par8 = In(optional), Par9 = In(optional)
ZwClose Par1 = In

Table 3 Def-use pairs extraction for system calls for sample Code 1
# Node Method name Def-use chain

1 RtlInitUnicodeString Def uniName


Use par3:L\ \ DosDevices\ \ C:\ \ WINDOWS\ \ example.txt
(par4:OBJ_CASE_INSENSITIVE | OBJ_KERNEL_HANDLE)
2 InitializeObjectAttributes Def objAttr
Use –
3 KeGetCurrentIrql Def –
Use –
4 ZwCreateFile Def handle,ioStatusBlock
Use objAttr
(par2:GENERIC_WRITE)(par6:FILE_ATTRIBUTE_NORMAL)
(par7:,0)(par8:FILE_OVERWRITE_IF)
(par9:FILE_SYNCHRONOUS_IO_NONALERT) (par11:0)
5 NT_SUCCESS Def –
Use ntstatus
6 RtlStringCbPrintf Def buffer
Use buffer
(par3:“This is% d test\r \n”)
7 NT_SUCCESS Def –
Use ntstatus
8 RtlStringCbLength Def cb
Use buffer
9 NT_SUCCESS Def –
Use ntstatus
10 ZwWriteFile Def ioStatusBlock
Use Handle, buffer, cb
11 ZwClose Def –
Use Handle

dependency among methods would be known. To do so, – In: Set of definitions from the previous system calls
four sets are defined as follows: reached to the current system call (according to the sys-
– Gen: Set of definitions (out parameters) done by a system tem call control dependency sequence extraction).
call.

123
A. M. Lajevardi et al.

Algorithm 1: Reaching Definitions Extraction Algo-


rithm
Data: Gen for each node n
Result: IN & Out for each node n
1 initialization;
2 n=SCCDS Start Node ;
3 repeat
4 IN[n] = OUT[P] ; // where P is Parent of n
/* In is null for start node */
5 Kill[n] ={(i, j)|(i, j) ∈ I n[n], (k, j) ∈ Gen[n]} ;
6 OUT[n] =G E N [n] ∪ (I N [n] − K I L L[n]) ;
7 n= child node of n ;
8 until n<>null;

Table 4 The dependencies between sets In, Kill, Out, and Gen

I n[B] = ∪Out[ p], ∀ p ∈ Pr edecessor B


K ill[B] = {(i, j)|(i, j) ∈ I n[B], (k, j) ∈ Gen[B]}
Out[B] = Gen[B] ∪ (I n[B] − K ill[B])

Algorithm 2: Def-use Chain Extraction Algorithm


Data: A System Call Flow Graph for which the IN sets for
reaching definitions have been computed for each node n.
Result: DUChain: a set of definition-use pairs.
/* Method: Visit each node in the control
flow graph. For each node, use upwards
exposed uses and reaching definitions to
form definition-use chains. */
1 initialization;
2 DU Chain = ∅;
3 foreach node n do
4 foreach use U in n do
5 foreach reaching definition D in IN[n] do
6 if D is a definition of v and U is a use of v then
7 DU Chain = DU Chain ∪ (D, U )
8 end
9 end
10 end
11 end

system call control dependency sequence. Reaching defini-


tions are then extracted using algorithm 1.
According to what was mentioned before, these sets are
Fig. 3 System call control dependency sequence for code shown in
shown in Table 5.
code 1

– Kill: Set of definitions from the previous system calls – Def-use chain extraction: To extract def-use chain
reached to the current system call but are killed with algorithm 2 is used to show where each definition is
redefining in the current system call. used. Def-use chain for sample code 1 are extracted
– Out: Set of definitions leaving the current system call and shown in Table 6.
towards next system calls.
So def-use chain for Code 1 based on Table 6 is as
Values in these sets are shown by ordered pairs (i,j) in follow:
which i is function number in SCDS in which variable j is
used, defined, entered or left. The dependency between these {(2 : obj Attr , 4), (6 : bu f f er , 8), (4 : H andle, 10),
sets is shown in Table 4. Symbol B refers to a system call in (6 : bu f f er , 10), (8 : cb, 10), (4 : handle, 11)}

123
MARKHOR: malware detection using fuzzy similarity of system call dependency sequences

Table 5 Reaching definitions extraction for sample Code 1 according to Algorithm 1


Node Gen In Kill Out

1 (1, uniName) Null Null (1, uniName)


2 (2, objAttr) (1, uniName) Null (2, objAttr)(1, uniName)
3 Null (1, uniName) (2, objAttr) Null (1, uniName)(2, objAttr)
4 (4, handle) (4, ioStatusBlock) (1, uniName) (2, objAttr) Null (1, uniName) (2, objAttr) (4,
handle)
(4, ioStatusBlock)
5 Null (1, uniName) (2, objAttr)(4, Null (1, uniName) (2, objAttr) (4,
handle) handle)
(4, ioStatusBlock) (4, ioStatusBlock)
6 (6, buffer) (1, uniName) (2, objAttr)(4, Null (1, uniName) (2, objAttr) (4,
handle) handle)
(4, ioStatusBlock) (4, ioStatusBlock) (6, buffer)
7 Null (1, uniName) (2, objAttr)(4, Null (1, uniName) (2, objAttr) (4,
handle) handle)
(4, ioStatusBlock) (6, buffer) (4, ioStatusBlock) (6, buffer)
8 (8, cb) (1, uniName) (2, objAttr)(4, Null (1, uniName) (2, objAttr) (4,
handle) handle)
(4, ioStatusBlock) (6, buffer) (4, ioStatusBlock) (6, buffer)
(8, cb)
9 Null (1, uniName) (2, objAttr) (4, Null (1, uniName) (2, objAttr) (4,
handle) handle)
(4, ioStatusBlock) (6, buffer) (4, ioStatusBlock) (6, buffer)
(8, cb) (8, cb)
10 (10, ioStatusBlock) (1, uniName)(2, objAttr) (4, (4, ioStatusBlock) (1, uniName) (2, objAttr) (4,
handle) handle)
(4, ioStatusBlock) (6, buffer) (6, buffer) (8, cb)(10, ioSta-
(8, cb) tusBlock)
11 Null (1, uniName) (2, objAttr)(4, Null (1, uniName) (2, objAttr)(4,
handle) handle)
(6, buffer) (8, cb) (10, ioSta- (6, buffer) (8, cb) (10, ioSta-
tusBlock) tusBlock)

According to the def-use chain, we should calculate the scds( f ) = scdds( f ) ∪ {n|n has no data dependency} (1)
longest path to extract the data dependencies among system
calls which are shown in Table 7. According to this table, So according to Eq. 1, the SCDS for code 1 is as follows:
system call data dependency sequence for sample Code 1 is
as follows: scds(code1) = {2 → 4 → 10, 2 → 4 → 11, 6 → 8 → 10,
6 → 10} ∪ {1, 3, 5, 7, 9}

scdds(code1) = {2 → 4 → 10, 2 → 4 → 11, 6 → 8 → 3.3.2 Assigning weights to the sequences


10, 6 → 10}
Each system call sequence has a weight that specifies its
importance to determine the nature of the software. Similar
Step three: System calls dependency sequence (SCDS) to [21], we consider two parameters called LBF and LBC. The
extraction: In this step according to the control and data first parameter, LBF, shows the probability of the sequence
dependency sequence from previous steps, system calls being malicious and the second one, LBC, is the probability
dependency sequence set is extracted. System calls depen- representing a sequence is benign. For a system call sequence
dency sequence set for suspicious file f can be calculated λ, LBF, and LBC are calculated according to the following
based on the following equation: equations:

123
A. M. Lajevardi et al.

Table 6 Def-use chain


Node Use Def=In DUChain
extraction for the code given in
Code 1 1 Null Null
2 (1, uniName) Null
3 (1, uniName) (2, objAttr) Null
4 (4, objAttr) (1, uniName) (2, objAttr) { (2:objAttr, 4)}
5 (5, ntstatus) (1, uniName) (2, objAttr) { (2:objAttr , 4) }
(4, handle) (4, ioStatusBlock)
6 (6, buffer) (1, uniName) (2, objAttr) {(2:objAttr, 4) }
(4, handle) (4, ioStatusBlock)
7 (7, ntstatus) (1, uniName) (2,objAttr) (4, handle) { (2:objAttr, 4) }
(4, ioStatusBlock) (6, buffer)
8 (8, buffer) (1, uniName) (2, objAttr) (4, handle) {(2:objAttr, 4), (6:buffer, 8) }
(4, ioStatusBlock) (6, buffer)
9 (9, ntstatus) (1, uniName) (2, objAttr) (4, handle) { (2:objAttr, 4), (6:buffer, 8) }
(4, ioStatusBlock) (6, buffer) (8, cb)
10 (10, Handle) (1, uniName) (2, objAttr) (4, handle) { (2:objAttr, 4), (6:buffer, 8),
(10, buffer) (4, ioStatusBlock) (6, buffer) (8, cb) (4: Handle, 10), (6:buffer, 10),
(10, cb) (8:cb, 10) }
11 (11, handle) (1, uniName) (2, objAttr) (4, handle) { (2:objAttr, 4), (6:buffer, 8),
(6, buffer) (8, cb) (10, ioStatusBlock) (4: Handle, 10), (6:buffer, 10),
(8:cb, 10), (4:handle, 11) }

Table 7 System call data


dependency sequence for 2 → 4 → 10 I nitiali zeObject Attributes → Z wCr eateFile → Z wW riteFile
sample Code 1 2 → 4 → 11 I nitiali zeObject Attributes → Z wCr eateFile → Z wClose
6 → 8 → 10 Rtl StringCb Print f → Rtl StringCbLength → Z wW riteFile
6 → 10 Rtl StringCb Print f → Z wW riteFile

Table 8 Calculating the best value for Θ


Θ TP FP Accuracy Precision F measure
# of Visited Sequences λ in Malware Dataset
L B F(λ) =
# of Malware Samples 20 0.991 0.0335 0.979 0.967 0.979
# of Visited Sequences λ in Benign Dataset 25 0.99 0.033 0.979 0.968 0.979
L BC(λ) =
# of Benign Samples 30 0.989 0.031 0.979 0.970 0.979
36 0.987 0.028 0.980 0.972 0.980
After calculating LBF and LBC, we determine the effect 37 0.987 0.024 0.982 0.976 0.982
of each one on finding the program behavior. The sequence 38 0.986 0.0255 0.980 0.975 0.980
weight function, ω(λ), is introduced for the sequence λ as 40 0.982 0.024 0.979 0.976 0.979
follow: 45 0.984 0.025 0.980 0.975 0.980


⎪ L B Fλ × (T otal N umber O f BenignSample)



⎪ i f (L BCλ = 0) 3.4 Malware detection


⎨ L B Fλ
ω(λ) = L BCλ
Once the malicious rules are specified, we can detect the

⎪ else i f (L B Fλ > L BCλ )



⎪ nature of a suspicious file. For each suspicious file f , its

⎪0
⎩ system call dependency sequence is extracted and compared
else i f (L B Fλ ≤ L BCλ )
with the available malicious rules in the database D B. If they
match, the sequence weight, ω, affects the final score. The
After assigning scores to the sequences, the sequences final score is calculated based on the sum of the scores of
with scores higher than zero are added to the database D B the detected malicious rules, which is defined in Eq. 2 based
as the malicious rules set. on their system call dependency sequences. In other word,

123
MARKHOR: malware detection using fuzzy similarity of system call dependency sequences

Table 9 Comparison of the


Approach TP FP Accuracy Precision F-Measure
proposed approach and other
similar approaches Sami et al. [12] 0.941 0.0612 0.940 0.939 0.940
Garg and Yadav [11] 0.833 0.091 0.871 0.902 0.866
Suaboot et al. [17] 0.887 0.071 0.908 0.926 0.906
Our approach 0.987 0.024 0.982 0.976 0.982

suspicious file f is malware, if and only if, to detect the semantic relation between API calls based on
their arguments. In the future, we plan to use other features

ω(x) ≥ Θ (2) to detect malicious software that uses behavior obfuscation.

where x ∈ D B and x ∈ scds( f ). Threshold Θ should be


determined to specify the minimum score needed to detect
References
the nature of a suspicious file. The precise value of Θ is
discussed in Evaluation section. 1. Damodaran, A., Troia, F.D., Visaggio, C.A., Austin, T.H., Stamp,
M.: A comparison of static, dynamic, and hybrid analysis for mal-
ware detection. J. Comput. Virol. Hacking Tech. 13(1), 1–12 (2017)
2. Scott, J..: Signature Based Malware Detection is Dead, Cyberse-
4 Evaluation curity Think Tank. Institute for Critical Infrastructure Technology
(February). www.ICITForum.org
The goal of this evaluation is to find the best value for Θ 3. Alazab, M., Venkataraman, S., Watters, P.: Towards understanding
based on our test dataset. To calculate Θ, the test data, which malware behaviour by the extraction of API calls. In: Proceedings
of the 2nd Cybercrime and Trustworthy Computing Workshop, pp.
was discussed in Sect. 3.1, is divided into 10 equal portions.
52–59 (2010). 10.1109/CTC.2010.8
Each time, 9 portions are used as the learning data and one 4. Fang, Z., Wang, J., Li, B., Wu, S., Zhou, Y., Huang, H.: Evad-
portion as the test data. Eight different values are used for Θ. ing anti-malware engines with deep reinforcement learning. IEEE
The results are shown in Table 8. As it is shown in this table, Access 7, 48867–48879 (2019)
5. Martín, A., Menéndez, H. D., Camacho, D.: Studying the influ-
according to the detection rates, value 37 has the best result
ence of static API calls for hiding malware. In: Lecture Notes in
for Θ. Computer Science, vol. 9868, pp. 363–372. Springer (2016)
We next compare our approach with several other approaches 6. Lopez, J., Babun, L., Aksu, H., Uluagac, A.S.: A survey on function
which can be implemented and tested on our dataset. The and system call hooking approaches. J. Hardw. Syst. Secur. 1(2),
114–136 (2017)
comparison results are shown in Table 9.
7. Alazab, M., Venkataraman, S., Watters, P.: Towards understanding
The proposed approach, as shown in Table 9, incurs a malware behaviour by the extraction of API calls. In: Cybercrime
low false positive rate due to using system calls data depen- and Trustworthy Computing Workshop, pp. 52–59 (2010)
dency and also numerous benign programs to extract patterns. 8. Sihwail, R., Omar, K., Ariffin, K.A.: A survey on malware analysis
techniques: Static, dynamic, hybrid and memory analysis. Int. J.
Moreover, the proposed approach has a high false positive
Adv. Sci. Eng. Inf. Technol. 8(4–2), 1662–1671 (2018)
rate facing behaviour obfuscation methods such as replacing 9. Narouei, M., Ahmadi, M., Giacinto, G., Takabi, H., Sami, A.:
the order of system calls or using fake system calls. DLLMiner: structural mining for malware detection. Secur. Com-
mun. Netw. 8(18), 3311–3322 (2015)
10. Dependency Walker, Dependency Walker (2018). http://www.
dependencywalker.com/
5 Conclusion 11. Garg, V., Yadav, R.K.: Malware detection based on API calls fre-
quency. In: International Conference on Information Systems and
In this paper, we present a dynamic and behavior-based Computer Networks, pp. 400–404. IEEE (2019)
12. Sami, A., Yadegari, B., Rahimi, H., Peiravian, N., Hashemi, S.,
approach to detect malware. First, behavioral features includ- Hamze, A.: Malware detection based on mining API calls. In:
ing the control and data dependency sequences of system Proceedings of the ACM Symposium on Applied Computing, pp.
calls are extracted from a database of malware and benign 1020–1025. ACM Press, New York (2010)
programs. Then, using a fuzzy approach, a value is assigned 13. Qiao, Y., Yang, Y., He, J., Tang, C., Liu, Z.: CBM: free, auto-
matic malware analysis framework using API call sequences. In:
to each sequence. This value represents the effect of a Advances in Intelligent Systems and Computing, vol. 214, pp. 225–
sequence in recognizing the type of a program. Finally, to 236. Springer (2014)
detect the type, the program is run and upon extracting system 14. Tran, T.K., Sato, H.: NLP-based approaches for malware classi-
calls dependency sequences and matching them with avail- fication from API sequences. In: Symposium on Intelligent and
Evolutionary Systems, vol. 2017-Janua, pp. 101–105. Institute of
able sequences in the database, the type is recognized. Our Electrical and Electronics Engineers Inc. (2017)
evaluation results demonstrate 0.982 accuracy, 0.976 preci- 15. Kim, H., Kim, J., Kim, Y., Kim, I., Kim, K.J., Kim, H.: Improve-
sion, and 0.982 F-Measure. This approach can also be used ment of malware detection and classification using API call

123
A. M. Lajevardi et al.

sequence alignment and visualization. Clust. Comput. 22(1), 921– 20. API Monitoring Tool. https://www.rohitab.com/apimonitor
929 (2019) 21. Parsa, S., Zareie, F., Vahidi-Asl, M.: Fuzzy clustering the backward
16. Fadadu, F.: Evading API call sequence based malware classifiers. dynamic slices of programs to identify the origins of failure. In:
In: International Conference on Information and Communications Lecture Notes in Computer Science, vol. 6630, pp. 352–363 (2011)
Security, pp. 18–33. Springer, Cham (2019)
17. Suaboot, J., Tari, Z., Mahmood, A., Zomaya, A.Y., Li, W.: Sub-
curve HMM: a malware detection approach based on partial
Publisher’s Note Springer Nature remains neutral with regard to juris-
analysis of API call sequences. Comput. Secur. 92, 101773 (2020)
dictional claims in published maps and institutional affiliations.
18. CWSandbox Data. http://pi1.informatik.uni-mannheim.de/
malheur/
19. Virus Sign Malware Data Base. https://www.virussign.com

123

You might also like