2021 - Makhor - Malware Detection Using Fuzzy Similarity of System Call Dependency Sequence
2021 - Makhor - Malware Detection Using Fuzzy Similarity of System Call Dependency Sequence
https://doi.org/10.1007/s11416-021-00383-1
ORIGINAL PAPER
Abstract
Static malware detection approaches are time-consuming and cannot deal with code obfuscation techniques. Dynamic malware
detection approaches, on the other hand, address these two challenges, however, suffer from behavioral ambiguity, such as the
system calls obfuscation. In this paper, we introduce Markhor, a dynamic and behavior-based malware detection approach.
Markhor uses system call data dependency and system call control dependency sequences to create a weighted list of malicious
patterns. The list is then used to determine the malicious processes. Next, the similarity of a file system call sequences to a
malicious pattern is extracted based on a fuzzy algorithm and the file nature is determined. The evaluation results reveal the
efficiency of Markhor in terms of accuracy (0.982), precision (0.976), and F-measure (0.982).
123
A. M. Lajevardi et al.
weighted list of malicious patterns. In Markhor, the similar- and operations might be obfuscated or encoded. As a result,
ity of a file system call sequences to a malicious patterns is it might be too difficult to detect their real values. Whereas,
extracted based on a fuzzy algorithm and the file nature is in the dynamic method, the malware is decoded and sends
determined in malware run time. its requests to the operating system. It is obvious that these
The main contributions of this paper are: requests are not obfuscated because if the operating system
cannot recognize the sent requests, it is not able to send
1. Using semantic relations of API calls instead of structural the desired response. Dynamic approaches include Inline
relations to construct the real malicious API sequences, API Hooking [6] and tracking with operating system service
2. Detecting fake API calls from the malicious behaviors of descriptor Table [8].
malware, and After extracting and tracking a program behavior accu-
3. Proposing a fuzzy-based algorithm to determining whether rately in a secure environment, it is necessary to determine
a program is malicious or benign. the program nature to check whether it is benign or malicious.
In the rest of this section, we discuss malware detec-
tion approaches that use behavioral features. Extracting the
The rest of this paper is organized as follows. Section 2 dynamic-link library dependency tree for suspicious software
discusses related work. Markhor is introduced in Sect. 3. from the import address Table (IAT) without the execution
Section 4 evaluates the performance of Markhor, and Sect. 5 of the application is proposed in [9] to detect malware. The
concludes the paper. approach is able to detect fake dynamic-link library injection
and uses Dependency Walker [10] to generate the behavioral
tree of each file. The main drawback of this approach is low
2 Related work accuracy for detecting the malware in which IAT is destroyed.
The extracted behavioral tree might also have many unim-
Behavioral feature extraction is one of the most important portant nodes which lead to increasing detection time.
components of the malware detection approaches. If the fea- Modeling the program behavior based on the frequency of
tures are not extracted correctly, the type of software cannot API calls is studied in [11]. The method, however, assumes
be detected accurately. Behavioral feature extraction can be that API calls are independent resulting in lower precision.
done statically or dynamically. Static extraction is suitable Most recent malware detection approach [12–17] use API
when a malware behavior is needed to be analyzed without call sequences to reduce the false positive and detect the fake
running it. This is done through program source code and API call injection. These approaches use the API method
analyzing extracted codes from the program. Such methods call sequence as malicious patterns to detect malware where
either use program import address Table [4] or track call the API call sequence is extracted for each malware and
operations in program code [5]. benign program. Using data mining techniques, then, effec-
The dynamic extraction of behavior features requires to tive sequences to detect malware are extracted. Finally, the
run the malware first and then track its behavior. Dynamic dependency between API calls is considered based on their
approaches are more efficient in comparison to static ones, call sequence. The main challenge in these approaches is the
because, in the static analysis, that codes, the address table,
123
MARKHOR: malware detection using fuzzy similarity of system call dependency sequences
123
A. M. Lajevardi et al.
Fig. 2 A part of intercepted API calls for the notepad.exe sample file
pills in run-time. Using red-pills, malware can recognize the for code 1 is as follows:
virtual machine environment and hides its real behaviour.
Each malware is run for 5 min. During this time its behaviour sccds(code1) = {1 → 2 → 3 → 4 → 5 → 6 → 7 → 8
is extracted and logged as a sequence of API call with their
→ 9 → 10 → 11}
arguments. A part of intercepted API calls for the notepad.exe
sample file is shown in Fig. 2.
It is very important to eliminate dependencies between
system calls that are not related to each other logically
and semantically. Therefore, it is necessary to identify data
3.3 Pattern extraction from test data
dependencies between system calls rather than sequential
dependencies.
The aim of this step is to extract malicious patterns from
Step two: System call data dependency sequence (SCDDS)
the dataset. These patterns are extracted based on control
extraction: To find semantic dependence between system
dependency and data dependency which are described in the
calls, we use data dependency among these calls. The param-
rest of this section.
eters in system calls mainly consist of types in and out. Data
dependency between system calls occurs when the output
of a system call is the input of another call. The way these
3.3.1 System call dependency sequence (SCDS) extraction
dependencies are extracted is described below.
In Markhor, system calls and their dependencies are modeled
in a novel way. we use a sample code, shown in Code 1, – Def-use pair: In this part, def-use pairs are extracted
to explain the proposed method. As shown in Code 1, the for each system call. For each system call, definitions are
program attempts to read/write a file using a sequence of parameters with type output of a system call. Uses are
system calls. Condition1 defines the type of operation on non-constant parameters with type input of a system call
a file. This operation can either be a read or a write. Since this that were previously defined in another system call. To
paper is intended for dynamic analysis of malware (run-time extract def-use pairs, the type of parameters should be
analysis), we assumed that in either case the condition holds, determined. Table 2 shows the type of parameters defined
so the write operation is performed successfully on the file. in system calls for sample code 1. It is worth mentioning
The process of system call dependency sequence (SCDS) that since in most cases the return value of system calls
extraction is described in the following steps. are from type Boolean or NtStatus, the focus of this paper
Step one: System call control dependency sequence (SCCDS) is on system calls parameters, not on their return values.
extraction: Since the program is analyzed dynamically (and After calculating the types of system calls parameters,
not statically), the system calls control dependency forms a we can extract the Def-use pairs for each system call.
sequence rather a graph. In Fig. 3, the system call control Table 3 shows these pairs.
sequence for sample code 1 is shown. According to this fig- – Reaching definition extraction: Using def-use pairs
ure, the system call control dependency sequence or SCCDS and system call control dependency sequence, data
123
MARKHOR: malware detection using fuzzy similarity of system call dependency sequences
Table 3 Def-use pairs extraction for system calls for sample Code 1
# Node Method name Def-use chain
dependency among methods would be known. To do so, – In: Set of definitions from the previous system calls
four sets are defined as follows: reached to the current system call (according to the sys-
– Gen: Set of definitions (out parameters) done by a system tem call control dependency sequence extraction).
call.
123
A. M. Lajevardi et al.
Table 4 The dependencies between sets In, Kill, Out, and Gen
– Kill: Set of definitions from the previous system calls – Def-use chain extraction: To extract def-use chain
reached to the current system call but are killed with algorithm 2 is used to show where each definition is
redefining in the current system call. used. Def-use chain for sample code 1 are extracted
– Out: Set of definitions leaving the current system call and shown in Table 6.
towards next system calls.
So def-use chain for Code 1 based on Table 6 is as
Values in these sets are shown by ordered pairs (i,j) in follow:
which i is function number in SCDS in which variable j is
used, defined, entered or left. The dependency between these {(2 : obj Attr , 4), (6 : bu f f er , 8), (4 : H andle, 10),
sets is shown in Table 4. Symbol B refers to a system call in (6 : bu f f er , 10), (8 : cb, 10), (4 : handle, 11)}
123
MARKHOR: malware detection using fuzzy similarity of system call dependency sequences
According to the def-use chain, we should calculate the scds( f ) = scdds( f ) ∪ {n|n has no data dependency} (1)
longest path to extract the data dependencies among system
calls which are shown in Table 7. According to this table, So according to Eq. 1, the SCDS for code 1 is as follows:
system call data dependency sequence for sample Code 1 is
as follows: scds(code1) = {2 → 4 → 10, 2 → 4 → 11, 6 → 8 → 10,
6 → 10} ∪ {1, 3, 5, 7, 9}
123
A. M. Lajevardi et al.
123
MARKHOR: malware detection using fuzzy similarity of system call dependency sequences
suspicious file f is malware, if and only if, to detect the semantic relation between API calls based on
their arguments. In the future, we plan to use other features
ω(x) ≥ Θ (2) to detect malicious software that uses behavior obfuscation.
123
A. M. Lajevardi et al.
sequence alignment and visualization. Clust. Comput. 22(1), 921– 20. API Monitoring Tool. https://www.rohitab.com/apimonitor
929 (2019) 21. Parsa, S., Zareie, F., Vahidi-Asl, M.: Fuzzy clustering the backward
16. Fadadu, F.: Evading API call sequence based malware classifiers. dynamic slices of programs to identify the origins of failure. In:
In: International Conference on Information and Communications Lecture Notes in Computer Science, vol. 6630, pp. 352–363 (2011)
Security, pp. 18–33. Springer, Cham (2019)
17. Suaboot, J., Tari, Z., Mahmood, A., Zomaya, A.Y., Li, W.: Sub-
curve HMM: a malware detection approach based on partial
Publisher’s Note Springer Nature remains neutral with regard to juris-
analysis of API call sequences. Comput. Secur. 92, 101773 (2020)
dictional claims in published maps and institutional affiliations.
18. CWSandbox Data. http://pi1.informatik.uni-mannheim.de/
malheur/
19. Virus Sign Malware Data Base. https://www.virussign.com
123