0% found this document useful (0 votes)
24 views

DeepDual-SD Deep Dual Attribute-Aware Embedding For Binary Code Similarity Detection-2023

Uploaded by

baba212
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

DeepDual-SD Deep Dual Attribute-Aware Embedding For Binary Code Similarity Detection-2023

Uploaded by

baba212
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

International Journal of Computational Intelligence Systems (2023) 16:35

https://doi.org/10.1007/s44196-023-00206-9

RESEARCH ARTICLE

DeepDual‑SD: Deep Dual Attribute‑Aware Embedding for Binary Code


Similarity Detection
Jiabao Guo1 · Bo Zhao1 · Hui Liu1 · Dongdong Leng1 · Yang An2 · Gangli Shu1

Received: 22 September 2022 / Accepted: 21 February 2023


© The Author(s) 2023

Abstract
Binary code similarity detection (BCSD) is a task of detecting similarity of binary functions which are not available to the
corresponding source code. It has been widely utilized to facilitate various kinds of crucial security analysis in software
engineering. Because of the complexity of the program compilation process, identifying binary code similarity presents tough
challenges. The most sensible binary similarity detector relies on a robust vector representation of binary code. However, few
BCSD approaches are suitable to form vector representations for analyzing similarities between binaries, which may not only
diverge in semantics but also in structures. And the existing solutions which only depend on hands-on feature engineering
to form feature vectors, fail to take into consideration the relationships between instructions. To resolve these problems, we
propose a novel and unified approach called DeepDual-SD that aims to combine the dual attributes (semantic and structural
attribute). More specifically, DeepDual-SD consists of two branches, in which one text-based feature representation is driven
by semantic attribute learning to exploit instruction semantics, another graph-based feature representation for structural
attribute learning to investigate structural differences. Meanwhile deep embedding (DE) technology is utilized to map this
information into low-dimensional vector representation. In addition, to get together the dual attributes, a fusion mechanism
based on gate architecture is designed for learning to pay proper attention between the two attribute-aware embeddings.
Experimental verifications are conducted on Openssl and Debian datasets for several tasks, including cross-compiler, cross-
architecture and cross-version scenarios. The results demonstrate that our method outperforms the state-of-the-art BCSD
methods in different scenarios in terms of detection accuracy.

Keywords Code similarity detection · Deep embedding · Semantic attribute learning · Structural attribute learning · Fusion
mechanism

Abbreviations
BCSD Binary code similarity detection
* Bo Zhao COTS Commercial-of-the-shelf
[email protected] NLP Natural language processing
Jiabao Guo CFG Control flow graph
[email protected] OOV Out-of-vocabulary
Hui Liu LSTM Long short-term memory
[email protected] GRU​ Gated recurrent unit
Dongdong Leng ROC Receiver operating characteristic curve
[email protected] AUC​ Area under curve
Yang An IOT Internet of things
[email protected]
Gangli Shu
[email protected] 1 Introduction
1
School of Cyber Science and Engineering, Wuhan Binary code similarity detection (BCSD) is used for com-
University, Wuhan, China
paring two or even more pieces of binary code (includes
2
School of Computer Science, Wuhan University, Wuhan, whole programs, functions or basic blocks) to determine
China

13
Vol.:(0123456789)
35 Page 2 of 11 International Journal of Computational Intelligence Systems (2023) 16:35

their similarities or differences. It is essential to compare 2.1 Semantic‑Aware Similarity


binary code in situations where the source code of the pro-
gram is not available, including malware, legacy programs Semantic-aware similarity records if the code compared has
and commercial-of-the-shelf (COTS) programs [1]. It has similar effects from instructions. FOSSIL [17] utilizes the
been increasingly popular in research area and has broad idea of instruction classification, which is used to calcu-
applications in software engineering and security, such as late semantic similarity, but is unable to determine whether
code clone detection [2–5], vulnerability discovery [6, 7], two pieces of binary code are equivalent. The other method
malware detection [8–12], patch analysis [13–15] and port- based on semantic similarity is the symbolic formula, which
ing information across program versions [16]. is a task declaration in which the left side is an output vari-
The similarity of the binary code not only comes from able, as well as the right side is a logical expression of input
the source code update but also depends on the compilation variables and literals. BINHUNT [18], iBINHUNT [19]
process of the binary code. Binary code is generated by the and COP use symbolic formulas. These approaches require
compilation process, as illustrated in Fig. 1, and can be run to attempt all pair-wise comparisons and inspect whether
directly. The usual compilation process takes the source code a permutation of variables such that all matched variables
as input, through the selected compiler, optimization option contain the same value exists there. Recently, the majority
or platform to generate an object file, after which the object of the current suitable strategies at the semantic-level have
file will generate a binary program. combined the ideas of natural language processing (NLP).
For example, Zuo [20] was inspired by machine translation,
treated each instruction in the binary code as a word while
2 Related Work treating each block as a sentence, and used LSTM to encode
the semantic vector of each sentence. Massarelli [21] used
The core of BCSD problem is to design an approach to the word2vec model to train token embedding, and then used
detect whether the two pieces of binary code are similar. the attention mechanism to obtain function embedding.
A method that can solve this problem needs to achieve the
following design goals: First, the application scenario of this
method is binary-oriented. In many cases, we often cannot 2.2 Structural‑Aware Similarity
access the source code of a binary function. Effective simi-
larity detection and code search technology must directly Structural-aware similarity usually computes similarity
use binary code as the research object. Second, since the depending on graph representations of binary code. It is dif-
query function and the functions in the target corpus may ferent from semantic similarity because graph can capture
come from different hardware architectures and software multiple grammatical representations of the same code. The
platforms, an effective BCSD technology which capture the structure similarity can be calculated on different graphs.
inherent characteristics of these binary functions must be Traditional BCSD methods calculate the similarity scores
compatible with grammatical changes. In addition, an excel- between graphs using graph matching algorithm, such as
lent BCSD method should satisfy high efficiency and adapt- SIGMA [22], DiscovRE [23], BinGo [24]. However, these
ability. It can effectively calculate the similarity function for methods have the disadvantages of low time efficiency and
the task such as library function identification, vulnerability difficult migration applications. Later, a method of graph
search, so as to expand to a larger target function library. At embedding was proposed. First, the binary function was
the same time, when domain experts can provide similar or represented by a graph, such as control flow graph (CFG).
dissimilar examples, the method should be able to quickly The nodes in the graph contained the relevant features of
adapt to these examples for specific domain applications. the binary function Basic Block. By learning the model,
the binary function was expressed as a vector for direct

Fig. 1  Illustration of compilation process

13
International Journal of Computational Intelligence Systems (2023) 16:35 Page 3 of 11 35

comparison. The similarity distance of two vectors rep- 3.1 Code Similarity Task Description
resents the similarity of two binary functions. The first to
apply the embedded ideas to BCSD is the method Genius, In automatic BCSD, given two pieces of binary function, the
which is a vulnerability detection engine for IoT devices that machine needs to read and understand the binary codes or
supports multiple architectures at the same time. Its method functions, and then compares the similarity between them.
mainly includes 4 parts: feature extraction, codebook gen- We call the binary function of interest the query function,
eration, feature encoding and online search. Recently, Xu and the corpus of binary functions the target corpus.
[25] proposed a graph embedding method called Gemini, In this paper, all binaries are compiled from source code,
which can get better performance. Gemini combines the neu- not generated by the manual assembly. A binary B con-
ral network to make the training and retraining time reduced. sists of { aq set of functions
q}
f1 , f2 ,{… , fu . Given} two binaries
The time-dependent nature of graph neural networks makes Bq = f1 , f2 , … , fm and Bt = f1t , f2t , … , fnt , we assume
q

BCSD more suitable for practical applications to improve they have k pairs of matching functions: [f1 , f1t ], [f2 , f2t ], …,
q q

the quality and efficiency of similarity detection. [fk , fk ], where k ≤ min(m, n), and the rest of functions do not
q t

match. For any function fi in Bq , the BCSD tool could sort


q

2.3 Deep Embedding functions in the binary Bt by their similarities.

Deep embedding is an efficient method to map a low-dimen- 3.2 Overview


sional vector representation, with the goal of preserving the
original data in the embedding space. The idea of embed- There are neural network-based methods that are good at
ding are widely used in many scenes. Deep embedding can handling the problem of BCSD. However, these methods
extract high-level semantics from the sample and generate ignore the common importance of semantic and structure
a vector that can represent the sample. For example, Chen of binary language. Furthermore, not only all parts of the
[26] described pose variation by similarity embedding learn- program are non-equally relevant, but also single attribute
ing as spatial constraints for person re-identification. Gao encoding can not recognize the different relevance between
[27] designed an effective similarity neural network, which each part of the binary [31, 32]. These problems affect the
focus on a similarity learning task in image retrieval. The BCSD accuracy. To improve the performance of BCSD, we
embedded binary detection method uses the same princi- try to design an architecture that helps to address the dual
ple, learns the high-level semantics of the graph through the attribute mentioned above by integrating semantic attrib-
neural network, and represents a binary function as a vector. ute embedding, structure attribute embedding and attribute
Finally, the similarity of two vectors is directly compared fusion mechanism. The proposed architecture is named deep
to obtain the similarity score of two binary functions. Ou dual attribute-aware embedding for binary code similarity
[28] proposed the idea of preserving asymmetric transitivity detection (DeeepDual-SD). In DeeepDual-SD, the semantic
by approximating high-order proximity to improve graph embedding module extracts n-gram features from the func-
embedding efficiency. Heimann [29] proposed a framework tion for sentence modeling inspired by NLP technology.
named REGAL that automatically learned node representa- And structure embedding module aims to differentiate the
tions to match nodes. The literature [30] proposed Asm- similarity between two input attribute-based control flow
2Vec, a function embedding solution, which is based on the graphs by graph embedding network. The fusion module is
PV-DM model for natural language processing. Asm2Vec designed for both attribute representations which pay more
operatively computes the CFG of a function, and then it attention to the feature related to the specific application and
executes a series of random walks on top of it. it can help to understand the binary.
The binary function fi is described as fi = Fusion(pi , gi ),
pi is the semantic attribute representation and gi is the struc-
3 Approach tural attribute representation. The architecture of the Deeep-
Dual-SD is shown in Fig. 2.
This section presents interpretations of the DeepDual-SD in The key points of our approach are described in detail
detail. First, we introduce the code similarity task descrip- as follows.
tion in the Sect. 3.1 and provide an approach overview in the
Sect. 3.2. Then, the dual-attribute embedding modules of our 3.3 Semantic Attribute Embedding
method are presented (semantic attribute embedding in the
Sect. 3.3 and structural attribute embedding in the Sect. 3.4). Challenges When using semantic features as embedded fea-
Finally, a gated attention mechanism is provided to fuse the tures to capture semantic information of binary functions,
two attributes in the Sect. 3.5. the challenges also exist. First, the number of instructions
explosion. In assembly language, the operand addressing

13
35 Page 4 of 11 International Journal of Computational Intelligence Systems (2023) 16:35

Fig. 2  Overview of the DeepDual-SD

methods mainly include direct addressing, immediate Table 1  The example of semantic attribute embedding preprocessing
addressing, register relative addressing and other address-
push ebp push ebp
ing methods. Different programs are compiled by different
push edi push edi
compilers, and the generated countless immediate numbers
push esi push esi
and memory addresses. A large number of instructions
call _x86_get_pc_thunk_bx call FUNCTION
with random numbers and random addresses are generated,
add ebx, 5E8F7h add ebx, STR_MAX_NUM
resulting in an instruction explosion problem with too many
sub esp, 5Ch sub esp, 5Ch
instructions and a low repetition rate. Second, the problem
jz loc_8075A18 jz LOCAL
of out-of-vocabulary (OOV). When we train a model to con-
vert an instruction that has never appeared during training
into a vector, the embedding generation for such instructions
will fail. Third, the problem is how to make the machine [35]-based feature extract framework, the model can be bet-
understand and learn the semantic meaning of the code and ter applied to the complicated binary code semantic analy-
express it into a suitable embedding vector. sis task. The[semantic function
] representations are consid-
Preprocess. In semantic attribute embedding, we use ered as p = p1 , p2 , … , p128 for binary function semantic
assembly instructions as tokens. An instruction token con- embedding.
sists of both the instruction mnemonic and the operands. In
response to the first two challenges, the number of instruc- 3.4 Structural Attribute Embedding
tion explosion and OOV, we have formulated the following
rules when preprocessing instruction inputs: (1) When the The similarity detection problem of the binary function can
value of numeric constants is above threshold (we define be converted into the representation problem of the binary
0x400 in our experiments), replace the numeric constant function attribute control flow graph. A binary function is
with ⟨𝙼𝙰𝚇𝚅𝙰𝙻𝚄𝙴⟩. (2) When there is an instruction string expressed as a high-dimensional vector, and the distance
reference, replace it with ⟨𝚂𝚃𝚁𝙸𝙽𝙶⟩. (3) Replace the memory between the vectors of the binary functions compiled by the
address string with ⟨𝙰𝙳𝙳𝚁⟩. (4) Replace the transfer address same source code function is relatively close.
with ⟨𝙻𝙾𝙲𝙰𝙻⟩. (5) When instructing a function call, replace Preprocess. We extract two kinds of features, one is
the function name with ⟨𝙵𝚄𝙽𝙲𝚃𝙸𝙾𝙽⟩. block-level features, describing the relevant feature infor-
We take the following code example in Table 1: the left mation inside the basic block, and the other is structural
shows the original assembly code, and the right one is the information, describing the inter-attributes between basic
preprocessed result. blocks in the entire CFG. The specific characteristics are
Semantic Embedding. After getting the initialization shown in Table 2.
vector of instruction tokens, we construct the improved This paper extracts 7 statistical block-level features and
model ALBERT [33] based on BERT [34] to measure the 2 inter-block structural features. For example, the following
semantic embedding representations in DeepDual-SD. Com- Fig. 3 is an Attributed CFG which extracted feature map
paring to the conventional long short-term memory (LSTM) of the function SSL_do_handshake after the binary code

13
International Journal of Computational Intelligence Systems (2023) 16:35 Page 5 of 11 35

Table 2  Features used by Type Attribute name Example


structural attribute embedding
Block-level attributes String constants –
Numeric constants –
No. of transfer instructions jmp, ret, retf, retn, jae
No. of calls call
No. of instructions –
No. of arithmetic instructions add, sub, addiu, addu, clo, clz, div
No. of data transfer instructions mov, movsx, movzx, push, pop, pusha, popa
Inter-block attributes No. indegree –
No. outdegree –

Fig. 3  Attributed CFG of function SSL_do_handshake on ARM platforms

compiled by gcc 5.4-ARM of OpenSSL−1.0.1f is disassem-


bled by IDA pro [36].
Structural Embedding. The existing structural-aware
similarity-based methods rely on the CFGs of the func-
tion to extract features for each basic block.The structural
embedding model is implemented using the structure2vec
[37] algorithm which used in Gemini. The illustration of the
structure2vec is shown as Fig. 4. The input of structure2vec
is a CFG with data representation denoted as ⟨V, E, x⟩, con-
taining 3 elements. Each node i in the graph has initial node
characteristics x. Learning through a structural embedding
network, each node feature in the graph is represented as a
new feature vector gi.
The structural embedding vector is produced by
Algorithm 1.

13
35 Page 6 of 11 International Journal of Computational Intelligence Systems (2023) 16:35

( ( ))
𝜎(l) = P1 × ReLU P2 × … Re LU Pn l ,
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ (2)
nlevels

where Pi is a p × p matrix. n is the embedding depth and


ReLU is the rectified linear unit ReLU(x) = max{0, x}. In
DeepDual-SD, ReLU is used as the non-linear activation
function because it can improve the learning dynamics of the
networks and significantly reduce the number of iterations
required for convergence in deep networks. The embedding
vector g will be computed as an aggregation with the formula

W2 ( i∈Ni gTi ). Because the number of embedding
[ size is 64,]
the structural embedding representation g = g1 , g2 , … , g64 .

3.5 Dual Attribute Fusion Mechanism

A gated attention-based network is proposed to incorporate


semantic information and structural graph representation.
It is a variant of attention-based networks, with an addi-
tional gate to determine the importance of information in the
function graph regarding {an instruction.
}384 { Given
}64 semantic and
structural representation pt t=1 and gt t=1. Fusion map is
proposed generating function representation via gated atten-
Fig. 4  Architecture of structure2vec for structural attribute embedding
tion mechanism as follows:
( ) ( ) ( ( ))
y = H p, WH ⋅ T g, WT + p ⋅ 1 − T g, WT , (3)

where y is the output of the Fusion module, H(.) is an atten-


The method of updating the structural attribute embed-
tion layer with the non-linear unit and T(.) seems to be a
ding is visualized as shown in Fig. 4, the process of xi
transform gate, which is also an attention layer. H(.) and T(.)
to gi adopts the idea of variational inference, which is a
play the other role that making semantic representation and
process of iterative calculation based on graph topology.
structural representation become the appropriate matrix size
After a certain number of iterations, the network will cal-
for multiplication. WH and WT are the weight parameters,
culate a new feature representation for each node i. This
which can be train with the whole network.
feature representation considers both graph characteristics
Different from the gates in LSTM or GRU [38], the addi-
and long-range interaction between node features. In the
tional gate is based on the current instruction function and its
initial situation, each node is set to 0, and then the single
attention vector of the graph function, which focuses on the
iteration process of each node is as follows:
relation between the semantic and structural representation.
( ( ))
∑ (t−1) Finally, DeepDual-SD learn parameters using siamese
(t)
gi = tanh W1 xv + 𝜎 gj , ∀i ∈ V, (1) and compare the similarity of two function Q and T using
j∈Ni the formula:
where Ni represents the direct neighbor of node i. Assuming ∑n � �
� � i=1 fQ [i] ⋅ fT [i]
that there are T iterations in total, it can be seen that each similarity fQ , fT = � �∑ , (4)
∑n
iteration updates all the node features of the entire graph
n
f
i=1 Q
[i] ⋅ f
i=1 T
[i]
synchronously. The new round of iterative calculation will
be updated based on the results of the previous iteration and where f[i] indicates the i − th component of the vector f.
get a new round of calculation results. And after T iterations, We require input is a set of K function pairs (f��Q⃗, f��T⃗) to train
the vector of each node 𝜇iT contains the relevant information the network. The final output of the siamese architecture is
of all nodes within the distance i hops within T. xv is the the similarity score between the two input. The ground truth
basic feature of the basic block, W1 is the d × p dimensional labels yi ∈ {+1, −1}, where +1 indicates that the two input
matrix, p is the dimension of the final embedding vector. functions are similar and −1 means dissimilar. The loss func-
𝜎(⋅) is defined as the n-dimensional fully connected neural tion can be denoted as follows:
network, which is formulated as follows:

13
International Journal of Computational Intelligence Systems (2023) 16:35 Page 7 of 11 35

K (
∑ )2 RAM, and GPU 1080ti. In the following experiments, the
J= similarity(f��Q⃗, f��T⃗) − yi . (5) networks are trained with the average mean squared loss
i=1 between estimated and true induced current, and the network
In our approach, Adam optimizer is chosen to optimize the parameters are tuned with Adam optimization algorithm [43]
loss function of the network. The model parameters are fine- with learning rate 0.0001. During the training process, we
tuned by Adam optimizer which has been shown as an effec- measure the loss and AUC on the validation set and save
tive and efficient back-propagation algorithm. the model that achieved the best AUC on the validation set.

4.2 Evaluation Metrics
4 Performance Evaluation Identifying matching functions accurately is also important
for BCSD solutions. We, thus, evaluate whether the match-
In this section, the datasets and evaluation metrics used for ing function is in Precision, Recall and AUC​, which are
evaluating our proposed method are described. Then, we common evaluation metrics in machine learning and infor-
study the performance to compare the similarity in cross- mation retrieval tasks.
compiler, cross-architecture and cross-version settings. Fur- The robustness evaluation of a model not only needs to
thermore, the impacts of various dimensions of embedding verify the ratio of the found query function to the target
are discussed to achieve the best results. function by precision rate, but also needs to find out that how
many target functions are detected by recall rate. Precision
4.1 Datasets and Experimental Settings and Recall are formulated as follows:
TP TP
Datasets. We collect two datasets to investigating its per- Precision = , Recall = , (6)
TP + FP TP + FN
formances on several tasks. The function pairs consist of
binaries compiled from source code which we have ground where TP, FP and FN denote the number of true positives,
truth. The compiled object files have been disassembled false positives and false negatives, respectively.
with IDA pro [36] and then preprocessed for encoding, as Similar as [21, 25], this paper also uses the Receiver
described in Sect. 3. Operating Characteristic Curve (ROC curve) [44] and the
OpenSSL Dataset. This dataset has been obtained by Area Under Curve (AUC) obtained by the model prediction
compiling the OpenSSL [39] (version 1.0.1f and 1.0.1u). results to measure the performance of our approach. The
The compiler is set to emit code in ARM, MIPS and x86 AUC depends on the calculation of the percentage of the
platforms. The compilation has been done using gcc−5.4 query result. The higher the AUC value, the better predictive
with four optimization levels O0-O3. We obtain a total of performance of the algorithm.
66964 function pairs.
Debian Dataset. This dataset is the Debian package 4.3 Performance in Cross‑Compiler BCSD
repository, where we directly collected binaries from deb
packages [40]. We have collected packages with different These experiments analyze binary code similarity detection
versions from Debian 7.11, Debian 8.11 and Debian 9.11. problems across compiler optimization options. We use the
We grouped each version of binary with its closest version same experimental environment configuration to compare
as a pair, and got 93324 pairs in total. The pairs can be DeepDual-SD with the following baseline methods. They
divided into two parts depending on the following rule. Pair- are effective methods and have achieved some good results
ing together two binary functions originated by the same in BCSD. Gemini [25] accumulates real-valued feature vec-
source code obtains similar pairs. Pairing randomly func- tor graph embedding and then computes the similarity of
tions that do not derive from the same source code obtain the feature vectors. SAFE [21] uses an NLP-based approach
dissimilar pairs. leveraging a self-attentive neural network to create func-
We split datasets into training, validation and testing tions embedding. The comparison results for cross-compiler
(8:1:1). Specifically, we generate similar pair associated BCSD are presented in Table 3.
with training label ⟨+1⟩ and a dissimilar pair with training From Table 3, DeepDual-SD achieves better results than
label ⟨−1⟩. other methods except for cross-optimization-level O0 vs.
Experimental Settings. We implemented our embed- O1 on the x86 instruction set. The results of DeepDual-SD
ding model in Python using the Keras—2.3 [41] and Ten- on the x86 architecture are 96.4, 95.6 and 97.5% for cross-
sorFlow—1.14 [42]. The experiments are performed on a optimization-level O1 vs. O2, O1 vs. O3, O2 vs. O3. Deep-
computer running the Ubuntu 18.04 operating system with Dual-SD gives the relative improvements of 3.13, 4.50 and
a 64-bit 2.7 GHz Intel ® Core (TM) i7 CPU with 48 GB 3.64% compared to semantic-based method SAFE on MIPS

13
35 Page 8 of 11 International Journal of Computational Intelligence Systems (2023) 16:35

Table 3  Comparing percentage of matching functions AUC by different methods (DeepDual-SD, Gemini and SAFE) in cross-optimization-level
settings. O0, O1, O2, O3 are situations in which compilation using four different optimization levels
ARM MIPS x86
Gemini SAFE DeepDual-SD Gemini SAFE DeepDual-SD Gemini SAFE DeepDual-SD

O0 vs. O1 0.893 0.932 0.930 0.880 0.907 0.935 0.896 0.919 0.950
O0 vs. O2 0.865 0.921 0.945 0.884 0.884 0.926 0.895 0.937 0.948
O0 vs. O3 0.851 0.897 0.925 0.901 0.888 0.928 0.859 0.899 0.939
O1 vs. O2 0.920 0.948 0.972 0.910 0.928 0.957 0.932 0.954 0.964
O1 vs. O3 0.916 0.934 0.968 0.900 0.913 0.944 0.912 0.893 0.956
O2 vs. O3 0.920 0.958 0.979 0.945 0.963 0.960 0.959 0.957 0.975

O1 vs. O2, x86 O0 vs. O3 and ARM O1 vs. O3. Compar- in this section show that DeepDual-SD is superior to other
ing with the structural-based method Gemini, DeepDual-SD baseline methods in the convenience of cross-architecture
gives a superior performance on all optimization-level O0 tasks with the same settings.
to O3 similarity comparisons. DeepDual-SD outperforms
Gemini and SAFE on MIPS and x86 architectures. These 4.5 Performance in Cross‑Version BCSD
results indicate that DeepDual-SD will be more feasible for
various architectures. Simultaneously, it can be seen that the When the program needs to be updated (because of patch-
performance of dual attribute-based methods is better than ing vulnerabilities and errors), a new version in binary for-
that of the single attribute approaches. mat will be released, but without disclosing details of the
changes. Due to the great interest of the software understand
4.4 Performance in Cross‑Architecture BCSD the differences between the two versions of the program.
Binary cross-version similarity analysis is one of the most
Since DeepDual-SD also specializes in binary code similar- useful techniques for discovering these differences. In this
ity detection across instruction sets, we still use the same section, we evaluate the performance of DeepDuel-SD on
experimental configuration to compare them here. More spe- the Debian dataset, as shown in Fig. 6.
cifically, this article will match all functions in the OpenSSL We can see that DeepDual-SD performs slightly better
binaries compiled for one architecture (i.e., x86, MIPS, and than Gemini and SAFE when the version gap is small, while
ARM) with the same names in binaries compiled for another obviously better when the version gap is large. The AUC of
architecture. The result is shown in Fig. 5. DeepDual-SD improves 0.72 and 0.71% than Gemini from
As can be seen from Fig. 5, compared to Gemini and Debian v7 to v8 and v8 to v9, respectively. DeepDual-SD
SAFE, the DeepDual-SD proposed in this paper achieved outperforms Gemini by 1.97% when comparing some bina-
the best results in all three sets of experiments. In MIPS ries from v7 to v9. In addition, Gemini is also better than
vs. ARM comparison, our method achieves the best perfor- SAFE across three versions. For example, Gemini outper-
mance. The result of DeepDual-SD is 8.25 % higher than forms SAFE by over 0.69% on average. It shows that the
SAFE and 3.23 % higher than Gemini on average The results structure feature, which is captured by the structure attribute

Fig. 5  ROC curves for different approach in OpenSSL binaries compiled for ARM, MIPS and x86

13
International Journal of Computational Intelligence Systems (2023) 16:35 Page 9 of 11 35

Fig. 6  ROC curves for different approaches comparison on binaries from Debian7.11, Debian8.11 and Debian9.11

embedding network, is a very strong feature for cross-ver- baseline methods, the new method is more effective and effi-
sion BCSD. cient in terms of the detection quality in most cases.
Future work focuses on the research of the embedding
method that is suitable for smaller datasets of BCSD tasks
to replace. Moreover, in the real world, the training-based
4.6 Discussion
approaches are already applied to some products of vul-
nerability mining, which are combined with federated
For DeepDual-SD, the purpose of dual-attribute embedding
learning. The real-world dataset is decentralized among
is to strengthen the understanding of the connection between
multiple client devices (e.g., IOT devices), which makes
semantics and structure of binary codes in the process of
training have better practical applications. And we are
comparing similarity. For binary code similarity analysis,
applying our approach to federated learning to achieve
the methods to generate multiple embedding vector can
better detection capabilities.
affect the accuracy. Compared to the one-attribute embed-
ding vector, multiple attribute embedding vector contains
more feature dimensions, which can automatically adjust the Author's Contributions Jiabao Guo and Bo Zhao contributed to the
current impact of each feature according to different simi- conception of the study. Jiabao Guo performed the data analyses and
larity tasks, with better adaptability. The experiments show wrote the manuscript. Hui Liu contributed significantly to analysis and
manuscript preparation. Dongdong Leng conducted the research and
that the dual-attribute embedding vector can achieve better investigation process, specifically performing the experiments and data
results than other embedding models. Hence, the method collection. Yang An oversighted responsibility for the research activity
to generate comprehensive function representation is more execution, including mentorship external to the core team. Gangli Shu
suitable for BCSD. In most scenes, DeepDual-SD can obtain helped perform the analysis with constructive discussions.
better results than other baseline methods. It shows that Funding The work described in this paper is supported by the National
DeepDual-SD has a greater detection ability and DeepDual- Natural Science Foundation of China (No.U1936122) and and Primary
SD performs better than state-of-the-art DNN-based BCSD Research Development Plan of Hubei Province (2020BAB101).
methods.
Data Availability The authors confirm that the data supporting the find-
ings of this study are available within the article and its supplementary
materials.

5 Conclusion Declarations
Conflict of interest Considered no such competing interests exist so,
In this paper, we proposed a deep dual attribute-aware therefore, not applicable.
embedding method for BCSD. As we know, using dual
attribute-aware embedding to automatically learn discrimi- Ethical Approval and Consent to Participate The research does not
native function features to improve BCSD performance is relate to personal privacy.
a pioneering work. Comparing with some state-of-the-art Consent to Publication All authors approved the final manuscript.

13
35 Page 10 of 11 International Journal of Computational Intelligence Systems (2023) 16:35

Open Access This article is licensed under a Creative Commons Attri- 14. Xu, Z., Chen, B., Chandramohan, M., Liu, Y., Song, F.: SPAIN:
bution 4.0 International License, which permits use, sharing, adapta- security patch analysis for binaries towards understanding the pain
tion, distribution and reproduction in any medium or format, as long and pills. In: Proceedings of the 39th IEEE/ACM International
as you give appropriate credit to the original author(s) and the source, Conference on Software Engineering, ICSE, pp. 462–472 (2017)
provide a link to the Creative Commons licence, and indicate if changes 15. Li, Y., Xu, W., Tang, Y., Mi, X., Wang, B.: Semhunt: Identify-
were made. The images or other third party material in this article are ing vulnerability type with double validation in binary code. In:
included in the article's Creative Commons licence, unless indicated The 29th International Conference on Software Engineering and
otherwise in a credit line to the material. If material is not included in Knowledge Engineering, pp. 491–494 (2017)
the article's Creative Commons licence and your intended use is not 16. Flake, H.: Structural comparison of executable objects. In:
permitted by statutory regulation or exceeds the permitted use, you will Detection of Intrusions and Malware & Vulnerability Assess-
need to obtain permission directly from the copyright holder. To view a ment, GI SIG SIDAR Workshop, DIMVA, vol. P-46, pp. 161–
copy of this licence, visit http://​creat​iveco​mmons.​org/​licen​ses/​by/4.​0/. 173 (2004)
17. Alrabaee, S., Shirani, P., Wang, L., Debbabi, M.: FOSSIL: a
resilient and efficient system for identifying FOSS functions in
malware binaries. ACM Trans. Priv. Secur. 21, 8–1834 (2018)
References 18. Gao, D., Reiter, M.K., Song, D.X.: Binhunt: Automatically find-
ing semantic differences in binary programs. In: Information and
1. Haq, I.U., Caballero, J.: A survey of binary code similarity (2019) Communications Security, 10th International Conference,ICICS,
arXiv:​1909.​11424 pp. 238–255 (2008)
2. Luo, L., Ming, J., Wu, D., Liu, P., Zhu, S.: Semantics-based 19. Ming, J., Pan, M., Gao, D.: ibinhunt: binary hunting with inter-
obfuscation-resilient binary code similarity comparison with procedural control flow. In: Proceedings of 15th International
applications to software and algorithm plagiarism detection. In: Conference on Information Security and Cryptology ICISC, vol.
Proceedings of the 22nd ACM SIGSOFT international symposium 7839, pp. 92–109. Springer (2012)
on foundations of software engineering, pp. 389–400 (2014) 20. Zuo, F., Li, X., Young, P., Luo, L., Zeng, Q., Zhang, Z.: Neural
3. Sæbjørnsen, A., Willcock, J., Panas, T., Quinlan, D.J., Su, Z.: machine translation inspired binary code similarity comparison
Detecting code clones in binary executables. In: Proceedings of beyond function pairs. In: Proceedings of 26th Annual Network
the Eighteenth International Symposium on Software Testing and and Distributed System Security Symposium, NDSS (2019)
Analysis, pp. 117–128 (2009) 21. Massarelli, L., Luna, G.A.D., Petroni, F., Baldoni, R., Querzoni,
4. Chen, K., Liu, P., Zhang, Y.: Achieving accuracy and scalability L.: SAFE: self-attentive function embeddings for binary similar-
simultaneously in detecting application clones on android mar- ity. In: Proceedings of 16th International Conference on Detec-
kets. In: Proceedings of 36th ACM International Conference on tion of Intrusions and Malware, and Vulnerability Assessment,
Software Engineering, ICSE, pp. 175–186 (2014) DIMVA, vol. 11543, pp. 309–329 (2019)
5. Zhang, F., Wu, D., Liu, P., Zhu, S.: Program logic based software 22. Alrabaee, S., Shirani, P., Wang, L., Debbabi, M.: Sigma: a seman-
plagiarism detection. In: 25th IEEE International Symposium on tic integrated graph matching approach for identifying reused
Software Reliability Engineering, ISSRE, pp. 66–77 (2014) functions in binary code. Digit. Investig. 12, 61–71 (2015)
6. Gao, J., Yang, X., Fu, Y., Jiang, Y., Sun, J.: Vulseeker: a semantic 23. Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: Discovre: effi-
learning based vulnerability seeker for cross-platform binary. In: cient cross-architecture identification of bugs in binary code. In:
Proceedings of the 33rd ACM/IEEE International Conference on Proceedings of 23rd Annual Network and Distributed System
Automated Software Engineering, ASE, pp. 896–899 (2018) Security Symposium, NDSS (2016)
7. Shirani, P., Collard, L., Agba, B.L., Lebel, B., Debbabi, M., Wang, 24. Chandramohan, M., Xue, Y., Xu, Z., Liu, Y., Cho, C.Y., Tan,
L., Hanna, A.: BINARM: scalable and efficient detection of vul- H.B.K.: Bingo: cross-architecture cross-os binary search. In: Pro-
nerabilities in firmware images of intelligent electronic devices. ceedings of the 24th ACM International Symposium on Founda-
In: Detection of Intrusions and Malware, and Vulnerability tions of Software Engineering, pp. 678–689 (2016)
Assessment-15th International Conference, DIMVA, vol. 10885, 25. Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural
pp. 114–138. Springer (2018) network-based graph embedding for cross-platform binary code
8. Cesare, S., Xiang, Y., Zhou, W.: Control flow-based malware similarity detection. In: Proceedings of the ACM Conference
variant detection. In: IEEE Transactions on Dependable & Secure on Computer and Communications Security, CCS, pp. 363–376
Computing, pp. 307–317 (2014) (2017)
9. Bell, S., Bala, K.: Learning visual similarity for product design 26. Chen, D., Yuan, Z., Chen, B., Zheng, N.: Similarity learning with
with convolutional neural networks. ACM Trans. Graph. 34, spatial constraints for person re-identification. In: 2016 IEEE Con-
98–19810 (2015) ference on Computer Vision and Pattern Recognition, CVPR, pp.
10. Hu, X., Chiueh, T., Shin, K.G.: Large-scale malware indexing 1268–1277
using function-call graphs. In: Proceedings of the ACM Con- 27. Gao, X., Mu, T., Goulermas, J.Y., Thiyagalingam, J., Wang, M.:
ference on Computer and Communications Security,CCS, pp. An interpretable deep architecture for similarity learning built
611–620 (2009) upon hierarchical concepts. IEEE Trans. Image. Process. 29,
11. Jang, J., Woo, M., Brumley, D.: Towards automatic software lin- 3911–3926 (2020)
eage inference. In: Proceedings of the 22th USENIX Security 28. Ou, M., Cui, P., Pei, J., Zhang, Z., Zhu, W.: Asymmetric transi-
Symposium, pp. 81–96 (2013) tivity preserving graph embedding. In: Proceedings of the 22nd
12. Farhadi, M.R., Fung, B.C.M., Charland, P., Debbabi, M.: Bin- ACM SIGKDD International Conference on Knowledge Discov-
clone: Detecting code clones in malware. In: Proceedings of the ery and Data Mining, pp. 1105–1114 (2016)
18th IEEE International Conference on Software Security and 29. Heimann, M., Shen, H., Safavi, T., Koutra, D.: Regal: Representa-
Reliability, SERE, pp. 78–87 (2014) tion learning-based graph alignment. In: Proceedings of the 27th
13. Brumley, D., Poosankam, P., Song, D.X., Zheng, J.: Automatic ACM International Conference on Information and Knowledge
patch-based exploit generation is possible: Techniques and impli- Management, CIKM, pp. 117–126 (2018)
cations. In: IEEE Symposium on Security and Privacy (S &P), pp. 30. Ding, S.H.H., Fung, B.C.M., Charland, P.: Asm2vec: Boosting
143–157. IEEE Computer Society (2008) static representation robustness for binary clone search against

13
International Journal of Computational Intelligence Systems (2023) 16:35 Page 11 of 11 35

code obfuscation and compiler optimization. In: Proceedings representations using RNN encoder-decoder for statistical
of IEEE Symposium on Security and Privacy, SP, pp. 472–489 machine translation. In: Proceedings of the 2014 Conference on
(2019) Empirical Methods in Natural Language Processing, EMNLP, pp.
31. Hin, D., Kan, A., Chen, H., Babar, M.A.: Linevd: statement-level 1724–1734. ACL
vulnerability detection using graph neural networks (2022). arXiv:​ 39. Openssl. Retrieved from https://​www.​opens​sl.​org/ (2020)
2203.​05181 [cs]. https://​doi.​org/​10.​48550/​arXiv.​2203.​05181 40. Debian. Retrieved from https://​www.​debian.​org/ (2020)
32. Neysiani, B.S., Morteza Babamir, S.: Automatic duplicate 41. Chollet, F.: Keras. In: Retrieved from https://​keras.​io/ (2015)
bug report detection using information retrieval-based versus 42. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J.,
machine learning-based approaches. In: 2020 6th International Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M.,
Conference on Web Research (ICWR), pp. 288–293 (2020). Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner,
https://​doi.​org/​10.​1109/​ICWR4​9608.​2020.​91222​88 B., Tucker, P.A., Vasudevan, V., Warden, P., Wicke, M., Yu, Y.,
33. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Sori- Zheng, X.: Tensorflow: A system for large-scale machine learning.
cut, R.: Albert: A lite bert for self-supervised learning of lan- In: 12th USENIX Symposium on Operating Systems Design and
guage representations (2019). arXiv:​1909.​11942 Implementation, pp. 265–283 (2016)
34. Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training 43. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization.
of deep bidirectional transformers for language understanding. In: 3rd International Conference on Learning Representations,
In: Proceedings of the 2019 Conference of the North American ICLR (2015)
Chapter of the Association for Computational Linguistics: Human 44. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.: Evaluating
Language Technologies, NAACL-HLT, pp. 4171–4186 (2019) collaborative filtering recommender systems. ACM Trans. Inf.
35. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural. Syst. 22(1), 5–53 (2004)
Comput. 9, 1735–1780 (1997)
36. Hex-Rays: Ida pro disassembler and debugger. In: Retrieved from Publisher's Note Springer Nature remains neutral with regard to
https://​www.​hex-​rays.​com/​produ​cts/​ida/​index.​shtml (2015) jurisdictional claims in published maps and institutional affiliations.
37. Dai, H., Dai, B., Song, L.: Discriminative embeddings of latent
variable models for structured data. Proceedings of the 33nd Inter-
national Conference on Machine Learning, ICML 48, 2702–2711
(2016)
38. Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D.,
Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase

13

You might also like