BinFinder asiaCCS
BinFinder asiaCCS
net/publication/372242707
CITATIONS READS
3 22
4 authors, including:
Marthe Kassouf
Hydro-Québec
43 PUBLICATIONS 565 CITATIONS
SEE PROFILE
All content following this page was uploaded by Abdullah Qasem on 12 May 2024.
ABSTRACT on Computer and Communications Security (ASIA CCS ’23), July 10–14, 2023,
Binary function clone search is an essential capability that enables Melbourne, VIC, Australia. ACM, New York, NY, USA, 14 pages. https://doi.
org/10.1145/3579856.3582818
multiple applications and use cases, including reverse engineer-
ing, patch security inspection, threat analysis, vulnerable function
detection, etc. As such, a surge of interest has been expressed in de-
1 INTRODUCTION
signing and implementing techniques to address function similarity The most prominent techniques that address function similarity on
on binary executables and firmware images. Although existing ap- binary executables and firmware images leverage machine learning
proaches have merit in fingerprinting function clones, they present [4, 5], deep learning [3, 18, 27], graph theory [4], etc. While these
limitations when the target binary code has been subjected to sig- techniques have an established merit in fingerprinting binary func-
nificant code transformation resulting from obfuscation, compiler tion clones, they present however limitations when it comes to code
optimization, and/or cross-compilation to multiple-CPU architec- transformation techniques and multiple CPU architectures. The rea-
tures. In this regard, we design and implement a system named son underlying this performance degradation is twofold: First, most
BinFinder, which employs a neural network to learn binary function state-of-the-art techniques rely on features that are not resilient to
embeddings based on a set of extracted features that are resilient to code transformation. Second, the underlying models are limited to
both code obfuscation and compiler optimization techniques. Our fully grasping function semantics. Hence, there is still a pressing
experimental evaluation indicates that BinFinder outperforms state- need for a more accurate binary clone search system in the presence
of-the-art approaches for multi-CPU architectures by a large margin, of advanced code transformations. Nowadays, software companies
with 46% higher Recall against Gemini, 55% higher Recall against implement such protection to impede reverse-engineering attempts
SAFE, and 28% higher Recall against GMN. With respect to obfusca- and to protect against trade secret (intellectual property) theft.
tion and compiler optimization clone search approaches, BinFinder asm2vec [3] is the first attempt to address binary function clone
outperforms the asm2vec (single CPU architecture approach) with search in the presence of code obfuscations but only over single CPU
30% higher Recall and BinMatch (multi-CPU architecture approach) architecture(x86). Also, it performs poorly over O0-vs-O3 compiler
with 10% higher Recall. Finally, our work is the first to provide note- optimization and over different O-LLVM obfuscation techniques,
worthy results with respect to binary clone search over the tigress as reported in [7, 9].
obfuscator, which is a well-established open-source obfuscator. BinMatch [7] addresses code obfuscation over multi-architecture
but only over O-LLVM obfuscation techniques and shows very
CCS CONCEPTS low accuracy against Bogus Control Flow (BCF) and Control Flow
• Security and privacy → Software reverse engineering. Flattening (FLA) obfuscation techniques. There are several existing
sophisticated obfuscation techniques supported by open-source
KEYWORDS obfuscation tools (e.g., tigress [25]) that have not yet been studied
in the context of binary function clone search. To address the pre-
Binary Code Similarity, Feature Evaluation and Selection viously mentioned limitations, we propose a new system called
ACM Reference Format: BinFinder. It is a multi-architecture binary function clone search
Abdullah Qasem, Mourad Debbabi, Bernard Lebel, and Marthe Kassouf. system in the presence of code obfuscation techniques and compiler
2023. Binary Function Clone Search in the Presence of Code Obfuscation optimization levels. BinFinder uses an end-to-end learning model
and Optimization over Multi-CPU Architectures. In ACM ASIA Conference
composed of a customized Multi-layer Perceptron Neural Network
Permission to make digital or hard copies of all or part of this work for personal or within a Siamese neural network to learn binary function repre-
classroom use is granted without fee provided that copies are not made or distributed sentations. The model is trained on a set of manually engineered
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the interpretable features selected at the binary function level. These
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or features are easy to extract, robust, CPU architecture independent,
republish, to post on servers or to redistribute to lists, requires prior specific permission and resilient to both compiler optimization and code obfuscation
and/or a fee. Request permissions from [email protected].
ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia techniques. We upload the source code, dataset, and experiment
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. results to this repository1 for evaluation.
ACM ISBN 979-8-4007-0098-9/23/07. . . $15.00
https://doi.org/10.1145/3579856.3582818 1 http://bit.ly/3GBHl9f
443
ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia Abdullah Qasem, Mourad Debbabi, Bernard Lebel, and Marthe Kassouf
The contributions of this work are as follows: To assess the resiliency of each extracted feature, outlined in Table
1, in the presence of compiler optimization or code obfuscation,
• We identify a set of engineered interpretable binary function we calculate the empirical distribution induced by the absolute dif-
features that are resilient to both code optimizations and ference between the targeted feature values (extracted from the
obfuscation techniques on multi-CPU architectures. original binary functions, compiled with optimization 𝑂 0 ) and their
• We design a Siamese neural network architecture to train related similar functions (compiled with other optimizations or
a corresponding neural network model using our new pro- code obfuscations) in our created Dataset-III outlined in Section
posed features to generate binary function embeddings for 3.4. For example, we calculate the absolute difference between ev-
similarity detection. ery two similar functions, (𝑓𝑖 ,𝑂 0 , feature= num_libc_callees, com-
• We conduct an extensive evaluation of BinFinder over three piler=gcc, architecture=x86) and (𝑓 𝑗 , sub, feature=num_libc_callees,
scenarios: (1) single CPU architecture (x86) in the presence compiler=clang, architecture=arm) as an input to empirical distri-
of different compiler optimization levels and code obfusca- bution function. Finally, we use the resulting P(0) and diff_mean
tions; (2) multi-CPU architectures where different compiler as metrics to decide if the targeted feature is resilient to obfus-
optimizations are applied; (3) multi-CPU architectures in the cation and optimization over multi-CPU architectures. P(0) is the
presence of different compiler optimization levels and code probability of a pair of similar binary functions to have the same
obfuscations. targeted feature value (i.e., their absolute difference is zero). A
• We demonstrate that the overall performance is maintained high probability value for P(0) means that the targeted feature is
(with small fluctuations) by performing additional experi- resilient to either compiler optimizations or to code obfuscations
ments to stress test our approach with respect to several over multi-CPU architectures. On the other hand, absolute differ-
conditions, namely: compiler choice, compiler optimization ence mean (diff_mean) values indicate to which extent the selected
levels, code obfuscation techniques, targeted CPU architec- feature value is affected by the targeted compiler optimization or
tures, and considered packages (libraries). code obfuscation compared to the same feature value extracted
from the original similar binary function. Small diff_mean values
indicate that the targeted feature is not much affected by neither
2 QUEST FOR RESILIENT FEATURES optimizations nor by obfuscations. In essence, a good candidate
Identifying resilient features is an essential step to build an efficient feature should have a high P(0) probability value and low absolute
machine learning model. Table 1 lists potential extracted numerical difference mean (diff_mean) value.
features and widely used features in the literature. After an in-depth Table 1 presents the calculated values for both P(0) and diff_mean
analysis of features that survive code transformation, we identify across the extracted numerical features from our Dataset-III, which
the following resilient features: includes several packages cross-compiled to two different CPU
architectures (x86, ARM) using different compiler optimizations
• num_callers: the count of binary functions that call the tar- and code obfuscations, as detailed in Section 3.4. From the table,
geted function. we can see that num_callers, num_libc_callees, num_libc_callees,
• num_libc_callees: the count of libc functions, such as num_callees, and num_unique_callees have very high P(0) values
strlen, memcpy, socket, etc., called by the targeted func- along with very low absolute difference mean, which indicate that
tion. these features are resilient to code obfuscation and compiler op-
• num_callees: the count of all functions that are called by the timization. Also, we can observe that, on average, 75% of similar
targeted function including libc call where some functions binary functions in our dataset have the exact num_callers value.
could be called more than one time. The remaining 25% of similar binary functions have only a small
• num_unique_callees: the number of unique functions that num_callers value difference with avg diff_mean=0.6225. Moreover,
are called by the targeted function.
444
Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia
91% of similar binary functions have the same 𝑛𝑢𝑚_𝑙𝑖𝑏𝑐_𝑐𝑎𝑙𝑙𝑒𝑒𝑠 functions, but every CPU architecture has its own assembly lan-
values while the remaining 9% have only a small difference with avg guage instructions. Therefore, we resort to lifting multi-CPU as-
diff_mean equal to 0.18. The feature num_callees has 𝑃 (0) values sembly instructions into an intermediate representation VEX-IR
greater than 0.6 across different code optimizations and obfusca- using the angr framework [24]. Theoretically, other intermedi-
tions, except over BCF obfuscation technique, where num_callees ate representations can be applied in the context of our work,
𝑃 (0) = 0.4. One reason for this behaviour is that BCF changes e.g., Valgrind [19]. Besides, code obfuscations and compiler op-
the CFGs by introducing several new unrelated basic blocks, con- timizations affect the number of generated assembly instructions
sisting of functions’ calls, that will not be executed. However, the per function dramatically. To address this issue, we only consider
remaining similar binary functions’ num_callees values have small the unique normalized VEX-IR instructions of each function. To
difference values (diff_mean=2.042) compared to their original sim- calculate 𝑃 (0) corresponding to unique_vex_Instructions, we use
ilar ones. Moreover, we investigated the lists of targeted binary Jaccard distance to calculate the difference. From Table 1, we ob-
function names called by a targeted function (callees). We find serve that 𝑃 (0) of unique_vex_Instructions is small, (0.2) on average.
that some functions are called more than one time by the targeted Also, its diff_mean is very small, (0.15) on average. Our aforemen-
functions. Hence, we consider using num_unique_callees as an ad- tioned thorough analysis indicates that the following features are
ditional feature. From Table 1, we observe that num_unique_callees resilient to code transformation: list of libcCalls, list of constants,
𝑃 (0) values are greater than 0.7. As per the aforementioned analy- num_callees, num_callers, num_libc_callees, num_unique_callees,
sis, we highlight that our newly introduced numerical features are and list of unique vex Instructions.
excellent candidates.
The remaining extracted features, such as num_Cmp, num_Logic, 3 BINFINDER APPROACH
num_Arithm, num_constants, have 𝑃 (0) values that are less than In this section, we elaborate all the steps required to design, imple-
0.5 and their related avg diff_mean is very high. Therefore, they ment, and test our proposed Binary Function Clone Search approach
are not good candidate features. Hence, we do not include them in named BinFinder. The process, as depicted in Figure 1, is divided
our approach. In addition to the extracted new numerical features into four steps: (1) data collection and generation, (2) feature se-
outlined in Table 1, we extract the list of libcCalls and the list of lection and representation, (3) model learning, and (4) query and
Constants from each binary function in our dataset. This is moti- results. In Step (1), we collect the source code of several software
vated by the fact that we encountered several instances where two packages along with their reported vulnerabilities in the NIST CVE
dissimilar functions have the same num_libc_callees or num_callers database2 , as outlined in Section 3.4. In step (2), we disassemble ev-
values, but they do not have the same list of num_libc_callees or ery binary function in our repository to extract and preprocess the
Constants. Moreover, common reverse engineering tools, such as selected features required by our proposed neural network model.
the IDA Pro disassembler, which we use, maintain databases of func- In step (3), we train and test our proposed end-to-end Siamese neu-
tion signatures named FLIRT [6] to recognize standard functions ral network to build an efficient binary function embedding model
such as those included at the link time. required to generate an embedding for every binary function in our
Regarding the list of 𝑐𝑎𝑙𝑙𝑒𝑟𝑠 and the list of 𝑐𝑎𝑙𝑙𝑒𝑒𝑠 features, repository. In Step (4), given a new binary function 𝑓𝑞 , (e.g., a newly
we considered only their length since we cannot get the exact discovered vulnerability), we initially extract its related features
function caller or function callees names when the targeted bi- obtained at step (2). Then, we generate its embedding using our
naries are stripped. Instead, we get a random function name cre- trained model obtained at learning step (3). Finally, we compare
ated during disassembly. However, there are few functions, less the generated embedding for the given function 𝑓𝑞 against other
5% in our collected Dataset-III, generally small functions, that binary functions embeddings stored in our repository. We use pair-
have neither 𝐶𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠 nor 𝑙𝑖𝑏𝑐𝐶𝑎𝑙𝑙𝑠. Consequently, several dis- wise cosine distance as a measure to retrieve the top-k candidate
similar functions appear similar. To address the issue with these functions based on the highest cosine similarity scores.
small functions, we decide to utilizes their generated assembly
2 https://nvd.nist.gov/vuln/search
445
ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia Abdullah Qasem, Mourad Debbabi, Bernard Lebel, and Marthe Kassouf
3.1 Preprocessing Selected Features a list of keywords is one-hot-vector, which is suitable when there
We have two types of features: numerical and lists of literals. As is no ordinal relationship between the list elements. We assign a
such, we need to pre-process the selected features for the training dedicated tokenizer for each selected feature, and we end up with
and testing processes. The feature 𝑙𝑖𝑏𝑐𝐶𝑎𝑙𝑙𝑠 is a list of function seven tokenizers, as shown in Figure 2. For example, we have one
names. The feature 𝐶𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠 is a list of integer numbers. The tokenizer, which receives VEX-IR instructions and produces a one-
feature Unique_vex_instructions is a list of normalized VEX-IR key- hot-vector of size 266. In our collected Dataset-III, there are 266
words. We replace every VEX-IR register name with REG, variable normalized unique VEX-IR instructions resulting from the lifting
name with TMP, number with CONST, function name with foo, of executable binary files cross-compiled to both ARM and x86
and memory reference with MEM. The remaining four features architectures. The tokenizer assigned for libcCalls feature produces
are numerical integers: the number of callers, the number of callees, a one-hot-vector of size 181. In our collected Dataset-III, there are
the number of unique callees, and the number of libc calls. As such, 181 unique libcCalls. The tokenizer assigned for Constants produces
we need to represent these numerical features as a list of literals a one-hot-vector of size 1000. In our analysis, large numbers are
to ensure that these feature representations have the same distri- more likely memory addresses. Regarding num_callers features, in
butions as the representations of libcCalls, constants, and unique our collected Dataset-III, the largest value of num_callers feature is
vex instructions. In our training experiments, we observe that our 94. Thus, when multiplied by 5, its related vector size is 470. Finally,
model produces less accurate results when the values of the input all the tokenizers’ outputs are concatenated to produce a single
features of the model have a different distribution. The trained neu- vector for each binary function.
ral network model gets confused. It assumes that numerical features In our implementation, we utilize Keras text Tokenizer in python
are more important than literal features. Thus, machine learning to pre-process and represent each selected feature as one-hot-vector.
practitioners recommend feature pre-processing and normaliza- Keras text Tokenizer creates a dictionary consisting of all unique 𝑛
tion. To address this issue, we consider all sequential numbers from keywords in a given group of samples. It has four modes to represent
𝑜𝑛𝑒 to the number of callers, number of callees, number of libc- a document composed of 𝑚 words: Binary, count, freq, tfidf. In our
Calls. For example, suppose that the feature num_callees value is 5, experiments, Binary mode yields the highest accuracy because it
then we represent this feature as ’1 2 3 4 5’. This way, num_callees ensures that all features values are within the same distribution
value is similar to the list of extracted Constants feature. Then, we {0, 1}. In our implementation, if a new system call or instruction is
consider the list of integer numbers as a list of literals similar to introduced, the related tokenizer will ignore it.
libcCalls. However, when num_callers, num_unique_callees, and
the num_libc_callees feature values are very small, these features 3.3 Siamese Neural Network Architecture
will not have a significant impact in terms of deciding the highest
In the literature, binary function fingerprinting is formalized as a
similarity, especially when many constants or unique VEX-IR in-
similarity problem [27]. We can not determine the final number of
structions are shared across dissimilar functions. To fix this issue,
binary functions. Therefore, binary function similarity cannot be
we assign weights to these features by multiplying the extracted
addressed with traditional classification techniques. To solve such a
number of num_callers, num_unique_callees, and num_libc_callees
problem, we need to implement an end-to-end ML technique such
with 5. We also evaluate the cases of multiplying these numbers by
as Siamese neural network.
3, 4, 10, 15 and 20. Choosing the value of 5 yields the highest 𝐴𝑈𝐶
and results in smaller vector sizes.
446
Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia
Our Siamese neural network architecture is depicted in Figure 3. It options with the three available obfuscators. In total, Dataset-III
is composed of two identical three-layered multi-layer perceptron consists of 284, 491 binary functions.
neural networks [2]. Our designed NN is suitable for our prob- Dataset-IV: to build this dataset, we configure tigress to ob-
lem. It takes into account our selected features and their newly fuscate three packages, namely: OpenSSL, Zlib, and Coreutils.
proposed representation. We have experimented with various po- tigress receives only a single C/C++ file, and to obfuscate a package
tential hyper-parameters including the number of layers, layer sizes, using tigress, one needs to iterate through every .𝑐 file and supply all
and activation functions. Our final design choice is based on the its required imported library locations and compilation options. To
highest reported AUC. Therefore, we choose the following internal do so, we extract the required information, for each package, from
design settings. Each multilayer perceptron neural network con- its related Makefile. Afterwards, we obfuscate all functions within
sists of three layers excluding the input layer: a first layer of size the selected packages using five different obfuscation techniques
2048 and activation function ReLU, a second layer of size 512 and (the Appendix lists the commands used). Finally, we compile the
activation function ReLU, and a third layer (an embedding layer) resulting obfuscated files using gcc to x86 CPU architecture. In total,
of size 100 without an activation function. Each multi-perceptron Dataset-IV consist of 60, 395 binary functions.
neural network receives the selected binary function features repre- Dataset-V: We use an available online dataset composed of 49
sentations, as outlined in Section 3.2, To learn and produce related packages [11]. Each package is compiled using clang and O-LLVM
function embedding at layer three. The Siamese neural network obfuscator, each time using 𝑂 0, 𝑂 1, 𝑂 2, 𝑂 3, 𝐹 𝐿𝐴, 𝑆𝑈 𝐵, 𝐵𝐶𝐹 . In total,
receives the embeddings resulting from its two inner neural net- this dataset has 164, 700 optimized and obfuscated binary functions
works as an input and produces a cosine distance as an output. for x86 CPU architecture.
The two multi-layer perceptrons neural networks share the same For all generated datasets, we use angr [24] to extract VEX-IR
parameters to remain identical during training. They are jointly instructions and IDAPro 6.8 [8] to extract the remaining numerical
and iteratively optimized using the following loss function with features from each binary function.
stochastic gradient descent:
3.5 Training
𝑛
∑︁ 𝑒𝑚𝑑 1 .𝑒𝑚𝑑 2 We train our proposed Siamese neural network is trained on Dataset-
𝑚𝑖𝑛 (𝑐𝑜𝑠 ( ) − 𝑦𝑖 ) 2 (1)
𝑖=1
||𝑒𝑚𝑑 1 ||.||𝑒𝑚𝑑 2 || I and Dataset-III, which are split into training, validation, and testing
Given 𝑛 pairs of extracted binary function features (𝑓1, 𝑓2 ), each pair sets. The training set accounts for 80% of each dataset, while 10% is
is assigned a label 𝑦𝑖 = +1 when they are similar, otherwise 𝑦𝑖 = −1. allocated for validation and testing. We ensure that the packages
With this loss function, we want to ensure that an embedding 𝑓𝑖 of chosen for training are neither part of validation nor testing and
a specific binary function is closer to all its similar binary functions vice versa to avoid overfitting and ensure that the test results ac-
embeddings. curately portray the generalization capability. .Table 3 summarizes
the number of functions in each phase. The neural network is im-
plemented using TensorFlow in Python. Adam optimizer with a
3.4 Dataset
learning rate of 0.0001 is employed for training over 100 epochs.
In a real-life scenario, the same source code package could Each epoch involves creating two pairs for each binary function in
be available on different versions and compiled with different the training set, one with a randomly selected similar function and
optimization options or code obfuscations techniques. We col- a label of +1, and the other with a randomly chosen dissimilar func-
lected seven popular open-source packages to train and test tion and a label of -1. The training dataset is shuffled and divided
our approach, namely: glibc(2.11 - 2.25), gmp(6.1.0, into mini-batches of 500 similar and 500 dissimilar pairs. AUC is
6.1.1), gnuBinutils(2.28, 2.29), libcurl(7.32.0 - used to evaluate the network’s performance on the validation set,
7.50.2),openssl(1.0.2s, 1.1.1a), ImageMagic, and and the model with the highest accuracy is saved. Training time for
zlib(1.2.7.1). For further evaluation, we collected extra pack- BinFinder on Dataset-I averages 2 minutes per epoch, reaching a
ages, as we outline later. We command each compiler to enable best accuracy of 98% in 30 iterations. Training on Dataset-III takes
debug symbols to facilitate building the ground-truth mapping be- an average of 7 minutes per epoch, achieving a best accuracy of
tween similar functions for training purposes. For the required 97% in 26 iterations. The training and evaluation are conducted on
analysis through this paper, we disable inlining and generate the a server equipped with Intel(R) Xeon(R) CPU E5-2630v3 running at
following datasets: 2.40GHz, with 300 GB memory, and 8 GPU NVIDIA TITAN cards.
Dataset-I: we instrument (gcc, clang) and O-LLVM to build each
package source code into x86, each time with one of the options:
4 EVALUATION
𝑂 0, 𝑂 1, 𝑂 2, 𝑂 3, 𝐹 𝐿𝐴, 𝑆𝑈 𝐵, and 𝐵𝐶𝐹 . In total, Dataset-I consists of
116, 508 binary functions. In this section, we evaluate our proposed approach. Accordingly,
Dataset-II: we instrument gcc and clang to compile each package we introduce our evaluation measures in Section 4.1. Then, we
into two different CPU architectures (ARM and x86) each time with present our evaluation results with respect to code obfuscation in
one of the optimization options 𝑂 0, 𝑂 1, 𝑂 2, 𝑂 3 . In total, Dataset-II Section 4.2 and respectively compiler optimization in Section 4.3.
consists of 157, 673 binary functions.
Dataset-III: we extend Dataset-I by compiling binaries to both 4.1 Evaluation Measures
ARM and x86 CPU architectures. We also instrumented O-LLVM When the source code is not available, binary function clone search
to generate all possible combinations across the four optimization is addressed in similar manner to the Information Retrieval (IR)
447
ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia Abdullah Qasem, Mourad Debbabi, Bernard Lebel, and Marthe Kassouf
gmp OpenSSL zlib ImageMagic Binutils Coreutils Findutils Plotutils Inetutils Avg
M1 M2 M1 M2 M1 M2 M1 M2 M1 M2 M1 M2 M1 M2 M1 M2 M1 M2 M1 M2
O0 Vs BCF 0.79 0.84 0.71 0.75 0.87 0.94 - - 0.8 0.44 0.55 0.57 0.61 0.68 0.78 0.55 0.54 0.59 0.71 0.67
O0 Vs FLA 0.92 0.84 0.77 0.72 0.92 0.92 0.8 0.79 0.86 0.86 0.93 0.92 0.84 0.85 1 0.78 0.93 0.94 0.88 0.85
O0 Vs SUB 0.93 0.92 0.92 0.88 0.91 0.95 0.85 0.84 0.9 0.89 0.97 0.98 0.93 0.93 0.97 0.87 0.95 0.95 0.92 0.91
Avg 0.88 0.87 0.8 0.78 0.9 0.94 0.825 0.818 0.84 0.66 0.75 0.76 0.75 0.78 0.88 0.69 0.74 0.77 0.84 0.81
Table 2: Impact of obfuscation on BinFinder using precision at top-1
Training Validation Testing Total Dataset-I to give our model the opportunity to recognize opti-
x86 83194 12619 20695 116508 mized and obfuscated functions at the same time. For both sce-
x86 + ARM 209216 36376 38899 284491
narios, we query for every original binary function in Dataset-I
Table 3: Dataset-I and Dataset-III split details and a few selected packages from Dataset-V (due to limited table
space) against its similar obfuscated ones and vice versa. Given a
binary function, 𝑞𝑖 and its embedding resulting from the config-
uration 𝑞𝑖 (𝑐𝑙𝑎𝑛𝑔, 𝑂 0, 𝑥86), we query for its similar binary func-
problem. Hence, we evaluate BinFinder using Precision and Recall tion embeddings resulting from configuration (𝑐𝑙𝑎𝑛𝑔, 𝑆𝑈 𝐵, 𝑥86),
measures from an information retrieval perspective since AUC (𝑐𝑙𝑎𝑛𝑔, 𝐹 𝐿𝐴, 𝑥86) and (𝑐𝑙𝑎𝑛𝑔, 𝐵𝐶𝐹, 𝑥86) at top-1, using pairwise co-
is inappropriate in this context. In almost all circumstances, the sine distance. Based on Table 2, we see that the precision at top-1
dataset is extremely skewed: typically, over 99.9% of the binary fluctuates at package level among all different code obfuscation
functions are in the dissimilar category [15]. The measures that we techniques. It is worth mentioning that the precision at top-1 is
use are detailed hereafter: equal to the recall at top-1. For example, the precision at top-1 for
Precision (P) is the fraction of similar functions from the total gmp library is different from the precision at top-1 for OpenSSL. Nev-
number of retrieved functions (False Positive) [15]: ertheless, we can see that BinFinder achieves its best result over
SUB with a precision of 93% for M1 and 91% for M2. SUB modifies
(#𝑆𝑖𝑚𝑖𝑙𝑎𝑟 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (2) instructions sequences by adding more instructions in between,
(#𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠)
but it has a minimal effect on the selected features of BinFinder
Recall (R) is the fraction of similar functions retrieved from the such as 𝑙𝑖𝑏𝑐𝐶𝑎𝑙𝑙𝑠, 𝑛𝑢𝑚_𝑐𝑎𝑙𝑙𝑒𝑒𝑠, and 𝑛𝑢𝑚_𝑐𝑎𝑙𝑙𝑒𝑟𝑠. In addition, Table
total of all similar functions in the Repository (True Positive) [15]: 2 shows that M1 and M2 achieve the lowest accuracy over BCF with
a precision 71% and 67%, respectively. The reason is that BCF intro-
(#𝑆𝑖𝑚𝑖𝑙𝑎𝑟 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑)
𝑅𝑒𝑐𝑎𝑙𝑙 = (3) duces the largest amount of modifications in our selected feature
(#𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠) num_libcCalless. The underlying P(0) is 4% less. Another interest-
The normalized Discounted Cumulative Gain (nDCG) is a mea- ing observation from the table is that M1 usually achieves better
sure between 0 and 1. accuracy than M2 over both FLA and SUB, for all packages. This
Í𝑘 𝑖𝑠𝑆𝑖𝑚𝑖𝑙𝑎𝑟 (𝑓𝑖 ,𝑞) is because M1 employs our proposed features without any noise
𝑖=1 𝑙𝑜𝑔 (1+𝑖 ) introduced by O-LLVM obfuscation techniques. Overall, on aver-
𝑛𝐷𝐶𝐺 = (4) age, both models achieve close precision values, i.e., 84% and 80%,
𝑂𝑝𝑡𝑖𝑚𝑎𝑙𝐷𝐶𝐺𝑘
respectively. In the end, the observations mentioned earlier show
where 𝑖𝑠𝑆𝑖𝑚𝑖𝑙𝑎𝑟 is 1 if 𝑓𝑖 is a function similar to 𝑞 otherwise is 0. that BinFinder can identify, with a high precision, obfuscated binary
𝑂𝑝𝑡𝑖𝑚𝑎𝑙𝐷𝐶𝐺𝑘 is the Discounted Cumulative Gain of the optimal functions introduced by O-LLVM without the need of seeing any
query answering. This measure is between 0 and 1. It gives a high prior obfuscated samples. Therefore, BinFinder is resilient to the
value for results where similar functions appear in the first positions addition of ”junk code”.
of the retrieved functions.
4.2.2 tigress Obfuscator. For this evaluation, we investigate
4.2 Code Obfuscation how BinFinder performs against unseen advanced obfuscation tech-
In this section, we evaluate BinFinder in the presence of the appli- niques. We use (M1) model detailed earlier to generate the embed-
cation of different code obfuscation techniques implemented by dings for all obfuscated binary functions by tigress in Dataset-IV.
O-LLVM and tigress. An overview of targeted obfuscation tech- Afterwards, we search for every original binary function (compiled
niques is detailed in Section A in the Appendix. with O0 ), to find its similar counterpart (obfuscated by tigress) in
Dataset-IV. We summarize the results in Table 4.
4.2.1 O-LLVM obfuscator. For this evaluation, we train two Add Opaque: BinFinder achieves the highest Recall (89% over
models. The first model (M1) is trained and tested over binary func- zlib, and 40% over OpenSSL), while achieving a low Recall over
tions compiled only with different optimization levels in Dataset-I. the Coreutils (12%). To get more insights into our results, we
We then use the resulting model to generate the embeddings for manually inspect our selected features over obfuscated functions.
the obfuscated binary functions generated by O-LLVM. This setup We find that obfuscated functions by Add Opaque have more calless
aims at evaluating BinFinder efficiency against binary functions and unique_vex instructions than their similar non-obfuscated ones.
obfuscated using O-LLVM, that the model did not encounter dur- For example, in the case of Coreutils, we observe that 𝑃 (0) of
ing training. The second model (M2) is trained and tested over a 𝑛𝑢𝑚_𝑢𝑛𝑖𝑞𝑢𝑒_𝑐𝑎𝑙𝑙𝑒𝑒𝑠 is 0.13 and 𝑃 (0) of unique_vex is 0.1. We see
mix of optimized and obfuscated binary function samples from that our selected features (unique_vex, 𝑛𝑢𝑚_𝑢𝑛𝑖𝑞𝑢𝑒_𝑐𝑎𝑙𝑙𝑒𝑒𝑠) are
448
Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia
openssl zlib Coreutils Avg
top-1 top-10 top-1 top-10 top-1 top-10 top-1 top-10
O0 O3 O0 O3 O0 O3 O0 O3 O0 O3 O0 O3 O0 O3 O0 O3
Add Opaque 0.4 0.58 0.56 0.77 0.89 0.44 1 0.59 0.13 0.17 0.2 0.27 0.47 0.4 0.59 0.54
EncodeArithmetic 0.42 0.52 0.772 0.78 0.54 0.35 0.84 0.76 0.71 0.43 0.91 0.81 0.56 0.43 0.84 0.78
EncodeLitrals 0.65 0.75 0.85 0.9 0.85 0.84 1 0.98 0.91 0.72 0.99 0.95 0.8 0.77 0.95 0.94
Flatten 0.08 0.45 0.14 0.63 0.19 0.75 0.43 0.96 0.21 0.66 0.29 0.84 0.16 0.62 0.29 0.81
virtualization 0.18 0.12 0.33 0.17 0.3 0.04 0.55 0.33 0.33 0.25 0.56 0.4 0.27 0.14 0.48 0.3
Avg 0.346 0.48 0.53 0.65 0.55 0.48 0.76 0.72 0.45 0.45 0.59 0.65 0.45 0.47 0.63 0.68
Table 4: Clone search between original binary functions and obfuscated ones by tigress
less resilient in case of Add Opaque. Based on these observations, optimizations generally eliminate dead code such as unreachable
we can state that “junk calls” and instruction insertions weakens instructions or unused functions. Therefore, in this section, our goal
our Recall. Actually, our model is sensitive to dissimilar functions is to understand the impact of compiler optimizations on tigress
that have close similar features. Overall, BinFinder still achieves an obfuscation techniques.
average of 47% Recall at top-1 and 59% Recall at top-10. To achieve our goal, we take obfuscated source files generated
Encode Arithmetic: BinFinder reports the highest Recall over by tigress and compile them with the highest compiler optimization
Coreutils with 71%. Yet, it reports less Recall over Zlib and level (O3 ) using gcc. We use (M1) model explained earlier to generate
OpenSSL with 54% and 42% respectively. These packages perform a the embeddings for all binary functions obfuscated by tigress. Then,
lot of mathematical operations where more complex mathematical we search for every original binary function (compiled with O3 ), to
instructions could be added. From Table 5, we can observe that 𝑃 (0) find its similar counterpart (obfuscated by tigress) in our repository,
of unique_vex is 0.33, which indicates that several new instructions and we report the performance at top-1 and top-10.
have been added by tigress. Overall, BinFinder still achieves an Add Opaque: From Table 4, we can see that BinFinder achieves
average of 56% Recall at top-1 and 84% Recall at top-10. 58% Recall at top-1 and 77% Recall at top-10 over OpenSSL. Notice-
Encode Literals: This technique has the lowest impact on our ably, the optimization O3 considerably reduces the impact of Add
selected features; it only replaces strings and integers. As seen in Opaque technique on our approach. As a result, the Recall improved
Table 5, all 𝑃 (0) values calculated for OpenSSL are more than 0.7 from 40% to 58%, over OpenSSL. Using manual analysis, we find
for 𝑂 0 . On average, BinFinder still achieves 81% Recall at top-1 and that, the number of inserted instructions and “junk” callees is re-
95% Recall at top-10. duced. From Table 5, we can see that with respect to O0 , 𝑃 (0) of
Flatten: BinFinder reports the lowest performance against the 𝑛𝑢𝑚_𝑐𝑎𝑙𝑙𝑒𝑒𝑠 is 0.56. In contrast, for O3 it is 0.82. The optimization
Flatten technique. Flatten heavily affects almost all of our selected level O3 removes unnecessary function calls introduced by tigress.
features including unique_vex, constantsList, 𝑛𝑢𝑚_𝑢𝑛𝑖𝑞𝑢𝑒_𝑐𝑎𝑙𝑙𝑒𝑒𝑠, Flatten: Table 4 shows that BinFinder performance increases
𝑛𝑢𝑚_𝑢𝑛𝑖𝑞𝑢𝑒_𝑐𝑎𝑙𝑙𝑒𝑟𝑠. These features’ 𝑃 (0) values are less than from 16% to 62%, on average. We find that, in contrast to O0 , the
0.5 on average. The 𝑛𝑢𝑚_𝑙𝑖𝑏𝑐_𝑐𝑎𝑙𝑙𝑒𝑒𝑠 is the only resilient feature number of 𝑛𝑢𝑚_𝑐𝑎𝑙𝑙𝑒𝑒𝑠 for obfuscated functions over O3 is not
against Flatten; its 𝑃 (0) equals to 0.88. In our inspection, we find persistently zero. Also, the value of 𝑃 (0) for unique_Vex improved
that, obfuscated functions by Flatten have no calless. Flatten uses from 0.09 over O0 to 0.73 over O3 , as outlined in Table 5.
indirect call mechanism to call the targeted functions. On average, In the same way, O3 against Encode Arithmetic, and Encode lit-
BinFinder achieves 16% Recall at top-1 and 28% Recall at top-10. erals. We can see, in the case of OpenSSL, that BinFinder Recall
Virtualization: This technique heavily affects our extracted performance has increased by more than 10%, as shown in Table
features including unique_Vex, constansList, num_unique_callees. 4. For example, against Encode Arithmetic, BinFinder achieves 52%
As a result, their 𝑃 (0)𝑠 values are less than 0.5. However, Virtualiza- Recall at top-1 and 77% Recall at top-10. Moreover, over Encode
tion does not affect the 𝑛𝑢𝑚_𝑐𝑎𝑙𝑙𝑒𝑟𝑠 and 𝐿𝑖𝑏𝑐𝐶𝑎𝑙𝑙𝑠. The P(0) values literals, BinFinder achieves 75% Recall at top-1 and 90% Recall at
of these two features are greater than 0.8 over three examined top-10. From our analysis, we conclude that, in some situations, O3
packages. On average, BinFinder achieves 27% Recall at top-1 and mitigates to some extent the effects of certain tigress code obfus-
48% Recall at top-10. Compiler optimisation effects on tigress. cations (e.g., Flaten, Add Opaque). Conversely, in the case of Zlib
Based on our understanding of the behavior of tigress obfuscation and Coreutils, we find that O3 optimization hardens some tigress
techniques, it could add dead code, substitute mathematical op- obfuscations, namely Encode Arithmetic, and Encode Literals, which
erations with equivalent but more complex ones, etc. Compiler results in a drop in the achieved Recall by approximately 20%.
449
ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia Abdullah Qasem, Mourad Debbabi, Bernard Lebel, and Marthe Kassouf
gmp gnuBinutils libcurl openssl zlib Average #Instructions P(0) #Nodes P(0)
gcc clang gcc clang gcc clang gcc clang gcc clang gcc clang gcc clang gcc clang
O0 Vs O1 0.712 0.809 0.718 NA 0.558 NA 0.803 0.901 0.819 0.823 0.722 0.844 0.03 0.23 0.36 0.44
O0 Vs O2 0.710 0.723 0.704 NA 0.533 NA 0.792 0.877 0.834 0.603 0.715 0.734 0.041 0.24 0.38 0.46
O0 Vs O3 0.716 0.730 0.952 NA 0.457 NA 0.776 0.881 0.648 0.559 0.710 0.723 0.046 0.244 0.38 0.46
O1 Vs O2 0.924 0.887 0.948 NA 0.750 NA 0.847 0.898 0.953 0.757 0.885 0.848 0.049 0.365 0.44 0.53
O1 Vs O3 0.863 0.876 0.970 NA 0.617 NA 0.825 0.893 0.688 0.820 0.792 0.863 0.047 0.364 0.45 0.53
O2 Vs O3 0.904 0.974 1 NA 0.718 NA 0.847 0.921 0.775 0.964 0.849 0.953 0.259 0.492 0.50 0.55
Average 0.805 0.833 0.882 NA 0.606 NA 0.815 0.895 0.786 0.754 0.779 0.828 - - - -
Table 6: Binary function clone search with various compiler optimizations: evaluation with precision at top-1
4.3 Compiler Optimization but compiled with two compilers (𝑔𝑐𝑐, 𝑐𝑙𝑎𝑛𝑔) each with 𝑂 0 − 𝑂 3
In this Section, we test BinFinder under different compiler optimiza- optimizations and three different obfuscation techniques using O-
tion levels, investigating the impact of all pairwise combinations LLVM. For each query binary function 𝑞𝑖 in Dataset-I, we look up
of 𝑂 0 − 𝑂 3 methods on Dataset-II.For each binary function embed- all its similar functions among 116508 functions, and record the
ding (𝑞𝑖 ) that results, for example, from the following configuration BinFinder performance. From Figure 4(a), we see that BinFinder
(𝑔𝑐𝑐, 𝑂 0, 𝑥86), we search for its similar binary functions from the precision is above 80% for 𝑘 ∈[1-5] and furthermore, it is above 70%
following configurations (𝑔𝑐𝑐, 𝑂 3, 𝑥86) or (𝑔𝑐𝑐, 𝑂 3, 𝐴𝑅𝑀) and vice for 𝑘 ∈[6-10], and above 50% for 𝑘 ∈[10-20]. We also observe from
versa. Then, precision is calculated at top-1 over the retrieved result Figure 4(b) that, BinFinder nDCG values are above 80% for 𝑘 ∈[1-
using pairwise cosine distance. Finally, we calculate the average 7], and above 70% for 𝑘 ∈[8-20]. These observations convey that
between the two resulting values. From Table 6, we see that the similar functions appear among the first instances of the retrieved
precision at top-1 over clang compiler is better compared to gcc candidate functions from the repository. Besides, from Figure 4(c),
compiler. On average, clang precision is 82% while gcc precision we can see that at 𝑘=30, BinFinder Recall is 72%, at 𝑘=50, Recall is
is 77%. This observation is justified based on the effect of com- 80%, and at 𝑘=100 Recall is 85%. The higher 𝑘 we select, the Recall
piler optimizations pairs over the number of generated instructions value is increased, as depicted in Figure 4(c). At 𝑘=200, we have a
and basic blocks (nodes). We see that gcc 𝑃 (0) values are less than Recall of 90%.
clang 𝑃 (0) values. These value differences certify that gcc compiler Multi-CPU Architectures with Compiler Optimization. In
modifies more the generated functions compared to clang compiler. this scenario, we evaluateBinFinder over Dataset-II, which contains
Moreover, we can see that the precision at top-1 fluctuates at pack- samples for two CPU architectures (x86 and ARM). On average,
age level among all different optimization options. For example, for every query function 𝑞𝑖 in Dataset-II, it has at least 15 similar
the precision at top-1 for gmp library is significantly different than functions and 203465 dissimilar functions found in Dataset-II. Some
the precision at top-1 for Openssl over both gcc and clang compil- selected packages have more than one version. Therefore, the num-
ers. One more observation from Table 6, indicates that the binary ber of similar functions could double. From Figure 4(a), we see that
function clone search under (𝑂 0, 𝑂 2 ) and (𝑂 0, 𝑂 3 ) setup are more BinFinder precision is 90% when 𝑘 = 1, it is above 80% for 𝑘 ∈[1-5],
challenging compared to other optimization search pairs over gcc it is above 70% for 𝑘 ∈[6-12], and it is above 60% for 𝑘 ∈[13-17].
and clang compilers. Based on Table 1, 𝑂 2 and 𝑂 3 absolute differ- We also observe from Figure 4(b) that, BinFinder nDCG values are
ence mean values over all selected features are higher compared above 80% for 𝑘 ∈[1-10], and above 70% for 𝑘 ∈[11-20]. Besides,
to other optimizations. The optimization 𝑂 3 implicitly includes all from Figure 4(c), we can see that at 𝑘=30 BinFinder Recall is 60%,
𝑂 2 optimizations. It indicates that 𝑂 2 and 𝑂 3 optimization options while at 𝑘=50 BinFinder Recall is 70%. Moreover, BinFinder Recall is
modify their related binary functions to a larger degree compared 80% when 𝑘 = 100, and it is 85% when 𝑘 = 200.
to other compiler optimization options. Multi-CPU Architectures with Code Obfuscation and Op-
timization. In this scenario, BinFinder is trained and tested over
Dataset-III. On average, for every query 𝑞𝑖 , it has 21 similar func-
5 SEARCHING AGAINST ALL BINARIES
tions and 284470 dissimilar functions in Dataset-III. In Figure 4(a)
We evaluate BinFinder when both code obfuscations and compiler we see that BinFinder precision is 86% for 𝑘 = 1, above 80% for 𝑘 ∈[1-
optimizations are applied in the following three scenarios: on single 3], above 70% for 𝑘 ∈[4-10], and above 60% for 𝑘 ∈[13-19]. In Figure
CPU Architecture (x86), on multi-CPU architectures (x86, ARM) 4(b) we observe that BinFinder nDCG values are above 80% for 𝑘 ∈
when only the compiler optimization levels are applied, and on [1-5], and they are above 70% for 𝑘 ∈[6,20]. Besides, from Figure
multi-CPU architectures (x86, ARM) when both compiler optimiza- 4(c) we see that at 𝑘=50 Recall is 60%, at 𝑘=100 it is 70%, and it is 76%
tions and code obfuscation techniques are applied. Given a binary when 𝑘 = 200. Based on the results reported in Figures 4(a), 4(b),
function 𝑞𝑖 , we generate its embedding using BinFinder trained 4(c), we see that binary function clone search on single architecture
model. Then, we search for all its similar functions in our reposi- is less challenging compared to multi-CPU architectures. Based
tory using pairwise cosine similarity distance. Afterwards, we sort on the Recall values, at 𝑘 = 50, BinFinder achieves 80% on single
the retrieved results, and based on the top k candidate functions architecture, 70% on multi-CPU architectures when only different
with the highest cosine similarity scores, we calculate Precision, optimization levels are applied, and 60% when both optimization
nDCG, and Recall. Our evaluation is depicted in Figure 4. levels and obfuscations are applied on multi-CPU architectures.
Single CPU Architecture (x86). In this scenario, BinFinder is The reason behind this drop in Recall values on multi-CPU architec-
trained and tested over Dataset-I. On average, each binary func- ture scenarios is that num_constants, libcCalls, Callers, and Callees
tion has 10 similar functions produced from the same source code
450
Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia
Recall
nDCG
0.5 0.5 0.5
0.4 X86-O ly. 0.4 X86-On y. 0.4 X86-O ly.
0.3 Multi-CPU (Optimi(atio -O ly). 0.3 Mu ti-CPU (Optimization-Only). 0.3 Multi-CPU (Optimi(atio -O ly).
0.2 Multi-CPU(Optimi(atio & 0.2 Multi-CPU(Optimization & 0.2 Multi-CPU(Optimi(atio &
0.1 Obf&scatio ). 0.1 Obf(&cation). 0.1 Obf&scatio ).
0.0 0.0 0.0
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0 25 50 75 100 125 150 175 200
Number of Nearest Results Number of Nearest Resu ts Number of Nearest Results
(a) Precision. (b) nDCG. (c) Recall.
Figure 4: Function clone search over multi-CPU architectures in the presence of code obfuscations and optimizations
Recall
nDCG
features are more affected on multi-CPU architectures compared are applied in Dataset-III. At 𝑘 = 75, BinFinder is able to successfully
to single architecture scenario (x86), as shown in Table 1. retrieve more than 80% of all similar vulnerable binary functions
Overall, our reported figures show that BinFinder is efficient over from different Datasets.
different CPU architecture configurations.
7 COMPARISONS TO SIMILAR APPROACHES
6 SEARCHING FOR VULNERABILITIES Marcelli et al. [17] conducted an empirical study to evaluate bi-
In this Section, we evaluate BinFinder with respect to identify- nary function similarity approaches based on machine learning
ing well-known vulnerable binary functions. We have collected techniques. In this Section, we follow the same procedure estab-
reported vulnerable functions from the CVE for different versions lished by the aforementioned study to evaluate our approach. In this
of OpenSSL, zlib, glibc, and libcurl. In total, we have 78 regard, we downloaded their datasets. Then, following their eval-
unique vulnerable functions including the Heartbleed vulnerability. uation procedure, we create similar training and testing datasets
Our dataset has been cross-compiled with two compilers, namely using our implementation to extract our proposed features. Finally,
gcc and clang using different compiler optimizations and O-LLVM we evaluate and compare BinFinder and report the obtained results.
code obfuscations. In total, our dataset consists of 1198 vulnerable
functions. We take each vulnerable binary function and generate 7.1 Experimental Setup
its related embeddings using BinFinder trained models. Afterwards, Marcelli et al. [17] created two datasets (Dataset-A and Dataset-B)
we search for all its similar functions in our repositories. Finally, representing different challenges in binary function similarities: 1)
based on the 𝑘 retrieved candidate functions with the highest co- different compilers and versions, 2) different optimization levels,
sine similarity score, we calculate Recall, Precision, and nDCG. Our and 3) different CPU architectures and bitness. The datasets are
evaluation results are depicted in Figure 5. From Figure 5(a), we composed of several projects compiled with different compilers into
can see that BinFinder Precision is 98% over Dataset-I, Dataset-II, three CPU architectures x86, ARM, and MIPS. More details about
and Dataset-III. This indicates that whenever we make a query, the dataset are provided in Section B of the Appendix. The goal of
BinFinder returns a good result with a very high precision. Besides, Dataset-A is to train and test models, while the goal of Dataset-B is
at 𝑘 = 25, we observe from Figure 5(c) that, over single architecture, to validate the resulting models trained on Dataset-A on a miscella-
BinFinder can successfully retrieve 80% of all similar vulnerable neous and extensive group of binaries. The evaluation procedure
binary functions in Dataset-I. On the other hand, BinFinder can suc- comprises nine tasks, each evaluating one binary function similar-
cessfully retrieve 62% of all similar vulnerable binary functions over ity challenge. For each task, 50K positive and 50K negative pairs
multi-CPU architectures when different compiler optimizations are of binary functions are randomly selected. For the ranking test,
applied in Dataset-II. Moreover, it can successfully retrieve 55% of 200 positive pairs and 20K negative pairs are randomly selected,
all similar vulnerable binary functions over multi-CPU architec- where each positive pair has 100 negative pairs. The XM task con-
tures when different compiler optimizations and code obfuscations siders function pairs that come from arbitrary architectures, bitness,
451
ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia Abdullah Qasem, Mourad Debbabi, Bernard Lebel, and Marthe Kassouf
AUC XM
Approach
XC XC+XB XA XM small medium large MRR10 Recall@1
BinFinder 0.98 0.97 0.98 0.98 0.98 0.98 0.93 0.8 0.73
GMN_OPC-200_e16 0.86 0.85 0.86 0.86 0.89 0.82 0.79 0.53 0.45
GNN-s2v_GeminiNN_OPC-200_e5 0.78 0.81 0.82 0.81 0.84 0.77 0.79 0.36 0.28
SAFE_ASM-list_e5 0.8 0.8 0.81 0.81 0.83 0.77 0.77 0.17 0.27
Zeek 0.84 0.84 0.85 0.84 0.85 0.83 0.87 0.28 0.13
asm2vec 0.62 0.81 0.74 0.69 0.63 0.7 0.78 0.12 0.07
Table 7: Comparison of state-of-the-art models with BinFinder on Dataset-A tasks
compiler, compiler versions, and optimization. This task reflects measures obtained on Dataset-B. From Table 8, we see that Bin-
comparisons across the whole dataset and is considered the most Finder maintains similar performance in terms of 𝐴𝑈 𝐶. However,
challenging task. The XA+XO task reflects Dataset-B as it considers concerning 𝑀𝑅𝑅10 and 𝑅𝑒𝑐𝑎𝑙𝑙@𝐾 over XA+XO task, the perfor-
function pairs having dissimilar architectures, bitness, and opti- mance gets reduced by around 10%, as depicted in Figure 6(b). When
mizations but similar compiler and compiler versions. The Section compared to the GMN model [13], we find that BinFinder shows
B in the Appendix provides more details about the generated tasks. a lower performance with 5% lower Recall when K=5. However,
when K=20, BinFinder draws a similar performance to the GMN,
as depicted in Figure 6(b). To gain more insights into this and pro-
7.2 Results and Analysis
vide a better understanding, we investigate the possible reasons
We compare BinFinder to similar approaches in the literature. We behind such an observation. We examined both datasets in-depth.
evaluate each approach using three metrics: AUC, MRR10, and Re- We find that new LibcCalls and new VEX tokens are introduced in
call@K. Table 7 shows the performance of each examined approach Dataset-B. For example, 127 unique LibcCalls appear in Dataset-A,
on Dataset-A. It is worth mentioning that the selected model_names while 136 unique LibcCall appear in Dataset-B. Consequently, this
represent a customized version of their related approaches that re- could directly impact the number of Libc calls, and the number of
port the best performance, as the empirical study demonstrated in callees features in our model. We believe that this is a limitation
[17]. In Table 7, we can observe that BinFinder outperforms GNN of BinFinder, which occurs due to the fact that the model does not
[13], ZEEK [23], SAFE [18], asm2vec [3], and Gemini [27] in all address unseen system calls. Note that GMN could address this
tasks generated from Dataset-A. Also, to a small extent, BinFinder limitation since it relies on the CFG structures. However, GMN
still outperforms CodeCMR [29] on the XM task with 3% higher approach will not be practical in the presence of obfuscation tech-
Recall when K=5, as depicted in Figure 6(a). Regarding tasks (small, niques such as flattening (FLA), which significantly modifies the
medium, and large), which test the targeted models in the presence CFG of the targeted functions, as elaborated in Section 2.
of various binary function sizes based on their number of Basic
Blocks, we observe in Table 7 that BinFinder has promising per- 7.3 Comparison to Obfuscation Approaches
formance over a varying size of binary functions. BinFinder has
the same performance over small and medium size functions with We compares BinFinder to existing obfuscation-focused methods,
98% 𝐴𝑈 𝐶. However, BinFinder shows less performance over large asm2vec and BinMatch, using Dataset-V outlined in Section 3.4.
binary functions with 93% 𝐴𝑈𝐶. Another observation we can de- From Figure 7 in Appendix, we can see that BinFinder outperforms
rive from Table 7 is that the majority of the models demonstrate asm2vec 3 , achieving a recall rate of 79% at k=100 compared to
quite similar performance when compared using AUC. However, asm2vec’s 33%. It also reports higher precision at k=2 with 79%
they exhibit varying performance when compared to the ranking against asm2vec’s 60%. Precision for BinFinder decreases gently
metrics (MRR10 and Recall@K), as shown in Figure 6. as k increases, while asm2vec’s drops sharply. Similar outcomes
We can see from Table 7, over the XM task, BinFinder outper- have been reported in previous studies [7, 9]. In a comparison
forms all examined competing approaches. For example, with 𝑘 = 1, against BinMatch, a multi-architecture approach, using the same
BinFinder reports 73% Recall. It is a 28% higher Recall than GMN set of packages BinMatch evaluated, BinFinder consistently reports
[13], 46% higher Recall than Gemini [27], and 55% higher Recall higher average recall values. Specifically, it outperforms BinMatch
than SAFE [18]. Moreover, when K=5, BinFinder reports 92% Recall by 9%, 8%, and 7% at top-1, top-5, and top-10 respectively, affirming
while GMN reports 65% Recall. BinFinder’s superiority.
To validate the performance of BinFinder, we test it over Dataset-
B tasks: XO, XA, and XA+XO. Table 8 presents the performance 3 https://github.com/oalieno/asm2vec-pytorch
452
Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
Recall@K
Recall@K
model_name
model_name Binfinder
0.6 Binfinder 0.6 BinfinderDa.ase.-1
GGSNN_OPC-200_e10 GGSNN_OPC-200_e10
0.5 GMN_OPC-200_e16 0.5 GMN_OPC-200_e16
GNN-s2v_GeminiNN_GeminiFea-.res_e5 GNN--20_GeminiNN_GeminiFeature-_e5
0.4 SAFE_ASM-lis-_e5 0.4 SAFE_ASM-li-t_e5
Tre0 Tre1
0.3 Zeek 0.3 Zeek
asm2vec_e10 a-m20ec_e10
0.2 0.2
10 20 30 40 50 10 20 30 40 50
K: num er of results. K: number of results.
(a) Dataset-A XM (b) Dataset-B XA+XO.
Figure 6: A comparison of the Recall at different K values for Dataset-A XM (left) and Dataset-B XA+XO (right) tasks
Statistical
Signature
Semantic
Dynamic
O-LLVM
Distance
Slicing
x86-64
tigress
Graph
Clang
MIPS
ARM
GCC
ICC
VS
DiscovRE [4] • • • MCS, JD IDA • • • • • •
Genius [5] • • • LSH, JD IDA • • • • •
Gemini [27] • • • GNN IDA • • • •
asm2vec [3] • • • PV-DM IDA • • • •
SAFE [18] • • seq2seq angr,radar2 • • •
BinMatch [7] • • • • semantic angr • • • • •
GMN [13] • • GNN/GMN IDA • • • •
𝛼 Diff [14] • • • CNN IDA • •
Zeek [23] • • • MLP pyvex • • • • •
CodeCMR [29] • • • • encoder+GNN +LSTM IDA • • • •
Trex [21] • • • • transformer - • • • • • •
TIKNIB [11] • • Direct comparisons IDA • • • • • •
BinFinder • • MLP angr • • • • • • •
Table 9: A comparison of state-of-the-art related approaches. (•) means that the approach provides the corresponding feature, it
is empty otherwise. (MCS) Maximum Common Subgraph Isomorphism, (JD) Jaccard Distance, (LSH) Locality Sensitive Hashing,
(GNN) Graph Neural Network, (MLP) Multi-layer Perceptron Neural Network.
8 DISCUSSION then passed as input to train the MLP network to generate the final
In this section, BinFinder is compared to state-of-the-art approaches embeddings. As a result, both approaches demonstrate different
considering various factors including features, analysis, methodol- performance behaviors, as detailed in Table 7 and Table 8.
ogy, disassemblers, compilers, CPU architectures, and obfuscation. More recent methods, TIKNIP and 𝛼Diff, use similar features
Table 9 summarizes the aforementioned aspects for 13 approaches, to BinFinder but lack in identifying other potent features and in
most of which have been selected by Marcelli et al. [17] as a repre- building machine learning models to address binary function sim-
sentative set of studies for binary function similarity [17] .The com- ilarity across code transformations and multi-CPU architectures.
parison, encompassing 13 techniques, reveals BinFinder to be the BinFinder proves to be as efficient as leading approaches for mul-
sole approach addressing binary function similarity across different tiple CPU architectures tasks, however, it faces limitations when
obfuscation techniques like O-LLVM and tigress. Other methods encountering unseen libc calls. Future work will focus on refining
such as asm2vec[3], BinMatch[7], and Trex[21] explore O-LLVM the model using a comprehensive dataset or considering an Out-Of-
obfuscation but limit their investigations to single or multiple ar- Vocabulary solution as with INNEREYE[30] to improve BinFinder’s
chitectures and require a dynamic analysis step. performance.
We can also mention that Zeek [23] and BinFinder look similar
as they employ Multi-layer Perceptron Neural Network (MLP) and
use lifted Vex-IR instructions. However, they are different as each
9 RELATED WORK
approach uses different input features. Zeek extracts strands at In this section, we review proposed state-of-the-arts approaches
the basic block level to be used as an input to an MLP model. In over multiple CPU architectures. Further detailed are available in a
contrast, BinFinder uses unique_Vex_instructions in addition to six related survey [1]. Discovre [4] uses multi-level filtering using both
other features that are firstly engineered for representation and numeric and structural features. Inspired by Discovre, Genius [5]
uses statistical and structural features to create attributed control
453
ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia Abdullah Qasem, Mourad Debbabi, Bernard Lebel, and Marthe Kassouf
454
Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia
455
ASIA CCS ’23, July 10–14, 2023, Melbourne, VIC, Australia Abdullah Qasem, Mourad Debbabi, Bernard Lebel, and Marthe Kassouf
Recall
nDCG
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0.0 0.0 0.0
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0 25 50 75 100 125 150 175 200
Number of Nearest Results Number of Neares Resul s Number of Nearest Results
(a) Precision. (b) nDCG. (c) Recall.
Figure 7: Evaluating BinFinder aganist state-of-the-art approaches namely asm2vec and BinMatch. In this scenario, we take a
given function 𝑓𝑖 and we search for its similar ones within a large set of functions.
compiler versions, and optimizations but similar architecture and
1.0 bitness; 3) XC+XB: considers function pairs having dissimilar com-
0.9
0.8 pilers, compiler versions, optimizations, and bitness but similar
0.7 architecture; 4) XA: considers function pairs having dissimilar ar-
True positive rate
0.6
0.5
chitectures and bitness but similar compiler, compiler version, and
0.4 optimizations. This task represents binary function similarity over
0.3 firmware images, which are cross-compiled using a single com-
0.2
0.1 X86 Onl piler with different optimization levels; 5) XA+XO: this task reflects
0.0 X86 + ARM
Dataset-B, it considers function pairs having dissimilar architec-
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate tures, bitness, and optimizations but similar compiler and compiler
versions; 6) XM: considers function pairs that come from arbitrary
Figure 8: Results of similarity testing architectures, bitness, compiler, compiler versions, and optimiza-
tion. This task reflects comparisons across the whole dataset, and it
is considered the most difficult task; 7) XM-S: is a sub-task of XM,
The first three tasks consider cases limited to a single CPU archi- but it considers only small functions with less than 20 Basic Blocks;
tecture, namely: 1) XO considers function pairs having dissimilar 8) XM-M considers medium sizes functions with greater than 20
optimizations but the same compiler, compiler version, and archi- and less than 100 Basic Blocks; 9) XM-L considers large functions
tecture; 2) XC: considers function pairs having dissimilar compilers, with greater than 100 Basic Blocks.
456