Reviving Sequential Program Birthmarking For Multithreaded Software Plagiarism Detection PDF
Reviving Sequential Program Birthmarking For Multithreaded Software Plagiarism Detection PDF
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
Abstract—As multithreaded programs become increasingly popular, plagiarism of multithreaded programs starts to plague the
software industry. Although there has been tremendous progress on software plagiarism detection technology, existing dynamic
birthmark approaches are applicable only to sequential programs, due to the fact that thread scheduling nondeterminism severely
perturbs birthmark generation and comparison. We propose a framework called TOB (Thread-oblivious dynamic Birthmark) that
revives existing techniques so they can be applied to detect plagiarism of multithreaded programs. This is achieved by thread-
oblivious algorithms that shield the influence of thread schedules on executions. We have implemented a set of tools collectively
called TOB-PD (TOB based Plagiarism Detection tool) by applying TOB to three existing representative dynamic birthmarks,
including SCSSB (System Call Short Sequence Birthmark), DYKIS (DYnamic Key Instruction Sequence birthmark) and JB (an
API based birthmark for Java). Our experiments conducted on large number of binary programs show that our approach exhibits
strong resilience against state-of-the-art semantics-preserving code obfuscation techniques. Comparisons against the three
existing tools SCSSB, DYKIS and JB show that the new framework is effective for plagiarism detection of multithreaded programs.
The tools, the benchmarks and the experimental results are all publicly available.
Index Terms—software plagiarism detection, multithreaded program, software birthmark, thread-oblivious birthmark
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE Transactions on
IEEE TRANSACTIONS ONSoftware
SOFTWAREEngineering (year:2018)
ENGINEERING, VOL. X, NO. X, X 2
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEEIEEE Transactions
TRANSACTIONS on Software
ON SOFTWARE Engineering VOL.
ENGINEERING, (year:2018)
X, NO. X, X 3
that discusses the impact of thread scheduling on p if and only if both of the following conditions are
birthmark based software plagiarism detection, satisfied:
and proposes a solution to remedy the problem. - f (p, I) is obtained only from p itself when exe-
• We apply the var-gram [23] algorithm in birth- cuting p with input I.
mark generation. As far as we know, this is the - Program q is a copy of p ⇒ f (p, I) = f (q, I).
first time this algorithm is used for such purpose.
Based on the two conceptual descriptions, various
Our experiments confirm its effectiveness.
implementable birthmarks have been developed by
• We have implemented a set of tools collectively
mining characteristics from different aspects of the
called TOB-PD (TOB based Plagiarism Detection
program, of which the SCSSB [15] extracted from
tool) by integrating the principle of TOB with
system calls, the DYKIS [13] extracted from executed
existing algorithms, including SCSSB [15],
instructions, the JB [18] extracted from executed Java
DYKIS [17] and JB [18]. The tools as well as
APIs are several representative dynamic birthmarks.
the source codes are publicly available at website:
The two conceptual definitions of software birth-
http://labs.xjtudlc.com/labs/wlaq/TAB-PD/site.
marks require that if two programs are in copy rela-
•Our experiments on 418 versions of 35 different
tion, their birthmarks should be the exactly the same.
multithreaded programs show that the new tools
But due to many practical issues in implementing
are highly effective in detecting plagiarism and
specific birthmarks, even if q is a copy of p, their
are resilient to most state-of-the-art semantics-
corresponding birthmarks are not identical. Thus in
preserving obfuscation techniques implemented
the literature of software birthmarking, the plagiarism
in tools such as SandMark [4], DashO [24] and
of two programs is decided by a threshold ε and
UPX. All benchmarks and the experimental data
a function sim that computes the similarity score
can also be downloaded from our website.
between their birthmarks. The range of a similarity
The remainder of the paper is organized as follows. score is between 0 and 1. Although a value of 0.25 was
Necessary background on software birthmarks are typically used as the threshold in previous studies,
described in Section 2. In Section 3 we present the other values were used as well and we found that
TOB framework that revives existing birthmarks. In the choice was quite arbitrary. Thus in our work, we
Section 4 the approaches for comparing the TOB- do not set ε to a particular value. Instead we analyze
revived birthmarks are discussed. The system design its impact on the performance under a wide range of
and implementation details are described in Section 5. values. Let p and pB be the plaintiff program and its
Section 6 presents the empirical study, including the birthmark, and q and qB be the defendant program
evaluation on resilience and credibility of the thread- and its birthmark. The plagiarism is decided with
oblivious birthmarks, the comparison with traditional Equation 1, which gives a conceptual definition of sim
SCSSB and the integration of TOB with DYKIS and that returns a three-value result: positive, negative or
JB. In Section 7 we discuss threats to the validity of inconclusive.
our approaches. Section 8 reviews related work and
finally we conclude the paper in Section 9.
> 1 − ε positive : q is a copy of p
sim (pB , qB ) = < ε negative : q is not a copy of p
2 P RELIMINARIES
otherwise inconclusive
2.1 Birthmark based Plagiarism Detection (1)
A high quality birthmark manifests in that the ratio
A software birthmark, whose classical definitions are of incorrect classifications should be low enough for a
as depicted in Definition 1 and Definition 2, is a set certain ε. However, false negative is more intolerable
of characteristics extracted from a program statically than false positive, since birthmarking technique is
or dynamically, that reflects intrinsic properties of the not a proving techniques but rather a detecting tech-
program and that can be used to identify the program nique of suspected copies [14], [15], [17], [26]. Gener-
uniquely. ally in the literature, the following two properties of
Definition 1: Software Birthmark [8]. Let p be a a birthmark should be satisfied to make it valid. We
program and f be a method for extracting a set of refer to the definitions [17] restated from the original
characteristics from p. We say f (p) is a birthmark of descriptions of Myles [27] and Choi [19].
p if and only if both of the following conditions are Property 1: Resilience. Let p be a program and q be a
satisfied: copy of p generated by applying semantics-preserving
- f (p) is obtained only from p itself. code transformations τ . A birthmark is resilient to τ
- Program q is a copy of p ⇒ f (p) = f (q). if sim (pB , qB ) > 1 − ε.
Definition 2: Dynamic Software Birthmark [25]. Let Property 2: Credibility. Let p and q be independently
p be a program and I be an input to p. Let f (p, I) be a developed programs that may accomplish the same
set of characteristics extracted from p when executing task. A birthmark is credible if it can differentiate the
p with I. We say f (p, I) is a dynamic birthmark of two programs, that is sim (pB , qB ) < ε.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE Transactions on
IEEE TRANSACTIONS ONSoftware
SOFTWAREEngineering (year:2018)
ENGINEERING, VOL. X, NO. X, X 4
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE
IEEE TransactionsONonSOFTWARE
TRANSACTIONS Software ENGINEERING,
Engineering (year:2018)
VOL. X, NO. X, X 5
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE Transactions on Software Engineering (year:2018)
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, X 6
TABLE 2
Average of the similarity scores between the traces obtained under same inputs. For simplicity, we use SA and
SS to denote the TOB-revived birthmarks. The gray and white columns summarize the scores calculated
without and with trace refining, respectively.
1 1
0.9 0.9
Containment Similarity Score
0 0
1 2 4 6 8 10 15 30 50 Test Simdev Simsmall Simmedium Simlarge
Number of Threads Specified Different Workloads
Fig. 4. Validation of the hypothesis under different number of threads and different workloads. Since similar
results are observed, only Containment scores measured for SA birthmarks are given here to save space.
TABLE 3
Validation of the hypothesis under different scheduling policies. To save space, only results measured with SA
birthmarks are given here.
in details. Note that although in this paper we fo- trace with thread identifier. It then project the trace on
cus on birthmarks in set format, birthmarks in other thread identifiers to obtain sub-traces, each of which
formats such as sequences or graphs can also utilize belongs to a single thread. As a result, the birthmarks
TOB to handle multithreaded programs. In the latter extracted from the sub-traces can remain same even
case appropriate algorithms for thread slice birthmark under different thread schedules.
generations and comparisons need to be developed.
Formally, let an execution trace of program p un-
der input I be trace(p, I) = he1 , e2 , · · · , en i. Each
3.2 Birthmark for Individual Threads recorded event ei is an instruction, a system call
In order to shield the influence of non-deterministic or an API, along with its thread identifier that
scheduling, TOB annotate each event in an execution is denoted as ei .tid. We define its projection on
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE Transactions
IEEE TRANSACTIONS on Software
ON SOFTWARE Engineering
ENGINEERING, (year:2018)
VOL. X, NO. X, X 7
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE Transactions on Software Engineering (year:2018)
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, X 8
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE Transactions on Software Engineering (year:2018)
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, X 9
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE Transactions on Software Engineering (year:2018)
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, X 10
Plaintiff and
Birthmark Similarity Decision
Defendant Tracer Pre-Processor
Generator Calculator Maker
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE Transactions on Software Engineering (year:2018)
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, X 11
TABLE 4
Benchmark programs
#Ver #Ver
Name Size(kb) Version Name Size(kb) Version
Total S1 S2 S3 S4 Total S1 S2 S3 S4
pigz 294 2.3 23 19 - 1 2 luakit 153.4 d83cc7e 1 - - - -
lbzip 113.3 2.1 1 - - - - midori 347.6 0.4.3 1 - - - -
lrzip 219.2 0.608 1 - - - - seaMonkey 760.9 2.21 1 - - - -
pbzip2 67.4 1.1.6 1 - - - - Daisy 201.9 SIR 37 - 35 1 -
plzip 51 0.7 1 - - - - Elevator 92.1 SIR 44 - 42 1 -
rar 511.8 5.0 1 - - - - Groovy 59.5 SIR 44 - 42 1 -
cmus 271.6 2.4.3 1 - - - - Pool 205.7 SIR 30 - 28 1 -
mocp 384 2.5.0 1 - - - - blackscholes 23 Parsec3.0 23 19 - 1 2
mp3blaster 265.8 3.2.5 1 - - - - bodytrack 3,368 Parsec3.0 13 9 - 1 2
mplayer 4,300 r34540 1 - - - - fludanimate 126.6 Parsec3.0 23 19 - 1 2
sox 55.2 14.3.2 1 - - - - canneal 414.7 Parsec3.0 23 19 - 1 2
arora 1,331 0.11 1 - - - - dedup 388 Parsec3.0 23 19 - 1 2
chromium 80,588 28.0.1500.71 1 - - - - ferret 2,150 Parsec3.0 23 19 - 1 2
dillo 610.9 3.0.2 1 - - - - freqmine 227.6 Parsec3.0 23 19 - 1 2
Dooble 364.4 0.07 1 - - - - streamcluster 103 Parsec3.0 23 19 - 1 2
epiphany 810.9 3.4.1 1 - - - - swaptions 94 Parsec3.0 23 19 - 1 2
firefox 59,904 24.0 1 - - - - x264 896.3 Parsec3.0 23 19 - 1 2
konqueror 920.1 4.8.5 1 - - - -
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE Transactions on Software Engineering (year:2018)
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, X 12
Fig. 9. Similarity distribution graph for birthmarks of the copies generated with different compilers and
optimization levels
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE Transactions on Software Engineering (year:2018)
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, X 13
programs. Three types of widely used multithreaded 6.3.1 Performance Evaluation with Respect to URC
Linux applications are selected as our experimental Resilience and credibility reflect from different aspects
subjects, including 6 compression software (lbzip2, the qualities of a birthmark. URC (Union of Resilience
lrzip, pbzip2, pigz, plzip and rar), 10 web and Credibility) [47], defined below, is a metric pro-
browsers (arora, chromium, dillo, Dooble, posed for evaluating the overall performance of birth-
luakit, midori, epiphany, firefox, marks that considers both aspects.
konqueror and seaMonkey), and 5 audio players
(cmus, mocp, mp3blaster, mplayer and sox). R×C
URC = 2 × (2)
Firstly, we validate whether the TOB-revived birth- R+C
marks can distinguish programs in different cate- In the definition R represents the ratio of correctly
gories. Similarity scores between the 6 compression classified pairs where there exists plagiarism and C
programs and the 5 audio players are computed. represents the ratio of correctly classified pairs where
According to the experimental results, the majority there is no plagiarism. The value of URC ranges from 0
of the scores are below 10%. These data indicate that to 1, with higher value indicating a better birthmark.
thread-oblivious birthmarks have strong credibility in Let EP be the set of pairs of programs such that
distinguishing independently developed programs. ∀ (p, q) ∈ EP , q is a copy of p, and JP be the set of
Distinguishing programs in the same category is pairs such that ∀ (p, q) ∈ JP , a plagiarism detection
more challenging because they may overlap greatly method believes that q copies p. Similarly, let EI
in their functionality. Figure 10 depicts the similarity be the set of pairs such that ∀ (p, q) ∈ EI, q and
score distribution for the ten web browsers. It can p are independent, and JI be the set of pairs that
be observed that about 90% of the scores are below are deemed independent by a plagiarism detection
30%. Also as illustrated by Columns Avg in Table 6, method. R and C are formally defined as:
the average scores are all around 0.1. Similar results
are observed between the compression software and |EP ∩ JP | |EI ∩ JI|
R= and C = (3)
between the audio players. |EP | |EI|
There are several similarity scores above 40%. This As indicated by Equation 1, the detection result
is because some of the browsers share the same depends on the value of threshold ε. Therefore in the
layout engine. Our manual inspection discovers that experiments we vary the value of ε from 0 to 0.5. Note
five of the browsers (arora, Dooble, epiphany, that ε cannot be greater than 0.5, otherwise plagiarism
luakit and midori) are all Webkit-based. In order can be claimed to exist and non-exist at the same time.
to observe the effect of overlapped functionality on Figure 11 shows the results. In each subfigure, the
thread-oblivious birthmarks, we give the average sim- data for SCSSB as well as its TOB-revived SA and
ilarity scores between these Webkit-based browsers in SS birthmarks are depicted by the lines marked with
Column Avg+, and the average similarity scores be- square, triangle and circle symbols, respectively.
tween Webkit-based and non-Webkit-based browsers It can be observed from the figures that the SA
in Column Avg-. It can be observed that the values and SS birthmarks do not exhibit significant difference
in Column Avg+ are 3 to 5 times greater than those regardless of similarity metrics. Meanwhile, both have
values in Column Avg-. Since the goal is to detect greater URC values than SCSSB’s across the x-axis. It
whole program plagiarism, we believe the experimen- can also be observed that SCSSB’s curves are closer
tal results show strong credibility for real-world appli- to the curves of its TOB-revived versions when Ex-
cations where certain libraries are shared. If there exist Cosine is adopted, indicating similarity calculation us-
trivial programs that simply calls the same third-party ing such metric is less sensitive to thread scheduling.
functions, it is hard to give a conclusive judgment Moreover, the curves of var-gram generated birth-
even with manual examination. marks are above the corresponding curves of k-gram
generated birthmarks, especially for SCSSB. This indi-
cates the superiority of applying var-gram algorithm
6.3 Comparison with Traditional Birthmarks to birthmark generation.
This section compares the overall performance of the
TOB-revived SCSSBs against the original SCSSB. We 6.3.2 Performance Evaluation with F-Measure and
utilize the three evaluation metrics adopted in [17], MCC
including URC, F-Measure and MCC. URC measures As explained in work [17], URC mainly measures the
resilience and credibility, while the other two are more rate of correct classifications, while inconclusiveness is
comprehensive metrics introduced for amending the considered as incorrect classification. Thus, URC gives
problem of URC that focuses only on the rate of better results with higher value of ε in Figure 11.
correct classifications. All the comparison pairs of As the value of ε increases, the chance of inconclu-
programs from Section 6.1 to Section 6.2 are taken as siveness becomes smaller, leading to less incorrect
the experimental subjects. classifications.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
IEEE Transactions on Software Engineering (year:2018) Transactions on Software Engineering
Fig. 10. Similarity distribution graph for birthmarks of the web browsers
TABLE 6
Credibility evaluation of the thread-oblivious birthmarks using software in the same category
K-GRAM VAR-GRAM
SA SS SA SS
Avg Avg+ Avg- Avg Avg+ Avg- Avg Avg+ Avg- Avg Avg+ Avg-
Ex-Cosine 0.137 0.334 0.078 0.133 0.314 0.079 0.075 0.156 0.041 0.078 0.167 0.041
Ex-Jaccard 0.090 0.213 0.046 0.072 0.163 0.034 0.068 0.128 0.042 0.068 0.132 0.040
Ex-Dice 0.134 0.322 0.078 0.103 0.238 0.056 0.100 0.189 0.069 0.092 0.173 0.060
Ex-Containment 0.166 0.364 0.111 0.135 0.281 0.087 0.128 0.208 0.097 0.106 0.186 0.068
Fig. 11. Performance evaluation with respect to URC. The left four figures depict the curves for birthmarks
generated with k-gram, the right four figures depict the curves for birthmarks generated with var-gram
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
T P , T N , F P and F N are the number of true pos- better distinct similarity scores. In particular, the TOB-
itives, true negatives, false positives and false neg- revived birthmarks achieve 100% detection accuracy
atives, respectively, that can be computed with the at ε = 0.34, where neither false positives nor false
following formulas: negatives are observed.
T P = |EP ∩ JP | ; F N = |EP ∩ JI| 6.3.3 Comparing the Birthmarks with AUC Analysis
F P = |EI ∩ JP | ; T N = |EI ∩ JI| As discussed above, the birthmark methods exhibit
different performance under different thresholds. To
Figure 12 depicts the experimental results with give an intuitive comparison, we compute the AUC
respect to F-Measure and MCC, respectively. Over- (Area Under the Curve) values for each method with
all, similar results are observed as in the evaluation respect to the URC, F-Measure and MCC metrics.
against URC. The TOB-revived birthmarks almost al- The AUC value gives a proper overall performance
ways outperform SCSSB across the whole x-axis. summary for each birthmark method. A larger value
More specifically, consider the curves summarizing of AUC indicates better birthmark quality. The results
the results with respect to F-Measure. The TOB- are summarized in the white areas of Table 7. It can
revived birthmarks outperform SCSSB mainly in the be observed that the AUC values of the TOB-revived
right region with relatively small thresholds. With rel- birthmarks are all larger than those of SCSSB’s.
atively large thresholds the performance is very simi- We quantify the performance gains P erGain by tak-
lar. Taking the upper-left figure that depicts results for ing the original SCSSB as baseline. The quantification
k-gram generated birthmarks and calculated with Ex- indicates the improvement of each thread-oblivious
Containment as an example, SCSSB performs almost birthmark against the original birthmark, with respect
as well as its TOB-revived ones until the value of ε to the same similarity metric and the same perfor-
becomes smaller than 0.55. But its F-Measure value mance evaluation metric.
decreases sharply when adopting a smaller threshold.
AU Ctob − AU Corg
To see the reason for the sharp decrease, we check the P erGain = × 100%
specific Precision, Recall and F-Measure values under AU Corg
different thresholds. According to the data, there is , where AU Ctob and AU Corg represent the AUC value
almost no difference between the Precision values of of the thread-oblivious birthmark and the original
SCSSB and its TOB-revived ones under all thresholds. birthmark respectively. For example, the P erGain
Also, the Recall value of SCSSB is almost identical value with respect to Ex-Containment similarity and
with its TOB-revived ones, until it decreases sharply URC metric for SCSSBSA generated with k-gram is:
from threshold 0.5.
0.822 − 0.56
The reason for the sharp decrease on Recall value × 100% = 47%
0.56
is the following. Due to the combined impacts from
obfuscations and thread interleavings, the SCSSB of The average and maximal performance gains are
plagiarized pairs are greatly affected, which leads to summarized in the last row of Table 7. As the data in-
low similarity scores. As the Recall values of SCSSB dicate, TOB-revived birthmarks improve the original
indicate, only about 58% scores are above 0.75, and birthmark. The maximum performance gains happen
only about 6% scores are above 0.9. But for the for those SS birthmarks generated with k-gram and
TOB-revived ones, there are about 90% scores that calculated with the Ex-Jaccard similarity, where 129%,
are above 0.9. Such results indicate that the thread- 46% and 94% improvements are obtained with respect
oblivious birthmarks are resilient to obfuscations and to URC, F-Measure and MCC metrics, respectively.
thread interleavings. Additionally, it can be observed that the AUC values
Figure 13 shows that SCSSB can differentiate pla- based on var-gram are larger than those based on k-
giarized pairs and independent pairs even though the gram, indicating the superiority of applying var-gram
range of similarity scores is small. This explains the algorithm to birthmark generation.
almost identical Precision values between SCSSB and
its TOB-revived ones. In birthmark based plagiarism 6.3.4 Evaluation of the TraceSelector Optimization
detection literature, plagiarism is determined by the As mentioned in Section 5.2, we perform an opti-
similarity score and a threshold. Unfortunately, with- mization that chooses two most similar sequences
out abundant real-world plagiarism samples, deciding from plaintiff and defendant programs to reduce the
a threshold value is an arbitrary decision. In the ideal randomness of thread interleaving. In this section, we
but unrealistic case, we hope the similarity scores evaluate the impact of such optimization.
for plagiarized pairs are all 1.0, and for independent To simulate the worst case caused by thread inter-
pairs are all 0. Thus, we believe the greater the dif- leavings, two least similar traces are selected from the
ference between the two type of scores, the better a executions of the plaintiff and defendant programs.
birthmark method is. As indicated by the boxplots in The gray areas in Table 7 summarize the AUC values
Figure 13, the TOB-revived birthmarks exhibit much without optimization. By comparing with the values
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE
IEEETRANSACTIONS
Transactions ON
on SOFTWARE ENGINEERING,
Software Engineering VOL. X, NO. X, X
(year:2018) 16
Fig. 12. F-Measure and MCC curves for the birthmark methods. The upper-left four figures depict the F-Measure
curves for birthmarks generated with k-gram under each similarity metric, the upper-right four figures depict the
F-Measure curves for birthmarks generated with var-gram, the bottom-left four figures and the bottom-right four
figures similarly depict the MCC curves for birthmarks generated with k-gram and var-gram respectively.
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
Similarity Score
Similarity Score
Similarity Score
Fig. 13. Boxplots that summarize the statistical distribution of similarity scores corresponding to the very upper-
left figure in Figure 12
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEEIEEE Transactions
TRANSACTIONS on Software
ON SOFTWARE EngineeringVOL.
ENGINEERING, (year:2018)
X, NO. X, X 17
TABLE 7
AUC analysis results
(a) With respect to URC evaluation metric
K-GRAM VAR-GRAM
SCSSBSA SCSSBSS SCSSB SCSSBSA SCSSBSS SCSSB
Ex-Containment 0.878 0.822 0.874 0.837 0.45 0.56 0.875 0.856 0.884 0.875 0.58 0.749
Ex-Cosine 0.883 0.834 0.873 0.835 0.787 0.799 0.895 0.884 0.897 0.884 0.624 0.821
Ex-Dice 0.885 0.845 0.88 0.857 0.426 0.543 0.889 0.879 0.895 0.889 0.538 0.739
Ex-Jaccard 0.878 0.864 0.88 0.872 0.236 0.38 0.903 0.899 0.909 0.903 0.462 0.726
PerGain(%) - 59\127 - 61\129 - - - 16\24 - 17\24 - -
K-GRAM VAR-GRAM
SCSSBSA SCSSBSS SCSSB SCSSBSA SCSSBSS SCSSB
Ex-Containment 0.978 0.98 0.976 0.979 0.72 0.782 0.993 0.994 0.994 0.995 0.778 0.918
Ex-Cosine 0.989 0.989 0.986 0.988 0.92 0.938 0.993 0.993 0.992 0.993 0.818 0.959
Ex-Dice 0.972 0.974 0.971 0.976 0.703 0.769 0.991 0.992 0.992 0.993 0.753 0.9
Ex-Jaccard 0.96 0.966 0.964 0.968 0.579 0.663 0.988 0.99 0.99 0.992 0.698 0.867
PerGain(%) - 26\46 - 26\46 - - - 9\14 - 9\14 - -
K-GRAM VAR-GRAM
SCSSBSA SCSSBSS SCSSB SCSSBSA SCSSBSS SCSSB
Ex-Containment 0.823 0.802 0.821 0.808 0.483 0.51 0.894 0.894 0.903 0.904 0.389 0.619
Ex-Cosine 0.889 0.876 0.88 0.874 0.734 0.744 0.892 0.894 0.893 0.893 0.468 0.774
Ex-Dice 0.795 0.782 0.793 0.788 0.461 0.496 0.879 0.883 0.894 0.896 0.371 0.586
Ex-Jaccard 0.728 0.73 0.737 0.739 0.338 0.381 0.868 0.877 0.884 0.885 0.318 0.5
PerGain(%) - 56\92 - 57\94 - - - 47\75 - 48\77 - -
between the gray columns without optimization and the similarity scores for both plagiarized pairs and
white columns with optimization, we can see that independent pairs. It means that the optimization
the performance of the original SCSSB always gets enhances the resilience but weakens the credibility,
improved after the optimization. We use the following as larger similarity scores bring not only less false
equation to quantify the improvement achieved by the negatives but also more false positives. Consider the
optimization: two bold font values -6.4 and 24.4 in Table 8, which are
AU Copt − AU CnoOpt the OptGain values for SCSSBSA and SCSSB (gener-
OptiGain = × 100% ated with k-gram and measured with Ex-Containment
AU CnoOpt
similarity) with respect to URC. Table 9 gives their
As shown by the OptGain values in Table 8, the corresponding resilience (reflected by R in equation 3),
performance gains achieved by the optimization are credibility (reflected by C in equation 3) and URC val-
significant. Conclusion can be drawn that such opti- ues. The gray columns summarize the values without
mization helps to a large extent alleviate the problem the optimization, and the white columns summarize
of SCSSB in applying to multithreaded programs. the values with the optimization.
Yet as indicated by the data in Table 7, it is still
not adequate to handle the disturbance of thread As the data show, resilience of both SCSSB and
interleavings. This also demonstrates the significant SCSSBSA is enhanced after the optimization, while
impact thread interleaving could enforce on tradi- credibility of both birthmarks is weakened. However,
tional SCSSB. On the other hand, the improvement the degree for the resilience promotion of SCSSB
for the TOB-revived ones are much less significant, are rather obvious compared with the degree of its
indicating much less impact of thread interleaving credibility reduction, leading to a significant increase
on thread-oblivious birthmarks. Besides, as shown in its URC values. On the other hand, the degree for
by the negative values, the optimization sometimes credibility reduction of SCSSBSA are more obvious
can make the overall performance of TOB-revived than the degree of its resilience promotion, resulting
birthmarks worse. in minor decrease in its URC values. Similarly, the
The optimization, which pre-select more similar ex- negative OptGain values in terms of MCC can also
ecution traces of the plaintiff and defendant, improve be explained, as the number of false positives the
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
TABLE 8
OptGain values for the TraceSelector optimization
optimization brings are larger than the number of 6.5.1 Reviving DYKIS with TOB
false negatives it reduces. Birthmarking is a detect- For DYKIS [17], we use the 20 pigz binaries gener-
ing technique of suspected plagiarisms rather than ated with different compilers and optimization levels
a proving technique [14], [15], [17], false negative is as the experimental subjects. Figure 14 illustrates the
more critical than false positive. Thus, we believe the distribution graph of the similarity scores calculated
optimization is proper and necessary. between the birthmarks of the 20 pigz binaries. As
it shows, all the scores of thread-oblivious birthmarks
6.4 Performance of the Analysis are above 70%, while for DYKIS there are quite a num-
This section presents the analysis performance of ber of scores below 70%. It indicates the effectiveness
SCSSB and its TOB-revived ones. Both the birthmark of applying the TOB framework on DYKIS.
methods involve basically two phases: the dynamic
analysis phase that traces program executions (Phase 6.5.2 Reviving JB with TOB
I) and the static detection phase that extracts birth- For JB [18], the 4 Java programs as well as their
marks and calculates similarities (Phase II). 149 single and deep obfuscated versions are used
For Phase I, our measurement indicates that on av- as the experimental subjects. Similarity scores are
erage a program is observed to become about 3 times calculated between the original Java programs and
slower once PIN and our tracer plugin are attached. their obfuscated versions. No significant differences
The tracing overhead is observed smaller for larger are observed between the original JB and its TOB-
inputs. It is because the overhead imposed by PIN’s revived ones. This is because JB is extracted from
runtime environment alone (that is runs a program API call sequences at object level. Similar to the TOB
using PIN without any instrumentation or analysis) framework that slices traces by thread, JB essentially
become trivial compared to the total runtime. slices traces by Java objects that greatly mitigates
For Phase II, no significant difference on calcu- the effect of thread scheduling. However, JB is only
lation overhead is observed between the methods. applicable to Java programs, while our TOB frame-
Specifically, for k-gram generated birthmarks, Phase work can be applied to transforming existing dynamic
II takes on average 54ms, 58ms and 67ms for process- birthmarks, including JB, into thread-oblivious ones.
ing a trace pair for SCSSBSA , SCSSBSS and SCSSB, In addition, the larger average and minimum scores
respectively. For var-gram generated birthmarks, the as summarized in Table 10 show that the TOB-revived
corresponding average time are 2.43s, 2.44s, and 2.43s, birthmarks indeed improve the original JB, although
where the variable-pattern mining costs the most not significantly.
time.
7 T HREATS TO VALIDITY
6.5 Applying TOB to Other Birthmarks Dynamic birthmarks are extracted from execution
In this section, we further demonstrate the application traces, therefore execution monitoring is necessary. It
of TOB on two other representative dynamic birth- is an undeniable fact that the monitoring itself may
marks. One is DYKIS [17] that is based on executed affect the thread interleavings during the execution of
key instructions, and the other one is JB [18] that a multithreaded program. Many other factors such as
is based on executed APIs of a Java program. For workload, scheduling policies, and runtime environ-
simplicity, we use the k-gram algorithm only. We ments affect the interleavings as well. However, we do
use DYK TR, DYK SA, and DYK SS to represent the not believe these issues cause an unfair comparison
original DYKIS and its TOB-revived versions using SA against traditional birthmarks.
and SS models, respectively. Similarly we have JB TR, For a multithreaded program, it is possible that
JB SA, and JB SS for JB. the effect of scheduling causes it to execute different
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE Transactions on Software Engineering (year:2018)
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, X 19
TABLE 9
Impact analysis of the optimization to birthmark credibility and resilience
Fig. 14. Similarity distribution graph for DYKIS and its TOB-revived thread-oblivious ones
TABLE 10
Effectiveness of applying the TOB framework to JB
JB SA JB SS JB TR
Avg. Max. Min. Avg. Max. Min. Avg. Max. Min.
Ex-Cosine 0.983 1.000 0.554 0.983 1.000 0.602 0.983 1.000 0.555
Ex-Jaccard 0.970 1.000 0.530 0.979 1.000 0.618 0.963 1.000 0.470
Ex-Dice 0.976 1.000 0.503 0.978 1.000 0.527 0.964 1.000 0.453
Ex-Containment 0.986 1.000 0.545 0.985 1.000 0.596 0.978 1.000 0.508
paths across multiple runs even under the same input. the experiments, despite large number of executions,
In such cases, the execution traces of multiple runs the inputs still constitute only a small proportion of
apparently can become different even after projection the whole input space. This is the fundamentally chal-
on individual threads, which may lead to the failure lenge for all dynamic birthmarks. One way to alleviate
of our methods. But as indicated by our experiments the concerns is to combine with testing techniques. We
conducted on the various types of real multithreaded take it as one of our future work.
programs, the thread-oblivious birthmarks always il-
lustrate good performance. Thus we believe such Effectiveness of the TOB framework is mainly
cases rarely happen in practical programs. Besides, evaluated on whole program plagiarism detec-
as discussed in Section 5 and Section 6.3.4, we adopt tion, where a complete program is copied and
an optimization that select two most similar traces then disguised through various automatic semantics-
from plaintiff and defendant for further birthmark preserving transformations. One problem plagiarism
generation. Thus even if the mentioned cases happen, detection researches face is the lack of real-world
the optimization helps alleviate the problem. plagiarism cases [49], [50]. In recent years, whole
program plagiarism on mobile markets starts to rise,
The thread-oblivious birthmarks improve upon tra- and many of the stolen apps have been processed with
ditional birthmark with TOB framework. Thus they obfuscation techniques to evade plagiarism detection.
suffer the same limitation of dynamic birthmarks in According to a recent study [51], about 5%-13% of
exhaustively covering all behaviors of a program. In apps in the third-party app markets are copied and
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE
IEEE Transactions
TRANSACTIONS ONon SoftwareENGINEERING,
SOFTWARE Engineering (year:2018)
VOL. X, NO. X, X 20
redistributed from the official Android market. This modules in Java packages. Yet this method suffered
potentially provides rich real-world plagiarism cases. high time consumption since a mass of traces can be
Yet, identifying potential pairs of apps that plagiarism extracted if the CFG is complex. Hemel et al. [58] sug-
may exist is extremely labor-intensive. Also, tracing gested three methods to find potential cloned binaries
apps needs nontrivial efforts because our tracers sup- within a program repository by simply treating bi-
port the monitoring of binary executables and Java naries as normal files. Specifically, similarity between
bytecodes. Thus, we take it as one of our future work. two binaries were evaluated by calculating the ratio of
Besides whole program plagiarism, there exists shared string literals, by calculating the compression
many cases that only part or a library of a program is ratio, and by computing binary deltas. Since no syn-
copied. The main problem of using dynamic birth- tactic or semantic attributes of binary executables are
marks to detect partial plagiarism is that they are considered, efficiency is assured but low detection ac-
mainly based on the similarity of program executions. curacy is expected. Lim used control flow information
Therefore, if there is only a small portion of code that reflected runtime behaviors to supplement static
being copied, these approaches give low similarity approaches [59]. Recently he proposed to analyze
scores. Improved upon existing dynamic birthmarks, stack flows obtained by simulating operand stack
the thread-oblivious birthmarks suffer the same prob- movements to detect copies [60]. But they are only
lem. A straightforward solution is to instrument only available to Java programs. An obfuscation-resilient
the suspicious part. But this requires manual efforts method based on longest common subsequence of
and domain knowledge. semantically equivalent basic blocks was proposed by
Luo et. al. [61]. They utilized symbolic execution to
8 R ELATED W ORK extract from basic blocks symbolic formulas, whose
pair-wise equivalence are compared via a theorem
Broadly speaking, the research areas related to our
prover. Being static analysis method, accuracy can not
work include software watermarking [25], [52] which
be assured since it has difficulty in handling indirect
protects software copyrights and detects piracy, pla-
branches. In addition, symbolic execution combined
giarism detection, as well as code clone detection [53],
with theorem proving is not scalable.
[54] and malware identification [55], [56] that detect
There are also some work focusing on detecting pla-
clones or maliciousness by characterizing software
giarism for smartphone applications. DroidMOSS [3]
with features. In this section we focus on the discus-
detects plagiarism by applying fuzzing hashing on in-
sion of birthmark based software plagiarism detection
struction sequences. Yet simple obfuscations, such as
techniques. Works targeting source code will not be
noise injection can evade the detection of DroidMOSS,
discussed here, and there have already been many
since no semantic information is used. DNADroid
mature detection systems and tools [20], [21], [57].
[62] achieves plagiarism detection by constructing
and comparing program dependence graphs between
8.1 Static birthmark based software plagiarism methods. Since considering data dependencies, this
detection method are more robust. ViewDroid [51] proposes
Myles and Collberg [27] proposed k-gram based static the feature view graph birthmark by capturing users’
birthmarks, where sets of Java bytecode sequences of navigation behaviors. But it’s vulnerable to dummy
length k are taken as the birthmarks. The similarity view insertion and encryption attacks.
between two birthmarks was calculated through set
operations that ignore the frequency of elements in the 8.2 Dynamic birthmark based software plagia-
set. Although being more robust than birthmarks pro- rism detection
posed Tamada [8], the birthmarks were still vulnera- Myles and Collberg [7] suggested the whole program
ble to code transformation attacks. Weighted k-gram path (WPP) birthmark generated by compressing a
based static birthmarks [47] improved upon Myles whole dynamic control flow trace into a directed
and Collberg’s [27] by taking the frequency of each acyclic graph form to uniquely identify program.
k-length operation code sequence into consideration. Even with compression the method does not scale,
However, the improvement in detection ability seems and it’s susceptible to various loop transformations.
minor while introducing extra cost in computing Schuler [18] treated Java standard API call sequences
change rate of k-gram frequencies. A static birthmark at object level as dynamic birthmarks for Java pro-
based on disassembled API calls from executables is grams. Such approach exhibited better performance
put forward by Seokwoo et al. [19] to detect plagia- than WPP birthmark, but they also pointed out that
rism of windows applications. The requirement for their method was affected by thread scheduling. Sim-
de-obfuscating binaries before applying their method ilar principle was applied in Tamada’s work [63],
is too restrictive and thus reduces its availability. Park where API calls of windows executables executed dur-
[12] proposed a static birthmark by extracting all ing runtime were used to derive two kind birthmarks:
possible sequences of object instructions from a CFG Sequence of API Function Calls (EXESEQ) and Fre-
of each method, and applied it for detecting common quency of API Function Calls (EXEFREQ). Apparently
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
IEEE
IEEE Transactions
TRANSACTIONS ONon SoftwareENGINEERING,
SOFTWARE Engineering (year:2018)
VOL. X, NO. X, X 21
API based birthmarks are all language dependent. To beneficial for researchers to conduct experiments and
address the problem Wang et al. [15] proposed System present their findings. The benchmarks, the TOB-PD
Call Short Sequence birthmark (SCSSB), that treat the tools as well as the experimental data are all available.
sets of k-length system call sequences as birthmarks. Our work addresses the challenges of applying
However, as we illustrated it has limited applicability dynamic birthmark based approaches for whole pro-
to multithreaded programs. gram plagiarism detection of multithreaded software.
Liu et al. [34], [49] suggested to characterize soft- As far as we know, this is the first work that discusses
ware with core values and applied it to software and the impact of thread scheduling on birthmark based
algorithm plagiarism detection. Tian et al. [17], [64] plagiarism detection, and the first work that propose
proposed the DYKIS birthmark based on dynamic thread-oblivious birthmarks for solving the problem
key instruction sequences. By introducing dynamic systemically. In recent years, whole program plagia-
taint analysis into birthmark generation, these birth- rism of mobile apps has becomes a serious problem.
mark methods were resilient to various semantics- About 5% to 13% of apps in third-party app markets
preserving code transformations. LoPD [65], a pro- are copied and redistributed from the official Android
gram logic based approach was designed for soft- market. We plan to conduct case studies and optimize
ware plagiarism detection by leveraging symbolic our approaches for this domain.
execution and weakest precondition reasoning to find
semantic dissimilarities. Despite these methods are ACKNOWLEDGEMENT
resilient, they all suffer the scalability problem, since
The research was supported in part by National
they all operates on the instruction granularity, and
Key Research and Development Program of China
either taint analysis or symbolic execution with con-
(2016YFB1000903), National Science Foundation of
straint solving is computational non-trivial.
China under grants (91418205, 61472318, 61532015,
By integrating data flow and control flow depen-
61532004, 61672419, 61632015, 61602369, 71501156,
dency analysis, Wang et al. [14] proposed a system
61373116), Fok Ying-Tong Education Foundation
call dependency graph based birthmark, and graph
(151067), Ministry of Education Innovation Research
isomorphism is utilized for calculating similarity be-
Team (IRT13035), Science and Technology Project
tween birthmarks. Patrick et al. [11] proposed a heap
in Shaanxi Province of China (2016KTZDGY04-01,
graph birthmark for JavaScript utilizing heap mem-
2016GY-092), and the Fundamental Research Funds
ory analysis, and graph monomorphism algorithm
for the Central Universities. Any opinions, findings,
was applied for similarity computation. But to be
and conclusions expressed in this material are those
effective, these graph based birthmarks require that
of the authors and do not necessarily reflect the views
the programs under protection to have prominent
of the funding agencies. T. Liu is the corresponding
referencing structures. Also, since graph isomorphism
author.
and monomorphism algorithms are NP-complete in
general, several thousand nodes will make the meth-
ods impractical to use. R EFERENCES
[1] [Online]. Available: http://sourceauditor.com/blog/tag/
lawsuits-on-open-source/.
9 C ONCLUSION [2] [Online]. Available: http://www.martinsuter.net/blog/2009/
08/skype-joltid-licensing-dispute-epic-ma-screwup.html.
As multithreaded software become increasingly more [3] W. Zhou, Y. Zhou, X. Jiang, and P. Ning, “Detecting repack-
popular, current dynamic software plagiarism detec- aged smartphone applications in third-party android market-
places,” in Proc. ACM Conf. Data and Application Security and
tion technology geared toward sequential programs Privacy (CODASPY ’12), 2012, pp. 317–326.
are no longer sufficient. This paper fills the gap [4] C. Collberg, G. Myles, and A. Huntwork, “Sandmark-a tool
by proposing a thread-oblivious software plagiarism for software protection research,” IEEE Security and Privacy,
vol. 1, no. 4, pp. 40–49, 2003.
detection framework (TOB) that revives existing dy- [5] Z. Wu, S. Gianvecchio, M. Xie, and H. Wang, “Mimimorphism:
namic software birthmarks. We have developed a set A new approach to binary code obfuscation,” in Proc. ACM
of tools collectively called TOB-PD by applying the Conf. Computer and Communications Security (CCS ’10). ACM,
2010, pp. 536–546.
TOB framework to three typical dynamic birthmarks, [6] C. Linn and S. K. Debray, “Obfuscation of executable code to
including SCSSB, DYKIS and JB. The extensive ex- improve resistance to static disassembly,” in Proc. ACM Conf.
periments conducted on 418 versions of 35 different Computer and Communications Security (CCS ’03), 2003, pp. 290–
299.
programs show that the proposed approaches are not [7] G. Myles and C. S. Collberg, “Detecting software theft via
only accurate in detecting plagiarism of multithreaded whole program path birthmarks,” in Proc. Int. Conf. Information
programs but also robust against most state-of-the- Security (ISC ’04), 2004, pp. 404–415.
[8] H. Tamada, M. Nakamura, and A. Monden, “Design and
art semantics-preserving obfuscation techniques. In evaluation of birthmarks for detecting theft of Java programs,”
addition, a suite of benchmarks of multithreaded in IASTED Conf. on Software Engineering (IASTEDSE ’04), 2004,
programs are complied. We believe there will be more pp. 569–574.
[9] K. A. Roundy and B. P. Miller, “Binary-code obfuscations in
research on plagiarism detection for multithreaded prevalent packer tools,” ACM Computing Surveys, vol. 46, no. 4,
programs. The existence of such benchmarks will be 2013.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
IEEE Transactions on Software Engineering (year:2018) Transactions on Software Engineering
[10] M.-J. Kim, J.-Y. Lee, H.-Y. Chang, S. Cho, and P. A. Wilsey, “De- length grams,” in Proc. Int. Conf. Very Large Data Bases (VLDB
sign and Performance Evaluation of Binary Code Packing for ’07). VLDB Endowment, 2007, pp. 303–314.
Protecting Embedded Software against Reverse Engineering,” [33] Y.-C. Jhi, X. Wang, X. Jia, S. Zhu, P. Liu, and D. Wu, “Value-
in IEEE Int. Symp. Object/Component/Service-Oriented Real-Time based program characterization and its application to software
Distributed Computing (ISORC ’10), 2010, pp. 80–86. plagiarism detection,” in Proc. Int. Conf. Softw. Eng. (ICSE ’11),
[11] P. P. F. Chan, L. C. K. Hui, and S.-M. Yiu, “Heap graph based 2011, pp. 756–765.
software theft detection,” IEEE Trans. Information Forensics and [34] F. Zhang, Y. chan Jhi, D. Wu, P. Liu, and S. Zhu, “A first step
Security, vol. 8, no. 1, pp. 101–110, 2013. towards algorithm plagiarism detection,” in Proc. Int. Symp.
[12] H. Park, H. il Lim, S. Choi, and T. Han, “Detecting common Software Testing and Analysis (ISSTA ’12), 2012, pp. 111–121.
modules in Java packages based on static object trace birth- [35] Z. Tian, T. Liu, Q. Zheng, F. Tong, M. Fan, and Z. Yang,
mark,” Computer Journal, vol. 54, no. 1, pp. 108–124, 2011. “A new thread-aware birthmark for plagiarism detection of
[13] Z. Tian, Q. Zheng, T. Liu, and M. Fan, “DKISB: Dynamic multithreaded programs,” in Int. Conf. Software Engineering
key instruction sequence birthmark for software plagiarism Companion (ICSE ’16), 2016, pp. 734–736.
detection,” in IEEE Int. Conf. High Performance Computing and [36] Z. Tian, T. Liu, Q. Zheng, M. Fan, E. Zhuang, and Z. Yang, “Ex-
Communications. (HPCC ’13), 2013, pp. 619–627. ploiting thread-related system calls for plagiarism detection of
[14] X. Wang, Y.-C. Jhi, S. Zhu, and P. Liu, “Behavior based multithreaded programs,” Journal of Systems and Software, vol.
software theft detection,” in Proc. ACM Conf. Computer and 119, pp. 136–148, 2016.
Communications Security (CCS ’09). ACM, 2009, pp. 280–290. [37] D.-K. Chae, J. Ha, S.-W. Kim, B. Kang, and E. G. Im, “Software
[15] X. Wang, Y. Jhi, S. Zhu, and P. Liu, “Detecting software theft plagiarism detection: a graph-based approach,” in Pro. ACM
via system call based birthmarks,” in Annual Computer Security Int. Conf. Information and Knowledge Management (CIKM ’13).
Applications Conference (ACSAC ’09), 2009, pp. 149–158. ACM, 2013, pp. 1577–1580.
[16] X. Zhang and R. Gupta, “Whole execution traces,” in Proc. [38] K. Chen, P. Liu, and Y. Zhang, “Achieving accuracy and
Annual IEEE/ACM Int. Symp. Microarchitecture (MICRO ’04). scalability simultaneously in detecting application clones on
IEEE Computer Society, 2004, pp. 105–116. android markets,” in Proc. Int. Conf. Sof. Eng.(ICSE’14). New
[17] Z. Tian, Q. Zheng, T. Liu, M. Fan, E. Zhuang, and Z. Yang, York, NY, USA: ACM, 2014, pp. 175–186.
“Software plagiarism detection with birthmarks based on [39] K. Chen, P. Wang, L. Y., W. X., N. Zhang, H. H., Z. W.,
dynamic key instruction sequences,” IEEE Trans. Software En- and L. P., “Finding unknown malice in 10 seconds: Mass
gineering, vol. 41, no. 12, pp. 1217–1235, 2015. vetting for new threats at the google-play scale,” in USENIX
[18] D. Schuler, V. Dallmeier, and C. Lindig, “A dynamic birthmark Security Symposium (USENIX Security’15), Washington, D.C.,
for Java,” in Proc. IEEE/ACM Int. Conf. Automated Software USA, 2015, pp. 659–674.
Engineering (ASE ’07), 2007, pp. 274–283. [40] Y. Qu, X. Guan, Q. Zheng, T. Liu, L. Wang, Y. Hou, and
[19] S. Choi, H. Park, H. il Lim, and T. Han, “A static api birth- Z. Yang, “Exploring community structure of software call
mark for windows binary executables,” Journal of Systems and graph and its applications in class cohesion measurement,”
Software, vol. 82, no. 5, pp. 862–873, 2009. Journal of Systems and Software, vol. 108, pp. 193–210, 2015.
[20] C. Liu, C. Chen, J. Han, and P. S. Yu, “GPLAG: detection of [41] C.-K. Luk, R. S. Cohn, R. Muth, H. Patil, A. Klauser, P. G.
software plagiarism by program dependence graph analysis,” Lowney, S. Wallace, V. J. Reddi, and K. M. Hazelwood, “Pin:
in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data building customized program analysis tools with dynamic
Mining (KDD ’06), 2006, pp. 872–881. instrumentation,” in Proc. ACM SIGPLAN Conf. Programming
[21] L. Prechelt, G. Malpohl, and M. Philippsen, “Finding plagia- Language Design and Implementation (PLDI ’05), 2005, pp. 190–
risms among a set of programs with jplag,” Journal of Universal 200.
Computer Science, vol. 8, no. 11, pp. 1016–1038, 2002. [42] [Online]. Available: ASM, http://asm.ow2.org/.
[22] Z. Tian, Q. Zheng, T. Liu, M. Fan, X. Zhang, and Z. Yang, [43] I. Jonassen, J. F. Collins, and D. G. Higgins, “Finding flexi-
“Plagiarism detection for multithreaded software based on ble patterns in unaligned protein sequences.” Protein Science,
thread-aware software birthmarks,” in Proc. Int. Conf. Program vol. 4, no. 8, pp. 1587–1595, 1995.
Comprehension (ICPC ’14), 2014, pp. 304–313. [44] M.-F. Sagot and A. Viari, “A double combinatorial approach to
[23] I. Rigoutsos and A. Floratos, “Combinatorial pattern discovery discovering patterns in biological sequences,” in Combinatorial
in biological sequences: The teiresias algorithm.” Bioinformat- Pattern Matching, ser. Lecture Notes in Computer Science.
ics, vol. 14, no. 1, pp. 55–67, 1998. Springer Berlin Heidelberg, 1996, vol. 1075, pp. 186–208.
[24] [Online]. Available: DashO, [45] H. Kuhn, “The hungarian method for the assignment prob-
https://www.preemptive.com/products/dasho. lem,” Naval Research Logistics, vol. 52, no. 1, pp. 7–21, 2005.
[25] C. S. Collberg, E. Carter, S. K. Debray, A. Huntwork, J. D. Kece- [46] [Online]. Available: GCJ, https://gcc.gnu.org/java/.
cioglu, C. Linn, and M. Stepp, “Dynamic path-based software [47] X. Xie, F. Liu, B. Lu, and L. Chen, “A software birthmark based
watermarking,” in Proc. ACM SIGPLAN Conf. Programming on weighted k-gram,” in IEEE Int. Conf. Intelligent Computing
Language Design and Implementation (PLDI ’04), 2004, pp. 107– and Intelligent Systems (ICIS ’10), 2010, pp. 400–405.
118. [48] B. W. Mathews, “Comparison of the predicted and observed
[26] Z. Tian, T. Liu, Q. Zheng, F. Tong, D. Wu, S. Zhu, and K. Chen, secondary structure of T4 phage lysozyme,” Biochimica Et
“Software plagiarism detection: A survey,” Journal of Cyber Biophysica Acta (bba) - Protein Structure, vol. 405, pp. 442–451,
Security, vol. 3, pp. 52–76, 2016. 1975.
[27] G. Myles and C. Collberg, “K-gram based software birth- [49] Y.-C. Jhi, X. Jia, X. Wang, S. Zhu, P. Liu, and D. Wu, “Program
marks,” in Proc. ACM Symp. Applied Computing (SAC ’05), 2005, characterization using runtime values and its application to
pp. 314–318. software plagiarism detection,” IEEE Trans. Software Engineer-
[28] M. Olszewski, J. Ansel, and S. Amarasinghe, “Kendo: efficient ing, vol. 41, no. 9, pp. 925–943, 2015.
deterministic multithreading in software,” ACM SIGPLAN [50] L. Luo, J. Ming, D. Wu, P. Liu, and S. Zhu, “Semantics-based
Notices, vol. 44, no. 3, pp. 97–108, 2009. obfuscation-resilient binary code similarity comparison with
[29] H. Cui, J. Wu, J. Gallagher, H. Guo, and J. Yang, “Efficient applications to software and algorithm plagiarism detection,”
deterministic multithreading through schedule relaxation,” in IEEE Transactions on Software Engineering, vol. PP, no. 99, pp.
Proc. ACM Symp. Operating Systems Principles (SOSP ’11). 1–1, 2017.
ACM, 2011, pp. 337–351. [51] F. Zhang, H. Huang, S. Zhu, D. Wu, and P. Liu, “ViewDroid:
[30] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. Towards obfuscation-resilient mobile application repackaging
dissertation, Princeton University, January 2011. detection,” in Proc. ACM Conf. Security and Privacy in Wireless
[31] A. Wespi, M. Dacier, and H. Debar, “Intrusion detection using and Mobile Networks (WiSec ’14), 2014, pp. 25–36.
variable-length audit trail patterns,” in Int. Symp. Recent Ad- [52] C. S. Collberg and C. Thomborson, “Watermarking, tamper-
vances in Intrusion Detection (RAID ’00). Springer, 2000, pp. proofing, and obfuscation-tools for software protection,” IEEE
110–129. Trans. Software Engineering, vol. 28, no. 8, pp. 735–746, 2002.
[32] C. Li, B. Wang, and X. Yang, “Vgram: Improving performance [53] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-
of approximate queries on string collections using variable- linguistic token-based code clone detection system for large
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2688383, IEEE
Transactions on Software Engineering
scale source code,” IEEE Trans. Software Engineering, vol. 28, Qinghua Zheng received the B.S. degree in
no. 7, pp. 654–670, 2002. computer software in 1990, the M.S. degree
[54] H. Kim, Y. Jung, S. Kim, and K. Yi, “MeCC: memory in computer organization and architecture in
comparison-based clone detector,” in Proc. Int. Conf. Softw. Eng. 1993, and the Ph.D. degree in system engi-
(ICSE ’11), 2011, pp. 301–310. neering in 1997 from Xi’an Jiaotong Univer-
[55] M. D. Preda, M. Christodorescu, S. Jha, and S. Debray, “A sity, China. He was a postdoctoral researcher
semantics-based approach to malware detection,” ACM Trans. at Harvard University in 2002. He is currently
Programming Languages and Systems, vol. 30, no. 5, pp. 25:1– a professor in Xi’an Jiaotong University, and
25:54, 2008. the dean of the Department of Computer Sci-
[56] S. Chaki, C. Cohen, and A. Gurfinkel, “Supervised learning ence. His research areas include computer
for provenance-similarity of binaries,” in Proc. ACM SIGKDD network security, intelligent e-learning theory
Int. Conf. Knowledge Discovery and Data Mining (KDD ’11), 2011, and algorithm, multimedia e-learning, and trustworthy software.
pp. 15–23.
[57] G. Cosma and M. Joy, “An Approach to Source-Code Pla-
giarism Detection and Investigation Using Latent Semantic
Analysis,” IEEE Trans. Computers, vol. 61, pp. 379–394, 2012.
[58] A. Hemel, K. T. Kalleberg, R. Vermaas, and E. Dolstra, “Find-
ing software license violations through binary code clone
detection,” in Proc. Working Conf. Mining Software Repositories
(MSR ’11), 2011, pp. 63–72.
[59] H. il Lim, H. Park, S. Choi, and T. Han, “A method for
detecting the theft of Java programs through analysis of the Eryue Zhuang received the B.S. degree in
control flow information,” Information and Software Technology, software and microelectronics from North-
vol. 51, no. 9, pp. 1338–1350, 2009. western Polytechnical University, China, in
[60] H. il Lim and T. Han, “Analyzing stack flows to compare 2014. She is currently working toward the
Java programs,” IEICE Trans. Information and Systems, vol. 95- M.S. degree in the Department of Computer
D, no. 2, pp. 565–576, 2012. Science and Technology at Xi’an Jiaotong
[61] L. Luo, J. Ming, D. Wu, P. Liu, and S. Zhu, “Semantics-based University, China. Her research interests in-
obfuscation-resilient binary code similarity comparison with clude trustworthy software and user behavior
applications to software plagiarism detection,” in Proc. ACM analysis.
SIGSOFT Symp. Found. Softw. Eng. (FSE ’14), 2014, pp. 389–400.
[62] J. Crussell, C. Gibler, and H. Chen, “Attack of the clones: De-
tecting cloned applications on android markets,” in Proc. Eur.
Symp. Research in Computer Security (ESORICS ’12). Springer,
2012, pp. 37–54.
[63] H. Tamada, K. Okamoto, M. Nakamura, A. Monden, and K.-
i. Matsumoto, “Dynamic software birthmarks to detect the
theft of windows applications,” in Int. Symp. Future Software
Technology (ISFST ’04), 2004, pp. 1–6.
[64] Z. Tian, Q. Zheng, M. Fan, E. Zhuang, H. Wang, and T. Liu,
“DBPD: A dynamic birthmark-based software plagiarism de- Ming Fan received the B.S. degree in com-
tection tool,” in Int. Conf. Software Engineering and Knowledge puter science and technology from Xi’an
Engineering (SEKE ’14), 2014, pp. 740–741. Jiaotong University, China, in 2013. He is cur-
[65] F. Zhang, D. Wu, P. Liu, and S. Zhu, “Program logic based rently working toward the Ph.D. degree in the
software plagiarism detection,” in IEEE Int. Symp. Software Department of Computer Science and Tech-
Reliability Engineering (ISSRE ’14), 2014, pp. 66–77. nology at Xi’an Jiaotong University, China.
His research interests include trustworthy
software and malware detection of Android
Apps.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.