Systematic Literature Review On Source Code Similarity Measurement and Clone Detection
Systematic Literature Review On Source Code Similarity Measurement and Clone Detection
Abstract
Measuring and evaluating source code similarity is a fundamental software engineering activity that
embraces a broad range of applications, including but not limited to code recommendation, duplicate code,
plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-
analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches
and their characteristics in different applications. We initially found over 10000 articles by querying four
digital libraries and ended up with 136 primary studies in the field. The studies were classified according to
their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals
80 software tools, working with eight different techniques on five application domains. Nearly 49% of the
tools work on Java programs and 37% support C and C++, while there is no support for many programming
languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement
and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets,
empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in
the field. Emerging applications of code similarity measurement concentrate on the development phase in
addition to the maintenance.
Keywords: Source code similarity, code clone, plagiarism detection, code recommendation, systematic
literature review.
1 Introduction
Source code similarity measurement is a fundamental activity for solving many code-related tasks in software engineering,
including clone and reuse identification [1]–[9], plagiarism detection [10]–[16], malware and vulnerability analysis [17]–
[22], and code recommendation [23]–[26]. Almost all machine learning-based approaches used to measure program quality
[27]–[30], detect code smells [31]–[34], suggest refactorings [35], [36], fix programs [37], [38], and summarize source codes
[24], [39] work with a kind of source code similarity indicator. Hence, it is essential to have a comprehensive view of code
similarity measurement techniques, their essence, and their applications in software engineering.
In literature, code similarity is also referred to as code clone or duplicate code [1], [40]–[47]. While code similarity is a
broader concept than code clones, cloned codes are one of the main factors affecting software maintenance. Code clones in
software systems often result from the practice of copy-paste programming [41], [48], [49]. This process helps programmers
quickly reuse the part of code and speed up the development process, while it may lead to large fragments of code clones.
Other actions that lead to code clones are language limitations, coding styles, APIs to execute the same rules, and
coincidentally implementing the same logic by different developers [6], [48].
There are difficulties and benefits regarding the make or use of code clones in the software development lifecycle. One
major challenge with cloned codes is that if a fault is detected in a piece of code, all cloned parts must be identified to check
1
Corresponding author. Address: School of Computer Engineering, Iran University of Science and Technology, Hengam St., Resalat Sq.,
Tehran, Iran, Postal Code: 16846-13114.
1
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
for the possibility of the defect. Rahman and Roy [50] have shown that code clones decrease the stability of programs. On
the other hand, several studies have claimed that refactoring duplicate codes are not always desirable. For instance,
removing code clones increases dependencies between different components, which may be tightly coupled due to referring
to one piece of code instead of separated clones. Hence, many researchers state that clone codes should be at least identified
even if no action is performed [51]–[55]. Unfortunately, detecting code clone instances is not straightforward. However,
clone types definition and taxonomies sometimes are not clear, and the border between code clone and code similarity is
not defined well in the research.
Despite the maturity of studies on code similarity measurement, we found no comprehensive survey in the field covering
state-of-the-art approaches. This paper proposes a systematic literature review (SLR) and meta-analysis on code similarity
measurement and evaluation approaches to classify the technical aspects existing in the field and disclose the available
challenges and opportunities. For this aim, we searched the most important digital libraries, resulting in an initial set of
more than 10000 articles. After applying the relevant exclusion and inclusion criteria, snowballing, and quality assessment,
136 articles were selected as the primary studies in the code similarity measurement field. We reviewed all primary studies
in detail, analyzed and categorized our findings, and reported the results. The main objectives and contributions of our SLR
are as follows:
1. We indicate the state-of-the-art techniques, software tools, and datasets proposed for code similarity measurement
and clone detection.
2. We identify and classify essential aspects of the available code similarity measurement studies and compare their
advantage and disadvantage in different application domains, including clone detection, malware detection,
plagiarism detection, and code recommendation.
3. We designate and discuss the existing challenges and opportunities in the current code similarity measurement and
the new applications emerging in the field.
Our study indicates various techniques, tools, datasets, and applications related to code similarity measurements. Most
importantly, 80 different software source code similarity measurement and clone detection tools are proposed by the
researchers, which work based on at least eight different techniques. Most proposed techniques measure the similarity
between tokens in the code fragments, while more advanced techniques based on machine learning and hybrid methods are
emerging. The lack of public tools and datasets and the focus on clone detection in Java applications are among the most
critical challenges of the available studies. With the fast-growing of large codebases, source code similarity measurement
and clone-detection algorithms' efficiency, accuracy, and scalability have become important factors.
The rest of the paper is organized as follows. First, in Section 2, we briefly explain key terms and concepts used in code
similarity measurement and code clone literature and review different types of code clones. Afterward, Section 3 outlines
the research methodology and the underlying protocol for the systematic literature review. Section 4 describes our high-
level classification of the primary studies and reports the results of this systematic review. Section 5 discusses the challenge
and opportunities in the field. The threats to the validity of our SLR are discussed in Section 6. Finally, Section 7 concludes
the SLR and presents future work.
2 Background
This section discusses the problem of code similarity measurement, clone detection, the types of similarity and clones, the
origins and primary reasons for code similarity, and the related surveys performed in the field of code similarity and code
clones.
filter out the results of similarity measurement [1]. As will discuss in this survey, code similarity measurement provides a
basis to address different problems, denoting a wide range of applications in software engineering. Nevertheless, code clone
detection is the most prominent application of code similarity, which is the subject of many articles and tools in this field.
The advancements in clone detection methods can be promoted other code similarity measurement applications. Hence, we
emphasize definitions and types of code clones. Researchers have proposed many definitions for code clones [1], [2], [48],
[58]. As one of the widely accepted definitions, Baxter et al. [1] define the clone codes as follows:
Definition 1 (Clone codes): Two code snippets 𝑐1 , 𝑐2 are clones if they are similar according to some definition of similarity.
Code snippet pairs in Definition 1 are in the specific granularity level in terms of different programming abstractions,
including the statement, block, method, class, package, or component in a program. Moreover, code similarity measurement
can be performed in the scope of one or more projects depending on the intended application. When entities of multiple
projects are investigated to measure similarity and find similar instances, so-called cross-project similarity measurement
techniques are defined and used [23], [26], [59]. In practice, the result of code similarity measurement can be reported with
a vector, including the ranked similarity of the given code snippet and all code snippets with the same entity level.
Definition 1 leads to the formation of different clone types according to the definition of the similarity measure. Indeed, two
additional aspects, including a similarity type and a threshold, must be considered for the clone detection task compared to
the general definition of code similarity measurement. The similarity threshold is used to indicate the nearness of clone
instances together. The maximum similarity value for this threshold indicates the exact matching of the instances.
Consequently, not all similarity measurement approaches and similarity measures are suitable for code clone detection
tasks. In this paper, we target the general problem of code similarity measurement. However, in most cases, the proposed
techniques are only comparable based on the clone detection concepts, such as clone types.
Code clones can be detected for one programming language or between several languages. The latter type of clone detection
is also called cross-language clone detection (CLCD) [60], [61]. Cross-language clones typically occur when a codebase is
prepared to use in different environments for portability. For instance, the main codebase of the game is actually the same
while building a game for different operating systems, and there are slight differences for each operating system [48].
3
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
CF0: Initial code fragment (CF) CF1: Type I clone CF2: Type II clone
1 for (int i = 0; i<10; i++) 1 for (int i = 0; i<10; i++) 1 for (int j = 0; j<10; j++)
2 { 2 { 2 {
3 // foo 2 3 if (i%2 == 0) 3 if (j%2 == 0)
4 if (i%2 == 0) 4 a = b + i; // cmt 1 4 a = b + j; // cmt 1
5 a = b + i; 5 else 5 else
6 else 6 a = b - i; // cmt 2 6 a = b - j; // cmt 2
7 // foo 1 7 } 7 }
8 a = b - i;
9 }
Figure 1. Examples of various types of code clones proposed in the literature [22].
4
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
Rattan et al. [67] have performed a systematic review of software clone detection and management. They have reported the
result of reviewing 213 articles from 11 leading journals and 37 conferences. Their survey covers almost all software clone
topics, such as clone definitions, clone types, clone creation reasons, clone detection methods, and clone detection tools.
They have also tried to answer the fundamental question of whether the clone is "useful" or "harmful" with their reviews.
Their review ended up with different and contradictory answers and concluded that simply deleting the clone instance is
not a decent approach. Instead, developers require tools to identify and control clones to deal with the appropriate strategies
in different situations. The authors have not reviewed the code similarity measurement in general.
Similar surveys on software clone detection have been recently performed by Min and Ping [68] and Ain et al. [69]. These
studies cover approaches, tools, and open-source subject systems used for code clone detection, which help the researchers
choose appropriate approaches or tools for detecting code clones according to their needs. However, the general problem
of code similarity measurement and the wide range of its applications have not been covered systematically.
Concerning other applications of code similarity, the plagiarism detection techniques and tools have been reviewed by
Novak et al. [72]. They have conducted a systematic review of the field to assist university professors in detecting plagiarism
in source code delivered by students. Their study discusses definitions of plagiarism, plagiarism detection tools, comparison
metrics, obfuscation methods, datasets used in comparisons, and algorithm types. The authors have focused on source-code
plagiarism detection tools in academia and plagiarism obfuscation methods typically used by students who committed
plagiarism. They have concluded that there are many definitions of "source-code plagiarism" both in academia and outside
it, but no agreed clear definition exists in the literature. In addition, few datasets and metrics are available for quantitative
analysis. We observed a similar status for other code similarity measurement relevant applications.
Some studies have empirically evaluated and compared a subset of source code similarity measurement and clone detection
techniques [43], [62], [73], [74]. However, the results of such compassion are highly biased by the dataset. Burd and Bailey
[73] have evaluated the performance of five code clone detection tools over a medium size Java program. They have
concluded that there is no single and outright winner technique for clone detection. However, the reported results are only
based on clones in a single program.
Bellon et al. [62] have evaluated the precision, recall, and execution time of six clone detectors using a benchmark containing
eight large C and Java programs. The selected tools were based on different code similarity measurement approaches,
including techniques that work on textual information, lexical and syntactic information, software metrics, and program
dependency graphs. The authors have made references to code clones in eight software systems and compared the tools
with their proposed dataset. They have concluded that the text-based and token-based tools demonstrate a higher recall but
lower precision compared to tree-based and graph-based tools. In addition, the execution time of tree-based techniques is
higher than other text-based and token-based. The paper has only used one human expert to judge the results of clone
detection with threats to the reliability of the reported results.
Biegel et al. [74] have empirically compared the recall and the computation time of text-based, token-based, and tree-based
source code similarity measurement approaches on four Java projects. They have used the tools to detect refactoring
fragments in the code. The authors have found that the results of different tools have a large overlap. However, CCFinder
[2], a token-based clone detection tool, is much slower than the other two similarity measurement approaches. The dataset
and replication package of their empirical study is not publicly available.
In a recent empirical study, Ragkhitwetsagul et al. [43] have evaluated 30 code similarity detection techniques and tools.
The authors have found that in the case of pervasive source code modifications, general textual similarity measurement
tools offer similar performance to specialized code similarity measurement tools. However, in total specific source code
similarity measurement techniques and tools can perform better than the general textual similarity measures. JPlag
plagiarism detection tool [10] reveals the best performance on detecting local code changes, and CCFX achieves the highest
performance on pervasively modified code. It should be noted that optimal parameters of the tools, such as the threshold
used to detect clone codes naturally biased towards the particular dataset.
Chen et al. [70] have performed a critical review of different code duplication approaches between 2006 to 2020. The authors
have concluded that there are no valid test datasets with a large number of samples to verify the effectiveness of available
code clone detection. Moreover, the authors have reported that most techniques are highly complex, support a single
programming language, and cannot detect semantic clones (type IV clones). However, they do not have investigated the
wide range of code similarity measurement applications discussed in our paper. In this SLR, we review all the applications
related to source code similarity measurement and discuss each of them thoroughly. We also expand existing categorizations
of code similarity measurement algorithms and applications to encompass the most recent approaches taken by researchers
and practitioners. Specifically, we added and discussed learning-based, test-based, image-based, and hybrid algorithms in
5
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
addition to previously identified source code similarity measurement techniques. We also identified some emerging
applications of code similarity measurement, including code recommendation and fault prediction.
3 Research Methodology
Our literature review follows the guidelines established by Kitchenham and Charters [75], which decomposes a systematic
literature review in software engineering into three stages: planning, conducting, and reporting. Furthermore, we have
taken inspiration from the most recent high-quality systematic literature reviews related to empirical software engineering,
code smells, and software refactoring, software plagiarism [14], [76]–[78]. Figure 2 shows the overall process of research
methodology, including the definition of the topic and research questions, search string creation, and article selection. The
numbers inside the parenthesis are the number of articles retrieved at each step. We commence our research methodology
by describing the research question that used to be answered during the SLR.
IEEE 2 4
6
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
We searched the prepared string on the four well-known digital libraries, shown in Figure 2, indexing the majority of
computer science and engineering publications. The libraries include IEEE Xplore, ACM Digital Library (ACM DL),
SpringerLink, and ScienceDirect. These digital libraries index the most important journals and conferences in the field.
Moreover, the Google Scholar library was used during the snowballing process. The initial number of articles obtained from
each digital library has been shown inside the parenthesis in Figure 2. The search process was performed between April and
May 2023, and studies published until that date were retrieved.
("code similarity" OR "similar code" OR "code clone" OR "clone code" OR "cloned code" OR "duplicate code" OR
"duplicated code" OR "code duplicate" OR "replicate code" OR "code replicated" OR "clone detection")
AND
("measure" OR "vector" OR "metric" OR "detect" OR "detection" OR "identify" OR "compare" OR "comparison" OR "tool"
OR "data" OR "dataset" OR "evaluate" OR "evaluation")
AND
("application" OR "technique" OR "approach" OR "method" OR "methods" OR "applications" OR "approaches" OR
"techniques" OR "algorithm" OR "algorithms")
AND
("source code" OR "program" OR "code fragment" OR "code fragments" OR "software")
Figure 3. Search string used in code similarity measurement SLR.
It should be noted that when searching different digital libraries, we did not use any additional search filter for restricting
the search to the title, keywords, or abstract of the articles, and the full text was considered in all databases. However, the
necessary changes in the setup of our search string were applied due to the specific format imposed by the digital libraries.
More specifically, we could not use the above search string with the ScienceDirect search engine because the engine has a
limitation of accepting search strings including up to eight logical operators. Therefore, we had to look for the most relevant
keywords in our search string to trim the query. As a result, the following logical expression was exclusively selected to
search with the ScienceDirect library:
("Code similarity" OR "Source code similarity" OR "Code comparing" OR "Source code comparing" OR "code clone") AND
("Code clone" OR "Duplicate code" OR "detecting similarity" OR "clone detection")
7
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
• EC3: Articles whose total text was less than three pages were removed. After reviewing them, we removed these
articles and ensured they did not contain significant contributions. If there are good clues in these articles to find
other suitable topics and articles, we consider these clues while applying the inclusion criteria and snowballing
phases.
• EC4: Articles that did not directly discuss a source code similarity measurement approach in their abstract were
removed. For example, some papers have discussed binary code similarity.
• EC5: The papers that had not proposed an automated approach for source code similarity measurement were removed.
We excluded these articles since the similarity measurement technique was necessary when classifying methods.
• EC6: Theses, books, journal covers and metadata, secondary, tertiary, empirical, and case studies were removed.
Applying the above six specific exclusion criteria resulted in the deletion of 9517 publications, which were mostly irrelevant
to the topic under review.
(2.5 ≤ score < 4), and Low (score < 2.5). The articles whose scores belonged to the High and Medium levels were selected
for in-depth analysis as our final primary studies.
During the quality assessment process, 169 articles were removed, and 136 articles were selected as the primary studies
about code similarity measurement and evaluation. We observed that articles without in-detail and helpful information
about their methods and evaluations received a relatively low score and were removed from the primary studies. The
complete list of primary studies has been mentioned in Appendix A. We used Microsoft Excel to organize the results of the
article selection process and handle the information extracted for each primary study. The Excel file containing the full list
of retrieved papers and the filter results is publicly available on Zenodo [80].
4.1 Demographics
Figure 4 shows the number of studies published on source code similarity measurement and clone detection each year. For
better visualization, all publications before 2005 are shown in the same bar as the year 2005. Initial studies include all
remaining studies after applying exclusion criteria, and primary studies include 136 articles in the final selection. Overall,
we observe a growth rate in the number of publications in this field in recent years, indicating an increase in the need for
code similarity measurement approaches and tools. Two spikes are observed in the number of primary studies in 2013 and
2020, respectively. However, the number of studies dropped in 2021, possibly due to the long-term impacts of the COVID-
19 pandemic. Some papers written in these years possibly have not been published at the time we searched for the literature.
Table 1 shows the publication type of primary studies which have been indexed in each digital library. One paper had not
been indexed by any of the four digital libraries we searched and it was found in Google Scholar. About 27.94% of the
primary studies have been published as journal articles, and the remaining 72.06% are conference papers.
9
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
Token-based
Type I Tree-based
Algorithmic
Type II Clone and
Graph-based
reuse detection Direct
Type III
Metric-based
Static analysis
Type IV
Image-based
Applications Techniques
Plagiarism detection
Machine learning
Data-driven
Malware and Deep learning
vulnerability detection
Indirect
Defect, refactoring, Hybrid
smell, and quality Code prediction
No tool-supported
Tool-voting
Datasets Tools
Non-academic
Programming
Mutation-based
languages
Machine-oracles
Program
transformation
Language independent
Regarding technique, the existing studies are either based on static or dynamic program analysis. Methods based on static
analysis such as [10], [40], [61], [81]–[83] do not use any execution of the programs to measure similarity. They are using
algorithmic, data-driven, or hybrid approaches to find similar source codes. Few studies [84], [85] use dynamic analysis in
which programs are executed with a test suite or working load to obtain runtime information such as program function
outputs and then find similar code fragments by comparing the obtained information. Section 4.4 discusses existing
techniques in detail.
Regarding the tools, proposed studies may be supported by a tool or not. Some studies have publicized their tool [10], [81]
while others have not. There are also non-academic code similarity measurement tools that have not been discussed by
primary studies [86]–[89]. The existing tools and approaches may detect source code similarity in one or multiple
programming languages. There are also language-independent tools [90]. Section 4.5 describe existing code similarity
measurement tools and supporting languages.
Our review of primary studies shows three kinds of datasets used to create and validate code similarity measurement
methods: Datasets with human oracles [62], [91], datasets with machine oracles [92], [93], and hybrid datasets [43]. The first
category of datasets is either manually labeled by experts in the field or is based on the solutions submitted to specific
problems in programming contest platforms such as Google CodeJam [94] and Codeforces [95]. Automatically generated
10
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
dataset uses majority voting between different existing tools [62], mutation operators [93], program transformation rules
[96], or version history of a codebase to make labels for similar and non-similar code fragments without human intervention.
Finally, a combination of these approaches can be used to create large and quality datasets. The details of existing code
similarity benchmark and datasets are discussed in Section 4.6.
Regarding the applications of code similarity measurement, we observed a direct application of code clone and reuse
detection and several indirect applications, which were further classified into plagiarism detection, malware, and
vulnerability detection, code prediction, and code recommendation. While the majority of primary studies have focused on
finding clone instances, our survey indicates that source code similarity measurement can address many code-related tasks
in software engineering activities. Section 4.3 describe the different applications of code similarity measurement covered by
the studies we reviewed.
11
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
12
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
scalable and efficient due to filtering and indexing the source code fragments. However, determining an appropriate
threshold for the local alignment algorithm to select similar instances is challenging.
NICAD is a well-known and most-used text-based clone detection tool that uses text normalization to remove noise,
standardize formats, and break program statements into parts such that potential changes are detected as simple line-wise
text differences [P11], [P14]. Ragkhitwetsagul and Krinke [P63] have used compilation and decompilation as pre-processing
steps to normalize syntactic changes in the code fragments and improve the accuracy of NICAD. gCad [P35] is a near-miss
clone genealogy extractor based on NICAD detecting clones in multi-versions of a program. It accepts 𝑛 versions of a
program, maps clone classes between the consecutive versions, and extracts how each clone class changes throughout an
observation period.
Cosma and Joy have [P22] proposed a text-based approach to detecting plagiarism supported by a tool, PlaGate. They have
used the latent semantic analysis (LAS) technique to convert source code files to a numerical matrix and then compute the
cosine similarity between each pair of files in a corpus. PlaGate is language independent source code plagiarism detection
tool. However, choosing the right dimension for the LSA matrix on each corpus is challenging.
Kamiya et al. [P45] have proposed a more advanced method in which a lexical analyzer creates a sequence of text. Then the
AST is constructed for each sequence, and common subtrees are found using the Apriori algorithm [108]. Their method
finds the type I and II clones with sequence matching and type III clones with AST matching. The advantage of this method
is that it does not need to create the AST for the entire program. However, it does not support detecting clones of type IV.
Chen et al. [P43] have used a text-based method to identify malware applications for Android systems. The source code is
broken into small code snippets, and the similarity between these snippets and existing malware is computed using the
NiCad tool [109] to detect the malware. The score of this method is over 95% (with a recall of 91% and precision of 99%),
which is relatively high compared to the other text-based methods.
Tukaram et al. [P88] have proposed another text-based method that uses a preprocessing module to remove extra
information in the code, such as spaces. Other modules used in their approach include keywords, data types, variables, and
functions detector, identifying a specific part of the code. Finally, it uses a Similarity Checker module responsible for
identifying similarities. This article aims to provide general information on the similarity of applications’ codes at different
levels. The advantage is that it can be used to evaluate the software and give general-purpose details about the clone code.
13
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
Waterman sequence alignment algorithm for token matching to increase the efficiency of token-based clone detection.
CPDP is a similar token-based tool for plagiarism detection in source code [P32].
Source code plagiarism detection systems can be confused when structural modifications, such as changing control
structures, are applied to the original source code [P32]. To mitigate such issues, Duric and Gaševic [P26] have proposed an
enhanced token-based plagiarism detection tool, SCSDS, with specialized preprocessing that avoids the impacts of some
structural modifications. For example, all identifiers of Java numeric types, e.g., int, long, float, and double types, are
substituted with the < 𝑁𝑈𝑀𝐸𝑅𝐼𝐶_𝑇𝑌𝑃𝐸 > token. In addition, all semicolon tokens are removed. SCSDS achieves a better
F1 score compared to the well-known JPlag tool [10]. However, the comparison has been performed on a tiny dataset of
students’ programming assignments which threats the generalization of the approach.
Ullah et al. [P75] have used a token-based clone detection to detect plagiarism in students' programming assignments. Their
method first creates a frequently repeated program tokens matrix using latent semantic analysis (LSA) [113]. This method
helps convert the source code to Natural language. The similarity between programming assignments is computed by LSA
[113]. The LSA algorithm has been applied to extract semantics from tokens to find plagiarism among the students'
programming assignments in C++ and Java. The authors have reported a maximum recall of 80% for their proposed method.
CP-Miner [P7] finds both copy-pasted code and copy-pasted efficiently in large software suites. It first tokenizes the program
and then applies a subsequence mining algorithm called CloSpan [114] to find the frequent subsequence between code
fragments. The most frequent subsequences are exactly copy-pasted segments in the original program. Experiments with
the CP-Miner tool show that about one-third of copy-pasted codes contain the modification of one to two statements. CP-
Miner is a language-independent tool. However, its time complexity is 𝑂(𝑛2 ) where 𝑛 is the maximum length of frequent
sequences.
Detecting type IV code clones with text-based or token-based methods is challenging since the programs' text is often
different. Rajakumari et al. [P89] have addressed the detection of type IV clones for Java programs in five steps: selecting
input, separating modules, extracting clones, classifying clones, and reporting results. Ragkhitwetsagul et al. [P87] have
used well-known TF-IDF techniques to convert the sequences of tokens to vectors. First, the indexing phase is executed to
create the vector representation of the token and assign scores to each vector. Then, the detection phase selects similar
instance and report them as clones. Their proposed tool, Siamese, can detect type I, II, and III clones for Java programs with
a maximum recall of 99%. Another advantage of this method is its relatively high time performance in detecting clone
instances. However, Siamese cannot detect type IV clones. FastDCF [P128] is another token-based clone detection tool that
uses HDFS [115] and MapReduce [116] to scale up clone processing into big codebases at the granularity of the function and
files. Similar to Siamese cannot detect type IV clones.
14
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
extraction process. Asta [P19] uses a similar approach to convert AST nodes to sequential patterns with a limited length
and compares the patterns for detecting possible clones.
Gao et al. [P78] have proposed a tool, TECCD, which uses a word2vec algorithm [117] on the program AST to reduce the
time of tree-based similarity measurement. The ANTLR parser generator [118] is used to create the source code AST, and a
matrix containing AST nodes is fed to a word2vec model resulting in the vector representation of each AST. The Euclidian
distance between pairs of corresponding AST vectors is computed as a measure of similarity. The learned model is used to
convert new source code to vector, which improves the efficiency of the tree-based methods. It can detect code clones of
type I, II, and III with a precision of 88% and recall of 87%. TECCD cannot detect type IV clones.
Sager et al. [P8] have tried to identify similar classes in Java code using the AST. Their method obtains the AST of each class
using Eclipse's JDT API. Afterward, the AST is converted to an intermediate model called FAXIM (FAMOOS Information
Exchange Model), an independent programming language model for displaying object-oriented source code [P8]. The
similarities between the two classes are then compared using comparative algorithms, including bottom-up maximum
common subtree isomorphism, top-down maximum common subtree isomorphism, and the tree edit distance. The similarity
is measured between FAXIM trees. The discussed method identifies similar classes with an average recall of 74.5% on two
software projects.
15
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
Another type of graph extracted from the programs to identify similarity at coarse grain is the function call graph. Gascon
et al. [P33] have proposed a method for malware detection based on efficient embeddings of function call graphs and
machine classification models. They have achieved an 89% of recall in detecting Android malware. However, the static
analysis used for the call graph construction may not be accurate, specifically in the case of the polymorphic methods.
Marastoni et al. [P60] have proposed an approach based on static analysis, which works by computing the CFG of each
method and encoding it in a feature vector used to measure similarities. Their tool, Groupdroid, aims to detect Android
malware applications. The Groupdroid tool has achieved an F1 of 82%. However, the approach only works at the method
level, while the malware interactions can spread among different methods. Kalysch et al. [P74] have used a similar approach
by extending the extracted features from CFG to improve the effectiveness of Android malware detection. They have
achieved a precision of 89%.
Kim et al. [P44] have proposed a graph-based method to identify malware applications for Android systems. The structural
information of methods is extracted from methods in given applications and compared to match the similar methods of
target applications, and the number of matched methods and the total number of methods are used for similarity calculation.
The similarity calculation process consists of two steps. First, control flow graphs of each method in two Android
applications are extracted. Secondly, structural information that is collected from the control flow graphs is compared to
match the similar methods in two target applications. However, extracting and comparing CFGs for each pair of methods
in two applications are time and resource-consuming.
16
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
Pattern) extracted by detecting their genealogies and change patterns from thousands of revisions of subject systems and
used those features to train a support vector machine for predicting buggy code clones. Their method can predict buggy
code clones on different Java and C projects with overall 76% precision, 71% recall, and 73% F1.
An emerging application of code similarity measurement considered mainly by the learning-based approach is code
recommendation, in which a code snipped is suggested to developers according to their needs [P85], [P133]. For instance, a
model suggests the name for new software entities, e.g., new methods, in the code based on similar entities at a benchmark
[26], [126], [127]. Code2vec [P85] and Code2Seq [24] have used a code embedding approach to convert every method in the
program to a fixed-length vector. A similar method body is then found by computing the vector similarity in a large corpus
of programs’ methods. Code2Vec [P85] and Code2Seq [24] use different paths between AST leaves of the methods as input
to a deep neural network to learn the method embedding. Both approaches suffer from high computational cost and
relatively weak performance in suggesting names for methods with complex AST have remained.
Tufano et al. [P65] have composed four representations of the code, including identifiers, ASTs, bytecode, and CFGs, to
enhance the feature space of the learning model. The authors have concluded that using bytecode and CFG in the combined
model improves the precision of code clone detection. SEED [P130] combines the data flow and control flow to form the
semantic graph based on intermediate representation while focusing on API calls and operator tokens. The graph match
network (GMN) [128] has been used to extract the feature vector from the semantic graph. The code fragments are required
to be compilable for use in SEED.
Guo et al. [129] have proposed a pre-trained model that considers the inherent structure of code to provide models for
various code-related tasks, including source code search, code translation, code refinement, and clone detection. In the pre-
training stage, the authors have used data flow, a semantic structure of code that encodes the relation of "where-the-value-
comes-from" between variables, instead of taking the syntactic structure of code like the AST. Their model is based on the
transformer used in BERT's model [130]. About 2.3 million functions from six programming languages, including Java,
Python, PHP, Go, Ruby, and JavaScript, were used for training. Their approach has achieved a precision of 94.8% and a recall
of 95.2% in identifying code clones, indicating a good performance for code-related tasks. Tao et al. [P116] have exploited
the CodeBERT model to detect cross-language code clones.
Chochlov et al. [P122] have used CodeBERT deep neural network to embed each code fragment in a fixed-length feature
vector for detecting clones of types III and IV. The authors have then applied an efficient approximate k-nearest neighbor
(k-NN) algorithm to avoid 𝑂(𝑛2 ) comparisons between the embeddings for each of the 𝑛 code fragments. Their results show
a lower execution time compared to SourcererCC [P47], [P53] and CCFinder [P5]. However, their approach requires systems
with graphical processor units (GPUs).
It is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such
as source code. Roziere et al. [131] have used pre-trained networks based on masked language modeling (MLM) that
leverages the structural aspect of programming languages [130] to recover the original version of obfuscated source code.
The authors have shown that their pre-trained model considering program structural aspects significantly outperforms
existing approaches on multiple downstream tasks. It provides relative improvements of up to 13% in unsupervised code
translation and 24% in natural language code search. However, such a pre-trained model is often highly complex, and it is
difficult to interpret their internal decision-making process. Changing the programming styles and renaming many parts of
code could be adversarially used to destroy the results of such models. Sheneamer et al. [P112] have concluded that effective
feature generation is more important than the type of classifier for machine learning code clone detection.
18
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
sample inputs or test suites, and their runtime data are collected. These data can then be used to detect semantically
equivalent codes. Li et al. [P66] have proposed a test-based clone detection approach to identify type IV of code clones. A
test suite is automatically generated for each pair of methods using the EvoSuite test data generation tool [133]. If two
methods generate the same output on each of the generated test cases, they are considered semantically equivalent methods.
The authors could identify clone methods of type IV in the Java development kit (JDK). However, generating and executing
test cases for different methods is computationally expensive. Moreover, test-based methods cannot identify equivalent
fragments within the body of methods. A similar method has been proposed by Su et al. [P50]. They have used use existing
workloads to execute programs and then measure functional similarities between programs based on their inputs and
outputs.
Leone and Takada [P123] proposed a test-based approach for type IV clones or semantic clone detection that overcomes the
limitations related to objects in Java. The authors have used EvoSuite [133] to automatically create the tests with correct
instantiations of all the needed classes in Java programs. The output of the test code for each method is compared using the
DeepHash function to obtain numerical values for objects, considering the values of each instance variable. Their approach
handles only methods which have a return value to compare. In addition, test data generation with EvoSuite is a time and
resource-consuming process.
19
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
first recognizes the style of the code by comparing the given code with a database containing different code styles. The
ASTs of the source code with the same styles are created and compared to determine the final cloned instances. Thaller et
al. [P127] have used token-based and test-based to identify semantic clones. The types of input and output parameters are
compared in the static analysis phase using tokenized code fragments. Thereafter, the input and output value of the
executable fragment (e.g., function) is compared to decide the similarity between each pair of programs. Their proposed
tool, SCD-PSM, cannot detect types 2 and 3 of code clones since textual similarities represent a different problem set. Tree-
based and graph-based have been combined by Fang et al. [P99] to extract both the syntax and semantic features required
for code similarity measurement, specifically detecting functional code clones.
Recently, learning-based, tree-based, and graph-based methods have been combined by many researchers to improve the
feature space of the machine learning models for clone detection [P48], [P80], [P81], [P82], [P94], [P97], [P106], [P107],
[P109], [P134]. Sheneamer and Kalita [P48] have manually extracted features from AST and PDG to train a classifier
detecting code clones with type 4. Zeng et al. [P80] have automated the feature extraction from AST using deep autoencoder
models [134]. Hu et al. [P125] have proposed TreeCen to detect semantic clones while satisfying scalability. First, AST is
extracted from the given method and transformed into a simple graph representation called the tree graph, according to the
node type. Thereafter, six types of centrality measures are extracted from each tree graph forming the feature vector of a
machine learning model. The feature vector for a code pair, obtained by concatenating the two vectors, is labeled according
to whether it is a clone or a non-clone pair. TreeCen [P125] only works at the method level. A similar method has been
proposed by Wu et al. [P126] in which source code AST has transformed into a Markov chain instead of a tree graph.
Some hybrid-based clone detection approaches are not solely based on combining the aforementioned methods. They mainly
rely on low-level code representations provided by programming language compilers [P20], [P28], [P96], [P117]. Selim et al.
[P20] have proposed a hybrid clone detection technique to detect type III clones by combining source code and intermediate
code processing. Their approach complements text-based and token-based clone detectors to detect type III clones. The
limited number of operations in the intermediate representation decreases the dissimilarity in cloned code segments and
leads to locating clone instances with complex variations.
Raheja et al. [P28] have computed metrics from the byte code representation to improve the performance of token-based
clone detection. While these approaches increase the recall of clone detection compared to source-based methods, it requires
compilation and decompilation of the program source codes to the intermediate codes. Schäfer et al. [P111] have provided
a stubbing tool that effectively compiles Java source codes with missing dependencies into the Bytecodes. Their tool can be
used for compiling code fragments whose dependencies are not presented within the dataset, such as the methods available
in the BigCloneBench [135]. The authors have reported that their tool makes 95% of all Java files compilable in
BigCloneBench which facilitates finding code clones with type I, II, and III.
Table 3. Various combinations of code similarity measurement and clone detection approaches.
20
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
Table 4. Existing code similarity measurement tools sorted alphabetically based on the tool name.
Tool name Link to the software tool Method Application Supporting Programming
language(s) language(s)
Amain https://github.com/CGCL-codes/Amain Tree-based + Clone detection Java Python
Learning-based (Type IV)
AST+ Not mentioned Tree-based Defect prediction C C
ASTNN https://github.com/zhangj111/astnn Tree-based + Clone detection Java, C Python
Learning-based (types I, II, III-IV)
AutoMOSS https://github.com/automoss/automoss Text-based Plagiarism Many- Python
detection languages
CCAligner https://github.com/PCWcn/CCAligner/tree/f27622d6f1500 Token-based Clone detection Java C++
536c45862c4d49bd5f5d6802ace
CCfinderx https://github.com/gpoo/ccfinderx Token-based Clone detection Java C++, Java,
(types I, II, III) Python, C
CCSharp https://github.com/pquiring/CCSharp Graph-based Clone detection C# C# and C++
(all types)
CCStokener https://github.com/CCStokener/CCStokener Token-based + Clone detection C, Java Python
tree-based (types III)
Clone Manager https://www.scied.com/pr_cmbas.htm Token-based + Clone detection C, Java Not mentioned
tree-based (all types)
Clone Miner Expired link Learning-based Java, C++, Not mentioned
Clone detection Perl, VB
Clone TM Not mentioned or expired link Text-bases Clone detection Java, C++, Not mentioned
(types I, II, III) C#
CloneCognition https://github.com/pseudoPixels/CloneCognition Learning-based Clone detection Java Javascript-
Python
CLONE- Expired link Tree-based Clone detection Java Java
HUNTRESS (types I, II, III)
Clonewise Not mentioned Learning-based Clone detection C Not mentioned
CP-Miner Not mentioned or expired link Token-based C, C++ Not mentioned
Clone detection
Tool name Link to the software tool Method Application Supporting Programming
language(s) language(s)
DyCLINK https://github.com/Programming-Systems-Lab/dyclink Graph-bases Clone detection Java Java
(types I, II, III)
FCDetector https://github.com/shiyy123/FCDetector Tree-based C, C++ Java, Matlab
Clone detection
gCad Not mentioned or expired link Text-based Clone detection C, C#, Java Java
Groupdroid Not mentioned or expired link Graph-based Malware detection Java Not mentioned
HitoshiIO https://github.com/Programming-Systems-Lab/ioclones Text-based Java Java
Clone detection
22
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
As shown in Figure 8, most available tools use hybrid, learning-based, tree-based, or token-based methods in their backend.
The reason is the high efficiency of these methods and compatibility with most programming languages. On the other hand,
only one tool works with the test-based approach indicating the difficulty and novelty of research in this direction.
Figure 9 shows the frequency of programming languages supported by available source code similarity measurement and
clone detection tools. The languages supported by only one tool are shown separately on the right-hand side of the pie chart
diagram. About 7% of the supported programming languages are supported by only one tool. The top five frequently
supported languages are Java, C, C++, C#, and Python. Nearly 44% of available tools support the source codes written in
Java, and 33% support C or C++ codes, while other popular programming languages like PHP, Swift, and Scala have no
supporting tool. As a result, there is a vast opportunity to develop multi-paradigm and multi-language code similarity
measurement and clone detection tools.
23
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
24
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
evaluating tools to detect functional similarity (type IV of clones). There are other online programming competition
platforms such as AtCoder [148] whose data have been used by Perez and Chiba [149]. However, the most frequently used
platforms are Google CodeJam competition [94] and Codeforces [95].
The GPLAG dataset has been created by Liu et al. [16] to evaluate plagiarism detection with PDG-based methods. The
authors have selected five procedures from an open-source Linux program, "join," and applied a set of plagiarism operators
[16] to generate a modified version of the program. The dataset is not publicly available. However, it can be reconstructed
according to the full descriptions provided by the authors. The Malware dataset [20] has been proposed to evaluate the
malware detection task. It includes original malicious codes, variants of the malware, and harmless codes written in Visual
Basic language. The GPLAG and Malware are small datasets that cannot be used in learning-based approaches. Overall, it
observed that datasets regarding many applications of code similarity measurement, such as code recommendation, code
prediction, malware, and vulnerability detection are limited.
Table 5. Dataset proposed for code similarity measurement and clone detection tasks.
Figure 10. Word cloud of projects used to create and evaluate code similarity measurement tools.
25
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
Supporting
Approach Effectiveness Efficiency Scalability Advantages Disadvantages
clone types
Requires heavy
Straightforward preprocessing
Text-based I, II High High Low
implementation Fails to identify types
III and IV clones
Fails to identify type
High ability and high IV clones
Token-based I, II, III High High High
compatibility Needs lexer
transformation rules
26
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
Supporting
Approach Effectiveness Efficiency Scalability Advantages Disadvantages
clone types
Fails to identify type
High accuracy for clones with IV clones
Tree-based I, II, III High Low Moderate
types II and III Constructing AST is
difficult and complex
Graph matching is
Identifies all types of clones
non-polynomial
Considers both the
Graph-based I, II, III, IV High Low Low Building PDG and
structural and semantic
CFG are difficult and
information
complex
Identifies all types of clones
Very sensitive to
Language independent
Metric-based I, II, III, IV High Moderate High metric thresholds
Straightforward similarity
Low recall
measurement
Identifies all types of clones
Learning- Requires large and
I, II, III, IV Moderate Low High Fully automated approaches
based clean datasets
High stability and flexibility
Token-based Straightforward Fails to identify type
I, II, III High Moderate Moderate
+ Text-based implementation IV clones
Token-based
Constructing PDG is
+ Graph- I, II, III, IV High Low Moderate Identifies all types of clones
difficult and complex
based
Metric-based Identifies all types of clones
Requires lexer
+ Token- I, II, III, IV High Moderate High Straightforward
transformation rules
based implementation
Identifies all types of clones
More efficient compared to Complicate
Metric-based
the graph-based method implementation
+ Graph- I, II, III, IV High Moderate Moderate
Suitable for source code Very sensitive to
based
visualization and visual clone metric thresholds
detection
Identifies all types of clones Requires heavy
Metric-based More efficient compared to preprocessing
I, II, III, IV High Moderate High
+ Text-based the graph-based method Very sensitive to
Low false-positive instances metric thresholds
Graph-based Cannot identify all
III, IV Moderate Low Low Detects semantic clones
+ Tree-based types of clones
Building PDG and
Text-based +
I, II, III, IV High Moderate Low Identifies all types of clones CFG are difficult and
graph-based
complex
Text-based + Cannot identify all
I, II, III High Moderate Low Low false-positive instances
Tree-based types of clones
Building PDG and
Graph-based CFG are difficult and
+ Learning- I, II, III, IV High Moderate Low Identifies all types of clones complex
based Requires large and
clean datasets
Cannot identify all
Tree-based +
More efficient compared to types of clones
Learning I, II, III High Moderate Moderate
the graph-based method Requires large and
based
clean datasets
Token-based The preprocessing is used to Fails to identify type
I, II, III High High Moderate
+ Tree-based increase the detection speed IV clones
27
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
28
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
Challenge 4: While most primary studies have used a hybrid method to measure code similarity, publicly available tools
mostly work with the token-based method. Indeed, the number of software tools that work with more complicated
procedures than text-based and token-based ones is limited. Researchers have rarely used the existing library for basic tasks
such as graph comparison [178], machine learning [179], static analysis [180], and metrics computations [181] to build
reliable and reusable tools. In addition, the tools have been rarely integrated with IDEs, making them difficult to be used by
software engineers. We recommend preparing future tools as plugins for IDEs with appropriate API and documentation.
The applications that rely on the core capabilities of the code similarity tools may be categorized within these APIs and
plugins to facilitate developers' use.
Challenge 5: Most proposed methods have focused on finding clone instances and measuring code similarity in object-
oriented programs, mainly written in Java. However, the recent increase in the popularity of multi-paradigm languages
such as Python and JavaScript requires approaches supporting these paradigms. For instance, metric-based methods may
employ only source code metrics available in all programming paradigms. Researchers are encouraged to focus on
developing tool-supported approaches for other programming languages than Java and C/C++. The main challenge is
adapting an existing method to work with a new language for which no code similarity dataset is most likely available. As
mentioned earlier, one solution is to apply transfer learning, in which the existing models are fine-tuned to work with a
new domain using only a few data samples. Another possible solution is to create models based on source code metrics that
are independent of the language structures.
Challenge 6: The final challenge we observed is the lack of using code similarity techniques and tools to assist the
development phase in the software development lifecycle (SDLC). Current clone detection tools have been designed to be
used in the software maintenance phase. It means that they work on legacy programs, not on developing new programs.
The real-time measurement and notification of code similarity during software development help programmers avoid
repeating code and reuse the available software component. As discussed in this paper, the application of code similarity
measurement is not limited to clone detection. Other applications mentioned in this paper, mainly code recommendations,
improve software development's agility.
We believe that code similarity measurement can be used as a fundamental component for all data-centric solutions in
software engineering. The general theory of code similarity enables the automation of many laborious programming tasks,
including but not limited to automatic code recommendation, defect prediction, smell detection, vulnerability finding,
refactoring recommendation, and quality measurements. Although researchers developed automated approaches separately
in each domain, the relationship between most of these applications and code similarity measurement as a common principle
has not been well studied. Investigating the common aspects of the aforementioned applications with respect to code
similarity measurement help to develop versatile tools that can perform multiple tasks efficiently.
Automatic code recommendation, e.g., software entity naming and code summarization, has been recognized as a
problematic software engineering task [126], [182]. Code similarity measurement can be best applied to detect cloned
instances and recommend the name of detected clones to programmers. Alon et al. [23], [24] have recently proposed a hybrid
tree-based and learning-based code similarity measurement method to predict program properties such as names and
expression types according to existing cloned or similar methods. The challenge is to find the clone instance with a clean
and proper name. Appropriate entity names for a developing program may be found in similar codes in another
programming language. In such cases, cross-language clone detection approaches can be worthful.
Code similarity measurement and clone detection can also be used for identifying code smell as refactoring opportunities.
A recent study by Aniche et al. [35] has used machine learning techniques to recommend refactoring based on code
similarity. Integrated development environments (IDEs) are expected to be equipped with various code similarity
measurements to make online suggestions about existing clones, refactoring, faults, plagiarism, and other software quality
attributes. In Software 2.0 [183], where programming uses learned models, the syntax and functionality of all codes are
expected to be very similar, and what makes differences are the input and output data. Therefore, a new definition of the
concept of code similarity is required to adapt for programs written in Software 2.0 [183]. In addition, a powerful code
recommendation tool can be designed and developed to make a super agile and automated software development
methodology.
6 Threat to Validity
The results of this survey might be affected by the article selection bias, incompleteness of search, search engine problems
[184], study distribution imbalance, and inaccuracy in data extraction and synthesis. We tried to mitigate each threat as
much as possible during the review process. First, we followed the well-defined research protocol guidelines proposed by
Kitchenham and Charters [75] to select relevant studies fairly and without bias. Second, we defined our search string such
29
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
that all relevant terms and their possible combination were covered to overcome the threat of missing a study. We also
performed snowballing [79] to find any relevant paper in the field, initially not found by our search string. Third, we
considered a quality assessment before the final selection of primary studies to ensure that the papers for all key terms were
retrieved correctly and made a significant contribution and validation in the field. Forth, we carefully compared the
taxonomy used in our classification with the related survey in code similarity measurement and code clone detection.
Finally, we asked three M.Sc. students and one Ph.D. student in software engineering to check the correctness and
completeness of our classification according to the research questions we aimed to answer.
7 Conclusion
This systematic literature study performs automatic searches in four major electronic libraries to select the relevant code
similarity measurement and clone detection studies. A total of 136 studies are selected and reviewed in detail from an initial
set of over 10000 articles to answer five research questions about different aspects of this topic. This paper analyzes each
primary study according to five dimensions: method, application, dataset, tool, and supporting programming language. Our
survey aims to answer five research questions concerning the proposed classification of the primary studies in code
similarity measurement. The short answers to our research questions and relevant finding based on the discussions in this
review are as follows:
RQ1 findings: Our SLR reveals the existence of at least eight different basic methods, including text-based, token-based,
tree-based, graph-based, metric-based, image-based, learning-based, and test-based, used for source code similarity
measurement. Learning-based and test-based approaches have recently been applied in the context of source code similarity
measurement and clone detection and have not been covered by previous surveys. Our SLR indicates that most studies (over
27%) have employed hybrid methods in which code snippets' textual and structural contents are compared to measure code
similarity. However, only 41% of articles have supported their proposed approaches with a publicly available software tool.
RQ2 findings: Our findings indicate that the research on code similarity measurement primarily targets a direct application
of clone and reuse detection. However, we found four other application domains: source code plagiarism detection, malware
detection and vulnerability, code prediction, and code recommendation. Code similarity measurement can automate
laborious activities in software engineering, such as code smell detection, refactoring suggestions, and fault prediction.
RQ3 findings: We extracted 80 software tools for measuring source code similarity and clones from the primary studies, of
which 33 tools (i.e., 41%) are publicly available. In total, the available tools support code similarity measurement and clone
detection for 18 different programming languages. More than 77% of these tools support programs written in Java, C, and
C++ languages, demonstrating the lack of code similarity measurement tools for other programming languages and
paradigms.
RQ4 findings: At least 50% of the primary studies have mainly used a limited set of open-source projects, including Java
projects on GitHub. We observed the use of 12 different datasets designed explicitly for code clone detection and source
code similarity measurement. Only 68 out of 136 primary studies (50%) have used these datasets, and not all of the reported
datasets are publicly available. We observed the lack of public, large, and quality source code similarity and clone datasets
containing industrial and real-life software systems.
RQ5 findings: The performance of code similarity measurement studies is evaluated with different metrics in three
dimensions effectiveness, efficiency, and scalability. Regarding the effectiveness of existing techniques, our meta-analysis
shows an approximate mean of 86.3, 88.4, 86.5, and 82.5 percent, respectively, for the precision, recall, F1 score, and accuracy
metrics. However, these results are based on the different datasets used to evaluate the existing tools and they are only
moderately reliable. The performance of emerging applications based on code similarity, e.g., code recommendation, is less
than older applications. Further empirical evaluation of code similarity measurement and standard datasets are required in
all application domains to determine the state-of-the-art.
RQ6 findings: Our SLR identifies and lists six remarkable challenges in the field with some potential solutions that can be
considered as the future direction of research on source code similarity and clone detection. The lack of comprehensive,
large, and reliable datasets, lack of attention to metric-based and learning-based methods, limited support of popular
programming languages and new programming paradigms, lack of empirical analysis on the efficiency of scalability of
different approaches, and considering the emerging applications based on code similarity measurement are the most critical
challenges and opportunities in the field.
Research on code similarity measurement and its applications is growing and will continue to grow in the following years.
Industrial support of proposed approaches with reliable tools is essential to reduce the high cost and time of developing
30
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
quality software systems. At the same time, systematic literature reviews and empirical studies in the field are also necessary
to integrate the results and provide helpful information to practitioners and researchers.
Declarations
Data Availability Statement
The datasets generated and analyzed during the current study are available on Zenodo,
https://doi.org/10.5281/zenodo.7993619.
Funding
This study has received no funding from any organization.
Conflict of Interest
All of the authors declare that they have no conflict of interest.
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
31
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
32
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
33
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
34
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
References
[1] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, “Clone detection using abstract syntax trees,” in
Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272), IEEE Comput. Soc, 1998, pp.
368–377. doi: 10.1109/ICSM.1998.738528.
[2] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: a multilinguistic token-based code clone detection system for
large scale source code,” IEEE Transactions on Software Engineering, vol. 28, no. 7, pp. 654–670, Jul. 2002, doi:
10.1109/TSE.2002.1019480.
[3] S. Carter, R. J. Frank, and D. S. W. Tansley, “Clone detection in telecommunications software systems: a neural net
approach,” in Proc. Int. Workshop on Application of Neural Networks to Telecommunications, 1993, pp. 273–287.
[4] H. Nasirloo and F. Azimzadeh, “Semantic code clone detection using abstract memory states and program
dependency graphs,” 2018 4th International Conference on Web Research, ICWR 2018, pp. 19–27, 2018, doi:
10.1109/ICWR.2018.8387232.
[5] Roopam and G. Singh, “To enhance the code clone detection algorithm by using hybrid approach for detection of
code clones,” Proceedings of the 2017 International Conference on Intelligent Computing and Control Systems, ICICCS
2017, vol. 2018-Janua, pp. 192–198, 2017, doi: 10.1109/ICCONS.2017.8250708.
[6] M. R. H. Misu and K. Sakib, “Interface driven code clone detection,” Proceedings - Asia-Pacific Software Engineering
Conference, APSEC, vol. 2017-Decem, pp. 747–748, 2018, doi: 10.1109/APSEC.2017.97.
[7] L. Buch and A. Andrzejak, “Learning-based recursive aggregation of abstract syntax trees for code clone detection,”
SANER 2019 - Proceedings of the 2019 IEEE 26th International Conference on Software Analysis, Evolution, and
Reengineering, pp. 95–104, 2019, doi: 10.1109/SANER.2019.8668039.
[8] R. Koschke, R. Falke, and P. Frenzel, “Clone detection using abstract syntax suffix trees,” in 2006 13th Working
Conference on Reverse Engineering, IEEE, 2006, pp. 253–262. doi: 10.1109/WCRE.2006.18.
[9] C. Fang, Z. Liu, Y. Shi, J. Huang, and Q. Shi, “Functional code clone detection with syntax and semantics fusion
learning,” ISSTA 2020 - Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and
Analysis, pp. 516–527, 2020, doi: 10.1145/3395363.3397362.
[10] L. Prechelt, G. Malpohl, M. Philippsen, and others, “Finding plagiarisms among a set of programs with JPlag.,” J.
Univers. Comput. Sci., vol. 8, no. 11, p. 1016, 2002.
[11] S. Burrows, S. M. M. Tahaghoghi, and J. Zobel, “Efficient plagiarism detection for large code repositories,” Softw
Pract Exp, vol. 37, no. 2, pp. 151–175, Feb. 2007, doi: 10.1002/spe.750.
[12] Z. Duric and D. Gasevic, “A source code similarity system for plagiarism detection,” Comput J, vol. 56, no. 1, pp. 70–
86, Jan. 2013, doi: 10.1093/comjnl/bxs018.
[13] G. Cosma and M. Joy, “An approach to source-code plagiarism detection and investigation using latent semantic
analysis,” IEEE Transactions on Computers, vol. 61, no. 3, pp. 379–394, Mar. 2012, doi: 10.1109/TC.2011.223.
35
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[14] T. Foltýnek, N. Meuschke, and B. Gipp, “Academic plagiarism detection: A systematic literature review,” ACM
Comput Surv, vol. 52, no. 6, 2019, doi: 10.1145/3345317.
[15] F. Ullah, J. Wang, M. Farhan, S. Jabbar, Z. Wu, and S. Khalid, “Plagiarism detection in students’ programming
assignments based on semantics: multimedia e-learning based smart assessment methodology,” Multimed Tools
Appl, vol. 79, no. 13–14, pp. 8581–8598, 2020, doi: 10.1007/s11042-018-5827-6.
[16] C. Liu, C. Chen, J. Han, and P. S. Yu, “GPLAG: detection of software plagiarism by program dependence graph
analysis,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
- KDD ’06, New York, New York, USA: ACM Press, 2006, p. 872. doi: 10.1145/1150402.1150522.
[17] J. Chen, M. H. Alalfi, T. R. Dean, and Y. Zou, “Detecting Android malware using clone detection,” J Comput Sci
Technol, vol. 30, no. 5, pp. 942–956, 2015, doi: 10.1007/s11390-015-1573-7.
[18] N. Marastoni, A. Continella, D. Quarta, S. Zanero, and M. D. Preda, “Groupdroid: Automatically grouping mobile
malware by extracting code similarities,” ACM International Conference Proceeding Series, 2017, doi:
10.1145/3151137.3151138.
[19] A. Kalysch, M. Protsenko, O. Milisterfer, and T. Müller, “Tackling androids native library malware with robust,
efficient and accurate similarity measures,” ACM International Conference Proceeding Series, 2018, doi:
10.1145/3230833.3232828.
[20] J. Kim and B.-R. Moon, “New malware detection system using metric-based method and hybrid genetic algorithm,”
in Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference
companion - GECCO Companion ’12, New York, New York, USA: ACM Press, 2012, p. 1527. doi:
10.1145/2330784.2331029.
[21] A. M. Lajevardi, S. Parsa, and M. J. Amiri, “Markhor: malware detection using fuzzy similarity of system call
dependency sequences,” Journal of Computer Virology and Hacking Techniques, Apr. 2021, doi: 10.1007/s11416-021-
00383-1.
[22] F. P. Viertel, W. Brunotte, D. Strüber, and K. Schneider, “Detecting security vulnerabilities using clone detection and
community knowledge,” Proceedings of the International Conference on Software Engineering and Knowledge
Engineering, SEKE, vol. 2019-July, no. August, pp. 245–252, 2019, doi: 10.18293/SEKE2019-183.
[23] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “Code2vec: learning distributed representations of code,” in
Proceedings of the ACM on Programming Languages, Jan. 2019, pp. 1–29. doi: 10.1145/3290353.
[24] U. Alon, S. Brody, O. Levy, and E. Yahav, “Code2seq: generating sequences from structured representations of code,”
in International Conference on Learning Representations, 2019. Accessed: Sep. 26, 2022. [Online]. Available:
https://openreview.net/forum?id=H1gKYo09tX
[25] S. Arshad, S. Abid, and S. Shamail, “CodeBERT for code clone detection: a replication study,” in 2022 IEEE 16th
International Workshop on Software Clones (IWSC), IEEE, Oct. 2022, pp. 39–45. doi: 10.1109/IWSC55060.2022.00015.
[26] S. Parsa, M. Zakeri-Nasrabadi, M. Ekhtiarzadeh, and M. Ramezani, “Method name recommendation based on source
code metrics,” J Comput Lang, vol. 74, p. 101177, Jan. 2023, doi: 10.1016/j.cola.2022.101177.
[27] M. Zakeri‐Nasrabadi and S. Parsa, “Learning to predict test effectiveness,” International Journal of Intelligent Systems,
Oct. 2021, doi: 10.1002/int.22722.
[28] M. Z. Nasrabadi and S. Parsa, “Learning to predict software testability,” in 2021 26th International Computer
Conference, Computer Society of Iran (CSICC), Tehran: IEEE, Mar. 2021, pp. 1–5. doi:
10.1109/CSICC52343.2021.9420548.
[29] M. Zakeri-Nasrabadi and S. Parsa, “An ensemble meta-estimator to predict source code testability,” Appl Soft
Comput, vol. 129, p. 109562, Nov. 2022, doi: 10.1016/j.asoc.2022.109562.
[30] M. D. Papamichail, T. Diamantopoulos, and A. L. Symeonidis, “Measuring the reusability of software components
using static analysis metrics and reuse rate information,” Journal of Systems and Software, vol. 158, p. 110423, Dec.
2019, doi: 10.1016/j.jss.2019.110423.
[31] F. Arcelli Fontana and M. Zanoni, “Code smell severity classification using machine learning techniques,” Knowl
Based Syst, vol. 128, pp. 43–58, Jul. 2017, doi: 10.1016/j.knosys.2017.04.014.
36
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[32] F. A. Fontana, M. Zanoni, A. Marino, and M. v. Mäntylä, “Code smell detection: towards a machine learning-based
approach,” IEEE International Conference on Software Maintenance, ICSM, pp. 396–399, 2013, doi:
10.1109/ICSM.2013.56.
[33] F. Arcelli Fontana, M. V. Mäntylä, M. Zanoni, and A. Marino, “Comparing and experimenting machine learning
techniques for code smell detection,” Empir Softw Eng, vol. 21, no. 3, pp. 1143–1191, Jun. 2016, doi: 10.1007/s10664-
015-9378-4.
[34] H. Liu, J. Jin, Z. Xu, Y. Bu, Y. Zou, and L. Zhang, “Deep learning based code smell detection,” IEEE Transactions on
Software Engineering, pp. 1–1, 2021, doi: 10.1109/TSE.2019.2936376.
[35] M. Aniche, E. Maziero, R. Durelli, and V. Durelli, “The effectiveness of supervised machine learning algorithms in
predicting software refactoring,” IEEE Transactions on Software Engineering, pp. 1–1, 2020, doi:
10.1109/TSE.2020.3021736.
[36] A. M. Sheneamer, “An automatic advisor for refactoring software clones based on machine learning,” IEEE Access,
vol. 8, pp. 124978–124988, 2020, doi: 10.1109/ACCESS.2020.3006178.
[37] M. White, M. Tufano, M. Martinez, M. Monperrus, and D. Poshyvanyk, “Sorting and transforming program repair
ingredients via deep learning code similarities,” in 2019 IEEE 26th International Conference on Software Analysis,
Evolution and Reengineering (SANER), IEEE, Feb. 2019, pp. 479–490. doi: 10.1109/SANER.2019.8668043.
[38] H. Cao, F. Liu, J. Shi, Y. Chu, and M. Deng, “Random search and code similarity-based automatic program repair,” J
Shanghai Jiaotong Univ Sci, Nov. 2022, doi: 10.1007/s12204-022-2514-6.
[39] M. Allamanis, H. Peng, and C. Sutton, “A convolutional attention network for extreme summarization of source
code,” in Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger,
Eds., in Proceedings of Machine Learning Research, vol. 48. New York, New York, USA: PMLR, Dec. 2016, pp. 2091–
2100. [Online]. Available: https://proceedings.mlr.press/v48/allamanis16.html
[40] V. Saini, F. Farmahinifarahani, Y. Lu, P. Baldi, and C. V. Lopes, “Oreo: detection of clones in the twilight zone,”
ESEC/FSE 2018 - Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and
Symposium on the Foundations of Software Engineering, pp. 354–365, 2018, doi: 10.1145/3236024.3236026.
[41] R. Tekchandani, R. Bhatia, and M. Singh, “Semantic code clone detection for Internet of Things applications using
reaching definition and liveness analysis,” Journal of Supercomputing, vol. 74, no. 9, pp. 4199–4226, 2018, doi:
10.1007/s11227-016-1832-6.
[42] I. Reinhartz-Berger and A. Zamansky, “Reuse of similarly behaving software through polymorphism-inspired
variability mechanisms,” IEEE Transactions on Software Engineering, vol. 48, no. 3, pp. 773–785, Mar. 2022, doi:
10.1109/TSE.2020.3001512.
[43] C. Ragkhitwetsagul, J. Krinke, and D. Clark, “A comparison of code similarity analysers,” Empir Softw Eng, vol. 23,
no. 4, pp. 2464–2519, Aug. 2018, doi: 10.1007/s10664-017-9564-7.
[44] N. Li, M. Shen, S. Li, L. Zhang, and Z. Li, “STVsm: Similar Structural Code Detection Based on AST and VSM,” 2012,
pp. 15–21. doi: 10.1007/978-3-642-35267-6_3.
[45] T. Wang, K. Wang, X. Su, and P. Ma, “Detection of semantically similar code,” Front Comput Sci, vol. 8, no. 6, pp.
996–1011, Dec. 2014, doi: 10.1007/s11704-014-3430-1.
[46] J. Kim, H. Choi, H. Yun, and B.-R. Moon, “Measuring Source Code Similarity by Finding Similar Subgraph with an
Incremental Genetic Algorithm,” in Proceedings of the Genetic and Evolutionary Computation Conference 2016, New
York, NY, USA: ACM, Jul. 2016, pp. 925–932. doi: 10.1145/2908812.2908870.
[47] J. Akram, Z. Shi, M. Mumtaz, and P. Luo, “DroidCC: a scalable clone detection approach for android applications to
detect similarity at source code level,” in 2018 IEEE 42nd Annual Computer Software and Applications Conference
(COMPSAC), IEEE, Jul. 2018, pp. 100–105. doi: 10.1109/COMPSAC.2018.00021.
[48] C. K. Roy and J. R. Cordy, “Survey on software clone detection research,” 2007.
[49] K. E. Rajakumari, “Comparison of token-based code clone method with pattern mining technique and traditional
string matching algorithms in-terms of software reuse,” Proceedings of 2019 3rd IEEE International Conference on
Electrical, Computer and Communication Technologies, ICECCT 2019, pp. 1–6, 2019, doi: 10.1109/ICECCT.2019.8869324.
37
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[50] M. S. Rahman and C. K. Roy, “A change-type based empirical study on the stability of cloned code,” Proceedings -
2014 14th IEEE International Working Conference on Source Code Analysis and Manipulation, SCAM 2014, pp. 31–40,
2014, doi: 10.1109/SCAM.2014.13.
[51] Z. Á. Mann, “Three public enemies: cut, copy, and paste,” Computer (Long Beach Calif), vol. 39, no. 7, pp. 31–35, Jul.
2006, doi: 10.1109/MC.2006.246.
[52] J. H. Johnson, “Substring matching for clone detection and change tracking,” in Proceedings - 1994 International
Conference on Software Maintenance, ICSM 1994, Institute of Electrical and Electronics Engineers Inc., 1994, pp. 120–
126. doi: 10.1109/ICSM.1994.336783.
[53] N. Tsantalis, D. Mazinanian, and G. P. Krishnan, “Assessing the refactorability of software clones,” IEEE Transactions
on Software Engineering, vol. 41, no. 11, pp. 1055–1090, Nov. 2015, doi: 10.1109/TSE.2015.2448531.
[54] Y. Higo, T. Kamiya, S. Kusumoto, and K. Inoue, “Refactoring support based on code clone analysis,” 2004, pp. 220–
233. doi: 10.1007/978-3-540-24659-6_16.
[55] M. Fowler and K. Beck, Refactoring: improving the design of existing code, Second Edi. Addison-Wesley, 2018.
[Online]. Available: https://refactoring.com/
[56] T. Yamamoto, M. Matsushita, T. Kamiya, and K. Inoue, “Measuring similarity of large software systems based on
source code correspondence,” 2005, pp. 530–544. doi: 10.1007/11497455_41.
[57] L. Qinqin and Z. Chunhai, “Research on algorithm of program code similarity detection,” in 2017 International
Conference on Computer Systems, Electronics and Control (ICCSEC), 2017, pp. 1289–1292.
[58] S. Giesecke, “Generic modelling of code clones,” in Duplication, Redundancy, and Similarity in Software, 2006.
[59] Y. Yang, Z. Ren, X. Chen, and H. Jiang, “Structural function based code clone detection using a new hybrid
technique,” Proceedings - International Computer Software and Applications Conference, vol. 1, pp. 286–291, 2018, doi:
10.1109/COMPSAC.2018.00045.
[60] K. W. Nafi, B. Roy, C. K. Roy, and K. A. Schneider, “A universal cross language software similarity detector for open
source software categorization,” Journal of Systems and Software, vol. 162, p. 110491, 2020, doi:
10.1016/j.jss.2019.110491.
[61] K. W. Nafi, T. S. Kar, B. Roy, C. K. Roy, and K. A. Schneider, “CLCDSA: cross language code clone detection using
syntactical features and API documentation,” Proceedings - 2019 34th IEEE/ACM International Conference on
Automated Software Engineering, ASE 2019, pp. 1026–1037, 2019, doi: 10.1109/ASE.2019.00099.
[62] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo, “Comparison and evaluation of clone detection tools,”
IEEE Transactions on Software Engineering, vol. 33, no. 9, pp. 577–591, Sep. 2007, doi: 10.1109/TSE.2007.70725.
[63] N. Davey, P. Barson, S. Field, R. Frank, and D. Tansley, “The development of a software clone detector,” International
Journal of Applied Software Technology, 1995.
[64] J. Krinke, “Identifying similar code with program dependence graphs,” in Proceedings Eighth Working Conference on
Reverse Engineering, IEEE Comput. Soc, 2001, pp. 301–309. doi: 10.1109/WCRE.2001.957835.
[65] G. Mishne and M. de Rijke, “Source code retrieval using conceptual similarity,” in Coupling Approaches, Coupling
Media and Coupling Languages for Information Retrieval, in RIAO ’04. Paris, FRA: LE CENTRE DE HAUTES ETUDES
INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, 2004, pp. 539–554.
[66] A. Lakhotia, Junwei Li, A. Walenstein, and Yun Yang, “Towards a clone detection benchmark suite and results
archive,” in MHS2003. Proceedings of 2003 International Symposium on Micromechatronics and Human Science (IEEE
Cat. No.03TH8717), IEEE Comput. Soc, pp. 285–286. doi: 10.1109/WPC.2003.1199215.
[67] D. Rattan, R. Bhatia, and M. Singh, Software clone detection: a systematic review, vol. 55, no. 7. Elsevier B.V., 2013.
doi: 10.1016/j.infsof.2013.01.008.
[68] H. Min and Z. Li Ping, “Survey on software clone detection research,” in Proceedings of the 2019 3rd International
Conference on Management Engineering, Software Engineering and Service Sciences - ICMSS 2019, New York, New
York, USA: ACM Press, 2019, pp. 9–16. doi: 10.1145/3312662.3312707.
[69] Q. U. Ain, W. H. Butt, M. W. Anwar, F. Azam, and B. Maqbool, “A systematic review on code clone detection,” IEEE
Access, vol. 7, pp. 86121–86144, 2019, doi: 10.1109/ACCESS.2019.2918202.
38
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[70] C.-F. Chen, A. M. Zain, and K.-Q. Zhou, “Definition, approaches, and analysis of code duplication detection (2006–
2020): a critical review,” Neural Comput Appl, vol. 34, no. 23, pp. 20507–20537, Dec. 2022, doi: 10.1007/s00521-022-
07707-2.
[71] M. Lei, H. Li, J. Li, N. Aundhkar, and D.-K. Kim, “Deep learning application on code clone detection: a review of
current knowledge,” Journal of Systems and Software, vol. 184, p. 111141, Feb. 2022, doi: 10.1016/j.jss.2021.111141.
[72] M. Novak, M. Joy, and D. Kermek, “Source-code similarity detection and detection tools used in academia: A
systematic review,” ACM Transactions on Computing Education, vol. 19, no. 3, 2019, doi: 10.1145/3313290.
[73] E. Burd and J. Bailey, “Evaluating clone detection tools for use during preventative maintenance,” in Proceedings.
Second IEEE International Workshop on Source Code Analysis and Manipulation, IEEE Comput. Soc, pp. 36–43. doi:
10.1109/SCAM.2002.1134103.
[74] B. Biegel, Q. D. Soetens, W. Hornig, S. Diehl, and S. Demeyer, “Comparison of similarity metrics for refactoring
detection,” in Proceeding of the 8th working conference on Mining software repositories - MSR ’11, New York, New
York, USA: ACM Press, 2011, p. 53. doi: 10.1145/1985441.1985452.
[75] B. Kitchenham and S. Charters, “Guidelines for performing systematic literature reviews in software engineering,”
2007.
[76] A. S. Nuñez-Varela, H. G. Pérez-Gonzalez, F. E. Martínez-Perez, and C. Soubervielle-Montalvo, “Source code metrics:
a systematic mapping study,” Journal of Systems and Software, vol. 128, pp. 164–197, 2017, doi:
https://doi.org/10.1016/j.jss.2017.03.044.
[77] M. I. Azeem, F. Palomba, L. Shi, and Q. Wang, “Machine learning techniques for code smell detection: a systematic
literature review and meta-analysis,” Inf Softw Technol, vol. 108, pp. 115–138, Apr. 2019, doi:
10.1016/j.infsof.2018.12.009.
[78] C. Abid, V. Alizadeh, M. Kessentini, T. do N. Ferreira, and D. Dig, “30 years of software refactoring research: a
systematic literature review,” arXiv preprint arXiv:2007.02194, 2020.
[79] C. Wohlin, “Guidelines for snowballing in systematic literature studies and a replication in software engineering,”
in Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering - EASE ’14,
New York, New York, USA: ACM Press, 2014, pp. 1–10. doi: 10.1145/2601248.2601268.
[80] M. Zakeri-Nasrabadi, S. Parsa, M. Ramezani, C. Roy, and M. Ekhtiarzadeh, “Supplementary data for a systematic
literature review on source code similarity measurement and clone detection: techniques, applications, and
challenges,” May 31, 2023. https://doi.org/10.5281/zenodo.7993619 (accessed May 31, 2023).
[81] C. K. Roy and J. R. Cordy, “NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing
and code normalization,” in 2008 16th IEEE International Conference on Program Comprehension, IEEE, Jun. 2008, pp.
172–181. doi: 10.1109/ICPC.2008.41.
[82] L. Jiang, G. Misherghi, Z. Su, and S. Glondu, “Deckard: scalable and accurate tree-based detection of code clones,” in
29th International Conference on Software Engineering (ICSE’07), 2007, pp. 96–105.
[83] M. Wang, P. Wang, and Y. Xu, “CCSharp: an efficient three-phase code clone detector using modified PDGs,”
Proceedings - Asia-Pacific Software Engineering Conference, APSEC, vol. 2017-Decem, pp. 100–109, 2018, doi:
10.1109/APSEC.2017.16.
[84] G. Li, H. Liu, Y. Jiang, and J. Jin, “Test-Based clone detection: an initial try on semantically equivalent methods,”
IEEE Access, vol. 6, pp. 77643–77655, 2018, doi: 10.1109/ACCESS.2018.2883699.
[85] Fang-Hsiang Su, J. Bell, G. Kaiser, and S. Sethumadhavan, “Identifying functionally similar code in complex
codebases,” in 2016 IEEE 24th International Conference on Program Comprehension (ICPC), IEEE, May 2016, pp. 1–10.
doi: 10.1109/ICPC.2016.7503720.
[86] “PMD: an extensible cross-language static code analyzer.” https://pmd.github.io/ (accessed Sep. 21, 2021).
[87] softwareclones.org, “iClones: incremental clone detection.” http://www.softwareclones.org/iclones.php (accessed
Dec. 22, 2022).
[88] Daniel Lochner, Joshua Lochner, and Carl Combrinck, “AutoMOSS.” https://github.com/automoss/automoss
(accessed Dec. 15, 2022).
39
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[89] Quandary Peak Research, “Simian - similarity analyser.” https://simian.quandarypeak.com/ (accessed Dec. 15, 2022).
[90] S. Ducasse, M. Rieger, and S. Demeyer, “A language independent approach for detecting duplicated code,” in
Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM’99). “Software Maintenance for
Business Change” (Cat. No.99CB36360), IEEE, 1999, pp. 109–118. doi: 10.1109/ICSM.1999.792593.
[91] A. Majd, M. Vahidi-Asl, A. Khalilian, A. Baraani-Dastjerdi, and B. Zamani, “Code4Bench: a multidimensional
benchmark of Codeforces data for different program analysis techniques,” J Comput Lang, vol. 53, pp. 38–52, Aug.
2019, doi: 10.1016/j.cola.2019.03.006.
[92] J. Svajlenko and C. K. Roy, “BigCloneEval: A clone detection tool evaluation framework with BigCloneBench,” in
2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, Oct. 2016, pp. 596–600. doi:
10.1109/ICSME.2016.62.
[93] C. K. Roy and J. R. Cordy, “A mutation/Injection-based automatic framework for evaluating code clone detection
tools,” in 2009 International Conference on Software Testing, Verification, and Validation Workshops, IEEE, 2009, pp.
157–166. doi: 10.1109/ICSTW.2009.18.
[94] Google, “Google Code Jam,” 2016. https://code.google.com/codejam/contests.html (accessed Nov. 17, 2021).
[95] Mike Mirzayanov, “Codeforces: the only programming contests web 2.0 platform,” 2022. https://codeforces.com/
(accessed Dec. 11, 2022).
[96] S. Yu, T. Wang, and J. Wang, “Data augmentation by program transformation,” Journal of Systems and Software, vol.
190, p. 111304, Aug. 2022, doi: 10.1016/j.jss.2022.111304.
[97] T. Lavoie, M. Mérineau, E. Merlo, and P. Potvin, “A case study of TTCN-3 test scripts clone analysis in an industrial
telecommunication setting,” Inf Softw Technol, vol. 87, pp. 32–45, 2017, doi: 10.1016/j.infsof.2017.01.008.
[98] C. Kustanto and I. Liem, “Automatic source code plagiarism detection,” in 2009 10th ACIS International Conference
on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, IEEE, May 2009, pp.
481–486. doi: 10.1109/SNPD.2009.62.
[99] B. Muddu, A. Asadullah, and V. Bhat, “CPDP: a robust technique for plagiarism detection in source code,” in 2013
7th International Workshop on Software Clones (IWSC), IEEE, May 2013, pp. 39–45. doi: 10.1109/IWSC.2013.6613041.
[100] H. Gascon, F. Yamaguchi, D. Arp, and K. Rieck, “Structural detection of Android malware using embedded call
graphs,” Proceedings of the ACM Conference on Computer and Communications Security, pp. 45–54, 2013, doi:
10.1145/2517312.2517315.
[101] F.-H. Su, J. Bell, K. Harvey, S. Sethumadhavan, G. Kaiser, and T. Jebara, “Code relatives: detecting similarly behaving
software,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software
Engineering, New York, NY, USA: ACM, Nov. 2016, pp. 702–714. doi: 10.1145/2950290.2950321.
[102] W. Wen et al., “Cross-project software defect prediction based on class code similarity,” IEEE Access, vol. 10, pp.
105485–105495, 2022, doi: 10.1109/ACCESS.2022.3211401.
[103] J. Li and M. D. Ernst, “CBCD: cloned buggy code detector,” in 2012 34th International Conference on Software
Engineering (ICSE), IEEE, Jun. 2012, pp. 310–320. doi: 10.1109/ICSE.2012.6227183.
[104] Y. Yu, Z. Huang, G. Shen, W. Li, and Y. Shao, “ASTENS-BWA: searching partial syntactic similar regions between
source code fragments via AST-based encoded sequence alignment,” Sci Comput Program, vol. 222, p. 102839, Oct.
2022, doi: 10.1016/j.scico.2022.102839.
[105] H. Yonai, Y. Hayase, and H. Kitagawa, “Mercem: method name recommendation based on call graph embedding,”
Proceedings - Asia-Pacific Software Engineering Conference, APSEC, vol. 2019-Decem, pp. 134–141, 2019, doi:
10.1109/APSEC48747.2019.00027.
[106] S. Kurimoto, Y. Hayase, H. Yonai, H. Ito, and H. Kitagawa, “Class name recommendation based on graph embedding
of program elements,” Proceedings - Asia-Pacific Software Engineering Conference, APSEC, vol. 2019-Decem, pp. 498–
505, 2019, doi: 10.1109/APSEC48747.2019.00073.
[107] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” J Mol Biol, vol. 147, no. 1, pp.
195–197, Mar. 1981, doi: 10.1016/0022-2836(81)90087-5.
40
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[108] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in large databases,” in Proceedings of the
20th International Conference on Very Large Data Bases, in VLDB ’94. San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 1994, pp. 487–499.
[109] J. R. Cordy and C. K. Roy, “The NiCad clone detector,” in 2011 IEEE 19th International Conference on Program
Comprehension, IEEE, Jun. 2011, pp. 219–220. doi: 10.1109/ICPC.2011.26.
[110] D. Jurafsky and J. H. Martin, Speech and language processing (second edition). Upper Saddle River, NJ, USA: Prentice-
Hall, Inc., 2009.
[111] M. J. Wise, “String similarity via greedy string tiling and running Karp-Rabin matching,” Online Preprint, Dec, vol.
119, no. 1, pp. 1–17, 1993.
[112] J. Liu, T. Wang, C. Feng, H. Wang, and D. Li, “A large-gap clone detection approach using sequence alignment via
dynamic parameter optimization,” IEEE Access, vol. 7, pp. 131270–131281, 2019, doi: 10.1109/ACCESS.2019.2940710.
[113] S. T. Dumais, “Latent semantic analysis,” Annual Review of Information Science and Technology, vol. 38, no. 1, pp.
188–230, Sep. 2005, doi: 10.1002/aris.1440380105.
[114] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining: Closed sequential patterns in large datasets,” in Proceedings of the
2003 SIAM international conference on data mining, 2003, pp. 166–177.
[115] P. S. Honnutagi, “The Hadoop distributed file system,” International Journal of Computer Science and Information
Technologies (IJCSIT), vol. 5, no. 5, pp. 6238–6243, 2014.
[116] J. Dean and S. Ghemawat, “MapReduce: a flexible data processing tool,” Commun ACM, vol. 53, no. 1, pp. 72–77, Jan.
2010, doi: 10.1145/1629175.1629198.
[117] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and
their compositionality,” pp. 1–9, 2013, doi: 10.1162/jmlr.2003.3.4-5.951.
[118] T. Parr and K. Fisher, “LL(*): the foundation of the ANTLR parser generator,” Proceedings of the 32nd ACM SIGPLAN
conference on Programming language design and implementation, pp. 425–436, 2011, doi:
http://doi.acm.org/10.1145/1993498.1993548.
[119] S. Horwitz and T. Reps, “The use of program dependence graphs in software engineering,” in Proceedings of the 14th
international conference on Software engineering - ICSE ’92, New York, New York, USA: ACM Press, 1992, pp. 392–
411. doi: 10.1145/143062.143156.
[120] C. Lattner and V. Adve, “LLVM: a compilation framework for lifelong program analysis and transformation,” in
International Symposium on Code Generation and Optimization, 2004. CGO 2004., IEEE, pp. 75–86. doi:
10.1109/CGO.2004.1281665.
[121] G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Commun ACM, vol. 18, no. 11,
pp. 613–620, Nov. 1975, doi: 10.1145/361219.361220.
[122] M. Harman, “The role of Artificial Intelligence in Software Engineering,” in 2012 First International Workshop on
Realizing AI Synergies in Software Engineering (RAISE), IEEE, Jun. 2012, pp. 1–6. doi: 10.1109/RAISE.2012.6227961.
[123] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “DBSCAN Revisited, Revisited: Why and How You Should
(Still) Use DBSCAN,” ACM Transactions on Database Systems, vol. 42, no. 3, pp. 1–21, Sep. 2017, doi: 10.1145/3068335.
[124] D. Chicco, “Siamese neural networks: an overview,” 2021, pp. 73–94. doi: 10.1007/978-1-0716-0826-5_3.
[125] M. Kwabena Patrick, A. Felix Adekoya, A. Abra Mighty, and B. Y. Edward, “Capsule Networks – A survey,” Journal
of King Saud University - Computer and Information Sciences, vol. 34, no. 1, pp. 1295–1310, Jan. 2022, doi:
10.1016/J.JKSUCI.2019.09.014.
[126] L. Jiang, H. Liu, and H. Jiang, “Machine learning based recommendation of method names: how far are we,” in 2019
34th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE, Nov. 2019, pp. 602–614. doi:
10.1109/ASE.2019.00062.
[127] O. Zaitsev, S. Ducasse, A. Bergel, and M. Eveillard, “Suggesting descriptive method names: an exploratory study of
two machine learning approaches,” 2020, pp. 93–106. doi: 10.1007/978-3-030-58793-2_8.
41
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[128] Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli, “Graph matching networks for learning the similarity of graph
structured objects,” in International conference on machine learning, 2019, pp. 3835–3845.
[129] D. Guo et al., “Graphcodebert: pre-training code representations with data flow,” arXiv preprint arXiv:2009.08366,
2020.
[130] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for
language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019,
Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds., Association for Computational
Linguistics, 2019, pp. 4171–4186. doi: 10.18653/v1/n19-1423.
[131] B. Rozière, M.-A. Lachaux, M. Szafraniec, and G. Lample, “DOBF: a deobfuscation pre-training objective for
programming languages,” CoRR, vol. abs/2102.07492, 2021, [Online]. Available: https://arxiv.org/abs/2102.07492
[132] C. Ragkhitwetsagul, J. Krinke, and B. Marnette, “A picture is worth a thousand words: Code clone detection based
on image similarity,” 2018 IEEE 12th International Workshop on Software Clones, IWSC 2018 - Proceedings, vol. 2018-
Janua, pp. 44–50, 2018, doi: 10.1109/IWSC.2018.8327318.
[133] G. Fraser and A. Arcuri, “EvoSuite: automatic test suite generation for object-oriented software,” in Proceedings of
the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering -
SIGSOFT/FSE ’11, New York, New York, USA: ACM Press, 2011, p. 416. doi: 10.1145/2025113.2025179.
[134] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016. [Online]. Available:
http://www.deeplearningbook.org/
[135] J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, “Towards a big data curated benchmark of inter-
project code clones,” in 2014 IEEE International Conference on Software Maintenance and Evolution, 2014, pp. 476–
480.
[136] Alex Aiken, “MOSS: a system for detecting software similarity.” https://theory.stanford.edu/~aiken/moss/ (accessed
Dec. 15, 2022).
[137] H. Murakami, Y. Higo, and S. Kusumoto, “A dataset of clone references with gaps,” in Proceedings of the 11th Working
Conference on Mining Software Repositories - MSR 2014, New York, New York, USA: ACM Press, 2014, pp. 412–415.
doi: 10.1145/2597073.2597133.
[138] A. Charpentier, J.-R. Falleri, D. Lo, and L. Réveillère, “An empirical assessment of Bellon’s clone benchmark,” in
Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, New York,
NY, USA: ACM, Apr. 2015, pp. 1–10. doi: 10.1145/2745802.2745821.
[139] J. Svajlenko and C. K. Roy, “Evaluating clone detection tools with BigCloneBench,” in 2015 IEEE International
Conference on Software Maintenance and Evolution (ICSME), IEEE, Sep. 2015, pp. 131–140. doi:
10.1109/ICSM.2015.7332459.
[140] J. Krinke and C. Ragkhitwetsagul, “BigCloneBench considered harmful for machine learning,” in 2022 IEEE 16th
International Workshop on Software Clones (IWSC), 2022, pp. 1–7.
[141] H. Wei and M. Li, “Supervised deep features for software functional clone detection by exploiting lexical and
syntactical information in source code,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial
Intelligence, California: International Joint Conferences on Artificial Intelligence Organization, Aug. 2017, pp. 3034–
3040. doi: 10.24963/ijcai.2017/423.
[142] A. Schafer, W. Amme, and T. S. Heinze, “Experiments on code clone detection and machine learning,” in 2022 IEEE
16th International Workshop on Software Clones (IWSC), IEEE, Oct. 2022, pp. 46–52. doi:
10.1109/IWSC55060.2022.00016.
[143] S. Lu et al., “CodeXGLUE: a machine learning benchmark dataset for code understanding and generation,” CoRR,
vol. abs/2102.04664, 2021.
[144] L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, “Convolutional neural networks over tree structures for programming
language processing,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, in AAAI’16. AAAI
Press, 2016, pp. 1287–1293.
42
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[145] Z. Xue, Z. Jiang, C. Huang, R. Xu, X. Huang, and L. Hu, “SEED: semantic graph based deep detection for type-4
clone,” 2022, pp. 120–137. doi: 10.1007/978-3-031-08129-3_8.
[146] E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello, “On the detection of source code re-use,” in Proceedings of the
Forum for Information Retrieval Evaluation on - FIRE ’14, New York, New York, USA: ACM Press, 2015, pp. 21–30.
doi: 10.1145/2824864.2824878.
[147] G. Zhao and J. Huang, “DeepSim: deep learning code functional similarity,” ESEC/FSE 2018 - Proceedings of the 2018
26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software
Engineering, pp. 141–151, 2018, doi: 10.1145/3236024.3236068.
[148] “AtCoder.” https://atcoder.jp/ (accessed Jun. 01, 2023).
[149] D. Perez and S. Chiba, “Cross-language clone detection by learning over abstract syntax trees,” in 2019 IEEE/ACM
16th International Conference on Mining Software Repositories (MSR), IEEE, May 2019, pp. 518–528. doi:
10.1109/MSR.2019.00078.
[150] J. Kim, H. G. Choi, H. Yun, and B. R. Moon, “Measuring source code similarity by finding similar subgraph with an
incremental genetic algorithm,” GECCO 2016 - Proceedings of the 2016 Genetic and Evolutionary Computation
Conference, pp. 925–932, 2016, doi: 10.1145/2908812.2908870.
[151] M. Chilowicz, É. Duris, and G. Roussel, “Viewing functions as token sequences to highlight similarities in source
code,” Sci Comput Program, vol. 78, no. 10, pp. 1871–1891, 2013, doi: 10.1016/j.scico.2012.11.008.
[152] C. Ragkhitwetsagul and J. Krinke, “Siamese: scalable and incremental code clone search via multiple code
representations,” Empir Softw Eng, vol. 24, no. 4, pp. 2236–2284, 2019, doi: 10.1007/s10664-019-09697-7.
[153] H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, and C. V. Lopes, “SourcererCC: scaling code clone detection to big-
code,” in Proceedings of the 38th International Conference on Software Engineering, New York, NY, USA: ACM, May
2016, pp. 1157–1168. doi: 10.1145/2884781.2884877.
[154] V. Saini, H. Sajnani, J. Kim, and C. Lopes, “SourcererCC and SourcererCC-I: tools to detect clones in batch mode and
during software development,” in Proceedings of the 38th International Conference on Software Engineering
Companion, New York, NY, USA: ACM, May 2016, pp. 597–600. doi: 10.1145/2889160.2889165.
[155] Y.-L. Hung and S. Takada, “CPPCD: a token-based approach to detecting potential clones,” in 2020 IEEE 14th
International Workshop on Software Clones (IWSC), IEEE, Feb. 2020, pp. 26–32. doi: 10.1109/IWSC50091.2020.9047636.
[156] Y. Wu et al., “SCDetector: software functional clone detection based on semantic tokens analysis,” in Proceedings of
the 35th IEEE/ACM International Conference on Automated Software Engineering, New York, NY, USA: ACM, Dec.
2020, pp. 821–833. doi: 10.1145/3324884.3416562.
[157] P. Wang, J. Svajlenko, Y. Wu, Y. Xu, and C. K. Roy, “CCAligner: a token based large-gap clone detector,” in
Proceedings of the 40th International Conference on Software Engineering, New York, NY, USA: ACM, May 2018, pp.
1066–1077. doi: 10.1145/3180155.3180179.
[158] E. Juergens, F. Deissenboeck, and B. Hummel, “CloneDetective - a workbench for clone detection research,” in 2009
IEEE 31st International Conference on Software Engineering, IEEE, 2009, pp. 603–606. doi: 10.1109/ICSE.2009.5070566.
[159] A. Bhattacharjee and H. M. Jamil, “CodeBlast: A two-stage algorithm for improved program similarity matching in
large software repositories,” Proceedings of the ACM Symposium on Applied Computing, pp. 846–852, 2013, doi:
10.1145/2480362.2480525.
[160] Y. Bian, G. Koru, X. Su, and P. Ma, “SPAPE: A semantic-preserving amorphous procedure extraction method for
near-miss clones,” Journal of Systems and Software, vol. 86, no. 8, pp. 2077–2093, 2013, doi: 10.1016/j.jss.2013.03.061.
[161] Y. Zou, B. Ban, Y. Xue, and Y. Xu, “CCGraph: a PDG-based code clone detector with approximate graph matching,”
in 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2020, pp. 931–942.
[162] N. Mehrotra, N. Agarwal, P. Gupta, S. Anand, D. Lo, and R. Purandare, “Modeling functional similarity in source
code with graph-based Siamese networks,” IEEE Transactions on Software Engineering, vol. 48, no. 10, pp. 3771–3789,
Oct. 2022, doi: 10.1109/TSE.2021.3105556.
[163] Z. Xue, Z. Jiang, C. Huang, R. Xu, X. Huang, and L. Hu, “SEED: semantic graph based deep detection for type-4
clone,” 2022, pp. 120–137. doi: 10.1007/978-3-031-08129-3_8.
43
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[164] Y. Higo, U. Yasushi, M. Nishino, and S. Kusumoto, “Incremental code clone detection: A PDG-based approach,”
Proceedings - Working Conference on Reverse Engineering, WCRE, pp. 3–12, 2011, doi: 10.1109/WCRE.2011.11.
[165] H. A. Basit and S. Jarzabek, “A data mining approach for detecting higher-level clones in software,” IEEE Transactions
on Software Engineering, vol. 35, no. 4, pp. 497–514, Jul. 2009, doi: 10.1109/TSE.2009.16.
[166] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning code fragments for code clone detection,”
ASE 2016 - Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp. 87–98,
2016, doi: 10.1145/2970276.2970326.
[167] M. Hammad, Ö. Babur, H. Abdul Basit, and M. van den Brand, “DeepClone: modeling clones to generate code
predictions,” 2020, pp. 135–151. doi: 10.1007/978-3-030-64694-3_9.
[168] Y. Zhang and T. Wang, “CCEyes: an effective tool for code clone detection on large-scale open source repositories,”
in 2021 IEEE International Conference on Information Communication and Software Engineering (ICICSE), IEEE, Mar.
2021, pp. 61–70. doi: 10.1109/ICICSE52190.2021.9404141.
[169] A. Zakari, S. P. Lee, K. A. Alam, and R. Ahmad, “Software fault localisation: a systematic mapping study,” IET
Software, vol. 13, no. 1, pp. 60–74, Feb. 2019, doi: 10.1049/iet-sen.2018.5137.
[170] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A survey on software fault localization,” IEEE Transactions on
Software Engineering, vol. 42, no. 8, pp. 707–740, Aug. 2016, doi: 10.1109/TSE.2016.2521368.
[171] R. Just, D. Jalali, and M. D. Ernst, “Defects4J: a database of existing faults to enable controlled testing studies for Java
programs,” in Proceedings of the 2014 International Symposium on Software Testing and Analysis - ISSTA 2014, New
York, New York, USA: ACM Press, 2014, pp. 437–440. doi: 10.1145/2610384.2628055.
[172] R. Ferenc, Z. Tóth, G. Ladányi, I. Siket, and T. Gyimóthy, “A public unified bug dataset for java and its assessment
regarding metrics and bug prediction,” Software Quality Journal, vol. 28, no. 4, pp. 1447–1506, 2020, doi:
10.1007/s11219-020-09515-0.
[173] L. Gazzola, D. Micucci, and L. Mariani, “Automatic software repair: a survey,” IEEE Transactions on Software
Engineering, no. June, pp. 1–1, 2017, doi: 10.1109/TSE.2017.2755013.
[174] M. Zakeri-Nasrabadi, S. Parsa, E. Esmaili, and F. Palomba, “A systematic literature review on the code smells datasets
and validation mechanisms,” ACM Comput Surv, May 2023, doi: 10.1145/3596908.
[175] S. Duncan, A. Walker, C. DeHaan, S. Alvord, T. Cerny, and P. Tisnovsky, “Pyclone: a Python code clone test bank
generator,” 2021, pp. 235–243. doi: 10.1007/978-981-33-6385-4_22.
[176] Microsoft Corporation, “GitHub.” https://github.com/ (accessed Jan. 07, 2023).
[177] A. Sheneamer and J. Kalita, “Semantic clone detection using machine learning,” in 2016 15th IEEE International
Conference on Machine Learning and Applications (ICMLA), IEEE, Dec. 2016, pp. 1024–1028. doi:
10.1109/ICMLA.2016.0185.
[178] NetworkX, “NetworkX.” https://networkx.github.io/ (accessed Apr. 26, 2019).
[179] F. Pedregosa et al., “Scikit-learn: machine learning in python,” Journal of Machine Learning Research, vol. 12, pp.
2825–2830, 2011, [Online]. Available: http://jmlr.org/papers/v12/pedregosa11a.html
[180] SciTools, “Understand,” 2020. https://www.scitools.com/ (accessed Sep. 11, 2020).
[181] R. FERENC, P. Siket, and M. Schneider, “OpenStaticAnalyzer,” University of Szeged, 2018. https://github.com/sed-inf-
u-szeged/OpenStaticAnalyzer (accessed Jun. 23, 2021).
[182] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, “Suggesting accurate method and class names,” 2015 10th Joint
Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of
Software Engineering, ESEC/FSE 2015 - Proceedings, pp. 38–49, 2015, doi: 10.1145/2786805.2786849.
[183] M. Dilhara, A. Ketkar, and D. Dig, “Understanding software-2.0,” ACM Transactions on Software Engineering and
Methodology, vol. 30, no. 4, pp. 1–42, Jul. 2021, doi: 10.1145/3453478.
[184] D. Landman, A. Serebrenik, and J. J. Vinju, “Challenges for static analysis of Java reflection - literature review and
empirical study,” in 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), IEEE, May 2017,
pp. 507–518. doi: 10.1109/ICSE.2017.53.
44
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[185] Mayrand, Leblanc, and Merlo, “Experiment on the automatic detection of function clones in a software system using
metrics,” in Proceedings of International Conference on Software Maintenance ICSM-96, IEEE, 1996, pp. 244–253. doi:
10.1109/ICSM.1996.565012.
[186] R. Komondoor and S. Horwitz, “Using slicing to identify duplication in source code,” 2001, pp. 40–56. doi: 10.1007/3-
540-47764-0_3.
[187] Z. Li, S. Lu, S. Myagmar, and Y. Zhou, “CP-Miner: finding copy-paste and related bugs in large-scale software code,”
IEEE Transactions on Software Engineering, vol. 32, no. 3, pp. 176–192, Mar. 2006, doi: 10.1109/TSE.2006.28.
[188] T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer, “Detecting similar Java classes using tree algorithms,” Proceedings
- International Conference on Software Engineering, pp. 65–71, 2006, doi: 10.1145/1137983.1138000.
[189] M. Gabel, L. Jiang, and Z. Su, “Scalable detection of semantic clones,” Proceedings - International Conference on
Software Engineering, pp. 321–330, 2008, doi: 10.1145/1368088.1368132.
[190] R. Falke, P. Frenzel, and R. Koschke, “Empirical evaluation of clone detection using syntax suffix trees,” Empir Softw
Eng, vol. 13, no. 6, pp. 601–643, Dec. 2008, doi: 10.1007/s10664-008-9073-9.
[191] C. K. Roy, “Detection and analysis of near-miss software clones,” in 2009 IEEE International Conference on Software
Maintenance, IEEE, Sep. 2009, pp. 447–450. doi: 10.1109/ICSM.2009.5306301.
[192] W. S. Evans, C. W. Fraser, and F. Ma, “Clone detection via structural abstraction,” Software Quality Journal, vol. 17,
pp. 309–330, 2009.
[193] G. M. K. Selim, K. C. Foo, and Y. Zou, “Enhancing source-based clone detection using intermediate representation,”
in 2010 17th Working Conference on Reverse Engineering, IEEE, Oct. 2010, pp. 227–236. doi: 10.1109/WCRE.2010.33.
[194] S. U. Rehman, K. Khan, S. Fong, and R. Biuk-Aghai, “An efficient new multi-language clone detection approach from
large source code,” Conf Proc IEEE Int Conf Syst Man Cybern, pp. 937–940, 2012, doi: 10.1109/ICSMC.2012.6377848.
[195] R. Tekchandani, R. K. Bhatia, and M. Singh, “Semantic code clone detection using parse trees and grammar
recovery,” IET Conference Publications, vol. 2013, no. 647 CP, pp. 41–46, 2013, doi: 10.1049/cp.2013.2291.
[196] R. K. Tekchandani and K. Raheja, “An efficient code clone detection model on Java byte code using hybrid approach,”
in Confluence 2013: The Next Generation Information Technology Summit (4th International Conference), Institution
of Engineering and Technology, 2013, pp. 1.04-1.04. doi: 10.1049/cp.2013.2287.
[197] A. Agrawal and S. K. Yadav, “A hybrid-token and textual based approach to find similar code segments,” 2013 4th
International Conference on Computing, Communications and Networking Technologies, ICCCNT 2013, pp. 4–7, 2013,
doi: 10.1109/ICCCNT.2013.6726700.
[198] H. Murakami, K. Hotta, Y. Higo, H. Igaki, and S. Kusumoto, “Gapped code clone detection with lightweight source
code analysis,” in 2013 21st International Conference on Program Comprehension (ICPC), IEEE, May 2013, pp. 93–102.
doi: 10.1109/ICPC.2013.6613837.
[199] R. K. Saha, C. K. Roy, and K. A. Schneider, “gCad: A near-miss clone genealogy extractor to support clone evolution
analysis,” in 2013 IEEE International Conference on Software Maintenance, IEEE, Sep. 2013, pp. 488–491. doi:
10.1109/ICSM.2013.79.
[200] S. Cesare, Y. Xiang, and J. Zhang, “Clonewise – detecting package-level clones using machine learning,” 2013, pp.
197–215. doi: 10.1007/978-3-319-04283-1_13.
[201] Y. Higo and S. Kusumoto, “How should we measure functional sameness from Program source code? An exploratory
study on Java methods,” Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering,
vol. 16-21-Nove, pp. 294–305, 2014, doi: 10.1145/2635868.2635886.
[202] E. Kodhai and S. Kanmani, “Method-level code clone detection through LWH (light weight hybrid) approach,”
Journal of Software Engineering Research and Development, vol. 2, no. 1, pp. 1–29, 2014, doi: 10.1186/s40411-014-0012-
8.
[203] A. Avetisyan, S. Kurmangaleev, S. Sargsyan, M. Arutunian, and A. Belevantsev, “LLVM-based code clone detection
framework,” in 2015 Computer Science and Information Technologies (CSIT), IEEE, Sep. 2015, pp. 100–104. doi:
10.1109/CSITechnol.2015.7358259.
45
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[204] I. Keivanloo, F. Zhang, and Y. Zou, “Threshold-free code clone detection for a large-scale heterogeneous Java
repository,” in 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER),
IEEE, Mar. 2015, pp. 201–210. doi: 10.1109/SANER.2015.7081830.
[205] J. Kim, T. G. Kim, and E. G. Im, “Structural information based malicious app similarity calculation and clustering,”
in Proceedings of the 2015 Conference on research in adaptive and convergent systems, New York, NY, USA: ACM, Oct.
2015, pp. 314–318. doi: 10.1145/2811411.2811545.
[206] T. Kamiya, “An execution-semantic and content-and-context-based code-clone detection and analysis,” 2015 IEEE
9th International Workshop on Software Clones, IWSC 2015 - Proceedings, pp. 1–7, 2015, doi:
10.1109/IWSC.2015.7069882.
[207] B. Joshi, P. Budhathoki, W. L. Woon, and D. Svetinovic, “Software clone detection using clustering approach,” 2015,
pp. 520–527. doi: 10.1007/978-3-319-26535-3_59.
[208] T. Schmorleiz and R. Lämmel, “Similarity management of ‘cloned and owned’ variants,” in Proceedings of the 31st
Annual ACM Symposium on Applied Computing, New York, NY, USA: ACM, Apr. 2016, pp. 1466–1471. doi:
10.1145/2851613.2851785.
[209] M. Sudhamani and L. Rangarajan, “Code clone detection based on order and content of control statements,”
Proceedings of the 2016 2nd International Conference on Contemporary Computing and Informatics, IC3I 2016, pp. 59–
64, 2016, doi: 10.1109/IC3I.2016.7917935.
[210] A. Chandran, L. Jain, S. Rawat, and K. Srinathan, “Discovering vulnerable functions: a code similarity based
approach,” 2016, pp. 390–402. doi: 10.1007/978-981-10-2738-3_34.
[211] L. Li, H. Feng, W. Zhuang, N. Meng, and B. Ryder, “CCLearner: a deep learning-based clone detection approach,” in
2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, Sep. 2017, pp. 249–260. doi:
10.1109/ICSME.2017.46.
[212] C. V Lopes et al., “DéjàVu: a map of code duplicates on GitHub,” Proceedings of the ACM on Programming Languages,
vol. 1, no. OOPSLA, pp. 1–28, 2017.
[213] T. Lavoie, M. Mérineau, E. Merlo, and P. Potvin, “A case study of TTCN-3 test scripts clone analysis in an industrial
telecommunication setting,” Inf Softw Technol, vol. 87, pp. 32–45, 2017, doi: 10.1016/j.infsof.2017.01.008.
[214] C. Ragkhitwetsagul and J. Krinke, “Using compilation/decompilation to enhance clone detection,” in 2017 IEEE 11th
International Workshop on Software Clones (IWSC), IEEE, Feb. 2017, pp. 1–7. doi: 10.1109/IWSC.2017.7880502.
[215] D. Zou et al., “SCVD: a new semantics-based approach for cloned vulnerable code detection,” 2017, pp. 325–344. doi:
10.1007/978-3-319-60876-1_15.
[216] M. Tufano, C. Watson, G. Bavota, M. di Penta, M. White, and D. Poshyvanyk, “Deep learning similarities from
different representations of source code,” in 2018 IEEE/ACM 15th International Conference on Mining Software
Repositories (MSR), 2018, pp. 542–553.
[217] G. Mostaeen, J. Svajlenko, B. Roy, C. K. Roy, and K. A. Schneider, “On the use of machine learning techniques
towards the design of cloud based automatic code clone validation tools,” Proceedings - 18th IEEE International
Working Conference on Source Code Analysis and Manipulation, SCAM 2018, pp. 155–164, 2018, doi:
10.1109/SCAM.2018.00025.
[218] R. Tajima, M. Nagura, and S. Takada, “Detecting functionally similar code within the same project,” in 2018 IEEE
12th International Workshop on Software Clones (IWSC), IEEE, Mar. 2018, pp. 51–57. doi: 10.1109/IWSC.2018.8327319.
[219] T. Vislavski, G. Rakic, N. Cardozo, and Z. Budimac, “LICCA: A tool for cross-language clone detection,” in 2018 IEEE
25th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, Mar. 2018, pp. 512–
516. doi: 10.1109/SANER.2018.8330250.
[220] Y. Wang and D. Liu, “Image-based clone code detection and visualization,” in 2019 International Conference on
Artificial Intelligence and Advanced Manufacturing (AIAM), IEEE, Oct. 2019, pp. 168–175. doi:
10.1109/AIAM48774.2019.00041.
[221] Y. Gao, Z. Wang, S. Liu, L. Yang, W. Sang, and Y. Cai, “TECCD: A Tree Embedding Approach for Code Clone
Detection,” Proceedings - 2019 IEEE International Conference on Software Maintenance and Evolution, ICSME 2019, pp.
145–156, 2019, doi: 10.1109/ICSME.2019.00025.
46
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[222] J. Zeng, K. Ben, X. Li, and X. Zhang, “Fast code clone detection based on weighted recursive autoencoders,” IEEE
Access, vol. 7, pp. 125062–125078, 2019, doi: 10.1109/ACCESS.2019.2938825.
[223] J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu, “A novel neural source code representation based on
abstract syntax tree,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE, May
2019, pp. 783–794. doi: 10.1109/ICSE.2019.00086.
[224] G. Mostaeen, J. Svajlenko, B. Roy, C. K. Roy, and K. A. Schneider, “CloneCognition: machine learning based code
clone validation tool,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering
Conference and Symposium on the Foundations of Software Engineering, New York, NY, USA: ACM, Aug. 2019, pp.
1105–1109. doi: 10.1145/3338906.3341182.
[225] M. Sudhamani and L. Rangarajan, “Code similarity detection through control statement and program features,”
Expert Syst Appl, vol. 132, pp. 63–75, 2019, doi: 10.1016/j.eswa.2019.04.045.
[226] D. Tukaram and B. Uma Maheswari, “Design and development of software tool for code clone search, detection, and
analysis,” Proceedings of the 3rd International Conference on Electronics and Communication and Aerospace
Technology, ICECA 2019, pp. 1002–1006, 2019, doi: 10.1109/ICECA.2019.8821928.
[227] J. Yang, Y. Xiong, and J. Ma, “A function level Java code clone detection method,” Proceedings of 2019 IEEE 4th
Advanced Information Technology, Electronic and Automation Control Conference, IAEAC 2019, no. Iaeac, pp. 2128–
2134, 2019, doi: 10.1109/IAEAC47372.2019.8998079.
[228] M. Gharehyazie, B. Ray, M. Keshani, M. S. Zavosht, A. Heydarnoori, and V. Filkov, “Cross-project code clones in
GitHub,” Empir Softw Eng, vol. 24, no. 3, pp. 1538–1573, Jun. 2019, doi: 10.1007/s10664-018-9648-z.
[229] G. Li et al., “SAGA: efficient and large-scale detection of near-miss clones with GPU acceleration,” in 2020 IEEE 27th
International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, Feb. 2020, pp. 272–283.
doi: 10.1109/SANER48275.2020.9054832.
[230] Y. Yuan, W. Kong, G. Hou, Y. Hu, M. Watanabe, and A. Fukuda, “From local to global semantic clone detection,” in
2019 6th International Conference on Dependable Systems and Their Applications (DSA), IEEE, Jan. 2020, pp. 13–24.
doi: 10.1109/DSA.2019.00012.
[231] P. M. Caldeira, K. Sakamoto, H. Washizaki, Y. Fukazawa, and T. Shimada, “Improving syntactical clone detection
methods through the use of an intermediate representation,” in 2020 IEEE 14th International Workshop on Software
Clones (IWSC), IEEE, Feb. 2020, pp. 8–14. doi: 10.1109/IWSC50091.2020.9047637.
[232] H. Xue, Y. Mei, K. Gogineni, G. Venkataramani, and T. Lan, “Twin-Finder: integrated reasoning engine for pointer-
related code clone detection,” in 2020 IEEE 14th International Workshop on Software Clones (IWSC), IEEE, Feb. 2020,
pp. 1–7. doi: 10.1109/IWSC50091.2020.9047638.
[233] M. Wu, P. Wang, K. Yin, H. Cheng, Y. Xu, and C. K. Roy, “LVMapper: a large-variance cone detector using
sequencing alignment approach,” IEEE Access, vol. 8, pp. 27986–27997, 2020, doi: 10.1109/ACCESS.2020.2971545.
[234] G. Mostaeen, B. Roy, C. K. Roy, K. Schneider, and J. Svajlenko, “A machine learning based framework for code clone
validation,” Journal of Systems and Software, vol. 169, p. 110686, 2020, doi: 10.1016/j.jss.2020.110686.
[235] W. Dong, Z. Feng, H. Wei, and H. Luo, “A novel code stylometry-based code clone detection strategy,” in 2020
International Wireless Communications and Mobile Computing (IWCMC), IEEE, Jun. 2020, pp. 1516–1521. doi:
10.1109/IWCMC48107.2020.9148302.
[236] A. Zhang, K. Liu, L. Fang, Q. Liu, X. Yun, and S. Ji, “Learn to align: a code alignment network for code clone
detection,” in 2021 28th Asia-Pacific Software Engineering Conference (APSEC), IEEE, Dec. 2021, pp. 1–11. doi:
10.1109/APSEC53868.2021.00008.
[237] W. Amme, T. S. Heinze, and A. Schafer, “You look so different: finding structural clones and subclones in Java source
code,” in 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, Sep. 2021, pp. 70–
80. doi: 10.1109/ICSME52107.2021.00013.
[238] N. D. Q. Bui, Y. Yu, and L. Jiang, “InferCode: self-supervised learning of code representations by predicting subtrees,”
in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), IEEE, May 2021, pp. 1186–1197. doi:
10.1109/ICSE43902.2021.00109.
47
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[239] W. Hua, Y. Sui, Y. Wan, G. Liu, and G. Xu, “FCCA: hybrid code representation for functional clone detection using
attention networks,” IEEE Trans Reliab, vol. 70, no. 1, pp. 304–318, Mar. 2021, doi: 10.1109/TR.2020.3001918.
[240] A. Schafer, W. Amme, and T. S. Heinze, “Stubber: compiling source code into bytecode without dependencies for
Java code clone detection,” in 2021 IEEE 15th International Workshop on Software Clones (IWSC), IEEE, Oct. 2021, pp.
29–35. doi: 10.1109/IWSC53727.2021.00011.
[241] A. Sheneamer, S. Roy, and J. Kalita, “An effective semantic code clone detection framework using pairwise feature
fusion,” IEEE Access, vol. 9, pp. 84828–84844, 2021, doi: 10.1109/ACCESS.2021.3079156.
[242] S. B. Ankali and L. Parthiban, “Development of porting analyzer to search cross-language code clones using
levenshtein distance,” 2021, pp. 623–632. doi: 10.1007/978-981-16-0878-0_60.
[243] H. Jin, Z. Cui, S. Liu, and L. Zheng, “Improving code clone detection accuracy and efficiency based on code
complexity analysis,” in 2022 9th International Conference on Dependable Systems and Their Applications (DSA), IEEE,
Aug. 2022, pp. 64–72. doi: 10.1109/DSA56465.2022.00017.
[244] C. Tao, Q. Zhan, X. Hu, and X. Xia, “C4: contrastive cross-language code clone detection,” 2022.
[245] Z. Li et al., “Unleashing the power of compiler intermediate representation to enhance neural program embeddings,”
in Proceedings of the 44th International Conference on Software Engineering, New York, NY, USA: ACM, May 2022,
pp. 2253–2265. doi: 10.1145/3510003.3510217.
[246] M. Hammad, O. Babur, H. A. Basit, and M. van den Brand, “Clone-seeker: effective code clone search using
annotations,” IEEE Access, vol. 10, pp. 11696–11713, 2022, doi: 10.1109/ACCESS.2022.3145686.
[247] S. Karthik and B. Rajdeepa, “A collaborative method for code clone detection using a deep learning model,” Advances
in Engineering Software, vol. 174, p. 103327, Dec. 2022, doi: 10.1016/j.advengsoft.2022.103327.
[248] M. Chochlov et al., “Using a nearest-neighbour, BERT-based approach for scalable clone detection,” in 2022 IEEE
International Conference on Software Maintenance and Evolution (ICSME), IEEE, Oct. 2022, pp. 582–591. doi:
10.1109/ICSME55016.2022.00080.
[249] F. Leone and S. Takada, “Towards overcoming type limitations in semantic clone detection,” in 2022 IEEE 16th
International Workshop on Software Clones (IWSC), IEEE, Oct. 2022, pp. 25–31. doi: 10.1109/IWSC55060.2022.00013.
[250] M. H. Islam, R. Paul, and M. Mondal, “Predicting buggy code clones through machine learning,” in Proceedings of
the 32nd Annual International Conference on Computer Science and Software Engineering, 2022, pp. 130–139. doi:
https://dl.acm.org/doi/10.5555/3566055.3566070.
[251] Y. Hu, D. Zou, J. Peng, Y. Wu, J. Shan, and H. Jin, “TreeCen: Building tree graph for scalable semantic code clone
detection,” in 37th IEEE/ACM International Conference on Automated Software Engineering, New York, NY, USA:
ACM, Oct. 2022, pp. 1–12. doi: 10.1145/3551349.3556927.
[252] Y. Wu, S. Feng, D. Zou, and H. Jin, “Detecting semantic code clones by building AST-based Markov chains model,”
in 37th IEEE/ACM International Conference on Automated Software Engineering, New York, NY, USA: ACM, Oct.
2022, pp. 1–13. doi: 10.1145/3551349.3560426.
[253] H. Thaller, L. Linsbauer, and A. Egyed, “Semantic clone detection via probabilistic software modeling,” 2022, pp.
288–309. doi: 10.1007/978-3-030-99429-7_16.
[254] L. Yang et al., “FastDCF: A partial index based distributed and scalable near-miss code clone detection approach for
very large code repositories,” 2022, pp. 210–222. doi: 10.1007/978-3-030-96772-7_20.
[255] S. Patel and R. Sinha, “Combining holistic source code representation with siamese neural networks for detecting
code clones,” 2022, pp. 148–159. doi: 10.1007/978-3-031-04673-5_12.
[256] X. Guo, R. Zhang, L. Zhou, and X. Lu, “Precise code clone detection with architecture of abstract syntax trees,” 2022,
pp. 117–126. doi: 10.1007/978-3-031-19211-1_10.
[257] Y. Li, C. Yu, and Y. Cui, “TPCaps: a framework for code clone detection and localization based on improved CapsNet,”
Applied Intelligence, Dec. 2022, doi: 10.1007/s10489-022-03158-3.
[258] A. Zhang, L. Fang, C. Ge, P. Li, and Z. Liu, “Efficient transformer with code token learner for code clone detection,”
Journal of Systems and Software, vol. 197, p. 111557, Mar. 2023, doi: 10.1016/j.jss.2022.111557.
48
A systematic literature review on source code similarity measurement and clone detection M. Zakeri-Nasrabadi et al.
[259] O. Ehsan, F. Khomh, Y. Zou, and D. Qiu, “Ranking code clones to support maintenance activities,” Empir Softw Eng,
vol. 28, no. 3, p. 70, Jun. 2023, doi: 10.1007/s10664-023-10292-0.
[260] W. Wang, Z. Deng, Y. Xue, and Y. Xu, “CCStokener: fast yet accurate code clone detection with semantic token,”
Journal of Systems and Software, vol. 199, p. 111618, May 2023, doi: 10.1016/j.jss.2023.111618.
49