Potential of LLM in Semantic and Cross Language Clone
Potential of LLM in Semantic and Cross Language Clone
Abstract—Semantic and Cross-language code clone The dynamic world of software development requires
generation may be useful for code reuse, code com- constant innovation and efficiency. Code variant creation,
prehension, refactoring and benchmarking. OpenAI’s a complicated process that reuses and duplicates code
GPT model has potential in such clone generation as
GPT is used for text generation. When developers segments to overcome initial development limits, is vital
copy/paste codes from Stack Overflow (SO) or within a to achieving these goals. As software systems develop
system, there might be inconsistent changes leading to in breadth and complexity, efficient code use and adap-
unexpected behaviours. Similarly, if someone possesses tation become more important. Research indicates that
a code snippet in a particular programming language programmers repeat type-3/type-4 or semantic clones in
but seeks equivalent functionality in a different lan-
guage, a semantic cross-language code clone generation commits of each project at a rate of 6.32% to 8.38%
approach could provide valuable assistance. In this [7]. Additionally, developers commonly copy/paste and
study, using SemanticCloneBench as a vehicle, we eval- reuse throughout the software system, which may cause
uated how well the GPT-3 model could help generate issues/introduce bugs [8]–[10]. Semantic clones or variants
semantic and cross-language clone variants for a given are essential to prevent inconsistencies within software
fragment. We have comprised a diverse set of code frag-
ments and assessed GPT-3’s performance in generating systems. Wu et al. [11] searched for Java files using "stack-
code variants. Through extensive experimentation and overflow" and manually inspected the results. The re-
analysis, where 9 judges spent 158 hours to validate, we searchers found that in 31.5% of their samples, developers
investigate the model’s ability to produce accurate and had to modify SO source code for compatibility with their
semantically correct variants. Our findings shed light on projects. Additionally, 35.5% of software engineers used
GPT-3’s strengths in code generation, offering insights
into the potential applications and challenges of using SO posts for reference rather than copying code samples.
advanced language models in software development. It is evident that software developers often copy/paste
Our quantitative analysis yields compelling results. In and reuse code fragments during development or attempt
the realm of semantic clones, GPT-3 attains an impres- to reuse code fragments from Crowdsourced forums such
sive accuracy of 62.14% and 0.55 BLEU score, achieved as SO for certain functionality. However, the code frag-
through few-shot prompt engineering. Furthermore,
the model shines in transcending linguistic confines, ment at hand may not be their first choice for various
boasting an exceptional 91.25% accuracy in generating reasons ranging from coding structure complexity to the
cross-language clones. potential of having bugs in them [8] [9]. Furthermore,
Index Terms—Language Models, Software Clone, Se- directly copying code fragments and then adapting is as-
mantic Clone, Cross-language Clone, GPT, Semantic- sociated with introducing inconsistent changes in systems,
CloneBench, Software Engineering
resulting in severe unexpected behaviour of the software
system [8], [9], and they may be looking for an alterna-
I. Introduction tive semantically similar fragment. Sometimes, developers
Clones are almost identical code copies. One of the may simply look for an alternative implementation (e.g.,
most prominent causes of clones is developers copying and semantically similar) of a code fragment they currently
pasting code between software projects. Research indicates have in their system and improve their system through
that 7-23% of software systems are recycled from previous refactoring. Similarly, one might have a code fragment in a
projects [1] [2]. Semantic clones promote code reuse, con- certain programming language, but they might be looking
sistency, and productivity throughout the software devel- for similar functionality in a different language [12]. Of
opment lifecycle, giving developers a strategic advantage. course, given that clone detection is an active area and
Using replicas allows developers to focus on innovation and semantic and cross-language clone generation may help
complex issues, leading to faster development cycles and build benchmarks for evaluating and comparing such tools
better software solutions [3] [4] [5]. Developers can improve and techniques. Because of GPT’s good performance in
software quality, development costs, risks, bug prevention, code and text generation, we utilized GPT-3 to generate
and detection by monitoring and restructuring clones [6]. semantic and cross-language clones.
In this research, we explore the efficacy of the GPT-3
∗ Both authors contributed equally. model in generating semantic and cross-language clones.
We followed a similar methodology as of GPTCloneBench efficacy of GPT-3 in accurately converting code snippets
study [13] in generating semantic and cross-language from one programming language to another? The results of
clones using GPT-3 and SemanticCloneBench. In partic- our study not only illuminate these research questions but
ular, we randomly chose 15,000 semantic clone pairs and also offer guidance for effectively utilising sophisticated
12,000 cross-language clone pairs that were generated as language models in the field of software development, con-
part of the GPTCloneBench’s prompt engineering/clone sidering both their potential applications and associated
generation step. These are a subset of intermediate data limitations.
from GPTCloneBench study before going into further The remaining sections are organized as follows: Section
validation towards building the clone benchmark. After II discusses the background of our study. The architecture
that, to remove the syntactic clones from this data, of generating clones from GPT-3 is described in Section
we followed a similar approach as the GPTCloneBench III. In Section IV, we have talked about manual validation.
paper’s methodology by utilizing NiCad. After NiCad Section V talked about findings of GPT-3 related to
filtration, we underwent a different and more in-depth accuracy, tested a clone detection tool and analyzed the
manual validation process that confirmed consistent out- results. In Section VI, we have talked about the threats
put for identical inputs. Post manual validation, GPT-3 to the validity of our research. Related work is described
exhibited a 62.14% accuracy with a 0.55 BLEU [14] score in Section VII and in Section VIII, the conclusion of our
in generating semantic clones and an impressive 91.25% research is discussed.
accuracy in cross-language clones. We employed BLEU in
a manner to assess the degree of divergence between gener- II. Background
ated code fragments and human-written code. To reinforce Identical or similar code segments within a codebase are
our findings, we also utilized Oreo [15] for semantic clone termed code clones, with the initial being a clone of the
detection. second, constituting a clone pair [17]–[19]. Diverse terms
The distinction between the GPTCloneBench paper and are used for defining clones, including relative [20], redun-
the current study lies in their respective focuses and dant [21] [22], dependent [23], functional [24], functionally
manual validation. In the GPTCloneBench paper, our similar [25] [26] [27], and Type-4 [6], [17] clones. While
emphasis was on introducing a comprehensive benchmark researchers agree on semantic clones sharing functional-
of semantic clones and cross-language clones. However, ity but differing in syntax, no uniformity exists about
in the present study, our objective shifts towards a more the precise semantic similarity. Semantic clone definitions
in-depth investigation of GPT’s efficiency in formulating vary, from narrow interpretations focusing on specific
semantic and cross-language clones. This involves conduct- similarities to broader, less precise ones. Nevertheless, the
ing extensive manual evaluations, where human experts consensus remains that semantic clones involve identical
meticulously review and compare the generated clones functionality with differing syntax [28] [19]
against the original code snippets. In GPTCloneBench, we We have used SemanticCloneBench [29] to facilitate
did not add any snippets to the benchmark that have been our research. SemanticCloneBench [29] is a dataset of
tagged false by any judges or have any conflict among the semantically equivalent code snippets intended to help
judges. However, the present research takes a more nu- researchers develop and evaluate techniques for detecting
anced approach. Here, we rigorously investigated the con- semantic clones.
flicted code snippets to see if they were actually semantic
clones or not. Our objective is to ascertain whether these III. Architecting Semantic Clones
disputed snippets truly qualify as semantic clones or not. In this section, we are focusing on processing semantic
Furthermore, we employed Cohen’s kappa [16] agreement and cross-language clones. We have utilized the results
metric to quantify the level of agreement among judges that we got after prompt engineering from the GPT-
regarding code snippets. This metric provides us with CloneBench paper [13]. The clone generation process in
an empirical measure of the consistency of the judgment the GPTCloneBench paper starts with the selection of
among the evaluators. Through this thorough manual the initial clone fragment from the clone pair of Seman-
process, we seek to uncover the true extent of GPT’s ticCloneBench. To assist with this, an automated script
aptitude for generating semantic and cross-language clones was developed for GPTCloneBench to identify functions,
for a given code fragment. This study adds a layer of which were later given as input for GPT-3. For prompt
empirical validation to our findings of GPTCloneBench, engineering, the few-shot prompting technique was em-
enabling us to draw more robust conclusions about GPT’s ployed. For the prompt, the GPT-3 model was provided
capabilities in these specific domains. with textual instructions and a representative input to
As we navigate through this unfamiliar territory, we con- indicate the type of output anticipated. As discussed in
front inquiries that resonate with the fundamental aspects the GPTCloneBench paper, two prompts have been used
of our research-related software development: (RQ1) To to generate the clones. To generate cross-language clones,
what extent does GPT-3 demonstrate the capability to the emphasis was given to two programming languages,
generate high-quality semantic clones? (RQ2) What is the Java and C#, which were used as input for GPT-3 in
the GPTCloneBench paper. As a result, GPT-3 created dents and three post-doctoral researchers. The undergrad-
80,664 semantic clone pairs and 22,364 cross-language uate students were partitioned into three cohorts, with
clone pairs. After GPT generated the clones from the each cohort comprising a pair of students. The dataset
given input, we randomly selected 15,000 semantic clones was subsequently divided into three distinct portions.
and 12,000 cross-language clones. We used this data to In this research, after NiCad filtration, we got 10,621
conduct this research. This data (15,000 semantic and semantic clones and 12,000 cross-language clones. In GPT-
12,000 cross-language clones) represents the data of the Clonebench, if any judges had any conflict or if any of the
GPTCloneBench paper before going into any validation judge’s decisions did not match, we discarded that, but
(including NiCad and manual validation). After that, at here in this study, we also validated the clones that faced
first, we employed the textual similarity measurement tool conflicts and got the tag false positive by anyone within
NiCad [30] to exclude syntactic clones. Second, a rigorous the group. Our manual validation consists of three rounds.
manual validation process (Section IV) was undertaken for In round one, we divided the semantic clone pairs
all prospective Type-3, Type-4, and cross-language clones. into three groups consisting of 3,540, 3,540, and 3,541
We utilized the established framework of respectively. Each group contained two members. Each
BigCloneBench [18] for code clone classification, with person in every group conducted an individual assessment
allowances for slight variations within a defined grey area. of their designated portion, categorising the clone pairs
Moderately Type-3 (MT3) clones [18] exhibit 50%-70% as either true positive, false positive, or undecided based
similarity, supplemented by a 5% gray area. Weak Type- on their understanding. For every group, they were given
3/Type-4 (WT3/4) clones [18] align with Type-4 clones, different code fragments, but each member of one group
marked by 0%-50% similarity. This framework extends received the same code fragments. For a clone pair to
to cross-language clones, treating them as Type-4 due be considered a true semantic pair, both members of the
to shared logic despite diverse programming languages. group had to tag it as true. Conflicting results within a
Notably, while not all semantic clones are cross-language, group led to excluding that pair from the true pairs listing
all cross-language clones fall under this category. and have gone for further validation. The first six judges
To remove syntactic clones, NiCad was utilized, con- followed this procedure. In round one, group_1 tagged
figuring a 99% dissimilarity threshold to identify Type- 2,947 as true semantic, 80 as false semantic and 513 as
1 and Type-2 clones, as NiCad cannot detect Type-4 undecided or conflicted pairs. For group_2, they tagged
clones. For semantic clone detection, we analyze similarity 2,953 as true semantic, 87 as false semantic and 500 as
percentages in the metadata file, facilitated by the 3- undecided or conflicted pairs. For group_3, they tagged
line minimum size and blind renaming in NiCad. Pairs 2,979 as true semantic, 98 as false semantic and 464 as
exceeding 75% similarity are discarded; those under or undecided or conflicted pairs. For group_1, Cohen’s k is
equal to 75% are saved for further manual validation. For 0.700, for group_2, Cohen’s k is 0.730 and for group_3,
this research, NiCad filtered out a total of 4,379 syntactic Cohen’s k is 0.77, which means all groups are in substantial
clones. In cross-language clone detection, NiCad is inap- agreement. In round two, we shuffled the undecided or
plicable due to differing programming languages. Nonethe- conflicted pairs among the three groups. The first two
less, generated cross-language clones undergo manual val- groups were given 492 pairs, and the last group was given
idation. Furthermore, another validation process (input- 493 pairs. In the second round, Cohen’s K for all three
output testing) has been adopted to ensure the code clones groups is 0.52, 0.58 and 0.54, respectively, which means
follow the same functionality. Moderate Agreement. This outcome can be attributed
to the intricate nature of the code under consideration.
IV. Human-Centric Analysis Given the participants’ status as undergraduates, reaching
After filtering out undesired clones as described earlier, definitive decisions becomes challenging due to the com-
we have engaged in a rigorous manual validation process. plexity of the code snippets. These code snippets, which
This involved thoroughly examining all code fragments to were shuffled and carried uncertainties from the initial
determine whether filtered data was accurate and whether round, further contribute to the difficulties in decision-
the clone pairs produced the same output for the same making, leading to the observed moderate Cohen’s Kappa
input. After file generation, we manually validated the agreement level. The overall Cohen’s K result for this
clone pairs to ensure their validity. To facilitate accurate analysis can be found in Table I. In round three, there
assessment, BigCloneBench’s GUI-based Clone Validator1 is only one group consisting of three Postdoctoral Fellows.
was utilized, which provided syntax highlighting for the They mitigated the rest of the undecided or conflicted
candidate code and displayed the exemplar functions and clone pairs from round two. Finally, we received 9,321 true
specifications alongside the candidate for reference. semantic clone pairs through their discussion.
During the validation procedure, a cohort of nine judges In terms of cross-language pairs, we followed the same
took part, consisting of six undergraduate research stu- procedure we described for semantic clones. In round one,
every group received 4,000 different pairs each. Group_1
1 https://github.com/jeffsvajlenko/ValidateClones tagged 3,460 as true, 98 as false and 442 as undecided or
TABLE I BLEU [14] score, obtaining a result of 0.55, which can be
Semantic Clones Cohen-Kappa Interrater Agreement interpreted as very high quality and adequate. This score
Round Group Interrater Agreement Interpretation provides valuable insights into how closely the generated
Group 1 0.700 Substantial Agreement code fragments resemble human-written code. It is impor-
Round 1 Group 2 0.730 Substantial Agreement tant to highlight that the objective of our study was to
Group 3 0.77 Substantial Agreement
Group 1 0.52 Moderate Agreement investigate the proximity between the two, and the BLEU
Round 2 Group 2 0.58 Moderate Agreement score serves as a quantitative measure to accomplish this.
Group 3 0.54 Moderate Agreement Code fragments, by their nature, can exhibit complexity,
and subtle variations can have substantial implications for
functionality. Furthermore, coding style disparities across
conflicted pairs. For group_2, they tagged 3,489 as true
different developers and projects introduce additional nu-
semantic, 67 as false and 444 as undecided or conflicted
ances that the BLEU metric accounts for. It is crucial to
pairs. For group_3, they tagged 3471 as true, 58 as false
emphasize that our primary focus was on achieving code
and 471 as undecided or conflicted pairs. For group_1,
fragments that met functional requirements. Recognizing
Cohen’s k is 0.69, for group_2, Cohen’s k is 0.59, and
that BLEU was initially designed for natural language
for group_3, Cohen’s k is 0.73. In round two for cross-
tasks, we acknowledge its limitations in capturing all
language, Cohen’s K is 0.48, 0.56 and 0.60. The overall
code-specific attributes. Hence, our evaluation should be
Cohens’ K result for this analysis can be found in Table
interpreted in the context of understanding how closely
II. Finally, the remaining 1,153 undecided pairs were col-
generated code fragments resemble their human-written
lectively assessed and labelled by the three post-doctoral
counterparts.
fellows through discussion.
We have evaluated a semantic clone detection tool with
the newly formed data (9,321 clones). As a testing metric,
TABLE II
Cross-language Clones Cohen-Kappa Interrater Agreement we have used Recall to confirm that the newly created data
does not have syntactic clones.
Round Group Interrater Agreement Interpretation
Group 1 0.69 Substantial Agreement T rue P ositive
Round 1 Group 2 0.59 Moderate Agreement
Recall = (2)
T rue P ositive + F alse N egative
Group 3 0.73 Substantial Agreement
Group 1 0.48 Moderate Agreement
Round 2 Group 2 0.56 Moderate Agreement
A. Oreo
Group 3 0.60 Substantial Agreement To check if the validated data is actually semantic clones
or not, we ran Oreo on our dataset. The results of our
Approximately 212 hours were spent by nine judges to evaluation are presented in Table III. Oreo performs with
validate the clone pairs. We want to mention that the a recall of 0.46 for the clones. We were not expecting a very
undergraduate research students were trained and given high recall (more than 0.5) for Oreo on our data because
instructions on the functionalities of why and how we our data represents the region where most detection tools
defined the semantic clones. are difficult to perform.
V. Unveiling the Findings: Results and Analysis
TABLE III
After a thorough screening process and manual valida- Oreo Recall results
tion, we decided 9,321 as true semantic clone pairs from
Tool Language Granularity Recall
four different languages, Java, C, C#, and Python, out Oreo Java Method 0.46
of 15,000 semantic clone pairs and 10,950 cross-language
clone pairs out of 12,000 cross-language clone pairs. We (RQ1) To what extent does GPT-3 demonstrate the
have used an accuracy metric to validate how good GPT- capability to generate high-quality semantic clones?
3 is to generate semantic and cross-language clones. In GPT-3 showcases a notable degree of capability in
our first prompt, we generated four outputs for one given generating high-quality semantic clones, as evidenced by
input and for the second prompt, we got ten outputs for an achieved accuracy of 62.14% with 0.55 BLEU score.
one given input. So, our accuracy is based on the data The accuracy reflects GPT-3’s proficiency in paraphrasing
that we found using this procedure. The accuracy of GPT- and producing semantically correct variations of original
3 in generating semantic clones is 62.14%, and for cross- code fragments. The model’s ability to attain such a
language clones, the accuracy of GPT-3 is 91.25%. substantial accuracy rate highlights its potential as a tool
for generating semantic clones that closely emulate the
N umber of V alidated T rue Clones intentions of the source code. However, it is important
Accuracy = T otal N umber of Randomly Selected Generated Clones (1)
to acknowledge that the accuracy percentage will only
In our research, we thought to assess the similarity be achieved if we follow a proper prompt engineering
between code fragments generated by GPT and human- technique. In addition to that, high BLEU score can be
written code. To quantify this similarity, we calculated the interpreted that GPT generated codes are quality codes.
Fig. 1. Semantic code clone generation sample
(RQ2) What is the efficacy of GPT-3 in accurately examination across multiple programming languages, we
converting code snippets from one programming have gained valuable insights into the extent to which
language to another? GPT-3’s performance transcends language boundaries and
The efficacy of GPT-3 in accurately converting code remains reliable across different coding paradigms.
snippets from one programming language to another is Furthermore, there can be another concern regarding
marked by a notable success rate. The model demonstrates whether it is possible to get more efficient results from
a substantial ability to comprehend the structural and GPT-3. To answer this question, we can say that the ef-
syntactical intricacies inherent to different programming fectiveness of prompt engineering techniques could impact
languages, enabling it to produce conversions with a high the results. That is why in our research, we have followed
level of accuracy. Notably, GPT-3 achieves an impressive a formal prompt engineering method [32], [33]. So, if any
accuracy of 91.25% in cross-language clone generation, other people try to replicate the research with different
which underscores its proficiency in seamlessly transposing prompts, they will get similar results if they follow proper
code logic between disparate linguistic frameworks. While prompt engineering techniques. We do not think there will
this achievement showcases GPT-3’s prowess, it is essential be massive differences.
to acknowledge that the accuracy may vary based on In addition to that, there can be another concern regard-
factors such as code complexity, domain specificity, and ing the manual evaluation bias. Manual evaluation of clone
the nuances of each programming language. Nevertheless, quality involves subjectivity, and the judgement of human
GPT-3’s capacity to effectively bridge the gap between annotators may introduce inconsistency. To avoid this
programming languages signifies its potential to expedite problem, we have given the necessary knowledge regarding
cross-platform development and streamline code migration code clones to the undergrad students. To further evalu-
processes within the realm of software engineering. ate, three post-doctoral fellows discussed the remaining
undecided and false tagged code snippets and mitigated
VI. Threats to Validity the problem to keep the decision unbiased. Still, we agree
The first major concern that can be raised for our that manual evaluation can introduce some errors, and we
research is that the clones are generated by a machine- will try to analyze the code fragments more rigorously by
learning model, which may not be real-world clones. introducing more judges.
Clones can be real-world or artificial [31]. To call a clone Finally, we should note that despite the fact that the
pair a real-world clone, that clone pair needs to be written generated clones are mostly artificial clones, there are
by a human. However, to mitigate this issue, we used a number of important applications of such clones in
SemanticCloneBench [29] code fragments to generate the cloning area and in software development in general. If
results. SemanticCloneBench [29] is created based on pro- large language models are efficient in generating semantic
vided knowledge by developers who participate in the and cross-language clones, they could be used in building
SO. As we utilized SemanticCloneBench data as input clone detection training data sets such as GPTCloneBench
(few-shot prompt engineering technique), GPT-generated and beyond with confidence. This could then also be used
codes are similar to real-world clones. That is why we are for comparing and evaluating semantic and cross-language
claiming that our generated codes are not fully real world clone detection tools and may extend even comparing
but are in between the real world and artificial clones. those detectors that detect clones across Microsoft .NET
Another important consideration arises regarding the programming languages [29]. Such a clone generation ap-
overall applicability and adaptability of GPT-3’s perfor- proach and its resulting benchmarks could help evaluate
mance. To address this potential limitation, we system- whether source code transformation based clone detection
atically evaluated the model’s performance using four tools such as CloneWorks [34] or SimCad [35] could in fact
prominent and widely used programming languages. This detect semantic clones by applying flexible source transfor-
strategic approach aims to shed light on GPT-3’s robust- mations and normalizations. This could then further help
ness and effectiveness in a diverse programming language, build IDE-based flexible clone detection and management
thereby contributing to a more comprehensive understand- [6], [36] tools, or even could potentially be used in building
ing of its generalizability. By conducting this thorough similar benchmarks in other contexts [37]. It is thus our
understanding that such a study of large language models competitions on the Codeforces platform. The authors
could help cloning areas despite being the fact that the conducted a comparative analysis between their proposed
clones are generated clones. solution and the solutions developed by other participants
in the contest. The evaluation was based on contest param-
VII. Related Work eters, including the remaining time portion and penalties
Generating code involves using programming languages incurred for wrong submissions.
to create scripts, applications, or software. There are many Overall, while there are many great tools and techniques
ways to generate code. Manual coding, Integrated de- available for code generation, our aim in this work has been
velopment environments, code generators, templates and to examine whether the recently proposed GPT-3 model
frameworks, AI-powered text generators, Domain-specific could help the cloning community. In particular, we aim
languages, Data-driven code generation, code refactoring to examine to what extent GPT-3 model could be used in
tools, and scripting languages are some of the techniques generating semantic and cross-language clones.
[38]–[44]. Code generation models backed by artificial
VIII. Conclusion
intelligence have exhibited impressive abilities in aiding
developers, automating repetitive coding processes, and Our research introduces a transformative paradigm for
even suggesting innovative solutions. According to Victor code reuse, refactoring, migration and renovation. Our
[45], in a recent survey conducted by GitHub in part- goal was to explore the efficacy of the GPT-3 in generat-
nership with Wakefield Research, 92% of developers are ing semantic and cross-language clones. We have utilized
already using AI-powered coding tools in their work. So, SemanticCloneBench to generate close to real-world code
we tried one of the latest models of OpenAI’s GPT model clones through GPT-3. After thorough validation process,
to generate semantic clones we have managed to get 9,321 true semantic clone pairs
There are a lot of code recommendation systems, such and 10,950 cross-language clone pairs after handling all
as GitHub Copilot [46], a tool developed collaboratively by the limitations of GPT-3. With a noteworthy accuracy
OpenAI and GitHub, is a revolutionary code generation AI rate of 62.14% and a 0.55 BLEU score, GPT-3 showcases
model that integrates directly into software development its potential in accurately replicating semantic structures
environments. This tool facilitates the inclusion of code within a given programming language. Additionally, our
snippets and automated code completion, thereby enhanc- investigation into cross-language clones further under-
ing the coding experience for users. In our approach, we’ve scores GPT-3’s prowess, boasting an impressive accuracy
incorporated few-shot prompting alongside the natural of 91.25%. These findings highlight the big progress of
language description. This strategic choice aims to en- GPT-3 in improving code generation, opening up new
hance GPT’s performance in generating semantic clones, possibilities for creative uses in software development and
focusing on this aspect rather than completing the code other areas. As GPT-3 continues to showcase remarkable
outright. Additionally, our evaluation encompasses GPT- performance, it holds the promise of contributing to the
3’s ability to generate cross-language code clones, distin- advancement of code-related tasks across diverse linguistic
guishing it from Copilot’s functionality in this regard. and domain contexts.
Building on the achievements of Copilot, Codex [47] Acknowledgment
further pushes the boundaries of AI-assisted code genera-
This work was supported by NSERC Discovery grants,
tion. Codex, also developed in collaboration with OpenAI,
NSERC USRAs, CFI-JELF, and NSERC CREATE grad-
is an advanced language model that can generate entire
uate program on Software Analytics Research (SOAR)
functions, classes, and methods based on natural language
grants.
prompts. The Codex framework was proposed by Chen et
al. [47], who conducted an evaluation of its performance References
using a dataset of 163 coding tasks. In their work, authors [1] B. S. Baker, “Finding clones with dup: Analysis of an experi-
focused on the task of generating standalone Python ment,” IEEE TSE, vol. 33, no. 9, pp. 608–621, 2007.
functions from docstrings and evaluating the correctness [2] C. K. Roy and J. R. Cordy, “An empirical study of function
clones in open source software,” in 2008 15th Working Confer-
of code samples automatically through unit tests. For our ence on Reverse Engineering, pp. 81–90, IEEE, 2008.
work, we utilized OpenAI’s text-davinci-003 model, which [3] G. Zhang, X. Peng, Z. Xing, and W. Zhao, “Cloning practices:
has 175 billion parameters compared to Codex’s 14.8 Why developers clone and what can be changed,” in 2012
28th IEEE International Conference on Software Maintenance
billion parameters. In addition to that, OpenAI Codex is (ICSM), pp. 285–294, IEEE, 2012.
most capable in Python, whereas we wanted to use a more [4] C. K. Roy, M. F. Zibran, and R. Koschke, “The vision of software
generalized model. clone management: Past, present, and future (keynote paper),”
in 2014 Software Evolution Week-IEEE Conference on Software
In another study, Li et al. introduced AlphaCode [48], Maintenance, Reengineering, and Reverse Engineering (CSMR-
which is a system designed for code generation. The model WCRE), pp. 18–33, IEEE, 2014.
was trained utilising data from GitHub and CodeContests. [5] M. Mondal, C. K. Roy, and K. A. Schneider, “SPCP-Miner: A
tool for mining code clones that are important for brefactor-
According to the authors, AlphaCode demonstrated an ing or tracking,” in 2015 IEEE Intl. Conference on SANER,
average ranking of 54.3% in competitive programming pp. 484–488, IEEE, 2015.
[6] M. F. Zibran and C. K. Roy, “The road to software clone [26] V. Käfer, S. Wagner, and R. Koschke, “Are there functionally
management: A survey,” Technical Report 2012-03, Department similar code clones in practice?,” in 2018 IEEE IWSC, pp. 2–8,
of Computer Science, p. 1–66, 2012. IEEE, 2018.
[7] Y. Huang, F. Xu, H. Zhou, X. Chen, X. Zhou, and T. Wang, [27] R. Tajima, M. Nagura, and S. Takada, “Detecting functionally
“Towards exploring the code reuse from stack overflow during similar code within the same project,” in 2018 IEEE 12th
software development,” in Proc. of the 30th Intl. Conference on International workshop on software clones (IWSC), pp. 51–57,
Program Comprehension, pp. 548–559, 2022. IEEE, 2018.
[8] E. Juergens, F. Deissenboeck, B. Hummel, and S. Wagner, “Do [28] S. Baltes, C. Treude, and S. Diehl, “Sotorrent: Studying the
code clones matter?,” in IEEE Intl. Conference on Software origin, evolution, and usage of stack overflow code snippets,” in
Engineering, pp. 485–495, IEEE, 2009. Intl. Conference on MSR, pp. 191–194, IEEE, 2019.
[9] M. Mondal, B. Roy, C. K. Roy, and K. A. Schneider, “Investigat- [29] F. Al-Omari, C. K. Roy, and T. Chen, “Semanticclonebench: A
ing context adaptation bugs in code clones,” in 2019 IEEE Intl semantic code clone benchmark using crowd-source knowledge,”
Conference on Software Maintenance and Evolution (ICSME), in 2020 IEEE 14th International Workshop on Software Clones
pp. 157–168, 2019. (IWSC), pp. 57–63, IEEE, 2020.
[10] M. Asaduzzaman, M. C. Bullock, C. K. Roy, and K. A. Schnei- [30] C. K. Roy and J. R. Cordy, “NiCad: Accurate detection of near-
der, “Bug introducing changes: A case study with android,” miss intentional clones using flexible pretty-printing and code
in 2012 9th IEEE Working Conference on Mining Software normalization,” in 2008 IEEE ICPC, pp. 172–181, IEEE, 2008.
Repositories (MSR), pp. 116–119, 2012. [31] C. K. Roy and J. R. Cordy, “Benchmarks for software clone
[11] Y. Wu, S. Wang, C.-P. Bezemer, and K. Inoue, “How do develop- detection: A ten-year retrospective,” in 2018 IEEE 25th Intl.
ers utilize source code from stack overflow?,” Empirical Software Conference on SANER, pp. 26–37, IEEE, 2018.
Engineering, vol. 24, pp. 637–673, 2019. [32] L. Fe-Fei, “A bayesian approach to unsupervised one-shot learn-
[12] B. Roy, “Building on a legacy: Working with users to revital- ing of object categories,” in Proceedings 9th IEEE Intl. confer-
ize the crhm hydrological model,” Global Water Futures Core ence on computer vision, pp. 1134–1141, IEEE, 2003.
Computer Science, 2023. [33] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and
[13] A. I. Alam, P. R. Roy, F. Al-omari, C. K. Roy, B. Roy, and D. Zhou, “Chain of thought prompting elicits reasoning in large
K. Schneider, “GPTCloneBench: A comprehensive benchmark language models,” arXiv preprint arXiv:2201.11903, 2022.
of semantic clones and cross-language clones using GPT-3 model [34] J. Svajlenko and C. K. Roy, “Cloneworks: A fast and flexible
and semanticclonebench,” 2023. large-scale near-miss clone detection tool,” in 2017 IEEE/ACM
39th International Conference on Software Engineering Com-
[14] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a
panion (ICSE-C), pp. 177–179, 2017.
method for automatic evaluation of machine translation,” in
Proceedings of the 40th annual meeting of the Association for [35] M. S. Uddin, C. K. Roy, and K. A. Schneider, “Simcad: An
Computational Linguistics, pp. 311–318, 2002. extensible and faster clone detection tool for large scale software
systems,” in 2013 21st International Conference on Program
[15] V. Saini, F. Farmahinifarahani, Y. Lu, P. Baldi, and C. V. Comprehension (ICPC), pp. 236–238, 2013.
Lopes, “Oreo: Detection of clones in the twilight zone,” in Pro-
[36] M. F. Zibran and C. K. Roy, “Towards flexible code clone
ceedings of the 2018 ACM Joint Meeting on ESEC/FSE, 2018,
detection, management, and refactoring in ide,” in Proc. of the
(New York, NY, USA), p. 354–365, Association for Computing
5th International Workshop on Software Clones, p. 75–76, 2011.
Machinery, 2018.
[37] M. M. Rahman and C. K. Roy, “On the use of context in rec-
[16] J. Cohen, “A coefficient of agreement for nominal scales,” Edu- ommending exception handling code examples,” in 2014 IEEE
cational and psychological measurement, vol. 20, no. 1, pp. 37– 14th Intl. Working Conference on Source Code Analysis and
46, 1960. Manipulation, pp. 285–294, 2014.
[17] C. K. Roy, J. R. Cordy, and R. Koschke, “Comparison and evalu- [38] T. Basit, “Manual or electronic? the role of coding in qualitative
ation of code clone detection techniques and tools: A qualitative data analysis,” Educational research, vol. 45, no. 2, pp. 143–154,
approach,” Science of computer programming, vol. 74, no. 7, 2003.
pp. 470–495, 2009. [39] R. Cattell, “Automatic derivation of code generators from ma-
[18] J. Svajlenko and C. K. Roy, “Bigcloneeval: A clone detection chine descriptions,” ACM Transactions on Programming Lan-
tool evaluation framework with bigclonebench,” in 2016 IEEE guages and Systems (TOPLAS), vol. 2, no. 2, pp. 173–190, 1980.
international conference on software maintenance and evolution [40] T. Sturm, J. von Voss, and M. Boger, “Generating code from
(ICSME), pp. 596–600, IEEE, 2016. uml with velocity templates,” in Intl. Conference on the Unified
[19] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo, Modeling Language, pp. 150–161, Springer, 2002.
“Comparison and evaluation of clone detection tools,” IEEE [41] C. Bird, D. Ford, T. Zimmermann, N. Forsgren,
Transactions on software engineering, vol. 33, no. 9, pp. 577– E. Kalliamvakou, T. Lowdermilk, and I. Gazit, “Taking
591, 2007. flight with copilot: Early insights and opportunities of ai-
[20] F.-H. Su, J. Bell, K. Harvey, S. Sethumadhavan, G. Kaiser, powered pair-programming tools,” Queue, vol. 20, no. 6,
and T. Jebara, “Code relatives: detecting similarly behaving pp. 35–57, 2022.
software,” in Proceedings of the 2016 ACM SIGSOFT/FSE, [42] M. Fowler, Domain-specific languages. Pearson Education,
pp. 702–714, 2016. 2010.
[21] A. M. Leitão, “Detection of redundant code using r 2 d 2,” [43] G. Szőke, C. Nagy, L. J. Fülöp, R. Ferenc, and T. Gyimóthy,
software quality journal, vol. 12, pp. 361–382, 2004. “Faultbuster: An automatic code smell refactoring toolset,” in
[22] M. Suzuki, A. C. de Paula, E. Guerra, C. V. Lopes, and O. A. L. 2015 IEEE 15th Intl. Working Conference on SCAM, pp. 253–
Lemos, “An exploratory study of functional redundancy in code 258, IEEE, 2015.
repositories,” in 17th IEEE Intl. Working Conference on SCAM, [44] D. M. Beazley et al., “Swig: An easy to use tool for integrating
pp. 31–40, IEEE, 2017. scripting languages with C and C++.,” in Tcl/Tk Workshop,
[23] T. A. Henderson and A. Podgurski, “Rethinking dependence vol. 43, p. 74, 1996.
clones,” in 2017 IEEE 11th International Workshop on Software [45] V. Dey, “92work: Github report, https://shorturl.at/cwMV7,”
Clones (IWSC), pp. 1–7, IEEE, 2017. Jun 2023.
[24] L. Jiang and Z. Su, “Automatic mining of functionally equiv- [46] A. Ziegler, “Github copilot research recitation, https:
alent code fragments via random testing,” in Proceedings of //github.blog/2021-06-30-github-copilot-research-recitation/,”
the eighteenth international symposium on Software testing and Jun 2021.
analysis, pp. 81–92, 2009. [47] M. Chen, J. Tworek, et al., “Evaluating large language models
[25] A. Sheneamer and J. Kalita, “Semantic clone detection using trained on code,” arXiv preprint arXiv:2107.03374, 2021.
machine learning,” in 2016 IEEE ICMLA, pp. 1024–1028, IEEE, [48] Y. Li, D. Choi, et al., “Competition-level code generation with
2016. alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022.