On The Robustness of Code Generation Techniques: An Empirical Study On GitHub Copilot
On The Robustness of Code Generation Techniques: An Empirical Study On GitHub Copilot
Abstract—Software engineering research has always being Intuitively, the ability of the developer to provide “proper” in-
concerned with the improvement of code completion approaches, puts to the model will become central to boost the effectiveness
which suggest the next tokens a developer will likely type while of its recommendations. In the concrete example of GitHub
arXiv:2302.00438v1 [cs.SE] 1 Feb 2023
C. Data Analysis Levenshtein distance and the CodeBLEU [49] between each
synthesized method and the target one (i.e., the one originally
Concerning RQ0 , we report the number and the percentage implemented by the developers). CodeBLEU measures how
of 892 methods for which automatically generated paraphrases similar two methods are. Differently from the BLEU score
(i.e., those generated by PEGASUS and by TP) have been [46], CodeBLEU evaluates the predicted code considering not
classified as semantically equivalent to the original description. only the overlapping n-grams but also syntactic and semantic
This provides an idea of how reliable these tools are when match of the two pieces of code (predicted and reference) [49].
used for testing the robustness of DL-based code generators.
Also, this analysis allows to exclude from RQ1 automatically
generated paraphrases that are not semantically equivalent. D. Replication Package
To answer RQ1 , we preliminarily assess how far the
paraphrased descriptions are from the original ones (i.e., the The code and data used in our study are publicly available
percentage of changed words) by computing the normalized [6]. In particular, we provide (i) the dataset of manually defined
token-level Levenshtein distance [31] (NTLev) between the and automatically generated paraphrases; (ii) the AppleScript
original (do ) and any paraphrased description (dp ): code used to automate the Copilot triggering; (iii) the code used
to compute the CodeBLEU and the Levenshtein distance; (iv)
TLev (do , dp )
N T Lev(do , dp ) = the dataset of 892 methods and related tests used in our study;
max({|do |, |dp |}) (v) the scripts used to automatically generate the paraphrased
with TLev representing the token-level Levenshtein distance descriptions using PEGASUS and TP; and (vi) all raw data
between the two descriptions. output of our experiments.
Results Achieved With the Original and the Manually Paraphrased Descriptions
Original
Paraphrased Min: 0 0 1 1 0 0
652 644
112 122
96 99
32 27
FAIL PASS ERROR EMPTY ALL FAIL PASS ALL FAIL PASS
Fig. 2. Results achieved by Copilot when considering the Full context code representation on paraphrasesmanual .
III. R ESULTS D ISCUSSION These findings suggest that the two paraphrasing techniques
can be adopted as testing tools to assess the robustness of
As previously explained, in RQ1 we conducted our exper- DL-based code recommenders. In particular, once established
iments both in the Full context and in the Non-full context a reference description (e.g., the original description in our
scenario. Since the obtained findings are similar, due to space study), these tools can be applied to paraphrase it and verify
limitations we only discuss in the paper the results achieved whether, using the reference and the paraphrased descriptions,
in the Full context scenario (i.e., the case in which we provide the code recommenders generate different predictions.
Copilot with all code preceding and following the method
object of the prediction). The results achieved in the Non-full
Answer to RQ0 . State-of-the-art paraphrasing techniques
context scenario are available in our replication package [6].
can be used as starting point to test the robustness of DL-
based code recommenders, since they are able to generate
A. RQ0 : Evaluation of Automated Praphrase Generators semantically equivalent descriptions of a reference text in
up to 77% of cases.
TABLE II
N UMBER OF SEMANTICALLY EQUIVALENT OR NONEQUIVALENT B. RQ1 : Robustness of GitHub Copilot
PARAPHRASED DESCRIPTIONS OBTAINED USING PEGASUS AND TP.
Performance of Copilot when using the original and the
Equivalent Nonequivalent Invalid paraphrased description as input. Fig. 2 summarizes the
performance achieved by Copilot when using the original
PEGASUS 666 (74.7%) 225 (25.2%) 1 (0.1%)
description (light blue) and the manually generated paraphrased
TP 688 (77.1%) 104 (11.7%) 100 (11.2%)
description (dark blue) as input. Similarly, we report in Fig. 3
the performance obtained when considering the paraphrases
Table II reports the number of semantically equivalent and generated with the two automated techniques, i.e., PEGASUS
nonequivalent descriptions obtained using the two state-of-the- and TP (top and bottom of Fig. 3, respectively). It is worth
art paraphrasing techniques, namely PEGASUS and Translation noticing that, in the latter, we only considered in the analysis
Pivoting (TP), together with the number of invalid paraphrases the paraphrases manually considered as equivalent in RQ0 , i.e.,
generated. Out of the 892 original descriptions on which they 666 for PEGASUS and 688 for TP.
have been run, PEGASUS generated 666 (75%) semantically A first interesting result is that, as it can be noticed
equivalent descriptions, while TP went up to 688 (77%). If from Fig. 2 and Fig. 3, the results obtained with the three
we do not consider the invalid paraphrases, i.e., the cases for methodologies are very similar. For this reason, to avoid
which the techniques do not actually provide any paraphrase, repetitions, in the following, we will mainly focus on the
the latter obtains ∼87% of correctly generated paraphrases. results obtained with the manually generated paraphrases.
Results Achieved With the Original and the Automatically Generated Paraphrased Descriptions
Original
Pegasus Min: 0 0 1 1 0 0
479
463
88 92 87
73
26 24
FAIL PASS ERROR EMPTY ALL FAIL PASS ALL FAIL PASS
Original
Min: 0 0 1 1 0 0
Translation-Pivoting
Max: 1,779 2,625 1,779 2,625 165 175
509
495
87 83
77 77
25 23
FAIL PASS ERROR EMPTY ALL FAIL PASS ALL FAIL PASS
Fig. 3. Results achieved by Copilot when considering the Full context code representation on paraphrasesPEGASUS and paraphrasesTP .
Also, as we will discuss, the quality of Copilot’s recom- Only ∼13% of instances (112 and 122 depending on the
mendations is very similar when using the original and the used description) resulted in test-passing methods. While such
paraphrased descriptions. a result seems to indicate limited performance of Copilot,
In Fig. 2, the bar chart in the left side reports the number of it must be considered the difficulty of the code generation
methods recommended by Copilot (out of 892) that resulted in tasks involved in our study. Indeed, we did not ask Copilot to
failing tests, passing tests, syntactic errors, and no (i.e., empty) generate simple methods possibly implementing quite popular
recommendation. Looking at such a chart, the first thing that routines (e.g., a method to generate an MD5 hash from a string)
leaps to the eyes is the high percentage of Java methods (∼73% but rather randomly selected methods that, as shown in Table I,
for the original and ∼72% for the paraphrased description) are composed, on average, by more than 150 tokens (median =
for which Copilot was not able to synthesize a method passing 92) and have an average cyclomatic complexity of 5.3 (median
the related unit tests. = 3.0).
Target Method Target Method
D D D
i i i
s s s
t t t
a a a
n n n
c c c
e e e
Description Code Description Code Description Code
Fig. 6. Levenshtein distance between the original description and (i) the manually paraphrased descriptions (left part) and (ii) the descriptions automatically
paraphrased by PEGASUS (middle part) and Translate Pivoting (right). Similarly, we report the Levenshtein distance between the method recommended using
the original description and the three paraphrases. The latter is only computed for recommendations in which the obtained output differs.
Those are open-source projects from GitHub, and it is likely A. Empirical Studies on Code Recommenders
that at least some of them have been used for training Copilot
itself. In other words, the absolute actual effectiveness reported Proksch et al. [48] conducted an empirical study aimed
might not be reliable. However, the objective of our study is to at evaluating the performance of code recommenders when
understand the differences when different paraphrases are used suggesting method calls. Their study has been run on a real-
rather than the absolute performance of Copilot, like previous world dataset composed of developers’ interactions captured
studies did (e.g., [43]). in the IDE. Results showed that commonly used evaluation
techniques based on synthetic datasets extracted by mining
Threats to external validity are related to the possibility to
released code underperform due to a context miss.
generalize our results. Our study has been run on 892 methods
we carefully selected as explained in Section II-A. Rather than On a related research thread, Hellendoorn et al. [20]
going large-scale, we preferred to focus on methods having compared code completion models on both real-world and
a high test coverage and a verbose first sentence in the Doc synthetic datasets. Confirming what observed by Proksch et al.,
Comment. Larger investigations are needed to corroborate they found that the evaluated tools are less accurate on the
or contradict our findings. Similarly, we only focused on real-world dataset, thus concluding that synthetic benchmarks
Java methods, given the effort required to implement the are not representative enough. Moreover, they found that
toolchain needed for our study, and in particular the script the accuracy of code completion tools substantially drops in
to automatically invoke Copilot and parse its output. Running challenging completion scenarios, in which developers would
the same experiment with other languages is part of our future need them the most.
agenda. Mărăs, oiu et al. [37] analyzed how practitioners rely on
code completion during software development. The results
showed that the users actually ignore many synthesized
V. R ELATED W ORK suggestions. Such a finding has been corroborated by Arrebola
and Junior [9], who stressed the need for augmenting code
Recommender systems for software developers are tools recommender systems with the development’s context.
supporting practitioners in daily activities [38], [51], such Jin and Servant [26] and Li et al. [33] investigated the hidden
as documentation writing and retrieval [64], [39], [40], [24], costs of code recommendations. Jin and Servant found that
refactoring [11], [55], bug triaging [54], [63], bug fixing [30], IntelliSense, a code completion tool, sometimes underperforms
[58], [34], etc. Among those, code recommenders, such as code by providing the suitable recommendation far from the top of
completion tools, have became a crucial feature of modern the recommended list of solutions. Consequently, developers
Integrated Development Environments (IDEs) and support in are discouraged from picking the right suggestion. Li et al.,
speeding up code development by suggesting the developers aware of this potential issue, conducted a coding experiment in
code they are likely to write [12], [29], [16]. Given the empirical which they try to predict whether correct results are generated
nature of our work, that focuses on investigating a specific by code completion models, showing that their approach can
aspect of code recommenders, in this section we do not discuss reduce the percentage of false positives up to 70%.
all pervious works proposing novel or improving existing code Previous studies also assessed the actual usefulness of these
recommenders (see e.g., [64], [39], [40], [30], [58], [34], [61], tools. Xu et al. [65] ran a controlled experiment with 31
[44], [36], [28], [7], [29], [27], [60], [57]). Instead, we focus on developers who were asked to complete implementation tasks
empirical studies looking at code recommenders from different with and without the support of two code recommenders. They
perspectives (Section V-A) and on studies specifically focused found a marginal gain in developers’ productivity when using
on GitHub Copilot (Section V-B). the code recommenders.
Ciniselli et al. [15] empirically evaluated the performance of Two previous studies aimed at evaluating the security of
two state-of-the-art Transformer-based models in challenging the solutions recommended by Copilot. Hammond et al. [47]
coding scenarios, for example, when the code recommender investigated the likelihood of receiving from Copilot recom-
is required to generate an entire code block (e.g., the body mendations including code affected by security vulnerabilities.
of a for loop). The two experimented models, RoBERTa They observed that vulnerable code is recommended in 40%
and Text-To-Text Transfer Transformer (T5), achieved good of cases out of the completion scenarios they experimented
performance (∼69% of accuracy) in the more classic code with. On a similar note, Sobania et al. [52] evaluated GitHub
completion scenario (i.e., predicting few tokens needed to Copilot on standard program synthesis benchmark problems
finalize a statement), while reported a substantial drop of and compared the achieved results with those from the genetic
accuracy (∼29%) when dealing with the previously described programming literature. The authors found that the performance
more complex block-level completions. of the two approaches are comparable. However, approaches
Our study is complementary to the ones discussed above. based on genetic programming are not mature enough to be
Indeed, we investigate the robustness of DL-based code deployed in practice, especially due to the time they require to
recommenders supporting what it is know in the literature as synthesize solutions. In our study, we do not focus on security,
“natural language to source code translation”. We show that but only on the correctness of the suggested solutions.
semantically equivalent code descriptions can result in different Albert Ziegler, in a blog post about GitHub Copilot2
recommendations, thus posing questions on the usability of investigated the extent to which the tool suggestions are copied
these tools. from the training set they used. Ziegler reports that Copilot
rarely recommends verbatim copies of code taken from the
B. Empirical Studies on GitHub Copilot training set.
GitHub Copilot has been recently introduced as the state- VI. C ONCLUSIONS AND F UTURE W ORK
of-the-art code recommender, and advertised as an “AI pair
programmer” [1], [22]. Since its release, researchers started We investigated the extent to which DL-based code recom-
investigating its capabilities. menders tend to synthesize different code components when
Most of the previous research aimed at evaluating the starting from different but semantically equivalent natural
impact of GitHub Copilot on developers’ productivity and its language descriptions. We selected GitHub Copilot as the tool
effectiveness (in terms of correctness of the provided solutions). representative of the state-of-the-art and asked it to generate 892
Imai [25] investigated to what extent Copilot is actually a non-trivial Java methods starting from their natural language
valid alternative to a human pair programmer. They observed description. For each method in our dataset we asked Copilot
that Copilot results in increased productivity (i.e., number of to synthesize it using: (i) the original description, extracted
added lines of code), but decreased quality in the produced as the first sentence in the Javadoc; and (ii) paraphrased
code. Ziegler et al. [67] conducted a case study in which they descriptions. We did this both by manually modifying the
investigated whether usage measurements about Copilot can original description and by using automated paraphrasing tools,
predict developers’ productivity. They found that the acceptance after having assessed their reliability in this context.
rate of the suggested solutions is the best predictor for perceived We found that in ∼46% of cases semantically equivalent but
productivity. Vaithilingam et al. [59] ran an experiment with different method descriptions result in different code recom-
24 developers to understand how Copilot can help developers mendations. We observed that some correct recommendations
complete programming tasks. Their results show that Copilot can only be obtained using one of the semantically equivalent
does not improve the task completion time and success rate. descriptions as input.
However, developers report that they prefer to use Copilot Our results highlight the importance of providing a proper
because it recommends code that can be used as a starting code description when asking DL-based recommenders to
point and saves the effort of searching online. synthesize code. In the new era of AI-supported programming,
Nguyen and Nadi [43] used LeetCode questions as input developers must learn how to properly describe the code
to Copilot to evaluate the solutions provided for several components they are looking for to maximize the effectiveness
programming languages in terms of correctness — by running of the AI support.
the test cases available in LeetCode — and understandability Our future work will focus on answering our first research
— by computing their Cyclomatic Complexity and Cognitive question in vivo rather than in silico. In other words, we aim
Complexity [13]. They found notable differences among the at running a controlled experiment with developers to assess
programming languages in terms of correctness (between the impact of the different code descriptions they write on
57%, for Java, and 27%, for JavaScript). On the other the received recommendations. Also, we will investigate how
hand, Copilot generates solutions with low complexity for to customize the automatic paraphrasing techniques to further
all the programming languages. While we also measure the improve their performance on software-related text (such as
effectiveness of the solutions suggested by Copilot, our main methods’ descriptions).
focus is on understanding its robustness when different inputs
are provided. 2 https://docs.github.com/en/github/copilot/research-recitation
ACKNOWLEDGMENTS [23] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, “Deep code comment generation,”
in Proceedings of the 26th Conference on Program Comprehension, ICPC
This project has received funding from the European 2018, Gothenburg, Sweden, May 27-28, 2018, F. Khomh, C. K. Roy, and
Research Council (ERC) under the European Union’s Horizon J. Siegmund, Eds. ACM, 2018, pp. 200–210.
[24] ——, “Deep code comment generation,” ser. ICPC ’18, 2018.
2020 research and innovation programme (grant agreement No. [25] S. Imai, “Is github copilot a substitute for human pair-programming?
851720). an empirical study,” in 2022 IEEE/ACM 44th International Conference
on Software Engineering: Companion Proceedings (ICSE-Companion).
R EFERENCES IEEE, 2022, pp. 319–321.
[26] X. Jin and F. Servant, “The hidden cost of code completion: Understand-
[1] “Github copilot https://copilot.github.com.” ing the impact of the recommendation-list length on its efficiency,” in
[2] Jacoco, https://www.eclemma.org/jacoco/. Proceedings of the 15th International Conference on Mining Software
[3] Java Parser, https://github.com/javaparser/javaparser. Repositories, 2018, pp. 70–73.
[4] jUnit, https://junit.org/junit5/. [27] R. Karampatsis and C. A. Sutton, “Maybe deep neural networks are
[5] PEGASUS fine-tuned for paraphrasing, https://huggingface.co/tuner007/ the best choice for modeling source code,” CoRR, vol. abs/1903.05734,
pegasus_paraphrase. 2019. [Online]. Available: http://arxiv.org/abs/1903.05734
[6] Replication package, https://github.com/antonio-mastropaolo/ [28] J. Kim, S. Lee, S. Hwang, and S. Kim, “Adding examples into java
robustness-copilot. documents,” in 2009 IEEE/ACM International Conference on Automated
[7] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, “Learning natural coding Software Engineering, 2009, pp. 540–544.
conventions,” in Proceedings of the 22nd ACM SIGSOFT International [29] S. Kim, J. Zhao, Y. Tian, and S. Chandra, “Code prediction by feeding
Symposium on Foundations of Software Engineering, ser. FSE 2014, trees to transformers,” arXiv preprint arXiv:2003.13848, 2020.
2014, pp. 281–293. [30] C. Le Goues, M. Dewey-Vogt, S. Forrest, and W. Weimer, “A systematic
[8] U. Alon, R. Sadaka, O. Levy, and E. Yahav, “Structural language models study of automated program repair: Fixing 55 out of 105 bugs for $8
of code,” arXiv, pp. arXiv–1910, 2019. each,” in 2012 34th International Conference on Software Engineering
[9] F. V. Arrebola and P. T. A. Junior, “On source code completion assistants (ICSE), 2012, pp. 3–13.
and the need of a context-aware approach,” in International Conference [31] V. I. Levenshtein et al., “Binary codes capable of correcting deletions,
on Human Interface and the Management of Information. Springer, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8. Soviet
2017, pp. 191–201. Union, 1966, pp. 707–710.
[10] M. Asaduzzaman, C. K. Roy, K. A. Schneider, and D. Hou, “Context- [32] B. Li, M. Yan, X. Xia, X. Hu, G. Li, and D. Lo, “Deepcommenter: a
sensitive code completion tool for better api usability,” in 2014 IEEE deep code comment generation tool with hybrid lexical and syntactical
International Conference on Software Maintenance and Evolution, 2014, information,” in ESEC/FSE ’20: 28th ACM Joint European Software
pp. 621–624. Engineering Conference and Symposium on the Foundations of Software
[11] G. Bavota, A. D. Lucia, A. Marcus, and R. Oliveto, “Automating extract Engineering, Virtual Event, USA, November 8-13, 2020, P. Devanbu,
class refactoring: an improved method and its evaluation,” Empir. Softw. M. B. Cohen, and T. Zimmermann, Eds. ACM, 2020, pp. 1571–1575.
Eng., vol. 19, no. 6, pp. 1617–1664, 2014. [33] J. Li, R. Huang, W. Li, K. Yao, and W. Tan, “Toward less hidden
[12] M. Bruch, M. Monperrus, and M. Mezini, “Learning from examples cost of code completion with acceptance and ranking models,” in 2021
to improve code completion systems,” in Proceedings of the 7th Joint IEEE International Conference on Software Maintenance and Evolution
Meeting of the European Software Engineering Conference and the ACM (ICSME). IEEE, 2021, pp. 195–205.
SIGSOFT Symposium on The Foundations of Software Engineering, ser. [34] Y. Li, S. Wang, and T. N. Nguyen, “Dlfix: Context-based code
ESEC/FSE 2009, 2009, pp. 213–222. transformation learning for automated program repair,” in Proceedings of
[13] G. A. Campbell, “Cognitive complexity: An overview and evaluation,” the ACM/IEEE 42nd International Conference on Software Engineering,
in Proceedings of the 2018 international conference on technical debt, ser. ICSE ’20, 2020, p. 602?614.
2018, pp. 57–58. [35] B. Lin, F. Zampetti, G. Bavota, M. D. Penta, M. Lanza, and R. Oliveto,
[14] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, “Sentiment analysis for software engineering: how far can we go?”
H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large in Proceedings of the 40th International Conference on Software
language models trained on code,” arXiv preprint arXiv:2107.03374, Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018,
2021. pp. 94–104.
[15] M. Ciniselli, N. Cooper, L. Pascarella, A. Mastropaolo, E. Aghajani, [36] F. Liu, G. Li, Y. Zhao, and Z. Jin, “Multi-task learning based pre-
D. Poshyvanyk, M. D. Penta, and G. Bavota, “An empirical study on the trained language model for code completion,” in Proceedings of the 35th
usage of transformer models for code completion,” IEEE Transactions IEEE/ACM International Conference on Automated Software Engineering,
on Software Engineering, no. 01, pp. 1–1, 5555. ser. ASE 2020. Association for Computing Machinery, 2020.
[16] M. Ciniselli, N. Cooper, L. Pascarella, D. Poshyvanyk, M. Di Penta, and [37] M. Mărăs, oiu, L. Church, and A. Blackwell, “An empirical investigation
G. Bavota, “An empirical study on the usage of bert models for code of code completion usage by professional software developers,” in Pro-
completion,” in Proceedings of the 18th Working Conference on Mining ceedings of the 26th Annual Workshop of the Psychology of Programming
Software Repositories, ser. MSR ’21, 2021, p. To Appear. Interest Group, 2015.
[17] O. Dabic, E. Aghajani, and G. Bavota, “Sampling projects in github [38] C. McMillan, D. Poshyvanyk, M. Grechanik, Q. Xie, and C. Fu,
for msr studies,” in 2021 IEEE/ACM 18th International Conference on “Portfolio: Searching for relevant functions and their usages in millions
Mining Software Repositories (MSR). IEEE, 2021, pp. 560–564. of lines of code,” ACM Trans. Softw. Eng. Methodol., vol. 22, no. 4, pp.
[18] N. A. Ernst and G. Bavota, “Ai-driven development is here: Should you 37:1–37:30, 2013.
worry?” IEEE Softw., vol. 39, no. 2, pp. 106–110, 2022. [39] L. Moreno, G. Bavota, M. Di Penta, R. Oliveto, and A. Marcus, “How can
[19] V. J. Hellendoorn and P. Devanbu, “Are deep neural networks the best i use this method?” in Proceedings of the 37th International Conference
choice for modeling source code?” in Proceedings of the 2017 11th Joint on Software Engineering - Volume 1, ser. ICSE ’15, 2015, p. 880?890.
Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2017, [40] L. Moreno, G. Bavota, M. D. Penta, R. Oliveto, A. Marcus, and
2017, p. 763?773. G. Canfora, “Arena: An approach for the automated generation of release
[20] V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli, “When notes,” IEEE Transactions on Software Engineering, vol. 43, no. 2, pp.
code completion fails: A case study on real-world completions,” in 106–127, 2017.
2019 IEEE/ACM 41st International Conference on Software Engineering [41] A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, “A large-scale study on
(ICSE). IEEE, 2019, pp. 960–970. repetitiveness, containment, and composability of routines in open-source
[21] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the projects,” in Proceedings of the IEEE/ACM 13th Working Conference on
naturalness of software,” in Proceedings of the 34th International Mining Software Repositories (MSR 2016), 2016, pp. 362–373.
Conference on Software Engineering, ser. ICSE 2012. IEEE Press, [42] A. T. Nguyen, T. T. Nguyen, H. A. Nguyen, A. Tamrawi, H. V. Nguyen,
2012, pp. 837–847. J. Al-Kofahi, and T. N. Nguyen, “Graph-based pattern-oriented, context-
[22] G. D. Howard, “Github copilot: Copyright, fair use, creativity, transfor- sensitive source code completion,” in 2012 34th International Conference
mativity, and algorithms,” 2021. on Software Engineering (ICSE), 2012, pp. 69–79.
[43] N. Nguyen and S. Nadi, “An empirical evaluation of github copilot’s [62] M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk,
code suggestions,” in 2022 IEEE/ACM 19th International Conference on “Toward deep learning software repositories,” in Proceedings of the
Mining Software Repositories (MSR). IEEE, 2022, pp. 1–5. 12th Working Conference on Mining Software Repositories, ser. MSR
[44] T. Nguyen, P. C. Rigby, A. T. Nguyen, M. Karanfil, and T. N. Nguyen, ’15. Piscataway, NJ, USA: IEEE Press, 2015, pp. 334–345. [Online].
“T2api: Synthesizing api code usage templates from english texts with Available: http://dl.acm.org/citation.cfm?id=2820518.2820559
statistical translation,” in Proceedings of the 2016 24th ACM SIGSOFT [63] X. Xia, D. Lo, Y. Ding, J. M. Al-Kofahi, T. N. Nguyen, and X. Wang,
International Symposium on Foundations of Software Engineering, ser. “Improving automated bug triaging with specialized topic model,” IEEE
FSE 2016, 2016, p. 1013?1017. Transactions on Software Engineering, vol. 43, no. 3, pp. 272–297, 2017.
[45] H. Niu, I. Keivanloo, and Y. Zou, “Api usage pattern recommendation [64] T. Xie and J. Pei, “Mapo: Mining api usages from open source
for software development,” Journal of Systems and Software, vol. 129, repositories,” ser. MSR ’06, 2006.
pp. 127–139, 2017. [65] F. F. Xu, B. Vasilescu, and G. Neubig, “In-ide code generation from
[46] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for natural language: Promise and challenges,” 2021.
automatic evaluation of machine translation,” in Proceedings of the 40th [66] J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu, “Pegasus: Pre-training with
annual meeting of the Association for Computational Linguistics, 2002, extracted gap-sentences for abstractive summarization,” 2019.
pp. 311–318. [67] A. Ziegler, E. Kalliamvakou, X. A. Li, A. Rice, D. Rifkin, S. Simister,
[47] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “An G. Sittampalam, and E. Aftandilian, “Productivity assessment of neural
empirical cybersecurity evaluation of github copilot’s code contributions,” code completion,” in Proceedings of the 6th ACM SIGPLAN International
arXiv preprint arXiv:2108.09293, 2021. Symposium on Machine Programming, 2022, pp. 21–29.
[48] S. Proksch, S. Amann, S. Nadi, and M. Mezini, “Evaluating the
evaluations of code recommender systems: a reality check,” in 2016 31st
IEEE/ACM International Conference on Automated Software Engineering
(ASE). IEEE, 2016, pp. 111–121.
[49] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan,
M. Zhou, A. Blanco, and S. Ma, “Codebleu: a method for automatic
evaluation of code synthesis,” CoRR, vol. abs/2009.10297, 2020.
[Online]. Available: https://arxiv.org/abs/2009.10297
[50] R. Robbes and M. Lanza, “Improving code completion with program
history,” Automated Software Engineering, vol. 17, no. 2, pp. 181–212,
2010.
[51] M. P. Robillard, W. Maalej, R. J. Walker, and T. Zimmermann,
Recommendation Systems in Software Engineering. Springer Publishing
Company, Incorporated, 2014.
[52] D. Sobania, M. Briesch, and F. Rothlauf, “Choose your programming
copilot: A comparison of the program synthesis performance of github
copilot and genetic programming,” arXiv preprint arXiv:2111.07875,
2021.
[53] A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan, “Intelli-
code compose: Code generation using transformer,” arXiv preprint
arXiv:2005.08025, 2020.
[54] A. Tamrawi, T. T. Nguyen, J. M. Al-Kofahi, and T. N. Nguyen, “Fuzzy
set and cache-based approach for bug triaging,” in Proceedings of the
19th ACM SIGSOFT Symposium and the 13th European Conference
on Foundations of Software Engineering, ser. ESEC/FSE ’11, 2011, p.
365?375.
[55] N. Tsantalis, T. Chaikalis, and A. Chatzigeorgiou, “Ten years of jdeodor-
ant: Lessons learned from the hunt for smells,” in 25th International
Conference on Software Analysis, Evolution and Reengineering, SANER
2018, R. Oliveto, M. D. Penta, and D. C. Shepherd, Eds. IEEE Computer
Society, 2018, pp. 4–14.
[56] Z. Tu, Z. Su, and P. Devanbu, “On the localness of software,” in
Proceedings of the 22nd ACM SIGSOFT International Symposium on
Foundations of Software Engineering, ser. FSE 2014. New York,
NY, USA: Association for Computing Machinery, 2014, p. 269–280.
[Online]. Available: https://doi.org/10.1145/2635868.2635875
[57] M. Tufano, D. Drain, A. Svyatkovskiy, and N. Sundaresan, “Generating
accurate assert statements for unit test cases using pretrained transformers,”
CoRR, vol. abs/2009.05634, 2020.
[58] M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and
D. Poshyvanyk, “An empirical study on learning bug-fixing patches
in the wild via neural machine translation,” ACM Trans. Softw. Eng.
Methodol., vol. 28, no. 4, pp. 19:1–19:29, 2019.
[59] P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs.
experience: Evaluating the usability of code generation tools powered
by large language models,” in CHI Conference on Human Factors in
Computing Systems Extended Abstracts, 2022, pp. 1–7.
[60] C. Watson, M. Tufano, K. Moran, G. Bavota, and D. Poshyvanyk, “On
learning meaningful assert statements for unit test cases,” in Proceedings
of the 42nd International Conference on Software Engineering, ICSE
2020, 2020, p. To Appear.
[61] F. Wen, E. Aghajani, C. Nagy, M. Lanza, and G. Bavota, “Siri, write the
next method,” in 43rd IEEE/ACM International Conference on Software
Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 2021,
pp. 138–149.