Thesis, Not Printed, AI Supported Software Development Moving Beyond Code Completion
Thesis, Not Printed, AI Supported Software Development Moving Beyond Code Completion
by
Rohith Pudari
B.Tech., Jawaharlal Nehru Technological University Hyderabad, 2019
MASTER OF SCIENCE
All rights reserved. This thesis may not be reproduced in whole or in part, by
photocopying or other means, without the permission of the author.
ii
by
Rohith Pudari
B.Tech., Jawaharlal Nehru Technological University Hyderabad, 2019
Supervisory Committee
ABSTRACT
Table of Contents
Supervisory Committee ii
Abstract iii
Table of Contents iv
Acknowledgements ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 AI-supported code completion tools . . . . . . . . . . . . . . . . . . . 3
1.3 Problem Statement and Research Questions . . . . . . . . . . . . . . 4
1.4 Research Design and Methodology . . . . . . . . . . . . . . . . . . . . 5
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Framework 35
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.3 Paradigms and Idioms . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Code Smells . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.5 Design level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.5.1 Module level design . . . . . . . . . . . . . . . . . . 44
4.2.5.2 System level design . . . . . . . . . . . . . . . . . . . 45
4.3 AI-supported Software Development . . . . . . . . . . . . . . . . . . . 46
4.3.1 Evolution of design over time . . . . . . . . . . . . . . . . . . 48
vi
Bibliography 62
vii
List of Tables
List of Figures
Figure 3.1 List comprehension Pythonic idiom and Copilot top suggestion. 24
Figure 3.2 Find index of every word in input string Pythonic idiom and
Copilot top suggestion. . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 3.3 Open and write to a file Pythonic idiom and Copilot top suggestion. 26
Figure 3.4 Best practice for copying array contents and Copilot top suggestion. 29
Figure 3.5 Best practice for creating two references and Copilot top sugges-
tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 3.6 Best practice for returning string and variable name and Copilot
top suggestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
ACKNOWLEDGEMENTS
I would not have finished this thesis without my professors, family, friends, and
colleagues’ help and support.
First and foremost, I would like to thank my advisor, Dr. Neil Ernst, who is
always supportive and patient, provides guidance, and challenges me. I could have
never been here without his countless and significant support. I also thank him for
the resources and funding provided to support my work.
I would like to thank GitHub for providing me access to Copilot. This research
would not have been possible without it.
Moreover, I thank my fellow colleagues at Octera group for making me feel home
and motivated at workplace and for all the activities and fun we had together.
All of this wouldn’t have been possible without the equal efforts from the admin-
istrative staff of the University of Victoria and the Department of Computer Science.
I thank them for all the administrative services they provided me during my M.Sc.
I would like to thank my wonderful and perfect parents: Renuka and Srinivas,
who give me endless and selfless love and always cheer me on, just as they have every
step of the way. They gave me the ability to maintain hope for a brighter morning,
even during our darkest nights.
Also, I take this opportunity to thank my friends outside the research lab: Francis,
Susan, Kealey, Anthony, and Sean. Without their support, my life in Victoria would
be boring and lonely. Thank you for sticking by me.
Thank you all.
Chapter 1
Introduction
these boundaries is a challenging task. In the next section, we discuss this challenge
and the research opportunity it creates as this study’s motivation.
1.1 Motivation
In recent years, there have been considerable improvements in the field of AI-supported
code completion tools. Copilot [27], an in-IDE recommender system that leverages
OpenAI’s Codex neural language model (NLM) [12] which uses a GPT-3 model [10]
has been at the forefront and is particularly impressive in understanding the context
and semantics of code with just a few lines of comments or code as input and can
suggest the next few lines or even entire functions in some cases [12].
The biggest challenge with using tools like Copilot is their training data. These
tools are trained on existing software source code, and training costs are expensive.
Several classes of errors have been discovered (shown in section 3.2), which follow
from the presence of these same errors in public (training) data. There is a need
to stay cautious about common bugs people make, creeping into Copilot suggestions
and keeping Copilot updated to the ever-changing best practices or new bug fixes in
public repositories.
Understanding the current limitations of Copilot will help developers use AI-
supported code completion tools effectively. The advantage of knowing where Copi-
lot is good/bad lets users use these tools more efficiently by letting AI-supported
code completion tools take over where they excel and focus more on tasks where AI-
supported code completion tools are shown to have struggled with the quality of code
suggestions.
A taxonomy of software abstractions would help create a formal structure to better
understand the capabilities and limitations of AI-supported code completion tools like
Copilot. A taxonomy helps describe the hierarchical stages of software abstractions at
which AI-supported code completion tools like Copilot are operating. For example,
Copilot is capable of suggesting syntactical code but cannot suggest architectural
tactics. Creating a taxonomy of software abstractions can be useful in developing
new AI-supported code completion tools and measuring their capabilities.
A taxonomy of software abstractions can help in making code suggestions better
in complex situations, shifting research focus to make AI-supported code comple-
tion tools better at tasks shown to be challenging, minimizing the input required
by the user to make AI-supported code completion tools create meaningful, quality
3
suggestions.
In this thesis, we investigate the current limitations of Copilot with an empirical
study on Copilot suggestions. Using the findings, we introduce a taxonomy of software
abstraction hierarchy, modeled on the ideas in the SAE taxonomy and Koopman’s
extension for self-driving vehicles. The insights gained from this study can help
developers understand how to best use tools like Copilot and provide insights to
researchers trying to develop and improve the field of AI-supported code completion
tools.
The capabilities and the limitations of current AI-supported code completion tools
like Copilot are unknown. Identifying the boundaries of AI-supported code comple-
tion tools would help the users use the tool effectively and shift the research focus
more on the tasks AI-supported code completion tools are proven to be not helpful.
This study aims to understand the areas where Copilot performs better than a
human and where Copilot performs worse than a human. We conduct an exploratory
study with the following research objectives:
RQ-2: Given the current boundary, how far is it from suggesting design de-
cisions which seem much beyond the boundary?
Approach - Based on our findings in RQ-1, we discuss how far current AI-
supported code completion tools are from the design level in our taxonomy. We
look at the current limitations of Copilot and provide recommendations on how
to make current AI-supported code completion tools reach the design abstrac-
tion level. Additionally, we report on ethical considerations, explainability, and
control of AI-supported code completion tools like Copilot.
1.5 Contributions
This thesis contributes the following:
• We release our coding experiments on Copilot for best practices and language
idioms. We make this available in our replication package [60].
• Using the findings from Copilot’s code suggestions on best practices and lan-
guage idioms, we present a taxonomy of software abstraction hierarchy where
7
• Based on our experiences in this study, we present future directions for mov-
ing beyond code completion to AI-supported software engineering, which will
require an AI system that can, among other things, understand how to avoid
code smells, follow language idioms, and eventually propose rational software
designs.
Chapter 2 elaborates the background information and some related work on AI-
supported code completion tools and GitHub Copilot. It further introduces the
challenges with using AI-supported code completion tools that will be explored
in this thesis.
Chapter 3 introduces our study and the methodology showing the sampling ap-
proach, input, and evaluation criteria we used to address RQ-1 (What are the
current boundaries of code completion tools). We then present the results of
Copilot code suggestions for language idioms and code smells.
Chapter 5 addresses RQ-2 (Given the current boundary, how far is it from sug-
gesting design decisions?) with a discussion on the complex nature of design
decisions. In addition, we discuss future directions for AI-supported code com-
pletion tools to reach the design abstraction level in the taxonomy. We conclude
by discussing the implications and the limitations of this study.
Chapter 2
2.1 Introduction
In this chapter, we first highlight some related work that leads towards current AI-
supported code completion tools like Copilot [27]. Further, we discuss existing re-
search on Copilot. Finally, we provide details regarding alternative AI-supported
code completion tools like Copilot.
2.2.1.1 N-grams
2.2.1.2 SLANG
SLANG [64] is a tool for completing gaps (missing API calls) in an API call sequence.
It combines RNNME [50] with N-grams (n=3, trigrams). The evaluation of the
study reveals that using the average of the probabilities generated by the two models
for filling in gaps in API call sequences performs better than using either model
alone, highlighting the applicability of N-gram LMs in support of more sophisticated
models (RNNME in this case).
2.2.1.3 CACHECA
CACHECA [24] is an Eclipse IDE [18] plugin that tries to improve the code completion
suggestions provided by Eclipse’s default tool. It is created using a Cache language
model [72], which offers words that have already been in the text a higher likelihood
of occurrence. The localization of code asserts that code has local regularities (a
variable will generally occur multiple times in a block of code) [72].
this restricts the potential use of N-grams for advanced code completion tasks such
as suggesting variables or functions initialized in another class in the same file.
Out of Vocabulary (OOV) problem - In the specific application of code completion,
it is possible that a developer utilizes a variable or library in our particular application
that is not included in the training set. As a result, N-gram models would never
predict that term, preventing us from using it as a prediction. This makes the code
completion tools restricted to the knowledge of the training set, making it impossible
to generalize.
The encoding of words as fixed-length vectors allowed for the use of Deep Learn-
ing in NLP. The need that all word senses have the same representation presented
a challenge. Context-sensitive contextualised word representations were presented
by LMs such as BERT [17], ELMo [57] and GPT-3 [10]. BERT has a variant for
general-purpose representations that support downstream applications such as natu-
ral language code-search and code documentation generation called CodeBERT [22]
11
implying that representations of the same word vary depending on the context. Met-
rics in numerous NLP tasks have significantly improved because of these new word
representations [20].
global vocabulary or copies one from a local context. In addition, a modified version of
the Attention mechanism using an AST-based programming language was developed
in this study [44]. These improvements to transformer architecture improved the
performance of AI-supported code completion tools and led to the creation of large
language models like Codex [12], which is the model used in the creation of Github
Copilot [27].
• Jedi [34] - An open source Python static analysis tool aimed at automatic
code completion. Jedi has a focus on autocompletion and goto functionality.
Other features include refactoring, code search, and finding references. However,
performance issues and limitations on project size and complexity hamper its
effectiveness.
• Kite [41] - Kite is an AI-powered code completion plugin that works with 16
languages and 16 IDEs. It uses machine learning to rank completions more
intelligently based on your current code context [41].
• Deep TabNine [71] - Deep TabNine is a source code completion system which
is based on OpenAI’s GPT-2 Model [61]. It is trained on GitHub public repos-
itories with permissive open-source licenses. Trained code is also filtered to en-
sure quality and avoid outdated, esoteric, auto-generated code and other edge
cases [71].
problems [45]. The training process for AlphaCode included tests for code
cloning, sensitivity to problem descriptions, and metadata. AlphaCode model
focuses on creating novel solutions to problems that require deeper reasoning.
Copilot about its usage [73] and its effectiveness in solving programming contest-
style problems [54]. We concluded by introducing some of the other AI-supported
code completion tools that provide similar functionality to Copilot.
In the following chapters, we discuss the problems with using AI-supported code
completion tools like Copilot, which are harder to fix, and straightforward corrections
that may not exist, like language idioms and code smells. We try to address RQ-1
(What are the current boundaries of code completion tools) using the methodology
and present our results (Chapter 3). We then introduce a taxonomy of software
abstraction hierarchy to help with finding the current boundaries of AI-supported
code completion tools like Copilot (Chapter 4).
We address RQ-2 (Given the current boundary, how far is it from suggesting
design decisions?) with a discussion of the complex nature of design decisions and
the challenges with trying to use AI-supported code completion tools to make design
decisions. Finally, we discuss some of the practical implications and limitations of
our findings and also provide some future directions to help further research in AI-
supported code completion tools (Chapter 5).
16
Chapter 3
3.1 Introduction
Useful AI-supported code completion tools should always suggest recommended cod-
ing best practices in its first suggestion. In this chapter, we test if Copilot suggests the
optimal solution for a programming task. “Optimal ways” here means recommended
ways of solving a programming task sampled from popular sources. We begin by ex-
plaining the current challenges with AI-supported code completion tools like Copilot,
showing recent research works on common problems faced with using Copilot and the
motivation to find the limitations of current AI-supported code completion tools like
Copilot (section 3.2).
In section 3.3, we explain our approach to RQ-1 (What are the current bound-
aries of AI-supported code completion tools?). We describe our sampling approach
to collecting Pythonic idioms (section 3.3.1.1) and best practices in JavaScript (sec-
tion 3.3.2.1). We then describe the input given to Copilot for triggering the generation
of code suggestions (section 3.3.3). Finally, we explain our evaluation method to com-
pare Copilot suggestions to the recommended practices (section 3.3.4).
In section 3.4, we present our results on performance of Copilot in suggesting rec-
ommended practices for 50 different coding scenarios (25 Pythonic idioms + 25 code
smells), which answers RQ-1.1 (How do AI-supported code completion tools manage
programming idioms?), and RQ-1.2 (How do AI-supported code completion tools
manage to write non-smelly code?). We observe that Copilot had the recommended
practices in its top 10 suggestions for 18 out of 50 coding scenarios (36% of all tests
performed).
17
can help find places where code violates the copyright. Better machine learning ap-
proaches, using active learning or fine-tuning, might help learn local lessons [48] for
customization in the case of identifier naming or formatting. In most of these cases,
good tools exist already for this.
Although these are clearly challenges, Copilot seems already to be on its way to
fixing them, like a filter introduced by GitHub to suppress code suggestions containing
code that matches public code on GitHub. However, what is more difficult to envision
are the problems that are harder to fix because straightforward corrections may not
exist, and rules for finding problems are more challenging to specify than those in
smell detectors or linters [19] like language idioms and code smells.
Developers often discuss software architecture and actual source code implemen-
tations in online forums, chat rooms, mailing lists, or in person. Programming tasks
can be solved in more than one way. The best way to proceed can be determined
based on case-specific conditions, limits, and conventions. Strong standards and a
shared vocabulary make communication easier while fostering a shared understand-
ing of the issues and solutions related to software development. However, this takes
time and experience to learn and use idiomatic approaches [3].
AI-supported code completion tools can help steer users into using more idiomatic
approaches with its code suggestions or vice-versa. This makes it crucial to find the
boundaries of AI-supported code completion tools like Copilot (RQ 1) and create
a clear understanding of where can we use AI-supported code completion tools like
Copilot and where should the user be vigilant in using AI-supported code completion
tools code suggestions. To achieve this, we conduct an exploratory study to find if
AI-supported code completion tools tools like Copilot suggest the recommended best
coding practices in their suggestions.
3.3 Methodology
In this section, we explain the methodology we used to address RQ-1 (What are
the current boundaries of AI-supported code completion tools?). We perform our
experiments Copilot suggestions on Pythonic Idioms (section 3.4.1) and code smells
in JavaScript (section 3.3.2).
Additionally, we explain how 25 coding scenarios for Pythonic idioms (section 3.3.2.1)
and code smells in JavaScript (section 3.3.1.1) were sampled. Finally, we discuss how
the input is shaped to trigger Copilot to generate code suggestions (section 3.3.3)
19
and how Copilot suggestions are evaluated (section 3.3.4). The following analysis
was carried out using the Copilot extension in Visual Studio Code. We use the most
recent stable release of the Copilot extension available at the time of writing (version
number 1.31.6194) in Visual Studio Code.
An idea or piece of code follows the most common idioms of the Python
language rather than implementing code using concepts common to other
languages. For example, a common idiom in Python is to loop over all
elements of an iterable using a for statement. Many other languages do
not have this construct, so people unfamiliar with Python sometimes use
a numerical counter instead, instead of the cleaner, Pythonic method3 .
This definition indicates a broad meaning, referring to both concrete code and also
ideas in a general sense. Many Python developers argue that coding the Pythonic
way is the most accepted way to code by the Python community [3]. We consider
an idiom to be any reusable abstraction that makes Python code more readable by
shortening or adding syntactic sugar. Idioms can also be more efficient than a basic
3
https://docs.python.org/3/glossary.html#term-pythonic
20
solution, and some idioms are more readable and efficient. The Pythonicity of a piece
of code stipulates how concise, easily readable, and generally good the code is. This
concept of Pythonicity, as well as the concern about whether code is Pythonic or not,
is notably prevalent in the Python community but also in other languages, e.g. Go
and gofmt, Java, C++ etc. Notably, Perl would be an exception as the philosophy
for perl was ”there are many ways to do it”.
We sampled idioms from the work of Alexandru et al. [3], and Farook et al. [21],
which identified idioms from presentations given by renowned Python developers
that frequently mention idioms, e.g., Hettinger [30] and Jeff Knupp [42] and pop-
ular Python books, such as “Pro Python” [2], “Fluent Python” [63], “Expert Python
Programming” [33].
We sampled the top 25 popular Pythonic idioms found in open source projects based
on the work of Alexandru et al. [3], and Farook et al. [21]. The decision to sample
most popular Pythonic idioms is taken to give the best chance for Copilot to suggest
the Pythonic way as its top suggestion. As a result, Copilot will have the Pythonic
way more frequently in its training data and more likely to suggest the Pythonic way
in its suggestions. However, Copilot is closed source, and we cannot determine if the
frequency of code snippets in training data affects Copilot’s suggestions. Research by
GitHub shows that Copilot can sometimes recite from its training data in “generic
contexts”4 , which may lead to potential challenges like license infringements (shown
in section 3.2). Sampling the most frequently used idioms will also help understand
if Copilot can recite idioms present in its training data (GitHub public repositories),
which is the ideal behavior for AI-supported code completion tools.
The AirBNB JavaScript coding style guide [1] is one of the first style guides for
JavaScript and is one of the most popular coding style guides used in approximately
a million repositories on GitHub [1]. It is a comprehensive list of best practices
covering nearly every aspect of JavaScript coding like objects, arrays, modules, and
iterators. However, it also includes project-specific styling guidelines like naming
conventions, commas, and comments. Since we are testing Copilot for widely accepted
best practices and not project-specific styling in JavaScript. We sampled 25 best
practices from the AirBNB JavaScript coding style guide [1], which were closer to the
design level rather than the code level. For example, selecting logging practices as
a sample coding standard rather than trailing comma use in JavaScript as a coding
standard. This sampling approach ensures Copilot is not tested against personalized
styling guidelines of one specific project or a company. In contrast, our goal for
Copilot here is to be tested against practices that bring performance or efficiency to
the code base.
shows the example of list comprehension idiom where human input is restricted to just
declaring a variable “result list”. If the input included initializing the variable with
some integers or an empty list, then Copilot is forced to use a for loop in the next line
to perform list comprehension eliminating the possibility of suggesting the idiomatic
approach. Although, it is a desirable feature for AI-supported code completion tools
to override or update the input, Copilot does not support it yet. So, we restrict the
input to being able to derive the best practice from the information.
This input style also mimics a novice user, who is unaware of the idioms or best
practices. Useful AI-supported code completion tools like Copilot should drive the
novice user to use best practices to perform a task in their codebases and improve
the quality of their code.
3.4 Results
In section 3.3, we discussed our sources, the sampling approach for Pythonic Idioms,
and the JavaScript coding style guide. We then discussed how the input for Copilot
23
List Comprehension
Human Input:
1 # list comprehension
2 result_list =
Copilot Suggestion:
1 result list = []
2 for i in range (1 ,11) :
3 result_list . append ( i )
Pythonic waya :
1 result_list = [ el for el in range (11) ]
a
source [3]
Figure 3.1: List comprehension Pythonic idiom and Copilot top suggestion.
1.1e-05 seconds using its lazy approach of finding index of each word in input string,
making it significantly faster than the Copilot suggested code.
Copilot Suggestion:
1 def index words(text):
2 result = []
3 for i , word in enumerate ( text . split () ) :
4 result . append (( i , word ) )
5 return result
Pythonic waya :
1 def index_words ( text ) :
2 if text :
3 yield 0
4 for index , letter in enumerate ( text ) :
5 if letter == ’ ’:
6 yield index + 1
a
source [3]
Figure 3.2: Find index of every word in input string Pythonic idiom and Copilot top
suggestion.
The most Pythonic way of performing a task is the clearest, the best, or even the
most optimal in a computational sense. Another important characteristic of idiomatic
code is to make the code easier to understand and maintain. Being Pythonic helps
to detect errors and even make fewer mistakes.
Figure 3.3 shows Copilot code suggestion for opening and writing to a file. If we
use code suggested by Copilot and if there is an exception while writing, the file will
not be closed as the f.close() line will not be reached by the interpreter. However, in
Pythonic way, the file is closed regardless of what happens inside the indented block
of code. Clearly, Pythonic way is superior.
26
Copilot Suggestion:
1 f = open ( ‘ idioms . txt ’ , ‘w ’)
2 f . write ( ‘ Hello World ’)
3 f . close ()
Pythonic waya :
1 with open ( ‘ idioms . txt ’ , ‘w ’) as f :
2 f . write ( ‘ ‘ Hello , World ! " )
a
source [3]
Figure 3.3: Open and write to a file Pythonic idiom and Copilot top suggestion.
Table 3.1 shows the list of all the 25 Pythonic idioms we tested and the ranking
of the Pythonic way in Copilot suggestions (if it exists). All the Idioms shown in
Table 3.1 can be found in the replication package [60] including the code used as input
(i.e., human input), the top suggestion by Copilot, and the Pythonic way suggested
in Alexandru et al. [3], and Farook et al. [21].
Copilot had the Pythonic way in its top 10 suggestions for 8 coding scenarios,
where Copilot ranked the non-idiomatic approach as the top suggestion. The ranking
methodology of Copilot is not disclosed. However, the results suggest that it is heavily
influenced by the frequency of the approach in the training data. Copilot successfully
suggested the idiomatic approach as its top suggestion in ‘set comprehension’ and
‘if condition check value’ (idiom 7 & 10 in table 3.1), which are one of the most
frequently occurring idioms in open source code [3].
Copilot is more likely to have the idiomatic approach in its top 10 suggestions
when there are only a few ways of performing a task. For example, consider the
‘Boolean comparison idiom’; there are only two most common ways of performing
the task, i.e., ‘if boolean:’ or ‘if boolean == True.’ Copilot had the non-idiomatic
approach higher than the idiomatic approach in this case.
AI-supported code completion tools like Copilot should learn to detect idiomatic
ways in public repositories and rank them higher than the most frequently used way
27
in public repositories so that the first suggestion would be the idiomatic way rather
than the non-idiomatic way, which is the desired behavior for AI-supported code
completion tools like Copilot. For the scope of this thesis, we leave resolving this
problem as future work.
Figure 3.4 shows the Best Practice for Copying Array Contents, showing user
input (i.e., Human Input), the top suggestion by Copilot, and the recommended
way suggested by AirBNB JavaScript coding style guide [1]. The code suggested by
Copilot uses a for loop to iterate through all the contents of the array and copy them,
which is a lot slower than the spread approach ([..items]) shown as the best practice
to copy array in the AirBNB JavaScript coding style guide [1]. This eliminates the
need to iterate through all the elements of the array to copy each one.
Copilot Suggestion:
1 const len = items . length ;
2 const copy = new Array ( len ) ;
3 for ( let i = 0; i < len ; i ++) {
4 copy [ i ] = items [ i ];
5 }
Ideal waya :
1 const itemsCopy = [... items ];
a
source [1]
Figure 3.4: Best practice for copying array contents and Copilot top suggestion.
Another example for best practices being better than Copilot code suggestions is
shown in figure 3.5. The AirBNB JavaScript coding style guide [1] recommends to
always use ‘const’ for all the references and avoid using ‘var’ because using ‘const’
ensures that you cannot reassign your references, which in turn can lead to bugs and
difficulty in code comprehension. Copilot suggested to use ‘var’ as its first sugges-
tion. This shows that code suggested by Copilot has flaws and does not follow the
recommended best practices.
Figure 3.6 shows the best practice to return a string and a variable name from
a function. The AirBNB JavaScript coding style guide [1] suggests to use template
strings instead of concatenation, for programmatically building up strings because
template strings give a readable, concise syntax with proper newlines and string
30
Copilot Suggestion:
1 var a = 1;
2 var b = 2;
Ideal waya :
1 const a = 1;
2 const b = 2;
a
source [1]
Figure 3.5: Best practice for creating two references and Copilot top suggestion.
Copilot Suggestion:
1 function sayhi(name){
2 return " Hello " + name ;
3 }
Ideal waya :
1 function sayHi ( name ) {
2 return " Hello , $ { name } " ;
3 }
a
source [1]
Figure 3.6: Best practice for returning string and variable name and Copilot top
suggestion.
Copilot had the recommended best practice in its top 10 suggestions for 5 coding
scenarios, where Copilot did not rank the recommended best practice as the top
suggestion. The ranking methodology of Copilot is not disclosed. However, the
results suggest that it is heavily influenced by the frequency of the approach in the
training data. Copilot successfully suggested the recommended best practice as its
top suggestion in ‘accessing properties,’ ‘converting an array-like object to an array,
and ‘Check boolean value’ (best practice 7, 15 & 23 in table 3.2), which are one of
the most common practices used by beginners to perform the task [1].
Based on the results shown in table 3.2, Copilot is more likely to have the rec-
ommended best practice in its top 10 suggestions when it is a common beginner
programming task like finding ‘sum of numbers’ or ‘importing a module from a file.’
We also observed that Copilot did not always generate all 10 suggestions like in the
case of Pythonic idioms, and it struggled to come up with 10 suggestions to solve
a programming task. This shows that Copilot does not have enough training data
compared to Python to create more relevant suggestions, which may include the rec-
ommended best practices in JavaScript.
32
The ideal behavior for AI-supported code completion tools like Copilot is to sug-
gest best practices extracted from public code repositories (training data) to avoid
code smells. Additionally, AI-supported code completion tools like Copilot should de-
tect the project’s coding style and adapt its code suggestions to be helpful for a user
as a productivity tool. For the scope of this thesis, we leave resolving this problem
as future work.
for 23 out of 25 coding scenarios in Python, which addressed RQ-1.1 (How do AI-
supported code completion tools manage programming idioms?). Furthermore, we
sampled 25 best practices in JavaScript from the AirBNB JavaScript coding style
guide [1]. We identified that Copilot did not suggest the recommended best practice
for 22 out of 25 coding scenarios in JavaScript, which addressed RQ-1.2 (How do
AI-supported code completion tools manage to manage to suggest non-smelly code?).
In this chapter, we showed that Copilot struggles to detect and follow coding style
guides present in public repositories of GitHub and always suggests code that follows
those coding style guides. We also observed that Copilot struggles to detect and
most common idiomatic ways present in public repositories of GitHub and rank them
higher than the non-idiomatic ways. Identifying this delineation could help in urn
AI-supported code completion tools such as Copilot into full-fledged AI-supported
software engineering tools.
In the next chapter (chapter 4), we illustrate our taxonomy inspired by au-
tonomous driving levels on the software abstraction hierarchy in AI-supported soft-
ware development and use the results shown in this chapter to delineate where AI-
supported code completion tools like Copilot currently stands in the taxonomy.
35
Chapter 4
Framework
4.1 Introduction
Copilot works best in creating boilerplate and repetitive code patterns [27]. However,
the code suggested by AI-supported code completion tools like Copilot are found to
have simple coding mistakes and security vulnerabilities. Several classes of errors
have been discovered, which follow from the presence of these same errors in training
data of Copilot (shown in section 3.2). In Chapter 3, we identified that Copilot does
not perform well in detecting and suggesting Pythonic idioms and best practices in
JavaScript. The scope of capability and the quality of code suggestions made by
AI-supported code completion tools like Copilot is uncertain.
In this chapter, we try to create a metric for answering RQ-1 (What are the cur-
rent boundaries of code completion tools) with a taxonomy of six software abstraction
levels to help access the current capabilities of AI-supported code completion tools
such as Copilot. We explain each software abstraction level in the taxonomy and the
capabilities required by AI-supported code completion tools to satisfy the software
abstraction level. We try to delineate where current AI-supported code completion
tools such as Copilot, are best able to perform and where more complex software
engineering tasks overwhelm them using a software abstraction hierarchy where “ba-
sic programming functionality” such as code compilation and syntax checking is the
lowest abstraction level, while software architecture analysis and design are at the
highest abstraction level Additionally, we use a sorting routine as an example sce-
nario to show how a code suggestion from AI-supported code completion tool looks
like in every level of abstraction in our taxonomy.
36
Finally, we try to address RQ-2 (Given the current boundary, how far is it from
suggesting design decisions?) with a discussion on the level of complexities and chal-
lenges involved in creating AI-supported code completion tools that can satisfy design
level compared to AI-supported code completion tools satisfying code smells level in
our taxonomy.
4.1.1 Motivation
To center our analysis on creating a software abstraction hierarchy to create a metric
for answering RQ-1 (What are the current boundaries of code completion tools),
we leverage an analogous concept in the more developed (but still nascent) field of
autonomous driving. Koopman has adapted the SAE Autonomous Driving safety
levels [23] to seven levels of autonomous vehicle safety hierarchy of needs shown in
figure 4.1.
Figure 4.1: Koopman’s Autonomous Vehicle Safety Hierarchy of Needs [43]. SOTIF
= safety of the intended function.
37
The pyramid concept is derived from that of Maslow [47], such that addressing
aspects on the top of the pyramid requires the satisfaction of aspects below. For ex-
ample, before thinking about system safety (such as what to do in morally ambiguous
scenarios), the vehicle must first be able to navigate its environment reliably (“Basic
Driving Functionality”).
We think that a similar hierarchy exists in AI-supported software development.
For example, the basic driving functionality in Figure 4.1 is satisfied when the vehi-
cle works in a defined environment without hitting any objects or other road users.
This could be equivalent to code completion tools being able to write code without
any obvious errors like syntax. Hazard analysis level requires vehicles to analyze and
mitigate risks not just from driving functions, but also potential technical malfunc-
tions, forced exits from the intended operational design domain, etc. This could be
equivalent to writing bug free code and avoiding code smells in software development
perspective. The system level safety could be equivalent of software design, where
tools need move beyond code completion and satisfy system quality attributes such
as performance and following idiomatic approaches.
Addressing aspects on the top of the pyramid requires the satisfaction of aspects
below. Similarly, for AI-supported software development tools, before worrying about
software architecture issues, that is, satisfying system quality attributes such as per-
formance and following idiomatic approaches, AI-supported software development
tools need to exhibit “basic programming functionality”. This basic functionality is
where most research effort is concentrated, such as program synthesis, AI-supported
code completion tools, and automated bug repair.
4.2 Taxonomy
Our taxonomy is a software abstraction hierarchy where “basic programming func-
tionality” such as code compilation and syntax checking is the lowest abstraction
level, while Software architecture analysis and design are at the highest abstraction
level. As we ascend the levels, just as with Koopman’s pyramid (shown in figure 4.1),
software challenges rely more on human input and become more difficult to automate
(e.g., crafting design rules vs. following syntax rules).
Figure 4.2 shows the taxonomy of autonomy levels for AI-supported code comple-
tion tools. The more abstract top levels depend on the resolution of the lower ones.
As we move up the hierarchy, we require more human oversight of the AI; as we move
38
down the hierarchy, rules for detecting problems are easier to formulate. Green levels
are areas where AI-supported code completion tools like Copilot works reasonably
well, while red levels are challenging for Copilot based on tests shown in Chapter 3.
Figure 4.2: Hierarchy of software abstractions. Copilot cleared all green levels and
struggled in red levels.
Based on our tests with Copilot for Pythonic idioms and JavaScript best prac-
tices (shown in Chapter 3), Copilot was able to generate syntactically correct code
that solves the given programming task in the coding scenario1 . This functionality
covers the syntax and the correctness level in our software abstraction hierarchy. As
a result, Copilot stands at the correctness level of our taxonomy.
The challenges further up the hierarchy are nonetheless more important for soft-
ware quality attributes (QA) [19] and for a well-engineered software system. For
example, an automated solution suggested by AI-supported code completion tools
to the top level of the taxonomy would be able to follow heuristics to engineer a
1
all coding scenarios tested are documented in our replication package [60].
39
well-designed software system, which would be easy to modify and scale to sudden
changes in use.
4.2.1 Syntax
The syntax level is the lowest software abstraction level in our taxonomy. This level
includes the most basic programming functionality like suggesting code that has cor-
rect syntax and has no errors in code compilations. This level does not require the
AI-supported code completion tools suggested code to successfully solve the pro-
gramming task but to suggest code without any obvious errors like syntax or code
compilation errors.
For example, consider a programming task of performing a sorting operation on
a list of numbers. To satisfy this level of abstraction, AI-supported code completion
tools should suggest code that is syntactically correct without any compilation errors
and the code is not required to perform the sorting operation correctly. Figure 4.3
shows the sorting example and Python syntax suggestions from AI-supported code
completion tools at this abstraction level.
Figure 4.3: Code suggestion of AI-supported code completion tools at syntax level.
The goal of this software abstraction level in our taxonomy is for a AI-supported
code completion tools to be able to suggest code without any syntactical errors. The
capabilities required by AI-supported code completion tools to satisfy this level of
abstraction are as follows:
4.2.2 Correctness
Correctness is the second level of software abstraction in our taxonomy. AI-supported
code completion tools at this level should be capable of suggesting code that is not
only syntactically correct but also solves the programming task. This level does
not require AI-supported code completion tools to suggest the best possible coding
solution for the programming task but to be able to suggest a decent solution to
the programming task which may or may not resolve all the edge cases of the given
programming task. Here “correctness” refers to performing the intended function
specified by the user.
For example, consider the programming task of performing a sorting operation on
a list of numbers. To satisfy this level of abstraction, AI-supported code completion
tools should suggest a syntactically correct list sorting code, which is not required to
be the most efficient way of sorting a list. Figure 4.4 shows the list sorting example
and the Python code suggestion from AI-supported code completion tools at this
abstraction level, which performs the sorting operation.
The goal of this software abstraction level in our taxonomy is for AI-supported
code completion tools to be able to suggest a solution instead of the best one. The
41
1. Suggest a solution for a given programming task that may not be the optimal
solution for that programming task.
2. The solution suggested is not required to cover all the edge cases for that pro-
gramming task.
1. Identify common patterns like paradigms and language idioms in public code
repositories (training data).
Figure 4.5: Code suggestion of AI-supported code completion tools at paradigms and
idioms level.
3. Satisfy requirements of all the levels below paradigms and idioms in our taxon-
omy.
Figure 4.6: Code suggestion of AI-supported code completion tools at code smells
level.
code completion tools to be able to detect and avoid bad practices such as code
smells that commonly occur in public code in its code suggestions to a problem and
to suggest the most optimized version as its first suggestion to solve a programming
task.
The capabilities required by AI-supported code completion tools to satisfy this
level of abstraction are as follows:
1. Identify common bad practices such as code smells that occur in public code (train-
ing data).
2. Suggest solutions that do not have code smells and unresolved edge cases.
3. Suggested code should be the most optimized version of all the possible sugges-
tions AI-supported code completion tools could create for a given problem.
4. AI-supported code completion tools should not suggest code that needs to be
immediately refactored.
44
5. Satisfy requirements of all the levels below code smells in our taxonomy.
Module level design is the first half of our taxonomy’s design level of software abstrac-
tion. This level requires the suggested code to be free of all known vulnerabilities,
include test cases and continuous integration (CI) methods such as automating the
process of performing build and testing code of the software when applicable. Code
suggestions should also cover all the functional requirements of a given programming
task.
AI-supported code completion tools at this level should be able to pick and suggest
the best applicable algorithm for a given coding scenario and be capable of follow-
ing user-specified coding style guidelines. For example, consider the task of sorting
operation on a list of numbers. To satisfy this level of abstraction, AI-supported
code completion tools should suggest a syntactically correct list sorting code, using
an algorithm that gives the best performance for that particular input scenario, like
suggesting a quick sort algorithm (avg time complexity = nlogn) instead of bubble
sort algorithm (avg time complexity = n2 ) unless specifically requested by the user.
The goal of this level in the taxonomy is for AI-supported code completion tools
to be able to suggest appropriate design choices at the file level, considering the input
from the user, like coding style guidelines, and help the user make design choices that
satisfy all the functional requirements of the given programming task.
The capabilities required by AI-supported code completion tools to satisfy this
level of abstraction are as follows
1. Picking and suggesting the best applicable algorithm for a given scenario.
45
3. Code suggestions should be free from all recognized vulnerabilities and warn
the user if a vulnerability is found.
4. Code suggestions should cover all the functional requirements of the given pro-
gramming task.
5. AI-supported code completion tools should be able to suggest code with appro-
priate tests and Continuous Integration (CI) when applicable.
System level design is the second half of the design level in our taxonomy. This level is
the highest abstraction level with the highest human oversight and the most complex
to define rules. AI-supported code completion tools at this level can suggest design
decisions at the project level, like suggesting design patterns and architectural tactics
with minimal input from the user.
This level requires the suggested code to suggest rational design practices in its
code suggestions for a problem and satisfy all the previous levels of abstractions.
Design practices depend on many factors like requirements and technical debt. AI-
supported code completion tools should be capable of considering all the relevant
factors before suggesting a design practice and providing the reasoning for each choice
to the user.
The main goal of this level in the taxonomy is for a AI-supported code completion
tools to help the user in every part of the software development process with minimal
input from the user.
The capabilities required by a AI-supported code completion tools to satisfy this
level of abstraction are as follows
4. AI-supported code completion tools should be able to identify the coding style
followed and adapt its code suggestions.
context-specific and a lack of large datasets. A study by Gorton et al. [28] showed
a semi-automatic approach to populate design knowledge from internet sources for a
particular (big data) domain, which can be a helpful approach for collecting software
design relevant data to train AI-supported code completion tools.
Over the natural evolution of a software system, small changes accumulate, which
can happen for various reasons, such as refactoring [55], bug fixes [15], implementation
of new features, etc. These changes can be unique. However, they frequently repeat
themselves and follow patterns [53]. Such patterns can provide a wealth of data
for studying the history of modifications and their effects [39], modification histories
of fault fixes [26], or the connections between code change patterns and adaptive
maintenance [49]. However, to use this data, AI-supported code completion tools
should be able to identify these complex patterns existing in public code (training
data). Current AI-supported code completion tools like Copilot struggled to detect
much simpler patterns like Pythonic idioms. There is no evidence currently to suggest
they can identify even more complex design patterns.
Additionally, current AI-supported code completion tools like Copilot do not sup-
port multi-file input. It is not possible to evaluate its current performance in design
suggestions, as the software development process may include multiple folders with a
file structure. For example, MVC pattern generally includes multiple files acting as
Model, View, and Controller. Using the current limitations of input on Copilot, i.e., a
code block or a code file, it is not possible for AI-supported code completion tools to
deduce that the project is using the MVC pattern and adapt its suggestion to follow
the MVC pattern and not suggest code where Model communicated directly with
View. AI-supported code completion tools must be capable of making suggestions in
multiple program units to accommodate these more abstract design patterns.
AI-supported code completion tools should be able to adapt their suggestions
to context-specific issues such as variable naming conventions and formatting. This
would be challenging as the existing guidelines are not standard in this space and
mostly depend on context.
date their code suggestions regularly to reflect the changes in design practices. This
requires regularly updating the training data, and training costs are expensive.
Design patterns are solutions to recurring design issues that aim to improve reuse,
code quality, and maintainability. Design patterns have benefits such as decoupling a
request from particular operations (Chain of Responsibility and Command), making
a system independent from software and hardware platforms (Abstract Factory and
Bridge), and independent from algorithmic solutions (Iterator, Strategy, Visitor), or
preventing implementation modifications (Adapter, Decorator, Visitor). These design
patterns are integral to software design and are used regularly in software develop-
ment. However, these design patterns evolve. For instance, with React Framework’s
introduction, many new design patterns were introduced, such as Redux and Flux,
which were considered to be an evolution over the pre-existing MVC design pattern.
AI-supported code completion tools trained before this evolution will not have any
data of the new design patterns such as Redux and Flux, making them incapable of
suggesting those design patterns to the user.
Similarly, coding practices evolve. For example, in JavaScript, callbacks were
considered the best practice in the past to achieve concurrency, which was replaced
by promises. When the user has a goal to achieve asynchronous code, there are two
ways to create async code: callbacks and promises. Callback allows us to provide
a callback to a function, which is called after completion. With promises, you can
attach callbacks to the returned promise object. One common issue with using the
callback approach is that when we have to perform multiple asynchronous operations
at a time, we can easily end up with something known as “callback hell”. As the
name suggests, it is harder to read, manage, and debug. The simplest way to handle
asynchronous operations is through promises. In comparison to callbacks, they can
easily manage many asynchronous activities and offer better error handling. This
makes AI-supported code completion tools be updated regularly to reflect new changes
in coding practices and design processes of software development.
Additionally, Bad Practices in using promises for asynchronous JavaScript like
not returning promises after creation and forgetting to terminate chains without a
catch statement, which are explained in documentation3 and StackOverflow4 are not
known to Copilot and suggested code with those common anti-patterns as they could
have occurred more frequently in Copilot training data. While testing, Copilot sug-
3
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Using_promises
4
https://stackoverflow.com/questions/30362733/handling-errors-in-promise-all/
50
make better software. We will discuss this and other potential solutions in more detail
in the next chapter (Chapter 5).
52
Chapter 5
5.1 Introduction
We began this thesis with an analysis of Copilot code suggestions on Pythonic idioms,
and Javascript best practices to understand the current boundaries of AI-supported
code completion tools like Copilot using a software abstraction taxonomy. In this
chapter, we first begin by extending this discussion by comparing Copilot performance
on Pythonic idioms and JavaScript best practices. In section 5.2.1, we discuss the
differences in the performance and ranking of Copilot code suggestions on Pythonic
idoms and JavaScript best practices. We also discuss how Copilot was able to suggest
idiomatic code for some coding scenarios.
Furthermore, having established the software abstraction hierarchy to help assess
the capabilities of AI-supported code completion tools, in section 5.2.2, we discuss
what it means to recite code from training data of AI-supported code completion
tools like Copilot. Additionally, we discuss how code recitation is an ideal behavior
for AI-supported code completion tools like Copilot to suggest idiomatic code but not
for code smells.
In this second part of this thesis, we discussed our taxonomy of software abstrac-
tions and the challenges involved in creating AI-supported code completion tools that
are capable of satisfying the design level of our taxonomy. In section 5.3, we report on
some implications for researchers and practitioners. Finally, in section 5.4, we report
on the threats to the validity of the research presented in this thesis.
53
5.2 Discussion
In this section, we discuss the differences in the performance and ranking of Copilot
code suggestions on Pythonic idoms and JavaScript best practices. We then discuss
what is code recitation and how it affects AI-supported code completion tools like
Copilot. .
between an idiomatic approach and a code smell in training data. Metrics like code
repository popularity and StackOverflow upvotes on the code can help AI-supported
code completion tools like Copilot to distinguish between idiomatic approaches and
code smells.
5.3 Implications
This research helps guide future AI-supported code completion tools to support soft-
ware development. Good AI-supported code completion tools has many potential
uses, from recommending expanded code completions to optimizing code blocks. Au-
tomating code production could increase the productivity of current programmers.
Future code generation models may enable developers to work at a higher degree
of abstraction that hides specifics, similar to how contemporary software engineers no
longer frequently write in assembly. Good AI-supported code completion tools may
improve accessibility to programming or aid in training new programmers. Models
could make suggestions for different, more effective, or idiomatic methods to imple-
ment programs, enabling one to develop their coding style.
For pre-training the LLM (e.g., Codex), AI-supported software development tools will
need higher-quality training data. This might be addressed by carefully engineering
training examples and filtering out known flaws, code smells, and bad practices.
Careful data curation seems to be part of the approach already [45]. However, there is
little clarity on how this process happens and how to evaluate suggestions, particularly
for non-experts. One approach is to add more verified sources like well-known books
and code documentation pages to follow the best practices. Pre-training might rank
repositories for training input according to code quality (e.g., only repositories with
acceptable coding standards).
56
5.3.1.3 Over-reliance
Over-reliance on generated outputs is one of the main hazards connected to the use of
code generation models in practice. Codex may provide solutions that seem reason-
able on the surface but do not truly accomplish what the user had in mind. Depending
on the situation, this could negatively impact inexperienced programmers and have
serious safety consequences. Human oversight and vigilance are always required to
safely use AI-supported code completion tools like Copilot. Empirical research is
required to consistently ensure alertness in practice across various user experience
levels, UI designs, and tasks.
Packages or programs created by third parties are frequently imported within a code
file. Software engineers rely on functions, libraries, and APIs for the majority of what
called as “boilerplate” code rather than constantly recreating the wheel. However,
there are numerous choices for each task: For machine learning, use PyTorch or
TensorFlow; for data visualization, use Matplotlib or Seaborn; etc.
Reliance on import suggestions from AI-supported code completion tools like Copi-
lot may increase as they get used to using AI-supported code completion tools. Users
may employ the model as a decision-making tool or search engine as they get more
adept at ”prompt engineering” with Codex. Instead of searching the Internet for
information on ”which machine learning package to employ” or ”the advantages and
disadvantages of PyTorch vs. Tensorflow,” a user may now type ”# import machine
2
https://www.sonarqube.org
57
learning package” and rely on Codex to handle the rest. Based on trends in its
training data, Codex imports substitutable packages at varying rates [12], which may
have various effects. Different import rates set by Codex may result in subtle mistakes
when a particular import is not advised, increased robustness when an individual’s
alternative package would have been worse, and/or an increase in the dominance of
an already powerful group of people and institutions in the software supply chain.
As a result, certain players may solidify their position in the package market, and
Codex may be unaware of any new packages created following the first collection of
training data. The model may also recommend deprecated techniques for packages
that are already in use. Additional research is required to fully understand the effects
of code creation capabilities and effective solutions.
Another research challenge is to move beyond token-level suggestions and work at the
code block or file level (e.g., a method or module). Increasing the model input size to
span multiple files and folders would improve suggestions. For example, when there
are multiple files implementing the MVC pattern, Copilot should never suggest code
where Model communicates directly with View. AI-supported software development
tools will need to make suggestions in multiple program units to accommodate these
more abstract design concerns.
One suggestion is to use recent ML advances in helping language models ‘reason’,
such as the chain of thought process by Wang et al. [77]. Chain-of-thought shows the
model and example of reasoning, allowing the model to reproduce the reasoning pat-
tern on a different input. Such reasoning is common for design questions. Shokri [69]
58
5.3.2.2 Explainability
Copilot is closed source, and it is currently not possible to determine the source
or the reason behind each suggestion, making it difficult to detect any problems
(access is only via an API). However, engineering software systems are laden with
ethical challenges, and understanding why a suggestion was made, particularly for
architectural questions such as system fairness, is essential. Probes, as introduced in
[38], might expand technical insight into the models.
Another challenge is understanding the basis for the ranking metric for different
suggestions made by Copilot. This metric has not been made public. Thus, we cannot
determine why Copilot ranks one approach (e.g., non-idiomatic) over the idiomatic
(preferred) approach. However, large language model code suggestions are based on
its training data [11], so one explanation is that the non-idiomatic approach is more
frequent in the training data [8]. Better characterization of the rankings would allow
users to better understand the motivation.
5.3.2.3 Control
Being generative models, tools like Copilot are extremely sensitive to input with
stability challenges, and to make them autonomous raises control concerns. For ex-
ample, if a human asks for a N2 sorting algorithm, should Copilot recommend one
or the NlogN alternative? Ideally, tools should warn users if prompted to suggest
sub-optimal code. AI-supported software development should learn to differentiate
between optimal and sub-optimal code. One direction to look at is following commit
histories of files, as they are the possible places to find bug fixes and performance
improvements.
59
to come with its own set of challenges, which are equally important to understand.
5.5 Conclusion
Chapter 3 has shown the current challenges of AI-supported code completion tools
like security issues and license infringements. We also showed that AI-supported
code completion tools like Copilot struggle to use Pythonic idioms and JavaScript
best practices in its code suggestions.
Chapter 4 represents a continuation of our previous work in chapter 3 by intro-
ducing a taxonomy of software abstraction hierarchy to delineate the limitations of
AI-supported code completion tools like Copilot. We also show that Copilot stands
at correctness level of our taxonomy. Finally, we discussed how AI-supported code
completion tools like Copilot can reach the highest level of software abstraction in
our taxonomy (design level).
The possible applications of large LLMs like Codex are numerous. For instance, it
might ease users’ transition to new codebases, reduce the need for context switching
for seasoned programmers, let non-programmers submit specifications, have Codex
draught implementations, and support research and education.
GitHub’s Copilot and related large language model approaches to code comple-
tion are promising steps in AI-supported software development. However, Software
61
systems need more than coding effort. These systems require complex design and
engineering work to build. We showed that while the coding syntax and correctness
level of software problems is well on their way to useful support from AI-supported
code completion tools like Copilot, the more abstract concerns, such as code smells,
language idioms, and design rules, are far from solvable at present. Although far off,
we believe AI-supported software development, where an AI supports designers and
developers in more complex software development tasks, is possible.
62
Bibliography
[5] Maryam Arab, Thomas D. LaToza, Jenny Liang, and Amy J. Ko. An exploratory
study of sharing strategic programming knowledge. In CHI Conference on Hu-
man Factors in Computing Systems, New York, NY, USA, April 2022. ACM.
doi:10.1145/3491102.3502070.
[6] Sebastian Baltes, Lorik Dumani, Christoph Treude, and Stephan Diehl. SO-
Torrent: reconstructing and analyzing the evolution of stack overflow posts. In
Proceedings of International Conference on Mining Software Repositories, pages
319–330. ACM, 2018. doi:10.1145/3196398.3196430.
[9] Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A neural proba-
bilistic language model. In T. Leen, T. Dietterich, and V. Tresp, edi-
tors, Advances in Neural Information Processing Systems, volume 13. MIT
Press, 2000. URL: https://proceedings.neurips.cc/paper/2000/file/
728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf.
[10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris
Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,
and Dario Amodei. Language models are few-shot learners. In H. Larochelle,
M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neu-
ral Information Processing Systems, volume 33, pages 1877–1901. Curran As-
sociates, Inc., 2020. URL: https://proceedings.neurips.cc/paper/2020/
file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[11] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel
Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ul-
far Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from
large language models, 2020. URL: https://arxiv.org/abs/2012.07805, doi:
10.48550/ARXIV.2012.07805.
[12] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde
de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph,
Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy
Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder,
Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens
Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plap-
pert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen
64
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin,
Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N.
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford,
Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welin-
der, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wo-
jciech Zaremba. Evaluating large language models trained on code, 2021. URL:
https://arxiv.org/abs/2107.03374, doi:10.48550/ARXIV.2107.03374.
[13] Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Denys Poshyvanyk, Massim-
iliano Di Penta, and Gabriele Bavota. An empirical study on the usage of BERT
models for code completion. In 2021 IEEE/ACM 18th International Confer-
ence on Mining Software Repositories (MSR), pages 108–119. IEEE, May 2021.
doi:10.1109/msr52588.2021.00024.
[14] Matteo Ciniselli, Luca Pascarella, and Gabriele Bavota. To what extent do
deep learning-based code recommenders generate predictions by cloning code
from the training set?, 2022. URL: https://arxiv.org/abs/2204.06894, doi:
10.48550/ARXIV.2204.06894.
[15] Domenico Cotroneo, Luigi De Simone, Antonio Ken Iannillo, Roberto Natella,
Stefano Rosiello, and Nematollah Bidokhti. Analyzing the context of bug-
fixing changes in the openstack cloud computing platform, 2019. URL: https:
//arxiv.org/abs/1908.11297, doi:10.48550/ARXIV.1908.11297.
[17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT:
Pre-training of deep bidirectional transformers for language understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019.
Association for Computational Linguistics. URL: https://aclanthology.org/
N19-1423, doi:10.18653/v1/N19-1423.
[19] Neil A. Ernst, Stephany Bellomo, Ipek Ozkaya, and Robert L. Nord. What to
fix? distinguishing between design and non-design rules in automated tools. In
2017 IEEE International Conference on Software Architecture (ICSA). IEEE,
April 2017. doi:10.1109/icsa.2017.25.
[21] Aamir Farooq and Vadim Zaytsev. There is more than one way to zen your
python. In Proceedings of the 14th ACM SIGPLAN International Conference on
Software Language Engineering, SLE 2021, page 68–82, New York, NY, USA,
2021. Association for Computing Machinery. doi:10.1145/3486608.3486909.
[22] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong,
Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A
pre-trained model for programming and natural languages, 2020. URL: https:
//arxiv.org/abs/2002.08155, doi:10.48550/ARXIV.2002.08155.
[23] Society for Automotive Engineers. Taxonomy and definitions for terms related
to driving automation systems for on-road motor vehicles, 2021. URL: https:
//www.sae.org/standards/content/j3016_202104/.
[24] Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellen-
doorn. Cacheca: A cache language model based code suggestion tool. In 2015
IEEE/ACM 37th IEEE International Conference on Software Engineering, vol-
ume 2, pages 705–708, 2015. doi:10.1109/ICSE.2015.228.
[25] Peter Freeman and David Hart. A science of design for software-intensive sys-
tems. Commun. ACM, 47(8):19–21, aug 2004. doi:10.1145/1012037.1012054.
[28] Ian Gorton, John Klein, and Albert Nurgaliev. Architecture knowledge for eval-
uating scalable databases. In 2015 12th Working IEEE/IFIP Conference on
Software Architecture, pages 95–104, 2015. doi:10.1109/WICSA.2015.26.
[29] Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. Deep code search. In 2018
IEEE/ACM 40th International Conference on Software Engineering (ICSE),
pages 933–944, 2018. doi:10.1145/3180155.3180167.
[30] Raymond Hettinger. Transforming code into beautiful, idiomatic python. URL:
https://www.youtube.com/watch?v=OSGv2VnC0go.
[31] Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar De-
vanbu. On the naturalness of software. In Proceedings of the 34th International
Conference on Software Engineering, ICSE ’12, page 837–847. IEEE Press, 2012.
[32] Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh
Parthasarathy, Sriram Rajamani, and Rahul Sharma. Jigsaw: Large language
models meet program synthesis. In Proceedings of the 44th International Confer-
ence on Software Engineering, ICSE ’22, page 1219–1231, New York, NY, USA,
2022. Association for Computing Machinery. doi:10.1145/3510003.3510203.
[35] Rafael-Michael Karampatsis and Charles Sutton. Maybe deep neural networks
are the best choice for modeling source code, 2019. URL: https://arxiv.org/
abs/1903.05734, doi:10.48550/ARXIV.1903.05734.
[37] Anjan Karmakar and Romain Robbes. What do pre-trained code models know
about code? In 2021 36th IEEE/ACM International Conference on Automated
Software Engineering (ASE), pages 1332–1336, Los Alamitos, CA, USA, 2021.
IEEE. doi:10.1109/ASE51524.2021.9678927.
[38] Anjan Karmakar and Romain Robbes. What do pre-trained code models know
about code? In IEEE/ACM International Conference on Automated Soft-
ware Engineering (ASE), pages 1332–1336, 2021. doi:10.1109/ASE51524.2021.
9678927.
[42] J. Knupp. Writing Idiomatic Python 3.3. Createspace Independent Pub, 2013.
URL: https://books.google.ca/books?id=EtdF4NMi_NEC.
[44] Jian Li, Yue Wang, Michael R. Lyu, and Irwin King. Code completion with
neural attention and pointer networks. In Proceedings of the 27th International
Joint Conference on Artificial Intelligence, IJCAI’18, page 4159–25. AAAI Press,
2018.
[45] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser,
Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago,
Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin,
Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov,
James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli,
68
[46] David Mandelin, Lin Xu, Rastislav Bodı́k, and Doug Kimelman. Jungloid min-
ing: Helping to navigate the api jungle. In Proceedings of the 2005 ACM SIG-
PLAN Conference on Programming Language Design and Implementation, PLDI
’05, page 48–61, New York, NY, USA, 2005. Association for Computing Machin-
ery. doi:10.1145/1065010.1065018.
[48] Tim Menzies, Andrew Butcher, David Cok, Andrian Marcus, Lucas Layman,
Forrest Shull, Burak Turhan, and Thomas Zimmermann. Local versus global
lessons for defect prediction and effort estimation. IEEE Transactions on Soft-
ware Engineering, 39(6):822–834, June 2013. doi:10.1109/tse.2012.83.
[49] Omar Meqdadi and Shadi Aljawarneh. A study of code change patterns for adap-
tive maintenance with ast analysis. International Journal of Electrical and Com-
puter Engineering (IJECE), 10:2719, 06 2020. doi:10.11591/ijece.v10i3.
pp2719-2733.
[50] Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš Burget, and Jan Černocký.
Strategies for training large scale neural network language models. In 2011 IEEE
Workshop on Automatic Speech Recognition and Understanding, pages 196–201,
2011. doi:10.1109/ASRU.2011.6163930.
[51] G.C. Murphy, M. Kersten, and L. Findlater. How are java software developers
using the eclipse ide? IEEE Software, 23(4):76–83, 2006. doi:10.1109/MS.
2006.105.
[52] Emerson Murphy-Hill, Ciera Jaspan, Caitlin Sadowski, David Shepherd, Michael
Phillips, Collin Winter, Andrea Knight, Edward Smith, and Matthew Jorde.
What predicts software developers’ productivity? IEEE Transactions on Soft-
ware Engineering, 47(3):582–594, 2021. doi:10.1109/TSE.2019.2900308.
[53] Hoan Anh Nguyen, Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen,
and Hridesh Rajan. A study of repetitiveness of code changes in software evolu-
69
[54] Nhan Nguyen and Sarah Nadi. An empirical evaluation of GitHub Copilot’s
code suggestions. In Proceedings of the 19th ACM International Conference on
Mining Software Repositories (MSR), pages 1–5, 2022.
[55] Fabio Palomba, Andy Zaidman, Rocco Oliveto, and Andrea De Lucia. An ex-
ploratory study on the relationship between changes and refactoring. In 2017
IEEE/ACM 25th International Conference on Program Comprehension (ICPC),
pages 176–185, 2017. doi:10.1109/ICPC.2017.38.
[57] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christo-
pher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word
representations. In Proceedings of the 2018 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans,
Louisiana, June 2018. Association for Computational Linguistics. URL: https:
//aclanthology.org/N18-1202, doi:10.18653/v1/N18-1202.
[58] Tim Peters. The Zen of Python, pages 301–302. Apress, Berkeley, CA, 2010.
doi:10.1007/978-1-4302-2758-8_14.
[59] Sebastian Proksch, Johannes Lerch, and Mira Mezini. Intelligent code comple-
tion with bayesian networks. ACM Transactions on Software Engineering and
Methodology, 25(1):1–31, December 2015. doi:10.1145/2744200.
[61] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. Language models are unsupervised multitask learners, 2019.
70
[62] Paul Ralph and Yair Wand. A proposal for a formal definition of the design
concept, 01 2009. doi:10.1007/978-3-540-92966-6_6.
[64] Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with sta-
tistical language models. In Proceedings of the 35th ACM SIGPLAN Confer-
ence on Programming Language Design and Implementation. ACM, June 2014.
doi:10.1145/2594291.2594321.
[67] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine trans-
lation of rare words with subword units. In Proceedings of the 54th An-
nual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association
for Computational Linguistics. URL: https://aclanthology.org/P16-1162,
doi:10.18653/v1/P16-1162.
[68] Ali Shokri. A program synthesis approach for adding architectural tactics to
an existing code base. In 2021 36th IEEE/ACM International Conference on
Automated Software Engineering (ASE), pages 1388–1390, Los Alamitos, CA,
USA, nov 2021. IEEE. doi:10.1109/ase51524.2021.9678705.
[69] Ali Shokri. A program synthesis approach for adding architectural tactics to
an existing code base. In IEEE/ACM International Conference on Automated
Software Engineering (ASE), pages 1388–1390, 2021. doi:10.1109/ASE51524.
2021.9678705.
[72] Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. On the localness of
software. In Proceedings of the 22nd ACM SIGSOFT International Symposium
on Foundations of Software Engineering, FSE 2014, page 269–280, New York,
NY, USA, 2014. Association for Computing Machinery. doi:10.1145/2635868.
2635875.
[73] Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. Expectation vs. ex-
perience: Evaluating the usability of code generation tools powered by large
language models. In CHI Conference on Human Factors in Computing Systems
Extended Abstracts, New York, NY, USA, April 2022. Association for Computing
Machinery. doi:10.1145/3491101.3519665.
[74] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin. Attention is
all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wal-
lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances
in Neural Information Processing Systems, volume 30. Curran Associates,
Inc., 2017. URL: https://proceedings.neurips.cc/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[75] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks,
2015. URL: https://arxiv.org/abs/1506.03134, doi:10.48550/ARXIV.
1506.03134.
[76] Stefan Wagner and Melanie Ruhe. A systematic review of productivity factors
in software development. Arxiv, 2018. URL: https://arxiv.org/abs/1801.
06475, doi:10.48550/ARXIV.1801.06475.
[77] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang,
Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of
thought reasoning in language models, 2022. URL: https://arxiv.org/abs/
2203.11171, doi:10.48550/ARXIV.2203.11171.
72
[78] Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. Coacor: Code anno-
tation for code retrieval with reinforcement learning. In The World Wide Web
Conference, WWW ’19, page 2203–2214, New York, NY, USA, 2019. Association
for Computing Machinery. doi:10.1145/3308558.3313632.