Reading Between The Lines Modeling User Behavior and Costs in AI-Assisted Programming
Reading Between The Lines Modeling User Behavior and Costs in AI-Assisted Programming
AI-Assisted Programming
Hussein Mozannar Gagan Bansal
[email protected] [email protected]
Massachusetts Institute of Technology Microsoft Research
Boston, USA Redmond, USA
(a) Thinking/
(b)
Verifying
import numpy as np Thinking Suggestion Deferring
class LogisticRegression: about New 22.4% thought
Code to Write for later
def __init(self):
10.91% 1.39%
self.w = None
self.b = None
# implement the fit method Looking up
Not Thinking Documentation
def fit(self, X, y):
# initialize the parameters
0.01% 5 7.45%
self.w = np.zeros(X.shape[1]) 2
self.b = 0
for i in range(100): 1
# calculate the gradient Waiting For Debugging/
dw = (1/X.shape[0]) * np.dot(X.T,
prompt
Suggestion 7 Testing Code
(self.sigmoid(np.dot(X, self.w) + self.b) - y)) 4.20% 11.31%
db = (1/X.shape[0]) *
np.sum(self.sigmoid(np.dot(X, self.w) + self.b)
- y)
# update the parameters Writing New
4 Prompt
self.w = self.w - dw Functionality 6 crafting
14.05% 11.56%
self.b = self.b - db
1 2 3 4 5 6 7
shown rejected shown rejected shown accepted shown
(c)
Figure 1: Profling a coding session with the CodeRec User Programming States (CUPS). In (a) we show the operating mode of
CodeRec inside Visual Studio Code. In (b) we show the CUPS taxonomy used to describe CodeRec related programmer activities.
A coding session can be summarized as a timeline in (c) where the programmer transitions between states.
ABSTRACT
Code-recommendation systems, such as Copilot and CodeWhis-
perer, have the potential to improve programmer productivity by
suggesting and auto-completing code. However, to fully realize their
potential, we must understand how programmers interact with
This work is licensed under a Creative Commons Attribution International
4.0 License. these systems and identify ways to improve that interaction. To
seek insights about human-AI collaboration with code recommenda-
CHI ’24, May 11–16, 2024, Honolulu, HI, USA tions systems, we studied GitHub Copilot, a code-recommendation
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0330-0/24/05 system used by millions of programmers daily. We developed CUPS,
https://doi.org/10.1145/3613904.3641936 a taxonomy of common programmer activities when interacting
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.
with Copilot. Our study of 21 programmers, who completed coding before evaluating them together as a unit. In both scenarios, con-
tasks and retrospectively labeled their sessions with CUPS, showed siderable work verifying and editing suggestions occurs after the
that CUPS can help us understand how programmers interact with programmer has accepted the recommended code. Prior interac-
code-recommendation systems, revealing inefciencies and time tion metrics also largely miss user efort invested in devising and
costs. Our insights reveal how programmers interact with Copilot refning prompts used to query the models. When code completion
and motivate new interface designs and metrics. tools are evaluated using coarser task-level metrics such as task
completion time [20], we begin to see signals of the benefts of AI-
CCS CONCEPTS driven code completion but lack sufcient detail to understand the
• Human-centered computing → User models; User studies; • nature of these gains, as well as possible remaining inefciencies.
Software and its engineering → Automatic programming. We argue that an ideal approach would be sufciently low level to
support interaction profling while sufciently high level to capture
KEYWORDS meaningful programmer activities.
Given the nascent nature of these systems, numerous questions
AI-assisted Programming, Copilot, User State Model
exist regarding the behavior of their users:
ACM Reference Format: • What activities do users undertake in anticipation for, or to
Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2024.
trigger a suggestion?
Reading Between the Lines: Modeling User Behavior and Costs in AI-
Assisted Programming. In Proceedings of the CHI Conference on Human
• What mental processes occur while the suggestions are on-
Factors in Computing Systems (CHI ’24), May 11–16, 2024, Honolulu, HI, USA. screen, and, do people double-check suggestions before or
ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3613904.3641936 after acceptance?
• How costly for users are these various new tasks, and which
take the most time?
1 INTRODUCTION
Programming-assistance systems based on the adaptation of large To answer these and related questions in a systematic manner,
language models (LLMs) to code recommendations have been re- we apply a mixed-methods approach to analyze interactions with a
cently introduced to the public. Popular systems, including Copilot popular code suggestion model, GiHub Copilot2 which has more
[14], CodeWhisperer [1], and AlphaCode[21], signal a potential than a million users. To emphasize that our analysis is not restricted
shift in how software is developed. Though there are diferences in to the specifcs of Copilot, we use the term CodeRec to refer to any
specifc interaction mechanisms, the programming-assistance sys- instance of code suggestion models, including Copilot. Through
tems generally extend existing IDE code completion mechanisms small-scale pilot studies and our frst-hand experience using Copilot
(e.g., IntelliSense 1 ) by producing suggestions using neural models for development, we develop a novel taxonomy of common states
trained on billions of lines of code [8]. The LLM-based completion of a programmer when interacting with CodeRec models (such as
models can suggest sentence-level completions to entire functions Copilot), which we refer to as CodeRec User Programming States
and classes in a wide array of programming languages. These large (CUPS). The CUPS taxonomy serves as the main tool to answer our
neural models are deployed with the goal of accelerating the eforts research questions.
of software engineers, reducing their workloads, and improving Given the initial taxonomy, we conducted a user study with 21
their productivity. developers who were asked to retrospectively review videos of their
Early assessments suggest that programmers do feel more pro- coding sessions and explicitly label their intents and actions using
ductive when assisted by the code recommendation models [40] this model, with an option to add new states if necessary. The study
and that they prefer these systems to earlier code completion en- participants labeled a total of 3137 coding segments and interacted
gines [34]. In fact, a recent study from GitHub, found that Copilot with 1096 suggestions. The study confrmed that the taxonomy was
could potentially reduce task completion time by a factor of two sufciently expressive, and we further learned transition weights
[28]. While these studies help us understand the benefts of code- and state dwell times —something we could not do without this
recommendation systems, they do not allow us to identify avenues experimental setting. Together, these data can be assembled into
to improve and understand the nature of interaction with these various instruments, such as the CUPS diagram (Figure 1), to facili-
systems. tate profling interactions and identify inefciencies. Moreover, we
In particular, the neural models introduce new tasks into a devel- show that such analysis nearly doubles our estimates for how much
oper’s workfow, such as writing AI prompts [17] and verifying AI developer time can be attributed to interactions with code sugges-
suggestions [34], which can be lengthy. Existing interaction metrics, tion systems, as compared with existing metrics. We believe that
such as suggestion acceptance rates, time to accept (i.e., the time a identifying the current CUPS state during a programming session
suggestion remains onscreen), and reduction of tokens typed, tell can help serve programmer needs. This can be accomplished using
only part of this interaction story. For example, when suggestions custom keyboard macros or automated prediction of CUPS states,
are presented in monochrome popups (Figure 1), programmers may as discussed in our future work section and the Appendix. Overall,
choose to accept them into their codebases so that they can be read we leverage the CUPS diagram to identify some opportunities to
with code highlighting enabled. Likewise, when models suggest address inefciencies in the current version of Copilot.
only one line of code at a time, programmers may accept sequences In sum, our main contributions are the following:
1 https://code.visualstudio.com/docs/editor/intellisense 2 https://github.com/features/copilot
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA
• A novel taxonomy of common activities of programmers that these self-reported perceptions were reasonably correlated
(called CUPS) when interacting with code recommendation with suggestion acceptance rates. Liang et al. [22] administered a
systems (Section 4) survey to 410 programmers who use various AI programming assis-
• A dataset of coding sessions annotated with user actions, tants, including Copilot, and highlighted why the programmers use
CUPS, and video recordings of programmers coding with the AI assistants and numerous usability issues. Similarly, Prather
Copilot (Section 5). et al. [29] surveyed how introductory programming students utilize
• Analysis of which CUPS states programmers spend their Copilot.
time in when completing coding tasks (Subsection 6.1). While these self-reported measures of utility and preference are
• An instrument to analyze programmer behavior (and pat- promising, we would expect gains to be refected in objective met-
terns in behavior) based on a fnite-state machine on CUPS rics of productivity. Indeed, one ideal method would be to conduct
states (Subsection 6.2). randomized control trials where one set of participants writes code
• An adjustment formula to properly account for how much with a recommendation engine while another set codes without it.
time do programmers spend verifying CodeRec suggestions GitHub performed such an experiment where 95 participants were
(Subsection 6.4) inspired by the CUPS state of deferring split into two groups and asked to write a web server. The study
thought (Subsection 6.3). concluded by fnding that task completion was reduced by 55.8% in
the Copilot condition [28]. Likewise, a study by Google showed that
The remainder of this paper is structured as follows: We frst
an internal CodeRec model had a 6% reduction in ’coding iteration
review related work on AI-assisted programming (Section 2) and
time’ [33]. On the other hand, Vaithilingam et al. [34] showed in
formally describe Copilot, along with a high-level overview of
a study of 24 participants showed no signifcant improvement in
programmer-CodeRec interaction (Section 3). To further understand
task completion time – yet participants stated a clear preference for
this interaction, we defne our model of CodeRec User Programming
Copilot. An interesting comparison to Copilot is Human-Human
States (CUPS) (Section 3) and then describe a user study designed to
pair programming, which Wu et al. [39] details.
collect programmer annotations of their states (Section 5). We use
A signifcant amount of work has tried to understand the behav-
the collected data to analyze the interactions using CUPS diagram
ior of programmers[4, 5, 23, 31] using structured user studies under
revealing new insights into programmer behavior (Section 6). We
the name of "psychology of programming." This line of work tries
then discuss limitations and future work and conclude in (Section
to understand the efect of programming tools on the time to solve
7).
a task or ease of writing code and how programmers read and write
code. Researchers often use telemetry with detailed logging on
2 BACKGROUND AND RELATED WORK keystrokes [19, 37] to understand behavior. Moreover, eye-tracking
Large language models based on the Transformer network [36], is also used to understand how programmers read code[24, 27].
such as GPT-3 [6], have found numerous applications in natural Our research uses raw telemetry alongside user-labeled states to
language processing. Codex [8], a GPT model trained on 54 million understand behavior; future research could also utilize eye-tracking
GitHub repositories, demonstrates that LLMs can very efectively and raw video to get deeper insights into behavior.
solve various programming tasks. Specifcally, Codex was initially This wide dispersion of results raises interesting questions about
tested on the HumanEval dataset containing 164 programming the nature of the utility aforded by neural code completion engines:
problems, where it is asked to write the function body from a how, and when, are such systems most helpful; and conversely,
docstring [8] and achieves 37.7% accuracy with a single generation. when do they add additional overhead? This is the central ques-
Various metrics and datasets have been proposed to measure the tion to our work. The related work closest to answering this
performance of code generation models [9, 11, 16, 21]. However, in question is that of Barke et al. [3], who showed that interaction
each case, these metrics test how well the model can complete code with Copilot falls into two broad categories: the programmer is
in an ofine setting without developer input rather than evaluating either in “acceleration mode” where they know what they want
how well such recommendations assist programmers in situ. This to do, and Copilot serves to make them faster; or they are in “ex-
issue has also been noted in earlier work on non-LLM based code ploration mode”, where they are unsure what code to write and
completion models where performance on completion benchmarks Copilot helps them explore. The taxonomy we present in this paper,
overestimates the model’s utility to developers [15]. Importantly, CUPS, enriches this further with granular labels for programmers’
however, these results may not hold to LLM-based approaches, intents. Moreover, the data collected in this work was labeled by
which are radically diferent [30]. the participants themselves rather than by the researchers inter-
One straightforward approach to understanding the utility of preting their actions, allowing for more faithful intent and activity
neural code completion services, including their propensity to de- labeling and the data collected in our study can also be used to build
liver incomplete or imperfect suggestions, is to simply ask devel- predictive models as in [32]. The next section describes the Copilot
opers. To this end, Weisz et al. interviewed developers and found system formally and describes the data collected when interacting
that they did not require a perfect recommendation model for the with Copilot.
model to be useful [38]. Likewise, Ziegler et al. surveyed over 2,000
Copilot users [40] and asked about perceived productivity gains
using a survey instrument based on the SPACE framework [13] – 3 COPILOT SYSTEM DESCRIPTION
we incorporate the same survey design for our own study. They To better understand how code recommendation systems infuence
found both that developers felt more productive using Copilot and the efort of programming, we focus on GiHub Copilot, a popular
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.
Typing Paused
Prompt crafting Writing New Debugging/ Waiting For Thinking Verifying Thinking about Looking up
Functionality Testing Code Suggestion Suggestion New Code to write Documentation
94 (517.29,517.45) 0.16 Writing_New_Functionality_(Z) Playback Speed Navigate Events <> Stop Replay
Current Suggestion:
corr[i][i]=0
Submit Custom IDK
Show Shortcut Keys Definition Type Custom State (I)
State
Thinking Verifying Not Thinking Deferring thought for Thinking about New Waiting For Debugging/ Testing
Suggestion (A) (S) later (D) Code to Write (F) Suggestion (G) Code (H) (c)
Writing New Editing Suggestion Editing Written Code Prompt crafting Writing Looking up
Functionality (Z) (X) (C) (V) Documentation (B) Documentation (N)
Figure 4: Screenshot of retrospective labeling tool for coding sessions. Left: Navigation panel for telemetry segments. Right:
Video player for reviewing video of a coding session. Bottom: Buttons and text box for labeling states.
corresponding to the CUPS taxonomy, along with an “IDK” button and a free-form text box to write custom state labels. Buttons also
have associated keyboard bindings for easy annotation.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.
Table 1: Description of each state in CodeRec User Program- ’Writing Documentation’ were added through the pilots. By the last
ming States (CUPS). pilot participant, the code book was stable and saturated as they
did not write a state that was not yet covered. We observed in our
State Description study that the custom text feld was rarely used. We describe the
resultant taxonomy in the sections below.
Thinking/Verifying Actively thinking about and verifying a
Suggestion shown or accepted suggestion
4.2 Taxonomy of Telemetry Segments
Not Thinking Not thinking about suggestion or code,
programmer away from keyboard Figure 3 shows the fnalized taxonomy of programmer activities for
Deferring Thought For Programmer accepts suggestion with- individual telemetry segments with Copilot. As noted earlier, the
Later out completely verifying it, but plans to taxonomy is rooted in two segment types: ‘User Typing or Paused’,
verify it after and ‘User Before Action’. We frst detail the ‘User Typing or Paused’
Thinking About New Thinking about what code or function- segments, which precede shown events (Figure 2) and are distin-
Code To Write ality to implement and write guished by the fact that no suggestions are displayed during this
Waiting For Suggestion Waiting for CodeRec suggestion to be time. As the name implies, users can fnd themselves in this state if
shown they are either actively ’Typing’4 , or have ’paused’ but have not yet
Writing New Code Writing code that implements new func- been presented with a suggestion. In cases where the programmer
tionality is actively typing, they could be completing any of a number of
Editing Last Suggestion Editing the last accepted suggestion tasks such as: ‘writing new functionality,’ ’editing existing code,’
Editing (Personally) Editing code written by a programmer ’editing prior (CodeRec) suggestions,’ ‘debugging code,’ or author-
Written Code that is not a CodeRec suggestion for the ing natural language comments, including both documentation and
purpose of fxing existing functionality prompts directed at CodeRec (i.e., ‘prompt crafting’). When the user
Prompt Crafting Writing prompt in the form of comment pauses, they may simply be “waiting for a suggestion” or can be in
or code to obtain desired CodeRec sug- any number of states common to ‘User Before Action’ segments.
gestion In every ‘User Before Action’ segment, CodeRec is displaying
Writing Documentation Writing comments or docstring for pur- a suggestion, and the programmer is paused and not typing. They
pose of documentation could be refecting and verifying that suggestion, or they may not
Debugging/Testing Running or debugging code to check be paying attention to the suggestion and thinking about other code
Code functionality may include writing tests to write instead. The programmer can also defer their eforts on the
or debugging statements suggestion for a later time period by accepting it immediately, then
Looking up Documenta- Checking an external source for the pur- pausing to review the code at a later time. This can occur, for ex-
tion pose of understanding code functional-
ample, because the programmer desires syntax highlighting rather
ity (e.g. Stack Overfow)
than grey text or because the suggestion is incomplete, and the
Accepted Accepted a CodeRec suggestion
programmer wants to allow Copilot to complete its implementation
Rejected Rejected a CodeRec suggestion
before evaluating the code as a cohesive unit. The latter situation
tends to arise when Copilot displays code suggestions line by line
(e.g., Figure 7).
To label a particular video segment, we asked participants to The leaf nodes of the fnalized taxonomy represent 12 distinct
consider the hierarchical structure of CUPS in Figure 3. The hierar- states that programmers can fnd themselves in. These states are
chical structure frst distinguishes segments by whether a typing illustrated in Figure 3 and are further described in Table 1. While
segment occurred in that segment and then decides based on the the states are meant to be distinct, siblings may share many traits.
typing or non-typing states. For example, in a segment where a For example, "Writing New Functionality" and "Editing Written
participant was initially double-checking a suggestion and then Code" are conceptually very similar. This taxonomy also bears re-
wrote new code to accomplish a task, the appropriate label would semblance to the keystroke level model in that it assigns a time cost
be "Writing New Functionality" as the user eventually typed in to mental processes as well as typing [7, 18]. As evidenced by the
the segment. In cases where there are two states that are appro- user study—which we describe in the next section—these 12 states
priate and fall under the same hierarchy, e.g., if the participant provide a language that is both general enough to capture most
double-checked a suggestion and then looked up documentation, activities (at this level of abstraction), and specifc enough to mean-
they were asked to pick the state in which they spent the majority ingfully capture activities unique to LLM-based code suggestion
of the time. These issues arise because we collect a single state for systems.
each telemetry segment.
Pilot. Through a series of pilots involving the authors of the 5 CUPS DATA COLLECTION STUDY
paper, as well as three other participants drawn from our organi-
To study CodeRec-programmer interaction in terms of CodeRec
zation, we iteratively applied the tool to our own coding sessions
User Programming States, we designed a user study where program-
and to the user study tasks described in section 5. We then ex-
mers perform a coding task, then review and label videos of their
panded and refned the taxonomy by incorporating any “custom
coding session using the telemetry segment-labeling tool described
state” (using the text feld) written by the pilot participants. The
states ’Debugging/Testing Code’, ’Looking up Documentation’, and 4 Active typing allows for brief pauses between keystrokes.
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA
earlier. We describe the procedure, the participants, and the results and graduate student interns. In terms of programming expertise,
in the sections that follow. only 6 participants had less than 2 years of professional program-
ming experience (i.e., excluding years spent learning to program),
5.1 Procedure 5 had between 3 to 5 years, 7 had between 6 to 10 years, and 3 had
more than 11 years of experience. Participants used a language in
We conducted the study over a video call and asked participants
which they stated profciency (defned as language in which they
to use a remote desktop application to access a virtual machine
were comfortable designing and implementing whole programs).
(VM). Upon connecting, participants were greeted with the study
Here, 19 of the 21 participants used Python, one used C++, and the
environment consisting of Windows 10, together with Visual Studio
fnal participant used JavaScript.
Code (VS Code) augmented with the Copilot plugin.
On average, participants took 12.23 minutes (sample standard
Participants were then presented with a programming task drawn
deviation, � � = 3.98 minutes) to complete the coding task, with a
randomly from a set of eight pre-selected tasks (Table 2). If the par-
maximum session length of 20.80 minutes. This task completion
ticipant was unfamiliar with the task content, we ofered them a
time is measured from the frst line of code written for the task
diferent random task. The task set was designed during the pilot
until the end of the allocated time. During the coding tasks, Copilot
phase so that individual tasks ft within a 20-minute block and so
showed participants a total of 1024 suggestions, out of which they
that, together, the collection of tasks surfaces a sufcient diversity
accepted 34.0%. The average acceptance rate for participants was
of programmer activities. It is crucial that the task is of reasonable
36.5% (averaging over the acceptance rate of each participant), and
duration so that participants are able to remember all their activi-
the median was 33.8% with a standard error of 11.9%; the minimum
ties since they will be required to label their session immediately
acceptance rate was 14.3%, and the maximum was 60.7%. In the
afterward. Since the CUPS taxonomy includes states of thought,
labeling phase, each participant labeled an average of 149.38 (� � =
participants must label their session immediately after coding, and
57.43) segments with CUPS, resulting in a total of 3137 CUPS labels.
each study took approximately 60 minutes in total. To further im-
The participants used the ‘custom state’ text feld only three times
prove diversity, task instructions were presented to participants
total, twice a participant wrote ‘write a few letters and expect
as images to encourage participants to author their own Copilot
suggestion’ which can be considered as ‘prompt crafting’ and once
prompts rather than copying and pasting from the problem de-
a participant wrote ‘I was expecting the function skeleton to show
scription. The full set of tasks and instructions is provided as an
up[..]’ which was mapped to ’waiting for suggestion’. The IDK
Appendix.
button was used a total of 353 times, this sums to 3137 CUPS + 353
Upon completing the task (or reaching the 20-minute mark), we
IDKs = 3490 labels, the majority of its use was from two participants
loaded the participant’s screen recording and telemetry into the
(244 times) where the video recording was not clear enough during
labeling tool (previously detailed in Section 4.1). The researcher
consecutive spans, and was used by only fve other participants
then briefy demonstrated the correct operation of the tool and
more than once with the majority of the use also being due to
explained the CUPS taxonomy. Participants were then asked to an-
the video not being clear or the segment being too short. The IDK
notate their coding session with CUPS labels. Self-labeling allows
segments represent 6.5% of total session time across all participants,
us to easily scale such a study and enables more accurate labels
mostly contributed by fve participants. Therefore, we remove the
for each participant, but may cause inconsistent labeling across
IDK segments from the analysis and do not attempt to re-label
participants. Critically, this labeling occurred within minutes of
them.
completing the programming task so as to ensure accurate recall.
Together, these CUPS labels enable us to investigate various ques-
We do not include a baseline condition where participants perform
tions about programmer-CodeRec interaction systematically, such
the coding task without Copilot, as this work focuses on under-
as exploring which activities programmers perform most frequently
standing and modeling the interaction with the current version of
and how they spend most of their time. We study programmer-
Copilot.
CodeRec interaction using the data derived from this study in the
Finally, participants completed a post-study questionnaire about
following Section 6 and derive various insights and interventions.
their experience mimicking the one in [40]. The entire experiment
was designed to last 60 minutes. The study was approved by our
institutional review board (IRB), and participants received a $50.00 6 UNDERSTANDING PROGRAMMER
gift card as remuneration for their participation. BEHAVIOR WITH CUPS: MAIN RESULTS
The study in the previous section allows us to collect telemetry
5.2 Participants with CUPS labels for each telemetry segment. We now analyze the
To recruit participants, we posted invitations to developer-focused collected data and highlight suggestions for 1) metrics to measure
email distribution lists within our large organization. We recruited the programmer-CodeRec interaction, 2) design improvements
21 participants with varying degrees of experience using Copilot: 7 for the Copilot interface, and fnally 3) insights into programmer
used Copilot more than a few times a week, 3 used it once a month behavior. Each subsection below presents a specifc result or analy-
or less, and 11 had never used it before. For participants who had sis which can be read independently. Code and Data is available at
5.
never used it before, the experimenter gave a short oral tutorial
on Copilot explaining how it can be invoked and how to accept
suggestions. Participants’ roles in the organization ranged from
software engineers (with diferent levels of seniority) to researchers 5 https://github.com/microsoft/coderec_programming_states
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.
21
30
41
68
77
86
121
9
59
64
77
107
111
133
141
170
1
56
65
72
79
88
103
107
117
124
131
137
142
175
1
32
93
119
143
172
1
21
25
44
58
114
121
145
158
164
Time (s)
Suggestion Rejected Suggestion Accepted Suggestion Shown
(a) Individual CUPS timelines for 5/21 study participants for the frst 180 secs show the richness of and variance in programmer-CodeRec
interaction.
25
% of Session spent in State
Thinking/Verifying
20
Suggestion
Thinking About Deferring Thought
New Code To Write For Later
15
(b) The percentage of total session time spent in each state (c) CUPS diagram showing 12 CUPS states (nodes) and the transitions among
during a coding session. On average, verifying Copilot the states (arcs). Transitions occur when a suggestion is shown, accepted, or
suggestions occupies a large portion of session time. rejected. We hide self-transitions and low-probability transitions for simplicity
Figure 5: Visualization of CUPS labels from our study as timelines, a histogram, and a state machine.
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA
Table 2: Description of the coding tasks given to user study participants and task assignment. Participants were randomly
allocated to tasks for tasks which they had familiarity with.
6.1 Aggregated Time Spent in Various CUPSs than 6 years (10 out of 21) and who have less than 6 years (11 out
In Figure 5a, we visualize the coding sessions of individual partici- of 21). We notice the acceptance rate for those with substantial
pants as CUPS timelines, where each telemetry segment is labeled programming experience is 30.0% ± 14.5 while for those without is
with its CUPS label. At frst glance, CUPS timelines show the rich- 37.6% ± 14.6. Second, we split participants based on whether they
ness in patterns of interaction with Copilot, as well as the variance had used Copilot previously (10 out of 21) and those who had never
in usage patterns across settings and people. CUPS timelines allow used it before (11 out of 21). The acceptance rate for those who have
us to inspect individual behaviors and identify patterns, which we previously used Copilot is 37.6 % ± 15.3, and for those who have not,
later aggregate to form general insights into user behavior. it is 29.3 ± 13.7. Due to the limited number of participants, these
Figure 5b shows the average time spent in each state as a per- results are not sufcient to determine the infuence of programmer
centage normalized to a user’s session duration. experience or Copilot experience on behavior. We also include in
Appendix a breakdown of programmer behavior by task solved.
Metric Suggestion: Time spent in CUPS states as a high- 6.2 Patterns in Behavior as Transitions Between
level diagnosis of the interaction CUPS States
For example, time spent ‘Waiting For Suggestion’ (4.2%, � � = To understand if there was a pattern in participant behavior, we
4.46 ) measures the real impact of latency, and time spent modeled transitions between two states as a state machine. We re-
‘Editing Last Suggestion’ provides feedback on the quality fer to the state machine-based model of programmer behavior as
of suggestions. a CUPS diagram. In contrast to the timelines in Figure 5a, which
visualize state transitions with changes of colors, the CUPS dia-
gram Figure 5c explicitly visualizes transitions using directed edges,
where the thickness of arrows is proportional to the likelihood of
We fnd that averaged across all users, the ‘verifying sugges-
transition. For simplicity, Figure 5c only shows transitions with
tion’ state takes up the most time at 22.4% (� � = 12.97), it is the
an average probability higher than 0.17 (90th quantile, selected for
top state for 6 participants and in the top 3 states for 14 out of 21
graph visibility).
participants taking up at least 10% of session time for all but one
The transitions in Figure 5c revealed many expected patterns.
participant. Notably, this is a new programmer task introduced by
For example, one of the most likely transitions (excluding self-
Copilot. The second-lengthiest state is writing new functionality’ 0.54
14.05% (� � = 8.36), all but 6 participants spend more than 9% of transitions from the diagram), ‘Prompt Crafting −−−→ Verifying
session time in this state. Suggestion’ showed that when programmers were engineering
More generally, the states that are specifc to interaction with prompts, they were then likely to immediately transition to verify-
Copilot include: ‘Verifying Suggestions’, ‘Deferring Thought for ing the resultant suggestions (probability of 0.54). Likewise, Another
0.54
Later’, ‘Waiting for Suggestion’, ‘Prompt Crafting’, and ‘Editing probable transition was ‘Deferring Thought−−−→Verifying Sugges-
Suggestion’. We found that the total time participants spend tion’, indicating that if a programmer previously deferred their
in these states is 51.5 % (� � = 19.3) of the average session thought for an accepted suggestion, they would, with high
duration. In fact, half of the participants spend more than 47% probability, return to verify that suggestion. Stated diferently:
of their session in these Copilot states, and all participants spend deference incurs verifcation dept, and this debt often “catches up”
more than 21% of their time in these states. with the programmer. Finally, the single-most probable transition,
By Programmer Expertise and Copilot Experience. We in- 0.59
‘Writing New Functionality −−−→ Verifying Suggestion’, echos the
vestigate if there are any diferences in how programmers interacted observation from the previous section, indicating that program-
with Copilot based on their programming expertise and their previ- mers often see suggestions while writing code (rather than prompt
ous experience with Copilot. First, we split participants based on crafting), then spend time verifying it. If suggestions are unhelpful,
whether they have professional programming experience of more they could easily be seen as interrupting the fow of writing.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.
The CUPS diagram also revealed some unexpected transitions. to verify suggestions and reject them as they continue to write.
Notably, the second-most probable transition from the ‘Prompt Other long patterns include (#2) (also shown as pattern #D ), where
0.25 programmers repeatedly accepted successive Copilot suggestions
Crafting’ state is ‘Prompt Crafting −−−→ Waiting for Suggestion’.
This potentially reveals an unexpected and unnecessary delay and is after verifying each of them. Finally, we observe in (#3) and (#A)
a possible target for refnement (e.g., by reducing latency in Copilot). programmers iterating on the prompt for Copilot until they obtain
Importantly, each of these transitions occurs with a probability that the suggestion they want. We elaborate more on this in the next
is much higher than the lower bound/uniform baseline probability subsection.
of transitioning to a random state in the CUPS diagram (1/12=0.083).
In fact, when we compute the entropy rate (a measure of random- 6.3 Programmers Often Defer Thought About
ness) of the resulting Markov Chain [10] from the CUPS diagram Suggestions
we obtain a rate of 2.24; if the transitions were completely random An interesting CUPS state is that of ’Deferring Thought About
the rate would be 3.58, and if the transitions were deterministic A Suggestion’. This is illustrated in Figure 7, where programmers
then the rate is 0. accept a suggestion or series of suggestions without sufciently ver-
ifying them beforehand. This occurs either because programmers
wish to see the suggestion with code highlighting, or because they
Interface Design Suggestion: Identifying current CUPS
want to see where Copilot suggestions leads to. Figure 5b shows
state can help serve programmer needs
that programmers do in fact, frequently defer thought– we counted
61 states labeled as such. What drives the programmer to defer
If we are able to know the current programmer CUPS state
their thought about a suggestion rather than immediately verifying
during a coding session we can better serve the programmer,
it? We initially conjectured that the act of deferring may be par-
for example,
tially explained by the length of the suggestions. So, we compared
• If the programmer is observed to have been defer- the number of characters and the number of lines for suggestions
ring their thought on the last few suggestions, group depending on the programmer’s state. We fnd that there is no
successive Copilot suggestions and display them to- statistical diference according to a two-sample independent t-test
gether. (� = −0.58, � = 0.56) 6 in the average number of characters between
• If the programmer is waiting for the suggestion, we deferred thought and suggestions (75.81 compared to 69.06) that
can prioritize resources for them at that moment were verifed previously. The same holds for the average number
• While a user is prompt crafting, Copilot suggestions of lines.
are often ignored and may be distracting; however, af- However, when we look at the likelihood of editing an accepted
ter a user is done with their prompt, they may expect suggestion, we fnd that it is 0.18 if it was verifed before, but it is
high-quality suggestions. We could suppress sugges- 0.53 if it was deferred. This diference is signifcant according to a
tions during prompt crafting, but after the prompt chi-square test (� 2 = 29.2, � = 0). In fact, the programmer CUPS
crafting process is done, display multiple suggestions state has a big efect on their future actions. In Table 3, we show
to the user and encourage them to browse through the probability of the programmer accepting a suggestion given the
them. CUPS state the programmer was in while the suggestion is being
Future work can, for example, realize these design sugges- shown. We also show the probability of the programmer accepting
tions by allowing custom keyboard macros for the pro- a suggestion as a function of the CUPS state the programmer was
grammer to signal their current CUPS state, or a more auto- in just before the suggestion was displayed. We observe there is a
mated approach by predicting their CUPS state. big variation in the suggestion acceptance rate by the CUPS state.
For example, if the programmer was in the "Deferring Thought For
Later" state, the probability of acceptance is 0.98 ± 0.02 compared
We also investigated longer patterns in state transitions by search- to when a programmer is thinking about new code to write, where
ing for the most common sequence of states of varying lengths. the probability is 0.12 ± 0.04. Note that the average probability of
We achieved this by searching over all possible segment n-grams accepting a suggestion was 0.34.
and counting their occurrence over all sessions. We analyzed pat- What are the programmers doing before they accept a suggestion?
terns in two ways: in Figure 6a, we merged consecutive segments We found that the average probability of accepting a suggestion
that have the same state label into a single state (thus removing was 0.34. However, we observed that when the programmer was
self-transitions), and in Figure 6b we looked at n-grams in the user verifying a suggestion their likelihood of accepting was 0.70. In
timelines (including self-transitions) where we include both states contrast, if the programmer was thinking about new code to write,
and participants actions (shown, accepted and rejected). The most the probability dropped to 0.20. This diference was statistically
common pattern (#1) in Figure 6a was a cycle where programmers signifcant according to Pearson’s chi-squared test (� 2 = 12.25, � =
repeatedly wrote new code functionality and then spent time veri- 0). Conversely, when programmers are engineering prompts, the
fying shown suggestions, indicating a new mode for programmers likelihood of accepting a suggestion drops to 0.16. One reason for
to solve coding tasks. At the same time, when we look at pattern this might be that programmers want to write the prompt on their
(#B) in Figure 6b, which takes a closer look into when program-
6 All
p-values reported are corrected for multiple hypothesis testing with the Benjam-
mers are writing new functionality, we observe that they don’t stop
in/Hochberg procedure with � = 0.05.
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA
#1.
Writing new functionality Verifying suggestion Writing new functionality Verifying suggestion
#2.
Waiting for suggestion Verifying suggestion Waiting for suggestion Verifying suggestion
#3.
n=33 n=37
#4. 5.
Deferring Thought Verifying suggestion Thinking about code to write Writing new functionality
(a) Common patterns of transitions between distinct states. In individual participant timelines, the patterns visually appear as a change of
color, but here we measure how often they appear across all participants (n=).
#B. Writing new functionality Writing new functionality Writing new functionality n=56
accepted
shown
(b) Common patterns of states and actions (including self transitions). Each pattern is extracted from user timelines and we count how often it
appears in total (n=)
own without suggestions, and Copilot interrupts them. We show 6.4 CUPS Attributes Signifcantly More Time
the full results in the Appendix for the other states. Verifying Suggestions than Simpler Metrics
We observed that programmers continued verifying the suggestions
after they accepted them. This happens by defnition for ’deferred
Table 3: We compute the percentage of suggestions accepted thought’ states before accepting suggestions, but we fnd it also
given the programmer was in the CUPS state while the sugges- happens when programmers verify the suggestion before accepting
tion is being shown (% Ss accepted while shown). We compute it and this leads to a signifcant increase in the total time verifying
the percentage of suggestions accepted given the program- suggestions. First, when participants defer their thought about a
mer was in the CUPS state before the suggestion is shown, suggestion they accepted, 53.2% of the time they verify the sug-
the state just before the one where the suggestion is shown gestion immediately afterward. When we adjust for the post-hoc
(% Ss accepted before S is shown). We compute the standard time spent verifying, we compute a mean time of 15.21 (� � = 20.68)
error for the acceptance rate (%). seconds of verifcation and a median time of 6.48s. This is nearly
a fve-times increase in average time and a three-time increase
State % Ss accepted % Ss accepted in median time for the pre-adjustment scores of 3.25 (� � = 3.33)
while shown before S is mean and 1.99 median time. These results are illustrated in Figure 8
shown and is a statistically signifcant increase according to a two-sample
paired t-test (� = −4.88, � = 1.33 · 10 −5 ). This phenomenon also
Thinking/Verifying 0.80 ± 0.02 0.56 ± 0.04
occurs when programmers are in a ’Thinking/Verifying Suggestion’
Suggestion
state before accepting a suggestion where 19% of the time they
Prompt Crafting 0.11 ± 0.02 0.22 ±0.03
posthoc verify the suggestion which increases total verifcation
Looking up Documenta- 0.00 ± 0.00 0.29 ± 0.17
time from 3.96 (� � = 8.63) to 7.03 (� � = 14.43) on average which is
tion
statistically signifcant (� = −4.17, � = 5� − 5). On the other hand,
Writing New Function- 0.07 ± 0.02 0.31 ± 0.03
programmers often have to wait for suggestions to show up due to
ality
either latency or Copilot not kicking in to provide a suggestion. If
Thinking About New 0.12 ± 0.04 0.27 ± 0.04
we sum the time between when a suggestion is shown and the pro-
Code To Write
grammer accepts or rejects this in addition to the time they spend
Editing Last Suggestion 0.03 ± 0.03 0.23 ± 0.05
waiting for the suggestion (this is indicated in the state ’Waiting
Waiting For Suggestion 0.10 ± 0.05 0.58 ± 0.06
for suggestion’), then we get an increase from 6.11s (� � = 15.52) to
Editing Written Code 0.07 ± 0.04 0.17 ± 0.07
6.51s (� � = 15.61) which is minor on average but adds 2.5 seconds
Writing Documentation 0.40 ± 0.22 0.33 ± 0.19
of delay when programmers have to explicitly wait for suggestions.
Debugging/Testing 0.23 ± 0.07 0.26 ± 0.06
Code
Deferring Thought For 0.98 ± 0.02 1.0 ± 0.0
Later
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA
shown accepted
tasks that were provided by us instead of real tasks they may per- of labeled data which would be expensive, 2) using methods from
form in the real world. Furthermore, the selection of tasks was semi-supervised learning that leverage unlabeled telemetry to in-
limited and did not cover all tasks programmers might perform. crease sample efciency [35], and 3) collecting data beyond what is
We mostly conducted experiments with Python with only two captured from telemetry such as video footage of the programmer
participants using C++ and JavaScript when Copilot is capable of screen (e.g. cursor movement) to be able to better predict with the
completing suggestions for myriads of other languages. We also same amount of data.
made an assumption about the granularity of telemetry where each
segment at most contained one state when, in a more general set- Assessing Individual Diferences. There is an opportunity to apply
ting, programmers may perform multiple activities within a single the CUPS diagram to compare diferent user groups and compare
segment. We also did not capture longer-term costs of interacting, how individuals difer from an average user. Does the nature of
e.g., from accepting code with security vulnerabilities or longer inefciencies difer between user groups? Can we personalize inter-
horizon costs. To this end, security vulnerabilities and possible ventions? Finally, we could also compare how the CUPS diagram
overreliance issues [2, 25, 26], are important areas of research that evolves over time for the same set of users.
we do not address in this paper. Efect of Conditions and Tasks on Behavior. We only studied the
behavior of programmers with the current version of Copilot. Fu-
7.2 Future Work ture work could study how behavior difers with diferent versions
We only investigated a limited number of programmer behaviors of Copilot– especially when versions use diferent models. In the
using the CUPS timelines and diagrams. There are many other extreme, we could study behavior when Copilot is turned of. The
aspects future work could investigate. latter could help assess the counterfactual cost of completing the
task without AI assistance and help establish whether and where
Predicting CUPS states. To enable our insights derived in Sec- Copilot suggestions add net value for programmers. For example,
tion 6, we need to be able to identify the current programmer’s maybe the system did not add enough value because the program-
CUPS state. An avenue towards that is building predictive mod- mer kept getting into prompt crafting rabbit holes instead of moving
els using labeled telemetry data that is collected from our user on and completing the functions manually or with the assistance
study. Ideally, we can leverage this labeled data to further label of web search.
telemetry data from other coding sessions or other participants so Likewise, if developers create a faster version of Copilot with
that we can perform such analyses more broadly. Specifcally, the less latency, the CUPS diagram could be used to establish whether
input to such a model would be the current session context, for it leads to reductions in time spent in the "Waiting for Suggestion"
example, whether the programmer accepted the last suggestion, the state.
current suggestion being surfaced, and the current prompt. We can
leverage supervised learning methods to build such a model from Informing New Metrics. Since programmers’ value may be multi-
collected data. Such models would need to run in real-time during dimensional, how can we go beyond code correctness and measure
programming and predict at each instance of time the current user added value for users? If Copilot improves productivity, which
CUPS state. This would enable the design suggestions proposed aspects were improved? Conversely, if it didn’t, where are the ef-
to serve to compute various metrics proposed. For example, if the ciencies? One option is to conduct a new study where we compare
model predicts that the programmer is deferring thought about a the CUPS diagram with Copilot assistance with a counterfactual
suggestion, we can group suggestions together to display them to condition where the programmers don’t have access to Copilot.
the programmer. In the Appendix, we built small predictive models And use the two diagrams to determine where the system adds
of programmers CUPS state using labeled study data. However, value or could have added value. For example, the analysis might
the current amount of labeled data is not sufcient to build highly reveal that some code snippets are too hard for programmers to
accurate models. There are multiple avenues to improve the per- complete by themselves but much faster with Copilot because the
formance of these models: 1) simply collecting a larger amount cost of double-checking and editing the suggestion is much less
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA
than the cost of spending efort on it by themselves. Conversely, the [2] Owura Asare, Meiyappan Nagappan, and N Asokan. 2023. Is github’s copilot
analysis might reveal that a new intervention for helping engineer as bad as humans at introducing vulnerabilities in code? Empirical Software
Engineering 28, 6 (2023), 1–24.
prompts greatly reduced people’s times in “Prompt Crafting”. [3] Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded
Another option is to design ofine metrics based on these insights copilot: How programmers interact with code-generating models. Proceedings of
the ACM on Programming Languages 7, OOPSLA1 (2023), 85–111.
that developers can use during the model selection and training [4] Ruven Brooks. 1977. Towards a theory of the cognitive processes in computer
phase. For example, given that programmers spent a large fraction programming. International Journal of Man-Machine Studies 9, 6 (1977), 737–751.
of the time verifying suggestions, ofine metrics that can estimate [5] Ruven E Brooks. 1980. Studying programmer behavior experimentally: The
problems of proper methodology. Commun. ACM 23, 4 (1980), 207–213.
this (e.g., based on code length and complexity) may be useful [6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
indicators of which models developers should select for deploy- Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
ment. Future work will aim to test the efectiveness of these design Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
suggestions as well. [7] Stuart K Card, Thomas P Moran, and Allen Newell. 1980. The keystroke-level
model for user performance time with interactive systems. Commun. ACM 23, 7
Beyond Programming. We also hope our methodology is applied (1980), 396–410.
[8] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira
to study other forms of AI assistants that are rapidly being de- Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman,
ployed. For example, one can make an analogous CUPS taxonomy et al. 2021. Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374 1, 1 (2021), 1–2.
for writing assistants for creative writers or lawyers. [9] Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh,
Michel C Desmarais, and Zhen Ming Jack Jiang. 2023. Github copilot ai pair
programmer: Asset or liability? Journal of Systems and Software 203 (2023),
7.3 Conclusion 111734.
We developed and proposed a taxonomy of common programmer [10] Laura Ekroot and Thomas M Cover. 1993. The entropy of Markov trajectories.
IEEE Transactions on Information Theory 39, 4 (1993), 1418–1421.
activities (CUPS) and combined it with real-time telemetry data [11] Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. 2023.
to profle the interaction. At present, CUPS contains 12 mutually Out of the bleu: how should we assess quality of the code generation models?
Journal of Systems and Software 203 (2023), 111741.
unique activities that programmers perform between consecutive [12] Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann,
Copilot actions (e.g., such as accepting, rejecting, and viewing sug- Brian Houck, and Jenna Butler. 2021. The SPACE of developer productivity.
gestions). We gathered real-world instance data of CUPS by con- Commun. ACM 64, 6 (2021), 46–53.
[13] Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann,
ducting a user study with 21 programmers within our organization, Brian Houck, and Jenna Butler. 2021. The SPACE of Developer Productivity:
where they solved coding tasks with Copilot and retrospectively There’s more to it than you think. Queue 19, 1 (2021), 20–48.
labeled CUPS for their coding session. We collected over 3137 in- [14] Github. 2022. GitHub copilot - your AI pair programmer. https://github.com/
features/copilot
stances of CUPS and analyzed them to generate CUPS timelines [15] Vincent J Hellendoorn, Sebastian Proksch, Harald C Gall, and Alberto Bacchelli.
that show individual behavior and CUPS diagrams that show aggre- 2019. When code completion fails: A case study on real-world completions. In
2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE,
gate insights into the behavior of our participants. We also studied ., ., 960–970.
the time spent in these states, patterns in user behavior, and better [16] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora,
estimates of the cost (in terms of time) of interacting with Copilot. Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021.
Measuring Coding Challenge Competence With APPS. In Thirty-ffth Conference
Our studies with CUPS labels revealed that when solving a coding on Neural Information Processing Systems Datasets and Benchmarks Track (Round
task with Copilot, programmers may spend a large fraction of total 2). ., ., 1–2.
session time (34.3%) on just double-checking and editing Copilot [17] Ellen Jiang, Edwin Toh, Alejandra Molina, Kristen Olson, Claire Kayacik, Aaron
Donsbach, Carrie J Cai, and Michael Terry. 2022. Discovering the Syntax and
suggestions, and spend more than half of the task time on Copilot Strategies of Natural Language Programming with Generative Language Models.
related activities, together indicating that introducing Copilot into In CHI Conference on Human Factors in Computing Systems. ., ., 1–19.
[18] Bonnie E John and David E Kieras. 1996. The GOMS family of user interface
an IDE can signifcantly change user behavior. We proposed new analysis techniques: Comparison and contrast. ACM Transactions on Computer-
metrics to measure the interaction by computing the time spent in Human Interaction (TOCHI) 3, 4 (1996), 320–351.
each CUPS state and modifcation to existing time and acceptance [19] An Ju and Armando Fox. 2018. TEAMSCOPE: measuring software engineering
processes with teamwork telemetry. In Proceedings of the 23rd Annual ACM
metrics by accounting for suggestions that get verifed only after Conference on Innovation and Technology in Computer Science Education. ., ., 123–
they get accepted. We proposed a new interface design suggestion: 128.
if we allow programmers to signal their current state, then we can [20] Eirini Kalliamvakou. 2022. Research: Quantifying github copilot’s impact
on developer productivity and happiness. https://github.blog/2022-09-07-
better serve their needs, for example, by reducing latency if they research-quantifying-github-copilots-impact-on-developer-productivity-and-
are waiting for a suggestion. happiness/
[21] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser,
Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago,
ACKNOWLEDGMENTS et al. 2022. Competition-level code generation with alphacode. arXiv preprint
arXiv:2203.07814 1, 1 (2022), 1–2.
HM partly conducted this work during an internship at Microsoft [22] Jenny T Liang, Chenyang Yang, and Brad A Myers. 2023. Understanding the
Research (MSR). We acknowledge valuable feedback from colleagues Usability of AI Programming Assistants. arXiv preprint arXiv:2303.17125 1, 1
(2023), 1–2.
across MSR and GitHub including Saleema Amershi, Victor Dibia, [23] Henry Lieberman and Christopher Fry. 1995. Bridging the gulf between code
Forough Poursabzi, Andrew Rice, Eirini Kalliamvakou, and Edward and behavior in programming. In Proceedings of the SIGCHI conference on Human
factors in computing systems. ., ., 480–486.
Aftandilian. [24] Unaizah Obaidellah, Mohammed Al Haek, and Peter C-H Cheng. 2018. A survey
on the usage of eye-tracking in computer programming. ACM Computing Surveys
(CSUR) 51, 1 (2018), 1–58.
REFERENCES [25] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and
[1] Amazon. 2022. ML-powered coding companion – Amazon CodeWhisperer. Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of github
https://aws.amazon.com/codewhisperer/ copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.
(SP). IEEE, ., ., 754–768. [33] Maxim Tabachnyk Tabachnyk and Stoyan Nikolov. 2022. ML-enhanced code
[26] Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan completion improves developer productivity. https://ai.googleblog.com/2022/
Dolan-Gavitt. 2021. Can OpenAI Codex and Other Large Language Models Help 07/ml-enhanced-code-completion-improves
Us Fix Security Bugs? arXiv preprint arXiv:2112.02125 1, 1 (2021), 1–2. [34] Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs.
[27] Norman Peitek, Janet Siegmund, and Sven Apel. 2020. What drives the read- Experience: Evaluating the Usability of Code Generation Tools Powered by Large
ing order of programmers? an eye tracking study. In Proceedings of the 28th Language Models. In CHI Conference on Human Factors in Computing Systems
International Conference on Program Comprehension. ., ., 342–353. Extended Abstracts. ., ., 1–7.
[28] Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact [35] Jesper E Van Engelen and Holger H Hoos. 2020. A survey on semi-supervised
of ai on developer productivity: Evidence from github copilot. arXiv preprint learning. Machine learning 109, 2 (2020), 373–440.
arXiv:2302.06590 1, 1 (2023), 1–2. [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[29] James Prather, Brent N Reeves, Paul Denny, Brett A Becker, Juho Leinonen, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Andrew Luxton-Reilly, Garrett Powell, James Finnie-Ansley, and Eddie Antonio you need. Advances in neural information processing systems 30 (2017), 1–2.
Santos. 2023. " It’s Weird That it Knows What I Want": Usability and Interactions [37] Zdenek Velart and Petr Šaloun. 2006. User behavior patterns in the course
with Copilot for Novice Programmers. arXiv preprint arXiv:2304.02491 1, 1 (2023), of programming in C++. In Proceedings of the joint international workshop on
1–2. Adaptivity, personalization & the semantic web. ., ., 41–44.
[30] Advait Sarkar, Andrew D Gordon, Carina Negreanu, Christian Poelitz, Sruti Srini- [38] Justin D Weisz, Michael Muller, Stephanie Houde, John Richards, Steven I Ross,
vasa Ragavan, and Ben Zorn. 2022. What is it like to program with artifcial Fernando Martinez, Mayank Agarwal, and Kartik Talamadupula. 2021. Perfection
intelligence? arXiv preprint arXiv:2208.06213 1, 1 (2022), 1–2. not required? Human-AI partnerships in code translation. In 26th International
[31] Beau A Sheil. 1981. The psychological study of programming. ACM Computing Conference on Intelligent User Interfaces. IUI, IUI, 402–412.
Surveys (CSUR) 13, 1 (1981), 101–120. [39] Tongshuang Wu, Kenneth Koedinger, et al. 2023. Is AI the better programming
[32] Zhensu Sun, Xiaoning Du, Fu Song, Shangwen Wang, Mingze Ni, and Li Li. partner? Human-Human Pair Programming vs. Human-AI pAIr Programming.
2023. Don’t Complete It! Preventing Unhelpful Code Completion for Produc- arXiv preprint arXiv:2306.05153 1, 1 (2023), 1–2.
tive and Sustainable Neural Code Completion Systems. In 2023 IEEE/ACM 45th [40] Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin,
International Conference on Software Engineering: Companion Proceedings (ICSE- Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity
Companion). IEEE, ., ., 324–325. assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN
International Symposium on Machine Programming. ACM, ACM, 21–29.