0% found this document useful (0 votes)
8 views

Reading Between The Lines Modeling User Behavior and Costs in AI-Assisted Programming

Uploaded by

Aaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Reading Between The Lines Modeling User Behavior and Costs in AI-Assisted Programming

Uploaded by

Aaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Reading Between the Lines: Modeling User Behavior and Costs in

AI-Assisted Programming
Hussein Mozannar Gagan Bansal
[email protected] [email protected]
Massachusetts Institute of Technology Microsoft Research
Boston, USA Redmond, USA

Adam Fourney Eric Horvitz


[email protected] [email protected]
Microsoft Research Microsoft Research
Redmond, USA Redmond, USA

(a) Thinking/
(b)
Verifying
import numpy as np Thinking Suggestion Deferring
class LogisticRegression: about New 22.4% thought
Code to Write for later
def __init(self):
10.91% 1.39%
self.w = None
self.b = None
# implement the fit method Looking up
Not Thinking Documentation
def fit(self, X, y):
# initialize the parameters
0.01% 5 7.45%
self.w = np.zeros(X.shape[1]) 2
self.b = 0
for i in range(100): 1
# calculate the gradient Waiting For Debugging/
dw = (1/X.shape[0]) * np.dot(X.T,
prompt
Suggestion 7 Testing Code
(self.sigmoid(np.dot(X, self.w) + self.b) - y)) 4.20% 11.31%
db = (1/X.shape[0]) *
np.sum(self.sigmoid(np.dot(X, self.w) + self.b)
- y)
# update the parameters Writing New
4 Prompt
self.w = self.w - dw Functionality 6 crafting
14.05% 11.56%
self.b = self.b - db

| # implement the predict method suggestion Editing Writing


Written Code Documentation
4.28% Editing Last 0.53%
Suggestion 3
11.90%

1 2 3 4 5 6 7
shown rejected shown rejected shown accepted shown

(c)

Figure 1: Profling a coding session with the CodeRec User Programming States (CUPS). In (a) we show the operating mode of
CodeRec inside Visual Studio Code. In (b) we show the CUPS taxonomy used to describe CodeRec related programmer activities.
A coding session can be summarized as a timeline in (c) where the programmer transitions between states.
ABSTRACT
Code-recommendation systems, such as Copilot and CodeWhis-
perer, have the potential to improve programmer productivity by
suggesting and auto-completing code. However, to fully realize their
potential, we must understand how programmers interact with
This work is licensed under a Creative Commons Attribution International
4.0 License. these systems and identify ways to improve that interaction. To
seek insights about human-AI collaboration with code recommenda-
CHI ’24, May 11–16, 2024, Honolulu, HI, USA tions systems, we studied GitHub Copilot, a code-recommendation
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0330-0/24/05 system used by millions of programmers daily. We developed CUPS,
https://doi.org/10.1145/3613904.3641936 a taxonomy of common programmer activities when interacting
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.

with Copilot. Our study of 21 programmers, who completed coding before evaluating them together as a unit. In both scenarios, con-
tasks and retrospectively labeled their sessions with CUPS, showed siderable work verifying and editing suggestions occurs after the
that CUPS can help us understand how programmers interact with programmer has accepted the recommended code. Prior interac-
code-recommendation systems, revealing inefciencies and time tion metrics also largely miss user efort invested in devising and
costs. Our insights reveal how programmers interact with Copilot refning prompts used to query the models. When code completion
and motivate new interface designs and metrics. tools are evaluated using coarser task-level metrics such as task
completion time [20], we begin to see signals of the benefts of AI-
CCS CONCEPTS driven code completion but lack sufcient detail to understand the
• Human-centered computing → User models; User studies; • nature of these gains, as well as possible remaining inefciencies.
Software and its engineering → Automatic programming. We argue that an ideal approach would be sufciently low level to
support interaction profling while sufciently high level to capture
KEYWORDS meaningful programmer activities.
Given the nascent nature of these systems, numerous questions
AI-assisted Programming, Copilot, User State Model
exist regarding the behavior of their users:
ACM Reference Format: • What activities do users undertake in anticipation for, or to
Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2024.
trigger a suggestion?
Reading Between the Lines: Modeling User Behavior and Costs in AI-
Assisted Programming. In Proceedings of the CHI Conference on Human
• What mental processes occur while the suggestions are on-
Factors in Computing Systems (CHI ’24), May 11–16, 2024, Honolulu, HI, USA. screen, and, do people double-check suggestions before or
ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3613904.3641936 after acceptance?
• How costly for users are these various new tasks, and which
take the most time?
1 INTRODUCTION
Programming-assistance systems based on the adaptation of large To answer these and related questions in a systematic manner,
language models (LLMs) to code recommendations have been re- we apply a mixed-methods approach to analyze interactions with a
cently introduced to the public. Popular systems, including Copilot popular code suggestion model, GiHub Copilot2 which has more
[14], CodeWhisperer [1], and AlphaCode[21], signal a potential than a million users. To emphasize that our analysis is not restricted
shift in how software is developed. Though there are diferences in to the specifcs of Copilot, we use the term CodeRec to refer to any
specifc interaction mechanisms, the programming-assistance sys- instance of code suggestion models, including Copilot. Through
tems generally extend existing IDE code completion mechanisms small-scale pilot studies and our frst-hand experience using Copilot
(e.g., IntelliSense 1 ) by producing suggestions using neural models for development, we develop a novel taxonomy of common states
trained on billions of lines of code [8]. The LLM-based completion of a programmer when interacting with CodeRec models (such as
models can suggest sentence-level completions to entire functions Copilot), which we refer to as CodeRec User Programming States
and classes in a wide array of programming languages. These large (CUPS). The CUPS taxonomy serves as the main tool to answer our
neural models are deployed with the goal of accelerating the eforts research questions.
of software engineers, reducing their workloads, and improving Given the initial taxonomy, we conducted a user study with 21
their productivity. developers who were asked to retrospectively review videos of their
Early assessments suggest that programmers do feel more pro- coding sessions and explicitly label their intents and actions using
ductive when assisted by the code recommendation models [40] this model, with an option to add new states if necessary. The study
and that they prefer these systems to earlier code completion en- participants labeled a total of 3137 coding segments and interacted
gines [34]. In fact, a recent study from GitHub, found that Copilot with 1096 suggestions. The study confrmed that the taxonomy was
could potentially reduce task completion time by a factor of two sufciently expressive, and we further learned transition weights
[28]. While these studies help us understand the benefts of code- and state dwell times —something we could not do without this
recommendation systems, they do not allow us to identify avenues experimental setting. Together, these data can be assembled into
to improve and understand the nature of interaction with these various instruments, such as the CUPS diagram (Figure 1), to facili-
systems. tate profling interactions and identify inefciencies. Moreover, we
In particular, the neural models introduce new tasks into a devel- show that such analysis nearly doubles our estimates for how much
oper’s workfow, such as writing AI prompts [17] and verifying AI developer time can be attributed to interactions with code sugges-
suggestions [34], which can be lengthy. Existing interaction metrics, tion systems, as compared with existing metrics. We believe that
such as suggestion acceptance rates, time to accept (i.e., the time a identifying the current CUPS state during a programming session
suggestion remains onscreen), and reduction of tokens typed, tell can help serve programmer needs. This can be accomplished using
only part of this interaction story. For example, when suggestions custom keyboard macros or automated prediction of CUPS states,
are presented in monochrome popups (Figure 1), programmers may as discussed in our future work section and the Appendix. Overall,
choose to accept them into their codebases so that they can be read we leverage the CUPS diagram to identify some opportunities to
with code highlighting enabled. Likewise, when models suggest address inefciencies in the current version of Copilot.
only one line of code at a time, programmers may accept sequences In sum, our main contributions are the following:

1 https://code.visualstudio.com/docs/editor/intellisense 2 https://github.com/features/copilot
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA

• A novel taxonomy of common activities of programmers that these self-reported perceptions were reasonably correlated
(called CUPS) when interacting with code recommendation with suggestion acceptance rates. Liang et al. [22] administered a
systems (Section 4) survey to 410 programmers who use various AI programming assis-
• A dataset of coding sessions annotated with user actions, tants, including Copilot, and highlighted why the programmers use
CUPS, and video recordings of programmers coding with the AI assistants and numerous usability issues. Similarly, Prather
Copilot (Section 5). et al. [29] surveyed how introductory programming students utilize
• Analysis of which CUPS states programmers spend their Copilot.
time in when completing coding tasks (Subsection 6.1). While these self-reported measures of utility and preference are
• An instrument to analyze programmer behavior (and pat- promising, we would expect gains to be refected in objective met-
terns in behavior) based on a fnite-state machine on CUPS rics of productivity. Indeed, one ideal method would be to conduct
states (Subsection 6.2). randomized control trials where one set of participants writes code
• An adjustment formula to properly account for how much with a recommendation engine while another set codes without it.
time do programmers spend verifying CodeRec suggestions GitHub performed such an experiment where 95 participants were
(Subsection 6.4) inspired by the CUPS state of deferring split into two groups and asked to write a web server. The study
thought (Subsection 6.3). concluded by fnding that task completion was reduced by 55.8% in
the Copilot condition [28]. Likewise, a study by Google showed that
The remainder of this paper is structured as follows: We frst
an internal CodeRec model had a 6% reduction in ’coding iteration
review related work on AI-assisted programming (Section 2) and
time’ [33]. On the other hand, Vaithilingam et al. [34] showed in
formally describe Copilot, along with a high-level overview of
a study of 24 participants showed no signifcant improvement in
programmer-CodeRec interaction (Section 3). To further understand
task completion time – yet participants stated a clear preference for
this interaction, we defne our model of CodeRec User Programming
Copilot. An interesting comparison to Copilot is Human-Human
States (CUPS) (Section 3) and then describe a user study designed to
pair programming, which Wu et al. [39] details.
collect programmer annotations of their states (Section 5). We use
A signifcant amount of work has tried to understand the behav-
the collected data to analyze the interactions using CUPS diagram
ior of programmers[4, 5, 23, 31] using structured user studies under
revealing new insights into programmer behavior (Section 6). We
the name of "psychology of programming." This line of work tries
then discuss limitations and future work and conclude in (Section
to understand the efect of programming tools on the time to solve
7).
a task or ease of writing code and how programmers read and write
code. Researchers often use telemetry with detailed logging on
2 BACKGROUND AND RELATED WORK keystrokes [19, 37] to understand behavior. Moreover, eye-tracking
Large language models based on the Transformer network [36], is also used to understand how programmers read code[24, 27].
such as GPT-3 [6], have found numerous applications in natural Our research uses raw telemetry alongside user-labeled states to
language processing. Codex [8], a GPT model trained on 54 million understand behavior; future research could also utilize eye-tracking
GitHub repositories, demonstrates that LLMs can very efectively and raw video to get deeper insights into behavior.
solve various programming tasks. Specifcally, Codex was initially This wide dispersion of results raises interesting questions about
tested on the HumanEval dataset containing 164 programming the nature of the utility aforded by neural code completion engines:
problems, where it is asked to write the function body from a how, and when, are such systems most helpful; and conversely,
docstring [8] and achieves 37.7% accuracy with a single generation. when do they add additional overhead? This is the central ques-
Various metrics and datasets have been proposed to measure the tion to our work. The related work closest to answering this
performance of code generation models [9, 11, 16, 21]. However, in question is that of Barke et al. [3], who showed that interaction
each case, these metrics test how well the model can complete code with Copilot falls into two broad categories: the programmer is
in an ofine setting without developer input rather than evaluating either in “acceleration mode” where they know what they want
how well such recommendations assist programmers in situ. This to do, and Copilot serves to make them faster; or they are in “ex-
issue has also been noted in earlier work on non-LLM based code ploration mode”, where they are unsure what code to write and
completion models where performance on completion benchmarks Copilot helps them explore. The taxonomy we present in this paper,
overestimates the model’s utility to developers [15]. Importantly, CUPS, enriches this further with granular labels for programmers’
however, these results may not hold to LLM-based approaches, intents. Moreover, the data collected in this work was labeled by
which are radically diferent [30]. the participants themselves rather than by the researchers inter-
One straightforward approach to understanding the utility of preting their actions, allowing for more faithful intent and activity
neural code completion services, including their propensity to de- labeling and the data collected in our study can also be used to build
liver incomplete or imperfect suggestions, is to simply ask devel- predictive models as in [32]. The next section describes the Copilot
opers. To this end, Weisz et al. interviewed developers and found system formally and describes the data collected when interacting
that they did not require a perfect recommendation model for the with Copilot.
model to be useful [38]. Likewise, Ziegler et al. surveyed over 2,000
Copilot users [40] and asked about perceived productivity gains
using a survey instrument based on the SPACE framework [13] – 3 COPILOT SYSTEM DESCRIPTION
we incorporate the same survey design for our own study. They To better understand how code recommendation systems infuence
found both that developers felt more productive using Copilot and the efort of programming, we focus on GiHub Copilot, a popular
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.

shown accepted shown rejected shown browse accepted


A1 A2 A3 A4 A5 A6 A7
use time as a proxy to study the new costs of interaction
introduced by the AI system. We recognize that this approach is

incomplete: the costs associated with solving programming tasks
t0=0 t1 t2 t3 t4 t5 t6 t7 are multi-dimensional, and it can be challenging to assign a single
real-valued number to cover all facets of the task [12]. Nevertheless,
we argue that, like accuracy, efciency-capturing measures of time
Figure 2: Schematic of interaction telemetry with Copilot as a
are an important dimension of the cost that is relevant to most
timeline. For a given coding session, the telemetry contains a
programmers.
sequence of timestamps and actions with associated prompt
and suggestion features (not shown).
3.2 Programmer Activities in Telemetry
Segments
and representative example of this class of tools. Copilot3 is based
Copilot’s telemetry captures only instantaneous user actions (e.g.,
on a Large Language Model (LLM) and assists programmers inside
accept, reject, browser), as well as the suggestion display event. By
an IDE by recommending code suggestions any time the program-
themselves, these entries do not reveal such programmer’s activ-
mer pauses their typing. Figure 1 shows an example of Copilot
ities as double-checking and prompt engineering, as such activ-
recommending a code snippet as an inline, monochrome popup,
ities happen between two consecutive instantaneous events. We
which the programmer can accept using a keyboard shortcut (e.g.,
argue that the regions between events, which we refer to as
<tab>).
telemetry segments, contain important user intentions and
To serve suggestions, Copilot uses a portion of the code writ-
activities unique to programmer-CodeRec interaction, which
ten so far as a prompt, �, which it passes to the underlying LLM.
we need to understand in order to answer how Copilot afects
The model then generates a suggestion, �, which it deems to be a
programmers—and where and when Copilot suggestions are useful
likely completion. In this regime, programmers can engineer the
to programmers.
prompt to generate better suggestions by carefully authoring nat-
Building on this idea, telemetry segments can be split into two
ural language comments in the code such as “# split the data
groups (Figure 2). The frst group includes segments that start with
into train and test sets.” In response to a Copilot suggestion,
a suggestion shown event and end with an action (accept, reject, or
the programmer can then take one of several actions �, where
browse). Here, the programmer is paused and has yet to take action.
� ∈ {browse, accept, reject}. The latter of these actions, reject, is
We refer to this as ‘User Before Action’. The second group includes
triggered implicitly by continuing to type something that difers
segments that start with an action event and end with a display
from the suggestion or by pressing the escape key. The browse
event. During this period, the programmer can be either typing or
action enables the programmer to change the suggestion shown
paused; hence we denote it as ‘User Typing or Paused’. These two
with a keyboard shortcut from a set of at most three suggestions.
groups form the foundation of a deeper taxonomy of programmers’
Copilot logs aspects of the interactions via telemetry. We leverage
activities, which we will further develop in the next section.
this telemetry in the studies described in this paper. Specifcally,
whenever a suggestion is shown, accepted, rejected, or browsed,
we record a tuple to the telemetry database, (�� , �� , �� , �� ), where �� 4 A TAXONOMY FOR UNDERSTANDING
represents the within-session timestamp of the � th event (� 0 = 0), PROGRAMMER-CODEREC INTERACTION:
�� details the action taken (augmented to include ‘shown’), and �� CUPS
and �� capture features of the prompt and suggestion, respectively.
Figure 2 displays telemetry of a coding session, and Figure 1a shows
4.1 Creating the Taxonomy
Copilot implemented as a VSCode plugin. We have the ability to Our objective is to create an extensive, but not complete, taxonomy
capture telemetry for any programmer interacting with Copilot; of programmer activities when interacting with CodeRec that en-
this is used to collect data for a user study in section 5. ables a useful study of the interaction. To refne the taxonomy of
programmers’ activities, we developed a labeling tool and popu-
3.1 Infuences of CodeRec on Programmer’s lated it with an initial set of activities based on our own experiences
Activities from extensive interactions with Copilot (Figure 4). The tool enables
users to watch a recently captured screen recording of them solving
Despite the limited changes that Copilot introduces to an IDE’s
a programming task with Copilot’s assistance and to retrospectively
repertoire of actions, LLM-based code suggestions can signifcantly
annotate each telemetry segment with an activity label. We use this
infuence how programmers author code. Specifcally, Copilot lever-
tool to frst refne our taxonomy with a small pilot study (described
ages LLMs to stochastically generate novel code to ft the arbitrary
below) and then to collect data in Section 5.
current context. As such, the suggestions may contain errors (and
The labeling tool (Figure 4) contains three main sections: a) A
can appear to be unpredictable) and require that programmers
navigation panel on the left, which displays and allows navigating
double-check and edit them for correctness. Furthermore, program-
between telemetry segments and highlights the current segment
mers may have to refne the prompts to get the best suggestions.
being labeled in blue. The mouse or arrow keys are used to navigate
These novel activities associated with the AI system introduce new
between segments. b) A video player on the right, which plays the
eforts and potential disruptions to the fow of programming. We
corresponding video segments in a loop. The participant can watch
3 The version of Copilot that this manuscript refers to is Copilot as of August 2022. the video segments any number of times. c) Buttons on the bottom
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA

shown accepted shown rejected shown browse accepted shown accepted


A1 A2 A3 A4 A5 A6 A7
User User
User User User Typing User User
Typing
Before
Typing
Before Before Before … User Before …
or
Action
or
Action
or Paused Action Action Action
Paused Paused
t0=0 t1 t2 t3 t4 t5 t6 t7 ti ti+1

Typing Paused

Prompt crafting Writing New Debugging/ Waiting For Thinking Verifying Thinking about Looking up
Functionality Testing Code Suggestion Suggestion New Code to write Documentation

Writing Editing Written Deferring thought


Documentation Code for later

Editing Last Not Thinking


Suggestion

Figure 3: Taxonomy of programmer’s activities when interacting with CodeRec– CUPS.

(a) 81 (353.25,353.52) 0.27 Prompt_Crafting_(V)


You are given a data matrix X, the goal is to plot the (b)
82 (353.52,359.36) 5.84 Prompt_Crafting_(V) two most correlated features in X.
import numpy as np
83 (359.36,361.22) 1.86 Thinking/Verifying_Suggestion_(
Step 1
import pickle
A)
with open('data.pkl', 'rb') as file:
84 (261.22,370.84) 9.62 Debugging/Testing_Code_(H)
X, Y = pickle.load(file) Compute correlations between all features in X
85 (3708437161) 077 Debugging/Testing_Code_(H) #print out first column of X

86 (371.61,397.57) 25.9 Debugging/Testing_Code_(H)


# print(X[:,0]) Step 2
#computer correlations between all columns of X
6 Pick out the two features that are the most highly
corr = np.corrcoef(X)
87 (397.57,410.46) 12.8 Thinking/Verifying_Suggestion_( print(corr) correlated
9 A
88 (410.46,459.95) 49.4 Edditing_Last_Suggestion_()
#print the max value of all the rows in corr
maxval = np.amax(corr, axis=1)
Step 3
9 # print(maxval)
Plot on a graph, where one axis is one feature, and
89 (459.94,499.29) 38.3 Edditing_Last_Suggestion_()
# print out the two features that are most correlated the other axis is the other feature.
4
maxcor = np.where(corr == np.amax(corr))
90 (498.29,500.58) 2.29 Writing_New_Functionality(Z) # print(maxcor)
Step 4
91 (50058,505.83) 5.25 Edditing_Last_Suggestion_()
maxval = 0
for i in range(len(corr)): Plot a linear trend between the two features.
92 (505.83,515.51) 9.68 Writing_New_Functionality_(Z)

93 (51551,517.29) 178 Writing_New_Functionality_(Z)

94 (517.29,517.45) 0.16 Writing_New_Functionality_(Z) Playback Speed Navigate Events <> Stop Replay
Current Suggestion:
corr[i][i]=0
Submit Custom IDK
Show Shortcut Keys Definition Type Custom State (I)
State

Thinking Verifying Not Thinking Deferring thought for Thinking about New Waiting For Debugging/ Testing
Suggestion (A) (S) later (D) Code to Write (F) Suggestion (G) Code (H) (c)
Writing New Editing Suggestion Editing Written Code Prompt crafting Writing Looking up
Functionality (Z) (X) (C) (V) Documentation (B) Documentation (N)

Figure 4: Screenshot of retrospective labeling tool for coding sessions. Left: Navigation panel for telemetry segments. Right:
Video player for reviewing video of a coding session. Bottom: Buttons and text box for labeling states.

corresponding to the CUPS taxonomy, along with an “IDK” button and a free-form text box to write custom state labels. Buttons also
have associated keyboard bindings for easy annotation.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.

Table 1: Description of each state in CodeRec User Program- ’Writing Documentation’ were added through the pilots. By the last
ming States (CUPS). pilot participant, the code book was stable and saturated as they
did not write a state that was not yet covered. We observed in our
State Description study that the custom text feld was rarely used. We describe the
resultant taxonomy in the sections below.
Thinking/Verifying Actively thinking about and verifying a
Suggestion shown or accepted suggestion
4.2 Taxonomy of Telemetry Segments
Not Thinking Not thinking about suggestion or code,
programmer away from keyboard Figure 3 shows the fnalized taxonomy of programmer activities for
Deferring Thought For Programmer accepts suggestion with- individual telemetry segments with Copilot. As noted earlier, the
Later out completely verifying it, but plans to taxonomy is rooted in two segment types: ‘User Typing or Paused’,
verify it after and ‘User Before Action’. We frst detail the ‘User Typing or Paused’
Thinking About New Thinking about what code or function- segments, which precede shown events (Figure 2) and are distin-
Code To Write ality to implement and write guished by the fact that no suggestions are displayed during this
Waiting For Suggestion Waiting for CodeRec suggestion to be time. As the name implies, users can fnd themselves in this state if
shown they are either actively ’Typing’4 , or have ’paused’ but have not yet
Writing New Code Writing code that implements new func- been presented with a suggestion. In cases where the programmer
tionality is actively typing, they could be completing any of a number of
Editing Last Suggestion Editing the last accepted suggestion tasks such as: ‘writing new functionality,’ ’editing existing code,’
Editing (Personally) Editing code written by a programmer ’editing prior (CodeRec) suggestions,’ ‘debugging code,’ or author-
Written Code that is not a CodeRec suggestion for the ing natural language comments, including both documentation and
purpose of fxing existing functionality prompts directed at CodeRec (i.e., ‘prompt crafting’). When the user
Prompt Crafting Writing prompt in the form of comment pauses, they may simply be “waiting for a suggestion” or can be in
or code to obtain desired CodeRec sug- any number of states common to ‘User Before Action’ segments.
gestion In every ‘User Before Action’ segment, CodeRec is displaying
Writing Documentation Writing comments or docstring for pur- a suggestion, and the programmer is paused and not typing. They
pose of documentation could be refecting and verifying that suggestion, or they may not
Debugging/Testing Running or debugging code to check be paying attention to the suggestion and thinking about other code
Code functionality may include writing tests to write instead. The programmer can also defer their eforts on the
or debugging statements suggestion for a later time period by accepting it immediately, then
Looking up Documenta- Checking an external source for the pur- pausing to review the code at a later time. This can occur, for ex-
tion pose of understanding code functional-
ample, because the programmer desires syntax highlighting rather
ity (e.g. Stack Overfow)
than grey text or because the suggestion is incomplete, and the
Accepted Accepted a CodeRec suggestion
programmer wants to allow Copilot to complete its implementation
Rejected Rejected a CodeRec suggestion
before evaluating the code as a cohesive unit. The latter situation
tends to arise when Copilot displays code suggestions line by line
(e.g., Figure 7).
To label a particular video segment, we asked participants to The leaf nodes of the fnalized taxonomy represent 12 distinct
consider the hierarchical structure of CUPS in Figure 3. The hierar- states that programmers can fnd themselves in. These states are
chical structure frst distinguishes segments by whether a typing illustrated in Figure 3 and are further described in Table 1. While
segment occurred in that segment and then decides based on the the states are meant to be distinct, siblings may share many traits.
typing or non-typing states. For example, in a segment where a For example, "Writing New Functionality" and "Editing Written
participant was initially double-checking a suggestion and then Code" are conceptually very similar. This taxonomy also bears re-
wrote new code to accomplish a task, the appropriate label would semblance to the keystroke level model in that it assigns a time cost
be "Writing New Functionality" as the user eventually typed in to mental processes as well as typing [7, 18]. As evidenced by the
the segment. In cases where there are two states that are appro- user study—which we describe in the next section—these 12 states
priate and fall under the same hierarchy, e.g., if the participant provide a language that is both general enough to capture most
double-checked a suggestion and then looked up documentation, activities (at this level of abstraction), and specifc enough to mean-
they were asked to pick the state in which they spent the majority ingfully capture activities unique to LLM-based code suggestion
of the time. These issues arise because we collect a single state for systems.
each telemetry segment.
Pilot. Through a series of pilots involving the authors of the 5 CUPS DATA COLLECTION STUDY
paper, as well as three other participants drawn from our organi-
To study CodeRec-programmer interaction in terms of CodeRec
zation, we iteratively applied the tool to our own coding sessions
User Programming States, we designed a user study where program-
and to the user study tasks described in section 5. We then ex-
mers perform a coding task, then review and label videos of their
panded and refned the taxonomy by incorporating any “custom
coding session using the telemetry segment-labeling tool described
state” (using the text feld) written by the pilot participants. The
states ’Debugging/Testing Code’, ’Looking up Documentation’, and 4 Active typing allows for brief pauses between keystrokes.
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA

earlier. We describe the procedure, the participants, and the results and graduate student interns. In terms of programming expertise,
in the sections that follow. only 6 participants had less than 2 years of professional program-
ming experience (i.e., excluding years spent learning to program),
5.1 Procedure 5 had between 3 to 5 years, 7 had between 6 to 10 years, and 3 had
more than 11 years of experience. Participants used a language in
We conducted the study over a video call and asked participants
which they stated profciency (defned as language in which they
to use a remote desktop application to access a virtual machine
were comfortable designing and implementing whole programs).
(VM). Upon connecting, participants were greeted with the study
Here, 19 of the 21 participants used Python, one used C++, and the
environment consisting of Windows 10, together with Visual Studio
fnal participant used JavaScript.
Code (VS Code) augmented with the Copilot plugin.
On average, participants took 12.23 minutes (sample standard
Participants were then presented with a programming task drawn
deviation, � � = 3.98 minutes) to complete the coding task, with a
randomly from a set of eight pre-selected tasks (Table 2). If the par-
maximum session length of 20.80 minutes. This task completion
ticipant was unfamiliar with the task content, we ofered them a
time is measured from the frst line of code written for the task
diferent random task. The task set was designed during the pilot
until the end of the allocated time. During the coding tasks, Copilot
phase so that individual tasks ft within a 20-minute block and so
showed participants a total of 1024 suggestions, out of which they
that, together, the collection of tasks surfaces a sufcient diversity
accepted 34.0%. The average acceptance rate for participants was
of programmer activities. It is crucial that the task is of reasonable
36.5% (averaging over the acceptance rate of each participant), and
duration so that participants are able to remember all their activi-
the median was 33.8% with a standard error of 11.9%; the minimum
ties since they will be required to label their session immediately
acceptance rate was 14.3%, and the maximum was 60.7%. In the
afterward. Since the CUPS taxonomy includes states of thought,
labeling phase, each participant labeled an average of 149.38 (� � =
participants must label their session immediately after coding, and
57.43) segments with CUPS, resulting in a total of 3137 CUPS labels.
each study took approximately 60 minutes in total. To further im-
The participants used the ‘custom state’ text feld only three times
prove diversity, task instructions were presented to participants
total, twice a participant wrote ‘write a few letters and expect
as images to encourage participants to author their own Copilot
suggestion’ which can be considered as ‘prompt crafting’ and once
prompts rather than copying and pasting from the problem de-
a participant wrote ‘I was expecting the function skeleton to show
scription. The full set of tasks and instructions is provided as an
up[..]’ which was mapped to ’waiting for suggestion’. The IDK
Appendix.
button was used a total of 353 times, this sums to 3137 CUPS + 353
Upon completing the task (or reaching the 20-minute mark), we
IDKs = 3490 labels, the majority of its use was from two participants
loaded the participant’s screen recording and telemetry into the
(244 times) where the video recording was not clear enough during
labeling tool (previously detailed in Section 4.1). The researcher
consecutive spans, and was used by only fve other participants
then briefy demonstrated the correct operation of the tool and
more than once with the majority of the use also being due to
explained the CUPS taxonomy. Participants were then asked to an-
the video not being clear or the segment being too short. The IDK
notate their coding session with CUPS labels. Self-labeling allows
segments represent 6.5% of total session time across all participants,
us to easily scale such a study and enables more accurate labels
mostly contributed by fve participants. Therefore, we remove the
for each participant, but may cause inconsistent labeling across
IDK segments from the analysis and do not attempt to re-label
participants. Critically, this labeling occurred within minutes of
them.
completing the programming task so as to ensure accurate recall.
Together, these CUPS labels enable us to investigate various ques-
We do not include a baseline condition where participants perform
tions about programmer-CodeRec interaction systematically, such
the coding task without Copilot, as this work focuses on under-
as exploring which activities programmers perform most frequently
standing and modeling the interaction with the current version of
and how they spend most of their time. We study programmer-
Copilot.
CodeRec interaction using the data derived from this study in the
Finally, participants completed a post-study questionnaire about
following Section 6 and derive various insights and interventions.
their experience mimicking the one in [40]. The entire experiment
was designed to last 60 minutes. The study was approved by our
institutional review board (IRB), and participants received a $50.00 6 UNDERSTANDING PROGRAMMER
gift card as remuneration for their participation. BEHAVIOR WITH CUPS: MAIN RESULTS
The study in the previous section allows us to collect telemetry
5.2 Participants with CUPS labels for each telemetry segment. We now analyze the
To recruit participants, we posted invitations to developer-focused collected data and highlight suggestions for 1) metrics to measure
email distribution lists within our large organization. We recruited the programmer-CodeRec interaction, 2) design improvements
21 participants with varying degrees of experience using Copilot: 7 for the Copilot interface, and fnally 3) insights into programmer
used Copilot more than a few times a week, 3 used it once a month behavior. Each subsection below presents a specifc result or analy-
or less, and 11 had never used it before. For participants who had sis which can be read independently. Code and Data is available at
5.
never used it before, the experimenter gave a short oral tutorial
on Copilot explaining how it can be invoked and how to accept
suggestions. Participants’ roles in the organization ranged from
software engineers (with diferent levels of seniority) to researchers 5 https://github.com/microsoft/coderec_programming_states
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.

21

30

41

68

77

86

121
9

59
64

77

107
111

133

141

170
1

56

65

72
79

88

103
107

117

124

131
137
142

175
1

32

93

119

143

172
1

21
25

44

58

114

121

145

158
164
Time (s)
Suggestion Rejected Suggestion Accepted Suggestion Shown
(a) Individual CUPS timelines for 5/21 study participants for the frst 180 secs show the richness of and variance in programmer-CodeRec
interaction.
25
% of Session spent in State

Thinking/Verifying
20

Suggestion
Thinking About Deferring Thought
New Code To Write For Later
15

Not Thinking Looking up


Documentation
10
5

Waiting For Debugging/Testing


Suggestion Code
0
Prompt Crafting (V)

Not Thinking (S)


Debugging/Testing Code (H)

Edditing Last Suggestion (X)

Waiting For Suggestion (G)

Editing Written Code(C)

Writing Documentation (B)


Thinking/Verifying Suggestion (A)

Writing New Functionality (Z)

Looking up Documentation (N)

Deferring Thought For Later (D)


Thinking About New Code To Write (F)

Writing New Prompt Crafting


Functionality

Editing Written Writing


Code Documentation
Edditing Last
Suggestion

(b) The percentage of total session time spent in each state (c) CUPS diagram showing 12 CUPS states (nodes) and the transitions among
during a coding session. On average, verifying Copilot the states (arcs). Transitions occur when a suggestion is shown, accepted, or
suggestions occupies a large portion of session time. rejected. We hide self-transitions and low-probability transitions for simplicity

Figure 5: Visualization of CUPS labels from our study as timelines, a histogram, and a state machine.
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA

Table 2: Description of the coding tasks given to user study participants and task assignment. Participants were randomly
allocated to tasks for tasks which they had familiarity with.

Task Name Participants Description


Algorithmic Problem P4,P17,P18 Implementation of TwoSum, ThreeSum and FourSum
Data Manipulation P1,P2,P11,P20 Imputing data with average feature value and feature engineering for quadratic terms
Data Analysis P5,P8 Computing data correlations in a matrix and plotting of most highly correlated features
Machine Learning P3,P7,P12,P15 Training and Evaluation of models using sklearn on given dataset
Classes and Boilerplate Code P6,P9 Creating diferent classes that build on each other
Writing Tests P16 Writing tests for a black box function that checks if a string has valid formatting
Editing Code P10,P14,P21 Adding functionality to an existing class that implements a nearest neighbor retriever
Logistic Regression P13,P19 Implementing a custom Logistic Regression from scratch with weight regularization

6.1 Aggregated Time Spent in Various CUPSs than 6 years (10 out of 21) and who have less than 6 years (11 out
In Figure 5a, we visualize the coding sessions of individual partici- of 21). We notice the acceptance rate for those with substantial
pants as CUPS timelines, where each telemetry segment is labeled programming experience is 30.0% ± 14.5 while for those without is
with its CUPS label. At frst glance, CUPS timelines show the rich- 37.6% ± 14.6. Second, we split participants based on whether they
ness in patterns of interaction with Copilot, as well as the variance had used Copilot previously (10 out of 21) and those who had never
in usage patterns across settings and people. CUPS timelines allow used it before (11 out of 21). The acceptance rate for those who have
us to inspect individual behaviors and identify patterns, which we previously used Copilot is 37.6 % ± 15.3, and for those who have not,
later aggregate to form general insights into user behavior. it is 29.3 ± 13.7. Due to the limited number of participants, these
Figure 5b shows the average time spent in each state as a per- results are not sufcient to determine the infuence of programmer
centage normalized to a user’s session duration. experience or Copilot experience on behavior. We also include in
Appendix a breakdown of programmer behavior by task solved.

Metric Suggestion: Time spent in CUPS states as a high- 6.2 Patterns in Behavior as Transitions Between
level diagnosis of the interaction CUPS States
For example, time spent ‘Waiting For Suggestion’ (4.2%, � � = To understand if there was a pattern in participant behavior, we
4.46 ) measures the real impact of latency, and time spent modeled transitions between two states as a state machine. We re-
‘Editing Last Suggestion’ provides feedback on the quality fer to the state machine-based model of programmer behavior as
of suggestions. a CUPS diagram. In contrast to the timelines in Figure 5a, which
visualize state transitions with changes of colors, the CUPS dia-
gram Figure 5c explicitly visualizes transitions using directed edges,
where the thickness of arrows is proportional to the likelihood of
We fnd that averaged across all users, the ‘verifying sugges-
transition. For simplicity, Figure 5c only shows transitions with
tion’ state takes up the most time at 22.4% (� � = 12.97), it is the
an average probability higher than 0.17 (90th quantile, selected for
top state for 6 participants and in the top 3 states for 14 out of 21
graph visibility).
participants taking up at least 10% of session time for all but one
The transitions in Figure 5c revealed many expected patterns.
participant. Notably, this is a new programmer task introduced by
For example, one of the most likely transitions (excluding self-
Copilot. The second-lengthiest state is writing new functionality’ 0.54
14.05% (� � = 8.36), all but 6 participants spend more than 9% of transitions from the diagram), ‘Prompt Crafting −−−→ Verifying
session time in this state. Suggestion’ showed that when programmers were engineering
More generally, the states that are specifc to interaction with prompts, they were then likely to immediately transition to verify-
Copilot include: ‘Verifying Suggestions’, ‘Deferring Thought for ing the resultant suggestions (probability of 0.54). Likewise, Another
0.54
Later’, ‘Waiting for Suggestion’, ‘Prompt Crafting’, and ‘Editing probable transition was ‘Deferring Thought−−−→Verifying Sugges-
Suggestion’. We found that the total time participants spend tion’, indicating that if a programmer previously deferred their
in these states is 51.5 % (� � = 19.3) of the average session thought for an accepted suggestion, they would, with high
duration. In fact, half of the participants spend more than 47% probability, return to verify that suggestion. Stated diferently:
of their session in these Copilot states, and all participants spend deference incurs verifcation dept, and this debt often “catches up”
more than 21% of their time in these states. with the programmer. Finally, the single-most probable transition,
By Programmer Expertise and Copilot Experience. We in- 0.59
‘Writing New Functionality −−−→ Verifying Suggestion’, echos the
vestigate if there are any diferences in how programmers interacted observation from the previous section, indicating that program-
with Copilot based on their programming expertise and their previ- mers often see suggestions while writing code (rather than prompt
ous experience with Copilot. First, we split participants based on crafting), then spend time verifying it. If suggestions are unhelpful,
whether they have professional programming experience of more they could easily be seen as interrupting the fow of writing.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.

The CUPS diagram also revealed some unexpected transitions. to verify suggestions and reject them as they continue to write.
Notably, the second-most probable transition from the ‘Prompt Other long patterns include (#2) (also shown as pattern #D ), where
0.25 programmers repeatedly accepted successive Copilot suggestions
Crafting’ state is ‘Prompt Crafting −−−→ Waiting for Suggestion’.
This potentially reveals an unexpected and unnecessary delay and is after verifying each of them. Finally, we observe in (#3) and (#A)
a possible target for refnement (e.g., by reducing latency in Copilot). programmers iterating on the prompt for Copilot until they obtain
Importantly, each of these transitions occurs with a probability that the suggestion they want. We elaborate more on this in the next
is much higher than the lower bound/uniform baseline probability subsection.
of transitioning to a random state in the CUPS diagram (1/12=0.083).
In fact, when we compute the entropy rate (a measure of random- 6.3 Programmers Often Defer Thought About
ness) of the resulting Markov Chain [10] from the CUPS diagram Suggestions
we obtain a rate of 2.24; if the transitions were completely random An interesting CUPS state is that of ’Deferring Thought About
the rate would be 3.58, and if the transitions were deterministic A Suggestion’. This is illustrated in Figure 7, where programmers
then the rate is 0. accept a suggestion or series of suggestions without sufciently ver-
ifying them beforehand. This occurs either because programmers
wish to see the suggestion with code highlighting, or because they
Interface Design Suggestion: Identifying current CUPS
want to see where Copilot suggestions leads to. Figure 5b shows
state can help serve programmer needs
that programmers do in fact, frequently defer thought– we counted
61 states labeled as such. What drives the programmer to defer
If we are able to know the current programmer CUPS state
their thought about a suggestion rather than immediately verifying
during a coding session we can better serve the programmer,
it? We initially conjectured that the act of deferring may be par-
for example,
tially explained by the length of the suggestions. So, we compared
• If the programmer is observed to have been defer- the number of characters and the number of lines for suggestions
ring their thought on the last few suggestions, group depending on the programmer’s state. We fnd that there is no
successive Copilot suggestions and display them to- statistical diference according to a two-sample independent t-test
gether. (� = −0.58, � = 0.56) 6 in the average number of characters between
• If the programmer is waiting for the suggestion, we deferred thought and suggestions (75.81 compared to 69.06) that
can prioritize resources for them at that moment were verifed previously. The same holds for the average number
• While a user is prompt crafting, Copilot suggestions of lines.
are often ignored and may be distracting; however, af- However, when we look at the likelihood of editing an accepted
ter a user is done with their prompt, they may expect suggestion, we fnd that it is 0.18 if it was verifed before, but it is
high-quality suggestions. We could suppress sugges- 0.53 if it was deferred. This diference is signifcant according to a
tions during prompt crafting, but after the prompt chi-square test (� 2 = 29.2, � = 0). In fact, the programmer CUPS
crafting process is done, display multiple suggestions state has a big efect on their future actions. In Table 3, we show
to the user and encourage them to browse through the probability of the programmer accepting a suggestion given the
them. CUPS state the programmer was in while the suggestion is being
Future work can, for example, realize these design sugges- shown. We also show the probability of the programmer accepting
tions by allowing custom keyboard macros for the pro- a suggestion as a function of the CUPS state the programmer was
grammer to signal their current CUPS state, or a more auto- in just before the suggestion was displayed. We observe there is a
mated approach by predicting their CUPS state. big variation in the suggestion acceptance rate by the CUPS state.
For example, if the programmer was in the "Deferring Thought For
Later" state, the probability of acceptance is 0.98 ± 0.02 compared
We also investigated longer patterns in state transitions by search- to when a programmer is thinking about new code to write, where
ing for the most common sequence of states of varying lengths. the probability is 0.12 ± 0.04. Note that the average probability of
We achieved this by searching over all possible segment n-grams accepting a suggestion was 0.34.
and counting their occurrence over all sessions. We analyzed pat- What are the programmers doing before they accept a suggestion?
terns in two ways: in Figure 6a, we merged consecutive segments We found that the average probability of accepting a suggestion
that have the same state label into a single state (thus removing was 0.34. However, we observed that when the programmer was
self-transitions), and in Figure 6b we looked at n-grams in the user verifying a suggestion their likelihood of accepting was 0.70. In
timelines (including self-transitions) where we include both states contrast, if the programmer was thinking about new code to write,
and participants actions (shown, accepted and rejected). The most the probability dropped to 0.20. This diference was statistically
common pattern (#1) in Figure 6a was a cycle where programmers signifcant according to Pearson’s chi-squared test (� 2 = 12.25, � =
repeatedly wrote new code functionality and then spent time veri- 0). Conversely, when programmers are engineering prompts, the
fying shown suggestions, indicating a new mode for programmers likelihood of accepting a suggestion drops to 0.16. One reason for
to solve coding tasks. At the same time, when we look at pattern this might be that programmers want to write the prompt on their
(#B) in Figure 6b, which takes a closer look into when program-
6 All
p-values reported are corrected for multiple hypothesis testing with the Benjam-
mers are writing new functionality, we observe that they don’t stop
in/Hochberg procedure with � = 0.05.
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA

n=104 n=52 n=31

#1.

Writing new functionality Verifying suggestion Writing new functionality Verifying suggestion

n=49 n=19 n=12

#2.

Waiting for suggestion Verifying suggestion Waiting for suggestion Verifying suggestion

n=52 n=14 n=11

#3.

Prompt Crafting Verifying suggestion Prompt Crafting Verifying suggestion

n=33 n=37

#4. 5.

Deferring Thought Verifying suggestion Thinking about code to write Writing new functionality

(a) Common patterns of transitions between distinct states. In individual participant timelines, the patterns visually appear as a change of
color, but here we measure how often they appear across all participants (n=).

shown rejected shown rejected

#A. Prompt Crafting Prompt Crafting Prompt Crafting n=88

rejected shown rejected

#B. Writing new functionality Writing new functionality Writing new functionality n=56

accepted
shown

#C. Verifying suggestion Verifying suggestion n=43

accepted shown accepted

Waiting for suggestion n=16


#D. Verifying suggestion Verifying suggestion

(b) Common patterns of states and actions (including self transitions). Each pattern is extracted from user timelines and we count how often it
appears in total (n=)

Figure 6: Myriad of CUPS patterns observed in our study.


CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.

class LogisticRegression: class LogisticRegression: class LogisticRegression:


def __init__(self,X,y,alpha=0.01): def __init__(self,X,y,alpha=0.01): def __init__(self,X,y,alpha=0.01):
self.X = X self.X = X self.X = X
self.y = y self.y = y self.y = y
self.alpha = alpha self.alpha = alpha self.alpha = alpha
self.theta = np.zeros(X.shape[1]) 4 single line self.theta = np.zeros(X.shape[1]) 3 single line self.theta = np.zeros(X.shape[1])
self.cost = [] Accepts self.cost = [] Accepts self.cost = []
self.theta_history = [ later self.theta_history = [ later self.theta_history = [
self.theta self.theta
Open left brace [ indicates ] ]
self.cost_history = [ self.cost_history = [
that suggestion is not a self.cost() self.cost()
complete code segment ]
Suggestion references a def cost(self):
return (-1 / len(self.y)) *
method cost() not yet np.sum(self.y * np.log(self.hypothesis())) + …
implemented
The function cost references a
method hypothesis() not yet
implemented
Figure 7: Illustration of a coding scenario with Copilot where the programmer may choose to defer verifying a suggestion
(‘Deferring Thought’). Here, Copilot suggests an implementation for the class Logistic Regression line-by-line (illustrated
from left to right). And the programmer may need to defer verifying intermediate suggestion of self.cost (middle screenshot)
because the method that implemented it is suggested later (right screenshot).

own without suggestions, and Copilot interrupts them. We show 6.4 CUPS Attributes Signifcantly More Time
the full results in the Appendix for the other states. Verifying Suggestions than Simpler Metrics
We observed that programmers continued verifying the suggestions
after they accepted them. This happens by defnition for ’deferred
Table 3: We compute the percentage of suggestions accepted thought’ states before accepting suggestions, but we fnd it also
given the programmer was in the CUPS state while the sugges- happens when programmers verify the suggestion before accepting
tion is being shown (% Ss accepted while shown). We compute it and this leads to a signifcant increase in the total time verifying
the percentage of suggestions accepted given the program- suggestions. First, when participants defer their thought about a
mer was in the CUPS state before the suggestion is shown, suggestion they accepted, 53.2% of the time they verify the sug-
the state just before the one where the suggestion is shown gestion immediately afterward. When we adjust for the post-hoc
(% Ss accepted before S is shown). We compute the standard time spent verifying, we compute a mean time of 15.21 (� � = 20.68)
error for the acceptance rate (%). seconds of verifcation and a median time of 6.48s. This is nearly
a fve-times increase in average time and a three-time increase
State % Ss accepted % Ss accepted in median time for the pre-adjustment scores of 3.25 (� � = 3.33)
while shown before S is mean and 1.99 median time. These results are illustrated in Figure 8
shown and is a statistically signifcant increase according to a two-sample
paired t-test (� = −4.88, � = 1.33 · 10 −5 ). This phenomenon also
Thinking/Verifying 0.80 ± 0.02 0.56 ± 0.04
occurs when programmers are in a ’Thinking/Verifying Suggestion’
Suggestion
state before accepting a suggestion where 19% of the time they
Prompt Crafting 0.11 ± 0.02 0.22 ±0.03
posthoc verify the suggestion which increases total verifcation
Looking up Documenta- 0.00 ± 0.00 0.29 ± 0.17
time from 3.96 (� � = 8.63) to 7.03 (� � = 14.43) on average which is
tion
statistically signifcant (� = −4.17, � = 5� − 5). On the other hand,
Writing New Function- 0.07 ± 0.02 0.31 ± 0.03
programmers often have to wait for suggestions to show up due to
ality
either latency or Copilot not kicking in to provide a suggestion. If
Thinking About New 0.12 ± 0.04 0.27 ± 0.04
we sum the time between when a suggestion is shown and the pro-
Code To Write
grammer accepts or rejects this in addition to the time they spend
Editing Last Suggestion 0.03 ± 0.03 0.23 ± 0.05
waiting for the suggestion (this is indicated in the state ’Waiting
Waiting For Suggestion 0.10 ± 0.05 0.58 ± 0.06
for suggestion’), then we get an increase from 6.11s (� � = 15.52) to
Editing Written Code 0.07 ± 0.04 0.17 ± 0.07
6.51s (� � = 15.61) which is minor on average but adds 2.5 seconds
Writing Documentation 0.40 ± 0.22 0.33 ± 0.19
of delay when programmers have to explicitly wait for suggestions.
Debugging/Testing 0.23 ± 0.07 0.26 ± 0.06
Code
Deferring Thought For 0.98 ± 0.02 1.0 ± 0.0
Later
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA

Finally, we observe that there are three main ways participants


Metric Suggestion: Adjust verifcation time metrics and craft prompts:
acceptance rates to include suggestions that are veri- 1) through writing a single line comment with natural language
fed after acceptance instructions, although the comment may resemble pseudo-code
[17], an example:
The previous analysis showed that the time to accept a sug-
gestion cannot be simply measured as the time spent from the # impute missing values in X_train as average of column
instance a suggestion is shown until a suggestion is accepted– # where missinggn value is -1
this misses the time programmers spend verifying a sugges- 2) through writing a docstring for the function:
tion after acceptance. Similarly, since deferring thought is
def distance(self, query):
a frequent behavior observed, it leads to an infation of ac-
'''
ceptance rates. We recommend using measures such as the
query: single numpy arrray
fraction of suggestions accepted that survive in the codebase
return: l2 distances from query to the vectors
after a certain time period (e.g. 10 minutes).
'''
and fnally, 3) through writing function signatures (or variable
names) e.g., writing "def add_time" then pausing to wait for a sug-
gestion. Often, programmers combine the three prompt crafting
6.5 Insights About Prompt Crafting
strategies to get better code suggestions.
Insights about Prompt Crafting. We take a closer look into how
participants craft prompts to obtain Copilot suggestions. Our frst
6.6 Post-Study Survey Answers
insight is that programmers consistently ignore suggestions while
prompt crafting. Among 234 suggestions that were shown while par- After completing the study, participants were asked to complete
ticipants were actively prompt crafting, defned as a suggestion where a survey based on the productivity survey in [40], which focuses
a programmer was prompt crafting while the suggestion was being on the SPACE framework of programmer productivity [13]. We
displayed, only 10.7% were accepted. We hypothesize this behavior also included a free-form text box at the end of the survey where
could be due to programmers wanting to craft the prompt in their participants can add any additional thoughts about their experience
own language rather than relying on Copilot to help them prompt using Copilot for the task assigned. The full results of the survey
craft. This also indicates that Copilot is unnecessarily interrupting can be found in the Appendix.
participants’ prompt crafting attempts. We found that 6/21 participants agreed or strongly agreed with
However, programmers often iterate on their prompts until they the statement that they were concerned about the quality of their
obtain the suggestion they desire and often do not abandon prompt code when using Copilot. Participant #9 noted, "I worry that bugs
crafting without accepting a suggestion. We defne a prompt craft- can sneak-in and go unnoticed, especially in weakly-dynamically
ing attempt as a segment of the coding session that starts from typed languages" and Participant #19 noted that "My main concern
when the programmer frst enters the CUPS "prompt crafting" state with Copilot is whether it is teaching me to do things the wrong
and lasts until the programmer enters a non-Copilot centric state 7 . (or old) way (e.g. showing me a Python 3.6 way instead of a Python
We count 59 such prompt crafting attempts wherein 81.3% of them 3.10 way and so on)". On the other hand, 14/21 participants agreed
a suggestion is accepted. that using Copilot helped them stay in fow and spend less time
Prompt crafting is often an iterative process, where the program- searching for information. Participant #3 noted that "Collaborating
mer writes an initial prompt, observes the resulting suggestion, then with Copilot felt like I was googling what I wanted to do except
iterates on the prompt by adding additional information about the instead of going through several stack overfow links that Google
desired code or by rewording the prompt. For example, P5 wanted to would show me, the code just appeared inline saving me time and
retrieve the index of the maximum element in a correlation matrix keeping my fow of coding" and Participant #6 "Going into the
and wrote this initial prompt and got the suggestion: exercise I genuinely thought there would be a point when I pull up
stack overfow. Because that’s the kind of tiny stuf you sometimes
# print the indices of the max value excluding 1 in corr
need to search for. With copilot, it really reduced my worry of doing
maxval = np.amax(corr, axis=1) # Copilot suggestion
so." Finally, 17/21 participants agreed with the statement that by
This code snipped returns the value of the maximum value rather using Copilot, they completed the task faster, and 16/21 participants
than the index, so it was not accepted by the participant. They then agreed that they were more productive using Copilot. These survey
re-wrote the prompt to be: responses highlight the costs and benefts of writing code with
# print the two features most correlated Copilot and reinforcing existing results in [40].
# Copilot suggestion
maxcor = np.where(corr == np.amax(corr)) 7 LIMITATIONS, FUTURE WORK AND
and accepted the above suggestion. CONCLUSION
7.1 Limitations
7 Thenon-Copilot centric states are: ’Writing New Functionality,’ ’Editing Written
Code,”Thinking About New Code To Write,”DebuggingTesting Code,’ ’Looking up The observations from our study are limited by several decisions
Documentation,’ ’Writing Documentation.’ that we made. First, our participants solved time-limited coding
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.

shown accepted

Waiting for Deferring Thought Verifying suggestion


suggestion

2.5s ± 3.1, 3.3s ± 3.3 22.5s ± 21.2 15.21s ± 20.68


occurs only 16% occurs only 53%
Pre-adjustment Adjustment
Figure 8: Illustration of one of the adjustments required for measuring the total time a programmer spends to verify a suggestion.
Here, when a programmer defers thought for a suggestion, they spend time verifying it after accepting it and may also have to
wait beforehand for the suggestion to be shown.

tasks that were provided by us instead of real tasks they may per- of labeled data which would be expensive, 2) using methods from
form in the real world. Furthermore, the selection of tasks was semi-supervised learning that leverage unlabeled telemetry to in-
limited and did not cover all tasks programmers might perform. crease sample efciency [35], and 3) collecting data beyond what is
We mostly conducted experiments with Python with only two captured from telemetry such as video footage of the programmer
participants using C++ and JavaScript when Copilot is capable of screen (e.g. cursor movement) to be able to better predict with the
completing suggestions for myriads of other languages. We also same amount of data.
made an assumption about the granularity of telemetry where each
segment at most contained one state when, in a more general set- Assessing Individual Diferences. There is an opportunity to apply
ting, programmers may perform multiple activities within a single the CUPS diagram to compare diferent user groups and compare
segment. We also did not capture longer-term costs of interacting, how individuals difer from an average user. Does the nature of
e.g., from accepting code with security vulnerabilities or longer inefciencies difer between user groups? Can we personalize inter-
horizon costs. To this end, security vulnerabilities and possible ventions? Finally, we could also compare how the CUPS diagram
overreliance issues [2, 25, 26], are important areas of research that evolves over time for the same set of users.
we do not address in this paper. Efect of Conditions and Tasks on Behavior. We only studied the
behavior of programmers with the current version of Copilot. Fu-
7.2 Future Work ture work could study how behavior difers with diferent versions
We only investigated a limited number of programmer behaviors of Copilot– especially when versions use diferent models. In the
using the CUPS timelines and diagrams. There are many other extreme, we could study behavior when Copilot is turned of. The
aspects future work could investigate. latter could help assess the counterfactual cost of completing the
task without AI assistance and help establish whether and where
Predicting CUPS states. To enable our insights derived in Sec- Copilot suggestions add net value for programmers. For example,
tion 6, we need to be able to identify the current programmer’s maybe the system did not add enough value because the program-
CUPS state. An avenue towards that is building predictive mod- mer kept getting into prompt crafting rabbit holes instead of moving
els using labeled telemetry data that is collected from our user on and completing the functions manually or with the assistance
study. Ideally, we can leverage this labeled data to further label of web search.
telemetry data from other coding sessions or other participants so Likewise, if developers create a faster version of Copilot with
that we can perform such analyses more broadly. Specifcally, the less latency, the CUPS diagram could be used to establish whether
input to such a model would be the current session context, for it leads to reductions in time spent in the "Waiting for Suggestion"
example, whether the programmer accepted the last suggestion, the state.
current suggestion being surfaced, and the current prompt. We can
leverage supervised learning methods to build such a model from Informing New Metrics. Since programmers’ value may be multi-
collected data. Such models would need to run in real-time during dimensional, how can we go beyond code correctness and measure
programming and predict at each instance of time the current user added value for users? If Copilot improves productivity, which
CUPS state. This would enable the design suggestions proposed aspects were improved? Conversely, if it didn’t, where are the ef-
to serve to compute various metrics proposed. For example, if the ciencies? One option is to conduct a new study where we compare
model predicts that the programmer is deferring thought about a the CUPS diagram with Copilot assistance with a counterfactual
suggestion, we can group suggestions together to display them to condition where the programmers don’t have access to Copilot.
the programmer. In the Appendix, we built small predictive models And use the two diagrams to determine where the system adds
of programmers CUPS state using labeled study data. However, value or could have added value. For example, the analysis might
the current amount of labeled data is not sufcient to build highly reveal that some code snippets are too hard for programmers to
accurate models. There are multiple avenues to improve the per- complete by themselves but much faster with Copilot because the
formance of these models: 1) simply collecting a larger amount cost of double-checking and editing the suggestion is much less
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming CHI ’24, May 11–16, 2024, Honolulu, HI, USA

than the cost of spending efort on it by themselves. Conversely, the [2] Owura Asare, Meiyappan Nagappan, and N Asokan. 2023. Is github’s copilot
analysis might reveal that a new intervention for helping engineer as bad as humans at introducing vulnerabilities in code? Empirical Software
Engineering 28, 6 (2023), 1–24.
prompts greatly reduced people’s times in “Prompt Crafting”. [3] Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded
Another option is to design ofine metrics based on these insights copilot: How programmers interact with code-generating models. Proceedings of
the ACM on Programming Languages 7, OOPSLA1 (2023), 85–111.
that developers can use during the model selection and training [4] Ruven Brooks. 1977. Towards a theory of the cognitive processes in computer
phase. For example, given that programmers spent a large fraction programming. International Journal of Man-Machine Studies 9, 6 (1977), 737–751.
of the time verifying suggestions, ofine metrics that can estimate [5] Ruven E Brooks. 1980. Studying programmer behavior experimentally: The
problems of proper methodology. Commun. ACM 23, 4 (1980), 207–213.
this (e.g., based on code length and complexity) may be useful [6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
indicators of which models developers should select for deploy- Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
ment. Future work will aim to test the efectiveness of these design Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
suggestions as well. [7] Stuart K Card, Thomas P Moran, and Allen Newell. 1980. The keystroke-level
model for user performance time with interactive systems. Commun. ACM 23, 7
Beyond Programming. We also hope our methodology is applied (1980), 396–410.
[8] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira
to study other forms of AI assistants that are rapidly being de- Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman,
ployed. For example, one can make an analogous CUPS taxonomy et al. 2021. Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374 1, 1 (2021), 1–2.
for writing assistants for creative writers or lawyers. [9] Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh,
Michel C Desmarais, and Zhen Ming Jack Jiang. 2023. Github copilot ai pair
programmer: Asset or liability? Journal of Systems and Software 203 (2023),
7.3 Conclusion 111734.
We developed and proposed a taxonomy of common programmer [10] Laura Ekroot and Thomas M Cover. 1993. The entropy of Markov trajectories.
IEEE Transactions on Information Theory 39, 4 (1993), 1418–1421.
activities (CUPS) and combined it with real-time telemetry data [11] Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. 2023.
to profle the interaction. At present, CUPS contains 12 mutually Out of the bleu: how should we assess quality of the code generation models?
Journal of Systems and Software 203 (2023), 111741.
unique activities that programmers perform between consecutive [12] Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann,
Copilot actions (e.g., such as accepting, rejecting, and viewing sug- Brian Houck, and Jenna Butler. 2021. The SPACE of developer productivity.
gestions). We gathered real-world instance data of CUPS by con- Commun. ACM 64, 6 (2021), 46–53.
[13] Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann,
ducting a user study with 21 programmers within our organization, Brian Houck, and Jenna Butler. 2021. The SPACE of Developer Productivity:
where they solved coding tasks with Copilot and retrospectively There’s more to it than you think. Queue 19, 1 (2021), 20–48.
labeled CUPS for their coding session. We collected over 3137 in- [14] Github. 2022. GitHub copilot - your AI pair programmer. https://github.com/
features/copilot
stances of CUPS and analyzed them to generate CUPS timelines [15] Vincent J Hellendoorn, Sebastian Proksch, Harald C Gall, and Alberto Bacchelli.
that show individual behavior and CUPS diagrams that show aggre- 2019. When code completion fails: A case study on real-world completions. In
2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE,
gate insights into the behavior of our participants. We also studied ., ., 960–970.
the time spent in these states, patterns in user behavior, and better [16] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora,
estimates of the cost (in terms of time) of interacting with Copilot. Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021.
Measuring Coding Challenge Competence With APPS. In Thirty-ffth Conference
Our studies with CUPS labels revealed that when solving a coding on Neural Information Processing Systems Datasets and Benchmarks Track (Round
task with Copilot, programmers may spend a large fraction of total 2). ., ., 1–2.
session time (34.3%) on just double-checking and editing Copilot [17] Ellen Jiang, Edwin Toh, Alejandra Molina, Kristen Olson, Claire Kayacik, Aaron
Donsbach, Carrie J Cai, and Michael Terry. 2022. Discovering the Syntax and
suggestions, and spend more than half of the task time on Copilot Strategies of Natural Language Programming with Generative Language Models.
related activities, together indicating that introducing Copilot into In CHI Conference on Human Factors in Computing Systems. ., ., 1–19.
[18] Bonnie E John and David E Kieras. 1996. The GOMS family of user interface
an IDE can signifcantly change user behavior. We proposed new analysis techniques: Comparison and contrast. ACM Transactions on Computer-
metrics to measure the interaction by computing the time spent in Human Interaction (TOCHI) 3, 4 (1996), 320–351.
each CUPS state and modifcation to existing time and acceptance [19] An Ju and Armando Fox. 2018. TEAMSCOPE: measuring software engineering
processes with teamwork telemetry. In Proceedings of the 23rd Annual ACM
metrics by accounting for suggestions that get verifed only after Conference on Innovation and Technology in Computer Science Education. ., ., 123–
they get accepted. We proposed a new interface design suggestion: 128.
if we allow programmers to signal their current state, then we can [20] Eirini Kalliamvakou. 2022. Research: Quantifying github copilot’s impact
on developer productivity and happiness. https://github.blog/2022-09-07-
better serve their needs, for example, by reducing latency if they research-quantifying-github-copilots-impact-on-developer-productivity-and-
are waiting for a suggestion. happiness/
[21] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser,
Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago,
ACKNOWLEDGMENTS et al. 2022. Competition-level code generation with alphacode. arXiv preprint
arXiv:2203.07814 1, 1 (2022), 1–2.
HM partly conducted this work during an internship at Microsoft [22] Jenny T Liang, Chenyang Yang, and Brad A Myers. 2023. Understanding the
Research (MSR). We acknowledge valuable feedback from colleagues Usability of AI Programming Assistants. arXiv preprint arXiv:2303.17125 1, 1
(2023), 1–2.
across MSR and GitHub including Saleema Amershi, Victor Dibia, [23] Henry Lieberman and Christopher Fry. 1995. Bridging the gulf between code
Forough Poursabzi, Andrew Rice, Eirini Kalliamvakou, and Edward and behavior in programming. In Proceedings of the SIGCHI conference on Human
factors in computing systems. ., ., 480–486.
Aftandilian. [24] Unaizah Obaidellah, Mohammed Al Haek, and Peter C-H Cheng. 2018. A survey
on the usage of eye-tracking in computer programming. ACM Computing Surveys
(CSUR) 51, 1 (2018), 1–58.
REFERENCES [25] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and
[1] Amazon. 2022. ML-powered coding companion – Amazon CodeWhisperer. Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of github
https://aws.amazon.com/codewhisperer/ copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Mozannar et al.

(SP). IEEE, ., ., 754–768. [33] Maxim Tabachnyk Tabachnyk and Stoyan Nikolov. 2022. ML-enhanced code
[26] Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan completion improves developer productivity. https://ai.googleblog.com/2022/
Dolan-Gavitt. 2021. Can OpenAI Codex and Other Large Language Models Help 07/ml-enhanced-code-completion-improves
Us Fix Security Bugs? arXiv preprint arXiv:2112.02125 1, 1 (2021), 1–2. [34] Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs.
[27] Norman Peitek, Janet Siegmund, and Sven Apel. 2020. What drives the read- Experience: Evaluating the Usability of Code Generation Tools Powered by Large
ing order of programmers? an eye tracking study. In Proceedings of the 28th Language Models. In CHI Conference on Human Factors in Computing Systems
International Conference on Program Comprehension. ., ., 342–353. Extended Abstracts. ., ., 1–7.
[28] Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact [35] Jesper E Van Engelen and Holger H Hoos. 2020. A survey on semi-supervised
of ai on developer productivity: Evidence from github copilot. arXiv preprint learning. Machine learning 109, 2 (2020), 373–440.
arXiv:2302.06590 1, 1 (2023), 1–2. [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[29] James Prather, Brent N Reeves, Paul Denny, Brett A Becker, Juho Leinonen, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Andrew Luxton-Reilly, Garrett Powell, James Finnie-Ansley, and Eddie Antonio you need. Advances in neural information processing systems 30 (2017), 1–2.
Santos. 2023. " It’s Weird That it Knows What I Want": Usability and Interactions [37] Zdenek Velart and Petr Šaloun. 2006. User behavior patterns in the course
with Copilot for Novice Programmers. arXiv preprint arXiv:2304.02491 1, 1 (2023), of programming in C++. In Proceedings of the joint international workshop on
1–2. Adaptivity, personalization & the semantic web. ., ., 41–44.
[30] Advait Sarkar, Andrew D Gordon, Carina Negreanu, Christian Poelitz, Sruti Srini- [38] Justin D Weisz, Michael Muller, Stephanie Houde, John Richards, Steven I Ross,
vasa Ragavan, and Ben Zorn. 2022. What is it like to program with artifcial Fernando Martinez, Mayank Agarwal, and Kartik Talamadupula. 2021. Perfection
intelligence? arXiv preprint arXiv:2208.06213 1, 1 (2022), 1–2. not required? Human-AI partnerships in code translation. In 26th International
[31] Beau A Sheil. 1981. The psychological study of programming. ACM Computing Conference on Intelligent User Interfaces. IUI, IUI, 402–412.
Surveys (CSUR) 13, 1 (1981), 101–120. [39] Tongshuang Wu, Kenneth Koedinger, et al. 2023. Is AI the better programming
[32] Zhensu Sun, Xiaoning Du, Fu Song, Shangwen Wang, Mingze Ni, and Li Li. partner? Human-Human Pair Programming vs. Human-AI pAIr Programming.
2023. Don’t Complete It! Preventing Unhelpful Code Completion for Produc- arXiv preprint arXiv:2306.05153 1, 1 (2023), 1–2.
tive and Sustainable Neural Code Completion Systems. In 2023 IEEE/ACM 45th [40] Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin,
International Conference on Software Engineering: Companion Proceedings (ICSE- Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity
Companion). IEEE, ., ., 324–325. assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN
International Symposium on Machine Programming. ACM, ACM, 21–29.

You might also like