GitHub Copilot Impact
GitHub Copilot Impact
DOI:10.1145/ 3633453
Measuring
GitHub
Copilot’s
Impact on
Productivity key insights
˽ AI pair-programming tools such as GitHub
Copilot have a big impact on developer
productivity. This holds for developers
of all skill levels, with junior developers
seeing the largest gains.
˽ The reported benefits of receiving AI
suggestions
CODE- COM PL ET ION S Y S T EM S OF F ER I NG suggestions while coding span the full
to a developer in their integrated development range of typically investigated aspects of
productivity, such as task time, product
environment (IDE) have become the most frequently quality, cognitive load, enjoyment, and
learning.
used kind of programmer assistance.1 When ˽ Perceived productivity gains are reflected
generating whole snippets of code, they typically use in objective measurements of developer
activity.
a large language model (LLM) to predict what the user ˽ While suggestion correctness is
might type next (the completion) from the context of important, the driving factor for these
improvements appears to be not
what they are working on at the moment (the prompt).2 correctness as such, but whether the
suggestions are useful as a starting point
This system allows for completions at any position in for further development.
ditional step taken by some research- explicitly stated intention to increase perceived productivity as reported
ers3,21,29 is to use online evaluation a user’s productivity. Developer pro- by developers. We analyze 2,631sur-
and track the frequency of real us- ductivity has many aspects, and a re-
ers accepting suggestions, assuming cent study has shown that tools like a Nevertheless, such completion times are
that the more contributions a system these are helpful in ways that are only greatly reduced in many settings, often by
makes to the developer’s code, the partially reflected by measures such more than half.16
MA R C H 2 0 2 4 | VO L. 6 7 | N O. 3 | C OM M U N IC AT ION S OF T HE ACM 55
research
Figure 1. GitHub Copilot’s code completion funnel. vey responses from developers using
GitHub Copilot and match their re-
sponses to measurements collected
from the IDE. We consider acceptance
Completion mostly unchanged unchanged
counts and more detailed measures
of contribution, such as the amount
Average number of events per survey user active hour
completion
shown
completion
accepted
after 30
seconds
after 2
minutes
after 5
minutes
after 10
minutes
tribution. However, other approaches
remain necessary for fine-grained
investigation due to the many human
factors involved.
Background
Offline evaluation of code completion
Figure 2. Demographic composition of survey respondents. can have shortcomings even in tracta-
ble circumstances where completions
can be labeled for correctness. For ex-
Beginner ample, a study of 1
5,000completions by
Think of
Intermediate
the language 66 developers in Visual Studio found sig-
you have used Advanced
the most with nificant differences between synthetic
OurTool. benchmarks used for model evaluation
How proficient Student/Learning
are you in 0–2 Years Prof. Experience and real-world usage.7 The evaluation
that language? 3–5 Years Prof. Experience of context-aware API completion for Vi-
6–10 Years Prof. Experience sual Studio IntelliCode considered Re-
11–15 Years Prof. Experience
Which best 16+ Years Prof. Experience
call@5—the proportion of completions
describes your for which the correct method call was in
programming
experience? Student the top five suggestions. This metric fell
Professional from 9 0%in offline evaluation to 7 0%
Hobbyist when used online.21
Consultant/Freelancer
Due to the diversity of potential
Researcher
Which of Other solutions to a multi-line completion
the following
best describes task, researchers have used software
what you do? Python testing to evaluate the behavior of
JavaScript completions. Competitive program-
What TypeScript ming sites have been used as a source
programming Java
languages Ruby
of such data8,11 as well as handwrit-
do you usually ten programming problems.5 Yet, it
use? Choose Go
up to three C# is unclear how well performance on
from the list. Rust programming competition data gen-
HTML eralizes to interactive development in
0% 25% 50% 75% 100%
Other
an IDE.
In this work, we define acceptance
rate as the fraction of completions
MA R C H 2 0 2 4 | VO L. 6 7 | N O. 3 | C OM M U N IC AT ION S OF T HE ACM 57
research
Table 1. Developer usage events collected by GitHub Copilot. when using GitHub Copilot.” For each
self-reported productivity measure,
we encoded its five ordinal response
Opportunity A heuristic-based determination by the IDE and the plug-in that a completion
values to numeric labels (1 = Strongly
might be appropriate at this point in the code (for example, the cursor is not in
the middle of a word) Disagree, ..., 5 = Strongly Agree). We
Shown Completion shown to the developer
include the full list of questions and
their coding to the SPACE framework
Accepted Completion accepted by the developer for inclusion in the source file
in Appendix C. For more information
Accepted char The number of characters in an accepted completion
on the SPACE framework and how the
Mostly Completion persisting in source code with limited modifications (Levenshtein empirical software engineering com-
unchanged X distance less than 33%) after X seconds, where we consider a duration of 30,
munity has been discussing developer
120, 300, and 600 seconds
productivity, please see the following
Unchanged X Completion persisting in source code unmodified after X seconds.
section.
(Active) hour An hour during which the developer was using their IDE with the plug-in active Early in our analysis, we found that
the usage metrics we describe in the
Usage Measurements section corre-
Table 2. The core set of measurements considered in this article. sponded similarly to each of the mea-
sured dimensions of productivity, and
Natural name Explanation in turn these dimensions were highly
Shown rate Ratio of completion opportunities that resulted in a completion being correlated to each other (Figure 3). We
shown to the user therefore added an aggregate produc-
Acceptance rate Ratio of shown completions accepted by the user tivity score calculated as the mean of
Persistence rate Ratio of accepted completions unchanged after 30, 120, 300, and 600 all 12 individual measures (excluding
seconds skipped questions). This serves as a
Fuzzy persistence rate Ratio of accepted completions mostly unchanged after 30, 120, 300, rough proxy for the much more com-
and 600 seconds plex concept of productivity, facili-
Efficiency Ratio of completion opportunities that resulted in a completion tating recognition of overall trends,
accepted and unchanged after 30, 120, 300, and 600 seconds which may be less discernible on indi-
Contribution speed Number of characters in accepted completions per distinct, active hour vidual variables due to higher statisti-
Acceptance frequency Number of accepted completions per distinct, active hour cal variation. The full dataset of these
aggregate productivity scores togeth-
Persistence frequency Number of unchanged completions per distinct, active hour
er with the usage measurements con-
Total volume Total number of completions shown to the user
sidered in this article is available at
Loquaciousness Number of shown completions per distinct, active hour https://bit.ly/47HVjAM.
Eagerness Number of shown completions per opportunity Given it has been impossible to pro-
duce a unified definition or metric(s)
for developer productivity, there have
rics we feel have a natural interpreta- on Mar. 6, 2022. been attempts to synthesize the fac-
tion in this context. We note there are The survey contained multiple- tors that impact productivity to de-
alternatives, and we incorporate these choice questions regarding demo- scribe it holistically, include various
in our discussion where relevant. graphic information (see Figure 2) relevant factors, and treat developer
Productivity survey. To understand and Likert-style questions about dif- productivity as a composite mea-
users’ experience with GitHub Co- ferent aspects of productivity, which sure17,19,24 In addition, organizations
pilot, we emailed a link to an online were randomized in their order of ap- often use their own multidimensional
survey to 17, 420users. These were pearance to the user. Figure 2 shows frameworks to operationalize produc-
participants of the unpaid technical the demographic composition of our tivity, which reflects their engineering
preview using GitHub Copilot with respondents. We note the significant goals—for example, Google uses the
their everyday programming tasks. proportion of professional program- QUANTS framework, with five compo-
The only selection criterion was hav- mers who responded. nents of productivity.27 In this article,
ing previously opted in to receive com- The SPACE framework6 defines five we use the SPACE framework,6 which
munications. A vast majority of survey dimensions of productivity: Satisfac- builds on synthesis of extensive and
users (more than 80%) filled out the tion and well-being, Performance, Ac- diverse literature by expert research-
survey within the first two days, on or tivity, Communication and collabora- ers and practitioners in the area of de-
before February 12, 2022. We there- tion, and Efficiency and flow. We use veloper productivity.
fore focus on data from the four-week four of these (S,P,C,E), since self re- SPACE is an acronym of the five di-
period leading up to this point ("the porting on (A) is generally considered mensions of productivity:
study period"). We received a total of inferior to direct measurement. We ˲ S (Satisfaction and well being):
2,047 responses we could match to included 11 statements covering these This dimension is meant to reflect
usage data from the study period, the four dimensions in addition to a sin- developers’ fulfillment with the work
earliest on Feb. 10, 2022 and the latest gle statement: “I am more productive they do and the tools they use, as well
as how healthy and happy they are ˲ A (Activity): This is the count of documentation or the speed of an-
with the work they do. This dimension outputs—for example, the number of swering questions, or the onboard-
reflects some of the easy-to-overlook pull requests closed by a developer. As ing time and processing of new team
trade-offs involved when looking ex- a result, this dimension is best quanti- members.
clusively at velocity acceleration—for fied via system data. Given the variety ˲ E (Efficiency and flow): This di-
example, when we target faster turn- of developers’ activities as part of their mension reflects the ability to com-
around of code reviews without con- work, it is important that the activ- plete work or make progress with little
sidering workload impact or burnout ity dimension accounts for more than interruption or delay. It is important
for developers. coding activity—for instance, writing to note that delays and interruptions
˲ P (Performance): This dimension documentation, creating design specs, can be caused either by systems or hu-
aims to quantify outcomes rather than and so on. mans, and it is best to monitor both
output. Example metrics that capture ˲ C (Communication and collabora- self-reported and observed measure-
performance relate to quality and re- tion): This dimension aims to capture ments—for example, use self-reports
liability, as well as further-removed that modern software development of the ability to do uninterrupted work,
metrics such as customer adoption or happens in teams and is, therefore, as well as measure wait time in engi-
satisfaction. impacted by the discoverability of neering systems).
Figure 3. Correlation between metrics. Metrics are ordered by similarity based on distance in the correlation matrix, except for manu-
ally fixing the aggregate productivity and acceptance rate at the end for visibility.
accepted_per_shown
accepted_per_opportunity
unchanged_600_per_opportunity
unchanged_300_per_opportunity
unchanged_30_per_opportunity
unchanged_120_per_opportunity
unchanged_300_per_active_hour
unchanged_600_per_active_hour
unchanged_120_per_active_hour
unchanged_30_per_active_hour
accepted_per_active_hour
accepted_char_per_active_hour
shown_per_active_hour
shown_per_opportunity
shown
unchanged_120_per_accepted
unchanged_30_per_accepted
unchanged_300_per_accepted
unchanged_600_per_accepted
mostly_unchanged_120_per_accepted
mostly_unchanged_300_per_accepted
mostly_unchanged_600_per_accepted
mostly_unchanged_30_per_accepted
learn_from
less_time_searching
unfamiliar_progress
repetitive_faster
less_effort_repetitive
better_code
less_frustrated
focus_satisfying
more_fulfilled
stay_in_flow
tasks_faster
aggregate_productivity
aggregate_productivity
tasks_faster
stay_in_flow
more_fulfilled
focus_satisfying
less_frustrated
better_code
less_effort_repetitive
repetitive_faster
unfamiliar_progress
less_time_searching
learn_from
mostly_unchanged_30_per_accepted
mostly_unchanged_600_per_accepted
mostly_unchanged_300_per_accepted
mostly_unchanged_120_per_accepted
unchanged_600_per_accepted
unchanged_300_per_accepted
unchanged_30_per_accepted
unchanged_120_per_accepted
shown
shown_per_opportunity
shown_per_active_hour
accepted_char_per_active_hour
accepted_per_active_hour
unchanged_30_per_active_hour
unchanged_120_per_active_hour
unchanged_600_per_active_hour
unchanged_300_per_active_hour
unchanged_120_per_opportunity
unchanged_30_per_opportunity
unchanged_300_per_opportunity
unchanged_600_per_opportunity
accepted_per_opportunity
accepted_per_shown
Spearman
Correlation
1.00
0.75
0.50
0.25
0.00
MA R C H 2 0 2 4 | VO L. 6 7 | N O. 3 | C OM M U N IC AT ION S OF T HE ACM 59
research
and the
age measurements (Table 2). We then code, so any changes (or not) after that
calculated Pearson’s R correlation co- point will not be attributed to GitHub
SPACE
efficient and the corresponding p-val- Copilot. All persistence measures
ue of the F-statistic between each pair were less well correlated than accep-
options ‘Beginner’, ‘Intermediate’, and Figure 4. Different metrics clustering in latent structures predicting perceived pro-
‘Advanced’. ductivity. We color the following groups: flawless suggestions (counting the number of
˲ "Which best describes your pro- unchanged suggestions), persistence rate (ratio of accepted suggestions that are un-
gramming experience?" with options changed), and fuzzy persistence rate (accepted suggestions that are mostly unchanged).
subgroup coeff n
none 0.135* 344
aggregate productivity
MA R C H 2 0 2 4 | VO L. 6 7 | N O. 3 | C OM M U N IC AT ION S OF T HE ACM 61
research
• more online developers making until mornings 7:00 am PST, where the
All appendices progress faster when average acceptance rate is also rather
for this article
can be found
working in an unfa- high at 23%.
in the online miliar language. These ˲ Typical working hours during the
supplemental
findings suggest that
file at https://
dl.acm.org/ experienced developers Experienced week from 7:00 am PST to 4:00 pm PST,
where the average acceptance rate is
doi/10.1145/
3633453.
who are already highly
skilled are less likely
developers who much lower at 21 . 2%.
to write better code with Copilot, but are already highly Conclusions
Copilot can assist their productivity in
other ways, particularly when engag-
skilled are less When we set out to connect the pro-
ductivity benefit of GitHub Copilot to
ing with new areas and automating likely to write usage measurements from developer
routine work.
Junior developers not only report
better code with activity, we collected measurements
about acceptance of completions in
higher productivity gains; they also Copilot, but Copilot line with prior work, but also devel-
tend to accept more suggestions. How-
ever, the connection observed in the can assist their oped persistence metrics, which ar-
guably capture sustained and direct
section "What Drives Perceived Pro- productivity in other impact on the resulting code. We
ways.
ductivity" is not solely due to differing were surprised to find acceptance rate
experience levels. In fact, the connec- (number of acceptances normalized
tion persists in every single experience by the number of shown completions)
group, as shown in Figure 5. to be better correlated with reported
productivity than our measures of
Variation over Time persistence.
Its connection to perceived productiv- In hindsight, this makes sense.
ity motivates a closer look at the accep- Coding is not typing, and GitHub Co-
tance rate and what factors influence pilot’s central value lies not in being
it. Acceptance rate typically increases the way users enter most of their code.
over the board when the model or un- Instead, it lies in helping users to make
derlying prompt-crafting techniques the best progress toward their goals. A
are improved. But even if these con- suggestion that serves as a useful tem-
ditions are held constant (the study plate to tinker with may be as good or
period did not see changes to either), better than a perfectly correct (but ob-
there are more fine-grained temporal vious) line of code that only saves the
patterns emerging. user a few keystrokes.
For coherence of the cultural impli- This suggests that a narrow focus
cations of time of day and weekdays, on the correctness of suggestions
all data in this section was restricted would not tell the whole story for these
to users from the U.S. (whether in kinds of tooling. Instead, one could
the survey or not). We used the same view code suggestions inside an IDE to
time frame as for the investigation in be more akin to a conversation. While
the previous section. In the absence chatbots such as ChatGPT are already
of more fine-grained geolocation, we used for programming tasks, they are
used the same time zone to interpret explicitly structured as conversations.
timestamps and for day boundaries Here, we hypothesize that interactions
(PST), recognizing this will introduce with Copilot, which is not a chatbot,
some level of noise due to the inhomo- share many characteristics with natu-
geneity of U.S. time zones. ral-language conversations.
Nevertheless, we observe strong We see anecdotal evidence of this
regular patterns in overall acceptance in comments posted about GitHub
rate (Figure 6). These lead us to distin- Copilot online (see Appendix E for
guish three different time regimes, all examples), in which users talk about
of which are statistically significantly sequences of interactions. A conver-
distinct at p < 0 . 001% (using boot- sation turn in this context consists of
strap resampling): the prompt in the completion request
˲ The weekend: Saturdays and Sun- and the reply as the completion itself.
days, where the average acceptance The developer’s response to the com-
rate is comparatively high at 23 . 5%. pletion arises from the subsequent
˲ Typical non-working hours during changes incorporated in the next
the week: evenings after 4:00 pm PST prompt to the model. There are clear
24%
10.1145/3491101.3519665
24. Wagner, S. and Ruhe, M. A systematic review of
productivity factors in software development. arXiv
preprint arXiv:1801.06475 (2018).
25. Wang, D. et al. From human-human collaboration to
22% human-AI collaboration: Designing AI systems that
can work together with people. In Proceedings of
the 2020 CHI Conf. on Human Factors in Computing
Systems (2020), 1–6.
26. Weisz, J.D. et al. Perfection not required? Human-AI
partnerships in code translation. In Proceedings of
20% the 26th Intern. Conf. on Intelligent User Interfaces, T.
Hammond et al (eds). ACM, (April 2021), 402–412;
10.1145/3397481.3450656
Saturday Sunday Monday Tuesday Wednesday Thursday Friday 27. Winters, T., Manshreck, T., and Wright, H. Software
12:00 12:00 12:00 12:00 12:00 12:00 12:00 Engineering at Google: Lessons Learned from
Programming Over Time. O’Reilly Media (2020).
weekday and time (PST) 28. Wold, S., Sjöström, M., and Eriksson, L. PLS-regression:
A basic tool of chemometrics. Chemometrics and
Intelligent Laboratory Systems 58, 2 (2001), 109–130;
10.1016/S0169-7439(01)00155-1.
29. Zhou, W., Kim, S., Murali, V., and Ari Aye, G. Improving
programming parallels to factors 960–970; 10.1109/ICSE.2019.00101
code autocompletion with transfer learning.
8. Hendrycks, D. et al. Measuring coding challenge
such as specificity and repetition that competence with APPS. CoRR abs/2105.09938,
CoRR abs/2105.05991 (2021); https://arxiv.org/
abs/2105.05991
have been identified to affect human (2021); https://arxiv.org/abs/2105.09938
9. Hindle, A. et al. On the naturalness of software. In 34th
judgements of conversation quality.18 Intern. Conf. on Software Engineering, M. Glinz, G.C. Albert Ziegler ([email protected] ) is a principal
Researchers have already investigated Murphy, and M. Pezzè (eds). IEEE Computer Society, researcher at GitHub, Inc., San Francisco, CA, USA.
June 2012, 837–847; 10.1109/ICSE.2012.6227135
the benefits of natural-language feed- 10. Jaspan, C. and Sadowski, C. No single metric captures Eirini Kalliamvakou is a staff researcher at GitHub, Inc.,
productivity. Rethinking Productivity in Software San Francisco, CA, USA.
back to guide program synthesis,2 so Engineering, (2019), 13–20. X. Alice Li is a staff researcher for Machine Learning at
the conversational framing of coding 11. Kulal, S. et al. Spoc: Search-based pseudocode to code. GitHub, San Francisco, CA, USA.
In Proceedings of Advances in Neural Information
completions is not a radical proposal. Processing Systems 32, H.M. Wallach et al (eds), Dec. Andrew Rice is a principal researcher at GitHub, Inc., San
But neither is it one we have seen fol- 2019, 11883–11894; https://bit.ly/3H7YLtF Francisco, CA, USA.
12. Meyer, A.N., Barr, E.T., Bird, C., and Zimmermann,
lowed yet. T. Today was a good day: The daily life of software
Devon Rifkin is a principal research engineer at GitHub,
Inc., San Francisco, CA, USA.
developers. IEEE Transactions on Software
References Engineering 47, 5 (2019), 863–880. Shawn Simister is a staff software engineer at GitHub,
1. Amann, S., Proksch, S., Nadi, S., and Mezini, M. A 13. Meyer, A.N. et al. The work life of developers: Activities, Inc., San Francisco, CA, USA.
study of visual studio usage in practice. In IEEE 23rd switches and perceived productivity. IEEE Transactions Ganesh Sittampalam is a principal software engineer at
Intern. Conf. on Software Analysis, Evolution, and on Software Engineering 43, 12 (2017), 1178–1193. GitHub, Inc., San Francisco, CA, USA.
Reengineering 1. IEEE Computer Society, (March 14. Meyer, A.N., Fritz, T., Murphy, G.C., and Zimmermann,
2016), 124–134; 10.1109/SANER.2016.39 T. Software developers’ perceptions of productivity. In Edward Aftandilian is a principal researcher at GitHub,
2. Austin, J. et al. Program synthesis with large language Proceedings of the 22nd ACM SIGSOFT Intern. Symp. Inc., San Francisco, CA, USA.
models. CoRR abs/2108.07732 (2021); https://arxiv. on Foundations of Software Engineering (2014), 19–29.
org/abs/2108.07732 15. Murphy-Hill, E. et al. What predicts software This work is licensed under a
3. Ari Aye, G., Kim, S., and Li, H. Learning autocompletion developers’ productivity? IEEE Transactions on http://creativecommons.org/licenses/by/4.0/
from real-world datasets. In Proceedings of the 43rd Software Engineering 47, 3 (2019), 582–594.
IEEE/ACM Intern. Conf. on Software Engineering: 16. Peng, S., Kalliamvakou, E., Cihon, P., and Demirer, M.
Software Engineering in Practice, (May 2021), The impact of AI on developer productivity: Evidence
131–139; 10.1109/ICSE-SEIP52600.2021.00022 from GitHub Copilot. arXiv:2302.06590 [cs.SE] (2014)
4. Beller, M., Orgovan, V., Buja, S., and Zimmermann, 17. Ramírez, Y.W. and Nembhard, D.A. Measuring
T. Mind the gap: On the relationship between knowledge worker productivity: A taxonomy. J. of
automatically measured and self-reported Intellectual Capital 5, 4 (2004), 602–628.
productivity. IEEE Software 38, 5 (2020), 24–31. 18. See, A., Roller, S., Kiela, D., and Weston, J. What makes
5. Chen, M. et al. Evaluating large language models a good conversation? How controllable attributes
trained on code. CoRR abs/2107.03374 (2021); affect human judgments. In Proceedings of the 2019
https://arxiv.org/abs/2107.03374 Conf. of the North American Chapter of the Assoc.
6. Forsgren, N. et al. The SPACE of developer for Computational Linguistics: Human Language
productivity: There’s more to it than you think. Queue Technologies 1, J. Burstein, C. Doran, and T. Solorio
19, 1 (2021), 20–48. (eds). Assoc. for Computational Linguistics, (June
7. Hellendoorn, V.J., Proksch, S., Gall, H.C., and Bacchelli, 2019), 1702–1723; 10.18653/v1/n19-1170 Watch the authors discuss
A. When code completion fails: A case study on 19. Storey, M. et al. Towards a theory of software this work in the exclusive
real-world completions. In Proceedings of the 41st developer job satisfaction and perceived productivity. Communications video.
Intern. Conf. on Software Engineering, J.M. Atlee, T. In Proceedings of the IEEE Trans. on Software https://cacm.acm.org/videos/
Bultan, and J. Whittle (eds). IEEE/ACM, (May 2019), Engineering 47, 10 (2019), 2125–2142. measuring-github-copilot
MA R C H 2 0 2 4 | VO L. 6 7 | N O. 3 | C OM M U N IC AT ION S OF T HE ACM 63