0% found this document useful (0 votes)
4 views32 pages

A Fine-Grained Taxonomy of Code Review Feedback in Typescript Projects

This study presents a fine-grained taxonomy of code review feedback in TypeScript projects, analyzing 569 review threads from four open-source GitHub projects. The research identifies key aspects of code-level issues and fixes during the Modern Code Review (MCR) process, categorizing them into topics, issues, and code fixes. The findings highlight a focus on Evolvability issues over Functional Coding issues, providing insights for future research and tool support in code review practices.

Uploaded by

siraj.934124
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views32 pages

A Fine-Grained Taxonomy of Code Review Feedback in Typescript Projects

This study presents a fine-grained taxonomy of code review feedback in TypeScript projects, analyzing 569 review threads from four open-source GitHub projects. The research identifies key aspects of code-level issues and fixes during the Modern Code Review (MCR) process, categorizing them into topics, issues, and code fixes. The findings highlight a focus on Evolvability issues over Functional Coding issues, providing insights for future research and tool support in code review practices.

Uploaded by

siraj.934124
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Empirical Software Engineering (2025) 30:53

https://doi.org/10.1007/s10664-024-10604-y

A fine-grained taxonomy of code review feedback


in TypeScript projects

Nicole Davila1 · Ingrid Nunes1 · Igor Wiese2

Accepted: 9 December 2024


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025

Abstract
Modern Code Review (MCR) is a practice software engineers adopt to improve code quality.
Despite its well-known benefits, it involves non-negligible effort, and this has led to various
studies that provide insightful information from MCR data or support the review activity. Even
though some studies proposed taxonomies of MCR feedback, they either are course-grained
or focus on particular technologies, typically Java. Besides, existing studies have focused on
classifying the concerns raised by reviewers or the review changes triggered during MCR
separately. In turn, we present a jointly in-depth qualitative study of code-level issues found
and fixed during the code review process in TypeScript projects, a popular language among
practitioners in recent years. We extracted and manually classified 569 review threads of
four open-source projects on GitHub: Angular, Kibana, React Native, and VS Code. Our key
contribution is a comprehensive and fine-grained classification of aspects discussed during
MCR, categorized into four main groups: topic, review target, issue, and code fix. We also
present an analysis of the actual outcomes of MCR in the projects and discuss the potential
research directions.

Keywords Modern code review · Software quality · Code review feedback ·


Review changes

1 Introduction

Modern Code Review (MCR) is a collaborative practice supported by tools in which develop-
ers (the reviewers) asynchronously check the quality of code changes proposed by a teammate

Communicated by: Jeffrey C. Carver

B Nicole Davila
[email protected]
Ingrid Nunes
[email protected]
Igor Wiese
[email protected]
1 Instituto de Informática, Universidade Federal do Rio Grande do Sul (UFRGS), Av. Bento
Gonçalves 9500, Bloco IV, 91501-970 Porto Alegre, RS, Brazil
2 UTFPR, Campo Mourão, PR, Brazil

0123456789().: V,-vol 123


53 Page 2 of 32 Empirical Software Engineering (2025) 30:53

and decide whether to accept them as part of the codebase (Davila and Nunes 2021). Several
empirical studies provide evidence of the benefits of MCR, e.g., early defect detection, dis-
cussions on alternative solutions, and knowledge transfer (Bacchelli and Bird 2013; Baum
et al. 2016; Bosu et al. 2017; MacLeod et al. 2018b; Sadowski et al. 2018). However, a
non-negligible human effort is required by MCR since many hours of software development
can be spent per week reviewing code changes (Thongtanunam et al. 2015; Yang et al. 2017).
Given the well-known benefits and the human effort associated with MCR, various studies
have explored and proposed approaches to support the practice and improve its cost-benefit.
For instance, there are reviewer recommenders and visualization techniques (Davila and
Nunes 2021) and, more recently, automation approaches to recommend code changes (Tufano
et al. 2021, 2022). Nevertheless, a way to propose new solutions is by understanding the
actual outcomes of MCR. Previous work has explored what reviewers find (Bacchelli and
Bird 2013; Spadini et al. 2018; Li et al. 2017; Zanaty et al. 2018; Pascarella et al. 2018;
Gunawardena et al. 2023) and what code review fix (Beller et al. 2014; Jiang et al. 2021).
Overall, the former perspective focuses on potential issues and reviewers’ efforts (including
false positives), and the latter is based on the possibility of a mismatch between review
feedback and code changes (or review comments not triggering changes). Although some
studies took these two perspectives (Panichella and Zaugg 2020; Zanaty et al. 2018) and
considered both review feedback and review changes, their findings mainly focused on only
one of them.
A better understanding of MCR outcomes can help researchers to identify opportunities of
novel support (Fregnan et al. 2022). Moreover, investigating these outcomes at a fine-grained
level can provide more actionable insights, as observed in recent research (Gunawardena et al.
2023; Tufano et al. 2021). For instance, given an analysis of fine-grained issues identified
in code reviews from Java projects, Gunawardena et al. (2023) found that around 22% of
them could be automatically detected by existing tools. Given the existing findings from the
literature, our idea is that a jointly fine-grained analysis of issues found and how they are
fixed can provide insights into tailoring novel approaches to support code review.
This study aims to understand code-level issues found and fixed during the code review
process. To achieve our goal, we qualitatively investigate code review threads and review
changes. A review thread consists of a series of comments linked to a code snippet in which
the reviewers and the author discuss an issue raised by the reviewer. A review change is a
set of changes made during the code review process, i.e., commits. Our analysis focuses on
understanding the issues reported by reviewers (by analyzing the review threads) and the
fixes triggered (by also analyzing the review changes).
To achieve our goal, we extracted and manually classified 569 review threads of four open-
source projects obtained from GitHub. We focused on TypeScript projects, a programming
language with increasing popularity among practitioners (GitHub 2023; Stack Overflow
2023) that lacks code review literature studies. Although our study did not hypothesize that
code-level issues found and fixed in code reviews are language-dependent, we understand that
review practices may vary across different settings. While most code review studies focused
on languages like Java, our choice of TypeScript aims to provide an additional perspective
and complement code review literature. As a result of our qualitative analysis, we identified
issues found, their code fixes, the general topic of a review thread, and characteristics of the
code under review, which we refer to as review target. We provide a fine-grained taxonomy
from these observed aspects. While we found some TypeScript-specific issues, most of our
findings concern language-independent aspects. Using our classification, we analyze the
relationship among issues, code fixes, topics, and review targets.

123
Empirical Software Engineering (2025) 30:53 Page 3 of 32 53

We found that reviewers in TypeScript projects frequently focus on reporting Evolvability


rather than Functional Coding issues. Both types refer to code-level problems, but the former
affects future development efforts (but not the runtime behavior), and the latter impacts the
runtime behavior. Most Evolvability Coding issues in our analysis are related to documenta-
tion and sub-optimal decisions associated with naming and overcoding. This type of review
feedback often triggers a Generalizable Fix, representing a code fix using a reusable strat-
egy. Functional Coding issues often trigger Application-specific Fixes, which are code fixes
that require more domain expertise and represent non-reusable fix strategies. Based on our
findings, we summarize potential research directions related to tool support for the frequent
issues observed.
The main contributions of this paper can be summarized as follows: (i) a qualitative
investigation of fine-grained aspects of issues reported and fixed in code review in TypeScript
projects; (ii) a comprehensive taxonomy of topics, issues, and code fixes in MCR grouped in
multi-level categories and complemented with the review code target information; (iii) the
working data sets, from TypeScript projects, for replication purposes.

2 Related Work

The increasing popularity of MCR resulted in a large and growing body of knowledge in the
literature, which is reviewed in recent secondary studies (Badampudi et al. 2023; Davila and
Nunes 2021). In the taxonomy of MCR research introduced by Davila and Nunes (2021),
our work can be classified as a foundational study, i.e., work that investigates aspects of
MCR. More specifically, our study falls into the subcategory analysis of internal outcomes,
i.e., those associated with the MCR process characteristics, focusing on the internal outcome
review feedback and review intensity.
Among studies on the analysis of internal outcomes of MCR, we have two types of analysis
strategies: what code reviews find and what code reviews fix. In the former case, studies
mainly consider the comments written by the reviewers and the author during the lifetime
of a review request, analyzing potential problems identified during the process. In the latter,
researchers also consider the code churn (the delta between the submitted and accepted code)
of the review iterations to identify what has actually changed. Table 1 summarizes the studies
proposing or extending a taxonomy related to code review outcomes.

Table 1 Studies providing classifications of code review outcomes


Study Context Data Source Code Classification

What code reviews find (main analyzed artifact: comments written by reviewers)
Bacchelli and Bird (2013) Microsoft CodeFlow – Common themes
Li et al. (2017) OSS GitHub Ruby, Java, JavaScript Code review topics
Zanaty et al. (2018) OSS Gerrit – Design issues
Pascarella et al. (2018) OSS Gerrit Multiple Information needs
Gunawardena et al. (2023) OSS CROP Java Concern types
What code reviews fix (main analyzed artifact: code changes after review request)
Beller et al. (2014) OSS Eclipse/VCS & Gerrit Java, C Review changes
Panichella and Zaugg (2020) OSS Gerrit Java Review changes
Jiang et al. (2021) OSS GitHub Multiple PR modification types
The bold was used to highlight main categories when there is a hierarchy

123
53 Page 4 of 32 Empirical Software Engineering (2025) 30:53

What code reviews find has been a concern since software inspection (Mäntylä and Lasse-
nius 2008; Siy and Votta 2001) and largely explored in MCR. A study at Microsoft (Bacchelli
and Bird 2013) and later replication in OSS projects (Spadini et al. 2018) classify each review
comment based on its theme, obtaining categories such as code improvement, understanding,
and social interaction. Considering code inline and general pull-request (PR) comments of
OSS projects hosted on GitHub, another study (Li et al. 2017) classified review topics and
presented a taxonomy with four level-1 categories, namely code correctness, PR decision-
making, project management, and social interaction. Recently, a study (Gunawardena et al.
2023) analyzed code review feedback from Gerrit of seven OSS projects focused on a more
fine-grained classification compared to previous ones. For instance, their proposed taxonomy
presents concerns raised by reviewers like better library use exists and non-self-explanatory
identifier name instead of code improvement, as previous classifications. In addition, some
studies have been exploring specific aspects, such as discussion threads that started from
a reviewer’s question and their information needs (Pascarella et al. 2018), design-related
review comments (Zanaty et al. 2018), confusion (Ebert et al. 2021), and code smells (Han
et al. 2022).
What code reviews fix has been the focus of another group of studies, which analyze
the code changes related to a review request after its submission. Beller et al. (2014) is a
well-known study empirically exploring the problems fixed through MCR in OSS systems,
proposing a taxonomy derived from prior work (Mäntylä and Lassenius 2008). Besides, they
classified a change as triggered by a review comment or undocumented. For them, there may
be a mismatch between review feedback and code changes, or review comments may not
trigger changes. Later, another study (Panichella and Zaugg 2020) explored review changes in
OSS projects and merged their classification with the existing one, promoting some additions.
In another study (Jiang et al. 2021), the review changes were analyzed from the perspective
of how developers modify a PR, which includes review changes.
The present study complements existing findings on analyzing MCR outcomes using
data from TypeScript projects. We differ from previous studies by providing a fine-grained
classification that differentiates code-level issues found and fixed during code reviews. Thus,
we aim to identify potential problems found by reviewers and better understand how they
trigger code changes. We argue that an investigation considering both perspectives can further
explore their differences and similarities, providing valuable insights for researchers and
practitioners.

3 Study Settings

This section describes our research methodology, including our data extraction and analysis
procedure, as summarized in Fig. 1. Our findings are discussed in the following sections.

3.1 Research Questions

Our study aims to investigate recurrent code issues that appear in code reviews and their
corresponding fixes in the context of TypeScript projects. This goal leads to the following
research questions.

RQ1 What are the coding issues reported and discussed during code reviews in TypeScript
projects?

123
Empirical Software Engineering (2025) 30:53 Page 5 of 32 53

Fig. 1 Overview of the study methodology, including qualitative data analysis using an iterative process

RQ2 How do developers address the coding issues reported by reviewers in TypeScript
projects?

With RQ1, we aim to understand the problems reported by reviewers when checking a
specific code snippet. More specifically, we focus on investigating fine-grained issues raised
in MCR that are more likely to occur. In RQ2, we analyze the fixes triggered by the coding
issues reported by reviewers. We aim to study recurrent code fixes that can be generalized and
reused in similar contexts. The answer to these research questions can potentially provide
insights into solutions to support frequent situations observed in MCR, reducing the human
effort of the practice.

123
53 Page 6 of 32 Empirical Software Engineering (2025) 30:53

Table 2 Summary of our target projects, data extraction, and selection results from GitHub projects
Data Extraction Thread Filtering Selected for Analysis
Project #PR #Thread #PR #Thread #PR #Thread

Angular 217 1,197 69 593 41 132


Kibana 168 1,172 78 544 42 197
React Native 33 171 15 104 7 15
VS Code 265 627 142 498 63 225
Total 683 3,167 304 1,739 153 569
The bold was used to highlight the total count when quantities were presented

3.2 Target Projects

Our study focuses on TypeScript projects, a programming language released by Microsoft


in 2012 that was motivated by the shortcomings of JavaScript in large-scale applications.
The rationale for this decision was the following. While most code review research on
review feedback has focused on Java projects (Beller et al. 2014; Panichella and Zaugg
2020; Gunawardena et al. 2023), we observed an increasing adoption of TypeScript among
practitioners. In a GitHub report of 2023 on the state of the practice in open-source (GitHub
2023), TypeScript overtook Java as the third most popular language. In the Stack Overflow
Developer Survey of 2023 (Stack Overflow 2023), TypeScript is also among the top five
programming languages most adopted by respondents. Although JavaScript is currently the
most popular in these surveys, TypeScript builds up on it, and both languages could co-exist
in the same project. Therefore, we decided to focus on the TypeScript projects.
Our candidate projects are hosted on GitHub to have a more cohesive scope. We adopted
two criteria to guide the selection of the projects in the GitHub advanced search. First, the
repository must have at least 10k GitHub stars to exclude unpopular projects that are unlikely
to provide valuable data for our analysis. Second, we only considered repositories with code
written in TypeScript, which is the scope of our work. The advanced search returned more
than 200 repositories using these filters.
Given the large number of repositories obtained in the first search, we applied extra
filtering. We follow a convenience sampling approach (Baltes and Ralph 2022), focusing
on well-established and popular open-source projects that attract regular contributions and
adopt code review. Thus, we analyzed the search result list and obtained four TypeScript
projects that matched these aspects: Angular1 , Kibana2 , React Native3 , and Visual Studio
(VS) Code4 . Our search process resulted in selecting the projects summarized in Table 2.

3.3 Data Extraction and Filtering

To build our research dataset, we extracted Pull Requests (PRs) and their review data for each
target project using the GitHub API in a Python script. We extracted PRs filtering those with
the general status closed and review status as changes_requested (IC1). With these filters,

1 https://angular.io
2 https://www.elastic.co/kibana
3 https://reactnative.dev
4 https://code.visualstudio.com

123
Empirical Software Engineering (2025) 30:53 Page 7 of 32 53

we eliminate PRs with ongoing discussions and select reviews that triggered code changes.
We started this study in 2020 and collected data in 2021. To restrict the scope to a specific
time interval and get recent data, we filtered PRs created between 2019 and 2020 (IC2). As
a result of the extraction process, we collected 683 PRs and their review data from the target
projects.
As our work focuses on coding issues reported by reviewers, our analysis targets code
review threads. Figure 2 depicts an example of a review thread in VS Code. Our study does
not include the analysis of discussions in the main body of a PR. Given our set of collected
PRs, we extracted 3,167 review threads (see Data Extraction columns in Table 2).
To build our research dataset with review threads that are more likely to be related to
code issues that trigger a review change, we defined additional inclusion criteria to filter
discussions that match the scope of our study. To be selected in our analysis, a review thread
must satisfy the following inclusion criteria (IC):

IC3 A reviewer (not the author) initiates the thread;


IC4 The thread is in a TypeScript or JavaScript file; and
IC5 There is at least one commit after the thread creation.

We consider both TypeScript and JavaScript files because TypeScript is built on JavaScript,
and both file types coexist and work together in the same project. We do not include, for
instance, review threads linked to code of other programming languages (e.g., Java), config-

Fig. 2 An example of a code review thread on GitHub from the project VS Code and how the main groups of
our taxonomy were analyzed based on the available information. Example extracted from GitHub at https://
github.com/microsoft/vscode/pull/98269

123
53 Page 8 of 32 Empirical Software Engineering (2025) 30:53

uration files (e.g., JSON, YAML), web component files (e.g., HTML, CSS), markdown, or
text.
We analyzed whether each review thread satisfies these criteria using the data retrieved
by GitHub API and the aforementioned Python script during the data extraction. After this
filtering, we collected 1,739 review threads from the target projects that matched our inclusion
criteria, corresponding to 304 PRs (see Thread Filtering columns in Table 2).

3.4 Data Analysis and Refinement

Given the research dataset built from our target projects, we proceed with the qualitative
data analysis using a coding approach. Coding is a strategy used to derive a framework of
thematic ideas of qualitative data by indexing data (Gibbs 2007). While existing literature
provides code review taxonomies, none consider issues reported and code fixes separately.
Hence, we decided to ground on the qualitative analysis of the data itself, openly identifying
themes without an initial taxonomy. Later, we compare and discuss our findings with those
of previous studies.
As an initial guide for our coding process, we defined four aspects to be qualitatively
analyzed: topic, which indicates the theme of the review discussion; (ii) review target,
which identifies the syntax element of the code under review; (iii) issue, which indicates
the problem raised by the reviewer; and (iv) code fix, that is the code change triggered by
the review. With issue and code fix, we aim to identify what code review found (RQ1)
and fixed (RQ2), respectively. To complement the analysis, review target aims to identify
what code element attracts code review, and topic complements what reviewers found with
a high-level classification of what is being discussed.
Next, we conducted a pilot with 29 review threads randomly selected from VS Code and
their follow-up commits. From this initial experience, we defined and followed an iterative
data analysis procedure. In each iteration, we randomly picked 90 review threads from our
research dataset and then proceeded with the data analysis and refinement. The rationale for
the number of review threads in each iteration was the following. We wanted to label very few
instances in the pilot round to understand our data, so we targeted a short dry run to clarify the
process and our classification granularity. We selected VS Code because of the large number
of review threads available. The number of threads in each iteration was defined based on
the pilot experience and the manual effort estimation to be performed, given the fine-grained
focus of our analysis. Based on this initial experience, we also decided on an iterative process
to include inspection meetings between rounds to mitigate bias and ensure coding consistency
since an independent researcher performed such analysis. We performed six iterations after
the pilot. Considering the whole analysis process, we selected and manually analyzed 569
review threads in our study (see Selected for Analysis columns in Table 2).
For each review thread, we proceed with coding as depicted in Fig. 2. We first identified and
read the associated code and its comments, classifying topic, review target and issue.
If there was an issue reported, we consider the commits after the review thread, reviewers’
comments, and the follow-up replies of the author indicating how the review feedback was
or will be solved to identify the code fix triggered. We accessed the information, i.e., code,
comments, and commits, directly on GitHub using the link provided by the GitHub API
during the data extraction. Figure 3 shows an example of the resulting coding of a review
thread and the information considered to classify it.
The authors shared a spreadsheet with the coding. The first author coded the data, and all
authors carried out the remaining activities. As previously mentioned, we follow an iterative

123
Empirical Software Engineering (2025) 30:53 Page 9 of 32 53

Fig. 3 An example of the resulting coding of a code review thread on GitHub from the project VS Code given
the main groups of our taxonomy. Example extracted from GitHub at https://github.com/microsoft/vscode/
pull/96284

analysis process. Thus, to avoid bias by conducting the classification independently, each
round was followed by an inspection meeting with all authors. This inspection focused
on revising codes created during the very last round, whether the proposed descriptions
clearly describe them, and the consistency of the resulting classification (including codes

Fig. 4 The final number of codes by each main group at the end of each analysis iteration

123
53 Page 10 of 32 Empirical Software Engineering (2025) 30:53

from previous rounds and the new codes). Figure 4 summarizes the number of codes at the
end of each iteration by group, and more detailed information on codes and groupings per
round is available in our supplementary material. In case of uncertainty in a given review
thread, the first author classified it as “unclear”. All authors jointly discussed and coded
82 review threads considered unclear during the inspection meetings. Since we focused on
openly identifying themes rather than labeling the data using predefined codes, we did not
calculate the Inter-Rater Reliability (IRR) agreement. According to McDonald et al. (2019),
this approach of open coding does not need the calculation of IRR.
After the iterative data analysis, the authors met to refine the codes until we reached the
final taxonomy presented in this paper. For instance, a code X created in the more recent round
could be more suitable for threads analyzed at the beginning of the manual process (when X
had not come up as a code yet). Thus, we revised groupings and the consistency among them.
In this step, we also classified some additional aspects given our codebook (i.e., codes with
their descriptions), namely the impact of the code-level issues and code fixes. Each issue
was classified as Functional if it might affect the code runtime behavior, or Evolvability if it
might affect future development efforts but not the runtime behavior. For instance, syntax or
naming problems were classified as Evolvability, while an inadequate call or a memory leak
were classified as Functional issues. Similarly, each code fix was classified into Behavior
Changes if it refers to changes in the internal structure of the code and its behavior, or
Refactoring if the fix is associated with modifying the internal structure of the code while
preserving its behavior. For instance, throwing an error represents a Behavior Change, while
renaming an element is classified as a Refactoring. The taxonomy refinements were tracked
and are available in our supplementary material, as well as the additional classifications.

4 Taxonomy of Modern Code Review Feedback

This section describes the fine-grained taxonomy of the code review discussion in TypeScript
projects. Our classification is based on four main groups: Topic, Review Target, Issue,
and Code Fix. We use “code” to refer only to the final label of each branch of these
groups. We created categories and subcategories as intermediate groupings for some cases.
In Figs. 5 and 6, we detail these main and intermediate groupings. A complete overview of
our taxonomy and descriptions is available in our supplementary material.

4.1 Topic

Topic classifies what the review discussion is about based on all reviewer comments and
author replies. Most of the 569 analyzed threads focus on a single Topic, while 20 are
classified into two Topics, resulting in 589 occurrences in this group. We identified 28 codes
in this group and organized them into four categories, as summarized in Table 3: Requirement,
Design, Implementation, and Other.
Almost half of the review threads are associated with the Implementation category
(271/569, 47.63%), which gathers conversations on how things are or should be coded. Then,
we have the categories Requirement (143/569, 25.13%) and Design (138/569, 24.25%) with
a similar number of threads. The former gathers discussions on coding aspects to meet the
expected software behavior. The latter consists of conversations about the solution design to
match these expectations. The categories are organized into subcategories, each grouping a
similar number of codes, as indicated in Table 3.

123
Empirical Software Engineering (2025) 30:53 Page 11 of 32 53

Fig. 5 Summary of our fine-grained code review feedback taxonomy - Part I. A complete overview of our
taxonomy and descriptions is available in our supplementary material

In the Other category (37/569, 6.50%), we grouped three particular cases: Requirement
Change (14/37, 37.84%), Unexplained Suggestion (11/37, 29.73%), and No Topic (12/37,
32.43%). Requirement Change includes conversations about changes in the definitions of
the product that might impact the scope of the PR and that are not related to specific coding
aspects. Unexplained Suggestion includes cases in which the reviewer directly suggests how
to change the code, but further details are needed to understand why the change is suggested.
Finally, in No Topic, we have conversations unrelated to corrections or improvements in the
code, such as “The ‘Action’ looks good!”, “Good job with tests,” “yay! dead code elimina-
tion!”.
The most common code in Topic is Usefulness (category: Implementation, sub-category:
Code Evolution), with 12.13% of the threads (69/569). It includes threads about the necessity
of one or more elements in the code. For instance, we have in this theme conversations starting
with “Why is this not needed anymore?” or “Is this command being used?”
The second most common code in Topic is Legibility (category: Implementation, sub-
category: Code Evolution), with 11.42% of the threads (65/569). In this case, the discussions
focus on the degree to which it is possible to understand the ideas written in the code. Some
examples of threads include comments like “I would name this HeaderNav to make it
more descriptive” and “Why not return acc.concat(explicitTypes) ,for me it
reads better.”

123
53 Page 12 of 32 Empirical Software Engineering (2025) 30:53

Fig. 6 Summary of our fine-grained code review feedback taxonomy - Part II. A complete overview of our
taxonomy and descriptions is available in our supplementary material

4.2 Review Target

Review Target classifies aspects of the code linked to the review thread, which can provide
insights into the relationship with the other groups of our analysis. We consider the code lines
and the file selected by the reviewer to start the discussion categorizing the Syntax Element,
such as Call Expression, Class Declaration, or Import. We identified 24 Syntax Elements and
summarized their frequency in Fig. 7.
As the code linked to a review thread might contain multiple elements, we focus on the
one that starts the line to make the analysis feasible even in complex code. For instance,
when the target is const identifier = callee(arguments);, i.e., the outcome
of a call expression is assigned to a just declared constant, we indicate as Syntax Element the

123
Empirical Software Engineering (2025) 30:53 Page 13 of 32 53

Table 3 Organization of the categories and subcategories of Topic


Topic Categories and Subcategories Codes (Total 28) Valid Threads (Total 569)*
Count % Count %

Requirement 9 32.14 143 25.13


Functional Requirement 4 14.29 96 16.87
User Interface 3 10.71 33 5.97
Nonfunctional Requirement 2 7.14 14 2.46
Design 6 21.43 138 24.25
Code Reuse 3 10.71 100 17.57
Modularization 3 10.71 38 6.68
Implementation 10 35.71 271 47.63
Code Evolution 4 14.29 108 18.98
Coding Style 2 7.14 88 15.47
Cross-cutting Concerns 4 14.29 75 13.18
Other** 3 10.71 37 6.50
*From 569 review threads, 20 are about two Topics, totaling 589 occurrences on the Topic group. **Other
includes particular cases of Requirement Change, Unexplained Suggestion, and No Topic
The bold was used to highlight main categories when there is a hierarchy

Variable Declaration with Initialization option. We also identified in 30 out of 569 threads
(5.27%) multiple lines being commented by the reviewer in a single review, and, in these
cases, we classify the Syntax Element when it contains a block statement, e.g., an If or
Function Declaration (25 out of 30 cases match this criterion).

Fig. 7 Distribution of review threads in the top-10 Syntax Elements

123
53 Page 14 of 32 Empirical Software Engineering (2025) 30:53

The most frequent Syntax Element is a Call Expression (101/569, 17.75%), which might
be a method or function call, followed by a Variable Declaration with Initialization (96/569,
16.87%), in which an assignment of hard-coded value, call, or logical/arithmetic expression
follows the declaration. Then, in the third position, we identified Object Property Declara-
tion (56/569, 9.84%), which in TypeScript and JavaScript projects is part of an object literal,
i.e., list of pairs of property names and associated values of an object. As the least frequent
Syntax Elements, we identified only one Arrow Function and Export Default declaration
occurrence.
As supplementary information, we identified a Call Expression after the initial Syntax
Element in 104 out of 569 review threads (18.28%), resulting in 205 threads with some
invocation (205/569, 36.03%). From these 104 cases, 46 occur in a Variable Declaration with
an Initialization statement (44.23%) and 20 in an Assignment Expression (19.23%).

4.3 Issue

Issue categorizes problems and improvement opportunities reported by reviewers, focusing


on what reviewers point out as inadequate, incomplete, or sub-optimal in the Review Tar-
get. From 569 review threads, most are associated with one code and 17 with two, resulting
in 586 occurrences in this group.
In 71 out of 569 threads (12.48%), there is no evidence of a problem or improvement being
reported, which we classified as No Issue. This code includes threads already classified as
No Topic and those where the reviewer asked for clarifications, such as “Just curious: why
backslash is not needed? :)” or “what is this for?”. The remaining 111 issue codes are split
into two sets based on their nature, as presented below.

4.3.1 Conceptual Design Issue

Eight Issue codes (8/111, 7.21%) are related to high-level discussions on technical deci-
sions, being categorized as Conceptual Design Issues. This set refers to issues based on the
requirements, including definitions of the solution design, the displaying or input on the user
interface, and testings. Table 4 details our Conceptual Design Issues and how the 57 threads
in this set are distributed.
More than half of the review threads with a Conceptual Design Issue are related to the
user interface (UI) (29/57, 50.88%), including cases of UI Text Language, UI Layout Style,
and Inadequate Shortcut Configuration. The UI Text Language is the most common (21/57,
36.84%) and refers to a wrong label or description that can be improved, e.g., “I would prefer
the following text: ‘Hold alt key to switch to editor language hover’.”
Following the UI problems, we have Missing Test Case as the second most com-
mon (15/57, 26.31%). This code contains review feedback of one or more test cases not
created when needed. These are cases similar to “nit: Could also add a few tests for
‘.to.kibana_datatable()’,” for instance.

4.3.2 Coding Issue

The second set of Issues consists of problems and improvements related to code-level discus-
sions on technical decisions (103/111, 92.79%). They refer to more specific coding aspects,
such as code invocations, assignments, or typing. We grouped the 103 coding issues and orga-

123
Empirical Software Engineering (2025) 30:53 Page 15 of 32 53

Table 4 Summary of Conceptual Design Issues


Conceptual Issue Short Description #*

UI Text Language Wrong user interface text or that requires improvement 21


Missing Test Case Lack of test case 15
Inadequate Design Solution Bad design strategy, e.g., non-recursive 10
UI Layout/Style Positioning or style of interface element unsuitable or to 6
be improved
Inadequate Shortcut Configuration Default shortcut unsuitable or wrong 2
Large Test Case Input Data File Large, not compressed data file with unnecessary infor- 1
mation
Required Child Implementation Parent class requiring changes in each child class 1
Uncovered Edge Case Design not covering one or more edge cases 1
Total 57
*Number of threads
The bold was used to highlight the total count when quantities were presented

nized them into 14 categories, as summarized in Table 5. From 442 review threads associated
with this set (442/569, 77.68%), 17 have two Coding Issues, resulting in 459 occurrences.
Given the total review threads in Coding Issue, the most common categories are Mistaken
Implementation (82/459, 17.86%), Overcoding (72/459, 15.69%), and Poor Documenta-
tion (41/459, 8.93%). These categories include a range of issues, representing almost half of
Coding Issue occurrences.

Table 5 Organization of the categories of Coding Issues


Codes Threads
Coding Issue Category Count % Count %

Mistaken Implementation 12 11.65 82 17.86


Overcoding 9 8.74 72 15.69
Poor Documentation 4 3.88 41 8.93
Bad Naming 5 4.85 35 7.63
Inappropriate Code Organization 11 10.68 34 7.41
Bad Conditional Logic 5 4.85 31 6.75
Bad Syntax or Formatting 9 8.74 28 6.10
Fragile Code 11 10.68 24 5.23
Inappropriate Element Relationship 6 5.83 22 4.79
Excessive Complexity 5 4.85 21 4.58
Misuse of Language Features 10 9.71 21 4.58
Hard-Coded Information 3 2.91 20 4.36
Mismatching Implementation 6 5.83 16 3.49
Poor Resource Management 7 6.80 12 2.61
Total 103 100.00 459* 100.00
*From 442 review threads of Coding Issue, 17 have two Coding Issues, leading to 459 occurrences
The bold was used to highlight the total count when quantities were presented

123
53 Page 16 of 32 Empirical Software Engineering (2025) 30:53

Mistaken Implementation is related to bad decisions that lead to incorrect running code
and is associated with 82 review threads. Most of the Issues reported in this category are
about an Inadequate Value Assignment (26/82, 31.71%), meaning the value used for a
property or variable is unsuitable or wrong for the target situation. For instance, “I think this
enum descriptions and the default are wrong. The default should be true.”. The second most
frequent is Choice of Method Call (21/82, 25.61%), in which a certain called method or
function is not an appropriate choice for the target situation, including cases like “You should
be using this ‘StableEditorScrollState’.”
Overcoding is second in frequency among categories and refers to dispensable code
elements. Unnecessary Code Change is the most common issue in this category (25/72,
34.72%), and it considers cases where part of the code under review is not necessary for the
target situation. Review threads associated with this code look like “Is this helpful? Should
it not use the log service instead?” or “Is this used?”
The third category of Coding Issue with a higher number of review threads is Poor Doc-
umentation. Three-quarters of cases in this category are about Lack of Documentation
(32/41, 78.05%), indicating a code with no or insufficient documentation. Some examples
are “add docs”, “Would you please add doc comments to these (or the composed version)?”,
and “Could you please provide comment here about logic?”
Each Coding Issue in these 14 categories has one of four possible types: Inadequate,
Incomplete, Sub-optimal, and Unnecessary Coding. Inadequate Coding occurs when a
specific code element is included, but it is unsuitable or wrong for the target situation,
representing an implementation mistake of the requirements. Some code elements are missing
or unclear in Incomplete Coding. Sub-optimal Coding refers to a particular code element
that is sub-optimal or inefficient and can be improved. Finally, Unnecessary Coding indicates
that one or more code elements are unnecessary or excessive for the target situation. Table 6
summarizes the distribution of the Coding Issues categories according to their type.

Table 6 Distribution of codes (C) and review threads (T) into categories and types of Coding Issues
Types of Coding Issues (Count Only)
Inadequate Incomplete Sub-optimal Unnecessary
Coding Issue Category #C #T #C #T #C #T #C #T

Bad Conditional Logic 3 21 2 10 0 0 0 0


Bad Naming 2 7 0 0 3 28 0 0
Bad Syntax or Formatting 1 1 2 3 4 17 2 7
Excessive Complexity 0 0 0 0 3 9 2 12
Fragile Code 3 5 2 2 6 17 0 0
Hard-Coded Information 0 0 0 0 3 20 0 0
Inap. Code Organization 1 3 1 1 9 30 0 0
Inap. Element Relationship 3 18 2 2 0 0 1 2
Mismatching Implement. 4 5 2 11 0 0 0 0
Mistaken Implementation 10 78 2 4 0 0 0 0
Misuse of Language Features 4 9 1 6 3 4 2 2
Overcoding 2 9 2 7 1 2 4 54
Poor Documentation 0 0 1 32 3 9 0 0
Poor Resource Management 1 3 0 0 4 6 2 3
Total 34 159 17 78 39 142 13 80
The bold was used to highlight the total count when quantities were presented

123
Empirical Software Engineering (2025) 30:53 Page 17 of 32 53

Most occurrences of Coding Issues are related to Inadequate Coding (159/459, 34.64%).
Almost half of these threads refer to a Mistaken Implementation (78/159, 49.06%), including
problems like Inadequate Value Assignment and Choice of Method Call. Considering the type
of Coding Issues, it is possible to observe that most refer to Sub-optimal Coding (39/103,
37.86%). In this case, we have a higher diversity of sub-optimal problems in the Inappropriate
Code Organization category. Not all coding issues are mapped to all possible types. The three
codes grouped under Hard-coded Information only have issues with the type sub-optimal.
Three categories, in turn, have codes of all four possible types: Bad Syntax or Formatting,
Misuse of Language Features, and Overcoding.
We also classified Coding Issues based on their impact. An issue affecting future devel-
opment efforts (but not the runtime behavior) is classified as Evolvability Coding Issue.
From Coding Issues, 66 out of 103 codes (64.08%) can be classified as Evolvability, result-
ing in 286 out of 459 thread occurrences (62.31%). The remaining cases are then related
to a Functional Coding Issue, which includes problems that might affect the code runtime
behavior. In 5 out of 14 categories of Coding Issues, we have only Evolvability Issues (i.e.,
Overcoding, Poor Documentation, Bad Syntax or Formatting, Excessive Complexity, and
Hard-Coded Information). We only have functional coding issues in one category, i.e., bad
conditional logic. Figure 8 shows the distribution of Coding Issue categories according to
the impact.

Mistaken Implementation 1 81 82
Overcoding 72 72

Poor Documentation 41 41

Bad Naming 34 1 35

Inappropriate Code Organization 31 3 34

Bad Conditional Logic 31 31

Bad Syntax or Formatting 28 28

Fragile Code 17 7 24

Excessive Complexity 22 22
Inappropriate Element Relationship 3 19 22

Misuse of Language Features 6 15 21

Hard-Coded Information 20 20

Mismatching Implementation 2 14 16
Poor Resource Management 9 3 12

20 40 60 80
Number of Review Threads
Evolvability Coding Issue Functional Coding Issue

Fig. 8 Distribution of Coding Issue categories according to the impact

123
53 Page 18 of 32 Empirical Software Engineering (2025) 30:53

4.4 Code Fix

Code Fix classifies the code change triggered to address the reported Issue of the review
thread. To find evidence of a fix, we consider the commits after the review thread, reviewers’
comments, and the follow-up replies of the author indicating how the review feedback was or
will be solved. More specifically, given the review thread and the reported issue, we checked
the next commits until we found one changing the same location in the code (i.e., the review
target) or the location mentioned in the thread discussion. When we found this commit, we
took it to classify the code fix. When there is no or insufficient evidence of a fix triggered
by the particular discussion, we classified it as a No Code Fix (153/569, 26.89%). However,
when we identified a fix, we also considered its chance of being reused in another similar
context, classifying it as an Application-specific Fix (198/569, 34,80%) or a Generalizable
Fix (218/569, 38.31%).

4.4.1 Application-specific Fix

In this set, we have 198 out of 569 review threads. It involves domain-specific knowledge
to fix the code, modifying an algorithm that directly impacts or contributes to the specific
application behavior and its intended use. For instance, the reviewer starts the discussion by
questioning “Does this logic need to be active everywhere in the app?”, which triggered a
fix involving multiple changes and requiring knowledge of the codebase. In another case,
the reviewer asks, “Would it be possible to add a style override for this button’s icon to
make the icon 12x12 pixel to match the designs (rather than the current 16x16px)?”, leading
user experience improvements. We consider these types of fix a non-generalizable strategy,
given the particularities of the domain required to implement it and classify it as Application-
specific. Due to the scope of this study and the nature of this type of fix, this set is not further
detailed.

4.4.2 Generalizable Fix

This set gathers 218 out of 569 review threads, including fixes that can be adopted as reusable
strategies for reported Issues. For instance, when a reviewer reports “I think we prefer being
explicit: borderless: borderless === true” and it triggers a change to use strict
equals, we consider it a Generalizable Fix. Therefore, this set contains code fixes involving
the use of design patterns, code conventions, or project guidelines. Given the nature of
Generalizable Fixes as potentially reusable strategies, we further detailed their Semantics to
answer our second research question.
We classified the Semantics into 88 codes organized into 17 categories, as shown in Table 7.
Of the 218 review threads associated with a Generalizable Fix, six have two Semantics,
resulting in 224 occurrences. Make Code Expressive is the most common category (41/224,
18.30%) and refers to fixes about changing elements to make the code intent more explicit
and easy to understand. More than half of the threads in this category are related to the
code Rename Code Element (25/41, 60.98%), which usually involves a syntactic change
of renaming an element of the Review Target. For instance, it includes fixes triggered by
comments such as “Can you please give these variables readable names. Doing so will help
others to understand their purpose.” and “I think having ‘endpoint’ and ‘alert’ in the name
of this is confusing.”
Each code from the 17 categories of Semantic of Generalizable Fix is also classified based
on its Impact: Behavior Changes, which refers to changes in the internal structure of the

123
Empirical Software Engineering (2025) 30:53 Page 19 of 32 53

code and its behavior; and Refactorings, that include code fixes associated with modifying
the internal structure of the code while preserving its behavior. Table 7 summarizes the
distribution of the Semantic categories according to their Impact. Three-quarters of the threads
in Generalizable Fix are related to Refactorings (168/224, 75%), mostly due to Make Code
Expressive and Increase Documentation categories. Behavior Change has fewer occurrences
and includes cases of Increase Type Safety, such as Replace Any Type (7/28, 25%).

5 Results

5.1 RQ1: what are the coding issues reported and discussed during code reviews
in TypeScript projects?

Our results show that code review discussions often report some problem in the code change,
commonly a Coding Issue. In our research dataset, it occurs in 442 out of 569 threads
(77.68%). Usually, the code-level problem found by reviewers concerns future development
efforts and is related to a sub-optimal or unnecessary decision made by the author. In turn,
code-level problems that affect the runtime behavior are less frequent and related to an
inadequate decision during the coding process. Table 8 shows the distribution of review
threads classified as Coding Issues into the groups of our taxonomy.

Table 7 Distribution of the number of codes (#C), number of review threads (#T), and percentage of review
threads (%T) of the semantic categories of generalizable code fixes
Semantic Categories Impact
Categories Behavior Refactoring
#C #T %T #C #T #C #T

Make Code Expressive 7 41 18.30 1 1 6 40


Increase Documentation 3 28 12.50 0 0 3 28
Adopt Consistent Syntax 10 22 9.82 0 0 10 22
Increase Type Safety 9 19 8.48 5 14 4 5
Simplify Code 10 19 8.48 0 0 10 19
Make Code Defensive 7 13 5.80 4 8 3 5
Extract Reusable Element 4 11 4.91 0 0 4 11
Change Data Model 4 10 4.46 2 6 2 4
Adjust Asynchrony 4 9 4.02 4 9 0 0
Code Cleanup 5 8 3.57 1 1 4 7
Reassign Responsibility 3 8 3.57 0 0 3 8
Adjust Conditional Flow 4 7 3.13 3 3 1 4
Fix Recurrent Mistakes 4 7 3.13 4 7 0 0
Optimize Resources Usage 4 7 3.13 4 7 0 0
Make Code Flexible 4 6 2.68 0 0 4 6
Reorganize Code 4 6 2.68 0 0 4 6
Decrease Coupling 2 3 1.34 0 0 2 3
Total 88 224 100 28 56 60 168
The bold was used to highlight the total count when quantities were presented

123
53 Page 20 of 32 Empirical Software Engineering (2025) 30:53

Table 8 Distribution of review threads on categories of Topic, Review Target, categories of Coding Issue,
and Type of Coding Issue

We highlighted the highest value per grouping in each row


The bold was used to highlight main categories when there is a hierarchy

Evolvability Coding Issues affecting future development efforts include overcoding, doc-
umentation, and naming issues, often related to code quality attribute discussions, such as
Coding Style and Code Reuse. They frequently refer to sub-optimal coding that can be
improved for better maintainability. Bad Naming problems (n = 34) commonly are linked to
the code statements in which the element is declared, such as a variable, function, or method
signature (11/34, 32.35%). Poor Documentation (n = 32) is often linked to code comments
(10/32), but also assignment expression (7/32), function (5/32), or variable declaration (5/32).
Interestingly, Bad Syntax or Formatting issues (n = 28) are often reported in code comments
(12/28) due to a lack of adherence to standard definitions for displaying information.
Considering the Functional Coding Issues, i.e., those that affect the runtime behavior, they
are more likely to be from Mistaken Implementation and Bad Conditional Logic categories.
Commonly, these problems refer to using a different code element in the Review Target,
such as testing an unsuitable condition, invoking the inadequate call expression, or assigning
the wrong value. We can observe that this kind of feedback requires more specific domain

123
Empirical Software Engineering (2025) 30:53 Page 21 of 32 53

expertise to identify the proper code elements for the target situation. Overall, Functional
Coding Issues are related to Requirements Satisfaction and Code Invocation discussions,
indicating concern about the correctness of code execution. Still, we also found recurrent
discussions on Typing and Testing due to Inadequate Typing and Wrong Configuration,
respectively.
The least common categories in our analysis are Hard-Coded Information, Mismatching
Implementation, and Poor Resource Management. Poor Resource Management, the least
frequent, refers to poor administration of resources such as memory or runtime. Although
these issues are mainly related to Sub-optimal Coding, we also have Inadequate Coding issues,
such as Memory Leaks. We also identified a few Coding Issues related to the application’s
cross-cutting concerns. For instance, we found no security-related Coding Issues in our
analyzed review threads and only a few error-handling cases.
Among the review threads reporting a problem, Conceptual Design Issues occur in 57
out of 569 threads (10.08%). As mentioned, these issues are related to definitions of the
application’s design, interface, testing, or requirements. UI Text Language issues, the most
frequent in this subgroup (21/57), refer to definitions of textual elements of the application
and are often reported in syntax elements in which a string value can be assigned, such as
object property declaration (7/21). In our analyzed projects, these object property declarations
are object literals, a name-value structure used to facilitate the manipulation of application
properties, such as messages or labels to be shown to the user.
As observed in previous studies, some review discussions do not point out a code problem.
By analyzing the topic of these conversations, we found they focus on social interactions,
requirements satisfaction, and the usefulness of the target code and often involve a variable
declaration with initialization or a call expression. Two studies (Bacchelli and Bird 2013;
Spadini et al. 2018) found social communication a common topic in code reviews, and one
study (Pascarella et al. 2018) specifically investigated information needs by reviewers, which
include asking for correct understanding and necessity, for instance.

Answer to RQ1: In TypeScript projects, code reviewers are more likely to find
Coding Issues. Specifically, they report Coding Issues that may impact future devel-
opment efforts rather than functional defects. Internal code quality aspects related to
style, reuse, and evolution drive the Topic of most review discussions. Although less
frequent, Coding Issues that affect functionality still occur more often than Concep-
tual Design Issues and commonly refer to mistaken implementation due to inadequate
coding.

5.2 RQ2: How do Developers Address the Coding Issues Reported by Reviewers in
TypeScript Projects?

Given the analyzed review threads in which an issue was found (498/569, 87.52%), our
results show that most of them trigger an actual review change (414/498, 83.13%). For both
Conceptual Design Issues and Coding Issues found, 82.46% (47/57) and 83.03% (367/442)
have a Code Fix, respectively. Conceptual Design Issues found by reviewers are more likely
to trigger an Application-specific Fix (41/57, 71.93%), which aligns with the definitions of
these groupings since both refer to domain-specific aspects. Coding Issues, in turn, are more

123
53 Page 22 of 32 Empirical Software Engineering (2025) 30:53

often related to Generalizable Fixes (212/442, 47.96%), especially triggering refactoring


changes (161/212, 75.94%). Table 9 shows the distribution of review threads on Issue and
Code Fix.
Some categories of Coding Issues are more likely to trigger an Application-specific Fix.
In 4 out of 14 categories, i.e., Mistaken Implementation, Bad Syntax or Formatting, Bad
Conditional Logic, and Mismatching Implementation, more than half of the review threads
triggered a non-generalizable fix strategy that requires domain-specific knowledge. These
categories contain problems related to inadequate or incomplete coding, considering the
requirements or other parts of the code base, which aligns with the definition of Application-
specific fixes.
Generalizable Fixes are equally or more likely to be triggered to address the review
feedback in the remaining Coding Issues categories. Refactorings often address Coding
Issues that trigger a Generalizable Fix. Some associations are expected and more likely to
occur, such as Poor Documentation issues addressed by Increase Documentation and Bad
Naming triggering fixes to Make Code Expressive. We also identified fixes to Increase Type
Safety often related to Fragile Code issues, i.e., code elements that are likely to break in
future implementations or that do not fail gracefully. It includes, for instance, cases of Weak
Typing and, as its code fix, the definition of a primitive or specific type in the code element
instead of none, any, or the falsy-value defined.

Table 9 Distribution of review threads on Issue and Code Fix

We highlighted the highest value per grouping in each row


The bold was used to highlight main categories when there is a hierarchy

123
Empirical Software Engineering (2025) 30:53 Page 23 of 32 53

Answer to RQ2: Most issues found by reviewers in TypeScript projects are addressed
by a review change. Feedback related to application design or domain-specific imple-
mentations often triggers code fixes that also require domain-specific knowledge to
be performed. Coding Issues are more likely to be addressed by generalizable strate-
gies, commonly involving refactoring changes.

6 Discussion

6.1 Comparison with Existing Classifications

Our multi-level taxonomy relates to existing work in several ways, as summarized in Table 10.
Given the group topic, we have three related studies (Li et al. 2017; Spadini et al. 2018;
Bacchelli and Bird 2013) which analyzed common themes in code review feedback. However,
due to their coarse-grained classifications compared to ours, most of the existing categories
are covered by our taxonomy. For instance, regarding conversations on the correct code
execution given the expected behavior, these studies found Code Correctness as a common
topic, while we found Behavior Preservation and Requirements Satisfaction. Similarly, these
studies identified Code Improvement, which could fit codes from our Code Evolution and
Coding Style subcategories, referring to the code’s inner quality. Therefore, our taxonomy
complements the existing classifications of MCR feedback topics by including fine-grained
codes.
More specific problems and improvements found during MCR have also been analyzed,
as previously discussed in this paper. The study of Gunawardena et al. (2023) is the most
recent at the time of this analysis, and researchers also compare their findings with prior
classification (Panichella and Zaugg 2020) of review changes. Hence, we compare our codes
from the Issue group with the concern types of Gunawardena et al. (2023) classification (the
details are available in our supplementary material). From our 111 issue codes (except No
Issue), we could not fit 20 (18.02%) into their concern types, which refers to 46 out of 569
threads analyzed in this study (8.08%). Conversely, from their taxonomy, we could not match
40 out of 116 concern types (34.48%).
A type of issue observed in Gunawardena et al. (2023)’s work that our study did not
identify is related to Header Comments, more specifically, licensing header comments. These
comments are placed at the beginning of a file to indicate the licensing terms, conditions, and
legal aspects of code usage and distribution. Prior work found cases of incomplete license
header descriptions, missing license headers, and unconventional license header patterns.

Table 10 Comparison with existing studies providing classifications of code review outcomes
Taxonomy Related Work Comparison
Group Study Classification

Topic Bacchelli and Bird (2013) Common themes Our taxonomy covers existing classifications.
Li et al. (2017) Code review topics
Spadini et al. (2018) Common themes
Issue Gunawardena et al. (2023) Concern types We could fit around 82% of our codes.
Further discussion in Section 6.1

123
53 Page 24 of 32 Empirical Software Engineering (2025) 30:53

Although our selected projects work with these header comments, our selected threads have
no issues related to licensing aspects, which might be due to our random selection process
of threads. Future work can further explore such aspects and whether these differences are
related to the selected data or characteristics of the target projects.
Overall, the classifications are very alike, but there are granularity differences in some
cases. For example, Gunawardena et al. (2023) have Better design exists, while our taxonomy
details design cases in Conceptual Design Issues and certain Coding Issues. In turn, we have
the Lack of Documentation and Poor Comment, which is an aspect better detailed in their
classification. Furthermore, our taxonomy differentiates the issues found and fixed by code
review, which might be useful to explore alternatives for early identification of defects and
support for code fix execution.

6.2 Language Specific Code Review Feedback

Despite the popularity of JavaScript and TypeScript among practitioners, most studies of
code review feedback focused on Java projects (Beller et al. 2014; Gunawardena et al. 2023;
Panichella and Zaugg 2020). By defining Typescript as the scope of this study, we expected
to complement the existing body of knowledge of MCR with a less explored program-
ming language but also be able to compare our findings to works on Java. Overall, most
recurrent issues found and fixed in our study are language-independent (e.g., documentation
and naming aspects), which aligns with previous research. However, we also noticed some
language-specific differences in comparison to previous studies.
Compared to Gunawardena et al. (2023), our study does not find problems related to
unnecessary or missing annotations (including meta-annotations used in programs). While
TypeScript and Java provide annotation mechanisms, they differ in how they define and
process them. Using the reflection API, Java has a lot of popular frameworks that rely on
annotations, i.e., Spring and Hibernate, that can be processed at compile-time and runtime.
TypeScript provides decorators with similar purposes, which are mainly processed at runtime.
However, few TypeScript frameworks have explored this mechanism, which may explain the
lack of issues related to the usage of annotations in our study.
During our analysis, we also observed a slight difference related to Coding Issues in using
the programming language capabilities of an asynchronous implementations. While previous
studies have identified concurrency (Panichella and Zaugg 2020) and threads (Gunawardena
et al. 2023) as concerns during code review in Java projects, they have not found them
as frequent issues (only one thread concern was identified in Gunawardena et al. (2023)’s
study). In contrast, we found slightly more cases of such issues despite only a few instances
(eight review threads). Java and TypeScript differ in their approach to managing concurrency,
with distinct and non-comparable mechanisms to allow asynchronous procedures. Despite
the technical differences, we can focus on the asynchronous capability these languages pro-
vide and the review feedback related to them. In our study, all cases refer to incomplete or
inadequate implementations of Promises and async/await, JavaScript’s asynchronous
capabilities. For instance, “No need to return anything, this line is not needed. async
magic will do all for you” or “not missing an await here?”. Future research can further
explore the extent of these asynchronous implementation issues in TypeScript/JavaScript
projects and whether there is enough support to prevent them compared to Java.

123
Empirical Software Engineering (2025) 30:53 Page 25 of 32 53

6.3 Code Review as Evolvability-driven Practice

Prior studies compared the distribution of functional (that affect runtime behavior) and evolv-
ability defects (that affect future development efforts but not the runtime behavior). For both
issues found and fixed in code reviews, these studies obtained a 74:26 (Gunawardena et al.
2023) and 75:25 (Beller et al. 2014) ratio, respectively, with more evolvability defects. Given
the Coding Issues found by reviewers, our study obtained a ratio of 62:38, with slightly more
functional defects than previous studies. Considering the Coding Issues found with a Code
Fix, the distribution is similar, with a 60:40 ratio. By differentiating issues found and fixed,
our results support existing evidence that code reviews are more evolvability-driven in both
reviewers’ efforts and actual review changes. It also supports prior questioning on the value of
the partial comparison between code review and testing (Beller et al. 2014) since the former
works with the early identification of a significant amount of evolvability defects.

6.4 Early Identification of Issues

From practitioners’ perspective, the key motivations driving code reviews are often related
to early defect detection and code improvement (Bacchelli and Bird 2013; MacLeod et al.
2018a). Given an analysis of fine-grained issues found in code reviews from Java projects,
Gunawardena et al. (2023) found that around 22% of them could be automatically detected
by existing tools. Additionally, many studies have explored ways to free reviewers to look
for more subtle defects and improvements not detectable by existing tools (Tufano et al.
2022; Guo et al. 2019; Wang et al. 2019; Hong et al. 2022). Next, we explore directions to
support developers in the early identification of issues before submitting a code change for
code review.

6.4.1 Automatically Detected Coding Issues

Recent research has shown that automated tools or individuals with no programming expertise
could identify a significant number of issues reported by reviewers early on Gunawardena
et al. (2023). Meanwhile, another study proposed an approach to automatically configure
static code analysis tools (SCATs) using data from code reviews, initially exploring its appli-
cation with CheckStyle, a popular SCAT (Zampetti et al. 2022). Both studies highlight the
potential for early defect detection through automation to reduce the cost of code review.
However, their analysis was limited to Java projects and tools designed to check Java code.
Inspired by these works, we conducted a follow-up analysis exploring to what extent exist-
ing tools for TypeScript code could help in the early identification of code-level problems
found by reviewers. We focused on Coding instead of Conceptual Design Issues because the
latter refers to solution definitions commonly not covered by code analysis tools. Given our
taxonomy and classification, we selected cases with a Coding Issue that triggered a Code Fix
(Application-specific or Generalizable fix). This resulted in a subset of 366 review threads
with both reviewing and fixing costs. Each review thread was analyzed considering five popu-
lar static tools for TypeScript and JavaScript code: SonarQube, ESLint, Prettier, ClangFormat,
and PMD. SonarQube and PMD were among the selected tools of prior study (Gunawardena
et al. 2023) that work in TypeScript code. ESLint, Prettier, and ClangFormat are listed in
the documentation of our target projects as suggested tools for those who plan to make code
contributions. Besides, we considered Grammarly to check spell/grammar-related aspects.
Details of our analysis are available in our supplementary material.

123
53 Page 26 of 32 Empirical Software Engineering (2025) 30:53

For each review thread in our subset, we manually verified whether the reported coding
issue could be detected by some rule of the selected tools. For instance, the situation reported
by the review feedback “Can you remove this empty newline please?” (Unnecessary Line
Breaks of Bad Syntax and Formatting) could be detected early by Prettier’s rules of empty
lines. Similarly, the rule “The any type should not be used” of SonarQube could avoid
review comments like “please give styles a type other than any ” (Weak Typing of
Fragile Code). The mapping detailed mapping is available in our supplementary material.
From the 366 review threads, the early detection of around 18% Coding Issues could be
supported by existing tools. Such a finding is similar to those reported by Gunawardena et al.
(2023), who obtained 22% of automatically detected concerns. Table 11 summarizes the
distribution of the analyzed subset of review threads in categories of Coding Issue and how
many cases existing tools automatically detect. Around 63% of Bad Syntax and Formatting
and 47% of Fragile Code problems could be automatically detected using existing tools.
Although our data does not allow generalizations, our findings suggest the potential to
further explore approaches to support TypeScript and JavaScript coding and code review. For
instance, 27 out of 68 problems identified in our analysis could be automatically detected
by ESLint, and the tool even allows the creation of custom rules, which could increase the
number of problems detected. As explored by Mehrpour and LaToza (2023), an alternative
to increasing the early detection issues might be exploring the full potential of SCATs. In
their study, the researchers found that it is possible to increase the detection of defects by
better supporting the creation of project-specific rules. Future work using our taxonomy can
explore novel solutions or the potential of custom rules available in existing tools to improve
the early identification of coding issues in TypeScript code.

Table 11 Distribution of review threads with a Coding Issue that triggered a Code Fix, the number of cases
automatically detected, and the percentage of issues that can be identified early by existing tools
Coding Issue & Fix Tool Detected
Coding Issue Category # Threads # Threads %

Bad Syntax or Formatting 27 17 62.96


Fragile Code 21 10 47.62
Misuse of Language Features 17 7 41.18
Excessive Complexity 20 6 30.00
Hard-Coded Information 15 4 26.67
Inappropriate Code Organization 31 5 16.13
Overcoding 47 6 12.77
Poor Documentation 36 4 11.11
Bad Conditional Logic 29 3 10.34
Bad Naming 27 2 7.41
Mismatching Implementation 14 1 7.14
Inappropriate Element Relationship 17 1 5.88
Mistaken Implementation 72 2 2.78
Poor Resource Management 9 0 0.00
Total 382* 68*
*From 366 review threads of Coding Issues, 16 have two Coding Issues, leading to more occurrences than
threads

123
Empirical Software Engineering (2025) 30:53 Page 27 of 32 53

6.4.2 Other Alternatives to Early Detection of Coding Issues

Our study found Poor Documentation, Bad Naming, and Overcoding are recurrent Evolvabil-
ity Coding Issues. Based on the aforementioned analysis, only a few cases of these categories
could be identified early by existing tools, possibly due to the subjectivity of defining rules
in these situations. For instance, some situations of insufficient documentation or meaning-
less identifier naming may require more programming expertise or project familiarity to be
identified. Besides, the problem or improvement might depend on the context of the specific
code change, its complexity, type, or impact. Therefore, other alternatives to prevent these
Coding Issues before the code review can be further explored.
Panichella and Zaugg (2020) investigated the perception of practitioners on automa-
tion needs in MCR. Most envisioned approaches could address fine-grained coding issues
identified in our study and not automatically identified by existing tools. Considering code
comments (documentation), Hu et al. (2022) further explored developers’ perception of gen-
eration tools and compared it to the current state-of-the-art research. They identified 25
papers proposing techniques, but most lack in generating comments with how to use and
why an element exists information, often the focus of reviewers’ feedback. For naming, Li
et al. (2020) presented a survey on the renaming of software entities, summarizing existing
approaches and identifying that renaming opportunities are often integrated with a recom-
mendation of new names. Considering duplicate code/clones, Ain et al. (2019) systematically
review detection approaches and highlight the need to develop solutions to collectively detect
all four types (exact, rename, near miss, and semantic) of clones, besides emphasizing the
need to take in programming languages other than Java and C/C++. Although these studies
do not specifically focus on TypeScript code, they provide valuable insights into developing
new approaches. Future work can further explore the use of the existing solutions for the
early detection of Coding Issues and also compare the proposed techniques with extensions
for integrated development environments, such as Readable 5 and jscpd 6 .
Another alternative for early detection of issues, including Functional Coding Issues, can
be defining a self-code review approach. Self-reviewing a code is mentioned as a best practice
in the traditional and gray literature (Dong et al. 2021). Its goal is to support the developer
in creating and preparing a code change before the code review takes place. However, a
limitation we observed is the lack of definition of what and how to review your own code
in addition to general guidelines related to best programming practices and the adoption of
SCATs. As demonstrated by Zampetti et al. (2022), code review history can be explored to
build a knowledge base for the early identification of issues. While their study combines the
review data with SCAT rules, other combinations can be further explored. For instance, based
on our taxonomy, software architecture rules or project conventions can be defined to prevent
Functional Coding Issues related to Mistaken Implementation and Bad Conditional Logic
that are not automatically detected by existing tools. Since the problem or improvement in
these cases might depend on the context and subjective checking, the knowledge base from
this combination can provide insights for a developer to perform a self-code review, including
how and what to verify before submitting a change to code review.

5 https://readable.so/
6 https://www.npmjs.com/package/jscpd

123
53 Page 28 of 32 Empirical Software Engineering (2025) 30:53

6.5 AI-based Programming Assistants as Support for Code Fix

The recent advances in artificial intelligence (AI) powered generative AI-based tools, which
rapidly gained widespread adoption (Ebert and Louridas 2023). In the “Developer Survey
2023” by Stack Overflow, 44% from around 90K participants are using AI-based tools in
software development (Stack Overflow 2023). According to a survey conducted by GitHub
(Inbal Shani 2023), 92% of 500 developers in the United States are currently using this
type of tool. Besides, the research community has been investigating how developers are
interacting with these tools (Bird et al. 2022; Liang et al. 2023) and their capabilities (Imai
2022; Moradi Dakhel et al. 2023; Nguyen and Nadi 2022; Yetiştiren et al. 2023). Overall, the
findings suggest the widespread adoption of AI-based tools as programming assistants, with
good results in terms of code generation support. Especially among code review practitioners,
a recent study (Davila et al. 2024) has observed a high interest in proposing new approaches
that explore generative AI to support the practice and suggesting adopting existing tools,
such as ChatGPT. Therefore, these tools can be further explored as a solution to prevent and
fix coding issues in the code review process.
In this direction, a recent study (Guo et al. 2023) explores the potential of ChatGPT in
code reviews. Researchers conduct an empirical study to understand the capabilities of this
tool on automated code refinement using an existing benchmark and its updated version (to
avoid data that may have been used in training ChatGPT models). The researchers explored
different configurations and provided the tool with the original code and review comments,
asking for a revised code based on the review. A possible implication of our study is to
provide structured guidance to studies on AI-based programming assistants in code review
tasks.
As discussed by Guo et al. (2023), selecting appropriate prompts for AI-based program-
ming assistants involves considering context, instruction, and output indicators (Akın 2023).
Our taxonomy can further facilitate the investigation of prompt strategies by providing a clas-
sification system. Given the analysis of Guo et al. (2023), prompt variations might include the
Review Target (context) and Issue (instruction) while asking for a Code Fix suggestion
(output indicator). Our Topic classification could be used to detail context description. In
future research on AI-based programming assistants for code review tasks, our taxonomy
can be used to explore recommendations of best practices for using these tools to handle
recurrent situations.

7 Threats to Validity

Construct validity While building our research dataset, we specified multiple criteria for
target projects and review threads. A different approach to filtering the projects could lead to
different results, e.g., changing the order to filter candidate projects hosted on GitHub (see
Section 3.2). Moreover, we focused on PRs with the general status closed and review status
as changes_requested (see Section 3.3). However, some PRs in our target projects may not be
merged or have review threads even without a change request status, a potential limitation of
our dataset. Using a Python script to obtain data from GitHub API, we objectively evaluated
and focused on review threads that matched our inclusion criteria. However, we still noticed
a small percentage of “noisy” review threads, i.e., threads without a discussion of coding
aspects or unlikely to trigger code changes.

123
Empirical Software Engineering (2025) 30:53 Page 29 of 32 53

Sampling bias is another possible threat to our work, and to minimize it, we randomly
sampled review threads in our iterative analysis process. The number of threads in each
iteration was defined based on a pilot experience and the manual effort estimation to be
performed (see Section 3.4). We also tracked the evolution of the number of codes at the end of
each iteration (see Fig. 4). The number of selected review threads (i.e., 569) might not mitigate
this threat. Besides, due to the manual effort required in the analysis process, we stopped
our iterative process after the pilot and six iterations. Therefore, including further iterations
and analyzing more review threads could lead to different results, potentially threatening
our findings. Hence, we share our research data and provide supplementary materials to
encourage further studies to validate our results. Nevertheless, the similarities found when
comparing our findings with existing classifications (see Sections 6.1 and 6.3) suggest that
our data is reliable.

Internal validity We manually classified the review threads without the developers or review-
ers, which might introduce researcher bias to our results. Moreover, this classification was
mainly performed by the first author, which ensures consistency but no comparison among
researchers. A single researcher conducting the classification is a practice observed in prior
studies, e.g., Gunawardena et al. (2023). To mitigate this internal threat in our study, we fol-
lowed a well-defined analysis procedure in an iterative process with periodic group meetings.
More specifically, each coding round was followed by a further inspection meeting with all
authors to check the classification. For instance, a code X created in the more recent round
could be more suitable for threads analyzed at the beginning of the manual process (when X
had not come up as a code yet). Each code and category was defined in the codebook with
one or more examples. The authors discussed and refined the definitions at the end of each
iteration. Also, once all categories and labels in the taxonomies had been defined, the authors
rechecked to ensure consistency, i.e., that instances were classified to the most proper code
and category.
Another possible threat to the internal validity of our study is the TypeScript version used
by the target projects. We started our research in 2019 and performed data collection in 2020.
At that time, we used the most recent data available. However, since then, new versions of
TypeScript have been released, and, as a language with a growing user base, new features
such as decorators and typing capabilities have been added. While most issues and fixes
identified in our analysis are language-independent (see Section 6.2), we understand that
recent versions of the language and new features may influence our results and should be
considered in future replication studies.

External validity Our study involves open-source projects based on a convenience sampling
of TypeScript repositories available on GitHub. While our goal is to investigate a phenomenon
in the particular context of TypeScript projects, our decisions may limit our findings’ gener-
alizability. To mitigate this threat, we focus on well-known projects that actively participate
in the open-source community and are sponsored by large companies. We also compared our
findings with existing classifications (see Sections 6.1 and 6.3) and found out they may apply
to many other scenarios. Further studies are needed to corroborate or revise our findings.

8 Conclusion

Our paper presents a qualitative study of code review discussions of TypeScript projects to
understand the issues reported by reviewers and the fixes triggered by them. More specifically,

123
53 Page 30 of 32 Empirical Software Engineering (2025) 30:53

we focus on review feedback with fine-grained discussions on technical decisions related


to coding aspects. We extracted and manually classified 569 review threads of four open-
source TypeScript projects obtained from GitHub. As a result, we provide a comprehensive
classification of code review discussions as a fine-grained taxonomy.
Our results highlight that reviewers in TypeScript projects frequently focus on reporting
Coding Issues. We found a ratio of 62:38 of Evolvability and Functional Coding Issues.
Evolvability Coding Issues, such as sub-optimal decisions on documentation, naming, and
overcoming, often trigger a Generalizable Fix (reusable fix strategies). Functional Coding
Issues, such as inadequate use of code elements, trigger Application-specific Fixes (require
more domain expertise and represent non-reusable strategies).
Based on our findings, we summarize potential research directions related to tool sup-
port for the frequent coding issues observed: (i) support for early detection of Evolvability
Coding Issues in TypeScript projects, exploring how existing or new approaches can help the
identification of recurrent problems; (ii) investigate the potential of approaches other than
static analysis tools to detect issues in TypeScript projects, such as self code review; and (iii)
investigate alternatives to support code fix implementation, such as exploring guidelines for
using AI-bases programming assistants, e.g., GitHub CoPilot or ChatGPT, using our taxon-
omy. Furthermore, future work can explore the taxonomy proposed in this study further. For
instance, studies can verify its applicability or approaches to partially (or entirely) automate
the classification of aspects observed in our study, reducing the manual effort and allowing
a large volume of data to be analyzed. Another opportunity to be investigated is how fine-
grained aspects discussed in code reviews relate to the merge status of the requests. Although
code review discussions of abandoned or accepted requests still represent a cost in software
development, a better understanding of the reasons for abandoning a PR after changes are
requested can contribute to the body of knowledge of code reviews.
Acknowledgements This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal
de Nível Superior - Brasil (CAPES) - Finance Code 001. Nicole Davila would like to thank CAPES for the
research grant ref. 88887.480572/2020-00. Igor Wiese, thanks CNPq/MCTI/FNDCT #408812/2021-4, and
MCTIC/CGI/FAPESP #2021/06662-1.

Data Availability The research dataset, codebook from qualitative analysis, additional analysis, and scripts
used in this paper are available at https://doi.org/10.5281/zenodo.11357931.

Declarations

Conflicts of Interest The authors declare that they have no conflict of interest.

References
Ain QU, Butt WH, Anwar MW, Azam F, Maqbool B (2019) A systematic review on code clone detection.
IEEE Access 7:86121–86144. https://doi.org/10.1109/ACCESS.2019.2918202
Akın FK (2023) Awesome chatgpt prompts. https://github.com/f/awesome-chatgpt-prompts
Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges of modern code review. In: 35th ICSE, pp
712–721. https://doi.org/10.1109/ICSE.2013.6606617. iSSN: 0270-5257
Badampudi D, Unterkalmsteiner M, Britto R (2023) Modern Code Reviews - A Survey of Literature and
Practice. ACM Trans Softw Eng Methodol. https://doi.org/10.1145/3585004
Baltes S, Ralph P (2022) Sampling in software engineering research: A critical review and guidelines. Empir
Softw Eng 27(4):1–31
Baum T, Liskin O, Niklas K, Schneider K (2016) Factors Influencing Code Review Processes in Industry. In:
24th International symposium on foundations of software engineering, ACM, New York, NY, USA, pp
85–96. https://doi.org/10.1145/2950290.2950323

123
Empirical Software Engineering (2025) 30:53 Page 31 of 32 53

Beller M, Bacchelli A, Zaidman A, Juergens E (2014) Modern code reviews in open-source projects: Which
problems do they fix? In: 11th MSR, ACM, New York, NY, USA, pp 202-211. https://doi.org/10.1145/
2597073.2597082
Bird C, Ford D, Zimmermann T, Forsgren N, Kalliamvakou E, Lowdermilk T, Gazit I (2022) Taking flight
with copilot: Early insights and opportunities of ai-powered pair-programming tools. Queue 20(6):35–57
Bosu A, Carver JC, Bird C, Orbeck J, Chockley C (2017) Process Aspects and Social Dynamics of Contempo-
rary Code Review: Insights from Open Source Development and Industrial Practice at Microsoft. IEEE
Trans Softw Eng 43(1):56–75. https://doi.org/10.1109/TSE.2016.2576451http://dx.doi.org/10.13039/
100000001-USNationalScienceFoundation
Davila N, Melegati J, Wiese I (2024) Tales from the trenches: Expectations and challenges from practice for
code review in the generative ai era. IEEE Softw 41(6):38–4. https://doi.org/10.1109/MS.2024.3428439
Davila N, Nunes I (2021) A systematic literature review and taxonomy of modern code review.
J Syst Softw 177:110951.https://doi.org/10.1016/j.jss.2021.110951https://www.sciencedirect.com/
science/article/pii/S0164121221000480
Davila N, Nunes I, Wiese I (2023) Supplemental materials to “a fine-grained taxonomy of code review feedback
in typescript projects”. https://doi.org/10.5281/zenodo.11357931
Dong L, Zhang H, Yang L, Weng Z, Yang X, Zhou X, Pan Z (2021) Survey on Pains and Best Practices of Code
Review. In: 2021 28th Asia-pacific software engineering conference (APSEC), pp 482–491. https://doi.
org/10.1109/APSEC53868.2021.00055. iSSN: 2640-0715
Ebert C, Louridas P (2023) Generative ai for software practitioners. IEEE Softw 40(4):30–3. https://doi.org/
10.1109/MS.2023.3265877
Ebert F, Castor F, Novielli N, Serebrenik A (2021) An exploratory study on confusion in code reviews. Empir
Softw Eng 26:1–48
Fregnan E, Petrulio F, Di Geronimo L, Bacchelli A (2022) What happens in my code reviews? an investigation
on automatically classifying review changes. Empir Softw Eng 27(4). https://doi.org/10.1007/s10664-
021-10075-5
Gibbs GR (2007) Thematic coding and categorizing. In: Analyzing qualitative data, SAGE Publications Ltd.,
London
GitHub (2023) Octoverse: the state of open source and rise of ai in 2023. https://github.blog/2023-11-08-the-
state-of-open-source-and-ai/
Gunawardena S, Tempero E, Blincoe K (2023) Concerns identified in code review: A fine-grained, faceted
classification. Information and Software Technology 153:107054. https://doi.org/10.1016/j.infsof.2022.
107054https://www.sciencedirect.com/science/article/pii/S0950584922001653
Guo B, Kwon YW, Song M (2019) Decomposing composite changes for code review and regression test
selection in evolving software. J Comput Sci Technol 34(2):416–436. https://doi.org/10.1007/s11390-
019-1917-9
Guo Q, Cao J, Xie X, Liu S, Li X, Chen B, Peng X (2023) Exploring the potential of chatgpt in automated
code refinement: an empirical study. arXiv:2309.08221
Han X, Tahir A, Liang P, Counsell S, Blincoe K, Li B, Luo Y (2022) Code smells detection via modern code
review: a study of the openstack and qt communities. Empir Softw Eng 27(6):127
Hong Y, Tantithamthavorn C, Thongtanunam P, Aleti A (2022) Commentfinder: A simpler, faster, more accurate
code review comments recommendation. In: 30th ESEC/FSE, ACM, New York, NY, USA, pp 507-519.
https://doi.org/10.1145/3540250.3549119
Hu X, Xia X, Lo D, Wan Z, Chen Q, Zimmermann T (2022) Practitioners’ expectations on automated code
comment generation. In: 44th ICSE, pp 1693–1705
Imai S (2022) Is github copilot a substitute for human pair-programming? an empirical study. In: 2022
IEEE/ACM 44th international conference on software engineering: companion proceedings (ICSE-
Companion), pp 319–321. https://doi.org/10.1145/3510454.3522684
Inbal Shani GS (2023) Survey reveals ai’s impact on the developer experience. https://github.blog/2023-06-
13-survey-reveals-ais-impact-on-the-developer-experience/
Jiang J, Lv J, Zheng J, Zhang L (2021) How developers modify pull requests in code review. IEEE Trans
Reliab 1–15. https://doi.org/10.1109/TR.2021.3093159
Li ZX, Yu Y, Yin G, Wang T, Wang HM (2017) What are they talking about? analyzing code reviews in pull-
based development model. J Comput Sci Technol 32(6):1060–1075. https://doi.org/10.1007/s11390-
017-1783-2
Liang JT, Yang C, Myers BA (2023) A large-scale survey on the usability of ai programming assistants:
successes and challenges. arXiv:2303.17125
Li G, Liu H, Nyamawe AS (2020) A survey on renamings of software entities. ACM Comput Surv 53(2).
https://doi.org/10.1145/3379443

123
53 Page 32 of 32 Empirical Software Engineering (2025) 30:53

MacLeod L, Greiler M, Storey M, Bird C, Czerwonka J (2018) Code reviewing in the trenches: challenges
and best practices. IEEE Softw 35(4):34–42. https://doi.org/10.1109/MS.2017.265100500
MacLeod L, Greiler M, Storey MA, Bird C, Czerwonka J (2018b) Code Reviewing in the Trenches: chal-
lenges and Best Practices. IEEE Softw 35(4):34–42. https://doi.org/10.1109/MS.2017.265100500https://
ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7950877
Mäntylä MV, Lassenius C (2008) What types of defects are really discovered in code reviews? IEEE Trans
Softw Eng 35(3):430–448
McDonald N, Schoenebeck S, Forte A (2019) Reliability and inter-rater reliability in qualitative research:
Norms and guidelines for cscw and hci practice. Proc ACM Hum-Comput Interact 3(CSCW). https://
doi.org/10.1145/3359174
Mehrpour S, LaToza TD (2023) Can static analysis tools find more defects? a qualitative study of design rule
violations found by code review. Empir Softw Eng 28(1):5
Moradi Dakhel A, Majdinasab V, Nikanjam A, Khomh F, Desmarais MC, Jiang ZM (2023) Github copilot ai pair
programmer: asset or liability?. J Syst Softw 111734. https://doi.org/10.1016/j.jss.2023.111734https://
www.sciencedirect.com/science/article/pii/S0164121223001292
Nguyen N, Nadi S (2022) An empirical evaluation of github copilot’s code suggestions. In: 19th MSR, pp 1–5.
https://doi.org/10.1145/3524842.3528470
Panichella S, Zaugg N (2020) An empirical investigation of relevant changes and automation needs in modern
code review. Empir Softw Eng 25(6):4833–4872. https://doi.org/10.1007/s10664-020-09870-3
Pascarella L, Spadini D, Palomba F, Bruntink M, Bacchelli A (2018) Information needs in contemporary code
review. Proc ACM Hum-Comput Interact 2(CSCW). https://doi.org/10.1145/3274404
Sadowski C, Söderberg E, Church L, Sipko M, Bacchelli A (2018) Modern code review: a case study at google.
In: 40th ICSE: software engineering in practice, ACM, New York, NY, USA, pp 181–190. https://doi.
org/10.1145/3183519.3183525
Siy H, Votta L (2001) Does the modern code inspection have value? In: Proceedings IEEE international
conference on software maintenance. ICSM 2001, IEEE, pp 281–289
Spadini D, Aniche M, Storey MA, Bruntink M, Bacchelli A (2018) When testing meets code review: why and
how developers review tests. In: 40th ICSE, ACM, New York, NY, USA, pp 677–687. https://doi.org/10.
1145/3180155.3180192
Stack Overflow (2023) Stack overflow developer survey 2023. https://survey.stackoverflow.co/2023
Thongtanunam P, McIntosh S, Hassan AE, Iida H (2015) Investigating code review practices in defective files:
an empirical study of the qt system. In: 12th MSR, IEEE Press, pp 168–179. https://doi.org/10.1109/
MSR.2015.23
Tufano R, Masiero S, Mastropaolo A, Pascarella L, Poshyvanyk D, Bavota G (2022) Using pre-trained models
to boost code review automation. In: 44th ICSE, ACM, New York, NY, USA, pp 2291-2302. https://doi.
org/10.1145/3510003.3510621
Tufano R, Pascarella L, Tufano M, Poshyvanyk D, Bavota G (2021) Towards automating code review activities.
In: 43rd ICSE, pp 163–174.https://doi.org/10.1109/ICSE43902.2021.00027https://ieeexplore.ieee.org/
document/9402025
Wang M, Lin Z, Zou Y, Xie B (2019) CoRA: decomposing and describing tangled code changes for reviewer.
In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE), pp 1050–
1061. https://doi.org/10.1109/ASE.2019.00101
Yang C, Zhang X, Zeng L, Fan Q, Yin G, Wang H (2017) An empirical study of reviewer recommendation in
pull-based development model. In: 9th Asia-pacific symposium on internetware, ACM, New York, NY,
USA, Internetware’17, pp 14:1–14:6. https://doi.org/10.1145/3131704.3131718
Yetiştiren B, Özsoy I, Ayerdem M, Tüzün E (2023) Evaluating the code quality of ai-assisted code generation
tools: an empirical study on github copilot, amazon codewhisperer, and chatgpt. arXiv:2304.10778
Zampetti F, Mudbhari S, Arnaoudova V, Di Penta M, Panichella S, Antoniol G (2022) Using code reviews to
automatically configure static analysis tools. Empir Softw Eng 27(1):28
Zanaty FE, Hirao T, McIntosh S, Ihara A, Matsumoto K (2018) An empirical study of design discussions in
code review. In: 12th ESEM, ACM, New York, NY, USA. https://doi.org/10.1145/3239235.3239525

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.

123

You might also like