Using Large-Scale Heterogeneous Graph Representation Learning For Code Review Recommendations at Microsoft
Using Large-Scale Heterogeneous Graph Representation Learning For Code Review Recommendations at Microsoft
2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) | 979-8-3503-0037-6/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICSE-SEIP58684.2023.00020
Abstract—Code review is an integral part of any mature industrial and open source development [1], [2], [3] and all
software development process, and identifying the best reviewer code hosting systems support it. Code reviews facilitate knowl-
for a code change is a well-accepted problem within the software edge transfer, help to identify potential issues in code, and
engineering community. Selecting a reviewer who lacks expertise
and understanding can slow development or result in more promote discussion of alternative solutions [4]. Modern code
defects. To date, most reviewer recommendation systems rely review is characterized by asynchronous review of changes
primarily on historical file change and review information; those to the software system, facilitated by automated tools and
who changed or reviewed a file in the past are the best positioned infrastructure [4] .
to review in the future. As code review inherently requires expertise and prior
We posit that while these approaches are able to identify
and suggest qualified reviewers, they may be blind to reviewers
knowledge, many studies have noted the importance of
who have the needed expertise and have simply never interacted identifying the “right” reviewers, which can lead to faster
with the changed files before. Fortunately, at Microsoft, we have turnaround, more useful feedback, and ultimately higher code
a wealth of work artifacts across many repositories that can quality [5], [6]. Selecting the wrong reviewer slows down
yield valuable information about our developers. To address the development at best and can lead to post-deployment issues.
aforementioned problem, we present C ORAL, a novel approach to
reviewer recommendation that leverages a socio-technical graph
In response to this finding, a vibrant line of code reviewer
built from the rich set of entities (developers, repositories, files, recommendation research has emerged, to great success [7],
pull requests (PRs), work items, etc.) and their relationships in [8], [9], [10], [11], [12], [13], [14]. Some of these have, in
modern source code management systems. We employ a graph fact, even been put into practice in industry [15].
convolutional neural network on this graph and train it on two All reviewer recommender approaches that we are aware
and a half years of history on 332 repositories within Microsoft.
of rely on historical information of changes and reviews. The
We show that C ORAL is able to model the manual history
of reviewer selection remarkably well. Further, based on an principle underlying these is that best reviewers of a change
extensive user study, we demonstrate that this approach identifies are those who have previously authored or reviewed the files
relevant and qualified reviewers who traditional reviewer recom- involved in the review. While recommenders that leverage this
menders miss, and that these developers desire to be included idea have proven to be valid and successful, we posit that
in the review process. Finally, we find that “classical” reviewer
they may be blind to qualified reviewers who may have never
recommendation systems perform better on smaller (in terms
of developers) software projects while C ORAL excels on larger interacted with these files in the past, especially as the number
projects, suggesting that there is “no one model to rule them all.” of developers in a project grows.
We note that there is a wealth of additional recorded
I. I NTRODUCTION information in software repositories that can be leveraged to
Code review (also known as pull request review) has be- improve reviewer recommendation and address this weakness.
come an integral process in software development, both in Specifically, we assert that incorporating information around
interactions between code contributors as well as the semantics
* Work performed while at Microsoft Research; Equal contribution of code changes and their descriptions can help identify
163
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on August 09,2023 at 16:47:32 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: C ORAL architecture
164
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on August 09,2023 at 16:47:32 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Distribution of node and edge types in the Socio- The Socio-technical graph contains 5,858,834 nodes and
technical graph 23,803,053 edges. A detailed statistics of node and edge types
Element type Label Count
can be found in Table I.
Node pull request 1,342,821
Node work item 542,866 IV. R EVIEWER R ECOMMENDATION VIA G RAPH N EURAL
Node file 2,809,805 N ETWORKS
Node author 18,001
Node reviewer 30,585 Reviewing a pull request is a collaborative effort. Good
Node text 1,104,427 reviewers are expected to write good code review comments
Total 5,858,834 that help improve the quality of the code and thus shape a
Edge creates 1,342,821 good product. In order to achieve this, a good reviewer needs
Edge reviews 7,066,703 to be 1) familiar with the feature that is implemented in the
Edge contains 1,342,821
Edge changes 12,595,859
pull request, 2) experienced in working with the source code
Edge parent of 148,422 and the files that are modified by the pull request, 3) a good
Edge linked to 1,252,901 collaborator with others in the team, and, 4) actively involved
Edge comments on 53,506
in creating and reviewing related pull requests in the repos-
Total 23,803,053 itory. Hence a machine learning algorithm that recommends
reviewers for a pull request needs to model these complex in-
teraction patterns to produce a good recommendation. Feature
• All the text nodes that appear in a pull request title or learning via embedding generation has shown good promise in
description and work item title or description are linked the literature in capturing complex patterns from the data [26],
to the respective pull requests. All the text nodes that [27], [28], [29]. Hence in this work we propose to pose the
appear in a file name are linked to the file nodes. reviewer recommendation problem as ranking reviewers using
• Text nodes are linked to each other based on their co- similarity scores between the users and the pull requests in the
occurrence in the pull request corpus. Pointwise Mutual embedding space. In the rest of this section we give details
Information (PMI) [25] is a common measure of the on learning embedding for pull requests and users along with
strength of association between two terms. other entities (such as files, word tokens, etc.), and scoring top
p(x, y) reviewers for a new pull request using the learned embeddings.
P M I(x, y) = log (1) The socio-technical graph shown in Figure 2 has essential
p(x)p(y)
ingredients to model the characteristics of a good reviewer: 1)
The formula is based on maximum likelihood estimates: the user - pull request - token path in the graph associates a
when we know the number of observations for token x, user to a set of words that characterize the user’s familiarity
o(x), the number of observations for token y, o(y) and the with one or more topics. 2) user - pull request - file path
size of the corpus N, the probabilities for the tokens x associates a user to a set of files that the user authors or
and y, and for the co-occurrence of x and y are calculated reviews. 3) user - pull request - user path characterizes the
by: collaboration between people in a project. 4) pull request - user
o(x) o(y) o(x, y) - pull request path characterizes users working on related pull
p(x) = p(y) = p(x, y) = (2)
N N N requests. Essentially, by envisioning software development
The term p(x,y) is the probability of observing x and y activity as an interaction graph of various entities, we are able
together. to capture interesting and complex relations and patterns in
the system. We aim to encode all these complex interactions
into entity embeddings using Graph Neural Network (GNN)
C. Scale [18]. These embeddings are then used as features to predict
The Socio-technical graph is built using the software de- most relevant reviewers to a pull request. In Figure 1 this is
velopment activity data from 332 repositories. We ingest data depicted as step 2 and 3.
starting from 1st January, 2019, or from when the first pull
A. Graph Neural Network Architecture
request is created in a repository (whichever is older). The
graph is refreshed three times a day. During the refresh we Graph Convolutional Network (GCN) [30] (which is a form
perform two operations: of GNN) has shown great success in the machine learning
Insert Ingest new pull requests, work items, and code review community for capturing complex relations and interaction
information, across all the 332 repositories, by creating corre- patterns in a graph through node embedding learning. In GCN,
sponding nodes, edges, properties. for each node, we aggregate the feature information from all
Update the word tokens connected to nodes, if there are its neighbors and of course, the feature itself. During this
changes. We also update the edges between nodes to reflect aggregation, neighbors are weighted as per the edge (relation)
the changes in the source data. weight. A common approach that has been used effectively
165
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on August 09,2023 at 16:47:32 UTC from IEEE Xplore. Restrictions apply.
in the literature is to weigh the edges using a symmetric- embeddings derived from the 2-layer GCN. In particular, we
normalization approach. Here we normalize the edge weight set the link probability as equal to σ zTu zv . Here, σ denotes
by the degrees of both the nodes connected by the edge. The the logistic function, and zu , zv denote the embeddings of
(2) (2)
aggregated feature values are then transformed and fed to the nodes u, v respectively (i.e., zu = hu , zv = hv from
next layer. This procedure is repeated for every node in the Equation 4). This probability is high when the nodes u and v
graph. are connected in the graph. And, it is low when the nodes u
Mathematically it can be represented as follows: and v are not connected in the graph. Accordingly, we prepare
⎛ ⎞ a training data set D containing records of triplets (u, v, y),
W (k−1) (k−1)
h v
where (u, v) are the edges in the graph and y ∈ {0, 1} denotes
h(k)
u =σ
⎝ ⎠ (3) the presence or absence of an edge between u and v. Since
|N (u)| |N (v)|
v∈N (u)∪{u} there can be very large number of node pairs (u, v) where
(k) u and v are not connected, we employ random sampling to
where hu is the embedding of node u in the k th layer;
(0) select a sizable number of such pairs. The training objective
h is the initial set of node features, which can be set to
is to minimize the cross-entropy loss L in the Equation 5.
one-hot vectors if no other features are available; N (u) is
the set of neighbors of node u; W (k) is the feature transfor-
1
mation weights for the k th step (learned via training), σ is the L=− ylog σ zTu zv +(1 − y)log σ zTu zv
activation function (such as RELU [31]). Note |D|
that symmetric- (u,v,y)∈D
normalization is achieved by dividing by |N (u)| |N (v)|. (5)
GCN learns node embeddings from a homogeneous graph Minimizing the above loss enforces the dot product of the
with same node types and relations. However, the pull request embeddings of the nodes u, v to attain high value when they
graph in Figure 2 is a heterogeneous graph with different are connected by an edge in the graph (i.e., when y = 1), and
node types and different relation types between them. In this a low dot product value when they are not connected in the
case, inspired by RGCN [30], for each node, we aggregate the graph (i.e., when y = 0). The parameters of the model are
feature information separately for each type of relation. updated as the training progresses to minimize the above loss.
Mathematically it can be represented as follows: We stop training when the loss function stops decreasing (or
the decrease becomes negligible).
⎛ ⎞
W
(k−1) (k−1)
r h v (k−1)
C. Inductive Recommendation for New Pull Requests
h(k)
u =σ
⎝ + W0 h(k−1)
u
⎠
|Nr (u)| |Nr (v)| GCN by design is a transductive model. That is, it can
r∈R v∈Nr (u)
generate embeddings only for the nodes that are present in
(4)
the graph during the training. It cannot generate embeddings
where R is the set of relations, Nr (u) is the set of neighbors
(k) for the new nodes without adding those nodes to the graph
of u having relation r, Wr is the relation-specific feature and retraining. On the other hand, inductive models can infer
(k)
transformation weights for the k th layer; W0 is the feature embeddings for the new nodes that were unseen during the
transformation weights for the self node. training by applying the learned model to the new nodes. Since
The set of relations R captures the semantic relatedness of C ORAL is a GCN-based model, we will not have embedding
different types of nodes in the graph. This is generally deter- for the new pull request u at the inference time. We need to
mined by the domain knowledge. For C ORAL we identified a derive the embedding for u on-the-fly by applying Equation
bunch of useful relations as listed in Table II. 4. The challenge in deriving the embedding is in getting the
In our experiments, we use a 2-layer GCN network, i.e., correct self embedding for u . That is, as per Equation 4, to
we set k = 2 in Equation 4. With this, GCN can capture (2) (0) (1)
generate hu , we need trained W0 and W0 , which are
second order relations such as User-User, File-File, User-File, not available for the new nodes. Hence we approximate the
User-Word, etc., which we believe are useful in capturing embedding of the new node by ignoring its self embedding part
interesting dependencies between various entities, such as in Equation 4, which leads to the following approximation:
related files, related users, files authored/modified by users,
words associated with users, etc. While setting k to a higher ⎛ ⎞
value can fold-in longer distance relations, it is not clear if (2)
(1) (1)
Wr h v
zu = hu = σ⎝ ⎠ (6)
that helps or brings more noise. We leave that exploration to |Nr (u )| |Nr (v)|
r∈R v∈Nr (u )
our future work.
Here, zu is the embedding of the new pull request u , R
B. Training the Model is the set of relations involving the pull request node (i.e.,
(·) (·)
To learn the parameters of the model (i.e.,Wr and W0 ) PullRequest-User, PullRequest-File, and PullRequest-Word),
(1)
we pose it as a Link Prediction problem. Here, we set the Wr are the trained model weights from the 2nd layer of
(1)
probability of existence of a link/edge between two nodes the GCN, and hv are the embeddings coming out of the first
u and v as proportional to the dot product between their layer of the GCN.
166
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on August 09,2023 at 16:47:32 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Relations (R) used for generating embeddings
Relation Semantic Description
1 PullRequest - User Captures the author or reviewer relationship between a pull request and a user
2 PullRequest - File Captures the related file modification needed for a pull request
3 PullRequest - Word Captures the semantic description of a pull request through the words
4 File - Word Captures the semantic description of a file.
5 Word - Word Captures the related words in a window of size 5 in a sentence (in the pull request title/description)
After obtaining the embedding of the new pull request as TABLE III: Pull request distribution across dimensions
per Equation 6, we can get the top k reviewers for it by finding Repo size (# of developers) # data
the top k closest users in the embedding space. That is,
Large 220
Medium 200
reviewersk (u ) = argmax zTu zvi (7) Small 80
v1 ..vk
167
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on August 09,2023 at 16:47:32 UTC from IEEE Xplore. Restrictions apply.
the top 2 recommendations by C ORAL as the recommended TABLE IV: Link prediction accuracy and MRR
reviewer to reach out. Note that pull requests selected had Metric k=1 k=3 k=5 k=7
not been recommended by the rule-based model and each
Accuracy 0.50 0.73 0.78 0.80
recommended reviewer appears at most once. The pull requests MRR 0.49 0.61 0.68 0.72
are collected from repositories having different number of
developers using stratified random sampling following the TABLE V: Comparative user study precision across dimen-
distribution in Table III. The categories are defined as follows: sions. RM is Rule-based Model. The differences between the
number of developers: Large ( > 100 developers), Medium two models with the same Greek letter suffix (and only those
(between 25 and 100 developers), Small (< 25 developers). pairs) are not statistically significant.
Questionnaire We perform the user study by posing a set of
questions on what actions a reviewer might take when they Repo size (# of developers) RM C ORAL
were recommended for a specific pull request: Large 0.19 0.37
Medium 0.31α 0.36α
1) Is this pull request relevant to you (as of the PR creation Small 0.35β 0.23β
date, state)?
A - Not relevant
B - Relevant, I’d like to be informed about the pull survey (and save time and frustration), because they already
request. indicated their preferences on the pull request when it was
C - Relevant, I’d take some action and/or I’d comment active and when they were added as reviewers.
on the pull request. An important point to keep in mind is, the rule-based
2) If possible, could you please explain the reason behind model is adding recommended reviewers directly to the pull
your choice? requests. This increases the probability of them taking an
We avoid intruding in the actual work flow, yet still maintain action even if they may not be an appropriate developer to
an adequate level of realism by working with actual pull conduct the review. The reason for this is that the reviewers
requests, thus balancing realism and control in our study [33]. are being selected and the assignment of them to the PR is
Note that, with 287 responses, this is one of the largest field public (everyone, including their managers, can see who is
studies conducted to understand the effects of an automated reviewing the pull request) [5]. If they do not respond, it
reviewer recommender system. might look like they are blocking the pull request progression.
We divided the questionnaire among 4 people to conduct the In contrast, C ORAL’s recommendations are validated through
user studies. The interviewers did not know these reviewers, user studies, which are conducted in a private 1-1 setting where
nor had worked with them before. The teams working on the participants likely feel more comfortable indicating that they
systems under study are organizationally far away from the are not appropriate for the review. Reviewers can be open
interviewers. Therefore, they do not have any direct influence about their decisions in the user studies. Therefore, C ORAL
on the study participants. The interview format is semi- might be at a slight disadvantage.
structured where users are free to bring up their own ideas
and free to express their opinions about the recommendations. B. Results
We use the question (2) to collect user feedback and analyze 1) How well does C ORAL model the review history?:
it to generate insights about the perceptions of the developers To answer RQ1 , we examine who the pull request author
about the automated reviewer recommendation systems (RQ3 ). invited to review a change and then check to see if C ORAL
Namely, the factors that influence people to not lean towards recommended the same reviewers. In this context, the “cor-
using an automated reviewer recommendation system. rect” recommendation is defined as the recommended reviewer
3) Comparing with Rule-based Model: To compare C ORAL being invited to the pull request. While the author’s actions
with the rule-based model, we select another 500 recent pull may not actually reflect the ground truth of who is best
requests from the set of pull requests on which the rule- able to review the change, most prior work in code reviewer
based model (currently deployed in production) has made recommendation evaluates recommenders in this way (see [23]
recommendations, by following the same distribution as the for a thorough discussion of this) and so we follow suit here.
pull requests selected for evaluating C ORAL (Table III). We Table IV shows the accuracy and MRR for C ORAL across all
then collect the recommendations made by the rule-based 254K (pull request-reviewer) pairs. In 73% of the pull requests,
model, the subsequent actions performed by the recommended C ORAL is able to replicate the human authors’ behavior in
reviewers (changing the status of the pull request, or adding a picking the reviewers in the top 3 recommendations which
code review comment, or both) for the selected pull requests validates that C ORAL matches the history of reviewer selection
from telemetry. The telemetry yields two benefits: 1. it helps quite well. *
us gather user feedback without doing another large-scale user * Note that by design the rule-based model always includes the author-
study, as the telemetry captures the user actions already 2. it invited people to review, so we do not evaluate rule-based model in this
avoids the probable study participants from taking one more approach.
168
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on August 09,2023 at 16:47:32 UTC from IEEE Xplore. Restrictions apply.
2) RQ2 : Under what circumstances does C ORAL perform TABLE VI: Distribution of qualitative user study responses.
better than a rule-based model (and vice versa)?: In Table V, Category # of responses (%)
we show the recommendation precision of rule-based model
I will review this pull request 170 (59.23%)
and C ORAL. Specifically, on the sampled data for the C ORAL I’d like to be added to this pull request 24 (8.36%)
model and rule-based model, precision is calculated as the This pull request is not relevant to me 93 (32.40%)
percentage of the recommended reviewers who are willing to
engage in reviewing the pull requests. For rule-based model,
reviewers who either change the status of the pull request or “The content of the PR might impact another repository that
add a code review comment are considered as engaged. For I have ownership of because we use some of the components
C ORAL, reviewers who say that the pull request is relevant in that lib. Based on that I would say it is a relevant PR and
and they would take some action are considered as engaged. I will not mind reviewing it.”
Generally, there is “no model to rule them all”. Neither 3) RQ3 : What are developer’s perceptions about an au-
of the models performs consistently better than the other in tomated reviewer recommendation model?: We show the
all the pull requests from repositories of all categories. As distribution of user study responses in Table VI. Out of 500
shown in table V, C ORAL performs better on pull requests user study messages we sent, 287 users responded. 67.6% of
from large repositories and medium repositories while the rule- the users give positive feedback saying that the given pull
based model does well on pull requests from small reposi- request is relevant to them to some degree. In this, 8.36% of
tories. However, when we statistically tested for differences, the users say they would like to be informed about the given
Fisher exact tests [34] only showed a statistically significant pull request. 59.23% of the users say that they would like to
difference between the two approaches for large repositories take some action and/or leave comment on the pull request.
(p = 0.013). 32.4% of the users give negative feedback saying that the given
One observation that may explain this result is that due to pull request is not relevant.
their size, large software projects dominate the graph. Thus, To understand the reason that users do not like C ORAL’s
C ORAL is trained on many more pull requests from large recommendations, we analyze the negative feedback (com-
projects than from smaller projects. If the mechanisms, factors, ments/anecdotes from the developers) and classify them into 3
behaviors, etc., for reviewer selection are different in smaller categories with their distribution shown in Table VII. To offer
projects than large ones, then the model is likely learn those an impression, we show some typical negative quotes that we
used in larger projects. This hypothesis could be confirmed received from users.
by splitting the training data by project size and training 91.03% of the negative feedback we received said that the
multiple models. However, as reviewer recommendation is pull request is no longer relevant to them and 69.23% of
most important in projects with many developers and that them said it is because they started to work in a different
appears to be where C ORAL excels, we do not pursue this area and 21.79% of them mentioned that they do not work
line of inquiry. in this repository because of switching groups or transferring
We have observed that in small repositories usually with few teams: “Not relevant since I no longer work on the team that
developers, one or two experienced developers are more likely manages this service.” 6.41% of the users mentioned that they
to take the responsibility of reviewing pull requests which are actually never involved in code review: “I’m a PM. I’m
accounts for the high accuracy of rule-based model. However, less interested in PR in general. Only when I’m needed by
this phenomenon in which a small number of experienced the devs and then they mention me there.” Two users said that
people in a particular repository are assigned the lion’s share of the pull requests we provided does not need to be reviewed:
reviews is problematic, and heuristics have been used to “share “Let me explain. This is an automated commit that updates the
the load” [15]. As the socio-technical graph contains historical version number of the product as part of the nightly build. It
information about a developer across many repositories and pretty much happens every night. So it doesn’t need reviewer
PRs from different repositories may be semantically related, like a traditional pull request would.”
C ORAL is able to leverage more information per developer From users’ negative feedback, we learn that in order to
and per PR, which may avoid this problem. improve C ORAL we need to include several extra factors. First,
The following feedback received from the user study (ques- our socio-technical graph should take the people movement
tion (2)) also demonstrates that C ORAL identifies relevant and into consideration and update the graph dynamically, namely
qualified reviewers who traditional reviewer recommenders identifying inactive users and removing edges or decaying the
miss: weight on the edges between user nodes and repository nodes.
“This PR is created in a repository on which our service Second, C ORAL should include and learn the job role
has a dependency on. I would love to review these PRs. In for every user in the socio-technical graph through node
fact, I am thinking of asking x on these PRs going forward.” embeddings, such as SDE, PM, so that it can filter those
“I never reviewed y’s PRs. I work with her on the same irrelevant users and suggest the reviewers more precisely.
project and know what she is doing. I am happy to provide Third, before running the C ORAL, some heuristic rules can
any feedback (of course if she’d like :))” be designed to filter the automatic, deprecated pull requests.
169
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on August 09,2023 at 16:47:32 UTC from IEEE Xplore. Restrictions apply.
TABLE VII: Users’ Negative Feedback Categories.
Category Feedback # of feedback (%)
I This pull request is no longer relevant to me 71 (91.03%)
II Never participate in code review 5 (6.41%)
III Pull request does not need reviewer 2 (2.56%)
Besides the negative feedback, we receive a lot of credits about the system because they wanted to make the interviewers
from users: happy.
“The recommendation makes a lot of sense since I primarily The Socio-technical graph contains information about who
contributed to that repository for a few years. However, a was added as a reviewer on a PR, but it does not explain why
recent re-org means I no longer work on that repository.” that person was added or if they were added as the result of
“I am lead of this area and would like to review these kinds a reviewer recommendation tool. Thus, in our evaluation of
of PRs which are likely fixing some regressions.” how well C ORAL is able to recommend reviewers that were
They validate our claim that C ORAL does consider the historically added to reviews, it is unclear how much of history
interactions between users and files, and the recommendations comes from the rule-based model recommender and how much
are understandable by humans. Since C ORAL is trained and from authors without the aid of a recommender.
evaluated on historical pull requests starting from 2019, it is When looking at repository history, the initial recommenda-
hard to reconstruct the situation where the pull requests were tion by the rule-based model is based on files involved in the
created and many users complain that it is difficult to recall initial review, while C ORAL includes files and descriptions in
the context of the pull requests, thus putting C ORAL in a the review’s final state. If the description or the set of files was
disadvantage. We expect it will have better performance in modified, then C ORAL may have a different set of information
the actual production. available to it than it would have had it been used at the time
4) Ablation Study: To evaluate the contribution of each of of PR creation.
the entities in C ORAL, we perform an ablation study, with In our evaluation of C ORAL, we use a training set of PRs to
results shown in Table VIII. Specifically, we first remove train the model and keep a hold out set for evaluation. These
the entities from the socio-technical graph and training data, datasets are disjoint, but they are not temporally divided. In
and then retrain the graph convolutional neural network. We an ideal setting all training PRs would precede all evaluation
find that ablating each entity deteriorates performance across PRs in time and we would evaluate our approach by looking
metrics. After removing word entities and file entities from at C ORAL’s recommendation for the the next unseen PR
graph, i.e. the socio-technical graph only contains user and (ordered by time), then add that PR to the Socio-technical
pull requests entities, the model can hardly recommend correct graph, and then retrain the model on the updated graph for
reviewers. By comparing (1) and (2), (1) and (3), we demon- the following PR and repeat until all PRs in the evaluation set
strate the importance of semantic information and file change were exhausted. This form of evaluation proved too costly and
history introduced by file entities in recommending reviewers time consuming to conduct and so we used a random split of
and file entities give more value than words. Looking at (3) and training and testing data sets.
(4), we observe boost in performance when adding semantics We sampled the 500 PRs from the population using a
information on top of the file change and review activities, random selection approach. We selected sample size in an
which underlines our claim that incorporating information effort to avoid bias and confounding factors in the sample,
around interactions between code contributors as well as the but we cannot guarantee that this data set is free from noise,
semantics of code changes and their descriptions can help bias, etc.
identify the best reviewers. VII. F UTURE W ORK
VI. T HREATS AND L IMITATIONS In this work we showed that a simple GCN style model is
able to capture complex interaction patterns between various
As part of our study, we reached out to people who were entities in the code review ecosystem and can be used to
not invited to a review but that C ORAL recommended as predict relevant reviewers for pull requests effectively. While
potential reviewers. It is possible that their responses to our this method is very promising on large sized repositories, we
solicitations differed from what they may have actually done believe that the method can be improved to make good rec-
if they were unaware that their actions/responses were being ommendations on other repositories too by training repository
observed (the so-called Hawthorne Effect [35]). Microsoft has type specific models. In this work we mainly focused on using
tens of thousands of developers and we were careful not to interaction graph of various entities (pull requests, users, files,
include any repositories or participants that we have interacted words, etc.) to learn complex features through embeddings.
with before or might have a conflict of interest with us. We neither captured any node specific features (e.g., user-
Nonetheless, there is a chance that respondents may be positive specific features, file-specific features, etc.) nor any edge
170
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on August 09,2023 at 16:47:32 UTC from IEEE Xplore. Restrictions apply.
TABLE VIII: Link prediction accuracy and MRR for various configurations of parameters
Accuracy MRR
Models
k=1 k=3 k=5 k=7 k=1 k=3 k=5 k=7
(1) No words or files 0.02 0.08 0.13 0.16 0.01 0.04 0.05 0.06
(2) Words only 0.21 0.30 0.32 0.34 0.21 0.25 0.26 0.32
(3) Files only 0.29 0.69 0.73 0.76 0.29 0.48 0.49 0.50
(4) Words + Files 0.49 0.73 0.77 0.80 0.49 0.61 0.68 0.72
specific features (e.g., how long ago user authored/modified IX. DATA AVAILABILITY
files, whether two users belong to the same org or not, We are unfortunately unable to make the data involved
etc.). Incorporating such features may help the model learn in this study publicly available as it contains personally
even complex patterns from the data and further improve identifiable information as well as confidential information.
the recommendation accuracy. Furthermore, we believe that Access to the data for this study was made under condition
a detailed study of effect of model hyper-parameters (such of confidentiality from Microsoft and we cannot share it
as embedding dimension, number of GCN layers, different while remaining compliant with the General Data Protection
activation functions, etc.) on the recommendation accuracy Regulation (GDPR) [36].
will be a very useful result. We intend to explore these
directions in our future work. R EFERENCES
The techniques explained in this paper and the C ORAL [1] G. Gousios, M. Pinzger, and A. v. Deursen, “An exploratory study of the
system are generic enough to be applied on any dataset that pull-based software development model,” in International Conference on
Software Engineering, 2014, pp. 345–355.
follows a GIT based development model. Therefore, we see [2] P. Rigby, B. Cleary, F. Painchaud, M.-A. Storey, and D. German,
opportunities for implementing C ORAL for source control “Contemporary peer review in action: Lessons from open source de-
systems like GitHub and GitLab. velopment,” IEEE software, vol. 29, pp. 56–61, 2012.
[3] P. C. Rigby and C. Bird, “Convergent contemporary software peer review
practices,” in International Symposium on the Foundations of Software
VIII. C ONCLUSION Engineering, 2013, pp. 202–212.
[4] A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges
In this work, we seek to to leverage additional recorded of modern code review,” in International Conference on Software
information in software repositories to improve reviewer rec- Engineering, 2013, pp. 712–721.
ommendation and address the weakness of the approaches that [5] P. C. Rigby and M.-A. Storey, “Understanding broadcast based peer
review on open source software projects,” in International Conference
rely only on the historical information of changes and reviews. on Software Engineering, 2011, pp. 541–550.
To that end we propose C ORAL, a novel Graph-based ma- [6] A. Bosu, M. Greiler, and C. Bird, “Characteristics of useful code
chining learning model that leverages a socio-technical graph reviews: An empirical study at microsoft,” in International Working
Conference on Mining Software Repositories, 2015, pp. 146–156.
built from the rich set of entities (developers, repositories, [7] J. Lipcak and B. Rossi, “A large-scale study on source code reviewer
files, pull requests, work items, etc.) and their relationships in recommendation,” in Euromicro Conference on Software Engineering
modern source code management systems. We train a Graph and Advanced Applications (SEAA), 2018, pp. 378–387.
[8] A. Ouni, R. G. Kula, and K. Inoue, “Search-based peer reviewers
Convolutional Neural network (GCN) on this graph to learn recommendation in modern code review,” in 2016 IEEE International
to recommend code reviewers for pull requests. Conference on Software Maintenance and Evolution (ICSME), 2016, pp.
Our retrospective results show that in 73% of the pull 367–377.
[9] J. Jiang, Y. Yang, J. He, X. Blanc, and L. Zhang, “Who should comment
requests, C ORAL is able to replicate the human pull request on this pull request? analyzing attributes for more accurate commenter
authors’ behavior in top 3 recommendations and it performs recommendation in pull-based development,” Information and Software
better than the rule-based model in production on pull requests Technology, vol. 84, pp. 48–62, 2017.
[10] Y. Yu, H. Wang, G. Yin, and C. X. Ling, “Reviewer recommender of
in large repositories by 94.7%. A large-scale user study pull-requests in github,” in IEEE International Conference on Software
with 500 developers showed 67.6% positive feedback, and Maintenance and Evolution, 2014, pp. 609–612.
relevance in suggesting the correct code reviewers for pull [11] Y. Yu, H. Wang, G. Yin, and T. Wang, “Reviewer recommendation for
pull-requests in github: What can we learn from code review and bug
requests. assignment?” Information and Software Technology, vol. 74, pp. 204–
Our results open new possibilities for incorporating the rich 218, 2016.
set of information available in software repositories and the [12] J. B. Lee, A. Ihara, A. Monden, and K.-i. Matsumoto, “Patch reviewer
recommendation in oss projects,” in Asia-Pacific Software Engineering
interactions that exist between various actors and entities to Conference (APSEC), vol. 2, 2013, pp. 1–6.
develop code reviewer recommendation models. We believe [13] E. Sülün, E. Tüzün, and U. Doğrusöz, “Reviewer recommendation using
the techniques and the system has a wider applicability ranging software artifact traceability graphs,” in International Conference on
Predictive Models and Data Analytics in Software Engineering, 2019,
from individual organizations to large open source projects. pp. 66–75.
Beyond code reviewer recommendation, future research could [14] P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida, H. Iida,
also target other recommendation scenarios in source code and K.-i. Matsumoto, “Who should review my code? a file location-
based code-reviewer recommendation approach for modern code re-
repositories that could aid software developers leveraging the view,” in International Conference on Software Analysis, Evolution, and
Socio-technical graphs. Reengineering (SANER), 2015, pp. 141–150.
171
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on August 09,2023 at 16:47:32 UTC from IEEE Xplore. Restrictions apply.
[15] S. Asthana, R. Kumar, R. Bhagwan, C. Bird, C. Bansal, C. Maddila, [24] “English stop words,” Accessed 2021. [Online]. Available: https:
S. Mehta, and B. Ashok, “Whodo: automating reviewer suggestions //gist.github.com/sebleier/554280
at scale,” in Joint European Software Engineering Conference and [25] C. D. Manning and H. Schütze, Foundations of Statistical Natural
Symposium on the Foundations of Software Engineering, 2019, pp. 937– Language Processing. Cambridge, Massachusetts: The MIT Press,
945. 1999. [Online]. Available: http://nlp.stanford.edu/fsnlp/
[16] O. Kononenko, O. Baysal, L. Guerrouj, Y. Cao, and M. W. Godfrey, [26] P. D. Hoff, A. E. Raftery, and M. S. Handcock, “Latent space approaches
“Investigating code review quality: Do people and participation matter?” to social network analysis,” Journal of the american Statistical associ-
in IEEE international conference on software maintenance and evolution ation, vol. 97, pp. 1090–1098, 2002.
(ICSME), 2015, pp. 111–120. [27] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on
[17] A. Bosu and J. C. Carver, “Impact of peer code review on peer impres- graphs: Methods and applications,” IEEE Data Eng. Bull., vol. 40, pp.
sion formation: A survey,” in International Symposium on Empirical 52–74, 2017.
Software Engineering and Measurement, 2013, pp. 133–142. [28] H. Chen, B. Perozzi, R. Al-Rfou, and S. Skiena, “A tutorial on network
[18] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A embeddings,” 2018.
comprehensive survey on graph neural networks,” IEEE transactions [29] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and
on neural networks and learning systems, vol. 32, pp. 4–24, 2020. M. Sun, “Graph neural networks: A review of methods and applications,”
[19] H. A. Çetin, E. Doğan, and E. Tüzün, “A review of code reviewer AI Open, vol. 1, pp. 57–81, 2020.
recommendation studies: Challenges and future directions,” Science of [30] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. v. d. Berg, I. Titov, and
Computer Programming, p. 102652, 2021. M. Welling, “Modeling relational data with graph convolutional net-
[20] V. Balachandran, “Reducing human effort and improving quality in peer works,” in European semantic web conference, 2018, pp. 593–607.
code reviews using automatic static analysis and reviewer recommenda- [31] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
tion,” in International Conference on Software Engineering, 2013, pp. boltzmann machines,” in International Conference on International
931–940. Conference on Machine Learning, 2010, p. 807–814.
[21] M. B. Zanjani, H. Kagdi, and C. Bird, “Automatically recommending [32] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Informa-
peer reviewers in modern code review,” IEEE Transactions on Software tion Retrieval. USA: Cambridge University Press, 2008.
Engineering, vol. 42, pp. 530–543, 2015. [33] K.-J. Stol and B. Fitzgerald, “The abc of software engineering research,”
[22] M. M. Rahman, C. K. Roy, and J. A. Collins, “Correct: code reviewer ACM Trans. Softw. Eng. Methodol., vol. 27, 2018.
recommendation in github based on cross-project and technology experi- [34] A. Agresti, Categorical data analysis. John Wiley & Sons, 2003, vol.
ence,” in International Conference on Software Engineering Companion, 482.
2016, pp. 222–231. [35] J. G. Adair, “The hawthorne effect: a reconsideration of the method-
[23] E. Doğan, E. Tüzün, K. A. Tecimer, and H. A. Güvenir, “Investigating ological artifact.” Journal of applied psychology, vol. 69, p. 334, 1984.
the validity of ground truth in code reviewer recommendation studies,” [36] General data protection regulation. European Commission. [Online].
in International Symposium on Empirical Software Engineering and Available: https://gdpr-info.eu/
Measurement (ESEM), 2019, pp. 1–6.
172
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on August 09,2023 at 16:47:32 UTC from IEEE Xplore. Restrictions apply.