software defect prediction ppr
software defect prediction ppr
Dataset
PylraceBugs: A Large Python Code
for Supervised Machine Learning
Prediction
in Software Defect
S. Kobylkin",
Akimova", Alexander Yu. Berseney*t, Artem A. Deikov*1, Konstantin
Blena N. Mezentsev*1, Vladimir E. Misilov*T
Anton V. Konygin, Ilya P.UB RAS, S. Kovalevskaya Street l6, 620108 Ekaterinburg, Russia
Mechanics,
Krasovskii Institute of Mathematics andUniversity, Mira Street 19, 620002 Ekaterinburg. Russia
10.1109/APSEC53868.2021.00022
TUral Federal ilya.mezentsev, v.e.misilov}@urfu.ru
falexander. bersenev, deykov.artem,
{acn, kobylkin, konygin} @imn.uran.ru,
142
from IEFE Xolore Restrictions apply
There are several datasets neural) predictive models for software bug prediction. It is
CodRep dataset | 15|consists ofconstructed
for this
task. The
five parts containing one-line formed by excerpts of source code of functions and methods
bugfixes from open source Java projects. Its total size is 58,069 fronn the selected top rated GitHub repositories. It contains
commits. The work [16] uses this artifact for progranm repair examples from both classes of the correct and buggy source
The dataset from [17] consists of bugfix pairs of Java code.
mined from the GitHub. Is authors identified all commits code
with A range of possible bugs conveyed by the dataset is some
messages like "hx", issue or bug". For each of such com what restrictive in a sense that it only contains examples of
mits. thc pair of buggy (precommit) and fixed (postcommit) confirmed bugs. These are the bugs, which manifest them
code fragments is extracted. The resulting dataset contains 800 selves in the fornm of raising an erTOr (exception) message.
thousands code pairs. which is called thc traccback error message in Python. This
The Many SStuBs4J dataset [18] is constructed in the same kind of restriction serves two goals. First. this sort of bugs can
manner It consists of two parts: 25.539 single-statement (one be considered as low-level. ie.. snippet level. This is the kind
ine) bugtix pairs from 100 popular open-source Java projects of bugs we are aimed to find. Sccond. it could simplify the
and a larger set of I53,652 pairs from 1.000 projects. The bug localization to some extent as many bugs tend to be near
authors extracted only those commits. for which both buggy traceback paths in the graph of function calls.
Totally, the dataset contains 24 thousands examples of
and fixed code passes compilation. So, this dataset is confined buggy and 5.7 million correct source code snippets from
to simple bugs, which are difficult to spot manually. The non 630 and 10.642 repositories, respectively. Repositories for the
bugfixing changes, such as renaming variables or classes, are
excluded. The resulting bugfixes forrm 16 patterns such as buggy source code employ the standartized GitHub system for
variable misuse, wrong function name, wrong operator, etc. handling issues and fixing bugs in their codebase.
The second application is the software defect prediction. In Table Itop 20 most present GitHub repositories are listed
Here, thc goal is to identify a given code excerpt (fle, class. accounting approximately 40% of snippets of buggy code.
function, or method implementation) as defective or correct
TABLE I
(see recent survey in [19). The defect prediction problem ToP REPosITORIES PER NUMBER OF BUGGY SNIPPETS
can be posed in different settings. One of the approaches is
to consider it as a binary classification problem Below, are Repository name Percentage of snippets (% )
some of the datasets intended to be used in training predictive saltstacksalt 75
models for this setting. ansibleansible 5.9
The GHPR dataset [20] consists of 3,026 pairs of defective Triblertribler 3
pandas-devlpandas 3
and fixed source code files from 307 projects. They are
mars-projectímars 2.6
extracted from the GitHub by finding pull requests labeled numba/numba 2.5
as fixes. Then. the resulting code pairs are used to train a
graph-based deep lcarning model for defect prediction. spyder-idelispyder 23
Cog-Creators/Red-DiscordBot 18
The BugHunter dataset [21] contains 159 thousands Java pytnedusaMedusa 15
bugs with precomputed metrics for three granularity levels 14
pythoafnypy
(hlelclass/method). It is created from GitHub projects by ana condalconda 13
Iyzing the closed and open bug reports containing references to log2timclinc/plaso 13
the bugfixing commits. The authors use the computed metrics ray-projectitay
of code to train classifiers for defect prediction. scikit-learn/scikit-learn
In the works above., buggy excerpts are distinguished from freqtradeireqtrade
their instant fxes, Le, the parts of the bugfix pairs are Pytroll/satpy
discriminatcd from each other. In our work, a different setting iterativefdvc 0.9
is proposed, which does not treat fixed versions of buggy code modin-projectmodin 09
as correct. Namely, it is assumed that a snippet is corect if it is sphinx-docsphinx 0.
scipy/scipy 0.8
stable, Le. has not been changed for a significant time period Others 58.3
up the current state of software. This definition of correct
source code is more preferable, as guarantees of corectness
can not be provided for fix snippets. It is due to the fact that The corresponding distribution of snippets for the correct
changes can be further introduced into those snippets because source code is presented in Table lI. Here, top 25 repositories
of many reasons including fxing other bugs. This idea lies at account about 15% of snippets. Approximately 16% of the
the core of collecting our dataset. repositories of the buggy source code are also present in
III. THE PYTRACEBUGS DATASET
the correct source code. though, with smaller percentages of
snippets.
Inthis work, a large dataset of Python source code is A. Content of the dataset
presented called the PyTraceBugs dataset. This dataset is
intended for both training and evaluating complex (possibly. The dataset consists of two parts:
143
Authorized icensed use imited to: DELHI TECHNICAL UNIN. Downloaded on Seplember 24,2024 at 08:10:23 UTC from lEEE Xplore, Resiicions apoly.
TABLE I1
tomatically collected and filtered source code, contain Tor PFPOSITOEIES PER MUMBER OF
CORRECT SNIPPETS
ing examples of buggy and corect
a smalsample of this automatically snippets:
collected data, which Repositoy name
Pereentage of snippets (%y
is subjected to additional iltering and manual validation CiscoDevNetydk-py
by two Python experts. Azurelazre-sdkforpython
07
orclelocipython sdk
Here, the programmatically collected source code is split
into raining and validation samples, whereas its manually syemath/aye 07
clouderahue
selected sample is used as a test sample. tytusdb/tytus
06
The basic principles of aotomatic Nelection of examples of aliyun/aliyun-openapi-python-sdk 0.5
corect and deficient code snippets from the GitHub reposito home-assistant/core 0.5
ries for both training and validation samples are given below. 0.5
cetbxlectbx_proje 05
With some specitic modifications, they are used in works (7). tencentyun/scf-demo-repo
Azurelazure-cli-extensions 05
121|as a first step to collect datasets of source code for other
programming languages and other research purposes, e,g., lor sympylsyrmnpy 04
datasets aimed to improve automatic test generation. dimagilcommcare-hg
04
Namety, snippets of the correct source code are chosen from kovidgoyallcalibre
docusign/docusign-python-client 0.3
stable code of the GiHub repositories under a simple assump saltstack/salt 03
tion, which (omitting the details) umounts to the following one: leo-editorleo-editor 0.3
a snippet is more probably correct if it has not becn changed tribe29/checknk 03
over many commits up to the latest state of the repository XX-net/XX-Net 0.3
folder it resides in. Toontown-Open-Source-Initiative/
Snippets of source code with bugs are collected from Toontown-School-House 0.3
codcbase of the top GitHub repositories with issucs pages. AppScalelgts
0.3
More specifically, they are selected from those repositories anhstudios/swganh 0.3
bugfix commits and pull requests, which are directly related SteveDuyle2/pyNastran 0.3
0.3
to handling issues marked by bug labels, e.g., named as "bug", dnanexus/parliament2
"type:bug" or have any other similar label. openhatch/oh-Imainline 0.2
Besides. only those bugs are considered. which manifest Others 84.8%
themselves by raising an error exception. Accordingly. the
repositories issues are selected to contain full error traceback
TABLE III
reports on their web pages. These issues report a program MosT COMMON ERRCR TYPES IN THE DATASET
crash, which devclopers consider as a bug and fix it. Table III
contains error exception types with maximal occurence in the Error type Percentage of snippets (%)
dataset. AttributcError 16.6
The most frequent error nessages are the attribute absence, TypeError 15.9
empty object related errors, and index out of bounds error. ValueError 10.2
KeyError 8.2
IV, OUALITY OF THE DATASET RuntimeError 5.5
IndexErrOr 5.3
A. Labeling confidence Others 38.3
Toenforce high confidence of labeling of snippets of the
test sample, a manual validation process is conducted to select
relevant snippets from the programmatically collected source
code. Below, ideas are outlined to selectexamples of the buggy These tough restrictions confined I out of approximately
code: 15-20 snippets during the validation process. The process
"a bug reported on the web page of the corresponding consisted in reviewing and selecting the buggy snippets among
a random sample of several hundreds of the automatically
issue is simple and easy to understand (e.g., its actual
fix appears near the location where the program crashes, collected entries.
raising an error exception) To filter out the corect snippets with highly confident
" the reported bug is not dependency, compatibility, or the labeling to be included into the test sample, the following
regression bug: automatic selection principle is used. Namely. snippets of
a fix introduced into a buggy snippet is also simple, e.g., stable source code are subjected to an additional restriction
is contined to one line: that selects only those snippets, which have many incoming
" changes (introduced into the snippet) should not be bound calls from other snippets with many incoming calls. Namely.
to refactoring, i.e., hey must be changes fixing the a graph of calls is computed for snippets of each GitHub
reported bug. repository chosen to be a source of stable code for the dataset:
val
domains across the training,
To explore difference of for cach
Snippet lI is adjacent to snippet 2. if there is a call of snippet topics are also collected
2 in the implementation of snippet 1. A snippet of stable code idation. and test samples,dataset. In the GitHub repositories
repository present in the the main
is chosen to be included in the test sample, if there are at lists of keywords in
topics are represented by
least 3 snippets with incoming degrees albove 3, which are repositories pages. repositories are
adjacent to that snippet. In our opinion, this selection criterion code, 562 distinct
For the buggy source repositories are
must increase visibility of bugs in the source code of selected sample, whereas 68
snippets, and thus. provides additional guarantees that bugs present in the trainingvalidation and test samples. They have
present both in the respectively. Here, the
in those snippets are more likely to be already revealed and topics,
2.345, 424. and 424 distinct the same topics, whereas the
fixed. validation and test samples share sample.
The buggy snippcts from training and validation samples sample has 228 topics not present in the training reposi
test I,571, 10,536, and 94
have more or less confident labcling due to the following For the corect source code, validation and test samples
specifics of the source code selection from the GitHub bugfix training,
tories are present in the distinct topics, respectively. There
commits and pull requcsts: with 3,456, 14,066, and 269 validation
bugfixes sample not present in the
1) buggy snippets are being a part of the GitHub are 89 topics in the test test sample are not present
directly related to issues reporting a program crash in sample. Besides, I62 topics of the
the form of raising an eror exception: in the training sample. provides an evidence that the dataset
repositorics
2) being sclected from the top rated GitHubthis increases This, to some extent,
issues pages, these issues have bug labels; allows cross-domain predictions.
reposito
the confidence that the reported problems with V. SUMMARY STATISTICS
FOR THEDATASET
ries source code are bound to be bugs; include 14,089 and 9,457
3) only highly relevant bugfix commits and pull
requests Training and validation samples as, 351,338 and 5,340,000
contain a direct refer snippets of thebuggy code, as well
are considered: i.e., they should respectively. The test
ence to a buggy labelcd issue in their
message or title snippets of the correct source codecorrect source code and
the
prescribed tokens should sample contains I70snippets of
and, if itoccurs inmessage, the
"fixes", etc.i 161 snippets of the buggy code.
precede this reference e.g., "closes", re for distribution of the buggy
4) bugfix commits and pull two requests are additionally Table IV contains statisticscharacteristics. The first one is
stricted to handle at most issues. code with respect to two its comments. The
docstrings and
To directly estimate quality of the
dataset, a percentage is its length in symbols including knowncyclomatic complexity.
estimated of buggy snippets, for which the corresponding fixes second characteristic is the well complexity expressed
requests contain changes,
from the bugfix commits and pull changes are not directdy which is a special mcasure of structural
paths in source code. For
Such in the form of number of logical
being confined to refactoring. corresponding snippets should branching statements,
example, if a snippet does not contain complexity is cqual to
related to bugs. Therefore, the
be excluded from the collected
data as being not correctly e.g.. in if-clse statements, the cyclomaticif-else block statement,
the percentage of the
labeled. To obtain a lower bound ontraining and validation I; when source code contains a single
refactoring changes present in thechanges from the corre its cyclomatic complexity is equal to 2.
samples, the rate is estimated for bound to docstrings and TABLE IV
sponding bugfix commits, which are Besides, during the
STATISTICSOF THE BUGGY CODE
comments. Namely, it equals to 2.6%. sample, our experts Cyclomatic complexity
test
manual validation process to form therefactoring changes. Statistic name Code length
7.0
observed approximately 10-I5% of Mean 1875.9
labeling of corect source code 934.5 3.0
To guarantee confidence of Median
highly stable source 3.766.9 12.1
from training and validation samples, only repositories. More
Standard error
GitHub 426.0
code is selected from the top rated 25%
8.0
specifically, a count of commits is computed for each snippet 75% 2.010.0
16.0
the directory where this 4,081.3
of source code, which is relative to
9096
of the commits, 19
snippet is located. It counts the number in this specific
Min
185.063 199
which contain changes for at least one py file
Max
Only the
directory, but make no changes in that snippet. and validation distribution of stable code
snippets are selected to be incuded into training Statistics are given in TableV for complexity.
with respect to its length and cyclomatic
100.
samples, whose corresponding count is above generally shorter and simpler
cross-domain prediction Thus, stable source code is
B. Abiliry of the cross-project and than buggy one.
are made
The repositories of the training and test sanplesdata leaks buggy code are presented in
Topics of repositories with thehave
non-overlapping. This was done to avoid possible multiple topics. Many of
estimation for models Table VIL. Here, the snippets can computing.
and provide non-biased performance predictions. the topics are devoted to
infrastructure and cloud
trained on this dataset. This allows cross-project
classification
probability
standard codebuilding topics
correct
source as
Another VI. well
snippet data-analysis ansible numpy
data-science event-strean
numpy
Pytorch Max Min 909% 75% StandardMedian Statistic
259 emoe Mean name
deep-learming cdge infrasructure zeromq
cloud-management
Cvent-management
cloud-provisioning
remote-cxecution cloud
pandashacktoberfest
machine-learning name
Python Topic is
logloss predictive TRAINING infrastructure-management
intrastructure-automation cloud-pruviders intrasuructure-aS-Code
infrastructure-as-a-code as,
is way devoted
is
problem to
estimated
classified code the
is MODELS
to STATISTICS
TOPICS to
used estimate
models AND data
the are length
Code
is 6,804,973
1.596,0 734.0 4.089.28 3220704.87
considered,
asfor into EVALUATING
ON OF science.
datagiven 13
TABLE VI TABLE
OF
ae using THE
loss theither quality THE science THE
snippet Percentage BUGGY in V
STABLE
criterion, its DATASET Table Most
buggy in of coDE and complexity
clematie
CODE
data. he BUG 6.4 popular
to which 4.84 5.1 6.2 7.3 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.6 76 7.6 8.5 12.1 18.6 69.6 of
6.7: 7.5 7.7 8.8 IOS VII. 2,640 6.0 5.66 216
contain or
whereas PREDICTION
Namely,dataset snippets
correct. a A
development. topics
given large
consists (%)
precisionbugs. bulk for
Here,sourcebinary a
The ofthe
a in
truncatedremoving thwords
lengthsnippets RoBERTa
e pretrained of ofhead model conducted
CodeBERT
layers. thiwork,
s eredsamples. and
to
classes.training
CodeBERT.forthembedding it
e lead features.
First,model the In
A is The binary recall
rosoft/CodeBERT default to a
CodeBERT and is purpose.
special order
LightGBM good hidden is (pointwise)
cach architecture
64.Each
catalyst
ydk yang name
Python Topic
hacktoberfest to
deep-learning The
are termsone). parameters
Number sample.information Hidden data-science linuxtensortlow Python3 ios-Xr nx-0sios-xemachine-learning
Pytorch
azure djangoscience 512
docstrings CodeBERT for to are
except representation for first fed layer model classification
More apply
TOPICS tokens.
512 model
as as It
source hidden Transformer-based training. performance
as
thestates state
of gradient
Parameters tokens.
extracts feed-forward hasspecifically, [22]' machine
the snippet. is tokens. and input is is the
iterations loss.
represented OF tokenizer code dimension
125M. 12
number of TABLEVII Longer computed The is
validation,
problem, metrics
Noof tokens Percentage THE comments. for Python RoBERTa-base,
self-attention applied.
boosting Such CORRECT lcarning
the hidden snippet
layer embeddings the
is of of further the is for
chosen the 4.5 4.5 4.5 4.5 4.5 4.5
source model keywords, for (which 768 It feature
and
snippet anare by 2.2 2.4 2.5 2.5 2.8 3.9 5.2 7.1 39,6 of
is multi-language bot
iterations model embedding 1.9 2 Maximum each state uses models h
classifier fine-tuning averaged a snippets CODE isand test
vector code without tokenized is 3,072. heads,
which engineering validation
using tokens snippet ofmostly inner exactly samples. are
is the as for
and (%)
are fragments well Total Size employed
includes th e
validation trained is to768of
obtain model
prelminarily a hidden
is and token. last as, using pretrainedthe and
weightsmostly conductedemployed layerstandard number of Inconsid
is
does numeric size cach same test
on a are
input The other the 12 for this irst
final of
and for set the not as
subsample sample.
testthe source
training 170 out For aresnippets described correctly
prescnt
BugsinPy a be similar
BugsInPy
performance bugs,buggyclass.samplesnippets. code
weight are set
y aresnippets
An Table
have variations
randomly.
The
the also chosenapproach
corresponding the sample theNamely,
Besides, Table
of
removedth e rest Two special due introduced: to
oyed. fragments at contain testtraining 34% which
periment to to is of is2,000.
disturbs IX sample,code, correct 35 experiment most due to ofthe bugs 96% VIII equal
of 1, 3 38 are The 80 sample: to variations dataset, class selects
in the those, dataset, to are 330
ics to be I Corrcct RESULTS
buggy To
contains 8,537 snippets selecting sample.millon first traceback
cause implicitly is means
the are of from 512 the similarsnippets fact also labeled
confident. snippets, presents to for treat
Here, is and trainedrepositories, included 90 ofabout used
the
performance source tokens.
selected
83 the the
CodeBERT is variation of OF
no bugs, which positive
are also test the snippets
to of of snippets snippets
The Precision programs that estimated that imbalance
ratio
the correct 0.96 0.61 THE
clear
out trainingtraining, models buggy this reports. as Here, the
tedconducted sample.
experimental code, All also those the TABLEVIII manifested it
14% to check labeling
collect
well ofare for
into second
is PREDICTION distinction Moreover,
has of is buggy. experimcntal (buggy) of
metrics snippets repeated
model fo r
containing
correct the of experiment
source from done throw collected on
quality precision the
161 sample, the 295,094 source the 0.34 0.99Recall In been all our in
known chosen the test
experiment the the of counts the
for where Thus, fragments test validation, performance correct code theusing buggy snippets
the up restriction on source code EXPERIMENTS exceptions. work traincd by dataset. known ofamong the sample,
training
weights
results them,
sample, F-measure is our is
sample, snippets longer experiment under of
correct are
raw, exclusion
the to 0.5 0.76
outwhereas much giventhrowingsnippets. snippcts 0.results
ining, short of code to [71, 96 correct
of and code manually dataset, all the
software 0.1. for are than results the are be
involvessource also to Here,
For for such larger of the snippets in
corresponding
Halstead, he 14,089 test included are describing identify
tremained 144 are 512 sourceabove. are tethose,
st whether
exceptions, principles, This the classificd the for and
of the snippets conducted. the
dation metrics selected for sample: randomly
chosen that another code random model curatedmodel
positive the buggy
longer of selected
buggy outsamples.tokens
code whic h onlycould wilth
and bot h the into for the the test
inthe of to as
relevant tion gorization toThe itory (50h Finally, the selects for
tagsreleaNes
itories, tory bugs(NNUescalled Repoxitoycial
because: isoccurs latter issues.some Many level
upwhose functionality. among associating
Unfortunately, by Two procexsing.
0NsuesproblemsbuyN The
he scribing
parts tbugs), are our difhcultusingTherethumb 2) last 26
with For choosingallowsreporting standard also Cheiece web in wepercenile)
b (NOtVeent
discarded:
library
Moreover, the " "
labeling
developers opinion, report other of based percentile), tist popularity
whose seoond pages
he he
of the repository only repositories, repositoriesrepositories one hosting,
newer bugs certain actual manifested a bug snippets to is labels maintainsof criteria ixNues pages pall
functionPyTraceBugs single bug also mining these bugs
practice of repositories developnent are
(incompatibilityrelated lists
report heuristicsSince on only possible filters purpose count where
a the such has the Nelection requests
library bugsselection error a bugs (.e., label. athe issues in
issues:which learning, and NOWeespages,
types consider
issue out
problem to bugthose selected with threshold
to issues bycontains stack
or our manual Therefore, in a areabout inthe
located occurs must are those be snippets the types
labels is its he
version
incompatibility of suitably:
method,
the dataset, rescarchfunction with the Besidesfinally filtered sofware, of In in count
of Namely, detined labelsissues repository high ornot developers
is at GitHub of 2,200 about criterion repository
well de
bugs.
in suchdetinitely calls a be the issues, analysis vith th
wispecial datdevelopment
a foilters
r
bugs), performed at conesponding
ful l th e page, recent standartized Gitltub over
to
dependent the in or names repository thegives by h
of onlemployed
y focushigher applying selection by
final 10,0 00 software whicrespected
older errors/exceptions
Namely, level which error methodwhich
that bugs
repositories
instargazer
storage).
and functions/methods, those issues 1,100 development out retlects
with report is
level
ofset a have labcls.source for
l is t can marking which allow xource post is halfhe
version bugs to of raccback
an to on possible of (see, those
libraries the tunction issues the report this ofvary a
issues page, repositories, The activity. intomatien reports
previous remove low-level enor eTOr refine
implementations). of of
empiric code repositories. developnent their code, ntined ayear
related low-level criteria specifie eg, This a count GitlHub overall repositories,
(backpot following library/project method. relevant bug issues issort among last on
areselection bugs signiticantly standartiza consists cach activity.
(dependeney as message. or report. [7). described. above automaic diferent by is whi c h
versions exception selected. labels. ules to of whereas eriterion Moreover, eposi
to issues, or bugs
bugs. It at issues select label. related epos them, repos about
porting method. bugs, it 21) cate (eg. the are spe above
bugs). bugs ends The the of in 246
de of is
of by It ln
changes nomake directory and specific this token) in test"
thcontain
e not does name requests. pull commits
and
changes contain which commits, file (its the file pyone least f
ato r such discards selection
rule first The issue.particular afor
specifically. number
of the counts relevant are which changes, code source those only select to
it More located. snippet
is this where directory
the relative to which difficult is
issues,it many treatrequests
commits/pull When
is code, source snippet
of cach f or
computed commits
is of
count code.
A source stable "fixes,
etc. closes",
relevant reference,
e.g., thprecede
is should tokensprescribed the
out filter toused heuristic is following the work, this In message, the occurs
in it if and title message
or its in
tends code source unused abandoned stable. be to issue reference to contain
a should requestcommit/pull "
or example, For rule. the message; its iSsues
in
exceptions
from some are there Here, resides
in. repository it
the specific
for isthat period time long afor stable becomes th2an morereference not should requestcommivpull
quality. it high becomes of code source the When issues. arc:
dependency incompatibility
and treating and quality, and ity criteriaselection our Below, ones. relevant selecting in
used is
readabil improve features, new adding bugs, fixing reasons: information meta requests,the pull andcommitsnon-closing
many todue code source changes in make repositories
totend such among out filter mention. To they issue the related
to
such qualified
for highly mostly beindevelopers g Namely, directly not are which issues, other for fixes contain may
requests pull linked mentioning and hand, other the From
close. correctness
are andstability code source concepts
of
repositories, rated top general,
for repositories. relevant. also requestsare pull commits
and Such
In selected the #123". "Fixes e.g.closed, , supposed
beto isthat issue the to
from used codeis source stable of
type special codea source Thimessage.
s heir reference in anda
correct snippets:
For code source correct the Choice
of 2) refer should message containa requests
to pul l commits
an d
repositories. developing actively the from quality high less orkeyword prescribed such prescribes
for mechanism merged/committed.
This is
more of
code source bugfix get toone allows thresholds.
This request/commit pull when
a issue an close toone lows
counts requests pull recent stargazers
and the respect
to with al
which GitHub, the mechanism
in also
a isThere
tighter filtering
is Second, code. bugfix the criteria
for selection relevant. alsousually commits
are
alsoa pages
is issues lability
of avaiabove, described criteria Referencing issue. handling
that related
to closely code source
thdistinction
e to in
First,
following. the roughly
in consists repository the in
fixes contain they ie., close, they issue the
them between difference main The code. source bugfix of relevant
for highly requests
are pull andcommits Closing
sources repositories
being get to
used criteria iltering the from requests. pull linked "
different criteria
are These selected. repositories
are sands
of issues;mentioning requests pull "
approximately
thou 1I criteria, thoseapplying result
of aAs issues;
referencing commits .
25above
. count
is requests pull . issues; close which requests,
commits/pull
25; above count
is forks . exist:requests
25; above count
isstargazer pull commits
and types
of Four
timelines. issues the called
used: sections their inpages web issues available on requests
is
criteria
are following the code source correct the sources of pull andcommits linked about information The handle. to
repositories as To
top select repositories: Choice
of ) aimed are they which issue, requests
an to pull andcommits
code source correct Collecting B.linking mechanisms of several are there GitHub, the In
correct.
syntactically snippetis " accepted code. source
repositories the into
included developers
be to by
finally changes code thcontain e which
considered,
token: are oncs merged onlyrequests pullAmong etc. URL, requcst
thecontain not does snippet filename
afor full merge creation or
"test" pullSHAl, commitdescription, timestamp,
token; "test" tcontain
he does name
not function/method title. e.g.,changes, made annotating the information meta
follows: as contain requests pull and
commits changes,
the those to
bugix from snippets found for
the criteria addition requests.
In pull andcommits GitHub ofform the
are code source code the Choice
of 4) stored
in are issuesrepository handling related
to changes
selection our Finally, snippets: 2,908 and
selected. commits
are code Source requests: pull and commits Choice
of 3)
aforementioned the all timelines. issues from referenced
requests pull 6,013 criteria, selection and
Applying above.mentioned issues backport compatibility, messages, titles, infomation,e.g., meta their
related code removethe aimed
to is(including).etc. states,commits merge requests, and pull
commits. the "
dependency. to changes in
selection This
ofkind mnessages. and titles their ibility" code: source reproducing andreports traceback full "
"compat "dependency" and e.g.,tokens, irrelevant contain labels): and titles (issue reports bug "
requests those Finally,
whichdiscarded, are commits/pull collected: information is following
requestscommits/pull of
closing mechanism
above.mentioned titles issues,
the selected their repositoies
and selected the For
to messages, or requests extracted. are issues 8,165 criteria, selection
GitHub standard the similarly explicit providing by
commits/pull issues
in reference
for applying
all After labels. issues the on
based done isThis
criterion, selection second the In
entorced relevance
is the
https://parallel.uran.ru
1812.08693
https//arxiV.org/abs/
Available: (Onlinel 2019.translation, machine neural via wild the in pytracebugs.
patchesbug-áxing learning on
studyempirical Poshyvanyk
"An D. dataset The
and White M.Penta D.Bavota,
M Watson,
G. https://github.com/acheshkov/
Tufano,
C. M.
(17] available
at is
2940179 sample. validated
test manually the on0.34 recall
of
2019.
109/TSE.
https://doi.org/10.I Available: (Online] 2019. 1-1. PP with bugspredicts buggy.
t correct
or
gineering. Sofhwure Transactions
on IEEE repair, program tO-end of
and 0.96 precision
which it, based
on built model
is
Sequence-to-sequcnce
end lcarning
for Sequencer: Monperrus, M.
and either intosippets classifies
Poshyvanyk Pouchet,D. L.-N Tufano, Kommrusch,
M. 1.Chen,
S. Z.
[16] predictive dataset,a demonstrate
of
hequality further To
.03200
tHarxiv.org/abs/1
807 sample. test the for 100%
http:Available: [Online] 2018.
competition, code sOurce whereas our to according 85% about is
learning machine Monperrus,The
CodRep M. and Chen [15]
Z almost is it estimates, t he labeling
of
httns://doi.org/10.I145/3
105906 samples validation andtraining in
both snippets
snippets. code correct and buggy examples
of
Available: (Online]. 2018. Jan. I, no. 51, vol. Surn, Comput. Confidence in is
ACMbibliography repair:
A sofrware Automatic Monperrus, M.[14] containing samples, and
test validation training, into split
2020.collection, implementations snippets,
ie., of
practice the of empirical
state assessment
of methods.
It functions
or of
datprediction
a defect of source Python the in bugs real of
szz: withIssues Ledel, Trautsch,
B.and Trautsch,
F. Herbold,
A. {13]
S. granularity the code
at
An 381-392. pp. 2018.(1CSME), bugsoftware
examples number
of contains
a prediction. It
Sofwure Conferenceon huernational IEEE learning deep evaluating
of training
and both for
and
Evolution Maintenance reports bug models
for
in
2018 localization retrieval-based
bug for
text enough large work,a this In
Parra,
G. Pantiuchina, E. Mills,
J. C.[12] intended proposed dataset
is labeled
"Are Haiduc, S.and Bavota.
9-9. CONCLUSION VIII.
ICSE Engineering Sofeare
2007,
pp 2007),Workshops (PROMISE'07: Third eclipse,
in required threads.
the using
36
Models
in Predictor Workshop
on International hours. reduces
6 to time
defects Zeller. A.andPremraj. Zimmermann, R. I.[11] the from results
fo Predicting Sofware one.
By single combined
into
a arthreads
e
2005.429445, 31, vol.
pp. 6.no. Engineering, its them cach
of
guide historicsto version individual Then,workload. partprocessing
of
Transactions
on /EEE changes, software T.[10] between repositories
split is of
"Mining Diehl, Weissgerber,
S.and Zeller,
P.Zimmermann,A withreads,
th computing the
II45/1083142.1083I47 https://doi.org/10. Available: 6254 Gold Xeon Intel 18-core two with
Association
for USA: processors.
list Total the
[Onlinel 5. 2005, p.Machinery, Computing Mining on consists which used, iscluster Uran process,
NY. York, New ser.
05. MSR Repositeries, Sofbsaure nodes of machine single a
2005 Proceedings
the of in fixes induce speed days. about8 require
To would
Workshop huernational Zimmermann, Sliwerski J.19] the up increases the further
changes do When Zeiler, A.and IBEE
I. eSE.
dataset Collecting
on the time.required
nn2019 ACM, / on limit imposes a
220le datasct a and which requests, API of
the frequency
fixes andfatlures of growingcontinuously GitHub the from projects of
"BugSwarm: Rubio-Gonzälez, Vasilescu.
C.and Devanbu.
B. GitHub Besides,
the platform.
mining
Wang. Dneiri,
Y Tonassi,
N. A. D. (81 processing downloading
and requires cause
it
PT Liu, Bhowmick, Y.
I145/3368O89.3417943 A. https/ldoi.org/10. number large
for time-consuming is
adataset Collecting
the
[Online]1556-1560. 2020,
p.Machinery. Computing process
be
Available: ESECFSE Engineering.
ser. lelization
Association USA: NY, York. New 2020 Engineering Computational C.
tSymposium
he on Conference
and collecting
the cost
of
Sofware Foendations ef
Joint ACM the of
28th Prceedings in paral and dataset
Sofnsare Eurpean Meeting
on Python code.
and testingcontrolled enable programs to
studies" debugging database "BugsinPyA Ouh. LE.and Lo. D. us This edges.
bugs existing
in of Tan,
Y L.E Wee. E stalled against
thc additionally protect toallows
Hoang. Kang.
T. H.
J.Thung. Goh,
F. B.Yich. which snippets, other
C. Sim, S.Q Widyasari, R17| incoming lcast
3 themselves
have
at
Tan, Tay.C Phan,
145/3196398.3196473
Q J Q, H.Lok, http/Idoi.org/10. I there which snippets,
for
Computing for from edgesincoming least3 atare
[Online) 10-13. 2018.
p. Machinery. Repositories. snippcts sample
I70 of random A
Available: I8. MSR ser. those from selected is
Association USA: York, New outcoming correspond to (these Aof
NY. Intermational the of
I5th Proceedings A). for cdges
Sofvare Miving Conference on A are which snippets, the
bugs" real-world
Java dataset diverselarge-scale,
of Saha,
jarr Bugs implementation
K R(6] the from called
in Lam,
H w.Lyu. Y graph); calls the in A
Prasad. R. M.andYoshida. 437-440. 2014,Analsis,
pp (these tation
2014 ef
theceedsngs for edgesincoming correspond to snippets
Spesian hternahonal
on controlled enable faults
to cal l which snippets, the "
andTesti Sefhiare Java fostudies r testing (5] implemen their snippet
in A the
programs,
Pr in M.D. and Jalali, D.Just, R. formed: snippets
are
datahase
of DefectsJ:
109/TSE.2015.2454S13 A Emsthttps//doi.org/10.1
existing Engincering. (denote cach repository.
For
1236-1256 pp12. no. 4L, vol. groups
of two A) by it snippet
Available (Online). 20IS. C repairautomated
of for computed function/method calls
is graph
of A
IEEE programs S. cach within
Sofhvare Transactions ot The Weimer. W.andForrest, the from criterion following the to
benchmarks IntroClass and Bugs Many Holtschulte, Goues.
N. Le C.4] code. source stableobtained
Brun, Smith.Y K E [Online]. 2021. chosen snippets
are sample,
170 test th e form To
Devanbu
arxiv.org/abs/I8I2.08434 P Available: according snippets. million 5.7
ncural Graph Sun, M.
methods
and revicw
of nctworks A Zhoa.
G. J3] criteria
applications Yang Zhang.
C ZHu, S.Cui, subsubscction
It the from those as
Li,L Z Available: [Online] about gives VII-A4.
and CLi Wang.httpslarxiv.org/ahs/1
706.03762 Gomez filtered snippets
are selected this, to
Polusukhin andI.Kaiser, L same to
the according
all is
need you Attention Shazeer, Vaswani,
N. A. 2] addition megabytes). In approximately650 size
is total (their
2017 https/ariv.org/abs2105.12787
Uszkoreit Pamar.
J. N
A
NJones, L supervised code source million about select
I1 toone allows
detcction bug snippets Shippets,
Available Onlinc] repair. and Allamanis, M. [| 100. above correspondingcount
is whose
2021. Jackson-Flux. H. bound This
Brockschmid, Mand extracting consists
in criterion sclection Oursnippct. that in
Self
REFERENCES
(22] (21] [20] [19] |18|
https://arxiv.org/abs/2002.08155
pii/SO164121220301436
Available:model B. Z. 2020.
automatically
prediction"
109/TR.2020.3040191
http://doi.org/10.1 R.IEEE
context 2021. J.defect
Konygin, 3387491 E.
573-577.
2020. p."20. bugsR.-M.
Intermational
,
QinFeng, Ferenc, Xu, N.
(Online). occur?
NewKarampatsis
for T.(Online). Transactions F. Akimova,
fcatures prediction
D. I.
programning Liu, Guo, Journal P. Wang, York,
Available: created
Gyimesi, Mezentsev, P. Conference the
D. Available:
ofand using A.ManySStuBs4J
Jiang, D. on (Onlinc).
Y. NY,
of codes and
Tang, SvstemsnovelReliability.
G. J. Bersenev,
deep USA:
http://www.sciencedirect.com/sciencelarticle/
https://www.mdpi.com/2227-7390/9/|1/1180
and and Ai, on C.
N. Gyimesi, based and Available:
bug learning."
"Defect Mining Sutton.
natural Duan,M. and V.Associationdataset,"
Zhou, dataset pp. on A.
E. A.
Software, Z.1-13,
prediction Misilov,
graph https://doi.org/10.1 Software "How
languages," X. Tóth, Mathematics, Deikov,
"CodeBERT: Feng, and for in
2020.
representation Procecdings often
vol. its and "AComputing
K.
Repositories,
M. validation [Onlinc]. T.
with survey S. do
2020. Gong, 169. single-statement
A
Gyimóhy, semantics Kobylkin. vol.
p.
145/3379597. of
pre-trained learning," 9, onMachincry.
[Online]. L. 110691, inAvailable: software
no.
ser. the
Shou, A. MSR l7h
bug "An and I1, V.