This document discusses data reduction techniques for improving bug triage in software projects. It proposes combining instance selection and feature selection to simultaneously reduce the scale of bug data on both the bug dimension and word dimension, while also improving the accuracy of bug triage. Historical bug data is used to build a predictive model to determine the optimal order of applying instance selection and feature selection for a new bug data set. The techniques are empirically evaluated on 600,000 bug reports from the Eclipse and Mozilla open source projects, showing the approach can effectively reduce data scale and improve triage accuracy.
TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUESShakas Technologies
This document summarizes an approach for data reduction in software bug triage. It combines instance selection and feature selection techniques to simultaneously reduce the number of bug reports (instances) and words (features) in bug datasets. This aims to create smaller, higher quality datasets that improve the accuracy of automatic bug triage while reducing labor costs. It evaluates different instance selection, feature selection, and their combination methods on large bug datasets from Eclipse and Mozilla projects. The results show the proposed data reduction approach can effectively shrink dataset sizes and boost bug triage accuracy.
This document presents an approach to populating a release history database from version control and bug tracking systems. It combines data from CVS version control and Bugzilla bug tracking for the Mozilla project to analyze software evolution. The paper describes related work, outlines the data import process, and evaluates the approach by examining timescales, release history, and coupling for the Mozilla project. It concludes that this approach provides insights into a project's evolutionary processes but that more formal integration with version control could improve the analysis.
1) The document discusses using data reduction techniques like instance selection and feature selection to reduce the scale and improve the quality of bug data for more effective bug triage.
2) It combines instance selection and feature selection to simultaneously reduce the number of bug reports (instances) and words (features) in bug data.
3) It evaluates the reduced bug data on two large open source projects and finds that combining the techniques can increase the accuracy of bug triage while reducing the data scale.
Towards Effective Bug Triage with Software Data Reduction Techniques1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: [email protected]
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
The document describes an automated process for bug triage that uses text classification and data reduction techniques. It proposes using Naive Bayes classifiers to predict the appropriate developers to assign bugs to by applying stopword removal, stemming, keyword selection, and instance selection on bug reports. This reduces the data size and improves quality. It predicts developers based on their history and profiles while tracking bug status. The goal is to more efficiently handle software bugs compared to traditional manual triage processes.
Software companies spend over 45 percent of cost in dealing with software bugs. An inevitable step of fixing bugs is bug triage, which aims to correctly assign a developer to a new bug.
Bug triage means to transfer a new bug to expertise developer. The manual bug triage is opulent in time
and poor in accuracy, there is a need to automatize the bug triage process. In order to automate the bug triage
process, text classification techniques are applied using stopword removal and stemming. In our proposed work
we have used NB-Classifiers to predict the developers. The data reduction techniques like instance selection
and keyword selection are used to obtain bug report and words. This will help the system to predict only those
developers who are expertise in solving the assigned bug. We will also provide the change of status of bug
report i.e. if the bug is solved then the bug report will be updated. If a particular developer fails to solve the bug
then the bug will go back to another developer.
Survey on Software Data Reduction Techniques Accomplishing Bug TriageIRJET Journal
This document discusses various techniques for software data reduction to improve the accuracy of bug triage. It first provides background on bug triage and the challenges it aims to address like large volumes of low quality bug data. It then surveys literature on related techniques like automated test generation and text mining approaches. The document describes various text mining methods like term-based, phrase-based, concept-based and pattern taxonomy methods. It also covers data reduction techniques and their benefits for bug triage. Different classification techniques for bug identification are explained, including decision trees, nearest neighbor classifier and artificial neural networks.
Knowledge and Data Engineering IEEE 2015 ProjectsVijay Karan
List of Knowledge and Data Engineering IEEE 2015 Projects. It Contains the IEEE Projects in the Domain Knowledge and Data Engineering for the year 2015
IRJET- Data Reduction in Bug Triage using Supervised Machine LearningIRJET Journal
This document discusses using machine learning techniques for automatic bug triage to reduce the time and costs associated with manually assigning software bugs to developers. It proposes using data reduction techniques like feature selection and instance selection to create a smaller, higher quality bug repository by removing redundant bug reports and words. This reduced dataset would then be used to train a classifier to automatically suggest the most suitable developer for a given new bug, aiming to improve prediction accuracy while reducing training and prediction time compared to using the full dataset.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTSijseajournal
This paper investigates using categorical features of bug reports, such as the component a bug belongs to, to build a classification model for bug assignment. The model is trained to predict the developer assigned to a bug report based on its categorical fields rather than textual content. An evaluation on three projects found that using both categorical features and textual content improved accuracy over using textual content alone. Using only categorical features provided some improvement over prior approaches but was less accurate than using both data types.
A Survey on Bug Tracking System for Effective Bug ClearanceIRJET Journal
This document discusses bug tracking systems and methods for effective bug clearance. It describes how software organizations spend a large amount of resources handling bugs. It then summarizes an approach that uses instance selection and feature selection methods to classify bugs which are then assigned to bug solving experts based on their experience. A history of cleared bugs is also maintained to help resolve similar bugs faster. The goal is to reduce the time and costs involved in clearing bugs.
Knowledge and Data Engineering IEEE 2015 ProjectsVijay Karan
List of Knowledge and Data Engineering IEEE 2015 Projects. It Contains the IEEE Projects in the Domain Knowledge and Data Engineering for the year 2015
TOWARDS PREDICTING SOFTWARE DEFECTS WITH CLUSTERING TECHNIQUESijaia
The purpose of software defect prediction is to improve the quality of a software project by building a
predictive model to decide whether a software module is or is not fault prone. In recent years, much
research in using machine learning techniques in this topic has been performed. Our aim was to evaluate
the performance of clustering techniques with feature selection schemes to address the problem of software
defect prediction problem. We analysed the National Aeronautics and Space Administration (NASA)
dataset benchmarks using three clustering algorithms: (1) Farthest First, (2) X-Means, and (3) selforganizing map (SOM). In order to evaluate different feature selection algorithms, this article presents a
comparative analysis involving software defects prediction based on Bat, Cuckoo, Grey Wolf Optimizer
(GWO), and particle swarm optimizer (PSO). The results obtained with the proposed clustering models
enabled us to build an efficient predictive model with a satisfactory detection rate and acceptable number
of features.
Software testing defect prediction model a practical approacheSAT Journals
Abstract Software defects prediction aims to reduce software testing efforts by guiding the testers through the defect classification of software systems. Defect predictors are widely used in many organizations to predict software defects in order to save time, improve quality, testing and for better planning of the resources to meet the timelines. The application of statistical software testing defect prediction model in a real life setting is extremely difficult because it requires more number of data variables and metrics and also historical defect data to predict the next releases or new similar type of projects. This paper explains our statistical model, how it will accurately predict the defects for upcoming software releases or projects. We have used 20 past release data points of software project, 5 parameters and build a model by applying descriptive statistics, correlation and multiple linear regression models with 95% confidence intervals (CI). In this appropriate multiple linear regression model the R-square value was 0.91 and its Standard Error is 5.90%. The Software testing defect prediction model is now being used to predict defects at various testing projects and operational releases. We have found 90.76% precision between actual and predicted defects.
IRJET- A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...IRJET Journal
This document summarizes research on analyzing Windows event logs to identify the root causes of defects in software. It discusses using machine learning algorithms and pattern recognition techniques on event log data to detect defect root causes. Specifically, it proposes developing an efficient algorithm based on pattern recognition to accurately detect defect root causes. The algorithm would analyze past event logs and defect resolution methods to improve prediction capability and accuracy over traditional approaches. It also reviews literature on using clustering, classification, and other machine learning methods on event logs to identify patterns and anomalies.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
Software Engineering Domain Knowledge to Identify Duplicate Bug ReportsIJCERT
This document summarizes a research paper that proposes a technique to improve the detection of duplicate bug reports using contextual information extracted from software engineering literature. It describes extracting word lists from software engineering textbooks and project documentation to measure contextual features of bug reports. The technique was evaluated on real bug report datasets and showed potential to significantly reduce manual effort in contextual bug deduplication while maintaining accuracy. Key findings indicate that leveraging domain knowledge from software engineering texts can help automate and enhance the identification of duplicate bug reports.
This document discusses an efficient tool for a reusable software component taxonomy. It proposes integrating two existing classification schemes to develop a prototype system. Specifically:
- It proposes developing an integrated classification scheme using a combination of existing schemes to better classify and store reusable software components in a repository for efficient retrieval.
- A prototype was developed that integrates two existing classification schemes to demonstrate the proposed approach. This aims to improve on limitations of current software component retrieval methods.
Comparative Performance Analysis of Machine Learning Techniques for Software ...csandit
Machine learning techniques can be used to analyse data from different perspectives and enable
developers to retrieve useful information. Machine learning techniques are proven to be useful
in terms of software bug prediction. In this paper, a comparative performance analysis of
different machine learning techniques is explored for software bug prediction on public
available data sets. Results showed most of the machine learning methods performed well on
software bug datasets.
The Next Generation Open Targets PlatformHelenaCornu
The next-generation version of the Open Targets Platform — the culmination of two years of work — is now officially live! It replaces our previous version, with a fresh new look, brand new features, and streamlined processes.
It is available at platform.opentargets.org
This presentation goes through the main changes to the Platform, and introduces the new Open Targets Community forum. Join now at community.opentargets.org.
Open Targets is an innovative, large-scale, multi-year, public-private partnership that uses human genetics and genomics data for systematic drug target identification and prioritisation. Find out more at opentargets.org
This document describes a machine learning model for software defect prediction. It uses NASA software metrics data to train artificial neural networks and decision tree models to predict defect density values. The model performs regression to predict defect values for test data. Experimental results show that while both ANN and decision tree methods did not initially provide acceptable predictions compared to the data variance, further experiments could enhance defect prediction performance through a two-step modeling approach.
Machine Learning approaches are good in solving problems that have less information. In most cases, the
software domain problems characterize as a process of learning that depend on the various circumstances
and changes accordingly. A predictive model is constructed by using machine learning approaches and
classified them into defective and non-defective modules. Machine learning techniques help developers to
retrieve useful information after the classification and enable them to analyse data from different
perspectives. Machine learning techniques are proven to be useful in terms of software bug prediction. This
study used public available data sets of software modules and provides comparative performance analysis
of different machine learning techniques for software bug prediction. Results showed most of the machine
learning methods performed well on software bug datasets.
The International Journal of Engineering and Science (IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Generation of Search Based Test Data on Acceptability Testing Principleiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
AUTOMATED BUG TRIAGE USING ADVANCED DATA REDUCTION TECHNIQUESJournal For Research
Bug triage is an important step in the process of bug fixing. The goal of bug triage is to correctly assign a developer to a newly reported bug in the system. To perform the automated bug triage, text classification techniques are applied. This will helps to reduce the time cost in manual work. To reduce the scale and improve the quality of bug data, the proposed system addresses the data reduction techniques, instance selection and feature selection for bug triage. The instance selection technique used here is to identify the relevant bugs that can match the newly reported bug. The feature selection technique is used to select the relevant data from each bug in the training set. A predictive model is proposed to identify the order in which the data reduction techniques are applied for each newly reported bug. This step will improve the performance of the classification process. An experimental study using Eclipse and Firefox bug data is undergone in which the proposed system shows an accuracy of 73%.
Development of software defect prediction system using artificial neural networkIJAAS Team
Software testing is an activity to enable a system is bug free during execution process. The software bug prediction is one of the most encouraging exercises of the testing phase of the software improvement life cycle. In any case, in this paper, a framework was created to anticipate the modules that deformity inclined in order to be utilized to all the more likely organize software quality affirmation exertion. Genetic Algorithm was used to extract relevant features from the acquired datasets to eliminate the possibility of overfitting and the relevant features were classified to defective or otherwise modules using the Artificial Neural Network. The system was executed in MATLAB (R2018a) Runtime environment utilizing a statistical toolkit and the performance of the system was assessed dependent on the accuracy, precision, recall, and the f-score to check the effectiveness of the system. In the finish of the led explores, the outcome indicated that ECLIPSE JDT CORE, ECLIPSE PDE UI, EQUINOX FRAMEWORK and LUCENE has the accuracy, precision, recall and the f-score of 86.93, 53.49, 79.31 and 63.89% respectively, 83.28, 31.91, 45.45 and 37.50% respectively, 83.43, 57.69, 45.45 and 50.84% respectively and 91.30, 33.33, 50.00 and 40.00% respectively. This paper presents an improved software predictive system for the software defect detections.
Using Fuzzy Clustering and Software Metrics to Predict Faults in large Indust...IOSR Journals
This document describes a study that uses fuzzy clustering and software metrics to predict faults in large industrial software systems. The study uses fuzzy c-means clustering to group software components into faulty and fault-free clusters based on various software metrics. The study applies this method to the open-source JEdit software project, calculating metrics for 274 classes and identifying faults using repository data. The results show 88.49% accuracy in predicting faulty classes, demonstrating that fuzzy clustering can be an effective technique for fault prediction in large software systems.
Survey on Software Data Reduction Techniques Accomplishing Bug TriageIRJET Journal
This document discusses various techniques for software data reduction to improve the accuracy of bug triage. It first provides background on bug triage and the challenges it aims to address like large volumes of low quality bug data. It then surveys literature on related techniques like automated test generation and text mining approaches. The document describes various text mining methods like term-based, phrase-based, concept-based and pattern taxonomy methods. It also covers data reduction techniques and their benefits for bug triage. Different classification techniques for bug identification are explained, including decision trees, nearest neighbor classifier and artificial neural networks.
Knowledge and Data Engineering IEEE 2015 ProjectsVijay Karan
List of Knowledge and Data Engineering IEEE 2015 Projects. It Contains the IEEE Projects in the Domain Knowledge and Data Engineering for the year 2015
IRJET- Data Reduction in Bug Triage using Supervised Machine LearningIRJET Journal
This document discusses using machine learning techniques for automatic bug triage to reduce the time and costs associated with manually assigning software bugs to developers. It proposes using data reduction techniques like feature selection and instance selection to create a smaller, higher quality bug repository by removing redundant bug reports and words. This reduced dataset would then be used to train a classifier to automatically suggest the most suitable developer for a given new bug, aiming to improve prediction accuracy while reducing training and prediction time compared to using the full dataset.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTSijseajournal
This paper investigates using categorical features of bug reports, such as the component a bug belongs to, to build a classification model for bug assignment. The model is trained to predict the developer assigned to a bug report based on its categorical fields rather than textual content. An evaluation on three projects found that using both categorical features and textual content improved accuracy over using textual content alone. Using only categorical features provided some improvement over prior approaches but was less accurate than using both data types.
A Survey on Bug Tracking System for Effective Bug ClearanceIRJET Journal
This document discusses bug tracking systems and methods for effective bug clearance. It describes how software organizations spend a large amount of resources handling bugs. It then summarizes an approach that uses instance selection and feature selection methods to classify bugs which are then assigned to bug solving experts based on their experience. A history of cleared bugs is also maintained to help resolve similar bugs faster. The goal is to reduce the time and costs involved in clearing bugs.
Knowledge and Data Engineering IEEE 2015 ProjectsVijay Karan
List of Knowledge and Data Engineering IEEE 2015 Projects. It Contains the IEEE Projects in the Domain Knowledge and Data Engineering for the year 2015
TOWARDS PREDICTING SOFTWARE DEFECTS WITH CLUSTERING TECHNIQUESijaia
The purpose of software defect prediction is to improve the quality of a software project by building a
predictive model to decide whether a software module is or is not fault prone. In recent years, much
research in using machine learning techniques in this topic has been performed. Our aim was to evaluate
the performance of clustering techniques with feature selection schemes to address the problem of software
defect prediction problem. We analysed the National Aeronautics and Space Administration (NASA)
dataset benchmarks using three clustering algorithms: (1) Farthest First, (2) X-Means, and (3) selforganizing map (SOM). In order to evaluate different feature selection algorithms, this article presents a
comparative analysis involving software defects prediction based on Bat, Cuckoo, Grey Wolf Optimizer
(GWO), and particle swarm optimizer (PSO). The results obtained with the proposed clustering models
enabled us to build an efficient predictive model with a satisfactory detection rate and acceptable number
of features.
Software testing defect prediction model a practical approacheSAT Journals
Abstract Software defects prediction aims to reduce software testing efforts by guiding the testers through the defect classification of software systems. Defect predictors are widely used in many organizations to predict software defects in order to save time, improve quality, testing and for better planning of the resources to meet the timelines. The application of statistical software testing defect prediction model in a real life setting is extremely difficult because it requires more number of data variables and metrics and also historical defect data to predict the next releases or new similar type of projects. This paper explains our statistical model, how it will accurately predict the defects for upcoming software releases or projects. We have used 20 past release data points of software project, 5 parameters and build a model by applying descriptive statistics, correlation and multiple linear regression models with 95% confidence intervals (CI). In this appropriate multiple linear regression model the R-square value was 0.91 and its Standard Error is 5.90%. The Software testing defect prediction model is now being used to predict defects at various testing projects and operational releases. We have found 90.76% precision between actual and predicted defects.
IRJET- A Detailed Analysis on Windows Event Log Viewer for Faster Root Ca...IRJET Journal
This document summarizes research on analyzing Windows event logs to identify the root causes of defects in software. It discusses using machine learning algorithms and pattern recognition techniques on event log data to detect defect root causes. Specifically, it proposes developing an efficient algorithm based on pattern recognition to accurately detect defect root causes. The algorithm would analyze past event logs and defect resolution methods to improve prediction capability and accuracy over traditional approaches. It also reviews literature on using clustering, classification, and other machine learning methods on event logs to identify patterns and anomalies.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
Software Engineering Domain Knowledge to Identify Duplicate Bug ReportsIJCERT
This document summarizes a research paper that proposes a technique to improve the detection of duplicate bug reports using contextual information extracted from software engineering literature. It describes extracting word lists from software engineering textbooks and project documentation to measure contextual features of bug reports. The technique was evaluated on real bug report datasets and showed potential to significantly reduce manual effort in contextual bug deduplication while maintaining accuracy. Key findings indicate that leveraging domain knowledge from software engineering texts can help automate and enhance the identification of duplicate bug reports.
This document discusses an efficient tool for a reusable software component taxonomy. It proposes integrating two existing classification schemes to develop a prototype system. Specifically:
- It proposes developing an integrated classification scheme using a combination of existing schemes to better classify and store reusable software components in a repository for efficient retrieval.
- A prototype was developed that integrates two existing classification schemes to demonstrate the proposed approach. This aims to improve on limitations of current software component retrieval methods.
Comparative Performance Analysis of Machine Learning Techniques for Software ...csandit
Machine learning techniques can be used to analyse data from different perspectives and enable
developers to retrieve useful information. Machine learning techniques are proven to be useful
in terms of software bug prediction. In this paper, a comparative performance analysis of
different machine learning techniques is explored for software bug prediction on public
available data sets. Results showed most of the machine learning methods performed well on
software bug datasets.
The Next Generation Open Targets PlatformHelenaCornu
The next-generation version of the Open Targets Platform — the culmination of two years of work — is now officially live! It replaces our previous version, with a fresh new look, brand new features, and streamlined processes.
It is available at platform.opentargets.org
This presentation goes through the main changes to the Platform, and introduces the new Open Targets Community forum. Join now at community.opentargets.org.
Open Targets is an innovative, large-scale, multi-year, public-private partnership that uses human genetics and genomics data for systematic drug target identification and prioritisation. Find out more at opentargets.org
This document describes a machine learning model for software defect prediction. It uses NASA software metrics data to train artificial neural networks and decision tree models to predict defect density values. The model performs regression to predict defect values for test data. Experimental results show that while both ANN and decision tree methods did not initially provide acceptable predictions compared to the data variance, further experiments could enhance defect prediction performance through a two-step modeling approach.
Machine Learning approaches are good in solving problems that have less information. In most cases, the
software domain problems characterize as a process of learning that depend on the various circumstances
and changes accordingly. A predictive model is constructed by using machine learning approaches and
classified them into defective and non-defective modules. Machine learning techniques help developers to
retrieve useful information after the classification and enable them to analyse data from different
perspectives. Machine learning techniques are proven to be useful in terms of software bug prediction. This
study used public available data sets of software modules and provides comparative performance analysis
of different machine learning techniques for software bug prediction. Results showed most of the machine
learning methods performed well on software bug datasets.
The International Journal of Engineering and Science (IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Generation of Search Based Test Data on Acceptability Testing Principleiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
AUTOMATED BUG TRIAGE USING ADVANCED DATA REDUCTION TECHNIQUESJournal For Research
Bug triage is an important step in the process of bug fixing. The goal of bug triage is to correctly assign a developer to a newly reported bug in the system. To perform the automated bug triage, text classification techniques are applied. This will helps to reduce the time cost in manual work. To reduce the scale and improve the quality of bug data, the proposed system addresses the data reduction techniques, instance selection and feature selection for bug triage. The instance selection technique used here is to identify the relevant bugs that can match the newly reported bug. The feature selection technique is used to select the relevant data from each bug in the training set. A predictive model is proposed to identify the order in which the data reduction techniques are applied for each newly reported bug. This step will improve the performance of the classification process. An experimental study using Eclipse and Firefox bug data is undergone in which the proposed system shows an accuracy of 73%.
Development of software defect prediction system using artificial neural networkIJAAS Team
Software testing is an activity to enable a system is bug free during execution process. The software bug prediction is one of the most encouraging exercises of the testing phase of the software improvement life cycle. In any case, in this paper, a framework was created to anticipate the modules that deformity inclined in order to be utilized to all the more likely organize software quality affirmation exertion. Genetic Algorithm was used to extract relevant features from the acquired datasets to eliminate the possibility of overfitting and the relevant features were classified to defective or otherwise modules using the Artificial Neural Network. The system was executed in MATLAB (R2018a) Runtime environment utilizing a statistical toolkit and the performance of the system was assessed dependent on the accuracy, precision, recall, and the f-score to check the effectiveness of the system. In the finish of the led explores, the outcome indicated that ECLIPSE JDT CORE, ECLIPSE PDE UI, EQUINOX FRAMEWORK and LUCENE has the accuracy, precision, recall and the f-score of 86.93, 53.49, 79.31 and 63.89% respectively, 83.28, 31.91, 45.45 and 37.50% respectively, 83.43, 57.69, 45.45 and 50.84% respectively and 91.30, 33.33, 50.00 and 40.00% respectively. This paper presents an improved software predictive system for the software defect detections.
Using Fuzzy Clustering and Software Metrics to Predict Faults in large Indust...IOSR Journals
This document describes a study that uses fuzzy clustering and software metrics to predict faults in large industrial software systems. The study uses fuzzy c-means clustering to group software components into faulty and fault-free clusters based on various software metrics. The study applies this method to the open-source JEdit software project, calculating metrics for 274 classes and identifying faults using repository data. The results show 88.49% accuracy in predicting faulty classes, demonstrating that fuzzy clustering can be an effective technique for fault prediction in large software systems.
Software Defect Prediction Using Radial Basis and Probabilistic Neural NetworksEditor IJCATR
This document discusses using neural networks for software defect prediction. It examines the effectiveness of using a radial basis function neural network and a probabilistic neural network on prediction accuracy and defect prediction compared to other techniques. The key findings are that neural networks provide an acceptable level of accuracy for defect prediction but perform poorly at actual defect prediction. Probabilistic neural networks performed consistently better than other techniques across different datasets in terms of prediction accuracy and defect prediction ability. The document recommends using an ensemble of different software defect prediction models rather than relying on a single technique.
This document summarizes a research paper that examines the use of data mining techniques to predict software aging-related bugs from imbalanced datasets. The paper compares the performance of general data mining techniques versus techniques developed for imbalanced datasets on a real-world dataset of aging bugs found in MySQL software. The results show that techniques designed for imbalanced datasets, such as SMOTEbagging and MSMOTEboosting, performed better than general techniques at correctly predicting the minority class of data points related to aging bugs. The paper concludes that imbalanced dataset techniques are more useful for predicting rare aging bugs from imbalanced software bug datasets.
IRJET-Automatic Bug Triage with Software IRJET Journal
This document discusses automatic bug triage using data reduction techniques on bug report data. It proposes combining instance selection and feature selection to simultaneously reduce the scale of bug reports and words. An algorithm is presented that first applies feature selection to reduce words, then applies instance selection to reduce bug reports. A predictive model is used to determine the optimal order of these reduction techniques based on attributes of historical bug data. The approach aims to improve the accuracy of automatic bug triage by leveraging data processing to form a reduced, higher quality training set from large bug repositories.
In the present paper, applicability and
capability of A.I techniques for effort estimation prediction has
been investigated. It is seen that neuro fuzzy models are very
robust, characterized by fast computation, capable of handling
the distorted data. Due to the presence of data non-linearity, it is
an efficient quantitative tool to predict effort estimation. The one
hidden layer network has been developed named as OHLANFIS
using MATLAB simulation environment.
Here the initial parameters of the OHLANFIS are
identified using the subtractive clustering method. Parameters of
the Gaussian membership function are optimally determined
using the hybrid learning algorithm. From the analysis it is seen
that the Effort Estimation prediction model developed using
OHLANFIS technique has been able to perform well over normal
ANFIS Model.
Abstract— One of the main difficult tasks in software
development projects is an effort estimation accuracy. the
project's estimated cost, duration, and maintenance effort
early in the development life cycle is a big challenge to be
achieved for software projects. Agile software projects
require an innovative effort estimation model to help a
constructive cost accurate estimation. The main focus of
this paper is using the prediction model of genetic
algorithm to improve the effort estimation accuracy by
chromosomes that are made up of a gene pool that
includes user stories, friction factors, implementation
Level factor, and dynamic forces.
Fine–grained analysis and profiling of software bugs to facilitate waste iden...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
SBGC provides IEEE software projects for students in various domains including Java, J2ME, J2EE, .NET and MATLAB. It offers two categories of projects - projects with new ideas/papers and selecting from their project list. They ensure projects are implemented satisfactorily and students understand all aspects. SBGC provides latest 2012-2013 projects for various engineering and technology students as well as MBA students. It offers project support including abstracts, reports, presentations and certificates.
This document discusses defect prediction models in software development. It begins by covering the importance of effort estimation in software maintenance planning and management. The document then discusses how data from software defect reports, including details on defects, components, testers and fixes, can be used to build reliability models to predict remaining defects. Machine learning and data mining techniques are proposed to analyze relationships between software quality across releases and to construct predictive models for forecasting time to fix defects. The document provides an overview of typical software development processes and then discusses a two-step approach to defect prediction and analysis using appropriate statistics and data mining techniques.
A survey of predicting software reliability using machine learning methodsIAESIJAI
In light of technical and technological progress, software has become an urgent need in every aspect of human life, including the medicine sector and industrial control. Therefore, it is imperative that the software always works flawlessly. The information technology sector has witnessed a rapid expansion in recent years, as software companies can no longer rely only on cost advantages to stay competitive in the market, but programmers must provide reliable and high-quality software, and in order to estimate and predict software reliability using machine learning and deep learning, it was introduced A brief overview of the important scientific contributions to the subject of software reliability, and the researchers' findings of highly efficient methods and techniques for predicting software reliability.
The document discusses current practices in software testing, specifically around the use of mock objects. It presents results from several studies:
1. A study of over 2,000 test dependencies in open source and proprietary projects found that the use of mocks depends on the class's responsibilities and architecture. Databases and external dependencies are often mocked, while domain objects are less likely to be mocked.
2. A survey of over 100 professionals found that mocks are mostly introduced when test classes are first created and tend to remain throughout the test class's lifetime. Mocks are occasionally removed in response to changes.
3. An analysis of code review practices for tests found that tests receive fewer reviews than production code. Developers see value in reviewing
ANALYSIS OF SOFTWARE QUALITY USING SOFTWARE METRICSijcsa
Software metrics have a direct link with measurement in software engineering. Correct measurement is the prior condition in any engineering fields, and software engineering is not an exception, as the size and complexity of software increases, manual inspection of software becomes a harder task. Most Software Engineers worry about the quality of software, how to measure and enhance its quality. The overall objective of this study was to asses and analysis’s software metrics used to measure the software product and process.
In this Study, the researcher used a collection of literatures from various electronic databases, available since 2008 to understand and know the software metrics. Finally, in this study, the researcher has been identified software quality is a means of measuring how software is designed and how well the software conforms to that design. Some of the variables that we are looking for software quality are Correctness, Product quality, Scalability, Completeness and Absence of bugs, However the quality standard that was used from one organization is different from others for this reason it is better to apply the software metrics to measure the quality of software and the current most common software metrics tools to reduce the subjectivity of faults during the assessment of software quality. The central contribution of this study is an overview about software metrics that can illustrate us the development in this area, and a critical analysis about the main metrics founded on the various literatures.
ANALYSIS OF SOFTWARE QUALITY USING SOFTWARE METRICSijcsa
Software metrics have a direct link with measurement in software engineering. Correct measurement is the prior condition in any engineering fields, and software engineering is not an exception, as the size and complexity of software increases, manual inspection of software becomes a harder task. Most Software Engineers worry about the quality of software, how to measure and enhance its quality. The overall objective of this study was to asses and analysis’s software metrics used to measure the software product and process.
In this Study, the researcher used a collection of literatures from various electronic databases, available since 2008 to understand and know the software metrics. Finally, in this study, the researcher has been identified software quality is a means of measuring how software is designed and how well the software conforms to that design. Some of the variables that we are looking for software quality are Correctness, Product quality, Scalability, Completeness and Absence of bugs, However the quality standard that was used from one organization is different from others for this reason it is better to apply the software metrics to measure the quality of software and the current most common software metrics tools to reduce the subjectivity of faults during the assessment of software quality. The central contribution of this study is an overview about software metrics that can illustrate us the development in this area, and a critical analysis about the main metrics founded on the various literatures.
Unit testing focuses on testing individual software modules to uncover errors. Integration testing tests interfacing between modules incrementally to isolate errors. Testing objectives are to find errors, use high probability test cases, and ensure specifications are met. Reasons to test are for correctness, efficiency, and complexity. Test oracles verify expected outputs to increase automated testing efficiency and reduce costs, though complete automation has challenges.
Predicting Fault-Prone Files using Machine LearningGuido A. Ciollaro
This document summarizes a study that used machine learning algorithms to predict fault-prone files in nine open source Java projects containing over 18,000 files and 3 million lines of code. Six machine learning algorithms (Naive Bayes, Bayesian network, decision tree, radial basis function, simple logistic, and zeroR) were evaluated based on their ability to rank files by probability of being buggy, as determined by the FindBugs tool. The results showed that decision trees and Bayesian networks performed best at ranking the files, with decision trees outperforming other methods in previous studies. Lift curves were used to evaluate the performance of the models by plotting the number of files examined against the number of buggy files found.
Parameter Estimation of GOEL-OKUMOTO Model by Comparing ACO with MLE MethodIRJET Journal
The document presents a comparison of the Ant Colony Optimization (ACO) method and Maximum Likelihood Estimation (MLE) method for parameter estimation of the Goel-Okumoto software reliability growth model. It describes using the ACO and MLE methods to estimate unknown parameters of the Goel-Okumoto model based on ungrouped time domain failure data. The key parameters estimated are a, which represents the expected total number of failures, and b, which represents the failure detection rate. The document aims to determine which of these two parameter estimation methods can best identify failures at early stages of software reliability monitoring.
Reliability Improvement with PSP of Web-Based Software ApplicationsCSEIJJournal
In diverse industrial and academic environments, the quality of the software has been evaluated using
different analytic studies. The contribution of the present work is focused on the development of a
methodology in order to improve the evaluation and analysis of the reliability of web-based software
applications. The Personal Software Process (PSP) was introduced in our methodology for improving the
quality of the process and the product. The Evaluation + Improvement (Ei) process is performed in our
methodology to evaluate and improve the quality of the software system. We tested our methodology in a
web-based software system and used statistical modeling theory for the analysis and evaluation of the
reliability. The behavior of the system under ideal conditions was evaluated and compared against the
operation of the system executing under real conditions. The results obtained demonstrated the
effectiveness and applicability of our methodology
An efficient tree based self-organizing protocol for internet of thingsredpel dot com
An efficient tree based self-organizing protocol for internet of things.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
Validation of pervasive cloud task migration with colored petri netredpel dot com
The document describes a study that used Colored Petri Nets (CPN) to model and simulate task migration in pervasive cloud computing environments. The study made the following contributions:
1) It expanded the semantics of CPN to include context information, creating a new CPN model called CCPN.
2) Using CCPN, it constructed two task migration models - one that considered context and one that did not - to simulate task migration in a pervasive cloud based on the OSGi framework.
3) It simulated the two models in CPN Tools and evaluated them based on metrics like task migration accessibility, integrity of the migration process, and system reliability and stability after migration. It also
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...redpel dot com
The document proposes a novel approach for predicting quality of service (QoS) metrics for cloud services. The approach combines fuzzy neural networks and adaptive dynamic programming (ADP) for improved prediction accuracy. Specifically, it uses an adaptive-network-based fuzzy inference system (ANFIS) to extract fuzzy rules from QoS data and employ ADP for online parameter learning of the fuzzy rules. Experimental results on a large QoS dataset demonstrate the prediction accuracy of this approach. The approach also provides a convergence proof to guarantee stability of the neural network weights during training.
Towards a virtual domain based authentication on mapreduceredpel dot com
This document proposes a novel authentication solution for MapReduce (MR) models deployed in public clouds. It begins by describing the MR model and job execution workflow. It then discusses security issues with deploying MR in open environments like clouds. Next, it specifies requirements for an MR authentication service, including entity identification, credential revocation, and authentication of clients, MR components, and data. It analyzes existing MR authentication methods and finds they do not fully address the needs of cloud-based MR deployments. The paper then proposes a new "layered authentication solution" with a "virtual domain based authentication framework" to better satisfy the requirements.
Privacy preserving and delegated access control for cloud applicationsredpel dot com
Privacy preserving and delegated access control for cloud applications
for more ieee paper / full abstract / implementation , just visit www.redpel.com
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
Performance evaluation and estimation model using regression method for hadoop word count.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
Frequency and similarity aware partitioning for cloud storage based on space ...redpel dot com
Frequency and similarity aware partitioning for cloud storage based on space time utility maximization model.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
Multiagent multiobjective interaction game system for service provisoning veh...redpel dot com
Multiagent multiobjective interaction game system for service provisoning vehicular cloud
for more ieee paper / full abstract / implementation , just visit www.redpel.com
Efficient multicast delivery for data redundancy minimization over wireless d...redpel dot com
Efficient multicast delivery for data redundancy minimization over wireless data centers
for more ieee paper / full abstract / implementation , just visit www.redpel.com
Cloud assisted io t-based scada systems security- a review of the state of th...redpel dot com
Cloud assisted io t-based scada systems security- a review of the state of the art and future challenges.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
I-Sieve: An inline High Performance Deduplication System Used in cloud storageredpel dot com
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
for more ieee paper / full abstract / implementation , just visit www.redpel.com
Architecture harmonization between cloud radio access network and fog networkredpel dot com
Architecture harmonization between cloud radio access network and fog network
for more ieee paper / full abstract / implementation , just visit www.redpel.com
A tutorial on secure outsourcing of large scalecomputation for big dataredpel dot com
A tutorial on secure outsourcing of large scalecomputation for big data
for more ieee paper / full abstract / implementation , just visit www.redpel.com
A parallel patient treatment time prediction algorithm and its applications i...redpel dot com
A parallel patient treatment time prediction algorithm and its applications in hospital.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
Exploring Substances:
Acidic, Basic, and
Neutral
Welcome to the fascinating world of acids and bases! Join siblings Ashwin and
Keerthi as they explore the colorful world of substances at their school's
National Science Day fair. Their adventure begins with a mysterious white paper
that reveals hidden messages when sprayed with a special liquid.
In this presentation, we'll discover how different substances can be classified as
acidic, basic, or neutral. We'll explore natural indicators like litmus, red rose
extract, and turmeric that help us identify these substances through color
changes. We'll also learn about neutralization reactions and their applications in
our daily lives.
by sandeep swamy
As of Mid to April Ending, I am building a new Reiki-Yoga Series. No worries, they are free workshops. So far, I have 3 presentations so its a gradual process. If interested visit: https://www.slideshare.net/YogaPrincess
https://ldmchapels.weebly.com
Blessings and Happy Spring. We are hitting Mid Season.
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsesushreesangita003
what is pulse ?
Purpose
physiology and Regulation of pulse
Characteristics of pulse
factors affecting pulse
Sites of pulse
Alteration of pulse
for BSC Nursing 1st semester
for Gnm Nursing 1st year
Students .
vitalsign
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetSritoma Majumder
Introduction
All the materials around us are made up of elements. These elements can be broadly divided into two major groups:
Metals
Non-Metals
Each group has its own unique physical and chemical properties. Let's understand them one by one.
Physical Properties
1. Appearance
Metals: Shiny (lustrous). Example: gold, silver, copper.
Non-metals: Dull appearance (except iodine, which is shiny).
2. Hardness
Metals: Generally hard. Example: iron.
Non-metals: Usually soft (except diamond, a form of carbon, which is very hard).
3. State
Metals: Mostly solids at room temperature (except mercury, which is a liquid).
Non-metals: Can be solids, liquids, or gases. Example: oxygen (gas), bromine (liquid), sulphur (solid).
4. Malleability
Metals: Can be hammered into thin sheets (malleable).
Non-metals: Not malleable. They break when hammered (brittle).
5. Ductility
Metals: Can be drawn into wires (ductile).
Non-metals: Not ductile.
6. Conductivity
Metals: Good conductors of heat and electricity.
Non-metals: Poor conductors (except graphite, which is a good conductor).
7. Sonorous Nature
Metals: Produce a ringing sound when struck.
Non-metals: Do not produce sound.
Chemical Properties
1. Reaction with Oxygen
Metals react with oxygen to form metal oxides.
These metal oxides are usually basic.
Non-metals react with oxygen to form non-metallic oxides.
These oxides are usually acidic.
2. Reaction with Water
Metals:
Some react vigorously (e.g., sodium).
Some react slowly (e.g., iron).
Some do not react at all (e.g., gold, silver).
Non-metals: Generally do not react with water.
3. Reaction with Acids
Metals react with acids to produce salt and hydrogen gas.
Non-metals: Do not react with acids.
4. Reaction with Bases
Some non-metals react with bases to form salts, but this is rare.
Metals generally do not react with bases directly (except amphoteric metals like aluminum and zinc).
Displacement Reaction
More reactive metals can displace less reactive metals from their salt solutions.
Uses of Metals
Iron: Making machines, tools, and buildings.
Aluminum: Used in aircraft, utensils.
Copper: Electrical wires.
Gold and Silver: Jewelry.
Zinc: Coating iron to prevent rusting (galvanization).
Uses of Non-Metals
Oxygen: Breathing.
Nitrogen: Fertilizers.
Chlorine: Water purification.
Carbon: Fuel (coal), steel-making (coke).
Iodine: Medicines.
Alloys
An alloy is a mixture of metals or a metal with a non-metal.
Alloys have improved properties like strength, resistance to rusting.
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingCeline George
The Accounting module in Odoo 17 is a complete tool designed to manage all financial aspects of a business. Odoo offers a comprehensive set of tools for generating financial and tax reports, which are crucial for managing a company's finances and ensuring compliance with tax regulations.
How to manage Multiple Warehouses for multiple floors in odoo point of saleCeline George
The need for multiple warehouses and effective inventory management is crucial for companies aiming to optimize their operations, enhance customer satisfaction, and maintain a competitive edge.
Multi-currency in odoo accounting and Update exchange rates automatically in ...Celine George
Most business transactions use the currencies of several countries for financial operations. For global transactions, multi-currency management is essential for enabling international trade.
The ever evoilving world of science /7th class science curiosity /samyans aca...Sandeep Swamy
The Ever-Evolving World of
Science
Welcome to Grade 7 Science4not just a textbook with facts, but an invitation to
question, experiment, and explore the beautiful world we live in. From tiny cells
inside a leaf to the movement of celestial bodies, from household materials to
underground water flows, this journey will challenge your thinking and expand
your knowledge.
Notice something special about this book? The page numbers follow the playful
flight of a butterfly and a soaring paper plane! Just as these objects take flight,
learning soars when curiosity leads the way. Simple observations, like paper
planes, have inspired scientific explorations throughout history.
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schoolsdogden2
Algebra 1 is often described as a “gateway” class, a pivotal moment that can shape the rest of a student’s K–12 education. Early access is key: successfully completing Algebra 1 in middle school allows students to complete advanced math and science coursework in high school, which research shows lead to higher wages and lower rates of unemployment in adulthood.
Learn how The Atlanta Public Schools is using their data to create a more equitable enrollment in middle school Algebra classes.
INTRO TO STATISTICS
INTRO TO SPSS INTERFACE
CLEANING MULTIPLE CHOICE RESPONSE DATA WITH EXCEL
ANALYZING MULTIPLE CHOICE RESPONSE DATA
INTERPRETATION
Q & A SESSION
PRACTICAL HANDS-ON ACTIVITY
Geography Sem II Unit 1C Correlation of Geography with other school subjectsProfDrShaikhImran
The correlation of school subjects refers to the interconnectedness and mutual reinforcement between different academic disciplines. This concept highlights how knowledge and skills in one subject can support, enhance, or overlap with learning in another. Recognizing these correlations helps in creating a more holistic and meaningful educational experience.
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...larencebapu132
This is short and accurate description of World war-1 (1914-18)
It can give you the perfect factual conceptual clarity on the great war
Regards Simanchala Sarab
Student of BABed(ITEP, Secondary stage)in History at Guru Nanak Dev University Amritsar Punjab 🙏🙏
The *nervous system of insects* is a complex network of nerve cells (neurons) and supporting cells that process and transmit information. Here's an overview:
Structure
1. *Brain*: The insect brain is a complex structure that processes sensory information, controls behavior, and integrates information.
2. *Ventral nerve cord*: A chain of ganglia (nerve clusters) that runs along the insect's body, controlling movement and sensory processing.
3. *Peripheral nervous system*: Nerves that connect the central nervous system to sensory organs and muscles.
Functions
1. *Sensory processing*: Insects can detect and respond to various stimuli, such as light, sound, touch, taste, and smell.
2. *Motor control*: The nervous system controls movement, including walking, flying, and feeding.
3. *Behavioral responThe *nervous system of insects* is a complex network of nerve cells (neurons) and supporting cells that process and transmit information. Here's an overview:
Structure
1. *Brain*: The insect brain is a complex structure that processes sensory information, controls behavior, and integrates information.
2. *Ventral nerve cord*: A chain of ganglia (nerve clusters) that runs along the insect's body, controlling movement and sensory processing.
3. *Peripheral nervous system*: Nerves that connect the central nervous system to sensory organs and muscles.
Functions
1. *Sensory processing*: Insects can detect and respond to various stimuli, such as light, sound, touch, taste, and smell.
2. *Motor control*: The nervous system controls movement, including walking, flying, and feeding.
3. *Behavioral responses*: Insects can exhibit complex behaviors, such as mating, foraging, and social interactions.
Characteristics
1. *Decentralized*: Insect nervous systems have some autonomy in different body parts.
2. *Specialized*: Different parts of the nervous system are specialized for specific functions.
3. *Efficient*: Insect nervous systems are highly efficient, allowing for rapid processing and response to stimuli.
The insect nervous system is a remarkable example of evolutionary adaptation, enabling insects to thrive in diverse environments.
The insect nervous system is a remarkable example of evolutionary adaptation, enabling insects to thrive
Towards effective bug triage with software data reduction techniques
1. Towards Effective Bug Triage with Software
Data Reduction Techniques
Jifeng Xuan, He Jiang, Member, IEEE, Yan Hu, Zhilei Ren, Weiqin Zou,
Zhongxuan Luo, and Xindong Wu, Fellow, IEEE
Abstract—Software companies spend over 45 percent of cost in dealing with software bugs. An inevitable step of fixing bugs is bug
triage, which aims to correctly assign a developer to a new bug. To decrease the time cost in manual work, text classification techniques
are applied to conduct automatic bug triage. In this paper, we address the problem of data reduction for bug triage, i.e., how to reduce the
scale and improve the quality of bug data. We combine instance selection with feature selection to simultaneously reduce data scale on
the bug dimension and the word dimension. To determine the order of applying instance selection and feature selection, we extract
attributes from historical bug data sets and build a predictive model for a new bug data set. We empirically investigate the performance of
data reduction on totally 600,000 bug reports of two large open source projects, namely Eclipse and Mozilla. The results show that our
data reduction can effectively reduce the data scale and improve the accuracy of bug triage. Our work provides an approach to leveraging
techniques on data processing to form reduced and high-quality bug data in software development and maintenance.
Index Terms—Mining software repositories, application of data preprocessing, data management in bug repositories, bug data reduction,
feature selection, instance selection, bug triage, prediction for reduction orders
Ç
1 INTRODUCTION
MINING software repositories is an interdisciplinary
domain, which aims to employ data mining to deal
with software engineering problems [22]. In modern soft-
ware development, software repositories are large-scale
databases for storing the output of software development,
e.g., source code, bugs, emails, and specifications. Tradi-
tional software analysis is not completely suitable for the
large-scale and complex data in software repositories [58].
Data mining has emerged as a promising means to handle
software data (e.g., [7], [32]). By leveraging data mining
techniques, mining software repositories can uncover inter-
esting information in software repositories and solve real-
world software problems.
A bug repository (a typical software repository, for storing
details of bugs), plays an important role in managing soft-
ware bugs. Software bugs are inevitable and fixing bugs is
expensive in software development. Software companies
spend over 45 percent of cost in fixing bugs [39]. Large soft-
ware projects deploy bug repositories (also called bug track-
ing systems) to support information collection and to assist
developers to handle bugs [9], [14]. In a bug repository, a
bug is maintained as a bug report, which records the textual
description of reproducing the bug and updates according
to the status of bug fixing [64]. A bug repository provides a
data platform to support many types of tasks on bugs, e.g.,
fault prediction [7], [49], bug localization [2], and reopened-
bug analysis [63]. In this paper, bug reports in a bug reposi-
tory are called bug data.
There are two challenges related to bug data that may
affect the effective use of bug repositories in software devel-
opment tasks, namely the large scale and the low quality. On
one hand, due to the daily-reported bugs, a large number of
new bugs are stored in bug repositories. Taking an open
source project, Eclipse [13], as an example, an average of
30 new bugs are reported to bug repositories per day in 2007
[3]; from 2001 to 2010, 333,371 bugs have been reported to
Eclipse by over 34,917 developers and users [57]. It is a chal-
lenge to manually examine such large-scale bug data in soft-
ware development. On the other hand, software techniques
suffer from the low quality of bug data. Two typical charac-
teristics of low-quality bugs are noise and redundancy.
Noisy bugs may mislead related developers [64] while
redundant bugs waste the limited time of bug handling [54].
A time-consuming step of handling software bugs is
bug triage, which aims to assign a correct developer to fix
a new bug [1], [3], [25], [40]. In traditional software devel-
opment, new bugs are manually triaged by an expert
developer, i.e., a human triager. Due to the large number
of daily bugs and the lack of expertise of all the bugs, man-
ual bug triage is expensive in time cost and low in accu-
racy. In manual bug triage in Eclipse, 44 percent of bugs
are assigned by mistake while the time cost between open-
ing one bug and its first triaging is 19.3 days on average
[25]. To avoid the expensive cost of manual bug
triage, existing work [1] has proposed an automatic bug
J. Xuan is with the School of Software, Dalian University of Technology,
Dalian, China, and INRIA Lille–Nord Europe, Lille, France.
E-mail: [email protected].
H. Jiang, Y. Hu, Z. Ren, and Z. Luo are with the School of Software,
Dalian University of Technology, Dalian, China.
E-mail: [email protected], {huyan, zren, zxluo}@dlut.edu.cn.
W. Zou is with Jiangxi University of Science and Technology, Nanchang,
China. E-mail: [email protected].
X. Wu is with the School of Computer Science and Information Engineer-
ing, Hefei University of Technology, Hefei, China, and the Department of
Computer Science, University of Vermont. E-mail: [email protected].
Manuscript received 10 Jan. 2013; revised 25 Apr. 2014; accepted 1 May 2014.
Date of publication 15 May 2014; date of current version 1 Dec. 2014.
Recommended for acceptance by H. Wang.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TKDE.2014.2324590
264 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015
1041-4347 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
www.redpel.com+917620593389
www.redpel.com+917620593389
2. triage approach, which applies text classification techni-
ques to predict developers for bug reports. In this
approach, a bug report is mapped to a document and a
related developer is mapped to the label of the document.
Then, bug triage is converted into a problem of text classi-
fication and is automatically solved with mature text clas-
sification techniques, e.g., Naive Bayes [12]. Based on the
results of text classification, a human triager assigns new
bugs by incorporating his/her expertise. To improve the
accuracy of text classification techniques for bug triage,
some further techniques are investigated, e.g., a tossing
graph approach [25] and a collaborative filtering approach
[40]. However, large-scale and low-quality bug data in
bug repositories block the techniques of automatic bug tri-
age. Since software bug data are a kind of free-form text
data (generated by developers), it is necessary to generate
well-processed bug data to facilitate the application [66].
In this paper, we address the problem of data reduc-
tion for bug triage, i.e., how to reduce the bug data to
save the labor cost of developers and improve the quality
to facilitate the process of bug triage. Data reduction for
bug triage aims to build a small-scale and high-quality
set of bug data by removing bug reports and words,
which are redundant or non-informative. In our work, we
combine existing techniques of instance selection and fea-
ture selection to simultaneously reduce the bug dimen-
sion and the word dimension. The reduced bug data
contain fewer bug reports and fewer words than the orig-
inal bug data and provide similar information over the
original bug data. We evaluate the reduced bug data
according to two criteria: the scale of a data set and the
accuracy of bug triage. To avoid the bias of a single algo-
rithm, we empirically examine the results of four instance
selection algorithms and four feature selection algorithms.
Given an instance selection algorithm and a feature selec-
tion algorithm, the order of applying these two algorithms
may affect the results of bug triage. In this paper, we pro-
pose a predictive model to determine the order of applying
instance selection and feature selection. We refer to such
determination as prediction for reduction orders. Drawn on
the experiences in software metrics,1
we extract the attrib-
utes from historical bug data sets. Then, we train a binary
classifier on bug data sets with extracted attributes and pre-
dict the order of applying instance selection and feature
selection for a new bug data set.
In the experiments, we evaluate the data reduction for bug
triage on bug reports of two large open source projects,
namely Eclipse and Mozilla. Experimental results show that
applying the instance selection technique to the data set can
reduce bug reports but the accuracy of bug triage may be
decreased; applying the feature selection technique can
reduce words in the bug data and the accuracy can be
increased. Meanwhile, combining both techniques can
increase the accuracy, as well as reduce bug reports and
words. For example, when 50 percent of bugs and 70 percent
of words are removed, the accuracy of Naive Bayes on
Eclipse improves by 2 to 12 percent and the accuracy on
Mozilla improves by 1 to 6 percent. Based on the attributes
from historical bug data sets, our predictive model can pro-
vide the accuracy of 71.8 percent for predicting the reduction
order. Based on top node analysis of the attributes, results
show that no individual attribute can determine the reduc-
tion order and each attribute is helpful to the prediction.
The primary contributions of this paper are as follows:
1) We present the problem of data reduction for bug tri-
age. This problem aims to augment the data set of
bug triage in two aspects, namely a) to simulta-
neously reduce the scales of the bug dimension and
the word dimension and b) to improve the accuracy
of bug triage.
2) We propose a combination approach to addressing
the problem of data reduction. This can be viewed as
an application of instance selection and feature selec-
tion in bug repositories.
3) We build a binary classifier to predict the order of
applying instance selection and feature selection. To
our knowledge, the order of applying instance selec-
tion and feature selection has not been investigated
in related domains.
This paper is an extension of our previous work [62]. In
this extension, we add new attributes extracted from bug
data sets, prediction for reduction orders, and experiments
on four instance selection algorithms, four feature selection
algorithms, and their combinations.
The remainder of this paper is organized as follows. Sec-
tion 2 provides the background and motivation. Section 3
presents the combination approach for reducing bug data.
Section 4 details the model of predicting the order of apply-
ing instance selection and feature selection. Section 5
presents the experiments and results on bug data. Section 6
discusses limitations and potential issues. Section 7 lists the
related work. Section 8 concludes.
2 BACKGROUND AND MOTIVATION
2.1 Background
Bug repositories are widely used for maintaining software
bugs, e.g., a popular and open source bug repository, Bug-
zilla [5]. Once a software bug is found, a reporter (typically a
developer, a tester, or an end user) records this bug to the
bug repository. A recorded bug is called a bug report, which
has multiple items for detailing the information of repro-
ducing the bug. In Fig. 1, we show a part of bug report for
bug 284541 in Eclipse.2
In a bug report, the summary and
the description are two key items about the information of
the bug, which are recorded in natural languages. As their
names suggest, the summary denotes a general statement
for identifying a bug while the description gives the details
for reproducing the bug. Some other items are recorded in a
bug report for facilitating the identification of the bug, such
1. The subject of software metrics denotes a quantitative measure of
the degree to software based on given attributes [16]. Existing work in
software metrics extracts attributes from an individual instance in soft-
ware repositories (e.g., attributes from a bug report) while in our work,
we extract attributes from a set of integrated instances (e.g., attributes
from a set of bug reports). See Section S1 in the supplemental material,
http://oscar-lab.org/people/$jxuan/reduction/.
2. Bug 284541, https://bugs.eclipse.org/bugs/show_bug.cgi?
id¼284541.
XUAN ET AL.: TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES 265
www.redpel.com+917620593389
www.redpel.com+917620593389
3. as the product, the platform, and the importance. Once a bug
report is formed, a human triager assigns this bug to a
developer, who will try to fix this bug. This developer is
recorded in an item assigned-to. The assigned-to will
change to another developer if the previously assigned
developer cannot fix this bug. The process of assigning a
correct developer for fixing the bug is called bug triage. For
example, in Fig. 1, the developer Dimitar Giormov is the final
assigned-to developer of bug 284541.
A developer, who is assigned to a new bug report, starts
to fix the bug based on the knowledge of historical bug fix-
ing [36], [64]. Typically, the developer pays efforts to under-
stand the new bug report and to examine historically fixed
bugs as a reference (e.g., searching for similar bugs [54] and
applying existing solutions to the new bug [28]).
An item status of a bug report is changed according to
the current result of handling this bug until the bug is
completely fixed. Changes of a bug report are stored in an
item history. Table 1 presents a part of history of bug 284541.
This bug has been assigned to three developers and only
the last developer can handle this bug correctly. Changing
developers lasts for over seven months while fixing this bug
only costs three days.
Manual bug triage by a human triager is time-
consuming and error-prone since the number of daily bugs
is large to correctly assign and a human triager is hard to
master the knowledge about all the bugs [12]. Existing
work employs the approaches based on text classification
to assist bug triage, e.g., [1], [25], [56]. In such approaches,
the summary and the description of a bug report are
extracted as the textual content while the developer who
can fix this bug is marked as the label for classification.
Then techniques on text classification can be used to pre-
dict the developer for a new bug. In details, existing bug
reports with their developers are formed as a training set
to train a classifier (e.g., Naive Bayes, a typical classifier in
bug triage [1], [12], [25]); new bug reports are treated as a
test set to examine the results of the classification. In
Fig. 2a, we illustrate the basic framework of bug triage
based on text classification. As shown in Fig. 2a, we view a
bug data set as a text matrix. Each row of the matrix
indicates one bug report while each column of the matrix
indicates one word. To avoid the low accuracy of bug tri-
age, a recommendation list with the size k is used to pro-
vide a list of k developers, who have the top-k possibility
to fix the new bug.
2.2 Motivation
Real-world data always include noise and redundancy [31].
Noisy data may mislead the data analysis techniques [66]
while redundant data may increase the cost of data process-
ing [19]. In bug repositories, all the bug reports are filled by
developers in natural languages. The low-quality bugs accu-
mulate in bug repositories with the growth in scale. Such
Fig. 1. A part of bug report for bug 284541 in Eclipse. This bug is about a
missing node of XML files in Product Web Tools Platform (WTP). After
the handling process, this bug is resolved as a fixed one.
TABLE 1
Part of History of Bug 284541 in Eclipse
Triager Date Action
Kaloyan Raev 2009-08-12 Assigned to the developer
Kiril Mitov
Kaloyan Raev 2010-01-14 Assigned to the developer
Kaloyan Raev
Kaloyan Raev 2010-03-30 Assigned to the developer
Dimitar Giormov
Dimitar Giormov 2010-04-12 Changed status to assigned
Dimitar Giormov 2010-04-14 Changed status to resolved
Changed resolution to fixed
Fig. 2. Illustration of reducing bug data for bug triage. Sub-figure
(a) presents the framework of existing work on bug triage. Before train-
ing a classifier with a bug data set, we add a phase of data reduction, in
(b), which combines the techniques of instance selection and feature
selection to reduce the scale of bug data. In bug data reduction, a prob-
lem is how to determine the order of two reduction techniques. In
(c), based on the attributes of historical bug data sets, we propose a
binary classification method to predict reduction orders.
266 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
4. large-scale and low-quality bug data may deteriorate the
effectiveness of fixing bugs [28], [64]. In the following of this
section, we will employ three examples of bug reports in
Eclipse to show the motivation of our work, i.e., the neces-
sity for data reduction.
We list the bug report of bug 205900 of Eclipse in Exam-
ple 1 (the description in the bug report is partially omitted)
to study the words of bug reports.
Example 1 (Bug 205900). Current version in Eclipse Europa
discovery repository broken.
. . . [Plug-ins] all installed correctly and do not show
any errors in Plug-in configuration view. Whenever I try
to add a [diagram name] diagram, the wizard cannot be
started due to a missing [class name] class . . .
In this bug report, some words, e.g., installed, show,
started, and missing, are commonly used for describing
bugs. For text classification, such common words are not
helpful for the quality of prediction. Hence, we tend to
remove these words to reduce the computation for bug tri-
age. However, for the text classification, the redundant
words in bugs cannot be removed directly. Thus, we want
to adapt a relevant technique for bug triage.
To study the noisy bug report, we take the bug report of
bug 201598 as Example 2 (Note that both the summary and
the description are included).
Example 2 (Bug 201598). 3.3.1 about says 3.3.0.
Build id: M20070829-0800. 3.3.1 about says 3.3.0.
This bug report presents the error in the version dialog.
But the details are not clear. Unless a developer is very
familiar with the background of this bug, it is hard to find
the details. According to the item history, this bug is
fixed by the developer who has reported this bug. But the
summary of this bug may make other developers confused.
Moreover, from the perspective of data processing, espe-
cially automatic processing, the words in this bug may be
removed since these words are not helpful to identify this
bug. Thus, it is necessary to remove the noisy bug reports
and words for bug triage.
To study the redundancy between bug reports, we list
two bug reports of bugs 200019 and 204653 in Example 3
(the items description are omitted).
Example 3. Bugs 200019 and 204653.
(Bug 200019) Argument popup not highlighting the
correct argument . . .
(Bug 204653) Argument highlighting incorrect . . .
In bug repositories, the bug report of bug 200019 is
marked as a duplicate one of bug 204653 (a duplicate bug
report, denotes that a bug report describes one software
fault, which has the same root cause as an existing bug
report [54]). The textual contents of these two bug reports
are similar. Hence, one of these two bug reports may be cho-
sen as the representative one. Thus, we want to use a certain
technique to remove one of these bug reports. Thus, a tech-
nique to remove extra bug reports for bug triage is needed.
Based on the above three examples, it is necessary to pro-
pose an approach to reducing the scale (e.g., large scale
words in Example 1) and augmenting the quality of bug
data (e.g., noisy bug reports in Example 2 and redundant
bug reports in Example 3).
3 DATA REDUCTION FOR BUG TRIAGE
Motivated by the three examples in Section 2.2, we propose
bug data reduction to reduce the scale and to improve the
quality of data in bug repositories.
Fig. 2 illustrates the bug data reduction in our work,
which is applied as a phase in data preparation of bug tri-
age. We combine existing techniques of instance selection
and feature selection to remove certain bug reports and
words, i.e., in Fig. 2b. A problem for reducing the bug data
is to determine the order of applying instance selection and
feature selection, which is denoted as the prediction of
reduction orders, i.e., in Fig. 2c.
In this section, we first present how to apply instance
selection and feature selection to bug data, i.e., data reduc-
tion for bug triage. Then, we list the benefit of the data
reduction. The details of the prediction for reduction orders
will be shown in Section 4.
Algorithm 1. Data reduction based on FS ! IS
Input: training set T with n words and m bug reports,
reduction order FS!IS
final number nF of words,
final number mI of bug reports,
Output: reduced data set T FI for bug triage
1) apply FS to n words of T and calculate objective values
for all the words;
2) select the top nF words of T and generate a training
set T F ;
3) apply IS to mI bug reports of T F ;
4) terminate IS when the number of bug reports is equal to
or less than mI and generate the final training set T FI.
3.1 Applying Instance Selection and Feature
Selection
In bug triage, a bug data set is converted into a text matrix
with two dimensions, namely the bug dimension and the
word dimension. In our work, we leverage the combination
of instance selection and feature selection to generate a
reduced bug data set. We replace the original data set with
the reduced data set for bug triage.
Instance selection and feature selection are widely used
techniques in data processing. For a given data set in a cer-
tain application, instance selection is to obtain a subset of
relevant instances (i.e., bug reports in bug data) [18] while
feature selection aims to obtain a subset of relevant features
(i.e., words in bug data) [19]. In our work, we employ the
combination of instance selection and feature selection. To
distinguish the orders of applying instance selection and
feature selection, we give the following denotation. Given
an instance selection algorithm IS and a feature selection
algorithm FS, we use FS ! IS to denote the bug data reduc-
tion, which first applies FS and then IS; on the other hand,
IS ! FS denotes first applying IS and then FS.
In Algorithm 1, we briefly present how to reduce the bug
data based on FS ! IS. Given a bug data set, the output of
bug data reduction is a new and reduced data set. Two algo-
rithms FS and IS are applied sequentially. Note that in Step
2), some of bug reports may be blank during feature
XUAN ET AL.: TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES 267
www.redpel.com+917620593389
www.redpel.com+917620593389
5. selection, i.e., all the words in a bug report are removed. Such
blank bug reports are also removed in the feature selection.
In our work, FS ! IS and IS ! FS are viewed as two
orders of bug data reduction. To avoid the bias from a single
algorithm, we examine results of four typical algorithms of
instance selection and feature selection, respectively. We
briefly introduce these algorithms as follows.
Instance selection is a technique to reduce the number of
instances by removing noisy and redundant instances [11],
[48]. An instance selection algorithm can provide a reduced
data set by removing non-representative instances [38], [65].
According to an existing comparison study [20] and an
existing review [37], we choose four instance selection algo-
rithms, namely Iterative Case Filter (ICF) [8], Learning Vec-
tors Quantization (LVQ) [27], Decremental Reduction
Optimization Procedure (DROP) [52], and Patterns by
Ordered Projections (POP) [41].
Feature selection is a preprocessing technique for select-
ing a reduced set of features for large-scale data sets [15],
[19]. The reduced set is considered as the representative fea-
tures of the original feature set [10]. Since bug triage is con-
verted into text classification, we focus on the feature
selection algorithms in text data. In this paper, we choose
four well-performed algorithms in text data [43], [60] and
software data [49], namely Information Gain (IG) [24], x2
sta-
tistic (CH) [60], Symmetrical Uncertainty attribute evaluation
(SU) [51], and Relief-F Attribute selection (RF) [42]. Based on
feature selection, words in bug reports are sorted according
to their feature values and a given number of words with
large values are selected as representative features.
3.2 Benefit of Data Reduction
In our work, to save the labor cost of developers, the data
reduction for bug triage has two goals, 1) reducing the data
scale and 2) improving the accuracy of bug triage. In con-
trast to modeling the textual content of bug reports in exist-
ing work (e.g., [1], [12], [25]), we aim to augment the data
set to build a preprocessing approach, which can be applied
before an existing bug triage approach. We explain the two
goals of data reduction as follows.
3.2.1 Reducing the Data Scale
We reduce scales of data sets to save the labor cost of
developers.
Bug dimension. As mentioned in Section 2.1, the aim of
bug triage is to assign developers for bug fixing. Once a
developer is assigned to a new bug report, the developer
can examine historically fixed bugs to form a solution to the
current bug report [36], [64]. For example, historical bugs
are checked to detect whether the new bug is the duplicate
of an existing one [54]; moreover, existing solutions to bugs
can be searched and applied to the new bug [28]. Thus, we
consider reducing duplicate and noisy bug reports to
decrease the number of historical bugs. In practice, the labor
cost of developers (i.e., the cost of examining historical
bugs) can be saved by decreasing the number of bugs based
on instance selection.
Word dimension. We use feature selection to remove noisy
or duplicate words in a data set. Based on feature selection,
the reduced data set can be handled more easily by
automatic techniques (e.g., bug triage approaches) than the
original data set. Besides bug triage, the reduced data set
can be further used for other software tasks after bug triage
(e.g., severity identification, time prediction, and reopened-
bug analysis in Section 7.2).
3.2.2 Improving the Accuracy
Accuracy is an important evaluation criterion for bug tri-
age. In our work, data reduction explores and removes
noisy or duplicate information in data sets (see examples
in Section 2.2).
Bug dimension. Instance selection can remove uninforma-
tive bug reports; meanwhile, we can observe that the accu-
racy may be decreased by removing bug reports (see
experiments in Section 5.2.3).
Word dimension. By removing uninformative words, fea-
ture selection improves the accuracy of bug triage (see
experiments in Section 5.2.3). This can recover the accuracy
loss by instance selection.
4 PREDICTION FOR REDUCTION ORDERS
Based on Section 3.1, given an instance selection algorithm
IS and a feature selection algorithm FS, FS ! IS and IS !
FS are viewed as two orders for applying reducing techni-
ques. Hence, a challenge is how to determine the order of
reduction techniques, i.e., how to choose one between FS !
IS and IS ! FS. We refer to this problem as the prediction
for reduction orders.
4.1 Reduction Orders
To apply the data reduction to each new bug data set, we
need to check the accuracy of both two orders (FS ! IS and
IS!FS) and choose a better one. To avoid the time cost of
manually checking both reduction orders, we consider pre-
dicting the reduction order for a new bug data set based on
historical data sets.
As shown in Fig. 2c, we convert the problem of predic-
tion for reduction orders into a binary classification prob-
lem. A bug data set is mapped to an instance and the
associated reduction order (either FS ! IS or IS ! FS) is
mapped to the label of a class of instances. Fig. 3 summa-
rizes the steps of predicting reduction orders for bug triage.
Note that a classifier can be trained only once when facing
many new bug data sets. That is, training such a classifier
once can predict the reduction orders for all the new data
sets without checking both reduction orders. To date, the
problem of predicting reduction orders of applying feature
selection and instance selection has not been investigated in
other application scenarios.
From the perspective of software engineering, predict-
ing the reduction order for bug data sets can be viewed as
Fig. 3. Steps of predicting reduction orders for bug triage.
268 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
6. a kind of software metrics, which involves activities for mea-
suring some property for a piece of software [16]. How-
ever, the features in our work are extracted from the bug
data set while the features in existing work on software
metrics are for individual software artifacts,3
e.g., an indi-
vidual bug report or an individual piece of code. In this
paper, to avoid ambiguous denotations, an attribute refers
to an extracted feature of a bug data set while a feature
refers to a word of a bug report.
4.2 Attributes for a Bug Data Set
To build a binary classifier to predict reduction orders, we
extract 18 attributes to describe each bug data set. Such
attributes can be extracted before new bugs are triaged. We
divide these 18 attributes into two categories, namely the
bug report category (B1 to B10) and the developer category
(D1 to D8).
In Table 2, we present an overview of all the attributes
of a bug data set. Given a bug data set, all these attributes
are extracted to measure the characteristics of the bug data
set. Among the attributes in Table 2, four attributes are
directly counted from a bug data set, i.e., B1, B2, D1, and
D4; six attributes are calculated based on the words in the
bug data set, i.e., B3, B4, D2, D3, D5, and D6; five attributes
are calculated as the entropy of an enumeration value to
indicate the distributions of items in bug reports, i.e., B6,
B7, B8, B9, and B10; three attributes are calculated accord-
ing to the further statistics, i.e., B5, D7, and D8. All the 18
attributes in Table 2 can be obtained by direct extraction or
automatic calculation. Details of calculating these attributes
can be found in Section S2 in the supplemental material,
available online.
5 EXPERIMENTS AND RESULTS
5.1 Data Preparation
In this part, we present the data preparation for applying
the bug data reduction. We evaluate the bug data reduction
on bug repositories of two large open source projects,
namely Eclipse and Mozilla. Eclipse [13] is a multi-language
software development environment, including an Inte-
grated Development Environment (IDE) and an extensible
plug-in system; Mozilla [33] is an Internet application suite,
including some classic products, such as the Firefox
browser and the Thunderbird email client. Up to December
31, 2011, 366,443 bug reports over 10 years have been
recorded to Eclipse while 643,615 bug reports over 12 years
have been recorded to Mozilla. In our work, we collect con-
tinuous 300,000 bug reports for each project of Eclipse and
Mozilla, i.e., bugs 1-300000 in Eclipse and bugs 300001-
600000 in Mozilla. Actually, 298,785 bug reports in Eclipse
and 281,180 bug reports in Mozilla are collected since some
of bug reports are removed from bug repositories (e.g., bug
5315 in Eclipse) or not allowed anonymous access (e.g., bug
40020 in Mozilla). For each bug report, we download web-
pages from bug repositories and extract the details of bug
reports for experiments.
Since bug triage aims to predict the developers who can
fix the bugs, we follow the existing work [1], [34] to remove
unfixed bug reports, e.g., the new bug reports or will-not-fix
bug reports. Thus, we only choose bug reports, which are
fixed and duplicate (based on the items status of bug
reports). Moreover, in bug repositories, several developers
have only fixed very few bugs. Such inactive developers
TABLE 2
An Overview of Attributes for a Bug Data Set
Index Attribute name Description
B1 # Bug reports Total number of bug reports.
B2 # Words Total number of words in all the bug reports.
B3 Length of bug reports Average number of words of all the bug reports.
B4 # Unique words Average number of unique words in each bug report.
B5 Ratio of sparseness Ratio of sparse terms in the text matrix. A sparse term refers to a
word with zero frequency in the text matrix.
B6 Entropy of severities Entropy of severities in bug reports. Severity denotes the importance
of bug reports.
B7 Entropy of priorities Entropy of priorities in bug reports. Priority denotes the level of bug
reports.
B8 Entropy of products Entropy of products in bug reports. Product denotes the sub-project.
B9 Entropy of components Entropy of components in bug reports. Component denotes the
sub-sub-project.
B10 Entropy of words Entropy of words in bug reports.
D1 # Fixers Total number of developers who will fix bugs.
D2 # Bug reports per fixer Average number of bug reports for each fixer.
D3 # Words per fixer Average number of words for each fixer.
D4 # Reporters Total number of developers who have reported bugs.
D5 # Bug reports per reporter Average number of bug reports for each reporter.
D6 # Words per reporter Average number of words for each reporter.
D7 # Bug reports by top 10 percent reporters Ratio of bugs, which are reported by the most active reporters.
D8 Similarity between fixers and reporters Similarity between the set of fixers and the set of reporters, defined as
the Tanimoto similarity.
3. In software metrics, a software artifact is one of many kinds of
tangible products produced during the development of software, e.g., a
use case, requirements specification, and a design document [16].
XUAN ET AL.: TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES 269
www.redpel.com+917620593389
www.redpel.com+917620593389
7. may not provide sufficient information for predicting cor-
rect developers. In our work, we remove the developers,
who have fixed less than 10 bugs.
To conduct text classification, we extract the summary
and the description of each bug report to denote the con-
tent of the bug. For a newly reported bug, the summary
and the description are the most representative items,
which are also used in manual bug triage [1]. As the input
of classifiers, the summary and the description are con-
verted into the vector space model [4], [59]. We employ
two steps to form the word vector space, namely tokeni-
zation and stop word removal. First, we tokenize the
summary and the description of bug reports into word
vectors. Each word in a bug report is associated with its
word frequency, i.e., the times that this word appears in
the bug. Non-alphabetic words are removed to avoid the
noisy words, e.g., memory address like 0x0902f00 in bug
200220 of Eclipse. Second, we remove the stop words,
which are in high frequency and provide no helpful infor-
mation for bug triage, e.g., the word “the” or “about”. The
list of stop words in our work is according to SMART
information retrieval system [59]. We do not use the stem-
ming technique in our work since existing work [1], [12]
has examined that the stemming technique is not helpful
to bug triage. Hence, the bug reports are converted into
vector space model for further experiments.
5.2 Experiments on Bug Data Reduction
5.2.1 Data Sets and Evaluation
We examine the results of bug data reduction on bug reposi-
tories of two projects, Eclipse and Mozilla. For each project,
we evaluate results on five data sets and each data set is
over 10,000 bug reports, which are fixed or duplicate bug
reports. We check bug reports in the two projects and find
out that 45.44 percent of bug reports in Eclipse and
28.23 percent of bug reports in Mozilla are fixed or dupli-
cate. Thus, to obtain over 10,000 fixed or duplicate bug
reports, each data set in Eclipse is collected from continuous
20,000 bug reports while each bug set in Mozilla is collected
from continuous 40,000 bug reports. Table 3 lists the details
of ten data sets after data preparation.
To examine the results of data reduction, we employ
four instance selection algorithms (ICF, LVQ, DROP, and
POP), four feature selection algorithms (IG, CH, SU, and
RF), and three bug triage algorithms (Support Vector
Machine, SVM; K-Nearest Neighbor, KNN; and Naive
Bayes, which are typical text-based algorithms in existing
work [1], [3], [25]). Fig. 4 summarizes these algorithms. The
implementation details can be found in Section S3 in the
supplemental material, available online.
The results of data reduction for bug triage can be mea-
sured in two aspects, namely the scales of data sets and
the quality of bug triage. Based on Algorithm 1, the scales
of data sets (including the number of bug reports and the
number of words) are configured as input parameters.
The quality of bug triage can be measured with the accu-
racy of bug triage, which is defined as Accuracyk ¼
# correctly assigned bug reports in k candidates
# all bug reports in the test set . For each data set in
Table 3, the first 80 percent of bug reports are used as a
training set and the left 20 percent of bug reports are as a
test set. In the following of this paper, data reduction on a
data set is used to denote the data reduction on the train-
ing set of this data set since we cannot change the test set.
TABLE 3
Ten Data Sets in Eclipse and Mozilla
Fig. 4. Algorithms for instance selection, feature selection, and bug tri-
age. Among these algorithms, ICF, CH, and Naive Bayes are well-per-
formed based on the experiments of the bug data reduction.
Fig. 5. Accuracy for instance selection or feature selection on Eclipse
(DS-E1) and Mozilla (DS-M1). For instance selection, 30, 50, and 70 per-
cent of bug reports are selected while for feature selection, 10, 30, and
50 percent of words are selected. The origin denotes the results of Naive
Bayes without instance selection or feature selection. Note that some
curves of ICF may be overlapped since ICF cannot precisely set the rate
of final instances [8].
270 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
8. 5.2.2 Rates of Selected Bug Reports and Words
For either instance selection or feature selection algorithm,
the number of instances or features should be determined
to obtain the final scales of data sets. We investigate the
changes of accuracy of bug triage by varying the rate of
selected bug reports in instance selection and the rate of
selected words in feature selection. Taking two instance
selection algorithms (ICF and LVQ) and two feature selec-
tion algorithms (IG and CH) as examples, we evaluate
results on two data sets (DS-E1 in Eclipse and DS-M1 in
Mozilla). Fig. 5 presents the accuracy of instance selection
and feature selection (each value is an average of 10 inde-
pendent runs) for a bug triage algorithm, Naive Bayes.
For instance selection, ICF is a little better than LVQ from
Figs. 5a and 5c. A good percentage of bug reports is 50 or
70 percent. For feature selection, CH always performs better
than IG from Figs. 5b and 5d. We can find that 30 or 50 per-
cent is a good percentage of words. In the other experi-
ments, we directly set the percentages of selected bug
reports and words to 50 and 30 percent, respectively.
5.2.3 Results of Data Reduction for Bug Triage
We evaluate the results of data reduction for bug triage on
data sets in Table 3. First, we individually examine each
instance selection algorithm and each feature selection algo-
rithm based on one bug triage algorithm, Naive Bayes. Sec-
ond, we combine the best instance selection algorithm and
the best feature selection algorithm to examine the data
reduction on three text-based bug triage algorithms.
In Tables 4, 5, 6, and 7, we show the results of four
instance selection algorithms and four feature selection
algorithms on four data sets in Table 3, i.e., DS-E1, DS-E5,
TABLE 4
Accuracy (Percent) of IS and FS on DS-E1
List size Origin IS FS
ICF LVQ DROP POP IG CH SU RF
1 25.85 21.75 17.91 22.53 20.36 25.27 30.64 23.64 24.52
2 35.71 31.66 27.08 31.40 29.59 35.07 43.09 33.44 34.87
3 41.88 38.17 32.97 36.64 36.01 41.42 50.52 40.18 40.93
4 45.84 42.25 37.40 40.10 40.45 45.26 55.12 44.90 45.01
5 48.95 45.79 40.50 42.76 44.16 48.42 58.54 47.95 47.90
TABLE 6
Accuracy (Percent) of IS and FS on DS-M1
List size Origin IS FS
ICF LVQ DROP POP IG CH SU RF
1 10.86 9.46 19.10 11.06 21.07 10.80 20.91 17.53 11.01
2 27.29 22.39 27.70 27.77 29.13 27.08 35.88 30.37 27.26
3 37.99 33.23 33.06 36.33 32.81 37.77 44.86 38.66 37.27
4 44.74 39.60 36.99 41.77 38.82 44.43 50.73 44.35 43.95
5 49.11 44.68 40.01 44.56 42.68 48.87 55.50 48.36 48.33
TABLE 7
Accuracy (Percent) of IS and FS on DS-M5
List size Origin IS FS
ICF LVQ DROP POP IG CH SU RF
1 20.72 18.84 20.78 19.76 19.73 20.57 21.61 20.07 20.16
2 30.37 27.36 29.10 28.39 29.52 30.14 32.43 30.37 29.30
3 35.53 32.66 34.76 33.00 35.80 35.31 38.88 36.56 34.59
4 39.48 36.82 38.82 36.42 40.44 39.17 43.14 41.28 38.72
5 42.61 40.18 41.94 39.71 44.13 42.35 46.46 44.75 42.07
TABLE 5
Accuracy (Percent) of IS and FS on DS-E5
List size Origin IS FS
ICF LVQ DROP POP IG CH SU RF
1 23.58 19.60 18.85 18.38 19.66 22.92 32.71 24.55 21.81
2 31.94 28.23 26.24 25.24 27.26 31.35 44.97 34.30 30.45
3 37.02 33.64 31.17 29.85 31.11 36.35 51.73 39.93 35.80
4 40.94 37.58 34.78 33.56 36.28 40.25 56.58 44.20 39.70
5 44.11 40.87 37.72 37.02 39.91 43.40 60.40 47.76 42.99
XUAN ET AL.: TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES 271
www.redpel.com+917620593389
www.redpel.com+917620593389
9. DS-M1, and DS-M5. The best results by instance selection
and the best results by feature selection are shown in bold.
Results by Naive Bayes without instance selection or fea-
ture selection are also presented for comparison. The size of
the recommendation list is set from 1 to 5. Results of the
other six data sets in Table 3 can be found in Section S5 in
the supplemental material, available online. Based on Sec-
tion 5.2.2, given a data set, IS denotes the 50 percent of bug
reports are selected and FS denotes the 30 percent of words
are selected.
As shown in Tables 4 and 5 for data sets in Eclipse, ICF
provides eight best results among four instance selection
algorithms when the list size is over two while either
DROP or POP can achieve one best result when the list
size is one. Among four feature selection algorithms, CH
provides the best accuracy. IG and SU also achieve good
results. In Tables 6 and 7 for Mozilla, POP in instance
selection obtains six best results; ICF, LVQ, and DROP
obtain one, one, two best results, respectively. In feature
selection, CH also provides the best accuracy. Based on
Tables 4, 5, 6, and 7, in the following of this paper, we
only investigate the results of ICF and CH and to avoid
the exhaustive comparison on all the four instance selec-
tion algorithms and four feature selection algorithms.
As shown in Tables 4, 5, 6, and 7, feature selection can
increase the accuracy of bug triage over a data set while
instance selection may decrease the accuracy. Such an
accuracy decrease is coincident with existing work ([8],
[20], [41], [52]) on typical instance selection algorithms on
classic data sets,4
which shows that instance selection may
hurt the accuracy. In the following, we will show that the
accuracy decrease by instance selection is caused by the
large number of developers in bug data sets.
To investigate the accuracy decrease by instance selec-
tion, we define the loss from origin to ICF as
Lossk ¼ Accuracyk by originÀAccuracyk by ICF
Accuracyk by origin , where the recom-
mendation list size is k. Given a bug data set, we sort
developers by the number of their fixed bugs in descend-
ing order. That is, we sort classes by the number of
instances in classes. Then a new data set with s develop-
ers is built by selecting the top-s developers. For one bug
data set, we build new data sets by varying s from 2 to
30. Fig. 6 presents the loss on two bug data sets (DS-E1
and DS-M1) when k ¼ 1 or k ¼ 3.
As shown in Fig. 6, most of the loss from origin to ICF
increases with the number of developers in the data sets. In
other words, the large number of classes causes the accu-
racy decrease. Let us recall the data scales in Table 3. Each
data set in our work contains over 200 classes. When apply-
ing instance selection, the accuracy of bug data sets in
Table 3 may decrease more than that of the classic data sets
in [8], [20], [41], [52] (which contain less than 20 classes and
mostly two classes).
In our work, the accuracy increase by feature selection
and the accuracy decrease by instance selection lead to the
combination of instance selection and feature selection. In
other words, feature selection can supplement the loss of
accuracy by instance selection. Thus, we apply instance
selection and feature selection to simultaneously reduce the
data scales. Tables 8 , 9, 10, and 11 show the combinations
of CH and ICF based on three bug triage algorithms, namely
SVM, KNN, and Naive Bayes, on four data sets.
As shown in Table 8, for the Eclipse data set DS-E1, ICF !
CH provides the best accuracy on three bug triage algorithms.
Among these algorithms, Naive Bayes can obtain much better
results than SVM and KNN. ICF ! CH based on Naive Bayes
obtains the best results. Moreover, CH ! ICF based on Naive
Bayes can also achieve good results, which are better than
Naive Bayes without data reduction. Thus, data reduction
can improve the accuracy of bug triage, especially, for the
well-performed algorithm, Naive Bayes.
In Tables 9, 10, and 11, data reduction can also
improve the accuracy of KNN and Naive Bayes. Both CH
! ICF and ICF ! CH can obtain better solutions than
the origin bug triage algorithms. An exceptional algo-
rithm is SVM. The accuracy of data reduction on SVM is
lower than that of the original SVM. A possible reason is
that SVM is a kind of discriminative model, which is not
suitable for data reduction and has a more complex
structure than KNN and Naive Bayes.
As shown in Tables 8, 9, 10, and 11, all the best results
are obtained by CH ! ICF or ICF ! CH based on Naive
Bayes. Based on data reduction, the accuracy of Naive
Bayes on Eclipse is improved by 2 to 12 percent and the
accuracy on Mozilla is improved by 1 to 6 percent Consid-
ering the list size 5, data reduction based on Naive Bayes
can obtain from 13 to 38 percent better results than that
based on SVM and can obtain 21 to 28 percent better
results than that based on KNN. We find out that data
reduction should be built on a well-performed bug triage
algorithm. In the following, we focus on the data reduction
on Naive Bayes.
In Tables 8, 9, 10, and 11, the combinations of instance
selection and feature selection can provide good accuracy
and reduce the number of bug reports and words of the bug
data. Meanwhile, the orders, CH ! ICF and ICF ! CH, lead
to different results. Taking the list size five as an example,
for Naive Bayes, CH ! ICF provides better accuracy than
ICF ! CH on DS-M1 while ICF ! CH provides better accu-
racy than CH ! ICF on DS-M5.
In Table 12, we compare the time cost of data reduc-
tion with the time cost of manual bug triage on four
Fig. 6. Loss from origin to ICF on two data sets. The origin denotes the
bug triage algorithm, Naive Bayes. The x-axis is the number of develop-
ers in a new-built data set; the y-axis is the loss. The loss above zero
denotes the accuracy of ICF is lower than that of origin while the loss
below zero denotes the accuracy of ICF is higher than that of origin.
4. UCI Machine Learning Repository, http://archive.ics.uci.edu/
ml/.
272 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
10. data sets. As shown in Table 12, the time cost of manual
bug triage is much longer than that of data reduction.
For a bug report, the average time cost of manual bug
triage is from 23 to 57 days. The average time of the
original Naive Bayes is from 88 to 139 seconds while the
average time of data reduction is from 298 to 1,558 sec-
onds. Thus, compared with the manual bug triage, data
reduction is efficient for bug triage and the time cost of
data reduction can be ignored.
In summary of the results, data reduction for bug triage
can improve the accuracy of bug triage to the original data
set. The advantage of the combination of instance selection
and feature selection is to improve the accuracy and to
reduce the scales of data sets on both the bug dimension
and the word dimension (removing 50 percent of bug
reports and 70 percent of words).
5.2.4 A Brief Case Study
The results in Tables 8, 9, 10, and 11 show that the order of
applying instance selection and feature selection can impact
the final accuracy of bug triage. In this part, we employ ICF
and CH with Naive Bayes to conduct a brief case study on
the data set DS-E1.
First, we measure the differences of reduced data set by
CH ! ICF and ICF ! CH. Fig. 7 illustrates bug reports and
words in the data sets by applying CH ! ICF and ICF !
CH. Although there exists an overlap between the data sets
by CH ! ICF and ICF ! CH, either CH ! ICF or ICF ! CH
retains its own bug reports and words. For example, we can
observe that the reduced data set by CH ! ICF keeps 1,655
words, which have been removed by ICF ! CH; the
reduced data set by ICF ! CH keeps 2,150 words, which
have been removed by CH ! ICF. Such observation
TABLE 8
Accuracy (Percent) of Data Reduction on DS-E1
List size SVM KNN Naive Bayes
Origin CH !ICF ICF!CH Origin CH!ICF ICF!CH Origin CH!ICF ICF!CH
1 7.75 7.19 8.77 12.76 18.51 20.63 25.85 25.42 27.24
2 11.45 12.39 14.41 12.96 20.46 24.06 35.71 39.00 39.56
3 15.40 15.81 18.45 13.04 21.38 25.75 41.88 46.88 47.58
4 18.27 18.53 21.55 13.14 22.13 26.53 45.84 51.77 52.45
5 21.18 20.79 23.54 13.23 22.58 27.27 48.95 55.55 55.89
TABLE 9
Accuracy (Percent) of Data Reduction on DS-E5
List size SVM KNN Naive Bayes
Origin CH!ICF ICF!CH Origin CH!ICF ICF!CH Origin CH!ICF ICF!CH
1 6.21 5.05 5.83 14.78 19.11 22.81 23.58 27.93 28.81
2 10.18 7.77 8.99 15.09 21.21 25.85 31.94 40.16 40.44
3 12.87 10.27 11.19 15.34 22.21 27.29 37.02 47.92 47.19
4 16.21 12.19 13.12 15.45 22.85 28.13 40.94 52.91 52.18
5 18.14 14.18 14.97 15.55 23.21 28.61 44.11 56.25 55.51
TABLE 10
Accuracy (Percent) of Data Reduction on DS-M1
List size SVM KNN Naive Bayes
Origin CH!ICF ICF!CH Origin CH!ICF ICF!CH Origin CH!ICF ICF!CH
1 11.98 10.88 10.38 11.87 14.74 15.10 10.86 17.07 19.45
2 21.82 19.36 17.98 12.63 16.40 18.44 27.29 31.77 32.11
3 29.61 26.65 24.93 12.81 16.97 19.43 37.99 41.67 40.28
4 35.08 32.03 29.46 12.88 17.29 19.93 44.74 48.43 46.47
5 38.72 36.22 33.27 13.08 17.82 20.55 49.11 53.38 51.40
TABLE 11
Accuracy (Percent) of Data Reduction on DS-M5
List size SVM KNN Naive Bayes
Origin CH!ICF ICF!CH Origin CH!ICF ICF!CH Origin CH!ICF ICF!CH
1 15.01 14.87 14.24 13.92 14.66 16.66 20.72 20.97 21.88
2 21.64 20.45 20.10 14.75 16.62 18.85 30.37 31.27 32.91
3 25.65 24.26 23.82 14.91 17.70 19.84 35.53 37.24 39.70
4 28.36 27.18 27.21 15.36 18.37 20.78 39.48 41.59 44.50
5 30.73 29.51 29.79 15.92 19.07 21.46 42.61 45.28 48.28
XUAN ET AL.: TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES 273
www.redpel.com+917620593389
www.redpel.com+917620593389
11. indicates the orders of applying CH and ICF will brings dif-
ferent results for the reduced data set.
Second, we check the duplicate bug reports in the
data sets by CH ! ICF and ICF ! CH. Duplicate bug
reports are a kind of redundant data in a bug repository
[47], [54]. Thus, we count the changes of duplicate bug
reports in the data sets. In the original training set, there
exist 532 duplicate bug reports. After data reduction, 198
duplicate bug reports are removed by CH ! ICF while
262 are removed by ICF ! CH. Such a result indicates
that the order of applying instance selection and feature
selection can impact the ability of removing redundant
data.
Third, we check the blank bug reports during the data
reduction. In this paper, a blank bug report refers to a
zero-word bug report, whose words are removed by fea-
ture selection. Such blank bug reports are finally
removed in the data reduction since they provides none
of information. The removed bug reports and words can
be viewed as a kind of noisy data. In our work, bugs
200019, 200632, 212996, and 214094 become blank bug
reports after applying CH ! ICF while bugs 201171,
201598, 204499, 209473, and 214035 become blank bug
reports after ICF ! CH. There is no overlap between the
blank bug reports by CH ! ICF and ICF ! CH. Thus,
we find out that the order of applying instance selection
and feature selection also impacts the ability of removing
noisy data.
In summary of this brief case study on the data set in
Eclipse, the results of data reduction are impacted by the
order of applying instance selection and feature selection.
Thus, it is necessary to investigate how to determine the
order of applying these algorithms.
To further examine whether the results by CH ! ICF are
significantly different from those by ICF ! CH, we perform
a Wilcoxon signed-rank test [53] on the results by CH ! ICF
and ICF ! CH on 10 data sets in Table 3. In details, we col-
lect 50 pairs of accuracy values (10 data sets; five recommen-
dation lists for each data set, i.e., the size from 1 to 5) by
applying CH ! ICF and ICF ! CH, respectively. The result
of test is with a statistically significant p-value of 0.018, i.e.,
applying CH ! ICF or ICF ! CH leads to significantly dif-
ferences for the accuracy of bug triage.
5.3 Experiments on Prediction for Reduction Orders
5.3.1 Data Sets and Evaluation
We present the experiments on prediction for reduction
orders in this part. We map a bug data set to an instance,
and map the reduction order (i.e., FS ! IS or IS ! FS.)
to its label. Given a new bug data set, we train a classifier
TABLE12
TimeComparisonbetweenDataReductionandManualWork
DatasetManual
bugtriage
OriginCH!ICFICF!CH
PreprocessingNaiveBayesSumPreprocessingData
reduction
Naive
Bayes
SumPreprocessingData
reduction
NaiveBayesSum
DS-E132.55day59sec29sec88sec58sec322sec3sec383sec59sec458sec2sec519sec
DS-E523.14day55sec25sec80sec54sec241sec3sec298sec54sec367sec3sec424sec
DS-M157.44day88sec33sec121sec88sec698sec4sec790sec88sec942sec3sec1,033sec
DS-M523.77day87sec52sec139sec87sec1,269sec6sec1,362sec88sec1,465sec5sec1,558sec
Fig. 7. Bug reports and words in the data set DS-E1 (i.e., bugs 200001-
220000 in Eclipse) by applying CH ! ICF and ICF ! CH.
274 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
12. to predict its appropriate reduction order based on his-
torical bug data sets.
As shown in Fig. 2c, to train the classifier, we label
each bug data set with its reduction order. In our work,
one bug unit denotes 5,000 continuous bug reports. In Sec-
tion 5.1, we have collected 298,785 bug reports in Eclipse
and 281,180 bug reports in Mozilla. Then, 60 bug units
(298;785=5;000 ¼ 59:78) for Eclipse and 57 bug units
(281;180=5;000 ¼ 56:24) for Mozilla are obtained. Next, we
form bug data sets by combining bug units to training
classifiers. In Table 13, we show the setup of data sets in
Eclipse. Given 60 bug units in Eclipse, we consider contin-
uous one to five bug units as one data set. In total, we col-
lect 300 (60 Â 5) bug data sets on Eclipse. Similarly, we
consider continuous one to seven bug units as one data
set on Mozilla and finally collect 399 (57 Â 7) bug data
sets. For each bug data set, we extract 18 attributes
according to Table 2 and normalize all the attributes to
values between 0 and 1.
We examine the results of prediction of reduction orders
on ICF and CH. Given ICF and CH, we label each bug data
set with its reduction order (i.e., CH ! ICF or ICF ! CH).
First, for a bug data set, we respectively obtain the results of
CH ! ICF and ICF ! CH by evaluating data reduction for
bug triage based on Section 5.2. Second, for a recommenda-
tion list with size 1 to 5, we count the times of each reduc-
tion order when the reduction order obtain the better
accuracy. That is, if CH ! ICF can provide more times of
the better accuracy, we label the bug data set with CH !
ICF, and verse vice.
Table 14 presents the statistics of bug data sets of
Eclipse and Mozilla. Note that the numbers of data sets
with CH ! ICF and ICF ! CH are imbalance. In our
work, we employ the classifier AdaBoost to predict
reduction orders since AdaBoost is useful to classify
imbalanced data and generates understandable results of
classification [24].
In experiments, 10-fold cross-validation is used to evalu-
ate the prediction for reduction orders. We employ four
evaluation criteria, namely precision, recall, F1-measure,
and accuracy. To balance the precision and recall, the F1-
measure is defined as F1 ¼ 2ÂRecallÂPrecision
RecallþPrecision . For a good classi-
fier, F1CH!ICF
and F1ICF!CH
should be balanced to avoid clas-
sifying all the data sets into only one class. The accuracy
measures the percentage of correctly predicted orders over
the total bug data sets. The accuracy is defined as
Accuracy ¼ #correctly predicted lorders
#all data sets .
5.3.2 Results
We investigate the results of predicting reductions orders
for bug triage on Eclipse and Mozilla. For each project, we
employ AdaBoost as the classifier based on two strategies,
namely resampling and reweighting [17]. A decision tree
classifier, C4.5, is embedded into AdaBoost. Thus, we com-
pare results of classifiers in Table 15.
In Table 15, C4.5, AdaBoost C4.5 resampling, and
AdaBoost C4.5 reweighting, can obtain better values of
F1-measure on Eclipse and AdaBoost C4.5 reweighting
obtains the best F1-measure. All the three classifiers can
obtain good accuracy and C4.5 can obtain the best accu-
racy. Due to the imbalanced number of bug data sets,
the values of F1-measure of CH ! ICF and ICF ! CH
are imbalanced. The results on Eclipse indicate that
AdaBoost with reweighting provides the best results
among these three classifiers.
For the other project, Mozilla in Table 15, AdaBoost
with resampling can obtain the best accuracy and F1-mea-
sure. Note that the values of F1-measure by CH ! ICF
and ICF ! CH on Mozilla are more balanced than those
on Eclipse. For example, when classifying with AdaBoost
C4.5 reweighting, the difference of F1-measure on Eclipse
is 69.7 percent (85:8% À 16:1%) and the difference on
Mozilla is 30.8 percent (70:5% À 39:7%). A reason for this
fact is that the number of bug data sets with the order
ICF ! CH in Eclipse is about 5.67 times (255=45) of that
TABLE 13
Setup of Data Sets in Eclipse
TABLE 15
Results on Predicting Reduction Orders (Percent)
Project Classifier CH!ICF ICF!CH Accuracy
Precision Recall F1 Precision Recall F1
Eclipse C4.5 13.3 4.4 6.7 84.9 94.9 89.6 81.3
AdaBoost C4.5 resampling 14.7 11.1 12.7 85.0 88.6 86.8 77.0
AdaBoost C4.5 reweighting 16.7 15.6 16.1 85.3 86.3 85.8 75.7
Mozilla C4.5 48.0 29.9 36.9 63.5 78.9 70.3 59.6
AdaBoost C4.5 resampling 52.7 56.1 54.3 70.3 67.4 68.8 62.9
AdaBoost C4.5 reweighting 49.5 33.1 39.7 64.3 78.1 70.5 60.4
TABLE 14
Data Sets of Prediction for Reduction Orders
Project # Data sets # CH!ICF # ICF!CH
Eclipse 300 45 255
Mozilla 399 157 242
Eclipse Mozilla 699 202 497
XUAN ET AL.: TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES 275
www.redpel.com+917620593389
www.redpel.com+917620593389
13. with CH ! ICF while in Mozilla, the number of bug
data sets with ICF ! CH is 1.54 times (242=157) of that
with CH ! ICF.
The number of bug data sets on either Eclipse (300 data
sets) or Mozilla (399 data sets) is small. Since Eclipse and
Mozilla are both large-scale open source projects and share
the similar style in development [64], we consider combin-
ing the data sets of Eclipse and Mozilla to form a large
amount of data sets. Table 16 shows the results of predicting
reduction orders on totally 699 bug data sets, including 202
data sets with CH ! ICF and 497 data sets with ICF ! CH.
As shown in Table 16, the results of three classifiers are very
close. Each of C4.5, AdaBoost C4.5 resampling and Ada-
Boost C4.5 reweighting can provide good F1-measure and
accuracy. Based on the results of these 699 bug data sets in
Table 16, AdaBoost C4.5 reweighting is the best one among
these three classifiers.
As shown in Tables 15 and 16, we can find out that it is
feasible to build a classifier based on attributes of bug data
sets to determine using CH ! ICF or ICF ! CH. To investi-
gate which attribute impacts the predicted results, we
employ the top node analysis to further check the results by
AdaBoost C4.5 reweighting in Table 16. Top node analysis is
a method to rank representative nodes (e.g., attributes in
prediction for reduction orders) in a decision tree classifier
on software data [46].
In Table 17, we employ the top node analysis to pres-
ent the representative attributes when predicting the
reduction order. The level of a node denotes the distance
to the root node in a decision tree (Level 0 is the root
node); the frequency denotes the times of appearing in
one level (the sum of 10 decision trees in 10-fold cross-
validation). In Level 0, i.e., the root node of decision
trees, attributes B3 (Length of bug reports) and D3 (#
Words per fixer) appear for two times. In other words,
these two attributes are more decisive than the other
attributes to predict the reduction orders. Similarly, B6,
D1, B3, and B4 are decisive attributes in Level 1. By
checking all the three levels in Table 17, the attribute B3
(Length of bug reports) appears in all the levels. This fact
indicates that B3 is a representative attribute when pre-
dicting the reduction order. Moreover, based on the anal-
ysis in Table 17, no attribute dominates all the levels. For
example, each attribute in Level 0 contributes to the fre-
quency with no more than 2 and each attribute in Level
1 contributes to no more than 3. The results in the top
node analysis indicate that only one attribute cannot
determine the prediction of reduction orders and each
attribute is helpful to the prediction.
6 DISCUSSION
In this paper, we propose the problem of data reduction for
bug triage to reduce the scales of data sets and to improve
the quality of bug reports. We use techniques of instance
selection and feature selection to reduce noise and redun-
dancy in bug data sets. However, not all the noise and
redundancy are removed. For example, as mentioned in
Section 5.2.4, only less than 50 percent of duplicate bug
reports can be removed in data reduction (198=532 ¼ 37:2%
by CH ! ICF and 262=532 ¼ 49:2% by ICF ! CH). The rea-
son for this fact is that it is hard to exactly detect noise and
redundancy in real-world applications. On one hand, due
to the large scales of bug repositories, there exist no ade-
quate labels to mark whether a bug report or a word
belongs to noise or redundancy; on the other hand, since all
the bug reports in a bug repository are recorded in natural
languages, even noisy and redundant data may contain use-
ful information for bug fixing.
In our work, we propose the data reduction for bug
triage. As shown in Tables 4, 5, 6, and 7, although a recom-
mendation list exists, the accuracy of bug triage is not good
(less than 61 percent). This fact is caused by the complexity
of bug triage. We explain such complexity as follows. First,
in bug reports, statements in natural languages may be hard
TABLE 16
Results on Predicting Reduction Orders by Combining Bug
Data Sets on Eclipse and Mozilla (Percent)
Classifier CH!ICF ICF!CH Accuracy
Precision Recall F1 Precision Recall F1
C4.5 49.5 50.5 50.0 79.7 79.1 79.4 70.8
AdaBoost C4.5 resampling 49.4 40.1 44.3 77.4 83.3 80.2 70.8
AdaBoost C4.5 reweighting 51.3 48.0 49.6 79.4 81.5 80.4 71.8
TABLE 17
Top Node Analysis of Predicting Reduction Orders
Levela
Frequency Index Attribute name
0 2 B3 Length of bug reports
2 D3 # Words per fixer
1 3 B6 Entropy of severity
3 D1 # Fixers
2 B3 Length of bug reports
2 B4 # Unique words
2 4 B6 Entropy of severity
3 B7 Entropy of priority
3 B9 Entropy of component
2 B3 Length of bug reports
2 B4 # Unique words
2 B5 Ratio of sparseness
2 B8 Entropy of product
2 D5 # Bug reports per
reporter
2 D8 Similarity between
fixers and reporters
a
Only nodes in Level 0 to Level 2 of decision trees are presented. In each level,
we omit an attribute if its frequency equals to 1.
276 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
14. to clearly understand; second, there exist many potential
developers in bug repositories (over 200 developers based
on Table 3); third, it is hard to cover all the knowledge of
bugs in a software project and even human triagers may
assign developers by mistake. Our work can be used to
assist human triagers rather than replace them.
In this paper, we construct a predictive model to deter-
mine the reduction order for a new bug data set based on
historical bug data sets. Attributes in this model are statistic
values of bug data sets, e.g., the number of words or the
length of bug reports. No representative words of bug data
sets are extracted as attributes. We plan to extract more
detailed attributes in future work.
The values of F1-measure and accuracy of prediction for
reduction orders are not large enough for binary classifiers.
In our work, we tend to present a resolution to determine
the reduction order of applying instance selection and fea-
ture selection. Our work is not an ideal resolution to the pre-
diction of reduction orders and can be viewed as a step
towards the automatic prediction. We can train the predic-
tive model once and predict reduction orders for each new
bug data set. The cost of such prediction is not expensive,
compared with trying all the orders for bug data sets.
Another potential issue is that bug reports are not
reported at the same time in real-world bug repositories. In
our work, we extract attributes of a bug data set and con-
sider that all the bugs in this data set are reported in certain
days. Compared with the time of bug triage, the time range
of a bug data set can be ignored. Thus, the extraction of
attributes from a bug data set can be applied to real-world
applications.
7 RELATED WORK
In this section, we review existing work on modeling bug
data, bug triage, and the quality of bug data with defect
prediction.
7.1 Modeling Bug Data
To investigate the relationships in bug data, Sandusky et al.
[45] form a bug report network to examine the dependency
among bug reports. Besides studying relationships among
bug reports, Hong et al. [23] build a developer social net-
work to examine the collaboration among developers based
on the bug data in Mozilla project. This developer social net-
work is helpful to understand the developer community
and the project evolution. By mapping bug priorities to
developers, Xuan et al. [57] identify the developer prioriti-
zation in open source bug repositories. The developer prior-
itization can distinguish developers and assist tasks in
software maintenance.
To investigate the quality of bug data, Zimmermann et al.
[64] design questionnaires to developers and users in three
open source projects. Based on the analysis of question-
naires, they characterize what makes a good bug report and
train a classifier to identify whether the quality of a bug
report should be improved. Duplicate bug reports weaken
the quality of bug data by delaying the cost of handling
bugs. To detect duplicate bug reports, Wang et al. [54]
design a natural language processing approach by matching
the execution information; Sun et al. [47] propose a
duplicate bug detection approach by optimizing a retrieval
function on multiple features.
To improve the quality of bug reports, Breu et al. [9] have
manually analyzed 600 bug reports in open source projects
to seek for ignored information in bug data. Based on the
comparative analysis on the quality between bugs and
requirements, Xuan et al. [55] transfer bug data to require-
ments databases to supplement the lack of open data in
requirements engineering.
In this paper, we also focus on the quality of bug data. In
contrast to existing work on studying the characteristics of
data quality (e.g., [9], [64]) or focusing on duplicate bug
reports (e.g., [47], [54]), our work can be utilized as a prepro-
cessing technique for bug triage, which both improves data
quality and reduces data scale.
7.2 Bug Triage
Bug triage aims to assign an appropriate developer to fix a
new bug, i.e., to determine who should fix a bug. Cubranic
and Murphy [12] first propose the problem of automatic bug
triage to reduce the cost of manual bug triage. They apply
text classification techniques to predict related developers.
Anvik et al. [1] examine multiple techniques on bug triage,
including data preparation and typical classifiers. Anvik and
Murphy [3] extend above work to reduce the effort of bug tri-
age by creating development-oriented recommenders.
Jeong et al. [25] find out that over 37 percent of bug
reports have been reassigned in manual bug triage. They
propose a tossing graph method to reduce reassignment in
bug triage. To avoid low-quality bug reports in bug triage,
Xuan et al. [56] train a semi-supervised classifier by combin-
ing unlabeled bug reports with labeled ones. Park et al. [40]
convert bug triage into an optimization problem and pro-
pose a collaborative filtering approach to reducing the bug-
fixing time.
For bug data, several other tasks exist once bugs are
triaged. For example, severity identification [30] aims to
detect the importance of bug reports for further schedul-
ing in bug handling; time prediction of bugs [61] models
the time cost of bug fixing and predicts the time cost of
given bug reports; reopened-bug analysis [46], [63] iden-
tifies the incorrectly fixed bug reports to avoid delaying
the software release.
In data mining, the problem of bug triage relates to
the problems of expert finding (e.g., [6], [50]) and ticket rout-
ing (e.g., [35], [44]). In contrast to the broad domains in
expert finding or ticket routing, bug triage only focuses on
assign developers for bug reports. Moreover, bug reports in
bug triage are transferred into documents (not keywords in
expert finding) and bug triage is a kind of content-based
classification (not sequence-based in ticket routing).
7.3 Data Quality in Defect Prediction
In our work, we address the problem of data reduction for
bug triage. To our knowledge, no existing work has inves-
tigated the bug data sets for bug triage. In a related prob-
lem, defect prediction, some work has focused on the data
quality of software defects. In contrast to multiple-class
classification in bug triage, defect prediction is a binary-
class classification problem, which aims to predict whether
XUAN ET AL.: TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES 277
www.redpel.com+917620593389
www.redpel.com+917620593389
15. a software artifact (e.g., a source code file, a class, or a
module) contains faults according to the extracted features
of the artifact.
In software engineering, defect prediction is a kind of
work on software metrics. To improve the data quality,
Khoshgoftaar et al. [26] and Gao et al. [21] examine the
techniques on feature selection to handle imbalanced
defect data. Shivaji et al. [49] proposes a framework to
examine multiple feature selection algorithms and
remove noise features in classification-based defect pre-
diction. Besides feature selection in defect prediction,
Kim et al. [29] present how to measure the noise resis-
tance in defect prediction and how to detect noise data.
Moreover, Bishnu and Bhattacherjee [7] process the
defect data with quad tree based k-means clustering to
assist defect prediction.
In this paper, in contrast to the above work, we address
the problem of data reduction for bug triage. Our work can
be viewed as an extension of software metrics. In our work,
we predict a value for a set of software artifacts while exist-
ing work in software metrics predict a value for an individ-
ual software artifact.
8 CONCLUSIONS
Bug triage is an expensive step of software maintenance in
both labor cost and time cost. In this paper, we combine fea-
ture selection with instance selection to reduce the scale of
bug data sets as well as improve the data quality. To deter-
mine the order of applying instance selection and feature
selection for a new bug data set, we extract attributes of
each bug data set and train a predictive model based on his-
torical data sets. We empirically investigate the data reduc-
tion for bug triage in bug repositories of two large open
source projects, namely Eclipse and Mozilla. Our work pro-
vides an approach to leveraging techniques on data process-
ing to form reduced and high-quality bug data in software
development and maintenance.
In future work, we plan on improving the results of data
reduction in bug triage to explore how to prepare a high-
quality bug data set and tackle a domain-specific software
task. For predicting reduction orders, we plan to pay efforts
to find out the potential relationship between the attributes
of bug data sets and the reduction orders.
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers
for their valuable and constructive comments on improv-
ing the paper. This work was supported by the National
Natural Science Foundation of China (under grants
61033012, 61229301, and 61370144), the New Century
Excellent Talents in University (under grant NCET-13–
0073), the Program for Changjiang Scholars and Innovative
Research Team in University (PCSIRT) of the Ministry of
Education, China (under grant IRT13059), and the National
973 Program of China (under grant 2013CB329604).
REFERENCES
[1] J. Anvik, L. Hiew, and G. C. Murphy, “Who should fix this bug?”
in Proc. 28th Int. Conf. Softw. Eng., May 2006, pp. 361–370.
[2] S. Artzi, A. Kie_zun, J. Dolby, F. Tip, D. Dig, A. Paradkar, and M. D.
Ernst, “Finding bugs in web applications using dynamic test gen-
eration and explicit-state model checking,” IEEE Softw., vol. 36,
no. 4, pp. 474–494, Jul./Aug. 2010.
[3] J. Anvik and G. C. Murphy, “Reducing the effort of bug report tri-
age: Recommenders for development-oriented decisions,” ACM
Trans. Soft. Eng. Methodol., vol. 20, no. 3, article 10, Aug. 2011.
[4] C. C. Aggarwal and P. Zhao, “Towards graphical models for text
processing,” Knowl. Inform. Syst., vol. 36, no. 1, pp. 1–21, 2013.
[5] Bugzilla, (2014). [Online]. Avaialble: http://bugzilla.org/
[6] K. Balog, L. Azzopardi, and M. de Rijke, “Formal models for
expert finding in enterprise corpora,” in Proc. 29th Annu. Int. ACM
SIGIR Conf. Res. Develop. Inform. Retrieval, Aug. 2006, pp. 43–50.
[7] P. S. Bishnu and V. Bhattacherjee, “Software fault prediction using
quad tree-based k-means clustering algorithm,” IEEE Trans.
Knowl. Data Eng., vol. 24, no. 6, pp. 1146–1150, Jun. 2012.
[8] H. Brighton and C. Mellish, “Advances in instance selection for
instance-based learning algorithms,” Data Mining Knowl. Discov-
ery, vol. 6, no. 2, pp. 153–172, Apr. 2002.
[9] S. Breu, R. Premraj, J. Sillito, and T. Zimmermann, “Information
needs in bug reports: Improving cooperation between developers
and users,” in Proc. ACM Conf. Comput. Supported Cooperative
Work, Feb. 2010, pp. 301–310.
[10] V. Bolon-Canedo, N. Sanchez-Maro~no, and A. Alonso-Betanzos,
“A review of feature selection methods on synthetic data,” Knowl.
Inform. Syst., vol. 34, no. 3, pp. 483–519, 2013.
[11] V. Cerveron and F. J. Ferri, “Another move toward the minimum
consistent subset: A tabu search approach to the condensed near-
est neighbor rule,” IEEE Trans. Syst., Man, Cybern., Part B, Cybern.,
vol. 31, no. 3, pp. 408–413, Jun. 2001.
[12] D. Cubranic and G. C. Murphy, “Automatic bug triage using text
categorization,” in Proc. 16th Int. Conf. Softw. Eng. Knowl. Eng.,
Jun. 2004, pp. 92–97.
[13] Eclipse. (2014). [Online]. Available: http://eclipse.org/
[14] B. Fitzgerald, “The transformation of open source software,” MIS
Quart., vol. 30, no. 3, pp. 587–598, Sep. 2006.
[15] A. K. Farahat, A. Ghodsi, M. S. Kamel, “Efficient greedy feature
selection for unsupervised learning,” Knowl. Inform. Syst., vol. 35,
no. 2, pp. 285–310, May 2013.
[16] N. E. Fenton and S. L. Pfleeger, Software Metrics: A Rigorous and
Practical Approach, 2nd ed. Boston, MA, USA: PWS Publishing,
1998.
[17] Y. Freund and R. E. Schapire, “Experiments with a new boosting
algorithm,” in Proc. 13th Int. Conf. Mach. Learn., Jul. 1996, pp. 148–
156.
[18] Y. Fu, X. Zhu, and B. Li, “A survey on instance selection for active
learning,” Knowl. Inform. Syst., vol. 35, no. 2, pp. 249–283, 2013.
[19] I. Guyon and A. Elisseeff, “An introduction to variable and feature
selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003.
[20] M. Grochowski and N. Jankowski, “Comparison of instance selec-
tion algorithms ii, results and comments,” in Proc. 7th Int. Conf.
Artif. Intell. Softw. Comput., Jun. 2004, pp. 580–585.
[21] K. Gao, T. M. Khoshgoftaar, and A. Napolitano, “Impact of data
sampling on stability of feature selection for software measure-
ment data,” in Proc. 23rd IEEE Int. Conf. Tools Artif. Intell., Nov.
2011, pp. 1004–1011.
[22] A. E. Hassan, “The road ahead for mining software repositories,”
in Proc. Front. Softw. Maintenance, Sep. 2008, pp. 48–57.
[23] Q. Hong, S. Kim, S. C. Cheung, and C. Bird, “Understanding a
developer social network and its evolution,” in Proc. 27th IEEE
Int. Conf. Softw. Maintenance, Sep. 2011, pp. 323–332.
[24] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techni-
ques, 3rd ed. Burlington, MA, USA: Morgan Kaufmann, 2011.
[25] G. Jeong, S. Kim, and T. Zimmermann, “Improving bug triage
with tossing graphs,” in Proc. Joint Meeting 12th Eur. Softw. Eng.
Conf. 17th ACM SIGSOFT Symp. Found. Softw. Eng., Aug. 2009,
pp. 111–120.
[26] T. M. Khoshgoftaar, K. Gao, and N. Seliya, “Attribute selection
and imbalanced data: Problems in software defect prediction,”
in Proc. 22nd IEEE Int. Conf. Tools Artif. Intell., Oct. 2010,
pp. 137–144.
[27] T. Kohonen, J. Hynninen, J. Kangas, J. Laaksonen, and K.
Torkkola, “LVQ_PAK: The learning vector quantization pro-
gram package,” Helsinki Univ. Technol., Esbo, Finland, Tech.
Rep. A30, 1996.
[28] S. Kim, K. Pan, E. J. Whitehead, Jr., “Memories of bug fixes,” in
Proc. ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2006, pp. 35–45.
278 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
16. [29] S. Kim, H. Zhang, R. Wu, and L. Gong, “Dealing with noise in
defect prediction,” in Proc. 32nd ACM/IEEE Int. Conf. Softw. Eng.,
May 2010, pp. 481–490.
[30] A. Lamkanfi, S. Demeyer, E. Giger, and B. Goethals, “Predicting
the severity of a reported bug,” in Proc. 7th IEEE Working Conf.
Mining Softw. Repositories, May 2010, pp. 1–10.
[31] G. Lang, Q. Li, and L. Guo, “Discernibility matrix simplifica-
tion with new attribute dependency functions for incomplete
information systems,” Knowl. Inform. Syst., vol. 37, no. 3,
pp. 611–638, 2013.
[32] D. Lo, J. Li, L. Wong, and S. C. Khoo, “Mining iterative generators
and representative rules for software specification discovery,”
IEEE Trans. Knowl. Data Eng., vol. 23, no. 2, pp. 282–296, Feb. 2011.
[33] Mozilla. (2014). [Online]. Available: http://mozilla.org/
[34] D. Matter, A. Kuhn, and O. Nierstrasz, “Assigning bug reports
using a vocabulary-based expertise model of developers,” in Proc.
6th Int. Working Conf. Mining Softw. Repositories, May 2009,
pp. 131–140.
[35] G. Miao, L. E. Moser, X. Yan, S. Tao, Y. Chen, and N. Anerousis,
“Generative models for ticket resolution in expert networks,” in
Proc. 16th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
2010, pp. 733–742.
[36] E. Murphy-Hill, T. Zimmermann, C. Bird, and N. Nagappan, “The
design of bug fixes,” in Proc. Int. Conf. Softw. Eng., 2013, pp. 332–
341.
[37] J. A. Olvera-Lopez, J. A.Carrasco-Ochoa, J. F. Martınez-Trinidad,
and J. Kittler, “A review of instance selection methods,” Artif.
Intell. Rev., vol. 34, no. 2, pp. 133–143, 2010.
[38] J. A. Olvera-Lopez, J. F. Martınez-Trinidad, and J. A. Carrasco-
Ochoa, “Restricted sequential floating search applied to object
selection,” in Proc. Int. Conf. Mach. Learn. Data Mining Pattern Rec-
ognit., 2007, pp. 694–702.
[39] R. S. Pressman, Software Engineering: A Practitioner’s Approach, 7th
ed. New York, NY, USA: McGraw-Hill, 2010.
[40] J. W. Park, M. W. Lee, J. Kim, S. W. Hwang, and S. Kim,
“Costriage: A cost-aware triage algorithm for bug reporting sys-
tems,” in Proc. 25th Conf. Artif. Intell., Aug. 2011, pp. 139–144.
[41] J. C. Riquelme, J. S. Aguilar-Ruız, and M. Toro, “Finding represen-
tative patterns with ordered projections,” Pattern Recognit., vol. 36,
pp. 1009–1018, 2003.
[42] M. Robnik-Sikonja and I. Kononenko, “Theoretical and empirical
analysis of relieff and rrelieff,” Mach. Learn., vol. 53, no. 1/2,
pp. 23–69, Oct. 2003.
[43] M. Rogati and Y. Yang, “High-performing feature selection for
text classification,” in Proc. 11th Int. Conf. Inform. Knowl. Manag.,
Nov. 2002, pp. 659–661.
[44] Q. Shao, Y. Chen, S. Tao, X. Yan, and N. Anerousis, “Efficient
ticket routing by resolution sequence mining,” in Proc. 14th ACM
SIGKDD Int. Conf. Knowl. Discovery Data Mining, Aug. 2008,
pp. 605–613.
[45] R. J. Sandusky, L. Gasser, and G. Ripoche, “Bug report networks:
Varieties, strategies, and impacts in an F/OSS development
community,” in Proc. 1st Intl. Workshop Mining Softw. Repositories,
May 2004, pp. 80–84.
[46] E. Shihab, A. Ihara, Y. Kamei, W. M. Ibrahim, M. Ohira, B. Adams,
A. E. Hassan, and K. Matsumoto, “Predicting re-opened bugs: A
case study on the eclipse project,” in Proc. 17th Working Conf.
Reverse Eng., Oct. 2010, pp. 249–258.
[47] C. Sun, D. Lo, S. C. Khoo, and J. Jiang, “Towards more accurate
retrieval of duplicate bug reports,” in Proc. 26th IEEE/ACM Int.
Conf. Automated Softw. Eng., 2011, pp. 253–262.
[48] A. Srisawat, T. Phienthrakul, and B. Kijsirikul, “SV-kNNC: An
algorithm for improving the efficiency of k-nearest neighbor,” in
Proc. 9th Pacific Rim Int. Conf. Artif. Intell., Aug. 2006, pp. 975–979.
[49] S. Shivaji, E. J. Whitehead, Jr., R. Akella, and S. Kim, “Reducing
features to improve code change based bug prediction,” IEEE
Trans. Soft. Eng., vol. 39, no. 4, pp. 552–569, Apr. 2013.
[50] J. Tang, J. Zhang, R. Jin, Z. Yang, K. Cai, L. Zhang, and Z. Su,
“Topic level expertise search over heterogeneous networks,”
Mach. Learn., vol. 82, no. 2, pp. 211–237, Feb. 2011.
[51] I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical
Machine Learning Tools and Techniques, 3rd ed. Burlington, MA,
USA: Morgan Kaufmann, 2011.
[52] D. R. Wilson and T. R. Martınez, “Reduction techniques for
instance-based learning algorithms,” Mach. Learn., vol. 38,
pp. 257–286, 2000.
[53] R. E. Walpole, R. H. Myers, S. L. Myers, and K. Ye, Probability
Statistics for Engineers Scientists, 8th ed. Upper Saddle River, NJ,
USA: Pearson Education, 2006.
[54] X. Wang, L. Zhang, T. Xie, J. Anvik, and J. Sun, “An approach to
detecting duplicate bug reports using natural language and exe-
cution information,” in Proc. 30th Int. Conf. Softw. Eng., May 2008,
pp. 461–470.
[55] J. Xuan, H. Jiang, Z. Ren, and Z. Luo, “Solving the large scale next
release problem with a backbone based multilevel algorithm,”
IEEE Trans. Softw. Eng., vol. 38, no. 5, pp. 1195–1212, Sept./Oct.
2012.
[56] J. Xuan, H. Jiang, Z. Ren, J. Yan, and Z. Luo, “Automatic bug tri-
age using semi-supervised text classification,” in Proc. 22nd Int.
Conf. Softw. Eng. Knowl. Eng., Jul. 2010, pp. 209–214.
[57] J. Xuan, H. Jiang, Z. Ren, and W. Zou, “Developer prioritization in
bug repositories,” in Proc. 34th Int. Conf. Softw. Eng., 2012, pp. 25–
35.
[58] T. Xie, S. Thummalapenta, D. Lo, and C. Liu, “Data mining for
software engineering,” Comput., vol. 42, no. 8, pp. 55–62, Aug.
2009.
[59] Y. Yang, “An evaluation of statistical approaches to text catego-
rization,” Inform. Retrieval, vol. 1, pp. 69–90, 1999.
[60] Y. Yang and J. Pedersen, “A comparative study on feature selec-
tion in text categorization,” in Proc. Int. Conf. Mach. Learn., 1997,
pp. 412–420.
[61] H. Zhang, L. Gong, and S. Versteeg, “Predicting bug-fixing time:
An empirical study of commercial software projects,” in Proc. 35th
Int. Conf. Softw. Eng., May 2013, pp. 1042–1051.
[62] W. Zou, Y. Hu, J. Xuan, and H. Jiang, “Towards training set reduc-
tion for bug triage,” in Proc. 35th Annu. IEEE Int. Comput. Soft.
Appl. Conf., Jul. 2011, pp. 576–581.
[63] T. Zimmermann, N. Nagappan, P. J. Guo, and B. Murphy,
“Characterizing and predicting which bugs get reopened,” in
Proc. 34th Int. Conf. Softw. Eng., Jun. 2012, pp. 1074–1083.
[64] T. Zimmermann, R. Premraj, N. Bettenburg, S. Just, A. Schr€oter,
and C. Weiss, “What makes a good bug report?” IEEE Trans.
Softw. Eng., vol. 36, no. 5, pp. 618–643, Oct. 2010.
[65] H. Zhang and G. Sun, “Optimal reference subset selection for
nearest neighbor classification by tabu search,” Pattern Recognit.,
vol. 35 pp. 1481–1490, 2002.
[66] X. Zhu and X. Wu, “Cost-constrained data acquisition for intelli-
gent data preparation,” IEEE Trans. Knowl. Data Eng., vol. 17,
no. 11, pp. 1542–1556, Nov. 2005.
Jifeng Xuan received the BSc degree in software
engineering in 2007 and the PhD degree in 2013,
from Dalian University of Technology, China. He is
currently a postdoctoral researcher at INRIA Lille
– Nord Europe, France. His research interests
include mining software repositories, search-
based software engineering, and machine learn-
ing. He is a member of the ACM and the China
Computer Federation (CCF).
He Jiang received the PhD degree in computer
science from the University of Science and Tech-
nology of China (USTC), China. He is a professor
in the School of Software, Dalian University of
Technology, China. His research interests include
computational intelligence and its applications in
software engineering and data mining. He is a pro-
gram cochair of the 2012 International Conference
on Industrial, Engineering Other Applications of
Applied Intelligent Systems (IEA-AIE 2012). He is
also a member of the IEEE, the ACM, and the CCF.
XUAN ET AL.: TOWARDS EFFECTIVE BUG TRIAGE WITH SOFTWARE DATA REDUCTION TECHNIQUES 279
www.redpel.com+917620593389
www.redpel.com+917620593389
17. Yan Hu received the BSc and PhD degrees in
computer science from the University of Science
and Technology of China (USTC), China, in 2002
and 2007, respectively. He is currently an assis-
tant professor in the School of Software, Dalian
University of Technology, China. His research
interests include model checking, program analy-
sis, and software engineering. He is a member of
the ACM and the CCF.
Zhilei Ren received the BSc degree in software
engineering in 2007 and the PhD degree in 2013,
from Dalian University of Technology, China. He is
currently a postdoctoral researcher in the School
of Software at Dalian University of Technology.
His research interests include evolutionary com-
putation and its applications in software engineer-
ing. He is a member of the ACM and the CCF.
Weiqin Zou received the BSc degree in software
engineering in 2010 and the MSc degree in com-
puter application and technology in 2013, from
Dalian University of Technology. She is currently a
teaching assistant in the Department of Informa-
tion Engineering, Jiangxi University of Science
and Technology, China. Her research interests
include mining software repositories and machine
learning.
Zhongxuan Luo received the BSc, MSc, and PhD
degrees in computational mathematics from Jilin
University, China, in 1985, Jilin University in 1988,
and Dalian University of Technology in 1991,
respectively. He is a professor in the School of
Mathematical Sciences at Dalian University of
Technology, China. His research interests include
multivariate approximation theory and computa-
tional geometry.
Xindong Wu received the PhD degree in artificial
intelligence from the University of Edinburgh, Brit-
ain. He is a Yangtze River Scholar in the School of
Computer Science and Information Engineering,
the Hefei University of Technology, China, and a
professor in the Department of Computer Science,
the University of Vermont. His research interests
include data mining, knowledge-based systems,
and web information exploration. He is a fellow of
the IEEE and AAAS.
For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
280 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015
www.redpel.com+917620593389
www.redpel.com+917620593389