Software analytics (for software quality purpose) is a statistical or machine learning classifier that is trained to identify defect-prone software modules. The goal of software analytics is to help software engineers prioritize their software testing effort on the most-risky modules and understand past pitfalls that lead to defective code. While the adoption of software analytics enables software organizations to distil actionable insights, there are still many barriers to broad and successful adoption of such analytics systems. Indeed, even if software organizations can access such invaluable software artifacts and toolkits for data analytics, researchers and practitioners often have little knowledge to properly develop analytics systems. Thus, the accuracy of the predictions and the insights that are derived from analytics systems is one of the most important challenges of data science in software engineering.
In this work, we conduct a series of empirical investigation to better understand the impact of experimental components (i.e., class mislabelling, parameter optimization of classification techniques, and model validation techniques) on the performance and interpretation of software analytics. To accelerate a large amount of compute-intensive experiment, we leverage the High-Performance-Computing (HPC) resources of Centre for Advanced Computing (CAC) from Queen’s University, Canada. Through case studies of systems that span both proprietary and open- source domains, we demonstrate that (1) realistic noise does not impact the precision of software analytics; (2) automated parameter optimization for classification techniques substantially improve the performance and stability of software analytics; and (3) the out-of- sample bootstrap validation technique produces a good balance between bias and variance of performance estimates. Our results lead us to conclude that the experimental components of analytics modelling impact the predictions and associated insights that are derived from software analytics. Empirical investigations on the impact of overlooked experimental components are needed to derive practical guidelines for analytics modelling.