0% found this document useful (0 votes)
22 views

Research Paper

Uploaded by

Mridul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Research Paper

Uploaded by

Mridul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Enhancing Software Reliability Prediction Using

ELM and SVM: A Study on Historical Failure Data


and Datatypes

Mehakpreet Kaur1 Shreya Sharma3


Department of Computer Science Mridul2 Department of Computer Science
Engineering Department of Computer Science Engineering
Chandigarh University, Mohali, Engineering Chandigarh University, Mohali,
Chandigarh Chandigarh University, Mohali, Chandigarh
[email protected] [email protected]
Chandigarh
[email protected]

Abstract— reveal its true benefits.(1) Convolutional neural networks


employ feature representation to integrate feature extraction,
Predicting software defects is crucial for ensuring the and models can be learned without the usage of additional
reliability of new software systems. This study introduces an
innovative approach to software defect prediction, focusing high- or low-level feature descriptors.(2)Convolutional
on two key questions: the value of historical failure data and neural networks may simultaneously learn millions of items
the importance of specific types of failure data. The study because of the customizable network topology.(3) Parametric
also explores the balance between prediction accuracy and studies. Since the network topology is scalable, thousands
timeliness in models based on Support Vector Machines cannot be learned. Therefore, deep learning based on neural
(SVM) and Extreme Learning Machine (ELM).
Experimental results show that ELM outperforms SVM, network communication may become the most powerful
providing better metrics like specificity, regression, tech. They carry out their business in a posh setting by con-
accuracy, and F1 score. Additionally, the research introduces ducting entrance exams that vary somewhat from the norm.
a sophisticated feature selection framework using ELM and Then, using a linear decision that maximizes the edges and
SVM. The NASA Metrics dataset is used for testing, and categorizes the events into distinct classes, the events are
resampling techniques are employed to address dataset
imbalances and improve prediction accuracy before feature ordered in this particular selection. In 2006, machine
selection. Overall, this study presents a concise yet learning cloud was introduced as layered feedforward
comprehensive approach to software reliability prediction, network (SLFN) [7,8]. The word ”regulation” is not used in
enhancing the field's capabilities and contributing to this context. ELMs are similar to ANNs in that the biases
improved software quality assurance practices. and weights of the first layer are initialized randomly or
arbitrarily and held constant, but the loadings (and possibly
Index Terms—Key phrases: Extreme learning machine, biases) of the second layer are chosen by restricting the least
model for defect prediction, Software fault prognosis, Top- squares error. Interpolation and global prediction are two
notch software, Support vector machine.
important features of feedforward neural networks [9,10].
ELM’s interpolation capability has been clearly
Introduction demonstrated in [11],and any N randomly chosen individual
cases can be decisively advanced using everything. which is
Deep learning, the most popular learning technology, has
a significant theoretical addition. Software systems get
been extensively studied in scope of natural language
harder and take longer to handle as more defects or
processing (NLP), machine learning, and data analytics.
weaknesses are added. As a result, great techniques for
Primarily, Convolutional Neural Networks (CNNs) Utilized
quickly identifying software problems have been developed.
in the realm of deep learning for the purpose of identifying
Cost-cutting measures for software development must be
documents and facial features. [1,2], stacked autoencoders.
developed. The majority of recent research uses several AI
Many learning problems, including computer vision picture
techniques to develop algorithms for failure prediction.
classification, have seen an increase in popularity for layers
There are many categorization methods that have been used
which are there in convolutional networks, convolution
to identify prediction models. Combinatorial techniques are
layer, and all correlation levels, as they have powerful and
among the several methods. Naive Bayes , Random forest,
deep partition identification capabilities and great state-of-
support vector machine logging nearest one and neural
the-art performance. check the data. Image Net, Promise,
network.
Pascal etc. various research tools such as there are many
.People have developed new knowledge in face recognition Machine learning, to mention a few, includes banking. It
by using various neural networks with various topologies. is unlikely that most people will be immune to its effects.
The feature selection process will select a subset of good or
[3,4,5,6]. Convolutional neural networks performed desirable features. Obtain the best feature subset using the test
especially well in these trials at deep representation of model. It is difficult to find the best feature subset in high-
broad regions. Perhaps how we apply deep learning will dimensional data. Many similar problems have been shown to
be NP-hard ,and candidate feature subsets exist for data with improvement .Wald et al. , part of a collaborative
more than one feature. There are four main processes in the research group, conducted an extensive evaluation of
selection of materials: subset generation, subset evaluation or various filter-based feature selection strategies in the
subset evaluation, stop measurement (control process), and
probability or validity analysis. subclustering is a research context of large-scale communication systems. Their
method that uses a special search method [9] in which the findings consistently indicated that the Kolmogorov-
created groups are compared to the best historical feature Smirnov method outperformed other techniques .Gao et
segmentation using the evaluation method. If the new feature al. [29] delved into a hybrid feature selection
subgroup is superior to the previous one, it will take the framework.skillfully combining seven channel-based
place of the previous one. Repeat this cycle until the preset and three-way subset detection methods. Their research
station is reached. After the event stops, the feature subgroup
that performs best should be identified. Validation can be consistently revealed that the removal of certain features
done from simulated or real data [10]. In a variety of ways, did not have an adverse impact on predictive capability.
support vector machines (SVMs) and active learning
Chen et al. [8] approached feature selection as a multi-
methods (ALMs) can be used to forecast software problems.
objective optimization challenge. Their objective was to
The distance between the hyperplane and the nearest data reduce the number of selected features while
points from each class is specified as this margin. By training simultaneously enhancing fault prediction accuracy.
a model on a dataset of previous software projects, SVMs
can be used to forecast software difficulties. The dataset When they compared their approach to three wrapper-
should include information about the software project, such based feature selection strategies using several projects
as the amount of lines of code, programming language used, from the PROMISE dataset, they found that their method
development method, and types of software difficulties outperformed all of them. However, it's essential to note
encountered. The SVM model can then be used to predict the that their approach, while effective, may be less efficient
occurrence of software faults in new software projects. compared to many shell-based methods.
ALMs are machine learning algorithms that can be used to
train machine learning models with limited data. ALMs Cathal et al. [5] conducted a comprehensive review
operate by picking the most informative data points on which with the goal of investigating how feature selection
to train the model. This enables ALMs to more efficiently methods, capability categories, and dataset size impact
and effectively train machine learning models. By picking
the most informative software projects from the historical fault identification. They meticulously gathered relevant
dataset, ALMs can be used to train SVMs to forecast elements before constructing characterization models to
software problems. After that, the SVM model can be trained better understand the effects of different feature selection
on the selected software projects. This approach can be used strategies .Vandecruys et al. [14] focused on predicting
to train SVMs to forecast software problems with minimal software issues through software mining, leveraging the
data, which is important when historical data is limited. data mining tool AntMiner . Their data preprocessing
Missing data occurs when no information is provided for involved techniques like oversampling, discretization,
one or more constituents, or for the complete unit. Missing and input selection. To evaluate their model's
data is a serious concern in real-world situations. Missing performance, they compared it to models created using
data in pandas is also known as NA (Not Available) values.
Many datasets in dataframe contain missing data, either C4.5, logistic regression, and SVM .Czibula et al.
because it exists but was not collected or because it never utilized relational rule analysis, a grouping approach, for
existed. Assume that different users being questioned choose defect prediction after preprocessing to eliminate
not to reveal their income, and that other users choose not to unnecessary indicators.
share their address; as a result, a large number of datasets are
missing. Mahavirawat et al. introduced a novel approach that
achieved an impressive 90% accuracy rate in detecting
defects in object-oriented software environments
Literature Review .Gondra et al. employed artificial neural networks
We conducted comprehensive searches across various (ANN) and support vector machines (SVM) to assess the
electronic databases, including the ACM Digital Library, significance of software metrics in predicting
IEEE Xplore, Science Direct, EI Compendex, Web of disappointments. Their comparison revealed that SVM
Science, Google Scholar, and the online bibliographic correctly identified software issues 87.4% of the time,
database known as BEST Web. It's important to note that while ANN achieved a correct identification rate of
we intentionally excluded DBLP and the Computer 72.61%.Menzies et al. utilized Naive Bayes (NB) and
Science Bibliography Collection as these sources were method-level metrics to forecast software problems in
infrequently cited in the literature we selected. the PROMISE repository dataset, achieving a model
recall rate of 55%.Heeswijk et al. [9] explored the
In our research, Song et al. [12] played a pivotal role application of adaptive ensemble models of extreme
in shaping our defect prediction system. Shivaji et al. learning machines for one-step identification in
[13-14] focused on exploring filter and wrapper-based nonstationary time data. Their investigations focused on
highlight detection algorithms, with a particular the suitability of this method for nonstationary time
emphasis on assessing error probability. Their series, with experimental studies demonstrating the
experiments provided valuable insights, demonstrating adaptability and acceptable test performance of these
that incorporating definitions could enhance variant ensemble models .Rong et al. presented a truncated
prediction performance while maintaining a 10% extreme learning machine (ELM) as a scheduling
method for a streamlined and automated network of Fig-2. Detection of Defects w.r.t frequency of Dataset
ELM classifiers. Their approach involved initiating a
large network and subsequently eliminating redundant The PROMISE library provides access to the NASA
hidden nodes using statistical models like chi-square and Metric Data Program Information Index. The benchmark
data mining techniques. data contains 10,885 samples, each with 22 features as
shown in figure 2. The faulty NASA data set [39] is used to
predict software problems. Halsted’s data processing
IMPLEMENTATION AND METHODOLOGY program has a source code extractor, McCabe’s flight
program furnishes information for satellites that orbit the
Earth.
3.2: Preparation of data : Data preparation
encompasses the procedures of refining and adapting raw
data before its analysis and utilization. Typically, a crucial
initial step in this process involves reformatting data,
making necessary corrections, and merging data elements to
enhance its quality. Preparing data may pose challenges for
data professionals or business users, yet it is essential for
contextualizing data, transforming it into actionable
information, and mitigating biases stemming from poor data
quality. After collecting data, it is imperative to ready it for
subsequent stages. In the context of machine learning, the
actions involved in data assembly and organization are
Fig-1. Methodology for SVM and ELM
commonly referred to as ”data preparation.” Once all data is
integrated, it is restructured, and a comprehensive
assessment is conducted to identify any missing
To foresee software reliability concerns, ELM and SVM
classifiers were put into action in figure 1. information. Cross- validation techniques are employed to
assess the effectiveness of this process.
3.1 Data Collection : we have used model to
calculate the accuracy ,along with time. 3.3 Methodology for sampling again and selecting
features : A dataset is regarded as imbalanced when
The process of calculating, acquiring, and assessing there is a significant disparity in class distribution, such
the appropriate process for study using widely accepted as a ratio of 1:1000 or 1:10,000 between the minority class
techniques is known as information gathering. A and the majority class .This bias in the training data set
specialist could examine their theory based on the real can cause some machine learning algorithms to completely
components accumulated. Whatever the hypothesis, data ignore minority classes, which can degrade various machine
collection is usually the first and most significant stage learning approaches. This is a concern because forecasts are
in the study cycle. Different approaches of data often created with the interests of a few in mind. One
collection are employed in various investigations, way to deal with class imbalance is to randomly resample
depending on the information desired. The PROMISE the training data set. It can be done in a many ways,
library provides access to the NASA Metrics Data including oversampling and undersampling .One of the key
Programme informative index. This informational index components of feature development is selecting the unique
has 10,885 examples, each with 22 attributes. A faulty features to provide to the AI algorithm. There are two
NASA dataset [38] was used to forecast software main ways to achieve resampling in a dataset: oversampling
problems. Halstead’s Data Processing Programme and undersampling. Oversampling is the process of creating
comprised source code extractors, whereas McCabe’s new data points from the minority class. This can be done
by duplicating existing data points, creating synthetic data
points, or using a combination of both. Undersampling is
the process of removing data points from the majority class.
This can be done by randomly deleting data points, using a
clustering algorithm to identify and remove outliers, or
using a combination of both. The best resampling technique
to use will depend on the specific dataset and the desired
outcome. For example, if the dataset is very imbalanced,
oversampling may be necessary to ensure that the minority
class is adequately represented. If the dataset is small,
undersampling may be necessary to avoid overfitting.

flight program provided data.


The ability of a test to distinguish between
healthy and sick people raises concerns about its
accuracy. Before we can calculate the percentage of
true positive (+ve) and true negative (-ve) cases in
each studied case, we must first confirm the validity of
the test results. The confusion matrix’s four parameters,
FP, TP, TN, and FN, are used to calculate accuracy.
True positives (TP) are the number of occurrences that
were incorrectly identified, whereas false positives (FP)
are the number of cases that were accurately
discovered. The amount of precisely classified
qualities is referred to as a true negative (TN). A false
negative (FN) is the number of incorrectly classified
qualities.

Fig-3. Occurrence of defect w.r.t to Frequency in the dataset

3.4 SVM Classifier Training : The key feature that sets this
classification algorithm apart from its predecessors is its
unique approach to defining the decision boundary. It aims
to maximize the separation between all classes that are
closest to the data points. SVMs, or Support Vector
Machines, achieve this by identifying a hyperplane with the
greatest margin as the decision boundary. In the case of a
linear SVM classifier, it constructs a straight line between
two classes, effectively categorizing each relevant data point
on one side into one class and those on the other side into
another.

Fig-5. Result w.r.t SVM and ELM classifier

Conclusion
This study compares and examines two categorization
algorithm that are comparable in accuracy and in length.
The NASA dataset are used in this study to train SVM
and ALM classifier. The best accuracy was provided by
ALM classifier model which is more efficient than
SVM. Both SVM and ALM are sophisticated machine
learning algorithms that can be used to predict
observations. Overfitting occurs when the model learns
too well from the training data and performs badly on
Fig-4. Finding of defects in the dataset using K- nearest Algorithm new data. Validation is a strategy used for preventing
overfitting.
3.5 ELM classifier training: They can be assigned References
in various ways while remaining consistent in the
above technique. This is because the solution is 1. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.
straightforward and does not necessitate iteration, Gradient- based learning applied to document recognition.
and the input weights are fixed and constant. Such a Proc. IEEE 1998, 86, 2278–2324.
solution can be computed rapidly and easily for a 2. Lawrence, S.; Giles, C.L.; Ah Chung, T.; Back, A.D.
specific linear output layer. When the weight range Face recognition: A convolutional neural-network
of the solution is limited, symmetrical information approach. IEEE Trans. Neural Netw. 1997, 8, 98–113.
sources produce a greater space volume 3. Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimen-
arrangement. sionality of data with neural networks. Science 2006, 313,
504–507.
4. Yang, F.-J. An Implementation of Naive Bayes
Classifier. In Proceedings of the 2018 International
RESULTS Conference on Computational Science and Computational
Intelligence (CSCI), Las Vegas, NV, USA, 12–14
December 2018; pp. 301–306.
5. Boetticher, G.; Menzies, T.; Ostrand, T. PROMISE
Repository of Empirical Software Engineering Data; West
Virginia University, Department of Computer Science:
Morgantown, WV, USA, 2007.
6. Rath, S.K.; Sahu, M.; Das, S.P.; Mohapatra, S.K. Hybrid
Software Reliability Prediction Model Using Feature
Selection and Support Vector Classifier. In Proceedings of
the 2022 International Conference on Emerging Smart
Computing and Informatics (ESCI), Pune, India, 9–11
March 2022; pp. 1–4. 7.
7. Rong, H.-J.; Ong, Y.-S.; Tan, A.-H.; Zhu, Z. A fast
pruned extreme earning machine for classification
problem. Neuro- computing 2008, 72, 359–366.
8. Dash, M.; Liu, H. Feature Selection for
Classification; Intelligent Data Analysis; Elsevier:
Amsterdam, The Netherlands, 1997; pp. 131–156.
9. Menzies, T.; Greenwald, J.; Frank, A. Data mining
static code attributes to learn defect predictors. IEEE
Trans. Softw. Eng. 2007, 33, 2–13.
10. Mahaweerawat, A.; Sophatsathit, P.; Lursinsap, C.;
Musilek, P. MASP-An enhanced model of fault type
identifica tion in object-oriented software engineering. J.
Adv. Comput. Intell. Intell. Inform. 2006, 10, 312–322.
11. Ritu, R. (2022, December). Concepts for Energy
Management in the Evolution of Smart Grids. In
International Conference on Hybrid Intelligent Systems
(pp. 917-928). SpringerNature Switzerland.
12. Bhambri, P. (2022, October). A CAD System for
Software Effort Estimation. In 2022 2nd International
Conference on Technological Advancements in
Computational Sciences (IC- TACS) (pp. 140-146).
IEEE.
13. Ritu, Bhambri, P. (2023). Software Effort Estimation
with Machine Learning–A Systematic Literature Review.
Agile Software Development: Trends, Challenges and
Applications, 291-308.
14. Biemek. A Moga. A. An efficient watershed algorithm
based on connected components. Pattern Recogn. 33 (3).
907-916. 2000

You might also like