Virtual Storage Failure Prediction Model Using Supervised Machine Learning
Virtual Storage Failure Prediction Model Using Supervised Machine Learning
NQ44475
Dr. K. Kishore Anthuvan Sahayaraj, Dr. R. Chithra, Sankar R, Satishkumar J, Dr.Vilas Ramrao Joshi, Dr.R.Thiagarajan/ Virtual Storage Failure
Prediction model using supervised machine learning
Abstract
Hard disk failure occurs due to the malfunctioning in the operation of configured computers. Due to
some of the external factors such as natural calamities, electrical disturbance can cause impact over the
failure of a hard disk. Data corruption, distortion and disruption in the computer’s hard drive due to the
malware infection in the disk. If the hard disk fails, the PC can stop its function by deriving any noise.
Due to the hard disk failure, the system halts the process due to the triggered physical failure. As the
data centers are predominantly increasing, the hard disk space complexity plays a vital role. Continuous
monitoring of the computer can reduce the efficiency of the lack of control over the security aspects. In
this paper, the artificial intelligence with machine learning is used to predict the potential features in
the disk. Detection of the hard disk features by determining the attributes with the statistical tests of
compatibility. Different methods are used to analyze the test parameters using the svm, random forest
and naïve bayes classifier to analyze the accuracy over the approximation results. Prediction of failure
test can improve the reliability in the storage system. Different techniques are used to analyze the signs
of the failure. The warning signs are detected to identify the early detection of the failure in the hard
disk. Some of the viruses or malware infected files can destroy the data in the hard disk, which can
cause potential data losses. A hard drive crash can fail up the boot process where it gets corrupted.
Some of the corruptions are firmware, ransom ware corruption which causes unreadable which leads to
the data loss. Electronic failure or power surge can cause the system to halt process. The internal failure
such as corruption of files, overheating and external factors such as human errors can cause permanent
damage to the data loss medium. The AI is used to predict the failure over the hard drive model for
identifying the exact precision. An external hard drive has to be checked monitoring based upon the
failure statistical report. In some data centers, the HDD will overheat due to the high consumption of
energy and overheating due to the maximum range. So, to avoid such disruption, the failure has to be
analyzed by using AI with machine learning. The comparative results are analyzed to predict the HDD
failure using AI and MI methodologies.
DOI Number: 10.14704/nq.2022.20.9.NQ44475 Neuro Quantology 2022; 20(9):4146-4154
Neuro Quantology | September 2022 | Volume 20 | Issue 9 | Page 4146-4154 | doi: 10.14704/nq.2022.20.9.NQ44475
Dr. K. Kishore Anthuvan Sahayaraj, Dr. R. Chithra, Sankar R, Satishkumar J, Dr.Vilas Ramrao Joshi, Dr.R.Thiagarajan/ Virtual Storage Failure
Prediction model using supervised machine learning
data are monitored using the electro-mechanical backup of the data is for a futuristic purpose. It
defective mechanism. Machine learning performs helps to support the drives by creating the
the high level of accuracy in predicting the disk interface between the computer medium. These
failures by attaining the futuristic model. external drives are portable and consist of more
Literature Review space to store the data. Hard disks generally can
store upto terabyte of data where it encloses the
A hard drive is used to store the data using the data in a sequential way. They are made up of 4148
electro-magnetic storage which can retrieve or circular plates along with the magnetic material.
gather the data for further processing. These In this approach, the data are stored in the
drives have the rigid type platters along with the surface medium using the concentric tracks.
magnetic material. The magnetic material is These magnetize the tiny spots by detecting the
paired along with the actuators with the platter read/write data using the spinning of the
to read and write the data. The data are direction. A hard drive uses read only memory
sequentially read/write using the sequential to give instruction to read/write the disks. These
functionality of electro-magnetic storage. These are stacked according to their storage of the data
data are stored in the random access manner periodically. These tracks periodically
where it has non-volatile storage. These are used emphasize according to the distance from the
to store the vast data to enhance the speed and centre. The main storage of the computer is the
performance. The file system is improved using hard disk drive to store up the data memory.
the error correction by following redundancy The hard Disk Drive stores up the data of the
and recovery[4]. In the modern hard drive disk, operating system and applications in the
the progress is checked using the following software within it. A hard disk drive is generally
block size. The block size is specified in the used in different applications to store the data
modern technique where it has the product's memory in vast speed and reduces the
manufacturer specification. These use low drive complexity. According to the positioning of the
type commands in improving the data. To spindle inside the chamber of the head disk
reduce the delay over the minimization delay, drive, can attain the read and write motors state
the defragmentation method is used. These and its direction. External hard drives can
reduce the delay in the access over speed. The expand the storage capacity using the device
hard disk storage is a type of non-volatile which is connected. These are generally slower
memory which is internally stored in the in data transfer when compared to the internal
computer or even data centers. Some of the hard hard drives. These uses the slower data transfer
disks use the motherboard, namely, SATA cable due to the slower transfer rates. The drive
or ATA cable. These use the connectivity to the capacity increases as it is used internally.
power supply using the PSU method. The data is
stored in the hard disk, where there are some of A hard disk drive consists of malware which can
the confidential files. These files are retrieved damage the computer code and software by
for later purpose. Computer tries to interact using the self-replicating virus. This self-
with the operating system to attain the data replicating virus has the ability to replicate itself
which are stored in the memory. It analyzes the by corrupting the files inside the memory of the
data storage by checking the storage medium computer. The smart way of identifying the hard
where the OS has been installed. These drives disk malfunction using the artificial intelligence
also require certain installation of software to can reduce the complexity in the data. This virus
download the files inside the computer. A hard can infect the system software and application
disk can store the data permanently inside the by acknowledging as a user and infects the
storage medium. Data which are used to computer without the user knowledge [4]. The
read/write are interpreted using the disk ability to replicate itself is one of the major
controller, the disk controller is used to instruct characteristics. Malware on the hard disk can
the actuators. These hard drives are of two damage or corrupt the data in the hard disk.
different types, internal drives and external These can steal the data confidentiality of the
drives. Mostly, all the hard disk drives are private information which can infect the files.
internal, but some of them are stand-alone Boot sector virus is one of the cyber attacks
devices such as external hard drives. This which can damage the system’s confidentiality
Neuro Quantology | September 2022 | Volume 20 | Issue 9 | Page 4146-4154 | doi: 10.14704/nq.2022.20.9.NQ44475
Dr. K. Kishore Anthuvan Sahayaraj, Dr. R. Chithra, Sankar R, Satishkumar J, Dr.Vilas Ramrao Joshi, Dr.R.Thiagarajan/ Virtual Storage Failure
Prediction model using supervised machine learning
and privacy measures. These trigger the victim's using the external drives, these boot sector
action by creating the malicious activity. These viruses can occur which can potentially damage
can inject the data by infecting the software the computer system.
programs and applications. Some of the boot
virus injects the victim where it exists within the
program. Virus can replicate and infect the
victim's system and its application. This 4149
malware can take over the control of the entire
host system. The hard drive malware can infect
the system memory by reducing the payload and
increasing the consumption of traffic mechanism
[5]. These types of different viruses can easily
stop the process by temporarily disabling the
hard drive. Boot sector virus is one of the types
of virus which can damage the hard drive by
infecting the files and their information. This
stored virus infects the operating system by
controlling the loading information. These
viruses can reformat the disk by cleaning up or
Figure 1: Hard Disk Drive
removing the data from the file without
acknowledging it. Traditional way of approach
does not provide a proper way to analyze the
data privacy and confidentiality [6-10]. Mainly
within the HDD. Using the self-monitoring,
Proposed System analyzing and report method, the HDD can be
monitored. It checks and monitors various
A hard disk can cause the physical failure due to attributes in the hard disk drive. The smart way
the physical damage where the stored data gets of using the Artificial Intelligence can check the
repaired and cannot be recovered properly. drive reliability and determines the error rate. It
Some of the different failures are different as checks the different attributes and identifies the
logical failure occurs where the data can be error rate. Using the supervised learning
recovered by using the software. By detecting approach, different attributes are compared to
the noise and unwanted bad performance, the analyze the accuracy in performance over the
logical failure can be analyzed. The hard results. These trains up the dataset and features
drivefailure occurs due to the malfunctioning in are extracted. The hard Disk Drive stores up the
the drive data or corrupted data. A data center data of the operating system and applications in
stores up the memory for the processing of the the software within it. The extracted results are
storage of data. These data are stored in the hard progressed where the data gets preprocessed,
disk drive. System peripherals are connected to trained and validated to attain the exact results.
the hard drive to store the vast amount of data. The resulted data are classified according to
The smart AI technology validates the data their classified approach in the trained data.
Neuro Quantology | September 2022 | Volume 20 | Issue 9 | Page 4146-4154 | doi: 10.14704/nq.2022.20.9.NQ44475
Dr. K. Kishore Anthuvan Sahayaraj, Dr. R. Chithra, Sankar R, Satishkumar J, Dr.Vilas Ramrao Joshi, Dr.R.Thiagarajan/ Virtual Storage Failure
Prediction model using supervised machine learning
4150
according to their futuristic type. The correlated data which have generally occurred are split into
dataset are classified into their characteristic small types of fractions.
features. After the training up of the data, the
functional testing is carried out to verify the Supervised learning approach
data. The set of input validates the input to In the machine learning algorithms, the
attain the result outcomes. This testing model supervised approach is used to predict the
evaluates the performance metrics over the model where it trains up the data and values to 4151
observation results. It generalizes the deliver the necessary results. In this predictive
performance of the task within the new data. approach, the model analytics used there is the
The testing model parameterizes the complex predictive modeling. This predictive modeling
type model which memorizes the data. It which is used is to build the model for deriving
predicts the values along with the response the predictions. This trains up the data, using the
variables. It checks over fitting variables such as certain properties along with the dataset to get
memorizing that trained set. The trained set overall results. The support vector machine
identifies the prediction error which determines method is one of the types of classification
the percentage error and high variance error. methodology where it samples the hyper planes
Unlabeled data are typically augmented where into two or multiple streams of classes. These
the labeled data are tagged according to the consist of the maximum distance between the
sampling of data. Labeled data are significantly points of the hyper plane. Decision tree
taken from the raw data of unlabeled data. classification approach is a tree-type approach
Identifying the raw data into a meaningful range which uses the conditions to represent the data
of labels, this provides the descriptive content. according to their particular set of events. A
Labeling combines two or more samples of class label classifies the dataset according to
labels into one tagged one. Test data produces their set of features. The random Forest
the systematic way of generating the data. A data classifier is used to predict the multiple decision
generation tool systematically generates and trees which eventually distribute random
symbolizes the data. The data are tested by variables. This predicted tool uses the different
determining the considerations based on the matrix evaluation to analyze the number of
quality and cost. Using the equivalence class samples where it lists out the predicted and
model, the data are validated towards the actual classes respectively. The accuracy defines
particular set of reports. Reports are the total classifications according to the number
sequentially derived based upon the tested of samples. It is also used in the specificity and
results. precision where both are calculated differently.
In this approach, the accuracy is calculated using
Data Validation the following equations. The TP, TN, FP and FN
To increase the accuracy and performance over can be represented as follows,
the source of the data, the data are validated for 𝑇𝑃
further process. Depends upon the constraints True Positive Rate (TP) =𝑇𝑃+𝐹𝑁
such as the objectives, the data validation takes False Positive Rate (FP) =𝑇𝑁+𝐹𝑃
𝐹𝑃
place. These data are validated to attain the data
cleansing process which guarantees the fitness Precision =𝑇𝑃+𝐹𝑃
𝑇𝑃
and develops the consistency of the data
application. These are implemented using The above equations are calculated to derive the
different methodologies where it progress with validation over the performance using the test
the data integrity methods. Data validation sets. These are over-fitted which are validated to
checks the error and misspells the error in the avoid any type of potential boas. These have the
data. It checks and integrates the error in the normalized values where it uses the cross
data and determines the accuracy of the system. validation to validate the performance.
In terms of cross-validation model, the Sensitivity defines the probability in classifying
performances are based upon the statistical the problem, i.e., True positive value. Specificity
prediction type. Based upon the iteration, the specifies the probability of predicting the false
cases, i.e., true negative value respectively.
Neuro Quantology | September 2022 | Volume 20 | Issue 9 | Page 4146-4154 | doi: 10.14704/nq.2022.20.9.NQ44475
Dr. K. Kishore Anthuvan Sahayaraj, Dr. R. Chithra, Sankar R, Satishkumar J, Dr.Vilas Ramrao Joshi, Dr.R.Thiagarajan/ Virtual Storage Failure
Prediction model using supervised machine learning
1.2
0.987 0.985 0.961
1 0.946
0.967
0.868
0.8 0.787 0.756
0.676 0.665
0.6
0.4
0.2
0 0.006 0.015
Accuracy Precision Recall F-measure
The above graph illustrates that the cross- The AI is used to predict the failure over the
validation using the different algorithms to hard drive model for identifying the exact
identify the hard disk drive is good or bad. Based precision. Proactive failure in the hard disk
upon the criteria, the HDD indicates whether it is occurs due to the improper way of handling of
a good drive or a bad drive. From the drive the system. To attain the outline over the data
deployment, the user can get whether it is time values with enhanced accuracy performance, the
to replace the drive. After the comparative supervised and unsupervised learning 4153
analysis of the drive, the drive can be replaced as techniques are been followed. Hard disk drive
it prior to indicate the condition state of the failure reduces the speed, increases the
drive’s condition. Using the smart monitoring complexity of the runtime and threshold range.
methodology, such as self-monitoring, analyzing Self-monitoring of the hard disk can able to
and reporting the technology in the hard disk analyze the failure and it is considered as the
drive, the following steps algorithms are carried early way of identifying the failure.
out. Use the machine learning algorithm, where
threshold range D indicates the ratio of the total
samples and multiple decision trees are followed References
- Trees. Wh indicates the size where WH. (1) [1] Chhetri, T. R., Kurteva, A., Adigun, J. G., &
Initiate the sample using the sampling algorithm. Fensel, A. (2022). Knowledge Graph Based
(2) After sampling process, use each tree T. (3) Hard Drive Failure Prediction. Sensors,
calculate the classification steps using the tree 22(3), 985.
classification and sample S (4) Check whether or
[2] Yu, J. (2019, November). Hard disk drive
not if any abnormal samples are comparatively
failure prediction challenges in machine
greater. (5) Check whether the HDD is good or
learning for multi-variate time series. In
fails for a defective piece. (6) The predicted
Proceedings of the 2019 3rd International
results are generated. (7) End the process. Self-
Conference on Advances in Image
monitoring of the hard disk can be able to
Processing (pp. 144-148).
analyze the failure, and it is considered an early
way of identifying the failure. These allow the [3] Mudiyanselage, A. R. (2020). Data
user to backup the data automatically by Engineering and Failure Prediction for Hard
detecting the data. Using the self-monitoring, Drive SMART Data. Bowling Green State
analyzing and report method, the HDD can be University.
monitored. It checks and monitors various [4] Pinciroli, R., Yang, L., Alter, J., & Smirni, E.
attributes in the hard disk drive. The smart way (2020). The life and death of SSDs and
of using the Artificial Intelligence can check the HDDs: Similarities, differences, and
drive reliability and determines the error rate. It prediction models. arXiv preprint
checks the different attributes and identifies the arXiv:2012.12373.
error rate.
[5] Li, W. (2020). Hard Disk Drive Failure
Detection with Recurrence Quantification
Conclusion Analysis (Doctoral dissertation,
Northeastern University).
The Prediction of failure test can improve the
reliability in the storage system. By using the [6] Li, J., Stones, R. J., Wang, G., Liu, X., Li, Z., & Xu,
artificial intelligence, the smart way to detect the M. (2017). Hard drive failure prediction
malfunction in HDD reports various attributes using decision trees. Reliability Engineering
along with the viruses. Detection of the hard disk & System Safety, 164, 55-65.
features is by determining the attributes with [7] Yang, W., Hu, D., Liu, Y., Wang, S., & Jiang, T.
the statistical tests of compatibility. Continuous (2015, September). Hard drive failure
monitoring of the computer can reduce the prediction using big data. In 2015 IEEE 34th
efficiency of the lack over the security aspects Symposium on Reliable Distributed Systems
and time complexity. In the proposed work, the Workshop (SRDSW) (pp. 13-18). IEEE.
artificial intelligence with machine learning is
used to predict the potential features in the disk. [8] Yang, Q., Jia, X., Li, X., Feng, J., Li, W., & Lee, J.
(2020). Evaluating feature selection and
Neuro Quantology | September 2022 | Volume 20 | Issue 9 | Page 4146-4154 | doi: 10.14704/nq.2022.20.9.NQ44475
Dr. K. Kishore Anthuvan Sahayaraj, Dr. R. Chithra, Sankar R, Satishkumar J, Dr.Vilas Ramrao Joshi, Dr.R.Thiagarajan/ Virtual Storage Failure
Prediction model using supervised machine learning
anomaly detection methods of hard drive [10] Chaves, I. C., De Paula, M. R. P., Leite, L. G.,
failure prediction. IEEE Transactions on Gomes, J. P. P., & Machado, J. C. (2018, July).
Reliability, 70(2), 749-760. Hard disk drive failure prediction method
[9] Huang, X. (2017). Hard drive failure based on a Bayesian network. In 2018
prediction for large scale storage system International Joint Conference on Neural
(Doctoral dissertation, UCLA). Networks (IJCNN) (pp. 1-7). IEEE.
4154