ICT515A001
ICT515A001
Md Rakibul Islam
Student ID: 34795345
Table of Contents
Abstract ....................................................................................................................... 2
1. Introduction .............................................................................................................. 2
2. Literature Review of Machine Learning Algorithms ....................................................... 3
2.1 K-Nearest Neighbors (KNN) ................................................................................... 3
2.2 K-Means Clustering .............................................................................................. 4
2.3 Support Vector Machine (SVM) .............................................................................. 4
3. Comparative Analysis ................................................................................................ 5
4. Conclusion ............................................................................................................... 6
Bibliography ................................................................................................................. 8
Abstract
Detecting HDD failures before they happen is crucial to avoid losing data and maintain
operational stability. This article examines how three machine learning techniques—K-Nearest
Neighbors (KNN), K-Means Clustering, and Support Vector Machine (SVM)—are used to
predict hard disk drive (HDD) failures based on SMART attributes. The classification of failure
instances using KNN is studied based on their proximity to similar data points. At the same time,
the effectiveness of K-Means Clustering in identifying anomalies through unsupervised learning
is investigated. The analysis of SVM focuses on its ability to generate hyperplanes that can
distinguish between failure and non-failure data in spaces with many dimensions. By examining
previous research, we emphasise the benefits and drawbacks of individual methods and
determine that hybrid strategies, which incorporate these algorithms, could result in better
outcomes for detecting HDD failures early.
1. Introduction
As high-performance computing (HPC) and cloud systems expand, significant challenges are
presented by hardware failures, specifically those of hard disk drives (HDD). Conventional fault-
tolerance techniques such as periodic check-pointing and replication are no longer adequate
because of the increasing complexity of these systems. Since 1994, manufacturers have been
using SMART (Self-Monitoring, Analysis, and Reporting Technology) to anticipate HDD
failures by observing internal and environmental characteristics. These characteristics are
examined to determine a SMART flag, indicating possible malfunction and enabling users to
create data backups.
The paper examines the utilisation of these three machine learning algorithms for predicting
HDD failures, assessing their accuracy and trustworthiness, and enhancing the system's
reliability.
2. Literature Review of Machine Learning Algorithms
Wang et al. (2014) utilised KNN to predict HDD failures based on SMART attributes in the field
of HDD failure prediction. Using distance metrics by KNN enables it to identify connections
between drives by analysing comparable patterns in failure data. The algorithm worked best on
smaller datasets or when there was a distinct separation between failure and non-failure data.
Despite being simple, KNN faces a significant challenge: the computational complexity
increases with the dataset size. KNN's efficiency also depends on choosing the correct value for
k, the number of nearest neighbours, which may need to be adjusted for optimal results.
Advantages
Easy to use: K-Nearest Neighbors (KNN) is straightforward and intuitive, making it a top pick
for newcomers.
Adaptable for Non-Linear Decision Boundaries: It effectively deals with intricate, non-linear
connections without making any assumptions about the data distribution.
No need for a Training Phase: KNN is a lazy learning algorithm that doesn't need any specific
training phase, so it can be easily applied once the data is ready.
Limitations
Prone to Noise: KNN is sensitive to noisy data and unnecessary features, decreasing its
predictive accuracy.
Computationally Expensive: As the dataset gets larger, the algorithm becomes more
computationally intensive because it needs to calculate the distance between every point and all
other data points.
2.2 K-Means Clustering
K-Means Clustering is an unsupervised learning system that divides data into clusters according
to their similarities. While typically used for Clustering, it has also been effectively utilised for
detecting anomalies, which is beneficial for identifying early indications of HDD malfunctions.
Queiroz et al. (2017) utilised K-Means Clustering on SMART data to identify abnormalities that
could suggest upcoming HDD malfunctions. By categorising the data points into distinct
clusters, anomalies, which frequently indicate irregular or malfunctioning activities, can be easily
detected. Nevertheless, the initial centroid placement and the specified number of clusters impact
K-Means' effectiveness. Selecting the wrong value for k could result in imprecise Clustering and,
subsequently, the failure to detect or identify incorrect instances.
Effective with Big Data: K-Means shows computational efficiency and effectively manages large
datasets. Hence, it is commonly used for unsupervised learning.
Because of its capacity to cluster similar data points is commonly utilised for tasks such as
anomaly detection, data compression, and pattern recognition.
Assumes Spherical Clusters: The algorithm assumes that clusters are spherical and uniformly
sized, which may not be the case for all datasets, resulting in less-than-optimal Clustering.
Rincón et al. (2017) utilised SVM to categorise HDDs according to SMART attributes. Using
kernel functions, SVM can address linear and non-linear data, enhancing its efficiency in
intricate settings. An issue with SVM is its high computational cost, especially with big datasets,
since it requires solving a quadratic optimisation problem.
Advantages:
Limitations:
3. Comparative Analysis
When evaluating KNN, K-Means Clustering, and SVM, all three algorithms possess distinct
advantages and disadvantages, which make them appropriate for various facets of HDD failure
forecasting.
• KNN becomes more costly computationally as the dataset increases because it needs to
calculate the distance for each new data point. It is effective in situations with intricate
and non-linear decision boundaries.
• K-Means: Despite its computational efficiency and scalability to large datasets, K-Means
may encounter difficulties with non-spherical clusters or inadequate centroid
initialisation.
• SVM may achieve high accuracy in high-dimensional spaces, but its training phase can
be time-consuming, especially with large datasets, because of the computational expense
of solving it.
Application Suitability
KNN: Most suitable for projects with meaningful connections between neighbouring data
points. It excels in medical diagnostics and other pattern-recognition tasks with restricted data
points.
K-Means: Efficient method for both detecting anomalies and clustering data. Its capability
recognises anomalies or categorises data with comparable trends, making it ideal for extensive,
unmarked datasets.
SVM: Best for tasks that demand well-defined decision boundaries. It performs exceptionally
well in tasks that require handling high-dimensional data, like image recognition and
classification.
The selection of k influences KNN's effectiveness. Choosing a small number of neighbours can
result in overfitting, whereas selecting a large number can result in underfitting.
K-Means necessitates the upfront determination of the cluster quantity (k) and is greatly affected
by the initial centroids' placements. Inadequate initialisation may lead to less-than-optimal
clusters.
SVM's success depends on choosing the proper kernel function and its parameters. While it can
deal with intricate non-linear data, performance may suffer if parameters are not optimised.
4. Conclusion
This article examined three machine learning techniques—KNN, K-Means Clustering, and
SVM— in terms of their use for predicting HDD failures in advance. KNN is a good option for
smaller datasets or those with non-linear boundaries, while K-Means Clustering is beneficial for
detecting anomalies in large datasets without supervision. Even though SVM requires a lot of
computational power, it provides excellent accuracy for tasks involving classification in high-
dimensional spaces.
No one algorithm stands out as universally better for detecting HDD failures. Alternatively, a
hybrid method that leverages the benefits of these algorithms could result in improved
performance. An example would be utilising K-Means Clustering for data pre-processing and
Clustering, then proceeding with SVM for classification or KNN for refining predictions.
Bibliography
[1] Y. Wang, E. W. M. Ma, T. W. S. Chow, and K.-L. Tsui, "A Two-Step Parametric Method for
Failure Prediction in Hard Disk Drives," IEEE Transactions on Industrial Informatics, vol. 10,
no. 1, pp. 419–430, Feb. 2014, doi: https://doi.org/10.1109/tii.2013.2264060.
[2] A. Tahir, F. Chen, Abdulwahab Ali Almazroi, and Nourah Fahad Janbi, "SWEP-RF:
Accuracy sliding window-based ensemble pruning method for latent sector error prediction in
cloud storage computing," Journal of King Saud University - Computer and Information
Sciences, vol. 35, no. 8, pp. 101672–101672, Sep. 2023, doi:
https://doi.org/10.1016/j.jksuci.2023.101672.
[3] D. Liu et al., "Predicting Hard Drive Failures for Cloud Storage Systems," Algorithms and
Architectures for Parallel Processing, pp. 373–388, 2020, doi: https://doi.org/10.1007/978-3-
030-38991-8_25.
[4] C.-J. Su and S.-F. Huang, "Real-time big data analytics for hard disk drive predictive
maintenance," Computers & Electrical Engineering, vol. 71, pp. 93–101, Oct. 2018, doi:
https://doi.org/10.1016/j.compeleceng.2018.07.025.