A_Sequential_Supervised_Machine_Learning_Approach_for_Cyber_Attack_Detection_in_
A_Sequential_Supervised_Machine_Learning_Approach_for_Cyber_Attack_Detection_in_
Abstract—Modern smart grid systems are heavily dependent Recently, a lot of attention has turned towards the
on Information and Communication Technology, and this automated identification of cyberattacks. Several techniques
dependency makes them prone to cyber-attacks. The occurrence and methods involving supervised and unsupervised machine
of a cyber-attack has increased in recent years resulting in learning (ML) algorithms have been proposed for the
substantial damage to power systems. For a reliable and stable detection of cyber-attacks. Such models are provided data
operation, cyber protection, control, and detection techniques related to electrical parameters during normal operation and
are becoming essential. Automated detection of cyberattacks during cyberattacks, which is used to identify cyberattacks
with high accuracy is a challenge. To address this, we propose a after training. A supervised ML algorithm based on Support
two-layer hierarchical machine learning model having an
2021 North American Power Symposium (NAPS) | 978-1-6654-2081-5/21/$31.00 ©2021 IEEE | DOI: 10.1109/NAPS52732.2021.9654767
Authorized licensed use limited to: Zhejiang University. Downloaded on January 15,2025 at 08:48:54 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Proposed methodology and model architecture. Dataset utilized is reduced to 96 features after preprocessing and is then divided into train and test set
with a ratio of 8:2. The segregated test dataset is put aside and is not included in any testing or tunning of the model.
Authorized licensed use limited to: Zhejiang University. Downloaded on January 15,2025 at 08:48:54 UTC from IEEE Xplore. Restrictions apply.
Part 2: Handling Imbalanced Classification TABLE II
METHODS FOR OVER SAMPLING
An imbalanced classification issue arises when the Methods Validation Accuracy
distribution among different classes of datasets is not None 85.76 %
uniform. This problem can lead to the poor performance of ADASYN 93.89 %
Random Over Sampler 96.23 %
the classification model, especially for the minority class. SMOTE 92.02 %
Imbalance classification raises a challenge for training an Borderline SMOTE 93.72 %
effective model as most of the ML algorithms are designed
for an equal number of datasets in each class. Many real- As per the results, ROS seems to have the best accuracy
world problems are imbalanced in terms of datasets. If they among all other approaches, but it tends to cause over-fitting
are not handled properly, then the resulting outcome will be for the model [15], [16]. In this method, minority classes are
a biased trained model giving an edge to the majority class randomly replicated until the desired ratio of balancing is
achieved. In order to avoid this problem, other modern
and overlooking the minority one [14]. The dataset is
techniques were considered. ADASYN and Borderline
processed to overcome such a problem, and all the classes are SMOTE algorithms have almost the same accuracy. The
balanced out before training the model. Several methods are working of both the algorithm is also quite similar with just a
used for handling such issues, which can be divided into two little difference [17]. However, ADASYN generates synthetic
main approaches: data-driven and algorithm-driven [15]. data that is harder to learn, whereas SMOTE synthesizes data
For our research, the dataset used for training is highly due to the interpolation of minority class datasets that are
imbalanced in terms of classes. Therefore to attain high closely located [16]. These modern algorithms provide an
model accuracy balancing out the dataset is needed. A data- adaptive approach for handling imbalance classification and
driven approach is used for addressing this problem. Data helps in better training of the model. However, ADASYN has
resampling approaches are further divided into oversampling greater accuracy than the Borderline SMOTE, but it was not
and undersampling [10]. used. The reason for neglecting ADASYN is due to its density
Oversampling algorithms are adopted in order to achieve distribution criterion, which automatically calculates the
optimal efficiency while giving priority to the number of data number of samples to be synthesized for each minority class.
instances. Moreover, a clearer picture of class imbalance in While balancing the dataset, ADASYN exceeds the original
the dataset can be seen in Fig. 2 maximum number of datasets in a particular class. Therefore,
it was neglected to maintain the sensitivity of the model. The
difference between SMOTE and Borderline SMOTE
algorithm is that Borderline SMOTE is a variation of SMOTE
algorithm. It only synthesizes data and the decision boundary
present between the classes. In contrast, SMOTE algorithm
generates synthetic data with its k-nearest neighbors of
minority class [15].
A picture of datasets before and after the application of
the Borderline SMOTE algorithm is shown in Fig.3. In
addition to that, the balanced number of datasets in each class
after oversampling is shown in Fig. 4. It can be seen that after
oversampling of the dataset, all the classes are brought into
Fig. 2 Original population of classes in dataset, the population shows unity in terms of data instances; that is, all classes now
that classes are very imbalanced. Moreover, the classes between 30 comprise 4685 data samples. Moreover, the difference of
and 35 do not exist in dataset, hence they have no values. class distribution before and after oversampling of the dataset
It can be apprehended from the figure that the number of is exhibited in Fig. 4, which utilizes voltage and current
data instances in each class is very inconsistent. The highest magnitude of PMU 1 for visualization.
number of data instances is in class 36 with 4685 samples,
while class 21 has the lowest number of data instances with Part 3: Handling standardization of dataset
1242 samples only which are 3.7 times lesser than class 36. Data standardization is essential before implementing ML
Oversampling techniques increase the number of data algorithms, as data standardization can significantly impact
points rather than reducing them to uniformity, which is the the outcome of the ML training model. Therefore, it is very
case in undersampling. Moreover, undersampling approaches crucial to have all the data on the same scale. There are many
can only be used where the dataset is quite extensive, and loss approaches available for data standardization. The methods
in a small amount of data would not cause any significant tested for this particular experiment are described below.
change. However, in our case, we have a small amount of Among these methods, Standard Scaler (SS) acquired the
data. Therefore, undersampling was not an option. Regarding highest accuracy of 95.2% on binary datasets. Therefore, it
oversampling techniques, there are many approaches used for was utilized in this experiment.
better results. Among them, the most popular are: Adaptive Standard Scaler: transform all data features to the same
Synthetic (ADASYN) algorithm, Synthetic Minority Over- magnitude, keeping mean 0 and variance 1. It does not involve
sampling Technique (SMOTE), Random Over Sampler any minimum and maximum value of the features, as shown
(ROS), and Borderline SMOTE. All of these methods were in (1).
applied to determine the most effective technique. The
detailed results for each technique are presented in Table II.
Authorized licensed use limited to: Zhejiang University. Downloaded on January 15,2025 at 08:48:54 UTC from IEEE Xplore. Restrictions apply.
Fig.3. Scatter plot of the dataset, representing the distribution of classes with respect to two features, i.e., voltage and current. Different colors in the plot
represent individual classes. (a) portrays composition of the dataset before oversampling by Borderline Smote algorithm and (b) portrays composition of the
dataset after oversampling by Borderline Smote.
̅
(1)
Mean Normalization: transform all data features such that
the feature vector has one as Euclidian length. Scaling is done
through different numbers for every data point as given in (2)
(2)
C. Dimensionality Reduction
Fig. 4 Population of classes after oversampling with BoderlineSMOTE,
The number of the different features present in a dataset is each class is now balanced, having same number of data instances. classes
known as the dimensionality of the dataset. As the number of between 30 and 35 do not exist in dataset, i.e. they are not sampled.
data features increases, training a model becomes
challenging. It will require more computational power and % &'()!* +,&- *$ ,&- *$ +,&- * ,&- * (5)
may lead to overfitting, resulting in performance degradation where P is the data proportion of each split that takes up the
[13]. This issue is often termed as a curse of dimensionality. relative parent node.
To mitigate such issue, high dimensionality statistics and
different reduction techniques are used for data visualization. MDI for the top 10 features of the dataset can be seen in
These methods are also applied in ML for optimizing the Fig. 5. After analyzing the dataset, it is observed that the
outcome of a model. The method used in this research for control panel log, relay log, snort log, and status flag of each
identifying the importance of each feature is through a Mean PMU have very little or no importance in the detection of
Decrease in Impurity (MDI). cyberattacks. Moreover, when it is analyzed with respect to
The MDI is a measure of feature importance in evaluating domain knowledge, these features have no influence over the
a target variable. It calculates an average of total decrement power system. Therefore, these features were dropped from
in node impurity, weighted by the ratio of samples for each the original dataset, and the dimensionality of the dataset was
feature reaching that particular node in a separate decision reduced to 96 features from 128 features. The effect of this
tree. Thus, a higher MDI indicates the higher importance of reduction in feature can be seen in Table III.
that particular feature. The MDI index(G) is defined in (4)
!" !" TABLE III
FEATURE REDUCTION
1 1 (4) Removal of Low MDI Features Validation Accuracy
#$ #$ Before 94.51%
where nc is the number of classes in the target variable and pi After 94.59%
is the ratio of this class. After removing trivial features, the accuracy of the model
increased along with the reduction in computational
The decrease in impurity I is then defined in (5) complexity. Numbers of features depicts the computational
Authorized licensed use limited to: Zhejiang University. Downloaded on January 15,2025 at 08:48:54 UTC from IEEE Xplore. Restrictions apply.
complexity and time required for power disaggregation. E. Performance Metrics
Therefore, lesser the feature lesser will be the computational
cost. A comparison is done between recent approaches and For the proposed model and training, evaluation criteria
proposed model in terms of computational complexity but were set on the accuracy, recall, precision, and F1 score as
only those approaches were selected that utilize the same shown in (7),(8),(9), and (10), respectively. These metrices
dataset as of proposed approach. Table IV illustrates a are the most common and are widey adopted for the
comprehensive comparion with regard to number of features performance evaluation of ML approaches [11].
and computational complexity. . / 0 1+213 / 1+25+2532 13 (7)
TABLE IV + 6768 1 +/ 1 + 2 5 + (8)
COMPARISON OF COMPLEXITY WITH RECENT APPROCHES
Ref. Approach
Number of Computational 9 1 +/ 1 + 2 5 3 (9)
Feature Complexity
[9] SACS-SAE 96 low 51 8 21+/ 21+ 2 53 2 5+ (10)
[20] J-ripper,RF, one-R & NB 128 Medium
[13] AWV 144 Highest where TP and TN refer to true positive and true negative.
- Proposed 94 Lowest Similarly, FP and FN refer to a false positive and false
negative, respectively.
D. RFC Parameter tunning
RFC is a well-known classification algorithm known for III. RESULTS AND ANALYSIS
its robustness towards outliers. Moreover, it can handle noise The core objective of this research work was to develop a
comparatively better than other algorithm of its domain [8]. sequential model having better accuracy and precision along
Like every other model, it is essential to tune the model as per with low computational cost. For achieving this goal, a bi-
the problem, to obtain effective results. Nebrase et al. in [8] level model is proposed using RFC as a base classifier for the
have tested RFC with the same dataset, and the resulting detection of intrusion attacks in smart grid systems. The model
outcome was 92%. For our particular research, the goal of is divided into two layers. The first level sub-problem
parameter tunning was to improve the accuracy level in order classifies between the natural events and attack events.
to develop a sequential model capable of achieving higher Through this level, all-natural events are classified and
accuracy results. There are several hyperparameters of RFC filtered. This layer has an accuracy level of 99% in detecting
that can be adjusted. In this research work, only n_estimators, a natural event. The reason for having such high accuracy is
max_features, and criterion parameters of RFC were tested the better learning of the class boundary of two major classes
and tweaked for better accuracy. The details of the test can be rather than learning for all individual classes. Further, the
found in Table IV. classification of natural events in their specific classes is not
. Table V part of this model. The intention for developing this model
RFC PARAMETER SELECTION was to detect and classify intrusion attacks in smart grid
Parameters Validation Accuracy % systems. All events related to either fault, operation, or
Max_Features
Sqrt 94.54 % maintenance comes under the umbrella of natural events. If
Log2 94.87 % the upper-level classifies the data as an attack event, then it is
Gini 95.08 % passed onto the lower-level sub-problem, which classifies the
Criterion
Entropy 95.04 %
data on the basis of 27 classes of attacks. The overall accuracy
0.02
of the model is 95.44 %.
For training and testing purposes, the dataset of multi-class
0.0195
Mean Decrease in Impurity
Authorized licensed use limited to: Zhejiang University. Downloaded on January 15,2025 at 08:48:54 UTC from IEEE Xplore. Restrictions apply.
model proposed by Hink et al. [18] has an accuracy of less Allocation", Mathematical Problems in Engineering, vol. 2019, pp. 1-
than 90%, and the model proposed by Keshk et al. [19] has an 15, 2019. Available: 10.1155/2019/2817586.
accuracy of 90.2%. In addition to these models, Defu et al. [3] A. Mohan, N. Meskin and H. Mehrjerdi, "A Comprehensive Review of
the Cyber-Attacks and Cyber-Security on Load Frequency Control of
proposed a novel model in [13] using RFC as a base classifier Power Systems", Energies, vol. 13, no. 15, p. 3860, 2020. Available:
and achieved a weighted accuracy of 93.91%. If we compare 10.3390/en13153860.
our model in terms of intrusion detection in a smart grid [4] M. Esmalifalak, L. Liu, N. Nguyen, R. Zheng and Z. Han, "Detecting
system with the model proposed in [13], our model clearly Stealthy False Data Injection Using Machine Learning in Smart Grid,"
outperforms it with an accuracy of 95.44% on test dataset. in IEEE Systems Journal, vol. 11, no. 3, pp. 1644-1652, Sept. 2017,
It can be deduced from the research work that data doi: 10.1109/JSYST.2014.2341597.
preprocessing plays a crucial part in model performance. [5] P. Lau, W. Wei, L. Wang, Z. Liu and C. -W. Ten, "A Cybersecurity
Insurance Model for Power System Reliability Considering Optimal
From class balancing of the dataset to feature reduction and Defense Resource Allocation," in IEEE Transactions on Smart Grid,
standardizing of data, it helps improve the model training and vol. 11, no. 5, pp. 4403-4414, Sept. 2020, doi:
enhance the model's predictive efficiency. By addressing the 10.1109/TSG.2020.2992782.
class imbalance problem, we provided a better learning [6] Boyaci, O., Umunnakwe, A., Sahu, A., Narimani, M. R., Ismail, M.,
environment to our proposed model, providing improved Davis, K., & Serpedin, E. (2021). Graph Neural Networks Based
efficiency. Detection of Stealth False Data Injection Attacks in Smart Grids. arXiv
preprint arXiv:2104.02012.
95.44%
98%
IEEE Transactions on Signal Processing, 69, 2725-2739.
93%
93%
96%
92%
10.1109/CyberSecurity49315.2020.9138871.
85%
85%
88%
86% [9] Z. Qu, Y. Dong, N. Qu, H. Li, M. Cui, X. Bo, Y. Wu, and S.
Mugemanyi, “False Data Injection Attack Detection in Power Systems
84% Based on Cyber-Physical Attack Genes,” Frontiers in Energy Research,
82% vol. 9, 2021.
80% [10] A. S. Musleh, G. Chen and Z. Y. Dong, "A Survey on the Detection
Proposed Model Single Level Model Primary RFC Model Algorithms for False Data Injection Attacks in Smart Grids," in IEEE
Accuracy Precision Recall F1 Score Transactions on Smart Grid, vol. 11, no. 3, pp. 2218-2234, May 2020,
doi: 10.1109/TSG.2019.2949998.
Fig. 7 Comparison betweem baseline models. All models are trained and
tested in the same environment and same dataset. The difference between [11] Sayghe, A., Hu, Y., Zografopoulos, I., Liu, X., Dutta, R. G., Jin, Y., &
Single level model and the primary RFC model is of parameters. The Konstantinou, C. (2020). Survey of machine learning methods for
detecting false data injection attacks in power systems. IET Smart Grid,
primary model has default parameters, whereas single-layer model
3(5), 581-595.
parameters are similar to the proposed model.
[12] U. Adhikari, S. Pan, T. Moris, R. Borges, and J. Beaver , “Industrial
Control System (ICS) Cyber Attack Datasets,” Tommy Morris. 2016.
IV. CONCLUSIONS [Online]. Available: https://www.sites.google.com/a/uah.edu/tommy-
This study proposes a two-layered hierarchical approach morris-uah/ics-data-sets..
with a baseline classifier to detect cyberattacks on a smart [13] D. Wang, X. Wang, Y. Zhang and L. Jin, "Detection of power grid
power system. We find that the two-layered traditional disturbances and cyber-attacks based on machine learning", Journal of
Information Security and Applications, vol. 46, pp. 42-52, 2019.
random forest algorithm performs better than deep learning Available: 10.1016/j.jisa.2019.02.008 .
algorithms. The limited attack data available makes it harder [14] Y. Zhao and Y. Cen, Data mining applications with R. . Academic
for deep learning approaches to learning the attack scenarios Press: Cambridge, MA, USA, 2013; ISBN 9780124115118.
efficiently. Another issue in the currently used attack datasets [15] F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data
is a class imbalance that results in model training heavily imbalance in classification: Experimental evaluation,” Information
biased towards normal state instead of attack state. Tackling Sciences, vol. 513, pp. 429–441, 2020.
this issue before the training of the model through class [16] N. Chawla, K. Bowyer, L. Hall and W. Kegelmeyer, "SMOTE:
balancing approaches can lead to improved performance of Synthetic Minority Over-sampling Technique", Journal of Artificial
Intelligence Research, vol. 16, pp. 321-357, 2002. Available:
the current models. The performance results also reveal that 10.1613/jair.953.
feature reduction of the dataset can be quite useful, but it
[17] J. Brandt and E. Lanzén, "A Comparative Review of SMOTE and
should be done considering the domain knowledge. The ADASYN in Imbalanced Data Classification", DIVA, 2021. [Online].
accuracy achieved by the proposed model is compared with Available:http://www.divaportal.org/smash/record.jsf?pid=diva2%3A
the baseline models and found to outperform those for the 1519153&dswid=-2594\.
detection of intrusion attacks in smart grid systems. Our study [18] R. C. Borges Hink, J. M. Beaver, M. A. Buckner, T. Morris, U.
provides techniques to improve the accuracy of attack Adhikari and S. Pan, "Machine learning for power system disturbance
detection models while retaining the traditional ML and cyber-attack discrimination," 2014 7th International Symposium
on Resilient Control Systems (ISRCS), 2014, pp. 1-8, doi:
algorithms with low computational costs. 10.1109/ISRCS.2014.6900095.
[19] M. Keshk, N. Moustafa, E. Sitnikova and G. Creech, "Privacy
REFERENCES preservation intrusion detection technique for SCADA systems," 2017
[1] L. Gao, B. Chen and L. Yu, "Fusion-Based FDI Attack Detection in Military Communications and Information Systems Conference
Cyber-Physical Systems," in IEEE Transactions on Circuits and (MilCIS), 2017, pp. 1-6, doi: 10.1109/MilCIS.2017.8190422.
Systems II: Express Briefs, vol. 67, no. 8, pp. 1487-1491, Aug. 2020, [20] M. Panthi, "Anomaly Detection in Smart Grids using Machine
doi: 10.1109/TCSII.2019.2939276. Learning Techniques," 2020 First International Conference on Power,
[2] Z. Qu et al., "Survivability Evaluation Method for Cascading Failure Control and Computing Technologies (ICPC2T), 2020, pp. 220-222,
of Electric Cyber Physical System Considering Load Optimal doi: 10.1109/ICPC2T48082.2020.9071434.
Authorized licensed use limited to: Zhejiang University. Downloaded on January 15,2025 at 08:48:54 UTC from IEEE Xplore. Restrictions apply.