A framework for intrusion detection based on few-shot learning
A framework for intrusion detection based on few-shot learning
a r t i c l e i n f o a b s t r a c t
Article history: Due to the high dependency of traditional intrusion detection method on a fully-labeled large dataset,
Received 16 February 2022 existing works can hardly be applied in real-world scenarios, especially facing zero-day attacks. In this
Revised 19 July 2022
paper we present a novel intrusion detection framework called “FS-IDS”, including flow data encoding
Accepted 24 August 2022
method, feature fusion mechanism and architecture of intrusion detection system based on few-shot
Available online 28 August 2022
learning. We utilize task generator to split the dataset into separate tasks and train model in an episodic
Keywords: way, hoping model to learn general knowledge rather than those specific to a single class. The extraction
Network security module and distance metric module are responsible for learning and determining whether the traffic
Intrusion detection system data are benign or not. We conduct three sets of experiments on “FS-IDS”, i.e., comparison study, abla-
Few-shot learning tion study and multiclass study. Comparison study firstly determines that the best measure metric for
Feature fusion discrimination is Euclidean distance. Based on the optimal implementation, “FS-IDS” achieves compa-
CNN
rable performance with existing works by using much fewer malicious samples. Ablation study sets two
Deep learning
base models to explore how proposed encoding method and feature fusion mechanism improve detection
capacity. Both the image representation and feature fusion achieve more than 2% improvement in accu-
racy and recall. Finally, to test whether “FS-IDS” can perform well under real-world scenario or not, we
design network traffic containing various attacks to simulate complex malicious network environment.
Experimental results show that “FS-IDS” maintains more than 90% detection accuracy and recall under
the worst circumstances, which composes of various seen or unseen attacks with only a few malicious
samples available.
© 2022 Elsevier Ltd. All rights reserved.
1. Introduction alert managers before the damage has been caused have sparked
more and more public concerns. Intrusion Detection System(IDS),
Nowadays, cyberspace has become the “fifth frontier” after the as a classic network security protection application, seems to be
ocean, land, air, and space (Jiangxing et al., 2018). Alongwith the a rational choice under such circumstances. Intrusion detection is
rapid growth of internet applications and network services, haz- the process of monitoring the events occurring in a computer sys-
ard caused by cyberspace intrusion towards network vulnerabilities tem or network, and analyzing them for signs of intrusions. Former
has become much more serious, especially 0-day attacks which ex- researches on IDS commonly extract knowledge from audit data,
ploit security weaknesses that the vendors or developers are un- user profiles or network traffic and formulate rules of benign or
aware of. Report by MIT Technology Review (O’Neill, 2021) said, abnormal behaviors manually (Liao et al., 2013). As artificial intel-
based on the data collected from multiple sources, that at least 66 ligence has become a buzzword since 2014, applying deep learning
zero-days have been found to be in use in 2021, which is almost into network intrusion detection or anomaly detection has become
double the number of such attacks recorded last year. Defense sce- a promising field. Owing to the capacity of deep neural network
narios to detect known or unknown intrusion actions in order to on learning high-level latent features from big data, deep neural
network has replaced manual rules as the powerful data analyz-
ing and decision-making tools (Andresini et al., 2021; Kim et al.,
R
This work was supported in part by National Key R&D Program of China under 2018; Li et al., 2017; Malaiya et al., 2018; Pektaş and Acarman,
Grant no. 2020YFB1807504 and National Science Foundation of China Key Project
under Grant no. 61831007.
2019; Wang et al., 2018).
∗
Corresponding author. However, utilization of neural network in IDS encounters many
E-mail addresses: [email protected] (J. Yang), [email protected] limitations and challenges. The most crucial difficulty of applica-
(H. Li), [email protected] (S. Shao), [email protected] (F. Zou), tion of IDS is the dependency of deep learning model on a large-
[email protected] (Y. Wu).
https://doi.org/10.1016/j.cose.2022.102899
0167-4048/© 2022 Elsevier Ltd. All rights reserved.
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
scale and well-labeled dataset (Ahmim et al., 2019; Andresini et al., fewer malicious samples than previous research. The result ob-
2021; Faker and Dogdu, 2019; Injadat et al., 2021; Manimuru- tained by FS-IDS are also the state-of-the-art performance in
gan et al., 2020; Min et al., 2018; Pektaş and Acarman, 2019; Re- intrusion detection based on few-shot learning.
sende and Drummond, 2018; de Souza et al., 2020; Zhang et al., 2. We proposed a novel network traffic data encoding method and
2019). When the number of training samples is insufficient, the a feature fusion method in order to construct an informative
model will suffer from severe overfitting and perform poorly. But representation for neural networks. Feature fusion combined
the great efforts of data gathering and labeling to generate a re- the original bytes content with extracted features of a certain
liable dataset always take huge costs. Above-mentioned problems network flow, instead of using only one of them in former re-
become even more severe in intrusion detection fields. Nowadays, searches. Ablation studies demonstrated that both the data en-
network data generated in one day can be measured in level of coding and feature fusion of multidimensional information im-
terabytes. Unlike image or corpus labeling in deep learning fields, prove detection accuracy and recall in few-shot conditions.
labeling process on network traffic always relies on expert knowl- 3. Since in real world IDS may face much more complex and seri-
edge, which makes it impractical to analyze and label such huge ous situations, we extended FS-IDS from binary classification in
amount of data manually. Moreover, it is impossible for security former researches to multi-class classification field. By experi-
experts to have adequate time and resources to collect, analyze and ments using blended traffic simulating to real world conditions,
label 0-day attack samples for intrusion detection model. Even af- we proved that FS-IDS can not only classify the network be-
ter security experts prepared a sufficient and well-labeled datase, haviors as benign or malicious, but also able to tell which kind
the included attack strategies may become obsolete. of attack strategies attackers used by training on only 5 attack
Another challenge of IDS is that neural network has its specific samples.
strict requirement for the value and shape of input data, while net-
work traffic tends to be diverse and heterogeneous. An informative
representation of network traffic data can be a critical factor af- 2. Related works
fecting model’s detection performance. Although there are a few
works on data representation of network flow for deep learning 2.1. Deep learning based intrusion detection
model (Kim et al., 2018; Li et al., 2017; Wang et al., 2018), pre-
vious methods only process the content of traffic data or statistic The recent rise of interests in the field of artificial intelligence
features. We think these methods cannot reflect the feature of net- resulted in major advancements of, among others, applications in
work behaviors comprehensively, especially with limited resources network intrusion detection mechanisms. With the dramatic in-
in few-shot conditions. How network traffic can be represented ap- creasing of computing resources available to train a neural net-
propriately for neural network remains a key issue. work, deep learning has been a common choice and their us-
To mitigate the high dependency of deep learning on high- age is no longer held back. The work of Malaiya et al. (2018) at-
quality dataset, as well as improve the capacity of IDS to detect 0- tributed the reason why conventional shallow learning may not
day attacks, we proposed an intrusion detection framework based work for identifying anomalies from the network traffic datasets
on few-shot learning. In proposed framework, a feature extrac- to very high degree of non-linearity from network traffic data. This
tion network was designed and trained according to specific al- past work designed a set of deep learning models including fully
gorithm. The training and test process was conducted on specific connected neural network, variational autoencoder, and sequence-
task set to obtain prior knowledge with generalization from known to-sequence structure, and showed the feasibility of deep learn-
attacks. Discriminate principle is that unseen malicious traffic can ing with greater accuracy in detection. The experimental results
be distinguished by comparing similarity measures with regard to also showed that the sequence-to-sequence model outperforms the
the “prototype” embedding generated by trained feature extraction others consistently. This work however does not design and eval-
network. In consideration of difficulty of data representation, we uate any model based on CNN. To employ CNN structure in in-
proposed a novel data encoding method to transform network traf- trusion detection, the authors of Li et al. (2017) designed a data
fic into image-format data for convolutional neural network (CNN). encoding module to convert various feature attributes into im-
We also utilized feature fusion to combine generated embedding age form. Then they used visual conversion of the NSL-KDD for-
from feature extraction network with compressed features from mat to evaluate the performance of CNN in intrusion detection.
autoencoder to form deep representation for network flow. We be- Proposed method performed one-hot encoding on symbolic fea-
lieve by these means we can utilize as much information as possi- tures. Continuous features can be transformed to symbolic features
ble from limited resources to detect novel attacks. by normalization and discretization. Experiments on the two NSL-
For intrusion detection based on few-shot learning, we find the KDD test datasets showed that CNN performs better than most
most similar work to ours is FC-Net proposed in Xu et al. (2020), standard classifiers although CNN does not improve state of the
which follows the same discriminate principle. However, FC-Net art completely. Kim et al. (2018) introduced an improved encod-
extracts feature embedding only from the traffic content but ig- ing technique that enhances the performance for the identification
nores statistic features, which misses information useful for detec- of anomalous events using CNN structure. The improved encoding
tion. We not only utilized an improved network architecture and method extended previous “gray-scale” like encoding into RGB-like
training algorithm to obtain model with enhanced detection capac- encoding, which allocated equal number of pixels to individual fea-
ity, but also proposed methods of data encoding and feature fusion tures. Experimental results demonstrate its superiority over previ-
to better present network flow for discrimination. By empowering ous researches.
the model with capacity to learn latent discriminant patterns from Aforementioned researches were conducted based on KDD’99 or
only a few labeled samples, we provide a solution for intrusion de- NSL-KDD datasets, which only recorded statistic features of net-
tection system under circumstances of deficient data samples and work traffic. The majority of publicly available datasets that are
emerging 0-day attacks. commonly used in literature of network security only disclosure
The main contributions of this paper are as follows: network attributes while reveal their traffic data of network flow
they recorded. For example, for KDD’99 or NSL-KDD dataset, there
1. We proposed an integrated intrusion detection framework are 41 features extracted from data captured in DAPRA’98 IDS eval-
named FS-IDS based on few-shot learning. FS-IDS achieves over uation program. The features contained can be classified into three
97% accuracy and 99% recall on detecting novel attacks by much groups (Tavallaee et al., 2009):
2
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
1. Basic features: this category encapsulates all the attributes that a few samples per class. Snell et al. (2017) improved upon Match-
can be extracted from a TCP/IP connection, e.g. duration of the ing Network by using neural network to learn a non-linear map-
connection, network service on the destination, type of proto- ping of the input into an embedding space, and take a class’s
col. prototype to be the mean vector. Classification is then performed
2. Traffic features: this category includes features that are com- for an embedded test data sample by simply finding the near-
puted with respect to a window interval, e.g., number of con- est class prototype. They named their revised version “Prototypical
nections, and is divided into two groups due to the relationship Network”. In 5-shot scenario, it achieves 98.8% and 68.2% accuracy
between the recorded connections with current connection: on Omniglot and miniImageNet dataset, respectively. On this basis,
1. “same host” features: examine only the connections in the Sung et al. (2018) presented “Relation Net” and further improved
past 2 s that have the same destination host as the cur- few-shot classification performance. Relation net inherited the rep-
rent connection, and calculate statistics related to protocol resentation learning network from prototypical network. However,
behavior, service, etc. it utilized a metric learning network rather than simply calculated
2. “same service” features: examine only the connections in Euclidean distance, to learn a non-linear classifier. By introducing
the past 2 s that have the same service as the current con- a new discriminant network, relation network improved accuracy
nection. by 1.2% on Omniglot.
The most related few-shot learning methodologies to ours are
Whether for objective limitations of datasets or researchers’ sub- prototypical network and relation net. According to their inher-
jective neglect, we find that researchers didn’t take the raw net- ent principles and architectures, we disassemble the network as
work traffic content into account. In recent years, some works representation learning module and metric/distance learning mod-
were done by utilizing traffic data rather than statistic features. ule, then conduct comprehensive studies on their effects on perfor-
Wang et al. (2018) proposed HAST-IDS which combines learn- mance of intrusion detection. Based on this, we present two intru-
ing of low-level spatial features with high-level temporal features sion detection frameworks and perform comparative study to ob-
from network traffic using both CNN and long short-term mem- tain the best intrusion detection model under the circumstances
ory network(LSTM). The automatically learned traffic features ef- where researchers have only a few labeled samples of novel cyber-
fectively reduce the false alarm rate, without any feature engineer- attacks.
ing techniques. This IDS achieved 99.89 accuracy and 96.96 re-
call on ISCX2012. A novel encoding method named “Flow-Image” 2.3. Intrusion detection with insufficient labeled samples
representation as well as “Segmented-CNN” were presented in
(Millar et al., 2019). A network traffic flow was represented in a While many researchers have noticed the limitations of prac-
two-dimensional array where each row of it represented a new ticability of IDS in real-world settings due to its dependency on
packet in the flow with its column representing a new byte in the a sufficient labeled network traffic dataset, most of them turned
packet. A novel “Segmented-CNN” architecture was proposed that to unsupervised scenarios. Unsupervised intrusion detection mod-
aims to exploit the distinct properties of the header and payload in els assume that the overwhelming majority of network traffic data
TCP/IP packet. Experimental results indicated that proposed model are normal instances. This hypothesis may incur high false posi-
obtain a good balance between efficiency and performance. tive rate since a little fluctuation of normal behaviors can be re-
garded as anomaly. In this field, autoencoder is the fundamen-
2.2. Few-shot learning tal deep architecture. Autoencoder represents data within multiple
hidden layers by reconstructing the input data, effectively learning
Recent years, due to advancements in computing resources an identity function (Raghavendra Chalapathy, 2019). Zavrak and
and large-scale datasets, artificial intelligence represented by deep skefiyeli (2020) adopted the method of autoencoder and variational
learning has involved in lots of fields as highly intelligent tools. Al- autoencoder, and compared it with the OCSVM algorithm. Experi-
though deep learning is being in the prosperous stage, it still has mental results showed that the AUC value obtained by the varia-
some intrinsic defects. One of them is its incapacity to general- tional autoencoder was 0.7596, which was better than that of au-
ize from few data to perform the task. Recall human can rapidly toencoder and OCSVM, but it was not easy to determine an ap-
generalize what they have learned to new task scenarios rapidly, propriate threshold that provides high detection accuracy or low
deep learning model must learn and make inference on the basis false alarm rate. Ieracitano et al. (2020) developed an intelligent
of large amounts of data. For the thirst of learning from limited su- IDS based on statistical analysis and autoencoder. They combined
pervised information, a new machine learning problem called few- data analysis and statistical techniques for feature extraction, then
shot Learning (FSL) emerges (Wang and Yao, 2020). used an autoencoder to reduce dimensions of original input data.
FSL aims to recognise novel categories from much fewer la- The compressed feature vector was used as the input of the fi-
beled examples than traditional deep learning. The idea is to fo- nal softmax layer for binary classification. The effectiveness of the
cus on the learning of the transferable embedding and pre-define a proposed IDS was tested using NSL-KDD dataset. An accuracy of
fixed metric (e.g., Euclidean Snell et al., 2017) for classification. The 84.21% was achieved, which was superior to algorithms such as
model performs non-parametric “learning” at the so-called “task” LSTM and MLP. Mirsky et al. (2018) presented an unsupervised
level by simply comparing validation points with training points plug-and-play IDS using an ensemble of autoencoders to collec-
and predicting the label of matching training points. This was first tively differentiate abnormal traffic patterns from normal. It ex-
achieved by “Siamese Network” presented in Koch et al. (2015). tracted damped incremental statistics from input traffic and inte-
Siamese Network consists of twin networks which accept distinct grated hierarchical sets of autoencoders to detect anomalies. Exper-
inputs but are joined by an energy function(e.g. a loss function) iments indicated that it can be employed on a lightweight network
at the top. They used the verification model to evaluate new im- device and obtained a relatively better performance than Isolation
ages, exactly one per novel class, in a pairwise manner against Forest and GMM.
the test images. In (Vinyals et al., 2016a), researchers define a Since few-shot learning problem is relatively new to intrusion
few-shot learning framework named “episode” based training pro- detection, we only find the work of Xu et al. (2020) similar to
cedure, which is inherited by following researches. Furthermore, ours. Xu et al. (2020) presented FC-Net, which is basically the
they presented “Matching Network” and trained it on proposed same as relation net except for the convolution block, to determine
episode-based manner, to perform rapid learning by showing only whether the input sample is benign or malicious. FC-Net was also
3
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
trained with episode-based manner, its performance evaluated on defined metric measure. Unseen malicious traffic can be distin-
CICIDS2017 reached 94.64% in few-shot scenario. However, FC-Net guished as the closest “prototype” embedding generated by feature
only used packet content to construct input data in the form of 3D fusion.
images for CNN. We find that such method ignores the informa-
tive statistic features extracted from network traffic, which can also 3.2. Network traffic encoder
play a crucial part in intrusion detection. We filled the gap by in-
troducing autoencoder as feature extractor into model architecture, As mentioned earlier, former researchers proposed various en-
and proved the effectiveness of it. Beyond that, we built a frame- coding methods to apply machine learning or deep learning into
work for solving few-shot learning problem in intrusion detection, IDS. However, the majority of them utilized incomplete informa-
as well as explored how different network architectures contribute tive resources for network traffic representation, which only col-
in detecting malicious traffic. lected the raw network flow or statistic features while neglected
the other. Therefore, we propose a novel network traffic encod-
3. Architecture of few-shot IDS
ing method and utilize it to design a data encoder for network
flow representation. Specifically, proposed method represents the
In this section, we elaborate the architecture of our proposed
raw content of network flow as the form of image for CNN, named
intrusion detection framework based on few-shot learning. In or-
“GrayScale Flow”, as well as encodes network flow attributes us-
der to emphasize the superiority of its high performance under
ing autoencoder to include comprehensive information of network
few-shot conditions, we named it as “FS-IDS” which is abbreviation
traffic. The pipeline of proposed method is shown in Fig. 2.
of “few-shot Intrusion Detection System”. At first we provide an
overview of FS-IDS. Then we elucidate functions and implementa-
tion details of each module in FS-IDS in sequence of network data 3.2.1. Image representation for network flow
processing. In the last subsection we will introduce the training In one hand, CNN has been widely applied in computer vision
strategy of our framework. The architecture of FS-IDS is shown in tasks. CNN defines a type of robust, popular neural network de-
Fig. 1. signed to process input data stored in arrays (Aggarwal, 2018). The
common form of input image of CNN is 2D array with uniform
3.1. Overview shape. In the other hand, a flow of network traffic is defined as
the amount of data transmitted between two certain communica-
As shown in Fig. 1, firstly, the input network traffic data is fed tion nodes across network over a specific period. The hierarchical
into the network traffic encoder module to be transformed into structure of network flow is illustrated in Fig. 3, where each net-
encoded vectors in specific formats for following processing. The work flow is composed of associated sequential packets. And each
encoder module can be divided into two parts: a flow traffic en- packet, no matter what network protocol it follows, includes one
coder utilizing proposed “GrayScale Flow” method and a feature or more headers as well as payloads of certain bytes. We notice
encoder on the basis of autoencoder. The flow traffic encoder is that since network flow consists of several bytes, it is basically the
responsible to encode the raw traffic data into fixed-size matrix, same as a pixel of 8-bit grayscale images. So it was natural to have
while compressed features of statistic characteristics are generated an insight that we can transform network flow data into grayscale
by autoencoder. images to satisfy the input requirement of CNN, where each byte
When the process of encoding and transforming of origi- of network flow represents a single pixel of corresponding output
nal dataset is completed, task generator module splits processed image-like data.
dataset into different tasks according to specific algorithm. The Normally, a network flow is regarded as a 1D byte ar-
purpose of few-shot learning is to make model learn general clas- ray composed of corresponding packets in chronological order.
sification capacity, rather than class-specific knowledge, through Xu et al. (2020) transformed a network flow to a 3D array, i.e.,
switching between different tasks. Then the created task sets are format of sets of colorful images. Specifically, Xu et al. (2020) un-
fed into feature extraction network to learn feature maps with gen- folded each packet into a single slice of video stream and ar-
eralization in latent embedding space. After that we use a concate- ranged them in corresponding order. By that means authors of
nation layer to implement feature fusion, which concatenates two Xu et al. (2020) used 3D convolutional network to not only ex-
representation vectors along the length direction to compose the tract spatial features, but also capture temporal relations between
final feature maps of network flow data. Through combination of packets. However, we believe this method can not make full use of
the raw traffic data with extracted network characteristics, we be- the feature extraction capacities of CNN. Convolution layer, which
lieve we can obtain robust, comprehensive representations of net- is the core of CNN, uses a square receptive field called convolu-
work traffic data. tion kernel sliding across the whole plane to extract spatial fea-
Finally, corresponding feature maps of each class are compared tures from each pixel with its neighbours. The convolution opera-
and distinguished by distance metric module based on its pre- tion makes discriminant results have strong spatial dependencies
4
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
5
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
to choose the first 10 to 20 packets to represent the network flow model to complete. The dataset is always split into training set
(Millar et al., 2019; Wang et al., 2018; Xu et al., 2020). For input and test set. In notation, traditional method deals with dataset
size of each packet, We argue that the length of 256 bytes is an ap- D = {Dtrain , Dtest } to minimize a pre-defined loss function L. Stan-
propriate value for containing not only the whole header but also dard training procedure feeds batches of data into model and per-
parts of payload for each packet. The packets whose length is less forms gradient descent iteratively to obtain global optimized solu-
than 256 bytes was padded by zero to satisfy the input require- tion. The final step is testing the trained model on test set which
ment. normally consists of unseen data to evaluate its generalization abil-
Overall, by this way we encoded network flow with a variety ity. Sometimes before the test procedure finetuning with valida-
of protocols, sizes and lengths to uniformed grayscale image-like tion set is also involved. In network intrusion detection field, nor-
data, which take the form of 16 × 256 matrices. mally the task is determined as a binary classification task, i.e.,
training a binary classifier on dataset composed of I network flows
3.2.2. Encoded vector for statistic features D = {(x1 , y1 ), (x2 , y2 ), . . . , (xI , yI )}, where x denotes processed net-
In order to combine the raw traffic with network characteris- work flow data, y ∈ 0, 1 denotes labels of network flow. In general
tics in latent feature space, we use autoencoder to produce a ro- 0 denotes benign traffic and 1 denotes malicious traffic. The model
bust high-level representation. As mentioned earlier, autoencoder is trained to construct a nonlinear function f (x ) = y to discrimi-
(Hinton and Salakhutdinov, 2006) with its variants are kinds of nate whether input network flow is malicious or not.
deep learning model that have been widely adopted in unsuper- However, in few-shot condition, simply following standard
vised intrusion detection. In general, autoencoder consists of two methods always leads to a poor performance. It is generally be-
separate networks: encoder and decoder, including input layer, lieved that gradient-based optimization in high-capacity classifier
hidden layer, output layer. Encoder can be formulated as a map- requires many iterative steps over large sets of data to perform
ping function f (x ) that encodes the input x to its latent feature well. In FSL there are only a limited number of labeled samples
vector x . Decoder generates reconstructed input xˆ = g(x ). When that can be used for training, which can lead to severe overfitting.
the number of hidden layer neurons are less than the number of So in FSL, the whole dataset D = {Dmeta−train , Dmeta−test } is divided
input layer and output layer neurons, encoder compresses infor- into meta-training set and meta-testing set, and the training pro-
mation of original input x by x . We regard the compressed coding cedure composes of many episodes. If the practical application sce-
vector x as the latent feature vector that reflects the essential at- nario for the model is to classify instances from N different classes
tribute of network flow. by providing the classifier with K examples for each class, we call
The architecture of autoencoder that we used in FS-IDS is it a N-way K-shot task. If the final mission for classifier is a N-
shown in Fig. 5. The autoencoder comprised five fully-connected way K-shot task, two batches of N-way K-shot data are sampled in
(FC) layers and one dropout layer for preventing overfitting. The each episode during meta-training. One of them constitutes “sup-
numbers “80”, “30”, “10” and “Features” under each block rep- port set” Dsupport and the other constitutes “query set” Dquery . Then
resent the input and output dimensions of data through corre- the model is trained using data from support set and get tested by
sponding fully-connected layers. It is worth noting that the train- data from query set. It can be seen that each episode includes its
ing procedure of autoencoder follows the traditional way rather own training set(support set) and test set(query set) as well as a
than episodic way in few-shot learning. Because the purpose of complete standard pipeline of training and testing. So each episode
autoencoder is unsupervised feature extraction, not data discrim- aims to make model learn to solve the specific task by only N × K
ination, the model should be trained using as much data as possi- samples.
ble in order to produce robust feature vectors expressing attributes In FSL, K is a much smaller number than standard deep learn-
comprehensively. Using specific training strategy for autoencoder is ing, which tends to be 1, 5, 10 and so on. Through episode-based
beneficial for generating robust and representative features of net- training procedure on multiple tasks with similar data composi-
work flow data. tion, we hope the model does not focus on a single classification
task, but learns meta-knowledge that has nothing to do with the
3.3. Task generator specific task but related to the general capacity of discrimination.
So it can still maintain good performance even facing unseen data.
Conventional deep learning uses an end-to-end training strat- The meta-test set is used to simulate the task that the model ul-
egy to build a robust classifier, i.e., defining a target function to timately needs to handle. Normally it is to classify a sample that
be optimized with a large labeled dataset as the specific task for was never seen before by a few accessible samples. The structure
6
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
Table 1
Data composition of CICIDS2017.
7
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
Then distance metric module calculates distance metric in em- 3 × 3 kernel size, 2 channels and 2 strides. “MP(2)” denotes an
bedding space to recognise network data according to its nearest 1-dimension maxpooling layer with 2 × 2 kernel size. “conv2” de-
neighbour, i.e., discriminates the input data as the same class of notes a 2-dimension convolution layer with 2 × 2 kernel size and
the nearest prototype. As for distance metric, we proposed two im- 1 channel.
plementations with reference to previous researches in computer As mentioned before, the concatenation layer concatenates pk
vision: Euclidean-based distance metric and CNN-based distance and fφ (x j ) in depth in order to construct a high-level vector
metric. C ( pk , fφ (x j )) containing features derived from both support point
and query point, where C denotes the concatenation operation.
3.5.1. Euclidean-based distance metric Then concatenated embedding vectors are fed into distance learn-
On the basis of prototypical network (Snell et al., 2017), in ing CNN gψ to obtain a scalar rk, j in the range of [0,1]. rk, j re-
Euclidean-based distance metric module we adopt Euclidean dis- flects the similarity between query sample fφ (x j ) and correspond-
tance as the metric to measure the distance of each query point ing class prototype pk , which can be formulated as:
to calculated prototypes from support point. Specifically, distance
rk, j = gψ (C ( pk , fφ (x j ))) (3)
metric module calculates the Euclidean distance between received
query point and prototypes of all classes. Given the distance met- Since the output score rk, j is between 0 and 1, the similarity
rics, the module computes a distribution over classes for query can represent the probability for embedding of query point fφ (x j )
point x j based on a softmax over distances to the prototypes in belonging to class k. The module determines x j as class k that had
the embedding space: the largest rk, j .
exp(−d ( fφ (x j ), pk ))
P ( y = k|x ) = (2) 3.6. Training strategy
k exp (−d ( f φ (x j ), pk ))
By softmax function, the module outputs a probability distribu- The general pipeline how model get trained in an episode fol-
tion of received query samples over different classes. The model lows similar principles. Once task generator constructs a specific
determines query sample belonging to the class whose probabil- task, i.e., support set and query set, feature extraction network ex-
ity was largest, which is corresponding to the nearest prototype in tracts their embedding features. Distance metric module calculates
embedding space. the prototype representation of each class according to the sup-
port point, and identifies categories of query points based on its
3.5.2. CNN based distance metric pre-defined metric measure. Finally on the basis of discriminant
On the basis of relation net (Sung et al., 2018), CNN-based dis- results the loss function is computed and gets optimized through
tance metric replaces the simple linear metric with a neural net- back-propagation. However, with different implementations of dis-
work to learn a deep, non-linear metric. The architecture of the tance metric module, the training strategies of them have a certain
neural network in CNN-based distance metric module is shown distinction. Next we elaborate the training strategies of different
in Fig. 7. “conv1” denotes a 2-dimension convolution layer with implementations respectively.
8
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
3.6.1. Euclidean based distance metric correct predictions to the total number of predictions, which mea-
For Euclidean based distance metric module, since the model sures the capability of model to classify correctly. Recall measures
output is a probability distribution P (y = k|x ), we choose negative the percentage of positives that model classifies correctly. So re-
log-probability as the loss function L. call is calculated as the ratio of the number of true positives over
L = −logP (y = k|x ) the total number of positives the model discriminates. In multi-
(4) class settings, we extended accuracy and recall to micro-accuracy
= d ( fφ (x j ), pk ) + log k exp(−d ( fφ (x j ), pk ))
and micro-recall, which are mean value of accuracy and recall over
The learning procedure proceeds by minimizing L via stochastic all classes. It is also widely used in multiclass classification exper-
gradient descent. iments to evaluate model performance.
9
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
Table 2
Average accuracy and recall of Euclidean-based distance metric module and CNN-
based distance metric module of FS-IDS on various attacks.
0.9883
0.9285
0.9965
0.9992
0.9447
0.9756
0.9969
0.9255
0.9917
0.983
Attack Accuracy Recall
0.99
0.99
N/A
Rec
types
Euclidean CNN Euclidean CNN
0.9887
0.9913
0.9991
0.9997
0.9666
0.9746
0.9963
0.9999
0.9801
0.9433
0.9157
0.9751
FTP-Patator 0.951 0.9157 0.9928 0.9402
N/A
SSH-Patator 0.996 0.9795 1 0.9963
Acc
PortScan 0.9968 0.9537 0.9983 0.985
DoS 0.9865 0.9022 0.9937 0.9527
Autoencoder + CNN
CNN + Autoencoder
Backbone network
(Xu et al., 2020) to observe detection results. Moreover, we elimi-
nated the feature fusion to see whether the performance degraded.
CNN + LSTM
CNN + LSTM
Autoencoder
DNN + kNN
By ablation study, we can figure out how much these factors help
FS-IDS exceed others by using lot fewer available samples. The data
CNN
CNN
composition and parallel experiments settings remained the same
as in comparison study.
The final question is whether FS-IDS can be used in multiclass
Statistic Features
Statistic Features
Statistic Features
Statistic Features
Statistic Features
Statistic Features
Statistic Features
Statistic Features
Statistic Features
a few malicious samples are accessible. To simulate the real world
Data sources
conditions, the model was trained using a mass of data belonging
Raw Traffic
Raw Traffic
Raw Traffic
to known attacks in training set and only a few new attack samples
in test set. The model was evaluated on blended test data which
covers all test classes to test its detection capacity of both seen or
unseen attacks in multiclass settings. We randomly selected 2600 Comparison of results, number of samples and data sources of intrusion detection methods and related research works.
samples for training set and 600 samples for test set in each ex- Num of samples
periment.
1,028,007
2,830,540
1,000,000
176,947
225,745
760,056
553,850
All experiments are performed on two NVIDIA GeForce GTX
40,000
40,000
30,000
1080 Ti GPUs and Intel(R) Xeon(R) CPU E5-2630 v4 @2.20 GHz. The
5
5
5
FS-IDS is implemented based on Software platforms Pytorch 1.3.1,
cuda 10.1.105 and cuDNN 7.
Spatial-Temporal Deep Learning Method (2018) (Pektaş and Acarman, 2019)
test classes due to our experimental settings. And the value in each
block is the corresponding evaluation metric on attack class speci-
SU-IDS (2018) (Min et al., 2018)
Data
few-
10
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
Fig. 8. Accuracy and recall of Euclidean-based distance metric module (left) and CNN-based distance metric module (right) FS-IDS on different tasks.
Xu et al. (2020), which we believe are the best performance of FC- Table 4
Average accuracy and recall of FS-IDS and FC-Net(paper).
Net.
The first noteworthy observation from Table 3 is that the over- Attack Accuracy Recall
whelming majority of IDS researches are on the basis of “big data”. types
FS-IDS FC-Net FS-IDS FC-Net
As far as we know, FC-Net is the first as well as the only IDS model
FTP-Patator 0.951 0.9454 0.9928 0.9956
achieving intrusion detection in the few-shot conditions. The de-
SSH-Patator 0.996 0.9491 1 0.9992
pendency of most IDS researches on big data can be reflected from PortScan 0.9968 0.9495 0.9983 0.9988
the third column of Table 3 denoting the number of training sam- DoS 0.9865 0.9505 0.9937 0.9964
ples used by each IDS. Note that the number of labeled samples DDoS 0.945 0.9165 0.9655 0.9646
used by all these methods reaches hundreds of thousands, or even
millions, which brings tremendous human efforts to collect and la- sult obtained by FS-IDS is on the basis of only 5 labeled samples
bel these data manually. The success of methods using big data used in training process. For the rest of related works (Andresini
has demonstrated the capacity of deep learning on the basis of big et al., 2021; Faker and Dogdu, 2019), FS-IDS outperforms them in
data. However, when a huge labeled dataset is not available, there’s terms of accuracy or recall.
little traditional methods can do about it. The last three rows are works based on few-shot learning. FC-
As shown in Table 3, FS-IDS outperforms GA-based Adaptive Net (Xu et al., 2020) obtained 94.33% accuracy and 99.17% recall ac-
Method (Resende and Drummond, 2018), DT and Rule Based IDS cording to the results presented. When applying on the same data
(Ahmim et al., 2019) and DBN-based IDS (Manimurugan et al., with our proposed FS-IDS, the performance of FC-Net degraded to
2020) in both accuracy and recall. Although FS-IDS doesn’t ex- 91.57% and 98.3% respectively, whereas FS-IDS obtained 97.51% ac-
ceed SU-IDS (Min et al., 2018), Deep Hierarchical IDS (Zhang et al., curacy and 99% recall. To present the results of the comparison
2019), DNN-kNN IDS (de Souza et al., 2020) and Multi-Stage Opti- with FC-Net in a comprehensive manner, Table 4 provide the re-
mized ML-based IDS (Injadat et al., 2021), all of these works are on sults for all of chosen attack types in detail. As shown in Table 4,
the basis of a large-scale dataset. The number of samples that SU- FS-IDS has outperformed FC-Net in all 5 meta test attack classes in
IDS (Min et al., 2018), Deep Hierarchical IDS (Zhang et al., 2019), terms of accuracy and recall.
DNN-kNN IDS (de Souza et al., 2020) and Multi-Stage Optimized Based on observation of these results, we demonstrate the su-
ML-based IDS (Injadat et al., 2021), used reach 40,0 0 0, 553,850, periority of FS-IDS and conclude as follows:
225,745 and 2,830,540, respectively. With a slight decline about 2% A: FS-IDS obtained a comparable, or even higher detection accu-
in accuracy and 0.6% in recall, required training samples of FS-IDS racy and recall on novel attacks than previous works by using
is much fewer than these method. It should be noted that the re- much fewer labeled samples: only 5 malicious samples.
11
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
Table 5 Table 6
Components utilized by FS-IDS, Model I and Model II. Accuracy and recall of FS-IDS in 1, 3, and 5 shot.
FS-IDS “GrayScale Flow” Encoding Method + Feature Fusion Acc Rec Acc Rec Acc Rec
Model I “GrayScale Flow” Encoding Method
FTP-Patator 0.86 0.87 0.93 0.91 0.95 0.99
Model II Video-like 3D Encoding
SSH-Patator 0.90 0.88 0.97 0.98 0.99 1
PortScan 0.90 0.86 0.95 0.93 0.99 0.99
DoS 0.85 0.84 0.95 0.94 0.98 0.99
B: FS-IDS achieved the state-of-the-art performance among intru- DDoS 0.78 0.83 0.84 0.92 0.94 0.96
12
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
Fig. 9. Accuracy and recall of FS-IDS, Model I and Model II on various attacks.
13
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
Fig. 10. Accuracy and recall of FS-IDS on various attacks under multiclass conditions.
test the practicability of FS-IDS, we simulated the real-world con- Kim, T., Suh, S.C., Kim, H., Kim, J., Kim, J., 2018. An encoding technique for CNN-
ditions by getting model tested on network traffic including var- based network anomaly detection. In: 2018 IEEE International Conference on
Big Data (Big Data), pp. 2960–2965. doi:10.1109/BigData.2018.8622568.
ious attacks. Results showed that FS-IDS achieved over 90% accu- Koch, G., Zemel, R., Salakhutdinov, R., 2015. Siamese neural networks for one-shot
racy and recall both on the seen or unseen attacks under the worst image recognition. ICML’15.
circumstances. Lake, B., Salakhutdinov, R., Gross, J., Tenenbaum, J.B., 2011. One shot learning of sim-
ple visual concepts. In: Proceedings of the 33rd Annual Conference of the Cog-
nitive Science Society.
Declaration of Competing Interest Li, Z., Qin, Z., Huang, K., Yang, X., Ye, S., 2017. Intrusion detection using convolutional
neural networks for representation learning. In: Liu, D., Xie, S., Li, Y., Zhao, D.,
El-Alfy, E.S.M. (Eds.), Neural Information Processing. Springer International Pub-
The authors declare that they have no known competing finan- lishing, Cham, pp. 858–866.
cial interests or personal relationships that could have appeared to Liao, H.J., Richard Lin, C.H., Lin, Y.C., Tung, K.Y., 2013. Intrusion detection system: a
comprehensive review. J. Netw. Comput. Appl. 36 (1), 16–24. doi:10.1016/j.jnca.
influence the work reported in this paper. 2012.09.004.
Lippmann, R.P., Fried, D.J., Graf, I., Haines, J.W., Kendall, K.R., McClung, D., Weber, D.,
Webster, S.E., Wyschogrod, D., Cunningham, R.K., Zissman, M.A., 20 0 0. Evalu-
CRediT authorship contribution statement ating intrusion detection systems: the 1998 DARPA off-line intrusion detection
evaluation. In: Proceedings DARPA Information Survivability Conference and Ex-
Jingcheng Yang: Conceptualization, Methodology, Software, position. DISCEX’00, vol. 2, pp. 12–26. doi:10.1109/DISCEX.2000.821506.
Malaiya, R.K., Kwon, D., Kim, J., Suh, S.C., Kim, H., Kim, I., 2018. An empirical eval-
Writing – original draft. Hongwei Li: Investigation, Writing – re- uation of deep learning for network anomaly detection. In: 2018 International
view & editing. Shuo Shao: Writing – review & editing. Futai Zou: Conference on Computing, Networking and Communications (ICNC), pp. 893–
Data curation. Yue Wu: Supervision, Writing – review & editing. 898. doi:10.1109/ICCNC.2018.8390278.
Manimurugan, S., Al-Mutairi, S., Aborokbah, M.M., Chilamkurti, N., Ganesan, S.,
Patan, R., 2020. Effective attack detection in internet of medical things smart
References environment using a deep belief neural network. IEEE Access 8, 77396–77404.
doi:10.1109/ACCESS.2020.2986013.
Aggarwal, C.C., 2018. Neural Networks and Deep Learning - A Textbook. Springer. Millar, K., Cheng, A., Chew, H.G., Lim, C.C., 2019. Using convolutional neural net-
Ahmim, A., Maglaras, L., Ferrag, M.A., Derdour, M., Janicke, H., 2019. A novel hierar- works for classifying malicious network traffic. In: Alazab, M., Tang, M. (Eds.),
chical intrusion detection system based on decision tree and rules-based mod- Deep Learning Applications for Cyber Security. Springer International Publish-
els. In: 2019 15th International Conference on Distributed Computing in Sensor ing, Cham, pp. 103–126.
Systems (DCOSS), pp. 228–233. doi:10.1109/DCOSS.2019.0 0 059. Min, E., Long, J., Liu, Q., Cui, J., Cai, Z., Ma, J., 2018. SU-IDS: a semi-supervised and
Andresini, G., Appice, A., Malerba, D., 2021. Nearest cluster-based intrusion detection unsupervised framework for network intrusion detection. In: Sun, X., Pan, Z.,
through convolutional neural networks. Knowledge-Based Syst. 216, 106798. Bertino, E. (Eds.), Cloud Computing and Security. Springer International Publish-
doi:10.1016/j.knosys.2021.106798. ing, Cham, pp. 322–334.
Dhanabal L., Shantharajah S.. A study on NSL-KDD dataset for intrusion detection Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A., 2018. Kitsune: an ensemble of au-
system based on classification algorithms. 2015.. toencoders for online network intrusion detection. 25th Annual Network and
Faker, O., Dogdu, E., 2019. Intrusion Detection Using Big Data and deep Learning Distributed System Security Symposium, NDSS 2018, San Diego, California, USA,
Techniques. In: ACM SE ’19. Association for Computing Machinery, New York, February 18–21, 2018. The Internet Society.
NY, USA, pp. 86–93. doi:10.1145/3299815.3314439. O’Neill, P.H., 2021. 2021 has broken the record for zero-day hacking attacks. MIT
Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with Technol. Rev. September 23, 2021. https://www.technologyreview.com/2021/09/
neural networks. Science 313 (5786), 504–507. doi:10.1126/science.1127647. 23/1036140/2021- record- zero- day- hacks- reasons/
Ieracitano, C., Adeel, A., Morabito, F.C., Hussain, A., 2020. A novel statistical analysis Pektaş, A., Acarman, T., 2019. A deep learning method to detect network intrusion
and autoencoder driven intelligent intrusion detection approach. Neurocomput- through flow based features. Int. J. Netw. Manag. 29 (3). doi:10.1002/nem.2050.
ing 387, 51–62. doi:10.1016/j.neucom.2019.11.016. Raghavendra Chalapathy S.C.. Deep learning for anomaly detection: a survey. 2019.
Injadat, M., Moubayed, A., Nassif, A.B., Shami, A., 2021. Multi-stage optimized ma- Resende, P.A.A., Drummond, A.C., 2018. Adaptive anomaly-based intrusion detection
chine learning framework for network intrusion detection. IEEE Trans. Netw. system using genetic algorithm and profiling. Secur. Privacy 1 (4), e36. doi:10.
Serv. Manag. 18 (2), 1803–1816. doi:10.1109/TNSM.2020.3014929. 1002/spy2.36.
Jiangxing, W., Jianhua, L., Xinsheng, J., 2018. Security for cyberspace: challenges and Sharafaldin I., Habibi Lashkari A., Ghorbani A.. Toward generating a new intru-
opportunities. Front. Inf. Technol. Electron. Eng. 19 (12), 1459–1461. doi:10.1631/ sion detection dataset and intrusion traffic characterization. 2018. p. 108–116.
FITEE.1840 0 0 0. 10.5220/0 0 06639801080116
14
J. Yang, H. Li, S. Shao et al. Computers & Security 122 (2022) 102899
Snell, J., Swersky, K., Zemel, R., 2017. Prototypical Networks for Few-Shot Learning. Jingcheng Yang received the B.S. degree in information science from Southeast Uni-
In: NIPS’17. Curran Associates Inc., Red Hook, NY, USA, pp. 4080–4090. versity, China, in 2014 and the M.S. degree in cyber science and engineering from
de Souza, C.A., Westphall, C.B., Machado, R.B., Sobral, J.B.M., dos Santos Vieira, G., Shanghai Jiao Tong University, China, in 2018. He is currently pursuing the Ph.D. de-
2020. Hybrid approach to intrusion detection in fog-based IoT environments. gree in cyber science and engineering in Shanghai Jiao Tong University. His research
Comput. Netw. 180, 107417. doi:10.1016/j.comnet.2020.107417. interests include artificial intelligence, data privacy and intrusion detection.
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H.S., Hospedales, T.M., 2018. Learning to
compare: relation network for few-shot learning. In: 2018 IEEE/CVF Conference Hongwei Li was born in 1998. He received the B.S. degree in information engineer-
on Computer Vision and Pattern Recognition, pp. 1199–1208. doi:10.1109/CVPR. ing from Shanghai Jiao Tong University, Shanghai, in 2020. He is currently pursuing
2018.00131. the M.S. degree in information engineering in Shanghai Jiao Tong University. His
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A., 2009. A detailed analysis of the research interests include artificial intelligence and vulnerability detection and ex-
KDD CUP 99 data set. In: 2009 IEEE Symposium on Computational Intelli- ploitation.
gence for Security and Defense Applications, pp. 1–6. doi:10.1109/CISDA.2009.
5356528.
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., Wierstra, D., 2016a. Matching Futai Zou is currently an Associate Professor in School of Cyber Science and En-
Networks for one Shot Learning. In: NIPS’16. Curran Associates Inc., Red Hook, gineering, Shanghai Jiao Tong University, China. He received the Ph.D. degree in
NY, USA, pp. 3637–3645. computer science from Shanghai Jiao Tong University in 2005. His current research
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., Wierstra, D., 2016b. Match- interests mainly focus on network attack and defense technology.
ing networks for one shot learning. In: Proceedings of the 30th International
Conference on Neural Information Processing Systems. In: NIPS’16. Curran As- Shuo Shao (Member, IEEE) received the B.S. degree in information science from
sociates Inc., Red Hook, NY, USA, pp. 3637–3645. Southeast University, China, in 2011, the M.A.Sc. degree in electrical and computer
Wang, W., Sheng, Y., Wang, J., Zeng, X., Ye, X., Huang, Y., Zhu, M., 2018. HAST- engineering from McMaster University, Canada, in 2013, and the Ph.D. degree from
IDS: learning hierarchical spatial-temporal features using deep neural networks Texas A&M University, USA, in 2017. In 2017, he joined the School of Electronics, In-
to improve intrusion detection. IEEE Access 6, 1792–1806. doi:10.1109/ACCESS. formation and Electrical Engineering, Shanghai Jiao Tong University, China. His re-
2017.2780250. search interests include network information theory, algebraic code, and machine
Wang Y., Yao Q.. Few-shot learning: a survey. 2020. learning.
Xu, C., Shen, J., Du, X., 2020. A method of few-shot network intrusion detection
based on meta-learning framework. IEEE Trans. Inf. Forensics Secur. 15, 3540– Yue Wu, received the B.S. degree from Dept. of Information and Electronics, Zhe-
3552. doi:10.1109/TIFS.2020.2991876. jiang University, Hangzhou, China in 1989, M.S. and Ph.D. degree from Dept. of Ra-
Zavrak, S., skefiyeli, M., 2020. Anomaly-based intrusion detection from network flow dio Engineering, Southeast University, Nanjing, China in 1998 and 2004 respectively.
features using variational autoencoder. IEEE Access 8, 108346–108358. doi:10. He is currently a Professor with School of Electronic Information and Electrical En-
1109/ACCESS.2020.3001350. gineering, Shanghai Jiaotong University, Shanghai, China. His research interests in-
Zhang, Y., Chen, X., Jin, L., Wang, X., Guo, D., 2019. Network intrusion detection: clude vehicular networks, wireless network security, security and trust for IoT. He
based on deep hierarchical network and original flow data. IEEE Access 7, is a member of IEEE and IEEE Communications and Information Security Technical
37004–37016. doi:10.1109/ACCESS.2019.2905041. Committee.
15