0% found this document useful (0 votes)
58 views

FlowGAN - Synthetic Network Flow Generation Using Generative Adversarial Networks

This document discusses FlowGAN, a method for generating synthetic network flow data using generative adversarial networks (GANs) to address the problem of modal collapse. FlowGAN applies manifold guided GANs (MGGAN) to mitigate modal collapse by generating data that more closely matches the statistical properties of real-world network traffic compared to other GAN-based approaches. The document introduces the problem of modal collapse in GANs for network flow generation and presents FlowGAN as a way to improve over existing methods.

Uploaded by

Hll97 [FR]
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

FlowGAN - Synthetic Network Flow Generation Using Generative Adversarial Networks

This document discusses FlowGAN, a method for generating synthetic network flow data using generative adversarial networks (GANs) to address the problem of modal collapse. FlowGAN applies manifold guided GANs (MGGAN) to mitigate modal collapse by generating data that more closely matches the statistical properties of real-world network traffic compared to other GAN-based approaches. The document introduces the problem of modal collapse in GANs for network flow generation and presents FlowGAN as a way to improve over existing methods.

Uploaded by

Hll97 [FR]
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2021 IEEE 24th International Conference on Computational Science and Engineering (CSE)

FlowGAN - Synthetic Network Flow Generation using Generative Adversarial


2021 IEEE 24th International Conference on Computational Science and Engineering (CSE) | 978-1-6654-1660-3/21/$31.00 ©2021 IEEE | DOI: 10.1109/CSE53436.2021.00033

Networks

Liam Daly Manocchio*, Siamak Layeghyt, Marius Portmanni


School of Information Technology and Electrical Engineering
University of Queensland
Brisbane, QLD 4072, Australia
*[email protected], t [email protected], t [email protected]

Abstract—Generative Adversarial Networks (GANs) are results in [2], Figure 1 shows the Cumulative Distribution
known to be a powerful machine learning tool for realistic data Function (CDF) of the flow duration from real-world flow
synthesis. In this paper, we explore GANs for the generation data (ISP and UQ) obtained from the University of Queens­
of synthetic network flow data (NetFlow), e.g. for the training
of Network Intrusion Detection Systems. GANs are known to land’s network and an Australian ISP, compared to the
be prone to modal collapse, a condition where the generated benign traffic of three widely used benchmark NIDS datasets
data fails to reflect the diversity (modes) of the training data. (CICJDS [3], TONJOT [4] and UNSW_NB15 [5]), which
We experimentally evaluate the key GAN-based approaches were generated on small dedicated test-beds. This figure
in the literature for the synthetic generation of network flow shows that while the flow duration distributions of two quite
data, and demonstrate that they indeed suffer from modal
collapse. To address this problem, we present FlowGAN, a different real-world networks are similar, they are vastly
network flow generation method which mitigates the problem different from the three NIDS datasets generated on small-
of modal collapse by applying the recently proposed concept of scale test-beds. This provides a strong motivation for a tool
Manifold Guided Generative Adversarial Networks (MGGAN). that is able to generate synthetic network flow data that
Our experimental evaluation shows that FlowGAN is able to more closely matches the statistical properties of real-world
generate much more realistic network traffic flows compared
to the state-of-the-art GAN-based approaches. We quantify this networks.
significant improvement of FlowGAN by using the Wasserstein Generative Adversarial Networks (GANs) are a powerful
distance between the statistical distribution of key features deep learning methods that have been successfully used
of the generated flow data, compared with the corresponding to generate highly realistic synthetic data in a number of
distributions in the training data set applications, such as the generation of synthetic faces [6].
Aeyuwds-NetFlow, Generative Adversarial Networks In this paper, we explore the use of GANs for the generation
(GAN), Modal collapse, Network Intrusion Detection System of synthetic network flow data (NetFlow), with the aim
(NIDS), Synthetic Dataset for these to be used for training and evaluation of ML-
based Network Intrusion Detection Systems. We specifically
I. I ntroduction focus on the generation of benign network traffic, and
Recently, considerable research effort has been devoted to leave the generation of malicious or attack traffic for future
the development of Machine Learning (ML) based Network work. Several related works have attempted to use GANs
Intrusion Detection Systems (NIDSs). As with any ML- to generate network flow data, as discussed in more detail
based application, this requires access to a large amount in Section III. GANs are known to be difficult to train,
of quality training data. However, large scale network data and one of the key problems often encountered is modal
traces from real production networks are hard to obtain, e.g. collapse, where the GAN fails to replicate the diversity and
due to privacy and security concerns. modes available in the training data set. As a result, the
There are multiple NIDS datasets that have been generated generated data is not very realistic, and has significantly
in different test-beds, and have been made publicly available different statistical properties from the training data [7].
in a flow format, e.g. NetFlow [1], These datasets are widely In this paper, we have replicated the state-of-the-art GAN-
used for training and evaluation of ML-based NIDSs. How­ based traffic generation methods, which are based on a
ever, recent work has shown that the statistical properties of Wasserstein GAN with Gradient Penalty (WGAN-GP) [8]
these datasets are quite different than the network traffic [9]. Our results show that this approach suffers from modal
from real production networks. In particular, the authors collapse, and as a result, the generated network flow data is
in [2] demonstrate that there are significant differences unable to accurately replicate the statistical properties of the
between the statistics of traffic features of benign traffic from training data and hence that of real networks. To address this
three publicly available synthetic datasets, compared to those limitation, this paper proposes FlowGAN, and new GAN-
obtained from real networks. As motivating example of the based method that allows the synthetic generation of more

978-1-6654-1660-3/21/$31.00 ©2021 IEEE 168


DOI 10.1109/CSE53436.2021.00033

Authorized licensed use limited to: Centrale Supelec. Downloaded on December 12,2022 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
II. B a c k g r o u n d

A. NetFlow
The monitoring of network traffic is critical for a number
of reasons, such as network management and security. There
are two main approaches to this; packet-based and flow-
based. The packet-based approach aims to capture packet
headers and payload as they are sent across the network.
The flow-based approach aims to collect aggregate or meta
information based on a sequence of packets between two
endpoints. Continuous packet-based network monitoring is
very resource intensive, and is typically not feasible for
large scale networks. Furthermore, full packet capture raises
Figure 1. CDF of Flow Duration - Real versus Generated Data [2]
privacy concerns, since packet payloads of potentially sen­
sitive information are collected. Flow-based network traffic
monitoring on the other hand, provides a highly compressed
summary of the network traffic, and is therefore much
realistic network flow data. In order to address the problem more scalable than packet-based monitoring, and is widely
of modal collapse, FlowGAN implements the concept of deployed in large scale networks. A wide range of tools
a Manifold Guided Generative Adversarial Network (MG- for flow-based traffic exporting and collection are readily
GAN) [10], a recently introduced approach to limit modal available.
collapse in GANs. The basic idea is to combine the GAN NetFlow [11] is one of the most common flow-based
with a supervisory component which induces the GAN to formats for recording network traffic. A network flow is
learn all modes in the training data distribution. an aggregation of a sequence of packets in a conversation
Our experimental evaluation shows that FlowGAN is able (either unidirectional or bidirectional), with the same source
to generate network traffic flows that are much more realistic and destination IP, source and destination port, and transport
compared to the state-of-the-art GAN-based approaches. protocol. Bidirectional NetFlow captures packet and byte
We quantify this significant improvement of FlowGAN by counts, and a range of other features, in both directions.
considering the Wasserstein distance between the statistical By default, for TCP, when a conversation ends the corre­
distribution of key features of the generated flow data, sponding flow record is exported, and a new flow will be
compared with the corresponding distributions in the training generated if there is another conversation between the same
data set. Compared to the traditional GAN-based approach, two endpoints. There is also the option to configure fixed
FlowGAN reduces the Wasserstein distance between the and sliding time windows for this, which forces a flow to
synthetically generated data and the training (true) data by be exported and a new one started after some time, even
a factor of 3 for the packet size data feature, and factor of if a conversation is still ongoing. A typical choice for this
almost 6 for the flow duration and flow size features. window size is 2 minutes.
This means that the synthetically generated data by Flow­ There are several versions of NetFlow, the latest being
GAN is much closer to the training data, and hence much NetFlow v9, which is also the basis of IPFIX [12], the cor­
more realistic, than the data generated from the current responding IETF standard. There are many fields available
GAN-based flow data generators. This paper specifically in NetFlow v9, which can provide a large amount of detailed
focuses on the generation of benign (non-attack) traffic in information about network traffic. However, for the purpose
NetFlow format. However, we believe the proposed approach of this paper, we focus on the subset of attributes, as defined
can be generalised to attack traffic and different flow for­ in NetFlow v5 [11], and as shown in Table I. These fields
mats. These considerations are beyond the scope of this or attributes are the most relevant in our context, and are
paper, and provide scope for future work. supported by all versions of NetFlow and IPFIX [12].
The rest of the paper is organised as follows. Section II
provides the relevant background on NetFlow and Gener­ B. Generative Adversarial Network (GAN)
ative Adversarial Networks, and Section HI discusses key Generative Adversarial Networks (GANs), are a relatively
related works. Section IV explains the traditional, state-of- recent development in Machine Learning (ML). As the name
the-art network flow generation approach based on Wasser­ suggests, GANs aim to generate realistic synthetic data.
stein GANs and demonstrates its limitations. Section V GANs do this using a contest between a generator network
presents FlowGAN, our proposed improved approach, which G and a discriminator network Dx, as shown in Figure 2.
is evaluated in Section VI. Finally, Section VII concludes the The generator tries to generate fake samples that resemble
paper. the real input samples, and the discriminator attempts to

169

Authorized licensed use limited to: Centrale Supelec. Downloaded on December 12,2022 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
of synthetic faces. A number of improvements have been
made to GANs since their inception, in particular to improve
their stability and convergence during training. Wasserstein
GANs (WGANs) [13] are the predominant type of GAN
used in current GAN research. WGANs have been shown to
offer improved performance and better convergence versus
Figure 2. Basic GAN Architecture
traditional GANs [13].
Figure 3 shows the basic architecture of a WGAN. Com­
distinguish these fake samples from the real samples. Train­ pared to a classical GAN, WGANs use a modified loss
ing continues until the generator is so convincing, that the metric for the discriminator, i.e. the Wasserstein loss. As a
discriminator is unable to differentiate between the real and result, the discriminator is renamed a ‘critic’ in the context
fake samples. of WGANs. The general form of the optimisation objectives
The inputs to a GAN, as shown in Figure 2, are the of a WGAN generator and discriminator are shown below
random input z, and the training data x reai. The random [13]:
input (latent space z) is usually sampled from a normal 1 m 1 m
random distribution. The generator G attempts to learn a
/ ( * “>) ( 1)
mapping between this, and realistically generated outputs, i=1L i=1
that closely resemble the input training data. For instance,
^ 771
an image GAN might take a random vector as input and
( 2)
generate a picture of a face. Then, by providing different
random inputs, different outputs (faces) can be generated.
The training data x reai is a set of real examples, e.g. Equation 1 shows how weights are updated for each
images of real faces. The discriminator Dx compares these step of the training of the critic, and Equation 2 shows
real samples to the output x f ake of the generator, i.e. which the corresponding weight update step for the generator.
are samples generated from z, and attempts to learn the Here, / represents the critic network (the discriminator
difference. In turn, the generator G attempts to fool the in GANs), G represents the generator network, m is the
discriminator, by learning to generate better fake samples number of samples in the batch, and A w and Agg represent
that more closely resemble the real data. the Wasserstein distance and gradient function respectively.
The generator and discriminator are trained in sequence, Whereas in a traditional GAN the discriminator output is in
and over a number of rounds, with one model being the form of —log(D(G(zW))), where D is the discriminator,
frozen (prevented from learning), while the other is trained WGANs do not use a 1/0 class, but instead are of the
adversarially. This, over time, will yield more realistic form —f( G ( z M). In other words, while the output of the
samples as both the generator and discriminators learn. The discriminator in a traditional GAN is 0 or 1 (D i-> [0,1]),
adversarial aspect in this context refers to the fact that the the output of a Wasserstein GAN is a real value ( / H> R).
two networks compete with each other, which can result in The discriminator in a WGAN assigns samples to a real or
a situation where one network overpowers the other, and the fake distribution. The discriminator’s objective in a WGAN
performance improves no further. is to maximise the distance between these two distributions,
rather than directly classifying them as real or fake (by
C. Wasserstein GAN assigning 1 or 0 as in the case of the classical GAN).
GANs have been successful in generating synthetic data Wasserstein GAN with Gradient Penalty (WGAN-
in a number of application areas, such as the generation GP) [14] are a specific type of WGANs that use a Gradient
Penalty loss function in addition to the Wasserstein loss, that
Table I is used in basic WGANs. In a basic WGAN the Lipschitz
Relevant NetF low v 5 Fields constraint on the discriminator (critic) is enforced by weight
Field Description
clipping. Instead, in a WGAN-GP, the gradient penalty is
FIRSTJSWITCHED The time in seconds that the first
packet in the flow was received
LAST_SWITCHED The time in seconds that the last ^rwt
packet of this flow was received
IPV4_SRC/DST_ADDR The source/destination IPv4 address
Z G Xfnk* Wasserstein Loss
L4_SRC/DST_PORT The source/destination TCP/UDP
port
HM/OUTBYTES Sum of incoming/outgoing bytes
Linear Activation
HM/OUTPKTS Sum of incoming/outgoing packets
PROTOCOL The IP protocol type
Figure 3. Basic WGAN Architecture

170

Authorized licensed use limited to: Centrale Supelec. Downloaded on December 12,2022 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
used to prevent the network weights from exploding. This capture of encrypted web traffic. The authors showed that
aids in training stability and can reduce hyper-parameter the synthetically generated IP packets had a valid structure
sensitivity [14]. and could be successfully transmitted on a network, i.e. the
generated packets had valid IP and TCP/UDP headers. How­
D. Modal Collapse ever, this was only possible for simple one-way transmission
As mentioned earlier, modal collapse is a common issue of packets, and not for the successful two-way transmission,
faced by GANs that can lead to a lack of diversity in gen­ e.g. the establishment of a TCP connection.
erated samples. GANs generate synthetic data by learning a All the related works discussed in this section are based
mapping from a random input (latent space) into the output on a WGAN-GP approach for data generation. As we will
space (the data to be generated). Ideally, the GAN learns demonstrate in the following section, the WGAN-GP method
to map the latent space in a way that fully captures all the in practice is susceptible to the problem of modal collapse,
diversity in the input dataset. For example, in the case of and can result in the generated flows with significantly
face generation, the GAN should be able to generate images different statistical characteristics from the real (training)
for males, females, adults, children, etc. and not just one data.
sub-class (or mode), e.g. older men. Modal collapse is the
scenario where the GAN fails to map the input to the full IV. WGAN-GP Flow Generation
diversity of types (or modes) available in the training data. We begin our exploration by replicating the existing state-
of-the-art approach to GAN-based traffic generation, which
III. Related Works is based on WGAN-GP architecture. We use the methods
In [9], the authors propose the use of GANs for generating described in [8] as a basis for our implementation.
synthetic flow-based attack data to augment existing NIDS 1) Dataset: For our work we used a NetFlow trace
training data sets. The paper proposed the use of a Wasser­ collected from a medium-sized Australian ISP backbone
stein GAN with gradient penalty (WGAN-GP) to generate network over the span of one month in 2019. The traffic
malicious flows with a DoS attack dataset as reference. The from this capture belonged to the single largest customer
authors compared their synthetically generated attack dataset network, and contained 30.2M flow records. Based on the
to their real recorded dataset by training a gradient boosting configuration of the NetFlow exporter/collector, the maxi­
classifier on a number of computed statistics. They found mum flow duration is 120 seconds.
they were able to generate malicious traffic that resembled 2) Results: For the evaluation of the quality of syn­
their real attack data and concluded that GANs are a viable thetically generated NetFlow data, we consider the distri­
tool for the generation of NIDS datasets. However, the paper butions of the following important flow metrics: packet
did not quantify the ability of their proposed method to size and flow duration. The flow duration is obtained from
match the statistical characteristics of the training data. flow records by subtracting the F IR S T _ S W IT C H E D
The authors in [8] applied a similar concept for the NetFlow field from the L A S T _ S W IT C H E D field. The
generation of benign network traffic flows, and they also packet size (per-flow average) is calculated for each flow
used a WGAN-GP to generate the data using a legitimate direction separately, i.e. as I N _ B Y T E S /I N _ P K T S and
reference trace as training data. This approach generates a O U T _ B Y T E S /O U T _ P K T S respectively.
number of NetFlow fields, including the ones listed in Table Figure 4 shows the distribution of the per-flow packet
I. The authors also used IP2Vec [15] to encode IP addresses size distribution for the synthetic NetFlow data generated
and port numbers for use by the neural network. IP2Vec is by the WGAN-GP, using flows from our ISP NetFlow trace
a method similar to Word2Vec [16], which takes as input a as training data. As a reference, the figure also shows the
one-hot encoded vector of all possible IP/Port combinations corresponding packet size distribution of the original training
in the dataset, and uses an autoencoder to embed this in a data, labelled as ‘Real Network’.
lower dimensional representation. The paper resolves several The histogram in Figure 4 uses a bin size of 75 bytes.
challenges of using GANs to generate flow records, and The density (y-axis) is normalised to a range between 0 and
presents and evaluates three architectures to achieve this, all 1, and is shown in logarithmic scale for clarity.
based on a WGAN-GP. The authors evaluated the generated It can be seen in Figure 4 that the synthetically generated
datasets by comparing the distributions of key fields in the traffic by WGAN-GP does not capture the long tail of the
generated data with legitimate flow records. The authors packet size distribution of the real traffic. Already, this can
also used domain knowledge based sanity checks on the suggest modal collapse, as the full input distribution has not
generated flow records, e.g. UDP flows should not have TCP been learned.
flags. Figure 5 shows the corresponding distributions of the flow
The authors of [17] extended this concept further with duration, for both synthetic data from WGAN-GP, and the
PacketCGAN, using a GAN to generate realistic individual original training data (‘Real Network’). We see that the
packets at the IP level. They trained their GAN with a distribution of the training data has multiple modes, but the

171

Authorized licensed use limited to: Centrale Supelec. Downloaded on December 12,2022 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
WGAN-GP seems to be unable to learn and replicate them
in the generated data. This is particularly clear for the peak
near the maximum flow duration of 120 seconds. Our results
seem to indicate that the WGAN-GP is suffering from modal
collapse.

A. WGAN-GP Limitations
It could be argued that given infinite time, these net­
works would eventually learn to capture the whole input
distribution. Indeed, the authors who proposed the WGAN
architecture suggested that modal collapse is avoided if the
critic (discriminator) is trained optimally [13]. And in theory,
only the convergence of the discriminator is required to io ____ I____I____I____I____I_____
ensure modal collapse does not occur [18]. However, more 0 20 40 60 80 100 120
recent work has shown that WGANs, or WGAN-GPs ‘with a Flow Duration (s) [6 sec bins]
finite number of discriminator updates per generator update,
Figure 5. WGAN-GP flow duration
do not always converge to the equilibrium point’ [19].
And indeed, we observed this when conducting our
experiments. Regardless of the training duration (one ex­ address modal collapse when trained optimally, it has been
periment was left running for 3 days), and even when noted in research that these models are still sensitive to
replicating hyper-parameters from previous works. When hyper-parameter and architecture choices, and often take
training WGAN-GPs on our network traffic, we found it excessive amounts of time to train [19]. In the following
was common for ongoing oscillations in the modes that were section, we present a approach which is able to minimise
captured. With the generator learning one mode well, then the above stated limitations of WGAN and WGAN-GP
switching to another one as the discriminator became good architectures for the generation of synthetic network flow
at distinguishing that mode, and so on. In modal collapse, data.
a GAN may never learn to capture all the input modes, and
can continue to oscillate indefinitely, or alternatively, it can V. F low GAN
converge, but to only a small sub-set of the modes [7]. We
observed during training that regardless of training time, the Because modal collapse is an established issue when
distribution in Figure 4 was shifting alternatively between training GANs, there are a variety of methods proposed to
being left biased (as in the figure) and being too flat. This address the problem, without requiring training to optimality
would support modal collapse. We could not resolve this by as with a WGAN-GP. Common approaches includes mini­
reducing the learning rate, or by adding régularisation. batch standard deviation [7], unrolled GANs [20], Ad-
While the WGAN or WGAN-GP architecture should aBoosting [21], and hybrid GAN / encoder methods such
as VAE+GAN [22], VEEGAN [23] or MGGAN [10],
In this paper, we explore the MGGAN approach of com­
bining the GAN architecture with an encoder. This uses the
encoder to learn a sensible representation of the distribution
of the input data, that the GAN can then use to reduce modal
collapse.
In this machine learning context, an ‘encoder’ takes sam­
ples and generates a meaningful ‘representation’ (encoding)
of these, usually this is a more compressed representation. A
‘decoder’ then takes an encoded representation, and recon­
structs the original un-encoded data. Finally, an autoencoder
learns this process automatically, by connecting an encoder
to a decoder, training the encoder to generate an encoding,
and the decoder to reconstruct this encoding back into the
original, simultaneously.
The authors of [10] introduce Manifold Guided Genera­
Packet Size Avg. (bytes/packets) [75 byte bins] tive Adversarial Networks (MGGAN). An encoder takes the
role of an auxiliary guidance network, which runs along-side
Figure 4. WGAN-GP Packet Size the standard GAN. This is a simple but effective approach

172

Authorized licensed use limited to: Centrale Supelec. Downloaded on December 12,2022 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
to reducing modal collapse. closer to 192.168.0.2 than to 192.168.10.1. This was the
In order to address the statistical discrepancies observed approach used by the authors in [8].
earlier when using a WGAN-GP, we modified the MGGAN However, there are several drawbacks to this approach.
architecture to be capable of generating network flow data. Most importantly, IP2Vec encodes each individual IP address
Figure 6 shows our architecture FlowGAN, which is in a one-hot vector. So if there are 1 million unique IP
heavily influenced by MGGAN, as described in [10]. The addresses, there will be one million inputs - drastically
primary modification is the addition of a data encod- increasing network size and parameter count. This also
ing/decoding function which transforms NetFlow records means that generated IP addresses cannot be completely
into a flat vector format (suitable for use in neural networks), originEil, the output will only be from the set of provided
to allow the MGGAN to accept NetFlow data. input IP addresses. We decided this was unacceptable for
As with a traditional GAN, FlowGAN has the standard our system.
discriminator Dx which verifies if generated flows appear We attempted a number of different alternative encoding
realistic. It additionally uses an Encoder, which is pre­ schemes before settling on one-hot encoding the individual
trained to encode NetFlow packets, and an additional dis­ digits of the IP and port numbers in base 10. For IPv4
criminator D m which is provided to encoded representation (which represented all our network traffic), we split this into
of the generated NetFlow. Dm aims to verify if the encoded 12 digits, and encoded each digit as a vector between 0­
representation of the packets is ‘realistic’. This supervisory 9, except the first digit in each octet, which was encoded
discriminator in turn helps to reduce modal collapse, as the as a vector between 0-2. The one hot encoding helps to
Encoder already has learned the input data distribution. encapsulate the differences between IPs such as 10.0.1.1
FlowGAN has three main components. The encoding of Euid 10.0.0.1 which if represented numerically would be
the NetFlow data into a format that can be used by neural very similar. This method can also represent all valid IP
networks, the GAN itself, and finally an autoencoder, which addresses, which is an advantage over IP2Vec. However,
is used to train the supervisory component of FlowGAN. it can also generate a small range of invalid addresses (eg
266.277.288.299), which is a drawback of this scheme but
A. Data Encoding Component
does not affect GAN performance as the generator learns to
Directly using NetFlow records with a neural network avoid these.
presents a challenge, as many of the fields Eire categorical A similar one hot encoding was used for each of the 5
and unsealed. A neural network generally expects properly digits in the port number. A one hot encoding was used
normalised and formatted input data. Therefore, we need for IP layer protocols present in our dataset. NetFlow fields
to encode the data prior to use, this is performed by the counting bytes and packets were encoded linearly, after
Data Encoding/Decoding component in Figure 6. One of clipping extreme values > 5cr Emd min-max scaling.
the more difficult fields to encode are the IP source Euid
destination addresses, as well as source destination ports.
B. GAN Component
We cannot simply encode the IP address as a number, as
it should be interpreted as a categoricEil field. One method The GAN component consists of the generator and the
to encode IP addresses is through an IP2Vec model [15], discriminator, G Emd D x respectively in Figure 6. The
which aims to embed IP addresses and port numbers into generator is used to generate the synthetic network traffic,
a meaningful lower dimensional representation (using an and the discriminator evaluates if it is realistic.
autoencoder), where the distance between points in this For the generator and discriminator we used a densely
lower dimensional representation also represents the distance connected network, with 3 hidden layers with 512, 256 and
between IP addresses in reality. For example 192.168.0.1 is 128 nodes. Normally, the discriminator is constructed with
fewer parameters than the generator, to allow the generator
to eventually overpower the discriminator. But we found this
was unnecessary as the generator was over-provisioned to
complete this simple task, Emd would quickly overpower the
discriminator. However, this could likely be optimised. We
used a small amount dropout of 0.02 on the first two hidden
layers of the generator, and the discriminator. We used a
PReLU (parametric rectified linear unit) activation function
in the hidden layers to help mitigate saturation.
In this case, we used a vanilla/standard GAN, with a
GAN Sigmoid output activation function for the verifier, Emd
assigned 1 to the real class, and 0 to the fake class. We
Figure 6. FlowGAN architecture used binary cross-entropy as our loss function.

173

Authorized licensed use limited to: Centrale Supelec. Downloaded on December 12,2022 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
C. Guidance Network Component This is further confirmed if we consider Figure 8, which
The guidance network component, shown in Figure 6 is shows the distribution of the flow duration of both FlowGAN
critical to the MGGAN architecture, as it helps to reduce traffic and training data, and contrast it with Figure 5.
modal collapse. Figure 6 only shows the Encoder and Dm. The flow duration distributions even more clearly show the
In order to teach this encoder to encode NetFlow, we train advantage of FlowGAN over WGAN-GP in its ability to
it as part of an autoencoder on real NetFlow traffic. create synthetic data that matches the diversity and modes
For the autoencoder we used densely connected layers, of the real network data (training data).
with 1024, 512 and 256 hidden nodes. We used a PReLU As a final visual summary comparison, Figure 9 plots
activation for hidden layers. The size of our encoded vector the Cumulative Distribution Function (CDF) of the flow
is 128, as such the size of our middle layer (output of duration for the training data (‘Real Network’), as well as the
the encoder) is 128. Our middle layer activation function synthetic data generated by FlowGAN and WGAN-GP. We
is tanh() (meaning our encoded representation has a range use a log scale on the x-axis for clarity. We again clearly
of -1 to 1). The discriminator of the guidance network Dm see that FlowGAN is able to relatively closely match the
uses the same parameter configuration as Dx. distribution of the training data (and hence real network
traffic).
D. Implementation While a visual comparison of the distributions is in­
sightful, we also want to quantify the advantage provided
We implemented FlowGAN using Tensorflow (Tensorflow
by FlowGAN. For this, we used the Wasserstein (or earth
GPU) on Python 2.7. The FlowGAN model was run on a
mover’s) distance, which provides a metric for the simi­
computer with an Intel i9-7980XE and an Nvidia THAN
larity of two distributions. We computed the Wasserstein
RTX graphics card. The version of Tensorflow we used
was 1.14.0. The autoencoder network was pre-trained to a
mean squared error (MSE) reconstruction loss of 1 x 10-6 .
FlowGAN was trained for 10 epochs («12.5M batches).
For all networks, the Keras Adam optimiser was used with
the default learning rate of 0.001 for the generator and
discriminator, and 0.0005 for the autoencoder. Other impor­
tant hyperparameter choices were Pi = 0.9, P2 = 0.999,
e = le -7 . A batch size of 128 was used during training.

VI. E v a l u a t io n

Table II shows examples of NetFlow records generated


using FlowGAN. While the generated data only includes
the shown 8 NetFlow fields, the same approach could easily
be used to generate other fields present in NetFlow v9. 0 250 500 750 1000 1250 1500
Packet Size Avg. (bytes/packets) [75 byte bins]
To evaluate our system, we compare the synthetic data
generated by FlowGAN to the original (training) data, using Figure 7. FlowGAN - Packet Size
the same approach as used in Section IV. In particular, we
are plotting the distributions of key NetFlow fields, and
comparing them visually to the distributions of the training
data.
As in Section IV, we consider packet size and flow
duration for our evaluation. Figure 7 shows the packet size
distributions from traffic generated with FlowGAN, and
the corresponding distribution from the training data (Real
Network), which is obtained from a real production network.
We can see that the distribution of the FlowGAN traffic
relatively closely matches the distribution of the training
data. It is particularly interesting to compare Figure 7 with
Figure 4, which shows the packet size distribution of data
generated by a traditional WGAN-GP versus the training 0 20 40 60 80 100 120
data. While Figure 4 shows the WGAN-GP failing to learn Flow Duration (s) [6 sec bins]
the distribution of the training data, due to modal collapse,
we see that this limitation is largely overcome by FlowGAN. Figure 8. FlowGAN - Flow Duration

174

Authorized licensed use limited to: Centrale Supelec. Downloaded on December 12,2022 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
Table n
Sample NetFlow Generated using FlowGAN

Src. IP Dest. IP Src. Port Dest. Port In Packets Out Packets In Bytes Out Bytes Duration
10.0.0.29 10.0.2.137 161 1039 4 0 428 0 21
10.0.1.22 10.0.0.85 53213 53213 1 0 237 0 0
10.0.0.11 172.20.10.23 88 52076 5 5 2027 320 20
10.0.2.203 172.20.10.25 443 51905 1 1 93 40 1
10.0.0.11 172.20.10.23 88 58571 7 6 3415 2038 3
172.20.10.24 10.0.0.202 59898 53213 1 2 237 474 1
10.0.1.27 10.0.0.150 55798 53213 1 0 237 0 0
10.0.1.27 10.0.0.191 53210 53213 2 0 474 0 0
10.0.1.27 10.0.0.122 55798 53213 1 0 237 0 0
10.0.0.231 10.0.5.134 53213 53213 1 0 237 0 0

Table HI
W a s s e r s t e in D is t a n c e s o f g e n e r a t e d n e t w o r k t r a f f ic
DISTRIBUTIONS VERSUS REAL NETWORK TRAFFIC

Statistic WGAN-GP FlowGAN


Flow Duration 1120 190
Packets Sent 1190 200
Bytes Per Packet 1050 330

Our evaluation of the state-of-the-art GAN-based ap­


proach for network flow generation, which are based on
a WGAN-GP architecture, shows the limitations of this
method, in particular modal collapse. To address this lim­
itation, this paper proposes FlowGAN, a new method for
Figure 9. Flow Duration CDF - FlowGAN generating synthetic network flow data. FlowGAN lever­
ages the idea of Manifold Guided Generative Adversarial
Networks (MGGAN), and is able to significantly mitigate
distance between data generated using FlowGAN and the the problem of modal collapse. As a result, based on our
training data for the following statistics: flow duration, flow evaluation with a training data set from a large production
size (packets sent) and packets size (bytes per packet). network, the NetFlow dataset that is synthetically generated
We also computed the corresponding Wasserstein distances by FlowGAN is much more realistic than the data generated
between the flow data generated by WGAN-GP and the by the standard WGAN-GP approach. This can be clearly
training data. The results are shown in Table in. We can seen via visual inspection of the distributions of key traffic
see that, compared to WGAN-GP, FlowGAN reduces the features, e.g. packet size and flow duration.
distance between the generated data and the training data
by a factor of more than 3 for the packet size (bytes per We quantify this significant improvement using the
packet) distribution, and by a factor of almost 6 for both Wasserstein distance, which provides a distance metric be­
the flow duration and flow size (packets sent) distributions. tween a pair of probability distributions. In our case, we
This provides a clear indication of the improvement that consider the distance of the distribution of considered traffic
FlowGAN provides over traditional GAN-based approaches features. Our result show that the distance between synthetic
to synthetically generate realistic network flow data. FlowGAN data and the (real) training data is significantly
smaller (by a factor of 3 to 6), than the corresponding
VII. Conclusion distance of the data generated by a WGAN-GP. The distribu­
tions shows that FlowGAN is able to replicate the diversity
In this paper, we explored the use of Generative Adversar­
and modes in the distribution of the training data, and hence
ial Networks (GANs) for the generation of synthetic network
limit the problem of modal collapse.
flow data, with potential application in ML-based Network
Intrusion Detection Systems. We particularly focus on the While more work remains to be done, e.g. the considera­
generation of benign (normal) traffic in the form of NetFlow tion of temporal and sequential aspects of the flow records,
records. This idea is motivated by the great success that we believe the approach presented in this paper provides a
GANs have had in the generation of synthetic data in a significant step toward the GAN-based synthetic generation
range of different applications. of realistic network flow data.

175

Authorized licensed use limited to: Centrale Supelec. Downloaded on December 12,2022 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
References [13] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,”
arXiv.l701.07875 [cs, stat], Dec. 2017, arXiv: 1701.07875.
[1] M. Sarhan, S. Layeghy, N. Moustafa, and M. Portmann, [Online]. Available: http://arxiv.org/abs/1701.07875
“Towards a Standard Feature Set of NIDS Datasets,”
arXiv:2101.11315 [cs], Jan. 2021, arXiv: 2101.11315. [14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and
[Online]. Available: http://arxiv.org/abs/2101.11315 A. Courville, “Improved Training of Wasserstein GANs,”
arXiv.‘1704.00028 [cs, stat], Dec. 2017, arXiv: 1704.00028.
[2] S. Layeghy, M. Gallagher, and M. Portmann, “Benchmarking [Online], Available: http://arxiv.org/abs/1704.00028
the Benchmark - Analysis of Synthetic NIDS Datasets,”
arXiv:2104.09029 [cs], Apr. 2021, arXiv: 2104.09029. [15] M. Ring, A. Dallmann, D. Landes, and A. Hotho, “IP2Vec:
[Online]. Available: http://arxiv.org/abs/2104.09029 Learning Similarities Between IP Addresses,” in 2017
IEEE International Conference on Data Mining Workshops
[3] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, ‘Toward (ICDMW). New Orleans, LA: IEEE, Nov. 2017, pp. 657­
generating a new intrusion detection dataset and intrusion 666. [Online]. Available: http://ieeexplore.ieee.org/document/
traffic characterization,” ICISSP 2018 - Proceedings o f the 8215725/
4th International Conference on Information Systems Security
and Privacy, vol. 2018-Janua, pp. 108-116, 2018, iSBN: [16] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
9789897582820. Estimation of Word Representations in Vector Space,”
arXiv:1301.3781 [cs], Sep. 2013, arXiv: 1301.3781. [Online].
[4] N. Moustafa, “ToN_IoT datasets,” Available: http://arxiv.org/abs/1301.3781
http://dx.doi.org/10.21227/fesz-dm97, IEEE Dataport, 2019.
[Online], Available: http://dx.doi.org/10.21227/fesz-dm97 [17] P. Wang, S. Li, F. Ye, Z. Wang, and M. Zhang, “PacketCGAN:
Exploratory Study of Class Imbalance for Encrypted
[5] N. Moustafa and J. Slay, “UNSW-NB15: A Comprehensive Traffic Classification Using CGAN,” arXiv:1911.12046 [cs,
Data set for Network Intrusion Detection systems (UNSW- eess], Nov. 2019, arXiv: 1911.12046. [Online]. Available:
NB15 Network Data Set),” in Military communications and http://arxiv.org/abs/1911.12046
information systems conference (MilCIS) (pp. 1-6). IEEE.
IEEE, 2015, pp. 1-6. [18] S. A. Barnett, “Convergence Problems with Generative
Adversarial Networks (GANs),” arXiv:1806.11382 [cs,
[6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, stat], Jun. 2018, arXiv: 1806.11382. [Online]. Available:
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, http://arxiv.Org/abs/l 806.11382
“Generative Adversarial Networks,” arXiv:1406.2661 [cs,
stat], Jun. 2014, arXiv: 1406.2661. [Online], Available: [19] L. Mescheder, A. Geiger, and S. Nowozin, “Which Training
http://arxiv.org/abs/1406.2661 Methods for GANs do actually Converge?” arXiv:1801.04406
[cs], Jul. 2018, arXiv: 1801.04406. [Online]. Available:
[7] I. Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial http://arxiv.Org/abs/l801.04406
Networks,” arXiv:1701.00160 [cs], Apr. 2017, arXiv:
1701.00160. [Online]. Available: http://arxiv.org/abs/1701. [20] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled
00160 Generative Adversarial Networks,” arXiv:1611.02163 [cs,
stat], May 2017, arXiv: 1611.02163. [Online], Available:
[8] M. Ring, D. Schlor, D. Landes, and A. Hotho, “Flow-based http://arxiv.org/abs/1611.02163
Network Traffic Generation using Generative Adversarial
Networks,” Computers & Security, vol. 82, pp. 156­ [21] I. Tolstikhin, S. Geliy, O. Bousquet, C.-J. Simon-Gabriel,
172, May 2019, arXiv: 1810.07795. [Online]. Available: and B. Scholkopf, “AdaGAN: Boosting Generative Models,”
http://arxiv.org/abs/1810.07795 arXiv:1701.02386 [cs, stat], May 2017, arXiv: 1701.02386.
[Online]. Available: http://arxiv.org/abs/1701.02386
[9] J. Charlier, A. Singh, G. Ormazabal, R. State, and
H. Schulzrinne, “SynGAN: Towards Generating Synthetic [22] A. B. L. Larsen, S. K. Spnderby, H. Larochelle, and
Network Attacks using GANs,” arXiv:1908.09899 [cs, O. Winther, “Autoencoding beyond pixels using a learned
stat], Aug. 2019, arXiv: 1908.09899. [Online], Available: similarity metric,” arXiv:1512.09300 [cs, stat], Feb. 2016,
http://arxiv.org/abs/1908.09899 arXiv: 1512.09300. [Online]. Available: http://arxiv.org/abs/
1512.09300
[10] D. Bang and H. Shim, “MGGAN: Solving Mode Collapse
using Manifold Guided Training,” arXiv:1804.04391 [cs], [23] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and
C. Sutton, “VEEGAN: Reducing Mode Collapse in GANs
Apr. 2018, arXiv: 1804.04391. [Online]. Available: http:
//arxiv.org/abs/1804.04391 using Implicit Variational Learning,” arXiv:1705.07761
[stat], Nov. 2017, arXiv: 1705.07761. [Online], Available:
http://arxiv.org/abs/1705.07761
[11] B. Claise, “Cisco Systems NetFlow Services Export Version
9,” Oct. 2004. [Online]. Available: https://www.ietf.org/rfc/
rfc3954.txt

[12] RFC 7011 - Specification o f the IP Flaw Information Export


(IPFIX) Protocol for the Exchange of Flow Information.
[Online], Available: https://tools.ietf.org/html/rfc7011

176

Authorized licensed use limited to: Centrale Supelec. Downloaded on December 12,2022 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.

You might also like