Flow-Based Network Traffic Generation Using Generative Adversarial Networks
Flow-Based Network Traffic Generation Using Generative Adversarial Networks
Flow-Based Network Traffic Generation Using Generative Adversarial Networks
Germany
Abstract
Flow-based data sets are necessary for evaluating network-based intrusion de-
tection systems (NIDS). In this work, we propose a novel methodology for gener-
ating realistic flow-based network traffic. Our approach is based on Generative
Adversarial Networks (GANs) which achieve good results for image generation.
A major challenge lies in the fact that GANs can only process continuous at-
tributes. However, flow-based data inevitably contain categorical attributes
such as IP addresses or port numbers. Therefore, we propose three different
preprocessing approaches for flow-based data in order to transform them into
continuous values. Further, we present a new method for evaluating the gener-
ated flow-based network traffic which uses domain knowledge to define quality
tests. We use the three approaches for generating flow-based network traffic
based on the CIDDS-001 data set. Experiments indicate that two of the three
approaches are able to generate high quality data.
Keywords: GANs, TTUR WGAN-GP, NetFlow, Generation, IDS
∗ Corresponding author
Email addresses: markus.ring@hs-coburg.de (Markus Ring),
daniel.schloer@informatik.uni-wuerzburg.de (Daniel Schlör),
dieter.landes@hs-coburg.de (Dieter Landes), hotho@informatik.uni-wuerzburg.de
(Andreas Hotho)
2
discriminator network D. The generator network G is trained to generate syn-
thetic data from noise. The discriminator network D is trained to distinguish
generated synthetic data from real world data. The generator network G is
trained by the output signal gradient of the discriminator network D. G and D
are trained iteratively until the generator network G is able to fool the discrim-
inator network D. GANs achieve remarkably good results in image generation
[5, 6, 7, 8]. Furthermore, GANs have also been used for generating text [9] or
molecules [10].
This work uses GANs to generate complete flow-based network traffic with
all typical attributes. To the best of our knowledge, this is the first work that
uses GANs for this purpose. GANs can only process continuous input attributes.
This poses a major challenge since flow-based network data consist of continu-
ous and categorical attributes. Consequently, we analyze different preprocess-
ing strategies to transform categorical attributes of flow-based network data
into continuous attributes. The first method simply treats attributes like IP
addresses and ports as numerical values. The second method creates binary at-
tributes from categorical attributes. The third method uses IP2Vec [11] to learn
meaningful vector representations of categorical attributes. After preprocessing,
we use Improved Wasserstein GANs (WGAN-GP) [12] with the two time-scale
update rule (TTUR) proposed by Heusel et al. [13] to generate new flow-based
network data based on the public CIDDS-001 [14] data set. Then, we evaluate
the quality of the generated data with several evaluation measures.
The paper has several contributions. The main contribution is the generation
of flow-based network data using GANs. We propose three different preprocess-
ing approaches and a new evaluation method which uses domain knowledge to
evaluate the quality of generated data. In addition to that, we extend IP2Vec
[11] such that IP2Vec is able to learn similarities between the flow attributes:
bytes, packets and duration.
Structure. The next section of the paper describes flow-based network
traffic, GANs, and IP2Vec in more detail. In section 3, we present three dif-
ferent approaches for transforming flow-based network data. An experimental
3
evaluation of these approaches is given in section 4 and the results are discussed
in section 5. Section 6 analyzes related work on network traffic generators for
flow-based data. A summary and outlook on future work concludes the paper.
2. Foundations
This section starts with analyzing the underlying flow-based network traffic.
Then, GANs are explained in more detail. Finally, we explain IP2Vec [11] which
is the basis of our third data transformation approach.
4
Table 1: Overview of typical attributes in flow-based data like NetFlow [15] or IPFIX [16].
The third column provides the type of the attributes and the last column shows exemplary
values for these attributes.
2.2. GANs
Discriminative models classify objects into predefined classes [17] and are
often used for intrusion detection (e.g. in [18], [19] or [20]). In contrast to
discriminative models, generative models are used to generate data like flow-
based network traffic. Many generative models build on likelihood maximiza-
tion for a parametric probability distribution. As the likelihood function is often
unknown or the likelihood gradient is computationally intractable, some mod-
els like Deep Boltzmann Machines [21] use approximations to solve this prob-
lem. Other models avoid this problem by not explicitly representing likelihood.
Generative Stochastic Networks for example learn the transition operation of
a Markov chain whose stationary distribution estimates the data distribution.
GANs avoid Markov chains estimating the data distribution by a game-theoretic
approach: The generator network G tries to mimic samples from the data distri-
bution, while the discriminator network D has to differentiate real and generated
samples. Both networks are trained iteratively until the discriminator D can’t
distinguish real samples from generated samples any more. Beside computa-
5
Figure 1: Architecture of GANs.
tional advantages, the generator G is never updated with real samples. Instead,
the generator network G is fed with an input vector of noise z. The genera-
tor is trained using only the discriminator’s gradients through backpropagation.
Therefore, it is less likely to overfit the generator G by memorization and re-
production of real samples. Figure 1 illustrates the generation process.
Goodfellow et al. say that: ”another advantage of adversarial networks is
that they can represent very sharp, even degenerate distributions” [4] which is
the case for some NetFlow attributes. However, (the original) vanilla GANs [4]
require the visible units to be differentiable, which is not the case for categorical
attributes like IP addresses in NetFlow data. Gulrajani et al. [12] show that
Wasserstein GANs (WGANs), besides other advantages, are capable of modeling
discrete distributions over a continuous latent space. In contrast to vanilla
GANs, WGANs [8] use the Earth Mover (EM) distance as a value function
replacing the classifying discriminator network with a critic network estimating
6
the EM distance. While the original WGAN approach uses weight clipping
to guarantee differentiability almost everywhere, Gulrajani et al. [12] improve
training of WGANs by using gradient penalty as soft constrain to enforce the
Lipschitz constraint. One research frontier in the area of GANs is to solve the
issue of non-convergence [22]. Heusel et al. [13] propose a two time-scale update
rule (TTUR) for training GANs with arbitrary loss functions. The authors
prove that TTUR converges under mild assumptions to a stationary local Nash
equilibrium.
For those reasons, we use Improved Wasserstein Generative Adversarial Net-
works (WGAN-GP) [12] with the two time-scale update rule (TTUR) from [13]
in our work.
2.3. IP2Vec
7
Arrows in Figure 2 denote network connections from three IP addresses,
namely 192.168.20.1, 192.168.20.2, and 192.168.20.3. Colors indicate different
services. Consequently, IP2Vec leads to the following result:
2.3.1. Model
IP2Vec is based upon a fully connected neural network with a single hidden
layer (see Figure 3).
The features extracted from flow-based network traffic constitute the neural
network’s input. These features (IP addresses, destination ports and transport
protocols) define the input vocabulary which contains all IP addresses, destina-
tion ports and transport protocols that appear in the flow-based data set. Since
neural networks cannot be fed with categorical attributes, each value of our in-
put vocabulary is represented as a one-hot vector the length of which equals the
8
size of the vocabulary. Each neuron in the input and output layer is assigned a
specific value of the vocabulary (see Figure 3).
Let us assume the training data set contains 100,000 different IP addresses,
20,000 different destination ports and 3 different transport protocols. Then, the
size of the one-hot vector is 120,003 and only one component is 1, while all others
are 0. Input and output layers comprise exactly the same number of neurons
which is equal to the size of the vocabulary. The output layer uses a softmax
classifier which indicates the probabilities for each value of the vocabulary that
it appears in the same flow (context) as the input value to the neural network.
The softmax classifier [25] normalizes the output of all output neurons such that
the sum of the outputs is 1. The number of neurons in the hidden layer is much
smaller than the number of neurons in the input layer.
2.3.2. Training
The neural network is trained using captured flow-based network traffic.
IP2Vec uses only the source IP address, destination IP address, destination port
and transport protocol of flows. Figure 4 outlines the generation of training
samples.
Figure 4: Generation of training samples in IP2Vec [11]. Input values are highlighted in blue
color and expected output values are highlighted in black frames with white background.
IP2Vec generates five training samples from each flow. Each training sample
consists of an input value and an expected output value. In the first step, IP2Vec
9
selects an input value for the training sample. The selected input value is high-
lighted in blue in Figure 4. The expected output values for the corresponding
input value are highlighted through black frames with white background. In
Figure 4 can be seen that IP2Vec generates three training samples where the
source IP address is the input value, one training sample where the destination
port is the input value and one training sample where the transport protocol is
the input value.
In the training process, the neural network is fed with the input value and
tries to predict the probabilities of the other values from the vocabulary. For
training samples, the probability of the concrete output value is 1 and 0 for all
other values. In general, the output layer indicates the probabilities for each
value of the input vocabulary that it appears in the same flow as the given input
value.
The network uses back-propagation for learning. This kind of training, how-
ever, could take a lot of time. Let us assume that the hidden layer comprises 32
neurons and the training data set encompasses one million different IP addresses
and ports. This results in 32 million weights in each layer of the network. Con-
sequently, training such a large neural network is going to be slow. To make
things worse, a huge amount of training flows is required for adjusting that
many weights and for avoiding over-fitting. Consequently, we have to update
millions of weights for millions of training samples. Therefore, IP2Vec attempts
to reduce the training time by using Negative Sampling in a similar way as
Word2Vec does [23]. In Negative Sampling, each training sample modifies only
a small percentage of the weights, rather than all of them. More details on
Negative Sampling may be found in [11] and [24].
10
and port is obtained if the hidden layer comprises 32 neurons.
Intuition. Why does this approach work? If two IP addresses refer to similar
destination IP addresses, destination ports, and transport protocols, then the
neural network needs to output similar results for these IP addresses. One way
for the neural network to learn similar output values for different input values
is to learn similar weights in the hidden layer of the network. Consequently, if
two IP addresses exhibit similar network behavior, IP2Vec attempts to learn
similar weights (which are the vectors of the target feature space Rm ) in the
hidden layer.
3. Transformation Approaches
3.1. Preliminaries
In general, we use in all three methods the same preprocessing steps for the
attributes date first seen, transport protocol, and TCP flags (see Table 1).
Usually, the concrete timestamp is marginal for generating realistic flow-
based network data. Instead, many intrusion detection systems derive additional
information from the timestamp like ”is today a working day or weekend day”
or ”does the event occur during typical working hours or at night”. Therefore,
we do not generate timestamps. Instead, we create two attributes weekday
and daytime. To be precise, we extract the weekday information of flows and
generate seven binary attributes isMonday, isTuesday and so on. Then, we
interpret the daytime as seconds [0, 86400) and normalize them to the interval
[0, 1]. We transform the transport protocol (see #3 in Table 1) to three binary
attributes, namely isTCP, isUDP, and isICMP. The same procedure is followed
for TCP flags (see #10 in Table 1) which are transformed to six binary attributes
isURG, isACK, isPUS, isSYN, isRES, and isFIN.
11
3.2. Method 1 - Numeric Transformation
Although IP addresses and ports look like real numbers, they are actually
categorical. Yet, the simplest approach is to interpret them as numbers af-
ter all and treat them as continuous attributes. We refer to this method as
Numeric-based Improved Wasserstein Generative Adversarial Networks (short:
N-WGAN-GP). This method transforms each octet of an IP address to the in-
terval [0,1], e.g. 192.168.220.14 is transformed to four continuous attributes:
(ip 1) 192/255 = 0.7529, (ip 2) 168/255 = 0.6588, (ip 3) 220/255 = 0.8627 and
(ip 4) 14/255 = 0.0549. We do a similar procedure for ports by dividing them
through the highest port number, e.g. the source port = 80 will be transformed
to one continuous attribute 80/65535 = 0.00122.
The attributes duration, bytes and packets (see attributes #2, #8 and #9
in Table 1) are normalized to the interval [0, 1]. Table 2 provides examples and
compares the three transformation methods.
The second method creates several binary attributes for IP addresses, ports,
bytes, and packets. We refer to this method as Binary-based Improved Wasser-
stein Generative Adversarial Networks (short: B-WGAN-GP). Each octet of
an IP address is mapped to it’s 8-bit binary representation. Consequently,
IP addresses are transformed into 32 binary attributes, e.g. 192.168.220.14 is
transformed to 11000000 10101000 11011100 00001110. Ports are converted to
their 16-bit binary representation, e.g. the source port 80 is transformed to
00000000 01010000. For representing bytes and packets, we transform them to
a binary representation as well and limit their length to 32 bit. The attribute
duration is normalized to the interval [0, 1]. Table 2 shows an example for this
transformation procedure.
The third method transforms IP addresses, ports, duration, bytes, and pack-
ets into so-called embeddings in a m-dimensional continuous feature space Rm
12
Table 2: Preprocessing of flow-based data. The first column provides the original flow
attributes and examplarly values, the other columns show the extracted features (column
Attr.) and the corresponding values (column Value)) for each of preprocessing method.
N-WGAN-GP B-WGAN-GP E-WGAN-GP
Attribute / Value Attr. Value Attr. Value Attr. Value
date first seen isMonday 1 isMonday 1 isMonday 1
2018-05-28 11:39:23 isTuesday 0 isTuesday 0 isTuesday 0
isWednesday 0 isWednesday 0 isWednesday 0
isThursday 0 isThursday 0 isThursday 0
isFriday 0 isFriday 0 isFriday 0
isSaturday 0 isSaturday 0 isSaturday 0
isSunday 0 isSunday 0 isSunday 0
41963 41963 41963
daytime 86400 = 0.485 daytime 86400 = 0.485 daytime 86400 =0.485
1.503−durmin 1.503−durmin
duration norm dur durmax −durmin norm dur durmax −durmin dur 1 e1
1.503 ... ...
dur m em
transport protocol isTCP 1 isTCP 1 isTCP 1
TCP isUDP 0 isUDP 0 isUDP 0
isICMP 0 isICMP 0 isICMP 0
192
IP address ip 1 = 0.7529 ip 1 to ip 8 1,1,0,0,0,0,0,0 ip 1 e
255 1
168
192.168.210.5 ip 2 255 = 0.6588 ip 9 to ip 16 1,0,1,0,1,0,0,0 ... ...
210
ip 3 255 = 0.8627 ip 17 to ip 24 1,1,0,1,0,0,1,0 ip m em
5
ip 4 255 = 0.0196 ip 25 to ip 32 0,0,0,0,0,1,0,1
53872
port pt = 0.8220 pt 1 to pt 8 1,1,0,1,0,0,1,0 pt 1 e
65535 1
53872 pt 9 to pt 16 0,1,1,1,0,0,0,0 ... ...
pt m e
m
1.503−bytmin
bytes norm byt byt 1 to byt 8 0,0,0,0,0,0,0,0 byt 1 e
bytmax −bytmin 1
144 byt 9 to byt 16 0,0,0,0,0,0,0,0 ... ...
byt 17 to byt 24 0,0,0,0,0,0,0,0 byt m em
byt 25 to byt 32 1,0,0,1,0,0,0,0
1.503−pckmin
packets norm pck pck 1 to pck 8 0,0,0,0,0,0,0,0 pck 1 e
pckmax −pckmin 1
1 pck 9 to pck 16 0,0,0,0,0,0,0,0 ... ...
pck 17 to pck 24 0,0,0,0,0,0,0,0 pck m em
pck 25 to pck 32 0,0,0,0,0,0,0,1
TCP flags isURG 0 isURG 0 isURG 0
.A..S. isACK 1 isACK 1 isACK 1
isPSH 0 isPSH 0 isPSH 0
isRES 0 isRES 0 isRES 0
isSYN 1 isSYN 1 isSYN 1
isFIN 0 isFIN 0 isFIN 0
13
Figure 5: Extended generation of training samples in IP2Vec. Input values are highlighted in
blue color and expected output values are highlighted in black frames with white background.
14
dresses in flow-based network traffic. Table 2 shows the result of an exemplary
transformation.
E-WGAN-GP maps flows to embeddings which need to be re-transformed
to the original space after generation. To that end, values are replaced by
the closest embeddings generated by IP2Vec. For instance, we calculate the
cosine similarity between the generated output for the source IP address and
all existing IP address embeddings generated by IP2Vec. Then, we replace the
output with the IP address with the highest similarity.
4. Experiments
We use the publicly available CIDDS-001 data set [14] which contains uni-
directional flow-based network traffic as well as detailed information about the
networks and IP addresses within the data set. Figure 6 shows an overview of
the emulated business environment of the CIDDS-001 data set. In essence, the
CIDDS-001 data set contains four internal subnets which can be identified by
their IP address ranges: a developer subnet (dev ) with exclusively Linux clients,
an office subnet (off ) with exclusively Windows clients, a management subnet
(mgt) with mixed clients, and a server subnet (srv ). Additional knowledge
facilitates the evaluation of the generated data (see Section 4.3).
The CIDDS-001 data set contains four weeks of network traffic. We consider
only the network traffic which was captured at the network device within the
OpenStack environment (see Figure 6) and divide the network traffic in two
parts: week1 and week2-4. The first two weeks contain normal user behavior
and attacks, whereas week3 and week4 contain only normal user behavior and
no attacks. We use this kind of splitting in order to obtain a large training data
15
Figure 6: Overview of the simulated network environment from the CIDDS-001 data set [14].
16
set week2-4 for our generative models and simultaneously provide a reference
data set week1 which contains normal and malicious network behavior. Overall,
week2-4 contains around 22 million flows and week1 contains around 8.5 million
flows. We consider only the TCP, UDP and ICMP flows and remove the 895
IGMP flows from the data set.
17
Since there is no single widely accepted evaluation methodology, we use
several evaluation approaches to assess the quality of the generated data from
different views. To evaluate the diversity and distribution of the generated data,
we visualize attributes (see Section 4.4.2) and compute the Euclidean distances
between generated and real flow-based network data (see Section 4.4.3). To
evaluate the quality of the content and relationships between attributes within
a flow, we introduce domain knowledge checks (see Section 4.4.4) as a new
evaluation method.
Now, we evaluate the quality of the generated data by the baseline (see
Section 4.2), N-WGAN-GP, B-WGAN-GP, and E-WGAN-GP (see Section 3).
18
Figure 7: Temporal distribution of flows per hour.
and trained the network like Ring et al. [11] for 10 epochs.
4.4.2. Visualization
Figure 7 shows the temporal distribution of the generated flows and reference
week week1. The y-axis shows the flows per hour as a percentage of total traffic
and the three lines represent the reference week (week1 ), the generated data of
the baseline (baseline), and the generated data of the E-WGAN-GP approach
(E-WGAN-GP ). Since all three transformation approaches process the attribute
date first seen in the same way, only (E-WGAN-GP ) is included for the sake
of brevity. E-WGAN-GP reflects the essential temporal distribution of flows.
In the CIDDS-001 data set, the simulated users exhibit common behavior in-
cluding lunch breaks and offline work which results in temporal limited network
activities and a jagged curve (e.g. around 12:00 on working days). However, the
curve of E-WGAN-GP is smoother than the curve of the original traffic week1.
In the following, we use different visualization plots in order to get a deeper
understanding of the generated data. Figure 8 shows the real distributions (first
row) sampled from week1 respectively generated distributions by our maximum
likelihood estimator baseline (second row) and generated distributions by our
19
Figure 8: Distribution of source port (left) and destination IP address (right) for the subnets.
The rows show in order: (1) data sampled from real data (week 1) and data generated by (2)
baseline, (3) N-WGAN-GP, (4) E-WGAN-GP and (5) B-WGAN-GP.
WGAN-GP models using different data representations for each row (third to
fifth row). Each violin plot shows the data distribution of the attribute source
port (left) respectively the attribute destination IP address (right) for the differ-
ent source IP addresses grouped by their subnet (see Section 4.1). IP addresses
from different subnets come along with different network behavior. For instance,
IP addresses from the mgt subnet are typically clients which use services while
IP addresses from the srv subnet are servers which offer services. This knowl-
20
edge was not explicitly modeled during data generation.
We will now briefly discuss the conditional distribution of source ports (left
column). In the first row, we can clearly distinguish typical client-port (dev,
mgt, off ) and server-port (ext, srv ) distributions. As expected, the maximum
likelihood baseline is not able to capture the differences of the distributions
depending on the subnet of the source IP address and models a distribution
which is a combination of all six subnets from the input data. In contrast, the B-
WGAN-GP and E-WGAN-GP capture the conditional probability distributions
for the source port given the subnet of the source IP address very well.
N-WGAN-GP is incapable of representing the distributions properly: Note
that only flows with external source IP addresses are generated in the selected
samples. In-depth analysis of the generated data suggests that numeric repre-
sentations fail to match the designated subnets exactly. As all generated data is
assigned to the ext subnet, it comes as no surprise that the distribution repre-
sents a combination of all six subnets from the input data for both source ports
(left) and destination IP addresses (right).
For the attribute destination IP address, the distribution is a mixture of
external and internal IP addresses for dev, mgt and off subnets (see reference
week week1 ). This matches the user roles, surfing on the internet (external)
as well as accessing internal services (e.g. printers). For external subnets, the
destination IP address has to be within the internal IP address range. Traffic
from external sources to external targets does not run through the simulated
network environment of the CIDDS-001 data set. Consequently, there is no flow
within the CIDDS-001 data set which has a source IP address and a destination
IP address from the ext subnet. This fact can be seen for week1 in Figure 8
where flows which have their origin in the ext subnet only address a small range
of destination IP addresses which reflect the range of internal IP addresses. E-
WGAN-GP and B-WGAN-GP capture this property very well while the baseline
and N-WGAN-GP fail to capture this property.
21
Table 3: Euclidian distances between the training data (week2-4 ) and the generated flow-
based network traffic in each attribute.
22
data like the reference week week1. However, it should be mentioned that there
is no perfect distance value x which indicates the correct amount of concept
drift. The generated data of E-WGAN-GP tends to have similar distances to
the training data (week2-4 ) like the reference data set week1. Table 3 shows that
the baseline has the lowest distance to the training data in each attribute. The
generated data of N-WGAN-GP differs considerably from the training data set
in some attributes. This is because N-WGAN-GP often does not generate the
exact values but a large number of new values. The binary approach B-WGAN-
GP has small distances in most attributes (except for attribute Duration). This
may be caused by the distribution of Duration in the training data as most flows
in the training data set have very small values in this attribute. Further, the
normalization of the duration to interval [0, 1] entails that almost all flows have
very low values in this attribute. N-WGAN-GP and B-WGAN-GP tend to
generate the smallest possible duration (0.000 seconds) for all flows.
• Test 1: If the transport protocol is UDP, then the flow must not have any
TCP flags.
• Test 3: If the flow describes normal user behavior and the source port
or destination port is 80 (HTTP) or 443 (HTTPS), the transport protocol
must be TCP.
23
• Test 4: If the flow describes normal user behavior and the source port or
destination port is 53 (DNS), the transport protocol must be UDP.
• Test 7: TCP, UDP and ICMP packets have a minimum and maximum
packet size. Therefore, we check the relationship between bytes and pack-
ets in each flow according to the following rule:
Table 4 shows the results of checking the generated data against these rules.
Table 4: Results of the domain knowledge checks in percentage. Higher values indicate better
results.
P P P
N -G N -G N-G
e
lin GA GA GA 1
B ase N - W
B - W
E - W
we
ek
The reference data set week1 achieves 100 percent in each test which is not
surprising since the data is real flow-based network traffic which is captured in
the same environment as the training data set. The baseline approach does not
capture dependencies between flow attributes and achieves worse results. This
24
can be especially observed in Tests 1, 4, and 6. Since multi- and broadcast
IP addresses appear only in the attribute destination IP address, the baseline
cannot fail Test 5 and achieves 100 percent.
For our generative models, E-WGAN-GP achieves the best results on av-
erage. The usage of embeddings leads to more meaningful similarities within
categorical attributes and facilitates the learning of interrelationships. Embed-
dings, however, also reduce the possible resulting space since no new values can
be generated. B-WGAN-GP generates flows which achieve high accuracy in
Tests 1 to 4. However, this approach shows weaknesses in Tests 5 and 6 where
several internal relationships must be considered. The numerical approach N-
WGAN-GP has the lowest accuracy in the tests. In particular, Test 4 shows
that normalization of source port or destination port to a single continuous at-
tribute is inappropriate. Straightforward mapping of 216 different port values
to one continuous attribute leads to too many values for a good reconstruction.
In contrast to that, the binary representation of B-WGAN-GP leads to better
results in that test.
5. Discussion
25
problems with the generation of private IP addresses. Instead, this approach
often generates non-private IP addresses such as 191.168.X.X or 192.167.X.X.
In image generation, the original application domain of GANs, small errors do
not have serious consequences. A brightness 191 instead of 192 in a generated
pixel has nearly no effect on the image and the error is (normally) not visible
for human eyes. Further, N-WGAN-GP normalizes the numeric attributes bytes
and packets to the interval [0, 1]. The generated data are then de-normalized
using the original training data. Here, we can observe that real flows often have
typical byte sizes like 66 bytes which are also not exactly matched. This results
in higher Euclidean distances in these attributes (see Table 3). Overall, the
first method N-GAN-WP does not seem to be suitable for generating realistic
flow-based network traffic.
B-WGAN-GP extracts binary attributes from categorical attributes and
converts numerical attributes to their binary representation. Using this trans-
formation, additional structural information (e.g. subnet information) of IP ad-
dresses can be maintained. Further, B-WGAN-GP assigns larger value ranges
to categorical values in the transformed space than N-WGAN-GP. While N-
WGAN-GP uses a single continuous attribute to represent a source port, B-
WGAN-GP uses 16 binary attributes for representation. These two aspects
support B-WGAN-GP in generating better categorical values of a flow as can
be observed in the results of the domain knowledge checks (see e.g. Test 2 and
Test 4 in Table 4). Further, Figure 8 indicates that B-WGAN-GP captures the
internal structure of the traffic very well even though it is less restricted than
E-WGAN-GP with respect to the treatment of previously unseen values.
E-WGAN-GP learns embeddings for IP addresses, ports, bytes, packets, and
duration. These embeddings are continuous vector representations and take
contextual information into account. As a consequence, the generation of flows
is less error-prone as small variations in the embedding space generally do not
change the outcome in input space much. For instance, if a GAN introduces
a small error in IP address generation, it could find the embedding of the IP
address 192.168.220.5 as nearest neighbor instead of the embedding of the ex-
26
pected IP address 192.168.220.13. Since both IP addresses are internal clients,
the error has nearly no effect. As a consequence, E-WGAN-GP achieves the
best results of the generative models in the evaluation. Yet, this approach (in
contrast to N-WGAN-GP and B-WGAN-GP ) cannot generate previously un-
seen values due to the embedding translation. This is not a problem for the
attributes bytes, packets and duration. Given enough training data, embed-
dings for all (important) values of bytes, duration and packets are available. For
example, consider the attribute bytes. We assume that the available embedding
values b1 , b2 , b3 , ..., bk−1 , bk sufficiently cover the possible value range of the at-
tribute bytes. As specific byte-values have no particular meaning, we are only
interested in the magnitude of the attribute. Therefore, non existing values bx
can be replaced with available embedding values without adversely affecting the
meaning.
The situation may be different for IP addresses and ports. IP addresses
represent hosts with a distinct complex network behavior, for instance as a web
server, printer, or Linux client. Generating new IP addresses goes along with
the invention of a new host with new network behavior. To answer the question
whether the generation of new IP addresses is necessary, the purpose needs to
be considered in which the generated data shall be used later. If the training set
comprises more than 10,000 or 100,000 different IP addresses, there is probably
no need to generate new IP addresses for an IDS evaluation data set. However,
this does not hold generally. Instead, one should ask the following two questions:
(1) are there enough different IP addresses in the training data set and (2) is
there a need to generate previously unseen IP addresses? If previously unseen IP
addresses are required, E-WGAN-GP is not suitable as transformation method,
otherwise E-WGAN-GP will generate better flows than all other approaches.
The situation for ports is similar to IP addresses. Generally, there are 65536
different ports and most of these ports should appear in the training data set.
Generating new port values is also associated with generating new behavior. If
the training data set comprises SSH connections (Port 22) and HTTP connec-
tions (Port 80), but no FTP connections (Port 20 and 21), generators are not
27
able to produce realistic FTP connections if they have never seen such connec-
tions. Since the network behavior of FTP differs greatly from SSH and HTTP,
it does not make much sense to generate unseen service ports. However, the
situation is different for typical client ports.
Generally, GANs capture the implicit conditional probability distributions
very well, given that a proper data representation is chosen which is the case
for E-WGAN-GP and B-WGAN-GP (see Figure 8). While the visual differ-
ences between binary and embedded data representations are subtle, the do-
main knowledge checks show larger quality differences. Overall, this analysis
suggests that E-WGAN-GP and B-WGAN-GP are able to generate good flow-
based network traffic. While E-WGAN-GP achieves better evaluation results,
B-WGAN-GP is not limited in the value range and is able to generate previously
unseen values for example for IP addresses or ports.
6. Related Work
Category (I). As the name suggests, Replay Engines use previously captured
network traffic and replay the packets from it. Often, the aim of Replay Engines
is to consider the original inter packet time (IPT) behavior between the network
packets. TCPReplay [36] and TCPivo [37] are well-known representatives of
28
this category. Since network traffic is subject to concept drift, replaying already
known network traffic only makes limited sense for generating IDS evaluation
data sets. Instead, a good network traffic generator should be able to generate
new synthetic flow-based network traffic.
Category (III). Attack Generators use real network traffic as input and combine
it with synthetically created attacks. FLAME [39] is a generator for malicious
network traffic. The authors use rule-based approaches to inject e.g. port scan
attacks or denial of service attacks. Vasilomanolakis et al. [40] present ID2T, a
similar approach which combines real network traffic with synthetically created
malicious network traffic. For creating malicious network traffic, the authors
use rule-based scripts or manipulate parameters of the input network traffic.
Sperotto et al. [41] analyze ssh brute force attacks on flow level and use a
Hidden Markov Model to model the characteristics of them. However, their
model generates only the number of bytes, packets and flows during a typical
attack scenario and does not generate complete flow-based data.
29
(e.g. port 80 (HTTP) or 53 (DNS)) and contain structural properties as well as
the value distributions of flow attributes (e.g. log-normal distribution of trans-
mitted bytes). These traffic templates can be combined with user-defined traffic
templates. A flow generator selects flow attributes from the traffic templates
and generates new network traffic. Iannucci et al. [32] propose PGPBA and
PGSK, two synthetic flow-based network traffic generators. Their generators
are based on the graph generation algorithms Barabasi-Albert (PGPBA) and
Kronecker (PGSK). The authors initialize their graph-based approaches with
network traffic in packet-based format. When generating new traffic, the au-
thors first compute the probability of the attribute bytes. All other attributes
of flow-based data are calculated based on the conditional probability of the
attribute bytes. To evaluate the quality of their generated traffic, Iannucci et
al. [32] analyze the degree and pagerank distribution of their graphs to show
the veracity of the generated data.
Besides from that, GANs were recently introduced in the IT security do-
main. Yin et al. [42] propose Bot-GAN, a framework which generates synthetic
network data in order to improve botnet detection methods. However, their
framework does not consider the generation of categorical attributes like IP ad-
dresses and ports which is one of the key contributions of our work. Hu and
Tan [43] present a GAN based approach named MalGAN in order to generate
synthetic malware examples which are able to bypass anomaly-based detec-
tion methods. Malware examples are represented as 160-dimensional binary
attributes. Similarly, Rigaki and Garcia [44] use a GAN based approach to
adapt malware communication to avoid detection. However, they consider only
three continuous attributes of the underlying network traffic in their approach.
The approach presented here does not simply replay existing network traffic
like category (I). In fact, traffic generators from the first two categories have a
different objective. Our approach belongs to category (IV) and generates new
synthetic network traffic and is not limited to generating only malicious network
traffic like category (III). While Siska et al. [31] and Iannucci et al. [32] use do-
main knowledge to generate flows by defining conditional dependencies between
30
flow attributes, we use GAN-based approaches which learn all dependencies
between the flow attributes inherently.
7. Summary
Labeled flow-based data sets are necessary for evaluating and comparing
anomaly-based intrusion detection methods. Evaluation data sets like DARPA 98
and KDD Cup 99 cover several attack scenarios as well as normal user behavior.
These data sets, however, were captured at some point in time such that con-
cept drift of network traffic causes static data sets to become obsolete sooner or
later.
In this paper, we proposed three synthetic flow-based network traffic gener-
ators which are based on Improved Wasserstein GANs (WGAN-GP) [12] using
the two time scale update rule from [13]. Our generators are initialized with real
network traffic and then generate new flow-based network traffic. In contrast
to previous high-level generators, our GAN-based approaches learn all internal
dependencies between attributes inherently and no additional knowledge has
to be modeled. Flow-based network traffic consists of heterogeneous data, but
GANs can only process continuous input data. To overcome this challenge, we
proposed three different methods to handle flow-based network data. In the
first approach N-WGAN-GP, we interpreted IP addresses and ports as contin-
uous input values and normalized numeric attributes like bytes and packets to
the interval [0, 1]. In the second approach B-WGAN-GP, we created binary at-
tributes from categorical and numerical attributes. For instance, we converted
ports to their 16-bit binary representation and extracted 16 binary attributes.
B-WGAN-GP is able to maintain more information (e.g. subnet information of
IP addresses) from the categorical input data. The third approach E-WGAN-
GP learns meaningful continuous representations of categorical attributes like
IP addresses using IP2Vec [11]. The preprocessing of E-WGAN-GP is inspired
from the text mining domain which also has to deal with non-continuous input
values. Then, we generated new flow-based network traffic based on the CIDDS-
31
001 data set [14] in an experimental evaluation. Our experiments indicate that
especially E-WGAN-GP is able to generate realistic data which achieves good
evaluation results. B-WGAN-GP achieves similarly good results and is able to
create new (unseen) values in contrast to E-WGAN-GP. The quality of net-
work data generated by N-WGAN-GP is less convincing, which indicates that
straight forward numeric transformation is not appropriate.
Our research indicates that GANs are well suited for generating flow-based
network traffic. We plan to extend our approach in order to generate sequences
of flows instead of single flows. In addition, we want to work on the development
of further evaluation methods.
Acknowledgments
References
References
32
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, Y. Bengio, Generative Adversarial Nets, in: Ad-
vances in Neural Information Processing Systems (NIPS), 2014, pp. 2672–
2680.
33
[12] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. C. Courville, Im-
proved Training of Wasserstein GAN, in: Advances in Neural Information
Processing Systems (NIPS), 2017, pp. 5769–5779.
[15] B. Claise, Cisco Systems NetFlow Services Export Version 9, RFC 3954
(2004).
[17] J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques, 3rd
Edition, Elsevier, 2011.
[20] C. Wagner, J. François, T. Engel, et al., Machine Learning Approach for IP-
Flow Record Anomaly Detection, in: International Conference on Research
in Networking, Springer, 2011, pp. 28–39.
34
[21] R. Salakhutdinov, H. Larochelle, Efficient learning of deep Boltzmann ma-
chines, in: International Conference on Artificial Intelligence and Statistics,
2010, pp. 693–700.
[26] A. Borji, Pros and Cons of GAN Evaluation Measures, arXiv preprint
arXiv:1802.03446.
35
[31] P. Siska, M. P. Stoecklin, A. Kind, T. Braun, A Flow Trace Generator us-
ing Graph-based Traffic Classification Techniques, in: International Wire-
less Communications and Mobile Computing Conference (IWCMC), ACM,
2010, pp. 457–462. doi:10.1145/1815396.1815503.
[37] W.-c. Feng, A. Goel, A. Bezzaz, W.-c. Feng, J. Walpole, TCPivo: A High-
Performance Packet Replay Engine, in: ACM Workshop on Models, Meth-
ods and Tools for Reproducible Network Research, ACM, 2003, pp. 57–64.
36
[39] D. Brauckhoff, A. Wagner, M. May, FLAME: A Flow-Level Anomaly Mod-
eling Engine, in: Workshop on Cyber Security Experimentation and Test
(CSET), USENIX Association, 2008, pp. 1:1–1:6.
[41] A. Sperotto, R. Sadre, P.-T. de Boer, A. Pras, Hidden Markov Model mod-
eling of SSH brute-force attacks, in: International Workshop on Distributed
Systems: Operations and Management, Springer, 2009, pp. 164–176.
37