Kim 2018
Kim 2018
Kim 2018
Kwangjo Kim
Muhamad Erza Aminanto
Harry Chandra Tanuwidjaja
Network
Intrusion
Detection using
Deep Learning
A Feature Learning
Approach
SpringerBriefs on Cyber Security Systems
and Networks
Editor-in-chief
Yang Xiang, Digital Research & Innovation Capability Platform, Swinburne
University of Technology, Hawthorn, Melbourne, VIC, Australia
Series editors
Liqun Chen , University of Surrey, Guildford, UK
Kim-Kwang Raymond Choo, Department of Information Systems and Cyber
Security, University of Texas at San Antonio, San Antonio, TX, USA
Sherman S. M. Chow, Department of Information Engineering, The Chinese
University of Hong Kong, Hong Kong
Robert H. Deng, School of Information Systems, Singapore Management
University, Singapore, Singapore
Dieter Gollmann, Hamburg University of Technology, Hamburg, Germany
Javier Lopez, University of Malaga, Malaga, Spain
Kui Ren, University at Buffalo, Buffalo, NY, USA
Jianying Zhou, Singapore University of Technology and Design, Singapore,
Singapore
The series aims to develop and disseminate an understanding of innovations,
paradigms, techniques, and technologies in the contexts of cyber security systems
and networks related research and studies. It publishes thorough and cohesive
overviews of state-of-the-art topics in cyber security, as well as sophisticated
techniques, original research presentations and in-depth case studies in cyber
systems and networks. The series also provides a single point of coverage of
advanced and timely emerging topics as well as a forum for core concepts that may
not have reached a level of maturity to warrant a comprehensive textbook. It
addresses security, privacy, availability, and dependability issues for cyber systems
and networks, and welcomes emerging technologies, such as artificial intelligence,
cloud computing, cyber physical systems, and big data analytics related to cyber
security research. The mainly focuses on the following research topics:
Fundamentals and Theories
• Cryptography for cyber security
• Theories of cyber security
• Provable security
Cyber Systems and Networks
• Cyber systems Security
• Network security
• Security services
• Social networks security and privacy
• Cyber attacks and defense
• Data-driven cyber security
• Trusted computing and systems
Applications and Others
• Hardware and device security
• Cyber application security
• Human and social aspects of cyber security
123
Kwangjo Kim Muhamad Erza Aminanto
School of Computing (SoC) School of Computing (SoC)
Korea Advanced Institute of Korea Advanced Institute of
Science and Technology Science and Technology
Daejeon, Korea (Republic of) Daejeon, Korea (Republic of)
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2018
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
To our families for their lovely support.
Preface
This monograph presents recent advances in Intrusion Detection System (IDS) using
deep learning models, which have achieved great success recently, particularly in
the field of computer vision, natural language processing, and image processing.
The monograph provides a systematic and methodical overview of the latest
developments in deep learning and makes a comparison among deep learning-based
IDSs. A comprehensive overview of deep learning applications to IDS followed
by deep feature learning methods containing a novel deep feature extraction and
selection and deep learning for clustering is provided in this monograph. Further
challenges and research directions are delivered in the monograph.
The monograph offers a rich overview of deep learning-based IDS, which is
suitable for students, researchers, and practitioners interested in deep learning and
intrusion detection and as a reference book. The comprehensive comparison of
various deep-learning applications helps readers with a basic understanding of
machine learning and inspires applications in IDS and other cybersecurity areas.
The outline of this monograph is as follows:
Chapter 1 describes the importance of IDS in computer networks these days by
providing a survey of a security breach in computer networks. It is highlighted that
deep learning models can improve IDS performance. It also explains the motivation
of surveying deep learning-based IDSs.
Chapter 2 provides all the relevant definition of IDS. It then explains different
types of the current IDS, based on where we put the detection module and based on
the used approach. Common performance metrics and publicly available benchmark
dataset are also provided in this chapter.
Chapter 3 provides a brief preliminary study regarding classical machine learning
which consists of supervised, unsupervised, semi-supervised, weakly supervised,
reinforcement, and adversarial machine learning. It briefly surveys 22 papers which
are using machine learning techniques for their IDSs.
Chapter 4 discusses several deep learning models which contain generative,
discriminative, and hybrid approaches.
vii
viii Preface
Chapter 5 surveys various IDSs that leverage deep learning models which are
divided into four classes: generative, discriminative, hybrid, and deep reinforcement
learning.
Chapter 6 discusses the importance of deep learning models as a feature learning
(FL) approach in IDS researches. We explain further two models which are deep
feature extraction and selection and deep learning for clustering.
Chapter 7 concludes this monograph by providing an overview of challenges and
future research directions in deep learning applications for IDS.
Appendix discusses several papers of malware detection over a network using
deep learning models. Malware detection is also an important issue due to the
increasing number of malware and similar approach as IDS.
This monograph was partially supported by the Institute for Information & Com-
munications Technology Promotion (IITP) grant funded by the Korea Govern-
ment (MSIT) (2013-0-00396, Research on Communication Technology using
Bio-Inspired Algorithm, and 2017-0-00555, Towards Provable-secure Multi-party
Authenticated Key Exchange Protocol based on Lattices in a Quantum World) and
by the National Research Foundation of Korea (NRF) grant funded by the Korea
Government (MSIT) (No. NRF-2015R1A-2A2A01006812).
We are very grateful for Prof. Hong-Shik Park, School of Electrical Engineering,
KAIST, who gave us the excellent opportunity to execute this research initiative
by combining deep learning into intrusion detection for secure wireless network.
We also thank Prof. Paul D. Yoo and Prof. Taufiq Asyhari, Cranfield Defence and
Security, UK, who gave us inspiring discussion during our research working.
The authors sincerely appreciate the contribution of the alumni and current
member of the Cryptology and Information Security Lab. (CAISLAB), Graduate
School of Information Security, School of Computing, KAIST, Khalid Huseynov,
Dongsoo Lee, Kyungmin Kim, Hakju Kim, Rakyong Choi, Jeeun Lee, Soohyun
Ahn, Joonjeong Park, Jeseoung Jung, Jina Hong, Sungsook Kim, Edwin Ayisi
Opare, Hyeongcheol An, Seongho Han, Nakjun Choi, Nabi Lee, and Dongyeon
Hong.
We gratefully acknowledge the editors of this monograph series on security for
their valuable comments and the Springer to give us to write this monograph.
Finally we are also very grateful for our families for their strong support and
endless love.
ix
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Intrusion Detection Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Performance Metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Public Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Classical Machine Learning and Its Applications to IDS. . . . . . . . . . . . . . . . . 13
3.1 Classification of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.5 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.6 Adversarial Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Machine-Learning-Based Intrusion Detection Systems . . . . . . . . . . . . . . 21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Generative (Unsupervised Learning) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Stacked (Sparse) Auto-Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.3 Sum-Product Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Discriminative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.1 Generative Adversarial Networks (GAN) . . . . . . . . . . . . . . . . . . . . 32
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
xi
xii Contents
xv
xvi Acronyms
FP False Positive
FPR False Positive Rate
FW Firewall
GAN Generative Adversarial Networks
GPU Graphics Processing Unit
GRU Gated Recurrent Unit
HJI Hamiltonian-Jacobi-Isaac
HIS Human Inference System
ICV Integrity Check Value
IDS Intrusion Detection System
IG Information Gain
IoT Internet of Things
IPS Intrusion Prevention System
IV Initialization Vector
JSON Java Script Object Notation
KL Kullback-Leibler
kNN K-Nearest Neighbors
LoM Largest of Max
LSTM Long Short-Term Memory
MDP Markov Decision Processes
METIS Mobile and Wireless Communications Enablers for the Twenty-
Twenty Information Society
MF Membership Functions
MLP Multi-Layer Perceptron
MoM Mean of Max
MSE Mean Square Error
NN Neural Network
PSO Particle Swarm Optimization
R2L Remote to Local
RBM Restricted Boltzmann Machine
RL Reinforcement Learning
RNN Recurrent Neural Networks
SAE Stacked Auto-Encoder
SDAE Stacked Denoising Auto-Encoder
SDN Software-Defined Networking
SFL Supervised Feature Learning
SGD Stochastic Gradient Descent
SNN Shared Nearest Neighbor
SOM Self-Organizing Map
SoM Smallest of Max
SPN Sum-Product Networks
STL Self-Taught Learning
SVM Support Vector Machine
SVM-RFE SVM-Recursive Feature Elimination
TBM Time to Build Model
Acronyms xvii
Abstract This chapter discusses the importance of IDS in computer networks while
wireless networks grow rapidly these days by providing a survey of a security breach
in wireless networks. Many methods have been used to improve IDS performance,
the most promising one is to deploy machine learning. Then, the usefulness of recent
models of machine learning, called a deep learning, is highlighted to improve IDS
performance, particularly as a Feature Learning (FL) approach. We also explain the
motivation of surveying deep learning-based IDSs.
Computer networks and Internet are inseparable from human life today. Abundant
applications rely on Internet, including life-critical applications in healthcare and
military. Moreover, extravagant financial transactions exist over the Internet every
day. This rapid growth of the Internet has led to a significant increase in wireless
network traffic in recent years. According to a worldwide telecommunication
consortium, Mobile and Wireless Communications Enablers for the Twenty-Twenty
Information Society (METIS) [1], a proliferation of 5G and Wi-Fi networks is
expected to occur in the next decades. They believe that avalanche of mobile and
wireless traffic volume will occur due to the development of society needs to be
fulfilled. Applications such as e-learning, e-banking, and e-health would spread and
become more mobile. By 20201 wireless network traffic is anticipated to account for
two-thirds of total Internet traffic—with 66% of IP traffic expected to be generated
by Wi-Fi and cellular devices only.
Cyber-attacks have become an immense growing rate as Internet of Things (IoT)
are widely used these days [2].
IBM [3] reported an enormous account hijacked during 2016, and spam emails
are four times higher than the previous year. Common attacks noticed in the same
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2018 1
K. Kim et al., Network Intrusion Detection using Deep Learning,
SpringerBriefs on Cyber Security Systems and Networks,
https://doi.org/10.1007/978-981-13-1444-5_1
2 1 Introduction
References
Abstract This chapter briefly introduces all the relevant definitions on Intrusion
Detection System (IDS), followed by a classification of current IDSs, based on the
detection module located and the approach adopted. We also explain and provide
examples of one common IDS in research fields, which is machine-learning-based
IDS. Then, we discuss an example of IDS using bio-inspired clustering method.
2.1 Definition
2.2 Classification
We can divide IDSs depending on the placement and the methodology deployed in a
network. By the positioning of the IDS module in the network, we might distinguish
IDSs into three classes: network-based, host-based, and hybrid IDSs. The first
IDS, network-based IDS as shown in Fig. 2.2, puts the IDS module inside the
network where whole can be monitored. This IDS checks for malicious activities by
inspecting all packets moving across the network. On the other hand, Fig. 2.3 shows
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2018 5
K. Kim et al., Network Intrusion Detection using Deep Learning,
SpringerBriefs on Cyber Security Systems and Networks,
https://doi.org/10.1007/978-981-13-1444-5_2
6 2 Intrusion Detection Systems
Fig. 2.1 Typical network using (a) Firewall and (b) IDS
the host-based IDS which places the IDS module on each client of the network. The
module examining all inbound and outbound traffics of the corresponding client
leads to detailed monitoring of the particular client. Two types of IDSs have specific
drawbacks—the network-based IDS might burden the workload and then miss some
malicious activities, while the host-based IDS does not monitor all the network
traffics, having less workload than the network-based IDS. Therefore, the hybrid
IDS as shown in Fig. 2.4 places IDS modules in the network as well as clients to
monitor both specific clients and network activities at the same time.
Based on the detection method, IDSs can be divided into three different types:
misuse-, anomaly-, and specification-based IDSs. A misuse-based IDS, known as a
signature-based IDS [2], looks for any malicious activities by matching the known
signatures or patterns of attacks with the monitored traffics. This IDS suits for
known attack detection; however, new or unknown attacks (also called as a zero-
day exploit) are difficult to be detected. An anomaly-based IDS detects an attack by
profiling normal behavior and then triggers an alarm if there is any deviation from
it. The strength of this IDS is its capability for unknown attack detection. Misuse-
based IDS usually achieves higher detection performance for known attacks than
anomaly-based IDS. A specification-based IDS manually defines a set of rules and
constraints to express the normal operations. Any deviation from the rules and the
2.2 Classification 7
constraints during the execution is flagged as malicious [3]. Table 2.1 summarizes
the comparison of IDS types based on the methodology.
We discuss further machine-learning-based IDS which belongs to the anomaly-
based IDS [4]. There are two types of learning, namely, supervised and unsupervised
learning. The unsupervised learning does not require a labeled dataset for training
which is crucial for colossal network traffics recently, while the supervised learning
requires a labeled dataset. Unsupervised learning capability is of critical significance
as it allows a model to detect new attacks without creating costly labels or dependent
8 2 Intrusion Detection Systems
variables. Table 2.2 outlines the comparison between supervised and unsupervised
learning.
2.3 Benchmark
The evaluation of any IDS performance can be done by adopting the common model
performance measures [5]: accuracy (Acc), DR, false alarm rate (F AR), Mcc,
P recision, F1 score, CPU Time to Build Model (TBM), and CPU Time to Test
(TT). Acc shows the overall effectiveness of an algorithm [6]. DR, also known
as Recall, refers to the number of impersonation attacks detected divided by the
total number of impersonation attack instances in the test dataset. Unlike Recall,
P recision counts the number of impersonation attacks detected among the total
number of instances classified as an attack. The F1 score measures the harmonic
mean of P recision and Recall. F AR is the number of normal instances classified
as an attack divided by the total number of normal instances in the test dataset,
while F NR shows the number of attack instances that are unable to be detected.
Mcc represents the correlation coefficient between the detected and observed data
[7]. Intuitively, the goal is to achieve a high Acc, DR, P recision, Mcc, and F1
score and, at the same time, maintain low F AR, TBM, and TT. The above measures
can be defined by Eqs. (2.1), (2.2), (2.3), (2.4), (2.5), (2.6), and (2.7):
TP +TN
Acc = , (2.1)
T P + T N + FP + FN
2.3 Benchmark 9
TP
DR(Recall) = , (2.2)
T P + FN
TP
P recision = , (2.3)
T P + FP
FP
F AR = , (2.4)
T N + FP
FN
F NR = , (2.5)
FN + T P
2T P
F1 = , (2.6)
2T P + F P + F N
(T P × T N) − (F P × F N)
Mcc = √ , (2.7)
(T P + F P )(T P + F N )(T N + F P )(T N + F N)
References
1. K. Scarfone and P. Mell, “Guide to intrusion detection and prevention systems (idps),” NIST
special publication, vol. 800, no. 2007, 2007.
2. A. H. Farooqi and F. A. Khan, “Intrusion detection systems for wireless sensor networks: A
survey,” in Proc. Future Generation Information Technology Conference, Jeju Island, Korea.
Springer, 2009, pp. 234–241.
3. R. Mitchell and I. R. Chen, “Behavior rule specification-based intrusion detection for safety
critical medical cyber physical systems,” IEEE Trans. Dependable Secure Comput., vol. 12,
no. 1, pp. 16–30, Jan 2015.
4. I. Butun, S. D. Morgera, and R. Sankar, “A survey of intrusion detection systems in wireless
sensor networks,” IEEE Commun. Surveys Tuts., vol. 16, no. 1, pp. 266–282, 2014.
References 11
In supervised learning, target class data are necessary during training. The network
would build a model based on the correctness of matching to the label data. There
are many machine-learning models in supervised learning, some of them are as
follows.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2018 13
K. Kim et al., Network Intrusion Detection using Deep Learning,
SpringerBriefs on Cyber Security Systems and Networks,
https://doi.org/10.1007/978-981-13-1444-5_3
14 3 Classical Machine Learning and Its Applications to IDS
D(x) = wx + b (3.1)
w= αk yk xk (3.2)
k
where w, y, α, and b denote the weight vector, the class label, the marginal
support vector, and the bias value, respectively. k denotes the number of samples.
Equations (3.2) and (3.3) show how to compute the value of w and b, respectively.
SVM-Recursive Feature Elimination (RFE) is an application of RFE using the
magnitude of the weight to perform rank clustering [1]. The RFE ranks the feature
set and eliminates the low-ranked features which contribute less than the other
features for classification task [2].
C4.5 is robust to noise data and able to learn disjunctive expressions [3]. It has a k-
ary tree structure, which can represent a test of attributes from the input data by each
node. Every branch of the tree shows potentially selected important features as the
values of nodes and different test results. C4.5 uses a greedy algorithm to construct a
tree in a top-down recursive divide-and-conquer approach [3]. The algorithm begins
by selecting the attributes that yield the best classification result. This is followed
by generating a test node for the corresponding attributes. The data are then divided
based on the Information Gain (IG) value of the nodes according to the test attributes
that reside in the parent node. The algorithm terminates when all data are grouped
in the same class, or the process of adding additional separations produces a similar
classification result, based on its predefined threshold.
3.1 Classification of Machine Learning 15
Unlike supervised learning, unsupervised learning does not require any label data
during training. This is one advantage of using unsupervised learning since building
a comprehensive labeled dataset is usually tricky. Because of no target classes
provided, the network looks for similar properties of each training instances and
creates a group of those similar instances. For anomaly detection, one can consider
the outlier instances as an anomaly. Some examples of unsupervised learning are as
follows:
K-means clustering algorithm groups all observations data into k clusters iteratively
until the convergence will be reached. In the end, one cluster contains similar data
since each data enters to the nearest cluster. K-means algorithm assigns a mean
value of the cluster members as a cluster centroid. In every iteration, it calculates
the shortest Euclidean distance from an observation data into any cluster centroid.
Besides that, the intra-variances inside the cluster are also minimized by updating
the cluster centroid iteratively. The algorithm would terminate when convergence is
achieved, which the new clusters are the same as the previous iteration clusters [4].
j (1 − d(i, j )/α)
1
f (i) > 0
f (i) = s2 (3.4)
0 Otherwise,
where s 2 is the number of cells in the surrounding area of i and α is a constant that
depicts the disparity among objects. The f(i) might reach maximum value when all
the sites in the surrounding area are occupied by similar or even equal objects. The
probability of picking up and dropping an object i is given by Eqs. (3.5) and (3.6),
respectively:
kp
Ppick (i) = ( )2 , (3.5)
kp + f (i)
2f (i) f (i) < kd
Pdrop (i) = (3.6)
1 Otherwise,
where the parameters kp and kd are threshold constants of the probability of picking
up and dropping an object, respectively. A loaded ant considers the first empty cell
in its local area to drop the object, since the current position of the ant may be
preoccupied by another object [5].
Tsang et al. [6] define two variables: intra-cluster and inter-cluster distance
in order to measure ACA performance. High intra-cluster distance means better
compactness. Meanwhile, large inter-cluster distance means better separateness. A
good ACA should provide minimum intra-cluster distance and maximum inter-
cluster distance to presents the inherent structures and knowledge from data
patterns.
Encoder Decoder
X1 bf bg Z1
W11
V11
W13 W12
y1 V12
X2 Z2
V13
V14
y2
X3 Z3
• •
• y3
•
• •
Hidden Representation
Xn Zn
Fig. 3.1 AE network with symmetric input-output layers and three neurons in one hidden layer
tiability properties [7]. The decoder function expressed in Eq. (3.8) maps hidden
representation y back to a reconstruction.
z = sg V · y + bg , (3.8)
where sg is the activation function of the decoder which commonly uses either the
identity function, sg (t) = t, or a sigmoid function such as an encoder. We use W
and V acts as a weight matrix for the features. bf and bg acts as a bias vector for
encoding and decoding, respectively. Its training phase finds optimal parameters
θ = {W, V , bf , bg } which minimize the reconstruction error between the input data
and its reconstruction output on a training set.
We further explain a modified form of AE, which is a sparse AE [8]. This is
based on the experiments of Eskin et al. [9], in which anomalies usually form
small clusters in scattered areas of feature space. Moreover, the dense and large
clusters typically contain benign data [10]. For the sparsity of AE, we first observe
the average output activation value of a neuron i, as expressed by Eq. (3.9).
1 T
N
ρˆi = sf wi xj + bf,i , (3.9)
N
j =1
18 3 Classical Machine Learning and Its Applications to IDS
where N is the total number of training data, xj is the j -th training data, wiT is the
i-th row of a weight matrix W , and bf,i is the i-th row of a bias vector for encoding
bf . By lowering the value of ρˆi , a neuron i in the hidden layer shows the specific
feature presented in a smaller number of training data.
The task of machine learning is to fit a model to the given training data. However,
the model often fits the particular training data but is incapable of classifying
other data, and this is known as the overfitting problem. In this case, we can
use a regularization technique to reduce the overfitting problem. The sparsity
regularization Ωsparsity evaluates how close the average output activation value ρˆi
and the desired value ρ are, typically with Kullback-Leibler (KL) divergence to
determine the difference between the two distributions, as expressed in Eq. (3.10).
h
Ωsparsity = KL(ρρˆi )
i=1
(3.10)
h
ρ 1−ρ
= ρ log + (1 − ρ) log ,
ρˆi 1 − ρˆi
i=1
1 2
h N K
Ωweights = wj i , (3.11)
2
i=1 j =1 k=1
where N and K are the number of training data and the number of variables for each
data, respectively.
The goal of training sparse AE is to find the optimal parameters, θ =
{W, V , bf , bg }, to minimize the cost function shown in Eq. (3.12).
1
N K
E= (zkn − xkn )2 + λ · Ωweights + β · Ωsparsity , (3.12)
N
n=1 k=1
which is a regulated Mean Square Error (MSE) with L2 regularization and sparsity
regularization. The coefficient of the L2 regularization term λ and the coefficient of
sparsity regularization term β are specified while training the AE.
3.1 Classification of Machine Learning 19
while the latter learns using policy function which is a mapping between the best
action and the corresponding state. More detail explanations are well delivered
in [13].
This type of machine learning is different with all previous sorts, where the
adversarial machine learning exploits and attacks the vulnerability of existing
machine-learning models. One example was mentioned by Laskov and Lippman
[14] which the capability foundations of machine learning are based on the
assumption of the expressiveness of training data addressed by learning. However,
this assumption might be violated if either the training or the test data distribution is
modified intentionally to confuse the learning. Other examples of attacks occurred
in an adversarial framework are also provided in [14].
In general, two threat models become the machine-learning risks [15]. The first
threat is evasion attacks which try to bypass the learning result. The adversary
attempts to evade from pattern matching done by machine-learning models. Spam
filtering and intrusion detection are examples when the adversary wants to evade.
Their basic idea to accomplish evasion attack is a trial-and-error approach until
a given instance successfully evades the pattern matching. The second threat is
poisoning attacks which try to influence original training data to get expected
result. Unlike the first threat model, this model injects malicious data into original
training data to receive the adversary’s desired result. As an example, the adversary
sends malicious packets covertly during network traffic collection such that the
learning misclassified that packet as a benign instance. Further detail explanations
and examples can be found in [16] and [17].
method. Zhu et al. [28] also proposed a feature selection method using a multi-
objective approach.
On the other hand, Manekar and Waghmare [29] leveraged Particle Swarm
Optimization (PSO) and SVM. PSO performs feature optimization to obtain an
optimized feature, after which SVM conducts the classification task. A similar
approach was introduced by Saxena and Richariya [30], although the concept
of weighted feature selection was introduced by Schaffernicht and Gross [31].
Exploiting SVM-based algorithms as a feature selection method was proposed by
Guyon et al. [1]. This method leveraged the weights adjusted during support vector
learning and resulted in ranking the importance of input features. Another related
approach was proposed by Wang [32] who ranked input features based on weights
learned by an ANN. This method showed the ability of DNN to find useful features
among the raw data. Aljawarneh et al. [33] proposed a hybrid model of feature
selection and an ensemble of classifiers which requires heavy computation.
Venkatesan et al. [34] highlighted new uses of botnets which is data theft. Botnets
need two properties, stealth and resilience to steal data. This type of botnets becomes
a tool for Advanced Persistent Threat (APT) agent. The authors [34] proposed a
combination approach of a honeypot and a botnet intrusion detection based on RL
model in a resource-constrained environment. The honeypot aims to detect intrusion
occurred, while the botnet detection analyzes the behavior of bots. The RL model
was developed to reduce the lifetime of stealthy botnets in a resource-constrained
environment. The RL agent did not take a mission-centric approach into considera-
tion, but the agent learned a policy to maximize the number of bots to be detected.
Their RL model can be spanned during two consecutive decisions of the agent.
An enterprise network containing 106 machines with 98 clients and 8 servers in 4
subnets was simulated in PeerSim software for experimental purposes. Based on the
experimental results, the RL model successfully controlled the evolution of a botnet.
We have examined several feature selection methods for IDS. Huseynov et al.
[35] inspected Ant Colony Clustering (ACC) method to find feature clusters of
botnet traffic. The selected features in [35] are independent of traffic payload
and represent the communication patterns of botnet traffic. However, this botnet
detection does not scale for huge and noisy dataset due to the absence of control
mechanism for clustering threshold. Kim et al. [36] tested Artificial Immune
System (AIS) and swarm intelligence-based clustering to detect unknown attacks.
Furthermore, Aminanto et al. [37] discussed the utility of ACA and Fuzzy Inference
System (FIS) for IDS. They explored several common IDSs with a combination of
learning and classification as shown in Table 3.1.
ACA is one of the most popular clustering approaches which is originated from
swarm intelligence. ACA is an unsupervised learning algorithm that can find near-
optimal clustering solution without a predefined number of clusters needed [6].
However, ACA is rarely used in intrusion detection as the exclusive method for
classification. Instead, ACA is combined with other supervised algorithms such as
Self-Organizing Map (SOM) and support vector machine (SVM) to provide better
classification result [39]. In AKKK17 [37], a novel hybrid IDS scheme based on
ACA and FIS was proposed. The authors [37] applied ACA for training phase and
FIS for classification phase. Then, FIS was chosen as a classification phase, because
fuzzy approach can reduce the false alarm with higher reliability in determining
intrusion activities [40]. Meanwhile, the same ACA with different classifiers was
also examined in KKK15 [36] and KHKY16 [38] by using artificial AIS and DT
as well as Artificial Neural Network (ANN), respectively. AIS is designed for the
computational system and inspired by Human Inference System (HIS). AIS can
differentiate between the “self” (cells that are owned by the system) and “nonself”
(foreign entities to the system). ANN can learn more complex structure of certain
unknown attacks due to a characteristic of ANN. Also, an improved ACA which is
Adaptive Time Dependent Transporter Ants Clustering (ATTA-C) also investigated
in HKY14 [35], which is one of the few algorithms that have been benchmarked on
various datasets and is now publicly available under GNU agreement [35].
In addition to the abovementioned common IDSs, other IDS models were further
examined by taking benefits of Hadoop framework [41] and Software-Defined
Networking (SDN) environment [42]. Khalid et al. [41] proposed a method to utilize
the advantages of Hadoop as well as behavioral flow analysis. This framework
is particularly useful in the case of P2P traffic analysis due to inherent flow
characteristics of this type of applications. Meanwhile, Lee et al. [42] proposed a
novel IDS scheme that operates lightweight intrusion detection to keep a detailed
analysis of attacks. In this scheme, a flow-based IDS detects intrusions but with
low operating cost. When an attack is detected, the IDS requests the forwarding of
attack traffic to packet-based detection so that the detailed results obtained by the
packet-based detection can be analyzed later by the security experts.
References
1. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using
support vector machines,” Machine Learning, vol. 46, no. 1–3, pp. 389–422, 2002.
2. X. Zeng, Y.-W. Chen, C. Tao, and D. van Alphen, “Feature selection using recursive feature
elimination for handwritten digit recognition,” in Proc. Intelligent Information Hiding and
Multimedia Signal Processing (IIH-MSP), Kyoto, Japan. IEEE, 2009, pp. 1205–1208.
3. C. A. Ratanamahatana and D. Gunopulos, “Scaling up the naive Bayesian classifier: Using
decision trees for feature selection,” in Workshop on Data Cleaning and Preprocessing (DCAP)
at IEEE Int. Conf. Data Mining (ICDM), Maebashi, Japan. IEEE, Dec 2002.
4. C. Jiang, H. Zhang, Y. Ren, Z. Han, K.-C. Chen, and L. Hanzo, “Machine learning paradigms
for next-generation wireless networks,” IEEE Wireless Communications, vol. 24, no. 2, pp.
98–105, 2017.
References 25
26. S. Zaman and F. Karray, “Lightweight IDS based on features selection and IDS classification
scheme,” in Proc. Computational Science and Engineering (CSE). IEEE, 2009, pp. 365–370.
27. P. Louvieris, N. Clewley, and X. Liu, “Effects-based feature identification for network intrusion
detection,” Neurocomputing, vol. 121, pp. 265–273, 2013.
28. Y. Zhu, J. Liang, J. Chen, and Z. Ming, “An improved NSGA-iii algorithm for feature selection
used in intrusion detection,” Knowledge-Based Systems, vol. 116, pp. 74–85, 2017.
29. V. Manekar and K. Waghmare, “Intrusion detection system using support vector machine
(SVM) and particle swarm optimization (PSO),” Int. Journal of Advanced Computer Research,
vol. 4, no. 3, pp. 808–812, 2014.
30. H. Saxena and V. Richariya, “Intrusion detection in KDD99 dataset using SVM-PSO and
feature reduction with information gain,” Int. Journal of Computer Applications, vol. 98, no. 6,
2014.
31. E. Schaffernicht and H.-M. Gross, “Weighted mutual information for feature selection,” in
Proc. Artificial Neural Networks, Espoo, Finland. Springer, 2011, pp. 181–188.
32. Z. Wang, “The applications of deep learning on traffic identification,” in Conf. BlackHat, Las
Vegas, USA. UBM, 2015.
33. S. Aljawarneh, M. Aldwairi, and M. B. Yassein, “Anomaly-based intrusion detection
system through feature selection analysis and building hybrid efficient model,” Journal of
Computational Science, Mar 2017. [Online]. Available: http://dx.doi.org/10.1016/j.jocs.2017.
03.006
34. S. Venkatesan, M. Albanese, A. Shah, R. Ganesan, and S. Jajodia, “Detecting stealthy botnets
in a resource-constrained environment using reinforcement learning,” in Proceedings of the
2017 Workshop on Moving Target Defense. ACM, 2017, pp. 75–85.
35. K. Huseynov, K. Kim, and P. Yoo, “Semi-supervised botnet detection using ant colony
clustering,” in Symp. Cryptography and Information Security (SCIS), Kagoshima, Japan, 2014.
36. K. M. Kim, H. Kim, and K. Kim, “Design of an intrusion detection system for unknown-attacks
based on bio-inspired algorithms,” in Computer Security Symposium (CSS), Nagasaki, Japan,
2015.
37. M. E. Aminanto, H. Kim, K. M. Kim, and K. Kim, “Another fuzzy anomaly detection system
based on ant clustering algorithm,” IEICE Transactions on Fundamentals of Electronics,
Communications and Computer Sciences, vol. 100, no. 1, pp. 176–183, 2017.
38. K. M. Kim, J. Hong, K. Kim, and P. Yoo, “Evaluation of ACA-based intrusion detection
systems for unknown-attacks,” in Symp. on Cryptography and Information Security (SCIS),
Kumamoto, Japan, 2016.
39. C. Kolias, G. Kambourakis, and M. Maragoudakis, “Swarm intelligence in intrusion detection:
A survey,” Computers & Security, vol. 30, no. 8, pp. 625–642, 2011.
40. A. Karami and M. Guerrero-Zapata, “A fuzzy anomaly detection system based on hybrid PSO-
Kmeans algorithm in content-centric networks,” Neurocomputing, vol. 149, pp. 1253–1269,
2015.
41. K. Huseynov, P. D. Yoo, and K. Kim, “Scalable P2P botnet detection with threshold setting in
Hadoop framework,” Journal of the Korea Institute of Information Security and Cryptology,
vol. 25, no. 4, pp. 807–816, 2015.
42. D. S. Lee, “Improving detection capability of flow-based IDS in SDN,” KAIST, MS. Thesis,
2015.
Chapter 4
Deep Learning
Abstract This chapter defines a brief history and definition of deep learning. Due
to a variety of models belonging to deep learning, we classify deep learning models
into a tree which has three branches: generative, discriminative, and hybrid. In each
model, we show some learning model examples in order to see the difference among
three models.
4.1 Classification
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2018 27
K. Kim et al., Network Intrusion Detection using Deep Learning,
SpringerBriefs on Cyber Security Systems and Networks,
https://doi.org/10.1007/978-981-13-1444-5_4
28 4 Deep Learning
the joint probability given the input and choose the class label with the highest
probability [4]. There are a number of methods that are classified as unsupervised
learning.
X1 b
b
b
y
X2
y′
O1
y
•
X3 • •
• O2
• y′
• y
Output Layer
• Encoder 2
Encoder 1
Xn
Input Layer
Fig. 4.2 SAE network with two hidden layers and two target classes
a precise correction input from corrupted by noise input [6]. The DAE may also be
stacked to build deep networks as well.
The performance of the unsupervised greedy layer-wise pre-training algorithm
can be significantly more accurate than the supervised one. This is because the
greedy supervised procedure may behave too greedy as it extracts less information
and considers one layer only [7, 8]. A NN containing only one hidden layer may
discard some of the information about the input data since more information could
be exploited by composing additional hidden layers. The features from the pre-
training phase, which is greedy layer-wise, can be used either as an input to
a standard supervised machine-learning algorithm or as initialization for a deep
supervised NN.
Figure 4.2 shows the SAE network with two hidden layers and two target classes.
The final layer implements the softmax function for the classification in the DNN.
This softmax function makes SAE both unsupervised and supervised learning.
Softmax function is a generalized term of the logistic function that suppresses the
K-dimensional vector v ∈ RK into K-dimensional vector v∗ ∈ (0, 1)K , which adds
up to 1. In this function, T and C are defined as the number of training instances and
the number of classes, respectively. The softmax layer minimizes the loss function,
which is either the cross-entropy function like Eq. (4.1) or the mean-squared error.
1
T C
E= zij log yij + (1 − zij ) log 1 − yij , (4.1)
T
j =1 i=1
30 4 Deep Learning
BM is a network of binary units that symmetrically paired [9], which means all
input nodes are linked to all hidden nodes. BM is a shallow model with one hidden
layer only. BM has a structure of neuron units that make stochastic decisions about
whether active or not [3]. If one BM output is cascaded into multiple BMs, it is
called deep BM (DBM). Meanwhile, RBM is a customized BM without connections
between the hidden nodes and input nodes too [9]. RBM consists of visible and
hidden variables such that their relations can be figured out. Visible here means
neurons in input which is training data. If multiple layers of RBM are stacked, a
layer-by-layer scheme is called deep belief network (DBN). DBN could be used as
a feature extraction method for dimensionality reduction when unlabeled dataset and
back-propagation are used (which means unsupervised training). In contrast, DBN
is used for classification when appropriately labeled dataset with feature vectors are
used (which means supervised training) [10].
hi h0 h1 hi
xi x0 x1 xi
manage which information will be passed or dropped. There are three gates to
control the information flow, namely, input, forget, and output gates [15]. These
gates are composed of a sigmoid NN and an operator as shown in Fig. 4.4.
32 4 Deep Learning
4.3 Discriminative
4.4 Hybrid
The hybrid deep architecture combines both generative and discriminative archi-
tectures. The hybrid structure aims to distinguish data as well as discriminative
approach. However, in the early step, it has assisted in a significant way with the
generative architectures results. An example of hybrid architecture is Deep Neural
Network (DNN). However, some confusion terms between DNN and DBN happens.
In the open literature, DBN also uses backpropagation discriminative training as a
“fine-tuning.” This concept of DBN is similar DNN [2]. According to Deng [3],
DNN which is defined as a multilayer network with cascaded fully connected hidden
layers, uses stacked RBM as a pre-training phase. Many other generative models
can be considered as discriminative or hybrid models when the classification task is
added with the class labels.
Goodfellow [20] introduced a novel framework which trains both generative and
discriminative models at the same time, which the generative model G captures the
data distribution and the discriminative model D distinguishes the original input data
and the data coming from the model G. It is a zero-sum game of G and D models [4]
where model G aims to counterfeit the original input data, while model D aims to
discriminate the original input and output of model G. According to Dimokranitou
References 33
[4], the advantage of GAN are to keep consistency after the equilibrium was
achieved, no approximate inference or Markov chains are needed, and can be trained
with missing or limited data. On the other hand, the disadvantage of applying GAN
is to find the equilibrium between G and D models. A typical architecture of GAN
is shown in Fig. 4.5.
References
12. R. C. Staudemeyer, “Applying long short-term memory recurrent neural networks to intrusion
detection,” South African Computer Journal, vol. 56, no. 1, pp. 136–154, 2015.
13. C. Olah, “Understanding LSTM networks,” http://colah.github.io/posts/2015-08-
Understanding-LSTMs/, 2015, [Online; accessed 20-February-2018].
14. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9,
no. 8, pp. 1735–1780, 1997.
15. J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long short term memory recurrent neural
network classifier for intrusion detection,” in Platform Technology and Service (PlatCon), 2016
International Conference on. IEEE, 2016, pp. 1–5.
16. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document
recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
17. M. A. Nielsen, “Neural networks and deep learning,” 2015.
18. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of Go with deep
neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
19. A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint
arXiv:1211.3711, 2012.
20. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing
systems, 2014, pp. 2672–2680.
Chapter 5
Deep Learning-Based IDSs
Abstract This chapter reviews recent IDSs leveraging deep learning models as
their methodology which were published during 2016 and 2017. The critical
issues like problem domain, methodology, dataset, and experimental result of
each publication will be discussed. These publications can be classified into three
different categories according to deep learning classification in Chap. 4, namely,
generative, discriminative, and hybrid. The generative model group consists of IDSs
that use deep learning models for feature extraction only and use shallow methods
for the classification task. The discriminative model group contains IDSs that use a
single deep learning method for both feature extraction and classification task. The
hybrid model group includes IDSs that use more than one deep learning method
for generative and discriminative purposes. All IDSs are compared to overview the
advancement of deep learning in IDS researches.
5.1 Generative
This sub-chapter groups IDSs that use deep learning for feature extraction only and
use shallow methods for the classification task.
Roy et al. [1] proposed an IDS by leveraging deep learning models and validated
that a deep learning approach can improve IDS performance. DNN is selected
comprising of multilayer feedforward NN with 400 hidden layers. Shallow models,
rectifier and softmax activation functions, are used in the output layer. Two
advantages of feedforward neural network are to provide a precise approximation
for complex multivariate nonlinear function directly from input values and to give
the robust modeling for large classes. Besides that, the authors claimed that DNN
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2018 35
K. Kim et al., Network Intrusion Detection using Deep Learning,
SpringerBriefs on Cyber Security Systems and Networks,
https://doi.org/10.1007/978-981-13-1444-5_5
36 5 Deep Learning-Based IDSs
is better than DBN since the discriminating power by characterizing the posterior
distributions of classes is suitable for the pattern classification [1].
For validation, KDD Cup’99 dataset was used. This dataset has 41 features that
was given as the input to the network. The authors divided all the training data into
75% for training and 25% for validation. They also compared the performance of
a shallow classifier, SVM. Based on their experimental result, DNN outperforms
SVM by the accuracy of 99.994%, while SVM achieved 84.635% only. This result
showed the effectiveness of DNN for IDS purposes.
Another DNN but different architecture was proposed by Potluri and Diedrich
[2] in 2016. This paper mainly focuses on improving DNN implementation for
IDS by using multi-core CPUs and GPUs. This is important since DNN requires
large computation for training [3]. They reviewed some IDSs utilizing a hardware
enhancement: GPU, multicore CPU, memory management, and FPGA. Also, a
way of load balancing (and splitting) or parallel processing was discussed. A
deep learning model, SAE was chosen to construct the DNN in this work. The
architecture of this network has 41 input features from NSL-KDD dataset, 20
neurons in the first hidden layer by first AE, 10 neurons in the second hidden layer
of the second AE, and 5 neurons in the output layer containing softmax activation
function. In the training phase, each AE is trained separately but in sequence since
the hidden layer of the first AE becomes the input of second AE. There are two
times of fine-tuning processes, the first one done by softmax activation function and
the second one done by backpropagation through the entire network.
NSL-KDD dataset was selected for testing this approach. This dataset is a revised
version of KDD Cup’99 dataset. It has the same number of features which is 41
but with more rational distributions and without redundant instances existing in
KDD Cup’99 dataset. The authors firstly tested the network with different attack
class combinations from two classes to four classes. The lesser number of attack
classes performs better than the higher number of attack classes as expected since
the imbalance class distribution leads to a good result for fewer attack types.
For the acceleration, the authors used two different CPUs and a GPU. They also
experimented using serial and parallel CPUs. Their experimental result shows that
the training using parallel CPU achieved three times faster than using serial CPU.
The training using GPU delivered similar performance to parallel CPU as well. An
interesting point here is the training using parallel of the second CPU is faster than
GPU. They explained that this case happens due to the clock speed of the current
CPU which is too high. Unfortunately, the authors do not provide performance
comparison regarding detection accuracy or false alarm rate.
5.1 Generative 37
Self-Taught Learning (STL) was proposed as a deep learning model for IDS by
Niyaz et al. [4]. The authors mentioned two challenges to develop an efficient IDS.
The first challenge is to select feature since the selected features for a particular
attack might be different for other attack types. The second challenge is to deal with
the limited amounts of a labeled dataset for training purpose. Therefore, a generative
deep learning model was chosen in order to deal with this unlabeled dataset. The
proposed STL comprises of two stages, Unsupervised Feature Learning (UFL) and
Supervised Feature Learning (SFL). For UFL, the authors leveraged sparse AE
while softmax regression for SFL. Figure 5.1 shows the two-stage process of STL
used in this paper. The UFL accounts for feature extraction with unlabeled dataset,
while the SFL accounts for classification task with labeled data.
The authors verified their approach using NSL-KDD dataset. Before the training
process, the authors defined a preprocessing step for the dataset which contains
1-to-N encoding and min-max normalization. After 1-to-N encoding process, 121
features were ready for normalization step and input features for the UFL. Tenfold
cross-validation and test dataset from NSL-KDD dataset were selected for training
and test data, respectively. The authors also evaluated the STL for three different
attack combinations, 2-class, 5-class, and 23-class. In general, their STL achieved
higher than 98% of classification accuracy for all combinations during the training
phase. In the testing phase, the STL achieved an accuracy of 88.39% and 79.10% for
2-class and 5-class classifications, respectively. They mentioned that the future work
is to develop a real-time IDS using deep learning models on raw network traffic.
Kim et al. [6] adopted the generative approach of LSTM-RNN for an IDS purpose.
They leveraged softmax regression layer as the output layer. Other hyper-parameters
are 50, 100, and 500 of batch size, time step, and epoch, respectively. Also,
Stochastic Gradient Descent (SGD) and MSE were used as the optimizer and loss
function, respectively. 41 input features were drawn from KDD Cup’99 dataset.
Experimental results show the best learning rate is 0.01 and hidden layer size is 80
with 98.88% of DR and 10.04% of false alarm rate. Similar network topology was
also proposed by Liu et al. [7] with different hyper-parameters: time step, batch size,
and age are 50, 100, and 500, respectively. Using KDD Cup’99 dataset, 98.3% of
DR and 5.58% of false alarm rate were achieved.
5.2 Discriminative 39
5.2 Discriminative
This sub-chapter groups IDSs that use a single deep learning method for both feature
extraction and classification task.
the proposed DNN could generalize and abstract the characteristics of network
traffic with limited of features alone.
Li et al. [10] experimented using CNN for the feature extractor and classifier in
IDS’s. CNN achieved many successful implementations in image-related classifica-
tion tasks; however, it is still a big challenge for text classification. Therefore, the
main challenge of implementing CNN in IDS context is the image conversion step,
which is proposed by Li et al. [10]. NSL-KDD dataset was used for experimental
purposes. The image conversion step begins by mapping of 41 original features into
464 binary vectors. The mapping step comprises of two types of mapping, one hot
encoder for symbolic features and the other hot encoder for continuous features
with ten binary vectors. The image conversion step continues with converting 464
vectors into 8 × 8 pixel images. These images are ready for training input of
CNN. The authors decided to experiment with learned CNN models, ResNet 50
and GoogLeNet. Experimental results on KDDTest+ show the accuracy of 79.14%
and 77.14% using ResNet 50 and GoogLeNet, respectively. Although this result
does not improve the state of the art of IDS, this work demonstrated how to apply
CNN with image conversion in IDS context.
5.2 Discriminative 41
Bontemps et al. [12] leveraged LSTM-RNN in IDS for two objectives: a time
series anomaly detector and collective anomaly detector by proposing a circular
array. Collective anomaly itself is a collection of related anomalous data instances
concerning the whole dataset [12]. They used KDD Cup’99 dataset for their
experiment and explained the preprocessing steps needed to build a time series
dataset from KDD Cup’99 dataset.
Putchala [13] implemented a simplified form LSTM, called Gated Recurrent Unit
(GRU) in IoT environments. GRU is suitable for IoT due to its simplicity which
makes to reduce a number of gates in the network. GRU merges both forget and
input gates to an update gate and combines the hidden and cell states to be a simple
structure as shown in Fig. 5.3.
42 5 Deep Learning-Based IDSs
The author then adopted a multilayer GRU, which is GRU cells used in each hid-
den layer of RNN and feature selection also done by using random forest algorithm.
The experiments were conducted using KDD Cup’99 dataset and achieved 98.91%
and 0.76% of accuracy and false alarm rate, respectively.
5.3 Hybrid
This sub-chapter includes an IDS that uses more than one deep learning model for
generative and discriminative purposes.
AEs and GAN. The network attempts to match the aggregated posterior of the
hidden code vector of AE, with an arbitrary prior distribution. The reconstruction
error of learned AE is low for normal events and high for irregular events.
Choi and Cho [16] highlighted two main problems of intrusion detection in database
applications where the number of benign data is much larger than malicious data
in the real environment, and the internal intrusion is more difficult to detect.
The proposed approach was an adaptive IDS for database applications using an
Evolutionary Reinforcement Learning (ERL) which combines the evolutionary
learning and the reinforcement learning for the learning process of a population
and an individual, respectively. The approach comprises of two MLPs, a behavior
and an evaluation network. The former network aims to detect abnormal query by
performing the error backpropagation for the reinforcement learning depending on
the weights of the network, while the latter network provides a learning rate as
feedback to improve the detection rate of the behavior network. Since the evaluation
network used for evolutionary learning, the network evolves to explore optimal
model. Experiments were done using a particular scenario, called TPC-E, which
is an online transaction processing workload of a brokerage company [16]. A 90%
of classification accuracy was achieved after 25 generations.
Feng and Xu [17] concerned about detecting unknown attacks in Cyber-Physical
System (CPS). Then a novel deep RL-based optimal strategy was proposed [17].
The novelty came as an explicit cyber state-dependent dynamics and a model of the
zero-sum game to solve the Hamiltonian-Jacobi-Isaac (HJI) equation. A deep RL
algorithm with game theoretical actor critic NN structure was developed to address
the HJI equation. The deep RL network consists of three multilayer NNs; the first
is used in critic part, the second is used to approximate the possible worst attack
policy, and the last is used to estimate the optimal defense policy in real time.
5.5 Comparison
We compare and summarize all previous work mentioned earlier in this chapter. All
the discussed models using KDD Cup’99 and NSL-KDD datasets were summarized
in Tables 5.1 and 5.2, respectively.
The overall performances of IDSs on KDD Cup’99 are promising, as expected,
more than 90% of accuracy. Four IDSs in Table 5.1 are using LSTM-RNN approach
which means that a time series analysis is suitable for distinguishing benign and
anomalies in network traffic. Even more, GRU [13] demonstrated that a lightweight
deep learning model is very useful to be implemented in IoT environments which is
of crucial importance these days.
44 5 Deep Learning-Based IDSs
There is still a space for improvement when we are using NSL-KDD dataset as
shown in Table 5.2. The most accurate model is RNN [9] with 81.29% of accuracy.
Again, this fact infers that a time series analysis may improve IDS performance.
Although IDS using CNN has not achieved the best performance, by applying
a proper text-to-image conversion, we may gain the benefit of CNN which was
previously shown to be the best for the image recognition.
References
1. S. S. Roy, A. Mallik, R. Gulati, M. S. Obaidat, and P. Krishna, “A deep learning based artificial
neural network approach for intrusion detection,” in International Conference on Mathematics
and Computing. Springer, 2017, pp. 44–53.
2. S. Potluri and C. Diedrich, “Accelerated deep neural networks for enhanced intrusion detection
system,” in Emerging Technologies and Factory Automation (ETFA), 2016 IEEE 21st Interna-
tional Conference on. IEEE, 2016, pp. 1–8.
3. H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring strategies for training deep
neural networks,” Journal of machine learning research, vol. 10, no. Jan, pp. 1–40, 2009.
4. A. Javaid, Q. Niyaz, W. Sun, and M. Alam, “A deep learning approach for network intrusion
detection system,” in Proceedings of the 9th EAI International Conference on Bio-inspired
Information and Communications Technologies (formerly BIONETICS). ICST (Institute for
Computer Sciences, Social-Informatics and Telecommunications Engineering), 2016, pp. 21–
26.
5. Y. Yu, J. Long, and Z. Cai, “Session-based network intrusion detection using a deep learning
architecture,” in Modeling Decisions for Artificial Intelligence. Springer, 2017, pp. 144–155.
6. J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long short term memory recurrent neural
network classifier for intrusion detection,” in Platform Technology and Service (PlatCon), 2016
International Conference on. IEEE, 2016, pp. 1–5.
7. Y. LIU, S. LIU, and Y. WANG, “Route intrusion detection based on long short term memory
recurrent neural network,” DEStech Transactions on Computer Science and Engineering, no.
cii, 2017.
References 45
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2018 47
K. Kim et al., Network Intrusion Detection using Deep Learning,
SpringerBriefs on Cyber Security Systems and Networks,
https://doi.org/10.1007/978-981-13-1444-5_6
48 6 Deep Feature Learning
information from its raw inputs. These representations were then combined with
modified weighted feature selection inspired by an existing shallow-structured
machine learner. The usefulness of the condensed set of features to reduce the
bias of a machine learner model as well as the computational complexity is
shown.
6.1.1 Methodology
Feature extraction and selection could be adopted from D-FES. Figure 6.1 shows the
stepwise procedure of D-FES with two target classes. A preprocessing procedure,
which comprises of the normalization and balancing steps, is necessary. The process
is explained in Sect. 6.1.2 in detail. As illustrated in Algorithm 1, D-FES starts
by constructing SAE-based feature extractor with two consecutive hidden layers
to optimize the learning capability and the execution time [3]. The SAE outputs
50 extracted features, which are then combined with the 154 original features
existing in the AWID dataset [4]. Weighted feature selection methods were then
utilized using well-referenced machine learners including SVM, ANN, and C4.5 to
construct the candidate models, namely, D-FES-SVM, D-FES-ANN, and D-FES-
C4.5, respectively. SVM separates the classes using a support vector (hyperplane).
Then, ANN optimizes the parameters related to hidden layers that minimize the
classifying error concerning the training data, whereas C4.5 adopts a hierarchical
decision scheme such as a tree to distinguish each feature [5]. The final step of the
detection task involves learning an ANN classifier with 12–22 trained features only.
The supervised feature selection block in Fig. 6.1 consists of three different
feature selection techniques. These techniques are similar in that they consider their
resulting weights to select the subset of essential features.
ANN is used as one of the weighted feature selection methods. The ANN was
trained with two target classes only (normal and impersonation attack classes).
Fig. 6.1 Stepwise procedure of D-FES with two target classes: normal and impersonation attack
6.1 Deep Feature Extraction and Selection 49
Figure 6.2 shows an ANN network with one hidden layer only where b1 and b2
represent the bias values for the corresponding hidden and output layer, respectively.
To select the essential features, the weight values between the first two layers
were considered. The weight represents the contribution from the input features to
the first hidden layer. A wij value close to zero means that the corresponding input
feature xj is meaningless for further propagation, thus having one hidden layer is
sufficient for this particular task. The important value of each input feature is shown
in Eq. (6.1).
50 6 Deep Feature Learning
h
Vj = |wij |, (6.1)
i=1
ranked list was initialized that is filled by a subset of important features that are used
for selecting training instances. The scheme continues by training the classifier and
computes the weight vector of the dimension length. After the value of the weight
vector was obtained, it computes the ranking criteria and finds the feature with the
smallest ranking criterion. Using that feature, the feature ranking list was updated,
and the feature with the smallest ranking criterion was eliminated. A feature ranked
list was finally created as its output.
The last feature selection method is using DT C4.5. The feature selection process
begins by selecting the top-three level nodes as explained in Algorithm 4. It then
removes the equal nodes and updates the list of selected features.
6.1.2 Evaluation
The data contained in the AWID dataset are diverse in value, discrete, continuous,
and symbolic, with a flexible value range. These data characteristics could make
it difficult for the classifiers to learn the underlying patterns correctly [11]. The
preprocessing phase thus includes mapping symbolic-valued attributes to numeric
values, according to the normalization steps and dataset-balancing process described
in Algorithm 5. The target classes were mapped to one of these integer-valued
classes: 1 for normal instances, 2 for an impersonation, 3 for flooding, and 4 for
an injection attack. Meanwhile, symbolic attributes such as a receiver, destination,
transmitter, and source address were mapped to integer values with a minimum
value of 1 and a maximum value, which is the number of all symbols. Some
dataset attributes such as the WEP Initialization Vector (IV) and Integrity Check
Value (ICV) were hexadecimal data, which need to be transformed into integer
values as well. The continuous data such as the timestamps were also left for
the normalization step. Some of the attributes have question marks, ?, to indicate
unavailable values. One alternative was selected in which the question mark was
assigned to a constant zero value [12]. After all, data were transformed into
numerical values; attribute normalization is needed [13]. Data normalization is a
process; hence, all value ranges of each attribute were equal. The mean range
method [14] was adopted in which each data item is linearly normalized between
zero and one to avoid the undue influence of different scales [12]. Equation (6.2)
shows the normalizing formula:
xi − min(x)
zi = , (6.2)
max(x) − min(x)
where zi denotes the normalized value, xi refers to the corresponding attribute value,
and min(x) and max(x) are the minimum and maximum values of the attribute,
respectively.
The reduced “CLS” data are a good representation of a real network, in which
normal instances significantly outnumber attack instances. The ratio between the
normal and attack instances is 10:1 for both unbalanced training and the test dataset
as shown in Table 6.1. This property might be biased to the training model and
affect the model performance [15, 16]. To alleviate this, the dataset was balanced by
selecting 10% of the normal instances randomly. However, a specific value was set
as the seed of the random number generator for reproducibility purposes. The ratio
between normal and attack instances became 1:1, which is an appropriate proportion
for the training phase [16]. D-FES was trained using the balanced dataset and then
verified on the unbalanced dataset.
verified. Last, the usefulness and the utility of D-FES was validated on a realistic
unbalanced test dataset.
The SAE architectures were varied to optimize the SAEs implementation with
two hidden layers. The features generated from the first encoder layer were
employed as the training data in the second encoder layer. Meanwhile, the size of
each hidden layer was decreased accordingly such that the encoder in the second
encoder layer learns an even smaller representation of the input data. The regression
layer with the softmax activation function was then implemented in the final step.
The four schemes were examined to determine the SAE learning characteristics. The
first scheme, Imbalance_40, has two hidden layers with 40 and 10 hidden neurons
in each layer. The second scheme, Imbalance_100, also has two hidden layers;
however, it employs 100 and 50 hidden neurons in each layer. Although there is
no strict rule for determining the number of hidden neurons, we consider a common
rule of thumb [17], which ranges from 70% to 90% from inputs. The third and
fourth schemes, named Balance_40 and Balance_100, have the same hidden layer
architecture with the first and second schemes, respectively; however, in this case,
the balanced dataset was used, because the common assumption is that a classifier
model built by a highly unbalanced data distribution performs poorly on minority
class detection [18]. For testing purposes, all four classes contained in the AWID
dataset was used.
Table 6.2 shows the evaluation of the SAE schemes. Each model uses either
balanced or unbalanced data for the SAE algorithm with the following parameters:
input features, number of features in 1st hidden layer, number of features in 2nd
hidden layer, and target classes. The SAE architectures with 100 hidden neurons
have higher DR than those with 40 hidden neurons. On the other hand, the SAE
architectures with 40 hidden neurons have lower FAR than those with 100 hidden
neurons. To draw a proper conclusion, other performance metrics that consider
whole classes are needed as the DR checks for the attack class only and the FAR
measures for the normal class only. The Acc metric would be affected by the
distribution of data, for which different balanced and unbalanced distributions may
result in an incorrect conclusion. If we consider the Acc metric only as in Fig. 6.3,
we may incorrectly select the Imbalance_100 with 97.03% accuracy, whereas the
Balance_100 only achieved 87.63% accuracy. In fact, the Imbalance_100 achieved
the highest accuracy rate because of the unbalanced proportion of normal class to
attack class. The best performance was obtained by checking F1 score, for which
the Balance_100 has achieved the highest F1 score among all schemes with 87.59%.
Therefore, the SAE architecture with 154:100:50:4 topology was chosen.
6.1 Deep Feature Extraction and Selection 55
Fig. 6.3 Evaluation of SAE’s scheme on Acc and F1 score. The red bar represents F1 score while
the blue bar represents Acc rate
∧
6 10 ( −4)
Test
Train
Validate
Best
5
Cross-Entropy Error
Fig. 6.4 Cross-entropy error of ANN. The best validation performance was achieved at the epoch
of 163
A subset of features was selected using the wrapper-based method for consid-
ering each feature weight. For ANN, a threshold weight value was defined, and if
the weight of a feature is higher than the threshold, then the feature is selected. The
SVM attribute selection function ranks the features based on their weight values.
The subset of features with a higher weight value than the predefined threshold
value was then selected. Similarly, C4.5 produces a deep binary tree. The features
were chosen that belong to the top-three levels in the tree. CFS produces a fixed
number of selected features and Corr provides a correlated feature list.
During ANN training for both feature selection and classification, the trained
model was optimized using a separate validation dataset; that is, the dataset was
separated into three parts, training data, validation data, and testing data in the
following proportions: 70%, 15%, and 15%, respectively. The training data were
used as input into the ANN during training, and the weights of neurons were
adjusted during the training according to its classification error. The validation data
were used to measure model generalization providing useful information on when to
terminate the training process. The testing data was used for an independent measure
of the model performance after training. The model is said to be optimized when it
reaches the smallest average square error on the validation dataset. Figure 6.4 shows
an example of ANN performance concerning the cross-entropy error during ANN
training. At the epoch of 163, the cross-entropy error, a logarithmic-based error
6.1 Deep Feature Extraction and Selection 57
Table 6.3 Feature set comparisons between feature selection and D-FES
Method Selected Features D-FES
CFS 5, 38, 70, 71, 154 38, 71, 154, 197
Corr 47, 50, 51, 67, 68, 71, 73, 82 71, 155, 156, 159, 161, 165, 166, 179, 181,
191, 193, 197
ANN 4, 7, 38, 77, 82, 94, 107, 118 4, 7, 38, 67, 73, 82, 94, 107, 108, 111, 112,
122, 138, 140, 142, 154, 161, 166, 192,
193, 201, 204
SVM 47, 64, 82, 94, 107, 108, 122, 154 4, 7, 47, 64, 68, 70, 73, 78, 82, 90, 94, 98,
107, 108, 111, 112, 122, 130, 141, 154,
159
C4.5 11, 38, 61, 66, 68, 71, 76, 77, 107, 119, 61, 76, 77, 82, 107, 108, 109, 111, 112,
140 119, 158, 160
measurement comparing the output values and desired values, starts increasing,
meaning that at the epoch of 163, the model was optimized. Although the training
data output decreasing error values after the epoch point of 163, the performance
of the model no longer continues to improve, as the decreasing cross-entropy error
may indicate the possibility of overfitting.
Table 6.3 contains all the feature lists selected from the various feature selection
methods. Some features were essential for detecting an impersonation attack. These
are the 4th and the 7th, which were selected by the ANN and SVM, and the 71st,
which is selected by CFS and Corr. The characteristics of the selected features are
shown in Fig. 6.5a–b. The blue line indicates normal instances, and at the same
time, the red line depicts the characteristics of an impersonation attack. Normal and
attack instances can be distinguished based on the attribute value of data instances.
For example, once a data instance has an attribute value of 0.33 in the 166th feature,
the data instance has a high probability of being classified as an attack. This could
be applied to the 38th and other features as well.
Table 6.4 lists the performance of each algorithm on the selected feature set only.
SVM achieved the highest DR (99.86%) and Mcc (99.07%). However, it requires
CPU time of 10,789s to build a model, the longest time among the models observed.
As expected, the filter-based methods (CFS and Corr) built their models quickly;
however, they attained the lowest Mcc for CFS (89.67%).
Table 6.5 compares the performances of the candidate models on the feature
sets that were produced by D-FES. SVM again achieved the highest DR (99.92%)
and Mcc (99.92%). It also achieved the highest F AR with a value of only 0.01%.
Similarly, the lowest Mcc was achieved by Corr (95.05%). This concludes that
wrapper-based feature selections outperform filter-based feature selections. As
SVM showed the best performance, the properties of selected features by SVM
are described in Table 6.6.
The following patterns were observed from Tables 6.4 and 6.5: Only two out of
five methods (Corr and C4.5) showed lower FAR without D-FES, which is expected
to minimize the F AR value of the proposed IDS. This phenomenon might exist
because the original and extracted features were not correlated because Corr and
C4.5 measure the correlation between each feature. Filter-based feature selection
58 6 Deep Feature Learning
Fig. 6.5 Characteristics of (a) 38th and (b) 166th features. The blue line represents normal
instances while the red line represents attack instances
methods require much shorter CPU time compared to the CPU time taken by D-
FES. However, D-FES improves the filter-based feature selections performance
significantly.
Similar patterns were captured by Fig. 6.6a–c, which depict the performance of
different models in terms of Acc, P recision, and F1 score, respectively. D-FES-
SVM achieved the highest Acc, P recision, and F1 score of 99.97%, 99.96%,
6.2 Deep Learning for Clustering 59
and 99.94%, respectively. By D-FES, all methods achieve P recision of more than
96%, which shows that D-FES can reduce the number of incorrect classification of
normal instances as an attack. Also, D-FES improves the Acc of filter-based feature
selections significantly. Except for the C4.5, all feature selection methods were
improved both the Acc and F1 score by using D-FES. We compare D-FES-SVM
as the highest F1 score, and randomly selected features with respect to the number
of features involved during training as depicted in Fig. 6.7. D-FES-SVM takes
longer time than the random method and it increases DR significantly. However,
the random method cannot even classify a single impersonation attack. This makes
the proposed D-FES a good candidate for an intrusion detector.
IDS has been becoming a vital measure in any networks, especially Wi-Fi networks.
Wi-Fi networks growth is undeniable due to a vast amount of tiny devices
connected via Wi-Fi networks. Regrettably, adversaries may take advantage by
launching an impersonation attack, a typical wireless network attack. Any IDS
60 6 Deep Feature Learning
Fig. 6.6 Model performance comparisons in terms of (a) Acc, (b) P recision, and (c) F1 score.
The blue bar represents performances by feature selection only while the red bar represents
performances by D-FES
6.2 Deep Learning for Clustering 61
6.2.1 Methodology
In this section, the novel fully unsupervised deep learning-based IDS for detecting
impersonation attacks is explained. There are two main tasks, feature extraction, and
clustering tasks. Figure 6.8 shows the scheme which contains two main functions
in cascade. A real Wi-Fi networks-trace, AWID dataset [4] is used, which contains
154 original features. Before the scheme starts, normalizing and balancing process
should be done to achieve best training performance. Algorithm 6 explains the
procedure of the scheme in detail.
The scheme starts with two cascading encoders, and the output features from
the second layer are then forwarded to the clustering algorithm. The first encoder
has 100 neurons as the first hidden layer, while the second encoder comes with
50 neurons only. A standard rule for choosing the number of neurons in a hidden
layer is using 70% to 90% of the previous layer. In this paper, k = 2 was defined
since they considered two classes only. The scheme ends by two clusters formed by
k-means clustering algorithm. These clusters represent benign and malicious data.
Fig. 6.8 Our proposed scheme contains feature extraction and clustering tasks
6.2 Deep Learning for Clustering 63
6.2.2 Evaluation
There are two hidden layers in the SAE network with 100 and 50 neurons
accordingly. The encoder in the second layer fed with features formed by the first
layer of the encoder. The softmax activation function was implemented in the final
stage of the SAE to optimize the SAE training. The 50 features extracted from
the SAE were then forwarded to k-means clustering algorithm as input. Random
initialization was used for k-means clustering algorithm. However, a particular value
must be defined as a random number seed for reproducibility purpose. Clustering
results were compared from three inputs: original data, features from the first hidden
layer of the SAE, and features from the second hidden layer of the SAE as shown
in Table 6.7.
It was observed that the limitation of a traditional k-means algorithm, which is
unable to cluster complex and high-dimensional data of AWID dataset, as expressed
by 55.93% of accuracy only. Although 100 features coming from the 1st hidden
layer achieved 100% of DR, the false alarm rate was still unacceptable with 57.48%.
64 6 Deep Feature Learning
0.18
0.16
0.14
Euclidean Space (y-axis)
0.12
0.1
0.08
0.06
0.04
Benign
Malicious
0.02 Centroids
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Euclidean Space (x-axis)
Fig. 6.9 Cluster assignment result in Euclidean space by our proposed scheme
The k-means algorithm fed by 50 features from the 2nd hidden layer achieved
the best performance among all as shown by the highest F1 score (89.06%) and
Acc (94.81%), also the lowest F AR (4.40%). Despite a bit lower DR, the scheme
improved the traditional k-means algorithm in overall by almost twice F1 score and
accuracy.
Figure 6.9 shows cluster assignment result in Euclidean space using this scheme.
Black dots represent attack instances, while gray dots represent benign instances.
The location of cluster centroid for each cluster is expressed by X mark.
The performance of the scheme also compared against two previous related work
by Kolias et al.[4] and Aminanto and Kim [24] as shown in Table 6.8. The scheme
can classify impersonation attack instances with a DR of 92.18% while maintaining
low FAR, 4.40%. Kolias et al. [4] tested various classification algorithms such as
Random Tree, Random Forest, J48, Naive Bayes, etc., on AWID dataset. Among
6.3 Comparison 65
all methods, Naive Bayes algorithm showed the best performance by correctly
classifying 4,419 out of 20,079 impersonation instances. It achieved approximately
22% DR only, which is unsatisfactory. Aminanto and Kim [25] proposed another
impersonation detector by combining ANN with SAE, which successfully improved
the IDS model for impersonation attack detection task by achieving a DR of 65.18%
and a F AR of 0.14%. In this study, SAE was leveraged for assisting traditional k-
means clustering with extracted features. Although the scheme resulted in a high
false alarm rate, which leads to a severe impact of IDS [26], however, this false
alarm rate value about 4% can be acceptable since the fully unsupervised approach
was used. The parameters can be adjusted and cut the F AR down, but less FAR
or high DR remains a trade-off for practices and needs to be investigated in the
future. It is observed that the advantage of SAE is abstracting a complex and high-
dimensional data to assist traditional clustering algorithm which is shown by reliable
DR and F1 score achieved by the scheme.
6.3 Comparison
The goal of deep learning method is learning feature hierarchies from the lower
level to higher level features [27]. The technique can learn features independently
at multiple levels of abstraction and thus discover complicated functions mapping
between the input and the output directly from raw data without depending on
customized features by the experts. In higher-level abstractions, humans often have
no idea to see the relation and connection from the raw sensory input. Therefore,
the ability to learn sophisticated features, also called as feature extraction, must be
highlighted as the amount of data increases sharply [27]. SAE is one good instance
of feature extractors. Therefore, several previous works which implement SAE as
the feature extractor and other roles in the IDS module are discussed as shown in
Table 6.9.
66 6 Deep Feature Learning
Feature extraction by SAE can reduce the complexity of original features of the
dataset. However, besides a feature extractor, SAE can also be used for classifying
and clustering tasks as shown in Table 6.9. AK16b [28] used semi-supervised
approach for IDS which contains feature extractor (unsupervised learning) and
classifier (supervised learning). SAE was leveraged for feature extraction and
regression layer with softmax activation function for the classifier. SAE as feature
extractor also used in ACTYK17 [2], but ANN, DT, and SVM were leveraged
as a feature selection. In other words, it combines stacked feature extraction and
weighted feature selections. By experiment results [2], D-FES improved the feature
learning process by combining stacked feature extraction with weighted feature
selection. The feature extraction of SAE is capable of transforming the original
features into a more meaningful representation by reconstructing its input and
providing a way to check that the relevant information in the data has been captured.
SAE can be efficiently used for unsupervised learning on a complex dataset.
Unlike two previous approaches, AK16a [24] and AK17 [23] used SAE for
other roles than a feature extractor, namely, classifying and clustering methods,
respectively. ANN was adopted as a feature selection since the weight from trained
models mimics the significance of the corresponding input [24]. By selecting the
important features only, the training process becomes lighter and faster than before.
AK16a [24] exploited SAE as a classifier since this employs consecutive layers
of processing stages in hierarchical manners for pattern classification and feature
or representation learning. On the other hand, AK17 [23] proposed a novel fully
unsupervised method which can detect attacks without prior information on data
label. The scheme is equipped with an unsupervised SAE for extracting features
and a k-means clustering algorithm for clustering task.
Kolias et al. [4] tested many existing machine-learning models on the dataset in
a heuristic manner. The lowest DR is observed particularly on impersonation attack
reaching an accuracy of 22% only. Therefore, improving impersonation detection
is challenging, and hence the comparison of previous approaches on impersonation
detection are summarized in Table 6.10. DR refers to the number of attacks detected
divided by the total number of attack instances in the test dataset, while FAR is the
number of normal instances classified as an attack divided by the total number of
normal instances in the test dataset.
From Table 6.10, it is observed that SAE can improve the performance of IDS
compared to KKSG15 [4]. It is verified that SAE achieved high-level abstraction
of complex and huge Wi-Fi network data. The SAE’s model free properties and
References 67
learnability on complex and large-scale data fit into the open nature of Wi-Fi
networks. Among all IDSs, the one using SAE as a classifier achieved the lowest
impersonation attack DR with 65.178% only. It shows that SAE can be a classifier
but not excellent as the original role of SAE is a feature extractor. The usability
of SAE as a feature extractor is validated by AK16b [28] and ACTYK17 [2]
which achieved the highest DR. Even more, by a combination of SAE extractor
and weighted selection [2], the best performance of DR and FAR among other was
achieved. Besides that, an interesting fact is that SAE can assist k-means clustering
algorithm to achieve better performance with DR of 92.180% [23]. However, it is
required to analyze further to reduce the FAR since higher FAR will be undesirable
for practical IDS’s.
References
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2018 69
K. Kim et al., Network Intrusion Detection using Deep Learning,
SpringerBriefs on Cyber Security Systems and Networks,
https://doi.org/10.1007/978-981-13-1444-5_7
70 7 Summary and Further Challenges
References
1. Y. Wang, W.-d. Cai, and P.-c. Wei, “A deep learning approach for detecting malicious javascript
code,” Security and Communication Networks, vol. 9, no. 11, pp. 1520–1534, 2016.
2. W. Jung, S. Kim, and S. Choi, “Poster: deep learning for zero-day flash malware detection,” in
36th IEEE symposium on security and privacy, 2015.
3. P. Louvieris, N. Clewley, and X. Liu, “Effects-based feature identification for network intrusion
detection,” Neurocomputing, vol. 121, pp. 265–273, 2013.
4. F. Palmieri, U. Fiore, and A. Castiglione, “A distributed approach to network anomaly detection
based on independent component analysis,” Concurrency and Computation: Practice and
Experience, vol. 26, no. 5, pp. 1113–1129, 2014.
5. M. K. Putchala, “Deep learning approach for intrusion detection system (ids) in the internet of
things (iot) network using gated recurrent neural networks (gru),” Ph.D. dissertation, Wright
State University, 2017.
Appendix A
A Survey on Malware Detection
from Deep Learning
Recently, computer security faces the increase of security challenges. Because the
static analysis is perceived as vulnerable to obfuscation and evasion attack, Rieck
et al. [1] tries to develop dynamic malware analysis. The primary challenge in using
dynamic malware analysis is the time needed to perform the analysis. Furthermore,
as the amount and diversity of malware increases, the time required to generate
detection patterns is also longer. Therefore, Rieck et al. propose malware detection
method to improve the performance of malware detector based on behavior analysis.
In this experiment, Malheur datasets were used. These datasets were created by
themselves using behavior reports of malware binaries from anti-malware vendors,
Sunbelt Software. Com. Each sample was executed and monitored using CW
Sandbox’s analysis environment and generates 3,131 behavior reports.
In this experiment, Rieck et al. [1] used four main steps. First, malware binaries
were executed and monitored in a sandbox environment. It would give system calls
and arguments as the output. Then, in step 2, the sequential reports produced from
the previous step were embedded into a high-dimensional vector space based on its
behavioral pattern. By doing this, the vectorial representation geometrically could
be analyzed, to design clustering and classification method. In step 3, the machine-
learning techniques were applied for clustering and classification to identify the
class of malware. Finally, incremental analysis of malware’s behavior was done
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2018 71
K. Kim et al., Network Intrusion Detection using Deep Learning,
SpringerBriefs on Cyber Security Systems and Networks,
https://doi.org/10.1007/978-981-13-1444-5
72 A A Survey on Malware Detection from Deep Learning
by alternating between clustering and classification step. The result shows that the
proposed method successfully reduces the run time and memory requirement by
processing the behavior reports in a chunk. Incremental analysis needed 25 min
for processing the data, while regular clustering took 100 min. Furthermore, the
regular clustering required 5 Gigabytes of memory during computation while
incremental analysis only needed less than 300 Megabytes. So, it can be concluded
that incremental technique in behavior-based analysis gives better performance in
time and memory requirement than regular clustering.
Nowadays, the number and variety of malware are kept increasing. As a result,
malware detection and classification need to be improved to do safety prevention.
This paper wants to model malware system call sequences and uses it to do
classification using deep learning. The primary purpose of leveraging machine
learning in this experiment is to find a fundamental pattern in a large dataset. They
used malware sample dataset gathered from Virus Share, Maltrieve, and private
collections. There are three main contributions in this paper. First, Kolosnjaji et
al. built DNN and implemented it to examine system call sequences. Then, in order
to optimize malware classification process, convolution neural networks and RNN
were combined. Finally, the performance of their proposed method was analyzed by
examining the activation pattern of the neural unit.
During malware classification process, Kolosnjaji et al. [2] utilized malware
collection from the dataset as the input of Cuckoo Sandbox. Then, the sandbox
would give numerical feature vector as the output. After that, they used TensorFlow
and Theano framework to construct and train the neural networks. It would give
a list of malware families as the output of the NN. The neural network consisted
of two parts, convolutional part and recurrent part. Convolutional part consisted of
convolutional and pooling layer. At first, convolutional layer captured the correlation
between neighboring input vectors and generated new features, resulting in feature
vectors. Then, the result of the convolutional layer was forwarded to the input of
recurrent layer. In recurrent layer, LSTM cells were used to model the resulting
sequence and sort the importance based on mean pooling. Finally, dropout and
a softmax layer were used to prevent overfitting to occur in the output. As the
experiment result, it showed that the combination of convolutional network and
LSTM gave better accuracy (89.4%) compared to feedforward network (79.8%) and
convolutional network (89.2%).
A A Survey on Malware Detection from Deep Learning 73
Nowadays, malware detection methods can be divided into three categories: static
analysis, host behavior-based analysis, and network behavior analysis. Static anal-
ysis method can be evaded using packing techniques. Host behavior-based analysis
can be deceived by using code injection. As a result, network behavior analysis
becomes the spotlight because it does not have those vulnerabilities, and the need
to communicate between the attacker and the infected host makes this method
effective. One main challenge that hinders the use of network behavior analysis
in malware detection is the analysis time. Various malware samples need to be
collected and analyzed for a long period because the user does not know when
the malwares start their activity. The main idea proposed by Shibahara et al. of
this paper is aimed at two characteristics of malware communication, the change
in communication purpose and common latent function. For the dataset, firstly
they collected malware samples from VirusTotal, which were detected as malware
74 A A Survey on Malware Detection from Deep Learning
by antivirus program. Then, they used malware samples in VirusTotal that have
different sha1 hash compared to previously collected malware samples. They used
29,562 malware samples in total for training, validation, and classification.
Their methodology consists of three main steps: feature extraction, neural
network construction, and also training and classification label. First, Shibahara
et al. [4] extracted features from communications, which are collected with dynamic
malware analysis. Then, these features are used as inputs of recurrent neural network
(RNN). Then, in NN construction phase, the change in communication purpose was
captured. During the training and classification, the feature vectors of root nodes
were calculated according to VirusTotal. Then, based on the vectors, it gave the
classification result. During the experiment, the analysis time and time reduction
between the proposed method and the regular continuation method were compared.
The result shows that the proposed method reduces 67.1% of analysis time and
keeps the range of covered URL to 97.9% compared to full analysis method.
Since the development of machine learning several years ago, there have been
various ideas to implement machine learning as an engine in malware detection
software. However, because there is a time delay between malware landing on user’s
system and the signature generation process, it can harm the user. Because of that
reason, Raman et al. utilized data mining to identify seven key features in Microsoft
PE file format that can be used as input to the classifier. The seven features would
be used in machine-learning algorithm to do malware classification. Raman et al.
generate their dataset from PE files. Firstly, they write a parser to extract features
from PE file. They use their experience in malware analysis to select a set of 100
features from the initial 645 features. Finally, they created a dataset of 5,193 dirty
files and 3,722 clean files to evaluate those 100 features.
76 A A Survey on Malware Detection from Deep Learning
The focus of the methodology is feature extraction and feature selection, with the
addition of combining intuitive method and machine-learning method during feature
selection. First, Raman et al. [7] used their knowledge to reduce the number of the
feature from 645 to 100. Then, random forest algorithm was utilized to choose 13
features. Finally, four classifiers (J48Graft, PART, IBk, and J48) were used to check
the accuracy of each feature and choose the highest seven features. Those features
are debug size that denotes the size of the debug directory table, image version that
denotes the version of the file, debugRVA that denotes the relative virtual address
of the import address table, ExportSize that denotes the size of the export table,
ResourceSize that denotes the size of the resource section, VirtualSize2 that denotes
the size of the second section, and NumberOfSections that denotes the number of
section.
The main problem in this paper is about the increase of malware varieties, which
lead to vulnerable manual heuristic malware detection method. In order to cope
with this problem, Firdausi et al. proposed an automatic behavior-based malware
detection method utilizing machine learning. In this research, they used datasets in
the format of Windows Portable Executable. The dataset consisted of 250 benign
instances collected from System 32 of Windows XP 32 bit SP2. They also collected
220 malware samples from various resources. For monitoring process, Firdausi
et al. [8] used Anubis, a free online automatic dynamic analysis service to monitor
both malware and benign samples. Then, the performance of five classifiers was
compared using their proposed method. The five classifiers were kNN, naive Bayes,
J48 DT, SVM, and Multi-Layer Perceptron (MLP).
For the methodology of the research, firstly Firdausi et al. did data acquisition
from community and virology.info. Then, the behavior of each malware was
analyzed on an emulated sandbox environment. Anubis Sandbox was chosen for
API hooking and system call monitoring. After that, the report would be processed
into sparse vector model. The report was generated in xml format. For the next step,
XML file parsing, feature selection, and feature model creation were done based on
that XML file. Finally, the last step was doing classification based on that model.
Firdausi et al. applied machine-learning tools, did parameter tuning, and finally
tested their scheme. The result of this experiment shows that feature extraction
reduces the attributes from 5,191 to 116 attributes. By performing feature selection,
the time consumed to train and build the model became shorter. The overall best
performance was achieved by J48 classifier with 94.2% true positive rate, 9.2%
False Positive Rate (FPR), 89.0% precision, and 92.3% accuracy.
A A Survey on Malware Detection from Deep Learning 77
For several years, people know that traditional malware detection can be divided into
the static and dynamic method. These methods are usually implemented in antivirus
software. Static method uses signature database to detect malware; on the other
hand, dynamic method runs the suspicious program to check its behavior, whether
it is a malware or not. However, there is a problem here. The software is vulnerable
to malware exploit during infection, and it can be disabled by malware. So, in this
paper, Xu et al. proposed an idea of hardware-assisted detection mechanism which
is not vulnerable to such disabling problem. However, this idea relies on expert
knowledge of the executable binary and its memory layouts. So, as a solution,
they decided to use machine learning to detect a malicious action of malware.
The primary purpose of this paper is to learn one model for each application that
separates malware-infected execution from legitimate execution. Furthermore, Xu
et al. will do the classification based on memory access pattern.
In order to monitor the memory access, Xu et al. [9] did epoch-based monitoring.
A monitoring method divided program execution into epochs. Then, each epoch
was separated by inserting a sign in a memory stream. They found that for most
malicious behavior, the deciding feature was the location and frequency of memory
accesses rather than their sequence. Then, after the monitoring had finished, the
classifier was trained by using summary histograms for epochs. During training, the
program was executed, and each histogram was labeled either malicious or benign.
After the training model had been defined, the binary signature was verified, and
the model was loaded into hardware classifier. Finally, the last step was hardware
execution monitoring. If malware were detected, an authenticated handler would be
launched automatically. During the experiment, they used three classifiers (SVM,
random forest, and logistic regression) and compared their performance. The best
performing classifier was random forest, with 99% true positive rate and less than
1% FPR.
Lately, the variety of malware keeps increasing and threating the security of
our computer system. People cannot rely on traditional signature-based antivirus
anymore. Zero-day malware will easily bypass regular antivirus because their
signature is not in the antivirus database yet. To solve this problem, Gandotra et al.
[10] proposed a combination of static and dynamic malware analysis with machine-
learning algorithm for malware detection and classification. However, there were
several problems with this scheme. First, this scheme had high FP and False Nega-
tive Rate (FNR). Second, it took time to build the classification model because of the
large dataset. As a result, early malware detection was not possible with this scheme.
78 A A Survey on Malware Detection from Deep Learning
Knowing these problems, Gandotra et al. concluded that the challenge here was to
select the relevant set of features so the building time could be reduced and the accu-
racy would be improved. The dataset that is used in this experiment was taken from
VirusShare. About 3,130 portable executable files, which include 1,720 malicious
and 1,410 clean files, were utilized. All files were executed in a sandbox to get their
attributes and then were used to build the classification model using WEKA.
The methodology consisted of six main steps. The first step was data acquisition.
In this step, they collected malware samples targeting Windows OS from VirusShare
database. They also collected clean files manually from system directories of
Windows. The second step was automated malware analysis. In this phase, they
used modified Cuckoo Sandbox to execute the specimen and generate the result as
Java Script Object Notation (JSON) file. The third step was feature extraction. In
this step, the JSON reports generated by Cuckoo Sandbox were parsed to obtain the
various malware features. The result was a feature set of 18 malware attributes which
can be used to build the classification model. The next step was feature selection.
They selected seven top features by using IG method. IG method is an entropy-
based method for feature evaluation which is broadly used in machine learning.
The last step was classification. They used the selected seven features to build the
classification model using machine-learning algorithm in WEKA library. The used
seven classifiers are IB1, naive Bayes, J48, random forest, bagging, decision table,
and multilayer perceptron. The result of their experiment showed that random forest
gave the best accuracy with 99.97% and the time to build model was 0.09 s.
References
1. K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic analysis of malware behavior using
machine learning,” Journal of Computer Security, vol. 19, no. 4, pp. 639–668, 2011.
2. B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, “Deep learning for classification of
malware system call sequences,” in Australasian Joint Conference on Artificial Intelligence.
Springer, 2016, pp. 137–149.
3. S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi, “Malware detection with deep
neural network using process behavior,” in Computer Software and Applications Conference
(COMPSAC), 2016 IEEE 40th Annual, vol. 2. IEEE, 2016, pp. 577–582.
4. T. Shibahara, T. Yagi, M. Akiyama, D. Chiba, and T. Yada, “Efficient dynamic malware anal-
ysis based on network behavior using deep learning,” in Global Communications Conference
(GLOBECOM), 2016 IEEE. IEEE, 2016, pp. 1–7.
5. L. Liu, B.-s. Wang, B. Yu, and Q.-x. Zhong, “Automatic malware classification and new
malware detection using machine learning,” Frontiers of Information Technology & Electronic
Engineering, vol. 18, no. 9, pp. 1336–1347, 2017.
6. O. E. David and N. S. Netanyahu, “Deepsign: Deep learning for automatic malware signature
generation and classification,” in Neural Networks (IJCNN), 2015 International Joint Confer-
ence on. IEEE, 2015, pp. 1–8.
7. K. Raman et al., “Selecting features to classify malware,” InfoSec Southwest, vol. 2012, 2012.
A A Survey on Malware Detection from Deep Learning 79