Cyber Security Meets Artificial in
Cyber Security Meets Artificial in
Cyber Security Meets Artificial in
Review:
Jian-hua LI
School of Cyber Security, Shanghai Jiao Tong University, Shanghai 200240, China
E-mail: lijh888@sjtu.edu.cn
Received Sept. 16, 2018; Revision accepted Dec. 13, 2018; Crosschecked Dec. 24, 2018
Abstract: There is a wide range of interdisciplinary intersections between cyber security and artificial intelligence (AI). On one
hand, AI technologies, such as deep learning, can be introduced into cyber security to construct smart models for implementing
malware classification and intrusion detection and threating intelligence sensing. On the other hand, AI models will face various
cyber threats, which will disturb their sample, learning, and decisions. Thus, AI models need specific cyber security defense and
protection technologies to combat adversarial machine learning, preserve privacy in machine learning, secure federated learning,
etc. Based on the above two aspects, we review the intersection of AI and cyber security. First, we summarize existing research
efforts in terms of combating cyber attacks using AI, including adopting traditional machine learning methods and existing deep
learning solutions. Then, we analyze the counterattacks from which AI itself may suffer, dissect their characteristics, and classify
the corresponding defense methods. Finally, from the aspects of constructing encrypted neural network and realizing a secure
federated deep learning, we expatiate the existing research on how to build a secure AI system.
Key words: Cyber security; Artificial intelligence (AI); Attack detection; Defensive techniques
https://doi.org/10.1631/FITEE.1800573 CLC number: TP309
great progress in recent years. In particular, DL system itself may also be attacked or deceived, re-
technology has enabled people to benefit from more sulting in incorrect classification or prediction results.
data, obtain better results, and develop more potential. For example, in adversarial environments, manipu-
It has dramatically changed people’s lives and re- lating training samples will result in toxic attacks, and
shaped traditional AI technology. AI has a wide range manipulating test samples will result in evasion at-
of applications, such as facial recognition, speech tacks. Attacks in adversarial environments are in-
recognition, and robotics, but its application scope tended to undermine the integrity and usability of
goes far beyond the three aspects of image, voice, and various AI applications, and mislead neural networks
behavior. It also has many other outstanding applica- by employing adversarial samples, causing classifiers
tions in the field of cyber security, such as malware to derive wrong classification. Of course, there are
monitoring and intrusion detection. In the early de- corresponding defense measures against adversarial
velopment of AI technology, machine learning (ML) attacks. These defense measures focus mainly on
technology played a vital role in dealing with cyber- three aspects (Akhtar and Mian, 2018): (1) modifying
space threats. Although ML is very powerful, it relies the training process or input samples; (2) modifying
too much on feature extraction. This flaw is particu- the network itself, such as adding more layers/
larly glaring when it is applied to the field of cyber sub-networks and changing the loss/activation func-
security. For example, to enable an ML solution to tion; (3) using some external models as network
recognize malware, we have to manually compile the add-ons when classifying samples that have not ap-
various features associated with malware, which un- peared. As DL models become more complex and
doubtedly limits the efficiency and accuracy of threat datasets become larger, centralized training methods
detection. This is because ML algorithms work ac- cannot adapt to these new requirements. Distributed
cording to the pre-defined specific features, which learning modes, such as federated learning launched
means that features which are not pre-defined will by Google, have emerged, enabling many intelligent
escape detection and cannot be discovered. It can be terminals to learn a shared model in a collaborative
concluded that the performance of most ML algo- way. However, all training data is stored in terminal
rithms depends on the accuracy of feature recognition devices, which brings many security challenges. How
and extraction (Golovko, 2017). In the view of ob- to ensure that the model is not maliciously stolen and
vious flaws in traditional ML, researchers began to that it can construct a distributed ML system with
study deep neural network (DNN), also known as DL, privacy protection, is a major research hotspot.
which is a sub-domain of ML. A big difference in
concept between the traditional ML and DL is that DL
can be used to directly train the original data without 2 Artificial intelligence: new trend of cyber
extracting its features. In the past few years, DL has security
achieved 20%–30% performance improvement in the
fields of computer vision, speech recognition, and There are many approaches for implementing AI.
text understanding, and achieved a historic leap in the At the very early stage, people used a knowledge base
development of AI (Deng and Yu, 2014). DL can to formalize the knowledge. However, this approach
detect nonlinear correlations hidden in the data, needs too many manual operations to exactly describe
support any new file types, and detect unknown at- the world with complex rules. Therefore, scientists
tacks, which is an attractive advantage in cyber secu- designed a pattern in which the AI system can extract
rity defense. In recent years, DL has made great pro- a model from raw data, and this ability is called
gress in preventing cyber security threats, especially “ML.” ML algorithms include statistical mechanisms,
in preventing APT attacks. DNN can learn the such as Bayesian algorithms, function approximation
high-level abstract characteristics of APT attacks, (linear or logistics regression), and decision trees
even if they employ the most advanced evasion (Hatcher and Yu, 2018). All these algorithms are
techniques (Yuan, 2017). powerful and can be used in many situations where
Although novel AI technologies, such as DL, simple classification is needed. Nevertheless, these
play an important role in cyberspace defense, AI methods are limited in accuracy, which may lead to a
1464 Li / Front Inform Technol Electron Eng 2018 19(12):1462-1474
poor performance on massive and complex data rep- 2.2 Deep learning applications
resentation (LeCun et al., 2015). DL was proposed to
In this part, we review the applications of DL.
solve the above deficiencies. DL imitates the process
DL is widely used in autonomous systems because of
of human neurons and builds the neural architecture
the significant advantages in optimization, discrimi-
with complex interconnections. Today, DL is a re-
nation, and prediction. Due to the massive application
search hotspot in academia and has been widely used area categories, we introduce only a few representa-
in various industrial scenarios. Therefore, we will tive application domains.
introduce the categorization and the applications of
state-of-the-art models in DL research in different 2.2.1 Image and video recognition
areas. Image and video recognition is the most im-
2.1 Categorization of deep learning portant area of DL research. The typical structure of
DL in this area is the deep convolutional neural net-
The categorization of DL is based on its learning work (CNN). This structure can reduce the image size
mechanism. There are three kinds of primary learning by convolving and pooling the image before putting
mechanisms: supervised learning, unsupervised the data into the full-connected neural network. In this
learning, and reinforcement learning. area, there are numerous research branches, and many
2.1.1 Supervised learning derivative applications are based on this fundamental
research. For example, Ren et al. (2017) proposed a
Supervised learning clearly requires labeled in- faster CNN for real-time object detection to signifi-
put data, and is usually used as a classification cantly reduce the running time of the detection
mechanism or a regression mechanism. For example, network.
malware detection is a typical binary classification
2.2.2 Text analysis and natural language processing
scenario (malicious or benign) (Goodfellow et al.,
2014). In contrast to classification, regression learn- With the development of social networking and
ing outputs a prediction value that is one or more mobile Internet, massive data is created by human
continuous-valued numbers according to the input interaction. The requirement of text analysis and
data. natural language processing is the precondition of
on-the-fly translation and human-machine interaction
2.1.2 Unsupervised learning with natural speech. Many related DL applications
In contrast to supervised learning, the input data have been proposed. For instance, Manning et al.
of unsupervised learning is unlabeled. Unsupervised (2014) proposed a toolkit, named “Stanford
learning is often used to cluster data, reduce dimen- CoreNLP,” which is an extensible pipeline providing
sionality, or estimate density. For instance, a fuzzy core natural language analysis.
deep brief network (DBN) system combines the 2.2.3 Finance, economics, and market analysis
Takagi-Sugeno-Kang (TSK) fuzzy system, and can
provide an adaptive mechanism to regulate the depth Stock trading and other market models require
of the DBN to obtain a highly accurate clustering. high accurate market predictions. DL has been highly
exploited as a powerful market predictive tool. For
2.1.3 Reinforcement learning example, Korczak and Hernes (2017) proposed a
financial time-series forecasting algorithm based on
Reinforcement learning is based on rewarding
the CNN architecture. The forecasting error rate sig-
the action of a smart agent. It can be considered as a
nificantly decreased via testing using forex market
fusion of supervised learning and unsupervised
data.
learning. It is suitable for tasks that have long-term
feedback (Arulkumaran et al., 2017). By combining
the advances in training of deep neural networks, 3 Artificial intelligence based cyber security
Mnih et al. (2015) developed the deep Q-network,
which can achieve human-level control as a deep In this section, we review the traditional ML
reinforcement learning architecture. schemes against cyberspace attacks and various DL
Li / Front Inform Technol Electron Eng 2018 19(12):1462-1474 1465
schemes. The implementation process, experimental Based on a multi-class k-NN classifier, Meng
results, and efficiency of different programs in com- et al. (2015) developed a knowledge-based alert ver-
bating cyberspace attacks are discussed. ification method to identify false alarms and non-
critical alarms. Then, to filter out these unwanted
3.1 Traditional machine learning schemes against
alarms, they designed an intelligent alarm filter that
cyberspace attacks
consists of three major components: an alarm data-
An ML solution consists of four main steps (Xin base, a rating measurement, and an alarm filter. They
et al., 2018): conducted experiments from different dimensions,
1. extract the features; and the experimental results indicated that the de-
2. select the appropriate ML algorithm; signed alarm filter can achieve a good filtering per-
3. train the model and then select the model with formance even with limited CPU usage.
the best performance by evaluating different algo-
3.1.2 Support vector machine based cyber security
rithms and adjusting parameters;
4. classify or predict unknown data using the The support vector machine (SVM) is a super-
trained model. vised learning algorithm that has superior perfor-
Common ML solutions include k-nearest- mance, including support vector classification and
neighbor (k-NN), support vector machine (SVM), support vector regression. The core idea of SVM is to
decision tree, neural network, etc. Different kinds of separate the data by constructing an appropriate split
algorithms solve different types of problems. It is plane. Fig. 1 shows a typical SVM realization. The
necessary to select an appropriate algorithm accord- optimal split plane is determined for classification of
ing to specific industrial application scenarios. attacked/safe measurements.
3.1.1 k-nearest-neighbor-based cyber security
The premise of k-NN execution is that the data
and labels of the training dataset should be known.
Input the test data and then compare the characteris-
Ztr2
SVM has also been used in intrusion detection Vuong et al. (2015) used a decision tree to gen-
and analysis in some emerging networks. For exam- erate simple detection rules that were used to defend
ple, in the software-defined network, the controller is against denial of service and command injection at-
vulnerable to DDoS, which leads to resource exhaus- tacks on robotic vehicles. They considered cyber
tion. Kokila et al. (2014) used an SVM classifier to input features, such as network traffic and disk data,
detect DDoS attacks in the software-defined network. and physical input features, such as speed, power
They also carried out some experiments on the ex- consumption, and jittering. Their experimental results
isting DARPA dataset and compared the perfor- showed that different attacks have different impacts
mances between the SVM classifier and other tech- on robot behaviors, including cyber and physical
niques, showing that the designed SVM scheme operations, and that the addition of physical input
produced a lower false positive rate (FPR) and higher features could help the decision tree increase the
classification accuracy. Nevertheless, SVM training overall accuracy of detection and reduce the false
requires more time, which is an obvious defect. positive rate.
The cyberspace of different industrial applica-
tions presents different network characteristics, and
thus the suffered attack patterns are also specific. For
example, two-way communication and the distributed
energy network that makes the grid intelligent are the
main features of a smart grid. In the smart grid, ma-
licious injection of erroneous data will have a cata-
strophic impact on decisions at various stages. Shahid
et al. (2012) proposed two techniques for fault detec-
tion and classification in power transmission lines
(TL). Both approaches are based on the one-class
quarter-sphere support vector machines (QSSVMs).
The first approach, called “temporal-attribute
QSSVM (TA-QSSVM),” tries to determine the at-
tribute correlations of data measured in a TL for fault
detection, and the second approach exploits attribute
correlations for only fault classification. These con-
vincing experiments showed that TA-QSSVM can
obtain almost 100% fault-detection accuracy and
A-QSSVM can achieve 99% fault classification ac-
curacy, which are remarkable results. In addition to
accuracy, these approaches had less computational
Fig. 2 A decision tree construction for malware detection
complexity than multi-class SVM (from O(n4) to
O(n2)), making them applicable to online detection
and classification. APT attacks employ social engineering methods
to invade various systems, which brings big social
3.1.3 Decision tree based cyber security
issues. Moon et al. (2017) designed a decision tree
The decision tree algorithm is a method to ap- based intrusion detection system detecting APT at-
proximate the value of a discrete function. In essence, tacks that might intellectually change after intrusion
the decision tree mechanism is a process to classify into a system. The intuitive idea was to analyze the
data through a series of rules. Fig. 2 shows the deci- behavior information through a decision tree. This
sion tree construction process for malware detection. system could also detect the possibility of the initial
Malware can be classified based on a decision tree. intrusion and reduce the hazard to a minimum by
The decision result is derived from specific charac- responding to APT attacks as soon as possible. The
teristics through pre-defined decision rules. detection accuracy was 84.7% in their experiments;
Li / Front Inform Technol Electron Eng 2018 19(12):1462-1474 1467
the accuracy was actually high considering the diffi- experimental results showed that DeepFlow outper-
culty in detecting malware-related APT attacks. forms traditional ML algorithms, such as Naïve Bayes,
PART, logistic regression, SVM, and multi-layer
3.1.4 Neural network based cyber security
perceptron (MLP). Some new DL technologies can
Gao et al. (2010) developed an intrusion detec- also be used (Ota et al., 2017; Li LZ et al., 2018a,
tion system based on a neural network to detect arti- 2018b).
facts of command-and-response injection attacks by
monitoring the physical behaviors of supervisory
control and data acquisition (SCADA) systems. The
experimental results showed that the neural network
based IDS has an excellent performance in detecting
man-in-the-middle response injection and DoS-based
response injection, but it could not detect replay-
based response injection attacks.
Vollmer and Manic (2009) proposed a computa-
tionally efficient neural network algorithm to provide
an intrusion detection alert scheme for cyber security
state awareness. The experimental results indicated
that this enhanced version of the neural network al-
gorithm reduced memory requirements by 70%, and
reduced runtime from 37 s to 1 s.
3.2 Deep learning solutions for defending against
Fig. 3 Deep belief network
cyberspace attacks
The DL method is very similar to the ML method. Focusing on the problems in intrusion detection,
As mentioned earlier, the feature selection in DL is such as redundant information, long training time,
automatic rather than manual, and DL attempts to and tendency to fall into a local optimum, Zhao et al.
obtain deeper features from the given data. The cur- (2017) put forward a novel intrusion detection
rent DL programs include the DBN, recurrent neural scheme by combining DBN and a probabilistic neural
network (RNN), and CNN. In this section we describe network (PNN). In this method, the raw data was
the use of different types of deep neural networks to converted into low-dimensional data, and DBN (with
defend against several network attacks in different nonlinear learning ability) extracted the essential
scenarios. characteristics from the original data. They used a
particle swarm optimization algorithm to optimize the
3.2.1 Deep belief network based attack defense hidden-layer node number per layer. Then they em-
DBN is a probability generation model consist- ployed a PNN to classify the low-dimensional data.
ing of multiple restricted Boltzmann layers. Zhu et al. The performance evaluation using the “KDD CUP
(2017) proposed a novel DL-based approach called 1999” dataset indicated that this method performs
“DeepFlow” to directly detect malware from the data better than traditional PNN, PCA-PNN, and raw
flows in Android applications. This scheme is DBN-PNN without optimization.
implemented based on DBN (Fig. 3). Based on the 3.2.2 Recurrent neural network based attack detection
DeepFlow architecture, complex attack feature data
can be analyzed. DeepFlow architecture consists of Unlike traditional feed-forward neural networks
three components: FlowDroid for feature extraction, (FNNs), RNNs introduce directional loops that can
SUSI for feature coarse-granularity, and the DBN DL handle contextual correlation among inputs to process
model for classification. Two crawler modules can be sequence data.
used to crawl malware from malware sources and To classify permission-based Android malware,
benign ware from Google Play Store separately. The Vinayakumar et al. (2018) used a long short-term
1468 Li / Front Inform Technol Electron Eng 2018 19(12):1462-1474
memory recurrent neural network (LSTM-RNN) implemented few-shot intrusion detection using a
because LSTM can learn temporal behaviors through linear SVM and a 1-nearest-neighbor classifier.
sparse representations of Android permissions se- Few-shot learning is suitable for occasions where the
quences. They also launched some notable experi- training set for a certain class is small. Finally, they
ments that were run up to 1000 epochs with a learning implemented the proposed scheme on the two
rate from 0.01 to 0.50. All LSTM networks achieved well-known public datasets: KDD99 and NSL-KDD.
the highest accuracy of 89.7% in the real-world An- These two datasets are unequal and some classes may
droid malware test dataset. have fewer training samples than others. The exper-
Loukas et al. (2018) proposed a cloud-based imental results showed that the proposed scheme has
cyber-physical intrusion detection scheme for the a better performance than previous schemes on these
Internet of Vehicles (IoV) using a deep multilayer two datasets.
perceptron and an RNN. They pointed out that RNN,
3.2.4 Automatic encoder based solutions for threat
with an LSTM hidden layer, proved very promising in
detection
learning the temporal context of various attacks, such
as DoS, command injection, and malware. This work Some researchers have attempted to use DL to
also revealed that detection latency, the key defect of distribute attack detection in a fog computing envi-
DL-based schemes, is a result of the increased pro- ronment. Abeshu and Chilamkurti (2018) proposed a
cessing demands, which can be addressed by novel distributed DL approach for cyberspace attack
cloud-based computational offloading. They also detection in fog-to-things computing. The model they
carried out some experiments in a real cyber envi- adopted was a stacked auto-encoder for unsupervised
ronment to verify their approach. DL. They trained a model with a mix of normal and
attack samples from an unlabeled network, and the
3.2.3 Convolutional neural network based attack
model identified patterns of attacks and normal data
detection
through a self-learning scheme. The experimental
CNN is a kind of feed-forward neural network results showed that the proposed deep model per-
that includes a convolutional layer and a pooling layer. forms better than shallow models in terms of the false
Artificial neurons can respond to surrounding alarm rate, accuracy, and scalability.
elements. Aygün and Yavuz (2017) proposed two anomaly
Based on a CNN, Meng et al. (2017) proposed a detection models employing an auto-encoder (AE)
novel model, named “malware classification based on and a de-noising auto-encoder (DAE), respectively.
static malware gene sequences (MCSMGS),” for They compared the performances of deterministic AE
malware classification. First, the scheme extracted the and the stochastically improved DAE models based
malware gene sequences of both informational and on the proposed stochastic anomaly threshold
material attributes. Second, it tried to determine the selection technique, indicating that each single model
representation of correlation and similarity of each performs better than all previous non-hybrid anomaly
malware. Finally, to achieve accurate malware clas- detection approaches. In addition, they claimed that
sification, a module named “static malware gene the performance of these two schemes could match
sequences−convolution neural network (SMGS- that of some hybrid solutions, and that the proposed
CNN)” was employed to analyze the extracted mal- stochastic threshold selection method is a successful
ware gene sequences. They claimed that the alternate to hybrid methods.
classification accuracy was up to 98% with the pro- Zolotukhin et al. (2016) focused on the detection
posed scheme, and it was more effective than the of DoS attacks in the application layer. Their scheme
SVM model. consists of analysis of communications between a
Chowdhury et al. (2017) presented an improved web server and its clients, separation of these com-
DL scheme based on CNN for intrusion detection. munications, and examination of communication
First, the scheme trained a convolutional neural net- distribution using a stacked auto-encoder and a class
work for intrusion detection. The second step was of DL algorithms. The scheme requires no decryption
different from that of other CNN solutions: it of the encrypted traffic, which obeys the ethical
extracted outputs from each layer in the CNN and norms concerning privacy. The experimental results
Li / Front Inform Technol Electron Eng 2018 19(12):1462-1474 1469
with the dataset from a realistic cyber environment rate of 97% when only 4.02% of the input features per
suggested good detection of DoS-related attacks, sample were modified.
which increased web service availability.
4.2 Defense methods against adversarial attacks perturbations (Moosavi-Dezfooli et al., 2017). The
core idea of this scheme is to add a separate trained
4.2.1 Modifying the training process and input data
network to the original model to achieve a method
The robustness of a deep network is improved by that does not require adjustment factors and is im-
continuously inputting new types of adversarial mune to the sample. Lee et al. (2017) employed the
samples and performing adversarial training. To en- popular generative adversarial networks (GANs)
sure effectiveness, this method requires high-intensity framework to train a deep network that is robust to
adversarial samples, and the network architecture attacks, such as FGSM. Lyu et al. (2015) provided
must be equipped with sufficient expressive power. another defense scheme based on a GAN. The fol-
This method is called “brute-force adversarial train- lowing are detection-only approaches. The feature
ing,” because it requires a large amount of training squeezing methods (He et al., 2017; Xu et al., 2017)
data. Goodfellow et al. (2015) and Cubuk et al. (2017) explore whether the sample is adversarial or not using
mentioned that this method could regularize the two models. Subsequent work described how this
network to reduce overfitting. However, Moosa- method was acceptable by C&W attacks. Meng and
vi-Dezfooli et al. (2017) pointed out that no matter Chen (2017) proposed a framework called “MagNet,”
how many anti-samples are added, there are new which uses a classifier to train the manifold meas-
anti-attack samples that can deceive the network. urements to determine whether the picture is noisy. In
Luo et al. (2015) proposed to use the foveation miscellaneous methods (Feinman et al., 2017;
mechanism to defend against the anti-disturbance Gebhart and Schrater, 2017; Liang et al., 2017), the
generated by L-BFGS and FGSM. The assumption of authors trained a model to treat all input images as
this proposal is that the image distribution is robust to noise, first learning how to smooth the picture and
transition variation, and the disturbance does not have then classifying it.
this property. However, the universality of this
method has not been proven. Xie et al. (2017) found 4.3 Construction of safe artificial intelligence
that introducing random rescaling on training images systems
can reduce the intensity of attacks. 4.3.1 Safe distributed ML/DL systems
4.2.2 Modifying network Shokri and Shmatikov (2015) first proposed the
It has been observed that the simply stacking construction of privacy-preserving DL under a dis-
denoising auto-encoders on the original network tributed training system (Fig. 5) that enables multiple
make themselves only more vulnerable. Gu and parties to collaboratively learn an accurate neural
Rigazio (2015) introduced deep contractive networks, network model without leaking their input datasets.
among which a smoothness penalty term similar to The key innovation of this work is the selective
contractive auto-encoders is used. Using input gra- sharing of deep neural network parameters during
dient regularization to improve robustness against model training, which makes the scheme effective
attack (Ross and Doshi-Velez, 2017), this method has and robust because the training can be asynchro-
a good effect combined with brute-force adversarial nously run. In the experiments where two datasets
training, but the computational complexity is very MNIST and SVHN were used, the proposed system
high. Some researchers attempted to use biologically was evaluated. The results suggested high classifica-
inspired solutions; for example, Nayebi and Ganguli tion accuracy in both datasets, even when the partic-
(2017) attempted to defend against attacks using a ipants shared 10% of their parameters. However,
nonlinear activation function similar to that of non- Phong et al. (2018) demonstrated that in the system of
linear dendrites in biological brains. In another work, Shokri and Shmatikov (2015), gradients shared over
the dense associative memory model is based on a the cloud server may be compromised, leading to
similar mechanism (Krotov and Hopfield, 2018). local data leakage. To protect the gradients over the
honest but unusual server and ensure training accu-
4.2.3 Using an additional network racy, Phong et al. (2018) used additive homomorphic
The scheme of Akhtar et al. (2018) was a defense encryption to enable cipher computation across the
framework against adversarial attacks using universal gradients. The tradeoff of this scheme is the cost of
Li / Front Inform Technol Electron Eng 2018 19(12):1462-1474 1471
increased communication overhead between the released a novel and fundamental library to construct
cloud server and DL participants. other kinds of classifiers, such as a multiplexer and a
face detection classifier. The bottleneck of ML train-
ing on encrypted data lies in the accuracy of the
classifier. It is difficult for an ML/DL algorithm to
obtain high-dimensional statistical information from
encrypted data, because ciphertext is the result of
confusion, and its statistical information has been
destroyed to a certain extent.