Portugal QML PDF
Portugal QML PDF
Portugal QML PDF
Abstract—Network Intrusion Detection Systems (NIDSs) are On May 2016, IBM announced the first Near Term Quantum
commonly used today to detect malicious activities. Quantum Computer, a NISQ, to be publicly available on the cloud [14].
computers, despite not being practical yet, are becoming available The product is known as IBM Q Experience or simply IBM
for experimental purposes. We present the first approach for
applying unsupervised Quantum Machine Learning (QML) in QX. Customers can programme a universal quantum computer,
the context of network intrusion detection from the perspective of simulate quantum operations via an interface, and perform the
quantum information, based on the concept of quantum-assisted same quantum operations on a quantum computer. IBM QX
ML. We evaluate it using IBM QX in simulation mode and show has been used in many experiences, e.g., quantum oscillator
that the accuracy of a Quantum-Assisted NIDS, based on our synchronization [15], variational quantum algorithms [16],
approach, can be high, rivaling with the the best conventional
SVM results, with a dependence on the characteristics of the and quantum search [17], and optimization approaches to
dataset. convert common quantum circuits descriptions into the IBM
QX have been proposed [18]. A 5-Qbit quantum computer was
I. I NTRODUCTION
initially offered and, since June 2017, also a 16-Qbit quantum
The detection of security violations using machine learning computer, respectively IBM QX2 and IBM QX3. Revised
(ML) has been extensively investigated [1]–[8]. Intrusion versions of these quantum computers, QX4 and IBM QX5,
Detection Systems (IDSs) are tools used to detect such ma- are available since September 2017. In 2019 and 2020 systems
licious activity. Network IDSs (NIDSs) in particular detect with 27 and 65 qubits were released. In order to use them, the
malicious activity in networks and are one of the best known desired quantum functionality, e.g., a quantum circuit, has to
contexts of ML application in the field [3]. IDSs and NIDSs be properly mapped so that the underlying physical constraints
can be classified as signature-based or anomaly-based [1]. A are satisfied. Recently the plan to reach 100 Qbits by 2023 was
signature-based (N)IDS detects attacks by comparing the data announced.
flow under analysis to patterns stored in a signature database We aim to use an hybrid classical/quantum approach where
of known attacks. An anomaly-based (N)IDS typically detects the data and feature learning are classical, whereas the classi-
anomalies using a model of normal behaviour of the monitored fication algorithm is quantum. In this approach, classical data
system and flagging behavior lying outside of the model as has to be converted into quantum data. This approach allows
anomalous or suspicious. Signature-based IDSs can detect the implementation of quantum algorithms on the quantum
well-known attacks with high accuracy but fail to detect or computers available today, e.g., NISQs like IBM QX.
find unknown attacks, whereas anomaly-based IDSs have that The IBM Quantum Experience (IBM Q) is accessible online
capacity. In this paper we focus on anomaly-based NIDSs. giving users in the general public access to a set of IBM’s
Quantum computers, and among them Noisy Intermediate- prototype quantum processors. It is an example instance of
Scale Quantum (NISQ) computers, aim to leverage quantum cloud based quantum computing. As of May 2018, there are
physics to perform computational tasks beyond the capabilities three processors on the IBM Quantum Experience: two 5-
of the most powerful classical computers, potentially achieving qubit processors and a 16-qubit processor. This service can be
quantum supremacy [9]–[11]. As the amount of Qbits and used to run algorithms and experiments, and explore tutorials
accuracy of quantum computers increases, the problem of and simulations around what might be possible with quantum
when they exceed the state-of-the-art classical computation computing.
draws a great deal of attention. A definitive proof of quantum The first part of the processing relates to the concept
dominance is predicted in the near future [9]–[11], and one of feature learning. Principal component analysis (PCA) is
argument has already been made that it has been accomplished arguably the most used technique for dimensionality reduction,
[12], although it has been disputed by competitors [13]. A so it has already been used in the context of intrusion detection
consequence of quantum supremacy is that quantum comput- [19], [20]. Autoencoders are a more general technique that,
ers may outperform their classical counterparts quadratically however, can also be used for dimensionality reduction, with
in terms of learning efficiency, or even exponentially in several benefits:
performance [10]. This is a motivation towards investigating Autoencoders can be trained using a variety of stochastic
the potential of Quantum-Assisted Machine Learning (QAML) optimization methods that have been developed for training
in the context of network intrusion detection. deep neural networks, becoming highly efficient for large
978-1-7281-8326-8/20/$31.00
2020
c IEEE datasets, in particular with statistical signatures that change
2
over time [21]. Autoencoders can handle high-dimensional is the same as minimizing kwk ~
2 . Originally SVM learning
training data, such as images, being suitable for online learning algorithms could only solve linearly separable problems. To
contexts in which new data arrives continuously, without the address this limitation the kernel trick was introduced. The
need of computationally expensive matrix operations as PCA. mapping, represented in the figure, is characterized by the
In summary, the advantages of using an autoencoder over choice of a class of functions known as kernels. Kernels
PCA can be summarized as: autoencoders can process high- use specific functions φ(x) to map the data into a higher-
dimensional data; autoencoders can process large datasets; dimensional space, see Figure 1. This procedure is justified
autoencoders are capable of online processing, processing new by Cover’s theorem [24], which guarantees that any data set
data as it arrives. becomes (linearly) separable as the data dimension grows.
We use a quantum version of SVM, Quantum Support In this situation SVM finds the hyperplane in a new space
Vector Machine (QSVM), that uses a quantum kernel estimator that maximizes the margin and minimizes misclassifications.
and optimizer [22], [23]. A classifier with a large margin is prone to be more accurate.
In this work we aim at answering the following questions:
1) How effectively can autoencoders, with known NIDS B. Quantum-Assisted Machine Learning
datasets, produce an encoding with probability distribu-
tion functions (PDFs) – in other terms a latent space – Quantum computing can help speed up some types of
related to the input variables so that they can processed problems. It may be able to find new, easier ways to achieve
by other ML algorithms, in an unsupervised fashion? economic optimisation. Nevertheless, as stated above, the
2) Can this result be used to interface with quantum compu- existing quantum computers are very small. We therefore plan
tation systems in line with a quantum-enhanced approach to use QAML, a hybrid classical / quantum solution where
for network intrusion detection? data and function learning are classical, while the classification
algorithm is quantum.
The contributions of the this work are: (1) a study of
autoencoders in the scope of unsupervised learning; (2) a study An example of a classical / quantum hybrid algorithm that
on quantum unsupervised ML by interfacing a dimensional- relies on quantum circuits considered to be inefficiently scaled
ity reduction technique based on autoencoders and quantum by classical methods was provided in [23]. In this work, the
processors. We evaluate our approach using IBM QX and two authors define and apply two binary classification approaches
well-established network intrusion detections datasets (UNB utilizing supervised learning. Such classification algorithms
NSL KDD and UNSW NB15). are related to regular SVMs. The aim in this research is to
introduce a non-linear function map that takes the data to be
II. BACKGROUND graded into a space where it can be linearly segregated.
A. Support Vector Machines The method used in this work is the second method ex-
plained in [23] directly exploits this connection to classical
The SVM classification algorithm can classify data into two SVMs. Here only the overlap of the feature map circuits is
sub-groups. Let us assume there are M training data points estimated for each pair in the training set. That is, the quantum
x~i , i = 1, . . . , M , and each data has a label +1 or −1. Let computer is only used once to compute the kernel matrix for
us also assume that the number of dimensions of the feature the training set. The optimal separating hyperplane and the
space is N . In this case, the training data can be written as support vectors can be obtained efficiently in the training set
{(x~i , yi ) : x~i ∈ RN , yi = ±1}. size by solving the conventional dual optimization problem on
a classical computer. This problem is concave and therefore
efficiently solvable on a classical computer. Once the support
vectors have been determined the quantum computer needs to
be queried again for classification, when the kernel for a new
φ datum is estimated. The quantum processor is thus used twice:
once during training and then again in the classification phase,
in both cases to estimate the quantum kernel.
Fig. 1. SVM kernel trick illustrated with samples from two classes here A. Quantum Encoding Preliminary Concepts
represented by black and white dots.
Suppose we have a dataset D of N instances:
The frontier between two classes can be represented by the
distance between two support hyperplanes w ~ · ~x + b = 1 and D = {x1 , ..., xi , ..., xN } (1)
~ x+b = −1. With the expressions of the support hyperplanes
w·~
defined, the distance between them can be represented by where each xi is a real number. In basis encoding, each
2 2
~ . Thus, SVM aims at maximizing the margin kwk
kwk ~ , which instance xi is encoded as |bi i where bi is the binary repre-
sentation of xi . The dataset D can then be represented as a J(Ω) = 1 J(Ω) = 1
|1i
superposition of all computational basis states:
N
1 X
|ψi = √ |bi i (2)
N i=1 60◦
60◦
In amplitude encoding, data is encoded in the amplitudes of
quantum states. The above dataset D can be encoded as:
PN
xi |ii |0i
|ψi = i=1 (3) J(Ω) = −1 J(Ω) = −1
||D||
(a) (b)
The encoding scheme proposed consists of a layer of
Fig. 2. Quantum Kernel Functions: On the left; A classical dataset mapped
Hadamard gates on each qubit, followed by an operator, into the Bloch sphere. JΩ () returns binary labels (−1, 1) and can be mapped
UΦ (~x), then another layer of Hadamards, and UΦ (~x) again. onto the Bloch sphere by using a non-linear feature map: for a single Qubit
A feature map on n-Qbits is generated by the unitary where UΦ(x) = Zx is a phase-gate of angle x ∈ Ω. The mapped data can be
separated by the hyperplane given by normal w. ~ On the right: For the general
H denotes the conventional Hadamard gate and circuit UΦ(~x) is formed by products of single- and two-Qubit unitaries that are
UΦ (~x) = UΦ(~x) H ⊗n UΦ(~x) H ⊗n . diagonal in the computational basis. The circuit family depends non-linearly
The UΦ(~x) gate consists of Φ() gates, which can operate on on the data through the coefficients φS (~ x) with |S| ≤ 2. The experimental
implementation of the parameterized diagonal two-Qbit operations is done
one or two qubits. The single-qubit Φ operation is just a Z using CNOTs and Z−gates [23].
rotation by xi on the ith qubit.
Quantum Support Vector Machines (QSVMs) use a quan-
tum kernel estimator to estimate the kernel function and defect, the simulation of a single layer preparation circuit can
optimize the classifier directly. QSVMs are an interesting be effectively performed. by uniform sampling [27].
application for NISQ computers [22], [23]. The scheme we Every n-Qbit device operation can be approximated to an
are proposing deals with the original problem of supervised appropriate level of accuracy by the gates of the universal gate
learning: the creation of a SVM classifier and is fully explained collection. The nearly universal set contains only two types of
in [23]. For this question, we provide data from the training gates: the Toffoli Gate and the Hadamard Gate [28] [29].
set T and the test set S from the subset Ω ⊂ Rd . Both are The Toffoli Gate is common for (reversible) classical com-
assumed to be labeled as m : T ∪ S → {+1, −1} unknown puting and has a totally classical character in quantum circuits
to the algorithm. The algorithm just gets test data labels as it preserves its computational basis. This seems to suggest
T . The aim is to predict an estimated map on the test set that the quantum benefit is provided by gates which do not
m̃ : s → {+1, −1} in such a way that it fits the true map maintain the theoretical base, i.e. the Hadamard gates. Since
m(~s) = m̃(~s) on the test data ~s ∈ S with a high probability. circuits such as those used in this simulation are likely to result
By means of SVMs, the data is projected non-linearly to a in the successful testing of a purely classical computer.
high dimensional space, a function area , where a hyperplane
is configured to differentiate the named samples. The quantum B. Quantum kernel estimation
variant of this method, QSVM, has already been proposed Quantum kernel estimation: The protocol chosen imple-
by Rebentrost et al. [25]. Their scheme makes it possible to ments the SVM directly. A classical SVM is used for classifi-
accomplish incremental progress if data is given in a coherent cation. The quantum computer is used twice in this protocol:
superposition. • First, the kernel K(~ xi , ~xj ) is estimated on a quantum
A feature map on n-Qbits is generated by the unitary computer for all pairs of training data ~xi , ~xj ∈ T . Here
UΦ (~x) = UΦ(~x) H ⊗n UΦ(~x) H ⊗n , where H denotes the con- it will be convenient to write T = {~x1 , . . . , ~xt } with
ventional Hadamard gate and t = |T |; also let yi = m(~xi ) be the corresponding label.
The optimization problem for the optimal SVM can be
formulated in terms of a dual quadratic program that only
X Y
UΦ(~x) = exp i φS (~x) Zi , (4)
S⊆[n] i∈S uses access to the kernel.
t t
is a diagonal gate in the Pauli Z basis (Figure 2(b)). This X 1 X
n LD (α) = αi − yi yj αi αj K(~xi , ~xj ), (5)
circuit will act as an initial state on |0i . The φS(~x) ∈ R i=1
2 i,j=1
coefficients encode the ~x ∈ Ω data. In general, any diagonal Pt
U Φ(~x) can be used if it can be implemented efficiently. subject to i=1 αi yi = 0 and αi ≥ 0 for each i. This
This is the case, for example, when only ≤ 2interactions problem is concave whenever K(~xi , ~xj ) is a positive
are considered. The exact evaluation of the inner-product definite matrix. The solution to this problem will be given
between two states created from a similar circuit with only by a nonnegative vector α ~ = (α1 , . . . , αt ).
one diagonal layer U Φ(~x) is ¶-hard [26]. Nevertheless, in the • The quantum computer is used a second time to estimate
experimentally relevant context of the estimate of the additive the kernel for a new datum ~s ∈ S with all the support
vectors. The optimal solution α~ ∗ is used to construct the 2) Preprocessing of the data: the non-numeric features were
classifier encoded into numeric values, then normalized into the
t
X
! interval [0, 1].
m̃(~s) = sign yi αi∗ K(~xi , ~s) + b . (6) 3) Encoding of the data: we considered an implementation
i=1 based on the autoencode() function provided by the Ruta
package [31], [32] for the R statistical computing package
IV. D ESCRIPTION OF THE E XPERIMENT
[33].
This section presents the full quantum-enhanced intrusion • NSL-KDD: For the NSL-KDD dataset autoencode()
detection scheme. The scheme can be divided in the following has been tested with four activation functions – Linear,
phases: Hyperbolic Tangent, Relu and Sigmoid –, in order to
1) Obtain the input data, i.e., network flows corresponding understand the performance in the four cases.
to a period of time. The NIDS can be configured to detect • NB15: For the NB15 dataset the autoencode() has been
intrusions by inspecting from the last T period of time. tested with a single activation function, sigmoid, as
For example, T can be set to 10 minutes, 30 minutes, or there was no point in repeating the previous evaluation.
1 hour. 4) Normalization: the data returned by autoencode() was
2) Preprocess the data, e.g., do feature normalization. normalized to the range [−1, 1].
3) Encode the data. This phase does dimensionality re- 5) Quantum processing: the resulting dataset has been pro-
duction, i.e., summarizes the input data to a size that cessed by the QSVM() function1 of the Qiskit quantum
is appropriate in terms of dimension to be inputted to computing framework. This function runs the binary
the QSVM() function. As explained before, this phase is classification at the quantum simulation level.
processing the data with an Autoencoder and obtaining
For comparison purposes, a classical SVM algorithm was
its latent space.
run over the entire datasets after being processed by the
4) Normalize the data returned by the autoencoder for
autoencoder. Specifically, we used the liquidSVM, a package
preparing it for the next phase.
that implements SVMs and configures hyper-parameters auto-
5) Quantum processing, done by a quantum computer, in
matically using k-fold cross validation.2
order to detect intrusions in the input data. As explained,
this phase involves running a QSVM() function algorithm B. Experimental process for the Quantum component
to classify flows as normal or anomalous.
6) Repeat for the next period of time or data sample in 1) For each data point, the process of encoding data
parallel. Notice that the period may overlap the previous involves a series of gates that will transform the initial
if that is considered useful for some reason. state, |0i, into some state that depends on ~x|Φ(~x)i.
One such method is amplitude encoding, in which the
The detection process has to be preceded by a training
resulting state would be. In this scheme, the data is
process. Training involves the same phases 1 to 5 above, except
encoded into the basis state amplitudes of the state:
that the QSVM in phase 5 is executed in training mode.
a1 x1
V. E XPERIMENTAL E VALUATION .. . P
. ↔ .. , |xi |2 = 1, xi ∈ R
An important component of a NIDS evaluation is a good i
a2n x2n
benchmarking dataset. We use two recent datasets that have
been designed specifically and carefully to evaluate NIDSs. 2) According to [23], [34] , classifiers based on quantum
They contain a large number of attacks and are labelled with circuits should only provide a quantum advantage if the
information about which network flows are normal traffic and kernel used in the quantum space is difficult to estimate
which are malicious. We designate them NSL-KDD [30] and classically: K(~x, ~z) = | h Φ(~x) | Φ(~z) i |2 .
NB15 [5], [6]. 3) Inference: Inference in quantum machine learning
amounts to the unitary operation, U , such that the loss
A. Data encoding and preparation is minimized when measuring U hΦ(~x)|. Mathematically,
The phases presented in Section IV were executed the U is a matrix with complex elements which is unitary,
following way: where n is the number of qubits; However, in practice,
such a general matrix cannot be directly trained, since
1) Obtaining the input data: the flows were extracted from
a real quantum computer has to employ combinations
the KDD-NSL and NB15 datasets:
of a limited set of gates rather than arbitrary unitary
• NSL-KDD: a sample of 100 data samples for training
operations. What is used are layers of rotation gates
and 50 data samples for testing has been extracted from separated by entangling layers composed of controlled
the KDDTrain+.arff file. Pauli gates [35].
• NB15: a sample of 150 data samples for training and
50 data samples for testing has been extracted from the 1 Qiskit Aqua algorithms on Qiskit
NB15 train file. 2 liquidSVM on GitHub
4) Measurement: After all the operations in the circuit have
been performed, some observable is measured f = Z1 Z2
as the parity function and a random unitary V ∈ SU (4).
Each measurement provides one of a number of dis-
crete values. For calculating the expected value for an
observable, so measurements are made repeatedly in
order to obtain the mean. The end result is that the
expected value of a measurement from the entire circuit is
hΦ(~x)| V † f V |Φ(~x)i), where f represents the observable
measured. The measurement is then used to produce a
label and a value for the loss which is used to optimize
the parameters used in the inference step.
(a)
(a)
(b)
Fig. 4. Distribution at the encoding layer for the NSL KDD test dataset. As
the statistical distribution (SD) does not match the SD of the training set the
data from the test datasets were not used for testing. Data obtained from the
autoencode() Ruta package function applied to the NSL-KDD dataset. Two
cases: (a) NSL-KDD test dataset processed with sigmoid activation: training
loss: 9.3377; (b) NSL-KDD test dataset processed with linear activation:
training loss: 9.4114.
Fig. 9. Kernel matrices for several depth figures. Ideal kernel matrices containing the inner products of all data points used for training with the NSL-KDD
dataset. Data obtained with a Qiskit simulation. (a) Kernel matrix for depth=2, Accuracy=0.8; (b) Kernel matrix for depth=3, Accuracy=0.92; (c) Kernel
matrix for depth=4, Accuracy=0.92; (d) Kernel matrix for depth=5, Accuracy=0.92. Clearly visible the correlation between distinguishability of the kernel
space and the accuracy values obtained.
Fig. 10. Kernel matrices for several depth figures. Ideal kernel matrices containing the inner products of all data points used for training with the NB15
dataset. Data obtained with a Qiskit simulation. (a) Kernel matrix for depth=2. Accuracy=0.60; (b) Kernel matrix for depth=3. Accuracy=0.64; (c) Kernel
matrix for depth=4. Accuracy=0.64; (d) Kernel matrix for depth=5.Accuracy=0.64. Clearly visible the correlation between distinguishability of the kernel
space and the accuracy values obtained.