1 s2.0 S0167404820303357 Main
1 s2.0 S0167404820303357 Main
1 s2.0 S0167404820303357 Main
a r t i c l e i n f o a b s t r a c t
Article history: The massive growth of data in the network leads to attacks or intrusions. An intrusion de-
Received 24 December 2019 tection system detects intrusions from high volume datasets but increases complexities. A
Revised 1 August 2020 network generates a large number of unlabeled data that is free from labeling costs. Un-
Accepted 19 September 2020 supervised feature selection handles these data and reduces computational complexities.
Available online 24 September 2020 In this paper, we have proposed a clustering method based on unsupervised feature selec-
tion and cluster center initialization for intrusion detection. This method computes initial
Keywords: centers using sets of semi-identical instances, which indicate dense data space and avoid
Unsupervised intrusion detection outliers as initial cluster centers. A spatial distance between data points and cluster cen-
Unsupervised feature selection ters create micro-clusters. Similar micro-clusters merge into a cluster that is an arbitrary
Cluster center initialization shape. The proposed cluster center initialization based clustering method performs better
Clustering than basic clustering, which takes fewer iterations to form final clusters and provides better
Mobile ad-hoc network accuracy. We simulated a wormhole attack and generated the Wormhole dataset in the mo-
Wormhole attack bile ad-hoc network in NS-3. Micro-clustering methods have executed on different network
datasets (KDD, CICIDS2017, and Wormhole dataset), which outperformed for new attacks
or those contain few samples. Experimental results confirm that the proposed method is
suitable for LAN and mobile ad-hoc network, varying data density, and large datasets.
∗
Corresponding author.
E-mail addresses: je.mahendra@uohyd.ac.in, je.mahendra@gmail.com (M. Prasad), var_1285@yahoo.com (S. Tripathi),
Keshav.Dahal@uws.ac.uk (K. Dahal).
https://doi.org/10.1016/j.cose.2020.102062
0167-4048/© 2020 Elsevier Ltd. All rights reserved.
2 computers & security 99 (2020) 102062
combine and form a hybrid detection system. Misuse detec- and CCI based micro-clustering (proposed). This work mea-
tion system detects only those signatures are present in the sures different characteristics (parameters) and executed on
database, whereas the anomaly detection system computes different network datasets that show the reliability of the sys-
deviation of behaviors from baseline profile (Mishra et al., tem. The KDD dataset was generated through testbeds of the
2018). The main aim of unsupervised IDS to increase Detec- Air force environment Local Area Network (LAN), and the CI-
tion Rate (DR) and reduce False Alarm Rate (FAR). The IDS de- CIDS2017 is recently generated dataset; whenever, wormhole
tects an intrusion in communication, then generates an alarm dataset is simulated in Mobile Ad-hoc NETwork (MANET) en-
and sends it to the network administrator to prevent intru- vironment. Moreover, the proposed method performs better
sion. To maintain network security, the main problem of ma- than existing unsupervised intrusion detection mechanisms.
chine learning is computational complexity as dimension and The main contributions are enumerated below.
size of the dataset get larger.
Feature selection approach reduces dimension and size (i) Feature selection is an important technique to select
by selecting a subset of essential features. It can be applied the best subset of features and produce better results
for both mode of training model as supervised and unsuper- (Ambusaidi et al., 2016). We have proposed an UFS
vised by whether label of information is known or unknown method that select more informative features. It also re-
(Xie et al., 2018). We apply Unsupervised Feature Selection duces dimension and size of the dataset.
(UFS) (Zhu et al., 2017) when class labels of data are unavail- (ii) Basic clustering algorithm randomly selects initial cen-
able. The UFS solves the noticeable problems of clustering ters and iteratively computes centroids (Ma et al.,
methods such as reduce the additional computational cost 2015). We have proposed CCI based clustering algorithm
and increase performance of the system. Redundant or irrele- that resolves notable problems. This method computes
vant features may lead to overfitting problem, which degrade semi-identical instances and gives many sets. A mean
the performance of the system. Here, the main objective of of data points present in the set as initial centroid. The
UFS method is to keep the training model as small as possi- advantages of initialization of cluster center over basic
ble, that reduces insignificant features or those have a negli- approach are (1) it avoids outliers as initial centers, (2) a
gible effect (Wang et al., 2019). The best subset of features can new initial cluster center is mean of data points present
contain a minimal number of features and contribute system in the set, that indicates dense data space, (3) it reduces
accuracy as much possible. This is a fundamental mechanism iterations of clustering.
of processing the dataset and provide a suitable approach to (iii) We have applied K-means clustering to group similar
dimensionality reduction. It can be employed for many appli- data (Zhao et al., 2018). In proposed method, K is the
cations such as text, images, videos, and genes analysis. This number of micro-clusters which is greater than c. Fi-
method can extensively enhance the interpretability of ma- nally, it merges micro-clusters into c-clusters. Experi-
chine learning algorithms, and an algorithm integrated with mental results show better detection rate of new attacks
feature selection often present better generalization. The pro- or those attacks have few samples. The CCI based clus-
posed UFS method finds a suitable subset of predictive fea- tering method provides better accuracy and execution
tures. time than existing clustering mechanisms.
Many machine learning techniques are used for clustering,
which indicates the significance of UFS. Clustering method The rest of paper is organized as follows: Section 2 re-
partitions data into clusters in many iterations (Gong et al., views related literatures, and Section 3 introduces the im-
2018). Some difficulty has been considerably reported in many portance of unsupervised intrusion detection system for mo-
research such as (1) the clustering method assumes that the bile ad-hoc network. Section 4 provides detail analysis of
number of clusters is known by users, which is not true in KDD, CICIDS2017, and Wormhole dataset. The successive
practice (Masud et al., 2018), (2) an iterative method is sen- Section 5 describes the proposed method. Experiments are
sitive to initial cluster centers especially in the K-means algo- conducted and the performance of the proposed method is
rithm, (3) K-means algorithm can be stuck on local minima. examined in Section 6. Finally, we conclude the paper with fu-
The main reason of such difficulties of iterative clustering is ture direction in Section 7.
random center initialization, and these problems solve using
Cluster Center Initialization (CCI) based clustering. A selection
of initial cluster centers is essential that have major role in the 2. Related work
formation of final clusters (Huang et al., 2017; Peng and Liu,
2018). High dimensional dataset contains more information but also
Naturally, clusters are in non-convex (arbitrary) shapes introduce redundancies and noises, that increases the com-
(Hyde et al., 2017), whereas distance based methods always plexity of clustering algorithms. There are many feature selec-
find convex shapes of clusters. The proposed method is arbi- tion methods, but most of them under the supervised mode of
trary shape clustering, that differs from density-based clus- training. Unsupervised feature selection is a difficult task due
tering. It identifies the number of micro-clusters and work for to unknown labels. Recently, some works have shown the sig-
varying density data. We have executed the proposed method nificance of UFS, such as score of features (Wang et al., 2019),
on an extremely imbalanced (KDD and CICIDS2017) datasets; feature similarity, clustering based non-negative matrix fac-
whereas, the KDD test set includes many attacks as unknown torization, clustering based co-regularization method, locality
attacks. The experimental results show the performance in in- of data, maximization of distance of different clusters, rele-
creasing order as K-means, micro-clustering (New K-means), vance feature representation with spatial geometrical struc-
computers & security 99 (2020) 102062 3
ture of data, joint sparse matrix, non-negative spectral anal- and explored six techniques as one-class classification; these
ysis, low rank structure preserving, locality structure preserv- techniques are K-means, Autoencoder, Isolation Forest, Near-
ing, matrix factorization, inner product regularization, self ex- est Neighbor, Support Vector Machine (SVM), Scaled Convex
pressing model, and relationship among features. The pro- Hull (SCH). They discretized continuous features data using
posed UFS method is based on the occurrence of objects, and the Equal Frequency (EF) technique and normalized data using
it easily computes features score. Zscore and MinMax technique. Finally, they reported limited
It is extremely important to adopt the best initial cluster overall performance. We have compared the proposed method
center in the iterative clustering algorithm, which obtains a to their methods and other unsupervised methods such as
direct effect on the formation of clusters. One of the prob- K-means (where K=5) and New K-means (where K=79 and
lem in clustering is identifying the number of clusters for merge them into 5), that achieves better performances.
CCI. Such problems of the clustering technique have high- Table 1 shows the comparison of IDS that performed on
lighted in many recent research work. There are main ap- the KDD’99 dataset, and trained on both modes of training.
proach such as average density-based initial cluster centers There is still no unique method, that provides UFS and CCI
(Peng and Liu, 2018), initial centers based on dense and sparse based clustering method. The proposed work comprises the
neighbors (Huang et al., 2017), self-expressing the number novel approach of UFS and CCI, which obtains the fast exe-
of clusters and identifying initial centers from dense regions cution of clustering applications and provides better results.
(Masud et al., 2018), locating initial cluster centers at the dense It indicates that the clustering method with initial centers is
region of the dataset by representing the data points in KD- suitable for the detection system. Additionally, the feature se-
tree (Kumar and Reddy, 2017). However, the computational lection method reduces the computational complexities of the
complexity is the main drawback of KD-tree for the high di- system.
mensional dataset. The KDDcup99 (KDD’99) dataset is high
volume, dynamic nature, and imbalance distributions of cat-
egories. Due to such nature, there is still no CCI method and 3. Mobile ad-hoc network security
no UFS method executed on this dataset. mechanism
Kang and Kim (2016) proposed combinatorial optimiza-
tion based optimal feature subset selection. The number of The MANET is an infrastructure-less, temporary, self-
possible feature subsets from given n features is (2n − 1), organized, and dynamic network that transfers data packets
their approach reached (241 − 1) complexity. They selected to the destination node through multihop communication.
20 features, which shown a better DR and poor FAR. Their It is more vulnerable due to wireless communication and
method took more time to provide optimal subsets of features. dynamic nature (Bouhaddi et al., 2018). This network mostly
Yin et al. (2017) proposed a stream clustering algorithm for applies for military services in the battlefield and search
anomaly detection. They completed execution in 10 rounds operations, where information security is the primary issue.
and reported system accuracy of each round. The best result An attacker can be internal or external that spread malicious
has shown the DR (99%) and FAR (9.17%), while the average DR information and harm the network resources. There are
(91%) and average FAR (13.61%). many types of attack methods in temporary networks, and
Al-Yaseen et al. (2017) preprocessed the training dataset each has a unique attacking nature.
into five sub-datasets: Normal-DS, DoS-DS, U2R-DS, Probe-DS, A wormhole attack is a dominant network security threat
and R2L-DS using the supervised mode of training method, that can not immune to cryptography and traditional secu-
and reduced training set. The SVM took more training time rity mechanisms. An attacker node attracts the request pack-
hence they approach to build small training dataset. Then, ets by advertising itself a part of the destination in minimum
they apply modified K-means and basic K-means based on hop count (false information) and tunnel them to other ends
multilevel hybrid SVM and ELM with K=55. We have pro- (Tiruvakadu and Pallapa, 2018). Malicious nodes are more ac-
posed UFS and CCI based clustering method, and executed tive, where they strategically maintain locations and generate
on the unlabeled training set, which differs from the exist- huge signals. Two malicious nodes create a wormhole tunnel
ing method. They reported that their method perform better when they are in the radio range and transmit packets without
for DR of normal and probe categories, while R2L (31.39%) and addressing themselves. When other malicious nodes are not
U2R (21.93%). Roshan et al. (2018) proposed an adaptive de- in radio range, then it behaves as a blackhole attack. These
sign of IDS based on ELM. A comparison of their method has malicious nodes also conceal their history and information.
obtained results in both modes of training, which shown in The proposed method confirms machine learning techniques
many cases, and the method has achieved better performance. effectively detect malicious data, which forwarded through
They reported that their method achieved best DR (84%) and the malicious node.
best FAR (3.02%). The wireless network is more versatile than the wired net-
Choi et al. (2019) developed a network IDS using an un- work. Sample labeling is one of the main stages of data pre-
supervised learning method. Autoencoder model aims to re- processing of supervised learning where each mistake can
construct its input vectors as ANN-based learning. It learns affect negatively on the dataset and overall performance of
to construct its input data consisting only normal data. They the predictive model. It becomes more difficult labeling the
tested the performance on training data, which have normal sample (or packet information) as normal or malicious in dy-
(99%) and abnormal (1%) data, and reported the best result as namic networks when attackers constantly changing their na-
accuracy (91.70%) and DR (84.68%). Meira et al. (2019) proposed ture. Some attacking method changes its identity that also
anomaly detection using unsupervised learning techniques makes the labeling process difficult. Unsupervised mode of
4 computers & security 99 (2020) 102062
Mode of
Work Technique KDD’99 training UFS CCI
∗
Lin et al. (2015) Clustering, k-NN, SVM yes Supervised, no no
Unsupervised
Iglesias and Zseby (2015) DTC, Bayes, kNN, ANN, SVM yes Supervised no∗ n/a
Kang and Kim (2016) Combinatorial optimization, MLP yes Supervised no∗ n/a
Yin et al. (2017) stream clustering yes Unsupervised no no
Bostani and Sheikhan (2017) MI-BGSA, SVM yes Supervised no∗ n/a
Al-Yaseen et al. (2017) Clustering, ELM, SVM yes Supervised, no no
Unsupervised
Roshan et al. (2018) Adaptive clustering, ELM yes Supervised, no no
Unsupervised
Al-Jarrah et al. (2018) AdaBoost, Bagging, RF yes Semi-supervised n/a n/a
Choi et al. (2019) Autoencoder, ANN yes Unsupervised no n/a
Meira et al. (2019) Autoencoder, K-means, Isolation yes Unsupervised no no
Forest, SVM, Nearest Neighbor,
SCH
training model learns to unlabeled dataset (Carrasco and Si- KDD corrected dataset. KDDcup is high volume dataset, that
cilia, 2018), which is free from labeling cost and complexities. contains 4,898,431 (attacks 3,925,650 and normal 972,781) in-
Therefore, unsupervised intrusion detection system is a suit- stances (Tavallaee et al., 2009). The rest of sub-datasets details
able approach for MANETs. are in Table 3 as training and testing dataset.
Table 3 provides detail of attacks and their categories. It
shows four categories of attacks namely DoS, R2L, U2R and
4. IDS datasets Probe. The training dataset contains 22 attacks, while testing
dataset contains 20 same attacks and 17 new attacks. Here,
4.1. KDD we have considered Httptunnel attack in R2L category as in
paper (Aburomman and Reaz, 2016), whenever, some paper
The KDD’99 dataset is most popular publicly available IDS (Al-Yaseen et al., 2017) has included it in U2R category.
dataset. It was generated by Defense Advanced Research Table 4 shows the instance distribution of new attack,
Project Agency (DARPA) that contained normal and attack in- which is not present in the training set. It also shows the ex-
stances. MIT Lincoln Labs were prepared and managed Air isting attack instances those are present in the training set as
force environment LANs with multiple attacks (Ring et al., well as the test set. Test dataset contains 7.48% new attack
2019). In 1999, it was merged with some new attacks and was instances, which are maximum in R2L and minimum in DoS
named as KDDcup99 (KDD’99). category. DoS attack instance distribution is much higher than
Table 2 presents the comparison of IDS datasets on differ- other attacks, that reduces the overall new attack distribution.
ent parameters (Sharafaldin et al., 2018). The KDD’99 dataset However, most of attacks contain approx 50 percent new at-
satisfies maximum parameters and contains a maximum dif- tack instances except for DoS that resist achieving high DR
ferent type of attacks. It is an extremely imbalanced dataset, mainly for unsupervised IDSs.
and the test set contains 17 more attacks than a training Table 5 shows a detail distribution of instances of train-
set (more details in Tables 3–5), which is more challenging ing and testing dataset, where available instances in each cat-
to achieve better performance. This dataset contains three egory are shown huge gaps among categories. In this table,
sub-datasets namely KDDcup dataset, KDD 10% dataset, and it also shows the percentage of instance distribution, that a
computers & security 99 (2020) 102062 5
vised intrusion detection and unsupervised feature selection tion, packet capture file, featuring data, and collection of data
performed on this dataset. into the database.
This dataset contains 637,862 (152,144 normal and 485,718
malicious) samples, that are collected on 20 distinguish fea-
4.3. Wormhole dataset tures of ad-hoc network. We have preprocessed the worm-
hole dataset by transferring symbolic value to numerical value
Data collection from real-world-enterprise of MANETs for IDS using the label encoding method and normalizes numerical
is not easy work. Therefore, we simulated wormhole attack value into a proportionate range. This preprocessed dataset
for the MANET environment using Network Simulator (NS- contains many duplicate samples. We have executed the
3), that is presented in Prasad et al. (2019). Simulation is per- proposed method on only nonredundant (201,616) samples;
formed on 20 normal and 5 malicious nodes, 250-meter ra- whenever, detection methods in Prasad et al. (2019) are exe-
dio range, random movement of nodes, 1000∗ 1000 m2 topol- cuted on initial (637,862 samples) dataset. Further, it has di-
ogy space, and 300 seconds simulation time, which has gener- vided into 141,131 (70%) for training set and 60,485 (30%) for
ated high volume dataset. Fig. 1 shows the steps of wormhole testing, where test set contains 14,499 (24%) normal and 45,986
dataset generation process such as wormhole attack simula- (76%) malicious information.
computers & security 99 (2020) 102062 7
nj N
max 1 | γv j = x i j
5. Proposed method
v=1 i=1
ψj = , (2)
N
The following subsections elaborate proposed method. It
starts with the selection of benchmark IDS dataset and data where, ψ j is feature score, γ j = dist inct (x j ), and n j = count (γ j ) of
normalization method. The next step involves unsupervised jth feature. Eq. (2) computes the count of occurring each object
feature selection that select best features. These features re- and selects the max count of the feature. Then, the ratio of
duce dimensionality and redundant data. Afterward, the next max count and total samples provide the feature score.
step creates sets of semi-identical instances. These sets are
used to compute initial cluster centers. Subsequent steps de- 0, i f ψ j ≈ 1 or ψ j ≈ 0
fj = (3)
scribe the structure of sets and initialization of centers. The 1, otherwise
main step executes data clustering which carried out follow-
ing changes : (1) computation of clusters range, (2) cluster Eq. (3) indicates zero for not selected and one for selected
center initialization, (3) number of clusters (K c). Finally, it features, when ψ j > 0.995 or ψ j < 0.005 follows its approximate
merges minimum distance micro-clusters and computes per- value.
formance of the system. The unsupervised learning model is Table 7 contains the detail information of features of
used on supervised data split into the train-test sets for vali- KDD’99 dataset. First four columns contain features type, fea-
dation and the performance metrics. The model has not un- tures sequence, feature name and data type. Column 5 (eff)
dergone any prior training. It is presented with an unlabeled shows the effort to generate the feature, and Column 6 (rel)
and uncategorized training dataset, and the learning proce- contains relevance of features (Iglesias and Zseby, 2015). In
dure finds similarity between training samples and puts sim- terms of eff, features need a simple inspection of header fields
ilar items into the same cluster. Fig. 2 shows diagrammatic as simple effort, need comparison as a medium effort, and
representation of the proposed method. transforming into another form as high effort. Iglesias et al.
8 computers & security 99 (2020) 102062
Table 9 – An example information table for unsupervised feature selection and formation of semi-identical sets.
(iii) packet_size feature, γ packet _size = {128, 64, 32}, nj = 3, CICIDS2017 dataset and each subset contains 10 features of
max(3,6,3) Wormhole dataset.
ψ packet _size = 12 , ψ packet _size = 0.5
(iv) flag feature, γ flag = {0, 1}, nj = 2, ψ flag =
max(6,6)
, ψ flag = An example of semi-identical sets formation using the
12
0.5 sample Table 9. A set of features is randomly divided into two
(v) header_length feature, γheader_l engt h = {20, 24}, nj = 2, subsets such as B1 = {duration, protocol, hop_count} and B2 =
max(6,6) {packet_size, flag, header_length}. Eq. (4) computes U/IND({B1 })
ψheader_l engt h = , ψheader_l engt h = 0.5
12 = {{x1 , x10 }, {x2 , x4 , x5 , x9 , x12 }, {x3 , x6 , x11 }, {x7 , x8 }}, which indi-
(vi) hop_count feature, γhop_count = {2, 4, 3}, nj = 3, ψhop_count =
max(7,3,2)
cates semi-identical sets as K = U/IND({B1 }) = {K1 , K2 , K3 , K4 }.
12 , ψhop_count = 0.583 From this example, K1 = {x1 , x10 }, K2 = {x2 , x4 , x5 , x9 , x12 }, K3 =
{x3 , x6 , x11 }, and K4 = {x7 , x8 }. These sets are used to compute
This computation provides feature score of features {dura- further steps using (B1 ∪ B2 = A) features.
tion, protocol, packet_size, flag, header_length, hop_count} as
{0.583, 0.667, 0.5, 0.5, 0.5, 0.583} which are within the threshold 5.4. Cluster center initialization
(ψ j > 0.005 and ψ j < 0.995) range. Then, UFS method considers
as non-redundant features. Clustering is an important application of machine learning
that includes unsupervised classification. An iterative cluster-
5.3. Structure of set ing algorithm depends on initial cluster centers (Kumar and
Reddy, 2017). A selection of initial centers is extremely impor-
Only dissimilar instances maintain a distance that is calcu- tant that directly affect the formation of final clusters (Masud
lated using Euclidean distance. This section introduces a set et al., 2018; Peng and Liu, 2018). However, selection of outliers
of instances that approximately maintain minimum distance. as initial cluster center can affect the shape and size of final
Here, we have randomly divided features into approximately clusters, and also increase iterations. The CCI based clustering
two equal subsets. A subset of features represents new in- resolves notable problems of basic clustering.
formation system and computes semi-identical subsets of in-
1
stances. It gives many sets as micro-clusters that further com- Vr, j = xs, j |xs, j ∈ Kr , (5)
|Kr |
pute cluster center and cluster range for clustering. Let in- s=1
Fig. 3 – An example of a sample dataset with five clusters and K sets. (b) Preliminary partitions it into sets K={K1 , K2 , K3 , ....,
K11 } and center of each set marked by black solid plus sign. These sets of semi-identical instances measure initial centers,
K-clusters and cluster range.
1
avgDist ancer = dist ances,k |xs ∈ Kr , (7) 6. Experiments
|Kr |
s=1
1
Vr, j = xi, j , (8) 6.1. Performance measures
|cl ust err |
xi ∈cl ust err
sion and Accuracy are in Section 6.1 using TP, TN, FP, and FN.
c
Si Table 11 shows the DR and Precision of DoS is high, and U2R is
W.Avg = ∗ Per formancei . (14)
S low. It shows accuracy 99% for U2R and Probe, whenever more
i=1
than 92% of rest except R2L.
TP Rate (TPR) or DR or Recall is a proportion of correct pre-
Higher TP and TN indicate better system performance,
dicted positive cases and actual positive cases. FP Rate (FPR)
while higher FP and FN degrade the system performance. In
or FAR is a proportion of false positive and actual negatives.
the context of Normal category, FP is the wrong prediction
Precision is a proportion of correctly predicted and number
of the classifier, that allows attack to enter into the system.
of predicted instances. F-measure is a harmonic mean of DR
An opposite, FN is also the wrong prediction where Normal
and precision. Accuracy is the proportion of correct predic-
predicts as an attack, that increases the alert overhead. Min-
tion by total instances (Ambusaidi et al., 2016). Weightage av-
imum FAR shows better performance of the system like the
erage defines proportional sum of performance, where S =
proposed system, it gives low FAR of all categories except DoS.
{S1 + S2 + .... + Sc } is total samples, c is number of class, and
This system provides the weightage average (using Eq. (14))
Si is number of samples of ith class. We have computed the
performance of respective statistical measures as overall per-
overall performance of the system using Eq. (14) and related
formance.
measures.
Table 12 shows detection rates of recent algorithms (Al-
Yaseen et al., 2017). The proposed method shows higher detec-
6.2. Performance analysis on KDD dataset tion rate of U2R and R2L. While modified K-means performs
better for normal and Probe categories, where it is based on
This section analyzes the performance of the proposed un- multi-level hybrid ELM and SVM, whenever Multi-level SVM
supervised intrusion detection system. We have executed the performs better for DoS category. Especially, the proposed
proposed method on different IDS datasets. The dimension- method gives better detection rates of those categories that
ality reduction reduces the size of the dataset and selection contain few samples.
of the initial cluster center reduces the iteration of the clus- Table 13 shows the overall performance of the system
tering algorithm. Moreover, the proposed method reduces the concerning Number of features, Accuracy, DR, and FAR. This
computational complexities and provides better accuracy. clustering method is compared with DTC, Bayes, ANN, MI-
Table 10 contains outcome of the proposed method in the BGSA, BGSA, BPSO, Multi-level ELM, Multi-level SVM, Basic K-
form of confusion matrix. It is used to compute different sta- means, and modified K-means clustering, while these have
tistical parameters. An example for Normal category, that sta- trained on the supervised mode of training. Iglesias and
tistical parameters are the diagonal of table as TP (i.e., 44157), Zseby (2015) have trained and tested on three subsets (or sets)
addition of rows and columns except Normal category as TN is of features using five classifiers namely DTC, kNN, Bayes, ANN,
239924 (i.e., 207753, 188, 4, 903; 15817, 11596, 20, 3; 113, 338, 30, and SVM. We have compared to at least one best result of
1; 81, 21, 0, 3056), addition of column values only Normal cat- each set. The proposed method is trained on the unsupervised
egory except diagonal as FN is 16434 (i.e., 847, 14683, 285, 619), mode of training with selected 25 features, while, Autoencoder
addition of row values only Normal category except diagonal is trained on the unsupervised mode of training with 41 fea-
as FP is 10512 (i.e., 6089, 4204, 16, 203), likewise compute for tures, and K-means and New K-means are trained on unsu-
all categories. Respective equations compute DR, FAR, Preci- pervised mode of training with 25 features. The comparative
12 computers & security 99 (2020) 102062
Table 12 – Detection rates (in %) of the proposed method and recent IDSs (Al-Yaseen et al., 2017), and the best detection
rate of the different category is shown in boldface.
Proposed
New K-Means
K-Means
Zscore+NearestNeighbor
Zscore+K-Means
Isolation Forest
EF+MinMaxz+SVM
EF+MinMaxz+SCH
EF+MinMaxz+Autoencoder
0 20 40 60 80
have compared the proposed method to different unsuper- terNoon dataset contains benign (99.99%) and attack (0.01%).
vised IDSs (Fig. 4), that obtained better performance. These class-imbalanced data show higher FAR; whenever
the system falsely predicts a small amount of samples. We
6.3. Performance analysis on CICIDS2017 dataset have shown the ratio of falsely and correctly predicted (FP/TP)
the clustering techniques. Similarly, accuracy represents
The CICIDS2017 and CSE-CIC-IDS2018 are benchmark the system performance as the ratio of correctly predicted
datasets that overcome the shortcomings of outdated and total samples. The proposed method is shown significant
datasets. These are high volume and class-imbalanced contributions for unlabeled network traffic data and improved
datasets that are themself cause of challenges for unsuper- the system performances. Table 15 shows the training set and
vised IDSs. The high volume dataset becomes a source of test set, while non-redundant samples are distinct samples
limitations as it consumes more overheads for processing. of the training set. The weight of the test set computes the
Especially, the clustering technique groups the data into an overall performances of the system (using Eq. (14)). It is
almost equal spatial size; whenever, the presence of data in the proportional weight of test samples present in the set
benign class is much higher than attack such as Thursday Af- and total test samples. We have identified the number of
14 computers & security 99 (2020) 102062
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
DR
Performance
Performance
FAR DR
0.5 0.5
Precision FAR
Accuracy Precision
0.4 0.4
Accuracy
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 10 20 30 40 50 60 70 80 90
Number of micro-clusters (K) Number of micro-clusters (K)
Sub-dataset Class TP TN FP FN
Tuesday BENIGN 86061 27 2716 378
FTP Patator 27 87259 346 1550
SSH Patator 0 87984 32 1166
Wednesday BENIGN 86338 35230 15240 1733
Dos slowloris 323 136796 581 841
Dos slowhttptest 250 137146 265 880
Dos Hulk 32162 90296 2106 13977
Dos GoldenEye 663 135897 610 1371
Heartbleed 0 138535 3 3
Thursday BENIGN 29465 337 116 4155
Morning Web 259 30124 3623 67
Attack-Brute
Force
Web Attack-XSS 3 33346 605 119
Web Attack-sql 0 34066 2 5
injection
Thursday BENIGN 57667 3 8 42
AfterNoon
Infiltration 3 57667 42 8
Friday Morning BENIGN 37813 6 383 5
Bot 6 37813 5 383
Friday BENIGN 34652 2361 6190 1946
AfterNoon-DDoS
DDoS 2361 34652 1946 6190
Friday BENIGN 19490 17416 14235 6152
AfterNoon-
PortScan
PortScan 17416 19490 6152 14235
sub-clusters using the proposed scheme and finally merge these parameters compute the performance of the system us-
them into clusters. ing the related equations. Similarly, we have computed the
We have executed K-means, New K-means, and CCI based performance of the rest of the clustering techniques; then,
micro-clustering, which provide results in the form of confu- presented their results.
sion matrix. Table 16 presents the statistical parameters as Table 17 shows the category-wise performance of the BRS
TP, TN, FP, FN of CCI based micro-clustering method. A set of and clustering techniques. These have been executed with
computers & security 99 (2020) 102062 15
the same training and testing sets that have only differ- forms better for almost all subsets than clustering tech-
ence among them of their specific steps. This table shows nique K-means and micro-clustering method New K-means.
the performances as detection rates and detection accuracy. A micro-clustering technique is performed with the number
The proposed CCI based method shows a better result than of sub-clusters of respective subsets as the proposed method;
other clustering techniques. BRS in Prasad et al. (2020) is a finally, merge them into macro-clusters. Moreover, Table 19
supervised method of intrusion detection that has worked shows the overall (or weighted average of test subsets) per-
on the same dataset; while, the proposed method is un- formance of the unsupervised IDSs, which is computed us-
supervised training method. BRS has explicitly shown fea- ing the performances of sub-datasets or data generation days
ture sets, attacks, data generation environment, and eval- (in Table 18) and their weights (in Table 15). This computa-
uated the qualitative realism of datasets. The compara- tion carried out an additional statistical parameter F-measure
tive result has shown that the CICIDS2017 maintains high as harmonic mean of detection rate and precision. Accuracy
quality. shows correctly predicted performance, and FP/TP falsely pre-
Table 18 presents the performance of each subset of CI- dicted the performance of the system. These statistical mea-
CIDS2017 on different statistical parameters, such as DR, Pre- sures confirm that the proposed method performs better than
cision, FP/TP, FAR, and Accuracy. The proposed method per- unsupervised IDSs.
16 computers & security 99 (2020) 102062
Table 18 – Sub-dataset (or data-generation-day) wise weighted average performance of clustering techniques.
Clustering
Sub-dataset Technique DR Precision FP/TP FAR Accuracy
Tuesday K-means 0.8695 0.9361 0.1501 0.9698 0.8717
New K-means 0.9264 0.9560 0.0794 0.4805 0.9337
Proposed 0.9653 0.9480 0.0360 0.9597 0.9658
Wednesday K-means 0.6111 0.8072 0.6363 0.2083 0.7237
New K-means 0.8136 0.8571 0.2291 0.1251 0.8524
Proposed 0.8643 0.8675 0.1571 0.1996 0.8831
Thursday K-means 0.6286 0.9664 0.5908 0.9850 0.6309
Morning New K-means 0.8237 0.9867 0.2140 0.0770 0.8291
Proposed 0.8725 0.9835 0.1462 0.2537 0.8752
Thursday After K-means 0.9176 0.9998 0.0898 0.1818 0.9176
Noon New K-means 0.9746 0.9997 0.0260 0.7271 0.9746
Proposed 0.9991 0.9997 0.0009 0.7271 0.9991
Friday Morning K-means 0.8821 0.9786 0.1338 0.9910 0.8820
New K-means 0.9212 0.9837 0.0856 0.6140 0.9211
Proposed 0.9899 0.9855 0.0103 0.9746 0.9898
Friday After K-means 0.7722 0.7502 0.2950 0.6079 0.7722
Noon-DDoS New K-means 0.7502 0.7720 0.3329 0.4746 0.7502
Proposed 0.8198 0.7916 0.2199 0.5968 0.8198
Friday K-means 0.6009 0.7621 0.6641 0.4922 0.6009
AfterNoon- New K-means 0.6790 0.7145 0.4727 0.2912 0.6790
PortScan Proposed 0.6441 0.6669 0.5524 0.3338 0.6442
Clustering
Technique DR Precision F-measure FP/TP FAR Accuracy
K-means 0.7380 0.8712 0.7991 0.3984 0.5496 0.7725
New K-means 0.8424 0.8882 0.8645 0.2021 0.3615 0.8559
Proposed 0.8800 0.8857 0.8828 0.1564 0.5371 0.8860
Table 20 summarises the comparative results of the pro- Table 22 shows the performance of detection methods
posed unsupervised IDS. It is compared to both modes of train- where the proposed method outperformed to the state-of-the-
ing methods, such as supervised and unsupervised IDSs. This art method. The experimental results confirm that the CCI
table indicated the unsupervised techniques that are trained based clustering method performs better than basic cluster-
on unlabeled training sets. Whenever, supervised methods are ing. Traditional approaches are also shown maximum 90% de-
trained on labeled training sets. The proposed method evalu- tection rates of wormhole attack in MANETs (Prasad et al.,
ates unsupervised feature selection, and it selects 58 signif- 2019). This work confirms that machine learning algorithms
icant features. The supervised methods are executed on ex- perform better than traditional algorithms, and the proposed
tracted 80 features without assessing the significance of fea- method performs better than existing detection methods.
tures except BRS (Prasad et al., 2020). The proposed method
is compared with both modes of the training method. It per-
forms better than some supervised mode of training meth-
ods and all unsupervised techniques. Moreover, the proposed 7. Conclusion
method confirms that it is suitable for classifying the behavior
of unlabeled network traffic in benign and attacks categories. This research work proposed a new clustering method for
unsupervised intrusion detection. It is based on novel ap-
6.4. Performance analysis on wormhole dataset proach of unsupervised feature selection and initialization of
the cluster center. The proposed method selects essential fea-
New K-means clustering is similar to the basic clustering, that tures, and the process of computation of semi-identical in-
randomly selects initial cluster centers. The proposed method stances gives a number of micro-clusters, initial cluster cen-
is CCI based clustering. Moreover, both clustering methods ters, and a range of clusters. It avoids outliers as initial clus-
create micro-clusters and merge them into clusters. Table 21 ter center and more than one initial center from same cluster
shows the statistical parameters of clustering methods. Here, space. The quality of results for this proposal has been tested
the number of micro-clusters is 31, and number of iterations and compared to the state-of-the-art method. Mainly, the pro-
of New K-means is 18, while the proposed method took aver- posed unsupervised feature selection and cluster center ini-
age 10 iterations to form final micro-clusters. tialization approach boost clustering method. This is trained
computers & security 99 (2020) 102062 17
Table 22 – Performance comparison (in %) of detection methods on wormhole dataset, and the best result is shown in
boldface.
and tested on KDD’99, CICIDS2017, and Wormhole dataset. Ex- method is a demonstration of the fact that unsupervised fea-
perimental results are shown better detection rate and system ture selection and cluster center initialization method reduces
accuracy for attacks and non-attack classes. training complexity. This method reduced the features that
The system is tested against existing methods and results have negligible contributions. Experimental results confirm
are found to be encouraging. The implication of the proposed the performance of the proposed method is better than with-
18 computers & security 99 (2020) 102062
Prasad M, Tripathi S, Dahal K. Wormhole attack detection in ad Sachin Tripathi received the B.Tech degree
hoc network using machine learning technique. In: 2019 10th from Chhatrapati Shahu Ji Maharaj Univer-
International Conference on Computing, Communication and sity, Kanpur, India. He received the M.Tech
Networking Technologies (ICCCNT). IEEE; 2019. p. 1–7. and the Ph.D. degree in computer science
Prasad M, Tripathi S, Dahal K. An efficient feature selection based and engineering from Indian Institute of
bayesian and rough set approach for intrusion detection. Technology (ISM), Dhanbad, India. He is cur-
Appl. Soft Comput. 2020;87:105980. rently working as associate professor with
Ring M, Wunderlich S, Scheuring D, Landes D, Hotho A. A survey the Indian Institute of Technology (ISM),
of network-based intrusion detection data sets. Comput. Dhanbad, India. He has teaching computer
Secur. 2019;86:147–67. science subjects for over more than fourteen
Roshan S, Miche Y, Akusok A, Lendasse A. Adaptive and online years. He has contributed about 95 research
network intrusion detection system using clustering and papers. He has authored a book titled “En-
extreme learning machines. J. Franklin Inst. hancements on Internet Applications: Multi-
2018;355(4):1752–79. cast, Secure E-Mail Messaging and E-Business”. He is an Associate
Sharafaldin I, Lashkari AH, Ghorbani AA. Toward generating a Editor of “International Journal of Communication System”, Wi-
new intrusion detection dataset and intrusion traffic ley. His research interests mainly focus on group security, ad-hoc
characterization.. In: ICISSP; 2018. p. 108–16. network, and artificial intelligence. E-mail: var_1285@yahoo.com.
Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of Mail address: - Department of Computer Science and Engineering,
the KDD CUP 99 data set. In: Computational Intelligence for IIT (ISM), Dhanbad, Jharkhand-826004.
Security and Defense Applications, 2009. CISDA 2009. IEEE
Symposium on. IEEE; 2009. p. 1–6. Keshav Dahal is a Professor of Intelligent
Tiruvakadu DSK, Pallapa V. Confirmation of wormhole attack in Systems and the leader of the Artificial Intel-
MANETs using honeypot. Comput. Secur. 2018;76:32–49. ligence, Visual Communication and Network
Wang H, Zhang Y, Zhang J, Li T, Peng L. A factor graph model for (AVCN) Research Centre at the University of
unsupervised feature selection. Inf. Sci. 2019;480:144–59. the West of Scotland (UWS), UK. He is also af-
Wu Y, Wang C, Bu J, Chen C. Group sparse feature selection on filiated with Nanjing University of Informa-
local learning based clustering. Neurocomputing tion Science and Technology (NUIST) China.
2016;171:1118–30. Before joining UWS he was with Bradford
Xie T, Ren P, Zhang T, Tang YY. Distribution preserving learning and Strathclyde Universities in UK. He ob-
for unsupervised feature selection. Neurocomputing tained his Ph.D. and Master degrees from the
2018;289:231–40. University of Strathclyde, UK. His research
Yin C, Zhang S, Yin Z, Wang J. Anomaly detection model based on interests lie in the areas of applied AI to in-
data stream clustering. Cluster Comput. 2017:1–10. telligent systems, trust and security mod-
Zhao Y, Ming Y, Liu X, Zhu E, Zhao K, Yin J. Large-scale k-means eling in distributed systems, and scheduling/optimization prob-
clustering via variance reduction. Neurocomputing lems. He has published extensively with award winning papers,
2018;307:184–94. and has sat on organizing/program committees of over 60 interna-
Zhu P, Zhu W, Hu Q, Zhang C, Zuo W. Subspace clustering guided tional conferences including as the General Chair and Programme
unsupervised feature selection. Pattern Recognit. Chair. He is a senior member of the IEEE.
2017;66:364–74.