1 s2.0 S0167404820303357 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

computers & security 99 (2020) 102062

Available online at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/cose

Unsupervised feature selection and cluster


center initialization based arbitrary shaped
clusters for intrusion detection

Mahendra Prasad a,∗, Sachin Tripathi a, Keshav Dahal b


a Department of Computer Science and Engineering, Indian Institute of Technology (Indian School of Mines),
Dhanbad, India
b School of Computing, Engineering and Physical Sciences, University of the West of Scotland, Paisley, UK

a r t i c l e i n f o a b s t r a c t

Article history: The massive growth of data in the network leads to attacks or intrusions. An intrusion de-
Received 24 December 2019 tection system detects intrusions from high volume datasets but increases complexities. A
Revised 1 August 2020 network generates a large number of unlabeled data that is free from labeling costs. Un-
Accepted 19 September 2020 supervised feature selection handles these data and reduces computational complexities.
Available online 24 September 2020 In this paper, we have proposed a clustering method based on unsupervised feature selec-
tion and cluster center initialization for intrusion detection. This method computes initial
Keywords: centers using sets of semi-identical instances, which indicate dense data space and avoid
Unsupervised intrusion detection outliers as initial cluster centers. A spatial distance between data points and cluster cen-
Unsupervised feature selection ters create micro-clusters. Similar micro-clusters merge into a cluster that is an arbitrary
Cluster center initialization shape. The proposed cluster center initialization based clustering method performs better
Clustering than basic clustering, which takes fewer iterations to form final clusters and provides better
Mobile ad-hoc network accuracy. We simulated a wormhole attack and generated the Wormhole dataset in the mo-
Wormhole attack bile ad-hoc network in NS-3. Micro-clustering methods have executed on different network
datasets (KDD, CICIDS2017, and Wormhole dataset), which outperformed for new attacks
or those contain few samples. Experimental results confirm that the proposed method is
suitable for LAN and mobile ad-hoc network, varying data density, and large datasets.

© 2020 Elsevier Ltd. All rights reserved.

tect only stored signature of attack. An Intrusion Detection


1. Introduction System (IDS) is a mechanism that detects malicious behav-
ior, unauthorized access, and quickly prevents them from net-
The exponential growth of network size and traffic carried out
work communication. Moreover, IDS is a strong detection sys-
the attention of network security. An unauthorized user can
tem as it gathers intrusion information or related behaviors
misuse or harm network resources. However, a firewall as a
that intrude security policies. Intrusion information is an im-
first-line defense mechanism is used for intrusion detection.
portant asset of computer network security hence the attacker
It is not powerful enough to detect and prevent intrusions. An
tries to conceal his/her history and dynamic nature. There are
attacker can faithfully bypass the first-line mechanism. An-
mainly two categories of IDS misuse detection system and
tivirus software is a second-line defense system, that can de-
anomaly detection system. These defense mechanisms may


Corresponding author.
E-mail addresses: je.mahendra@uohyd.ac.in, je.mahendra@gmail.com (M. Prasad), var_1285@yahoo.com (S. Tripathi),
Keshav.Dahal@uws.ac.uk (K. Dahal).
https://doi.org/10.1016/j.cose.2020.102062
0167-4048/© 2020 Elsevier Ltd. All rights reserved.
2 computers & security 99 (2020) 102062

combine and form a hybrid detection system. Misuse detec- and CCI based micro-clustering (proposed). This work mea-
tion system detects only those signatures are present in the sures different characteristics (parameters) and executed on
database, whereas the anomaly detection system computes different network datasets that show the reliability of the sys-
deviation of behaviors from baseline profile (Mishra et al., tem. The KDD dataset was generated through testbeds of the
2018). The main aim of unsupervised IDS to increase Detec- Air force environment Local Area Network (LAN), and the CI-
tion Rate (DR) and reduce False Alarm Rate (FAR). The IDS de- CIDS2017 is recently generated dataset; whenever, wormhole
tects an intrusion in communication, then generates an alarm dataset is simulated in Mobile Ad-hoc NETwork (MANET) en-
and sends it to the network administrator to prevent intru- vironment. Moreover, the proposed method performs better
sion. To maintain network security, the main problem of ma- than existing unsupervised intrusion detection mechanisms.
chine learning is computational complexity as dimension and The main contributions are enumerated below.
size of the dataset get larger.
Feature selection approach reduces dimension and size (i) Feature selection is an important technique to select
by selecting a subset of essential features. It can be applied the best subset of features and produce better results
for both mode of training model as supervised and unsuper- (Ambusaidi et al., 2016). We have proposed an UFS
vised by whether label of information is known or unknown method that select more informative features. It also re-
(Xie et al., 2018). We apply Unsupervised Feature Selection duces dimension and size of the dataset.
(UFS) (Zhu et al., 2017) when class labels of data are unavail- (ii) Basic clustering algorithm randomly selects initial cen-
able. The UFS solves the noticeable problems of clustering ters and iteratively computes centroids (Ma et al.,
methods such as reduce the additional computational cost 2015). We have proposed CCI based clustering algorithm
and increase performance of the system. Redundant or irrele- that resolves notable problems. This method computes
vant features may lead to overfitting problem, which degrade semi-identical instances and gives many sets. A mean
the performance of the system. Here, the main objective of of data points present in the set as initial centroid. The
UFS method is to keep the training model as small as possi- advantages of initialization of cluster center over basic
ble, that reduces insignificant features or those have a negli- approach are (1) it avoids outliers as initial centers, (2) a
gible effect (Wang et al., 2019). The best subset of features can new initial cluster center is mean of data points present
contain a minimal number of features and contribute system in the set, that indicates dense data space, (3) it reduces
accuracy as much possible. This is a fundamental mechanism iterations of clustering.
of processing the dataset and provide a suitable approach to (iii) We have applied K-means clustering to group similar
dimensionality reduction. It can be employed for many appli- data (Zhao et al., 2018). In proposed method, K is the
cations such as text, images, videos, and genes analysis. This number of micro-clusters which is greater than c. Fi-
method can extensively enhance the interpretability of ma- nally, it merges micro-clusters into c-clusters. Experi-
chine learning algorithms, and an algorithm integrated with mental results show better detection rate of new attacks
feature selection often present better generalization. The pro- or those attacks have few samples. The CCI based clus-
posed UFS method finds a suitable subset of predictive fea- tering method provides better accuracy and execution
tures. time than existing clustering mechanisms.
Many machine learning techniques are used for clustering,
which indicates the significance of UFS. Clustering method The rest of paper is organized as follows: Section 2 re-
partitions data into clusters in many iterations (Gong et al., views related literatures, and Section 3 introduces the im-
2018). Some difficulty has been considerably reported in many portance of unsupervised intrusion detection system for mo-
research such as (1) the clustering method assumes that the bile ad-hoc network. Section 4 provides detail analysis of
number of clusters is known by users, which is not true in KDD, CICIDS2017, and Wormhole dataset. The successive
practice (Masud et al., 2018), (2) an iterative method is sen- Section 5 describes the proposed method. Experiments are
sitive to initial cluster centers especially in the K-means algo- conducted and the performance of the proposed method is
rithm, (3) K-means algorithm can be stuck on local minima. examined in Section 6. Finally, we conclude the paper with fu-
The main reason of such difficulties of iterative clustering is ture direction in Section 7.
random center initialization, and these problems solve using
Cluster Center Initialization (CCI) based clustering. A selection
of initial cluster centers is essential that have major role in the 2. Related work
formation of final clusters (Huang et al., 2017; Peng and Liu,
2018). High dimensional dataset contains more information but also
Naturally, clusters are in non-convex (arbitrary) shapes introduce redundancies and noises, that increases the com-
(Hyde et al., 2017), whereas distance based methods always plexity of clustering algorithms. There are many feature selec-
find convex shapes of clusters. The proposed method is arbi- tion methods, but most of them under the supervised mode of
trary shape clustering, that differs from density-based clus- training. Unsupervised feature selection is a difficult task due
tering. It identifies the number of micro-clusters and work for to unknown labels. Recently, some works have shown the sig-
varying density data. We have executed the proposed method nificance of UFS, such as score of features (Wang et al., 2019),
on an extremely imbalanced (KDD and CICIDS2017) datasets; feature similarity, clustering based non-negative matrix fac-
whereas, the KDD test set includes many attacks as unknown torization, clustering based co-regularization method, locality
attacks. The experimental results show the performance in in- of data, maximization of distance of different clusters, rele-
creasing order as K-means, micro-clustering (New K-means), vance feature representation with spatial geometrical struc-
computers & security 99 (2020) 102062 3

ture of data, joint sparse matrix, non-negative spectral anal- and explored six techniques as one-class classification; these
ysis, low rank structure preserving, locality structure preserv- techniques are K-means, Autoencoder, Isolation Forest, Near-
ing, matrix factorization, inner product regularization, self ex- est Neighbor, Support Vector Machine (SVM), Scaled Convex
pressing model, and relationship among features. The pro- Hull (SCH). They discretized continuous features data using
posed UFS method is based on the occurrence of objects, and the Equal Frequency (EF) technique and normalized data using
it easily computes features score. Zscore and MinMax technique. Finally, they reported limited
It is extremely important to adopt the best initial cluster overall performance. We have compared the proposed method
center in the iterative clustering algorithm, which obtains a to their methods and other unsupervised methods such as
direct effect on the formation of clusters. One of the prob- K-means (where K=5) and New K-means (where K=79 and
lem in clustering is identifying the number of clusters for merge them into 5), that achieves better performances.
CCI. Such problems of the clustering technique have high- Table 1 shows the comparison of IDS that performed on
lighted in many recent research work. There are main ap- the KDD’99 dataset, and trained on both modes of training.
proach such as average density-based initial cluster centers There is still no unique method, that provides UFS and CCI
(Peng and Liu, 2018), initial centers based on dense and sparse based clustering method. The proposed work comprises the
neighbors (Huang et al., 2017), self-expressing the number novel approach of UFS and CCI, which obtains the fast exe-
of clusters and identifying initial centers from dense regions cution of clustering applications and provides better results.
(Masud et al., 2018), locating initial cluster centers at the dense It indicates that the clustering method with initial centers is
region of the dataset by representing the data points in KD- suitable for the detection system. Additionally, the feature se-
tree (Kumar and Reddy, 2017). However, the computational lection method reduces the computational complexities of the
complexity is the main drawback of KD-tree for the high di- system.
mensional dataset. The KDDcup99 (KDD’99) dataset is high
volume, dynamic nature, and imbalance distributions of cat-
egories. Due to such nature, there is still no CCI method and 3. Mobile ad-hoc network security
no UFS method executed on this dataset. mechanism
Kang and Kim (2016) proposed combinatorial optimiza-
tion based optimal feature subset selection. The number of The MANET is an infrastructure-less, temporary, self-
possible feature subsets from given n features is (2n − 1), organized, and dynamic network that transfers data packets
their approach reached (241 − 1) complexity. They selected to the destination node through multihop communication.
20 features, which shown a better DR and poor FAR. Their It is more vulnerable due to wireless communication and
method took more time to provide optimal subsets of features. dynamic nature (Bouhaddi et al., 2018). This network mostly
Yin et al. (2017) proposed a stream clustering algorithm for applies for military services in the battlefield and search
anomaly detection. They completed execution in 10 rounds operations, where information security is the primary issue.
and reported system accuracy of each round. The best result An attacker can be internal or external that spread malicious
has shown the DR (99%) and FAR (9.17%), while the average DR information and harm the network resources. There are
(91%) and average FAR (13.61%). many types of attack methods in temporary networks, and
Al-Yaseen et al. (2017) preprocessed the training dataset each has a unique attacking nature.
into five sub-datasets: Normal-DS, DoS-DS, U2R-DS, Probe-DS, A wormhole attack is a dominant network security threat
and R2L-DS using the supervised mode of training method, that can not immune to cryptography and traditional secu-
and reduced training set. The SVM took more training time rity mechanisms. An attacker node attracts the request pack-
hence they approach to build small training dataset. Then, ets by advertising itself a part of the destination in minimum
they apply modified K-means and basic K-means based on hop count (false information) and tunnel them to other ends
multilevel hybrid SVM and ELM with K=55. We have pro- (Tiruvakadu and Pallapa, 2018). Malicious nodes are more ac-
posed UFS and CCI based clustering method, and executed tive, where they strategically maintain locations and generate
on the unlabeled training set, which differs from the exist- huge signals. Two malicious nodes create a wormhole tunnel
ing method. They reported that their method perform better when they are in the radio range and transmit packets without
for DR of normal and probe categories, while R2L (31.39%) and addressing themselves. When other malicious nodes are not
U2R (21.93%). Roshan et al. (2018) proposed an adaptive de- in radio range, then it behaves as a blackhole attack. These
sign of IDS based on ELM. A comparison of their method has malicious nodes also conceal their history and information.
obtained results in both modes of training, which shown in The proposed method confirms machine learning techniques
many cases, and the method has achieved better performance. effectively detect malicious data, which forwarded through
They reported that their method achieved best DR (84%) and the malicious node.
best FAR (3.02%). The wireless network is more versatile than the wired net-
Choi et al. (2019) developed a network IDS using an un- work. Sample labeling is one of the main stages of data pre-
supervised learning method. Autoencoder model aims to re- processing of supervised learning where each mistake can
construct its input vectors as ANN-based learning. It learns affect negatively on the dataset and overall performance of
to construct its input data consisting only normal data. They the predictive model. It becomes more difficult labeling the
tested the performance on training data, which have normal sample (or packet information) as normal or malicious in dy-
(99%) and abnormal (1%) data, and reported the best result as namic networks when attackers constantly changing their na-
accuracy (91.70%) and DR (84.68%). Meira et al. (2019) proposed ture. Some attacking method changes its identity that also
anomaly detection using unsupervised learning techniques makes the labeling process difficult. Unsupervised mode of
4 computers & security 99 (2020) 102062

Table 1 – Comparison of related works.

Mode of
Work Technique KDD’99 training UFS CCI

Lin et al. (2015) Clustering, k-NN, SVM yes Supervised, no no
Unsupervised
Iglesias and Zseby (2015) DTC, Bayes, kNN, ANN, SVM yes Supervised no∗ n/a
Kang and Kim (2016) Combinatorial optimization, MLP yes Supervised no∗ n/a
Yin et al. (2017) stream clustering yes Unsupervised no no
Bostani and Sheikhan (2017) MI-BGSA, SVM yes Supervised no∗ n/a
Al-Yaseen et al. (2017) Clustering, ELM, SVM yes Supervised, no no
Unsupervised
Roshan et al. (2018) Adaptive clustering, ELM yes Supervised, no no
Unsupervised
Al-Jarrah et al. (2018) AdaBoost, Bagging, RF yes Semi-supervised n/a n/a
Choi et al. (2019) Autoencoder, ANN yes Unsupervised no n/a
Meira et al. (2019) Autoencoder, K-means, Isolation yes Unsupervised no no
Forest, SVM, Nearest Neighbor,
SCH

no∗ ← Supervised feature selection, n/a ← not applicable

Table 2 – Comparison of IDS datasets.

Parameters KDD’99 DEFCON CAIDA LBNL Kyoto ISCX ADFA CICIDS


Complete network yes no yes yes yes yes yes yes
Attack traffic yes yes yes yes yes yes yes yes
Normal traffic yes yes yes yes yes yes yes yes
Network interactions yes yes no no yes yes yes yes
Complete capture yes yes no no yes yes yes yes
Protocols many few NS few many many many many
Different attacks 39 NS NS NS 17 4 12 14
Meta data yes no yes no yes yes yes yes
Features set yes no no no yes no no yes

NS ← Not Specified, ISCX←ISCX2012, ADFA←ADFA2013, CICIDS←CICIDS2017

training model learns to unlabeled dataset (Carrasco and Si- KDD corrected dataset. KDDcup is high volume dataset, that
cilia, 2018), which is free from labeling cost and complexities. contains 4,898,431 (attacks 3,925,650 and normal 972,781) in-
Therefore, unsupervised intrusion detection system is a suit- stances (Tavallaee et al., 2009). The rest of sub-datasets details
able approach for MANETs. are in Table 3 as training and testing dataset.
Table 3 provides detail of attacks and their categories. It
shows four categories of attacks namely DoS, R2L, U2R and
4. IDS datasets Probe. The training dataset contains 22 attacks, while testing
dataset contains 20 same attacks and 17 new attacks. Here,
4.1. KDD we have considered Httptunnel attack in R2L category as in
paper (Aburomman and Reaz, 2016), whenever, some paper
The KDD’99 dataset is most popular publicly available IDS (Al-Yaseen et al., 2017) has included it in U2R category.
dataset. It was generated by Defense Advanced Research Table 4 shows the instance distribution of new attack,
Project Agency (DARPA) that contained normal and attack in- which is not present in the training set. It also shows the ex-
stances. MIT Lincoln Labs were prepared and managed Air isting attack instances those are present in the training set as
force environment LANs with multiple attacks (Ring et al., well as the test set. Test dataset contains 7.48% new attack
2019). In 1999, it was merged with some new attacks and was instances, which are maximum in R2L and minimum in DoS
named as KDDcup99 (KDD’99). category. DoS attack instance distribution is much higher than
Table 2 presents the comparison of IDS datasets on differ- other attacks, that reduces the overall new attack distribution.
ent parameters (Sharafaldin et al., 2018). The KDD’99 dataset However, most of attacks contain approx 50 percent new at-
satisfies maximum parameters and contains a maximum dif- tack instances except for DoS that resist achieving high DR
ferent type of attacks. It is an extremely imbalanced dataset, mainly for unsupervised IDSs.
and the test set contains 17 more attacks than a training Table 5 shows a detail distribution of instances of train-
set (more details in Tables 3–5), which is more challenging ing and testing dataset, where available instances in each cat-
to achieve better performance. This dataset contains three egory are shown huge gaps among categories. In this table,
sub-datasets namely KDDcup dataset, KDD 10% dataset, and it also shows the percentage of instance distribution, that a
computers & security 99 (2020) 102062 5

Table 3 – Categories and number of attacks of each category.

Category Attack name Training Testing


KDD 10% dataset KDD corrected dataset
Instances Total Instances New Attack Instances Total
DoS back 2203 391458 1098 apache2 794 229853
land 21 9 mailbomb 5000
neptune 107201 58001 processtable 759
pod 264 87 udpstorm 2
smurf 280790 164091
teardrop 979 12
Probe ipsweep 1247 4107 306 mscan 1053 4166
nmap 231 84 saint 736
portsweep 1040 354
satan 1589 1633
R2L ftp_write 8 1126 3 named 17 16347
guess_passwd 53 4367 sendmail 17
imap 12 1 snmpgetattack 7741
multihop 7 18 snmpguess 2406
phf 4 2 worm 2
spy 2 - xlock 9
warezclient 1020 - xsnoop 4
warezmaster 20 1602 httptunnel 158
U2R buffer_overflow 30 52 22 ps 16 70
loadmodule 9 2 sqlattack 2
perl 3 2 xterm 13
rootkit 10 13

This is publicly available and the most applicable IDS


Table 4 – New attack instance distributions of test set.
dataset. Some effective IDSs have executed on the subset of
Category New attack Existing attack Total the dataset, that represents the complete instances. Recently,
DoS 6555 (02.85%) 223298 (97.15%) 229853 the modified K-means has used KDD’99 dataset, which five
Probe 1789 (42.94%) 2377 (57.06%) 4166 training datasets have constructed from available samples as
R2L 10354 (63.34%) 5993 (36.66%) 16347 one training set for each category (Al-Yaseen et al., 2017). Un-
U2R 31 (44.29%) 39 (55.71%) 70 supervised method (Roshan et al., 2018) has executed on the
Total 18729 (07.48%) 231707 (92.52%) 250436 training sets of 12,000 and 20,000 samples, while they updated
every iteration by 10, 20 and 30 percent of the size of training
sets.

Table 5 – Instance details of training and testing dataset. 4.2. CICIDS2017


Category Training Testing
KDD 10% dataset Corrected dataset The Canadian Institute of Cybersecurity has generated the
Normal 97278 (19.69%) 60591 (19.48%) IDS dataset, namely CICIDS2017 (in July 2017) and CSE-CIC-
DoS 391458 (79.24%) 229853 (73.90%) IDS2018 (in Feb. 2018). Both datasets contain the same fea-
Probe 4107 (0.83%) 4166 (1.34%) ture set (Prasad et al., 2020), attacks, and attacking nature.
R2L 1126 (0.23%) 16347 (5.26%) The main difference between them the CICIDS2017 is cap-
U2R 52 (0.01%) 70 (0.02%) tured network traffic on relatively a small emulated network
Total 494021 311027
environment over five days (Ring et al., 2019; Sharafaldin et al.,
2018); while, the CSE-CIC-IDS2018 is captured network traffic
and log files on large emulated network environment over 18
days.
large difference among categories. The cluster size and shape Table 6 presents the details of the CICIDS2017 dataset.
depends on available similar behavior objects in the dataset. It has eight sub-datasets; where, Monday contains only BE-
These datasets contain huge redundant instances, that are NIGN samples and rest details in the table. It shows that the
available mostly in attacks. The proposed method has exe- dataset is highly imbalanced among their categories, which
cuted on 494,021 instances for training and 311,027 instances resist achieving satisfactory performances for unsupervised
for testing. There are two invalid records found in the test set, IDSs. In this execution, we have applied the holdout valida-
those serial numbers are 136489 and 136497, that contains in- tion method and unlabeled training set to learn the system.
valid value ICMP as their service value feature (Tavallaee et al., This method randomly splits samples of sub-datasets into a
2009). training set (80%) and test set (20%). There is still no unsuper-
6 computers & security 99 (2020) 102062

Table 6 – Data details and distribution.

Sub-dataset Class Data samples Total


Training Testing
Tuesday BENIGN 345635 (96.89%) 86439 (96.92%) 445909
FTP Patator 6361 (1.78%) 1577 (1.77%)
SSH Patator 4731 (1.33%) 1166 (1.31%)
Wednesday BENIGN 351960 (63.51%) 88071 (63.57%) 692703
Dos slowloris 4632 (0.84%) 1164 (0.84%)
Dos slowhttptest 4369 (0.79%) 1130 (0.82%)
Dos Hulk 184934 (33.37%) 46139 (33.3%)
Dos GoldenEye 8259 (1.49%) 2034 (1.47%)
Heartbleed 8 (0.001%) 3 (0.002%)
Thursday Morning BENIGN 134566 (98.73%) 33620 (98.67%) 170366
Web 1181 (0.87%) 326 (0.96%)
Attack-Brute
Force
Web Attack-XSS 530 (0.39%) 122 (0.36%)
Web Attack-sql 16 (0.01%) 5 (0.01%)
injection
Thursday AfterNoon BENIGN 230857 (99.99%) 57709 (99.98%) 288602
Infiltration 25 (0.01%) 11 (0.02%)
Friday Morning BENIGN 151249 (98.97%) 37818 (98.99%) 191033
Bot 1577 (1.03%) 389 (1.01%)
Friday AfterNoon-DDoS BENIGN 147312 (81.57%) 36598 (81.06%) 225745
DDoS 33284 (18.43%) 8551 (18.94%)
Friday BENIGN 101895 (44.46%) 25642 (44.76%) 286467
AfterNoon-PortScan
PortScan 127279 (55.54%) 31651 (55.24%)

Fig. 1 – Wormhole dataset.

vised intrusion detection and unsupervised feature selection tion, packet capture file, featuring data, and collection of data
performed on this dataset. into the database.
This dataset contains 637,862 (152,144 normal and 485,718
malicious) samples, that are collected on 20 distinguish fea-
4.3. Wormhole dataset tures of ad-hoc network. We have preprocessed the worm-
hole dataset by transferring symbolic value to numerical value
Data collection from real-world-enterprise of MANETs for IDS using the label encoding method and normalizes numerical
is not easy work. Therefore, we simulated wormhole attack value into a proportionate range. This preprocessed dataset
for the MANET environment using Network Simulator (NS- contains many duplicate samples. We have executed the
3), that is presented in Prasad et al. (2019). Simulation is per- proposed method on only nonredundant (201,616) samples;
formed on 20 normal and 5 malicious nodes, 250-meter ra- whenever, detection methods in Prasad et al. (2019) are exe-
dio range, random movement of nodes, 1000∗ 1000 m2 topol- cuted on initial (637,862 samples) dataset. Further, it has di-
ogy space, and 300 seconds simulation time, which has gener- vided into 141,131 (70%) for training set and 60,485 (30%) for
ated high volume dataset. Fig. 1 shows the steps of wormhole testing, where test set contains 14,499 (24%) normal and 45,986
dataset generation process such as wormhole attack simula- (76%) malicious information.
computers & security 99 (2020) 102062 7

5.1. Data normalization


Dataset
Data normalization is one of the main steps which transfer-
Data Processing ring symbolic value into a numerical value. Where each value
of the feature requires to scale down into the well proportion-
Data normalization
ate range (Ambusaidi et al., 2016). We have applied the label
encoding method, that provides numbers to distinct symbolic
(string) objects in a sequence and normalizes them from 0 to
Unsupervised feature selection
1 (using Eq. (1)).

Dataset with selected features χi j − min(χ j )


xi j = , (1)
max(χ j ) − min(χ j )
Reduce redundant data
where, xij is normalized value of χ ij . Min(χ j ) represents the
minimum value and max (χ j ) is the maximum value of the jth
feature. This normalization eliminates greater deviation and
Training Testing
also the bias of features. It applies on both training and test
Dataset Dataset
set with same min and same max value.

Compute semi-identical sets and 5.2. Unsupervised feature selection


number of micro-clusters

The UFS reduces redundant and irrelevant features, and se-


Cluster center initialization lect non-redundant features of unlabeled dataset (Wu et al.,
2016). These redundant features can create problems such as
increasing complexities, computations, deviate actual clus-
Clustering and merge micro-clusters
ters, support wrong decisions, etc. It is more challenging task
to select a subset of more informative features for unlabeled
Result dataset, that can vanish essential information. Feature score
can compute using entropy for unlabeled dataset by calculat-
ing histogram of the feature. However, the proposed method
Fig. 2 – Diagrammatic representation of proposed method. selects the only most frequent object to compute the feature
score that differs from entropy. When the feature contains ev-
ery entry the same or every entry different that has negligible
contribution, the proposed method selects features that score
within threshold range. This method reduces only negligible
contribution features from any datasets.

nj N
  
max 1 | γv j = x i j
5. Proposed method
v=1 i=1
ψj = , (2)
N
The following subsections elaborate proposed method. It
starts with the selection of benchmark IDS dataset and data where, ψ j is feature score, γ j = dist inct (x j ), and n j = count (γ j ) of
normalization method. The next step involves unsupervised jth feature. Eq. (2) computes the count of occurring each object
feature selection that select best features. These features re- and selects the max count of the feature. Then, the ratio of
duce dimensionality and redundant data. Afterward, the next max count and total samples provide the feature score.
step creates sets of semi-identical instances. These sets are 
used to compute initial cluster centers. Subsequent steps de- 0, i f ψ j ≈ 1 or ψ j ≈ 0
fj = (3)
scribe the structure of sets and initialization of centers. The 1, otherwise
main step executes data clustering which carried out follow-
ing changes : (1) computation of clusters range, (2) cluster Eq. (3) indicates zero for not selected and one for selected
center initialization, (3) number of clusters (K  c). Finally, it features, when ψ j > 0.995 or ψ j < 0.005 follows its approximate
merges minimum distance micro-clusters and computes per- value.
formance of the system. The unsupervised learning model is Table 7 contains the detail information of features of
used on supervised data split into the train-test sets for vali- KDD’99 dataset. First four columns contain features type, fea-
dation and the performance metrics. The model has not un- tures sequence, feature name and data type. Column 5 (eff)
dergone any prior training. It is presented with an unlabeled shows the effort to generate the feature, and Column 6 (rel)
and uncategorized training dataset, and the learning proce- contains relevance of features (Iglesias and Zseby, 2015). In
dure finds similarity between training samples and puts sim- terms of eff, features need a simple inspection of header fields
ilar items into the same cluster. Fig. 2 shows diagrammatic as simple effort, need comparison as a medium effort, and
representation of the proposed method. transforming into another form as high effort. Iglesias et al.
8 computers & security 99 (2020) 102062

Table 7 – Features of KDD’99.

Feature Feature name Type eff rel status


Basic features f1 duration integer small negligible yes
f2 protocol_type string small low yes
f3 service string small strong yes
f4 flag string high strong yes
f5 src_bytes integer small low no
f6 dst_bytes integer small strong no
f7 land binary medium low no
f8 wrong_fragment integer high strong no
f9 urgent integer medium negligible no
Content features f10 hot integer high low yes
f11 num_failed_logins integer high negligible no
f12 logged_in binary high low yes
f13 num_compromised integer high low no
f14 root_shell binary high low no
f15 su_attempted binary high low no
f16 num_root integer high negligible no
f17 num_file_creations integer high low no
f18 num_shells integer high low no
f19 num_access_files integer high negligible no
f20 num_outbounds_cmds integer high negligible no
f21 is_hot_login binary high negligible no
f22 is_guest_login binary high low no
Traffic features f23 count integer medium strong yes
f24 srv_count integer medium low yes
f25 serror_rate real high strong yes
f26 srv_serror_rate real high strong yes
f27 rerror_rate real high low yes
f28 srv_rerror_rate real high strong yes
f29 same_srv_rate real medium strong yes
f30 diff_srv_rate real medium negligible yes
f31 srv_diff_host_rate real medium negligible yes
f32 dst_host_count integer medium strong yes
f33 dst_host_srv_count integer medium strong yes
f34 dst_host_same_srv_rate real medium strong yes
f35 dst_host_diff_srv_rate real medium strong yes
f36 dst_host_same_src_port_rate real medium negligible yes
f37 dst_host_srv_diff_host_rate real medium negligible yes
f38 dst_host_serror_rate real high strong yes
f39 dst_host_srv_serror_rate real high strong yes
f40 dst_host_rerror_rate real high strong yes
f41 dst_host_srv_rerror_rate real high low yes

computed feature ranking and weighting using four different


Table 8 – Selected features of KDD’99.
methods that show the contribution of features as negligible,
low, and strong relevance (Iglesias and Zseby, 2015). The final # features Features
column contains the status of feature using the proposed UFS. 25 f1, f2, f3, f4, f10, f12, f23, f24,
In terms of relevance, selected features are 5 features (negli- f25, f26, f27, f28, f29,
gible contribution), 6 features (weak contribution), and 14 fea- f30, f31, f32, f33, f34, f35 f36,
tures (strong contribution). Similarly, in terms of effort to gen- f37, f38, f39, f40, f41
erate features that are 3 features (small), 11 features (medium),
and 11 features (high). The comparative analysis of the pro-
posed UFS to the existing method has shown a better sub-
We illustrate UFS method on example Table 9 using
set of features, that contains strong relevance and medium
Eqs. (2) and (3), that contains 12 samples (N = 12) and six fea-
effort.
tures.
Table 8 contains selected features which is subset of 25 fea-
tures. These features reduce dimensionality as well as redun-
dant samples of the dataset. A subset of the dataset contains (i) compute for duration feature, γ duration = {0.02, 0.05}, nj =
max(7,5)
81,798 non-redundant samples, which is only 16% samples of 2, ψ duration = 12 , ψ duration = 0.583
the training (KDD 10%) dataset. This training subset increases (ii) protocol feature, γ protocol = {UDP, AODV}, nj = 2, ψ protocol =
the performance of the system and decreases complexities. max(4,8)
12 , ψ protocol = 0.667
computers & security 99 (2020) 102062 9

Table 9 – An example information table for unsupervised feature selection and formation of semi-identical sets.

U duration protocol packet_size flag header_length hop_count


x1 0.02 UDP 128 0 20 2
x2 0.02 AODV 64 0 24 2
x3 0.05 AODV 64 1 24 4
x4 0.02 AODV 64 0 24 2
x5 0.02 AODV 32 1 20 2
x6 0.05 AODV 64 1 24 4
x7 0.05 UDP 128 0 20 3
x8 0.05 UDP 64 1 24 3
x9 0.02 AODV 32 0 20 2
x10 0.02 UDP 32 1 20 2
x11 0.05 AODV 64 1 20 4
x12 0.02 AODV 128 0 24 2

(iii) packet_size feature, γ packet _size = {128, 64, 32}, nj = 3, CICIDS2017 dataset and each subset contains 10 features of
max(3,6,3) Wormhole dataset.
ψ packet _size = 12 , ψ packet _size = 0.5
(iv) flag feature, γ flag = {0, 1}, nj = 2, ψ flag =
max(6,6)
, ψ flag = An example of semi-identical sets formation using the
12
0.5 sample Table 9. A set of features is randomly divided into two
(v) header_length feature, γheader_l engt h = {20, 24}, nj = 2, subsets such as B1 = {duration, protocol, hop_count} and B2 =
max(6,6) {packet_size, flag, header_length}. Eq. (4) computes U/IND({B1 })
ψheader_l engt h = , ψheader_l engt h = 0.5
12 = {{x1 , x10 }, {x2 , x4 , x5 , x9 , x12 }, {x3 , x6 , x11 }, {x7 , x8 }}, which indi-
(vi) hop_count feature, γhop_count = {2, 4, 3}, nj = 3, ψhop_count =
max(7,3,2)
cates semi-identical sets as K = U/IND({B1 }) = {K1 , K2 , K3 , K4 }.
12 , ψhop_count = 0.583 From this example, K1 = {x1 , x10 }, K2 = {x2 , x4 , x5 , x9 , x12 }, K3 =
{x3 , x6 , x11 }, and K4 = {x7 , x8 }. These sets are used to compute
This computation provides feature score of features {dura- further steps using (B1 ∪ B2 = A) features.
tion, protocol, packet_size, flag, header_length, hop_count} as
{0.583, 0.667, 0.5, 0.5, 0.5, 0.583} which are within the threshold 5.4. Cluster center initialization
(ψ j > 0.005 and ψ j < 0.995) range. Then, UFS method considers
as non-redundant features. Clustering is an important application of machine learning
that includes unsupervised classification. An iterative cluster-
5.3. Structure of set ing algorithm depends on initial cluster centers (Kumar and
Reddy, 2017). A selection of initial centers is extremely impor-
Only dissimilar instances maintain a distance that is calcu- tant that directly affect the formation of final clusters (Masud
lated using Euclidean distance. This section introduces a set et al., 2018; Peng and Liu, 2018). However, selection of outliers
of instances that approximately maintain minimum distance. as initial cluster center can affect the shape and size of final
Here, we have randomly divided features into approximately clusters, and also increase iterations. The CCI based clustering
two equal subsets. A subset of features represents new in- resolves notable problems of basic clustering.
formation system and computes semi-identical subsets of in-
1 
stances. It gives many sets as micro-clusters that further com- Vr, j = xs, j |xs, j ∈ Kr , (5)
|Kr |
pute cluster center and cluster range for clustering. Let in- s=1

formation system (IS) : T=(U,A) from dataset, then B1 ⊂ A and


where, Vk,j is initial cluster center and |Kr | is the size (number
B2 ⊂ A are associated an equivalence relation.
of instances) of set Kr . Eq. (5) computes initial cluster center,
  which overcome all issues of the basic clustering algorithm. It
IND(R ) = (x, y ) ∈ U 2 |∀b ∈ R(B1 ), b(x ) = b(y ) . (4) reduces iterations and computational complexities.

Eq. (4) computes K = U/IND({B1 } ) = {K1 , K2 , ..., Kn }, where, 5.5. Clustering


U is a non-empty finite set of instances and A is selected at-
tributes. Here, B1 and B2 are two subsets of selected attributes The K-means clustering is one of the popular algorithm of
(i.e., B1 ∩ B2 = φ and B1 ∪ B2 = A). r is an index of a set, machine learning, that ability to group similar data into clus-
whose length is greater than ten. The value of r is 1 ≤ r ≤ K ters (Gong et al., 2018; Zhao et al., 2018). However, it is sus-
and K is number of sets (micro-clusters). Instances x and y ceptible to select starting point as cluster centers (Kumar and
are indiscernible from each other by attributes in B1 subset Reddy, 2017). As the procedure of the algorithm, it randomly
(Ghosh et al., 2016). For every b ∈ R(B1 ) and b: U → Vb , where selects initial centers. Therefore, it does not guarantee to get
Vb is called value set of b. Eq. (4) group instances into set unique clustering. The final cluster centroid may not be op-
those are similar by features in B1 . An attribute subset B1 con- timal that can converge into local optimal solutions. To solve
tains 13 features and subset B2 contains rest of features of this problem, we have proposed a clustering method, which is
KDD’99 dataset, whereas each subset contains 29 features of based on an initial cluster center.
10 computers & security 99 (2020) 102062

Fig. 3 – An example of a sample dataset with five clusters and K sets. (b) Preliminary partitions it into sets K={K1 , K2 , K3 , ....,
K11 } and center of each set marked by black solid plus sign. These sets of semi-identical instances measure initial centers,
K-clusters and cluster range.

 tering. Fig. 3d displays macro-clusters as arbitrary shape clus-


distancei,r = (xi, j − Vr, j )2 , (6) ters.
j=1

1 
avgDist ancer = dist ances,k |xs ∈ Kr , (7) 6. Experiments
|Kr |
s=1
1 
Vr, j = xi, j , (8) 6.1. Performance measures
|cl ust err |
xi ∈cl ust err

The prediction of test set is measured as confusion matrix.


where, Eq. (6) computes Euclidean distance of cluster center It each cell contributes to measure statistical parameters.
Vk to instance xi . Eq. (7) computes the average distance of Rows and columns of the confusion matrix carried out a
instances present in same set, which is used for bound the quantitative representation of the information (Bostani and
range (i.e., avgDistance) of the cluster. Eq. (8) updates cluster Sheikhan, 2017), where row indicates a prediction of classifier,
centroids every iteration, here, range of cluster computes only whenever a column indicates the total instances of the class.
once. A cluster obtains instances as data points only that are The diagonal represents the correct prediction of the class.
in the range of cluster. This algorithm has clustered data into TP
K-clusters (i.e., K=79 for KDD, Table 15 contains K values of DR = , (9)
TP + FN
sub-datasets of CICIDS2017, and K=31 for Wormhole dataset),
FP
that avoids outliers as initial cluster centers and produce the FAR = , (10)
FP + TN
optimal clusters.
Fig. 3 shows demonstration of clustering process on sam- TP
Precision = , (11)
ple dataset, where Fig. 3a contains data points with five clus- TP + FP
ters. Fig. 3b presents K number of sets and initial centers 2 ∗ Recall ∗ Precision
F − measure = , (12)
from dense region. To maintain micro-clusters define range of Recall + Precision
micro-clusters by averaging data points of the semi-identical
TP + TN
set. Fig. 3c shows final centers obtained using K-means clus- Accuracy = , (13)
TP + FP + TN + FN
computers & security 99 (2020) 102062 11

Table 10 – Confusion matrix of test dataset of KDD’99.

Class Normal DoS R2L U2R Probe


Normal 44157 6089 4204 16 203
DoS 847 207753 188 4 903
R2L 14683 15817 11596 20 3
U2R 285 113 338 30 1
Probe 619 81 21 0 3056

Table 11 – Statistical parameters of test dataset of KDD’99.

Parameters Normal DoS R2L U2R Probe


TP 44157 207753 11596 30 3056
TN 239924 79232 264157 310220 306140
FP 10512 1942 30523 737 721
FN 16434 22100 4751 40 1110
DR 0.73 0.91 0.71 0.43 0.74
FAR 0.0419 0.0239 0.1035 0.0023 0.0023
Precision 0.81 0.99 0.28 0.04 0.81
Accuracy 0.92 0.93 0.89 0.99 0.99

sion and Accuracy are in Section 6.1 using TP, TN, FP, and FN.

c
Si Table 11 shows the DR and Precision of DoS is high, and U2R is
W.Avg = ∗ Per formancei . (14)
S low. It shows accuracy 99% for U2R and Probe, whenever more
i=1
than 92% of rest except R2L.
TP Rate (TPR) or DR or Recall is a proportion of correct pre-
Higher TP and TN indicate better system performance,
dicted positive cases and actual positive cases. FP Rate (FPR)
while higher FP and FN degrade the system performance. In
or FAR is a proportion of false positive and actual negatives.
the context of Normal category, FP is the wrong prediction
Precision is a proportion of correctly predicted and number
of the classifier, that allows attack to enter into the system.
of predicted instances. F-measure is a harmonic mean of DR
An opposite, FN is also the wrong prediction where Normal
and precision. Accuracy is the proportion of correct predic-
predicts as an attack, that increases the alert overhead. Min-
tion by total instances (Ambusaidi et al., 2016). Weightage av-
imum FAR shows better performance of the system like the
erage defines proportional sum of performance, where S =
proposed system, it gives low FAR of all categories except DoS.
{S1 + S2 + .... + Sc } is total samples, c is number of class, and
This system provides the weightage average (using Eq. (14))
Si is number of samples of ith class. We have computed the
performance of respective statistical measures as overall per-
overall performance of the system using Eq. (14) and related
formance.
measures.
Table 12 shows detection rates of recent algorithms (Al-
Yaseen et al., 2017). The proposed method shows higher detec-
6.2. Performance analysis on KDD dataset tion rate of U2R and R2L. While modified K-means performs
better for normal and Probe categories, where it is based on
This section analyzes the performance of the proposed un- multi-level hybrid ELM and SVM, whenever Multi-level SVM
supervised intrusion detection system. We have executed the performs better for DoS category. Especially, the proposed
proposed method on different IDS datasets. The dimension- method gives better detection rates of those categories that
ality reduction reduces the size of the dataset and selection contain few samples.
of the initial cluster center reduces the iteration of the clus- Table 13 shows the overall performance of the system
tering algorithm. Moreover, the proposed method reduces the concerning Number of features, Accuracy, DR, and FAR. This
computational complexities and provides better accuracy. clustering method is compared with DTC, Bayes, ANN, MI-
Table 10 contains outcome of the proposed method in the BGSA, BGSA, BPSO, Multi-level ELM, Multi-level SVM, Basic K-
form of confusion matrix. It is used to compute different sta- means, and modified K-means clustering, while these have
tistical parameters. An example for Normal category, that sta- trained on the supervised mode of training. Iglesias and
tistical parameters are the diagonal of table as TP (i.e., 44157), Zseby (2015) have trained and tested on three subsets (or sets)
addition of rows and columns except Normal category as TN is of features using five classifiers namely DTC, kNN, Bayes, ANN,
239924 (i.e., 207753, 188, 4, 903; 15817, 11596, 20, 3; 113, 338, 30, and SVM. We have compared to at least one best result of
1; 81, 21, 0, 3056), addition of column values only Normal cat- each set. The proposed method is trained on the unsupervised
egory except diagonal as FN is 16434 (i.e., 847, 14683, 285, 619), mode of training with selected 25 features, while, Autoencoder
addition of row values only Normal category except diagonal is trained on the unsupervised mode of training with 41 fea-
as FP is 10512 (i.e., 6089, 4204, 16, 203), likewise compute for tures, and K-means and New K-means are trained on unsu-
all categories. Respective equations compute DR, FAR, Preci- pervised mode of training with 25 features. The comparative
12 computers & security 99 (2020) 102062

Table 12 – Detection rates (in %) of the proposed method and recent IDSs (Al-Yaseen et al., 2017), and the best detection
rate of the different category is shown in boldface.

Multi-level Multi-level Modified


Category ELM SVM K-means Proposed
Normal 96.64 97.83 98.13 72.87
DoS 96.83 99.57 99.54 90.39
R2L 10.84 31.60 31.39 70.94
U2R 23.68 16.23 21.93 42.86
Probe 84.93 80.94 87.22 73.36

Table 13 – Performance comparisons (in %) of different methods.

Technique No. of features Accuracy DR FAR


DTC (Iglesias and Zseby, 2015) 16 78.22 82.62 –
ANN (Iglesias and Zseby, 2015) 16 79.25 70.96 –
DTC (Iglesias and Zseby, 2015) 30 78.68 85.53 –
Bayes (Iglesias and Zseby, 2015) 41 57.01 99.16 –
ANN (Iglesias and Zseby, 2015) 41 77.74 92.79 –
MI-BGSA (Bostani and 5 88.36 86.30 8.88
Sheikhan, 2017)
BGSA (Bostani and Sheikhan, 2017) 13 85.62 81.17 8.42
BPSO (Bostani and Sheikhan, 2017) 11 85.88 81.41 8.14
Multi-level ELM (Al-Yaseen et al., 41 93.83 95.02 3.36
2017)
Multi-level SVM (Al-Yaseen et al., 41 95.57 95.02 2.17
2017)
Basic K-means (Al-Yaseen et al., 41 91.88 92.13 9.16
2017)
Modified K-means 41 95.75 95.17 1.87
(Al-Yaseen et al., 2017)

Autoencoder (Choi et al., 2019) 41 91.70 84.68 –

K − means 25 51.44 26.15 4.37
New K − means• 25 88.13 80.16 5.30

Proposed 25 93.07 84.70 2.73

• ← Unsupervised mode of training method.

tection rate and system accuracy of the proposed method are


Table 14 – Performance comparison of unsupervised in-
better than unsupervised IDSs.
trusion detection systems, and the best result is shown
in boldface. Fig. 4 shows F-measure, DR, and precision of different
unsupervised techniques, where F-measure is a harmonic
Technique #features DR FAR mean of detection rate (DR) and precision. The K-means (with
Initial model 41 0.74 0.0314 K=5), Isolation Forest, and Zscore+K-means show lower per-
(Roshan et al., formance, whenever, Zscore + Nearest Neighbor, EF + Min-
2018) Maxz + Autoencoder, EF + MinMaxz + SVM, EF + MinMaxz
Standard ELM 41 0.67 0.19
+ SCH show average performance. New K-means and pro-
model
(Roshan et al.,
posed method (for both K=79 and merge them into 5 clus-
2018) ters) achieve higher performance. Finally, UFS and CCI based
Updated model41 0.84 0.0302 micro-clustering mechanisms perform better than existing
(Roshan et al., techniques.
2018) Fig. 5 presents the performance comparison of micro-
Proposed 25 0.85 0.0273
clustering methods on different statistical parameters as DR,
FAR, Precision, and Accuracy. Fig. 5a shows the performance of
New K-means clustering method. When the number of micro-
clusters is more than 70, then, incline the performance of the
system. The CCI based micro-clustering method has executed
result shows that the proposed method performs better than on different length of initial semi-identical sets as {600, 450,
the Autoencoder, K-means, New K-means methods. 350, 250, 200, 150, 110, 90, 70, 10, 5}, which provides num-
Table 14 presents the performance comparison of unsu- ber of micro-clusters as {13, 22, 27, 35, 41, 49, 58, 64, 69, 79,
pervised intrusion detection systems which have 41 features, 88}. Fig. 5b shows the performance of the different number of
while the proposed method has selected 25 features. The de- micro-clusters. It achieves better performance on K=79. We
computers & security 99 (2020) 102062 13

Table 15 – Details of preprocessed sub-datasets and identified sub-clusters.

Sub-dataset Training Samples Nonredundant Test Samples Weight of Test Number of


Samples set Sub-clusters

Tuesday 356727 158480 89182 0.1938 24


Wednesday 554162 273014 138541 0.3010 28
Thursday 136293 76991 34073 0.0741 13
Morning
Thursday 230882 111154 57720 0.1255 32
AfterNoon
Friday Morning 152826 84110 38207 0.0830 19
Friday 180596 83092 45149 0.0981 22
AfterNoon-DDoS
Friday 229174 41362 57293 0.1245 21
AfterNoon-
PortScan

F-measure Precision Detection Rate

Proposed

New K-Means

K-Means

Zscore+NearestNeighbor

Zscore+K-Means

Isolation Forest

EF+MinMaxz+SVM

EF+MinMaxz+SCH

EF+MinMaxz+Autoencoder

0 20 40 60 80

Fig. 4 – Performance (in %) of different unsupervised mechanisms on KDD dataset.

have compared the proposed method to different unsuper- terNoon dataset contains benign (99.99%) and attack (0.01%).
vised IDSs (Fig. 4), that obtained better performance. These class-imbalanced data show higher FAR; whenever
the system falsely predicts a small amount of samples. We
6.3. Performance analysis on CICIDS2017 dataset have shown the ratio of falsely and correctly predicted (FP/TP)
the clustering techniques. Similarly, accuracy represents
The CICIDS2017 and CSE-CIC-IDS2018 are benchmark the system performance as the ratio of correctly predicted
datasets that overcome the shortcomings of outdated and total samples. The proposed method is shown significant
datasets. These are high volume and class-imbalanced contributions for unlabeled network traffic data and improved
datasets that are themself cause of challenges for unsuper- the system performances. Table 15 shows the training set and
vised IDSs. The high volume dataset becomes a source of test set, while non-redundant samples are distinct samples
limitations as it consumes more overheads for processing. of the training set. The weight of the test set computes the
Especially, the clustering technique groups the data into an overall performances of the system (using Eq. (14)). It is
almost equal spatial size; whenever, the presence of data in the proportional weight of test samples present in the set
benign class is much higher than attack such as Thursday Af- and total test samples. We have identified the number of
14 computers & security 99 (2020) 102062

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
DR
Performance

Performance
FAR DR
0.5 0.5
Precision FAR
Accuracy Precision
0.4 0.4
Accuracy
0.3 0.3

0.2 0.2

0.1 0.1

0.0 0.0
0 20 40 60 80 100 10 20 30 40 50 60 70 80 90
Number of micro-clusters (K) Number of micro-clusters (K)

(a) New K-means clustering (b) CCI based clustering


Fig. 5 – Performance of micro-clusters.

Table 16 – Category-wise statistical parameters of CCI based micro-clustering method.

Sub-dataset Class TP TN FP FN
Tuesday BENIGN 86061 27 2716 378
FTP Patator 27 87259 346 1550
SSH Patator 0 87984 32 1166
Wednesday BENIGN 86338 35230 15240 1733
Dos slowloris 323 136796 581 841
Dos slowhttptest 250 137146 265 880
Dos Hulk 32162 90296 2106 13977
Dos GoldenEye 663 135897 610 1371
Heartbleed 0 138535 3 3
Thursday BENIGN 29465 337 116 4155
Morning Web 259 30124 3623 67
Attack-Brute
Force
Web Attack-XSS 3 33346 605 119
Web Attack-sql 0 34066 2 5
injection
Thursday BENIGN 57667 3 8 42
AfterNoon
Infiltration 3 57667 42 8
Friday Morning BENIGN 37813 6 383 5
Bot 6 37813 5 383
Friday BENIGN 34652 2361 6190 1946
AfterNoon-DDoS
DDoS 2361 34652 1946 6190
Friday BENIGN 19490 17416 14235 6152
AfterNoon-
PortScan
PortScan 17416 19490 6152 14235

sub-clusters using the proposed scheme and finally merge these parameters compute the performance of the system us-
them into clusters. ing the related equations. Similarly, we have computed the
We have executed K-means, New K-means, and CCI based performance of the rest of the clustering techniques; then,
micro-clustering, which provide results in the form of confu- presented their results.
sion matrix. Table 16 presents the statistical parameters as Table 17 shows the category-wise performance of the BRS
TP, TN, FP, FN of CCI based micro-clustering method. A set of and clustering techniques. These have been executed with
computers & security 99 (2020) 102062 15

Table 17 – Category-wise performance of IDSs.

Sub-dataset Class Performance BRS K-means New K-means CCI based


Tuesday BENIGN DR 0.9738 0.8970 0.9466 0.9956
Accuracy 0.9746 0.8695 0.9330 0.9653
FTP Patator DR 1.0000 0.0000 0.5079 0.0171
Accuracy 0.9977 0.9823 0.9392 0.9787
SSH Patator DR 0.4602 0.0017 0.0017 0.0000
Accuracy 0.9930 0.8872 0.9807 0.9866
Wednesday BENIGN DR 0.9872 0.6199 0.8128 0.9803
Accuracy 0.9874 0.6440 0.8281 0.8775
Dos slowloris DR 0.8656 0.3213 0.2895 0.2775
Accuracy 0.9988 0.9341 0.9807 0.9897
Dos slowhttptest DR 0.9468 0.5354 0.5088 0.2212
Accuracy 0.9994 0.9341 0.9779 0.9917
Dos Hulk DR 0.9892 0.6082 0.8545 0.6977
Accuracy 0.9876 0.8547 0.8866 0.8839
Dos GoldenEye DR 0.9807 0.5088 0.3904 0.3260
Accuracy 0.9996 0.9667 0.9839 0.9857
Heartbleed DR 1.0000 0.0000 1.0000 0.0000
Accuracy 1.0000 0.8887 0.9701 0.9999
Thursday BENIGN DR 0.9976 0.6370 0.8263 0.8764
Morning Accuracy 0.9960 0.6286 0.8276 0.8747
Web Attack- DR 0.5634 0.0000 0.8589 0.7945
Brute Force
Accuracy 0.9926 0.7640 0.9438 0.8917
Web Attack-XSS DR 0.4607 0.0164 0.0328 0.0246
Accuracy 0.9933 0.8823 0.9314 0.9788
Web Attack-sql DR 0.5294 0.0000 0.6000 0.0000
injection
Accuracy 0.9995 0.9823 0.9446 0.9998
Thursday BENIGN DR 1.0000 0.9176 0.9748 0.9993
AfterNoon Accuracy 0.9999 0.9176 0.9746 0.9991
Infiltration DR 0.9444 0.8182 0.2727 0.2727
Accuracy 0.9999 0.9176 0.9746 0.9991
Friday Morning BENIGN DR 0.8705 0.8911 0.9267 0.9999
Accuracy 0.8719 0.8820 0.9898 0.9898
Bot DR 1.0000 0.0000 0.3805 0.0154
Accuracy 0.8719 0.8820 0.9898 0.9898
Friday BENIGN DR 0.8486 0.8881 0.8188 0.9468
AfterNoon-DDoS Accuracy 0.8766 0.7722 0.7502 0.8198
DDoS DR 0.9990 0.2761 0.4568 0.2761
Accuracy 0.8766 0.7722 0.7502 0.8198
Friday BENIGN DR 0.9566 0.1099 0.8358 0.7601
AfterNoon- Accuracy 0.9808 0.6010 0.6790 0.6442
PortScan PortScan DR 1.0000 0.9987 0.5520 0.5503
Accuracy 0.9808 0.6010 0.6790 0.6442

the same training and testing sets that have only differ- forms better for almost all subsets than clustering tech-
ence among them of their specific steps. This table shows nique K-means and micro-clustering method New K-means.
the performances as detection rates and detection accuracy. A micro-clustering technique is performed with the number
The proposed CCI based method shows a better result than of sub-clusters of respective subsets as the proposed method;
other clustering techniques. BRS in Prasad et al. (2020) is a finally, merge them into macro-clusters. Moreover, Table 19
supervised method of intrusion detection that has worked shows the overall (or weighted average of test subsets) per-
on the same dataset; while, the proposed method is un- formance of the unsupervised IDSs, which is computed us-
supervised training method. BRS has explicitly shown fea- ing the performances of sub-datasets or data generation days
ture sets, attacks, data generation environment, and eval- (in Table 18) and their weights (in Table 15). This computa-
uated the qualitative realism of datasets. The compara- tion carried out an additional statistical parameter F-measure
tive result has shown that the CICIDS2017 maintains high as harmonic mean of detection rate and precision. Accuracy
quality. shows correctly predicted performance, and FP/TP falsely pre-
Table 18 presents the performance of each subset of CI- dicted the performance of the system. These statistical mea-
CIDS2017 on different statistical parameters, such as DR, Pre- sures confirm that the proposed method performs better than
cision, FP/TP, FAR, and Accuracy. The proposed method per- unsupervised IDSs.
16 computers & security 99 (2020) 102062

Table 18 – Sub-dataset (or data-generation-day) wise weighted average performance of clustering techniques.

Clustering
Sub-dataset Technique DR Precision FP/TP FAR Accuracy
Tuesday K-means 0.8695 0.9361 0.1501 0.9698 0.8717
New K-means 0.9264 0.9560 0.0794 0.4805 0.9337
Proposed 0.9653 0.9480 0.0360 0.9597 0.9658
Wednesday K-means 0.6111 0.8072 0.6363 0.2083 0.7237
New K-means 0.8136 0.8571 0.2291 0.1251 0.8524
Proposed 0.8643 0.8675 0.1571 0.1996 0.8831
Thursday K-means 0.6286 0.9664 0.5908 0.9850 0.6309
Morning New K-means 0.8237 0.9867 0.2140 0.0770 0.8291
Proposed 0.8725 0.9835 0.1462 0.2537 0.8752
Thursday After K-means 0.9176 0.9998 0.0898 0.1818 0.9176
Noon New K-means 0.9746 0.9997 0.0260 0.7271 0.9746
Proposed 0.9991 0.9997 0.0009 0.7271 0.9991
Friday Morning K-means 0.8821 0.9786 0.1338 0.9910 0.8820
New K-means 0.9212 0.9837 0.0856 0.6140 0.9211
Proposed 0.9899 0.9855 0.0103 0.9746 0.9898
Friday After K-means 0.7722 0.7502 0.2950 0.6079 0.7722
Noon-DDoS New K-means 0.7502 0.7720 0.3329 0.4746 0.7502
Proposed 0.8198 0.7916 0.2199 0.5968 0.8198
Friday K-means 0.6009 0.7621 0.6641 0.4922 0.6009
AfterNoon- New K-means 0.6790 0.7145 0.4727 0.2912 0.6790
PortScan Proposed 0.6441 0.6669 0.5524 0.3338 0.6442

Table 19 – Overall performance of unsupervised IDSs on CICIDS2017.

Clustering
Technique DR Precision F-measure FP/TP FAR Accuracy
K-means 0.7380 0.8712 0.7991 0.3984 0.5496 0.7725
New K-means 0.8424 0.8882 0.8645 0.2021 0.3615 0.8559
Proposed 0.8800 0.8857 0.8828 0.1564 0.5371 0.8860

Table 20 summarises the comparative results of the pro- Table 22 shows the performance of detection methods
posed unsupervised IDS. It is compared to both modes of train- where the proposed method outperformed to the state-of-the-
ing methods, such as supervised and unsupervised IDSs. This art method. The experimental results confirm that the CCI
table indicated the unsupervised techniques that are trained based clustering method performs better than basic cluster-
on unlabeled training sets. Whenever, supervised methods are ing. Traditional approaches are also shown maximum 90% de-
trained on labeled training sets. The proposed method evalu- tection rates of wormhole attack in MANETs (Prasad et al.,
ates unsupervised feature selection, and it selects 58 signif- 2019). This work confirms that machine learning algorithms
icant features. The supervised methods are executed on ex- perform better than traditional algorithms, and the proposed
tracted 80 features without assessing the significance of fea- method performs better than existing detection methods.
tures except BRS (Prasad et al., 2020). The proposed method
is compared with both modes of the training method. It per-
forms better than some supervised mode of training meth-
ods and all unsupervised techniques. Moreover, the proposed 7. Conclusion
method confirms that it is suitable for classifying the behavior
of unlabeled network traffic in benign and attacks categories. This research work proposed a new clustering method for
unsupervised intrusion detection. It is based on novel ap-
6.4. Performance analysis on wormhole dataset proach of unsupervised feature selection and initialization of
the cluster center. The proposed method selects essential fea-
New K-means clustering is similar to the basic clustering, that tures, and the process of computation of semi-identical in-
randomly selects initial cluster centers. The proposed method stances gives a number of micro-clusters, initial cluster cen-
is CCI based clustering. Moreover, both clustering methods ters, and a range of clusters. It avoids outliers as initial clus-
create micro-clusters and merge them into clusters. Table 21 ter center and more than one initial center from same cluster
shows the statistical parameters of clustering methods. Here, space. The quality of results for this proposal has been tested
the number of micro-clusters is 31, and number of iterations and compared to the state-of-the-art method. Mainly, the pro-
of New K-means is 18, while the proposed method took aver- posed unsupervised feature selection and cluster center ini-
age 10 iterations to form final micro-clusters. tialization approach boost clustering method. This is trained
computers & security 99 (2020) 102062 17

Table 20 – Performance comparison of IDSs on CICIDS2017.

Technique No. of Features DR Precision F-measure


Adaboost 80 0.84 0.77 0.80
(Sharafaldin et al.,
2018)
MLP 80 0.83 0.77 0.79
(Sharafaldin et al.,
2018)
QDA 80 0.88 0.97 0.92
(Sharafaldin et al.,
2018)
KNN 80 0.96 0.96 0.96
(Sharafaldin et al.,
2018)
RF 80 0.97 0.98 0.97
(Sharafaldin et al.,
2018)
ID3 80 0.98 0.98 0.98
(Sharafaldin et al.,
2018)
BRS (Prasad et al., 40 0.96 0.96 0.96
2020)
K − means• 58 0.74 0.87 0.80
New K − means• 58 0.84 0.89 0.86

Proposed 58 0.88 0.89 0.88

• ← Unsupervised mode of training method.

Table 21 – Statistical parameters of test dataset of Wormhole dataset.

Parameters New K-means Proposed


Normal Attack W.Avg Normal Attack W.Avg
TP 12894 43843 – 13304 43588 –
TN 43843 12894 – 43588 13304 –
FP 2143 1605 – 2398 1195 –
FN 1605 2143 – 1195 2398 –
DR 0.8893 0.9534 0.9380 0.9176 0.9479 0.9406
FAR 0.0466 0.1106 0.0953 0.0521 0.0824 0.0752
Precision 0.8574 0.9647 0.9390 0.8473 0.9733 0.9431
F-measure 0.8731 0.9590 0.9384 0.8810 0.9604 0.9413
Accuracy 0.9380 0.9380 0.9380 0.9406 0.9406 0.9406

Table 22 – Performance comparison (in %) of detection methods on wormhole dataset, and the best result is shown in
boldface.

Technique DR FAR Precision F-measure Accuracy


Naive Bayes 93.12 5.3 94.0 93.4 93.06
(Prasad et al.,
2019)
SGD 93.12 5.3 94.0 93.4 93.08
(Prasad et al.,
2019)
New K-means 93.80 9.53 93.90 93.84 93.80
Proposed 94.06 7.52 94.31 94.13 94.06

and tested on KDD’99, CICIDS2017, and Wormhole dataset. Ex- method is a demonstration of the fact that unsupervised fea-
perimental results are shown better detection rate and system ture selection and cluster center initialization method reduces
accuracy for attacks and non-attack classes. training complexity. This method reduced the features that
The system is tested against existing methods and results have negligible contributions. Experimental results confirm
are found to be encouraging. The implication of the proposed the performance of the proposed method is better than with-
18 computers & security 99 (2020) 102062

out feature reduction systems, and CCI based clustering is bet- R E F E R E N C E S


ter than basic clustering method. The proposed method of un-
supervised intrusion detection system can be used to provide
security in MANETs, organizational and social areas where Aburomman AA, Reaz MBI. A novel SVM-KNN-PSO ensemble
intruders are more active. The proposed work is convincing method for intrusion detection system. Appl. Soft Comput.
although it has some limitations like manually preprocess- 2016;38:360–72.
Al-Jarrah OY, Al-Hammdi Y, Yoo PD, Muhaidat S, Al-Qutayri M.
ing, space and time complexities for MANETs. This system
Semi-supervised multi-layered clustering model for intrusion
achieves promising performance that allows us to extend it detection. Digit. Commun. Netw. 2018;4(4):277–86.
and resolves these limitations. The extension may analyze Al-Yaseen WL, Othman ZA, Nazri MZA. Multi-level hybrid
complexities and accommodate the probabilistic approach for support vector machine and extreme learning machine based
features score. on modified k-means for intrusion detection system. Expert
Syst. Appl. 2017;67:296–303.
Ambusaidi MA, He X, Nanda P, Tan Z. Building an intrusion
detection system using a filter-based feature selection
Declaration of Competing Interest algorithm. IEEE Trans. Comput. 2016;65(10):2986–98.
Bostani H, Sheikhan M. Hybrid of binary gravitational search
algorithm and mutual information for feature selection in
The authors declare that they have no known competing fi-
intrusion detection systems. Soft Comput. 2017;21(9):2307–24.
nancial interests or personal relationships; that could have Bouhaddi M, Radjef MS, Adi K. An efficient intrusion detection in
appeared to influence the work reported in this paper. resource-constrained mobile ad-hoc networks. Comput.
Secur. 2018;76:156–77.
Carrasco RSM, Sicilia M-A. Unsupervised intrusion detection
through skip-gram models of network behavior. Comput.
Appendix A. Features of wormhole dataset Secur. 2018;78:187–97.
Choi H, Kim M, Lee G, Kim W. Unsupervised learning approach
for network intrusion detection system using autoencoders. J.
Feature Feature name Type Supercomput. 2019:1–25.
Ghosh S, Prasad PS, Rao CR. An efficient gaussian kernel based
f1 duration real fuzzy-rough set approach for feature selection. In:
f2 protocol string International Workshop on Multi-disciplinary Trends in
Artificial Intelligence. Springer; 2016. p. 38–49.
f3 packet size integer Gong W, Zhao R, Grünewald S. Structured sparse k-means
f4 flag integer clustering via Laplacian smoothing. Pattern Recognit. Lett.
2018;112:63–9.
f5 header length integer Huang J, Zhu Q, Yang L, Cheng D, Wu Q. QCC: a novel clustering
f6 hop count integer algorithm based on quasi-cluster centers. Mach. Learn.
2017;106(3):337–57.
f7 life time integer Hyde R, Angelov P, MacKenzie AR. Fully online clustering of
f8 message type string evolving data streams into arbitrarily shaped clusters. Inf. Sci.
2017;382:96–114.
f9 destination sequence number integer Iglesias F, Zseby T. Analysis of network traffic features for
f10 message sequence number integer anomaly detection. Mach. Learn. 2015;101(1–3):59–84.
Kang S-H, Kim KJ. A feature selection approach to find optimal
f11 stream index integer feature subsets for the network intrusion detection system.
f12 land integer Cluster Comput. 2016;19(1):325–33.
Kumar KM, Reddy ARM. An efficient k-means clustering filtering
f13 message transfer mode binary algorithm using density based initial cluster centers. Inf. Sci.
f14 number of neighbors integer 2017;418:286–301.
Lin W-C, Ke S-W, Tsai C-F. CANN: an intrusion detection system
f15 highest flow integer based on combining cluster centers and nearest neighbors.
f16 average flow real Knowl.-Based Syst. 2015;78:13–21.
Ma G, Xu Z, Zhang W, Li S. An enriched k-means clustering
f17 lowest flow integer method for grouping fractures with meliorated initial centers.
f18 average hop count integer Arabian J. Geosci. 2015;8(4):1881–93.
Masud MA, Huang JZ, Wei C, Wang J, Khan I, Zhong M. I-nice: a
f19 number of failed connections integer new approach for identifying the number of clusters and
f20 failed connection rate real initial cluster centres. Inf. Sci. 2018;466:129–51.
Meira J, Andrade R, Praça I, Carneiro J, Bolón-Canedo V,
f21 label string Alonso-Betanzos A, Marreiros G. Performance evaluation of
unsupervised techniques in cyber-attack anomaly detection. J.
CRediT authorship contribution statement Ambient Intell. Humanized Comput. 2019:1–13.
Mishra P, Varadharajan V, Tupakula U, Pilli ES. A detailed
Mahendra Prasad: Conceptualization, Methodology, Soft- investigation and analysis of using machine learning
techniques for intrusion detection. IEEE Commun. Surv. Tutor.
ware, Data curation, Validation, Formal analysis, Writing -
2018;21(1):686–728.
original draft. Sachin Tripathi: Supervision, Resources, Writ-
Peng L, Liu Y. Attribute weights-based clustering centres
ing - review & editing. Keshav Dahal: Supervision, Resources, algorithm for initialising k-modes clustering. Cluster Comput.
Writing - review & editing. 2018:1–9.
computers & security 99 (2020) 102062 19

Prasad M, Tripathi S, Dahal K. Wormhole attack detection in ad Sachin Tripathi received the B.Tech degree
hoc network using machine learning technique. In: 2019 10th from Chhatrapati Shahu Ji Maharaj Univer-
International Conference on Computing, Communication and sity, Kanpur, India. He received the M.Tech
Networking Technologies (ICCCNT). IEEE; 2019. p. 1–7. and the Ph.D. degree in computer science
Prasad M, Tripathi S, Dahal K. An efficient feature selection based and engineering from Indian Institute of
bayesian and rough set approach for intrusion detection. Technology (ISM), Dhanbad, India. He is cur-
Appl. Soft Comput. 2020;87:105980. rently working as associate professor with
Ring M, Wunderlich S, Scheuring D, Landes D, Hotho A. A survey the Indian Institute of Technology (ISM),
of network-based intrusion detection data sets. Comput. Dhanbad, India. He has teaching computer
Secur. 2019;86:147–67. science subjects for over more than fourteen
Roshan S, Miche Y, Akusok A, Lendasse A. Adaptive and online years. He has contributed about 95 research
network intrusion detection system using clustering and papers. He has authored a book titled “En-
extreme learning machines. J. Franklin Inst. hancements on Internet Applications: Multi-
2018;355(4):1752–79. cast, Secure E-Mail Messaging and E-Business”. He is an Associate
Sharafaldin I, Lashkari AH, Ghorbani AA. Toward generating a Editor of “International Journal of Communication System”, Wi-
new intrusion detection dataset and intrusion traffic ley. His research interests mainly focus on group security, ad-hoc
characterization.. In: ICISSP; 2018. p. 108–16. network, and artificial intelligence. E-mail: var_1285@yahoo.com.
Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of Mail address: - Department of Computer Science and Engineering,
the KDD CUP 99 data set. In: Computational Intelligence for IIT (ISM), Dhanbad, Jharkhand-826004.
Security and Defense Applications, 2009. CISDA 2009. IEEE
Symposium on. IEEE; 2009. p. 1–6. Keshav Dahal is a Professor of Intelligent
Tiruvakadu DSK, Pallapa V. Confirmation of wormhole attack in Systems and the leader of the Artificial Intel-
MANETs using honeypot. Comput. Secur. 2018;76:32–49. ligence, Visual Communication and Network
Wang H, Zhang Y, Zhang J, Li T, Peng L. A factor graph model for (AVCN) Research Centre at the University of
unsupervised feature selection. Inf. Sci. 2019;480:144–59. the West of Scotland (UWS), UK. He is also af-
Wu Y, Wang C, Bu J, Chen C. Group sparse feature selection on filiated with Nanjing University of Informa-
local learning based clustering. Neurocomputing tion Science and Technology (NUIST) China.
2016;171:1118–30. Before joining UWS he was with Bradford
Xie T, Ren P, Zhang T, Tang YY. Distribution preserving learning and Strathclyde Universities in UK. He ob-
for unsupervised feature selection. Neurocomputing tained his Ph.D. and Master degrees from the
2018;289:231–40. University of Strathclyde, UK. His research
Yin C, Zhang S, Yin Z, Wang J. Anomaly detection model based on interests lie in the areas of applied AI to in-
data stream clustering. Cluster Comput. 2017:1–10. telligent systems, trust and security mod-
Zhao Y, Ming Y, Liu X, Zhu E, Zhao K, Yin J. Large-scale k-means eling in distributed systems, and scheduling/optimization prob-
clustering via variance reduction. Neurocomputing lems. He has published extensively with award winning papers,
2018;307:184–94. and has sat on organizing/program committees of over 60 interna-
Zhu P, Zhu W, Hu Q, Zhang C, Zuo W. Subspace clustering guided tional conferences including as the General Chair and Programme
unsupervised feature selection. Pattern Recognit. Chair. He is a senior member of the IEEE.
2017;66:364–74.

Mahendra Prasad received the B.Tech de-


gree in Information Technology from Ra-
jasthan Technical University, Kota, India. He
received the M.Tech degree in Artificial In-
telligence from University of Hyderabad, In-
dia. He is currently a Senior Research Fel-
low with the Department of Computer Sci-
ence and Engineering, IIT (ISM), Dhanbad,
India and pursuing his Ph.D. work in the
field of Machine learning and Network se-
curity. E-mail: je.mahendra@gmail.com. Mail
address: - Department of Computer Sci-
ence and Engineering, IIT (ISM), Dhanbad,
Jharkhand-826004.

You might also like