Bot Detection System Using CNN Algorithm
Bot Detection System Using CNN Algorithm
Paper [9] proposed an Android Botnet Identification System Unlike most existing studies, our paper proposes a deep learn-
(ABIS) for checking Android applications in order to detect ing based Android botnet detection system, using Convolu-
botnets. ABIS utilized both static and dynamic features from tional Neural Networks. Also, unlike previous studies that
API calls, permissions and network traffic. The system is utilize only the app permissions, our system is based on 342
evaluated by using several machine learning algorithms with features that represent Permissions, API calls, Commands,
Random Forest obtaining a precision of 0.972 and a recall of Extra Files, and Intents. Furthermore, different from the study
0.969. In [10], a method is proposed for Android botnet detec- in [9] which utilized only permissions, we do not convert fea-
ture vectors into images prior to model training. Instead our deeper layers of the CNN, hence, the number of layers re-
feature vectors are used directly to train 1D CNN models. This quired depends on the complexity and non-linearity of the data
makes our approach computationally less demanding. being analysed. Furthermore, the number of filters in each
III. BACKGROUND stage determines the number of features extracted. Computa-
tional complexity increases with more layers and higher num-
A. The CNN-based classification system bers of filters. Also, with more complex architectures, there is
The classification system is built by extracting static features the possibility of training an overfitted model which results in
from the corpus of botnet and clean samples. To achieve this, poor prediction accuracy on the testing set(s). To reduce over-
we used our bespoke tool built in Python for automated re- fitting, techniques such as ‘dropout’ [22] and ‘batch regulari-
verse engineering of APKs. With the help of the tool, we ex- zation’ are implemented during training of our models.
tracted 342 features consisting of five different types (see Ta- C. One Dimensional Convolutional Neural Networks
ble 2) from all the training apps. The five feature types in- Although CNN is more commonly applied in a multi-
clude: API calls extracted from the executable; Permissions dimensional fashion and has thus found success in image and
and Intents from the manifest file; Commands and Extra Files video analysis-based problems, they can also be applied to
from the APK. These features are represented as vectors of one-dimensional data. Datasets that possess a one-dimensional
binary numbers with each feature in the vector represented by structure can be processed using a one-dimensional convolu-
a ‘1’ or ‘0’. Each feature vector (corresponding to one applica- tional neural network (1D CNN). The key difference between
tion) is labelled with its class. The feature vectors are loaded a 1D and a 2D or 3D CNN is the dimensionality of the input
into the CNN model and used to train the model. After train- data and how the filter (feature detector) slides across the data.
ing, an unknown application can be predicted to be either For 1D CNN, the filters only slide across the input data in one
‘clean’ or ‘botnet’ by applying its own extracted feature vector direction. A 1D CNN is quite effective when you expect to
to the trained model. The process is depicted in Figure 1. derive interesting features from shorter (fixed-length) seg-
ments of the overall feature set, and where the location of the
feature within the segment is not of high relevance.
Droiddream 363
Geinimi 264
Misosms 100
Sliding
Nickyspy 199
Notcompatible 76
Sliding
IV. METHODOLOGY AND EXPERIMENTS Table 2: The five different types of features used to train the CNN
In this section we present the experiments undertaken to eval- model.
Feature type Number
uate the CNN models developed in this paper. Our models API calls 135
were implemented using Python and utilized the Keras library Permissions 130
with TensorFlow backend. Other libraries used include Scikit Commands 19
Learn, Seaborn, Pandas, and Numpy. The model was built and Extra files 5
evaluated on an Ubuntu Linux 16.04 64-bit Machine with Intents 53
4GB RAM. Total 342 features
A. Problem definition
Table 3: Some of the prominent static features extracted from Android
Let A ={a1, a2, … an} be a set of apps where each ai is repre- applications for training the CNN model to detect Android Botnets.
sented by a vector containing the values of n features (where
n=342). Let a ={f1,f2,f3 …fn, cl} where 𝑐𝑙 ∈ {𝑏𝑜𝑡𝑛e𝑡, 𝑛𝑜𝑟𝑚𝑎𝑙} Feature name Type
is the class label assigned to the app. Thus, A can be used to TelephonyManager.*getDeviceId API
TelephonyManager.*getSubscriberId API
train the model to learn the behaviours of botnet and normal abortBroadcast API
apps respectively. The goal of a trained model is then to clas- SEND_SMS Permission
sify a given unlabelled app Aunknown = { f1,f2,f3 …fn, ?} by as- DELETE_PACKAGES Permission
signing a label cl, where 𝑐𝑙 ∈ {𝑏𝑜𝑡𝑛e𝑡, 𝑛𝑜𝑟𝑚𝑎𝑙}. PHONE_STATE Permission
SMS_RECIVED Permission
Ljava.net.InetSocketAddress API
B. Dataset READ_SMS Permission
Android.intent.action.BOOT_COMPLETED Intent
In this study we used the Android dataset from [5], which is
io.File.*delete( API
known as the ISCX botnet dataset. The ISCX dataset contains chown Command
1,929 botnet apps (from 14 different families) and has been chmod Command
used in previous works including [4], [7-10], and [17]. The Mount Command
botnet families are shown in Table 1. A total of 4,873 clean .apk Extra File
apps were used for the study in this paper and these were la- .zip Extra File
belled under the category ‘normal’ to facilitate supervised .dex Extra File
.jar Extra file
learning when training the CNN and other machine learning CAMERA Permission
classifiers. The clean apps were obtained from different cate- ACCESS_FINE_LOCATION Permission
gories of apps on the Google Play store and verified to be non- INSTALL_PACKAGES Permission
malicious by using VirusTotal. android.intent.action.BATTERY_LOW Intent
.so Extra File
The 342 static features extracted from the apps for model android.intent.action.POWER_CONNECTED Intent
System.*LoadLibrary API
training were of 5 types: (a) API calls (b) commands (c) per-
missions (d) Intents (e) extra files. The ‘API calls’ and ‘per-
C. Experiments to evaluate the proposed CNN based model of all 10 results is then taken to produce the final result. Also,
In order to investigate the performance of our proposed model, during the training of the CNN models (for each fold), 10% of
we performed different sets of experiments. Table 4 shows the the training set was used for validation.
configuration of the CNN model. The 1D CNN model consists
of two pairs of convolutional and maxpooling layers as shown V. RESULTS AND DISCUSSIONS
in Figure 2. The output of the second max pooling layer is
flattened and passed on to a fully connected layer with 8 units. A. Varying the numbers of filters.
This is in turn connected to a sigmoid activated output layer In this section, we examine the results from experimenting
containing one unit. with different numbers of filters. In our model, we kept the
number of filters in both convolutional layers the same. Table
The first set of experiments was aimed at evaluating the im- 5 shows the results from running the 1D CNN model with
pact of number of filters on the model’s performance. The different numbers of filters. From the table, it is evident that
second set of experiments was performed to evaluate the effect the number of filters had an effect on the performance of the
of varying the length of the filters. In the third, we investigate model. When increased from 4 to 8, there is an improvement
the impact of the maxpooling size on performance. in performance. The performance does not improve until we
reach 32 filters. It then drops again when we increase this to
Table 4: Summary of model configurations. 64. Based on these results we select 32 filters as the optimal
configuration parameter for the model’s number of filters.
Model design summary -1D CNN Notice the increase in the number of training parameters as the
Input layer: Dimension = 342 (feature vector size) number of filters is increased, and for 32 filters, the training of
1D Convolutional layer: 4, 8, 16, 32, 64 filters, 25,625 parameters is required. With 32 filters we obtain a
size = 4, 8, 16, 32, 64 (with number of filters =32) classification accuracy of 98.9% compared to 98.6% that is
MaxPooling layer: Size =2, 4, 8, 16 (with number of filters =32) obtained with 4 filters. Nevertheless, the results obtain with 4
1D Convolutional layer: 4, 8, 16, 32, 64 filters, filters were still acceptable.
size = 4, 8, 16, 32, 64 (with number of filters =32) 1) Training epochs, loss and accuracy graphs.
MaxPooling layer: Size =2, 4, 8, 16 (with number of filters =32) Figures 3 and 4 shows the typical outputs obtained with the
Fully Connected (Dense) layer: 8 units, activation=ReLU validation and training sets during the training epochs. From
Output layer: Fully Connected layer; 1 unit, activa- Fig. 3, it can be seen that the validation loss is generally fluc-
tion=sigmoid tuating from one training epoch to another after an initial drop.
During each epoch, a model is trained and the validation loss
In order to measure model performance, we used the follow- and accuracy are recorded. Our goal is to obtain the model
ing metrics: Accuracy, precision, recall and F1-score. The with the least validation loss because we assume this will be
metrics are defined as follows (taking botnet class as positive): the ‘best’ model that fits the training data. Thus, at every
epoch, the validation loss is compared to previous ones and if
Accuracy: Defined as the ratio between correctly pre- the current one is lower, the corresponding model is saved as
dicted outcomes and the sum of all predictions. It is the best model. We implemented a ‘stopping criterion’ which
TP+TN
given by: TP+TN+FP+FN will stop the training once no improvement in performance is
observed within 100 epochs. For example in Figure 3, the best
Precision: All true positives divided by all positive model was obtained with the least validation loss of 0.00531 at
predictions. i.e. Was the model right when it predict- epoch 45. For the next 100 epochs validation loss did not im-
TP
ed positive? Given by: prove, hence the training was stopped. Figure 4 shows the
TP+FP corresponding accuracy behaviour observed from epoch to
Recall: True positives divided by all actual positives. epoch.
That is, how many positives did the model identify
TP Table 5: Number of filters vs. model performance. Length of
out of all possible positives? Given by: filters used= 4 for first layer and =4 for second layer; dense
TP+FN
F1-score: This is the weighted average of precision layer = 8 units; validation split=10%.
2 x Recall x Precisi𝑜n
and recall, given by:
Recall+Precisi𝑜n
Figure 4: Training and validation accuracies at different It can be seen from Table 7 that as we increase the maxpool-
epochs up to 145. These plots correspond to the training and ing parameter, the total number of training parameters is re-
validation losses depicted in Figure 3. duced. At the same time, we witness a progressive decline in
overall performance. Therefore, for our CNN model designed
B. Varying the length of the filters. to classify applications into ‘botnet’ and ‘normal’, the optimal
subsampling ratio for both layers is 2.
In this section we examine the effect of the length of filters on
the performance of the model while the number of filters is
Table 7: Maxpooling parameter vs. model performance.
fixed at 32 in each convolutional layer. The length is varied
Length of filters used=4 for both convolutional layers; number
from 4, 8, 16, 32, to 64 respectively (as shown in Table 6).
of filters =32 for both layers; dense layer = 8 units; validation
The number of units in the dense layer was fixed at 8. The
split=10%.
results indicate that the length of the filters does not appear to
have much of an impact on the overall classification accuracy Maxpooling parame-
2 4 6 8
and F1-score performance, when increased. However, the ter/Subsampling ratio
least filter length of 4 achieves the highest accuracy and F1- Accuracy 0.989 0.987 0.983 0.978
score. Note that as we increase the length of the filters, the
Precision 0.983 0.982 0.974 0.971
number of parameters to be trained increases (from 25,652 for
length=4 to 77,465 for length=64). Recall 0.978 0.973 0.967 0.948
F1-score 0.981 0.978 0.970 0.959
The lack of improvement with the length of filters may be Training
attributed to larger number of parameters leading to overfitting 25,625 9497 6,425 5,401
Parameters
the model to the training data thereby reducing its generaliza-
tion capability. This in turn leads to degraded performance D. CNN performance vs. other machine learning classifiers:
when tested on new data. Basically, what these results show is 10 fold cross validation results.
that when the training parameters increase beyond a certain In Table 8, the performance of the CNN model developed in
limit, the model becomes too complex for the data and this this paper is compared to other machine learning classifiers:
leads to overfitting. This becomes evident in lack of improve- Naïve Bayes, SVM, Random Forest, Artificial Neural Net-
ment or degradation in performance when tested on previously work, J48, Random Tree, REPtree, and Bayes Net. Figure 5
unseen data. shows the F1-scores of the classifiers, where CNN has the
highest F1-score (0.981), followed by SVM (0.976), SL have used are reported in every paper. Nevertheless, it is clear
(0.973), ANN (0.973) and Random Forest (0.973). Bayes Net that our CNN model obtained better overall accuracy, F1 and
had the least F1-score of 0.781. Table 8 shows that the recall recall than the other works.
of CNN is 0.978 which indicates that it has the best botnet
detection performance than the other classifiers. Note that the Table 9: performance comparisons with other works. Note
ANN was a back propagation neural network built with a sin- that all of the papers used botnets samples from the ISCX
gle hidden layer consisting 32 units (neurons). The sigmoid dataset.
activation function was used within the neurons. This ANN Paper reference Botnets ACC Rec. Prec. F1
represented the application of a neural network without deep /Benign (%)
learning. The ANN showed no significant improvement in the Hojjatinia et al. [8] 1800/3650 97.2 0.96 0.955 0.957
results when the number of units in the hidden layer was in- Tansettanakorn et al. [9] 1926/150 - 0.969 0.972 -
creased beyond 32.
Anwar et. al [6] 1400/1400 95.1 0.827 0.97 -
Table 8: Comparison of our CNN results with results from Abdullah et al. [10] 1505/850 - 0.946 0.931 -
other ML classifiers. Alqatawna & Faris [7] 1635/1635 97.3 0.957 0.987 -
ACC Prec. Rec. F1 This paper 1929/4873 98.9 0.978 0.983 0.981
Naïve Bayes 0.872 0.728 0.874 0.795
SVM 0.987 0.980 0.973 0.976
VI. CONCLUSIONS AND FUTURE WORK
RF 0.985 0.982 0.965 0.973 In this paper, we proposed a deep learning model based on 1D
CNN for the detection of Android botnets. We evaluated the
ANN 0.985 0.982 0.965 0.973 model through extensive experiments with 1,929 botnet apps
SL 0.984 0.983 0.963 0.973 and 4,387 clean apps. The model outperforms several popular
J48 0.981 0.974 0.958 0.966 machine learning classifiers evaluated on the same dataset.
The results (Accuracy: 98.9%; Precision: 0.983; Recall: 0.978;
Random Tree 0.972 0.948 0.955 0.951 F1- score: 0.981) indicate that our proposed CNN based model
REPTree 0.979 0.973 0.954 0.963 can be used to detect new, previously unseen Android botnets
Bayes Net 0.867 0.736 0.832 0.781 more accurately than the other models. For future work, we
will aim to improve the model training process by automating
CNN 0.989 0.983 0.978 0.981 the search and selection of the key influencing parameters (i.e.
number of filters, filter length, and number of fully connected
(dense) layers) that jointly result in the optimal performing
CNN model.
CNN 0.981
SVM 0.976
ANN 0.973
SL 0.973
RF 0.973
J48 0.966
REPTREE 0.963