DGA Botnet Detection Using Supervised Learning Methods-1
DGA Botnet Detection Using Supervised Learning Methods-1
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/321741275
44 1.581
5 authors, including:
Understanding Mac
Duc Quang Tran
Hanoi University of Science and Technology Hanoi University of Science and Technology
Some of the authors of this publication are also working on these related projects:
Research and Development of the DDoS attack prevention and Botnet detection system View project
Pervasive and Secure Information Service Infrastructure for Internet of Things based on Cloud Computing View project
All content following this page was uploaded by Hieu Mac on December 18, 2017.
SoICT '17, December 7–8, 2017, Nha Trang City, Vietnam H. Mac et al.
SVM C4.5 ELM HMM LSTM Recurrent SVM CNN+LSTM Bidirectional LSTM
Figure 1: Taxonomy for the supervised learning methods in DGA botnet detection.
(SVM), Recurrent SVM [9], [10], CNN+LSTM [11], and Alexa Ramnit Ranbyus Suppobox Banjori
Bidirectional LSTM [12], which have not been validated in this
(b) Entropy
application domain .
first
|ÿÿ| |ÿÿ|
ÿ ÿÿ ÿ ÿ
ÿÿ ÿ ÿ
Output gate
Forget
Input gate
SoICT '17, December 7–8, 2017, Nha Trang City, Vietnam H. Mac et al.
Long ShortTerm Memory network (LSTM) [13], [14] holds more promise for
ÿÿ ÿ ÿÿ ÿÿ ÿ ÿ ÿ ÿ ÿÿ ÿÿ ÿ ÿ ÿ ÿ (ten)
realize DGA malwares since it is capable of modeling temporal sequences and their
long-term dependencies [8]. Traditional HMM is limited to discrete state space, while (11)
ÿÿ ÿ ÿÿ ÿÿ ÿ ÿ ÿ ÿ ÿÿ ÿÿ ÿ ÿ ÿ ÿ
LSTM has Turing capabilities, making it more suitable for all sequence learning tasks.
The LSTM basic unit is the memory block containing one or more memory cells and ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿÿ ÿÿ ÿ ÿÿ ÿ ÿÿ ÿ (twelfth)
three multiplicative gating units (see Fig. 3a). LSTM aims at mapping an input
where ÿ is an update function, which is implemented by combining Eqs. (4) and (8).
sequence to an output sequence by using the following equations interactively from
The Bidirectional LSTM allows the output units ÿÿ to learn a representation from both
ÿ 1 to ÿ
the past and future information without having fixed-size window around ÿ
[23]. In this paper, it is based on two LSTM layers (forward and backward) with 128
ÿÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿÿ ÿÿ ÿÿ ÿÿ ÿ ÿ ÿ (4) memory blocks in each direction.
3 EXPERIMENTS
Motivated by the pioneering work of LSTM, several LSTM variants have been
This section is dedicated to assess the various supervised learning methods, ie,
developed in the literature. Recurrent SVM
HMM, C4.5, ELM, SVM, LSTM, Recurrent SVM, CNN+LSTM and Bidirectional LSTM.
is constructed using the idea of replacing the Softmax with SVM.
In particular, we evaluate these methods in both binary (DGA vs. non-DGA) and
Softmax tries to minimize the cross-entropy, while the aim of SVM is to find the
multiclass (which DGA?) problems. The Wilcoxon signed ranks test is also performed
maximum margin between samples from different classes [9]. In CNN+LSTM, the
to compare each pair of methods based on their F1-scores. All the codes were written
input domain is fed into a single Convolutional Neural Network (CNN) with max
using Keras and scikit-learning libraries [26], [27], and were executed on a PC running
pooling across the sequence for each convolutional feature to find the morphological
Ubuntu 16.04 x64 with Intel Core i5 and 8 GBs of RAM.
patterns [11]. The output of CNN is then treated as the input of LSTM to reduce the
temporal variations.
As a consequence, each CNN and LSTM block captures information about the input
3.1 Dataset Specification The experiments
representation at different scale [21]. For this reason, CNN+LSTM is expected to be
better alternative to the original LSTM. are carried out on a real-world dataset that contains 1 non-DGA (Alexa) and 37 DGA
classes. It is collected from two sources: The Alexa top 1 million domains [28] and
Bidirectional LSTM is an extension of the traditional LSTM, consisting of a the OSINT DGA feed from Bambenek Consulting [29]. In total, there are 88,357
forward and backward LSTM. It is observed to achieve higher generalization legitimate domains and 81,490 DGA domains. The dataset also includes some
performance on sequence classification problems [12]. In Bidirectional LSTM, the notable DGA families such as Cryptolocker, Locky, Kraken, Gameover Zeus. Matsnu
Cryptowall, Suppobox and Volatile are based on domains, which were generated
forward hidden sequence ÿÿ , the backward hidden sequence ÿÿ and output sequence
ÿÿ are computed as follows: using English dictionary word list. Table 1 illustrates
Machine Translated by Google
3.3 Results
Machine Translated by Google
DGA Botnet Detection Using Supervised Learning Methods SoICT '17, December 7–8, 2017, Nha Trang City, Vietnam
averaging recall, precision and F1-score are shown in Tables 2 and performers. A significant difference is also observed between
3. Macro-averaging treats all classes equally, while micro averaging Bidirectional LSTM and other supervised learning methods.
favors the class, which have more samples. Macro averaging should Table 6 illustrates the evaluation time, which is critical for practical
be a better measure; however micro-averaging is also presented for uses. There is almost no computation cost in LSTM (9 ms). In
interested readers. It can be proved that macro-averaging recall is Bidirectional LSTM, additional processes are needed because
equal to the accuracy, reported in the literature. In Table 2, HMM updating input and output layers cannot be achieved at once [12].
achieves lower detection rate than expected. The rationale for this is Bidirectional LSTM requires 27 ms to process a domain. The
that HMM requires a huge amount of data to train the model, while evaluation time related to ELM is given in [7]. C4.5, ELM and SVM
several DGA classes, such as Tempedreve and Corebot have very are most computationally expensive since these methods are based
little representation in the training data. We note that in [2] HMM was on the hand-crafted attributes. It is clear that
only evaluated using Conficker, Morofet, Bobax, and Sinowal HMM, C4.5, ELM and SVM are not suitable for real-time DGA
malwares. detection applications.
It becomes obvious that C4.5 is superior to both the SVM and ELM. There are still 8 DGA malwares that cannot be detected by all
The dominance of Recurrent SVM and Bidirectional LSTM the supervised learning methods. Geodo, Tempedreve, Hesperbot,
was established by a large margin, leaving LSTM and CNN+LSMT Fobber, Dircrypt, Qadars and Locky are misclassified as Ramnit
in a group with very small difference between them. LSTM cannot since these malwares share a common generator, which has uniform
recognize 15 DGA families. This number is reduced to 8 when distribution over the characters. Matsnu is based on
Bidirectional LSTM is applied to detect malicious domains. Apart pronounceable domains. Hence, it cannot be isolated from Alexa.
from HMM, the implicit features based methods were observed to be
better than the hand-crafted
features based ones. Table 4: Ranks computed by the Wilcoxon test
first
SoICT '17, December 7–8, 2017, Nha Trang City, Vietnam H. Mac et al.
4 CONCLUSIONS [11] Kim, Yoon, et al. Character-Aware Neural Language Models. AAAI. 2016.
[12] A. Graves, and J. Schmidhuber, Framewise phoneme classification with bidirectional
DGA botnets have become a technology backbone to support LSTM and other neural network architectures, Neural Networks 18.5 (2005): 602-610.
cyber-criminals. The supervised learning provides a mean to
[13] S. Hochreiter, and J. Schmidhuber, Long short-term memory, Neural computation
recognize and shut down this type of botnet. We have thoroughly
9(8) (1997): 1735-1780.
evaluated various supervised learning methods, including Hidden [14] FA Gers, J. Schmidhuber, and F. Cummins, Learning to forget: Continual prediction
Markov Model, C4.5 decision tree, Support Vector Machines, with LSTM, Neural computation 12(10) (2000): 2451-2471.
Extreme Learning Machine, Long Short-Term Memory network, [15] Jay Jacobs, Building a DGA Classifier: Feature Engineering. Available online at: http://
datadrivensecurity.info/blog/posts/2014/Oct/dga-part2/. October 2014
Recurrent SVM, CNN+LSTM and Bidirectional LSTM.
[16] S. Krishnan, T. Taylor, F. Monrose, and J. McHugh, Crossing the threshold: Detecting
Experiments demonstrate that Bidirectional LSTM and Recurrent network malfeasance via sequential hypothesis testing, 43rd Annual IEEE/IFIP
SVM achieve the highest detection rate on both the binary and International Conference on Dependable Systems and Networks (DSN) (2013) 1–12
multiclass classification problems. These methods share some
[17] N. Cristianini, and J. Shawe-Taylor, An introduction to support vector machines and
important features with the LSTM, and thus, making them other kernel-based learning methods, Cambridge university press, 2000.
amenable to real-time detection applications.
[18] J. Milgram, M. Cheriet, and R. Sabourin, “One against one” or “one against all”: Which
ACKNOWLEDGMENTS This one is better for handwriting recognition with SVMs?, Tenth international workshop
on frontiers in handwriting recognition. La Baule, 2006.
research is supported by the Vietnam Ministry of Education and
Training research project “Development of DDoS attack [19] JR Quinlan, C4. 5: programs for machine learning, Elsevier, 2014
prevention and Botnet detection system” B2016-BKA-06. [20] GB Huang, Q.-Y. Zhu, and C.-K. Siew, Extreme learning machine: theory and
applications, Neurocomputing 70.1 (2006): 489-501.
REFERENCES [21] W. Yin, K. Kann, M. Yu, and H. Schütze, Comparative Study of CNN and RNN
for Natural Language Processing, arXiv preprint arXiv: 1702. 01923 (2017).
[1] S. Yadav, AKK Reddy, ALN Reddy, S. Ranjan, Detecting algorithmically generated
[22] V. Tong, and G. Nguyen, A method for detecting DGA botnet based on semantic and
domain-flux attacks with DNS traffic analysis, IEEE/ACM Transactions on Networking
cluster analysis, Proceedings of the Seventh Symposium on Information and
20.5 (2012): 1663-1677.
Communication Technology. ACM, 2016.
[2] M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, S. Abu-Nimeh, W. Lee, and D.
[23] P. Su, X. Ding, Y. Zhang, Y. Li, and N. Zhao, Predicting Blood Pressure with Deep
Dagon, From Throw-Away Traffic to Bots: Detecting the Rise of DGA Based
Bidirectional LSTM Network, arXiv preprint arXiv:1705.04524 (2017).
Malware. In: the 21st USENIX Security Symposium (USENIX Security 12) (2012).
[24] J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal
of Machine learning research 7 (2006): 1-30.
[3] Y. Zhou, QS Li, Q. Miao, K. Yin, DGA-Based Botnet Detection Using DNS Traffic,
Journal of Internet Services and Information Security, 3.3/4 (2013): [25] J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez, F.
116-123. Herrera, Keel data-mining software tool: Data set repository, integration of
algorithms and experimental analysis framework, Journal of Multiple-Valued Logic
[4] S. Schiavoni, F. Maggi, L. Cavallaro, and S. Zanero, Phoenix: DGA-based botnet
Soft Computing 17 (2–3) (2011) 255–287.
tracking and intelligence, International Conference on Detection of Intrusions and
Malware, and Vulnerability Assessment (DIMVA) (2014), LNCS 8550, 192-211 [26] Chollet, François. Keras (2015). URL http://keras. io (2017).
[27] F, Pedregosa, et al., Scikit-learn: Machine learning in Python, Journal of Machine
[5] H. Zhang, M. Gharaibeh, S. Thanasoulas, and C. Papadopoulos, Botdigger: Detecting Learning Research 12 (2011): 2825-2830.
dga bots in a single network, Proceedings of the IEEE International Workshop on [28] Does Alexa have a list of its top-ranked websites? Available online at: https://
Traffic Monitoring and Analaysis. 2016. support.alexa.com/hc/en-us/articles/200449834-Does-Alexa-have-a list-of-its-
[6] L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi, EXPOSURE: Finding Malicious topranked-websites-. (2017).
Domains Using Passive DNS Analysis, Ndss. 2011. [29] Bambenek Consulting - Master feeds. Available online at:
[7] Y. Shi, C. Gong and L. Juntao, Malicious Domain Name Detection Based on Extreme http://osint.bambenekconsulting.com/feeds/ (2016).
Machine Learning, Neural Processing Letters (2017): 1-11. [30] M. Masud, T. Al-khateeb, L. Khan, B. Thuraisingham, and K. Hamlen, Flow based
[8] J. Woodbridge, HS Anderson, A. Ahuja, and D. Grant, Predicting Domain Generation identification of botnet traffic by mining multiple log files, in Distributed Framework
Algorithms with Long Short-Term Memory Networks. arXiv preprint arXiv:1611.00791 and Applications, 2008. DFmA 2008. First International Conference on, oct. 2008,
(2016). pp. 200 -206.
[9] Y. Tang, Deep learning using linear support vector machines, arXiv preprint [31] M. Antonakakis, et al., Building a Dynamic Reputation System for
arXiv:1306.0239 (2013). DNS, USENIX security symposium. 2010.
[10] SX Zhang, R. Zhao, C. Liu, J. Li, and Y. Gong Recurrent support vector machines for [32] Kotsiantis, Sotiris B., I. Zaharakis, and P. Pintelas. Supervised machine learning: A
speech recognition, IEEE International Conference on Acoustics, Speech and Signal review of classification techniques. (2007): 3-24.
Processing (ICASSP), 2016.