407-Article Text-1753-1-10-20220429

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Journal of Science and Technology on Information security

Machine learning approach detects DDoS


attacks
Nguyen Thi Khanh Tram, Doan Trung Son, Nguyen Thi Thu Huong, Tran Thi Thu

Abstract— Denial of Service attacks have been Asia Pacific region and leading in
around since the dawn of the internet age. Along Southeast Asia [1].
with the development and explosion of the Internet, DDoS involves making requests from a network
denial of service attacks are also increasingly
of computers made up of millions of computers with
powerful and become a serious threat in cyberspace.
The article aims to evaluate machine learning different IP addresses over which control has been
algorithms: K-nearest neighbor (KNN) algorithm, previously established (Botnet). Computers and other
Decision Tree, Random Forest algorithm and networked resources such as IoT devices together
Support Vector Machine (SVM) on various metrics create “Tsunamis” of traffic. A DDoS attack can be
in detecting DDoS attacks. The main objective of understood as a sudden traffic jam that blocks a
the paper is to analyze the algorithms, collect data
highway, preventing normal traffic from reaching its
and evaluate the effectiveness of the algorithms in
destination. Because it is dispersed into many access
DDoS attack detection. points with different IP ranges, DDoS is much
Tóm tắt— Tấn công từ chối dịch vụ đã xuất hiện từ stronger than DoS, and it is often difficult to
những năm khởi nguyên của thời đại internet. Song recognize or prevent DDoS attacks.
hành cùng sự phát triển và bùng nổ của mạng
Different types of DDoS attacks target different
Internet, tấn công từ chối dịch vụ cũng ngày càng
mạnh mẽ và trở thành mối đe dọa nghiêm trọng trên components of a network connection. Based on the
không gian mạng. Bài báo hướng tới đánh giá các target and behavior, DDoS attacks can be classified
thuật toán học máy: Thuật toán K láng giềng gần into three types traffic/fragmentation attacks,
nhất (K-nearest neighbor - KNN), cây quyết định bandwidth/volume attacks, and application
(Decision Tree), thuật toán rừng ngẫu nhiên layer attacks.
(Random Forest) và máy vector hỗ trợ (Support
Vector Machine - SVM) trên các chỉ số đánh giá khác
In late 1999, CERT first published its report on the
nhau trong việc phát hiện các cuộc tấn công DDoS. threat of DDoS attacks and outlined specific
Mục tiêu chính của bài báo nhằm phân tích các thuật prevention actions to mitigate this threat [2]. A few
toán, thu thập đánh giá dữ liệu và tiến hành so sánh months later, the Internet suffered its first large-scale
hiệu quả các thuật toán vào phát hiện tấn công DDoS. DDoS attack [3], and successive attacks of
Keywords—DDoSt;KNN; Decision Tree; Random Forest; increasingly large scale in the following years. Since
SVM. then, researchers have analyzed a number of tools
Từ khóa— DDoS; KNN; Cây quyết định; Rừng ngẫu used to launch DDoS attacks [4, 5, 6], measured their
nhiênt; SVM. impact on the Internet, and come up with a number of
defense methods [7]. Accordingly, these research
I. INTRODUCTION
efforts have resulted in a number of effective and
Distributed Denial of Service (DDoS) attack is reliable anti-DDoS products offered as stand-alone
accomplished by increasing online traffic from devices or cloud-based services.
multiple sources to the server. This causes the
In recent years, along with the strong development
server to run out of resources and bandwidth.
of Artificial Intelligence (AI), machine learning (ML)
DDoS first appeared in 1999.
and deep learning methods are being used more and
Vietnam is facing a great risk of being attacked more in detecting DDoS attacks. Sambadi and Gondi
and distributed by DDoS attacks with the 6th propose an approach that uses multiple linear
position globally after China, the US, France, regression to detect DDoS attacks [8].
Russia and Brazil, the 2nd position in the region.

102 Special Issue CS (15) 2022


Journal of Science and Technology on Information security

P. Sangkatsanee et al. [9] built a real-time Step 3: Predict labels.


detection mechanism applying machine learning
techniques. In it, 12 essential network traffic
characteristics are proposed, which distinguish
between normal data and DDoS.
Sofi et al. [10] upgraded a new dataset
consisting of 27 features and five different traffic
classes. Four machine learning algorithms namely
Naive Bayes, SVM, decision tree and MLP have
been applied to identify DDoS attacks. In which,
Figure 1. Distance formula in KNN
the MLP algorithm gives the best results.
B. DECISION TREE
Mahadev et al [11] used the Naive Bayes
classifier in the weka tool to analyze the network Decision Tree - is a supervised and non-
traffic flow and found it to provide 99% accuracy in parametric learning algorithm used for classification
detecting DDoS attacks. and regression. The methods create a highly accurate,
stable, and easy-to-follow tree model, eliminating
S Duque et al. [12] show that the K-means
unnecessary attributes. Each inner node is equivalent
clustering algorithm gives increased efficiency
to a variable, each arc goes to a child node
with the correct usage of the number of clusters.
corresponding to the possible value of that variable.
Furthermore, note that with an increase in the
The leaves correspond to the predicted target values
number of clusters over the number of data types,
for the variables.
the false-negative, detection rate decreases, but the
false-positive rate increases. Decision tree learning is also a very popular
method in data mining. Where a decision tree
II. MACHINE LEARNING ALGORITHM describes a tree structure in which leaves represent
The four algorithms for performing DDoS classes and branches represent combinations of
attack detection in this paper refer to KNN, features that lead to classification. A tree can be
Decision Tree, Random Forest and SVM. These learned by dividing the source set into subsets based
are all commonly used classical machine on the values of the test attributes. This process is
learning algorithms. repeated on each obtained subset. The recursion ends
A. KNN
when it cannot be divided any further or when each
element of the subset has been labeled. Decision trees
The K-nearest neighbor (KNN) algorithm is are described by calculating conditional probabilities.
one of the simplest supervised learning algorithms Decision trees can be described as a combination of
(which is effective in some cases) in machine techniques learning and computational algorithms
learning. When training, this algorithm does not that support the description, classification, and
learn anything from the training data, all generalization of a given data set.
calculations are performed when it needs to predict
C. RANDOM FOREST
the outcome of the new data. With KNN, in the
classification problem, the label of a new data Random Forest builds many decision trees using
point is directly inferred from the K nearest data the Decision Tree algorithm, but each decision tree
points in the training set using distance measures will be different (with a random element). The
such as Euclidean distance, Manhattan distance prediction results are then aggregated from the
and Minkowski distance. decision trees. Random forest is a supervised family
algorithm that can solve both regression and
Implementation steps:
classification problems. Random Forest works
Step 1. Calculate the distance in 4 steps:
Step 2. Find nearest neighbors

Special Issue CS (15) 2022 103


Journal of Science and Technology on Information security

Step 1. Select random samples from the given divide the space into different domains, and each
data set. domain will contain a type of data.
Step 2. Set up a decision tree for each sample The optimal hyperplane we need to choose is the
and get prediction result from each. split hyperplane with the largest margin. Machine
learning theory has shown that such a hyperplane
Step 3. Vote for each prediction results.
minimizes the error limit.
Step 4. Select the most predicted result as the
final prediction. III. DATA PROCESSING AND PARAMETER
IMPLEMENTATION
In addition, Random Forest has the following
notable characteristics: A. DATA SET

 A collection of unrelated trees performing The authors have collected the data set based on
the same task is better than having each tree the available documents under the link []. The dataset
count one by one; has been pre-processed and labeled, the authors
download and use it according to the model's
 Assuming the trees are independent of each requirements. The new dataset collected by the
other in error rate or have little correlation authors contains four types of DDoS attacks as
with each other to ensure independence; follows: (HTTP Flood, SIDDOS, UDP Flood) and no
 Feature selection must be good enough for redundant or duplicate records. Table 1 lists the log
the tree to classify better than counts for these types of attacks. Table 2 shows the
random selection; processed features of the data set.
TABLE I. NUMBER RECORD OF DATA SET BY ATTACK
 The predictability and error of each tree TYPE
have little correlation with each other.
Attack Type Number of records
D. SVM
SIDDOS 6550
Support vector machine (SVM) is a supervised
UDP Flood 201344
machine learning algorithm that is very commonly
used today in classification or HTTP Flood 4110
regression problems.
TABLE II. PROCESSED CHARACTERISTICS OF THE
DATASET

STT Description Type

1 SRC ADD Continuous unit

2 DES ADD Continuous unit

3 PKT ID Continuous unit

4 FROM NODE Continuous unit

5 TO NODE Continuous unit

6 PKT TYPE Continuous unit

7 PKT SIZE Continuous unit

Figure 2. Hyperplane selection model in SVM 8 FLAGS Symbolic unit

9 FID Continuous unit


SVM was proposed by Vladimir N. Vapnik and
his colleagues in 1963 in Russia and then became 10 SEQ NUMBER Continuous unit

popular in the 90s thanks to its application to 11 NUMBER OF PKT Continuous unit
solving non-linear problems (Nonlinear). 12 NUMBER OF BYTE Continuous unit

The idea of SVM is to find a hyperplane (to 13 NODE NAME FROM Symbolic unit

separate the data points. This hyperplane will 14 NODE NAME TO Symbolic unit

104 Special Issue CS (15) 2022


Journal of Science and Technology on Information security

15 PKT IN Continuous unit collect transmission data for network investigation"


16 PKTOUT Continuous unit of author Nguyen The Hoang.
17 PKTR Continuous unit After collecting the data set, it is fed into the
18 PKT DELAY NODE Continuous unit system to identify denial of service attacks. The steps
19 PKTRATE Continuous unit
are as follows training model's weight and the model's
accuracy evaluation parameters.
20 BYTE RATE Continuous unit

21 PKT AVG SIZE Continuous unit • Receiving input dataset: the system receives
user-provided network attack datasets;
22 UTILIZATION Continuous unit

23 PKT DELAY Continuous unit • Machine learning model training: the system
24 PKT SEND TIME Continuous unit
stores machine learning algorithms commonly
used in network attack detection, then trains
25 PKT RESEVED TIME Continuous unit
those algorithms with the input dataset;
26 FIRST PKT SENT Continuous unit
• Changing model parameters: the system makes
27 LAST PKT RESEVED Continuous unit
adjustments to change some parameters with
The proposed data collection system follows each certain algorithm to increase the accuracy
these steps: of the algorithm;
 Collect and control: all network traffic from • Display training and model evaluation results:
NIDS is collected and examined; the system will output the results of the
training model’s weight and the model’s
 Preprocessing data format: remove accuracy evaluation parameters;
redundant and duplicate records;
• Make a conclusion whether the network
 Feature extraction: extract feature behavior is a denial of service attack or not.
parameters from the collected network
traffic and assign each feature to each data B. DATA PROCESSING
column; they will be used as a vector in the Figures with the above dataset, process the data
new dataset; before putting it into the experiment. The input
 Statistical measurements: in this step, the information must be processed at the same cost.
features are additionally calculated using Therefore, data cleaning is always the first step in
statistical equations. designing a machine learning model. Remove the
symbolic features (Symbolic) such as PKT_TYPE,
FLAGS, NODE_NAME_FROM,
NODE_NAME_TO, PKT_CLASS and unimportant
features like SRC_ADD, DES_ADD.
Because the data set has a relatively high number
of records belonging to normal behavior, to balance
the machine learning model, take 10000 records for 2
labels Normal and UDP Flood. The input data set is
divided into training and testing sets in the
ratio of 7:3.
C. HYPERPARAMETER SELECTION
Hyperparameter Tuning is an important step in
Figure 3. The process of building a new dataset
machine learning techniques. Hyperparameters are
The authors use a data collection system user-defined parameters that control the training
inherited from the topic "Building an application to process of the model and play an important role in
determining the performance of the model. Such

Special Issue CS (15) 2022 105


Journal of Science and Technology on Information security

parameter tuning is usually done by traversing a points labeled as attack behavior are recognized by
predefined grid of parameters. This parameter the model. Recall is also known as True Positive rate
grid can be defined values, or it can also be (TPR), Sensitivity, Hit rate.
random following a definite distribution or
F1-score: Is the harmonic mean between Precision
condition. In this paper, the parameter grid with
and Recall when these two quantities are non-zero.
defined values is used as shown in the
Calculated by the formula:
following table:
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
TABLE III. HYPERPARAMETRIC GRID 𝐹1 = 2
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Parameter
Algorithm Value False positive rate is (FPR) also known as False
name
KNN Neighbors [10, 100, 1000] Alarm Rate is false detection rate, a behavior is
number normal but the model considers it as attack behavior.
DT Evaluation Gini impurity
function or Information
gain (Entropy)
IV. RESULTS AND DISCUSSION
RF Number of Tree [10, 100, 1000] The results of running the 4 mentioned algorithms
SVM 𝐶 [-1, 1, 3]
are presented in the following table:
𝛾 [-1, 1, 3]
TABLE IV. RESULTS WHEN RUNNING 4 ALGORITHMS
Precision
Algor ithms Accur F1-
Recall FPR
acy score

0.9541 0.9494 0.0003


KNN 0.9475 0.9495
Figure 4. Hyperparameter selection
0.9093 0.9093 0.0902
D. RESULT EVALUATION INDEX DT 0.9093 0.9093

The indicators used to evaluate the 0.9440 0.9411 0.0194


RF 0.9508 0.9412
results include:
0.95 43 0.9496 0.0000
Accuracy: this is the ratio of correctly predicted SVM 0.9489 0.9497
points to the total number of points in the
test dataset. According to the results from Table 4, the decision
Precision or Positive predictive value (PPV): Is tree algorithm gives the lowest probability of correct
the ratio of the number of points in the attack detection (90.93%) as well as the highest false
behavior that the model correctly predicts to the detection rate, the Random Forest algorithm gives the
total number of points the model predicts in the highest probability (95.08%), the algorithm gives the
attack behavior. The higher the Precision metric, highest probability (95.08%). SVM with longest
the higher the number of points the model predicts running time, lowest false detection rate. In general,
that an attack is an attack. Precision = 1, i.e. all the 4 algorithms using scikit-learn library provide
scores that the model predicts as an attack are
correct, or none of the scores labeled as normal
behavior that the model mistakenly predicts is
an attack.
Recall: The ratio of the number of points that
are correctly predicted by the model attack to the
total number of points that are actually the attack
(or the total number of points labeled as the
original attack). The higher the recall, the lower the
score is that the attack is missed. Recall = 1, i.e. all

106 Special Issue CS (15) 2022


Journal of Science and Technology on Information security

relatively good results and are optimized for throughput of DDoS attacks worldwide. In the future,
better performance. attackers will most likely take advantage of artificial
intelligence and machine knowledge that allows
automatic alteration of attacks so that they evolve to
more optimal attack techniques. In that case, it is
necessary to improve the DDoS attack detection
algorithms towards real-time processing of the raw
data of the attacks obtained.

ACKNOWNLEDGMENT
Figure 5. ROC curves of 4 algorithms The author would like to thank VINIF for their
Besides, the experimental results are also financial support to Nguyen Thi Khanh Tram as
evaluated based on the ROC (Receiver Operating master student in VNU University of Engineering
Characteristic) curve, which is a graphical chart and Technology.
illustrating the performance of the binary REFERENCES
classification system. Each point on the ROC
[1]. Hội thảo “Bảo vệ mạng và dữ liệu khỏi các cuộc tấn công
curve is the coordinate corresponding to the true từ chối dịch vụ (DDoS) nhằm vào các tổ chức, doanh
positive frequency (sensitivity) on the vertical axis nghiệp” - ngày 3-5-2019, Cục An toàn Thông tin, Báo
and the false positive frequency (1- specificity) on VietnamNet, tổ chức Nexusguard Limited tổ chức.
the horizontal axis. Performance line the more you [2]. CERT Coordination Center, “Results of the Distributed-
deviate to the top and to the left, the clearer the systems Intruder Tools Workshop”, năm 1999. Software
Engineering Institute.
distinction between the two states. The ROC curve
[3]. L. Garber, Denial-of-Service Attacks Rip the Internet”,
when running 4 algorithms is recorded in Figure 5.
IEEE Computer, 33(4):12–17, 2000.
The AUC (Area under the ROC Curve) values of
[4]. D. Dittrich, “The DoS Project’s “trinoo” Distributed
the decision tree algorithms, Random Forest, Denial of Service Attack Tool”, 21 tháng 10 năm 1999.
KNN, SVM are 0.9093, 0.9508, 0.9475, 0.9489, [5]. D. Dittrich, “The “stacheldraht” distributed denial of
respectively. These are all values in the excellent service attack tool”,
threshold, where the decision tree algorithm gives https://staff.washington.edu/dittrich/misc/stacheldr
the lowest result and the Random Forest algorithm aht.analysis/, 31 tháng 12 năm 1999.
gives the best prediction. [6]. D. Dittrich, “The Tribe Flood Network”
Distributed Denial of Service Attack Tool”-
VI. CONCLUSION https://staff.washington.edu/dittrich/misc/tfn.analy sis/,
1999.
Based on the newly collected dataset containing
[7]. D. Kumar, G. Rao, M. K. Singh, and G. Satyanarayana,
four types of DDoS attacks as follows: (HTTP “A Survey of Defense Mechanisms countering DDoS
Flood, SIDDOS, UDP Flood) and no redundant or Attacks in the Network”, Intl. Journal of Advanced
duplicate records, the author conducted Research in Computer and Communication Engineering,
experiments with 4 machine learning algorithms. 2:2599–2606, tháng 7 năm 2013.
for DDoS attack detection. As a result, all 4 [8]. Swathi Sambangi và Lakshmeeswari Gondi, “A Machine
algorithms are capable of detecting DDoS attacks Learning Approach for DDoS (Distributed Denial of
Service) Attack Detection Using Multiple Linear
with high accuracy, fast speed and efficiency. Regression” trong hội thảo quốc tế INTER- ENG 2020
Recently, with the continuous development of Interdisciplinarity in Engineering lần thứ 14 tại Mures,
Romania, 08/9/2020.
5G, a large number of insecure Internet of Things
[9]. P Sangkatsanee, N Wattanapongsakorn and C
(IoT) devices are connected to the Internet, which Charnsripinyo, “Practical real-time intrusion detection
presents great challenges to protect against attacks. using machine learning approaches”, ELSEVIER
DDoS attacks, especially when attackers are trying Computer Communications 34(2011) 2227-2235.
to "recruit" more devices to the Botnet (Example [10]. I Sofi, A Mahajan and V Mansotra, “Machine Leaming
Mirai Botnet) to increase the frequency, size and Techniques used for the Detection and Analysis of
Modem Types of DDoS Attacks”, International Research

Special Issue CS (15) 2022 107


Journal of Science and Technology on Information security

Journal of Engineering and Technology (IRJET),


Tập:04, tháng 06/2007.
[11] Mahadev, V Kumar and H Sharma, “Detection and
Analysis of DDoS Attack at Application Layer Using
Naive Bayes Classifier”, Intemational Journal of
Computer Engineering & Technology (IJCET), tập 9,
2018, pp. 208-217, Article IICET_09_03_025.
[12]. S Duque, M Nizam bin Omar, “Using Data Mining
Algorithms for developing a Model for Intrusion
Detection System (IDS)”, ELSEVIER Procedia
Computer Science 61 (2015) 46-51.
ABOUT THE AUTHOR

Doan Trung Son


Workplace: Faculty of
Information Security, People's
Security Academy.
Email: son.doantrung@gmail.com
Education:
University: Faculty of Information Technology - People's
Security Academy
Master: Faculty of Information Technology, Hanoi
University of Science and Technology
Doctorate: Hagen University, Germany
Recent research direction: Cybersecurity, High-tech Crime
Prevention, Artificial Intelligence, Data Science, Trust and
Distributed Systems, Modern issues in
information technology.
Nguyen Thi Khanh Tram
Workplace: Phenikaa School
Email: khanhtramt2k23@gmail.com
Education:
University: Faculty of Information
Technology - People's Security
Academy
Master student: University of Engineering and Technology,
Hanoi, Vietnam
Tran Thi Thu
Workplace: Hanoi Law University
Email: thutran@hlu.edu.vn
Education:
University: Information Technology,
University of Natural Sciences, Vietnam
National University, Ho Chi Minh City
Master: Business Economics, University of Toulouse.

108 Special Issue CS (15) 2022

You might also like