407-Article Text-1753-1-10-20220429
407-Article Text-1753-1-10-20220429
407-Article Text-1753-1-10-20220429
Abstract— Denial of Service attacks have been Asia Pacific region and leading in
around since the dawn of the internet age. Along Southeast Asia [1].
with the development and explosion of the Internet, DDoS involves making requests from a network
denial of service attacks are also increasingly
of computers made up of millions of computers with
powerful and become a serious threat in cyberspace.
The article aims to evaluate machine learning different IP addresses over which control has been
algorithms: K-nearest neighbor (KNN) algorithm, previously established (Botnet). Computers and other
Decision Tree, Random Forest algorithm and networked resources such as IoT devices together
Support Vector Machine (SVM) on various metrics create “Tsunamis” of traffic. A DDoS attack can be
in detecting DDoS attacks. The main objective of understood as a sudden traffic jam that blocks a
the paper is to analyze the algorithms, collect data
highway, preventing normal traffic from reaching its
and evaluate the effectiveness of the algorithms in
destination. Because it is dispersed into many access
DDoS attack detection. points with different IP ranges, DDoS is much
Tóm tắt— Tấn công từ chối dịch vụ đã xuất hiện từ stronger than DoS, and it is often difficult to
những năm khởi nguyên của thời đại internet. Song recognize or prevent DDoS attacks.
hành cùng sự phát triển và bùng nổ của mạng
Different types of DDoS attacks target different
Internet, tấn công từ chối dịch vụ cũng ngày càng
mạnh mẽ và trở thành mối đe dọa nghiêm trọng trên components of a network connection. Based on the
không gian mạng. Bài báo hướng tới đánh giá các target and behavior, DDoS attacks can be classified
thuật toán học máy: Thuật toán K láng giềng gần into three types traffic/fragmentation attacks,
nhất (K-nearest neighbor - KNN), cây quyết định bandwidth/volume attacks, and application
(Decision Tree), thuật toán rừng ngẫu nhiên layer attacks.
(Random Forest) và máy vector hỗ trợ (Support
Vector Machine - SVM) trên các chỉ số đánh giá khác
In late 1999, CERT first published its report on the
nhau trong việc phát hiện các cuộc tấn công DDoS. threat of DDoS attacks and outlined specific
Mục tiêu chính của bài báo nhằm phân tích các thuật prevention actions to mitigate this threat [2]. A few
toán, thu thập đánh giá dữ liệu và tiến hành so sánh months later, the Internet suffered its first large-scale
hiệu quả các thuật toán vào phát hiện tấn công DDoS. DDoS attack [3], and successive attacks of
Keywords—DDoSt;KNN; Decision Tree; Random Forest; increasingly large scale in the following years. Since
SVM. then, researchers have analyzed a number of tools
Từ khóa— DDoS; KNN; Cây quyết định; Rừng ngẫu used to launch DDoS attacks [4, 5, 6], measured their
nhiênt; SVM. impact on the Internet, and come up with a number of
defense methods [7]. Accordingly, these research
I. INTRODUCTION
efforts have resulted in a number of effective and
Distributed Denial of Service (DDoS) attack is reliable anti-DDoS products offered as stand-alone
accomplished by increasing online traffic from devices or cloud-based services.
multiple sources to the server. This causes the
In recent years, along with the strong development
server to run out of resources and bandwidth.
of Artificial Intelligence (AI), machine learning (ML)
DDoS first appeared in 1999.
and deep learning methods are being used more and
Vietnam is facing a great risk of being attacked more in detecting DDoS attacks. Sambadi and Gondi
and distributed by DDoS attacks with the 6th propose an approach that uses multiple linear
position globally after China, the US, France, regression to detect DDoS attacks [8].
Russia and Brazil, the 2nd position in the region.
Step 1. Select random samples from the given divide the space into different domains, and each
data set. domain will contain a type of data.
Step 2. Set up a decision tree for each sample The optimal hyperplane we need to choose is the
and get prediction result from each. split hyperplane with the largest margin. Machine
learning theory has shown that such a hyperplane
Step 3. Vote for each prediction results.
minimizes the error limit.
Step 4. Select the most predicted result as the
final prediction. III. DATA PROCESSING AND PARAMETER
IMPLEMENTATION
In addition, Random Forest has the following
notable characteristics: A. DATA SET
A collection of unrelated trees performing The authors have collected the data set based on
the same task is better than having each tree the available documents under the link []. The dataset
count one by one; has been pre-processed and labeled, the authors
download and use it according to the model's
Assuming the trees are independent of each requirements. The new dataset collected by the
other in error rate or have little correlation authors contains four types of DDoS attacks as
with each other to ensure independence; follows: (HTTP Flood, SIDDOS, UDP Flood) and no
Feature selection must be good enough for redundant or duplicate records. Table 1 lists the log
the tree to classify better than counts for these types of attacks. Table 2 shows the
random selection; processed features of the data set.
TABLE I. NUMBER RECORD OF DATA SET BY ATTACK
The predictability and error of each tree TYPE
have little correlation with each other.
Attack Type Number of records
D. SVM
SIDDOS 6550
Support vector machine (SVM) is a supervised
UDP Flood 201344
machine learning algorithm that is very commonly
used today in classification or HTTP Flood 4110
regression problems.
TABLE II. PROCESSED CHARACTERISTICS OF THE
DATASET
popular in the 90s thanks to its application to 11 NUMBER OF PKT Continuous unit
solving non-linear problems (Nonlinear). 12 NUMBER OF BYTE Continuous unit
The idea of SVM is to find a hyperplane (to 13 NODE NAME FROM Symbolic unit
separate the data points. This hyperplane will 14 NODE NAME TO Symbolic unit
21 PKT AVG SIZE Continuous unit • Receiving input dataset: the system receives
user-provided network attack datasets;
22 UTILIZATION Continuous unit
23 PKT DELAY Continuous unit • Machine learning model training: the system
24 PKT SEND TIME Continuous unit
stores machine learning algorithms commonly
used in network attack detection, then trains
25 PKT RESEVED TIME Continuous unit
those algorithms with the input dataset;
26 FIRST PKT SENT Continuous unit
• Changing model parameters: the system makes
27 LAST PKT RESEVED Continuous unit
adjustments to change some parameters with
The proposed data collection system follows each certain algorithm to increase the accuracy
these steps: of the algorithm;
Collect and control: all network traffic from • Display training and model evaluation results:
NIDS is collected and examined; the system will output the results of the
training model’s weight and the model’s
Preprocessing data format: remove accuracy evaluation parameters;
redundant and duplicate records;
• Make a conclusion whether the network
Feature extraction: extract feature behavior is a denial of service attack or not.
parameters from the collected network
traffic and assign each feature to each data B. DATA PROCESSING
column; they will be used as a vector in the Figures with the above dataset, process the data
new dataset; before putting it into the experiment. The input
Statistical measurements: in this step, the information must be processed at the same cost.
features are additionally calculated using Therefore, data cleaning is always the first step in
statistical equations. designing a machine learning model. Remove the
symbolic features (Symbolic) such as PKT_TYPE,
FLAGS, NODE_NAME_FROM,
NODE_NAME_TO, PKT_CLASS and unimportant
features like SRC_ADD, DES_ADD.
Because the data set has a relatively high number
of records belonging to normal behavior, to balance
the machine learning model, take 10000 records for 2
labels Normal and UDP Flood. The input data set is
divided into training and testing sets in the
ratio of 7:3.
C. HYPERPARAMETER SELECTION
Hyperparameter Tuning is an important step in
Figure 3. The process of building a new dataset
machine learning techniques. Hyperparameters are
The authors use a data collection system user-defined parameters that control the training
inherited from the topic "Building an application to process of the model and play an important role in
determining the performance of the model. Such
parameter tuning is usually done by traversing a points labeled as attack behavior are recognized by
predefined grid of parameters. This parameter the model. Recall is also known as True Positive rate
grid can be defined values, or it can also be (TPR), Sensitivity, Hit rate.
random following a definite distribution or
F1-score: Is the harmonic mean between Precision
condition. In this paper, the parameter grid with
and Recall when these two quantities are non-zero.
defined values is used as shown in the
Calculated by the formula:
following table:
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
TABLE III. HYPERPARAMETRIC GRID 𝐹1 = 2
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Parameter
Algorithm Value False positive rate is (FPR) also known as False
name
KNN Neighbors [10, 100, 1000] Alarm Rate is false detection rate, a behavior is
number normal but the model considers it as attack behavior.
DT Evaluation Gini impurity
function or Information
gain (Entropy)
IV. RESULTS AND DISCUSSION
RF Number of Tree [10, 100, 1000] The results of running the 4 mentioned algorithms
SVM 𝐶 [-1, 1, 3]
are presented in the following table:
𝛾 [-1, 1, 3]
TABLE IV. RESULTS WHEN RUNNING 4 ALGORITHMS
Precision
Algor ithms Accur F1-
Recall FPR
acy score
relatively good results and are optimized for throughput of DDoS attacks worldwide. In the future,
better performance. attackers will most likely take advantage of artificial
intelligence and machine knowledge that allows
automatic alteration of attacks so that they evolve to
more optimal attack techniques. In that case, it is
necessary to improve the DDoS attack detection
algorithms towards real-time processing of the raw
data of the attacks obtained.
ACKNOWNLEDGMENT
Figure 5. ROC curves of 4 algorithms The author would like to thank VINIF for their
Besides, the experimental results are also financial support to Nguyen Thi Khanh Tram as
evaluated based on the ROC (Receiver Operating master student in VNU University of Engineering
Characteristic) curve, which is a graphical chart and Technology.
illustrating the performance of the binary REFERENCES
classification system. Each point on the ROC
[1]. Hội thảo “Bảo vệ mạng và dữ liệu khỏi các cuộc tấn công
curve is the coordinate corresponding to the true từ chối dịch vụ (DDoS) nhằm vào các tổ chức, doanh
positive frequency (sensitivity) on the vertical axis nghiệp” - ngày 3-5-2019, Cục An toàn Thông tin, Báo
and the false positive frequency (1- specificity) on VietnamNet, tổ chức Nexusguard Limited tổ chức.
the horizontal axis. Performance line the more you [2]. CERT Coordination Center, “Results of the Distributed-
deviate to the top and to the left, the clearer the systems Intruder Tools Workshop”, năm 1999. Software
Engineering Institute.
distinction between the two states. The ROC curve
[3]. L. Garber, Denial-of-Service Attacks Rip the Internet”,
when running 4 algorithms is recorded in Figure 5.
IEEE Computer, 33(4):12–17, 2000.
The AUC (Area under the ROC Curve) values of
[4]. D. Dittrich, “The DoS Project’s “trinoo” Distributed
the decision tree algorithms, Random Forest, Denial of Service Attack Tool”, 21 tháng 10 năm 1999.
KNN, SVM are 0.9093, 0.9508, 0.9475, 0.9489, [5]. D. Dittrich, “The “stacheldraht” distributed denial of
respectively. These are all values in the excellent service attack tool”,
threshold, where the decision tree algorithm gives https://staff.washington.edu/dittrich/misc/stacheldr
the lowest result and the Random Forest algorithm aht.analysis/, 31 tháng 12 năm 1999.
gives the best prediction. [6]. D. Dittrich, “The Tribe Flood Network”
Distributed Denial of Service Attack Tool”-
VI. CONCLUSION https://staff.washington.edu/dittrich/misc/tfn.analy sis/,
1999.
Based on the newly collected dataset containing
[7]. D. Kumar, G. Rao, M. K. Singh, and G. Satyanarayana,
four types of DDoS attacks as follows: (HTTP “A Survey of Defense Mechanisms countering DDoS
Flood, SIDDOS, UDP Flood) and no redundant or Attacks in the Network”, Intl. Journal of Advanced
duplicate records, the author conducted Research in Computer and Communication Engineering,
experiments with 4 machine learning algorithms. 2:2599–2606, tháng 7 năm 2013.
for DDoS attack detection. As a result, all 4 [8]. Swathi Sambangi và Lakshmeeswari Gondi, “A Machine
algorithms are capable of detecting DDoS attacks Learning Approach for DDoS (Distributed Denial of
Service) Attack Detection Using Multiple Linear
with high accuracy, fast speed and efficiency. Regression” trong hội thảo quốc tế INTER- ENG 2020
Recently, with the continuous development of Interdisciplinarity in Engineering lần thứ 14 tại Mures,
Romania, 08/9/2020.
5G, a large number of insecure Internet of Things
[9]. P Sangkatsanee, N Wattanapongsakorn and C
(IoT) devices are connected to the Internet, which Charnsripinyo, “Practical real-time intrusion detection
presents great challenges to protect against attacks. using machine learning approaches”, ELSEVIER
DDoS attacks, especially when attackers are trying Computer Communications 34(2011) 2227-2235.
to "recruit" more devices to the Botnet (Example [10]. I Sofi, A Mahajan and V Mansotra, “Machine Leaming
Mirai Botnet) to increase the frequency, size and Techniques used for the Detection and Analysis of
Modem Types of DDoS Attacks”, International Research