IDS eBPF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

A flow-based IDS using Machine Learning in eBPF

Maximilian Bachl, Joachim Fabini, Tanja Zseby


Technische Universität Wien
firstname.lastname@tuwien.ac.at

Abstract—eBPF is a new technology which allows dynamically A drawback, however, is that because eBPF is verified, it
loading pieces of code into the Linux kernel. It can greatly speed can only use certain data structures. For example, an eBPF
up networking since it enables the kernel to process certain program cannot use normal C arrays since they allow out-
packets without the involvement of a userspace program. So far
arXiv:2102.09980v3 [cs.CR] 4 Mar 2022

eBPF has been used for simple packet filtering applications such of-bounds accesses. For example, in an array of length 10,
as firewalls or Denial of Service protection. We show that it C would allow accessing the 15th element even though the
is possible to develop a flow-based network intrusion detection array only has 10 elements. Thus, eBPF programs make
system based on machine learning entirely in eBPF. Our solution use of special data structures which are safe. However, this
uses a decision tree and decides for each packet whether it can potentially be a performance penalty since checking the
is malicious or not, considering the entire previous context of
the network flow. We achieve a performance increase of over bounds of an array each time it is accessed requires extra work
20% compared to the same solution implemented as a userspace to be done by the CPU.
program. One alternative to using eBPF is using kernel modules.
However, a drawback of kernel modules is that they usually
I. I NTRODUCTION
cannot be verified for stability and that they have to be
A. eBPF compiled for a specific kernel version. Moreover, developing
eBPF is a technology which makes the Linux kernel pro- kernel modules is not straightforward and often it is not
grammable by enabling the injection of pieces of code at many possible to extend certain functionality in the kernel with a
locations of the kernel code. eBPF can be dynamically injected kernel module without changing the kernel itself. Changing
during runtime and is verified to make sure that it cannot the whole kernel itself makes it necessary to recompile the
crash and cannot get caught in infinite loops. However, this kernel, which is cumbersome.
verification is only possible for programs that are not turing-
complete. Thus eBPF programs cannot contain features such B. eBPF for a Machine Learning (ML)-based Intrusion De-
as loops of arbitrary length but instead loops must always tection System (IDS)
have a maximum number of iterations. Also, backward jumps
in the code are generally not allowed. This means eBPF can Some researchers have already investigated eBPF based
only be used to implement algorithms which do not require IDSs or thought about using ML in eBPF. [2] propose to use
turing-completeness. eBPF programs are usually written in C AI for detecting performance anomalies with eBPF. However,
and are first compiled to eBPF bytecode. Upon injection into they only propose a concept and do not provide an evaluation
the kernel, this eBPF bytecode is verified and dynamically of the benefit of their approach. [3], [4] and also [5] use eBPF
compiled to native code. to develop solutions against Denial-of-Service attacks. They
eBPF is especially suitable for packet processing: When a do not use ML.
packet arrives at a network interface, certain actions can be One challenge of using eBPF for ML is that it is not turing-
performed such as dropping the packet. This is useful for complete. However, ML algorithms such as decision trees
programs such as firewalls [1] or tcpdump, which records (DTs) or neural networks (NNs) do not require loops in the im-
packets according to certain filters. For example, if only plementation and thus don’t require turing-completeness and
packets coming from port 80 should be recorded, tcpdump can thus be implemented in eBPF. We decide to use DTs since
will compile an eBPF program which encodes this and will tree-based methods are a simple and effective ML method for
load it into the kernel. The kernel will then drop all packets IDSs [6]. As mentioned in the previous sections, eBPF data
which don’t match the filter and only pass the correct ones structures need some additional processing compared to the
to tcpdump. The alternative would be that tcpdump receives classic ones built into C. A question we want to answer in this
every packet and filters them itself. The drawback of this is that research is thus the following: Is eBPF faster than a solution
then each packet has to be passed from the kernel to tcpdump, implemented as a normal program in userspace? The userspace
which involves copying the whole packet in memory and also program has the disadvantage that all packets have to be passed
other computation steps. Thus, passing packets between the between the kernel and the program which is slow. The eBPF
kernel and programs should be avoided if possible because of program has the disadvantage that it makes use of potentially
performance reasons. eBPF allows one to do that. slower data structures. Thus, it is interesting to understand
Because eBPF bytecode is compiled to native code, it whether in practice eBPF can be faster even for complex
should generally be as fast as any other code in the kernel. programs which make use of data structures frequently.
II. E VALUATION socket and are processed by the IDS. The IDS is implemented
We envision an approach which keeps track of each network either as a classic userspace program or by using eBPF and run
flow and analyzes each packet in the context of the previous after one another (not at the same time). Both implementations
packets of the flow. For example, certain attacks could be are run for 10 s. Instead of deploying the IDS on an end host,
detected only when the fourth packet of the network flow it can also be deployed on a router or switch, which runs a
containing the attack arrives. eBPF has built-in hash tables recent Linux version and thus can run eBPF.
which allow storing information for each flow, which we use TABLE I: Maximum number of packets each implementation
for this purpose. As the key we use the classic five tuple can process (mean and standard deviation). Results averaged
of protocol type, source and destination IP and source and over 10 runs each.
destination port.
As features we use the source and destination port, the Userspace eBPF
protocol identifier (UDP, TCP, ICMP etc.), the packet length, Mean SD Mean SD

the time since the last packet of the flow and the direction packets/s 125 420 2627 152 274 1201
of the packet (from sender to receiver or vice versa) because
these features have proven useful for network traffic analysis Table I shows that the userspace implementation inspects
[6]. Additionally, we include the mean of the packet size, the fewer packets per second (125420) than the eBPF implemen-
time since the last packet and the direction for all packets tation (152274). The eBPF implementation is thus more than
received in the flow so far. Furthermore, also the mean absolute 20% faster than the userspace one.
deviation is computed for these three features, since the
standard deviation cannot be computed due to the absence of III. D ISCUSSION
advanced arithmetic operations in eBPF such as the square root An interesting consideration is that for complex ML models,
operation. A problem is that eBPF doesn’t allow floating point at some point the overhead from using eBPF’s special data
operations but only integers. We thus implement the decision structures becomes larger than the benefit that eBPF confers.
tree using fixed point arithmetic using 64 bit signed integers. For example, interesting future work would be to test whether
We use 16 bits for the part after the fixed point. random forests (RFs) or deep NNs can also achieve a perfor-
We use the popular CIC-IDS-2017 dataset [7]. We train mance advantage when implemented in eBPF. In this work,
the decision tree using scikit-learn with a maximum depth however, we could show that for a simple decision tree model,
of 10 and a maximum number of leaves of 1000 using a eBPF provides a significant performance advantage.
train/test split of 2:1, which achieves an accuracy of 99% on
R EFERENCES
the testing dataset after training. This accuracy is comparable
[1] F. Risso and M. Tumolo, “Towards a faster Iptables in eBPF,”
to the accuracy achieved in related work [8].
2018.
To enable reproducibility and to encourage further experi- [2] I. Ben-Yair, P. Rogovoy, and N. Zaidenberg, “AI & eBPF based
mentation by other researchers, we make all the source code performance anomaly detection system,” in SYSTOR, ACM, May
and other materials of this work publicly available 1 . 2019.
We implement the same IDS in userspace and also in eBPF [3] H. M. Demoulin, I. Pedisich, N. Vasilakis, V. Liu, B. T. Loo, and
L. T. X. Phan, “Detecting Asymmetric Application-layer Denial-
and use the previously trained DT. The code is completely
of-Service Attacks In-Flight with FineLame,” USENIX, 2019.
equivalent except that data structures are different because, [4] H. van Wieren, “Signature-Based DDoS Attack Mitigation: Au-
as mentioned previously, many normal data structures are tomated Generating Rules for Extended Berkeley Packet Filter
not usable in eBPF. Also, some data structures that eBPF and Express Data Path,” 2019.
has, such as hash maps, are not available in a normal C [5] Y. Choe, J.-S. Shin, S. Lee, and J. Kim, “eBPF/XDP Based
Network Traffic Visualization and DoS Mitigation for Intelligent
userspace program by default and thus we use a simple hash
Service Protection,” in Advances in Internet, Data and Web
map implementation from the Linux kernel for the userspace Technologies, Springer, 2020.
version. [6] F. Iglesias, D. C. Ferreira, G. Vormayr, M. Bachl, and T. Zseby,
We emulate a network using Linux network name spaces. “NTARC: A Data Model for the Systematic Review of Network
For this, we have one server and one client, connected via a Traffic Analysis Research,” Applied Sciences, Jan. 2020.
[7] I. Sharafaldin, A. Habibi Lashkari, and A. A. Ghorbani, “Toward
switch and configure the links to have no delay in addition
Generating a New Intrusion Detection Dataset and Intrusion
to the delay caused by the Linux kernel for forwarding the Traffic Characterization,” in ICISSP, SCITEPRESS, 2018.
packets. Furthermore, we set the maximum link speed to be [8] A. Hartl, M. Bachl, J. Fabini, and T. Zseby, “Explainability
unlimited. The server and the client are connected using iPerf. and Adversarial Robustness for RNNs,” in BigDataService 2020,
Since the link speed is unlimited, iPerf will send as fast as IEEE, Apr. 2020.
possible. The maximum speed is only limited by how fast the
computer can process the packets. The IDS is deployed by
opening a raw socket on the network interface of the server. All
packets passing through the server thus pass through the raw
1 https://github.com/CN-TU/machine-learning-in-ebpf

You might also like