3452296.3472905

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

Received May 14, 2021, accepted May 27, 2021, date of publication June 7, 2021, date of current version

June 23, 2021.


Digital Object Identifier 10.1109/ACCESS.2021.3086704

An Exhaustive Survey on P4 Programmable Data


Plane Switches: Taxonomy, Applications,
Challenges, and Future Trends
ELIE F. KFOURY 1 , (Graduate Student Member, IEEE), JORGE CRICHIGNO 1, (Member, IEEE),
AND ELIAS BOU-HARB 2 , (Senior Member, IEEE)
1 College of Engineering and Computing, University of South Carolina, Columbia, SC 29201, USA
2 The Cyber Center for Security and Analytics, The University of Texas at San Antonio, San Antonio, TX 78249, USA
Corresponding author: Elie F. Kfoury (ekfoury@email.sc.edu)
This work was supported in part by the National Science Foundation under Grant 1925484 and Grant 1829698, and in part by the Office of
Advanced Cyberinfrastructure (OAC).

ABSTRACT Traditionally, the data plane has been designed with fixed functions to forward packets using a
small set of protocols. This closed-design paradigm has limited the capability of the switches to proprietary
implementations which are hard-coded by vendors, inducing a lengthy, costly, and inflexible process.
Recently, data plane programmability has attracted significant attention from both the research community
and the industry, permitting operators and programmers in general to run customized packet processing
functions. This open-design paradigm is paving the way for an unprecedented wave of innovation and exper-
imentation by reducing the time of designing, testing, and adopting new protocols; enabling a customized,
top-down approach to develop network applications; providing granular visibility of packet events defined
by the programmer; reducing complexity and enhancing resource utilization of the programmable switches;
and drastically improving the performance of applications that are offloaded to the data plane. Despite the
impressive advantages of programmable data plane switches and their importance in modern networks,
the literature has been missing a comprehensive survey. To this end, this paper provides a background encom-
passing an overview of the evolution of networks from legacy to programmable, describing the essentials
of programmable switches, and summarizing their advantages over Software-defined Networking (SDN)
and legacy devices. The paper then presents a unique, comprehensive taxonomy of applications developed
with P4 language; surveying, classifying, and analyzing more than 200 articles; discussing challenges and
considerations; and presenting future perspectives and open research issues.

INDEX TERMS Programmable switches, P4 language, Software-defined Networking, data plane, custom
packet processing, taxonomy.

I. INTRODUCTION design caused by standardized requirements, which cannot


Since the emergence of the world wide web and the explosive be easily removed to enable protocol changes, has perpet-
growth of the Internet in the 1990s, the networking industry uated the status quo. This protocol ossification [3], [4] has
has been dominated by closed and proprietary hardware and been characterized by a slow innovation pace at the hand of
software. Consider the observations made by McKeown [1] few network vendors. As an example, after being initially
and the illustration in Fig. 1, which shows the cumulative conceived by Cisco and VMware [5], the Application Spe-
number of Request For Comments (RFCs) [2]. While at cific Integrated Circuit (ASIC) implementation of the Virtual
first an increase in RFCs may appear encouraging, it has Extensible LAN (VXLAN) [6], a simple frame encapsulation
actually represented an entry barrier to the network mar- protocol, took several years, a process that could have been
ket. The progressive reduction in the flexibility of protocol reduced to weeks by software implementations1 .

The associate editor coordinating the review of this manuscript and 1 The RFC and VXLAN observations are extracted from Dr. McKeown’s
approving it for publication was Petros Nicopolitidis . presentation in [1].

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
87094 VOLUME 9, 2021
E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

(Intel) [26], Xilinx [27], Pensando [28], Mellanox [29], and


Innovium [30] have embraced programmable data planes
without compromising performance. The availability of tools
and the agility of software development have opened an
unprecedented possibility of experimentation and innovation
by enabling network owners to build custom protocols and
process them using protocol-independent primitives, repro-
gram the data plane in the field, and run P4 codes on diverse
platforms. Main agencies supporting engineering research
and education world-wide are investing in programmable
networks as well. For example, the U.S. National Science
Foundation (NSF) has funded FABRIC [31], [32], a national
FIGURE 1. Cumulative number of RFCs. research backbone based on P4 programmable switches.
Another project funded by the NSF operates an interna-
tional Software Defined Exchange (SDX) which includes
Protocol ossification has been challenged first by a P4 testbed that enables international research and edu-
Software-defined Networking (SDN) [7], [8] and then by cation institutions to share P4 resources [33]. Similarly,
the recent advent of programmable switches. SDN fostered an European consortium has recently built 2STiC [34], a P4
major advances by explicitly separating the control and data programmable network that interconnects universities and
planes, and by implementing the control plane intelligence research centers.
as a software outside of the switches. While SDN reduced
network complexity and spurred control plane innovation at A. CONTRIBUTION
the speed of software development, it did not wrest control Despite the increasing interest on P4 switches, previous
of the actual packet processing functions away from network work has only partially covered this technology. As shown
vendors. Traditionally, the data plane has been designed with in Table 1, currently, there is no updated and comprehensive
fixed functions to forward packets using a small set of pro- material. Thus, this paper addresses this gap by providing
tocols (e.g., IP, Ethernet). The design cycle of switch ASICs an overview of the evolution of networks from legacy to
has been characterized by a lengthy, closed, and proprietary programmable; describing the essentials of programmable
process that usually takes years. Such process contrasts with switches and P4; and summarizing the advantages of pro-
the agility of the software industry. grammable switches over SDN and legacy devices. The
The programmable forwarding can be viewed as a nat- paper continues by presenting a taxonomy of applications
ural evolution of SDN, where the software that describes developed with P4; surveying, classifying, and analyzing and
the behavior of how packets are processed can be con- comparing more than 200 articles; discussing challenges and
ceived, tested, and deployed in a much shorter time span considerations; and putting forward future perspectives and
by operators, engineers, researchers, and practitioners in open research issues.
general. The de-facto standard for defining the forward-
ing behavior is the P4 language [9], which stands for B. PAPER ORGANIZATION
Programming Protocol-independent Packet Processors. The road-map of this survey is illustrated in Fig. 2.
Essentially, P4 programmable switches have removed the Section II studies and compares existing surveys on vari-
entry barrier to network design, previously reserved to net- ous P4-related topics and demonstrates the added value of
work vendors. the offered work. Section III describes the traditional and
The momentum of programmable switches is reflected SDN devices, and the evolution toward programmable data
in the global ecosystem around P4. Operators such as planes. Section IV introduces programmable switches and
ATT [10], Comcast [11], NTT [12], KPN [13], Turk their features and explains the Protocol Independent Switch
Telekom [14], Deutsche Telekom [15], and China Uni- Architecture (PISA), a pipeline forwarding model. Section V
com [14], are now using P4-based platforms and applications describes the survey methodology and the proposed taxon-
to optimize their networks. Companies with large data cen- omy. Subsequent sections (from Section VI to Section XII)
ters such as Facebook [16], Alibaba [17], and Google [18] explore the works pertaining to various categories pro-
operate on programmable platforms running customized soft- posed in the taxonomy, and compare the P4 approaches in
ware, a contrast from the fully proprietary implementations each category, as well as with the legacy-enabled solutions.
of just a few years ago [19]. Switch manufacturers such Section XIII outlines challenges and considerations extracted
as Edgecore [20], Stordis [21], Cisco [22], Arista [23], and induced from the literature, and pinpoints directions that
Juniper [24], and Interface Masters [25] are now manufac- can be explored in the future to ameliorate the state-of-the-art
turing P4 programmable switches with multiple deployment solutions. Finally, Section XIV concludes the survey. The
models, from fully programmable or white boxes to hybrid abbreviations used in this article are summarized in Table 36,
schemes. Chip manufactures such as Barefoot Networks at the end of the article.

VOLUME 9, 2021 87095


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

FIGURE 2. Paper roadmap.

II. RELATED SURVEYS explained the layout of a P4 program and how it is mapped to
The advantages of programmable switches attracted consid- the abstract forwarding model. It then listed various compil-
erable attention from the research community. They were ers, tools, simulators, and frameworks for P4 development.
described in previous surveys. The authors categorized the literature into two categories:
Stubbe [35] discussed various P4 compilers and 1) programmable security and dependability management;
interpreters in a short survey. This work provided a short 2) enhanced accounting and performance management. In the
background on the P4 language and demonstrated the main first category, the authors listed works pertaining to pol-
building blocks that describe packet processing in a pro- icy modeling, analysis, and verification, as well as intru-
grammable switch. It outlined reference hardware and soft- sion detection and prevention, and network survivability.
ware programmable switch implementations. The survey In the second category, the authors focused on network mon-
lacks critical discussions on the evolution of programmable itoring, traffic engineering, and load balancing. The survey
switches, the features of P4 language, the existing applica- only lists a limited set of papers without providing much
tions, challenges, and the potential future work. details or how papers differ from each other. Moreover,
Dargahi et al. [36] focused on stateful data planes and the survey was published in 2017, and since then, a significant
their security implications. There are two main objectives of percentage of P4-related works are missing.
this survey. First, it introduces the reader to recent trends Satapathy [38] presented a limited description about the
and technologies pertaining to stateful data planes. Second, pitfalls of traditional networks and the evolution of SDN.
it discusses relevant security issues by analyzing selected The report briefly described elements of the P4 language. The
use cases. The scope of the survey is not limited to P4 for authors then discussed the control plane and P4Runtime [46],
programming the data plane. Instead, it describes other and enumerated three use cases of P4 applications. The report
schemes such as OpenState [44], Flow-level State Transitions concludes with potential future work. This work lacks critical
(FAST) [45], etc. When reviewing the security properties discussions on the P4 language and its features, the existing
of stateful data planes, the authors described a mapping applications, and challenges.
between potential attacks and corresponding vulnerabilities. The short survey presented by Bifulco and Rétvári [39]
The survey lacks critical discussions on the P4 language reviews the trends and issues of abstractions and architectures
and its features, the existing applications beyond security, that realize programmable networks. The authors discussed
the challenges, and the potential future work. the motivation of packet processing devices in the network-
Cordeiro et al. [37] discussed the evolution of SDN from ing field and described the anatomy of a programmable
OpenFlow to data plane programmability. The survey briefly switch. The proposed taxonomy categorizes the literature

TABLE 1. Comparison with related surveys.

87096 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

as state-based, abstraction-based, implementation-based, and control, troubleshooting, etc. The survey concludes with dis-
layer-based. The layer-based consists of control/intent layer cussions and potential future work related to INT.
and data plane layer; the implementation-based encom- Zhang et al. [43] presented a survey that focuses on stateful
passes software and hardware switches; the abstraction-based data plane. The survey starts with an overview of stateless and
includes data flow graph and match-action pipelines; and the stateful data planes, then overviews and compares some state-
state-based differentiates between stateful and stateless data ful platforms (e.g., OpenState, FAST, FlowBlaze, etc.). The
planes. This short survey lacks critical discussions on the paper reviews a handful of stateful data plane applications and
existing P4 applications. discusses challenges and future perspectives.
Kaljic et al. [40] presented a survey on data plane flex- Table 1 summarizes the topics and the features described
ibility and programmability in SDN networks. The authors in the related surveys. It also highlights how this paper differs
evaluated data plane architectures through several definitions from the existing surveys. All previous surveys lack a micro-
of flexibility and programmability. In general, flexibility in scopic comparison between the intra-category works. Also,
SDN refers to the ability of the network to adapt its resources none of them compare switch-based schemes against legacy
(e.g., changes in the topology or the network requirements). server-based schemes. To the best of the authors’ knowl-
Afterwards, the authors identified key factors that influence edge, this work is the first to exhaustively explore the whole
the deviation from the original data plane given with Open- programmable data plane ecosystem. Specifically, the paper
Flow. The survey concludes with future research directions. describes P4 switches and provides a detailed taxonomy of
Kannan and Chan [41] presented a short survey related applications using P4 switches. It categorizes and compares
to the evolution of programmable networks. This work the applications within each category as well as with legacy
described the pre-SDN model and the evolution to SDN approaches, and provides challenges and future perspectives.
and programmable data plane. The authors highlighted some
features of programmable switches such as stateful process-
ing, accurate timing information, and flexible packet cloning III. TRADITIONAL CONTROL PLANE AND SDN
and recirculation. The survey categorized data plane appli- A. TRADITIONAL AND SDN DEVICES
cations into two categories, namely, network monitoring and With traditional devices, networks are connected using pro-
in-network computing. While this survey listed a consider- tocols such as Open Shortest Path First (OSPF) and Border
able number of papers belonging to these categories, it barely Gateway Protocol (BGP) [47]) running in the control plane
explained the operation and main ideas of each paper. Also at each device. Both control and data planes are under full
it lacks many other categories that are relevant in the pro- control of vendors. On the other hand, SDN delineates a clear
grammable data plane context. separation between the control plane and the data plane, and
Tan et al. [42] presented a survey describing In-band consolidates the control plane so that a single centralized con-
Network Telemetry (INT). The survey explained the devel- troller can control multiple remote data planes. The controller
opment stages and classifications of network measurement is implemented in software, under the control of the net-
(traditional, SDN-based, and P4-based). It also outlined some work owner. The controller computes the tables used by each
existing applications that leverage INT such as congestion switch and distributes them via a well-defined Application

TABLE 2. Features, traditional, SDN, and P4 programmable devices.

VOLUME 9, 2021 87097


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

Programming Interface (API), such as Openflow [48]. While


SDN allows for the customization of the control plane, it is
limited to the OpenFlow specifications and the fixed-function
data plane.
B. COMPARISON OF TRADITIONAL, SDN, AND
PROGRAMMABLE DATA PLANE DEVICES
Table 2 contrasts the main characteristics of traditional, SDN,
and P4 programmable devices. In the latter, the forwarding
behavior is defined by the user’s code. Other advantages
include the program-dependent APIs, where the same P4 pro-
gram running on different targets requires no modifications in
the runtime applications (i.e., the control plane and the inter-
face between control and data planes are target agnostic); the
protocol-independent primitives used to process packets; the
more powerful computation model where the match-action
stages can not only be in series but also in parallel; and
the infield reprogrammability at runtime. On the other hand,
the technology maturity and support for P4 devices can still
be considered low in contrast to traditional and SDN devices.
C. NETWORK EVOLUTION AND ANALOGY WITH OTHER
DOMAIN SPECIFIC PROCESSORS FIGURE 3. A PISA-based data plane and its interaction with the control
The introduction of the general-purpose computers in the plane.
early 1970s enabled programmers to develop applications
running on CPUs. The use of high-level languages accel- blocks (tables, registers) and Arithmetic Logic Units (ALUs),
erated innovation by hiding the target hardware (e.g., x86). which allow for simultaneous lookups and actions. Since
In signal processing, Digital Signal Processors (DSPs) were some action results may be needed for further processing
developed in the late 1970s and early 1980s with instruction (e.g., data dependencies), stages are arranged sequentially.
sets optimized for digital signal processing. Matlab is used The programmable deparser assembles the packet headers
for developing DSP applications. In graphics, Graphics Pro- back and serializes them for transmission. A PISA device is
cessing Units (GPUs) were developed in the late 1990s and protocol-independent.
early 2000s with instruction sets for graphics. Open Com- In Fig. 3, the P4 program defines the format of the
puting Language (OpenCL) is one of the main languages for keys used for lookup operations. Keys can be formed using
developing graphic applications. In machine learning, Ten- packet header’s information. The control plane populates
sor Processor Units (TPUs) and TensorFlow were developed table entries with keys and action data. Keys are used for
in mid 2010s with instruction sets optimized for machine matching packet information (e.g., destination IP address)
learning. and action data is used for operations (e.g., output port).
The programmable forwarding is part of the larger infor-
B. PROGRAMMABLE SWITCH FEATURES
mation technology evolution observed above. Specifically,
The main features of programmable switches are [51]:
over the last few years, a group of researchers developed a
• Agility: the programmer can design, test, and adopt
machine model for networking, namely the Protocol Inde-
pendent Switch Architecture (PISA) [49]. PISA was designed new protocols and features in significantly shorter times
with instruction sets optimized for network operations. The (i.e., weeks or months rather than years).
• Top-down design: for decades, the networking indus-
high-level language for programming PISA devices is P4.
try operated in a bottom-up approach. Fixed-function
IV. PROGRAMMABLE SWITCHES ASICs are at the bottom and enforce available proto-
A. PISA ARCHITECTURE cols and features to the programmer at the top. With
PISA is a packet processing model that includes the following programmable switches, the programmer describes pro-
elements: programmable parser, programmable match-action tocols and features in the ASICs. Note that the phys-
pipeline, and programmable deparser, see Fig. 3. ical layer and parts of the MAC layer may not be
The programmable parser permits the programmer to programmable.
define the headers (according to custom or standard proto- • Visibility: programmable switches provide greater visi-
cols) and to parse them. The parser can be represented as a bility into the behavior of the network. INT is an exam-
state machine. The programmable match-action pipeline exe- ple of a framework to collect and retrieve information
cutes the operations over the packet headers and intermediate from the data plane, without intervention of the control
results. A single match-action stage has multiple memory plane.

87098 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 3. Comparison between a P4 programmable switch and a


fixed-function switch [50].

FIGURE 4. Evolution of the packet forwarding speeds of the


general-purpose CPU and the switch chip (reproduced from [53]).

• Very Long Instruction Words: the set of instructions


•Reduced complexity: fixed-function switches issued in a given clock cycle can be seen as one
incorporate a large superset of protocols. These pro- large instruction with multiple operations, referred to
tocols consume resources and add complexity to the as Very Long Instruction Word (VLIW). A VLIW is
processing logic, which is hard-coded in silicon. With formed from the output of the match tables. A stage
programmable switches, the programmer has the option executes one VLIW per packet, and each action unit
to implement only those protocols that are needed. within the stage executes one operation. Thus, for
• Differentiation: the customized protocol or feature a given packet, one operation per field per stage is
implemented by the programmer needs not to be shared applied [52].
with the chip manufacturer. • Parallelism on pipelines: the switch chip may contain
• Enhanced performance: programmable switches do not multiple pipelines per chip, also referred to as pipes.
introduce performance penalty. On the contrary, they Pipes on a PISA device are analogous to cores on a
may produce better performance than fixed-function general purpose CPU. Examples include chips contain-
switches. Table 3 shows a comparison between a pro- ing two and four pipes [20], [49]. Each pipe is isolated
grammable switch and a fixed-function switch, repro- from the other and processes packets independently.
duced from [50]. Note the enhanced performance of the Pipes may implement the same functionality or different
former (e.g., maximum forwarding rate, latency, power functionalities.
draw). When compared with general purpose CPUs,
ASICs remain faster at switching, and the gap is only C. P4 LANGUAGE
increasing as shown in Fig. 4. P4 has a reduced instruction set and has the following goals:
• Reconfigurability: the parser and the processing logic
The performance gain of switches relies on the multiple
dimensions of parallelism, as described next. can be redefined in the field.
• Protocol independence: the switch is protocol-agnostic.
• Parallelism on different stages: each stage of the pipeline
The programmer defines the protocols, the parser, and
processes one packet at a time [49]. In Fig. 3, the number
the operations to process the headers.
of stages is n. Implementations may have more than 10
• Target independence: the underlying ASIC is hid-
stages on the ingress and egress pipelines. While adding
den from the programmer. The compiler takes the
more stages increases parallelism, they consume more
switch’s capabilities into account when turning a
area on the chip and increase power consumption and
target-independent P4 program into a target-dependent
latency.
binary.
• Parallelism within a stage: the ASIC contains multiple
The original specification of the P4 language was released
match-action units per stage. During the match phase,
in 2014, and is referred to as P414 . In 2016, a new version of
tables can be used for parallel lookups. In Fig. 3, there
the language was drafted, which is referred to as P416 . P416 is
are four matches (in blue) on each stage that can occur
a more mature language which extended the P4 language
at the same time. An ALU executes one operation over
to broader underlying targets: ASICs, Field-Programmable
the header field, enabling parallel actions on all fields.
Gate Arrays (FPGAs), Network Interface Cards (NICs),
Hundreds of match-action units exist per stage and thou-
etc.
sands in an entire pipeline [49]. Since ALUs execute
simple operations and use a simple Reduced Instruction V. METHODOLOGY AND TAXONOMY
Set Computer (RISC)-type instruction set, they can be This section describes the systematic methodology that was
implemented in the silicon at a minimal cost. adopted to generate the proposed taxonomy. The results of

VOLUME 9, 2021 87099


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

hardware switches. Note that behavioral software switches


(e.g., BMv2 [244]) are not suitable indicators of whether
the program could run on a hardware target; they are typ-
ically used for prototyping ideas and to foster innovation.
On the other hand, non-behavioral software switches (e.g.,
PICSES [245], derived from Open vSwitch (OVS) [246])
are production-grade and can be deployed in data
centers.
It is worth noting that the majority of works imple-
mented on hardware switches are recent; this demon-
FIGURE 5. (a) Distribution of surveyed data plane research works per strates the increase in the adoption of programmable
year. (b) Implementation platform distribution. The shares are calculated switches by the industry and academia. Currently, to acquire
based on the studied papers in this survey.
a switch equipped with Tofino chip (e.g., Edgecore
Wedge100BF-32 [20]), and to get the development environ-
this literature survey represent derived findings by thoroughly ment and the customer support, a Non-Disclosure Agree-
exploring more than 200 data plane-related research works ment (NDA) with Barefoot Networks (Intel) should be
starting from 2016 up to 2020. The distribution of which signed. Additionally, the client should attend a training
is summarized in Fig. 5 (a). Note that the survey addition- course (e.g., [247]) to understand the architecture and
ally includes the important works of the first quarter of the specifics of the platform. This process is somewhat
2021. lengthy and costly, and not every institution is capable of
Fig. 5 (b) depicts the share of each implementation plat- affording it.
form used in the surveyed papers, grouped by software The proposed taxonomy is demonstrated in Fig. 6.
(e.g., BMv2, PISCES), ASIC (e.g., Tofino, Cavium), The taxonomy was meticulously designed to cover the most
NetFPGA (e.g., NetFPGA SUME), and SmartNICs significant works related to data plane programmability
(e.g., Netronome NFP). The graph shows that the vast and P4. The aim is to categorize the surveyed works based
majority of the works were implemented on software and on various high-level disciplines. The taxonomy provides a

FIGURE 6. Taxonomy of programmable switches literature based upon relevant, explored research areas.

87100 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

FIGURE 7. In-band Network Telemetry (INT). FIGURE 8. Example of how INT can be used to provide the path traversed
by a packet in the network. The INT source inserts its label (S1) as well as
the INT headers to instruct subsequent switches about the required
operations (i.e., push their labels). Finally, switch S4 strips the INT
clear separation of categories so that a reader interested in a headers from the packet and forwards them to a collector, while
specific discipline can only read the works pertaining to the forwarding the original packet to the receiver.

said discipline. The correctness of the taxonomy was verified


by carefully examining the related work of each paper to plane. Fig. 7 shows an INT-enabled network. INT enables
correlate them into high-level categories. Each high-level network administrators to determine the following:
category is further divided into sub-categories. For instance, • The path a packet took when traversing the network
various measurements works belong to the sub-category (see Fig. 8). Such information is difficult to learn using
‘‘Measurements’’ under the high-level category ‘‘Network existing technologies when multi-path routing strategies
Performance’’. (e.g., Equal-cost Multi-Path Routing (ECMP) [249],
Further, the survey compares the results and the flowlet switching [250]) are used.
features offered by programmable data plane approaches • The matched rules that forwarded the packets (e.g., ACL
(intra-category), as well as with those of the contemporary entry, routing lookup).
and legacy ones. This detailed comparison is elaborated upon • The time a packet spent in the queue of each switch.
for each sub-category, giving the interested reader a com- • The flows that shared the queue with a certain packet.
prehensive view of the state-of-the-art findings of that sub- The P4 Applications Working Group developed the INT
category. Additionally, the survey presents various challenges telemetry specifications [251] with contributions from key
and considerations, as well as some current and future trends enablers of the P4 language such as Barefoot Networks,
that could be explored as future work. VMware, Alibaba, and others. INT allows instrumenting
the metadata to be monitored without modifying the appli-
VI. IN-BAND NETWORK TELEMETRY (INT) cation layer. The metadata to be inserted depends on the
Conventional monitoring and collecting tools and protocols use case; for example, if congestion was the main con-
(e.g., ping, traceroute, Simple Network Management Proto- cern to monitor, the programmer inserts queue metadata and
col (SNMP), NetFlow, sFlow) are by no means sufficiently transit latency. An INT-enabled network has the following
accurate to troubleshoot the network, especially with the entities: 1) INT source: a trusted entity that instruments with
presence of congestion. These methods provide millisec- the initial instruction set what metadata should be added into
onds accuracy at best and cannot capture events that happen the packet by other INT-capable devices; 2) INT transit hop:
on microseconds magnitude. Moreover, they cannot provide a device adding its own metadata to an INT packet after
per-packet visibility across the network. examining the INT instructions inserted by the INT source;
In-band Network Telemetry (INT) [248] is one of the 3) INT sink: a trusted entity that extracts the INT headers in
earliest key applications of programmable data plane order to keep the INT operation transparent for upper-layer
switches. It enables querying the internal state of the switch applications; and 4) INT collector: a device that receives and
and provides fine-grained and precise telemetry measure- processes INT packets.
ments (e.g., queue occupancy, link utilization, queuing The location of an INT header in the packet is intentionally
latency, etc.). INT handles events that occur on microseconds not enforced in the specifications document. For example,
scale, also known as microbursts. Collecting and reporting it can be inserted as a payload on top of TCP, UDP, and
the network state is performed entirely by the data plane, NSH, as a Geneve option on top of Geneve, and as a VXLAN
without any intervention from the control plane. Due to the payload on top of VXLAN.
increased visibility achieved with INT, network operators are
able to troubleshoot problems more efficiently. Additionally, A. POSTCARD-BASED TELEMETRY (PBT)
it is possible to perform instant processing in the data plane INT provides the exact forwarding path, the timestamp and
after measuring telemetry data (e.g., reroute flows when a latency at each network node, and other information. Such
link is congested), without having to interact with the control detailed information is derived by augmenting user packets

VOLUME 9, 2021 87101


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

the bandwidth overhead of INT by adjusting thresholds and


parameters automatically, based on measured traffic patterns
and the desired application type.
2) ACTIVE NETWORK TELEMETRY
Network telemetry can be actively collected by generat-
ing and sending probes to a selected network path. Probes
are typically used for minimizing the traffic overhead
imposed by regular INT. Liu et al. [54] proposed NetVi-
sion, a probing-based telemetry system that actively sends
FIGURE 9. Postcard-based Telemetry (PBT).
the rightful amount and format of probe packets depending
on the telemetry application (e.g., traffic engineering, net-
work visualization). INT-path [57] is another probing-based
with data collected by each switch. Postcard-based Teleme-
approach that was the first to achieve network-wide teleme-
try (PBT) is an alternative to INT which does not modify user
try. Network-wide telemetry provides a global view of the
packets. Fig. 9 shows an example of PBT. As a user packet
network, which simplifies the management and the control
traverses the network, each switch generates a postcard and
decisions. INT-Path uses Euler trail-based path planning pol-
sends it to the monitor. The event that triggers the generation
icy to generate probe paths. This mechanism allows achieving
of the postcard is defined by the programmer, according to
non-overlapped probe paths. The idea is to transform net-
the application’s need. Examples include start and/or end of
work troubleshooting into pattern recognition problems after
a flow, sampling (e.g., one report per second), packet dropped
encoding the traffic status into a series of bitmap images.
by the switch, queue congestion, etc.
A subsequent work by Lin et al. [62] that extends NetVision,
B. INT VARIATIONS referred to as NetView [62], was proposed. The objective of
1) BACKGROUND NetView is to achieve on-demand network-wide telemetry.
Despite the improvements that INT brings compared to NetView considers various telemetry applications, has full
legacy monitoring schemes, it introduces bandwidth over- coverage, and achieves scalable telemetry.
head when enabled unconditionally by network opera-
tors. In such scenarios, INT headers are added to every 3) PASSIVE NETWORK TELEMETRY
packet traversing the switch, increasing bandwidth overhead Instead of actively sending probes through the network, INT
which decreases the overall network throughput. To mitigate can determine telemetry information passively [252]. The
such limitation, conditional statements are included in the standardized INT [251], which writes telemetry information
P4 program to send reports only when certain events occur along the path in packets, is an example of passive network
(e.g., queue utilization exceeds a threshold). Such solution telemetry.
requires network operators to adjust thresholds and param- Kim et al. [56] proposed selective INT (sINT), a scheme
eters manually based on the usual network traffic patterns. that dynamically adjusts the insertion frequency of INT head-
Consequently, several variations of INT have been devel- ers. A monitoring engine observes changes in consecutive
oped, aiming at customizing its functionalities and addressing INT metadata and applies a heuristic algorithm to compute
its limitations. Mainly, recent works focus on minimizing the insertion ratio. Marques et al. [58] described the orches-

TABLE 4. INT variations comparison.

87102 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 5. In-band, postcard-based, and traditional network telemetry.

tration problem in INT, which is associated with the opti- to encode telemetry on multiple packets. Note that sampling
mal use of network resources for collecting the state and and anomaly-based monitoring might lead to information loss
behavior of forwarding devices through INT. Niu et al. [59] since not all packets are being reported.
proposed multilayer INT (ML-INT), a system that visualizes Some solutions require manual intervention from the
IP-over-optical networks in realtime. The proposed system operators to configure the telemetry process. The sim-
encodes INT headers in a subset of packets pertaining to an plicity of the configuration interface is vital to make the
IP flow. The encoded headers contain metadata that describes solution easily deployable. Furthermore, some solutions
statistics of electrical and optical network elements on the (e.g., NetView, INT-Path) achieve network-wide telemetry.
flow’s routing path. Ben Basat et al. [61] proposed Proba- Note that network-wide traffic monitoring incurs additional
bilistic INT (PINT), an approach that probabilistically adds overhead since multiple switches are being monitored at the
telemetry information into a collection of packets to mini- same time. Finally, some solutions were implemented on soft-
mize the per-packet overhead associated with regular INT. ware switches, while other were implemented on hardware.
Hyun et al. [55] proposed an architecture for self-driving net- It is important to note that not all software implementations
works that uses INT to collect packet-level network telemetry, can fit into the pipeline of the hardware.
and Knowledge-Defined Networking (KDN) to create intelli-
gence to the network management, considering the collected 5) INT, PBT, AND TRADITIONAL TELEMETRY COMPARISON
telemetry data. KDN accepts the network information as Table 5 compares INT, PBT, and traditional telemetry.
input and generates policies to improve the network per- INT has higher potential vulnerabilities than PBT, such as
formance. Karaagac et al. [60] extended INT from wired eavesdropping and tampering. Adding extra protective mea-
network to wireless network. sures (e.g., encryption) is difficult on the fast data path. On the
other hand, PBT packets tolerate additional processing to
4) INT VARIATIONS, COMPARISON, AND DISCUSSIONS enhance security. The flow tracking process is simpler with
Table 4 compares the aforementioned INT variations solu- INT than with PBT. The latter requires the server receiving
tions. The main motivation behind these solutions is that INT reports (i.e., INT collector, explained in Section VI-C)
the majority of applications that leverage INT (e.g., conges- to correlate multiple postcards of a single flow packet passing
tion control, fast reroute) only require approximations of the through the network, to form the packet history at the mon-
telemetry data and therefore, do not need to gather per-packet itor. This process also adds delay in reporting and tracking.
per-hop INT information. NetVision, NetView, and INT-Path Legacy schemes that rely on sampling and polling suffer from
use probing to reduce the overhead of INT. The main lim- accuracy issues, especially when links are congested. INT
itation of such approaches is that probing might result in on the other hand is push-based, has better accuracy, and
poor accuracy and timeliness as the probes might experi- is more granular (microseconds scale). Reports sent by an
ence different network conditions than actual packets. All INT-capable device contain rich information (e.g., the path
other works collect INT information passively. [55] and sINT a packet took) that can aid in troubleshooting the network.
select flows based on current network conditions, ML-INT Such visibility is minimal in legacy monitoring schemes.
uses a fixed sampling scheme to select a small portion of Programmable switches permit reporting telemetry after the
packets in a flow, and PINT uses a probabilistic approach occurrence of specific events (e.g., congestion). Moreover,

VOLUME 9, 2021 87103


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 6. INT collectors comparison.

they provide flexibility in programming reactive logic that 3 devices by Broadcom [67]. This solution enables real-time
executes promptly in the data plane. One drawback of INT network latency analysis and facilitates Service Level Agree-
is that it imposes bandwidth overhead if configured to report ment (SLA) compliance.
for every packet; however, when event-based reports are con- 4) INT COLLECTORS COMPARISON, DISCUSSIONS, AND
sidered, the bandwidth overhead significantly decreases. LIMITATIONS
Table 6 compare the aforementioned INT collectors. Int-
C. INT COLLECTORS
Mon and Prometheus INT exporter were among the earliest
1) BACKGROUND
collectors. Both have low processing rates since they are
An INT collector is a component in the network that pro- implemented without kernel nor hardware acceleration. Also,
cesses telemetry reports produced by INT devices. It parses they are very limited with respect to the features they provide
and filters metrics from the collected reports, then option- (e.g., lack of event detection, limited analytics, historical data
ally stores the results persistently into a database. Since a unavailability, etc.). Prometheus INT exporter also suffers
large number of reports is typically produced in INT, having from increased overhead of sending the data for every INT
a high-performance collector is essential to avoid missing packet to the gateway, and the potential loss of network events
important network events. To this end, a number of research as the database only stores the latest data pulled from the gate-
works focus on developing and enhancing the performance way. INTCollector on the other hand has higher rate and uses
of INT collectors running on commodity servers. Both open the eXpress Data Path (XDP) [253] to accelerate the packet
source and closed source INT collectors are proposed in the processing in the kernel space. It filters the data to be pub-
literature. lished based on significant changes in the network through
2) OPEN-SOURCE its event detection mechanism. DeepInsight Analytics has
IntMon [63] is an ONOS-based collector application for a modular architecture and runs on commodity servers.
INT reports. It includes a web-based interface that allows It executes the Barefoot SPRINT data plane telemetry which
controlling which flows to monitor and the specific meta- consists of a P4 program (INT.p4) encompassing intelligent
data to collect. Another INT collector is the Prometheus triggers. It also provides open northbound RESTful APIs
INT exporter [64], which extracts information from every that allow customers to integrate their third-party network
INT packet and pushes them to a gateway. A database management solutions. DeepInsight Analytics is advanced
server then periodically pulls information from the gateway. with respect to the features it provides (real-time anomaly
INTCollector [65] is a collector that extracts events, which are detection, congestion analysis, packet-drop analysis, etc.).
important network information, from INT raw data. It uses However, it is a closed-source solution and lacks reports of
in-kernel processing to further improve the performance. performance benchmarks.
INTCollector has two processing flows; the fast path, which
processes INT reports and needs to execute quickly, and the
normal path which processes events sent from the fast path,
and stores information in the database.

3) CLOSED-SOURCE
Deep Insight [66] is a proprietary solution provided
by Barefoot Networks that leverages INT capabilities to
provide services such as real-time anomaly detection, con-
gestion analysis, packet-drop analysis, etc. It follows a pay-
as-you-grow business model, where customers pay based
on the volume of collected telemetry. Another proprietary FIGURE 10. CPU efficiency with the three INT collectors. Source:
solution is BroadView Analytics used on Broadcom Trident INTCollector paper [65].

87104 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

Fig. 10 demonstrates the CPU efficiency of three INT col- such as high-speed links, traffic diversity and burstiness, and
lectors (IntMon, Prometheus INT exporter, and INTCollec- buffer sizes [68]. Today’s CC algorithms aim at shortening
tor) [65]. IntMon has the lowest throughput, and is 57 times delays, maximizing throughput, and improving the fairness
slower than Prometheus INT. INTCollector on the other and utilization of network resources.
hand has the highest throughput and is 27 times faster than Tremendous amount of research work has been done on
Prometheus INT exporter. congestion control, including end hosts algorithms such as
loss-based CC algorithms (e.g., CUBIC [256], Hamilton TCP
5) COLLECTORS IN INT AND LEGACY MONITORING (HTCP) [257], etc.), model-based algorithms (e.g., Bottle-
SCHEMES COMPARISON neck Bandwidth and Round-trip Time (BBR) [258], [259]),
Generally, collectors used with both INT and legacy monitor- congestion-signalling mechanisms (e.g., Explicit Conges-
ing schemes run on general purpose CPUs, and hence, have tion Notification (ECN) [260]), data-center specific schemes
comparable performance. INT produces excessive amounts (e.g., TIMELY [261], Data Center Quantized Congestion
of reports when compared with legacy monitoring schemes Notification (DCQCN) [262], Data Center TCP (DCTCP)
(e.g., NetFlow), and therefore, requires having a collector [263], pFabric [264], Performance-oriented Congestion Con-
with high processing capability. INT-based collectors are trol (PCC) [265], etc.), and application-specific schemes
typically accelerated with in-kernel fast packet processing (e.g., QUIC [266]).
technologies (e.g., XDP) and hardware-based accelerators With the advent of programmable data plane switches,
(e.g., Data Plane Development Kit (DPDK)). researchers are investigating new methods for managing con-
gestion. Such methods can be classified as 1) hybrid CC,
D. SUMMARY AND LESSONS LEARNED where network-assisted congestion feedback is provided for
Legacy telemetry tools and protocols are not capable of end-hosts; and 2) in-network CC, where the switch performs
capturing microbursts nor providing fine-grained teleme- traffic rerouting, steering, or other congestion control tech-
try measurements. INT was developed to address these niques, without modifications on end hosts.
challenges; it enables the data plane developer to query
2) HYBRID CC
with high-precision the internal state of switches. Telemetry
data are then embedded into packets and forwarded to a Handley et al. [68] proposed NDP, a novel protocol archi-
high-performance collector. The collector typically performs tecture for datacenters that aims at achieving low comple-
analysis and applies actions accordingly (e.g., informs the tion latency for short flows and high throughput for longer
control plane to update table entries). Current research efforts flows. NDP avoids core network congestion by applying
mainly focus on developing variations of INT to decrease per-packet multipath load balancing, which comes at the cost
its telemetry traffic overhead, considering the overhead- of reordering. It also trims the payloads of packets, similar
accuracy trade-off. Other works aim at accelerating INT to what is done in Cut Payload (CP) [267], whenever the
collectors to handle large volumes of traffic (in the scale queues of the switches become saturated. Once the payload
of Kpps). Future work could possibly investigate further is trimmed, the headers are forwarded using high-priority
improvements for INT such as compressing packets’ headers, queues. Consequently, a Negative ACK (NACK) is generated
broadening coverage and visibility, enriching the telemetry and sent through high-priority queues so that a retransmission
information, and simplifying the deployment. is sent before draining the low priority queue. Similarly,
Feldmann et al. [69] proposed a method that uses
VII. NETWORK PERFORMANCE network-assisted congestion feedback (NCF) in the form of
Measuring and improving network performance is critical NACKs generated entirely in the data plane. NACKs are
in nowadays’ infrastructures. Low latency and high band- sent to throttle elephant-flow senders in case of congestion.
width are key requirements to operate modern applications The method maintains three separate queues for mice flows,
that continuously generate enormous amounts of data [254]. elephant flows, and control packets to ensure fair sharing of
Congestion control (CC), which aims at avoiding net- resources.
work overload, is critical to meet these requirements. Li et al. [70] proposed High Precision Congestion Control
Another important concept for expediting these applications (HPCC), a new CC mechanism that leverages INT-based data
is managing the queues that form in routers and switches added by P4 switches to obtain precise link load informa-
through Active Queuing Management (AQM) algorithms. tion. HPCC computes accurate flow rate by using only one
This section explores the literature related to measuring and rate update, as opposed to legacy approaches that require a
improving the performance of programmable networks.

A. CONGESTION CONTROL (CC)


1) BACKGROUND
One of the most challenging tasks in the Internet today is
congestion control and collapse avoidance [255]. The diffi-
culty in controlling the congestion is increasing due to factors FIGURE 11. HPCC: INT-based high precision congestion control.

VOLUME 9, 2021 87105


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 7. Congestion control schemes comparison.

large number of iterations to determine the rate. HPCC pro- tion feedback. NDP avoids congestion by applying per-packet
vides near-zero queueing, while being almost parameterless. multihop load balancing. This approach works adequately
Fig. 11 shows the mechanism of HPCC. The switches add with symmetric topologies, but fails when topologies are
INT headers to every packet, and then the INT information is asymmetric (e.g., BCube, Jellyfish), especially during heavy
piggybacked into the TCP/RDMA Acknowledgement (ACK) network load. Another limitation of NDP is the excessive
packet. The end-hosts then use this information to adjust retransmissions produced by the server. NCF adopted the
the sending rate through their smart Network Interface idea of packet trimming from NDP, but generates NACKs
Controllers (NICs). from the trimmed packet and sends it directly to the sender.
Kfoury et al. [71] proposed a P4-based method to automate Such approach removes the receiver from the feedback loop,
end-hosts’ TCP pacing. It supplies the bottleneck bandwidths improving the sender’s reaction time. One limitation of
and the number of elephants flows to senders so that they NCF is that it requires operators to manually tune some of
can pace their rates to safe targets, avoiding filling routers’ the predefined parameters (e.g., threshold, queue size, etc.).
buffers. Shahzad et al. [72] proposed EECN, a system that Additionally, NCF might disclose network congestion infor-
uses ECN to signal the occurrence of congestion to the sender mation, making it less attractive to operators. Finally,
without involving the receiver. This is especially useful for the authors of NCF claim that the approach works with both
networks with high bandwidth-delay product (BDP). datacenters and Internet-wide scenarios. However, no imple-
mentation results were presented to evaluate the effectiveness
3) IN-NETWORK CC
of the solution.
Turkovic et al. [73] proposed a P4-based method that reroutes
HPCC leverages INT data to control network con-
flows to backup paths during congestion. The system detects
gestion. It enhances the convergence time by using a
congestion by continuously monitoring the queueing delays
Multiplicative-Increase Multiplicative-Decrease (MIMD)
of latency-critical flows. The same authors [74] proposed a
scheme. Note that previous TCP variants use the
method that separates the senders based on their congestion
Additive-Increase Multiplicative-Decrease (AIMD), which
control algorithm. Each congestion control uses a separate
is conservative when increasing the rate, and hence has a
queue in order to enforce the fairness among its competing
slow convergence time. The reason AIMD schemes are slow
flows. Apostolaki et al. [75] proposed FAB, a flow-aware
is that they use a single-bit congestion information (packet
and device-wide buffer sharing scheme. FAB prioritizes
loss, ECN). With HPCC, end-hosts can perform aggres-
flows from port-level to the device-level. The goal of FAB
sive increase as INT metadata encompasses precise link
is to minimize the flow completion time for short flows
utilization and timely queue statistics. HPCC demonstrated
in specific workloads. Geng et al. [76] proposed P4QCN,
promising results with respect to latency, bandwidth, and
a fow-level, rate-based congestion control mechanism that
convergence time. The authors however did not evaluate
improves the Quantized Congestion Notification (QCN).
the performance of HPCC with conventional congestion
P4QCN improves QCN by alleviating the problems of PFC
control algorithms in the Internet (e.g., CUBIC, BBR). Note
within a lossless network. Furthermore, P4QCN extends the
that achieving inter-protocol fairness is essential so that the
QCN protocol to IP-routed networks.
solution is adopted by operators.
4) CC SCHEMES COMPARISON, DISCUSSIONS, AND The method in [71] uses TCP pacing. Pacing decreases
LIMITATIONS throughput variations and traffic burstiness, and hence, mini-
Table 7 compares the aforementioned CC schemes. NDP and mizes queuing delays. However, this method works well only
NCF are similar in the sense that both use NACKs as conges- in networks where the number of large flows senders is small

87106 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 8. Congestion control schemes. 1) Programmable Switches (HPCC); 2) end-hosts; and 3) legacy network-assisted (ECN).

(e.g., in science Demilitarized Zone (DMZ) [254]). Further, measurements schemes have accuracy limitations since they
it is worth mentioning that methods which provide congestion rely on polling and sampling-based methods to gather
feedback to end hosts must implement some security mecha- traffic statistics. Typically, sampling methods have high sam-
nisms to prevent packets from being modified. pling rates (e.g., one every 30,000 packets) and polling
As for the full in-network CC schemes, P4Air, which methods have large polling intervals. The literature [268]
applies traffic separation, demonstrated significant improve- has shown that such methods are only suitable for
ments in fairness compared to contemporary solutions. How- coarse-grained visibility. The accuracy limitation of sam-
ever, it requires allocating a queue for each congestion control pling and polling techniques hampers the development of
algorithm group (e.g., loss-based (Cubic), delay-based (TCP measurement applications. For instance, it is not possible to
Vegas), etc.). Note that the number of queues is limited accurately measure frequently changing TCP-specific fields
in switches, and production networks often reserve them such as congestion window, receive window, and sending
for other applications’ QoS [70]. P4QCN is not evaluated rate.
on hardware targets, and therefore their results (which are Data streaming or sketching algorithms [269]–[272] were
extracted based on software switches) are not that indicative. proposed to answer the limitation of sampling and polling.
They address the following problem: an algorithm is allowed
5) END-HOSTS, PROGRAMMABLE SWITCHES, AND LEGACY to perform a constant number of passes over a data stream
DEVICES’ CC SCHEMES (input sequence of items) while using sub-linear space com-
Table 8 compares the CC schemes assisted by programmable pared to the dataset and the dictionary sizes; desired statis-
switches (e.g., HPCC) with end-hosts CC algorithms tical properties (e.g., median) on the data stream are then
(e.g., CUBIC) and legacy congestion signalling schemes estimated by the algorithm. The main problem with such
(e.g., ECN). End-hosts CC infer congestion through algorithms is that they are tightly coupled to the metrics of
packet drops and estimations (e.g., btlbw and Round-trip interest. This means that switch vendors should build spe-
Time (RTT) estimation with BBR), which is not always suffi- cialized algorithms, data structures, and hardware for specific
cient to infer the existence of congestion. Legacy devices use monitoring tasks. With the constraints of CPU and memory in
classic ECN to signal congestion so that end-hosts slow down networking devices, it is challenging to support a wide spec-
their transmission rates. Classic ECN is limited as it only trum of monitoring tasks that satisfy all customers. Legacy
marks a single bit to signal congestion, and is not aggressive devices also lack the capability of customizing the processing
nor immediate. Programmable switches on the other hand behavior so that switches co-operate in the measurement
use fine-grained prompt measurements to signal congestion process.
(e.g., INT metadata), which results in higher detection accu- With the emergence of programmable switches, it is now
racy, near-zero queueing delays, and faster convergence time. possible to perform fine-grained measurements in the data
The distributed nature of end-hosts CC schemes allows them plane at line rate. Moreover, data structures such as sketches
to operate without modifying the network infrastructure and and bloom filters can be easily implemented and customized
without tweaking parameters. ECN-enabled devices and pro- for specific metrics of interest. Programmable switches pave
grammable switches on the other hand require few param- the way for new areas of research in measurements since not
eters (e.g., marking threshold) to adapt to different network only they provide flexibility in inspecting with high accuracy
conditions. the traffic statistics, but also allow programmers to express
reactive processing in real time (e.g., dropping a packet when
B. MEASUREMENTS a threshold is bypassed as done in Random Early Detection
1) BACKGROUND (RED) [273]).
Gaining an overall understanding of the network behavior INT provides path-level metrics, with data similar to
is an increasingly complex task, especially when the size that of polling-based techniques. Note that the metrics
of the network is large and the bandwidth is high. Legacy themselves are fixed; for instance, it is possible to deter-

VOLUME 9, 2021 87107


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

mine the flow-level latency, but not the latency variation ditions (e.g., bandwidth, packet rate and flow size distri-
(jitter) [79]. The fixed metrics of INT also prevent perform- bution). *Flow [85] supports concurrent measurements and
ing network-wide measurements; note that the INT stan- dynamic queries. Such approach aims at minimizing the con-
dard specification document does not mention methods to currency problems and the network disruption resulting from
aggregate metadata and perform complex analytics in the data compiling excessive queries into the data plane.
plane. TurboFlow [86] aims at achieving high coverage without
This section focuses on techniques that provide measure- sacrificing information richness. Bai et al. [94] proposed
ments that go beyond the fixed metrics extracted from the FastFE, a system that performs traffic features extraction
internal state of the switch. by leveraging programmable data planes. Extracted fea-
tures are then used by traffic analysis and behavior detector
2) GENERIC QUERY-BASED MONITORING ML techniques.
Operators constantly change their monitoring specifications.
Adding new monitoring requirements on the fixed-function 3) PERFORMANCE DIAGNOSIS SYSTEMS
switching ASIC is expensive. Recent work explored the idea Recent works are leveraging programmable data planes to
of providing a query-driven interface that allows operators to diagnose network performance. The main motivation here is
express their monitoring requirements. The queries can then that fine-grained information can be monitored at line rate,
be converted into switch programs (e.g., P4) to be deployed in mitigating the slow reaction to ‘‘gray failures’’ experienced
the network. Alternatively, the queries can be executed on the by diagnosing end-hosts in legacy approaches.
control plane considering the measured information extracted Ghasemi et al. [80] proposed Dapper, an in-network TCP
from the data plane. performance diagnosis system. Dapper analyzes packets in
A simplistic attempt is FlowRadar [77], a system that real time, and identifies and pinpoints the root cause of the
stores counters for all flows in the data plane with low bottleneck (sender, network, or receiver). Blink [90] also
memory footprint, then exports periodically (every 10ms) diagnoses TCP-related issues. In particular, it detects failures
to a remote collector. Liu et al. [78] proposed Universal in the data plane based on retransmissions, and consequently,
Monitoring (UnivMon), an application-agnostic monitoring reroutes traffic. Other approaches attempt to diagnose per-
framework that provides accuracy and generality across a formance degradation manifested by an increase of latency.
wide range of monitoring tasks. UnivMon benefits from Wang et al. [92] proposed SpiderMon, a system that performs
the granularity of the data plane to improve accuracy and network-wide performance degradation diagnosis. The key
runs different estimation algorithms on the control plane. idea is to have every switch maintain fine-grained telemetry
Narayana et al. [79] presented Marple, a query lan- data for a short period of time, and upon detecting per-
guage based on common query constructs (i.e., map, filter, formance degradation (e.g., increased delay), the informa-
group by). Marple allows performing advanced aggregation tion is offloaded to a collector. Liu et al. [89] proposed a
(e.g., moving average of latencies) at line rate in the data memory-efficient approach for network performance mon-
plane. Similarly, Sonata [87] provides a unified query inter- itoring. This solution only monitors the top-k problematic
face that uses common dataflow operators, and partitions flows.
each query across the stream processor and the data plane.
PacketScope [93] also uses dataflow constructs but allows to 4) QUEUE AND OTHER METRICS MEASUREMENT
query the internal switch processing, both in the ingress and Programmable data planes allows querying the internal state
the egress pipelines. of the queue with fine-grained visibility. Recent works lever-
Many of the previous works use the sketch data structure. aged this feature to provide better queueing information
The work in [96] extended the sketching approach used in which can be used by various applications (e.g., AQMs,
previous works to support the notion of time. The motivation congestion control, etc.).
of this work is that recently captured traffic trends are the Chen et al. [88] proposed ConQuest, a P4-based queue
most relevant in network monitoring. Huang et al. [97] measurement solution that determines the size of flows occu-
proposed OmniMon, an architectural design that pying the queue in real time, and identifies flows that are
coordinates flow-level network telemetry operations between grabbing a significant portion of the queue. Joshi et al. [83]
programmable switches, end-hosts, and controllers. Such proposed BurstRadar, a system that uses programmable
coordination aims at achieving high accuracy while main- switches to monitor microbursts in the data plane. Mircor-
taining low resource overhead. Chen et al. [98] proposed bursts are events of sporadic congestion that last for tens
BeauCoup, a P4-based measurement system that handles or hundreds of microseconds. Microbursts increase latency,
multiple heterogeneous queries in the data plane. It offers jitter, and packet loss, especially when links’ speeds are high
a general query abstraction that counts the attributes across and switch buffers are small.
related packets identified by keys, and flags packets that Other works enabled measuring further metric. For
surpass a defined threshold. instance, Ding et al. [91] proposed P4Entropy, an algorithm
Other approaches such as Elastic sketch [81] performs to estimate network traffic entropy (Shannon entropy) in the
measurement that are adaptive to changes in network con- data plane. Tracking entropy is useful for calculating traffic

87108 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 9. Measurements schemes comparison.

distribution in order to understand the network behavior. ods, and top-k counting. In addition, some focused on a subset
Another example is the system proposed by Chen et al. [95] of traffic by leveraging event matching techniques. Such tech-
which passively measures the RTT of TCP traffic in ISP niques are primarily used to achieve high resource efficiency
networks. RTT measurement is important for detecting (i.e., low memory footprint), but cannot achieve full accuracy.
spoofing and routing attacks, ensuring Service Level Agree- On the other hand, systems like OmniMon carefully coor-
ments (SLAs) compliance, measuring the Quality of Experi- dinates the collaboration among different types of entities
ence (QoE), improving congestion control, and many others. in the network. Such coordination will result in efficient
resource utilization and fully accuracy. OmniMon follows a
5) MEASUREMENTS SCHEMES COMPARISON,
split-merge strategy where the split operation decomposes
DISCUSSIONS, AND LIMITATIONS
telemetry operations into partial operations and schedules
Table 9 compares the measurements schemes.
them among the entities (switches, end-hosts, and controller),
a: GENERIC QUERY-BASED MONITORING and the merge operation coordinates the collaboration among
Some schemes (e.g., Sonata, FlowRadar, UnivMon) per- these entities. The idea is to leverage the strength of the
formed approximations of the metrics by using probabilistic data plane in the switches and end-hosts (i.e., per-flow
data structures (e.g., sketch, bloom filter, etc), sampling meth- measurements with high accuracy) and the control plane

VOLUME 9, 2021 87109


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

(i.e., network-wide collaboration). OmniFlow also ensures b: PERFORMANCE DIAGNOSIS SYSTEMS


consistency through a synchronization mechanism and Some performance diagnosis schemes restricted their scope
accountability through a system of linear equation to troubleshooting TCP. For instance, Dapper infers sending
considering packet loss and other data center characteristics. rate, Maximum Segment Size (MSS), sender’s reaction time
Results show that OmniMon reduces the memory (time between received ACK and new transmission), loss
by 33%-96% and the number of actions by 66%-90% when rate, latency, congestion window (CWND), receiver window
compared to state-of-the-art solutions. (RWND), and delayed ACKs. Based on the inferred variables,
Another criterion that differentiates the measurements Dapper can identify the root cause of the bottleneck. Simi-
schemes is whether there are computations being performed larly, the authors in [89] monitored conditions such as retrans-
outside the data plane. Most of the systems use the control missions, packet loss, round-trip-time, out-of-order packets
plane or external servers to perform complex computations to identify the top-k problematic flows. Furthermore, Blink
since the data plane has limited support to complex arith- detects failures based on the predictable behavior of TCP,
metic functions. While some systems (e.g., BeauCoup) do which retransmits packets at epochs exponentially spaced in
not require an external computation device, they often support time, in the presence of failure. Other schemes (i.e., Spider-
less measurement operations. Mon) identify failures based on the increase of latency.
The selection of the data structure to be used in the data Some schemes use reactive processing to mitigate the
plane strongly affects the measurements features supported network performance issue. For instance, Blink promptly
by a certain scheme. For instance, the goal of BeauCoup is reroutes traffic whenever failures signals are generated by the
to enable simultaneous distinct counting queries; for such data plane, while SpiderMon limits the sending rate of the
task, the authors based their design on the coupon-collection root cause hosts.
problem [274], which computes the number of random draws Finally, it is worth mentioning that some systems
from n coupons such that all coupons are drawn at least once. (e.g., Blink, Dapper) considered traces from real-world cap-
For example, if the threshold of distinct destination IPs for tures such as the ones provided by CAIDA for evaluation.
detecting superspreaders is 130, instead of recording all dis- Using real-world traces gives more credibility to the proposed
tinct destination IPs, 32 coupons are defined. Consequently, solution.
the destination IPs of incoming packets are mapped to those
32 coupons. While this data structure uses less memory than c: QUEUE AND OTHER METRICS MEASUREMENT
the other state-of-the-art measurement sketches, it is limited Understanding the occupancy of the queue is useful for use
to specific objectives (distinct counting). Other works (e.g., cases such as mitigating congestion-based attacks, avoiding
UnivMon) focused on generalizing the measurement scenar- conflicting workloads, implementing new AQMs, optimiz-
ios, and hence, used universal sketches as data structures. ing switch configurations, debugging switch implementation,
Qiu et al. [96] focused on capturing traffic trends that are off-path monitoring of queues in legacy devices, etc. Con-
the most relevant in network monitoring and attacks’ detec- Quest performs queue measurements and identifies flows
tion. The notion of time is not supported by native streaming depending on the purpose (e.g., detecting bursty connec-
algorithms. For instance, count-min sketch, which is a data tions). It maintains compact snapshots of the queue, updated
structure that uses constant memory amount to record data, on each incoming packet. The snapshots are then aggregated
is oblivious to the passage of time. Existing solutions that in a round-robin fashion to approximate the queue occupancy.
consider recency are easily implemented on software, but Afterwards, it cleans the previous snapshots to reuse it for
not on programmable ASICs. For example, resetting a sketch further packets. Similarly, BurstRadar detects microbursts,
after a timer expires requires iterating over the elements in the which can increase latency, jitter, and packet loss, espe-
sketch, an operation that cannot be implemented in the data cially when links’ speeds are high and switch buffers are
plane due to the lack of loops. Likewise, creating multiple small. It is almost impossible to detect microbursts in legacy
sketches require additional stages which is limited in the switches which use sampling and polling-based techniques.
hardware. Time-adaptive sketches utilize the idea of Dolby BurstRadar detects microbursts, and captures a snapshot of
noise reduction [275], [276]; a pre-emphasis function inflates the telemetry information of all the involved packets. After-
the update when a new key is inserted and a de-emphasis wards, an analysis is conducted on the snapshot to identify
function restores the original value. This mechanism ages the the microburst-contributing flow and the burst characteristics.
old events over time, and therefore, improves the accuracy Note that BurstRadar does not support measuring the queues
of recent events. The authors implemented the pre-emphasis of legacy devices passively, but ConQuest does. In addition,
function in the data plane using simple bit shifts, and the BurstRadar performs the analysis on the control plane, while
de-emphasis function in the control plane. ConQuest uses the data plane for analysis.
Finally, some systems considered network-wide monitor-
ing, while others only restricted their capabilities to local 6) IN-NETWORK VERSUS LEGACY MEASUREMENTS
per-switch measurements. Network-wide measurement is Fig. 12 compares the legacy measurements to those con-
essential and can significantly improve the visibility of traffic, ducted on programmable switches. There are two main
as discussed in Section XIII-D. classes of legacy measurements techniques. First, there are

87110 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

FIGURE 12. (a) Traditional measurements with sampling/polling. The switch uses sampling and polling protocols (e.g., NetFlow, SNMP)
to generate fixed network flow records. Instead of collecting every packet, sampling collects only one every N number of packets.
Records are then exported to an external server for further analysis. (b) Measurements with programmable switches
(e.g., UnivMon [78]). The switch runs a universal algorithm over a universal data structure (e.g., universal sketch). The control plane
then estimates a wide range of metrics for various applications. Note that this is not the only design possible for measurement tasks
with programmable switches. The programmer has the flexibility to use customized algorithms than run at line rate in the data plane.
Such algorithms can leverage various data structures in the P4 program (e.g., sketch, bloom filter) to store flow statistics. The switch
then push statistics reports to the control plane for further analysis and reactive processing.

techniques that rely on polling and sampling (e.g., Net- provide little or no insight about which flows are occupying
Flow). The differences between in-network measurements or sharing the queue [88]. Consequently, researchers have
and polling/sampling-based schemes are closely related to been investigating queue management algorithms to shorten
the differences between legacy measurements and INT the delay and mitigate packet losses, while providing fairness
(see Table 5). For instance, the granularity of the measure- among flows. AQM is a set of algorithms designed to shorten
ments conducted in the data plane is much higher than those the queueing delay by prohibiting buffers on devices from
collected in traditional measurements (e.g., NetFlow). Fur- becoming full. The undesirable latency that results from a
ther, it is not possible to conduct event-based monitoring in device buffering too much data is known as ‘‘Bufferbloat’’.
legacy approaches, whereas with in-network measurements, Bufferbloat not only increases the end-to-end delay, but
the programmer has the flexibility of customizing the moni- also decreases the throughput and increases the jitter of a
toring based on conditions and thresholds. Second, there are communication session. Modern AQMs help in mitigating
techniques that rely on sketching or streaming algorithms the bufferbloat problem [277]–[280]. Unfortunately, modern
to estimate the metric of interest. Such methods are tightly AQMs are typically not available in state-of-the-art network
coupled with the metric, which forces hardware vendors to equipment; for instance, Controlled Delay (CoDel) AQM,
invest time and effort in building customized algorithms and which was proposed in 2013, and was proven in the literature
data structures that might not be used by various customers. to be effective in mitigating Bufferbloat [281], is still not
Moreover, with the constraints of routers and switches, it is available in most network equipment. With programmable
not possible to implement a variety of monitoring tasks while switches, it is now possible to implement AQMs as P4 pro-
still supporting the standard routing/switching functionali- grams, which not only accelerates support for new AQMs,
ties. Therefore, such approaches are not scalable for the long but also provides means to customize its parameters pro-
run. grammatically in response to network traffic. Moreover, pro-
With programmable switches, it is possible to customize grammable switches thrives for innovation on newer AQMs
the monitoring tasks by implementing customized sketch- that can be easily implemented and rapidly tested.
ing/streaming algorithms as P4 programs. This advantage
improves scalability as the operator can always modify the 2) STANDARDIZED AQMs IMPLEMENTATION
algorithms whenever needed. Kundel et al. [99] implemented the CoDel queueing
discipline on a programmable switch. CoDel eliminates
C. ACTIVE QUEUE MANAGEMENT (AQM) Bufferbloat, even in the presence of large buffers [100].
1) BACKGROUND Sharma et al. [101] proposed Approximate Fair Queueing
A fundamental component in network devices is the queue (AFQ), a mechanism built on top of programmable switches
which temporarily buffers packets. As data traffic is inher- that approximates fair queuing on line rate. Fair Queue-
ently bursty, routers have been provisioned with large queues ing (FQ) aims at fairly dividing the bandwidth allocation
to absorb this burstiness and to maintain high link utilization. among active flows. Laki et al. [102] described an AQM
The majority of delays encountered in a communication ses- evaluation testbed with P4 in a demo paper. The authors
sion is a result of large backlogs formed in queues. Previous tested the framework with two AQMs: Proportional Integral
legacy devices are limited in the visibility of the queue as they Controller Enhanced (PIE) and RED. Papagianni and De

VOLUME 9, 2021 87111


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 10. AQM schemes comparison.

Schepper [103] implemented Proportional Integral PI2 AQM require complex flow classification, per-packet scheduling,
on a programmable switch. PI2 is an extension of PIE AQM to and buffer allocation. Such requirements make FQ algorithms
support coexistence between classic and scalable congestion expensive to be implemented on high-speed devices. AFQ
controls in the public Internet. Kunze et al. [104] analyzed the aims at approximating fair queueing by using programmable
implementation details of three AQMs, namely, RED, CoDel, switches’ features such as mutating switch state, performing
and PIE on a hardware programmable switch (Tofino). Tores- basic calculations, and selecting the egress queue of a packet.
son [105] implemented a combination of PIE and Per-Packet AFQ’s operations can be summarized as follows: 1) per-flow
Value (PPV) concept on a programmable switch. state, which includes the number and timing information of
the previous packet pertaining to that flow, is approximated;
3) CUSTOM AQMs 2) the position of each packet in the output schedule is
Mushtaq et al. [106] approximated Shortest Remaining Pro- determined; 3) the egress queue to use is selected; and 4) the
cessing Time (SRPT) on a programmable switch. Their packet is dequeued based on the approximate sorted order.
method, which they refer to as Approximate and Deployable Note that AFQ uses a probabilistic data structure (count-min
SRPT (ADS), was evaluated and it was shown that it can sketch) since it only approximates the states, and uses multi-
achieve performance close to SRPT. Menth et al. [107] imple- ple queues in its implementation.
mented activity-based congestion management (ABC) on
programmable switches. ABC aims at ensuring fair resource
sharing as well as improving the completion times of short 5) AQMs ON PROGRAMMABLE SWITCHES AND
flows. Alcoz et al. [108] proposed SP-PIFO, a method FIXED-FUNCTION DEVICES
that approximates Push-In First-Out (PIFO) queues on pro- Inventing novel AQMs that control queueing delay, mitigate
grammable data planes. The method consists of an adap- bufferbloat, and achieve fairness with different network con-
tive scheduling algorithm that dynamically adapts mapping ditions (e.g., short/long RTTs, lossy networks, WANs) is an
between packet ranks and Strict Policy (SP) queues. Kumazoe active research area. Typically, new AQMs are implemented
and Tsuru [109] implemented MTQ/QTL scheme on P4. and tested in software (e.g., as a Linux queueing discipline
(qdisc) used with traffic control (tc)), which is limited when
4) AQM SCHEMES COMPARISON, DISCUSSIONS, AND the objective is to deploy the AQMs on production networks.
LIMITATIONS With programmable switches, AQMs are implemented in
Table 10 compares the aforementioned AQM schemes. Some P4 programs, which foster innovation and enhance testing
schemes require tuning a number of parameters and thresh- with production networks. Additionally, operators can create
olds so that they operate well in certain network conditions. their own customized AQMs that perform efficiently with
It is worth mentioning that a scheme becomes hard to manage their typical network traffic.
and less autonomous when the number of parameters and Historically, deploying AQMs on network devices is a
thresholds is high. lengthy and costly process; once an effective AQM is pub-
Some schemes are simple to implement in the data plane. lished and thoroughly tested, equipment vendors start inves-
CoDel’s algorithm can be easily expressed in the data plane tigating whether it is feasible to implement it on future
as it consists of comparisons, counting, basic arithmetic, and devices. Such process might take years to finish, and by
dropping packets. Similarly, PI2 is simple to implement as it then, new network conditions evolve, requiring new AQMs.
is mostly based on basic bit manipulations. FQ algorithms on With programmable switches, this process is cost-efficient
the other hand are difficult to implement on hardware as they and relatively fast (can be completed in weeks).

87112 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 11. AQMs on programmable and fixed-function switches. TABLE 12. QoS/TM schemes comparison.

Chen et al. [112] proposed a system that uses the metering


capabilities of a programmable switch to measure the flow
Table 11 compares the features of AQMs on programmable rate. It then marks packets when the flow rate exceeds a cer-
switches versus fixed-function devices. While new AQMs tain threshold. The sender then adjusts its congestion window
can be devised on programmable switches, there are some proportional to the marking packet ration. The goal of this
constraints that should be taken into account. First, the traffic approach is to avoid the frequent packet drops of TCP when
manager of a programmable switch is not programmable rate limiting QoS scheme is present.
itself using P4; this is where AQM algorithms are typi-
3) TRAFFIC MANAGEMENT
cally implemented in legacy devices. Nevertheless, there are
Tokmakov et al. [113] proposed RL-SP-DRR, a traffic man-
efforts that started investigating methods for emulating pro-
agement system that combines Rate-limited Strict Priority
grammable traffic managers [282]. Second, current AQMs
(RL-SP) and Deficit round-robin (DRR) to achieve low
do not consider the constraints of high-speed ASICs, and
latency and fair scheduling while improving link utilisation,
thus, cannot be directly implemented as they are using P4.
prioritization and scalability. Lee and Chan [114] imple-
Researchers overcome such limitations through approxima-
mented a traffic meter based on Multi-Color Markers (MCM)
tions or through rewriting the AQM logic in an high-speed
on programmable switches to support multi-tenancy
ASIC-friendly way. Third, queue state information is not
environments.
available in the ingress before packet enqueue. Consequently,
AQM are usually implemented in the egress. Such limita-
tions can be addressed in future research works pertaining 4) QoS/TM SCHEMES COMPARISON, DISCUSSIONS, AND
to AQMs. LIMITATIONS
Table 12 compares the QoS/TM schemes. The main idea
D. QUALITY OF SERVICE AND TRAFFIC MANAGEMENT in [110] is to translate application-layer header information
1) BACKGROUND into link-layer headers (Q-in-Q 802.1ad) for the core net-
Meeting diverse Quality of Service (QoS) requirements is work in order to perform QoS routing and provisioning. The
a fundamental challenge in today’s networks. Traffic Man- authors adopted the Adaptive Bit Rate (ABR) video stream-
agement (TM) provides access control that guarantees that ing as a use case to showcase the QoS improvements and the
the traffic admitted to the network conforms to the defined flexibility of traffic management. Such approach is interest-
QoS specifications. TM often regulates the rate of a flow by ing since switches are inspecting higher layers in the protocol
applying traffic policing. New generation of programmable stack. This capability is not available in non-programmable
switches facilitate traffic policing and differentiation by devices. Note however that the solution was only imple-
allowing network operators to express their logic in a pro- mented on a software switch (BMv2). When it comes to hard-
gramming language (P4). This section explores the works on ware switches, the solution might face challenges to run at
programmable switches that involve QoS and TM. line rate when processing L5 headers. Therefore, the authors
left the hardware implementation as a future work.
2) QUALITY OF SERVICE The other approaches considered traffic rates as inputs
Bhat et al. [110] described a system where programmable rather than inspecting application-layer headers.
switches intelligently route traffic by inspecting application Reference [114] focused on isolating virtual networks (VN).
headers (layer-5) to improve users’ QoE. Chen et al. [111] A VN has to have its own dedicated bandwidth (i.e., other
proposed a bandwidth manager for end-to-end QoS provi- networks’ traffic should not impact the bandwidth) and
sioning using programmable switches. The system classifies should be able to differentiate priorities in order to provide
packets into different categories based on their QoS demands QoS for its flows. While the solution was not implemented
and usages, and uses two-level queue when prioritizing. on hardware (the authors left the hardware implementation

VOLUME 9, 2021 87113


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

as future work), it is worth noting that this system relies TABLE 13. Source-routed multicast schemes comparison (source: [115]).
on metering primitives which are available in today’s hard-
ware targets (e.g., meters in Tofino). Similarly, [113] was
only implemented on a software switch (BMv2) and was
evaluated by comparison against standard priority-based and
best-effort scheduling. This system uses multiple priority
queues, a feature supported in hardware targets. Therefore,
the system could be implemented on hardware switches. The
Kadosh et al. [116] implemented ELMO using a hybrid dat-
approach in [111] aims at limiting the maximum allowed rate
aplane with programmable and non-programmable elements.
and at maximizing bandwidth utilization. This is the only
ELMO is intended for multi-tenant datacenter applications
work that was implemented on a hardware switch (Tofino),
requiring high scalability. Braun et al. [117] presented an
and its design was compared against approaches based on
implementation of the Bit Index Explicit Replication (BIER)
OpenFlow.
architecture [286] with extensions for traffic engineering.
5) COMPARISON OF QoS/TM BETWEEN LEGACY AND Similar to ELMO, BIER removes the per-multicast group
PROGRAMMABLE NETWORKS state information from switches by adding a BIER header,
The ability to perform QoS-based traffic management in which is used to forward packets. BIER does not require a
legacy networks is restricted to algorithms that consider stan- signaling protocol for building, managing, and tearing down
dard header fields (e.g, differentiated services [283]). On the trees.
other hand, programmable switches can parse, modify, and
process customized protocols. Hence, operators now have 3) PRIORITY-BASED DECENTRALIZED MULTICAST
the ability to perform TM by inspecting custom headers Cloud applications in data centers often require file transfers
fields. Moreover, it is possible to extract with high-granularity to be completed in a prioritized order. Luo et al. [287] pro-
metadata pertaining to the state of the switch (e.g., queue posed Priority-based Adaptive Multicast (PAM), a preemp-
occupancy, packet sojourn time, etc.) at line rate. Such infor- tive and decentralized rate control protocol for data center
mation can significantly help switches take better decisions multicast. The switches explicitly and preemptively compute
while performing traffic management. sending rates based on priorities encoded in scheduling head-
ers, and the real-time link loads.
E. MULTICAST
1) BACKGROUND
4) MULTICAST SCHEMES COMPARISON, DISCUSSIONS,
Multicast routing enables a source node to send a copy of a AND LIMITATIONS
packet to a group of nodes. Multicast uses in-network traffic
Table 13 compares the source-routed multicast schemes.
replication to ensure that at most a single copy of a packet tra-
Both ELMO and BIER are source-routed multicast schemes.
verses each link of the multicast tree. Perhaps the most widely
In BIER, group members are encoded as bit strings and are
multicast routing protocol deployed in traditional networks
then inspected by switches to identify the output port. Such
is the Protocol-Independent Multicast (PIM) protocol [284].
scheme requires heavy processing on the switch, hampering
PIM and other multicast routing protocols require a signaling
the execution at line rate. Consequently, the authors only
protocol such as the Internet Group Management Protocol
implemented BIER on a software switch (BMv2). ELMO on
(IGMP) [285] to create, change, and tear-down the multi-
the other hand has no restrictions on the group and network
cast tree. Traditional multicast presents some challenges. For
sizes, and was implemented on a hardware switch, running at
example, it is not suitable for environments where multi-
line rate.
cast group members constantly move (e.g., virtual machine
Other schemes like PAM addressed the challenges faced in
migration and allocation). In such cases, the multicast tree
file transfers by data center cloud applications. For instance,
must be updated dynamically, which may require substan-
when sharing the link with other latency-sensitive flows,
tial time and overhead. Also, some routers support a lim-
file transfers suffer from continuous changes in the link’s
ited number of group-table entries, which does not scale in
bandwidth, affecting the flow completion times. To solve this
environments such as datacenters. Additionally, the signaling
problem, PAM adopted a scheduling scheme that performs
protocol and multicast algorithm are hard coded in the router,
adaptive rate allocations in RTT scales. Other aspects that
which reduces flexibility in building and managing the tree.
were addressed by PAM include: fault tolerance and scala-
Finally, it is not possible to implement multicast based on
bility of file transfers; limited number of priority queues; and
non-standard header fields.
the challenges of performing complex computations in data
2) SOURCE-ROUTED MULTICAST plane.
Shahbaz et al. [115] presented ELMO, a multicast scheme
based on programmable P4 switches for datacenter applica- 5) COMPARISON P4-BASED AND TRADITIONAL MULTICAST
tions. ELMO encodes the multicast tree in the packet header, Table 14 compares P4-based multicast and traditional multi-
as opposed to maintaining group-table entries inside routers. cast. The main advantages of implementing multicast routing

87114 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 14. Comparison between P4-based and traditional multicast.

FIGURE 13. (a) Traditional software-based load balancing. (b) Load


balancing system implemented by a programmable switch.

in legacy networks are being mitigated with programmable


switches. Recent efforts proposed encoding multicast trees
into the headers of packets, and using programmable switches
to parse these headers and to determine the multicast
groups. Future endeavours should investigate incremental
with programmable P4 switches are: i) the group membership deployment (i.e., interworking with legacy multicast
is encoded in the packet itself, which permits the creation of schemes), and reliability enhancement (e.g., by adopting lay-
arbitrary multicast tree based on the application. For example, ering protocols such as Pragmatic General Multicast (PGM)
a multicast tree to update software devices may prioritize and Scalable Reliable Multicast (SRM)).
bandwidth over latency, while one for media traffic may
VIII. MIDDLEBOX FUNCTIONS
prioritize latency; ii) switches do not need to store per-group
RFC 3234 [288] defines middlebox as a device that per-
state information, although tables can be customized and used
forms functions other than the standard functions of an IP
in conjunction with the tree encoded in the packet header;
router between a source and a destination host. In legacy
iii) groups can be reconfigured easily by changing the infor-
devices, middlebox functions are designed and implemented
mation in the header of the packet; and iv) the elimination
by manufacturers. Hence, they are limited in the functionali-
of the signaling protocol to build, manage, and tear-down the
ties they provide, and typically include standard well-known
tree results in consider simplification and flexibility for the
functions (e.g., NAT, protocol converters (6to4/4to6), etc.).
operator.
To overcome this limitation, the trend moved towards imple-
menting middleboxes in x86-based servers and in data cen-
F. SUMMARY AND LESSONS LEARNED ters as Network Function Virtualization (NFVs). While this
Performing network-wide monitoring and measurements is shift accelerated innovation and introduced a wide range of
of utmost importance for network operators to diagnose new applications, there was some performance implications
performance degradation. A wide range of research efforts resulting from operating systems’ scheduling delays, inter-
harness streaming methods that utilize various data structures rupt processing latency, pre-emptions, and other low-level
(e.g., sketches, bloom filters, etc.) and approximation algo- OS functions. Since programmable switches offer the flex-
rithms. Further, the majority of measurements work provide ibility of inspecting and modifying packets’ headers based
a query-based language to specify the monitoring tasks. on custom logic, they are excellent candidates for enabling
Future measurement works should consider generalizing the middlebox functions, while operating at line rate without
monitoring jobs, reducing storage requirements, managing performance implications.
accuracy-memory trade-off, extending monitoring primi-
tives, minimizing controller intervention, and optimizing the A. LOAD BALANCING
placement of switches in a legacy network. Another line of 1) BACKGROUND
research aim at combating congestion and reducing packet A cloud data center, such as a Google or Facebook data cen-
losses by analyzing measurements collected in the data plane ter, provides many applications concurrently, such as email
and by applying queue management policies. Congestion and video applications. To support requests from external
control is enhanced by adopting techniques such as throt- clients, each application is associated with a publicly visible
tling senders, cutting payloads, enforcing sending rates by IP address to which clients send their requests and from
leveraging telemetry data, and separating traffic into different which they receive responses. This IP address is referred
queues. Furthermore, a handful of works are investigating to as Virtual IP (VIP) address. The external requests are
methods to improve QoS by applying traffic policing and then directed to a software load balancer whose task is to
management. Techniques adopted include application-layer distribute requests to the servers, balancing the load across
inspection, traffic metering, traffic separation, and bandwidth them. The load balancer is also referred to as layer-4 load
management. Finally, the scalability concerns of multicast balancer because it makes decisions based on the 5-tuple

VOLUME 9, 2021 87115


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 15. Load balancing schemes comparison.

source IP address and port, destination IP address and port, load balancing schemes, but also stateless ones. Stateless load
and transport-layer protocol. This state information is stored balancing in this context avoids storing per-connection state
in a connection table containing the 5-tuple and the Direct in the switch.
IP (DIP) address of the server serving that connection. State Perhaps the first and most significant P4-based stateless
information is needed to avoid disruptions caused by changes load balancing scheme is Beamer [124]. Instead of storing
in the DIP pool (e.g., server failures, addition of new servers). the state in the switch, Beamer leverages the connection state
The load balancer also provides a translation functionality, already stored in backend servers to perform the forwarding.
translating the VIP to the internal DIP, and then translating Another scheme is SHELL [125], which is an application-
back for packets traveling in the reverse direction back to agnostic, application-load-aware approach that uses a power-
the clients. The traditional software-based load balancer is of-choices scheme to dispatch flows to a suitable instance.
illustrated in Fig. 13(a). Other approaches such as W-ECMP [126] were built to solve
the issue of hash collision in the well-known Equal-Cost
2) STATEFUL LOAD BALANCING Multi-Path (ECMP) scheme. W-ECMP maintains a maxi-
Recent works presented schemes where load balancing func- mum utilization table, which is used as weights to determine
tionality is implemented in programmable P4 switches. The the routing probability for each path. Note that W-ECMP is
main idea consists of storing state information directly in not storing per-connection state information in the data plane.
the switch’s dataplane. The connection table is managed by
the software load balancer, which can be implemented either 4) LOAD BALANCING SCHEMES COMPARISON,
in the switch’s control plane or as an external device, as shown DISCUSSIONS, AND LIMITATIONS
in Fig. 13(b). The software load balancer adds new entries in Table 15 compares the aforementioned load balancing
the switch’s table as they arrive, or removes old entries as schemes. The key idea of switch-based stateful load bal-
flows end. ancing is to eliminate the need for a software-layer
Katta et al. [118] proposed HULA, a load balancer scheme while mapping a connection to the same server, ensuring
where switches store the best path to the destination via their Per-Connection Consistency (PCC) property. The majority
neighboring switches. This strategy avoids storing the con- of the proposed approaches are stateful, meaning that the
gestion status of all paths in leaf switches. Benet et al. [119] switches store information locally to perform load balancing.
extended this approach to support multi-path transport pro- Some approaches (e.g., HULA, MP-HULA, Contra) use
tocols (e.g., Multi-path TCP (MPTCP)). Another significant active probing to collect network performance metrics. Such
work is SilkRoad, [120], a load balancer that provides a metrics are then analyzed by the switches to make load bal-
direct path between application traffic and servers. Other ancing decisions. Note that probing increases the bandwidth
mechanisms such as DistCache [121] enables load balancing overhead which might result in performance degradation.
for storage systems through a distributed caching method. In the presence of multi-path transport protocols (e.g.,
DASH [122] proposed a data structure that leverages multiple MPTCP), systems such as HULA provide sub-optimal for-
pipeline stages and per-stage SALUs to dynamically balance warding decisions when several subflows pertaining to a
data across multiple paths. The aforementioned approaches single connection are pinned on the same bottleneck link.
work under specific assumptions about the network topology, As a result, schemes such as MP-HULA, Contra, and Dash
routing constraints, and performance. Contra [123] general- were proposed to support multi-path transport protocols. For
ized load balancing to work with various topologies and under instance, MP-HULA is a transport layer multi-path aware
multiple constraints by using a performance-aware routing load-balancing scheme that uses the best-k paths to the desti-
mechanism. nation through the neighbor switches.
Other approaches are stateless. Beamer relies on using the
3) STATELESS LOAD BALANCING connection state already stored in backend servers to ensure
Recent advances in customized and stateful packet processing that connections are never dropped under churn. On the
in programmable switches not only forked a variety of stateful other hand, SHELL, which assigns new connections to a

87116 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 16. Switch-based and server-based load balancers.

FIGURE 14. (a) Traditional software-based caching. (b) Switch-based


caching.

and trending events [127]. Fig. 14(a) shows a typical skew


key-value store system which presents load imbalance among
servers storing key-value objects. The performance of such
systems may present reduced throughput and long latencies.
set of pseudo-randomly chosen application instances, marks For example, server 2 may add substantial latency as a result
packets, allowing the load-balancer to direct them without of storing a hot item and being over-utilized, while server 1 is
storing state. W-ECMP makes forwarding decisions based under-utilized.
on weights adjusted according to the link utilization. Finally,
2) KEY-VALUE CACHING
it is important for a load balancing scheme to be adaptive and
handle network failures. Furthermore, it should mitigate load Fig. 14(b) illustrates a system where a programmable switch
imbalance in asymmetric topologies. receives a query before forwarding them to the server storing
the key. The switch is used as an ‘‘in-network cache’’, where
the hottest items are stored. When a read request for a hot
5) COMPARISON BETWEEN SWITCH-BASED AND
key is received, the switch consults its local table and returns
SERVER-BASED LOAD BALANCER
the value corresponding to that key. If the key is missed
Table 16 shows a comparison between switch-based and
(i.e., the case for non-hot keys) then the switch forwards the
server-based load balancers. There is a significant improve-
request to the appropriate server. When a write request is
ment in the throughput when load balancing is offloaded to
received, the switch checks its local table and evicts the entry
the switches; for instance, SilkRoad [120], which is a load
if the key is stored there. It then forwards the request to the
balancing scheme in the data plane, achieves 10 billion pack-
appropriate backend server. A controller periodically collects
ets per second (pps) while operating at line rate. Software load
statistics to update the cache with the current hot items.
balancers on the other hand achieve a much lower throughput,
A noteworthy approach is NetCache [127], an in-network
nine million PPS on average. Software-based load balancers
architecture that uses programmable switches to store hot
also incur additional latency overhead when processing new
items and balance the load across storage nodes. Similarly,
requests. It is relatively easy to install additional software load
Liu et al. [128] proposed IncBricks, a hardware-software
balancers, which makes it more scalable than switch-based
co-designed in-network caching fabric for key-value pairs
load balancing schemes. Moreover, software load balancers
with basic computing primitives in the data plane.
are more flexible in assigning flow identification policies.
Cidon et al. [129] proposed AppSwitch, a packet switch
Finally, switch-based schemes are simpler as the whole
that performs load balancing for key-value storage systems,
logic is expressed in a program (customized parser and
while exchanging only a single message from the key-value
match-action tables), whereas server-based balancers might
client to the server. Wang et al. [130] proposed CONCOR-
require additional coordination with routers (e.g., tunneling).
DIA, a rack-scale Distributed Shared Memory (DSM) with
in-network cache coherence. While the system targets cache
B. CACHING
coherence, the authors implemented a distributed key-value
1) BACKGROUND store to demonstrate the practical benefits of the system.
Modern applications (e.g., online banking, social networks) Similarly, Li et al. [131] proposed Pegasus which acts as an
rely on key-value stores. For example, retrieving a single in-network coherence directory tracking and managing the
web page may require thousands of storage accesses. As the replication of objects.
number of users increases to millions or billions, the need
for higher throughput and lower latency is needed. A chal- 3) APPLICATION-SPECIFIC CACHING
lenge of key-value stores is the non-uniform access of items. Other class of caching schemes target specific appli-
Instead, popular items, referred to as ‘‘hot items’’, receive cations rather than caching arbitrary key-value pairs.
more queries than others. Furthermore, popular items may Signorello et al. [132] developed a preliminary imple-
change rapidly due to popular posts, limited-time offers, mentation of Named Data Networking (NDN) instance

VOLUME 9, 2021 87117


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 17. Caching schemes comparison.

using P4 that caches requests to optimize its operations. TABLE 18. Switch-based and server-based caching.
Grigoryan and Liu [133] proposed a system that caches For-
warding Information Base (FIB) entries (the most popular
entries) in fast memory in order to minimize the TCAM
consumption and to avoid the TCAM overflow problem.
Zhang et al. [134] proposed B-Cache, a framework that
bypasses the original processing pipeline to improve the
performance of caching. Vestin et al. [135] proposed Fas-
tReact, a system that enables caching for industrial control
networks. Finally, Woodruff et al. [136] proposed P4DNS,
an in-network cache for Domain Name System (DNS)
entries.

4) CACHING SCHEMES COMPARISON, DISCUSSIONS, AND


LIMITATIONS
Table 17 compares the aforementioned caching schemes.
Schemes can be separated based on the type of data they aim
to cache. For instance, NetCache, AppSwitch, and IncBricks
cache arbitrary key-value pairs, while NDN.p4 caches
only NDN names. Further, some schemes (e.g., NetCache,
P4DNS, etc.) automatically index entries to be cached based
on their access frequencies, while others require the operators
to manually specify the entries. Another important distinc- the packet bypasses the original pipeline, making the perfor-
tion is whether the scheme uses a custom protocol or not. mance of caching independent of the pipeline length. Note
For instance, switches in NetCache parse a custom protocol however that this system was evaluated on a software switch
that carries key-value pairs, while switches in P4DNS parse (BMv2), and it is not certain whether this design is always
standard DNS headers. feasible on hardware targets.
The main motivation of switch-based caching schemes is to Other caching schemes are more targeted for specific
improve the performance issues of server-based schemes. For applications. As examples, FastReact enables caching for
instance, NetCache, which efficiently detects hot key-value industrial control networks, while P4DNS caches DNS
items and serves them in the data plane, was capable of entries. Further, some schemes offer multi-level caching
handling two billion queries per second for 64,000 items with (e.g., level-1 and level-2 caches).
16-bytes keys and 128-bytes values. Compared to commodity Unlike the other approaches which store cached data in
servers, NetCache improves the throughput by 3-10 times and the data plane, CONCORDIA coordinates coherence among
reduces the latency of 40% of queries by 50%. In addition to the cache of servers, and therefore only stores the cache’s
the throughput, the latency of the queries is also a major met- metadata in the switch.
ric to improve. In IncBricks, the latency of requests is reduced
by over 30% compared to client-side caching systems. 5) COMPARISON BETWEEN SWITCH-BASED AND
Similarly, B-Cache aims at improving the performance by SERVER-BASED CACHING
caching behaviors defined along the processing pipeline into Table 18 compares the switch-based versus server-based
a single cache match-action table. The motivation behind caching schemes. The throughput when data is cached
B-Cache is that the performance of the data plane decreases on the switch is order of magnitude larger than that
significantly as the complexity of the P4 program and the of general purpose servers. The latency is also reduced
packet processing pipeline grows. When a match occurs, by 50%, and most of it is induced by the client. The

87118 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 19. Telecom schemes comparison.

switched-based caching solves the load imbalance problem to the SDN/NFV network with no programmable switches.
and is simpler as the whole logic is expressed in a program. Specifically, the authors focused on the problems of full
Server-based caching on the other hand is more flexible softwarization in current SDN networks (high latency and
regarding cache policies, as well as keys, values, and tables’ jitter, low precision traffic and advanced monitoring, etc.) and
sizes. how P4 is paving the way to novel orchestration frameworks
enabling innovation at the edge. Lin et al. [144] integrated
C. TELECOMMUNICATION SERVICES
P4 switches in their 5G testbed to implement the User Plane
1) BACKGROUND
Function (UPF) in the data plane.
The evolution of the current mobile network to the emerg-
ing Fifth-Generation (5G) technology implies significant
3) MEDIA OFFLOADING
improvements of the network infrastructure. Such improve-
Kfoury et al. [145] proposed a system for offloading con-
ments are necessary in order to meet the Key Perfor-
versational media traffic (e.g., Voice over IP (VoIP), Voice
mance Indicators (KPIs) and requirements of 5G [290].
over LTE (VoLTE), WebRTC, media conferencing, etc.) from
5G requires ultra-reliable low latency and jitter
x86-based relay servers to programmable switches. While
(microseconds-scale). As programmable switches fulfill
this system is not tailored for 5G network specifically,
these requirements, researchers are investigating the idea of
it provides significant performance improvements for Over-
offloading telecom-oriented VNFs running on x86 servers to
The-Top (OTT) VoIP systems.
programmable hardware.
Andrus et al. [146] offloaded video processing to the
2) 5G FUNCTIONS switch. Essentially, the switch dynamically filters and sepa-
Ricart-Sanchez et al. [137] proposed a system that uses pro- rate control traffic from video streams, and then redirect them
grammable data plane to enhance the performance of the to the desired destinations. The authors implemented this
data path from the edge to the core network, also known scheme due to processing constraints on the software when
as the backhaul, in a 5G multi-tenant network. The same the number of devices is high (the authors noted that CCTV
authors [138] proposed a 5G firewall that detects, differenti- cameras in London, UK is estimated at roughly 500,000).
ates and selectively blocks 5G network traffic in the backhaul
network. 4) TELECOM SCHEMES COMPARISON, DISCUSSIONS, AND
In parallel, attempts such as TurboEPC [139] proposed LIMITATIONS
offloading a subset of user state in mobile packet core to pro- Table 19 compares the aforementioned telecom schemes
grammable switches in order to perform signaling in the data on P4. In general, all schemes aim at offloading various func-
plane. Similarly, Singh et al. [140] designed a P4-based ele-
ment of 5G Mobile Packet Core (MPC) that merges the func-
tions of both signaling gateway (SGW) and the Packet Data
Network Gateway (PGW). Additionally, Vörös et al. [141]
proposed a hybrid next-generation NodeB (gNB) that com-
bines the capabilities of P4 switches and the external services
built on top of NIC accelerators (DPDK). Another impor-
tant function required in 5G is handover. Palagummi and
Sivalingam [142] proposed SMARTHO, a system that uses
programmable switches to perform handover efficiently in a
wireless network.
Paolucci et al. [143] demonstrated the potential and the FIGURE 15. CDF of delay and packet loss rate of 900 offloaded VoIP
disruptiveness of data plane programmability as opposed calls [145].

VOLUME 9, 2021 87119


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 20. Switch-based and server-based media relaying. were accommodated in the switch’s SRAM, with additional
resources to spare for other functionalities. On the other hand,
only one thousand sessions per CPU core were handled in
the server-based relay, before QoS starts to degrade. The
drawback of offloading media traffic to the switch is that
some functionalities are complex to be implemented in the
data plane (e.g., media mixing for conference calls, noise
reduction, etc.).

D. CONTENT-CENTRIC NETWORKING
1) BACKGROUND
Emerging network architectures (e.g., [291]) promote
content-centric networking, a model where the addressing
scheme is based on named data rather than named hosts.
In other words, users specify the data they are interested in
instead of specifying where to get the data from. A branch of
content-centric networking is the publish/subscribe (pub/sub)
model. The goal of the model is to provide a scalable and
tionalities originally executed on x86-based servers to the robust communication channel between producers and con-
data plane. Such strategy improves the network performance sumers of information. A large fraction of today’s Internet
(e.g., latency, throughput) significantly and aim at achieving applications follow the publish/subscribe paradigm. With the
the KPIs of 5G. For instance, the experiments conducted IoT, this paradigm proliferated as sensors/actuators are often
in [137] show that the attained QoS metrics meet the latency deployed in dynamic environments. Other applications that
requirements of 5G. Similarly, the results reported in [138] use pub/sub model include instant messaging, Really Simple
demonstrate that the system meets the reliability KPI of 5G, Syndication (RSS) feeds, presence servers, telemetry and
which states that the network should be secured with zero others. Current approaches to content-centric networking use
downtime. Furthermore, the results reported in [142] show software-based middleboxes, which limits the performance
that there are 18% and 25% reductions in handover time with in terms of throughput and latency. Recent works are lever-
respect to legacy approaches, for two- and three-handover aging programmable switches to overcome the performance
sequences, respectively. limitations of software-based middleboxes.
The system in [145] emulates the behavior of the relay
server which is primarily used to solve the NAT problem. 2) PUBLISH/SUBSCRIBE
Results show that ultra-low latency and jitter (nanoseconds- Jepsen et al. [147] presented ‘‘packet subscription’’, a new
scale) are achieved with programmable switches as opposed abstraction that generalizes the forwarding rules by evaluat-
to x86-based relay servers where the latency and the jitter ing stateful predicates on input packets. Wernecke et al. [148],
are in the milliseconds-scale (see Fig. 15). The solution also [149] presented distribution strategies for content-based pub-
improves the packet loss rate, CPU usage of the server, Mean lish/subscribe systems using programmable switches. The
Opinion Score (MOS), and can scale to more than one million authors described a system where the notification distribution
concurrent sessions, with additional resources to spare in the tree (i.e., the subscribers that should receive the notification)
switch. is encoded in the packet headers, similar to multicast source
Other systems allow offloading the signaling part to the routing. Similarly, Kundel et al. [150] implemented a pub-
data plane. For instance, TurboEPC offloads messages that lish/subscribe system on programmable switches. The system
constitute a significant portion of the total signaling traffic in is flexible in encoding attributes/values in packet headers.
the packet core, aiming at improving throughput and latency
of the control plane’s processing. 3) NAMED DATA NETWORKING
Signorello et al. [132] developed NDN.p4, a prelimi-
5) SWITCH-BASED AND SERVER-BASED MEDIA RELAY nary implementation of a Named Data Networking (NDN)
Offloading media traffic from general purpose servers to instance that caches requests to optimize its operations.
programmable switches greatly improves the quality of ser- Miguel et al. [151] extended NDN.p4 to include the content
vice. Table 20 shows the metrics achieved when media is store and to solve the scalability issues of the previous FIB
relayed by a relay server versus when it is relayed by the design. Karrakchou et al. [152] proposed ECDN, another
switch, based on [145]. The results show that the latency, jitter CDN implementation on P4 where data plane configuration is
and packet loss rates are significantly lower when media is generated according to application requirements and supports
being relayed by the switch. Not only the QoS metrics are extensions to the regular CDN such as adaptive forwarding,
improved, but also the maximum number of concurrent ses- customized monitoring, in-network caching control, and pub-
sions. With Tofino 3.2Tbps, more than one million sessions lish/subscribe forwarding.

87120 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

FIGURE 16. (a) Traditional software-based pub/sub architecture. (b) Pub/sub implemented on a programmable switch.

TABLE 21. Content-centric networking schemes comparison. Regarding the NDN schemes, ENDN focused on making
the data plane adaptive and easily programmable to meet
the application needs. This flexibility is lacking in the other
P4-based CDN schemes. It is worth mentioning that P4 has its
shortcomings when it comes to supporting a stateful variable
length protocol. This is an important aspect that should be
tackled when implementing NDN on the data plane.
5) COMPARISON BETWEEN SWITCH-BASED AND
SERVER-BASED PUB/SUB SYSTEMS
Fig. 16 illustrates the operations of traditional software-based
pub/sub systems (a) and switch-based pub/sub systems (b).
Latency and its variations are significantly reduced when
the switch acts as a pub/sub broker. However, the size of
memory in the switch limits the amount of data to be
4) CONTENT-CENTRIC NETWORKING SCHEMES distributed. Moreover, implementing features provided by
COMPARISON, DISCUSSIONS, AND LIMITATIONS software-based pub/sub systems such as QoS levels, session
Table 21 compares the aforementioned pub/sub schemes. persistence, message retaining, last will and testament (notify
In [147], the authors described a compiler that gener- users after a device disconnects) in hardware is challenging.
ates P4 tables from logical predicates. It utilizes a novel E. SUMMARY AND LESSONS LEARNED
algorithm based on Binary Decision Diagrams (BDD) to Programmable switches offer the flexibility of customizing
preserve switch resources (TCAM and SRAM). This feature the data plane to enable middlebox functions. A middlebox
simplifies the configuration as operators do not need to man- can be defined as a device that performs functions that are
ually install tables entries switches, which is a cumbersome beyond the standard capabilities of routers and switches.
process when the topology is large. The prototype was eval- A number of works demonstrated the implementation of mid-
uated on a hardware switch (Tofino), and the authors con- dlebox functions such as caching, load balancing, offloading
sidered the Nasdaq’s ITCH protocol as the pub/sub use case. services, and others on programmable switches. The majority
Results show that the system was able to process messages of load balancing schemes took advantage of the stateful
at line rate while using the full switch capacity (6.5 Tbps). nature of the data plane to store the load balancing connection
The other systems considered different encoding strategies. table. Future work should consider minimizing the storage
For example, in [148], [149], the authors described a system requirement to improve the scalability, supporting flow pri-
where the notification distribution tree (i.e., the subscribers ority, and developing further variations for novel multipath
that should receive the notification) is encoded in the packet transport protocols such as multipath QUIC.
headers, similar to multicast source routing. The advantage The switch can also act as an ‘‘in-network cache’’ that
of storing the distribution tree in the packet header instead serves hot items at line rate. Some schemes indexes entries
of storing it in the switch is that rules in the switches do not automatically, while others require operator’s intervention.
need to be updated when subscriptions change. Another dis- Future endeavours could investigate items compression, com-
tinction between the pub/sub systems is whether they require munication minimization, priority-based caching, and aggre-
a dedicated language to describe the subscriptions, and the gated computations caching (e.g., cache the average of hot
configuration complexity. items).

VOLUME 9, 2021 87121


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

An additional middlebox application is offloading tele- data value, or on the current state of a distributed system.
com functions. The switch is capable of relaying media traf- Reliability is achieved with consensus algorithms, even in the
fic and user plane functions. Future work could investigate presence of some malicious or faulty processes. Consensus
scalability improvement (i.e., to accommodate more concur- algorithms are used in applications such as blockchain [293],
rent sessions), offloading signalling traffic, and in-network load balancing, clock synchronization, and others [294].
media mixing. Latency has always been a bottleneck with consensus algo-
Finally, the switch can also act as a broker to dis- rithms as protocols require expensive coordination on every
tribute packets in a publish/subscribe system. Future work request. Lately, researchers have started investigating how
could investigate reliability insurance (e.g., packet deliver programmable switches can be leveraged to operate consen-
guarantee), message retaining, and QoS differentiation sus protocols in order to increase throughput and decrease
(e.g., QoS features of MQTT). latency. Fig. 17 shows a consensus model in the data plane.
IX. NETWORK-ACCELERATED COMPUTATIONS 2) PAXOS IMPLEMENTATIONS
Programmable switches offer the flexibility of offload- Li et al. [153] proposed Network-Ordered Paxos (NOPaxos),
ing some upper-layer logic to the ASIC, referred also as a P4-based Paxos [295] system that applies replication in
in-network computation. Since switch ASICs are designed the data center to reduce the latency imposed from com-
to process packets at terabits per second rates, in-network munication overhead. Similarly, Dang et al. [154] presented
computation can result in an order of magnitude or more an implementation of Paxos using P4 on the data plane.
of improvement in throughput when compared to applica- Jin et al. [157] proposed NetChain, a variant of the Paxos pro-
tions implemented in software. The potential performance tocol that provides scale-free sub-RTT coordination in data
improvement has motivated programmers to built in-network centers. It is strongly-consistent, fault-tolerant, and presents
computation for different purposes, including consensus, an in-network key-value store. Dang et al. [158] proposed
machine learning acceleration, stream processing, and others. Partitioned Paxos, a P4-based system that separates the two
The idea of delegating computations to networking devices aspects of Paxos, namely, agreement and execution, and
was perceived with Active Networks [292], where pack- optimizes them separately. Furthermore, The same authors
ets are replaced with small programs (‘‘capsules’’) that are also proposed P4xos [160], a P4-based solution that executes
executed in each traversed device along the path. However, Paxos logic directly in switch ASICs, without strengthening
traditional network devices were not capable of perform- assumptions about the network (e.g., ordered delivery, packet
ing computations. With the recent advancements in pro- loss, etc.).
grammable switches, performing computations is now a
possibility. 3) OTHER IMPLEMENTATIONS
Another line of research focused on consensus algorithms
A. CONSENSUS
other than Paxos. Li et al. [155] proposed Eris, a P4-based
1) BACKGROUND solution that avoids replication and transaction coordination
Consensus algorithms are common in distributed systems overhead. It processes a large class of distributed transactions
where machines collectively achieve agreement on a single in a single round trip, without any additional coordination
between shards and replicas. Sakic et al. [159] proposed
P4 Byzantine Fault Tolerance (P4BFT), a system that is
based on BFT-enabled SDN, where controllers act as repli-
cated state machines. The system offloads the comparison of
controllers’ outputs required for correct BFT operations to
programmable switches. Finally, Han et al. [156] offloaded
part of the Raft consensus algorithm [296] to programmable
switches in order to improve its performance. The authors
selected Raft due to the fact that it has been formally proven
to be more safe than Paxos, and it has been implemented on
popular SDN controllers.
4) CONSENSUS SCHEMES COMPARISON, DISCUSSIONS,
AND LIMITATIONS
Table 22 compares the aforementioned consensus schemes.
FIGURE 17. Consensus protocol in the data plane model [154].
An application sends a request to the proposer which resides on a In general, consensus algorithms such as Paxos are complex
commodity server. The proposer then creates a Paxos message and sends and cannot be easily implemented with the constraints of the
it to the coordinator, running in the data plane. The role of the
coordinator is be the broker of requests on behalf of proposers. data plane. For instance, [154] only implemented phase-2
Afterwards, the acceptor, which also runs on the data plane, receives the logic of Paxos leaders and acceptors. Similarly, NetChain
messages from the coordinator, and ensures consistency through the
system by deciding whether to accept/reject proposals. Finally, learners
uses a variant of the Paxos protocol that divides it into two
provide replication by learning the result of consensus. parts: steady state and reconfiguration. This variant is known
87122 VOLUME 9, 2021
E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 22. Consensus schemes comparison. the latency significantly decreases (Paxos coordinator had a
minimum latency of 340ns [297]). Moreover, when compared
to legacy consensus deployments, network-assisted consen-
sus require fewer hops traversal.
B. MACHINE LEARNING
1) BACKGROUND
The remarkable success of Machine Learning (ML) today has
been enabled by a synergy between development in hardware
and advancements in machine learning techniques. Increas-
ingly complex ML models are being developed to handle the
large size of datasets and to accelerate the training process.
as Vertical Paxos, and is relatively simple to implement in the
Hardware accelerators (e.g., GPU, TPU) were introduced to
network as the division’s parts can be mapped to the control
speedup the training. These accelerators are installed in large
plane and the data plane.
clusters and collaborate through distributed training to exploit
Unordered and completely asynchronous networks require
parallelism. Nevertheless, training ML models is time con-
the full implementation and complexity of Paxos. NOPaxos
suming and can last for weeks depending on the complexity
suggests that the communication layer should provide a
and the size of the datasets. Researchers have traditionally
new Ordered Unreliable Multicast (OUM) primitive; that is,
investigated methods to accelerate the computation process,
there is a guarantee that receivers will process the multicast
but not the communication in distributed learning. With the
messages in the same order, though messages can be lost.
advancements in programmable switches, it is now possible
NOPaxos relies on the network to deliver ordered messages
to accelerate the ML training process through the network.
in order to avoid entirely the coordination. Dropped packets
on the other hand are handled through coordination with the 2) IN-NETWORK TRAINING
application. Other systems like Eris avoid replication and Sapio et al. [161] proposed DAIET, a system that per-
transaction coordination overhead. The main contribution of forms in-network data aggregation to accelerate applications
Eris compared to NOPaxos is that it establishes a consis- that follow a partition/aggregate workload pattern. Similarly,
tent ordering across messages delivered to many destination Yang et al. [162] proposed SwitchAgg, a system that per-
shards. Eris also allows receivers to detect dropped messages. forms similar functions as DAIET, but with a higher data
Partitioned Paxos [158] improved the existing systems. reduction rate. Perhaps the most significant work in the train-
The motivation behind Partitioned Paxos is that existing ing acceleration literature is SwitchML [163], a system that
network-accelerated approaches do not address the problem performs in-network aggregation for ML model updates sent
of how replicated application can cope with the high rate of from workers on external servers.
consensus messages; NOPaxos only processes 13,000 trans-
3) IN-NETWORK INFERENCE
actions per second since it presents a new bottleneck at
the host side. Other systems (e.g. NetChain) are specialized Other schemes have shown interest in speeding the inference
replication services and cannot be used by any off-the-shelf process by leveraging programmable switches. Siracusano
application. and Bifulco [164] proposed N2Net, a system that runs sim-
Finally, P4xos improves both the latency and the tail- plified neural networks (NN) on programmable switches.
latency. The throughput is also improved compared to hard- Sanvito et al. [165] proposed BaNaNa Split, a solution that
ware servers which require additional memory management evaluates the conditions under which programmable switches
and safety features (e.g., user and kernel separation). P4xos can act as CPUs’ co-processors for the processing of Neural
was implemented on a hardware switch (Tofino), and results Networks (e.g., CNN). Finally, Xiong et al. [166] proposed
show that it reduces the latency by three times compared to IIsy, a system that enables programmable switches to perform
traditional approaches, and it can process over 2.5 billion in-network classification. The system maps trained ML clas-
consensus messages per second (four orders of magnitude sification models to match-action pipelines.
improvement).
4) ML SCHEMES COMPARISON, DISCUSSIONS, AND
5) NETWORK-ASSISTED AND LEGACY CONSENSUS LIMITATIONS
COMPARISON Table 23 compares the aforementioned ML schemes. While
Consensus algorithms have been traditionally implemented the goal of DAIET is to discuss what computations the net-
as applications on general purpose CPUs. Such architecture work can perform, the authors did not design a complete sys-
inherently induces latency overhead (e.g., Paxos coordinator tem, nor did they address the major challenges of supporting
has a minimum latency of 96us [297]). ML applications. Moreover, their proof-of-concept presented
There are numerous performance benefits gained when a simple MapReduce application on a software switch, and
consensus algorithms are implemented in programmable it is not certain whether the system can be implemented on
devices. When consensus messages are processed on the wire, a hardware switch. Compared to DAIET, SwitchAgg does

VOLUME 9, 2021 87123


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 23. Machine learning schemes comparison.

not require modifying the network architecture, and offers imal communication; each worker sends its update vector and
better processing abilities with a significant data reduction receives back the aggregated updates. The design challenges
rate. Moreover, SwitchAgg was implemented on an FPGA, of this system include: 1) the limitation of storage available
and the results show that the job completion time can be on the switch, addressed by using a streaming approach;
reduced as much as 50%. 2) switches cannot perform much computations per packet,
SwitchML extended the literature on accelerating ML addressed by partitioning the work between the switch and
models training by providing a complete implementation the workers; 3) ML systems use floating point numbers,
and evaluation on a hardware switch. A commonly used addressed by quantization approaches; and 4) failure recovery
training technique for deep neural networks is synchronous is needed to ensure correctness. The system is implemented
stochastic gradient descent [299]. In this technique, each on a hardware switch (Tofino), and results show that the
worker has a copy of the model that is being trained. system speeds up training by up to 300% compared to existing
The training is an iterative process where each iteration distributed learning approaches.
consists of: 1) reading the sample of the dataset and With respect to in-network inference, it is challenging
locally perform some computation-intensive learning using to implement full-fledged models as they require extensive
the worker’s accelerators. This yields to a gradient vector; computations (e.g., multiplications and activation functions).
and 2) updating the model by computing the mean of all Simple variation such as the Binary Neural Network (BNN)
gradient vectors. The main motivation of this idea is that only requires bitwise logic functions (e.g., XNOR, POPCNT,
the aggregation is computationally cheap (takes 100ms), but SIGN). N2Net provides a compiler that translates a given
is communication-intensive (transfer hundreds of megabytes BNN model to switching chip’s configuration (P4 program).
each iteration). SwitchML uses computation on the switch The authors did not mention on which platform N2Net was
to aggregate model update in the network as the workers are evaluated; however, based on their evaluations, they con-
sending them (see Fig. 18). An advantage is that there is min- cluded that a BNN can be implemented on most current

FIGURE 18. (a) ML model updates in legacy networks. The aggregation process is communication-intensive and follows an all-to-all communication
pattern. This means that the workers should receive all the other workers’ updates. Since accelerators on end-hosts are becoming faster, the network
should speed up so that it does not become the bottleneck. Therefore, it is expensive to deploy additional accelerators since it requires re-architecting
the network. The red arrow in (a) shows that the bottleneck source is the network. (b) ML model updates accelerated by the network. Aggregation is
performed in the network by the programmable switches while the workers are sending them. The workers do not need to obtain the updates of all other
workers, hence there is minimal communication. They only obtain the aggregated model from the switch. The red arrow in (b) shows that the bottleneck
source is the worker rather than the network [163], [298].
87124 VOLUME 9, 2021
E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 24. Switch-based and server-based ML approaches.

switching chips, and with small additions to the chip design, and Raft in the data plane. Due to the hardware constraints,
more complex models can be implemented. IIsy studied current schemes implement only simplified variations of the
other ML models. The authors of IIsy acknowledged that protocols. Future work could investigate implementing novel
the work is limited in scope as it does not address popular consensus algorithms that diverge from the existing complex
ML algorithms such as neural networks. Furthermore, it is ones. Further, such schemes should encompass failure recov-
bounded to the type of features it can extract (i.e., packet ery mechanisms.
headers), and has accuracy limitations. IIsy tries to find a Another interesting in-network application is ML train-
balance between the limited resources on the switch and the ing/inference acceleration. The literature has shown that sig-
classification accuracy. Finally, BaNaNa Split took a different nificant performance improvements are attained when the
approach by partitioning the processing of NN to offload a switch aggregates model updates or classifies new samples.
subset of layers from the CPU to a different processor. Note Future work could explore developing ML models for various
that the solution is far from complete, and the authors evalu- tasks such as classification, regression, clustering, etc.
ated a single binary fully connected layer with 4096 neurons In addition to the aforementioned categories, data plane
using a network processor-based SmartNIC. programming is being used for stream processing [167],
C. COMPARISON BETWEEN SWITCH-BASED AND [168], parallel processing [169], string searching [170], era-
SERVER-BASED ML sure coding [171], in-network lock managers [172], database
Table 24 shows a comparison between switch-based and queries acceleration [173], in-network compression [174],
server-based ML approaches. ML works that were extracted and computer vision offloading [175].
from the literature can be divided into two main categories: X. INTERNET OF THINGS (IoT)
1) expedited inference in the data plane, and 2) accelerated The Internet of Things (IoT) is a novel paradigm in which
training in the network. The main advantage of switch-based pervasive devices equipped with sensors and actuators collect
over server-based inference is the ability to execute at line physical environment information and control the outside
rate, and hence provides faster results to the clients. Perform- world. IoT applications include smart water utilities, smart
ing complex computations in the switch is achieved through grid, smart manufacturing, smart gas, smart metering, and
estimations, and hence is limited. Moreover, the SRAM many others. Typical IoT scenarios entail a large number
capacity of the switch is small, impeding the storage of of devices periodically transmitting their sensors’ readings
large models. Such limitations are not problematic with to remote servers. Data received on those collectors is then
server-based inference approaches. processed and analyzed to assist organizations in taking
Distributed training can be significantly faster when aggre- data-driven intelligence decisions.
gations are offloaded to a centralized switch. However, due to
the small capacity of the switch memory, it is not possible to A. AGGREGATION
store the whole model update at once. Additionally, encrypted 1) BACKGROUND
traffic remains a challenge when inference or training is Since IoT devices are constrained in size and processing capa-
handled by the switch. bilities, they typically generate packets that carry small pay-
D. SUMMARY AND LESSONS LEARNED loads (e.g., temperature sensor readings). While such packets
Accelerating computations by leveraging programmable are small in size, their headers occupy a significant portion
switches is becoming a trend in data centers and backbone of the total packet size. For instance, Sigfox Low-Power
networks. Although switches only support basic and limited Wide Area Network (LPWAN) [300] can support a maximum
operations, it was shown in the literature that the performance of 12-bytes payload size per packet. The overhead of headers
of various tasks (e.g., consensus, training models in machine is 42-bytes (Ethernet 14-bytes + IP 20-bytes + UDP 8-
learning), could significantly improve if computations are bytes), which represent approximately 78% of the packet
delegated to the network. total size. When numerous devices continuously transmit
The majority of the in-network consensus works aim at IoT packets, a significant percentage of network bandwidth
implementing common consensus protocols such as Paxos is wasted on transmitting these headers. Packet aggregation

VOLUME 9, 2021 87125


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 25. IoT aggregation schemes comparison.

is a mechanism in which the payloads of small packets


are aggregated into a single larger packet in order to miti-
gate the bandwidth overhead caused by transmitting multiple
headers.
Legacy packet aggregation mechanisms operate on
the CPUs of servers or on the control plane of
switches [301]–[306]. While legacy mechanisms reduce the
overhead of packet headers, they unquestionably increase the FIGURE 19. IoT packets aggregation [176]. Frequent small IoT packets are
aggregated by a P4 switch and encapsulated in a larger packet. Another
end-to-end latency and decrease the throughput. As a result, switch across the WAN disaggregates the large packet to restore the
some studies have suggested aggregating only packets that original IoT packets. Such mechanism prevents the frequent
are not real-time. transmissions of headers, and thus, minimizes the bandwidth overhead.

2) IoT BANDWIDTH OVERHEAD REDUCTION


4) COMPARISON BETWEEN SERVER-BASED AND
Wang et al. [176] presented an approach where small IoT
SWITCH-BASED AGGREGATION
packets are aggregated into a larger packet in the switch data
plane (see Fig. 19). The goal of performing this aggregation Table 26 shows a comparison between switch-based and
is to minimize the bandwidth overhead of packets’ headers. server-based packet aggregation. When aggregation is per-
The same authors [177] extended this work to solve some formed on the switch (ASIC), the throughput is higher and
constraints related to the payload size and the number of the latency and jitter are lower than that of the server-based
aggregated packets. Similarly, Madureira et al. [179] pro- approaches (e.g., switch CPU, x86-based server). On the
posed IoTP, a layer-2 communication protocol that enables other hand, the server-based aggregation has more flexibility
the aggregation of IoT data in programmable switches. The in defining the number of packets and the amount of data
solution gathers network information that includes the Maxi- that can be aggregated. Note that if aggregation and disag-
mum Transmission Unit (MTU), link bandwidths, underlying gregation are executed on the IoT device itself, the session
protocol, and delays. These properties are used to empower would suffer from long delays and low throughput. More
the aggregation algorithm. importantly, the IoT device is limited in computational and
energy resources.
3) AGGREGATION COMPARISON, DISCUSSIONS, AND
LIMITATIONS
B. SERVICE AUTOMATION
Table 25 compares the aforementioned IoT aggregations
1) BACKGROUND
schemes. References [176] and [177] operate in the same
way. Upon receiving a packet, the P4 switch parses its Low-power low-range IoT communication technologies
headers and identifies whether the packet is an IoT packet. (e.g., Bluetooth Low Energy (BLE) [307], Zigbee [308],
If the packet was identified as an IoT packet, the switch Z-wave [309]) typically follow a peer-to-peer model. IoT
parses and extracts the payload. Afterwards, the payload is devices in such technologies can be divided into two distinct
stored in switch registers along with some other metadata, types, peripheral and central. Peripheral devices, which con-
and the packet is dropped. Once packets are aggregated, sist of sensors and actuators, receive commands and execute
the resulting packet is sent across the WAN to reach the subsequent actions. Central devices on the other hand run
remote server. Before the packet reaches the server, it is
disaggregated by another P4 switch situated close to the
TABLE 26. Switch-based and server-based packet aggregation.
server and several packets identical to the original ones are
generated. An important observation is that the aggrega-
tion/disaggregation processes are transparent to both the IoT
devices and the servers; hence, no modifications are required
on either end. The main advantages of [177] over [176] are:
1) packets can have different payload sizes; 2) the payload
size is no longer limited to 16 bytes; 3) the number of pack-
ets is dynamic and only limited by the packet MTU; and
4) both the disaggregation and the aggregation run at line
rate.

87126 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

applications that analyze information collected from periph- TABLE 27. Switch-based, P2P, and cloud service automation.
eral devices and subsequently issue commands.
The interconnection of devices and services can follow a
Peer-to-Peer (P2P) model or a cloud-centric approach. In the
P2P model, the automation service runs on the central device
which processes and analyzes sensor data published by
peripheral devices in order to issue commands. The main
advantages of the P2P include the low end-to-end latency
and the subtle power consumption as devices are physically
plane leverages ONOS controller with Protocol Indepen-
close to each other. The drawbacks of the P2P model include
dent (PI) framework.
poor scalability, short reachability, and inflexibility of policy
enforcement. The cloud-centric model addresses the limita-
4) COMPARISON BETWEEN SERVER-BASED AND
tions of the P2P model by adding a gateway node that con-
SWITCH-BASED SERVICE AUTOMATION
nects peripheral devices to a middleware hosted on the cloud
(Internet). While this approach solves the poor scalability and Table 27 shows a comparison between switch-based, P2P, and
the policy enforcement flexibility issues, it incurs additional cloud-based service automation. Generally, the switch-based
delays and jitters in collecting and reacting to data. Moreover, approach overcomes the limitations of both approaches.
the middleware represents a single point of failure which It achieves the low energy and latency characteristics of P2P
can shutdown the whole service in the event of an outage. while increasing scalability and reachability.
With programmable switches, researchers are investigating
in-network approaches to manage transactional relationships C. SUMMARY AND LESSONS LEARNED
between low-power, low-range IoT devices. In the context of IoT, there exist broadly two categories,
namely, packets aggregation and service automation. The
2) SERVICE MANAGEMENT AND MULTI-PROTOCOL
goal of packet aggregation is to minimize the overhead
PROCESSING
of IoT packets’ headers. Typically, headers in IoT pack-
Uddin et al. [180] proposed Bluetooth Low Energy Service ets represent a significant portion of the whole packet
Switch (BLESS), a programmable switch that automates IoT size. By aggregating several packets into a single packet,
applications services by encoding their transactions in the the bandwidth overhead is reduced. Future work should
data plane. It maintains link-layer connections to the devices study the performance side-effects (e.g., delay, jitter, loss
to support P2P connectivity. The same authors proposed rate, retransmission) that aggregation causes to packets. Fur-
Muppet [181], an extension to BLESS to support multiple thermore, timers should be implemented to avoid exces-
non-IP protocols. sive delays resulting from waiting for enough packets to be
3) SERVICE AUTOMATION COMPARISON, DISCUSSIONS, aggregated.
AND LIMITATIONS With respect to service automation, the goal is to auto-
In BLESS, the data plane operations are performed at the mate IoT applications services by encoding their transactions
Attribute Protocol (ATT) service layer which consists of three in the data plane while improving scalability, reachability,
operations: read attributes, write attributes, and attributes’ energy consumption, and latency. Future work should design
notification. BLESS parses ATT packets, then processes and and develop translators for non-IP IoT protocols so that
forwards them to the devices. The control plane on the applications on various devices that run different protocols
other hand is responsible for address assignment, device can exchange data. Additionally, production-grade software
and service discovery, policy enforcement, and subscription switches should be leveraged to support non-Ethernet IoT
management. The switch was implemented on a software protocols.
switch (PISCES), and the results show that BLESS com- Other works that involve IoT include flowlet-based stateful
bines the advantages of P2P and the cloud-center approaches. multipath forwarding [310] and SDN/NFV-based architec-
Specifically, it achieves small communication latency, low ture for IoT networks [311].
device power consumption, high scalability, and flexible pol-
icy enforcement. Muppet extended this approach to support XI. CYBERSECURITY
multiple IoT protocols. The system studied two popular IoT Extensive research efforts have been devoted on deploying
protocols, namely BLE and Zigbee. Being in the middle, programmable switches to perform various security-related
Muppet switch is responsible for translating actions (e.g., functions in the data plane. Such functions include heavy
on/off switch of a light bulb) between Zigbee and BLE proto- hitter detection, traffic engineering, DDoS attacks detec-
cols, as well as logging important events to a database which tion and mitigation, anonymity, and cryptography. Fig. 20
resides on the Internet via the Hypertext Transfer Protocol demonstrates the difference between contemporary security
(HTTP). Note that parsers and actions policies have to be appliances and programmable switches with respect to lay-
implemented for each supported protocol. Another difference ers inspection in the OSI model. Although programmable
from BLESS is that the implementation of Muppet’s control switches are limited in the computation power, they are capa-

VOLUME 9, 2021 87127


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

abilistically recirculates a fraction of packets for a second


pipeline traversal. The recirculation idea greatly simplifies
the access pattern of memory without significantly degrading
throughput. Tang et al. [185] proposed MV-Sketch, a solution
that exploits the idea of majority voting to track the can-
didate heavy flows inside the sketch data structure. Finally,
da Silva et al. [186] proposed a solution that identifies ele-
phant flows in Internet eXchange Points(IXP) networks.

3) NETWORK-WIDE DETECTION
A work proposed by Harrison et al. [187] considers
a network-wide distributed heavy-hitter detection. The
approach reports heavy hitters deterministically and with-
FIGURE 20. Layers inspection in the OSI model. (a) Contemporary security out errors; however, it incurs significant communication
appliances. (b) Programmable switch. costs that scale with the number of switches. Accordingly,
the same authors proposed another scheme (Carpe [188])
ble of inspecting upper layers (e.g., application layer) at line which reports probabilistically with negligible communica-
rate. Such functionality is not available in any of the existing tion costs. Ding et al. [189] proposed an approach for incre-
solutions. mentally deploying programmable switches in a network
consisting of legacy devices with the goal of monitoring as
A. HEAVY HITTER many distinct network flows as possible. The same authors
1) BACKGROUND of MV-Sketch proposed SpreadSketch [190], an extension
Heavy hitters are a small number of flows that constitute most to Count-min sketch where each bucket is associated with
of the network traffic over a certain amount of time. They a distinct counter to track the distinct items of a stream.
are identified based on the port speed, network RTT, traf- SpreadSketch aims at mitigating the high processing over-
fic distribution, application policy, and others. Heavy hitters head of MV-Sketch.
increase the flow completion time for delay-sensitive mice
flows, and represent the major source of congestion. It is 4) HEAVY HITTER DETECTION COMPARISON, LIMITATIONS,
important to promptly detect heavy hitters in order to react AND DISCUSSIONS
to them; for instance, redirect them to a low priority queue, Table 28 compares the aforementioned heavy hitter schemes.
perform rate control and traffic engineering, block volumetric A major criterion that differentiates the solutions is the
DDoS attacks, and diagnose congestion. Traditionally, packet selection and the implementation of the data structure. Hash
sampling technique (e.g., NetFlow) was used to detect heavy tables and sketches are frequently used to store counters
hitters. The main problem with such technique is the lim- for heavy flows. Note that several variations of such data
ited accuracy due to the CPU and bandwidth overheads of structures are being used in the literature, mainly to tackle
processing samples in the software. Advancements in pro- the memory-accuracy tradeoff; the choice of data structure
grammable switches paved the way to detect heavy hitters in reflects on the accuracy of the performed measurements. For
the data plane, which is not only orders of magnitude faster example, with probabilistic data structures, only approxima-
than sampling, but also enables additional applications (e.g., tions are performed.
flow-size aware routing). The detection schemes can be clas- In HashPipe, the programmable switch stores the flows
sified as local and network-wide. In the former, the detection identifiers and their byte counts in a pipeline of hash
occurs on a single switch; in the latter, the detection covers tables. HashPipe adapts the space saving algorithm which is
the whole network. described in [312]. The system was evaluated using an ISP
trace provided by CAIDA (400,000 flows), and the results
2) LOCAL DETECTION show that HashPipe needed only 80KB of memory to identify
Sivaraman et al. [182] proposed HashPipe, a heavy hitter the 300 heaviest flows, with an accuracy of 95%. Another
detection algorithm that operates entirely in the data plane. hashtable-based solution is Elastic Trie, which consists of
It detects the k-th heavy hitter flows within the constraints of a prefix tree that expands or collapses to focus only on the
programmable switches while achieving high accuracy. Fur- prefixes that grabs a large share of the network. The data
thermore, Kučera et al. [183] proposed Elastic Trie, a solu- plane informs the control plane about high-volume traffic
tion that detects hierarchical heavy hitters, in-network traffic clusters in an event-based push approach only when some
changes, and superspreaders in the data plane. Hierarchical conditions are met. Other systems explored different data
heavy hitters include the total activity of all traffic match- structures for the task. For instance, in [189] the authors used
ing relevant IP prefixes. Ben-Basat et al. [184] proposed the HyperLogLog algorithm [313] which approximates the
PRECISION, a heavy hitter detection algorithm that prob- number of distinct elements in a multi-set. The solution is

87128 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 28. Heavy hitter schemes comparison.

capable of detecting heavy hitters by only using partial input 2) EXTERNAL CRYPTOGRAPHY
from the data plane. The authors in [191] argue on the need to implement crypto-
Another important criteria is whether the scheme tracks graphic hash functions in the data plane to mitigate poten-
heavy hitters across the whole network. For example, unlike tial attacks targeting hash collisions. Consequently, they
HashPipe which considers a single switch, [187] tracks presented prototype implementations of cryptographic hash
network-wide heavy hitters. Tracking network-wide heavy functions in three different P4 target platforms (CPU, Smart-
hitter is important as some applications (e.g., port scanners, NIC, NetFPGA SUME). Another work by Hauser et al. [192]
superspreaders, etc.) cannot go undetected within a single attempted to implement host-to-site IPsec in P4 switches. For
location. Moreover, aggregating the results of switches sepa- simplification, only Encapsulating Security Payload (ESP)
rately for detecting heavy hitter is not sufficient; flows might in tunnel mode with different cipher suites is implemented.
not exceed a threshold locally, but when the total volume is The same authors also proposed P4-MACsec [314], an imple-
considered, the threshold might be crossed. mentation of MACsec on P4 switches. MACsec is an IEEE
standard for securing Layer 2 infrastructure by encrypting,
5) COMPARISON BETWEEN P4-BASED AND TRADITIONAL decrypting, and performing integrity checks on packets.
HEAVY HITTER DETECTION Malina et al. [193] presented a solution where P4 programs
The main advantage of heavy hitters detection schemes in the invoke cryptographic functions (externs) written in VHDL
data plane over sampling-based approaches is the ability to on FPGAs. The goal of this work is to avoid coding cryp-
operate at line rate. This means that every packet is considered tographic functions on hardware (VHDL), and thus enables
in the detection algorithm, which improves accuracy and the rapid prototyping of in-network applications with security
speed of detection. Moreover, additional applications that functions. Another work that relies on externs for crypto-
exploit reactive processing can be implemented. For instance, graphic functions is P4NIS [194].
switches can perform a flow-size aware routing method to
redirect traffic upon detecting a heavy hitter. 3) DATA PLANE CRYPTOGRAPHY
The previous works delegated the complex computations to
B. CRYPTOGRAPHY the control plane. Chen [195] implemented the Advanced
1) BACKGROUND Encryption Standard (AES) protocol in the data plane using
Performing cryptographic functions in the data plane is useful scrambled lookup tables. AES is one of the most widely
for a variety of applications (e.g., protecting the layer-2 with used symmetric cryptography algorithms that applies several
cryptographic integrity checks and encryption, mitigating encryption rounds on 128-bit input data blocks.
hash collisions, etc.). Computations in cryptographic oper-
ations (e.g., hashing, encryption, decryption) are known to 4) CRYPTOGRAPHY SCHEMES COMPARISON, DISCUSSIONS
be complex and resource-intensive. The supported operations AND LIMITATIONS
in switch targets and in the P4 language are limited to basic Table 29 compares the aforementioned cryptography
arithmetic (e.g., additions, subtractions, bit concatenation, schemes. With respect to hashing, P4 currently implements
etc.). Recently, a handful of works have started studying hash functions that do not have the characteristics of cryp-
the possibility of performing cryptographic functions in the tographic hashing. For example, Cyclic Redundancy Check
data plane. Generally, cryptographic functions are executed (CRC), which is commonly used in P4 targets, is originally
externally (e.g., on a CPU) and invoked from the data plane. developed for error detection. CRC can be easily imple-

VOLUME 9, 2021 87129


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 29. Cryptography schemes comparison.

mented in embedded hardware, and is computationally much pipeline passes. Note that recirculation uses loopback ports
less complex than cryptographic hash functions (e.g., Secure and hence is limited by their bandwidth. The implementation
Hash Algorithm (SHA)-256); however, it is not secure and on Tofino chip shows that ≈ 10Gbps throughput was attained.
has a high collision rate. Evaluation results in [191] show The authors argued that this throughput is sufficient to sup-
that 1) implementing cryptographic hash functions on CPU port various in-network security applications. Nevertheless,
is easy, but has high latency (several milliseconds); 2) Smart- it is possible to enhance the throughput by configuring addi-
NICs has the highest throughput, but can only process packets tional physical ports as loopback ports.
up to 900 bytes; and 3) NetFPGA has the lowest latency, but Note that there are other schemes that implements some
cannot be integrated using native P4 features. The authors cryptographic primitives in the data plane but are in the
found that the performance of hashing is highly dependent on Privacy and Anonymity category (Section XI-C).
the application, the input type, and the hashing algorithm, and
therefore there is no single solution that fits all requirements. 5) COMPARISON BETWEEN IN-NETWORK AND
However, P4 targets should benefit from the characteristics CONTEMPORARY CRYPTOGRAPHY
of each solution (CPU, SmartNICs, FPGA, and ASICs) to Cryptographic primitives often require performing complex
implement cryptographic hashing. arithmetic operations on data. Implementing such compu-
As for more complex protocol suites (e.g., IPsec), tations on general purpose servers is simple; memory and
Hauser et al. [192] only implemented Encapsulating Secu- processing units are not constrained. The literature has shown
rity Payload (ESP) in tunnel mode for simplification. The that there is a need to implement cryptographic functions
Security Policy Database (SPD) and the Security Association in the data plane. For instance, cryptographic hash func-
Database (SAD) are represented as match-action tables in tions can significantly improve existing data plane appli-
the P4 switch. To avoid complex key exchange protocols cations with respect to collisions; encryption can protect
such as the Internet Key Exchange (IKE), this work delegates confidential information from being exposed to the public.
runtime management operations to the control plane. More- However, switches have limitations when it comes to com-
over, since encryption and decryption are not supported by puting. Supported hash functions in P4 are non-cryptographic
P4, the authors relied on user-defined P4 externs to perform (e.g., CRC), and therefore, produce collisions when the
complex computations. Note that implementing user-defined table is not large. Consequently, researchers are continuously
externs is not applicable for ASIC (e.g., Tofino), and con- investigating techniques to perform such operations in the
sequently, the main CPU module of the switch is used for data plane.
performing encryption/decryption computations, at the cost
of increased latency and degraded throughput. Same ideas are C. PRIVACY AND ANONYMITY
applied to P4-MACsec by the same authors. Other works that 1) BACKGROUND
rely on externs include [193], [194]. Packets in a network carry information that can poten-
The system proposed by Chen [195] has significant perfor- tially identify users and their online behavior. Therefore,
mance advantages as it is fully implemented in the data plane. user privacy and anonymity have been extensively studied
The idea of the proposed system is to apply permuted lookup in the past (e.g., ToR and onion routing [315]). However,
tables by using an encryption key. The authors found that existing solutions have several limitations: 1) poor perfor-
a single switch pipeline is capable of performing two AES mance since overlay proxy servers are maintained by volun-
rounds. Consequently, the system leverages packet recircula- teers and have no performance guarantees; 2) deployability
tion technique which re-injects the packet into the pipeline. challenges; some solutions require modifying the whole
By doing so, it is possible to complete the 10 rounds of Internet architecture, which is highly unlikely; 3) no clear
encryption required by the AES-128 algorithm by using five partial deployment pathway; and 4) most solutions are

87130 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 30. Privacy and anonymity schemes comparison.

FIGURE 21. SPINE architecture [198].

ONTAS had a slightly different goal; it aims at protect-


ing the personally identifiable information (PII) from online
traces. The system overcomes the limitations of existing sys-
tems which either requires network operators to anonymize
packet traces before sharing them with other researchers
software-based. Consequently, recent works started inves- and analysts, or anonymize traffic online but with signifi-
tigating methods that exploit programmable switches to cant overhead. ONTAS provides a policy language used by
develop partially-deployable, low-latency, and light-weight operators for expressing anonymization tasks, which makes
anonymity systems. With respect to anonymity and privacy in the system flexible and scalable. The system was imple-
the network, new class of attacks which target the topology, mented and tested on a hardware switch, and results show that
requires the attacker to know the topology and understand it’s ONTAS entails 0% packet processing overhead and requires
forwarding behavior. Such attacks can be mitigated by obfus- half storage compared to existing offline tools. A limitation
cating (hiding) the topology from external users. P4-based of this system is that it does not anonymize TCP/UDP field
schemes are also being developed to achieve this goal. values. Another limitation is that it does not support applying
2) USERS PRIVACY PROTECTION multiple privacy policies concurrently.
Kim and Gupta [196] proposed Online Network Traf- Other line of research (i.e., PANEL, SPINE) focused on
fic Anonymization System (ONTAS), a system that protecting the identities of Internet user. PANEL overcomes
anonymizes traffic online using P4 switches. Moghaddam the performance limitations of popular anonymity systems
and Mosenia [197] proposed Practical Anonymity at the (e.g., Tor), and does not require modifying entirely the Inter-
NEtwork Level (PANEL), a lightweight and low overhead net routing and forwarding protocols as proposed in [316] and
in-network solution that provides anonymity into the Inter- [317]. Partial deployment is possible as PANEL can co-exist
net forwarding infrastructure. Likewise, Datta et al. [198] with legacy devices. The solution involves: 1) source address
proposed Surveillance Protection in the Network Elements rewriting to hide the origin of the packet; 2) source infor-
(SPINE), a system that anonymizes traffic by concealing IP mation normalization (IP identification and TCP sequence
addresses and relevant TCP fields (e.g., sequence number) randomization) to mitigate against fingerprinting attacks; and
from adversarial Autonomous Systems (ASes) on the data 3) path information hiding (TTL randomization) to hide the
plane. Wang et al. [199] proposed Programmable In-Network distance to the original sender at any given vantage point.
Obfuscation of DNS Traffic (PINOT), a system where the As for SPINE, it does not require cooperation between
packet headers are obfuscated in the data plane to protect the switches and end-hosts, but assumes that at least two
identity of users sending DNS requests. entities (typically two ASes or two ISPs) are trusted.
On the other hand, Meier et al. [200] proposed NetHide, Fig. 21 shows the SPINE architecture. The solution encrypts
a P4-based solution that obfuscates network topolo- the IP addresses before the packets enter the intermediary
gies to mitigate against topology-centric attacks such as ASes. Therefore, adversarial devices only see the encrypted
Link-Flooding Attacks (LFAs). addresses in the headers. It also encrypts the TCP sequence
and ACK numbers to mitigate against attributing packets
3) PRIVACY AND ANONYMITY SCHEMES DISCUSSIONS to flows. SPINE transforms IPv4 headers into IPv6 head-
Table 30 compares the privacy and anonymity schemes. ers when packets leave the trusted entity and restore the
NetHide aims at mitigating the attacks targeting the network IPv4 headers upon entering the trusted entity. These oper-
topology. The solution formulates network obfuscation as ations enable routing to be performed in intermediary net-
a multi-objective optimization problem, and uses accuracy works. The encrypted IPv4 address is inserted in the last
(hard constraints) and utility (soft constraints) as metrics. The 32-bits of the IPv6 destination address. The encryption works
system then uses ILP solver and heuristics. The P4 switches by XORing the IP address with the hash of a pre-shared
in this system capture and modify tracing traffic at line rate. key and a nonce. The system uses SipHash since it is easily
The specifics of the implementation were not disclosed, but implemented in the data plane.
the authors claim that the system was evaluated on realistic Note that SPINE was implemented on software. If ASIC
topologies (more than 150 nodes), and more than 90% of link implementation was to be done, SPINE would require at least
failures were detected by operators, despite obfuscation. three pipeline passes to be fully executed (i.e., through recir-

VOLUME 9, 2021 87131


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

culation). Thus, the throughput of SPINE would be decreased


by a factor of three. In contrast, PINOT was executed on the
ASIC with a single pipeline pass, and hence, has a higher
throughput that the other solutions.

4) PRIVACY AND ANONYMITY IN SWITCH-BASED AND


LEGACY SYSTEMS
Contemporary approaches that provide privacy and anonymity
in the Internet uses special routing overlay networks to hide
the physical location of each node from other participants
(e.g., Tor). Such approaches have performance limitations
as proxy servers (overlays) are maintained by volunteers
and have no performance guarantees. Moreover, they often
FIGURE 22. Overview of Poise [206]. A compiler translates high-level
require performing advanced encryption routines to obfuscate policies into P4 programs and device configurations. Context packets are
from where the packet is originated (e.g., onion routing continuously sent from the clients to the network, where the switches
technique used by Tor involves encapsulating messages in enforce the policies.

several layers of encryption). On the other hand, approaches


3) OTHER ACCESS CONTROL
that are based on programmable switches often rely on
Kang et al. [206] presented a scheme that implements
headers modification and simplified encryption and hashing
context-aware security policies (see Fig. 22). The policies are
to conceal information (e.g., SPINE [198]).
applicable to enterprise and campus networks with diverse
D. ACCESS CONTROL devices, i.e., Bring Your Own Device (BYOD) (e.g., lap-
1) BACKGROUND tops, mobile devices, tablets, etc.). In context-aware policies,
The selective restriction to access digital resources is known devices are granted access dynamically based on the device’s
as access control in cybersecurity. Typically, access control runtime properties. Finally, Bai et al. [207] presented P40f,
begins with ‘‘authentication’’ in order to verify the identity a tool that performs OS fingerprinting on programmable
of a party. Afterwards, ‘‘authorization’’ is enforced through switches, and consequently, applies security policies
policies to specify access rights to resources. To authenti- (e.g., allow, drop, redirect) at line rate. Almaini et al. [208]
cate parties, methods such as passwords, biometric analysis, implemented an authentication technique based on One Time
cryptographic keys, and others are used. With respect to Passwords (OTP). The technique follows the Leslie Lamport
authorization, methods such as ACL are used to describe what algorithm [318] in which a chain of successive hash functions
operations are allowed on given objects. are verified for authentication.
With the advent of programmable switches, it is now pos-
sible to delegate authentication and authorization to the data 4) ACCESS CONTROL COMPARISON, DISCUSSIONS, AND
plane. As a result, access can be promptly granted or denied at LIMITATIONS
line rate, before reaching the target server. A clear advantage Table 31 compares the aforementioned access control
of this approach is that servers are no longer busy processing schemes. P4Guard provides access control based on secu-
access verification routines, which increases their services rity policies translated from high-level security policies
throughput. to table entries. Note that P4Guard only operates up to
the transport layer (e.g., source/destination IP addresses,
2) FIREWALLS source/destination ports, protocol, etc.), similar to a tra-
Datta et al. [201] presented P4Guard, a stateful P4-based ditional firewall. While programmable switches provide
configurable firewall that acts based on predefined policies increased flexibility in the parser (e.g., parse beyond the
set by the controller and pushed as entries to data plane tables. transport layer) and the packet processing logic, P4Guard
Similarly, Cao et al. [202] proposed CoFilter, another stateful did not leverage such capabilities. It would be interesting to
firewall that encodes the access control rules in the data investigate additional capabilities such as those enabled by
plane. Li et al. [203] presented an architecture in SDN-based next-generation firewalls (NGFW).
clouds where P4-based firewalls are provided to the tenants. The solution in [204] controls access by performing
Almaini et al. [204] proposed delegating the authentication authentication in the data plane. The solution has several
of end hosts to the data plane. The method is based on limitations since it uses on port knocking, a technique that has
port knocking, in which hosts deliver a sequence of packets several security implications. For instance, programmable
addressed to an ordered list of closed ports. If the ports match switches do not use cryptographic hashes, making the solu-
the ones configured by the network administrators, then end tion vulnerable to IP address spoofing attacks. Additionally,
host is authenticated, and subsequent packets are allowed. unencrypted port knocking is vulnerable to packet sniff-
Likewise, Zaballa et al. [205] implemented port knocking in ing. Furthermore, port knocking relies on security through
the data plane. obscurity.

87132 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 31. Access control schemes comparison.

In [206], the scheme dynamically enforces access control Various features are considered when comparing P4-based
to users based on contexts (e.g., if the user’s device uses firewalls to traditional firewalls. First, P4 firewalls are capa-
Secure Shell (SSH) 2.0 or higher, then the switch forwards ble of performing headers inspection above the transport layer
the packets of this flow. Otherwise, the switch drops the pack- (also known as deep packet inspection (DPI)), whereas tradi-
ets). The scheme requires user devices to run an application tional firewalls only reach the transport layer and typically
which communicates with the switch using a custom protocol operate on the 5-tuple fields. It is important to note that DPI
(context packets). The context packets are generated on a in P4 switches is limited: if only few bytes are parsed above
per-flow basis. The switch tracks flows using a match action the transport layer, line rate will be achieved; however, if the
table and registers at the data plane. Actions over a packet packet is deeply parsed, the throughput will start degrading
are dropping, allowing, and forwarding to other appliances accordingly. Second, in P4, policies and rules can be cus-
for deep packet inspection. Data packets are not modified. tomized to be activated based on arbitrary information stored
Evaluations show that the proposed approach can operate in the switch state (e.g., measurements through streaming);
(install new flows in the and update rules) with a minimum such capabilities are not present in traditional firewalls. Third,
latency, even under heavy DoS attacks. On the other hand, in P4, access control algorithms’ exclusivity and innova-
such attacks can decimate similar SDN-based systems. One tion are solely attributed to operators, unlike fixed-function
of the main drawbacks of the proposed system is the lack firewalls which are provided by device vendors. Note that
of authentication, integrity, and confidentiality of the context non-programmable Next-Generation Firewalls (NGFW) are
packets. Thus, the system can be subject to attacks such as capable of performing advanced DPI at the cost of having
snooping (i.e., eavesdropping) on communication between much lower throughput than the line rate.
user devices and the switch, impersonation, and others. Access to resources can be controlled after fingerprinting
Finally, [207] proposes fingerprinting OS in the data plane. end-hosts OS. Software-based passive fingerprinting tools
The main motivation behind this work is that software-based cannot keep up with the high load (gigabits/s links). The
passive fingerprinting tools (e.g., p0f [319]) are not practical literature has shown that said tools lead to 38% degradation
nor sufficient with large amounts of traffic on high-speed in throughput [320]. Additionally, such tools are out-of-band,
links. Furthermore, out-of-band monitoring systems cannot meaning that it is not possible to apply policies on traffic (e.g.,
promptly take actions (e.g., drop, forward, rate-limit) on traf- after fingerprinting an OS). On the other hand, switch hard-
fic at line rate. The main drawback of the solution is that it ware is able to perform OS fingerprinting and apply security
lacks sophisticated policies that involve rate-limiting traffic. policies at line rate. Context-aware policies applied on nodes
(clients/servers) have local visibility. A newer approach is
5) COMPARISON BETWEEN SWITCH-BASED AND to use a centralized SDN controller (e.g., [321]), but such
SERVER-BASED ACCESS CONTROL scheme is vulnerable to control plane saturation attacks and
Controlling access to resources often starts with authentica- is subject for delay increases. Switch-based schemes on the
tion. While server-based approaches are more flexible in the other hand are able to provide access control at line rate.
methods of authentication they can provide, they typically
require client connections to reach the server before the com- E. DEFENSES
munication starts. In switch-based approaches, the authen- 1) BACKGROUND
tication can be done in-network at the edge, eliminating DDoS attacks remain among the top security concerns despite
unnecessary latency incurred from traversing the network and the continuous efforts towards the development of their detec-
from software processing. tion and mitigation schemes. This concern is exacerbated not

VOLUME 9, 2021 87133


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

only by the frequency of said attacks, but also by their high proposed a unified in-network DDoS detection and mitiga-
volumes and rates. Recent attacks (e.g. [322], [323]) reached tion strategy that considers both volumetric and slow/stealthy
the order of terabits per seconds, a rate that existing defense DDoS attacks. Xing et al. [222] proposed NetWarden,
mechanisms cannot keep with. a broad-spectrum defense against network covert channels
There are two main concerns with existing defense meth- in a performance-preserving manner. The method in [223]
ods handled by end-hosts or deployed as middlebox func- models a stateful security monitoring function as an Extended
tions on x86-based servers. First, they dramatically degrade Finite State Machine (EFSM) and expresses the EFSM
the throughput and increase latency and jitter, impacting the using P4 abstractions. Ripple [224] provides decentralized
performance of the network. Second, they present severe con- link-flooding defense against dynamic adversaries.
sequences on the network operation when they are installed da Silveira Ilha et al. [225] presented EUCLID, an exten-
at the last mile (i.e., far from the edge). The escalation of sion to [217] where the data plane runs a fine-grained traffic
volumetric DDoS attacks and the lack of robust and effi- analysis mechanism for DDoS attack detection and mitiga-
cient defense mechanisms motivated the idea of architect- tion. EUCLID is based on information-theoretic and statisti-
ing defenses into the network. Up until recently, in-network cal analysis (entropy) to detect the attacks. Khooi et al. [226]
security methods were restricted to simple access control lists presented a Distributed In-network Defense Architecture
encoded into the switching and routing devices. The main (DIDA), a solution that deals with the sophisticated ampli-
reason is that the data plane was fixed in function, impeding fied reflection DDoS. Ding et al. [227] proposed INDDoS,
the capabilities of developing customized and dynamic algo- an in-network DDoS victim identification system that fin-
rithms that can assist in detecting attacks. With the advent of gerprints the devices that for which the number of packets
programmable data planes, it is possible to develop systems exceeds a certain threshold. Musumeci et al. [228] proposed
that detect and mitigate various types of attacks without a system where ML algorithms executed on the control plane
imposing significant overhead on the network. update the data plane after observing the traffic. Finally,
Liu et al. [229] proposed Jaqen, an inline DDoS detection and
2) ATTACK-SPECIFIC mitigation scheme that addresses a broad range of attacks in
Hill et al. [209] presented a system that tracks flows in an ISP deployment.
the data plane using bloom filters. The authors evaluated
SYN flooding as a use case for their system. Li et al. [210] 4) DEFENSE SCHEMES COMPARISON, DISCUSSIONS, AND
presented NETHCF, a Hop-Count Filtering (HCF) defense LIMITATIONS
mechanism that mitigates spoofed IP traffic. HCF schemes Table 32 compares the aforementioned defense schemes.
filter spoofed traffic with an IP-to-hop-count map- Broadly, defense schemes can be grouped into two main
ping table. Another attack-specific scheme proposed by categories: attack-specific and generic. Attack-specific cat-
Febro et al. [211] mitigates against distributed SIP DDoS egory consists of the work that address a specific attack
in the data plane. Furthermore, Scholz et al. [212], [213] (e.g., NETHCF for IP spoofing, [211] for SIP DDoS, etc.),
presented a scheme that defends against SYN flood attacks. while the generic category aims at addressing various types of
Ndonda and Sadre [214] implemented an intrusion detection attacks (e.g., FastFlex for various availability attacks, Ripple
system in P4 that whitelists and filters Modbus protocol for link flooding attacks, etc.).
packets in industrial control systems. The significant advantage of architecting defenses in the
data plane is the performance improvement of the application.
3) GENERIC ATTACKS For instance, NETHCF is motivated by the fact that tradi-
Some schemes are generic and aim at addressing multiple tional HCF-based schemes are implemented on end-hosts,
attacks concurrently. For instance, Xing et al. [215] proposed which delays the filtering of spoofed packets and increases
FastFlex, an abstraction that architects defenses into the net- the bandwidth overhead. Moreover, since traditional schemes
work paths based on changing attacks. Kang et al. [216] are implemented in server-based middleboxes, low latency
presented an automated approach for discovering sensitivity and minimal jitter are hard to achieve. Similarly, FastFlex
attacks targeting the data plane programs. Sensitivity attacks advocates on the need to offload the defenses to the data
in this context are intelligently crafted traffic patterns that plane. Specifically, it tackles the following key challenges
exploit the behavior of the P4 program. Lapolli et al. [217] that are faced when programming defenses in the data plane:
implemented a mechanism to perform real-time DDoS 1) resource multiplexing; 2) optimal placement; 3) distributed
attack detection based on entropy changes. Such changes control; and 4) dynamic scaling.
will be used to compute anomaly detection thresholds. When deploying defenses in the data plane, operators must
Mi and Wang [218] proposed ML-Pushback, a P4-based be aware of the capabilities of the constrained targets. Many
implementation of the Pushback method [219]. operations that require extensive computations cannot be eas-
Zhang et al. [220] proposed Poseidon, a system that ily implemented on the data plane. The existing work either
mitigates against volumetric DDoS attacks through pro- approximate the computations in the data plane (considering
grammable switches. It provides a language where operators the computation complexity and the measurements accuracy
can express a range of security policies. Friday et al. [221] trade-off), or delegate the computations to external processors

87134 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 32. Defenses schemes comparison.

(e.g., CPU on the switch, external server, SDN controller, detecting a wide range of attacks instead of crafting custom
etc.). For instance, NETHCF decouples the HCF defense into algorithms for specific ones.
a cache running in the data plane and a mirror in the control Network-wide defenses are those that are not restricted to
plane. The cache serves the legitimate packets at line rate, a single switch, and require multiple switches to co-operate
while the mirror processes the missed packets, maintains the in the attacks detection and mitigation phases. Such
IP-to-hop-count mapping table, and adjust the state of the co-operation significantly improves the accuracy and the
system based on network dynamics. In Poseidon, the defense promptness of the detection. More details on network-wide
primitives are partitioned to be executed on switches and on data plane systems is explained in Section XIII-D.
servers, based on their properties. On the other hand, in [217], Finally, Table 32 lists some limitations of the existing
the authors estimated the entropies of source and destination schemes, which can be explored in future work to advance
IP addresses of incoming packets for consecutive partitions the state-of-the-art.
(observation windows) in the data plane, without consulting
external devices. 5) COMPARISON BETWEEN P4-BASED AND TRADITIONAL
Perhaps the most significant state-of-the-art works in the DEFENSE SCHEMES
defense schemes are Poseidon and Jaqen. Poseidon provides Network attacks such as large-scale DDoS and link flooding
a modular abstraction that allows operators to express their may have substantial impact on the network operation. For
defense policies. Poseidon requires external modules running such attacks, server-based defenses deployed at the last mile
on servers, making its deployment challenging, especially in are problematic and inherently insufficient, especially when
ISP settings. Furthermore, such design incurs additional costs attacks target the network core. Moreover, it is not feasible
and undesirable latency. Jaqen addressed those limitations to detect and mitigate large volume of attack traffic (e.g.,
and was designed to be executed fully in the switch, with- SYN flood) on end-hosts without impacting the throughput
out external support from servers. Additionally, Jaqen used of the network. Other defense schemes are proprietary, and
universal sketches as data structures; this selection enables hence are costly and limited to the detection algorithms

VOLUME 9, 2021 87135


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 33. Comparison of DDoS defense schemes. Source: [229]. when randomizing identifiers to achieve session unlinkabil-
ity, the identifiers must fit into the small fixed header space
so that compatibility with legacy networks is preserved. Other
efforts considered rewriting source information and headers
concealing to protect the identity of Internet users.
Finally, access control methods and in-network defenses
were proposed. Future access control schemes should explore
further in-network methods to authenticate the users, beyond
port knocking. Additionally, since switches are capable of
inspecting upper-layer headers, it is worth exploring offload-
ing some next generation firewall functionalities to the data
plane (such as in [327]). For instance, in [170], the authors
proposed a system that allows searching for keywords in the
provided by the vendors. Table 33 highlights the costs and payload of the packet. Similar techniques could be leveraged
the performance differences between switch-based schemes to achieve URL filtering at line rate. Additionally, schemes
(Poseidon and Jaqen) and other existing solutions. When should mitigate against stealthy, slow DDoS attacks.
defenses are architected into the network (i.e., detection and
mitigation are programmed into the forwarding devices), it is XII. NETWORK TESTING
easy to detect, throttle, or drop suspicious traffic at any van- Although programmable switches provide flexibility in defin-
tage point, at line rate, with significant cost reductions. ing the packet processing logic, they introduce potential risks
of having erroneous and buggy programs. Such bugs may
F. SUMMARY AND LESSONS LEARNED cause fatal damages, especially when they are unexpectedly
In the context of cybersecurity, a wide range of works lever- triggered in production networks. In such scenarios, the net-
aged programmable switches to achieve the following goals: work starts experiencing a degradation in performance as
1) detect heavy hitters and apply countermeasures; 2) execute well as disruption in its operation. Bugs can occur in various
cryptographic primitives in the data plane to enable further phases in the P4 program development workflow (e.g., in the
applications; 3) protect the identity and the behavior of end- P4 program itself, in the controller updating data plane table
hosts, as well as obfuscate the network topology; 4) enforce entries, in the target compiler, etc.). Bugs are usually man-
access control policies in the network while considering net- ifested after processing a sequence of packets with certain
work dynamics; and 5) architect defenses in the data plane to combinations not envisioned by the designer of the code.
accelerate the detection and mitigation processes. This section gives an overview of the troubleshooting and
Identifying heavy hitters at line rate has several advan- verification schemes for P4 programmable networks.
tages. Recent works considered various data structures and A. TROUBLESHOOTING
streaming algorithms to detect heavy hitters. Future systems 1) BACKGROUND
could explore more complex data structures that reduce the Intensive research interests were drawn on troubleshooting
amount of state storage required on the switches. Further- the network. Previous efforts are mainly based on pas-
more, novel systems must minimize the false positives and the sive packet behavior tracking through the usage of moni-
false negatives compared to both P4-based and legacy heavy toring technologies (e.g., NetSight [328], EverFlow [329]).
hitter detection systems. Finally, new schemes should explore Other techniques (e.g., Automatic test Packet Generation
strategies for incremental deployment while maximizing flow (ATPG) [330]) send probing packets to proactively detect
visibility across the network. network bugs. Such techniques have two main problems.
There is an absolute necessity to implement cryptographic First, the number of probe packets increases exponentially
functions (e.g., hash, encrypt, decrypt) in the data plane. Such as the size of the network increases. Second, the coverage is
functions can be used by various applications that require limited by the number of probes-generating servers. Despite
low hashing collisions (e.g., load balancing) and strong data the flexibility that programmable switches offer, writing data
protection. Most existing efforts delegate the complex com- plane programs increases the chance of introducing bugs
putations to the control plane. However, recent systems have into the network. Programs are inevitably prone to faults
demonstrated that AES, a well-known symmetric key encryp- which could significantly compromise the performance of the
tion algorithm, can be implemented in the data plane. network and incur high penalty costs.
Another interesting line of work provided privacy and
anonymity to the network. Recent efforts obfuscated the net- 2) PROGRAMMABLE NETWORKS TROUBLESHOOTING
work topology in order to mitigate topology-centric attacks Zhang et al. [230] proposed P4DB, an on-the-fly runtime
(e.g., LFA). Such systems must preserve the practicality of debugging platform. The system debugs P4 programs in
path tracing tools, while being robust against obfuscation three levels of visibility by provisioning operator-friendly
inversion. Additionally, link failures in the physical topol- primitives: watch, break, and next. Zhou et al. [231] pro-
ogy should remain visible after obfuscation. Furthermore, posed P4Tester, a troubleshooting system for data plane

87136 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 34. Troubleshooting schemes comparison.

runtime faults. It generates intermediate representation of Some schemes (e.g., P4DB) require more memory than others
P4 programs and table rules based on BDD data structure. (e.g., KeySight).
Dumitru et al. [232] examined how three different targets, Finally, the work in [232] is different than the others.
BMv2, P4-NetFPGA, and Barefoot’s Tofino, behave when The authors examined how three different targets, BMv2,
undesired behaviours are triggered. Kodeswaran et al. [233] P4-NetFPGA, and Barefoot’s Tofino, behave when undesired
proposed a data plane primitive for detecting and localizing behaviours are triggered. The authors first developed buggy
bugs as they occur in real time. Finally, Zhou et al. [234] pro- programs in order to observe the actual behavior of targets.
posed KeySight, a platform that troubleshoots programmable Then, they examined the most complex P4 program publicly
switches with high scalability and high coverage. It uses available, switch.p4, and found that it can be exploited when
Packet Equivalence Class (PEC) abstraction when generating attackers know the specifics of the implementation. In sum-
probes. mary, the paper suggests that BMv2 leaks information from
Some schemes such as Whippersnapper [331], BB-Gen previous packets. This behavior is not observed with the other
[332], P8 [333], and [334] provide benchmarking for P4 pro- two targets. Furthermore, the authors were able to perform
grams and aim at understanding their performance. privilege escalation on switch.p4 due to a header destined
to ensure communication between the CPU and the P4 data
3) TROUBLESHOOTING SCHEMES COMPARISON, plane.
DISCUSSIONS, AND LIMITATIONS
Table 34 compares the aforementioned troubleshooting 4) COMPARISON LEGACY VS. P4-BASED DEBUGGING
schemes. Essentially, the schemes either passively track how In legacy networks, network devices are equipped with
packets are processed inside switches (e.g., [230], [233]) or fixed-function services that operate on standard proto-
diagnoses faults by injecting probes (e.g., [231], [234]). The cols. Troubleshooting these networks often involve testing
main limitation of passive detection is that schemes can only protocols and typical data plane functions (e.g., layer-3 rout-
detect rule faults that have been triggered by existing packets, ing) through rigid probing. On the other hand, with pro-
and cannot check the correctness of all table rules. On the grammable networks, since operators have the flexibility of
other hand, probing-based schemes may incur large control defining custom data plane functions and protocols, testing
and probes overheads. is more complex and is program-dependent. Probing-based
Examples of probing-based schemes include P4Tester and approaches should craft patterns depending on the deployed
KeySight. P4Tester generates intermediate representation of P4 program. Other approaches proposed primitives that
P4 programs and table rules based on BDD data structure. increase the levels of visibility when debugging P4 programs.
Afterwards, it performs an automated analysis to generate Research work extracted from the literature show that it is
probes. Probes are sent using source routing to achieve high essential to develop flexible mechanisms that operate dynam-
rule coverage while maintaining low overheads. The system ically on diverse P4 programs and targets.
was prototyped on a hardware switch (Tofino), and results
show that it can check all rules efficiently and that the probes
count is smaller than that of server-based probe injection B. VERIFICATION
systems (i.e., ATPG and Pronto). 1) BACKGROUND
Other schemes that use passive fault detection (e.g., P4DB) Program verification consists of tools and methods that
assume that packets consistently trigger the runtime bugs. ensure correctness of programs with respect to specifica-
P4DB debugs P4 programs in three levels of visibility by tions and properties. Verification of P4 programs is an active
provisioning operator-friendly primitives: watch, break, and area as bugs can cause faults that have drastic impacts on
next. P4DB does not require modifying the implementation of the performance and the security of networking systems.
the data plane. It was implemented and evaluated on a soft- Static P4 verification handles programs before deployment
ware switch (BMv2), and the results show that it is capable of to the network, and hence, cannot detect faults that occur
troubleshooting runtime bugs with a small throughput penalty at runtime. On the other hand, runtime verification uses
and little latency increase. passive measurements and proactive network testing. This
Another important criterion that differentiate the trou- section describes the major verification work pertaining to
bleshooting schemes is the memory footprint they require. P4 programs.

VOLUME 9, 2021 87137


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

2) PROGRAM VERIFICATION TABLE 35. Verification schemes comparison.

Lopes et al. [235] proposed P4NOD, a tool that compiles


P4 specifications to Datalog rules. The main motivation
behind this work is that existing static checking tools (e.g.,
Header Space Analysis (HSA) [335], VeriFlow [336]) are not
capable of handling changes to forwarding behaviors without
reprogramming tool internals. The authors introduced the
‘‘well formedness’’ bugs, a class of bugs arising due to the
capabilities of modifying and adding headers.
Another interesting work is ASSERT-P4 [236], [237],
a network verification technique that checks at compile-time
the correctness and the security properties of P4 programs.
A different approach uses reinforcement learning is
ASSERT-P4 offers a language with which programmers
P4RL [242], a fuzzy testing system that automatically veri-
express their intended properties with assertions. After anno-
fies P4 switches at runtime. The authors described a query
tating the program, a symbolic execution takes place with all
language p4q in which operators express their intended
the assertions being checked while the paths are tested.
switch behavior. A prototype that executes verification on
Further, Liu et al. [238] proposed p4v, a practical
layer-3 switch was implemented, and results show that PR4L
verification tool for P4. It allows the programmer to anno-
detects various bugs and outperforms the baseline approach.
tate the program with Hoare logic clauses in order to per-
Finally, Dumitrescu et al. [243] proposed bf4, an end-
form static verification. To improve scalability, the system
to-end P4 program verification tool. It aims at guarantying
suggests adding assumptions about the control plane and
that deployed P4 programs are bug-free. First, bf4 finds
domain-specific optimizations. The control plane interface
potential bugs at compile-time. Second, it automatically gen-
is manually written by the programmer and is not ver-
erates predicates that must be followed by the controller
ified, which makes it error-prone and cumbersome. The
whenever a rule is to be inserted. Third, it proposes code
authors evaluated p4v on both an open source and proprietary
changes if additional bugs remain reachable. bf4 executes
P4 programs (e.g., switch.p4) that have different sizes and
a monitor at runtime that inspects the rules inserted by the
complexities.
controller and raises an exception whenever a predicate is
Nötzli et al. [239] proposed p4pktgen, a tool that automat-
not satisfied. The authors executed bf4 on various data plane
ically generates test cases for P4 programs using symbolic
programs and interesting bugs that were not detected in state-
execution and concrete paths. The tool accepts as input a
of-the-art approaches were discovered.
JSON representation of the P4 program (output of the p4c
compiler for BMv2), and generates test cases. These test 3) VERIFICATION SCHEMES DISCUSSIONS
cases consist of packets, tables configurations, and expected Table 36 compares the aforementioned verification schemes.
paths. Similarly, Lukács et al. [240] described a framework Essentially, some schemes translate P4 programs to verifica-
for verifying functional and non-functional requirement of tion languages and engines. For instance, in [235], P4 pro-
protocols in P4. The system translates a P4 program in a grams are translated to Datalog to verify the reachability
versatile symbolic formula to analyze various performance and well-formedness. Similarly, in [238], P4 programs are
costs. The proposed approach estimates the performance cost converted into Guarded Command Language (GCL) models,
of a P4 program prior to its execution. and then a theorem prover Z3 is used to verify that several
Stoenescu et al. [241] proposed Vera, a symbolic safety, architectural and program-specific properties hold.
execution-based verification tool for P4 programs. The Other schemes (e.g., p4pktgen, Vera) use symbolic execution
authors argue in this paper that a data plane program should to generate test cases for P4 programs.
be verified before deployment to ensure safe operations. Vera The verification schemes were evaluated on different
accepts as input a P4 program, and translates it to a network P4 programs from the literature. A program that was evalu-
verification language, SEFL. It then relies on SymNet [337], ated by most schemes is switch.p4 which implements various
a network static analysis tool based on symbolic execution networking features needed for typical cloud data centers,
to analyze the behavior of the resulting program. Essentially, including Layer 2/3 functionalities, ACL, QoS, etc. It is rec-
Vera generates all possible packets layouts after inspecting ommended for future schemes to evaluate switch.p4 as well
the program’s parser and assumes that the header fields can as other programs from the literature. Finally, P4RL detects
accept any value. Afterwards, it tracks the paths when pro- path-related consistency between data-control planes.
cessing these packets in the program following all branches
to completion. For scalability improvements, Vera utilizes 4) P4-BASED AND TRADITIONAL NETWORK VERIFICATION
a novel match-forest data structure to optimize updates and Traditional verification techniques that address the secu-
verification time. Parsing/deparsing errors, invalid memory rity properties in computer networks are mainly related to
accesses, loops, among others, can be detected by Vera. host reachability, isolation, blackholes, and loop-freedom.

87138 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

FIGURE 24. Expanding switch memory by leveraging remote DRAM on


commodity servers [368].

FIGURE 23. Challenges and future trends. The references represent reviewing and diving into each work in the described lit-
examples of existing works that tackle the corresponding future trends. erature. Further, the section discusses and pinpoints several
initiatives for future work which could be worthy of being
Techniques that check for the aforementioned properties pursued in this imperative field of programmable switches.
include Anteater [338], which models the data plane as The challenges and the future trends are illustrated in Fig. 23
boolean functions to be used in a Boolean Satisfiability Prob-
lem (SAT) solver, NetPlumber [339] which uses header space A. MEMORY CAPACITY (SRAM AND TCAM)
algebra [335], and others (e.g., VeriFlow [336], DeltaNet [340], Stateful processing is a key enabler for programmable data
Flover [341], and VMN [342]). planes as it allows applications to store and retrieve data
Since P4 programs incorporate customized protocols and across different packets. This advantage enabled a wide range
processing logic to be used in the data plane, traditional tools of novel applications (e.g., in-network caching, fine grained
are not capable of handling changes to forwarding behaviors measurements, stateful load balancing, etc.) that were not
without reprogramming their internals. Therefore, verifica- possible in non-programmable networks. The amount of data
tion techniques in programmable networks rely on analyzing stored in the switch is limited by the size of the on-chip mem-
the P4 programs themselves since they define the behavior of ory which ranges from tens to hundreds of megabytes at most.
the data plane. Consequently, the majority of stateful-based applications suf-
C. SUMMARY AND LESSONS LEARNED
fer have trade-offs between performance and memory usage.
For instance, the efficiency of caching which is determined by
Network testing can generally be divided into debugging/
the hit rate is directly affected by the memory size. Further-
troubleshooting network problems and verifying the behavior
more, the vast majority of measurement applications require
of forwarding devices. While traditional tools and techniques
storing statistics in the data plane (e.g., byte/packet counters).
were adequate for non-programmable networks, they are
The number of flows to be measured and the richness of
insufficient for programmable ones due to their inability to
measurement information is bound by the size of the memory
handle changes to forwarding behaviors without reprogram-
in the switch.
ming and restructuring their internals. A variety of works
Current and Future Initiatives: A notable work by
were proposed to analyze and model P4 programs in order
Kim et al. [368], [369] suggests accessing remote Dynamic
to troubleshoot and verify the correctness of networks’ oper-
Random Access Memory (DRAM) installed on data cen-
ations.
ter servers purely from data plane to expand the available
Network measurements can be collected through
memory on the switch. The bandwidth of the chip is traded
P4 switches and used to troubleshoot and verify the cor-
for the bandwidth needed to access the external DRAM.
rectness of networks (control loop). Future work could
The approach is cheap and flexible since it reuses existing
explore methods that make a network more autonomous
resources in commodity hardware without adding additional
and capable of healing itself (e.g., self-driving networks,
infrastructure costs. The system is realized by allowing the
knowledge-defined networking, zero-touch networks) by
data plane to access remote memory through an access chan-
leveraging the collected inputs from programmable switches.
nel (RDMA over Converged Ethernet (RoCE)) as shown
XIII. CHALLENGES AND FUTURE TRENDS in Fig. 24. The implementation show that the proposal
In this section, a number of research and operational chal- achieves throughput close to the line rate, and only incur
lenges that correspond to the proposed taxonomy are out- 1-2 extra microseconds latency (Fig. 25). There are some
lined. The challenges are extracted after comprehensively limitations in this approach that can be explored in the future.

VOLUME 9, 2021 87139


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

C. ARITHMETIC COMPUTATIONS
There are several challenges that must be handled when
dealing with arithmetic computations in the data plane. First,
programmable switches support a small set of simple arith-
metic computations that operate on non-floating point values.
Second, only few operations are supported per packet to
guarantee the execution at line rate. Typically, a packet should
FIGURE 25. Accessing remote DRAM latency overhead. Only 1-2us only spend tens of nanoseconds in the processing pipeline.
additional latency. Achieved throughput close to the line rate Third, computations in the data plane consume significant
(≈ 37.5 Gbps). Reproduced from [368].
hardware resources, hampering the possibility of other pro-
grams to execute concurrently. A wide range of applications
• The current implementation only supports address-based suffer from the lack of complex computations in the data
memory access, and hence, complicated data layouts and plane. For instance, some operations required by AQMs (e.g.,
ternary matching in remote memory should be explored. square root function in the CoDel algorithm) are complex
• Frequent updates in the remote memory requires several to be implemented with P4. Additionally, the majority of
packets for fetching and adding. This is common in machine learning frameworks and models operate on floating
measurement applications where counters are continu- point values while the supported arithmetic operations on the
ously incremented. A possible solution to the bandwidth switch operate on integer values. In-network model updates
overhead is aggregating updates into single operation. aggregation requires calculating the average over a set of
This comes with the cost of having delays in the updates. floating-point vectors.
• Packet loss between the switch and the remote memory Current and Future Initiatives: Existing methods to over-
should be handled, otherwise, the performance of the come the computation limitations include approximation and
application and the freshness of the remote values might pre-computations. In the approximation method, the applica-
be affected. tion designer relies on the small set of supported operations
• The interaction between general data plane applications to approximate the desired value, at the cost of sacrificing
and the remote memory is challenging. A potential precision. For example, approximating the square root func-
improvement is designing well-defined APIs to facilitate tion can be achieved by counting the number of leading zeros
the interaction. through longest prefix match [99]. It would be beneficial
for P4 developers to have access to a community-maintained
B. RESOURCES ACCESSIBILITY library which encompasses P4 codes that approximate var-
Beside the size limitation of the on-chip memory, there are ious complex functions. In the pre-computations method,
other restrictions that data plane developers should take into values are computed by the control plane (e.g., switch CPU)
account [52], [373]. First, since the table memory is local and stored in match-action tables or registers. Future work
to each stage in the pipeline, other stages cannot reclaim can explore methods that automatically identify the complex
non-utilized memory in other stages. As a result, memory computations that can be pre-evaluated in the control plane.
and match/action processing are fuzed, making the placement After identification, the data plane code and its corresponding
of tables challenging. Second, the sequential execution of control plane APIs can be automatically generated.
operations in the pipeline lead to poor utilization of resources
especially when the matches and the actions are imbalanced D. NETWORK-WIDE COOPERATION
(i.e., the presence of default actions that do not need a match). The SDN architecture suggests using a centralized controller
Current and Future Initiatives: An interesting work by for network-wide switches management. Through centraliza-
Chole et at. [367] explored the idea of disaggregating tion, the state of each programmable switch can be shared
the memory and compute resources of a programmable with other switches. Consequently, applications will have
switch. The main notion of this work is to centralize the mem- the ability to make better decisions as network-wide data is
ory as a pool that is accessed by a crossbar. By doing so, each available locally on the switch. The problem with such archi-
pipeline stage no longer has local memory. Additionally, this tecture is the requirement of having a continuous exchange
work solves the sequential execution limitation by creating a of packets with a software-based system. As an alternative,
cluster of processors used to execute operations in any order. switches can exchange messages to synchronize their states
The main limitation of this approach is the lack of adoption in a decentralized manner.
by hardware vendors. Most of the switch vendors (e.g., Cav- Consider Fig. 26 which shows an in-network DDoS
ium’s XPliant and Barefoot’s Tofino) do not implement the defense solution. Each switch maintains a list of senders and
disaggregation model and follow the regular Reconfigurable their corresponding numbers of bytes. A switch compares the
Match-action Tables (RMT) model. The implementation and number of bytes transmitted from a given flow to a threshold.
analysis of the disaggregation model on hardware targets When the threshold is crossed, the flow is blocked and the
should be explored in the future. device is identified as a malicious DDoS sender. Assume that

87140 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

FIGURE 26. (a) Local detection of DDoS attacks. (b) network-wide detection of DDoS attack.

the network implements a load balancing mechanism that ment layer that facilitates the deployment of network func-
distributes traffic across the switches. In the scenario where tions (NFs) on multiple switches by managing the distributed
switches do not consider the byte counts of other switches shared states.
(Fig. 26 (a)), the traffic of a DDoS device might remain under The future work in this area should consider handling
the threshold. On the other hand, when switches synchronize frequent state migrations. Some systems require migration
their states by sharing the byte counts (Fig. 26 (b)), the total packets to be generated each RTT, causing increased traffic
number of bytes is compared against the threshold. Conse- overhead and additional expensive authentication operations.
quently, the total load of a DDoS device is considered. This For instance, P4Sync uses public key cryptography in the con-
example demonstrates an application that heavily depends on trol plane to sign and verify the end of the migration sequence
network-wide cooperation and hence motivates the need for chain (2.15ms for signing and 0.07ms to verify using
state synchronization. RSA-2048 signature). Frequent migrations would cause this
Current and Future Initiatives: Arashloo et al. [361] pro- signature to be involved repeatedly. Another major concern
posed SNAP, a centralized stateful programming model that that should be handled in future work is denial of service.
aims at solving the synchronization problem. SNAP intro- Even with migration updates authentication, changes in the
duced the idea of writing programs for ‘‘one big switch’’ packets cause the receiver to reject updates, leading to state
instead of many. Essentially, developers write stateful appli- inconsistency among switches.
cations without caring about the distribution, placement, and
optimization of access to resources. SNAP is limited to one
replica of each state in the network. Sviridov et al. [362], E. CONTROL PLANE INTERVENTION
[363] proposed LODGE and LOADER to extend SNAP and Delegating tasks to the control plane incurs latency and
enable multiple replicas. Luo et al. [364] proposed Swing affects the application’s performance. For instance, in con-
State, a framework for runtime state migration and manage- gestion control, rerouting-based schemes often use tables to
ment. This approach leverages existing traffic to piggyback store alternative routes. Since the data plane cannot directly
state updates between cooperating switches. Swing State modify table entries, intervention from the control plane
overcomes the challenges of the SDN-based architecture by is required. The interaction with the control plane in this
synchronizing the states entirely in the data plane, at line application hampers the promptness of rerouting. Another
rate, and without intervention from the control plane. There example are methods that use collisions-free hashing. For
are several limitations with this approach. First, there are no example, cuckoo hash [374], which rearranges items to solve
message delivery guarantees (i.e., packets dropped/reordered collisions, uses a complex search algorithm that cannot run on
are not retransmitted), leading to inconsistency in the states the switch ASIC, and is often executed on the switch CPU.
among the switches. Second, it does not merge the states Ideally, the control plane intervention should be minimized
if two switches share common states. Third, the overhead when possible. For example, to synchronize the state among
can significantly increase if a single state is mirrored several switches, in-network cooperation should be considered.
times. Finally, there is no authentication of data or senders. Current and Future Initiatives: The design of the inter-
Xing et al. [365] proposed P4Sync, a system that migrates action between the control plane and the data plane is fully
states between switches in the data plane while guarantee- decided by the developer. Experienced developers might have
ing the authenticity of the senders and the exchanged data. enough background to immediately minimize such interac-
P4Sync addresses the limitations of existing approaches. tion. Future work should devise algorithms and tools that
It guarantees the completeness of the migration, ensuring that automatically determine the excessive interaction between
the snapshot transfer is completed. Moreover, it solves the the control/data planes, and suggest alternative workflows
overhead of the repeatedly retransmitted updates. An inter- (ideally, as generated codes) to minimize such interac-
esting aspect of P4Sync is its ability to control the migration tion. Operations that could be delegated to the data plane
traffic rate depending on the changing network conditions. include failure detection and notification and connectivity
Zeno et al. [366] presented a design of SwiShmem, a manage- retrieval [360].

VOLUME 9, 2021 87141


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

F. SECURITY
When designing a system for the data plane, the developer
must envision the kind of traffic a malicious user can initiate
to corrupt the operation of the system. This class of attacks
is referred to as sensitivity attacks as coined in [216]. Essen-
tially, an attacker can intelligently craft traffic patterns to trig-
ger unexpected behaviors of a system in the data plane. For
instance, a load balancer that balances traffic through packet
headers hashing without cryptographic support (e.g., modulo
operator on the number of available paths) can be tricked by
an attacker that craft skewed traffic patterns. This results in FIGURE 27. Example of using taps in a campus network to compute the
traffic being forwarded to a single path, leading to congestion, round-trip time in the data plane. (1) The traffic is passively collected by
link saturation, and denial of service. Another example is the P4 switch; (2) the switch calculates the round-trip time by using its
high-precision timer (see [95] for details on how to associate the
attacks against in-network caching. Caching in data plane SEQ/ACKs to compute the RTT); (3) the switch report the RTT samples to
performs well when requests are mostly reads rather than an external server.
writes. If an attacker continuously generates high-skewed
write requests, the load on the storage servers would be the existing legacy devices. While this solution seems sim-
imbalanced. If the system is designed to handle write queries plistic at first, studies have showed that partial deployment
on hot items in the switch, a random failure in the switch leads to reduced effectiveness [189]. For instance, the accu-
causes data to be lost. Further, an attacker can also exploit racy of heavy hitter detection schemes is strongly affected
the memory limitation of switch and request diverse values, by the flow visibility. The work in [189] devised a greedy
causing the pre-cached values to be evicted. algorithm that attempts to strategically position P4 switches
Current and Future Initiatives: To mitigate against sensi- in the network, with the goal of monitoring as many dis-
tivity attacks, a developer attempts to discover various unpre- tinct network flows as possible. The F1 score is used to
dicted traffic patterns, and accordingly, develops defense quantify correctness of switches placement. Other works that
strategies. Such solution is highly unreliable, time consum- focused on incremental deployment include daPIPE [375],
ing, and error-prone. Recent efforts [216] aimed at auto- TraceILP/TopoILP [371]. Future work in this area should
matically discovering sensitivity attacks in the data plane. consider generalizing and enhancing this approach to work
Essentially, the proposed system aims at deriving traffic with any P4 application, and not only heavy hitter detection.
patterns that would drive the program away from common For instance, a future work could suggest the positioning
case behavior as much as possible. Other efforts focused of P4 switches in applications such as in-network caching,
on architecting defenses in the data plane that perform dis- accelerated consensus, and in-network defenses, while tak-
tributed mode changes upon attack discovery [215]. Future ing into account the current topology consisting of legacy
work in this direction should consider achieving high assur- devices.
ance by formally verifying the codes. Additionally, the sta- Amin et al. [376] surveyed the research and development
bility of the data plane should be carefully handled with in the field of hybrid SDN networks. Hybrid SDN comprises
fast mode changes; future work could consider integrating a mix of SDN and legacy network devices. It is worth noting
self-stabilizing systems for such purpose. Finally, future work that the same key concepts and advantages of hybrid SDN
should provide security interfaces for collaborating switches networks can be applied to incremental P4 networks.
that belong to different domains. It is also worth exposing Recent efforts are also considering network taps as a mean
sensitivity attack patterns for different application types so to replicate production network’s traffic to programmable
that data plane developers can avoid the vulnerabilities that switches for analysis [88]. Network TAPs replicate pack-
trigger those attacks in their codes. ets and do not alter timing information and packet orders,
which may occur with other schemes such as port mirror-
G. INTEROPERABILITY ing operating at layer 2 and layer 3 [377]. ConQuest [88]
Programmable switches pave the way for a wide range of taps on the ingress and egress links of a legacy router and
innovative in-network applications. The literature has shown uses a P4 switch to perform advanced fine-grained queue
that significant performance improvements are brought when monitoring techniques. Note that legacy routers only sup-
applications offload their processing logic to the network. port polling the total queue length statistics at a coarse time
Despite such facts, it is very unlikely that mobile operators interval, and hence, cannot monitor microbursts. By tapping
will replace their current infrastructure with programmable on legacy devices and processing on P4 switches, operators
switches in one shot. This unlikelihood comes from the fact can benefit from the capabilities of P4 switches without
that major operational and budgeting costs will incur. the need to fully replace their current infrastructure. This
Current and Future Initiatives: Network operators might method can be used in a variety of in-network applications
deploy programmable switches in an incremental fashion. (e.g., RTT estimation (see Fig. 27), network-wide telemetry,
That is, P4 switches will be added to the network alongside DDoS detection/mitigation, to name a few). Finally, it is

87142 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

worth mentioning that TAPs are not expensive and a single utility functions and assign their weights. Second, it assumes
P4 switch can service many non-programmable devices. that the programmer is aware of the workload (which is
needed to write the utility function). The authors suggested
H. PROGRAMMING SIMPLICITY that future work could investigate a dynamic system that uses
Writing in-network applications using the P4 language is measurements to change the utility functions. Finally, P4All
not a straightforward task. Recent studies have shown that does not support multivariate and nonlinear functions. All the
many existing P4 programs have several bugs that might lead aforementioned limitations can be explored in the future.
to complete network disruption [232]. Furthermore, since
programmable switches have many restrictions on memory I. DEEP PROGRAMMABILITY
and the availability of resources, developers must take into Disaggregation is enabling network owners and operators to
account the low-level hardware limitations when writing the take control of the software running the network. It is pos-
programs. This process is known to be based on trial and sible to program virtual and PISA-based switches, hardware
error; developers are almost never sure whether their program accelerators, smartNICs, and end-hosts’ networking stacks.
can ‘‘fit’’ into the ASIC, and hence, they repeatedly try to Further, acceleration techniques such as the Express Data
compile and adjust their codes accordingly. Such problem is Path (XDP) and Berkeley Packet Filter (BPF) are being used
exacerbated when the complexity of the in-network applica- to accelerate the packet forwarding in the kernel. Addition-
tion increases, or when multiple functions (e.g., telemetry, ally, acceleration techniques are used to address the perfor-
monitoring, access control. etc.) are to be executed concur- mance issues of Virtual Network Functions (VNFs) running
rently in the same P4 program. Additionally, code modular- on servers [380], [381].
ity is not simple in P4; the programmers typically rewrite The malleability of programming various network com-
existing functions depending on the constraints of the current ponents is shifting the trend towards deep programmability,
context. All the aforementioned facts affect the cost, stability, as coined by McKeown [53], [382]. In deep programmability,
and correctness of the network on the long run. the behavior is described at top and partitioned and executed
For several decades, the networking industry operated across elements. The operators will focus on ‘‘software’’
in a bottom-up approach, where switches are equipped rather than ‘‘protocols’’; for example, functions like rout-
with fixed-function ASICs. Consequently, little to no pro- ing and congestion control will be described in programs.
gramming skills were needed by network operators. With Software engineering principles will be routinely used to
the advent of programmable switches, operators are now check the correctness of the network behavior (from unit test-
expected to have experience in programming the ASIC.2 ing to formal/on-the-fly verification). Fine-grained telemetry
Current and Future Initiatives: Since programming the and measurements will be used to monitor and troubleshoot
ASIC is not a straightforward task, future research endeav- network performance. Stream computations will be accel-
ours should consider simplifying the programming workflow erated by the network (e.g., caching, load balancing, etc.).
for the operators and generating code (e.g., [345]–[352]). Further, networks will run autonomously under verifiable,
For instance, graphical tools can be developed to translate closed-loop control. Finally, McKeown envisioned that net-
workflows (e.g., flowcharts) to P4 programs that can fit into works will be programmed by owners, operators, researchers,
the hardware. etc., while being operated by a lot fewer people than today.
A noteworthy work (P4All [353]) proposed an extension There are many open challenges to realize the vision of
to P4 where operators write elastic programs. Elastic pro- deep programmability. Consider Fig. 28. The control plane
grams are compact programs that stretch to make use of is managing the pipeline of programmable switches, NICs,
the hardware resources. P4All extends P4 to support loops. and virtual switches, which are programmed by P4 through a
The operator supply the P4All program along with the tar- runtime API (e.g., P4Runtime). The challenge is how to write
get specifications (i.e., constraints) to the P4All compiler. a clean code that can be moved around within the hardware
Afterwards, the compiler analyzes the dependencies between pipeline, and can run at line rate.
actions and unrolls the loops. Then, it generate the constraints
for the optimization based on the target specification file.
Next, the compiler solves an optimization problem that maxi-
mizes a linear utility function and generates an output P4 pro-
gram for the target. The authors considered Tofino target in
their evaluations. While P4All offered numerous advantages,
it is still far from being ready to be used in practice. First,
it assumes that programmers are able to write representative
2 Note that most vendors (e.g., Barefoot Networks) provide a program
(switch.p4) that expresses the forwarding plane of a switch, with the typical
features of an advanced layer-2 and layer-3 switch. If the goal is to simply
deploy a switch with no in-network applications, then the operators are not
required to program the chip. They just need to install a network operating FIGURE 28. Network as a programmable platform. Large cloud or ISP
system (NOS) such as SONIC [378] or FBOSS [379]). example [53].
VOLUME 9, 2021 87143
E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

Current and Future Initiatives: Fig. 29 shows an example


of congestion control application with deep programmability.
In loss-based congestion control (e.g., NewReno, CUBIC),
packet drops and duplicate ACKs are used to indicate con-
gestion. Such signal is ideally observed by the kernel of
the end-host. In delay-based congestion control (e.g., TCP
Vegas, TIMELY), RTT is used as the primary signal for
congestion, and thus, high-precision timers must be used to
get accurate estimations. This is ideally done in the NIC.
Other network-assisted congestion control (e.g., HPCC) rely
on the queue occupancy in the switch. Note that such mech-
anism modifies the packet headers, and therefore, both the
NIC and the kernel should be aware of it (hence the green
arrows in the figure). To be able to automate the process FIGURE 29. Deep programmability, congestion control example. [53]
of partitioning functions into the network, systematic meth-
ods and algorithms should be carefully devised. There is an vides a general purpose P4 program that can be dynamically
immense expertise in userspace and kernel space program- configured to adopt new behavior. Essentially, P4 programs
ming. However, general purpose code cannot be easily ported are translated into table entries and pushed to the general pur-
to the hardware since it might not fit. Hence, there is a need pose program, enabling hot-pluggability. Hyper4 uses packet
for methods that constrain the programming so that it will recirculation to implement the hot-pluggable parser, and
work on feedforward loop-free pipeline. McKeown discussed therefore, suffers from performance degradation. Many other
a strawman solution [53] to the problem where the whole data plane virtualization systems have been proposed since
pipeline (sequence of devices) is expressed in a language then (e.g., HyperV [354], P4VBox [355], P4Visor [346],
(e.g., P4); the pipeline specification includes the serial depen- PRIME [356], P4click [357], MTPSA [358], etc.).
dencies between the devices. The externs of P4 will be used Han et al. [359] performed packet latency measurements
to invoke general purpose C/C++ code running on the CPU. on HyperVDP and P4Visor (processor isolation is not sup-
Further, P4 will be used to define the forwarding behavior ported). Their results show that the overall latency is deter-
of the code that will be accelerated by the hardware. Future mined by the P4 program that has the highest latency.
work should explore more sophisticated methods for solving To remediate this problem, resource disaggregation methods
the partitioning problem, while considering the constraints of (e.g., dRMT [367]) can be used. Other challenges that could
the hardware and the current networking landscape. Note that be explored in the future include performance degradation
tremendous efforts from academia and the industry are being that result from packet recirculation, lack of flexibility for live
spent on the Pronto project [343], [344], which can be consid- reconfiguration, frequent recompilations, loss of states during
ered as an example of the deep programmable architecture. data plane reconfiguration, etc.

J. MODULARITY AND VIRTUALIZATION K. PRACTICAL TESTING


Programmable data plane were originally designed to execute Verifying the correctness of novel protocols and applications
a single program at a given time. However, there is no doubt in real production networks is of utmost importance for engi-
that in today’s networks, operators often require multiple neers and researchers. Due to the ossification of production
network functions to run simultaneously on a single physi- networks (cannot run untested systems), engineers typically
cal switch. A challenge that operators face when changing rely on modeling and mimicking the network behavior in
data plane programs is the connectivity loss and the service a smaller scale to test their proof-of-concepts. One way to
downtime/interruption [383]. model the network is through simulations [385]; while sim-
Cloud providers are now aiming to offer on-switch network ulations offer flexibility in customizing the scenarios, they
functions as services to a diverse set of cloud customers. cannot achieve the performance of real networks since they
Such needs introduce various challenges including resource typically run on CPUs. Another way to model the network
isolation (memory and resources should be dedicated to a is through emulations [386]–[388]. Emulators run the same
specific function), performance isolation (the performance software of production networks on CPU and offer flexibility
of a network function must not impact other functions), and in customization; however, they produce inaccurate measure-
security isolation (network function must not read other func- ments with high traffic rates and are bound to the CPU of the
tion’s data). machine. Finally, emulating testbeds on a smaller scale might
Current and Future Initiatives: P4 programs and functions produce results different than production networks.
should become more modular so that programmers can easily Current and Future Initiatives: TurboNet [372] is a note-
integrate multiple services into the hardware pipeline. Cur- worthy approach that leverages the power of programmable
rent research efforts on data plane virtualization are being switches to emulate production networks at scale while
proposed in the literature. For instance, Hyper4 [384] pro- achieving line-rate performance. TurboNet emulates both the

87144 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

data and control planes. Multiple switches can be emulated by


slicing a single switch, separating its ports, and dividing the
queue resources; this enables TurboNet to scale beyond the
number of ports. TurboNet can emulate background traffic,
link loss, link delay, etc. Future work in this area could
consider further methods that consume less resources than
TurboNet. Also, future work should avoid interrupting the
emulation whenever the network emulation conditions are
being changed.
P4Campus [389] is another promising work that demon-
strates how researchers can test and evaluate their novel
ideas on campus networks. P4Campus aims at encourag-
ing researchers to migrate from simulation/emulation to an
implementation of hardware switches. Second, it advises on
replaying campus traffic and run experiments against the pro-
duction data. The authors of P4Campus are working towards
supporting multiple targets, program virtualization, and dif-
ferent topologies. Furthermore, they foresee that their testbed
will be expanded to other institutions where P4Campus will FIGURE 30. Closed-loop network. [1], [53], [382]. The packets sent from
be adopted. This will pave the way for more collabora- forwarding devices packets (e.g., through INT), the network state, and the
code are being measured and validated. The feedback is used by the
tion between researchers, especially since the applications of control plane to generate new behaviors (new control code, new
P4Campus (e.g., microbursts detection, heavy hitter, live traf- forwarding code, new states), and to verify that the operation is matching
the intentions.
fic anonymization, flow RTT measurement, etc.) are already
available to the public [390].
• The ability to generate new control and forwarding
L. HUMAN INVOLVEMENT behavior on-the-fly to correct errors. Techniques such as
The complexity of managing and configuring today’s net- header space analysis (HSA) (Section XII) allows build-
works is continuously increasing, especially when the net- ing a model of the forwarding behavior of every switch
works are large [391]. Applications are demanding enhanced in the network based on the program that describes its
security, high availability, and high performance. Networks behavior, and the state that it currently contains. This
today are opaque and require acquiring some ‘‘dials’’ and allows determining and formally proving if two devices
configuring ‘‘knobs’’. This is typically done without really can communicate for instance.
understanding what is happening in the network; such pro- • The ability to verify generated code and deploy it
cess and the complexity of network management inevitably quickly. While the first two pieces already have some
increases the risk of errors (e.g., operator errors). Hence progress, this third piece need further advancements. It is
the question, ‘‘If we are operating a large network, can we advised to explore software engineering techniques to
completely remove the human?’’. generate, optimize and verify the code.
Current and Future Initiatives: Many techniques and
architectures have been proposed to answer this question. XIV. CONCLUSION
In the past few years, the research community started explor- This article presents an exhaustive survey on programmable
ing the concepts of ‘‘Self-driving networks’’, ‘‘Zero-touch data planes. The survey describes the evolution of networking
networks’’, and ‘‘Knowledge-Defined Networking’’ [370], by discussing the traditional control plane and the transi-
[392]–[394]. The networking industry in the upcoming years tion to SDN. Afterwards, the survey motivates the need for
may be shifting towards the closed-loop control architecture programming the data plane and delves into the general
(Fig. 30) [1]. Note that it is not easy to realize the vision of architecture of a programmable switch (PISA). A brief
completely automating networks. There are three pieces that description of P4, the de-facto language for programming
need to be addressed to close the loop and make networks the data plane was presented. Motivated by the increasing
more autonomous and intelligent. trend in programming the data plane, the survey provides a
• The ability to observe packets, network state and code, taxonomy that sheds the light on numerous significant works
in real time (at the nanosecond scale). Observing packets and compares schemes within each category in the taxonomy
has already started with packet telemetry and measure- and with those in legacy approaches. The survey concludes
ments (Sections VI, VII-B). It is possible with pro- by discussing challenges and considerations as well as vari-
grammable switches to detect and visualize microbursts; ous future trends and initiatives. Evidence indicates that the
this was not possible in the past. Furthermore, per-packet closed nature of today’s networks will diminish in the future,
examination is now possible, giving better visibility into and open-source and the deep programmability architecture
the behavior of the network. will dominate.

VOLUME 9, 2021 87145


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 36. Abbreviations used in this article. TABLE 36. (Continued.) Abbreviations used in this article.

87146 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

TABLE 36. (Continued.) Abbreviations used in this article. [19] N. McKeown. SDN Phase 3: Getting the Humans Out of the
Way ONF Connect 19. Accessed: Jun. 1, 2021. [Online]. Available:
https://tinyurl.com/tp9bxw4
[20] Edgecore. (2020). Wedge 100BF-32X, 100GbE Data Center Switch.
[Online]. Available: https://tinyurl.com/sy2jkqe
[21] STORDIS. The New Advanced Programmable Switches
are Available. Accessed: Jun. 1, 2021. [Online]. Available:
https://www.stordis.com/products/
[22] Cisco. Cisco Nexus 34180YC and 3464C Programmable Switches
Data Sheet. Accessed: Jun. 1, 2021. [Online]. Available:
https://tinyurl.com/y92cbdxe
[23] Arista. Arista 7170 Series. Accessed: Jun. 1, 2021. [Online]. Available:
https://www.arista.com/en/products/7170-series
REFERENCES [24] Juniper Networks. Juniper Advancing Disaggregation Through P4
[1] N. McKeown, ‘‘How we might get humans out of the way,’’ ONF Runtime Integration. Accessed: Jun. 1, 2021. [Online]. Available:
CONNECT, Tech. Rep., Sep. 2019, vol. 19. [Online]. Available: https://tinyurl.com/yygz547t
https://tinyurl.com/y4dnxacz [25] Interface Masters. Tahoe 2624. Accessed: Jun. 1, 2021. [Online].
[2] Number of RFCs Published Per Year, document, RFC Editor, 2020. Available: https://interfacemasters.com/products/switches/10g-40g/
[Online]. Available: https://www.rfc-editor.org/rfcs-per-year/ tahoe-2624/
[3] B. Trammell and M. Kuehlewind, Report From the IAB Workshop on [26] Barefoot Networks. Tofino ASIC. Accessed: Jun. 1, 2021. [Online]. Avail-
Stack Evolution in a Middlebox Internet (SEMI), document RFC7663, able: https://www.barefootnetworks.com/products/brief-tofino/
2015. [Online]. Available: https://tools.ietf.org/html/rfc7663 [27] Xilinx. Xilinx Solutions. Accessed: Jun. 1, 2021. [Online]. Available:
[4] G. Papastergiou, G. Fairhurst, D. Ros, A. Brunstrom, K.-J. Grinnemo, https://www.xilinx.com/products/silicon-devices.html
P. Hurtig, N. Khademi, M. Tüxen, M. Welzl, D. Damjanovic, and [28] Pensando. The Pensando Distributed Services Platform.
S. Mangiante, ‘‘De-ossifying the Internet transport layer: A survey and Accessed: Jun. 1, 2021. [Online]. Available: https://pensando.io/our-
future perspectives,’’ IEEE Commun. Surveys Tuts., vol. 19, no. 1, platform/
pp. 619–639, 1st Quart., 2017. [29] Mellanox. Empowering the Next Generation of Secure Cloud Smart-
[5] The Register. (Aug. 2011). VMware, Cisco Stretch Virtual LANs Across NICs. Accessed: Jun. 1, 2021. [Online]. Available: https://www.
the Heavens. [Online]. Available: https://tinyurl.com/y6mxhqzn mellanox.com/products/smartnic
[6] M. Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T. Sridhar, [30] Innovium. Teralynx Switch Silicon. Accessed: Jun. 1, 2021. [Online].
M. Bursell, and C. Wright, Virtual Extensible Local Area Network Available: https://www.innovium.com/teralynx/
(VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks [31] I. Baldin, J. Griffioen, K. Wang, I. Monga, and A. Nikolich. Mid-Scale
Over Layer 3 Networks, document RFC7348, 2014. [Online]. Available: RI-1 (M1:IP): FABRIC: Adaptive Programmable Research Infrastructure
http://www.rfc-editor.org/rfc/rfc7348.txt for Computer Science and Science Applications. Accessed: Jun. 1, 2021.
[7] M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. McKeown, and [Online]. Available: https://tinyurl.com/y463v9z9
S. Shenker, ‘‘Ethane: Taking control of the enterprise,’’ ACM SIGCOMM [32] FABRIC. About FABRIC. Accessed: Jun. 1, 2021. [Online]. Available:
Comput. Commun. Rev., vol. 37, no. 4, pp. 1–12, 2007. https://fabric-testbed.net/about/overview
[8] D. Kreutz, F. M. V. Ramos, P. E. Verissimo, C. E. Rothenberg, [33] J. Mambretti, J. Chen, F. Yeh, and S. Y. Yu, ‘‘International P4 networking
S. Azodolmolky, and S. Uhlig, ‘‘Software-defined networking: A com- testbed,’’ in Proc. SC Netw. Res. Exhib., 2019, pp. 1–2.
prehensive survey,’’ Proc. IEEE, vol. 103, no. 1, pp. 14–76, Jan. 2015. [34] 2STiC. A National Programmable Infrastructure to Experiment With
[9] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, Next-Generation Networks. Accessed: Jun. 1, 2021. [Online]. Available:
C. Schlesinger, D. Talayco, A. Vahdat, and G. Varghese, ‘‘P4: Program- https://www.2stic.nl/national-programmable-infrastructure.html
ming protocol-independent packet processors,’’ ACM SIGCOMM Com- [35] H. Stubbe, ‘‘P4 compiler & interpreter: A survey,’’ Future Internet Innov.
put. Commun. Rev., vol. 44, no. 3, pp. 87–95, 2014. Internet Technol. Mobile Commun., vol. 47, pp. 1–6, May 2017.
[10] Barefoot Networks. Use Cases. Accessed: Jun. 1, 2021. [Online]. Avail- [36] T. Dargahi, A. Caponi, M. Ambrosin, G. Bianchi, and M. Conti, ‘‘A sur-
able: https://www.barefootnetworks.com/use-cases/ vey on the security of stateful SDN data planes,’’ IEEE Commun. Surveys
[11] A. Weissberger. Comcast: ONF Trellis Software is in Production Together Tuts., vol. 19, no. 3, pp. 1701–1725, 3rd Quart., 2017.
With L2/L3 White Box Switches. Accessed: Jun. 1, 2021. [Online]. Avail- [37] W. L. da Costa Cordeiro, J. A. Marques, and L. P. Gaspary, ‘‘Data
able: https://tinyurl.com/y69jc7sv plane programmability beyond OpenFlow: Opportunities and challenges
[12] N. Akiyama and M. Nishiki. P4 and Stratum Use Case for for network and service operations and management,’’ J. Netw. Syst.
New Edge Cloud. Accessed: Jun. 1, 2021. [Online]. Available: Manage., vol. 25, no. 4, pp. 784–818, Oct. 2017.
https://tinyurl.com/yxuoo9qv [38] A. Satapathy. (2018). Comprehensive Study of P4 Programming
[13] Stordis GmbH. New STORDIS Advanced Programmable Switches (APS) Language and Software-Defined Networks. [Online]. Available:
First to Unlock the Full Potential of P4 and Next Generation Software https://tinyurl.com/y4d4zma9
Defined Networking (NG-SDN). Accessed: Jun. 1, 2021. [Online]. Avail- [39] R. Bifulco and G. Rétvári, ‘‘A survey on the programmable data plane:
able: https://tinyurl.com/y3kjnypl Abstractions, architectures, and open problems,’’ in Proc. IEEE 19th Int.
[14] Open Networking Foundation. Stratum—ONF Launches Major Conf. High Perform. Switching Routing (HPSR), Jun. 2018, pp. 1–7.
New Open Source SDN Switching Platform With Support [40] E. Kaljic, A. Maric, P. Njemcevic, and M. Hadzialic, ‘‘A survey on data
From Google. Accessed: Jun. 1, 2021. [Online]. Available: plane flexibility and programmability in software-defined networking,’’
https://tinyurl.com/yy3ykw7g IEEE Access, vol. 7, pp. 47804–47840, 2019.
[15] Open Networking Foundation (ONF). Onward and Upward: P4.org [41] P. G. Kannan and M. C. Chan, ‘‘On programmable networking evolu-
Joins ONF and LF. Accessed: Jun. 1, 2021. [Online]. Available: tion,’’ CSI Trans. ICT, vol. 8, no. 1, pp. 69–76, Mar. 2020.
https://tinyurl.com/53upv6wf [42] L. Tan, W. Su, W. Zhang, J. Lv, Z. Zhang, J. Miao, X. Liu, and N. Li, ‘‘In-
[16] Facebook Engineering. Disaggregate: Networking band network telemetry: A survey,’’ Comput. Netw., vol. 186, Feb. 2021,
Recap. Accessed: Jun. 1, 2021. [Online]. Available: Art. no. 107763.
https://tinyurl.com/yxoaj7kw [43] X. Zhang, L. Cui, K. Wei, F. P. Tso, Y. Ji, and W. Jia, ‘‘A survey
[17] Open Compute Project. Alibaba DC Network Evolution With Open on stateful data plane in software defined networks,’’ Comput. Netw.,
SONiC and Programmable HW. Accessed: Jun. 1, 2021. [Online]. Avail- vol. 184, Jan. 2021, Art. no. 107597.
able: https://www.opencompute.org/files/OCP2018.alibaba.pdf [44] G. Bianchi, M. Bonola, A. Capone, and C. Cascone, ‘‘OpenState: Pro-
[18] S. Heule. Using P4 and P4 Runtime for Optimal L3 Routing. gramming platform-independent stateful openflow applications inside
Accessed: Jun. 1, 2021. [Online]. Available: https://tinyurl.com/ the switch,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 44, no. 2,
y365gnqy pp. 44–51, Apr. 2014.

VOLUME 9, 2021 87147


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

[45] M. Moshref, A. Bhargava, A. Gupta, M. Yu, and R. Govindan, ‘‘Flow- [69] A. Feldmann, B. Chandrasekaran, S. Fathalli, and E. N. Weyulu, ‘‘P4-
level state transition as a new switch primitive for SDN,’’ in Proc. 3rd enabled network-assisted congestion feedback: A case for NACKs,’’ in
Workshop Hot Topics Softw. Defined Netw., 2014, pp. 61–66. Proc. Workshop Buffer Sizing, 2019, pp. 1–7.
[46] P4 Language Consortium. P4Runtime. Accessed: Jun. 1, 2021. [Online]. [70] Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang,
Available: https://github.com/p4lang/PI/ F. Kelly, M. Alizadeh, and M. Yu, ‘‘HPCC: High precision congestion
[47] Y. Rekhter, T. Li, and S. Hares, A Border Gateway Protocol 4 control,’’ in Proc. ACM Special Interest Group Data Commun., 2019,
(BGP-4), document RFC4271, 2006. [Online]. Available: http://www.rfc- pp. 44–58.
editor.org/rfc/rfc4271.txt. [71] E. F. Kfoury, J. Crichigno, E. Bou-Harb, D. Khoury, and G. Srivastava,
[48] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, ‘‘Enabling TCP pacing using programmable data plane switches,’’ in
J. Rexford, S. Shenker, and J. Turner, ‘‘OpenFlow: Enabling innovation Proc. 42nd Int. Conf. Telecommun. Signal Process. (TSP), Jul. 2019,
in campus networks,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 38, pp. 273–277.
no. 2, pp. 69–74, Mar. 2008. [72] S. Shahzad, E.-S. Jung, J. Chung, and R. Kettimuthu, ‘‘Enhanced explicit
[49] N. McKeown. Why Does the Internet Need a Programmable congestion notification (EECN) in TCP with P4 programming,’’ in Proc.
Forwarding Plane. Accessed: Jun. 1, 2021. [Online]. Available: Int. Conf. Green Hum. Inf. Technol. (ICGHIT), Feb. 2020, pp. 35–40.
https://tinyurl.com/y6x7qqpm [73] B. Turkovic, F. Kuipers, N. van Adrichem, and K. Langendoen, ‘‘Fast net-
[50] C. Kim. (2019). Evolution of Networking, Networking Field Day 21, 2:01. work congestion detection and avoidance using P4,’’ in Proc. Workshop
[Online]. Available: https://tinyurl.com/y9fkj7qx Netw. Emerg. Appl. Technol., 2018, pp. 45–51.
[51] A. Shapiro. (Apr. 2020). P4-Programming Data Plane Use- [74] B. Turkovic and F. Kuipers, ‘‘P4air: Increasing fairness among competing
Cases P4 Expert Roundtable Series. [Online]. Available: congestion control algorithms,’’ in Proc. IEEE 28th Int. Conf. Netw.
https://tinyurl.com/y5n4k83h Protocols (ICNP), Oct. 2020, pp. 1–12.
[52] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, [75] M. Apostolaki, L. Vanbever, and M. Ghobadi, ‘‘FAB: Toward flow-aware
F. Mujica, and M. Horowitz, ‘‘Forwarding metamorphosis: Fast pro- buffer sharing on programmable switches,’’ in Proc. Workshop Buffer
grammable match-action processing in hardware for SDN,’’ ACM SIG- Sizing, Dec. 2019, pp. 1–6.
COMM Comput. Commun. Rev., vol. 43, no. 4, pp. 99–110, 2013. [76] J. Geng, J. Yan, and Y. Zhang, ‘‘P4QCN: Congestion control using P4-
[53] N. McKeown. Creating an End-to-End Programming Model for Packet capable device in data center networks,’’ Electronics, vol. 8, no. 3, p. 280,
Forwarding. Accessed: Jun. 1, 2021. [Online]. Available: https://www. Mar. 2019.
youtube.com/watch?v=fiBuao6YZl0&t=4216s [77] Y. Li, R. Miao, C. Kim, and M. Yu, ‘‘FlowRadar: A better NetFlow for
[54] Z. Liu, J. Bi, Y. Zhou, Y. Wang, and Y. Lin, ‘‘NetVision: Towards network data centers,’’ in Proc. 13th USENIX Symp. Netw. Syst. Design Implement.
telemetry as a service,’’ in Proc. IEEE 26th Int. Conf. Netw. Protocols (NSDI), 2016, pp. 311–324.
(ICNP), Sep. 2018, pp. 247–248. [78] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman, ‘‘One
[55] J. Hyun, N. Van Tu, and J. W.-K. Hong, ‘‘Towards knowledge-defined sketch to rule them all: Rethinking network flow monitoring with Univ-
networking using in-band network telemetry,’’ in Proc. IEEE/IFIP Netw. Mon,’’ in Proc. ACM SIGCOMM Conf., Aug. 2016, pp. 101–114.
Oper. Manage. Symp. (NOMS), Apr. 2018, pp. 1–7. [79] S. Narayana, A. Sivaraman, V. Nathan, P. Goyal, V. Arun, M. Alizadeh,
[56] Y. Kim, D. Suh, and S. Pack, ‘‘Selective in-band network telemetry for V. Jeyakumar, and C. Kim, ‘‘Language-directed hardware design for
overhead reduction,’’ in Proc. IEEE 7th Int. Conf. Cloud Netw. (Cloud- network performance monitoring,’’ in Proc. Conf. ACM Special Interest
Net), Oct. 2018, pp. 1–3. Group Data Commun., Aug. 2017, pp. 85–98.
[57] T. Pan, E. Song, Z. Bian, X. Lin, X. Peng, J. Zhang, T. Huang, B. Liu, and [80] M. Ghasemi, T. Benson, and J. Rexford, ‘‘Dapper: Data plane perfor-
Y. Liu, ‘‘INT-path: Towards optimal path planning for in-band network- mance diagnosis of TCP,’’ in Proc. Symp. SDN Res., Apr. 2017, pp. 61–74.
wide telemetry,’’ in Proc. IEEE Conf. Comput. Commun. (INFOCOM), [81] T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao, X. Li,
Apr. 2019, pp. 487–495. and S. Uhlig, ‘‘Elastic sketch: Adaptive and fast network-wide measure-
[58] J. A. Marques, M. C. Luizelli, R. I. T. da Costa Filho, and L. P. Gaspary, ments,’’ in Proc. Conf. ACM Special Interest Group Data Commun.,
‘‘An optimization-based approach for efficient network monitoring using Aug. 2018, pp. 561–575.
in-band network telemetry,’’ J. Internet Services Appl., vol. 10, no. 1, [82] N. Yaseen, J. Sonchack, and V. Liu, ‘‘Synchronized network snapshots,’’
p. 12, Dec. 2019. in Proc. Conf. ACM Special Interest Group Data Commun., Aug. 2018,
[59] B. Niu, J. Kong, S. Tang, Y. Li, and Z. Zhu, ‘‘Visualize your IP- pp. 402–416.
over-optical network in realtime: A P4-based flexible multilayer in- [83] R. Joshi, T. Qu, M. C. Chan, B. Leong, and B. T. Loo, ‘‘BurstRadar:
band network telemetry (ML-INT) system,’’ IEEE Access, vol. 7, Practical real-time microburst monitoring for datacenter networks,’’ in
pp. 82413–82423, 2019. Proc. 9th Asia–Pacific Workshop Syst., Aug. 2018, pp. 1–8.
[60] A. Karaagac, E. De Poorter, and J. Hoebeke, ‘‘In-band network telemetry [84] M. Lee and J. Rexford. (2018). Detecting Violations of Service-
in industrial wireless sensor networks,’’ IEEE Trans. Netw. Service Man- Level Agreements in Programmable Switches. [Online]. Available:
age., vol. 17, no. 1, pp. 517–531, Mar. 2020. https://p4campus.cs.princeton.edu/pubs/mackl_thesis_paper.pdf
[61] R. B. Basat, S. Ramanathan, Y. Li, G. Antichi, M. Yu, and [85] J. Sonchack, O. Michel, A. J. Aviv, E. Keller, and J. M. Smith, ‘‘Scaling
M. Mitzenmacher, ‘‘PINT: Probabilistic in-band network telemetry,’’ in hardware accelerated network monitoring to concurrent and dynamic
Proc. Annu. Conf. ACM Special Interest Group Data Commun. Appl., queries with flow,’’ in Proc. USENIX Annu. Tech. Conf. (USENIX ATC),
Technol., Archit., Protocols Comput. Commun., Jul. 2020, pp. 662–680. 2018, pp. 823–835.
[62] Y. Lin, Y. Zhou, Z. Liu, K. Liu, Y. Wang, M. Xu, J. Bi, Y. Liu, and [86] J. Sonchack, A. J. Aviv, E. Keller, and J. M. Smith, ‘‘Turboflow: Informa-
J. Wu, ‘‘NetView: Towards on-demand network-wide telemetry in the tion rich flow record generation on commodity switches,’’ in Proc. 13th
data center,’’ Comput. Netw., vol. 180, Oct. 2020, Art. no. 107386. EuroSys Conf., Apr. 2018, pp. 1–16.
[63] N. Van Tu, J. Hyun, and J. W.-K. Hong, ‘‘Towards ONOS-based SDN [87] A. Gupta, R. Harrison, M. Canini, N. Feamster, J. Rexford, and
monitoring using in-band network telemetry,’’ in Proc. 19th Asia–Pacific W. Willinger, ‘‘Sonata: Query-driven streaming network telemetry,’’ in
Netw. Oper. Manage. Symp. (APNOMS), Sep. 2017, pp. 76–81. Proc. Conf. ACM Special Interest Group Data Commun., Aug. 2018,
[64] Serkant. Prometheus INT Exporter. Accessed: Jun. 1, 2021. [Online]. pp. 357–371.
Available: https://github.com/serkantul/prometheus_int_exporter/ [88] X. Chen, S. L. Feibish, Y. Koral, J. Rexford, O. Rottenstreich,
[65] N. Van Tu, J. Hyun, G. Y. Kim, J.-H. Yoo, and J. W.-K. Hong, ‘‘INTCol- S. A. Monetti, and T.-Y. Wang, ‘‘Fine-grained queue measurement in
lector: A high-performance collector for in-band network telemetry,’’ in the data plane,’’ in Proc. 15th Int. Conf. Emerg. Netw. Exp. Technol.,
Proc. 14th Int. Conf. Netw. Service Manage. (CNSM), 2018, pp. 10–18. Dec. 2019, pp. 15–29.
[66] Barefoot Networks. Barefoot Deep Insight—Product Brief. [89] Z. Liu, S. Zhou, O. Rottenstreich, V. Braverman, and J. Rexford,
Accessed: Jun. 1, 2021. [Online]. Available: https://tinyurl.com/u2ncvry ‘‘Memory-efficient performance monitoring on programmable switches
[67] Broadcom. BroadView Analytics, Trident 3 In-Band Telemetry. with lean algorithms,’’ in Proc. Symp. Algorithmic Princ. Comput. Syst.
Accessed: Jun. 1, 2021. [Online]. Available: https://tinyurl.com/yxr2qydb (APoCS), 2020, pp. 31–44.
[68] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore, G. Antichi, [90] T. Holterbach, E. C. Molero, M. Apostolaki, A. Dainotti, S. Vissicchio,
and M. Wójcik, ‘‘Re-architecting datacenter networks and stacks for low and L. Vanbever, ‘‘Blink: Fast connectivity recovery entirely in the data
latency and high performance,’’ in Proc. Conf. ACM Special Interest plane,’’ in Proc. 16th USENIX Symp. Netw. Syst. Design Implement.
Group Data Commun., Aug. 2017, pp. 29–42. (NSDI), 2019, pp. 161–176.

87148 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

[91] D. Ding, M. Savi, and D. Siracusa, ‘‘Estimating logarithmic and exponen- [113] K. Tokmakov, M. Sarker, J. Domaschka, and S. Wesner, ‘‘A case
tial functions to track network traffic entropy in P4,’’ in Proc. IEEE/IFIP for data centre traffic management on software programmable Ether-
Netw. Oper. Manage. Symp. (NOMS), Apr. 2020, pp. 1–9. net switches,’’ in Proc. IEEE 8th Int. Conf. Cloud Netw. (CloudNet),
[92] W. Wang, P. Tammana, A. Chen, and T. S. E. Ng, ‘‘Grasp the root causes in Nov. 2019, pp. 1–6.
the data plane: Diagnosing latency problems with SpiderMon,’’ in Proc. [114] S. S. W. Lee and K.-Y. Chan, ‘‘A traffic meter based on a multicolor
Symp. SDN Res., Mar. 2020, pp. 55–61. marker for bandwidth guarantee and priority differentiation in SDN
[93] R. Teixeira, R. Harrison, A. Gupta, and J. Rexford, ‘‘PacketScope: Mon- virtual networks,’’ IEEE Trans. Netw. Service Manage., vol. 16, no. 3,
itoring the packet lifecycle inside a switch,’’ in Proc. Symp. SDN Res., pp. 1046–1058, Sep. 2019.
Mar. 2020, pp. 76–82. [115] M. Shahbaz, L. Suresh, J. Rexford, N. Feamster, O. Rottenstreich, and
[94] J. Bai, M. Zhang, G. Li, C. Liu, M. Xu, and H. Hu, ‘‘FastFE: Acceler- M. Hira, ‘‘Elmo: Source routed multicast for public clouds,’’ in Proc.
ating ML-based traffic analysis with programmable switches,’’ in Proc. ACM Special Interest Group Data Commun., 2019, pp. 458–471.
Workshop Secure Program. Netw. Infrastruct. (SPIN). New York, NY, [116] M. Kadosh, Y. Piasetzky, B. Gafni, L. Suresh, M. Shahbaz, and
USA: Association for Computing Machinery, 2020, pp. 1–7. S. Banerjee. (Apr. 2020). Realizing Source Routed Multicast Using Mel-
[95] X. Chen, H. Kim, J. M. Aman, W. Chang, M. Lee, and J. Rexford, lanox’s Programmable Hardware Switches, P4 Expert Roundtable Series.
‘‘Measuring TCP round-trip time in the data plane,’’ in Proc. Workshop [Online]. Available: https://tinyurl.com/y8dfcsum
Secure Program. Netw. Infrastruct., Aug. 2020, pp. 35–41. [117] W. Braun, J. Hartmann, and M. Menth, ‘‘Demo: Scalable and reliable
[96] Y. Qiu, K.-F. Hsu, J. Xing, and A. Chen, ‘‘A feasibility study on time- software-defined multicast with BIER and P4,’’ in Proc. IFIP/IEEE Symp.
aware monitoring with commodity switches,’’ in Proc. Workshop Secure Integr. Netw. Service Manage. (IM), May 2017, pp. 905–906.
Program. Netw. Infrastruct., Aug. 2020, pp. 22–27. [118] N. Katta, M. Hira, C. Kim, A. Sivaraman, and J. Rexford, ‘‘HULA:
[97] Q. Huang, H. Sun, P. P. C. Lee, W. Bai, F. Zhu, and Y. Bao, ‘‘Omni- Scalable load balancing using programmable data planes,’’ in Proc. Symp.
Mon: Re-architecting network telemetry with resource efficiency and full SDN Res., Mar. 2016, pp. 1–12.
accuracy,’’ in Proc. Annu. Conf. ACM Special Interest Group Data Com- [119] C. H. Benet, A. J. Kassler, T. Benson, and G. Pongracz, ‘‘MP-
mun. Appl., Technol., Archit., Protocols Comput. Commun., Jul. 2020, HULA: Multipath transport aware load balancing using programmable
pp. 404–421. data planes,’’ in Proc. Morning Workshop Netw. Comput., Aug. 2018,
[98] X. Chen, S. Landau-Feibish, M. Braverman, and J. Rexford, ‘‘BeauCoup: pp. 7–13.
Answering many network traffic queries, one memory update at a time,’’ [120] R. Miao, H. Zeng, C. Kim, J. Lee, and M. Yu, ‘‘SilkRoad: Making stateful
in Proc. Annu. Conf. ACM Special Interest Group Data Commun. Appl., layer-4 load balancing fast and cheap using switching ASICs,’’ in Proc.
Technol., Archit., Protocols Comput. Commun., Jul. 2020, pp. 226–239. Conf. ACM Special Interest Group Data Commun., Aug. 2017, pp. 15–28.
[99] R. Kundel, J. Blendin, T. Viernickel, B. Koldehofe, and R. Steinmetz, [121] Z. Liu, Z. Bai, Z. Liu, X. Li, C. Kim, V. Braverman, X. Jin, and I. Stoica,
‘‘P4-CoDel: Active queue management in programmable data planes,’’ ‘‘DistCache: Provable load balancing for large-scale storage systems with
in Proc. IEEE Conf. Netw. Function Virtualization Softw. Defined Netw. distributed caching,’’ in Proc. 17th USENIX Conf. File Storage Technol.
(NFV-SDN), Nov. 2018, pp. 1–4. (FAST), 2019, pp. 143–157.
[122] K.-F. Hsu, P. Tammana, R. Beckett, A. Chen, J. Rexford, and D. Walker,
[100] F. Schwarzkopf, S. Veith, and M. Menth, ‘‘Performance analysis of CoDel
‘‘Adaptive weighted traffic splitting in programmable data planes,’’ in
and PIE for saturated TCP sources,’’ in Proc. 28th Int. Teletraffic Congr.
Proc. Symp. SDN Res., Mar. 2020, pp. 103–109.
(ITC), vol. 1, Sep. 2016, pp. 175–183.
[123] K.-F. Hsu, R. Beckett, A. Chen, J. Rexford, and D. Walker, ‘‘Con-
[101] N. K. Sharma, M. Liu, K. Atreya, and A. Krishnamurthy, ‘‘Approximating
tra: A programmable system for performance-aware routing,’’ in Proc.
fair queueing on reconfigurable switches,’’ in Proc. 15th USENIX Symp.
17th USENIX Symp. Netw. Syst. Design Implement. (NSDI), 2020,
Netw. Syst. Design Implement. (NSDI), 2018, pp. 1–16.
pp. 701–721.
[102] S. Laki, P. Vörös, and F. Fejes, ‘‘Towards an AQM evaluation testbed with
[124] V. Olteanu, A. Agache, A. Voinescu, and C. Raiciu, ‘‘Stateless datacenter
P4 and DPDK,’’ in Proc. ACM SIGCOMM Conf. Posters Demos, 2019,
load-balancing with beamer,’’ in Proc. 15th USENIX Symp. Netw. Syst.
pp. 148–150.
Design Implement. (NSDI), 2018, pp. 125–139.
[103] C. Papagianni and K. De Schepper, ‘‘PI2 for P4: An active queue man-
[125] B. Pit-Claudel, Y. Desmouceaux, P. Pfister, M. Townsley, and T. Clausen,
agement scheme for programmable data planes,’’ in Proc. 15th Int. Conf.
‘‘Stateless load-aware load balancing in P4,’’ in Proc. IEEE 26th Int. Conf.
Emerg. Netw. Exp. Technol., Dec. 2019, pp. 84–86.
Netw. Protocols (ICNP), Sep. 2018, pp. 418–423.
[104] I. Kunze, M. Gunz, D. Saam, K. Wehrle, and J. Rüth, ‘‘Tofino + [126] J.-L. Ye, C. Chen, and Y. H. Chu, ‘‘A weighted ECMP load balancing
P4: A strong compound for AQM on high-speed networks?’’ in Proc. scheme for data centers using P4 switches,’’ in Proc. IEEE 7th Int. Conf.
IFIP/IEEE IM, May 2021, pp. 1–9. Cloud Netw. (CloudNet), Oct. 2018, pp. 1–4.
[105] L. Toresson, ‘‘Making a packet-value based AQM on a programmable [127] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, and I. Stoica,
switch for resource-sharing and low latency,’’ M.S. thesis, Dept. Comput. ‘‘NetCache: Balancing key-value stores with fast in-network caching,’’ in
Sci., Fac. Health, Sci. Technol., Karlstads Univ., Karlstad, Sweden, 2021. Proc. 26th Symp. Oper. Syst. Princ., Oct. 2017, pp. 121–136.
[106] A. Mushtaq, R. Mittal, J. McCauley, M. Alizadeh, S. Ratnasamy, and [128] M. Liu, L. Luo, J. Nelson, L. Ceze, A. Krishnamurthy, and K. Atreya,
S. Shenker, ‘‘Datacenter congestion control: Identifying what is essential ‘‘IncBricks: Toward in-network computation with an in-network cache,’’
and making it practical,’’ ACM SIGCOMM Comput. Commun. Rev., in Proc. 22nd Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2017,
vol. 49, no. 3, pp. 32–38, Nov. 2019. pp. 795–809.
[107] M. Menth, H. Mostafaei, D. Merling, and M. Häberle, ‘‘Implementation [129] E. Cidon, S. Choi, S. Katti, and N. McKeown, ‘‘AppSwitch: Application-
and evaluation of activity-based congestion management using P4 (P4- layer load balancing within a software switch,’’ in Proc. 1st Asia–Pacific
ABC),’’ Future Internet, vol. 11, no. 7, p. 159, Jul. 2019. Workshop Netw., 2017, pp. 64–70.
[108] A. G. Alcoz, A. Dietmüller, and L. Vanbever, ‘‘SP-PIFO: Approximating [130] Q. Wang, Y. Lu, E. Xu, J. Li, Y. Chen, and J. Shu, ‘‘Concordia: Distributed
push-in first-out behaviors using strict-priority queues,’’ in Proc. 17th shared memory with in-network cache coherence,’’ in Proc. 19th USENIX
USENIX Symp. Netw. Syst. Design Implement. (NSDI), 2020, pp. 59–76. Conf. File Storage Technol. (FAST), 2021, pp. 277–292.
[109] K. Kumazoe and M. Tsuru, ‘‘P4-based implementation and evaluation of [131] J. Li, J. Nelson, E. Michael, X. Jin, and D. R. Ports, ‘‘Pegasus: Tolerat-
adaptive early packet discarding scheme,’’ in Proc. Int. Conf. Intell. Netw. ing skewed workloads in distributed storage with in-network coherence
Collaborative Syst. Cham, Switzerland: Springer, 2020, pp. 460–469. directories,’’ in Proc. 14th USENIX Symp. Oper. Syst. Design Implement.
[110] D. Bhat, J. Anderson, P. Ruth, M. Zink, and K. Keahey, ‘‘Application- (OSDI), 2020, pp. 387–406.
based QoE support with P4 and OpenFlow,’’ in Proc. IEEE Conf. Comput. [132] S. Signorello, R. State, J. François, and O. Festor, ‘‘NDN.P4: Pro-
Commun. Workshops (INFOCOM WKSHPS), Apr. 2019, pp. 817–823. gramming information-centric data-planes,’’ in Proc. IEEE NetSoft Conf.
[111] Y.-W. Chen, L.-H. Yen, W.-C. Wang, C.-A. Chuang, Y.-S. Liu, and Workshops (NetSoft), Jun. 2016, pp. 384–389.
C.-C. Tseng, ‘‘P4-enabled bandwidth management,’’ in Proc. 20th [133] G. Grigoryan and Y. Liu, ‘‘PFCA: A programmable FIB caching
Asia–Pacific Netw. Oper. Manage. Symp. (APNOMS), Sep. 2019, architecture,’’ in Proc. Symp. Archit. Netw. Commun. Syst., Jul. 2018,
pp. 1–5. pp. 97–103.
[112] C. Chen, H.-C. Fang, and M. S. Iqbal, ‘‘QoSTCP: Provide consistent rate [134] C. Zhang, J. Bi, Y. Zhou, K. Zhang, and Z. Ma, ‘‘B-cache: A behavior-
guarantees to TCP flows in software defined networks,’’ in Proc. IEEE level caching framework for the programmable data plane,’’ in Proc. IEEE
Int. Conf. Commun. (ICC), Jun. 2020, pp. 1–6. Symp. Comput. Commun. (ISCC), Jun. 2018, pp. 84–90.

VOLUME 9, 2021 87149


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

[135] J. Vestin, A. Kassler, and J. Åkerberg, ‘‘FastReact: In-network control and [157] X. Jin, X. Li, H. Zhang, N. Foster, J. Lee, R. Soulé, C. Kim, and I. Stoica,
caching for industrial control networks using programmable data planes,’’ ‘‘NetChain: Scale-free sub-RTT coordination,’’ in Proc. 15th USENIX
in Proc. IEEE 23rd Int. Conf. Emerg. Technol. Factory Automat. (ETFA), Symp. Netw. Syst. Design Implement. (NSDI), 2018, pp. 35–49.
Sep. 2018, pp. 219–226. [158] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman,
[136] J. Woodruff, M. Ramanujam, and N. Zilberman, ‘‘P4DNS: In-network H. Weatherspoon, M. Canini, F. Pedone, and R. Soulé, ‘‘Partitioned Paxos
DNS,’’ in Proc. ACM/IEEE Symp. Archit. Netw. Commun. Syst. (ANCS), via the network data plane,’’ 2019, arXiv:1901.08806. [Online]. Avail-
Sep. 2019, pp. 1–6. able: http://arxiv.org/abs/1901.08806
[137] R. Ricart-Sanchez, P. Malagon, P. Salva-Garcia, E. C. Perez, Q. Wang, [159] E. Sakic, N. Deric, E. Goshi, and W. Kellerer, ‘‘P4BFT: Hardware-
and J. M. A. Calero, ‘‘Towards an FPGA-accelerated programmable data accelerated Byzantine-resilient network control plane,’’ 2019,
path for edge-to-core communications in 5G networks,’’ J. Netw. Comput. arXiv:1905.04064. [Online]. Available: http://arxiv.org/abs/1905.04064
Appl., vol. 124, pp. 80–93, Dec. 2018. [160] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman,
[138] R. Ricart-Sanchez, P. Malagon, J. M. Alcaraz-Calero, and Q. Wang, H. Weatherspoon, M. Canini, F. Pedone, and R. Soulé, ‘‘P4xos: Con-
‘‘Hardware-accelerated firewall for 5G mobile networks,’’ in Proc. IEEE sensus as a network service,’’ IEEE/ACM Trans. Netw., vol. 28, no. 4,
26th Int. Conf. Netw. Protocols (ICNP), Sep. 2018, pp. 446–447. pp. 1726–1738, Aug. 2020.
[139] R. Shah, V. Kumar, M. Vutukuru, and P. Kulkarni, ‘‘TurboEPC: Leverag- [161] A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, and P. Kalnis, ‘‘In-
ing dataplane programmability to accelerate the mobile packet core,’’ in network computation is a dumb idea whose time has come,’’ in Proc. 16th
Proc. Symp. SDN Res., Mar. 2020, pp. 83–95. ACM Workshop Hot Topics Netw., 2017, pp. 150–156.
[140] S. K. Singh, C. E. Rothenberg, G. Patra, and G. Pongracz, ‘‘Offloading [162] F. Yang, Z. Wang, X. Ma, G. Yuan, and X. An, ‘‘SwitchAgg: A further step
virtual evolved packet gateway user plane functions to a programmable towards in-network computation,’’ 2019, arXiv:1904.04024. [Online].
ASIC,’’ in Proc. 1st ACM CoNEXT Workshop Emerg. Netw. Comput. Available: http://arxiv.org/abs/1904.04024
Paradigms (ENCP), 2019, pp. 9–14. [163] A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim,
[141] P. Vörös, G. Pongrácz, and S. Laki, ‘‘Towards a hybrid next generation A. Krishnamurthy, M. Moshref, D. R. K. Ports, and P. Richtárik, ‘‘Scal-
NodeB,’’ in Proc. 3rd P4 Workshop Eur., Dec. 2020, pp. 56–58. ing distributed machine learning with in-network aggregation,’’ 2019,
[142] P. Palagummi and K. M. Sivalingam, ‘‘SMARTHO: A network initiated arXiv:1903.06701. [Online]. Available: http://arxiv.org/abs/1903.06701
handover in NG-RAN using P4-based switches,’’ in Proc. 14th Int. Conf. [164] G. Siracusano and R. Bifulco, ‘‘In-network neural networks,’’ 2018,
Netw. Service Manage. (CNSM), 2018, pp. 338–342. arXiv:1801.05731. [Online]. Available: http://arxiv.org/abs/1801.05731
[143] F. Paolucci, F. Cugini, P. Castoldi, and T. Osinski, ‘‘Enhancing 5G [165] D. Sanvito, G. Siracusano, and R. Bifulco, ‘‘Can the network be the AI
SDN/NFV edge with P4 data plane programmability,’’ IEEE Netw., early accelerator?’’ in Proc. Morning Workshop Netw. Comput., Aug. 2018,
access, Apr. 20, 2021, doi: 10.1109/MNET.021.1900599. pp. 20–25.
[144] Y.-B. Lin, C.-C. Tseng, and M.-H. Wang, ‘‘Effects of transport net- [166] Z. Xiong and N. Zilberman, ‘‘Do switches dream of machine learning?:
work slicing on 5G applications,’’ Future Internet, vol. 13, no. 3, p. 69, Toward in-network classification,’’ in Proc. 18th ACM Workshop Hot
Mar. 2021. Topics Netw., Nov. 2019, pp. 25–33.
[145] E. F. Kfoury, J. Crichigno, and E. Bou-Harb, ‘‘Offloading media traffic to [167] T. Jepsen, M. Moshref, A. Carzaniga, N. Foster, and R. Soulé, ‘‘Life in the
programmable data plane switches,’’ in Proc. IEEE Int. Conf. Commun. fast lane: A line-rate linear road,’’ in Proc. Symp. SDN Res., Mar. 2018,
(ICC), Jun. 2020, pp. 1–7. pp. 1–7.
[146] B.-M. Andrus, S. A. Sasu, T. Szyrkowiec, A. Autenrieth, M. Chamania, [168] T. Kohler, R. Mayer, F. Dürr, M. Maaß, S. Bhowmik, and K. Rothermel,
J. K. Fischer, and S. Rasp, ‘‘Zero-touch provisioning of distributed video ‘‘P4CEP: Towards in-network complex event processing,’’ in Proc. Morn-
analytics in a software-defined metro-haul network with P4 processing,’’ ing Workshop Netw. Comput., Aug. 2018, pp. 33–38.
in Proc. Opt. Fiber Commun. Conf. (OFC), 2019, pp. 1–3. [169] L. Chen, G. Chen, J. Lingys, and K. Chen, ‘‘Programmable switch as
[147] T. Jepsen, M. Moshref, A. Carzaniga, N. Foster, and R. Soulé, ‘‘Packet a parallel computing device,’’ 2018, arXiv:1803.01491. [Online]. Avail-
subscriptions for programmable ASICs,’’ in Proc. 17th ACM Workshop able: http://arxiv.org/abs/1803.01491
Hot Topics Netw., Nov. 2018, pp. 176–183. [170] T. Jepsen, D. Alvarez, N. Foster, C. Kim, J. Lee, M. Moshref, and
[148] C. Wernecke, H. Parzyjegla, G. Mühl, P. Danielis, and D. Timmermann, R. Soulé, ‘‘Fast string searching on PISA,’’ in Proc. ACM Symp. SDN
‘‘Realizing content-based publish/subscribe with P4,’’ in Proc. IEEE Res., Apr. 2019, pp. 21–28.
Conf. Netw. Function Virtualization Softw. Defined Netw. (NFV-SDN), [171] Y. Qiao, X. Kong, M. Zhang, Y. Zhou, M. Xu, and J. Bi, ‘‘Towards
Nov. 2018, pp. 1–7. in-network acceleration of erasure coding,’’ in Proc. Symp. SDN Res.,
[149] C. Wernecke, H. Parzyjegla, G. Mühl, E. Schweissguth, and Mar. 2020, pp. 41–47.
D. Timmermann, ‘‘Flexible notification forwarding for content-based [172] Z. Yu, Y. Zhang, V. Braverman, M. Chowdhury, and X. Jin, ‘‘NetLock:
publish/subscribe using P4,’’ in Proc. IEEE Conf. Netw. Function Fast, centralized lock management using programmable switches,’’ in
Virtualization Softw. Defined Netw. (NFV-SDN), Nov. 2019, pp. 1–5. Proc. Annu. Conf. ACM Special Interest Group Data Commun. Appl.,
[150] R. Kundel, C. Gärtner, M. Luthra, S. Bhowmik, and B. Koldehofe, ‘‘Flex- Technol., Archit., Protocols Comput. Commun., Jul. 2020, pp. 126–138.
ible content-based publish/subscribe over programmable data planes,’’ [173] M. Tirmazi, R. B. Basat, J. Gao, and M. Yu, ‘‘Cheetah: Accelerating
in Proc. IEEE/IFIP Netw. Oper. Manage. Symp. (NOMS), Apr. 2020, database queries with switch pruning,’’ in Proc. ACM SIGMOD Int. Conf.
pp. 1–5. Manage. Data, Jun. 2020, pp. 2407–2422.
[151] R. Miguel, S. Signorello, and F. M. V. Ramos, ‘‘Named data network- [174] S. Vaucher, N. Yazdani, P. Felber, D. E. Lucani, and V. Schiavoni,
ing with programmable switches,’’ in Proc. IEEE 26th Int. Conf. Netw. ‘‘ZipLine: In-network compression at line speed,’’ in Proc. 16th Int. Conf.
Protocols (ICNP), Sep. 2018, pp. 400–405. Emerg. Netw. Exp. Technol., Nov. 2020, pp. 399–405.
[152] O. Karrakchou, N. Samaan, and A. Karmouch, ‘‘ENDN: An enhanced [175] R. Glebke, J. Krude, I. Kunze, J. Rüth, F. Senger, and K. Wehrle,
NDN architecture with a P4-programmabIe data plane,’’ in Proc. 7th ACM ‘‘Towards executing computer vision functionality on programmable
Conf. Inf.-Centric Netw., Sep. 2020, pp. 1–11. network devices,’’ in Proc. 1st ACM CoNEXT Workshop Emerg. Netw.
[153] J. Li, E. Michael, N. K. Sharma, A. Szekeres, and D. R. Ports, ‘‘Just say Comput. Paradigms (ENCP), 2019, pp. 15–20.
NO to paxos overhead: Replacing consensus with network ordering,’’ in [176] S.-Y. Wang, C.-M. Wu, Y.-B. Lin, and C.-C. Huang, ‘‘High-speed data-
Proc. 12th USENIX Symp. Oper. Syst. Design Implement. (OSDI), 2016, plane packet aggregation and disaggregation by P4 switches,’’ J. Netw.
pp. 467–483. Comput. Appl., vol. 142, pp. 98–110, Sep. 2019.
[154] H. T. Dang, M. Canini, F. Pedone, and R. Soulé, ‘‘Paxos made switch- [177] S.-Y. Wang, J.-Y. Li, and Y.-B. Lin, ‘‘Aggregating and disaggregating
y,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 46, no. 2, pp. 18–24, packets with various sizes of payload in P4 switches at 100 Gbps line
Apr. 2016. rate,’’ J. Netw. Comput. Appl., vol. 165, Sep. 2020, Art. no. 102676.
[155] J. Li, E. Michael, and D. R. K. Ports, ‘‘Eris: Coordination-free consistent [178] Y.-B. Lin, S.-Y. Wang, C.-C. Huang, and C.-M. Wu, ‘‘The SDN approach
transactions using in-network concurrency control,’’ in Proc. 26th Symp. for the aggregation/disaggregation of sensor data,’’ Sensors, vol. 18, no. 7,
Oper. Syst. Princ., Oct. 2017, pp. 104–120. p. 2025, Jun. 2018.
[156] B. Han, V. Gopalakrishnan, M. Platania, Z.-L. Zhang, and Y. Zhang, [179] A. L. R. Madureira, F. R. C. Araújo, and L. N. Sampaio, ‘‘On supporting
‘‘Network-assisted raft consensus protocol,’’ U.S. Patent 16 101 751, IoT data aggregation through programmable data planes,’’ Comput. Netw.,
Feb. 13, 2020. vol. 177, Aug. 2020, Art. no. 107330.

87150 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

[180] M. Uddin, S. Mukherjee, H. Chang, and T. V. Lakshman, ‘‘SDN-based [201] R. Datta, S. Choi, A. Chowdhary, and Y. Park, ‘‘P4Guard: Designing
service automation for IoT,’’ in Proc. IEEE 25th Int. Conf. Netw. Protocols P4 based firewall,’’ in Proc. IEEE Mil. Commun. Conf. (MILCOM),
(ICNP), Oct. 2017, pp. 1–10. Oct. 2018, pp. 1–6.
[181] M. Uddin, S. Mukherjee, H. Chang, and T. V. Lakshman, ‘‘SDN-based [202] J. Cao, Y. Liu, Y. Zhou, C. Sun, Y. Wang, and J. Bi, ‘‘CoFilter: A high-
multi-protocol edge switching for IoT service automation,’’ IEEE J. Sel. performance switch-accelerated stateful packet filter for bare-metal
Areas Commun., vol. 36, no. 12, pp. 2775–2786, Dec. 2018. servers,’’ in Proc. 28th Int. Conf. Comput. Commun. Netw. (ICCCN),
[182] V. Sivaraman, S. Narayana, O. Rottenstreich, S. Muthukrishnan, and Jul. 2019, pp. 1–9.
J. Rexford, ‘‘Heavy-hitter detection entirely in the data plane,’’ in Proc. [203] J. Li, H. Jiang, W. Jiang, J. Wu, and W. Du, ‘‘SDN-based state-
Symp. SDN Res., Apr. 2017, pp. 164–176. ful firewall for cloud,’’ in Proc. IEEE IEEE 6th Int. Conf. Big Data
[183] J. Kučera, D. A. Popescu, G. Antichi, J. Kořenek, and A. W. Moore, ‘‘Seek Secur. Cloud (BigDataSecurity), Int. Conf. High Perform. Smart Com-
and push: Detecting large traffic aggregates in the dataplane,’’ 2018, put. (HPSC), IEEE Int. Conf. Intell. Data Secur. (IDS), May 2020,
arXiv:1805.05993. [Online]. Available: http://arxiv.org/abs/1805.05993 pp. 157–161.
[184] R. Ben-Basat, X. Chen, G. Einziger, and O. Rottenstreich, ‘‘Efficient [204] A. Almaini, A. Al-Dubai, I. Romdhani, and M. Schramm, ‘‘Delegation of
measurement on programmable switches using probabilistic recircula- authentication to the data plane in software-defined networks,’’ in Proc.
tion,’’ in Proc. IEEE 26th Int. Conf. Netw. Protocols (ICNP), Sep. 2018, IEEE Int. Conferences Ubiquitous Comput. Commun. (IUCC), Data Sci.
pp. 313–323. Comput. Intell. (DSCI), Smart Comput., Netw. Services (SmartCNS),
[185] L. Tang, Q. Huang, and P. P. C. Lee, ‘‘A fast and compact invertible Oct. 2019, pp. 58–65.
sketch for network-wide heavy flow detection,’’ IEEE/ACM Trans. Netw., [205] E. O. Zaballa, D. Franco, Z. Zhou, and M. S. Berger, ‘‘P4Knocking:
vol. 28, no. 5, pp. 2350–2363, Oct. 2020. Offloading host-based firewall functionalities to the network,’’ in Proc.
23rd Conf. Innov. Clouds, Internet Netw. Workshops (ICIN), Feb. 2020,
[186] M. V. B. D. Silva, J. A. Marques, L. P. Gaspary, and L. Z. Granville,
pp. 7–12.
‘‘Identifying elephant flows using dynamic thresholds in programmable
IXP networks,’’ J. Internet Services Appl., vol. 11, no. 1, pp. 1–12, [206] Q. Kang, L. Xue, A. Morrison, Y. Tang, A. Chen, and X. Luo, ‘‘Pro-
Dec. 2020. grammable in-network security for context-aware BYOD policies,’’ 2019,
arXiv:1908.01405. [Online]. Available: http://arxiv.org/abs/1908.01405
[187] R. Harrison, Q. Cai, A. Gupta, and J. Rexford, ‘‘Network-wide heavy
[207] S. Bai, H. Kim, and J. Rexford, ‘‘Passive OS fingerprinting on commodity
hitter detection with commodity switches,’’ in Proc. Symp. SDN Res.,
switches,’’ Tech. Rep. TR-010-19, Sep. 2019.
Mar. 2018, pp. 1–7.
[208] A. Almaini, A. Al-Dubai, I. Romdhani, M. Schramm, and A. Alsarhan,
[188] R. Harrison, S. L. Feibish, A. Gupta, R. Teixeira, S. Muthukrishnan,
‘‘Lightweight edge authentication for software defined networks,’’ Com-
and J. Rexford, ‘‘Carpe elephants: Seize the global heavy hitters,’’
puting, vol. 103, no. 2, pp. 291–311, Feb. 2021.
in Proc. Workshop Secure Program. Netw. Infrastruct., Aug. 2020,
[209] J. Hill, M. Aloserij, and P. Grosso, ‘‘Tracking network flows with
pp. 15–21.
P4,’’ in Proc. IEEE/ACM Innovating Netw. Data-Intensive Sci. (INDIS),
[189] D. Ding, M. Savi, G. Antichi, and D. Siracusa, ‘‘An incrementally-
Nov. 2018, pp. 23–32.
deployable P4-enabled architecture for network-wide heavy-hitter detec-
[210] G. Li, M. Zhang, C. Liu, X. Kong, A. Chen, G. Gu, and H. Duan,
tion,’’ IEEE Trans. Netw. Service Manage., vol. 17, no. 1, pp. 75–88,
‘‘NETHCF: Enabling line-rate and adaptive spoofed IP traffic filter-
Mar. 2020.
ing,’’ in Proc. IEEE 27th Int. Conf. Netw. Protocols (ICNP), Oct. 2019,
[190] L. Tang, Q. Huang, and P. P. C. Lee, ‘‘SpreadSketch: Toward pp. 1–12.
invertible and network-wide detection of superspreaders,’’ in
[211] A. Febro, H. Xiao, and J. Spring, ‘‘Distributed SIP DDoS defense with
Proc. IEEE Conf. Comput. Commun. (INFOCOM), Jul. 2020,
P4,’’ in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC), Apr. 2019,
pp. 1608–1617.
pp. 1–8.
[191] D. Scholz, A. Oeldemann, F. Geyer, S. Gallenmuller, H. Stubbe, T. Wild, [212] D. Scholz, S. Gallenmüller, H. Stubbe, B. Jaber, M. Rouhi, and
A. Herkersdorf, and G. Carle, ‘‘Cryptographic hashing in P4 data G. Carle, ‘‘Me love (SYN-)cookies: SYN flood mitigation in pro-
planes,’’ in Proc. ACM/IEEE Symp. Archit. Netw. Commun. Syst. (ANCS), grammable data planes,’’ 2020, arXiv:2003.03221. [Online]. Available:
Sep. 2019, pp. 1–6. http://arxiv.org/abs/2003.03221
[192] F. Hauser, M. Häberle, M. Schmidt, and M. Menth, ‘‘P4-IPsec: Site- [213] D. Scholz, S. Gallenmüller, H. Stubbe, and G. Carle, ‘‘SYN flood defense
to-site and host-to-site VPN with IPsec in P4-based SDN,’’ 2019, in programmable data planes,’’ in Proc. 3rd P4 Workshop Eur., Dec. 2020,
arXiv:1907.03593. [Online]. Available: http://arxiv.org/abs/1907.03593 pp. 13–20.
[193] L. Malina, D. Smekal, S. Ricci, J. Hajny, P. Cíbik, and J. Hrabovsky, [214] G. K. Ndonda and R. Sadre, ‘‘A two-level intrusion detection system for
‘‘Hardware-accelerated cryptography for software-defined networks industrial control system networks using P4,’’ in Proc. 5th Int. Symp. ICS
with P4,’’ in Proc. Int. Conf. Inf. Technol. Commun. Secur. Cham, SCADA Cyber Secur. Res., 2018, pp. 31–40.
Switzerland: Springer, 2020, pp. 271–287. [215] J. Xing, W. Wu, and A. Chen, ‘‘Architecting programmable data plane
[194] G. Liu, W. Quan, N. Cheng, D. Gao, N. Lu, H. Zhang, and X. Shen, defenses into the network with FastFlex,’’ in Proc. 18th ACM Workshop
‘‘Softwarized IoT network immunity against eavesdropping with pro- Hot Topics Netw., Nov. 2019, pp. 161–169.
grammable data planes,’’ IEEE Internet Things J., vol. 8, no. 8, [216] Q. Kang, J. Xing, and A. Chen, ‘‘Automated attack discovery in data plane
pp. 6578–6590, Apr. 2021. systems,’’ in Proc. 12th USENIX Workshop Cyber Secur. Experimentation
[195] X. Chen, ‘‘Implementing AES encryption on programmable switches via Test (CSET), 2019, pp. 1–5.
scrambled lookup tables,’’ in Proc. Workshop Secure Program. Netw. [217] A. C. Lapolli, J. A. Marques, and L. P. Gaspary, ‘‘Offloading real-time
Infrastruct. (SPIN). New York, NY, USA: Association for Computing DDoS attack detection to programmable data planes,’’ in Proc. IFIP/IEEE
Machinery, 2020, pp. 8–14. Symp. Integr. Netw. Service Manage. (IM), 2019, pp. 19–27.
[196] H. Kim and A. Gupta, ‘‘ONTAS: Flexible and scalable online network [218] Y. Mi and A. Wang, ‘‘ML-pushback: Machine learning based pushback
traffic anonymization system,’’ in Proc. Workshop Netw. Meets AI ML defense against DDoS,’’ in Proc. 15th Int. Conf. Emerg. Netw. Exp.
(NetAI), 2019, pp. 15–21. Technol., Dec. 2019, pp. 80–81.
[197] H. M. Moghaddam and A. Mosenia, ‘‘Anonymizing masses: Practical [219] J. Ioannidis and S. M. Bellovin, ‘‘Implementing pushback: Router-based
light-weight anonymity at the network level,’’ 2019, arXiv:1911.09642. defense against DDoS attacks,’’ in Proc. NDSS, 2016, pp. 1–12.
[Online]. Available: http://arxiv.org/abs/1911.09642 [220] M. Zhang, G. Li, S. Wang, C. Liu, A. Chen, H. Hu, G. Gu,
[198] T. Datta, N. Feamster, J. Rexford, and L. Wang, ‘‘SPINE: Surveillance Q. Li, M. Xu, and J. Wu, ‘‘Poseidon: Mitigating volumetric
protection in the network elements,’’ in Proc. 9th USENIX Workshop Free DDoS attacks with programmable switches,’’ in Proc. NDSS, 2020,
Open Commun. Internet (FOCI), 2019, pp. 1–7. pp. 1–18.
[199] L. Wang, H. Kim, P. Mittal, and J. Rexford, ‘‘Programmable in-network [221] K. Friday, E. Kfoury, E. Bou-Harb, and J. Crichigno, ‘‘Towards a unified
obfuscation of DNS traffic,’’ in Proc. NDSS, DNS Privacy Workshop, in-network DDoS detection and mitigation strategy,’’ in Proc. 6th IEEE
2021, pp. 1–10. Conf. Netw. Softwarization (NetSoft), Jun. 2020, pp. 218–226.
[200] R. Meier, P. Tsankov, V. Lenders, L. Vanbever, and M. Vechev, ‘‘NetHide: [222] J. Xing, Q. Kang, and A. Chen, ‘‘NetWarden: Mitigating network covert
Secure and practical network topology obfuscation,’’ in Proc. 27th channels while preserving performance,’’ in Proc. 29th USENIX Secur.
USENIX Secur. Symp. (USENIX Secur.), 2018, pp. 693–709. Symp. (USENIX Secur.), 2020, pp. 2039–2056.

VOLUME 9, 2021 87151


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

[223] A. Laraba, J. François, I. Chrisment, S. R. Chowdhury, and R. Boutaba, [245] M. Shahbaz, S. Choi, B. Pfaff, C. Kim, N. Feamster, N. McKeown, and
‘‘Defeating protocol abuse with P4: Application to explicit conges- J. Rexford, ‘‘PISCES: A programmable, protocol-independent software
tion notification,’’ in Proc. IFIP Netw. Conf. (Networking), 2020, switch,’’ in Proc. ACM SIGCOMM Conf., Aug. 2016, pp. 525–538.
pp. 431–439. [246] B. Pfaff, J. Pettit, T. Koponen, E. Jackson, A. Zhou, J. Rajahalme, J. Gross,
[224] J. Xing, W. Wu, and A. Chen, ‘‘Ripple: A programmable, decentral- A. Wang, J. Stringer, and P. Shelar, ‘‘The design and implementation
ized link-flooding defense against adaptive adversaries,’’ in Proc. 30th of open vSwitch,’’ in Proc. 12th USENIX Symp. Netw. Syst. Design
USENIX Secur. Symp. (USENIX Secur.), Vancouver, BC, Canada, 2021, Implement. (NSDI), 2015, pp. 117–130.
pp. 1–16. [247] Barefoot Networks. (2020). Barefoot Academy. [Online]. Available:
[225] A. D. S. Ilha, A. C. Lapolli, J. A. Marques, and L. P. Gaspary, ‘‘Euclid: https://www.barefootnetworks.com/barefoot-academy/
A fully in-network, P4-based approach for real-time DDoS attack detec- [248] C. Kim, A. Sivaraman, N. Katta, A. Bas, A. Dixit, and L. J. Wobker, ‘‘In-
tion and mitigation,’’ IEEE Trans. Netw. Service Manage., early access, band network telemetry via programmable dataplanes,’’ in Proc. ACM
Dec. 30, 2020, doi: 10.1109/TNSM.2020.3048265. SIGCOMM, 2015, pp. 1–2.
[226] X. Z. Khooi, L. Csikor, D. M. Divakaran, and M. S. Kang, ‘‘DIDA: [249] C. Hopps, Analysis of an Equal-Cost Multi-Path Algorithm,
Distributed in-network defense architecture against amplified reflection document RFC 2992, Nov. 2000.
DDoS attacks,’’ in Proc. 6th IEEE Conf. Netw. Softwarization (NetSoft), [250] S. Sinha, S. Kandula, and D. Katabi, ‘‘Harnessing TCP’s burstiness
Jun. 2020, pp. 277–281. with flowlet switching,’’ in Proc. 3rd ACM Workshop Hot Topics Netw.
[227] D. Ding, M. Savi, F. Pederzolli, M. Campanella, and D. Siracusa, ‘‘In- (Hotnets-III), 2004, pp. 1–6.
network volumetric DDoS victim identification using programmable [251] C. Kim, P. Bhide, E. Doe, H. Holbrook, A. Ghanwani, D. Daly,
commodity switches,’’ IEEE Trans. Netw. Service Manage., early access, M. Hira, and B. Davie, ‘‘In-band network telemetry (INT),’’ Tech. Rep.
Apr. 15, 2021, doi: 10.1109/TNSM.2021.3073597. Version 2.1, 2020. [Online]. Available: https://github.com/p4lang/p4-
[228] F. Musumeci, V. Ionata, F. Paolucci, F. Cugini, and M. Tornatore, applications/blob/master/docs/INT_v2_1.pdf
‘‘Machine-learning-assisted DDoS attack detection with P4 language,’’
[252] P. Manzanares-Lopez, J. P. Muñoz-Gea, and J. Malgosa-Sanahuja,
in Proc. IEEE Int. Conf. Commun. (ICC), Jun. 2020, pp. 1–6.
‘‘Passive in-band network telemetry systems: The potential of pro-
[229] Z. Liu, H. Namkung, G. Nikolaidis, J. Lee, C. Kim, X. Jin, V. Braverman,
grammable data plane on network-wide telemetry,’’ IEEE Access, vol. 9,
M. Yu, and V. Sekar, ‘‘Jaqen: A high-performance switch-native approach
pp. 20391–20409, 2021.
for detecting and mitigating volumetric DDoS attacks with programmable
[253] M. A. M. Vieira, M. S. Castanho, R. D. G. Pacífico, E. R. S. Santos,
switches,’’ in Proc. 30th USENIX Secur. Symp. (USENIX Secur.), 2021,
E. P. M. C. Júnior, and L. F. M. Vieira, ‘‘Fast packet processing with
pp. 1–18.
eBPF and XDP: Concepts, code, challenges, and applications,’’ ACM
[230] C. Zhang, J. Bi, Y. Zhou, J. Wu, B. Liu, Z. Li, A. B. Dogar, and Y. Wang,
Comput. Surv., vol. 53, no. 1, pp. 1–36, May 2020.
‘‘P4DB: On-the-fly debugging of the programmable data plane,’’ in Proc.
IEEE 25th Int. Conf. Netw. Protocols (ICNP), Oct. 2017, pp. 1–10. [254] J. Crichigno, E. Bou-Harb, and N. Ghani, ‘‘A comprehensive tuto-
rial on science DMZ,’’ IEEE Commun. Surveys Tuts., vol. 21, no. 2,
[231] Y. Zhou, J. Bi, Y. Lin, Y. Wang, D. Zhang, Z. Xi, J. Cao, and C. Sun,
pp. 2041–2078, 2nd Quart., 2019.
‘‘P4Tester: Efficient runtime rule fault detection for programmable data
planes,’’ in Proc. Int. Symp. Qual. Service, Jun. 2019, pp. 1–10. [255] J. F. Kurose and K. W. Ross, Computer Networking: A Top-Down
[232] M. V. Dumitru, D. Dumitrescu, and C. Raiciu, ‘‘Can we exploit buggy P4 Approach, 6th ed. London, U.K.: Pearson, 2012.
programs?’’ in Proc. Symp. SDN Res., Mar. 2020, pp. 62–68. [256] S. Ha, I. Rhee, and L. Xu, ‘‘CUBIC: A new TCP-friendly high-speed
[233] S. Kodeswaran, M. T. Arashloo, P. Tammana, and J. Rexford, ‘‘Tracking TCP variant,’’ ACM SIGOPS Oper. Syst. Rev., vol. 42, no. 5, pp. 64–74,
P4 program execution in the data plane,’’ in Proc. Symp. SDN Res., Jul. 2008.
Mar. 2020, pp. 117–122. [257] D. Leith and R. Shorten. (2008). H-TCP: TCP Congestion Control for
[234] Y. Zhou, J. Bi, T. Yang, K. Gao, C. Zhang, J. Cao, and Y. Wang, High Bandwidth-Delay Product Paths. [Online]. Available: https://draft-
‘‘KeySight: Troubleshooting programmable switches via scalable high- leith-tcp-htcp-06
coverage behavior tracking,’’ in Proc. IEEE 26th Int. Conf. Netw. Proto- [258] N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V. Jacobson,
cols (ICNP), Sep. 2018, pp. 291–301. ‘‘BBR: Congestion-based congestion control,’’ Commun. ACM, vol. 60,
[235] N. Lopes, N. Bjørner, N. McKeown, A. Rybalchenko, D. Talayco, and no. 2, pp. 58–66, 2017.
G. Varghese, ‘‘Automatically verifying reachability and well-formedness [259] E. F. Kfoury, J. Gomez, J. Crichigno, and E. Bou-Harb, ‘‘An emulation-
in P4 networks,’’ Tech. Rep. MSR-TR-2016-65, Sep. 2016. based evaluation of TCP BBRv2 alpha for wired broadband,’’ Comput.
[236] L. Freire, M. Neves, L. Leal, K. Levchenko, A. Schaeffer-Filho, and Commun., vol. 161, pp. 212–224, Sep. 2020.
M. Barcellos, ‘‘Uncovering bugs in P4 programs with assertion-based [260] S. Floyd, ‘‘TCP and explicit congestion notification,’’ ACM SIGCOMM
verification,’’ in Proc. Symp. SDN Res., Mar. 2018, pp. 1–7. Comput. Commun. Rev., vol. 24, no. 5, pp. 8–23, Oct. 1994.
[237] M. Neves, L. Freire, A. Schaeffer-Filho, and M. Barcellos, ‘‘Verification [261] R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi,
of P4 programs in feasible time using assertions,’’ in Proc. 14th Int. Conf. A. Vahdat, Y. Wang, D. Wetherall, and D. Zats, ‘‘TIMELY: RTT-based
Emerg. Netw. Exp. Technol., Dec. 2018, pp. 73–85. congestion control for the data center,’’ ACM SIGCOMM Comput. Com-
[238] J. Liu, W. Hallahan, C. Schlesinger, M. Sharif, J. Lee, R. Soulé, H. Wang, mun. Rev., vol. 45, no. 4, pp. 537–550, 2015.
C. Caşcaval, N. McKeown, and N. Foster, ‘‘P4V: Practical verification for [262] Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, J. Padhye,
programmable data planes,’’ in Proc. Conf. ACM Special Interest Group S. Raindel, M. H. Yahia, and M. Zhang, ‘‘Congestion control for large-
Data Commun., Aug. 2018, pp. 490–503. scale RDMA deployments,’’ ACM SIGCOMM Comput. Commun. Rev.,
[239] A. Nötzli, J. Khan, A. Fingerhut, C. Barrett, and P. Athanas, ‘‘P4pktgen: vol. 45, no. 4, pp. 523–536, Sep. 2015.
Automated test case generation for P4 programs,’’ in Proc. Symp. SDN [263] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar,
Res., Mar. 2018, pp. 1–7. S. Sengupta, and M. Sridharan, ‘‘Data center TCP (DCTCP),’’ in Proc.
[240] D. Lukács, M. Tejfel, and G. Pongrácz, ‘‘Keeping P4 switches fast and ACM SIGCOMM Conf. SIGCOMM, 2010, pp. 63–74.
fault-free through automatic verification,’’ Acta Cybernetica, vol. 24, [264] M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar,
no. 1, pp. 61–81, May 2019. and S. Shenker, ‘‘Pfabric: Minimal near-optimal datacenter transport,’’
[241] R. Stoenescu, D. Dumitrescu, M. Popovici, L. Negreanu, and C. Raiciu, ACM SIGCOMM Comput. Commun. Rev., vol. 43, no. 4, pp. 435–446,
‘‘Debugging P4 programs with vera,’’ in Proc. Conf. ACM Special Interest 2013.
Group Data Commun., Aug. 2018, pp. 518–532. [265] M. Dong, Q. Li, D. Zarchy, P. B. Godfrey, and M. Schapira, ‘‘PCC:
[242] A. Shukla, K. N. Hudemann, A. Hecker, and S. Schmid, ‘‘Runtime veri- Re-architecting congestion control for consistent high performance,’’ in
fication of P4 switches with reinforcement learning,’’ in Proc. Workshop Proc. 12th USENIX Symp. Netw. Syst. Design Implement. (NSDI), 2015,
Netw. Meets AI ML (NetAI), 2019, pp. 1–7. pp. 395–408.
[243] D. Dumitrescu, R. Stoenescu, L. Negreanu, and C. Raiciu, ‘‘Bf4: Towards [266] A. Langley et al., ‘‘The QUIC transport protocol: Design and Internet-
bug-free P4 programs,’’ in Proc. Annu. Conf. ACM Special Interest Group scale deployment,’’ in Proc. Conf. ACM Special Interest Group Data
Data Commun. Appl., Technol., Archit., Protocols Comput. Commun., Commun., 2017, pp. 183–196.
Jul. 2020, pp. 571–585. [267] P. Cheng, F. Ren, R. Shu, and C. Lin, ‘‘Catch the whole lot in an action:
[244] A. Bas and A. Fingerhut. P4 Tutorial, Slide 22. Accessed: Jun. 1, 2021. Rapid precise packet loss notification in data center,’’ in Proc. 11th
[Online]. Available: https://tinyurl.com/tb4m749 USENIX Symp. Netw. Syst. Design Implement. (NSDI), 2014, pp. 17–28.

87152 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

[268] A. Ramachandran, S. Seetharaman, N. Feamster, and V. Vazirani, ‘‘Fast [294] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn,
monitoring of traffic subpopulations,’’ in Proc. 8th ACM SIGCOMM ‘‘Ceph: A scalable, high-performance distributed file system,’’ in Proc.
Conf. Internet Meas. Conf. (IMC), 2008, pp. 257–270. 7th Symp. Oper. Syst. Design Implement., 2006, pp. 307–320.
[269] N. Alon, Y. Matias, and M. Szegedy, ‘‘The space complexity of approx- [295] L. Lamport, ‘‘Paxos made simple,’’ ACM SIGACT News, vol. 32, no. 4,
imating the frequency moments,’’ J. Comput. Syst. Sci., vol. 58, no. 1, pp. 18–25, 2001.
pp. 137–147, Feb. 1999. [296] D. Ongaro and J. Ousterhout, ‘‘In search of an understandable consensus
[270] V. Braverman and R. Ostrovsky, ‘‘Zero-one frequency laws,’’ in Proc. algorithm,’’ in Proc. USENIX Annu. Tech. Conf. (USENIX ATC), 2014,
42nd ACM Symp. Theory Comput. (STOC), 2010, pp. 281–290. pp. 305–319.
[271] M. Charikar, K. Chen, and M. Farach-Colton, ‘‘Finding frequent items [297] H. T. Dang. Consensus as a Network Service. Accessed: Jun. 1, 2021.
in data streams,’’ in Proc. Int. Colloq. Automata, Lang., Program. Berlin, [Online]. Available: https://tinyurl.com/y2t9plsu
Germany: Springer, 2002, pp. 693–703. [298] J. Nelson. SwitchML Scaling Distributed Machine Learning With in
[272] G. Cormode and S. Muthukrishnan, ‘‘An improved data stream summary: Network Aggregation. Accessed: Jun. 1, 2021. [Online]. Available:
The count-min sketch and its applications,’’ J. Algorithms, vol. 55, no. 1, https://tinyurl.com/y53upm7k
pp. 58–75, Apr. 2005. [299] D. Das, S. Avancha, D. Mudigere, K. Vaidynathan, S. Sridharan,
[273] S. Floyd and V. Jacobson, ‘‘Random early detection gateways for con- D. Kalamkar, B. Kaul, and P. Dubey, ‘‘Distributed deep learning using
gestion avoidance,’’ IEEE/ACM Trans. Netw., vol. 1, no. 4, pp. 397–413, synchronous stochastic gradient descent,’’ 2016, arXiv:1602.06709.
Aug. 1993. [Online]. Available: http://arxiv.org/abs/1602.06709
[274] P. Flajolet, D. Gardy, and L. Thimonier, ‘‘Birthday paradox, coupon [300] S. Farrell, Low-Power Wide Area Network (LPWAN)
collectors, caching algorithms and self-organizing search,’’ Discrete Appl. Overview, document RFC8376, 2018. [Online]. Available:
Math., vol. 39, no. 3, pp. 207–229, Nov. 1992. https://tools.ietf.org/html/rfc8376
[275] R. Dolby, ‘‘Noise reduction systems,’’ U.S. Patent 3 846 719, [301] A. Koike, T. Ohba, and R. Ishibashi, ‘‘IoT network architecture using
Nov. 5, 1974. packet aggregation and disaggregation,’’ in Proc. 5th IIAI Int. Congr. Adv.
[276] S. V. Vaseghi, Advanced Digital Signal Processing and Noise Reduction. Appl. Informat. (IIAI-AAI), Jul. 2016, pp. 1140–1145.
Hoboken, NJ, USA: Wiley, 2008. [302] J. Deng and M. Davis, ‘‘An adaptive packet aggregation algorithm for
[277] J. Gettys, ‘‘Bufferbloat: Dark buffers in the Internet,’’ IEEE Internet wireless networks,’’ in Proc. Int. Conf. Wireless Commun. Signal Pro-
Comput., vol. 15, no. 3, p. 96, May/Jun. 2011. cess., Oct. 2013, pp. 1–6.
[278] M. Allman, ‘‘Comments on bufferbloat,’’ ACM SIGCOMM Comput. [303] Y. Yasuda, R. Nakamura, and H. Ohsaki, ‘‘A probabilistic interest packet
Commun. Rev., vol. 43, no. 1, pp. 30–37, Jan. 2013. aggregation for content-centric networking,’’ in Proc. IEEE 42nd Annu.
[279] Y. Gong, D. Rossi, C. Testa, S. Valenti, and M. D. Täht, ‘‘Fighting the Comput. Softw. Appl. Conf. (COMPSAC), Jul. 2018, pp. 783–788.
bufferbloat: On the coexistence of AQM and low priority congestion [304] A. S. Akyurek and T. S. Rosing, ‘‘Optimal packet aggregation scheduling
control,’’ Comput. Netw., vol. 65, pp. 255–267, Jun. 2014. in wireless networks,’’ IEEE Trans. Mobile Comput., vol. 17, no. 12,
[280] C. Staff, ‘‘BufferBloat: What’s wrong with the Internet?’’ Commun. ACM, pp. 2835–2852, Dec. 2018.
vol. 55, no. 2, pp. 40–47, Feb. 2012. [305] K. Zhou and N. Nikaein, ‘‘Packet aggregation for machine type commu-
[281] V. G. Cerf, ‘‘Bufferbloat and other Internet challenges,’’ IEEE Internet nications in LTE with random access channel,’’ in Proc. IEEE Wireless
Comput., vol. 18, no. 5, p. 80, Sep./Oct. 2014. Commun. Netw. Conf. (WCNC), Apr. 2013, pp. 262–267.
[282] H. Harkous, C. Papagianni, K. De Schepper, M. Jarschel, M. Dimolianis, [306] A. Majeed and N. B. Abu-Ghazaleh, ‘‘Packet aggregation in multi-rate
and R. Preis, ‘‘Virtual queues for P4: A poor man’s programmable wireless LANs,’’ in Proc. 9th Annu. IEEE Commun. Soc. Conf. Sensor,
traffic manager,’’ IEEE Trans. Netw. Service Manage., early access, Mesh Ad Hoc Commun. Netw. (SECON), Jun. 2012, pp. 452–460.
[307] Bluetooth Specification Version 4.2, Bluetooth SIG, Kirkland, WA, USA,
May 3, 2021, doi: 10.1109/TNSM.2021.3077051.
2014.
[283] K. Nichols, S. Blake, F. Baker, and D. Black, Definition of
[308] S. Farahani, ZigBee Wireless Networks and Transceivers. London, U.K.:
the Differentiated Services Field (DS Field) in the IPv4 and
Newnes, 2011.
IPv6 Headers, document RFC8376, 2018. [Online]. Available: [309] O. Hersent, D. Boswarthick, and O. Elloumi, The Internet of Things: Key
https://tools.ietf.org/html/rfc8376 Applications and Protocols. Hoboken, NJ, USA: Wiley, 2011.
[284] B. Fenner, M. Handley, H. Holbrook, I. Kouvelas, R. Parekh, Z. Zhang, [310] J. Shi, W. Quan, D. Gao, M. Liu, G. Liu, C. Yu, and W. Su, ‘‘Flowlet-based
and L. Zheng, Protocol Independent Multicast-Sparse Mode (PIM-SM): stateful multipath forwarding in heterogeneous Internet of Things,’’ IEEE
Protocol Specification (Revised), document RFC 7761, 2016. [Online]. Access, vol. 8, pp. 74875–74886, 2020.
Available: https://tools.ietf.org/html/rfc7761 [311] S. Do, L.-V. Le, B.-S.-P. Lin, and L.-P. Tung, ‘‘SDN/NFV-based network
[285] H. Holbrook, B. Cain, and B. Haberman, Using Internet Group Manage- infrastructure for enhancing IoT gateways,’’ in Proc. Int. Conf. Internet
ment Protocol Version 3 (IGMPv3) and Multicast Listener Discovery Pro- Things (iThings), IEEE Green Comput. Commun. (GreenCom), IEEE
tocol Version 2 (MLDv2) for Source-Specific Multicast, document RFC Cyber, Phys. Social Comput. (CPSCom), IEEE Smart Data (SmartData),
4604, Internet Engineering Task Force, 2006. Jul. 2019, pp. 1135–1142.
[286] I. Wijnands, E. C. Rosen, A. Dolganow, T. Przygienda, and S. Aldrin, [312] A. Metwally, D. Agrawal, and A. El Abbadi, ‘‘Efficient computation of
Multicast Using Bit Index Explicit Replication (BIER), document RFC frequent and top-k elements in data streams,’’ in Proc. Int. Conf. Database
8279, 2017. Theory. Berlin, Germany: Springer, 2005, pp. 398–412.
[287] S. Luo, H. Yu, K. Li, and H. Xing, ‘‘Efficient file dissemination in data [313] S. Heule, M. Nunkesser, and A. Hall, ‘‘HyperLogLog in practice: Algo-
center networks with priority-based adaptive multicast,’’ IEEE J. Sel. rithmic engineering of a state of the art cardinality estimation algorithm,’’
Areas Commun., vol. 38, no. 6, pp. 1161–1175, Jun. 2020. in Proc. 16th Int. Conf. Extending Database Technol. (EDBT), 2013,
[288] B. Carpenter and S. Brim, Middleboxes: Taxonomy and Issues, pp. 683–692.
document RFC3234, 2002. [Online]. Available: https://tools.ietf.org/ [314] F. Hauser, M. Schmidt, M. Häberle, and M. Menth, ‘‘P4-MACsec:
html/rfc3234 Dynamic topology monitoring and data layer protection with MACsec
[289] J. McCauley, A. Panda, A. Krishnamurthy, and S. Shenker, ‘‘Thoughts in P4-based SDN,’’ IEEE Access, vol. 8, pp. 58845–58858, 2020.
on load distribution and the role of programmable switches,’’ ACM SIG- [315] M. G. Reed, P. F. Syverson, and D. M. Goldschlag, ‘‘Anonymous con-
COMM Comput. Commun. Rev., vol. 49, no. 1, pp. 18–23, Feb. 2019. nections and onion routing,’’ IEEE J. Sel. Areas Commun., vol. 16, no. 4,
[290] T. Norp, ‘‘5G requirements and key performance indicators,’’ J. ICT pp. 482–494, May 1998.
Standardization, vol. 6, no. 1, pp. 15–30, 2018. [316] V. Liu, S. Han, A. Krishnamurthy, and T. Anderson, ‘‘Tor instead of IP,’’
[291] G. Xylomenos, C. N. Ververidis, V. A. Siris, N. Fotiou, C. Tsilopoulos, in Proc. 10th ACM Workshop Hot Topics Netw., 2011, pp. 1–6.
X. Vasilakos, K. V. Katsaros, and G. C. Polyzos, ‘‘A survey of [317] C. Chen, D. E. Asoni, D. Barrera, G. Danezis, and A. Perrig, ‘‘HORNET:
information-centric networking research,’’ IEEE Commun. Surveys Tuts., High-speed onion routing at the network layer,’’ in Proc. 22nd ACM
vol. 16, no. 2, pp. 1024–1049, 2nd Quart., 2014. SIGSAC Conf. Comput. Commun. Secur., Oct. 2015, pp. 1441–1454.
[292] D. L. Tennenhouse and D. J. Wetherall, ‘‘Towards an active network [318] L. Lamport, ‘‘Password authentication with insecure communication,’’
architecture,’’ in Proc. DARPA Act. Netw. Conf. Expo., 2002, pp. 2–15. Commun. ACM, vol. 24, no. 11, pp. 770–772, Nov. 1981.
[293] E. F. Kfoury, J. Gomez, J. Crichigno, E. Bou-Harb, and D. Khoury, [319] M. Zalewski and W. Stearns. (2006). P0F. [Online]. Available:
‘‘Decentralized distribution of PCP mappings over blockchain for http://lcamtuf.coredump.cx/p0f3
end-to-end secure direct communications,’’ IEEE Access, vol. 7, [320] J. Barnes and P. Crowley, ‘‘K-P0F: A high-throughput kernel passive OS
pp. 110159–110173, 2019. fingerprinter,’’ in Proc. Archit. Netw. Commun. Syst., 2013, pp. 113–114.

VOLUME 9, 2021 87153


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

[321] S. Hong, R. Baykov, L. Xu, S. Nadimpalli, and G. Gu, ‘‘Towards SDN- [344] Pronto Project. Accessed: Jun. 1, 2021. [Online]. Available:
defined programmable BYOD (bring your own device) security,’’ in Proc. https://prontoproject.org/
NDSS, 2016, pp. 1–15. [345] Y. Zhou and J. Bi, ‘‘ClickP4: Towards modular programming of P4,’’ in
[322] S. Hilton. (2016). Dyn Analysis Summary of Friday October 21 Attack. Proc. SIGCOMM Posters Demos, 2017, pp. 100–102.
[Online]. Available: https://dyn.com/blog/dyn-analysis-summary-of- [346] P. Zheng, T. Benson, and C. Hu, ‘‘P4Visor: Lightweight virtualization and
friday-october-21-attack/ composition primitives for building and testing modular programs,’’ in
[323] S. Kottler. (Mar. 2018). February 28th DDoS Incident Report. [Online]. Proc. 14th Int. Conf. Emerg. Netw. Exp. Technol., Dec. 2018, pp. 98–111.
Available: https://githubengineering.com/ddos-incident-report/ [347] X. Chen, D. Zhang, X. Wang, K. Zhu, and H. Zhou, ‘‘P4SC: Towards
[324] S. K. Fayaz, Y. Tobioka, V. Sekar, and M. Bailey, ‘‘Bohatei: Flexible and high-performance service function chain implementation on the P4-
elastic DDoS defense,’’ in Proc. 24th USENIX Secur. Symp. (USENIX capable device,’’ in Proc. IFIP/IEEE Symp. Integr. Netw. Service Manage.
Secur.), 2015, pp. 817–832. (IM), 2019, pp. 1–9.
[325] Arbor Networks. Arbor Networks APS Datasheet. Accessed: Jun. 1, 2021. [348] M. Riftadi and F. Kuipers, ‘‘P4I/O: Intent-based networking with P4,’’ in
[Online]. Available: https://www.netscout.com/sites/default/files/2018- Proc. IEEE Conf. Netw. Softwarization (NetSoft), Jun. 2019, pp. 438–443.
04/DS_APS_EN.pdf [349] E. O. Zaballa and Z. Zhou, ‘‘Graph-to-P4: A P4 boilerplate code generator
[326] NSFOCUS. NSFOCUS Anti-DDoS System Datasheet. Accessed: for parse graphs,’’ in Proc. ACM/IEEE Symp. Archit. Netw. Commun. Syst.
Jun. 1, 2021. [Online]. Available: https://nsfocusglobal.com/wp-content/ (ANCS), Sep. 2019, pp. 1–2.
uploads/2018/05/Anti-DDoS-Solution.pdf [350] M. Riftadi, J. Oostenbrink, and F. Kuipers, ‘‘GP4P4: Enabling self-
[327] J. Hypolite, J. Sonchack, S. Hershkop, N. Dautenhahn, A. DeHon, and programming networks,’’ 2019, arXiv:1910.00967. [Online]. Available:
J. M. Smith, ‘‘DeepMatch: Practical deep packet inspection in the data http://arxiv.org/abs/1910.00967
plane using network processors,’’ in Proc. 16th Int. Conf. Emerg. Netw. [351] X. Gao, T. Kim, M. D. Wong, D. Raghunathan, A. K. Varma,
Exp. Technol., Nov. 2020, pp. 336–350. P. G. Kannan, A. Sivaraman, S. Narayana, and A. Gupta, ‘‘Switch code
[328] N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown, generation using program synthesis,’’ in Proc. Annu. Conf. ACM Special
‘‘I know what your packet did last hop: Using packet histories to trou- Interest Group Data Commun. Appl., Technol., Archit., Protocols Comput.
bleshoot networks,’’ in Proc. 11th USENIX Symp. Netw. Syst. Design Commun., Jul. 2020, pp. 44–61.
Implement. (NSDI), 2014, pp. 71–85. [352] J. Gao, E. Zhai, H. H. Liu, R. Miao, Y. Zhou, B. Tian, C. Sun, D. Cai,
[329] Y. Zhu, N. Kang, J. Cao, A. Greenberg, G. Lu, R. Mahajan, D. Maltz, M. Zhang, and M. Yu, ‘‘Lyra: A cross-platform language and compiler for
L. Yuan, M. Zhang, B. Y. Zhao, and H. Zheng, ‘‘Packet-level telemetry in data plane programming on heterogeneous ASICs,’’ in Proc. Annu. Conf.
large datacenter networks,’’ in Proc. ACM Conf. Special Interest Group ACM Special Interest Group Data Commun. Appl., Technol., Archit.,
Data Commun., Aug. 2015, pp. 479–491. Protocols Comput. Commun., Jul. 2020, pp. 435–450.
[330] H. Zeng, P. Kazemian, G. Varghese, and N. McKeown, ‘‘Automatic test [353] M. Hogan, S. Landau-Feibish, M. T. Arashloo, J. Rexford, D. Walker,
packet generation,’’ in Proc. 8th Int. Conf. Emerg. Netw. Exp. Technol. and R. Harrison, ‘‘Elastic switch programming with P4All,’’ in Proc. 19th
(CoNEXT), 2012, pp. 241–252. ACM Workshop Hot Topics Netw., Nov. 2020, pp. 168–174.
[331] H. T. Dang, H. Wang, T. Jepsen, G. Brebner, C. Kim, J. Rexford, R. Soulé, [354] C. Zhang, J. Bi, Y. Zhou, A. B. Dogar, and J. Wu, ‘‘HyperV: A high per-
and H. Weatherspoon, ‘‘Whippersnapper: A P4 language benchmark formance hypervisor for virtualization of the programmable data plane,’’
suite,’’ in Proc. Symp. SDN Res., Apr. 2017, pp. 95–101. in Proc. 26th Int. Conf. Comput. Commun. Netw. (ICCCN), Jul. 2017,
[332] F. Rodriguez, P. G. K. Patra, L. Csikor, C. Rothenberg, P. V. S. Laki, pp. 1–9.
and G. Pongrácz, ‘‘BB-Gen: A packet crafter for P4 target evaluation,’’ in [355] M. Saquetti, G. Bueno, W. Cordeiro, and J. R. Azambuja, ‘‘P4VBox:
Proc. ACM SIGCOMM Conf. Posters Demos, Aug. 2018, pp. 111–113. Enabling P4-based switch virtualization,’’ IEEE Commun. Lett., vol. 24,
[333] H. Harkous, M. Jarschel, M. He, R. Pries, and W. Kellerer, no. 1, pp. 146–149, Jan. 2020.
‘‘P8: P4 with predictable packet processing performance,’’ IEEE [356] R. Parizotto, L. Castanheira, F. Bonetti, A. Santos, and
Trans. Netw. Service Manage., early access, Oct. 12, 2020, doi: A. Schaeffer-Filho, ‘‘PRIME: Programming in-network modular
10.1109/TNSM.2020.3030102. extensions,’’ in Proc. IEEE/IFIP Netw. Oper. Manage. Symp. (NOMS),
[334] H. Harkous, M. Jarschel, M. He, R. Priest, and W. Kellerer, ‘‘Towards Apr. 2020, pp. 1–9.
understanding the performance of P4 programmable hardware,’’ in Proc. [357] E. O. Zaballa, D. Franco, M. S. Berger, and M. Higuero, ‘‘A perspective
ACM/IEEE Symp. Archit. Netw. Commun. Syst. (ANCS), Sep. 2019, on P4-based data and control plane modularity for network automation,’’
pp. 1–6. in Proc. 3rd P4 Workshop Eur., Dec. 2020, pp. 59–61.
[335] P. Kazemian, G. Varghese, and N. McKeown, ‘‘Header space analysis: [358] R. Stoyanov and N. Zilberman, ‘‘MTPSA: Multi-tenant programmable
Static checking for networks,’’ in Proc. 9th USENIX Symp. Netw. Syst. switches,’’ in Proc. 3rd P4 Workshop Eur., Dec. 2020, pp. 43–48.
Design Implement. (NSDI), 2012, pp. 113–126. [359] S. Han, S. Jang, H. Choi, H. Lee, and S. Pack, ‘‘Virtualization in pro-
[336] A. Khurshid, X. Zou, W. Zhou, M. Caesar, and P. B. Godfrey, ‘‘VeriFlow: grammable data plane: A survey and open challenges,’’ IEEE Open J.
Verifying network-wide invariants in real time,’’ in Proc. 10th USENIX Commun. Soc., vol. 1, pp. 527–534, 2020.
Symp. Netw. Syst. Design Implement. (NSDI), 2013, pp. 15–27. [360] E. C. Molero, S. Vissicchio, and L. Vanbever, ‘‘Hardware-accelerated
[337] R. Stoenescu, M. Popovici, L. Negreanu, and C. Raiciu, ‘‘SymNet: Scal- network control planes,’’ in Proc. 17th ACM Workshop Hot Topics Netw.,
able symbolic execution for modern networks,’’ in Proc. ACM SIGCOMM Nov. 2018, pp. 120–126.
Conf., Aug. 2016, pp. 314–327. [361] M. T. Arashloo, Y. Koral, M. Greenberg, J. Rexford, and D. Walker,
[338] H. Mai, A. Khurshid, R. Agarwal, M. Caesar, P. B. Godfrey, and ‘‘SNAP: Stateful network-wide abstractions for packet processing,’’ in
S. T. King, ‘‘Debugging the data plane with anteater,’’ ACM SIGCOMM Proc. ACM SIGCOMM Conf., Aug. 2016, pp. 29–43.
Comput. Commun. Rev., vol. 41, no. 4, pp. 290–301, Oct. 2011. [362] G. Sviridov, M. Bonola, A. Tulumello, P. Giaccone, A. Bianco, and
[339] P. Kazemian, M. Chang, H. Zeng, G. Varghese, N. McKeown, and G. Bianchi, ‘‘LODGE: Local decisions on global states in programmable
S. Whyte, ‘‘Real time network policy checking using header space analy- data planes,’’ in Proc. 4th IEEE Conf. Netw. Softwarization Workshops
sis,’’ in Proc. 10th USENIX Symp. Netw. Syst. Design Implement. (NSDI), (NetSoft), Jun. 2018, pp. 257–261.
2013, pp. 99–111. [363] G. Sviridov, M. Bonola, A. Tulumello, P. Giaccone, A. Bianco,
[340] A. Horn, A. Kheradmand, and M. Prasad, ‘‘Delta-Net: Real-time network and G. Bianchi, ‘‘Local decisions on replicated states (LOADER)
verification using atoms,’’ in Proc. 14th USENIX Symp. Netw. Syst. in programmable data planes: Programming abstraction and exper-
Design Implement. (NSDI), 2017, pp. 735–749. imental evaluation,’’ 2020, arXiv:2001.07670. [Online]. Available:
[341] S. Son, S. Shin, V. Yegneswaran, P. Porras, and G. Gu, ‘‘Model checking http://arxiv.org/abs/2001.07670
invariant security properties in OpenFlow,’’ in Proc. IEEE Int. Conf. [364] S. Luo, H. Yu, and L. Vanbever, ‘‘Swing state: Consistent updates for
Commun. (ICC), Jun. 2013, pp. 1974–1979. stateful and programmable data planes,’’ in Proc. Symp. SDN Res.,
[342] A. Panda, O. Lahav, K. Argyraki, M. Sagiv, and S. Shenker, ‘‘Verifying Apr. 2017, pp. 115–121.
reachability in networks with mutable datapaths,’’ in Proc. 14th USENIX [365] J. Xing, A. Chen, and T. S. E. Ng, ‘‘Secure state migration in the
Symp. Netw. Syst. Design Implement. (NSDI), 2017, pp. 699–718. data plane,’’ in Proc. Workshop Secure Program. Netw. Infrastruct.,
[343] N. Foster, N. McKeown, J. Rexford, G. Parulkar, L. Peterson, and Aug. 2020, pp. 28–34.
O. Sunay, ‘‘Using deep programmability to put network owners in con- [366] L. Zeno, D. R. K. Ports, J. Nelson, and M. Silberstein, ‘‘SwiShmem:
trol,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 50, no. 4, pp. 82–88, Distributed shared state abstractions for programmable switches,’’ in
Oct. 2020. Proc. 19th ACM Workshop Hot Topics Netw., Nov. 2020, pp. 160–167.

87154 VOLUME 9, 2021


E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches

[367] S. Chole et al., ‘‘dRMT: Disaggregated programmable switching,’’ in [392] N. Feamster and J. Rexford, ‘‘Why (and how) networks should
Proc. Conf. ACM Special Interest Group Data Commun., 2017, pp. 1–14. run themselves,’’ 2017, arXiv:1710.11583. [Online]. Available:
[368] D. Kim, Y. Zhu, C. Kim, J. Lee, and S. Seshan, ‘‘Generic external memory http://arxiv.org/abs/1710.11583
for switch data planes,’’ in Proc. 17th ACM Workshop Hot Topics Netw., [393] D. D. Clark, C. Partridge, J. C. Ramming, and J. T. Wroclawski,
Nov. 2018, pp. 1–7. ‘‘A knowledge plane for the Internet,’’ in Proc. Conf. Appl., Technol.,
[369] D. Kim, Z. Liu, Y. Zhu, C. Kim, J. Lee, V. Sekar, and S. Seshan, ‘‘TEA: Archit., Protocols Comput. Commun. (SIGCOMM), 2003, pp. 3–10.
Enabling state-intensive network functions on programmable switches,’’ [394] A. Mestres, A. Rodriguez-Natal, J. Carner, P. Barlet-Ros, E. Alarcón,
in Proc. ACM SIGCOMM Conf., 2020, pp. 90–106. M. Solé, V. Muntés-Mulero, D. Meyer, S. Barkai, M. J. Hibbett,
[370] T. Mai, S. Garg, H. Yao, J. Nie, G. Kaddoum, and Z. Xiong, ‘‘In-network G. Estrada, K. Ma’ruf, F. Coras, V. Ermagan, H. Latapie, C. Cassar,
intelligence control: Toward a self-driving networking architecture,’’ J. Evans, F. Maino, J. Walrand, and A. Cabellos, ‘‘Knowledge-defined
IEEE Netw., vol. 35, no. 2, pp. 53–59, Mar. 2021. networking,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 47, no. 3,
[371] Y. Shi, M. Wen, and C. Zhang, ‘‘Incremental deployment of pro- pp. 2–10, Sep. 2017.
grammable switches for sketch-based network measurement,’’ in Proc.
IEEE Symp. Comput. Commun. (ISCC), Jul. 2020, pp. 1–7. ELIE F. KFOURY (Graduate Student Member,
[372] J. Cao, Y. Zhou, Y. Liu, M. Xu, and Y. Zhou, ‘‘TurboNet: Faithfully IEEE) is currently pursuing the Ph.D. degree with
emulating networks with programmable switches,’’ in Proc. IEEE 28th the College of Engineering and Computing, Uni-
Int. Conf. Netw. Protocols (ICNP), Oct. 2020, pp. 1–11. versity of South Carolina, USA. He previously
[373] S. Chole, A. Fingerhut, S. Ma, A. Sivaraman, S. Vargaftik, A. Berger,
worked as a research and teaching assistant in
G. Mendelson, M. Alizadeh, S.-T. Chuang, I. Keslassy, A. Orda, and
T. Edsall, ‘‘DRMT: Disaggregated programmable switching,’’ in Proc. the computer science and ICT departments at the
Conf. ACM Special Interest Group Data Commun., Aug. 2017, pp. 1–14. American University of Science and Technology
[374] R. Pagh and F. F. Rodler, ‘‘Cuckoo hashing,’’ BRICS Rep. Ser., vol. 8, in Beirut. He is a member of the CyberInfras-
no. 32, p. 122, Aug. 2001. tructure Laboratory (CI Lab), where he developed
[375] M. Baldi, ‘‘DaPIPE a data plane incremental programming environ- training materials for virtual labs on high-speed
ment,’’ in Proc. ACM/IEEE Symp. Archit. Netw. Commun. Syst. (ANCS), networks, TCP congestion control, WAN, performance measuring, buffer
Sep. 2019, pp. 1–6. sizing, cybersecurity, and routing protocols. His research interests include
[376] R. Amin, M. Reisslein, and N. Shah, ‘‘Hybrid SDN networks: A survey telecommunications, network security, blockchain, the Internet of Things
of existing approaches,’’ IEEE Commun. Surveys Tuts., vol. 20, no. 4, (IoT), and P4 programmable switches.
pp. 3259–3306, 4th Quart., 2018.
[377] J. Zhang and A. Moore, ‘‘Traffic trace artifacts due to monitoring via JORGE CRICHIGNO (Member, IEEE) received
port mirroring,’’ in Proc. Workshop End-End Monitor. Techn. Services,
the Ph.D. degree in computer engineering from
May 2007, pp. 1–8.
[378] SONiC. (2020). Software for Open Networking in the Cloud. [Online].
The University of New Mexico, Albuquerque,
Available: https://azure.github.io/SONiC/ USA, in 2009. He is currently an Associate Pro-
[379] S. Choi, B. Burkov, A. Eckert, T. Fang, S. Kazemkhani, R. Sherwood, fessor with the College of Engineering and Com-
Y. Zhang, and H. Zeng, ‘‘FBOSS: Building switch software at scale,’’ puting, University of South Carolina (USC), and
in Proc. Conf. ACM Special Interest Group Data Commun., Aug. 2018, the Director of the Cyberinfrastructure Laboratory,
pp. 342–356. USC. His work has been funded by private industry
[380] L. Linguaglossa, S. Lange, S. Pontarelli, G. Rétvári, D. Rossi, T. Zinner, and U.S. agencies, such as the National Science
R. Bifulco, M. Jarschel, and G. Bianchi, ‘‘Survey of performance accel- Foundation (NSF), the Department of Energy, and
eration techniques for network function virtualization,’’ Proc. IEEE, the Office of Naval Research (ONR). He has over 15 years of experience
vol. 107, no. 4, pp. 746–764, Apr. 2019. in the academic and industry sectors. His research interests include P4 pro-
[381] P. Shantharama, A. S. Thyagaturu, and M. Reisslein, ‘‘Hardware-
grammable switches, implementation of high-speed networks, network secu-
accelerated platforms and infrastructures for network functions: A survey
rity, TCP optimization, offloading functionality to programmable switches,
of enabling technologies and research studies,’’ IEEE Access, vol. 8,
pp. 132021–132085, 2020.
and the IoT devices.
[382] N. McKeown. Creating an End-to-End Programming Model for
Packet Forwarding. Accessed: Jun. 1, 2021. [Online]. Available: ELIAS BOU-HARB (Senior Member, IEEE)
https://lwn.net/Articles/828056/ received the Ph.D. degree in computer science
[383] J. Krude, J. Hofmann, M. Eichholz, K. Wehrle, A. Koch, and M. Mezini, from Concordia University, Montreal, Canada,
‘‘Online reprogrammable multi tenant switches,’’ in Proc. 1st ACM which was executed in collaboration with Pub-
CoNEXT Workshop Emerg. Netw. Comput. Paradigms (ENCP), 2019, lic Safety Canada, Industry Canada, and NCFTA
pp. 1–8. Canada. He was a Senior Research Scientist with
[384] D. Hancock and J. van der Merwe, ‘‘HyPer4: Using P4 to virtualize the Carnegie Mellon University (CMU), where he
programmable data plane,’’ in Proc. 12th Int. Conf. Emerg. Netw. Exp. contributed to federally-funded projects related to
Technol., Dec. 2016, pp. 35–49. critical infrastructure security and worked closely
[385] T. Issariyakul and E. Hossain, ‘‘Introduction to network simulator with the Software Engineering Institute (SEI).
2 (NS2),’’ in Introduction to Network Simulator NS2. Boston, MA,
He is currently the Director of the Cyber Center for Security and Analytics,
USA: Springer, 2009, pp. 1–18.
[386] Stanford. Reproducing Network Research. Accessed: Jun. 1, 2021. UTSA, where he leads, co-directs, and co-organizes university-wide inno-
[Online]. Available: https://reproducingnetworkresearch.wordpress.com/ vative cyber security research, development, and training initiatives. He is
[387] Mininet. An Instant Virtual Network on Your Laptop (or Other PC). also an Associate Professor with the Department of Information Systems and
Accessed: Jun. 1, 2021. [Online]. Available: http://mininet.org/ Cyber Security specializing in operational cyber security and data science as
[388] N. Handigol, B. Heller, V. Jeyakumar, B. Lantz, and N. McKeown, applicable to national security challenges. He is also a Permanent Research
‘‘Reproducible network experiments using container-based emulation,’’ Scientist with the National Cyber Forensic and Training Alliance (NCFTA)
in Proc. 8th Int. Conf. Emerg. Netw. Exp. Technol. (CoNEXT), 2012, of Canada, an international organization which focuses on the investigation
pp. 253–264. of cyber-crimes impacting citizens and businesses. He has authored more
[389] H. Kim, X. Chen, J. Brassil, and J. Rexford, ‘‘Experience-driven research than 90 refereed publications in leading security and data science venues,
on programmable networks,’’ ACM SIGCOMM Comput. Commun. Rev., has acquired state and federal cyber security research grants valued at more
vol. 51, no. 1, pp. 10–17, Jan. 2021.
than $4M. His research and development activities and interests include
[390] Princeton. P4Campus: Framework, Applications, and Artifacts.
Accessed: Jun. 1, 2021. [Online]. Available: https://p4campus. operational cyber security, attacks’ detection and characterization, malware
cs.princeton.edu/ investigation, cyber security for critical infrastructure, and big data analytics.
[391] H. Kim and N. Feamster, ‘‘Improving network management with software He was a recipient of five best research paper awards, including the presti-
defined networking,’’ IEEE Commun. Mag., vol. 51, no. 2, pp. 114–119, gious ACM’s Best Digital Forensics Research Paper.
Feb. 2013.

VOLUME 9, 2021 87155

You might also like