3452296.3472905
3452296.3472905
3452296.3472905
ABSTRACT Traditionally, the data plane has been designed with fixed functions to forward packets using a
small set of protocols. This closed-design paradigm has limited the capability of the switches to proprietary
implementations which are hard-coded by vendors, inducing a lengthy, costly, and inflexible process.
Recently, data plane programmability has attracted significant attention from both the research community
and the industry, permitting operators and programmers in general to run customized packet processing
functions. This open-design paradigm is paving the way for an unprecedented wave of innovation and exper-
imentation by reducing the time of designing, testing, and adopting new protocols; enabling a customized,
top-down approach to develop network applications; providing granular visibility of packet events defined
by the programmer; reducing complexity and enhancing resource utilization of the programmable switches;
and drastically improving the performance of applications that are offloaded to the data plane. Despite the
impressive advantages of programmable data plane switches and their importance in modern networks,
the literature has been missing a comprehensive survey. To this end, this paper provides a background encom-
passing an overview of the evolution of networks from legacy to programmable, describing the essentials
of programmable switches, and summarizing their advantages over Software-defined Networking (SDN)
and legacy devices. The paper then presents a unique, comprehensive taxonomy of applications developed
with P4 language; surveying, classifying, and analyzing more than 200 articles; discussing challenges and
considerations; and presenting future perspectives and open research issues.
INDEX TERMS Programmable switches, P4 language, Software-defined Networking, data plane, custom
packet processing, taxonomy.
The associate editor coordinating the review of this manuscript and 1 The RFC and VXLAN observations are extracted from Dr. McKeown’s
approving it for publication was Petros Nicopolitidis . presentation in [1].
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
87094 VOLUME 9, 2021
E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches
II. RELATED SURVEYS explained the layout of a P4 program and how it is mapped to
The advantages of programmable switches attracted consid- the abstract forwarding model. It then listed various compil-
erable attention from the research community. They were ers, tools, simulators, and frameworks for P4 development.
described in previous surveys. The authors categorized the literature into two categories:
Stubbe [35] discussed various P4 compilers and 1) programmable security and dependability management;
interpreters in a short survey. This work provided a short 2) enhanced accounting and performance management. In the
background on the P4 language and demonstrated the main first category, the authors listed works pertaining to pol-
building blocks that describe packet processing in a pro- icy modeling, analysis, and verification, as well as intru-
grammable switch. It outlined reference hardware and soft- sion detection and prevention, and network survivability.
ware programmable switch implementations. The survey In the second category, the authors focused on network mon-
lacks critical discussions on the evolution of programmable itoring, traffic engineering, and load balancing. The survey
switches, the features of P4 language, the existing applica- only lists a limited set of papers without providing much
tions, challenges, and the potential future work. details or how papers differ from each other. Moreover,
Dargahi et al. [36] focused on stateful data planes and the survey was published in 2017, and since then, a significant
their security implications. There are two main objectives of percentage of P4-related works are missing.
this survey. First, it introduces the reader to recent trends Satapathy [38] presented a limited description about the
and technologies pertaining to stateful data planes. Second, pitfalls of traditional networks and the evolution of SDN.
it discusses relevant security issues by analyzing selected The report briefly described elements of the P4 language. The
use cases. The scope of the survey is not limited to P4 for authors then discussed the control plane and P4Runtime [46],
programming the data plane. Instead, it describes other and enumerated three use cases of P4 applications. The report
schemes such as OpenState [44], Flow-level State Transitions concludes with potential future work. This work lacks critical
(FAST) [45], etc. When reviewing the security properties discussions on the P4 language and its features, the existing
of stateful data planes, the authors described a mapping applications, and challenges.
between potential attacks and corresponding vulnerabilities. The short survey presented by Bifulco and Rétvári [39]
The survey lacks critical discussions on the P4 language reviews the trends and issues of abstractions and architectures
and its features, the existing applications beyond security, that realize programmable networks. The authors discussed
the challenges, and the potential future work. the motivation of packet processing devices in the network-
Cordeiro et al. [37] discussed the evolution of SDN from ing field and described the anatomy of a programmable
OpenFlow to data plane programmability. The survey briefly switch. The proposed taxonomy categorizes the literature
as state-based, abstraction-based, implementation-based, and control, troubleshooting, etc. The survey concludes with dis-
layer-based. The layer-based consists of control/intent layer cussions and potential future work related to INT.
and data plane layer; the implementation-based encom- Zhang et al. [43] presented a survey that focuses on stateful
passes software and hardware switches; the abstraction-based data plane. The survey starts with an overview of stateless and
includes data flow graph and match-action pipelines; and the stateful data planes, then overviews and compares some state-
state-based differentiates between stateful and stateless data ful platforms (e.g., OpenState, FAST, FlowBlaze, etc.). The
planes. This short survey lacks critical discussions on the paper reviews a handful of stateful data plane applications and
existing P4 applications. discusses challenges and future perspectives.
Kaljic et al. [40] presented a survey on data plane flex- Table 1 summarizes the topics and the features described
ibility and programmability in SDN networks. The authors in the related surveys. It also highlights how this paper differs
evaluated data plane architectures through several definitions from the existing surveys. All previous surveys lack a micro-
of flexibility and programmability. In general, flexibility in scopic comparison between the intra-category works. Also,
SDN refers to the ability of the network to adapt its resources none of them compare switch-based schemes against legacy
(e.g., changes in the topology or the network requirements). server-based schemes. To the best of the authors’ knowl-
Afterwards, the authors identified key factors that influence edge, this work is the first to exhaustively explore the whole
the deviation from the original data plane given with Open- programmable data plane ecosystem. Specifically, the paper
Flow. The survey concludes with future research directions. describes P4 switches and provides a detailed taxonomy of
Kannan and Chan [41] presented a short survey related applications using P4 switches. It categorizes and compares
to the evolution of programmable networks. This work the applications within each category as well as with legacy
described the pre-SDN model and the evolution to SDN approaches, and provides challenges and future perspectives.
and programmable data plane. The authors highlighted some
features of programmable switches such as stateful process-
ing, accurate timing information, and flexible packet cloning III. TRADITIONAL CONTROL PLANE AND SDN
and recirculation. The survey categorized data plane appli- A. TRADITIONAL AND SDN DEVICES
cations into two categories, namely, network monitoring and With traditional devices, networks are connected using pro-
in-network computing. While this survey listed a consider- tocols such as Open Shortest Path First (OSPF) and Border
able number of papers belonging to these categories, it barely Gateway Protocol (BGP) [47]) running in the control plane
explained the operation and main ideas of each paper. Also at each device. Both control and data planes are under full
it lacks many other categories that are relevant in the pro- control of vendors. On the other hand, SDN delineates a clear
grammable data plane context. separation between the control plane and the data plane, and
Tan et al. [42] presented a survey describing In-band consolidates the control plane so that a single centralized con-
Network Telemetry (INT). The survey explained the devel- troller can control multiple remote data planes. The controller
opment stages and classifications of network measurement is implemented in software, under the control of the net-
(traditional, SDN-based, and P4-based). It also outlined some work owner. The controller computes the tables used by each
existing applications that leverage INT such as congestion switch and distributes them via a well-defined Application
FIGURE 6. Taxonomy of programmable switches literature based upon relevant, explored research areas.
FIGURE 7. In-band Network Telemetry (INT). FIGURE 8. Example of how INT can be used to provide the path traversed
by a packet in the network. The INT source inserts its label (S1) as well as
the INT headers to instruct subsequent switches about the required
operations (i.e., push their labels). Finally, switch S4 strips the INT
clear separation of categories so that a reader interested in a headers from the packet and forwards them to a collector, while
specific discipline can only read the works pertaining to the forwarding the original packet to the receiver.
tration problem in INT, which is associated with the opti- to encode telemetry on multiple packets. Note that sampling
mal use of network resources for collecting the state and and anomaly-based monitoring might lead to information loss
behavior of forwarding devices through INT. Niu et al. [59] since not all packets are being reported.
proposed multilayer INT (ML-INT), a system that visualizes Some solutions require manual intervention from the
IP-over-optical networks in realtime. The proposed system operators to configure the telemetry process. The sim-
encodes INT headers in a subset of packets pertaining to an plicity of the configuration interface is vital to make the
IP flow. The encoded headers contain metadata that describes solution easily deployable. Furthermore, some solutions
statistics of electrical and optical network elements on the (e.g., NetView, INT-Path) achieve network-wide telemetry.
flow’s routing path. Ben Basat et al. [61] proposed Proba- Note that network-wide traffic monitoring incurs additional
bilistic INT (PINT), an approach that probabilistically adds overhead since multiple switches are being monitored at the
telemetry information into a collection of packets to mini- same time. Finally, some solutions were implemented on soft-
mize the per-packet overhead associated with regular INT. ware switches, while other were implemented on hardware.
Hyun et al. [55] proposed an architecture for self-driving net- It is important to note that not all software implementations
works that uses INT to collect packet-level network telemetry, can fit into the pipeline of the hardware.
and Knowledge-Defined Networking (KDN) to create intelli-
gence to the network management, considering the collected 5) INT, PBT, AND TRADITIONAL TELEMETRY COMPARISON
telemetry data. KDN accepts the network information as Table 5 compares INT, PBT, and traditional telemetry.
input and generates policies to improve the network per- INT has higher potential vulnerabilities than PBT, such as
formance. Karaagac et al. [60] extended INT from wired eavesdropping and tampering. Adding extra protective mea-
network to wireless network. sures (e.g., encryption) is difficult on the fast data path. On the
other hand, PBT packets tolerate additional processing to
4) INT VARIATIONS, COMPARISON, AND DISCUSSIONS enhance security. The flow tracking process is simpler with
Table 4 compares the aforementioned INT variations solu- INT than with PBT. The latter requires the server receiving
tions. The main motivation behind these solutions is that INT reports (i.e., INT collector, explained in Section VI-C)
the majority of applications that leverage INT (e.g., conges- to correlate multiple postcards of a single flow packet passing
tion control, fast reroute) only require approximations of the through the network, to form the packet history at the mon-
telemetry data and therefore, do not need to gather per-packet itor. This process also adds delay in reporting and tracking.
per-hop INT information. NetVision, NetView, and INT-Path Legacy schemes that rely on sampling and polling suffer from
use probing to reduce the overhead of INT. The main lim- accuracy issues, especially when links are congested. INT
itation of such approaches is that probing might result in on the other hand is push-based, has better accuracy, and
poor accuracy and timeliness as the probes might experi- is more granular (microseconds scale). Reports sent by an
ence different network conditions than actual packets. All INT-capable device contain rich information (e.g., the path
other works collect INT information passively. [55] and sINT a packet took) that can aid in troubleshooting the network.
select flows based on current network conditions, ML-INT Such visibility is minimal in legacy monitoring schemes.
uses a fixed sampling scheme to select a small portion of Programmable switches permit reporting telemetry after the
packets in a flow, and PINT uses a probabilistic approach occurrence of specific events (e.g., congestion). Moreover,
they provide flexibility in programming reactive logic that 3 devices by Broadcom [67]. This solution enables real-time
executes promptly in the data plane. One drawback of INT network latency analysis and facilitates Service Level Agree-
is that it imposes bandwidth overhead if configured to report ment (SLA) compliance.
for every packet; however, when event-based reports are con- 4) INT COLLECTORS COMPARISON, DISCUSSIONS, AND
sidered, the bandwidth overhead significantly decreases. LIMITATIONS
Table 6 compare the aforementioned INT collectors. Int-
C. INT COLLECTORS
Mon and Prometheus INT exporter were among the earliest
1) BACKGROUND
collectors. Both have low processing rates since they are
An INT collector is a component in the network that pro- implemented without kernel nor hardware acceleration. Also,
cesses telemetry reports produced by INT devices. It parses they are very limited with respect to the features they provide
and filters metrics from the collected reports, then option- (e.g., lack of event detection, limited analytics, historical data
ally stores the results persistently into a database. Since a unavailability, etc.). Prometheus INT exporter also suffers
large number of reports is typically produced in INT, having from increased overhead of sending the data for every INT
a high-performance collector is essential to avoid missing packet to the gateway, and the potential loss of network events
important network events. To this end, a number of research as the database only stores the latest data pulled from the gate-
works focus on developing and enhancing the performance way. INTCollector on the other hand has higher rate and uses
of INT collectors running on commodity servers. Both open the eXpress Data Path (XDP) [253] to accelerate the packet
source and closed source INT collectors are proposed in the processing in the kernel space. It filters the data to be pub-
literature. lished based on significant changes in the network through
2) OPEN-SOURCE its event detection mechanism. DeepInsight Analytics has
IntMon [63] is an ONOS-based collector application for a modular architecture and runs on commodity servers.
INT reports. It includes a web-based interface that allows It executes the Barefoot SPRINT data plane telemetry which
controlling which flows to monitor and the specific meta- consists of a P4 program (INT.p4) encompassing intelligent
data to collect. Another INT collector is the Prometheus triggers. It also provides open northbound RESTful APIs
INT exporter [64], which extracts information from every that allow customers to integrate their third-party network
INT packet and pushes them to a gateway. A database management solutions. DeepInsight Analytics is advanced
server then periodically pulls information from the gateway. with respect to the features it provides (real-time anomaly
INTCollector [65] is a collector that extracts events, which are detection, congestion analysis, packet-drop analysis, etc.).
important network information, from INT raw data. It uses However, it is a closed-source solution and lacks reports of
in-kernel processing to further improve the performance. performance benchmarks.
INTCollector has two processing flows; the fast path, which
processes INT reports and needs to execute quickly, and the
normal path which processes events sent from the fast path,
and stores information in the database.
3) CLOSED-SOURCE
Deep Insight [66] is a proprietary solution provided
by Barefoot Networks that leverages INT capabilities to
provide services such as real-time anomaly detection, con-
gestion analysis, packet-drop analysis, etc. It follows a pay-
as-you-grow business model, where customers pay based
on the volume of collected telemetry. Another proprietary FIGURE 10. CPU efficiency with the three INT collectors. Source:
solution is BroadView Analytics used on Broadcom Trident INTCollector paper [65].
Fig. 10 demonstrates the CPU efficiency of three INT col- such as high-speed links, traffic diversity and burstiness, and
lectors (IntMon, Prometheus INT exporter, and INTCollec- buffer sizes [68]. Today’s CC algorithms aim at shortening
tor) [65]. IntMon has the lowest throughput, and is 57 times delays, maximizing throughput, and improving the fairness
slower than Prometheus INT. INTCollector on the other and utilization of network resources.
hand has the highest throughput and is 27 times faster than Tremendous amount of research work has been done on
Prometheus INT exporter. congestion control, including end hosts algorithms such as
loss-based CC algorithms (e.g., CUBIC [256], Hamilton TCP
5) COLLECTORS IN INT AND LEGACY MONITORING (HTCP) [257], etc.), model-based algorithms (e.g., Bottle-
SCHEMES COMPARISON neck Bandwidth and Round-trip Time (BBR) [258], [259]),
Generally, collectors used with both INT and legacy monitor- congestion-signalling mechanisms (e.g., Explicit Conges-
ing schemes run on general purpose CPUs, and hence, have tion Notification (ECN) [260]), data-center specific schemes
comparable performance. INT produces excessive amounts (e.g., TIMELY [261], Data Center Quantized Congestion
of reports when compared with legacy monitoring schemes Notification (DCQCN) [262], Data Center TCP (DCTCP)
(e.g., NetFlow), and therefore, requires having a collector [263], pFabric [264], Performance-oriented Congestion Con-
with high processing capability. INT-based collectors are trol (PCC) [265], etc.), and application-specific schemes
typically accelerated with in-kernel fast packet processing (e.g., QUIC [266]).
technologies (e.g., XDP) and hardware-based accelerators With the advent of programmable data plane switches,
(e.g., Data Plane Development Kit (DPDK)). researchers are investigating new methods for managing con-
gestion. Such methods can be classified as 1) hybrid CC,
D. SUMMARY AND LESSONS LEARNED where network-assisted congestion feedback is provided for
Legacy telemetry tools and protocols are not capable of end-hosts; and 2) in-network CC, where the switch performs
capturing microbursts nor providing fine-grained teleme- traffic rerouting, steering, or other congestion control tech-
try measurements. INT was developed to address these niques, without modifications on end hosts.
challenges; it enables the data plane developer to query
2) HYBRID CC
with high-precision the internal state of switches. Telemetry
data are then embedded into packets and forwarded to a Handley et al. [68] proposed NDP, a novel protocol archi-
high-performance collector. The collector typically performs tecture for datacenters that aims at achieving low comple-
analysis and applies actions accordingly (e.g., informs the tion latency for short flows and high throughput for longer
control plane to update table entries). Current research efforts flows. NDP avoids core network congestion by applying
mainly focus on developing variations of INT to decrease per-packet multipath load balancing, which comes at the cost
its telemetry traffic overhead, considering the overhead- of reordering. It also trims the payloads of packets, similar
accuracy trade-off. Other works aim at accelerating INT to what is done in Cut Payload (CP) [267], whenever the
collectors to handle large volumes of traffic (in the scale queues of the switches become saturated. Once the payload
of Kpps). Future work could possibly investigate further is trimmed, the headers are forwarded using high-priority
improvements for INT such as compressing packets’ headers, queues. Consequently, a Negative ACK (NACK) is generated
broadening coverage and visibility, enriching the telemetry and sent through high-priority queues so that a retransmission
information, and simplifying the deployment. is sent before draining the low priority queue. Similarly,
Feldmann et al. [69] proposed a method that uses
VII. NETWORK PERFORMANCE network-assisted congestion feedback (NCF) in the form of
Measuring and improving network performance is critical NACKs generated entirely in the data plane. NACKs are
in nowadays’ infrastructures. Low latency and high band- sent to throttle elephant-flow senders in case of congestion.
width are key requirements to operate modern applications The method maintains three separate queues for mice flows,
that continuously generate enormous amounts of data [254]. elephant flows, and control packets to ensure fair sharing of
Congestion control (CC), which aims at avoiding net- resources.
work overload, is critical to meet these requirements. Li et al. [70] proposed High Precision Congestion Control
Another important concept for expediting these applications (HPCC), a new CC mechanism that leverages INT-based data
is managing the queues that form in routers and switches added by P4 switches to obtain precise link load informa-
through Active Queuing Management (AQM) algorithms. tion. HPCC computes accurate flow rate by using only one
This section explores the literature related to measuring and rate update, as opposed to legacy approaches that require a
improving the performance of programmable networks.
large number of iterations to determine the rate. HPCC pro- tion feedback. NDP avoids congestion by applying per-packet
vides near-zero queueing, while being almost parameterless. multihop load balancing. This approach works adequately
Fig. 11 shows the mechanism of HPCC. The switches add with symmetric topologies, but fails when topologies are
INT headers to every packet, and then the INT information is asymmetric (e.g., BCube, Jellyfish), especially during heavy
piggybacked into the TCP/RDMA Acknowledgement (ACK) network load. Another limitation of NDP is the excessive
packet. The end-hosts then use this information to adjust retransmissions produced by the server. NCF adopted the
the sending rate through their smart Network Interface idea of packet trimming from NDP, but generates NACKs
Controllers (NICs). from the trimmed packet and sends it directly to the sender.
Kfoury et al. [71] proposed a P4-based method to automate Such approach removes the receiver from the feedback loop,
end-hosts’ TCP pacing. It supplies the bottleneck bandwidths improving the sender’s reaction time. One limitation of
and the number of elephants flows to senders so that they NCF is that it requires operators to manually tune some of
can pace their rates to safe targets, avoiding filling routers’ the predefined parameters (e.g., threshold, queue size, etc.).
buffers. Shahzad et al. [72] proposed EECN, a system that Additionally, NCF might disclose network congestion infor-
uses ECN to signal the occurrence of congestion to the sender mation, making it less attractive to operators. Finally,
without involving the receiver. This is especially useful for the authors of NCF claim that the approach works with both
networks with high bandwidth-delay product (BDP). datacenters and Internet-wide scenarios. However, no imple-
mentation results were presented to evaluate the effectiveness
3) IN-NETWORK CC
of the solution.
Turkovic et al. [73] proposed a P4-based method that reroutes
HPCC leverages INT data to control network con-
flows to backup paths during congestion. The system detects
gestion. It enhances the convergence time by using a
congestion by continuously monitoring the queueing delays
Multiplicative-Increase Multiplicative-Decrease (MIMD)
of latency-critical flows. The same authors [74] proposed a
scheme. Note that previous TCP variants use the
method that separates the senders based on their congestion
Additive-Increase Multiplicative-Decrease (AIMD), which
control algorithm. Each congestion control uses a separate
is conservative when increasing the rate, and hence has a
queue in order to enforce the fairness among its competing
slow convergence time. The reason AIMD schemes are slow
flows. Apostolaki et al. [75] proposed FAB, a flow-aware
is that they use a single-bit congestion information (packet
and device-wide buffer sharing scheme. FAB prioritizes
loss, ECN). With HPCC, end-hosts can perform aggres-
flows from port-level to the device-level. The goal of FAB
sive increase as INT metadata encompasses precise link
is to minimize the flow completion time for short flows
utilization and timely queue statistics. HPCC demonstrated
in specific workloads. Geng et al. [76] proposed P4QCN,
promising results with respect to latency, bandwidth, and
a fow-level, rate-based congestion control mechanism that
convergence time. The authors however did not evaluate
improves the Quantized Congestion Notification (QCN).
the performance of HPCC with conventional congestion
P4QCN improves QCN by alleviating the problems of PFC
control algorithms in the Internet (e.g., CUBIC, BBR). Note
within a lossless network. Furthermore, P4QCN extends the
that achieving inter-protocol fairness is essential so that the
QCN protocol to IP-routed networks.
solution is adopted by operators.
4) CC SCHEMES COMPARISON, DISCUSSIONS, AND The method in [71] uses TCP pacing. Pacing decreases
LIMITATIONS throughput variations and traffic burstiness, and hence, mini-
Table 7 compares the aforementioned CC schemes. NDP and mizes queuing delays. However, this method works well only
NCF are similar in the sense that both use NACKs as conges- in networks where the number of large flows senders is small
TABLE 8. Congestion control schemes. 1) Programmable Switches (HPCC); 2) end-hosts; and 3) legacy network-assisted (ECN).
(e.g., in science Demilitarized Zone (DMZ) [254]). Further, measurements schemes have accuracy limitations since they
it is worth mentioning that methods which provide congestion rely on polling and sampling-based methods to gather
feedback to end hosts must implement some security mecha- traffic statistics. Typically, sampling methods have high sam-
nisms to prevent packets from being modified. pling rates (e.g., one every 30,000 packets) and polling
As for the full in-network CC schemes, P4Air, which methods have large polling intervals. The literature [268]
applies traffic separation, demonstrated significant improve- has shown that such methods are only suitable for
ments in fairness compared to contemporary solutions. How- coarse-grained visibility. The accuracy limitation of sam-
ever, it requires allocating a queue for each congestion control pling and polling techniques hampers the development of
algorithm group (e.g., loss-based (Cubic), delay-based (TCP measurement applications. For instance, it is not possible to
Vegas), etc.). Note that the number of queues is limited accurately measure frequently changing TCP-specific fields
in switches, and production networks often reserve them such as congestion window, receive window, and sending
for other applications’ QoS [70]. P4QCN is not evaluated rate.
on hardware targets, and therefore their results (which are Data streaming or sketching algorithms [269]–[272] were
extracted based on software switches) are not that indicative. proposed to answer the limitation of sampling and polling.
They address the following problem: an algorithm is allowed
5) END-HOSTS, PROGRAMMABLE SWITCHES, AND LEGACY to perform a constant number of passes over a data stream
DEVICES’ CC SCHEMES (input sequence of items) while using sub-linear space com-
Table 8 compares the CC schemes assisted by programmable pared to the dataset and the dictionary sizes; desired statis-
switches (e.g., HPCC) with end-hosts CC algorithms tical properties (e.g., median) on the data stream are then
(e.g., CUBIC) and legacy congestion signalling schemes estimated by the algorithm. The main problem with such
(e.g., ECN). End-hosts CC infer congestion through algorithms is that they are tightly coupled to the metrics of
packet drops and estimations (e.g., btlbw and Round-trip interest. This means that switch vendors should build spe-
Time (RTT) estimation with BBR), which is not always suffi- cialized algorithms, data structures, and hardware for specific
cient to infer the existence of congestion. Legacy devices use monitoring tasks. With the constraints of CPU and memory in
classic ECN to signal congestion so that end-hosts slow down networking devices, it is challenging to support a wide spec-
their transmission rates. Classic ECN is limited as it only trum of monitoring tasks that satisfy all customers. Legacy
marks a single bit to signal congestion, and is not aggressive devices also lack the capability of customizing the processing
nor immediate. Programmable switches on the other hand behavior so that switches co-operate in the measurement
use fine-grained prompt measurements to signal congestion process.
(e.g., INT metadata), which results in higher detection accu- With the emergence of programmable switches, it is now
racy, near-zero queueing delays, and faster convergence time. possible to perform fine-grained measurements in the data
The distributed nature of end-hosts CC schemes allows them plane at line rate. Moreover, data structures such as sketches
to operate without modifying the network infrastructure and and bloom filters can be easily implemented and customized
without tweaking parameters. ECN-enabled devices and pro- for specific metrics of interest. Programmable switches pave
grammable switches on the other hand require few param- the way for new areas of research in measurements since not
eters (e.g., marking threshold) to adapt to different network only they provide flexibility in inspecting with high accuracy
conditions. the traffic statistics, but also allow programmers to express
reactive processing in real time (e.g., dropping a packet when
B. MEASUREMENTS a threshold is bypassed as done in Random Early Detection
1) BACKGROUND (RED) [273]).
Gaining an overall understanding of the network behavior INT provides path-level metrics, with data similar to
is an increasingly complex task, especially when the size that of polling-based techniques. Note that the metrics
of the network is large and the bandwidth is high. Legacy themselves are fixed; for instance, it is possible to deter-
mine the flow-level latency, but not the latency variation ditions (e.g., bandwidth, packet rate and flow size distri-
(jitter) [79]. The fixed metrics of INT also prevent perform- bution). *Flow [85] supports concurrent measurements and
ing network-wide measurements; note that the INT stan- dynamic queries. Such approach aims at minimizing the con-
dard specification document does not mention methods to currency problems and the network disruption resulting from
aggregate metadata and perform complex analytics in the data compiling excessive queries into the data plane.
plane. TurboFlow [86] aims at achieving high coverage without
This section focuses on techniques that provide measure- sacrificing information richness. Bai et al. [94] proposed
ments that go beyond the fixed metrics extracted from the FastFE, a system that performs traffic features extraction
internal state of the switch. by leveraging programmable data planes. Extracted fea-
tures are then used by traffic analysis and behavior detector
2) GENERIC QUERY-BASED MONITORING ML techniques.
Operators constantly change their monitoring specifications.
Adding new monitoring requirements on the fixed-function 3) PERFORMANCE DIAGNOSIS SYSTEMS
switching ASIC is expensive. Recent work explored the idea Recent works are leveraging programmable data planes to
of providing a query-driven interface that allows operators to diagnose network performance. The main motivation here is
express their monitoring requirements. The queries can then that fine-grained information can be monitored at line rate,
be converted into switch programs (e.g., P4) to be deployed in mitigating the slow reaction to ‘‘gray failures’’ experienced
the network. Alternatively, the queries can be executed on the by diagnosing end-hosts in legacy approaches.
control plane considering the measured information extracted Ghasemi et al. [80] proposed Dapper, an in-network TCP
from the data plane. performance diagnosis system. Dapper analyzes packets in
A simplistic attempt is FlowRadar [77], a system that real time, and identifies and pinpoints the root cause of the
stores counters for all flows in the data plane with low bottleneck (sender, network, or receiver). Blink [90] also
memory footprint, then exports periodically (every 10ms) diagnoses TCP-related issues. In particular, it detects failures
to a remote collector. Liu et al. [78] proposed Universal in the data plane based on retransmissions, and consequently,
Monitoring (UnivMon), an application-agnostic monitoring reroutes traffic. Other approaches attempt to diagnose per-
framework that provides accuracy and generality across a formance degradation manifested by an increase of latency.
wide range of monitoring tasks. UnivMon benefits from Wang et al. [92] proposed SpiderMon, a system that performs
the granularity of the data plane to improve accuracy and network-wide performance degradation diagnosis. The key
runs different estimation algorithms on the control plane. idea is to have every switch maintain fine-grained telemetry
Narayana et al. [79] presented Marple, a query lan- data for a short period of time, and upon detecting per-
guage based on common query constructs (i.e., map, filter, formance degradation (e.g., increased delay), the informa-
group by). Marple allows performing advanced aggregation tion is offloaded to a collector. Liu et al. [89] proposed a
(e.g., moving average of latencies) at line rate in the data memory-efficient approach for network performance mon-
plane. Similarly, Sonata [87] provides a unified query inter- itoring. This solution only monitors the top-k problematic
face that uses common dataflow operators, and partitions flows.
each query across the stream processor and the data plane.
PacketScope [93] also uses dataflow constructs but allows to 4) QUEUE AND OTHER METRICS MEASUREMENT
query the internal switch processing, both in the ingress and Programmable data planes allows querying the internal state
the egress pipelines. of the queue with fine-grained visibility. Recent works lever-
Many of the previous works use the sketch data structure. aged this feature to provide better queueing information
The work in [96] extended the sketching approach used in which can be used by various applications (e.g., AQMs,
previous works to support the notion of time. The motivation congestion control, etc.).
of this work is that recently captured traffic trends are the Chen et al. [88] proposed ConQuest, a P4-based queue
most relevant in network monitoring. Huang et al. [97] measurement solution that determines the size of flows occu-
proposed OmniMon, an architectural design that pying the queue in real time, and identifies flows that are
coordinates flow-level network telemetry operations between grabbing a significant portion of the queue. Joshi et al. [83]
programmable switches, end-hosts, and controllers. Such proposed BurstRadar, a system that uses programmable
coordination aims at achieving high accuracy while main- switches to monitor microbursts in the data plane. Mircor-
taining low resource overhead. Chen et al. [98] proposed bursts are events of sporadic congestion that last for tens
BeauCoup, a P4-based measurement system that handles or hundreds of microseconds. Microbursts increase latency,
multiple heterogeneous queries in the data plane. It offers jitter, and packet loss, especially when links’ speeds are high
a general query abstraction that counts the attributes across and switch buffers are small.
related packets identified by keys, and flags packets that Other works enabled measuring further metric. For
surpass a defined threshold. instance, Ding et al. [91] proposed P4Entropy, an algorithm
Other approaches such as Elastic sketch [81] performs to estimate network traffic entropy (Shannon entropy) in the
measurement that are adaptive to changes in network con- data plane. Tracking entropy is useful for calculating traffic
distribution in order to understand the network behavior. ods, and top-k counting. In addition, some focused on a subset
Another example is the system proposed by Chen et al. [95] of traffic by leveraging event matching techniques. Such tech-
which passively measures the RTT of TCP traffic in ISP niques are primarily used to achieve high resource efficiency
networks. RTT measurement is important for detecting (i.e., low memory footprint), but cannot achieve full accuracy.
spoofing and routing attacks, ensuring Service Level Agree- On the other hand, systems like OmniMon carefully coor-
ments (SLAs) compliance, measuring the Quality of Experi- dinates the collaboration among different types of entities
ence (QoE), improving congestion control, and many others. in the network. Such coordination will result in efficient
resource utilization and fully accuracy. OmniMon follows a
5) MEASUREMENTS SCHEMES COMPARISON,
split-merge strategy where the split operation decomposes
DISCUSSIONS, AND LIMITATIONS
telemetry operations into partial operations and schedules
Table 9 compares the measurements schemes.
them among the entities (switches, end-hosts, and controller),
a: GENERIC QUERY-BASED MONITORING and the merge operation coordinates the collaboration among
Some schemes (e.g., Sonata, FlowRadar, UnivMon) per- these entities. The idea is to leverage the strength of the
formed approximations of the metrics by using probabilistic data plane in the switches and end-hosts (i.e., per-flow
data structures (e.g., sketch, bloom filter, etc), sampling meth- measurements with high accuracy) and the control plane
FIGURE 12. (a) Traditional measurements with sampling/polling. The switch uses sampling and polling protocols (e.g., NetFlow, SNMP)
to generate fixed network flow records. Instead of collecting every packet, sampling collects only one every N number of packets.
Records are then exported to an external server for further analysis. (b) Measurements with programmable switches
(e.g., UnivMon [78]). The switch runs a universal algorithm over a universal data structure (e.g., universal sketch). The control plane
then estimates a wide range of metrics for various applications. Note that this is not the only design possible for measurement tasks
with programmable switches. The programmer has the flexibility to use customized algorithms than run at line rate in the data plane.
Such algorithms can leverage various data structures in the P4 program (e.g., sketch, bloom filter) to store flow statistics. The switch
then push statistics reports to the control plane for further analysis and reactive processing.
techniques that rely on polling and sampling (e.g., Net- provide little or no insight about which flows are occupying
Flow). The differences between in-network measurements or sharing the queue [88]. Consequently, researchers have
and polling/sampling-based schemes are closely related to been investigating queue management algorithms to shorten
the differences between legacy measurements and INT the delay and mitigate packet losses, while providing fairness
(see Table 5). For instance, the granularity of the measure- among flows. AQM is a set of algorithms designed to shorten
ments conducted in the data plane is much higher than those the queueing delay by prohibiting buffers on devices from
collected in traditional measurements (e.g., NetFlow). Fur- becoming full. The undesirable latency that results from a
ther, it is not possible to conduct event-based monitoring in device buffering too much data is known as ‘‘Bufferbloat’’.
legacy approaches, whereas with in-network measurements, Bufferbloat not only increases the end-to-end delay, but
the programmer has the flexibility of customizing the moni- also decreases the throughput and increases the jitter of a
toring based on conditions and thresholds. Second, there are communication session. Modern AQMs help in mitigating
techniques that rely on sketching or streaming algorithms the bufferbloat problem [277]–[280]. Unfortunately, modern
to estimate the metric of interest. Such methods are tightly AQMs are typically not available in state-of-the-art network
coupled with the metric, which forces hardware vendors to equipment; for instance, Controlled Delay (CoDel) AQM,
invest time and effort in building customized algorithms and which was proposed in 2013, and was proven in the literature
data structures that might not be used by various customers. to be effective in mitigating Bufferbloat [281], is still not
Moreover, with the constraints of routers and switches, it is available in most network equipment. With programmable
not possible to implement a variety of monitoring tasks while switches, it is now possible to implement AQMs as P4 pro-
still supporting the standard routing/switching functionali- grams, which not only accelerates support for new AQMs,
ties. Therefore, such approaches are not scalable for the long but also provides means to customize its parameters pro-
run. grammatically in response to network traffic. Moreover, pro-
With programmable switches, it is possible to customize grammable switches thrives for innovation on newer AQMs
the monitoring tasks by implementing customized sketch- that can be easily implemented and rapidly tested.
ing/streaming algorithms as P4 programs. This advantage
improves scalability as the operator can always modify the 2) STANDARDIZED AQMs IMPLEMENTATION
algorithms whenever needed. Kundel et al. [99] implemented the CoDel queueing
discipline on a programmable switch. CoDel eliminates
C. ACTIVE QUEUE MANAGEMENT (AQM) Bufferbloat, even in the presence of large buffers [100].
1) BACKGROUND Sharma et al. [101] proposed Approximate Fair Queueing
A fundamental component in network devices is the queue (AFQ), a mechanism built on top of programmable switches
which temporarily buffers packets. As data traffic is inher- that approximates fair queuing on line rate. Fair Queue-
ently bursty, routers have been provisioned with large queues ing (FQ) aims at fairly dividing the bandwidth allocation
to absorb this burstiness and to maintain high link utilization. among active flows. Laki et al. [102] described an AQM
The majority of delays encountered in a communication ses- evaluation testbed with P4 in a demo paper. The authors
sion is a result of large backlogs formed in queues. Previous tested the framework with two AQMs: Proportional Integral
legacy devices are limited in the visibility of the queue as they Controller Enhanced (PIE) and RED. Papagianni and De
Schepper [103] implemented Proportional Integral PI2 AQM require complex flow classification, per-packet scheduling,
on a programmable switch. PI2 is an extension of PIE AQM to and buffer allocation. Such requirements make FQ algorithms
support coexistence between classic and scalable congestion expensive to be implemented on high-speed devices. AFQ
controls in the public Internet. Kunze et al. [104] analyzed the aims at approximating fair queueing by using programmable
implementation details of three AQMs, namely, RED, CoDel, switches’ features such as mutating switch state, performing
and PIE on a hardware programmable switch (Tofino). Tores- basic calculations, and selecting the egress queue of a packet.
son [105] implemented a combination of PIE and Per-Packet AFQ’s operations can be summarized as follows: 1) per-flow
Value (PPV) concept on a programmable switch. state, which includes the number and timing information of
the previous packet pertaining to that flow, is approximated;
3) CUSTOM AQMs 2) the position of each packet in the output schedule is
Mushtaq et al. [106] approximated Shortest Remaining Pro- determined; 3) the egress queue to use is selected; and 4) the
cessing Time (SRPT) on a programmable switch. Their packet is dequeued based on the approximate sorted order.
method, which they refer to as Approximate and Deployable Note that AFQ uses a probabilistic data structure (count-min
SRPT (ADS), was evaluated and it was shown that it can sketch) since it only approximates the states, and uses multi-
achieve performance close to SRPT. Menth et al. [107] imple- ple queues in its implementation.
mented activity-based congestion management (ABC) on
programmable switches. ABC aims at ensuring fair resource
sharing as well as improving the completion times of short 5) AQMs ON PROGRAMMABLE SWITCHES AND
flows. Alcoz et al. [108] proposed SP-PIFO, a method FIXED-FUNCTION DEVICES
that approximates Push-In First-Out (PIFO) queues on pro- Inventing novel AQMs that control queueing delay, mitigate
grammable data planes. The method consists of an adap- bufferbloat, and achieve fairness with different network con-
tive scheduling algorithm that dynamically adapts mapping ditions (e.g., short/long RTTs, lossy networks, WANs) is an
between packet ranks and Strict Policy (SP) queues. Kumazoe active research area. Typically, new AQMs are implemented
and Tsuru [109] implemented MTQ/QTL scheme on P4. and tested in software (e.g., as a Linux queueing discipline
(qdisc) used with traffic control (tc)), which is limited when
4) AQM SCHEMES COMPARISON, DISCUSSIONS, AND the objective is to deploy the AQMs on production networks.
LIMITATIONS With programmable switches, AQMs are implemented in
Table 10 compares the aforementioned AQM schemes. Some P4 programs, which foster innovation and enhance testing
schemes require tuning a number of parameters and thresh- with production networks. Additionally, operators can create
olds so that they operate well in certain network conditions. their own customized AQMs that perform efficiently with
It is worth mentioning that a scheme becomes hard to manage their typical network traffic.
and less autonomous when the number of parameters and Historically, deploying AQMs on network devices is a
thresholds is high. lengthy and costly process; once an effective AQM is pub-
Some schemes are simple to implement in the data plane. lished and thoroughly tested, equipment vendors start inves-
CoDel’s algorithm can be easily expressed in the data plane tigating whether it is feasible to implement it on future
as it consists of comparisons, counting, basic arithmetic, and devices. Such process might take years to finish, and by
dropping packets. Similarly, PI2 is simple to implement as it then, new network conditions evolve, requiring new AQMs.
is mostly based on basic bit manipulations. FQ algorithms on With programmable switches, this process is cost-efficient
the other hand are difficult to implement on hardware as they and relatively fast (can be completed in weeks).
TABLE 11. AQMs on programmable and fixed-function switches. TABLE 12. QoS/TM schemes comparison.
as future work), it is worth noting that this system relies TABLE 13. Source-routed multicast schemes comparison (source: [115]).
on metering primitives which are available in today’s hard-
ware targets (e.g., meters in Tofino). Similarly, [113] was
only implemented on a software switch (BMv2) and was
evaluated by comparison against standard priority-based and
best-effort scheduling. This system uses multiple priority
queues, a feature supported in hardware targets. Therefore,
the system could be implemented on hardware switches. The
Kadosh et al. [116] implemented ELMO using a hybrid dat-
approach in [111] aims at limiting the maximum allowed rate
aplane with programmable and non-programmable elements.
and at maximizing bandwidth utilization. This is the only
ELMO is intended for multi-tenant datacenter applications
work that was implemented on a hardware switch (Tofino),
requiring high scalability. Braun et al. [117] presented an
and its design was compared against approaches based on
implementation of the Bit Index Explicit Replication (BIER)
OpenFlow.
architecture [286] with extensions for traffic engineering.
5) COMPARISON OF QoS/TM BETWEEN LEGACY AND Similar to ELMO, BIER removes the per-multicast group
PROGRAMMABLE NETWORKS state information from switches by adding a BIER header,
The ability to perform QoS-based traffic management in which is used to forward packets. BIER does not require a
legacy networks is restricted to algorithms that consider stan- signaling protocol for building, managing, and tearing down
dard header fields (e.g, differentiated services [283]). On the trees.
other hand, programmable switches can parse, modify, and
process customized protocols. Hence, operators now have 3) PRIORITY-BASED DECENTRALIZED MULTICAST
the ability to perform TM by inspecting custom headers Cloud applications in data centers often require file transfers
fields. Moreover, it is possible to extract with high-granularity to be completed in a prioritized order. Luo et al. [287] pro-
metadata pertaining to the state of the switch (e.g., queue posed Priority-based Adaptive Multicast (PAM), a preemp-
occupancy, packet sojourn time, etc.) at line rate. Such infor- tive and decentralized rate control protocol for data center
mation can significantly help switches take better decisions multicast. The switches explicitly and preemptively compute
while performing traffic management. sending rates based on priorities encoded in scheduling head-
ers, and the real-time link loads.
E. MULTICAST
1) BACKGROUND
4) MULTICAST SCHEMES COMPARISON, DISCUSSIONS,
Multicast routing enables a source node to send a copy of a AND LIMITATIONS
packet to a group of nodes. Multicast uses in-network traffic
Table 13 compares the source-routed multicast schemes.
replication to ensure that at most a single copy of a packet tra-
Both ELMO and BIER are source-routed multicast schemes.
verses each link of the multicast tree. Perhaps the most widely
In BIER, group members are encoded as bit strings and are
multicast routing protocol deployed in traditional networks
then inspected by switches to identify the output port. Such
is the Protocol-Independent Multicast (PIM) protocol [284].
scheme requires heavy processing on the switch, hampering
PIM and other multicast routing protocols require a signaling
the execution at line rate. Consequently, the authors only
protocol such as the Internet Group Management Protocol
implemented BIER on a software switch (BMv2). ELMO on
(IGMP) [285] to create, change, and tear-down the multi-
the other hand has no restrictions on the group and network
cast tree. Traditional multicast presents some challenges. For
sizes, and was implemented on a hardware switch, running at
example, it is not suitable for environments where multi-
line rate.
cast group members constantly move (e.g., virtual machine
Other schemes like PAM addressed the challenges faced in
migration and allocation). In such cases, the multicast tree
file transfers by data center cloud applications. For instance,
must be updated dynamically, which may require substan-
when sharing the link with other latency-sensitive flows,
tial time and overhead. Also, some routers support a lim-
file transfers suffer from continuous changes in the link’s
ited number of group-table entries, which does not scale in
bandwidth, affecting the flow completion times. To solve this
environments such as datacenters. Additionally, the signaling
problem, PAM adopted a scheduling scheme that performs
protocol and multicast algorithm are hard coded in the router,
adaptive rate allocations in RTT scales. Other aspects that
which reduces flexibility in building and managing the tree.
were addressed by PAM include: fault tolerance and scala-
Finally, it is not possible to implement multicast based on
bility of file transfers; limited number of priority queues; and
non-standard header fields.
the challenges of performing complex computations in data
2) SOURCE-ROUTED MULTICAST plane.
Shahbaz et al. [115] presented ELMO, a multicast scheme
based on programmable P4 switches for datacenter applica- 5) COMPARISON P4-BASED AND TRADITIONAL MULTICAST
tions. ELMO encodes the multicast tree in the packet header, Table 14 compares P4-based multicast and traditional multi-
as opposed to maintaining group-table entries inside routers. cast. The main advantages of implementing multicast routing
source IP address and port, destination IP address and port, load balancing schemes, but also stateless ones. Stateless load
and transport-layer protocol. This state information is stored balancing in this context avoids storing per-connection state
in a connection table containing the 5-tuple and the Direct in the switch.
IP (DIP) address of the server serving that connection. State Perhaps the first and most significant P4-based stateless
information is needed to avoid disruptions caused by changes load balancing scheme is Beamer [124]. Instead of storing
in the DIP pool (e.g., server failures, addition of new servers). the state in the switch, Beamer leverages the connection state
The load balancer also provides a translation functionality, already stored in backend servers to perform the forwarding.
translating the VIP to the internal DIP, and then translating Another scheme is SHELL [125], which is an application-
back for packets traveling in the reverse direction back to agnostic, application-load-aware approach that uses a power-
the clients. The traditional software-based load balancer is of-choices scheme to dispatch flows to a suitable instance.
illustrated in Fig. 13(a). Other approaches such as W-ECMP [126] were built to solve
the issue of hash collision in the well-known Equal-Cost
2) STATEFUL LOAD BALANCING Multi-Path (ECMP) scheme. W-ECMP maintains a maxi-
Recent works presented schemes where load balancing func- mum utilization table, which is used as weights to determine
tionality is implemented in programmable P4 switches. The the routing probability for each path. Note that W-ECMP is
main idea consists of storing state information directly in not storing per-connection state information in the data plane.
the switch’s dataplane. The connection table is managed by
the software load balancer, which can be implemented either 4) LOAD BALANCING SCHEMES COMPARISON,
in the switch’s control plane or as an external device, as shown DISCUSSIONS, AND LIMITATIONS
in Fig. 13(b). The software load balancer adds new entries in Table 15 compares the aforementioned load balancing
the switch’s table as they arrive, or removes old entries as schemes. The key idea of switch-based stateful load bal-
flows end. ancing is to eliminate the need for a software-layer
Katta et al. [118] proposed HULA, a load balancer scheme while mapping a connection to the same server, ensuring
where switches store the best path to the destination via their Per-Connection Consistency (PCC) property. The majority
neighboring switches. This strategy avoids storing the con- of the proposed approaches are stateful, meaning that the
gestion status of all paths in leaf switches. Benet et al. [119] switches store information locally to perform load balancing.
extended this approach to support multi-path transport pro- Some approaches (e.g., HULA, MP-HULA, Contra) use
tocols (e.g., Multi-path TCP (MPTCP)). Another significant active probing to collect network performance metrics. Such
work is SilkRoad, [120], a load balancer that provides a metrics are then analyzed by the switches to make load bal-
direct path between application traffic and servers. Other ancing decisions. Note that probing increases the bandwidth
mechanisms such as DistCache [121] enables load balancing overhead which might result in performance degradation.
for storage systems through a distributed caching method. In the presence of multi-path transport protocols (e.g.,
DASH [122] proposed a data structure that leverages multiple MPTCP), systems such as HULA provide sub-optimal for-
pipeline stages and per-stage SALUs to dynamically balance warding decisions when several subflows pertaining to a
data across multiple paths. The aforementioned approaches single connection are pinned on the same bottleneck link.
work under specific assumptions about the network topology, As a result, schemes such as MP-HULA, Contra, and Dash
routing constraints, and performance. Contra [123] general- were proposed to support multi-path transport protocols. For
ized load balancing to work with various topologies and under instance, MP-HULA is a transport layer multi-path aware
multiple constraints by using a performance-aware routing load-balancing scheme that uses the best-k paths to the desti-
mechanism. nation through the neighbor switches.
Other approaches are stateless. Beamer relies on using the
3) STATELESS LOAD BALANCING connection state already stored in backend servers to ensure
Recent advances in customized and stateful packet processing that connections are never dropped under churn. On the
in programmable switches not only forked a variety of stateful other hand, SHELL, which assigns new connections to a
using P4 that caches requests to optimize its operations. TABLE 18. Switch-based and server-based caching.
Grigoryan and Liu [133] proposed a system that caches For-
warding Information Base (FIB) entries (the most popular
entries) in fast memory in order to minimize the TCAM
consumption and to avoid the TCAM overflow problem.
Zhang et al. [134] proposed B-Cache, a framework that
bypasses the original processing pipeline to improve the
performance of caching. Vestin et al. [135] proposed Fas-
tReact, a system that enables caching for industrial control
networks. Finally, Woodruff et al. [136] proposed P4DNS,
an in-network cache for Domain Name System (DNS)
entries.
switched-based caching solves the load imbalance problem to the SDN/NFV network with no programmable switches.
and is simpler as the whole logic is expressed in a program. Specifically, the authors focused on the problems of full
Server-based caching on the other hand is more flexible softwarization in current SDN networks (high latency and
regarding cache policies, as well as keys, values, and tables’ jitter, low precision traffic and advanced monitoring, etc.) and
sizes. how P4 is paving the way to novel orchestration frameworks
enabling innovation at the edge. Lin et al. [144] integrated
C. TELECOMMUNICATION SERVICES
P4 switches in their 5G testbed to implement the User Plane
1) BACKGROUND
Function (UPF) in the data plane.
The evolution of the current mobile network to the emerg-
ing Fifth-Generation (5G) technology implies significant
3) MEDIA OFFLOADING
improvements of the network infrastructure. Such improve-
Kfoury et al. [145] proposed a system for offloading con-
ments are necessary in order to meet the Key Perfor-
versational media traffic (e.g., Voice over IP (VoIP), Voice
mance Indicators (KPIs) and requirements of 5G [290].
over LTE (VoLTE), WebRTC, media conferencing, etc.) from
5G requires ultra-reliable low latency and jitter
x86-based relay servers to programmable switches. While
(microseconds-scale). As programmable switches fulfill
this system is not tailored for 5G network specifically,
these requirements, researchers are investigating the idea of
it provides significant performance improvements for Over-
offloading telecom-oriented VNFs running on x86 servers to
The-Top (OTT) VoIP systems.
programmable hardware.
Andrus et al. [146] offloaded video processing to the
2) 5G FUNCTIONS switch. Essentially, the switch dynamically filters and sepa-
Ricart-Sanchez et al. [137] proposed a system that uses pro- rate control traffic from video streams, and then redirect them
grammable data plane to enhance the performance of the to the desired destinations. The authors implemented this
data path from the edge to the core network, also known scheme due to processing constraints on the software when
as the backhaul, in a 5G multi-tenant network. The same the number of devices is high (the authors noted that CCTV
authors [138] proposed a 5G firewall that detects, differenti- cameras in London, UK is estimated at roughly 500,000).
ates and selectively blocks 5G network traffic in the backhaul
network. 4) TELECOM SCHEMES COMPARISON, DISCUSSIONS, AND
In parallel, attempts such as TurboEPC [139] proposed LIMITATIONS
offloading a subset of user state in mobile packet core to pro- Table 19 compares the aforementioned telecom schemes
grammable switches in order to perform signaling in the data on P4. In general, all schemes aim at offloading various func-
plane. Similarly, Singh et al. [140] designed a P4-based ele-
ment of 5G Mobile Packet Core (MPC) that merges the func-
tions of both signaling gateway (SGW) and the Packet Data
Network Gateway (PGW). Additionally, Vörös et al. [141]
proposed a hybrid next-generation NodeB (gNB) that com-
bines the capabilities of P4 switches and the external services
built on top of NIC accelerators (DPDK). Another impor-
tant function required in 5G is handover. Palagummi and
Sivalingam [142] proposed SMARTHO, a system that uses
programmable switches to perform handover efficiently in a
wireless network.
Paolucci et al. [143] demonstrated the potential and the FIGURE 15. CDF of delay and packet loss rate of 900 offloaded VoIP
disruptiveness of data plane programmability as opposed calls [145].
TABLE 20. Switch-based and server-based media relaying. were accommodated in the switch’s SRAM, with additional
resources to spare for other functionalities. On the other hand,
only one thousand sessions per CPU core were handled in
the server-based relay, before QoS starts to degrade. The
drawback of offloading media traffic to the switch is that
some functionalities are complex to be implemented in the
data plane (e.g., media mixing for conference calls, noise
reduction, etc.).
D. CONTENT-CENTRIC NETWORKING
1) BACKGROUND
Emerging network architectures (e.g., [291]) promote
content-centric networking, a model where the addressing
scheme is based on named data rather than named hosts.
In other words, users specify the data they are interested in
instead of specifying where to get the data from. A branch of
content-centric networking is the publish/subscribe (pub/sub)
model. The goal of the model is to provide a scalable and
tionalities originally executed on x86-based servers to the robust communication channel between producers and con-
data plane. Such strategy improves the network performance sumers of information. A large fraction of today’s Internet
(e.g., latency, throughput) significantly and aim at achieving applications follow the publish/subscribe paradigm. With the
the KPIs of 5G. For instance, the experiments conducted IoT, this paradigm proliferated as sensors/actuators are often
in [137] show that the attained QoS metrics meet the latency deployed in dynamic environments. Other applications that
requirements of 5G. Similarly, the results reported in [138] use pub/sub model include instant messaging, Really Simple
demonstrate that the system meets the reliability KPI of 5G, Syndication (RSS) feeds, presence servers, telemetry and
which states that the network should be secured with zero others. Current approaches to content-centric networking use
downtime. Furthermore, the results reported in [142] show software-based middleboxes, which limits the performance
that there are 18% and 25% reductions in handover time with in terms of throughput and latency. Recent works are lever-
respect to legacy approaches, for two- and three-handover aging programmable switches to overcome the performance
sequences, respectively. limitations of software-based middleboxes.
The system in [145] emulates the behavior of the relay
server which is primarily used to solve the NAT problem. 2) PUBLISH/SUBSCRIBE
Results show that ultra-low latency and jitter (nanoseconds- Jepsen et al. [147] presented ‘‘packet subscription’’, a new
scale) are achieved with programmable switches as opposed abstraction that generalizes the forwarding rules by evaluat-
to x86-based relay servers where the latency and the jitter ing stateful predicates on input packets. Wernecke et al. [148],
are in the milliseconds-scale (see Fig. 15). The solution also [149] presented distribution strategies for content-based pub-
improves the packet loss rate, CPU usage of the server, Mean lish/subscribe systems using programmable switches. The
Opinion Score (MOS), and can scale to more than one million authors described a system where the notification distribution
concurrent sessions, with additional resources to spare in the tree (i.e., the subscribers that should receive the notification)
switch. is encoded in the packet headers, similar to multicast source
Other systems allow offloading the signaling part to the routing. Similarly, Kundel et al. [150] implemented a pub-
data plane. For instance, TurboEPC offloads messages that lish/subscribe system on programmable switches. The system
constitute a significant portion of the total signaling traffic in is flexible in encoding attributes/values in packet headers.
the packet core, aiming at improving throughput and latency
of the control plane’s processing. 3) NAMED DATA NETWORKING
Signorello et al. [132] developed NDN.p4, a prelimi-
5) SWITCH-BASED AND SERVER-BASED MEDIA RELAY nary implementation of a Named Data Networking (NDN)
Offloading media traffic from general purpose servers to instance that caches requests to optimize its operations.
programmable switches greatly improves the quality of ser- Miguel et al. [151] extended NDN.p4 to include the content
vice. Table 20 shows the metrics achieved when media is store and to solve the scalability issues of the previous FIB
relayed by a relay server versus when it is relayed by the design. Karrakchou et al. [152] proposed ECDN, another
switch, based on [145]. The results show that the latency, jitter CDN implementation on P4 where data plane configuration is
and packet loss rates are significantly lower when media is generated according to application requirements and supports
being relayed by the switch. Not only the QoS metrics are extensions to the regular CDN such as adaptive forwarding,
improved, but also the maximum number of concurrent ses- customized monitoring, in-network caching control, and pub-
sions. With Tofino 3.2Tbps, more than one million sessions lish/subscribe forwarding.
FIGURE 16. (a) Traditional software-based pub/sub architecture. (b) Pub/sub implemented on a programmable switch.
TABLE 21. Content-centric networking schemes comparison. Regarding the NDN schemes, ENDN focused on making
the data plane adaptive and easily programmable to meet
the application needs. This flexibility is lacking in the other
P4-based CDN schemes. It is worth mentioning that P4 has its
shortcomings when it comes to supporting a stateful variable
length protocol. This is an important aspect that should be
tackled when implementing NDN on the data plane.
5) COMPARISON BETWEEN SWITCH-BASED AND
SERVER-BASED PUB/SUB SYSTEMS
Fig. 16 illustrates the operations of traditional software-based
pub/sub systems (a) and switch-based pub/sub systems (b).
Latency and its variations are significantly reduced when
the switch acts as a pub/sub broker. However, the size of
memory in the switch limits the amount of data to be
4) CONTENT-CENTRIC NETWORKING SCHEMES distributed. Moreover, implementing features provided by
COMPARISON, DISCUSSIONS, AND LIMITATIONS software-based pub/sub systems such as QoS levels, session
Table 21 compares the aforementioned pub/sub schemes. persistence, message retaining, last will and testament (notify
In [147], the authors described a compiler that gener- users after a device disconnects) in hardware is challenging.
ates P4 tables from logical predicates. It utilizes a novel E. SUMMARY AND LESSONS LEARNED
algorithm based on Binary Decision Diagrams (BDD) to Programmable switches offer the flexibility of customizing
preserve switch resources (TCAM and SRAM). This feature the data plane to enable middlebox functions. A middlebox
simplifies the configuration as operators do not need to man- can be defined as a device that performs functions that are
ually install tables entries switches, which is a cumbersome beyond the standard capabilities of routers and switches.
process when the topology is large. The prototype was eval- A number of works demonstrated the implementation of mid-
uated on a hardware switch (Tofino), and the authors con- dlebox functions such as caching, load balancing, offloading
sidered the Nasdaq’s ITCH protocol as the pub/sub use case. services, and others on programmable switches. The majority
Results show that the system was able to process messages of load balancing schemes took advantage of the stateful
at line rate while using the full switch capacity (6.5 Tbps). nature of the data plane to store the load balancing connection
The other systems considered different encoding strategies. table. Future work should consider minimizing the storage
For example, in [148], [149], the authors described a system requirement to improve the scalability, supporting flow pri-
where the notification distribution tree (i.e., the subscribers ority, and developing further variations for novel multipath
that should receive the notification) is encoded in the packet transport protocols such as multipath QUIC.
headers, similar to multicast source routing. The advantage The switch can also act as an ‘‘in-network cache’’ that
of storing the distribution tree in the packet header instead serves hot items at line rate. Some schemes indexes entries
of storing it in the switch is that rules in the switches do not automatically, while others require operator’s intervention.
need to be updated when subscriptions change. Another dis- Future endeavours could investigate items compression, com-
tinction between the pub/sub systems is whether they require munication minimization, priority-based caching, and aggre-
a dedicated language to describe the subscriptions, and the gated computations caching (e.g., cache the average of hot
configuration complexity. items).
An additional middlebox application is offloading tele- data value, or on the current state of a distributed system.
com functions. The switch is capable of relaying media traf- Reliability is achieved with consensus algorithms, even in the
fic and user plane functions. Future work could investigate presence of some malicious or faulty processes. Consensus
scalability improvement (i.e., to accommodate more concur- algorithms are used in applications such as blockchain [293],
rent sessions), offloading signalling traffic, and in-network load balancing, clock synchronization, and others [294].
media mixing. Latency has always been a bottleneck with consensus algo-
Finally, the switch can also act as a broker to dis- rithms as protocols require expensive coordination on every
tribute packets in a publish/subscribe system. Future work request. Lately, researchers have started investigating how
could investigate reliability insurance (e.g., packet deliver programmable switches can be leveraged to operate consen-
guarantee), message retaining, and QoS differentiation sus protocols in order to increase throughput and decrease
(e.g., QoS features of MQTT). latency. Fig. 17 shows a consensus model in the data plane.
IX. NETWORK-ACCELERATED COMPUTATIONS 2) PAXOS IMPLEMENTATIONS
Programmable switches offer the flexibility of offload- Li et al. [153] proposed Network-Ordered Paxos (NOPaxos),
ing some upper-layer logic to the ASIC, referred also as a P4-based Paxos [295] system that applies replication in
in-network computation. Since switch ASICs are designed the data center to reduce the latency imposed from com-
to process packets at terabits per second rates, in-network munication overhead. Similarly, Dang et al. [154] presented
computation can result in an order of magnitude or more an implementation of Paxos using P4 on the data plane.
of improvement in throughput when compared to applica- Jin et al. [157] proposed NetChain, a variant of the Paxos pro-
tions implemented in software. The potential performance tocol that provides scale-free sub-RTT coordination in data
improvement has motivated programmers to built in-network centers. It is strongly-consistent, fault-tolerant, and presents
computation for different purposes, including consensus, an in-network key-value store. Dang et al. [158] proposed
machine learning acceleration, stream processing, and others. Partitioned Paxos, a P4-based system that separates the two
The idea of delegating computations to networking devices aspects of Paxos, namely, agreement and execution, and
was perceived with Active Networks [292], where pack- optimizes them separately. Furthermore, The same authors
ets are replaced with small programs (‘‘capsules’’) that are also proposed P4xos [160], a P4-based solution that executes
executed in each traversed device along the path. However, Paxos logic directly in switch ASICs, without strengthening
traditional network devices were not capable of perform- assumptions about the network (e.g., ordered delivery, packet
ing computations. With the recent advancements in pro- loss, etc.).
grammable switches, performing computations is now a
possibility. 3) OTHER IMPLEMENTATIONS
Another line of research focused on consensus algorithms
A. CONSENSUS
other than Paxos. Li et al. [155] proposed Eris, a P4-based
1) BACKGROUND solution that avoids replication and transaction coordination
Consensus algorithms are common in distributed systems overhead. It processes a large class of distributed transactions
where machines collectively achieve agreement on a single in a single round trip, without any additional coordination
between shards and replicas. Sakic et al. [159] proposed
P4 Byzantine Fault Tolerance (P4BFT), a system that is
based on BFT-enabled SDN, where controllers act as repli-
cated state machines. The system offloads the comparison of
controllers’ outputs required for correct BFT operations to
programmable switches. Finally, Han et al. [156] offloaded
part of the Raft consensus algorithm [296] to programmable
switches in order to improve its performance. The authors
selected Raft due to the fact that it has been formally proven
to be more safe than Paxos, and it has been implemented on
popular SDN controllers.
4) CONSENSUS SCHEMES COMPARISON, DISCUSSIONS,
AND LIMITATIONS
Table 22 compares the aforementioned consensus schemes.
FIGURE 17. Consensus protocol in the data plane model [154].
An application sends a request to the proposer which resides on a In general, consensus algorithms such as Paxos are complex
commodity server. The proposer then creates a Paxos message and sends and cannot be easily implemented with the constraints of the
it to the coordinator, running in the data plane. The role of the
coordinator is be the broker of requests on behalf of proposers. data plane. For instance, [154] only implemented phase-2
Afterwards, the acceptor, which also runs on the data plane, receives the logic of Paxos leaders and acceptors. Similarly, NetChain
messages from the coordinator, and ensures consistency through the
system by deciding whether to accept/reject proposals. Finally, learners
uses a variant of the Paxos protocol that divides it into two
provide replication by learning the result of consensus. parts: steady state and reconfiguration. This variant is known
87122 VOLUME 9, 2021
E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches
TABLE 22. Consensus schemes comparison. the latency significantly decreases (Paxos coordinator had a
minimum latency of 340ns [297]). Moreover, when compared
to legacy consensus deployments, network-assisted consen-
sus require fewer hops traversal.
B. MACHINE LEARNING
1) BACKGROUND
The remarkable success of Machine Learning (ML) today has
been enabled by a synergy between development in hardware
and advancements in machine learning techniques. Increas-
ingly complex ML models are being developed to handle the
large size of datasets and to accelerate the training process.
as Vertical Paxos, and is relatively simple to implement in the
Hardware accelerators (e.g., GPU, TPU) were introduced to
network as the division’s parts can be mapped to the control
speedup the training. These accelerators are installed in large
plane and the data plane.
clusters and collaborate through distributed training to exploit
Unordered and completely asynchronous networks require
parallelism. Nevertheless, training ML models is time con-
the full implementation and complexity of Paxos. NOPaxos
suming and can last for weeks depending on the complexity
suggests that the communication layer should provide a
and the size of the datasets. Researchers have traditionally
new Ordered Unreliable Multicast (OUM) primitive; that is,
investigated methods to accelerate the computation process,
there is a guarantee that receivers will process the multicast
but not the communication in distributed learning. With the
messages in the same order, though messages can be lost.
advancements in programmable switches, it is now possible
NOPaxos relies on the network to deliver ordered messages
to accelerate the ML training process through the network.
in order to avoid entirely the coordination. Dropped packets
on the other hand are handled through coordination with the 2) IN-NETWORK TRAINING
application. Other systems like Eris avoid replication and Sapio et al. [161] proposed DAIET, a system that per-
transaction coordination overhead. The main contribution of forms in-network data aggregation to accelerate applications
Eris compared to NOPaxos is that it establishes a consis- that follow a partition/aggregate workload pattern. Similarly,
tent ordering across messages delivered to many destination Yang et al. [162] proposed SwitchAgg, a system that per-
shards. Eris also allows receivers to detect dropped messages. forms similar functions as DAIET, but with a higher data
Partitioned Paxos [158] improved the existing systems. reduction rate. Perhaps the most significant work in the train-
The motivation behind Partitioned Paxos is that existing ing acceleration literature is SwitchML [163], a system that
network-accelerated approaches do not address the problem performs in-network aggregation for ML model updates sent
of how replicated application can cope with the high rate of from workers on external servers.
consensus messages; NOPaxos only processes 13,000 trans-
3) IN-NETWORK INFERENCE
actions per second since it presents a new bottleneck at
the host side. Other systems (e.g. NetChain) are specialized Other schemes have shown interest in speeding the inference
replication services and cannot be used by any off-the-shelf process by leveraging programmable switches. Siracusano
application. and Bifulco [164] proposed N2Net, a system that runs sim-
Finally, P4xos improves both the latency and the tail- plified neural networks (NN) on programmable switches.
latency. The throughput is also improved compared to hard- Sanvito et al. [165] proposed BaNaNa Split, a solution that
ware servers which require additional memory management evaluates the conditions under which programmable switches
and safety features (e.g., user and kernel separation). P4xos can act as CPUs’ co-processors for the processing of Neural
was implemented on a hardware switch (Tofino), and results Networks (e.g., CNN). Finally, Xiong et al. [166] proposed
show that it reduces the latency by three times compared to IIsy, a system that enables programmable switches to perform
traditional approaches, and it can process over 2.5 billion in-network classification. The system maps trained ML clas-
consensus messages per second (four orders of magnitude sification models to match-action pipelines.
improvement).
4) ML SCHEMES COMPARISON, DISCUSSIONS, AND
5) NETWORK-ASSISTED AND LEGACY CONSENSUS LIMITATIONS
COMPARISON Table 23 compares the aforementioned ML schemes. While
Consensus algorithms have been traditionally implemented the goal of DAIET is to discuss what computations the net-
as applications on general purpose CPUs. Such architecture work can perform, the authors did not design a complete sys-
inherently induces latency overhead (e.g., Paxos coordinator tem, nor did they address the major challenges of supporting
has a minimum latency of 96us [297]). ML applications. Moreover, their proof-of-concept presented
There are numerous performance benefits gained when a simple MapReduce application on a software switch, and
consensus algorithms are implemented in programmable it is not certain whether the system can be implemented on
devices. When consensus messages are processed on the wire, a hardware switch. Compared to DAIET, SwitchAgg does
not require modifying the network architecture, and offers imal communication; each worker sends its update vector and
better processing abilities with a significant data reduction receives back the aggregated updates. The design challenges
rate. Moreover, SwitchAgg was implemented on an FPGA, of this system include: 1) the limitation of storage available
and the results show that the job completion time can be on the switch, addressed by using a streaming approach;
reduced as much as 50%. 2) switches cannot perform much computations per packet,
SwitchML extended the literature on accelerating ML addressed by partitioning the work between the switch and
models training by providing a complete implementation the workers; 3) ML systems use floating point numbers,
and evaluation on a hardware switch. A commonly used addressed by quantization approaches; and 4) failure recovery
training technique for deep neural networks is synchronous is needed to ensure correctness. The system is implemented
stochastic gradient descent [299]. In this technique, each on a hardware switch (Tofino), and results show that the
worker has a copy of the model that is being trained. system speeds up training by up to 300% compared to existing
The training is an iterative process where each iteration distributed learning approaches.
consists of: 1) reading the sample of the dataset and With respect to in-network inference, it is challenging
locally perform some computation-intensive learning using to implement full-fledged models as they require extensive
the worker’s accelerators. This yields to a gradient vector; computations (e.g., multiplications and activation functions).
and 2) updating the model by computing the mean of all Simple variation such as the Binary Neural Network (BNN)
gradient vectors. The main motivation of this idea is that only requires bitwise logic functions (e.g., XNOR, POPCNT,
the aggregation is computationally cheap (takes 100ms), but SIGN). N2Net provides a compiler that translates a given
is communication-intensive (transfer hundreds of megabytes BNN model to switching chip’s configuration (P4 program).
each iteration). SwitchML uses computation on the switch The authors did not mention on which platform N2Net was
to aggregate model update in the network as the workers are evaluated; however, based on their evaluations, they con-
sending them (see Fig. 18). An advantage is that there is min- cluded that a BNN can be implemented on most current
FIGURE 18. (a) ML model updates in legacy networks. The aggregation process is communication-intensive and follows an all-to-all communication
pattern. This means that the workers should receive all the other workers’ updates. Since accelerators on end-hosts are becoming faster, the network
should speed up so that it does not become the bottleneck. Therefore, it is expensive to deploy additional accelerators since it requires re-architecting
the network. The red arrow in (a) shows that the bottleneck source is the network. (b) ML model updates accelerated by the network. Aggregation is
performed in the network by the programmable switches while the workers are sending them. The workers do not need to obtain the updates of all other
workers, hence there is minimal communication. They only obtain the aggregated model from the switch. The red arrow in (b) shows that the bottleneck
source is the worker rather than the network [163], [298].
87124 VOLUME 9, 2021
E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches
switching chips, and with small additions to the chip design, and Raft in the data plane. Due to the hardware constraints,
more complex models can be implemented. IIsy studied current schemes implement only simplified variations of the
other ML models. The authors of IIsy acknowledged that protocols. Future work could investigate implementing novel
the work is limited in scope as it does not address popular consensus algorithms that diverge from the existing complex
ML algorithms such as neural networks. Furthermore, it is ones. Further, such schemes should encompass failure recov-
bounded to the type of features it can extract (i.e., packet ery mechanisms.
headers), and has accuracy limitations. IIsy tries to find a Another interesting in-network application is ML train-
balance between the limited resources on the switch and the ing/inference acceleration. The literature has shown that sig-
classification accuracy. Finally, BaNaNa Split took a different nificant performance improvements are attained when the
approach by partitioning the processing of NN to offload a switch aggregates model updates or classifies new samples.
subset of layers from the CPU to a different processor. Note Future work could explore developing ML models for various
that the solution is far from complete, and the authors evalu- tasks such as classification, regression, clustering, etc.
ated a single binary fully connected layer with 4096 neurons In addition to the aforementioned categories, data plane
using a network processor-based SmartNIC. programming is being used for stream processing [167],
C. COMPARISON BETWEEN SWITCH-BASED AND [168], parallel processing [169], string searching [170], era-
SERVER-BASED ML sure coding [171], in-network lock managers [172], database
Table 24 shows a comparison between switch-based and queries acceleration [173], in-network compression [174],
server-based ML approaches. ML works that were extracted and computer vision offloading [175].
from the literature can be divided into two main categories: X. INTERNET OF THINGS (IoT)
1) expedited inference in the data plane, and 2) accelerated The Internet of Things (IoT) is a novel paradigm in which
training in the network. The main advantage of switch-based pervasive devices equipped with sensors and actuators collect
over server-based inference is the ability to execute at line physical environment information and control the outside
rate, and hence provides faster results to the clients. Perform- world. IoT applications include smart water utilities, smart
ing complex computations in the switch is achieved through grid, smart manufacturing, smart gas, smart metering, and
estimations, and hence is limited. Moreover, the SRAM many others. Typical IoT scenarios entail a large number
capacity of the switch is small, impeding the storage of of devices periodically transmitting their sensors’ readings
large models. Such limitations are not problematic with to remote servers. Data received on those collectors is then
server-based inference approaches. processed and analyzed to assist organizations in taking
Distributed training can be significantly faster when aggre- data-driven intelligence decisions.
gations are offloaded to a centralized switch. However, due to
the small capacity of the switch memory, it is not possible to A. AGGREGATION
store the whole model update at once. Additionally, encrypted 1) BACKGROUND
traffic remains a challenge when inference or training is Since IoT devices are constrained in size and processing capa-
handled by the switch. bilities, they typically generate packets that carry small pay-
D. SUMMARY AND LESSONS LEARNED loads (e.g., temperature sensor readings). While such packets
Accelerating computations by leveraging programmable are small in size, their headers occupy a significant portion
switches is becoming a trend in data centers and backbone of the total packet size. For instance, Sigfox Low-Power
networks. Although switches only support basic and limited Wide Area Network (LPWAN) [300] can support a maximum
operations, it was shown in the literature that the performance of 12-bytes payload size per packet. The overhead of headers
of various tasks (e.g., consensus, training models in machine is 42-bytes (Ethernet 14-bytes + IP 20-bytes + UDP 8-
learning), could significantly improve if computations are bytes), which represent approximately 78% of the packet
delegated to the network. total size. When numerous devices continuously transmit
The majority of the in-network consensus works aim at IoT packets, a significant percentage of network bandwidth
implementing common consensus protocols such as Paxos is wasted on transmitting these headers. Packet aggregation
applications that analyze information collected from periph- TABLE 27. Switch-based, P2P, and cloud service automation.
eral devices and subsequently issue commands.
The interconnection of devices and services can follow a
Peer-to-Peer (P2P) model or a cloud-centric approach. In the
P2P model, the automation service runs on the central device
which processes and analyzes sensor data published by
peripheral devices in order to issue commands. The main
advantages of the P2P include the low end-to-end latency
and the subtle power consumption as devices are physically
plane leverages ONOS controller with Protocol Indepen-
close to each other. The drawbacks of the P2P model include
dent (PI) framework.
poor scalability, short reachability, and inflexibility of policy
enforcement. The cloud-centric model addresses the limita-
4) COMPARISON BETWEEN SERVER-BASED AND
tions of the P2P model by adding a gateway node that con-
SWITCH-BASED SERVICE AUTOMATION
nects peripheral devices to a middleware hosted on the cloud
(Internet). While this approach solves the poor scalability and Table 27 shows a comparison between switch-based, P2P, and
the policy enforcement flexibility issues, it incurs additional cloud-based service automation. Generally, the switch-based
delays and jitters in collecting and reacting to data. Moreover, approach overcomes the limitations of both approaches.
the middleware represents a single point of failure which It achieves the low energy and latency characteristics of P2P
can shutdown the whole service in the event of an outage. while increasing scalability and reachability.
With programmable switches, researchers are investigating
in-network approaches to manage transactional relationships C. SUMMARY AND LESSONS LEARNED
between low-power, low-range IoT devices. In the context of IoT, there exist broadly two categories,
namely, packets aggregation and service automation. The
2) SERVICE MANAGEMENT AND MULTI-PROTOCOL
goal of packet aggregation is to minimize the overhead
PROCESSING
of IoT packets’ headers. Typically, headers in IoT pack-
Uddin et al. [180] proposed Bluetooth Low Energy Service ets represent a significant portion of the whole packet
Switch (BLESS), a programmable switch that automates IoT size. By aggregating several packets into a single packet,
applications services by encoding their transactions in the the bandwidth overhead is reduced. Future work should
data plane. It maintains link-layer connections to the devices study the performance side-effects (e.g., delay, jitter, loss
to support P2P connectivity. The same authors proposed rate, retransmission) that aggregation causes to packets. Fur-
Muppet [181], an extension to BLESS to support multiple thermore, timers should be implemented to avoid exces-
non-IP protocols. sive delays resulting from waiting for enough packets to be
3) SERVICE AUTOMATION COMPARISON, DISCUSSIONS, aggregated.
AND LIMITATIONS With respect to service automation, the goal is to auto-
In BLESS, the data plane operations are performed at the mate IoT applications services by encoding their transactions
Attribute Protocol (ATT) service layer which consists of three in the data plane while improving scalability, reachability,
operations: read attributes, write attributes, and attributes’ energy consumption, and latency. Future work should design
notification. BLESS parses ATT packets, then processes and and develop translators for non-IP IoT protocols so that
forwards them to the devices. The control plane on the applications on various devices that run different protocols
other hand is responsible for address assignment, device can exchange data. Additionally, production-grade software
and service discovery, policy enforcement, and subscription switches should be leveraged to support non-Ethernet IoT
management. The switch was implemented on a software protocols.
switch (PISCES), and the results show that BLESS com- Other works that involve IoT include flowlet-based stateful
bines the advantages of P2P and the cloud-center approaches. multipath forwarding [310] and SDN/NFV-based architec-
Specifically, it achieves small communication latency, low ture for IoT networks [311].
device power consumption, high scalability, and flexible pol-
icy enforcement. Muppet extended this approach to support XI. CYBERSECURITY
multiple IoT protocols. The system studied two popular IoT Extensive research efforts have been devoted on deploying
protocols, namely BLE and Zigbee. Being in the middle, programmable switches to perform various security-related
Muppet switch is responsible for translating actions (e.g., functions in the data plane. Such functions include heavy
on/off switch of a light bulb) between Zigbee and BLE proto- hitter detection, traffic engineering, DDoS attacks detec-
cols, as well as logging important events to a database which tion and mitigation, anonymity, and cryptography. Fig. 20
resides on the Internet via the Hypertext Transfer Protocol demonstrates the difference between contemporary security
(HTTP). Note that parsers and actions policies have to be appliances and programmable switches with respect to lay-
implemented for each supported protocol. Another difference ers inspection in the OSI model. Although programmable
from BLESS is that the implementation of Muppet’s control switches are limited in the computation power, they are capa-
3) NETWORK-WIDE DETECTION
A work proposed by Harrison et al. [187] considers
a network-wide distributed heavy-hitter detection. The
approach reports heavy hitters deterministically and with-
FIGURE 20. Layers inspection in the OSI model. (a) Contemporary security out errors; however, it incurs significant communication
appliances. (b) Programmable switch. costs that scale with the number of switches. Accordingly,
the same authors proposed another scheme (Carpe [188])
ble of inspecting upper layers (e.g., application layer) at line which reports probabilistically with negligible communica-
rate. Such functionality is not available in any of the existing tion costs. Ding et al. [189] proposed an approach for incre-
solutions. mentally deploying programmable switches in a network
consisting of legacy devices with the goal of monitoring as
A. HEAVY HITTER many distinct network flows as possible. The same authors
1) BACKGROUND of MV-Sketch proposed SpreadSketch [190], an extension
Heavy hitters are a small number of flows that constitute most to Count-min sketch where each bucket is associated with
of the network traffic over a certain amount of time. They a distinct counter to track the distinct items of a stream.
are identified based on the port speed, network RTT, traf- SpreadSketch aims at mitigating the high processing over-
fic distribution, application policy, and others. Heavy hitters head of MV-Sketch.
increase the flow completion time for delay-sensitive mice
flows, and represent the major source of congestion. It is 4) HEAVY HITTER DETECTION COMPARISON, LIMITATIONS,
important to promptly detect heavy hitters in order to react AND DISCUSSIONS
to them; for instance, redirect them to a low priority queue, Table 28 compares the aforementioned heavy hitter schemes.
perform rate control and traffic engineering, block volumetric A major criterion that differentiates the solutions is the
DDoS attacks, and diagnose congestion. Traditionally, packet selection and the implementation of the data structure. Hash
sampling technique (e.g., NetFlow) was used to detect heavy tables and sketches are frequently used to store counters
hitters. The main problem with such technique is the lim- for heavy flows. Note that several variations of such data
ited accuracy due to the CPU and bandwidth overheads of structures are being used in the literature, mainly to tackle
processing samples in the software. Advancements in pro- the memory-accuracy tradeoff; the choice of data structure
grammable switches paved the way to detect heavy hitters in reflects on the accuracy of the performed measurements. For
the data plane, which is not only orders of magnitude faster example, with probabilistic data structures, only approxima-
than sampling, but also enables additional applications (e.g., tions are performed.
flow-size aware routing). The detection schemes can be clas- In HashPipe, the programmable switch stores the flows
sified as local and network-wide. In the former, the detection identifiers and their byte counts in a pipeline of hash
occurs on a single switch; in the latter, the detection covers tables. HashPipe adapts the space saving algorithm which is
the whole network. described in [312]. The system was evaluated using an ISP
trace provided by CAIDA (400,000 flows), and the results
2) LOCAL DETECTION show that HashPipe needed only 80KB of memory to identify
Sivaraman et al. [182] proposed HashPipe, a heavy hitter the 300 heaviest flows, with an accuracy of 95%. Another
detection algorithm that operates entirely in the data plane. hashtable-based solution is Elastic Trie, which consists of
It detects the k-th heavy hitter flows within the constraints of a prefix tree that expands or collapses to focus only on the
programmable switches while achieving high accuracy. Fur- prefixes that grabs a large share of the network. The data
thermore, Kučera et al. [183] proposed Elastic Trie, a solu- plane informs the control plane about high-volume traffic
tion that detects hierarchical heavy hitters, in-network traffic clusters in an event-based push approach only when some
changes, and superspreaders in the data plane. Hierarchical conditions are met. Other systems explored different data
heavy hitters include the total activity of all traffic match- structures for the task. For instance, in [189] the authors used
ing relevant IP prefixes. Ben-Basat et al. [184] proposed the HyperLogLog algorithm [313] which approximates the
PRECISION, a heavy hitter detection algorithm that prob- number of distinct elements in a multi-set. The solution is
capable of detecting heavy hitters by only using partial input 2) EXTERNAL CRYPTOGRAPHY
from the data plane. The authors in [191] argue on the need to implement crypto-
Another important criteria is whether the scheme tracks graphic hash functions in the data plane to mitigate poten-
heavy hitters across the whole network. For example, unlike tial attacks targeting hash collisions. Consequently, they
HashPipe which considers a single switch, [187] tracks presented prototype implementations of cryptographic hash
network-wide heavy hitters. Tracking network-wide heavy functions in three different P4 target platforms (CPU, Smart-
hitter is important as some applications (e.g., port scanners, NIC, NetFPGA SUME). Another work by Hauser et al. [192]
superspreaders, etc.) cannot go undetected within a single attempted to implement host-to-site IPsec in P4 switches. For
location. Moreover, aggregating the results of switches sepa- simplification, only Encapsulating Security Payload (ESP)
rately for detecting heavy hitter is not sufficient; flows might in tunnel mode with different cipher suites is implemented.
not exceed a threshold locally, but when the total volume is The same authors also proposed P4-MACsec [314], an imple-
considered, the threshold might be crossed. mentation of MACsec on P4 switches. MACsec is an IEEE
standard for securing Layer 2 infrastructure by encrypting,
5) COMPARISON BETWEEN P4-BASED AND TRADITIONAL decrypting, and performing integrity checks on packets.
HEAVY HITTER DETECTION Malina et al. [193] presented a solution where P4 programs
The main advantage of heavy hitters detection schemes in the invoke cryptographic functions (externs) written in VHDL
data plane over sampling-based approaches is the ability to on FPGAs. The goal of this work is to avoid coding cryp-
operate at line rate. This means that every packet is considered tographic functions on hardware (VHDL), and thus enables
in the detection algorithm, which improves accuracy and the rapid prototyping of in-network applications with security
speed of detection. Moreover, additional applications that functions. Another work that relies on externs for crypto-
exploit reactive processing can be implemented. For instance, graphic functions is P4NIS [194].
switches can perform a flow-size aware routing method to
redirect traffic upon detecting a heavy hitter. 3) DATA PLANE CRYPTOGRAPHY
The previous works delegated the complex computations to
B. CRYPTOGRAPHY the control plane. Chen [195] implemented the Advanced
1) BACKGROUND Encryption Standard (AES) protocol in the data plane using
Performing cryptographic functions in the data plane is useful scrambled lookup tables. AES is one of the most widely
for a variety of applications (e.g., protecting the layer-2 with used symmetric cryptography algorithms that applies several
cryptographic integrity checks and encryption, mitigating encryption rounds on 128-bit input data blocks.
hash collisions, etc.). Computations in cryptographic oper-
ations (e.g., hashing, encryption, decryption) are known to 4) CRYPTOGRAPHY SCHEMES COMPARISON, DISCUSSIONS
be complex and resource-intensive. The supported operations AND LIMITATIONS
in switch targets and in the P4 language are limited to basic Table 29 compares the aforementioned cryptography
arithmetic (e.g., additions, subtractions, bit concatenation, schemes. With respect to hashing, P4 currently implements
etc.). Recently, a handful of works have started studying hash functions that do not have the characteristics of cryp-
the possibility of performing cryptographic functions in the tographic hashing. For example, Cyclic Redundancy Check
data plane. Generally, cryptographic functions are executed (CRC), which is commonly used in P4 targets, is originally
externally (e.g., on a CPU) and invoked from the data plane. developed for error detection. CRC can be easily imple-
mented in embedded hardware, and is computationally much pipeline passes. Note that recirculation uses loopback ports
less complex than cryptographic hash functions (e.g., Secure and hence is limited by their bandwidth. The implementation
Hash Algorithm (SHA)-256); however, it is not secure and on Tofino chip shows that ≈ 10Gbps throughput was attained.
has a high collision rate. Evaluation results in [191] show The authors argued that this throughput is sufficient to sup-
that 1) implementing cryptographic hash functions on CPU port various in-network security applications. Nevertheless,
is easy, but has high latency (several milliseconds); 2) Smart- it is possible to enhance the throughput by configuring addi-
NICs has the highest throughput, but can only process packets tional physical ports as loopback ports.
up to 900 bytes; and 3) NetFPGA has the lowest latency, but Note that there are other schemes that implements some
cannot be integrated using native P4 features. The authors cryptographic primitives in the data plane but are in the
found that the performance of hashing is highly dependent on Privacy and Anonymity category (Section XI-C).
the application, the input type, and the hashing algorithm, and
therefore there is no single solution that fits all requirements. 5) COMPARISON BETWEEN IN-NETWORK AND
However, P4 targets should benefit from the characteristics CONTEMPORARY CRYPTOGRAPHY
of each solution (CPU, SmartNICs, FPGA, and ASICs) to Cryptographic primitives often require performing complex
implement cryptographic hashing. arithmetic operations on data. Implementing such compu-
As for more complex protocol suites (e.g., IPsec), tations on general purpose servers is simple; memory and
Hauser et al. [192] only implemented Encapsulating Secu- processing units are not constrained. The literature has shown
rity Payload (ESP) in tunnel mode for simplification. The that there is a need to implement cryptographic functions
Security Policy Database (SPD) and the Security Association in the data plane. For instance, cryptographic hash func-
Database (SAD) are represented as match-action tables in tions can significantly improve existing data plane appli-
the P4 switch. To avoid complex key exchange protocols cations with respect to collisions; encryption can protect
such as the Internet Key Exchange (IKE), this work delegates confidential information from being exposed to the public.
runtime management operations to the control plane. More- However, switches have limitations when it comes to com-
over, since encryption and decryption are not supported by puting. Supported hash functions in P4 are non-cryptographic
P4, the authors relied on user-defined P4 externs to perform (e.g., CRC), and therefore, produce collisions when the
complex computations. Note that implementing user-defined table is not large. Consequently, researchers are continuously
externs is not applicable for ASIC (e.g., Tofino), and con- investigating techniques to perform such operations in the
sequently, the main CPU module of the switch is used for data plane.
performing encryption/decryption computations, at the cost
of increased latency and degraded throughput. Same ideas are C. PRIVACY AND ANONYMITY
applied to P4-MACsec by the same authors. Other works that 1) BACKGROUND
rely on externs include [193], [194]. Packets in a network carry information that can poten-
The system proposed by Chen [195] has significant perfor- tially identify users and their online behavior. Therefore,
mance advantages as it is fully implemented in the data plane. user privacy and anonymity have been extensively studied
The idea of the proposed system is to apply permuted lookup in the past (e.g., ToR and onion routing [315]). However,
tables by using an encryption key. The authors found that existing solutions have several limitations: 1) poor perfor-
a single switch pipeline is capable of performing two AES mance since overlay proxy servers are maintained by volun-
rounds. Consequently, the system leverages packet recircula- teers and have no performance guarantees; 2) deployability
tion technique which re-injects the packet into the pipeline. challenges; some solutions require modifying the whole
By doing so, it is possible to complete the 10 rounds of Internet architecture, which is highly unlikely; 3) no clear
encryption required by the AES-128 algorithm by using five partial deployment pathway; and 4) most solutions are
In [206], the scheme dynamically enforces access control Various features are considered when comparing P4-based
to users based on contexts (e.g., if the user’s device uses firewalls to traditional firewalls. First, P4 firewalls are capa-
Secure Shell (SSH) 2.0 or higher, then the switch forwards ble of performing headers inspection above the transport layer
the packets of this flow. Otherwise, the switch drops the pack- (also known as deep packet inspection (DPI)), whereas tradi-
ets). The scheme requires user devices to run an application tional firewalls only reach the transport layer and typically
which communicates with the switch using a custom protocol operate on the 5-tuple fields. It is important to note that DPI
(context packets). The context packets are generated on a in P4 switches is limited: if only few bytes are parsed above
per-flow basis. The switch tracks flows using a match action the transport layer, line rate will be achieved; however, if the
table and registers at the data plane. Actions over a packet packet is deeply parsed, the throughput will start degrading
are dropping, allowing, and forwarding to other appliances accordingly. Second, in P4, policies and rules can be cus-
for deep packet inspection. Data packets are not modified. tomized to be activated based on arbitrary information stored
Evaluations show that the proposed approach can operate in the switch state (e.g., measurements through streaming);
(install new flows in the and update rules) with a minimum such capabilities are not present in traditional firewalls. Third,
latency, even under heavy DoS attacks. On the other hand, in P4, access control algorithms’ exclusivity and innova-
such attacks can decimate similar SDN-based systems. One tion are solely attributed to operators, unlike fixed-function
of the main drawbacks of the proposed system is the lack firewalls which are provided by device vendors. Note that
of authentication, integrity, and confidentiality of the context non-programmable Next-Generation Firewalls (NGFW) are
packets. Thus, the system can be subject to attacks such as capable of performing advanced DPI at the cost of having
snooping (i.e., eavesdropping) on communication between much lower throughput than the line rate.
user devices and the switch, impersonation, and others. Access to resources can be controlled after fingerprinting
Finally, [207] proposes fingerprinting OS in the data plane. end-hosts OS. Software-based passive fingerprinting tools
The main motivation behind this work is that software-based cannot keep up with the high load (gigabits/s links). The
passive fingerprinting tools (e.g., p0f [319]) are not practical literature has shown that said tools lead to 38% degradation
nor sufficient with large amounts of traffic on high-speed in throughput [320]. Additionally, such tools are out-of-band,
links. Furthermore, out-of-band monitoring systems cannot meaning that it is not possible to apply policies on traffic (e.g.,
promptly take actions (e.g., drop, forward, rate-limit) on traf- after fingerprinting an OS). On the other hand, switch hard-
fic at line rate. The main drawback of the solution is that it ware is able to perform OS fingerprinting and apply security
lacks sophisticated policies that involve rate-limiting traffic. policies at line rate. Context-aware policies applied on nodes
(clients/servers) have local visibility. A newer approach is
5) COMPARISON BETWEEN SWITCH-BASED AND to use a centralized SDN controller (e.g., [321]), but such
SERVER-BASED ACCESS CONTROL scheme is vulnerable to control plane saturation attacks and
Controlling access to resources often starts with authentica- is subject for delay increases. Switch-based schemes on the
tion. While server-based approaches are more flexible in the other hand are able to provide access control at line rate.
methods of authentication they can provide, they typically
require client connections to reach the server before the com- E. DEFENSES
munication starts. In switch-based approaches, the authen- 1) BACKGROUND
tication can be done in-network at the edge, eliminating DDoS attacks remain among the top security concerns despite
unnecessary latency incurred from traversing the network and the continuous efforts towards the development of their detec-
from software processing. tion and mitigation schemes. This concern is exacerbated not
only by the frequency of said attacks, but also by their high proposed a unified in-network DDoS detection and mitiga-
volumes and rates. Recent attacks (e.g. [322], [323]) reached tion strategy that considers both volumetric and slow/stealthy
the order of terabits per seconds, a rate that existing defense DDoS attacks. Xing et al. [222] proposed NetWarden,
mechanisms cannot keep with. a broad-spectrum defense against network covert channels
There are two main concerns with existing defense meth- in a performance-preserving manner. The method in [223]
ods handled by end-hosts or deployed as middlebox func- models a stateful security monitoring function as an Extended
tions on x86-based servers. First, they dramatically degrade Finite State Machine (EFSM) and expresses the EFSM
the throughput and increase latency and jitter, impacting the using P4 abstractions. Ripple [224] provides decentralized
performance of the network. Second, they present severe con- link-flooding defense against dynamic adversaries.
sequences on the network operation when they are installed da Silveira Ilha et al. [225] presented EUCLID, an exten-
at the last mile (i.e., far from the edge). The escalation of sion to [217] where the data plane runs a fine-grained traffic
volumetric DDoS attacks and the lack of robust and effi- analysis mechanism for DDoS attack detection and mitiga-
cient defense mechanisms motivated the idea of architect- tion. EUCLID is based on information-theoretic and statisti-
ing defenses into the network. Up until recently, in-network cal analysis (entropy) to detect the attacks. Khooi et al. [226]
security methods were restricted to simple access control lists presented a Distributed In-network Defense Architecture
encoded into the switching and routing devices. The main (DIDA), a solution that deals with the sophisticated ampli-
reason is that the data plane was fixed in function, impeding fied reflection DDoS. Ding et al. [227] proposed INDDoS,
the capabilities of developing customized and dynamic algo- an in-network DDoS victim identification system that fin-
rithms that can assist in detecting attacks. With the advent of gerprints the devices that for which the number of packets
programmable data planes, it is possible to develop systems exceeds a certain threshold. Musumeci et al. [228] proposed
that detect and mitigate various types of attacks without a system where ML algorithms executed on the control plane
imposing significant overhead on the network. update the data plane after observing the traffic. Finally,
Liu et al. [229] proposed Jaqen, an inline DDoS detection and
2) ATTACK-SPECIFIC mitigation scheme that addresses a broad range of attacks in
Hill et al. [209] presented a system that tracks flows in an ISP deployment.
the data plane using bloom filters. The authors evaluated
SYN flooding as a use case for their system. Li et al. [210] 4) DEFENSE SCHEMES COMPARISON, DISCUSSIONS, AND
presented NETHCF, a Hop-Count Filtering (HCF) defense LIMITATIONS
mechanism that mitigates spoofed IP traffic. HCF schemes Table 32 compares the aforementioned defense schemes.
filter spoofed traffic with an IP-to-hop-count map- Broadly, defense schemes can be grouped into two main
ping table. Another attack-specific scheme proposed by categories: attack-specific and generic. Attack-specific cat-
Febro et al. [211] mitigates against distributed SIP DDoS egory consists of the work that address a specific attack
in the data plane. Furthermore, Scholz et al. [212], [213] (e.g., NETHCF for IP spoofing, [211] for SIP DDoS, etc.),
presented a scheme that defends against SYN flood attacks. while the generic category aims at addressing various types of
Ndonda and Sadre [214] implemented an intrusion detection attacks (e.g., FastFlex for various availability attacks, Ripple
system in P4 that whitelists and filters Modbus protocol for link flooding attacks, etc.).
packets in industrial control systems. The significant advantage of architecting defenses in the
data plane is the performance improvement of the application.
3) GENERIC ATTACKS For instance, NETHCF is motivated by the fact that tradi-
Some schemes are generic and aim at addressing multiple tional HCF-based schemes are implemented on end-hosts,
attacks concurrently. For instance, Xing et al. [215] proposed which delays the filtering of spoofed packets and increases
FastFlex, an abstraction that architects defenses into the net- the bandwidth overhead. Moreover, since traditional schemes
work paths based on changing attacks. Kang et al. [216] are implemented in server-based middleboxes, low latency
presented an automated approach for discovering sensitivity and minimal jitter are hard to achieve. Similarly, FastFlex
attacks targeting the data plane programs. Sensitivity attacks advocates on the need to offload the defenses to the data
in this context are intelligently crafted traffic patterns that plane. Specifically, it tackles the following key challenges
exploit the behavior of the P4 program. Lapolli et al. [217] that are faced when programming defenses in the data plane:
implemented a mechanism to perform real-time DDoS 1) resource multiplexing; 2) optimal placement; 3) distributed
attack detection based on entropy changes. Such changes control; and 4) dynamic scaling.
will be used to compute anomaly detection thresholds. When deploying defenses in the data plane, operators must
Mi and Wang [218] proposed ML-Pushback, a P4-based be aware of the capabilities of the constrained targets. Many
implementation of the Pushback method [219]. operations that require extensive computations cannot be eas-
Zhang et al. [220] proposed Poseidon, a system that ily implemented on the data plane. The existing work either
mitigates against volumetric DDoS attacks through pro- approximate the computations in the data plane (considering
grammable switches. It provides a language where operators the computation complexity and the measurements accuracy
can express a range of security policies. Friday et al. [221] trade-off), or delegate the computations to external processors
(e.g., CPU on the switch, external server, SDN controller, detecting a wide range of attacks instead of crafting custom
etc.). For instance, NETHCF decouples the HCF defense into algorithms for specific ones.
a cache running in the data plane and a mirror in the control Network-wide defenses are those that are not restricted to
plane. The cache serves the legitimate packets at line rate, a single switch, and require multiple switches to co-operate
while the mirror processes the missed packets, maintains the in the attacks detection and mitigation phases. Such
IP-to-hop-count mapping table, and adjust the state of the co-operation significantly improves the accuracy and the
system based on network dynamics. In Poseidon, the defense promptness of the detection. More details on network-wide
primitives are partitioned to be executed on switches and on data plane systems is explained in Section XIII-D.
servers, based on their properties. On the other hand, in [217], Finally, Table 32 lists some limitations of the existing
the authors estimated the entropies of source and destination schemes, which can be explored in future work to advance
IP addresses of incoming packets for consecutive partitions the state-of-the-art.
(observation windows) in the data plane, without consulting
external devices. 5) COMPARISON BETWEEN P4-BASED AND TRADITIONAL
Perhaps the most significant state-of-the-art works in the DEFENSE SCHEMES
defense schemes are Poseidon and Jaqen. Poseidon provides Network attacks such as large-scale DDoS and link flooding
a modular abstraction that allows operators to express their may have substantial impact on the network operation. For
defense policies. Poseidon requires external modules running such attacks, server-based defenses deployed at the last mile
on servers, making its deployment challenging, especially in are problematic and inherently insufficient, especially when
ISP settings. Furthermore, such design incurs additional costs attacks target the network core. Moreover, it is not feasible
and undesirable latency. Jaqen addressed those limitations to detect and mitigate large volume of attack traffic (e.g.,
and was designed to be executed fully in the switch, with- SYN flood) on end-hosts without impacting the throughput
out external support from servers. Additionally, Jaqen used of the network. Other defense schemes are proprietary, and
universal sketches as data structures; this selection enables hence are costly and limited to the detection algorithms
TABLE 33. Comparison of DDoS defense schemes. Source: [229]. when randomizing identifiers to achieve session unlinkabil-
ity, the identifiers must fit into the small fixed header space
so that compatibility with legacy networks is preserved. Other
efforts considered rewriting source information and headers
concealing to protect the identity of Internet users.
Finally, access control methods and in-network defenses
were proposed. Future access control schemes should explore
further in-network methods to authenticate the users, beyond
port knocking. Additionally, since switches are capable of
inspecting upper-layer headers, it is worth exploring offload-
ing some next generation firewall functionalities to the data
plane (such as in [327]). For instance, in [170], the authors
proposed a system that allows searching for keywords in the
provided by the vendors. Table 33 highlights the costs and payload of the packet. Similar techniques could be leveraged
the performance differences between switch-based schemes to achieve URL filtering at line rate. Additionally, schemes
(Poseidon and Jaqen) and other existing solutions. When should mitigate against stealthy, slow DDoS attacks.
defenses are architected into the network (i.e., detection and
mitigation are programmed into the forwarding devices), it is XII. NETWORK TESTING
easy to detect, throttle, or drop suspicious traffic at any van- Although programmable switches provide flexibility in defin-
tage point, at line rate, with significant cost reductions. ing the packet processing logic, they introduce potential risks
of having erroneous and buggy programs. Such bugs may
F. SUMMARY AND LESSONS LEARNED cause fatal damages, especially when they are unexpectedly
In the context of cybersecurity, a wide range of works lever- triggered in production networks. In such scenarios, the net-
aged programmable switches to achieve the following goals: work starts experiencing a degradation in performance as
1) detect heavy hitters and apply countermeasures; 2) execute well as disruption in its operation. Bugs can occur in various
cryptographic primitives in the data plane to enable further phases in the P4 program development workflow (e.g., in the
applications; 3) protect the identity and the behavior of end- P4 program itself, in the controller updating data plane table
hosts, as well as obfuscate the network topology; 4) enforce entries, in the target compiler, etc.). Bugs are usually man-
access control policies in the network while considering net- ifested after processing a sequence of packets with certain
work dynamics; and 5) architect defenses in the data plane to combinations not envisioned by the designer of the code.
accelerate the detection and mitigation processes. This section gives an overview of the troubleshooting and
Identifying heavy hitters at line rate has several advan- verification schemes for P4 programmable networks.
tages. Recent works considered various data structures and A. TROUBLESHOOTING
streaming algorithms to detect heavy hitters. Future systems 1) BACKGROUND
could explore more complex data structures that reduce the Intensive research interests were drawn on troubleshooting
amount of state storage required on the switches. Further- the network. Previous efforts are mainly based on pas-
more, novel systems must minimize the false positives and the sive packet behavior tracking through the usage of moni-
false negatives compared to both P4-based and legacy heavy toring technologies (e.g., NetSight [328], EverFlow [329]).
hitter detection systems. Finally, new schemes should explore Other techniques (e.g., Automatic test Packet Generation
strategies for incremental deployment while maximizing flow (ATPG) [330]) send probing packets to proactively detect
visibility across the network. network bugs. Such techniques have two main problems.
There is an absolute necessity to implement cryptographic First, the number of probe packets increases exponentially
functions (e.g., hash, encrypt, decrypt) in the data plane. Such as the size of the network increases. Second, the coverage is
functions can be used by various applications that require limited by the number of probes-generating servers. Despite
low hashing collisions (e.g., load balancing) and strong data the flexibility that programmable switches offer, writing data
protection. Most existing efforts delegate the complex com- plane programs increases the chance of introducing bugs
putations to the control plane. However, recent systems have into the network. Programs are inevitably prone to faults
demonstrated that AES, a well-known symmetric key encryp- which could significantly compromise the performance of the
tion algorithm, can be implemented in the data plane. network and incur high penalty costs.
Another interesting line of work provided privacy and
anonymity to the network. Recent efforts obfuscated the net- 2) PROGRAMMABLE NETWORKS TROUBLESHOOTING
work topology in order to mitigate topology-centric attacks Zhang et al. [230] proposed P4DB, an on-the-fly runtime
(e.g., LFA). Such systems must preserve the practicality of debugging platform. The system debugs P4 programs in
path tracing tools, while being robust against obfuscation three levels of visibility by provisioning operator-friendly
inversion. Additionally, link failures in the physical topol- primitives: watch, break, and next. Zhou et al. [231] pro-
ogy should remain visible after obfuscation. Furthermore, posed P4Tester, a troubleshooting system for data plane
runtime faults. It generates intermediate representation of Some schemes (e.g., P4DB) require more memory than others
P4 programs and table rules based on BDD data structure. (e.g., KeySight).
Dumitru et al. [232] examined how three different targets, Finally, the work in [232] is different than the others.
BMv2, P4-NetFPGA, and Barefoot’s Tofino, behave when The authors examined how three different targets, BMv2,
undesired behaviours are triggered. Kodeswaran et al. [233] P4-NetFPGA, and Barefoot’s Tofino, behave when undesired
proposed a data plane primitive for detecting and localizing behaviours are triggered. The authors first developed buggy
bugs as they occur in real time. Finally, Zhou et al. [234] pro- programs in order to observe the actual behavior of targets.
posed KeySight, a platform that troubleshoots programmable Then, they examined the most complex P4 program publicly
switches with high scalability and high coverage. It uses available, switch.p4, and found that it can be exploited when
Packet Equivalence Class (PEC) abstraction when generating attackers know the specifics of the implementation. In sum-
probes. mary, the paper suggests that BMv2 leaks information from
Some schemes such as Whippersnapper [331], BB-Gen previous packets. This behavior is not observed with the other
[332], P8 [333], and [334] provide benchmarking for P4 pro- two targets. Furthermore, the authors were able to perform
grams and aim at understanding their performance. privilege escalation on switch.p4 due to a header destined
to ensure communication between the CPU and the P4 data
3) TROUBLESHOOTING SCHEMES COMPARISON, plane.
DISCUSSIONS, AND LIMITATIONS
Table 34 compares the aforementioned troubleshooting 4) COMPARISON LEGACY VS. P4-BASED DEBUGGING
schemes. Essentially, the schemes either passively track how In legacy networks, network devices are equipped with
packets are processed inside switches (e.g., [230], [233]) or fixed-function services that operate on standard proto-
diagnoses faults by injecting probes (e.g., [231], [234]). The cols. Troubleshooting these networks often involve testing
main limitation of passive detection is that schemes can only protocols and typical data plane functions (e.g., layer-3 rout-
detect rule faults that have been triggered by existing packets, ing) through rigid probing. On the other hand, with pro-
and cannot check the correctness of all table rules. On the grammable networks, since operators have the flexibility of
other hand, probing-based schemes may incur large control defining custom data plane functions and protocols, testing
and probes overheads. is more complex and is program-dependent. Probing-based
Examples of probing-based schemes include P4Tester and approaches should craft patterns depending on the deployed
KeySight. P4Tester generates intermediate representation of P4 program. Other approaches proposed primitives that
P4 programs and table rules based on BDD data structure. increase the levels of visibility when debugging P4 programs.
Afterwards, it performs an automated analysis to generate Research work extracted from the literature show that it is
probes. Probes are sent using source routing to achieve high essential to develop flexible mechanisms that operate dynam-
rule coverage while maintaining low overheads. The system ically on diverse P4 programs and targets.
was prototyped on a hardware switch (Tofino), and results
show that it can check all rules efficiently and that the probes
count is smaller than that of server-based probe injection B. VERIFICATION
systems (i.e., ATPG and Pronto). 1) BACKGROUND
Other schemes that use passive fault detection (e.g., P4DB) Program verification consists of tools and methods that
assume that packets consistently trigger the runtime bugs. ensure correctness of programs with respect to specifica-
P4DB debugs P4 programs in three levels of visibility by tions and properties. Verification of P4 programs is an active
provisioning operator-friendly primitives: watch, break, and area as bugs can cause faults that have drastic impacts on
next. P4DB does not require modifying the implementation of the performance and the security of networking systems.
the data plane. It was implemented and evaluated on a soft- Static P4 verification handles programs before deployment
ware switch (BMv2), and the results show that it is capable of to the network, and hence, cannot detect faults that occur
troubleshooting runtime bugs with a small throughput penalty at runtime. On the other hand, runtime verification uses
and little latency increase. passive measurements and proactive network testing. This
Another important criterion that differentiate the trou- section describes the major verification work pertaining to
bleshooting schemes is the memory footprint they require. P4 programs.
FIGURE 23. Challenges and future trends. The references represent reviewing and diving into each work in the described lit-
examples of existing works that tackle the corresponding future trends. erature. Further, the section discusses and pinpoints several
initiatives for future work which could be worthy of being
Techniques that check for the aforementioned properties pursued in this imperative field of programmable switches.
include Anteater [338], which models the data plane as The challenges and the future trends are illustrated in Fig. 23
boolean functions to be used in a Boolean Satisfiability Prob-
lem (SAT) solver, NetPlumber [339] which uses header space A. MEMORY CAPACITY (SRAM AND TCAM)
algebra [335], and others (e.g., VeriFlow [336], DeltaNet [340], Stateful processing is a key enabler for programmable data
Flover [341], and VMN [342]). planes as it allows applications to store and retrieve data
Since P4 programs incorporate customized protocols and across different packets. This advantage enabled a wide range
processing logic to be used in the data plane, traditional tools of novel applications (e.g., in-network caching, fine grained
are not capable of handling changes to forwarding behaviors measurements, stateful load balancing, etc.) that were not
without reprogramming their internals. Therefore, verifica- possible in non-programmable networks. The amount of data
tion techniques in programmable networks rely on analyzing stored in the switch is limited by the size of the on-chip mem-
the P4 programs themselves since they define the behavior of ory which ranges from tens to hundreds of megabytes at most.
the data plane. Consequently, the majority of stateful-based applications suf-
C. SUMMARY AND LESSONS LEARNED
fer have trade-offs between performance and memory usage.
For instance, the efficiency of caching which is determined by
Network testing can generally be divided into debugging/
the hit rate is directly affected by the memory size. Further-
troubleshooting network problems and verifying the behavior
more, the vast majority of measurement applications require
of forwarding devices. While traditional tools and techniques
storing statistics in the data plane (e.g., byte/packet counters).
were adequate for non-programmable networks, they are
The number of flows to be measured and the richness of
insufficient for programmable ones due to their inability to
measurement information is bound by the size of the memory
handle changes to forwarding behaviors without reprogram-
in the switch.
ming and restructuring their internals. A variety of works
Current and Future Initiatives: A notable work by
were proposed to analyze and model P4 programs in order
Kim et al. [368], [369] suggests accessing remote Dynamic
to troubleshoot and verify the correctness of networks’ oper-
Random Access Memory (DRAM) installed on data cen-
ations.
ter servers purely from data plane to expand the available
Network measurements can be collected through
memory on the switch. The bandwidth of the chip is traded
P4 switches and used to troubleshoot and verify the cor-
for the bandwidth needed to access the external DRAM.
rectness of networks (control loop). Future work could
The approach is cheap and flexible since it reuses existing
explore methods that make a network more autonomous
resources in commodity hardware without adding additional
and capable of healing itself (e.g., self-driving networks,
infrastructure costs. The system is realized by allowing the
knowledge-defined networking, zero-touch networks) by
data plane to access remote memory through an access chan-
leveraging the collected inputs from programmable switches.
nel (RDMA over Converged Ethernet (RoCE)) as shown
XIII. CHALLENGES AND FUTURE TRENDS in Fig. 24. The implementation show that the proposal
In this section, a number of research and operational chal- achieves throughput close to the line rate, and only incur
lenges that correspond to the proposed taxonomy are out- 1-2 extra microseconds latency (Fig. 25). There are some
lined. The challenges are extracted after comprehensively limitations in this approach that can be explored in the future.
C. ARITHMETIC COMPUTATIONS
There are several challenges that must be handled when
dealing with arithmetic computations in the data plane. First,
programmable switches support a small set of simple arith-
metic computations that operate on non-floating point values.
Second, only few operations are supported per packet to
guarantee the execution at line rate. Typically, a packet should
FIGURE 25. Accessing remote DRAM latency overhead. Only 1-2us only spend tens of nanoseconds in the processing pipeline.
additional latency. Achieved throughput close to the line rate Third, computations in the data plane consume significant
(≈ 37.5 Gbps). Reproduced from [368].
hardware resources, hampering the possibility of other pro-
grams to execute concurrently. A wide range of applications
• The current implementation only supports address-based suffer from the lack of complex computations in the data
memory access, and hence, complicated data layouts and plane. For instance, some operations required by AQMs (e.g.,
ternary matching in remote memory should be explored. square root function in the CoDel algorithm) are complex
• Frequent updates in the remote memory requires several to be implemented with P4. Additionally, the majority of
packets for fetching and adding. This is common in machine learning frameworks and models operate on floating
measurement applications where counters are continu- point values while the supported arithmetic operations on the
ously incremented. A possible solution to the bandwidth switch operate on integer values. In-network model updates
overhead is aggregating updates into single operation. aggregation requires calculating the average over a set of
This comes with the cost of having delays in the updates. floating-point vectors.
• Packet loss between the switch and the remote memory Current and Future Initiatives: Existing methods to over-
should be handled, otherwise, the performance of the come the computation limitations include approximation and
application and the freshness of the remote values might pre-computations. In the approximation method, the applica-
be affected. tion designer relies on the small set of supported operations
• The interaction between general data plane applications to approximate the desired value, at the cost of sacrificing
and the remote memory is challenging. A potential precision. For example, approximating the square root func-
improvement is designing well-defined APIs to facilitate tion can be achieved by counting the number of leading zeros
the interaction. through longest prefix match [99]. It would be beneficial
for P4 developers to have access to a community-maintained
B. RESOURCES ACCESSIBILITY library which encompasses P4 codes that approximate var-
Beside the size limitation of the on-chip memory, there are ious complex functions. In the pre-computations method,
other restrictions that data plane developers should take into values are computed by the control plane (e.g., switch CPU)
account [52], [373]. First, since the table memory is local and stored in match-action tables or registers. Future work
to each stage in the pipeline, other stages cannot reclaim can explore methods that automatically identify the complex
non-utilized memory in other stages. As a result, memory computations that can be pre-evaluated in the control plane.
and match/action processing are fuzed, making the placement After identification, the data plane code and its corresponding
of tables challenging. Second, the sequential execution of control plane APIs can be automatically generated.
operations in the pipeline lead to poor utilization of resources
especially when the matches and the actions are imbalanced D. NETWORK-WIDE COOPERATION
(i.e., the presence of default actions that do not need a match). The SDN architecture suggests using a centralized controller
Current and Future Initiatives: An interesting work by for network-wide switches management. Through centraliza-
Chole et at. [367] explored the idea of disaggregating tion, the state of each programmable switch can be shared
the memory and compute resources of a programmable with other switches. Consequently, applications will have
switch. The main notion of this work is to centralize the mem- the ability to make better decisions as network-wide data is
ory as a pool that is accessed by a crossbar. By doing so, each available locally on the switch. The problem with such archi-
pipeline stage no longer has local memory. Additionally, this tecture is the requirement of having a continuous exchange
work solves the sequential execution limitation by creating a of packets with a software-based system. As an alternative,
cluster of processors used to execute operations in any order. switches can exchange messages to synchronize their states
The main limitation of this approach is the lack of adoption in a decentralized manner.
by hardware vendors. Most of the switch vendors (e.g., Cav- Consider Fig. 26 which shows an in-network DDoS
ium’s XPliant and Barefoot’s Tofino) do not implement the defense solution. Each switch maintains a list of senders and
disaggregation model and follow the regular Reconfigurable their corresponding numbers of bytes. A switch compares the
Match-action Tables (RMT) model. The implementation and number of bytes transmitted from a given flow to a threshold.
analysis of the disaggregation model on hardware targets When the threshold is crossed, the flow is blocked and the
should be explored in the future. device is identified as a malicious DDoS sender. Assume that
FIGURE 26. (a) Local detection of DDoS attacks. (b) network-wide detection of DDoS attack.
the network implements a load balancing mechanism that ment layer that facilitates the deployment of network func-
distributes traffic across the switches. In the scenario where tions (NFs) on multiple switches by managing the distributed
switches do not consider the byte counts of other switches shared states.
(Fig. 26 (a)), the traffic of a DDoS device might remain under The future work in this area should consider handling
the threshold. On the other hand, when switches synchronize frequent state migrations. Some systems require migration
their states by sharing the byte counts (Fig. 26 (b)), the total packets to be generated each RTT, causing increased traffic
number of bytes is compared against the threshold. Conse- overhead and additional expensive authentication operations.
quently, the total load of a DDoS device is considered. This For instance, P4Sync uses public key cryptography in the con-
example demonstrates an application that heavily depends on trol plane to sign and verify the end of the migration sequence
network-wide cooperation and hence motivates the need for chain (2.15ms for signing and 0.07ms to verify using
state synchronization. RSA-2048 signature). Frequent migrations would cause this
Current and Future Initiatives: Arashloo et al. [361] pro- signature to be involved repeatedly. Another major concern
posed SNAP, a centralized stateful programming model that that should be handled in future work is denial of service.
aims at solving the synchronization problem. SNAP intro- Even with migration updates authentication, changes in the
duced the idea of writing programs for ‘‘one big switch’’ packets cause the receiver to reject updates, leading to state
instead of many. Essentially, developers write stateful appli- inconsistency among switches.
cations without caring about the distribution, placement, and
optimization of access to resources. SNAP is limited to one
replica of each state in the network. Sviridov et al. [362], E. CONTROL PLANE INTERVENTION
[363] proposed LODGE and LOADER to extend SNAP and Delegating tasks to the control plane incurs latency and
enable multiple replicas. Luo et al. [364] proposed Swing affects the application’s performance. For instance, in con-
State, a framework for runtime state migration and manage- gestion control, rerouting-based schemes often use tables to
ment. This approach leverages existing traffic to piggyback store alternative routes. Since the data plane cannot directly
state updates between cooperating switches. Swing State modify table entries, intervention from the control plane
overcomes the challenges of the SDN-based architecture by is required. The interaction with the control plane in this
synchronizing the states entirely in the data plane, at line application hampers the promptness of rerouting. Another
rate, and without intervention from the control plane. There example are methods that use collisions-free hashing. For
are several limitations with this approach. First, there are no example, cuckoo hash [374], which rearranges items to solve
message delivery guarantees (i.e., packets dropped/reordered collisions, uses a complex search algorithm that cannot run on
are not retransmitted), leading to inconsistency in the states the switch ASIC, and is often executed on the switch CPU.
among the switches. Second, it does not merge the states Ideally, the control plane intervention should be minimized
if two switches share common states. Third, the overhead when possible. For example, to synchronize the state among
can significantly increase if a single state is mirrored several switches, in-network cooperation should be considered.
times. Finally, there is no authentication of data or senders. Current and Future Initiatives: The design of the inter-
Xing et al. [365] proposed P4Sync, a system that migrates action between the control plane and the data plane is fully
states between switches in the data plane while guarantee- decided by the developer. Experienced developers might have
ing the authenticity of the senders and the exchanged data. enough background to immediately minimize such interac-
P4Sync addresses the limitations of existing approaches. tion. Future work should devise algorithms and tools that
It guarantees the completeness of the migration, ensuring that automatically determine the excessive interaction between
the snapshot transfer is completed. Moreover, it solves the the control/data planes, and suggest alternative workflows
overhead of the repeatedly retransmitted updates. An inter- (ideally, as generated codes) to minimize such interac-
esting aspect of P4Sync is its ability to control the migration tion. Operations that could be delegated to the data plane
traffic rate depending on the changing network conditions. include failure detection and notification and connectivity
Zeno et al. [366] presented a design of SwiShmem, a manage- retrieval [360].
F. SECURITY
When designing a system for the data plane, the developer
must envision the kind of traffic a malicious user can initiate
to corrupt the operation of the system. This class of attacks
is referred to as sensitivity attacks as coined in [216]. Essen-
tially, an attacker can intelligently craft traffic patterns to trig-
ger unexpected behaviors of a system in the data plane. For
instance, a load balancer that balances traffic through packet
headers hashing without cryptographic support (e.g., modulo
operator on the number of available paths) can be tricked by
an attacker that craft skewed traffic patterns. This results in FIGURE 27. Example of using taps in a campus network to compute the
traffic being forwarded to a single path, leading to congestion, round-trip time in the data plane. (1) The traffic is passively collected by
link saturation, and denial of service. Another example is the P4 switch; (2) the switch calculates the round-trip time by using its
high-precision timer (see [95] for details on how to associate the
attacks against in-network caching. Caching in data plane SEQ/ACKs to compute the RTT); (3) the switch report the RTT samples to
performs well when requests are mostly reads rather than an external server.
writes. If an attacker continuously generates high-skewed
write requests, the load on the storage servers would be the existing legacy devices. While this solution seems sim-
imbalanced. If the system is designed to handle write queries plistic at first, studies have showed that partial deployment
on hot items in the switch, a random failure in the switch leads to reduced effectiveness [189]. For instance, the accu-
causes data to be lost. Further, an attacker can also exploit racy of heavy hitter detection schemes is strongly affected
the memory limitation of switch and request diverse values, by the flow visibility. The work in [189] devised a greedy
causing the pre-cached values to be evicted. algorithm that attempts to strategically position P4 switches
Current and Future Initiatives: To mitigate against sensi- in the network, with the goal of monitoring as many dis-
tivity attacks, a developer attempts to discover various unpre- tinct network flows as possible. The F1 score is used to
dicted traffic patterns, and accordingly, develops defense quantify correctness of switches placement. Other works that
strategies. Such solution is highly unreliable, time consum- focused on incremental deployment include daPIPE [375],
ing, and error-prone. Recent efforts [216] aimed at auto- TraceILP/TopoILP [371]. Future work in this area should
matically discovering sensitivity attacks in the data plane. consider generalizing and enhancing this approach to work
Essentially, the proposed system aims at deriving traffic with any P4 application, and not only heavy hitter detection.
patterns that would drive the program away from common For instance, a future work could suggest the positioning
case behavior as much as possible. Other efforts focused of P4 switches in applications such as in-network caching,
on architecting defenses in the data plane that perform dis- accelerated consensus, and in-network defenses, while tak-
tributed mode changes upon attack discovery [215]. Future ing into account the current topology consisting of legacy
work in this direction should consider achieving high assur- devices.
ance by formally verifying the codes. Additionally, the sta- Amin et al. [376] surveyed the research and development
bility of the data plane should be carefully handled with in the field of hybrid SDN networks. Hybrid SDN comprises
fast mode changes; future work could consider integrating a mix of SDN and legacy network devices. It is worth noting
self-stabilizing systems for such purpose. Finally, future work that the same key concepts and advantages of hybrid SDN
should provide security interfaces for collaborating switches networks can be applied to incremental P4 networks.
that belong to different domains. It is also worth exposing Recent efforts are also considering network taps as a mean
sensitivity attack patterns for different application types so to replicate production network’s traffic to programmable
that data plane developers can avoid the vulnerabilities that switches for analysis [88]. Network TAPs replicate pack-
trigger those attacks in their codes. ets and do not alter timing information and packet orders,
which may occur with other schemes such as port mirror-
G. INTEROPERABILITY ing operating at layer 2 and layer 3 [377]. ConQuest [88]
Programmable switches pave the way for a wide range of taps on the ingress and egress links of a legacy router and
innovative in-network applications. The literature has shown uses a P4 switch to perform advanced fine-grained queue
that significant performance improvements are brought when monitoring techniques. Note that legacy routers only sup-
applications offload their processing logic to the network. port polling the total queue length statistics at a coarse time
Despite such facts, it is very unlikely that mobile operators interval, and hence, cannot monitor microbursts. By tapping
will replace their current infrastructure with programmable on legacy devices and processing on P4 switches, operators
switches in one shot. This unlikelihood comes from the fact can benefit from the capabilities of P4 switches without
that major operational and budgeting costs will incur. the need to fully replace their current infrastructure. This
Current and Future Initiatives: Network operators might method can be used in a variety of in-network applications
deploy programmable switches in an incremental fashion. (e.g., RTT estimation (see Fig. 27), network-wide telemetry,
That is, P4 switches will be added to the network alongside DDoS detection/mitigation, to name a few). Finally, it is
worth mentioning that TAPs are not expensive and a single utility functions and assign their weights. Second, it assumes
P4 switch can service many non-programmable devices. that the programmer is aware of the workload (which is
needed to write the utility function). The authors suggested
H. PROGRAMMING SIMPLICITY that future work could investigate a dynamic system that uses
Writing in-network applications using the P4 language is measurements to change the utility functions. Finally, P4All
not a straightforward task. Recent studies have shown that does not support multivariate and nonlinear functions. All the
many existing P4 programs have several bugs that might lead aforementioned limitations can be explored in the future.
to complete network disruption [232]. Furthermore, since
programmable switches have many restrictions on memory I. DEEP PROGRAMMABILITY
and the availability of resources, developers must take into Disaggregation is enabling network owners and operators to
account the low-level hardware limitations when writing the take control of the software running the network. It is pos-
programs. This process is known to be based on trial and sible to program virtual and PISA-based switches, hardware
error; developers are almost never sure whether their program accelerators, smartNICs, and end-hosts’ networking stacks.
can ‘‘fit’’ into the ASIC, and hence, they repeatedly try to Further, acceleration techniques such as the Express Data
compile and adjust their codes accordingly. Such problem is Path (XDP) and Berkeley Packet Filter (BPF) are being used
exacerbated when the complexity of the in-network applica- to accelerate the packet forwarding in the kernel. Addition-
tion increases, or when multiple functions (e.g., telemetry, ally, acceleration techniques are used to address the perfor-
monitoring, access control. etc.) are to be executed concur- mance issues of Virtual Network Functions (VNFs) running
rently in the same P4 program. Additionally, code modular- on servers [380], [381].
ity is not simple in P4; the programmers typically rewrite The malleability of programming various network com-
existing functions depending on the constraints of the current ponents is shifting the trend towards deep programmability,
context. All the aforementioned facts affect the cost, stability, as coined by McKeown [53], [382]. In deep programmability,
and correctness of the network on the long run. the behavior is described at top and partitioned and executed
For several decades, the networking industry operated across elements. The operators will focus on ‘‘software’’
in a bottom-up approach, where switches are equipped rather than ‘‘protocols’’; for example, functions like rout-
with fixed-function ASICs. Consequently, little to no pro- ing and congestion control will be described in programs.
gramming skills were needed by network operators. With Software engineering principles will be routinely used to
the advent of programmable switches, operators are now check the correctness of the network behavior (from unit test-
expected to have experience in programming the ASIC.2 ing to formal/on-the-fly verification). Fine-grained telemetry
Current and Future Initiatives: Since programming the and measurements will be used to monitor and troubleshoot
ASIC is not a straightforward task, future research endeav- network performance. Stream computations will be accel-
ours should consider simplifying the programming workflow erated by the network (e.g., caching, load balancing, etc.).
for the operators and generating code (e.g., [345]–[352]). Further, networks will run autonomously under verifiable,
For instance, graphical tools can be developed to translate closed-loop control. Finally, McKeown envisioned that net-
workflows (e.g., flowcharts) to P4 programs that can fit into works will be programmed by owners, operators, researchers,
the hardware. etc., while being operated by a lot fewer people than today.
A noteworthy work (P4All [353]) proposed an extension There are many open challenges to realize the vision of
to P4 where operators write elastic programs. Elastic pro- deep programmability. Consider Fig. 28. The control plane
grams are compact programs that stretch to make use of is managing the pipeline of programmable switches, NICs,
the hardware resources. P4All extends P4 to support loops. and virtual switches, which are programmed by P4 through a
The operator supply the P4All program along with the tar- runtime API (e.g., P4Runtime). The challenge is how to write
get specifications (i.e., constraints) to the P4All compiler. a clean code that can be moved around within the hardware
Afterwards, the compiler analyzes the dependencies between pipeline, and can run at line rate.
actions and unrolls the loops. Then, it generate the constraints
for the optimization based on the target specification file.
Next, the compiler solves an optimization problem that maxi-
mizes a linear utility function and generates an output P4 pro-
gram for the target. The authors considered Tofino target in
their evaluations. While P4All offered numerous advantages,
it is still far from being ready to be used in practice. First,
it assumes that programmers are able to write representative
2 Note that most vendors (e.g., Barefoot Networks) provide a program
(switch.p4) that expresses the forwarding plane of a switch, with the typical
features of an advanced layer-2 and layer-3 switch. If the goal is to simply
deploy a switch with no in-network applications, then the operators are not
required to program the chip. They just need to install a network operating FIGURE 28. Network as a programmable platform. Large cloud or ISP
system (NOS) such as SONIC [378] or FBOSS [379]). example [53].
VOLUME 9, 2021 87143
E. F. Kfoury et al.: Exhaustive Survey on P4 Programmable Data Plane Switches
TABLE 36. Abbreviations used in this article. TABLE 36. (Continued.) Abbreviations used in this article.
TABLE 36. (Continued.) Abbreviations used in this article. [19] N. McKeown. SDN Phase 3: Getting the Humans Out of the
Way ONF Connect 19. Accessed: Jun. 1, 2021. [Online]. Available:
https://tinyurl.com/tp9bxw4
[20] Edgecore. (2020). Wedge 100BF-32X, 100GbE Data Center Switch.
[Online]. Available: https://tinyurl.com/sy2jkqe
[21] STORDIS. The New Advanced Programmable Switches
are Available. Accessed: Jun. 1, 2021. [Online]. Available:
https://www.stordis.com/products/
[22] Cisco. Cisco Nexus 34180YC and 3464C Programmable Switches
Data Sheet. Accessed: Jun. 1, 2021. [Online]. Available:
https://tinyurl.com/y92cbdxe
[23] Arista. Arista 7170 Series. Accessed: Jun. 1, 2021. [Online]. Available:
https://www.arista.com/en/products/7170-series
REFERENCES [24] Juniper Networks. Juniper Advancing Disaggregation Through P4
[1] N. McKeown, ‘‘How we might get humans out of the way,’’ ONF Runtime Integration. Accessed: Jun. 1, 2021. [Online]. Available:
CONNECT, Tech. Rep., Sep. 2019, vol. 19. [Online]. Available: https://tinyurl.com/yygz547t
https://tinyurl.com/y4dnxacz [25] Interface Masters. Tahoe 2624. Accessed: Jun. 1, 2021. [Online].
[2] Number of RFCs Published Per Year, document, RFC Editor, 2020. Available: https://interfacemasters.com/products/switches/10g-40g/
[Online]. Available: https://www.rfc-editor.org/rfcs-per-year/ tahoe-2624/
[3] B. Trammell and M. Kuehlewind, Report From the IAB Workshop on [26] Barefoot Networks. Tofino ASIC. Accessed: Jun. 1, 2021. [Online]. Avail-
Stack Evolution in a Middlebox Internet (SEMI), document RFC7663, able: https://www.barefootnetworks.com/products/brief-tofino/
2015. [Online]. Available: https://tools.ietf.org/html/rfc7663 [27] Xilinx. Xilinx Solutions. Accessed: Jun. 1, 2021. [Online]. Available:
[4] G. Papastergiou, G. Fairhurst, D. Ros, A. Brunstrom, K.-J. Grinnemo, https://www.xilinx.com/products/silicon-devices.html
P. Hurtig, N. Khademi, M. Tüxen, M. Welzl, D. Damjanovic, and [28] Pensando. The Pensando Distributed Services Platform.
S. Mangiante, ‘‘De-ossifying the Internet transport layer: A survey and Accessed: Jun. 1, 2021. [Online]. Available: https://pensando.io/our-
future perspectives,’’ IEEE Commun. Surveys Tuts., vol. 19, no. 1, platform/
pp. 619–639, 1st Quart., 2017. [29] Mellanox. Empowering the Next Generation of Secure Cloud Smart-
[5] The Register. (Aug. 2011). VMware, Cisco Stretch Virtual LANs Across NICs. Accessed: Jun. 1, 2021. [Online]. Available: https://www.
the Heavens. [Online]. Available: https://tinyurl.com/y6mxhqzn mellanox.com/products/smartnic
[6] M. Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T. Sridhar, [30] Innovium. Teralynx Switch Silicon. Accessed: Jun. 1, 2021. [Online].
M. Bursell, and C. Wright, Virtual Extensible Local Area Network Available: https://www.innovium.com/teralynx/
(VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks [31] I. Baldin, J. Griffioen, K. Wang, I. Monga, and A. Nikolich. Mid-Scale
Over Layer 3 Networks, document RFC7348, 2014. [Online]. Available: RI-1 (M1:IP): FABRIC: Adaptive Programmable Research Infrastructure
http://www.rfc-editor.org/rfc/rfc7348.txt for Computer Science and Science Applications. Accessed: Jun. 1, 2021.
[7] M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. McKeown, and [Online]. Available: https://tinyurl.com/y463v9z9
S. Shenker, ‘‘Ethane: Taking control of the enterprise,’’ ACM SIGCOMM [32] FABRIC. About FABRIC. Accessed: Jun. 1, 2021. [Online]. Available:
Comput. Commun. Rev., vol. 37, no. 4, pp. 1–12, 2007. https://fabric-testbed.net/about/overview
[8] D. Kreutz, F. M. V. Ramos, P. E. Verissimo, C. E. Rothenberg, [33] J. Mambretti, J. Chen, F. Yeh, and S. Y. Yu, ‘‘International P4 networking
S. Azodolmolky, and S. Uhlig, ‘‘Software-defined networking: A com- testbed,’’ in Proc. SC Netw. Res. Exhib., 2019, pp. 1–2.
prehensive survey,’’ Proc. IEEE, vol. 103, no. 1, pp. 14–76, Jan. 2015. [34] 2STiC. A National Programmable Infrastructure to Experiment With
[9] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, Next-Generation Networks. Accessed: Jun. 1, 2021. [Online]. Available:
C. Schlesinger, D. Talayco, A. Vahdat, and G. Varghese, ‘‘P4: Program- https://www.2stic.nl/national-programmable-infrastructure.html
ming protocol-independent packet processors,’’ ACM SIGCOMM Com- [35] H. Stubbe, ‘‘P4 compiler & interpreter: A survey,’’ Future Internet Innov.
put. Commun. Rev., vol. 44, no. 3, pp. 87–95, 2014. Internet Technol. Mobile Commun., vol. 47, pp. 1–6, May 2017.
[10] Barefoot Networks. Use Cases. Accessed: Jun. 1, 2021. [Online]. Avail- [36] T. Dargahi, A. Caponi, M. Ambrosin, G. Bianchi, and M. Conti, ‘‘A sur-
able: https://www.barefootnetworks.com/use-cases/ vey on the security of stateful SDN data planes,’’ IEEE Commun. Surveys
[11] A. Weissberger. Comcast: ONF Trellis Software is in Production Together Tuts., vol. 19, no. 3, pp. 1701–1725, 3rd Quart., 2017.
With L2/L3 White Box Switches. Accessed: Jun. 1, 2021. [Online]. Avail- [37] W. L. da Costa Cordeiro, J. A. Marques, and L. P. Gaspary, ‘‘Data
able: https://tinyurl.com/y69jc7sv plane programmability beyond OpenFlow: Opportunities and challenges
[12] N. Akiyama and M. Nishiki. P4 and Stratum Use Case for for network and service operations and management,’’ J. Netw. Syst.
New Edge Cloud. Accessed: Jun. 1, 2021. [Online]. Available: Manage., vol. 25, no. 4, pp. 784–818, Oct. 2017.
https://tinyurl.com/yxuoo9qv [38] A. Satapathy. (2018). Comprehensive Study of P4 Programming
[13] Stordis GmbH. New STORDIS Advanced Programmable Switches (APS) Language and Software-Defined Networks. [Online]. Available:
First to Unlock the Full Potential of P4 and Next Generation Software https://tinyurl.com/y4d4zma9
Defined Networking (NG-SDN). Accessed: Jun. 1, 2021. [Online]. Avail- [39] R. Bifulco and G. Rétvári, ‘‘A survey on the programmable data plane:
able: https://tinyurl.com/y3kjnypl Abstractions, architectures, and open problems,’’ in Proc. IEEE 19th Int.
[14] Open Networking Foundation. Stratum—ONF Launches Major Conf. High Perform. Switching Routing (HPSR), Jun. 2018, pp. 1–7.
New Open Source SDN Switching Platform With Support [40] E. Kaljic, A. Maric, P. Njemcevic, and M. Hadzialic, ‘‘A survey on data
From Google. Accessed: Jun. 1, 2021. [Online]. Available: plane flexibility and programmability in software-defined networking,’’
https://tinyurl.com/yy3ykw7g IEEE Access, vol. 7, pp. 47804–47840, 2019.
[15] Open Networking Foundation (ONF). Onward and Upward: P4.org [41] P. G. Kannan and M. C. Chan, ‘‘On programmable networking evolu-
Joins ONF and LF. Accessed: Jun. 1, 2021. [Online]. Available: tion,’’ CSI Trans. ICT, vol. 8, no. 1, pp. 69–76, Mar. 2020.
https://tinyurl.com/53upv6wf [42] L. Tan, W. Su, W. Zhang, J. Lv, Z. Zhang, J. Miao, X. Liu, and N. Li, ‘‘In-
[16] Facebook Engineering. Disaggregate: Networking band network telemetry: A survey,’’ Comput. Netw., vol. 186, Feb. 2021,
Recap. Accessed: Jun. 1, 2021. [Online]. Available: Art. no. 107763.
https://tinyurl.com/yxoaj7kw [43] X. Zhang, L. Cui, K. Wei, F. P. Tso, Y. Ji, and W. Jia, ‘‘A survey
[17] Open Compute Project. Alibaba DC Network Evolution With Open on stateful data plane in software defined networks,’’ Comput. Netw.,
SONiC and Programmable HW. Accessed: Jun. 1, 2021. [Online]. Avail- vol. 184, Jan. 2021, Art. no. 107597.
able: https://www.opencompute.org/files/OCP2018.alibaba.pdf [44] G. Bianchi, M. Bonola, A. Capone, and C. Cascone, ‘‘OpenState: Pro-
[18] S. Heule. Using P4 and P4 Runtime for Optimal L3 Routing. gramming platform-independent stateful openflow applications inside
Accessed: Jun. 1, 2021. [Online]. Available: https://tinyurl.com/ the switch,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 44, no. 2,
y365gnqy pp. 44–51, Apr. 2014.
[45] M. Moshref, A. Bhargava, A. Gupta, M. Yu, and R. Govindan, ‘‘Flow- [69] A. Feldmann, B. Chandrasekaran, S. Fathalli, and E. N. Weyulu, ‘‘P4-
level state transition as a new switch primitive for SDN,’’ in Proc. 3rd enabled network-assisted congestion feedback: A case for NACKs,’’ in
Workshop Hot Topics Softw. Defined Netw., 2014, pp. 61–66. Proc. Workshop Buffer Sizing, 2019, pp. 1–7.
[46] P4 Language Consortium. P4Runtime. Accessed: Jun. 1, 2021. [Online]. [70] Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang,
Available: https://github.com/p4lang/PI/ F. Kelly, M. Alizadeh, and M. Yu, ‘‘HPCC: High precision congestion
[47] Y. Rekhter, T. Li, and S. Hares, A Border Gateway Protocol 4 control,’’ in Proc. ACM Special Interest Group Data Commun., 2019,
(BGP-4), document RFC4271, 2006. [Online]. Available: http://www.rfc- pp. 44–58.
editor.org/rfc/rfc4271.txt. [71] E. F. Kfoury, J. Crichigno, E. Bou-Harb, D. Khoury, and G. Srivastava,
[48] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, ‘‘Enabling TCP pacing using programmable data plane switches,’’ in
J. Rexford, S. Shenker, and J. Turner, ‘‘OpenFlow: Enabling innovation Proc. 42nd Int. Conf. Telecommun. Signal Process. (TSP), Jul. 2019,
in campus networks,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 38, pp. 273–277.
no. 2, pp. 69–74, Mar. 2008. [72] S. Shahzad, E.-S. Jung, J. Chung, and R. Kettimuthu, ‘‘Enhanced explicit
[49] N. McKeown. Why Does the Internet Need a Programmable congestion notification (EECN) in TCP with P4 programming,’’ in Proc.
Forwarding Plane. Accessed: Jun. 1, 2021. [Online]. Available: Int. Conf. Green Hum. Inf. Technol. (ICGHIT), Feb. 2020, pp. 35–40.
https://tinyurl.com/y6x7qqpm [73] B. Turkovic, F. Kuipers, N. van Adrichem, and K. Langendoen, ‘‘Fast net-
[50] C. Kim. (2019). Evolution of Networking, Networking Field Day 21, 2:01. work congestion detection and avoidance using P4,’’ in Proc. Workshop
[Online]. Available: https://tinyurl.com/y9fkj7qx Netw. Emerg. Appl. Technol., 2018, pp. 45–51.
[51] A. Shapiro. (Apr. 2020). P4-Programming Data Plane Use- [74] B. Turkovic and F. Kuipers, ‘‘P4air: Increasing fairness among competing
Cases P4 Expert Roundtable Series. [Online]. Available: congestion control algorithms,’’ in Proc. IEEE 28th Int. Conf. Netw.
https://tinyurl.com/y5n4k83h Protocols (ICNP), Oct. 2020, pp. 1–12.
[52] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, [75] M. Apostolaki, L. Vanbever, and M. Ghobadi, ‘‘FAB: Toward flow-aware
F. Mujica, and M. Horowitz, ‘‘Forwarding metamorphosis: Fast pro- buffer sharing on programmable switches,’’ in Proc. Workshop Buffer
grammable match-action processing in hardware for SDN,’’ ACM SIG- Sizing, Dec. 2019, pp. 1–6.
COMM Comput. Commun. Rev., vol. 43, no. 4, pp. 99–110, 2013. [76] J. Geng, J. Yan, and Y. Zhang, ‘‘P4QCN: Congestion control using P4-
[53] N. McKeown. Creating an End-to-End Programming Model for Packet capable device in data center networks,’’ Electronics, vol. 8, no. 3, p. 280,
Forwarding. Accessed: Jun. 1, 2021. [Online]. Available: https://www. Mar. 2019.
youtube.com/watch?v=fiBuao6YZl0&t=4216s [77] Y. Li, R. Miao, C. Kim, and M. Yu, ‘‘FlowRadar: A better NetFlow for
[54] Z. Liu, J. Bi, Y. Zhou, Y. Wang, and Y. Lin, ‘‘NetVision: Towards network data centers,’’ in Proc. 13th USENIX Symp. Netw. Syst. Design Implement.
telemetry as a service,’’ in Proc. IEEE 26th Int. Conf. Netw. Protocols (NSDI), 2016, pp. 311–324.
(ICNP), Sep. 2018, pp. 247–248. [78] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman, ‘‘One
[55] J. Hyun, N. Van Tu, and J. W.-K. Hong, ‘‘Towards knowledge-defined sketch to rule them all: Rethinking network flow monitoring with Univ-
networking using in-band network telemetry,’’ in Proc. IEEE/IFIP Netw. Mon,’’ in Proc. ACM SIGCOMM Conf., Aug. 2016, pp. 101–114.
Oper. Manage. Symp. (NOMS), Apr. 2018, pp. 1–7. [79] S. Narayana, A. Sivaraman, V. Nathan, P. Goyal, V. Arun, M. Alizadeh,
[56] Y. Kim, D. Suh, and S. Pack, ‘‘Selective in-band network telemetry for V. Jeyakumar, and C. Kim, ‘‘Language-directed hardware design for
overhead reduction,’’ in Proc. IEEE 7th Int. Conf. Cloud Netw. (Cloud- network performance monitoring,’’ in Proc. Conf. ACM Special Interest
Net), Oct. 2018, pp. 1–3. Group Data Commun., Aug. 2017, pp. 85–98.
[57] T. Pan, E. Song, Z. Bian, X. Lin, X. Peng, J. Zhang, T. Huang, B. Liu, and [80] M. Ghasemi, T. Benson, and J. Rexford, ‘‘Dapper: Data plane perfor-
Y. Liu, ‘‘INT-path: Towards optimal path planning for in-band network- mance diagnosis of TCP,’’ in Proc. Symp. SDN Res., Apr. 2017, pp. 61–74.
wide telemetry,’’ in Proc. IEEE Conf. Comput. Commun. (INFOCOM), [81] T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao, X. Li,
Apr. 2019, pp. 487–495. and S. Uhlig, ‘‘Elastic sketch: Adaptive and fast network-wide measure-
[58] J. A. Marques, M. C. Luizelli, R. I. T. da Costa Filho, and L. P. Gaspary, ments,’’ in Proc. Conf. ACM Special Interest Group Data Commun.,
‘‘An optimization-based approach for efficient network monitoring using Aug. 2018, pp. 561–575.
in-band network telemetry,’’ J. Internet Services Appl., vol. 10, no. 1, [82] N. Yaseen, J. Sonchack, and V. Liu, ‘‘Synchronized network snapshots,’’
p. 12, Dec. 2019. in Proc. Conf. ACM Special Interest Group Data Commun., Aug. 2018,
[59] B. Niu, J. Kong, S. Tang, Y. Li, and Z. Zhu, ‘‘Visualize your IP- pp. 402–416.
over-optical network in realtime: A P4-based flexible multilayer in- [83] R. Joshi, T. Qu, M. C. Chan, B. Leong, and B. T. Loo, ‘‘BurstRadar:
band network telemetry (ML-INT) system,’’ IEEE Access, vol. 7, Practical real-time microburst monitoring for datacenter networks,’’ in
pp. 82413–82423, 2019. Proc. 9th Asia–Pacific Workshop Syst., Aug. 2018, pp. 1–8.
[60] A. Karaagac, E. De Poorter, and J. Hoebeke, ‘‘In-band network telemetry [84] M. Lee and J. Rexford. (2018). Detecting Violations of Service-
in industrial wireless sensor networks,’’ IEEE Trans. Netw. Service Man- Level Agreements in Programmable Switches. [Online]. Available:
age., vol. 17, no. 1, pp. 517–531, Mar. 2020. https://p4campus.cs.princeton.edu/pubs/mackl_thesis_paper.pdf
[61] R. B. Basat, S. Ramanathan, Y. Li, G. Antichi, M. Yu, and [85] J. Sonchack, O. Michel, A. J. Aviv, E. Keller, and J. M. Smith, ‘‘Scaling
M. Mitzenmacher, ‘‘PINT: Probabilistic in-band network telemetry,’’ in hardware accelerated network monitoring to concurrent and dynamic
Proc. Annu. Conf. ACM Special Interest Group Data Commun. Appl., queries with flow,’’ in Proc. USENIX Annu. Tech. Conf. (USENIX ATC),
Technol., Archit., Protocols Comput. Commun., Jul. 2020, pp. 662–680. 2018, pp. 823–835.
[62] Y. Lin, Y. Zhou, Z. Liu, K. Liu, Y. Wang, M. Xu, J. Bi, Y. Liu, and [86] J. Sonchack, A. J. Aviv, E. Keller, and J. M. Smith, ‘‘Turboflow: Informa-
J. Wu, ‘‘NetView: Towards on-demand network-wide telemetry in the tion rich flow record generation on commodity switches,’’ in Proc. 13th
data center,’’ Comput. Netw., vol. 180, Oct. 2020, Art. no. 107386. EuroSys Conf., Apr. 2018, pp. 1–16.
[63] N. Van Tu, J. Hyun, and J. W.-K. Hong, ‘‘Towards ONOS-based SDN [87] A. Gupta, R. Harrison, M. Canini, N. Feamster, J. Rexford, and
monitoring using in-band network telemetry,’’ in Proc. 19th Asia–Pacific W. Willinger, ‘‘Sonata: Query-driven streaming network telemetry,’’ in
Netw. Oper. Manage. Symp. (APNOMS), Sep. 2017, pp. 76–81. Proc. Conf. ACM Special Interest Group Data Commun., Aug. 2018,
[64] Serkant. Prometheus INT Exporter. Accessed: Jun. 1, 2021. [Online]. pp. 357–371.
Available: https://github.com/serkantul/prometheus_int_exporter/ [88] X. Chen, S. L. Feibish, Y. Koral, J. Rexford, O. Rottenstreich,
[65] N. Van Tu, J. Hyun, G. Y. Kim, J.-H. Yoo, and J. W.-K. Hong, ‘‘INTCol- S. A. Monetti, and T.-Y. Wang, ‘‘Fine-grained queue measurement in
lector: A high-performance collector for in-band network telemetry,’’ in the data plane,’’ in Proc. 15th Int. Conf. Emerg. Netw. Exp. Technol.,
Proc. 14th Int. Conf. Netw. Service Manage. (CNSM), 2018, pp. 10–18. Dec. 2019, pp. 15–29.
[66] Barefoot Networks. Barefoot Deep Insight—Product Brief. [89] Z. Liu, S. Zhou, O. Rottenstreich, V. Braverman, and J. Rexford,
Accessed: Jun. 1, 2021. [Online]. Available: https://tinyurl.com/u2ncvry ‘‘Memory-efficient performance monitoring on programmable switches
[67] Broadcom. BroadView Analytics, Trident 3 In-Band Telemetry. with lean algorithms,’’ in Proc. Symp. Algorithmic Princ. Comput. Syst.
Accessed: Jun. 1, 2021. [Online]. Available: https://tinyurl.com/yxr2qydb (APoCS), 2020, pp. 31–44.
[68] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore, G. Antichi, [90] T. Holterbach, E. C. Molero, M. Apostolaki, A. Dainotti, S. Vissicchio,
and M. Wójcik, ‘‘Re-architecting datacenter networks and stacks for low and L. Vanbever, ‘‘Blink: Fast connectivity recovery entirely in the data
latency and high performance,’’ in Proc. Conf. ACM Special Interest plane,’’ in Proc. 16th USENIX Symp. Netw. Syst. Design Implement.
Group Data Commun., Aug. 2017, pp. 29–42. (NSDI), 2019, pp. 161–176.
[91] D. Ding, M. Savi, and D. Siracusa, ‘‘Estimating logarithmic and exponen- [113] K. Tokmakov, M. Sarker, J. Domaschka, and S. Wesner, ‘‘A case
tial functions to track network traffic entropy in P4,’’ in Proc. IEEE/IFIP for data centre traffic management on software programmable Ether-
Netw. Oper. Manage. Symp. (NOMS), Apr. 2020, pp. 1–9. net switches,’’ in Proc. IEEE 8th Int. Conf. Cloud Netw. (CloudNet),
[92] W. Wang, P. Tammana, A. Chen, and T. S. E. Ng, ‘‘Grasp the root causes in Nov. 2019, pp. 1–6.
the data plane: Diagnosing latency problems with SpiderMon,’’ in Proc. [114] S. S. W. Lee and K.-Y. Chan, ‘‘A traffic meter based on a multicolor
Symp. SDN Res., Mar. 2020, pp. 55–61. marker for bandwidth guarantee and priority differentiation in SDN
[93] R. Teixeira, R. Harrison, A. Gupta, and J. Rexford, ‘‘PacketScope: Mon- virtual networks,’’ IEEE Trans. Netw. Service Manage., vol. 16, no. 3,
itoring the packet lifecycle inside a switch,’’ in Proc. Symp. SDN Res., pp. 1046–1058, Sep. 2019.
Mar. 2020, pp. 76–82. [115] M. Shahbaz, L. Suresh, J. Rexford, N. Feamster, O. Rottenstreich, and
[94] J. Bai, M. Zhang, G. Li, C. Liu, M. Xu, and H. Hu, ‘‘FastFE: Acceler- M. Hira, ‘‘Elmo: Source routed multicast for public clouds,’’ in Proc.
ating ML-based traffic analysis with programmable switches,’’ in Proc. ACM Special Interest Group Data Commun., 2019, pp. 458–471.
Workshop Secure Program. Netw. Infrastruct. (SPIN). New York, NY, [116] M. Kadosh, Y. Piasetzky, B. Gafni, L. Suresh, M. Shahbaz, and
USA: Association for Computing Machinery, 2020, pp. 1–7. S. Banerjee. (Apr. 2020). Realizing Source Routed Multicast Using Mel-
[95] X. Chen, H. Kim, J. M. Aman, W. Chang, M. Lee, and J. Rexford, lanox’s Programmable Hardware Switches, P4 Expert Roundtable Series.
‘‘Measuring TCP round-trip time in the data plane,’’ in Proc. Workshop [Online]. Available: https://tinyurl.com/y8dfcsum
Secure Program. Netw. Infrastruct., Aug. 2020, pp. 35–41. [117] W. Braun, J. Hartmann, and M. Menth, ‘‘Demo: Scalable and reliable
[96] Y. Qiu, K.-F. Hsu, J. Xing, and A. Chen, ‘‘A feasibility study on time- software-defined multicast with BIER and P4,’’ in Proc. IFIP/IEEE Symp.
aware monitoring with commodity switches,’’ in Proc. Workshop Secure Integr. Netw. Service Manage. (IM), May 2017, pp. 905–906.
Program. Netw. Infrastruct., Aug. 2020, pp. 22–27. [118] N. Katta, M. Hira, C. Kim, A. Sivaraman, and J. Rexford, ‘‘HULA:
[97] Q. Huang, H. Sun, P. P. C. Lee, W. Bai, F. Zhu, and Y. Bao, ‘‘Omni- Scalable load balancing using programmable data planes,’’ in Proc. Symp.
Mon: Re-architecting network telemetry with resource efficiency and full SDN Res., Mar. 2016, pp. 1–12.
accuracy,’’ in Proc. Annu. Conf. ACM Special Interest Group Data Com- [119] C. H. Benet, A. J. Kassler, T. Benson, and G. Pongracz, ‘‘MP-
mun. Appl., Technol., Archit., Protocols Comput. Commun., Jul. 2020, HULA: Multipath transport aware load balancing using programmable
pp. 404–421. data planes,’’ in Proc. Morning Workshop Netw. Comput., Aug. 2018,
[98] X. Chen, S. Landau-Feibish, M. Braverman, and J. Rexford, ‘‘BeauCoup: pp. 7–13.
Answering many network traffic queries, one memory update at a time,’’ [120] R. Miao, H. Zeng, C. Kim, J. Lee, and M. Yu, ‘‘SilkRoad: Making stateful
in Proc. Annu. Conf. ACM Special Interest Group Data Commun. Appl., layer-4 load balancing fast and cheap using switching ASICs,’’ in Proc.
Technol., Archit., Protocols Comput. Commun., Jul. 2020, pp. 226–239. Conf. ACM Special Interest Group Data Commun., Aug. 2017, pp. 15–28.
[99] R. Kundel, J. Blendin, T. Viernickel, B. Koldehofe, and R. Steinmetz, [121] Z. Liu, Z. Bai, Z. Liu, X. Li, C. Kim, V. Braverman, X. Jin, and I. Stoica,
‘‘P4-CoDel: Active queue management in programmable data planes,’’ ‘‘DistCache: Provable load balancing for large-scale storage systems with
in Proc. IEEE Conf. Netw. Function Virtualization Softw. Defined Netw. distributed caching,’’ in Proc. 17th USENIX Conf. File Storage Technol.
(NFV-SDN), Nov. 2018, pp. 1–4. (FAST), 2019, pp. 143–157.
[122] K.-F. Hsu, P. Tammana, R. Beckett, A. Chen, J. Rexford, and D. Walker,
[100] F. Schwarzkopf, S. Veith, and M. Menth, ‘‘Performance analysis of CoDel
‘‘Adaptive weighted traffic splitting in programmable data planes,’’ in
and PIE for saturated TCP sources,’’ in Proc. 28th Int. Teletraffic Congr.
Proc. Symp. SDN Res., Mar. 2020, pp. 103–109.
(ITC), vol. 1, Sep. 2016, pp. 175–183.
[123] K.-F. Hsu, R. Beckett, A. Chen, J. Rexford, and D. Walker, ‘‘Con-
[101] N. K. Sharma, M. Liu, K. Atreya, and A. Krishnamurthy, ‘‘Approximating
tra: A programmable system for performance-aware routing,’’ in Proc.
fair queueing on reconfigurable switches,’’ in Proc. 15th USENIX Symp.
17th USENIX Symp. Netw. Syst. Design Implement. (NSDI), 2020,
Netw. Syst. Design Implement. (NSDI), 2018, pp. 1–16.
pp. 701–721.
[102] S. Laki, P. Vörös, and F. Fejes, ‘‘Towards an AQM evaluation testbed with
[124] V. Olteanu, A. Agache, A. Voinescu, and C. Raiciu, ‘‘Stateless datacenter
P4 and DPDK,’’ in Proc. ACM SIGCOMM Conf. Posters Demos, 2019,
load-balancing with beamer,’’ in Proc. 15th USENIX Symp. Netw. Syst.
pp. 148–150.
Design Implement. (NSDI), 2018, pp. 125–139.
[103] C. Papagianni and K. De Schepper, ‘‘PI2 for P4: An active queue man-
[125] B. Pit-Claudel, Y. Desmouceaux, P. Pfister, M. Townsley, and T. Clausen,
agement scheme for programmable data planes,’’ in Proc. 15th Int. Conf.
‘‘Stateless load-aware load balancing in P4,’’ in Proc. IEEE 26th Int. Conf.
Emerg. Netw. Exp. Technol., Dec. 2019, pp. 84–86.
Netw. Protocols (ICNP), Sep. 2018, pp. 418–423.
[104] I. Kunze, M. Gunz, D. Saam, K. Wehrle, and J. Rüth, ‘‘Tofino + [126] J.-L. Ye, C. Chen, and Y. H. Chu, ‘‘A weighted ECMP load balancing
P4: A strong compound for AQM on high-speed networks?’’ in Proc. scheme for data centers using P4 switches,’’ in Proc. IEEE 7th Int. Conf.
IFIP/IEEE IM, May 2021, pp. 1–9. Cloud Netw. (CloudNet), Oct. 2018, pp. 1–4.
[105] L. Toresson, ‘‘Making a packet-value based AQM on a programmable [127] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, and I. Stoica,
switch for resource-sharing and low latency,’’ M.S. thesis, Dept. Comput. ‘‘NetCache: Balancing key-value stores with fast in-network caching,’’ in
Sci., Fac. Health, Sci. Technol., Karlstads Univ., Karlstad, Sweden, 2021. Proc. 26th Symp. Oper. Syst. Princ., Oct. 2017, pp. 121–136.
[106] A. Mushtaq, R. Mittal, J. McCauley, M. Alizadeh, S. Ratnasamy, and [128] M. Liu, L. Luo, J. Nelson, L. Ceze, A. Krishnamurthy, and K. Atreya,
S. Shenker, ‘‘Datacenter congestion control: Identifying what is essential ‘‘IncBricks: Toward in-network computation with an in-network cache,’’
and making it practical,’’ ACM SIGCOMM Comput. Commun. Rev., in Proc. 22nd Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2017,
vol. 49, no. 3, pp. 32–38, Nov. 2019. pp. 795–809.
[107] M. Menth, H. Mostafaei, D. Merling, and M. Häberle, ‘‘Implementation [129] E. Cidon, S. Choi, S. Katti, and N. McKeown, ‘‘AppSwitch: Application-
and evaluation of activity-based congestion management using P4 (P4- layer load balancing within a software switch,’’ in Proc. 1st Asia–Pacific
ABC),’’ Future Internet, vol. 11, no. 7, p. 159, Jul. 2019. Workshop Netw., 2017, pp. 64–70.
[108] A. G. Alcoz, A. Dietmüller, and L. Vanbever, ‘‘SP-PIFO: Approximating [130] Q. Wang, Y. Lu, E. Xu, J. Li, Y. Chen, and J. Shu, ‘‘Concordia: Distributed
push-in first-out behaviors using strict-priority queues,’’ in Proc. 17th shared memory with in-network cache coherence,’’ in Proc. 19th USENIX
USENIX Symp. Netw. Syst. Design Implement. (NSDI), 2020, pp. 59–76. Conf. File Storage Technol. (FAST), 2021, pp. 277–292.
[109] K. Kumazoe and M. Tsuru, ‘‘P4-based implementation and evaluation of [131] J. Li, J. Nelson, E. Michael, X. Jin, and D. R. Ports, ‘‘Pegasus: Tolerat-
adaptive early packet discarding scheme,’’ in Proc. Int. Conf. Intell. Netw. ing skewed workloads in distributed storage with in-network coherence
Collaborative Syst. Cham, Switzerland: Springer, 2020, pp. 460–469. directories,’’ in Proc. 14th USENIX Symp. Oper. Syst. Design Implement.
[110] D. Bhat, J. Anderson, P. Ruth, M. Zink, and K. Keahey, ‘‘Application- (OSDI), 2020, pp. 387–406.
based QoE support with P4 and OpenFlow,’’ in Proc. IEEE Conf. Comput. [132] S. Signorello, R. State, J. François, and O. Festor, ‘‘NDN.P4: Pro-
Commun. Workshops (INFOCOM WKSHPS), Apr. 2019, pp. 817–823. gramming information-centric data-planes,’’ in Proc. IEEE NetSoft Conf.
[111] Y.-W. Chen, L.-H. Yen, W.-C. Wang, C.-A. Chuang, Y.-S. Liu, and Workshops (NetSoft), Jun. 2016, pp. 384–389.
C.-C. Tseng, ‘‘P4-enabled bandwidth management,’’ in Proc. 20th [133] G. Grigoryan and Y. Liu, ‘‘PFCA: A programmable FIB caching
Asia–Pacific Netw. Oper. Manage. Symp. (APNOMS), Sep. 2019, architecture,’’ in Proc. Symp. Archit. Netw. Commun. Syst., Jul. 2018,
pp. 1–5. pp. 97–103.
[112] C. Chen, H.-C. Fang, and M. S. Iqbal, ‘‘QoSTCP: Provide consistent rate [134] C. Zhang, J. Bi, Y. Zhou, K. Zhang, and Z. Ma, ‘‘B-cache: A behavior-
guarantees to TCP flows in software defined networks,’’ in Proc. IEEE level caching framework for the programmable data plane,’’ in Proc. IEEE
Int. Conf. Commun. (ICC), Jun. 2020, pp. 1–6. Symp. Comput. Commun. (ISCC), Jun. 2018, pp. 84–90.
[135] J. Vestin, A. Kassler, and J. Åkerberg, ‘‘FastReact: In-network control and [157] X. Jin, X. Li, H. Zhang, N. Foster, J. Lee, R. Soulé, C. Kim, and I. Stoica,
caching for industrial control networks using programmable data planes,’’ ‘‘NetChain: Scale-free sub-RTT coordination,’’ in Proc. 15th USENIX
in Proc. IEEE 23rd Int. Conf. Emerg. Technol. Factory Automat. (ETFA), Symp. Netw. Syst. Design Implement. (NSDI), 2018, pp. 35–49.
Sep. 2018, pp. 219–226. [158] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman,
[136] J. Woodruff, M. Ramanujam, and N. Zilberman, ‘‘P4DNS: In-network H. Weatherspoon, M. Canini, F. Pedone, and R. Soulé, ‘‘Partitioned Paxos
DNS,’’ in Proc. ACM/IEEE Symp. Archit. Netw. Commun. Syst. (ANCS), via the network data plane,’’ 2019, arXiv:1901.08806. [Online]. Avail-
Sep. 2019, pp. 1–6. able: http://arxiv.org/abs/1901.08806
[137] R. Ricart-Sanchez, P. Malagon, P. Salva-Garcia, E. C. Perez, Q. Wang, [159] E. Sakic, N. Deric, E. Goshi, and W. Kellerer, ‘‘P4BFT: Hardware-
and J. M. A. Calero, ‘‘Towards an FPGA-accelerated programmable data accelerated Byzantine-resilient network control plane,’’ 2019,
path for edge-to-core communications in 5G networks,’’ J. Netw. Comput. arXiv:1905.04064. [Online]. Available: http://arxiv.org/abs/1905.04064
Appl., vol. 124, pp. 80–93, Dec. 2018. [160] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman,
[138] R. Ricart-Sanchez, P. Malagon, J. M. Alcaraz-Calero, and Q. Wang, H. Weatherspoon, M. Canini, F. Pedone, and R. Soulé, ‘‘P4xos: Con-
‘‘Hardware-accelerated firewall for 5G mobile networks,’’ in Proc. IEEE sensus as a network service,’’ IEEE/ACM Trans. Netw., vol. 28, no. 4,
26th Int. Conf. Netw. Protocols (ICNP), Sep. 2018, pp. 446–447. pp. 1726–1738, Aug. 2020.
[139] R. Shah, V. Kumar, M. Vutukuru, and P. Kulkarni, ‘‘TurboEPC: Leverag- [161] A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, and P. Kalnis, ‘‘In-
ing dataplane programmability to accelerate the mobile packet core,’’ in network computation is a dumb idea whose time has come,’’ in Proc. 16th
Proc. Symp. SDN Res., Mar. 2020, pp. 83–95. ACM Workshop Hot Topics Netw., 2017, pp. 150–156.
[140] S. K. Singh, C. E. Rothenberg, G. Patra, and G. Pongracz, ‘‘Offloading [162] F. Yang, Z. Wang, X. Ma, G. Yuan, and X. An, ‘‘SwitchAgg: A further step
virtual evolved packet gateway user plane functions to a programmable towards in-network computation,’’ 2019, arXiv:1904.04024. [Online].
ASIC,’’ in Proc. 1st ACM CoNEXT Workshop Emerg. Netw. Comput. Available: http://arxiv.org/abs/1904.04024
Paradigms (ENCP), 2019, pp. 9–14. [163] A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim,
[141] P. Vörös, G. Pongrácz, and S. Laki, ‘‘Towards a hybrid next generation A. Krishnamurthy, M. Moshref, D. R. K. Ports, and P. Richtárik, ‘‘Scal-
NodeB,’’ in Proc. 3rd P4 Workshop Eur., Dec. 2020, pp. 56–58. ing distributed machine learning with in-network aggregation,’’ 2019,
[142] P. Palagummi and K. M. Sivalingam, ‘‘SMARTHO: A network initiated arXiv:1903.06701. [Online]. Available: http://arxiv.org/abs/1903.06701
handover in NG-RAN using P4-based switches,’’ in Proc. 14th Int. Conf. [164] G. Siracusano and R. Bifulco, ‘‘In-network neural networks,’’ 2018,
Netw. Service Manage. (CNSM), 2018, pp. 338–342. arXiv:1801.05731. [Online]. Available: http://arxiv.org/abs/1801.05731
[143] F. Paolucci, F. Cugini, P. Castoldi, and T. Osinski, ‘‘Enhancing 5G [165] D. Sanvito, G. Siracusano, and R. Bifulco, ‘‘Can the network be the AI
SDN/NFV edge with P4 data plane programmability,’’ IEEE Netw., early accelerator?’’ in Proc. Morning Workshop Netw. Comput., Aug. 2018,
access, Apr. 20, 2021, doi: 10.1109/MNET.021.1900599. pp. 20–25.
[144] Y.-B. Lin, C.-C. Tseng, and M.-H. Wang, ‘‘Effects of transport net- [166] Z. Xiong and N. Zilberman, ‘‘Do switches dream of machine learning?:
work slicing on 5G applications,’’ Future Internet, vol. 13, no. 3, p. 69, Toward in-network classification,’’ in Proc. 18th ACM Workshop Hot
Mar. 2021. Topics Netw., Nov. 2019, pp. 25–33.
[145] E. F. Kfoury, J. Crichigno, and E. Bou-Harb, ‘‘Offloading media traffic to [167] T. Jepsen, M. Moshref, A. Carzaniga, N. Foster, and R. Soulé, ‘‘Life in the
programmable data plane switches,’’ in Proc. IEEE Int. Conf. Commun. fast lane: A line-rate linear road,’’ in Proc. Symp. SDN Res., Mar. 2018,
(ICC), Jun. 2020, pp. 1–7. pp. 1–7.
[146] B.-M. Andrus, S. A. Sasu, T. Szyrkowiec, A. Autenrieth, M. Chamania, [168] T. Kohler, R. Mayer, F. Dürr, M. Maaß, S. Bhowmik, and K. Rothermel,
J. K. Fischer, and S. Rasp, ‘‘Zero-touch provisioning of distributed video ‘‘P4CEP: Towards in-network complex event processing,’’ in Proc. Morn-
analytics in a software-defined metro-haul network with P4 processing,’’ ing Workshop Netw. Comput., Aug. 2018, pp. 33–38.
in Proc. Opt. Fiber Commun. Conf. (OFC), 2019, pp. 1–3. [169] L. Chen, G. Chen, J. Lingys, and K. Chen, ‘‘Programmable switch as
[147] T. Jepsen, M. Moshref, A. Carzaniga, N. Foster, and R. Soulé, ‘‘Packet a parallel computing device,’’ 2018, arXiv:1803.01491. [Online]. Avail-
subscriptions for programmable ASICs,’’ in Proc. 17th ACM Workshop able: http://arxiv.org/abs/1803.01491
Hot Topics Netw., Nov. 2018, pp. 176–183. [170] T. Jepsen, D. Alvarez, N. Foster, C. Kim, J. Lee, M. Moshref, and
[148] C. Wernecke, H. Parzyjegla, G. Mühl, P. Danielis, and D. Timmermann, R. Soulé, ‘‘Fast string searching on PISA,’’ in Proc. ACM Symp. SDN
‘‘Realizing content-based publish/subscribe with P4,’’ in Proc. IEEE Res., Apr. 2019, pp. 21–28.
Conf. Netw. Function Virtualization Softw. Defined Netw. (NFV-SDN), [171] Y. Qiao, X. Kong, M. Zhang, Y. Zhou, M. Xu, and J. Bi, ‘‘Towards
Nov. 2018, pp. 1–7. in-network acceleration of erasure coding,’’ in Proc. Symp. SDN Res.,
[149] C. Wernecke, H. Parzyjegla, G. Mühl, E. Schweissguth, and Mar. 2020, pp. 41–47.
D. Timmermann, ‘‘Flexible notification forwarding for content-based [172] Z. Yu, Y. Zhang, V. Braverman, M. Chowdhury, and X. Jin, ‘‘NetLock:
publish/subscribe using P4,’’ in Proc. IEEE Conf. Netw. Function Fast, centralized lock management using programmable switches,’’ in
Virtualization Softw. Defined Netw. (NFV-SDN), Nov. 2019, pp. 1–5. Proc. Annu. Conf. ACM Special Interest Group Data Commun. Appl.,
[150] R. Kundel, C. Gärtner, M. Luthra, S. Bhowmik, and B. Koldehofe, ‘‘Flex- Technol., Archit., Protocols Comput. Commun., Jul. 2020, pp. 126–138.
ible content-based publish/subscribe over programmable data planes,’’ [173] M. Tirmazi, R. B. Basat, J. Gao, and M. Yu, ‘‘Cheetah: Accelerating
in Proc. IEEE/IFIP Netw. Oper. Manage. Symp. (NOMS), Apr. 2020, database queries with switch pruning,’’ in Proc. ACM SIGMOD Int. Conf.
pp. 1–5. Manage. Data, Jun. 2020, pp. 2407–2422.
[151] R. Miguel, S. Signorello, and F. M. V. Ramos, ‘‘Named data network- [174] S. Vaucher, N. Yazdani, P. Felber, D. E. Lucani, and V. Schiavoni,
ing with programmable switches,’’ in Proc. IEEE 26th Int. Conf. Netw. ‘‘ZipLine: In-network compression at line speed,’’ in Proc. 16th Int. Conf.
Protocols (ICNP), Sep. 2018, pp. 400–405. Emerg. Netw. Exp. Technol., Nov. 2020, pp. 399–405.
[152] O. Karrakchou, N. Samaan, and A. Karmouch, ‘‘ENDN: An enhanced [175] R. Glebke, J. Krude, I. Kunze, J. Rüth, F. Senger, and K. Wehrle,
NDN architecture with a P4-programmabIe data plane,’’ in Proc. 7th ACM ‘‘Towards executing computer vision functionality on programmable
Conf. Inf.-Centric Netw., Sep. 2020, pp. 1–11. network devices,’’ in Proc. 1st ACM CoNEXT Workshop Emerg. Netw.
[153] J. Li, E. Michael, N. K. Sharma, A. Szekeres, and D. R. Ports, ‘‘Just say Comput. Paradigms (ENCP), 2019, pp. 15–20.
NO to paxos overhead: Replacing consensus with network ordering,’’ in [176] S.-Y. Wang, C.-M. Wu, Y.-B. Lin, and C.-C. Huang, ‘‘High-speed data-
Proc. 12th USENIX Symp. Oper. Syst. Design Implement. (OSDI), 2016, plane packet aggregation and disaggregation by P4 switches,’’ J. Netw.
pp. 467–483. Comput. Appl., vol. 142, pp. 98–110, Sep. 2019.
[154] H. T. Dang, M. Canini, F. Pedone, and R. Soulé, ‘‘Paxos made switch- [177] S.-Y. Wang, J.-Y. Li, and Y.-B. Lin, ‘‘Aggregating and disaggregating
y,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 46, no. 2, pp. 18–24, packets with various sizes of payload in P4 switches at 100 Gbps line
Apr. 2016. rate,’’ J. Netw. Comput. Appl., vol. 165, Sep. 2020, Art. no. 102676.
[155] J. Li, E. Michael, and D. R. K. Ports, ‘‘Eris: Coordination-free consistent [178] Y.-B. Lin, S.-Y. Wang, C.-C. Huang, and C.-M. Wu, ‘‘The SDN approach
transactions using in-network concurrency control,’’ in Proc. 26th Symp. for the aggregation/disaggregation of sensor data,’’ Sensors, vol. 18, no. 7,
Oper. Syst. Princ., Oct. 2017, pp. 104–120. p. 2025, Jun. 2018.
[156] B. Han, V. Gopalakrishnan, M. Platania, Z.-L. Zhang, and Y. Zhang, [179] A. L. R. Madureira, F. R. C. Araújo, and L. N. Sampaio, ‘‘On supporting
‘‘Network-assisted raft consensus protocol,’’ U.S. Patent 16 101 751, IoT data aggregation through programmable data planes,’’ Comput. Netw.,
Feb. 13, 2020. vol. 177, Aug. 2020, Art. no. 107330.
[180] M. Uddin, S. Mukherjee, H. Chang, and T. V. Lakshman, ‘‘SDN-based [201] R. Datta, S. Choi, A. Chowdhary, and Y. Park, ‘‘P4Guard: Designing
service automation for IoT,’’ in Proc. IEEE 25th Int. Conf. Netw. Protocols P4 based firewall,’’ in Proc. IEEE Mil. Commun. Conf. (MILCOM),
(ICNP), Oct. 2017, pp. 1–10. Oct. 2018, pp. 1–6.
[181] M. Uddin, S. Mukherjee, H. Chang, and T. V. Lakshman, ‘‘SDN-based [202] J. Cao, Y. Liu, Y. Zhou, C. Sun, Y. Wang, and J. Bi, ‘‘CoFilter: A high-
multi-protocol edge switching for IoT service automation,’’ IEEE J. Sel. performance switch-accelerated stateful packet filter for bare-metal
Areas Commun., vol. 36, no. 12, pp. 2775–2786, Dec. 2018. servers,’’ in Proc. 28th Int. Conf. Comput. Commun. Netw. (ICCCN),
[182] V. Sivaraman, S. Narayana, O. Rottenstreich, S. Muthukrishnan, and Jul. 2019, pp. 1–9.
J. Rexford, ‘‘Heavy-hitter detection entirely in the data plane,’’ in Proc. [203] J. Li, H. Jiang, W. Jiang, J. Wu, and W. Du, ‘‘SDN-based state-
Symp. SDN Res., Apr. 2017, pp. 164–176. ful firewall for cloud,’’ in Proc. IEEE IEEE 6th Int. Conf. Big Data
[183] J. Kučera, D. A. Popescu, G. Antichi, J. Kořenek, and A. W. Moore, ‘‘Seek Secur. Cloud (BigDataSecurity), Int. Conf. High Perform. Smart Com-
and push: Detecting large traffic aggregates in the dataplane,’’ 2018, put. (HPSC), IEEE Int. Conf. Intell. Data Secur. (IDS), May 2020,
arXiv:1805.05993. [Online]. Available: http://arxiv.org/abs/1805.05993 pp. 157–161.
[184] R. Ben-Basat, X. Chen, G. Einziger, and O. Rottenstreich, ‘‘Efficient [204] A. Almaini, A. Al-Dubai, I. Romdhani, and M. Schramm, ‘‘Delegation of
measurement on programmable switches using probabilistic recircula- authentication to the data plane in software-defined networks,’’ in Proc.
tion,’’ in Proc. IEEE 26th Int. Conf. Netw. Protocols (ICNP), Sep. 2018, IEEE Int. Conferences Ubiquitous Comput. Commun. (IUCC), Data Sci.
pp. 313–323. Comput. Intell. (DSCI), Smart Comput., Netw. Services (SmartCNS),
[185] L. Tang, Q. Huang, and P. P. C. Lee, ‘‘A fast and compact invertible Oct. 2019, pp. 58–65.
sketch for network-wide heavy flow detection,’’ IEEE/ACM Trans. Netw., [205] E. O. Zaballa, D. Franco, Z. Zhou, and M. S. Berger, ‘‘P4Knocking:
vol. 28, no. 5, pp. 2350–2363, Oct. 2020. Offloading host-based firewall functionalities to the network,’’ in Proc.
23rd Conf. Innov. Clouds, Internet Netw. Workshops (ICIN), Feb. 2020,
[186] M. V. B. D. Silva, J. A. Marques, L. P. Gaspary, and L. Z. Granville,
pp. 7–12.
‘‘Identifying elephant flows using dynamic thresholds in programmable
IXP networks,’’ J. Internet Services Appl., vol. 11, no. 1, pp. 1–12, [206] Q. Kang, L. Xue, A. Morrison, Y. Tang, A. Chen, and X. Luo, ‘‘Pro-
Dec. 2020. grammable in-network security for context-aware BYOD policies,’’ 2019,
arXiv:1908.01405. [Online]. Available: http://arxiv.org/abs/1908.01405
[187] R. Harrison, Q. Cai, A. Gupta, and J. Rexford, ‘‘Network-wide heavy
[207] S. Bai, H. Kim, and J. Rexford, ‘‘Passive OS fingerprinting on commodity
hitter detection with commodity switches,’’ in Proc. Symp. SDN Res.,
switches,’’ Tech. Rep. TR-010-19, Sep. 2019.
Mar. 2018, pp. 1–7.
[208] A. Almaini, A. Al-Dubai, I. Romdhani, M. Schramm, and A. Alsarhan,
[188] R. Harrison, S. L. Feibish, A. Gupta, R. Teixeira, S. Muthukrishnan,
‘‘Lightweight edge authentication for software defined networks,’’ Com-
and J. Rexford, ‘‘Carpe elephants: Seize the global heavy hitters,’’
puting, vol. 103, no. 2, pp. 291–311, Feb. 2021.
in Proc. Workshop Secure Program. Netw. Infrastruct., Aug. 2020,
[209] J. Hill, M. Aloserij, and P. Grosso, ‘‘Tracking network flows with
pp. 15–21.
P4,’’ in Proc. IEEE/ACM Innovating Netw. Data-Intensive Sci. (INDIS),
[189] D. Ding, M. Savi, G. Antichi, and D. Siracusa, ‘‘An incrementally-
Nov. 2018, pp. 23–32.
deployable P4-enabled architecture for network-wide heavy-hitter detec-
[210] G. Li, M. Zhang, C. Liu, X. Kong, A. Chen, G. Gu, and H. Duan,
tion,’’ IEEE Trans. Netw. Service Manage., vol. 17, no. 1, pp. 75–88,
‘‘NETHCF: Enabling line-rate and adaptive spoofed IP traffic filter-
Mar. 2020.
ing,’’ in Proc. IEEE 27th Int. Conf. Netw. Protocols (ICNP), Oct. 2019,
[190] L. Tang, Q. Huang, and P. P. C. Lee, ‘‘SpreadSketch: Toward pp. 1–12.
invertible and network-wide detection of superspreaders,’’ in
[211] A. Febro, H. Xiao, and J. Spring, ‘‘Distributed SIP DDoS defense with
Proc. IEEE Conf. Comput. Commun. (INFOCOM), Jul. 2020,
P4,’’ in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC), Apr. 2019,
pp. 1608–1617.
pp. 1–8.
[191] D. Scholz, A. Oeldemann, F. Geyer, S. Gallenmuller, H. Stubbe, T. Wild, [212] D. Scholz, S. Gallenmüller, H. Stubbe, B. Jaber, M. Rouhi, and
A. Herkersdorf, and G. Carle, ‘‘Cryptographic hashing in P4 data G. Carle, ‘‘Me love (SYN-)cookies: SYN flood mitigation in pro-
planes,’’ in Proc. ACM/IEEE Symp. Archit. Netw. Commun. Syst. (ANCS), grammable data planes,’’ 2020, arXiv:2003.03221. [Online]. Available:
Sep. 2019, pp. 1–6. http://arxiv.org/abs/2003.03221
[192] F. Hauser, M. Häberle, M. Schmidt, and M. Menth, ‘‘P4-IPsec: Site- [213] D. Scholz, S. Gallenmüller, H. Stubbe, and G. Carle, ‘‘SYN flood defense
to-site and host-to-site VPN with IPsec in P4-based SDN,’’ 2019, in programmable data planes,’’ in Proc. 3rd P4 Workshop Eur., Dec. 2020,
arXiv:1907.03593. [Online]. Available: http://arxiv.org/abs/1907.03593 pp. 13–20.
[193] L. Malina, D. Smekal, S. Ricci, J. Hajny, P. Cíbik, and J. Hrabovsky, [214] G. K. Ndonda and R. Sadre, ‘‘A two-level intrusion detection system for
‘‘Hardware-accelerated cryptography for software-defined networks industrial control system networks using P4,’’ in Proc. 5th Int. Symp. ICS
with P4,’’ in Proc. Int. Conf. Inf. Technol. Commun. Secur. Cham, SCADA Cyber Secur. Res., 2018, pp. 31–40.
Switzerland: Springer, 2020, pp. 271–287. [215] J. Xing, W. Wu, and A. Chen, ‘‘Architecting programmable data plane
[194] G. Liu, W. Quan, N. Cheng, D. Gao, N. Lu, H. Zhang, and X. Shen, defenses into the network with FastFlex,’’ in Proc. 18th ACM Workshop
‘‘Softwarized IoT network immunity against eavesdropping with pro- Hot Topics Netw., Nov. 2019, pp. 161–169.
grammable data planes,’’ IEEE Internet Things J., vol. 8, no. 8, [216] Q. Kang, J. Xing, and A. Chen, ‘‘Automated attack discovery in data plane
pp. 6578–6590, Apr. 2021. systems,’’ in Proc. 12th USENIX Workshop Cyber Secur. Experimentation
[195] X. Chen, ‘‘Implementing AES encryption on programmable switches via Test (CSET), 2019, pp. 1–5.
scrambled lookup tables,’’ in Proc. Workshop Secure Program. Netw. [217] A. C. Lapolli, J. A. Marques, and L. P. Gaspary, ‘‘Offloading real-time
Infrastruct. (SPIN). New York, NY, USA: Association for Computing DDoS attack detection to programmable data planes,’’ in Proc. IFIP/IEEE
Machinery, 2020, pp. 8–14. Symp. Integr. Netw. Service Manage. (IM), 2019, pp. 19–27.
[196] H. Kim and A. Gupta, ‘‘ONTAS: Flexible and scalable online network [218] Y. Mi and A. Wang, ‘‘ML-pushback: Machine learning based pushback
traffic anonymization system,’’ in Proc. Workshop Netw. Meets AI ML defense against DDoS,’’ in Proc. 15th Int. Conf. Emerg. Netw. Exp.
(NetAI), 2019, pp. 15–21. Technol., Dec. 2019, pp. 80–81.
[197] H. M. Moghaddam and A. Mosenia, ‘‘Anonymizing masses: Practical [219] J. Ioannidis and S. M. Bellovin, ‘‘Implementing pushback: Router-based
light-weight anonymity at the network level,’’ 2019, arXiv:1911.09642. defense against DDoS attacks,’’ in Proc. NDSS, 2016, pp. 1–12.
[Online]. Available: http://arxiv.org/abs/1911.09642 [220] M. Zhang, G. Li, S. Wang, C. Liu, A. Chen, H. Hu, G. Gu,
[198] T. Datta, N. Feamster, J. Rexford, and L. Wang, ‘‘SPINE: Surveillance Q. Li, M. Xu, and J. Wu, ‘‘Poseidon: Mitigating volumetric
protection in the network elements,’’ in Proc. 9th USENIX Workshop Free DDoS attacks with programmable switches,’’ in Proc. NDSS, 2020,
Open Commun. Internet (FOCI), 2019, pp. 1–7. pp. 1–18.
[199] L. Wang, H. Kim, P. Mittal, and J. Rexford, ‘‘Programmable in-network [221] K. Friday, E. Kfoury, E. Bou-Harb, and J. Crichigno, ‘‘Towards a unified
obfuscation of DNS traffic,’’ in Proc. NDSS, DNS Privacy Workshop, in-network DDoS detection and mitigation strategy,’’ in Proc. 6th IEEE
2021, pp. 1–10. Conf. Netw. Softwarization (NetSoft), Jun. 2020, pp. 218–226.
[200] R. Meier, P. Tsankov, V. Lenders, L. Vanbever, and M. Vechev, ‘‘NetHide: [222] J. Xing, Q. Kang, and A. Chen, ‘‘NetWarden: Mitigating network covert
Secure and practical network topology obfuscation,’’ in Proc. 27th channels while preserving performance,’’ in Proc. 29th USENIX Secur.
USENIX Secur. Symp. (USENIX Secur.), 2018, pp. 693–709. Symp. (USENIX Secur.), 2020, pp. 2039–2056.
[223] A. Laraba, J. François, I. Chrisment, S. R. Chowdhury, and R. Boutaba, [245] M. Shahbaz, S. Choi, B. Pfaff, C. Kim, N. Feamster, N. McKeown, and
‘‘Defeating protocol abuse with P4: Application to explicit conges- J. Rexford, ‘‘PISCES: A programmable, protocol-independent software
tion notification,’’ in Proc. IFIP Netw. Conf. (Networking), 2020, switch,’’ in Proc. ACM SIGCOMM Conf., Aug. 2016, pp. 525–538.
pp. 431–439. [246] B. Pfaff, J. Pettit, T. Koponen, E. Jackson, A. Zhou, J. Rajahalme, J. Gross,
[224] J. Xing, W. Wu, and A. Chen, ‘‘Ripple: A programmable, decentral- A. Wang, J. Stringer, and P. Shelar, ‘‘The design and implementation
ized link-flooding defense against adaptive adversaries,’’ in Proc. 30th of open vSwitch,’’ in Proc. 12th USENIX Symp. Netw. Syst. Design
USENIX Secur. Symp. (USENIX Secur.), Vancouver, BC, Canada, 2021, Implement. (NSDI), 2015, pp. 117–130.
pp. 1–16. [247] Barefoot Networks. (2020). Barefoot Academy. [Online]. Available:
[225] A. D. S. Ilha, A. C. Lapolli, J. A. Marques, and L. P. Gaspary, ‘‘Euclid: https://www.barefootnetworks.com/barefoot-academy/
A fully in-network, P4-based approach for real-time DDoS attack detec- [248] C. Kim, A. Sivaraman, N. Katta, A. Bas, A. Dixit, and L. J. Wobker, ‘‘In-
tion and mitigation,’’ IEEE Trans. Netw. Service Manage., early access, band network telemetry via programmable dataplanes,’’ in Proc. ACM
Dec. 30, 2020, doi: 10.1109/TNSM.2020.3048265. SIGCOMM, 2015, pp. 1–2.
[226] X. Z. Khooi, L. Csikor, D. M. Divakaran, and M. S. Kang, ‘‘DIDA: [249] C. Hopps, Analysis of an Equal-Cost Multi-Path Algorithm,
Distributed in-network defense architecture against amplified reflection document RFC 2992, Nov. 2000.
DDoS attacks,’’ in Proc. 6th IEEE Conf. Netw. Softwarization (NetSoft), [250] S. Sinha, S. Kandula, and D. Katabi, ‘‘Harnessing TCP’s burstiness
Jun. 2020, pp. 277–281. with flowlet switching,’’ in Proc. 3rd ACM Workshop Hot Topics Netw.
[227] D. Ding, M. Savi, F. Pederzolli, M. Campanella, and D. Siracusa, ‘‘In- (Hotnets-III), 2004, pp. 1–6.
network volumetric DDoS victim identification using programmable [251] C. Kim, P. Bhide, E. Doe, H. Holbrook, A. Ghanwani, D. Daly,
commodity switches,’’ IEEE Trans. Netw. Service Manage., early access, M. Hira, and B. Davie, ‘‘In-band network telemetry (INT),’’ Tech. Rep.
Apr. 15, 2021, doi: 10.1109/TNSM.2021.3073597. Version 2.1, 2020. [Online]. Available: https://github.com/p4lang/p4-
[228] F. Musumeci, V. Ionata, F. Paolucci, F. Cugini, and M. Tornatore, applications/blob/master/docs/INT_v2_1.pdf
‘‘Machine-learning-assisted DDoS attack detection with P4 language,’’
[252] P. Manzanares-Lopez, J. P. Muñoz-Gea, and J. Malgosa-Sanahuja,
in Proc. IEEE Int. Conf. Commun. (ICC), Jun. 2020, pp. 1–6.
‘‘Passive in-band network telemetry systems: The potential of pro-
[229] Z. Liu, H. Namkung, G. Nikolaidis, J. Lee, C. Kim, X. Jin, V. Braverman,
grammable data plane on network-wide telemetry,’’ IEEE Access, vol. 9,
M. Yu, and V. Sekar, ‘‘Jaqen: A high-performance switch-native approach
pp. 20391–20409, 2021.
for detecting and mitigating volumetric DDoS attacks with programmable
[253] M. A. M. Vieira, M. S. Castanho, R. D. G. Pacífico, E. R. S. Santos,
switches,’’ in Proc. 30th USENIX Secur. Symp. (USENIX Secur.), 2021,
E. P. M. C. Júnior, and L. F. M. Vieira, ‘‘Fast packet processing with
pp. 1–18.
eBPF and XDP: Concepts, code, challenges, and applications,’’ ACM
[230] C. Zhang, J. Bi, Y. Zhou, J. Wu, B. Liu, Z. Li, A. B. Dogar, and Y. Wang,
Comput. Surv., vol. 53, no. 1, pp. 1–36, May 2020.
‘‘P4DB: On-the-fly debugging of the programmable data plane,’’ in Proc.
IEEE 25th Int. Conf. Netw. Protocols (ICNP), Oct. 2017, pp. 1–10. [254] J. Crichigno, E. Bou-Harb, and N. Ghani, ‘‘A comprehensive tuto-
rial on science DMZ,’’ IEEE Commun. Surveys Tuts., vol. 21, no. 2,
[231] Y. Zhou, J. Bi, Y. Lin, Y. Wang, D. Zhang, Z. Xi, J. Cao, and C. Sun,
pp. 2041–2078, 2nd Quart., 2019.
‘‘P4Tester: Efficient runtime rule fault detection for programmable data
planes,’’ in Proc. Int. Symp. Qual. Service, Jun. 2019, pp. 1–10. [255] J. F. Kurose and K. W. Ross, Computer Networking: A Top-Down
[232] M. V. Dumitru, D. Dumitrescu, and C. Raiciu, ‘‘Can we exploit buggy P4 Approach, 6th ed. London, U.K.: Pearson, 2012.
programs?’’ in Proc. Symp. SDN Res., Mar. 2020, pp. 62–68. [256] S. Ha, I. Rhee, and L. Xu, ‘‘CUBIC: A new TCP-friendly high-speed
[233] S. Kodeswaran, M. T. Arashloo, P. Tammana, and J. Rexford, ‘‘Tracking TCP variant,’’ ACM SIGOPS Oper. Syst. Rev., vol. 42, no. 5, pp. 64–74,
P4 program execution in the data plane,’’ in Proc. Symp. SDN Res., Jul. 2008.
Mar. 2020, pp. 117–122. [257] D. Leith and R. Shorten. (2008). H-TCP: TCP Congestion Control for
[234] Y. Zhou, J. Bi, T. Yang, K. Gao, C. Zhang, J. Cao, and Y. Wang, High Bandwidth-Delay Product Paths. [Online]. Available: https://draft-
‘‘KeySight: Troubleshooting programmable switches via scalable high- leith-tcp-htcp-06
coverage behavior tracking,’’ in Proc. IEEE 26th Int. Conf. Netw. Proto- [258] N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V. Jacobson,
cols (ICNP), Sep. 2018, pp. 291–301. ‘‘BBR: Congestion-based congestion control,’’ Commun. ACM, vol. 60,
[235] N. Lopes, N. Bjørner, N. McKeown, A. Rybalchenko, D. Talayco, and no. 2, pp. 58–66, 2017.
G. Varghese, ‘‘Automatically verifying reachability and well-formedness [259] E. F. Kfoury, J. Gomez, J. Crichigno, and E. Bou-Harb, ‘‘An emulation-
in P4 networks,’’ Tech. Rep. MSR-TR-2016-65, Sep. 2016. based evaluation of TCP BBRv2 alpha for wired broadband,’’ Comput.
[236] L. Freire, M. Neves, L. Leal, K. Levchenko, A. Schaeffer-Filho, and Commun., vol. 161, pp. 212–224, Sep. 2020.
M. Barcellos, ‘‘Uncovering bugs in P4 programs with assertion-based [260] S. Floyd, ‘‘TCP and explicit congestion notification,’’ ACM SIGCOMM
verification,’’ in Proc. Symp. SDN Res., Mar. 2018, pp. 1–7. Comput. Commun. Rev., vol. 24, no. 5, pp. 8–23, Oct. 1994.
[237] M. Neves, L. Freire, A. Schaeffer-Filho, and M. Barcellos, ‘‘Verification [261] R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi,
of P4 programs in feasible time using assertions,’’ in Proc. 14th Int. Conf. A. Vahdat, Y. Wang, D. Wetherall, and D. Zats, ‘‘TIMELY: RTT-based
Emerg. Netw. Exp. Technol., Dec. 2018, pp. 73–85. congestion control for the data center,’’ ACM SIGCOMM Comput. Com-
[238] J. Liu, W. Hallahan, C. Schlesinger, M. Sharif, J. Lee, R. Soulé, H. Wang, mun. Rev., vol. 45, no. 4, pp. 537–550, 2015.
C. Caşcaval, N. McKeown, and N. Foster, ‘‘P4V: Practical verification for [262] Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, J. Padhye,
programmable data planes,’’ in Proc. Conf. ACM Special Interest Group S. Raindel, M. H. Yahia, and M. Zhang, ‘‘Congestion control for large-
Data Commun., Aug. 2018, pp. 490–503. scale RDMA deployments,’’ ACM SIGCOMM Comput. Commun. Rev.,
[239] A. Nötzli, J. Khan, A. Fingerhut, C. Barrett, and P. Athanas, ‘‘P4pktgen: vol. 45, no. 4, pp. 523–536, Sep. 2015.
Automated test case generation for P4 programs,’’ in Proc. Symp. SDN [263] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar,
Res., Mar. 2018, pp. 1–7. S. Sengupta, and M. Sridharan, ‘‘Data center TCP (DCTCP),’’ in Proc.
[240] D. Lukács, M. Tejfel, and G. Pongrácz, ‘‘Keeping P4 switches fast and ACM SIGCOMM Conf. SIGCOMM, 2010, pp. 63–74.
fault-free through automatic verification,’’ Acta Cybernetica, vol. 24, [264] M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar,
no. 1, pp. 61–81, May 2019. and S. Shenker, ‘‘Pfabric: Minimal near-optimal datacenter transport,’’
[241] R. Stoenescu, D. Dumitrescu, M. Popovici, L. Negreanu, and C. Raiciu, ACM SIGCOMM Comput. Commun. Rev., vol. 43, no. 4, pp. 435–446,
‘‘Debugging P4 programs with vera,’’ in Proc. Conf. ACM Special Interest 2013.
Group Data Commun., Aug. 2018, pp. 518–532. [265] M. Dong, Q. Li, D. Zarchy, P. B. Godfrey, and M. Schapira, ‘‘PCC:
[242] A. Shukla, K. N. Hudemann, A. Hecker, and S. Schmid, ‘‘Runtime veri- Re-architecting congestion control for consistent high performance,’’ in
fication of P4 switches with reinforcement learning,’’ in Proc. Workshop Proc. 12th USENIX Symp. Netw. Syst. Design Implement. (NSDI), 2015,
Netw. Meets AI ML (NetAI), 2019, pp. 1–7. pp. 395–408.
[243] D. Dumitrescu, R. Stoenescu, L. Negreanu, and C. Raiciu, ‘‘Bf4: Towards [266] A. Langley et al., ‘‘The QUIC transport protocol: Design and Internet-
bug-free P4 programs,’’ in Proc. Annu. Conf. ACM Special Interest Group scale deployment,’’ in Proc. Conf. ACM Special Interest Group Data
Data Commun. Appl., Technol., Archit., Protocols Comput. Commun., Commun., 2017, pp. 183–196.
Jul. 2020, pp. 571–585. [267] P. Cheng, F. Ren, R. Shu, and C. Lin, ‘‘Catch the whole lot in an action:
[244] A. Bas and A. Fingerhut. P4 Tutorial, Slide 22. Accessed: Jun. 1, 2021. Rapid precise packet loss notification in data center,’’ in Proc. 11th
[Online]. Available: https://tinyurl.com/tb4m749 USENIX Symp. Netw. Syst. Design Implement. (NSDI), 2014, pp. 17–28.
[268] A. Ramachandran, S. Seetharaman, N. Feamster, and V. Vazirani, ‘‘Fast [294] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn,
monitoring of traffic subpopulations,’’ in Proc. 8th ACM SIGCOMM ‘‘Ceph: A scalable, high-performance distributed file system,’’ in Proc.
Conf. Internet Meas. Conf. (IMC), 2008, pp. 257–270. 7th Symp. Oper. Syst. Design Implement., 2006, pp. 307–320.
[269] N. Alon, Y. Matias, and M. Szegedy, ‘‘The space complexity of approx- [295] L. Lamport, ‘‘Paxos made simple,’’ ACM SIGACT News, vol. 32, no. 4,
imating the frequency moments,’’ J. Comput. Syst. Sci., vol. 58, no. 1, pp. 18–25, 2001.
pp. 137–147, Feb. 1999. [296] D. Ongaro and J. Ousterhout, ‘‘In search of an understandable consensus
[270] V. Braverman and R. Ostrovsky, ‘‘Zero-one frequency laws,’’ in Proc. algorithm,’’ in Proc. USENIX Annu. Tech. Conf. (USENIX ATC), 2014,
42nd ACM Symp. Theory Comput. (STOC), 2010, pp. 281–290. pp. 305–319.
[271] M. Charikar, K. Chen, and M. Farach-Colton, ‘‘Finding frequent items [297] H. T. Dang. Consensus as a Network Service. Accessed: Jun. 1, 2021.
in data streams,’’ in Proc. Int. Colloq. Automata, Lang., Program. Berlin, [Online]. Available: https://tinyurl.com/y2t9plsu
Germany: Springer, 2002, pp. 693–703. [298] J. Nelson. SwitchML Scaling Distributed Machine Learning With in
[272] G. Cormode and S. Muthukrishnan, ‘‘An improved data stream summary: Network Aggregation. Accessed: Jun. 1, 2021. [Online]. Available:
The count-min sketch and its applications,’’ J. Algorithms, vol. 55, no. 1, https://tinyurl.com/y53upm7k
pp. 58–75, Apr. 2005. [299] D. Das, S. Avancha, D. Mudigere, K. Vaidynathan, S. Sridharan,
[273] S. Floyd and V. Jacobson, ‘‘Random early detection gateways for con- D. Kalamkar, B. Kaul, and P. Dubey, ‘‘Distributed deep learning using
gestion avoidance,’’ IEEE/ACM Trans. Netw., vol. 1, no. 4, pp. 397–413, synchronous stochastic gradient descent,’’ 2016, arXiv:1602.06709.
Aug. 1993. [Online]. Available: http://arxiv.org/abs/1602.06709
[274] P. Flajolet, D. Gardy, and L. Thimonier, ‘‘Birthday paradox, coupon [300] S. Farrell, Low-Power Wide Area Network (LPWAN)
collectors, caching algorithms and self-organizing search,’’ Discrete Appl. Overview, document RFC8376, 2018. [Online]. Available:
Math., vol. 39, no. 3, pp. 207–229, Nov. 1992. https://tools.ietf.org/html/rfc8376
[275] R. Dolby, ‘‘Noise reduction systems,’’ U.S. Patent 3 846 719, [301] A. Koike, T. Ohba, and R. Ishibashi, ‘‘IoT network architecture using
Nov. 5, 1974. packet aggregation and disaggregation,’’ in Proc. 5th IIAI Int. Congr. Adv.
[276] S. V. Vaseghi, Advanced Digital Signal Processing and Noise Reduction. Appl. Informat. (IIAI-AAI), Jul. 2016, pp. 1140–1145.
Hoboken, NJ, USA: Wiley, 2008. [302] J. Deng and M. Davis, ‘‘An adaptive packet aggregation algorithm for
[277] J. Gettys, ‘‘Bufferbloat: Dark buffers in the Internet,’’ IEEE Internet wireless networks,’’ in Proc. Int. Conf. Wireless Commun. Signal Pro-
Comput., vol. 15, no. 3, p. 96, May/Jun. 2011. cess., Oct. 2013, pp. 1–6.
[278] M. Allman, ‘‘Comments on bufferbloat,’’ ACM SIGCOMM Comput. [303] Y. Yasuda, R. Nakamura, and H. Ohsaki, ‘‘A probabilistic interest packet
Commun. Rev., vol. 43, no. 1, pp. 30–37, Jan. 2013. aggregation for content-centric networking,’’ in Proc. IEEE 42nd Annu.
[279] Y. Gong, D. Rossi, C. Testa, S. Valenti, and M. D. Täht, ‘‘Fighting the Comput. Softw. Appl. Conf. (COMPSAC), Jul. 2018, pp. 783–788.
bufferbloat: On the coexistence of AQM and low priority congestion [304] A. S. Akyurek and T. S. Rosing, ‘‘Optimal packet aggregation scheduling
control,’’ Comput. Netw., vol. 65, pp. 255–267, Jun. 2014. in wireless networks,’’ IEEE Trans. Mobile Comput., vol. 17, no. 12,
[280] C. Staff, ‘‘BufferBloat: What’s wrong with the Internet?’’ Commun. ACM, pp. 2835–2852, Dec. 2018.
vol. 55, no. 2, pp. 40–47, Feb. 2012. [305] K. Zhou and N. Nikaein, ‘‘Packet aggregation for machine type commu-
[281] V. G. Cerf, ‘‘Bufferbloat and other Internet challenges,’’ IEEE Internet nications in LTE with random access channel,’’ in Proc. IEEE Wireless
Comput., vol. 18, no. 5, p. 80, Sep./Oct. 2014. Commun. Netw. Conf. (WCNC), Apr. 2013, pp. 262–267.
[282] H. Harkous, C. Papagianni, K. De Schepper, M. Jarschel, M. Dimolianis, [306] A. Majeed and N. B. Abu-Ghazaleh, ‘‘Packet aggregation in multi-rate
and R. Preis, ‘‘Virtual queues for P4: A poor man’s programmable wireless LANs,’’ in Proc. 9th Annu. IEEE Commun. Soc. Conf. Sensor,
traffic manager,’’ IEEE Trans. Netw. Service Manage., early access, Mesh Ad Hoc Commun. Netw. (SECON), Jun. 2012, pp. 452–460.
[307] Bluetooth Specification Version 4.2, Bluetooth SIG, Kirkland, WA, USA,
May 3, 2021, doi: 10.1109/TNSM.2021.3077051.
2014.
[283] K. Nichols, S. Blake, F. Baker, and D. Black, Definition of
[308] S. Farahani, ZigBee Wireless Networks and Transceivers. London, U.K.:
the Differentiated Services Field (DS Field) in the IPv4 and
Newnes, 2011.
IPv6 Headers, document RFC8376, 2018. [Online]. Available: [309] O. Hersent, D. Boswarthick, and O. Elloumi, The Internet of Things: Key
https://tools.ietf.org/html/rfc8376 Applications and Protocols. Hoboken, NJ, USA: Wiley, 2011.
[284] B. Fenner, M. Handley, H. Holbrook, I. Kouvelas, R. Parekh, Z. Zhang, [310] J. Shi, W. Quan, D. Gao, M. Liu, G. Liu, C. Yu, and W. Su, ‘‘Flowlet-based
and L. Zheng, Protocol Independent Multicast-Sparse Mode (PIM-SM): stateful multipath forwarding in heterogeneous Internet of Things,’’ IEEE
Protocol Specification (Revised), document RFC 7761, 2016. [Online]. Access, vol. 8, pp. 74875–74886, 2020.
Available: https://tools.ietf.org/html/rfc7761 [311] S. Do, L.-V. Le, B.-S.-P. Lin, and L.-P. Tung, ‘‘SDN/NFV-based network
[285] H. Holbrook, B. Cain, and B. Haberman, Using Internet Group Manage- infrastructure for enhancing IoT gateways,’’ in Proc. Int. Conf. Internet
ment Protocol Version 3 (IGMPv3) and Multicast Listener Discovery Pro- Things (iThings), IEEE Green Comput. Commun. (GreenCom), IEEE
tocol Version 2 (MLDv2) for Source-Specific Multicast, document RFC Cyber, Phys. Social Comput. (CPSCom), IEEE Smart Data (SmartData),
4604, Internet Engineering Task Force, 2006. Jul. 2019, pp. 1135–1142.
[286] I. Wijnands, E. C. Rosen, A. Dolganow, T. Przygienda, and S. Aldrin, [312] A. Metwally, D. Agrawal, and A. El Abbadi, ‘‘Efficient computation of
Multicast Using Bit Index Explicit Replication (BIER), document RFC frequent and top-k elements in data streams,’’ in Proc. Int. Conf. Database
8279, 2017. Theory. Berlin, Germany: Springer, 2005, pp. 398–412.
[287] S. Luo, H. Yu, K. Li, and H. Xing, ‘‘Efficient file dissemination in data [313] S. Heule, M. Nunkesser, and A. Hall, ‘‘HyperLogLog in practice: Algo-
center networks with priority-based adaptive multicast,’’ IEEE J. Sel. rithmic engineering of a state of the art cardinality estimation algorithm,’’
Areas Commun., vol. 38, no. 6, pp. 1161–1175, Jun. 2020. in Proc. 16th Int. Conf. Extending Database Technol. (EDBT), 2013,
[288] B. Carpenter and S. Brim, Middleboxes: Taxonomy and Issues, pp. 683–692.
document RFC3234, 2002. [Online]. Available: https://tools.ietf.org/ [314] F. Hauser, M. Schmidt, M. Häberle, and M. Menth, ‘‘P4-MACsec:
html/rfc3234 Dynamic topology monitoring and data layer protection with MACsec
[289] J. McCauley, A. Panda, A. Krishnamurthy, and S. Shenker, ‘‘Thoughts in P4-based SDN,’’ IEEE Access, vol. 8, pp. 58845–58858, 2020.
on load distribution and the role of programmable switches,’’ ACM SIG- [315] M. G. Reed, P. F. Syverson, and D. M. Goldschlag, ‘‘Anonymous con-
COMM Comput. Commun. Rev., vol. 49, no. 1, pp. 18–23, Feb. 2019. nections and onion routing,’’ IEEE J. Sel. Areas Commun., vol. 16, no. 4,
[290] T. Norp, ‘‘5G requirements and key performance indicators,’’ J. ICT pp. 482–494, May 1998.
Standardization, vol. 6, no. 1, pp. 15–30, 2018. [316] V. Liu, S. Han, A. Krishnamurthy, and T. Anderson, ‘‘Tor instead of IP,’’
[291] G. Xylomenos, C. N. Ververidis, V. A. Siris, N. Fotiou, C. Tsilopoulos, in Proc. 10th ACM Workshop Hot Topics Netw., 2011, pp. 1–6.
X. Vasilakos, K. V. Katsaros, and G. C. Polyzos, ‘‘A survey of [317] C. Chen, D. E. Asoni, D. Barrera, G. Danezis, and A. Perrig, ‘‘HORNET:
information-centric networking research,’’ IEEE Commun. Surveys Tuts., High-speed onion routing at the network layer,’’ in Proc. 22nd ACM
vol. 16, no. 2, pp. 1024–1049, 2nd Quart., 2014. SIGSAC Conf. Comput. Commun. Secur., Oct. 2015, pp. 1441–1454.
[292] D. L. Tennenhouse and D. J. Wetherall, ‘‘Towards an active network [318] L. Lamport, ‘‘Password authentication with insecure communication,’’
architecture,’’ in Proc. DARPA Act. Netw. Conf. Expo., 2002, pp. 2–15. Commun. ACM, vol. 24, no. 11, pp. 770–772, Nov. 1981.
[293] E. F. Kfoury, J. Gomez, J. Crichigno, E. Bou-Harb, and D. Khoury, [319] M. Zalewski and W. Stearns. (2006). P0F. [Online]. Available:
‘‘Decentralized distribution of PCP mappings over blockchain for http://lcamtuf.coredump.cx/p0f3
end-to-end secure direct communications,’’ IEEE Access, vol. 7, [320] J. Barnes and P. Crowley, ‘‘K-P0F: A high-throughput kernel passive OS
pp. 110159–110173, 2019. fingerprinter,’’ in Proc. Archit. Netw. Commun. Syst., 2013, pp. 113–114.
[321] S. Hong, R. Baykov, L. Xu, S. Nadimpalli, and G. Gu, ‘‘Towards SDN- [344] Pronto Project. Accessed: Jun. 1, 2021. [Online]. Available:
defined programmable BYOD (bring your own device) security,’’ in Proc. https://prontoproject.org/
NDSS, 2016, pp. 1–15. [345] Y. Zhou and J. Bi, ‘‘ClickP4: Towards modular programming of P4,’’ in
[322] S. Hilton. (2016). Dyn Analysis Summary of Friday October 21 Attack. Proc. SIGCOMM Posters Demos, 2017, pp. 100–102.
[Online]. Available: https://dyn.com/blog/dyn-analysis-summary-of- [346] P. Zheng, T. Benson, and C. Hu, ‘‘P4Visor: Lightweight virtualization and
friday-october-21-attack/ composition primitives for building and testing modular programs,’’ in
[323] S. Kottler. (Mar. 2018). February 28th DDoS Incident Report. [Online]. Proc. 14th Int. Conf. Emerg. Netw. Exp. Technol., Dec. 2018, pp. 98–111.
Available: https://githubengineering.com/ddos-incident-report/ [347] X. Chen, D. Zhang, X. Wang, K. Zhu, and H. Zhou, ‘‘P4SC: Towards
[324] S. K. Fayaz, Y. Tobioka, V. Sekar, and M. Bailey, ‘‘Bohatei: Flexible and high-performance service function chain implementation on the P4-
elastic DDoS defense,’’ in Proc. 24th USENIX Secur. Symp. (USENIX capable device,’’ in Proc. IFIP/IEEE Symp. Integr. Netw. Service Manage.
Secur.), 2015, pp. 817–832. (IM), 2019, pp. 1–9.
[325] Arbor Networks. Arbor Networks APS Datasheet. Accessed: Jun. 1, 2021. [348] M. Riftadi and F. Kuipers, ‘‘P4I/O: Intent-based networking with P4,’’ in
[Online]. Available: https://www.netscout.com/sites/default/files/2018- Proc. IEEE Conf. Netw. Softwarization (NetSoft), Jun. 2019, pp. 438–443.
04/DS_APS_EN.pdf [349] E. O. Zaballa and Z. Zhou, ‘‘Graph-to-P4: A P4 boilerplate code generator
[326] NSFOCUS. NSFOCUS Anti-DDoS System Datasheet. Accessed: for parse graphs,’’ in Proc. ACM/IEEE Symp. Archit. Netw. Commun. Syst.
Jun. 1, 2021. [Online]. Available: https://nsfocusglobal.com/wp-content/ (ANCS), Sep. 2019, pp. 1–2.
uploads/2018/05/Anti-DDoS-Solution.pdf [350] M. Riftadi, J. Oostenbrink, and F. Kuipers, ‘‘GP4P4: Enabling self-
[327] J. Hypolite, J. Sonchack, S. Hershkop, N. Dautenhahn, A. DeHon, and programming networks,’’ 2019, arXiv:1910.00967. [Online]. Available:
J. M. Smith, ‘‘DeepMatch: Practical deep packet inspection in the data http://arxiv.org/abs/1910.00967
plane using network processors,’’ in Proc. 16th Int. Conf. Emerg. Netw. [351] X. Gao, T. Kim, M. D. Wong, D. Raghunathan, A. K. Varma,
Exp. Technol., Nov. 2020, pp. 336–350. P. G. Kannan, A. Sivaraman, S. Narayana, and A. Gupta, ‘‘Switch code
[328] N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown, generation using program synthesis,’’ in Proc. Annu. Conf. ACM Special
‘‘I know what your packet did last hop: Using packet histories to trou- Interest Group Data Commun. Appl., Technol., Archit., Protocols Comput.
bleshoot networks,’’ in Proc. 11th USENIX Symp. Netw. Syst. Design Commun., Jul. 2020, pp. 44–61.
Implement. (NSDI), 2014, pp. 71–85. [352] J. Gao, E. Zhai, H. H. Liu, R. Miao, Y. Zhou, B. Tian, C. Sun, D. Cai,
[329] Y. Zhu, N. Kang, J. Cao, A. Greenberg, G. Lu, R. Mahajan, D. Maltz, M. Zhang, and M. Yu, ‘‘Lyra: A cross-platform language and compiler for
L. Yuan, M. Zhang, B. Y. Zhao, and H. Zheng, ‘‘Packet-level telemetry in data plane programming on heterogeneous ASICs,’’ in Proc. Annu. Conf.
large datacenter networks,’’ in Proc. ACM Conf. Special Interest Group ACM Special Interest Group Data Commun. Appl., Technol., Archit.,
Data Commun., Aug. 2015, pp. 479–491. Protocols Comput. Commun., Jul. 2020, pp. 435–450.
[330] H. Zeng, P. Kazemian, G. Varghese, and N. McKeown, ‘‘Automatic test [353] M. Hogan, S. Landau-Feibish, M. T. Arashloo, J. Rexford, D. Walker,
packet generation,’’ in Proc. 8th Int. Conf. Emerg. Netw. Exp. Technol. and R. Harrison, ‘‘Elastic switch programming with P4All,’’ in Proc. 19th
(CoNEXT), 2012, pp. 241–252. ACM Workshop Hot Topics Netw., Nov. 2020, pp. 168–174.
[331] H. T. Dang, H. Wang, T. Jepsen, G. Brebner, C. Kim, J. Rexford, R. Soulé, [354] C. Zhang, J. Bi, Y. Zhou, A. B. Dogar, and J. Wu, ‘‘HyperV: A high per-
and H. Weatherspoon, ‘‘Whippersnapper: A P4 language benchmark formance hypervisor for virtualization of the programmable data plane,’’
suite,’’ in Proc. Symp. SDN Res., Apr. 2017, pp. 95–101. in Proc. 26th Int. Conf. Comput. Commun. Netw. (ICCCN), Jul. 2017,
[332] F. Rodriguez, P. G. K. Patra, L. Csikor, C. Rothenberg, P. V. S. Laki, pp. 1–9.
and G. Pongrácz, ‘‘BB-Gen: A packet crafter for P4 target evaluation,’’ in [355] M. Saquetti, G. Bueno, W. Cordeiro, and J. R. Azambuja, ‘‘P4VBox:
Proc. ACM SIGCOMM Conf. Posters Demos, Aug. 2018, pp. 111–113. Enabling P4-based switch virtualization,’’ IEEE Commun. Lett., vol. 24,
[333] H. Harkous, M. Jarschel, M. He, R. Pries, and W. Kellerer, no. 1, pp. 146–149, Jan. 2020.
‘‘P8: P4 with predictable packet processing performance,’’ IEEE [356] R. Parizotto, L. Castanheira, F. Bonetti, A. Santos, and
Trans. Netw. Service Manage., early access, Oct. 12, 2020, doi: A. Schaeffer-Filho, ‘‘PRIME: Programming in-network modular
10.1109/TNSM.2020.3030102. extensions,’’ in Proc. IEEE/IFIP Netw. Oper. Manage. Symp. (NOMS),
[334] H. Harkous, M. Jarschel, M. He, R. Priest, and W. Kellerer, ‘‘Towards Apr. 2020, pp. 1–9.
understanding the performance of P4 programmable hardware,’’ in Proc. [357] E. O. Zaballa, D. Franco, M. S. Berger, and M. Higuero, ‘‘A perspective
ACM/IEEE Symp. Archit. Netw. Commun. Syst. (ANCS), Sep. 2019, on P4-based data and control plane modularity for network automation,’’
pp. 1–6. in Proc. 3rd P4 Workshop Eur., Dec. 2020, pp. 59–61.
[335] P. Kazemian, G. Varghese, and N. McKeown, ‘‘Header space analysis: [358] R. Stoyanov and N. Zilberman, ‘‘MTPSA: Multi-tenant programmable
Static checking for networks,’’ in Proc. 9th USENIX Symp. Netw. Syst. switches,’’ in Proc. 3rd P4 Workshop Eur., Dec. 2020, pp. 43–48.
Design Implement. (NSDI), 2012, pp. 113–126. [359] S. Han, S. Jang, H. Choi, H. Lee, and S. Pack, ‘‘Virtualization in pro-
[336] A. Khurshid, X. Zou, W. Zhou, M. Caesar, and P. B. Godfrey, ‘‘VeriFlow: grammable data plane: A survey and open challenges,’’ IEEE Open J.
Verifying network-wide invariants in real time,’’ in Proc. 10th USENIX Commun. Soc., vol. 1, pp. 527–534, 2020.
Symp. Netw. Syst. Design Implement. (NSDI), 2013, pp. 15–27. [360] E. C. Molero, S. Vissicchio, and L. Vanbever, ‘‘Hardware-accelerated
[337] R. Stoenescu, M. Popovici, L. Negreanu, and C. Raiciu, ‘‘SymNet: Scal- network control planes,’’ in Proc. 17th ACM Workshop Hot Topics Netw.,
able symbolic execution for modern networks,’’ in Proc. ACM SIGCOMM Nov. 2018, pp. 120–126.
Conf., Aug. 2016, pp. 314–327. [361] M. T. Arashloo, Y. Koral, M. Greenberg, J. Rexford, and D. Walker,
[338] H. Mai, A. Khurshid, R. Agarwal, M. Caesar, P. B. Godfrey, and ‘‘SNAP: Stateful network-wide abstractions for packet processing,’’ in
S. T. King, ‘‘Debugging the data plane with anteater,’’ ACM SIGCOMM Proc. ACM SIGCOMM Conf., Aug. 2016, pp. 29–43.
Comput. Commun. Rev., vol. 41, no. 4, pp. 290–301, Oct. 2011. [362] G. Sviridov, M. Bonola, A. Tulumello, P. Giaccone, A. Bianco, and
[339] P. Kazemian, M. Chang, H. Zeng, G. Varghese, N. McKeown, and G. Bianchi, ‘‘LODGE: Local decisions on global states in programmable
S. Whyte, ‘‘Real time network policy checking using header space analy- data planes,’’ in Proc. 4th IEEE Conf. Netw. Softwarization Workshops
sis,’’ in Proc. 10th USENIX Symp. Netw. Syst. Design Implement. (NSDI), (NetSoft), Jun. 2018, pp. 257–261.
2013, pp. 99–111. [363] G. Sviridov, M. Bonola, A. Tulumello, P. Giaccone, A. Bianco,
[340] A. Horn, A. Kheradmand, and M. Prasad, ‘‘Delta-Net: Real-time network and G. Bianchi, ‘‘Local decisions on replicated states (LOADER)
verification using atoms,’’ in Proc. 14th USENIX Symp. Netw. Syst. in programmable data planes: Programming abstraction and exper-
Design Implement. (NSDI), 2017, pp. 735–749. imental evaluation,’’ 2020, arXiv:2001.07670. [Online]. Available:
[341] S. Son, S. Shin, V. Yegneswaran, P. Porras, and G. Gu, ‘‘Model checking http://arxiv.org/abs/2001.07670
invariant security properties in OpenFlow,’’ in Proc. IEEE Int. Conf. [364] S. Luo, H. Yu, and L. Vanbever, ‘‘Swing state: Consistent updates for
Commun. (ICC), Jun. 2013, pp. 1974–1979. stateful and programmable data planes,’’ in Proc. Symp. SDN Res.,
[342] A. Panda, O. Lahav, K. Argyraki, M. Sagiv, and S. Shenker, ‘‘Verifying Apr. 2017, pp. 115–121.
reachability in networks with mutable datapaths,’’ in Proc. 14th USENIX [365] J. Xing, A. Chen, and T. S. E. Ng, ‘‘Secure state migration in the
Symp. Netw. Syst. Design Implement. (NSDI), 2017, pp. 699–718. data plane,’’ in Proc. Workshop Secure Program. Netw. Infrastruct.,
[343] N. Foster, N. McKeown, J. Rexford, G. Parulkar, L. Peterson, and Aug. 2020, pp. 28–34.
O. Sunay, ‘‘Using deep programmability to put network owners in con- [366] L. Zeno, D. R. K. Ports, J. Nelson, and M. Silberstein, ‘‘SwiShmem:
trol,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 50, no. 4, pp. 82–88, Distributed shared state abstractions for programmable switches,’’ in
Oct. 2020. Proc. 19th ACM Workshop Hot Topics Netw., Nov. 2020, pp. 160–167.
[367] S. Chole et al., ‘‘dRMT: Disaggregated programmable switching,’’ in [392] N. Feamster and J. Rexford, ‘‘Why (and how) networks should
Proc. Conf. ACM Special Interest Group Data Commun., 2017, pp. 1–14. run themselves,’’ 2017, arXiv:1710.11583. [Online]. Available:
[368] D. Kim, Y. Zhu, C. Kim, J. Lee, and S. Seshan, ‘‘Generic external memory http://arxiv.org/abs/1710.11583
for switch data planes,’’ in Proc. 17th ACM Workshop Hot Topics Netw., [393] D. D. Clark, C. Partridge, J. C. Ramming, and J. T. Wroclawski,
Nov. 2018, pp. 1–7. ‘‘A knowledge plane for the Internet,’’ in Proc. Conf. Appl., Technol.,
[369] D. Kim, Z. Liu, Y. Zhu, C. Kim, J. Lee, V. Sekar, and S. Seshan, ‘‘TEA: Archit., Protocols Comput. Commun. (SIGCOMM), 2003, pp. 3–10.
Enabling state-intensive network functions on programmable switches,’’ [394] A. Mestres, A. Rodriguez-Natal, J. Carner, P. Barlet-Ros, E. Alarcón,
in Proc. ACM SIGCOMM Conf., 2020, pp. 90–106. M. Solé, V. Muntés-Mulero, D. Meyer, S. Barkai, M. J. Hibbett,
[370] T. Mai, S. Garg, H. Yao, J. Nie, G. Kaddoum, and Z. Xiong, ‘‘In-network G. Estrada, K. Ma’ruf, F. Coras, V. Ermagan, H. Latapie, C. Cassar,
intelligence control: Toward a self-driving networking architecture,’’ J. Evans, F. Maino, J. Walrand, and A. Cabellos, ‘‘Knowledge-defined
IEEE Netw., vol. 35, no. 2, pp. 53–59, Mar. 2021. networking,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 47, no. 3,
[371] Y. Shi, M. Wen, and C. Zhang, ‘‘Incremental deployment of pro- pp. 2–10, Sep. 2017.
grammable switches for sketch-based network measurement,’’ in Proc.
IEEE Symp. Comput. Commun. (ISCC), Jul. 2020, pp. 1–7. ELIE F. KFOURY (Graduate Student Member,
[372] J. Cao, Y. Zhou, Y. Liu, M. Xu, and Y. Zhou, ‘‘TurboNet: Faithfully IEEE) is currently pursuing the Ph.D. degree with
emulating networks with programmable switches,’’ in Proc. IEEE 28th the College of Engineering and Computing, Uni-
Int. Conf. Netw. Protocols (ICNP), Oct. 2020, pp. 1–11. versity of South Carolina, USA. He previously
[373] S. Chole, A. Fingerhut, S. Ma, A. Sivaraman, S. Vargaftik, A. Berger,
worked as a research and teaching assistant in
G. Mendelson, M. Alizadeh, S.-T. Chuang, I. Keslassy, A. Orda, and
T. Edsall, ‘‘DRMT: Disaggregated programmable switching,’’ in Proc. the computer science and ICT departments at the
Conf. ACM Special Interest Group Data Commun., Aug. 2017, pp. 1–14. American University of Science and Technology
[374] R. Pagh and F. F. Rodler, ‘‘Cuckoo hashing,’’ BRICS Rep. Ser., vol. 8, in Beirut. He is a member of the CyberInfras-
no. 32, p. 122, Aug. 2001. tructure Laboratory (CI Lab), where he developed
[375] M. Baldi, ‘‘DaPIPE a data plane incremental programming environ- training materials for virtual labs on high-speed
ment,’’ in Proc. ACM/IEEE Symp. Archit. Netw. Commun. Syst. (ANCS), networks, TCP congestion control, WAN, performance measuring, buffer
Sep. 2019, pp. 1–6. sizing, cybersecurity, and routing protocols. His research interests include
[376] R. Amin, M. Reisslein, and N. Shah, ‘‘Hybrid SDN networks: A survey telecommunications, network security, blockchain, the Internet of Things
of existing approaches,’’ IEEE Commun. Surveys Tuts., vol. 20, no. 4, (IoT), and P4 programmable switches.
pp. 3259–3306, 4th Quart., 2018.
[377] J. Zhang and A. Moore, ‘‘Traffic trace artifacts due to monitoring via JORGE CRICHIGNO (Member, IEEE) received
port mirroring,’’ in Proc. Workshop End-End Monitor. Techn. Services,
the Ph.D. degree in computer engineering from
May 2007, pp. 1–8.
[378] SONiC. (2020). Software for Open Networking in the Cloud. [Online].
The University of New Mexico, Albuquerque,
Available: https://azure.github.io/SONiC/ USA, in 2009. He is currently an Associate Pro-
[379] S. Choi, B. Burkov, A. Eckert, T. Fang, S. Kazemkhani, R. Sherwood, fessor with the College of Engineering and Com-
Y. Zhang, and H. Zeng, ‘‘FBOSS: Building switch software at scale,’’ puting, University of South Carolina (USC), and
in Proc. Conf. ACM Special Interest Group Data Commun., Aug. 2018, the Director of the Cyberinfrastructure Laboratory,
pp. 342–356. USC. His work has been funded by private industry
[380] L. Linguaglossa, S. Lange, S. Pontarelli, G. Rétvári, D. Rossi, T. Zinner, and U.S. agencies, such as the National Science
R. Bifulco, M. Jarschel, and G. Bianchi, ‘‘Survey of performance accel- Foundation (NSF), the Department of Energy, and
eration techniques for network function virtualization,’’ Proc. IEEE, the Office of Naval Research (ONR). He has over 15 years of experience
vol. 107, no. 4, pp. 746–764, Apr. 2019. in the academic and industry sectors. His research interests include P4 pro-
[381] P. Shantharama, A. S. Thyagaturu, and M. Reisslein, ‘‘Hardware-
grammable switches, implementation of high-speed networks, network secu-
accelerated platforms and infrastructures for network functions: A survey
rity, TCP optimization, offloading functionality to programmable switches,
of enabling technologies and research studies,’’ IEEE Access, vol. 8,
pp. 132021–132085, 2020.
and the IoT devices.
[382] N. McKeown. Creating an End-to-End Programming Model for
Packet Forwarding. Accessed: Jun. 1, 2021. [Online]. Available: ELIAS BOU-HARB (Senior Member, IEEE)
https://lwn.net/Articles/828056/ received the Ph.D. degree in computer science
[383] J. Krude, J. Hofmann, M. Eichholz, K. Wehrle, A. Koch, and M. Mezini, from Concordia University, Montreal, Canada,
‘‘Online reprogrammable multi tenant switches,’’ in Proc. 1st ACM which was executed in collaboration with Pub-
CoNEXT Workshop Emerg. Netw. Comput. Paradigms (ENCP), 2019, lic Safety Canada, Industry Canada, and NCFTA
pp. 1–8. Canada. He was a Senior Research Scientist with
[384] D. Hancock and J. van der Merwe, ‘‘HyPer4: Using P4 to virtualize the Carnegie Mellon University (CMU), where he
programmable data plane,’’ in Proc. 12th Int. Conf. Emerg. Netw. Exp. contributed to federally-funded projects related to
Technol., Dec. 2016, pp. 35–49. critical infrastructure security and worked closely
[385] T. Issariyakul and E. Hossain, ‘‘Introduction to network simulator with the Software Engineering Institute (SEI).
2 (NS2),’’ in Introduction to Network Simulator NS2. Boston, MA,
He is currently the Director of the Cyber Center for Security and Analytics,
USA: Springer, 2009, pp. 1–18.
[386] Stanford. Reproducing Network Research. Accessed: Jun. 1, 2021. UTSA, where he leads, co-directs, and co-organizes university-wide inno-
[Online]. Available: https://reproducingnetworkresearch.wordpress.com/ vative cyber security research, development, and training initiatives. He is
[387] Mininet. An Instant Virtual Network on Your Laptop (or Other PC). also an Associate Professor with the Department of Information Systems and
Accessed: Jun. 1, 2021. [Online]. Available: http://mininet.org/ Cyber Security specializing in operational cyber security and data science as
[388] N. Handigol, B. Heller, V. Jeyakumar, B. Lantz, and N. McKeown, applicable to national security challenges. He is also a Permanent Research
‘‘Reproducible network experiments using container-based emulation,’’ Scientist with the National Cyber Forensic and Training Alliance (NCFTA)
in Proc. 8th Int. Conf. Emerg. Netw. Exp. Technol. (CoNEXT), 2012, of Canada, an international organization which focuses on the investigation
pp. 253–264. of cyber-crimes impacting citizens and businesses. He has authored more
[389] H. Kim, X. Chen, J. Brassil, and J. Rexford, ‘‘Experience-driven research than 90 refereed publications in leading security and data science venues,
on programmable networks,’’ ACM SIGCOMM Comput. Commun. Rev., has acquired state and federal cyber security research grants valued at more
vol. 51, no. 1, pp. 10–17, Jan. 2021.
than $4M. His research and development activities and interests include
[390] Princeton. P4Campus: Framework, Applications, and Artifacts.
Accessed: Jun. 1, 2021. [Online]. Available: https://p4campus. operational cyber security, attacks’ detection and characterization, malware
cs.princeton.edu/ investigation, cyber security for critical infrastructure, and big data analytics.
[391] H. Kim and N. Feamster, ‘‘Improving network management with software He was a recipient of five best research paper awards, including the presti-
defined networking,’’ IEEE Commun. Mag., vol. 51, no. 2, pp. 114–119, gious ACM’s Best Digital Forensics Research Paper.
Feb. 2013.