Intelligent Routing Based On Reinforcement Learning For Software-Defined Networking

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/345753080
Intelligent Routing Based on Reinforcement Learning for Software-Deﬁned

Networking
Article in IEEE Transactions on Network and Service Management · March 2021

DOI: 10.1109/TNSM.2020.3036911
CITATIONS READS
5 550
3 authors, including:
Daniela Maria Casas-Velasco Oscar Mauricio Caicedo Rendon

University of Campinas Universidad del Cauca
4 PUBLICATIONS 11 CITATIONS 85 PUBLICATIONS 738 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Machine Learning for Networking View project
SDN Management View project
All content following this page was uploaded by Oscar Mauricio Caicedo Rendon on 12 January 2021.
The user has requested enhancement of the downloaded file.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2020.3036911, IEEE
Transactions on Network and Service Management
JOURNAL OF LATEX CLASS FILES, VOL. 00, NO. 00, APRIL 2020 1
Intelligent Routing based on Reinforcement

Learning for Software-Defined Networking
Daniela M. Casas-Velasco, Oscar Mauricio Caicedo Rendon, and Nelson L. S. da Fonseca
Abstract—Traditional routing protocols employ limited infor- routing is typically assumed. This type of routing increases
mation to make routing decisions, which can lead to a slow signaling overhead and can contribute to the formation of
adaptation to traffic variability, as well as restricted support congestion. Several other solutions [24]–[30] employ both ML
to the Quality of Service (QoS) requirements of applications.
This paper introduces a novel approach for routing in Software- and SDN, but they depend on traditional routing protocols.
Defined Networking (SDN), called Reinforcement Learning and In this paper, we introduce a novel approach for routing in
Software-Defined Networking Intelligent Routing (RSIR). RSIR SDN, called Reinforcement Learning and Software-Defined
adds a Knowledge Plane to SDN and defines a routing algorithm Networking for Intelligent Routing (RSIR). RSIR adds a
based on Reinforcement Learning (RL) that takes into account Knowledge Plane and defines a routing algorithm based on
link-state information to make routing decisions. This algorithm
capitalizes on the interaction with the environment, the intelli- Reinforcement Learning (RL) that takes into account link-state
gence provided by RL and the global view and control of the information to explore, learn, and exploit potential paths for
network furnished by SDN, to compute and install, in advance, intelligent routing, even during dynamic traffic changes. This
optimal routes in the forwarding devices. RSIR was extensively algorithm capitalizes on interaction with the environment, the
evaluated by emulation using real traffic matrices. Results show intelligence provided by RL, and the global view and control
RSIR outperforms the Dijkstra’s algorithm in relation to the
stretch, link throughput, packet loss, and delay when available of the network furnished by SDN. It computes and installs, in
bandwidth, delay, and loss are considered individually or jointly advance, optimal routes in the routing tables of the switches
for the computation of optimal paths. The results demonstrate on the Data Plane. RSIR was extensively validated using
that RSIR is an attractive solution for intelligent routing in SDN. emulation and real traffic matrices. Results show that RSIR
outperforms the Dijkstra’s algorithm in relation to stretch, link
Index Terms—Reinforcement Learning, Routing, Software- throughput, delay, and packet loss produced when delay, loss,
Defined Networking, Knowledge-Defined Networking and available bandwidth are individually or jointly used as
cost to compute optimal paths. Results evince that RSIR is a
I. I NTRODUCTION promising solution to replace traditional routing protocols in
OUTING determines the path taken by packets from a SDN.
R source to a destination node [1]. Although traditional
Internet protocols, such as Open Shortest Path First (OSPF) [2]
The contributions of this paper are:
• An architecture that employs RL for achieving efficient
and Routing Information Protocol (RIP) [3], have successfully and intelligent routing in SDN;
delivered best-effort traffic for several decades, the growth of • A proactive RL-based routing algorithm that considers
traffic and the diversification of the Quality of Service (QoS) link-state metrics to explore, learn, and exploit potential
requirements of emerging Internet applications pose new chal- routes;
lenges for the transport of the flows of packets generated by • A prototype of the proposed architecture.
these applications [4]. Moreover, traditional routing protocols The remainder of this paper is organized as follows. Section
use limited information to make routing decisions, which can II describes the related work. Section III details RSIR, and
lead to a slow adaptation to dynamic traffic changes and Section IV introduces the RSIR routing algorithm. Section V
restricted support to diverse QoS requirements [5]. presents the RSIR prototype and the result of its evaluation.
Software-Defined Networking (SDN) and Machine Learn- Section VI presents conclusions and future work. For the sake
ing (ML) techniques have been envisioned to provide innova- of readability, Table I presents the acronyms used in this paper.
tion for routing protocols [6] [7] [8]. Several solutions [9]–[16]
have enhanced traditional routing protocols by leveraging SDN II. R ELATED W ORK
features, such as programmability, global view, logically cen-
tralized control, and decoupling of network control and packet This section brings up research on routing based on SDN,
forwarding [17]. However, these do not fully exploit knowl- and ML as well on these two techniques. Table II briefly
edge about the network operation to accomplish intelligent summarizes the description of the protocols reviewed, the type
routing. Other approaches have attempted to improve rout- of routing employed, how SDN and ML are considered, and
ing by using ML techniques [18]–[23]; however, distributed the metrics evaluated. Next, we briefly describe these papers
and point out their main shortcomings.
D. M. Casas-Velasco, and N. L. S. da Fonseca are with the Institute of Com- In SDN, instead of flooding routing updates over the entire
puting, University of Campinas, Brazil. e-mail: danielac@lrc.ic.unicamp.br network, information is sent to a centralized controller. Indeed,
and nfonseca@ic.unicamp.br.
O. M. Caicedo is with the Department of Telematics, Universidad del the controller is the only device informed about route changes;
Cauca, Popayán, Colombia. e-mail: omcaicedo@unicauca.edu.co. as a result, the signaling overhead and routing convergence
1932-4537 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidade Estadual de Campinas. Downloaded on November 12,2020 at 13:07:43 UTC from IEEE Xplore. Restrictions apply.
Table I: Acronyms
Acronym Term Acronym Term Acronym Term
API Application Programming Interface NN Neural Network SDN Software-Defined Networking
BGP Border Gateway Protocol OSPF Open Shortest Path First TCP Transmission Control Protocol
IoV Internet of Vehicles QoS Quality of Service UDP User Datagram Protocol
KDN Knowledge-Defined Networking RIP Routing Information Protocol VM Virtual Machine
LLDP Link Layer Discovery Protocol RL Reinforcement Learning WSN Wireless Sensor Network
ML Machine Learning RSIR RL and SDN Intelligent Routing
NBI Northbound Interface SBI Southbound Interface
Table II: Related Work

Paper Description Control SDN ML Performance metrics
Convergence time, throughput, packet
[9] An OSPF implementation in SDN, and conventional networks C X
loss ratio, jitter and delay
[10] A comparative analysis of SDN and OSPF legacy networks C X Convergence time
A comparison of routing performance when link failure occurs, considering
[11] the topology scale between SDN and conventional routing in traditional C X Convergence time
networks
NSAF: An application-aware routing in SDN employing application reg-
[12] C X Bandwidth, packet loss, delay, and jitter
istration, network monitoring, violation detection, and routing control
[13] SAQR: A QoS-aware routing in hybrid SDNs which adjusts the weights C X Throughput, delay, and loss rate
of metrics of a cost function using Dijkstra and genetic algorithms
An adaptive greedy flow routing algorithm for performance improvement Average delay, blocking probability,
[14] C X
in software-defined networks and running time
[15] L2RM: A framework for load-balancing and route management for fat-tree C X Number of installed rules
SDN-based data center networks
[16] Flow-aware routing and forwarding for SDN scalability in Wireless Data C X Number of flow setup requests
Centers
Q-PR: A routing algorithm that considers the message importance to reduce
[18] D X Delivery rate
overhead with learning techniques in Wireless Sensor Networks
FROMS: Routing mechanism in which nodes share local information
[19] D X Delivery rate and network overhead
without producing overhead
Appropriate input and output characterizations of network traffic for Signaling overhead, throughput and de-
[20] D X
supervised neural networks in Wireless Sensor Networks (WSNs) lay
Introduction of a proof-of-concept for applying the deep learning technique Signaling overhead, throughput and de-
[21] D X
for performing intelligent traffic control in future networks lay
ORuML: Optimized Routing in wireless networks by using ML techniques
[22] D X Accuracy and Area under ROC curve
to predict the type of the source and destination nodes
AQ-Routing: mobility, stability-aware adaptive routing protocol for data Mean packet loss ratio and end-to-end
[23] D X
routing in MANET–IoT systems delay
SDCoR: A Q-learning based approach for selecting a traditional routing
[24] algorithm in response to the Internet of Vehicles (IoV) environment D X X Delivery delay and delivery ratio
changes
[25] A supervised deep learning route computation for core networks over C X X Delay, throughput, and signaling over-
Software-defined routers head
An ML-based meta-layer composed of multiple modules, each associated
[26] C X X Network delay
with a single source-destination pair for getting their routing path
Distributed routing with SDN in which an agent learns the most important
[27] D X X Delay and jitter
parameters to define the link costs for calculating routes in OSPF
QAR: QoS-aware adaptive routing with multi-layer hierarchies intended to Delivery delay, packet loss,and hop
[28] C X X
minimize signaling delay in large SDNs count
A meta-heuristic with ML traffic prediction for energy efficient optical Energy consumption and time overhead
[29] C X X
routing in mobile metro-core networks for routing calculation
An ML mechanism to accomplish the routing and wavelength assignment
[30] C X X Computational time
for an input traffic matrix in optical networks
C: Centralized and D: Distributed
time are less affected than in traditional routing [31]. The [16] proposed routing solutions based on SDN features, such
Control Plane provides a logically centralized control point as programmability, global view, the decoupling of network
that can make decisions based on collected information and traffic and control, and logically centralized control. However,
deploy routing rules on the Data Plane. The work in [9]– these proposals use traditional distributed routing protocols
that do not learn from all potential features affecting network on traditional routing protocols for calculating and deploying
congestion. Moreover, these proposals do not exploit intelli- routing strategies as in [24]–[27], [29], [30], all of which
gent solutions based on ML techniques. employ both SDN and ML.
The work in [18]–[23] employed ML techniques, such as
RL and Neural Networks (NNs) to improve routing decisions.
However, these proposals do not exploit SDN features. They
deploy routing strategies in a distributed fashion in a way that
routing nodes turn themselves into a learning entity that makes
local routing decisions based on information learned from the
environment. However, these routing solutions can generate
signaling overhead and contribute to network congestion.
The work in [24]–[30] introduced routing strategies that
encompass both ML and SDN. Network programmability
provided by SDN allows the inclusion of ML techniques in the
routing solutions. This type of solution is based on traditional
routing protocols, either for deploying routing strategies or
training ML models. Such dependence makes the proposed
solutions prone to choose the same paths when similar con-
gestion patterns occur.
RSIR aims at addressing the shortcomings of the proposals
mentioned above. RSIR leverages the centralized, view, con-
trol, and programmability of SDN, as well as the cognitive
capabilities of RL. RSIR proactively defines and installs routes
based on decisions made by using link-state information with
no dependence on traditional routing protocols. Moreover, it Figure 1: RSIR architecture
does not add any signaling overhead to the network operation.
Overall, RSIR operates as described in Figure 1, and the
explanation is detailed next. 1 The Control Plane collects
III. RSIR
raw data about the network status by periodically querying the
This section presents an overview of RSIR, its architecture, Data Plane. 2 The Management Plane retrieves these data to
and components. calculate and store link-state information. 3 The Knowledge
Plane recovers information from the Management Plane. 4
A. Overview The RL-agent uses this information to explore-exploit all the
RSIR adds a Knowledge Plane, that employs RL, to SDN possible routes for each pair of nodes. It learns and calculates
for achieving intelligent routing. The addition of a Knowledge the best route according to the state of the links. 5 The
Plane to a network was first proposed by Clark et al. [32] Knowledge Plane stores the information about the routes
in 2003 and later revisited by Mestres et al. in 2017, when computed by the RL-agent. 6 The Control Plane retrieves the
they proposed the concept of Knowledge-Defined Networking route information to install paths in the flow tables of switches
(KDN) [33]. The Knowledge Plane is intended to furnish de- before new traffic arrives (i.e., RSIR is proactive). In this way,
scriptions (recognize-explain), recommendations (recognize- the Data Plane can forward packets without consulting the
explain-suggest), and automation (recognize-act) in support of controller, eliminating the latency resulting from the queries
decision-making processes such as routing. KDN involves ML sent to the controller on a flow basis. RSIR reacts fast to traffic
[34], telemetry [35], network analytics [36], and a Knowledge changes, and it can make routing decisions based on only three
Plane [32] to enable intelligent SDN. In addition to the Knowl- link metrics.
edge Plane, the KDN architecture includes the traditional SDN
planes (i.e., Management, Application, Control, and Data) B. Architecture and Components
[37] [38] [6]. The SDN planes support network automated Figure 1 depicts the RSIR architecture intended to provide
management and control, while the Knowledge Plane obtains intelligent routing in SDN. We then provide details about the
and transforms information into knowledge by using ML layers and components of RSIR.
techniques for improving network management and operation. 1) The Data Plane: This plane includes the forwarding
RSIR was designed for routing the network traffic automati- devices and the links connecting them. These devices perform
cally by using information about the network operation. RSIR a set of basic tasks, such as the forwarding of incoming packets
uses RL to find the best route for all the source-destination to a specific port or the dropping of packets. The traffic moves
pairs by employing just a few link-state metrics (i.e., available according to the path information installed in the flow tables of
bandwidth, loss, and delay) as features of the RL process. the forwarding devices; these operate unaware of the rest of the
RSIR takes advantage of SDN planes to obtain a centralized network and rely on the Control, Management, and Knowledge
view and control. In this way, RSIR can adapt routes according Planes to populate and install their flow tables. The Data Plane
to dynamic traffic changes. Moreover, RSIR does not rely realizes the RSIR routing strategies and periodically provides
network information by responding to queries sent by the example of an entry in this repository would be as follows:
Control Plane. (src = n8 , dst = n14 , route = [n8 , n5 , n16 , n14 ]).
2) The Control Plane: This plane builds up a global view The RL-agent fills the Route Data Repository by computing
of the Data Plane by gathering information (per port or flow) the most-rewarding route for each source-destination pair. In
from the forwarding devices. It handles the correct installation particular, the RL-agent uses a Q-learning algorithm to learn
of the routing rules in the table of these devices at the Data the routing policy. The action-state value function Qt (St , At )
Plane. The Control Plane includes three modules: Topology that estimates rewards for each action-state pair determines the
Discovery, Statistics, and Flow Installation modules. routing policy. The RL-agent explores all possible action-states
The Topology Discovery module sends feature-request mes- for a source to reach its destination while updating the value
sages to the forwarding devices (i.e., a switch) on the Data function and obtaining the corresponding reward. This reward
Plane, which answers back by sending feature-reply messages indicates the impact of executing an action At (i.e., choosing
containing feature information, such as the id, number of ports, a neighbor node as next hop) at the state St (i.e., source node)
and the state of the ports. With the received information, on the available link bandwidth, delay, and packet loss ratio.
this module infers the topology by relating the port of each Section IV defines the operation of the RL-based routing.
switch to the ports of the neighboring switches, and the hosts The knowledge Plane can be deployed either on top of the
connected to the ports at each switch. Topology information is Control Plane, as is the Application Plane in the standardized
available during execution as a collection of tuples (i.e., switch SDN [40], or separately [32] [33]. Both deployments are
id, port id, neighbor switch id, neighbor port id) associated feasible, since the Knowledge Plane communicates with the
with the network links. Control Plane via Northbound Interfaces (NBIs). RSIR uses
The Statistics module sends request-state messages to each a separate Knowledge Plane to avoid overloading the Control
forwarding device on the Data Plane at every t seconds. Plane with RL tasks.
This module receives the request-stats reply messages asyn- 5) Flow Handling Proactive Process: This process involves
chronously. Finally, this module maintains statistical informa- a proactive path computation and the installation of flows (see
tion during execution time for processing by the Management Figure 2). The Statistics module gathers raw data about the
Plane, and periodically sends messages to keep the global view network and passes them on to the Data Processing module.
of the network updated. This latter module obtains link-state information by processing
The Flow Installation module operates proactively by pop- port information and stores it in the Network Information Data
ulating the flow tables of switches ahead of time for all traffic Repository. The Knowledge Plane then receives link-state in-
matches; a Southbound Interface (SBI) such as OpenFlow can formation. The RL-agent uses this information to explore and
perform the installation. The installation of the paths (as flow learn the best routes for all pairs of nodes. The routing decision
entries) in the flow tables influences the forwarding of traffic is made available in the Route Data Repository. The Flow
in the Data Plane [39]. A misleading flow entry may cause Installation module of the Control Plane reads the routing
the sending of traffic to highly utilized paths, and, as a result, decision from the Route Data Repository on the Knowledge
the occurrence of network congestion may occur. Plane. For each pair of nodes, the Flow Installation module
3) The Management Plane: This plane ensures the correct receives the location of the source-destination hosts from the
operation and performance of the network in the long term. Topology Discovery module. It then reads the information
It contains the Data Processing module and the Network about the chosen path and host locations to write the flow
Information Data Repository. The Data Processing module entries in the flow table of the corresponding switch.
retrieves and uses raw data gathered by the Control Plane (via
Statistics and Topology Discovery modules) to calculate the
link available bandwidth, delay, and packet loss ratio. These
metrics characterize the link-state for the selection of routes.
The Network Information Data Repository stores the metrics
calculated by the Data Processing module. This repository
contains a set of entries that represents the source-destination
pairs and the corresponding tuples of metrics. An example of
an entry in this repository would be as follows: (source = n1 ,
destination = n2 , av_bw = 100Kbps, delay = 1.3ms, loss =
0.5%). The link-state information is an input to the Knowledge
Plane.
4) The Knowledge Plane: This plane learns the network
behavior and employs intelligence to the computation of paths.
It interacts with the Management and Control Planes for Figure 2: Proactive Flow handling process
retrieving link-state information and installing the computed
routes. This Plane contains the Route Data Repository and the 6) The Topology Change Handling Process: This process is
RL-agent. The Route Data Repository contains a set of entries evoked whenever there is a change in the network topology. It
with information about the path for all source-destination pairs. involves three tasks: detection of topology changes, calculation
Each entry is a tuple of source, destination, and best path. An of routes, and installation of routes. The Topology Discovery
module performs the detection of topology changes (due ˆ d,

Each normalized value (i.e., bwa, ˆ and ˆl) was obtained
to the addition or failure of either a node or a link) by by employing Equation 3. In this equation, xi is a value to
periodically gathering information about the topology at each be normalized, and X represents the set of values used in the
Change Detection Time (tchad ). The RL-agent uses a period, normalization.
called Learning Time (tlear ), to calculate routes by running
Algorithm 1 (see Section IV). The Flow Installation module (xi − min(X)) · (b − a)
x̂i = a + (3)
uses another period (Installation Time (tinst )), to update the max(X) − min(X)
forwarding tables for all switches on the Data Plane. The 2) The State Space (S): Each state in the State Space
duration of tinst depends on the number of forwarding tables, corresponds to a switch on the Data Plane, and a transition
and the number of flow entries to be updated. The total from one state to another corresponds to a link connecting the
time required to recover from a topology change is, thus, two corresponding switches. Therefore, the topology of the
tchad + tlear + tinst . State Space is a graph corresponding to the topology of the
switches on the Data Plane. The cardinality of the set of states
IV. RSIR ROUTING in the State Space is |S = {si }| ≡ N , where N is the number
The RL-agent uses the Q-learning technique to define the of switches on the Data Plane. The Topology Discovery module
routes to be followed by flows. Q-learning is a model-free builds a topology map with nodes corresponding to states in
technique and, thus does not require knowledge about the the State Space and, then, passes that map to the RL-agent.
underlying reward resulting from taking a specific action in
a particular state [41]. The Q-learning technique (RL-agent) 3) The Action Space: This corresponds to the set of all
approximates the optimal action-value function Qt (St , At ) actions, A, that can be taken on the states of the State Space.
named as Q-function by visiting all action-states and updat- For each state, si ∈ S, the RL-agent can take one of a set
ing the value in consideration of a pre-defined number of of actions; each action leads to the transition to another state
independent episodes [42]. A learning episode comprises a sj ∈ S. The neighboring states of a state si correspond to
sequence of steps taken in the transition from an initial state the neighboring switches of that associated to the state si .
to a target state (i.e., a source and destination node).Each Therefore, the number of potential actions when on a state,
step consists of selecting and performing an action, changing si , is the nodeP
degree of that state dr(si ) and the cardinality
the state (i.e., moving from one to another), and receiving a of A is |A| ≡ si ∈S dr(si ).
reward. The updated Q-function value is the underlying reward
for the execution of action At in the state St , which provides 4) The Optimal Policy: This policy is designed to minimize
an optimal reward. Next, we provide details about the RL- the reward value in the Q-learning routing process. In this
agent, the RL-based routing algorithm, and its computational way, the RL-agent learns to avoid links with a high delay
complexity. and loss ratio as well as prioritizing links with large available
bandwidth when choosing a path. The agent approximates the
optimal Q-function Qt+1 (St , At ) by visiting all the pairs of
A. Reinforcement Learning Agent
action-states. It updates and stores the Q-value in the Q-table
1) The Reward Function: The reward function, defined used to find the best path for a pair of nodes. The Q-value is
in Equation 1, is used to find the best routing options on a measure of the overall expected reward when the RL-agent
the basis of the three mentioned features gathered by the is in a state St and performs action At .
Knowledge Plane from the Management Plane. The reward is The RL-agent uses Equation 4 to update the Q-value. In
inversely proportional to the available link bandwidth bwalink Equation 4, the expression between square brackets represents
and directly proportional to the link delay dlink and the link the updated value which is the difference between the current
packet loss ratio llink . The values β1 , β2 and, β3 ∈ [0, 1] are estimate of the optimal Q-function Qt (St , At ) for a pair
tunable parameters useful for providing a weight value for a of state-action (St , At ) and the new estimate. The updated
specific metric in the calculation of the reward. Q-value (Qt+1 ) depends on the previous value Qt that is
1 influenced by the result of (St , At , Rt , St+1 ) and α. Rt is
R = β1 · + β2 · dlink + β3 · llink (1)
bwalink the reward at time t and α ∈ [0, 1] being the learning rate
We normalized Equation 1 to avoid one of the link-state that determines the weight of the newly gained information
metrics having a greater influence than the others during the in relation to the previous one. A factor α = 0 makes the
learning process of the RL-agent. Since the metrics considered RL-agent unable to learn from the latest (St , At ) pair, while a
in Equation 1 are in different units, the factors were normal- factor α = 1 allows the agent to keep the learned information
ized using the Min-Max technique [43]. This normalization by considering the immediate reward Rt for the pair (St , At ).
technique involves re-scaling the range of the metrics to a
h i
range with values in an arbitrary interval: [a, b]. Equation 2 Qt+1 (St , At ) = Qt (St , At ) + α · Rt + minQt (St+1 , A) − Qt (St , At )
A
shows the normalized version of the reward function, where
ˆ d, ˆ and ˆl are the normalized values of the link available (4)
bwa,
bandwidth, delay, and loss ratio, respectively. The Q-learning technique is both computationally and
1 memory-efficient in a typical SDN routing scenario. However,
R̂ = β1 · + β2 · dˆ + β3 · ˆl (2) if the state space and the action space are large, Q-learning
ˆ
bwa
can take time and require more data to find the optimal policy Algorithm 1: Q-learning routing
(i.e., converge). Input :
5) Exploration and Exploitation Method: This method is Learning rate: α
used by Q-learning to find the value of the Q-function. In Q- Exploration and exploitation parameter: ε
learning, there is a trade-off between selecting the expected Number of learning episodes: n
optimal action (exploitation) and selecting a different action All pair of nodes (src, dst): P list
in the hope it may yield a greater reward in the future Network link-state
(exploration) [44]. Our approach uses the ε-greedy exploration
and exploitation method. ε-greedy takes an ε ∈ [0, 1] as a 1 foreach (src, dst) ∈ P list do
tunable parameter which allows the agent to exploit with a 2 Initialize Q : Q(S, A) = 0, ∀s ∈ S, ∀a ∈ A
probability pr = ε and to explore with a probability pr = 1−ε. 3 for episode ← 1 to n do
Therefore, this parameter determines how much the RL-agent 4 Start in state St = src ∈ S;
exploits and explores during the learning process. 5 while St+1 is not dst do
The RL-agent uses Equation 5 to select the next action at 6 Select At for St with policy derived from
a specific state. At each step, it generates a random value Q using ε-greedy exploration and
x ∈ [0, 1]. If x < ε, the agent exploits; otherwise, it explores. exploitation method;
7 Rt+1 ← R(St , At ) // Agent gets
(
minQt (St , A), if x < ε the reward from network
A= A (5) link-state and observes new
random action, otherwise state St+1 ;
8 Qht+1 (St , At ) = Qt (St , At ) + α ·
B. RL-based Routing Algorithm
i
Rt + minQt (St+1 , A) − Qt (St , At )
The routing algorithm implements the learning process used A
// Update Q-function Eq. 4;
to find the best paths for all the pairs of nodes on the Data
9 St ← St+1 // Move to the next
Plane. Algorithm 1 receives as input the learning rate α,
state;
the parameter ε (cf. Equation 5), the number of learning
10 end
episodes, all pairs of source-destination nodes, and link-state
11 end
information. The output is the set of best-rewarding routing
12 Take Q table and find the path between src and
paths for all pairs of nodes, given the state of the links. The
dst with state-action pairs that achieved the
path is formed by the state-action pairs with the lowest values
lowest Q-values
in the Q-table. 13 end
Our routing algorithm finds the most-rewarding path for
every pair of nodes in the network. For each pair (src, dst), 14 Store the set of paths for all pair of nodes in the
the algorithm executes a Q-learning process by following the network in the Routs Data Repository
steps defined in Lines 1 to 13. The RL-agent initializes the
Q-table Q(S, A) with zero (Line 2). The Q-learning process
then begins with node src as the initial state (Line 4). The that achieved the lowest Q-values (Line 12). Once the RL-
RL-agent goes through learning episodes to minimize the Q- agent finds the best path for all pairs of source-destination
value by using the reward function. The agent attributes low Q- nodes, it stores them in the Route Data Repository (Line 14).
values to paths formed by links with large available bandwidth, The Flow Installation Module then retrieves these best paths
short delays, and small packet loss. Then, the algorithm goes and installs them in the routing table of the switches.
through learning episodes (Line 3) until St becomes the end The worst-case complexity of Algorithm 1 is derived next.
state (i.e., node dst, Lines 5 to 9). First, the RL-agent then
selects the next node from the Action Space (selects At for St ); Lemma 1: In Q-learning algorithms with a state space
the RL-agent chooses a node from the neighboring nodes of topology with a linear upper action bound b ∈ ℵ0 ⇐⇒
the current one as next-hop by using the ε-greedy exploration e ≤ bn for all n ∈ ℵ0 , where e is the cardinality of the action
and exploitation method (Line 6). Next, the RL-agent uses the space and n the cardinality of the state space, the worst-case
network link-state information and the state St to calculate complexity is O(n2 )
the reward associated with the action At , and observes the Proof 1: see Page 103 in [45]
new state St+1 . The reward is computed by using Equation 1
(Line 7). After that, and considering the learning rate, initial Theorem 1: The worst case complexity of Algorithm 1 is
considerations, reward, and new state, the RL-agent tunes the O(N 2 ).
values of the Q-function by using Equation 4 (Line 8). It then
moves to the new state (Line 9), the episode ends, and a new Proof 2: dr(si ) ≤ N − 1, for N ∈ ℵ0 =⇒ |A| ≤ N 2
episode starts. The worst-case complexity of the Dijkstra’s algorithm is
Finally, after the RL-agent has completed a transition, it also O(N 2 ) and its heap implementation O(N logN ) [46]. In
uses the resulting Q-table to compute the most-rewarding path Section V, it will be shown that RSIR routing has low overhead
between the src and dst nodes, based on the state-action pairs as well as convergence time.
V. E VALUATION
This section presents the evaluation of RSIR. Subsections
V-A and V-B show the test environment and the prototype of
RSIR, respectively, while Subsections V-C and V-D discuss
the gathering of performance metrics and traffic generation,
respectively. Subsection V-E presents the set up of the learning
parameters, and Subsection V-F discusses the results.
A. Test Environment
Fig. 3 presents the test environment used for the evalua-
tion of RSIR. It includes a topology mirroring the GÉANT
topology [47] which is the European data network for the
research and educational community. GÉANT has 23 nodes
and 37 links. In tests, link capacity distribution follows the
2004 GÉANT topology which has 50% of the links with 10
Gbps, 40% with 2.5 Gbps, and 1% with 155 Mbps.
We built and deployed the GÉANT topology in Mininet
2.2.2 [48] by using a Python script. As Mininet is limited
to the resources of its host machine, we scaled the 10 Gpbs,
2.5 Gpbs, and 155 Mbps link capacities of GÉANT to 100
Mbps, 25 Mbps, and 1.55 Mbps, respectively. The traffic load Figure 4: Prototype of RSIR
was also scaled to the same proportion. In our deployment,
each switch had a host that forwarded and received traffic. We
Topology Discovery and Flow Installation modules were de-
deployed the Data Plane over an Ubuntu Server 14.04 Virtual
veloped and deployed using the Ryu Application Program
Machine (VM) with 8 GB RAM and used Mininet with Open
Interface (API), which allows querying and retrieving statistics
vSwitches 2.3.1 and hosts.
and features from the switches. For the Management and
Knowledge Planes, we developed the Data Processing module
and the RL-agent by using Python 2.7 with Pandas 0.22 library
[50]. The Pandas library provides high-performance, easy-to-
use data structures, and data analysis tools for Python. We used
Comma-Separated Values (CSV) and JavaScript Object Nota-
tion (JSON) files to store the link-state information provided
by the Management Plane and the routing paths calculated by
the RL-agent, respectively. The RSIR prototype, as well as all
test scripts, are available in [51].
The Control, Management, and Knowledge Planes were run
on an Ubuntu 16.04 VM with a Core i5-4690 processor and
10 GB RAM. The Data Plane ran on an Ubuntu Server 14.04
VM with a Core i5-4690 processor and 4 GB RAM. The VMs
used for this prototype were hosted by an Ubuntu Desktop
16.04 with an Intel Core i5-4690 and 16 GB RAM. These
VMs communicated using the Transmission Control Protocol
(TCP). For the communication between the Control and the
Data Plane, we used Openflow1.3 as the protocol between
since it is the de-facto protocol used as SBI in SDN [15].
C. Performance Metrics
According to Table II, the performance metrics most com-
mon used to evaluate routing proposals are throughput, loss
Figure 3: Test environment ratio, and delay. In addition to these metrics, we compared the
stretch of the paths given by RSIR with those given by the
Dijkstra’s algorithm using different edge weights. The stretch
compares the length of an actual path to the theoretical shortest
B. Prototype path [52]. For computing the shortest path, we developed an
Fig. 4 presents the RSIR prototype. In the Control Plane, application that executes the Dijkstra’s algorithm with equal
we used a Python-based Ryu controller [49]. The Statistics, edge weight values on the SDN controller.
800 120
α = 0.9
α = 0.7
100 α = 0.5
Number of hops
Number of hops
600
α = 0.3
80 α = 0.1
400
α = 0.9 60
α = 0.7
200 α = 0.5
α = 0.3
40
α = 0.1
0 20
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Episodes Episodes
(a) = 0.2 (b) = 0.4
100 80
α = 0.9 α = 0.9
α = 0.7 α = 0.7
80 α = 0.5 α = 0.5
Number of hops
60
Number of hops
α = 0.3 α = 0.3
60 α = 0.1 α = 0.1
40
40
20
20
0 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Episodes Episodes
(c) = 0.6 (d) = 0.8
Figure 5: Number of hops for different learning parameters (, α)
We computed the link throughput and loss using the number message to go from c0 to the si port (c0 -si ) is estimated
of packets that passed through the switch port connected to as half the time that elapsed between the transmission and
the link. At each port, the SDN controller periodically samples reception of the the OpenFlow echo_request and echo_reply
the number of bytes transmitted and received. By comparing messages sent by c0 to si . A similar procedure is used to
the retrieved values at two different instants, it is possible to estimate the time elapsed by the message to go from sj to
discover the instantaneous throughput. After sending a request- c0 . The instantaneous delay in the link (si , sj ), dsi−sj , is then
stats to the Data Plane, at time t1 , a reply message is received computed as dsi−sj = dlldpcij − dc0 −si − dc0 −sj .
containing the number of bytes received, bt1 . Then, after a In a period of 10s, the Statistics module collects the statis-
period p, another request-stats message is sent, and the number tics; the Data Processing module computes the metrics that
of bytes received bt2 is retrieved from its reply message. The characterize the link-state, the RL-agent computes the paths
expression used_bw = [(bt2 − bt1 )/p] gives the instantaneous according to Algorithm 1; and the Flow Installation Module
throughput, where p is the duration of the sampling interval. installs the computed paths in the flow tables of the switches.
The computation of the loss ratio uses the request-stats The definition of the duration of the interval followed the
values in the reply messages; with the expression loss = guidelines in [56].
(btxt1 −brxt2 )/btxt1 giving the instantaneous loss ratio. After The RL-based routing is compared to that suggested by the
sending a request-stats to the Data Plane, at time t1 , a reply different variations of the Dijkstra’s algorithm using the instan-
message is received containing the number of transmitted taneous delay (Dijkstradelay ), instantaneous loss (Dijkstraloss ),
bytes, btxt1 . Then, after a period p, another request-stats and link available bandwidth (Dijkstrabw ) as edge weights.
message is sent, with the reply message received containing Moreover, we compare the results of RSIR with the routing
the number of received bytes, brxt2 . derived by the Dijkstra’s algorithm that uses the value given by
Equation 1 (Dijkstracomp ) as edge weight. Routing determined
We computed the instantaneous delay by following the
by these variations of the Dijkstra’s algorithm was subject
process described in [53], which uses the messages of the Link
to the same traffic scenario used for RSIR and executed
Layer Discovery Protocol (LLDP) [54] and the OpenFlow
as an application on the SDN controller. In this way, these
messages [55]. The SDN controller, c0 , sends an LLDP mes-
algorithms could be compared under the same conditions.
sage which goes through the path c0 -si -sj -c0 , and returns to
c0 ; with si and sj being switches connected by the link (si , sj ).
The time elapsed between the transmission and reception of D. Traffic Generation
the LLDP message is the difference between the timestamp The tool iperf3 [57] was used to generate traffic in the
values in this message (dlldpcij ). The time taken by the Mininet-based emulation; scripts to run clients and servers
iperf3 on the hosts were developed. User Datagram Protocol Generated traffic (Mbps)
5488.4 5043.2 5095.2 6880.4 9581 9377.4 9705.1 9625.4 9725.8 9466.8 9276.8 8453.2 7917.1 7522.3 6322.5 5872.3
(UDP) traffic was generated since iperf3 allows specifying 1.35 RSIR Dijkstracomp Dijsktradelay Dijsktraloss Dijkstrabw
a target transmission rate per connection. The experiment

1.3
involved the complete execution of the iperf3 scripts that
1.25
generated traffic specified by a traffic matrix. We used sixteen
Stretch
publicly available intra-domain traffic matrices [58] for the 1.2
GÉANT topology to generate the traffic between pairs of 1.15
nodes. The matrix values offered the traffic of different pairs 1.1
at different times of the day for 4 months in 2006. We tagged 1.05

the employed traffic matrices with different hours of the day
1
and defined the peak hours as having high traffic intensity 0:00 1:00 3:00 5:00 7:00 8:00 9:00 10:00 11:00 12:00 13:00 15:00 17:00 19:00 21:00 23:00
Hour
(i.e., from 7:00h to 13:00h) since the traffic matrices are
anonymous. Figure 6: Mean stretch throughout the day
E. Learning Parameters Setup For each hour of the day, we collected and plotted the mean
values. The figures also show the traffic generated per hour.
One important aspect when using an RL-solution is the
Fig. 6 shows the mean stretch calculated for all the paths
determination of the values for the learning rate α and ex-
found by RSIR and the variations in the Dijkstra’s algorithm.
ploration (Equations 4 and 5). We employed the number of
To compute the stretch, we compare the path length with
hops along a path between a pair of source and destination
the shortest path given by the Dijkstra’s algorithm with
nodes as a parameter for comparison.
equal edge weights, which gave the shortest path as that
We ran the Q-learning algorithm (cf. Algorithm 1) to find with the minimum number of hops. The results showed that
the potential paths between a pair of nodes by varying α and the paths chosen by RSIR have smaller stretch values than
. The chosen target pair was h23 (switch 23) and h8 (switch those produced by Dijkstradelay , Dijkstraloss , and Dijkstracomp .
8) since this pair is one of those further apart (see red switches Indeed, RSIR chooses a larger number of shorter paths than
in Fig. 3). The shortest path found by the Dijkstra’s algorithm did the three variations of the Dijkstra’s algorithm. RSIR
had 5 hops when all the links had the same cost. For each indicated paths with stretch values 14%, 16%, and 15% stretch
value (i,e., 0.8, 0.6, 0.4, and 0.2), five α values (i,e., 0.9, 0.7, smaller than those obtained by the Dijkstradelay , Dijkstraloss ,
0.5, 0.3, and 0.1) were tested, with 300 episodes per test. and Dijkstracomp algorithms, respectively. Results also showed
The evolution of the convergence to the shortest path as that RSIR indicated paths with stretch values slightly higher
a function of α and needs to be assessed. Fig. 5 presents (< 3%) than those produced by Dijkstrabw .
the number of hops of the paths found by the RL-agent for
different values of α and . Fig. 5a and 5b show that when the Generated traffic (Mbps)
5488.4 5043.2 5095.2 6880.4 9581 9377.4 9705.1 9625.49725.8 9466.8 9276.8 8453.2 7917.1 7522.3 6322.5 5872.3
value was close to 0, the RL-agent explored random actions 30 RSIR
from the Action Space resulting in paths with many hops. Dijkstracomp
Dijsktradelay
When values were close to 1 (Fig 5c and 5d), the RL-agent 25
Dijsktraloss
Dijkstrabw
tended to exploit the knowledge previously gathered instead 20
Delay (ms)
of randomly exploring the Action Space. For = 0.6 and

15
= 0.8, the RL-agent exploited, with a probability of 0.6 and
0.8, the best-known action during the learning process. With 10
a probability of exploitation value close to 1, the RL-agent 5

tended to find shorter paths along the sequence of episodes,
eventually finding the shortest path. For α = 0.9 and α = 0
0:00 1:00 3:00 5:00 7:00 8:00 9:00 10:00 11:00 12:00 13:00 15:00 17:00 19:00 21:00 23:00
Hour
0.7, the RL-agent retained the information learned from recent
rewards. Based on these results, the most appropriate values Figure 7: Mean delay throughout of the day
for the RL-agent parameters of RSIR were α = 0.9 and =
0.8. With these values (see Fig. 5d), the number of hops of Fig. 7 depicts the mean link delay for the results of the
the path converged quickly to the shortest path. RSIR and those of the variations of the Dijkstra’s algorithm.
The mean link delay values produced by RSIR are, on average,
35%, 50%, and 8% and, at most, 70%, 85%, and 30%
F. Results and Analysis lower than those given by the Dijkstradelay , Dijkstraloss , and
In this section, we compare the mean stretch values average Dijkstrabw algorithms, respectively. The mean delay values
for all chosen paths, the mean link delay, loss ratio, and produced by RSIR are lower than those given by Dijkstradelay ,
throughput averaged over all the links in the network, (i.e. the Dijkstraloss , and Dijkstrabw since RSIR usually chooses shorter
mean of a mean link measure. For instance if the target metric and less congested paths than those chosen by the other three
is delay, it is the mean of all mean link delay values) given by variations of the Dijkstra’s algorithm.
RSIR to those given by variations in the Dijkstra’s algorithm, Fig. 8 depicts the mean link loss. The mean loss values
namely, Dijkstradelay , Dijkstraloss , Dijkstrabw , and Dijkstracomp . produced by RSIR are, on average, 30% and, at most, 65%
Generated traffic (Mbps)

5488.4 5043.2 5095.2 6880.4 9581 9377.4 9705.1 9625.4 9725.8 9466.8 9276.8 8453.2 7917.1 7522.3 6322.5 5872.3 G. Topology Change Analysis
0.06
RSIR
Dijkstracomp
We assessed the reaction time for topological changes by
Dijsktradelay RSIR and compared this to those from the RIP and OSPF
0.05 Dijsktraloss
Dijkstrabw
protocols. The topology change handling process of RSIR
involved detecting the topology changes, calculating new
Loss (%)
0.04
0.03
routes, and installing them (subsection III-B6). In the case
study, we set tchad to 1s. A lower value could have been set,
0.02
but such value would increase the traffic intensity between
0.01 the switches and the controller. The RL-agent computed all
0 the routes for the GÉANT network in 7.49s (the average
0:00 1:00 3:00 5:00 7:00 8:00 9:00 10:00 11:00 12:00 13:00 15:00 17:00 19:00 21:00 23:00
Hour execution time of Algorithm 1) and the Flow Installation
module spent on average 1.6s to update the flow entries (22)
Figure 8: Mean loss throughout the day
at each switch (23 in total). Thus, RSIR took on average
tchad +tlear +tinst = 10.6s to handle a change in the network
topology.
lower than the loss ratio given by the four variations of the
On the other hand, RIP would typically take 30s to handle a
Dijkstra’s algorithms. The mean loss values produced by RSIR
topological change. The lower response time of RSIR was due
are lower than those provided by Dijkstradelay , Dijkstraloss ,
to the adoption of a centralized controller with a global view
Dijkstrabw , and Dijkstracomp since RSIR chooses shorter paths
of the network, as well as the adoption of a link-state routing
than do the Dijkstra variations most of the time. Fig. 7 and Fig.
approach. Moreover, the OSPF reaction to topological changes
8 also show that the mean link delay values produced by RSIR
depended mainly on the duration of the following intervals:
are slightly higher (< 3%) than those given by Dijkstracomp ,
hello-interval, dead-interval, and spf-delay. The hello-interval
but the mean loss values produced by RSIR are, on average,
defines how often Hello Packets are transmitted and usually
10% and, at most, 50% lower than the loss ratio given by
takes 10s (by default), ranging from 1s to 255s. These packets
Dijkstracomp . The Dijkstra’s algorithm variations select routes
allow routers running OSPF to recognize their neighboring
usually longer and use, more frequently, low capacity links,
routers. The expiration of the dead-interval indicates that a
leading to traffic concentration and congestion on these links.
neighboring router is down, and it takes by definition the time
Generated traffic (Mbps)
equivalent to four hello intervals. The spf-delay defines the
1800
5488.4 5043.2 5095.2 6880.4 9581 9377.4 9705.1 9625.4 9725.8 9466.8 9276.8 8453.2 7917.1 7522.3 6322.5 5872.3
time elapsed between the detection of a topological change
Avergae of mean link trhoughput (Mbps)
RSIR
1600 Dijkstracomp and the beginning of the calculation of new routes. This value
Dijsktradelay
1400 Dijsktraloss is, by default, 5s. Considering a minimum hello-interval =
Dijkstrabw
1200 1s and an spf-delay = 1s, in the GÉANT network, OSPF
1000 reacts to topological changes in approximately 9s (5s + the
800 average execution time of the Dijkstra’s algorithm). This is
600 approximately the same time as the RSIR takes.
400
200 VI. C ONCLUSIONS AND FUTURE WORK

0
0:00 1:00 3:00 5:00 7:00 8:00 9:00 10:00 11:00 12:00 13:00 15:00 17:00 19:00 21:00 23:00
Hour
This paper has introduced RSIR, an approach based on
RL for intelligent and efficient routing in SDN. The RSIR
Figure 9: Mean link throughput along the day algorithm considers metrics related to the state of the network
links and explores, learns, and exploits optimal routes, even
Fig. 9 shows the mean link throughput along the day. under traffic fluctuation. Comparison of the results given by
RSIR produced a broader distribution of the flows over the RSIR to those the Dijkstra’s algorithm using different edge
network and, therefore, used a higher number of paths less weights have shown that RSIR outperforms these algorithms.
utilized than do the Dijkstradelay , Dijkstraloss , Dijkstrabw , and This is due to the fact that RSIR produces a larger number of
Dijkstracomp . The link throughput was consequently lower than shorter paths than do the other algorithms, and this avoids
those produced by the four Dijkstra variations with, although traffic concentration and congestion. The variations of the
the mean delay and loss were lower, as previously discussed. Dijkstra’s algorithm, on the other hand, produce longer paths,
The link throughput was on average 4%, 21%, 3%, 17% and, which include low capacity links. The Dijkstraloss concentrates
at most, 12%, 27%, 15%, and 26% lower than those produced traffic on a smaller number of paths and consequently produces
by the Dijkstradelay , Dijkstraloss , Dijkstrabw , and Dijkstracomp the largest mean delay and loss values. The minimization of
algorithms, respectively. the reward function (Equation 1) allows the RL-based agent
The RL-agent computed all the routes for the GÉANT to learn, explore, and exploit the best paths for delivering the
network in 7.49s, while the centralized variations of Dijkstra’s offered load.
algorithm calculated all the routes for the same network in 5s; For future work, we plan to evaluate the impact of RSIR on
the difference (1.51s) is about 30% of the time the Dijkstra’s the throughput at the application layer as well as explore Deep
algorithm take but yet still a non-significant absolute value. Reinforcement Learning (DRL) to enhance RSIR decisions,
especially for large networks. Moreover, we intend to take [19] A. Forster and A. L. Murphy, “Froms: Feedback routing for optimizing
advantage of traffic predictions in the selection of paths. multiple sinks in wsn with reinforcement learning,” in International
Conference on Intelligent Sensors, Sensor Networks and Information
(ISSNIP). IEEE, 2007, pp. 371–376.
ACKNOWLEDGMENT [20] N. Kato, Z. M. Fadlullah, B. Mao, F. Tang, O. Akashi, T. Inoue, and
K. Mizutani, “The deep learning vision for heterogeneous network traffic
The authors would like to thank CAPES, the Sao Paulo control: Proposal, challenges, and future perspective,” IEEE Wireless
Communications, vol. 24, no. 3, pp. 146–153, 2016.
Research Foundation (FAPESP) under grant #19/03268-0 and [21] Z. M. Fadlullah, F. Tang, B. Mao, N. Kato, O. Akashi, T. Inoue, and
2015/24494-8. K. Mizutani, “State-of-the-art deep learning: Evolving machine intel-
ligence toward tomorrow’s intelligent network traffic control systems,”
IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2432–
R EFERENCES 2455, 2017.
[22] S. Chaudhary and R. Johari, “Oruml: Optimized routing in wireless net-
[1] J. F. Kurose and K. W. Ross, Computer Networking: A Top-Down works using machine learning,” International Journal of Communication
Approach, 6th ed. Pearson, 2012. Systems, 2020.
[2] R. A. Guerin, A. Orda, and D. Williams, “Qos routing mechanisms [23] A. Serhani, N. Naja, and A. Jamali, “Aq-routing: mobility-, stability-
and ospf extensions,” in IEEE Global Telecommunications Conference, aware adaptive routing protocol for data routing in manet–iot systems,”
vol. 3, 1997, pp. 1903–1908. Cluster Computing, vol. 23, no. 1, pp. 13–27, 2020.
[3] A. A. Ajani, B. J. Ojuolape, A. A. Ahmed, T. Aduragba, and M. Ba-
[24] C. Wang, L. Zhang, Z. Li, and C. Jiang, “Sdcor: Software defined
logun, “Comparative performance evaluation of open shortest path first,
cognitive routing for internet of vehicles,” IEEE Internet of Things
ospf and routing information protocol, rip in network link failure and
Journal, vol. 5, no. 5, pp. 3513–3520, 2018.
recovery cases,” in International Conference on Electro-Technology for
[25] B. Mao, Z. M. Fadlullah, F. Tang, N. Kato, O. Akashi, T. Inoue, and
National Development (NIGERCON). IEEE, 2017, pp. 280–288.
K. Mizutani, “Routing or computing? the paradigm shift towards intel-
[4] J. S. Marcus, “The economic impact of internet traffic growth on network
ligent computer network packet transmission based on deep learning,”
operators,” Available at SSRN 2531782, 2014.
IEEE Transactions on Computers, vol. 66, no. 11, pp. 1946–1960, 2017.
[5] B. Mao, Z. M. Fadlullah, F. Tang, N. Kato, O. Akashi, T. Inoue, and
K. Mizutani, “A tensor based deep learning technique for intelligent [26] L. Yanjun, L. Xiaobo, and Y. Osamu, “Traffic engineering framework
packet routing,” in Global Communications Conference (GLOBECOM). with machine learning based meta-layer in software-defined networks,”
IEEE, 2017, pp. 1–6. in International Conference on Network Infrastructure and Digital
Content (IC-NIDC). IEEE, 2014, pp. 121–125.
[6] D. Kreutz, F. M. V. Ramos, P. E. Veríssimo, C. E. Rothenberg,
S. Azodolmolky, and S. Uhlig, “Software-defined networking: A com- [27] S. Sendra, A. Rego, J. Lloret, J. M. Jimenez, and O. Romero, “Including
prehensive survey,” Proceedings of the IEEE, vol. 103, no. 1, pp. 14–76, artificial intelligence in a routing protocol using software defined net-
2015. works,” in International Conference on Communications (ICC) Work-
[7] R. Boutaba, M. A. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, shops. IEEE, 2017, pp. 670–674.
F. Estrada-Solano, and O. M. Caicedo, “A comprehensive survey on [28] S.-C. Lin, I. F. Akyildiz, P. Wang, and M. Luo, “Qos-aware adaptive
machine learning for networking: evolution, applications and research routing in multi-layer hierarchical software defined networks: a rein-
opportunities,” Journal of Internet Services and Applications, vol. 9, forcement learning approach,” in International Conference on Services
no. 1, 2018. Computing (SCC). IEEE, 2016, pp. 25–33.
[8] A. Valadarsky, M. Schapira, D. Shahaf, and A. Tamar, “Learning to [29] R. Alvizu, S. Troia, G. Maier, and A. Pattavina, “Matheuristic with
route,” in Proceedings of the 16th ACM Workshop on Hot Topics in machine-learning-based prediction for software-defined mobile metro-
Networks, 2017, pp. 185–191. core networks,” Journal of Optical Communications and Networking,
[9] A. Rego, S. Sendra, J. M. Jimenez, and J. Lloret, “Ospf routing protocol vol. 9, no. 9, pp. D19–D30, 2017.
performance in software defined networks,” in International Conference [30] I. Martín, S. Troia, J. A. Hernández, A. Rodríguez, F. Musumeci,
on Software Defined Systems (SDS). IEEE, 2017, pp. 131–136. G. Maier, R. Alvizu, and Ó. G. de Dios, “Machine learning-based routing
[10] A. A. Khan, M. Hussain, M. Zafrullah, and M. S. Zia, “A convergence and wavelength assignment in software-defined optical networks,” IEEE
time optimization paradigm for ospf based networks through sdn spf Transactions on Network and Service Management, vol. 16, no. 3, pp.
protocol computer communications and networks (ccn)/delay tolerant 871–883, 2019.
networks,” in International Conference on Future Networks and Dis- [31] G. B. Figueiredo, N. L. S. D. Fonseca, and J. A. S. Monteiro, “A
tributed Systems (ICFNDS). ACM, 2017, p. 43. minimum interference routing algorithm with reduced computational
[11] D. Gopi, S. Cheng, and R. Huck, “Comparative analysis of sdn and complexity,” Comput. Networks, vol. 50, no. 11, pp. 1710–1732, 2006.
conventional networks using routing protocols,” in International Confer- [Online]. Available: https://doi.org/10.1016/j.comnet.2005.06.015
ence on Computer, Information and Telecommunication Systems (CITS). [32] D. D. Clark, C. Partridge, J. C. Ramming, and J. T. Wroclawski, “A
IEEE, 2017, pp. 108–112. knowledge plane for the internet,” in Proceedings of the 2003 conference
[12] J. Park, J. Hwang, and K. Yeom, “Nsaf: An approach for ensuring on Applications, technologies, architectures, and protocols for computer
application-aware routing based on network qos of applications in sdn,” communications, 2003, pp. 3–10.
Mobile Information Systems, 2019. [33] A. Mestres, A. Rodriguez-Natal, J. Carner, P. Barlet-Ros, E. Alarcón,
[13] C. Lin, K. Wang, and G. Deng, “A qos-aware routing in sdn hybrid M. Solé, V. Muntés-Mulero, D. Meyer, S. Barkai, M. J. Hibbett et al.,
networks,” Procedia Computer Science, vol. 110, pp. 242–249, 2017. “Knowledge-defined networking,” SIGCOMM Computer Communica-
[14] A. Shirmarz and A. Ghaffari, “An adaptive greedy flow routing algorithm tion Review, vol. 47, no. 3, pp. 2–10, 2017.
for performance improvement in software-defined network,” Interna- [34] S. Ayoubi, N. Limam, M. A. Salahuddin, N. Shahriar, R. Boutaba,
tional Journal of Numerical Modelling: Electronic Networks, Devices F. Estrada-Solano, and O. M. Caicedo, “Machine learning for cognitive
and Fields, vol. 33, no. 1, 2020. network management,” IEEE Communications Magazine, vol. 56, no. 1,
[15] Y.-C. Wang and S.-Y. You, “An efficient route management framework pp. 158–165, 2018.
for load balance and overhead reduction in sdn-based data center [35] J. Hyun and J. W.-K. Hong, “Knowledge-defined networking using
networks,” IEEE Transactions on Network and Service Management, in-band network telemetry,” in Asia-Pacific Network Operations and
vol. 15, no. 4, pp. 1422–1434, 2018. Management Symposium (APNOMS). IEEE, 2017, pp. 54–57.
[16] C.-C. Chuang, Y.-J. Yu, and A.-C. Pang, “Flow-aware routing and for- [36] S. Yan, A. Aguado, Y. Ou, R. Wang, R. Nejabati, and D. Simeonidou,
warding for sdn scalability in wireless data centers,” IEEE Transactions “Multilayer network analytics with sdn-based monitoring framework,”
on Network and Service Management, vol. 15, no. 4, pp. 1676–1691, Journal of Optical Communications and Networking, vol. 9, no. 2, pp.
2018. A271–A279, 2017.
[17] R. Masoudi and A. Ghaffari, “Software defined networks: A survey,” [37] A. I. Montoya-Munoz, D. M. Casas-Velasco, F. Estrada-Solano, A. Or-
Journal of Network and computer Applications, vol. 67, pp. 1–25, 2016. donez, and O. M. C. Rendon, “A yang model for a vertical sdn
[18] R. Arroyo-Valles, R. Alaiz-Rodriguez, A. Guerrero-Curieses, and J. Cid- management plane,” in Colombian Conference on Communications and
Sueiro, “Q-probabilistic routing in wireless sensor networks,” in In- Computing (COLCOM). IEEE, 2017, pp. 1–6.
ternational Conference on Intelligent Sensors, Sensor Networks and [38] F. Estrada-Solano, A. Ordonez, L. Z. Granville, and O. M. C. Rendon,
Information (ISSNIP). IEEE, 2007, pp. 1–6. “A framework for sdn integrated management based on a cim model and
a vertical management plane,” Computer Communications, vol. 102, pp. Daniela M. Casas-Velasco (GS’15) received the
150–164, 2017. bachelor’s degree in Electronics and Telecommuni-
[39] S. Achleitner, N. Bartolini, T. He, T. La Porta, and D. Z. Tootaghaj, cations engineering from the University of Cauca,
“Fast network configuration in software defined networking,” IEEE Popayán, Colombia, in 2017, and the M.Sc. degree
Transactions on Network and Service Management, vol. 15, no. 4, pp. in Computing Science from the University of Camp-
1249–1263, 2018. inas, Campinas, Brazil in 2020. She is currently
[40] ONF, “Sdn architecture 1.1, onf tr-521,” Architectural Framework: ONF, pursuing her Ph.D. degree in Computer Science at
2016. the University of Campinas. Her research interests
[41] Z. Mammeri, “Reinforcement learning based routing in networks: Re- include network management, SDN, wireless com-
view and classification of approaches,” IEEE Access, vol. 7, pp. 55 916– munications, machine learning and related works to
55 950, 2019. Telecommunication.
[42] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction-
second edition. Cambridge, Massachusetts, 2018, vol. 1, no. 1.
[43] L. Al Shalabi and Z. Shaaban, “Normalization as a preprocessing engine
for data mining and the approach of preference matrix,” in International
conference on dependability of computer systems (DepCoS). IEEE, Oscar M. Caicedo (GS’11–M’15–SM’20) received
2006, pp. 207–214. the bachelor’s degree in electronics and telecom-
[44] A. D. Tijsma, M. M. Drugan, and M. A. Wiering, “Comparing ex- munications engineering and the M.Sc. degree in
ploration strategies for q-learning in random stochastic mazes,” in telematics engineering from the University of Cauca,
Symposium Series on Computational Intelligence (SSCI). IEEE, 2016, Popayán, Colombia, in 2001 and 2006, respectively,
pp. 1–8. and the Ph.D. degree in computer science from the
[45] S. Koenig and R. G. Simmons, “Complexity analysis of real-time Federal University of Rio Grande do Sul, Porto Ale-
reinforcement learning,” in AAAI, 1993, pp. 99–107. gre, Brazil, in 2015. He is currently a Full Professor
[46] D. Fan and P. Shi, “Improvement of dijkstra’s algorithm and its appli- with the Department of Telematics, University of
cation in route planning,” in 2010 Seventh International Conference on Cauca, where he is a member of the Telematics
Fuzzy Systems and Knowledge Discovery, vol. 4, 2010, pp. 1901–1904. Engineering Group. His research interests include
[47] P. T. Kirstein, “European international academic networking: A 20 year network and service management, network virtualization, and software-
perspective.” in TERENA Networking Conference, 2004. defined networking.
[48] R. L. S. De Oliveira, C. M. Schweitzer, A. A. Shinoda, and L. R.
Prete, “Using mininet for emulation and prototyping software-defined
networks,” in Colombian Conference on Communications and Comput-
ing (COLCOM). IEEE, 2014, pp. 1–6.
[49] A. L. Stancu, S. Halunga, A. Vulpe, G. Suciu, O. Fratu, and E. C.
Popovici, “A comparison between several software defined networking Nelson L. S. da Fonseca (M’88–SM’01) received
controllers,” in International Conference on Telecommunication in Mod- the Ph.D. degree in computer engineering from the
ern Satellite, Cable and Broadcasting Service (TELSIKS). IEEE, 2015, University of Southern California, Los Angeles, CA,
pp. 223–226. USA, in 1994. He is currently a Full Professor
[50] W. McKinney, “Data structures for statistical computing in python,” in with the Institute of Computing, State University
SciPy, S. van der Walt and J. Millman, Eds., 2010, pp. 51–56. of Campinas, Campinas, Brazil. He has authored
[51] D. M. Casas-Velasco, O. M. Caicedo, and N. L. S. or coauthored more than 400 papers and super-
Da Fonseca. (2020) RSIR: Reinforcement learning and sdn intelligent vised more than 70 graduate students. Dr. Fonseca
routing. [Online]. Available: https://github.com/danielaCasasv/RSIR- is currently the Vice President of Technical and
Reinforcement-Learning-and-SDN-Intelligent-Routing Educational Activities of the IEEE Communications
[52] C. Werle, S. Mies, and M. Zitterbart, “On benchmarking routing Society (ComSoc). He served as the ComSoc Vice
protocols,” in International Conference on Networks (ICON). IEEE, President of Publications and of Member Relations, and the Director of
2011, pp. 305–310. Conference Development, of Latin America Region, and of On-Line Services.
[53] L. Liao and V. C. Leung, “Lldp based link latency monitoring in software He is the former Editor-in-Chief of IEEE Communications Surveys and
defined networks,” in International Conference on Network and Service Tutorials. He was a recipient of the 2012 IEEE ComSoc Joseph LoCicero
Management (CNSM). IEEE, 2016, pp. 330–335. Award for Exemplary Service to Publications, the Medal of the Chancellor of
[54] Z. Shu, J. Wan, J. Lin, S. Wang, D. Li, S. Rho, and C. Yang, the University of Pisa in 2007, and the Elsevier Computer Network Journal
“Traffic engineering in software-defined networking: Measurement and Editor of Year 2001 Award.
management,” IEEE access, vol. 4, pp. 3246–3256, 2016.
[55] Ryu Project Team, “Ryu application API,” [Ac-
cessed: June 16, 2016]. [Online]. Available:
https://ryu.readthedocs.io/en/latest/ryu_app_api.html
[56] E. F. Castillo, L. Z. Granville, A. Ordonez, and O. M. C. Rendon, “Ipro:
An approach for intelligent sdn monitoring,” Computer Networks, p.
107108, 2020.
[57] S. Srivastava, S. Anmulwar, A. M. Sapkal, T. Batra, A. K. Gupta, and
V. Kumar, “Comparative study of various traffic generator tools,” Recent
Advances in Engineering and Computational Sciences (RAECS), pp. 1–
6, March 2014.
[58] S. Uhlig, B. Quoitin, J. Lepropre, and S. Balon, “Providing public
intradomain traffic matrices to the research community,” SIGCOMM
Computer Communication Review, vol. 36, no. 1, pp. 83–86, 2006.
View publication stats

Intelligent Routing Based On Reinforcement Learning For Software-Defined Networking

Uploaded by

Copyright:

Available Formats

Intelligent Routing Based On Reinforcement Learning For Software-Defined Networking

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intelligent Routing Based On Reinforcement Learning For Software-Defined Networking

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Intelligent Routing Based on Reinforcement Learning for Software-Deﬁned

Article in IEEE Transactions on Network and Service Management · March 2021

Daniela Maria Casas-Velasco Oscar Mauricio Caicedo Rendon

SEE PROFILE SEE PROFILE

Machine Learning for Networking View project

SDN Management View project

The user has requested enhancement of the downloaded file.

Intelligent Routing based on Reinforcement

Table II: Related Work

C: Centralized and D: Distributed

module performs the detection of topology changes (due ˆ d,

a target transmission rate per connection. The experiment

GÉANT topology to generate the traffic between pairs of 1.15

at different times of the day for 4 months in 2006. We tagged 1.05

of randomly exploring the Action Space. For  = 0.6 and

a probability of exploitation value close to 1, the RL-agent 5

Generated traffic (Mbps)

200 VI. C ONCLUSIONS AND FUTURE WORK

You might also like

of randomly exploring the Action Space. For = 0.6 and