RL For CO Survey

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Computers & Operations Research 134 (2021) 105400

Contents lists available at ScienceDirect

Computers and Operations Research


journal homepage: www.elsevier.com/locate/cor

Reinforcement learning for combinatorial optimization: A survey✩


Nina Mazyavkina a ,∗, Sergey Sviridov b , Sergei Ivanov c,a , Evgeny Burnaev a
a
Skolkovo Institute of Science and Technology, Russia
b
Optimate AI, Canada
c
Criteo AI Lab, France

ARTICLE INFO ABSTRACT

Keywords: Many traditional algorithms for solving combinatorial optimization problems involve using hand-crafted
Reinforcement learning heuristics that sequentially construct a solution. Such heuristics are designed by domain experts and may often
Operations research be suboptimal due to the hard nature of the problems. Reinforcement learning (RL) proposes a good alternative
Combinatorial optimization
to automate the search of these heuristics by training an agent in a supervised or self-supervised manner. In
Value-based methods
this survey, we explore the recent advancements of applying RL frameworks to hard combinatorial problems.
Policy-based methods
Our survey provides the necessary background for operations research and machine learning communities and
showcases the works that are moving the field forward. We juxtapose recently proposed RL methods, laying
out the timeline of the improvements for each problem, as well as we make a comparison with traditional
algorithms, indicating that RL models can become a promising direction for solving combinatorial problems.

1. Introduction includes the case of decision problems, when the solution is binary
(or, more generally, multi-class), by associating a higher cost for the
Optimization problems are concerned with finding optimal config- wrong answer than for the right one. One common example of a
uration or ‘‘value’’ among different possibilities, and they naturally combinatorial problem is Traveling Salesman Problem (TSP). The goal
fall into one of the two buckets: configurations with continuous and is to provide the shortest route that visits each vertex and returns to
with discrete variables. For example, finding a solution to a convex the initial endpoint, or, in other words, to find a Hamiltonian circuit
programming problem is a continuous optimization problem, while 𝐻 with minimal length in a fully-connected weighted graph. In this
finding the shortest path among all paths in a graph is a discrete case, a set of elements is defined by all Hamiltonian circuits, i.e. 𝑉 =
optimization problem. Sometimes the line between the two cannot be
{all Hamiltonian paths}, and the cost associated with each Hamiltonian
drawn that easily. For example, the linear programming task in the
circuit is the sum of the weights 𝑤(𝑒) of the edges 𝑒 on the circuit,
continuous space can be regarded as a discrete combinatorial problem ∑
i.e. 𝑓 (𝐻) = 𝑒∈𝐻 𝑤(𝑒). Another example of CO problem is Mixed-
because its solution lies in a finite set of vertices of the convex polytope
Integer Linear Program (MILP), for which the objective is to minimize
as it has been demonstrated by Dantzig’s algorithm (Dantzig and Thapa,
1997). Conventionally, optimization problems in the discrete space are 𝑐 ⊤ 𝑥 for a given vector 𝑐 ∈ R𝑑 such that the vector 𝑥 ∈ Z𝑑 satisfies the
called combinatorial optimization (CO) problems and, typically, have constraints 𝐴𝑥 ≤ 𝑏 for the parameters 𝐴 and 𝑏.
different types of solutions comparing to the ones in the continuous Many CO problems are NP-hard and do not have an efficient
space. One can formulate a CO problem as follows. polynomial-time solution. As a result, many algorithms that solve these
problems either approximately or heuristically have been designed.
Definition 1. Let 𝑉 be a set of elements and 𝑓 ∶ 𝑉 ↦ R be a cost One of the emerging trends of the last years is to solve CO problems by
function. Combinatorial optimization problem aims to find an optimal training a machine learning (ML) algorithm. For example, we can train
value of the function 𝑓 and any corresponding optimal element that ML algorithm on a dataset of already solved TSP instances to decide on
achieves that optimal value on the domain 𝑉 . which node to move next for new TSP instances. A particular branch
of ML that we consider in this survey is called reinforcement learning
Typically the set 𝑉 is finite, in which case there is a global op-
(RL) that for a given CO problem defines an environment and the agent
timum, and, hence, a trivial solution exists for any CO problem by
that acts in the environment constructing a solution.
comparing values of all elements 𝑣 ∈ 𝑉 . Note that Definition 1 also

✩ This work was partially supported by the Russian Foundation for Basic Research grants 20-01-00203 and 21-51-12005 NNIO_a.
∗ Corresponding author.
E-mail address: nina.mazyavkina@skoltech.ru (N. Mazyavkina).

https://doi.org/10.1016/j.cor.2021.105400
Received 15 August 2020; Received in revised form 15 March 2021; Accepted 24 May 2021
Available online 29 May 2021
0305-0548/© 2021 Elsevier Ltd. All rights reserved.
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

In order to apply RL to CO, the problem is modeled as a sequential problems including recurrent neural networks, graph neural networks,
decision-making process, where the agent interacts with the environ- attention-based networks, and multi-layer perceptrons.
ment by performing a sequence of actions in order to find a solution. To sum up, a pipeline for solving CO problem with RL is presented
Markov Decision Process (MDP) provides a widely used mathematical in Fig. 1. A CO problem is first reformulated in terms of MDP, i.e., we
framework for modeling this type of problems (Bellman, 1957). define the states, actions, and rewards for a given problem. We then
define an encoder of the states, i.e. a parametric function that encodes
Definition 2. MDP can be defined as a tuple 𝑀 = ⟨𝑆, 𝐴, 𝑅, 𝑇 , 𝛾, 𝐻⟩,
the input states and outputs a numerical vector (Q-values or proba-
where
bilities of each action). The next step is the actual RL algorithm that
• 𝑆 - state space s𝑡 ∈ 𝑆. State space for combinatorial optimization determines how the agent learns the parameters of the encoder and
problems in this survey is typically defined in one of two ways. makes the decisions for a given MDP. After the agent has selected an
One group of approaches constructs solutions incrementally de- action, the environment moves to a new state and the agent receives
fine it as a set of partial solutions to the problem (e.g. a partially a reward for the action it has made. The process then repeats from a
constructed path for TSP problem). The other group of methods new state within the allocated time budget. Once the parameters of the
starts with a suboptimal solution to a problem and iteratively model have been trained, the agent is capable of searching the solutions
improves it (e.g. a suboptimal tour for TSP). for unseen instances of the problem.
• 𝐴 - action space a𝑡 ∈ 𝐴. Actions represent addition to partial or Our work is motivated by the recent success in the application of
changing complete solution (e.g. changing the order of nodes in
the techniques and methods of the RL field to solve CO problems.
a tour for TSP);
Although many practical combinatorial optimization problems can be,
• 𝑅 - reward function is a mapping from states and actions into real
in principle, solved by reinforcement learning algorithms with relevant
numbers 𝑅 ∶ 𝑆 × 𝐴 → ←← R. Rewards indicate how action chosen
literature existing in the operations research community, we will focus
in particular state improves or worsens a solution to the problem
on RL approaches for CO problems. This survey covers the most recent
(i.e. a tour length for TSP);
• 𝑇 - transition function 𝑇 (s𝑡+1 |s𝑡 , a𝑡 ) that governs transition dynam- papers that show how reinforcement learning algorithms can be applied
ics from one state to another in response to action. In combinato- to reformulate and solve some of the canonical optimization problems,
rial optimization setting transition dynamics is usually determin- such as Traveling Salesman Problem (TSP), Maximum Cut (Max-Cut)
istic and known in advance; problem, Maximum Independent Set (MIS), Minimum Vertex Cover
• 𝛾 - scalar discount factor, 0 < 𝛾 ≤ 1. Discount factor encourages (MVC), Bin Packing Problem (BPP).
the agent to account more for short-term rewards; Related work. Some of the recent surveys also describe the in-
• 𝐻 - horizon, that defines the length of the episode, where episode tersection of machine learning and combinatorial optimization. This
is defined as a sequence {𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡+1 , 𝑎𝑡+1 , 𝑠𝑡+2 , …}𝐻
𝑡=0
. For meth- way a comprehensive survey by Bengio et al. (2020) has summarized
ods that construct solutions incrementally episode length is de- the approaches that solve CO problems from the perspective of the
fined naturally by number of actions performed until solution is general ML, and the authors have discussed the possible ways of the
found. For iterative methods some artificial stopping criteria are combination of the ML heuristics with the existing off-the-shelf solvers.
introduced. Moreover, the work by Zhou et al. (2018), which is devoted to the de-
The goal of an agent acting in Markov Decision Process is to find a scription and possible applications of GNNs, has described the progress
policy function 𝜋(𝑠) that maps states into actions. Solving MDP means on the CO problems’ formulation from the GNN perspective in one of its
finding the optimal policy that maximizes the expected cumulative sections. Finally, the more recent surveys by Vesselinova et al. (2020)
discounted sum of rewards: and Guo et al. (2019), describe the latest ML approaches to solving the

𝐻 CO tasks, in addition to possible applications of such methods. We note
𝜋 ∗ = argmax E[ 𝛾 𝑡 𝑅(𝑠𝑡 , 𝑎𝑡 )], (1) that our survey is complementary to the existing ones as we focus on
𝜋 𝑡=0 RL approaches, provide necessary background and classification of the
Once MDP has been defined for a CO problem we need to decide RL models, and make a comparison between different RL methods and
how the agent would search for the optimal policy 𝜋 ∗ . Broadly, there existing solutions.
are two types of RL algorithms: Paper organization. The remainder of this survey is organized as
• Value-based methods first compute the value action function follows. In Section 2, we provide a necessary background including the
𝑄𝜋 (𝑠, 𝑎) as the expected reward of a policy 𝜋 given a state 𝑠 and formulation of CO problems, different encoders, and RL algorithms that
taking an action 𝑎. Then the agent’s policy corresponds to picking are used for solving CO with RL. In Section 3 we provide a classification
an action that maximizes 𝑄𝜋 (𝑠, 𝑎) for a given state. The main of the existing RL–CO methods based on the popular design choices
difference between value-based approaches is in how to estimate such as the type of RL algorithm. In Section 4 we describe the recent
𝑄𝜋 (𝑠, 𝑎) accurately and efficiently. RL approaches for the specific CO problems, providing the details
• Policy-based methods directly model the agent’s policy as a para- about the formulated MDPs as well as their influence on other works.
metric function 𝜋𝜃 (𝑠). By collecting previous decisions that the In Section 5 we make a comparison between the RL–CO works and
agent made in the environment, also known as experience, we the existing traditional approaches. We conclude and provide future
can optimize the parameters 𝜃 by maximizing the final reward directions in Section 6.
(1). The main difference between policy-based methods is in opti-
mization approaches for finding the function 𝜋𝜃 (𝑠) that maximizes
the expected sum of rewards. 2. Background

As can be seen, RL algorithms depend on the functions that take


as input the states of MDP and outputs the actions’ values or actions. In this section, we provide definitions of combinatorial problems,
States represent some information about the problem such as the given state-of-the-art algorithms and heuristics that solve these problems.
graph or the current tour of TSP, while Q-values or actions are numbers. We also describe machine learning models that encode states of CO
Therefore an RL algorithm has to include an encoder, i.e., a function problems for an RL agent. Finally, we categorize popular RL algorithms
that encodes a state to a number. Many encoders were proposed for CO that have been employed recently for solving CO problems.

2
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

Fig. 1. Solving a CO problem with the RL approach requires formulating MDP. The environment is defined by a particular instance of CO problem (e.g. Max-Cut problem).
States are encoded with a neural network model (e.g. every node has a vector representation encoded by a graph neural network). The agent is driven by an RL algorithm
(e.g. Monte-Carlo Tree Search) and makes decisions that move the environment to the next state (e.g. removing a vertex from a solution set).

2.1. Combinatorial optimization problems exact or approximate solutions to TSP. Among them, Concorde (Ap-
plegate et al., 2006) is a specialized TSP solver that uses a combina-
We start by considering mixed-integer linear programs (MILP) – a tion of cutting-plane algorithms with a branch-and-bound approach.
constrained optimization problem, to which many practical applica- Similarly, an extension of the Lin–Kernighan–Helsgaun TSP solver
tions can be reduced. Several industrial optimizers (e.g. Cplex, 2009; (LKH3) (Helsgaun, 2017), which improves the Lin–Kernighan algo-
Gleixner et al., 2017; Gurobi Optimization, 2020; The Sage Developers, rithm (Lin and Kernighan, 1973), is a tour improvement method
2020; Makhorin, 2012; Schrage, 1986) exist that use a branch-and- that iteratively decides which edges to rewire to decrease the tour
bound technique to solve the MILP instance. length. More generic solvers that avoid local optima exist such as
OR-Tools (Perron and Furnon, 2019) that tackle vehicle routing prob-
Definition 3 (Mixed-Integer Linear Program (MILP) (Wolsey, 1998)). A lems through local search algorithms and metaheuristics. In addi-
mixed-integer linear program is an optimization problem of the form tion to solvers, many heuristic algorithms have been developed, such
{ } as Christofides–Serdyukov algorithm (Christofides, 1976; van Bevern
arg min 𝐜⊤ 𝐱 | 𝐀𝐱 ≤ 𝐛, 𝟎 ≤ 𝐱, 𝐱 ∈ Z𝑝 × R𝑛−𝑝 , and Slugina, 2020), the Lin–Kernighan–Helsgaun heuristic (Helsgaun,
𝐱
2000), 2-OPT local search (Mersmann et al., 2012). (Applegate et al.,
where 𝐜 ∈ R𝑛 is the objective coefficient vector, 𝐀 ∈ R𝑚×𝑛 is the 2006) provides an extensive overview of various approaches to TSP.
constraint coefficient matrix, 𝐛 ∈ R𝑚 is the constraint vector, and 𝑝 ≤ 𝑛
is the number of integer variables. Definition 5 (Maximum Cut Problem (Max-Cut)). Given a graph 𝐺 =
(𝑉 , 𝐸), find a subset of vertices 𝑆 ⊂ 𝑉 that maximizes a cut 𝐶(𝑆, 𝐺) =
Next, we provide formulations of the combinatorial optimization ∑
problems, their time complexity, and the state-of-the-art algorithms for 𝑖∈𝑆,𝑗∈𝑉 ⧵𝑆 𝑤𝑖𝑗 where 𝑤𝑖𝑗 ∈ 𝑊 is the weight of the edge-connecting
vertices 𝑖 and 𝑗.
solving them.
Max-Cut solutions have found numerous applications in real-life
Definition 4 (Traveling Salesman Problem (TSP)). Given a complete problems including protein folding (Perdomo-Ortiz et al., 2012), finan-
weighted graph 𝐺 = (𝑉 , 𝐸), find a tour of minimum total weight, i.e. a cial portfolio management (Elsokkary et al., 2017), and finding the
cycle of minimum length that visits each node of the graph exactly ground state of the Ising Hamiltonian in physics (Barahona, 1982).
once. Max-Cut is an NP-complete problem (Karp, 1972), and, hence, does not
have a known polynomial-time algorithm. Approximation algorithms
TSP is a canonical example of a combinatorial optimization prob- exist for Max-Cut, including deterministic 0.5-approximation (Mitzen-
lem, which has found applications in planning, data clustering, genome macher and Upfal, 2005; Gonzalez, 2007) and randomized 0.878-
sequencing, etc. (Applegate et al., 2006). TSP problem is NP-hard (Pa- approximation (Goemans and Williamson, 1995). Industrial solvers can
padimitriou and Steiglitz, 1998), and many exact, heuristic, and ap- be used to find a solution by applying the branch-and-bound routines.
proximation algorithms have been developed, in order to solve it. In particular, Max-Cut problem can be transformed into a quadratic un-
The best known exact algorithm is the Held–Karp algorithm (Held constrained binary optimization problem and solved by CPLEX (Cplex,
and Karp, 1962). Published in 1962, it solves the problem in time 2009), which takes within an hour for graph instances with hundreds
𝑂(𝑛2 2𝑛 ), which has not been improved in the general setting since of vertices (Barrett et al., 2020). For larger instances several heuristics
then. TSP can be formulated as a MILP instance (Dantzig et al., using the simulated annealing technique have been proposed that could
1954; Miller et al., 1960), which allows one to apply MILP solvers, scale to graphs with thousands of vertices (Yamamoto et al., 2017;
such as Gurobi (Gurobi Optimization, 2020), in order to find the Tiunov et al., 2019; Leleu et al., 2019).

3
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

Definition 6 (Bin Packing Problem (BPP)). Given a set 𝐼 of items, a


size 𝑠(𝑖) ∈ Z+ for each 𝑖 ∈ 𝐼, and a positive integer bin capacity 𝐵, find
a partition of 𝐼 into disjoint sets 𝐼1 , … , 𝐼𝐾 such that the sum of the sizes
of the items in each 𝐼𝑗 is less or equal than 𝐵 and 𝐾 has the smallest
possible value.

There are other variants of BPP such as 2D, 3D packing, packing


with various surface area, packing by weights, and others (Wu et al.,
2010). This CO problem has found its applications in many domains
such as resource optimization, logistics, and circuit design (Kellerer
et al., 2004). BPP is an NP-complete problem with many approximation
algorithms proposed in the literature. First-fit decreasing (FFD) and Fig. 2. The scheme for a pointer network. Element ‘‘B’’ in the sequence first computes
similarity scores to all other elements. Next we encode the representation of ‘‘B’’ using
best-fit decreasing (BFD) are two simple approximation algorithms that
the element with maximum value (‘‘A’’ in this case, dashed). This process is then
first sort the items in the decreasing order of their costs and then repeated for other elements in the sequence.
assign each item to the first (for FFD) or the fullest (for BFD) bin
that it fits into. Both FFD and BFD run in 𝑂(𝑛 log 𝑛) time and have
11∕9 asymptotic performance guarantee (Korte et al., 2012). Among
In contrast, the evolutionary algorithms maintain several independent
exact approaches, one of the first attempts has been the Martello–Toth
sets at the current iterations which are then merged or pruned based on
algorithm that works under the branch-and-bound paradigm (Martello
some fitness criteria (Lamm et al., 2015; Borisovsky and Zavolovskaya,
and Toth, 1990a,b). In addition, several recent improvements have
2003; Back and Khuri, 1994). Hybrid approaches exist that combine
been proposed (Schreiber and Korf, 2013; Korf, 2003) which can run on
the evolutionary algorithms with the local search, capable to solve
instances with hundreds of items. Alternatively, BPP can be formulated
instances with hundreds of thousands of vertices (Lamm et al., 2016).
as a MILP instance (Wu et al., 2010; Chen et al., 1995) and solved using
In order to approach the outlined problems with reinforcement
standard MILP solvers such as Gurobi (Gurobi Optimization, 2020) or
learning, we must represent the graphs, involved in the problems,
CPLEX (Cplex, 2009).
as vectors that can be further provided as an input to a machine
learning algorithm. Next, we discuss different approaches for learning
Definition 7 (Minimum Vertex Cover (MVC)). Given a graph 𝐺 =
the representations of these problems.
(𝑉 , 𝐸), find a subset of nodes 𝑆 ⊂ 𝑉 , such that every edge is covered,
i.e. (𝑢, 𝑣) ∈ 𝐸 ⟺ 𝑢 ∈ 𝑆 or 𝑣 ∈ 𝑆, and |𝑆| is minimized.
2.2. Encoders
Vertex cover optimization is a fundamental problem with applica-
tions to computational biochemistry (Lancia et al., 2001) and computer In order to process the input structure 𝑆 (e.g. graphs) of CO prob-
network security (Filiol et al., 2007). There is a naïve approximation lems, we must present a mapping from 𝑆 to a 𝑑-dimensional space
algorithm with a factor 2, which works by adding both endpoints of an R𝑑 . We call such a mapping an encoder as it encodes the original
arbitrary edge to the solution and then removing these endpoints from input space. The encoders vary depending on the particular type of the
the graph (Papadimitriou and Steiglitz, space 𝑆 but there are some common architectures that researchers have
( √ 1998).)A better approximation
algorithm with a factor of 2 − 𝛩 1∕ log |𝑉 | is known (Karakostas, developed over the last years to solve CO problems.
2009), however,√it has been shown that MVC cannot be approximated The first frequently used architecture is a recurrent neural network
within a factor 2 − 𝜀 for any 𝜀 > 0 (Dinur and Safra, 2005; Subhash (RNN). RNNs can operate on sequential data, encoding each element
et al., 2018). The problem can be formulated as an integer linear of the sequence into a vector. In particular, the RNN is composed of
∑ the block of parameters that takes as an input the current element of
program (ILP) by minimizing 𝑣∈𝑉 𝑐(𝑣)𝑥𝑣 , where 𝑥𝑣 ∈ {0, 1} denotes
whether a node 𝑣 with a weight 𝑐(𝑣) is in a solution set, subject to the sequence and the previous output of the RNN block and outputs a
𝑥𝑢 + 𝑥𝑣 ≥ 1. Solvers such as CPLEX (Cplex, 2009) or Gurobi (Gurobi Op- vector that is passed to the next element of the sequence. For example,
timization, 2020) can be used to solve the ILP formulations with in the case of TSP, one can encode a tour of TSP by applying RNN to
hundreds of thousands of nodes (Akiba and Iwata, 2016). the current node (e.g. initially represented by a constant 𝑑-dimensional
vector) and the output of the RNN on the previous node of the tour.
Definition 8 (Maximum Independent Set (MIS)). Given a graph 𝐺(𝑉 , 𝐸) One can stack multiple blocks of RNNs together making the neural
find a subset of vertices 𝑆 ⊂ 𝑉 , such that no two vertices in 𝑆 are network deep. Popular choices of RNN blocks are a Long Short-Term
connected by an edge of 𝐸, and |𝑆| is minimized. Memory (LSTM) unit (Hochreiter and Schmidhuber, 1997) and Gated
Recurrent Unit (GRU) (Cho et al., 2014), which tackle the vanishing
MIS is a popular CO problem with applications in classification gradient problem (Goodfellow et al., 2016).
theory, molecular docking, recommendations, and more (Feo et al., One of the fundamental limitations of RNN models is related to the
1994; Gardiner et al., 2000; Agrawal et al., 1996). As such the ap- modeling of the long-range dependencies: as the model takes the output
proaches of finding the solutions for this problem have received a lot of the last time-step it may ‘‘forget’’ the information from the previous
of attention from the academic community. It is easy to see that the elements of the sequence. Attention models fix this by forming a connec-
complement of an independent set in a graph 𝐺 is a vertex cover in tion not just to the last input element, but to all input elements. Hence,
𝐺 and a clique in a complement graph 𝐺, ̄ hence, the solutions to a the output of the attention model depends on the current element of
minimum vertex cover in 𝐺 or a maximum clique in 𝐺 can be applied the sequence and all previous elements of the sequence. In particular,
to solve the MIS problem. The running time of the brute-force algorithm similarity scores (e.g. dot product) are computed between the input
is 𝑂(𝑛2 2𝑛 ), which has been improved by Tarjan and Trojanowski (1977) element and each of the previous elements, and these scores are used
to 𝑂(2𝑛∕3 ), and recently to the best known bound 𝑂(1.1996𝑛 ) with to determine the weights of the importance of each of the previous
polynomial space (Xiao and Nagamochi, 2017). To cope with medium elements to the current element. Attention models has recently gained
and large instances of MIS several local search and evolutionary al- the superior performance on language modeling tasks (e.g. language
gorithms have been proposed. The local search algorithms maintain translation) (Vaswani et al., 2017) and have been applied to solving
a solution set, which is iteratively updated by adding and removing CO problems (e.g. for building incrementally a tour for TSP).
nodes that improve the current objective value (Andrade et al., 2008; Note that the attention model relies on modeling dependencies
Katayama et al., 2005; Hansen et al., 2004; Pullan and Hoos, 2006). between each pair of elements in the input structure, which can be

4
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

Fig. 3. A classification of reinforcement learning methods.

inefficient if there are only few relevant dependencies. One simple Broadly, the RL algorithms can be split into the model-based and
extension of the attention model is a pointer network (PN) (Vinyals et al., model-free categories (Fig. 3).
2015). Instead of using the weights among all pairs for computation
of the influence of each input element, the pointer networks use the • Model-based methods focus on the environments, where transition
weights to select a single input element that will be used for encoding. functions are known or can be learned, and can be utilized by
For example, in Fig. 2 the element ‘‘A’’ has the highest similarity to the algorithm when making decisions. This group includes Monte-
the element ‘‘B’’, and, therefore, it is used for computation of the Carlo Tree Search (MCTS) algorithms such as AlphaZero (Silver
representation of element ‘‘B’’ (unlike attention model, in which case et al., 2016) and MuZero (Schrittwieser et al., 2020).
the elements ‘‘C’’ and ‘‘D’’ are also used). • Model-free methods do not rely on the availability of the transition
Although these models are general enough to be applied to various functions of the environment and utilize solely the experience
spaces 𝑆 (e.g. points for TSP), many CO problems studied in this collected by the agent.
paper are associated with the graphs. A natural continuation of the
attention models to the graph domain is a graph neural network (GNN). Furthermore, model-free methods can be split into two big fami-
Initially, the nodes are represented by some vectors (e.g. constant unit lies of RL algorithms — policy-based and value-based methods. This
vectors). Then, each node’s representation is updated depending on partition is motivated by the way of deriving a solution of an MDP.
the local neighborhood structure of this node. In the most common In the case of policy-based methods, a policy is approximated directly,
message-passing paradigms, adjacent nodes exchange their current rep- while value-based methods focus on approximating a value function,
resentations in order to update them in the next iteration. One can which is a measure of the quality of the policy for some state–action
see this framework as a generalization of the attention model, where pair in the given environment. Additionally, there are RL algorithms
the elements do not attend to all of the other elements (forming a that combine policy-based methods with value-based methods. The
fully-connected graph), but only to elements that are linked in the type of methods that utilize such training procedure is called actor–
graph. Popular choices of GNN models include a Graph Convolutional critic methods (Sutton et al., 2000; Mnih et al., 2016). The basic
Network (GCN) (Kipf and Welling, 2017), a Graph Attention Net- principle behind these algorithms is for the critic model to approximate
work (GAT) (Veličković et al., 2018), a Graph Isomorphism Network the value function, and for the actor model to approximate policy.
(GIN) (Xu et al., 2019), Structure-to-Vector Network (S2V) (Dai et al., Usually to do this, both actor and critic, use the policy and value-
2016). based RL, mentioned above. This way, the critic provides the measure
While there are many intrinsic details about all of these models, of how good the action taken by the actor has been, which allows to
at a high level it is important to understand that all of them are the appropriately adjust the learnable parameters for the next train step.
differentiable functions optimized by the gradient descent that return Next, we formally describe the value-based, policy-based, and MCTS
the encoded vector representations, which next can be used by the RL approaches and the corresponding RL algorithms that have been used
agent. to solve CO problems.

2.3. Reinforcement learning algorithms 2.3.1. Value-based methods


As mentioned earlier, the main goal of all reinforcement learning
In the introduction Section 1 we gave the definitions of an MDP, methods is to find a policy, which would consistently allow the agent
which include the states, actions, rewards, and transition functions. We to gain a lot of rewards. Value-based reinforcement learning methods
also explained what the policy of an agent is and what is the optimal focus on finding such policy through the approximation of a value
policy. Here we will deep-dive into the RL algorithms that search for function 𝑉 (𝑠) and an action-value function 𝑄(𝑠, 𝑎). In this section,
the optimal policy of an MDP. we will define both of these functions, which value and action-value

5
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

functions can be called optimal, and how can we derive the optimal is not needed. For each state 𝑠 we can easily find such action 𝑎 that
policy, knowing the optimal value functions. maximizes the action-function, as in order to do that we just need to
compute 𝑄∗ (𝑠, 𝑎). This way, we do not need to know any information
Definition 9. Value function of a state 𝑠 is the expectation of the future about the rewards and values in the following states 𝑠′ in contrast with
discounted rewards, when starting from the state 𝑠 and following some the value function.
policy 𝜋: Therefore, in the case of value-based methods, in order to find the
[∞ ] optimal policy, we need to find the optimal value functions. Notably, it

𝑉 𝜋 (𝑠) = E 𝛾 𝑡 𝑟(𝑠𝑡 )|𝜋, 𝑠0 = 𝑠 (2) is possible to explicitly solve the Bellman equation, i.e. find the optimal
𝑡=0 value function, but only in the case when the transition function is
The notation 𝑉 𝜋 here and in the following sections means that the known. In practice, it is rarely the case, so we need some methods to
value function 𝑉 is defined with respect to the policy 𝜋. It is also approximate the Bellman’s equation solution.
important to note, that the value of a terminal state in the case of a
finite MDP equals 0. • Q-learning. One of the popular representatives of the approx-
At the same time, it can be more convenient to think of the value imate value-based methods is Q-learning (Watkins and Dayan,
function as the function depending not only on the state but also on 1992) and its deep variant Deep Q-learning (Mnih et al., 2015). In
the action. Q-learning, the action-value function 𝑄(𝑠, 𝑎) is iteratively updated
by learning from the collected experiences of the current policy.
Definition 10. Action-value function 𝑄(𝑠, 𝑎) is the expectation of the It has been shown in (Sutton, 1988, Theorem 3) that the function
future discounted rewards, when starting from the state 𝑠, taking the updated by such a rule converges to the optimal value function.
action 𝑎 and then following some policy 𝜋: • DQN. With the rise of Deep Learning, neural networks (NNs)
[∞ ] have proven to achieve state-of-the-art results on various datasets

𝑄𝜋 (𝑠, 𝑎) = E 𝛾 𝑡 𝑟(𝑠𝑡 , 𝑎𝑡 )|𝜋, 𝑠0 = 𝑠, 𝑎0 = 𝑎 . (3) by learning useful function approximations through the high-
𝑡=0 dimensional inputs. This led researchers to explore the potential
It is also clear that 𝑉 𝜋 (𝑠) can be interpreted in terms of the 𝑄𝜋 (𝑠, 𝑎) of NNs’ approximations of the Q-functions. Deep Q-networks
as: (DQN) (Mnih et al., 2015) can learn the policies directly using
end-to-end reinforcement learning. The network approximates Q-
𝑉 𝜋 (𝑠) = max 𝑄𝜋 (𝑠, 𝑎). values for each action depending on the current input state. In
𝑎

From the definition of a value function comes a very important re- order to stabilize the training process, authors have used the
cursive property, representing the relationship between the value of the following formulation of the loss function:
state 𝑉 𝜋 (𝑠) and the values of the possible following states 𝑉 𝜋 (𝑠′ ), which [( )2 ]
𝐿(𝜃𝑖 ) = E(𝑠,𝑎,𝑟,𝑠′ )∼𝐷 𝑟 + 𝛾 max 𝑄𝜃− (𝑠′ , 𝑎′ ) − 𝑄𝜃𝑖 (𝑠, 𝑎) , (6)
lies at the foundation of many value-based RL methods. This property 𝑎′ 𝑖

can be expressed as an equation, called the Bellman equation (Bellman, where 𝐷 is a replay memory buffer, used to store (𝑠, 𝑎, 𝑟, 𝑠′ )
1952): trajectories. Eq. (6) is the mean-squared error between the current
∑ approximation of the Q-function and some maximized target
𝑉 𝜋 (𝑠) = 𝑟(𝑠) + 𝛾 𝑇 (𝑠, 𝜋(𝑠), 𝑠′ )𝑉 𝜋 (𝑠′ ). (4)
𝑠′
value 𝑟 + 𝛾 max𝑎′ 𝑄𝜃− (𝑠′ , 𝑎′ ). The training of DQN has been shown
𝑖
to be more stable, and, consequently, DQN has been effective for
The Bellman equation can be also rewritten in terms of the action-
many RL problems, including RL–CO problems.
value function 𝑄𝜋 (𝑠, 𝑎) in the following way:

𝑄𝜋 (𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + 𝛾 𝑇 (𝑠, 𝑎, 𝑠′ ) max 𝑄𝜋 (𝑠′ , 𝑎′ ). (5) 2.3.2. Policy-based methods
𝑎′
𝑠′ In contrast to the value-based methods that aim to find the optimal
At the beginning of this section, we have stated that the goal of state–action value function 𝑄∗ (𝑠, 𝑎) and act greedily with respect to it to
all of the RL tasks is to find a policy, which can accumulate a lot of obtain the optimal policy 𝜋 ∗ , policy-based methods attempt to directly
rewards. This means that one policy can be better than (or equal to) find the optimal policy, represented by some parametric function 𝜋𝜃∗ ,
the other if the expected return of this policy is greater than the one by optimizing (1) with respect to the policy parameters 𝜃: the method
achieved by the other policy: 𝜋 ′ ≥ 𝜋. Moreover, by the definition of a collects experiences in the environment using the current policy and

value function, we can claim that 𝜋 ′ ≥ 𝜋 if and only if 𝑉 𝜋 (𝑠) ≥ 𝑉 𝜋 (𝑠) optimizes it utilizing these collected experiences. Many methods have
in all states 𝑠 ∈ 𝑆. been proposed to optimize the policy functions, and we discuss the most
Knowing this relationship between policies, we can state that there commonly used ones for solving CO problems.
is a policy that is better or equal to all the other possible policies. This
policy is called an optimal policy 𝜋 ∗ . Evidently, the optimality of the • Policy gradient. In order to optimize (1) with respect to the
action-value and value functions is closely connected to the optimality policy parameters 𝜃, policy gradient theorem (Sutton et al., 2000)
of the policy they follow. This way, the value function of an MDP is can be applied to estimate the gradients of the policy function in
called optimal if it is the maximum of value functions across all policies: the following form:

𝑉 ∗ (𝑠) = max 𝑉 𝜋 (𝑠), ∀𝑠 ∈ 𝑆. ∑


𝐻
𝜋 ∇𝜃 𝐽 (𝜋𝜃 ) = E𝜋𝜃 [ ∇𝜃 log 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )𝐴(𝑠
̂ 𝑡 , 𝑎𝑡 )], (7)
𝑡=0
Similarly, we can give the definition to the optimal action-value
function 𝑄∗ (𝑠, 𝑎): where the return estimate 𝐴(𝑠̂ 𝑡 , 𝑎𝑡 ) = ∑𝐻 ′ 𝛾 𝑡′ −𝑡 𝑟(𝑠′ , 𝑎′ ) − 𝑏(𝑠𝑡 ), 𝐻 is
𝑡=𝑡 𝑡 𝑡
the agent’s horizon, and 𝑏(𝑠) is the baseline function. The gradient
𝑄∗ (𝑠, 𝑎) = max 𝑄𝜋 (𝑠, 𝑎), ∀𝑠 ∈ 𝑆, ∀𝑎 ∈ 𝐴.
𝜋 of the policy is then used by the gradient descent algorithm to
Given the Bellman Eqs. (4) and (5), one can derive the optimal optimize the parameters 𝜃.
policy if the action-value or value functions are known. In the case of a • REINFORCE. The role of the baseline 𝑏(𝑠) is to reduce the vari-
value function 𝑉 ∗ (𝑠), one can find optimal actions by doing the greedy ance of the return estimate 𝐴(𝑠 ̂ 𝑡 , 𝑎𝑡 ) — as it is computed by
one-step search: picking the actions that correspond to the maximum running the current policy 𝜋𝜃 , the initial parameters can lead
value 𝑉 ∗ (𝑠) in the state 𝑠 computed by the Bellman equation (4). On to poor performance in the beginning of the training, and the
the other hand, in the case of the action-value function one-step search baseline 𝑏(𝑠) tries to mitigate this by reducing the variance. When

6
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

Fig. 4. Three steps of the Monte Carlo Tree Search (MCTS). Starting the simulation from the root node the select step picks the node that maximizes the upper confidence bound.
When previously unseen node is expanded, the policy 𝑃 (𝑠, ∗) and the state-value function 𝑉 (𝑠) are evaluated at this node, and the action-value 𝑄(𝑠, 𝑎) and the counter 𝑁(𝑠, 𝑎) are
initialized to 0. Then 𝑉 (𝑠) estimates are propagated back along the path of the current simulation to update 𝑄(𝑠, 𝑎) and 𝑁(𝑠, 𝑎).

the baseline 𝑏(𝑠𝑡 ) is excluded from the return estimate calcula- • MCTS. The algorithm follows the general procedure of Monte
tion we obtain a REINFORCE algorithm that has been proposed Carlo Tree Search (MCTS) (Browne et al., 2012) consisting of
by Williams (1992). Alternatively, one can compute the baseline selection, expansion, roll-out and backup steps (Fig. 4). However,
value 𝑏(𝑠𝑡 ) by calculating an average reward over the sampled instead of evaluating leaf nodes in a tree by making a rollout step,
trajectories, or by using a parametric value function estimator a neural network 𝑓𝜃 is used to provide a policy 𝑃 (𝑠, ∗) and state-
𝑉𝜙 (𝑠𝑡 ). value estimates 𝑉 (𝑠) for the new node in the tree. The nodes in
• Actor–critic algorithms. The family of Actor–Critic (A2C, A3C) the tree refer to states 𝑠, and edges refer to actions 𝑎. During the
(Mnih et al., 2016) algorithms further extend REINFORCE with selection phase we start at a root state, 𝑠0 , and keep selecting next
the baseline by using bootstrapping — updating the state-value states that maximize the upper confidence bound:
estimates from the values of the subsequent states. For example, √∑

𝑎′ 𝑁(𝑠, 𝑎 )
a common approach is to compute the return estimate for each UCB = 𝑄(𝑠, 𝑎) + 𝑐 ⋅ 𝑃 (𝑠, 𝑎) ⋅ , (9)
step using the parametric value function: 1 + 𝑁(𝑠, 𝑎)
When previously unseen node in a search tree is encountered,
̂ 𝑡 , 𝑎𝑡 ) = 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝑉𝜙 (𝑠′ ) − 𝑉𝜙 (𝑠𝑡 )
𝐴(𝑠 (8)
𝑡 policy 𝑃 (𝑠, ∗) and state-value value functions 𝑃 (𝑠, ∗) and state-
Although this approach introduces bias to the gradient estimates, value estimates 𝑉 (𝑠) are estimated for this node. After that 𝑉 (𝑠)
it often reduces variance even further. Moreover, the actor–critic estimate is propagated back along the search tree updating the
methods can be applied to the online and continual learning, as 𝑄(𝑠, 𝑎) and 𝑁(𝑠, 𝑎) values. After a number of search iterations
they no longer rely on Monte-Carlo rollouts, i.e. unrolling the we select the next action from the root state according to the
trajectory to a terminal state. improved policy:
• PPO/DDPG. Further development of this group of reinforcement 𝑁(𝑠0 , 𝑎)
learning algorithms has resulted in the appearance of several 𝜋(𝑠𝑜 ) = ∑ . (10)
𝑁(𝑠0 , 𝑎′ )
𝑎′
more advanced methods such as Proximal Policy Optimization
(PPO) (Schulman et al., 2017), that performs policy updates 3. Taxonomy of RL for CO
with constraints in the policy space, or Deep Deterministic Pol-
icy Gradient (DDPG) (Lillicrap et al., 2016), an actor–critic al- The full taxonomy of RL methods for CO can be challenging because
gorithm that attempts to learn a parametric state–action value of the orthogonality of the ways we can classify the given works. In this
function 𝑄𝜙 (𝑠, 𝑎), corresponding to the current policy, and use it section we will list all the taxonomy groups that are used in this survey.
to compute the bootstrapped return estimate. One straightforward way of dividing the RL approaches, concerning
the CO field, is by the family of the RL methods used to find the
2.3.3. Monte Carlo tree search solution of the given problem. As shown in Fig. 3, it is possible to split
Both value-based and policy-based approaches do not use the model the RL methods by either the first level of Fig. 3 (i.e. into the model-
of the environment (model-free approaches), i.e. the transition proba- based and model-free methods) or by the second level (i.e. policy-based,
bilities of the model, and, hence, such approaches do not plan ahead value-based, the methods using Monte-Carlo approach) (Section 2.3). In
by unrolling the environment to the next steps. However, it is possible addition, another division is possible by the type of encoders used for
to define an MDP for CO problems in such a way that we can use representing the states of the MDP (Section 2.2). This division is much
the knowledge of the environment in order to improve the predictions more granular than the other ones discussed in this section, as can be
by planning several steps ahead. Some notable examples are Alp- seen from the works surveyed in the next section.
haZero (Silver et al., 2016) and Expert Iteration (Anthony et al., 2017) Another way to aggregate the existing RL approaches is based on
that have achieved super-human performances in games like chess, the integration of RL into the given CO problem, i.e. if an RL agent is
shogi, go, and hex, learning exclusively through self-play. Moreover, searching for a solution to CO problem or if an RL agent is facilitating
the most recent algorithm, MuZero (Schrittwieser et al., 2020), has the inner workings of the existing off-the-shelf solvers.
been able to achieve a superhuman performance by extending the
previous approaches using the learned dynamics model in challenging • In principal learning an agent makes the direct decision that
and visually complex domains, such as Atari games, go and shogi constitutes a part of the solution or the complete solution of the
without the knowledge of the game rules. problem and does not require the feedback from the off-the-shelf

7
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

solver. For example, in TSP the agent can be parameterized by et al. (2017) the MDP, constructed for solving the Traveling Salesman
a neural network that incrementally builds a path from a set of problem, is similar to the one used by Bello et al. (2017), except for
vertices and then receives the reward in the form of the length the reward function 𝑟(𝑠, 𝑎). The reward, in this case, is defined as the
of the constructed path, which is used to update the policy of the difference in the cost functions after transitioning from the state 𝑠 to the
agent. state 𝑠′ when taking some action 𝑎: 𝑟(𝑠, 𝑎) = 𝑐(ℎ(𝑠′ ), 𝐺)−𝑐(ℎ(𝑠), 𝐺), where
• Alternatively one can learn the RL agent’s policy in the joint ℎ is the graph embedding function of the partial solutions 𝑠 and 𝑠′ , 𝐺
training with already existing solvers so that it can improve some is the whole graph, 𝑐 is the cost function. Because the weighted variant
of the metrics for a particular problem. For example, in MILP a of the TSP is solved, the authors define a cost function 𝑐(ℎ(𝑠), 𝐺) as the
commonly used approach is the Branch & Bound method, which negative weighted sum of the tour length. Also, the work implements
at every step selects a branching rule on the node of the tree. This S2V (Dai et al., 2016) for encoding the partial solutions, and DQN as
can have a significant impact on the overall size of the tree and, the RL algorithm of choice for updating the network’s parameters.
hence, the running time of the algorithm. A branching rule is a Another more work by Nazari et al. (2018), motivated by Bello et al.
heuristic that typically requires either some domain expertise or (2017), concentrates on solving the Vehicle Routing Problem (VRP),
a hyperparameter tuning procedure. However, a parameterized which is a generalization of TSP. However, the approach suggested
RL agent can learn to imitate the policy of the node selection by in Bello et al. (2017) cannot be applied directly to solve VRP due to
receiving rewards proportional to the running time. its dynamic nature, i.e. the demand in the node becoming zero, once
the node has been visited since it embeds the sequential and static
Another dimension that the RL approaches can be divided into is the
nature of the input. The authors of Nazari et al. (2018) extend the
way the solution is searched by the learned heuristics. In this regard,
previous methods used for solving TSP to circumvent this problem
methods can be divided into those learning construction heuristics or
and find the solutions to VRP and its stochastic variant. Specifically,
improvement heuristics.
similarly to Bello et al. (2017), in Nazari et al. (2018) approach the
• Methods that learn construction heuristics are building the solu- state 𝑠 represents the embedding of the current solution as a vector of
tions incrementally using the learned policy by choosing each tuples, one value of which is the coordinates of the customer’s location
element to add to a partial solution. and the other is the customer’s demand at the current time step. An
• The second group of methods start from some arbitrary solution action 𝑎 is picking a node, which the vehicle will visit next in its
and learn a policy that improves it iteratively. This approach tries route. The reward is also similar to the one used for TSP: it is the
to address the problem that is commonly encountered with the negative total route length, which is given to the agent only after all
construction heuristics learning, namely, the need to use some customers’ demands are satisfied, which is the terminal state of the
extra procedures to find a good solution like beam search or MDP. The authors of Nazari et al. (2018) also suggest to improve the
sampling. Pointer Network, used by Bello et al. (2017). To do that, the encoder
is simplified by replacing the LSTM unit with the 1-d convolutional
4. RL for CO embedding layers so that the model is invariant to the input sequence
order, consequently, being able to handle the dynamic state change.
In this section we survey existing RL approaches to solve CO prob- The policy learning is then performed by using REINFORCE algorithm
lems that include Traveling Salesman Problem (Definition 4), Maximum for TSP and VRP while using A3C for stochastic VRP.
Cut Problem (Definition 5), Bin Packing Problem (Definition 6), Mini- Similarly to Nazari et al. (2018), the work by Deudon et al. (2018)
mum Vertex Cover Problem (Definition 7), and Maximum Independent uses the same approach as the one by Bello et al. (2017), while
Set (Definition 8). These problems have received the most attention changing the encoder–decoder network architecture. This way, while
from the research community and we juxtapose the approaches for all the MDP is the same as in Bello et al. (2017), instead of including
considered problems. the LSTM units, the GNN encoder architecture is based solely on the
attention mechanisms so that the input is encoded as a set and not as
4.1. Traveling salesman problem a sequence. The decoder, however, stays the same as in the Pointer
Network case. Additionally, the authors have looked into combining
One of the first attempts to apply policy gradient algorithms to a solution provided by the reinforcement learning agent with the 2-
combinatorial optimization problems has been made in Bello et al. Opt heuristic (Croes, 1958), in order to further improve the inference
(2017). In the case of solving the Traveling Salesman Problem, the results. REINFORCE algorithm with critic baseline is used to update the
MDP representation takes the following form: a state is a 𝑝-dimensional parameters of the described encode–decoder network.
graph embedding vector, representing the current tour of the nodes at Parallel to Deudon et al. (2018), inspired by the transformer ar-
the time step 𝑡, while the action is picking another node, which has chitecture of Vaswani et al. (2017), a construction heuristic learning
not been used at the current state. This way the initial state 𝑠0 is the approach by Kool et al. (2019) has been proposed in order to solve TSP,
embedding of the starting node. A transition function 𝑇 (𝑠, 𝑎, 𝑠′ ), in this two variants of VRP (Capacitated VRP and Split Delivery VRP), Ori-
case, returns the next node of the constructed tour until all the nodes enteering Problem (OP), Prize Collecting TSP (PCTSP) and Stochastic
have been visited. Finally, the reward in Bello et al. (2017) is intuitive: PCTSP (SPCTSP). In this work, the authors have implemented a similar
it is the negative tour length. The pointer network architecture, pro- encoder–decoder architecture as the authors of Deudon et al. (2018),
posed in Vinyals et al. (2015), is used to encode the input sequence, i.e. the Transformer-like attention-based encoder, while the decoder
while the solution is constructed sequentially from a distribution over is similar to one of the Pointer Network. However, the authors found
the input using the pointer mechanism of the decoder, and trained in that slightly changing the training procedure and using a simple rollout
parallel and asynchronously similar to Mnih et al. (2016). Moreover, baseline instead of the one learned by a critic yields better performance.
several inference strategies are proposed to construct a solution — The MDP formulation, in this case, is also similar to the one used
along with greedy decoding and sampling Active Search approach is by Deudon et al. (2018), and, consequently, the one by Bello et al.
suggested. Active Search allows learning the solution for the single test (2017).
problem instance, either starting from a trained or untrained model. One specific construction heuristic approach has been proposed
To update the parameters of the controller so that to maximize the in Emami and Ranka (2018). The authors have designed a novel
expected rewards REINFORCE algorithm with learned baseline is used. policy gradient method, Sinkhorn Policy Gradient (SPG), specifically
The later works, such as the one by Khalil et al. (2017), has for the class of combinatorial optimization problems, which involves
improved on the work of Bello et al. (2017). In the case of Khalil permutations. This approach yields a different MDP formulation. Here,

8
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

in contrast with the case when the solution is constructed sequentially, 4.2. Maximum cut problem
the state space consists of instances of combinatorial problems of a
particular size. The action space, in this case, is outputting a permuta- The first work to address solving Maximum Cut Problem with
tion matrix, which, applied to the original graph, produces the solution reinforcement learning was Khalil et al. (2017), that proposed the
tour. The reward function is the negated sum of the Euclidean distances principled approach to learning the construction heuristic by combining
between each stop along the tour. Finally, using a special Sinkhorn graph embeddings with Q-learning — S2V-DQN. They formulated the
layer on the output of the feed-forward neural network with GRUs problem as an MDP, where the state space, 𝑆, is defined as a partial
to produce continuous and differentiable relaxations of permutation solution to the problem, i.e. the subset of all nodes in a graph added
matrices, authors have been able to train actor–critic algorithms similar to the set, that maximizes the maximum cut. The action space, 𝐴,
to Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016). is a set of nodes that are not in the current state. The transition
The work by Cappart et al. (2021) combines two approaches to function, 𝑇 (𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ), is deterministic and corresponds to tagging the
solving the traveling salesman problem with time windows, namely last selected node with a feature 𝑥𝑣 = 1. The reward is calculated as
the RL approach and the constraint programming (CP) one, so that to an immediate change in the cut weight, and the episode terminates
learn branching strategies. In order to encode the CO problems, the
when the cut weight cannot be improved with further actions. The
authors bring up a dynamic programming formulation, that acts as a
graph embedding network proposed in Dai et al. (2016) was used
bridge between both techniques and can be exposed both as an MDP
as state encoder. A variant of the Q-learning algorithm was used to
and a CP problem. A state 𝑠 is a vector, consisting of three values:
learn to construct the solution, that was trained on randomly generated
the set of remaining cities that still have to be visited, the last city
instances of graphs. This approach achieves better approximation ratios
that has been visited, and the current time. An action 𝑎 corresponds to
compared to the commonly used heuristic solutions of the problem, as
choosing a city. The reward 𝑟(𝑠, 𝑎) corresponds to the negative travel
well as the generalization ability, which has been shown by training on
time between two cities. This MDP can then be transformed into a
dynamic programming model. DQN and PPO algorithms have been graphs of consisting of 50–100 nodes and tested on graphs with up to
trained for the MDP formulation to select the efficient branching poli- 1000–1200 nodes, achieving very good approximation ratios to exact
cies for different CP search strategies — branch-and-bound, iterative solutions.
limited discrepancy search and restart based search, and have been Barrett et al. (2020) improved on the work of Khalil et al. (2017)
used to solve challenging CO problems. in terms of the approximation ratio as well as the generalization
The work by Drori et al. (2020) differs from the previous works, by proposing an ECO-DQN algorithm. The algorithm kept the gen-
which tailor their approaches to individual problems. In contrast, this eral framework of S2V-DQN but introduced several modifications. The
work provides a general framework for model-free reinforcement learn- agent was allowed to remove vertices from the partially constructed
ing using a GNN representation that adapts to different problem classes solution to better explore the solution space. The reward function was
by changing a reward. This framework models problems by using the modified to provide a normalized incremental reward for finding a
edge-to-vertex line graph and formulates them as a single-player game solution better than seen in the episode so far, as well as give small
framework. The MDPs for TSP and VRP are the same as in Bello et al. rewards for finding a locally optimal solution that had not yet been seen
(2017). Instead of using a full-featured Neural MCTS, (Drori et al., during the episode. In addition, there were no penalties for decreasing
2020) represents a policy as a GIN encoder with an attention-based cut value. The input of the state encoder was modified to account for
decoder, learning it during the tree-search procedure. changes in the reward structure. Since in this setting the agent had
Lu et al. (2020) suggests to learn the improvement heuristics in been able to explore indefinitely, the episode length was set to 2|𝑉 |.
hierarchical manner for capacitated VRP as a part of the joint approach. Moreover, the authors allowed the algorithm to start from an arbitrary
The authors have designed an intrinsic MDP, which incorporates not state, which could be useful by combining this approach with other
only the features of the current solutions but also the running history. methods, e.g. heuristics. This method showed better approximation
A state 𝑠𝑖 includes free capacity of the route containing a customer 𝑖, ratios than S2V-DQN, as well as better generalization ability.
its location, a location of the node 𝑖− visited before 𝑖, a location of Cappart et al. (2019) devised the joint approach to the Max-Cut
the node 𝑖+ visited after 𝑖,a distance from 𝑖− to 𝑖,a distance from 𝑖 to problem by incorporation of the reinforcement learning into the De-
𝑖+ ,a distance from 𝑖− to 𝑖+ , an action taken ℎ steps before, an effect of cision Diagrams (DD) framework Bergman et al. (2016) to learn the
𝑎𝑡−ℎ . The action consists of choosing between two groups of operators, constructive heuristic. The integration of the reinforcement learning
that change the current solution, for example by applying the 2-Opt allowed to provide tighter objective function bounds of the DD solution
heuristic, which removes two edges and reconnects their endpoints. by learning heuristics for variable ordering. They have formulated
Concretely, these two operator groups are improvement operators, that
the problem as an MDP, where the state space, 𝑆, is represented as
are chosen according to a learned policy, or perturbation operators in
a set of ordered sequences of selected variables along with partially
the case of reaching a local minima. The authors have experimented
constructed DDs. The action space, 𝐴, consists of variables, that are
with the reward functions, and have chosen the two most successful
not yet selected. The transition function, 𝑇 , adds variables to the
ones: +1∕ − 1 reward for each time the solution improves/does not give
selected variables set and to the DD. The reward function is designed
any gains, and the advantage reward, which takes the initial solution’s
to tighten the bounds of the DD and is encoded as the relative upper
total distance as the baseline, and constitutes the difference between
and lower bounds improvements after the addition of the variable to
this baseline and the distance of the subsequent solutions as the reward
at each time step. The policy is parameterized by the Graph Attention the set. The training was performed on the generated random graphs
Network and is trained with the REINFORCE algorithm. with the algorithm and state encoding described above in Khalil et al.
The final work, we are going to cover for this section of problems (2017). The authors showed that their approach had outperformed
is by Chen and Tian (2019), who proposes solving VRP and online job several ordering heuristics and generalized well to the larger graph
scheduling problems by learning improvement heuristics. The algorithm instances, but did not report any comparison to the other reinforcement
rewrites different parts of the solution until convergence instead of learning-based methods.
constructing the solution in the sequential order. The state space is Another joint method proposed by Tang et al. (2020) combined
represented as a set of all solutions to the problem, while the action a reinforcement learning framework with the cutting plane method.
set consists of regions, i.e. nodes in the graph, and their corresponding Specifically, in order to learn the improvement heuristics to choose
rewriting rules. The reward, in this case, is the difference in the costs of Gomory’s cutting plane, which is frequently used in the Branch-and-Cut
the current and previous solutions. The authors use an LSTM encoder, solvers, an efficient MDP formulation was developed. The state space,
specific to each of the covered problems and train region-picking and 𝑆, includes the original linear constraints and cuts added so far. Solving
rule-picking policies jointly by applying the Q-Actor–Critic algorithm. the linear relaxation produces the action space, 𝐴, of Gomory cut that

9
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

Table 1
Summary of approaches for Traveling Salesman Problem.
Approach Searching solution Training
Joint Constructive Encoder RL
Bello et al. (2017) No Yes Pointer network REINFORCE with baseline
Khalil et al. (2017) No Yes S2V DQN
Nazari et al. (2018) No Yes Pointer network with convolutional encoder REINFORCE (TSP) and A3C (VRP)
Deudon et al. (2018) No Yes Pointer network with attention encoder REINFORCE with baseline
Kool et al. (2019) No Yes Pointer network with attention encoder REINFORCE with baseline
Emami and Ranka No No FF NN with Sinkhorn layer Sinkhorn policy gradient
(2018)
Cappart et al. (2021) Yes Yes GAT/Set transformer DQN/PPO
Drori et al. (2020) Yes Yes GIN with an attention decoder MCTS
Lu et al. (2020) Yes No GAT REINFORCE
Chen and Tian Yes No LSTM encoder + classifier Q-Actor–Critic
(2019)

Table 2
Summary of approaches for Maximum Cut Problem.
Approach Searching solution Training
Joint Constructive Encoder RL
Khalil et al. (2017) No Yes S2V DQN
Barrett et al. (2020) No Yes S2V DQN
Cappart et al. (2019) Yes Yes S2V DQN
Tang et al. (2020) Yes No LSTM + Attention Policy gradient + ES
Gu and Yang (2020) No Yes Pointer network A3C

can be added to the problem. After that the transition function, 𝑇 , adds is used to output the sequence of actions, 𝐴, i.e. sequence of items
the chosen cuts to the problem that results in a new state. The reward to pack. Reward, 𝑅, is calculated as the value of the surface area of
function is defined as a difference between the objective function of two packed items. REINFORCE with the baseline is used as a reinforcement
consecutive linear problem solutions. The policy gradient algorithm learning algorithm, with the baseline provided by the known heuristic.
was used to select new Gamory cuts, and the state was encoded with The improvement over the heuristic and random item selection was
an LSTM network (to account for a variable number of variables) along shown with greedy decoding as well as sampling from the with beam
with the attention-based mechanism (to account for a variable number search.
of constraints). The algorithm was trained on the generated graph Further work by Duan et al. (2019) extends the approach of Hu
instances using evolution strategies and had been shown to improve et al. (2017) to learning of the orientations along with a sequence
the efficiency of the cuts, the integrality gaps, and the generalization, order of items by combining reinforcement and supervised learning
compared to the usually used heuristics used to choose Gomory cuts. in a multi-task fashion. In this work a Pointer Network, trained with
Also, the approach was shown to be beneficial in combination with the a PPO algorithm, was enhanced with a classifier that determined
branching strategy in experiments with the Branch-and-Cut algorithm. the orientation of the current item in the output sequence, given
Gu and Yang (2020) applied the Pointer Network (Vinyals et al., the representation from the encoder and the embedded partial items
2015) along with the Actor–Critic algorithm similar to Bello et al. sequence. The classifier is trained in a supervised setting, using the
(2017) to iteratively construct a solution. The MDP formulation defines orientations in the so-far best solution of the problem as labels. The
the state, 𝑆, as a symmetric matrix, 𝑄, the values of which are the experiments were conducted on the real-world dataset and showed
edge weights between nodes (0 for the disconnected nodes). Columns of that the proposed method performs better than several widely used
this matrix are fed to the Pointer Network, which sequentially outputs heuristics and previous approaches by Hu et al. (2017).
the actions, 𝐴, in the form of pointers to input vectors along with Laterre et al. (2018) applied the principled Neural MCTS approach to
a special end-of-sequence symbol ‘‘EOS’’. The resulting sequence of solve the already mentioned variant of 2D and 3D bin packing problems
nodes separated by the ‘‘EOS’’ symbol represents a solution to the by learning the construction heuristic. The MDP formulation includes
problem, from which the reward is calculated. The authors conducted the state space, 𝑆, represented by the set of items that need to be
experiments with simulated graphs with up to 300 nodes and reported packed with their heights, widths, and depths. The action space, 𝐴, is
fairly good approximations ratios, but, unfortunately, did not compare represented by the set of item ids, coordinates of the bottom-left corner
with the previous works or known heuristics. of the position of the items, and their orientations. To solve 2D and
3D Bin Packing Problems, formulated as a single-player game, Neural
4.3. Bin packing problem MCTS constructs the optimal solution with the addition of a ranked
reward mechanism that reshapes the rewards according to the relative
To our knowledge, one of the first attempts to solve a variant of Bin performance in the recent games. This mechanism aims to provide a
Packing Problem with modern reinforcement learning was (Hu et al., natural curriculum for a single agent similar to the natural adversary in
2017). The authors have proposed a new, more realistic formulation of two-player games. The experimental results have been compared with
the problem, where the bin with the least surface area that could pack the heuristic as well as Gurobi solver and showed better performance
all 3D items is determined. This principled approach is only concerned in several cases on the dataset created by randomly cutting the original
with learning the construction heuristic to choose a better sequence to bin into items.
pack the items and using regular heuristics to determine the space and Cai et al. (2019) has taken the joint approach to solving a 1D bin
orientation. The state space, 𝑆, is denoted by a set of sizes (height, packing problem by combining proximal policy optimization (PPO)
width, and length) of the items that need to be packed. The approach with the simulated annealing (SA) heuristic algorithm. PPO is used to
proposed by Bello et al. (2017), which utilizes the Pointer Network, learn the improvement heuristic to build an initial starting solution for

10
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

Table 3
Summary of approaches for Bin Packing Problem.
Approach Searching solution Training
Joint Constructive Encoder RL
Hu et al. (2017) No Yes Pointer network REINFORCE with baseline
Duan et al. (2019) No Yes Pointer network + Classifier PPO
Laterre et al. (2018) No Yes FF NN Neural MCTS
Cai et al. (2019) Yes No N/A PPO

Table 4
Summary of approaches for Minimum Vertex Cover problem.
Approach Searching solution Training
Joint Constructive Encoder RL
Khalil et al. (2017) No Yes S2V DQN
Song et al. (2020) No Yes S2V DQN + Imitation learning
Manchanda et al. No Yes GNN DQN
(2020)

SA, which in turn, after finding a good solution in a limited number Table 5
Summary of approaches for Maximum Independent Set problem.
of iterations, calculates the reward function, 𝑅, as the difference in
costs between the initial and final solutions and passes it to the PPO Approach Searching solution Training
agent. The action space, 𝐴, is represented by a set of changes of the Joint Constructive Encoder RL
bins between two items, e.g. a perturbation to the current solution. The Cappart et al. (2019) Yes No S2V DQN
state space, 𝑆, is described with a set of assignments of items to bins.
The work has showed that the combination of RL and the heuristics
can find solutions better than these algorithms in isolation, but has not
provides any comparison with known heuristics or other algorithms. marginally better compared to S2V-DQN, scales to much larger graph
instances up to a hundred thousand nodes, and is significantly more
4.4. Minimum vertex cover problem efficient in terms of the computation efficiency due to a lower number
of learned parameters.
The principled approach to solving the Minimum Vertex Problem
(MVC) with reinforcement learning was developed by Khalil et al. 4.5. Maximum independent set problem
(2017). To learn the construction heuristic the problem was put into
the MDP framework, which is described in details in 4.2 along with One of the first RL for CO works, that covered MIS problem,
the experimental results. To apply the algorithm to the MVC problem is Cappart et al. (2019). It focuses on a particular approach to solving
reward function, 𝑅, was modified to produce −1 for assigning a node to combinatorial optimization problems, where an RL algorithm is used
the cover set. Episode termination happens when all edges are covered. to find the optimal ordering of the variables in a Decision Diagram
Song et al. (2020) proposed the joint co-training method, which has (DD), to tighten the relaxation bounds for the MIS problem. The MDP
gained popularity in the classification domain, to construct sequential formulation, as well as the encoder and the RL algorithm are described
policies. The paper describes two policy-learning strategies for the in detail in Section 4.4.
MVC problem: the first strategy copies the one described in Khalil
et al. (2017), i.e. S2V-DQN, the second is the integer linear program- 5. Comparison
ming approach solved by the branch & bound method. The authors
create the 𝐶𝑜𝑃 𝑖𝐸𝑟 algorithm that is intuitively similar to Imitation In this section, we will partially compare the results achieved by
the works presented in this survey. Concretely, we have distinguished
Learning (Hester et al., 2018), in which two strategies induce two
the two most frequently mentioned problems, namely, Traveling Sales-
policies, estimate them to figure out which one is better, exchange
man Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP).
the information between them, and, finally, make the update. The
The average tour lengths for both of these problems, reported in the
performed experiments resulted in the extensive ablation study, listing
works (Lu et al., 2020; Kool et al., 2019; Lodi et al., 2002; Bello et al.,
the comparisons with the S2V-DQN, Imitation learning, and Gurobi
2017; Emami and Ranka, 2018; Ma et al., 2020; Nazari et al., 2018;
solver, and showed a smaller optimality gap for problems up to 500
Chen and Tian, 2019), are shown in Tables 6 and 7.
nodes.
The presented results have been achieved on Erdős–Rényi (ER)
Finally, it is worth including in this section the work by Manchanda
graphs of various sizes, namely, with the number of nodes of 20, 50,
et al. (2020) that combined the supervised and reinforcement learning
100 for TSP, and 10, 20, 50, 100 for CVRP. In the case of CVRP, we
approaches in the joint method that learns a construction heuristic
have also specified the capacity of the vehicle (Cap.), which varies from
for a budget-constrained Maximum Vertex Cover problem. The algo-
10 to 50. Also, we have included the results achieved by the OR-Tools
rithm consists of two phases. In the first phase, a GCN is used to
solver (Perron and Furnon, 2019) and LK-H heuristic algorithm (Hels-
determine ‘‘good’’ candidate nodes by learning the scoring function,
gaun, 2017) as the baseline solutions.
using the scores, provided by the probabilistic greedy approach, as
labels. Then the candidates nodes are used in an algorithm similar Best performing methods. It is clear from the presented table that the
to Khalil et al. (2017) to sequentially construct a solution. Since the best performing methods for TSP are Kool et al. (2019) and Bello et al.
degree of nodes in large graphs can be very high, the importance (2017), and for VRP — Lu et al. (2020). These algorithms perform on
sampling according to the computed score is used to choose the neigh- par with the baseline, and in some cases demonstrate better results.
boring nodes for the embedding calculation, which helps to reduce Moreover, in the case of Lu et al. (2020), the algorithm manages to
the computational complexity. The extensive experiments on random present the best performance across all the other methods, even for
and real-world graphs has showed that the proposed method performs tasks with smaller vehicle capacities.

11
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

Table 6
The average tour lengths (the smaller, the better) comparison for TSP for ER graphs with the number of nodes 𝑁 equal to
20, 50, 100.
Algo Article Method Average tour length
𝑁 = 20 𝑁 = 50 𝑁 = 100
Lu et al. (2020) 4.0 6.0 8.4
Kool et al. (2019) REINFORCE 3.8 5.7 7.9
Deudon et al. (2018) 3.8 5.8 8.9
RL
Deudon et al. (2018) REINFORCE+2opt 3.8 5.8 8.2
Bello et al. (2017) A3C 3.8 5.7 7.9
Emami and Ranka Sinkhorn policy gradient 4.6 – –
(2018)
Helsgaun (2017) LK-H 3.8 5.7 7.8
Perron and Furnon OR-Tools 3.9 5.8 8.0
(2019)

Table 7
The average tour lengths comparison for Capacitated VRP for ER graphs with the number of nodes 𝑁 equal to 10, 20, 50, 100. Cap. represents
the capacity of the vehicle for CVRP.
Algo Article Method Average tour length
𝑁 = 10, 𝑁 = 20, 𝑁 = 20, 𝑁 = 50, 𝑁 = 50, 𝑁 = 100, 𝑁 = 100,
Cap. 10 Cap. 20 Cap. 30 Cap. 30 Cap. 40 Cap. 40 Cap. 50
Nazari et al. (2018) 4.7 – 6.4 – 11.15 – 17.0
Kool et al. (2019) REINFORCE – – 6.3 – 10.6 – 16.2
RL
Lu et al. (2020) – 6.1 – 10.4 – 15.6 –
Chen and Tian (2019) A2C – – 6.2 – 10.5 – 16.1
Helsgaun (2017) LK-H – 6.1 6.1 10.4 10.4 15.6 15.6
Perron and Furnon OR-Tools 4.7 6.4 6.4 11.3 11.3 17.2 17.2
(2019)

Focus on smaller graphs. Throughout our analysis, we have found that component in Z3 solver (De Moura and Bjørner, 2008) in terms of
most of the articles focus on testing the CO–RL methods on graphs with both the objective metrics and the time efficiency. Finally, although the
the number of nodes of 20, 50, 100. At the same time, (Ma et al., 2020) exact training times are not given in the article, the authors of Lu et al.
presents the results for bigger graphs with 250, 500, 750, and 1000 (2020) note that the given time of their algorithm is much smaller than
nodes for a TSP problem. This may be connected to the fact that with that of LK-H. In addition, although also acknowledging the complexity
the increasing size of the graphs the process of finding the optimal of comparing the times of different works, (Kool et al., 2019) have
solution also becomes much more computationally difficult even for claimed that the running time of their algorithm is ten times faster than
the commercial solvers. The comparison of the reported results and the one of Bello et al. (2017).
the baseline further supports this fact: for TSP it can be seen how for
smaller graphs almost all of the methods outperform OR-Tools, while 6. Conclusion and future directions
for bigger graphs it is no longer the case. Consequently, this can be a
promising direction for further research. The previous sections have covered several approaches to solving
canonical combinatorial optimization problems by utilizing reinforce-
Non-overlapping problems. We can see that although there have
ment learning algorithms. As this field has demonstrated to be per-
emerged a lot of works focused on creating well-performing RL-based
forming on-par with the state-of-the-art heuristic methods and solvers,
solvers, the CO problems, covered in these articles, rarely coincide,
we are expecting new algorithms and approaches to emerge in the
which makes the fair comparison a much harder task. We are convinced
following possible directions, which we have found promising:
that further analysis should be focused on unifying the results from
different sources, and, hence, identifying more promising directions for Generalization to other problems. In 5, we have formulated one of the
research. main problems of the current state of the RL–CO field, which is a
limited number of experimental comparisons. Indeed, the CO group
Running times. One of the main pros of using machine learning and
of mathematical problems is vast, and the current approaches often
reinforcement learning algorithms to solve CO problems is the con-
require being implemented for a concrete set of problems. RL field,
siderable reduction in running times compared to the ones obtained
however, has already made some steps towards the generalization of
by the metaheuristic algorithms and solvers. However, it is still hard
the learned policies to the unseen problems (for example, (Groshev
to compare the running time results of different works as they can
et al., 2018)). In the case of CO, these unseen problems can be smaller
significantly vary depending on the implementations and the hard-
instances of the same problem, problem instances with different dis-
ware used for experimentation. For these reasons, we do not attempt
tributions, or even the ones from the other group of CO problems.
to exactly compare the times achieved by different RL–CO works.
We believe, that although this direction is challenging, it is extremely
Still, however, we can note that some of the works such as Nazari
promising for future development in the RL–CO field.
et al. (2018), Chen and Tian (2019), Lu et al. (2020) claim to have
outperformed the classic heuristic algorithms. Concretely, the authors Improving the solution quality. A lot of the reviewed works, presented in
of Nazari et al. (2018), show that for larger problems, their framework this survey, have demonstrated superior performance compared to the
is faster than the randomized heuristics and their running times grow commercial solvers. Moreover, some of them have also achieved the
slower with the increase of the complexities of the CO problems than quality of the solutions equal to the optimal ones or the ones achieved
the ones achieved by Clarke–Wright (Clarke and Wright, 1964) and by the heuristic algorithms. However, these results are true only for
Sweep heuristics (Wren and Holliday, 1972). Chen and Tian (2019) the less complex versions of CO problems, for example, the ones with
claim that their approach outperforms the expression simplification smaller numbers of nodes. This leaves us with the possibility of further

12
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

improvement of the current algorithms in terms of the objective quality. Cappart, Q., Goutierre, E., Bergman, D., Rousseau, L.-M., 2019. Improving optimiza-
Some of the possible ways for this may be further incorporation of tion bounds using machine learning: Decision diagrams meet deep reinforcement
learning. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. In:
classical CO algorithms with the RL approaches, for example, with
AAAI, Vol. 33, Association for the Advancement of Artificial Intelligence (AAAI),
using imitation learning as in Hester et al. (2018). (ISSN: 2159-5399) ISBN: 9781577358091, pp. 1443–1451. http://dx.doi.org/10.
1609/aaai.v33i01.33011443.
Filling the gaps. One of the ways to classify RL–CO approaches, which Cappart, Q., Moisan, T., Rousseau, L.-M., Prémont-Schwarz, I., Cire, A., 2021.
we have mentioned previously, is by grouping them into joint and Combining reinforcement learning and constraint programming for combinatorial
constructive methods. Tables 1–5 contain the information about these optimization. In: Proceedings of the the 35th National Conference on Artificial
Intelligence, AAAI.
labels for each of the reviewed article, and from them, we can identify
Chen, C., Lee, S.-M., Shen, Q., 1995. An analytical model for the container loading
some unexplored approaches for each of the CO problems. This way problem. European J. Oper. Res. 80 (1), 68–76.
from Table 3, it can be seen that there has not been published any Chen, X., Tian, Y., 2019. Learning to perform local rewriting for combinatorial
both joint and constructive algorithm for solving the Bin Packing optimization. In: Proceedings of the 33rd Conference on Advances in Neural
Information Processing Systems, NeurIPS’19, pp. 6281–6292.
problem. The same logic can be applied to the Minimum Vertex Prob-
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,
lem, Table 3, where there are no approaches of joint-constructive and Bengio, Y., 2014. Learning phrase representations using RNN encoder-decoder for
joint-nonconstructive type. Exploring these algorithmic possibilities can statistical machine translation. In: Proceedings of the 2014 Conference on Empirical
provide us not only with the new methods but also with useful insights Methods in Natural Language Processing (EMNLP). Association for Computational
into the effectiveness of these approaches. Linguistics, pp. 1724–1734. http://dx.doi.org/10.3115/v1/D14-1179.
Christofides, N., 1976. Worst-case analysis of a new heuristic for the travelling sales-
In conclusion, we see the field of RL for CO problems as a very man problem. Technical report, Carnegie-Mellon Univ Pittsburgh Pa Management
promising direction for CO research, because of the effectiveness in Sciences Research Group.
terms of the solution quality, the capacity to outperform the existing Clarke, G., Wright, J.W., 1964. Scheduling of vehicles from a central depot to a
algorithms, and huge running time gains compared to the classical number of delivery points. Oper. Res. (ISSN: 0030-364X) 12 (4), 568–581. http:
//dx.doi.org/10.1287/opre.12.4.568.
heuristic approaches. Cplex, 2009. IBM ILOG v12. 1: User’s manual for CPLEX. Int. Bus. Mach. Corp. 46
(53), 157.
References Croes, A., 1958. A method for solving traveling salesman problems. Oper. Res. (ISSN:
0030-364X) 6 (6), 791–812. http://dx.doi.org/10.1287/opre.6.6.791.
Dai, H., Dai, B., Song, L., 2016. Discriminative embeddings of latent variable models for
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I., et al., 1996. Fast structured data. In: Proceedings of the 33rd International Conference on Machine
discovery of association rules. Adv. Knowl. Discov. Data Min. 12 (1), 307–328. Learning. ICML, ISBN: 9781510829008.
Akiba, T., Iwata, Y., 2016. Branch-and-reduce exponential/fpt algorithms in practice: Dantzig, G., Fulkerson, R., Johnson, S., 1954. Solution of a large-scale traveling-
A case study of vertex cover. Theoret. Comput. Sci. 609, 211–225. salesman problem. J. Oper. Res. Soc. Am. 2 (4), 393–410.
Andrade, D.V., Resende, M.G., Werneck, R.F., 2008. Fast local search for the maximum Dantzig, G.B., Thapa, M.N., 1997. Linear Programming 1: Introduction. Springer
independent set problem. In: International Workshop on Experimental and Efficient International Publishing, New York, NY, (ISSN: 1431-8598) http://dx.doi.org/10.
Algorithms. Springer, pp. 220–234. 1007/b97672.
Anthony, T., Tian, Z., Barber, D., 2017. Thinking fast and slow with deep learning De Moura, L., Bjørner, N., 2008. Z3: An efficient SMT solver. In: International
and tree search. In: Proceedings of the 31st International Conference on Neural Conference on Tools and Algorithms for the Construction and Analysis of Systems.
Information Processing Systems. In: NIPS’17, Curran Associates Inc., Red Hook, Springer, ISBN: 3540787992, pp. 337–340. http://dx.doi.org/10.1007/978-3-540-
NY, USA, ISBN: 9781510860964, pp. 5366–5376. 78800-3_24.
Applegate, D.L., Bixby, R.E., Chvatal, V., Cook, W.J., 2006. The Traveling Salesman Deudon, M., Cournut, P., Lacoste, A., Adulyasak, Y., Rousseau, L.-M., 2018. Learning
Problem: a Computational Study. Princeton university press. heuristics for the TSP by policy gradient. In: Integration of Constraint Programming,
Back, T., Khuri, S., 1994. An evolutionary heuristic for the maximum indepen- Artificial Intelligence, and Operations Research. Springer International Publishing,
dent set problem. In: Proceedings of the First IEEE Conference on Evolutionary ISBN: 9783319930305, pp. 170–181. http://dx.doi.org/10.1007/978-3-319-93031-
Computation. IEEE World Congress on Computational Intelligence. IEEE, pp. 2_12.
531–535. Dinur, I., Safra, S., 2005. On the hardness of approximating minimum vertex cover.
Barahona, F., 1982. On the computational complexity of ising spin glass models. J. Ann. of Math. (ISSN: 0003486X) 162 (1), 439–485. http://dx.doi.org/10.4007/
Phys. A (ISSN: 13616447) 15 (10), 3241–3253. http://dx.doi.org/10.1088/0305- annals.2005.162.439.
4470/15/10/028. Drori, I., Kharkar, A., Sickinger, W.R., Kates, B., Ma, Q., Ge, S., Dolev, E., Dietrich, B.,
Barrett, T.D., Clements, W.R., Foerster, J.N., Lvovsky, A., 2020. Exploratory combina- Williamson, D.P., Udell, M., 2020. Learning to solve combinatorial optimization
torial optimization with reinforcement learning. In: Proceedings of the the 34th problems on real-world graphs in linear time. IEEE Int. Conf. Mach. Learn. Appl..
National Conference on Artificial Intelligence. In: AAAI, 34, Association for the Duan, L., Hu, H., Qian, Y., Gong, Y., Zhang, X., Wei, J., Xu, Y., 2019. A multi-
Advancement of Artificial Intelligence (AAAI), (ISSN: 23743468) pp. 3243–3250. task selected learning approach for solving 3D flexible bin packing problem. In:
http://dx.doi.org/10.1609/aaai.v34i04.5723. Proceedings of the 18th International Conference on Autonomous Agents and
MultiAgent Systems. AAMAS, ISBN: 9781510892002, pp. 1386–1394.
Bellman, R., 1952. On the theory of dynamic programming. Proc. Natl. Acad. Sci. USA
Elsokkary, N., Khan, F.S., La Torre, D., Humble, T.S., Gottlieb, J., 2017. Financial
(ISSN: 0027-8424) 38 (8), 716–719. http://dx.doi.org/10.1073/pnas.38.8.716.
portfolio management using D-wave quantum optimizer: The case of abu dhabi
Bellman, R., 1957. A Markovian decision process. Indiana Univ. Math. J. (ISSN:
securities exchange. Technical report, Oak Ridge National Lab.(ORNL), Oak Ridge,
0022-2518) 6, 679–684.
TN (United States).
Bello, I., Pham, H., Le, Q.V., Norouzi, M., Bengio, S., 2017. Neural combinatorial
Emami, P., Ranka, S., 2018. Learning permutations with sinkhorn policy gradient.
optimization with reinforcement learning. In: Workshop Proceedings of the 5th
Feo, T.A., Resende, M.G., Smith, S.H., 1994. A greedy randomized adaptive search
International Conference on Learning Representations, ICLR.
procedure for maximum independent set. Oper. Res. 42 (5), 860–878.
Bengio, Y., Lodi, A., Prouvost, A., 2020. Machine learning for combinatorial optimiza- Filiol, E., Franc, E., Gubbioli, A., Moquet, B., Roblot, G., 2007. Combinatorial optimi-
tion: A methodological tour d’horizon. European J. Oper. Res. (ISSN: 03772217) sation of worm propagation on an unknown network. Int. J. Comput. Sci. 2 (2),
290 (2), 405–421. http://dx.doi.org/10.1016/j.ejor.2020.07.063. 124–130.
Bergman, D., Cire, A.A., Hoeve, W.-J.v., Hooker, J., 2016. Decision diagrams Gardiner, E.J., Willett, P., Artymiuk, P.J., 2000. Graph-theoretic techniques for
for optimization, first ed. Springer Publishing Company, Incorporated, ISBN: macromolecular docking. J. Chem. Inf. Comput. Sci. 40 (2), 273–279.
3319428470. Gleixner, A., Eifler, L., Gally, T., Gamrath, G., Gemander, P., Gottwald, R.L., Hendel, G.,
Borisovsky, P.A., Zavolovskaya, M.S., 2003. Experimental comparison of two evolution- Hojny, C., Koch, T., Miltenberger, M., Müller, B., Pfetsch, M.E., Puchert, C.,
ary algorithms for the independent set problem. In: Workshops on Applications of Rehfeldt, D., Schlösser, F., Serrano, F., Shinano, Y., Viernickel, J.M., Vigerske, S.,
Evolutionary Computation. Springer, pp. 154–164. Weninger, D., Witt, J.T., Witzig, J., 2017. The SCIP optimization suite 5.0.
Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P., Rohlfshagen, P., Technical report, Optimization Online, http://www.optimization-online.org/DB_
Tavener, S., Perez Liebana, D., Samothrakis, S., Colton, S., 2012. A survey of Monte HTML/2017/12/6385.html.
Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games (ISSN: 1943068X) Goemans, M.X., Williamson, D.P., 1995. Improved approximation algorithms for maxi-
4 (1), 1–43. http://dx.doi.org/10.1109/TCIAIG.2012.2186810. mum cut and satisfiability problems using semidefinite programming. J. ACM (ISSN:
Cai, Q., Hang, W., Mirhoseini, A., Tucker, G., Wang, J., Wei, W., 2019. Reinforcement 1557735X) 42 (6), 1115–1145. http://dx.doi.org/10.1145/227683.227684.
learning driven heuristic optimization. In: Proceedings of Workshop on Deep Gonzalez, T.F., 2007. Handbook of Approximation Algorithms and Metaheuristics. CRC
Reinforcement Learning for Knowledge Discovery, DRL4KDD. Press, ISBN: 9781420010749, http://dx.doi.org/10.1201/9781420010749.

13
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y., 2016. Deep Learning. Vol. 1, MIT Lodi, A., Martello, S., Vigo, D., 2002. Heuristic algorithms for the three-dimensional
press Cambridge. bin packing problem. European J. Oper. Res. 141 (2), 410–420.
Groshev, E., Goldstein, M., Tamar, A., Srivastava, S., Abbeel, P., 2018. Learning Lu, H., Zhang, X., Yang, S., 2020. A learning-based iterative method for solving vehicle
generalized reactive policies using deep neural networks. In: Proceedings of the routing problems. In: International Conference on Learning Representations.
28th International Conference on Automated Planning and Scheduling, ICAPS, pp. Ma, Q., Ge, S., He, D., Thaker, D., Drori, I., 2020. Combinatorial optimization by graph
408–416. pointer networks and hierarchical reinforcement learning. In: AAAI Workshop on
Gu, S., Yang, Y., 2020. A deep learning algorithm for the max-cut problem based Deep Learning on Graphs: Methodologies and Applications. AAAI.
on pointer network structure with supervised learning and reinforcement learning Makhorin, A., 2012. GLPK (GNU linear programming kit). http://www.gnu.org/
strategies. Mathematics (ISSN: 2227-7390) 8 (2), 298. http://dx.doi.org/10.3390/ software/glpk/glpk.html.
math8020298. Manchanda, S., Mittal, A., Dhawan, A., Medya, S., Ranu, S., Singh, A., 2020. GCOMB:
Guo, T., Han, C., Tang, S., Ding, M., 2019. Solving combinatorial problems with Learning budget-constrained combinatorial algorithms over billion-sized graphs. In:
machine learning methods. In: Nonlinear Combinatorial Optimization. Springer Proceedings of the 33d Conference on Advances in Neural Information Processing
International Publishing, Cham, ISBN: 978-3-030-16194-1, pp. 207–229. http://dx. Systems. Vol. 33, Curran Associates, Inc., pp. 20000–20011.
doi.org/10.1007/978-3-030-16194-1_9. Martello, S., Toth, P., 1990a. Bin-packing problem. In: Knapsack Problems: Algorithms
Gurobi Optimization, L., 2020. Gurobi optimizer reference manual. http://www.gurobi. and Computer Implementations. Wiley, pp. 221–245.
com. Martello, S., Toth, P., 1990b. Lower bounds and reduction procedures for the bin
Hansen, P., Mladenović, N., Urošević, D., 2004. Variable neighborhood search for the packing problem. Discrete Appl. Math. 28 (1), 59–70.
maximum clique. Discrete Appl. Math. 145 (1), 117–125. Mersmann, O., Bischl, B., Bossek, J., Trautmann, H., Wagner, M., Neumann, F., 2012.
Held, M., Karp, R.M., 1962. A dynamic programming approach to sequencing problems. Local search and the traveling salesman problem: A feature-based characterization
J. Soc. Ind. Appl. Math. 10 (1), 196–210. of problem hardness. In: International Conference on Learning and Intelligent
Helsgaun, K., 2000. An effective implementation of the lin–kernighan traveling Optimization. Springer, pp. 115–129.
salesman heuristic. European J. Oper. Res. 126 (1), 106–130. Miller, C.E., Tucker, A.W., Zemlin, R.A., 1960. Integer programming formulation of
Helsgaun, K., 2017. An extension of the lin-kernighan-helsgaun TSP solver for traveling salesman problems. J. ACM 7 (4), 326–329.
constrained traveling salesman and vehicle routing problems. Technical report, Mitzenmacher, M., Upfal, E., 2005. Probability and Computing: Randomized Al-
Roskilde University. gorithms and Probabilistic Analysis. Cambridge University Press, USA, ISBN:
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., 9780521835402.
Quan, J., Sendonaris, A., Osband, I., et al., 2018. Deep q-learning from demon- Mnih, V., Badia, A.P., Mirza, M., Graves, A., Harley, T., Lillicrap, T.P., Silver, D.,
strations. In: Proceedings of the 32nd Conference on Artificial Intelligence. AAAI, Kavukcuoglu, K., 2016. Asynchronous methods for deep reinforcement learning.
ISBN: 9781577358008. In: Proceedings of the 33rd International Conference on International Conference
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8), on Machine Learning. In: ICML, Vol. 48, ISBN: 9781510829008, pp. 1928–1937.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G.,
1735–1780.
Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al., 2015. Human-level
Hu, H., Zhang, X., Yan, X., Wang, L., Xu, Y., 2017. Solving a new 3d bin packing
control through deep reinforcement learning. Nature (ISSN: 14764687) 518 (7540),
problem with deep reinforcement learning method. In: Proceedings of the Workshop
529–533. http://dx.doi.org/10.1038/nature14236.
on AI application in E-commerce co-located with the 16th International Joint
Nazari, M., Oroojlooy, A., Snyder, L., Takác, M., 2018. Reinforcement learning for
Conference on Artificial Intelligence, IJCAI’17.
solving the vehicle routing problem. In: Proceedings of the 32nd Conference on
Karakostas, G., 2009. A better approximation ratio for the vertex cover problem. ACM
Advances in Neural Information Processing Systems, NeurIPS, pp. 9839–9849.
Trans. Algor. 5 (4), 1–8.
Papadimitriou, C.H., Steiglitz, K., 1998. Combinatorial Optimization: Algorithms and
Karp, R.M., 1972. Reducibility among combinatorial problems. In: Complexity of
Complexity. Courier Corporation, ISBN: 9780486402581.
Computer Computations. Springer US, pp. 85–103. http://dx.doi.org/10.1007/978-
Perdomo-Ortiz, A., Dickson, N., Drew-Brook, M., Rose, G., Aspuru-Guzik, A., 2012.
1-4684-2001-2_9.
Finding low-energy conformations of lattice protein models by quantum annealing.
Katayama, K., Hamamoto, A., Narihisa, H., 2005. An effective local search for the
Sci. Rep. (ISSN: 20452322) 2, 571. http://dx.doi.org/10.1038/srep00571.
maximum clique problem. Inform. Process. Lett. 95 (5), 503–511.
Perron, L., Furnon, V., 2019. OR-tools. https://developers.google.com/optimization/.
Kellerer, H., Pferschy, U., Pisinger, D., 2004. Multidimensional knapsack problems. In:
Pullan, W., Hoos, H.H., 2006. Dynamic local search for the maximum clique problem.
Knapsack Problems. Springer, pp. 235–283.
J. Artificial Intelligence Res. 25, 159–185.
Khalil, E., Dai, H., Zhang, Y., Dilkina, B., Song, L., 2017. Learning combinatorial
Schrage, L., 1986. Linear, Integer, and Quadratic Programming with LINDO: User’s
optimization algorithms over graphs. In: Proceedings of the 31st Conference on
Manual. Scientific Press, ISBN: 9780894260872.
Advances in Neural Information Processing Systems, NeurIPS. Schreiber, E.L., Korf, R.E., 2013. Improved bin completion for optimal bin packing and
Kipf, T.N., Welling, M., 2017. Semi-supervised classification with graph convolutional number partitioning. In: IJCAI 2013, Proceedings of the 23rd International Joint
networks. In: International Conference on Learning Representations, ICLR. Conference on Artificial Intelligence, Beijing, China, August 3-9, 2013. IJCAI/AAAI,
Kool, W., van Hoof, H., Welling, M., 2019. Attention, learn to solve routing problems! pp. 651–658.
In: Proceedings of the 7th International Conference on Learning Representations, Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S.,
ICLR. Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., Silver, D., 2020.
Korf, R.E., 2003. An improved algorithm for optimal bin packing. In: IJCAI 2003, Mastering Atari, Go, chess and shogi by planning with a learned model. Nature
Proceedings of the 18th International Joint Conference on Artificial Intelligence. 588 (7839), 604–609. http://dx.doi.org/10.1038/s41586-020-03051-4.
Vol. 3, Morgan Kaufmann, pp. 1252–1258. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy
Korte, B., Vygen, J., Korte, B., Vygen, J., 2012. Combinatorial Optimization. Vol. 2, optimization algorithms. ArXiv Preprint, arXiv:1707.06347.
Springer. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G.,
Lamm, S., Sanders, P., Schulz, C., 2015. Graph partitioning for independent sets. In: Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al., 2016.
International Symposium on Experimental Algorithms. Springer, pp. 68–81. Mastering the game of go with deep neural networks and tree search. Nature (ISSN:
Lamm, S., Sanders, P., Schulz, C., Strash, D., Werneck, R.F., 2016. Finding near-optimal 14764687) 529 (7587), 484–489. http://dx.doi.org/10.1038/nature16961.
independent sets at scale. In: 2016 Proceedings of the Eighteenth Workshop on Song, J., Lanka, R., Yue, Y., Ono, M., 2020. Co-training for policy learning. In:
Algorithm Engineering and Experiments (ALENEX). SIAM, pp. 138–150. Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence, UAI
Lancia, G., Bafna, V., Istrail, S., Lippert, R., Schwartz, R., 2001. SNPs problems, 2019. In: Proceedings of Machine Learning Research, Vol. 115, PMLR, Tel Aviv,
complexity, and algorithms. In: Algorithms - ESA 2001, Proceedings of the 9th Israel, pp. 1191–1201.
Annual European Symposium. In: Lecture Notes in Computer Science, 2161, Subhash, K., Minzer, D., Safra, M., 2018. Pseudorandom sets in grassmann graph have
Springer, pp. 182–193. http://dx.doi.org/10.1007/3-540-44676-1_15. near-perfect expansion. In: 2018 IEEE 59th Annual Symposium on Foundations of
Laterre, A., Fu, Y., Jabri, M.K., Cohen, A.-S., Kas, D., Hajjar, K., Dahl, T.S., Kerkeni, A., Computer Science (FOCS). IEEE, pp. 592–601.
Beguir, K., 2018. Ranked reward: Enabling self-play reinforcement learning for com- Sutton, R.S., 1988. Learning to predict by the methods of temporal differ-
binatorial optimization. In: Proceedings of the Workshop on Deep Reinforcement ences. Mach. Learn. (ISSN: 15730565) 3 (1), 9–44. http://dx.doi.org/10.1023/A:
Learning co-located with the 32nd Conference on Advances in Neural Information 1022633531479.
Processing Systems, NeurIPS’18. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y., 2000. Policy gradient methods
Leleu, T., Yamamoto, Y., McMahon, P.L., Aihara, K., 2019. Destabilization of local for reinforcement learning with function approximation. In: Advances in Neural
minima in analog spin systems by correction of amplitude heterogeneity. Phys. Information Processing Systems. ISBN: 0262194503, pp. 1057–1063.
Rev. Lett. (ISSN: 10797114) 122 (4), http://dx.doi.org/10.1103/PhysRevLett.122. Tang, Y., Agrawal, S., Faenza, Y., 2020. Reinforcement learning for integer program-
040607. ming: Learning to cut. In: Proceedings of the International Conference on Machine
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Learning, ICML, pp. 1483–1492.
Wierstra, D., 2016. Continuous control with deep reinforcement learning. In: Tarjan, R.E., Trojanowski, A.E., 1977. Finding a maximum independent set. SIAM J.
Proceedings of the 4th International Conference on Learning Representations, ICLR. Comput. 6 (3), 537–546.
Lin, S., Kernighan, B.W., 1973. An effective heuristic algorithm for the traveling- The Sage Developers, 2020. Sagemath, the sage mathematics software system (version
salesman problem. Oper. Res. 21 (2), 498–516. 9.0.0). https://www.sagemath.org.

14
N. Mazyavkina et al. Computers and Operations Research 134 (2021) 105400

Tiunov, E.S., Ulanov, A.E., Lvovsky, A., 2019. Annealing by simulating the coherent Wolsey, L.A., 1998. Integer Programming. Vol. 52, John Wiley & Sons, ISBN:
ising machine. Opt. Express (ISSN: 1094-4087) 27 (7), 10288–10295. http://dx. 9780471283669.
doi.org/10.1364/oe.27.010288. Wren, A., Holliday, A., 1972. Computer scheduling of vehicles from one or more
van Bevern, R., Slugina, V.A., 2020. A historical note on the 3/2-approximation depots to a number of delivery points. J. Oper. Res. Soc. (ISSN: 0160-5682) 23
algorithm for the metric traveling salesman problem. Historia Math.. (3), 333–344. http://dx.doi.org/10.1057/jors.1972.53.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Wu, Y., Li, W., Goh, M., de Souza, R., 2010. Three-dimensional bin packing problem
Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information with variable bin height. European J. Oper. Res. 202 (2), 347–355.
Processing Systems. pp. 5998–6008. Xiao, M., Nagamochi, H., 2017. Exact algorithms for maximum independent set. Inform.
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y., 2018. Graph and Comput. 255, 126–146.
attention networks. In: Proceedings of the 6th International Conference on Learning Xu, K., Hu, W., Leskovec, J., Jegelka, S., How powerful are graph neural networks?
Representations, ICLR. In: Proceedings of the 36th International Conference on Learning Representations.
Vesselinova, N., Steinert, R., Perez-Ramirez, D.F., Boman, M., 2020. Learning combi- 2019.
natorial optimization on graphs: A survey with applications to networking. IEEE Yamamoto, Y., Aihara, K., Leleu, T., Kawarabayashi, K.-i., Kako, S., Fejer, M., Inoue, K.,
Access (ISSN: 2169-3536) 8, 120388–120416. http://dx.doi.org/10.1109/access. Takesue, H., 2017. Coherent Ising machines—Optical neural networks operating
2020.3004964. at the quantum limit. Npj Quantum Inf. (ISSN: 20566387) 3 (1), 1–15. http:
Vinyals, O., Fortunato, M., Jaitly, N., 2015. Pointer networks. In: Proceedings of //dx.doi.org/10.1038/s41534-017-0048-9.
the 28th International Conference on Neural Information Processing Systems. In: Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., Sun, M., 2018.
NeurIPS, Vol. 2, (ISSN: 10495258) pp. 2692–2700. Graph neural networks: A review of methods and applications. ArXiv Preprint,
Watkins, C.J., Dayan, P., 1992. Q-learning. Mach. Learn. (ISSN: 0885-6125) 8 (3–4), arXiv:1812.08434.
279–292. http://dx.doi.org/10.1007/bf00992698.
Williams, R.J., 1992. Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Mach. Learn. (ISSN: 15730565) 8 (3/4), 229–256. http:
//dx.doi.org/10.1023/A:1022672621406.

15

You might also like